To ensure smooth communication on your network, Address Resolution Protocol (ARP) cache settings must be adjusted to handle a larger number of nodes. This guide will help you calculate and set the appropriate values for your system.
The following information is required and will need to be gathered before performing the steps outlined in this guide.
gc_thresh1
: Minimum number of entries the cache attempts to maintain.gc_thresh2
: Threshold where older entries start getting cleared before the cache overfills.gc_thresh3
: Maximum number of cache entries allowed. If exceeded, new entries may be dropped.gc_stale_time
: Time (in seconds) before an ARP entry is marked stale.base_reachable_time_ms
: Time (in milliseconds) an entry remains valid before being re-verified.Storage nodes do not have HSN connections. To keep arp
settings management simple, the values calculated for the worker nodes should also be applied to the storage nodes.
To calculate the base gc_thresh1
value:
gc_thresh1
:
$gc\_thresh1 = (number\ of\ nodes) × ((NICS\ per\ node)²)
$
Where:
number of nodes
= Total number of communicating nodes (master, worker, storage, compute and application nodes).NICS per node
= Number of NICs in each node (HSN NICs + NMN NICs). Nodes will have 2, 3, or 5 total NICs.
4
HSN NICs per node.1
NIC On the NMN.gc\_thresh2\ =\ 1.5\ ×\ gc\_thresh1
$gc\_thresh3\ =\ 2\ ×\ gc\_thresh2
$Scenario:
NICS\ per\ node\ =\ 4\ +\ 1\ =\ 5
$gc_thresh1
:
$gc\_thresh1\ =\ 1000\ ×\ (5²)\ =\ 25000
$
gc_thresh2
:
$gc\_thresh2\ =\ 1.5\ ×\ 25000\ =\ 37500
$
gc_thresh3
:
$gc\_thresh3\ =\ 2\ ×\ 37500\ =\ 75000
$
Example settings:
net.ipv4.neigh.default.gc_thresh1 = 25000
net.ipv4.neigh.default.gc_thresh2 = 37500
net.ipv4.neigh.default.gc_thresh3 = 75000
net.ipv4.neigh.default.gc_stale_time = 240
net.ipv4.neigh.default.base_reachable_time_ms = 1500000
The following steps describe how to use the Configuration Framework Service (CFS) to configure ARP cache settings for NCNs.
Use the Cloning a VCS repository procedure to clone the csm-config-management
repository.
List the available CSM versions (CSM_RELEASE
).
kubectl -n services get cm cray-product-catalog -o jsonpath='{.data.csm}'
Determine the import branch to use.
NOTE
UpdateCSM_RELEASE
for the version being used.
export CSM_RELEASE=1.6.0
export IMPORT_BRANCH=$(kubectl -n services get cm cray-product-catalog -o jsonpath='{.data.csm}' | yq4 ".[\"${CSM_RELEASE}\"].configuration.import_branch") && echo "${IMPORT_BRANCH}"
Create an integration branch from the import branch for the required configuration.
cd csm-config-management
git checkout -b integration-${IMPORT_BRANCH##*/} origin/${IMPORT_BRANCH}
Example output:
branch 'integration-1.26.0' set up to track 'origin/cray/csm/1.26.0'.
Switched to a new branch 'integration-1.26.0'
NOTE
In this example,integration-1.26.0
is the name of your new branch.
See VCS Branching Strategy for more information about Git branches.
csm.ncn.sysctl
Ansible roleUpdate the appropriate variables in roles/csm.ncn.sysctl/vars/main.yml
using the values calculated in Formula for Tuning ARP cache settings.
NOTE
The following values are only examples and MUST be updated using the calculated values.
sysctl_config:
- name: net.ipv4.neigh.default.gc_thresh1
value: 25000
- name: net.ipv4.neigh.default.gc_thresh2
value: 37500
- name: net.ipv4.neigh.default.gc_thresh3
value: 75000
- name: net.ipv4.neigh.default.gc_stale_time
value: 240
- name: net.ipv4.base_reachable_time_ms
value: 1500000
Commit the change and push it back up to the VCS.
git add roles/csm.ncn.sysctl/vars/main.yml
git commit -m 'Set ARP cache values for all NCNs'
git push --set-upstream origin integration-1.26.0
csm.ncn.sysctl
roleCreate a CFS configuration using the committed changes.
Obtain the commit hash and create a configuration template file.
COMMIT=$(git rev-parse --verify HEAD)
cat << EOF > arp-settings.json
{
"layers": [
{
"cloneUrl": "https://api-gw-service-nmn.local/vcs/cray/csm-config-management.git",
"commit": "${COMMIT}",
"name": "arp-cache-settings",
"playbook": "ncn_sysctl.yml"
}
]
}
EOF
Create a CFS configuration from the template file.
cray cfs configurations update arp-cache-settings --file ./arp-settings.json
Example output:
lastUpdated = "2025-01-06T22:39:46Z"
name = "arp-cache-settings"
[[layers]]
cloneUrl = "https://api-gw-service-nmn.local/vcs/cray/csm-config-management.git"
commit = "36811473a8b98e88ef8afee1df021d55eac50114"
name = "arp-cache-settings"
playbook = "ncn_sysctl.yml"
Create a CFS session to apply the configuration to the nodes.
SESSION=arp-cache-settings-$(date +%Y%m%d%H%M%S)
cray cfs sessions create --name "${SESSION}" --configuration-name arp-cache-settings
Example output:
name = "arp-cache-settings-20250106224045"
[ansible]
config = "cfs-default-ansible-cfg"
verbosity = 0
[configuration]
limit = ""
name = "arp-cache-settings"
[status]
artifacts = []
[tags]
[target]
definition = "dynamic"
groups = []
image_map = []
[status.session]
startTime = "2025-01-06T22:41:04"
status = "pending"
succeeded = "none"
Check the CFS session completed successfully.
cray cfs sessions describe ${SESSION}
Example output:
name = "arp-cache-settings-20250106224045"
[ansible]
config = "cfs-default-ansible-cfg"
verbosity = 0
[configuration]
limit = ""
name = "arp-cache-settings"
[status]
artifacts = []
[tags]
[target]
definition = "dynamic"
groups = []
image_map = []
[status.session]
completionTime = "2025-01-06T22:41:14"
job = "cfs-e7962be9-1cfa-4604-bb07-d5ae99be5456"
startTime = "2025-01-06T22:41:04"
status = "complete"
succeeded = "true"
The session status should be “complete” and succeeded should be “true”. See the troubleshooting section if that is not the case.
(ncn-m001
) Verify ARP settings.
NCNS=$(grep -oP 'ncn-\w\d+' /etc/hosts | sort -u | tr '\r\n\t' ',')
pdsh -w ${NCNS} "sysctl -a | grep -E 'net.ipv4.neigh.default.gc_thresh[1-3]|net.ipv4.neigh.default.gc_stale_time|net.ipv4.neigh.default.base_reachable_time_ms'"
Example output:
ncn-s001: net.ipv4.neigh.default.base_reachable_time_ms = 30000
ncn-s001: net.ipv4.neigh.default.gc_stale_time = 240
ncn-s001: net.ipv4.neigh.default.gc_thresh1 = 2048
ncn-s001: net.ipv4.neigh.default.gc_thresh2 = 4096
ncn-s001: net.ipv4.neigh.default.gc_thresh3 = 8192
ncn-w001: net.ipv4.neigh.default.base_reachable_time_ms = 30000
ncn-w001: net.ipv4.neigh.default.gc_stale_time = 240
ncn-w001: net.ipv4.neigh.default.gc_thresh1 = 2048
ncn-w001: net.ipv4.neigh.default.gc_thresh2 = 4096
ncn-w001: net.ipv4.neigh.default.gc_thresh3 = 8192
ncn-m001: net.ipv4.neigh.default.base_reachable_time_ms = 30000
ncn-m001: net.ipv4.neigh.default.gc_stale_time = 240
ncn-m001: net.ipv4.neigh.default.gc_thresh1 = 2048
ncn-m001: net.ipv4.neigh.default.gc_thresh2 = 4096
ncn-m001: net.ipv4.neigh.default.gc_thresh3 = 8192
Refer to View Configuration Session Logs to troubleshoot why the CFS session failed to complete successfully.
This procedure performs a one time configuration of the target nodes. The ARP cache settings will persist through a reboot of the node but a rebuild of the node will wipe it.
In order to persist this configuration through a rebuild of the node, the CFS layer should be added to the CFS configuration used for the NCNs.
It may also be desirable to add this layer to the site_vars.yaml
as well as any bootprep file used for sat bootprep
to ensure that the update-cfs-configuration
stage of IUF does not remove this layer.
See CFS Configurations and the IUF overview and configuration documentation for more information.