This procedure can be used to create a bonded HSN interface on an NCN worker node. The csm.ncn.hsn_bonding
Ansible role is
merely an automation of the manual steps outlined in the “How to create a bonded IP host interface with HPE Slingshot” document (See References)
The How to create a bonded IP host interface with HPE Slingshot document is available from the HPE Support Portal. The other documentation is bundled with the HPE Slingshot software download.
The following steps should have occurred before configuring a bonded interface on an NCN worker node.
Configuring a LAG using the HPE Slingshot Fabric Manager is beyond the scope of this document (See the “Link Aggregation” section of the HPE Slingshot Installation Guide for CSM for more information), however the following example configuration is provided for the purpose of illustration.
{
"lagPropertyMap": {
"2": {
"portLinks": [
"/fabric/ports/x3000c0r15j4p0",
"/fabric/ports/x3000c0r15j4p1"
],
"dmacs": [
"b2:00:00:00:00:01"
],
"lacpMode": "ACTIVE",
"lagFeatureMode": "DYNAMIC",
"lacpTimeout": "SHORT"
}
}
}
The following steps describe how to use CFS to configure a bond on an NCN worker node.
Use the Cloning a VCS repository procedure to clone the csm-config-management
repository.
Determine the import branch to use.
NOTE
UpdateCSM_RELEASE
for the version being used.
CSM_RELEASE=1.6.0
kubectl -n services get cm cray-product-catalog -o jsonpath='{.data.csm}' | yq4 ".[\"${CSM_RELEASE}\"].configuration.import_branch"
Example output:
cray/csm/1.26.0
Create an integration branch from the import branch for the required configuration.
cd csm-config-management
git checkout -b integration-1.26.0 origin/cray/csm/1.26.0
Example output:
branch 'integration-1.26.0' set up to track 'origin/cray/csm/1.26.0'.
Switched to a new branch 'integration-1.26.0'
Refer to VCS Branching Strategy for more information about git branches.
csm.ncn.hsn_bonding
Ansible roleDetermine eligible NCN worker nodes.
sat status --hsm-fields --filter SubRole=Worker
Example output:
+----------------+------+--------+-------+------+---------+------+-------+------------+---------+----------+
| xname | Type | NID | State | Flag | Enabled | Arch | Class | Role | SubRole | Net Type |
+----------------+------+--------+-------+------+---------+------+-------+------------+---------+----------+
| x3000c0s4b0n0 | Node | 100008 | Ready | OK | True | X86 | River | Management | Worker | Sling |
| x3000c0s5b0n0 | Node | 100007 | Ready | OK | True | X86 | River | Management | Worker | Sling |
| x3000c0s6b0n0 | Node | 100006 | Ready | OK | True | X86 | River | Management | Worker | Sling |
| x3000c0s30b0n0 | Node | 100005 | Ready | OK | True | X86 | River | Management | Worker | Sling |
| x3000c0s31b0n0 | Node | 100004 | Ready | OK | True | X86 | River | Management | Worker | Sling |
+----------------+------+--------+-------+------+---------+------+-------+------------+---------+----------+
Define the node-specific Ansible variables.
This example uses the node x3000c0s31b0n0
and the following parameters.
Parameter | Value |
---|---|
hsn_bond_enable |
true |
hsn_bond_mac (DMAC) |
b2:00:00:00:00:01 |
hsn_bond_ip |
10.253.254.1 |
hsn_bond_netmask |
255.255.0.0 |
The DMAC used should match the one defined in the fabric LAG configuration. The four parameters in this table must be provided. The values for
hsn_bond_mac
, hsn_bond_ip
, and hsn_bond_netmask
cannot be derived so must be set. Interface configuration will fail if these values are not provided.
NOTE
Thehsn_bond_options
parameter defaults to"mode=802.3ad xmit_hash_policy=layer2+3 miimon=100 ad_select=bandwidth lacp_rate=fast"
and may need changing if static mode LAGs are to be used instead of LACP. Seeroles/csm.nmn_hsn_bonding
in thecsm-config-management
repository for a full list Ansible variables that can be changed.
Create the node-specific variables file.
Create the file host_vars/x3000c0s31b0n0.yml
containing the following values. It may be necessary to create the host_vars
directory if it does not
already exist.
hsn_bond_enable: true
hsn_bond_mac: "b2:00:00:00:00:01"
hsn_bond_ip: "10.253.254.1"
hsn_bond_netmask: '255.255.0.0'
Commit the change and push it back up to the VCS.
git add host_vars/x3000c0s31b0n0.yml
git commit -m 'Configure HSN bonding on ncn-w005'
git push --set-upstream origin integration-1.26.0
csm.ncn.hsn_bonding
roleCreate a CFS configuration using the committed changes.
Obtain the commit hash and create a configuration template file.
COMMIT=$(git rev-parse --verify HEAD)
cat << EOF > hsn-nic-bonding.json
{
"layers": [
{
"cloneUrl": "https://api-gw-service-nmn.local/vcs/cray/csm-config-management.git",
"commit": "${COMMIT}",
"name": "hsn-nic-bonding",
"playbook": "ncn_hsn_bonding.yml"
}
]
}
EOF
Create a CFS configuration from the template file.
cray cfs configurations update hsn-nic-bonding --file ./hsn-nic-bonding.json
Example output:
lastUpdated = "2024-10-11T11:13:39Z"
name = "hsn-nic-bonding"
[[layers]]
cloneUrl = "https://api-gw-service-nmn.local/vcs/cray/csm-config-management.git"
commit = "e8de8f5be0ba6576b4102137db821a0da3b28375"
name = "hsn-nic-bonding"
playbook = "ncn_hsn_bonding.yml"
Create a CFS session to apply the configuration to the node(s).
SESSION=hsn-nic-bonding-$(date +%Y%m%d%H%M%S)
cray cfs sessions create --name "${SESSION}" --configuration-name hsn-nic-bonding
Example output:
debug_on_failure = false
logs = "ara.cmn.surtur.hpc.amslabs.hpecorp.net/?label=hsn-nic-bonding-20241011111435"
name = "hsn-nic-bonding-20241011111435"
[ansible]
config = "cfs-default-ansible-cfg"
limit = ""
passthrough = ""
verbosity = 0
[configuration]
limit = ""
name = "hsn-nic-bonding"
[status]
artifacts = []
[tags]
[target]
definition = "dynamic"
groups = []
image_map = []
[status.session]
start_time = "2024-10-11T11:14:48"
status = "pending"
succeeded = "none"
Check the CFS session completed successfully.
cray cfs sessions describe ${SESSION}
Example output:
name = "hsn-nic-bonding-20241011111435"
[ansible]
config = "cfs-default-ansible-cfg"
limit = ""
passthrough = ""
verbosity = 0
[configuration]
limit = ""
name = "hsn-nic-bonding"
[status]
artifacts = []
[tags]
[target]
definition = "dynamic"
groups = []
image_map = []
[status.session]
completionTime = "2024-10-11T12:05:21"
job = "cfs-f2e9a676-7faa-4554-8e20-1d28024c2859"
startTime = "2024-10-11T12:02:58"
status = "complete"
succeeded = "true"
The session status should be “complete” and succeeded should be “true”. See the troubleshooting section if that is not the case.
Verify the bonded interface is configured.
SSH to the target node.
Check the interface is configured with the correct IP address and MAC address.
ip ad show bond1
Example output:
5: bond1: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether b2:00:00:00:00:01 brd ff:ff:ff:ff:ff:ff
inet 10.253.254.1/16 brd 10.253.255.255 scope global bond1
valid_lft forever preferred_lft forever
inet6 fe80::b000:ff:fe00:1/64 scope link proto kernel_ll
valid_lft forever preferred_lft forever
Verify both interfaces forming the bonded interface are up.
cat /proc/net/bonding/bond1
Example output:
Ethernet Channel Bonding Driver: v6.4.0-150600.23.17-default
Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2+3 (2)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0
802.3ad info
LACP active: on
LACP rate: fast
Min links: 0
Aggregator selection policy (ad_select): bandwidth
System priority: 65535
System MAC address: b2:00:00:00:00:01
Active Aggregator Info:
Aggregator ID: 1
Number of ports: 2
Actor Key: 31
Partner Key: 2
Partner Mac Address: 02:00:00:00:00:01
Slave Interface: macvlan1
MII Status: up
Speed: 200000 Mbps
Duplex: full
Link Failure Count: 2
Permanent HW addr: b2:00:00:00:00:01
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 1
Partner Churned Count: 2
details actor lacp pdu:
system priority: 65535
system mac address: b2:00:00:00:00:01
port key: 31
port priority: 255
port number: 1
port state: 63
details partner lacp pdu:
system priority: 32768
system mac address: 02:00:00:00:00:01
oper key: 2
port priority: 255
port number: 4
port state: 63
Slave Interface: macvlan0
MII Status: up
Speed: 200000 Mbps
Duplex: full
Link Failure Count: 2
Permanent HW addr: b2:00:00:00:00:01
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 2
details actor lacp pdu:
system priority: 65535
system mac address: b2:00:00:00:00:01
port key: 31
port priority: 255
port number: 2
port state: 63
details partner lacp pdu:
system priority: 32768
system mac address: 02:00:00:00:00:01
oper key: 2
port priority: 255
port number: 3
port state: 63
Refer to View Configuration Session Logs to troubleshoot why the CFS session failed to complete successfully.
If the underlying HSN interfaces are not up or present refer to the Slingshot documentation listed in References to verify the fabric is healthy. Troubleshooting HPE Slingshot is beyond the scope of this document. See the HPE Slingshot Troubleshooting Guide for troubleshooting information.
This procedure performs a one time configuration of the target nodes. The bonded HSN configuration will persist through a reboot of the node but a rebuild of the node will wipe it.
In order to persist this configuration through a rebuild of the node, the CFS layer should be added to the CFS configuration used for the NCN worker nodes.
It may also be desirable to add this layer to the site_vars.yaml
as well as any bootprep file used for sat bootprep
to ensure that the update-cfs-configuration
stage
of IUF does not remove this layer.
See CFS Configurations and the IUF overview and configuration documentation for more information.