Add NCN data to the System Layout Service (SLS), Boot Script Service (BSS), and Hardware State Manager (HSM) as needed, in order to add an NCN to the system.
Scenarios where this procedure is applicable:
Retrieve an API token:
ncn-mw# export TOKEN=$(curl -s -S -d grant_type=client_credentials \
-d client_id=admin-client -d client_secret=`kubectl get secrets admin-client-auth \
-o jsonpath='{.data.client-secret}' | base64 -d` \
https://api-gw-service-nmn.local/keycloak/realms/shasta/protocol/openid-connect/token \
| jq -r '.access_token')
Collect information from the NCN.
Determine the component name (xname) of the NCN, if it has not been determined yet.
Determine the xname by referring to the HMN of the system’s SHCD file.
Sample row from the HMN
tab of an SHCD file:
Source (J20) | Source Rack (K20) | Source Location (L20) | (M20) | Parent (N20) | (O20) | Source Port (P20) | Destination (Q20) | Destination Rack (R20) | Destination Location (S20) | (T20) | Destination Port (U20) |
---|---|---|---|---|---|---|---|---|---|---|---|
wn01 |
x3000 |
u04 |
- |
j3 |
sw-smn01 |
x3000 |
u14 |
- |
j48 |
The
Source
name for a worker NCN would be in the formatwn01
; master NCNs have formatmn01
and storage NCNs have formatsn01
.
Node xname format: xXcCsSbBnN
SHCD Column to Reference | Description | ||
---|---|---|---|
X | Cabinet number | Source Rack (K20) | The Cabinet or rack number containing the Management NCN. |
C | Chassis number | For air-cooled nodes within a standard rack, the chassis is 0 . |
|
S | Slot/Rack U | Source Location (L20) | The Slot of the node is determined by the bottom most rack U that node occupies. |
B | BMC number | For Management NCNs the BMC number is 0. | |
N | Node number | For Management NCNs the Node number is 0. |
ncn-mw# XNAME=x3000c0s4b0n0
Skip if adding ncn-m001
: Determine the NCN BMC xname by removing the trailing n0
from the NCN xname:
ncn-mw# BMC_XNAME=x3000c0s4b0
Skip if adding ncn-m001
: Determine the xname of the MgmtSwitchConnector
(the switch port of the management switch that the BMC is connected to). This is not required for ncn-m001
, because its BMC is typically connected to the
site network.
Sample row from the HMN tab of an SHCD:
Source (J20) | Source Rack (K20) | Source Location (L20) | (M20) | Parent (N20) | (O20) | Source Port (P20) | Destination (Q20) | Destination Rack (R20) | Destination Location (S20) | (T20) | Destination Port (U20) |
---|---|---|---|---|---|---|---|---|---|---|---|
wn01 |
x3000 |
u04 |
- |
j3 |
sw-smn01 |
x3000 |
u14 |
- |
j48 |
MgmtSwitchConnector
xname format: xXcCwWjJ
SHCD Column to reference | Description | ||
---|---|---|---|
X | Cabinet number | Destination Rack (R20) | The Cabinet or rack number containing the management NCN. |
C | Chassis number | For air-cooled management switches within standard racks, the chassis is 0 . |
|
W | Slot/Rack U | Destination Location (S20) | The Slot/Rack U that the management switch occupies. |
J | Switch port number | Destination Port (U20) | The switch port on the switch that the NCN BMC is cabled to. |
ncn-mw# MGMT_SWITCH_CONNECTOR=x3000c0w14j48
Skip if adding ncn-m001
: Determine the xname of the management switch by removing the trailing jJ
from the MgmtSwitchConnector
xname.
ncn-mw# MGMT_SWITCH=x3000c0w14
Skip if adding ncn-m001
: Collect the BMC MAC address.
If the NCN was previously in the system, recall the BMC MAC address recorded from the Remove NCN Data procedure.
Alternatively, view the MAC address table on the management switch that the BMC is cabled to.
Determine the alias of the management switch that is connected to the BMC.
ncn-mw# cray sls hardware describe "${MGMT_SWITCH}" --format json | jq .ExtraProperties.Aliases[] -r
Example output:
sw-leaf-001
SSH into the management switch that is connected to the BMC.
ncn-mw# ssh admin@sw-leaf-001.hmn
Locate the switch port that the BMC is connected to and record its MAC address.
In the commands below, change the value of 1/1/39
to match the BMC switch port number (the BMC switch port number is the J
value in the in the MgmtSwitchConnector
xname xXwWjJ
).
For example, with a MgmtSwitchConnector
xname of x3000c0w14j39
, the switch port number would be 39
. In that case,
1/1/39
would be used instead of 1/1/48
in the following commands.
Dell Management Switch
sw-leaf# show mac address-table | grep 1/1/48
Example output:
4 a4:bf:01:65:68:54 dynamic 1/1/48
Aruba Management Switch
sw-leaf# show mac-address-table | include 1/1/48
Example output:
a4:bf:01:65:68:54 4 dynamic 1/1/48
Skip if adding ncn-m001
: Set the BMC_MAC
environment variable to the BMC MAC address.
ncn-mw# BMC_MAC=a4:bf:01:65:68:54
Skip if adding ncn-m001
: Determine the current IP address of the NCN BMC.
Query Kea for the BMC MAC address to determine its current IP address.
ncn-mw# BMC_IP=$(curl -sk -H "Authorization: Bearer ${TOKEN}" -X POST -H "Content-Type: application/json" \
-d '{ "command": "lease4-get-all", "service": [ "dhcp4" ] }' \
https://api-gw-service-nmn.local/apis/dhcp-kea |
jq --arg BMC_MAC "${BMC_MAC}" \
'.[].arguments.leases[] | select(."hw-address" == $BMC_MAC)."ip-address"' -r)
ncn-mw# echo ${BMC_IP}
Example output:
10.254.1.26
Troubleshooting If the MAC addresses of the BMC are not present in Kea, then check for the following items:
Ping the BMC to see if it is reachable.
ncn-mw# ping "${BMC_IP}"
Perform this step if adding ncn-m001
, otherwise skip: Set the BMC_IP
environment variable to the current IP address or hostname of the BMC. This is not the allocated HMN address for the BMC of ncn-m001
.
ncn-mw# BMC_IP=10.0.0.10
Collect NCN MAC addresses for the following interfaces if they are present. Depending on the hardware present, not all of the following interfaces will be present.
The collected MAC addresses will be used later in this procedure with the add_management_ncn.py
script.
Depending on the hardware present in the NCN, not all of these interfaces may be present.
NCN with a single PCIe card (1 card with 2 ports):
Interface | CLI Flag | Required MAC Address | Description |
---|---|---|---|
mgmt0 |
--mac-mgmt0 |
Required | First MAC address of Bond 1. |
mgmt1 |
--mac-mgmt1 |
Required | Second MAC address of Bond 0. |
hsn0 |
--mac-hsn0 |
Required for Worker NCNs | MAC address of the first High Speed Network NIC. Master and Storage NCNs do not have HSN NICs. |
hsn1 |
--mac-hsn1 |
Optional for Worker NCNs | MAC address of the second High Speed Network NIC. Master and Storage NCNs do not have HSN NICs. |
lan0 |
--mac-lan0 |
Optional | MAC address for the first non-bond or HSN-related interface. |
lan1 |
--mac-lan1 |
Optional | MAC address for the second non-bond or HSN-related interface. |
lan2 |
--mac-lan2 |
Optional | MAC address for the third non-bond or HSN-related interface. |
lan3 |
--mac-lan3 |
Optional | MAC address for the forth non-bond or HSN-related interface. |
NCN with a dual PCIe cards (2 cards with 2 ports each for 4 ports total):
Interface | CLI Flag | Required MAC Address | Description |
---|---|---|---|
mgmt0 |
--mac-mgmt0 |
Required | First MAC address of Bond 1. |
mgmt1 |
--mac-mgmt1 |
Required | First MAC address of Bond 1. |
mgmt2 |
--mac-mgmt2 |
Required | Second MAC address of Bond 0. |
mgmt3 |
--mac-mgmt3 |
Required | Second MAC address of Bond 1. |
hsn0 |
--mac-hsn0 |
Required for Worker NCNs | MAC address of the first High Speed Network NIC. Master and Storage NCNs do not have HSN NICs. |
hsn1 |
--mac-hsn1 |
Optional for Worker NCNs | MAC address of the second High Speed Network NIC. Master and Storage NCNs do not have HSN NICs. |
lan0 |
--mac-lan0 |
Optional | MAC address for the first non-bond or HSN-related interface. |
lan1 |
--mac-lan1 |
Optional | MAC address for the second non-bond or HSN-related interface. |
lan2 |
--mac-lan2 |
Optional | MAC address for the third non-bond or HSN-related interface. |
lan3 |
--mac-lan3 |
Optional | MAC address for the forth non-bond or HSN-related interface. |
If the NCN being added is being moved to a new location in the system, then these MAC addresses can be retrieved from backup files generated by the Remove NCN Data procedure.
Recall the previous node xname of the NCN being added:
ncn-mw# PREVIOUS_XNAME=REPLACE_WITH_OLD_XNAME
Retrieve the MAC address for the NCN from the backup files:
ncn-mw# cat "/tmp/remove_management_ncn/${PREVIOUS_XNAME}/bss-bootparameters-${PREVIOUS_XNAME}.json" |
jq .[].params -r | tr " " "\n" | grep ifname
Example output for a worker node with a single management PCIe NIC card:
ifname=hsn0:50:6b:4b:23:9f:7c
ifname=lan1:b8:59:9f:d9:9d:e9
ifname=lan0:b8:59:9f:d9:9d:e8
ifname=mgmt0:a4:bf:01:65:6a:aa
ifname=mgmt1:a4:bf:01:65:6a:ab
Using the example output from above, derive the following CLI flags for a worker NCN:
Interface | MAC Address | CLI Flag |
---|---|---|
mgmt0 |
a4:bf:01:65:6a:aa |
--mac-mgmt0=a4:bf:01:65:6a:aa |
mgmt1 |
a4:bf:01:65:6a:ab |
--mac-mgmt1=a4:bf:01:65:6a:ab |
lan0 |
b8:59:9f:d9:9d:e8 |
--mac-lan0=b8:59:9f:d9:9d:e8 |
lan1 |
b8:59:9f:d9:9d:e9 |
--mac-lan1=b8:59:9f:d9:9d:e9 |
hsn0 |
50:6b:4b:23:9f:7c |
--mac-hsn0=50:6b:4b:23:9f:7c |
Otherwise the NCN MAC addresses need to be collected using the Collect NCN MAC Addresses procedure.
Perform a dry run of the add_management_ncn.py
script in order to determine if any validation failures occur:
Update the following command with the MAC addresses and interfaces that were collected from the NCN.
If adding a node other than ncn-m001
:
ncn-mw# cd /usr/share/doc/csm/scripts/operations/node_management/Add_Remove_Replace_NCNs/
ncn-mw# ./add_management_ncn.py ncn-data \
--xname "${XNAME}" \
--alias "${NODE}" \
--bmc-mgmt-switch-connector "${MGMT_SWITCH_CONNECTOR}" \
--mac-bmc "${BMC_MAC}" \
--mac-mgmt0 a4:bf:01:65:6a:aa \
--mac-mgmt1 a4:bf:01:65:6a:ab \
--mac-hsn0 50:6b:4b:23:9f:7c \
--mac-lan0 b8:59:9f:d9:9d:e8 \
--mac-lan1 b8:59:9f:d9:9d:e9
If adding ncn-m001
, omit the --bmc-mgmt-switch-connector
and --mac-bmc
arguments, because its BMC is connected to the site network:
ncn-mw# cd /usr/share/doc/csm/scripts/operations/node_management/Add_Remove_Replace_NCNs/
ncn-mw# ./add_management_ncn.py ncn-data \
--xname "${XNAME}" \
--alias "${NODE}" \
--mac-mgmt0 a4:bf:01:65:6a:aa \
--mac-mgmt1 a4:bf:01:65:6a:ab \
--mac-lan0 b8:59:9f:d9:9d:e8 \
--mac-lan1 b8:59:9f:d9:9d:e9
Add the NCN to SLS, HSM, and BSS.
Run the add_management_ncn.py
script again, adding the --perform-changes
argument to the command run in the previous step:
For example:
ncn-mw# ./add_management_ncn.py ncn-data \
--xname "${XNAME}" \
--alias "${NODE}" \
--bmc-mgmt-switch-connector "${MGMT_SWITCH_CONNECTOR}" \
--mac-bmc "${BMC_MAC}" \
--mac-mgmt0 a4:bf:01:65:6a:aa \
--mac-mgmt1 a4:bf:01:65:6a:ab \
--mac-hsn0 50:6b:4b:23:9f:7c \
--mac-lan0 b8:59:9f:d9:9d:e8 \
--mac-lan1 b8:59:9f:d9:9d:e9 \
--perform-changes
Example output:
...
x3000c0s3b0n0 (ncn-m002) has been added to SLS/HSM/BSS
WARNING The NCN BMC currently has the IP address: 10.254.1.20, and needs to have IP address 10.254.1.13
=================================
Management NCN IP Allocation
=================================
Network | IP Address
--------|-----------
HMN | 10.254.1.14
MTL | 10.1.1.7
NMN | 10.252.1.9
CAN | 10.102.4.10
=================================
Management NCN BMC IP Allocation
=================================
Network | IP Address
--------|-----------
HMN | 10.254.1.13
If the following text is present at the end of the add_management_ncn.py
script output, then the NCN BMC was given an IP address by DHCP, and it is not at the expected IP address.
Sample output when the BMC has an unexpected IP address.
x3000c0s3b0n0 (ncn-m002) has been added to SLS/HSM/BSS
WARNING The NCN BMC currently has the IP address: <$BMC_IP>, and needs to have IP address X.Y.Z.W
Restart the BMC to pick up the expected IP address:
read -s
is used to read the password in order to prevent it from being echoed to the screen or recorded in the shell history.
ncn-mw# read -r -s -p "BMC root password: " IPMI_PASSWORD
ncn-mw# export IPMI_PASSWORD
ncn-mw# ipmitool -U root -I lanplus -E -H "${BMC_IP}" mc reset cold
ncn-mw# sleep 60
Skip if adding ncn-m001
: Verify that the BMC is reachable at the expected IP address.
ncn-mw# ping "${NODE}-mgmt"
Wait five minutes for Kea and the HSM to sync. If ping
continues to fail, then re-run the previous step to restart the BMC.
Restart the REDS deployment.
ncn-mw# kubectl -n services rollout restart deployment cray-reds
Expected output:
deployment.apps/cray-reds restarted
Wait for REDS to restart.
ncn-mw# kubectl -n services rollout status deployment cray-reds
Expected output:
Waiting for deployment "cray-reds" rollout to finish: 1 old replicas are pending termination...
Waiting for deployment "cray-reds" rollout to finish: 1 old replicas are pending termination...
deployment "cray-reds" successfully rolled out
Skip if adding ncn-m001
: Wait for the NCN BMC to get discovered by HSM.
If the BMC of
ncn-m001
is connected to the site network, then the BMC will not be discovered, because it is not connected via the HMN network.
ncn-mw# watch -n 0.2 "cray hsm inventory redfishEndpoints describe '${BMC_XNAME}' --format json"
Wait until the LastDiscoveryAttempt
field is DiscoverOK
:
{
"ID": "x3000c0s38b0",
"Type": "NodeBMC",
"Hostname": "",
"Domain": "",
"FQDN": "x3000c0s38b0",
"Enabled": true,
"UUID": "cc48551e-ec22-4bef-b8a3-bb3261749a0d",
"User": "root",
"Password": "",
"RediscoverOnUpdate": true,
"DiscoveryInfo": {
"LastDiscoveryAttempt": "2022-02-28T22:54:08.496898Z",
"LastDiscoveryStatus": "DiscoverOK",
"RedfishVersion": "1.7.0"
}
}
Discovery troubleshooting
The redfishEndpoint
may cycle between DiscoveryStarted
and HTTPsGetFailed
before the endpoint becomes DiscoverOK
.
If the BMC is in HTTPSGetFailed
for a long period of time, then the following steps may help to determine the cause:
Verify that the xname of the BMC resolves in DNS.
ncn-mw# nslookup x3000c0s38b0
Expected output:
Server: 10.92.100.225
Address: 10.92.100.225#53
Name: x3000c0s38b0.hmn
Address: 10.254.1.13
Verify that the BMC is reachable at the expected IP address.
ncn-mw# ping "${NODE}-mgmt"
Verify that the BMC Redfish v1/Managers
endpoint is reachable.
ncn-mw# curl -k -u root:changeme https://x3000c0s38b0/redfish/v1/Managers
Verify that the NCN exists under HSM State Components.
ncn-mw# cray hsm state components describe "${XNAME}" --format toml
Example output:
ID = "x3000c0s11b0n0"
Type = "Node"
State = "Off"
Flag = "OK"
Enabled = true
Role = "Management"
SubRole = "Worker"
NID = 100006
NetType = "Sling"
Arch = "X86"
Class = "River"
Proceed to Update Firmware or return to the main Add, Remove, Replace, or Move NCNs page.