Remove NCN data to System Layout Service (SLS), Boot Script Service (BSS), and Hardware State Manager (HSM) as needed to remove an NCN.
IMPORTANT: The following procedures assume that you have set the variables from the prerequisites section.
(ncn-mw#
) Prepare for the procedure.
Obtain an API token.
cd /usr/share/doc/csm/scripts/operations/node_management/Add_Remove_Replace_NCNs
export TOKEN=$(curl -s -S -d grant_type=client_credentials -d client_id=admin-client \
-d client_secret=`kubectl get secrets admin-client-auth -o jsonpath='{.data.client-secret}' | base64 -d` \
https://api-gw-service-nmn.local/keycloak/realms/shasta/protocol/openid-connect/token \
| jq -r '.access_token')
Set BMC credentials for the NCN (unless doing this for ncn-m001
).
The default username is root
.
Set IPMI_USERNAME
if this needs to be different for the given BMC.
read -s
is used in order to prevent the password from being echoed to the screen or saved in the shell history.
read -r -s -p "BMC ${IPMI_USERNAME-root} password: " IPMI_PASSWORD
export IPMI_PASSWORD
(ncn-mw#
) Fetch the status of the nodes.
./ncn_status.py --all
Example output:
first_master_hostname: ncn-m002
ncns:
ncn-m001 x3000c0s1b0n0 master
ncn-m002 x3000c0s3b0n0 master
ncn-m003 x3000c0s5b0n0 master
ncn-w001 x3000c0s7b0n0 worker
ncn-w002 x3000c0s9b0n0 worker
ncn-w003 x3000c0s11b0n0 worker
ncn-w004 x3000c0s34b0n0 worker
ncn-s001 x3000c0s13b0n0 storage
ncn-s002 x3000c0s15b0n0 storage
ncn-s003 x3000c0s17b0n0 storage
ncn-s004 x3000c0s26b0n0 storage
(ncn-mw#
) Fetch the status of the node to be removed.
./ncn_status.py --xname "${XNAME}"
Example output:
...
x3000c0s26b0n0:
xname: x3000c0s26b0n0
name: ncn-s004
parent: x3000c0s26b0
type: Node, Management, Storage
sources: bss, hsm, sls
ip_reservations: 10.1.1.19, 10.101.5.150, 10.101.5.214, 10.101.5.36, 10.252.1.21, 10.254.1.38
ip_reservations_name: ncn-s004-mtl, ncn-s004-can, x3000c0s26b0n0, ncn-s004-cmn, ncn-s004-nmn, ncn-s004-hmn
ip_reservations_mac: a4:bf:01:38:f4:50, a4:bf:01:38:f4:50, , , a4:bf:01:38:f4:50, a4:bf:01:38:f4:50
ifnames: mgmt0:a4:bf:01:38:f4:50, mgmt1:a4:bf:01:38:f4:51, lan0:b8:59:9f:de:b4:8c, lan1:b8:59:9f:de:b4:8d
ncn_macs:
ifnames: mgmt0:a4:bf:01:38:f4:50, mgmt1:a4:bf:01:38:f4:51, lan0:b8:59:9f:de:b4:8c, lan1:b8:59:9f:de:b4:8d
bmc_mac: a4:bf:01:38:f4:54
Important: Save the ifnames
and bmc_mac
information if planning to add this NCN back at some time in the future.
(ncn-mw#
) Shutdown cray-reds
.
kubectl -n services scale deployment cray-reds --replicas=0
(ncn-mw#
) Remove the node from SLS, HSM, and BSS.
./remove_management_ncn.py --xname "${XNAME}"
Example output:
...
Permanently remove x3000c0s26b0n0 - ncn-s004 (y/n)? y
...
Summary:
Logs: /tmp/remove_management_ncn/x3000c0s26b0n0
xname: x3000c0s26b0n0
ncn_name: ncn-s004
ncn_macs:
ifnames: mgmt0:a4:bf:01:38:f4:50, mgmt1:a4:bf:01:38:f4:51, lan0:b8:59:9f:de:b4:8c, lan1:b8:59:9f:de:b4:8d
bmc_mac: a4:bf:01:38:f4:54
Successfully removed x3000c0s26b0n0 - ncn-s004
NOTE
If workers have been removed and the worker count is currently at two, then a timeout restarting cray-bss
can be ignored.
For example, the following failure output from remove_management_ncn.py
can be ignored.
Waiting for cray-bss to start.
Do not kill this script. The wait will timeout in 10 minutes if bss does not fully start up.
Ran: kubectl -n services rollout status deployment cray-bss --timeout=600s
Error: Failed: 1
stdout:
Waiting for deployment "cray-bss" rollout to finish: 0 out of 3 new replicas have been updated...
Waiting for deployment "cray-bss" rollout to finish: 1 out of 3 new replicas have been updated...
Waiting for deployment "cray-bss" rollout to finish: 2 out of 3 new replicas have been updated...
stderr:
error: timed out waiting for the condition
(ncn-mw#
) Start cray-reds
.
kubectl -n services scale deployment cray-reds --replicas=1
(ncn-mw#
) Verify the results by fetching the status of the management nodes.
./ncn_status.py --all
Example output:
first_master_hostname: ncn-m002
ncns:
ncn-m001 x3000c0s1b0n0 master
ncn-m002 x3000c0s3b0n0 master
ncn-m003 x3000c0s5b0n0 master
ncn-w001 x3000c0s7b0n0 worker
ncn-w002 x3000c0s9b0n0 worker
ncn-w003 x3000c0s11b0n0 worker
ncn-w004 x3000c0s34b0n0 worker
ncn-s001 x3000c0s13b0n0 storage
ncn-s002 x3000c0s15b0n0 storage
ncn-s003 x3000c0s17b0n0 storage
(ncn-mw#
) Fetch the status of the node that was removed.
./ncn_status.py --xname "${XNAME}"
Example output:
Not found: x3000c0s26b0n0
Proceed to Remove Switch Configuration or return to the main Add, Remove, Replace, or Move NCNs page.