This section applies to all node types. The commands in this section assume the variables from the prerequisites section have been set.
Open and watch the console for the node being rebuilt.
Log in to a second session to use it to watch the console using the instructions at the link below:
Open this link in a new tab or page Log in to a Node Using ConMan
The first session will be needed to run the commands in the following Rebuild Node steps.
Set the PXE boot option and power cycle the node.
IMPORTANT: Run these commands from a node NOT being rebuilt.
IMPORTANT: The commands in this section assume the variables from the prerequisites section have been set.
Set the BMC variable to the hostname of the BMC of the node being rebuilt.
linux# BMC="${NODE}-mgmt"
Set and export the root
password of the BMC.
NOTE:
read -s
is used to prevent the password from echoing to the screen or being saved in the shell history.
linux# read -r -s -p "${BMC} root password: " IPMI_PASSWORD
linux# export IPMI_PASSWORD
Set the PXE/efiboot option.
linux# ipmitool -I lanplus -U root -E -H "${BMC}" chassis bootdev pxe options=efiboot
Power off the node.
linux# ipmitool -I lanplus -U root -E -H "${BMC}" chassis power off
Verify that the node is off.
linux# ipmitool -I lanplus -U root -E -H "${BMC}" chassis power status
Ensure the power is reporting as off. This may take 5-10 seconds for this to update. Wait about 30 seconds after receiving the correct power status before issuing the next command.
Power on the node.
linux# ipmitool -I lanplus -U root -E -H "${BMC}" chassis power on
Verify that the node is on.
Ensure the power is reporting as on. This may take 5-10 seconds for this to update.
linux# ipmitool -I lanplus -U root -E -H "${BMC}" chassis power status
Observe the boot.
After a bit, the node should begin to boot. This can be viewed from the ConMan console window. Eventually, there will be a NBP file...
message in the console output which indicates that the
PXE boot has begun the TFTP download of the ipxe
program. Messages will appear as the Linux kernel loads, and later when the scripts in the initrd
begin to run, including cloud-init
.
Wait until cloud-init
displays messages similar to these on the console, indicating that cloud-init
has finished with the module called modules:final
.
[ 295.466827] cloud-init[9333]: Cloud-init v. 20.2-8.45.1 running 'modules:final' at Thu, 26 Aug2021 15:23:20 +0000. Up 125.72 seconds.
[ 295.467037] cloud-init[9333]: Cloud-init v. 20.2-8.45.1 finished at Thu, 26 Aug 2021 15:26:12+0000. Datasource DataSourceNoCloudNet [seed=cmdline,http://10.92.100.81:8888/][dsmode=net]. Up 29546 seconds
Troubleshooting:
If the NBP file...
output never appears, or something else goes wrong, then go back to the steps for modifying the XNAME.json
file (see the step to
inspect and modify the JSON file and make sure these instructions were completed correctly.
Master nodes only: If cloud-init
did not complete, then the newly rebuilt node will need to have its etcd
service definition manually updated. Reconfigure the etcd
service and
restart cloud-init
on the newly rebuilt master:
ncn-m# systemctl stop etcd.service; sed -i 's/new/existing/' \
/etc/systemd/system/etcd.service /srv/cray/resources/common/etcd/etcd.service; \
systemctl daemon-reload ; rm -rf /var/lib/etcd/member; \
systemctl start etcd.service; /srv/cray/scripts/common/kubernetes-cloudinit.sh
Rebuilt node with modified SSH keys: The cloud-init
process can fail when accessing other nodes if SSH keys have been modified in the cluster. If this occurs, the following steps can be used to repair the desired SSH keys on the newly rebuilt node:
Allow cloud-init
to fail because of the non-matching keys.
Copy the correct SSH keys to the newly rebuilt node.
Re-run cloud-init
on the newly rebuilt node.
ncn-m# cloud-init clean; cloud-init init --local; cloud-init init
Press enter on the console to ensure that the the login prompt is displayed including the correct hostname of this node.
Exit the ConMan console.
Type &
and then .
.
Confirm that vlan004
is up with the correct IP address on the rebuilt node.
Run these commands on the rebuilt node.
Find the desired IP address.
These commands assume that the variables from the prerequisites section have been set.
ncn# dig +short ${NODE}.hmn
Example output:
10.254.1.16
Show the IP address configured for vlan004
on the rebuilt node.
ncn# ip addr show vlan004
Example output:
14: vlan004@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether b8:59:9f:2b:2f:9e brd ff:ff:ff:ff:ff:ff
inet 10.254.1.16/17 brd 10.254.127.255 scope global vlan004
valid_lft forever preferred_lft forever
inet6 fe80::ba59:9fff:fe2b:2f9e/64 scope link
valid_lft forever preferred_lft forever
Confirm that the IP addresses from the two commands match.
If they do not match, then perform the following remediation steps:
Change the IP address for vlan004
.
Edit the /etc/sysconfig/network/ifcfg-vlan004
file on the rebuilt node.
Set the IPADDR
line to the correct IP address with a /17
mask.
For example, if the correct IP address is 10.254.1.16
, then the line should be:
IPADDR='10.254.1.16/17'
Restart the vlan004
network interface.
ncn# wicked ifreload vlan004
Repeat the earlier steps and confirm that the IP addresses now match.
Confirm that vlan007
is up with the correct IP address on the rebuilt node.
Run these commands on the rebuilt node.
Find the desired IP address.
These commands assume that the variables from the prerequisites section have been set.
ncn# dig +short ${NODE}.can
Example output:
10.103.8.11
Show the IP address configured for vlan007
on the rebuilt node.
ncn# ip addr show vlan007
Example output:
15: vlan007@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether b8:59:9f:2b:2f:9e brd ff:ff:ff:ff:ff:ff
inet 10.103.8.11/24 brd 10.103.8.255 scope global vlan007
valid_lft forever preferred_lft forever
inet6 fe80::ba59:9fff:fe2b:2f9e/64 scope link
valid_lft forever preferred_lft forever
Confirm that the IP addresses from the two commands match.
If they do not match, then perform the following remediation steps:
Change the IP address for vlan007
.
Edit the /etc/sysconfig/network/ifcfg-vlan007
file on the rebuilt node.
Set the IPADDR
line to the correct IP address with a /24
mask.
For example, if the correct IP address is 10.103.8.11
, then the line should be:
IPADDR='10.103.8.11/24'
Restart the vlan007
network interface.
ncn# wicked ifreload vlan007
Repeat the earlier steps and confirm that the IP addresses now match.
Set the wipe flag back so it will not wipe the disk when the node is rebooted.
Run the following commands from a node that has cray
CLI initialized.
ncn# cray bss bootparameters list --name "${XNAME}" --format=json | jq .[] > "${XNAME}.json"
Edit the XNAME.json
file and set the metal.no-wipe=1
value.
Get a token to interact with BSS using the REST API.
ncn# TOKEN=$(curl -s -S -d grant_type=client_credentials -d client_id=admin-client \
-d client_secret=`kubectl get secrets admin-client-auth -o jsonpath='{.data.client-secret}' | base64 -d` \
https://api-gw-service-nmn.local/keycloak/realms/shasta/protocol/openid-connect/token \
| jq -r '.access_token')
Do a PUT
action for the edited JSON file.
This command can be run from any node.
ncn# curl -i -s -k -H "Content-Type: application/json" \
-H "Authorization: Bearer ${TOKEN}" \
"https://api-gw-service-nmn.local/apis/bss/boot/v1/bootparameters" \
-X PUT -d @"./${XNAME}.json"
Verify that the bss bootparameters list
command returns the expected information.
Export the list from BSS to a file with a different name.
ncn# cray bss bootparameters list --name "${XNAME}" --format=json |jq .[]> "${XNAME}.check.json"
Compare the new JSON file with what was put into BSS.
This command should give no output, because the files should be identical.
ncn# diff "${XNAME}.json" "${XNAME}.check.json"
Proceed to the next step to Validate Boot Raid.