This section applies to all node types. The commands in this section assume the variables from the prerequisites section have been set.
Open and watch the console for the node being rebuilt.
Log in to a second session to use it to watch the console using the instructions at the link below:
Open this link in a new tab or page Log in to a Node Using ConMan
The first session will be needed to run the commands in the following Rebuild Node steps.
(ncn#
) Set the PXE boot option and power cycle the node.
IMPORTANT: Run these commands from a node NOT being rebuilt.
IMPORTANT: The commands in this section assume the variables from the prerequisites section have been set.
Set the BMC variable to the hostname of the BMC of the node being rebuilt.
BMC="${NODE}-mgmt"
Set and export the root
password of the BMC.
NOTE:
read -s
is used to prevent the password from echoing to the screen or being saved in the shell history.
read -r -s -p "${BMC} root password: " IPMI_PASSWORD
export IPMI_PASSWORD
Set the PXE/efiboot option.
ipmitool -I lanplus -U root -E -H "${BMC}" chassis bootdev pxe options=efiboot
Power off the node.
ipmitool -I lanplus -U root -E -H "${BMC}" chassis power off
Verify that the node is off.
ipmitool -I lanplus -U root -E -H "${BMC}" chassis power status
Ensure the power is reporting as off. This may take 5-10 seconds for this to update. Wait about 30 seconds after receiving the correct power status before issuing the next command.
Power on the node.
ipmitool -I lanplus -U root -E -H "${BMC}" chassis power on
Verify that the node is on.
Ensure the power is reporting as on. This may take 5-10 seconds for this to update.
ipmitool -I lanplus -U root -E -H "${BMC}" chassis power status
Observe the boot.
After a bit, the node should begin to boot. This can be viewed from the ConMan console window. Eventually, there will be a NBP file...
message in the console output which indicates that the
PXE boot has begun the TFTP download of the ipxe
program. Messages will appear as the Linux kernel loads, and later when the scripts in the initrd
begin to run, including cloud-init
.
Wait until cloud-init
displays messages similar to these on the console, indicating that cloud-init
has finished with the module called modules:final
.
[ 295.466827] cloud-init[9333]: Cloud-init v. 20.2-8.45.1 running 'modules:final' at Thu, 26 Aug2021 15:23:20 +0000. Up 125.72 seconds.
[ 295.467037] cloud-init[9333]: Cloud-init v. 20.2-8.45.1 finished at Thu, 26 Aug 2021 15:26:12+0000. Datasource DataSourceNoCloudNet [seed=cmdline,http://10.92.100.81:8888/][dsmode=net]. Up 29546 seconds
Troubleshooting:
If the NBP file...
output never appears, or something else goes wrong, then go back to the steps for modifying the XNAME.json
file (see the step to
inspect and modify the JSON file and make sure these instructions were completed correctly.
Master nodes only: If cloud-init
did not complete, then the newly rebuilt node will need to have its etcd
service definition manually updated. Reconfigure the etcd
service and
restart cloud-init
on the newly rebuilt master:
systemctl stop etcd.service
sed -i 's/new/existing/' /etc/systemd/system/etcd.service /srv/cray/resources/common/etcd/etcd.service
systemctl daemon-reload
rm -rvf /var/lib/etcd/member
systemctl start etcd.service
/srv/cray/scripts/common/kubernetes-cloudinit.sh
Rebuilt node with modified SSH keys: The cloud-init
process can fail when accessing other nodes if SSH keys have been modified in the cluster. If this occurs, the following steps can be used to repair the desired SSH keys on the newly rebuilt node:
Allow cloud-init
to fail because of the non-matching keys.
Copy the correct SSH keys to the newly rebuilt node.
Re-run cloud-init
on the newly rebuilt node.
cloud-init clean
cloud-init init --local
cloud-init init
Verify cloud-init
completed If the console is not showing the expected output for cloud-init
completing but the power cycled node is reachable via SSH, then run the following steps to verify if cloud-init
successfully completed.
cloud-init
changes is the file: /etc/cloud/cloud-init.disabled
. Check that the time on this file corresponds to the most recent power-cycle and the time that cloud-init
would have completed.ssh $NODE ls -l /etc/cloud/cloud-init.disabled
/var/log/cloud-init-output.log
.ssh $NODE 'tail -1 /var/log/cloud-init-output.log'
Expected output:
The system is finally up, after 214.43 seconds cloud-init has come to completion.
Press enter on the console to ensure that the the login prompt is displayed including the correct hostname of this node.
Exit the ConMan console.
Type &
and then .
.
(ncn#
) Set the wipe flag back so it will not wipe the disk when the node is rebooted.
Run the following commands from a node that has cray
CLI initialized.
cray bss bootparameters list --name "${XNAME}" --format=json | jq .[] > "${XNAME}.json"
Edit the XNAME.json
file and set the metal.no-wipe=1
value.
Get a token to interact with BSS using the REST API.
TOKEN=$(curl -s -S -d grant_type=client_credentials -d client_id=admin-client \
-d client_secret=`kubectl get secrets admin-client-auth -o jsonpath='{.data.client-secret}' | base64 -d` \
https://api-gw-service-nmn.local/keycloak/realms/shasta/protocol/openid-connect/token \
| jq -r '.access_token')
Do a PUT
action for the edited JSON file.
This command can be run from any node.
curl -i -s -k -H "Content-Type: application/json" \
-H "Authorization: Bearer ${TOKEN}" \
"https://api-gw-service-nmn.local/apis/bss/boot/v1/bootparameters" \
-X PUT -d @"./${XNAME}.json"
Verify that the bss bootparameters list
command returns the expected information.
Export the list from BSS to a file with a different name.
cray bss bootparameters list --name "${XNAME}" --format=json |jq .[]> "${XNAME}.check.json"
Compare the new JSON file with what was put into BSS.
This command should give no output, because the files should be identical.
diff "${XNAME}.json" "${XNAME}.check.json"
Update SSH keys to the rebuild node.
This command will update the SSH keys of the rebuilt node in the known_hosts
file.
node_ip=$(host $NODE | awk '{ print $NF }')
ssh-keygen -R $NODE -f ~/.ssh/known_hosts > /dev/null 2>&1
ssh-keygen -R $node_ip -f ~/.ssh/known_hosts > /dev/null 2>&1
ssh-keyscan -H "$NODE,$node_ip" >> ~/.ssh/known_hosts
If executing this procedure as part of an NCN rebuild, return to the main Rebuild NCNs page and proceed with the next step.