Power Cycle and Rebuild Nodes

This section applies to all node types. The commands in this section assume the variables from the prerequisites section have been set.

Procedure

  1. Open and watch the console for the node being rebuilt.

  2. Log in to a second session to use it to watch the console using the instructions at the link below:

    Open this link in a new tab or page Log in to a Node Using ConMan

    The first session will be needed to run the commands in the following Rebuild Node steps.

  3. (ncn#) Set the PXE boot option and power cycle the node.

    IMPORTANT: Run these commands from a node NOT being rebuilt.

    IMPORTANT: The commands in this section assume the variables from the prerequisites section have been set.

    1. Set the BMC variable to the hostname of the BMC of the node being rebuilt.

      BMC="${NODE}-mgmt"; echo $BMC
      
    2. Set and export the root password of the BMC.

      NOTE: read -s is used to prevent the password from echoing to the screen or being saved in the shell history.

      read -r -s -p "${BMC} root password: " IPMI_PASSWORD
      
      export IPMI_PASSWORD
      
    3. Set the PXE/efiboot option.

      ipmitool -I lanplus -U root -E -H "${BMC}" chassis bootdev pxe options=efiboot
      
    4. Power off the node.

      ipmitool -I lanplus -U root -E -H "${BMC}" chassis power off
      
    5. Verify that the node is off.

      ipmitool -I lanplus -U root -E -H "${BMC}" chassis power status
      

      Ensure the power is reporting as off. This may take 5-10 seconds for this to update. Wait about 30 seconds after receiving the correct power status before issuing the next command.

    6. Power on the node.

      ipmitool -I lanplus -U root -E -H "${BMC}" chassis power on
      
    7. Verify that the node is on.

      Ensure the power is reporting as on. This may take 5-10 seconds for this to update.

      ipmitool -I lanplus -U root -E -H "${BMC}" chassis power status
      
  4. Observe the boot.

    After a bit, the node should begin to boot. This can be viewed from the ConMan console window. Eventually, there will be a NBP file... message in the console output which indicates that the PXE boot has begun the TFTP download of the ipxe program. Messages will appear as the Linux kernel loads, and later when the scripts in the initrd begin to run, including cloud-init.

  5. Wait until cloud-init displays messages similar to these on the console, indicating that cloud-init has finished with the module called modules:final.

    [  295.466827] cloud-init[9333]: Cloud-init v. 20.2-8.45.1 running 'modules:final' at Thu, 26 Aug2021  15:23:20 +0000. Up 125.72 seconds.
    [  295.467037] cloud-init[9333]: Cloud-init v. 20.2-8.45.1 finished at Thu, 26 Aug 2021 15:26:12+0000. Datasource DataSourceNoCloudNet [seed=cmdline,http://10.92.100.81:8888/][dsmode=net].  Up 29546 seconds
    

    Troubleshooting:

    • If the NBP file... output never appears, or something else goes wrong, then go back to the steps for modifying the XNAME.json file (see the step to inspect and modify the JSON file and make sure these instructions were completed correctly.

    • Master nodes only: If cloud-init did not complete, then the newly rebuilt node will need to have its etcd service definition manually updated. Reconfigure the etcd service and restart cloud-init on the newly rebuilt master:

      systemctl stop etcd.service
      sed -i 's/new/existing/' /etc/systemd/system/etcd.service /srv/cray/resources/common/etcd/etcd.service
      systemctl daemon-reload
      rm -rvf /var/lib/etcd/member
      systemctl start etcd.service
      /srv/cray/scripts/common/kubernetes-cloudinit.sh
      
    • Rebuilt node with modified SSH keys: The cloud-init process can fail when accessing other nodes if SSH keys have been modified in the cluster. If this occurs, the following steps can be used to repair the desired SSH keys on the newly rebuilt node:

      1. Allow cloud-init to fail because of the non-matching keys.

      2. Copy the correct SSH keys to the newly rebuilt node.

      3. Re-run cloud-init on the newly rebuilt node.

        cloud-init clean
        cloud-init init --local
        cloud-init init
        
    • Verify cloud-init completed If the console is not showing the expected output for cloud-init completing but the power cycled node is reachable via SSH, then run the following steps to verify if cloud-init successfully completed.

      1. The last thing cloud-init changes is the file: /etc/cloud/cloud-init.disabled. Check that the time on this file corresponds to the most recent power-cycle and the time that cloud-init would have completed.
      ssh $NODE ls -l /etc/cloud/cloud-init.disabled
      
      1. Check the last line in /var/log/cloud-init-output.log.
      ssh $NODE 'tail -1 /var/log/cloud-init-output.log'
      

      Expected output:

      The system is finally up, after 214.43 seconds cloud-init has come to completion.
      
  6. Press enter on the console to ensure that the the login prompt is displayed including the correct hostname of this node.

  7. Exit the ConMan console.

    Type & and then ..

  8. (ncn#) Set the wipe flag back so it will not wipe the disk when the node is rebooted.

    1. Run the following commands from a node that has cray CLI initialized.

      See Configure the Cray CLI.

      cray bss bootparameters list --name "${XNAME}" --format=json | jq .[] > "${XNAME}.json"
      
    2. Edit the XNAME.json file and set the metal.no-wipe=1 value.

    3. Get a token to interact with BSS using the REST API.

      TOKEN=$(curl -s -S -d grant_type=client_credentials -d client_id=admin-client \
               -d client_secret=`kubectl get secrets admin-client-auth -o jsonpath='{.data.client-secret}' | base64 -d` \
               https://api-gw-service-nmn.local/keycloak/realms/shasta/protocol/openid-connect/token \
               | jq -r '.access_token')
      
    4. Do a PUT action for the edited JSON file.

      This command can be run from any node.

      curl -i -s -k -H "Content-Type: application/json" \
          -H "Authorization: Bearer ${TOKEN}" \
          "https://api-gw-service-nmn.local/apis/bss/boot/v1/bootparameters" \
          -X PUT -d @"./${XNAME}.json"
      
    5. Verify that the bss bootparameters list command returns the expected information.

      1. Export the list from BSS to a file with a different name.

        cray bss bootparameters list --name "${XNAME}" --format=json |jq .[]> "${XNAME}.check.json"
        
      2. Compare the new JSON file with what was put into BSS.

        This command should give no output, because the files should be identical.

        diff "${XNAME}.json" "${XNAME}.check.json"
        
  9. Update SSH keys to the rebuild node.

    This command will update the SSH keys of the rebuilt node in the known_hosts file.

    node_ip=$(host $NODE | awk '{ print $NF }')
    ssh-keygen -R $NODE -f ~/.ssh/known_hosts > /dev/null 2>&1
    ssh-keygen -R $node_ip -f ~/.ssh/known_hosts > /dev/null 2>&1
    ssh-keyscan -H "$NODE,$node_ip" >> ~/.ssh/known_hosts
    

Next Step

If executing this procedure as part of an NCN rebuild, return to the main Rebuild NCNs page and proceed with the next step.