WARNING:
Gigabyte NCNs running firmware version C20 can become unusable when Shasta 1.4 is installed. This is a result of a bug in the Gigabyte firmware that ships with Shasta 1.4. This bug has not been observed in firmware version C17.A key symptom of this bug is that the NCN will not PXE boot and will instead fall through to the boot menu, despite being configure to PXE boot. This behavior will persist until the failing node’s CMOS is cleared.
A procedure is available in 254-NCN-FIRMWARE-GB.md.
This document specifies the procedures for deploying the non-compute nodes (NCNs).
INTERNAL USE
– This section is only relevant for Cray/HPE internal systems.
SKIP IF AIRGAP/OFFLINE
- Do NOT reconfigure the bootstrap registry to proxy an upstream registry if performing an airgap/offline install.
By default, the bootstrap registry is a type: hosted
Nexus repository to
support airgap/offline installs, which requires container images to be
imported prior to platform installation. However, it may be reconfigured to
proxy container images from an upstream registry in order to support online
installs as follows:
Stop Nexus:
pit# systemctl stop nexus
Remove nexus
container:
pit# podman container exists nexus && podman container rm nexus
Remove nexus-data
volume:
pit# podman volume rm nexus-data
Add the corresponding URL to the ExecStartPost
script in
/usr/lib/systemd/system/nexus.service
.
INTERNAL USE
Cray internal systems may want to proxy to https://dtr.dev.cray.com as follows:pit# URL=https://dtr.dev.cray.com pit# sed -e "s,^\(ExecStartPost=/usr/sbin/nexus-setup.sh\).*$,\1 $URL," -i /usr/lib/systemd/system/nexus.service
Restart Nexus
pit# systemctl daemon-reload
pit# systemctl start nexus
These tokens will assist an administrator as they follow this page. Copy these into the shell environment Notice that one of them
is the IPMI_PASSWORD
These exist as an avoidance measure for hard-codes, so these may be used in various system contexts.
pit# \
export mtoken='ncn-m(?!001)\w+-mgmt'
export stoken='ncn-s\w+-mgmt'
export wtoken='ncn-w\w+-mgmt'
export username=root
# Replace "changeme" with the real root password.
export IPMI_PASSWORD=changeme
Throughout the guide, simple one-liners can be used to query status of expected nodes. If the shell or environment is terminated, these environment variables should be re-exported.
Examples:
# Power status of all expected NCNs:
pit# grep -oP "($mtoken|$stoken|$wtoken)" /etc/dnsmasq.d/statics.conf | xargs -t -i ipmitool -I lanplus -U $username -E -H {} power status
# Power off all expected NCNs:
pit# grep -oP "($mtoken|$stoken|$wtoken)" /etc/dnsmasq.d/statics.conf | xargs -t -i ipmitool -I lanplus -U $username -E -H {} power off
The timing of each set of boots varies based on hardware, some manufacturers will POST faster than others or vary based on BIOS setting. After powering a set of nodes on, an administrator can expect a healthy boot-session to take about 60 minutes depending on the number of storage and worker nodes.
This section will walk an administrator through NCN deployment.
Grab the Tokens to facilitate commands if loading this page from a bookmark.
There will be post-boot workarounds as well.
Check for workarounds in the /opt/cray/csm/workarounds/before-ncn-boot
directory within the CSM tar. If there are any workarounds in that directory, run those now. Each has its own instructions in their respective README.md
files.
# Example
pit# ls /opt/cray/csm/workarounds/before-ncn-boot
If there is a workaround here, the output looks similar to the following:
CASMINST-980
NOTE: If you wish to use a timezone other than UTC, instead of step 1 below, follow this procedure for setting a local timezone, then proceed to step 2.
Ensure that the PIT node has the current and correct time.
This step should not be skipped
Check the current time to see if it matches the current time:
pit# date "+%Y-%m-%d %H:%M:%S.%6N%z"
The time can be inaccurate if the system has been off for a long time, or, for example, the CMOS was cleared. If needed, set the time manually as close as possible.
pit# timedatectl set-time "2019-11-15 00:00:00"
Then finally run the NTP script:
pit# /root/bin/configure-ntp.sh
This ensures that the PIT is configured with an accurate date/time, which will be properly propagated to the NCNs during boot.
Ensure the current time is set in BIOS for all management NCNs.
If each NCN is booted to the BIOS menu, you can check and set the current UTC time.
pit# export username=root
pit# export IPMI_PASSWORD=changeme
Repeat the following process for each NCN.
Start an IPMI console session to the NCN.
pit# bmc=ncn-w001-mgmt # Change this to be each node in turn.
pit# conman -j $bmc
In another terminal boot the node to BIOS.
pit# bmc=ncn-w001-mgmt # Change this to be each node in turn.
pit# ipmitool -I lanplus -U $username -E -H $bmc chassis bootdev bios
pit# ipmitool -I lanplus -U $username -E -H $bmc chassis power off
pit# sleep 10
pit# ipmitool -I lanplus -U $username -E -H $bmc chassis power on
For HPE NCNs the above process will boot the nodes to their BIOS, but the menu is unavailable through conman as the node is booted into a graphical BIOS menu.
To access the serial version of the BIOS setup. Perform the ipmitool steps above to boot the node. Then in conman press
ESC+9
key combination to when you see the following messages in the console, this will open you to a menu that can be used to enter the BIOS via conman.For access via BIOS Serial Console: Press 'ESC+9' for System Utilities Press 'ESC+0' for Intelligent Provisioning Press 'ESC+!' for One-Time Boot Menu Press 'ESC+@' for Network Boot
For HPE NCNs the date configuration menu can be found at the following path:
System Configuration -> BIOS/Platform Configuration (RBSU) -> Date and Time
Alternatively for HPE NCNs you can login to the BMC’s web interface and access the HTML5 console for the node to interact with the graphical BIOS. From the administrators own machine create a SSH tunnel (-L creates the tunnel, and -N prevents a shell and stubs the connection):
# Change this to be each node in turn. linux# bmc=ncn-w001-mgmt linux# ssh -L 9443:$bmc:443 -N root@eniac-ncn-m001
Opening a web browser to
https://localhost:9443
will give access to the BMC’s web interface.
When the node boots, you will be able to use the conman session to see the BIOS menu to check and set the time to current UTC time. The process varies depending on the vendor of the NCN.
Repeat this process for each NCN.
Deployment of the nodes starts with booting the storage nodes first, then the master nodes and worker nodes together. After the operating system boots on each node there are some configuration actions which take place. Watching the console or the console log for certain nodes can help to understand what happens and when. When the process is complete for all nodes, the Ceph storage will have been initialized and the Kubernetes cluster will be created ready for a workload.
The configuration workflow described here is intended to help understand the expected path for booting and configuring. See the actual steps below for the commands to deploy these management NCNs.
/etc/cray/kubernetes/join-command-control-plane
so they can join Kubernetes/etc/cray/kubernetes/join-command-control-plane
so it can join KubernetesChange the default root password and SSH keys
If you want to avoid using the default install root password and SSH keys for the NCNs, follow the NCN image customization steps in 110 NCN Image Customization.
This step is strongly encouraged for external/site deployments.
Create boot directories for any NCN in DNS:
This will create folders for each host in
/var/www
, allowing each host to have their own unique set of artifacts; kernel, initrd, SquashFS, andscript.ipxe
bootscript.
pit# /root/bin/set-sqfs-links.sh
Customize boot scripts for any out-of-baseline NCNs
Set each node to always UEFI Network Boot, and ensure they are powered off
pit# grep -oP "($mtoken|$stoken|$wtoken)" /etc/dnsmasq.d/statics.conf | xargs -t -i ipmitool -I lanplus -U $username -E -H {} chassis bootdev pxe options=efiboot,persistent
pit# grep -oP "($mtoken|$stoken|$wtoken)" /etc/dnsmasq.d/statics.conf | xargs -t -i ipmitool -I lanplus -U $username -E -H {} power off
Note: some BMCs will “flake” and ignore the bootorder setting by
ipmitool
. As a fallback, cloud-init will correct the bootorder after NCNs complete their first boot. The first boot may need manual effort to set the boot order over the conman console. The NCN boot order is further explained in 101 NCN Booting.
Validate that the LiveCD is ready for installing NCNs Observe the output of the checks and note any failures, then remediate them.
pit# csi pit validate --livecd-preflight
Note: If your shell terminal is not echoing your input after running this, type “reset” and press enter to recover.
Note: If you are not on an internal Cray/HPE system, or if you are on an offline/airgapped system, then you can ignore any errors about not being able resolve arti.dev.cray.com
Print the consoles available to you:
pit# conman -q
Expected output looks similar to the following:
ncn-m001-mgmt
ncn-m002-mgmt
ncn-m003-mgmt
ncn-s001-mgmt
ncn-s002-mgmt
ncn-s003-mgmt
ncn-w001-mgmt
ncn-w002-mgmt
ncn-w003-mgmt
IMPORTANT
This is the administrators last chance to run NCN pre-boot workarounds.
NOTE
: All consoles are located at/var/log/conman/console*
pit# grep -oP $stoken /etc/dnsmasq.d/statics.conf | xargs -t -i ipmitool -I lanplus -U $username -E -H {} power on
Wait. Observe the installation through ncn-s001-mgmt’s console:
# Print the console name
pit# conman -q | grep s001
Expected output looks similar to the following:
ncn-s001-mgmt
Then join the console:
# Join the console
pit# conman -j ncn-s001-mgmt
From there an administrator can witness console-output for the cloud-init scripts.
NOTE
: Watch the storage node consoles carefully for error messages. If any are seen, consult 066-CEPH-CSI
NOTE
: If the nodes have pxe boot issues (e.g. getting pxe errors, not pulling the ipxe.efi binary) see PXE boot troubleshooting
NOTE
: If other issues arise, such as cloud-init (e.g. NCNs come up to linux with no hostname) see the CSM workarounds for fixes around mutual symptoms.
# Example pit# ls /opt/cray/csm/workarounds/after-ncn-boot
If there is a workaround here, the output looks similar to the following:
CASMINST-1093
Once all storage nodes are up and ncn-s001 is running ceph-ansible, boot Kubernetes Managers and Workers
pit# grep -oP "($mtoken|$wtoken)" /etc/dnsmasq.d/statics.conf | xargs -t -i ipmitool -I lanplus -U $username -E -H {} power on
Wait. Observe the installation through ncn-m002-mgmt’s console:
# Print the console name
pit# conman -q | grep m002
Expected output looks similar to the following:
ncn-m002-mgmt
Then join the console:
# Join the console
pit# conman -j ncn-m002-mgmt
NOTE
: If the nodes have pxe boot issues (e.g. getting pxe errors, not pulling the ipxe.efi binary) see PXE boot troubleshooting
NOTE
: If one of the manager nodes seems hung waiting for the storage nodes to create a secret, check the storage node consoles for error messages. If any are found, consult 066-CEPH-CSI
NOTE
: If other issues arise, such as cloud-init (e.g. NCNs come up to linux with no hostname) see the CSM workarounds for fixes around mutual symptoms.
# Example pit# ls /opt/cray/csm/workarounds/after-ncn-boot
If there is a workaround here, the output looks similar to the following:
CASMINST-1093
Refer to timing of deployments. It should not take more than 60 minutes for the kubectl get nodes
command to return output indicating that all the managers and workers aside from the PIT node booted from the LiveCD are Ready
:
pit# ssh ncn-m002
ncn-m002# kubectl get nodes -o wide
Expected output looks similar to the following:
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ncn-m002 Ready master 14m v1.18.6 10.252.1.5 <none> SUSE Linux Enterprise High Performance Computing 15 SP2 5.3.18-24.43-default containerd://1.3.4
ncn-m003 Ready master 13m v1.18.6 10.252.1.6 <none> SUSE Linux Enterprise High Performance Computing 15 SP2 5.3.18-24.43-default containerd://1.3.4
ncn-w001 Ready <none> 6m30s v1.18.6 10.252.1.7 <none> SUSE Linux Enterprise High Performance Computing 15 SP2 5.3.18-24.43-default containerd://1.3.4
ncn-w002 Ready <none> 6m16s v1.18.6 10.252.1.8 <none> SUSE Linux Enterprise High Performance Computing 15 SP2 5.3.18-24.43-default containerd://1.3.4
ncn-w003 Ready <none> 5m58s v1.18.6 10.252.1.12 <none> SUSE Linux Enterprise High Performance Computing 15 SP2 5.3.18-24.43-default containerd://1.3.4
IMPORTANT:
Do the following if NCNs use Gigabyte hardware.
Log into each ncn-s node and check for unused drives
ncn-s# ceph-volume inventory
The field “available” would be true if Ceph sees the drive as empty and can be used, e.g.:
Device Path Size rotates available Model name
/dev/sda 447.13 GB False False SAMSUNG MZ7LH480
/dev/sdb 447.13 GB False False SAMSUNG MZ7LH480
/dev/sdc 3.49 TB False False SAMSUNG MZ7LH3T8
/dev/sdd 3.49 TB False False SAMSUNG MZ7LH3T8
/dev/sde 3.49 TB False False SAMSUNG MZ7LH3T8
/dev/sdf 3.49 TB False False SAMSUNG MZ7LH3T8
/dev/sdg 3.49 TB False False SAMSUNG MZ7LH3T8
/dev/sdh 3.49 TB False False SAMSUNG MZ7LH3T8
Alternatively, just dump the paths of available drives:
ncn-s# ceph-volume inventory --format json-pretty | jq -r '.[]|select(.available==true)|.path'
Add unused drives
ncn-s# ceph-volume lvm create --data /dev/sd<drive to add> --bluestore
Check for workarounds in the /opt/cray/csm/workarounds/after-ncn-boot
directory. If there are any workarounds in that directory, run those now. Instructions are in the README
files.
# Example
pit# ls /opt/cray/csm/workarounds/after-ncn-boot
If there is a workaround here, the output looks similar to the following:
CASMINST-12345
The LiveCD needs to authenticate with the cluster to facilitate the rest of the CSM installation.
Copy the Kubernetes config to the LiveCD to be able to use kubectl
as cluster administrator.
This will always be whatever node is the
first-master-hostname
in your/var/www/ephemeral/configs/data.json | jq
file. If you are provisioning your CRAY fromncn-m001
then you can expect to fetch these fromncn-m002
.
pit# mkdir ~/.kube
pit# scp ncn-m002.nmn:/etc/kubernetes/admin.conf ~/.kube/config
After the NCNs are booted, the BGP peers will need to be checked and updated if the neighbor IPs are incorrect on the switches. See the doc to Check and Update BGP Neighbors.
Note: If migrating from Shasta v1.3.x, the worker nodes have different IP addresses, so the scripts below must be run to correct the spine switch configuration to the Shasta v1.4 IP addresses for the worker nodes.
Make sure you clear the BGP sessions here.
clear bgp *
enable
then clear ip bgp all
NOTE
: At this point all but possibly one of the peering sessions with the BGP neighbors should be in IDLE or CONNECT state and not ESTABLISHED state. If the switch is an Aruba, you will have one peering session established with the other switch. You should check that all of the neighbor IPs are correct.
If needed, the following helper scripts are available for the various switch types:
pit# ls -1 /usr/bin/*peer*py
Expected output looks similar to the following:
/usr/bin/aruba_set_bgp_peers.py
/usr/bin/mellanox_set_bgp_peers.py
The following commands will run a series of remote tests on the other nodes to validate they are healthy and configured correctly.
Observe the output of the checks and note any failures, then remediate them.
Check Ceph
pit# csi pit validate --ceph
Note
: Throughout the output there are multiple lines of test totals; be sure to check all of them and not just the final one.
Note
: Please refer to the Utility Storage section of the Admin guide to help resolve any failed tests.
Check Kubernetes
pit# csi pit validate --k8s
WARNING
If test failures for/dev/sdc
are observed, the Manual LVM Check Procedure must be carried out to determine if they are true failures.
Ensure that weave has not split-brained
Run the following command on each member of the Kubernetes cluster (master nodes and worker nodes) to ensure that weave is operating as a single cluster:
ncn# weave --local status connections | grep failed
If you see messages like ‘IP allocation was seeded by different peers’ then weave looks to have split-brained. At this point it is necessary to wipe the ncns and start the pxe boot again:
If an automated test reports a failure relating to /dev/sdc
on a master or worker NCN, this manual procedure must be followed to determine whether or not there is a real error.
To manually validate the ephemeral disks on a master node, run the following command:
ncn-m# blkid -L ETCDLVM
To manually validate the ephemeral disks on a worker node, run the following commands:
ncn-w# blkid -L CONLIB
ncn-w# blkid -L CONRUN
ncn-w# blkid -L K8SLET
The validation is considered successful if each of the commands returns the name of any device (e.g. /dev/sdd
, /dev/sdb1
, etc). The name of the device does not matter – each command just needs to output the name of some device.
If any nodes fail the validation, then the problem must be resolved before continuing with the install.
If any master node has the problem, then you must wipe and redeploy all of the NCNs before continuing the installation:
ncn-m001
because it is the PIT node) using the ‘Basic Wipe’ section of DISK CLEANSLATE.If only worker nodes have the problem, then you must wipe and redeploy the affected worker nodes before continuing the installation:
ipmitool
command will give errors trying to power on the unaffected nodes, since they are already powered on – this is expected and not a problem.These tests are for sanity checking. These exist as software reaches maturity, or as tests are worked and added into the installation repertoire.
All validation should be taken care of by the CSI validate commands. The following checks can be done for sanity-checking:
Important common issues should be checked by tests, new pains in these areas should entail requests for new tests.
Do the following two steps outlined in Fixing Boot-Order for all NCNs except the PIT node.
Now move to the CSM Platform Install page to continue the CSM install.