The following procedure deploys Linux and Kubernetes software to the management NCNs. Deployment of the nodes starts with booting the storage nodes followed by the master nodes and worker nodes together.
After the operating system boots on each node, there are some configuration actions which take place. Watching the console or the console log for certain nodes can help to understand what happens and when. When the process completes for all nodes, the Ceph storage is initialized and the Kubernetes cluster is created and ready for a workload. The PIT node will join Kubernetes after it is rebooted later in Deploy Final NCN.
The timing of each set of boots varies based on hardware. Nodes from some manufacturers will POST faster than others or vary based on BIOS setting. After powering on a set of nodes, an administrator can expect a healthy boot session to take about 60 minutes depending on the number of storage and worker nodes.
Preparation of the environment must be done before attempting to deploy the management nodes.
Define shell environment variables that will simplify later commands to deploy management nodes.
Set IPMI_PASSWORD
to the root password for the NCN BMCs.
read -s
is used to prevent the password from being written to the screen or the shell history.
pit# read -s IPMI_PASSWORD
pit# export IPMI_PASSWORD
Set the remaining helper variables.
These values do not need to be altered from what is shown.
pit# mtoken='ncn-m(?!001)\w+-mgmt' ; stoken='ncn-s\w+-mgmt' ; wtoken='ncn-w\w+-mgmt' ; export USERNAME=root
Throughout the guide, simple one-liners can be used to query status of expected nodes. If the shell or environment is terminated, these environment variables should be re-exported.
Examples:
Check power status of all NCNs.
pit# grep -oP "($mtoken|$stoken|$wtoken)" /etc/dnsmasq.d/statics.conf | sort -u |
xargs -t -i ipmitool -I lanplus -U $USERNAME -E -H {} power status
Power off all NCNs.
pit# grep -oP "($mtoken|$stoken|$wtoken)" /etc/dnsmasq.d/statics.conf | sort -u |
xargs -t -i ipmitool -I lanplus -U $USERNAME -E -H {} power off
NOTE: Optionally, in order to use a timezone other than UTC, instead of step 1 below, follow this procedure for setting a local timezone. Then proceed to step 2.
Ensure that the PIT node has the correct current time.
The time can be inaccurate if the system has been powered off for a long time, or, for example, the CMOS was cleared on a Gigabyte node. See Clear Gigabyte CMOS.
This step should not be skipped.
Check the time on the PIT node to see whether it matches the current time:
pit# date "+%Y-%m-%d %H:%M:%S.%6N%z"
If the time is inaccurate, set the time manually.
pit# timedatectl set-time "2019-11-15 00:00:00"
Run the NTP script:
pit# /root/bin/configure-ntp.sh
This ensures that the PIT is configured with an accurate date/time, which will be propagated to the NCNs during boot.
If the error Failed to set time: NTP unit is active
is observed, then stop chrony
first.
pit# systemctl stop chronyd
Then run the commands above to complete the process.
Ensure that the current time is set in BIOS for all management NCNs.
Each NCN is booted to the BIOS menu, the date and time are checked, and set to the current UTC time if needed.
NOTE: Some steps in this procedure depend on
USERNAME
andIPMI_PASSWORD
being set. This is done in Tokens and IPMI Password.
Repeat the following process for each NCN.
Set the bmc
variable to the name of the BMC of the NCN being checked.
Important: Be sure to change the below example to the appropriate NCN.
pit# bmc=ncn-w001-mgmt
Start an IPMI console session to the NCN.
pit# conman -j $bmc
Using another terminal to watch the console, boot the node to BIOS.
pit# ipmitool -I lanplus -U $USERNAME -E -H $bmc chassis bootdev bios &&
ipmitool -I lanplus -U $USERNAME -E -H $bmc chassis power off && sleep 10 &&
ipmitool -I lanplus -U $USERNAME -E -H $bmc chassis power on
For HPE NCNs, the above process will boot the nodes to their BIOS; however, the BIOS menu is unavailable through conman because the node is booted into a graphical BIOS menu.
In order to access the serial version of the BIOS menu, perform the
ipmitool
steps above to boot the node. Then, in conman, pressESC+9
key combination when the following messages are shown on the console. That key combination will open a menu that can be used to enter the BIOS using conman.For access via BIOS Serial Console: Press 'ESC+9' for System Utilities Press 'ESC+0' for Intelligent Provisioning Press 'ESC+!' for One-Time Boot Menu Press 'ESC+@' for Network Boot
For HPE NCNs, the date configuration menu is at the following path:
System Configuration -> BIOS/Platform Configuration (RBSU) -> Date and Time
.Alternatively, for HPE NCNs, log in to the BMC’s web interface and access the HTML5 console for the node, in order to interact with the graphical BIOS. From the administrator’s own machine, create an SSH tunnel (
-L
creates the tunnel;-N
prevents a shell and stubs the connection):linux# bmc=ncn-w001-mgmt # Change this to be the appropriate node linux# ssh -L 9443:$bmc:443 -N root@eniac-ncn-m001
Opening a web browser to
https://localhost:9443
will give access to the BMC’s web interface.
When the node boots, the conman session can be used to see the BIOS menu, in order to check and set the time to current UTC time. The process varies depending on the vendor of the NCN.
After the correct time has been verified, power off the NCN.
pit# ipmitool -I lanplus -U $USERNAME -E -H $bmc chassis power off
Repeat the above process for each NCN.
All firmware can be found in the HFP package provided with the Shasta release.
The management nodes are expected to have certain minimum firmware installed for BMC, node BIOS, and PCIe cards. Where possible, the firmware should be updated prior to install. It is good to meet the minimum NCN firmware requirement before starting.
Note: When the PIT node is booted from the LiveCD, it is not possible to use the Firmware Action Service (FAS) to update the the firmware because that service has not yet been installed. However, at this point, it would be possible to use the HPE Cray EX HPC Firmware Pack (HFP) product on the PIT node to learn about the firmware versions available in HFP.
If the firmware is not updated at this point in the installation workflow, then it can be done with FAS after CSM and HFP have both been installed and configured. However, at that point a rolling reboot procedure for the management nodes will be needed, after the firmware has been updated.
See the HPE Cray EX System Software Getting Started Guide (S-8000) 22.07
on the HPE Customer Support Center for information about the HPE Cray EX HPC Firmware Pack (HFP) product.
In the HFP documentation there is information about the recommended firmware packages to be installed.
See “Product Details” in the HPE Cray EX HPC Firmware Pack Installation Guide.
Some of the component types have manual procedures to check firmware versions and update firmware.
See Upgrading Firmware Without FAS
in the HPE Cray EX HPC Firmware Pack Installation Guide
.
It will be possible to extract the files from the product tarball, but the install.sh
script from that product
will be unable to load the firmware versions into the Firmware Action Services (FAS) because the management nodes
are not booted and running Kubernetes and FAS cannot be used until Kubernetes is running.
If booted into the PIT node, the firmware can be found with HFP package provided with the Shasta release.
(optional) Check these BIOS settings on management nodes NCN BIOS.
This is optional, the BIOS settings (or lack thereof) do not prevent deployment. The NCN installation will work with the CMOS’ default BIOS. There may be settings that facilitate the speed of deployment, but they may be tuned at a later time.
NOTE: The BIOS tuning will be automated, further reducing this step.
The firmware on the management nodes should be checked for compliance with the minimum required version and updated, if necessary, at this point.
WARNING: Gigabyte NCNs running BIOS version C20 can become unusable when Shasta 1.5 is installed. This is a result of a bug in the Gigabyte firmware. This bug has not been observed in BIOS version C17.
A key symptom of this bug is that the NCN will not PXE boot and will instead fall through to the boot menu, despite being configure to PXE boot. This behavior will persist until the failing node’s CMOS is cleared.
Deployment of the nodes starts with booting the storage nodes first. Then, the master nodes and worker nodes should be booted together. After the operating system boots on each node, there are some configuration actions which take place. Watching the console or the console log for certain nodes can help to understand what happens and when. When the process is complete for all nodes, the Ceph storage will have been initialized and the Kubernetes cluster will be created ready for a workload.
The configuration workflow described here is intended to help understand the expected path for booting and configuring. The actual steps to be performed are in the Deploy section.
ncn-s001
and at least one other storage nodencn-s001
) will boot; it then starts a loop as ceph-ansible
configuration waits for all other storage nodes to boot.ceph-ansible
runs to completion on ncn-s001
.ncn-s001
notices that all other storage nodes have booted, ceph-ansible
will begin Ceph configuration. This takes several minutes.ceph-ansible
has finished on ncn-s001
, then ncn-s001
waits for ncn-m002
to create /etc/kubernetes/admin.conf
.ncn-m002
, ncn-m003
, and at least one worker node.ncn-m002
and ncn-m003
) and all worker nodes at the same time.
ncn-m002
to create the /etc/cray/kubernetes/join-command-control-plane
file so that they can join Kubernetes.ncn-m003
) boots and waits for ncn-m002
to create the /etc/cray/kubernetes/join-command-control-plane
file so that it can join Kubernetesncn-m002
) boots and runs kubernetes-cloudinit.sh
, which will create /etc/kubernetes/admin.conf
and
/etc/cray/kubernetes/join-command-control-plan
. It then waits for the storage node to create etcd-backup-s3-credentials
.ncn-s001
notices that ncn-m002
has created /etc/kubernetes/admin.conf
, then ncn-s001
waits for any worker node to become available.ncn-m002
has created /etc/cray/kubernetes/join-command-control-plane
, they will join the Kubernetes cluster.
ncn-s001
notices that a worker node has done this, it moves forward with the creation of ConfigMaps and running the post-Ceph playbooks
(S3, OSD pools, quotas, and so on.)ncn-s001
creates etcd-backup-s3-credentials
during the ceph-rgw-users
role (one of the last roles after Ceph has been set up), then ncn-m001
notices this and proceeds.NOTE: If several hours have elapsed between storage and master nodes booting, or if there were issues PXE booting master nodes, the
cloud-init
script onncn-s001
may not complete successfully. This can cause the/var/log/cloud-init-output.log
on master node(s) to continue to output the following message:[ 1328.351558] cloud-init[8472]: Waiting for storage node to create etcd-backup-s3-credentials secret...
In this case, the following script is safe to be executed again on
ncn-s001
:ncn-s001# /srv/cray/scripts/common/storage-ceph-cloudinit.sh
After this script finishes, the secrets will be created and the
cloud-init
script on the master node(s) should complete.
NOTE: Some scripts in this section depend on
IPMI_PASSWORD
being set. This is done in Tokens and IPMI Password.
Set the default root password and SSH keys and optionally change the timezone.
The management nodes images do not contain a default password or default SSH keys.
It is required to set the default root password and SSH keys in the images used to boot the management nodes. Follow the NCN image customization steps in Change NCN Image Root Password and SSH Keys on PIT Node
Create boot directories for any NCN in DNS.
This will create folders for each host in /var/www
, allowing each host to have its own unique set of artifacts:
kernel, initrd
, SquashFS, and script.ipxe
bootscript.
Patch the set-sqfs-links.sh
script to include the blacklisting of an undesired kernel module.
pit# sed -i -E 's:rd.luks=0 /:rd.luks=0 module_blacklist=rpcrdma \/:g' /root/bin/set-sqfs-links.sh
Invoke the script.
pit# /root/bin/set-sqfs-links.sh
Every NCN except for
ncn-m001
should be included in the output from this script. If that is not the case, then verify that all NCN BMCs are set to use DHCP. See Set node BMCs to DHCP. After that is done, re-run theset-sqfs-links.sh
script.
Customize boot scripts for any out-of-baseline NCNs
etcd
creation.Run the BIOS baseline script to apply configurations to BMCs.
The script will apply helper configurations to facilitate more deterministic network booting on any NCN port. This runs against any server vendor, but some settings are not applied for certain vendors.
NOTE: This script will enable DCMI/IPMI on Hewlett-Packard Enterprise servers equipped with ILO. If
ipmitool
is not working at this time, it will after running this script.
pit# /root/bin/bios-baseline.sh
Set each node to always UEFI Network Boot, and ensure they are powered off
pit# grep -oP "($mtoken|$stoken|$wtoken)" /etc/dnsmasq.d/statics.conf | sort -u | xargs -t -i ipmitool -I lanplus -U $USERNAME -E -H {} chassis bootdev pxe options=persistent
pit# grep -oP "($mtoken|$stoken|$wtoken)" /etc/dnsmasq.d/statics.conf | sort -u | xargs -t -i ipmitool -I lanplus -U $USERNAME -E -H {} chassis bootdev pxe options=efiboot
pit# grep -oP "($mtoken|$stoken|$wtoken)" /etc/dnsmasq.d/statics.conf | sort -u | xargs -t -i ipmitool -I lanplus -U $USERNAME -E -H {} power off
NOTE: The NCN boot order is further explained in NCN Boot Workflow.
Validate that the LiveCD is ready for installing NCNs.
Observe the output of the checks and note any failures, then remediate them.
Specify the admin
user password for the management switches in the system.
read -s
is used to prevent the password from being written to the screen or the shell history.
pit# read -s SW_ADMIN_PASSWORD
pit# export SW_ADMIN_PASSWORD
Run the LiveCD preflight checks.
pit# csi pit validate --livecd-preflight
Note: Ignore any errors about not being able resolve
arti.dev.cray.com
.
Print the available consoles.
pit# conman -q
Expected output looks similar to the following:
ncn-m001-mgmt
ncn-m002-mgmt
ncn-m003-mgmt
ncn-s001-mgmt
ncn-s002-mgmt
ncn-s003-mgmt
ncn-w001-mgmt
ncn-w002-mgmt
ncn-w003-mgmt
NOTE: All console logs are located at
/var/log/conman/console*
Boot all the storage nodes. ncn-s001
will start 1 minute after the other storage nodes.
pit# grep -oP $stoken /etc/dnsmasq.d/statics.conf | grep -v "ncn-s001-" | sort -u |
xargs -t -i ipmitool -I lanplus -U $USERNAME -E -H {} power on; \
sleep 60; ipmitool -I lanplus -U $USERNAME -E -H ncn-s001-mgmt power on
Observe the installation through the console of ncn-s001-mgmt
.
pit# conman -j ncn-s001-mgmt
From there an administrator can witness console output for the cloud-init
scripts.
NOTE: Watch the storage node consoles carefully for error messages. If any are seen, consult Ceph-CSI Troubleshooting.
NOTE: If the nodes have PXE boot issues (for example, getting PXE errors, or not pulling the ipxe.efi
binary), see PXE boot troubleshooting.
Wait for storage nodes before booting Kubernetes master nodes and worker nodes.
NOTE: Once all storage nodes are up and the message ...sleeping 5 seconds until /etc/kubernetes/admin.conf
appears on ncn-s001
’s console, it is safe to proceed with booting the Kubernetes master nodes and worker nodes
pit# grep -oP "($mtoken|$wtoken)" /etc/dnsmasq.d/statics.conf | sort -u | xargs -t -i ipmitool -I lanplus -U $USERNAME -E -H {} power on
Stop watching the console from ncn-s001
.
Type the ampersand character and then the period character to exit from the conman session on ncn-s001
.
&.
pit#
Wait. Observe the installation through ncn-m002-mgmt
’s console:
Print the console name:
pit# conman -q | grep m002
Expected output looks similar to the following:
ncn-m002-mgmt
Then join the console:
pit# conman -j ncn-m002-mgmt
NOTE: If the nodes have PXE boot issues (e.g. getting PXE errors, not pulling the ipxe.efi binary) see PXE boot troubleshooting
NOTE: If one of the master nodes seems hung waiting for the storage nodes to create a secret, check the storage node consoles for error messages. If any are found, consult CEPH CSI Troubleshooting
Wait for the deployment to finish.
Refer to timing of deployments. It should not take more than 60 minutes for the kubectl get nodes
command to return output indicating
that all the master nodes and worker nodes (excluding from the PIT node) booted from the LiveCD and are Ready
.
When the following command prompts for a password, enter the root password for
ncn-m002
.
pit# ssh ncn-m002 kubectl get nodes -o wide
Expected output looks similar to the following:
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ncn-m002 Ready control-plane,master 2h v1.20.13 10.252.1.5 <none> SUSE Linux Enterprise High Performance Computing 15 SP3 5.3.18-59.19-default containerd://1.5.7
ncn-m003 Ready control-plane,master 2h v1.20.13 10.252.1.6 <none> SUSE Linux Enterprise High Performance Computing 15 SP3 5.3.18-59.19-default containerd://1.5.7
ncn-w001 Ready <none> 2h v1.20.13 10.252.1.7 <none> SUSE Linux Enterprise High Performance Computing 15 SP3 5.3.18-59.19-default containerd://1.5.7
ncn-w002 Ready <none> 2h v1.20.13 10.252.1.8 <none> SUSE Linux Enterprise High Performance Computing 15 SP3 5.3.18-59.19-default containerd://1.5.7
ncn-w003 Ready <none> 2h v1.20.13 10.252.1.9 <none> SUSE Linux Enterprise High Performance Computing 15 SP3 5.3.18-59.19-default containerd://1.5.7
Stop watching the console of ncn-m002
.
Type the ampersand character and then the period character to exit from the conman session on ncn-m002
.
&.
pit#
Enable passwordless SSH for the PIT node.
Copy SSH files from ncn-m002
to the PIT node.
When the following command prompts for a password, enter the root password for
ncn-m002
.
pit# rsync -av ncn-m002:.ssh/ /root/.ssh/
Expected output looks similar to the following:
Password:
receiving incremental file list
./
authorized_keys
id_rsa
id_rsa.pub
known_hosts
sent 145 bytes received 13,107 bytes 3,786.29 bytes/sec
total size is 12,806 speedup is 0.97
Make a list of all of the NCNs (including ncn-m001
).
pit# NCNS=$(grep -oP "ncn-[msw][0-9]{3}" /etc/dnsmasq.d/statics.conf | sort -u | tr '\n' ',') ; echo "${NCNS}"
Expected output looks similar to the following:
ncn-m001,ncn-m002,ncn-m003,ncn-s001,ncn-s002,ncn-s003,ncn-w001,ncn-w002,ncn-w003,
Verify that passwordless SSH is now working from the PIT node to the other NCNs.
The following command should not prompt for a password.
pit# PDSH_SSH_ARGS_APPEND='-o StrictHostKeyChecking=no' pdsh -Sw "${NCNS}" date && echo SUCCESS || echo ERROR
Expected output looks similar to the following:
ncn-w001: Warning: Permanently added 'ncn-w001,10.252.1.7' (ECDSA) to the list of known hosts.
ncn-w003: Warning: Permanently added 'ncn-w003,10.252.1.9' (ECDSA) to the list of known hosts.
ncn-m003: Warning: Permanently added 'ncn-m003,10.252.1.6' (ECDSA) to the list of known hosts.
ncn-s002: Warning: Permanently added 'ncn-s002,10.252.1.11' (ECDSA) to the list of known hosts.
ncn-m001: Warning: Permanently added 'ncn-m001,10.252.1.4' (ECDSA) to the list of known hosts.
ncn-w002: Warning: Permanently added 'ncn-w002,10.252.1.8' (ECDSA) to the list of known hosts.
ncn-m002: Warning: Permanently added 'ncn-m002,10.252.1.5' (ECDSA) to the list of known hosts.
ncn-s003: Warning: Permanently added 'ncn-s003,10.252.1.12' (ECDSA) to the list of known hosts.
ncn-s001: Warning: Permanently added 'ncn-s001,10.252.1.10' (ECDSA) to the list of known hosts.
ncn-s003: Thu 28 Apr 2022 02:43:21 PM UTC
ncn-s001: Thu 28 Apr 2022 02:43:21 PM UTC
ncn-s002: Thu 28 Apr 2022 02:43:21 PM UTC
ncn-m001: Thu 28 Apr 2022 02:43:21 PM UTC
ncn-m003: Thu 28 Apr 2022 02:43:21 PM UTC
ncn-m002: Thu 28 Apr 2022 02:43:21 PM UTC
ncn-w001: Thu 28 Apr 2022 02:43:22 PM UTC
ncn-w002: Thu 28 Apr 2022 02:43:22 PM UTC
ncn-w003: Thu 28 Apr 2022 02:43:22 PM UTC
SUCCESS
Validate that the expected LVM labels are present on disks on the master and worker nodes.
pit# /usr/share/doc/csm/install/scripts/check_lvm.sh
Expected output looks similar to the following:
When prompted, please enter the NCN password for ncn-m002
Warning: Permanently added 'ncn-m002,10.252.1.11' (ECDSA) to the list of known hosts.
Password:
Checking ncn-m002...
ncn-m002: OK
Checking ncn-m003...
Warning: Permanently added 'ncn-m003,10.252.1.10' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ncn-m003,10.252.1.10' (ECDSA) to the list of known hosts.
ncn-m003: OK
Checking ncn-w001...
Warning: Permanently added 'ncn-w001,10.252.1.9' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ncn-w001,10.252.1.9' (ECDSA) to the list of known hosts.
ncn-w001: OK
Checking ncn-w002...
Warning: Permanently added 'ncn-w002,10.252.1.8' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ncn-w002,10.252.1.8' (ECDSA) to the list of known hosts.
ncn-w002: OK
Checking ncn-w003...
Warning: Permanently added 'ncn-w003,10.252.1.7' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ncn-w003,10.252.1.7' (ECDSA) to the list of known hosts.
ncn-w003: OK
SUCCESS: LVM checks passed on all master and worker NCNs
If the check fails for any nodes, then the problem must be resolved before continuing. See LVM Check Failure Recovery.
Apply the boot order workaround.
pit# /usr/share/doc/csm/scripts/workarounds/boot-order/run.sh
Apply the kdump
workaround.
kdump
assists in taking a dump of the NCN if it encounters a kernel panic.
kdump
does not work properly in CSM 1.2. Until this workaround is applied, kdump
may not produce a proper dump.
Running this script applies the workaround on all of the NCNs that were just deployed.
pit# /usr/share/doc/csm/scripts/workarounds/kdump/run.sh
Example output:
Uploading hotfix files to ncn-m001:/srv/cray/scripts/common/ ... Done
Uploading hotfix files to ncn-m002:/srv/cray/scripts/common/ ... Done
Uploading hotfix files to ncn-m003:/srv/cray/scripts/common/ ... Done
Uploading hotfix files to ncn-s001:/srv/cray/scripts/common/ ... Done
Uploading hotfix files to ncn-s002:/srv/cray/scripts/common/ ... Done
Uploading hotfix files to ncn-s003:/srv/cray/scripts/common/ ... Done
Uploading hotfix files to ncn-s004:/srv/cray/scripts/common/ ... Done
Uploading hotfix files to ncn-w001:/srv/cray/scripts/common/ ... Done
Uploading hotfix files to ncn-w002:/srv/cray/scripts/common/ ... Done
Uploading hotfix files to ncn-w003:/srv/cray/scripts/common/ ... Done
Uploading hotfix files to ncn-w004:/srv/cray/scripts/common/ ... Done
Running updated create-kdump-artifacts.sh script on [11] NCNs ... Done
The following NCNs contain the kdump patch:
ncn-m001
ncn-m002
ncn-m003
ncn-s001
ncn-s002
ncn-s003
ncn-s004
ncn-w001
ncn-w002
ncn-w003
ncn-w004
This workaround has completed.
IMPORTANT: Do the following if NCNs are Gigabyte hardware. It is suggested (but optional) for HPE NCNs.
IMPORTANT: Estimate the expected number of OSDs using the following table and using this equation:
total_osds
=(number of utility storage/Ceph nodes)
*
(OSD count from table below for the appropriate hardware)
Hardware Manufacturer | OSD Drive Count (not including OS drives) |
---|---|
GigaByte | 12 |
HPE | 8 |
If there are OSDs on each node (ceph osd tree
can show this), then all the nodes are in Ceph. That means the orchestrator can be used to look for the devices.
Get the number of OSDs in the cluster.
ncn-s# ceph -f json-pretty osd stat |jq .num_osds
24
IMPORTANT: If the returned number of OSDs is equal to total_osds
calculated, then skip the following steps. If not, then proceed with the below additional checks and remediation steps.
Compare the number of OSDs to the output (which should resemble the example below). The number of drives will depend on the server hardware.
NOTE: If the Ceph cluster is large and has a lot of nodes, a node may be specified after the below command to limit the results.
ncn-s# ceph orch device ls
Hostname Path Type Serial Size Health Ident Fault Available
ncn-s001 /dev/sda ssd PHYF015500M71P9DGN 1920G Unknown N/A N/A No
ncn-s001 /dev/sdb ssd PHYF016500TZ1P9DGN 1920G Unknown N/A N/A No
ncn-s001 /dev/sdc ssd PHYF016402EB1P9DGN 1920G Unknown N/A N/A No
ncn-s001 /dev/sdd ssd PHYF016504831P9DGN 1920G Unknown N/A N/A No
ncn-s001 /dev/sde ssd PHYF016500TV1P9DGN 1920G Unknown N/A N/A No
ncn-s001 /dev/sdf ssd PHYF016501131P9DGN 1920G Unknown N/A N/A No
ncn-s001 /dev/sdi ssd PHYF016500YB1P9DGN 1920G Unknown N/A N/A No
ncn-s001 /dev/sdj ssd PHYF016500WN1P9DGN 1920G Unknown N/A N/A No
ncn-s002 /dev/sda ssd PHYF0155006W1P9DGN 1920G Unknown N/A N/A No
ncn-s002 /dev/sdb ssd PHYF0155006Z1P9DGN 1920G Unknown N/A N/A No
ncn-s002 /dev/sdc ssd PHYF015500L61P9DGN 1920G Unknown N/A N/A No
ncn-s002 /dev/sdd ssd PHYF015502631P9DGN 1920G Unknown N/A N/A No
ncn-s002 /dev/sde ssd PHYF0153000G1P9DGN 1920G Unknown N/A N/A No
ncn-s002 /dev/sdf ssd PHYF016401T41P9DGN 1920G Unknown N/A N/A No
ncn-s002 /dev/sdi ssd PHYF016504C21P9DGN 1920G Unknown N/A N/A No
ncn-s002 /dev/sdj ssd PHYF015500GQ1P9DGN 1920G Unknown N/A N/A No
ncn-s003 /dev/sda ssd PHYF016402FP1P9DGN 1920G Unknown N/A N/A No
ncn-s003 /dev/sdb ssd PHYF016401TE1P9DGN 1920G Unknown N/A N/A No
ncn-s003 /dev/sdc ssd PHYF015500N51P9DGN 1920G Unknown N/A N/A No
ncn-s003 /dev/sdd ssd PHYF0165010Z1P9DGN 1920G Unknown N/A N/A No
ncn-s003 /dev/sde ssd PHYF016500YR1P9DGN 1920G Unknown N/A N/A No
ncn-s003 /dev/sdf ssd PHYF016500X01P9DGN 1920G Unknown N/A N/A No
ncn-s003 /dev/sdi ssd PHYF0165011H1P9DGN 1920G Unknown N/A N/A No
ncn-s003 /dev/sdj ssd PHYF016500TQ1P9DGN 1920G Unknown N/A N/A No
If there are devices that show Available
as Yes
and they are not being automatically added, that device may need to be zapped.
IMPORTANT: Prior to zapping any device, ensure that it is not being used.
Check to see if the number of devices is less than the number of listed drives in the output from step 1.
ncn-s# ceph orch device ls|grep dev|wc -l
Example output:
24
If the numbers are equal, but less than the total_osds
calculated, then the ceph-mgr
daemon may need to be failed in order to get a fresh inventory.
ncn-s# ceph mgr fail $(ceph mgr dump | jq -r .active_name)
Wait 5 minutes and then re-check ceph orch device ls
. See if the drives are still showing as Available
. If so, then proceed to the next step.
ssh
to the host and look at lsblk
output and check against the device from the above ceph orch device ls
ncn-s# lsblk
Example output:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 4.2G 1 loop / run/ rootfsbase
loop1 7:1 0 30G 0 loop
└─live-overlay-pool 254:8 0 300G 0 dm
loop2 7:2 0 300G 0 loop
└─live-overlay-pool 254:8 0 300G 0 dm
sda 8:0 0 1.8T 0 disk
└─ceph--0a476f53--8b38--450d--8779--4e587402f8a8-osd--data--b620b7ef--184a--46d7--9a99--771239e7a323 254:7 0 1.8T 0 lvm
Log into each ncn-s
node and check for unused drives.
ncn-s# cephadm shell -- ceph-volume inventory
IMPORTANT: The cephadm
command may output this warning WARNING: The same type, major and minor should not be used for multiple devices.
. Ignore this warning.
The field available
would be True
if Ceph sees the drive as empty and can
be used. For example:
Device Path Size rotates available Model name
/dev/sda 447.13 GB False False SAMSUNG MZ7LH480
/dev/sdb 447.13 GB False False SAMSUNG MZ7LH480
/dev/sdc 3.49 TB False False SAMSUNG MZ7LH3T8
/dev/sdd 3.49 TB False False SAMSUNG MZ7LH3T8
/dev/sde 3.49 TB False False SAMSUNG MZ7LH3T8
/dev/sdf 3.49 TB False False SAMSUNG MZ7LH3T8
/dev/sdg 3.49 TB False False SAMSUNG MZ7LH3T8
/dev/sdh 3.49 TB False False SAMSUNG MZ7LH3T8
Alternatively, just dump the paths of available drives:
ncn-s# cephadm shell -- ceph-volume inventory --format json-pretty | jq -r '.[]|select(.available==true)|.path'
Wipe the drive ONLY after confirming that the drive is not being used by the current Ceph cluster using options 1, 2, or both.
The following example wipes drive
/dev/sdc
onncn-s002
. Replace these values with the appropriate ones for the situation.
ncn-s# ceph orch device zap ncn-s002 /dev/sdc --force
Add unused drives.
ncn-s# cephadm shell -- ceph-volume lvm create --data /dev/sd<drive to add> --bluestore
More information can be found at the cephadm
reference page.
After the management nodes have been deployed, configuration can be applied to the booted nodes.
The LiveCD needs to authenticate with the cluster to facilitate the rest of the CSM installation.
Determine which master node is the first master node.
Most often the first master node will be ncn-m002
.
Run the following commands on the PIT node to extract the value of the first-master-hostname
field from the /var/www/ephemeral/configs/data.json
file:
pit# FM=$(cat /var/www/ephemeral/configs/data.json | jq -r '."Global"."meta-data"."first-master-hostname"')
pit# echo $FM
Copy the Kubernetes configuration file from that node to the LiveCD to be able to use kubectl
as cluster administrator.
Run the following commands on the PIT node:
pit# mkdir -v ~/.kube
pit# scp ${FM}.nmn:/etc/kubernetes/admin.conf ~/.kube/config
Validate that kubectl
commands run successfully from the PIT node.
pit# kubectl get nodes -o wide
Expected output looks similar to the following:
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ncn-m002 Ready control-plane,master 2h v1.20.13 10.252.1.5 <none> SUSE Linux Enterprise High Performance Computing 15 SP3 5.3.18-59.19-default containerd://1.5.7
ncn-m003 Ready control-plane,master 2h v1.20.13 10.252.1.6 <none> SUSE Linux Enterprise High Performance Computing 15 SP3 5.3.18-59.19-default containerd://1.5.7
ncn-w001 Ready <none> 2h v1.20.13 10.252.1.7 <none> SUSE Linux Enterprise High Performance Computing 15 SP3 5.3.18-59.19-default containerd://1.5.7
ncn-w002 Ready <none> 2h v1.20.13 10.252.1.8 <none> SUSE Linux Enterprise High Performance Computing 15 SP3 5.3.18-59.19-default containerd://1.5.7
ncn-w003 Ready <none> 2h v1.20.13 10.252.1.9 <none> SUSE Linux Enterprise High Performance Computing 15 SP3 5.3.18-59.19-default containerd://1.5.7
Run the following commands on the PIT node.
pit# pushd /var/www/ephemeral && ${CSM_RELEASE}/lib/install-goss-tests.sh && popd
chrony
configurationsRun the following command without editing the value of the TOKEN
variable.
pit# for i in $(grep -oP 'ncn-\w\d+' /etc/dnsmasq.d/statics.conf | sort -u | grep -v ncn-m001); do
ssh $i "TOKEN=token /srv/cray/scripts/common/chrony/csm_ntp.py"; done
Successful output can appear as:
If BSS is unreachable, local cache is checked and the configuration is still deployed:
...
BSS query failed. Checking local cache...
Chrony configuration created
Problematic config found: /etc/chrony.d/cray.conf.dist
Problematic config found: /etc/chrony.d/pool.conf
Restarted chronyd
...
The following csi pit validate
commands will run a series of remote tests on the other nodes to validate they are healthy and configured correctly.
Observe the output of the checks. If there are any failures, remediate them.
Check the storage nodes.
pit# csi pit validate --ceph | tee csi-pit-validate-ceph.log
Once that command has finished, the following will extract the test totals reported for each node:
pit# grep "Total Test" csi-pit-validate-ceph.log
Example output for a system with three storage nodes:
Total Tests: 8, Total Passed: 8, Total Failed: 0, Total Execution Time: 74.3782 seconds
Total Tests: 3, Total Passed: 3, Total Failed: 0, Total Execution Time: 0.6091 seconds
Total Tests: 3, Total Passed: 3, Total Failed: 0, Total Execution Time: 0.6260 seconds
If these total lines report any failed tests, then look through the full output of the test in csi-pit-validate-ceph.log
to see which node had the failed test and what the details are for that test.
Note: See Utility Storage and Ceph CSI Troubleshooting in order to help resolve any failed tests.
Check the master and worker nodes.
Note: Throughout the output of the csi pit validate
command are test totals for each node where the tests run. Be sure to check
all of them and not just the final one. A grep
command is provided to help with this.
pit# csi pit validate --k8s | tee csi-pit-validate-k8s.log
Once that command has finished, the following will extract the test totals reported for each node:
pit# grep "Total Test" csi-pit-validate-k8s.log
Example output for a system with five master and worker nodes (excluding the PIT node):
Total Tests: 16, Total Passed: 16, Total Failed: 0, Total Execution Time: 0.3072 seconds
Total Tests: 16, Total Passed: 16, Total Failed: 0, Total Execution Time: 0.2727 seconds
Total Tests: 12, Total Passed: 12, Total Failed: 0, Total Execution Time: 0.2841 seconds
Total Tests: 12, Total Passed: 12, Total Failed: 0, Total Execution Time: 0.3622 seconds
Total Tests: 12, Total Passed: 12, Total Failed: 0, Total Execution Time: 0.2353 seconds
If these total lines report any failed tests, then look through the full output of the test in csi-pit-validate-k8s.log
to see which node had the failed test and what the details are for that test.
WARNING: Notes on specific failures:
- If any of the
FS Label
tests fail (they have names likeMaster Node ETCDLVM FS Label
orWorker Node CONLIB FS Label
), then run manual tests on the node which reported the failure. See Manual LVM Check Procedure. If the manual tests fail, then the problem must be resolved before continuing to the next step. See LVM Check Failure Recovery.- If the
Weave Health
test fails, runweave --local status connections
on the node where the test failed. If messages similar toIP allocation was seeded by different peers
are seen, thenweave
appears to be split-brained. At this point, it is necessary to wipe the NCNs and start the PXE boot again:
- Wipe the NCNs using the ‘Basic Wipe’ section of Wipe NCN Disks for Reinstallation.
- Return to the ‘Boot the Storage Nodes’ step of Deploy Management Nodes section above.
Verify that all the pods in the kube-system
namespace are Running
or Completed
.
Run the following command on any Kubernetes master or worker node, or the PIT node:
ncn-mw/pit# kubectl get pods -o wide -n kube-system | grep -Ev '(Running|Completed)'
If any pods are listed by this command, it means they are not in the Running
or Completed
state. Do not proceed before investigating this.
Before proceeding, be aware that this is the last point where the other NCNs can be rebuilt without also having to rebuild the PIT node. Therefore, take time to double check both the cluster and the validation test results
After completing the deployment of the management nodes, the next step is to install the CSM services.
See Install CSM Services.
This section gives information on troubleshooting and remediating issues with the LVM check performed during the Deploy Management Nodes procedure. If that check passed, this section can be ignored.
If needed, the LVM checks can be performed manually on the master and worker nodes.
Manual check on master nodes:
ncn-m# blkid -L ETCDLVM
Example output:
/dev/sdc
Manual check on worker nodes:
ncn-w# blkid -L CONLIB
/dev/sdb2
ncn-w# blkid -L CONRUN
/dev/sdb1
ncn-w# blkid -L K8SLET
/dev/sdb3
The manual checks are considered successful if all of the blkid
commands report a disk device (such as /dev/sdc
– the particular device is unimportant).
If any of the lsblk
commands return no output, then the check is a failure. Any failures must be resolved before continuing. See the following section
for details on how to do so.
If there are LVM check failures, then the problem must be resolved before continuing with the install.
If any master node has the problem, then wipe and redeploy all of the NCNs before continuing the installation:
ncn-m001
because it is the PIT node) using the ‘Basic Wipe’ section of Wipe NCN Disks for Reinstallation.If only worker nodes have the problem, then wipe and redeploy the affected worker nodes before continuing the installation:
ipmitool
command will give errors trying to power on the unaffected nodes, because they are already powered on – this is expected and not a problem.