The following procedure deploys Linux and Kubernetes software to the management NCNs. Deployment of the nodes starts with booting the storage nodes followed by the master nodes and worker nodes together.
After the operating system boots on each node, there are some configuration actions which take place. Watching the console or the console log for certain nodes can help to understand what happens and when. When the process completes for all nodes, the Ceph storage is initialized and the Kubernetes cluster is created and ready for a workload. The PIT node will join Kubernetes after it is rebooted later in Redeploy PIT Node.
The timing of each set of boots varies based on hardware. Nodes from some manufacturers will POST faster than others or vary based on BIOS setting. After powering a set of nodes on, an administrator can expect a healthy boot session to take about 60 minutes depending on the number of storage and worker nodes.
Preparation of the environment must be done before attempting to deploy the management nodes.
Define shell environment variables that will simplify later commands to deploy management nodes.
Set IPMI_PASSWORD
to the root password for the NCN BMCs.
read -s
is used to prevent the password from being written to the screen or the shell history.
pit# read -s IPMI_PASSWORD
pit# export IPMI_PASSWORD
Set the remaining helper variables.
These values do not need to be altered from what is shown.
pit# mtoken='ncn-m(?!001)\w+-mgmt' ; stoken='ncn-s\w+-mgmt' ; wtoken='ncn-w\w+-mgmt' ; export USERNAME=root
Throughout the guide, simple one-liners can be used to query status of expected nodes. If the shell or environment is terminated, these environment variables should be re-exported.
Examples:
Check power status of all NCNs.
pit# grep -oP "($mtoken|$stoken|$wtoken)" /etc/dnsmasq.d/statics.conf | sort -u |
xargs -t -i ipmitool -I lanplus -U $USERNAME -E -H {} power status
Power off all NCNs.
pit# grep -oP "($mtoken|$stoken|$wtoken)" /etc/dnsmasq.d/statics.conf | sort -u |
xargs -t -i ipmitool -I lanplus -U $USERNAME -E -H {} power off
There will be post-boot workarounds as well.
Follow the workaround instructions for the before-ncn-boot
breakpoint.
NOTE: Optionally, in order to use a timezone other than UTC, instead of step 1 below, follow this procedure for setting a local timezone. Then proceed to step 2.
Ensure that the PIT node has the correct current time.
The time can be inaccurate if the system has been powered off for a long time, or, for example, the CMOS was cleared on a Gigabyte node. See Clear Gigabyte CMOS.
This step should not be skipped.
Check the time on the PIT node to see whether it matches the current time:
pit# date "+%Y-%m-%d %H:%M:%S.%6N%z"
If the time is inaccurate, set the time manually.
pit# timedatectl set-time "2019-11-15 00:00:00"
Run the NTP script:
pit# /root/bin/configure-ntp.sh
This ensures that the PIT is configured with an accurate date/time, which will be propagated to the NCNs during boot.
Ensure that the current time is set in BIOS for all management NCNs.
Each NCN is booted to the BIOS menu, the date and time are checked, and set to the current UTC time if needed.
NOTE: Some steps in this procedure depend on
USERNAME
andIPMI_PASSWORD
being set. This is done in Tokens and IPMI Password.
Repeat the following process for each NCN.
Set the bmc
variable to the name of the BMC of the NCN being checked.
Important: Be sure to change the below example to the appropriate NCN.
pit# bmc=ncn-w001-mgmt
Start an IPMI console session to the NCN.
pit# conman -j $bmc
Using another terminal to watch the console, boot the node to BIOS.
pit# ipmitool -I lanplus -U $USERNAME -E -H $bmc chassis bootdev bios &&
ipmitool -I lanplus -U $USERNAME -E -H $bmc chassis power off && sleep 10 &&
ipmitool -I lanplus -U $USERNAME -E -H $bmc chassis power on
For HPE NCNs, the above process will boot the nodes to their BIOS; however, the BIOS menu is unavailable through conman because the node is booted into a graphical BIOS menu.
In order to access the serial version of the BIOS menu, perform the
ipmitool
steps above to boot the node. Then, in conman, pressESC+9
key combination when the following messages are shown on the console. That key combination will open a menu that can be used to enter the BIOS using conman.For access via BIOS Serial Console: Press 'ESC+9' for System Utilities Press 'ESC+0' for Intelligent Provisioning Press 'ESC+!' for One-Time Boot Menu Press 'ESC+@' for Network Boot
For HPE NCNs, the date configuration menu is at the following path:
System Configuration -> BIOS/Platform Configuration (RBSU) -> Date and Time
.Alternatively, for HPE NCNs, log in to the BMC’s web interface and access the HTML5 console for the node, in order to interact with the graphical BIOS. From the administrator’s own machine, create an SSH tunnel (
-L
creates the tunnel;-N
prevents a shell and stubs the connection):linux# bmc=ncn-w001-mgmt # Change this to be the appropriate node linux# ssh -L 9443:$bmc:443 -N root@eniac-ncn-m001
Opening a web browser to
https://localhost:9443
will give access to the BMC’s web interface.
When the node boots, the conman session can be used to see the BIOS menu, in order to check and set the time to current UTC time. The process varies depending on the vendor of the NCN.
After the correct time has been verified, power off the NCN.
pit# ipmitool -I lanplus -U $USERNAME -E -H $bmc chassis power off
Repeat the above process for each NCN.
All firmware can be found in the HFP package provided with the Shasta release.
The management nodes are expected to have certain minimum firmware installed for BMC, node BIOS, and PCIe cards. Where possible, the firmware should be updated prior to install. It is good to meet the minimum NCN firmware requirement before starting.
Note: When the PIT node is booted from the LiveCD, it is not possible to use the Firmware Action Service (FAS) to update the the firmware because that service has not yet been installed. However, at this point, it would be possible to use the HPE Cray EX HPC Firmware Pack (HFP) product on the PIT node to learn about the firmware versions available in HFP.
If the firmware is not updated at this point in the installation workflow, then it can be done with FAS after CSM and HFP have both been installed and configured. However, at that point a rolling reboot procedure for the management nodes will be needed, after the firmware has been updated.
See the Shasta 1.5 HPE Cray EX System Software Getting Started Guide S-8000
on the HPE Customer Support Center
for information about the HPE Cray EX HPC Firmware Pack (HFP) product.
In the HFP documentation there is information about the recommended firmware packages to be installed. See “Product Details” in the HPE Cray EX HPC Firmware Pack Installation Guide.
Some of the component types have manual procedures to check firmware versions and update firmware.
See Upgrading Firmware Without FAS
in the HPE Cray EX HPC Firmware Pack Installation Guide
.
It will be possible to extract the files from the product tarball, but the install.sh
script from that product
will be unable to load the firmware versions into the Firmware Action Services (FAS) because the management nodes
are not booted and running Kubernetes and FAS cannot be used until Kubernetes is running.
If booted into the PIT node, the firmware can be found with HFP package provided with the Shasta release.
(optional) Check these BIOS settings on management nodes NCN BIOS.
This is optional, the BIOS settings (or lack thereof) do not prevent deployment. The NCN installation will work with the CMOS’ default BIOS. There may be settings that facilitate the speed of deployment, but they may be tuned at a later time.
NOTE: The BIOS tuning will be automated, further reducing this step.
Check for minimum NCN firmware versions and update them as needed, The firmware on the management nodes should be checked for compliance with the minimum version required and updated, if necessary, at this point.
WARNING: Gigabyte NCNs running BIOS version C20 can become unusable when Shasta 1.5 is installed. This is a result of a bug in the Gigabyte firmware. This bug has not been observed in BIOS version C17.
A key symptom of this bug is that the NCN will not PXE boot and will instead fall through to the boot menu, despite being configure to PXE boot. This behavior will persist until the failing node’s CMOS is cleared.
Deployment of the nodes starts with booting the storage nodes first, then the master nodes and worker nodes together. After the operating system boots on each node there are some configuration actions which take place. Watching the console or the console log for certain nodes can help to understand what happens and when. When the process is complete for all nodes, the Ceph storage will have been initialized and the Kubernetes cluster will be created ready for a workload.
The configuration workflow described here is intended to help understand the expected path for booting and configuring. The actual steps to be performed are in the Deploy section.
ncn-s001
and at least one other storage nodencn-s001
) will boot; it then starts a loop as ceph-ansible
configuration waits for all other storage nodes to boot.ceph-ansible
runs to completion on ncn-s001
.ncn-s001
notices that all other storage nodes have booted, ceph-ansible
will begin Ceph configuration. This takes several minutes.ceph-ansible
has finished on ncn-s001
, then ncn-s001
waits for ncn-m002
to create /etc/kubernetes/admin.conf
.ncn-m002
, ncn-m003
, and at least one worker node.ncn-m002
and ncn-m003
) and all worker nodes at the same time.
ncn-m002
to create the /etc/cray/kubernetes/join-command-control-plane
file so that they can join Kubernetes.ncn-m003
) boots and waits for ncn-m002
to create the /etc/cray/kubernetes/join-command-control-plane
file so that it can join Kubernetesncn-m002
) boots and runs kubernetes-cloudinit.sh
, which will create /etc/kubernetes/admin.conf
and
/etc/cray/kubernetes/join-command-control-plan
. It then waits for the storage node to create etcd-backup-s3-credentials
.ncn-s001
notices that ncn-m002
has created /etc/kubernetes/admin.conf
, then ncn-s001
waits for any worker node to become available.ncn-m002
has created /etc/cray/kubernetes/join-command-control-plane
, they will join the Kubernetes cluster.
ncn-s001
notices that a worker node has done this, it moves forward with the creation of ConfigMaps and running the post-Ceph playbooks
(S3, OSD pools, quotas, and so on.)ncn-s001
creates etcd-backup-s3-credentials
during the ceph-rgw-users
role (one of the last roles after Ceph has been set up), then ncn-m001
notices this and proceeds.NOTE: If several hours have elapsed between storage and master nodes booting, or if there were issues PXE booting master nodes, the
cloud-init
script onncn-s001
may not complete successfully. This can cause the/var/log/cloud-init-output.log
on master node(s) to continue to output the following message:[ 1328.351558] cloud-init[8472]: Waiting for storage node to create etcd-backup-s3-credentials secret...
In this case, the following script is safe to be executed again on
ncn-s001
:ncn-s001# /srv/cray/scripts/common/storage-ceph-cloudinit.sh
After this script finishes, the secrets will be created and the
cloud-init
script on the master node(s) should complete.
Change the default root password and SSH keys
The management nodes deploy with a default password in the image, so it is a recommended best practice for system security to change the root password in the image so that it is not the documented default password.
It is strongly encouraged to change the default root password and SSH keys in the images used to boot the management nodes. Follow the NCN image customization steps in Change NCN Image Root Password and SSH Keys on PIT Node
Create boot directories for any NCN in DNS.
This will create folders for each host in /var/www
, allowing each host to have its own unique set of artifacts:
kernel, initrd
, SquashFS, and script.ipxe
bootscript.
pit# /root/bin/set-sqfs-links.sh
Every NCN except for
ncn-m001
should be included in the output from this script. If that is not the case, then verify that all NCN BMCs are set to use DHCP. See Set node BMCs to DHCP. After that is done, re-run theset-sqfs-links.sh
script.
Customize boot scripts for any out-of-baseline NCNs
etcd
creation.Set each node to always UEFI Network Boot, and ensure they are powered off
pit# grep -oP "($mtoken|$stoken|$wtoken)" /etc/dnsmasq.d/statics.conf | sort -u | xargs -t -i ipmitool -I lanplus -U $USERNAME -E -H {} chassis bootdev pxe options=persistent
pit# grep -oP "($mtoken|$stoken|$wtoken)" /etc/dnsmasq.d/statics.conf | sort -u | xargs -t -i ipmitool -I lanplus -U $USERNAME -E -H {} chassis bootdev pxe options=efiboot
pit# grep -oP "($mtoken|$stoken|$wtoken)" /etc/dnsmasq.d/statics.conf | sort -u | xargs -t -i ipmitool -I lanplus -U $USERNAME -E -H {} power off
NOTE: The NCN boot order is further explained in NCN Boot Workflow.
Validate that the LiveCD is ready for installing NCNs.
Observe the output of the checks and note any failures, then remediate them.
pit# csi pit validate --livecd-preflight
Notes:
- This check sometimes leaves the terminal in a state where input is not echoed to the screen. If this happens, running the
reset
command will correct it.- Ignore any errors about not being able resolve
arti.dev.cray.com
.
Print the available consoles.
pit# conman -q
Expected output looks similar to the following:
ncn-m001-mgmt
ncn-m002-mgmt
ncn-m003-mgmt
ncn-s001-mgmt
ncn-s002-mgmt
ncn-s003-mgmt
ncn-w001-mgmt
ncn-w002-mgmt
ncn-w003-mgmt
IMPORTANT: This is the administrator’s last chance to run NCN pre-boot workarounds (the
before-ncn-boot
breakpoint).NOTE: All console logs are located at
/var/log/conman/console*
Boot all storage nodes except ncn-s001
:
pit# grep -oP $stoken /etc/dnsmasq.d/statics.conf | grep -v "ncn-s001-" | sort -u | xargs -t -i ipmitool -I lanplus -U $USERNAME -E -H {} power on
Wait approximately 1 minute.
Boot ncn-s001
:
pit# ipmitool -I lanplus -U $USERNAME -E -H ncn-s001-mgmt power on
Observe the installation through the console of ncn-s001-mgmt
.
pit# conman -j ncn-s001-mgmt
From there an administrator can witness console output for the cloud-init
scripts.
NOTE: Watch the storage node consoles carefully for error messages. If any are seen, consult Ceph-CSI Troubleshooting.
NOTE: If the nodes have PXE boot issues (for example, getting PXE errors, or not pulling the ipxe.efi
binary), see PXE boot troubleshooting.
Boot the master and worker nodes.
Wait for storage nodes before booting Kubernetes master nodes and worker nodes.
NOTE: Once all storage nodes are up and the message ...sleeping 5 seconds until /etc/kubernetes/admin.conf
appears on ncn-s001
’s console, it is safe to proceed with booting the Kubernetes master nodes and worker nodes
pit# grep -oP "($mtoken|$wtoken)" /etc/dnsmasq.d/statics.conf | sort -u | xargs -t -i ipmitool -I lanplus -U $USERNAME -E -H {} power on
Stop watching the console from ncn-s001
.
Type the ampersand character and then the period character to exit from the conman session on ncn-s001
.
&.
pit#
Wait. Observe the installation through ncn-m002-mgmt
’s console:
Print the console name:
pit# conman -q | grep m002
Expected output looks similar to the following:
ncn-m002-mgmt
Then join the console:
pit# conman -j ncn-m002-mgmt
NOTE: If the nodes have PXE boot issues (e.g. getting PXE errors, not pulling the ipxe.efi binary) see PXE boot troubleshooting
NOTE: If one of the master nodes seems hung waiting for the storage nodes to create a secret, check the storage node consoles for error messages. If any are found, consult CEPH CSI Troubleshooting
Wait for the deployment to finish.
Refer to timing of deployments. It should not take more than 60 minutes for the kubectl get nodes
command to return output indicating
that all the master nodes and worker nodes (excluding from the PIT node) booted from the LiveCD and are Ready
.
pit# ssh ncn-m002
ncn-m002# kubectl get nodes -o wide
Expected output looks similar to the following:
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ncn-m002 Ready master 14m v1.18.6 10.252.1.5 <none> SUSE Linux Enterprise High Performance Computing 15 SP2 5.3.18-24.43-default containerd://1.3.4
ncn-m003 Ready master 13m v1.18.6 10.252.1.6 <none> SUSE Linux Enterprise High Performance Computing 15 SP2 5.3.18-24.43-default containerd://1.3.4
ncn-w001 Ready <none> 6m30s v1.18.6 10.252.1.7 <none> SUSE Linux Enterprise High Performance Computing 15 SP2 5.3.18-24.43-default containerd://1.3.4
ncn-w002 Ready <none> 6m16s v1.18.6 10.252.1.8 <none> SUSE Linux Enterprise High Performance Computing 15 SP2 5.3.18-24.43-default containerd://1.3.4
ncn-w003 Ready <none> 5m58s v1.18.6 10.252.1.12 <none> SUSE Linux Enterprise High Performance Computing 15 SP2 5.3.18-24.43-default containerd://1.3.4
Stop watching the console of ncn-m002
.
Type the ampersand character and then the period character to exit from the conman session on ncn-m002
.
&.
pit#
Enable passwordless SSH for the PIT node.
Copy SSH files from ncn-m002
to the PIT node.
When the following command prompts for a password, enter the root password for
ncn-m002
.
pit# rsync -av ncn-m002:.ssh/ /root/.ssh/
Expected output looks similar to the following:
Password:
receiving incremental file list
./
authorized_keys
id_rsa
id_rsa.pub
known_hosts
sent 145 bytes received 13,107 bytes 3,786.29 bytes/sec
total size is 12,806 speedup is 0.97
Make a list of all of the NCNs (including ncn-m001
).
pit# NCNS=$(grep -oP "ncn-[msw][0-9]{3}" /etc/dnsmasq.d/statics.conf | sort -u | tr '\n' ',') ; echo "${NCNS}"
Expected output looks similar to the following:
ncn-m001,ncn-m002,ncn-m003,ncn-s001,ncn-s002,ncn-s003,ncn-w001,ncn-w002,ncn-w003,
Verify that passwordless SSH is now working from the PIT node to the other NCNs.
The following command should not prompt for a password.
pit# PDSH_SSH_ARGS_APPEND='-o StrictHostKeyChecking=no' pdsh -Sw "${NCNS}" date && echo SUCCESS || echo ERROR
Expected output looks similar to the following:
ncn-w001: Warning: Permanently added 'ncn-w001,10.252.1.7' (ECDSA) to the list of known hosts.
ncn-w003: Warning: Permanently added 'ncn-w003,10.252.1.9' (ECDSA) to the list of known hosts.
ncn-m003: Warning: Permanently added 'ncn-m003,10.252.1.6' (ECDSA) to the list of known hosts.
ncn-s002: Warning: Permanently added 'ncn-s002,10.252.1.11' (ECDSA) to the list of known hosts.
ncn-m001: Warning: Permanently added 'ncn-m001,10.252.1.4' (ECDSA) to the list of known hosts.
ncn-w002: Warning: Permanently added 'ncn-w002,10.252.1.8' (ECDSA) to the list of known hosts.
ncn-m002: Warning: Permanently added 'ncn-m002,10.252.1.5' (ECDSA) to the list of known hosts.
ncn-s003: Warning: Permanently added 'ncn-s003,10.252.1.12' (ECDSA) to the list of known hosts.
ncn-s001: Warning: Permanently added 'ncn-s001,10.252.1.10' (ECDSA) to the list of known hosts.
ncn-s003: Thu 28 Apr 2022 02:43:21 PM UTC
ncn-s001: Thu 28 Apr 2022 02:43:21 PM UTC
ncn-s002: Thu 28 Apr 2022 02:43:21 PM UTC
ncn-m001: Thu 28 Apr 2022 02:43:21 PM UTC
ncn-m003: Thu 28 Apr 2022 02:43:21 PM UTC
ncn-m002: Thu 28 Apr 2022 02:43:21 PM UTC
ncn-w001: Thu 28 Apr 2022 02:43:22 PM UTC
ncn-w002: Thu 28 Apr 2022 02:43:22 PM UTC
ncn-w003: Thu 28 Apr 2022 02:43:22 PM UTC
SUCCESS
Validate that the expected LVM labels are present on disks on the master and worker nodes.
pit# /usr/share/doc/csm/install/scripts/check_lvm.sh
Expected output looks similar to the following:
When prompted, please enter the NCN password for ncn-m002
Warning: Permanently added 'ncn-m002,10.252.1.11' (ECDSA) to the list of known hosts.
Password:
Checking ncn-m002...
ncn-m002: OK
Checking ncn-m003...
Warning: Permanently added 'ncn-m003,10.252.1.10' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ncn-m003,10.252.1.10' (ECDSA) to the list of known hosts.
ncn-m003: OK
Checking ncn-w001...
Warning: Permanently added 'ncn-w001,10.252.1.9' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ncn-w001,10.252.1.9' (ECDSA) to the list of known hosts.
ncn-w001: OK
Checking ncn-w002...
Warning: Permanently added 'ncn-w002,10.252.1.8' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ncn-w002,10.252.1.8' (ECDSA) to the list of known hosts.
ncn-w002: OK
Checking ncn-w003...
Warning: Permanently added 'ncn-w003,10.252.1.7' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ncn-w003,10.252.1.7' (ECDSA) to the list of known hosts.
ncn-w003: OK
SUCCESS: LVM checks passed on all master and worker NCNs
Run the following command on the PIT node to validate that the expected LVM labels are present on disks on the master and worker nodes. When it prompts you for a password, enter the root password for ncn-m002
.
pit# /usr/share/doc/csm/install/scripts/check_lvm.sh
Expected output looks similar to the following:
When prompted, please enter the NCN password for ncn-m002
Warning: Permanently added 'ncn-m002,10.252.1.11' (ECDSA) to the list of known hosts.
Password:
Checking ncn-m002...
ncn-m002: OK
Checking ncn-m003...
Warning: Permanently added 'ncn-m003,10.252.1.10' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ncn-m003,10.252.1.10' (ECDSA) to the list of known hosts.
ncn-m003: OK
Checking ncn-w001...
Warning: Permanently added 'ncn-w001,10.252.1.9' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ncn-w001,10.252.1.9' (ECDSA) to the list of known hosts.
ncn-w001: OK
Checking ncn-w002...
Warning: Permanently added 'ncn-w002,10.252.1.8' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ncn-w002,10.252.1.8' (ECDSA) to the list of known hosts.
ncn-w002: OK
Checking ncn-w003...
Warning: Permanently added 'ncn-w003,10.252.1.7' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ncn-w003,10.252.1.7' (ECDSA) to the list of known hosts.
ncn-w003: OK
SUCCESS: LVM checks passed on all master and worker NCNs
If the check succeeds, skip the manual check procedure and recovery steps.
If the check fails for any nodes, the problem must be resolved before continuing. See LVM Check Failure Recovery.
If needed, the LVM checks can be performed manually on the master and worker nodes.
Manual check on master nodes:
ncn-m# blkid -L ETCDLVM
/dev/sdc
Manual check on worker nodes:
ncn-w# blkid -L CONLIB
/dev/sdb2
ncn-w# blkid -L CONRUN
/dev/sdb1
ncn-w# blkid -L K8SLET
/dev/sdb3
The manual checks are considered successful if all of the blkid
commands report a disk device (such as /dev/sdc
– the particular device is unimportant).
If any of the lsblk
commands return no output, then the check is a failure. Any failures must be resolved before continuing. See the following section
for details on how to do so.
If there are LVM check failures, then the problem must be resolved before continuing with the install.
If any master node has the problem, then wipe and redeploy all of the NCNs before continuing the installation:
ncn-m001
because it is the PIT node) using the ‘Basic Wipe’ section of Wipe NCN Disks for Reinstallation.If only worker nodes have the problem, then wipe and redeploy the affected worker nodes before continuing the installation:
ipmitool
command will give errors trying to power on the unaffected nodes, because they are already powered on – this is expected and not a problem.IMPORTANT: Do the following if NCNs are Gigabyte hardware. It is suggested (but optional) for HPE NCNs.
IMPORTANT: Estimate the expected number of OSDs using the following table and using this equation:
total_osds
=(number of utility storage/Ceph nodes)
*
(OSD count from table below for the appropriate hardware)
Hardware Manufacturer | OSD Drive Count (not including OS drives) |
---|---|
GigaByte | 12 |
HPE | 8 |
If there are OSDs on each node (ceph osd tree
can show this), then all the nodes are in Ceph. That means the orchestrator can be used to look for the devices.
Get the number of OSDs in the cluster.
ncn-s# ceph -f json-pretty osd stat |jq .num_osds
24
IMPORTANT: If the returned number of OSDs is equal to total_osds
calculated, then skip the following steps. If not, then proceed with the below additional checks and remediation steps.
Compare the number of OSDs to the output (which should resemble the example below). The number of drives will depend on the server hardware.
NOTE: If the Ceph cluster is large and has a lot of nodes, a node may be specified after the below command to limit the results.
ncn-s# ceph orch device ls
Hostname Path Type Serial Size Health Ident Fault Available
ncn-s001 /dev/sda ssd PHYF015500M71P9DGN 1920G Unknown N/A N/A No
ncn-s001 /dev/sdb ssd PHYF016500TZ1P9DGN 1920G Unknown N/A N/A No
ncn-s001 /dev/sdc ssd PHYF016402EB1P9DGN 1920G Unknown N/A N/A No
ncn-s001 /dev/sdd ssd PHYF016504831P9DGN 1920G Unknown N/A N/A No
ncn-s001 /dev/sde ssd PHYF016500TV1P9DGN 1920G Unknown N/A N/A No
ncn-s001 /dev/sdf ssd PHYF016501131P9DGN 1920G Unknown N/A N/A No
ncn-s001 /dev/sdi ssd PHYF016500YB1P9DGN 1920G Unknown N/A N/A No
ncn-s001 /dev/sdj ssd PHYF016500WN1P9DGN 1920G Unknown N/A N/A No
ncn-s002 /dev/sda ssd PHYF0155006W1P9DGN 1920G Unknown N/A N/A No
ncn-s002 /dev/sdb ssd PHYF0155006Z1P9DGN 1920G Unknown N/A N/A No
ncn-s002 /dev/sdc ssd PHYF015500L61P9DGN 1920G Unknown N/A N/A No
ncn-s002 /dev/sdd ssd PHYF015502631P9DGN 1920G Unknown N/A N/A No
ncn-s002 /dev/sde ssd PHYF0153000G1P9DGN 1920G Unknown N/A N/A No
ncn-s002 /dev/sdf ssd PHYF016401T41P9DGN 1920G Unknown N/A N/A No
ncn-s002 /dev/sdi ssd PHYF016504C21P9DGN 1920G Unknown N/A N/A No
ncn-s002 /dev/sdj ssd PHYF015500GQ1P9DGN 1920G Unknown N/A N/A No
ncn-s003 /dev/sda ssd PHYF016402FP1P9DGN 1920G Unknown N/A N/A No
ncn-s003 /dev/sdb ssd PHYF016401TE1P9DGN 1920G Unknown N/A N/A No
ncn-s003 /dev/sdc ssd PHYF015500N51P9DGN 1920G Unknown N/A N/A No
ncn-s003 /dev/sdd ssd PHYF0165010Z1P9DGN 1920G Unknown N/A N/A No
ncn-s003 /dev/sde ssd PHYF016500YR1P9DGN 1920G Unknown N/A N/A No
ncn-s003 /dev/sdf ssd PHYF016500X01P9DGN 1920G Unknown N/A N/A No
ncn-s003 /dev/sdi ssd PHYF0165011H1P9DGN 1920G Unknown N/A N/A No
ncn-s003 /dev/sdj ssd PHYF016500TQ1P9DGN 1920G Unknown N/A N/A No
If there are devices that show Available
as Yes
and they are not being automatically added, that device may need to be zapped.
IMPORTANT: Prior to zapping any device, ensure that it is not being used.
Check to see if the number of devices is less than the number of listed drives in the output from step 1.
ncn-s# ceph orch device ls|grep dev|wc -l
24
If the numbers are equal, but less than the total_osds
calculated, then the ceph-mgr
daemon may need to be failed in order to get a fresh inventory.
ncn-s# ceph mgr fail $(ceph mgr dump | jq -r .active_name)
Wait 5 minutes and then re-check ceph orch device ls
. See if the drives are still showing as Available
. If so, then proceed to the next step.
ssh
to the host and look at lsblk
output and check against the device from the above ceph orch device ls
ncn-s# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 4.2G 1 loop / run/ rootfsbase
loop1 7:1 0 30G 0 loop
└─live-overlay-pool 254:8 0 300G 0 dm
loop2 7:2 0 300G 0 loop
└─live-overlay-pool 254:8 0 300G 0 dm
sda 8:0 0 1.8T 0 disk
└─ceph--0a476f53--8b38--450d--8779--4e587402f8a8-osd--data--b620b7ef--184a--46d7--9a99--771239e7a323 254:7 0 1.8T 0 lvm
Log into each ncn-s
node and check for unused drives.
ncn-s# cephadm shell -- ceph-volume inventory
IMPORTANT: The cephadm
command may output this warning WARNING: The same type, major and minor should not be used for multiple devices.
. Ignore this warning.
The field available
would be True
if Ceph sees the drive as empty and can
be used. For example:
Device Path Size rotates available Model name
/dev/sda 447.13 GB False False SAMSUNG MZ7LH480
/dev/sdb 447.13 GB False False SAMSUNG MZ7LH480
/dev/sdc 3.49 TB False False SAMSUNG MZ7LH3T8
/dev/sdd 3.49 TB False False SAMSUNG MZ7LH3T8
/dev/sde 3.49 TB False False SAMSUNG MZ7LH3T8
/dev/sdf 3.49 TB False False SAMSUNG MZ7LH3T8
/dev/sdg 3.49 TB False False SAMSUNG MZ7LH3T8
/dev/sdh 3.49 TB False False SAMSUNG MZ7LH3T8
Alternatively, just dump the paths of available drives:
ncn-s# cephadm shell -- ceph-volume inventory --format json-pretty | jq -r '.[]|select(.available==true)|.path'
Wipe the drive ONLY after confirming that the drive is not being used by the current Ceph cluster using options 1, 2, or both.
The following example wipes drive
/dev/sdc
onncn-s002
. Replace these values with the appropriate ones for the situation.
ncn-s# ceph orch device zap ncn-s002 /dev/sdc --force
Add unused drives.
ncn-s# cephadm shell -- ceph-volume lvm create --data /dev/sd<drive to add> --bluestore
More information can be found at the cephadm
reference page.
Follow the workaround instructions for the after-ncn-boot
breakpoint.
After the management nodes have been deployed, configuration can be applied to the booted nodes.
The LiveCD needs to authenticate with the cluster to facilitate the rest of the CSM installation.
Determine which master node is the first master node.
Most often the first master node will be ncn-m002
.
Run the following commands on the PIT node to extract the value of the first-master-hostname
field from the /var/www/ephemeral/configs/data.json
file:
pit# FM=$(cat /var/www/ephemeral/configs/data.json | jq -r '."Global"."meta-data"."first-master-hostname"')
pit# echo $FM
Copy the Kubernetes configuration file from that node to the LiveCD to be able to use kubectl
as cluster administrator.
Run the following commands on the PIT node:
pit# mkdir -v ~/.kube
pit# scp ${FM}.nmn:/etc/kubernetes/admin.conf ~/.kube/config
Validate that kubectl
commands run successfully from the PIT node.
pit# kubectl get nodes -o wide
Expected output looks similar to the following:
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ncn-m002 Ready master 14m v1.18.6 10.252.1.5 <none> SUSE Linux Enterprise High Performance Computing 15 SP2 5.3.18-24.43-default containerd://1.3.4
ncn-m003 Ready master 13m v1.18.6 10.252.1.6 <none> SUSE Linux Enterprise High Performance Computing 15 SP2 5.3.18-24.43-default containerd://1.3.4
ncn-w001 Ready <none> 6m30s v1.18.6 10.252.1.7 <none> SUSE Linux Enterprise High Performance Computing 15 SP2 5.3.18-24.43-default containerd://1.3.4
ncn-w002 Ready <none> 6m16s v1.18.6 10.252.1.8 <none> SUSE Linux Enterprise High Performance Computing 15 SP2 5.3.18-24.43-default containerd://1.3.4
ncn-w003 Ready <none> 5m58s v1.18.6 10.252.1.12 <none> SUSE Linux Enterprise High Performance Computing 15 SP2 5.3.18-24.43-default containerd://1.3.4
After the NCNs are booted, the BGP peers will need to be checked and updated if the neighbor IP addresses are incorrect on the switches. Follow the steps below and see Update BGP Neighbors for more details on the BGP configuration.
IMPORTANT: If the management switches are using the CANU-generated configuration for CSM 1.0 (the CSM 1.2 Preconfig
), then this procedure should be skipped.
To check if the management switches are using the CANU-generated configuration for CSM 1.0 (the CSM 1.2 Preconfig
) log in to both spine switches and
see if a login banner exists. It should look similar to the examples below.
The CSM version must be 1.0, and CANU should be present showing a version. An accurate login banner for Mellanox and Aruba will look similar to the following examples:
Mellanox Example:
ncn-m001# ssh admin@sw-spine-001
NVIDIA Onyx Switch Management
Password:
Last login: Sat Feb 26 00:10:26 UTC 2022 from 10.252.1.5 on pts/0
Number of total successful connections since last 1 days: 89
###############################################################################
# CSM version: 1.0
# CANU version: 1.1.11
###############################################################################
Aruba Example:
ncn-m001# ssh admin@sw-spine-001
###############################################################################
# CSM version: 1.0
# CANU version: 1.1.11
###############################################################################
Make sure the SYSTEM_NAME
variable is set to name of your system.
pit# export SYSTEM_NAME=eniac
Determine the IP addresses of the worker NCNs.
pit# grep -B1 "name: ncn-w" /var/www/ephemeral/prep/${SYSTEM_NAME}/networks/NMN.yaml
Determine the IP addresses of the switches that are peering.
pit# grep peer-address /var/www/ephemeral/prep/${SYSTEM_NAME}/metallb.yaml
Run the script appropriate for your switch hardware vendor.
If you have Mellanox switches, run the BGP helper script.
The BGP helper script requires three parameters: the IP address of switch 1, the IP address of switch 2, and the path to the to CSI-generated network files.
CAN.yaml
, HMN.yaml
, HMNLB.yaml
, NMNLB.yaml
, and NMN.yaml
. The path must include the SYSTEM_NAME
.The IP addresses in this example should be replaced by the IP addresses of the switches.
pit# /usr/local/bin/mellanox_set_bgp_peers.py 10.252.0.2 10.252.0.3 /var/www/ephemeral/prep/${SYSTEM_NAME}/networks/
*WARNING*
The mellanox_set_bgp_peers.py
script assumes that the prefix length of the CAN is /24
. If that value is incorrect for the
system being installed then update the script with the correct prefix length by editing the following line:
cmd_prefix_list_can = "ip prefix-list pl-can seq 30 permit {} /24 ge 24".format()
If you have Aruba switches, run CANU.
CANU requires three parameters: the IP address of switch 1, the IP address of switch 2, and the path to the to directory containing the file sls_input_file.json
.
The IP addresses in this example should be replaced by the IP addresses of the switches.
pit# canu -s 1.5 config bgp --ips 10.252.0.2,10.252.0.3 --csi-folder /var/www/ephemeral/prep/${SYSTEM_NAME}/
Do the following steps for each of the switch IP addresses that you found previously.
Log in to the switch as the admin
user:
pit# ssh admin@<switch_ip_address>
Check the status of the BGP peering sessions
show bgp ipv4 unicast summary
show ip bgp summary
You should see a neighbor for each of the workers NCN IP addresses found above. If it is an Aruba switch, you will also see a neighbor for the other switch of the pair that are peering.
At this point the peering sessions with the worker IP addresses should be in IDLE
, CONNECT
, or ACTIVE
state (not ESTABLISHED
). This is because the MetalLB speaker pods are not deployed yet.
You should see that the MsgRcvd
and MsgSent
columns for the worker IP addresses are 0.
Check the BGP configuration to verify that the NCN neighbors are configured as passive.
Aruba: show run bgp
The passive neighbor configuration is required. neighbor 10.252.1.7 passive
EXAMPLE ONLY
sw-spine-001# show run bgp
router bgp 65533
bgp router-id 10.252.0.2
maximum-paths 8
distance bgp 20 70
neighbor 10.252.0.3 remote-as 65533
neighbor 10.252.1.7 remote-as 65533
neighbor 10.252.1.7 passive
neighbor 10.252.1.8 remote-as 65533
neighbor 10.252.1.8 passive
neighbor 10.252.1.9 remote-as 65533
neighbor 10.252.1.9 passive
Mellanox: show run protocol bgp
The passive neighbor configuration is required. router bgp 65533 vrf default neighbor 10.252.1.7 transport connection-mode passive
EXAMPLE ONLY
protocol bgp
router bgp 65533 vrf default
router bgp 65533 vrf default router-id 10.252.0.2 force
router bgp 65533 vrf default maximum-paths ibgp 32
router bgp 65533 vrf default neighbor 10.252.1.7 remote-as 65533
router bgp 65533 vrf default neighbor 10.252.1.7 route-map ncn-w003
router bgp 65533 vrf default neighbor 10.252.1.8 remote-as 65533
router bgp 65533 vrf default neighbor 10.252.1.8 route-map ncn-w002
router bgp 65533 vrf default neighbor 10.252.1.9 remote-as 65533
router bgp 65533 vrf default neighbor 10.252.1.9 route-map ncn-w001
router bgp 65533 vrf default neighbor 10.252.1.7 transport connection-mode passive
router bgp 65533 vrf default neighbor 10.252.1.8 transport connection-mode passive
router bgp 65533 vrf default neighbor 10.252.1.9 transport connection-mode passive
Repeat the previous steps for the remaining switch IP addresses.
Run the following commands on the PIT node.
pit# export CSM_RELEASE=csm-x.y.z
pit# pushd /var/www/ephemeral && ${CSM_RELEASE}/lib/install-goss-tests.sh && popd
Run the following command on the PIT node to remove the default pool, which can cause contention issues with NTP. When it prompts you for a password, enter the root password for ncn-m002
.
pit# ssh ncn-m002 "\
PDSH_SSH_ARGS_APPEND='-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null' \
pdsh -b -S -w $(grep -oP 'ncn-\w\d+' /etc/dnsmasq.d/statics.conf |
grep -v m001 | sort -u | tr -t '\n' ,) \
sed -i \'s/^! pool pool[.]ntp[.]org.*//\' /etc/chrony.conf"
Expected output looks similar to the following:
Password:
ncn-m002: Warning: Permanently added 'ncn-m002,10.252.1.11' (ECDSA) to the list of known hosts.
ncn-s001: Warning: Permanently added 'ncn-s001,10.252.1.6' (ECDSA) to the list of known hosts.
ncn-s002: Warning: Permanently added 'ncn-s002,10.252.1.5' (ECDSA) to the list of known hosts.
ncn-s003: Warning: Permanently added 'ncn-s003,10.252.1.4' (ECDSA) to the list of known hosts.
ncn-m003: Warning: Permanently added 'ncn-m003,10.252.1.10' (ECDSA) to the list of known hosts.
ncn-w002: Warning: Permanently added 'ncn-w002,10.252.1.8' (ECDSA) to the list of known hosts.
ncn-w001: Warning: Permanently added 'ncn-w001,10.252.1.9' (ECDSA) to the list of known hosts.
ncn-w003: Warning: Permanently added 'ncn-w003,10.252.1.7' (ECDSA) to the list of known hosts.
Do all of the validation steps. The optional validation steps are manual steps which could be skipped.
The following csi pit validate
commands will run a series of remote tests on the other nodes to validate they are healthy and configured correctly.
Observe the output of the checks and note any failures, then remediate them.
Check the storage nodes.
pit# csi pit validate --ceph | tee csi-pit-validate-ceph.log
Once that command has finished, the following will extract the test totals reported for each node:
pit# grep "Total Test" csi-pit-validate-ceph.log
Example output for a system with three storage nodes:
Total Tests: 8, Total Passed: 8, Total Failed: 0, Total Execution Time: 74.3782 seconds
Total Tests: 3, Total Passed: 3, Total Failed: 0, Total Execution Time: 0.6091 seconds
Total Tests: 3, Total Passed: 3, Total Failed: 0, Total Execution Time: 0.6260 seconds
If these total lines report any failed tests, then look through the full output of the test in csi-pit-validate-ceph.log
to see which node had the failed test and what the details are for that test.
Note: See Utility Storage in order to help resolve any failed tests.
Check the master and worker nodes.
Note: Throughout the output of the csi pit validate
command are test totals for each node where the tests run. Be sure to check
all of them and not just the final one. A grep
command is provided to help with this.
pit# csi pit validate --k8s | tee csi-pit-validate-k8s.log
Once that command has finished, the following will extract the test totals reported for each node:
pit# grep "Total Test" csi-pit-validate-k8s.log
Example output for a system with five master and worker nodes (excluding the PIT node):
Total Tests: 16, Total Passed: 16, Total Failed: 0, Total Execution Time: 0.3072 seconds
Total Tests: 16, Total Passed: 16, Total Failed: 0, Total Execution Time: 0.2727 seconds
Total Tests: 12, Total Passed: 12, Total Failed: 0, Total Execution Time: 0.2841 seconds
Total Tests: 12, Total Passed: 12, Total Failed: 0, Total Execution Time: 0.3622 seconds
Total Tests: 12, Total Passed: 12, Total Failed: 0, Total Execution Time: 0.2353 seconds
If these total lines report any failed tests, then look through the full output of the test in csi-pit-validate-k8s.log
to see which node had the failed test and what the details are for that test.
WARNING: If there are failures for tests with names like
Worker Node CONLIB FS Label
, then manual tests should be run on the node which reported the failure. See Manual LVM Check Procedure. If the manual tests fail, then the problem must be resolved before continuing to the next step. See LVM Check Failure Recovery.
Ensure that weave has not split-brained
To ensure that weave is operating as a single cluster, run the following command on each member of the Kubernetes cluster (master nodes and worker nodes but not the PIT node):
ncn# weave --local status connections | grep failed
If the check is successful, there will be no output. If messages like IP allocation was seeded by different peers
are seen, then weave
appears to be split-brained.
At this point, it is necessary to wipe the NCNs and start the PXE boot again:
Verify that all the pods in the kube-system
namespace are Running
or Completed
.
Run the following command on any Kubernetes master or worker node, or the PIT node:
ncn-mw/pit# kubectl get pods -o wide -n kube-system | grep -Ev '(Running|Completed)'
If any pods are listed by this command, it means they are not in the Running
or Completed
state. That needs to be investigated before proceeding.
Verify that the ceph-csi
requirements are in place.
See Ceph CSI Troubleshooting for details.
Before proceeding, be aware that this is the last point where the other NCN nodes can be rebuilt without also having to rebuild the PIT node. Therefore, take time to double check both the cluster and the validation test results
After completing the deployment of the management nodes, the next step is to install the CSM services.
See Install CSM Services.