The following procedure deploys Linux and Kubernetes software to the management NCNs. Deployment of the nodes starts with booting the storage nodes, followed by the master nodes and worker nodes together.
After the operating system boots on each node, there are some configuration actions which take place. Watching the console or the console log for certain nodes can help to understand what happens and when. When the process completes for all nodes, the Ceph storage is initialized and the Kubernetes cluster is created and ready for a workload. The PIT node will join Kubernetes after it is rebooted later in Deploy Final NCN.
The timing of each set of boots varies based on hardware. Nodes from some manufacturers will POST faster than others or vary based on BIOS setting. After powering on a set of nodes, an administrator can expect a healthy boot session to take about 60 minutes depending on the number of storage and worker nodes.
Preparation of the environment must be done before attempting to deploy the management nodes.
(pit#
) Define shell environment variables that will simplify later commands to deploy management nodes.
Set USERNAME
and IPMI_PASSWORD
to the credentials for the NCN BMCs.
read -s
is used to prevent the password from being written to the screen or the shell history.
USERNAME=root
read -r -s -p "NCN BMC ${USERNAME} password: " IPMI_PASSWORD
Set the remaining helper variables.
These values do not need to be altered from what is shown.
export IPMI_PASSWORD ; mtoken='ncn-m(?!001)\w+-mgmt' ; stoken='ncn-s\w+-mgmt' ; wtoken='ncn-w\w+-mgmt'
(pit#
) If the NCNs are HPE hardware, then ensure that DCMI/IPMI is enabled.
This will enable ipmitool
usage with the BMCs.
/root/bin/bios-baseline.sh
(pit#
) Check power status of all NCNs.
grep -oP "(${mtoken}|${stoken}|${wtoken})" /etc/dnsmasq.d/statics.conf | sort -u |
xargs -t -i ipmitool -I lanplus -U "${USERNAME}" -E -H {} power status
(pit#
) Power off all NCNs.
grep -oP "(${mtoken}|${stoken}|${wtoken})" /etc/dnsmasq.d/statics.conf | sort -u |
xargs -t -i ipmitool -I lanplus -U "${USERNAME}" -E -H {} power off
(pit#
) Clear CMOS; ensure default settings are applied to all NCNs.
NOTE: Gigabyte Servers and Intel Servers should SKIP THIS STEP.
Resetting the CMOS will:
grep -oP "(${mtoken}|${stoken}|${wtoken})" /etc/dnsmasq.d/statics.conf | sort -u |
xargs -t -i ipmitool -I lanplus -U "${USERNAME}" -E -H {} chassis bootdev none options=clear-cmos
(pit#
) Boot NCNs to BIOS to allow the CMOS to reinitialize.
grep -oP "(${mtoken}|${stoken}|${wtoken})" /etc/dnsmasq.d/statics.conf | sort -u |
xargs -t -i ipmitool -I lanplus -U "${USERNAME}" -E -H {} chassis bootdev bios options=efiboot
grep -oP "(${mtoken}|${stoken}|${wtoken})" /etc/dnsmasq.d/statics.conf | sort -u |
xargs -t -i ipmitool -I lanplus -U "${USERNAME}" -E -H {} power on
(pit#
) Run bios-baseline.sh
.
NOTE: For HPE servers, this should still be done, even though it was already run earlier in the procedure.
/root/bin/bios-baseline.sh
(pit#
) Power off the nodes.
grep -oP "(${mtoken}|${stoken}|${wtoken})" /etc/dnsmasq.d/statics.conf | sort -u |
xargs -t -i ipmitool -I lanplus -U "${USERNAME}" -E -H {} power off
Deployment of the nodes starts with booting the storage nodes first. Then, the master nodes and worker nodes should be booted together. After the operating system boots on each node, there are some configuration actions which take place. Watching the console or the console log for certain nodes can help to understand what happens and when. When the process is complete for all nodes, the Ceph storage will have been initialized and the Kubernetes cluster will be created ready for a workload.
(pit#
) Customize boot scripts for any out-of-baseline NCNs if needed (see below).
/var/www/ncn-*/script.ipxe
(e.g. tar -czvf $SYSTEM_NAME-boot-scripts.tar.gz /var/www/ncn-*/script.ipxe
).(pit#
) Set each node to always UEFI network boot, and ensure that they are powered off.
grep -oP "(${mtoken}|${stoken}|${wtoken})" /etc/dnsmasq.d/statics.conf | sort -u | xargs -t -i ipmitool -I lanplus -U "${USERNAME}" -E -H {} chassis bootdev pxe options=persistent
grep -oP "(${mtoken}|${stoken}|${wtoken})" /etc/dnsmasq.d/statics.conf | sort -u | xargs -t -i ipmitool -I lanplus -U "${USERNAME}" -E -H {} chassis bootdev pxe options=efiboot
grep -oP "(${mtoken}|${stoken}|${wtoken})" /etc/dnsmasq.d/statics.conf | sort -u | xargs -t -i ipmitool -I lanplus -U "${USERNAME}" -E -H {} power off
NOTE: The NCN boot order is further explained in NCN Boot Workflow.
(pit#
) Boot the storage NCNs.
grep -oP "${stoken}" /etc/dnsmasq.d/statics.conf | sort -u | xargs -t -i ipmitool -I lanplus -U "${USERNAME}" -E -H {} power on
(pit#
) Observe the installation through the console of ncn-s001-mgmt
.
conman -j ncn-s001-mgmt
From there, an administrator can witness console output for the cloud-init
scripts.
NOTES:
- Watch the storage node consoles carefully for error messages. If any are seen, consult Ceph-CSI Troubleshooting.
- If the nodes have PXE boot issues (for example, getting PXE errors, or not pulling the
ipxe.efi
binary), then see PXE boot troubleshooting.
(pit#
) Wait for storage nodes to output the following before booting Kubernetes master nodes and worker nodes.
...sleeping 5 seconds until /etc/kubernetes/admin.conf
(pit#
) Boot the Kubernetes NCNs.
grep -oP "(${mtoken}|${wtoken})" /etc/dnsmasq.d/statics.conf | sort -u | xargs -t -i ipmitool -I lanplus -U "${USERNAME}" -E -H {} power on
(pit#
) Start watching the the first Kubernetes master’s console.
Either stop watching ncn-s001-mgmt
before doing this, or do it in a different window.
NOTE: To exit a conman console, press
&
followed by a.
(e.g. keystroke&.
)
Determine the first Kubernetes master.
FM=$(jq -r '."Global"."meta-data"."first-master-hostname"' "${PITDATA}"/configs/data.json)
echo ${FM}
Open its console.
conman -j "${FM}-mgmt"
NOTES:
- If the nodes have PXE boot issues (e.g. getting PXE errors, not pulling the
ipxe.efi
binary), then see Troubleshooting PXE Boot.- If one of the master nodes seems hung waiting for the storage nodes to create a secret, then check the storage node consoles for error messages. If any are found, then consult CEPH CSI Troubleshooting.
(pit#
) Wait for the deployment to finish.
Wait for the first Kubernetes master to complete cloud-init
.
The following text should appear in the console of the first Kubernetes master:
The system is finally up, after 995.71 seconds cloud-init has come to completion.
NOTES:
- The duration reported will vary.
- All NCNs should report the above text when they have completed their Ceph or Kubernetes installation.
Validate that all master and worker NCNs (except for ncn-m001
) show up in the cluster.
Enter the
root
password for the first Kubernetes master node, if prompted.
ssh "${FM}" kubectl get nodes -o wide
Expected output looks similar to the following:
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ncn-m002 Ready control-plane,master 2h v1.20.13 10.252.1.5 <none> SUSE Linux Enterprise High Performance Computing 15 SP3 5.3.18-59.19-default containerd://1.5.7
ncn-m003 Ready control-plane,master 2h v1.20.13 10.252.1.6 <none> SUSE Linux Enterprise High Performance Computing 15 SP3 5.3.18-59.19-default containerd://1.5.7
ncn-w001 Ready <none> 2h v1.20.13 10.252.1.7 <none> SUSE Linux Enterprise High Performance Computing 15 SP3 5.3.18-59.19-default containerd://1.5.7
ncn-w002 Ready <none> 2h v1.20.13 10.252.1.8 <none> SUSE Linux Enterprise High Performance Computing 15 SP3 5.3.18-59.19-default containerd://1.5.7
ncn-w003 Ready <none> 2h v1.20.13 10.252.1.9 <none> SUSE Linux Enterprise High Performance Computing 15 SP3 5.3.18-59.19-default containerd://1.5.7
(pit#
) Stop watching the consoles.
Exit the first master’s console; also exit the console for ncn-s001
, if it was left open.
NOTE: To exit a conman console, press
&
followed by a.
(e.g. keystroke&.
)
kubectl
on the PIT(pit#
) This was done in a previous step, but if the user is resuming/starting here then the first master needs to be
redefined.
NOTE This requires that the set reusable environment variables step was completed,
PITDATA
should be defined in the users environment before continuing.
FM=$(jq -r '."Global"."meta-data"."first-master-hostname"' "${PITDATA}"/configs/data.json)
echo ${FM}
(pit#
) Copy the Kubernetes configuration file from the first master node to the LiveCD.
This will allow kubectl
to work from the PIT node.
mkdir -v ~/.kube
scp "${FM}.nmn:/etc/kubernetes/admin.conf" ~/.kube/config
(pit#
) Ensure that the working directory is the prep
directory.
cd "${PITDATA}/prep"
(pit#
) Check cabling.
Ceph can begin to exhibit latency over time unless OSDs are restarted and some OSD memory settings are changed. It is recommended to run the /usr/share/doc/csm/scripts/repair-ceph-latency.sh
script at Known Issue: Ceph OSD latency.
Run the following command on the PIT node to validate that the expected LVM labels are present on disks on the master and worker nodes.
/usr/share/doc/csm/install/scripts/check_lvm.sh
Expected output looks similar to the following:
When prompted, please enter the NCN password for ncn-m002
Warning: Permanently added 'ncn-m002,10.252.1.11' (ECDSA) to the list of known hosts.
Password:
Checking ncn-m002...
ncn-m002: OK
Checking ncn-m003...
Warning: Permanently added 'ncn-m003,10.252.1.10' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ncn-m003,10.252.1.10' (ECDSA) to the list of known hosts.
ncn-m003: OK
Checking ncn-w001...
Warning: Permanently added 'ncn-w001,10.252.1.9' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ncn-w001,10.252.1.9' (ECDSA) to the list of known hosts.
ncn-w001: OK
Checking ncn-w002...
Warning: Permanently added 'ncn-w002,10.252.1.8' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ncn-w002,10.252.1.8' (ECDSA) to the list of known hosts.
ncn-w002: OK
Checking ncn-w003...
Warning: Permanently added 'ncn-w003,10.252.1.7' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ncn-w003,10.252.1.7' (ECDSA) to the list of known hosts.
ncn-w003: OK
SUCCESS: LVM checks passed on all master and worker NCNs
If the check fails, stop and:
(pit#
) Power cycle the node
ipmitool -I lanplus -U "${USERNAME}" -E -H <node-in-question> power reset
If the check fails after doing the rebuild, contact support.
(pit#
) Install tests and test server on NCNs.
/usr/share/doc/csm/install/scripts/install-goss-tests.sh
If the output ends with PASSED
, then it was successful, despite any warning messages that may have been displayed.
(pit#
) Remove the default NTP pool.
This removes the default pool, which can cause contention issues with NTP.
pdsh -b -S -w "$(grep -oP 'ncn-\w\d+' /etc/dnsmasq.d/statics.conf | grep -v m001 | sort -u | tr -t '\n' ',')" \
'sed -i "s/^! pool pool\.ntp\.org.*//" /etc/chrony.conf' && echo SUCCESS
Successful output is:
SUCCESS
(pit#
) Check the storage nodes.
csi pit validate --ceph
For assistance resolving failed tests, see the following pages:
(pit#
) Check the master and worker nodes.
csi pit validate --k8s
After completing the deployment of the management nodes, the next step is to install the CSM services.
See Install CSM Services.