This document and the procedures contained within it are for the purposes of communicating the kind of testing done by the internal Cray System Management (CSM) team to ensure a basic level of system resiliency in the event of the loss of a single non-compute node (NCN).
It is assumed that some procedures are already known by admins and thus does not go into great detail or attempt to encompass every command necessary for execution. It is intended to be higher level guidance (with some command examples) to inform internal users and customers about our process.
Confirm the component name (xname) mapping for each node on the system by running the /opt/cray/platform-utils/ncnGetXnames.sh
script on each node.
Verify that metal.no-wipe=1
is set for each of the NCNs using output from running the ncnGetXnames.sh
script.
(ncn-mw#
) Ensure the user account in use is an authorized user on the Cray CLI.
Log in as a user account where the credentials are known:
export CRAY_CONFIG_DIR=$(mktemp -d); echo $CRAY_CONFIG_DIR; cray init --configuration default --hostname https://api-gw-service-nmn.local
(ncn-mw#
) Verify that kubectl get nodes
reports all master and worker nodes are Ready
.
kubectl get nodes -o wide
(ncn-mw#
) Get a current list of pods that have a status of anything other than Running
or Completed
. Investigate any of concern.
Save the list of pods for comparison once resiliency testing is completed and the system has been restored.
kubectl get pods -o wide -A | grep -Ev 'Running|Completed'
(ncn-mw#
) Note which pods are running on an NCN that will be taken down (as well as the total number of pods running). The following is an example that shows the listing of pods running on ncn-w001
:
kubectl get pods -o wide -A | grep ncn-w001 | awk '{print $2}'
Note that the above would only apply to Kubernetes nodes, such as master and worker nodes.
(linux#
) Verify ipmitool
can report power status for the NCN to be shut down.
ipmitool -I lanplus -U root -P <password> -H <ncn-node-name> chassis power status
If ncn-m001
is the node to be brought down, then note that it has the external network connection. Therefore it is important to establish that ipmitool
commands are able to be run from a node external to the system, in
order to get the power status of ncn-m001
.
If ncn-m001
is the node to be brought down, then establish Customer Access Network (CAN) links to bypass ncn-m001
(because it will be down) in order to enable an external connection to one of the other master NCNs before,
during, and after ncn-m001
is brought down.
Verify Boot Orchestration Service (BOS) templates and create a new one if needed (to be set-up for booting a specific compute nodes after the targeted NCN has been shutdown).
Before shutting down the NCN and beginning resiliency testing, verify that compute nodes identified for reboot validation can be successfully rebooted and configured.
(ncn-mw#
) To see a list of BOS templates that exist on the system:
cray bos v2 sessiontemplates list
For more information regarding management of BOS session templates, refer to Manage a Session Template.
If a UAN is present on the system, log onto it and verify that the workload manager (WLM) is configured by running a command.
(uan#
) The following is an example for Slurm:
srun -N 4 hostname | sort
In order to ensure that the system is healthy before taking an NCN node down, run the Platform Health Checks
section of Validate CSM Health.
If health issues are noted, it is best to address those before proceeding with the resiliency testing procedure. If it is believed (in the case of an internal Cray-HPE testing environment) that the issue is known/understood and will not impact the testing to be performed, then those health issues just need to be noted (so that it does not appear that they were caused by inducing the fault, in this case, powering off the NCN). There is an optional section of the platform health validation that deals with using the System Management monitoring tools to survey system health. If that optional validation is included, note that the Prometheus alert manager may show various alerts that would not prevent or block moving forward with this testing. For more information about Prometheus alerts (and some that can be safely ignored), see Troubleshooting Prometheus Alerts.
Part of the data being returned via execution of the Platform Health Checks
includes patronictl
information for each Postgres cluster. Each of the Postgres clusters has a leader pod, and in the case of a resiliency test
that involves bringing an NCN worker node down, it may be useful to take note of the Postgres clusters that have their leader pods running on the NCN worker targeted for shutdown. The postgres-operator
should handle
re-establishment of a leader on another pod running in the cluster, but it is worth taking note of where leader re-elections are expected to occur so special attention can be given to those Postgres clusters.
(ncn-mw#
) The Postgres health check is included in Validate CSM Health, but the script for dumping Postgres data can be run at any time:
/opt/cray/platform-utils/ncnPostgresHealthChecks.sh
In order to keep watch on various items during and after the fault has been introduced (in this case, the shutdown of a single NCN), the steps listed below can help give insight into changing health conditions.
(ncn-mw#
) Set up a watch
command to repeatedly run with the Cray CLI (that will hit the service API) to ensure that critical services can ride through a fault. Note that there is not more than a window of 5-10 minutes where a
service would intermittently fail to respond.
In the examples below, the CLI commands are checking the BOS and CPS APIs. It may be desired to choose additional Cray CLI commands to run in this manner. The ultimate proof of system resiliency lies in the ability to perform system level use cases and to, further, prove that can be done at scale. If there are errors being returned consistently (and without recovery) with respect to these commands, then it is likely that business critical use cases (that utilize the same APIs) will also fail.
It may be useful to reference instructions for Configuring the Cray CLI.
watch -n 5 "date; cray cps contents"
watch -n 5 "date; cray bos v2 sessions list"
Monitor Ceph health, in a window, during and after a single NCN is taken down.
watch -n 5 "date; ceph -s"
(ncn-mw#
) Identify when pods on a downed master or worker NCN are no longer responding.
This takes around 5-6 minutes, and Kubernetes will begin terminating pods so that new pods to replace them can start-up on another NCN. Pods that had been running on the downed NCN will remain in Terminated
state until
the NCN is back up. Pods that need to start-up on other nodes will be Pending
until they start-up. Some pods that have anti-affinity configurations or that run as daemonsets
will not be able to start up on another NCN.
Those pods will remain in Pending state until the NCN is back up.
Finally, it is helpful to have a window tracking the list of pods that are not in Completed
or Running
state to be able to determine how that list is changing once the NCN is downed and pods begin shifting around. This
step offers a view of what is going on at the time that the NCN is brought down and once Kubernetes detects an issue and begins remediation. It is not so important to capture everything that is happening during this step. It
may be helpful for debugging. The output of these windows/commands becomes more interesting once the NCN is down for a period of time and then it is brought back up. At that point, the expectation is that everything can
recover.
Run the following commands in separate windows:
watch -n 5 "date; kubectl get pods -o wide -A | grep Termin"
watch -n 10 "date; kubectl get pods -o wide -A | grep Pending"
watch -n 5 "date; kubectl get pods -o wide -A | grep -v Completed | grep -v Running"
(ncn-mw#
) Detect the change in state of the various Postgres instances running.
Run the following in a separate window:
watch -n 30 "date; kubectl get postgresql -A"
If Postgres reports a status that deviates from Running
, that would require further investigation and possibly remediation via Troubleshooting the Postgres Database.
The purpose of this procedure is to launch a non-interactive, long-running batch job across computes via the UAN, if present, in order to verify that the batch job continued to run, uninterrupted.
(uan#
) Log in to the UAN and verify that a WLM has been properly configured.
In this example, Slurm will be used.
srun -N 4 hostname | sort
Example output:
nid000001
nid000002
nid000003
nid000004
Copy an MPI application source and WLM batch job files to UAN.
Compile an MPI application within the UAN. Launch the application as interactive on compute node(s)that have not been designated, already, for either reboots (once an NCN is shut down).
Verify that the job launched on the UAN is running and that application output is streaming to a file. Streaming output will be used to verify that the batch job is still running during resiliency testing. A batch job, when
submitted, will designate a log file location. This log file can be accessed to be able to verify that the batch job is continuing to run after an NCN is brought down and once it is back online. Additionally, the squeue
command can be used to verify that the job continues to run (for Slurm).
Establish a console session to the NCN targeted for shutdown by executing the steps in Establish a Serial Connection to NCNs.
Log onto the target node and execute /sbin/shutdown -h 0
.
Take note of the timestamp of the power off in the target node’s console output.
(linux#
) Once the target node is reported as being powered off, verify that the node’s power status with the ipmitool
is reported as off.
ipmitool -I lanplus -U root -P <password> -H <ncn-node-name> chassis power status
NOTE
In previous releases, an ipmitool
command has been used to simply yank the power to an NCN. There have been times where this resulted in a longer recovery procedure under Shasta 1.5 (mostly due to issues with
getting nodes physically booted up again), so the preference has been to simply use the shutdown
command.
If the NCN shutdown is a master or worker node, within 5-6 minutes of the node being shut down, Kubernetes will begin reporting Terminating
pods on the target node and start rescheduling pods to other NCN nodes. New
pending pods will be created for pods that can not be relocated off of the NCN shut down. Pods reported as Terminating
will remain in that state until the NCN has been powered back up.
Take note of changes in the data being reported out of the many monitoring windows that were set-up in a previous step.
After the target NCN was shut down, assuming the command line windows that were set up for ensuring API responsiveness are not encountering persistent failures, the next step will be to use a BOS template to boot a pre-designated set of compute nodes. The timing of this test is recommended to be around 10 minutes after the NCN has gone down. That should give ample time for Kubernetes to have terminated pods on the downed node (in the case of a master or worker NCN) and for them to have been rescheduled and in a healthy state on another NCN. Going too much earlier than 10 minutes runs the risk that there are still some critical pods that are settling out to reach a healthy state.
(ncn-mw#
) Reboot a pre-designated set of compute nodes.
Use BOS to reboot the designated compute nodes.
cray bos v2 sessions create --template-name boot-nids-1-4 --operation reboot
Wait until the BOS reboot session has completed.
If the WLM batch job was launched on a UAN, log back into it and verify that the batch job is still running and streaming output via the log file created with the batch job and/or the squeue
command (if Slurm is used as
the WLM).
Verify that new WLM jobs can be started on a compute node after the NCN is down via the UAN node).
(ncn-mw#
) Look for any pods that are in a state other than Running
, Completed
, Pending
, or Terminating
:
kubectl get pods -o wide -A | grep -Ev "Running|Completed|Pending|Termin"
Compare what comes up in this list to the pod list that was collected before. If there are new pods that are in status ImagePullBackOff
or CrashLoopBackOff
, a kubectl describe
as well as kubectl logs
command should
be run against them to collect additional data about what happened. Obviously, if there were pods in a bad state before the procedure started, then it should not be expected that bringing one of the NCNs down is going to fix
that.
Ignore anything that was already in a bad state before (that was deemed to be okay). It is also worth taking note of any pods in a bad state at this stage as this should be checked again after bringing the NCN back up - to see if those pods remain in a bad state or if they are cleared. Noting behaviors, collecting logs, and opening tickets throughout this process is recommended when behavior occurs that is not expected. When we see an issue that has not been encountered before, it may not be immediately clear if code changes/regressions are at fault or if it is simply an intermittent/timing kind of issue that has not previously surfaced. The recommendation at that point, given time/resources is to repeat the test to gain a sense of the repeatability of the behavior (in the case that the issue is not directly tied to a code change).
Additionally, it is as important to understand (and document) any work-around procedures needed to fix issues encountered. In addition to filing a bug for a permanent fix, workaround documentation can be very useful when written up - for both internal and external customers to access.
(linux#
) Use the ipmitool
command to power up the NCN.
It will take several minutes for the NCN to reboot. Progress can be monitored over the connected serial console session. Wait to begin execution of the next steps until after it can be determined that the NCN has booted up and is back at the login prompt (when viewing the serial console log).
ipmitool -I lanplus -U root -P <password> -H <hostname> chassis power on #example hostname is ncn-w003-mgmt
Check the following depending on the NCN type powered on:
Terminating
pods on that NCN clear up. It may take several minutes. Watch the command prompt, previously set-up, that is displaying the Terminating
pod list.HEALTH_OK
status. It may take several minutes for Ceph to resolve clock skew. This can be noted in the previously set-up window
to watch Ceph status.Check that pod statuses have returned to the state that they were in at the beginning of this procedure, paying particular attention to any pods that were previously noted to be in a bad state while the NCN was down. Additionally, there is no concern if pods that were in a bad state at the beginning of the procedure, are still in a bad state. What is important to note is anything that is different from either the beginning of the test or from the time that the NCN was down.
Re-run the Platform Health Checks
section of Validate CSM Health noting any output that indicates output is not as expected.
Ensure tickets have been opened for any unexpected behavior along with associated logs and notes on workarounds, if any were executed.