The procedures in this section detail the high-level tasks required to power on an HPE Cray EX system.
Important: If an emergency power off (EPO) event occurred, then see Recover from a Liquid-Cooled Cabinet EPO Event for recovery procedures.
If user IDs or passwords are needed, then see step 1 of the Prepare the System for Power Off procedure.
sma-timescaledb-single
in CrashLoopBackOff
stateSome pods like sma-timescaledb-single-1
or sma-timescaledb-single-2
are in CrashLoopBackOff
status when the system is powered up.
Although sma-timescaledb-single
pod 0 started, pod 1 and 2 show CrashLoopBackOff
status.
NAMESPACE NAME READY STATUS
sma sma-timescaledb-single-1 0/1 CrashLoopBackOff
sma sma-timescaledb-single-2 0/1 CrashLoopBackOff
Run the following command to resolve this issue:
kubectl delete pod -n sma sma-timescaledb-single-1
This command will fix sma-timescaledb-single-1
pod and then it fixes pod 2 automatically.
SMA Alerta and Monasca fail to start when the system is powered up.
The job sma-pgdb-init-job-1
is missing which prevents sma-alerta-*
from starting. sma-monasca-*
pod shows CrashBackLoopOff
status waiting for Alerta to initialize.
(ncn-m001#
)
kubectl get pods -A -o wide | grep -Ev " (Completed|Running|(cray-dns-unbound-manager|hms-discovery)-.* (Pending|Init:0/[1-9]|PodInitializing|NotReady|Terminating)) "
Example output:
NAMESPACE NAME READY STATUS
sma sma-aiops-enable-disable-models-28891714-v8pcf 0/1 ContainerCreating
sma sma-alerta-54b657ccb9-ptx4s 0/1 Init:0/1
sma sma-monasca-notification-0 0/1 CrashLoopBackOff
--- WARNING --- not all pods are in a 'Running' or 'Completed' state.
The wait-for-sma-pgdb-init-job
container appears to be waiting for this pod which does not exist.
(ncn-m001#
)
kubectl logs -n sma sma-alerta-54b657ccb9-ptx4s -c wait-for-sma-pgdb-init-job --timestamps | head -20
Example output:
2024-12-05T22:15:34.783970586Z Error from server (NotFound): jobs.batch "sma-pgdb-init-job1" not found
2024-12-05T22:15:34.847387533Z Waiting for sma-pgdb-init job to complete
To resolve this issue, use the following workaround:
(ncn#
) Check how many helm versions exist.
helm history -n sma sma-pgdb-init
Example output:
REVISION UPDATED STATUS CHART APP VERSION DESCRIPTION
1 Sep 23 2024 deployed sma-pgdb-init-1.7.1 1.7.1 Install complete
(ncn#
) This command works when there is one or more versions that have single digit version numbers.
It will fail if there is a version 1,10,2,3,4,5,6,7,9 because of the non-numerical sort.
If there are any two digit versions, then the helm rollback command should be used with a specific older version.
helm rollback -n sma sma-pgdb-init $(helm history -n sma sma-pgdb-init | awk '{print $1}' | tail -c 2)
Once the sma-pgdb-init
job is complete, confirm that the sma-alerta
and sma-monasca-notification-0
pods have started normally.
(ncn#
) Confirm how long the ttlSecondsAfterFinished
value is set in the new job.
kubectl -n sma get job sma-pgdb-init-job1 -o yaml > sma-pgdb-init-job1.yaml
grep ttl sma-pgdb-init-job1.yaml
cluster-job-ttl.cluster-job-ttl.kyverno.io: added /spec/ttlSecondsAfterFinished
ttlSecondsAfterFinished: 259200
This is the number of seconds that a system can be powered off before the job will be deleted. 259200 seconds is only 72 hours so the job will be deleted by Kubernetes after 72 hours. If the system is powered off for more than 72 hours, this job will be purged and hence preventing these SMA pods from starting correctly. So increase the value.
(ncn#
) Use the already collected YAML file for sma-pgdb-init-job1
to delete the current job.
kubectl -n sma delete -f sma-pgdb-init-job1.yaml
(ncn#
) Modify the YAML file and make these changes:
ttlSecondsAfterFinished
value to maxint
ttlSecondsAfterFinished: 2147483647
vi sma-pgdb-init-job1.yaml
(ncn#
) Apply new settings.
kubectl -n sma apply -f sma-pgdb-init-job1.yaml
Always use the cabinet power-on sequence for the site.
The management cabinet is the first part of the system that must be powered on and booted. Management network and Slingshot fabric switches power on and boot when cabinet power is applied. After cabinets are powered on, wait at least 10 minutes for systems to initialize.
After all the system cabinets are powered on, be sure that all management network and Slingshot network switches are powered on, and that there are no error LEDS or hardware failures.
To power on an external Lustre file system (ClusterStor), refer to Power On the External Lustre File System.
To power on the external Spectrum Scale (GPFS) file system, refer to site procedures.
Note: If the external file systems are not mounted on worker nodes, then continue to power them in parallel with the power on and boot of the Kubernetes management cluster and the power on of the compute cabinets. This must be completed before beginning to power on and boot the compute nodes and User Access Nodes (UANs).
To power on the management cabinet and bring up the management Kubernetes cluster, refer to Power On and Start the Management Kubernetes Cluster.
To power on all liquid-cooled cabinet CDUs and cabinet PDUs, refer to Power On Compute Cabinets.
Note: Ensure that the external Lustre and Spectrum Scale (GPFS) filesystems are available before starting to boot the compute nodes and UANs.
To power on and boot compute nodes and UANs, refer to Power On and Boot Compute and User Access Nodes.
After power on, refer to Validate CSM Health to check system health and status.
Make nodes available to users once system health and any other post-system maintenance checks have completed.