After a system has been running a while and a worker node is rebooted either due to maintenance or failure, any RTS pod that is running on that worker node may not properly start up.
ncn-mw#
) Check current status of RTS.kubectl get pods -n services | egrep rts | egrep -v Running
services cray-hms-rts-744577c9b5-fs287 0/3 Init:0/2 0 45m
services cray-hms-rts-snmp-776695546-thxlb 0/3 Init:0/2 0 45m
The RTS pods are looking for the cray-hms-rts-init
job that indicates the prerequisites are ready
but that job no longer exists because it has been cleaned up.
ncn-mw#
) Generate a RTS init yaml
from the manifest.helm get manifest -n services cray-hms-rts | yq4 ea 'select(.kind == "Job")' > cray-hms-rts-init.yaml
ncn-mw#
) Apply the RTS init yaml
to get it created.kubectl apply -n services -f cray-hms-rts-init.yaml
ncn-mw#
) Wait for the RTS init job to complete.while [ -z "$DONT_WAIT" -a "`kubectl get jobs -n services cray-hms-rts-init \
-o jsonpath='{.status.conditions[0].type}'`" != "Complete" ]; do \
echo "Waiting for cray-hms-rts-init to complete"; sleep 3; done; echo "Completed";
Waiting for cray-hms-rts-init to complete
Waiting for cray-hms-rts-init to complete
Waiting for cray-hms-rts-init to complete
Waiting for cray-hms-rts-init to complete
Waiting for cray-hms-rts-init to complete
Waiting for cray-hms-rts-init to complete
Waiting for cray-hms-rts-init to complete
Waiting for cray-hms-rts-init to complete
Waiting for cray-hms-rts-init to complete
Waiting for cray-hms-rts-init to complete
Waiting for cray-hms-rts-init to complete
Waiting for cray-hms-rts-init to complete
Waiting for cray-hms-rts-init to complete
Waiting for cray-hms-rts-init to complete
Completed
ncn-mw#
) Check to make sure RTS has properly been started.kubectl get pods -n services | egrep rts
cray-hms-rts-dbbc87df4-rzrgl 3/3 Running 0 4m20s
cray-hms-rts-init-vctgv 0/2 Completed 0 104s
cray-hms-rts-snmp-776695546-zh4k4 3/3 Running 0 24h