pods_not_running
If the output of pods_not_running
indicates that there are pods in the Evicted
state, it may be because of the root file system
being filled up on the Kubernetes node in question. Kubernetes will begin evicting pods once the root file system space is at 85%
full, and will continue to evict them until it is back under 80%. This commonly happens on ncn-m001
, because it is a location where
install and documentation files may have been downloaded. It may be necessary to clean up space in the /
directory if this is the
root cause of pod evictions.
(ncn-mw#
) View the free space in the root file system.
df -h /
(ncn-mw#
) See how much space is being used in /root/
.
du -h -s /root/
(ncn-mw#
) List the top 10 files in /root/
that are 1024M or larger.
du -ah -B 1024M /root | sort -n -r | head -n 10
The cray-crus-
pod is expected to be in the Init
state until Slurm and MUNGE
are installed. In particular, this will be the case if executing this as part of the validation after completing the
Install CSM Services.
If in doubt, validate the CRUS service using the CMS Validation Tool.
If the CRUS check passes using that tool, then do not worry about the cray-crus-
pod state.
If some *postgresql-db-backup
cronjob pods are in Error
state, they can be ignored if the most recent pod Completed
.
The Error
pods are cleaned up over time but are left to troubleshoot issues in the case that all retries for the postgresql-db-backup
job fail.
During a CSM install, while ncn-m001
is still the PIT node, cray-sysmgmt-health-grok-exporter
pods are expected to be in Pending
state.
The hmn-discovery
and cray-dns-unbound-manager
cronjob pods may be in various transitional states such as Pending
, Init
, PodInitializing
,
NotReady
, or Terminating
. This is expected because these pods are periodically started and often can be caught in intermediate states.