Post-install customizations may be needed as systems scale. These customizations also need to persist across future installs or upgrades. Not all resources can be customized post-install; common scenarios are documented in the following sections.
The following is a guide for determining where issues may exist, how to adjust the resources, and how to ensure the changes will persist. Different values may be be needed for systems as they scale.
kubectl
events OOMKilled
CPUThrottlingHigh
alertsThe SYSTEM_DOMAIN_NAME
value found in some of the URLs on this page is expected to be the system’s fully qualified domain name (FQDN).
(ncn-mw#
) The FQDN can be found by running the following command on any Kubernetes NCN.
kubectl get secret site-init -n loftsman -o jsonpath='{.data.customizations\.yaml}' | base64 -d | yq r - spec.network.dns.external
Example output:
system.hpc.amslabs.hpecorp.net
Be sure to modify the example URLs on this page by replacing SYSTEM_DOMAIN_NAME
with the actual value found using the above command.
kubectl
events OOMKilled
Check to see if there are any recent out of memory events.
(ncn-mw#
) Check kubectl
events to see if there are any recent out of memory events.
kubectl get event -A | grep OOM
Log in to Grafana at the following URL: https://grafana.cmn.SYSTEM_DOMAIN_NAME/
Search for the “Kubernetes / Compute Resources / Pod” dashboard to view the memory utilization graphs over time for any pod that has been OOMKilled
.
CPUThrottlingHigh
alertsCheck Prometheus for recent CPUThrottlingHigh
alerts.
Log in to Prometheus at the following URL: https://prometheus.cmn.SYSTEM_DOMAIN_NAME/
Select the Alert tab.
Scroll down to the alert for CPUThrottlingHigh
.
Log in to Grafana at the following URL: https://grafana.cmn.SYSTEM_DOMAIN_NAME/
Use Grafana to investigate and analyze CPU throttling and memory usage.
Log in to Grafana at the following URL: https://grafana.cmn.SYSTEM_DOMAIN_NAME/
Search for the “Kubernetes / Compute Resources / Pod” dashboard.
Select the datasource
, namespace
, and pod
based on the pod being examined.
For example:
datasource: default
namespace: sysmgmt-health
pod: prometheus-cray-sysmgmt-health-kube-p-prometheus-0
Select the CPU Throttling drop-down to see the CPU Throttling graph for the pod during the selected time (from the top right).
Select the container (from the legends under the x axis).
Review the graph and adjust the resources.limits.cpu
value as needed.
The presence of CPU throttling does not always indicate a problem, but if a service is being slow or experiencing latency
issues, adjusting resources.limits.cpu
may be beneficial.
For example:
NOTE: The
resources.requests.cpu
values are used by the Kubernetes scheduler to decide which node to place the pod on and do not impact CPU throttling. The value ofresources.limits.cpu
can never be lower than the value ofresources.requests.cpu
.
Select the Memory Usage drop-down to see the memory usage graph for the pod during the selected time (from the top right).
Select the container (from the legends under the x axis).
Determine the steady state memory usage by looking at the memory usage graph for the container.
This is where the resources.requests.memory
value should be minimally set.
But more importantly, determine the spike usage for the container and set the resources.limits.memory
value based on the spike values with some additional headroom.
OOMKilled
or CPU throttledOOMKilled
or CPU throttledcray-bss
servicecray-dns-unbound
servicecray-hms-hmcollector
pods are OOMKilled
cray-cfs-api
pods are OOMKilled
Most of these procedures instruct the administrator to perform the Redeploying a Chart procedure for a specific chart. In these cases, the section on this page provides the administrator with the information necessary in order to carry out that procedure. It is recommended to keep both pages open in different browser windows for easy reference.
OOMKilled
or CPU throttledUpdate resources associated with Prometheus in the sysmgmt-health
namespace.
This example is based on what was needed for a system with 4000 compute nodes.
Trial and error may be needed to determine what is best for a given system at scale.
Follow the Redeploying a Chart procedure with the following specifications:
Chart name: cray-sysmgmt-health
Base manifest name: platform
(ncn-mw#
) When reaching the step to update the customizations, perform the following steps:
Only follow these steps as part of the previously linked chart redeploy procedure.
Edit the customizations by adding or updating spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources
.
If the number of NCNs is less than 20, then:
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.requests.cpu' --style=double '2'
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.requests.memory' '15Gi'
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.limits.cpu' --style=double '6'
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.limits.memory' '30Gi'
If the number of NCNs is 20 or more, then:
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.requests.cpu' --style=double '6'
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.requests.memory' '50Gi'
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.limits.cpu' --style=double '12'
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.limits.memory' '60Gi'
Check that the customization file has been updated.
yq read customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources'
Example output:
requests:
cpu: "3"
memory: 15Gi
limits:
cpu: "6"
memory: 30Gi
(ncn-mw#
) When reaching the step to validate the redeployed chart, perform the following steps:
Only follow these steps as part of the previously linked chart redeploy procedure.
Verify that the pod restarts and that the desired resources have been applied.
Watch the prometheus-cray-sysmgmt-health-kube-p-prometheus-0
pod restart.
watch "kubectl get pods -n sysmgmt-health -l prometheus=cray-sysmgmt-health-kube-p-prometheus"
It may take about 10 minutes for the prometheus-cray-sysmgmt-health-kube-p-prometheus-0
pod to terminate.
It can be forced deleted if it remains in the terminating state:
kubectl delete pod prometheus-cray-sysmgmt-health-kube-p-prometheus-0 --force --grace-period=0 -n sysmgmt-health
Verify that the resource changes are in place.
kubectl get pod prometheus-cray-sysmgmt-health-kube-p-prometheus-0 -n sysmgmt-health -o json | jq -r '.spec.containers[] | select(.name == "prometheus").resources'
Make sure to perform the entire linked procedure, including the step to save the updated customizations.
OOMKilled
or CPU throttledUpdate resources associated with spire-postgres
in the spire
namespace.
This example is based on what was needed for a system with 4000 compute nodes.
Trial and error may be needed to determine what is best for a given system at scale.
A similar flow can be used to update the resources for cray-sls-postgres
, cray-smd-postgres
, or gitea-vcs-postgres
.
The following table provides values the administrator will need based on which pods are experiencing problems.
Chart name | Base manifest name | Resource path name | Kubernetes namespace |
---|---|---|---|
cray-sls-postgres |
core-services |
cray-hms-sls |
services |
cray-smd-postgres |
core-services |
cray-hms-smd |
services |
gitea-vcs-postgres |
sysmgmt |
gitea |
services |
spire-postgres |
sysmgmt |
spire |
spire |
Using the values from the above table, follow the Redeploying a Chart with the following specifications:
(ncn-mw#
) When reaching the step to update the customizations, perform the following steps:
Only follow these steps as part of the previously linked chart redeploy procedure.
Set the rpname
variable to the appropriate resource path name from the table above.
rpname=<put resource path name from table here>
Edit the customizations by adding or updating spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.resources
.
yq write -i customizations.yaml "spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.resources.requests.cpu" --style=double '4'
yq write -i customizations.yaml "spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.resources.requests.memory" '4Gi'
yq write -i customizations.yaml "spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.resources.limits.cpu" --style=double '8'
yq write -i customizations.yaml "spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.resources.limits.memory" '8Gi'
Check that the customization file has been updated.
yq read customizations.yaml "spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.resources"
Example output:
requests:
cpu: "4"
memory: 4Gi
limits:
cpu: "8"
memory: 8Gi
(ncn-mw#
) When reaching the step to validate the redeployed chart, perform the following steps:
Only follow these steps as part of the previously linked chart redeploy procedure.
Verify that the pods restart and that the desired resources have been applied. Commands in this section use the
$CHART_NAME
variable which should have been set as part of the Redeploying a Chart procedure.
Set the ns
variable to the name of the appropriate Kubernetes namespace from the earlier table.
ns=<put kubernetes namespace here>
Watch the pod restart.
watch "kubectl get pods -n ${ns} -l application=spilo,cluster-name=${CHART_NAME}"
Verify that the desired resources have been applied.
kubectl get pod ${CHART_NAME}-0 -n "${ns}" -o json | jq -r '.spec.containers[] | select(.name == "postgres").resources'
Example output:
{
"limits": {
"cpu": "8",
"memory": "8Gi"
},
"requests": {
"cpu": "4",
"memory": "4Gi"
}
}
Make sure to perform the entire linked procedure, including the step to save the updated customizations.
cray-bss
serviceScale the replica count associated with the cray-bss
service in the services
namespace.
This example is based on what was needed for a system with 4000 compute nodes.
Trial and error may be needed to determine what is best for a given system at scale.
Follow the Redeploying a Chart procedure with the following specifications:
Chart name: cray-hms-bss
Base manifest name: sysmgmt
(ncn-mw#
) When reaching the step to update the customizations, perform the following steps:
Only follow these steps as part of the previously linked chart redeploy procedure.
Edit the customizations by adding or updating spec.kubernetes.services.cray-hms-bss.cray-service.replicaCount
.
yq write -i customizations.yaml 'spec.kubernetes.services.cray-hms-bss.cray-service.replicaCount' '5'
Check that the customization file has been updated.
yq read customizations.yaml 'spec.kubernetes.services.cray-hms-bss.cray-service.replicaCount'
Example output:
5
(ncn-mw#
) When reaching the step to validate the redeployed chart, perform the following steps:
Only follow these steps as part of the previously linked chart redeploy procedure.
Verify the cray-bss
pods scale.
Watch the cray-bss
pods scale to the desired number (in this example, 5), with each pod reaching a 2/2
ready state.
watch "kubectl get pods -l app.kubernetes.io/instance=cray-hms-bss -n services"
Example output:
NAME READY STATUS RESTARTS AGE
cray-bss-fccbc9f7d-7jw2q 2/2 Running 0 82m
cray-bss-fccbc9f7d-l524g 2/2 Running 0 93s
cray-bss-fccbc9f7d-qwzst 2/2 Running 0 93s
cray-bss-fccbc9f7d-sw48b 2/2 Running 0 82m
cray-bss-fccbc9f7d-xr26l 2/2 Running 0 82m
Verify that the replicas change is present in the Kubernetes cray-bss
deployment.
kubectl get deployment cray-bss -n services -o json | jq -r '.spec.replicas'
In this example, 5
will be the returned value.
Make sure to perform the entire linked procedure, including the step to save the updated customizations.
cray-dns-unbound
serviceScale the replica count associated with the cray-dns-unbound
service in the services
namespace.
Trial and error may be needed to determine what is best for a given system at scale.
Follow the Redeploying a Chart procedure with the following specifications:
Chart name: cray-dns-unbound
Base manifest name: core-services
(ncn-mw#
) When reaching the step to update the customizations, perform the following steps:
Only follow these steps as part of the previously linked chart redeploy procedure.
Edit the customizations by adding or updating spec.kubernetes.services.cray-hms-bss.cray-service.replicaCount
.
yq write -i customizations.yaml 'spec.kubernetes.services.cray-dns-unbound.cray-service.replicaCount' '5'
Check that the customization file has been updated.
yq read customizations.yaml 'spec.kubernetes.services.cray-dns-unbound.cray-service.replicaCount'
Example output:
5
(ncn-mw#
) When reaching the step to validate the redeployed chart, perform the following steps:
Only follow these steps as part of the previously linked chart redeploy procedure.
Verify the cray-dns-unbound
pods scale.
Watch the cray-dns-unbound
pods scale to the desired number (in this example, 5), with each pod reaching a 3/3
ready state.
watch "kubectl get pods -l app.kubernetes.io/instance=cray-dns-unbound -n services"
Example output:
NAME READY STATUS RESTARTS AGE
cray-dns-unbound-58b5cfdb4d-6vwrx 3/3 Running 0 88s
cray-dns-unbound-58b5cfdb4d-6wrpr 3/3 Running 0 87s
cray-dns-unbound-58b5cfdb4d-7ndhg 3/3 Running 0 70m
cray-dns-unbound-58b5cfdb4d-n498k 3/3 Running 0 70m
cray-dns-unbound-58b5cfdb4d-w2tq9 3/3 Running 0 70m
Verify that the replicas change is present in the Kubernetes cray-dns-unbound
deployment.
kubectl get deployment cray-dns-unbound -n services -o json | jq -r '.spec.replicas'
In this example, 5
will be the returned value.
Make sure to perform the entire linked procedure, including the step to save the updated customizations.
Increase the PVC volume size associated with cray-smd-postgres
cluster in the services
namespace.
This example is based on what was needed for a system with 4000 compute nodes.
Trial and error may be needed to determine what is best for a given system at scale. The PVC size can only ever be increased.
A similar flow can be used to update the resources for cray-sls-postgres
, gitea-vcs-postgres
, or spire-postgres
.
The following table provides values the administrator will need based on which pods are experiencing problems.
Chart name | Base manifest name | Resource path name | Kubernetes namespace |
---|---|---|---|
cray-sls-postgres |
core-services |
cray-hms-sls |
services |
cray-smd-postgres |
core-services |
cray-hms-smd |
services |
gitea-vcs-postgres |
sysmgmt |
gitea |
services |
spire-postgres |
sysmgmt |
spire |
spire |
Using the values from the above table, follow the Redeploying a Chart with the following specifications:
(ncn-mw#
) When reaching the step to update the customizations, perform the following steps:
Only follow these steps as part of the previously linked chart redeploy procedure.
Set the rpname
variable to the appropriate resource path name from the table above.
rpname=<put resource path name from table here>
Edit the customizations by adding or updating spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.volumeSize
.
yq write -i customizations.yaml "spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.volumeSize" '100Gi'
Check that the customization file has been updated.
yq read customizations.yaml "spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.volumeSize"
Example output:
100Gi
(ncn-mw#
) When reaching the step to validate the redeployed chart, perform the following steps:
Only follow these steps as part of the previously linked chart redeploy procedure.
Verify that the pods restart and that the desired resources have been applied. Commands in this section use the
$CHART_NAME
variable which should have been set as part of the Redeploying a Chart procedure.
Set the ns
variable to the name of the appropriate Kubernetes namespace from the earlier table.
ns=<put kubernetes namespace here>
Verify that the increased volume size has been applied.
watch "kubectl get postgresql ${CHART_NAME} -n $ns"
Example output:
NAME TEAM VERSION PODS VOLUME CPU-REQUEST MEMORY-REQUEST AGE STATUS
cray-smd-postgres cray-smd 11 3 100Gi 500m 8Gi 45m Running
If the status on the above command is SyncFailed
instead of Running
, refer to Case 1 in the
SyncFailed
section of Troubleshoot Postgres Database.
At this point the Postgres cluster is healthy, but additional steps are required to complete the resize of the Postgres PVCs.
Make sure to perform the entire linked procedure, including the step to save the updated customizations.
Increase the PVC volume size associated with prometheus-cray-sysmgmt-health-kube-p-prometheus
cluster in the sysmgmt-health
namespace.
This example is based on what was needed for a system with more than 20 non compute nodes (NCNs). The PVC size can only ever be increased.
Follow the Redeploying a Chart procedure with the following specifications:
Chart name: cray-sysmgmt-health
Base manifest name: platform
(ncn-mw#
) When reaching the step to update the customizations, perform the following steps:
Only follow these steps as part of the previously linked chart redeploy procedure.
Edit the customizations by adding or updating spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage
.
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage' '300Gi'
Check that the customization file has been updated.
yq read customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage'
Example output:
300Gi
(ncn-mw#
) When reaching the step to validate the redeployed chart, perform the following step:
Only follow this step as part of the previously linked chart redeploy procedure.
Verify that the increased volume size has been applied.
watch "kubectl get pvc -n sysmgmt-health prometheus-cray-sysmgmt-health-kube-p-prometheus-db-prometheus-cray-sysmgmt-health-kube-p-prometheus-0"
Example output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
prometheus-cray-sysmgmt-health-kube-p-prometheus-db-prometheus-cray-sysmgmt-health-kube-p-prometheus-0 Bound pvc-bcb8f4f1-fb84-4b48-95c7-63508ef18962 200Gi RWO k8s-block-replicated 3d2h
At this point the Prometheus cluster is healthy, but additional steps are required to complete the resize of the Prometheus PVCs.
Make sure to perform the entire linked procedure, including the step to save the updated customizations.
cray-hms-hmcollector
pods are OOMKilled
Update resources associated with cray-hms-hmcollector
in the services
namespace.
Trial and error may be needed to determine what is best for a given system at scale.
See Adjust HM Collector Ingress Replicas and Resource Limits.
cray-cfs-api
pods are OOMKilled
Increase the memory requests and limits associated with the cray-cfs-api
deployment in the services
namespace.
Follow the Redeploying a Chart procedure with the following specifications:
Chart name: cray-cfs-api
Base manifest name: sysmgmt
(ncn-mw#
) When reaching the step to update the customizations, perform the following steps:
Only follow these steps as part of the previously linked chart redeploy procedure.
Edit the customizations by adding or updating spec.kubernetes.services.cray-cfs-api.cray-service.containers.cray-cfs-api.resources
.
yq4 -i '.spec.kubernetes.services.cray-cfs-api.cray-service.containers.cray-cfs-api.resources.requests.memory="200Mi"' customizations.yaml
yq4 -i '.spec.kubernetes.services.cray-cfs-api.cray-service.containers.cray-cfs-api.resources.limits.memory="500Mi"' customizations.yaml
Check that the customization file has been updated.
Check the memory request value.
yq4 '.spec.kubernetes.services.cray-cfs-api.cray-service.containers.cray-cfs-api.resources.requests.memory' customizations.yaml
Expected output:
200Mi
Check the memory limit value.
yq4 '.spec.kubernetes.services.cray-cfs-api.cray-service.containers.cray-cfs-api.resources.limits.memory' customizations.yaml
Expected output:
500Mi
(ncn-mw#
) When reaching the step to validate the redeployed chart, perform the following steps:
Only follow these steps as part of the previously linked chart redeploy procedure.
Verify that the increased memory request and limit have been applied.
kubectl get deployment -n services cray-cfs-api -o json | jq .spec.template.spec.containers[0].resources
Example output:
{
"limits": {
"cpu": "500m",
"memory": "500Mi"
},
"requests": {
"cpu": "150m",
"memory": "200Mi"
}
}
Run a CFS health check.
/usr/local/bin/cmsdev test -q cfs
For more details on this test, including known issues and other command line options, see Software Management Services health checks.
Make sure to perform the entire linked procedure, including the step to save the updated customizations.
To make changes that will not persist across installs or upgrades, see the following references. These procedures will also help to verify and eliminate any issues in the short term. As other resource customizations are needed, contact support to request the feature.