Cray System Management Documentation > Cray System Management (CSM) Administration Guide > CSM product management > Post-Install Customizations

Post-Install Customizations

Post-install customizations may be needed as systems scale. These customizations also need to persist across future installs or upgrades. Not all resources can be customized post-install; common scenarios are documented in the following sections.

The following is a guide for determining where issues may exist, how to adjust the resources, and how to ensure the changes will persist. Different values may be be needed for systems as they scale.

System domain name
kubectl events OOMKilled
Prometheus CPUThrottlingHigh alerts
Grafana “Kubernetes / Compute Resources / Pod” dashboard
- CPU throttling
- Memory usage
Common customization scenarios

System domain name

The SYSTEM_DOMAIN_NAME value found in some of the URLs on this page is expected to be the system’s fully qualified domain name (FQDN).

(ncn-mw#) The FQDN can be found by running the following command on any Kubernetes NCN.

kubectl get secret site-init -n loftsman -o jsonpath='{.data.customizations\.yaml}' | base64 -d | yq r - spec.network.dns.external

Example output:

system.hpc.amslabs.hpecorp.net

Be sure to modify the example URLs on this page by replacing SYSTEM_DOMAIN_NAME with the actual value found using the above command.

`kubectl` events `OOMKilled`

Check to see if there are any recent out of memory events.

(ncn-mw#) Check kubectl events to see if there are any recent out of memory events.
```
kubectl get event -A | grep OOM
```
Log in to Grafana at the following URL: https://grafana.cmn.SYSTEM_DOMAIN_NAME/
Search for the “Kubernetes / Compute Resources / Pod” dashboard to view the memory utilization graphs over time for any pod that has been OOMKilled.

Prometheus `CPUThrottlingHigh` alerts

Check Prometheus for recent CPUThrottlingHigh alerts.

Log in to Prometheus at the following URL: https://prometheus.cmn.SYSTEM_DOMAIN_NAME/
1. Select the Alert tab.
2. Scroll down to the alert for CPUThrottlingHigh.
Log in to Grafana at the following URL: https://grafana.cmn.SYSTEM_DOMAIN_NAME/
1. Search for the “Kubernetes / Compute Resources / Pod” dashboard to view the throttling graphs over time for any pod that is alerting.

Grafana “Kubernetes / Compute Resources / Pod” dashboard

Use Grafana to investigate and analyze CPU throttling and memory usage.

Log in to Grafana at the following URL: https://grafana.cmn.SYSTEM_DOMAIN_NAME/
Search for the “Kubernetes / Compute Resources / Pod” dashboard.

Select the datasource, namespace, and pod based on the pod being examined.

For example:

datasource: default
namespace: sysmgmt-health
pod: prometheus-cray-sysmgmt-health-kube-p-prometheus-0

CPU throttling

Select the CPU Throttling drop-down to see the CPU Throttling graph for the pod during the selected time (from the top right).
Select the container (from the legends under the x axis).
Review the graph and adjust the resources.limits.cpu value as needed.

The presence of CPU throttling does not always indicate a problem, but if a service is being slow or experiencing latency issues, adjusting resources.limits.cpu may be beneficial.

For example:
- If the pod is being throttled at or near 100% for any period of time, then adjustments are likely needed.
- If the service’s response time is critical, then adjusting the pod’s resources to greatly reduce or eliminate any CPU throttling may be required.
NOTE: The resources.requests.cpu values are used by the Kubernetes scheduler to decide which node to place the pod on and do not impact CPU throttling. The value of resources.limits.cpu can never be lower than the value of resources.requests.cpu.

Memory usage

Select the Memory Usage drop-down to see the memory usage graph for the pod during the selected time (from the top right).
Select the container (from the legends under the x axis).
Determine the steady state memory usage by looking at the memory usage graph for the container.

This is where the resources.requests.memory value should be minimally set. But more importantly, determine the spike usage for the container and set the resources.limits.memory value based on the spike values with some additional headroom.

Common customization scenarios

Prerequisites
Prometheus pod is OOMKilled or CPU throttled
Postgres pods are OOMKilled or CPU throttled
Scale cray-bss service
Scale cray-dns-unbound service
Postgres PVC resize
Prometheus PVC resize
cray-hms-hmcollector pods are OOMKilled
cray-cfs-api pods are OOMKilled
References

Prerequisites

Most of these procedures instruct the administrator to perform the Redeploying a Chart procedure for a specific chart. In these cases, the section on this page provides the administrator with the information necessary in order to carry out that procedure. It is recommended to keep both pages open in different browser windows for easy reference.

Prometheus pod is `OOMKilled` or CPU throttled

Update resources associated with Prometheus in the sysmgmt-health namespace. This example is based on what was needed for a system with 4000 compute nodes. Trial and error may be needed to determine what is best for a given system at scale.

Follow the Redeploying a Chart procedure with the following specifications:

Chart name: cray-sysmgmt-health
Base manifest name: platform

(ncn-mw#) When reaching the step to update the customizations, perform the following steps:

Only follow these steps as part of the previously linked chart redeploy procedure.

Edit the customizations by adding or updating spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.

If the number of NCNs is less than 20, then:

yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.requests.cpu' --style=double '2'
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.requests.memory' '15Gi'
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.limits.cpu' --style=double '6'
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.limits.memory' '30Gi'

If the number of NCNs is 20 or more, then:

yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.requests.cpu' --style=double '6'
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.requests.memory' '50Gi'
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.limits.cpu' --style=double '12'
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.limits.memory' '60Gi'

Check that the customization file has been updated.

yq read customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources'

Example output:

requests:
  cpu: "3"
  memory: 15Gi
limits:
  cpu: "6"
  memory: 30Gi

(ncn-mw#) When reaching the step to validate the redeployed chart, perform the following steps:

Only follow these steps as part of the previously linked chart redeploy procedure.
1. Verify that the pod restarts and that the desired resources have been applied.
  
  Watch the prometheus-cray-sysmgmt-health-kube-p-prometheus-0 pod restart.
```
watch "kubectl get pods -n sysmgmt-health -l prometheus=cray-sysmgmt-health-kube-p-prometheus"
```
  It may take about 10 minutes for the prometheus-cray-sysmgmt-health-kube-p-prometheus-0 pod to terminate. It can be forced deleted if it remains in the terminating state:
```
kubectl delete pod prometheus-cray-sysmgmt-health-kube-p-prometheus-0 --force --grace-period=0 -n sysmgmt-health
```
2. Verify that the resource changes are in place.
```
kubectl get pod prometheus-cray-sysmgmt-health-kube-p-prometheus-0 -n sysmgmt-health -o json | jq -r '.spec.containers[] | select(.name == "prometheus").resources'
```
Make sure to perform the entire linked procedure, including the step to save the updated customizations.

Postgres pods are `OOMKilled` or CPU throttled

Update resources associated with spire-postgres in the spire namespace. This example is based on what was needed for a system with 4000 compute nodes. Trial and error may be needed to determine what is best for a given system at scale.

A similar flow can be used to update the resources for cray-sls-postgres, cray-smd-postgres, or gitea-vcs-postgres.

The following table provides values the administrator will need based on which pods are experiencing problems.

Chart name	Base manifest name	Resource path name	Kubernetes namespace
`cray-sls-postgres`	`core-services`	`cray-hms-sls`	`services`
`cray-smd-postgres`	`core-services`	`cray-hms-smd`	`services`
`gitea-vcs-postgres`	`sysmgmt`	`gitea`	`services`
`spire-postgres`	`sysmgmt`	`spire`	`spire`

Using the values from the above table, follow the Redeploying a Chart with the following specifications:

(ncn-mw#) When reaching the step to update the customizations, perform the following steps:

Only follow these steps as part of the previously linked chart redeploy procedure.

Set the rpname variable to the appropriate resource path name from the table above.
```
rpname=<put resource path name from table here>
```

Edit the customizations by adding or updating spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.resources.

yq write -i customizations.yaml "spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.resources.requests.cpu" --style=double '4'
yq write -i customizations.yaml "spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.resources.requests.memory" '4Gi'
yq write -i customizations.yaml "spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.resources.limits.cpu" --style=double '8'
yq write -i customizations.yaml "spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.resources.limits.memory" '8Gi'

Check that the customization file has been updated.

yq read customizations.yaml "spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.resources"

Example output:

requests:
  cpu: "4"
  memory: 4Gi
limits:
  cpu: "8"
  memory: 8Gi

(ncn-mw#) When reaching the step to validate the redeployed chart, perform the following steps:

Only follow these steps as part of the previously linked chart redeploy procedure.

Verify that the pods restart and that the desired resources have been applied. Commands in this section use the $CHART_NAME variable which should have been set as part of the Redeploying a Chart procedure.
1. Set the ns variable to the name of the appropriate Kubernetes namespace from the earlier table.
```
ns=<put kubernetes namespace here>
```
2. Watch the pod restart.
```
watch "kubectl get pods -n ${ns} -l application=spilo,cluster-name=${CHART_NAME}"
```
3. Verify that the desired resources have been applied.
```
kubectl get pod ${CHART_NAME}-0 -n "${ns}" -o json | jq -r '.spec.containers[] | select(.name == "postgres").resources'
```
  Example output:
```
{
"limits": {
   "cpu": "8",
   "memory": "8Gi"
},
"requests": {
   "cpu": "4",
   "memory": "4Gi"
}
}
```
Make sure to perform the entire linked procedure, including the step to save the updated customizations.

Scale `cray-bss` service

Scale the replica count associated with the cray-bss service in the services namespace. This example is based on what was needed for a system with 4000 compute nodes. Trial and error may be needed to determine what is best for a given system at scale.

Follow the Redeploying a Chart procedure with the following specifications:

Chart name: cray-hms-bss
Base manifest name: sysmgmt
(ncn-mw#) When reaching the step to update the customizations, perform the following steps:

Only follow these steps as part of the previously linked chart redeploy procedure.
1. Edit the customizations by adding or updating spec.kubernetes.services.cray-hms-bss.cray-service.replicaCount.
```
yq write -i customizations.yaml 'spec.kubernetes.services.cray-hms-bss.cray-service.replicaCount' '5'
```
2. Check that the customization file has been updated.
```
yq read customizations.yaml 'spec.kubernetes.services.cray-hms-bss.cray-service.replicaCount'
```
  Example output:
```
5
```

(ncn-mw#) When reaching the step to validate the redeployed chart, perform the following steps:

Only follow these steps as part of the previously linked chart redeploy procedure.

Verify the cray-bss pods scale.

Watch the cray-bss pods scale to the desired number (in this example, 5), with each pod reaching a 2/2 ready state.

watch "kubectl get pods -l app.kubernetes.io/instance=cray-hms-bss -n services"

Example output:

NAME                       READY   STATUS    RESTARTS   AGE
cray-bss-fccbc9f7d-7jw2q   2/2     Running   0          82m
cray-bss-fccbc9f7d-l524g   2/2     Running   0          93s
cray-bss-fccbc9f7d-qwzst   2/2     Running   0          93s
cray-bss-fccbc9f7d-sw48b   2/2     Running   0          82m
cray-bss-fccbc9f7d-xr26l   2/2     Running   0          82m

Verify that the replicas change is present in the Kubernetes cray-bss deployment.
```
kubectl get deployment cray-bss -n services -o json | jq -r '.spec.replicas'
```
In this example, 5 will be the returned value.

Make sure to perform the entire linked procedure, including the step to save the updated customizations.

Scale `cray-dns-unbound` service

Scale the replica count associated with the cray-dns-unbound service in the services namespace. Trial and error may be needed to determine what is best for a given system at scale.

Follow the Redeploying a Chart procedure with the following specifications:

Chart name: cray-dns-unbound
Base manifest name: core-services
(ncn-mw#) When reaching the step to update the customizations, perform the following steps:

Only follow these steps as part of the previously linked chart redeploy procedure.
1. Edit the customizations by adding or updating spec.kubernetes.services.cray-hms-bss.cray-service.replicaCount.
```
yq write -i customizations.yaml 'spec.kubernetes.services.cray-dns-unbound.cray-service.replicaCount' '5'
```
2. Check that the customization file has been updated.
```
yq read customizations.yaml 'spec.kubernetes.services.cray-dns-unbound.cray-service.replicaCount'
```
  Example output:
```
5
```

(ncn-mw#) When reaching the step to validate the redeployed chart, perform the following steps:

Only follow these steps as part of the previously linked chart redeploy procedure.

Verify the cray-dns-unbound pods scale.

Watch the cray-dns-unbound pods scale to the desired number (in this example, 5), with each pod reaching a 3/3 ready state.

watch "kubectl get pods -l app.kubernetes.io/instance=cray-dns-unbound -n services"

Example output:

NAME                                READY   STATUS    RESTARTS   AGE
cray-dns-unbound-58b5cfdb4d-6vwrx   3/3     Running   0          88s
cray-dns-unbound-58b5cfdb4d-6wrpr   3/3     Running   0          87s
cray-dns-unbound-58b5cfdb4d-7ndhg   3/3     Running   0          70m
cray-dns-unbound-58b5cfdb4d-n498k   3/3     Running   0          70m
cray-dns-unbound-58b5cfdb4d-w2tq9   3/3     Running   0          70m

Verify that the replicas change is present in the Kubernetes cray-dns-unbound deployment.
```
kubectl get deployment cray-dns-unbound -n services -o json | jq -r '.spec.replicas'
```
In this example, 5 will be the returned value.

Make sure to perform the entire linked procedure, including the step to save the updated customizations.

Postgres PVC resize

Increase the PVC volume size associated with cray-smd-postgres cluster in the services namespace. This example is based on what was needed for a system with 4000 compute nodes. Trial and error may be needed to determine what is best for a given system at scale. The PVC size can only ever be increased.

A similar flow can be used to update the resources for cray-sls-postgres, gitea-vcs-postgres, or spire-postgres.

The following table provides values the administrator will need based on which pods are experiencing problems.

Chart name	Base manifest name	Resource path name	Kubernetes namespace
`cray-sls-postgres`	`core-services`	`cray-hms-sls`	`services`
`cray-smd-postgres`	`core-services`	`cray-hms-smd`	`services`
`gitea-vcs-postgres`	`sysmgmt`	`gitea`	`services`
`spire-postgres`	`sysmgmt`	`spire`	`spire`

Using the values from the above table, follow the Redeploying a Chart with the following specifications:

(ncn-mw#) When reaching the step to update the customizations, perform the following steps:

Only follow these steps as part of the previously linked chart redeploy procedure.
1. Set the rpname variable to the appropriate resource path name from the table above.
```
rpname=<put resource path name from table here>
```
2. Edit the customizations by adding or updating spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.volumeSize.
```
yq write -i customizations.yaml "spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.volumeSize" '100Gi'
```
3. Check that the customization file has been updated.
```
yq read customizations.yaml "spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.volumeSize"
```
  Example output:
```
100Gi
```
(ncn-mw#) When reaching the step to validate the redeployed chart, perform the following steps:

Only follow these steps as part of the previously linked chart redeploy procedure.

Verify that the pods restart and that the desired resources have been applied. Commands in this section use the $CHART_NAME variable which should have been set as part of the Redeploying a Chart procedure.
1. Set the ns variable to the name of the appropriate Kubernetes namespace from the earlier table.
```
ns=<put kubernetes namespace here>
```
2. Verify that the increased volume size has been applied.
```
watch "kubectl get postgresql ${CHART_NAME} -n $ns"
```
  Example output:
```
NAME                TEAM       VERSION   PODS   VOLUME   CPU-REQUEST   MEMORY-REQUEST   AGE   STATUS
cray-smd-postgres   cray-smd   11        3      100Gi     500m          8Gi              45m  Running
```
3. If the status on the above command is SyncFailed instead of Running, refer to Case 1 in the SyncFailed section of Troubleshoot Postgres Database.
  
  At this point the Postgres cluster is healthy, but additional steps are required to complete the resize of the Postgres PVCs.
Make sure to perform the entire linked procedure, including the step to save the updated customizations.

Prometheus PVC resize

Increase the PVC volume size associated with prometheus-cray-sysmgmt-health-kube-p-prometheus cluster in the sysmgmt-health namespace. This example is based on what was needed for a system with more than 20 non compute nodes (NCNs). The PVC size can only ever be increased.

Follow the Redeploying a Chart procedure with the following specifications:

Chart name: cray-sysmgmt-health
Base manifest name: platform
(ncn-mw#) When reaching the step to update the customizations, perform the following steps:

Only follow these steps as part of the previously linked chart redeploy procedure.
1. Edit the customizations by adding or updating spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage.
```
yq write -i customizations.yaml  'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage' '300Gi'
```
2. Check that the customization file has been updated.
```
yq read customizations.yaml  'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage'
```
  Example output:
```
300Gi
```

(ncn-mw#) When reaching the step to validate the redeployed chart, perform the following step:

Only follow this step as part of the previously linked chart redeploy procedure.

Verify that the increased volume size has been applied.

watch "kubectl get pvc -n sysmgmt-health prometheus-cray-sysmgmt-health-kube-p-prometheus-db-prometheus-cray-sysmgmt-health-kube-p-prometheus-0"

Example output:

NAME                                                                                                     STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS           AGE
prometheus-cray-sysmgmt-health-kube-p-prometheus-db-prometheus-cray-sysmgmt-health-kube-p-prometheus-0   Bound    pvc-bcb8f4f1-fb84-4b48-95c7-63508ef18962   200Gi      RWO            k8s-block-replicated   3d2h

At this point the Prometheus cluster is healthy, but additional steps are required to complete the resize of the Prometheus PVCs.

Make sure to perform the entire linked procedure, including the step to save the updated customizations.

`cray-hms-hmcollector` pods are `OOMKilled`

Update resources associated with cray-hms-hmcollector in the services namespace. Trial and error may be needed to determine what is best for a given system at scale. See Adjust HM Collector Ingress Replicas and Resource Limits.

`cray-cfs-api` pods are `OOMKilled`

Increase the memory requests and limits associated with the cray-cfs-api deployment in the services namespace.

Follow the Redeploying a Chart procedure with the following specifications:

Chart name: cray-cfs-api
Base manifest name: sysmgmt

(ncn-mw#) When reaching the step to update the customizations, perform the following steps:

Only follow these steps as part of the previously linked chart redeploy procedure.

Edit the customizations by adding or updating spec.kubernetes.services.cray-cfs-api.cray-service.containers.cray-cfs-api.resources.

yq4 -i '.spec.kubernetes.services.cray-cfs-api.cray-service.containers.cray-cfs-api.resources.requests.memory="200Mi"' customizations.yaml
yq4 -i '.spec.kubernetes.services.cray-cfs-api.cray-service.containers.cray-cfs-api.resources.limits.memory="500Mi"' customizations.yaml

Check that the customization file has been updated.

Check the memory request value.

yq4 '.spec.kubernetes.services.cray-cfs-api.cray-service.containers.cray-cfs-api.resources.requests.memory' customizations.yaml

Expected output:

200Mi

Check the memory limit value.

yq4 '.spec.kubernetes.services.cray-cfs-api.cray-service.containers.cray-cfs-api.resources.limits.memory' customizations.yaml

Expected output:

500Mi

(ncn-mw#) When reaching the step to validate the redeployed chart, perform the following steps:

Only follow these steps as part of the previously linked chart redeploy procedure.
1. Verify that the increased memory request and limit have been applied.
```
kubectl get deployment -n services cray-cfs-api -o json | jq .spec.template.spec.containers[0].resources
```
  Example output:
```
{
  "limits": {
    "cpu": "500m",
    "memory": "500Mi"
  },
  "requests": {
    "cpu": "150m",
    "memory": "200Mi"
  }
}
```
2. Run a CFS health check.
```
/usr/local/bin/cmsdev test -q cfs
```
  For more details on this test, including known issues and other command line options, see Software Management Services health checks.
Make sure to perform the entire linked procedure, including the step to save the updated customizations.

References

To make changes that will not persist across installs or upgrades, see the following references. These procedures will also help to verify and eliminate any issues in the short term. As other resource customizations are needed, contact support to request the feature.

Post-Install Customizations

System domain name

kubectl events OOMKilled

Prometheus CPUThrottlingHigh alerts

Grafana “Kubernetes / Compute Resources / Pod” dashboard

CPU throttling

Memory usage

Common customization scenarios

Prerequisites

Prometheus pod is OOMKilled or CPU throttled

Postgres pods are OOMKilled or CPU throttled

Scale cray-bss service

Scale cray-dns-unbound service

Postgres PVC resize

Prometheus PVC resize

cray-hms-hmcollector pods are OOMKilled

cray-cfs-api pods are OOMKilled

References

`kubectl` events `OOMKilled`

Prometheus `CPUThrottlingHigh` alerts

Prometheus pod is `OOMKilled` or CPU throttled

Postgres pods are `OOMKilled` or CPU throttled

Scale `cray-bss` service

Scale `cray-dns-unbound` service

`cray-hms-hmcollector` pods are `OOMKilled`

`cray-cfs-api` pods are `OOMKilled`