Post-Install Customizations

Post-install customizations may be needed as systems scale. These customizations also need to persist across future installs or upgrades. Not all resources can be customized post-install; common scenarios are documented in the following sections.

The following is a guide for determining where issues may exist, how to adjust the resources, and how to ensure the changes will persist. Different values may be be needed for systems as they scale.

System domain name

The SYSTEM_DOMAIN_NAME value found in some of the URLs on this page is expected to be the system’s fully qualified domain name (FQDN).

(ncn-mw#) The FQDN can be found by running the following command on any Kubernetes NCN.

kubectl get secret site-init -n loftsman -o jsonpath='{.data.customizations\.yaml}' | base64 -d | yq r - spec.network.dns.external

Example output:

system.hpc.amslabs.hpecorp.net

Be sure to modify the example URLs on this page by replacing SYSTEM_DOMAIN_NAME with the actual value found using the above command.

kubectl events OOMKilled

Check to see if there are any recent out of memory events.

  1. (ncn-mw#) Check kubectl events to see if there are any recent out of memory events.

    kubectl get event -A | grep OOM
    
  2. Log in to Grafana at the following URL: https://grafana.cmn.SYSTEM_DOMAIN_NAME/

  3. Search for the “Kubernetes / Compute Resources / Pod” dashboard to view the memory utilization graphs over time for any pod that has been OOMKilled.

Prometheus CPUThrottlingHigh alerts

Check Prometheus for recent CPUThrottlingHigh alerts.

  1. Log in to Prometheus at the following URL: https://prometheus.cmn.SYSTEM_DOMAIN_NAME/

    1. Select the Alert tab.

    2. Scroll down to the alert for CPUThrottlingHigh.

  2. Log in to Grafana at the following URL: https://grafana.cmn.SYSTEM_DOMAIN_NAME/

    1. Search for the “Kubernetes / Compute Resources / Pod” dashboard to view the throttling graphs over time for any pod that is alerting.

Grafana “Kubernetes / Compute Resources / Pod” dashboard

Use Grafana to investigate and analyze CPU throttling and memory usage.

  1. Log in to Grafana at the following URL: https://grafana.cmn.SYSTEM_DOMAIN_NAME/

  2. Search for the “Kubernetes / Compute Resources / Pod” dashboard.

  3. Select the datasource, namespace, and pod based on the pod being examined.

    For example:

    datasource: default
    namespace: sysmgmt-health
    pod: prometheus-cray-sysmgmt-health-kube-p-prometheus-0
    

CPU throttling

  1. Select the CPU Throttling drop-down to see the CPU Throttling graph for the pod during the selected time (from the top right).

  2. Select the container (from the legends under the x axis).

  3. Review the graph and adjust the resources.limits.cpu value as needed.

    The presence of CPU throttling does not always indicate a problem, but if a service is being slow or experiencing latency issues, adjusting resources.limits.cpu may be beneficial.

    For example:

    • If the pod is being throttled at or near 100% for any period of time, then adjustments are likely needed.
    • If the service’s response time is critical, then adjusting the pod’s resources to greatly reduce or eliminate any CPU throttling may be required.

    NOTE: The resources.requests.cpu values are used by the Kubernetes scheduler to decide which node to place the pod on and do not impact CPU throttling. The value of resources.limits.cpu can never be lower than the value of resources.requests.cpu.

Memory usage

  1. Select the Memory Usage drop-down to see the memory usage graph for the pod during the selected time (from the top right).

  2. Select the container (from the legends under the x axis).

  3. Determine the steady state memory usage by looking at the memory usage graph for the container.

    This is where the resources.requests.memory value should be minimally set. But more importantly, determine the spike usage for the container and set the resources.limits.memory value based on the spike values with some additional headroom.

Common customization scenarios

Prerequisites

Most of these procedures instruct the administrator to perform the Redeploying a Chart procedure for a specific chart. In these cases, the section on this page provides the administrator with the information necessary in order to carry out that procedure. It is recommended to keep both pages open in different browser windows for easy reference.

Prometheus pod is OOMKilled or CPU throttled

Update resources associated with Prometheus in the sysmgmt-health namespace. This example is based on what was needed for a system with 4000 compute nodes. Trial and error may be needed to determine what is best for a given system at scale.

Follow the Redeploying a Chart procedure with the following specifications:

  • Chart name: cray-sysmgmt-health

  • Base manifest name: platform

  • (ncn-mw#) When reaching the step to update the customizations, perform the following steps:

    Only follow these steps as part of the previously linked chart redeploy procedure.

    1. Edit the customizations by adding or updating spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.

      • If the number of NCNs is less than 20, then:

        yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.requests.cpu' --style=double '2'
        yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.requests.memory' '15Gi'
        yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.limits.cpu' --style=double '6'
        yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.limits.memory' '30Gi'
        
      • If the number of NCNs is 20 or more, then:

        yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.requests.cpu' --style=double '6'
        yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.requests.memory' '50Gi'
        yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.limits.cpu' --style=double '12'
        yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.limits.memory' '60Gi'
        
    2. Check that the customization file has been updated.

      yq read customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources'
      

      Example output:

      requests:
        cpu: "3"
        memory: 15Gi
      limits:
        cpu: "6"
        memory: 30Gi
      
  • (ncn-mw#) When reaching the step to validate the redeployed chart, perform the following steps:

    Only follow these steps as part of the previously linked chart redeploy procedure.

    1. Verify that the pod restarts and that the desired resources have been applied.

      Watch the prometheus-cray-sysmgmt-health-kube-p-prometheus-0 pod restart.

      watch "kubectl get pods -n sysmgmt-health -l prometheus=cray-sysmgmt-health-kube-p-prometheus"
      

      It may take about 10 minutes for the prometheus-cray-sysmgmt-health-kube-p-prometheus-0 pod to terminate. It can be forced deleted if it remains in the terminating state:

      kubectl delete pod prometheus-cray-sysmgmt-health-kube-p-prometheus-0 --force --grace-period=0 -n sysmgmt-health
      
    2. Verify that the resource changes are in place.

      kubectl get pod prometheus-cray-sysmgmt-health-kube-p-prometheus-0 -n sysmgmt-health -o json | jq -r '.spec.containers[] | select(.name == "prometheus").resources'
      
  • Make sure to perform the entire linked procedure, including the step to save the updated customizations.

Postgres pods are OOMKilled or CPU throttled

Update resources associated with spire-postgres in the spire namespace. This example is based on what was needed for a system with 4000 compute nodes. Trial and error may be needed to determine what is best for a given system at scale.

A similar flow can be used to update the resources for cray-sls-postgres, cray-smd-postgres, or gitea-vcs-postgres.

The following table provides values the administrator will need based on which pods are experiencing problems.

Chart name Base manifest name Resource path name Kubernetes namespace
cray-sls-postgres core-services cray-hms-sls services
cray-smd-postgres core-services cray-hms-smd services
gitea-vcs-postgres sysmgmt gitea services
spire-postgres sysmgmt spire spire

Using the values from the above table, follow the Redeploying a Chart with the following specifications:

  • (ncn-mw#) When reaching the step to update the customizations, perform the following steps:

    Only follow these steps as part of the previously linked chart redeploy procedure.

    1. Set the rpname variable to the appropriate resource path name from the table above.

      rpname=<put resource path name from table here>
      
    2. Edit the customizations by adding or updating spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.resources.

      yq write -i customizations.yaml "spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.resources.requests.cpu" --style=double '4'
      yq write -i customizations.yaml "spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.resources.requests.memory" '4Gi'
      yq write -i customizations.yaml "spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.resources.limits.cpu" --style=double '8'
      yq write -i customizations.yaml "spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.resources.limits.memory" '8Gi'
      
    3. Check that the customization file has been updated.

      yq read customizations.yaml "spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.resources"
      

      Example output:

      requests:
        cpu: "4"
        memory: 4Gi
      limits:
        cpu: "8"
        memory: 8Gi
      
  • (ncn-mw#) When reaching the step to validate the redeployed chart, perform the following steps:

    Only follow these steps as part of the previously linked chart redeploy procedure.

    Verify that the pods restart and that the desired resources have been applied. Commands in this section use the $CHART_NAME variable which should have been set as part of the Redeploying a Chart procedure.

    1. Set the ns variable to the name of the appropriate Kubernetes namespace from the earlier table.

      ns=<put kubernetes namespace here>
      
    2. Watch the pod restart.

      watch "kubectl get pods -n ${ns} -l application=spilo,cluster-name=${CHART_NAME}"
      
    3. Verify that the desired resources have been applied.

      kubectl get pod ${CHART_NAME}-0 -n "${ns}" -o json | jq -r '.spec.containers[] | select(.name == "postgres").resources'
      

      Example output:

      {
      "limits": {
         "cpu": "8",
         "memory": "8Gi"
      },
      "requests": {
         "cpu": "4",
         "memory": "4Gi"
      }
      }
      
  • Make sure to perform the entire linked procedure, including the step to save the updated customizations.

Scale cray-bss service

Scale the replica count associated with the cray-bss service in the services namespace. This example is based on what was needed for a system with 4000 compute nodes. Trial and error may be needed to determine what is best for a given system at scale.

Follow the Redeploying a Chart procedure with the following specifications:

  • Chart name: cray-hms-bss

  • Base manifest name: sysmgmt

  • (ncn-mw#) When reaching the step to update the customizations, perform the following steps:

    Only follow these steps as part of the previously linked chart redeploy procedure.

    1. Edit the customizations by adding or updating spec.kubernetes.services.cray-hms-bss.cray-service.replicaCount.

      yq write -i customizations.yaml 'spec.kubernetes.services.cray-hms-bss.cray-service.replicaCount' '5'
      
    2. Check that the customization file has been updated.

      yq read customizations.yaml 'spec.kubernetes.services.cray-hms-bss.cray-service.replicaCount'
      

      Example output:

      5
      
  • (ncn-mw#) When reaching the step to validate the redeployed chart, perform the following steps:

    Only follow these steps as part of the previously linked chart redeploy procedure.

    Verify the cray-bss pods scale.

    1. Watch the cray-bss pods scale to the desired number (in this example, 5), with each pod reaching a 2/2 ready state.

      watch "kubectl get pods -l app.kubernetes.io/instance=cray-hms-bss -n services"
      

      Example output:

      NAME                       READY   STATUS    RESTARTS   AGE
      cray-bss-fccbc9f7d-7jw2q   2/2     Running   0          82m
      cray-bss-fccbc9f7d-l524g   2/2     Running   0          93s
      cray-bss-fccbc9f7d-qwzst   2/2     Running   0          93s
      cray-bss-fccbc9f7d-sw48b   2/2     Running   0          82m
      cray-bss-fccbc9f7d-xr26l   2/2     Running   0          82m
      
    2. Verify that the replicas change is present in the Kubernetes cray-bss deployment.

      kubectl get deployment cray-bss -n services -o json | jq -r '.spec.replicas'
      

      In this example, 5 will be the returned value.

  • Make sure to perform the entire linked procedure, including the step to save the updated customizations.

Scale cray-dns-unbound service

Scale the replica count associated with the cray-dns-unbound service in the services namespace. Trial and error may be needed to determine what is best for a given system at scale.

Follow the Redeploying a Chart procedure with the following specifications:

  • Chart name: cray-dns-unbound

  • Base manifest name: core-services

  • (ncn-mw#) When reaching the step to update the customizations, perform the following steps:

    Only follow these steps as part of the previously linked chart redeploy procedure.

    1. Edit the customizations by adding or updating spec.kubernetes.services.cray-hms-bss.cray-service.replicaCount.

      yq write -i customizations.yaml 'spec.kubernetes.services.cray-dns-unbound.cray-service.replicaCount' '5'
      
    2. Check that the customization file has been updated.

      yq read customizations.yaml 'spec.kubernetes.services.cray-dns-unbound.cray-service.replicaCount'
      

      Example output:

      5
      
  • (ncn-mw#) When reaching the step to validate the redeployed chart, perform the following steps:

    Only follow these steps as part of the previously linked chart redeploy procedure.

    Verify the cray-dns-unbound pods scale.

    1. Watch the cray-dns-unbound pods scale to the desired number (in this example, 5), with each pod reaching a 3/3 ready state.

      watch "kubectl get pods -l app.kubernetes.io/instance=cray-dns-unbound -n services"
      

      Example output:

      NAME                                READY   STATUS    RESTARTS   AGE
      cray-dns-unbound-58b5cfdb4d-6vwrx   3/3     Running   0          88s
      cray-dns-unbound-58b5cfdb4d-6wrpr   3/3     Running   0          87s
      cray-dns-unbound-58b5cfdb4d-7ndhg   3/3     Running   0          70m
      cray-dns-unbound-58b5cfdb4d-n498k   3/3     Running   0          70m
      cray-dns-unbound-58b5cfdb4d-w2tq9   3/3     Running   0          70m
      
    2. Verify that the replicas change is present in the Kubernetes cray-dns-unbound deployment.

      kubectl get deployment cray-dns-unbound -n services -o json | jq -r '.spec.replicas'
      

      In this example, 5 will be the returned value.

  • Make sure to perform the entire linked procedure, including the step to save the updated customizations.

Postgres PVC resize

Increase the PVC volume size associated with cray-smd-postgres cluster in the services namespace. This example is based on what was needed for a system with 4000 compute nodes. Trial and error may be needed to determine what is best for a given system at scale. The PVC size can only ever be increased.

A similar flow can be used to update the resources for cray-sls-postgres, gitea-vcs-postgres, or spire-postgres.

The following table provides values the administrator will need based on which pods are experiencing problems.

Chart name Base manifest name Resource path name Kubernetes namespace
cray-sls-postgres core-services cray-hms-sls services
cray-smd-postgres core-services cray-hms-smd services
gitea-vcs-postgres sysmgmt gitea services
spire-postgres sysmgmt spire spire

Using the values from the above table, follow the Redeploying a Chart with the following specifications:

  • (ncn-mw#) When reaching the step to update the customizations, perform the following steps:

    Only follow these steps as part of the previously linked chart redeploy procedure.

    1. Set the rpname variable to the appropriate resource path name from the table above.

      rpname=<put resource path name from table here>
      
    2. Edit the customizations by adding or updating spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.volumeSize.

      yq write -i customizations.yaml "spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.volumeSize" '100Gi'
      
    3. Check that the customization file has been updated.

      yq read customizations.yaml "spec.kubernetes.services.${rpname}.cray-postgresql.sqlCluster.volumeSize"
      

      Example output:

      100Gi
      
  • (ncn-mw#) When reaching the step to validate the redeployed chart, perform the following steps:

    Only follow these steps as part of the previously linked chart redeploy procedure.

    Verify that the pods restart and that the desired resources have been applied. Commands in this section use the $CHART_NAME variable which should have been set as part of the Redeploying a Chart procedure.

    1. Set the ns variable to the name of the appropriate Kubernetes namespace from the earlier table.

      ns=<put kubernetes namespace here>
      
    2. Verify that the increased volume size has been applied.

      watch "kubectl get postgresql ${CHART_NAME} -n $ns"
      

      Example output:

      NAME                TEAM       VERSION   PODS   VOLUME   CPU-REQUEST   MEMORY-REQUEST   AGE   STATUS
      cray-smd-postgres   cray-smd   11        3      100Gi     500m          8Gi              45m  Running
      
    3. If the status on the above command is SyncFailed instead of Running, refer to Case 1 in the SyncFailed section of Troubleshoot Postgres Database.

      At this point the Postgres cluster is healthy, but additional steps are required to complete the resize of the Postgres PVCs.

  • Make sure to perform the entire linked procedure, including the step to save the updated customizations.

Prometheus PVC resize

Increase the PVC volume size associated with prometheus-cray-sysmgmt-health-kube-p-prometheus cluster in the sysmgmt-health namespace. This example is based on what was needed for a system with more than 20 non compute nodes (NCNs). The PVC size can only ever be increased.

Follow the Redeploying a Chart procedure with the following specifications:

  • Chart name: cray-sysmgmt-health

  • Base manifest name: platform

  • (ncn-mw#) When reaching the step to update the customizations, perform the following steps:

    Only follow these steps as part of the previously linked chart redeploy procedure.

    1. Edit the customizations by adding or updating spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage.

      yq write -i customizations.yaml  'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage' '300Gi'
      
    2. Check that the customization file has been updated.

      yq read customizations.yaml  'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage'
      

      Example output:

      300Gi
      
  • (ncn-mw#) When reaching the step to validate the redeployed chart, perform the following step:

    Only follow this step as part of the previously linked chart redeploy procedure.

    Verify that the increased volume size has been applied.

    watch "kubectl get pvc -n sysmgmt-health prometheus-cray-sysmgmt-health-kube-p-prometheus-db-prometheus-cray-sysmgmt-health-kube-p-prometheus-0"
    

    Example output:

    NAME                                                                                                     STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS           AGE
    prometheus-cray-sysmgmt-health-kube-p-prometheus-db-prometheus-cray-sysmgmt-health-kube-p-prometheus-0   Bound    pvc-bcb8f4f1-fb84-4b48-95c7-63508ef18962   200Gi      RWO            k8s-block-replicated   3d2h
    

    At this point the Prometheus cluster is healthy, but additional steps are required to complete the resize of the Prometheus PVCs.

  • Make sure to perform the entire linked procedure, including the step to save the updated customizations.

cray-hms-hmcollector pods are OOMKilled

Update resources associated with cray-hms-hmcollector in the services namespace. Trial and error may be needed to determine what is best for a given system at scale. See Adjust HM Collector Ingress Replicas and Resource Limits.

cray-cfs-api pods are OOMKilled

Increase the memory requests and limits associated with the cray-cfs-api deployment in the services namespace.

Follow the Redeploying a Chart procedure with the following specifications:

  • Chart name: cray-cfs-api

  • Base manifest name: sysmgmt

  • (ncn-mw#) When reaching the step to update the customizations, perform the following steps:

    Only follow these steps as part of the previously linked chart redeploy procedure.

    1. Edit the customizations by adding or updating spec.kubernetes.services.cray-cfs-api.cray-service.containers.cray-cfs-api.resources.

      yq4 -i '.spec.kubernetes.services.cray-cfs-api.cray-service.containers.cray-cfs-api.resources.requests.memory="200Mi"' customizations.yaml
      yq4 -i '.spec.kubernetes.services.cray-cfs-api.cray-service.containers.cray-cfs-api.resources.limits.memory="500Mi"' customizations.yaml
      
    2. Check that the customization file has been updated.

      • Check the memory request value.

        yq4 '.spec.kubernetes.services.cray-cfs-api.cray-service.containers.cray-cfs-api.resources.requests.memory' customizations.yaml
        

        Expected output:

        200Mi
        
      • Check the memory limit value.

        yq4 '.spec.kubernetes.services.cray-cfs-api.cray-service.containers.cray-cfs-api.resources.limits.memory' customizations.yaml
        

        Expected output:

        500Mi
        
  • (ncn-mw#) When reaching the step to validate the redeployed chart, perform the following steps:

    Only follow these steps as part of the previously linked chart redeploy procedure.

    1. Verify that the increased memory request and limit have been applied.

      kubectl get deployment -n services cray-cfs-api -o json | jq .spec.template.spec.containers[0].resources
      

      Example output:

      {
        "limits": {
          "cpu": "500m",
          "memory": "500Mi"
        },
        "requests": {
          "cpu": "150m",
          "memory": "200Mi"
        }
      }
      
    2. Run a CFS health check.

      /usr/local/bin/cmsdev test -q cfs
      

      For more details on this test, including known issues and other command line options, see Software Management Services health checks.

  • Make sure to perform the entire linked procedure, including the step to save the updated customizations.

References

To make changes that will not persist across installs or upgrades, see the following references. These procedures will also help to verify and eliminate any issues in the short term. As other resource customizations are needed, contact support to request the feature.