Restore Hardware State Manager (HSM) Postgres without an Existing Backup

This procedure is intended to repopulate HSM in the event when no Postgres backup exists.

Prerequisite

  • Healthy System Layout Service (SLS). Recovered first if also affected.

  • Healthy HSM service.

    Verify all 3 HSM postgres replicas are up and running:

    ncn# kubectl -n services get pods -l cluster-name=cray-smd-postgres
    

    Example output:

    NAME                  READY   STATUS    RESTARTS   AGE
    cray-smd-postgres-0   3/3     Running   0          18d
    cray-smd-postgres-1   3/3     Running   0          18d
    cray-smd-postgres-2   3/3     Running   0          18d
    

Procedure

  1. Re-run the HSM loader job.

    ncn# kubectl -n services get job cray-smd-init -o json | jq 'del(.spec.selector)' | jq 'del(.spec.template.metadata.labels."controller-uid")' | kubectl replace --force -f -
    

    Wait for the job to complete:

    ncn# kubectl wait -n services job cray-smd-init --for=condition=complete --timeout=5m
    
  2. Verify that the service is functional.

    ncn# cray hsm service ready
    

    Example output:

    code = 0
    message = "HSM is healthy"
    
  3. Get the number of node objects stored in HSM.

    ncn# cray hsm state components list --type node --format json | jq .[].ID | wc -l
    
  4. Restart MEDS and REDS.

    To repopulate HSM with components, restart MEDS and REDS so that they will add known RedfishEndpoints back in to HSM. This will also kick off HSM rediscovery to repopulate components and hardware inventory.

    ncn# kubectl scale deployment cray-meds -n services --replicas=0
    ncn# kubectl scale deployment cray-meds -n services --replicas=1
    ncn# kubectl scale deployment cray-reds -n services --replicas=0
    ncn# kubectl scale deployment cray-reds -n services --replicas=1
    

    Wait for the RedfishEndpoints table to get repopulated and discovery to complete.

    ncn# cray hsm inventory RedfishEndpoints list --format json | jq .[].ID | wc -l
    100
    ncn# cray hsm inventory redfishEndpoints list --format json | grep -c "DiscoveryStarted"
    0
    
  5. Check for Discovery Errors.

    ncn# cray hsm inventory redfishEndpoints list --format json | grep LastDiscoveryStatus | grep -v -c "DiscoverOK"
    

    If any of the RedfishEndpoint entries have a LastDiscoveryStatus other than DiscoverOK after discovery has completed, refer to the Troubleshoot Issues with Redfish Endpoint Discovery procedure for guidance.

  6. Re-apply any component group or partition customizations.

    Any component groups or partitions created before HSM’s Postgres information was lost will need to be manually re-entered.