This procedure is intended to repopulate HSM in the event when no Postgres backup exists.
Healthy System Layout Service (SLS). Recovered first if also affected.
Healthy HSM service.
Verify all 3 HSM postgres replicas are up and running:
ncn# kubectl -n services get pods -l cluster-name=cray-smd-postgres
NAME READY STATUS RESTARTS AGE
cray-smd-postgres-0 3/3 Running 0 18d
cray-smd-postgres-1 3/3 Running 0 18d
cray-smd-postgres-2 3/3 Running 0 18d
Re-run the HSM loader job.
ncn# kubectl -n services get job cray-smd-init -o json | jq 'del(.spec.selector)' | jq 'del(.spec.template.metadata.labels."controller-uid")' | kubectl replace --force -f -
Wait for the job to complete:
ncn# kubectl wait -n services job cray-smd-init --for=condition=complete --timeout=5m
Verify that the service is functional.
ncn# cray hsm service ready
code = 0
message = "HSM is healthy"
Get the number of node objects stored in HSM.
ncn# cray hsm state components list --type node --format json | jq .[].ID | wc -l
0
Restart MEDS and REDS.
To repopulate HSM with components, restart MEDS and REDS so that they will add known RedfishEndpoints back in to HSM. This will also kick off HSM rediscovery to repopulate components and hardware inventory.
ncn# kubectl scale deployment cray-meds -n services --replicas=0
ncn# kubectl scale deployment cray-meds -n services --replicas=1
ncn# kubectl scale deployment cray-reds -n services --replicas=0
ncn# kubectl scale deployment cray-reds -n services --replicas=1
Wait for the RedfishEndpoints table to get repopulated and discovery to complete.
ncn# cray hsm inventory RedfishEndpoints list --format json | jq .[].ID | wc -l
100
ncn# cray hsm inventory redfishEndpoints list --format json | grep -c "DiscoveryStarted"
0
Check for Discovery Errors.
ncn# cray hsm inventory redfishEndpoints list --format json | grep LastDiscoveryStatus | grep -v -c "DiscoverOK"
If any of the RedfishEndpoint entries have a LastDiscoveryStatus
other than DiscoverOK
after discovery has completed, refer to the Troubleshoot Issues with Redfish Endpoint Discovery procedure for guidance.
Re-apply any component group or partition customizations.
Any component groups or partitions created before HSM’s Postgres information was lost will need to be manually re-entered.