This procedure can be used to restore the HSM Postgres database from a previously taken backup. This can be a manual backup created by the Create a Backup of the HSM Postgres Database procedure, or an
automatic backup created by the cray-smd-postgresql-db-backup
Kubernetes cronjob.
Healthy System Layout Service (SLS). Recovered first if also affected.
Healthy HSM Postgres Cluster.
Use patronictl list
on the HSM Postgres cluster to determine the current state of the cluster, and a healthy cluster will look similar to the following:
ncn# kubectl exec cray-smd-postgres-0 -n services -c postgres -it -- patronictl list
Example output:
+ Cluster: cray-smd-postgres (6975238790569058381) ---+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+---------------------+------------+--------+---------+----+-----------+
| cray-smd-postgres-0 | 10.44.0.40 | Leader | running | 1 | |
| cray-smd-postgres-1 | 10.36.0.37 | | running | 1 | 0 |
| cray-smd-postgres-2 | 10.42.0.42 | | running | 1 | 0 |
+---------------------+------------+--------+---------+----+-----------+
Previously taken backup of the HSM Postgres cluster either a manual or automatic backup.
Check for any available automatic HSM Postgres backups:
ncn# cray artifacts list postgres-backup --format json | jq -r '.artifacts[].Key | select(contains("smd"))'
Example output:
cray-smd-postgres-2021-07-11T23:10:08.manifest
cray-smd-postgres-2021-07-11T23:10:08.psql
Retrieve a previously taken HSM Postgres backup. This can be either a previously taken manual HSM backup or an automatic Postgres backup in the postgres-backup
S3 bucket.
From a previous manual backup:
Copy over the folder or tarball containing the Postgres backup to be restored. If it is a tarball, extract it.
Set the environment variable POSTGRES_SQL_FILE
to point toward the .psql
file in the backup folder:
ncn# export POSTGRES_SQL_FILE=/root/cray-smd-postgres-backup_2021-07-07_16-39-44/cray-smd-postgres-backup_2021-07-07_16-39-44.psql
Set the environment variable POSTGRES_SECRET_MANIFEST
to point toward the .manifest
file in the backup folder:
ncn# export POSTGRES_SECRET_MANIFEST=/root/cray-smd-postgres-backup_2021-07-07_16-39-44/cray-smd-postgres-backup_2021-07-07_16-39-44.manifest
From a previous automatic Postgres backup:
Check for available backups.
ncn# cray artifacts list postgres-backup --format json | jq -r '.artifacts[].Key | select(contains("smd"))'
Example output:
cray-smd-postgres-2021-07-11T23:10:08.manifest
cray-smd-postgres-2021-07-11T23:10:08.psql
Set the following environment variables for the name of the files in the backup:
ncn# export POSTGRES_SECRET_MANIFEST_NAME=cray-smd-postgres-2021-07-11T23:10:08.manifest
ncn# export POSTGRES_SQL_FILE_NAME=cray-smd-postgres-2021-07-11T23:10:08.psql
Download the .psql
file for the Postgres backup.
ncn# cray artifacts get postgres-backup "$POSTGRES_SQL_FILE_NAME" "$POSTGRES_SQL_FILE_NAME"
Download the .manifest
file for the HSM backup.
ncn# cray artifacts get postgres-backup "$POSTGRES_SECRET_MANIFEST_NAME" "$POSTGRES_SECRET_MANIFEST_NAME"
Setup environment variables pointing to the full path of the .psql
and .manifest
files.
ncn# export POSTGRES_SQL_FILE=$(realpath "$POSTGRES_SQL_FILE_NAME")
ncn# export POSTGRES_SECRET_MANIFEST=$(realpath "$POSTGRES_SECRET_MANIFEST_NAME")
Verify the POSTGRES_SQL_FILE
and POSTGRES_SECRET_MANIFEST
environment variables are set correctly.
ncn# echo "$POSTGRES_SQL_FILE"
/root/cray-smd-postgres-backup_2021-07-07_16-39-44/cray-smd-postgres-backup_2021-07-07_16-39-44.psql
ncn# echo "$POSTGRES_SECRET_MANIFEST"
/root/cray-smd-postgres-backup_2021-07-07_16-39-44/cray-smd-postgres-backup_2021-07-07_16-39-44.manifest
Scale HSM to 0.
ncn# CLIENT=cray-smd
ncn# POSTGRESQL=cray-smd-postgres
ncn# NAMESPACE=services
ncn# kubectl scale deployment ${CLIENT} -n ${NAMESPACE} --replicas=0
deployment.apps/cray-smd scaled
ncn# while [ $(kubectl get pods -n ${NAMESPACE} -l app.kubernetes.io/name="${CLIENT}" | grep -v NAME | wc -l) != 0 ] ; do echo " waiting for pods to terminate"; sleep 2; done
Re-run the HSM loader job.
ncn# kubectl -n services get job cray-smd-init -o json | jq 'del(.spec.selector)' | jq 'del(.spec.template.metadata.labels."controller-uid")' | kubectl replace --force -f -
Wait for the job to complete:
ncn# kubectl wait -n services job cray-smd-init --for=condition=complete --timeout=5m
Determine which Postgres member is the leader.
ncn# kubectl exec "${POSTGRESQL}-0" -n ${NAMESPACE} -c postgres -it -- patronictl list
Example output:
+-------------------+---------------------+------------+--------+---------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+-------------------+---------------------+------------+--------+---------+----+-----------+
| cray-smd-postgres | cray-smd-postgres-0 | 10.42.0.25 | Leader | running | 1 | |
| cray-smd-postgres | cray-smd-postgres-1 | 10.44.0.34 | | running | | 0 |
| cray-smd-postgres | cray-smd-postgres-2 | 10.36.0.44 | | running | | 0 |
+-------------------+---------------------+------------+--------+---------+----+-----------+
Create a variable for the identified leader:
ncn# POSTGRES_LEADER=cray-smd-postgres-0
Determine the database schema version of the currently running HSM database, and then verify that it matches the database schema version from the Postgres backup:
Database schema of the currently running HSM Postgres instance.
ncn# kubectl exec $POSTGRES_LEADER -n services -c postgres -it -- bash -c "psql -U hmsdsuser -d hmsds -c 'SELECT * FROM system'"
Example output:
id | schema_version | system_info
----+----------------+-------------
0 | 17 | {}
(1 row)
The output above shows the database schema is at version 17.
Database schema version from the Postgres backup:
ncn# cat "$POSTGRES_SQL_FILE" | grep "COPY public.system" -A 2
COPY public.system (id, schema_version, dirty) FROM stdin;
0 17 f
\.
The output above shows the database schema is at version 17.
If the database schema versions match, proceed to the next step. Otherwise, the Postgres backup taken is not applicable to the currently running instance of HSM.
WARNING: If the database schema versions do not match the version of HSM deployed, they will need to be either upgraded/downgraded to a version with a compatible database schema version. Ideally, it will be to the same version of HSM that was used to create the Postgres backup.
Delete and re-create the postgresql
resource (which includes the PVCs).
ncn# CLIENT=cray-smd
ncn# POSTGRESQL=cray-smd-postgres
ncn# NAMESPACE=services
ncn# kubectl get postgresql ${POSTGRESQL} -n ${NAMESPACE} -o json | jq 'del(.spec.selector)' | jq 'del(.spec.template.metadata.labels."controller-uid")' | jq 'del(.status)' > postgres-cr.yaml
ncn# kubectl delete -f postgres-cr.yaml
postgresql.acid.zalan.do "cray-smd-postgres" deleted
ncn# while [ $(kubectl get pods -l "application=spilo,cluster-name=${POSTGRESQL}" -n ${NAMESPACE} | grep -v NAME | wc -l) != 0 ] ; do echo " waiting for pods to terminate"; sleep 2; done
ncn# kubectl create -f postgres-cr.yaml
postgresql.acid.zalan.do/cray-smd-postgres created
ncn# while [ $(kubectl get pods -l "application=spilo,cluster-name=${POSTGRESQL}" -n ${NAMESPACE} | grep -v NAME | wc -l) != 3 ] ; do echo " waiting for pods to start running"; sleep 2; done
Determine which Postgres member is the new leader.
ncn# kubectl exec "${POSTGRESQL}-0" -n ${NAMESPACE} -c postgres -it -- patronictl list
Example output:
+-------------------+---------------------+------------+--------+---------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+-------------------+---------------------+------------+--------+---------+----+-----------+
| cray-smd-postgres | cray-smd-postgres-0 | 10.42.0.25 | Leader | running | 1 | |
| cray-smd-postgres | cray-smd-postgres-1 | 10.44.0.34 | | running | | 0 |
| cray-smd-postgres | cray-smd-postgres-2 | 10.36.0.44 | | running | | 0 |
+-------------------+---------------------+------------+--------+---------+----+-----------+
Set a variable for the new leader:
ncn# POSTGRES_LEADER=cray-smd-postgres-0
Copy the dump taken above to the Postgres leader pod and restore the data.
If the dump exists in a different location, adjust this example as needed.
cat ${POSTGRES_SQL_FILE} | kubectl exec ${POSTGRES_LEADER} -c postgres -n ${NAMESPACE} -it -- psql -U postgres
Clear out of sync data from tables in Postgres.
The backup will have restored tables that may contain out of date information. To refresh this data, it must first be deleted.
Delete the entries in the Ethernet Interfaces table. These will automatically get repopulated during rediscovery.
ncn# kubectl exec $POSTGRES_LEADER -n services -c postgres -it -- bash -c "psql -U hmsdsuser -d hmsds -c 'DELETE FROM comp_eth_interfaces'"
Restore the secrets.
Once the dump has been restored onto the newly built Postgres cluster, the Kubernetes secrets need to match with the Postgres cluster, otherwise the service will experience readiness and liveness probe failures because it will be unable to authenticate to the database.
With secrets manifest from an existing backup If the Postgres secrets were auto-backed up, then re-create the secrets in Kubernetes.
Delete and re-create the four cray-smd-postgres
secrets using the manifest set to POSTGRES_SECRET_MANIFEST
in step 1 above.
ncn# kubectl delete secret postgres.cray-smd-postgres.credentials service-account.cray-smd-postgres.credentials hmsdsuser.cray-smd-postgres.credentials standby.cray-smd-postgres.credentials -n ${NAMESPACE}
ncn# kubectl apply -f ${POSTGRES_SECRET_MANIFEST}
Without the previous secrets from a backup If the Postgres secrets were not backed up, then update the secrets in Postgres.
Determine which Postgres member is the leader.
ncn# kubectl exec "${POSTGRESQL}-0" -n ${NAMESPACE} -c postgres -it -- patronictl list
Example output:
+-------------------+---------------------+------------+--------+---------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+-------------------+---------------------+------------+--------+---------+----+-----------+
| cray-smd-postgres | cray-smd-postgres-0 | 10.42.0.25 | Leader | running | 1 | |
| cray-smd-postgres | cray-smd-postgres-1 | 10.44.0.34 | | running | | 0 |
| cray-smd-postgres | cray-smd-postgres-2 | 10.36.0.44 | | running | | 0 |
+-------------------+---------------------+------------+--------+---------+----+-----------+
Set a variable for the leader:
ncn# POSTGRES_LEADER=cray-smd-postgres-0
Determine what secrets are associated with the Postgres credentials.
ncn# kubectl get secrets -n ${NAMESPACE} | grep "${POSTGRESQL}.credentials"
Example output:
services hmsdsuser.cray-smd-postgres.credentials Opaque 2 31m
services postgres.cray-smd-postgres.credentials Opaque 2 31m
services service-account.cray-smd-postgres.credentials Opaque 2 31m
services standby.cray-smd-postgres.credentials Opaque 2 31m
For each secret above, get the username and password from Kubernetes and update the Postgres database with this information.
For example (hmsdsuser.cray-smd-postgres.credentials):
ncn# kubectl get secret hmsdsuser.cray-smd-postgres.credentials -n ${NAMESPACE} -ojsonpath='{.data.username}' | base64 -d
ncn# kubectl get secret hmsdsuser.cray-smd-postgres.credentials -n ${NAMESPACE} -ojsonpath='{.data.password}'| base64 -d
Exec into the leader pod to reset the user’s password:
ncn# kubectl exec ${POSTGRES_LEADER} -n ${NAMESPACE} -c postgres -it -- bash
root@cray-smd-postgres-0:/home/postgres# /usr/bin/psql postgres postgres
postgres=# ALTER USER hmsdsuser WITH PASSWORD 'ABCXYZ';
ALTER ROLE
postgres=#
Continue the above process until all ${POSTGRESQL}.credentials secrets have been updated in the database.
Restart the Postgres cluster.
ncn# kubectl delete pod "${POSTGRESQL}-0" "${POSTGRESQL}-1" "${POSTGRESQL}-2" -n ${NAMESPACE}
ncn# while [ $(kubectl get postgresql ${POSTGRESQL} -n ${NAMESPACE} -o json | jq -r '.status.PostgresClusterStatus') != "Running" ]; do echo "waiting for ${POSTGRESQL} to start running"; sleep 2; done
Scale the client service back to 3.
ncn# kubectl scale deployment ${CLIENT} -n ${NAMESPACE} --replicas=3
ncn# kubectl -n ${NAMESPACE} rollout status deployment ${CLIENT}
Verify that the service is functional.
ncn# cray hsm service ready list
Example output:
code = 0
message = "HSM is healthy"
Get the number of node objects stored in HSM:
ncn# cray hsm state components list --type node --format json | jq .Components[].ID | wc -l
Resync the component state and inventory.
After restoring HSM’s Postgres from a back up, some of the transient data like component state and hardware inventory may be out of sync with reality. This involves kicking off an HSM rediscovery.
ncn# endpoints=$(cray hsm inventory redfishEndpoints list --format json | jq -r '.[]|.[]|.ID')
ncn# for e in $endpoints; do cray hsm inventory discover create --xnames ${e}; done
Wait for discovery to complete. Discovery is complete after there are no redfishEndpoints left in the ‘DiscoveryStarted’ state. A value of 0
will be returned.
ncn# cray hsm inventory redfishEndpoints list --format json | grep -c "DiscoveryStarted"
Check for discovery errors.
ncn# cray hsm inventory redfishEndpoints list --format json | grep LastDiscoveryStatus | grep -v -c "DiscoverOK"
If any of the RedfishEndpoint entries have a LastDiscoveryStatus
other than DiscoverOK
after discovery has completed, refer to the
Troubleshoot Issues with Redfish Endpoint Discovery procedure for guidance.
Perform this step only if the system has Intel management NCNs, otherwise for HPE or Gigabyte management NCNs skip this step. Due to known firmware issues on Intel BMCs they do not report the MAC addresses of the management NICs via Redfish, and when the BMC is discovered after restoring from a Postgres backup the management NIC MACs in HSM will have an empty component ID. The following script will correct any Ethernet Interfaces for a Intel management NCN without a component ID.
ncn# \
UNKNOWN_NCN_MAC_ADDRESSES=$(cray hsm inventory ethernetInterfaces list --component-id "" --format json | jq '.[] | select(.Description == "- kea") | .MACAddress' -r)
for UNKNOWN_MAC_ADDRESS in $UNKNOWN_NCN_MAC_ADDRESSES; do
XNAME=$(cray bss bootparameters list --format json | jq --arg MAC "${UNKNOWN_MAC_ADDRESS}" '.[] | select(.params != null) | select(.params | test($MAC)) | .hosts[]' -r)
if [[ $(wc -l <<< $(printf $XNAME)) -ne 1 ]]; then
echo "MAC Address ${UNKNOWN_MAC_ADDRESS} unexpected number matches found. Expected 1 match, but found: $(wc -l <<< $(printf $XNAME))"
continue
fi
echo "MAC: ${UNKNOWN_MAC_ADDRESS} is ${XNAME}"
EI_ID=$(echo "$UNKNOWN_MAC_ADDRESS" | sed 's/://g')
echo "Updating ${EI_ID} in HSM EthernetInterfaces with component ID ${XNAME}"
cray hsm inventory ethernetInterfaces update ${EI_ID} --component-id ${XNAME}
done