Cray System Management Documentation > Cray System Management (CSM) Administration Guide > spire > Spire Service Recovery

Spire Service Recovery

The following covers redeploying the Spire service and restoring the data.

Prerequisites

The system is fully installed and has transitioned off of the LiveCD.
All activities required for site maintenance are complete.
A backup or export of the data already exists.
The latest CSM documentation has been installed on the master nodes. See Check for Latest Documentation.
The Cray CLI has been configured on the node where the procedure is being performed. See Configure the Cray CLI.

Service recovery for Spire

(ncn-mw#) Verify that a backup of the Spire Postgres data exists.

Verify that a completed backup exists.

cray artifacts list postgres-backup --format json | jq -r '.artifacts[].Key | select(contains("spire"))'

Example output:

spire-postgres-2022-09-14T03:10:04.manifest
spire-postgres-2022-09-14T03:10:04.psql

(ncn-mw#) Uninstall the chart and wait for the resources to terminate.

Note the version of the chart that is currently deployed.

helm history -n spire spire

Example output:

REVISION    UPDATED                     STATUS      CHART       APP VERSION DESCRIPTION
1           Tue Aug  2 22:14:31 2022    deployed    spire-2.6.0 0.12.2      Install complete

Uninstall the chart.

helm uninstall -n spire spire

Example output:

release "spire" uninstalled

Wait for the resources to terminate, delete the PVCs, and clean up spire-agent before reinstalling the chart.

Verify that no Spire pods are running.

watch "kubectl get pods -n spire"

Example output:

No resources found in spire namespace.

Delete the Spire PVCs.

kubectl get pvc -n spire | grep spire-data-spire-server | awk '{print $1}' | xargs kubectl delete -n spire pvc

Example output:

persistentvolumeclaim "spire-data-spire-server-0" deleted
persistentvolumeclaim "spire-data-spire-server-1" deleted
persistentvolumeclaim "spire-data-spire-server-2" deleted

Clean up spire-agent.

for ncn in $(kubectl get nodes -o name | cut -d'/' -f2); do
    echo "Cleaning up NCN ${ncn}"
    ssh "${ncn}" systemctl stop spire-agent
    ssh "${ncn}" rm -v /var/lib/spire/data/svid.key /var/lib/spire/agent_svid.der /var/lib/spire/bundle.der
done

(ncn-mw#) Redeploy the chart and wait for the resources to start.

Follow the Redeploying a Chart procedure with the following specifications:

Name of chart to be redeployed: spire
Base name of manifest: sysmgmt
When reaching the step to update customizations, no edits need to be made to the customizations file.

When reaching the step to validate that the redeploy was successful, perform the following step:

Only follow this step as part of the previously linked chart redeploy procedure.

Wait for the resources to start.

watch "kubectl get pods -n spire"

Example output:

NAME                                     READY   STATUS      RESTARTS   AGE
request-ncn-join-token-89hp7             2/2     Running     0          31m
request-ncn-join-token-fvqdj             2/2     Running     0          31m
request-ncn-join-token-h7qc2             2/2     Running     0          31m
request-ncn-join-token-wv56n             2/2     Running     0          31m
request-ncn-join-token-dnfhk             2/2     Running     0          31m
request-ncn-join-token-hbvwc             2/2     Running     0          31m
spire-agent-cmn9q                        1/1     Running     0          31m
spire-agent-gzn2d                        1/1     Running     0          31m
spire-agent-pl595                        1/1     Running     0          31m
spire-create-pooler-schema-1-g6gr6       0/3     Completed   0          31m
spire-jwks-6c97b5694f-d94rg              3/3     Running     0          31m
spire-jwks-6c97b5694f-h89lb              3/3     Running     0          31m
spire-jwks-6c97b5694f-kz9k4              3/3     Running     0          31m
spire-postgres-0                         3/3     Running     0          31m
spire-postgres-1                         3/3     Running     0          31m
spire-postgres-2                         3/3     Running     0          30m
spire-postgres-pooler-695d4cd48f-57p5s   2/2     Running     0          30m
spire-postgres-pooler-695d4cd48f-bzm6n   2/2     Running     0          30m
spire-postgres-pooler-695d4cd48f-mv57z   2/2     Running     0          30m
spire-server-0                           2/2     Running     4          31m
spire-server-1                           2/2     Running     0          28m
spire-server-2                           2/2     Running     0          28m
spire-update-bss-1-cfbxc                 0/2     Completed   0          31m

Rejoin the storage nodes to Spire and restart the spire-agent on all NCNs.

/opt/cray/platform-utils/spire/fix-spire-on-storage.sh
for i in $(kubectl get nodes -o name | cut -d"/" -f2) $(ceph node ls | jq -r '.[] | keys[]' | sort -u); do ssh $i systemctl start spire-agent; done