Spire Service Recovery

The following covers redeploying the Spire service and restoring the data.

Prerequisites

  • The system is fully installed and has transitioned off of the LiveCD.
  • All activities required for site maintenance are complete.
  • A backup or export of the data already exists.
  • The latest CSM documentation has been installed on the master nodes. See Check for Latest Documentation.
  • The Cray CLI has been configured on the node where the procedure is being performed. See Configure the Cray CLI.

Service recovery for Spire

  1. (ncn-mw#) Verify that a backup of the Spire Postgres data exists.

    1. Verify that a completed backup exists.

      cray artifacts list postgres-backup --format json | jq -r '.artifacts[].Key | select(contains("spire"))'
      

      Example output:

      spire-postgres-2022-09-14T03:10:04.manifest
      spire-postgres-2022-09-14T03:10:04.psql
      
  2. (ncn-mw#) Uninstall the chart and wait for the resources to terminate.

    1. Note the version of the chart that is currently deployed.

      helm history -n spire spire
      

      Example output:

      REVISION    UPDATED                     STATUS      CHART       APP VERSION DESCRIPTION
      1           Tue Aug  2 22:14:31 2022    deployed    spire-2.6.0 0.12.2      Install complete
      
    2. Uninstall the chart.

      helm uninstall -n spire spire
      

      Example output:

      release "spire" uninstalled
      
    3. Wait for the resources to terminate, delete the PVCs, and clean up spire-agent before reinstalling the chart.

      1. Verify that no Spire pods are running.

        watch "kubectl get pods -n spire"
        

        Example output:

        No resources found in spire namespace.
        
      2. Delete the Spire PVCs.

        kubectl get pvc -n spire | grep spire-data-spire-server | awk '{print $1}' | xargs kubectl delete -n spire pvc
        

        Example output:

        persistentvolumeclaim "spire-data-spire-server-0" deleted
        persistentvolumeclaim "spire-data-spire-server-1" deleted
        persistentvolumeclaim "spire-data-spire-server-2" deleted
        
      3. Clean up spire-agent.

        for ncn in $(kubectl get nodes -o name | cut -d'/' -f2); do
            echo "Cleaning up NCN ${ncn}"
            ssh "${ncn}" systemctl stop spire-agent
            ssh "${ncn}" rm -v /var/lib/spire/data/svid.key /var/lib/spire/agent_svid.der /var/lib/spire/bundle.der
        done
        
  3. (ncn-mw#) Redeploy the chart and wait for the resources to start.

    Follow the Redeploying a Chart procedure with the following specifications:

    • Name of chart to be redeployed: spire

    • Base name of manifest: sysmgmt

    • When reaching the step to update customizations, no edits need to be made to the customizations file.

    • When reaching the step to validate that the redeploy was successful, perform the following step:

      Only follow this step as part of the previously linked chart redeploy procedure.

      1. Wait for the resources to start.

        watch "kubectl get pods -n spire"
        

        Example output:

        NAME                                     READY   STATUS      RESTARTS   AGE
        request-ncn-join-token-89hp7             2/2     Running     0          31m
        request-ncn-join-token-fvqdj             2/2     Running     0          31m
        request-ncn-join-token-h7qc2             2/2     Running     0          31m
        request-ncn-join-token-wv56n             2/2     Running     0          31m
        request-ncn-join-token-dnfhk             2/2     Running     0          31m
        request-ncn-join-token-hbvwc             2/2     Running     0          31m
        spire-agent-cmn9q                        1/1     Running     0          31m
        spire-agent-gzn2d                        1/1     Running     0          31m
        spire-agent-pl595                        1/1     Running     0          31m
        spire-create-pooler-schema-1-g6gr6       0/3     Completed   0          31m
        spire-jwks-6c97b5694f-d94rg              3/3     Running     0          31m
        spire-jwks-6c97b5694f-h89lb              3/3     Running     0          31m
        spire-jwks-6c97b5694f-kz9k4              3/3     Running     0          31m
        spire-postgres-0                         3/3     Running     0          31m
        spire-postgres-1                         3/3     Running     0          31m
        spire-postgres-2                         3/3     Running     0          30m
        spire-postgres-pooler-695d4cd48f-57p5s   2/2     Running     0          30m
        spire-postgres-pooler-695d4cd48f-bzm6n   2/2     Running     0          30m
        spire-postgres-pooler-695d4cd48f-mv57z   2/2     Running     0          30m
        spire-server-0                           2/2     Running     4          31m
        spire-server-1                           2/2     Running     0          28m
        spire-server-2                           2/2     Running     0          28m
        spire-update-bss-1-cfbxc                 0/2     Completed   0          31m
        
    1. Rejoin the storage nodes to Spire and restart the spire-agent on all NCNs.

      /opt/cray/platform-utils/spire/fix-spire-on-storage.sh
      for i in $(kubectl get nodes -o name | cut -d"/" -f2) $(ceph node ls | jq -r '.[] | keys[]' | sort -u); do ssh $i systemctl start spire-agent; done