Disaster Recovery for Postgres

In the event that the Postgres cluster has failed to the point that it must be recovered and there is no dump available to restore the data, a full service specific disaster recovery is needed.

Below are the service specific steps required to cleanup any existing resources, redeploy the resources, and repopulate the data.

Disaster recovery procedures by service:

Restore Keycloak Postgres without a backup

The following procedures are required to rebuild the automatically populated contents of Keycloak’s PostgreSQL database if the database has been lost and recreated.

  1. Re-run the keycloak-setup job.

    1. Fetch the current job definition.

      ncn-mw# kubectl get job -n services -l app.kubernetes.io/name=cray-keycloak -oyaml |\
                yq r - 'items[0]' | yq d - 'spec.selector' | \
                yq d - 'spec.template.metadata.labels' > keycloak-setup.yaml
      

      There should be no output.

    2. Restart the keycloak-setup job.

      ncn-mw# kubectl replace --force -f keycloak-setup.yaml
      

      The output should be similar to the following:

      job.batch "keycloak-setup-1" deleted
      job.batch/keycloak-setup-1 replaced
      
    3. Wait for the job to finish.

      ncn-mw# kubectl wait --for=condition=complete -n services job -l app.kubernetes.io/name=cray-keycloak --timeout=-1s
      

      The output should be similar to the following:

      job.batch/keycloak-setup-1 condition met
      
  2. Re-run the keycloak-users-localize job.

    1. Fetch the current job definition.

      ncn-mw# kubectl get job -n services -l app.kubernetes.io/name=cray-keycloak-users-localize -oyaml |\
                yq r - 'items[0]' | yq d - 'spec.selector' | \
                yq d - 'spec.template.metadata.labels' > keycloak-users-localize.yaml
      

      There should be no output.

    2. Restart the keycloak-users-localize job.

      ncn-mw# kubectl replace --force -f keycloak-users-localize.yaml
      

      The output should be similar to the following:

      job.batch "keycloak-users-localize-1" deleted
      job.batch/keycloak-users-localize-1 replaced
      
    3. Wait for the job to finish.

      ncn-mw# kubectl wait --for=condition=complete -n services job -l app.kubernetes.io/name=cray-keycloak-users-localize --timeout=-1s
      

      The output should be similar to the following:

      job.batch/keycloak-users-localize-1 condition met
      
  3. Restart keycloak-gatekeeper to pick up the newly generated client ID.

    1. Restart the keycloak-gatekeeper pods.

      ncn-mw# kubectl rollout restart deployment -n services cray-keycloak-gatekeeper-ingress
      

      Expected output:

      deployment.apps/cray-keycloak-gatekeeper-ingress restarted
      
    2. Wait for the restart to complete.

      ncn-mw# kubectl rollout status deployment -n services cray-keycloak-gatekeeper-ingress
      

      Expected output:

      deployment "cray-keycloak-gatekeeper-ingress" successfully rolled out
      

Any other changes made to Keycloak, such as local users that have been created, will have to be manually re-applied.

Restore console Postgres

Many times the PostgreSQL database used for the console services may be restored to health using the techniques described in the following documents:

If the database is not able to be restored to health, follow the directions below to recover. There is nothing in the console services PostgreSQL database that needs to be backed up and restored. Once the database is healthy it will get rebuilt and populated by the console services from the current system. Recovery consists of uninstalling and reinstalling the Helm chart for the cray-console-data service.

  1. Determine the version of cray-console-data that is deployed.

    ncn-mw# helm history -n services cray-console-data
    

    Output similar to the following will be returned:

    REVISION UPDATED                   STATUS     CHART                    APP VERSION  DESCRIPTION
    1        Thu Sep  2 19:56:24 2021  deployed   cray-console-data-1.0.8  1.0.8        Install complete
    

    Note the version of the helm chart that is deployed.

  2. Get the correct Helm chart package to reinstall.

    Copy the chart from the local Nexus repository into the current directory:

    Replace the version in the following example with the version noted in the previous step.

    ncn-mw# wget https://packages.local/repository/charts/cray-console-data-1.0.8.tgz
    
  3. Uninstall the current cray-console-data service.

    ncn-mw# helm uninstall -n services cray-console-data
    

    Example output:

    release "cray-console-data" uninstalled
    
  4. Wait for all resources to be removed.

    1. Watch the deployed pods terminate.

      Watch the services from the cray-console-data Helm chart as they are terminated and removed:

      ncn-mw# watch -n .2 'kubectl -n services get pods | grep cray-console-data'
      

      Output similar to the following will be returned:

      cray-console-data-764f9d46b5-vbs7w     2/2     Running      0          4d20h
      cray-console-data-postgres-0           3/3     Running      0          20d
      cray-console-data-postgres-1           3/3     Running      0          20d
      cray-console-data-postgres-2           3/3     Terminating  0          4d20h
      

      This may take several minutes to complete. When all of the services have terminated and nothing is displayed any longer, use ctrl-C to exit from the watch command.

    2. Check that the data PVC instances have been removed.

      ncn-mw# kubectl -n services get pvc | grep console-data-postgres
      

      There should be no PVC instances returned by this command. If there are, delete them manually with the following command:

      Replace the name of the PVC in the following example with the PVC to be deleted.

      ncn-mw# kubectl -n services delete pvc pgdata-cray-console-data-postgres-0
      

      Repeat until all of the `pgdata-cray-console-data-postgres-’ instances are removed.

  5. Install the Helm chart.

    Install using the file downloaded previously:

    ncn-mw# helm install -n services cray-console-data ./cray-console-data-1.0.8.tgz
    

    Example output:

    NAME: cray-console-data
    LAST DEPLOYED: Mon Oct 25 22:44:49 2021
    NAMESPACE: services
    STATUS: deployed
    REVISION: 1
    TEST SUITE: None
    
  6. Verify that all services restart correctly.

    1. Watch the services come back up again.

      ncn-mw# watch -n .2 'kubectl -n services get pods | grep cray-console-data'
      

      After a little time, expected output should look similar to:

      cray-console-data-764f9d46b5-vbs7w     2/2     Running    0          5m
      cray-console-data-postgres-0           3/3     Running    0          4m
      cray-console-data-postgres-1           3/3     Running    0          3m
      cray-console-data-postgres-2           3/3     Running    0          2m
      

      It will take a few minutes after these services are back up and running for the console services to settle and rebuild the database.

    2. Query cray-console-operator for a node location.

      After a few minutes, query cray-console-operator to find the pod a particular node is connected to.

      In the following example, replace the cray-console-operator pod name with the actual name of the running pod, and replace the component name (xname) with an actual node xname on the system.

      ncn-mw# kubectl -n services exec -it cray-console-operator-7fdc797f9f-xz8rt -- sh -c '/app/get-node x9000c3s3b0n1'
      

      Example output:

      {"podname":"cray-console-node-0"}
      

      This confirms that the cray-console-data service is up and operational.