Cray System Management Documentation > Cray System Management (CSM) Administration Guide > kubernetes > Disaster Recovery for Postgres

Disaster Recovery for Postgres

In the event that the Postgres cluster has failed to the point that it must be recovered and there is no dump available to restore the data, a full service specific disaster recovery is needed.

Below are the service specific steps required to cleanup any existing resources, redeploy the resources, and repopulate the data.

Disaster recovery procedures by service:

Restore Keycloak Postgres without a backup

The following procedures are required to rebuild the automatically populated contents of Keycloak’s PostgreSQL database if the database has been lost and recreated.

Re-run the keycloak-setup job.

Fetch the current job definition.

ncn-mw# kubectl get job -n services -l app.kubernetes.io/name=cray-keycloak -oyaml |\
          yq r - 'items[0]' | yq d - 'spec.selector' | \
          yq d - 'spec.template.metadata.labels' > keycloak-setup.yaml

There should be no output.

Restart the keycloak-setup job.

ncn-mw# kubectl replace --force -f keycloak-setup.yaml

The output should be similar to the following:

job.batch "keycloak-setup-1" deleted
job.batch/keycloak-setup-1 replaced

Wait for the job to finish.

ncn-mw# kubectl wait --for=condition=complete -n services job -l app.kubernetes.io/name=cray-keycloak --timeout=-1s

The output should be similar to the following:

job.batch/keycloak-setup-1 condition met

Re-run the keycloak-users-localize job.

Fetch the current job definition.

ncn-mw# kubectl get job -n services -l app.kubernetes.io/name=cray-keycloak-users-localize -oyaml |\
          yq r - 'items[0]' | yq d - 'spec.selector' | \
          yq d - 'spec.template.metadata.labels' > keycloak-users-localize.yaml

There should be no output.

Restart the keycloak-users-localize job.

ncn-mw# kubectl replace --force -f keycloak-users-localize.yaml

The output should be similar to the following:

job.batch "keycloak-users-localize-1" deleted
job.batch/keycloak-users-localize-1 replaced

Wait for the job to finish.

ncn-mw# kubectl wait --for=condition=complete -n services job -l app.kubernetes.io/name=cray-keycloak-users-localize --timeout=-1s

The output should be similar to the following:

job.batch/keycloak-users-localize-1 condition met

Restart keycloak-gatekeeper to pick up the newly generated client ID.

Restart the keycloak-gatekeeper pods.

ncn-mw# kubectl rollout restart deployment -n services cray-keycloak-gatekeeper-ingress

Expected output:

deployment.apps/cray-keycloak-gatekeeper-ingress restarted

Wait for the restart to complete.

ncn-mw# kubectl rollout status deployment -n services cray-keycloak-gatekeeper-ingress

Expected output:

deployment "cray-keycloak-gatekeeper-ingress" successfully rolled out

Any other changes made to Keycloak, such as local users that have been created, will have to be manually re-applied.

Restore console Postgres

Many times the PostgreSQL database used for the console services may be restored to health using the techniques described in the following documents:

If the database is not able to be restored to health, follow the directions below to recover. There is nothing in the console services PostgreSQL database that needs to be backed up and restored. Once the database is healthy it will get rebuilt and populated by the console services from the current system. Recovery consists of uninstalling and reinstalling the Helm chart for the cray-console-data service.

Determine the version of cray-console-data that is deployed.

ncn-mw# helm history -n services cray-console-data

Output similar to the following will be returned:

REVISION UPDATED                   STATUS     CHART                    APP VERSION  DESCRIPTION
1        Thu Sep  2 19:56:24 2021  deployed   cray-console-data-1.0.8  1.0.8        Install complete

Note the version of the helm chart that is deployed.

Get the correct Helm chart package to reinstall.

Copy the chart from the local Nexus repository into the current directory:

Replace the version in the following example with the version noted in the previous step.
```
ncn-mw# wget https://packages.local/repository/charts/cray-console-data-1.0.8.tgz
```

Uninstall the current cray-console-data service.

ncn-mw# helm uninstall -n services cray-console-data

Example output:

release "cray-console-data" uninstalled

Wait for all resources to be removed.
1. Watch the deployed pods terminate.
  
  Watch the services from the cray-console-data Helm chart as they are terminated and removed:
```
ncn-mw# watch -n .2 'kubectl -n services get pods | grep cray-console-data'
```
  Output similar to the following will be returned:
```
cray-console-data-764f9d46b5-vbs7w     2/2     Running      0          4d20h
cray-console-data-postgres-0           3/3     Running      0          20d
cray-console-data-postgres-1           3/3     Running      0          20d
cray-console-data-postgres-2           3/3     Terminating  0          4d20h
```
  This may take several minutes to complete. When all of the services have terminated and nothing is displayed any longer, use ctrl-C to exit from the watch command.
2. Check that the data PVC instances have been removed.
```
ncn-mw# kubectl -n services get pvc | grep console-data-postgres
```
  There should be no PVC instances returned by this command. If there are, delete them manually with the following command:
  
  Replace the name of the PVC in the following example with the PVC to be deleted.
```
ncn-mw# kubectl -n services delete pvc pgdata-cray-console-data-postgres-0
```
  Repeat until all of the `pgdata-cray-console-data-postgres-’ instances are removed.

Install the Helm chart.

Install using the file downloaded previously:

ncn-mw# helm install -n services cray-console-data ./cray-console-data-1.0.8.tgz

Example output:

NAME: cray-console-data
LAST DEPLOYED: Mon Oct 25 22:44:49 2021
NAMESPACE: services
STATUS: deployed
REVISION: 1
TEST SUITE: None

Verify that all services restart correctly.
1. Watch the services come back up again.
```
ncn-mw# watch -n .2 'kubectl -n services get pods | grep cray-console-data'
```
  After a little time, expected output should look similar to:
```
cray-console-data-764f9d46b5-vbs7w     2/2     Running    0          5m
cray-console-data-postgres-0           3/3     Running    0          4m
cray-console-data-postgres-1           3/3     Running    0          3m
cray-console-data-postgres-2           3/3     Running    0          2m
```
  It will take a few minutes after these services are back up and running for the console services to settle and rebuild the database.
2. Query cray-console-operator for a node location.
  
  After a few minutes, query cray-console-operator to find the pod a particular node is connected to.
  
  In the following example, replace the cray-console-operator pod name with the actual name of the running pod, and replace the component name (xname) with an actual node xname on the system.
```
ncn-mw# kubectl -n services exec -it cray-console-operator-7fdc797f9f-xz8rt -- sh -c '/app/get-node x9000c3s3b0n1'
```
  Example output:
```
{"podname":"cray-console-node-0"}
```
  This confirms that the cray-console-data service is up and operational.