In the event that the Postgres cluster has failed to the point that it must be recovered and there is no dump available to restore the data, a full service specific disaster recovery is needed.
Below are the service specific steps required to cleanup any existing resources, redeploy the resources, and repopulate the data.
Disaster recovery procedures by service:
The following procedures are required to rebuild the automatically populated contents of Keycloak’s PostgreSQL database if the database has been lost and recreated.
Re-run the keycloak-setup
job.
Fetch the current job definition.
ncn-mw# kubectl get job -n services -l app.kubernetes.io/name=cray-keycloak -oyaml |\
yq r - 'items[0]' | yq d - 'spec.selector' | \
yq d - 'spec.template.metadata.labels' > keycloak-setup.yaml
There should be no output.
Restart the keycloak-setup
job.
ncn-mw# kubectl replace --force -f keycloak-setup.yaml
The output should be similar to the following:
job.batch "keycloak-setup-1" deleted
job.batch/keycloak-setup-1 replaced
Wait for the job to finish.
ncn-mw# kubectl wait --for=condition=complete -n services job -l app.kubernetes.io/name=cray-keycloak --timeout=-1s
The output should be similar to the following:
job.batch/keycloak-setup-1 condition met
Re-run the keycloak-users-localize
job.
Fetch the current job definition.
ncn-mw# kubectl get job -n services -l app.kubernetes.io/name=cray-keycloak-users-localize -oyaml |\
yq r - 'items[0]' | yq d - 'spec.selector' | \
yq d - 'spec.template.metadata.labels' > keycloak-users-localize.yaml
There should be no output.
Restart the keycloak-users-localize
job.
ncn-mw# kubectl replace --force -f keycloak-users-localize.yaml
The output should be similar to the following:
job.batch "keycloak-users-localize-1" deleted
job.batch/keycloak-users-localize-1 replaced
Wait for the job to finish.
ncn-mw# kubectl wait --for=condition=complete -n services job -l app.kubernetes.io/name=cray-keycloak-users-localize --timeout=-1s
The output should be similar to the following:
job.batch/keycloak-users-localize-1 condition met
Restart the ingress oauth2-proxies
.
Restart the deployments.
ncn-mw# kubectl rollout restart -n services deployment/cray-oauth2-proxies-customer-access-ingress && \
kubectl rollout restart -n services deployment/cray-oauth2-proxies-customer-high-speed-ingress && \
kubectl rollout restart -n services deployment/cray-oauth2-proxies-customer-management-ingress
Expected output:
deployment.apps/cray-oauth2-proxies-customer-access-ingress restarted
deployment.apps/cray-oauth2-proxies-customer-high-speed-ingress restarted
deployment.apps/cray-oauth2-proxies-customer-management-ingress restarted
Wait for the restart to complete.
ncn-mw# kubectl rollout status -n services deployment/cray-oauth2-proxies-customer-access-ingress && \
kubectl rollout status -n services deployment/cray-oauth2-proxies-customer-high-speed-ingress && \
kubectl rollout status -n services deployment/cray-oauth2-proxies-customer-management-ingress
Expected output:
deployment "cray-oauth2-proxies-customer-access-ingress" successfully rolled out
deployment "cray-oauth2-proxies-customer-high-speed-ingress" successfully rolled out
deployment "cray-oauth2-proxies-customer-management-ingress" successfully rolled out
Any other changes made to Keycloak, such as local users that have been created, will have to be manually re-applied.
Many times the PostgreSQL database used for the console services may be restored to health using the techniques described in the following documents:
If the database is not able to be restored to health, follow the directions below to recover.
There is nothing in the console services PostgreSQL database that needs to be backed up and restored.
Once the database is healthy it will get rebuilt and populated by the console services from the
current system. Recovery consists of uninstalling and reinstalling the Helm chart for the
cray-console-data
service.
Determine the version of cray-console-data
that is deployed.
ncn-mw# helm history -n services cray-console-data
Output similar to the following will be returned:
REVISION UPDATED STATUS CHART APP VERSION DESCRIPTION
1 Thu Sep 2 19:56:24 2021 deployed cray-console-data-1.0.8 1.0.8 Install complete
Note the version of the helm chart that is deployed.
Get the correct Helm chart package to reinstall.
Copy the chart from the local Nexus repository into the current directory:
Replace the version in the following example with the version noted in the previous step.
ncn-mw# wget https://packages.local/repository/charts/cray-console-data-1.0.8.tgz
Uninstall the current cray-console-data
service.
ncn-mw# helm uninstall -n services cray-console-data
Example output:
release "cray-console-data" uninstalled
Wait for all resources to be removed.
Watch the deployed pods terminate.
Watch the services from the cray-console-data
Helm chart as
they are terminated and removed:
ncn-mw# watch -n .2 'kubectl -n services get pods | grep cray-console-data'
Output similar to the following will be returned:
cray-console-data-764f9d46b5-vbs7w 2/2 Running 0 4d20h
cray-console-data-postgres-0 3/3 Running 0 20d
cray-console-data-postgres-1 3/3 Running 0 20d
cray-console-data-postgres-2 3/3 Terminating 0 4d20h
This may take several minutes to complete. When all of the services have terminated and nothing
is displayed any longer, use ctrl
-C
to exit from the watch
command.
Check that the data PVC instances have been removed.
ncn-mw# kubectl -n services get pvc | grep console-data-postgres
There should be no PVC instances returned by this command. If there are, delete them manually with the following command:
Replace the name of the PVC in the following example with the PVC to be deleted.
ncn-mw# kubectl -n services delete pvc pgdata-cray-console-data-postgres-0
Repeat until all of the `pgdata-cray-console-data-postgres-’ instances are removed.
Install the Helm chart.
Install using the file downloaded previously:
ncn-mw# helm install -n services cray-console-data ./cray-console-data-1.0.8.tgz
Example output:
NAME: cray-console-data
LAST DEPLOYED: Mon Oct 25 22:44:49 2021
NAMESPACE: services
STATUS: deployed
REVISION: 1
TEST SUITE: None
Verify that all services restart correctly.
Watch the services come back up again.
ncn-mw# watch -n .2 'kubectl -n services get pods | grep cray-console-data'
After a little time, expected output should look similar to:
cray-console-data-764f9d46b5-vbs7w 2/2 Running 0 5m
cray-console-data-postgres-0 3/3 Running 0 4m
cray-console-data-postgres-1 3/3 Running 0 3m
cray-console-data-postgres-2 3/3 Running 0 2m
It will take a few minutes after these services are back up and running for the console services to settle and rebuild the database.
Query cray-console-operator
for a node location.
After a few minutes, query cray-console-operator
to find the pod a particular node is connected to.
In the following example, replace the cray-console-operator
pod name with the actual name of the
running pod, and replace the component name (xname) with an actual node xname on the system.
ncn-mw# kubectl -n services exec -it cray-console-operator-7fdc797f9f-xz8rt -- sh -c '/app/get-node x9000c3s3b0n1'
Example output:
{"podname":"cray-console-node-0"}
This confirms that the cray-console-data
service is up and operational.