Below are the service-specific steps required to restore data to a Postgres cluster.
Restore Postgres procedures by service:
In the event that the Keycloak Postgres cluster must be rebuilt and the data restored, then the following procedures are recommended.
Copy the database dump to an accessible location.
postgres-backup
S3 bucket.
These will be needed in the steps below. List the files in the postgres-backup
S3 bucket and if the files exist, download the dump and secrets out of the S3 bucket.
The cray artifacts
CLI can be used list and download the files. Note that the .psql
file contains the database dump and the .manifest file contains the secrets.Setup the CRAY_CREDENTIALS
environment variable to permit simple CLI operations needed while restoring the Keycloak database. See Authenticate an Account with the Command Line.
List the available backups.
ncn-mw# cray artifacts list postgres-backup --format json | jq -r '.artifacts[].Key | select(contains("keycloak"))'
Example output:
keycloak-postgres-2022-09-14T02:10:05.manifest
keycloak-postgres-2022-09-14T02:10:05.psql
Set the environment variables to the name of the backup files.
In order to avoid a
kubectl cp
bug, the dump file will be downloaded with a slightly altered name (:
characters replaced with-
characters).
ncn-mw# MANIFEST=keycloak-postgres-2022-09-14T02:10:05.manifest
ncn-mw# DUMPFILE_SRC=keycloak-postgres-2022-09-14T02:10:05.psql
ncn-mw# DUMPFILE=${DUMPFILE_SRC//:/-}
Download the backup files.
ncn-mw# cray artifacts get postgres-backup "${DUMPFILE_SRC}" "${DUMPFILE}"
ncn-mw# cray artifacts get postgres-backup "${MANIFEST}" "${MANIFEST}"
Unset the CRAY_CREDENTIALS
environment variable:
unset CRAY_CREDENTIALS
rm /tmp/setup-token.json
Set helper variables.
ncn-mw# CLIENT=cray-keycloak
ncn-mw# NAMESPACE=services
ncn-mw# POSTGRESQL=keycloak-postgres
Scale the Keycloak service to 0.
ncn-mw# kubectl scale statefulset "${CLIENT}" -n "${NAMESPACE}" --replicas=0
Wait for the pods to terminate.
ncn-mw# while kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/instance="${CLIENT}" | grep -qv NAME ; do
echo " waiting for pods to terminate"; sleep 2
done
Delete the Keycloak Postgres cluster.
ncn-mw# kubectl get postgresql "${POSTGRESQL}" -n "${NAMESPACE}" -o json | jq 'del(.spec.selector)' |
jq 'del(.spec.template.metadata.labels."controller-uid")' | jq 'del(.status)' > postgres-cr.json
ncn-mw# kubectl delete -f postgres-cr.json
Wait for the pods to terminate.
ncn-mw# while kubectl get pods -l "application=spilo,cluster-name=${POSTGRESQL}" -n "${NAMESPACE}" | grep -qv NAME ; do
echo " waiting for pods to terminate"; sleep 2
done
Create a new single instance Keycloak Postgres cluster.
ncn-mw# cp postgres-cr.json postgres-orig-cr.json
ncn-mw# jq '.spec.numberOfInstances = 1' postgres-orig-cr.json > postgres-cr.json
ncn-mw# kubectl create -f postgres-cr.json
Wait for the pod to start.
ncn-mw# while ! kubectl get pods -l "application=spilo,cluster-name=${POSTGRESQL}" -n "${NAMESPACE}" | grep -qv NAME ; do
echo " waiting for pod to start"; sleep 2
done
Wait for the Postgres cluster to start running.
ncn-mw# while [ $(kubectl get postgresql "${POSTGRESQL}" -n "${NAMESPACE}" -o json | jq -r '.status.PostgresClusterStatus') != "Running" ]
do
echo " waiting for postgresql to start running"; sleep 2
done
Copy the database dump file to the Postgres member.
ncn-mw# kubectl cp "./${DUMPFILE}" "${POSTGRESQL}-0:/home/postgres/${DUMPFILE}" -c postgres -n "${NAMESPACE}"
Restore the data.
ncn-mw# kubectl exec "${POSTGRESQL}-0" -c postgres -n "${NAMESPACE}" -it -- psql -U postgres < "${DUMPFILE}"
Errors such as ... already exists
can be ignored; the restore can be considered successful when it completes.
Either update or re-create the keycloak-postgres
secrets.
Update the secrets in Postgres.
If a manual dump was done, and the secrets were not saved, then the secrets in the newly created Postgres cluster will need to be updated.
From the three keycloak-postgres
secrets, collect the password for each Postgres username:
postgres
, service_account
, and standby
.
ncn-mw# for secret in postgres.keycloak-postgres.credentials service-account.keycloak-postgres.credentials \
standby.keycloak-postgres.credentials
do
echo -n "secret ${secret} username & password: "
echo -n "`kubectl get secret "${secret}" -n "${NAMESPACE}" -ojsonpath='{.data.username}' | base64 -d` "
echo `kubectl get secret "${secret}" -n "${NAMESPACE}" -ojsonpath='{.data.password}'| base64 -d`
done
Example output:
secret postgres.keycloak-postgres.credentials username & password: postgres ABCXYZ
secret service-account.keycloak-postgres.credentials username & password: service_account ABC123
secret standby.keycloak-postgres.credentials username & password: standby 123456
kubectl exec
into the Postgres pod.
ncn-mw# kubectl exec "${POSTGRESQL}-0" -n "${NAMESPACE}" -c postgres -it -- bash
Open a Postgres console.
pod# /usr/bin/psql postgres postgres
Update the password for each user to match the values found in the secrets.
Update the password for the postgres
user.
postgres# ALTER USER postgres WITH PASSWORD 'ABCXYZ';
Example of successful output:
ALTER ROLE
Update the password for the service_account
user.
postgres# ALTER USER service_account WITH PASSWORD 'ABC123';
Example of successful output:
ALTER ROLE
Update the password for the standby
user.
postgres# ALTER USER standby WITH PASSWORD '123456';
Example of successful output:
ALTER ROLE
Exit the Postgres console with the \q
command.
Exit the Postgres pod with the exit
command.
Re-create secrets in Kubernetes.
If the Postgres secrets were automatically backed up, then re-create the secrets in Kubernetes.
Delete and re-create the three keycloak-postgres
secrets using the manifest that was copied from S3 earlier.
ncn-mw# kubectl delete secret postgres.keycloak-postgres.credentials \
service-account.keycloak-postgres.credentials standby.keycloak-postgres.credentials -n "${NAMESPACE}"
ncn-mw# kubectl apply -f "${MANIFEST}"
Restart the Postgres cluster.
ncn-mw# kubectl delete pod -n "${NAMESPACE}" "${POSTGRESQL}-0"
Wait for the postgresql
pod to start.
ncn-mw# while ! kubectl get pods -l "application=spilo,cluster-name=${POSTGRESQL}" -n "${NAMESPACE}" | grep -qv NAME ; do
echo " waiting for pods to start"; sleep 2
done
Scale the Postgres cluster back to 3 instances.
ncn-mw# kubectl patch postgresql "${POSTGRESQL}" -n "${NAMESPACE}" --type='json' \
-p='[{"op" : "replace", "path":"/spec/numberOfInstances", "value" : 3}]'
Wait for the postgresql
cluster to start running.
This may take a few minutes to complete.
ncn-mw# while [ $(kubectl get postgresql "${POSTGRESQL}" -n "${NAMESPACE}" -o json | jq -r '.status.PostgresClusterStatus') != "Running" ]
do
echo " waiting for postgresql to start running"; sleep 2
done
Scale the Keycloak service back to 3 replicas.
ncn-mw# kubectl scale statefulset "${CLIENT}" -n "${NAMESPACE}" --replicas=3
Wait for the Keycloak pods to start.
ncn-mw# while [ $(kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/instance="${CLIENT}" | grep -cv NAME) != 3 ]
do
echo " waiting for pods to start"; sleep 2
done
Wait for all Keycloak pods to be ready.
If there are pods that do not show that both containers are ready (READY
is 2/2
), then wait a few seconds and
re-run the command until all containers are ready.
ncn-mw# kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/instance="${CLIENT}"
Example output:
NAME READY STATUS RESTARTS AGE
cray-keycloak-0 2/2 Running 0 35s
cray-keycloak-1 2/2 Running 0 35s
cray-keycloak-2 2/2 Running 0 35s
Run the keycloak-setup
job to restore the Kubernetes client secrets.
Run the job.
ncn-mw# kubectl get job -n "${NAMESPACE}" -l app.kubernetes.io/instance=cray-keycloak -o json > keycloak-setup.json
ncn-mw# cat keycloak-setup.json | jq '.items[0]' | jq 'del(.metadata.creationTimestamp)' |
jq 'del(.metadata.managedFields)' | jq 'del(.metadata.resourceVersion)' |
jq 'del(.metadata.selfLink)' | jq 'del(.metadata.uid)' | jq 'del(.spec.selector)' |
jq 'del(.spec.template.metadata.labels)' | jq 'del(.status)' | kubectl replace --force -f -
Wait for job to complete.
Check the status of the keycloak-setup
job. If the COMPLETIONS
value is not 1/1
, then wait a few seconds
and run the command again until the COMPLETIONS
value is 1/1
.
ncn-mw# kubectl get jobs -n "${NAMESPACE}" -l app.kubernetes.io/instance=cray-keycloak
Example output:
NAME COMPLETIONS DURATION AGE
keycloak-setup-2 1/1 59s 91s
Run the keycloak-users-localize
job to restore the users and groups in S3 and the Kubernetes ConfigMap.
Run the job.
ncn-mw# kubectl get job -n "${NAMESPACE}" -l app.kubernetes.io/instance=cray-keycloak-users-localize \
-o json > cray-keycloak-users-localize.json
ncn-mw# cat cray-keycloak-users-localize.json | jq '.items[0]' | jq 'del(.metadata.creationTimestamp)' |
jq 'del(.metadata.managedFields)' | jq 'del(.metadata.resourceVersion)' |
jq 'del(.metadata.selfLink)' | jq 'del(.metadata.uid)' | jq 'del(.spec.selector)' |
jq 'del(.spec.template.metadata.labels)' | jq 'del(.status)' | kubectl replace --force -f -
Wait for the job to complete.
Check the status of the cray-keycloak-users-localize
job. If the COMPLETIONS
value is not 1/1
, then wait a
few minutes and run the command again until the COMPLETIONS
value is 1/1
.
ncn-mw# kubectl get jobs -n "${NAMESPACE}" -l app.kubernetes.io/instance=cray-keycloak-users-localize
Example output:
NAME COMPLETIONS DURATION AGE
keycloak-users-localize-2 1/1 45s 49s
Restart the ingress oauth2-proxies
.
Issue the restarts.
ncn-mw# kubectl rollout restart -n "${NAMESPACE}" deployment/cray-oauth2-proxies-customer-access-ingress &&
kubectl rollout restart -n "${NAMESPACE}" deployment/cray-oauth2-proxies-customer-high-speed-ingress &&
kubectl rollout restart -n "${NAMESPACE}" deployment/cray-oauth2-proxies-customer-management-ingress
Wait for the restarts to complete.
ncn-mw# kubectl rollout status -n "${NAMESPACE}" deployment/cray-oauth2-proxies-customer-access-ingress &&
kubectl rollout status -n "${NAMESPACE}" deployment/cray-oauth2-proxies-customer-high-speed-ingress &&
kubectl rollout status -n "${NAMESPACE}" deployment/cray-oauth2-proxies-customer-management-ingress
Verify that the service is working.
The following should return an access_token
for an existing user. Replace the <username>
and <password>
as appropriate.
ncn-mw# curl -s -k -d grant_type=password -d client_id=shasta -d username=<username> -d password=<password> \
https://api-gw-service-nmn.local/keycloak/realms/shasta/protocol/openid-connect/token
In the event that the VCS Postgres cluster must be rebuilt and the data restored, then the following procedures are recommended.
Copy the database dump to an accessible location.
postgres-backup
S3 bucket.
These will be needed in the steps below. List the files in the postgres-backup
S3 bucket and if the files exist, download the dump and secrets out of the S3 bucket.
The cray artifacts
CLI can be used list and download the files. Note that the .psql
file contains the database dump and the .manifest file contains the secrets.List the available backups.
ncn-mw# cray artifacts list postgres-backup --format json | jq -r '.artifacts[].Key | select(contains("vcs"))'
Example output:
gitea-vcs-postgres-2022-09-14T01:10:04.manifest
gitea-vcs-postgres-2022-09-14T01:10:04.psql
Set the environment variables to the name of the backup files.
In order to avoid a
kubectl cp
bug, the dump file will be downloaded with a slightly altered name (:
characters replaced with-
characters).
ncn-mw# MANIFEST=gitea-vcs-postgres-2022-09-14T01:10:04.manifest
ncn-mw# DUMPFILE_SRC=gitea-vcs-postgres-2022-09-14T01:10:04.psql
ncn-mw# DUMPFILE=${DUMPFILE_SRC//:/-}
Download the backup files.
ncn-mw# cray artifacts get postgres-backup "${DUMPFILE_SRC}" "${DUMPFILE}"
ncn-mw# cray artifacts get postgres-backup "${MANIFEST}" "${MANIFEST}"
Set helper variables.
ncn-mw# SERVICE=gitea-vcs
ncn-mw# SERVICELABEL=vcs
ncn-mw# NAMESPACE=services
ncn-mw# POSTGRESQL=gitea-vcs-postgres
Scale the VCS service to 0.
ncn-mw# kubectl scale deployment ${SERVICE} -n "${NAMESPACE}" --replicas=0
Wait for the pods to terminate.
ncn-mw# while kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/name="${SERVICELABEL}" | grep -qv NAME ; do
echo " waiting for pods to terminate"; sleep 2
done
Delete the VCS Postgres cluster.
ncn-mw# kubectl get postgresql "${POSTGRESQL}" -n "${NAMESPACE}" -o json | jq 'del(.spec.selector)' |
jq 'del(.spec.template.metadata.labels."controller-uid")' | jq 'del(.status)' > postgres-cr.json
ncn-mw# kubectl delete -f postgres-cr.json
Wait for the pods to terminate.
ncn-mw# while kubectl get pods -l "application=spilo,cluster-name=${POSTGRESQL}" -n "${NAMESPACE}" | grep -qv NAME ; do
echo " waiting for pods to terminate"; sleep 2
done
Create a new single instance VCS Postgres cluster.
ncn-mw# cp postgres-cr.json postgres-orig-cr.json
ncn-mw# jq '.spec.numberOfInstances = 1' postgres-orig-cr.json > postgres-cr.json
ncn-mw# kubectl create -f postgres-cr.json
Wait for the pod to start.
ncn-mw# while ! kubectl get pods -l "application=spilo,cluster-name=${POSTGRESQL}" -n "${NAMESPACE}" | grep -qv NAME ; do
echo " waiting for pod to start"; sleep 2
done
Wait for the Postgres cluster to start running.
ncn-mw# while [ $(kubectl get postgresql "${POSTGRESQL}" -n "${NAMESPACE}" -o json | jq -r '.status.PostgresClusterStatus') != "Running" ]
do
echo " waiting for postgresql to start running"; sleep 2
done
Copy the database dump file to the Postgres member.
ncn-mw# kubectl cp "./${DUMPFILE}" "${POSTGRESQL}-0:/home/postgres/${DUMPFILE}" -c postgres -n services
Restore the data.
ncn-mw# kubectl exec "${POSTGRESQL}-0" -c postgres -n services -it -- psql -U postgres < "${DUMPFILE}"
Errors such as ... already exists
can be ignored; the restore can be considered successful when it completes.
Either update or re-create the gitea-vcs-postgres
secrets.
Update the secrets in Postgres.
If a manual dump was done, and the secrets were not saved, then the secrets in the newly created Postgres cluster will need to be updated.
From the three gitea-vcs-postgres
secrets, collect the password for each Postgres username:
postgres
, service_account
, and standby
.
ncn-mw# for secret in postgres.gitea-vcs-postgres.credentials service-account.gitea-vcs-postgres.credentials \
standby.gitea-vcs-postgres.credentials
do
echo -n "secret ${secret} username & password: "
echo -n "`kubectl get secret "${secret}" -n "${NAMESPACE}" -ojsonpath='{.data.username}' | base64 -d` "
echo `kubectl get secret "${secret}" -n "${NAMESPACE}" -ojsonpath='{.data.password}'| base64 -d`
done
Example output:
secret postgres.gitea-vcs-postgres.credentials username & password: postgres ABCXYZ
secret service-account.gitea-vcs-postgres.credentials username & password: service_account ABC123
secret standby.gitea-vcs-postgres.credentials username & password: standby 123456
kubectl exec
into the Postgres pod.
ncn-mw# kubectl exec "${POSTGRESQL}-0" -n "${NAMESPACE}" -c postgres -it -- bash
Open a Postgres console.
pod# /usr/bin/psql postgres postgres
Update the password for each user to match the values found in the secrets.
Update the password for the postgres
user.
postgres# ALTER USER postgres WITH PASSWORD 'ABCXYZ';
Example of successful output:
ALTER ROLE
Update the password for the service_account
user.
postgres# ALTER USER service_account WITH PASSWORD 'ABC123';
Example of successful output:
ALTER ROLE
Update the password for the standby
user.
postgres# ALTER USER standby WITH PASSWORD '123456';
Example of successful output:
ALTER ROLE
Exit the Postgres console with the \q
command.
Exit the Postgres pod with the exit
command.
Re-create secrets in Kubernetes.
If the Postgres secrets were auto-backed up, then re-create the secrets in Kubernetes.
Delete and re-create the three gitea-vcs-postgres
secrets using the manifest that was copied from S3 earlier.
ncn-mw# kubectl delete secret postgres.gitea-vcs-postgres.credentials \
service-account.gitea-vcs-postgres.credentials standby.gitea-vcs-postgres.credentials -n services
ncn-mw# kubectl apply -f "${MANIFEST}"
Restart the Postgres cluster.
ncn-mw# kubectl delete pod -n "${NAMESPACE}" "${POSTGRESQL}-0"
Wait for the postgresql
pod to start.
ncn-mw# while ! kubectl get pods -l "application=spilo,cluster-name=${POSTGRESQL}" -n "${NAMESPACE}" | grep -qv NAME ; do
echo " waiting for pods to start"; sleep 2
done
Scale the Postgres cluster back to 3 instances.
ncn-mw# kubectl patch postgresql "${POSTGRESQL}" -n "${NAMESPACE}" --type='json' \
-p='[{"op" : "replace", "path":"/spec/numberOfInstances", "value" : 3}]'
Wait for the postgresql
cluster to start running.
ncn-mw# while [ $(kubectl get postgresql "${POSTGRESQL}" -n "${NAMESPACE}" -o json | jq -r '.status.PostgresClusterStatus') != "Running" ]
do
echo " waiting for postgresql to start running"; sleep 2
done
Scale the Gitea service back up.
ncn-mw# kubectl scale deployment ${SERVICE} -n "${NAMESPACE}" --replicas=1
Wait for the Gitea pods to start.
ncn-mw# while ! kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/name="${SERVICELABEL}" | grep -qv NAME ; do
echo " waiting for pods to start"; sleep 2
done