Cray System Management Documentation > Cray System Management (CSM) Administration Guide > kubernetes > Troubleshoot Postgres Database

Troubleshoot Postgres Database

This page contains general Postgres troubleshooting topics.

The patronictl tool
Database unavailable
Database disk full
Replication lagging
Postgres status SyncFailed
- Check all the postgresql resources
Cluster member start failed
- Determine if invalid permissions
- Recover from a start failed member
Cluster member missing
- Determine if a cluster member is missing
- Recover from a missing member
Postgres leader missing
- Determine if the Postgres leader is missing
- Recover from a missing Postgres leader

The `patronictl` tool

The patronictl tool is used to call a REST API that interacts with Postgres databases. It handles a variety of tasks, such as listing cluster members and the replication status, configuring and restarting databases, and more.

(ncn-mw#) The tool is installed in the database containers:

kubectl exec -it -n services keycloak-postgres-0 -c postgres -- su postgres

(postgres-container#) Use the following command for more information on the patronictl command:

patronictl --help

Database unavailable

If there are no endpoints for the main service, then Patroni will mark the database as unavailable.

(ncn-mw#) The following is an example for keycloak-postgres where no endpoints are listed, which means the database is unavailable.

kubectl get endpoints keycloak-postgres -n services

Example output:

NAME                ENDPOINTS         AGE
keycloak-postgres   <none>            3d22h

If the database is unavailable, then check if Database disk full is the cause of the issue.

(ncn-mw#) Otherwise, check the postgres-operator logs for errors.

kubectl logs -l app.kubernetes.io/name=postgres-operator -n services

Database disk full

(ncn-mw#) The following is an example for keycloak-postgres. One cluster member is failing to start because of a full pgdata disk. This was likely due to replication issues, which caused the pg_wal files to grow.

POSTGRESQL=keycloak-postgres
NAMESPACE=services
kubectl exec "${POSTGRESQL}-1" -c postgres -it -n ${NAMESPACE} -- patronictl list

Example output:

+-------------------+---------------------+------------+--------+--------------+----+-----------+
|      Cluster      |        Member       |    Host    |  Role  |    State     | TL | Lag in MB |
+-------------------+---------------------+------------+--------+--------------+----+-----------+
| keycloak-postgres | keycloak-postgres-0 | 10.42.0.11 |        | start failed |    |   unknown |
| keycloak-postgres | keycloak-postgres-1 | 10.44.0.7  |        |   running    |  4 |         0 |
| keycloak-postgres | keycloak-postgres-2 | 10.36.0.40 | Leader |   running    |  4 |           |
+-------------------+---------------------+------------+--------+--------------+----+-----------+

for i in {0..2}; do echo "${POSTGRESQL}-${i}:"; kubectl exec "${POSTGRESQL}-${i}" -n ${NAMESPACE} -c postgres -- df -h pgdata; done

Example output:

keycloak-postgres-0:
Filesystem      Size  Used Avail Use% Mounted on
/dev/sde        976M  960M     0 100% /home/postgres/pgdata
keycloak-postgres-1:
Filesystem      Size  Used Avail Use% Mounted on
/dev/rbd12      976M  152M  809M  16% /home/postgres/pgdata
keycloak-postgres-2:
Filesystem      Size  Used Avail Use% Mounted on
/dev/rbd3       976M  136M  825M  15% /home/postgres/pgdata

kubectl logs "${POSTGRESQL}-0" -n ${NAMESPACE} -c postgres | grep FATAL

Example output:

2021-07-14 17:52:48 UTC [30495]: [1-1] 60ef2470.771f 0     FATAL:  could not write lock file "postmaster.pid": No space left on device

kubectl exec "${POSTGRESQL}-0" -n ${NAMESPACE} -c postgres -it -- du -h --max-depth 1 /home/postgres/pgdata/pgroot/data/pg_wal

To recover the cluster member that had failed to start because of disk pressure, attempt to reclaim some space on the pgdata disk.

kubectl exec into that pod, copy the logs off (optional), and then clear the logs in order to recover some disk space. Finally, restart the Postgres cluster and postgres-operator.

(ncn-mw#) Copy off the logs (optional).

kubectl cp "${POSTGRESQL}-1":/home/postgres/pgdata/pgroot/pg_log /tmp -c postgres -n ${NAMESPACE}

Clear the logs.

(ncn-mw#) Open a shell in the database container.

kubectl exec "${POSTGRESQL}-1" -n ${NAMESPACE} -c postgres -it -- bash

(postgres-container#) Clear the logs and exit the container.

for i in {0..7}; do > /home/postgres/pgdata/pgroot/pg_log/postgresql-$i.csv; done
exit

(ncn-mw#) Restart the pods by deleting them.

kubectl delete pod "${POSTGRESQL}-0" "${POSTGRESQL}-1" "${POSTGRESQL}-2" -n ${NAMESPACE}

(ncn-mw#) Restart the operator by deleting it.

kubectl delete pod -l app.kubernetes.io/name=postgres-operator -n services

If disk issues persist or exist on multiple nodes and the above does not resolve the issue, then see the Recover from Postgres WAL Event procedure.

Replication lagging

Postgres replication lag can be detected with Prometheus alerts and alert notifications (See Configure Prometheus Email Alert Notifications). If replication lag is not caught early, it can cause the disk mounted on /home/postgres/pgdata to fill up and the database to stop running. If this issue is caught before the database stops, it can be easily remediated using a patronictl command to reinitialize the lagging cluster member.

Check if replication is working

When services have a Postgres cluster of pods, they need to be able to replicate data between them. When the pods are not able to replicate data, the database will become full. The patronictl list command will show the status of replication.

Replication is working

(ncn-mw#) The following is an example where replication is working:

kubectl exec keycloak-postgres-0 -c postgres -n services -it -- patronictl list

Example output:

+-------------------+---------------------+------------+--------+---------+----+-----------+
|      Cluster      |        Member       |    Host    |  Role  |  State  | TL | Lag in MB |
+-------------------+---------------------+------------+--------+---------+----+-----------+
| keycloak-postgres | keycloak-postgres-0 | 10.40.0.23 | Leader | running |  1 |           |
| keycloak-postgres | keycloak-postgres-1 | 10.42.0.25 |        | running |  1 |         0 |
| keycloak-postgres | keycloak-postgres-2 | 10.42.0.29 |        | running |  1 |         0 |
+-------------------+---------------------+------------+--------+---------+----+-----------+

Replication is not working

The following is an example where replication is broken:

+-------------------+---------------------+--------------+--------+----------+----+-----------+
|      Cluster      |        Member       |     Host     |  Role  |  State   | TL | Lag in MB |
+-------------------+---------------------+--------------+--------+----------+----+-----------+
| keycloak-postgres | keycloak-postgres-0 | 10.42.10.22  |        | starting |    |   unknown |
| keycloak-postgres | keycloak-postgres-1 | 10.40.11.191 | Leader | running  | 47 |           |
| keycloak-postgres | keycloak-postgres-2 | 10.40.11.190 |        | running  | 14 |       608 |
+-------------------+---------------------+--------------+--------+----------+----+-----------+

Recover replication

In the event that a state of broken Postgres replication persists and the space allocated for the WAL files fills up, the affected database will likely shut down and create a state where it can be very difficult to recover. This can impact the reliability of the related service and may require that it be redeployed with data repopulation procedures. If replication lag is caught and remediated before the database shuts down, replication can be recovered using patronictl reinit.

A reinitialize will get the lagging replica member re-synced and replicating again. This should be done as soon as replication lag is detected. In the preceding example, keycloak-postgres-0 and keycloak-postgres-2 were not replicating properly (unknown or non-zero lag). To remediate, kubectl exec into the leader pod and use patronictl reinit <cluster> <lagging cluster member> to reinitialize the lagging member(s).

For example:

(ncn-mw#) Open a shell into the Postgres container.

kubectl exec keycloak-postgres-1 -c postgres -n services -it -- bash

(postgres-container#) Reinitialize the first lagging replica member.

patronictl reinit keycloak-postgres keycloak-postgres-0

Example output:

Are you sure you want to reinitialize members keycloak-postgres-0? [y/N]: y
Failed: reinitialize for member keycloak-postgres-0, status code=503, (restarting after failure already in progress)
Do you want to cancel it and reinitialize anyway? [y/N]: y
Success: reinitialize for member keycloak-postgres-0

(postgres-container#) Reinitialize the next lagging replica member.

patronictl reinit keycloak-postgres keycloak-postgres-2

Example output:

Are you sure you want to reinitialize members keycloak-postgres-2? [y/N]: y
Failed: reinitialize for member keycloak-postgres-2, status code=503, (restarting after failure already in progress)
Do you want to cancel it and reinitialize anyway? [y/N]: y
Success: reinitialize for member keycloak-postgres-2

(ncn-mw#) Verify that replication has recovered.

kubectl exec keycloak-postgres-0 -c postgres -n services -it -- patronictl list

Example output:

+-------------------+---------------------+--------------+--------+---------+----+-----------+
|      Cluster      |        Member       |     Host     |  Role  |  State  | TL | Lag in MB |
+-------------------+---------------------+--------------+--------+---------+----+-----------+
| keycloak-postgres | keycloak-postgres-0 | 10.42.10.22  |        | running | 47 |         0 |
| keycloak-postgres | keycloak-postgres-1 | 10.40.11.191 | Leader | running | 47 |           |
| keycloak-postgres | keycloak-postgres-2 | 10.40.11.190 |        | running | 47 |         0 |
+-------------------+---------------------+--------------+--------+---------+----+-----------+

Troubleshooting

If patronictl reinit fails with Failed: reinitialize for member … status code=503, (Cluster has no leader, can not reinitialize):

(ncn-mw#) For example:

kubectl exec -it cray-console-data-postgres-0 -c postgres -n services -- patronictl reinit cray-console-data-postgres cray-console-data-postgres-1

Example output:

+ Cluster: cray-console-data-postgres (7072784871993835594) ---+----+-----------+
|            Member            |    Host    |  Role  |  State  | TL | Lag in MB |
+------------------------------+------------+--------+---------+----+-----------+
| cray-console-data-postgres-0 | 10.39.0.74 | Leader | running |  1 |           |
| cray-console-data-postgres-1 | 10.36.0.37 |        | running |  1 |        16 |
| cray-console-data-postgres-2 | 10.32.0.1  |        | running |  1 |        16 |
+------------------------------+------------+--------+---------+----+-----------+
Are you sure you want to reinitialize members cray-console-data-postgres-1? [y/N]: y
Failed: reinitialize for member cray-console-data-postgres-1, status code=503, (Cluster has no leader, can not reinitialize)

Delete the Postgres leader pod and wait for the leader to restart.

(ncn-mw#) Delete the leader pod.

kubectl delete pod cray-console-data-postgres-0 -n services

(ncn-mw#) Wait for the leader to restart.

Re-run the following command until it succeeds and reports that the leader pod is running.

kubectl exec keycloak-postgres-0 -c postgres -n services -it -- patronictl list

Example output:

+ Cluster: cray-console-data-postgres (7072784871993835594) ---+----+-----------+
|            Member            |    Host    |  Role  |  State  | TL | Lag in MB |
+------------------------------+------------+--------+---------+----+-----------+
| cray-console-data-postgres-0 | 10.39.0.80 | Leader | running |  2 |           |
| cray-console-data-postgres-1 | 10.36.0.37 |        | running |  1 |        49 |
| cray-console-data-postgres-2 | 10.32.0.1  |        | running |  1 |        49 |
+------------------------------+------------+--------+---------+----+-----------+

Re-run Recover replication to reinit any lagging members.

If the reinit still fails, then delete member pods that are still reporting lag. This should clear up any remaining lag.

(ncn-mw#) Determine which pods are reporting lag.

kubectl exec cray-console-postgres-0 -c postgres -n services -it -- patronictl list

Example output:

+ Cluster: cray-console-data-postgres (7072784871993835594) ---+----+-----------+
|            Member            |    Host    |  Role  |  State  | TL | Lag in MB |
+------------------------------+------------+--------+---------+----+-----------+
| cray-console-data-postgres-0 | 10.39.0.80 | Leader | running |  2 |           |
| cray-console-data-postgres-1 | 10.36.0.37 |        | running |  1 |        49 |
| cray-console-data-postgres-2 | 10.32.0.1  |        | running |  1 |        49 |
+------------------------------+------------+--------+---------+----+-----------+

(ncn-mw#) Delete the pods that are still reporting lag.

kubectl delete pod cray-console-data-postgres-1 cray-console-data-postgres-2 -n services

(ncn-mw#) Once the pods restart, verify that the lag has resolved.

kubectl exec cray-console-postgres-0 -c postgres -n services -it -- patronictl list

Example output:

+ Cluster: cray-console-data-postgres (7072784871993835594) ---+----+-----------+
|            Member            |    Host    |  Role  |  State  | TL | Lag in MB |
+------------------------------+------------+--------+---------+----+-----------+
| cray-console-data-postgres-0 | 10.39.0.80 | Leader | running |  2 |           |
| cray-console-data-postgres-1 | 10.36.0.37 |        | running |  2 |         0 |
| cray-console-data-postgres-2 | 10.32.0.1  |        | running |  2 |         0 |
+------------------------------+------------+--------+---------+----+-----------+

If a cluster member is stopped after a successful reinitialization, check for pg_internal.init.* files that may need to be cleaned up. This can occur if the pgdata disk was full prior to the reinitialization, leaving truncated pg_internal.init.* files in the pgdata directory.

(ncn-mw#) Determine if any pods are stopped.

kubectl exec keycloak-postgres-0 -c postgres -n services -it -- patronictl list

Example output:

+-------------------+---------------------+--------------+--------+---------+----+-----------+
|      Cluster      |        Member       |     Host     |  Role  |  State  | TL | Lag in MB |
+-------------------+---------------------+--------------+--------+---------+----+-----------+
| keycloak-postgres | keycloak-postgres-0 | 10.42.10.22  |        | running | 47 |         0 |
| keycloak-postgres | keycloak-postgres-1 | 10.40.11.191 | Leader | running | 47 |           |
| keycloak-postgres | keycloak-postgres-2 | 10.40.11.190 |        | stopped |    |   unknown |
+-------------------+---------------------+--------------+--------+---------+----+-----------+

Check the most recent Postgres log in the stopped pod.

(ncn-mw#) Open a shell into that pod that is stopped.

kubectl exec keycloak-postgres-2 -c postgres -n services -it -- bash

(postgres-container#) Check the most recent Postgres log for any invalid segment number 0 errors relating to pg_internal.init.* files.

LOG=`ls -t /home/postgres/pgdata/pgroot/pg_log/*.csv | head -1`
grep pg_internal.init "$LOG" | grep "invalid segment number 0" | tail -1

Example output:

2022-02-01 16:59:35.529 UTC,"standby","",227600,"127.0.0.1:42264",61f966f7.37910,3,"sending backup ""pg_basebackup base backup""",2022-02-01 16:59:35 UTC,7/0,0,ERROR,XX000,"invalid segment number 0 in file ""pg_internal.init.2239188""",,,,,,,,,"pg_basebackup"

(postgres-container#) If the check in the previous step finds such files, then first find any zero length pg_internal.init.* files.

This command should be run inside the kubectl exec session from the previous steps.
```
find /home/postgres/pgdata -name pg_internal.init.* -size 0
```
Example output:
```
./pgroot/data/base/16622/pg_internal.init.2239004
...
./pgroot/data/base/16622/pg_internal.init.2239010
```
(postgres-container#) Delete the zero length pg_internal.init.* files.

This command should be run inside the kubectl exec session from the previous steps. Double check the syntax of the command in this step before executing it.
```
find /home/postgres/pgdata -name pg_internal.init.* -size 0 -exec rm {} \;
```

(postgres-container#) Find any non-zero length pg_internal.init.* files that were truncated when the file system filled up.

This command should be run inside the kubectl exec session from the previous step.

grep pg_internal.init $LOG | grep "invalid segment number 0" | tail -1

Example output:

2022-02-01 16:59:35.529 UTC,"standby","",227600,"127.0.0.1:42264",61f966f7.37910,3,"sending backup ""pg_basebackup base backup""",2022-02-01 16:59:35 UTC,7/0,0,ERROR,XX000,"invalid segment number 0 in file ""pg_internal.init.2239188""",,,,,,,,,"pg_basebackup"

(postgres-container#) Locate the non-zero length pg_internal.init.* file.

This command should be run inside the kubectl exec session from the previous step.
```
find ~/pgdata -name pg_internal.init.2239188
```
Example output:
```
/home/postgres/pgdata/pgroot/data/base/16622/pg_internal.init.2239188
```
(postgres-container#) Delete (or move to a different location) the non-zero length pg_internal.init.* file.

This command should be run inside the kubectl exec session from the previous step.
```
rm -v /home/postgres/pgdata/pgroot/data/base/16622/pg_internal.init.2239188
```
Repeat the above steps to find, locate, and delete non-zero length pg_internal.init.* files until there are no more new invalid segment number 0 messages.

(ncn-mw#) Verify that the cluster member has started.

kubectl exec keycloak-postgres-0 -c postgres -n services -it -- patronictl list

Example output:

+-------------------+---------------------+--------------+--------+---------+----+-----------+
|      Cluster      |        Member       |     Host     |  Role  |  State  | TL | Lag in MB |
+-------------------+---------------------+--------------+--------+---------+----+-----------+
| keycloak-postgres | keycloak-postgres-0 | 10.42.10.22  |        | running | 47 |         0 |
| keycloak-postgres | keycloak-postgres-1 | 10.40.11.191 | Leader | running | 47 |           |
| keycloak-postgres | keycloak-postgres-2 | 10.40.11.190 |        | running | 47 |         0 |
+-------------------+---------------------+--------------+--------+---------+----+-----------+

Setup alerts for replication lag

Alerts exist in Prometheus for the following:

PostgresqlReplicationLagSMA
PostgresqlReplicationLagServices
PostgresqlFollowerReplicationLagSMA
PostgresqlFollowerReplicationLagServices

When alert notifications are configured, replication issues can be detected quickly. If the replication issue persists such that the database becomes unavailable, recovery will likely be much more involved. Catching such issues as soon as possible is desired. See Configure Prometheus Email Alert Notifications.

Postgres status `SyncFailed`

Check all the `postgresql` resources

Check for any postgresql resource that has a STATUS of SyncFailed. SyncFailed generally means that there is something between the postgres-operator and the Postgres cluster that is out of sync. This does not always mean that the cluster is unhealthy. Check the postgres-operator logs for messages in order to further determine the root cause of the issue.

Other STATUS values such as Updating are a non-issue. It is expected that this will eventually change to Running or possibly SyncFailed, if the postgres-operator encounters issues syncing updates to the postgresql cluster.

(ncn-mw#) Check for any postgresql resource that has a STATUS of SyncFailed.

kubectl get postgresql -A

Example output:

NAMESPACE   NAME                         TEAM                VERSION   PODS   VOLUME   CPU-REQUEST   MEMORY-REQUEST   AGE     STATUS
services    cray-console-data-postgres   cray-console-data   11        3      2Gi                                     4h10m   Running
services    cray-sls-postgres            cray-sls            11        3      1Gi                                     4h12m   SyncFailed
services    cray-smd-postgres            cray-smd            11        3      30Gi     500m          8Gi              4h12m   Updating
services    gitea-vcs-postgres           gitea-vcs           11        3      50Gi                                    4h11m   Running
services    keycloak-postgres            keycloak            11        3      1Gi                                     4h13m   Running
spire       spire-postgres               spire               11        3      20Gi     1             4Gi              4h10m   Running

(ncn-mw#) Find the the postgres-operator pod name.

kubectl get pods -l app.kubernetes.io/name=postgres-operator -n services

Example output:

NAME                                      READY   STATUS    RESTARTS   AGE
cray-postgres-operator-6fffc48b4c-mqz7z   2/2     Running   0          5h26m

(ncn-mw#) Check the logs for the postgres-operator.

kubectl logs cray-postgres-operator-6fffc48b4c-mqz7z -n services -c postgres-operator | grep -i sync | grep -i msg

Case 1: `some persistent volumes are not compatible with existing resizing providers`

Case 1: Symptom

msg="could not sync cluster: could not sync persistent volumes: could not sync volumes: could not resize EBS volumes: some persistent volumes are not compatible with existing resizing providers"

Case 1: Details

This generally means that the postgresql resource was updated to change the volume size from the Postgres operator’s perspective, but the additional step to resize the actual PVCs was not done so the operator and the Postgres cluster are not able to sync the resize change. The cluster is still healthy, but to complete the resize of the underlying Postgres PVCs, additional steps are needed.

The following example assumes that cray-smd-postgres is in SyncFailed and the volume size was recently increased to 100Gi (possibly by editing the volume size of postgresql cray-smd-postgres resource), but the pgdata-cray-smd-postgres PVC’s storage capacity was not updated to align with the change.

(ncn-mw#) To confirm that this is the case:

kubectl get postgresql cray-smd-postgres -n services -o jsonpath="{.spec.volume.size}"

Example output:

100Gi

kubectl get pvc -n services -l application=spilo,cluster-name=cray-smd-postgres

Example output:

NAME                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS           AGE
pgdata-cray-smd-postgres-0   Bound    pvc-020cf339-e372-46ae-bc37-de2b55320e88   30Gi       RWO            k8s-block-replicated   70m
pgdata-cray-smd-postgres-1   Bound    pvc-3d42598a-188e-4301-a58e-0f0ce3944c89   30Gi       RWO            k8s-block-replicated   27m
pgdata-cray-smd-postgres-2   Bound    pvc-0d659080-7d39-409a-9ee5-1a1806971054   30Gi       RWO            k8s-block-replicated   27m

(ncn-mw#) To resolve this SyncFailed case, resize the pgdata PVCs for the selected Postgres cluster. Create the following function in the shell and execute the function by calling it with the appropriate arguments. For this example the pgdata-cray-smd-postgres PVCs will be resized to 100Gi to match that of the postgresql cray-smd-postgres volume size.

function resize-postgresql-pvc
{
    POSTGRESQL=$1
    PGDATA=$2
    NAMESPACE=$3
    PGRESIZE=$4

    # Check for required arguments
    if [ $# -ne 4 ]; then
        echo "Illegal number of parameters ($#). Function requires exactly 4 arguments."
        exit 2
    fi

    ## Check that PGRESIZE matches current postgresql volume size
    postgresql_volume_size=$(kubectl get postgresql "${POSTGRESQL}" -n "${NAMESPACE}" -o jsonpath="{.spec.volume.size}")
    if [ "${postgresql_volume_size}" != "${PGRESIZE}" ]; then
         echo "Invalid resize ${PGRESIZE}, expected ${postgresql_volume_size}"
         exit 2
    fi

    ## Scale the postgres cluster to 1 member
    kubectl patch postgresql "${POSTGRESQL}" -n "${NAMESPACE}" --type='json' -p='[{"op" : "replace", "path":"/spec/numberOfInstances", "value" : 1}]'
    while [ $(kubectl get pods -l "application=spilo,cluster-name=${POSTGRESQL}" -n "${NAMESPACE}" | grep -v NAME | wc -l) != 1 ] ; do
        echo "  waiting for pods to terminate"
        sleep 2
    done

    ## Delete the inactive PVCs, resize the active PVC, and wait for the resize to complete
    kubectl delete pvc "${PGDATA}-1" "${PGDATA}-2" -n "${NAMESPACE}"
    kubectl patch -p '{"spec": {"resources": {"requests": {"storage": "'${PGRESIZE}'"}}}}' "pvc/${PGDATA}-0" -n "${NAMESPACE}"
    while [ -z '$(kubectl describe pvc "{PGDATA}-0" -n "${NAMESPACE}" | grep FileSystemResizeSuccessful' ] ; do
        echo "  waiting for PVC to resize"
        sleep 2
    done

    ## Scale the postgres cluster back to 3 members
    kubectl patch postgresql "${POSTGRESQL}" -n "${NAMESPACE}" --type='json' -p='[{"op" : "replace", "path":"/spec/numberOfInstances", "value" : 3}]'
    while [ $(kubectl get pods -l "application=spilo,cluster-name=${POSTGRESQL}" -n "${NAMESPACE}" | grep -v NAME | grep -c "Running") != 3 ] ; do
        echo "  waiting for pods to restart"
        sleep 2
    done
}

resize-postgresql-pvc cray-smd-postgres pgdata-cray-smd-postgres services 100Gi

In order to persist any Postgres PVC storage volume size changes, it is necessary that this change also be made to the customer-managed customizations.yaml file. See the Postgres PVC Resize information in the Post Install Customizations document.

Case 2: `could not init db connection`

Case 2: Symptom

msg="could not sync cluster: could not sync roles: could not init db connection: could not init db connection: still failing after 8 retries"

Case 2: Details

This generally means that some state in the Postgres operator is out of sync with that of the postgresql cluster, resulting in database connection issues.

Case 2: Restart the Postgres operator

To resolve this SyncFailed case, restarting the Postgres operator by deleting the pod may clear up the issue.

(ncn-mw#) Delete the pod.

kubectl delete pod -l app.kubernetes.io/name=postgres-operator -n services

(ncn-mw#) Wait for the postgres-operator to restart.

kubectl get pods -l app.kubernetes.io/name=postgres-operator -n services

Example output of restarted pod:

NAME                                      READY   STATUS    RESTARTS   AGE
cray-postgres-operator-6fffc48b4c-mqz7z   2/2     Running   0           6m

Case 2: Restart the cluster and operator

If the database connection has been down for a long period of time and the SyncFailed persists after the above steps, a restart of the cluster and the postgres-operator may be needed for the service to reconnect to the Postgres cluster. For example, if the cray-gitea service is not able to connect to the Postgres database and the connection has been failing for many hours, restart the cluster and operator.

(ncn-mw#) Set necessary variables.

Replace the values of these variables for the appropriate ones for the particular cluster being remediated.
```
CLIENT=gitea-vcs
POSTGRESQL=gitea-vcs-postgres
NAMESPACE=services
```
(ncn-mw#) Scale the service to 0. (for different services this may be a statefulset instead of a deployment).
```
kubectl scale deployment ${CLIENT} -n ${NAMESPACE} --replicas=0
```

(ncn-mw#) Restart the Postgres cluster and the postgres-operator.

kubectl delete pod "${POSTGRESQL}-0" "${POSTGRESQL}-1" "${POSTGRESQL}-2" -n ${NAMESPACE}
kubectl delete pods -n services -lapp.kubernetes.io/name=postgres-operator
while [ $(kubectl get postgresql ${POSTGRESQL} -n ${NAMESPACE} -o json | jq -r '.status.PostgresClusterStatus') != "Running" ]; do
    echo "waiting for ${POSTGRESQL} to start running"; sleep 2
done

(ncn-mw#) Scale the service back to 1 (for different services this may be --replicas=3).
```
kubectl scale deployment ${CLIENT} -n ${NAMESPACE} --replicas=1
```

Case 3: `password authentication failed for user`

Case 3: Symptom

msg="error while syncing cluster state: could not sync roles: could not init db connection: could not init db connection: pq: password authentication failed for user \<username\>"

Case 3: Details

This generally means that the password for the given user is not the same as that specified in the Kubernetes secret. This can occur if the postgresql cluster was rebuilt and the data was restored, leaving the Kubernetes secrets out of sync with the Postgres cluster. To resolve this SyncFailed case, gather the username and password for the credential from Kubernetes, and update the database with these values. For example, if the user postgres is failing to authenticate between the cray-smd services and the cray-smd-postgres cluster, then get the password for the postgres user from the Kubernetes secret and update the password in the database.

(ncn-mw#) Set necessary variables.

CLIENT=cray-smd
POSTGRESQL=cray-smd-postgres
NAMESPACE=services

(ncn-mw#) Scale the service to 0.

kubectl scale deployment ${CLIENT} -n ${NAMESPACE} --replicas=0
while [ $(kubectl get pods -n ${NAMESPACE} -l app.kubernetes.io/name="${CLIENT}" | grep -v NAME | wc -l) != 0 ] ; do
    echo "  waiting for pods to terminate"
    sleep 2
done

(ncn-mw#) Determine what secrets are associated with the postgresql credentials.

kubectl get secrets -n ${NAMESPACE} | grep "${POSTGRESQL}.credentials"

Example output:

services            hmsdsuser.cray-smd-postgres.credentials                       Opaque                                2      31m
services            postgres.cray-smd-postgres.credentials                        Opaque                                2      31m
services            service-account.cray-smd-postgres.credentials                 Opaque                                2      31m
services            standby.cray-smd-postgres.credentials                         Opaque                                2      31m

Gather the decoded username and password for the user that is failing to authenticate.
1. (ncn-mw#) Save the name of the secret with the failing authentication in a variable.
  
  Replace the secret name in this command with the secret determined in the previous step.
```
SECRET=postgres.cray-smd-postgres.credentials
```
2. (ncn-mw#) Decode the username.
```
kubectl get secret ${SECRET} -n ${NAMESPACE} -ojsonpath='{.data.username}' | base64 -d
```
  Example output:
```
postgres
```
3. (ncn-mw#) Decode the password.
```
kubectl get secret ${SECRET} -n ${NAMESPACE} -ojsonpath='{.data.password}'| base64 -d
```
  Example output:
```
ABCXYZ
```

Update the username and password in the database.

(ncn-mw#) Determine which pod is the leader.

kubectl exec "${POSTGRESQL}-0" -n ${NAMESPACE} -c postgres -it -- patronictl list

Example output:

+-------------------+---------------------+------------+--------+---------+----+-----------+
|      Cluster      |        Member       |    Host    |  Role  |  State  | TL | Lag in MB |
+-------------------+---------------------+------------+--------+---------+----+-----------+
| cray-smd-postgres | cray-smd-postgres-0 | 10.42.0.25 | Leader | running |  1 |           |
| cray-smd-postgres | cray-smd-postgres-1 | 10.44.0.34 |        | running |    |         0 |
| cray-smd-postgres | cray-smd-postgres-2 | 10.36.0.44 |        | running |    |         0 |
+-------------------+---------------------+------------+--------+---------+----+-----------+

Update the username and password in the database in the Postgres leader container.
1. (ncn-mw#) Open a shell into the leader container.
```
POSTGRES_LEADER=cray-smd-postgres-0
kubectl exec ${POSTGRES_LEADER} -n ${NAMESPACE} -c postgres -it -- bash
```
2. (postgres-container#) Open Postgres.
```
/usr/bin/psql postgres postgres
```
3. (postgres#) Update the username and password in the database.
```
ALTER USER postgres WITH PASSWORD 'ABCXYZ';
```
  On success, output resembles the following:
```
ALTER ROLE
```

(ncn-mw#) Restart the postgresql cluster.

kubectl delete pod "${POSTGRESQL}-0" "${POSTGRESQL}-1" "${POSTGRESQL}-2" -n ${NAMESPACE}
while [ $(kubectl get postgresql ${POSTGRESQL} -n ${NAMESPACE} -o json | jq -r '.status.PostgresClusterStatus') != "Running" ]; do
    echo "waiting for ${POSTGRESQL} to start running"
    sleep 2
done

(ncn-mw#) Scale the service back to 3.

kubectl scale deployment ${CLIENT} -n ${NAMESPACE} --replicas=3
while [ $(kubectl get pods -n ${NAMESPACE} -l app.kubernetes.io/name="${CLIENT}" | grep -v NAME | wc -l) != 3 ] ; do
    echo "  waiting for pods to start running"
    sleep 2
done

Cluster member start failed

Due to a bug in the Postgres spilo image, there are cases where Postgres cluster members may fail to start due to invalid permissions.

Determine if invalid permissions

(ncn-mw#) The following is an example for cray-console-data-postgres. One cluster member is failing to start due to invalid permissions on /home/postgres/pgdata/pgroot/data.

POSTGRESQL=cray-console-data-postgres
NAMESPACE=services
kubectl exec "${POSTGRESQL}-1" -c postgres -it -n ${NAMESPACE} -- patronictl list

Example output:

+ Cluster: cray-console-data-postgres (7302102293929427021) ----------+----+-----------+
|            Member            |     Host     |  Role  |    State     | TL | Lag in MB |
+------------------------------+--------------+--------+--------------+----+-----------+
| cray-console-data-postgres-0 |  10.32.0.22  | Leader | start failed |    |           |
| cray-console-data-postgres-1 |  10.46.0.55  |        |   running    | 55 |         0 |
| cray-console-data-postgres-2 | 10.39.128.15 |        |   running    | 55 |         0 |
+------------------------------+--------------+--------+--------------+----+-----------+

For this example, the member cray-console-data-postgres-0 has failed to start. Check the pod logs for errors such as those shown below.

kubectl logs "${POSTGRESQL}-0" -n ${NAMESPACE} -c postgres | grep FATAL

Example output:

2024-03-30 07:41:35 UTC [2092847]: [1-1] 6607c22f.1fef2f 0     FATAL:  data directory "/home/postgres/pgdata/pgroot/data" has invalid permissions

Recover from a start failed member

For a member that has failed to start and is logging invalid permission errors on /home/postgres/pgdata/pgroot/data, run the following to correct the permissions. For this example, cray-console-data-postgres-0 encountered the issue. If one or more of the members have failed to start due to the invalid permission error, the following would need to be run on each member.

kubectl exec -it "${POSTGRESQL}-0" -n ${NAMESPACE} -c postgres -- chmod 700 -R /home/postgres/pgdata/pgroot/data

Check that the member has started.

kubectl exec "${POSTGRESQL}-1" -c postgres -it -n ${NAMESPACE} -- patronictl list

Example output:

+ Cluster: cray-console-data-postgres (7302102293929427021) ----------+----+-----------+
|            Member            |     Host     |  Role  |    State     | TL | Lag in MB |
+------------------------------+--------------+--------+--------------+----+-----------+
| cray-console-data-postgres-0 |  10.32.0.22  | Leader |   running    | 56 |           |
| cray-console-data-postgres-1 |  10.46.0.55  |        |   running    | 56 |         0 |
| cray-console-data-postgres-2 | 10.39.128.15 |        |   running    | 56 |         0 |
+------------------------------+--------------+--------+--------------+----+-----------+

Cluster member missing

Most services expect to maintain a Postgres cluster consisting of three pods for resiliency (System Monitoring Application (SMA) is one exception where only two pods are expected to exist).

Determine if a cluster member is missing

(ncn-mw#) For a given Postgres cluster, check how many pods are running.

POSTGRESQL=keycloak-postgres
NAMESPACE=services
kubectl get pods -A -l "application=spilo,cluster-name=${POSTGRESQL}"

Recover from a missing member

If the number of Postgres pods for the given cluster is more or less than expected, increase or decrease as needed. This example will patch the keycloak-postgres cluster resource so that three pods are running.

(ncn-mw#) Set the POSTGRESQL and NAMESPACE variables.

POSTGRESQL=keycloak-postgres
NAMESPACE=services

(ncn-mw#) Patch the keycloak-postgres cluster resource to ensure three pods are running.

kubectl patch postgresql "${POSTGRESQL}" -n "${NAMESPACE}" --type='json' \
    -p='[{"op" : "replace", "path":"/spec/numberOfInstances", "value" : 3}]'

(ncn-mw#) Confirm the number of cluster members, otherwise known as pods, by checking the postgresql resource.

kubectl get postgresql ${POSTGRESQL} -n ${NAMESPACE}

Example output:

NAME                TEAM       VERSION   PODS   VOLUME   CPU-REQUEST   MEMORY-REQUEST   AGE   STATUS
keycloak-postgres   keycloak   11        3      10Gi                                    29m   Running

If a pod is starting but remains in Pending, CrashLoopBackOff, ImagePullBackOff, or other non-Running states, then describe the pod and get logs from the pod for further analysis.

(ncn-mw#) Find the pod name.

kubectl get pods -A -l "application=spilo,cluster-name=${POSTGRESQL}"

Example output:

NAMESPACE   NAME                  READY   STATUS    RESTARTS   AGE
services    keycloak-postgres-0   0/3     Pending   0          36m
services    keycloak-postgres-1   3/3     Running   0          35m
services    keycloak-postgres-2   3/3     Running   0          34m

(ncn-mw#) Describe the pod.

kubectl describe pod "${POSTGRESQL}-0" -n ${NAMESPACE}

(ncn-mw#) View the pod logs.

kubectl logs "${POSTGRESQL}-0" -c postgres -n ${NAMESPACE}

Postgres leader missing

If a Postgres cluster no longer has a leader, the database will need to be recovered.

Determine if the Postgres leader is missing

(ncn-mw#) Set the POSTGRESQL and NAMESPACE variables.

POSTGRESQL=cray-smd-postgres
NAMESPACE=services

(ncn-mw#) Check if the leader is missing.

kubectl exec ${POSTGRESQL}-0 -n ${NAMESPACE} -c postgres -- patronictl list

Example output:

+-------------------+---------------------+------------+------+--------------+----+-----------+
|      Cluster      |        Member       |    Host    | Role |    State     | TL | Lag in MB |
+-------------------+---------------------+------------+------+--------------+----+-----------+
| cray-smd-postgres | cray-smd-postgres-0 | 10.42.0.25 |      |  running     |    |   unknown |
| cray-smd-postgres | cray-smd-postgres-1 | 10.44.0.34 |      | start failed |    |   unknown |
| cray-smd-postgres | cray-smd-postgres-2 | 10.36.0.44 |      | start failed |    |   unknown |
+-------------------+---------------------+------------+------+--------------+----+-----------+

If the output does not list a leader, then proceed to Recover from a missing Postgres leader.

Check if there is conflicting leader information.

It sometimes happen that the above check reports a leader, but other checks report no leader, or report conflicting leader information. The following steps show the status reported by each member of the cluster.

(ncn-mw#) Make a list of the Kubernetes pods of the cluster members.

PODS=$(kubectl get pods -n ${NAMESPACE} --no-headers -o custom-columns=:.metadata.name | grep "^${POSTGRESQL}-[0-9]$") ; echo ${PODS}

Example output:

cray-smd-postgres-0 cray-smd-postgres-1 cray-smd-postgres-2

(ncn-mw#) Query each pod about the status of the cluster.

This script reports the cluster status as perceived by each member of the cluster.

for POD in ${PODS} ; do
    echo "Checking ${POD}..."
    kubectl exec ${POD} -n ${NAMESPACE} -c postgres -- curl -s http://localhost:8008/cluster |  jq ; echo
done

Example output:

Checking cray-smd-postgres-0...

{
  "members": [
    {
      "name": "cray-smd-postgres-0",
      "role": "leader",
      "state": "running",
      "api_url": "http://10.32.0.33:8008/patroni",
      "host": "10.32.0.33",
      "port": 5432,
      "timeline": 1
    },
    {
      "name": "cray-smd-postgres-1",
      "role": "replica",
      "state": "running",
      "api_url": "http://10.44.0.30:8008/patroni",
      "host": "10.44.0.30",
      "port": 5432,
      "timeline": 1,
      "lag": 0
    },
    {
      "name": "cray-smd-postgres-2",
      "role": "replica",
      "state": "running",
      "api_url": "http://10.47.0.33:8008/patroni",
      "host": "10.47.0.33",
      "port": 5432,
      "timeline": 1,
      "lag": 0
    }
  ]
}

Checking cray-smd-postgres-1...

{
  "members": [
    {
      "name": "cray-smd-postgres-0",
      "role": "leader",
      "state": "running",
      "api_url": "http://10.32.0.33:8008/patroni",
      "host": "10.32.0.33",
      "port": 5432,
      "timeline": 1
    },
    {
      "name": "cray-smd-postgres-1",
      "role": "replica",
      "state": "running",
      "api_url": "http://10.44.0.30:8008/patroni",
      "host": "10.44.0.30",
      "port": 5432,
      "timeline": 1,
      "lag": 0
    },
    {
      "name": "cray-smd-postgres-2",
      "role": "replica",
      "state": "running",
      "api_url": "http://10.47.0.33:8008/patroni",
      "host": "10.47.0.33",
      "port": 5432,
      "timeline": 1,
      "lag": 0
    }
  ]
}

And so on for every member of the cluster. This script does not do any checking – it only displays the information.

Check the output for errors or inconsistencies.

In particular, validate the following:
- Every cluster member reports exactly one leader.
- Every cluster member reports the same leader.
- Every cluster member reports the same states for each member.
If any of the above are not true, this indicates that the cluster members are no longer properly synchronized. In this case, attempt the Recover replication remediation procedures.

Recover from a missing Postgres leader

See the Recover from Postgres WAL Event procedure.

Troubleshoot Postgres Database

The patronictl tool

Database unavailable

Database disk full

Replication lagging

Check if replication is working

Replication is working

Replication is not working

Recover replication

Troubleshooting

Setup alerts for replication lag

Postgres status SyncFailed

Check all the postgresql resources

Case 1: some persistent volumes are not compatible with existing resizing providers

Case 1: Symptom

Case 1: Details

Case 2: could not init db connection

Case 2: Symptom

Case 2: Details

Case 2: Restart the Postgres operator

Case 2: Restart the cluster and operator

Case 3: password authentication failed for user

Case 3: Symptom

Case 3: Details

Cluster member start failed

Determine if invalid permissions

Recover from a start failed member

Cluster member missing

Determine if a cluster member is missing

Recover from a missing member

Postgres leader missing

Determine if the Postgres leader is missing

Recover from a missing Postgres leader

The `patronictl` tool

Postgres status `SyncFailed`

Check all the `postgresql` resources

Case 1: `some persistent volumes are not compatible with existing resizing providers`

Case 2: `could not init db connection`

Case 3: `password authentication failed for user`