This page contains general Postgres troubleshooting topics.
patronictl
toolSyncFailed
patronictl
toolThe patronictl
tool is used to call a REST API that interacts with Postgres databases. It handles a variety of tasks, such as listing cluster members and the replication
status, configuring and restarting databases, and more.
(ncn-mw#
) The tool is installed in the database containers:
kubectl exec -it -n services keycloak-postgres-0 -c postgres -- su postgres
(postgres-container#
) Use the following command for more information on the patronictl
command:
patronictl --help
If there are no endpoints for the main service, then Patroni will mark the database as unavailable.
(ncn-mw#
) The following is an example for keycloak-postgres
where no endpoints are listed, which means the database is unavailable.
kubectl get endpoints keycloak-postgres -n services
Example output:
NAME ENDPOINTS AGE
keycloak-postgres <none> 3d22h
If the database is unavailable, then check if Database disk full is the cause of the issue.
(ncn-mw#
) Otherwise, check the postgres-operator
logs for errors.
kubectl logs -l app.kubernetes.io/name=postgres-operator -n services
(ncn-mw#
) The following is an example for keycloak-postgres
. One cluster member is failing to start because of a full pgdata
disk. This was likely due to
replication issues, which caused the pg_wal
files to grow.
POSTGRESQL=keycloak-postgres
NAMESPACE=services
kubectl exec "${POSTGRESQL}-1" -c postgres -it -n ${NAMESPACE} -- patronictl list
Example output:
+-------------------+---------------------+------------+--------+--------------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+-------------------+---------------------+------------+--------+--------------+----+-----------+
| keycloak-postgres | keycloak-postgres-0 | 10.42.0.11 | | start failed | | unknown |
| keycloak-postgres | keycloak-postgres-1 | 10.44.0.7 | | running | 4 | 0 |
| keycloak-postgres | keycloak-postgres-2 | 10.36.0.40 | Leader | running | 4 | |
+-------------------+---------------------+------------+--------+--------------+----+-----------+
for i in {0..2}; do echo "${POSTGRESQL}-${i}:"; kubectl exec "${POSTGRESQL}-${i}" -n ${NAMESPACE} -c postgres -- df -h pgdata; done
Example output:
keycloak-postgres-0:
Filesystem Size Used Avail Use% Mounted on
/dev/sde 976M 960M 0 100% /home/postgres/pgdata
keycloak-postgres-1:
Filesystem Size Used Avail Use% Mounted on
/dev/rbd12 976M 152M 809M 16% /home/postgres/pgdata
keycloak-postgres-2:
Filesystem Size Used Avail Use% Mounted on
/dev/rbd3 976M 136M 825M 15% /home/postgres/pgdata
kubectl logs "${POSTGRESQL}-0" -n ${NAMESPACE} -c postgres | grep FATAL
Example output:
2021-07-14 17:52:48 UTC [30495]: [1-1] 60ef2470.771f 0 FATAL: could not write lock file "postmaster.pid": No space left on device
kubectl exec "${POSTGRESQL}-0" -n ${NAMESPACE} -c postgres -it -- du -h --max-depth 1 /home/postgres/pgdata/pgroot/data/pg_wal
To recover the cluster member that had failed to start because of disk pressure, attempt to reclaim some space on the pgdata
disk.
kubectl exec
into that pod, copy the logs off (optional), and then clear the logs in order to recover some disk space. Finally, restart the Postgres cluster and postgres-operator
.
(ncn-mw#
) Copy off the logs (optional).
kubectl cp "${POSTGRESQL}-1":/home/postgres/pgdata/pgroot/pg_log /tmp -c postgres -n ${NAMESPACE}
Clear the logs.
(ncn-mw#
) Open a shell in the database container.
kubectl exec "${POSTGRESQL}-1" -n ${NAMESPACE} -c postgres -it -- bash
(postgres-container#
) Clear the logs and exit the container.
for i in {0..7}; do > /home/postgres/pgdata/pgroot/pg_log/postgresql-$i.csv; done
exit
(ncn-mw#
) Restart the pods by deleting them.
kubectl delete pod "${POSTGRESQL}-0" "${POSTGRESQL}-1" "${POSTGRESQL}-2" -n ${NAMESPACE}
(ncn-mw#
) Restart the operator by deleting it.
kubectl delete pod -l app.kubernetes.io/name=postgres-operator -n services
If disk issues persist or exist on multiple nodes and the above does not resolve the issue, then see the Recover from Postgres WAL Event procedure.
Postgres replication lag can be detected with Prometheus alerts and alert notifications (See
Configure Prometheus Email Alert Notifications). If replication lag is not caught early, it
can cause the disk mounted on /home/postgres/pgdata
to fill up and the database to stop running. If this issue is caught before the database stops, it can be easily remediated
using a patronictl
command to reinitialize the lagging cluster member.
When services have a Postgres cluster of pods, they need to be able to replicate data between them. When the pods are not able to replicate data, the database will become full.
The patronictl list
command will show the status of replication.
(ncn-mw#
) The following is an example where replication is working:
kubectl exec keycloak-postgres-0 -c postgres -n services -it -- patronictl list
Example output:
+-------------------+---------------------+------------+--------+---------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+-------------------+---------------------+------------+--------+---------+----+-----------+
| keycloak-postgres | keycloak-postgres-0 | 10.40.0.23 | Leader | running | 1 | |
| keycloak-postgres | keycloak-postgres-1 | 10.42.0.25 | | running | 1 | 0 |
| keycloak-postgres | keycloak-postgres-2 | 10.42.0.29 | | running | 1 | 0 |
+-------------------+---------------------+------------+--------+---------+----+-----------+
The following is an example where replication is broken:
+-------------------+---------------------+--------------+--------+----------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+-------------------+---------------------+--------------+--------+----------+----+-----------+
| keycloak-postgres | keycloak-postgres-0 | 10.42.10.22 | | starting | | unknown |
| keycloak-postgres | keycloak-postgres-1 | 10.40.11.191 | Leader | running | 47 | |
| keycloak-postgres | keycloak-postgres-2 | 10.40.11.190 | | running | 14 | 608 |
+-------------------+---------------------+--------------+--------+----------+----+-----------+
In the event that a state of broken Postgres replication persists and the space allocated for the WAL files fills up, the affected database will likely shut down and create a
state where it can be very difficult to recover. This can impact the reliability of the related service and may require that it be redeployed with data repopulation procedures.
If replication lag is caught and remediated before the database shuts down, replication can be recovered using patronictl reinit
.
A reinitialize will get the lagging replica member re-synced and replicating again. This should be done as soon as replication lag is detected. In the preceding example,
keycloak-postgres-0
and keycloak-postgres-2
were not replicating properly (unknown
or non-zero lag). To remediate, kubectl exec
into the leader pod and use
patronictl reinit <cluster> <lagging cluster member>
to reinitialize the lagging member(s).
For example:
(ncn-mw#
) Open a shell into the Postgres container.
kubectl exec keycloak-postgres-1 -c postgres -n services -it -- bash
(postgres-container#
) Reinitialize the first lagging replica member.
patronictl reinit keycloak-postgres keycloak-postgres-0
Example output:
Are you sure you want to reinitialize members keycloak-postgres-0? [y/N]: y
Failed: reinitialize for member keycloak-postgres-0, status code=503, (restarting after failure already in progress)
Do you want to cancel it and reinitialize anyway? [y/N]: y
Success: reinitialize for member keycloak-postgres-0
(postgres-container#
) Reinitialize the next lagging replica member.
patronictl reinit keycloak-postgres keycloak-postgres-2
Example output:
Are you sure you want to reinitialize members keycloak-postgres-2? [y/N]: y
Failed: reinitialize for member keycloak-postgres-2, status code=503, (restarting after failure already in progress)
Do you want to cancel it and reinitialize anyway? [y/N]: y
Success: reinitialize for member keycloak-postgres-2
(ncn-mw#
) Verify that replication has recovered.
kubectl exec keycloak-postgres-0 -c postgres -n services -it -- patronictl list
Example output:
+-------------------+---------------------+--------------+--------+---------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+-------------------+---------------------+--------------+--------+---------+----+-----------+
| keycloak-postgres | keycloak-postgres-0 | 10.42.10.22 | | running | 47 | 0 |
| keycloak-postgres | keycloak-postgres-1 | 10.40.11.191 | Leader | running | 47 | |
| keycloak-postgres | keycloak-postgres-2 | 10.40.11.190 | | running | 47 | 0 |
+-------------------+---------------------+--------------+--------+---------+----+-----------+
If patronictl reinit
fails with Failed: reinitialize for member
… status code=503, (Cluster has no leader, can not reinitialize)
:
(ncn-mw#
) For example:
kubectl exec -it cray-console-data-postgres-0 -c postgres -n services -- patronictl reinit cray-console-data-postgres cray-console-data-postgres-1
Example output:
+ Cluster: cray-console-data-postgres (7072784871993835594) ---+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+------------------------------+------------+--------+---------+----+-----------+
| cray-console-data-postgres-0 | 10.39.0.74 | Leader | running | 1 | |
| cray-console-data-postgres-1 | 10.36.0.37 | | running | 1 | 16 |
| cray-console-data-postgres-2 | 10.32.0.1 | | running | 1 | 16 |
+------------------------------+------------+--------+---------+----+-----------+
Are you sure you want to reinitialize members cray-console-data-postgres-1? [y/N]: y
Failed: reinitialize for member cray-console-data-postgres-1, status code=503, (Cluster has no leader, can not reinitialize)
Delete the Postgres leader pod and wait for the leader to restart.
(ncn-mw#
) Delete the leader pod.
kubectl delete pod cray-console-data-postgres-0 -n services
(ncn-mw#
) Wait for the leader to restart.
Re-run the following command until it succeeds and reports that the leader pod is running
.
kubectl exec keycloak-postgres-0 -c postgres -n services -it -- patronictl list
Example output:
+ Cluster: cray-console-data-postgres (7072784871993835594) ---+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+------------------------------+------------+--------+---------+----+-----------+
| cray-console-data-postgres-0 | 10.39.0.80 | Leader | running | 2 | |
| cray-console-data-postgres-1 | 10.36.0.37 | | running | 1 | 49 |
| cray-console-data-postgres-2 | 10.32.0.1 | | running | 1 | 49 |
+------------------------------+------------+--------+---------+----+-----------+
Re-run Recover replication to reinit
any lagging members.
If the reinit
still fails, then delete member pods that are still reporting lag. This should clear up any remaining lag.
(ncn-mw#
) Determine which pods are reporting lag.
kubectl exec cray-console-postgres-0 -c postgres -n services -it -- patronictl list
Example output:
+ Cluster: cray-console-data-postgres (7072784871993835594) ---+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+------------------------------+------------+--------+---------+----+-----------+
| cray-console-data-postgres-0 | 10.39.0.80 | Leader | running | 2 | |
| cray-console-data-postgres-1 | 10.36.0.37 | | running | 1 | 49 |
| cray-console-data-postgres-2 | 10.32.0.1 | | running | 1 | 49 |
+------------------------------+------------+--------+---------+----+-----------+
(ncn-mw#
) Delete the pods that are still reporting lag.
kubectl delete pod cray-console-data-postgres-1 cray-console-data-postgres-2 -n services
(ncn-mw#
) Once the pods restart, verify that the lag has resolved.
kubectl exec cray-console-postgres-0 -c postgres -n services -it -- patronictl list
Example output:
+ Cluster: cray-console-data-postgres (7072784871993835594) ---+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+------------------------------+------------+--------+---------+----+-----------+
| cray-console-data-postgres-0 | 10.39.0.80 | Leader | running | 2 | |
| cray-console-data-postgres-1 | 10.36.0.37 | | running | 2 | 0 |
| cray-console-data-postgres-2 | 10.32.0.1 | | running | 2 | 0 |
+------------------------------+------------+--------+---------+----+-----------+
If a cluster member is stopped
after a successful reinitialization, check for pg_internal.init.*
files that may need to be cleaned up. This can occur if the pgdata
disk was full prior to the reinitialization, leaving truncated pg_internal.init.*
files in the pgdata
directory.
(ncn-mw#
) Determine if any pods are stopped
.
kubectl exec keycloak-postgres-0 -c postgres -n services -it -- patronictl list
Example output:
+-------------------+---------------------+--------------+--------+---------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+-------------------+---------------------+--------------+--------+---------+----+-----------+
| keycloak-postgres | keycloak-postgres-0 | 10.42.10.22 | | running | 47 | 0 |
| keycloak-postgres | keycloak-postgres-1 | 10.40.11.191 | Leader | running | 47 | |
| keycloak-postgres | keycloak-postgres-2 | 10.40.11.190 | | stopped | | unknown |
+-------------------+---------------------+--------------+--------+---------+----+-----------+
Check the most recent Postgres log in the stopped
pod.
(ncn-mw#
) Open a shell into that pod that is stopped
.
kubectl exec keycloak-postgres-2 -c postgres -n services -it -- bash
(postgres-container#
) Check the most recent Postgres log for any invalid segment number 0
errors relating to pg_internal.init.*
files.
LOG=`ls -t /home/postgres/pgdata/pgroot/pg_log/*.csv | head -1`
grep pg_internal.init "$LOG" | grep "invalid segment number 0" | tail -1
Example output:
2022-02-01 16:59:35.529 UTC,"standby","",227600,"127.0.0.1:42264",61f966f7.37910,3,"sending backup ""pg_basebackup base backup""",2022-02-01 16:59:35 UTC,7/0,0,ERROR,XX000,"invalid segment number 0 in file ""pg_internal.init.2239188""",,,,,,,,,"pg_basebackup"
(postgres-container#
) If the check in the previous step finds such files, then first find any zero length pg_internal.init.*
files.
This command should be run inside the kubectl exec
session from the previous steps.
find /home/postgres/pgdata -name pg_internal.init.* -size 0
Example output:
./pgroot/data/base/16622/pg_internal.init.2239004
...
./pgroot/data/base/16622/pg_internal.init.2239010
(postgres-container#
) Delete the zero length pg_internal.init.*
files.
This command should be run inside the kubectl exec
session from the previous steps.
Double check the syntax of the command in this step before executing it.
find /home/postgres/pgdata -name pg_internal.init.* -size 0 -exec rm {} \;
(postgres-container#
) Find any non-zero length pg_internal.init.*
files that were truncated when the file system filled up.
This command should be run inside the kubectl exec
session from the previous step.
grep pg_internal.init $LOG | grep "invalid segment number 0" | tail -1
Example output:
2022-02-01 16:59:35.529 UTC,"standby","",227600,"127.0.0.1:42264",61f966f7.37910,3,"sending backup ""pg_basebackup base backup""",2022-02-01 16:59:35 UTC,7/0,0,ERROR,XX000,"invalid segment number 0 in file ""pg_internal.init.2239188""",,,,,,,,,"pg_basebackup"
(postgres-container#
) Locate the non-zero length pg_internal.init.*
file.
This command should be run inside the kubectl exec
session from the previous step.
find ~/pgdata -name pg_internal.init.2239188
Example output:
/home/postgres/pgdata/pgroot/data/base/16622/pg_internal.init.2239188
(postgres-container#
) Delete (or move to a different location) the non-zero length pg_internal.init.*
file.
This command should be run inside the kubectl exec
session from the previous step.
rm -v /home/postgres/pgdata/pgroot/data/base/16622/pg_internal.init.2239188
Repeat the above steps to find, locate, and delete non-zero length pg_internal.init.*
files until there are no more new invalid segment number 0
messages.
(ncn-mw#
) Verify that the cluster member has started.
kubectl exec keycloak-postgres-0 -c postgres -n services -it -- patronictl list
Example output:
+-------------------+---------------------+--------------+--------+---------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+-------------------+---------------------+--------------+--------+---------+----+-----------+
| keycloak-postgres | keycloak-postgres-0 | 10.42.10.22 | | running | 47 | 0 |
| keycloak-postgres | keycloak-postgres-1 | 10.40.11.191 | Leader | running | 47 | |
| keycloak-postgres | keycloak-postgres-2 | 10.40.11.190 | | running | 47 | 0 |
+-------------------+---------------------+--------------+--------+---------+----+-----------+
Alerts exist in Prometheus for the following:
PostgresqlReplicationLagSMA
PostgresqlReplicationLagServices
PostgresqlFollowerReplicationLagSMA
PostgresqlFollowerReplicationLagServices
When alert notifications are configured, replication issues can be detected quickly. If the replication issue persists such that the database becomes unavailable, recovery will likely be much more involved. Catching such issues as soon as possible is desired. See Configure Prometheus Email Alert Notifications.
SyncFailed
postgresql
resourcesCheck for any postgresql
resource that has a STATUS
of SyncFailed
. SyncFailed
generally means that there is something between the postgres-operator
and the Postgres
cluster that is out of sync. This does not always mean that the cluster is unhealthy. Check the postgres-operator
logs for messages in order to further determine the root
cause of the issue.
Other STATUS
values such as Updating
are a non-issue. It is expected that this will eventually change to Running
or possibly SyncFailed
, if the postgres-operator
encounters issues syncing updates to the postgresql
cluster.
(ncn-mw#
) Check for any postgresql
resource that has a STATUS
of SyncFailed
.
kubectl get postgresql -A
Example output:
NAMESPACE NAME TEAM VERSION PODS VOLUME CPU-REQUEST MEMORY-REQUEST AGE STATUS
services cray-console-data-postgres cray-console-data 11 3 2Gi 4h10m Running
services cray-sls-postgres cray-sls 11 3 1Gi 4h12m SyncFailed
services cray-smd-postgres cray-smd 11 3 30Gi 500m 8Gi 4h12m Updating
services gitea-vcs-postgres gitea-vcs 11 3 50Gi 4h11m Running
services keycloak-postgres keycloak 11 3 1Gi 4h13m Running
spire spire-postgres spire 11 3 20Gi 1 4Gi 4h10m Running
(ncn-mw#
) Find the postgres-operator
pod name.
kubectl get pods -l app.kubernetes.io/name=postgres-operator -n services
Example output:
NAME READY STATUS RESTARTS AGE
cray-postgres-operator-6fffc48b4c-mqz7z 2/2 Running 0 5h26m
(ncn-mw#
) Check the logs for the postgres-operator
.
kubectl logs cray-postgres-operator-6fffc48b4c-mqz7z -n services -c postgres-operator | grep -i sync | grep -i msg
some persistent volumes are not compatible with existing resizing providers
msg="could not sync cluster: could not sync persistent volumes: could not sync volumes: could not resize EBS volumes: some persistent volumes are not compatible with existing resizing providers"
This generally means that the postgresql
resource was updated to change the volume size from the Postgres operator’s perspective, but the additional step to resize the
actual PVCs was not done so the operator and the Postgres cluster are not able to sync the resize change. The cluster is still healthy, but to complete the resize of the
underlying Postgres PVCs, additional steps are needed.
The following example assumes that cray-smd-postgres
is in SyncFailed
and the volume size was recently increased to 100Gi
(possibly by editing the volume size of
postgresql
cray-smd-postgres
resource), but the pgdata-cray-smd-postgres
PVC’s storage capacity was not updated to align with the change.
(ncn-mw#
) To confirm that this is the case:
kubectl get postgresql cray-smd-postgres -n services -o jsonpath="{.spec.volume.size}"
Example output:
100Gi
kubectl get pvc -n services -l application=spilo,cluster-name=cray-smd-postgres
Example output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
pgdata-cray-smd-postgres-0 Bound pvc-020cf339-e372-46ae-bc37-de2b55320e88 30Gi RWO k8s-block-replicated 70m
pgdata-cray-smd-postgres-1 Bound pvc-3d42598a-188e-4301-a58e-0f0ce3944c89 30Gi RWO k8s-block-replicated 27m
pgdata-cray-smd-postgres-2 Bound pvc-0d659080-7d39-409a-9ee5-1a1806971054 30Gi RWO k8s-block-replicated 27m
(ncn-mw#
) To resolve this SyncFailed
case, resize the pgdata
PVCs for the selected Postgres cluster. Create the following function in the shell and execute the function by
calling it with the appropriate arguments. For this example the pgdata-cray-smd-postgres
PVCs will be resized to 100Gi
to match that of the postgresql
cray-smd-postgres
volume size.
function resize-postgresql-pvc
{
POSTGRESQL=$1
PGDATA=$2
NAMESPACE=$3
PGRESIZE=$4
# Check for required arguments
if [ $# -ne 4 ]; then
echo "Illegal number of parameters ($#). Function requires exactly 4 arguments."
exit 2
fi
## Check that PGRESIZE matches current postgresql volume size
postgresql_volume_size=$(kubectl get postgresql "${POSTGRESQL}" -n "${NAMESPACE}" -o jsonpath="{.spec.volume.size}")
if [ "${postgresql_volume_size}" != "${PGRESIZE}" ]; then
echo "Invalid resize ${PGRESIZE}, expected ${postgresql_volume_size}"
exit 2
fi
## Scale the postgres cluster to 1 member
kubectl patch postgresql "${POSTGRESQL}" -n "${NAMESPACE}" --type='json' -p='[{"op" : "replace", "path":"/spec/numberOfInstances", "value" : 1}]'
while [ $(kubectl get pods -l "application=spilo,cluster-name=${POSTGRESQL}" -n "${NAMESPACE}" | grep -v NAME | wc -l) != 1 ] ; do
echo " waiting for pods to terminate"
sleep 2
done
## Delete the inactive PVCs, resize the active PVC, and wait for the resize to complete
kubectl delete pvc "${PGDATA}-1" "${PGDATA}-2" -n "${NAMESPACE}"
kubectl patch -p '{"spec": {"resources": {"requests": {"storage": "'${PGRESIZE}'"}}}}' "pvc/${PGDATA}-0" -n "${NAMESPACE}"
while [ -z '$(kubectl describe pvc "{PGDATA}-0" -n "${NAMESPACE}" | grep FileSystemResizeSuccessful' ] ; do
echo " waiting for PVC to resize"
sleep 2
done
## Scale the postgres cluster back to 3 members
kubectl patch postgresql "${POSTGRESQL}" -n "${NAMESPACE}" --type='json' -p='[{"op" : "replace", "path":"/spec/numberOfInstances", "value" : 3}]'
while [ $(kubectl get pods -l "application=spilo,cluster-name=${POSTGRESQL}" -n "${NAMESPACE}" | grep -v NAME | grep -c "Running") != 3 ] ; do
echo " waiting for pods to restart"
sleep 2
done
}
resize-postgresql-pvc cray-smd-postgres pgdata-cray-smd-postgres services 100Gi
In order to persist any Postgres PVC storage volume size changes, it is necessary that this change also be made to the customer-managed customizations.yaml
file.
See the Postgres PVC Resize information in the Post Install Customizations document.
could not init db connection
msg="could not sync cluster: could not sync roles: could not init db connection: could not init db connection: still failing after 8 retries"
This generally means that some state in the Postgres operator is out of sync with that of the postgresql
cluster, resulting in database connection issues.
To resolve this SyncFailed
case, restarting the Postgres operator by deleting the pod may clear up the issue.
(ncn-mw#
) Delete the pod.
kubectl delete pod -l app.kubernetes.io/name=postgres-operator -n services
(ncn-mw#
) Wait for the postgres-operator
to restart.
kubectl get pods -l app.kubernetes.io/name=postgres-operator -n services
Example output of restarted pod:
NAME READY STATUS RESTARTS AGE
cray-postgres-operator-6fffc48b4c-mqz7z 2/2 Running 0 6m
If the database connection has been down for a long period of time and the SyncFailed
persists after the above steps, a restart of
the cluster and the postgres-operator
may be needed for the service to reconnect to the Postgres cluster. For example, if the
cray-gitea
service is not able to connect to the Postgres database and the connection has been failing for many hours, restart
the cluster and operator.
(ncn-mw#
) Set necessary variables.
Replace the values of these variables for the appropriate ones for the particular cluster being remediated.
CLIENT=gitea-vcs
POSTGRESQL=gitea-vcs-postgres
NAMESPACE=services
(ncn-mw#
) Scale the service to 0. (for different services this may be a statefulset
instead of a deployment
).
kubectl scale deployment ${CLIENT} -n ${NAMESPACE} --replicas=0
(ncn-mw#
) Restart the Postgres cluster and the postgres-operator
.
kubectl delete pod "${POSTGRESQL}-0" "${POSTGRESQL}-1" "${POSTGRESQL}-2" -n ${NAMESPACE}
kubectl delete pods -n services -lapp.kubernetes.io/name=postgres-operator
while [ $(kubectl get postgresql ${POSTGRESQL} -n ${NAMESPACE} -o json | jq -r '.status.PostgresClusterStatus') != "Running" ]; do
echo "waiting for ${POSTGRESQL} to start running"; sleep 2
done
(ncn-mw#
) Scale the service back to 1 (for different services this may be --replicas=3
).
kubectl scale deployment ${CLIENT} -n ${NAMESPACE} --replicas=1
password authentication failed for user
msg="error while syncing cluster state: could not sync roles: could not init db connection: could not init db connection: pq: password authentication failed for user \<username\>"
This generally means that the password for the given user is not the same as that specified in the Kubernetes secret. This can occur if the postgresql
cluster was rebuilt and the data was restored, leaving the Kubernetes secrets out of sync with the Postgres cluster. To resolve this SyncFailed
case,
gather the username and password for the credential from Kubernetes, and update the database with these values. For example, if the user postgres
is
failing to authenticate between the cray-smd
services and the cray-smd-postgres
cluster, then get the password for the postgres
user from the
Kubernetes secret and update the password in the database.
(ncn-mw#
) Set necessary variables.
CLIENT=cray-smd
POSTGRESQL=cray-smd-postgres
NAMESPACE=services
(ncn-mw#
) Scale the service to 0.
kubectl scale deployment ${CLIENT} -n ${NAMESPACE} --replicas=0
while [ $(kubectl get pods -n ${NAMESPACE} -l app.kubernetes.io/name="${CLIENT}" | grep -v NAME | wc -l) != 0 ] ; do
echo " waiting for pods to terminate"
sleep 2
done
(ncn-mw#
) Determine what secrets are associated with the postgresql
credentials.
kubectl get secrets -n ${NAMESPACE} | grep "${POSTGRESQL}.credentials"
Example output:
services hmsdsuser.cray-smd-postgres.credentials Opaque 2 31m
services postgres.cray-smd-postgres.credentials Opaque 2 31m
services service-account.cray-smd-postgres.credentials Opaque 2 31m
services standby.cray-smd-postgres.credentials Opaque 2 31m
Gather the decoded username and password for the user that is failing to authenticate.
(ncn-mw#
) Save the name of the secret with the failing authentication in a variable.
Replace the secret name in this command with the secret determined in the previous step.
SECRET=postgres.cray-smd-postgres.credentials
(ncn-mw#
) Decode the username.
kubectl get secret ${SECRET} -n ${NAMESPACE} -ojsonpath='{.data.username}' | base64 -d
Example output:
postgres
(ncn-mw#
) Decode the password.
kubectl get secret ${SECRET} -n ${NAMESPACE} -ojsonpath='{.data.password}'| base64 -d
Example output:
ABCXYZ
Update the username and password in the database.
(ncn-mw#
) Determine which pod is the leader.
kubectl exec "${POSTGRESQL}-0" -n ${NAMESPACE} -c postgres -it -- patronictl list
Example output:
+-------------------+---------------------+------------+--------+---------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+-------------------+---------------------+------------+--------+---------+----+-----------+
| cray-smd-postgres | cray-smd-postgres-0 | 10.42.0.25 | Leader | running | 1 | |
| cray-smd-postgres | cray-smd-postgres-1 | 10.44.0.34 | | running | | 0 |
| cray-smd-postgres | cray-smd-postgres-2 | 10.36.0.44 | | running | | 0 |
+-------------------+---------------------+------------+--------+---------+----+-----------+
Update the username and password in the database in the Postgres leader container.
(ncn-mw#
) Open a shell into the leader container.
POSTGRES_LEADER=cray-smd-postgres-0
kubectl exec ${POSTGRES_LEADER} -n ${NAMESPACE} -c postgres -it -- bash
(postgres-container#
) Open Postgres.
/usr/bin/psql postgres postgres
(postgres#
) Update the username and password in the database.
ALTER USER postgres WITH PASSWORD 'ABCXYZ';
On success, output resembles the following:
ALTER ROLE
(ncn-mw#
) Restart the postgresql
cluster.
kubectl delete pod "${POSTGRESQL}-0" "${POSTGRESQL}-1" "${POSTGRESQL}-2" -n ${NAMESPACE}
while [ $(kubectl get postgresql ${POSTGRESQL} -n ${NAMESPACE} -o json | jq -r '.status.PostgresClusterStatus') != "Running" ]; do
echo "waiting for ${POSTGRESQL} to start running"
sleep 2
done
(ncn-mw#
) Scale the service back to 3.
kubectl scale deployment ${CLIENT} -n ${NAMESPACE} --replicas=3
while [ $(kubectl get pods -n ${NAMESPACE} -l app.kubernetes.io/name="${CLIENT}" | grep -v NAME | wc -l) != 3 ] ; do
echo " waiting for pods to start running"
sleep 2
done
Due to a bug in the Postgres spilo
image, there are cases where Postgres cluster members may fail to start due to invalid permissions.
(ncn-mw#
) The following is an example for cray-console-data-postgres
. One cluster member is failing to start due to invalid permissions on /home/postgres/pgdata/pgroot/data
.
POSTGRESQL=cray-console-data-postgres
NAMESPACE=services
kubectl exec "${POSTGRESQL}-1" -c postgres -it -n ${NAMESPACE} -- patronictl list
Example output:
+ Cluster: cray-console-data-postgres (7302102293929427021) ----------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+------------------------------+--------------+--------+--------------+----+-----------+
| cray-console-data-postgres-0 | 10.32.0.22 | Leader | start failed | | |
| cray-console-data-postgres-1 | 10.46.0.55 | | running | 55 | 0 |
| cray-console-data-postgres-2 | 10.39.128.15 | | running | 55 | 0 |
+------------------------------+--------------+--------+--------------+----+-----------+
For this example, the member cray-console-data-postgres-0
has failed to start. Check the pod logs for errors such as those shown below.
kubectl logs "${POSTGRESQL}-0" -n ${NAMESPACE} -c postgres | grep FATAL
Example output:
2024-03-30 07:41:35 UTC [2092847]: [1-1] 6607c22f.1fef2f 0 FATAL: data directory "/home/postgres/pgdata/pgroot/data" has invalid permissions
For a member that has failed to start and is logging invalid permission errors on /home/postgres/pgdata/pgroot/data
, run the following to correct the permissions.
For this example, cray-console-data-postgres-0
encountered the issue. If one or more of the members have failed to start due to the invalid permission error, the following would need to be run on each member.
kubectl exec -it "${POSTGRESQL}-0" -n ${NAMESPACE} -c postgres -- chmod 700 -R /home/postgres/pgdata/pgroot/data
Check that the member has started.
kubectl exec "${POSTGRESQL}-1" -c postgres -it -n ${NAMESPACE} -- patronictl list
Example output:
+ Cluster: cray-console-data-postgres (7302102293929427021) ----------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+------------------------------+--------------+--------+--------------+----+-----------+
| cray-console-data-postgres-0 | 10.32.0.22 | Leader | running | 56 | |
| cray-console-data-postgres-1 | 10.46.0.55 | | running | 56 | 0 |
| cray-console-data-postgres-2 | 10.39.128.15 | | running | 56 | 0 |
+------------------------------+--------------+--------+--------------+----+-----------+
Most services expect to maintain a Postgres cluster consisting of three pods for resiliency (System Monitoring Application (SMA) is one exception where only two pods are expected to exist).
(ncn-mw#
) For a given Postgres cluster, check how many pods are running.
POSTGRESQL=keycloak-postgres
NAMESPACE=services
kubectl get pods -A -l "application=spilo,cluster-name=${POSTGRESQL}"
If the number of Postgres pods for the given cluster is more or less than expected, increase or decrease as needed. This example will patch
the keycloak-postgres
cluster resource so that three pods are running.
(ncn-mw#
) Set the POSTGRESQL
and NAMESPACE
variables.
POSTGRESQL=keycloak-postgres
NAMESPACE=services
(ncn-mw#
) Patch the keycloak-postgres
cluster resource to ensure three pods are running.
kubectl patch postgresql "${POSTGRESQL}" -n "${NAMESPACE}" --type='json' \
-p='[{"op" : "replace", "path":"/spec/numberOfInstances", "value" : 3}]'
(ncn-mw#
) Confirm the number of cluster members, otherwise known as pods, by checking the postgresql
resource.
kubectl get postgresql ${POSTGRESQL} -n ${NAMESPACE}
Example output:
NAME TEAM VERSION PODS VOLUME CPU-REQUEST MEMORY-REQUEST AGE STATUS
keycloak-postgres keycloak 11 3 10Gi 29m Running
If a pod is starting but remains in Pending
, CrashLoopBackOff
, ImagePullBackOff
, or other non-Running
states, then describe
the pod and get logs from the pod for further analysis.
(ncn-mw#
) Find the pod name.
kubectl get pods -A -l "application=spilo,cluster-name=${POSTGRESQL}"
Example output:
NAMESPACE NAME READY STATUS RESTARTS AGE
services keycloak-postgres-0 0/3 Pending 0 36m
services keycloak-postgres-1 3/3 Running 0 35m
services keycloak-postgres-2 3/3 Running 0 34m
(ncn-mw#
) Describe the pod.
kubectl describe pod "${POSTGRESQL}-0" -n ${NAMESPACE}
(ncn-mw#
) View the pod logs.
kubectl logs "${POSTGRESQL}-0" -c postgres -n ${NAMESPACE}
If a Postgres cluster no longer has a leader, the database will need to be recovered.
(ncn-mw#
) Set the POSTGRESQL
and NAMESPACE
variables.
POSTGRESQL=cray-smd-postgres
NAMESPACE=services
(ncn-mw#
) Check if the leader is missing.
kubectl exec ${POSTGRESQL}-0 -n ${NAMESPACE} -c postgres -- patronictl list
Example output:
+-------------------+---------------------+------------+------+--------------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+-------------------+---------------------+------------+------+--------------+----+-----------+
| cray-smd-postgres | cray-smd-postgres-0 | 10.42.0.25 | | running | | unknown |
| cray-smd-postgres | cray-smd-postgres-1 | 10.44.0.34 | | start failed | | unknown |
| cray-smd-postgres | cray-smd-postgres-2 | 10.36.0.44 | | start failed | | unknown |
+-------------------+---------------------+------------+------+--------------+----+-----------+
If the output does not list a leader, then proceed to Recover from a missing Postgres leader.
Check if there is conflicting leader information.
It sometimes happen that the above check reports a leader, but other checks report no leader, or report conflicting leader information. The following steps show the status reported by each member of the cluster.
(ncn-mw#
) Make a list of the Kubernetes pods of the cluster members.
PODS=$(kubectl get pods -n ${NAMESPACE} --no-headers -o custom-columns=:.metadata.name | grep "^${POSTGRESQL}-[0-9]$") ; echo ${PODS}
Example output:
cray-smd-postgres-0 cray-smd-postgres-1 cray-smd-postgres-2
(ncn-mw#
) Query each pod about the status of the cluster.
This script reports the cluster status as perceived by each member of the cluster.
for POD in ${PODS} ; do
echo "Checking ${POD}..."
kubectl exec ${POD} -n ${NAMESPACE} -c postgres -- curl -s http://localhost:8008/cluster | jq ; echo
done
Example output:
Checking cray-smd-postgres-0...
{
"members": [
{
"name": "cray-smd-postgres-0",
"role": "leader",
"state": "running",
"api_url": "http://10.32.0.33:8008/patroni",
"host": "10.32.0.33",
"port": 5432,
"timeline": 1
},
{
"name": "cray-smd-postgres-1",
"role": "replica",
"state": "running",
"api_url": "http://10.44.0.30:8008/patroni",
"host": "10.44.0.30",
"port": 5432,
"timeline": 1,
"lag": 0
},
{
"name": "cray-smd-postgres-2",
"role": "replica",
"state": "running",
"api_url": "http://10.47.0.33:8008/patroni",
"host": "10.47.0.33",
"port": 5432,
"timeline": 1,
"lag": 0
}
]
}
Checking cray-smd-postgres-1...
{
"members": [
{
"name": "cray-smd-postgres-0",
"role": "leader",
"state": "running",
"api_url": "http://10.32.0.33:8008/patroni",
"host": "10.32.0.33",
"port": 5432,
"timeline": 1
},
{
"name": "cray-smd-postgres-1",
"role": "replica",
"state": "running",
"api_url": "http://10.44.0.30:8008/patroni",
"host": "10.44.0.30",
"port": 5432,
"timeline": 1,
"lag": 0
},
{
"name": "cray-smd-postgres-2",
"role": "replica",
"state": "running",
"api_url": "http://10.47.0.33:8008/patroni",
"host": "10.47.0.33",
"port": 5432,
"timeline": 1,
"lag": 0
}
]
}
And so on for every member of the cluster. This script does not do any checking – it only displays the information.
Check the output for errors or inconsistencies.
In particular, validate the following:
If any of the above are not true, this indicates that the cluster members are no longer properly synchronized. In this case, attempt the Recover replication remediation procedures.
See the Recover from Postgres WAL Event procedure.