There is a known issue that can occur when a Kubernetes control-plane (master) node is rebooted or rebuilt.
Rebooting a control-plan node can result in a brief interruption of service to the Kubernetes API which may cause the leader of a PostgreSQL database cluster to demote itself and force a leadership race. When this occurs a different instance may become leader of the cluster and while the Kubernetes service load balancer is correctly updated to reflect this change, the database connection to the client was never broken so the client continues to send transactions to the former leader which is now a replica and can only handle read-only queries.
The spire-server
or cray-spire-server
pods may be in a CrashLoopBackOff
state.
The spire-agent
or cray-spire-agent
pods may be in a CrashLoopBackOff
state.
Services may fail to acquire tokens from the spire-server
or cray-spire-server
.
The spire-server
or cray-spire-server
pods contain the following error in the logs.
2024-02-04T22:57:25.145365496Z time="2024-02-04T22:57:25Z" level=error msg="Could not generate TLS config for gRPC client" address="10.47.0.0:63042" error="get bundle from datastore: datastore-sql: pq: cannot set transaction read-write mode during recovery" subsystem_name=endpoints
The spire-agent
or cray-spire-agent
pods contain the following error in the logs.
time="2024-02-26T22:24:55Z" level=error msg="Agent crashed" error="failed to get SVID: error getting attestation response from SPIRE server: rpc error: code = Internal desc = failed to attest: k8s-sat: unable to get agent info: rpc error: code = Unknown desc = datastore-sql: pq: cannot set transaction read-write mode during recovery"
Services that rely on PostgreSQL such as cray-sls
, cray-smd
, or cray-console-data
may have errors similar to the following in their logs.
error="datastore-sql: pq: cannot set transaction read-write mode during recovery"
Some cray
commands that query services like SLS or HSM may fail with the following error.
Error: Internal Server Error: failed to query DB.
The workaround is to restart either the PostgreSQL replica that is receiving UPDATE transactions or the service that is sending them. The following script can automatically detect this condition and restart affected PostgreSQL replicas.
NOTE
Do not run this script multiple times in quick succession. The script may detect a false positive and restart replicas unnecessarily if the PostgreSQL server logs have not yet advanced beyond the detection threshold of the last 50 log entries.
(ncn-mw#
) Run the check_postgres_replica_transactions.sh
script.
/usr/share/doc/csm/troubleshooting/scripts/check_postgres_replica_transactions.sh
Example output:
INFO: Checking cray-nls-postgres-0 in namespace argo
INFO: Checking cray-nls-postgres-1 in namespace argo
INFO: Checking capsules-dispatch-server-postgres-0 in namespace services
INFO: Checking capsules-dispatch-server-postgres-1 in namespace services
INFO: Checking capsules-warehouse-server-postgres-0 in namespace services
INFO: Checking capsules-warehouse-server-postgres-1 in namespace services
INFO: Checking cfs-ara-postgres-0 in namespace services
INFO: Checking cfs-ara-postgres-1 in namespace services
INFO: Checking cray-console-data-postgres-1 in namespace services
INFO: Checking cray-dhcp-kea-postgres-0 in namespace services
INFO: Checking cray-dhcp-kea-postgres-2 in namespace services
INFO: Checking cray-dns-powerdns-postgres-0 in namespace services
INFO: Checking cray-dns-powerdns-postgres-1 in namespace services
INFO: Checking cray-hms-badger-postgres-0 in namespace services
INFO: Checking cray-hms-badger-postgres-2 in namespace services
INFO: Checking cray-sls-postgres-0 in namespace services
INFO: Checking cray-sls-postgres-2 in namespace services
INFO: Checking cray-smd-postgres-0 in namespace services
INFO: Checking cray-smd-postgres-1 in namespace services
INFO: Checking gitea-vcs-postgres-1 in namespace services
INFO: Checking gitea-vcs-postgres-2 in namespace services
INFO: Checking keycloak-postgres-0 in namespace services
INFO: Checking keycloak-postgres-2 in namespace services
INFO: Checking cray-spire-postgres-1 in namespace spire
INFO: Checking cray-spire-postgres-2 in namespace spire
WARN: Restarting PostgreSQL replica Pod cray-spire-postgres-2 that is receiving UPDATE transactions
INFO: Checking spire-postgres-0 in namespace spire
WARN: Restarting PostgreSQL replica Pod spire-postgres-0 that is receiving UPDATE transactions
INFO: Checking spire-postgres-1 in namespace spire
It is necessary to restart affected services or their database in order to force them to connect to the correct instance of the database.
(ncn-mw#
) Restart the Postgres connection pool.
If the spire-server
has the errors, restart the spire-postgres-pooler
deployment.
kubectl -n spire rollout restart deployment spire-postgres-pooler
If the cray-spire-server
has the errors, restart the cray-spire-postgres-pooler
deployment.
kubectl -n spire rollout restart deployment cray-spire-postgres-pooler
(ncn-mw#
) Wait for the service to restart and check the logs to ensure that there are no more errors.
If other services are exhibiting this problem restart their database cluster.
The below example uses the cray-sls-postgres
database cluster, refer to the output of kubectl get postgresql -A
for all possible database clusters.
(ncn-mw#
) Restart the impacted Postgres database StatefulSet.
kubectl -n services rollout restart statefulset cray-sls-postgres
(ncn-mw#
) Wait for the cluster to restart and check the logs to ensure that there are no more errors.