PostgreSQL Database is in Recovery

Description

There is a known issue that can occur when a Kubernetes control-plane (master) node is rebooted or rebuilt.

Rebooting a control-plan node can result in a brief interruption of service to the Kubernetes API which may cause the leader of a PostgreSQL database cluster to demote itself and force a leadership race. When this occurs a different instance may become leader of the cluster and while the Kubernetes service load balancer is correctly updated to reflect this change, the database connection to the client was never broken so the client continues to send transactions to the former leader which is now a replica and can only handle read-only queries.

Symptoms

  • The spire-server or cray-spire-server pods may be in a CrashLoopBackOff state.

  • The spire-agent or cray-spire-agent pods may be in a CrashLoopBackOff state.

  • Services may fail to acquire tokens from the spire-server or cray-spire-server.

  • The spire-server or cray-spire-server pods contain the following error in the logs.

    2024-02-04T22:57:25.145365496Z time="2024-02-04T22:57:25Z" level=error msg="Could not generate TLS config for gRPC client" address="10.47.0.0:63042" error="get bundle from datastore: datastore-sql: pq: cannot set transaction read-write mode during recovery" subsystem_name=endpoints
    
  • The spire-agent or cray-spire-agent pods contain the following error in the logs.

    time="2024-02-26T22:24:55Z" level=error msg="Agent crashed" error="failed to get SVID: error getting attestation response from SPIRE server: rpc error: code = Internal desc = failed to attest: k8s-sat: unable to get agent info: rpc error: code = Unknown desc = datastore-sql: pq: cannot set transaction read-write mode during recovery"
    
  • Services that rely on PostgreSQL such as cray-sls, cray-smd, or cray-console-data may have errors similar to the following in their logs.

    error="datastore-sql: pq: cannot set transaction read-write mode during recovery"
    
  • Some cray commands that query services like SLS or HSM may fail with the following error.

    Error: Internal Server Error: failed to query DB.
    

Solution

Option 1 - Apply automated workaround

The workaround is to restart either the PostgreSQL replica that is receiving UPDATE transactions or the service that is sending them. The following script can automatically detect this condition and restart affected PostgreSQL replicas.

NOTE Do not run this script multiple times in quick succession. The script may detect a false positive and restart replicas unnecessarily if the PostgreSQL server logs have not yet advanced beyond the detection threshold of the last 50 log entries.

  1. (ncn-mw#) Run the check_postgres_replica_transactions.sh script.

    /usr/share/doc/csm/troubleshooting/scripts/check_postgres_replica_transactions.sh
    

    Example output:

    INFO: Checking cray-nls-postgres-0 in namespace argo
    INFO: Checking cray-nls-postgres-1 in namespace argo
    INFO: Checking capsules-dispatch-server-postgres-0 in namespace services
    INFO: Checking capsules-dispatch-server-postgres-1 in namespace services
    INFO: Checking capsules-warehouse-server-postgres-0 in namespace services
    INFO: Checking capsules-warehouse-server-postgres-1 in namespace services
    INFO: Checking cfs-ara-postgres-0 in namespace services
    INFO: Checking cfs-ara-postgres-1 in namespace services
    INFO: Checking cray-console-data-postgres-1 in namespace services
    INFO: Checking cray-dhcp-kea-postgres-0 in namespace services
    INFO: Checking cray-dhcp-kea-postgres-2 in namespace services
    INFO: Checking cray-dns-powerdns-postgres-0 in namespace services
    INFO: Checking cray-dns-powerdns-postgres-1 in namespace services
    INFO: Checking cray-hms-badger-postgres-0 in namespace services
    INFO: Checking cray-hms-badger-postgres-2 in namespace services
    INFO: Checking cray-sls-postgres-0 in namespace services
    INFO: Checking cray-sls-postgres-2 in namespace services
    INFO: Checking cray-smd-postgres-0 in namespace services
    INFO: Checking cray-smd-postgres-1 in namespace services
    INFO: Checking gitea-vcs-postgres-1 in namespace services
    INFO: Checking gitea-vcs-postgres-2 in namespace services
    INFO: Checking keycloak-postgres-0 in namespace services
    INFO: Checking keycloak-postgres-2 in namespace services
    INFO: Checking cray-spire-postgres-1 in namespace spire
    INFO: Checking cray-spire-postgres-2 in namespace spire
    WARN: Restarting PostgreSQL replica Pod cray-spire-postgres-2 that is receiving UPDATE transactions
    INFO: Checking spire-postgres-0 in namespace spire
    WARN: Restarting PostgreSQL replica Pod spire-postgres-0 that is receiving UPDATE transactions
    INFO: Checking spire-postgres-1 in namespace spire
    

Option 2 - Manually restart affected services

It is necessary to restart affected services or their database in order to force them to connect to the correct instance of the database.

Spire

  1. (ncn-mw#) Restart the Postgres connection pool.

    • If the spire-server has the errors, restart the spire-postgres-pooler deployment.

      kubectl -n spire rollout restart deployment spire-postgres-pooler
      
    • If the cray-spire-server has the errors, restart the cray-spire-postgres-pooler deployment.

      kubectl -n spire rollout restart deployment cray-spire-postgres-pooler
      
  2. (ncn-mw#) Wait for the service to restart and check the logs to ensure that there are no more errors.

Other services

If other services are exhibiting this problem restart their database cluster.

The below example uses the cray-sls-postgres database cluster, refer to the output of kubectl get postgresql -A for all possible database clusters.

  1. (ncn-mw#) Restart the impacted Postgres database StatefulSet.

    kubectl -n services rollout restart statefulset cray-sls-postgres
    
  2. (ncn-mw#) Wait for the cluster to restart and check the logs to ensure that there are no more errors.