Remove Duplicate Detected Events From the HSM Postgres Database

There is a known bug in the Hardware State Manager (HSM) that results in excessive “Detected” events added to the hardware inventory history in the HSM Postgres database. This issue has been resolved in the CSM 1.7.0 release.

These duplicate events can accumulate to such a significant volume over a period of time on large systems and result in certain operations taking significant amounts of time to complete or time out (eg. adding or querying hardware event history). We have also seen specific HSM CT tests fail due to issues associated with testing the hardware event history.

This document is meant to provide relief for systems encountering these issues until they are able to upgrade to a CSM release containing the fix. The steps outlined here are repeatable and the scripts can be rerun at any time should the database continue to grow too large.

The steps in this document must be also be completed immediately prior to starting a CSM 1.6.0 upgrade.

Prerequisites

  • Healthy HSM Postgres Cluster.

    Use patronictl list on the HSM Postgres cluster to determine the current state of the cluster. A healthy cluster will look similar to the following:

    kubectl exec cray-smd-postgres-0 -n services -c postgres -it -- patronictl list
    

    Example output:

    + Cluster: cray-smd-postgres ------+---------+---------+----+-----------+
    |        Member       |    Host    |  Role   |  State  | TL | Lag in MB |
    +---------------------+------------+---------+---------+----+-----------+
    | cray-smd-postgres-0 | 10.44.0.40 | Leader  | running |  1 |           |
    | cray-smd-postgres-1 | 10.36.0.37 | Replica | running |  1 |         0 |
    | cray-smd-postgres-2 | 10.42.0.42 | Replica | running |  1 |         0 |
    +---------------------+------------+---------+---------+----+-----------+
    
  • Healthy HSM Service.

    Verify all 3 HSM replicas are up and running:

    kubectl -n services get pods -l cluster-name=cray-smd-postgres
    

    Example output:

    NAME                  READY   STATUS    RESTARTS   AGE
    cray-smd-postgres-0   3/3     Running   0          18d
    cray-smd-postgres-1   3/3     Running   0          18d
    cray-smd-postgres-2   3/3     Running   0          18d
    

Procedure

  1. Set an environment variable pointing to the location of the scripts we will be using:

    export HWINV_SCRIPT_DIR="/usr/share/doc/csm/upgrade/scripts/upgrade/smd/"
    
  2. Run the fru_history_backup.sh script to take a backup of the hardware inventory history table. Runtime will depend on the size of the table.

    For large systems we have seen this take up to several hours. Please do not interrupt the operation.

    ${HWINV_SCRIPT_DIR}/fru_history_backup.sh
    

    Example output:

    Determining the postgres leader...
    The SMD postgres leader is cray-smd-postgres-2
    Using pg_dump to dump the hwinv_hist table...
    Dump complete. Dump file is: smd_hwinv_hist_table_backup-07302025-161001.sql
    

    The backup file will be located in the current working directory.

  3. [OPTIONAL] This step should not be required unless some sort of corruption occurs to the hardware inventory history table in subsequent operations. Should the table need to be restored from the backup generated by the previous step, follow the procedure in this step to run the fru_history_restore.sh script.

    This step may take several hours to complete. Please do not interrupt it.

    export BACKUP_FILE="/full/path/to/backup/file.sql"
    ${HWINV_SCRIPT_DIR}/fru_history_restore.sh
    

    Example output:

    Determining the postgres leader...
    The SMD postgres leader is cray-smd-postgres-2
    Copying /tmp/examples/smd_hwinv_hist_table_backup-07302025-152732.sql to /tmp/smd_hwinv_hist_table_backup-07302025-152732.sql in the postgres leader pod
    Using psql to restore the hwinv_hist table usign specified backup
    SET
    SET
    SET
    SET
    SET
     set_config
    ------------
    
    (1 row)
    
    SET
    SET
    SET
    SET
    DROP INDEX
    DROP INDEX
    DROP INDEX
    DROP INDEX
    DROP INDEX
    DROP INDEX
    DROP TABLE
    SET
    SET
    CREATE TABLE
    ALTER TABLE
    COPY 2696634
    CREATE INDEX
    CREATE INDEX
    CREATE INDEX
    CREATE INDEX
    CREATE INDEX
    CREATE INDEX
    Removing /tmp/smd_hwinv_hist_table_backup-07302025-152732.sql in the postgres leader pod
    Restore complete.
    
  4. Run the fru_history_remove_duplicate_detected_events.sh pruning script to remove the duplicate “Detected” events from the database.

    Similar to the prior steps, on large systems this step may take up to several hours. It is very important to not interrupt the operation and let the script run to completion.

    ${HWINV_SCRIPT_DIR}/fru_history_remove_duplicate_detected_events.sh
    

    Example output:

    Determining the postgres leader...
    The SMD postgres leader is cray-smd-postgres-2
    NOTICE:  hwinv_history table size before pruning:  652 mb
    NOTICE:  Database size before pruning:             666 mb
    DO
    
    Operations may take considerable time - please do not interrupt
    
    Creating hwinvhist_id_ts_idx index on hwinv_hist table...
    CREATE INDEX
    
    Pruning hwinv_hist table...
    DELETE 1740689
    
    Running VACUUM FULL on hwinv_hist table to reclaim disk space...
    VACUUM
    
    NOTICE:  hwinv_history table size after pruning:  268 mb
    NOTICE:  Database size after pruning:             282 mb
    DO
    

    Should any issues arise requiring restoration of the hardware inventory history table, please refer back to step 3.