Cray System Management Documentation > Cray System Management (CSM) Administration Guide > hardware state manager > Remove Duplicate Detected Events From the HSM Postgres Database

Remove Duplicate Detected Events From the HSM Postgres Database

There is a known bug in the Hardware State Manager (HSM) that results in excessive “Detected” events added to the hardware inventory history in the HSM Postgres database. This issue has been resolved in the CSM 1.7.0 release.

These duplicate events can accumulate to such a significant volume over a period of time on large systems and result in certain operations taking significant amounts of time to complete or time out (eg. adding or querying hardware event history). We have also seen specific HSM CT tests fail due to issues associated with testing the hardware event history.

This document is meant to provide relief for systems encountering these issues until they are able to upgrade to a CSM release containing the fix. The steps outlined here are repeatable and the scripts can be rerun at any time should the database continue to grow too large.

The steps in this document must be also be completed immediately prior to starting an upgrade from CSM 1.5 to CSM 1.6 or from CSM 1.6 to CSM 1.7.

This same material is also present in the CSM 1.5 documentation.

Unfortunately, because of the nature of this bug and the capabilities of the pruning script, the history of FRUIDs associated with CPUs and GPUs within a node may not be fully accurate in systems running CSM releases prior to CSM 1.7. After the system is upgraded to CSM 1.7, the FRUID associations with CPUs and GPUs within a node will be fully accurate from that point on. No other component types were affected by this bug.

Prerequisites

Healthy HSM Postgres Cluster.

Use patronictl list on the HSM Postgres cluster to determine the current state of the cluster. A healthy cluster will look similar to the following:

kubectl exec cray-smd-postgres-0 -n services -c postgres -it -- patronictl list

Example output:

+ Cluster: cray-smd-postgres ------+---------+---------+----+-----------+
|        Member       |    Host    |  Role   |  State  | TL | Lag in MB |
+---------------------+------------+---------+---------+----+-----------+
| cray-smd-postgres-0 | 10.44.0.40 | Leader  | running |  1 |           |
| cray-smd-postgres-1 | 10.36.0.37 | Replica | running |  1 |         0 |
| cray-smd-postgres-2 | 10.42.0.42 | Replica | running |  1 |         0 |
+---------------------+------------+---------+---------+----+-----------+

Healthy HSM Service.

Verify all 3 HSM replicas are up and running:

kubectl -n services get pods -l cluster-name=cray-smd-postgres

Example output:

NAME                  READY   STATUS    RESTARTS   AGE
cray-smd-postgres-0   3/3     Running   0          18d
cray-smd-postgres-1   3/3     Running   0          18d
cray-smd-postgres-2   3/3     Running   0          18d

Procedure

Set an environment variable pointing to the location of the scripts we will be using:
```
export HWINV_SCRIPT_DIR="/usr/share/doc/csm/upgrade/scripts/upgrade/smd/"
```
Run the fru_history_backup.sh script to take a backup of the hardware inventory history table. Runtime will depend on the size of the table.

For large systems we have seen this take up to several hours. Please do not interrupt the operation.
```
${HWINV_SCRIPT_DIR}/fru_history_backup.sh
```
Example output:
```
Determining the postgres leader...
The SMD postgres leader is cray-smd-postgres-2
Using pg_dump to dump the hwinv_hist table...
Dump complete. Dump file is: smd_hwinv_hist_table_backup-07302025-161001.sql
```
The backup file will be located in the current working directory.

[OPTIONAL] This step should not be required unless some sort of corruption occurs to the hardware inventory history table in subsequent operations. Should the table need to be restored from the backup generated by the previous step, follow the procedure in this step to run the fru_history_restore.sh script.

This step may take several hours to complete. Please do not interrupt it.

export BACKUP_FILE="/full/path/to/backup/file.sql"
${HWINV_SCRIPT_DIR}/fru_history_restore.sh

Example output:

Determining the postgres leader...
The SMD postgres leader is cray-smd-postgres-2
Copying /tmp/examples/smd_hwinv_hist_table_backup-07302025-152732.sql to /tmp/smd_hwinv_hist_table_backup-07302025-152732.sql in the postgres leader pod
Using psql to restore the hwinv_hist table usign specified backup
SET
SET
SET
SET
SET
 set_config
------------

(1 row)

SET
SET
SET
SET
DROP INDEX
DROP INDEX
DROP INDEX
DROP INDEX
DROP INDEX
DROP INDEX
DROP TABLE
SET
SET
CREATE TABLE
ALTER TABLE
COPY 2696634
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE INDEX
Removing /tmp/smd_hwinv_hist_table_backup-07302025-152732.sql in the postgres leader pod
Restore complete.

Run the fru_history_remove_duplicate_detected_events.sh pruning script to remove the duplicate “Detected” events from the database.

Similar to the prior steps, on large systems this step may take up to several hours. It is very important to not interrupt the operation and let the script run to completion.

${HWINV_SCRIPT_DIR}/fru_history_remove_duplicate_detected_events.sh

Example output:

Batch size:        ALL (ALL = unlimited)
Max batches:       0 (0 = unlimited)
Replication delay: 1 seconds between batches
Vacuum type:       FULL

Set BATCH_SIZE, MAX_BATCHES, REPLICATION_SLEEP_DELAY, and VACUUM_TYPE variables to override

Determining the postgres leader...
The SMD postgres leader is cray-smd-postgres-2

NOTICE:  hwinv_history row count before pruning:    180,857,684
NOTICE:  hwinv_history table size before pruning:   79831 mb
NOTICE:  Database size before pruning:              79858 mb
DO

Operations may take considerable time - please do not interrupt

Creating hwinvhist_id_ts_idx index on hwinv_hist table...
CREATE INDEX

Pruning hwinv_hist table .
Pruning complete: 180431252 total rows deleted across 1 batches

Running VACUUM FULL on hwinv_hist table...
VACUUM

NOTICE:  hwinv_history row count after pruning:    426,432
NOTICE:  hwinv_history table size after pruning:   199 mb
NOTICE:  Database size after pruning:              226 mb
DO

Total execution time: 0h 17m 32s

Should any issues arise requiring restoration of the hardware inventory history table, please refer back to step 3.

We have seen cases where the fru_history_remove_duplicate_detected_events.sh script can fail if there are too many duplicate “Detected” events present in the database. Should this occur, see the alternative procedure described further below.

Alternative Procedure for Large Databases

If the fru_history_remove_duplicate_detected_events.sh script fails due to an excessive number of duplicate “Detected” events in the database, the pruning operation must be broken down into smaller batches. Skip this section if the pruning operation in step 4 completed successfully.

This procedure uses environment variables to control the batch size and number of iterations. Full descriptions of each variable are provided at the end of this section.

The first step is to determine an appropriate batch size through incremental testing. Begin with conservative settings: VACUUM_TYPE set to ANALYZE, MAX_BATCHES set to 1, and BATCH_SIZE set to 100,000:

BATCH_SIZE=100000 MAX_BATCHES=1 VACUUM_TYPE=ANALYZE ${HWINV_SCRIPT_DIR}/fru_history_remove_duplicate_detected_events.sh

If successful, increase the batch size to 1,000,000:

BATCH_SIZE=1000000 MAX_BATCHES=1 VACUUM_TYPE=ANALYZE ${HWINV_SCRIPT_DIR}/fru_history_remove_duplicate_detected_events.sh

If successful, increase the batch size to 10,000,000:

BATCH_SIZE=10000000 MAX_BATCHES=1 VACUUM_TYPE=ANALYZE ${HWINV_SCRIPT_DIR}/fru_history_remove_duplicate_detected_events.sh

If successful, attempt 100,000,000:

BATCH_SIZE=100000000 MAX_BATCHES=1 VACUUM_TYPE=ANALYZE ${HWINV_SCRIPT_DIR}/fru_history_remove_duplicate_detected_events.sh

The maximum viable batch size is typically either 10,000,000 or 100,000,000.

Once the maximum batch size has been determined, increase the number of batches processed per run by adjusting the MAX_BATCHES variable. Continue running the script until all duplicate events are pruned.

After all duplicates have been removed, run the script one final time with a FULL vacuum to reclaim disk space:

VACUUM_TYPE=FULL ${HWINV_SCRIPT_DIR}/fru_history_remove_duplicate_detected_events.sh

Note: A FULL vacuum is required to return disk space to the operating system.

Environment Variables

The following environment variables control the behavior of the pruning script:

`BATCH_SIZE`

Default: ALL (no limit)

Limits the number of duplicate events to prune per iteration. When tuning this variable, start with a low value and increase incrementally until a failure occurs, then use the last successful batch size.

`MAX_BATCHES`

Default: 0 (no limit)

Limits the number of batches (iterations) to perform. When determining the appropriate BATCH_SIZE, set this to 1 to test one batch at a time. Once an appropriate BATCH_SIZE is determined, increase this value to process multiple batches per run.

`REPLICATION_SLEEP_DELAY`

Default: 1 (second)

Specifies the sleep delay in seconds between batches to allow database replication to catch up. Adjusting this value is typically not necessary.

`VACUUM_TYPE`

Default: FULL

Controls the type of vacuum to perform after pruning. Options are FULL or ANALYZE. Use ANALYZE during incremental pruning operations to save time. After all duplicates are removed, perform a final FULL vacuum to reclaim disk space and return it to the operating system.