Recover from a Liquid Cooled Cabinet EPO Event

Identify an emergency power off (EPO) has occurred and restore cabinets to a healthy state.

CAUTION: Verify the reason why the EPO occurred and resolve that problem before clearing the EPO state.

If a Cray EX liquid-cooled cabinet or cooling group experiences an EPO event, the compute nodes may not boot. Use CAPMC to force off all the chassis affected by the EPO event.

Procedure

  1. Verify that the EPO event did not damage the system hardware.

  2. (ncn-mw#) Check the status of the chassis.

    cray capmc get_xname_status create --xnames x9000c[1,3] --format toml
    

    Example output:

    e = 0
    err_msg = ""
    off = [ "x9000c1", "x9000c3",]
    
  3. (ncn-mw#) Check the Chassis Controller Module (CCM) log for Critical messages and the EPO event.

    ssh x9000c1b0 egrep \"Critical\|= No\" /var/log/messages
    

    Example output:

    Apr 8 03:47:55 x9000c1 user.info redfish-cmmd[4453]: do_cmm_enclosure_reset_forceoff: Handling Enclosure (Force)Off request: clearing EPO = No
    Apr 8 03:47:55 x9000c1 user.info redfish-cmmd[4453]: rbe_set_chassis_status: Update Chassis 'Enclosure' Status: UnavailableOffline, Critical
    Apr 11 04:00:06 x9000c1 user.info redfish-cmmd[4453]: do_cmm_enclosure_reset_forceoff: Handling Enclosure (Force)Off request: clearing EPO = No
    Apr 11 04:00:06 x9000c1 user.info redfish-cmmd[4453]: rbe_set_chassis_status: Update Chassis 'Enclosure' Status: UnavailableOffline, Critical
    
  4. (ncn-mw#) Disable the hms-discovery Kubernetes cron job.

    kubectl -n services patch cronjobs hms-discovery -p '{"spec" : {"suspend" : true }}'
    

    CAUTION: Do not power the system on until it is safe to do so. Determine why the EPO event occurred before clearing the EPO state.

  5. (ncn-mw#) If it is safe to power on the hardware, clear all chassis in the EPO state in the cooling group.

    All chassis in cabinets 1000-1003 are forced off in this example. Power off all chassis in a cooling group simultaneously, or the EPO condition may persist.

    cray capmc xname_off create --xnames x[1000-1003]c[0-7] --force true --format toml
    

    Example output:

    e = 0
    err_msg = ""
    

    The HPE Cray EX EX TDS cabinet contains only two chassis: 1 (bottom) and 3 (top).

    cray capmc xname_off create --xnames x9000c[1,3] --force true --format toml
    

    Example output:

    e = 0
    err_msg = ""
    
  6. (ncn-mw#) Restart the hms-discovery cron job.

    kubectl -n services patch cronjobs hms-discovery -p '{"spec" : {"suspend" : false }}'
    

    About 5 minutes after hms-discovery restarts, the service will power on the chassis enclosures, switches, and compute blades. If components are not being powered back on, then power them on manually.

    cray capmc xname_on create --xnames x[1000-1003]c[0-7]r[0-7],x[1000-1003]c[0-7]s[0-7] --prereq true --continue true --format toml
    

    Example output:

    e = 0
    err_msg = ""
    
  7. Verify the Slingshot fabric is up and healthy.

    Refer to the following documentation for more information on how to verify the health of the Slingshot Fabric:

    • The Slingshot Administration Guide PDF for HPE Cray EX systems.
    • The Slingshot Troubleshooting Guide PDF.
  8. After the components have powered on, boot the nodes using the Boot Orchestration Services (BOS).

    See Power On and Boot Compute and User Access Nodes.