Cray System Management
  • v
  • Cray System Management (CSM) - Release Notes
  • Cray System Management (CSM) Administration Guide
    • Create a Backup of HMS Items for reinstall
    • Component Names (xnames)
    • Restore HSM
    • Validate CSM Health
    • Configure the Cray Command Line Interface (cray CLI)
    • User Access Service (UAS)
      • Add a Volume to UAS
      • Broker Mode UAI Management
      • Choosing UAI Resource Settings
      • Common UAI Configuration
      • Configure End-User UAI Classes for Broker Mode
      • Configure UAIs in UAS
      • Configure a Broker UAI Class
      • Configure a Default UAI Class for Legacy Mode
      • Create UAIs From Specific UAI Images in Legacy Mode
      • Create a UAI
      • Create a UAI Class
      • Create a UAI Resource Specification
      • Create a UAI with Additional Ports
      • Create and Use Default UAIs in Legacy Mode
      • Customize End-User UAI Images
      • Customize the Broker UAI Image
      • Delete a UAI
      • Delete a UAI Class
      • Delete a UAI Image Registration
      • Delete a UAI Resource Specification
      • Delete a Volume Configuration
      • Elements of a UAI
      • End-User UAIs
      • Examine a UAI Using a Direct Administrative Command
      • Legacy Mode User-Driven UAI Management
      • List Available UAI Classes
      • List Available UAI Images in Legacy Mode
      • List Registered UAI Images
      • List UAI Resource Specifications
      • List UAIs
      • List UAS Version Information
      • List Volumes Registered in UAS
      • Log in to a Broker UAI
      • This Page Has Moved
      • Modify a UAI Class
      • Obtain the Configuration of a UAS Volume
      • Register a UAI Image
      • Clear UAS Configuration
      • Resource Specifications
      • Retrieve Resource Specification Details
      • Retrieve UAI Image Registration Information
      • Setting UAI Timeouts
      • Broker UAI Resiliency and Load Balancing
      • Special Purpose UAIs
      • Start a Broker UAI
      • Troubleshoot Broker UAI SSSD Cannot Use /etc/sssd/sssd.conf
      • Troubleshoot Common Mistakes when Creating a Custom End-User UAI Image
      • Troubleshoot Duplicate Mount Paths in a UAI
      • Troubleshoot Missing or Incorrect UAI Images
      • Troubleshoot Stale Brokered UAIs
      • Troubleshoot UAS / CLI Authentication Issues
      • Troubleshoot UAI Stuck in ContainerCreating
      • Troubleshoot UAIs by Viewing Log Output
      • Troubleshoot UAIs with Administrative Access
      • Troubleshoot UAS Issues
      • Troubleshoot UAS by Viewing Log Output
      • UAI Classes
      • UAI Host Node Selection
      • UAI Host Nodes
      • UAI Image Customization
      • UAI Images
      • UAI Management
      • UAI Network Attachment Customization
      • UAI macvlans Network Attachments
      • UAS Limitations
      • UAS and UAI Legacy Mode Health Checks
      • Update a Resource Specification
      • Update a UAI Image Registration
      • Update a UAS Volume
      • View a UAI Class
      • Volumes
    • sat
      • System Admin Toolkit (SAT) in CSM
    • argo
      • Using Argo Workflows
      • Using the Argo UI
    • bare metal
      • Bare-Metal Steps
      • Fresh Install Setting NodeBMC and RouterBMC Redfish Credentials
    • CSM product management
      • Change Passwords and Credentials
      • Configure CSM packages with CFS
      • Configure Keycloak Account
      • Configure the root password and SSH keys in Vault
      • Post-Install Customizations
      • Redeploying a Chart
      • Remove Artifacts from Product Installations
      • Set up passwordless SSH
      • Validate Signed RPMs
    • firmware
      • FASUpdate Script
      • FAS Admin Procedures
      • FAS CLI
      • Cleaning up FAS Database
      • FAS Filters
      • Backup and Restoring FAS Images
      • Updating Foxconn Paradise Nodes with FAS
      • FAS Recipes
      • Update iLO 5 firmware above v2.78
      • FAS Recipes and Procedures
      • Firmware Upgrade using SPP on HPE ProLiant Servers
      • Update Firmware with FAS
      • Updating BMC Firmware and BIOS for ncn-m001
      • Updating BMC Firmware and BIOS for NCNs without FAS
      • Upload BMC Recovery Firmware into TFTP Server
    • hmcollector
      • Adjust HM Collector Ingress Replicas and Resource Limits
    • observability
      • Install and Upgrade Observability Framework
    • power management
      • Cray Advanced Platform Monitoring and Control (CAPMC)
      • Ignore Nodes with CAPMC
      • Liquid-cooled Node Power Management
      • Power Off Compute Cabinets
      • Power Off Management Cabinets
      • Power Off Storage Cabinets
      • Power Off the External Lustre File System
      • Power On Compute Cabinets
      • Power On and Boot Compute and User Access Nodes
      • Power On and Start the Management Kubernetes Cluster
      • Power On the External Lustre File System
      • Prepare the System for Power Off
      • Recover from a Liquid Cooled Cabinet EPO Event
      • Save Management Network Switch Configuration Settings
      • Set the Turbo Boost Limit
      • Shut Down and Power Off Compute and User Access Nodes
      • Shut Down and Power Off the Management Kubernetes Cluster
      • Standard Rack Node Power Management
      • System Power Off Procedures
      • System Power On Procedures
      • User Access to Compute Node Power Data
      • Power Management
      • Power Control Service
        • Node Card Power Management
        • Power Control Service (PCS)
        • Power Off Compute Cabinets
        • Power On Compute Cabinets
        • Recover from a Liquid Cooled Cabinet EPO Event
    • system configuration service
      • Configure BMC and Controller Parameters with SCSD
      • Manage Parameters with the scsd Service
      • Set BMC Credentials Using SAT
      • System Configuration Service
    • system layout service
      • Add Liquid-Cooled Cabinets to SLS
      • Add UAN CAN IP Addresses to SLS
      • Add an alias to a service
      • Create a Backup of the SLS Postgres Database
      • Dump SLS Information
      • Load SLS Database with Dump File
      • Restore SLS Postgres Database from Backup
      • Restore SLS Postgres without an Existing Backup
      • System Layout Service (SLS)
      • Update SLS with UAN Aliases
    • System Recovery
      • PBS Service Recovery
      • Slurm Service Recovery
      • Beta Procedures for System Recovery
    • security and authentication
      • API Authorization
      • Access the Keycloak User Management UI
      • Add LDAP User Federation
      • Add Root Service Account for Gigabyte Controllers
      • Audit Logs
      • Authenticate an Account with the Command Line
      • Backup and Restore Vault Clusters
      • Certificate Types
      • Change Air-Cooled Node BMC Credentials Using SAT
      • Change Credentials on ServerTech PDUs
      • Change Cray EX Liquid-Cooled Cabinet Global Default Password
      • Change the Keycloak Token Lifetime
      • Set NCN Image Root Password, SSH Keys, and Timezone
      • Set NCN Image Root Password, SSH Keys, and Timezone on PIT Node
      • Change Root Passwords for Compute Nodes
      • Change the Keycloak Admin Password
      • Change the LDAP Server IP Address for Existing LDAP Server Content
      • Change the LDAP Server IP Address for New LDAP Server Content
      • Configure Keycloak for LDAP/AD authentication
      • Configure root user on HPE iLO BMCs
      • Configure the RSA Plugin in Keycloak
      • Create Internal Groups in the Keycloak Shasta Realm
      • Create Internal User Accounts in the Keycloak Shasta Realm
      • Create a Backup of the Keycloak Postgres Database
      • Create a Service Account in Keycloak
      • Default Keycloak Realms, Accounts, and Clients
      • Delete Internal User Accounts in the Keycloak Shasta Realm
      • Get a Long-Lived Token for a Service Account
      • HashiCorp Vault
      • Keycloak Operations
      • Keycloak Service Recovery
      • Keycloak User Localization
      • Keycloak User Management with kcadm.sh
      • Make HTTPS Requests from Sources Outside the Management Kubernetes Cluster
      • Manage Sealed Secrets
      • Manage System Passwords
      • PKI Certificate Authority (CA)
      • PKI Services
      • Preserve Username Capitalization for Users Exported from Keycloak
      • Provisioning a Liquid-Cooled EX Cabinet CEC with Default Credentials
      • Public Key Infrastructure (PKI)
      • Recovering from Mismatched BMC Credentials
      • Remove Internal Groups from the Keycloak Shasta Realm
      • Remove the Email Mapper from the LDAP User Federation
      • Remove the LDAP User Federation from Keycloak
      • Re-Sync Keycloak Users to Compute Nodes
      • Retrieve an Authentication Token
      • Retrieve the Client Secret for Service Accounts
      • Update NCN User SSH Keys
      • System Security and Authentication
      • Transport Layer Security (TLS) for Ingress Services
      • Troubleshoot Common Vault Cluster Issues
      • Troubleshoot Kyverno configuration manually
      • Update Default Air-Cooled BMC and Leaf-BMC Switch SNMP Credentials
      • Update Default ServerTech PDU Credentials used by the Redfish Translation Service (RTS)
      • Set NCN User Passwords
      • Update Root Secrets In Vault
      • Updating the Liquid-Cooled EX Cabinet CEC with Default Credentials after a CEC Password Change
      • Vault Service Recovery
    • utility storage
      • Adding a Ceph Node to the Ceph Cluster
      • Add Ceph OSDs
      • Adjust Ceph Pool Quotas
      • Alternate Storage Pools
      • Ceph Daemon Memory Profiling
      • Ceph Deep Scrubs
      • Ceph Health States
      • Ceph Orchestrator Usage
      • Ceph Service Check Script Usage
      • Ceph Storage Types
      • ceph-upgrade-tool.py Usage
      • Cephadm Reference Material
      • Collect Information about the Ceph Cluster
      • Dump Ceph Crash Data
      • Identify Ceph Latency Issues
      • Manage Ceph Services
      • Move Unmanaged Ceph OSDs
      • Shrink the Ceph Cluster
      • Shrink Ceph OSDs
      • Troubleshoot Ceph-Mon Processes Stopping and Exceeding Max Restarts
      • Troubleshoot Ceph MDS Client Connectivity Issues
      • Troubleshooting Ceph MDS Reporting Slow Requests and Failure on Client
      • Troubleshoot Ceph New RGW Deployment Failing
      • Troubleshoot Ceph OSDs Reporting Full
      • Troubleshoot Ceph Services Not Starting After a Server Crash
      • Troubleshoot Failure to Get Ceph Health
      • Troubleshoot HEALTH ERR Module devicehealth has failed table Device already exists
      • Troubleshoot Insufficient Standby MDS Daemons Available
      • Troubleshoot Large Object Map Objects in Ceph Health
      • Troubleshoot Pods Failing to Restart on Other Worker Nodes
      • Fixing incorrect number of PG Issues
      • Troubleshoot if RGW Health Check Fails
      • Troubleshoot S3FS Cache Cleanup
      • Troubleshoot S3FS Mount Issues
      • Troubleshoot System Clock Skew
      • Troubleshoot a Down OSD
      • Troubleshoot an Unresponsive Rados-Gateway (radosgw) S3 Endpoint
      • Troubleshoot Ceph image with tag'<none>'
      • Utility Storage
      • Update ceph node-exporter config to monitor SNMP counters
    • multi-tenancy
      • Cray HNC Manager
      • Creating a Tenant
      • Modifying a Tenant
      • Multi-Tenancy Support
      • Removing a Tenant
      • Slurm Operator
      • TAPMS (Tenant and Partition Management System) Overview
      • Tenant Administrator Configuration
      • Multi-Tenancy Vault Overview
    • boot orchestration
      • Boot Orchestration
      • BOS Services
      • BOS Workflows
      • Compute Node Boot Issue Symptom Node Console or Logs Indicate that the Server Response has Timed Out
      • Boot Issue Symptom Node HSN Interface Does Not Appear or Show Detected Links Detected
      • Boot Orchestration
      • Boot UANs
      • BOS Commands Cheat Sheet
      • Check the Progress of BOS Session Operations
      • Clean Up After a BOS/BOA Job is Completed or Cancelled
      • Clean Up Logs After a BOA Kubernetes Job
      • Component Status
      • BOS Components
      • Compute Node Boot Issue Symptom Duplicate Address Warnings and Declined DHCP Offers in Logs
      • Compute Node Boot Issue Symptom Message About Invalid EEPROM Checksum in Node Console or Log
      • Compute Node Boot Issue Symptom Node is Not Able to Download the Required Artifacts
      • Compute Node Boot Sequence
      • Configure the BOS Timeout When Booting Compute Nodes
      • Create a Session Template to Boot Compute Nodes with CPS
      • Customize iPXE Binary Names
      • Determine Which BOS Session Booted a Node
      • Edit the iPXE Embedded Boot Script
      • Exporting and Importing BOS Data
      • Exporting and Importing BSS Date
      • Healthy Compute Node Boot Process
      • Kernel Boot Parameters
      • Limit the Scope of a BOS Session
      • BOS Limitations for Gigabyte BMC Hardware
      • Log File Locations and Ports Used in Compute Node Boot Troubleshooting
      • Manage a BOS Session
      • Manage a Session Template
      • Multi-tenancy with BOS
      • Node Boot Root Cause Analysis
      • BOS Options
      • Redeploy the iPXE and TFTP Services
      • Rolling Upgrades using BOS
      • BOS Session Templates
      • BOS Sessions
      • Staging Changes with BOS
      • Tools for Resolving Compute Node Boot Issues
      • Troubleshoot Booting Nodes with Hardware Issues
      • Troubleshoot Compute Node Boot Issues Related to Dynamic Host Configuration Protocol (DHCP)
      • Troubleshoot Compute Node Boot Issues Related to Slow Boot Times
      • Troubleshoot Compute Node Boot Issues Related to Trivial File Transfer Protocol (TFTP)
      • Troubleshoot Compute Node Boot Issues Related to Unified Extensible Firmware Interface (UEFI)
      • Troubleshoot Compute Node Boot Issues Related to the Boot Script Service (BSS)
      • Troubleshoot Compute Node Boot Issues Using Kubernetes
      • Troubleshoot UAN Boot Issues
      • Upload Node Boot Information to Boot Script Service (BSS)
      • View the Status of a BOS Session
    • kubernetes
      • About Kubernetes Taints and Labels
      • Kubernetes Encryption
      • About Postgres
      • About etcd
      • About kubectl
      • Backups for Etcd Clusters Running in Kubernetes
      • Kubernetes and Bare Metal EtcD Certificate Renewal
      • Check for and Clear etcd Cluster Alarms
      • Check the Health of etcd Clusters
      • Clear Space in an etcd Cluster Database
      • Configure kubectl Credentials to Access the Kubernetes APIs
      • containerd
      • Create a Manual Backup of Bare-Metal etcd Cluster
      • Create a Manual Backup of a Healthy etcd Cluster
      • Determine if Pods are Hitting Resource Limits
      • Disaster Recovery for Postgres
      • Fix Failed to start etcd on Master NCN
      • Increase Kafka Pod Resource Limits
      • Increase the PVC size in an etcd Cluster Database
      • Increase Pod Resource Limits
      • Kubernetes
      • Kubernetes Networking
      • Kubernetes Storage
      • Kyverno policy management
      • Pod Resource Limits
      • Rebuild Unhealthy etcd Clusters
      • Recover from Postgres WAL Event
      • Repopulate Data in etcd Clusters When Rebuilding Them
      • Report the Endpoint Status for etcd Clusters
      • Restore Bare-Metal etcd Clusters from an S3 Snapshot
      • Restore Postgres
      • Restore an etcd Cluster from a Backup
      • Retrieve Cluster Health Information Using Kubernetes
      • TDS Lower CPU Requests
      • Troubleshoot Intermittent HTTP 503 Code Failures
      • Troubleshoot Postgres Database
      • View Postgres Information for System Databases
    • network
      • Management Network User Guide
        • Fresh Install
        • Manual Switch Configuration
        • Added Hardware
        • Generate Switch Configurations
        • Apply Custom Switch Configuration CSM 1.2
        • Apply Switch Configurations
        • CSM Automatic Network Utility
          • CANU Installation
          • Troubleshoot CANU Validation Errors
          • Use CANU to Verify, Generate, or Compare Switch Configurations
          • Generate Switch Configs Including Custom Configurations
          • Initializing CANU
          • Introduction to CANU
          • Quick start guide to CANU
          • Uninstall CANU
          • Update CANU From CSM Release Tarball
          • Use CANU to Generate Full Network Configuration
        • Dell Installation and Configuration Guide
          • Configure Access Control Links (ACLs)
          • Configure Address Resolution Protocol (ARP)
          • Back Up a Switch Configuration
          • Configure Domain Name System (DNS) Client
          • Configure Domain Name
          • Configure Hostnames
          • Configure Internet Group Multicast Protocol (IGMP)
          • Configure Link Aggregation Group (LAG)
          • Link layer discovery protocol (LLDP)
          • Configure Locator LED
          • Configure Loopback Interface
          • Configure Management Interface
          • Configure Multiple Spanning Tree Protocol (MSTP)
          • Network Time Protocol (NTP) Client
          • Configure Physical Interfaces
          • Configure QoS
          • Configure Remote Logging
          • Reset Dell Switch Configuration
          • Configure SNMPv2c community
          • Dell SNMPv3 Users
          • Configure Secure Shell (SSH)
          • Configure System Images
          • Perform an Upgrade on Dell Switches
          • Configure Virtual Local Access Networks (VLANs)
          • Configure VLAN Interface
          • VLAN Trunking 802.1Q
        • Using canu-inventory with Ansible
        • Upgrade CANU
        • Collect Data
        • Configuration Management
        • Configuring SNMP in CSM
        • Mellanox Installation and Configuration Guide
          • Access control lists (ACLs)
          • Address resolution protocol (ARP)
          • Backing up switch configuration
          • BGP basics
          • Cable diagnostics
          • Check BGP and MetalLB
          • Check current DHCP leases
          • Check DHCP lease is getting allocated
          • Check HSM
          • Check KEA DHCP logs
          • Computes/UANs/Application Nodes
          • Large Number of DHCP Declines During a Node Boot
          • Domain name system (DNS) client
          • Domain name
          • You are getting an IP address, but not the correct one. Duplicate IP address check
          • Exec banners
          • Hostname
          • IGMP
          • Ip filter
          • Key features used in the management network configuration
          • Link aggregation group (LAG)
          • Large
          • Link layer discovery protocol (LLDP)
          • Loopback interface
          • Management interface
          • Example of how to configure Scenario A or B
          • Management network functions in detail
          • Medium
          • Multi-chassis interface
          • MLAG (Multi-Chassis LAG)
          • MLAG
          • Multiple spanning tree protocol (MSTP)
          • Native VLAN
          • TCPDUMP
          • NCNs on Install
          • Network types – Naming and segment Function
          • Network traffic pattern inside of the system
          • Network Time Protocol (NTP) Client
          • Open shortest path first (OSPF) v2
          • Physical interfaces
          • PIM-SM bootstrap router (BSR) and rendezvous-point (RP)
          • Rebooting NCN and PXE fails
          • Remote logging
          • How to connect management network to your campus network
          • Routed interfaces
          • Scenario A network connection via management network
          • Scenario B network connection via high speed network
          • Small
          • SNMPv2c community
          • Mellanox SNMPv3 users
          • Spine-leaf Architecture
          • Spine-leaf architecture
          • Why are spine-leaf architectures becoming more popular?
          • Secure shell (SSH)
          • Mac address Table
          • Static routing
          • Confirm the status of the cray-dhcp-kea pods/services
          • System images
          • Test TFTP traffic (Aruba Only)
          • Typical configuration of MLAG link connecting to NCN
          • Typical configuration of MLAG between switches
          • Performing Upgrade On Mellanox Switches
          • Verify the switches are forwarding DHCP traffic
          • Verify BGP
          • Verify the DHCP traffic on the workers
          • Verify route to TFTP
          • Very Large (Exascale)
          • Virtual local access networks (VLANs)
          • VLAN interface
          • VLAN trunking 802.1Q
          • Web user interface (WebUI)
        • Aruba Installation and Configuration Guide
          • 802.1X
          • Access Control Lists (ACLs)
          • Address Resolution Protocol (ARP)
          • Backup a Switch Configuration
          • Border Gateway Protocol (BGP) Basics
          • Bluetooth Capabilities
          • Cable Diagnostics
          • Check BGP and MetalLB
          • Check Current DHCP Leases
          • Check DHCP Lease is Getting Allocated
          • Check HSM
          • Check KEA DHCP Logs
          • Classifier Policies
          • Verify Computes/UANs/Application Nodes
          • Large Number of DHCP Declines During a Node Boot
          • Configure Domain Name Service (DNS) Clients
          • Configure Domain Names
          • Check for Duplicate IP Addresses
          • Configure Exec Banners
          • Configure Hostnames
          • Configure Internet Group Multicast Protocol (IGMP)
          • Initial Prioritization
          • Introduction
          • Key Features Used in the Management Network Configuration
          • Link Aggregation Group (LAG)
          • Link Layer Discovery Protocol (LLDP)
          • Locator LED
          • Loopback Interface
          • MAC Authentication
          • Management Interface
          • Example of How to Configure Scenario A or B
          • System Management Network Functions
          • VSX ISL HA
          • VSX MCLAG Link HA
          • VSX Member Power Failure
          • VSX Split
          • Multi-Chassis Link Aggregation Group (MCLAG)
          • Message-Of-The-Day (MOTD)
          • Multicast Source Discovery Protocol (MSDP)
          • Multiple Spanning Tree Protocol (MSTP)
          • Native VLAN
          • NCN tcpdump
          • NCNs on Install
          • Network Types – Naming and Segment Function
          • Network Topologies
          • Network Traffic Pattern
          • Notices
          • Network Time Protocol (NTP) Client
          • Open Shortest Path First (OSPF) v2
          • Physical Interfaces
          • PIM-SM Bootstrap Router (BSR) and Rendezvous Point (RP)
          • Port Mirroring
          • Port Security
          • Queuing and Scheduling
          • RADIUS
          • Rebooting NCNs and PXE Fails
          • Redundant Power Supplies
          • Remote Logging
          • Connect the Management Network to a Campus Network
          • Routed interfaces
          • Scenario A Network Connection via Management Network
          • Scenario B Network Connection via High-Speed Network
          • Simple Network Management Protocol (SNMP) Agent
          • SNMPv2c Community
          • SNMP traps
          • Aruba SNMPv3 Users
          • Spine-Leaf Architecture
          • Spine-leaf Architecture
          • Secure Shell (SSH)
          • Static Routing
          • Confirm the Status of the cray-dhcp-kea Pods
          • TACACS
          • Test TFTP Traffic (Aruba Only)
          • Typical Configuration of VSX
          • Typical Edge Port Configuration
          • Typical Configuration of MCLAG Link
          • Unidirectional Link Detection (UDLD)
          • Perform a VSX Upgrade on Aruba Switches
          • Verify the Switches are Forwarding DHCP Traffic
          • Verify BGP
          • Verify the DHCP Traffic on the Worker Nodes
          • Verify Route to TFTP
          • Virtual Local Access Networks (VLANs)
          • VLAN Interface
          • VLAN Trunking 802.1Q
          • Virtual Switching Framework (VSF) - 6300 Only
          • Virtual Switching Extension (VSX)
          • VSX Architecture
          • Switch Replacement in the VSX Cluster
          • VSX Sync
          • Web User Interface (WebUI)
          • Erase All zeroize
        • Edge switch cabling guide
        • Network Tests
        • Reinstall
        • Replace Switch
        • Save a Configuration
        • Prometheus SNMP Exporter
        • Transceivers and Cables
        • Example of the Connections Used in Shasta Management Network
        • Validate Cabling
        • Validate the SHCD
        • Validate Switch Configurations
        • Wipe Management Switch Configuration
        • Aruba splitting of QSFP+ and QSFP28 ports
        • Backup a Custom Configuration
        • BICAN Support Matrix - Shasta Customer Access Networks
        • BICAN switch configuration
        • Bifurcating the CAN - Feature Details
        • BICAN Summary
        • Bonded UAN Configuration
        • Cable Management Network Servers
        • firmware
          • Update Management Network Firmware
        • hardware
          • EX2500 Installation and Cabling
      • Access to System Management Services
      • Connect to Switch over USB-Serial Cable
      • Connect to the HPE Cray EX Environment
      • Create a CSM Configuration Upgrade Plan
      • Default IP Address Ranges
      • Gateway Testing
      • Network
      • customer accessible networks
        • Connect to the CMN and CAN
        • Customer Access Networks
          • network
            • Enabling Customer High Speed Network Routing
            • Management Network Upgrade CSM 1.2 to 1.3
          • scripts
            • sls
              • sls utils Library
        • Customer Accessible Networks
        • CAN/CMN with Dual-Spine Configuration
        • Externally Exposed Services
        • Troubleshoot CMN issues
        • BI-CAN Aruba/Arista Configuration
        • MetalLB Peering with Arista Edge Router
      • dhcp
        • DHCP boot file customization
        • DHCP
        • Troubleshoot DHCP Issues
      • dns
        • Domain Name Service (DNS) Overview
        • Enable ncsd on UANs
        • Manage the DNS Unbound Resolver
        • PowerDNS Configuration
        • PowerDNS Migration Guide
        • Troubleshoot Common DNS Issues
        • Troubleshoot PowerDNS
      • external dns
        • External DNS
        • External DNS Failing to Discover Services Workaround
        • External DNS CSI Input Values
        • Ingress Routing
        • Troubleshoot DNS Configuration Issues
        • Troubleshoot Connectivity to Services with External IP addresses
        • Update the cmn-external-dns value post-installation
      • metallb bgp
        • Check BGP Status and Reset Sessions
        • MetalLB Configuration
        • MetalLB in BGP-Mode
        • Troubleshoot BGP not Accepting Routes from MetalLB
        • Troubleshoot Services without an Allocated IP Address
    • resiliency
      • Recreate StatefulSet Pods on Another Node
      • Resilience of System Management Services
      • Resiliency
      • Resiliency Testing Procedure
      • Restore System Functionality if a Kubernetes Worker Node is Down
    • hpe pdu
      • HPE PDU Admin Procedures
    • node management
      • Access and Update Settings for Replacement NCNs
      • Removing a Liquid-cooled blade from a System
      • Removing a Liquid-cooled blade from a System Using SAT
      • Removing a Standard rack node from a System
      • Replace a Compute Blade
      • Replace a Compute Blade Using SAT
      • Replace a Standard rack node from a System
      • Replacing Foxconn Username and Passwords in Vault
      • Add TLS Certificates to BMCs
      • Repurpose a Compute Node as a UAN
      • Add a Standard Rack Node
      • Reset Credentials on Redfish Devices
      • Add Additional Air-Cooled Cabinets to a System
      • S3FS Usage and Guidelines for Shasta
      • Add Additional Liquid-Cooled Cabinets to a System
      • Set Gigabyte Node BMC to Factory Defaults
      • Adding a Liquid-cooled Blade to a System
      • Swap a Compute Blade with a Different System
      • Adding a Liquid-cooled blade to a System Using SAT
      • Swap a Compute Blade with a Different System Using SAT
      • Boot a storage node into new image without upgrading CSM
      • Switch PXE Boot from Onboard NIC to PCIe
      • Build NCN Images Locally
      • TLS Certificates for Redfish BMCs
      • Change Java Security Settings
      • Troubleshoot Interfaces with IP Address Issues
      • Change Settings for HMS Collector Polling of Air-Cooled Nodes
      • Troubleshoot Issues with Redfish Endpoint Discovery
      • Check and Set the metal.no-wipe Setting on NCNs
      • Troubleshoot Loss of Console Connections and Logs on Gigabyte Nodes
      • Check the BMC Failover Mode
      • Update Compute Node Mellanox HSN NIC Firmware
      • Clear Space in Root File System on Worker Nodes
      • Update the Gigabyte Node BIOS Time
      • Configuration of NCN Bonding
      • Update the HPE Node BIOS Time
      • Configure NTP on NCNs
      • Updating Cabinet Routes on Management NCNs
      • Customize PCIe Hardware
      • Use the Physical KVM
      • Customize PCIe Hardware
      • Verify Node Removal
      • Defragment NID Numbering
      • View BIOS Logs for Liquid-Cooled Nodes
      • Disable Nodes
      • Manual Wipe Procedures
      • Dump a Non-Compute Node
      • Clear Gigabyte CMOS
      • Enable Nodes
      • Enable Passwordless Connections to Liquid Cooled Node BMCs
      • Enable IPMI access on HPE iLO BMCs
      • Find Node Type and Manufacturer
      • Launch a Virtual KVM on Gigabyte Servers
      • Launch a Virtual KVM on Intel Servers
      • Move a Standard Rack Node
      • Move a Standard Rack Node (Same Rack/Same HSN Ports)
      • Move a liquid-cooled blade within a System
      • NCN Drive Identification
      • NCN NIC Replacement
      • NCN Network Troubleshooting
      • Node Management
      • Node Management Workflows
      • Reboot NCNs
      • Add Remove Replace NCNs
        • Add NCN Data
        • Alpha Framework to Add, Remove, Replace, or Move NCNs
        • Add Switch Configuration for NCN
        • Allocate NCN IP Addresses
        • Boot NCN
        • Collect NCN MAC Addresses
        • Redeploy Services Impacted by Adding or Permanently Removing Storage Nodes
        • Remove NCN Data
        • Remove NCN from Role
        • Remove Switch Configuration for NCN
        • Update Firmware
        • Update NCN BIOS TPM State
        • Validate Health
        • Validate Added NCN
      • Rebuild NCNs
        • Final Validation Steps
        • Identify Nodes and Update Metadata
        • Post Rebuild Storage Node Validation
        • Power Cycle and Rebuild Nodes
        • Prepare Storage Nodes
        • Re-Add a Storage Node to Ceph
        • Rebuild NCNs
        • Validate Boot Loader
    • artifact management
      • Artifact Management
      • Generate Temporary S3 Credentials
      • Manage Artifacts with the Cray CLI
      • Use S3 Libraries and Clients
    • cani
      • Add A Blade To A Cabinet In SLS Using CANI
      • Add A Cabinet To SLS using CANI
    • package repository management
      • Manage Repositories with Nexus
      • Nexus Configuration
      • Nexus Deployment
      • Nexus Export and Restore
      • Nexus Service Recovery
      • Nexus Space Cleanup
      • Package Repository Management
      • Package Repository Management with Nexus
      • Repair Blobstore
      • Repair Yum Repository Metadata
      • Restrict Admin Privileges in Nexus
      • Troubleshoot Nexus
    • spire
      • Restore missing Spire metadata
      • Restore Spire Postgres without an Existing Backup
      • Spire Service Recovery
      • Troubleshoot Spire Failing to Start on NCNs
      • Update Spire Intermediate CA Certificate
      • Xname Validation
    • conman
      • Access Compute Node Logs
      • Access Console Log Data Via the System Monitoring Framework (SMF)
      • Complete Reset of the Console Services
      • ConMan
      • Configure Log Rotation
      • Console Services Troubleshooting Guide
      • Disable ConMan After the System Software Installation
      • Establish a Serial Connection to NCNs
      • Log in to a Node Using ConMan
      • Manage Node Consoles
      • Troubleshoot ConMan Asking for Password on SSH Connection
      • Troubleshoot ConMan Blocking Access to a Node BMC
      • Troubleshoot ConMan Failing to Connect to a Console
      • Troubleshoot Console Node Pod Stuck in Terminating State
    • system management health
      • Access System Management Health Services
      • Configure Prometheus Alerta Alert Notifications
      • Configure Prometheus Email Alert Notifications
      • Retrieve SMART data from ClusterStor E1000 nodes via Redfish Exporter
      • Grafana Dashboards by Component
      • Grafterm
      • grok-exporter pod status showing as ContainerStatusUnknown Error
      • prometheus-kafka-adapter errors during installation
      • Remove Kiali
      • System Management Health
      • System Management Health Checks and Alerts
      • Troubleshoot Grafana Dashboard
      • Troubleshoot Prometheus Alerts
      • Thanos
      • UAN NODE Exporter
    • hardware state manager
      • Add a Switch to the HSM Database
      • Add an NCN to the HSM Database
      • Component Group Members
      • Component Groups and Partitions
      • Component Memberships
      • Component Partition Members
      • Create a Backup of the HSM Postgres Database
      • Backup/Restore HSM User Data (Locks, Groups, and Partitions)
      • HSM Roles and Subroles
      • Hardware Management Services (HMS) Locking API
      • Hardware State Manager (HSM)
      • Hardware State Manager (HSM) State and Flag Fields
      • Lock and Unlock Management Nodes
      • Manage Component Groups
      • Manage Component Partitions
      • Manage HMS Locks
      • Remove Duplicate Detected Events From the HSM Postgres Database
      • Restore Hardware State Manager (HSM) Postgres Database from Backup
      • Restore Hardware State Manager (HSM) Postgres without an Existing Backup
      • Set BMC Management Roles
    • image management
      • Build a New UAN Image Using the Default Recipe
      • Build an Image Using IMS REST Service
      • Configure IMS to Use DKMS
      • Configure IMS to Validate RPMs
      • Configure a Remote Build Node
      • Convert TGZ Archives to SquashFS Images
      • Create UAN Boot Images
      • Customize an Image Root Using IMS
      • Delete or Recover Deleted IMS Content
      • Exporting and Importing IMS Data
      • Image Job Performance
      • Image Management
      • Image Management Workflows
      • Import an External Image to IMS
      • Import an NCN Image to IMS
      • Troubleshoot Issues with Large Images
      • Troubleshoot Remote Build Node
      • Troubleshoot Interactions with zypper
      • Upload and Register an Image Recipe
      • Working With aarch64 Images
    • configuration management
      • ARP Cache Tuning Guide
      • Accessing sat bootprep Files
      • Adding Additional Inventory
      • Ansible Execution Environments
      • Ansible Log Collection
      • Automatic Configuration Management
      • Automatic Session Deletion with session ttl
      • Backup and Restore VCS Data
      • CFS Commands Cheat Sheet
      • CFS Components
      • CFS Configurations
      • CFS Flow
      • CFS Global Options
      • CFS Key Management and Permission Denied Errors
      • CFS Session Inventory
      • CFS Sessions
      • CFS Sources
      • Change the Ansible Verbosity
      • Configuration Management
      • Configure Ansible
      • Create a Node Personalization CFS Session
      • Create an Image Customization CFS Session
      • Create and Populate a VCS Configuration Repository
      • Customize Configuration Values
      • Differences Between the V2 and V3 CFS APIs
      • Enable Ansible Profiling
      • Exporting and Importing CFS Data
      • Git Operations
      • Management Node Image Customization
      • Management Node Personalization
      • Paging CFS Records
      • Set Limits for a Configuration Session
      • Specifying Hosts and Groups
      • Target Ansible Tasks for Image Customization
      • Track the Status of a Session
      • Troubleshoot CFS Issues
      • Troubleshoot Failed CFS Sessions
      • Troubleshoot CFS Session Failing to Complete
      • Troubleshoot CFS Sessions Failing to Start
      • Update a CFS Configuration
      • Update the Privacy Settings for Gitea Configuration Content Repositories
      • VCS Administrative User
      • VCS Branching Strategy
      • Version Control Service (VCS)
      • View Configuration Session Logs
      • Write Ansible Code for CFS
    • iuf
      • Install and Upgrade Framework
      • examples
        • iuf abort Examples
        • iuf activity Examples
        • iuf list-activities Examples
        • iuf list-stages Examples
        • iuf restart Examples
        • iuf resume Examples
        • iuf run Examples
        • iuf workflow Examples
      • stages
        • deliver-product
        • deploy-product
        • managed-nodes-rollout
        • management-nodes-rollout
        • post-install-check
        • post-install-service-check
        • pre-install-check
        • prepare-images
        • process-media
        • update-cfs-config
        • update-vcs-config
      • workflows
        • Populate Admin Directory with Files Defining Site Preferences
        • Backup
        • Configuration
        • Configuration of the Slingshot Fabric Manager
        • Deploy Product
        • Image Preparation
        • Install or Upgrade Additional Products with IUF
        • Managed Rollout
        • Management Rollout
        • Prepare for the Install or Upgrade
        • Product Delivery
        • Perform Slingshot Switch Firmware Updates
        • Upgrade CSM and Additional Products with IUF
        • Validate Deployment
  • Cray System Management Install
    • SHCD HMN Tab/HMN Connections Rules
    • Ceph CSI Troubleshooting
    • Collect MAC Addresses for NCNs
    • Troubleshooting Installation Problems
    • Collecting the BMC MAC Addresses
    • PXE Boot Troubleshooting
    • Collecting NCN MAC Addresses
    • Troubleshooting Unused Drives on Storage Nodes
    • Configure Administrative Access
    • Utility Storage Installation Troubleshooting
    • Pre-Installation
    • Configure Management Network
    • Prepare Compute Nodes
    • Create Application Node Config YAML
    • Prepare site init
    • Create Cabinets YAML
    • Re-Installation
    • Create HMN Connections JSON File
    • Create NCN Metadata CSV
    • Create Switch Metadata CSV
    • Create System Configuration Using Cluster Discovery Service
    • Create System Configuration Using SHCD
    • CSM Services Install Fails Because of Missing Secret
    • Deploy Final NCN
    • Deploy Management Nodes
    • Install CSM Services
    • livecd
      • Accessing LiveCD USB Device After Reboot
      • Boot LiveCD RemoteISO
      • Boot LiveCD USB
      • Reinstall LiveCD
      • Reset root Password on a LiveCD USB
  • CSM Troubleshooting Information
    • Weave Container Network Interface Troubleshooting
    • Manual SSH Key Setting Process
    • Troubleshoot the CMS Barebones Image Boot Test
    • DHCP Troubleshooting
    • DNS Troubleshooting
    • Running HMS CT Tests Manually
    • Incrementally Configuring Images
    • PXE Booting Runbook
    • Interpreting HMS Health Check Results
    • known issues
      • BOS Operator Pods OOMKilled
      • BOS Sessions Stuck Pending
      • CFS Component With Zero-Length ID
      • CFS V2 Failures On Large Systems
      • Known Issue FAS Loader / HFP script post-deliver-product.sh
      • Gigabyte BMC Missing Redfish Data
      • HMS Resource Leaks
      • Hang Listing BOS V1 Sessions
      • Keycloak Error "Cannot read properties" in Web UI
      • Nexus Fails Authentication with Keycloak Users
      • PCS Power Capping Blanca Peak and Parry Peak
      • SLS Not Working During Node Rebuild
      • VCS Password With Illegal Characters
      • Known Issue admin-client-auth Not Found
      • Antero node NID allocation
      • Known Issue Ceph OSD latency
      • CFS Session for Image Customization on Remote Node Status Stuck at running
      • Known Issue check bios firmware versions.sh script does not report valid expected firmware versions
      • SAT/HSM/CAPMC/PCS Component Power State Mismatch
      • cray-console-node pods in CrashLoopBackOff
      • Known Issue cray-tftp-upload errors
      • Cray CLI 403 Forbidden Errors
      • HMS Discovery Job Not Creating RedfishEndpoints In Hardware State Manager
      • Flags Set For Nodes In HSM
      • Goss Test Fails with Connection Refused
      • Helm Chart Deploy Timeouts
      • hms-discovery Timeout Due to Missing Switches
      • HPE iLO dropping event subscriptions and not properly transitioning power state in CSM software
      • IMS Image Customization Job Status Stuck at waiting on user
      • Known Issue IMS Images Orphaned in S3
      • Soft Deleted IMS Image Always Has arch=x86 64
      • Soft Deleted IMS Recipe Always Has arch=x86 64
      • Soft Deleted IMS Recipe Always Has require dkms=true
      • Known issues with NCN health checks
      • IUF fails with Not a directory /etc/cray/upgrade/csm/media/...
      • Known issue kubectl logs -f returns no space left on device
      • Missing binaries in aarch64 Images
      • Known issues with NCN resource checks
      • HPE Cray EX255a Boot Issue with Console Parameter
      • Transaction Size Limitation for PCS and CAPMC
      • PostgreSQL Cluster Upgrades Failing
      • PostgreSQL Database is in Recovery
      • PostgreSQL Clusters in SyncFailed State Due to Kyverno Webhook
      • Product Catalog Upgrade Error
      • QLogic driver crash
      • Known Issue Boot Orchestration Service (BOS) / Rolling reboots
      • Known Issue RTS fails to restart after a worker node has been rebooted
      • sat bootprep image customization error
      • Software Management Services health checks
      • Spire database connection pool configuration in an air-gapped environment
      • Spire Database Cluster DNS Lookup Failure
      • SSL Certificate Validation Issues
      • Storage node cloud-init fails with 'Timed out waiting for device' error
      • Test Failures Due To No Discovered Compute Nodes In HSM
      • Known Issue Velero Version Mismatch
      • Wait for unbound or cray-dns-unbound-manager hangs
    • kubernetes
      • Kubernetes kube-apiserver Failing
      • Kubernetes Log File Locations
      • Kubernetes Troubleshooting Information
      • Troubleshoot Kubernetes Master or Worker node in NotReady state
      • Troubleshoot Kubernetes Pods Not Starting
      • Troubleshoot Liveliness or Readiness Probe Failures
      • Troubleshoot Unresponsive kubectl Commands
  • Glossary
  • Introduction to CSM Installation
    • CSM Overview
    • Deprecated Features
      • CAPMC Deprecation Notice
    • Documentation Conventions
    • templates
      • Templates
  • Non-Compute Nodes
    • Certificate Authority
    • NCN BIOS
    • NCN Boot Workflow
    • NCN Firmware
    • NCN Images
    • Kernel Dumps
    • NCN Kernel
    • NCN Mounts and Filesystems
    • NCN Networking
    • NCN Operating System Releases
    • NCN Plan of Record
  • REST API Documentation
    • Boot Orchestration Service v2
    • Boot Script Service v1
    • Cray Advanced Platform Monitoring and Control (CAPMC) v3
    • Configuration Framework Service v1
    • Firmware Action Service v1
    • Heartbeat Tracker Service v1
    • HMS Notification Fanout Daemon v1
    • Image Management Service v3
    • NCN Lifecycle Service v1
    • Power Control Service (PCS) v1
    • System Configuration Service v1
    • System Layout Service v2
    • Hardware State Manager API v2
    • Cray STS Token Generator v1
    • TAPMS Tenant Status API v1
    • User Access Service v1
  • Update CSM Product Stream
  • Upgrade CSM
    • Resource Materials
      • k8s
        • Worker-Specific Manual Steps
      • storage
        • CEPHADM
    • CSM 1.5.1 Patch Installation Instructions
    • CSM 1.5.3 Patch Installation Instructions
    • CSM 1.5.4 Patch Installation Instructions
    • Prepare for Upgrade to Next CSM Major Version
    • CSM 1.5.2 Patch Installation Instructions
      • CSM Only Upgrade
      • CSM Only Upgrade on a System with Other Products
      • Upgrade NCNs during CSM 1.5.2 Patch
    • Stage 0 - Prerequisites and Preflight Checks
    • Stage 1 - CSM Service Upgrades
    • Stage 2 - Ceph image upgrade
    • Stage 3 - Kubernetes Upgrade
    • CSM 1.4 to 1.5 Upgrade Process
    • Upgrade only CSM
    • Validate CSM Health During a CSM Upgrade
    • scripts
      • sls
        • SLS Updates Expert mode
        • Upgrade SLS Offline from CSM 1.0.x to CSM 1.2
        • sls updater.py Technical Details
        • sls utils Library
      • upgrade
        • Upgrade Automation
  • workflows
    • iuf
      • operations
        • Argo Templates
        • Argo Templates
Cray System Management Documentation > Cray System Management (CSM) Administration Guide > System Recovery

System Recovery

Topics:

  1. PBS Service Recovery
  2. Slurm Service Recovery
  3. Beta Procedures for System Recovery