Cray System Management
  • v
  • Cray System Management (CSM) - Release Notes
  • Cray System Management (CSM) Administration Guide
    • Accessing LiveCD USB Device After Reboot
    • Component Names (xnames)
    • Validate CSM Health
    • Configure the Cray Command Line Interface (cray CLI)
    • User Access Service (UAS)
      • Add a Volume to UAS
      • Broker Mode UAI Management
      • Choosing UAI Resource Settings
      • Common UAI Configuration
      • Configure End-User UAI Classes for Broker Mode
      • Configure UAIs in UAS
      • Configure a Broker UAI Class
      • Configure a Default UAI Class for Legacy Mode
      • Create UAIs From Specific UAI Images in Legacy Mode
      • Create a UAI
      • Create a UAI Class
      • Create a UAI Resource Specification
      • Create a UAI with Additional Ports
      • Create and Use Default UAIs in Legacy Mode
      • Customize End-User UAI Images
      • Customize the Broker UAI Image
      • Delete a UAI
      • Delete a UAI Class
      • Delete a UAI Image Registration
      • Delete a UAI Resource Specification
      • Delete a Volume Configuration
      • Elements of a UAI
      • End-User UAIs
      • Examine a UAI Using a Direct Administrative Command
      • Legacy Mode User-Driven UAI Management
      • List Available UAI Classes
      • List Available UAI Images in Legacy Mode
      • List Registered UAI Images
      • List UAI Resource Specifications
      • List UAIs
      • List UAS Version Information
      • List Volumes Registered in UAS
      • Log in to a Broker UAI
      • This Page Has Moved
      • Modify a UAI Class
      • Obtain the Configuration of a UAS Volume
      • Register a UAI Image
      • Clear UAS Configuration
      • Resource Specifications
      • Retrieve Resource Specification Details
      • Retrieve UAI Image Registration Information
      • Setting UAI Timeouts
      • Broker UAI Resiliency and Load Balancing
      • Special Purpose UAIs
      • Start a Broker UAI
      • Troubleshoot Broker UAI SSSD Cannot Use /etc/sssd/sssd.conf
      • Troubleshoot Common Mistakes when Creating a Custom End-User UAI Image
      • Troubleshoot Duplicate Mount Paths in a UAI
      • Troubleshoot Missing or Incorrect UAI Images
      • Troubleshoot Stale Brokered UAIs
      • Troubleshoot UAS / CLI Authentication Issues
      • Troubleshoot UAI Stuck in ContainerCreating
      • Troubleshoot UAIs by Viewing Log Output
      • Troubleshoot UAIs with Administrative Access
      • Troubleshoot UAS Issues
      • Troubleshoot UAS by Viewing Log Output
      • UAI Classes
      • UAI Host Node Selection
      • UAI Host Nodes
      • UAI Image Customization
      • UAI Images
      • UAI Management
      • UAI Network Attachment Customization
      • UAI macvlans Network Attachments
      • UAS Limitations
      • UAS and UAI Legacy Mode Health Checks
      • Update a Resource Specification
      • Update a UAI Image Registration
      • Update a UAS Volume
      • View a UAI Class
      • Volumes
    • resiliency
      • NTP Resiliency
      • Recreate StatefulSet Pods on Another Node
      • Resilience of System Management Services
      • Resiliency
      • Resiliency Testing Procedure
      • Restore System Functionality if a Kubernetes Worker Node is Down
    • system management health
      • Access System Management Health Services
      • Configure Prometheus Email Alert Notifications
      • Grafana Dashboards by Component
      • Grafterm
      • Remove Kiali
      • System Management Health
      • System Management Health Checks and Alerts
      • Troubleshoot Grafana Dashboard
      • Troubleshoot Prometheus Alerts
    • node management
      • Access and Update Settings for Replacement NCNs
      • Removing a Liquid-cooled blade from a System
      • Replace a Compute Blade
      • Reset Credentials on Redfish Devices
      • S3FS Usage and Guidelines for Shasta
      • Swap a Compute Blade with a Different System
      • Add TLS Certificates to BMCs
      • TLS Certificates for Redfish BMCs
      • Add a Standard Rack Node
      • Troubleshoot Interfaces with IP Address Issues
      • Add Additional Liquid-Cooled Cabinets to a System
      • Troubleshoot Issues with Redfish Endpoint Discovery
      • Adding a Liquid-cooled Blade to a System
      • Troubleshoot Loss of Console Connections and Logs on Gigabyte Nodes
      • Build NCN Images Locally
      • Update Compute Node Mellanox HSN NIC Firmware
      • Change Java Security Settings
      • Update the Gigabyte Node BIOS Time
      • Change Settings for HMS Collector Polling of Air-Cooled Nodes
      • Updating Cabinet Routes on Management NCNs
      • Change Settings in the Bond
      • Use the Physical KVM
      • Check and Set the metal.no-wipe Setting on NCNs
      • Verify Node Removal
      • Check the BMC Failover Mode
      • View BIOS Logs for Liquid-Cooled Nodes
      • Clear Space in Root File System on Worker Nodes
      • Customize PCIe Hardware
      • Configuration of NCN Bonding
      • Customize PCIe Hardware
      • Configure NTP on NCNs
      • Disable Nodes
      • Dump a Non-Compute Node
      • Enable Nodes
      • Enable Passwordless Connections to Liquid Cooled Node BMCs
      • Find Node Type and Manufacturer
      • Launch a Virtual KVM on Gigabyte Servers
      • Launch a Virtual KVM on Intel Servers
      • Move a Standard Rack Node
      • Move a Standard Rack Node (Same Rack/Same HSN Ports)
      • Move a liquid-cooled blade within a System
      • NCN Drive Identification
      • Node Management
      • Node Management Workflows
      • Reboot NCNs
      • Rebuild NCNs
        • Final Validation Steps
        • Identify Nodes and Update Metadata
        • Post Rebuild Storage Node Validation
        • Power Cycle and Rebuild Nodes
        • Prepare Storage Nodes
        • Re-Add a Storage Node to Ceph
        • Rebuild NCNs
        • Validate Boot Loader
        • Wipe Drives
      • Add Remove Replace NCNs
        • Add NCN Data
        • Alpha Framework to Add, Remove, Replace, or Move NCNs
        • Add Switch Configuration for NCN
        • Allocate NCN IP Addresses
        • Boot NCN
        • Collect NCN MAC Addresses
        • Redeploy Services Impacted by Adding or Permanently Removing Storage Nodes
        • Remove NCN Data
        • Remove NCN from Role
        • Remove Switch Configuration for NCN
        • Update Firmware
        • Validate Health
        • Validate Added NCN
    • conman
      • Access Compute Node Logs
      • Access Console Log Data Via the System Monitoring Framework (SMF)
      • ConMan
      • Disable ConMan After the System Software Installation
      • Establish a Serial Connection to NCNs
      • Log in to a Node Using ConMan
      • Manage Node Consoles
      • Troubleshoot ConMan Asking for Password on SSH Connection
      • Troubleshoot ConMan Blocking Access to a Node BMC
      • Troubleshoot ConMan Failing to Connect to a Console
    • image management
      • Build a New UAN Image Using the Default Recipe
      • Build an Image Using IMS REST Service
      • Configure IMS to Validate RPMs
      • Convert TGZ Archives to SquashFS Images
      • Create UAN Boot Images
      • Customize an Image Root Using IMS
      • Delete or Recover Deleted IMS Content
      • Image Management
      • Image Management Workflows
      • Import an External Image to IMS
      • Update IMS Job Access Network
      • Upload and Register an Image Recipe
    • preinstall
      • Fresh Install Setting NodeBMC and RouterBMC Redfish Credentials
      • Change Credentials on ServerTech PDUs
      • Pre-Install Steps
    • system layout service
      • Add Liquid-Cooled Cabinets to SLS
      • Add UAN CAN IP Addresses to SLS
      • Add an alias to a service
      • Create a Backup of the SLS Postgres Database
      • Dump SLS Information
      • Load SLS Database with Dump File
      • Restore SLS Postgres Database from Backup
      • Restore SLS Postgres without an Existing Backup
      • System Layout Service (SLS)
      • Update SLS with UAN Aliases
    • utility storage
      • Adding a Ceph Node to the Ceph Cluster
      • Add Ceph OSDs
      • Adjust Ceph Pool Quotas
      • Alternate Storage Pools
      • Ceph Daemon Memory Profiling
      • Ceph Deep Scrubs
      • Ceph Health States
      • Ceph Orchestrator Usage
      • Ceph Service Check Script Usage
      • Ceph Storage Types
      • Cephadm Reference Material
      • Collect Information about the Ceph Cluster
      • Dump Ceph Crash Data
      • Identify Ceph Latency Issues
      • Manage Ceph Services
      • Shrink the Ceph Cluster
      • Restore Nexus Data After Data Corruption
      • Shrink Ceph OSDs
      • Troubleshoot Ceph-Mon Processes Stopping and Exceeding Max Restarts
      • Troubleshoot Ceph MDS Client Connectivity Issues
      • Troubleshooting Ceph MDS Reporting Slow Requests and Failure on Client
      • Troubleshoot Ceph OSDs Reporting Full
      • Troubleshoot Ceph Services Not Starting After a Server Crash
      • Troubleshoot Failure to Get Ceph Health
      • Troubleshoot Insufficient Standby MDS Daemons Available
      • Troubleshoot Large Object Map Objects in Ceph Health
      • Troubleshoot Pods Failing to Restart on Other Worker Nodes
      • Troubleshoot if RGW Health Check Fails
      • Troubleshoot S3FS Mount Issues
      • Troubleshoot System Clock Skew
      • Troubleshoot a Down OSD
      • Troubleshoot an Unresponsive Rados-Gateway (radosgw) S3 Endpoint
      • Utility Storage
    • hardware state manager
      • Add a Switch to the HSM Database
      • Add an NCN to the HSM Database
      • Component Group Members
      • Component Groups and Partitions
      • Component Memberships
      • Component Partition Members
      • Create a Backup of the HSM Postgres Database
      • HSM Roles and Subroles
      • Hardware Management Services (HMS) Locking API
      • Hardware State Manager (HSM)
      • Hardware State Manager (HSM) State and Flag Fields
      • Lock and Unlock Management Nodes
      • Manage Component Groups
      • Manage Component Partitions
      • Manage HMS Locks
      • Restore Hardware State Manager (HSM) Postgres Database from Backup
      • Restore Hardware State Manager (HSM) Postgres without an Existing Backup
      • Set BMC Management Roles
    • security and authentication
      • API Authorization
      • Access the Keycloak User Management UI
      • Add LDAP User Federation
      • Add Root Service Account for Gigabyte Controllers
      • Audit Logs
      • Authenticate an Account with the Command Line
      • Backup and Restore Vault Clusters
      • Certificate Types
      • Change Air-Cooled Node BMC Credentials
      • Change Credentials on ServerTech PDUs
      • Change Cray EX Liquid-Cooled Cabinet Global Default Password
      • Change the Keycloak Token Lifetime
      • Set NCN Image Root Password, SSH Keys, and Timezone
      • Set NCN Image Root Password, SSH Keys, and Timezone on PIT Node
      • Change Root Passwords for Compute Nodes
      • Change SNMP Credentials on Leaf-BMC Switches
      • Change the Keycloak Admin Password
      • Change the LDAP Server IP Address for Existing LDAP Server Content
      • Change the LDAP Server IP Address for New LDAP Server Content
      • Configure Keycloak for LDAP/AD authentication
      • Configure the RSA Plugin in Keycloak
      • Create Internal Groups in the Keycloak Shasta Realm
      • Create Internal User Accounts in the Keycloak Shasta Realm
      • Create a Backup of the Keycloak Postgres Database
      • Create a Service Account in Keycloak
      • Default Keycloak Realms, Accounts, and Clients
      • Delete Internal User Accounts in the Keycloak Shasta Realm
      • Get a Long-Lived Token for a Service Account
      • HashiCorp Vault
      • Keycloak Operations
      • Keycloak User Localization
      • Keycloak User Management with kcadm.sh
      • Make HTTPS Requests from Sources Outside the Management Kubernetes Cluster
      • Manage Sealed Secrets
      • Manage System Passwords
      • PKI Certificate Authority (CA)
      • PKI Services
      • Preserve Username Capitalization for Users Exported from Keycloak
      • Provisioning a Liquid-Cooled EX Cabinet CEC with Default Credentials
      • Public Key Infrastructure (PKI)
      • Recovering from Mismatched BMC Credentials
      • Remove Internal Groups from the Keycloak Shasta Realm
      • Remove the Email Mapper from the LDAP User Federation
      • Remove the LDAP User Federation from Keycloak
      • Restrict Network Access to the ncn-images S3 Bucket
      • Re-Sync Keycloak Users to Compute Nodes
      • Retrieve an Authentication Token
      • Retrieve the Client Secret for Service Accounts
      • Update NCN User SSH Keys
      • System Security and Authentication
      • Transport Layer Security (TLS) for Ingress Services
      • Troubleshoot Common Vault Cluster Issues
      • Update Default Air-Cooled BMC and Leaf-BMC Switch SNMP Credentials
      • Update Default ServerTech PDU Credentials used by the Redfish Translation Service (RTS)
      • Set NCN User Passwords
      • Updating the Liquid-Cooled EX Cabinet CEC with Default Credentials after a CEC Password Change
    • spire
      • Create a Backup of the Spire Postgres Database
      • Restore missing Spire metadata
      • Restore Spire Postgres without an Existing Backup
      • Troubleshoot Spire Failing to Start on NCNs
      • Update Spire Intermediate CA Certificate
    • boot orchestration
      • BOS Workflows
      • Compute Node Boot Issue Symptom Node Console or Logs Indicate that the Server Response has Timed Out
      • Boot Issue Symptom Node HSN Interface Does Not Appear or Show Detected Links Detected
      • Boot Orchestration
      • Boot UANs
      • Check the Progress of BOS Session Operations
      • Clean Up After a BOS/BOA Job is Completed or Cancelled
      • Clean Up Logs After a BOA Kubernetes Job
      • Compute Node Boot Issue Symptom Duplicate Address Warnings and Declined DHCP Offers in Logs
      • Compute Node Boot Issue Symptom Message About Invalid EEPROM Checksum in Node Console or Log
      • Compute Node Boot Issue Symptom Node is Not Able to Download the Required Artifacts
      • Compute Node Boot Sequence
      • Configure the BOS Timeout When Booting Compute Nodes
      • Create a Session Template to Boot Compute Nodes with CPS
      • Edit the iPXE Embedded Boot Script
      • Healthy Compute Node Boot Process
      • Kernel Boot Parameters
      • Limit the Scope of a BOS Session
      • BOS Limitations for Gigabyte BMC Hardware
      • Log File Locations and Ports Used in Compute Node Boot Troubleshooting
      • Manage a BOS Session
      • Manage a Session Template
      • Node Boot Root Cause Analysis
      • Redeploy the iPXE and TFTP Services
      • BOS Session Templates
      • BOS Sessions
      • Stage Changes Without BOS
      • Tools for Resolving Compute Node Boot Issues
      • Troubleshoot Booting Nodes with Hardware Issues
      • Troubleshoot Compute Node Boot Issues Related to Dynamic Host Configuration Protocol (DHCP)
      • Troubleshoot Compute Node Boot Issues Related to Slow Boot Times
      • Troubleshoot Compute Node Boot Issues Related to Trivial File Transfer Protocol (TFTP)
      • Troubleshoot Compute Node Boot Issues Related to Unified Extensible Firmware Interface (UEFI)
      • Troubleshoot Compute Node Boot Issues Related to the Boot Script Service (BSS)
      • Troubleshoot Compute Node Boot Issues Using Kubernetes
      • Troubleshoot UAN Boot Issues
      • Upload Node Boot Information to Boot Script Service (BSS)
      • View the Status of a BOS Session
    • CSM product management
      • Security Hardening
      • Change Passwords and Credentials
      • Configure CSM packages with CFS
      • Configure Keycloak Account
      • Configure Non-Compute Nodes with CFS
      • Perform NCN Personalization
      • Post-Install Customizations
      • Redeploying a Chart
      • Remove Artifacts from Product Installations
      • Validate Signed RPMs
    • firmware
      • FAS Admin Procedures
      • FAS CLI
      • FAS Filters
      • FAS Recipes
      • FAS Use Cases
      • Update Firmware with FAS
      • Updating BMC Firmware and BIOS for ncn-m001
      • Updating BMC Firmware and BIOS for NCNs without FAS
      • Upload BMC Recovery Firmware into TFTP Server
    • hpe pdu
      • HPE PDU Admin Procedures
    • power management
      • Cray Advanced Platform Monitoring and Control (CAPMC)
      • Ignore Nodes with CAPMC
      • Liquid Cooled Node Power Management
      • Power Off Compute and IO Cabinets
      • Power Off the External Lustre File System
      • Power On Compute and IO Cabinets
      • Power On and Boot Compute and User Access Nodes
      • Power On and Start the Management Kubernetes Cluster
      • Power On the External Lustre File System
      • Prepare the System for Power Off
      • Recover from a Liquid Cooled Cabinet EPO Event
      • Save Management Network Switch Configuration Settings
      • Set the Turbo Boost Limit
      • Shut Down and Power Off Compute and User Access Nodes
      • Shut Down and Power Off the Management Kubernetes Cluster
      • Standard Rack Node Power Management
      • System Power Off Procedures
      • System Power On Procedures
      • User Access to Compute Node Power Data
      • Worker Node COS Power Up Configuration
      • Power Management
    • artifact management
      • Artifact Management
      • Generate Temporary S3 Credentials
      • Manage Artifacts with the Cray CLI
      • Use S3 Libraries and Clients
    • kubernetes
      • About Kubernetes Taints and Labels
      • About Postgres
      • About etcd
      • About kubectl
      • Backups for etcd-operator Clusters
      • Kubernetes and Bare Metal EtcD Certificate Renewal
      • Check for and Clear etcd Cluster Alarms
      • Check the Health and Balance of etcd Clusters
      • Clear Space in an etcd Cluster Database
      • Configure kubectl Credentials to Access the Kubernetes APIs
      • containerd
      • Create a Manual Backup of a Healthy etcd Cluster
      • Determine if Pods are Hitting Resource Limits
      • Disaster Recovery for Postgres
      • Increase Kafka Pod Resource Limits
      • Increase Pod Resource Limits
      • Kubernetes
      • Kubernetes Networking
      • Kubernetes Storage
      • Configure Kubernetes API Audit Log Maximum Backups
      • Pod Resource Limits
      • Rebalance Healthy etcd Clusters
      • Rebuild Unhealthy etcd Clusters
      • Recover from Postgres WAL Event
      • Repopulate Data in etcd Clusters When Rebuilding Them
      • Report the Endpoint Status for etcd Clusters
      • Restore Bare-Metal etcd Clusters from an S3 Snapshot
      • Restore Postgres
      • Restore an etcd Cluster from a Backup
      • Retrieve Cluster Health Information Using Kubernetes
      • TDS Lower CPU Requests
      • Troubleshoot Intermittent HTTP 503 Code Failures
      • Troubleshoot Postgres Database
      • View Postgres Information for System Databases
    • package repository management
      • Manage Repositories with Nexus
      • Nexus Configuration
      • Nexus Deployment
      • Nexus Export and Restore
      • Nexus Space Cleanup
      • Package Repository Management
      • Package Repository Management with Nexus
      • Repair Yum Repository Metadata
      • Restrict Admin Privileges in Nexus
      • Troubleshoot Nexus
    • system configuration service
      • Configure BMC and Controller Parameters with SCSD
      • Manage Parameters with the scsd Service
      • Set BMC Credentials
      • System Configuration Service
    • network
      • Management Network User Guide
        • Management Network 1.0 (1.2 Preconfig) to 1.2
        • Load Saved Switch Configuration
        • Fresh Install
        • Added Hardware
        • Manual Switch Configuration
        • Generate Switch Configurations
        • Apply Custom Switch Configurations for CSM 1.0
        • CSM Automatic Network Utility
          • CANU Installation
          • Troubleshoot CANU Validation Errors
          • Use CANU to Verify, Generate, or Compare Switch Configurations
          • Initializing CANU
          • Introduction to CANU
          • Quick start guide to CANU
          • Uninstall CANU
          • Update CANU From CSM Release Tarball
          • Use CANU to Generate Full Network Configuration
        • Apply Custom Switch Configuration CSM 1.2
        • Apply Switch Configurations
        • Dell Installation and Configuration Guide
          • Configure Access Control Links (ACLs)
          • Configure Address Resolution Protocol (ARP)
          • Back Up a Switch Configuration
          • Configure Domain Name System (DNS) Client
          • Configure Domain Name
          • Configure Hostnames
          • Configure Internet Group Multicast Protocol (IGMP)
          • Configure Link Aggregation Group (LAG)
          • Link layer discovery protocol (LLDP)
          • Configure Locator LED
          • Configure Loopback Interface
          • Configure Management Interface
          • Configure Multiple Spanning Tree Protocol (MSTP)
          • Network Time Protocol (NTP) Client
          • Configure Physical Interfaces
          • Configure QoS
          • Configure Remote Logging
          • Reset Dell Switch Configuration
          • Configure SNMPv2c Community
          • Dell SNMPv3 Users
          • Configure Secure Shell (SSH)
          • Configure System Images
          • Perform an Upgrade on Dell Switches
          • Configure Virtual Local Access Networks (VLANs)
          • Configure VLAN Interface
          • VLAN Trunking 802.1Q
        • Upgrade CANU
        • Collect Data
        • Configuration Management
        • Configure SNMP
        • Mellanox Installation and Configuration Guide
          • Access control lists (ACLs)
          • Address resolution protocol (ARP)
          • Backing up switch configuration
          • BGP basics
          • Cable diagnostics
          • Check BGP and MetalLB
          • Check current DHCP leases
          • Check DHCP lease is getting allocated
          • Check HSM
          • Check KEA DHCP logs
          • Computes/UANs/Application Nodes
          • Large Number of DHCP Declines During a Node Boot
          • Domain name system (DNS) client
          • Domain name
          • You are getting an IP address, but not the correct one. Duplicate IP address check
          • Exec banners
          • Hostname
          • IGMP
          • Ip filter
          • Key features used in the management network configuration
          • Link aggregation group (LAG)
          • Large
          • Link layer discovery protocol (LLDP)
          • Loopback interface
          • Management interface
          • Example of how to configure Scenario A or B
          • Management network functions in detail
          • Medium
          • Multi-chassis interface
          • MLAG (Multi-Chassis LAG)
          • MLAG
          • Multiple spanning tree protocol (MSTP)
          • Native VLAN
          • TCPDUMP
          • NCNs on Install
          • Network types – Naming and segment Function
          • Network traffic pattern inside of the system
          • Network Time Protocol (NTP) Client
          • Open shortest path first (OSPF) v2
          • Physical interfaces
          • PIM-SM bootstrap router (BSR) and rendezvous-point (RP)
          • Rebooting NCN and PXE fails
          • Remote logging
          • How to connect management network to your campus network
          • Routed interfaces
          • Scenario A network connection via management network
          • Scenario B network connection via high speed network
          • Small
          • SNMPv2c community
          • SNMPv3 users
          • Spine-leaf architecture
          • Spine-leaf architecture
          • Why are spine-leaf architectures becoming more popular?
          • Secure shell (SSH)
          • Mac address Table
          • Static routing
          • Confirm the status of the cray-dhcp-kea pods/services
          • System images
          • Test TFTP traffic (Aruba Only)
          • Typical configuration of MLAG link connecting to NCN
          • Typical configuration of MLAG between switches
          • Performing Upgrade On Mellanox Switches
          • Verify the switches are forwarding DHCP traffic
          • Verify BGP
          • Verify the DHCP traffic on the workers
          • Verify route to TFTP
          • Very Large (Exascale)
          • Virtual local access networks (VLANs)
          • VLAN interface
          • VLAN trunking 802.1Q
          • Web user interface (WebUI)
        • Aruba Installation and Configuration Guide
          • 802.1X
          • Access Control Lists (ACLs)
          • Address Resolution Protocol (ARP)
          • Backup a Switch Configuration
          • Border Gateway Protocol (BGP) Basics
          • Bluetooth Capabilities
          • Cable Diagnostics
          • Check BGP and MetalLB
          • Check Current DHCP Leases
          • Check DHCP Lease is Getting Allocated
          • Check HSM
          • Check KEA DHCP Logs
          • Classifier Policies
          • Verify Computes/UANs/Application Nodes
          • Large Number of DHCP Declines During a Node Boot
          • Configure Domain Name Service (DNS) Clients
          • Configure Domain Names
          • Check for Duplicate IP Addresses
          • Configure Exec Banners
          • Configure Hostnames
          • Configure Internet Group Multicast Protocol (IGMP)
          • Initial Prioritization
          • Introduction
          • Key Features Used in the Management Network Configuration
          • Link Aggregation Group (LAG)
          • Link Layer Discovery Protocol (LLDP)
          • Locator LED
          • Loopback Interface
          • MAC Authentication
          • Management Interface
          • Example of How to Configure Scenario A or B
          • System Management Network Functions
          • VSX ISL HA
          • VSX MCLAG Link HA
          • VSX Member Power Failure
          • VSX Split
          • Multi-Chassis Link Aggregation Group (MCLAG)
          • Message-Of-The-Day (MOTD)
          • Multicast Source Discovery Protocol (MSDP)
          • Multiple Spanning Tree Protocol (MSTP)
          • Native VLAN
          • NCN tcpdump
          • NCNs on Install
          • Network Types – Naming and Segment Function
          • Network Topologies
          • Network Traffic Pattern
          • Notices
          • Network Time Protocol (NTP) Client
          • Open Shortest Path First (OSPF) v2
          • Physical Interfaces
          • PIM-SM Bootstrap Router (BSR) and Rendezvous Point (RP)
          • Port Mirroring
          • Port Security
          • Queuing and Scheduling
          • RADIUS
          • Rebooting NCNs and PXE Fails
          • Redundant Power Supplies
          • Remote Logging
          • Connect the Management Network to a Campus Network
          • Routed interfaces
          • Scenario A Network Connection via Management Network
          • Scenario B Network Connection via High-Speed Network
          • Simple Network Management Protocol (SNMP) Agent
          • SNMPv2c Community
          • SNMP traps
          • Aruba SNMPv3 Users
          • Spine-Leaf Architecture
          • Spine-leaf Architecture
          • Secure Shell (SSH)
          • Static Routing
          • Confirm the Status of the cray-dhcp-kea Pods
          • TACACS
          • Test TFTP Traffic (Aruba Only)
          • Typical Configuration of VSX
          • Typical Edge Port Configuration
          • Typical Configuration of MCLAG Link
          • Unidirectional Link Detection (UDLD)
          • Perform a VSX Upgrade on Aruba Switches
          • Verify the Switches are Forwarding DHCP Traffic
          • Verify BGP
          • Verify the DHCP Traffic on the Worker Nodes
          • Verify Route to TFTP
          • Virtual Local Access Networks (VLANs)
          • VLAN Interface
          • VLAN Trunking 802.1Q
          • Virtual Switching Framework (VSF) - 6300 Only
          • Virtual Switching Extension (VSX)
          • What is VSX?
          • Switch Replacement in the VSX Cluster
          • VSX Sync
          • Web User Interface (WebUI)
          • Erase All zeroize
        • External User Guides
        • Network Tests
        • Reinstall
        • Replace Switch
        • Save a Configuration
        • Prometheus SNMP Exporter
        • Upgrade Switches From 1.0 to 1.2 Preconfig
        • Validate Cabling
        • Validate the SHCD
        • Validate Switch Configurations
        • Wipe Management Switch Configuration
        • Backup a Custom Configuration
        • Remove UAN Access to the CMN
        • Enabling Customer High Speed Network Routing
        • BICAN Support Matrix - Shasta Customer Access Networks
        • Bifurcating the CAN - CSM 1.2 Feature Details
        • BICAN Summary
        • firmware
          • Update Management Network Firmware
      • Access to System Management Services
      • Connect to the HPE Cray EX Environment
      • Create a CSM Configuration Upgrade Plan
      • Default IP Address Ranges
      • Network
      • Gateway Testing
      • dhcp
        • DHCP
        • Troubleshoot DHCP Issues
      • external dns
        • External DNS
        • External DNS Failing to Discover Services Workaround
        • External DNS CSI Input Values
        • Ingress Routing
        • Troubleshoot DNS Configuration Issues
        • Troubleshoot Connectivity to Services with External IP addresses
        • Update the cmn-external-dns value post-installation
      • customer accessible networks
        • Connect to the CMN and CAN
        • Customer Accessible Networks
        • CAN/CMN with Dual-Spine Configuration
        • Externally Exposed Services
        • Troubleshoot CMN issues
        • BI-CAN Aruba/Arista Configuration
        • MetalLB Peering with Arista Edge Router
      • dns
        • Domain Name Service (DNS) Overview
        • Enable ncsd on UANs
        • Manage the DNS Unbound Resolver
        • PowerDNS Configuration
        • PowerDNS Migration Guide
        • Troubleshoot Common DNS Issues
        • Troubleshoot PowerDNS
      • metallb bgp
        • Check BGP Status and Reset Sessions
        • MetalLB Configuration
        • MetalLB in BGP-Mode
        • Troubleshoot BGP not Accepting Routes from MetalLB
        • Troubleshoot Services without an Allocated IP Address
    • compute rolling upgrades
      • CRUS Workflow
      • Compute Rolling Upgrades
      • Troubleshoot Nodes Failing to Upgrade in a CRUS Session
      • Troubleshoot a Failed CRUS Session Because of Bad Parameters
      • Troubleshoot a Failed CRUS Session Because of Unmet Conditions
      • Upgrade Compute Nodes with CRUS
    • configuration management
      • Ansible Execution Environments
      • Ansible Inventory
      • Automatic Session Deletion with sessionTTL
      • Backup and Restore VCS Data
      • CFS Flow
      • CFS Global Options
      • CFS Key Management and Permission Denied Errors
      • Change the Ansible Verbosity Logs
      • Configuration Layers
      • Configuration Management
      • Configuration Management of System Components
      • Configuration Management with the CFS Batcher
      • Configuration Sessions
      • Create a CFS Configuration
      • Create a CFS Session with Dynamic Inventory
      • Create an Image Customization CFS Session
      • Create and Populate a VCS Configuration Repository
      • Customize Configuration Values
      • Delete CFS Sessions
      • Enable Ansible Profiling
      • Git Operations
      • Manage Multiple Inventories in a Single Location
      • NCN Worker Image Customization
      • Set Limits for a Configuration Session
      • Set the ansible.cfg for a Session
      • Specifying Hosts and Groups
      • Target Ansible Tasks for Image Customization
      • Track the Status of a Session
      • Troubleshoot Ansible Play Failures in CFS Sessions
      • Troubleshoot CFS Session Failing to Complete
      • Troubleshoot CFS Sessions Failing to Start
      • Update a CFS Configuration
      • Update the Privacy Settings for Gitea Configuration Content Repositories
      • Use a Custom ansible.cfg File
      • Use a Specific Inventory in a Configuration Session
      • VCS Administrative User
      • VCS Branching Strategy
      • Version Control Service (VCS)
      • View Configuration Session Logs
      • Write Ansible Code for CFS
    • hmcollector
      • Adjust HM Collector resource limits and requests
  • CSM Background Information
    • Certificate Authority
    • NCN BIOS
    • NCN Boot Workflow
    • NCN Images
    • NCN Mounts and File Systems
    • NCN Networking
    • NCN Operating System Releases
    • NCN Packages
    • NCN Plan of Record
  • CSM Troubleshooting Information
    • Manual SSH Key Setting Process
    • Troubleshoot the CMS Barebones Image Boot Test
    • Running CT Tests Manually
    • Interpreting HMS Health Check Results
    • PXE Booting Runbook
    • known issues
      • CFS Component With Zero-Length ID
      • Gigabyte BMC Missing Redfish Data
      • Hang Listing BOS Sessions
      • Multiple Console Node Pods on the Same Worker
      • Nexus Fails Authentication with Keycloak Users
      • SLS Not Working During Node Rebuild
      • VCS Password With Illegal Characters
      • Known Issue admin-client-auth Not Found
      • SAT/HSM/CAPMC Component Power State Mismatch
      • Cray CLI 403 Forbidden Errors
      • HMS Discovery Job Not Creating RedfishEndpoints In Hardware State Manager
      • Etcd Cluster Backup Fails Due to Timeout
      • Known Issue Logging into the Gitea web UI requires logging in twice
      • HPE iLO dropping event subscriptions and not properly transitioning power state in CSM software
      • Kafka Failure after CSM 1.2 Upgrade
      • Mellanox lacp-individual Limitations
      • Common Platform CA Issues
      • Spire database connection pool configuration in an air-gapped environment
      • Spire Database Cluster DNS Lookup Failure
    • kubernetes
      • Kubernetes Log File Locations
      • Kubernetes Troubleshooting Information
      • Troubleshoot Kubernetes Master or Worker node in NotReady state
      • Troubleshoot Kubernetes Pods Not Starting
      • Troubleshoot Liveliness or Readiness Probe Failures
      • Troubleshoot Unresponsive kubectl Commands
  • Glossary
  • Install CSM
    • Set Gigabyte Node BMC to Factory Defaults
    • Boot LiveCD Virtual ISO
    • SHCD HMN Tab/HMN Connections Rules
    • Bootstrap PIT Node from LiveCD Remote ISO
    • Switch PXE Boot from Onboard NIC to PCIe
    • Bootstrap PIT Node from LiveCD USB
    • Troubleshooting Installation Problems
    • Cable Management Network Servers
    • Utility Storage Installation Troubleshooting
    • Ceph CSI Troubleshooting
    • Wipe NCN Disks for Reinstallation
    • Clear Gigabyte CMOS
    • Collect MAC Addresses for NCNs
    • Collecting the BMC MAC Addresses
    • Collecting NCN MAC Addresses
    • Configure Administrative Access
    • Configure Management Network
    • Connect to Switch over USB-Serial Cable
    • Create Application Node Config YAML
    • Create Cabinets YAML
    • Create HMN Connections JSON File
    • Create NCN Metadata CSV
    • Create Switch Metadata CSV
    • CSM Services Install Fails Because of Missing Secret
    • Deploy Final NCN
    • Deploy Management Nodes
    • Install CSM Services
    • Prepare Compute Nodes
    • Prepare Configuration Payload
    • Prepare Management Nodes
    • Prepare site-init
    • PXE Boot Troubleshooting
    • Reinstall LiveCD
    • Reset root Password on LiveCD
    • Restart Network Services and Interfaces on NCNs
    • Safeguards for CSM
  • Introduction to CSM Installation
    • CAPMC Deprecation Notice many CAPMC v1 features are being partially deprecated
    • CSM Overview
    • Differences from Previous Release
    • Documentation Conventions
    • Scenarios for Shasta v1.5
  • scripts
    • operations
      • node management
        • Add Remove Replace NCNs
          • Python Library for SLS Network Data
    • workarounds
      • Kernel Dump Workaround
      • Boot Order Workaround
  • Update CSM Product Stream
  • Upgrade CSM
    • CSM 1.2.1 Patch Installation Instructions
      • Upgrade and validate Switch Configurations
    • CSM 1.2.2 Patch Installation Instructions
      • Upgrade and validate Switch Configurations
    • Usage
      • k8s
        • Worker-Specific Manual Steps
      • storage
        • CEPHADM
    • CSM 1.0.x to 1.2.x Upgrade Process
      • Usage
        • k8s
          • Worker-Specific Manual Steps
        • storage
          • CEPHADM
      • Stage 0 - Prerequisites and Preflight Checks
      • Stage 1 - Ceph image upgrade
      • Stage 2 - Kubernetes Upgrade from 1.19.9 to 1.20.13
      • Stage 3 - CSM Service Upgrades
      • Stage 4 - Ceph Upgrade
      • Stage 5 - Perform NCN Personalization
      • Plan and coordinate network upgrade
      • scripts
        • sls
          • sls utils Library
          • SLS Updates Expert mode
          • Upgrade SLS Offline from CSM 1.0.x to CSM 1.2
          • sls updater.py Technical Details
        • upgrade
          • Upgrade Automation
    • Stage 0 - Prerequisites and Preflight Checks
    • Stage 1 - Ceph image upgrade
    • Prepare For Upgrade
    • Stage 2 - Kubernetes Upgrade from 1.19.9 to 1.20.13
    • Stage 3 - CSM Service Upgrades
    • Stage 4 - Ceph Upgrade
    • Stage 5 - Perform NCN Personalization
    • Plan and coordinate network upgrade
    • scripts
      • sls
        • sls utils Library
        • SLS Updates Expert mode
        • Upgrade SLS Offline from CSM 1.0.x to CSM 1.2
        • sls updater.py Technical Details
      • upgrade
        • Upgrade Automation
Cray System Management Documentation > Cray System Management (CSM) Administration Guide > hardware state manager

hardware state manager

Topics:

  1. Add a Switch to the HSM Database
  2. Add an NCN to the HSM Database
  3. Component Group Members
  4. Component Groups and Partitions
  5. Component Memberships
  6. Component Partition Members
  7. Create a Backup of the HSM Postgres Database
  8. HSM Roles and Subroles
  9. Hardware Management Services (HMS) Locking API
  10. Hardware State Manager (HSM)
  11. Hardware State Manager (HSM) State and Flag Fields
  12. Lock and Unlock Management Nodes
  13. Manage Component Groups
  14. Manage Component Partitions
  15. Manage HMS Locks
  16. Restore Hardware State Manager (HSM) Postgres Database from Backup
  17. Restore Hardware State Manager (HSM) Postgres without an Existing Backup
  18. Set BMC Management Roles