Cray System Management
  • v
  • Cray System Management (CSM) - Release Notes
  • Cray System Management (CSM) Administration Guide
    • Create a Backup of HMS Items for reinstall
    • Component Names (xnames)
    • Restore HSM
    • Validate CSM Health
    • Configure the Cray Command Line Interface (cray CLI)
    • User Access Service (UAS)
      • Add a Volume to UAS
      • Broker Mode UAI Management
      • Choosing UAI Resource Settings
      • Common UAI Configuration
      • Configure End-User UAI Classes for Broker Mode
      • Configure UAIs in UAS
      • Configure a Broker UAI Class
      • Configure a Default UAI Class for Legacy Mode
      • Create UAIs From Specific UAI Images in Legacy Mode
      • Create a UAI
      • Create a UAI Class
      • Create a UAI Resource Specification
      • Create a UAI with Additional Ports
      • Create and Use Default UAIs in Legacy Mode
      • Customize End-User UAI Images
      • Customize the Broker UAI Image
      • Delete a UAI
      • Delete a UAI Class
      • Delete a UAI Image Registration
      • Delete a UAI Resource Specification
      • Delete a Volume Configuration
      • Elements of a UAI
      • End-User UAIs
      • Examine a UAI Using a Direct Administrative Command
      • Legacy Mode User-Driven UAI Management
      • List Available UAI Classes
      • List Available UAI Images in Legacy Mode
      • List Registered UAI Images
      • List UAI Resource Specifications
      • List UAIs
      • List UAS Version Information
      • List Volumes Registered in UAS
      • Log in to a Broker UAI
      • This Page Has Moved
      • Modify a UAI Class
      • Obtain the Configuration of a UAS Volume
      • Register a UAI Image
      • Clear UAS Configuration
      • Resource Specifications
      • Retrieve Resource Specification Details
      • Retrieve UAI Image Registration Information
      • Setting UAI Timeouts
      • Broker UAI Resiliency and Load Balancing
      • Special Purpose UAIs
      • Start a Broker UAI
      • Troubleshoot Broker UAI SSSD Cannot Use /etc/sssd/sssd.conf
      • Troubleshoot Common Mistakes when Creating a Custom End-User UAI Image
      • Troubleshoot Duplicate Mount Paths in a UAI
      • Troubleshoot Missing or Incorrect UAI Images
      • Troubleshoot Stale Brokered UAIs
      • Troubleshoot UAS / CLI Authentication Issues
      • Troubleshoot UAI Stuck in ContainerCreating
      • Troubleshoot UAIs by Viewing Log Output
      • Troubleshoot UAIs with Administrative Access
      • Troubleshoot UAS Issues
      • Troubleshoot UAS by Viewing Log Output
      • UAI Classes
      • UAI Host Node Selection
      • UAI Host Nodes
      • UAI Image Customization
      • UAI Images
      • UAI Management
      • UAI Network Attachment Customization
      • UAI macvlans Network Attachments
      • UAS Limitations
      • UAS and UAI Legacy Mode Health Checks
      • Update a Resource Specification
      • Update a UAI Image Registration
      • Update a UAS Volume
      • View a UAI Class
      • Volumes
    • argo
      • Using Argo Workflows
      • Using the Argo UI
    • bare metal
      • Bare-Metal Steps
      • Fresh Install Setting NodeBMC and RouterBMC Redfish Credentials
    • firmware
      • FASUpdate Script
      • FAS Admin Procedures
      • FAS CLI
      • Cleaning up FAS Database
      • FAS Filters
      • Backup and Restoring FAS Images
      • Updating Foxconn Paradise Nodes with FAS
      • FAS Recipes
      • Update iLO 5 firmware above v2.78
      • FAS Recipes and Procedures
      • Firmware Upgrade using SPP on HPE ProLiant Servers
      • Update Firmware with FAS
      • Updating BMC Firmware and BIOS for ncn-m001
      • Updating BMC Firmware and BIOS for NCNs without FAS
      • Upload BMC Recovery Firmware into TFTP Server
    • hmcollector
      • Adjust HM Collector Ingress Replicas and Resource Limits
    • observability
      • Install and Upgrade Observability Framework
    • power management
      • Cray Advanced Platform Monitoring and Control (CAPMC)
      • Ignore Nodes with CAPMC
      • Liquid-cooled Node Power Management
      • Power Off Compute Cabinets
      • Power Off Management Cabinets
      • Power Off Storage Cabinets
      • Power Off the External Lustre File System
      • Power On Compute Cabinets
      • Power On and Boot Compute and User Access Nodes
      • Power On and Start the Management Kubernetes Cluster
      • Power On the External Lustre File System
      • Prepare the System for Power Off
      • Recover from a Liquid Cooled Cabinet EPO Event
      • Save Management Network Switch Configuration Settings
      • Set the Turbo Boost Limit
      • Shut Down and Power Off Compute and User Access Nodes
      • Shut Down and Power Off the Management Kubernetes Cluster
      • Standard Rack Node Power Management
      • System Power Off Procedures
      • System Power On Procedures
      • User Access to Compute Node Power Data
      • Power Management
      • Power Control Service
        • Node Card Power Management
        • Power Control Service (PCS)
        • Power Off Compute Cabinets
        • Power On Compute Cabinets
        • Recover from a Liquid Cooled Cabinet EPO Event
    • sat
      • System Admin Toolkit (SAT) in CSM
    • system configuration service
      • Configure BMC and Controller Parameters with SCSD
      • Manage Parameters with the scsd Service
      • Set BMC Credentials Using SAT
      • System Configuration Service
    • system layout service
      • Add Liquid-Cooled Cabinets to SLS
      • Add UAN CAN IP Addresses to SLS
      • Add an alias to a service
      • Create a Backup of the SLS Postgres Database
      • Dump SLS Information
      • Load SLS Database with Dump File
      • Restore SLS Postgres Database from Backup
      • Restore SLS Postgres without an Existing Backup
      • System Layout Service (SLS)
      • Update SLS with UAN Aliases
    • System Recovery
      • PBS Service Recovery
      • Slurm Service Recovery
      • Beta Procedures for System Recovery
    • CSM product management
      • Change Passwords and Credentials
      • Configure CSM packages with CFS
      • Configure Keycloak Account
      • Configure the root password and SSH keys in Vault
      • Post-Install Customizations
      • Redeploying a Chart
      • Remove Artifacts from Product Installations
      • Set up passwordless SSH
      • Validate Signed RPMs
    • security and authentication
      • API Authorization
      • Access the Keycloak User Management UI
      • Add LDAP User Federation
      • Add Root Service Account for Gigabyte Controllers
      • Audit Logs
      • Authenticate an Account with the Command Line
      • Backup and Restore Vault Clusters
      • Certificate Types
      • Change Air-Cooled Node BMC Credentials Using SAT
      • Change Credentials on ServerTech PDUs
      • Change Cray EX Liquid-Cooled Cabinet Global Default Password
      • Change the Keycloak Token Lifetime
      • Set NCN Image Root Password, SSH Keys, and Timezone
      • Set NCN Image Root Password, SSH Keys, and Timezone on PIT Node
      • Change Root Passwords for Compute Nodes
      • Change the Keycloak Admin Password
      • Change the LDAP Server IP Address for Existing LDAP Server Content
      • Change the LDAP Server IP Address for New LDAP Server Content
      • Configure Keycloak for LDAP/AD authentication
      • Configure root user on HPE iLO BMCs
      • Configure the RSA Plugin in Keycloak
      • Create Internal Groups in the Keycloak Shasta Realm
      • Create Internal User Accounts in the Keycloak Shasta Realm
      • Create a Backup of the Keycloak Postgres Database
      • Create a Service Account in Keycloak
      • Default Keycloak Realms, Accounts, and Clients
      • Delete Internal User Accounts in the Keycloak Shasta Realm
      • Get a Long-Lived Token for a Service Account
      • HashiCorp Vault
      • Keycloak Operations
      • Keycloak Service Recovery
      • Keycloak User Localization
      • Keycloak User Management with kcadm.sh
      • Make HTTPS Requests from Sources Outside the Management Kubernetes Cluster
      • Manage Sealed Secrets
      • Manage System Passwords
      • PKI Certificate Authority (CA)
      • PKI Services
      • Preserve Username Capitalization for Users Exported from Keycloak
      • Provisioning a Liquid-Cooled EX Cabinet CEC with Default Credentials
      • Public Key Infrastructure (PKI)
      • Recovering from Mismatched BMC Credentials
      • Remove Internal Groups from the Keycloak Shasta Realm
      • Remove the Email Mapper from the LDAP User Federation
      • Remove the LDAP User Federation from Keycloak
      • Re-Sync Keycloak Users to Compute Nodes
      • Retrieve an Authentication Token
      • Retrieve the Client Secret for Service Accounts
      • Update NCN User SSH Keys
      • System Security and Authentication
      • Transport Layer Security (TLS) for Ingress Services
      • Troubleshoot Common Vault Cluster Issues
      • Troubleshoot Kyverno configuration manually
      • Update Default Air-Cooled BMC and Leaf-BMC Switch SNMP Credentials
      • Update Default ServerTech PDU Credentials used by the Redfish Translation Service (RTS)
      • Set NCN User Passwords
      • Update Root Secrets In Vault
      • Updating the Liquid-Cooled EX Cabinet CEC with Default Credentials after a CEC Password Change
      • Vault Service Recovery
    • multi-tenancy
      • Cray HNC Manager
      • Creating a Tenant
      • Modifying a Tenant
      • Multi-Tenancy Support
      • Removing a Tenant
      • Slurm Operator
      • TAPMS (Tenant and Partition Management System) Overview
      • Tenant Administrator Configuration
      • Multi-Tenancy Vault Overview
    • utility storage
      • Adding a Ceph Node to the Ceph Cluster
      • Add Ceph OSDs
      • Adjust Ceph Pool Quotas
      • Alternate Storage Pools
      • Ceph Daemon Memory Profiling
      • Ceph Deep Scrubs
      • Ceph Health States
      • Ceph Orchestrator Usage
      • Ceph Service Check Script Usage
      • Ceph Storage Types
      • ceph-upgrade-tool.py Usage
      • Cephadm Reference Material
      • Collect Information about the Ceph Cluster
      • Dump Ceph Crash Data
      • Identify Ceph Latency Issues
      • Manage Ceph Services
      • Move Unmanaged Ceph OSDs
      • Shrink the Ceph Cluster
      • Shrink Ceph OSDs
      • Troubleshoot Ceph-Mon Processes Stopping and Exceeding Max Restarts
      • Troubleshoot Ceph MDS Client Connectivity Issues
      • Troubleshooting Ceph MDS Reporting Slow Requests and Failure on Client
      • Troubleshoot Ceph New RGW Deployment Failing
      • Troubleshoot Ceph OSDs Reporting Full
      • Troubleshoot Ceph Services Not Starting After a Server Crash
      • Troubleshoot Failure to Get Ceph Health
      • Troubleshoot HEALTH ERR Module devicehealth has failed table Device already exists
      • Troubleshoot Insufficient Standby MDS Daemons Available
      • Troubleshoot Large Object Map Objects in Ceph Health
      • Troubleshoot Pods Failing to Restart on Other Worker Nodes
      • Fixing incorrect number of PG Issues
      • Troubleshoot if RGW Health Check Fails
      • Troubleshoot S3FS Cache Cleanup
      • Troubleshoot S3FS Mount Issues
      • Troubleshoot System Clock Skew
      • Troubleshoot a Down OSD
      • Troubleshoot an Unresponsive Rados-Gateway (radosgw) S3 Endpoint
      • Troubleshoot Ceph image with tag'<none>'
      • Utility Storage
      • Update ceph node-exporter config to monitor SNMP counters
    • boot orchestration
      • Boot Orchestration
      • BOS Services
      • BOS Workflows
      • Compute Node Boot Issue Symptom Node Console or Logs Indicate that the Server Response has Timed Out
      • Boot Issue Symptom Node HSN Interface Does Not Appear or Show Detected Links Detected
      • Boot Orchestration
      • Boot UANs
      • BOS Commands Cheat Sheet
      • Check the Progress of BOS Session Operations
      • Clean Up After a BOS/BOA Job is Completed or Cancelled
      • Clean Up Logs After a BOA Kubernetes Job
      • Component Status
      • BOS Components
      • Compute Node Boot Issue Symptom Duplicate Address Warnings and Declined DHCP Offers in Logs
      • Compute Node Boot Issue Symptom Message About Invalid EEPROM Checksum in Node Console or Log
      • Compute Node Boot Issue Symptom Node is Not Able to Download the Required Artifacts
      • Compute Node Boot Sequence
      • Configure the BOS Timeout When Booting Compute Nodes
      • Create a Session Template to Boot Compute Nodes with CPS
      • Customize iPXE Binary Names
      • Determine Which BOS Session Booted a Node
      • Edit the iPXE Embedded Boot Script
      • Exporting and Importing BOS Data
      • Exporting and Importing BSS Date
      • Healthy Compute Node Boot Process
      • Kernel Boot Parameters
      • Limit the Scope of a BOS Session
      • BOS Limitations for Gigabyte BMC Hardware
      • Log File Locations and Ports Used in Compute Node Boot Troubleshooting
      • Manage a BOS Session
      • Manage a Session Template
      • Multi-tenancy with BOS
      • Node Boot Root Cause Analysis
      • BOS Options
      • Redeploy the iPXE and TFTP Services
      • Rolling Upgrades using BOS
      • BOS Session Templates
      • BOS Sessions
      • Staging Changes with BOS
      • Tools for Resolving Compute Node Boot Issues
      • Troubleshoot Booting Nodes with Hardware Issues
      • Troubleshoot Compute Node Boot Issues Related to Dynamic Host Configuration Protocol (DHCP)
      • Troubleshoot Compute Node Boot Issues Related to Slow Boot Times
      • Troubleshoot Compute Node Boot Issues Related to Trivial File Transfer Protocol (TFTP)
      • Troubleshoot Compute Node Boot Issues Related to Unified Extensible Firmware Interface (UEFI)
      • Troubleshoot Compute Node Boot Issues Related to the Boot Script Service (BSS)
      • Troubleshoot Compute Node Boot Issues Using Kubernetes
      • Troubleshoot UAN Boot Issues
      • Upload Node Boot Information to Boot Script Service (BSS)
      • View the Status of a BOS Session
    • kubernetes
      • About Kubernetes Taints and Labels
      • Kubernetes Encryption
      • About Postgres
      • About etcd
      • About kubectl
      • Backups for Etcd Clusters Running in Kubernetes
      • Kubernetes and Bare Metal EtcD Certificate Renewal
      • Check for and Clear etcd Cluster Alarms
      • Check the Health of etcd Clusters
      • Clear Space in an etcd Cluster Database
      • Configure kubectl Credentials to Access the Kubernetes APIs
      • containerd
      • Create a Manual Backup of Bare-Metal etcd Cluster
      • Create a Manual Backup of a Healthy etcd Cluster
      • Determine if Pods are Hitting Resource Limits
      • Disaster Recovery for Postgres
      • Fix Failed to start etcd on Master NCN
      • Increase Kafka Pod Resource Limits
      • Increase the PVC size in an etcd Cluster Database
      • Increase Pod Resource Limits
      • Kubernetes
      • Kubernetes Networking
      • Kubernetes Storage
      • Kyverno policy management
      • Pod Resource Limits
      • Rebuild Unhealthy etcd Clusters
      • Recover from Postgres WAL Event
      • Repopulate Data in etcd Clusters When Rebuilding Them
      • Report the Endpoint Status for etcd Clusters
      • Restore Bare-Metal etcd Clusters from an S3 Snapshot
      • Restore Postgres
      • Restore an etcd Cluster from a Backup
      • Retrieve Cluster Health Information Using Kubernetes
      • TDS Lower CPU Requests
      • Troubleshoot Intermittent HTTP 503 Code Failures
      • Troubleshoot Postgres Database
      • View Postgres Information for System Databases
    • network
      • Management Network User Guide
        • Manual Switch Configuration
        • Fresh Install
        • Added Hardware
        • Generate Switch Configurations
        • Apply Custom Switch Configuration CSM 1.2
        • Apply Switch Configurations
        • CSM Automatic Network Utility
          • CANU Installation
          • Troubleshoot CANU Validation Errors
          • Use CANU to Verify, Generate, or Compare Switch Configurations
          • Generate Switch Configs Including Custom Configurations
          • Initializing CANU
          • Introduction to CANU
          • Quick start guide to CANU
          • Uninstall CANU
          • Update CANU From CSM Release Tarball
          • Use CANU to Generate Full Network Configuration
        • Dell Installation and Configuration Guide
          • Configure Access Control Links (ACLs)
          • Configure Address Resolution Protocol (ARP)
          • Back Up a Switch Configuration
          • Configure Domain Name System (DNS) Client
          • Configure Domain Name
          • Configure Hostnames
          • Configure Internet Group Multicast Protocol (IGMP)
          • Configure Link Aggregation Group (LAG)
          • Link layer discovery protocol (LLDP)
          • Configure Locator LED
          • Configure Loopback Interface
          • Configure Management Interface
          • Configure Multiple Spanning Tree Protocol (MSTP)
          • Network Time Protocol (NTP) Client
          • Configure Physical Interfaces
          • Configure QoS
          • Configure Remote Logging
          • Reset Dell Switch Configuration
          • Configure SNMPv2c community
          • Dell SNMPv3 Users
          • Configure Secure Shell (SSH)
          • Configure System Images
          • Perform an Upgrade on Dell Switches
          • Configure Virtual Local Access Networks (VLANs)
          • Configure VLAN Interface
          • VLAN Trunking 802.1Q
        • Using canu-inventory with Ansible
        • Upgrade CANU
        • Collect Data
        • Configuration Management
        • Configuring SNMP in CSM
        • Mellanox Installation and Configuration Guide
          • Access control lists (ACLs)
          • Address resolution protocol (ARP)
          • Backing up switch configuration
          • BGP basics
          • Cable diagnostics
          • Check BGP and MetalLB
          • Check current DHCP leases
          • Check DHCP lease is getting allocated
          • Check HSM
          • Check KEA DHCP logs
          • Computes/UANs/Application Nodes
          • Large Number of DHCP Declines During a Node Boot
          • Domain name system (DNS) client
          • Domain name
          • You are getting an IP address, but not the correct one. Duplicate IP address check
          • Exec banners
          • Hostname
          • IGMP
          • Ip filter
          • Key features used in the management network configuration
          • Link aggregation group (LAG)
          • Large
          • Link layer discovery protocol (LLDP)
          • Loopback interface
          • Management interface
          • Example of how to configure Scenario A or B
          • Management network functions in detail
          • Medium
          • Multi-chassis interface
          • MLAG (Multi-Chassis LAG)
          • MLAG
          • Multiple spanning tree protocol (MSTP)
          • Native VLAN
          • TCPDUMP
          • NCNs on Install
          • Network types – Naming and segment Function
          • Network traffic pattern inside of the system
          • Network Time Protocol (NTP) Client
          • Open shortest path first (OSPF) v2
          • Physical interfaces
          • PIM-SM bootstrap router (BSR) and rendezvous-point (RP)
          • Rebooting NCN and PXE fails
          • Remote logging
          • How to connect management network to your campus network
          • Routed interfaces
          • Scenario A network connection via management network
          • Scenario B network connection via high speed network
          • Small
          • SNMPv2c community
          • Mellanox SNMPv3 users
          • Spine-leaf Architecture
          • Spine-leaf architecture
          • Why are spine-leaf architectures becoming more popular?
          • Secure shell (SSH)
          • Mac address Table
          • Static routing
          • Confirm the status of the cray-dhcp-kea pods/services
          • System images
          • Test TFTP traffic (Aruba Only)
          • Typical configuration of MLAG link connecting to NCN
          • Typical configuration of MLAG between switches
          • Performing Upgrade On Mellanox Switches
          • Verify the switches are forwarding DHCP traffic
          • Verify BGP
          • Verify the DHCP traffic on the workers
          • Verify route to TFTP
          • Very Large (Exascale)
          • Virtual local access networks (VLANs)
          • VLAN interface
          • VLAN trunking 802.1Q
          • Web user interface (WebUI)
        • Aruba Installation and Configuration Guide
          • 802.1X
          • Access Control Lists (ACLs)
          • Address Resolution Protocol (ARP)
          • Backup a Switch Configuration
          • Border Gateway Protocol (BGP) Basics
          • Bluetooth Capabilities
          • Cable Diagnostics
          • Check BGP and MetalLB
          • Check Current DHCP Leases
          • Check DHCP Lease is Getting Allocated
          • Check HSM
          • Check KEA DHCP Logs
          • Classifier Policies
          • Verify Computes/UANs/Application Nodes
          • Large Number of DHCP Declines During a Node Boot
          • Configure Domain Name Service (DNS) Clients
          • Configure Domain Names
          • Check for Duplicate IP Addresses
          • Configure Exec Banners
          • Configure Hostnames
          • Configure Internet Group Multicast Protocol (IGMP)
          • Initial Prioritization
          • Introduction
          • Key Features Used in the Management Network Configuration
          • Link Aggregation Group (LAG)
          • Link Layer Discovery Protocol (LLDP)
          • Locator LED
          • Loopback Interface
          • MAC Authentication
          • Management Interface
          • Example of How to Configure Scenario A or B
          • System Management Network Functions
          • VSX ISL HA
          • VSX MCLAG Link HA
          • VSX Member Power Failure
          • VSX Split
          • Multi-Chassis Link Aggregation Group (MCLAG)
          • Message-Of-The-Day (MOTD)
          • Multicast Source Discovery Protocol (MSDP)
          • Multiple Spanning Tree Protocol (MSTP)
          • Native VLAN
          • NCN tcpdump
          • NCNs on Install
          • Network Types – Naming and Segment Function
          • Network Topologies
          • Network Traffic Pattern
          • Notices
          • Network Time Protocol (NTP) Client
          • Open Shortest Path First (OSPF) v2
          • Physical Interfaces
          • PIM-SM Bootstrap Router (BSR) and Rendezvous Point (RP)
          • Port Mirroring
          • Port Security
          • Queuing and Scheduling
          • RADIUS
          • Rebooting NCNs and PXE Fails
          • Redundant Power Supplies
          • Remote Logging
          • Connect the Management Network to a Campus Network
          • Routed interfaces
          • Scenario A Network Connection via Management Network
          • Scenario B Network Connection via High-Speed Network
          • Simple Network Management Protocol (SNMP) Agent
          • SNMPv2c Community
          • SNMP traps
          • Aruba SNMPv3 Users
          • Spine-Leaf Architecture
          • Spine-leaf Architecture
          • Secure Shell (SSH)
          • Static Routing
          • Confirm the Status of the cray-dhcp-kea Pods
          • TACACS
          • Test TFTP Traffic (Aruba Only)
          • Typical Configuration of VSX
          • Typical Edge Port Configuration
          • Typical Configuration of MCLAG Link
          • Unidirectional Link Detection (UDLD)
          • Perform a VSX Upgrade on Aruba Switches
          • Verify the Switches are Forwarding DHCP Traffic
          • Verify BGP
          • Verify the DHCP Traffic on the Worker Nodes
          • Verify Route to TFTP
          • Virtual Local Access Networks (VLANs)
          • VLAN Interface
          • VLAN Trunking 802.1Q
          • Virtual Switching Framework (VSF) - 6300 Only
          • Virtual Switching Extension (VSX)
          • VSX Architecture
          • Switch Replacement in the VSX Cluster
          • VSX Sync
          • Web User Interface (WebUI)
          • Erase All zeroize
        • Edge switch cabling guide
        • Network Tests
        • Reinstall
        • Replace Switch
        • Save a Configuration
        • Prometheus SNMP Exporter
        • Transceivers and Cables
        • Example of the Connections Used in Shasta Management Network
        • Validate Cabling
        • Validate the SHCD
        • Validate Switch Configurations
        • Wipe Management Switch Configuration
        • Aruba splitting of QSFP+ and QSFP28 ports
        • Backup a Custom Configuration
        • BICAN Support Matrix - Shasta Customer Access Networks
        • BICAN switch configuration
        • Bifurcating the CAN - Feature Details
        • BICAN Summary
        • Bonded UAN Configuration
        • Cable Management Network Servers
        • firmware
          • Update Management Network Firmware
        • hardware
          • EX2500 Installation and Cabling
      • Access to System Management Services
      • Connect to Switch over USB-Serial Cable
      • Connect to the HPE Cray EX Environment
      • Create a CSM Configuration Upgrade Plan
      • Default IP Address Ranges
      • Gateway Testing
      • Network
      • dhcp
        • DHCP boot file customization
        • DHCP
        • Troubleshoot DHCP Issues
      • customer accessible networks
        • Connect to the CMN and CAN
        • Customer Access Networks
          • scripts
            • sls
              • sls utils Library
          • network
            • Enabling Customer High Speed Network Routing
            • Management Network Upgrade CSM 1.2 to 1.3
        • Customer Accessible Networks
        • CAN/CMN with Dual-Spine Configuration
        • Externally Exposed Services
        • Troubleshoot CMN issues
        • BI-CAN Aruba/Arista Configuration
        • MetalLB Peering with Arista Edge Router
      • external dns
        • External DNS
        • External DNS Failing to Discover Services Workaround
        • External DNS CSI Input Values
        • Ingress Routing
        • Troubleshoot DNS Configuration Issues
        • Troubleshoot Connectivity to Services with External IP addresses
        • Update the cmn-external-dns value post-installation
      • metallb bgp
        • Check BGP Status and Reset Sessions
        • MetalLB Configuration
        • MetalLB in BGP-Mode
        • Troubleshoot BGP not Accepting Routes from MetalLB
        • Troubleshoot Services without an Allocated IP Address
      • dns
        • Domain Name Service (DNS) Overview
        • Enable ncsd on UANs
        • Manage the DNS Unbound Resolver
        • PowerDNS Configuration
        • PowerDNS Migration Guide
        • Troubleshoot Common DNS Issues
        • Troubleshoot PowerDNS
    • resiliency
      • Recreate StatefulSet Pods on Another Node
      • Resilience of System Management Services
      • Resiliency
      • Resiliency Testing Procedure
      • Restore System Functionality if a Kubernetes Worker Node is Down
    • artifact management
      • Artifact Management
      • Generate Temporary S3 Credentials
      • Manage Artifacts with the Cray CLI
      • Use S3 Libraries and Clients
    • hpe pdu
      • HPE PDU Admin Procedures
    • node management
      • Access and Update Settings for Replacement NCNs
      • Removing a Liquid-cooled blade from a System
      • Removing a Liquid-cooled blade from a System Using SAT
      • Removing a Standard rack node from a System
      • Replace a Compute Blade
      • Replace a Compute Blade Using SAT
      • Replace a Standard rack node from a System
      • Replacing Foxconn Username and Passwords in Vault
      • Add TLS Certificates to BMCs
      • Repurpose a Compute Node as a UAN
      • Add a Standard Rack Node
      • Reset Credentials on Redfish Devices
      • Add Additional Air-Cooled Cabinets to a System
      • S3FS Usage and Guidelines for Shasta
      • Add Additional Liquid-Cooled Cabinets to a System
      • Set Gigabyte Node BMC to Factory Defaults
      • Adding a Liquid-cooled Blade to a System
      • Swap a Compute Blade with a Different System
      • Adding a Liquid-cooled blade to a System Using SAT
      • Swap a Compute Blade with a Different System Using SAT
      • Boot a storage node into new image without upgrading CSM
      • Switch PXE Boot from Onboard NIC to PCIe
      • Build NCN Images Locally
      • TLS Certificates for Redfish BMCs
      • Change Java Security Settings
      • Troubleshoot Interfaces with IP Address Issues
      • Change Settings for HMS Collector Polling of Air-Cooled Nodes
      • Troubleshoot Issues with Redfish Endpoint Discovery
      • Check and Set the metal.no-wipe Setting on NCNs
      • Troubleshoot Loss of Console Connections and Logs on Gigabyte Nodes
      • Check the BMC Failover Mode
      • Update Compute Node Mellanox HSN NIC Firmware
      • Clear Space in Root File System on Worker Nodes
      • Update the Gigabyte Node BIOS Time
      • Configuration of NCN Bonding
      • Update the HPE Node BIOS Time
      • Configure NTP on NCNs
      • Updating Cabinet Routes on Management NCNs
      • Customize PCIe Hardware
      • Use the Physical KVM
      • Customize PCIe Hardware
      • Verify Node Removal
      • Defragment NID Numbering
      • View BIOS Logs for Liquid-Cooled Nodes
      • Disable Nodes
      • Manual Wipe Procedures
      • Dump a Non-Compute Node
      • Clear Gigabyte CMOS
      • Enable Nodes
      • Enable Passwordless Connections to Liquid Cooled Node BMCs
      • Enable IPMI access on HPE iLO BMCs
      • Find Node Type and Manufacturer
      • Launch a Virtual KVM on Gigabyte Servers
      • Launch a Virtual KVM on Intel Servers
      • Move a Standard Rack Node
      • Move a Standard Rack Node (Same Rack/Same HSN Ports)
      • Move a liquid-cooled blade within a System
      • NCN Drive Identification
      • NCN NIC Replacement
      • NCN Network Troubleshooting
      • Node Management
      • Node Management Workflows
      • Reboot NCNs
      • Rebuild NCNs
        • Final Validation Steps
        • Identify Nodes and Update Metadata
        • Post Rebuild Storage Node Validation
        • Power Cycle and Rebuild Nodes
        • Prepare Storage Nodes
        • Re-Add a Storage Node to Ceph
        • Rebuild NCNs
        • Validate Boot Loader
      • Add Remove Replace NCNs
        • Add NCN Data
        • Alpha Framework to Add, Remove, Replace, or Move NCNs
        • Add Switch Configuration for NCN
        • Allocate NCN IP Addresses
        • Boot NCN
        • Collect NCN MAC Addresses
        • Redeploy Services Impacted by Adding or Permanently Removing Storage Nodes
        • Remove NCN Data
        • Remove NCN from Role
        • Remove Switch Configuration for NCN
        • Update Firmware
        • Update NCN BIOS TPM State
        • Validate Health
        • Validate Added NCN
    • package repository management
      • Manage Repositories with Nexus
      • Nexus Configuration
      • Nexus Deployment
      • Nexus Export and Restore
      • Nexus Service Recovery
      • Nexus Space Cleanup
      • Package Repository Management
      • Package Repository Management with Nexus
      • Repair Blobstore
      • Repair Yum Repository Metadata
      • Restrict Admin Privileges in Nexus
      • Troubleshoot Nexus
    • spire
      • Restore missing Spire metadata
      • Restore Spire Postgres without an Existing Backup
      • Spire Service Recovery
      • Troubleshoot Spire Failing to Start on NCNs
      • Update Spire Intermediate CA Certificate
      • Xname Validation
    • cani
      • Add A Blade To A Cabinet In SLS Using CANI
      • Add A Cabinet To SLS using CANI
    • system management health
      • Access System Management Health Services
      • Configure Prometheus Alerta Alert Notifications
      • Configure Prometheus Email Alert Notifications
      • Retrieve SMART data from ClusterStor E1000 nodes via Redfish Exporter
      • Grafana Dashboards by Component
      • Grafterm
      • grok-exporter pod status showing as ContainerStatusUnknown Error
      • prometheus-kafka-adapter errors during installation
      • Remove Kiali
      • System Management Health
      • System Management Health Checks and Alerts
      • Troubleshoot Grafana Dashboard
      • Troubleshoot Prometheus Alerts
      • Thanos
      • UAN NODE Exporter
    • conman
      • Access Compute Node Logs
      • Access Console Log Data Via the System Monitoring Framework (SMF)
      • Complete Reset of the Console Services
      • ConMan
      • Configure Log Rotation
      • Console Services Troubleshooting Guide
      • Disable ConMan After the System Software Installation
      • Establish a Serial Connection to NCNs
      • Log in to a Node Using ConMan
      • Manage Node Consoles
      • Troubleshoot ConMan Asking for Password on SSH Connection
      • Troubleshoot ConMan Blocking Access to a Node BMC
      • Troubleshoot ConMan Failing to Connect to a Console
      • Troubleshoot Console Node Pod Stuck in Terminating State
    • hardware state manager
      • Add a Switch to the HSM Database
      • Add an NCN to the HSM Database
      • Component Group Members
      • Component Groups and Partitions
      • Component Memberships
      • Component Partition Members
      • Create a Backup of the HSM Postgres Database
      • Backup/Restore HSM User Data (Locks, Groups, and Partitions)
      • HSM Roles and Subroles
      • Hardware Management Services (HMS) Locking API
      • Hardware State Manager (HSM)
      • Hardware State Manager (HSM) State and Flag Fields
      • Lock and Unlock Management Nodes
      • Manage Component Groups
      • Manage Component Partitions
      • Manage HMS Locks
      • Remove Duplicate Detected Events From the HSM Postgres Database
      • Restore Hardware State Manager (HSM) Postgres Database from Backup
      • Restore Hardware State Manager (HSM) Postgres without an Existing Backup
      • Set BMC Management Roles
    • image management
      • Build a New UAN Image Using the Default Recipe
      • Build an Image Using IMS REST Service
      • Configure IMS to Use DKMS
      • Configure IMS to Validate RPMs
      • Configure a Remote Build Node
      • Convert TGZ Archives to SquashFS Images
      • Create UAN Boot Images
      • Customize an Image Root Using IMS
      • Delete or Recover Deleted IMS Content
      • Exporting and Importing IMS Data
      • Image Job Performance
      • Image Management
      • Image Management Workflows
      • Import an External Image to IMS
      • Import an NCN Image to IMS
      • Troubleshoot Issues with Large Images
      • Troubleshoot Remote Build Node
      • Troubleshoot Interactions with zypper
      • Upload and Register an Image Recipe
      • Working With aarch64 Images
    • configuration management
      • ARP Cache Tuning Guide
      • Accessing sat bootprep Files
      • Adding Additional Inventory
      • Ansible Execution Environments
      • Ansible Log Collection
      • Automatic Configuration Management
      • Automatic Session Deletion with session ttl
      • Backup and Restore VCS Data
      • CFS Commands Cheat Sheet
      • CFS Components
      • CFS Configurations
      • CFS Flow
      • CFS Global Options
      • CFS Key Management and Permission Denied Errors
      • CFS Session Inventory
      • CFS Sessions
      • CFS Sources
      • Change the Ansible Verbosity
      • Configuration Management
      • Configure Ansible
      • Create a Node Personalization CFS Session
      • Create an Image Customization CFS Session
      • Create and Populate a VCS Configuration Repository
      • Customize Configuration Values
      • Differences Between the V2 and V3 CFS APIs
      • Enable Ansible Profiling
      • Exporting and Importing CFS Data
      • Git Operations
      • Management Node Image Customization
      • Management Node Personalization
      • Paging CFS Records
      • Set Limits for a Configuration Session
      • Specifying Hosts and Groups
      • Target Ansible Tasks for Image Customization
      • Track the Status of a Session
      • Troubleshoot CFS Issues
      • Troubleshoot Failed CFS Sessions
      • Troubleshoot CFS Session Failing to Complete
      • Troubleshoot CFS Sessions Failing to Start
      • Update a CFS Configuration
      • Update the Privacy Settings for Gitea Configuration Content Repositories
      • VCS Administrative User
      • VCS Branching Strategy
      • Version Control Service (VCS)
      • View Configuration Session Logs
      • Write Ansible Code for CFS
    • iuf
      • Install and Upgrade Framework
      • stages
        • deliver-product
        • deploy-product
        • managed-nodes-rollout
        • management-nodes-rollout
        • post-install-check
        • post-install-service-check
        • pre-install-check
        • prepare-images
        • process-media
        • update-cfs-config
        • update-vcs-config
      • workflows
        • Populate Admin Directory with Files Defining Site Preferences
        • Backup
        • Configuration
        • Configuration of the Slingshot Fabric Manager
        • Deploy Product
        • Image Preparation
        • Install or Upgrade Additional Products with IUF
        • Managed Rollout
        • Management Rollout
        • Prepare for the Install or Upgrade
        • Product Delivery
        • Perform Slingshot Switch Firmware Updates
        • Upgrade CSM and Additional Products with IUF
        • Validate Deployment
      • examples
        • iuf abort Examples
        • iuf activity Examples
        • iuf list-activities Examples
        • iuf list-stages Examples
        • iuf restart Examples
        • iuf resume Examples
        • iuf run Examples
        • iuf workflow Examples
  • Cray System Management Install
    • SHCD HMN Tab/HMN Connections Rules
    • Ceph CSI Troubleshooting
    • Collect MAC Addresses for NCNs
    • Troubleshooting Installation Problems
    • Collecting the BMC MAC Addresses
    • PXE Boot Troubleshooting
    • Collecting NCN MAC Addresses
    • Troubleshooting Unused Drives on Storage Nodes
    • Configure Administrative Access
    • Utility Storage Installation Troubleshooting
    • Pre-Installation
    • Configure Management Network
    • Prepare Compute Nodes
    • Create Application Node Config YAML
    • Prepare site init
    • Create Cabinets YAML
    • Re-Installation
    • Create HMN Connections JSON File
    • Create NCN Metadata CSV
    • Create Switch Metadata CSV
    • Create System Configuration Using Cluster Discovery Service
    • Create System Configuration Using SHCD
    • CSM Services Install Fails Because of Missing Secret
    • Deploy Final NCN
    • Deploy Management Nodes
    • Install CSM Services
    • livecd
      • Accessing LiveCD USB Device After Reboot
      • Boot LiveCD RemoteISO
      • Boot LiveCD USB
      • Reinstall LiveCD
      • Reset root Password on a LiveCD USB
  • CSM Troubleshooting Information
    • Weave Container Network Interface Troubleshooting
    • Manual SSH Key Setting Process
    • Troubleshoot the CMS Barebones Image Boot Test
    • DHCP Troubleshooting
    • DNS Troubleshooting
    • Running HMS CT Tests Manually
    • Incrementally Configuring Images
    • PXE Booting Runbook
    • Interpreting HMS Health Check Results
    • known issues
      • BOS Operator Pods OOMKilled
      • BOS Sessions Stuck Pending
      • CFS Component With Zero-Length ID
      • CFS V2 Failures On Large Systems
      • Known Issue FAS Loader / HFP script post-deliver-product.sh
      • Gigabyte BMC Missing Redfish Data
      • HMS Resource Leaks
      • Hang Listing BOS V1 Sessions
      • Keycloak Error "Cannot read properties" in Web UI
      • Nexus Fails Authentication with Keycloak Users
      • PCS Power Capping Blanca Peak and Parry Peak
      • SLS Not Working During Node Rebuild
      • VCS Password With Illegal Characters
      • Known Issue admin-client-auth Not Found
      • Antero node NID allocation
      • Known Issue Ceph OSD latency
      • CFS Session for Image Customization on Remote Node Status Stuck at running
      • Known Issue check bios firmware versions.sh script does not report valid expected firmware versions
      • SAT/HSM/CAPMC/PCS Component Power State Mismatch
      • cray-console-node pods in CrashLoopBackOff
      • Known Issue cray-tftp-upload errors
      • Cray CLI 403 Forbidden Errors
      • HMS Discovery Job Not Creating RedfishEndpoints In Hardware State Manager
      • Flags Set For Nodes In HSM
      • Goss Test Fails with Connection Refused
      • Helm Chart Deploy Timeouts
      • hms-discovery Timeout Due to Missing Switches
      • HPE iLO dropping event subscriptions and not properly transitioning power state in CSM software
      • IMS Image Customization Job Status Stuck at waiting on user
      • Known Issue IMS Images Orphaned in S3
      • Soft Deleted IMS Image Always Has arch=x86 64
      • Soft Deleted IMS Recipe Always Has arch=x86 64
      • Soft Deleted IMS Recipe Always Has require dkms=true
      • Known issues with NCN health checks
      • IUF fails with Not a directory /etc/cray/upgrade/csm/media/...
      • Known issue kubectl logs -f returns no space left on device
      • Missing binaries in aarch64 Images
      • Known issues with NCN resource checks
      • HPE Cray EX255a Boot Issue with Console Parameter
      • Transaction Size Limitation for PCS and CAPMC
      • PostgreSQL Cluster Upgrades Failing
      • PostgreSQL Database is in Recovery
      • PostgreSQL Clusters in SyncFailed State Due to Kyverno Webhook
      • Product Catalog Upgrade Error
      • QLogic driver crash
      • Known Issue Boot Orchestration Service (BOS) / Rolling reboots
      • Known Issue RTS fails to restart after a worker node has been rebooted
      • sat bootprep image customization error
      • Software Management Services health checks
      • Spire database connection pool configuration in an air-gapped environment
      • Spire Database Cluster DNS Lookup Failure
      • SSL Certificate Validation Issues
      • Storage node cloud-init fails with 'Timed out waiting for device' error
      • Test Failures Due To No Discovered Compute Nodes In HSM
      • Known Issue Velero Version Mismatch
      • Wait for unbound or cray-dns-unbound-manager hangs
    • kubernetes
      • Kubernetes kube-apiserver Failing
      • Kubernetes Log File Locations
      • Kubernetes Troubleshooting Information
      • Troubleshoot Kubernetes Master or Worker node in NotReady state
      • Troubleshoot Kubernetes Pods Not Starting
      • Troubleshoot Liveliness or Readiness Probe Failures
      • Troubleshoot Unresponsive kubectl Commands
  • Glossary
  • Introduction to CSM Installation
    • CSM Overview
    • Deprecated Features
      • CAPMC Deprecation Notice
    • Documentation Conventions
    • templates
      • Templates
  • Non-Compute Nodes
    • Certificate Authority
    • NCN BIOS
    • NCN Boot Workflow
    • NCN Firmware
    • NCN Images
    • Kernel Dumps
    • NCN Kernel
    • NCN Mounts and Filesystems
    • NCN Networking
    • NCN Operating System Releases
    • NCN Plan of Record
  • REST API Documentation
    • Boot Orchestration Service v2
    • Boot Script Service v1
    • Cray Advanced Platform Monitoring and Control (CAPMC) v3
    • Configuration Framework Service v1
    • Firmware Action Service v1
    • Heartbeat Tracker Service v1
    • HMS Notification Fanout Daemon v1
    • Image Management Service v3
    • NCN Lifecycle Service v1
    • Power Control Service (PCS) v1
    • System Configuration Service v1
    • System Layout Service v2
    • Hardware State Manager API v2
    • Cray STS Token Generator v1
    • TAPMS Tenant Status API v1
    • User Access Service v1
  • Update CSM Product Stream
  • Upgrade CSM
    • Resource Materials
      • k8s
        • Worker-Specific Manual Steps
      • storage
        • CEPHADM
    • CSM 1.5.3 Patch Installation Instructions
    • CSM 1.5.4 Patch Installation Instructions
    • CSM 1.5.1 Patch Installation Instructions
    • Prepare for Upgrade to Next CSM Major Version
    • CSM 1.5.2 Patch Installation Instructions
      • CSM Only Upgrade
      • CSM Only Upgrade on a System with Other Products
      • Upgrade NCNs during CSM 1.5.2 Patch
    • Stage 0 - Prerequisites and Preflight Checks
    • Stage 1 - CSM Service Upgrades
    • Stage 2 - Ceph image upgrade
    • Stage 3 - Kubernetes Upgrade
    • CSM 1.4 to 1.5 Upgrade Process
    • Upgrade only CSM
    • Validate CSM Health During a CSM Upgrade
    • scripts
      • upgrade
        • Upgrade Automation
      • sls
        • SLS Updates Expert mode
        • Upgrade SLS Offline from CSM 1.0.x to CSM 1.2
        • sls updater.py Technical Details
        • sls utils Library
  • workflows
    • iuf
      • operations
        • Argo Templates
        • Argo Templates
Cray System Management Documentation > Cray System Management (CSM) Administration Guide > spire

spire

Topics:

  1. Restore missing Spire metadata
  2. Restore Spire Postgres without an Existing Backup
  3. Spire Service Recovery
  4. Troubleshoot Spire Failing to Start on NCNs
  5. Update Spire Intermediate CA Certificate
  6. Xname Validation