Cray System Management
  • v
  • Cray System Management (CSM) - Release Notes
  • Cray System Management (CSM) Administration Guide
    • Accessing LiveCD USB Device After Reboot
    • Component Names (xnames)
    • Validate CSM Health
    • Configure the Cray Command Line Interface (cray CLI)
    • system management health
      • Access System Management Health Services
      • Configure Prometheus Email Alert Notifications
      • Grafana Dashboards by Component
      • System Management Health
      • System Management Health Checks and Alerts
      • Troubleshoot Prometheus Alerts
    • resiliency
      • NTP Resiliency
      • Recreate StatefulSet Pods on Another Node
      • Resilience of System Management Services
      • Resiliency
      • Resiliency Testing Procedure
      • Restore System Functionality if a Kubernetes Worker Node is Down
    • node management
      • Access and Update Settings for Replacement NCNs
      • Replace a Compute Blade
      • Add TLS Certificates to BMCs
      • Reset Credentials on Redfish Devices
      • Add a Standard Rack Node
      • Swap a Compute Blade with a Different System
      • Add Additional Liquid-Cooled Cabinets to a System
      • TLS Certificates for Redfish BMCs
      • Adding a Liquid-cooled Blade to a System
      • Troubleshoot Interfaces with IP Address Issues
      • Build NCN Images Locally
      • Troubleshoot Issues with Redfish Endpoint Discovery
      • Change Java Security Settings
      • Troubleshoot Loss of Console Connections and Logs on Gigabyte Nodes
      • Change Settings for HMS Collector Polling of Air-Cooled Nodes
      • Update Compute Node Mellanox HSN NIC Firmware
      • Change Settings in the Bond
      • Update the Gigabyte Node BIOS Time
      • Check and Set the metal.no-wipe Setting on NCNs
      • Updating Cabinet Routes on Management NCNs
      • Check the BMC Failover Mode
      • Use the Physical KVM
      • Clear Space in Root File System on Worker Nodes
      • Verify Node Removal
      • Configuration of NCN Bonding
      • View BIOS Logs for Liquid-Cooled Nodes
      • Configure NTP on NCNs
      • Disable Nodes
      • Dump a Non-Compute Node
      • Enable Nodes
      • Enable Passwordless Connections to Liquid Cooled Node BMCs
      • Find Node Type and Manufacturer
      • Launch a Virtual KVM on Gigabyte Nodes
      • Launch a Virtual KVM on Intel Servers
      • Move a Standard Rack Node
      • Move a Standard Rack Node (Same Rack/Same HSN Ports)
      • NCN Drive Identification
      • Node Management
      • Node Management Workflows
      • Reboot NCNs
      • Rebuild NCNs
        • Final Validation Steps
        • Identify Nodes and Update Metadata
        • 6.2. Validate Master Node
        • 6.3. Validate Storage Node
        • 7.1. Validate Worker Node
        • Power Cycle and Rebuild Node
        • Prepare Master Node
        • Prepare Storage Node
        • Prepare Worker Node
        • Adding a Ceph Node to the Ceph Cluster
        • 6. Validate BOOTRAID artifacts
        • Wipe Disks
      • Add Remove Replace NCNs
        • Add NCN Data
        • Alpha Framework to Add, Remove, Replace, or Move NCNs
        • Add Switch Configuration for NCN
        • Allocate NCN IP Addresses
        • Boot NCN
        • Collect NCN MAC Addresses
        • Redeploy Services Impacted by Adding or Permanently Removing Storage Nodes
        • Remove NCN Data
        • Remove Roles
        • Remove Switch Configuration for NCN
        • Update Firmware
        • Validate Health
        • Validate Added NCN
    • conman
      • Access Compute Node Logs
      • Access Console Log Data Via the System Monitoring Framework (SMF)
      • ConMan
      • Disable ConMan After the System Software Installation
      • Establish a Serial Connection to NCNs
      • Log in to a Node Using ConMan
      • Manage Node Consoles
      • Troubleshoot ConMan Asking for Password on SSH Connection
      • Troubleshoot ConMan Blocking Access to a Node BMC
      • Troubleshoot ConMan Failing to Connect to a Console
    • image management
      • Build a New UAN Image Using the Default Recipe
      • Build an Image Using IMS REST Service
      • Convert TGZ Archives to SquashFS Images
      • Create UAN Boot Images
      • Customize an Image Root Using IMS
      • Delete or Recover Deleted IMS Content
      • Image Management
      • Image Management Workflows
      • Import an External Image to IMS
      • Upload and Register an Image Recipe
    • system layout service
      • Add Liquid-Cooled Cabinets to SLS
      • Add UAN CAN IP Addresses to SLS
      • Create a Backup of the SLS Postgres Database
      • Dump SLS Information
      • Load SLS Database with Dump File
      • Restore SLS Postgres Database from Backup
      • Restore SLS Postgres without an Existing Backup
      • System Layout Service (SLS)
      • Update SLS with UAN Aliases
    • hardware state manager
      • Add a Switch to the HSM Database
      • Add an NCN to the HSM Database
      • Component Group Members
      • Component Groups and Partitions
      • Component Memberships
      • Component Partition Members
      • Create a Backup of the HSM Postgres Database
      • HSM Roles and Subroles
      • Hardware Management Services (HMS) Locking API
      • Hardware State Manager (HSM)
      • Hardware State Manager (HSM) State and Flag Fields
      • Lock and Unlock Management Nodes
      • Manage Component Groups
      • Manage Component Partitions
      • Manage HMS Locks
      • Restore Hardware State Manager (HSM) Postgres Database from Backup
      • Restore Hardware State Manager (HSM) Postgres without an Existing Backup
      • Set BMC Management Roles
    • utility storage
      • Adding a Ceph Node to the Ceph Cluster
      • Add Ceph OSDs
      • Adjust Ceph Pool Quotas
      • Ceph Daemon Memory Profiling
      • Ceph Health States
      • Ceph Orchestrator General Usage and Tips
      • Ceph Service Check Script Usage
      • Ceph Storage Types
      • Cephadm Reference Material
      • Collect Information about the Ceph Cluster
      • Dump Ceph Crash Data
      • Identify Ceph Latency Issues
      • Manage Ceph Services
      • Shrink the Ceph Cluster
      • Restore Nexus Data After Data Corruption
      • Troubleshoot Ceph-Mon Processes Stopping and Exceeding Max Restarts
      • Troubleshooting Ceph MDS slow ops
      • Troubleshoot Ceph OSDs Reporting Full
      • Troubleshoot Ceph services not starting after a server crash
      • Troubleshoot Failure to Get Ceph Health
      • Troubleshoot Insufficient Standby MDS Daemons Available
      • Troubleshoot Large Object Map Objects in Ceph Health
      • Troubleshoot Pods Failing to Restart on Other Worker Nodes
      • Troubleshoot if RGW Health Check Fails
      • Troubleshoot System Clock Skew
      • Troubleshoot a Down OSD
      • Troubleshoot an Unresponsive Rados-Gateway (radosgw) S3 Endpoint
      • Utility Storage
    • security and authentication
      • API Authorization
      • Access the Keycloak User Management UI
      • Add LDAP User Federation
      • Audit Logs
      • Authenticate an Account with the Command Line
      • Backup and Restore Vault Clusters
      • Certificate Types
      • Change Air-Cooled Node BMC Credentials
      • Change Credentials on ServerTech PDUs
      • Change Cray EX Liquid-Cooled Cabinet Global Default Password
      • Change NCN Image Root Password and SSH Keys
      • Change NCN Image Root Password and SSH Keys on PIT Node
      • Change Root Passwords for Compute Nodes
      • Change SNMP Credentials on Leaf Switches
      • Change the Keycloak Admin Password
      • Change the LDAP Server IP Address for Existing LDAP Server Content
      • Change the LDAP Server IP Address for New LDAP Server Content
      • Configure Keycloak for LDAP/AD authentication
      • Configure the RSA Plugin in Keycloak
      • Create Internal Groups in the Keycloak Shasta Realm
      • Create Internal User Accounts in the Keycloak Shasta Realm
      • Create a Backup of the Keycloak Postgres Database
      • Create a Service Account in Keycloak
      • Default Keycloak Realms, Accounts, and Clients
      • Delete Internal User Accounts in the Keycloak Shasta Realm
      • Get a Long-Lived Token for a Service Account
      • HashiCorp Vault
      • Keycloak Operations
      • Make HTTPS Requests from Sources Outside the Management Kubernetes Cluster
      • Manage Sealed Secrets
      • Manage System Passwords
      • PKI Certificate Authority (CA)
      • PKI Services
      • Preserve Username Capitalization for Users Exported from Keycloak
      • Provisioning a Liquid-Cooled EX Cabinet CEC with Default Credentials
      • Public Key Infrastructure (PKI)
      • Recovering from Mismatched BMC Credentials
      • Remove Internal Groups from the Keycloak Shasta Realm
      • Remove the Email Mapper from the LDAP User Federation
      • Remove the LDAP User Federation from Keycloak
      • Restrict Network Access to the ncn-images S3 Bucket
      • Re-Sync Keycloak Users to Compute Nodes
      • Retrieve an Authentication Token
      • Retrieve the Client Secret for Service Accounts
      • SSH Keys
      • System Security and Authentication
      • Transport Layer Security (TLS) for Ingress Services
      • Troubleshoot Common Vault Cluster Issues
      • Update Default Air-Cooled BMC and Leaf Switch SNMP Credentials
      • Update Default ServerTech PDU Credentials used by the Redfish Translation Service (RTS)
      • Update NCN Passwords
      • Updating the Liquid-Cooled EX Cabinet CEC with Default Credentials after a CEC Password Change
    • spire
      • Create a Backup of the Spire Postgres Database
      • Restore missing Spire metadata
      • Restore Spire Postgres without an Existing Backup
      • Troubleshoot Spire Failing to Start on NCNs
      • Update Spire Intermediate CA Certificate
    • boot orchestration
      • BOS Workflows
      • Compute Node Boot Issue Symptom Node Console or Logs Indicate that the Server Response has Timed Out
      • Boot Issue Symptom Node HSN Interface Does Not Appear or Show Detected Links Detected
      • Boot Orchestration
      • Boot UANs
      • Check the Progress of BOS Session Operations
      • Clean Up After a BOS/BOA Job is Completed or Cancelled
      • Clean Up Logs After a BOA Kubernetes Job
      • Compute Node Boot Issue Symptom Duplicate Address Warnings and Declined DHCP Offers in Logs
      • Compute Node Boot Issue Symptom Message About Invalid EEPROM Checksum in Node Console or Log
      • Compute Node Boot Issue Symptom Node is Not Able to Download the Required Artifacts
      • Compute Node Boot Sequence
      • Configure the BOS Timeout When Booting Compute Nodes
      • Create a Session Template to Boot Compute Nodes with CPS
      • Edit the iPXE Embedded Boot Script
      • Healthy Compute Node Boot Process
      • Kernel Boot Parameters
      • Limit the Scope of a BOS Session
      • BOS Limitations for Gigabyte BMC Hardware
      • Log File Locations and Ports Used in Compute Node Boot Troubleshooting
      • Manage a BOS Session
      • Manage a Session Template
      • Node Boot Root Cause Analysis
      • Redeploy the iPXE and TFTP Services
      • BOS Session Templates
      • BOS Sessions
      • Stage Changes Without BOS
      • Tools for Resolving Compute Node Boot Issues
      • Troubleshoot Booting Nodes with Hardware Issues
      • Troubleshoot Compute Node Boot Issues Related to Dynamic Host Configuration Protocol (DHCP)
      • Troubleshoot Compute Node Boot Issues Related to Slow Boot Times
      • Troubleshoot Compute Node Boot Issues Related to Trivial File Transfer Protocol (TFTP)
      • Troubleshoot Compute Node Boot Issues Related to Unified Extensible Firmware Interface (UEFI)
      • Troubleshoot Compute Node Boot Issues Related to the Boot Script Service (BSS)
      • Troubleshoot Compute Node Boot Issues Using Kubernetes
      • Troubleshoot UAN Boot Issues
      • Upload Node Boot Information to Boot Script Service (BSS)
      • View the Status of a BOS Session
    • CSM product management
      • Security Hardening
      • Change Passwords and Credentials
      • Configure Keycloak Account
      • Configure Non-Compute Nodes with CFS
      • Perform NCN Personalization
      • Post-Install Customizations
      • Redeploying a Chart
      • Remove Artifacts from Product Installations
      • Validate Signed RPMs
    • UAS user and admin topics
      • Add a Volume to UAS
      • Broker Mode UAI Management
      • Configure End-User UAI Classes for Broker Mode
      • Configure UAIs in UAS
      • Configure a Broker UAI Class
      • Configure a Default UAI Class for Legacy Mode
      • Create UAIs From Specific UAI Images in Legacy Mode
      • Create a UAI
      • Create a UAI Class
      • Create a UAI Resource Specification
      • Create a UAI Using a Direct Administrative Command
      • Create a UAI with Additional Ports
      • Create and Register a Custom UAI Image
      • Create and Use Default UAIs in Legacy Mode
      • Customize End-User UAI Images
      • Customize the Broker UAI Image
      • Delete a UAI
      • Delete a UAI Class
      • Delete a UAI Image Registration
      • Delete a UAI Resource Specification
      • Delete a UAI Using an Administrative Command
      • Delete a Volume Configuration
      • Elements of a UAI
      • End-User UAIs
      • Examine a UAI Using a Direct Administrative Command
      • Legacy Mode User-Driven UAI Management
      • List Available UAI Classes
      • List Available UAI Images in Legacy Mode
      • List Registered UAI Images
      • List UAI Resource Specifications
      • List UAIs
      • List UAS Information
      • List Volumes Registered in UAS
      • List and Delete All UAIs
      • Log in to a Broker UAI
      • Log in to a User's UAI to Troubleshoot Issues
      • Modify a UAI Class
      • Obtain the Configuration of a UAS Volume
      • Register a UAI Image
      • Reset the UAS Configuration to Original Installed Settings
      • Resource Specifications
      • Retrieve Resource Specification Details
      • Retrieve UAI Image Registration Information
      • Select and Configure Host Nodes for UAIs
      • Special Purpose UAIs
      • Start a Broker UAI
      • Troubleshoot Common Mistakes when Creating a Custom End-User UAI Image
      • Troubleshoot Duplicate Mount Paths in a UAI
      • Troubleshoot Missing or Incorrect UAI Images
      • Troubleshoot Stale Brokered UAIs
      • Troubleshoot UAI Authentication Issues
      • Troubleshoot UAI Stuck in "ContainerCreating"
      • Troubleshoot UAIs by Viewing Log Output
      • Troubleshoot UAIs with Administrative Access
      • Troubleshoot UAS Issues
      • Troubleshoot UAS by Viewing Log Output
      • UAI Classes
      • UAI Host Node Selection
      • UAI Host Nodes
      • UAI Images
      • UAI Management
      • UAI Network Attachments
      • UAI macvlans Network Attachments
      • UAS Limitations
      • UAS and UAI Health Checks
      • Update a Resource Specification
      • Update a UAI Image Registration
      • Update a UAS Volume
      • User Access Service (UAS)
      • View a UAI Class
      • Volumes
    • artifact management
      • Artifact Management
      • Generate Temporary S3 Credentials
      • Manage Artifacts with the Cray CLI
      • Use S3 Libraries and Clients
    • firmware
      • FAS Admin Procedures
      • FAS CLI
      • FAS Filters
      • FAS Recipes
      • FAS Use Cases
      • Update Firmware with FAS
      • Updating BMC Firmware and BIOS for ncn-m001
      • Upload BMC Recovery Firmware into TFTP Server
    • power management
      • Cray Advanced Platform Monitoring and Control (CAPMC)
      • Ignore Nodes with CAPMC
      • Liquid Cooled Node Power Management
      • Power Off Compute and IO Cabinets
      • Power Off the External Lustre File System
      • Power On Compute and IO Cabinets
      • Power On and Boot Compute and User Access Nodes
      • Power On and Start the Management Kubernetes Cluster
      • Power On the External Lustre File System
      • Prepare the System for Power Off
      • Recover from a Liquid Cooled Cabinet EPO Event
      • Save Management Network Switch Configuration Settings
      • Set the Turbo Boost Limit
      • Shut Down and Power Off Compute and User Access Nodes
      • Shut Down and Power Off the Management Kubernetes Cluster
      • Standard Rack Node Power Management
      • System Power Off Procedures
      • System Power On Procedures
      • User Access to Compute Node Power Data
      • Power Management
    • kubernetes
      • About Kubernetes Taints and Labels
      • About Postgres
      • About etcd
      • About kubectl
      • Backups for etcd-operator Clusters
      • Kubernetes and Bare Metal EtcD Certificate Renewal
      • Check for and Clear etcd Cluster Alarms
      • Check the Health and Balance of etcd Clusters
      • Clear Space in an etcd Cluster Database
      • Configure kubectl Credentials to Access the Kubernetes APIs
      • containerd
      • Create a Manual Backup of a Healthy etcd Cluster
      • Kubernetes CronJobs
      • Determine if Pods are Hitting Resource Limits
      • Disaster Recovery for Postgres
      • Increase Kafka Pod Resource Limits
      • Increase Pod Resource Limits
      • Kubernetes
      • Kubernetes Networking
      • Kubernetes Storage
      • Pod Resource Limits
      • Rebalance Healthy etcd Clusters
      • Rebuild Unhealthy etcd Clusters
      • Recover from Postgres WAL Event
      • Repopulate Data in etcd Clusters When Rebuilding Them
      • Report the Endpoint Status for etcd Clusters
      • Restore Bare-Metal etcd Clusters from an S3 Snapshot
      • Restore Postgres
      • Restore an etcd Cluster from a Backup
      • Retrieve Cluster Health Information Using Kubernetes
      • TDS Lower CPU Requests
      • Troubleshoot Intermittent HTTP 503 Code Failures
      • Troubleshoot Postgres Database
      • View Postgres Information for System Databases
    • package repository management
      • Manage Repositories with Nexus
      • Nexus Configuration
      • Nexus Deployment
      • Nexus Export and Restore
      • Package Repository Management
      • Package Repository Management with Nexus
      • Repair Yum Repository Metadata
      • Restrict Admin Privileges in Nexus
      • Troubleshoot Nexus
    • system configuration service
      • Configure BMC and Controller Parameters with SCSD
      • Manage Parameters with the scsd Service
      • Set BMC Credentials
      • System Configuration Service
    • network
      • Access to System Management Services
      • Connect to the HPE Cray EX Environment
      • Default IP Address Ranges
      • Network
      • dhcp
        • DHCP
        • Troubleshoot DHCP Issues
      • customer access network
        • CAN with Dual-Spine Configuration
        • Connect to the CAN
        • Customer Access Network
        • Externally Exposed Services
        • Required Labels if CAN is Not Configured
        • Troubleshoot CAN Issues
      • external dns
        • Add NCNs and UANs to External DNS
        • External DNS
        • External DNS Failing to Discover Services Workaround
        • External DNS csi config init Input Values
        • Ingress Routing
        • Troubleshoot DNS Configuration Issues
        • Troubleshoot Connectivity to Services with External IP addresses
        • Update the can-external-dns Value Post-Installation
      • management network
        • Management Network ACL Configuration
        • Management Network Access Port Configurations
        • Management Network CAN Setup
        • Management Network Flow Control Settings
        • Management Network Switch Rename
        • Update Management Network Firmware
      • dns
        • Domain Name Service (DNS)
        • Enable ncsd on UANs
        • Manage the DNS Unbound Resolver
        • Troubleshoot Common DNS Issues
      • metallb bgp
        • Check BGP Status and Reset Sessions
        • MetalLB in BGP-Mode
        • MetalLB in BGP-Mode Configuration
        • Troubleshoot BGP not Accepting Routes from MetalLB
        • Troubleshoot Services without an Allocated IP Address
        • Update BGP Neighbors
    • compute rolling upgrades
      • CRUS Workflow
      • Compute Rolling Upgrades
      • Troubleshoot Nodes Failing to Upgrade in a CRUS Session
      • Troubleshoot a Failed CRUS Session Because of Bad Parameters
      • Troubleshoot a Failed CRUS Session Because of Unmet Conditions
      • Upgrade Compute Nodes with CRUS
    • configuration management
      • Ansible Execution Environments
      • Ansible Inventory
      • Automatic Session Deletion with sessionTTL
      • Backup and Restore VCS Data
      • CFS Flow
      • CFS Global Options
      • CFS Key Management and Permission Denied Errors
      • Change the Ansible Verbosity Logs
      • Configuration Layers
      • Configuration Management
      • Configuration Management of System Components
      • Configuration Management with the CFS Batcher
      • Configuration Sessions
      • Create a CFS Configuration
      • Create a CFS Session with Dynamic Inventory
      • Create an Image Customization CFS Session
      • Create and Populate a VCS Configuration Repository
      • Customize Configuration Values
      • Delete CFS Sessions
      • Enable Ansible Profiling
      • Git Operations
      • Manage Multiple Inventories in a Single Location
      • Set Limits for a Configuration Session
      • Set the ansible.cfg for a Session
      • Specifying Hosts and Groups
      • Target Ansible Tasks for Image Customization
      • Track the Status of a Session
      • Troubleshoot Ansible Play Failures in CFS Sessions
      • Troubleshoot CFS Session Failing to Complete
      • Update a CFS Configuration
      • Update the Privacy Settings for Gitea Configuration Content Repositories
      • Use a Custom ansible.cfg File
      • Use a Specific Inventory in a Configuration Session
      • VCS Branching Strategy
      • Version Control Service (VCS)
      • View Configuration Session Logs
      • Write Ansible Code for CFS
    • hmcollector
      • Adjust HM Collector resource limits and requests
  • CSM Background Information
    • Certificate Authority
    • cloud-init Basecamp Configuration
    • Cray Site Init Files
    • NCN BIOS
    • NCN Boot Workflow
    • NCN Images
    • NCN Mounts and File Systems
    • NCN Networking
    • NCN Operating System Releases
    • NCN Packages
  • CSM Troubleshooting Information
    • Interpreting HMS Health Check Results
    • PXE Booting Runbook
    • capmc
      • Cray Advanced Platform Monitoring and Control (CAPMC) Reinit and Configuration Notice
    • known issues
      • CFS Component With Zero-Length ID
      • Gigabyte BMC Missing Redfish Data
      • Hang Listing BOS Sessions
      • Multiple Console Node Pods on the Same Worker
      • SLS Not Working During Node Rebuild
      • CFS Sessions are Stuck in Pending State
      • SAT/HSM/CAPMC Component Power State Mismatch
      • Console Logs Fill All Available Storage Space
      • Cray CLI 403 Forbidden Errors
      • Air-cooled hardware is not getting properly discovered with Aruba leaf switches.
      • HMS Discovery Job Not Creating RedfishEndpoints In Hardware State Manager
      • Gitea/VCS 401 Errors
      • BOS/BOA Incorrect command is output to rerun a failed operation.
      • Incorrectly Tagged zeromq Image
      • Known Issue initrd.img.xz Not Found
      • kube-multus pod is in ImagePullBackOff
      • Kubernetes Master or Worker node's root filesystem is out of space
      • Orphaned CFS Pods After Booting or Rebooting
      • Common Platform CA Issues
      • Unbound in CrashLoopBackOff After Deployment Restart
      • wait for unbound or cray-dns-unbound-manager hangs
    • kubernetes
      • Kubernetes Log File Locations
      • Kubernetes Troubleshooting Information
      • Troubleshoot Kubernetes Master or Worker node in NotReady state
      • Troubleshoot Liveliness or Readiness Probe Failures
      • Troubleshoot Unresponsive kubectl Commands
  • Glossary
  • Install CSM
    • Set Gigabyte Node BMC to Factory Defaults
    • Hotfix to workaround known mac-learning issue with 8325
    • SHCD HMN Tab/HMN Connections Rules
    • Aruba SNMP Known Issue
    • Switch PXE Boot from Onboard NIC to PCIe
    • Boot LiveCD Virtual ISO
    • Troubleshooting Installation Problems
    • Bootstrap PIT Node from LiveCD Remote ISO
    • Utility Storage Installation Troubleshooting
    • Bootstrap PIT Node from LiveCD USB
    • Validate Management Network Cabling
    • Cable Management Network Servers
    • Wipe NCN Disks for Reinstallation
    • Ceph CSI Troubleshooting
    • Clear Gigabyte CMOS
    • Collect MAC Addresses for NCNs
    • Collecting the BMC MAC Addresses
    • Collecting NCN MAC Addresses
    • Configure Administrative Access
    • Configure Aruba Aggregation Switch
    • Configure Aruba CDU Switch
    • Configure Aruba Leaf Switch
    • Configure Aruba Management Network Base
    • Configure Aruba Spine Switch
    • Configure Dell Aggregation Switch
    • Configure Dell CDU switch
    • Configure Dell Leaf Switch
    • Configure Management Network Switches
    • Configure Mellanox Spine Switch
    • Connect to Switch over USB-Serial Cable
    • Create Application Node Config YAML
    • Create Cabinets YAML
    • Create HMN Connections JSON File
    • Create NCN Metadata CSV
    • Create Switch Metadata CSV
    • Deploy Management Nodes
    • Install CSM Services
    • Prepare Compute Nodes
    • Prepare Configuration Payload
    • Prepare Management Nodes
    • Prepare Site Init
    • PXE Boot Troubleshooting
    • Redeploy PIT Node
    • Reinstall LiveCD
    • Reset root Password on LiveCD
    • Restart Network Services and Interfaces on NCNs
    • Safeguards for CSM
  • Introduction to CSM Installation
    • CAPMC Deprecation Notice many CAPMC v1 features are being partially deprecated
    • CSM Overview
    • Differences from Previous Release
    • Documentation Conventions
    • Scenarios for Shasta v1.5
    • Site Survey Worksheet
  • scripts
    • operations
      • node management
        • Add Remove Replace NCNs
  • Update CSM Product Stream
  • Upgrade CSM
    • Update Management Network From 1.4 To 1.5
    • CSM 1.0.10 Patch Installation Instructions
    • CSM 1.0.11 CVE Patch/Upgrade Procedure
      • Relevant Troubleshooting Links for Upgrade-Related Issues
      • Stage 0 - Prerequisites and Preflight Checks
      • Stage 1 - Ceph Image Upgrade
      • Stage 2 - Kubernetes Node Image Upgrade
      • Stage 3 - CSM Service Upgrades
      • Stage 4 - Rollout DNS Unbound Deployment Restart
      • Stage 5 - Verification
    • CSM 0.9.4 or later to 1.0.1 Upgrade Process
      • Usage
        • k8s
          • Worker-Specific Manual Steps
        • storage
          • CEPHADM
      • Stage 0 - Prerequisites and Preflight Checks
      • Stage 1 - Ceph upgrade from Nautilus (14.2.x) to Octopus (15.2.x)
      • Stage 2 - Ceph image upgrade
      • Stage 3 - Kubernetes Upgrade from 1.18.6 to 1.19.9
      • Stage 4 - CSM Service Upgrades
      • Stage 5 - Workaround for MAC-learning issue with Aruba 8325 switches
    • Prepare For Upgrade
    • lib
      • Pre-Upgrade Scripts
        • NCN Boot Order Hot-fix/Backport
Cray System Management Documentation > Cray System Management (CSM) Administration Guide > resiliency

resiliency

Topics:

  1. NTP Resiliency
  2. Recreate StatefulSet Pods on Another Node
  3. Resilience of System Management Services
  4. Resiliency
  5. Resiliency Testing Procedure
  6. Restore System Functionality if a Kubernetes Worker Node is Down