Cray System Management
v
1.7
1.6
1.5
1.4
1.3
1.2
1.0
Cray System Management (CSM) - Release Notes
Cray System Management (CSM) Administration Guide
Create a Backup of HMS Items for reinstall
Component Names (xnames)
Restore HSM
Validate CSM Health
Configure the Cray Command Line Interface (cray CLI)
User Access Service (UAS)
Add a Volume to UAS
Broker Mode UAI Management
Choosing UAI Resource Settings
Common UAI Configuration
Configure End-User UAI Classes for Broker Mode
Configure UAIs in UAS
Configure a Broker UAI Class
Configure a Default UAI Class for Legacy Mode
Create UAIs From Specific UAI Images in Legacy Mode
Create a UAI
Create a UAI Class
Create a UAI Resource Specification
Create a UAI with Additional Ports
Create and Use Default UAIs in Legacy Mode
Customize End-User UAI Images
Customize the Broker UAI Image
Delete a UAI
Delete a UAI Class
Delete a UAI Image Registration
Delete a UAI Resource Specification
Delete a Volume Configuration
Elements of a UAI
End-User UAIs
Examine a UAI Using a Direct Administrative Command
Legacy Mode User-Driven UAI Management
List Available UAI Classes
List Available UAI Images in Legacy Mode
List Registered UAI Images
List UAI Resource Specifications
List UAIs
List UAS Version Information
List Volumes Registered in UAS
Log in to a Broker UAI
This Page Has Moved
Modify a UAI Class
Obtain the Configuration of a UAS Volume
Register a UAI Image
Clear UAS Configuration
Resource Specifications
Retrieve Resource Specification Details
Retrieve UAI Image Registration Information
Setting UAI Timeouts
Broker UAI Resiliency and Load Balancing
Special Purpose UAIs
Start a Broker UAI
Troubleshoot Broker UAI SSSD Cannot Use /etc/sssd/sssd.conf
Troubleshoot Common Mistakes when Creating a Custom End-User UAI Image
Troubleshoot Duplicate Mount Paths in a UAI
Troubleshoot Missing or Incorrect UAI Images
Troubleshoot Stale Brokered UAIs
Troubleshoot UAS / CLI Authentication Issues
Troubleshoot UAI Stuck in ContainerCreating
Troubleshoot UAIs by Viewing Log Output
Troubleshoot UAIs with Administrative Access
Troubleshoot UAS Issues
Troubleshoot UAS by Viewing Log Output
UAI Classes
UAI Host Node Selection
UAI Host Nodes
UAI Image Customization
UAI Images
UAI Management
UAI Network Attachment Customization
UAI macvlans Network Attachments
UAS Limitations
UAS and UAI Legacy Mode Health Checks
Update a Resource Specification
Update a UAI Image Registration
Update a UAS Volume
View a UAI Class
Volumes
argo
Using Argo Workflows
Using the Argo UI
bare metal
Bare-Metal Steps
Fresh Install Setting NodeBMC and RouterBMC Redfish Credentials
firmware
FASUpdate Script
FAS Admin Procedures
FAS CLI
Cleaning up FAS Database
FAS Filters
Backup and Restoring FAS Images
Updating Foxconn Paradise Nodes with FAS
FAS Recipes
Update iLO 5 firmware above v2.78
FAS Recipes and Procedures
Firmware Upgrade using SPP on HPE ProLiant Servers
Update Firmware with FAS
Updating BMC Firmware and BIOS for ncn-m001
Updating BMC Firmware and BIOS for NCNs without FAS
Upload BMC Recovery Firmware into TFTP Server
hmcollector
Adjust HM Collector Ingress Replicas and Resource Limits
observability
Install and Upgrade Observability Framework
power management
Cray Advanced Platform Monitoring and Control (CAPMC)
Ignore Nodes with CAPMC
Liquid-cooled Node Power Management
Power Off Compute Cabinets
Power Off Management Cabinets
Power Off Storage Cabinets
Power Off the External Lustre File System
Power On Compute Cabinets
Power On and Boot Compute and User Access Nodes
Power On and Start the Management Kubernetes Cluster
Power On the External Lustre File System
Prepare the System for Power Off
Recover from a Liquid Cooled Cabinet EPO Event
Save Management Network Switch Configuration Settings
Set the Turbo Boost Limit
Shut Down and Power Off Compute and User Access Nodes
Shut Down and Power Off the Management Kubernetes Cluster
Standard Rack Node Power Management
System Power Off Procedures
System Power On Procedures
User Access to Compute Node Power Data
Power Management
Power Control Service
Node Card Power Management
Power Control Service (PCS)
Power Off Compute Cabinets
Power On Compute Cabinets
Recover from a Liquid Cooled Cabinet EPO Event
sat
System Admin Toolkit (SAT) in CSM
system configuration service
Configure BMC and Controller Parameters with SCSD
Manage Parameters with the scsd Service
Set BMC Credentials Using SAT
System Configuration Service
system layout service
Add Liquid-Cooled Cabinets to SLS
Add UAN CAN IP Addresses to SLS
Add an alias to a service
Create a Backup of the SLS Postgres Database
Dump SLS Information
Load SLS Database with Dump File
Restore SLS Postgres Database from Backup
Restore SLS Postgres without an Existing Backup
System Layout Service (SLS)
Update SLS with UAN Aliases
System Recovery
PBS Service Recovery
Slurm Service Recovery
Beta Procedures for System Recovery
CSM product management
Change Passwords and Credentials
Configure CSM packages with CFS
Configure Keycloak Account
Configure the root password and SSH keys in Vault
Post-Install Customizations
Redeploying a Chart
Remove Artifacts from Product Installations
Set up passwordless SSH
Validate Signed RPMs
security and authentication
API Authorization
Access the Keycloak User Management UI
Add LDAP User Federation
Add Root Service Account for Gigabyte Controllers
Audit Logs
Authenticate an Account with the Command Line
Backup and Restore Vault Clusters
Certificate Types
Change Air-Cooled Node BMC Credentials Using SAT
Change Credentials on ServerTech PDUs
Change Cray EX Liquid-Cooled Cabinet Global Default Password
Change the Keycloak Token Lifetime
Set NCN Image Root Password, SSH Keys, and Timezone
Set NCN Image Root Password, SSH Keys, and Timezone on PIT Node
Change Root Passwords for Compute Nodes
Change the Keycloak Admin Password
Change the LDAP Server IP Address for Existing LDAP Server Content
Change the LDAP Server IP Address for New LDAP Server Content
Configure Keycloak for LDAP/AD authentication
Configure root user on HPE iLO BMCs
Configure the RSA Plugin in Keycloak
Create Internal Groups in the Keycloak Shasta Realm
Create Internal User Accounts in the Keycloak Shasta Realm
Create a Backup of the Keycloak Postgres Database
Create a Service Account in Keycloak
Default Keycloak Realms, Accounts, and Clients
Delete Internal User Accounts in the Keycloak Shasta Realm
Get a Long-Lived Token for a Service Account
HashiCorp Vault
Keycloak Operations
Keycloak Service Recovery
Keycloak User Localization
Keycloak User Management with kcadm.sh
Make HTTPS Requests from Sources Outside the Management Kubernetes Cluster
Manage Sealed Secrets
Manage System Passwords
PKI Certificate Authority (CA)
PKI Services
Preserve Username Capitalization for Users Exported from Keycloak
Provisioning a Liquid-Cooled EX Cabinet CEC with Default Credentials
Public Key Infrastructure (PKI)
Recovering from Mismatched BMC Credentials
Remove Internal Groups from the Keycloak Shasta Realm
Remove the Email Mapper from the LDAP User Federation
Remove the LDAP User Federation from Keycloak
Re-Sync Keycloak Users to Compute Nodes
Retrieve an Authentication Token
Retrieve the Client Secret for Service Accounts
Update NCN User SSH Keys
System Security and Authentication
Transport Layer Security (TLS) for Ingress Services
Troubleshoot Common Vault Cluster Issues
Troubleshoot Kyverno configuration manually
Update Default Air-Cooled BMC and Leaf-BMC Switch SNMP Credentials
Update Default ServerTech PDU Credentials used by the Redfish Translation Service (RTS)
Set NCN User Passwords
Update Root Secrets In Vault
Updating the Liquid-Cooled EX Cabinet CEC with Default Credentials after a CEC Password Change
Vault Service Recovery
multi-tenancy
Cray HNC Manager
Creating a Tenant
Modifying a Tenant
Multi-Tenancy Support
Removing a Tenant
Slurm Operator
TAPMS (Tenant and Partition Management System) Overview
Tenant Administrator Configuration
Multi-Tenancy Vault Overview
utility storage
Adding a Ceph Node to the Ceph Cluster
Add Ceph OSDs
Adjust Ceph Pool Quotas
Alternate Storage Pools
Ceph Daemon Memory Profiling
Ceph Deep Scrubs
Ceph Health States
Ceph Orchestrator Usage
Ceph Service Check Script Usage
Ceph Storage Types
ceph-upgrade-tool.py Usage
Cephadm Reference Material
Collect Information about the Ceph Cluster
Dump Ceph Crash Data
Identify Ceph Latency Issues
Manage Ceph Services
Move Unmanaged Ceph OSDs
Shrink the Ceph Cluster
Shrink Ceph OSDs
Troubleshoot Ceph-Mon Processes Stopping and Exceeding Max Restarts
Troubleshoot Ceph MDS Client Connectivity Issues
Troubleshooting Ceph MDS Reporting Slow Requests and Failure on Client
Troubleshoot Ceph New RGW Deployment Failing
Troubleshoot Ceph OSDs Reporting Full
Troubleshoot Ceph Services Not Starting After a Server Crash
Troubleshoot Failure to Get Ceph Health
Troubleshoot HEALTH ERR Module devicehealth has failed table Device already exists
Troubleshoot Insufficient Standby MDS Daemons Available
Troubleshoot Large Object Map Objects in Ceph Health
Troubleshoot Pods Failing to Restart on Other Worker Nodes
Fixing incorrect number of PG Issues
Troubleshoot if RGW Health Check Fails
Troubleshoot S3FS Cache Cleanup
Troubleshoot S3FS Mount Issues
Troubleshoot System Clock Skew
Troubleshoot a Down OSD
Troubleshoot an Unresponsive Rados-Gateway (radosgw) S3 Endpoint
Troubleshoot Ceph image with tag'<none>'
Utility Storage
Update ceph node-exporter config to monitor SNMP counters
boot orchestration
Boot Orchestration
BOS Services
BOS Workflows
Compute Node Boot Issue Symptom Node Console or Logs Indicate that the Server Response has Timed Out
Boot Issue Symptom Node HSN Interface Does Not Appear or Show Detected Links Detected
Boot Orchestration
Boot UANs
BOS Commands Cheat Sheet
Check the Progress of BOS Session Operations
Clean Up After a BOS/BOA Job is Completed or Cancelled
Clean Up Logs After a BOA Kubernetes Job
Component Status
BOS Components
Compute Node Boot Issue Symptom Duplicate Address Warnings and Declined DHCP Offers in Logs
Compute Node Boot Issue Symptom Message About Invalid EEPROM Checksum in Node Console or Log
Compute Node Boot Issue Symptom Node is Not Able to Download the Required Artifacts
Compute Node Boot Sequence
Configure the BOS Timeout When Booting Compute Nodes
Create a Session Template to Boot Compute Nodes with CPS
Customize iPXE Binary Names
Determine Which BOS Session Booted a Node
Edit the iPXE Embedded Boot Script
Exporting and Importing BOS Data
Exporting and Importing BSS Date
Healthy Compute Node Boot Process
Kernel Boot Parameters
Limit the Scope of a BOS Session
BOS Limitations for Gigabyte BMC Hardware
Log File Locations and Ports Used in Compute Node Boot Troubleshooting
Manage a BOS Session
Manage a Session Template
Multi-tenancy with BOS
Node Boot Root Cause Analysis
BOS Options
Redeploy the iPXE and TFTP Services
Rolling Upgrades using BOS
BOS Session Templates
BOS Sessions
Staging Changes with BOS
Tools for Resolving Compute Node Boot Issues
Troubleshoot Booting Nodes with Hardware Issues
Troubleshoot Compute Node Boot Issues Related to Dynamic Host Configuration Protocol (DHCP)
Troubleshoot Compute Node Boot Issues Related to Slow Boot Times
Troubleshoot Compute Node Boot Issues Related to Trivial File Transfer Protocol (TFTP)
Troubleshoot Compute Node Boot Issues Related to Unified Extensible Firmware Interface (UEFI)
Troubleshoot Compute Node Boot Issues Related to the Boot Script Service (BSS)
Troubleshoot Compute Node Boot Issues Using Kubernetes
Troubleshoot UAN Boot Issues
Upload Node Boot Information to Boot Script Service (BSS)
View the Status of a BOS Session
kubernetes
About Kubernetes Taints and Labels
Kubernetes Encryption
About Postgres
About etcd
About kubectl
Backups for Etcd Clusters Running in Kubernetes
Kubernetes and Bare Metal EtcD Certificate Renewal
Check for and Clear etcd Cluster Alarms
Check the Health of etcd Clusters
Clear Space in an etcd Cluster Database
Configure kubectl Credentials to Access the Kubernetes APIs
containerd
Create a Manual Backup of Bare-Metal etcd Cluster
Create a Manual Backup of a Healthy etcd Cluster
Determine if Pods are Hitting Resource Limits
Disaster Recovery for Postgres
Fix Failed to start etcd on Master NCN
Increase Kafka Pod Resource Limits
Increase the PVC size in an etcd Cluster Database
Increase Pod Resource Limits
Kubernetes
Kubernetes Networking
Kubernetes Storage
Kyverno policy management
Pod Resource Limits
Rebuild Unhealthy etcd Clusters
Recover from Postgres WAL Event
Repopulate Data in etcd Clusters When Rebuilding Them
Report the Endpoint Status for etcd Clusters
Restore Bare-Metal etcd Clusters from an S3 Snapshot
Restore Postgres
Restore an etcd Cluster from a Backup
Retrieve Cluster Health Information Using Kubernetes
TDS Lower CPU Requests
Troubleshoot Intermittent HTTP 503 Code Failures
Troubleshoot Postgres Database
View Postgres Information for System Databases
network
Management Network User Guide
Manual Switch Configuration
Fresh Install
Added Hardware
Generate Switch Configurations
Apply Custom Switch Configuration CSM 1.2
Apply Switch Configurations
CSM Automatic Network Utility
CANU Installation
Troubleshoot CANU Validation Errors
Use CANU to Verify, Generate, or Compare Switch Configurations
Generate Switch Configs Including Custom Configurations
Initializing CANU
Introduction to CANU
Quick start guide to CANU
Uninstall CANU
Update CANU From CSM Release Tarball
Use CANU to Generate Full Network Configuration
Dell Installation and Configuration Guide
Configure Access Control Links (ACLs)
Configure Address Resolution Protocol (ARP)
Back Up a Switch Configuration
Configure Domain Name System (DNS) Client
Configure Domain Name
Configure Hostnames
Configure Internet Group Multicast Protocol (IGMP)
Configure Link Aggregation Group (LAG)
Link layer discovery protocol (LLDP)
Configure Locator LED
Configure Loopback Interface
Configure Management Interface
Configure Multiple Spanning Tree Protocol (MSTP)
Network Time Protocol (NTP) Client
Configure Physical Interfaces
Configure QoS
Configure Remote Logging
Reset Dell Switch Configuration
Configure SNMPv2c community
Dell SNMPv3 Users
Configure Secure Shell (SSH)
Configure System Images
Perform an Upgrade on Dell Switches
Configure Virtual Local Access Networks (VLANs)
Configure VLAN Interface
VLAN Trunking 802.1Q
Using canu-inventory with Ansible
Upgrade CANU
Collect Data
Configuration Management
Configuring SNMP in CSM
Mellanox Installation and Configuration Guide
Access control lists (ACLs)
Address resolution protocol (ARP)
Backing up switch configuration
BGP basics
Cable diagnostics
Check BGP and MetalLB
Check current DHCP leases
Check DHCP lease is getting allocated
Check HSM
Check KEA DHCP logs
Computes/UANs/Application Nodes
Large Number of DHCP Declines During a Node Boot
Domain name system (DNS) client
Domain name
You are getting an IP address, but not the correct one. Duplicate IP address check
Exec banners
Hostname
IGMP
Ip filter
Key features used in the management network configuration
Link aggregation group (LAG)
Large
Link layer discovery protocol (LLDP)
Loopback interface
Management interface
Example of how to configure Scenario A or B
Management network functions in detail
Medium
Multi-chassis interface
MLAG (Multi-Chassis LAG)
MLAG
Multiple spanning tree protocol (MSTP)
Native VLAN
TCPDUMP
NCNs on Install
Network types – Naming and segment Function
Network traffic pattern inside of the system
Network Time Protocol (NTP) Client
Open shortest path first (OSPF) v2
Physical interfaces
PIM-SM bootstrap router (BSR) and rendezvous-point (RP)
Rebooting NCN and PXE fails
Remote logging
How to connect management network to your campus network
Routed interfaces
Scenario A network connection via management network
Scenario B network connection via high speed network
Small
SNMPv2c community
Mellanox SNMPv3 users
Spine-leaf Architecture
Spine-leaf architecture
Why are spine-leaf architectures becoming more popular?
Secure shell (SSH)
Mac address Table
Static routing
Confirm the status of the cray-dhcp-kea pods/services
System images
Test TFTP traffic (Aruba Only)
Typical configuration of MLAG link connecting to NCN
Typical configuration of MLAG between switches
Performing Upgrade On Mellanox Switches
Verify the switches are forwarding DHCP traffic
Verify BGP
Verify the DHCP traffic on the workers
Verify route to TFTP
Very Large (Exascale)
Virtual local access networks (VLANs)
VLAN interface
VLAN trunking 802.1Q
Web user interface (WebUI)
Aruba Installation and Configuration Guide
802.1X
Access Control Lists (ACLs)
Address Resolution Protocol (ARP)
Backup a Switch Configuration
Border Gateway Protocol (BGP) Basics
Bluetooth Capabilities
Cable Diagnostics
Check BGP and MetalLB
Check Current DHCP Leases
Check DHCP Lease is Getting Allocated
Check HSM
Check KEA DHCP Logs
Classifier Policies
Verify Computes/UANs/Application Nodes
Large Number of DHCP Declines During a Node Boot
Configure Domain Name Service (DNS) Clients
Configure Domain Names
Check for Duplicate IP Addresses
Configure Exec Banners
Configure Hostnames
Configure Internet Group Multicast Protocol (IGMP)
Initial Prioritization
Introduction
Key Features Used in the Management Network Configuration
Link Aggregation Group (LAG)
Link Layer Discovery Protocol (LLDP)
Locator LED
Loopback Interface
MAC Authentication
Management Interface
Example of How to Configure Scenario A or B
System Management Network Functions
VSX ISL HA
VSX MCLAG Link HA
VSX Member Power Failure
VSX Split
Multi-Chassis Link Aggregation Group (MCLAG)
Message-Of-The-Day (MOTD)
Multicast Source Discovery Protocol (MSDP)
Multiple Spanning Tree Protocol (MSTP)
Native VLAN
NCN tcpdump
NCNs on Install
Network Types – Naming and Segment Function
Network Topologies
Network Traffic Pattern
Notices
Network Time Protocol (NTP) Client
Open Shortest Path First (OSPF) v2
Physical Interfaces
PIM-SM Bootstrap Router (BSR) and Rendezvous Point (RP)
Port Mirroring
Port Security
Queuing and Scheduling
RADIUS
Rebooting NCNs and PXE Fails
Redundant Power Supplies
Remote Logging
Connect the Management Network to a Campus Network
Routed interfaces
Scenario A Network Connection via Management Network
Scenario B Network Connection via High-Speed Network
Simple Network Management Protocol (SNMP) Agent
SNMPv2c Community
SNMP traps
Aruba SNMPv3 Users
Spine-Leaf Architecture
Spine-leaf Architecture
Secure Shell (SSH)
Static Routing
Confirm the Status of the cray-dhcp-kea Pods
TACACS
Test TFTP Traffic (Aruba Only)
Typical Configuration of VSX
Typical Edge Port Configuration
Typical Configuration of MCLAG Link
Unidirectional Link Detection (UDLD)
Perform a VSX Upgrade on Aruba Switches
Verify the Switches are Forwarding DHCP Traffic
Verify BGP
Verify the DHCP Traffic on the Worker Nodes
Verify Route to TFTP
Virtual Local Access Networks (VLANs)
VLAN Interface
VLAN Trunking 802.1Q
Virtual Switching Framework (VSF) - 6300 Only
Virtual Switching Extension (VSX)
VSX Architecture
Switch Replacement in the VSX Cluster
VSX Sync
Web User Interface (WebUI)
Erase All zeroize
Edge switch cabling guide
Network Tests
Reinstall
Replace Switch
Save a Configuration
Prometheus SNMP Exporter
Transceivers and Cables
Example of the Connections Used in Shasta Management Network
Validate Cabling
Validate the SHCD
Validate Switch Configurations
Wipe Management Switch Configuration
Aruba splitting of QSFP+ and QSFP28 ports
Backup a Custom Configuration
BICAN Support Matrix - Shasta Customer Access Networks
BICAN switch configuration
Bifurcating the CAN - Feature Details
BICAN Summary
Bonded UAN Configuration
Cable Management Network Servers
firmware
Update Management Network Firmware
hardware
EX2500 Installation and Cabling
Access to System Management Services
Connect to Switch over USB-Serial Cable
Connect to the HPE Cray EX Environment
Create a CSM Configuration Upgrade Plan
Default IP Address Ranges
Gateway Testing
Network
dhcp
DHCP boot file customization
DHCP
Troubleshoot DHCP Issues
customer accessible networks
Connect to the CMN and CAN
Customer Access Networks
scripts
sls
sls utils Library
network
Enabling Customer High Speed Network Routing
Management Network Upgrade CSM 1.2 to 1.3
Customer Accessible Networks
CAN/CMN with Dual-Spine Configuration
Externally Exposed Services
Troubleshoot CMN issues
BI-CAN Aruba/Arista Configuration
MetalLB Peering with Arista Edge Router
external dns
External DNS
External DNS Failing to Discover Services Workaround
External DNS CSI Input Values
Ingress Routing
Troubleshoot DNS Configuration Issues
Troubleshoot Connectivity to Services with External IP addresses
Update the cmn-external-dns value post-installation
metallb bgp
Check BGP Status and Reset Sessions
MetalLB Configuration
MetalLB in BGP-Mode
Troubleshoot BGP not Accepting Routes from MetalLB
Troubleshoot Services without an Allocated IP Address
dns
Domain Name Service (DNS) Overview
Enable ncsd on UANs
Manage the DNS Unbound Resolver
PowerDNS Configuration
PowerDNS Migration Guide
Troubleshoot Common DNS Issues
Troubleshoot PowerDNS
resiliency
Recreate StatefulSet Pods on Another Node
Resilience of System Management Services
Resiliency
Resiliency Testing Procedure
Restore System Functionality if a Kubernetes Worker Node is Down
artifact management
Artifact Management
Generate Temporary S3 Credentials
Manage Artifacts with the Cray CLI
Use S3 Libraries and Clients
hpe pdu
HPE PDU Admin Procedures
node management
Access and Update Settings for Replacement NCNs
Removing a Liquid-cooled blade from a System
Removing a Liquid-cooled blade from a System Using SAT
Removing a Standard rack node from a System
Replace a Compute Blade
Replace a Compute Blade Using SAT
Replace a Standard rack node from a System
Replacing Foxconn Username and Passwords in Vault
Add TLS Certificates to BMCs
Repurpose a Compute Node as a UAN
Add a Standard Rack Node
Reset Credentials on Redfish Devices
Add Additional Air-Cooled Cabinets to a System
S3FS Usage and Guidelines for Shasta
Add Additional Liquid-Cooled Cabinets to a System
Set Gigabyte Node BMC to Factory Defaults
Adding a Liquid-cooled Blade to a System
Swap a Compute Blade with a Different System
Adding a Liquid-cooled blade to a System Using SAT
Swap a Compute Blade with a Different System Using SAT
Boot a storage node into new image without upgrading CSM
Switch PXE Boot from Onboard NIC to PCIe
Build NCN Images Locally
TLS Certificates for Redfish BMCs
Change Java Security Settings
Troubleshoot Interfaces with IP Address Issues
Change Settings for HMS Collector Polling of Air-Cooled Nodes
Troubleshoot Issues with Redfish Endpoint Discovery
Check and Set the metal.no-wipe Setting on NCNs
Troubleshoot Loss of Console Connections and Logs on Gigabyte Nodes
Check the BMC Failover Mode
Update Compute Node Mellanox HSN NIC Firmware
Clear Space in Root File System on Worker Nodes
Update the Gigabyte Node BIOS Time
Configuration of NCN Bonding
Update the HPE Node BIOS Time
Configure NTP on NCNs
Updating Cabinet Routes on Management NCNs
Customize PCIe Hardware
Use the Physical KVM
Customize PCIe Hardware
Verify Node Removal
Defragment NID Numbering
View BIOS Logs for Liquid-Cooled Nodes
Disable Nodes
Manual Wipe Procedures
Dump a Non-Compute Node
Clear Gigabyte CMOS
Enable Nodes
Enable Passwordless Connections to Liquid Cooled Node BMCs
Enable IPMI access on HPE iLO BMCs
Find Node Type and Manufacturer
Launch a Virtual KVM on Gigabyte Servers
Launch a Virtual KVM on Intel Servers
Move a Standard Rack Node
Move a Standard Rack Node (Same Rack/Same HSN Ports)
Move a liquid-cooled blade within a System
NCN Drive Identification
NCN NIC Replacement
NCN Network Troubleshooting
Node Management
Node Management Workflows
Reboot NCNs
Rebuild NCNs
Final Validation Steps
Identify Nodes and Update Metadata
Post Rebuild Storage Node Validation
Power Cycle and Rebuild Nodes
Prepare Storage Nodes
Re-Add a Storage Node to Ceph
Rebuild NCNs
Validate Boot Loader
Add Remove Replace NCNs
Add NCN Data
Alpha Framework to Add, Remove, Replace, or Move NCNs
Add Switch Configuration for NCN
Allocate NCN IP Addresses
Boot NCN
Collect NCN MAC Addresses
Redeploy Services Impacted by Adding or Permanently Removing Storage Nodes
Remove NCN Data
Remove NCN from Role
Remove Switch Configuration for NCN
Update Firmware
Update NCN BIOS TPM State
Validate Health
Validate Added NCN
package repository management
Manage Repositories with Nexus
Nexus Configuration
Nexus Deployment
Nexus Export and Restore
Nexus Service Recovery
Nexus Space Cleanup
Package Repository Management
Package Repository Management with Nexus
Repair Blobstore
Repair Yum Repository Metadata
Restrict Admin Privileges in Nexus
Troubleshoot Nexus
spire
Restore missing Spire metadata
Restore Spire Postgres without an Existing Backup
Spire Service Recovery
Troubleshoot Spire Failing to Start on NCNs
Update Spire Intermediate CA Certificate
Xname Validation
cani
Add A Blade To A Cabinet In SLS Using CANI
Add A Cabinet To SLS using CANI
system management health
Access System Management Health Services
Configure Prometheus Alerta Alert Notifications
Configure Prometheus Email Alert Notifications
Retrieve SMART data from ClusterStor E1000 nodes via Redfish Exporter
Grafana Dashboards by Component
Grafterm
grok-exporter pod status showing as ContainerStatusUnknown Error
prometheus-kafka-adapter errors during installation
Remove Kiali
System Management Health
System Management Health Checks and Alerts
Troubleshoot Grafana Dashboard
Troubleshoot Prometheus Alerts
Thanos
UAN NODE Exporter
conman
Access Compute Node Logs
Access Console Log Data Via the System Monitoring Framework (SMF)
Complete Reset of the Console Services
ConMan
Configure Log Rotation
Console Services Troubleshooting Guide
Disable ConMan After the System Software Installation
Establish a Serial Connection to NCNs
Log in to a Node Using ConMan
Manage Node Consoles
Troubleshoot ConMan Asking for Password on SSH Connection
Troubleshoot ConMan Blocking Access to a Node BMC
Troubleshoot ConMan Failing to Connect to a Console
Troubleshoot Console Node Pod Stuck in Terminating State
hardware state manager
Add a Switch to the HSM Database
Add an NCN to the HSM Database
Component Group Members
Component Groups and Partitions
Component Memberships
Component Partition Members
Create a Backup of the HSM Postgres Database
Backup/Restore HSM User Data (Locks, Groups, and Partitions)
HSM Roles and Subroles
Hardware Management Services (HMS) Locking API
Hardware State Manager (HSM)
Hardware State Manager (HSM) State and Flag Fields
Lock and Unlock Management Nodes
Manage Component Groups
Manage Component Partitions
Manage HMS Locks
Remove Duplicate Detected Events From the HSM Postgres Database
Restore Hardware State Manager (HSM) Postgres Database from Backup
Restore Hardware State Manager (HSM) Postgres without an Existing Backup
Set BMC Management Roles
image management
Build a New UAN Image Using the Default Recipe
Build an Image Using IMS REST Service
Configure IMS to Use DKMS
Configure IMS to Validate RPMs
Configure a Remote Build Node
Convert TGZ Archives to SquashFS Images
Create UAN Boot Images
Customize an Image Root Using IMS
Delete or Recover Deleted IMS Content
Exporting and Importing IMS Data
Image Job Performance
Image Management
Image Management Workflows
Import an External Image to IMS
Import an NCN Image to IMS
Troubleshoot Issues with Large Images
Troubleshoot Remote Build Node
Troubleshoot Interactions with zypper
Upload and Register an Image Recipe
Working With aarch64 Images
configuration management
ARP Cache Tuning Guide
Accessing sat bootprep Files
Adding Additional Inventory
Ansible Execution Environments
Ansible Log Collection
Automatic Configuration Management
Automatic Session Deletion with session ttl
Backup and Restore VCS Data
CFS Commands Cheat Sheet
CFS Components
CFS Configurations
CFS Flow
CFS Global Options
CFS Key Management and Permission Denied Errors
CFS Session Inventory
CFS Sessions
CFS Sources
Change the Ansible Verbosity
Configuration Management
Configure Ansible
Create a Node Personalization CFS Session
Create an Image Customization CFS Session
Create and Populate a VCS Configuration Repository
Customize Configuration Values
Differences Between the V2 and V3 CFS APIs
Enable Ansible Profiling
Exporting and Importing CFS Data
Git Operations
Management Node Image Customization
Management Node Personalization
Paging CFS Records
Set Limits for a Configuration Session
Specifying Hosts and Groups
Target Ansible Tasks for Image Customization
Track the Status of a Session
Troubleshoot CFS Issues
Troubleshoot Failed CFS Sessions
Troubleshoot CFS Session Failing to Complete
Troubleshoot CFS Sessions Failing to Start
Update a CFS Configuration
Update the Privacy Settings for Gitea Configuration Content Repositories
VCS Administrative User
VCS Branching Strategy
Version Control Service (VCS)
View Configuration Session Logs
Write Ansible Code for CFS
iuf
Install and Upgrade Framework
stages
deliver-product
deploy-product
managed-nodes-rollout
management-nodes-rollout
post-install-check
post-install-service-check
pre-install-check
prepare-images
process-media
update-cfs-config
update-vcs-config
workflows
Populate Admin Directory with Files Defining Site Preferences
Backup
Configuration
Configuration of the Slingshot Fabric Manager
Deploy Product
Image Preparation
Install or Upgrade Additional Products with IUF
Managed Rollout
Management Rollout
Prepare for the Install or Upgrade
Product Delivery
Perform Slingshot Switch Firmware Updates
Upgrade CSM and Additional Products with IUF
Validate Deployment
examples
iuf abort Examples
iuf activity Examples
iuf list-activities Examples
iuf list-stages Examples
iuf restart Examples
iuf resume Examples
iuf run Examples
iuf workflow Examples
Cray System Management Install
SHCD HMN Tab/HMN Connections Rules
Ceph CSI Troubleshooting
Collect MAC Addresses for NCNs
Troubleshooting Installation Problems
Collecting the BMC MAC Addresses
PXE Boot Troubleshooting
Collecting NCN MAC Addresses
Troubleshooting Unused Drives on Storage Nodes
Configure Administrative Access
Utility Storage Installation Troubleshooting
Pre-Installation
Configure Management Network
Prepare Compute Nodes
Create Application Node Config YAML
Prepare site init
Create Cabinets YAML
Re-Installation
Create HMN Connections JSON File
Create NCN Metadata CSV
Create Switch Metadata CSV
Create System Configuration Using Cluster Discovery Service
Create System Configuration Using SHCD
CSM Services Install Fails Because of Missing Secret
Deploy Final NCN
Deploy Management Nodes
Install CSM Services
livecd
Accessing LiveCD USB Device After Reboot
Boot LiveCD RemoteISO
Boot LiveCD USB
Reinstall LiveCD
Reset root Password on a LiveCD USB
CSM Troubleshooting Information
Weave Container Network Interface Troubleshooting
Manual SSH Key Setting Process
Troubleshoot the CMS Barebones Image Boot Test
DHCP Troubleshooting
DNS Troubleshooting
Running HMS CT Tests Manually
Incrementally Configuring Images
PXE Booting Runbook
Interpreting HMS Health Check Results
known issues
BOS Operator Pods OOMKilled
BOS Sessions Stuck Pending
CFS Component With Zero-Length ID
CFS V2 Failures On Large Systems
Known Issue FAS Loader / HFP script post-deliver-product.sh
Gigabyte BMC Missing Redfish Data
HMS Resource Leaks
Hang Listing BOS V1 Sessions
Keycloak Error "Cannot read properties" in Web UI
Nexus Fails Authentication with Keycloak Users
PCS Power Capping Blanca Peak and Parry Peak
SLS Not Working During Node Rebuild
VCS Password With Illegal Characters
Known Issue admin-client-auth Not Found
Antero node NID allocation
Known Issue Ceph OSD latency
CFS Session for Image Customization on Remote Node Status Stuck at running
Known Issue check bios firmware versions.sh script does not report valid expected firmware versions
SAT/HSM/CAPMC/PCS Component Power State Mismatch
cray-console-node pods in CrashLoopBackOff
Known Issue cray-tftp-upload errors
Cray CLI 403 Forbidden Errors
HMS Discovery Job Not Creating RedfishEndpoints In Hardware State Manager
Flags Set For Nodes In HSM
Goss Test Fails with Connection Refused
Helm Chart Deploy Timeouts
hms-discovery Timeout Due to Missing Switches
HPE iLO dropping event subscriptions and not properly transitioning power state in CSM software
IMS Image Customization Job Status Stuck at waiting on user
Known Issue IMS Images Orphaned in S3
Soft Deleted IMS Image Always Has arch=x86 64
Soft Deleted IMS Recipe Always Has arch=x86 64
Soft Deleted IMS Recipe Always Has require dkms=true
Known issues with NCN health checks
IUF fails with Not a directory /etc/cray/upgrade/csm/media/...
Known issue kubectl logs -f returns no space left on device
Missing binaries in aarch64 Images
Known issues with NCN resource checks
HPE Cray EX255a Boot Issue with Console Parameter
Transaction Size Limitation for PCS and CAPMC
PostgreSQL Cluster Upgrades Failing
PostgreSQL Database is in Recovery
PostgreSQL Clusters in SyncFailed State Due to Kyverno Webhook
Product Catalog Upgrade Error
QLogic driver crash
Known Issue Boot Orchestration Service (BOS) / Rolling reboots
Known Issue RTS fails to restart after a worker node has been rebooted
sat bootprep image customization error
Software Management Services health checks
Spire database connection pool configuration in an air-gapped environment
Spire Database Cluster DNS Lookup Failure
SSL Certificate Validation Issues
Storage node cloud-init fails with 'Timed out waiting for device' error
Test Failures Due To No Discovered Compute Nodes In HSM
Known Issue Velero Version Mismatch
Wait for unbound or cray-dns-unbound-manager hangs
kubernetes
Kubernetes kube-apiserver Failing
Kubernetes Log File Locations
Kubernetes Troubleshooting Information
Troubleshoot Kubernetes Master or Worker node in NotReady state
Troubleshoot Kubernetes Pods Not Starting
Troubleshoot Liveliness or Readiness Probe Failures
Troubleshoot Unresponsive kubectl Commands
Glossary
Introduction to CSM Installation
CSM Overview
Deprecated Features
CAPMC Deprecation Notice
Documentation Conventions
templates
Templates
Non-Compute Nodes
Certificate Authority
NCN BIOS
NCN Boot Workflow
NCN Firmware
NCN Images
Kernel Dumps
NCN Kernel
NCN Mounts and Filesystems
NCN Networking
NCN Operating System Releases
NCN Plan of Record
REST API Documentation
Boot Orchestration Service v2
Boot Script Service v1
Cray Advanced Platform Monitoring and Control (CAPMC) v3
Configuration Framework Service v1
Firmware Action Service v1
Heartbeat Tracker Service v1
HMS Notification Fanout Daemon v1
Image Management Service v3
NCN Lifecycle Service v1
Power Control Service (PCS) v1
System Configuration Service v1
System Layout Service v2
Hardware State Manager API v2
Cray STS Token Generator v1
TAPMS Tenant Status API v1
User Access Service v1
Update CSM Product Stream
Upgrade CSM
Resource Materials
k8s
Worker-Specific Manual Steps
storage
CEPHADM
CSM 1.5.3 Patch Installation Instructions
CSM 1.5.4 Patch Installation Instructions
CSM 1.5.1 Patch Installation Instructions
Prepare for Upgrade to Next CSM Major Version
CSM 1.5.2 Patch Installation Instructions
CSM Only Upgrade
CSM Only Upgrade on a System with Other Products
Upgrade NCNs during CSM 1.5.2 Patch
Stage 0 - Prerequisites and Preflight Checks
Stage 1 - CSM Service Upgrades
Stage 2 - Ceph image upgrade
Stage 3 - Kubernetes Upgrade
CSM 1.4 to 1.5 Upgrade Process
Upgrade only CSM
Validate CSM Health During a CSM Upgrade
scripts
upgrade
Upgrade Automation
sls
SLS Updates Expert mode
Upgrade SLS Offline from CSM 1.0.x to CSM 1.2
sls updater.py Technical Details
sls utils Library
workflows
iuf
operations
Argo Templates
Argo Templates
Cray System Management Documentation
>
Cray System Management (CSM) Administration Guide
> utility storage
utility storage
Topics:
Adding a Ceph Node to the Ceph Cluster
Add Ceph OSDs
Adjust Ceph Pool Quotas
Alternate Storage Pools
Ceph Daemon Memory Profiling
Ceph Deep Scrubs
Ceph Health States
Ceph Orchestrator Usage
Ceph Service Check Script Usage
Ceph Storage Types
ceph-upgrade-tool.py Usage
Cephadm Reference Material
Collect Information about the Ceph Cluster
Dump Ceph Crash Data
Identify Ceph Latency Issues
Manage Ceph Services
Move Unmanaged Ceph OSDs
Shrink the Ceph Cluster
Shrink Ceph OSDs
Troubleshoot Ceph-Mon Processes Stopping and Exceeding Max Restarts
Troubleshoot Ceph MDS Client Connectivity Issues
Troubleshooting Ceph MDS Reporting Slow Requests and Failure on Client
Troubleshoot Ceph New RGW Deployment Failing
Troubleshoot Ceph OSDs Reporting Full
Troubleshoot Ceph Services Not Starting After a Server Crash
Troubleshoot Failure to Get Ceph Health
Troubleshoot HEALTH ERR Module devicehealth has failed table Device already exists
Troubleshoot Insufficient Standby MDS Daemons Available
Troubleshoot Large Object Map Objects in Ceph Health
Troubleshoot Pods Failing to Restart on Other Worker Nodes
Fixing incorrect number of PG Issues
Troubleshoot if RGW Health Check Fails
Troubleshoot S3FS Cache Cleanup
Troubleshoot S3FS Mount Issues
Troubleshoot System Clock Skew
Troubleshoot a Down OSD
Troubleshoot an Unresponsive Rados-Gateway (radosgw) S3 Endpoint
Troubleshoot Ceph image with tag’<none>’
Utility Storage
Update ceph node-exporter config to monitor SNMP counters