Cray System Management
v
1.6
1.5
1.4
1.3
1.2
1.0
Cray System Management (CSM) - Release Notes
Cray System Management (CSM) Administration Guide
Create a Backup of HMS Items for reinstall
Component Names (xnames)
Restore HSM
Validate CSM Health
Configure the Cray Command Line Interface (cray CLI)
User Access Service (UAS)
Add a Volume to UAS
Broker Mode UAI Management
Choosing UAI Resource Settings
Common UAI Configuration
Configure End-User UAI Classes for Broker Mode
Configure UAIs in UAS
Configure a Broker UAI Class
Configure a Default UAI Class for Legacy Mode
Create UAIs From Specific UAI Images in Legacy Mode
Create a UAI
Create a UAI Class
Create a UAI Resource Specification
Create a UAI with Additional Ports
Create and Use Default UAIs in Legacy Mode
Customize End-User UAI Images
Customize the Broker UAI Image
Delete a UAI
Delete a UAI Class
Delete a UAI Image Registration
Delete a UAI Resource Specification
Delete a Volume Configuration
Elements of a UAI
End-User UAIs
Examine a UAI Using a Direct Administrative Command
Legacy Mode User-Driven UAI Management
List Available UAI Classes
List Available UAI Images in Legacy Mode
List Registered UAI Images
List UAI Resource Specifications
List UAIs
List UAS Version Information
List Volumes Registered in UAS
Log in to a Broker UAI
This Page Has Moved
Modify a UAI Class
Obtain the Configuration of a UAS Volume
Register a UAI Image
Clear UAS Configuration
Resource Specifications
Retrieve Resource Specification Details
Retrieve UAI Image Registration Information
Setting UAI Timeouts
Broker UAI Resiliency and Load Balancing
Special Purpose UAIs
Start a Broker UAI
Troubleshoot Broker UAI SSSD Cannot Use /etc/sssd/sssd.conf
Troubleshoot Common Mistakes when Creating a Custom End-User UAI Image
Troubleshoot Duplicate Mount Paths in a UAI
Troubleshoot Missing or Incorrect UAI Images
Troubleshoot Stale Brokered UAIs
Troubleshoot UAS / CLI Authentication Issues
Troubleshoot UAI Stuck in ContainerCreating
Troubleshoot UAIs by Viewing Log Output
Troubleshoot UAIs with Administrative Access
Troubleshoot UAS Issues
Troubleshoot UAS by Viewing Log Output
UAI Classes
UAI Host Node Selection
UAI Host Nodes
UAI Image Customization
UAI Images
UAI Management
UAI Network Attachment Customization
UAI macvlans Network Attachments
UAS Limitations
UAS and UAI Legacy Mode Health Checks
Update a Resource Specification
Update a UAI Image Registration
Update a UAS Volume
View a UAI Class
Volumes
artifact management
Artifact Management
Generate Temporary S3 Credentials
Manage Artifacts with the Cray CLI
Use S3 Libraries and Clients
CSM product management
Change Passwords and Credentials
Configure CSM packages with CFS
Configure Keycloak Account
Configure the root password and SSH keys in Vault
Post-Install Customizations
Redeploying a Chart
Remove Artifacts from Product Installations
Set up passwordless SSH
Validate Signed RPMs
hmcollector
Adjust HM Collector Ingress Replicas and Resource Limits
iuf
Install and Upgrade Framework
examples
iuf abort Examples
iuf activity Examples
iuf list-activities Examples
iuf list-stages Examples
iuf restart Examples
iuf resume Examples
iuf run Examples
stages
deliver-product
deploy-product
managed-nodes-rollout
management-nodes-rollout
post-install-check
post-install-service-check
pre-install-check
prepare-images
process-media
update-cfs-config
update-vcs-config
workflows
Backup
Configuration
Configuration of the Slingshot Fabric Manager
Deploy product
Image preparation
Install or upgrade additional products with IUF
Managed rollout
Management rollout
Prepare for the install or upgrade
Product delivery
Upgrade CSM and additional products with IUF
Validate deployment
observability
Install and Upgrade Observability Framework
power management
Cray Advanced Platform Monitoring and Control (CAPMC)
Ignore Nodes with CAPMC
Liquid-cooled Node Power Management
Power Off Compute Cabinets
Power Off Management Cabinets
Power Off Storage Cabinets
Power Off the External Lustre File System
Power On Compute Cabinets
Power On and Boot Compute and User Access Nodes
Power On and Start the Management Kubernetes Cluster
Power On the External Lustre File System
Prepare the System for Power Off
Recover from a Liquid Cooled Cabinet EPO Event
Save Management Network Switch Configuration Settings
Set the Turbo Boost Limit
Shut Down and Power Off Compute and User Access Nodes
Shut Down and Power Off the Management Kubernetes Cluster
Standard Rack Node Power Management
System Power Off Procedures
System Power On Procedures
User Access to Compute Node Power Data
Power Management
Power Control Service
Node Card Power Management
Power Control Service (PCS)
Power Off Compute Cabinets
Power On Compute Cabinets
Recover from a Liquid Cooled Cabinet EPO Event
spire
Create a Backup of the Spire Postgres Database
Restore missing Spire metadata
Restore Spire Postgres without an Existing Backup
Spire Service Recovery
Troubleshoot Spire Failing to Start on NCNs
Update Spire Intermediate CA Certificate
Xname Validation
System Recovery
PBS Service Recovery
Slurm Service Recovery
Beta Procedures for System Recovery
boot orchestration
Boot Orchestration
BOS Services
BOS Workflows
Compute Node Boot Issue Symptom Node Console or Logs Indicate that the Server Response has Timed Out
Boot Issue Symptom Node HSN Interface Does Not Appear or Show Detected Links Detected
Boot Orchestration
Boot UANs
BOS Commands Cheat Sheet
Check the Progress of BOS Session Operations
Clean Up After a BOS/BOA Job is Completed or Cancelled
Clean Up Logs After a BOA Kubernetes Job
Component Status
BOS Components
Compute Node Boot Issue Symptom Duplicate Address Warnings and Declined DHCP Offers in Logs
Compute Node Boot Issue Symptom Message About Invalid EEPROM Checksum in Node Console or Log
Compute Node Boot Issue Symptom Node is Not Able to Download the Required Artifacts
Compute Node Boot Sequence
Configure the BOS Timeout When Booting Compute Nodes
Create a Session Template to Boot Compute Nodes with CPS
Customize iPXE Binary Names
Determine Which BOS Session Booted a Node
Edit the iPXE Embedded Boot Script
Exporting and Importing BOS Data
Exporting and Importing BSS Data
Healthy Compute Node Boot Process
Kernel Boot Parameters
Limit the Scope of a BOS Session
BOS Limitations for Gigabyte BMC Hardware
Log File Locations and Ports Used in Compute Node Boot Troubleshooting
Manage a BOS Session
Manage a Session Template
Node Boot Root Cause Analysis
BOS Options
Redeploy the iPXE and TFTP Services
Rolling Upgrades using BOS
BOS Session Templates
BOS Sessions
Staging Changes with BOS
Tools for Resolving Compute Node Boot Issues
Troubleshoot Booting Nodes with Hardware Issues
Troubleshoot Compute Node Boot Issues Related to Dynamic Host Configuration Protocol (DHCP)
Troubleshoot Compute Node Boot Issues Related to Slow Boot Times
Troubleshoot Compute Node Boot Issues Related to Trivial File Transfer Protocol (TFTP)
Troubleshoot Compute Node Boot Issues Related to Unified Extensible Firmware Interface (UEFI)
Troubleshoot Compute Node Boot Issues Related to the Boot Script Service (BSS)
Troubleshoot Compute Node Boot Issues Using Kubernetes
Troubleshoot UAN Boot Issues
Upload Node Boot Information to Boot Script Service (BSS)
View the Status of a BOS Session
hardware state manager
Add a Switch to the HSM Database
Add an NCN to the HSM Database
Component Group Members
Component Groups and Partitions
Component Memberships
Component Partition Members
Create a Backup of the HSM Postgres Database
Backup/Restore HSM User Data (Locks, Groups, and Partitions)
HSM Roles and Subroles
Hardware Management Services (HMS) Locking API
Hardware State Manager (HSM)
Hardware State Manager (HSM) State and Flag Fields
Lock and Unlock Management Nodes
Manage Component Groups
Manage Component Partitions
Manage HMS Locks
Restore Hardware State Manager (HSM) Postgres Database from Backup
Restore Hardware State Manager (HSM) Postgres without an Existing Backup
Set BMC Management Roles
bare metal
Bare-Metal Steps
Fresh Install Setting NodeBMC and RouterBMC Redfish Credentials
configuration management
Accessing sat bootprep Files
Ansible Execution Environments
Ansible Inventory
Ansible Log Collection
Automatic Session Deletion with sessionTTL
CFS Flow
CFS Global Options
CFS Key Management and Permission Denied Errors
Change the Ansible Verbosity Logs
Configuration Layers
Configuration Management
Configuration Management of System Components
Configuration Management with the CFS Batcher
Configuration Sessions
Create a CFS Configuration
Create a CFS Session with Dynamic Inventory
Create an Image Customization CFS Session
Create and Populate a VCS Configuration Repository
Customize Configuration Values
Delete CFS Sessions
Enable Ansible Profiling
Exporting and Importing CFS Data
Git Operations
Manage Multiple Inventories in a Single Location
Management Node Image Customization
Management Node Personalization
Set Limits for a Configuration Session
Set the ansible.cfg for a Session
Specifying Hosts and Groups
Target Ansible Tasks for Image Customization
Track the Status of a Session
Troubleshoot Ansible Play Failures in CFS Sessions
Troubleshoot CFS Session Failing to Complete
Troubleshoot CFS Sessions Failing to Start
Update a CFS Configuration
Update the Privacy Settings for Gitea Configuration Content Repositories
Use a Custom ansible.cfg File
Use a Specific Inventory in a Configuration Session
VCS Branching Strategy
Version Control Service (VCS)
View Configuration Session Logs
Write Ansible Code for CFS
image management
Build a New UAN Image Using the Default Recipe
Build an Image Using IMS REST Service
Configure IMS to Use DKMS
Configure IMS to Validate RPMs
Convert TGZ Archives to SquashFS Images
Create UAN Boot Images
Customize an Image Root Using IMS
Delete or Recover Deleted IMS Content
Exporting and Importing IMS Data
Image Management
Image Management Workflows
Import an External Image to IMS
Import an NCN Image to IMS
Upload and Register an Image Recipe
multi-tenancy
Cray HNC Manager
Creating a Tenant
Modifying a Tenant
Multi-Tenancy Support
Removing a Tenant
Slurm Operator
TAPMS (Tenant and Partition Management System) Overview
Tenant Administrator Configuration
firmware
FASUpdate Script
FAS Admin Procedures
FAS CLI
FAS Filters
Backup and Restoring FAS Images
FAS Recipes
Update iLO 5 firmware above v2.78
FAS Recipes and Procedures
Firmware Upgrade using SPP on HPE ProLiant Servers
Update Firmware with FAS
Updating BMC Firmware and BIOS for ncn-m001
Updating BMC Firmware and BIOS for NCNs without FAS
Upload BMC Recovery Firmware into TFTP Server
sat
System Admin Toolkit (SAT) in CSM
system management health
Access System Management Health Services
Configure Prometheus Alerta Alert Notifications
Configure Prometheus Email Alert Notifications
Retrieve SMART data from ClusterStor E1000 nodes via Redfish Exporter
Grafana Dashboards by Component
Grafterm
prometheus-kafka-adapter errors during installation
Remove Kiali
System Management Health
System Management Health Checks and Alerts
Troubleshoot Grafana Dashboard
Troubleshoot Prometheus Alerts
UAN NODE Exporter
utility storage
Adding a Ceph Node to the Ceph Cluster
Add Ceph OSDs
Adjust Ceph Pool Quotas
Alternate Storage Pools
Ceph Daemon Memory Profiling
Ceph Deep Scrubs
Ceph Health States
Ceph Orchestrator Usage
Ceph Service Check Script Usage
Ceph Storage Types
ceph-upgrade-tool.py Usage
Cephadm Reference Material
Collect Information about the Ceph Cluster
Dump Ceph Crash Data
Identify Ceph Latency Issues
Manage Ceph Services
Shrink the Ceph Cluster
Restore Nexus Data After Data Corruption
Shrink Ceph OSDs
Troubleshoot Ceph-Mon Processes Stopping and Exceeding Max Restarts
Troubleshoot Ceph MDS Client Connectivity Issues
Troubleshooting Ceph MDS Reporting Slow Requests and Failure on Client
Troubleshoot Ceph New RGW Deployment Failing
Troubleshoot Ceph OSDs Reporting Full
Troubleshoot Ceph Services Not Starting After a Server Crash
Troubleshoot Failure to Get Ceph Health
Troubleshoot Insufficient Standby MDS Daemons Available
Troubleshoot Large Object Map Objects in Ceph Health
Troubleshoot Pods Failing to Restart on Other Worker Nodes
Fixing incorrect number of PG Issues
Troubleshoot if RGW Health Check Fails
Troubleshoot S3FS Mount Issues
Troubleshoot System Clock Skew
Troubleshoot a Down OSD
Troubleshoot an Unresponsive Rados-Gateway (radosgw) S3 Endpoint
Troubleshoot Ceph image with tag'<none>'
Utility Storage
argo
Using Argo Workflows
Using the Argo UI
network
Management Network User Guide
Management Network Upgrade CSM 1.2 to 1.3
Fresh Install
Load Saved Switch Configuration
Generate Switch Configurations
Manual Switch Configuration
Added Hardware
Apply Custom Switch Configurations for CSM 1.0
Apply Custom Switch Configuration CSM 1.2
CSM Automatic Network Utility
CANU Installation
Troubleshoot CANU Validation Errors
Use CANU to Verify, Generate, or Compare Switch Configurations
Generate Switch Configs Including Custom Configurations
Initializing CANU
Introduction to CANU
Quick start guide to CANU
Uninstall CANU
Update CANU From CSM Release Tarball
Use CANU to Generate Full Network Configuration
Apply Switch Configurations
Dell Installation and Configuration Guide
Configure Access Control Links (ACLs)
Configure Address Resolution Protocol (ARP)
Back Up a Switch Configuration
Configure Domain Name System (DNS) Client
Configure Domain Name
Configure Hostnames
Configure Internet Group Multicast Protocol (IGMP)
Configure Link Aggregation Group (LAG)
Link layer discovery protocol (LLDP)
Configure Locator LED
Configure Loopback Interface
Configure Management Interface
Configure Multiple Spanning Tree Protocol (MSTP)
Network Time Protocol (NTP) Client
Configure Physical Interfaces
Configure QoS
Configure Remote Logging
Reset Dell Switch Configuration
Configure SNMPv2c community
Dell SNMPv3 Users
Configure Secure Shell (SSH)
Configure System Images
Perform an Upgrade on Dell Switches
Configure Virtual Local Access Networks (VLANs)
Configure VLAN Interface
VLAN Trunking 802.1Q
Upgrade CANU
Collect Data
Configuration Management
Configuring SNMP in CSM
Mellanox Installation and Configuration Guide
Access control lists (ACLs)
Address resolution protocol (ARP)
Backing up switch configuration
BGP basics
Cable diagnostics
Check BGP and MetalLB
Check current DHCP leases
Check DHCP lease is getting allocated
Check HSM
Check KEA DHCP logs
Computes/UANs/Application Nodes
Large Number of DHCP Declines During a Node Boot
Domain name system (DNS) client
Domain name
You are getting an IP address, but not the correct one. Duplicate IP address check
Exec banners
Hostname
IGMP
Ip filter
Key features used in the management network configuration
Link aggregation group (LAG)
Large
Link layer discovery protocol (LLDP)
Loopback interface
Management interface
Example of how to configure Scenario A or B
Management network functions in detail
Medium
Multi-chassis interface
MLAG (Multi-Chassis LAG)
MLAG
Multiple spanning tree protocol (MSTP)
Native VLAN
TCPDUMP
NCNs on Install
Network types – Naming and segment Function
Network traffic pattern inside of the system
Network Time Protocol (NTP) Client
Open shortest path first (OSPF) v2
Physical interfaces
PIM-SM bootstrap router (BSR) and rendezvous-point (RP)
Rebooting NCN and PXE fails
Remote logging
How to connect management network to your campus network
Routed interfaces
Scenario A network connection via management network
Scenario B network connection via high speed network
Small
SNMPv2c community
Mellanox SNMPv3 users
Spine-leaf architecture
Spine-leaf architecture
Why are spine-leaf architectures becoming more popular?
Secure shell (SSH)
Mac address Table
Static routing
Confirm the status of the cray-dhcp-kea pods/services
System images
Test TFTP traffic (Aruba Only)
Typical configuration of MLAG link connecting to NCN
Typical configuration of MLAG between switches
Performing Upgrade On Mellanox Switches
Verify the switches are forwarding DHCP traffic
Verify BGP
Verify the DHCP traffic on the workers
Verify route to TFTP
Very Large (Exascale)
Virtual local access networks (VLANs)
VLAN interface
VLAN trunking 802.1Q
Web user interface (WebUI)
Aruba Installation and Configuration Guide
802.1X
Access Control Lists (ACLs)
Address Resolution Protocol (ARP)
Backup a Switch Configuration
Border Gateway Protocol (BGP) Basics
Bluetooth Capabilities
Cable Diagnostics
Check BGP and MetalLB
Check Current DHCP Leases
Check DHCP Lease is Getting Allocated
Check HSM
Check KEA DHCP Logs
Classifier Policies
Verify Computes/UANs/Application Nodes
Large Number of DHCP Declines During a Node Boot
Configure Domain Name Service (DNS) Clients
Configure Domain Names
Check for Duplicate IP Addresses
Configure Exec Banners
Configure Hostnames
Configure Internet Group Multicast Protocol (IGMP)
Initial Prioritization
Introduction
Key Features Used in the Management Network Configuration
Link Aggregation Group (LAG)
Link Layer Discovery Protocol (LLDP)
Locator LED
Loopback Interface
MAC Authentication
Management Interface
Example of How to Configure Scenario A or B
System Management Network Functions
VSX ISL HA
VSX MCLAG Link HA
VSX Member Power Failure
VSX Split
Multi-Chassis Link Aggregation Group (MCLAG)
Message-Of-The-Day (MOTD)
Multicast Source Discovery Protocol (MSDP)
Multiple Spanning Tree Protocol (MSTP)
Native VLAN
NCN tcpdump
NCNs on Install
Network Types – Naming and Segment Function
Network Topologies
Network Traffic Pattern
Notices
Network Time Protocol (NTP) Client
Open Shortest Path First (OSPF) v2
Physical Interfaces
PIM-SM Bootstrap Router (BSR) and Rendezvous Point (RP)
Port Mirroring
Port Security
Queuing and Scheduling
RADIUS
Rebooting NCNs and PXE Fails
Redundant Power Supplies
Remote Logging
Connect the Management Network to a Campus Network
Routed interfaces
Scenario A Network Connection via Management Network
Scenario B Network Connection via High-Speed Network
Simple Network Management Protocol (SNMP) Agent
SNMPv2c Community
SNMP traps
Aruba SNMPv3 Users
Spine-Leaf Architecture
Spine-leaf Architecture
Secure Shell (SSH)
Static Routing
Confirm the Status of the cray-dhcp-kea Pods
TACACS
Test TFTP Traffic (Aruba Only)
Typical Configuration of VSX
Typical Edge Port Configuration
Typical Configuration of MCLAG Link
Unidirectional Link Detection (UDLD)
Perform a VSX Upgrade on Aruba Switches
Verify the Switches are Forwarding DHCP Traffic
Verify BGP
Verify the DHCP Traffic on the Worker Nodes
Verify Route to TFTP
Virtual Local Access Networks (VLANs)
VLAN Interface
VLAN Trunking 802.1Q
Virtual Switching Framework (VSF) - 6300 Only
Virtual Switching Extension (VSX)
What is VSX?
Switch Replacement in the VSX Cluster
VSX Sync
Web User Interface (WebUI)
Erase All zeroize
Edge switch cabling guide
External User Guides
Network Tests
Reinstall
Replace Switch
Save a Configuration
Prometheus SNMP Exporter
Upgrade Switches From 1.2 to 1.3 Preconfig
Validate Cabling
Validate the SHCD
Validate Switch Configurations
Wipe Management Switch Configuration
Aruba splitting of QSFP+ and QSFP28 ports
Backup a Custom Configuration
BICAN Support Matrix - Shasta Customer Access Networks
BICAN switch configuration
Bifurcating the CAN - Feature Details
BICAN Summary
Bonded UAN Configuration
Cable Management Network Servers
firmware
Update Management Network Firmware
hardware
EX2500 Installation and Cabling
Access to System Management Services
Connect to Switch over USB-Serial Cable
Connect to the HPE Cray EX Environment
Create a CSM Configuration Upgrade Plan
Default IP Address Ranges
Gateway Testing
Network
dhcp
DHCP
Troubleshoot DHCP Issues
external dns
External DNS
External DNS Failing to Discover Services Workaround
External DNS CSI Input Values
Ingress Routing
Troubleshoot DNS Configuration Issues
Troubleshoot Connectivity to Services with External IP addresses
Update the cmn-external-dns value post-installation
customer accessible networks
Connect to the CMN and CAN
Customer Access Networks
network
Enabling Customer High Speed Network Routing
Management Network Upgrade CSM 1.2 to 1.3
scripts
sls
sls utils Library
Customer Accessible Networks
CAN/CMN with Dual-Spine Configuration
Externally Exposed Services
Troubleshoot CMN issues
BI-CAN Aruba/Arista Configuration
MetalLB Peering with Arista Edge Router
dns
Domain Name Service (DNS) Overview
Enable ncsd on UANs
Manage the DNS Unbound Resolver
PowerDNS Configuration
PowerDNS Migration Guide
Troubleshoot Common DNS Issues
Troubleshoot PowerDNS
metallb bgp
Check BGP Status and Reset Sessions
MetalLB Configuration
MetalLB in BGP-Mode
Troubleshoot BGP not Accepting Routes from MetalLB
Troubleshoot Services without an Allocated IP Address
hpe pdu
HPE PDU Admin Procedures
node management
Access and Update Settings for Replacement NCNs
Removing a Liquid-cooled blade from a System
Removing a Liquid-cooled blade from a System Using SAT
Removing a Standard rack node from a System
Replace a Compute Blade
Replace a Compute Blade Using SAT
Replace a Standard rack node from a System
Repurpose a Compute Node as a UAN
Add TLS Certificates to BMCs
Reset Credentials on Redfish Devices
Add a Standard Rack Node
S3FS Usage and Guidelines for Shasta
Add Additional Air-Cooled Cabinets to a System
Set Gigabyte Node BMC to Factory Defaults
Add Additional Liquid-Cooled Cabinets to a System
Swap a Compute Blade with a Different System
Adding a Liquid-cooled Blade to a System
Swap a Compute Blade with a Different System Using SAT
Adding a Liquid-cooled blade to a System Using SAT
Switch PXE Boot from Onboard NIC to PCIe
Build NCN Images Locally
TLS Certificates for Redfish BMCs
Change Java Security Settings
Troubleshoot Interfaces with IP Address Issues
Change Settings for HMS Collector Polling of Air-Cooled Nodes
Troubleshoot Issues with Redfish Endpoint Discovery
Check and Set the metal.no-wipe Setting on NCNs
Troubleshoot Loss of Console Connections and Logs on Gigabyte Nodes
Check the BMC Failover Mode
Update Compute Node Mellanox HSN NIC Firmware
Clear Space in Root File System on Worker Nodes
Update the Gigabyte Node BIOS Time
Configuration of NCN Bonding
Update the HPE Node BIOS Time
Configure NTP on NCNs
Updating Cabinet Routes on Management NCNs
Customize PCIe Hardware
Use the Physical KVM
Customize PCIe Hardware
Verify Node Removal
Defragment NID Numbering
View BIOS Logs for Liquid-Cooled Nodes
Disable Nodes
Manual Wipe Procedures
Dump a Non-Compute Node
Clear Gigabyte CMOS
Enable Nodes
Enable Passwordless Connections to Liquid Cooled Node BMCs
Enable IPMI access on HPE iLO BMCs
Find Node Type and Manufacturer
Launch a Virtual KVM on Gigabyte Servers
Launch a Virtual KVM on Intel Servers
Move a Standard Rack Node
Move a Standard Rack Node (Same Rack/Same HSN Ports)
Move a liquid-cooled blade within a System
NCN Drive Identification
NCN Network Troubleshooting
Node Management
Node Management Workflows
Reboot NCNs
Add Remove Replace NCNs
Add NCN Data
Alpha Framework to Add, Remove, Replace, or Move NCNs
Add Switch Configuration for NCN
Allocate NCN IP Addresses
Boot NCN
Collect NCN MAC Addresses
Redeploy Services Impacted by Adding or Permanently Removing Storage Nodes
Remove NCN Data
Remove NCN from Role
Remove Switch Configuration for NCN
Update Firmware
Update NCN BIOS TPM State
Validate Health
Validate Added NCN
Rebuild NCNs
Final Validation Steps
Identify Nodes and Update Metadata
Post Rebuild Storage Node Validation
Power Cycle and Rebuild Nodes
Prepare Storage Nodes
Re-Add a Storage Node to Ceph
Rebuild NCNs
Validate Boot Loader
system layout service
Add Liquid-Cooled Cabinets to SLS
Add UAN CAN IP Addresses to SLS
Add an alias to a service
Create a Backup of the SLS Postgres Database
Dump SLS Information
Load SLS Database with Dump File
Restore SLS Postgres Database from Backup
Restore SLS Postgres without an Existing Backup
System Layout Service (SLS)
Update SLS with UAN Aliases
conman
Access Compute Node Logs
Access Console Log Data Via the System Monitoring Framework (SMF)
Complete Reset of the Console Services
ConMan
Configure Log Rotation
Console Services Troubleshooting Guide
Disable ConMan After the System Software Installation
Establish a Serial Connection to NCNs
Log in to a Node Using ConMan
Manage Node Consoles
Troubleshoot ConMan Asking for Password on SSH Connection
Troubleshoot ConMan Blocking Access to a Node BMC
Troubleshoot ConMan Failing to Connect to a Console
Troubleshoot Console Node Pod Stuck in Terminating State
package repository management
Manage Repositories with Nexus
Nexus Configuration
Nexus Deployment
Nexus Export and Restore
Nexus Service Recovery
Nexus Space Cleanup
Package Repository Management
Package Repository Management with Nexus
Repair Blobstore
Repair Yum Repository Metadata
Restrict Admin Privileges in Nexus
Troubleshoot Nexus
resiliency
Recreate StatefulSet Pods on Another Node
Resilience of System Management Services
Resiliency
Resiliency Testing Procedure
Restore System Functionality if a Kubernetes Worker Node is Down
security and authentication
API Authorization
Access the Keycloak User Management UI
Add LDAP User Federation
Add Root Service Account for Gigabyte Controllers
Audit Logs
Authenticate an Account with the Command Line
Backup and Restore Vault Clusters
Certificate Types
Change Air-Cooled Node BMC Credentials Using SAT
Change Credentials on ServerTech PDUs
Change Cray EX Liquid-Cooled Cabinet Global Default Password
Change the Keycloak Token Lifetime
Set NCN Image Root Password, SSH Keys, and Timezone
Set NCN Image Root Password, SSH Keys, and Timezone on PIT Node
Change Root Passwords for Compute Nodes
Change the Keycloak Admin Password
Change the LDAP Server IP Address for Existing LDAP Server Content
Change the LDAP Server IP Address for New LDAP Server Content
Configure Keycloak for LDAP/AD authentication
Configure root user on HPE iLO BMCs
Configure the RSA Plugin in Keycloak
Create Internal Groups in the Keycloak Shasta Realm
Create Internal User Accounts in the Keycloak Shasta Realm
Create a Backup of the Keycloak Postgres Database
Create a Service Account in Keycloak
Default Keycloak Realms, Accounts, and Clients
Delete Internal User Accounts in the Keycloak Shasta Realm
Get a Long-Lived Token for a Service Account
HashiCorp Vault
Keycloak Operations
Keycloak Service Recovery
Keycloak User Localization
Keycloak User Management with kcadm.sh
Make HTTPS Requests from Sources Outside the Management Kubernetes Cluster
Manage Sealed Secrets
Manage System Passwords
PKI Certificate Authority (CA)
PKI Services
Preserve Username Capitalization for Users Exported from Keycloak
Provisioning a Liquid-Cooled EX Cabinet CEC with Default Credentials
Public Key Infrastructure (PKI)
Recovering from Mismatched BMC Credentials
Remove Internal Groups from the Keycloak Shasta Realm
Remove the Email Mapper from the LDAP User Federation
Remove the LDAP User Federation from Keycloak
Re-Sync Keycloak Users to Compute Nodes
Retrieve an Authentication Token
Retrieve the Client Secret for Service Accounts
Update NCN User SSH Keys
System Security and Authentication
Transport Layer Security (TLS) for Ingress Services
Troubleshoot Common Vault Cluster Issues
Troubleshoot Kyverno configuration manually
Update Default Air-Cooled BMC and Leaf-BMC Switch SNMP Credentials
Update Default ServerTech PDU Credentials used by the Redfish Translation Service (RTS)
Set NCN User Passwords
Updating the Liquid-Cooled EX Cabinet CEC with Default Credentials after a CEC Password Change
Vault Service Recovery
system configuration service
Configure BMC and Controller Parameters with SCSD
Manage Parameters with the scsd Service
Set BMC Credentials Using SAT
System Configuration Service
compute rolling upgrades
CRUS Workflow
Compute Rolling Upgrades
Troubleshoot Nodes Failing to Upgrade in a CRUS Session
Troubleshoot a Failed CRUS Session Because of Bad Parameters
Troubleshoot a Failed CRUS Session Because of Unmet Conditions
Upgrade Compute Nodes with CRUS
kubernetes
About Kubernetes Taints and Labels
Kubernetes Encryption
About Postgres
About etcd
About kubectl
Backups for etcd-operator Clusters
Kubernetes and Bare Metal EtcD Certificate Renewal
Check for and Clear etcd Cluster Alarms
Check the Health and Balance of etcd Clusters
Clear Space in an etcd Cluster Database
Configure kubectl Credentials to Access the Kubernetes APIs
containerd
Create a Manual Backup of Bare-Metal etcd Cluster
Create a Manual Backup of a Healthy etcd Cluster
Determine if Pods are Hitting Resource Limits
Disaster Recovery for Postgres
Fix Failed to start etcd on Master NCN
Increase Kafka Pod Resource Limits
Increase Pod Resource Limits
Kubernetes
Kubernetes Networking
Kubernetes Storage
Kyverno policy management
Pod Resource Limits
Rebalance Healthy etcd Clusters
Rebuild Unhealthy etcd Clusters
Recover from Postgres WAL Event
Repopulate Data in etcd Clusters When Rebuilding Them
Report the Endpoint Status for etcd Clusters
Restore Bare-Metal etcd Clusters from an S3 Snapshot
Restore Postgres
Restore an etcd Cluster from a Backup
Retrieve Cluster Health Information Using Kubernetes
TDS Lower CPU Requests
Troubleshoot Intermittent HTTP 503 Code Failures
Troubleshoot Postgres Database
View Postgres Information for System Databases
Cray System Management Install
SHCD HMN Tab/HMN Connections Rules
Ceph CSI Troubleshooting
CSM Installation
Collect MAC Addresses for NCNs
Troubleshooting Installation Problems
CSM Services Install Fails Because of Missing Secret
Collecting the BMC MAC Addresses
PXE Boot Troubleshooting
Deploy Final NCN
Collecting NCN MAC Addresses
Troubleshooting Unused Drives on Storage Nodes
Deploy Management Nodes
Install CSM with Common Pre-installer (Tech Preview)
Boot Pre-Install Live ISO and Generate Seed Files
Configuration of Leaf Switch 001
Configuration of Spine Switch 01
Configuration of Spine Switch 02
Pre-Installation
Utility Storage Installation Troubleshooting
Pre-Installation
Upgrade Ceph and enable Smartmon metrics on storage NCNs
Install CSM Services
Prepare Compute Nodes
Configure Administrative Access
Prepare site init
Configure Management Network
Re-Installation
Create Application Node Config YAML
Create Cabinets YAML
Create HMN Connections JSON File
Create NCN Metadata CSV
Create Switch Metadata CSV
livecd
Accessing LiveCD USB Device After Reboot
Boot LiveCD RemoteISO
Boot LiveCD USB
Reinstall LiveCD
Reset root Password on a LiveCD USB
CSM Troubleshooting Information
Manual SSH Key Setting Process
Troubleshoot the CMS Barebones Image Boot Test
DHCP Troubleshooting
DNS Troubleshooting
Running HMS CT Tests Manually
PXE Booting Runbook
Interpreting HMS Health Check Results
kubernetes
Kubernetes kube-apiserver Failing
Kubernetes Log File Locations
Kubernetes Troubleshooting Information
Troubleshoot Kubernetes Master or Worker node in NotReady state
Troubleshoot Kubernetes Pods Not Starting
Troubleshoot Liveliness or Readiness Probe Failures
Troubleshoot Unresponsive kubectl Commands
known issues
CFS Component With Zero-Length ID
CRUS Subcommands Missing From Cray CLI
Gigabyte BMC Missing Redfish Data
Hang Listing BOS V1 Sessions
Nexus Fails Authentication with Keycloak Users
SLS Not Working During Node Rebuild
Known Issue admin-client-auth Not Found
Antero node NID allocation
Known Issue Ceph OSD latency
Check for duplicate and DNS entries for NCN and UANs test failure
SAT/HSM/CAPMC/PCS Component Power State Mismatch
Cray CLI 403 Forbidden Errors
HMS Discovery Job Not Creating RedfishEndpoints In Hardware State Manager
Helm Chart Deploy Timeouts
HPE iLO dropping event subscriptions and not properly transitioning power state in CSM software
Known Issue IMS image creation failure
Known Issue initrd.img.xz Not Found
Known issues with NCN health checks
IUF Error exec /usr/local/bin/argoexec argument list too long
IUF fails with Not a directory /etc/cray/upgrade/csm/media/...
Known issue kubectl logs -f returns no space left on device
Kubernetes Master or Worker node's root filesystem is out of space
Mellanox lacp-individual Limitations
Known issues with NCN resource checks
Transaction Size Limitation for PCS and CAPMC
Product Catalog Upgrade Error
QLogic driver crash
Software Management Services health checks
Spire database connection pool configuration in an air-gapped environment
Spire Database Cluster DNS Lookup Failure
SSL Certificate Validation Issues
Known Issue Velero Version Mismatch
Glossary
Introduction to CSM Installation
CSM Overview
Deprecated Features
CAPMC Deprecation Notice
Documentation Conventions
Non-Compute Nodes
Certificate Authority
NCN BIOS
NCN Boot Workflow
NCN Firmware
NCN Images
Kernel Dumps
NCN Kernel
NCN Mounts and Filesystems
NCN Networking
NCN Operating System Releases
NCN Plan of Record
REST API Documentation
Boot Orchestration Service v2
Boot Script Service v1
Cray Advanced Platform Monitoring and Control (CAPMC) v3
Configuration Framework Service v1
Compute Rolling Upgrade Service v1
Firmware Action Service v1
Heartbeat Tracker Service v1
HMS Notification Fanout Daemon v1
Image Management Service v3
NCN Lifecycle Service v1
Power Control Service (PCS) v1
System Configuration Service v1
System Layout Service v2
Hardware State Manager API v2
Cray STS Token Generator v1
TAPMS Tenant Status API v1
User Access Service v1
Update CSM Product Stream
Upgrade CSM
CSM 1.4.2 Patch Installation Instructions
CSM 1.4.3 Patch Installation Instructions
Resource Materials
k8s
Worker-Specific Manual Steps
storage
CEPHADM
Stage 0 - Prerequisites and Preflight Checks
CSM 1.4.1 Patch Installation Instructions
CSM 1.4.4 Patch Installation Instructions
CSM Only Upgrade
Stage 1 - Kubernetes Upgrade
Stage 2 - CSM Service Upgrades
CSM 1.3 to 1.4 Upgrade Process
Upgrade only CSM
Validate CSM Health During a CSM Upgrade
Prepare For Upgrade
scripts
sls
SLS Updates Expert mode
Upgrade SLS Offline from CSM 1.0.x to CSM 1.2
sls updater.py Technical Details
sls utils Library
upgrade
Upgrade Automation
workflows
iuf
operations
Argo Templates
Argo Templates
Cray System Management Documentation
>
Cray System Management (CSM) Administration Guide
> utility storage
utility storage
Topics:
Adding a Ceph Node to the Ceph Cluster
Add Ceph OSDs
Adjust Ceph Pool Quotas
Alternate Storage Pools
Ceph Daemon Memory Profiling
Ceph Deep Scrubs
Ceph Health States
Ceph Orchestrator Usage
Ceph Service Check Script Usage
Ceph Storage Types
ceph-upgrade-tool.py Usage
Cephadm Reference Material
Collect Information about the Ceph Cluster
Dump Ceph Crash Data
Identify Ceph Latency Issues
Manage Ceph Services
Shrink the Ceph Cluster
Restore Nexus Data After Data Corruption
Shrink Ceph OSDs
Troubleshoot Ceph-Mon Processes Stopping and Exceeding Max Restarts
Troubleshoot Ceph MDS Client Connectivity Issues
Troubleshooting Ceph MDS Reporting Slow Requests and Failure on Client
Troubleshoot Ceph New RGW Deployment Failing
Troubleshoot Ceph OSDs Reporting Full
Troubleshoot Ceph Services Not Starting After a Server Crash
Troubleshoot Failure to Get Ceph Health
Troubleshoot Insufficient Standby MDS Daemons Available
Troubleshoot Large Object Map Objects in Ceph Health
Troubleshoot Pods Failing to Restart on Other Worker Nodes
Fixing incorrect number of PG Issues
Troubleshoot if RGW Health Check Fails
Troubleshoot S3FS Mount Issues
Troubleshoot System Clock Skew
Troubleshoot a Down OSD
Troubleshoot an Unresponsive Rados-Gateway (radosgw) S3 Endpoint
Troubleshoot Ceph image with tag’<none>’
Utility Storage