Cray System Management
v
1.0
Cray System Management (CSM) - Release Notes
Cray System Management (CSM) Administration Guide
Accessing LiveCD USB Device After Reboot
Component Names (xnames)
Validate CSM Health
Configure the Cray Command Line Interface (cray CLI)
firmware
FAS Admin Procedures
FAS CLI
FAS Filters
FAS Recipes
FAS Use Cases
Update Firmware with FAS
Updating BMC Firmware and BIOS for ncn-m001
Upload BMC Recovery Firmware into TFTP Server
compute rolling upgrades
CRUS Workflow
Compute Rolling Upgrades
Troubleshoot Nodes Failing to Upgrade in a CRUS Session
Troubleshoot a Failed CRUS Session Because of Bad Parameters
Troubleshoot a Failed CRUS Session Because of Unmet Conditions
Upgrade Compute Nodes with CRUS
power management
Cray Advanced Platform Monitoring and Control (CAPMC)
Ignore Nodes with CAPMC
Liquid Cooled Node Power Management
Power Off Compute and IO Cabinets
Power Off the External Lustre File System
Power On Compute and IO Cabinets
Power On and Boot Compute and User Access Nodes
Power On and Start the Management Kubernetes Cluster
Power On the External Lustre File System
Prepare the System for Power Off
Recover from a Liquid Cooled Cabinet EPO Event
Save Management Network Switch Configuration Settings
Set the Turbo Boost Limit
Shut Down and Power Off Compute and User Access Nodes
Shut Down and Power Off the Management Kubernetes Cluster
Standard Rack Node Power Management
System Power Off Procedures
System Power On Procedures
User Access to Compute Node Power Data
Power Management
system layout service
Add Liquid-Cooled Cabinets to SLS
Add UAN CAN IP Addresses to SLS
Create a Backup of the SLS Postgres Database
Dump SLS Information
Load SLS Database with Dump File
Restore SLS Postgres Database from Backup
Restore SLS Postgres without an Existing Backup
System Layout Service (SLS)
Update SLS with UAN Aliases
UAS user and admin topics
Add a Volume to UAS
Broker Mode UAI Management
Configure End-User UAI Classes for Broker Mode
Configure UAIs in UAS
Configure a Broker UAI Class
Configure a Default UAI Class for Legacy Mode
Create UAIs From Specific UAI Images in Legacy Mode
Create a UAI
Create a UAI Class
Create a UAI Resource Specification
Create a UAI Using a Direct Administrative Command
Create a UAI with Additional Ports
Create and Register a Custom UAI Image
Create and Use Default UAIs in Legacy Mode
Customize End-User UAI Images
Customize the Broker UAI Image
Delete a UAI
Delete a UAI Class
Delete a UAI Image Registration
Delete a UAI Resource Specification
Delete a UAI Using an Administrative Command
Delete a Volume Configuration
Elements of a UAI
End-User UAIs
Examine a UAI Using a Direct Administrative Command
Legacy Mode User-Driven UAI Management
List Available UAI Classes
List Available UAI Images in Legacy Mode
List Registered UAI Images
List UAI Resource Specifications
List UAIs
List UAS Information
List Volumes Registered in UAS
List and Delete All UAIs
Log in to a Broker UAI
Log in to a User's UAI to Troubleshoot Issues
Modify a UAI Class
Obtain the Configuration of a UAS Volume
Register a UAI Image
Reset the UAS Configuration to Original Installed Settings
Resource Specifications
Retrieve Resource Specification Details
Retrieve UAI Image Registration Information
Select and Configure Host Nodes for UAIs
Special Purpose UAIs
Start a Broker UAI
Troubleshoot Common Mistakes when Creating a Custom End-User UAI Image
Troubleshoot Duplicate Mount Paths in a UAI
Troubleshoot Missing or Incorrect UAI Images
Troubleshoot Stale Brokered UAIs
Troubleshoot UAI Authentication Issues
Troubleshoot UAI Stuck in "ContainerCreating"
Troubleshoot UAIs by Viewing Log Output
Troubleshoot UAIs with Administrative Access
Troubleshoot UAS Issues
Troubleshoot UAS by Viewing Log Output
UAI Classes
UAI Host Node Selection
UAI Host Nodes
UAI Images
UAI Management
UAI Network Attachments
UAI macvlans Network Attachments
UAS Limitations
UAS and UAI Health Checks
Update a Resource Specification
Update a UAI Image Registration
Update a UAS Volume
User Access Service (UAS)
View a UAI Class
Volumes
CSM product management
Security Hardening
Change Passwords and Credentials
Configure Keycloak Account
Configure Non-Compute Nodes with CFS
Perform NCN Personalization
Post-Install Customizations
Redeploying a Chart
Remove Artifacts from Product Installations
Validate Signed RPMs
hmcollector
Adjust HM Collector resource limits and requests
system configuration service
Configure BMC and Controller Parameters with SCSD
Manage Parameters with the scsd Service
Set BMC Credentials
System Configuration Service
security and authentication
API Authorization
Access the Keycloak User Management UI
Add LDAP User Federation
Audit Logs
Authenticate an Account with the Command Line
Backup and Restore Vault Clusters
Certificate Types
Change Air-Cooled Node BMC Credentials
Change Credentials on ServerTech PDUs
Change Cray EX Liquid-Cooled Cabinet Global Default Password
Change NCN Image Root Password and SSH Keys
Change NCN Image Root Password and SSH Keys on PIT Node
Change Root Passwords for Compute Nodes
Change SNMP Credentials on Leaf Switches
Change the Keycloak Admin Password
Change the LDAP Server IP Address for Existing LDAP Server Content
Change the LDAP Server IP Address for New LDAP Server Content
Configure Keycloak for LDAP/AD authentication
Configure the RSA Plugin in Keycloak
Create Internal Groups in the Keycloak Shasta Realm
Create Internal User Accounts in the Keycloak Shasta Realm
Create a Backup of the Keycloak Postgres Database
Create a Service Account in Keycloak
Default Keycloak Realms, Accounts, and Clients
Delete Internal User Accounts in the Keycloak Shasta Realm
Get a Long-Lived Token for a Service Account
HashiCorp Vault
Keycloak Operations
Make HTTPS Requests from Sources Outside the Management Kubernetes Cluster
Manage Sealed Secrets
Manage System Passwords
PKI Certificate Authority (CA)
PKI Services
Preserve Username Capitalization for Users Exported from Keycloak
Provisioning a Liquid-Cooled EX Cabinet CEC with Default Credentials
Public Key Infrastructure (PKI)
Recovering from Mismatched BMC Credentials
Remove Internal Groups from the Keycloak Shasta Realm
Remove the Email Mapper from the LDAP User Federation
Remove the LDAP User Federation from Keycloak
Restrict Network Access to the ncn-images S3 Bucket
Re-Sync Keycloak Users to Compute Nodes
Retrieve an Authentication Token
Retrieve the Client Secret for Service Accounts
SSH Keys
System Security and Authentication
Transport Layer Security (TLS) for Ingress Services
Troubleshoot Common Vault Cluster Issues
Update Default Air-Cooled BMC and Leaf Switch SNMP Credentials
Update Default ServerTech PDU Credentials used by the Redfish Translation Service (RTS)
Update NCN Passwords
Updating the Liquid-Cooled EX Cabinet CEC with Default Credentials after a CEC Password Change
utility storage
Adding a Ceph Node to the Ceph Cluster
Add Ceph OSDs
Adjust Ceph Pool Quotas
Ceph Daemon Memory Profiling
Ceph Health States
Ceph Orchestrator General Usage and Tips
Ceph Service Check Script Usage
Ceph Storage Types
Cephadm Reference Material
Collect Information about the Ceph Cluster
Dump Ceph Crash Data
Identify Ceph Latency Issues
Manage Ceph Services
Shrink the Ceph Cluster
Restore Nexus Data After Data Corruption
Troubleshoot Ceph-Mon Processes Stopping and Exceeding Max Restarts
Troubleshooting Ceph MDS slow ops
Troubleshoot Ceph OSDs Reporting Full
Troubleshoot Ceph services not starting after a server crash
Troubleshoot Failure to Get Ceph Health
Troubleshoot Insufficient Standby MDS Daemons Available
Troubleshoot Large Object Map Objects in Ceph Health
Troubleshoot Pods Failing to Restart on Other Worker Nodes
Troubleshoot if RGW Health Check Fails
Troubleshoot System Clock Skew
Troubleshoot a Down OSD
Troubleshoot an Unresponsive Rados-Gateway (radosgw) S3 Endpoint
Utility Storage
boot orchestration
BOS Workflows
Compute Node Boot Issue Symptom Node Console or Logs Indicate that the Server Response has Timed Out
Boot Issue Symptom Node HSN Interface Does Not Appear or Show Detected Links Detected
Boot Orchestration
Boot UANs
Check the Progress of BOS Session Operations
Clean Up After a BOS/BOA Job is Completed or Cancelled
Clean Up Logs After a BOA Kubernetes Job
Compute Node Boot Issue Symptom Duplicate Address Warnings and Declined DHCP Offers in Logs
Compute Node Boot Issue Symptom Message About Invalid EEPROM Checksum in Node Console or Log
Compute Node Boot Issue Symptom Node is Not Able to Download the Required Artifacts
Compute Node Boot Sequence
Configure the BOS Timeout When Booting Compute Nodes
Create a Session Template to Boot Compute Nodes with CPS
Edit the iPXE Embedded Boot Script
Healthy Compute Node Boot Process
Kernel Boot Parameters
Limit the Scope of a BOS Session
BOS Limitations for Gigabyte BMC Hardware
Log File Locations and Ports Used in Compute Node Boot Troubleshooting
Manage a BOS Session
Manage a Session Template
Node Boot Root Cause Analysis
Redeploy the iPXE and TFTP Services
BOS Session Templates
BOS Sessions
Stage Changes Without BOS
Tools for Resolving Compute Node Boot Issues
Troubleshoot Booting Nodes with Hardware Issues
Troubleshoot Compute Node Boot Issues Related to Dynamic Host Configuration Protocol (DHCP)
Troubleshoot Compute Node Boot Issues Related to Slow Boot Times
Troubleshoot Compute Node Boot Issues Related to Trivial File Transfer Protocol (TFTP)
Troubleshoot Compute Node Boot Issues Related to Unified Extensible Firmware Interface (UEFI)
Troubleshoot Compute Node Boot Issues Related to the Boot Script Service (BSS)
Troubleshoot Compute Node Boot Issues Using Kubernetes
Troubleshoot UAN Boot Issues
Upload Node Boot Information to Boot Script Service (BSS)
View the Status of a BOS Session
kubernetes
About Kubernetes Taints and Labels
About Postgres
About etcd
About kubectl
Backups for etcd-operator Clusters
Kubernetes and Bare Metal EtcD Certificate Renewal
Check for and Clear etcd Cluster Alarms
Check the Health and Balance of etcd Clusters
Clear Space in an etcd Cluster Database
Configure kubectl Credentials to Access the Kubernetes APIs
containerd
Create a Manual Backup of a Healthy etcd Cluster
Kubernetes CronJobs
Determine if Pods are Hitting Resource Limits
Disaster Recovery for Postgres
Increase Kafka Pod Resource Limits
Increase Pod Resource Limits
Kubernetes
Kubernetes Networking
Kubernetes Storage
Pod Resource Limits
Rebalance Healthy etcd Clusters
Rebuild Unhealthy etcd Clusters
Recover from Postgres WAL Event
Repopulate Data in etcd Clusters When Rebuilding Them
Report the Endpoint Status for etcd Clusters
Restore Bare-Metal etcd Clusters from an S3 Snapshot
Restore Postgres
Restore an etcd Cluster from a Backup
Retrieve Cluster Health Information Using Kubernetes
TDS Lower CPU Requests
Troubleshoot Intermittent HTTP 503 Code Failures
Troubleshoot Postgres Database
View Postgres Information for System Databases
network
Access to System Management Services
Connect to the HPE Cray EX Environment
Default IP Address Ranges
Network
dhcp
DHCP
Troubleshoot DHCP Issues
management network
Management Network ACL Configuration
Management Network Access Port Configurations
Management Network CAN Setup
Management Network Flow Control Settings
Management Network Switch Rename
Update Management Network Firmware
external dns
Add NCNs and UANs to External DNS
External DNS
External DNS Failing to Discover Services Workaround
External DNS csi config init Input Values
Ingress Routing
Troubleshoot DNS Configuration Issues
Troubleshoot Connectivity to Services with External IP addresses
Update the can-external-dns Value Post-Installation
metallb bgp
Check BGP Status and Reset Sessions
MetalLB in BGP-Mode
MetalLB in BGP-Mode Configuration
Troubleshoot BGP not Accepting Routes from MetalLB
Troubleshoot Services without an Allocated IP Address
Update BGP Neighbors
customer access network
CAN with Dual-Spine Configuration
Connect to the CAN
Customer Access Network
Externally Exposed Services
Required Labels if CAN is Not Configured
Troubleshoot CAN Issues
dns
Domain Name Service (DNS)
Enable ncsd on UANs
Manage the DNS Unbound Resolver
Troubleshoot Common DNS Issues
resiliency
NTP Resiliency
Recreate StatefulSet Pods on Another Node
Resilience of System Management Services
Resiliency
Resiliency Testing Procedure
Restore System Functionality if a Kubernetes Worker Node is Down
artifact management
Artifact Management
Generate Temporary S3 Credentials
Manage Artifacts with the Cray CLI
Use S3 Libraries and Clients
node management
Access and Update Settings for Replacement NCNs
Replace a Compute Blade
Add TLS Certificates to BMCs
Reset Credentials on Redfish Devices
Add a Standard Rack Node
Swap a Compute Blade with a Different System
Add Additional Liquid-Cooled Cabinets to a System
TLS Certificates for Redfish BMCs
Adding a Liquid-cooled Blade to a System
Troubleshoot Interfaces with IP Address Issues
Build NCN Images Locally
Troubleshoot Issues with Redfish Endpoint Discovery
Change Java Security Settings
Troubleshoot Loss of Console Connections and Logs on Gigabyte Nodes
Change Settings for HMS Collector Polling of Air-Cooled Nodes
Update Compute Node Mellanox HSN NIC Firmware
Change Settings in the Bond
Update the Gigabyte Node BIOS Time
Check and Set the metal.no-wipe Setting on NCNs
Updating Cabinet Routes on Management NCNs
Check the BMC Failover Mode
Use the Physical KVM
Clear Space in Root File System on Worker Nodes
Verify Node Removal
Configuration of NCN Bonding
View BIOS Logs for Liquid-Cooled Nodes
Configure NTP on NCNs
Disable Nodes
Dump a Non-Compute Node
Enable Nodes
Enable Passwordless Connections to Liquid Cooled Node BMCs
Find Node Type and Manufacturer
Launch a Virtual KVM on Gigabyte Nodes
Launch a Virtual KVM on Intel Servers
Move a Standard Rack Node
Move a Standard Rack Node (Same Rack/Same HSN Ports)
NCN Drive Identification
Node Management
Node Management Workflows
Reboot NCNs
Rebuild NCNs
Final Validation Steps
Identify Nodes and Update Metadata
6.2. Validate Master Node
6.3. Validate Storage Node
7.1. Validate Worker Node
Power Cycle and Rebuild Node
Prepare Master Node
Prepare Storage Node
Prepare Worker Node
Adding a Ceph Node to the Ceph Cluster
6. Validate BOOTRAID artifacts
Wipe Disks
Add Remove Replace NCNs
Add NCN Data
Alpha Framework to Add, Remove, Replace, or Move NCNs
Add Switch Configuration for NCN
Allocate NCN IP Addresses
Boot NCN
Collect NCN MAC Addresses
Redeploy Services Impacted by Adding or Permanently Removing Storage Nodes
Remove NCN Data
Remove Roles
Remove Switch Configuration for NCN
Update Firmware
Validate Health
Validate Added NCN
package repository management
Manage Repositories with Nexus
Nexus Configuration
Nexus Deployment
Nexus Export and Restore
Package Repository Management
Package Repository Management with Nexus
Repair Yum Repository Metadata
Restrict Admin Privileges in Nexus
Troubleshoot Nexus
spire
Create a Backup of the Spire Postgres Database
Restore missing Spire metadata
Restore Spire Postgres without an Existing Backup
Troubleshoot Spire Failing to Start on NCNs
Update Spire Intermediate CA Certificate
system management health
Access System Management Health Services
Configure Prometheus Email Alert Notifications
Grafana Dashboards by Component
System Management Health
System Management Health Checks and Alerts
Troubleshoot Prometheus Alerts
conman
Access Compute Node Logs
Access Console Log Data Via the System Monitoring Framework (SMF)
ConMan
Disable ConMan After the System Software Installation
Establish a Serial Connection to NCNs
Log in to a Node Using ConMan
Manage Node Consoles
Troubleshoot ConMan Asking for Password on SSH Connection
Troubleshoot ConMan Blocking Access to a Node BMC
Troubleshoot ConMan Failing to Connect to a Console
hardware state manager
Add a Switch to the HSM Database
Add an NCN to the HSM Database
Component Group Members
Component Groups and Partitions
Component Memberships
Component Partition Members
Create a Backup of the HSM Postgres Database
HSM Roles and Subroles
Hardware Management Services (HMS) Locking API
Hardware State Manager (HSM)
Hardware State Manager (HSM) State and Flag Fields
Lock and Unlock Management Nodes
Manage Component Groups
Manage Component Partitions
Manage HMS Locks
Restore Hardware State Manager (HSM) Postgres Database from Backup
Restore Hardware State Manager (HSM) Postgres without an Existing Backup
Set BMC Management Roles
image management
Build a New UAN Image Using the Default Recipe
Build an Image Using IMS REST Service
Convert TGZ Archives to SquashFS Images
Create UAN Boot Images
Customize an Image Root Using IMS
Delete or Recover Deleted IMS Content
Image Management
Image Management Workflows
Import an External Image to IMS
Upload and Register an Image Recipe
configuration management
Ansible Execution Environments
Ansible Inventory
Automatic Session Deletion with sessionTTL
Backup and Restore VCS Data
CFS Flow
CFS Global Options
CFS Key Management and Permission Denied Errors
Change the Ansible Verbosity Logs
Configuration Layers
Configuration Management
Configuration Management of System Components
Configuration Management with the CFS Batcher
Configuration Sessions
Create a CFS Configuration
Create a CFS Session with Dynamic Inventory
Create an Image Customization CFS Session
Create and Populate a VCS Configuration Repository
Customize Configuration Values
Delete CFS Sessions
Enable Ansible Profiling
Git Operations
Manage Multiple Inventories in a Single Location
Set Limits for a Configuration Session
Set the ansible.cfg for a Session
Specifying Hosts and Groups
Target Ansible Tasks for Image Customization
Track the Status of a Session
Troubleshoot Ansible Play Failures in CFS Sessions
Troubleshoot CFS Session Failing to Complete
Update a CFS Configuration
Update the Privacy Settings for Gitea Configuration Content Repositories
Use a Custom ansible.cfg File
Use a Specific Inventory in a Configuration Session
VCS Branching Strategy
Version Control Service (VCS)
View Configuration Session Logs
Write Ansible Code for CFS
CSM Background Information
Certificate Authority
cloud-init Basecamp Configuration
Cray Site Init Files
NCN BIOS
NCN Boot Workflow
NCN Images
NCN Mounts and File Systems
NCN Networking
NCN Operating System Releases
NCN Packages
CSM Troubleshooting Information
Interpreting HMS Health Check Results
PXE Booting Runbook
capmc
Cray Advanced Platform Monitoring and Control (CAPMC) Reinit and Configuration Notice
known issues
CFS Component With Zero-Length ID
Gigabyte BMC Missing Redfish Data
Hang Listing BOS Sessions
Multiple Console Node Pods on the Same Worker
SLS Not Working During Node Rebuild
CFS Sessions are Stuck in Pending State
SAT/HSM/CAPMC Component Power State Mismatch
Console Logs Fill All Available Storage Space
Cray CLI 403 Forbidden Errors
Air-cooled hardware is not getting properly discovered with Aruba leaf switches.
HMS Discovery Job Not Creating RedfishEndpoints In Hardware State Manager
Gitea/VCS 401 Errors
BOS/BOA Incorrect command is output to rerun a failed operation.
Incorrectly Tagged zeromq Image
Known Issue initrd.img.xz Not Found
kube-multus pod is in ImagePullBackOff
Kubernetes Master or Worker node's root filesystem is out of space
Orphaned CFS Pods After Booting or Rebooting
Common Platform CA Issues
Unbound in CrashLoopBackOff After Deployment Restart
wait for unbound or cray-dns-unbound-manager hangs
kubernetes
Kubernetes Log File Locations
Kubernetes Troubleshooting Information
Troubleshoot Kubernetes Master or Worker node in NotReady state
Troubleshoot Liveliness or Readiness Probe Failures
Troubleshoot Unresponsive kubectl Commands
Glossary
Install CSM
Set Gigabyte Node BMC to Factory Defaults
Hotfix to workaround known mac-learning issue with 8325
SHCD HMN Tab/HMN Connections Rules
Aruba SNMP Known Issue
Switch PXE Boot from Onboard NIC to PCIe
Boot LiveCD Virtual ISO
Troubleshooting Installation Problems
Bootstrap PIT Node from LiveCD Remote ISO
Utility Storage Installation Troubleshooting
Bootstrap PIT Node from LiveCD USB
Validate Management Network Cabling
Cable Management Network Servers
Wipe NCN Disks for Reinstallation
Ceph CSI Troubleshooting
Clear Gigabyte CMOS
Collect MAC Addresses for NCNs
Collecting the BMC MAC Addresses
Collecting NCN MAC Addresses
Configure Administrative Access
Configure Aruba Aggregation Switch
Configure Aruba CDU Switch
Configure Aruba Leaf Switch
Configure Aruba Management Network Base
Configure Aruba Spine Switch
Configure Dell Aggregation Switch
Configure Dell CDU switch
Configure Dell Leaf Switch
Configure Management Network Switches
Configure Mellanox Spine Switch
Connect to Switch over USB-Serial Cable
Create Application Node Config YAML
Create Cabinets YAML
Create HMN Connections JSON File
Create NCN Metadata CSV
Create Switch Metadata CSV
Deploy Management Nodes
Install CSM Services
Prepare Compute Nodes
Prepare Configuration Payload
Prepare Management Nodes
Prepare site-init
PXE Boot Troubleshooting
Redeploy PIT Node
Reinstall LiveCD
Reset root Password on LiveCD
Restart Network Services and Interfaces on NCNs
Safeguards for CSM
Introduction to CSM Installation
CAPMC Deprecation Notice many CAPMC v1 features are being partially deprecated
CSM Overview
Differences from Previous Release
Documentation Conventions
Scenarios for Shasta v1.5
Site Survey Worksheet
scripts
operations
node management
Add Remove Replace NCNs
Update CSM Product Stream
Upgrade CSM
Update Management Network From 1.4 To 1.5
CSM 1.0.10 Patch Installation Instructions
CSM 1.0.11 CVE Patch/Upgrade Procedure
Relevant Troubleshooting Links for Upgrade-Related Issues
Stage 0 - Prerequisites and Preflight Checks
Stage 1 - Ceph Image Upgrade
Stage 2 - Kubernetes Node Image Upgrade
Stage 3 - CSM Service Upgrades
Stage 4 - Rollout DNS Unbound Deployment Restart
Stage 5 - Verification
CSM 0.9.4 or later to 1.0.1 Upgrade Process
Usage
k8s
Worker-Specific Manual Steps
storage
CEPHADM
Stage 0 - Prerequisites and Preflight Checks
Stage 1 - Ceph upgrade from Nautilus (14.2.x) to Octopus (15.2.x)
Stage 2 - Ceph image upgrade
Stage 3 - Kubernetes Upgrade from 1.18.6 to 1.19.9
Stage 4 - CSM Service Upgrades
Stage 5 - Workaround for MAC-learning issue with Aruba 8325 switches
Prepare For Upgrade
lib
Pre-Upgrade Scripts
NCN Boot Order Hot-fix/Backport
Cray System Management Documentation
>
Upgrade CSM
>
lib
> Pre-Upgrade Scripts
Pre-Upgrade Scripts
These scripts
must
be invoked before an upgrade.