Cray System Management
  • v
  • Cray System Management (CSM) - Release Notes
  • Cray System Management (CSM) Administration Guide
    • Accessing LiveCD USB Device After Reboot
    • Component Names (xnames)
    • Validate CSM Health
    • Configure the Cray Command Line Interface (cray CLI)
    • system management health
      • Access System Management Health Services
      • Configure Prometheus Email Alert Notifications
      • Grafana Dashboards by Component
      • System Management Health
      • System Management Health Checks and Alerts
      • Troubleshoot Prometheus Alerts
    • resiliency
      • NTP Resiliency
      • Recreate StatefulSet Pods on Another Node
      • Resilience of System Management Services
      • Resiliency
      • Resiliency Testing Procedure
      • Restore System Functionality if a Kubernetes Worker Node is Down
    • node management
      • Access and Update Settings for Replacement NCNs
      • Replace a Compute Blade
      • Add TLS Certificates to BMCs
      • Reset Credentials on Redfish Devices
      • Add a Standard Rack Node
      • Swap a Compute Blade with a Different System
      • Add Additional Liquid-Cooled Cabinets to a System
      • TLS Certificates for Redfish BMCs
      • Adding a Liquid-cooled Blade to a System
      • Troubleshoot Interfaces with IP Address Issues
      • Build NCN Images Locally
      • Troubleshoot Issues with Redfish Endpoint Discovery
      • Change Java Security Settings
      • Troubleshoot Loss of Console Connections and Logs on Gigabyte Nodes
      • Change Settings for HMS Collector Polling of Air-Cooled Nodes
      • Update Compute Node Mellanox HSN NIC Firmware
      • Change Settings in the Bond
      • Update the Gigabyte Node BIOS Time
      • Check and Set the metal.no-wipe Setting on NCNs
      • Updating Cabinet Routes on Management NCNs
      • Check the BMC Failover Mode
      • Use the Physical KVM
      • Clear Space in Root File System on Worker Nodes
      • Verify Node Removal
      • Configuration of NCN Bonding
      • View BIOS Logs for Liquid-Cooled Nodes
      • Configure NTP on NCNs
      • Disable Nodes
      • Dump a Non-Compute Node
      • Enable Nodes
      • Enable Passwordless Connections to Liquid Cooled Node BMCs
      • Find Node Type and Manufacturer
      • Launch a Virtual KVM on Gigabyte Nodes
      • Launch a Virtual KVM on Intel Servers
      • Move a Standard Rack Node
      • Move a Standard Rack Node (Same Rack/Same HSN Ports)
      • NCN Drive Identification
      • Node Management
      • Node Management Workflows
      • Reboot NCNs
      • Rebuild NCNs
        • Final Validation Steps
        • Identify Nodes and Update Metadata
        • 6.2. Validate Master Node
        • 6.3. Validate Storage Node
        • 7.1. Validate Worker Node
        • Power Cycle and Rebuild Node
        • Prepare Master Node
        • Prepare Storage Node
        • Prepare Worker Node
        • Adding a Ceph Node to the Ceph Cluster
        • 6. Validate BOOTRAID artifacts
        • Wipe Disks
      • Add Remove Replace NCNs
        • Add NCN Data
        • Alpha Framework to Add, Remove, Replace, or Move NCNs
        • Add Switch Configuration for NCN
        • Allocate NCN IP Addresses
        • Boot NCN
        • Collect NCN MAC Addresses
        • Redeploy Services Impacted by Adding or Permanently Removing Storage Nodes
        • Remove NCN Data
        • Remove Roles
        • Remove Switch Configuration for NCN
        • Update Firmware
        • Validate Health
        • Validate Added NCN
    • conman
      • Access Compute Node Logs
      • Access Console Log Data Via the System Monitoring Framework (SMF)
      • ConMan
      • Disable ConMan After the System Software Installation
      • Establish a Serial Connection to NCNs
      • Log in to a Node Using ConMan
      • Manage Node Consoles
      • Troubleshoot ConMan Asking for Password on SSH Connection
      • Troubleshoot ConMan Blocking Access to a Node BMC
      • Troubleshoot ConMan Failing to Connect to a Console
    • image management
      • Build a New UAN Image Using the Default Recipe
      • Build an Image Using IMS REST Service
      • Convert TGZ Archives to SquashFS Images
      • Create UAN Boot Images
      • Customize an Image Root Using IMS
      • Delete or Recover Deleted IMS Content
      • Image Management
      • Image Management Workflows
      • Import an External Image to IMS
      • Upload and Register an Image Recipe
    • system layout service
      • Add Liquid-Cooled Cabinets to SLS
      • Add UAN CAN IP Addresses to SLS
      • Create a Backup of the SLS Postgres Database
      • Dump SLS Information
      • Load SLS Database with Dump File
      • Restore SLS Postgres Database from Backup
      • Restore SLS Postgres without an Existing Backup
      • System Layout Service (SLS)
      • Update SLS with UAN Aliases
    • hardware state manager
      • Add a Switch to the HSM Database
      • Add an NCN to the HSM Database
      • Component Group Members
      • Component Groups and Partitions
      • Component Memberships
      • Component Partition Members
      • Create a Backup of the HSM Postgres Database
      • HSM Roles and Subroles
      • Hardware Management Services (HMS) Locking API
      • Hardware State Manager (HSM)
      • Hardware State Manager (HSM) State and Flag Fields
      • Lock and Unlock Management Nodes
      • Manage Component Groups
      • Manage Component Partitions
      • Manage HMS Locks
      • Restore Hardware State Manager (HSM) Postgres Database from Backup
      • Restore Hardware State Manager (HSM) Postgres without an Existing Backup
      • Set BMC Management Roles
    • utility storage
      • Adding a Ceph Node to the Ceph Cluster
      • Add Ceph OSDs
      • Adjust Ceph Pool Quotas
      • Ceph Daemon Memory Profiling
      • Ceph Health States
      • Ceph Orchestrator General Usage and Tips
      • Ceph Service Check Script Usage
      • Ceph Storage Types
      • Cephadm Reference Material
      • Collect Information about the Ceph Cluster
      • Dump Ceph Crash Data
      • Identify Ceph Latency Issues
      • Manage Ceph Services
      • Shrink the Ceph Cluster
      • Restore Nexus Data After Data Corruption
      • Troubleshoot Ceph-Mon Processes Stopping and Exceeding Max Restarts
      • Troubleshooting Ceph MDS slow ops
      • Troubleshoot Ceph OSDs Reporting Full
      • Troubleshoot Ceph services not starting after a server crash
      • Troubleshoot Failure to Get Ceph Health
      • Troubleshoot Insufficient Standby MDS Daemons Available
      • Troubleshoot Large Object Map Objects in Ceph Health
      • Troubleshoot Pods Failing to Restart on Other Worker Nodes
      • Troubleshoot if RGW Health Check Fails
      • Troubleshoot System Clock Skew
      • Troubleshoot a Down OSD
      • Troubleshoot an Unresponsive Rados-Gateway (radosgw) S3 Endpoint
      • Utility Storage
    • security and authentication
      • API Authorization
      • Access the Keycloak User Management UI
      • Add LDAP User Federation
      • Audit Logs
      • Authenticate an Account with the Command Line
      • Backup and Restore Vault Clusters
      • Certificate Types
      • Change Air-Cooled Node BMC Credentials
      • Change Credentials on ServerTech PDUs
      • Change Cray EX Liquid-Cooled Cabinet Global Default Password
      • Change NCN Image Root Password and SSH Keys
      • Change NCN Image Root Password and SSH Keys on PIT Node
      • Change Root Passwords for Compute Nodes
      • Change SNMP Credentials on Leaf Switches
      • Change the Keycloak Admin Password
      • Change the LDAP Server IP Address for Existing LDAP Server Content
      • Change the LDAP Server IP Address for New LDAP Server Content
      • Configure Keycloak for LDAP/AD authentication
      • Configure the RSA Plugin in Keycloak
      • Create Internal Groups in the Keycloak Shasta Realm
      • Create Internal User Accounts in the Keycloak Shasta Realm
      • Create a Backup of the Keycloak Postgres Database
      • Create a Service Account in Keycloak
      • Default Keycloak Realms, Accounts, and Clients
      • Delete Internal User Accounts in the Keycloak Shasta Realm
      • Get a Long-Lived Token for a Service Account
      • HashiCorp Vault
      • Keycloak Operations
      • Make HTTPS Requests from Sources Outside the Management Kubernetes Cluster
      • Manage Sealed Secrets
      • Manage System Passwords
      • PKI Certificate Authority (CA)
      • PKI Services
      • Preserve Username Capitalization for Users Exported from Keycloak
      • Provisioning a Liquid-Cooled EX Cabinet CEC with Default Credentials
      • Public Key Infrastructure (PKI)
      • Recovering from Mismatched BMC Credentials
      • Remove Internal Groups from the Keycloak Shasta Realm
      • Remove the Email Mapper from the LDAP User Federation
      • Remove the LDAP User Federation from Keycloak
      • Restrict Network Access to the ncn-images S3 Bucket
      • Re-Sync Keycloak Users to Compute Nodes
      • Retrieve an Authentication Token
      • Retrieve the Client Secret for Service Accounts
      • SSH Keys
      • System Security and Authentication
      • Transport Layer Security (TLS) for Ingress Services
      • Troubleshoot Common Vault Cluster Issues
      • Update Default Air-Cooled BMC and Leaf Switch SNMP Credentials
      • Update Default ServerTech PDU Credentials used by the Redfish Translation Service (RTS)
      • Update NCN Passwords
      • Updating the Liquid-Cooled EX Cabinet CEC with Default Credentials after a CEC Password Change
    • spire
      • Create a Backup of the Spire Postgres Database
      • Restore missing Spire metadata
      • Restore Spire Postgres without an Existing Backup
      • Troubleshoot Spire Failing to Start on NCNs
      • Update Spire Intermediate CA Certificate
    • boot orchestration
      • BOS Workflows
      • Compute Node Boot Issue Symptom Node Console or Logs Indicate that the Server Response has Timed Out
      • Boot Issue Symptom Node HSN Interface Does Not Appear or Show Detected Links Detected
      • Boot Orchestration
      • Boot UANs
      • Check the Progress of BOS Session Operations
      • Clean Up After a BOS/BOA Job is Completed or Cancelled
      • Clean Up Logs After a BOA Kubernetes Job
      • Compute Node Boot Issue Symptom Duplicate Address Warnings and Declined DHCP Offers in Logs
      • Compute Node Boot Issue Symptom Message About Invalid EEPROM Checksum in Node Console or Log
      • Compute Node Boot Issue Symptom Node is Not Able to Download the Required Artifacts
      • Compute Node Boot Sequence
      • Configure the BOS Timeout When Booting Compute Nodes
      • Create a Session Template to Boot Compute Nodes with CPS
      • Edit the iPXE Embedded Boot Script
      • Healthy Compute Node Boot Process
      • Kernel Boot Parameters
      • Limit the Scope of a BOS Session
      • BOS Limitations for Gigabyte BMC Hardware
      • Log File Locations and Ports Used in Compute Node Boot Troubleshooting
      • Manage a BOS Session
      • Manage a Session Template
      • Node Boot Root Cause Analysis
      • Redeploy the iPXE and TFTP Services
      • BOS Session Templates
      • BOS Sessions
      • Stage Changes Without BOS
      • Tools for Resolving Compute Node Boot Issues
      • Troubleshoot Booting Nodes with Hardware Issues
      • Troubleshoot Compute Node Boot Issues Related to Dynamic Host Configuration Protocol (DHCP)
      • Troubleshoot Compute Node Boot Issues Related to Slow Boot Times
      • Troubleshoot Compute Node Boot Issues Related to Trivial File Transfer Protocol (TFTP)
      • Troubleshoot Compute Node Boot Issues Related to Unified Extensible Firmware Interface (UEFI)
      • Troubleshoot Compute Node Boot Issues Related to the Boot Script Service (BSS)
      • Troubleshoot Compute Node Boot Issues Using Kubernetes
      • Troubleshoot UAN Boot Issues
      • Upload Node Boot Information to Boot Script Service (BSS)
      • View the Status of a BOS Session
    • CSM product management
      • Security Hardening
      • Change Passwords and Credentials
      • Configure Keycloak Account
      • Configure Non-Compute Nodes with CFS
      • Perform NCN Personalization
      • Post-Install Customizations
      • Redeploying a Chart
      • Remove Artifacts from Product Installations
      • Validate Signed RPMs
    • UAS user and admin topics
      • Add a Volume to UAS
      • Broker Mode UAI Management
      • Configure End-User UAI Classes for Broker Mode
      • Configure UAIs in UAS
      • Configure a Broker UAI Class
      • Configure a Default UAI Class for Legacy Mode
      • Create UAIs From Specific UAI Images in Legacy Mode
      • Create a UAI
      • Create a UAI Class
      • Create a UAI Resource Specification
      • Create a UAI Using a Direct Administrative Command
      • Create a UAI with Additional Ports
      • Create and Register a Custom UAI Image
      • Create and Use Default UAIs in Legacy Mode
      • Customize End-User UAI Images
      • Customize the Broker UAI Image
      • Delete a UAI
      • Delete a UAI Class
      • Delete a UAI Image Registration
      • Delete a UAI Resource Specification
      • Delete a UAI Using an Administrative Command
      • Delete a Volume Configuration
      • Elements of a UAI
      • End-User UAIs
      • Examine a UAI Using a Direct Administrative Command
      • Legacy Mode User-Driven UAI Management
      • List Available UAI Classes
      • List Available UAI Images in Legacy Mode
      • List Registered UAI Images
      • List UAI Resource Specifications
      • List UAIs
      • List UAS Information
      • List Volumes Registered in UAS
      • List and Delete All UAIs
      • Log in to a Broker UAI
      • Log in to a User's UAI to Troubleshoot Issues
      • Modify a UAI Class
      • Obtain the Configuration of a UAS Volume
      • Register a UAI Image
      • Reset the UAS Configuration to Original Installed Settings
      • Resource Specifications
      • Retrieve Resource Specification Details
      • Retrieve UAI Image Registration Information
      • Select and Configure Host Nodes for UAIs
      • Special Purpose UAIs
      • Start a Broker UAI
      • Troubleshoot Common Mistakes when Creating a Custom End-User UAI Image
      • Troubleshoot Duplicate Mount Paths in a UAI
      • Troubleshoot Missing or Incorrect UAI Images
      • Troubleshoot Stale Brokered UAIs
      • Troubleshoot UAI Authentication Issues
      • Troubleshoot UAI Stuck in "ContainerCreating"
      • Troubleshoot UAIs by Viewing Log Output
      • Troubleshoot UAIs with Administrative Access
      • Troubleshoot UAS Issues
      • Troubleshoot UAS by Viewing Log Output
      • UAI Classes
      • UAI Host Node Selection
      • UAI Host Nodes
      • UAI Images
      • UAI Management
      • UAI Network Attachments
      • UAI macvlans Network Attachments
      • UAS Limitations
      • UAS and UAI Health Checks
      • Update a Resource Specification
      • Update a UAI Image Registration
      • Update a UAS Volume
      • User Access Service (UAS)
      • View a UAI Class
      • Volumes
    • artifact management
      • Artifact Management
      • Generate Temporary S3 Credentials
      • Manage Artifacts with the Cray CLI
      • Use S3 Libraries and Clients
    • firmware
      • FAS Admin Procedures
      • FAS CLI
      • FAS Filters
      • FAS Recipes
      • FAS Use Cases
      • Update Firmware with FAS
      • Updating BMC Firmware and BIOS for ncn-m001
      • Upload BMC Recovery Firmware into TFTP Server
    • power management
      • Cray Advanced Platform Monitoring and Control (CAPMC)
      • Ignore Nodes with CAPMC
      • Liquid Cooled Node Power Management
      • Power Off Compute and IO Cabinets
      • Power Off the External Lustre File System
      • Power On Compute and IO Cabinets
      • Power On and Boot Compute and User Access Nodes
      • Power On and Start the Management Kubernetes Cluster
      • Power On the External Lustre File System
      • Prepare the System for Power Off
      • Recover from a Liquid Cooled Cabinet EPO Event
      • Save Management Network Switch Configuration Settings
      • Set the Turbo Boost Limit
      • Shut Down and Power Off Compute and User Access Nodes
      • Shut Down and Power Off the Management Kubernetes Cluster
      • Standard Rack Node Power Management
      • System Power Off Procedures
      • System Power On Procedures
      • User Access to Compute Node Power Data
      • Power Management
    • kubernetes
      • About Kubernetes Taints and Labels
      • About Postgres
      • About etcd
      • About kubectl
      • Backups for etcd-operator Clusters
      • Kubernetes and Bare Metal EtcD Certificate Renewal
      • Check for and Clear etcd Cluster Alarms
      • Check the Health and Balance of etcd Clusters
      • Clear Space in an etcd Cluster Database
      • Configure kubectl Credentials to Access the Kubernetes APIs
      • containerd
      • Create a Manual Backup of a Healthy etcd Cluster
      • Kubernetes CronJobs
      • Determine if Pods are Hitting Resource Limits
      • Disaster Recovery for Postgres
      • Increase Kafka Pod Resource Limits
      • Increase Pod Resource Limits
      • Kubernetes
      • Kubernetes Networking
      • Kubernetes Storage
      • Pod Resource Limits
      • Rebalance Healthy etcd Clusters
      • Rebuild Unhealthy etcd Clusters
      • Recover from Postgres WAL Event
      • Repopulate Data in etcd Clusters When Rebuilding Them
      • Report the Endpoint Status for etcd Clusters
      • Restore Bare-Metal etcd Clusters from an S3 Snapshot
      • Restore Postgres
      • Restore an etcd Cluster from a Backup
      • Retrieve Cluster Health Information Using Kubernetes
      • TDS Lower CPU Requests
      • Troubleshoot Intermittent HTTP 503 Code Failures
      • Troubleshoot Postgres Database
      • View Postgres Information for System Databases
    • package repository management
      • Manage Repositories with Nexus
      • Nexus Configuration
      • Nexus Deployment
      • Nexus Export and Restore
      • Package Repository Management
      • Package Repository Management with Nexus
      • Repair Yum Repository Metadata
      • Restrict Admin Privileges in Nexus
      • Troubleshoot Nexus
    • system configuration service
      • Configure BMC and Controller Parameters with SCSD
      • Manage Parameters with the scsd Service
      • Set BMC Credentials
      • System Configuration Service
    • network
      • Access to System Management Services
      • Connect to the HPE Cray EX Environment
      • Default IP Address Ranges
      • Network
      • dhcp
        • DHCP
        • Troubleshoot DHCP Issues
      • customer access network
        • CAN with Dual-Spine Configuration
        • Connect to the CAN
        • Customer Access Network
        • Externally Exposed Services
        • Required Labels if CAN is Not Configured
        • Troubleshoot CAN Issues
      • external dns
        • Add NCNs and UANs to External DNS
        • External DNS
        • External DNS Failing to Discover Services Workaround
        • External DNS csi config init Input Values
        • Ingress Routing
        • Troubleshoot DNS Configuration Issues
        • Troubleshoot Connectivity to Services with External IP addresses
        • Update the can-external-dns Value Post-Installation
      • management network
        • Management Network ACL Configuration
        • Management Network Access Port Configurations
        • Management Network CAN Setup
        • Management Network Flow Control Settings
        • Management Network Switch Rename
        • Update Management Network Firmware
      • dns
        • Domain Name Service (DNS)
        • Enable ncsd on UANs
        • Manage the DNS Unbound Resolver
        • Troubleshoot Common DNS Issues
      • metallb bgp
        • Check BGP Status and Reset Sessions
        • MetalLB in BGP-Mode
        • MetalLB in BGP-Mode Configuration
        • Troubleshoot BGP not Accepting Routes from MetalLB
        • Troubleshoot Services without an Allocated IP Address
        • Update BGP Neighbors
    • compute rolling upgrades
      • CRUS Workflow
      • Compute Rolling Upgrades
      • Troubleshoot Nodes Failing to Upgrade in a CRUS Session
      • Troubleshoot a Failed CRUS Session Because of Bad Parameters
      • Troubleshoot a Failed CRUS Session Because of Unmet Conditions
      • Upgrade Compute Nodes with CRUS
    • configuration management
      • Ansible Execution Environments
      • Ansible Inventory
      • Automatic Session Deletion with sessionTTL
      • Backup and Restore VCS Data
      • CFS Flow
      • CFS Global Options
      • CFS Key Management and Permission Denied Errors
      • Change the Ansible Verbosity Logs
      • Configuration Layers
      • Configuration Management
      • Configuration Management of System Components
      • Configuration Management with the CFS Batcher
      • Configuration Sessions
      • Create a CFS Configuration
      • Create a CFS Session with Dynamic Inventory
      • Create an Image Customization CFS Session
      • Create and Populate a VCS Configuration Repository
      • Customize Configuration Values
      • Delete CFS Sessions
      • Enable Ansible Profiling
      • Git Operations
      • Manage Multiple Inventories in a Single Location
      • Set Limits for a Configuration Session
      • Set the ansible.cfg for a Session
      • Specifying Hosts and Groups
      • Target Ansible Tasks for Image Customization
      • Track the Status of a Session
      • Troubleshoot Ansible Play Failures in CFS Sessions
      • Troubleshoot CFS Session Failing to Complete
      • Update a CFS Configuration
      • Update the Privacy Settings for Gitea Configuration Content Repositories
      • Use a Custom ansible.cfg File
      • Use a Specific Inventory in a Configuration Session
      • VCS Branching Strategy
      • Version Control Service (VCS)
      • View Configuration Session Logs
      • Write Ansible Code for CFS
    • hmcollector
      • Adjust HM Collector resource limits and requests
  • CSM Background Information
    • Certificate Authority
    • cloud-init Basecamp Configuration
    • Cray Site Init Files
    • NCN BIOS
    • NCN Boot Workflow
    • NCN Images
    • NCN Mounts and File Systems
    • NCN Networking
    • NCN Operating System Releases
    • NCN Packages
  • CSM Troubleshooting Information
    • Interpreting HMS Health Check Results
    • PXE Booting Runbook
    • capmc
      • Cray Advanced Platform Monitoring and Control (CAPMC) Reinit and Configuration Notice
    • known issues
      • CFS Component With Zero-Length ID
      • Gigabyte BMC Missing Redfish Data
      • Hang Listing BOS Sessions
      • Multiple Console Node Pods on the Same Worker
      • SLS Not Working During Node Rebuild
      • CFS Sessions are Stuck in Pending State
      • SAT/HSM/CAPMC Component Power State Mismatch
      • Console Logs Fill All Available Storage Space
      • Cray CLI 403 Forbidden Errors
      • Air-cooled hardware is not getting properly discovered with Aruba leaf switches.
      • HMS Discovery Job Not Creating RedfishEndpoints In Hardware State Manager
      • Gitea/VCS 401 Errors
      • BOS/BOA Incorrect command is output to rerun a failed operation.
      • Incorrectly Tagged zeromq Image
      • Known Issue initrd.img.xz Not Found
      • kube-multus pod is in ImagePullBackOff
      • Kubernetes Master or Worker node's root filesystem is out of space
      • Orphaned CFS Pods After Booting or Rebooting
      • Common Platform CA Issues
      • Unbound in CrashLoopBackOff After Deployment Restart
      • wait for unbound or cray-dns-unbound-manager hangs
    • kubernetes
      • Kubernetes Log File Locations
      • Kubernetes Troubleshooting Information
      • Troubleshoot Kubernetes Master or Worker node in NotReady state
      • Troubleshoot Liveliness or Readiness Probe Failures
      • Troubleshoot Unresponsive kubectl Commands
  • Glossary
  • Install CSM
    • Set Gigabyte Node BMC to Factory Defaults
    • Hotfix to workaround known mac-learning issue with 8325
    • SHCD HMN Tab/HMN Connections Rules
    • Aruba SNMP Known Issue
    • Switch PXE Boot from Onboard NIC to PCIe
    • Boot LiveCD Virtual ISO
    • Troubleshooting Installation Problems
    • Bootstrap PIT Node from LiveCD Remote ISO
    • Utility Storage Installation Troubleshooting
    • Bootstrap PIT Node from LiveCD USB
    • Validate Management Network Cabling
    • Cable Management Network Servers
    • Wipe NCN Disks for Reinstallation
    • Ceph CSI Troubleshooting
    • Clear Gigabyte CMOS
    • Collect MAC Addresses for NCNs
    • Collecting the BMC MAC Addresses
    • Collecting NCN MAC Addresses
    • Configure Administrative Access
    • Configure Aruba Aggregation Switch
    • Configure Aruba CDU Switch
    • Configure Aruba Leaf Switch
    • Configure Aruba Management Network Base
    • Configure Aruba Spine Switch
    • Configure Dell Aggregation Switch
    • Configure Dell CDU switch
    • Configure Dell Leaf Switch
    • Configure Management Network Switches
    • Configure Mellanox Spine Switch
    • Connect to Switch over USB-Serial Cable
    • Create Application Node Config YAML
    • Create Cabinets YAML
    • Create HMN Connections JSON File
    • Create NCN Metadata CSV
    • Create Switch Metadata CSV
    • Deploy Management Nodes
    • Install CSM Services
    • Prepare Compute Nodes
    • Prepare Configuration Payload
    • Prepare Management Nodes
    • Prepare Site Init
    • PXE Boot Troubleshooting
    • Redeploy PIT Node
    • Reinstall LiveCD
    • Reset root Password on LiveCD
    • Restart Network Services and Interfaces on NCNs
    • Safeguards for CSM
  • Introduction to CSM Installation
    • CAPMC Deprecation Notice many CAPMC v1 features are being partially deprecated
    • CSM Overview
    • Differences from Previous Release
    • Documentation Conventions
    • Scenarios for Shasta v1.5
    • Site Survey Worksheet
  • scripts
    • operations
      • node management
        • Add Remove Replace NCNs
  • Update CSM Product Stream
  • Upgrade CSM
    • Update Management Network From 1.4 To 1.5
    • CSM 1.0.10 Patch Installation Instructions
    • CSM 1.0.11 CVE Patch/Upgrade Procedure
      • Relevant Troubleshooting Links for Upgrade-Related Issues
      • Stage 0 - Prerequisites and Preflight Checks
      • Stage 1 - Ceph Image Upgrade
      • Stage 2 - Kubernetes Node Image Upgrade
      • Stage 3 - CSM Service Upgrades
      • Stage 4 - Rollout DNS Unbound Deployment Restart
      • Stage 5 - Verification
    • CSM 0.9.4 or later to 1.0.1 Upgrade Process
      • Usage
        • k8s
          • Worker-Specific Manual Steps
        • storage
          • CEPHADM
      • Stage 0 - Prerequisites and Preflight Checks
      • Stage 1 - Ceph upgrade from Nautilus (14.2.x) to Octopus (15.2.x)
      • Stage 2 - Ceph image upgrade
      • Stage 3 - Kubernetes Upgrade from 1.18.6 to 1.19.9
      • Stage 4 - CSM Service Upgrades
      • Stage 5 - Workaround for MAC-learning issue with Aruba 8325 switches
    • Prepare For Upgrade
    • lib
      • Pre-Upgrade Scripts
        • NCN Boot Order Hot-fix/Backport
Cray System Management Documentation > Cray System Management (CSM) Administration Guide > node management

node management

Topics:

  1. Access and Update Settings for Replacement NCNs
  2. Add Remove Replace NCNs
  3. Add TLS Certificates to BMCs
  4. Add a Standard Rack Node
  5. Add Additional Liquid-Cooled Cabinets to a System
  6. Adding a Liquid-cooled Blade to a System
  7. Build NCN Images Locally
  8. Change Java Security Settings
  9. Change Settings for HMS Collector Polling of Air-Cooled Nodes
  10. Change Settings in the Bond
  11. Check and Set the metal.no-wipe Setting on NCNs
  12. Check the BMC Failover Mode
  13. Clear Space in Root File System on Worker Nodes
  14. Configuration of NCN Bonding
  15. Configure NTP on NCNs
  16. Disable Nodes
  17. Dump a Non-Compute Node
  18. Enable Nodes
  19. Enable Passwordless Connections to Liquid Cooled Node BMCs
  20. Find Node Type and Manufacturer
  21. Launch a Virtual KVM on Gigabyte Nodes
  22. Launch a Virtual KVM on Intel Servers
  23. Move a Standard Rack Node
  24. Move a Standard Rack Node (Same Rack/Same HSN Ports)
  25. NCN Drive Identification
  26. Node Management
  27. Node Management Workflows
  28. Reboot NCNs
  29. Rebuild NCNs
  30. Rebuild NCNs
  31. Replace a Compute Blade
  32. Reset Credentials on Redfish Devices
  33. Swap a Compute Blade with a Different System
  34. TLS Certificates for Redfish BMCs
  35. Troubleshoot Interfaces with IP Address Issues
  36. Troubleshoot Issues with Redfish Endpoint Discovery
  37. Troubleshoot Loss of Console Connections and Logs on Gigabyte Nodes
  38. Update Compute Node Mellanox HSN NIC Firmware
  39. Update the Gigabyte Node BIOS Time
  40. Updating Cabinet Routes on Management NCNs
  41. Use the Physical KVM
  42. Verify Node Removal
  43. View BIOS Logs for Liquid-Cooled Nodes