Cray System Management
  • v
  • Cray System Management (CSM) - Release Notes
  • Cray System Management (CSM) Administration Guide
    • Create a Backup of HMS Items for reinstall
    • Component Names (xnames)
    • Restore HSM
    • Validate CSM Health
    • Configure the Cray Command Line Interface (cray CLI)
    • User Access Service (UAS)
      • Add a Volume to UAS
      • Broker Mode UAI Management
      • Choosing UAI Resource Settings
      • Common UAI Configuration
      • Configure End-User UAI Classes for Broker Mode
      • Configure UAIs in UAS
      • Configure a Broker UAI Class
      • Configure a Default UAI Class for Legacy Mode
      • Create UAIs From Specific UAI Images in Legacy Mode
      • Create a UAI
      • Create a UAI Class
      • Create a UAI Resource Specification
      • Create a UAI with Additional Ports
      • Create and Use Default UAIs in Legacy Mode
      • Customize End-User UAI Images
      • Customize the Broker UAI Image
      • Delete a UAI
      • Delete a UAI Class
      • Delete a UAI Image Registration
      • Delete a UAI Resource Specification
      • Delete a Volume Configuration
      • Elements of a UAI
      • End-User UAIs
      • Examine a UAI Using a Direct Administrative Command
      • Legacy Mode User-Driven UAI Management
      • List Available UAI Classes
      • List Available UAI Images in Legacy Mode
      • List Registered UAI Images
      • List UAI Resource Specifications
      • List UAIs
      • List UAS Version Information
      • List Volumes Registered in UAS
      • Log in to a Broker UAI
      • This Page Has Moved
      • Modify a UAI Class
      • Obtain the Configuration of a UAS Volume
      • Register a UAI Image
      • Clear UAS Configuration
      • Resource Specifications
      • Retrieve Resource Specification Details
      • Retrieve UAI Image Registration Information
      • Setting UAI Timeouts
      • Broker UAI Resiliency and Load Balancing
      • Special Purpose UAIs
      • Start a Broker UAI
      • Troubleshoot Broker UAI SSSD Cannot Use /etc/sssd/sssd.conf
      • Troubleshoot Common Mistakes when Creating a Custom End-User UAI Image
      • Troubleshoot Duplicate Mount Paths in a UAI
      • Troubleshoot Missing or Incorrect UAI Images
      • Troubleshoot Stale Brokered UAIs
      • Troubleshoot UAS / CLI Authentication Issues
      • Troubleshoot UAI Stuck in ContainerCreating
      • Troubleshoot UAIs by Viewing Log Output
      • Troubleshoot UAIs with Administrative Access
      • Troubleshoot UAS Issues
      • Troubleshoot UAS by Viewing Log Output
      • UAI Classes
      • UAI Host Node Selection
      • UAI Host Nodes
      • UAI Image Customization
      • UAI Images
      • UAI Management
      • UAI Network Attachment Customization
      • UAI macvlans Network Attachments
      • UAS Limitations
      • UAS and UAI Legacy Mode Health Checks
      • Update a Resource Specification
      • Update a UAI Image Registration
      • Update a UAS Volume
      • View a UAI Class
      • Volumes
    • bare metal
      • Bare-Metal Steps
      • Fresh Install Setting NodeBMC and RouterBMC Redfish Credentials
    • resiliency
      • Recreate StatefulSet Pods on Another Node
      • Resilience of System Management Services
      • Resiliency
      • Resiliency Testing Procedure
      • Restore System Functionality if a Kubernetes Worker Node is Down
    • sat
      • System Admin Toolkit (SAT) in CSM
    • system management health
      • Access System Management Health Services
      • Configure Prometheus Alerta Alert Notifications
      • Configure Prometheus Email Alert Notifications
      • Retrieve SMART data from ClusterStor E1000 nodes via Redfish Exporter
      • Grafana Dashboards by Component
      • Grafterm
      • prometheus-kafka-adapter errors during installation
      • Remove Kiali
      • System Management Health
      • System Management Health Checks and Alerts
      • Troubleshoot Grafana Dashboard
      • Troubleshoot Prometheus Alerts
      • UAN NODE Exporter
    • iuf
      • Install and Upgrade Framework
      • examples
        • iuf abort Examples
        • iuf activity Examples
        • iuf list-activities Examples
        • iuf list-stages Examples
        • iuf restart Examples
        • iuf resume Examples
        • iuf run Examples
      • stages
        • deliver-product
        • deploy-product
        • managed-nodes-rollout
        • management-nodes-rollout
        • post-install-check
        • post-install-service-check
        • pre-install-check
        • prepare-images
        • process-media
        • update-cfs-config
        • update-vcs-config
      • workflows
        • Backup
        • Configuration
        • Configuration of the Slingshot Fabric Manager
        • Deploy product
        • Image preparation
        • Install or upgrade additional products with IUF
        • Managed rollout
        • Management rollout
        • Prepare for the install or upgrade
        • Product delivery
        • Upgrade CSM and additional products with IUF
        • Validate deployment
    • System Recovery
      • PBS Service Recovery
      • Slurm Service Recovery
      • Beta Procedures for System Recovery
    • image management
      • Build a New UAN Image Using the Default Recipe
      • Build an Image Using IMS REST Service
      • Configure IMS to Use DKMS
      • Configure IMS to Validate RPMs
      • Convert TGZ Archives to SquashFS Images
      • Create UAN Boot Images
      • Customize an Image Root Using IMS
      • Delete or Recover Deleted IMS Content
      • Exporting and Importing IMS Data
      • Image Management
      • Image Management Workflows
      • Import an External Image to IMS
      • Import an NCN Image to IMS
      • Upload and Register an Image Recipe
    • node management
      • Access and Update Settings for Replacement NCNs
      • Removing a Liquid-cooled blade from a System
      • Removing a Liquid-cooled blade from a System Using SAT
      • Removing a Standard rack node from a System
      • Replace a Compute Blade
      • Replace a Compute Blade Using SAT
      • Replace a Standard rack node from a System
      • Repurpose a Compute Node as a UAN
      • Add TLS Certificates to BMCs
      • Reset Credentials on Redfish Devices
      • Add a Standard Rack Node
      • S3FS Usage and Guidelines for Shasta
      • Add Additional Air-Cooled Cabinets to a System
      • Set Gigabyte Node BMC to Factory Defaults
      • Add Additional Liquid-Cooled Cabinets to a System
      • Swap a Compute Blade with a Different System
      • Adding a Liquid-cooled Blade to a System
      • Swap a Compute Blade with a Different System Using SAT
      • Adding a Liquid-cooled blade to a System Using SAT
      • Switch PXE Boot from Onboard NIC to PCIe
      • Build NCN Images Locally
      • TLS Certificates for Redfish BMCs
      • Change Java Security Settings
      • Troubleshoot Interfaces with IP Address Issues
      • Change Settings for HMS Collector Polling of Air-Cooled Nodes
      • Troubleshoot Issues with Redfish Endpoint Discovery
      • Check and Set the metal.no-wipe Setting on NCNs
      • Troubleshoot Loss of Console Connections and Logs on Gigabyte Nodes
      • Check the BMC Failover Mode
      • Update Compute Node Mellanox HSN NIC Firmware
      • Clear Space in Root File System on Worker Nodes
      • Update the Gigabyte Node BIOS Time
      • Configuration of NCN Bonding
      • Update the HPE Node BIOS Time
      • Configure NTP on NCNs
      • Updating Cabinet Routes on Management NCNs
      • Customize PCIe Hardware
      • Use the Physical KVM
      • Customize PCIe Hardware
      • Verify Node Removal
      • Defragment NID Numbering
      • View BIOS Logs for Liquid-Cooled Nodes
      • Disable Nodes
      • Manual Wipe Procedures
      • Dump a Non-Compute Node
      • Clear Gigabyte CMOS
      • Enable Nodes
      • Enable Passwordless Connections to Liquid Cooled Node BMCs
      • Enable IPMI access on HPE iLO BMCs
      • Find Node Type and Manufacturer
      • Launch a Virtual KVM on Gigabyte Servers
      • Launch a Virtual KVM on Intel Servers
      • Move a Standard Rack Node
      • Move a Standard Rack Node (Same Rack/Same HSN Ports)
      • Move a liquid-cooled blade within a System
      • NCN Drive Identification
      • NCN Network Troubleshooting
      • Node Management
      • Node Management Workflows
      • Reboot NCNs
      • Rebuild NCNs
        • Final Validation Steps
        • Identify Nodes and Update Metadata
        • Post Rebuild Storage Node Validation
        • Power Cycle and Rebuild Nodes
        • Prepare Storage Nodes
        • Re-Add a Storage Node to Ceph
        • Rebuild NCNs
        • Validate Boot Loader
      • Add Remove Replace NCNs
        • Add NCN Data
        • Alpha Framework to Add, Remove, Replace, or Move NCNs
        • Add Switch Configuration for NCN
        • Allocate NCN IP Addresses
        • Boot NCN
        • Collect NCN MAC Addresses
        • Redeploy Services Impacted by Adding or Permanently Removing Storage Nodes
        • Remove NCN Data
        • Remove NCN from Role
        • Remove Switch Configuration for NCN
        • Update Firmware
        • Update NCN BIOS TPM State
        • Validate Health
        • Validate Added NCN
    • system layout service
      • Add Liquid-Cooled Cabinets to SLS
      • Add UAN CAN IP Addresses to SLS
      • Add an alias to a service
      • Create a Backup of the SLS Postgres Database
      • Dump SLS Information
      • Load SLS Database with Dump File
      • Restore SLS Postgres Database from Backup
      • Restore SLS Postgres without an Existing Backup
      • System Layout Service (SLS)
      • Update SLS with UAN Aliases
    • conman
      • Access Compute Node Logs
      • Access Console Log Data Via the System Monitoring Framework (SMF)
      • Complete Reset of the Console Services
      • ConMan
      • Configure Log Rotation
      • Console Services Troubleshooting Guide
      • Disable ConMan After the System Software Installation
      • Establish a Serial Connection to NCNs
      • Log in to a Node Using ConMan
      • Manage Node Consoles
      • Troubleshoot ConMan Asking for Password on SSH Connection
      • Troubleshoot ConMan Blocking Access to a Node BMC
      • Troubleshoot ConMan Failing to Connect to a Console
      • Troubleshoot Console Node Pod Stuck in Terminating State
    • utility storage
      • Adding a Ceph Node to the Ceph Cluster
      • Add Ceph OSDs
      • Adjust Ceph Pool Quotas
      • Alternate Storage Pools
      • Ceph Daemon Memory Profiling
      • Ceph Deep Scrubs
      • Ceph Health States
      • Ceph Orchestrator Usage
      • Ceph Service Check Script Usage
      • Ceph Storage Types
      • ceph-upgrade-tool.py Usage
      • Cephadm Reference Material
      • Collect Information about the Ceph Cluster
      • Dump Ceph Crash Data
      • Identify Ceph Latency Issues
      • Manage Ceph Services
      • Shrink the Ceph Cluster
      • Restore Nexus Data After Data Corruption
      • Shrink Ceph OSDs
      • Troubleshoot Ceph-Mon Processes Stopping and Exceeding Max Restarts
      • Troubleshoot Ceph MDS Client Connectivity Issues
      • Troubleshooting Ceph MDS Reporting Slow Requests and Failure on Client
      • Troubleshoot Ceph New RGW Deployment Failing
      • Troubleshoot Ceph OSDs Reporting Full
      • Troubleshoot Ceph Services Not Starting After a Server Crash
      • Troubleshoot Failure to Get Ceph Health
      • Troubleshoot Insufficient Standby MDS Daemons Available
      • Troubleshoot Large Object Map Objects in Ceph Health
      • Troubleshoot Pods Failing to Restart on Other Worker Nodes
      • Fixing incorrect number of PG Issues
      • Troubleshoot if RGW Health Check Fails
      • Troubleshoot S3FS Mount Issues
      • Troubleshoot System Clock Skew
      • Troubleshoot a Down OSD
      • Troubleshoot an Unresponsive Rados-Gateway (radosgw) S3 Endpoint
      • Troubleshoot Ceph image with tag'<none>'
      • Utility Storage
    • hardware state manager
      • Add a Switch to the HSM Database
      • Add an NCN to the HSM Database
      • Component Group Members
      • Component Groups and Partitions
      • Component Memberships
      • Component Partition Members
      • Create a Backup of the HSM Postgres Database
      • Backup/Restore HSM User Data (Locks, Groups, and Partitions)
      • HSM Roles and Subroles
      • Hardware Management Services (HMS) Locking API
      • Hardware State Manager (HSM)
      • Hardware State Manager (HSM) State and Flag Fields
      • Lock and Unlock Management Nodes
      • Manage Component Groups
      • Manage Component Partitions
      • Manage HMS Locks
      • Restore Hardware State Manager (HSM) Postgres Database from Backup
      • Restore Hardware State Manager (HSM) Postgres without an Existing Backup
      • Set BMC Management Roles
    • argo
      • Using Argo Workflows
      • Using the Argo UI
    • security and authentication
      • API Authorization
      • Access the Keycloak User Management UI
      • Add LDAP User Federation
      • Add Root Service Account for Gigabyte Controllers
      • Audit Logs
      • Authenticate an Account with the Command Line
      • Backup and Restore Vault Clusters
      • Certificate Types
      • Change Air-Cooled Node BMC Credentials Using SAT
      • Change Credentials on ServerTech PDUs
      • Change Cray EX Liquid-Cooled Cabinet Global Default Password
      • Change the Keycloak Token Lifetime
      • Set NCN Image Root Password, SSH Keys, and Timezone
      • Set NCN Image Root Password, SSH Keys, and Timezone on PIT Node
      • Change Root Passwords for Compute Nodes
      • Change the Keycloak Admin Password
      • Change the LDAP Server IP Address for Existing LDAP Server Content
      • Change the LDAP Server IP Address for New LDAP Server Content
      • Configure Keycloak for LDAP/AD authentication
      • Configure root user on HPE iLO BMCs
      • Configure the RSA Plugin in Keycloak
      • Create Internal Groups in the Keycloak Shasta Realm
      • Create Internal User Accounts in the Keycloak Shasta Realm
      • Create a Backup of the Keycloak Postgres Database
      • Create a Service Account in Keycloak
      • Default Keycloak Realms, Accounts, and Clients
      • Delete Internal User Accounts in the Keycloak Shasta Realm
      • Get a Long-Lived Token for a Service Account
      • HashiCorp Vault
      • Keycloak Operations
      • Keycloak Service Recovery
      • Keycloak User Localization
      • Keycloak User Management with kcadm.sh
      • Make HTTPS Requests from Sources Outside the Management Kubernetes Cluster
      • Manage Sealed Secrets
      • Manage System Passwords
      • PKI Certificate Authority (CA)
      • PKI Services
      • Preserve Username Capitalization for Users Exported from Keycloak
      • Provisioning a Liquid-Cooled EX Cabinet CEC with Default Credentials
      • Public Key Infrastructure (PKI)
      • Recovering from Mismatched BMC Credentials
      • Remove Internal Groups from the Keycloak Shasta Realm
      • Remove the Email Mapper from the LDAP User Federation
      • Remove the LDAP User Federation from Keycloak
      • Re-Sync Keycloak Users to Compute Nodes
      • Retrieve an Authentication Token
      • Retrieve the Client Secret for Service Accounts
      • Update NCN User SSH Keys
      • System Security and Authentication
      • Transport Layer Security (TLS) for Ingress Services
      • Troubleshoot Common Vault Cluster Issues
      • Troubleshoot Kyverno configuration manually
      • Update Default Air-Cooled BMC and Leaf-BMC Switch SNMP Credentials
      • Update Default ServerTech PDU Credentials used by the Redfish Translation Service (RTS)
      • Set NCN User Passwords
      • Updating the Liquid-Cooled EX Cabinet CEC with Default Credentials after a CEC Password Change
      • Vault Service Recovery
    • spire
      • Create a Backup of the Spire Postgres Database
      • Restore missing Spire metadata
      • Restore Spire Postgres without an Existing Backup
      • Spire Service Recovery
      • Troubleshoot Spire Failing to Start on NCNs
      • Update Spire Intermediate CA Certificate
      • Xname Validation
    • boot orchestration
      • Boot Orchestration
      • BOS Services
      • BOS Workflows
      • Compute Node Boot Issue Symptom Node Console or Logs Indicate that the Server Response has Timed Out
      • Boot Issue Symptom Node HSN Interface Does Not Appear or Show Detected Links Detected
      • Boot Orchestration
      • Boot UANs
      • BOS Commands Cheat Sheet
      • Check the Progress of BOS Session Operations
      • Clean Up After a BOS/BOA Job is Completed or Cancelled
      • Clean Up Logs After a BOA Kubernetes Job
      • Component Status
      • BOS Components
      • Compute Node Boot Issue Symptom Duplicate Address Warnings and Declined DHCP Offers in Logs
      • Compute Node Boot Issue Symptom Message About Invalid EEPROM Checksum in Node Console or Log
      • Compute Node Boot Issue Symptom Node is Not Able to Download the Required Artifacts
      • Compute Node Boot Sequence
      • Configure the BOS Timeout When Booting Compute Nodes
      • Create a Session Template to Boot Compute Nodes with CPS
      • Customize iPXE Binary Names
      • Determine Which BOS Session Booted a Node
      • Edit the iPXE Embedded Boot Script
      • Exporting and Importing BOS Data
      • Exporting and Importing BSS Data
      • Healthy Compute Node Boot Process
      • Kernel Boot Parameters
      • Limit the Scope of a BOS Session
      • BOS Limitations for Gigabyte BMC Hardware
      • Log File Locations and Ports Used in Compute Node Boot Troubleshooting
      • Manage a BOS Session
      • Manage a Session Template
      • Node Boot Root Cause Analysis
      • BOS Options
      • Redeploy the iPXE and TFTP Services
      • Rolling Upgrades using BOS
      • BOS Session Templates
      • BOS Sessions
      • Staging Changes with BOS
      • Tools for Resolving Compute Node Boot Issues
      • Troubleshoot Booting Nodes with Hardware Issues
      • Troubleshoot Compute Node Boot Issues Related to Dynamic Host Configuration Protocol (DHCP)
      • Troubleshoot Compute Node Boot Issues Related to Slow Boot Times
      • Troubleshoot Compute Node Boot Issues Related to Trivial File Transfer Protocol (TFTP)
      • Troubleshoot Compute Node Boot Issues Related to Unified Extensible Firmware Interface (UEFI)
      • Troubleshoot Compute Node Boot Issues Related to the Boot Script Service (BSS)
      • Troubleshoot Compute Node Boot Issues Using Kubernetes
      • Troubleshoot UAN Boot Issues
      • Upload Node Boot Information to Boot Script Service (BSS)
      • View the Status of a BOS Session
    • CSM product management
      • Change Passwords and Credentials
      • Configure CSM packages with CFS
      • Configure Keycloak Account
      • Configure the root password and SSH keys in Vault
      • Post-Install Customizations
      • Redeploying a Chart
      • Remove Artifacts from Product Installations
      • Set up passwordless SSH
      • Validate Signed RPMs
    • multi-tenancy
      • Cray HNC Manager
      • Creating a Tenant
      • Modifying a Tenant
      • Multi-Tenancy Support
      • Removing a Tenant
      • Slurm Operator
      • TAPMS (Tenant and Partition Management System) Overview
      • Tenant Administrator Configuration
    • firmware
      • FASUpdate Script
      • FAS Admin Procedures
      • FAS CLI
      • FAS Filters
      • Backup and Restoring FAS Images
      • FAS Recipes
      • Update iLO 5 firmware above v2.78
      • FAS Recipes and Procedures
      • Firmware Upgrade using SPP on HPE ProLiant Servers
      • Update Firmware with FAS
      • Updating BMC Firmware and BIOS for ncn-m001
      • Updating BMC Firmware and BIOS for NCNs without FAS
      • Upload BMC Recovery Firmware into TFTP Server
    • hpe pdu
      • HPE PDU Admin Procedures
    • observability
      • Install and Upgrade Observability Framework
    • power management
      • Cray Advanced Platform Monitoring and Control (CAPMC)
      • Ignore Nodes with CAPMC
      • Liquid-cooled Node Power Management
      • Power Off Compute Cabinets
      • Power Off Management Cabinets
      • Power Off Storage Cabinets
      • Power Off the External Lustre File System
      • Power On Compute Cabinets
      • Power On and Boot Compute and User Access Nodes
      • Power On and Start the Management Kubernetes Cluster
      • Power On the External Lustre File System
      • Prepare the System for Power Off
      • Recover from a Liquid Cooled Cabinet EPO Event
      • Save Management Network Switch Configuration Settings
      • Set the Turbo Boost Limit
      • Shut Down and Power Off Compute and User Access Nodes
      • Shut Down and Power Off the Management Kubernetes Cluster
      • Standard Rack Node Power Management
      • System Power Off Procedures
      • System Power On Procedures
      • User Access to Compute Node Power Data
      • Power Management
      • Power Control Service
        • Node Card Power Management
        • Power Control Service (PCS)
        • Power Off Compute Cabinets
        • Power On Compute Cabinets
        • Recover from a Liquid Cooled Cabinet EPO Event
    • artifact management
      • Artifact Management
      • Generate Temporary S3 Credentials
      • Manage Artifacts with the Cray CLI
      • Use S3 Libraries and Clients
    • kubernetes
      • About Kubernetes Taints and Labels
      • Kubernetes Encryption
      • About Postgres
      • About etcd
      • About kubectl
      • Backups for etcd-operator Clusters
      • Kubernetes and Bare Metal EtcD Certificate Renewal
      • Check for and Clear etcd Cluster Alarms
      • Check the Health and Balance of etcd Clusters
      • Clear Space in an etcd Cluster Database
      • Configure kubectl Credentials to Access the Kubernetes APIs
      • containerd
      • Create a Manual Backup of Bare-Metal etcd Cluster
      • Create a Manual Backup of a Healthy etcd Cluster
      • Determine if Pods are Hitting Resource Limits
      • Disaster Recovery for Postgres
      • Fix Failed to start etcd on Master NCN
      • Increase Kafka Pod Resource Limits
      • Increase Pod Resource Limits
      • Kubernetes
      • Kubernetes Networking
      • Kubernetes Storage
      • Kyverno policy management
      • Pod Resource Limits
      • Rebalance Healthy etcd Clusters
      • Rebuild Unhealthy etcd Clusters
      • Recover from Postgres WAL Event
      • Repopulate Data in etcd Clusters When Rebuilding Them
      • Report the Endpoint Status for etcd Clusters
      • Restore Bare-Metal etcd Clusters from an S3 Snapshot
      • Restore Postgres
      • Restore an etcd Cluster from a Backup
      • Retrieve Cluster Health Information Using Kubernetes
      • TDS Lower CPU Requests
      • Troubleshoot Intermittent HTTP 503 Code Failures
      • Troubleshoot Postgres Database
      • View Postgres Information for System Databases
    • package repository management
      • Manage Repositories with Nexus
      • Nexus Configuration
      • Nexus Deployment
      • Nexus Export and Restore
      • Nexus Service Recovery
      • Nexus Space Cleanup
      • Package Repository Management
      • Package Repository Management with Nexus
      • Repair Blobstore
      • Repair Yum Repository Metadata
      • Restrict Admin Privileges in Nexus
      • Troubleshoot Nexus
    • system configuration service
      • Configure BMC and Controller Parameters with SCSD
      • Manage Parameters with the scsd Service
      • Set BMC Credentials Using SAT
      • System Configuration Service
    • network
      • Management Network User Guide
        • Management Network Upgrade CSM 1.2 to 1.3
        • Fresh Install
        • Load Saved Switch Configuration
        • Generate Switch Configurations
        • Manual Switch Configuration
        • Added Hardware
        • Apply Custom Switch Configurations for CSM 1.0
        • Apply Custom Switch Configuration CSM 1.2
        • CSM Automatic Network Utility
          • CANU Installation
          • Troubleshoot CANU Validation Errors
          • Use CANU to Verify, Generate, or Compare Switch Configurations
          • Generate Switch Configs Including Custom Configurations
          • Initializing CANU
          • Introduction to CANU
          • Quick start guide to CANU
          • Uninstall CANU
          • Update CANU From CSM Release Tarball
          • Use CANU to Generate Full Network Configuration
        • Apply Switch Configurations
        • Dell Installation and Configuration Guide
          • Configure Access Control Links (ACLs)
          • Configure Address Resolution Protocol (ARP)
          • Back Up a Switch Configuration
          • Configure Domain Name System (DNS) Client
          • Configure Domain Name
          • Configure Hostnames
          • Configure Internet Group Multicast Protocol (IGMP)
          • Configure Link Aggregation Group (LAG)
          • Link layer discovery protocol (LLDP)
          • Configure Locator LED
          • Configure Loopback Interface
          • Configure Management Interface
          • Configure Multiple Spanning Tree Protocol (MSTP)
          • Network Time Protocol (NTP) Client
          • Configure Physical Interfaces
          • Configure QoS
          • Configure Remote Logging
          • Reset Dell Switch Configuration
          • Configure SNMPv2c community
          • Dell SNMPv3 Users
          • Configure Secure Shell (SSH)
          • Configure System Images
          • Perform an Upgrade on Dell Switches
          • Configure Virtual Local Access Networks (VLANs)
          • Configure VLAN Interface
          • VLAN Trunking 802.1Q
        • Upgrade CANU
        • Collect Data
        • Configuration Management
        • Configuring SNMP in CSM
        • Mellanox Installation and Configuration Guide
          • Access control lists (ACLs)
          • Address resolution protocol (ARP)
          • Backing up switch configuration
          • BGP basics
          • Cable diagnostics
          • Check BGP and MetalLB
          • Check current DHCP leases
          • Check DHCP lease is getting allocated
          • Check HSM
          • Check KEA DHCP logs
          • Computes/UANs/Application Nodes
          • Large Number of DHCP Declines During a Node Boot
          • Domain name system (DNS) client
          • Domain name
          • You are getting an IP address, but not the correct one. Duplicate IP address check
          • Exec banners
          • Hostname
          • IGMP
          • Ip filter
          • Key features used in the management network configuration
          • Link aggregation group (LAG)
          • Large
          • Link layer discovery protocol (LLDP)
          • Loopback interface
          • Management interface
          • Example of how to configure Scenario A or B
          • Management network functions in detail
          • Medium
          • Multi-chassis interface
          • MLAG (Multi-Chassis LAG)
          • MLAG
          • Multiple spanning tree protocol (MSTP)
          • Native VLAN
          • TCPDUMP
          • NCNs on Install
          • Network types – Naming and segment Function
          • Network traffic pattern inside of the system
          • Network Time Protocol (NTP) Client
          • Open shortest path first (OSPF) v2
          • Physical interfaces
          • PIM-SM bootstrap router (BSR) and rendezvous-point (RP)
          • Rebooting NCN and PXE fails
          • Remote logging
          • How to connect management network to your campus network
          • Routed interfaces
          • Scenario A network connection via management network
          • Scenario B network connection via high speed network
          • Small
          • SNMPv2c community
          • Mellanox SNMPv3 users
          • Spine-leaf architecture
          • Spine-leaf architecture
          • Why are spine-leaf architectures becoming more popular?
          • Secure shell (SSH)
          • Mac address Table
          • Static routing
          • Confirm the status of the cray-dhcp-kea pods/services
          • System images
          • Test TFTP traffic (Aruba Only)
          • Typical configuration of MLAG link connecting to NCN
          • Typical configuration of MLAG between switches
          • Performing Upgrade On Mellanox Switches
          • Verify the switches are forwarding DHCP traffic
          • Verify BGP
          • Verify the DHCP traffic on the workers
          • Verify route to TFTP
          • Very Large (Exascale)
          • Virtual local access networks (VLANs)
          • VLAN interface
          • VLAN trunking 802.1Q
          • Web user interface (WebUI)
        • Aruba Installation and Configuration Guide
          • 802.1X
          • Access Control Lists (ACLs)
          • Address Resolution Protocol (ARP)
          • Backup a Switch Configuration
          • Border Gateway Protocol (BGP) Basics
          • Bluetooth Capabilities
          • Cable Diagnostics
          • Check BGP and MetalLB
          • Check Current DHCP Leases
          • Check DHCP Lease is Getting Allocated
          • Check HSM
          • Check KEA DHCP Logs
          • Classifier Policies
          • Verify Computes/UANs/Application Nodes
          • Large Number of DHCP Declines During a Node Boot
          • Configure Domain Name Service (DNS) Clients
          • Configure Domain Names
          • Check for Duplicate IP Addresses
          • Configure Exec Banners
          • Configure Hostnames
          • Configure Internet Group Multicast Protocol (IGMP)
          • Initial Prioritization
          • Introduction
          • Key Features Used in the Management Network Configuration
          • Link Aggregation Group (LAG)
          • Link Layer Discovery Protocol (LLDP)
          • Locator LED
          • Loopback Interface
          • MAC Authentication
          • Management Interface
          • Example of How to Configure Scenario A or B
          • System Management Network Functions
          • VSX ISL HA
          • VSX MCLAG Link HA
          • VSX Member Power Failure
          • VSX Split
          • Multi-Chassis Link Aggregation Group (MCLAG)
          • Message-Of-The-Day (MOTD)
          • Multicast Source Discovery Protocol (MSDP)
          • Multiple Spanning Tree Protocol (MSTP)
          • Native VLAN
          • NCN tcpdump
          • NCNs on Install
          • Network Types – Naming and Segment Function
          • Network Topologies
          • Network Traffic Pattern
          • Notices
          • Network Time Protocol (NTP) Client
          • Open Shortest Path First (OSPF) v2
          • Physical Interfaces
          • PIM-SM Bootstrap Router (BSR) and Rendezvous Point (RP)
          • Port Mirroring
          • Port Security
          • Queuing and Scheduling
          • RADIUS
          • Rebooting NCNs and PXE Fails
          • Redundant Power Supplies
          • Remote Logging
          • Connect the Management Network to a Campus Network
          • Routed interfaces
          • Scenario A Network Connection via Management Network
          • Scenario B Network Connection via High-Speed Network
          • Simple Network Management Protocol (SNMP) Agent
          • SNMPv2c Community
          • SNMP traps
          • Aruba SNMPv3 Users
          • Spine-Leaf Architecture
          • Spine-leaf Architecture
          • Secure Shell (SSH)
          • Static Routing
          • Confirm the Status of the cray-dhcp-kea Pods
          • TACACS
          • Test TFTP Traffic (Aruba Only)
          • Typical Configuration of VSX
          • Typical Edge Port Configuration
          • Typical Configuration of MCLAG Link
          • Unidirectional Link Detection (UDLD)
          • Perform a VSX Upgrade on Aruba Switches
          • Verify the Switches are Forwarding DHCP Traffic
          • Verify BGP
          • Verify the DHCP Traffic on the Worker Nodes
          • Verify Route to TFTP
          • Virtual Local Access Networks (VLANs)
          • VLAN Interface
          • VLAN Trunking 802.1Q
          • Virtual Switching Framework (VSF) - 6300 Only
          • Virtual Switching Extension (VSX)
          • What is VSX?
          • Switch Replacement in the VSX Cluster
          • VSX Sync
          • Web User Interface (WebUI)
          • Erase All zeroize
        • Edge switch cabling guide
        • External User Guides
        • Network Tests
        • Reinstall
        • Replace Switch
        • Save a Configuration
        • Prometheus SNMP Exporter
        • Upgrade Switches From 1.2 to 1.3 Preconfig
        • Validate Cabling
        • Validate the SHCD
        • Validate Switch Configurations
        • Wipe Management Switch Configuration
        • Aruba splitting of QSFP+ and QSFP28 ports
        • Backup a Custom Configuration
        • BICAN Support Matrix - Shasta Customer Access Networks
        • BICAN switch configuration
        • Bifurcating the CAN - Feature Details
        • BICAN Summary
        • Bonded UAN Configuration
        • Cable Management Network Servers
        • firmware
          • Update Management Network Firmware
        • hardware
          • EX2500 Installation and Cabling
      • Access to System Management Services
      • Connect to Switch over USB-Serial Cable
      • Connect to the HPE Cray EX Environment
      • Create a CSM Configuration Upgrade Plan
      • Default IP Address Ranges
      • Gateway Testing
      • Network
      • dhcp
        • DHCP
        • Troubleshoot DHCP Issues
      • external dns
        • External DNS
        • External DNS Failing to Discover Services Workaround
        • External DNS CSI Input Values
        • Ingress Routing
        • Troubleshoot DNS Configuration Issues
        • Troubleshoot Connectivity to Services with External IP addresses
        • Update the cmn-external-dns value post-installation
      • customer accessible networks
        • Connect to the CMN and CAN
        • Customer Access Networks
          • network
            • Enabling Customer High Speed Network Routing
            • Management Network Upgrade CSM 1.2 to 1.3
          • scripts
            • sls
              • sls utils Library
        • Customer Accessible Networks
        • CAN/CMN with Dual-Spine Configuration
        • Externally Exposed Services
        • Troubleshoot CMN issues
        • BI-CAN Aruba/Arista Configuration
        • MetalLB Peering with Arista Edge Router
      • dns
        • Domain Name Service (DNS) Overview
        • Enable ncsd on UANs
        • Manage the DNS Unbound Resolver
        • PowerDNS Configuration
        • PowerDNS Migration Guide
        • Troubleshoot Common DNS Issues
        • Troubleshoot PowerDNS
      • metallb bgp
        • Check BGP Status and Reset Sessions
        • MetalLB Configuration
        • MetalLB in BGP-Mode
        • Troubleshoot BGP not Accepting Routes from MetalLB
        • Troubleshoot Services without an Allocated IP Address
    • compute rolling upgrades
      • CRUS Workflow
      • Compute Rolling Upgrades
      • Troubleshoot Nodes Failing to Upgrade in a CRUS Session
      • Troubleshoot a Failed CRUS Session Because of Bad Parameters
      • Troubleshoot a Failed CRUS Session Because of Unmet Conditions
      • Upgrade Compute Nodes with CRUS
    • configuration management
      • Accessing sat bootprep Files
      • Ansible Execution Environments
      • Ansible Inventory
      • Ansible Log Collection
      • Automatic Session Deletion with sessionTTL
      • Backup and Restore VCS Data
      • CFS Flow
      • CFS Global Options
      • CFS Key Management and Permission Denied Errors
      • Change the Ansible Verbosity Logs
      • Configuration Layers
      • Configuration Management
      • Configuration Management of System Components
      • Configuration Management with the CFS Batcher
      • Configuration Sessions
      • Create a CFS Configuration
      • Create a CFS Session with Dynamic Inventory
      • Create an Image Customization CFS Session
      • Create and Populate a VCS Configuration Repository
      • Customize Configuration Values
      • Delete CFS Sessions
      • Enable Ansible Profiling
      • Exporting and Importing CFS Data
      • Git Operations
      • Manage Multiple Inventories in a Single Location
      • Management Node Image Customization
      • Management Node Personalization
      • Set Limits for a Configuration Session
      • Set the ansible.cfg for a Session
      • Specifying Hosts and Groups
      • Target Ansible Tasks for Image Customization
      • Track the Status of a Session
      • Troubleshoot Ansible Play Failures in CFS Sessions
      • Troubleshoot CFS Session Failing to Complete
      • Troubleshoot CFS Sessions Failing to Start
      • Update a CFS Configuration
      • Update the Privacy Settings for Gitea Configuration Content Repositories
      • Use a Custom ansible.cfg File
      • Use a Specific Inventory in a Configuration Session
      • VCS Administrative User
      • VCS Branching Strategy
      • Version Control Service (VCS)
      • View Configuration Session Logs
      • Write Ansible Code for CFS
    • hmcollector
      • Adjust HM Collector Ingress Replicas and Resource Limits
  • Cray System Management Install
    • SHCD HMN Tab/HMN Connections Rules
    • Ceph CSI Troubleshooting
    • CSM Installation
    • Collect MAC Addresses for NCNs
    • Troubleshooting Installation Problems
    • CSM Services Install Fails Because of Missing Secret
    • Collecting the BMC MAC Addresses
    • PXE Boot Troubleshooting
    • Deploy Final NCN
    • Collecting NCN MAC Addresses
    • Troubleshooting Unused Drives on Storage Nodes
    • Deploy Management Nodes
    • Install CSM with Common Pre-installer (Tech Preview)
      • Boot Pre-Install Live ISO and Generate Seed Files
      • Configuration of Leaf Switch 001
      • Configuration of Spine Switch 01
      • Configuration of Spine Switch 02
      • Pre-Installation
    • Utility Storage Installation Troubleshooting
    • Pre-Installation
    • Upgrade Ceph and enable Smartmon metrics on storage NCNs
    • Install CSM Services
    • Prepare Compute Nodes
    • Configure Administrative Access
    • Prepare site init
    • Configure Management Network
    • Re-Installation
    • Create Application Node Config YAML
    • Create Cabinets YAML
    • Create HMN Connections JSON File
    • Create NCN Metadata CSV
    • Create Switch Metadata CSV
    • livecd
      • Accessing LiveCD USB Device After Reboot
      • Boot LiveCD RemoteISO
      • Boot LiveCD USB
      • Reinstall LiveCD
      • Reset root Password on a LiveCD USB
  • CSM Troubleshooting Information
    • Manual SSH Key Setting Process
    • Troubleshoot the CMS Barebones Image Boot Test
    • DHCP Troubleshooting
    • DNS Troubleshooting
    • Running HMS CT Tests Manually
    • PXE Booting Runbook
    • Interpreting HMS Health Check Results
    • known issues
      • CFS Component With Zero-Length ID
      • CRUS Subcommands Missing From Cray CLI
      • Gigabyte BMC Missing Redfish Data
      • Hang Listing BOS V1 Sessions
      • Nexus Fails Authentication with Keycloak Users
      • SLS Not Working During Node Rebuild
      • VCS Password With Illegal Characters
      • Known Issue admin-client-auth Not Found
      • Antero node NID allocation
      • Known Issue Ceph OSD latency
      • Check for duplicate and DNS entries for NCN and UANs test failure
      • SAT/HSM/CAPMC/PCS Component Power State Mismatch
      • Cray CLI 403 Forbidden Errors
      • HMS Discovery Job Not Creating RedfishEndpoints In Hardware State Manager
      • Helm Chart Deploy Timeouts
      • HPE iLO dropping event subscriptions and not properly transitioning power state in CSM software
      • Known Issue IMS image creation failure
      • Known issues with NCN health checks
      • IUF Error exec /usr/local/bin/argoexec argument list too long
      • IUF fails with Not a directory /etc/cray/upgrade/csm/media/...
      • Known issue kubectl logs -f returns no space left on device
      • Mellanox lacp-individual Limitations
      • Known issues with NCN resource checks
      • Transaction Size Limitation for PCS and CAPMC
      • Product Catalog Upgrade Error
      • QLogic driver crash
      • Software Management Services health checks
      • Spire database connection pool configuration in an air-gapped environment
      • Spire Database Cluster DNS Lookup Failure
      • SSL Certificate Validation Issues
      • Known Issue Velero Version Mismatch
    • kubernetes
      • Kubernetes kube-apiserver Failing
      • Kubernetes Log File Locations
      • Kubernetes Troubleshooting Information
      • Troubleshoot Kubernetes Master or Worker node in NotReady state
      • Troubleshoot Kubernetes Pods Not Starting
      • Troubleshoot Liveliness or Readiness Probe Failures
      • Troubleshoot Unresponsive kubectl Commands
  • Glossary
  • Introduction to CSM Installation
    • CSM Overview
    • Deprecated Features
      • CAPMC Deprecation Notice
    • Documentation Conventions
  • Non-Compute Nodes
    • Certificate Authority
    • NCN BIOS
    • NCN Boot Workflow
    • NCN Firmware
    • NCN Images
    • Kernel Dumps
    • NCN Kernel
    • NCN Mounts and Filesystems
    • NCN Networking
    • NCN Operating System Releases
    • NCN Plan of Record
  • REST API Documentation
    • Boot Orchestration Service v2
    • Boot Script Service v1
    • Cray Advanced Platform Monitoring and Control (CAPMC) v3
    • Configuration Framework Service v1
    • Compute Rolling Upgrade Service v1
    • Firmware Action Service v1
    • Heartbeat Tracker Service v1
    • HMS Notification Fanout Daemon v1
    • Image Management Service v3
    • NCN Lifecycle Service v1
    • Power Control Service (PCS) v1
    • System Configuration Service v1
    • System Layout Service v2
    • Hardware State Manager API v2
    • Cray STS Token Generator v1
    • TAPMS Tenant Status API v1
    • User Access Service v1
  • Update CSM Product Stream
  • Upgrade CSM
    • Prepare for Upgrade to Next CSM Major Version
    • Resource Materials
      • k8s
        • Worker-Specific Manual Steps
      • storage
        • CEPHADM
    • CSM 1.4.1 Patch Installation Instructions
    • CSM 1.4.2 Patch Installation Instructions
    • CSM 1.4.3 Patch Installation Instructions
    • CSM 1.4.4 Patch Installation Instructions
      • CSM Only Upgrade
    • Stage 0 - Prerequisites and Preflight Checks
    • Stage 1 - Kubernetes Upgrade
    • Stage 2 - CSM Service Upgrades
    • CSM 1.3 to 1.4 Upgrade Process
    • Upgrade only CSM
    • Validate CSM Health During a CSM Upgrade
    • scripts
      • sls
        • SLS Updates Expert mode
        • Upgrade SLS Offline from CSM 1.0.x to CSM 1.2
        • sls updater.py Technical Details
        • sls utils Library
      • upgrade
        • Upgrade Automation
  • workflows
    • iuf
      • operations
        • Argo Templates
        • Argo Templates
Cray System Management Documentation > Cray System Management (CSM) Administration Guide > network > external dns

external dns

Topics:

  1. External DNS
  2. External DNS Failing to Discover Services Workaround
  3. External DNS CSI Input Values
  4. Ingress Routing
  5. Troubleshoot DNS Configuration Issues
  6. Troubleshoot Connectivity to Services with External IP addresses
  7. Update the cmn-external-dns value post-installation