Anytime after the installation of the CSM services, the health of the management nodes and all CSM services can be validated.
The following are examples of when to run health checks:
install.sh
completesThe areas should be tested in the order they are listed on this page. Errors in an earlier check may cause errors in later checks because of dependencies.
ncnHealthChecks
ncnPostgresHealthChecks
barebones
image
Scripts do not verify results. Script output includes analysis needed to determine pass/fail for each check. All health checks are expected to pass.
Health check scripts can be run:
install.sh
has been run (not before)install.sh
has been runAvailable platform health checks:
ncnHealthChecks
ncnPostgresHealthChecks
ncnHealthChecks
NCN health check scripts can be found and run on any worker or master node (not on the PIT node), from any directory.
ncn-mw# /opt/cray/platform-utils/ncnHealthChecks.sh
The ncnHealthChecks
script reports the following health information:
metal.no-wipe
statusExecute the ncnHealthChecks
script and analyze the output of each individual check.
ncnHealthChecks
When the PIT node is booted the NCN node metal.no-wipe
status is not available and is correctly reported as ‘unavailable’. Once ncn-m001
has been booted,
the NCN metal.no-wipe
status is expected to be reported as metal.no-wipe=1
.
Only when ncn-m001
has been booted, if the output of the ncnHealthChecks.sh
script shows that there are nodes that do not have the metal.no-wipe=1
status, then do the following:
ncn-mw# csi handoff bss-update-param --set metal.no-wipe=1 --limit <SERVER_XNAME>
If ncn-s001
is down when running the ncnHealthChecks
script, status from the ceph -s
command will be unavailable. In this case, the ceph -s
command can
be executed on any available master or storage node to determine the status of the Ceph cluster.
If the output of pod statuses indicates that there are pods in the Evicted
state, it may be due to the /root
file system being filled up on the Kubernetes
node in question. Kubernetes will begin evicting pods once the root file system space is at 85% until it is back under 80%. This may commonly happen on ncn-m001
as it is a
location that install and documentation files may be downloaded to. It may be necessary to clean-up space in the /root
directory if this is the cause of pod evictions. The
following commands can be used to determine if analysis of files under /root
is needed to free-up space.
ncn-mw# df -h /root
Filesystem Size Used Avail Use% Mounted on
LiveOS_rootfs 280G 245G 35G 88% /
ncn-mw# du -h -s /root/
225G /root/
ncn-mw# du -ah -B 1024M /root | sort -n -r | head -n 10
The cray-crus-
pod is expected to be in the Init
state until Slurm and munge
are installed. In particular, this will be the case if executing this as part of the validation after completing the Install CSM Services.
If in doubt, validate the CRUS service using the CMS Validation Tool. If the CRUS check passes using that tool, do not worry about the cray-crus-
pod state.
The hmn-discovery
and cray-dns-unbound-manager
cronjob pods may be in various transitional states such as Pending
, Init
, PodInitializing
,
NotReady
, or Terminating
. This is expected because these pods are periodically started and often can be caught in intermediate states.
ncnPostgresHealthChecks
Postgres health check scripts can be found and run on any worker or master node (not on the PIT node), from any directory.
The ncnPostgresHealthChecks
script reports the following Postgres health information:
postgresql
resourcepostgres
pod statusExecute ncnPostgresHealthChecks
script and analyze the output of each individual check.
ncn# /opt/cray/platform-utils/ncnPostgresHealthChecks.sh
Check the STATUS
of the postgresql
resources which are managed by the operator:
NAMESPACE NAME TEAM VERSION PODS VOLUME CPU-REQUEST MEMORY-REQUEST AGE STATUS
services cray-sls-postgres cray-sls 11 3 1Gi 12d Running
If any postgresql
resources remains in a STATUS
other than Running
(such as SyncFailed
), refer to Troubleshoot Postgres Database.
For a particular Postgres cluster, the expected output is similar to the following:
--- patronictl, version 1.6.5, list for services leader pod cray-sls-postgres-0 ---
+ Cluster: cray-sls-postgres (6938772644984361037) ---+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+---------------------+------------+--------+---------+----+-----------+
| cray-sls-postgres-0 | 10.47.0.35 | Leader | running | 1 | |
| cray-sls-postgres-1 | 10.36.0.33 | | running | 1 | 0 |
| cray-sls-postgres-2 | 10.44.0.42 | | running | 1 | 0 |
+---------------------+------------+--------+---------+----+-----------+
The points below will cover the data in the table above for Member
, Role
, State
, and Lag in MB
columns.
For each Postgres cluster:
Verify that there are three cluster members (with the exception of sma-postgres-cluster
, where there should be only two cluster members).
If the number of cluster members is not correct, refer to Troubleshoot Postgres Database.
Verify that there is one cluster member with the Leader
Role
.
If there is no Leader
, refer to Troubleshoot Postgres Database.
Verify that the State
of each cluster member is running
.
If any cluster members are found not to be in running
state (such as start failed
), refer to
Troubleshoot Postgres Database.
Verify there is no large or growing lag.
If any cluster members are found to have lag or lag is unknown
, refer to Troubleshoot Postgres Database.
If all the above four checks indicate that Postgres clusters are healthy, then the log output for the postgres
pods can be ignored. If possible health issues exist,
then re-check the health by re-running the ncnPostgresHealthChecks
script after waiting for 15 minutes. If health issues persist, then review the log output and consult
Troubleshoot Postgres Database. During NCN reboots, temporary errors related to re-election are common but should resolve
upon the re-check.
Check that all Kubernetes Postgres pods have a STATUS
of Running
.
ncn# kubectl get pods -A -o wide -l application=spilo
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
services cray-sls-postgres-0 3/3 Running 3 6d 10.38.0.102 ncn-w002 <none> <none>
services cray-sls-postgres-1 3/3 Running 3 5d20h 10.42.0.89 ncn-w001 <none> <none>
services cray-sls-postgres-2 3/3 Running 0 5d20h 10.36.0.31 ncn-w003 <none> <none>
If any Postgres pods have a STATUS
other then Running
, gather more information from the pod and refer to Troubleshoot Postgres Database.
ncn# kubectl describe pod <pod name> -n <pod namespace>
ncn# kubectl logs <pod name> -n <pod namespace> -c <pod container name>
Verify that Border Gateway Protocol (BGP) peering sessions are established for each worker node on the system.
Check the Border Gateway Protocol (BGP) status on the Aruba or Mellanox switches.
Verify that all sessions are in an Established
state. If the state of any
session in the table is Idle
, reset the BGP sessions.
On an NCN, determine the IP addresses of switches:
ncn# kubectl get cm config -n metallb-system -o yaml | head -12
Expected output looks similar to the following:
apiVersion: v1
data:
config: |
peers:
- peer-address: 10.252.0.2
peer-asn: 65533
my-asn: 65533
- peer-address: 10.252.0.3
peer-asn: 65533
my-asn: 65533
address-pools:
- name: customer-access
Using the first peer-address
(10.252.0.2
here), log in using ssh
as the administrator to the switch and note in the returned output if Mellanox or Aruba is indicated.
ncn-m001# ssh admin@10.252.0.2
Mellanox Onyx Switch Management
or Mellanox Switch
may be displayed after logging in to the switch with ssh
. In this case, proceed to the Mellanox steps.Please register your products now at: https://asp.arubanetworks.com
may be displayed after logging in to the switch with ssh
. In this case, proceed to the Aruba steps.Enable:
sw-spine-001# enable
Verify BGP is enabled:
sw-spine-001# show protocols | include bgp
Expected output looks similar to the following:
bgp: enabled
Check peering status:
sw-spine-001# show ip bgp summary
Expected output looks similar to the following:
VRF name : default
BGP router identifier : 10.252.0.2
local AS number : 65533
BGP table version : 3
Main routing table version: 3
IPV4 Prefixes : 59
IPV6 Prefixes : 0
L2VPN EVPN Prefixes : 0
------------------------------------------------------------------------------------------------------------------
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
------------------------------------------------------------------------------------------------------------------
10.252.1.10 4 65533 2945 3365 3 0 0 1:00:21:33 ESTABLISHED/20
10.252.1.11 4 65533 2942 3356 3 0 0 1:00:20:49 ESTABLISHED/19
10.252.1.12 4 65533 2945 3363 3 0 0 1:00:21:33 ESTABLISHED/20
If one or more BGP session is reported in an Idle
state, reset BGP to re-establish the sessions:
sw-spine-001# clear ip bgp all
Established
. Wait a minute or so, and then verify that all sessions now are all reported as Established
. If some
sessions remain in an Idle
state, re-run the clear ip bgp all
command and check again.Idle
, see Check BGP Status and Reset Sessions.Repeat the above Mellanox procedure using the second peer-address
(10.252.0.3
here).
On an Aruba switch, the prompt may include sw-spine
or sw-agg
.
Check BGP peering status.
sw-agg01# show bgp ipv4 unicast summary
Expected output looks similar to the following:
VRF : default
BGP Summary
-----------
Local AS : 65533 BGP Router Identifier : 10.252.0.4
Peers : 7 Log Neighbor Changes : No
Cfg. Hold Time : 180 Cfg. Keep Alive : 60
Confederation Id : 0
Neighbor Remote-AS MsgRcvd MsgSent Up/Down Time State AdminStatus
10.252.0.5 65533 19579 19588 20h:40m:30s Established Up
10.252.1.7 65533 34137 39074 20h:41m:53s Established Up
10.252.1.8 65533 34134 39036 20h:36m:44s Established Up
10.252.1.9 65533 34104 39072 00m:01w:04d Established Up
10.252.1.10 65533 34105 39029 00m:01w:04d Established Up
10.252.1.11 65533 34099 39042 00m:01w:04d Established Up
10.252.1.12 65533 34101 39012 00m:01w:04d Established Up
If one or more BGP session is reported in a Idle
state, reset BGP to re-establish the sessions:
sw-agg01# clear bgp *
Established
. Wait a minute or so, and then
verify that all sessions now are reported as Established
. If some sessions remain in an Idle
state,
re-run the clear bgp *
command and check again.Idle
, see Check BGP Status and Reset SessionsRepeat the above Aruba procedure using the second peer-address
(10.252.0.5
in this example).
Verify that KEA has active DHCP leases. After an fresh install of CSM, it is important to verify that KEA is currently handing out DHCP leases on the system. The following commands can be run on any of the master or worker nodes.
Get an API token:
ncn# export TOKEN=$(curl -s -S -d grant_type=client_credentials \
-d client_id=admin-client \
-d client_secret=`kubectl get secrets admin-client-auth \
-o jsonpath='{.data.client-secret}' | base64 -d` \
https://api-gw-service-nmn.local/keycloak/realms/shasta/protocol/openid-connect/token | jq -r '.access_token')
Retrieve all the leases currently in KEA:
ncn# curl -H "Authorization: Bearer ${TOKEN}" -X POST -H "Content-Type: application/json" \
-d '{ "command": "lease4-get-all", "service": [ "dhcp4" ] }' https://api-gw-service-nmn.local/apis/dhcp-kea | jq
If there is a non-zero amount of DHCP leases for air-cooled hardware returned, then that is a good indication that KEA is working.
If unbound
is configured to resolve outside hostnames, then the following check should be performed. If this has not been done, then this check may be skipped.
Run the following on one of the master or worker nodes (not the PIT node):
ncn# nslookup cray.com ; echo "Exit code is $?"
Expected output looks similar to the following:
Server: 10.92.100.225
Address: 10.92.100.225#53
Non-authoritative answer:
Name: cray.com
Address: 52.36.131.229
Exit code is 0
Verify that the command has exit code zero, reports no errors, and resolves the address.
Execute the following command on all Kubernetes NCNs (all worker nodes and master nodes), excluding the PIT node:
ncn# goss -g /opt/cray/tests/install/ncn/tests/goss-spire-agent-service-running.yaml validate
Known failures and how to recover:
K8S Test: Verify spire-agent
is enabled and running
The spire-agent
service may fail to start on Kubernetes NCNs (all worker and master nodes). In this case, it may log errors
(using journalctl
) similar to join token does not exist or has already been used
, or the last log entries may contain multiple
instances of systemd[1]: spire-agent.service: Start request repeated too quickly.
. Deleting the request-ncn-join-token
daemonset
pod
running on the node may clear the issue. Even though the spire-agent
systemctl
service on the Kubernetes node should eventually
restart cleanly, the user may have to log in to the impacted nodes and restart the service. The following recovery procedure can
be run from any Kubernetes node in the cluster.
Set NODE
to the NCN which is experiencing the issue. In this example, ncn-w002
.
This command will not work on the PIT node.
ncn# export NODE=ncn-w002
Define the following function
ncn# function renewncnjoin() { for pod in $(kubectl get pods -n spire |grep request-ncn-join-token | awk '{print $1}'); do
if kubectl describe -n spire pods $pod | grep -q "Node:.*$1"; then echo "Restarting $pod running on $1"; kubectl delete -n spire pod "$pod"; fi
done }
Run the function as follows:
ncn# renewncnjoin $NODE
The spire-agent
service may also fail if an NCN was powered off for too long and its tokens expired. If this happens, delete /root/spire/agent_svid.der
,
/root/spire/bundle.der
, and /root/spire/data/svid.key
off the NCN before deleting the request-ncn-join-token
daemonset
pod.
Execute the following commands on ncn-m002
:
ncn-m002# goss -g /opt/cray/tests/install/ncn/tests/goss-k8s-vault-cluster-health.yaml validate
Check the output to verify no failures are reported:
Count: 2, Failed: 0, Skipped: 0
There are multiple Goss test suites available that cover a variety of sub-systems.
Run the NCN health checks against the three different types of nodes with the following commands:
IMPORTANT: These tests should only be run while booted into the PIT node. Do not run these as part of upgrade testing. This includes the Kubernetes check in the next block.
IMPORTANT: It is possible that the first pass of running these tests may fail due to cloud-init
not being completed on the storage nodes. In this case please wait 5 minutes and re-run the tests.
pit# /opt/cray/tests/install/ncn/automated/ncn-healthcheck-master
pit# /opt/cray/tests/install/ncn/automated/ncn-healthcheck-worker
pit# /opt/cray/tests/install/ncn/automated/ncn-healthcheck-storage
And the Kubernetes test suite via:
pit# /opt/cray/tests/install/ncn/automated/ncn-kubernetes-checks
Kubernetes Query BSS Cloud-init for ca-certs
TrustedCerts
operator has updated BSS with CA certificates.Kubernetes Velero No Failed Backups
Because of a known issue with Velero, a backup may be attempted immediately upon the deployment of a backup schedule (for example, Vault). It may be necessary to delete backups from a Kubernetes node to clear this situation. For example:
Find the failed backup.
ncn/pit# kubectl get backups -A -o json | jq -e '.items[] | select(.status.phase == "PartiallyFailed") | .metadata.name'
Delete the backup.
In the following command, replace
<backup>
with a backup returned in the previous step.This command will not work on the PIT node.
ncn# velero backup delete <backup> --confirm
If all designated prerequisites are met, the availability of system management health services may be validated by accessing the URLs listed in
Access System Management Health Services.
It is very important to check the Prerequisites
section for this topic.
If one or more of the the URLs listed in the procedure are inaccessible, it does not necessarily mean that system is not healthy. It may simply mean that not all of the prerequisites have been met to allow access to the system management health tools via URL.
Information to assist with troubleshooting some of the components mentioned in the prerequisites can be accessed here:
Execute the HMS smoke and functional tests after the CSM install to confirm that the Hardware Management Services are running and operational.
Note: Do not run HMS tests concurrently on multiple nodes. They may interfere with one another and cause false failures.
These tests should be executed as root on any worker or master NCN (but not the PIT node).
Run the HMS smoke tests.
ncn# /opt/cray/tests/ncn-resources/hms/hms-test/hms_run_ct_smoke_tests_ncn-resources.sh
Examine the output. If one or more failures occur, investigate the cause of each failure. See the Interpreting HMS Health Check Results documentation for more information.
If no failures occur, then run the HMS functional tests.
ncn# /opt/cray/tests/ncn-resources/hms/hms-test/hms_run_ct_functional_tests_ncn-resources.sh
Examine the output. If one or more failures occur, investigate the cause of each failure. See the Interpreting HMS Health Check Results documentation for more information.
NOTE: The Cray CLI must be configured in order to complete this task. See Configure the Cray Command Line Interface for details on how to do this.
By this point in the installation process, the Hardware State Manager (HSM) should have done its discovery of the system.
The foundational information for this discovery is from the System Layout Service (SLS). Thus, a comparison needs to be done to see that what is specified in SLS (focusing on BMC components and Redfish endpoints) are present in HSM.
Execute the hsm_discovery_verify.sh
script on a Kubernetes master or worker NCN:
ncn# /opt/cray/csm/scripts/hms_verification/hsm_discovery_verify.sh
The output will ideally appear as follows. If there are mismatches these will be displayed in the appropriate section of the output. Refer to 2.2.1 Interpreting results and 2.2.2 Known Issues below to troubleshoot any mismatched BMCs.
Fetching SLS Components...
Fetching HSM Components...
Fetching HSM Redfish endpoints...
=============== BMCs in SLS not in HSM components ===============
ALL OK
=============== BMCs in SLS not in HSM Redfish Endpoints ===============
ALL OK
Both sections BMCs in SLS not in HSM components
and BMCs in SLS not in HSM Redfish Endpoints
have the same format for mismatches between SLS and HSM. Each row starts with
the component name (xname) of the BMC. If the BMC does not have an associated MgmtSwitchConnector
in SLS, then # No mgmt port association
will be displayed alongside the BMC xname.
MgmtSwitchConnector
s in SLS are used to represent the switch port on a leaf switch that is connected to the BMC of an air-cooled device.
=============== BMCs in SLS not in HSM components ===============
x3000c0s1b0 # No mgmt port association
For each of the BMCs that show up in either of mismatch lists use the following notes to determine if the issue with the BMC can be safely ignored, or if there is a legitimate issue with the BMC.
The node BMC of ncn-m001
will not typically be present in HSM component data, as it is typically connected to the site network instead of the HMN network.
The following can be used to determine the friendly name of the Node that the
NodeBMC
controls:ncn# cray sls search hardware list --parent <NODE_BMC_XNAME> --format json | \ jq '.[] | { Xname: .Xname, Aliases: .ExtraProperties.Aliases }' -c
Example mismatch for the BMC of ncn-m001
:
=============== BMCs in SLS not in HSM components ===============
x3000c0s1b0 # No mgmt port association
The node BMCs for HPE Apollo XL645D nodes may report as a mismatch depending on the state of the system when the hsm_discovery_verify.sh
script is run. If the system is
currently going through the process of installation, then this is an expected mismatch as the Prepare Compute Nodes procedure required
to configure the BMC of the HPE Apollo 6500 XL645D node may not have been completed yet.
Refer to Configure HPE Apollo 6500 XL645D Gen10 Plus Compute Nodes for additional required configuration for this type of BMC.
Example mismatch for the BMC of an HPE Apollo XL654D:
=============== BMCs in SLS not in HSM components ===============
x3000c0s30b1
=============== BMCs in SLS not in HSM Redfish Endpoints ===============
x3000c0s30b1
Chassis Management Controllers (CMC) may show up as not being present in HSM. CMCs for Intel node blades can be ignored. Gigabyte node blade CMCs not found in HSM is not normal and should be investigated. If a Gigabyte CMC is expected to not be connected to the HMN network, then it can be ignored.
CMCs have component names (xnames) in the form of
xXc0sSb999
, whereX
is the cabinet andS
is the rack U of the compute node chassis.
Example mismatch for a CMC an Intel node blade:
=============== BMCs in SLS not in HSM components ===============
x3000c0s10b999 # No mgmt port association
=============== BMCs in SLS not in HSM Redfish Endpoints ===============
x3000c0s10b999 # No mgmt port association
Cabinet PDU Controllers have component names (xnames) in the form of xXmM
, where X
is the cabinet and M
is the ordinal of the Cabinet PDU Controller.
Example mismatch for a PDU:
=============== BMCs in SLS not in HSM components ===============
x3000m0
=============== BMCs in SLS not in HSM Redfish Endpoints ===============
x3000m0
If the PDU is accessible over the network, the following can be used to determine the vendor of the PDU.
ncn-m001# PDU=x3000m0
ncn-m001# curl -k -s --compressed https://$PDU -i | grep Server:
Example ServerTech PDU output:
Server: ServerTech-AWS/v8.0v
Example HPE PDU output:
Server: HPE/1.4.0
ServerTech PDUs may need passwords changed from their defaults to become functional. See Change Credentials on ServerTech PDUs.
HPE PDUs are not supported at this time and will likely show up as not being found in HSM. They can be ignored.
BMCs having no association with a management switch port will be annotated as such, and should be investigated. Exceptions to this are in Mountain or Hill configurations where Mountain BMCs will show this condition on SLS/HSM mismatches, which is normal.
In Hill configurations SLS assumes BMCs in chassis 1 and 3 are fully populated (32 Node BMCs), and in Mountain configurations SLS assumes all BMCs are fully populated (128 Node BMCs). Any non-populated BMCs will have no HSM data and will show up in the mismatch list.
If it was determined that the mismatch can not be ignored, then proceed onto the the 2.2.2 Known Issues below to troubleshoot any mismatched BMCs.
Known issues that may prevent hardware from getting discovered by Hardware State Manager:
RedfishEndpoints
in Hardware State ManagerThe following test can be run on any Kubernetes node (any master or worker node, but not the PIT node).
ncn# /usr/local/bin/cmsdev test -q all
cmsdev
tool logs to /opt/cray/tests/cmsdev.log
cmsdev
resultsSUCCESS
.
SUCCESS: All 7 service tests passed: bos, cfs, conman, crus, ims, tftp, vcs
FAILURE
and will list which checks failed.
FAILURE: 2 service tests FAILED (conman, ims), 5 passed (bos, cfs, crus, tftp, vcs)
all
in the cmsdev
command line with the name of the service. For example: /usr/local/bin/cmsdev test -q cfs
Additional test execution details can be found in /opt/cray/tests/cmsdev.log
.
Failed to create vcs organization
On a fresh install, it is possible that cmsdev
reports an error similar to the following:
ERROR (run tag zl7ak-vcs): POST https://api-gw-service-nmn.local/vcs/api/v1/orgs: expected status code 201, got 401
ERROR (run tag zl7ak-vcs): Failed to create vcs organization
In this case, follow the Gitea/VCS 401 Errors troubleshooting procedure.
On systems where too many BOS sessions exist, the cmsdev
test will hang when trying to list them. See
Hang Listing BOS Sessions for more information.
If a CFS component exists with a zero-length string for its id
field, then it may cause the cmsdev
CFS subtest to fail. The CFS subtest failure will resemble the following:
ERROR (run tag fhn3C-cfs): First list item has empty value for "id" field
For details on how to correct this problem, see CFS Component With Zero-Length ID.
barebones
imageIncluded with the Cray System Management (CSM) release is a pre-built node image that can be used
to validate that core CSM services are available and responding as expected. The CSM barebones
image contains only the minimal set of RPMs and configuration required to boot an image and is not
suitable for production usage. To run production work loads, it is suggested that an image from
the Cray OS (COS) product, or similar, be used.
dracut
stage of the boot process. However, if the dracut
stage is reached the
boot can be considered successful and shows that the necessary CSM services needed to
boot a node are up and available.
barebones
image fully will be resolved in future releases of the
CSM product.barebones
image in IMSLocate the CSM Barebones image and note the etag
and path
fields in the output.
ncn# cray ims images list --format json | jq '.[] | select(.name | contains("barebones"))'
Expected output is similar to the following:
{
"created": "2021-01-14T03:15:55.146962+00:00",
"id": "293b1e9c-2bc4-4225-b235-147d1d611eef",
"link": {
"etag": "6d04c3a4546888ee740d7149eaecea68",
"path": "s3://boot-images/293b1e9c-2bc4-4225-b235-147d1d611eef/manifest.json",
"type": "s3"
},
"name": "cray-shasta-csm-sles15sp2-barebones.x86_64-shasta-1.5"
}
barebones
imageThe session template below can be copied and used as the basis for the BOS session template. As noted below, make sure the S3 path for the manifest matches the S3 path shown in the Image Management Service (IMS).
Create sessiontemplate.json
ncn# vi sessiontemplate.json
The session template should contain the following:
{
"boot_sets": {
"compute": {
"boot_ordinal": 2,
"etag": "etag_value_from_cray_ims_command",
"kernel_parameters": "console=ttyS0,115200 bad_page=panic crashkernel=340M hugepagelist=2m-2g intel_iommu=off intel_pstate=disable iommu=pt ip=dhcp numa_interleave_omit=headless numa_zonelist_order=node oops=panic pageblock_order=14 pcie_ports=native printk.synchronous=y rd.neednet=1 rd.retry=10 rd.shell turbo_boost_limit=999 spire_join_token=${SPIRE_JOIN_TOKEN}",
"network": "nmn",
"node_roles_groups": [
"Compute"
],
"path": "path_value_from_cray_ims_command",
"rootfs_provider": "cpss3",
"rootfs_provider_passthrough": "dvs:api-gw-service-nmn.local:300:nmn0",
"type": "s3"
}
},
"cfs": {
"configuration": "cos-integ-config-1.4.0"
},
"enable_cfs": false,
"name": "shasta-1.5-csm-bare-bones-image"
}
NOTE: Be sure to replace the values of the etag
and path
fields with the ones you noted earlier in the cray ims images list
command.
Create the BOS session template using the file as input:
ncn# cray bos sessiontemplate create --file sessiontemplate.json --name shasta-1.5-csm-bare-bones-image
The expected output is:
/sessionTemplate/shasta-1.5-csm-bare-bones-image
ncn# cray hsm state components list --role Compute --enabled true --format toml
Example output:
[[Components]]
ID = "x3000c0s17b1n0"
Type = "Node"
State = "On"
Flag = "OK"
Enabled = true
Role = "Compute"
NID = 1
NetType = "Sling"
Arch = "X86"
Class = "River"
[[Components]]
ID = "x3000c0s17b2n0"
Type = "Node"
State = "On"
Flag = "OK"
Enabled = true
Role = "Compute"
NID = 2
NetType = "Sling"
Arch = "X86"
Class = "River"
If it is observed that expected compute nodes are missing from Hardware State Manager, then refer to Known issues with HSM discovery validation in order to troubleshoot any node BMCs that have not been discovered.
Choose a node from those listed and set XNAME
to its ID. In this example, x3000c0s17b2n0
:
ncn# export XNAME=x3000c0s17b2n0
Create a BOS session to reboot the chosen node using the BOS session template that was created:
ncn# cray bos session create --template-uuid shasta-1.5-csm-bare-bones-image --operation reboot --limit $XNAME --format toml
Expected output looks similar to the following:
limit = "x3000c0s17b2n0"
operation = "reboot"
templateUuid = "shasta-1.5-csm-bare-bones-image"
[[links]]
href = "/v1/session/8f2fc013-7817-4fe2-8e6f-c2136a5e3bd1"
jobId = "boa-8f2fc013-7817-4fe2-8e6f-c2136a5e3bd1"
rel = "session"
type = "GET"
[[links]]
href = "/v1/session/8f2fc013-7817-4fe2-8e6f-c2136a5e3bd1/status"
rel = "status"
type = "GET"
See Manage Node Consoles for information on how to connect to the node’s console (and for instructions on how to close it later).
The boot may take up to 10 or 15 minutes. The image being booted does not support a complete boot, so the node will not boot fully into an operating system. This test is merely to verify that the CSM services needed to boot a node are available and working properly.
This boot test is considered successful if the boot reaches the dracut
stage. You know this has happened if the console output has
something similar to the following somewhere within the final 20 lines of its output:
[ 7.876909] dracut: FATAL: Don't know how to handle 'root=craycps-s3:s3://boot-images/e3ba09d7-e3c2-4b80-9d86-0ee2c48c2214/rootfs:c77c0097bb6d488a5d1e4a2503969ac0-27:dvs:api-gw-service-nmn.local:300:nmn0'
[ 7.898169] dracut: Refusing to continue
NOTE: As long as the preceding text is found near the end of the console output, the test is considered successful. It is normal (and not indicative of a test failure) to see something similar to the following at the very end of the console output:
Starting Dracut Emergency Shell...
[ 11.591948] device-mapper: uevent: version 1.0.3
[ 11.596657] device-mapper: ioctl: 4.40.0-ioctl (2019-01-18) initialised: dm-devel@redhat.com
Warning: dracut: FATAL: Don't know how to handle
Press Enter for maintenance
(or press Control-D to continue):
After the node has reached this point, close the console session. The test is complete.
The procedures below use the CLI as an authorized user and run on two separate node types. The first part runs on the LiveCD node, while the second part runs on a non-LiveCD Kubernetes master or worker node. In either case, the CLI configuration needs to be initialized on the node and the user running the procedure needs to be authorized.
The following procedures run on separate nodes of the system. They are, therefore, separated into separate sub-sections.
This section can be run on any NCN or the PIT node.
Initialize the Cray CLI on the node where you are running this section. See Configure the Cray Command Line Interface for details on how to do this.
Show information about cray-uas-mgr
.
ncn# cray uas mgr-info list --format toml
Expected output looks similar to the following:
service_name = "cray-uas-mgr"
version = "1.11.5"
In this example output, it shows that UAS is installed and running the 1.11.5
version.
List UAIs on the system.
ncn# cray uas list --format toml
Expected output looks similar to the following:
results = []
This example output shows that there are no currently running UAIs. It is possible, if someone else has been using the UAS, that there could be UAIs in the list. That is acceptable too from a validation standpoint.
Verify that the pre-made UAI images are registered with UAS.
ncn# cray uas images list --format toml
Expected output looks similar to the following:
default_image = "registry.local/cray/cray-uai-sles15sp2:1.0.11"
image_list = [ "registry.local/cray/cray-uai-sles15sp2:1.0.11",]
This example output shows that the pre-made end-user UAI image (cray/cray-uai-sles15sp2:1.0.11
) is registered with UAS. This does not necessarily mean this image is
installed in the container image registry, but it is configured for use. If other UAI images have been created and registered, they may also show up here, which is acceptable.
IMPORTANT: If the site does not use UAIs, skip UAS and UAI validation. If UAIs are used, there are products that configure UAS like Cray Analytics and Cray Programming Environment that must be working correctly with UAIs, and should be validated (the procedures for this are beyond the scope of this document) prior to validating UAS and UAI. Failures in UAI creation that result from incorrect or incomplete installation of these products will generally take the form of UAIs stuck in waiting state trying to set up volume mounts. See the UAI Troubleshooting section for more information.
This procedure must run on a master or worker node (not the PIT node).
Initialize the Cray CLI on the node where you are running this section. See Configure the Cray Command Line Interface for details on how to do this.
Verify that a UAI can be created:
ncn# cray uas create --publickey ~/.ssh/id_rsa.pub --format toml
Expected output looks similar to the following:
uai_connect_string = "ssh vers@10.16.234.10"
uai_host = "ncn-w001"
uai_img = "registry.local/cray/cray-uai-sles15sp2:1.0.11"
uai_ip = "10.16.234.10"
uai_msg = ""
uai_name = "uai-vers-a00fb46b"
uai_status = "Pending"
username = "vers"
[uai_portmap]
This has created the UAI and the UAI is currently in the process of initializing and running. The uai_status
in
the output from this command may instead be Waiting
, which is also acceptable.
Set UAINAME
to the value of the uai_name
field in the previous command output (uai-vers-a00fb46b
in our example):
ncn# UAINAME=uai-vers-a00fb46b
Check the current status of the UAI:
ncn# cray uas list --format toml
Expected output looks similar to the following:
[[results]]
uai_age = "0m"
uai_connect_string = "ssh vers@10.16.234.10"
uai_host = "ncn-w001"
uai_img = "registry.local/cray/cray-uai-sles15sp2:1.0.11"
uai_ip = "10.16.234.10"
uai_msg = ""
uai_name = "uai-vers-a00fb46b"
uai_status = "Running: Ready"
username = "vers"
If the uai_status
field is Running: Ready
, proceed to the next step. Otherwise, wait and repeat this command until that is the case. It normally should not take more than a minute or two.
The UAI is ready for use. Log into it with the command in the uai_connect_string
field in the previous command output:
ncn# ssh vers@10.16.234.10
vers@uai-vers-a00fb46b-6889b666db-4dfvn:~>
Run a command on the UAI:
vers@uai-vers-a00fb46b-6889b666db-4dfvn:~> ps -afe
Expected output looks similar to the following:
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 18:51 ? 00:00:00 /bin/bash /usr/bin/uai-ssh.sh
munge 36 1 0 18:51 ? 00:00:00 /usr/sbin/munged
root 54 1 0 18:51 ? 00:00:00 su vers -c /usr/sbin/sshd -e -f /etc/uas/ssh/sshd_config -D
vers 55 54 0 18:51 ? 00:00:00 /usr/sbin/sshd -e -f /etc/uas/ssh/sshd_config -D
vers 62 55 0 18:51 ? 00:00:00 sshd: vers [priv]
vers 67 62 0 18:51 ? 00:00:00 sshd: vers@pts/0
vers 68 67 0 18:51 pts/0 00:00:00 -bash
vers 120 68 0 18:52 pts/0 00:00:00 ps -afe
Log out from the UAI
vers@uai-vers-a00fb46b-6889b666db-4dfvn:~> exit
ncn#
Clean up the UAI.
ncn# cray uas delete --uai-list $UAINAME --format toml
Expected output looks similar to the following:
results = [ "Successfully deleted uai-vers-a00fb46b",]
If the commands ran with similar results, then the basic functionality of the UAS and UAI is working.
The following subsections include common failure modes seen with UAS / UAI operations and how to resolve them.
An error will be returned when running CLI commands if the user is not logged in as a valid Keycloak user or is accidentally using the CRAY_CREDENTIALS
environment variable.
This variable is set regardless of the user credentials being used.
For example:
ncn# cray uas list
The symptom of this problem is output similar to the following:
Usage: cray uas list [OPTIONS]
Try 'cray uas list --help' for help.
Error: Bad Request: Token not valid for UAS. Attributes missing: ['gidNumber', 'loginShell', 'homeDirectory', 'uidNumber', 'name']
Fix this by logging in as a real user (someone with actual Linux credentials) and making sure that CRAY_CREDENTIALS
is unset.
When running CLI commands, a Keycloak error may be returned.
For example:
ncn# cray uas list
The symptom of this problem is output similar to the following:
Usage: cray uas list [OPTIONS]
Try 'cray uas list --help' for help.
Error: Internal Server Error: An error was encountered while accessing Keycloak
If the wrong hostname was used to reach the API gateway, re-run the CLI initialization steps above and try again to check that. There may also be a problem with the Istio
service mesh inside of the system. Troubleshooting this is beyond the scope of this section, but there may be useful information in the UAS pod logs in Kubernetes. There are
generally two UAS pods, so the user may need to look at logs from both to find the specific failure. The logs tend to have a very large number of GET
events listed as part
of the liveness checking.
The following shows an example of looking at UAS logs effectively (this example shows only one UAS manager, normally there would be two):
Determine the pod name of the uas-mgr
pod.
ncn-mw# kubectl get po -n services | grep "^cray-uas-mgr" | grep -v etcd
Expected output looks similar to:
cray-uas-mgr-6bbd584ccb-zg8vx 2/2 Running 0 12d
Set PODNAME
to the name of the manager pod whose logs are going to be viewed.
ncn-mw# export PODNAME=cray-uas-mgr-6bbd584ccb-zg8vx
View the last 25 log entries of the cray-uas-mgr
container in that pod, excluding GET
events:
ncn-mw# kubectl logs -n services $PODNAME cray-uas-mgr | grep -v 'GET ' | tail -25
Example output:
2021-02-08 15:32:41,211 - uas_mgr - INFO - getting deployment uai-vers-87a0ff6e in namespace user
2021-02-08 15:32:41,225 - uas_mgr - INFO - creating deployment uai-vers-87a0ff6e in namespace user
2021-02-08 15:32:41,241 - uas_mgr - INFO - creating the UAI service uai-vers-87a0ff6e-ssh
2021-02-08 15:32:41,241 - uas_mgr - INFO - getting service uai-vers-87a0ff6e-ssh in namespace user
2021-02-08 15:32:41,252 - uas_mgr - INFO - creating service uai-vers-87a0ff6e-ssh in namespace user
2021-02-08 15:32:41,267 - uas_mgr - INFO - getting pod info uai-vers-87a0ff6e
2021-02-08 15:32:41,360 - uas_mgr - INFO - No start time provided from pod
2021-02-08 15:32:41,361 - uas_mgr - INFO - getting service info for uai-vers-87a0ff6e-ssh in namespace user
127.0.0.1 - - [08/Feb/2021 15:32:41] "POST /v1/uas?imagename=registry.local%2Fcray%2Fno-image-registered%3A1.0.11 HTTP/1.1" 200 -
2021-02-08 15:32:54,455 - uas_auth - INFO - UasAuth lookup complete for user vers
2021-02-08 15:32:54,455 - uas_mgr - INFO - UAS request for: vers
2021-02-08 15:32:54,455 - uas_mgr - INFO - listing deployments matching: host None, labels uas=managed,user=vers
2021-02-08 15:32:54,484 - uas_mgr - INFO - getting pod info uai-vers-87a0ff6e
2021-02-08 15:32:54,596 - uas_mgr - INFO - getting service info for uai-vers-87a0ff6e-ssh in namespace user
2021-02-08 15:40:25,053 - uas_auth - INFO - UasAuth lookup complete for user vers
2021-02-08 15:40:25,054 - uas_mgr - INFO - UAS request for: vers
2021-02-08 15:40:25,054 - uas_mgr - INFO - listing deployments matching: host None, labels uas=managed,user=vers
2021-02-08 15:40:25,085 - uas_mgr - INFO - getting pod info uai-vers-87a0ff6e
2021-02-08 15:40:25,212 - uas_mgr - INFO - getting service info for uai-vers-87a0ff6e-ssh in namespace user
2021-02-08 15:40:51,210 - uas_auth - INFO - UasAuth lookup complete for user vers
2021-02-08 15:40:51,210 - uas_mgr - INFO - UAS request for: vers
2021-02-08 15:40:51,210 - uas_mgr - INFO - listing deployments matching: host None, labels uas=managed,user=vers
2021-02-08 15:40:51,261 - uas_mgr - INFO - deleting service uai-vers-87a0ff6e-ssh in namespace user
2021-02-08 15:40:51,291 - uas_mgr - INFO - delete deployment uai-vers-87a0ff6e in namespace user
127.0.0.1 - - [08/Feb/2021 15:40:51] "DELETE /v1/uas?uai_list=uai-vers-87a0ff6e HTTP/1.1" 200 -
When listing or describing a UAI, an error in the uai_msg
field may be returned. For example:
ncn# cray uas list --format toml
There may be something similar to the following output:
[[results]]
uai_age = "0m"
uai_connect_string = "ssh vers@10.103.13.172"
uai_host = "ncn-w001"
uai_img = "registry.local/cray/cray-uai-sles15sp2:1.0.11"
uai_ip = "10.103.13.172"
uai_msg = "ErrImagePull"
uai_name = "uai-vers-87a0ff6e"
uai_status = "Waiting"
username = "vers"
This means the pre-made end-user UAI image is not in the local registry (or whatever registry it is being pulled from; see the uai_img
value for details). To correct
this, locate and push/import the image to the registry.
Various packages install volumes in the UAS configuration. All of those volumes must also have the underlying resources available, sometimes on the host node where the UAI is running sometimes from with
Kubernetes. If a UAI gets stuck with a ContainerCreating
uai_msg
field for an extended time, this is a likely cause. UAIs run in the user
Kubernetes namespace, and are pods that can be examined
using kubectl describe
.
Locate the pod.
ncn-mw# kubectl get po -n user | grep <uai-name>
Investigate the problem using the pod name from the previous step.
ncn-mw# kubectl describe pod -n user <pod-name>
If volumes are missing they will show up in the Events:
section of the output. Other problems may show up there as well. The names of the missing volumes or other issues
should indicate what needs to be fixed to make the UAI run.