The CSM barebones image boot test verifies that the CSM services needed to boot a node are available and working properly. This test is very important to run, particularly during the CSM install prior to rebooting the PIT node, because it validates all of the services required for nodes to PXE boot from the cluster.
This page gives some information about the CSM barebones image, describes how the barebonesImageTest
script works, explains
how to interpret the results of the script, and provides a procedure to manually perform the test, if needed.
The CSM barebones image is a pre-built node image included with the CSM release. The CSM barebones image contains only the minimal set of RPMs and configuration required to boot an image, and is not suitable for production usage. To run production work loads, it is suggested that an image from the Cray OS (COS) product, or similar, be used.
The CSM barebones image included with the release will not successfully complete a boot
beyond the dracut
stage of the boot process. However, if the dracut
stage is reached, then the
boot can be considered successful, because this demonstrates that the necessary CSM services needed to
boot a node are up and available.
In addition to the CSM barebones image, the release also includes an IMS recipe that can be used to build the CSM barebones image. However, the CSM barebones recipe currently requires RPMs that are not installed with the CSM product. The CSM barebones recipe can be built after the COS product stream is installed on the system.
cray-cmstools-crayctldeploy
RPM.The script file location is /opt/cray/tests/integration/csm/barebonesImageTest
.
Review the Test prerequisites before proceeding.
This script automates the following steps:
dracut
.If the script fails, then investigate the underlying service to ensure that it is operating correctly; examine the detailed test log file to find information on the exact error and cause of failure.
The boot may take up to 10 or 15 minutes. The image being booted does not support a complete boot,
so the node will not boot fully into an operating system. This test is merely to verify that the
CSM services needed to boot a node are available and working properly. This boot test is considered
successful if the boot reaches the dracut
stage.
By default, the script will list all enabled compute nodes in HSM and use the first one
as the target for the test. This may be overridden by using the --xname
command line argument
to specify the component name (xname) of the target compute node. The target
compute node must be enabled and present in HSM. If the specified compute node is
not available, then the test will fail with an appropriate error message.
(ncn-mw#
) An example of specifying the target node:
/opt/cray/tests/integration/csm/barebonesImageTest --xname x3000c0s10b1n0
Troubleshooting: If any compute nodes are missing from HSM database, then refer to 2.2.2 Known issues with HSM discovery validation in order to troubleshoot any node BMCs that have not been discovered.
By default, the script will list all IMS images with barebones
in their names, and use the first
one as the boot image for the test. This may be overridden using the --id
command line argument
to specify the ID of the desired IMS image. If the specified image is not found, then the test
will fail with an appropriate error message.
The most common reason that this option may be needed is if some other IMS image has barebones
in its name,
and the test is choosing it instead of the regular CSM barebones image.
(ncn-mw#
) An example of specifying the image for the test:
/opt/cray/tests/integration/csm/barebonesImageTest --id 0eacdcaa-74ad-40d6-b2b3-801f244ef868
(ncn-mw#
) Available IMS images on the system can be listed using the Cray Command Line Interface (CLI)
with the following command:
cray ims images list --format json
For help configuring the Cray CLI, see Configure the Cray CLI.
Output is directed to both the console calling the script as well as a log file that will hold
more detailed information on the run and any potential problems found. The log file is written
to /tmp/cray.barebones-boot-test.log
and will overwrite any existing file at that location on
each new run of the script.
The messages output to the console and the log file may be controlled separately through
environment variables. To control the information being sent to the console, set the variable
CONSOLE_LOG_LEVEL
. To control the information being sent to the log file, set the variable
FILE_LOG_LEVEL
. Valid values in increasing levels of detail are: CRITICAL
, ERROR
,
WARNING
, INFO
, DEBUG
. The default for the console output is INFO
and the default for
the log file is DEBUG
.
(ncn-mw#
) Here is an example of running the script with more information displayed on the console
during the execution of the test:
CONSOLE_LOG_LEVEL=DEBUG /opt/cray/tests/integration/csm/barebonesImageTest
Example output excerpt:
cray.barebones-boot-test: INFO barebones image boot test starting
cray.barebones-boot-test: INFO For complete logs look in the file /tmp/cray.barebones-boot-test.log
cray.barebones-boot-test: DEBUG Found boot image: cray-shasta-csm-sles15sp2-barebones.x86_64-shasta-1.5
cray.barebones-boot-test: DEBUG Creating bos session template with etag:bc390772fbe67107cd58b3c7c08ed92d, path:s3://boot-images/e360fae1-7926-4dee-85bb-f2b4eb216d9c/manifest.json
The following manual steps may be performed to reproduce the actions of this script. Review the Test prerequisites before beginning.
(ncn-mw#
) Locate the CSM barebones image and note the etag
and path
fields in the output.
cray ims images list --format json | jq '.[] | select(.name | contains("barebones"))'
Expected output is similar to the following:
{
"created": "2021-01-14T03:15:55.146962+00:00",
"id": "293b1e9c-2bc4-4225-b235-147d1d611eef",
"link": {
"etag": "6d04c3a4546888ee740d7149eaecea68",
"path": "s3://boot-images/293b1e9c-2bc4-4225-b235-147d1d611eef/manifest.json",
"type": "s3"
},
"name": "cray-shasta-csm-sles15sp1-barebones.x86_64-shasta-1.4"
}
The session template below can be copied and used as the basis for the BOS session template. As noted below, make sure the S3 path for the manifest matches the S3 path shown in the Image Management Service (IMS).
Create the sessiontemplate.json
file.
vi sessiontemplate.json
The session template should contain the following:
{
"boot_sets": {
"compute": {
"boot_ordinal": 2,
"etag": "etag_value_from_cray_ims_command",
"kernel_parameters": "console=ttyS0,115200 bad_page=panic crashkernel=340M hugepagelist=2m-2g intel_iommu=off intel_pstate=disable iommu=pt ip=dhcp numa_interleave_omit=headless numa_zonelist_order=node oops=panic pageblock_order=14 pcie_ports=native printk.synchronous=y rd.neednet=1 rd.retry=10 rd.shell turbo_boost_limit=999 spire_join_token=${SPIRE_JOIN_TOKEN}",
"network": "nmn",
"node_roles_groups": [
"Compute"
],
"path": "path_value_from_cray_ims_command",
"rootfs_provider": "cpss3",
"rootfs_provider_passthrough": "dvs:api-gw-service-nmn.local:300:nmn0",
"type": "s3"
}
},
"cfs": {
"configuration": "none"
},
"enable_cfs": false,
"name": "shasta-csm-bare-bones-image"
}
NOTE
: Be sure to replace the values of theetag
andpath
fields with the ones noted earlier in thecray ims images list
command.
Create the BOS session template using the following file as input:
cray bos v1 sessiontemplate create --file sessiontemplate.json --name shasta-csm-bare-bones-image
The expected output is:
/sessionTemplate/shasta-csm-bare-bones-image
(ncn-mw#
) List the compute nodes managed by HSM.
cray hsm state components list --role Compute --enabled true --format toml
Example output:
[[Components]]
ID = "x3000c0s17b1n0"
Type = "Node"
State = "On"
Flag = "OK"
Enabled = true
Role = "Compute"
NID = 1
NetType = "Sling"
Arch = "X86"
Class = "River"
[[Components]]
ID = "x3000c0s17b2n0"
Type = "Node"
State = "On"
Flag = "OK"
Enabled = true
Role = "Compute"
NID = 2
NetType = "Sling"
Arch = "X86"
Class = "River"
Troubleshooting: If any compute nodes are missing from HSM database, then refer to 2.2.2 Known issues with HSM discovery validation in order to troubleshoot any node BMCs that have not been discovered.
(ncn-mw#
) Choose a node.
Choose a node from those listed and set XNAME
to its component name (xname). In this example, x3000c0s17b2n0
is used.
XNAME=x3000c0s17b2n0
(ncn-mw#
) Create a BOS session to reboot the chosen node using the BOS session template that was just created.
cray bos v1 session create --template-name shasta-csm-bare-bones-image --operation reboot --limit "${XNAME}" --format toml
Expected output looks similar to the following:
limit = "x3000c0s17b2n0"
operation = "reboot"
templateName = "shasta-csm-bare-bones-image"
[[links]]
href = "/v1/session/8f2fc013-7817-4fe2-8e6f-c2136a5e3bd1"
jobId = "boa-8f2fc013-7817-4fe2-8e6f-c2136a5e3bd1"
rel = "session"
type = "GET"
[[links]]
href = "/v1/session/8f2fc013-7817-4fe2-8e6f-c2136a5e3bd1/status"
rel = "status"
type = "GET"
The boot may take up to 10 or 15 minutes. The image being booted does not support a complete boot, so the node will not boot fully into an operating system. This test is merely to verify that the CSM services needed to boot a node are available and working properly.
Connect to the node’s console.
See Manage Node Consoles for information on how to connect to the node’s console (and for instructions on how to close it later).
Monitor the boot.
This boot test is considered successful if the boot reaches the dracut
stage. The indication that this
has happened is that the console output has something similar to the following somewhere within the final
20 lines of its output:
[ 7.876909] dracut: FATAL: Don't know how to handle 'root=craycps-s3:s3://boot-images/e3ba09d7-e3c2-4b80-9d86-0ee2c48c2214/rootfs:c77c0097bb6d488a5d1e4a2503969ac0-27:dvs:api-gw-service-nmn.local:300:nmn0'
[ 7.898169] dracut: Refusing to continue
NOTE
: As long as the preceding text is found near the end of the console output, then the test is considered successful. It is normal (and not indicative of a test failure) to see something similar to the following at the very end of the console output:
Starting Dracut Emergency Shell...
[ 11.591948] device-mapper: uevent: version 1.0.3
[ 11.596657] device-mapper: ioctl: 4.40.0-ioctl (2019-01-18) initialised: dm-devel@redhat.com
Warning: dracut: FATAL: Don't know how to handle
Press Enter for maintenance
(or press Control-D to continue):
Exit the console.
Do this by typing &.
.
The test is complete.