Troubleshoot the CMS Barebones Image Boot Test

The CSM barebones image boot test verifies that the CSM services needed to boot a node are available and working properly. This test is very important to run, particularly during the CSM install prior to rebooting the PIT node, because it validates all of the services required for nodes to PXE boot from the cluster.

This page gives some information about the CSM barebones image, describes how the barebonesImageTest script works, explains how to interpret the results of the script, and provides a procedure to manually perform the test, if needed.

Notes on the CSM barebones image

The CSM barebones image is a pre-built node image included with the CSM release. The CSM barebones image contains only the minimal set of RPMs and configuration required to boot an image, and is not suitable for production usage. To run production work loads, it is suggested that an image from the Cray OS (COS) product, or similar, be used.

The CSM barebones image included with the release will not successfully complete a boot beyond the dracut stage of the boot process. However, if the dracut stage is reached, then the boot can be considered successful, because this demonstrates that the necessary CSM services needed to boot a node are up and available.

In addition to the CSM barebones image, the release also includes an IMS recipe that can be used to build the CSM barebones image. However, the CSM barebones recipe currently requires RPMs that are not installed with the CSM product. The CSM barebones recipe can be built after the COS product stream is installed on the system.

Test prerequisites

  • This test can be run on any master or worker NCN, but not the PIT node.
  • The test script uses the Kubernetes API gateway to access CSM services. The gateway must be properly configured to allow an access token to be generated by the script.
  • The manual procedure uses the Cray CLI. The CLI must be configured and authenticated on the node where the manual steps are being performed.
  • The test script is installed as part of the cray-cmstools-crayctldeploy RPM.

Test script

The script file location is /opt/cray/tests/integration/csm/barebonesImageTest. Review the Test prerequisites before proceeding.

Steps the script performs

This script automates the following steps:

  1. Obtain the Kubernetes API gateway access token.
  2. Find the existing barebones boot image using IMS.
  3. Create a BOS session template for the barebones boot image.
  4. Find an enabled compute node using HSM.
  5. Watch the console log for the target compute node using console services.
  6. Create a BOS session to reboot the target compute node.
  7. Wait for the console output to show an error or successfully reach dracut.

If the script fails, then investigate the underlying service to ensure that it is operating correctly; examine the detailed test log file to find information on the exact error and cause of failure.

The boot may take up to 10 or 15 minutes. The image being booted does not support a complete boot, so the node will not boot fully into an operating system. This test is merely to verify that the CSM services needed to boot a node are available and working properly. This boot test is considered successful if the boot reaches the dracut stage.

Controlling which node is used

By default, the script will list all enabled compute nodes in HSM and use the first one as the target for the test. This may be overridden by using the --xname command line argument to specify the component name (xname) of the target compute node. The target compute node must be enabled and present in HSM. If the specified compute node is not available, then the test will fail with an appropriate error message.

(ncn-mw#) An example of specifying the target node:

/opt/cray/tests/integration/csm/barebonesImageTest --xname x3000c0s10b1n0

Troubleshooting: If any compute nodes are missing from HSM database, then refer to 2.2.2 Known issues with HSM discovery validation in order to troubleshoot any node BMCs that have not been discovered.

Controlling which image is used

By default, the script will list all IMS images with barebones in their names, and use the first one as the boot image for the test. This may be overridden using the --id command line argument to specify the ID of the desired IMS image. If the specified image is not found, then the test will fail with an appropriate error message.

The most common reason that this option may be needed is if some other IMS image has barebones in its name, and the test is choosing it instead of the regular CSM barebones image.

(ncn-mw#) An example of specifying the image for the test:

/opt/cray/tests/integration/csm/barebonesImageTest --id 0eacdcaa-74ad-40d6-b2b3-801f244ef868

(ncn-mw#) Available IMS images on the system can be listed using the Cray Command Line Interface (CLI) with the following command:

cray ims images list --format json

For help configuring the Cray CLI, see Configure the Cray CLI.

Controlling test script output level

Output is directed to both the console calling the script as well as a log file that will hold more detailed information on the run and any potential problems found. The log file is written to /tmp/cray.barebones-boot-test.log and will overwrite any existing file at that location on each new run of the script.

The messages output to the console and the log file may be controlled separately through environment variables. To control the information being sent to the console, set the variable CONSOLE_LOG_LEVEL. To control the information being sent to the log file, set the variable FILE_LOG_LEVEL. Valid values in increasing levels of detail are: CRITICAL, ERROR, WARNING, INFO, DEBUG. The default for the console output is INFO and the default for the log file is DEBUG.

(ncn-mw#) Here is an example of running the script with more information displayed on the console during the execution of the test:

CONSOLE_LOG_LEVEL=DEBUG /opt/cray/tests/integration/csm/barebonesImageTest

Example output excerpt:

cray.barebones-boot-test: INFO     barebones image boot test starting
cray.barebones-boot-test: INFO       For complete logs look in the file /tmp/cray.barebones-boot-test.log
cray.barebones-boot-test: DEBUG    Found boot image: cray-shasta-csm-sles15sp2-barebones.x86_64-shasta-1.5
cray.barebones-boot-test: DEBUG    Creating bos session template with etag:bc390772fbe67107cd58b3c7c08ed92d, path:s3://boot-images/e360fae1-7926-4dee-85bb-f2b4eb216d9c/manifest.json

Manual test procedure

The following manual steps may be performed to reproduce the actions of this script. Review the Test prerequisites before beginning.

  1. Locate CSM barebones image in IMS
  2. Create a BOS session template for the CSM barebones image
  3. Find an available compute node
  4. Reboot the node using a BOS session template
  5. Connect to the node’s console and watch the boot

1. Locate CSM barebones image in IMS

(ncn-mw#) Locate the CSM barebones image and note the etag and path fields in the output.

cray ims images list --format json | jq '.[] | select(.name | contains("barebones"))'

Expected output is similar to the following:

{
  "created": "2021-01-14T03:15:55.146962+00:00",
  "id": "293b1e9c-2bc4-4225-b235-147d1d611eef",
  "link": {
    "etag": "6d04c3a4546888ee740d7149eaecea68",
    "path": "s3://boot-images/293b1e9c-2bc4-4225-b235-147d1d611eef/manifest.json",
    "type": "s3"
  },
  "name": "cray-shasta-csm-sles15sp1-barebones.x86_64-shasta-1.4"
}

2. Create a BOS session template for the CSM barebones image

The session template below can be copied and used as the basis for the BOS session template. As noted below, make sure the S3 path for the manifest matches the S3 path shown in the Image Management Service (IMS).

  1. Create the sessiontemplate.json file.

    vi sessiontemplate.json
    

    The session template should contain the following:

    {
      "boot_sets": {
        "compute": {
          "boot_ordinal": 2,
          "etag": "etag_value_from_cray_ims_command",
          "kernel_parameters": "console=ttyS0,115200 bad_page=panic crashkernel=340M hugepagelist=2m-2g intel_iommu=off intel_pstate=disable iommu=pt ip=dhcp numa_interleave_omit=headless numa_zonelist_order=node oops=panic pageblock_order=14 pcie_ports=native printk.synchronous=y rd.neednet=1 rd.retry=10 rd.shell turbo_boost_limit=999 spire_join_token=${SPIRE_JOIN_TOKEN}",
          "network": "nmn",
          "node_roles_groups": [
            "Compute"
          ],
          "path": "path_value_from_cray_ims_command",
          "rootfs_provider": "cpss3",
          "rootfs_provider_passthrough": "dvs:api-gw-service-nmn.local:300:nmn0",
          "type": "s3"
        }
      },
      "cfs": {
        "configuration": "none"
      },
      "enable_cfs": false,
      "name": "shasta-csm-bare-bones-image"
    }
    

NOTE: Be sure to replace the values of the etag and path fields with the ones noted earlier in the cray ims images list command.

  1. Create the BOS session template using the following file as input:

    cray bos v1 sessiontemplate create --file sessiontemplate.json --name shasta-csm-bare-bones-image
    

    The expected output is:

    /sessionTemplate/shasta-csm-bare-bones-image
    

3. Find an available compute node

  1. (ncn-mw#) List the compute nodes managed by HSM.

    cray hsm state components list --role Compute --enabled true --format toml
    

    Example output:

    [[Components]]
    ID = "x3000c0s17b1n0"
    Type = "Node"
    State = "On"
    Flag = "OK"
    Enabled = true
    Role = "Compute"
    NID = 1
    NetType = "Sling"
    Arch = "X86"
    Class = "River"
    
    [[Components]]
    ID = "x3000c0s17b2n0"
    Type = "Node"
    State = "On"
    Flag = "OK"
    Enabled = true
    Role = "Compute"
    NID = 2
    NetType = "Sling"
    Arch = "X86"
    Class = "River"
    

    Troubleshooting: If any compute nodes are missing from HSM database, then refer to 2.2.2 Known issues with HSM discovery validation in order to troubleshoot any node BMCs that have not been discovered.

  2. (ncn-mw#) Choose a node.

    Choose a node from those listed and set XNAME to its component name (xname). In this example, x3000c0s17b2n0 is used.

    XNAME=x3000c0s17b2n0
    

4. Reboot the node using a BOS session template

(ncn-mw#) Create a BOS session to reboot the chosen node using the BOS session template that was just created.

cray bos v1 session create --template-name shasta-csm-bare-bones-image --operation reboot --limit "${XNAME}" --format toml

Expected output looks similar to the following:

limit = "x3000c0s17b2n0"
operation = "reboot"
templateName = "shasta-csm-bare-bones-image"
[[links]]
href = "/v1/session/8f2fc013-7817-4fe2-8e6f-c2136a5e3bd1"
jobId = "boa-8f2fc013-7817-4fe2-8e6f-c2136a5e3bd1"
rel = "session"
type = "GET"

[[links]]
href = "/v1/session/8f2fc013-7817-4fe2-8e6f-c2136a5e3bd1/status"
rel = "status"
type = "GET"

5. Connect to the node’s console and watch the boot

The boot may take up to 10 or 15 minutes. The image being booted does not support a complete boot, so the node will not boot fully into an operating system. This test is merely to verify that the CSM services needed to boot a node are available and working properly.

  1. Connect to the node’s console.

    See Manage Node Consoles for information on how to connect to the node’s console (and for instructions on how to close it later).

  2. Monitor the boot.

    This boot test is considered successful if the boot reaches the dracut stage. The indication that this has happened is that the console output has something similar to the following somewhere within the final 20 lines of its output:

    [    7.876909] dracut: FATAL: Don't know how to handle 'root=craycps-s3:s3://boot-images/e3ba09d7-e3c2-4b80-9d86-0ee2c48c2214/rootfs:c77c0097bb6d488a5d1e4a2503969ac0-27:dvs:api-gw-service-nmn.local:300:nmn0'
    [    7.898169] dracut: Refusing to continue
    

    NOTE: As long as the preceding text is found near the end of the console output, then the test is considered successful. It is normal (and not indicative of a test failure) to see something similar to the following at the very end of the console output:

             Starting Dracut Emergency Shell...
    [   11.591948] device-mapper: uevent: version 1.0.3
    [   11.596657] device-mapper: ioctl: 4.40.0-ioctl (2019-01-18) initialised: dm-devel@redhat.com
    Warning: dracut: FATAL: Don't know how to handle
    Press Enter for maintenance
    (or press Control-D to continue):
    
  3. Exit the console.

    Do this by typing &..

The test is complete.