This procedure adds one or more air-cooled cabinets and all associated hardware within the cabinet except for management NCNs.
(ncn-mw
) Set a variable with the system’s name.
SYSTEM_NAME=eniac
(ncn-mw
) Validate the systems SHCD using CANU to generate an updated CCJ file.
Note do not perform the step Proceed to generate topology files
because it is not required.
(ncn-mw
) Once the validation is completed, ensure that the systems CCJ file is present in the current directory, and set the CCJ_FILE
environment variable to the name of the file.
CCJ_FILE=${SYSTEM_NAME}-full-paddle.json
(ncn-mw
) Retrieve an API token:
export TOKEN=$(curl -k -s -S -d grant_type=client_credentials \
-d client_id=admin-client \
-d client_secret=`kubectl get secrets admin-client-auth -o jsonpath='{.data.client-secret}' | base64 -d` \
https://api-gw-service-nmn.local/keycloak/realms/shasta/protocol/openid-connect/token | jq -r '.access_token')
Determine the version of the latest hardware-topology-assistant.
HTA_VERSION=$(curl https://registry.local/v2/artifactory.algol60.net/csm-docker/stable/hardware-topology-assistant/tags/list | jq -r .tags[] | sort -V | tail -n 1)
echo ${HTA_VERSION}
Example output:
0.2.0
Perform a dry run of the hardware-topology-assistant.
Each invocation of the hardware-topology-assistant creates a new folder in the current directory named similarly to hardware-topology-assistant_TIMESTAMP
. This directory contains files with the following data:
topology_changes.json
which enumerates the changes made to SLS.
Reminder: New management NCNs are not handled by this tool. They will be handled by a different procedure referenced in the last step of this procedure.
podman run --rm -it --name hardware-topology-assistant -v "$(realpath .)":/work -e TOKEN \
registry.local/artifactory.algol60.net/csm-docker/stable/hardware-topology-assistant:$HTA_VERSION \
update $CCJ_FILE --dry-run
If prompted to fill in the generated application node metadata nodes having ~~FIXME~~
values, then follow the directions in the command output to update the application node metadata file. This is an optional file that is only required if
application nodes are being added to the system. If no new application nodes are being added to the system, then this is not required.
2022/08/11 12:33:54 Application node x3001c0s16b0n0 has SubRole of ~~FIXME~~
2022/08/11 12:33:54 Application node x3001c0s16b0n0 has Alias of ~~FIXME~~
2022/08/11 12:33:54
2022/08/11 12:33:54 New Application nodes are being added to the system which requires additional metadata to be provided.
2022/08/11 12:33:54 Please fill in all of the ~~FIXME~~ values in the application node metadata file.
2022/08/11 12:33:54
2022/08/11 12:33:54 Application node metadata file is now available at: application_node_metadata.yaml
2022/08/11 12:33:54 Add --application-node-metadata=application_node_metadata.yaml to the command line arguments and try again.
The following is an example entry in the application_node_metadata.yaml
file that requires additional information to be filled in. Do not change any of the SubRole or aliases values for other application nodes. The canu_common_name
field
contains the common name of the application node represented in the CANU CCJ/Paddle file for easier recognition of what the node is when editing the file.
x3001c0s16b0n0:
canu_common_name: login010
subrole: ~~FIXME~~
aliases:
- ~~FIXME~~
The following is the above example entry with its ~~FIXME~~
values filled with values to designate the node x3001c0s16b0n0
as an UAN
with the alias uan10
.
x3001c0s16b0n0:
canu_common_name: login010
subrole: UAN
aliases:
- uan10
Valid HSM SubRoles can be viewed with the following command. To add additional sub roles to HSM refer to Add Custom Roles and Subroles.
cray hsm service values subrole list --format toml
Example output:
SubRole = [ "Visualization", "UserDefined", "Master", "Worker", "Storage", "UAN", "Gateway", "LNETRouter",]
Add the --application-node-metadata=application_node_metadata.yaml
to the list of CLI arguments, and attempt the dry run again.
Additional advanced options that are available to control the behavior of the
hardware-topology-assistant
. Use these options with care.
Flag Description --ignore-unknown-canu-hardware-architectures
Ignore CANU hardware architectures that are unknown to this tool. Instead of erroring out the hardware-topology-assistant
will issue warnings for unknown CANU hardware architectures.--ignore-removed-hardware
Ignore hardware removed from the system, and only add new hardware to the system. This will prevent the hardware-topology-assistant
from refusing to continue when hardware was removed.--hardware-ignore-list=xnames
Hardware to ignore specified as xnames. Multiple xnames can be specified in a comma separated list. For example, --hardware-ignore-list=x3000c0s36b0n0,x3001c0s37b0n0
.
(ncn-mw
) Perform changes on the system by running the same command without the --dry-run
flag.
podman run --rm -it --name hardware-topology-assistant -v "$(realpath .)":/work -e TOKEN \
"registry.local/artifactory.algol60.net/csm-docker/stable/hardware-topology-assistant:${HTA_VERSION}" \
update "${CCJ_FILE}"
(ncn-mw
) Locate the topology_changes.json
file that was generated by the hardware-topology-assistant
from the last run.
TOPOLOGY_CHANGES_JSON="$(find . -name 'hardware-topology-assistant_*' | sort -V | tail -n 1)/topology_changes.json"
echo ${TOPOLOGY_CHANGES_JSON}
Example output:
./hardware-topology-assistant_2022-08-19T19-09-27Z/topology_changes.json
(ncn-mw
) Update /etc/hosts
on the management NCNs with any newly added management switches.
/usr/share/doc/csm/scripts/operations/node_management/Add_River_Cabinets/update_ncn_etc_hosts.py "${TOPOLOGY_CHANGES_JSON}" --perform-changes
(ncn-mw
) Update cabinet routes on management NCNs.
/usr/share/doc/csm/scripts/operations/node_management/update-ncn-cabinet-routes.sh
Reconfigure management network by following the CANU Added Hardware procedure.
DISCLAIMER: This procedure is for standard River cabinet network configurations and does not account for any site customizations that have been made to the management network. Site administrators and support teams are responsible for knowing the customizations in effect in Shasta/CSM and configuring CANU to respect them when generating new network configurations.
See examples of using CANU custom switch configurations and examples of other CSM features that require custom configurations in the following documentation:
Verify that new hardware has been discovered.
Perform the Hardware State Manager Discovery Validation procedure.
After the management network has been reconfigured, it may take up to 10 minutes for the hardware in the new cabinets to become discovered.
To help troubleshoot why new hardware may be in HTTPsGetFailed
, the following script can check for some common problems against all of the Redfish Endpoints that are currently in HTTPsGetFailed
. These common problems include:
- The hostname of the BMC does not resolve in DNS.
- The BMC is not configured with the expected root user credentials. Here are some common causes of this issue:
- Root user is not configured on the BMC.
- Root user exists on the BMC, but with an unexpected password.
/usr/share/doc/csm/scripts/operations/node_management/Add_River_Cabinets/verify_bmc_credentials.sh
Potential scenarios:
The BMC has no connection to the HMN network. This is typically seen with the BMC of ncn-m001
, because its BMC is connected to the site network.
------------------------------------------------------------
Redfish Endpoint x3000c0s1b0 has discovery state HTTPsGetFailed
Has no connection to HMN, ignoring
The BMC credentials present in Vault do not match the root user credentials on the BMC.
------------------------------------------------------------
Redfish Endpoint x3000c0s3b0 has discovery state HTTPsGetFailed
Checking to see if $endpoint resolves in DNS
Hostname resolves
Retrieving BMC credentials for $endpoint from SCSD/Vault
Testing stored BMC credentials against the BMC
ERROR Received 401 Unauthorized. BMC credentials in Vault do not match current BMC credentials.
If the root user credentials do not work then following procedures:
Validate BIOS and BMC firmware levels in the new nodes.
Perform the procedures in Update Firmware with FAS. Perform updates as needed with FAS.
Slingshot switches are updated with procedures from the HPE Slingshot Operations Guide.
Continue on to the HPE Slingshot Operations Guide to bring up the additional cabinets in the fabric.
Update workload manager configuration to include any newly added compute nodes to the system.
HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX Systems (S-8003)
to regenerate the Slurm
configuration to include any new compute nodes added to the system.One at a time, add each new management NCN using the Add Remove Replace NCNs procedure.