cray CLI)The cray command line interface (CLI) is a framework created to integrate all of the system management REST APIs into easily usable commands.
Later procedures in the installation workflow use the cray CLI to interact with multiple services.
The cray CLI configuration needs to be initialized for the Linux account, and the Keycloak user running
the procedure needs to be authorized. This section describes how to initialize the cray CLI for use by
a user and how to authorize that user.
The cray CLI only needs to be initialized once per user on a node.
There are two ways to initialize the cray CLI:
NOTE: The cray CLI supports an optional parameter (--tenant <tenant-name>) when using the CLI for tenant-scoped operations.
This argument is not used by default, but the CLI should be configured with the appropriate tenant when operating on tenant specific resources (i.e. creating BOS session templates for Compute Nodes that are members of a tenant, etc..).
See Tenant Administrator Configuration or execute cray init --help for more information.
There are times in normal operation that a particular user must be authenticated on the cray CLI. In
this case, the user must already be present in Keycloak and have the correct permissions to access the
system.
If a Keycloak user needs to be created, then see Keycloak User Management before proceeding.
(ncn-mws#) Initialize the cray CLI
cray init
Expected output (including responses to the prompts):
Overwrite configuration file at: MY_HOME_DIR/.config/cray/configurations/default ? [y/N]: y
Cray Hostname: api-gw-service-nmn.local
Username: MY_KEYCLOAK_USER_NAME
Password: MY_PASSWORD
Success!
Initialization complete.
(ncn-mws#) The cray CLI may need to be authenticated to complete the setup.
Use the same Keycloak username and password from the above initialization command. To authenticate to the cray CLI:
cray auth login
Expect the following prompts:
Username: MY_KEYCLOAK_USER_NAME
Password: MY_PASSWORD
Success!
The craycli_init.py script can be used to create a new Keycloak account that is authorized for the cray CLI.
That account can in turn be used to initialize and authorize the cray CLI on all master, worker, and storage nodes
in the cluster that have Kubernetes configured. This account is only intended to be used for the duration of the
install and should be removed when the install is complete.
As the script leverages Keycloak administrative APIs, the --keycloakHost command line option must be set to use the CMN load balancer, as detailed below.
NOTES:
- This script creates a
temporary userthat can be used for basiccrayCLI command only until Keycloak is populated with real users. At which point, thecrayCLI should be re-initialized with a real user.- The
temporary userthat was created is only in Keycloak - it is not arealuser with login shells and home directories.
(ncn-mws#) Unset the CRAY_CREDENTIALS environment variable, if previously set.
Some of the installation procedures leading up to this point use the CLI with a Kubernetes managed service
account that is normally used for internal operations. There is a procedure for extracting the OAUTH token for
this service account and assigning it to the CRAY_CREDENTIALS environment variable to permit simple CLI
operations. This environment variable must be removed prior to cray CLI initialization.
unset CRAY_CREDENTIALS
(ncn-mws#) Initialize the cray CLI for the root account on all master and worker nodes.
The script will handle creation of the temporary Keycloak user and initialize all master and
worker nodes that are in a ready state. Call the script with the --run option:
SITE_DOMAIN="$(craysys metadata get site-domain)"
SYSTEM_NAME="$(craysys metadata get system-name)"
AUTH_FQDN="auth.cmn.${SYSTEM_NAME}.${SITE_DOMAIN}"
python3 /usr/share/doc/csm/install/scripts/craycli_init.py --run --keycloakHost "$AUTH_FQDN"
Expected output showing the results of the operation on each node:
2021-12-21 15:50:47,095 - INFO - Keycloak Admin URL: https://auth.cmn.system1.dev.cray.com/keycloak
2021-12-21 15:50:47,814 - INFO - Loading Keycloak secrets.
2021-12-21 15:50:48,095 - INFO - Created user 'craycli_tmp_user'
2021-12-21 15:50:52,714 - INFO - Initializing nodes:
2021-12-21 15:50:52,714 - INFO - ncn-m001: Success
2021-12-21 15:50:52,714 - INFO - ncn-m002: Success
2021-12-21 15:50:52,714 - INFO - ncn-m003: Success
2021-12-21 15:50:52,714 - INFO - ncn-s001: Success
2021-12-21 15:50:52,714 - INFO - ncn-w001: Success
2021-12-21 15:50:52,714 - INFO - ncn-w002: Success
2021-12-21 15:50:52,714 - INFO - ncn-w003: Success
2021-12-21 15:50:52,714 - WARNING - ncn-s002: WARNING: Kubernetes not configured on this node
2021-12-21 15:50:52,714 - WARNING - ncn-s003: WARNING: Kubernetes not configured on this node
NOTE: In the above example, Kubernetes was not configured on ncn-s002 and ncn-s003; the
cray CLI was not authenticated on those nodes, but is functional on the other nodes.
The cray CLI is now operational on all nodes where success was reported. If a node was
unsuccessful with initialization, then there will be a warning reported. See
Troubleshooting results of the automated script
for additional information.
(ncn-mws#) Remove the temporary user after the install is complete.
IMPORTANT: If this section is not followed, then the temporary user will remain as a valid account in Keycloak. Be sure to clean this up when this user is no longer required.
When the install is completed and Keycloak is fully populated with the correct end users,
then call this script again with the --cleanup option to remove the temporary user from Keycloak
and uninitialize the cray CLI on all master and worker nodes in the cluster.
SITE_DOMAIN="$(craysys metadata get site-domain)"
SYSTEM_NAME="$(craysys metadata get system-name)"
AUTH_FQDN="auth.cmn.${SYSTEM_NAME}.${SITE_DOMAIN}"
python3 /usr/share/doc/csm/install/scripts/craycli_init.py --cleanup --keycloakHost "$AUTH_FQDN"
Expect output showing the results of the operation on each node:
2021-12-21 15:52:31,095 - INFO - Keycloak Admin URL: https://auth.cmn.system1.dev.cray.com/keycloak
2021-12-21 15:52:31,611 - INFO - Removing temporary user and uninitializing the cray CLI
2021-12-21 15:52:31,783 - INFO - Deleted user 'craycli_tmp_user'
2021-12-21 15:52:31,798 - INFO - Uninitializing nodes:
2021-12-21 15:52:32,714 - INFO - ncn-m001: Success
2021-12-21 15:52:32,714 - INFO - ncn-m002: Success
2021-12-21 15:52:32,714 - INFO - ncn-m003: Success
2021-12-21 15:52:32,714 - INFO - ncn-s001: Success
2021-12-21 15:52:32,714 - INFO - ncn-w001: Success
2021-12-21 15:52:32,714 - INFO - ncn-w002: Success
2021-12-21 15:52:32,714 - INFO - ncn-w003: Success
2021-12-21 15:50:52,714 - WARNING - ncn-s002: WARNING: Kubernetes not configured on this node
2021-12-21 15:50:52,714 - WARNING - ncn-s003: WARNING: Kubernetes not configured on this node
At this point, the cray CLI will no longer be operational on these nodes until they are
initialized and authorized again with a valid Keycloak user.
Optionally, the cray CLI may be initialized with a valid Keycloak user during the cleanup
operation so that it is left operational. To do this pass in a user and password with the
cleanup command:
SITE_DOMAIN="$(craysys metadata get site-domain)"
SYSTEM_NAME="$(craysys metadata get system-name)"
AUTH_FQDN="auth.cmn.${SYSTEM_NAME}.${SITE_DOMAIN}"
python3 /usr/share/doc/csm/install/scripts/craycli_init.py --cleanup --keycloakHost "$AUTH_FQDN" -u MY_USERNAME -p MY_PASSWORD
Expected output showing the cleanup of the temporary user on each node, then the results of
using the input user to initialize and authorize the cray CLI on each node:
2021-12-21 15:52:31,095 - INFO - Keycloak Admin URL: https://auth.cmn.system1.dev.cray.com/keycloak
2021-12-21 15:52:31,611 - INFO - Removing temporary user and uninitializing the cray CLI
2021-12-21 15:52:31,783 - INFO - Deleted user 'craycli_tmp_user'
2021-12-21 15:52:31,798 - INFO - Uninitializing nodes:
2021-12-21 15:52:32,714 - INFO - ncn-m001: Success
2021-12-21 15:52:32,714 - INFO - ncn-m002: Success
2021-12-21 15:52:32,714 - INFO - ncn-m003: Success
2021-12-21 15:52:32,714 - INFO - ncn-s001: Success
2021-12-21 15:52:32,714 - INFO - ncn-w001: Success
2021-12-21 15:52:32,714 - INFO - ncn-w002: Success
2021-12-21 15:52:32,714 - INFO - ncn-w003: Success
2021-12-21 15:50:52,714 - WARNING - ncn-s002: WARNING: Kubernetes not configured on this node
2021-12-21 15:50:52,714 - WARNING - ncn-s003: WARNING: Kubernetes not configured on this node
2021-12-21 15:52:33,079 - INFO - Re-initializing the cray CLI with existing Keycloak user MY_USERNAME
2021-12-21 15:52:33,131 - INFO - Initializing nodes:
2021-12-21 15:52:37,714 - INFO - ncn-m001: Success
2021-12-21 15:52:37,714 - INFO - ncn-m002: Success
2021-12-21 15:52:37,714 - INFO - ncn-m003: Success
2021-12-21 15:52:37,714 - INFO - ncn-s001: Success
2021-12-21 15:52:38,714 - INFO - ncn-w001: Success
2021-12-21 15:52:38,714 - INFO - ncn-w002: Success
2021-12-21 15:52:38,714 - INFO - ncn-w003: Success
2021-12-21 15:50:52,714 - WARNING - ncn-s002: WARNING: Kubernetes not configured on this node
2021-12-21 15:50:52,714 - WARNING - ncn-s003: WARNING: Kubernetes not configured on this node
At this point the cray CLI will be operational on all successful nodes and authenticated with
the input Keycloak account.
Each node will have Success reported if everything worked, the node was initialized,
and the cray CLI is operational for that node. For nodes with problems, there will be a
brief warning message that reports what the problem is on that node.
For all debugging steps, ensure you add --keycloakHost to the command line, else Keycloak requests may fail.
Results with problems on some nodes may look like the following:
2021-12-21 15:50:47,095 - INFO - Keycloak Admin URL: https://auth.cmn.system1.dev.cray.com/keycloak
2021-12-21 15:50:47,814 - INFO - Loading Keycloak secrets.
2021-12-21 15:50:48,095 - INFO - Created user 'craycli_tmp_user'
2021-12-21 15:50:52,714 - INFO - Initializing nodes:
2021-12-21 15:50:52,714 - INFO - ncn-m001: Success
2021-12-21 15:50:52,714 - INFO - ncn-m002: Success
2021-12-21 15:50:52,714 - INFO - ncn-w001: Success
2021-12-21 15:50:52,714 - WARNING - ncn-m003: WARNING: Call to cray init failed
2021-12-21 15:50:52,714 - WARNING - ncn-s001: WARNING: Python script failed
2021-12-21 15:50:52,714 - WARNING - ncn-w002: WARNING: Failed to copy script to remote host
2021-12-21 15:50:52,714 - WARNING - ncn-w003: WARNING: Verification that cray CLI is operational failed
2021-12-21 15:50:52,714 - WARNING - ncn-s002: WARNING: Kubernetes not configured on this node
2021-12-21 15:50:52,714 - WARNING - ncn-s003: WARNING: Kubernetes not configured on this node
At this point, the script may be re-run with the --debug flag added, in order for
debug level log messages to be displayed. Alternatively, each failing node may be looked at individually.
(ncn-mws#) Log into the node that failed.
To try re-running the initialization on only a single node, ssh to that node, then run
the script with the --initnode and --debug options:
NOTE: Part of the script is copying itself to the /tmp/ directory on each target node.
The script should still be there, but if not just copy the script somewhere accessible.
ssh NODE_THAT_FAILED
python3 /tmp/craycli_init.py --initnode --debug
Now use the enhanced messages to determine what is wrong on this node.
(ncn-mws#) Check for missing Python Modules
It is possible that some Python modules required for the script are missing on individual
nodes - particularly on the PIT or storage nodes. This script could run from any of the
NCNs, so if it fails on one node, copy it to any location on another node and try to run
it from there.
In the following example the script fails on an NCN because of the missing Python module oauthlib.
To work around that, the script is copied to ncn-m002, where it is successfully run.
python3 /usr/share/doc/csm/install/scripts/craycli_init.py --run
Error output:
Traceback (most recent call last):
File "craycli_init.py", line 50, in <module>
import oauthlib.oauth2
ModuleNotFoundError: No module named 'oauthlib'
Copy the script to the ncn-m002 node and run from there:
scp /usr/share/doc/csm/install/scripts/craycli_init.py ncn-m002:'~/my_dir/'
ssh ncn-m002 'cd my_dir && python3 ./craycli_init.py --run'
At this point expect it to proceed as documented, but it will fail again on the node originally attempted on, because of the lack of critical Python modules on that node. However, it may complete successfully on the rest of the nodes.
Alternatively, the modules could be installed using pip or pip3 if that is available on the node.
(ncn-mws#) Check for Kubernetes setup on the node
The script relies on Kubernetes Secrets to store the credentials of the temporary Keycloak user. If
a node does not have Kubernetes initialized on it, the user must manually initialize the cray CLI with a
valid Keycloak user.
Run the following command:
kubectl get nodes
If Kubernetes is configured and operating correctly, then the output should show a list of the master and worker nodes:
NAME STATUS ROLES AGE VERSION
ncn-m001 Ready control-plane,master 120d v1.21.12
ncn-m002 Ready control-plane,master 120d v1.21.12
ncn-m003 Ready control-plane,master 120d v1.21.12
ncn-w001 Ready <none> 120d v1.21.12
ncn-w002 Ready <none> 120d v1.21.12
ncn-w003 Ready <none> 120d v1.21.12
If Kubernetes is not configured or operating correctly, then an error will be displayed instead. For example:
W0902 16:06:38.726121 61796 loader.go:223] Config not found: /etc/kubernetes/admin.conf
error: the server doesn't have a resource type "nodes"
If Kubernetes is not operational on this node, the cray CLI may still be initialized and authorized manually
with a valid existing Keycloak user following the process
Single User Already Configured in Keycloak.
NOTE: While resolving the following issues is beyond the scope of this section, more information about what is failing can be found by adding -vvvvv to the cray init commands.
(ncn-mws#) Troubleshoot failed initialization.
If initialization fails in the above step, then there are several common causes:
api-gw-service-nmn.local or the host provided via --keycloakHost may be preventing the CLI from reaching the API Gateway and Keycloak for authorizationIf the initialization fails and the reason output is similar to the following example, then restart radosgw on the storage nodes.
cray artifacts buckets list -vvv
The output may look something like:
Loaded token: /root/.config/cray/tokens/api_gw_service_nmn_local.vers
REQUEST: PUT to https://api-gw-service-nmn.local/apis/sts/token
OPTIONS: {'verify': False}
ERROR: {
"detail": "The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.",
"status": 500,
"title": "Internal Server Error",
"type": "about:blank"
}
Usage: cray artifacts buckets list [OPTIONS]
Try 'cray artifacts buckets list --help' for help.
Error: Internal Server Error: The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.
(ncn-s#) Restart the Ceph radosgw process on one of the first three storage nodes.
ceph orch restart rgw.site1
The expected output will be similar to the following, but it will vary based on the number of nodes running radosgw:
restart rgw.site1.ncn-s001.cshvbb from host 'ncn-s001'
restart rgw.site1.ncn-s002.tlegbb from host 'ncn-s002'
restart rgw.site1.ncn-s003.vwjwew from host 'ncn-s003'
(ncn-s#) Check to see that the processes restarted.
ceph orch ps --daemon_type rgw
The REFRESHED time should be in seconds. Restarting all of them could require a couple of minutes depending on how many.
NAME HOST STATUS REFRESHED AGE VERSION IMAGE NAME IMAGE ID CONTAINER ID
rgw.site1.ncn-s001.cshvbb ncn-s001 running (29s) 23s ago 9h 15.2.8 registry.local/ceph/ceph:v15.2.8 5553b0cb212c 2a712824adc1
rgw.site1.ncn-s002.tlegbb ncn-s002 running (29s) 28s ago 9h 15.2.8 registry.local/ceph/ceph:v15.2.8 5553b0cb212c e423f22d06a5
rgw.site1.ncn-s003.vwjwew ncn-s003 running (29s) 23s ago 9h 15.2.8 registry.local/ceph/ceph:v15.2.8 5553b0cb212c 1e6ad6bc2c62
(ncn-s#) In the event that more than 5 minutes has passed and the radosgw services have not restarted, then fail the ceph-mgr process over to the standby.
There are cases where an orchestration task gets stuck and the current remediation is to fail the Ceph manager process.
Get active ceph-mgr.
ceph mgr dump | jq -r .active_name
Expected output will be something similar to:
ncn-s002.zozbqp
Fail the active ceph-mgr.
ceph mgr fail $(ceph mgr dump | jq -r .active_name)
Confirm that ceph-mgr has moved to a different ceph-mgr container.
ceph mgr dump | jq -r .active_name
Expect the output to be a different manager than was previously reported:
ncn-s001.qucrpr
Verify that the processes restarted using the command from step 3.
At this point the processes should restart. If they do not, then retry steps 2 and 3.