cray
CLI)The cray
command line interface (CLI) is a framework created to integrate all of the system management REST APIs into easily usable commands.
Later procedures in the installation workflow use the cray
CLI to interact with multiple services.
The cray
CLI configuration needs to be initialized for the Linux account, and the Keycloak user running
the procedure needs to be authorized. This section describes how to initialize the cray
CLI for use by
a user and how to authorize that user.
The cray
CLI only needs to be initialized once per user on a node.
There are two ways to initialize the cray
CLI:
NOTE: The cray
CLI supports an optional parameter (--tenant <tenant-name>
) when using the CLI for tenant-scoped operations.
This argument is not used by default, but the CLI should be configured with the appropriate tenant when operating on tenant specific resources (i.e. creating BOS session templates for Compute Nodes that are members of a tenant, etc..).
See Tenant Administrator Configuration or execute cray init --help
for more information.
There are times in normal operation that a particular user must be authenticated on the cray
CLI. In
this case, the user must already be present in Keycloak and have the correct permissions to access the
system.
If a Keycloak user needs to be created, then see Keycloak User Management before proceeding.
(ncn-mws#
) Initialize the cray
CLI
cray init
Expected output (including responses to the prompts):
Overwrite configuration file at: MY_HOME_DIR/.config/cray/configurations/default ? [y/N]: y
Cray Hostname: api-gw-service-nmn.local
Username: MY_KEYCLOAK_USER_NAME
Password: MY_PASSWORD
Success!
Initialization complete.
(ncn-mws#
) The cray
CLI may need to be authenticated to complete the setup.
Use the same Keycloak username and password from the above initialization command. To authenticate to the cray
CLI:
cray auth login
Expect the following prompts:
Username: MY_KEYCLOAK_USER_NAME
Password: MY_PASSWORD
Success!
The craycli_init.py
script can be used to create a new Keycloak account that is authorized for the cray
CLI.
That account can in turn be used to initialize and authorize the cray
CLI on all master, worker, and storage nodes
in the cluster that have Kubernetes configured. This account is only intended to be used for the duration of the
install and should be removed when the install is complete.
As the script leverages Keycloak administrative APIs, the --keycloakHost
command line option must be set to use the CMN load balancer, as detailed below.
NOTES:
- This script creates a
temporary user
that can be used for basiccray
CLI command only until Keycloak is populated with real users. At which point, thecray
CLI should be re-initialized with a real user.- The
temporary user
that was created is only in Keycloak - it is not areal
user with login shells and home directories.
(ncn-mws#
) Unset the CRAY_CREDENTIALS
environment variable, if previously set.
Some of the installation procedures leading up to this point use the CLI with a Kubernetes managed service
account that is normally used for internal operations. There is a procedure for extracting the OAUTH token for
this service account and assigning it to the CRAY_CREDENTIALS
environment variable to permit simple CLI
operations. This environment variable must be removed prior to cray
CLI initialization.
unset CRAY_CREDENTIALS
(ncn-mws#
) Initialize the cray
CLI for the root account on all master and worker nodes.
The script will handle creation of the temporary Keycloak user and initialize all master and
worker nodes that are in a ready state. Call the script with the --run
option:
SITE_DOMAIN="$(craysys metadata get site-domain)"
SYSTEM_NAME="$(craysys metadata get system-name)"
AUTH_FQDN="auth.cmn.${SYSTEM_NAME}.${SITE_DOMAIN}"
python3 /usr/share/doc/csm/install/scripts/craycli_init.py --run --keycloakHost "$AUTH_FQDN"
Expected output showing the results of the operation on each node:
2021-12-21 15:50:47,095 - INFO - Keycloak Admin URL: https://auth.cmn.system1.dev.cray.com/keycloak
2021-12-21 15:50:47,814 - INFO - Loading Keycloak secrets.
2021-12-21 15:50:48,095 - INFO - Created user 'craycli_tmp_user'
2021-12-21 15:50:52,714 - INFO - Initializing nodes:
2021-12-21 15:50:52,714 - INFO - ncn-m001: Success
2021-12-21 15:50:52,714 - INFO - ncn-m002: Success
2021-12-21 15:50:52,714 - INFO - ncn-m003: Success
2021-12-21 15:50:52,714 - INFO - ncn-s001: Success
2021-12-21 15:50:52,714 - INFO - ncn-w001: Success
2021-12-21 15:50:52,714 - INFO - ncn-w002: Success
2021-12-21 15:50:52,714 - INFO - ncn-w003: Success
2021-12-21 15:50:52,714 - WARNING - ncn-s002: WARNING: Kubernetes not configured on this node
2021-12-21 15:50:52,714 - WARNING - ncn-s003: WARNING: Kubernetes not configured on this node
NOTE: In the above example, Kubernetes was not configured on ncn-s002
and ncn-s003
; the
cray
CLI was not authenticated on those nodes, but is functional on the other nodes.
The cray
CLI is now operational on all nodes where success was reported. If a node was
unsuccessful with initialization, then there will be a warning reported. See
Troubleshooting results of the automated script
for additional information.
(ncn-mws#
) Remove the temporary user after the install is complete.
IMPORTANT: If this section is not followed, then the temporary user will remain as a valid account in Keycloak. Be sure to clean this up when this user is no longer required.
When the install is completed and Keycloak is fully populated with the correct end users,
then call this script again with the --cleanup
option to remove the temporary user from Keycloak
and uninitialize the cray
CLI on all master and worker nodes in the cluster.
SITE_DOMAIN="$(craysys metadata get site-domain)"
SYSTEM_NAME="$(craysys metadata get system-name)"
AUTH_FQDN="auth.cmn.${SYSTEM_NAME}.${SITE_DOMAIN}"
python3 /usr/share/doc/csm/install/scripts/craycli_init.py --cleanup --keycloakHost "$AUTH_FQDN"
Expect output showing the results of the operation on each node:
2021-12-21 15:52:31,095 - INFO - Keycloak Admin URL: https://auth.cmn.system1.dev.cray.com/keycloak
2021-12-21 15:52:31,611 - INFO - Removing temporary user and uninitializing the cray CLI
2021-12-21 15:52:31,783 - INFO - Deleted user 'craycli_tmp_user'
2021-12-21 15:52:31,798 - INFO - Uninitializing nodes:
2021-12-21 15:52:32,714 - INFO - ncn-m001: Success
2021-12-21 15:52:32,714 - INFO - ncn-m002: Success
2021-12-21 15:52:32,714 - INFO - ncn-m003: Success
2021-12-21 15:52:32,714 - INFO - ncn-s001: Success
2021-12-21 15:52:32,714 - INFO - ncn-w001: Success
2021-12-21 15:52:32,714 - INFO - ncn-w002: Success
2021-12-21 15:52:32,714 - INFO - ncn-w003: Success
2021-12-21 15:50:52,714 - WARNING - ncn-s002: WARNING: Kubernetes not configured on this node
2021-12-21 15:50:52,714 - WARNING - ncn-s003: WARNING: Kubernetes not configured on this node
At this point, the cray
CLI will no longer be operational on these nodes until they are
initialized and authorized again with a valid Keycloak user.
Optionally, the cray
CLI may be initialized with a valid Keycloak user during the cleanup
operation so that it is left operational. To do this pass in a user and password with the
cleanup command:
SITE_DOMAIN="$(craysys metadata get site-domain)"
SYSTEM_NAME="$(craysys metadata get system-name)"
AUTH_FQDN="auth.cmn.${SYSTEM_NAME}.${SITE_DOMAIN}"
python3 /usr/share/doc/csm/install/scripts/craycli_init.py --cleanup --keycloakHost "$AUTH_FQDN" -u MY_USERNAME -p MY_PASSWORD
Expected output showing the cleanup of the temporary user on each node, then the results of
using the input user to initialize and authorize the cray
CLI on each node:
2021-12-21 15:52:31,095 - INFO - Keycloak Admin URL: https://auth.cmn.system1.dev.cray.com/keycloak
2021-12-21 15:52:31,611 - INFO - Removing temporary user and uninitializing the cray CLI
2021-12-21 15:52:31,783 - INFO - Deleted user 'craycli_tmp_user'
2021-12-21 15:52:31,798 - INFO - Uninitializing nodes:
2021-12-21 15:52:32,714 - INFO - ncn-m001: Success
2021-12-21 15:52:32,714 - INFO - ncn-m002: Success
2021-12-21 15:52:32,714 - INFO - ncn-m003: Success
2021-12-21 15:52:32,714 - INFO - ncn-s001: Success
2021-12-21 15:52:32,714 - INFO - ncn-w001: Success
2021-12-21 15:52:32,714 - INFO - ncn-w002: Success
2021-12-21 15:52:32,714 - INFO - ncn-w003: Success
2021-12-21 15:50:52,714 - WARNING - ncn-s002: WARNING: Kubernetes not configured on this node
2021-12-21 15:50:52,714 - WARNING - ncn-s003: WARNING: Kubernetes not configured on this node
2021-12-21 15:52:33,079 - INFO - Re-initializing the cray CLI with existing Keycloak user MY_USERNAME
2021-12-21 15:52:33,131 - INFO - Initializing nodes:
2021-12-21 15:52:37,714 - INFO - ncn-m001: Success
2021-12-21 15:52:37,714 - INFO - ncn-m002: Success
2021-12-21 15:52:37,714 - INFO - ncn-m003: Success
2021-12-21 15:52:37,714 - INFO - ncn-s001: Success
2021-12-21 15:52:38,714 - INFO - ncn-w001: Success
2021-12-21 15:52:38,714 - INFO - ncn-w002: Success
2021-12-21 15:52:38,714 - INFO - ncn-w003: Success
2021-12-21 15:50:52,714 - WARNING - ncn-s002: WARNING: Kubernetes not configured on this node
2021-12-21 15:50:52,714 - WARNING - ncn-s003: WARNING: Kubernetes not configured on this node
At this point the cray
CLI will be operational on all successful nodes and authenticated with
the input Keycloak account.
Each node will have Success
reported if everything worked, the node was initialized,
and the cray
CLI is operational for that node. For nodes with problems, there will be a
brief warning message that reports what the problem is on that node.
For all debugging steps, ensure you add --keycloakHost
to the command line, else Keycloak requests may fail.
Results with problems on some nodes may look like the following:
2021-12-21 15:50:47,095 - INFO - Keycloak Admin URL: https://auth.cmn.system1.dev.cray.com/keycloak
2021-12-21 15:50:47,814 - INFO - Loading Keycloak secrets.
2021-12-21 15:50:48,095 - INFO - Created user 'craycli_tmp_user'
2021-12-21 15:50:52,714 - INFO - Initializing nodes:
2021-12-21 15:50:52,714 - INFO - ncn-m001: Success
2021-12-21 15:50:52,714 - INFO - ncn-m002: Success
2021-12-21 15:50:52,714 - INFO - ncn-w001: Success
2021-12-21 15:50:52,714 - WARNING - ncn-m003: WARNING: Call to cray init failed
2021-12-21 15:50:52,714 - WARNING - ncn-s001: WARNING: Python script failed
2021-12-21 15:50:52,714 - WARNING - ncn-w002: WARNING: Failed to copy script to remote host
2021-12-21 15:50:52,714 - WARNING - ncn-w003: WARNING: Verification that cray CLI is operational failed
2021-12-21 15:50:52,714 - WARNING - ncn-s002: WARNING: Kubernetes not configured on this node
2021-12-21 15:50:52,714 - WARNING - ncn-s003: WARNING: Kubernetes not configured on this node
At this point, the script may be re-run with the --debug
flag added, in order for
debug level log messages to be displayed. Alternatively, each failing node may be looked at individually.
(ncn-mws#
) Log into the node that failed.
To try re-running the initialization on only a single node, ssh
to that node, then run
the script with the --initnode
and --debug
options:
NOTE: Part of the script is copying itself to the /tmp/
directory on each target node.
The script should still be there, but if not just copy the script somewhere accessible.
ssh NODE_THAT_FAILED
python3 /tmp/craycli_init.py --initnode --debug
Now use the enhanced messages to determine what is wrong on this node.
(ncn-mws#
) Check for missing Python Modules
It is possible that some Python modules required for the script are missing on individual
nodes - particularly on the PIT
or storage nodes. This script could run from any of the
NCNs, so if it fails on one node, copy it to any location on another node and try to run
it from there.
In the following example the script fails on an NCN because of the missing Python module oauthlib
.
To work around that, the script is copied to ncn-m002
, where it is successfully run.
python3 /usr/share/doc/csm/install/scripts/craycli_init.py --run
Error output:
Traceback (most recent call last):
File "craycli_init.py", line 50, in <module>
import oauthlib.oauth2
ModuleNotFoundError: No module named 'oauthlib'
Copy the script to the ncn-m002
node and run from there:
scp /usr/share/doc/csm/install/scripts/craycli_init.py ncn-m002:'~/my_dir/'
ssh ncn-m002 'cd my_dir && python3 ./craycli_init.py --run'
At this point expect it to proceed as documented, but it will fail again on the node originally attempted on, because of the lack of critical Python modules on that node. However, it may complete successfully on the rest of the nodes.
Alternatively, the modules could be installed using pip
or pip3
if that is available on the node.
(ncn-mws#
) Check for Kubernetes setup on the node
The script relies on Kubernetes Secrets to store the credentials of the temporary Keycloak user. If
a node does not have Kubernetes initialized on it, the user must manually initialize the cray
CLI with a
valid Keycloak user.
Run the following command:
kubectl get nodes
If Kubernetes is configured and operating correctly, then the output should show a list of the master and worker nodes:
NAME STATUS ROLES AGE VERSION
ncn-m001 Ready control-plane,master 120d v1.21.12
ncn-m002 Ready control-plane,master 120d v1.21.12
ncn-m003 Ready control-plane,master 120d v1.21.12
ncn-w001 Ready <none> 120d v1.21.12
ncn-w002 Ready <none> 120d v1.21.12
ncn-w003 Ready <none> 120d v1.21.12
If Kubernetes is not configured or operating correctly, then an error will be displayed instead. For example:
W0902 16:06:38.726121 61796 loader.go:223] Config not found: /etc/kubernetes/admin.conf
error: the server doesn't have a resource type "nodes"
If Kubernetes is not operational on this node, the cray
CLI may still be initialized and authorized manually
with a valid existing Keycloak user following the process
Single User Already Configured in Keycloak.
NOTE: While resolving the following issues is beyond the scope of this section, more information about what is failing can be found by adding -vvvvv
to the cray init
commands.
(ncn-mws#
) Troubleshoot failed initialization.
If initialization fails in the above step, then there are several common causes:
api-gw-service-nmn.local
or the host provided via --keycloakHost
may be preventing the CLI from reaching the API Gateway and Keycloak for authorizationIf the initialization fails and the reason output is similar to the following example, then restart radosgw
on the storage nodes.
cray artifacts buckets list -vvv
The output may look something like:
Loaded token: /root/.config/cray/tokens/api_gw_service_nmn_local.vers
REQUEST: PUT to https://api-gw-service-nmn.local/apis/sts/token
OPTIONS: {'verify': False}
ERROR: {
"detail": "The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.",
"status": 500,
"title": "Internal Server Error",
"type": "about:blank"
}
Usage: cray artifacts buckets list [OPTIONS]
Try 'cray artifacts buckets list --help' for help.
Error: Internal Server Error: The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.
(ncn-s#
) Restart the Ceph radosgw process on one of the first three storage nodes.
ceph orch restart rgw.site1
The expected output will be similar to the following, but it will vary based on the number of nodes running radosgw:
restart rgw.site1.ncn-s001.cshvbb from host 'ncn-s001'
restart rgw.site1.ncn-s002.tlegbb from host 'ncn-s002'
restart rgw.site1.ncn-s003.vwjwew from host 'ncn-s003'
(ncn-s#
) Check to see that the processes restarted.
ceph orch ps --daemon_type rgw
The REFRESHED
time should be in seconds. Restarting all of them could require a couple of minutes depending on how many.
NAME HOST STATUS REFRESHED AGE VERSION IMAGE NAME IMAGE ID CONTAINER ID
rgw.site1.ncn-s001.cshvbb ncn-s001 running (29s) 23s ago 9h 15.2.8 registry.local/ceph/ceph:v15.2.8 5553b0cb212c 2a712824adc1
rgw.site1.ncn-s002.tlegbb ncn-s002 running (29s) 28s ago 9h 15.2.8 registry.local/ceph/ceph:v15.2.8 5553b0cb212c e423f22d06a5
rgw.site1.ncn-s003.vwjwew ncn-s003 running (29s) 23s ago 9h 15.2.8 registry.local/ceph/ceph:v15.2.8 5553b0cb212c 1e6ad6bc2c62
(ncn-s#
) In the event that more than 5 minutes has passed and the radosgw
services have not restarted, then fail the ceph-mgr
process over to the standby.
There are cases where an orchestration task gets stuck and the current remediation is to fail the Ceph manager process.
Get active ceph-mgr
.
ceph mgr dump | jq -r .active_name
Expected output will be something similar to:
ncn-s002.zozbqp
Fail the active ceph-mgr
.
ceph mgr fail $(ceph mgr dump | jq -r .active_name)
Confirm that ceph-mgr
has moved to a different ceph-mgr
container.
ceph mgr dump | jq -r .active_name
Expect the output to be a different manager than was previously reported:
ncn-s001.qucrpr
Verify that the processes restarted using the command from step 3.
At this point the processes should restart. If they do not, then retry steps 2 and 3.