cray
CLI)The cray
command line interface (CLI) is a framework created to integrate all of the system management REST APIs into easily usable commands.
Procedures in the CSM installation workflow use the cray
CLI to interact with multiple services.
The cray
CLI configuration needs to be initialized for the Linux account, and the Keycloak user running
the procedure needs to be authorized. This section describes how to initialize the cray
CLI for use by
a user and how to authorize that user.
The cray
CLI only needs to be initialized once per user on a node.
Unset the CRAY_CREDENTIALS
environment variable, if previously set.
Some CSM installation procedures use the CLI with a Kubernetes managed service
account that is normally used for internal operations. There is a procedure for extracting the OAUTH token for
this service account and assigning it to the CRAY_CREDENTIALS
environment variable to permit simple CLI operations.
It must be unset in order to validate that the CLI is working with user authentication.
unset CRAY_CREDENTIALS
Initialize the cray
CLI for the root
account.
The cray
CLI needs to know what host to use to obtain authorization and what user is requesting authorization,
so it can obtain an OAUTH token to talk to the API gateway. This is accomplished by initializing the CLI
configuration.
In this example, the vers
username is used. It should be replaced with an appropriate user account:
cray init --hostname api-gw-service-nmn.local
Expected output (including the typed input) should look similar to the following:
Username: vers
Password:
Success!
Initialization complete.
Verify that the cray
CLI is operational.
cray artifacts buckets list -vvv
Expected output looks similar to the following:
Loaded token: /root/.config/cray/tokens/api_gw_service_nmn_local.vers
REQUEST: PUT to https://api-gw-service-nmn.local/apis/sts/token
OPTIONS: {'verify': False}
S3 credentials retrieved successfully
results = [ "alc", "badger", "benji-backups", "boot-images", "etcd-backup", "fw-update", "ims", "install-artifacts", "nmd", "postgres-backup",
"prs", "sat", "sds", "sls", "sma", "ssd", "ssm", "vbis", "velero", "wlm",]
If an error occurs, then continue to the troubleshooting section below.
More information about what is failing can be found by adding -vvvvv
to the cray init ...
commands.
If CLI initialization fails, there are several common causes:
api-gw-service-nmn.local
may be preventing the CLI from reaching the API Gateway and Keycloak for authorizationIf an error similar to the following is seen, then restart radosgw
on the storage nodes.
The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.
Restart radosgw
using the following steps. These steps must be run on one of the storage nodes running the Ceph radosgw
process.
By default these nodes are ncn-s001
, ncn-s002
, and ncn-s003
.
Restart the Ceph radosgw
process.
The expected output will be similar to the following, but it will vary based on the nodes running
radosgw
.
ceph orch restart rgw.site1.zone1
Example output:
restart rgw.site1.zone1.ncn-s001.cshvbb from host 'ncn-s001'
restart rgw.site1.zone1.ncn-s002.tlegbb from host 'ncn-s002'
restart rgw.site1.zone1.ncn-s003.vwjwew from host 'ncn-s003'
Check to see that the processes restarted.
ceph orch ps --daemon_type rgw
Example output:
NAME HOST STATUS REFRESHED AGE VERSION IMAGE NAME IMAGE ID CONTAINER ID
rgw.site1.zone1.ncn-s001.cshvbb ncn-s001 running (29s) 23s ago 9h 15.2.8 registry.local/ceph/ceph:v15.2.8 5553b0cb212c 2a712824adc1
rgw.site1.zone1.ncn-s002.tlegbb ncn-s002 running (29s) 28s ago 9h 15.2.8 registry.local/ceph/ceph:v15.2.8 5553b0cb212c e423f22d06a5
rgw.site1.zone1.ncn-s003.vwjwew ncn-s003 running (29s) 23s ago 9h 15.2.8 registry.local/ceph/ceph:v15.2.8 5553b0cb212c 1e6ad6bc2c62
A process which has restarted should have an
AGE
in seconds. Restarting all of them could require a couple of minutes depending on how many.
In the event that more than five minutes have passed and the radosgw
processes have not restarted, then fail the ceph-mgr
process.
Determine the active ceph-mgr
.
ceph mgr dump | jq -r .active_name
Example output:
ncn-s002.zozbqp
Fail the active ceph-mgr
.
ceph mgr fail $(ceph mgr dump | jq -r .active_name)
Confirm that ceph-mgr
has moved to a different ceph-mgr
container.
ceph mgr dump | jq -r .active_name
Example output:
ncn-s001.qucrpr
Verify that the radosgw
processes restarted using the command from the previous step.
At this point the processes should restart. If they do not, then attempt this remediation procedure a second time.