This page will go over how to install CSM applications and services (i.e., into the CSM Kubernetes cluster).
SKIP IF ONLINE
- Online installs cannot upload container images to the bootstrap registry since it proxies an upstream source. DO NOT perform this procedure if the bootstrap registry was reconfigured to proxy from an upstream registry.
Verify that Nexus is running:
pit# systemctl status nexus
Verify that Nexus is ready. (Any HTTP response other than 200 OK
indicates Nexus is not ready.)
pit# curl -sSif http://localhost:8081/service/rest/v1/status/writable
Expected output looks similar to the following:
HTTP/1.1 200 OK
Date: Thu, 04 Feb 2021 05:27:44 GMT
Server: Nexus/3.25.0-03 (OSS)
X-Content-Type-Options: nosniff
Content-Length: 0
Load the skopeo image installed by the cray-nexus RPM:
pit# podman load -i /var/lib/cray/container-images/cray-nexus/skopeo-stable.tar quay.io/skopeo/stable
Use skopeo sync
to upload container images from the CSM release:
pit# export CSM_RELEASE=csm-x.y.z
pit# podman run --rm --network host -v /var/www/ephemeral/${CSM_RELEASE}/docker/dtr.dev.cray.com:/images:ro quay.io/skopeo/stable sync \
--scoped --src dir --dest docker --dest-tls-verify=false --dest-creds admin:admin123 /images localhost:5000
The site-init
secret in the loftsman
namespace makes
/var/www/ephemeral/prep/site-init/customizations.yaml
available to product
installers. The site-init
secret should only be updated when the
corresponding customizations.yaml
data is changed, such as during system
installation or upgrade. Create the site-init
secret to contain
/var/www/ephemeral/prep/site-init/customizations.yaml
:
pit# kubectl create secret -n loftsman generic site-init --from-file=/var/www/ephemeral/prep/site-init/customizations.yaml
Expected output looks similar to the following:
secret/site-init created
NOTE
If thesite-init
secret already exists thenkubectl
will error with a message similar to:Error from server (AlreadyExists): secrets "site-init" already exists
In this case, delete the
site-init
secret and recreate it.
First delete it:
pit# kubectl delete secret -n loftsman site-init
Expected output looks similar to the following:
secret "site-init" deleted
Then recreate it:
pit# kubectl create secret -n loftsman generic site-init --from-file=/var/www/ephemeral/prep/site-init/customizations.yaml
Expected output looks similar to the following:
secret/site-init created
WARNING
If for some reason the system customizations need to be modified to complete product installation, administrators must first updatecustomizations.yaml
in thesite-init
Git repository, which may no longer be mounted on any cluster node, and then delete and recreate thesite-init
secret as shown below.To read
customizations.yaml
from thesite-init
secret:ncn# kubectl get secrets -n loftsman site-init -o jsonpath='{.data.customizations\.yaml}' | base64 -d > customizations.yaml
To delete the
site-init
secret:ncn# kubectl -n loftsman delete secret site-init
To recreate the
site-init
secret:ncn# kubectl create secret -n loftsman generic site-init --from-file=customizations.yaml
Deploy the corresponding key necessary to decrypt sealed secrets:
pit# /var/www/ephemeral/prep/site-init/deploy/deploydecryptionkey.sh
An error similar to the following may occur when deploying the key:
Error from server (NotFound): secrets "sealed-secrets-key" not found
W0304 17:21:42.749101 29066 helpers.go:535] --dry-run is deprecated and can be replaced with --dry-run=client.
secret/sealed-secrets-key created
Restarting sealed-secrets to pick up new keys
No resources found
This is expected and can safely be ignored.
Run install.sh
to deploy CSM applications services:
NOTE
install.sh
requires various system configuration which are expected to be found in the locations used in proceeding documentation; however, it needs to knowSYSTEM_NAME
in order to findmetallb.yaml
andsls_input_file.json
configuration files.Some commands will also need to have the CSM_RELEASE variable set.
pit# export SYSTEM_NAME=eniac pit# export CSM_RELEASE=csm-x.y.z
pit# cd /var/www/ephemeral/$CSM_RELEASE
pit# ./install.sh
On success, install.sh
will output OK
to stderr and exit with status code
0
, e.g.:
pit# ./install.sh
...
+ CSM applications and services deployed
install.sh: OK
In the event that install.sh
does not complete successfully, consult the
known issues below to resolve potential problems and then try
running install.sh
again.
Run ./lib/setup-nexus.sh
to configure Nexus and upload CSM RPM repositories,
container images, and Helm charts:
pit# ./lib/setup-nexus.sh
On success, setup-nexus.sh
will output to OK
on stderr and exit with status
code 0
, e.g.:
pit# ./lib/setup-nexus.sh
...
+ Nexus setup complete
setup-nexus.sh: OK
In the event of an error, consult the known issues below to
resolve potential problems and then try running setup-nexus.sh
again. Note
that subsequent runs of setup-nexus.sh
may report FAIL
when uploading
duplicate assets. This is ok as long as setup-nexus.sh
outputs
setup-nexus.sh: OK
and exits with status code 0
.
First, verify that SLS properly reports all management NCNs in the system:
pit# ./lib/list-ncns.sh
On success, each management NCN will be output, e.g.:
pit# ./lib/list-ncns.sh
+ Getting admin-client-auth secret
+ Obtaining access token
+ Querying SLS
ncn-m001
ncn-m002
ncn-m003
ncn-s001
ncn-s002
ncn-s003
ncn-w001
ncn-w002
ncn-w003
If any management NCNs are missing from the output, take corrective action before proceeding.
Next, run lib/set-ncns-to-unbound.sh
to SSH to each management NCN and update
/etc/resolv.conf to use Unbound as the nameserver.
pit# ./lib/set-ncns-to-unbound.sh
NOTE
If passwordless SSH is not configured, the administrator will have to enter the corresponding password as the script attempts to connect to each NCN.
On success, the nameserver configuration in /etc/resolv.conf will be printed for each management NCN, e.g.,:
pit# ./lib/set-ncns-to-unbound.sh
+ Getting admin-client-auth secret
+ Obtaining access token
+ Querying SLS
+ Updating ncn-m001
Password:
ncn-m001: nameserver 127.0.0.1
ncn-m001: nameserver 10.92.100.225
+ Updating ncn-m002
Password:
ncn-m002: nameserver 10.92.100.225
+ Updating ncn-m003
Password:
ncn-m003: nameserver 10.92.100.225
+ Updating ncn-s001
Password:
ncn-s001: nameserver 10.92.100.225
+ Updating ncn-s002
Password:
ncn-s002: nameserver 10.92.100.225
+ Updating ncn-s003
Password:
ncn-s003: nameserver 10.92.100.225
+ Updating ncn-w001
Password:
ncn-w001: nameserver 10.92.100.225
+ Updating ncn-w002
Password:
ncn-w002: nameserver 10.92.100.225
+ Updating ncn-w003
Password:
ncn-w003: nameserver 10.92.100.225
NOTE
The script connects to ncn-m001 which will be the PIT node, whose password may be different from that of the other NCNs.
The cray command line interface (CLI) is a framework created to integrate all of the system management REST APIs into easily usable commands.
Later procedures in the installation process use the ‘cray’ CLI to interact with multiple services. The ‘cray’ CLI configuration needs to be initialized and the user running the procedure needs to be authorized. This section describes how to initialize the ‘cray’ CLI for use by a user and authorize that user.
The ‘cray’ CLI only needs to be initialized once per user on a node.
Unset CRAY_CREDENTIALS environment variable, if previously set.
Some of the installation procedures leading up to this point use the CLI with a Kubernetes managed service
account normally used for internal operations. There is a procedure for extracting the OAUTH token for
this service account and assigning it to the CRAY_CREDENTIALS
environment variable to permit simple CLI operations.
ncn# unset CRAY_CREDENTIALS
Initialize the ‘cray’ CLI for the root account.
The ‘cray’ CLI needs to know what host to use to obtain authorization and what user is requesting authorization
so it can obtain an OAUTH token to talk to the API Gateway. This is accomplished by initializing the CLI
configuration. In this example, the vers
username and its password are used.
If LDAP configuration has enabled, then use a valid account in LDAP instead of ‘vers’.
If LDAP configuration was not enabled, or is not working, then a keycloak local account could be created. See “Create a Service Account in Keycloak” in the HPE Cray EX System Administration Guide 1.4 S-80001.
ncn# cray init
When prompted, remember to substitute your username instead of ‘vers’. Expected output (including your typed input) should look similar to the following:
Cray Hostname: api-gw-service-nmn.local
Username: vers
Password:
Success!
Initialization complete.
If initialization fails in the above step, there are several common causes:
api-gw-service-nmn.local
may be preventing the CLI from reaching the API Gateway and Keycloak for authorizationWhile resolving these issues is beyond the scope of this section, you may get clues to what is failing by adding -vvvvv
to the cray init ...
commands.
Check for workarounds in the /opt/cray/csm/workarounds/after-sysmgmt-manifest
directory within the CSM tar. If there are any workarounds in that directory, run those now. Each has its own instructions in their respective README.md
files.
# Example
pit# ls /opt/cray/csm/workarounds/after-sysmgmt-manifest
If there is a workaround here, the output looks similar to the following:
CASMCMS-6857 CASMNET-423
NCNs require additional routing to enable access to Mountain, Hill and River Compute cabinets.
Requires:
To apply the routing, run:
ncn# /opt/cray/csm/workarounds/livecd-post-reboot/CASMINST-1570/CASMINST-1570.sh
NOTE
Currently, there is no automated procedure to apply routing changes to all worker NCNs to support Mountain, Hill and River Compute Node Cabinets.
The administrator should wait at least 15 minutes to let the various Kubernetes resources get initialized and started. Because there are a number of dependencies between them, some services are not expected to work immediately after the install script completes. After waiting, the administrator may start the CSM Validation process.
Once the CSM services are deemed healthy the administrator may proceed to the final step of the CSM install Reboot from the LiveCD to NCN.
The install.sh
script changes cluster state and should not simply be rerun
in the event of a failure without careful consideration of the specific
error. It may be possible to resume installation from the last successful
command executed by install.sh
, but admins will need to appropriately
modify install.sh
to pick up where the previous run left off. (Note: The
install.sh
script runs with set -x
, so each command will be printed to
stderr prefixed with the expanded value of PS4, namely, +
.)
Known potential issues with suggested fixes are listed below.
The following error may occur when running ./install.sh
:
+ /var/www/ephemeral/csm-0.8.11/lib/wait-for-unbound.sh
+ kubectl wait -n services job cray-sls-init-load --for=condition=complete --timeout=20m
error: timed out waiting for the condition on jobs/cray-sls-init-load
Determine the name and state of the SLS init loader job pod:
pit# kubectl -n services get pods -l app=cray-sls-init-load
Expected output looks similar to the following:
NAME READY STATUS RESTARTS AGE
cray-sls-init-load-nh5k7 2/2 Running 0 21m
If the state is Running
after after the 20 minute timeout, this is likely that the SLS loader job is failing to ping the SLS S3 bucket due to a malformed URL. To verify this inspect the logs of the cray-sls-init-load pod:
pit# kubectl -n services logs -l app=cray-sls-init-load -c cray-sls-loader
The symptom of this situation is the present of something similar to the following in the output of the previous command:
{"level":"warn","ts":1612296611.2630196,"caller":"sls-s3-downloader/main.go:96","msg":"Failed to ping bucket.","error":"encountered error during head_bucket operation for bucket sls at https://: RequestError: send request failed\ncaused by: Head \"https:///sls\": http: no Host in request URL"}
This error is most likely intermittent and and deleting the cray-sls-init-load pod is expected to resolve this issue. You may need to delete the loader pod multiple times until it succeeds.
pit# kubectl -n services delete pod cray-sls-init-load-nh5k7
Once the pod is deleted is deleted, verify the new pod started by k8s completes successfully. If it does not complete within a few minutes inspect the logs for the pod. If it is still failing to ping the S3 bucket, delete the pod again and try again.
pit# kubectl -n services get pods -l app=cray-sls-init-load
If the pod has completed successfully, the output looks similar to the following:
NAME READY STATUS RESTARTS AGE
cray-sls-init-load-pbzxv 0/2 Completed 0 55m
Since it can sometimes be required to repeat the above steps several times before the pod succeeds, the following script can be used to automate the retry process:
pit# while [ true ]; do
POD=""
while [ -z "$POD" ]; do
POD=$(kubectl get pods -n services --no-headers -o custom-columns=:metadata.name | grep "^cray-sls-init-load-")
done
GOOD=0
while [ true ]; do
[ "$(kubectl get pod -n services $POD --no-headers -o custom-columns=:.status.phase)" = Succeeded ] && GOOD=1 && echo "Success!" && break
kubectl logs -n services $POD --all-containers 2>/dev/null | grep -q "http: no Host in request URL" && break
sleep 1
done
[ $GOOD -eq 1 ] && break
kubectl delete pod -n services $POD
done
Once the loader job has completed successfully running ./install.sh
again is expected to succeed.
The infamous error: not ready: https://packages.local
indicates that from
the caller’s perspective, Nexus not ready to receive writes. However, it most
likely indicates that a Nexus setup utility was unable to connect to Nexus
via the packages.local
name. Since the install does not attempt to connect
to packages.local
until Nexus has been successfully deployed, the error
does not usually indicate something is actually wrong with Nexus. Instead, it
is most commonly a network issue with name resolution (i.e., DNS), IP
routes from the PIT node, switch misconfiguration, or Istio ingress.
Verify that packages.local resolves to ONLY the load balancer IP for the istio-ingressgateway service in the istio-system namespace, typically 10.92.100.71. If name resolution returns addresses on other networks (such as HMN) this must be corrected. Prior to DNS/DHCP hand-off to Unbound, these settings are controlled by dnsmasq. Unbound settings are based on SLS settings in sls_input_file.json and must be updated via the Unbound manager.
If packages.local resolves to the correct addresses, verify basic
connectivity using ping. If ping packages.local
is unsuccessful, verify the
IP routes from the PIT node to the NMN load balancer network. The
typical ip route
configuration is 10.92.100.0/24 via 10.252.0.1 dev vlan002
. If pings are successful, try checking the status of Nexus by
running curl -sS https://packages.local/service/rest/v1/status/writable
. If
the connection times out, it indicates there is a more complex connection
issue. Verify switches are configured properly and BGP peering is operating
correctly, see docs/400-SWITCH-BGP-NEIGHBORS.md for more information. Lastly,
check Istio and OPA logs to see if connections to packages.local are not
reaching Nexus, perhaps due to an authorization issue.
If https://packages.local/service/rest/v1/status/writable returns an HTTP
code other than 200 OK
, it indicates there is an issue with Nexus. Verify
that the loftsman ship
deployment of the nexus.yaml manifest was
successful. If helm status -n nexus cray-nexus
indicates the status is
NOT deployed
, then something is most likely wrong with the Nexus
deployment and additional diagnosis is required. In this case, the current
Nexus deployment probably needs to be uninstalled and the nexus-data
PVC
removed before attempting to deploy again.
The following error may occur when running ./lib/setup-nexus.sh
:
time="2021-02-07T20:25:22Z" level=info msg="Copying image tag 97/144" from="dir:/image/jettech/kube-webhook-certgen:v1.2.1" to="docker://registry.local/jettech/kube-webhook-certgen:v1.2.1"
Getting image source signatures
Copying blob sha256:f6e131d355612c71742d71c817ec15e32190999275b57d5fe2cd2ae5ca940079
Copying blob sha256:b6c5e433df0f735257f6999b3e3b7e955bab4841ef6e90c5bb85f0d2810468a2
Copying blob sha256:ad2a53c3e5351543df45531a58d9a573791c83d21f90ccbc558a7d8d3673ccfa
time="2021-02-07T20:25:33Z" level=fatal msg="Error copying tag \"dir:/image/jettech/kube-webhook-certgen:v1.2.1\": Error writing blob: Error initiating layer upload to /v2/jettech/kube-webhook-certgen/blobs/uploads/ in registry.local: received unexpected HTTP status: 200 OK"
+ return
This error is most likely intermittent and running ./lib/setup-nexus.sh
again is expected to succeed.
The following error may occur when running ./lib/setup-nexus.sh
:
time="2021-02-23T19:55:54Z" level=fatal msg="Error copying tag \"dir:/image/grafana/grafana:7.0.3\": Error writing blob: Head \"https://registry.local/v2/grafana/grafana/blobs/sha256:cf254eb90de2dc62aa7cce9737ad7e143c679f5486c46b742a1b55b168a736d3\": dial tcp: lookup registry.local: no such host"
+ return
Or a similar error:
time="2021-03-04T22:45:07Z" level=fatal msg="Error copying ref \"dir:/image/cray/cray-ims-load-artifacts:1.0.4\": Error trying to reuse blob sha256:1ec886c351fa4c330217411b0095ccc933090aa2cd7ae7dcd33bb14b9f1fd217 at destination: Head \"https://registry.local/v2/cray/cray-ims-load-artifacts/blobs/sha256:1ec886c351fa4c330217411b0095ccc933090aa2cd7ae7dcd33bb14b9f1fd217\": dial tcp: lookup registry.local: Temporary failure in name resolution"
+ return
These errors are most likely intermittent and running ./lib/setup-nexus.sh
again is expected to succeed.