Non-Deterministic Unbound DNS Results Patch
This procedure covers applying a new version of the cray-dns-unbound
Helm chart to enable this setting in the configmap:
rrset-roundrobin: no
Unbound back in April 2020 changed the default of this setting to be yes
which had the effect of randomizing the records returned from it if more than one entry corresponded (as would be the case for PTR records, for example):
21 April 2020: George
- Change default value for 'rrset-roundrobin' to yes.
- Fix tests for new rrset-roundrobin default.
Some software is especially sensitive to this and thus requires this setting to be no
.
Update the cray-sysmgmt-health helm chart to address multiple alerts
Install/Update node_exporter on storage nodes
Update cray-hms-hmnfd helm chart to include timestamp fix
Update the cray-hms-hmcollector helm chart to include fix to prevent crashing, also its resource limits and requests can be overridden via customizations.yaml
.
Start a typescript to capture the commands and output from this procedure.
ncn-m001# script -af csm-update.$(date +%Y-%m-%d).txt
ncn-m001# export PS1='\u@\H \D{%Y-%m-%d} \t \w # '
NOTE:
Installed CSM versions may be listed from the product catalog using:ncn-m001# kubectl get cm cray-product-catalog -n services -o jsonpath='{.data.csm}' | yq r -j - | jq -r 'to_entries[] | .key' | sort -V 0.9.2 0.9.3 0.9.4 0.9.5
Set CSM_DISTDIR
to the directory of the extracted release distribution for CSM 0.9.6:
NOTE:
Use--no-same-owner
and--no-same-permissions
options totar
when extracting a CSM release distribution asroot
to ensure the currentumask
value.
If using a release distribution:
ncn-m001# tar --no-same-owner --no-same-permissions -zxvf csm-0.9.6.tar.gz
ncn-m001# export CSM_DISTDIR="$(pwd)/csm-0.9.6"
Set CSM_RELEASE_VERSION
to the version reported by ${CSM_DISTDIR}/lib/version.sh
:
ncn-m001# CSM_RELEASE_VERSION="$(${CSM_DISTDIR}/lib/version.sh --version)"
ncn-m001# echo $CSM_RELEASE_VERSION
Download and install/upgrade the latest documentation RPM. If this machine does not have direct internet access these RPMs will need to be externally downloaded and then copied to be installed.
ncn-m001# rpm -Uvh https://storage.googleapis.com/csm-release-public/shasta-1.4/docs-csm/docs-csm-latest.noarch.rpm
It is important to first verify a healthy starting state. To do this, run the CSM validation checks. If any problems are found, correct them and verify the appropriate validation checks before proceeding.
If no scaling changes are desired to be made against the cray-hms-hmcollector deployment or if they have have not been previously applied, then this section can be skipped and proceed onto the Setup Nexus section.
Before upgrading services, customizations.yaml
in the site-init
secret in the loftsman
namespace must be updated to apply or re-apply any manual scaling changes made to the cray-hms-hmcollector deployment. Follow the Adjust HM Collector resource limits and requests procedure for information about tuning and updating the resource limits used by the cray-hms-hmcollector deployment. The section Redeploy cray-hms-hmcollector with new resource limits and request
of the referenced procedure can be skipped, as the upgrade.sh
script will re-deploy the collector with the new resource limit changes.
Run lib/setup-nexus.sh
to configure Nexus and upload new CSM RPM
repositories, container images, and Helm charts:
ncn-m001# cd "$CSM_DISTDIR"
ncn-m001# ./lib/setup-nexus.sh
On success, setup-nexus.sh
will output OK
on stderr and exit with status
code 0
, e.g.:
ncn-m001# ./lib/setup-nexus.sh
...
+ Nexus setup complete
setup-nexus.sh: OK
ncn-m001# echo $?
0
In the event of an error, consult the known
issues from the install
documentation to resolve potential problems and then try running
setup-nexus.sh
again. Note that subsequent runs of setup-nexus.sh
may
report FAIL
when uploading duplicate assets. This is ok as long as
setup-nexus.sh
outputs setup-nexus.sh: OK
and exits with status code 0
.
Set CSM_SCRIPTDIR
to the scripts directory included in the docs-csm RPM
for the CSM 0.9.6 upgrade:
ncn-m001# CSM_SCRIPTDIR=/usr/share/doc/metal/upgrade/0.9/csm-0.9.6/scripts
Execute the following script from the scripts directory determined in the previous step to update NCN.
ncn-m001# cd "$CSM_SCRIPTDIR"
ncn-m001# ./update-ncns.sh
Execute the following script from the scripts directory determined above to update BSS metadata.
ncn-m001# cd "$CSM_SCRIPTDIR"
ncn-m001# ./update-bss-metadata.sh
Run upgrade.sh
to deploy upgraded CSM applications and services:
ncn-m001# cd "$CSM_DISTDIR"
ncn-m001# ./upgrade.sh
Instruct Kubernetes to gracefully restart the Unbound pods:
ncn-m001:~ # kubectl -n services rollout restart deployment cray-dns-unbound
deployment.apps/cray-dns-unbound restarted
ncn-m001:~ # kubectl -n services rollout status deployment cray-dns-unbound
Waiting for deployment "cray-dns-unbound" rollout to finish: 0 out of 3 new replicas have been updated...
Waiting for deployment "cray-dns-unbound" rollout to finish: 3 old replicas are pending termination...
Waiting for deployment "cray-dns-unbound" rollout to finish: 3 old replicas are pending termination...
Waiting for deployment "cray-dns-unbound" rollout to finish: 3 old replicas are pending termination...
Waiting for deployment "cray-dns-unbound" rollout to finish: 2 old replicas are pending termination...
Waiting for deployment "cray-dns-unbound" rollout to finish: 2 old replicas are pending termination...
Waiting for deployment "cray-dns-unbound" rollout to finish: 2 old replicas are pending termination...
Waiting for deployment "cray-dns-unbound" rollout to finish: 1 old replicas are pending termination...
Waiting for deployment "cray-dns-unbound" rollout to finish: 1 old replicas are pending termination...
deployment "cray-dns-unbound" successfully rolled out
Verify the CSM version has been updated in the product catalog. Verify that the
following command includes version 0.9.6
:
ncn-m001# kubectl get cm cray-product-catalog -n services -o jsonpath='{.data.csm}' | yq r -j - | jq -r 'to_entries[] | .key' | sort -V
0.9.2
0.9.3
0.9.4
0.9.5
0.9.6
Confirm the import_date
reflects the timestamp of the upgrade:
ncn-m001# kubectl get cm cray-product-catalog -n services -o jsonpath='{.data.csm}' | yq r - '"0.9.6".configuration.import_date'
Execute the following script from the scripts directory to make sure correct images for CVE-2021-3711 are updated:
ncn-m001# cd "$CSM_SCRIPTDIR"
ncn-m001# ./validate_versions.sh
Confirm node-exporter is running on each storage node. This command can be run from a master node. Validate that the result contains go_goroutines
(replace ncn-s001 below with each storage node):
curl -s http://ncn-s001:9100/metrics |grep go_goroutines|grep -v "#"
go_goroutines 8
Confirm manifests were updated on each master node (repeat on each master node):
ncn-m# grep bind /etc/kubernetes/manifests/*
kube-controller-manager.yaml: - --bind-address=0.0.0.0
kube-scheduler.yaml: - --bind-address=0.0.0.0
Confirm updated sysmgmt-health chart was deployed. This command can be executed on a master node – confirm the cray-sysmgmt-health-0.12.6
chart version:
ncn-m# helm ls -n sysmgmt-health
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
cray-sysmgmt-health sysmgmt-health 2 2021-09-10 16:45:12.00113666 +0000 UTC deployed cray-sysmgmt-health-0.12.6 8.15.4
Confirm updates to BSS for cloud-init runcmd
IMPORTANT:
Ensure you replace XNAME
with the correct xname in the below examples (executing the /opt/cray/platform-utils/getXnames.sh
script on a master node will display xnames):
Example for a master node – this should be checked for each master node. Validate the three sed
commands are returned in the output.
ncn-m# cray bss bootparameters list --name XNAME --format=json | jq '.[]|."cloud-init"."user-data"'
{
"hostname": "ncn-m001",
"local_hostname": "ncn-m001",
"mac0": {
"gateway": "10.252.0.1",
"ip": "",
"mask": "10.252.2.0/23"
},
"runcmd": [
"/srv/cray/scripts/metal/install-bootloader.sh",
"/srv/cray/scripts/metal/set-host-records.sh",
"/srv/cray/scripts/metal/set-dhcp-to-static.sh",
"/srv/cray/scripts/metal/set-dns-config.sh",
"/srv/cray/scripts/metal/set-ntp-config.sh",
"/srv/cray/scripts/metal/set-bmc-bbs.sh",
"/srv/cray/scripts/metal/disable-cloud-init.sh",
"/srv/cray/scripts/common/update_ca_certs.py",
"/srv/cray/scripts/common/kubernetes-cloudinit.sh",
"sed -i 's/--bind-address=127.0.0.1/--bind-address=0.0.0.0/' /etc/kubernetes/manifests/kube-controller-manager.yaml",
"sed -i '/--port=0/d' /etc/kubernetes/manifests/kube-scheduler.yaml",
"sed -i 's/--bind-address=127.0.0.1/--bind-address=0.0.0.0/' /etc/kubernetes/manifests/kube-scheduler.yaml"
]
}
Example for a storage node – this should be checked for each storage node. Validate the zypper
command is returned in the output.
ncn-m001:~ # cray bss bootparameters list --name XNAME --format=json | jq '.[]|."cloud-init"."user-data"'
{
"hostname": "ncn-s001",
"local_hostname": "ncn-s001",
"mac0": {
"gateway": "10.252.0.1",
"ip": "",
"mask": "10.252.2.0/23"
},
"runcmd": [
"/srv/cray/scripts/metal/install-bootloader.sh",
"/srv/cray/scripts/metal/set-host-records.sh",
"/srv/cray/scripts/metal/set-dhcp-to-static.sh",
"/srv/cray/scripts/metal/set-dns-config.sh",
"/srv/cray/scripts/metal/set-ntp-config.sh",
"/srv/cray/scripts/metal/set-bmc-bbs.sh",
"/srv/cray/scripts/metal/disable-cloud-init.sh",
"/srv/cray/scripts/common/update_ca_certs.py",
"zypper --no-gpg-checks in -y https://packages.local/repository/csm-sle-15sp2/x86_64/cray-node-exporter-1.2.2.1-1.x86_64.rpm"
]
}
NOTE:
The following verification steps require SMA and SAT to be installed.
Once the patch is installed the missing timestamp fix can be validated by taking the following steps:
kubectl -n sma get pods | grep kafka
cluster-kafka-0 2/2 Running 1 30d
cluster-kafka-1 2/2 Running 1 26d
cluster-kafka-2 2/2 Running 0 73d
kubectl -n sma exec -it <pod_id> /bin/bash
cd to the ‘bin’ directory in the kafka pod.
Execute the following command in the kafka pod to run a kafka consumer app:
./kafka-console-consumer.sh --bootstrap-server=localhost:9092 --topic=cray-hmsstatechange-notifications
sat status | grep Compute | grep Ready
...
| x1003c7s7b1n1 | Node | 2023 | Ready | OK | True | X86 | Mountain | Compute | Sling |
NOTE: All examples below will use the node seen in the above example.
TOKEN=`curl -k -s -S -d grant_type=client_credentials -d client_id=admin-client -d client_secret=\`kubectl get secrets admin-client-auth -o jsonpath='{.data.client-secret}' | base64 -d\` https://api-gw-service-nmn.local/keycloak/realms/shasta/protocol/openid-connect/token | jq -r '.access_token'`
curl -s -k -H "Authorization: Bearer ${TOKEN}" -X POST -d '{"Components":["x1003c7s7b1n1"],"State":"Ready"}' https://api_gw_service.local/apis/hmnfd/hmi/v1/scn
{"Components":["x1003c7s7b1n1"],"Flag":"OK","State":"Ready","Timestamp":"2021-09-13T13:00:00"}
IMPORTANT:
Wait at least 15 minutes afterupgrade.sh
completes to let the various Kubernetes resources get initialized and started.
Run the following validation checks to ensure that everything is still working properly after the upgrade:
Other health checks may be run as desired.
Remember to exit your typescript.
ncn-m001# exit
It is recommended to save the typescript file for later reference.