The following covers redeploying the Spire service and restoring the data.
(ncn-mw#
) Uninstall the spire
and cray-spire
charts and wait for the resources to terminate.
Note the version of the spire
chart that is currently deployed.
helm history -n spire spire
Example output:
REVISION UPDATED STATUS CHART APP VERSION DESCRIPTION
1 Wed Nov 15 12:41:47 2023 deployed spire-2.14.2 0.12.2 Install complete
Uninstall the spire
chart.
helm uninstall -n spire spire
Example output:
release "spire" uninstalled
Note the version of the cray-spire
chart that is currently deployed.
helm history -n spire cray-spire
Example output:
REVISION UPDATED STATUS CHART APP VERSION DESCRIPTION
1 Wed Nov 15 12:41:50 2023 deployed cray-spire-1.5.4 1.5.5 Install complete
Uninstall the cray-spire
chart.
helm uninstall -n spire cray-spire
Example output:
release "cray-spire" uninstalled
Wait for the resources to terminate, delete the PVCs, and clean up spire-agent
before reinstalling the charts.
Verify that only tpm-provisioner
pods (or no pods) are running in spire
namespace.
watch "kubectl get pods -n spire"
Example output:
NAME READY STATUS RESTARTS AGE
tpm-provisioner-0 2/2 Running 0 17d
Delete the Spire server PVCs.
kubectl get pvc -n spire | grep spire-server | awk '{print $1}' | xargs kubectl delete -n spire pvc
Example output:
persistentvolumeclaim "data-cray-spire-server-0" deleted
persistentvolumeclaim "data-cray-spire-server-1" deleted
persistentvolumeclaim "data-cray-spire-server-2" deleted
persistentvolumeclaim "spire-data-spire-server-0" deleted
persistentvolumeclaim "spire-data-spire-server-1" deleted
persistentvolumeclaim "spire-data-spire-server-2" deleted
Clean up spire-agent
.
for ncn in $(kubectl get nodes -o name | cut -d'/' -f2); do
echo "Cleaning up NCN ${ncn}"
ssh "${ncn}" systemctl stop spire-agent
ssh "${ncn}" rm -v /var/lib/spire/data/keys.json /var/lib/spire/agent_svid.der /var/lib/spire/bundle.der
done
(ncn-mw#
) Redeploy the spire
and cray-spire
charts and wait for the resources to start.
Follow the Redeploying a Chart procedure to redeploy the spire
chart:
spire
sysmgmt
Repeat the above procedure for the cray-spire
chart:
cray-spire
sysmgmt
Wait for the resources to start.
watch "kubectl get pods -n spire"
Example output:
NAME READY STATUS RESTARTS AGE
cray-spire-agent-7w6tc 1/1 Running 0 10m
cray-spire-agent-b6754 1/1 Running 0 10m
cray-spire-agent-pxqmq 1/1 Running 0 10m
cray-spire-agent-rxsbf 1/1 Running 0 10m
cray-spire-jwks-76f48d6484-b72jf 3/3 Running 0 10m
cray-spire-jwks-76f48d6484-v5b5n 3/3 Running 0 10m
cray-spire-jwks-76f48d6484-xgnxw 3/3 Running 0 10m
cray-spire-postgres-0 3/3 Running 0 10m
cray-spire-postgres-1 3/3 Running 0 10m
cray-spire-postgres-2 3/3 Running 0 10m
cray-spire-postgres-pooler-86797d8b9b-p2nkf 2/2 Running 0 10m
cray-spire-postgres-pooler-86797d8b9b-rfvr8 2/2 Running 0 10m
cray-spire-postgres-pooler-86797d8b9b-t9xwt 2/2 Running 0 10m
cray-spire-server-0 2/2 Running 0 10m
cray-spire-server-1 2/2 Running 0 10m
cray-spire-server-2 2/2 Running 0 10m
request-ncn-join-token-4hgnt 2/2 Running 0 15m
request-ncn-join-token-67qlz 2/2 Running 0 15m
request-ncn-join-token-75q2l 2/2 Running 0 15m
request-ncn-join-token-d24wv 2/2 Running 0 15m
request-ncn-join-token-q56zm 2/2 Running 0 15m
request-ncn-join-token-tmz4l 2/2 Running 0 15m
request-ncn-join-token-z87pl 2/2 Running 0 15m
spire-agent-42gb2 1/1 Running 0 15m
spire-agent-6lxv9 1/1 Running 0 15m
spire-agent-hhbqm 1/1 Running 0 15m
spire-agent-sztjm 1/1 Running 0 15m
spire-jwks-6cd9d5b5b5-6bmcb 3/3 Running 0 15m
spire-jwks-6cd9d5b5b5-gz2tl 3/3 Running 0 15m
spire-jwks-6cd9d5b5b5-pds25 3/3 Running 0 15m
spire-postgres-0 3/3 Running 0 15m
spire-postgres-1 3/3 Running 0 15m
spire-postgres-2 3/3 Running 0 15m
spire-postgres-pooler-75964fbc66-6hvvq 2/2 Running 0 15m
spire-postgres-pooler-75964fbc66-d52mg 2/2 Running 0 15m
spire-postgres-pooler-75964fbc66-nm6v6 2/2 Running 0 15m
spire-server-0 2/2 Running 0 15m
spire-server-1 2/2 Running 0 15m
spire-server-2 2/2 Running 0 15m
tpm-provisioner-0 2/2 Running 0 17d
Rejoin the storage nodes to Spire and restart the spire-agent
on all NCNs.
/opt/cray/platform-utils/spire/fix-spire-on-storage.sh
for i in $(kubectl get nodes -o name | cut -d"/" -f2) $(ceph node ls | jq -r '.[] | keys[]' | sort -u); do ssh $i systemctl start spire-agent; done