etag
iSCSI based boot content projection which is also known as “Scalable Boot Content Projection” (SBPS) for rootfs
and PE
images
is supported in CSM 1.6.0. SBPS fails if the rootfs
image manifest to be projected doesn’t have an etag
field or the etag
field
is present but set to NULL. Due to this problem with etags
, the SBPS core service, the SBPS Marshal agent
, crashes.
Issue can be identified with the symptoms below:
Compute/UAN node (iSCSI Initiator) boot fails with the following messages in the console log:
2024-11-12T20:07:50.17117 [ 100.620313] dracut-pre-mount[3827]: iscsiadm: No active sessions.
2024-11-12T20:13:00.44921 [ 402.905538] dracut-pre-mount[5379]: ls: cannot access '/dev/mapper/594b7c79-9cb5-48f6-bceb-18f3596f0dbb_rootfs': No such file or directory
2024-11-12T20:13:00.44925 [ 402.924254] dracut-pre-mount[3814]: Warning: sbps-add-content.sh failed.
2024-11-12T20:13:00.44926 [ 402.952550] dracut-pre-mount[3809]: Warning: Unable to prepare squashfs file /tmp/cps/rootfs, dropping to debug.
SBPS Marshal agent crashes with the following messages on the worker node (iSCSI Target):
ncn-w002:~ # systemctl status sbps-marshal.service
× sbps-marshal.service - System service that manages Squashfs images projected via iSCSI for IMS, PE, and other ancillary images similar to PE.
Loaded: loaded (/usr/lib/systemd/system/sbps-marshal.service; enabled; preset: disabled)
Active: failed (Result: exit-code) since Tue 2024-11-12 19:20:05 UTC; 18h ago
Duration: 2.232s
Process: 1239406 ExecStart=/usr/lib/sbps-marshal/bin/sbps-marshal (code=exited, status=1/FAILURE)
Main PID: 1239406 (code=exited, status=1/FAILURE)
CPU: 1.394s
Nov 12 19:20:04 ncn-w002 sbps-marshal[1239406]: File "/usr/lib/sbps-marshal/bin/sbps-marshal", line 8, in <module>
Nov 12 19:20:04 ncn-w002 sbps-marshal[1239406]: sys.exit(main())
Nov 12 19:20:04 ncn-w002 sbps-marshal[1239406]: ^^^^^^
Nov 12 19:20:04 ncn-w002 sbps-marshal[1239406]: File "/usr/lib/sbps-marshal/lib/python3.12/site-packages/bin/agent.py", line 270, in main
Nov 12 19:20:04 ncn-w002 sbps-marshal[1239406]: rootfs_s3_etag = artifact["link"]["etag"]
Nov 12 19:20:04 ncn-w002 sbps-marshal[1239406]: ~~~~~~~~~~~~~~~~^^^^^^^^
Nov 12 19:20:04 ncn-w002 sbps-marshal[1239406]: KeyError: 'etag'
Nov 12 19:20:05 ncn-w002 systemd[1]: sbps-marshal.service: Main process exited, code=exited, status=1/FAILURE
Nov 12 19:20:05 ncn-w002 systemd[1]: sbps-marshal.service: Failed with result 'exit-code'.
Nov 12 19:20:05 ncn-w002 systemd[1]: sbps-marshal.service: Consumed 1.394s CPU time.
The complete log for the SBPS Marshal agent systemd
service can be obtained from journalctl
log as below:
# journalctl -u sbps-marshal.service
Step 1: Identify the manifest(s) which are missing the etag
field
This can be fetched from worker node as s3:/boot-images
is mounted onto the worker node. The following command will
search the mounted directory for manifests that are missing the etag
field:
for f in /var/lib/cps-local/boot-images/*/manifest.json; do \
echo $f etag=$(cat $f | jq '.artifacts[] | select(.type == "application/vnd.cray.image.rootfs.squashfs").link.etag' ); \
done | grep etag=null
An example run of the command listing the problematic manifest.json
is as follows:
# for f in /var/lib/cps-local/boot-images/*/manifest.json; do \
echo $f etag=$(cat $f | jq '.artifacts[] | select(.type == "application/vnd.cray.image.rootfs.squashfs").link.etag' ); \
done | grep etag=null
/var/lib/cps-local/boot-images/3f89b76d-7578-4ac3-8891-0f8b4335b06d/manifest.json etag=null
Looking at the manifest.json
:
ncn-w006:~ # cat /var/lib/cps-local/boot-images/3f89b76d-7578-4ac3-8891-0f8b4335b06d/manifest.json
{
"artifacts" : [
{
"link" : {
"path" : "s3://boot-images/3f89b76d-7578-4ac3-8891-0f8b4335b06d/compute.squashfs",
"type" : "s3"
},
"md5" : "fa69a6665124629bfb1d74bfa3e4b7de",
"type" : "application/vnd.cray.image.rootfs.squashfs"
}
],
"created" : "20240815093713",
"version" : "1.0"
}
The “link” section has no etag
field.
Step 2: Update the manifest with etag
and update the same in s3
from master node
Example manifest.json
update with etag
obtained from s3
and update the same in s3
:
ncn-m001:~ # MANIFEST=$(cray ims images describe "3f89b76d-7578-4ac3-8891-0f8b4335b06d" --format json | jq -r .link.path)
ncn-m001:~ # echo $MANIFEST
s3://boot-images/3f89b76d-7578-4ac3-8891-0f8b4335b06d/manifest.json
ncn-m001:~ # cray artifacts get $(echo $MANIFEST | sed -e 's;^s3://;;g' -e 's;boot-images/;boot-images ;g' ) manifest.json
ncn-m001:~ # cray artifacts describe $(cat manifest.json | jq -r '.artifacts[] | select(.type == "application/vnd.cray.image.rootfs.squashfs").link.path' | sed -e 's;^s3://;;g' -e 's;boot-images/;boot-images ;g' ) --format json
{
"artifact": {
"AcceptRanges": "bytes",
"LastModified": "2024-08-15T14:45:58+00:00",
"ContentLength": 22095331328,
"ETag": "\"f4153fe0a18cfb2205fb002d81a23ab2-2634\"",
"ContentType": "binary/octet-stream",
"Metadata": {
"md5sum": "fa69a6665124629bfb1d74bfa3e4b7de"
}
}
}
ncn-m001:~ # cray artifacts describe $(cat manifest.json | jq -r '.artifacts[] | select(.type == "application/vnd.cray.image.rootfs.squashfs").link.path' | sed -e 's;^s3://;;g' -e 's;boot-images/;boot-images ;g' ) --format json | jq -r '.artifact.ETag' | tr -d '"'
f4153fe0a18cfb2205fb002d81a23ab2-2634
Now update the manifest.json
file to add the etag
obtained in the “link” section.
ncn-m001:~ # cat manifest.json
{
"artifacts" : [
{
"link" : {
"path" : "s3://boot-images/3f89b76d-7578-4ac3-8891-0f8b4335b06d/compute.squashfs",
"etag" : "f4153fe0a18cfb2205fb002d81a23ab2-2634",
"type" : "s3"
},
"md5" : "fa69a6665124629bfb1d74bfa3e4b7de",
"type" : "application/vnd.cray.image.rootfs.squashfs"
}
],
"created" : "20240815093713",
"version" : "1.0"
}
Now update the etag
in S3 as well:
ncn-m001:~ # cray artifacts create $( echo $MANIFEST | sed -e 's;^s3://;;g' -e 's;boot-images/;boot-images ;g' ) manifest.json
Step 3: Restart SBPS Marshal agent systemd
service on each impacted worker node and ensure it’s running successfully
# systemctl restart sbps-marshal.service
# systemctl status sbps-marshal.service