The iSCSI SBPS solution requires worker nodes to be configured as iSCSI targets (servers) with necessary provisioning, configuration, and enablement of required components. This is done using CFS-based Ansible plays which do the following things:
systemd
service (sbps-marshal
).boot-images
bucket) read-only with new, dedicated S3 user (ISCSI-SBPS
)iscsi=sbps
), to be consumed by the Goss test suit and LIO Metrics, for identification of
nodes on which these have to run.Target worker node selection is via dynamic inventory stored in Hardware State Manager (HSM).
The default is all worker nodes (Management_Worker
group).
The LUN projection is over either HSN or NMN.
(ncn-mw#
) Below is the procedure (with examples) that need to be followed for worker node personalization for iSCSI SBPS.
Determine the latest commit in the csm-config-management
VCS repository.
COMMIT=$(/usr/share/doc/csm/scripts/operations/configuration/get_git.py | awk '{ print $NF }'); echo "$COMMIT"
Example output:
3bb1fce7d7de4c2cce237ab19dd6f239158d6d07
Create an input file for the CFS configuration.
cat << EOF > iscsi-sbps-targets-config.json
{
"layers": [
{
"clone_url": "https://api-gw-service-nmn.local/vcs/cray/csm-config-management.git",
"commit": "${COMMIT}",
"name": "iscsi-sbps-iscsi-targets-config",
"playbook": "config_sbps_iscsi_targets.yml"
}
]
}
EOF
Create a CFS configuration.
See CFS Configuration for more information.
cray cfs v3 configurations update iscsi-sbps-targets-config --file ./iscsi-sbps-targets-config.json --format json
Example output:
{
"last_updated": "2024-08-31T21:05:52Z",
"layers": [
{
"clone_url": "https://api-gw-service-nmn.local/vcs/cray/csm-config-management.git",
"commit": "3bb1fce7d7de4c2cce237ab19dd6f239158d6d07",
"name": "iscsi-sbps-iscsi-targets-config",
"playbook": "config_sbps_iscsi_targets.yml"
}
],
"name": "iscsi-sbps-targets-config"
}
Choose a name for the new CFS session.
SESSION=iscsi-config-$(date +%Y%m%d%H%M%S)
Create CFS session with new CFS configuration.
See CFS Sessions for more information.
cray cfs v3 sessions create --name "${SESSION}" --configuration-name iscsi-sbps-targets-config --format json
Example output:
{
"ansible": {
"config": "cfs-default-ansible-cfg",
"limit": "",
"passthrough": "",
"verbosity": 0
},
"configuration": {
"limit": "",
"name": "iscsi-sbps-targets-config"
},
"debug_on_failure": false,
"logs": "ara.cmn.surtur.hpc.amslabs.hpecorp.net/?label=iscsi-sbps-targets-config",
"name": "iscsi-config-20240831210845",
"status": {
"artifacts": [],
"session": {
"completion_time": null,
"ims_job": null,
"job": null,
"start_time": "2024-08-31T21:08:47",
"status": "pending",
"succeeded": "none"
}
},
"tags": {},
"target": {
"definition": "dynamic",
"groups": [],
"image_map": []
}
}
Note:
For specific target worker nodes selection for node personalization (instead of default: to all workers)
use CFS option --ansible-limit
with xnames of the worker nodes while creating a session.
cray cfs v3 sessions create --name <session_name> --configuration-name <config_name> --ansible-limit <xname1,xname2,...>
See View CFS Sessions for more information.
cray cfs v3 sessions describe "${SESSION}" --format toml
debug_on_failure = false
logs = "ara.cmn.surtur.hpc.amslabs.hpecorp.net/?label=iscsi-sbps-targets-config"
name = "iscsi-config-20240831210845"
[ansible]
config = "cfs-default-ansible-cfg"
limit = ""
passthrough = ""
verbosity = 0
[configuration]
limit = ""
name = "iscsi-sbps-targets-config"
[status]
artifacts = []
[tags]
[target]
definition = "dynamic"
groups = []
image_map = []
[status.session]
completion_time = "2024-08-31T21:09:19"
job = "cfs-b840455c-e919-4656-b64a-44d433f082dc"
start_time = "2024-08-31T21:08:47"
status = "complete"
succeeded = "True"
Make sure that all the CFS play books have completed successfully. Look for the message “All playbooks completed successfully” in the CFS Ansible container log.
See Troubleshooting for more information.
Get name of CFS pod.
CFSPOD=$(kubectl get pods --no-headers -o custom-columns=":metadata.name" -n services -l cfsession=${SESSION}) ; echo $CFSPOD
Example output:
cfs-b840455c-e919-4656-b64a-44d433f082dc-5d4jp
View the last line of Ansible container log.
kubectl logs -n services "${CFSPOD}" -c ansible --tail=1
Example of output if CFS was successful:
All playbooks completed successfully
(ncn-w#
) Run these checks on each worker node, in order to verify that it was configured correctly.
Verify that the SBPS Marshal Agent is running without any errors.
systemctl status sbps-marshal
Beginning of example output:
● sbps-marshal.service - System service that manages Squashfs images projected via iSCSI for IMS, PE, and other ancillary images similar to PE.
Loaded: loaded (/usr/lib/systemd/system/sbps-marshal.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2024-08-22 11:57:48 UTC; 2 weeks 4 days ago
Main PID: 2878373 (sbps-marshal)
Tasks: 1
CGroup: /system.slice/sbps-marshal.service
└─ 2878373 /usr/lib/sbps-marshal/bin/python /usr/lib/sbps-marshal/bin/sbps-marshal
Verify that the images/LUN mappings are created.
Check to see if the targetcli ls
output shows that fileio
backing store are created for rootfs
images,
along with corresponding iSCSI LUNs. These should have the rootfs
ID being mapped and network portals created (HSN and NMN).
targetcli ls
Example output:
o- / ......................................................................................................................... [...]
o- backstores .............................................................................................................. [...]
| o- block .................................................................................................. [Storage Objects: 0]
| o- fileio ................................................................................................. [Storage Objects: 4]
| | o- 059c573240acc07 [/var/lib/cps-local/boot-images/1681d5d6-bfaf-41e6-9dd4-2cc355314476/rootfs (3.7GiB) write-thru activated]
| | | o- alua ................................................................................................... [ALUA Groups: 1]
| | | o- default_tg_pt_gp ....................................................................... [ALUA state: Active/optimized]
| | o- 553a9957f5efcbf [/var/lib/cps-local/boot-images/c434edc1-8080-43c5-8393-4ab831d9eb00/rootfs (2.1GiB) write-thru activated]
| | | o- alua ................................................................................................... [ALUA Groups: 1]
| | | o- default_tg_pt_gp ....................................................................... [ALUA state: Active/optimized]
| | o- ad7d9e9736a1b8a [/var/lib/cps-local/boot-images/f1d6c8fe-32e2-420a-a051-377e34a8bd8a/rootfs (3.6GiB) write-thru activated]
| | | o- alua ................................................................................................... [ALUA Groups: 1]
| | | o- default_tg_pt_gp ....................................................................... [ALUA state: Active/optimized]
| | o- cd5c6354b0d12d2 [/var/lib/cps-local/boot-images/6e993608-068a-4f6d-8d3f-b904ff7d3602/rootfs (3.7GiB) write-thru activated]
| | o- alua ................................................................................................... [ALUA Groups: 1]
| | o- default_tg_pt_gp ....................................................................... [ALUA state: Active/optimized]
| o- pscsi .................................................................................................. [Storage Objects: 0]
| o- ramdisk ................................................................................................ [Storage Objects: 0]
| o- rbd .................................................................................................... [Storage Objects: 0]
o- iscsi ............................................................................................................ [Targets: 1]
| o- iqn.2023-06.csm.iscsi:ncn-w001 .................................................................................... [TPGs: 1]
| o- tpg1 .................................................................................................. [gen-acls, no-auth]
| o- acls .......................................................................................................... [ACLs: 0]
| o- luns .......................................................................................................... [LUNs: 4]
| | o- lun0 [fileio/ad7d9e9736a1b8a (/var/lib/cps-local/boot-images/f1d6c8fe-32e2-420a-a051-377e34a8bd8a/rootfs) (default_tg_pt_gp)]
| | o- lun1 [fileio/cd5c6354b0d12d2 (/var/lib/cps-local/boot-images/6e993608-068a-4f6d-8d3f-b904ff7d3602/rootfs) (default_tg_pt_gp)]
| | o- lun2 [fileio/059c573240acc07 (/var/lib/cps-local/boot-images/1681d5d6-bfaf-41e6-9dd4-2cc355314476/rootfs) (default_tg_pt_gp)]
| | o- lun3 [fileio/553a9957f5efcbf (/var/lib/cps-local/boot-images/c434edc1-8080-43c5-8393-4ab831d9eb00/rootfs) (default_tg_pt_gp)]
| o- portals .................................................................................................... [Portals: 3]
| o- 10.102.193.24:3260 ............................................................................................... [OK]
| o- 10.252.1.9:3260 .................................................................................................. [OK]
| o- 10.253.0.2:3260 .................................................................................................. [OK]
o- loopback ......................................................................................................... [Targets: 0]
o- vhost ............................................................................................................ [Targets: 0]
o- xen-pvscsi ....................................................................................................... [Targets: 0]
Verify that all the DNS SRV and A records are configured for all the intended worker nodes.
dig -t SRV +short _sbps-hsn._tcp.odin.hpc.amslabs.hpecorp.net _sbps-nmn._tcp.odin.hpc.amslabs.hpecorp.net
Example output:
1 0 3260 iscsi-server-id-001.hsn.odin.hpc.amslabs.hpecorp.net.
1 0 3260 iscsi-server-id-004.hsn.odin.hpc.amslabs.hpecorp.net.
1 0 3260 iscsi-server-id-006.hsn.odin.hpc.amslabs.hpecorp.net.
1 0 3260 iscsi-server-id-002.hsn.odin.hpc.amslabs.hpecorp.net.
1 0 3260 iscsi-server-id-003.hsn.odin.hpc.amslabs.hpecorp.net.
1 0 3260 iscsi-server-id-005.hsn.odin.hpc.amslabs.hpecorp.net.
1 0 3260 iscsi-server-id-004.nmn.odin.hpc.amslabs.hpecorp.net.
1 0 3260 iscsi-server-id-001.nmn.odin.hpc.amslabs.hpecorp.net.
1 0 3260 iscsi-server-id-005.nmn.odin.hpc.amslabs.hpecorp.net.
1 0 3260 iscsi-server-id-003.nmn.odin.hpc.amslabs.hpecorp.net.
1 0 3260 iscsi-server-id-002.nmn.odin.hpc.amslabs.hpecorp.net.
1 0 3260 iscsi-server-id-006.nmn.odin.hpc.amslabs.hpecorp.net.
dig -t A +short iscsi-server-id-005.hsn.odin.hpc.amslabs.hpecorp.net. iscsi-server-id-002.hsn.odin.hpc.amslabs.hpecorp.net. iscsi-server-id-004.hsn.odin.hpc.amslabs.hpecorp.net. iscsi-server-id-006.hsn.odin.hpc.amslabs.hpecorp.net. iscsi-server-id-001.hsn.odin.hpc.amslabs.hpecorp.net. iscsi-server-id-003.hsn.odin.hpc.amslabs.hpecorp.net. iscsi-server-id-006.nmn.odin.hpc.amslabs.hpecorp.net. iscsi-server-id-004.nmn.odin.hpc.amslabs.hpecorp.net. iscsi-server-id-001.nmn.odin.hpc.amslabs.hpecorp.net. iscsi-server-id-005.nmn.odin.hpc.amslabs.hpecorp.net. iscsi-server-id-003.nmn.odin.hpc.amslabs.hpecorp.net. iscsi-server-id-002.nmn.odin.hpc.amslabs.hpecorp.net.
Example output:
10.253.0.6
10.253.0.16
10.253.0.18
10.253.0.20
10.253.0.2
10.253.0.4
10.252.1.7
10.252.1.9
10.252.1.12
10.252.1.8
10.252.1.10
10.252.1.11
Run readiness checks.
After worker node personalization, in order to verify the overall readiness of the iSCSI targets before booting compute nodes or UANs, run GOSS tests to do additional verification of the iSCSI targets.
See GOSS tests for SBPS for more information.