slurm
operator CRslurm
operator CRThis page provides information about how to modify a tenant. Modifications that are supported are limited to:
childNamespaces
.An example of a change to add a component name (xname) to a tenant:
apiVersion: tapms.hpe.com/v1alpha2
kind: Tenant
metadata:
name: vcluster-blue
spec:
childnamespaces:
- user
- slurm
tenantname: vcluster-blue
tenantresources:
- type: compute
hsmgrouplabel: blue
enforceexclusivehsmgroups: true
xnames:
- x0c3s5b0n0
- x0c3s6b0n0
- x0c3s7b0n0
(ncn-mw#
) When a tenant CRD is applied, tapms
will determine any changes to the tenant, and reconcile any changes to childNamespaces
and xnames
.
kubectl apply -n tenants -f <tenant.yaml>
Example output:
tenant.tapms.hpe.com/vcluster-blue configured
(ncn-mw#
) It can take up to a minute for tapms
to reconcile the change. The following command can be used to monitor the status of the tenant:
kubectl get tenant -n tenants vcluster-blue -o yaml
Example output:
apiVersion: tapms.hpe.com/v1alpha2
kind: Tenant
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"tapms.hpe.com/v1alpha2","kind":"Tenant","metadata":{"annotations":{},"name":"vcluster-blue","namespace":"tenants"},"spec":{"childnamespaces":["user","slurm"],"tenantname":"vcluster-blue","tenantresources":[{"enforceexclusivehsmgroups":true,"hsmgrouplabel":"blue","type":"compute","xnames":["x0c3s5b0n0","x0c3s6b0n0"]}]}}
creationTimestamp: "2022-08-23T18:37:25Z"
finalizers:
- tapms.hpe.com/finalizer
generation: 3
name: vcluster-blue
namespace: tenants
resourceVersion: "3157072"
uid: 074b6db1-f504-4e9c-8245-259e9b22d2e6
spec:
childnamespaces:
- user
- slurm
state: Deployed
tenantname: vcluster-blue
tenantresources:
- enforceexclusivehsmgroups: true
hsmgrouplabel: blue
type: compute
xnames:
- x0c3s5b0n0
- x0c3s6b0n0
- x0c3s7b0n0
status:
childnamespaces:
- vcluster-blue-user
- vcluster-blue-slurm
tenantresources:
- enforceexclusivehsmgroups: true
hsmgrouplabel: blue
type: compute
xnames:
- x0c3s5b0n0
- x0c3s6b0n0
- x0c3s7b0n0
uuid: 074b6db1-f504-4e9c-8245-259e9b22d2e6
(ncn-mw#
) The cray
command can now be used to display changes to the HSM group:
cray hsm groups describe blue --format toml
Example output:
label = "blue"
description = ""
exclusiveGroup = "tapms-exclusive-group-label"
tags = [ "vcluster-blue",]
[members]
ids = [ "x0c3s5b0n0", "x0c3s6b0n0", "x0c3s7b0n0"]
slurm
operator CRTo make changes to a Slurm tenant deployment, first update the Slurm custom resource file. The Slurm operator will attempt to reconcile the following changes:
slurmctld
PVC.slurmctld
, or slurmdbd
deployments.slurm
operator CR(ncn-mw#
) To update the Slurm custom resource, apply the changed file:
kubectl apply -f <cluster>.yaml
Once the custom resource has been updated, the Slurm operator will attempt to update the relevant Kubernetes resources to reflect the changes.
If the list of nodes assigned to a tenant changes, the Slurm configuration must be updated by rerunning the Kubernetes job that generates it. This procedure can also be used to change Slurm configuration settings, or add new Slurm configuration files.
(ncn-mw#
) Get the current configuration.
kubectl get configmap -n slurm-operator <cluster>-config-templates -o yaml >slurm-config-templates.yaml
(ncn-mw#
) Make desired edits.
To edit an existing configuration file (for example, slurm.conf
):
yq r slurm-config-templates.yaml 'data."slurm.conf"' >slurm.conf
# Edit slurm.conf
yq w -i slurm-config-templates.yaml 'data."slurm.conf"' "$(cat slurm.conf)"
To add a new configuration file (for example, topology.conf
):
yq w -i slurm-config-templates.yaml 'data."topology.conf"' "$(cat topology.conf)"
(ncn-mw#
) Apply changes to the configuration templates.
kubectl apply -f slurm-config-templates.yaml
(ncn-mw#
) Rerun the Slurm configuration job.
kubectl get job -n slurm-operator <cluster>-slurm-config -o yaml >slurm-config.yaml
yq d -i slurm-config.yaml spec.template.metadata
yq d -i slurm-config.yaml spec.selector
kubectl delete -f slurm-config.yaml
kubectl create -f slurm-config.yaml
(ncn-mw#
) Once the job has completed, reconfigure Slurm.
SLURMCTLD_POD=$(kubectl get pod -n <namespace> -lapp.kubernetes.io/name=slurmctld -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n <namespace> ${SLURMCTLD_POD} -c slurmctld -- scontrol reconfigure
Any nodes that are moving from one tenant to another should have their records cleared in BOS. Clearing the desired state will have the secondary effect of causing BOS to shutdown the node. This prevents a tenant using a node that is booted with another tenant’s boot artifact or configuration.
This procedure requires the Cray CLI to be configured on the node where it is being performed. See Configure the Cray CLI.
(ncn-mw#
) Clear the desired state.
Repeat this step for each component that moving from one tenant to another.
cray bos v2 components update <xname> --enabled true --staged-state-session "" --staged-state-configuration "" \
--staged-state-boot-artifacts-initrd "" --staged-state-boot-artifacts-kernel-parameters "" --staged-state-boot-artifacts-kernel "" \
--desired-state-bss-token "" --desired-state-configuration "" --desired-state-boot-artifacts-initrd "" \
--desired-state-boot-artifacts-kernel-parameters "" --desired-state-boot-artifacts-kernel ""
(ncn-mw#
) Wait until the component reaches a stable
state.
Because the previous step cleared the desired state, the stable
state indicates that the component is powered off.
cray bos v2 components describe <xname> --format json | jq .status.status