One of the key ways to ensure CSM critical services availability is that the failure of nodes or a single rack is to spread the replicas of these services across multiple zones and racks.
Some of the CSM critical services already have established the pod affinities to spread the replicas
across nodes. Because the nodes which are picked by the Kubernetes scheduler can be on the same rack,
it is necessary to include a topology constraint for these services; this helps the Kubernetes
scheduler distribute the replicas across zones. This is achieved using the Kubernetes feature “Topology Spread Constraints” within a new Kyverno
cluster policy with the name insert-labels-topology-constraints
added.
This policy applies to all the Deployments and StatefulSets that have been identified as critical services for Rack Resiliency.
For more information on Kubernetes topology spread constraints, see Topology Spread Constraints.
(ncn-mw#
) View the policy and see the services for which the policy has been enabled.
kubectl get clusterpolicy insert-labels-topology-constraints -o yaml
Example output:
kind: ClusterPolicy
metadata:
annotations:
...
spec:
admission: true
background: true
emitWarning: false
rules:
- match:
any:
- resources:
kinds:
- Deployment
- StatefulSet
names:
- cray-dns-powerdns
- coredns
- sealed-secrets
- cray-ceph-csi-cephfs-provisioner
- cray-ceph-csi-rbd-provisioner
- cray-activemq-artemis-operator-controller-manager
- cray-dvs-mqtt-ss
- cray-hmnfd-bitnami-etcd
...
exclude: # Temporarily exclude resources in all namespaces
any:
- resources:
namespaces:
- "*"
mutate:
patchStrategicMerge:
spec:
template:
metadata:
labels:
rrflag: rr-{{ request.object.metadata.name }}
spec:
+(topologySpreadConstraints):
- labelSelector:
matchLabels:
rrflag: rr-{{ request.object.metadata.name }}
maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
name: insert-rack-res-label
skipBackgroundRequests: true
validationFailureAction: Audit
...
Note: The exclude
section of the policy shown in the above example is present when the policy is applied initially using Helm.
If Rack Resiliency is enabled, then the exclude
section is removed from the policy
(using a post-install/post-upgrade hook).
The Kyverno policy is applied to each service the first time that service restarts, after the Kyverno policy is in effect for that service. There are specific times when these restarts are expected to happen.
At the end of CSM upgrade from 1.6 to 1.7, the critical services (which are either Deployments or StatefulSets) are restarted only if Rack Resiliency is enabled and the Kyverno policy is applied.
During a fresh install of CSM 1.7, as part of Configure Administrative Access, the critical services are restarted in the Restart Rack Resiliency critical services step. The critical services are restarted only when Rack Resiliency is enabled and the Kyverno policy is applied.
Administrators are able to modify the Rack Resiliency critical services. If an administrator does nothing but remove critical services, then no service restarts are necessary. However, if any critical service is added or modified, then a service restart is performed as part of that procedure.
For more detailed information, see Manage Critical Services.
The policy engine updates the topology constraint to the Deployment or StatefulSet specifications of the critical services.
spec:
+(topologySpreadConstraints):
- labelSelector:
matchLabels:
rrflag: rr-{{ request.object.metadata.name }}
maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
During restart, the label mentioned in the policy is added to all the pods belonging to the specific critical service Deployment or StatefulSet that is being restarted.
For example, for the StatefulSet cray-bss-bitnami-etcd
, the policy adds the rrflag
as
shown below:
cray-bss-bitnami-etcd-0 2/2 Running 0 4d12h app.kubernetes.io/component=etcd,app.kubernetes.io/instance=cray-hms-bss,app.kubernetes.io/managed-by=Helm,app.kubernetes.io/name=cray-bss-bitnami-etcd,app.kubernetes.io/version=3.5.21,apps.kubernetes.io/pod-index=0,controller-revision-hash=cray-bss-bitnami-etcd-855488694f,helm.sh/chart=etcd-11.2.3,rrflag=rr-cray-bss-bitnami-etcd,security.istio.io/tlsMode=istio,service.istio.io/canonical-name=cray-bss-bitnami-etcd,service.istio.io/canonical-revision=3.5.21,statefulset.kubernetes.io/pod-name=cray-bss-bitnami-etcd-0
cray-bss-bitnami-etcd-1 2/2 Running 0 4d12h app.kubernetes.io/component=etcd,app.kubernetes.io/instance=cray-hms-bss,app.kubernetes.io/managed-by=Helm,app.kubernetes.io/name=cray-bss-bitnami-etcd,app.kubernetes.io/version=3.5.21,apps.kubernetes.io/pod-index=1,controller-revision-hash=cray-bss-bitnami-etcd-855488694f,helm.sh/chart=etcd-11.2.3,rrflag=rr-cray-bss-bitnami-etcd,security.istio.io/tlsMode=istio,service.istio.io/canonical-name=cray-bss-bitnami-etcd,service.istio.io/canonical-revision=3.5.21,statefulset.kubernetes.io/pod-name=cray-bss-bitnami-etcd-1
cray-bss-bitnami-etcd-2 2/2 Running 0 4d12h app.kubernetes.io/component=etcd,app.kubernetes.io/instance=cray-hms-bss,app.kubernetes.io/managed-by=Helm,app.kubernetes.io/name=cray-bss-bitnami-etcd,app.kubernetes.io/version=3.5.21,apps.kubernetes.io/pod-index=2,controller-revision-hash=cray-bss-bitnami-etcd-855488694f,helm.sh/chart=etcd-11.2.3,rrflag=rr-cray-bss-bitnami-etcd,security.istio.io/tlsMode=istio,service.istio.io/canonical-name=cray-bss-bitnami-etcd,service.istio.io/canonical-revision=3.5.21,statefulset.kubernetes.io/pod-name=cray-bss-bitnami-etcd-2
When the scheduler launches the pod, using the selector rrflag
the pods are spread across
racks. During the failure of a NCN node or rack, when service replicas are restarted by Kubernetes,
the policy helps the replicas being restarted to spread across zones.
Note: The policy has been enabled to allow running replicas on the same rack if other racks are not available because of a failure. This ensures that the number of replicas during a failure remains the same.