Cray System Management Documentation > Cray System Management (CSM) Administration Guide > kubernetes > containerd

`containerd`

containerd is a container runtime (systemd service) that runs on the host. It is used to run containers on the Kubernetes platform.

/var/lib/containerd filling up
containerd slow startup after reboot
Restarting containerd on a worker NCN

`/var/lib/containerd` filling up

In older versions of containerd, there are cases where the /var/lib/containerd directory fills up. In the event that this occurs, the following steps can be used to remediate the issue.

Restart containerd on the NCN.

Whether or not this resolves the space issue, if this is a worker NCN, then also see the notes in the Restarting containerd on a worker NCN section for subsequent steps that must be taken after containerd is restarted.
```
ncn-mw# systemctl restart containerd
```
Many times this will free up space in /var/lib/containerd – if not, then proceed to the next step.
Restart kubelet on the NCN.
```
ncn-mw# systemctl restart kubelet
```
If restarting kubelet fails to free up space in /var/lib/containerd, then proceed to the next step.
Prune unused container images on the NCN.
```
ncn-mw# crictl rmi --prune
```
Any unused images will be pruned. If still encountering disk space issues in /var/lib/containerd, then proceed to the next step to reboot the NCN.
Reboot the NCN.

Follow the Reboot NCNs process to properly cordon/drain the NCN and reboot. Generally this will free up space in /var/lib/containerd.

`containerd` slow startup after reboot

On some systems, containerd can take a very long time to start after a reboot. This has been fixed in CSM 1.3, but if this symptom occurs, messages indicating cleaning up dead shim may appear in the containerd log files. For example:

Aug 26 00:06:10 ncn-w001 containerd[4005]: time="2022-08-26T00:06:10.522985910Z" level=info msg="cleaning up dead shim"
Aug 26 00:06:10 ncn-w001 containerd[4005]: time="2022-08-26T00:06:10.556198245Z" level=warning msg="cleanup warnings time=\"2022-08-26T00:06:10Z\" level=info msg=\"starting signal loop\" namespace=k8s.io pid=57627\n"
Aug 26 00:06:10 ncn-w001 containerd[4005]: time="2022-08-26T00:06:10.556821890Z" level=info msg="loading plugin \"io.containerd.monitor.v1.cgroups\"..." type=io.containerd.monitor.v1
Aug 26 00:06:10 ncn-w001 containerd[4005]: time="2022-08-26T00:06:10.557576058Z" level=info msg="loading plugin \"io.containerd.service.v1.tasks-service\"..." type=io.containerd.service.v1

Instructing containerd to remove shims when containerd is being shutdown will correct this issue.

Edit the /srv/cray/resources/common/containerd/containerd.service file.

Add the following ExecStopPost line to the file:

ExecStopPost=/usr/bin/find /run/containerd/io.containerd.runtime.v2.task -name address -type f -delete

After the edit, the relevant section of the file should look similar to the following:

[Service]
ExecStartPre=/sbin/modprobe overlay && /sbin/modprobe br_netfilter
ExecStart=/usr/local/bin/containerd
ExecStopPost=/usr/bin/find /run/containerd/io.containerd.runtime.v2.task -name address -type f -delete
Restart=always
RestartSec=5
Delegate=yes

Restart containerd to pick up the change.

If this is a worker NCN, then also see the notes in the Restarting containerd on a worker NCN section for subsequent steps that must be taken after containerd is restarted.
```
ncn-mw# systemctl restart containerd
```

NOTE: If this NCN is rebuilt, then this change will need to be re-applied (until the system is upgraded to CSM 1.3).

Restarting `containerd` on a worker NCN

If the containerd service is restarted on a worker node, then this may cause the sonar-jobs-watcher pod running on that worker node to fail when attempting to cleanup unneeded containers. The following procedure determines if this is the case and remediates it, if necessary.

Retrieve the name of the sonar-jobs-watcher pod that is running on this worker node.

Modify the following command to specify the name of the specific worker NCN where containerd was restarted.
```
ncn-mw# kubectl get pods -l name=sonar-jobs-watcher -n services -o wide | grep ncn-w001
```
Example output:
```
sonar-jobs-watcher-8z6th   1/1     Running   0          95d   10.42.0.6    ncn-w001   <none>           <none>
```

View the logs for the sonar-jobs-watcher pod.

Modify the following command to specify the pod name identified in the previous step.

ncn-mw# kubectl logs sonar-jobs-watcher-8z6th -n services

Example output:

Found pod cray-dns-unbound-manager-1631116980-h69h6 with restartPolicy 'Never' and container 'manager' with status 'Completed'
All containers of job pod cray-dns-unbound-manager-1631116980-h69h6 has completed. Killing istio-proxy (1c65dacb960c2f8ff6b07dfc9780c4621beb8b258599453a08c246bbe680c511) to allow job to complete
time="2021-09-08T16:44:18Z" level=fatal msg="failed to connect: failed to connect, make sure you are running as root and the runtime has been started: context deadline exceeded"

When this occurs, pods that are running on the node where containerd was restarted may remain in a NotReady state and never complete.

Check if pods are stuck in a NotReady state.

ncn-mw# kubectl get pods -o wide -A | grep NotReady

Example output:

services      cray-dns-unbound-manager-1631116980-h69h6             1/2   NotReady  0     10m   10.42.0.100  ncn-w001  <none>      <none>

If any pods are stuck in a NotReady state, then restart the sonar-jobs-watcher daemonset to resolve the issue.
```
ncn-mw# kubectl rollout restart -n services daemonset sonar-jobs-watcher
```
Expected output:
```
daemonset.apps/sonar-jobs-watcher restarted
```

Verify that the restart completed successfully.

ncn-mw# kubectl rollout status -n services daemonset sonar-jobs-watcher

Expected output:

daemon set "sonar-jobs-watcher" successfully rolled out

Once the sonar-jobs-watcher pods restart, any pods that were in a NotReady state should complete within about a minute.

To learn more in general about containerd, refer to the containerd documentation.

containerd

/var/lib/containerd filling up

containerd slow startup after reboot

Restarting containerd on a worker NCN

`containerd`

`/var/lib/containerd` filling up

`containerd` slow startup after reboot

Restarting `containerd` on a worker NCN