containerdcontainerd is a container runtime (systemd service) that runs on the host. It is used to run containers on the Kubernetes platform.
/var/lib/containerd filling upcontainerd slow startup after rebootcontainerd on a worker NCN/var/lib/containerd filling upIn older versions of containerd, there are cases where the /var/lib/containerd directory fills up. In the event that this occurs, the following steps can be used to remediate the issue.
Restart containerd on the NCN.
Whether or not this resolves the space issue, if this is a worker NCN, then also see the notes in the Restarting
containerdon a worker NCN section for subsequent steps that must be taken aftercontainerdis restarted.
ncn-mw# systemctl restart containerd
Many times this will free up space in /var/lib/containerd – if not, then proceed to the next step.
Restart kubelet on the NCN.
ncn-mw# systemctl restart kubelet
If restarting kubelet fails to free up space in /var/lib/containerd, then proceed to the next step.
Prune unused container images on the NCN.
ncn-mw# crictl rmi --prune
Any unused images will be pruned. If still encountering disk space issues in /var/lib/containerd, then proceed to the next step to reboot the NCN.
Reboot the NCN.
Follow the Reboot NCNs process to properly cordon/drain the NCN and reboot.
Generally this will free up space in /var/lib/containerd.
containerd slow startup after rebootOn some systems, containerd can take a very long time to start after a reboot. This has been fixed in CSM 1.3, but if this symptom occurs,
messages indicating cleaning up dead shim may appear in the containerd log files. For example:
Aug 26 00:06:10 ncn-w001 containerd[4005]: time="2022-08-26T00:06:10.522985910Z" level=info msg="cleaning up dead shim"
Aug 26 00:06:10 ncn-w001 containerd[4005]: time="2022-08-26T00:06:10.556198245Z" level=warning msg="cleanup warnings time=\"2022-08-26T00:06:10Z\" level=info msg=\"starting signal loop\" namespace=k8s.io pid=57627\n"
Aug 26 00:06:10 ncn-w001 containerd[4005]: time="2022-08-26T00:06:10.556821890Z" level=info msg="loading plugin \"io.containerd.monitor.v1.cgroups\"..." type=io.containerd.monitor.v1
Aug 26 00:06:10 ncn-w001 containerd[4005]: time="2022-08-26T00:06:10.557576058Z" level=info msg="loading plugin \"io.containerd.service.v1.tasks-service\"..." type=io.containerd.service.v1
Instructing containerd to remove shims when containerd is being shutdown will correct this issue.
Edit the /srv/cray/resources/common/containerd/containerd.service file.
Add the following ExecStopPost line to the file:
ExecStopPost=/usr/bin/find /run/containerd/io.containerd.runtime.v2.task -name address -type f -delete
After the edit, the relevant section of the file should look similar to the following:
[Service]
ExecStartPre=/sbin/modprobe overlay && /sbin/modprobe br_netfilter
ExecStart=/usr/local/bin/containerd
ExecStopPost=/usr/bin/find /run/containerd/io.containerd.runtime.v2.task -name address -type f -delete
Restart=always
RestartSec=5
Delegate=yes
Restart containerd to pick up the change.
If this is a worker NCN, then also see the notes in the Restarting
containerdon a worker NCN section for subsequent steps that must be taken aftercontainerdis restarted.
ncn-mw# systemctl restart containerd
NOTE: If this NCN is rebuilt, then this change will need to be re-applied (until the system is upgraded to CSM 1.3).
containerd on a worker NCNIf the containerd service is restarted on a worker node, then this may cause the sonar-jobs-watcher pod running on that worker node to fail when attempting
to cleanup unneeded containers. The following procedure determines if this is the case and remediates it, if necessary.
Retrieve the name of the sonar-jobs-watcher pod that is running on this worker node.
Modify the following command to specify the name of the specific worker NCN where containerd was restarted.
ncn-mw# kubectl get pods -l name=sonar-jobs-watcher -n services -o wide | grep ncn-w001
Example output:
sonar-jobs-watcher-8z6th 1/1 Running 0 95d 10.42.0.6 ncn-w001 <none> <none>
View the logs for the sonar-jobs-watcher pod.
Modify the following command to specify the pod name identified in the previous step.
ncn-mw# kubectl logs sonar-jobs-watcher-8z6th -n services
Example output:
Found pod cray-dns-unbound-manager-1631116980-h69h6 with restartPolicy 'Never' and container 'manager' with status 'Completed'
All containers of job pod cray-dns-unbound-manager-1631116980-h69h6 has completed. Killing istio-proxy (1c65dacb960c2f8ff6b07dfc9780c4621beb8b258599453a08c246bbe680c511) to allow job to complete
time="2021-09-08T16:44:18Z" level=fatal msg="failed to connect: failed to connect, make sure you are running as root and the runtime has been started: context deadline exceeded"
When this occurs, pods that are running on the node where containerd was restarted may remain in a NotReady state and never complete.
Check if pods are stuck in a NotReady state.
ncn-mw# kubectl get pods -o wide -A | grep NotReady
Example output:
services cray-dns-unbound-manager-1631116980-h69h6 1/2 NotReady 0 10m 10.42.0.100 ncn-w001 <none> <none>
If any pods are stuck in a NotReady state, then restart the sonar-jobs-watcher daemonset to resolve the issue.
ncn-mw# kubectl rollout restart -n services daemonset sonar-jobs-watcher
Expected output:
daemonset.apps/sonar-jobs-watcher restarted
Verify that the restart completed successfully.
ncn-mw# kubectl rollout status -n services daemonset sonar-jobs-watcher
Expected output:
daemon set "sonar-jobs-watcher" successfully rolled out
Once the sonar-jobs-watcher pods restart, any pods that were in a NotReady state should complete within about a minute.
To learn more in general about containerd, refer to the containerd documentation.