Cray System Management Documentation > Cray System Management (CSM) Administration Guide > Boot Orchestration Service (BOS) > Redeploy the iPXE and TFTP Services

Redeploy the iPXE and TFTP Services

Redeploy the iPXE and TFTP services if a pod with a ceph-fs Process Virtualization Service (PVS) on a Kubernetes worker node is causing a HEALTH_WARN error.

Resolve issues with ceph-fs and ceph-mds by restarting the iPXE and TFTP services. The Ceph cluster will return to a healthy state after this procedure.

Prerequisites

This procedure requires administrative privileges.

Procedure

(ncn-mw#) Find the iPXE and TFTP deployments.

kubectl get deployments -n services|egrep 'tftp|ipxe'

Example output:

cray-ipxe-aarch64                           1/1     1            1           22m
cray-ipxe-x86-64                            1/1     1            1           22m
cray-tftp                                   3/3     3            3           28m

(ncn-mw#) Delete the deployments for the iPXE and TFTP services.

kubectl -n services delete deployment cray-tftp
kubectl -n services delete deployment cray-ipxe-aarch64
kubectl -n services delete deployment cray-ipxe-x86-64

Check the status of Ceph.

Ceph commands need to be run on ncn-m001. If a health warning is shown after checking the status, the ceph-mds daemons will need to be restarted on the manager nodes.

(ncn-m001#) Check the health of the Ceph cluster.

ceph -s

Example output:

  cluster:
    id:     bac74735-d804-49f3-b920-cd615b18316b
    health: HEALTH_WARN
            1 filesystem is degraded

  services:
    mon: 3 daemons, quorum ncn-m001,ncn-m002,ncn-m003 (age 13d)
    mgr: ncn-m001(active, since 24h), standbys: ncn-m002, ncn-m003
    mds: cephfs:1/1 {0=ncn-m002=up:reconnect} 2 up:standby
    osd: 60 osds: 60 up (since 4d), 60 in (since 4d)
    rgw: 5 daemons active (ncn-s001.rgw0, ncn-s002.rgw0, ncn-s003.rgw0, ncn-s004.rgw0, ncn-s005.rgw0)

  data:
    pools:   13 pools, 1664 pgs
    objects: 2.47M objects, 9.3 TiB
    usage:   26 TiB used, 78 TiB / 105 TiB avail
    pgs:     1664 active+clean

  io:
    client:   990 MiB/s rd, 111 MiB/s wr, 2.76k op/s rd, 1.03k op/s wr

(ncn-m001#) Obtain more information on the health of the cluster.

ceph health detail

Example output:

HEALTH_WARN 1 filesystem is degraded
FS_DEGRADED 1 filesystem is degraded
    fs cephfs is degraded

(ncn-m001#) Show the status of all CephFS components.

ceph fs status

Example output:

cephfs - 9 clients
======
+------+-----------+----------+----------+-------+-------+
| Rank |   State   |   MDS    | Activity |  dns  |  inos |
+------+-----------+----------+----------+-------+-------+
|  0   | reconnect | ncn-m002 |          | 11.0k |   74  |
+------+-----------+----------+----------+-------+-------+
+-----------------+----------+-------+-------+
|       Pool      |   type   |  used | avail |
+-----------------+----------+-------+-------+
| cephfs_metadata | metadata |  780M | 20.7T |
|   cephfs_data   |   data   |  150M | 20.7T |
+-----------------+----------+-------+-------+
+-------------+
| Standby MDS |
+-------------+
|   ncn-m003  |
|   ncn-m001  |

(ncn-m001#) Restart the ceph-mds service.

This step should only be done if a health warning is shown in the previous substeps.
```
for i in 1 2 3 ; do ansible ncn-m00$i -m shell -a "systemctl restart ceph-mds@ncn-m00$i"; done
```

(ncn-m001#) Failover the ceph-mds daemon.

This step should only be done if a health warning still exists after restarting the ceph-mds service.

ceph mds fail ncn-m002

The initial output will display the following:

cephfs - 0 clients
======
+------+--------+----------+----------+-------+-------+
| Rank | State  |   MDS    | Activity |  dns  |  inos |
+------+--------+----------+----------+-------+-------+
|  0   | **rejoin** | ncn-m003 |          |    0  |    0  |
+------+--------+----------+----------+-------+-------+
+-----------------+----------+-------+-------+
|       Pool      |   type   |  used | avail |
+-----------------+----------+-------+-------+
| cephfs_metadata | metadata |  781M | 20.7T |
|   cephfs_data   |   data   |  117M | 20.7T |
+-----------------+----------+-------+-------+
+-------------+
| Standby MDS |
+-------------+
|   ncn-m002  |
|   ncn-m001  |
+-------------+

The rejoin status should turn to active:

cephfs - 7 clients
======
+------+--------+----------+---------------+-------+-------+
| Rank | State  |   MDS    |    Activity   |  dns  |  inos |
+------+--------+----------+---------------+-------+-------+
|  0   | **active** | ncn-m003 | Reqs:    0 /s | 11.1k |  193  |
+------+--------+----------+---------------+-------+-------+
+-----------------+----------+-------+-------+
|       Pool      |   type   |  used | avail |
+-----------------+----------+-------+-------+
| cephfs_metadata | metadata |  781M | 20.7T |
|   cephfs_data   |   data   |  117M | 20.7T |
+-----------------+----------+-------+-------+
+-------------+
| Standby MDS |
+-------------+
|   ncn-m002  |
|   ncn-m001  |
+-------------+

(ncn-m001#) Ensure that the service is deleted along with the associated PVC.

The output for the command below should empty. If an output is displayed, such as in the example below, then the resources have not been deleted.
```
kubectl get pvc -n services|grep tftp
```
Example of resources not being deleted in returned output:
```
cray-tftp-shared-pvc Bound pvc-315d08b0-4d00-11ea-ad9d-b42e993b7096 5Gi RWX ceph-cephfs-external 29m
```
Optional: Use the following command to delete the associated PVC.
```
kubectl -n services delete pvc PVC_NAME
```
(ncn-m001#) Deploy the TFTP service.

Wait for the TFTP pods to come online and verify that the PVC was created.
```
loftsman helm upgrade cray-tftp loftsman/cray-tftp
```
(ncn-m001#) Deploy the iPXE service.

This may take a couple of minutes and the pod may initially show up in error state. Wait a couple minutes and it will go to running state.
```
loftsman helm upgrade cms-ipxe loftsman/cms-ipxe
```
(ncn-mw#) Log into each iPXE pod and verify the iPXE file was created.

This may take another couple of minutes while it is creating the files.
1. Find the iPXE pod IDs.
```
kubectl get pods -n services --no-headers -o wide | grep cray-ipxe | awk '{print $1}'
```
2. Log into the iPXE pods using their IDs.
```
kubectl exec -n services -it IPXE_POD_ID /bin/sh
```
  To see the containers in the pod:
```
kubectl describe pod/CRAY-IPXE_POD_NAME -n services
```

Log into the TFTP pods and verify it is seeing the correct file size.

(ncn-mw#) Find the TFTP pod ID.

kubectl get pods -n services --no-headers -o wide | grep cray-tftp | awk '{print $1}'

Example output:

cray-tftp-7dc77f9cdc-bn6ml
cray-tftp-7dc77f9cdc-ffgnh
cray-tftp-7dc77f9cdc-mr6zd
cray-tftp-modprobe-42648
cray-tftp-modprobe-4kmqg
cray-tftp-modprobe-4sqsk
cray-tftp-modprobe-hlfcc
cray-tftp-modprobe-r6bvb
cray-tftp-modprobe-v2txr

(ncn-mw#) Log into the pod using the TFTP pod ID.

kubectl exec -n services -it TFTP_POD_ID /bin/sh

(pod#) Change to the /var/lib/tftpboot directory.
```
cd /var/lib/tftpboot
```

(pod#) Check the ipxe.efi file size on the TFTP servers.

If there are any issues, the file will have a size of 0 bytes.

ls -l

Example output:

total 1919
-rw-r--r--    1 root     root        980768 May 15 16:49 debug.efi
-rw-r--r--    1 root     root        983776 May 15 16:50 ipxe.efi