Cray System Management Documentation > Cray System Management (CSM) Administration Guide > node management > Rebuild NCNs > Prepare Storage Nodes

Prepare Storage Nodes

Prepare a storage node before rebuilding it.

IMPORTANT: All of the output examples may not reflect the cluster status where this operation is being performed. For example, if this is a rebuild in place, then Ceph components will not be reporting down, in contrast to a failed node rebuild.

Prerequisites
Procedure
Next step

Prerequisites

Ensure that the latest CSM documentation RPM is installed on ncn-m001.

See Check for Latest Documentation.
(ncn-m001#) When rebuilding a node, make sure that /srv/cray/scripts/common/storage-ceph-cloudinit.sh and /srv/cray/scripts/common/pre-load-images.sh have been removed from the runcmd in BSS.
1. Set node name and xname if not already set.
```
NODE=ncn-s00n
XNAME=$(ssh $NODE cat /etc/cray/xname)
```
2. Get the runcmd in BSS.
```
cray bss bootparameters list --name ${XNAME} --format=json|jq -r '.[]|.["cloud-init"]|.["user-data"].runcmd'
```
  Expected output:
```
[
"/srv/cray/scripts/metal/net-init.sh",
"/srv/cray/scripts/common/update_ca_certs.py",
"/srv/cray/scripts/metal/install.sh",
"/srv/cray/scripts/common/ceph-enable-services.sh",
"touch /etc/cloud/cloud-init.disabled"
]
```
  If /srv/cray/scripts/common/storage-ceph-cloudinit.sh or /srv/cray/scripts/common/pre-load-images.sh is in the runcmd, then it will need to be fixed using the following procedure:
  1. Obtain an API authentication token.
    
    A token will need to be generated and made available as an environment variable. Refer to the Retrieve an Authentication Token procedure for more information.
  2. Run the following command to patch BSS.
```
python3 /usr/share/doc/csm/scripts/patch-ceph-runcmd.py
```
  3. Repeat the original Cray CLI command and verify that the expected output is obtained.

Procedure

Upload Ceph container images into Nexus.

Log into one of the first three storage NCNs.

This procedure must be performed on a ceph-monnode. By default these will be any of the first three storage NCNs: ncn-s001, ncn-s002, or ncn-s003

(ncn-s#) Check the status of Ceph.

Check the OSD status, weight, and location:

ceph osd tree

Example output:

ID  CLASS  WEIGHT    TYPE NAME          STATUS  REWEIGHT  PRI-AFF
-1         62.87558  root default
-5         20.95853      host ncn-s001
2    ssd   3.49309          osd.2          up   1.00000  1.00000
5    ssd   3.49309          osd.5          up   1.00000  1.00000
6    ssd   3.49309          osd.6          up   1.00000  1.00000
9    ssd   3.49309          osd.9          up   1.00000  1.00000
12   ssd   3.49309          osd.12         up   1.00000  1.00000
16   ssd   3.49309          osd.16         up   1.00000  1.00000
-3         20.95853      host ncn-s002
0    ssd   3.49309          osd.0          up   1.00000  1.00000
3    ssd   3.49309          osd.3          up   1.00000  1.00000
7    ssd   3.49309          osd.7          up   1.00000  1.00000
10   ssd   3.49309          osd.10         up   1.00000  1.00000
13   ssd   3.49309          osd.13         up   1.00000  1.00000
15   ssd   3.49309          osd.15         up   1.00000  1.00000
-7         20.95853      host ncn-s003
1    ssd   3.49309          osd.1          up   1.00000  1.00000
4    ssd   3.49309          osd.4          up   1.00000  1.00000
8    ssd   3.49309          osd.8          up   1.00000  1.00000
11   ssd   3.49309          osd.11         up   1.00000  1.00000
14   ssd   3.49309          osd.14         up   1.00000  1.00000
17   ssd   3.49309          osd.17         up   1.00000  1.00000

(ncn-s#) If the node is up, then stop and disable all the Ceph services on the node being rebuilt.
```
ceph orch host maintenance enter <storage node hostname being rebuilt>
```
Example output:
```
Daemons for Ceph cluster 5f79a490-c281-11ed-b6ec-fa163e741e89 stopped on host ncn-s003. Host ncn-s003 moved to maintenance mode
```
IMPORTANT: The –force flag is used to bypass warnings. These pertain to Ceph services which can handle failures, like rgw.
- IF the command returns any lines with an ALERT status then please follow the output to remedy.
  - Typically this will be something like the active MGR process is on that node and it must be failed over first.
Example:
```
WARNING: Stopping 1 out of 1 daemons in Alertmanager service. Service will not be operational with no daemons left. At least 1 daemon must be running to guarantee service.
ALERT: Cannot stop active Mgr daemon, Please switch active Mgrs with 'ceph mgr fail ncn-s003.ydycwn'
WARNING: Removing RGW daemons can cause clients to lose connectivity.
```
In this example, the warnings for RGW and Alertmanager would be ignored by passing the --force flag. The alert for active Mgr will need to be addressed with the provided command (ceph mgr fail ncn-s003.ydycwn).

(ncn-s#) Re-check the OSD status, weight, and location:

ceph osd tree

Example output:

ID  CLASS  WEIGHT    TYPE NAME          STATUS  REWEIGHT  PRI-AFF
-1         62.87558  root default
-5         20.95853      host ncn-s001
2    ssd   3.49309          osd.2          up   1.00000  1.00000
5    ssd   3.49309          osd.5          up   1.00000  1.00000
6    ssd   3.49309          osd.6          up   1.00000  1.00000
9    ssd   3.49309          osd.9          up   1.00000  1.00000
12   ssd   3.49309          osd.12         up   1.00000  1.00000
16   ssd   3.49309          osd.16         up   1.00000  1.00000
-3         20.95853      host ncn-s002
0    ssd   3.49309          osd.0          up   1.00000  1.00000
3    ssd   3.49309          osd.3          up   1.00000  1.00000
7    ssd   3.49309          osd.7          up   1.00000  1.00000
10   ssd   3.49309          osd.10         up   1.00000  1.00000
13   ssd   3.49309          osd.13         up   1.00000  1.00000
15   ssd   3.49309          osd.15         up   1.00000  1.00000
-7         20.95853      host ncn-s003
1    ssd   3.49309          osd.1        down   1.00000  1.00000
4    ssd   3.49309          osd.4        down   1.00000  1.00000
8    ssd   3.49309          osd.8        down   1.00000  1.00000
11   ssd   3.49309          osd.11       down   1.00000  1.00000
14   ssd   3.49309          osd.14       down   1.00000  1.00000
17   ssd   3.49309          osd.17       down   1.00000  1.00000

(ncn-s#) Check the status of the Ceph cluster:

ceph -s

Example output:

  cluster:
    id:     4c9e9d74-a208-11ed-b008-98039bb427f6
    health: HEALTH_WARN
            1 host is in maintenance mode           <-------- Expect this line.
            1/3 mons down, quorum ncn-s001,ncn-s002
            6 osds down
            1 host (6 osds) down
            Degraded data redundancy: 34257/102773 objects degraded (33.333%), 370 pgs degraded, 352 pgs undersized

  services:
    mon: 3 daemons, quorum ncn-s001,ncn-s002 (age 56s), out of quorum: ncn-s003
    mgr: ncn-s002.amfitm(active, since 43m), standbys: ncn-s001.rytusj
    mds: 1/1 daemons up, 1 hot standby
    osd: 18 osds: 12 up (since 55s), 18 in (since 13h)
    rgw: 2 daemons active (2 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   13 pools, 553 pgs
    objects: 34.26k objects, 58 GiB
    usage:   173 GiB used, 63 TiB / 63 TiB avail
    pgs:     34257/102773 objects degraded (33.333%)
            370 active+undersized+degraded
            159 active+undersized
            24  active+clean

  io:
    client:   8.7 KiB/s rd, 353 KiB/s wr, 3 op/s rd, 53 op/s wr

(ncn-s#) List down Ceph OSDs.

IMPORTANT: Before proceeding, ensure that this rebuild requires OSD wipes. Storage node rebuilds that are done on an active node do not require the OSD removal. Some examples are rebuilds to get some a custom patched image.

The ceph osd tree capture indicated that there are down OSDs on ncn-s003.

ceph osd tree down

Example output:

ID  CLASS  WEIGHT    TYPE NAME          STATUS  REWEIGHT  PRI-AFF
-1         62.87758  root default
-7         20.95853      host ncn-s003
 1    ssd   3.49309          osd.1        down   1.00000  1.00000
 4    ssd   3.49309          osd.4        down   1.00000  1.00000
 8    ssd   3.49309          osd.8        down   1.00000  1.00000
 11   ssd   3.49309          osd.11       down   1.00000  1.00000
 14   ssd   3.49309          osd.14       down   1.00000  1.00000
 17   ssd   3.49309          osd.17       down   1.00000  1.00000

(ncn-s#) Remove the OSD references to allow the rebuild to re-use the original OSD references on the drives.

By default, if the OSD reference is not removed, then there will still a reference to them in the CRUSH map. This will result in OSDs that no longer exist appearing to be down.

The following command assumes the variables from the prerequisites section are set.
```
for osd in $(ceph osd ls-tree $NODE); do ceph osd destroy osd.$osd --force; ceph osd purge osd.$osd --force; done
```
Example output:
```
destroyed osd.1
purged osd.1
destroyed osd.4
purged osd.4
destroyed osd.6
purged osd.6
destroyed osd.11
purged osd.11
destroyed osd.14
purged osd.14
destroyed osd.17
purged osd.17
```

Next step

Proceed to the next step of the storage node rebuild procedure to identify nodes and update metadata. Otherwise, return to the main Rebuild NCNs page.