This procedure will rearrange NIDs for specified compute nodes to create a numerically (NID) and lexicographically (xname) contiguous block of NIDs at the specified start point.
It is recommended that the system be taken down for maintenance while performing this procedure.
This procedure should only be performed if absolutely required. Some reasons for needing to perform this procedure include:
The example in this procedure removes NID gaps from 2 cabinets of compute nodes that were a result of incorrect numbering in SLS.
In the process of defragmenting NIDs, the defragment_nids.py
script will:
/State/Components
, /Inventory/ComponentEndpoints
, /Inventory/Hardware
, and /Inventory/EthernetInterfaces
.Limitations of the defragment_nids.py
script:
(ncn-mw#
) Choose the starting NID for the NID block (e.g., 1000).
export NID_START=1000
(ncn-mw#
) Choose the components to include in the NID block (e.g., x1000
,x3000
).
This can be specified at cabinet (x#
), chassis (x#c#
), slot (x#c#s#
), or even node level (x#c#s#b#n#
).
This list always gets expanded to include all compute nodes contained by the specified parent components.
export INCLUDE_LIST=x1000,x3000
Run defragment_nids.py
.
NOTE: Administrators can do a dryrun of defragment_nids.py
to print out a report of what will happen without affecting the system’s NID numbering
by specifying --dryrun
.
/usr/share/doc/csm/scripts/operations/node_management/defragment_nids.py --start ${NID_START} --include ${INCLUDE_LIST} | jq .
Example (summarized) output:
{
"Description": "NID Defragmentation Report",
"StartingNID": 1000,
"Include": [
"x1000",
"x3000"
],
"HSMChanges": [
{
"ID": "x1000c0s0b0n0",
"OldNID": 1000,
"NewNID": 1000
},
{
"ID": "x1000c0s0b0n1",
"OldNID": 1001,
"NewNID": 1001
},
{
"ID": "x1000c0s0b1n0",
"OldNID": 1002,
"NewNID": 1002
},
...
{
"ID": "x1000c0s2b1n0",
"OldNID": 1010,
"NewNID": 1009
},
{
"ID": "x1000c0s3b0n0",
"OldNID": 1012,
"NewNID": 1010
},
{
"ID": "x1000c0s3b0n1",
"OldNID": 1013,
"NewNID": 1011
},
{
"ID": "x3001c0s1b1n0",
"OldNID": 1,
"NewNID": 1012
},
...
{
"ID": "x3000c0s6b0n0",
"OldNID": 20,
"NewNID": 1020
}
],
"SLSEntries": [
{
"Xname": "x1000c0s0b0n0",
"Class": "Hill",
"ExtraProperties": {
"Aliases": [
"nid001000"
],
"NID": 1000,
"Role": "Compute"
}
},
{
"Xname": "x1000c0s0b0n1",
"Class": "Hill",
"ExtraProperties": {
"Aliases": [
"nid001001"
],
"NID": 1001,
"Role": "Compute"
}
},
{
"Xname": "x1000c0s0b1n0",
"Class": "Hill",
"ExtraProperties": {
"Aliases": [
"nid001002"
],
"NID": 1002,
"Role": "Compute"
}
},
...
{
"Xname": "x1000c0s2b1n0",
"Class": "Hill",
"ExtraProperties": {
"Aliases": [
"nid001009"
],
"NID": 1009,
"Role": "Compute"
}
},
{
"Xname": "x1000c0s3b0n0",
"Class": "Hill",
"ExtraProperties": {
"Aliases": [
"nid001010"
],
"NID": 1010,
"Role": "Compute"
}
},
{
"Xname": "x1000c0s3b0n1",
"Class": "Hill",
"ExtraProperties": {
"Aliases": [
"nid001011"
],
"NID": 1011,
"Role": "Compute"
}
},
{
"Xname": "x3000c0s1b1n0",
"Class": "River",
"ExtraProperties": {
"Aliases": [
"nid001012"
],
"NID": 1012,
"Role": "Compute"
}
},
...
{
"Xname": "x3000c0s6b0n0",
"Class": "River",
"ExtraProperties": {
"Aliases": [
"nid001020"
],
"NID": 1020,
"Role": "Compute"
}
}
],
"NodesRemovedFromHSM": [],
"NodesRemovedFromSLS": [
"x1000c0s2b0n1",
"x1000c0s2b1n1"
],
"Errors": []
}
Example output if --output text
is specified:
NID Defragmentation Report
=================
Starting NID: 1000
Include: ['x1000', 'x3000']
=================
HSM Changes:
x1000c0s0b0n0 1000 -> 1000
x1000c0s0b0n1 1001 -> 1001
x1000c0s0b1n0 1002 -> 1002
x1000c0s0b1n1 1003 -> 1003
x1000c0s1b0n0 1004 -> 1004
x1000c0s1b0n1 1005 -> 1005
x1000c0s1b1n0 1006 -> 1006
x1000c0s1b1n1 1007 -> 1007
x1000c0s2b0n0 1008 -> 1008
x1000c0s2b1n0 1010 -> 1009
x1000c0s3b0n0 1012 -> 1010
x1000c0s3b0n1 1013 -> 1011
x3000c0s1b1n0 1 -> 1012
x3000c0s1b2n0 2 -> 1013
x3000c0s1b3n0 3 -> 1014
x3000c0s1b4n0 4 -> 1015
x3000c0s3b1n0 11 -> 1016
x3000c0s3b2n0 12 -> 1017
x3000c0s3b3n0 13 -> 1018
x3000c0s3b4n0 14 -> 1019
x3000c0s6b0n0 20 -> 1020
SLS Entries:
{"Xname": "x1000c0s0b0n0", "Class": "Hill", "ExtraProperties": {"Aliases": ["nid001000"], "NID": 1000, "Role": "Compute"}}
{"Xname": "x1000c0s0b0n1", "Class": "Hill", "ExtraProperties": {"Aliases": ["nid001001"], "NID": 1001, "Role": "Compute"}}
{"Xname": "x1000c0s0b1n0", "Class": "Hill", "ExtraProperties": {"Aliases": ["nid001002"], "NID": 1002, "Role": "Compute"}}
{"Xname": "x1000c0s0b1n1", "Class": "Hill", "ExtraProperties": {"Aliases": ["nid001003"], "NID": 1003, "Role": "Compute"}}
{"Xname": "x1000c0s1b0n0", "Class": "Hill", "ExtraProperties": {"Aliases": ["nid001004"], "NID": 1004, "Role": "Compute"}}
{"Xname": "x1000c0s1b0n1", "Class": "Hill", "ExtraProperties": {"Aliases": ["nid001005"], "NID": 1005, "Role": "Compute"}}
{"Xname": "x1000c0s1b1n0", "Class": "Hill", "ExtraProperties": {"Aliases": ["nid001006"], "NID": 1006, "Role": "Compute"}}
{"Xname": "x1000c0s1b1n1", "Class": "Hill", "ExtraProperties": {"Aliases": ["nid001007"], "NID": 1007, "Role": "Compute"}}
{"Xname": "x1000c0s2b0n0", "Class": "Hill", "ExtraProperties": {"Aliases": ["nid001008"], "NID": 1008, "Role": "Compute"}}
{"Xname": "x1000c0s2b1n0", "Class": "Hill", "ExtraProperties": {"Aliases": ["nid001009"], "NID": 1009, "Role": "Compute"}}
{"Xname": "x1000c0s3b0n0", "Class": "Hill", "ExtraProperties": {"Aliases": ["nid001010"], "NID": 1010, "Role": "Compute"}}
{"Xname": "x1000c0s3b0n1", "Class": "Hill", "ExtraProperties": {"Aliases": ["nid001011"], "NID": 1011, "Role": "Compute"}}
{"Xname": "x3000c0s1b1n0", "Class": "River", "ExtraProperties": {"Aliases": ["nid001012"], "NID": 1012, "Role": "Compute"}}
{"Xname": "x3000c0s1b2n0", "Class": "River", "ExtraProperties": {"Aliases": ["nid001013"], "NID": 1013, "Role": "Compute"}}
{"Xname": "x3000c0s1b3n0", "Class": "River", "ExtraProperties": {"Aliases": ["nid001014"], "NID": 1014, "Role": "Compute"}}
{"Xname": "x3000c0s1b4n0", "Class": "River", "ExtraProperties": {"Aliases": ["nid001015"], "NID": 1015, "Role": "Compute"}}
{"Xname": "x3000c0s3b1n0", "Class": "River", "ExtraProperties": {"Aliases": ["nid001016"], "NID": 1016, "Role": "Compute"}}
{"Xname": "x3000c0s3b2n0", "Class": "River", "ExtraProperties": {"Aliases": ["nid001017"], "NID": 1017, "Role": "Compute"}}
{"Xname": "x3000c0s3b3n0", "Class": "River", "ExtraProperties": {"Aliases": ["nid001018"], "NID": 1018, "Role": "Compute"}}
{"Xname": "x3000c0s3b4n0", "Class": "River", "ExtraProperties": {"Aliases": ["nid001019"], "NID": 1019, "Role": "Compute"}}
{"Xname": "x3000c0s6b0n0", "Class": "River", "ExtraProperties": {"Aliases": ["nid001020"], "NID": 1020, "Role": "Compute"}}
Nodes Removed From SLS:
x1000c0s2b0n1,x1000c0s2b1n1
DVS node maps on NCN worker nodes and gateway nodes have entries of compute nodes that include their NIDs. Because of that, the NID defragmentation process will impact the NCN worker and gateway nodes.
Carry out the Procedure To Perform After CSM Defragmentation of Compute Node Identifiers documented in publication HPE Cray Supercomputing User Services Software Administration Guide: CSM on HPE Cray EX Systems (1.0.0 Rev A) (S-8063).
The defragment_nids.py
script checks for HSM discovery errors on the specified nodes before proceeding. It will return an error if any are found. For example:
{
"Message": "Discovery errors detected.",
"Severity": "Error",
"IDs": ["x1000c0s1b1", "x3000c0s6b0"]
}
To continue with the NID defragmentation an administrator must first debug any discovery errors such that all specified components have a discovery status of DiscoverOK
in HSM.
See Troubleshoot Issues with Redfish Endpoint Discovery for debugging discovery issues.
Alternately, if these issues are known and will not affect the desired resulting NID numbering, the --ignore-discovery-errors
option may be specified with defragment_nids.py
to continue through these errors.
Warning: Continuing through discovery errors may result in incorrect NID numbering if HSM’s inventory data for those nodes is missing or incorrect.
The defragment_nids.py
script checks for nodes with NIDs that fall within the specified NID block that are not specified in the include list. An example of this error is:
{
"Message": "There is an unexpected node NID in the requested NID range, 1000-1100",
"Severity": "Error",
"IDs": ["x3001c0s0b0n0", "x3001c0s0b0n1"]
}
These might be NCNs and UANs or compute nodes that were not covered by the specified include list. Here are some scenarios and how to fix them:
Computes nodes in cabinets x1000
and x1002
were specified in the include list, so the new NID block is 1000-1100, but the compute nodes in cabinet x1001
have NIDs 1090-1140.
This would create a conflict so defragment_nids.py
will return an error. This can be fixed by:
x1001
in the include list to include it in the new NID block.defragment_nids.py
to first move the computes nodes in x1001
to another NID block then rerun defragment_nids.py
for the compute nodes in cabinets x1000
and x1002
.Computes nodes in cabinet x1000
were specified in the include list, so the new NID block is 1000-1100, but x1000c1b0n0
is a UAN that was given the NID 1000.
This would create a conflict so defragment_nids.py
will return an error. This can be fixed by:
defragment_nids.py
for the nodes in x1000
.Computes nodes in cabinet x1000
were specified in the include list, so the new NID block is 1000-1100, but x3000c0b0n0
is an NCN that was given the NID 1000.
This would create a conflict so defragment_nids.py
will return an error. It is not recommended to try and change the NID of an NCN. The best course of action is to
change the starting NID for the new NID block.