On systems with a large number of non-existent network switches, it has been observed that the hms-discovery job may fail to finish due to spending too much time trying to communicate with the non-existent switches. This may happen when incrementally building up a new system and running CSM prior to completion of the full hardware installation.
If hms-discovery does not complete in such an environment, this specific problem can be diagnosed as follows:
(ncn-mw#
) Dump the logs for the failed hms-discovery jobs. Example:
kubectl -n services logs hms-discovery-28483755-mgvxn -f --timestamps
(ncn-mw#
) Look for messages similar to the following. This example is abbreviated heavily for illustrative purposes:
... "msg":"Failed to get port map for management switch!", ..., "error":"failed to perform bulk get: read udp 10.38.0.46:55957->10.254.0.18:161: i/o timeout" ...
If this problem is encountered, it can be worked around by increasing the activeDeadlineSeconds
value in the hms-discovery deployment.
(ncn-mw#
) Edit the hms-discovery deployment:
kubectl edit cronjob -n services hms-discovery
(ncn-mw#
) In the edit session for the deployment, look for the following line:
activeDeadlineSeconds: 300
(ncn-mw#
) Update it from 300 (5 min) to something much larger like 30000 (500 min):
activeDeadlineSeconds: 30000
(ncn-mw#
) Save your changes. The next hms-discovery job will have its timeout extended.
(ncn-mw#
) Gather a list of switches that were not discovered properly:
cray hsm inventory redfishEndpoints list --type RouterBMC --format json | jq -c '.RedfishEndpoints[] | {ID,DiscoveryInfo}'
A switch was not discovered properly if it is not showing as “DiscoverOK” in this output.
(ncn-mw#
) Force re-discovery of the failed switches:
cray hsm inventory discover create --xnames <COMMA_SEPERATED_XNAME_LIST> --force true
This issue is expected to be fixed in the CSM 1.6 release.