This is a Prometheus Exporter for extracting metrics from a server using the Redfish API. The hostname of the server has to be passed as target parameter in the HTTP call.
All these steps need to be followed post install/upgrade CSM services.
NOTE: The below steps needs to be performed on the ClusterStor E1000 node.
In order to provide SMART data to the Prometheus time series database, the Redfish Exporter must be configured with the domain name from ClusterStor primary management node.
Find the IP address of both management nodes on the external access network (EAN) of ClusterStor.
If static EAN IP addresses are configured on the management nodes, then the following command will show what they are:
cscli ean ipaddr show
Example output:
---------------------------------------------------
Node Network Interface IP ADDRESS
---------------------------------------------------
kjcf01n00 EAN pub0 172.30.53.54
kjcf01n01 EAN pub0 172.30.53.55
---------------------------------------------------
If static IP addresses have not been configured on the management nodes in the cluster,
then the cscli ean ipaddr show command returns empty, as seen below:
cscli ean ipaddr show
Example output:
empty
In this case, perform the following steps:
Check what the primary EAN interface name is with the following command:
cscli ean primary show
Example output:
Interface: pub0
Prefix:
Gateway:
Added EAN primary interfaces:
pub0
Free interfaces:
pub0
pub1
pub2
pub3
This output indicates that the primary EAN interface is pub0, which is the default primary EAN interface on ClusterStor management nodes.
If no static IP address is set on this interface, it will default to DHCP.
Check the IP address of this interface on both management nodes with the following command:
pdsh -g mgmt ip a l pub0 | dshbak -c
Example output:
----------------
kjlmo1200
----------------
2: pub0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether b4:96:91:02:73:0c brd ff:ff:ff:ff:ff:ff
altname enp8s0f0
inet 10.214.135.37/21 brd 10.214.135.255 scope global dynamic pub0
valid_lft 80500sec preferred_lft 80500sec
----------------
kjlmo1201
----------------
2: pub0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether b4:96:91:03:10:74 brd ff:ff:ff:ff:ff:ff
altname enp22s0f0
inet 10.214.135.45/21 brd 10.214.135.255 scope global dynamic pub0
valid_lft 77093sec preferred_lft 77093sec
Get the FQDN of each management node from the primary EAN IP addresses found on each using nslookup.
nslookup 10.214.135.37
Example output:
37.135.214.10.in-addr.arpa name = kjlmo1200.hpc.amslabs.hpecorp.net.
nslookup 10.214.135.45
Example output:
45.135.214.10.in-addr.arpa name = kjlmo1201.hpc.amslabs.hpecorp.net.
Determine which management node is currently the primary management node.
The RFSF API services run on the primary management node in the ClusterStor SMU.
To determine which node is currently the primary management node, look at cscli show_nodes output:
cscli show_nodes
Example output:
----------------------------------------------------------------------------------------
Hostname Role Power State Service State Targets HA Partner HA Resources
----------------------------------------------------------------------------------------
kjlmo1200 MGMT On N/a 0 / 0 kjlmo1201 None
kjlmo1201 (MGMT) On N/a 0 / 0 kjlmo1200 None
kjlmo1202 (MDS),(MGS) On Stopped 0 / 1 kjlmo1203 Local
kjlmo1203 (MDS),(MGS) On Stopped 0 / 1 kjlmo1202 Local
kjlmo1204 (OSS) On Stopped 0 / 3 kjlmo1205 Local
kjlmo1205 (OSS) On Stopped 0 / 3 kjlmo1204 Local
----------------------------------------------------------------------------------------
NOTE: The MGMT node where Role is NOT surrounded by parentheses is the current primary MGMT node (kjlmo1200 above).
This is also the node that can run cscli, so that is another indication of which node is primary vs. secondary.
If a node is failed over, the output changes as follows:
cscli show_nodes
Example output:
----------------------------------------------------------------------------------------
Hostname Role Power State Service State Targets HA Partner HA Resources
----------------------------------------------------------------------------------------
kjlmo1200 (MGMT) On N/a 0 / 0 kjlmo1201 None
kjlmo1201 MGMT On N/a 0 / 0 kjlmo1200 None
kjlmo1202 (MDS),(MGS) On Stopped 0 / 1 kjlmo1203 Local
kjlmo1203 (MDS),(MGS) On Stopped 0 / 1 kjlmo1202 Local
kjlmo1204 (OSS) On Stopped 0 / 3 kjlmo1205 Local
kjlmo1205 (OSS) On Stopped 0 / 3 kjlmo1204 Local
----------------------------------------------------------------------------------------
Select the FQDN of the primary management node to use as your RFSF API connection destination.
The FQDN of the primary EAN IP address discovered above on the primary management node is the FQDN that should be used to connect to the RFSF API.
The primary EAN IP address discovered above on the secondary node should be used in the case of a failover on the management nodes that causes the secondary node to become the primary.
admin user on ClusterStor E1000 primary management nodeNOTE: The below steps needs to be performed on the ClusterStor E1000 node.
Add an admin user on the primary management node discovered in the previous section.
cscli admins add --username abcxyz --role full --password Abcxyz@123
NOTE: Password should have minimum length of eight characters with minimum one lowercase alphabet, one uppercase alphabet, one number and one special character.
View the created user.
cscli admins list
Output will look similar to:
---------------------------------------------------------------
Username Role Uid SSH Enabled Web Enabled Policy
---------------------------------------------------------------
abcxyz full 1503 True True default
---------------------------------------------------------------
NOTE: The below steps needs to be performed on the CSM cluster either on master or worker node.
(ncn-mw#) Check if the cray-sysmgmt-health-redfish ConfigMap already exists in the sysmgmt-health namespace.
kubectl get cm -n sysmgmt-health cray-sysmgmt-health-redfish
Example output:
NAME DATA AGE
cray-sysmgmt-health-redfish 1 15d
(ncn-mw#) Delete the existing ConfigMap in the sysmgmt-health namespace.
kubectl delete cm -n sysmgmt-health cray-sysmgmt-health-redfish --force
(ncn-mw#) Create a ConfigMap file /tmp/configmap.yml.
Use the following content, replacing TARGET with the site-specific FQDN of the primary management node
identified at the end of Configure domain name for ClusterStor management node.
For example, TARGET=abc100.xyz.com.
apiVersion: v1
kind: ConfigMap
metadata:
annotations:
meta.helm.sh/release-name: cray-sysmgmt-health
meta.helm.sh/release-namespace: sysmgmt-health
name: cray-sysmgmt-health-redfish
namespace: sysmgmt-health
labels:
app.kubernetes.io/instance: cray-sysmgmt-health
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: redfish-exporter
app.kubernetes.io/version: 0.11.0
release: cray-sysmgmt-health
data:
fetch_health.sh: |
#!/bin/bash
TARGET=""
curl -o /tmp/redfish-smart-1.prom cray-sysmgmt-health-redfish-exporter.sysmgmt-health.svc:9220/health?target=${TARGET}
NOTE: If the ClusterStor has more than one or multiple primary management node, then multiple targets and curl commands can be used.
The above script file fetch_health.sh under data section will look similar to:
data:
fetch_health.sh: |
#!/bin/bash
TARGET1=""
curl -o /tmp/redfish-smart-1.prom cray-sysmgmt-health-redfish-exporter.sysmgmt-health.svc:9220/health?target=${TARGET1}
TARGET2=""
curl -o /tmp/redfish-smart-1.prom cray-sysmgmt-health-redfish-exporter.sysmgmt-health.svc:9220/health?target=${TARGET2}
.
.
.
TARGETN=""
curl -o /tmp/redfish-smart-1.prom cray-sysmgmt-health-redfish-exporter.sysmgmt-health.svc:9220/health?target=${TARGETN}
(ncn-mw#) Apply the above file to create a ConfigMap in the sysmgmt-health namespace.
kubectl apply -f /tmp/configmap.yml -n sysmgmt-health
(ncn-mw#) Verify that the ConfigMap exists in the sysmgmt-health namespace.
kubectl get configmap -n sysmgmt-health | grep redfish
Example output:
cray-sysmgmt-health-redfish 1 1m
This procedure configures the Redfish Exporter using the username and password created in Create admin user on ClusterStor E1000 primary management node.
This procedure can be performed on any master or worker NCN.
(ncn-mw#) Save the current redfish-exporter configuration, in case a rollback is needed.
kubectl get secret -n sysmgmt-health cray-sysmgmt-health-redfish-exporter \
-ojsonpath='{.data.config\.yml}' | base64 --decode > /tmp/config-default.yaml
(ncn-mw#) Create a secret and an redfish-exporter configuration that will be used to add ClusterStor user LDAP instance credential.
Create the secret file.
Create a file named /tmp/redfish-secret.yaml with the following contents:
apiVersion: v1
data:
config.yml: REDFISH_CONFIG
kind: Secret
metadata:
labels:
app.kubernetes.io/instance: cray-sysmgmt-health
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: redfish-exporter
app.kubernetes.io/version: 0.11.0
helm.sh/chart: redfish-exporter-0.1.1
release: cray-sysmgmt-health
name: cray-sysmgmt-health-redfish-exporter
namespace: sysmgmt-health
type: Opaque
Create the alert configuration file.
Create a file named /tmp/redfish-new.yaml with the following contents:
listen_port: 9220
timeout: 30
username: "abcdef"
password: "Abcd@123"
rf_port: 8081
NOTE: In the following example file, the rf_port is for the NEO RFSF API main RESTful server (by default it is set to 8081) and listen_port is the redfish-exporter port. Update username, password to reflect the desired configuration.
(ncn-mw#) Replace the redfish-exporter configuration based on the files created in the previous steps.
sed "s/REDFISH_CONFIG/$(cat /tmp/redfish-new.yaml \
| base64 -w0)/g" /tmp/redfish-secret.yaml \
| kubectl replace --force -f -
(ncn-mw#) Validate the configuration changes.
Get the redfish-exporter pod in sysmgmt-health namespace.
kubectl get pods -n sysmgmt-health | grep redfish
Example output:
cray-sysmgmt-health-redfish-exporter-86f7596c5-g6lxl 1/1 Running 0 3h25m
View the current configuration after few minutes.
kubectl exec cray-sysmgmt-health-redfish-exporter-86f7596c5-g6lxl \
-n sysmgmt-health -c redfish-exporter -- cat /config/config.yml
If the configuration does not look accurate, check the logs for errors.
kubectl logs -f -n sysmgmt-health pod/cray-sysmgmt-health-redfish-exporter-86f7596c5-g6lxl
(ncn-mw#) Delete the redfish-exporter pod so that latest configuration is picked up.
Delete the redfish-exporter pod.
kubectl delete pod -n sysmgmt-health cray-sysmgmt-health-redfish-exporter-86f7596c5-g6lxl --force
Valdiate the pod is running again after sometime.
kubectl get pod -n sysmgmt-health | grep redfish
The SMART data in Prometheus format would look like:
smartmon_temperature_celsius_raw_value{disk="/dev/sdk",host="kjlmo900.hpc.amslabs.hpecorp.net",endpoint="metrics", instance="10.252.1.6:9100", job="node-exporter", namespace="sysmgmt-health", pod="cray-sysmgmt-health-prometheus-node-exporter-74fd8",redfish_instance="10.214.132.198:9220",type="sas"} 33.0
smartmon_power_cycle_count_raw_value{disk="/dev/sdk",host="kjlmo900.hpc.amslabs.hpecorp.net"endpoint="metrics", instance="10.252.1.6:9100", job="node-exporter", namespace="sysmgmt-health", pod="cray-sysmgmt-health-prometheus-node-exporter-74fd8",redfish_instance="10.214.132.198:9220",type="sas"} 0.0
smartmon_power_on_hours_raw_value{disk="/dev/sdk",host="kjlmo900.hpc.amslabs.hpecorp.net"endpoint="metrics", instance="10.252.1.6:9100", job="node-exporter", namespace="sysmgmt-health", pod="cray-sysmgmt-health-prometheus-node-exporter-74fd8",redfish_instance="10.214.132.198:9220",type="sas"} 30531.0
smartmon_smartctl_run{disk="/dev/sdk",host="kjlmo900.hpc.amslabs.hpecorp.net"endpoint="metrics", instance="10.252.1.6:9100", job="node-exporter", namespace="sysmgmt-health", pod="cray-sysmgmt-health-prometheus-node-exporter-74fd8",redfish_instance="10.214.132.198:9220",type="sas"} 1.715076005e+09
smartmon_device_active{disk="/dev/sdm",host="kjlmo900.hpc.amslabs.hpecorp.net",endpoint="metrics", instance="10.252.1.6:9100", job="node-exporter", namespace="sysmgmt-health", pod="cray-sysmgmt-health-prometheus-node-exporter-74fd8",edfish_instance="10.214.132.198:9220",type="sas"} 1.0
NOTE: In the above metrics example, redfish_instance is the E1000 node primary management IP address;
instance is the master/worker node IP address where the Redfish Exporter pod is scheduled.
For open source Grafana dashboards, instance in the Grafana dashboards variable needs to be replaced with
redfish_instance in order to get the E1000 SMART data.