Shasta Liquid Cooled AMD EPYC compute node power management data available to users.
Shasta Liquid Cooled compute blade power management counters (pm_counters) enable users access to energy usage over time for billing and job profiling.
The blade-level and node-level accumulated energy telemetry is point-in-time power data. Blade accumulated energy data is collected out-of-band and is made available via workload managers. Users have access to the data in-band at the node-level via a special sysfs
files in /sys/cray/pm\_counters
on the node.
Time-stamped energy data from each node can be captured for a specific job before, during, and after the job to generate a power profile about the job. This energy usage data can be used in conjunction with current energy costs to assign a monetary value to the job.
The node CPU vendor provides specific in-band and out-of-band interfaces for controlling power management. In-band interfaces are accessed from the node OS through /sys/cray/pm\_counters
. Out-of-band interfaces are accessed from a node BMC or Redfish API.
Note that each node has a power supply that can support a fixed number of Watts. The combined power consumption of the CPU and the accelerator can never exceed this limit, thus, power to either the CPU or the accelerator must be capped so as not to exceed the total amount of power available.
Access to compute node power and energy data is provided by a set of files located in /sys/cray/pm\_counters/
on the node. All pm_counters are accompanied by a timestamp.
File | Description |
---|---|
power | Point-in-time power (Watts). When accelerators are present, includes accel_power. See limitation below on data collection from accelerators. |
energy | Accumulated energy, in joules. When accelerators are present, includes accel_energy. See limitation below on data collection from accelerators. |
cpu_power | Point-in-time power (Watts) used by the CPU domain. |
cpu_energy | The total energy (Joules) used by the CPU domain. |
cpu_temp | Temperature reading (Celsius) of the CPU domain. |
memory_power | Point-in-time power (Watts) used by the memory domain. |
memory_energy | The total energy (Joules) used by the memory domain. |
accel_energy | Accumulated accelerator energy (Joules). The data is non-zero only when an accelerator is present on the node. |
accel_power | Accelerator point-in-time power (Watts). The data is non-zero only when an accelerator is present on the node. |
generation | A counter that increments each time a power cap value is changed. |
startup | Startup counter. |
freshness | Free-running counter that increments at a rate of approximately 10Hz. |
version | Version number for power management counter support. |
power_cap | Current power cap limit in Watts; 0 indicates no capping. When accelerators are present, includes accel_power_cap. |
raw_scan_hz | The power management scanning rate for all data in pm_counters. |