The management nodes serve Network Time Protocol (NTP) at stratum 10, except for ncn-m001
, which serves at stratum 8 (or lower if an upstream NTP server is set). All management nodes peer with each other.
Until an upstream NTP server is configured, the time on the NCNs may not match the current time at the site, but they will stay in sync with each other.
Topics
The three different methods for configuring NTP are described below. The first option is the recommended method.
Edit /etc/chrony.d/cray.conf
and restart chronyd
on each node.
ncn# vi /etc/chrony.d/cray.conf
ncn# systemctl restart chronyd
Edit the data.json
file, restart basecamp
, and run the NTP script on each node.
ncn-m001# vi data.json
ncn-m001# systemctl restart basecamp
Run the NTP script on each node.
ncn# /srv/cray/scripts/metal/set-ntp-config.sh
Edit the data.json
file, restart basecamp
, and restart nodes so cloud-init
runs on boot.
ncn-m001# vi data.json
ncn-m001# systemctl restart basecamp
Reboot each node.
ncn# reboot
cloud-init
caches data, so there could be inconsistent results with this method.
Verify NTP is configured correctly and troubleshoot any issues.
The chronyc
command can be used to gather information on the state of NTP.
Check if a given host may be used as an NTP server.
This example checks whether 10.252.0.7
is a valid NTP server
ncn# chronyc accheck 10.252.0.7
208 Access allowed
Check the system clock performance.
ncn# chronyc tracking
Reference ID : 0AFC0104 (ncn-s003)
Stratum : 4
Ref time (UTC) : Mon Nov 30 20:02:24 2020
System time : 0.000007622 seconds slow of NTP time
Last offset : -0.000014609 seconds
RMS offset : 0.000015776 seconds
Frequency : 6.773 ppm fast
Residual freq : -0.000 ppm
Skew : 0.008 ppm
Root delay : 0.000075896 seconds
Root dispersion : 0.000484318 seconds
Update interval : 513.7 seconds
Leap status : Normal
View information on drift and offset
ncn# chronyc sourcestats
210 Number of sources = 8
Name/IP Address NP NR Span Frequency Freq Skew Offset Std Dev
==============================================================================
ncn-w001 6 3 42m -0.029 0.126 +4104ns 28us
ncn-w002 6 6 42m -0.028 0.030 +44us 7278ns
ncn-w003 12 7 23m -0.059 0.023 -35us 8359ns
ncn-s002 36 17 213m -0.001 0.010 +5794ns 54us
ncn-s003 36 17 212m -0.000 0.007 -178ns 40us
ncn-m001 0 0 0 +0.000 2000.000 +0ns 4000ms
ncn-m002 28 15 192m -0.007 0.009 +9942ns 49us
ncn-m003 24 15 197m -0.005 0.009 +9442ns 46us
View the NTP servers, pools, and peers.
ncn# chronyc sources
210 Number of sources = 8
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
=? ncn-w001 4 9 377 435 +162us[ +164us] +/- 679us
=? ncn-w002 4 9 377 505 +118us[ +120us] +/- 277us
=? ncn-w003 4 7 377 82 +850ns[+2686ns] +/- 504us
=? ncn-s002 4 9 377 542 -38us[ -36us] +/- 892us
=* ncn-s003 3 9 377 19 +13us[ +15us] +/- 110us
=? ncn-m001 0 9 0 - +0ns[ +0ns] +/- 0ns
=? ncn-m002 4 8 377 161 -47us[ -45us] +/- 408us
=? ncn-m003 4 8 377 215 -11us[-9109ns] +/- 446us
chrony
Log FilesThe chrony
logs are stored in /var/log/chrony/
If the time is out of sync, force a sync of NTP.
If Kubernetes or other services are already up, they do not always react well if there is a large time jump. Ideally, this action should be made as the node is booting.
ncn# chronyc burst 4/4
Wait about 15 seconds while NTP measurements are gathered
ncn# sleep 15
Jump the clock manually
ncn# chronyc makestep
Older versions of CSM contained some NTP bugs that can carry forward through CSM upgrades. This can result in problems with time syncing correctly. This section describes how to diagnose and fix these.
These issues all relate to certain nodes not being in a correct state.
ncn-m001
should have these important settings in /etc/chrony.d/cray.conf
:
server time.nist.gov iburst trust
# or
pool time.nist.gov iburst
# ncn-m001 should NOT use itself as a server and is known to cause issues
# this allows the clock to step itself during a restart without affecting running apps if it drifts more than 1 second
initstepslew 1 time.nist.gov
# the other ncns are set to 10, so in the event of a tie, ncn-m001 is chosen as the leader
local stratum 8 orphan
These settings ensure there is a low-stratum NTP server that ncn-m001
has access to. ncn-m001
also has the following:
# all non-ncn-m001 NCNs use ncn-m001 as their server, and they trust it
server ncn-m001 iburst trust
# no pools are on the other ncns
# ncn-m001 should NOT use itself as a server and is known to cause issues
# this allows the clock to step itself during a restart without affecting running apps if it drifts more than 1 second
initstepslew 1 ncn-m001
# the ncns peer with each other at a high stratum, and choose ncn-m001 (stratum 8 or lower) in the event of a tie
local stratum 10 orphan
# The nodes should have a max of 9 peers and should not include themselves in the list
peer ncn-m001 minpoll -2 maxpoll 9 iburst
peer ncn-m003 minpoll -2 maxpoll 9 iburst
peer ncn-s001 minpoll -2 maxpoll 9 iburst
peer ncn-s002 minpoll -2 maxpoll 9 iburst
peer ncn-s003 minpoll -2 maxpoll 9 iburst
peer ncn-w001 minpoll -2 maxpoll 9 iburst
peer ncn-w002 minpoll -2 maxpoll 9 iburst
peer ncn-w003 minpoll -2 maxpoll 9 iburst
If nodes are missing metadata for NTP, you will be required to generate the data using csi
and your system’s system_config.yaml
. If you do not have your seed data in the system_config.yaml
then you will need to open a ticket to help generate the NTP data.
The following steps are structured to be executed on one node at a time. However, step #3 will generate all relevant files for each node. If multiple nodes are missing NTP data in BSS, you can apply this fix to each node.
system_config.yaml
to have the correct NTP settings:
ntp-servers:
- ncn-m001
- time.nist.gov
ntp-timezone: UTC
ncn# csi config init
system/basecamp
directory, copy in and execute the metadata script that is included in the upgrade scripts of this documentation:
ncn# ./upgrade_ntp_timezone_metadata.sh
upgrade-metadata-000000000000.json
based on the MAC address of the node.ncn# cat /etc/cray/xname
ncn-m001
execute the following command to update BSS:
ncn# csi handoff bss-update-cloud-init --user-data="upgrade-metadata-000000000000.json" --limit=<xname>`
cloud-init
code and template files from the scripts directory in the CSM documentation RPM:
ncn# cp ./usr/share/doc/csm/scripts/cc_ntp.py /usr/lib/python3.6/site-packages/cloudinit/config/cc_ntp.py
ncn# cp ./usr/share/doc/csm/scripts/chrony.conf.cray.tmpl /etc/cloud/templates/chrony.conf.cray.tmpl
Alternatively, you can download the latest versions from Github:
bash ncn# wget -O /usr/lib/python3.6/site-packages/cloudinit/config/cc_ntp.py https://raw.githubusercontent.com/Cray-HPE/metal-cloud-init/main/cloudinit/config/cc_ntp.py` ncn# wget -O /etc/cloud/templates/chrony.conf.cray.tmpl https://raw.githubusercontent.com/Cray-HPE/metal-cloud-init/main/config/cray.conf.j2`
pool.conf
on all nodes:
ncn# rm /etc/chrony.d/pool.conf
cray.conf.dist
on all nodes:
ncn# rm /etc/chrony.d/cray.conf.dist
/etc/chrony.conf
on all nodes:
ncn# sed -i 's/^\!/#/' /etc/chrony.conf
chronyd
on all nodes:
ncn# systemctl restart chronyd
ncn-m001
Most of the bugs from CSM 0.9 carried forward with upgrades. Most commonly, ncn-m001
is the problem because it either does not have a valid upstream server, or it has a bad configuration. This can be quickly remedied by running three commands to download the latest cc_ntp
module, download an updated template, and re-run cloud-init
.
ncn-m001# wget -O /usr/lib/python3.6/site-packages/cloudinit/config/cc_ntp.py https://raw.githubusercontent.com/Cray-HPE/metal-cloud-init/main/cloudinit/config/cc_ntp.py
ncn-m001# wget -O /etc/cloud/templates/chrony.conf.cray.tmpl https://raw.githubusercontent.com/Cray-HPE/metal-cloud-init/main/config/cray.conf.j2
ncn-m001# cloud-init single --name ntp --frequency always
The other NCNs sometimes have the wrong stratum set or are missing the initstepslew
directive. These can be added in fairly quickly with some sed
commands:
Increase the stratum on NCNs (other than ncn-m001
):
ncn# sed -i "s/local stratum 3 orphan/local stratum 10 orphan/" /etc/chrony.d/cray.conf
Add a new line after the logchange
directive
ncn# sed -i "/^\(logchange 1.0\)\$/a initstepslew 1 ncn-m001" /etc/chrony.d/cray.conf
Restart chronyd
ncn# systemctl restart chronyd
This procedure needs to be completed on the PIT node before the other management nodes are deployed.
HPE Cray EX systems with CSM software have UTC as the default time zone. To change this, you will need
to set an environment variable, as well as chroot
into the node images and change some files
there. You can find a list of timezones to use in the commands below by running timedatectl list-timezones
.
Run the following commands, replacing them with your timezone as needed.
pit# export NEWTZ=America/Chicago
pit# echo -e "\nTZ=${NEWTZ}" >> /etc/environment
pit# sed -i "s#^timedatectl set-timezone UTC#timedatectl set-timezone ${NEWTZ}#" /root/bin/configure-ntp.sh
pit# sed -i 's/--utc/--localtime/' /root/bin/configure-ntp.sh
pit# /root/bin/configure-ntp.sh
The configure-ntp.sh
script should have the information for your local timezone in the output.
pit# /root/bin/configure-ntp.sh
Example output:
CURRENT TIME SETTINGS
rtc: 2021-03-26 11:34:45.873331+00:00
sys: 2021-03-26 11:34:46.015647+0000
200 OK
200 OK
NEW TIME SETTINGS
rtc: 2021-03-26 06:35:16.576477-05:00
sys: 2021-03-26 06:35:17.004587-0500
Verify the new timezone setting by running timedatectl
and hwclock --verbose
.
pit# timedatectl
Local time: Fri 2021-03-26 06:35:58 CDT
Universal time: Fri 2021-03-26 11:35:58 UTC
RTC time: Fri 2021-03-26 11:35:58
Time zone: America/Chicago (CDT, -0500)
Network time on: no
NTP synchronized: no
RTC in local TZ: no
pit# hwclock --verbose
hwclock from util-linux 2.33.1
System Time: 1616758841.688220
Trying to open: /dev/rtc0
Using the rtc interface to the clock.
Last drift adjustment done at 1616758836 seconds after 1969
Last calibration done at 1616758836 seconds after 1969
Hardware clock is on local time
Assuming hardware clock is kept in local time.
Waiting for clock tick...
...got clock tick
Time read from Hardware Clock: 2021/03/26 06:40:42
Hw clock time : 2021/03/26 06:40:42 = 1616758842 seconds since 1969
Time since last adjustment is 6 seconds
Calculated Hardware Clock drift is 0.000000 seconds
2021-03-26 06:40:41.685618-05:00
If the time is off and not accurate to your timezone, you will need to manually set the date and then run the NTP script again.
Manually set the time as close as possible to the real time.
pit# timedatectl set-time "2021-03-26 00:00:00"
Run the NTP script.
pit# /root/bin/configure-ntp.sh
The PIT is now configured to your local timezone.
Adjust the node images so that they also boot in the local timezone. This is accomplished by chroot
ing into the unsquashed images, making some modifications, re-squashing them, and moving the new images into place. This is included as an optional image modification step in the two procedures below.
If the PIT node is booted, see Change NCN Image Root Password and SSH Keys on PIT Node for more information.
Note: Make a note that when performing the csi handoff of NCN boot artifacts in Redeploy PIT Node, you must be sure to specify these new images. Otherwise ncn-m001
will use the default timezone when it boots, and subsequent reboots of the other NCNs will also lose the customized timezone changes.
If the PIT node is not booted, see Change NCN Image Root Password and SSH Keys for more information.