Configure NTP on NCNs

The management nodes serve Network Time Protocol (NTP) at stratum 10, except for ncn-m001, which serves at stratum 8 (or lower if an upstream NTP server is set). All management nodes peer with each other.

Until an upstream NTP server is configured, the time on the NCNs may not match the current time at the site, but they will stay in sync with each other.

Topics

Change NTP Config

The three different methods for configuring NTP are described below. The first option is the recommended method.

  • Edit /etc/chrony.d/cray.conf and restart chronyd on each node.

    ncn# vi /etc/chrony.d/cray.conf
    ncn# systemctl restart chronyd
    
  • Edit the data.json file, restart basecamp, and run the NTP script on each node.

    ncn-m001# vi data.json
    ncn-m001# systemctl restart basecamp
    

    Run the NTP script on each node.

    ncn# /srv/cray/scripts/metal/set-ntp-config.sh
    
  • Edit the data.json file, restart basecamp, and restart nodes so cloud-init runs on boot.

    ncn-m001# vi data.json
    ncn-m001# systemctl restart basecamp
    

    Reboot each node.

    ncn# reboot
    

    cloud-init caches data, so there could be inconsistent results with this method.

Troubleshooting NTP

Verify NTP is configured correctly and troubleshoot any issues.

The chronyc command can be used to gather information on the state of NTP.

  1. Check if a given host may be used as an NTP server.

    This example checks whether 10.252.0.7 is a valid NTP server

    ncn# chronyc accheck 10.252.0.7
    208 Access allowed
    
  2. Check the system clock performance.

    ncn# chronyc tracking
    Reference ID    : 0AFC0104 (ncn-s003)
    Stratum         : 4
    Ref time (UTC)  : Mon Nov 30 20:02:24 2020
    System time     : 0.000007622 seconds slow of NTP time
    Last offset     : -0.000014609 seconds
    RMS offset      : 0.000015776 seconds
    Frequency       : 6.773 ppm fast
    Residual freq   : -0.000 ppm
    Skew            : 0.008 ppm
    Root delay      : 0.000075896 seconds
    Root dispersion : 0.000484318 seconds
    Update interval : 513.7 seconds
    Leap status     : Normal
    
  3. View information on drift and offset

    ncn# chronyc sourcestats
    210 Number of sources = 8
    Name/IP Address            NP  NR  Span  Frequency  Freq Skew  Offset  Std Dev
    ==============================================================================
    ncn-w001                    6   3   42m     -0.029      0.126  +4104ns    28us
    ncn-w002                    6   6   42m     -0.028      0.030    +44us  7278ns
    ncn-w003                   12   7   23m     -0.059      0.023    -35us  8359ns
    ncn-s002                   36  17  213m     -0.001      0.010  +5794ns    54us
    ncn-s003                   36  17  212m     -0.000      0.007   -178ns    40us
    ncn-m001                    0   0     0     +0.000   2000.000     +0ns  4000ms
    ncn-m002                   28  15  192m     -0.007      0.009  +9942ns    49us
    ncn-m003                   24  15  197m     -0.005      0.009  +9442ns    46us
    
  4. View the NTP servers, pools, and peers.

    ncn# chronyc sources
    210 Number of sources = 8
    MS Name/IP address         Stratum Poll Reach LastRx Last sample
    ===============================================================================
    =? ncn-w001                      4   9   377   435   +162us[ +164us] +/-  679us
    =? ncn-w002                      4   9   377   505   +118us[ +120us] +/-  277us
    =? ncn-w003                      4   7   377    82   +850ns[+2686ns] +/-  504us
    =? ncn-s002                      4   9   377   542    -38us[  -36us] +/-  892us
    =* ncn-s003                      3   9   377    19    +13us[  +15us] +/-  110us
    =? ncn-m001                      0   9     0     -     +0ns[   +0ns] +/-    0ns
    =? ncn-m002                      4   8   377   161    -47us[  -45us] +/-  408us
    =? ncn-m003                      4   8   377   215    -11us[-9109ns] +/-  446us
    

chrony Log Files

The chrony logs are stored in /var/log/chrony/

Force a Time Sync

  1. If the time is out of sync, force a sync of NTP.

    If Kubernetes or other services are already up, they do not always react well if there is a large time jump. Ideally, this action should be made as the node is booting.

    ncn# chronyc burst 4/4
    
  2. Wait about 15 seconds while NTP measurements are gathered

    ncn# sleep 15
    
  3. Jump the clock manually

    ncn# chronyc makestep
    

Known Issues and Bugs

Older versions of CSM contained some NTP bugs that can carry forward through CSM upgrades. This can result in problems with time syncing correctly. This section describes how to diagnose and fix these.

These issues all relate to certain nodes not being in a correct state.

Correct State

ncn-m001 should have these important settings in /etc/chrony.d/cray.conf:

server time.nist.gov iburst trust
# or
pool time.nist.gov iburst
# ncn-m001 should NOT use itself as a server and is known to cause issues

# this allows the clock to step itself during a restart without affecting running apps if it drifts more than 1 second
initstepslew 1 time.nist.gov
# the other ncns are set to 10, so in the event of a tie, ncn-m001 is chosen as the leader
local stratum 8 orphan

These settings ensure there is a low-stratum NTP server that ncn-m001 has access to. ncn-m001 also has the following:

# all non-ncn-m001 NCNs use ncn-m001 as their server, and they trust it
server ncn-m001 iburst trust
# no pools are on the other ncns
# ncn-m001 should NOT use itself as a server and is known to cause issues

# this allows the clock to step itself during a restart without affecting running apps if it drifts more than 1 second
initstepslew 1 ncn-m001
# the ncns peer with each other at a high stratum, and choose ncn-m001 (stratum 8 or lower) in the event of a tie
local stratum 10 orphan

# The nodes should have a max of 9 peers and should not include themselves in the list
peer ncn-m001 minpoll -2 maxpoll 9 iburst
peer ncn-m003 minpoll -2 maxpoll 9 iburst
peer ncn-s001 minpoll -2 maxpoll 9 iburst
peer ncn-s002 minpoll -2 maxpoll 9 iburst
peer ncn-s003 minpoll -2 maxpoll 9 iburst
peer ncn-w001 minpoll -2 maxpoll 9 iburst
peer ncn-w002 minpoll -2 maxpoll 9 iburst
peer ncn-w003 minpoll -2 maxpoll 9 iburst

Quick Fixes

Fix BSS Metadata

If nodes are missing metadata for NTP, you will be required to generate the data using csi and your system’s system_config.yaml. If you do not have your seed data in the system_config.yaml then you will need to open a ticket to help generate the NTP data.

The following steps are structured to be executed on one node at a time. However, step #3 will generate all relevant files for each node. If multiple nodes are missing NTP data in BSS, you can apply this fix to each node.

  1. Update system_config.yaml to have the correct NTP settings:
    ntp-servers:
      - ncn-m001
      - time.nist.gov
    ntp-timezone: UTC
    
  2. Generate new configurations:
    ncn# csi config init
    
  3. In the newly created system/basecamp directory, copy in and execute the metadata script that is included in the upgrade scripts of this documentation:
    ncn# ./upgrade_ntp_timezone_metadata.sh
    
  4. Find the relevant file(s) to the node(s) with missing metadata, such as upgrade-metadata-000000000000.json based on the MAC address of the node.
  5. Find the component name (xname) for the node that needs to be fixed:
    ncn# cat /etc/cray/xname
    
  6. From ncn-m001 execute the following command to update BSS:
    ncn# csi handoff bss-update-cloud-init --user-data="upgrade-metadata-000000000000.json" --limit=<xname>`
    
  7. Obtain the updated cloud-init code and template files from the scripts directory in the CSM documentation RPM:
    ncn# cp ./usr/share/doc/csm/scripts/cc_ntp.py /usr/lib/python3.6/site-packages/cloudinit/config/cc_ntp.py
    ncn# cp ./usr/share/doc/csm/scripts/chrony.conf.cray.tmpl /etc/cloud/templates/chrony.conf.cray.tmpl
    

Alternatively, you can download the latest versions from Github: bash ncn# wget -O /usr/lib/python3.6/site-packages/cloudinit/config/cc_ntp.py https://raw.githubusercontent.com/Cray-HPE/metal-cloud-init/main/cloudinit/config/cc_ntp.py` ncn# wget -O /etc/cloud/templates/chrony.conf.cray.tmpl https://raw.githubusercontent.com/Cray-HPE/metal-cloud-init/main/config/cray.conf.j2`

  1. Continue with the upgrade.
  2. When the upgrade is completed there might be extra files and configurations, execute the following steps to clean up NTP if required:
    1. Remove pool.conf on all nodes:
      ncn# rm /etc/chrony.d/pool.conf
      
    2. Remove cray.conf.dist on all nodes:
      ncn# rm /etc/chrony.d/cray.conf.dist
      
    3. Comment out the default pool line in /etc/chrony.conf on all nodes:
      ncn# sed -i 's/^\!/#/' /etc/chrony.conf
      
    4. Restart chronyd on all nodes:
      ncn# systemctl restart chronyd
      
Fix ncn-m001

Most of the bugs from CSM 0.9 carried forward with upgrades. Most commonly, ncn-m001 is the problem because it either does not have a valid upstream server, or it has a bad configuration. This can be quickly remedied by running three commands to download the latest cc_ntp module, download an updated template, and re-run cloud-init.

ncn-m001# wget -O /usr/lib/python3.6/site-packages/cloudinit/config/cc_ntp.py https://raw.githubusercontent.com/Cray-HPE/metal-cloud-init/main/cloudinit/config/cc_ntp.py
ncn-m001# wget -O /etc/cloud/templates/chrony.conf.cray.tmpl https://raw.githubusercontent.com/Cray-HPE/metal-cloud-init/main/config/cray.conf.j2
ncn-m001# cloud-init single --name ntp --frequency always
Fix other NCNs

The other NCNs sometimes have the wrong stratum set or are missing the initstepslew directive. These can be added in fairly quickly with some sed commands:

Increase the stratum on NCNs (other than ncn-m001):

ncn# sed -i "s/local stratum 3 orphan/local stratum 10 orphan/" /etc/chrony.d/cray.conf

Add a new line after the logchange directive

ncn# sed -i "/^\(logchange 1.0\)\$/a initstepslew 1 ncn-m001" /etc/chrony.d/cray.conf

Restart chronyd

ncn# systemctl restart chronyd

Customize NTP

Set A Local Timezone

This procedure needs to be completed on the PIT node before the other management nodes are deployed.

Configure NTP on PIT to Local Timezone

HPE Cray EX systems with CSM software have UTC as the default time zone. To change this, you will need to set an environment variable, as well as chroot into the node images and change some files there. You can find a list of timezones to use in the commands below by running timedatectl list-timezones.

  1. Run the following commands, replacing them with your timezone as needed.

    pit# export NEWTZ=America/Chicago
    pit# echo -e "\nTZ=${NEWTZ}" >> /etc/environment
    pit# sed -i "s#^timedatectl set-timezone UTC#timedatectl set-timezone ${NEWTZ}#" /root/bin/configure-ntp.sh
    pit# sed -i 's/--utc/--localtime/' /root/bin/configure-ntp.sh
    pit# /root/bin/configure-ntp.sh
    
  2. The configure-ntp.sh script should have the information for your local timezone in the output.

    pit# /root/bin/configure-ntp.sh
    

    Example output:

    CURRENT TIME SETTINGS
    rtc: 2021-03-26 11:34:45.873331+00:00
    sys: 2021-03-26 11:34:46.015647+0000
    200 OK
    200 OK
    NEW TIME SETTINGS
    rtc: 2021-03-26 06:35:16.576477-05:00
    sys: 2021-03-26 06:35:17.004587-0500
    
  3. Verify the new timezone setting by running timedatectl and hwclock --verbose.

    pit# timedatectl
          Local time: Fri 2021-03-26 06:35:58 CDT
      Universal time: Fri 2021-03-26 11:35:58 UTC
            RTC time: Fri 2021-03-26 11:35:58
           Time zone: America/Chicago (CDT, -0500)
     Network time on: no
    NTP synchronized: no
     RTC in local TZ: no
    
    pit# hwclock --verbose
    hwclock from util-linux 2.33.1
    System Time: 1616758841.688220
    Trying to open: /dev/rtc0
    Using the rtc interface to the clock.
    Last drift adjustment done at 1616758836 seconds after 1969
    Last calibration done at 1616758836 seconds after 1969
    Hardware clock is on local time
    Assuming hardware clock is kept in local time.
    Waiting for clock tick...
    ...got clock tick
    Time read from Hardware Clock: 2021/03/26 06:40:42
    Hw clock time : 2021/03/26 06:40:42 = 1616758842 seconds since 1969
    Time since last adjustment is 6 seconds
    Calculated Hardware Clock drift is 0.000000 seconds
    2021-03-26 06:40:41.685618-05:00
    
  4. If the time is off and not accurate to your timezone, you will need to manually set the date and then run the NTP script again.

    Manually set the time as close as possible to the real time.

    pit# timedatectl set-time "2021-03-26 00:00:00"
    

    Run the NTP script.

    pit# /root/bin/configure-ntp.sh
    

    The PIT is now configured to your local timezone.

Configure NCN Images to Use Local Timezone

Adjust the node images so that they also boot in the local timezone. This is accomplished by chrooting into the unsquashed images, making some modifications, re-squashing them, and moving the new images into place. This is included as an optional image modification step in the two procedures below.