The Pre-Install Toolkit (PIT) node needs to be bootstrapped from the LiveCD. There are two media available to bootstrap the PIT node–the RemoteISO or a bootable USB device. This procedure describes using the RemoteISO. If not using the RemoteISO, see Bootstrap PIT Node from LiveCD USB
The installation process is similar to the USB based installation with adjustments to account for the lack of removable storage.
Important: Before starting this procedure be sure to complete the procedure to Prepare Configuration Payload for the relevant installation scenario.
The LiveCD Remote ISO has known compatibility issues for nodes from certain vendors.
Warning: If this is a re-installation on a system that still has a USB device from a prior installation, then that USB device must be wiped before continuing. Failing to wipe the USB, if present, may result in confusion. If the USB is still booted, then it can wipe itself using the basic wipe from Wipe NCN Disks for Reinstallation. If it is not booted, please do so and wipe it or disable the USB ports in the BIOS (not available for all vendors).
Obtain and attach the LiveCD cray-pre-install-toolkit
ISO file to the BMC. Depending on the vendor of the node,
the instructions for attaching to the BMC will differ.
The CSM software release should be downloaded and expanded for use.
Important: To ensure that the CSM release plus any patches, workarounds, or hot fixes are included follow the instructions in Update CSM Product Stream
The cray-pre-install-toolkit
ISO and other files are now available in the directory from the extracted CSM tar
.
The ISO will have a name similar to
cray-pre-install-toolkit-sle15sp2.x86_64-1.4.10-20210514183447-gc054094.iso
This ISO file can be extracted from the CSM release tar
file using the following command:
linux# tar --wildcards --no-anchored -xzvf <csm-release>.tar.gz 'cray-pre-install-toolkit-*.iso'
This release of CSM software, the cray-pre-install-toolkit
ISO should be placed on a server which the PIT node
will be able to contact using HTTP or HTTPS.
Note: A shorter path name is better than a long path name on the webserver.
tar
file. It will have a long filename similar to
cray-pre-install-toolkit-sle15sp2.x86_64-1.4.10-20210514183447-gc054094.iso
, so pick a shorter name on the webserver.See the respective procedure below to attach an ISO.
The chosen procedure should have rebooted the server. Observe the server boot into the LiveCD.
On first login (over SSH or at local console) the LiveCD will prompt the administrator to change the password.
The initial password is empty; set the username of root
and press return
twice.
pit login: root
Expected output looks similar to the following:
Password: <-------just press Enter here for a blank password
You are required to change your password immediately (administrator enforced)
Changing password for root.
Current password: <------- press Enter here, again, for a blank password
New password: <------- type new password
Retype new password:<------- retype new password
Welcome to the CRAY Pre-Install Toolkit (LiveOS)
Set up the initial typescript.
pit# cd ~
pit# script -af csm-install-remoteiso.$(date +%Y-%m-%d).txt
pit# export PS1='\u@\H \D{%Y-%m-%d} \t \w # '
Set up the site-link, enabling SSH to work. You can reconnect with SSH after this step.
NOTICE REGARDING DHCP
If your site’s network authority or network administrator has already provisioned an IPv4 address for your master node(s) external NIC(s), then skip this step.
Setup variables.
# The IPv4 Address for the nodes external interface(s); this will be provided if not already by the site's network administrator or network authority.
pit# site_ip=172.30.XXX.YYY/20
pit# site_gw=172.30.48.1
pit# site_dns=172.30.84.40
# The actual NIC names for the external site interface; the first onboard or the first 1GBe PCIe (RJ-45).
pit# site_nics='p2p1 p2p2 p2p3'
# another example:
pit# site_nics=em1
Run the link setup script.
NOTE : USAGE
All of the/root/bin/csi-*
scripts are harmless to run without parameters, doing so will dump usage statements.
pit# /root/bin/csi-setup-lan0.sh $site_ip $site_gw $site_dns $site_nics
Print lan0
, and if it has an IP address, then exit console and log in again using SSH.
pit# ip a show lan0
pit# exit
external# ssh root@${SYSTEM_NAME}-ncn-m001
(Recommended) After reconnecting, resume the typescript (the -a
appends to an existing script).
pit# cd ~
pit# script -af $(ls -tr csm-install-remoteiso* | head -n 1)
pit# export PS1='\u@\H \D{%Y-%m-%d} \t \w # '
Check hostname.
pit# hostnamectl
Note:
- The hostname should be similar to
eniac-ncn-m001-pit
when booted from the LiveCD, but it will be shown aspit#
in the documentation command prompts from this point onward.- If the hostname returned by the
hostnamectl
command ispit
, then re-run thecsi-set-hostname.sh
script with the same parameters. Otherwise, an administrator should set the hostname manually withhostnamectl
. In the latter case, do not confuse other administrators by using the hostnamencn-m001
. Append the-pit
suffix, indicating that the node is booted from the LiveCD.
Find a local disk for storing product installers.
pit# disk="$(lsblk -l -o SIZE,NAME,TYPE,TRAN | grep -E '(sata|nvme|sas)' | sort -h | awk '{print $2}' | head -n 1 | tr -d '\n')"
pit# echo $disk
pit# parted --wipesignatures -m --align=opt --ignore-busy -s /dev/$disk -- mklabel gpt mkpart primary ext4 2048s 100%
pit# mkfs.ext4 -L PITDATA "/dev/${disk}1"
In some cases the parted
command may give an error similar to the following:
Error: Partition(s) 4 on /dev/sda have been written, but we have been unable to inform the kernel of the change, probably
because it/they are in use. As a result, the old partition(s) will remain in use. You should reboot now before making
further changes.
In that case, the following steps may resolve the problem without needing to reboot. These commands will remove
volume groups and raid arrays that may be using the disk. These commands only need to be run if the earlier
parted
command failed.
pit# RAIDS=$(grep "${disk}[0-9]" /proc/mdstat | awk '{ print "/dev/"$1 }')
pit# echo $RAIDS
pit# VGS=$(echo $RAIDS | xargs -r pvs --noheadings -o vg_name 2>/dev/null)
pit# echo $VGS
pit# echo $VGS | xargs -r -t -n 1 vgremove -f -v
pit# echo $RAIDS | xargs -r -t -n 1 mdadm -S -f -v
After running the above procedure, retry the parted
command which failed. If it succeeds, resume the install from that point.
Mount local disk, check the output of each command as it goes.
pit# mount -v -L PITDATA
pit# pushd /var/www/ephemeral
pit# mkdir -v admin prep prep/admin configs data
Quit the typescript session with the exit
command, copy the file (csm-install-remoteis.<date>.txt
) from its initial location to the newly created directory, and restart the typescript.
pit# exit # The typescript
pit# cp -v ~/csm-install-remoteiso.*.txt /var/www/ephemeral/prep/admin
pit# cd /var/www/ephemeral/prep/admin
pit# script -af $(ls -tr csm-install-remoteiso* | head -n 1)
pit# export PS1='\u@\H \D{%Y-%m-%d} \t \w # '
pit# pushd /var/www/ephemeral
Download the CSM software release to the PIT node.
Important: In an earlier step, the CSM release plus any patches, workarounds, or hot fixes
were downloaded to a system using the instructions in Update CSM Product Stream
Either copy from that system to the PIT node or set the ENDPOINT variable to URL and use wget
.
Set helper variables.
pit# ENDPOINT=https://arti.dev.cray.com/artifactory/shasta-distribution-stable-local/csm
pit# export CSM_RELEASE=csm-x.y.z
pit# export SYSTEM_NAME=eniac
Save the CSM_RELEASE
and SYSTEM_NAME
variables for usage later; all subsequent shell sessions will have this variable set.
The
echo
prepends a newline to ensure that the variable assignment occurs on a unique line, and not at the end of another.
pit# echo -e "\nCSM_RELEASE=${CSM_RELEASE}\nSYSTEM_NAME=${SYSTEM_NAME}" >>/etc/environment
Fetch the release tar
file.
pit# wget ${ENDPOINT}/${CSM_RELEASE}.tar.gz -O /var/www/ephemeral/${CSM_RELEASE}.tar.gz
Expand the tar
file on the PIT node.
Note: Expansion of the
tar
file may take more than 45 minutes.
pit# tar -zxvf ${CSM_RELEASE}.tar.gz
pit# ls -l ${CSM_RELEASE}
Copy the artifacts into place.
pit# mkdir -pv data/{k8s,ceph}
pit# rsync -a -P --delete ./${CSM_RELEASE}/images/kubernetes/ ./data/k8s/
pit# rsync -a -P --delete ./${CSM_RELEASE}/images/storage-ceph/ ./data/ceph/
The PIT ISO, Helm charts/images, and bootstrap RPMs are now available in the extracted CSM
tar
.
Install/upgrade the CSI and testing RPMs.
pit# rpm -Uvh --force \
$(find ./${CSM_RELEASE}/rpm/ -name "cray-site-init-*.x86_64.rpm" | sort -V | tail -1) \
$(find ./${CSM_RELEASE}/rpm/ -name "hpe-csm-goss-package*.rpm" | sort -V | tail -1) \
$(find ./${CSM_RELEASE}/rpm/ -name "csm-testing*.rpm" | sort -V | tail -1) \
$(find ./${CSM_RELEASE}/rpm/ -name "goss-servers*.rpm" | sort -V | tail -1)
Show the version of CSI installed.
pit# csi version
Expected output looks similar to the following:
CRAY-Site-Init build signature...
Build Commit : b3ed3046a460d804eb545d21a362b3a5c7d517a3-release-shasta-1.4
Build Time : 2021-02-04T21:05:32Z
Go Version : go1.14.9
Git Version : b3ed3046a460d804eb545d21a362b3a5c7d517a3
Platform : linux/amd64
App. Version : 1.5.18
Download and install/upgrade the workaround and documentation RPMs.
If this machine does not have direct Internet access these RPMs will need to be externally downloaded and then copied to the system.
Important: In an earlier step, the CSM release plus any patches, workarounds, or hot fixes were downloaded to a system using the instructions in Check for Latest Workarounds and Documentation Updates. Use that set of RPMs rather than downloading again.
linux# wget https://storage.googleapis.com/csm-release-public/shasta-1.5/docs-csm/docs-csm-latest.noarch.rpm
linux# wget https://storage.googleapis.com/csm-release-public/shasta-1.5/csm-install-workarounds/csm-install-workarounds-latest.noarch.rpm
linux# scp -p docs-csm-*rpm csm-install-workarounds-*rpm ncn-m001:/root
linux# ssh ncn-m001
pit# rpm -Uvh --force docs-csm-latest.noarch.rpm
pit# rpm -Uvh --force csm-install-workarounds-latest.noarch.rpm
Some files are needed for generating the configuration payload. See the Command Line Configuration Payload and Configuration Payload Files topics if one has not already prepared the information for this system.
Create the hmn_connections.json
file by following the Create HMN Connections JSON procedure. Return to this section when completed.
Create the configuration input files if needed and copy them into the preparation directory.
The preparation directory is ${PITDATA}/prep
.
Copy these files into the preparation directory, or create them if this is an initial install of the system:
application_node_config.yaml
(optional - see below)cabinets.yaml
(optional - see below)hmn_connections.json
ncn_metadata.csv
switch_metadata.csv
system_config.yaml
(only available after first-install generation of system files)The optional
application_node_config.yaml
file may be provided for further definition of settings relating to how application nodes will appear in HSM for roles and subroles. See Create Application Node YAML.The optional
cabinets.yaml
file allows cabinet naming and numbering as well as some VLAN overrides. See Create Cabinets YAML.The
system_config.yaml
file is generated by thecsi
tool during the first install of a system, and can later be used for reinstalls of the system. For the initial install, the information in it must be provided as command line arguments tocsi config init
.
Change into the preparation directory.
linux# mkdir -pv /var/www/ephemeral/prep
linux# cd /var/www/ephemeral/prep
After gathering the files into this working directory, generate your configurations.
If doing a reinstall and have the system_config.yaml
parameter file available, then generate the system configuration reusing this parameter file (see avoiding parameters).
If not doing a reinstall of Shasta software, then the system_config.yaml
file will not be available, so skip the rest of this step.
Check for the configuration files. The needed files should be in the current directory.
linux# ls -1
Expected output looks similar to the following:
application_node_config.yaml
cabinets.yaml
hmn_connections.json
ncn_metadata.csv
switch_metadata.csv
system_config.yaml
Generate the system configuration.
Note: Ensure that you specify a reachable NTP pool or server using the
ntp-pools
orntp-servers
fields, respectively. Adding an unreachable server can cause clock skew aschrony
tries to continually reach out to a server it can never reach.
linux# csi config init
A new directory matching the system-name
field in system_config.yaml
will now exist in the working directory.
Note: These warnings from
csi config init
for issues inhmn_connections.json
can be ignored.
The node with the external connection (
ncn-m001
) will have a warning similar to this because its BMC is connected to the site and not the HMN like the other management NCNs. It can be ignored."Couldn't find switch port for NCN: x3000c0s1b0"
An unexpected component may have this message. If this component is an application node with an unusual prefix, it should be added to the
application_node_config.yaml
file. Then reruncsi config init
. See the procedure to Create Application Node Config YAML{"level":"warn","ts":1610405168.8705149,"msg":"Found unknown source prefix! If this is expected to be an Application node, please update application_node_config.yaml","row": {"Source":"gateway01","SourceRack":"x3000","SourceLocation":"u33","DestinationRack":"x3002","DestinationLocation":"u48","DestinationPort":"j29"}}
If a cooling door is found in
hmn_connections.json
, there may be a message like the following. It can be safely ignored.{"level":"warn","ts":1612552159.2962296,"msg":"Cooling door found, but xname does not yet exist for cooling doors!","row": {"Source":"x3000door-Motiv","SourceRack":"x3000","SourceLocation":" ","DestinationRack":"x3000","DestinationLocation":"u36","DestinationPort":"j27"}}
Skip the next step and continue to the CSI Workarounds.
If doing a first time install or the system_config.yaml
parameter file for a reinstall is not available, generate the system configuration.
If doing a first time install, this step is required. If you did the previous step as part of a reinstall, skip this.
Check for the configuration files. The needed files should be in the current directory.
linux# ls -1
Expected output looks similar to the following:
application_node_config.yaml
cabinets.yaml
hmn_connections.json
ncn_metadata.csv
switch_metadata.csv
Generate the system configuration.
Notes:
- Run
csi config init --help
to print a full list of parameters that must be set. These will vary significantly depending on the system and site configuration.- Ensure that you specify a reachable NTP pool or server using the
--ntp-pools
or--ntp-servers
flags, respectively. Adding an unreachable server can cause clock skew aschrony
tries to continually reach out to a server it can never reach.
linux# csi config init <options>
A new directory matching the system-name
field in system_config.yaml
will now exist in the working directory.
Important: After generating a configuration, a visual audit of the generated files for network data should be performed.
Special Notes: Certain parameters to
csi config init
may be hard to grasp on first-time configuration generations:Notes about parameters to
csi config init
:
- The optional
application_node_config.yaml
file is used to map prefixes inhmn_connections.csv
to HSM subroles. A command line option is required in order forcsi
to use the file. See Create Application Node YAML.- The
bootstrap-ncn-bmc-user
andbootstrap-ncn-bmc-pass
must match what is used for the BMC account and its password for the management NCNs.- Set site parameters (
site-domain
,site-ip
,site-gw
,site-nic
,site-dns
) for the network information which connectsncn-m001
(the PIT node) to the site. Thesite-nic
is the interface onncn-m001
that is connected to the site network.- There are other interfaces possible, but the
install-ncn-bond-members
are typically:
p1p1,p10p1
for HPE nodesp1p1,p1p2
for Gigabyte nodesp801p1,p801p2
for Intel nodes- If not using a
cabinets-yaml
file, then set the three cabinet parameters (mountain-cabinets
,hill-cabinets
, andriver-cabinets
) to the quantity of each cabinet type included in this system.- The starting cabinet number for each type of cabinet (for example,
starting-mountain-cabinet
) has a default that can be overridden. See thecsi config init --help
.- For systems that use non-sequential cabinet ID numbers, use the
cabinets-yaml
argument to include thecabinets.yaml
file. This file gives the ability to explicitly specify the ID of every cabinet in the system. When specifying acabinets.yaml
file with thecabinets-yaml
argument, other command line arguments related to cabinets will be ignored bycsi
. See Create Cabinets YAML.- An override to default cabinet IPv4 subnets can be made with the
hmn-mtn-cidr
andnmn-mtn-cidr
parameters.- By default, spine switches are used as MetalLB peers. Use
--bgp-peers aggregation
to use aggregation switches instead.- Several parameters (
can-gateway
,can-cidr
,can-static-pool
,can-dynamic-pool
) describe the CAN (Customer Access network). Thecan-gateway
is the common gateway IP address used for both spine switches and commonly referred to as the Virtual IP address for the CAN. Thecan-cidr
is the IP subnet for the CAN assigned to this system. Thecan-static-pool
andcan-dynamic-pool
are the MetalLB address static and dynamic pools for the CAN. Thecan-external-dns
is the static IP address assigned to the DNS instance running in the cluster to which requests the cluster subdomain will be forwarded. Thecan-external-dns
IP address must be within thecan-static-pool
range.- Set
ntp-pools
to reachable NTP poolsNote: These warnings from
csi config init
for issues inhmn_connections.json
can be ignored.
The node with the external connection (
ncn-m001
) will have a warning similar to this because its BMC is connected to the site and not the HMN like the other management NCNs. It can be ignored."Couldn't find switch port for NCN: x3000c0s1b0"
An unexpected component may have this message. If this component is an application node with an unusual prefix, it should be added to the
application_node_config.yaml
file. Then reruncsi config init
. See the procedure to Create Application Node Config YAML{"level":"warn","ts":1610405168.8705149,"msg":"Found unknown source prefix! If this is expected to be an Application node, please update application_node_config.yaml","row": {"Source":"gateway01","SourceRack":"x3000","SourceLocation":"u33","DestinationRack":"x3002","DestinationLocation":"u48","DestinationPort":"j29"}}
If a cooling door is found in
hmn_connections.json
, there may be a message like the following. It can be safely ignored.{"level":"warn","ts":1612552159.2962296,"msg":"Cooling door found, but xname does not yet exist for cooling doors!","row": {"Source":"x3000door-Motiv","SourceRack":"x3000","SourceLocation":" ","DestinationRack":"x3000","DestinationLocation":"u36","DestinationPort":"j27"}}
Link the generated system_config.yaml
file into the prep/
directory. This is needed for pit-init
to find and resolve the file.
NOTE
This step is needed only for fresh installs wheresystem_config.yaml
is missing from theprep/
directory.
pit# cd ${PITDATA}/prep && ln ${SYSTEM_NAME}/system_config.yaml
Continue with the next step to apply the csi-config workarounds.
Follow the workaround instructions for the csi-config
breakpoint.
Copy the interface configuration files generated earlier by csi config init
into /etc/sysconfig/network/
with the first option or use the provided scripts in the second option below.
Option 1: Copy PIT files.
pit# cp -pv /var/www/ephemeral/prep/${SYSTEM_NAME}/pit-files/* /etc/sysconfig/network/
pit# wicked ifreload all
pit# systemctl restart wickedd-nanny && sleep 5
Option 2: Set up dnsmasq
by hand.
pit# /root/bin/csi-setup-vlan002.sh $nmn_cidr
pit# /root/bin/csi-setup-vlan004.sh $hmn_cidr
pit# /root/bin/csi-setup-vlan007.sh $can_cidr
Check that IP addresses are set for each interface and investigate any failures.
Check IP addresses. Do not run tests if these are missing and instead triage the issue.
pit# wicked show bond0 vlan002 vlan004 vlan007
bond0 up
link: #7, state up, mtu 1500
type: bond, mode ieee802-3ad, hwaddr b8:59:9f:fe:49:d4
config: compat:suse:/etc/sysconfig/network/ifcfg-bond0
leases: ipv4 static granted
addr: ipv4 10.1.1.2/16 [static]
vlan002 up
link: #8, state up, mtu 1500
type: vlan bond0[2], hwaddr b8:59:9f:fe:49:d4
config: compat:suse:/etc/sysconfig/network/ifcfg-vlan002
leases: ipv4 static granted
addr: ipv4 10.252.1.4/17 [static]
route: ipv4 10.92.100.0/24 via 10.252.0.1 proto boot
vlan007 up
link: #9, state up, mtu 1500
type: vlan bond0[7], hwaddr b8:59:9f:fe:49:d4
config: compat:suse:/etc/sysconfig/network/ifcfg-vlan007
leases: ipv4 static granted
addr: ipv4 10.102.9.5/24 [static]
vlan004 up
link: #10, state up, mtu 1500
type: vlan bond0[4], hwaddr b8:59:9f:fe:49:d4
config: compat:suse:/etc/sysconfig/network/ifcfg-vlan004
leases: ipv4 static granted
addr: ipv4 10.254.1.4/17 [static]
Run tests, inspect failures.
pit# csi pit validate --network
Copy the service configuration files generated earlier by csi config init
for dnsmasq
, Metal
Basecamp (cloud-init
), and ConMan.
Copy files (files only, -r
is expressly not used).
pit# cp -pv /var/www/ephemeral/prep/${SYSTEM_NAME}/dnsmasq.d/* /etc/dnsmasq.d/
pit# cp -pv /var/www/ephemeral/prep/${SYSTEM_NAME}/conman.conf /etc/conman.conf
pit# cp -pv /var/www/ephemeral/prep/${SYSTEM_NAME}/basecamp/* /var/www/ephemeral/configs/
Enable, and fully restart all PIT services.
pit# systemctl enable basecamp nexus dnsmasq conman
pit# systemctl stop basecamp nexus dnsmasq conman
pit# systemctl start basecamp nexus dnsmasq conman
Start and configure NTP on the LiveCD for a fallback/recovery server.
pit# /root/bin/configure-ntp.sh
Check that the services are ready and investigate any test failures.
pit# csi pit validate --services
Mount a shim to match the SHASTA-CFG
steps’ directory structure.
pit# mkdir -vp /mnt/pitdata
pit# mount -v -L PITDATA /mnt/pitdata
The following procedure will set up customized CA certificates for deployment using SHASTA-CFG
.
site-init
to create and prepare the site-init
directory for your system.After completing this procedure, the next step is to configure the management network switches.