The page walks a user through setting up the Cray LiveCD with the intention of installing Cray System Management (CSM).
Before proceeding, ensure that other NCNs are powered off and their BMC’s IP source is set to DHCP and external connectivity is working.
DHCP and external connectivity is required to download CSM tar ball.
NOTE: Each step denotes where its commands must run;
external#
refers to a server that is not the Cray, whereaspit#
refers to the LiveCD itself.
On the first login, configure and verify the site-link, DNS and gateway IP addresses.
(pit#
) Configure the site-link (lan0
), DNS, and gateway IP addresses. (Optional) Also, at this stage, you can change the admin node password.
Set site_ip
variable.
Set the site_ip
value in CIDR format (A.B.C.D/N
):
site_ip=<IP CIDR>
Set the site_gw
and site_dns
variables.
Set the site_gw
and site_dns
values in IPv4 dotted decimal format (A.B.C.D
):
site_gw=<Gateway IP address>
site_dns=<DNS IP address>
Set the site_nics
variable.
The site_nics
value or values are found while the user is in the LiveCD (for example, site_nics='p2p1 p2p2 p2p3'
or site_nics=em1
).
site_nics='<site NIC or NICs>'
Set the SYSTEM_NAME
variable.
SYSTEM_NAME
is the name of the system. This will only be used for the PIT hostname.
This variable is capitalized because it will be used in a subsequent section.
SYSTEM_NAME=<system name>
Set network device files.
Download and extract the contents of network file template tarball from here, extract the contents.
tar -xzvf network_template.tar.gz
Delete existing network settings and copy the extracted files to /etc/sysconfig/network/
.
rm -rf /etc/sysconfig/network/*
cp -r $PWD/network/* /etc/sysconfig/network/
(pit#
) Run the csi-setup-lan0.sh
script to set up the site link and set the hostname.
NOTE:
- Use
ipmi sol
session orconman
session while performing this step as SSH session may disconnect.- All of the
/root/bin/csi-*
scripts can be run without parameters to display usage statements.- The hostname is auto-resolved based on reverse DNS.
/root/bin/csi-setup-lan0.sh "${SYSTEM_NAME}" "${site_ip}" "${site_gw}" "${site_dns}" "${site_nics}"
(pit#
) Verify that the assigned IP address was successfully applied to lan0
.
wicked ifstatus --verbose lan0
NOTE:
The output from the above command must say
leases: ipv4 static granted
.If the IPv4 address was not granted, then go back and recheck the variable values. The output will indicate the IP address failed to assign, which can happen if the given IP address is already taken on the connected network.
Populate the /etc/fstab
as follows:
LABEL=PITDATA /var/www/ephemeral ext4 noauto,noatime 0 2
tmpfs /var/lib/containers/storage tmpfs auto,nodev,nosuid,size=64g 0 0
Ensure that tmpfs
is large enough as almost 31 GB of data will be placed in /var/lib/containers/storage
during the install CSM services step. If tmpfs
is small, to increase the tmpfs
capacity use the following command:
LABEL=PITDATA /var/www/ephemeral ext4 noauto,noatime 0 2
/dev/sda1 /var/lib/containers/storage ext4 defaults 0 0
Create the required directories using the following command:
mkdir -p /var/www/ephemeral
mkdir -p /var/lib/containers/storage
mount -a
(pit#
) Mount the PITDATA
partition. Use a local disk for PITDATA
:
disk="$(lsblk -l -o SIZE,NAME,TYPE,TRAN -e7 -e11 -d -n | grep -v usb | sort -h | awk '{print $2}' | xargs -I {} bash -c "if ! grep -Fq {} /proc/mdstat; then echo {}; fi" | head -n 1)"
echo "Using ${disk}"
parted --wipesignatures -m --align=opt --ignore-busy -s "/dev/${disk}" -- mklabel gpt mkpart primary ext4 2048s 100%
partprobe "/dev/${disk}"
mkfs.ext4 -L PITDATA "/dev/${disk}1"
mount -vL PITDATA
These variables will need to be set for many procedures within the CSM installation process.
NOTE: This sets some variables that were already set. These should be set again anyway.
(pit#
) Set the variables.
Set the PITDATA
variable.
export PITDATA="$(lsblk -o MOUNTPOINT -nr /dev/disk/by-label/PITDATA)"
Set the CSM_RELEASE
variable.
The value is based on the version of the CSM release being installed.
Example release versions:
- An alpha build:
CSM_RELEASE=1.4.0-alpha.99
- A release candidate:
CSM_RELEASE=1.4.0-rc.1
- A stable release:
CSM_RELEASE=1.4.0
export CSM_RELEASE=<value>
Set the CSM_PATH
variable.
After the CSM release tarball has been expanded, this will be the path to its base directory.
export CSM_PATH="${PITDATA}/csm-${CSM_RELEASE}"
Set the SYSTEM_NAME
variable.
This is the user friendly name for the system. For example, for eniac-ncn-m001
, SYSTEM_NAME
should be set to eniac
.
export SYSTEM_NAME=<value>
(pit#
) Update /etc/environment
.
This ensures that these variables will be set in all future shells on the PIT node.
export GOSS_BASE=/opt/cray/tests/install/livecd
cat << EOF >/etc/environment
CSM_RELEASE=${CSM_RELEASE}
CSM_PATH=${PITDATA}/csm-${CSM_RELEASE}
GOSS_BASE=${GOSS_BASE}
PITDATA=${PITDATA}
SYSTEM_NAME=${SYSTEM_NAME}
EOF
Update dnsmasq
and apache2
configuration files.
Download the tarball from here and extract it in the current working directory.
tar -xf dhcp_http.tar.gz
Update the apache2
and dnsmasq
configurations as follows:
cp -rv dnsmasq/dnsmasq.conf /etc/dnsmasq.conf
cp -rv apache2/* /etc/apache2/
cp -rv conman/conman.conf /etc/conman.conf
cp -rv logrotate/conman /etc/logrotate.d/conman
cp -rv kubectl/kubectl /usr/bin/
(Optional) Uncomment the tftp_secure
entry in the dnsmasq.conf
file.
Stop the following services: dhcpd
and named
.
systemctl stop dhcpd
systemctl stop named
systemctl restart apache2
If ping dcldap3.us.cray.com
does not work, then add the following entry in /etc/hosts
.
172.30.12.37 dcldap3.us.cray.com
(pit#
) Get the artifact versions.
KUBERNETES_VERSION="$(find ${CSM_PATH}/images/kubernetes -name '*.squashfs' -exec basename {} .squashfs \; | awk -F '-' '{print $(NF-1)}')"
echo "${KUBERNETES_VERSION}"
CEPH_VERSION="$(find ${CSM_PATH}/images/storage-ceph -name '*.squashfs' -exec basename {} .squashfs \; | awk -F '-' '{print $(NF-1)}')"
echo "${CEPH_VERSION}"
(pit#
) Copy the NCN images from the expanded tarball.
NOTE: This hard-links the files to do this copy as fast as possible, as well as to mitigate space waste on the USB stick.
mkdir -pv "${PITDATA}/data/k8s/" "${PITDATA}/data/ceph/"
rsync -rltDP --delete "${CSM_PATH}/images/kubernetes/" --link-dest="${CSM_PATH}/images/kubernetes/" "${PITDATA}/data/k8s/${KUBERNETES_VERSION}"
rsync -rltDP --delete "${CSM_PATH}/images/storage-ceph/" --link-dest="${CSM_PATH}/images/storage-ceph/" "${PITDATA}/data/ceph/${CEPH_VERSION}"
(pit#
) Modify the NCN images with SSH keys and root
passwords.
The following substeps provide the most commonly used defaults for this process. For more advanced options, see Set NCN Image Root Password, SSH Keys, and Timezone on PIT Node.
Generate SSH keys.
NOTE: The code block below assumes there is an RSA key without a passphrase. This step can be customized to use a passphrase if desired.
ssh-keygen -N "" -t rsa
Export the password hash for root
that is needed for the ncn-image-modification.sh
script.
This will set the NCN root
user password to be the same as the root
user password on the PIT.
export SQUASHFS_ROOT_PW_HASH="$(awk -F':' /^root:/'{print $2}' < /etc/shadow)"
Inject these into the NCN images by running ncn-image-modification.sh
from the CSM documentation RPM.
NCN_MOD_SCRIPT=$(rpm -ql docs-csm | grep ncn-image-modification.sh)
echo "${NCN_MOD_SCRIPT}"
"${NCN_MOD_SCRIPT}" -p \
-d /root/.ssh \
-k "/var/www/ephemeral/data/k8s/${KUBERNETES_VERSION}/kubernetes-${KUBERNETES_VERSION}.squashfs" \
-s "/var/www/ephemeral/data/ceph/${CEPH_VERSION}/storage-ceph-${CEPH_VERSION}.squashfs"
(pit#
) Log the currently installed PIT packages.
Having this information in the typescript can be helpful if problems are encountered during the install. This command was run once in a previous step – running it again now is intentional.
/root/bin/metalid.sh
Expected output looks similar to the following (the versions in the example below may differ). There should be no errors.
= PIT Identification = COPY/CUT START =======================================
VERSION=1.6.0
TIMESTAMP=20220504161044
HASH=g10e2532
2022/05/04 17:08:19 Using config file: /var/www/ephemeral/prep/system_config.yaml
CRAY-Site-Init build signature...
Build Commit : 0915d59f8292cfebe6b95dcba81b412a08e52ddf-main
Build Time : 2022-05-02T20:21:46Z
Go Version : go1.16.10
Git Version : v1.9.13-29-g0915d59f
Platform : linux/amd64
App. Version : 1.17.1
metal-ipxe-2.2.6-1.noarch
metal-net-scripts-0.0.2-20210722171131_880ba18.noarch
metal-basecamp-1.1.12-1.x86_64
pit-init-1.2.20-1.noarch
pit-nexus-1.1.4-1.x86_64
= PIT Identification = COPY/CUT END =========================================
This stage walks the user through creating the configuration payload for the system.
Run the following steps before starting any of the system configuration procedures.
(pit#
) Make the prep
directory.
mkdir -pv "${PITDATA}/prep"
(pit#
) Change into the prep
directory.
cd "${PITDATA}/prep"
NOTE: The following seed files are auto-generated with the common pre-installer
application_node_config.yaml
,hmn_connections.json
,ncn_metadata.csv
,switch_metadata.csv
. See Seed file generation.
Verify if cabinets.yaml
config file has not been created (manually).
If cabinets.yaml
config file has not been created, create the cabinets.yaml
using the following step, else skip the following step.
(pit#
)
(pit#
) Assuming all seed files are under $HOME/seedfiles
directory, copy the generated files under ${PITDATA}/prep
directory.
cp $HOME/seedfiles/* "${PITDATA}/prep"
(pit#
) Confirm that the following files exist.
ls -l "${PITDATA}"/prep/{application_node_config.yaml,cabinets.yaml,hmn_connections.json,ncn_metadata.csv,switch_metadata.csv}
Expected output look similar to the following example:
-rw-r--r-- 1 root root 146 Jun 6 00:12 /var/www/ephemeral/prep/application_node_config.yaml
-rw-r--r-- 1 root root 392 Jun 6 00:12 /var/www/ephemeral/prep/cabinets.yaml
-rwxr-xr-x 1 root root 3768 Jun 6 00:12 /var/www/ephemeral/prep/hmn_connections.json
-rw-r--r-- 1 root root 1216 Jun 6 00:12 /var/www/ephemeral/prep/ncn_metadata.csv
-rw-r--r-- 1 root root 150 Jun 6 00:12 /var/www/ephemeral/prep/switch_metadata.csv
system_config.yaml
(pit#
) Create or copy system_config.yaml
.
If one does not exist from a prior installation, then create an empty one:
csi config init empty
Otherwise, copy the existing system_config.yaml
file into the working directory and proceed to the Run CSI step.
(pit#
) Edit the system_config.yaml
file with the appropriate values.
NOTE:
- For a short description of each key in the file, run
csi config init --help
.- For more description of these settings and the default values, see Default IP Address Ranges and the other topics in CSM Overview.
- To enable or disable audit logging, refer to Audit Logs for more information.
- If the system is using a
cabinets.yaml
file, be sure to update thecabinets-yaml
field with'cabinets.yaml'
as its value.
vim system_config.yaml
(pit#
) Generate the initial configuration for CSI.
This will validate whether the inputs for CSI are correct.
csi config init
Expected Output:
2022/09/29 06:40:15 Using config file: /var/www/ephemeral/prep/system_config.yaml
2022/09/29 06:40:15 Using application node config: /var/www/ephemeral/prep/application_node_config.yaml
2022/09/29 06:40:15 SLS Cabinet Map
2022/09/29 06:40:15 Class River
2022/09/29 06:40:15 x3000
{"level":"info","ts":1664433615.2577472,"msg":"Beginning SLS configuration generation."}
2022/09/29 06:40:15 WARNING (Not Fatal): Couldn't find switch port for NCN: x3000c0s1b0
2022/09/29 06:40:15 wrote 24725 bytes to /var/www/ephemeral/prep/system_name/sls_input_file.json
2022/09/29 06:40:15 wrote 2342 bytes to /var/www/ephemeral/prep/system_name/customizations.yaml
2022/09/29 06:40:15 Generating Installer Node (PIT) interface configurations for: ncn-m001
2022/09/29 06:40:15 wrote 509 bytes to /var/www/ephemeral/prep/system_name/pit-files/ifcfg-bond0
2022/09/29 06:40:15 wrote 376 bytes to /var/www/ephemeral/prep/system_name/pit-files/ifcfg-lan0
2022/09/29 06:40:15 wrote 1030 bytes to /var/www/ephemeral/prep/system_name/pit-files/config
2022/09/29 06:40:15 wrote 24 bytes to /var/www/ephemeral/prep/system_name/pit-files/ifroute-lan0
2022/09/29 06:40:15 wrote 335 bytes to /var/www/ephemeral/prep/system_name/pit-files/ifcfg-bond0.hmn0
2022/09/29 06:40:15 wrote 335 bytes to /var/www/ephemeral/prep/system_name/pit-files/ifcfg-bond0.nmn0
2022/09/29 06:40:15 wrote 39 bytes to /var/www/ephemeral/prep/system_name/pit-files/ifroute-bond0.nmn0
2022/09/29 06:40:15 wrote 336 bytes to /var/www/ephemeral/prep/system_name/pit-files/ifcfg-bond0.can0
2022/09/29 06:40:15 wrote 335 bytes to /var/www/ephemeral/prep/system_name/pit-files/ifcfg-bond0.cmn0
2022/09/29 06:40:15 wrote 320 bytes to /var/www/ephemeral/prep/system_name/dnsmasq.d/CMN.conf
2022/09/29 06:40:15 wrote 572 bytes to /var/www/ephemeral/prep/system_name/dnsmasq.d/HMN.conf
2022/09/29 06:40:15 wrote 572 bytes to /var/www/ephemeral/prep/system_name/dnsmasq.d/NMN.conf
2022/09/29 06:40:15 wrote 540 bytes to /var/www/ephemeral/prep/system_name/dnsmasq.d/MTL.conf
2022/09/29 06:40:15 wrote 324 bytes to /var/www/ephemeral/prep/system_name/dnsmasq.d/CAN.conf
2022/09/29 06:40:15 wrote 8917 bytes to /var/www/ephemeral/prep/system_name/dnsmasq.d/statics.conf
2022/09/29 06:40:15 wrote 1226 bytes to /var/www/ephemeral/prep/system_name/conman.conf
2022/09/29 06:40:15 wrote 894 bytes to /var/www/ephemeral/prep/system_name/metallb.yaml
2022/09/29 06:40:15 wrote 60609 bytes to /var/www/ephemeral/prep/system_name/basecamp/data.json
===== [system_name] Installation Summary =====
Installation Node: ncn-m001
Customer Management: 10.102.5.0/25 GW: 10.102.5.1
Customer Access: 10.102.5.128/25 GW: 10.102.5.129
Upstream DNS: 8.8.8.8, 9.9.9.9
MetalLB Peers: [spine]
Networking
BICAN user network toggle set to CAN
Supernet enabled! Using the supernet gateway for some management subnets
* Hardware Management Network 10.254.0.0/17 with 2 subnets
* High Speed Network 10.253.0.0/16 with 1 subnets
* Provisioning Network (untagged) 10.1.1.0/16 with 2 subnets
* Node Management Network 10.252.0.0/17 with 3 subnets
* Customer Access Network 10.102.5.128/25 with 2 subnets
* River Compute Hardware Management Network 10.107.0.0/17 with 1 subnets
* River Compute Node Management Network 10.106.0.0/17 with 1 subnets
* SystemDefaultRoute points the network name of the default route 0.0.0.0/0 with 0 subnets
* Customer Management Network 10.102.5.0/25 with 4 subnets
* Node Management Network LoadBalancers 10.92.100.0/24 with 1 subnets
* Hardware Management Network LoadBalancers 10.94.100.0/24 with 1 subnets
System Information
NCNs: 9
Mountain Compute Cabinets: 0
Hill Compute Cabinets: 0
River Compute Cabinets: 1
CSI Version Information
e7684168d062ed7276c6a349930f3582c0a7600f-heads-v1.26.1
v1.26.1
]
site init
Follow the Prepare site init
procedure.
Follow Configure management network switches.
NOTE: The generated paddle file can be used as input to the CANU command to configure the switches.
NOTE: If starting an installation at this point, ensure to copy the previous
prep
directory back onto the system.
(pit#
) Initialize the PIT.
NOTE: This step restarts the network interface, so this step can be performed from
ipmi sol
orconman
session.
The pit-init.sh
script will prepare the PIT server for deploying NCNs.
/root/bin/pit-init.sh
Setup tftp
boot directory and restart dnsmasq
.
mkdir -p /srv/tftpboot/boot/
cp -r /var/www/boot/* /srv/tftpboot/boot/
systemctl restart dnsmasq
(pit#
) Set the IPMI_PASSWORD
variable.
read -r -s -p "NCN BMC root password: " IPMI_PASSWORD
(pit#
) Export the IPMI_PASSWORD
variable.
export IPMI_PASSWORD
(pit#
) Setup links to the boot artifacts extracted from the CSM tarball.
NOTE:
- This will also set all the BMCs to DHCP.
- Changing into the
$HOME
directory ensures the proper operation of the script.
cd $HOME && /root/bin/set-sqfs-links.sh
Expected Output:
Resolving images to boot ...
Images resolved
Kubernetes Boot Selection:
kernel: /var/www/ephemeral/data/k8s/0.3.51/5.3.18-150300.59.87-default-0.3.51.kernel
initrd: /var/www/ephemeral/data/k8s/0.3.51/initrd.img-0.3.51.xz
squash: /var/www/ephemeral/data/k8s/0.3.51/secure-kubernetes-0.3.51.squashfs
Storage Boot Selection:
kernel: /var/www/ephemeral/data/ceph/0.3.51/5.3.18-150300.59.87-default-0.3.51.kernel
initrd: /var/www/ephemeral/data/ceph/0.3.51/initrd.img-0.3.51.xz
squash: /var/www/ephemeral/data/ceph/0.3.51/secure-storage-ceph-0.3.51.squashfs
Attempting to set all known BMCs (from /etc/conman.conf) to DHCP mode
current BMC count: 8
Waiting on 8 to request DHCP ...
All [8] expected BMCs have requested DHCP.
/root/bin/set-sqfs-links.sh is creating boot directories for each NCN with a BMC that has a lease in /var/lib/misc/dnsmasq.leases
NOTE: Nodes without boot directories will still boot the non-destructive iPXE binary for bare-metal discovery usage.
Images will be stored on the NCN at /run/initramfs/live/1.3.0-rc.3/
/var/www is ready.
Go to /var/www
and create additional symlinks as follows:
cp -r /var/www/ncn-* /srv/tftpboot/
mkdir /srv/tftpboot/ephemeral
cp -r /var/www/ephemeral/data/ /srv/tftpboot/ephemeral/
Start conman
service.
systemctl start conman.service
(pit#
) Verify that the LiveCD is ready by running the preflight tests.
Run the following command to make the kubectl
binary executable:
chmod +x /usr/bin/kubectl
Run preflight tests.
csi pit validate --livecd-preflight
Expected Output:
Running LiveCD preflight checks (may take a few minutes to complete)...
Writing full output to /opt/cray/tests/install/logs/print_goss_json_results/20220929_101501.528062-22314-Z7D4bWt9/out
Reading test results for node system_name-ncn-m001-pit (suites/livecd-preflight-tests.yaml)
Checking test results
Only errors will be printed to the screen
GRAND TOTAL: 162 passed, 0 failed
PASSED
If any tests fail, they need to be investigated. After actions have been taken to rectify the tests (for example, editing configuration or CSI inputs), then restart from the beginning of the Initialize the LiveCD procedure.
Save the prep
directory for re-use.
This needs to be copied off the system and either stored in a secure location or in a secured Git repository. There are secrets in this directory that should not be accidentally exposed.
Grant necessary privileges by running the following command:
sed -i 's/podman run/podman run --privileged/g' /usr/share/doc/csm/install/scripts/csm_services/steps/1.initialize_bootstrap_registry.yaml
Check if there are any processes attached to port 5000 by running the following command:
netstat -tlnp | grep 5000
If there is a process attached to port 5000, kill it using the kill
command.
kill -9 <pid>
Restart Nexus.
systemctl restart nexus.service
After completing the Pre-install step, the next step is to Deploy Management Nodes.