IMPORTANT: These procedures only need to be followed if upgrading from CSM 0.9 (Shasta 1.4). If upgrading from CSM 1.0.1 (Shasta 1.5), these procedures should already have been done.
Some of these changes are applied as hotfixes and patches for 1.4; they may have already been applied.
For some of the procedures on this page you will need to SSH into your switches as the admin
user to verify settings and possibly make changes.
If you do not already know the hostnames or IP addresses of the switches in the system, here are some methods to determine them.
The default switch hostnames are in the following formats:
sw-spine-001
sw-spine-002
sw-agg-001
sw-agg-002
...
sw-leaf-001
sw-leaf-002
...
sw-cdu-001
sw-cdu-002
...
One way to get the IP addresses for the two spine switches is from the metallb-system
configmap in Kubernetes:
ncn# kubectl get cm config -n metallb-system -o json |
jq -r '.data | .config' |
yq r - -j |
jq -r '.peers | .[]."peer-address"'
Expected output is similar to the following:
10.252.0.2
10.252.0.3
/etc/hosts
A quick method to look for the hostnames and IP addresses is by looking in /etc/hosts
on an NCN:
ncn# grep sw /etc/hosts
Expected output looks similar to the following:
10.252.0.2 sw-spine-001
10.252.0.3 sw-spine-002
10.252.0.4 sw-agg-001
10.252.0.5 sw-agg-002
10.252.0.6 sw-agg-003
10.252.0.7 sw-agg-004
10.252.0.8 sw-leaf-001
10.252.0.9 sw-leaf-002
10.252.0.10 sw-leaf-003
10.252.0.11 sw-leaf-004
10.252.0.12 sw-cdu-001
10.252.0.13 sw-cdu-002
On Mellanox switches, the
enable
command must be issued before some other commands will work properly.
If changes are made to a switch configuration, to not forget to save them with the
write memory
command.
For systems with Mountain cabinets ONLY, one must verify the version of the Mountain CMM firmware. The firmware must be on version 1.4.20 or greater in order to support static LAGs on the CDU switches. Follow this procedure to do so:
Set environment variables with the credentials for the Mountain CMMs:
ncn# read -s BMC_USER
ncn# read -s BMC_PASS
ncn# export BMC_USER BMC_PASS
Obtain an API token:
ncn# export TOKEN=$(curl -s -k -S -d grant_type=client_credentials -d client_id=admin-client \
-d client_secret=`kubectl get secrets admin-client-auth -o jsonpath='{.data.client-secret}' | base64 -d` \
https://api-gw-service-nmn.local/keycloak/realms/shasta/protocol/openid-connect/token |
jq -r '.access_token')
Run the following script to get a report on the Mountain CMM firmware levels:
ncn# /usr/share/doc/csm/upgrade/scripts/get_mountain_cmm_firmware_versions.py
Expected output looks similar to the following:
Retrieving list of Mountain CMM xnames
Making GET request to https://api-gw-service-nmn.local/apis/sls/v1/search/hardware?type=comptype_chassis_bmc&class=Mountain
Found 3 Mountain CMM(s) in the system: x1001c5, x1001c4, x1000c5
Retrieving list of Redfish endpoint FQDNs of the Mountain CMM(s)
Making GET request to https://api-gw-service-nmn.local/apis/smd/hsm/v2/Inventory/ComponentEndpoints?id=x1001c5&id=x1001c4&id=x1000c5
Found 3 Mountain CMM Redfish FQDN(s) in the system: x1001c5b0, x1000c5b0, x1001c4b0
Checking firmware version of x1001c5b0
Making GET request to https://x1001c5b0/redfish/v1/UpdateService/FirmwareInventory/BMC
Checking firmware version of x1000c5b0
Making GET request to https://x1000c5b0/redfish/v1/UpdateService/FirmwareInventory/BMC
Checking firmware version of x1001c4b0
Making GET request to https://x1001c4b0/redfish/v1/UpdateService/FirmwareInventory/BMC
Mountain CMM | Firmware Version
x1000c5b0 | cc.1.5-31-shasta-release.arm64.2021-11-03T03:50:18+00:00.b9ced71
x1001c4b0 | cc.1.5-31-shasta-release.arm64.2021-11-03T03:50:18+00:00.b9ced71
x1001c5b0 | cc.1.5-31-shasta-release.arm64.2021-11-03T03:50:18+00:00.b9ced71
Firmware versions successfully reported
Proceed to the appropriate next step:
The procedures to follow are broken up by switch type:
sw-spine-001# show run | include bgp
router bgp 65533
bgp router-id 10.252.0.2
sw-spine01# show ip route 10.92.100.60
Displaying IPv4 routes selected for forwarding
'[x/y]' denotes [distance/metric]
10.92.100.60/32, vrf default, tag 0
via 10.252.1.x, [1/0], static
via 10.252.1.x, [1/0], static
, then you will need to remove this route.sw-spine01# config t
sw-spine01(config)# no ip route 10.92.100.60/32 10.252.1.x
/opt/cray/csm/scripts/networking/BGP/Aruba_BGP_Peers.py
Log into the switches that you ran the BGP script against and execute sw-spine-001# show run | begin "ip prefix-list"
Do not copy this configuration onto your switches.
The following configuration needs to be present.
ip prefix-list tftp seq 10 permit 10.92.100.60/32 ge 32 le 32
neighbor 10.252.1.x passive
The neighbors should be the NMN IP addresses of the worker nodes.
Here is an example output from an Aruba switch with 3 worker nodes.
ip prefix-list pl-can seq 10 permit 10.103.11.0/24 ge 24
ip prefix-list pl-hmn seq 20 permit 10.94.100.0/24 ge 24
ip prefix-list pl-nmn seq 30 permit 10.92.100.0/24 ge 24
ip prefix-list tftp seq 10 permit 10.92.100.60/32 ge 32 le 32
!
!
!
!
route-map ncn-w001 permit seq 10
match ip address prefix-list tftp
match ip next-hop 10.252.1.7
set local-preference 1000
route-map ncn-w001 permit seq 20
match ip address prefix-list tftp
match ip next-hop 10.252.1.8
set local-preference 1100
route-map ncn-w001 permit seq 30
match ip address prefix-list tftp
match ip next-hop 10.252.1.9
set local-preference 1200
route-map ncn-w001 permit seq 40
match ip address prefix-list pl-can
set ip next-hop 10.103.11.10
route-map ncn-w001 permit seq 50
match ip address prefix-list pl-hmn
set ip next-hop 10.254.1.14
route-map ncn-w001 permit seq 60
match ip address prefix-list pl-nmn
set ip next-hop 10.252.1.9
route-map ncn-w002 permit seq 10
match ip address prefix-list tftp
match ip next-hop 10.252.1.7
set local-preference 1000
route-map ncn-w002 permit seq 20
match ip address prefix-list tftp
match ip next-hop 10.252.1.8
set local-preference 1100
route-map ncn-w002 permit seq 30
match ip address prefix-list tftp
match ip next-hop 10.252.1.9
set local-preference 1200
route-map ncn-w002 permit seq 40
match ip address prefix-list pl-can
set ip next-hop 10.103.11.9
route-map ncn-w002 permit seq 50
match ip address prefix-list pl-hmn
set ip next-hop 10.254.1.12
route-map ncn-w002 permit seq 60
match ip address prefix-list pl-nmn
set ip next-hop 10.252.1.8
route-map ncn-w003 permit seq 10
match ip address prefix-list tftp
match ip next-hop 10.252.1.7
set local-preference 1000
route-map ncn-w003 permit seq 20
match ip address prefix-list tftp
match ip next-hop 10.252.1.8
set local-preference 1100
route-map ncn-w003 permit seq 30
match ip address prefix-list tftp
match ip next-hop 10.252.1.9
set local-preference 1200
route-map ncn-w003 permit seq 40
match ip address prefix-list pl-can
set ip next-hop 10.103.11.8
route-map ncn-w003 permit seq 50
match ip address prefix-list pl-hmn
set ip next-hop 10.254.1.10
route-map ncn-w003 permit seq 60
match ip address prefix-list pl-nmn
set ip next-hop 10.252.1.7
!
router ospf 1
router-id 10.2.0.2
area 0.0.0.0
router ospfv3 1
router-id 10.2.0.2
area 0.0.0.0
router bgp 65533
bgp router-id 10.252.0.2
maximum-paths 8
neighbor 10.252.0.3 remote-as 65533
neighbor 10.252.1.7 remote-as 65533
neighbor 10.252.1.8 remote-as 65533
neighbor 10.252.1.9 remote-as 65533
address-family ipv4 unicast
neighbor 10.252.0.3 activate
neighbor 10.252.1.7 activate
neighbor 10.252.1.7 passive
neighbor 10.252.1.7 route-map ncn-w003 in
neighbor 10.252.1.8 activate
neighbor 10.252.1.8 passive
neighbor 10.252.1.8 route-map ncn-w002 in
neighbor 10.252.1.9 activate
neighbor 10.252.1.9 passive
neighbor 10.252.1.9 route-map ncn-w001 in
exit-address-family
!
If the configuration does not look like the example above, check the Update BGP Neighbors docs.
When the configuration is correct, if any configuration changes were made, run write memory
on all of the switches to save it.
sw-spine-001 [standalone: master] > enable
sw-spine-001 [standalone: master] > show protocols | include bgp
bgp: enabled
The Mellanox BGP neighbors need to configured as passive.
To do this, log into the switches and run the commands below. The neighbors will be the NMN IP addresses of ALL the worker nodes. You may have more than 3.
sw-spine-001 [standalone: master] > ena
sw-spine-001 [standalone: master] # conf t
sw-spine-001 [standalone: master] (config) # router bgp 65533 vrf default neighbor 10.252.1.10 transport connection-mode passive
sw-spine-001 [standalone: master] (config) # router bgp 65533 vrf default neighbor 10.252.1.11 transport connection-mode passive
sw-spine-001 [standalone: master] (config) # router bgp 65533 vrf default neighbor 10.252.1.12 transport connection-mode passive
Run the command below to verify the configuration got applied correctly.
sw-spine-001 [standalone: master] (config) # show protocols | include bgp
The configuration should look similar to the following. This is an example only. The neighbors should be the ALL of the NCN Workers and their NMN addresses; they will not peer over any other network.
protocol bgp
router bgp 65533 vrf default
router bgp 65533 vrf default router-id 10.252.0.2 force
router bgp 65533 vrf default maximum-paths ibgp 32
router bgp 65533 vrf default neighbor 10.252.1.10 remote-as 65533
router bgp 65533 vrf default neighbor 10.252.1.10 route-map ncn-w001
router bgp 65533 vrf default neighbor 10.252.1.11 remote-as 65533
router bgp 65533 vrf default neighbor 10.252.1.11 route-map ncn-w002
router bgp 65533 vrf default neighbor 10.252.1.12 remote-as 65533
router bgp 65533 vrf default neighbor 10.252.1.12 route-map ncn-w003
router bgp 65533 vrf default neighbor 10.252.1.10 transport connection-mode passive
router bgp 65533 vrf default neighbor 10.252.1.11 transport connection-mode passive
router bgp 65533 vrf default neighbor 10.252.1.12 transport connection-mode passive
If the configuration does not look like the example above, check the Update BGP Neighbors docs.
When the configuration is correct, if any configuration changes were made, run write memory
on all of the switches to save it.
If the system has Apollo servers, the required configuration can be found in the Management Network Access Port Configurations page, under the section “Apollo Server Port Configuration”.