Check the Border Gateway Protocol (BGP) status on the Aruba and Mellanox switches and verify that all sessions are in an Established state. If the state of any session in the table is Idle, the BGP sessions needs to be reset.
This procedure requires administrative privileges.
The following procedures will require knowing a list of switches that are BGP peers to connect to. You can obtain this list by running the following from an NCN node:
ncn-m001# kubectl get cm config -n metallb-system -o yaml | head -12
Expected output looks similar to the following:
apiVersion: v1
data:
config: |
peers:
- peer-address: 10.252.0.2
peer-asn: 65533
my-asn: 65533
- peer-address: 10.252.0.3
peer-asn: 65533
my-asn: 65533
address-pools:
- name: customer-access
The switch IPs are the peer-address
values.
Verify that all BGP sessions are in an Established state for the Mellanox spine switches.
SSH to each BGP peer switch to check the status of all BGP sessions.
SSH to a BGP peer switch.
For example:
ncn-m001# ssh admin@10.252.0.2
View the status of the BGP sessions.
sw-spine-001 [standalone: master] > enable
sw-spine-001 [standalone: master] # show ip bgp summary
VRF name : default
BGP router identifier : 10.252.0.2
local AS number : 65533
BGP table version : 50
Main routing table version: 50
IPV4 Prefixes : 68
IPV6 Prefixes : 0
L2VPN EVPN Prefixes : 0
------------------------------------------------------------------------------------------------------------------
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
------------------------------------------------------------------------------------------------------------------
10.252.1.10 4 65533 3144 3564 50 0 0 1:01:50:41 ESTABLISHED/13
10.252.1.11 4 65533 3144 3569 50 0 0 1:01:50:40 ESTABLISHED/14
10.252.1.12 4 65533 3145 3576 50 0 0 1:01:50:41 ESTABLISHED/14
10.252.1.13 4 65533 3144 3568 50 0 0 1:01:50:41 ESTABLISHED/13
10.252.1.14 4 65533 3145 3572 50 0 0 1:01:50:41 ESTABLISHED/14
If any of the sessions are in an Idle state, proceed to the next step.
Reset BGP to re-establish the sessions.
SSH to each BGP peer switch.
For example:
ncn-m001# ssh admin@10.252.0.2
Verify BGP is enabled.
sw-spine-001 [standalone: master] > show protocols | include bgp
bgp: enabled
Clear the BGP sessions.
sw-spine-001 [standalone: master] > enable
sw-spine-001 [standalone: master] # clear ip bgp all
Check the status of the BGP sessions to see if they are now Established.
It may take a few minutes for sessions to become Established.
sw-spine-001 [standalone: master] # show ip bgp summary
VRF name : default
BGP router identifier : 10.252.0.2
local AS number : 65533
BGP table version : 50
Main routing table version: 50
IPV4 Prefixes : 68
IPV6 Prefixes : 0
L2VPN EVPN Prefixes : 0
------------------------------------------------------------------------------------------------------------------
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
------------------------------------------------------------------------------------------------------------------
10.252.1.10 4 65533 3144 3564 50 0 0 1:01:50:41 ESTABLISHED/13
10.252.1.11 4 65533 3144 3569 50 0 0 1:01:50:40 ESTABLISHED/14
10.252.1.12 4 65533 3145 3576 50 0 0 1:01:50:41 ESTABLISHED/14
10.252.1.13 4 65533 3144 3568 50 0 0 1:01:50:41 ESTABLISHED/13
10.252.1.14 4 65533 3145 3572 50 0 0 1:01:50:41 ESTABLISHED/14
Once all sessions are in an Established state, BGP reset is complete for the Mellanox switches.
Troubleshooting: If some sessions remain Idle, re-run the Mellanox reset steps to clear and re-check status. The clear ip bgp all
command may need to be be ran multiple times (up to 10 times). In between each clear command wait a few minutes before re-checking the BGP Sessions. If some sessions still remain Idle, proceed to reapply the cray-metallb helm chart, along with the BGP reset, to force the speaker pods to re-establish sessions with the switch.
Verify that all BGP sessions are in an Established state for the Aruba spine switches.
SSH to each BGP peer switch to check the status of all BGP sessions.
SSH to a BGP peer switch.
ncn-m001# ssh admin@10.252.0.2
View the status of the BGP sessions.
sw-spine-001# show bgp ipv4 unicast summary
VRF : default
BGP Summary
-----------
Local AS : 65533 BGP Router Identifier : 10.252.0.2
Peers : 4 Log Neighbor Changes : No
Cfg. Hold Time : 180 Cfg. Keep Alive : 60
Confederation Id : 0
Neighbor Remote-AS MsgRcvd MsgSent Up/Down Time State AdminStatus
10.252.0.3 65533 19704 19708 00m:01w:00d Established Up
10.252.1.10 65533 34455 39416 00m:01w:04d Established Up
10.252.1.11 65533 34458 39400 00m:01w:04d Established Up
10.252.1.12 65533 34448 39415 00m:01w:04d Established Up
If any of the sessions are in an Idle state, proceed to the next step.
Reset BGP to re-establish the sessions.
SSH to each BGP peer switch.
For example:
ncn-m001# ssh admin@10.252.0.2
Clear the BGP sessions.
sw-spine-001# clear bgp *
Check the status of the BGP sessions.
It may take a few minutes for sessions to become Established.
sw-spine-001# show bgp ipv4 unicast summary
VRF : default
BGP Summary
-----------
Local AS : 65533 BGP Router Identifier : 10.252.0.2
Peers : 4 Log Neighbor Changes : No
Cfg. Hold Time : 180 Cfg. Keep Alive : 60
Confederation Id : 0
Neighbor Remote-AS MsgRcvd MsgSent Up/Down Time State AdminStatus
10.252.0.3 65533 19704 19708 00m:01w:00d Established Up
10.252.1.10 65533 34455 39416 00m:01w:04d Established Up
10.252.1.11 65533 34458 39400 00m:01w:04d Established Up
10.252.1.12 65533 34448 39415 00m:01w:04d Established Up
Once all sessions are in an Established state, BGP reset is complete for the Aruba switches.
Troubleshooting: If some sessions remain Idle, re-run the Aruba reset steps to clear and re-check status. The clear bgp *
command may need to be be ran multiple times (up to 10 times). In between each clear command wait a few minutes before re-checking the BGP Sessions. If some sessions still remain Idle, proceed to the next step to reapply the cray-metallb helm chart, along with the BGP reset to force the speaker pods to re-establish sessions with the switch.
cray-metallb
Helm ChartDetermine the cray-metallb chart version that is currently deployed.
ncn-m001# helm ls -A -a | grep cray-metallb
cray-metallb metallb-system 1 2021-02-10 14:58:43.902752441 -0600 CST deployed cray-metallb-0.12.2 0.8.1
Create a manifest file that will be used to reapply the same chart version.
ncn-m001# cat << EOF > ./metallb-manifest.yaml
apiVersion: manifests/v1beta1
metadata:
name: reapply-metallb
spec:
charts:
- name: cray-metallb
namespace: metallb-system
values:
imagesHost: dtr.dev.cray.com
version: 0.12.2
EOF
Open SSH sessions to all spine switches.
Determine the CSM_RELEASE
version that is currently running and set an environment variable.
For example:
ncn-m001# CSM_RELEASE=0.8.0
Mount the PITDATA so that helm charts are available for the re-install (it might already be mounted) and verify that the chart with the expected version exists.
ncn-m001# mkdir -pv /mnt/pitdata
ncn-m001# mount -L PITDATA /mnt/pitdata
ncn-m001# ls /mnt/pitdata/csm-${CSM_RELEASE}/helm/cray-metallb*
/mnt/pitdata/csm-0.8.0/helm/cray-metallb-0.12.2.tgz
Uninstall the current cray-metallb chart.
Until the chart is reapplied, this will also effect unbound name resolution, and all BGP sessions will be Idle for all of the worker nodes.
ncn-m001# helm del cray-metallb -n metallb-system
Use the open SSH sessions to the switches to clear the BGP sessions based on the above Mellanox or Aruba procedures.
Refer to substeps 1-3 for Mellanox.
Refer to substeps 1-2 for Aruba.
Reapply the cray-metallb chart based on the CSM_RELEASE.
ncn-m001# loftsman ship --manifest-path ./metallb-manifest.yaml \
--charts-path /mnt/pitdata/csm-${CSM_RELEASE}/helm
Check that the speaker pods are all running.
This may take a few minutes.
ncn-m001# kubectl get pods -n metallb-system
NAME READY STATUS RESTARTS AGE
cray-metallb-controller-6d545b5ccc-mm4qz 1/1 Running 0 79m
cray-metallb-speaker-4nrzq 1/1 Running 0 76m
cray-metallb-speaker-b5m2n 1/1 Running 0 79m
cray-metallb-speaker-h7s7b 1/1 Running 0 79m
Use the open SSH sessions to the switches to check the status of the BGP sessions.
Refer to substeps 1-3 for Mellanox.
Refer to substeps 1-2 for Aruba.