Check BGP Status and Reset Sessions

Check the Border Gateway Protocol (BGP) status on the Aruba and Mellanox switches and verify that all sessions are in an Established state. If the state of any session in the table is Idle, the BGP sessions needs to be reset.

Prerequisites

This procedure requires administrative privileges.

Procedure

The following procedures will require knowing a list of switches that are BGP peers to connect to. You can obtain this list by running the following from an NCN node:

ncn-m001# kubectl get cm config -n metallb-system -o yaml | head -12

Expected output looks similar to the following:

apiVersion: v1
data:
    config: |
    peers:
    - peer-address: 10.252.0.2
        peer-asn: 65533
        my-asn: 65533
    - peer-address: 10.252.0.3
        peer-asn: 65533
        my-asn: 65533
    address-pools:
    - name: customer-access

The switch IPs are the peer-address values.

MELLANOX

  1. Verify that all BGP sessions are in an Established state for the Mellanox spine switches.

    SSH to each BGP peer switch to check the status of all BGP sessions.

    1. SSH to a BGP peer switch.

      For example:

      ncn-m001# ssh admin@10.252.0.2
      
    2. View the status of the BGP sessions.

      sw-spine-001 [standalone: master] > enable
      sw-spine-001 [standalone: master] # show ip bgp summary
      
      VRF name                  : default
      BGP router identifier     : 10.252.0.2
      local AS number           : 65533
      BGP table version         : 50
      Main routing table version: 50
      IPV4 Prefixes             : 68
      IPV6 Prefixes             : 0
      L2VPN EVPN Prefixes       : 0
      
      ------------------------------------------------------------------------------------------------------------------
      Neighbor          V    AS           MsgRcvd   MsgSent   TblVer    InQ    OutQ   Up/Down       State/PfxRcd
      ------------------------------------------------------------------------------------------------------------------
      10.252.1.10       4    65533        3144      3564      50        0      0      1:01:50:41    ESTABLISHED/13
      10.252.1.11       4    65533        3144      3569      50        0      0      1:01:50:40    ESTABLISHED/14
      10.252.1.12       4    65533        3145      3576      50        0      0      1:01:50:41    ESTABLISHED/14
      10.252.1.13       4    65533        3144      3568      50        0      0      1:01:50:41    ESTABLISHED/13
      10.252.1.14       4    65533        3145      3572      50        0      0      1:01:50:41    ESTABLISHED/14
      

      If any of the sessions are in an Idle state, proceed to the next step.

  2. Reset BGP to re-establish the sessions.

    1. SSH to each BGP peer switch.

      For example:

      ncn-m001# ssh admin@10.252.0.2
      
    2. Verify BGP is enabled.

      sw-spine-001 [standalone: master] > show protocols | include bgp
       bgp:                    enabled
      
    3. Clear the BGP sessions.

      sw-spine-001 [standalone: master] > enable
      sw-spine-001 [standalone: master] # clear ip bgp all
      
    4. Check the status of the BGP sessions to see if they are now Established.

      It may take a few minutes for sessions to become Established.

      sw-spine-001 [standalone: master] # show ip bgp summary
      
      VRF name                  : default
      BGP router identifier     : 10.252.0.2
      local AS number           : 65533
      BGP table version         : 50
      Main routing table version: 50
      IPV4 Prefixes             : 68
      IPV6 Prefixes             : 0
      L2VPN EVPN Prefixes       : 0
      
      ------------------------------------------------------------------------------------------------------------------
      Neighbor          V    AS           MsgRcvd   MsgSent   TblVer    InQ    OutQ   Up/Down       State/PfxRcd
      ------------------------------------------------------------------------------------------------------------------
      10.252.1.10       4    65533        3144      3564      50        0      0      1:01:50:41    ESTABLISHED/13
      10.252.1.11       4    65533        3144      3569      50        0      0      1:01:50:40    ESTABLISHED/14
      10.252.1.12       4    65533        3145      3576      50        0      0      1:01:50:41    ESTABLISHED/14
      10.252.1.13       4    65533        3144      3568      50        0      0      1:01:50:41    ESTABLISHED/13
      10.252.1.14       4    65533        3145      3572      50        0      0      1:01:50:41    ESTABLISHED/14
      

    Once all sessions are in an Established state, BGP reset is complete for the Mellanox switches.

    Troubleshooting: If some sessions remain Idle, re-run the Mellanox reset steps to clear and re-check status. The clear ip bgp all command may need to be be ran multiple times (up to 10 times). In between each clear command wait a few minutes before re-checking the BGP Sessions. If some sessions still remain Idle, proceed to reapply the cray-metallb helm chart, along with the BGP reset, to force the speaker pods to re-establish sessions with the switch.

Aruba

  1. Verify that all BGP sessions are in an Established state for the Aruba spine switches.

    SSH to each BGP peer switch to check the status of all BGP sessions.

    1. SSH to a BGP peer switch.

      ncn-m001# ssh admin@10.252.0.2
      
    2. View the status of the BGP sessions.

      sw-spine-001# show bgp ipv4 unicast summary
      VRF : default
      BGP Summary
      -----------
       Local AS               : 65533        BGP Router Identifier  : 10.252.0.2
       Peers                  : 4            Log Neighbor Changes   : No
       Cfg. Hold Time         : 180          Cfg. Keep Alive        : 60
       Confederation Id       : 0
      
       Neighbor        Remote-AS MsgRcvd MsgSent   Up/Down Time State        AdminStatus
       10.252.0.3      65533       19704   19708   00m:01w:00d  Established   Up
       10.252.1.10     65533       34455   39416   00m:01w:04d  Established   Up
       10.252.1.11     65533       34458   39400   00m:01w:04d  Established   Up
       10.252.1.12     65533       34448   39415   00m:01w:04d  Established   Up
      

      If any of the sessions are in an Idle state, proceed to the next step.

  2. Reset BGP to re-establish the sessions.

    1. SSH to each BGP peer switch.

      For example:

      ncn-m001# ssh admin@10.252.0.2
      
    2. Clear the BGP sessions.

      sw-spine-001# clear bgp *
      
    3. Check the status of the BGP sessions.

      It may take a few minutes for sessions to become Established.

      sw-spine-001# show bgp ipv4 unicast summary
      VRF : default
      BGP Summary
      -----------
       Local AS               : 65533        BGP Router Identifier  : 10.252.0.2
       Peers                  : 4            Log Neighbor Changes   : No
       Cfg. Hold Time         : 180          Cfg. Keep Alive        : 60
       Confederation Id       : 0
      
       Neighbor        Remote-AS MsgRcvd MsgSent   Up/Down Time State        AdminStatus
       10.252.0.3      65533       19704   19708   00m:01w:00d  Established   Up
       10.252.1.10     65533       34455   39416   00m:01w:04d  Established   Up
       10.252.1.11     65533       34458   39400   00m:01w:04d  Established   Up
       10.252.1.12     65533       34448   39415   00m:01w:04d  Established   Up
      

    Once all sessions are in an Established state, BGP reset is complete for the Aruba switches.

    Troubleshooting: If some sessions remain Idle, re-run the Aruba reset steps to clear and re-check status. The clear bgp * command may need to be be ran multiple times (up to 10 times). In between each clear command wait a few minutes before re-checking the BGP Sessions. If some sessions still remain Idle, proceed to the next step to reapply the cray-metallb helm chart, along with the BGP reset to force the speaker pods to re-establish sessions with the switch.

Troubleshooting

Re-apply the cray-metallb Helm Chart

  1. Determine the cray-metallb chart version that is currently deployed.

    ncn-m001# helm ls -A -a | grep cray-metallb
    cray-metallb   metallb-system   1   2021-02-10 14:58:43.902752441 -0600 CST  deployed  cray-metallb-0.12.2   0.8.1
    
  2. Create a manifest file that will be used to reapply the same chart version.

    ncn-m001# cat << EOF > ./metallb-manifest.yaml
    apiVersion: manifests/v1beta1
    metadata:
      name: reapply-metallb
    spec:
      charts:
      - name: cray-metallb
        namespace: metallb-system
        values:
          imagesHost: dtr.dev.cray.com
        version: 0.12.2
    EOF
    
  3. Open SSH sessions to all spine switches.

  4. Determine the CSM_RELEASE version that is currently running and set an environment variable.

    For example:

    ncn-m001# CSM_RELEASE=0.8.0
    
  5. Mount the PITDATA so that helm charts are available for the re-install (it might already be mounted) and verify that the chart with the expected version exists.

    ncn-m001# mkdir -pv /mnt/pitdata
    ncn-m001# mount -L PITDATA /mnt/pitdata
    ncn-m001# ls /mnt/pitdata/csm-${CSM_RELEASE}/helm/cray-metallb*
    /mnt/pitdata/csm-0.8.0/helm/cray-metallb-0.12.2.tgz
    
  6. Uninstall the current cray-metallb chart.

    Until the chart is reapplied, this will also effect unbound name resolution, and all BGP sessions will be Idle for all of the worker nodes.

    ncn-m001# helm del cray-metallb -n metallb-system
    
  7. Use the open SSH sessions to the switches to clear the BGP sessions based on the above Mellanox or Aruba procedures.

    Refer to substeps 1-3 for Mellanox.

    Refer to substeps 1-2 for Aruba.

  8. Reapply the cray-metallb chart based on the CSM_RELEASE.

    ncn-m001# loftsman ship --manifest-path ./metallb-manifest.yaml \
    --charts-path /mnt/pitdata/csm-${CSM_RELEASE}/helm
    
  9. Check that the speaker pods are all running.

    This may take a few minutes.

    ncn-m001# kubectl get pods -n metallb-system
    NAME                                       READY   STATUS    RESTARTS   AGE
    cray-metallb-controller-6d545b5ccc-mm4qz   1/1     Running   0          79m
    cray-metallb-speaker-4nrzq                 1/1     Running   0          76m
    cray-metallb-speaker-b5m2n                 1/1     Running   0          79m
    cray-metallb-speaker-h7s7b                 1/1     Running   0          79m
    
  10. Use the open SSH sessions to the switches to check the status of the BGP sessions.

    Refer to substeps 1-3 for Mellanox.

    Refer to substeps 1-2 for Aruba.