Troubleshoot Compute Node Boot Issues Related to Trivial File Transfer Protocol (TFTP)

TFTP issues can result in node boot failures. Use this procedure to investigate and resolve such issues.

Prerequisites

This procedure requires administrative privileges.

Limitations

Encryption of compute node logs is not enabled, so the passwords may be passed in clear text.

Procedure

  1. (ncn-mw#) Check that the TFTP service is running.

     kubectl get pods -n services -o wide | grep cray-tftp
    
  2. Start a tcpdump session on the NCN.

  3. (ncn-mw#) Obtain the TFTP pod’s ID.

    PODID=$(kubectl get pods -n services --no-headers -o wide | grep cray-tftp | awk '{print $1}')
    echo $PODID
    
  4. (ncn-mw#) Enter the TFTP pod using the pod ID.

    Double check that PODID contains only one ID. If there are multiple TFTP pods listed, just choose one as the ID.

    kubectl exec -n services -it $PODID /bin/sh
    
  5. Start a tcpdump session from within the TFTP pod.

  6. Open another terminal to perform the following tasks:

    1. Use a TFTP client to issue a TFTP request from either the NCN or a laptop.

    2. Analyze the NCN tcpdump data to ensure that the TFTP discover request is visible.

  7. Go back to the original terminal to analyze the TFTP pod’s tcpdump data in order to ensure that the TFTP request is visible inside the pod.

Troubleshooting

If the TFTP request is not visible in the packet capture, consider the following:

  • Firewall issues: The TFTP traffic (UDP port 69) may be blocked by firewall rules on the NCN or network.
  • Wrong interface: Ensure the TFTP request was issued over the correct interface for the Node Management Network (NMN).
  • Network routing: Verify that routing is configured correctly between the client and the TFTP server.
  • Pod network issues: If traffic reaches the worker node but not the pod, there may be issues with the pod network or Kubernetes networking components.