There are cases where API calls or cray
command invocations will fail (sometimes intermittently) with an HTTP 503 error code.
In the event that this occurs, attempt to remediate the issue by taking the following actions, according to specific error codes
found in the pod or Envoy container log.
The Envoy container is typically named istio-proxy
, and it runs as a sidecar for pods that are part of the Istio mesh.
For pods with this sidecar, the logs can be viewed by running a command similar to the following:
ncn-mw# kubectl logs <podname> -n <namespace> -c istio-proxy | grep 503
For general Kubernetes troubleshooting information, including more information on viewing pod logs, see Kubernetes troubleshooting topics.
This page is broken into different sections, based on the errors found in the log.
UF,URX
with TLS errorUF,URX
with TLS error)[2022-05-10T16:27:29.232Z] "POST /apis/hbtd/hmi/v1/heartbeat HTTP/2" 503 UF,URX "-" "TLS error: Secret is not supplied by SDS"
UF,URX
with a TLS error)Envoy containers can occasionally get into this state when NCNs are being rebooted or upgraded, as well as when many deployments are being created.
UF,URX
with a TLS error)Do a Kubernetes delete or rolling restart:
If it is a single replica, then delete the pod.
If it is part of a multiple replica exhibiting the issue, then perform a rolling restart of the deployment or StatefulSet
.
Here is an example of how to do that for the istio-ingressgateway
deployment in the istio-system
namespace.
Initiate a rolling restart of the deployment.
ncn-mw# kubectl rollout restart -n istio-system deployment istio-ingressgateway
Wait for the restart to complete.
ncn-mw# kubectl rollout status -n istio-system deployment istio-ingressgateway
Once the roll out is complete, or the new pod is running, then the HTTP 503 message should clear.
UAEX
UAEX
)[2022-06-24T14:16:27.229Z] "POST /apis/hbtd/hmi/v1/heartbeat HTTP/2" 503 UAEX "-" 131 0 30 - "10.34.0.0" "-" "1797b0d3-56f0-4674-8cf2-a8a61f9adaea" "api-gw-service-nmn.local" "-" - - 10.40.0.29:443 10.34.0.0:15995 api-gw-service-nmn.local -
UAEX
)This error code typically indicates an issue with the authorization service (for example, Spire).
UAEX
)Initiate a rolling restart of Spire.
ncn-mw# kubectl rollout restart -n spire statefulset spire-postgres spire-server
ncn-mw# kubectl rollout restart -n spire daemonset spire-agent request-ncn-join-token
ncn-mw# kubectl rollout restart -n spire deployment spire-jwks spire-postgres-pooler
Wait for all of the restarts to complete.
ncn-mw# kubectl rollout status -n spire statefulset spire-postgres
ncn-mw# kubectl rollout status -n spire statefulset spire-server
ncn-mw# kubectl rollout status -n spire daemonset spire-agent
ncn-mw# kubectl rollout status -n spire daemonset request-ncn-join-token
ncn-mw# kubectl rollout status -n spire deployment spire-jwks
ncn-mw# kubectl rollout status -n spire deployment spire-postgres-pooler
Once the restarts are all complete, the HTTP 503 message should clear.
Although the above codes are most common, various other issues such as networking or application errors can cause different errors in the pod or sidecar logs. Refer to the Envoy access log documentation for a list of possible Envoy response flags. In general, running a rolling restart of the application itself to see if it clears the error is a good practice. If that does not resolve the problem, then an understanding of what the error message or response flag means is required to further troubleshoot the issue.