Manually update the boa-job-template
ConfigMap to tune the timeout and sleep intervals for the Boot Orchestration Agent (BOA).
Correcting the timeout value is a good troubleshooting option for when Boot Orchestration Service (BOS) sessions hang waiting for nodes to be in a Ready
state.
If the BOS timeout occurs when booting compute nodes, the system will be unable to boot via BOS.
A BOS session was run and compute nodes are failing to move to a Ready
state.
Edit the boa-job-template
ConfigMap to add the new timeout values.
ncn-mw# kubectl edit configmap -n services boa-job-template
Node boots can be set to time out faster by adding the following environment variables to the boa-job-template
. These variables do not appear in the ConfigMap by default.
NODE_STATE_CHECK_NUMBER_OF_RETRIES
NODE_STATE_CHECK_SLEEP_INTERVAL
The current default behavior in the absence of these parameters is 5 seconds (sleep interval) x 120 (retries), which has a timeout of 600 seconds or 10 minutes. The default values are shown below:
- name: "NODE_STATE_CHECK_NUMBER_OF_RETRIES"
value: "120"
- name: "NODE_STATE_CHECK_SLEEP_INTERVAL"
value: "5"
The example below increases the number of retries to 360, which results in a timeout of 1800 seconds or 30 minutes if the sleep interval is not changed from the default value of 5 seconds. Different values might be needed depending on system size.
Add the following values to the ConfigMap:
- name: "NODE_STATE_CHECK_NUMBER_OF_RETRIES"
value: "360"
The new variables need to be placed under the environment (env:
) section in the ConfigMap. As an example, the env
section in the ConfigMap looks as below.
env:
- name: OPERATION
value: "{{ operation }}"
- name: SESSION_ID
value: "{{ session_id }}"
- name: SESSION_TEMPLATE_ID
value: "{{ session_template_id }}"
- name: SESSION_LIMIT
value: "{{ session_limit }}"
- name: DATABASE_NAME
value: "{{ DATABASE_NAME }}"
- name: DATABASE_PORT
value: "{{ DATABASE_PORT }}"
- name: LOG_LEVEL
value: "{{ log_level }}"
- name: SINGLE_THREAD_MODE
value: "{{ single_thread_mode }}"
- name: S3_ACCESS_KEY
valueFrom:
secretKeyRef:
name: {{ s3_credentials }}
key: access_key
- name: S3_SECRET_KEY
valueFrom:
secretKeyRef:
name: {{ s3_credentials }}
key: secret_key
- name: GIT_SSL_CAINFO
value: /etc/cray/ca/certificate_authority.crt
- name: S3_PROTOCOL
value: "{{ S3_PROTOCOL }}"
- name: S3_GATEWAY
value: "{{ S3_GATEWAY }}"
**- name: "NODE_STATE_CHECK_NUMBER_OF_RETRIES"
value: "360"**
Restart BOA.
Restarting BOA will allow the new timeout values to take effect.
ncn-mw# kubectl scale deployment -n services cray-bos --replicas=0 && kubectl scale deployment -n services cray-bos --replicas=1