On some systems, IMS image creation will fail with the following error in the CFS pod log:
Traceback (most recent call last):
File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.9/site-packages/cray/cfs/inventory/image/__init__.py", line 239, in _request_ims_ssh
mpq.put(ImageRootInventory._wait_for_ssh_container(ims_id, job_id, cfs_session))
File "/usr/lib/python3.9/site-packages/cray/cfs/inventory/image/__init__.py", line 290, in _wait_for_ssh_container
raise CFSInventoryError(
cray.cfs.inventory.CFSInventoryError: ('IMS status=error for IMS image=%r job=%r, SSH container was not created.', 'b72ca306-965c-4139-a8db-fb84233b2c1f', '68d681ca-cf5e-44f8-b089-8b482e009ea1')
2022-11-02 16:53:21,764 - INFO - cray.cfs.inventory.image - Removing public key from IMS.
2022-11-02 16:53:21,822 - ERROR - cray.cfs.inventory - An error occurred while attempting to generate the inventory. Error: One or more IMS jobs failed to launch.
When this happens, the IMS pod creating the image may contain this error:
File "/scripts/fetch.py", line 221, in download_file
for chunk in response.iter_content(chunk_size=1024*1024):
File "/scripts/venv/lib/python3.9/site-packages/requests/models.py", line 754, in generate
raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
Run the following command from a master node. Be sure to change the -w ncn-w[0-n]
argument reflects all the worker nodes for the system (pdsh
supports multiple -w
arguments):
pdsh -w ncn-w00[1-n] 'sysctl -w net.netfilter.nf_conntrack_tcp_be_liberal=1'
Once the sysctl
command is complete, the Connection reset by peer
errors in the IMS pod should no longer appear and the CFS job should complete successfully.