-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multiprocessing: OSError: AF_UNIX path too long #1571
Comments
This comment was marked as resolved.
This comment was marked as resolved.
From your logs, it does not seem like the error is in It seems the error occurred in some sub proc. Can you give details on what sub proc this is? You posted the stack of But the main thread of the RETURNN main proc crashed, specifically in Torch self._mp_manager = torch.multiprocessing.Manager() So, this looks very much outside of the scope of RETURNN? This is either pure Python related or PyTorch related (but I assume just pure Python; |
I assume some bad environment configuration. Maybe weird ulimits or so. Or wrong tmp path. I wonder a bit where it tries to bind the socket ( |
This is the relevant code from the stack trace where the read fails: https://github.com/python/cpython/blob/26d24eeb90d781e381b97d64b4dcb1ee4dd891fe/Lib/multiprocessing/managers.py#L554-L570 The read goes over a and the address comes from a temp dir: https://github.com/python/cpython/blob/26d24eeb90d781e381b97d64b4dcb1ee4dd891fe/Lib/multiprocessing/connection.py#L70-L82 which is created like this: https://github.com/python/cpython/blob/26d24eeb90d781e381b97d64b4dcb1ee4dd891fe/Lib/multiprocessing/util.py#L133-L145 For my account on the node the temp dir path does not look excessive, so maybe some strange user config @michelwi? Apparently what's failing here is the child process trying to bind to that socket. |
Note, the temp dir logic of Python, i.e. where it would create those temp files/dirs: def _candidate_tempdir_list():
"""Generate a list of candidate temporary directories which
_get_default_tempdir will try."""
dirlist = []
# First, try the environment.
for envname in 'TMPDIR', 'TEMP', 'TMP':
dirname = _os.getenv(envname)
if dirname: dirlist.append(dirname)
# Failing that, try OS-specific locations.
if _os.name == 'nt':
dirlist.extend([ _os.path.expanduser(r'~\AppData\Local\Temp'),
_os.path.expandvars(r'%SYSTEMROOT%\Temp'),
r'c:\temp', r'c:\tmp', r'\temp', r'\tmp' ])
else:
dirlist.extend([ '/tmp', '/var/tmp', '/usr/tmp' ])
# As a last resort, the current directory.
try:
dirlist.append(_os.getcwd())
except (AttributeError, OSError):
dirlist.append(_os.curdir)
return dirlist So, is |
But in any case, what is |
I just got bitten by the same error in a training not using the new dataset or caching mechanism. |
So what is |
sorry I could not figure it out. Then my node was rebooted and the error is gone again. |
Got the address. It seems reproducible for me on g-16. I don't know the root cause yet why it's behaving strangely.
I'm not sure why it won't take e.g. /var/tmp. I checked from the container, it is writable for me. >>> os.open("/var/tmp/test2", os.O_RDWR | os.O_CREAT | os.O_EXCL)
3 Or this is something wrt. the child processes lacking some rights the parent process has? I'm not sure what this is yet. |
Yea that's a problem. It should definitely not use that.
Yea that's strange. We should find out. Eg. just step-by-step debug through it. |
Yesterday I started a training with DistributeFilesDataset and file caching which today crashed and consistently crashes after restarting with what I think is
OSError: AF_UNIX path too long
in the_TouchFilesThread.run
.The original files reside on a ceph cluster and are cached to an LVM volume with ext4 filesystem. The basename should be within reasonable length but the absolute path can be quite long.
I try to attach one version of each flavor of traces in order of appearence:
OSError
_TouchFilesThread
Main
torch.distributed
EDIT: fixed torch.distributed trace
The text was updated successfully, but these errors were encountered: