Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiprocessing: OSError: AF_UNIX path too long #1571

Open
michelwi opened this issue Jul 3, 2024 · 11 comments
Open

multiprocessing: OSError: AF_UNIX path too long #1571

michelwi opened this issue Jul 3, 2024 · 11 comments
Assignees

Comments

@michelwi
Copy link
Collaborator

michelwi commented Jul 3, 2024

Yesterday I started a training with DistributeFilesDataset and file caching which today crashed and consistently crashes after restarting with what I think is OSError: AF_UNIX path too long in the _TouchFilesThread.run.

The original files reside on a ceph cluster and are cached to an LVM volume with ext4 filesystem. The basename should be within reasonable length but the absolute path can be quite long.

I try to attach one version of each flavor of traces in order of appearence:

OSError
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/multiprocessing/managers.py", line 591, in _run_server
    server = cls._Server(registry, address, authkey, serializer)
  File "/usr/lib/python3.10/multiprocessing/managers.py", line 156, in __init__
    self.listener = Listener(address=address, backlog=16)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 448, in __init__
    self._listener = SocketListener(address, family, backlog)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 591, in __init__
    self._socket.bind(address)
OSError: AF_UNIX path too long
_TouchFilesThread
Unhandled exception <class 'EOFError'> in thread <_MainThread(MainThread, started 139642623541696)>, proc 96008.

Thread <_TouchFilesThread(Thread-1, started daemon 139638043833920)>:
Process SyncManager-1:
  File "/usr/lib/python3.10/threading.py", line 324, in Condition.wait
    line: gotit = waiter.acquire(True, timeout)
    locals:
      gotit = <local> False
      waiter = <local> <locked _thread.lock object at 0x7f6b43d94640>
      waiter.acquire = <local> <built-in method acquire of _thread.lock object at 0x7f6b43d94640>
      timeout = <local> 1.0
  File "/usr/lib/python3.10/threading.py", line 607, in Event.wait
    line: signaled = self._cond.wait(timeout)
    locals:
      signaled = <local> False
      self = <local> <threading.Event object at 0x7f6b43d661d0>
      self._cond = <local> <Condition(<unlocked _thread.lock object at 0x7f6b43d85e80>, 1)>
      self._cond.wait = <local> <bound method Condition.wait of <Condition(<unlocked _thread.lock object at 0x7f6b43d85e80>, 1)>>
      timeout = <local> 1.0
  File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/returnn/util/file_cache.py", line 418, in _TouchFilesThread.run
    line: if self.stop.wait(self.interval):
    locals:
      self = <local> <_TouchFilesThread(Thread-1, started daemon 140098675930688)>
      self.stop = <local> <threading.Event object at 0x7f6b43d661d0>
      self.stop.wait = <local> <bound method Event.wait of <threading.Event object at 0x7f6b43d661d0>>
      self.interval = <local> 1.0
  File "/usr/lib/python3.10/threading.py", line 1016, in Thread._bootstrap_inner
    line: self.run()
    locals:
      self = <local> <_TouchFilesThread(Thread-1, started daemon 140098675930688)>
      self.run = <local> <bound method _TouchFilesThread.run of <_TouchFilesThread(Thread-1, started daemon 140098675930688)>>
  File "/usr/lib/python3.10/threading.py", line 973, in Thread._bootstrap
    line: self._bootstrap_inner()
    locals:
      self = <local> <_TouchFilesThread(Thread-1, started daemon 140098675930688)>
      self._bootstrap_inner = <local> <bound method Thread._bootstrap_inner of <_TouchFilesThread(Thread-1, started daemon 140098675930688)>>

Thread current, main, <_MainThread(MainThread, started 140103104491968)>:
(Excluded thread.)
Main
EXCEPTION
Traceback (most recent call last):
  File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/rnn.py", line 11, in <module>
    line: main()
    locals:
      main = <local> <function main at 0x7fd88b3a27a0>
  File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/returnn/__main__.py", line 735, in main
    line: init(command_line_options=argv[1:])
    locals:
      init = <global> <function init at 0x7fd88b3a24d0>
      command_line_options = <not found>
      argv = <local> ['/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/rnn.py', '/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/returnn/training/ReturnnTrainingJob.V7zBp10cMkXR/output/returnn.config'], _[0]: {len = 142}
  File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/returnn/__main__.py", line 487, in init
    line: init_engine()
    locals:
      init_engine = <global> <function init_engine at 0x7fd88b3a2320>
  File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/returnn/__main__.py", line 294, in init_engine
    line: engine = Engine(config=config)
    locals:
      engine = <global> None
      Engine = <local> <class 'returnn.torch.engine.Engine'>
      config = <global> <returnn.config.Config object at 0x7fd88b333fd0>
  File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/returnn/torch/engine.py", line 60, in Engine.__init__
    line: self._mp_manager = torch.multiprocessing.Manager()
    locals:
      self = <local> <returnn.torch.engine.Engine object at 0x7fd79384aa40>
      self._mp_manager = <local> !AttributeError: 'Engine' object has no attribute '_mp_manager'
      torch = <global> <module 'torch' from '/usr/local/lib/python3.10/dist-packages/torch/__init__.py'>
      torch.multiprocessing = <global> <module 'torch.multiprocessing' from '/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/__init__.py'>
      torch.multiprocessing.Manager = <global> <bound method BaseContext.Manager of <multiprocessing.context.DefaultContext object at 0x7fd7a1d772e0>>
  File "/usr/lib/python3.10/multiprocessing/context.py", line 57, in BaseContext.Manager
    line: m.start()
    locals:
      m = <local> <multiprocessing.managers.SyncManager object at 0x7fd88b56d930>
      m.start = <local> <bound method BaseManager.start of <multiprocessing.managers.SyncManager object at 0x7fd88b56d930>>
  File "/usr/lib/python3.10/multiprocessing/managers.py", line 566, in BaseManager.start
    line: self._address = reader.recv()
    locals:
      self = <local> <multiprocessing.managers.SyncManager object at 0x7fd88b56d930>
      self._address = <local> None
      reader = <local> <multiprocessing.connection.Connection object at 0x7fd88b56fb80>
      reader.recv = <local> <bound method _ConnectionBase.recv of <multiprocessing.connection.Connection object at 0x7fd88b56fb80>>
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 250, in _ConnectionBase.recv
    line: buf = self._recv_bytes()
    locals:
      buf = <not found>
      self = <local> <multiprocessing.connection.Connection object at 0x7fd88b56fb80>
      self._recv_bytes = <local> <bound method Connection._recv_bytes of <multiprocessing.connection.Connection object at 0x7fd88b56fb80>>
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 414, in Connection._recv_bytes
    line: buf = self._recv(4)
    locals:
      buf = <not found>
      self = <local> <multiprocessing.connection.Connection object at 0x7fd88b56fb80>
      self._recv = <local> <bound method Connection._recv of <multiprocessing.connection.Connection object at 0x7fd88b56fb80>>
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 383, in Connection._recv
    line: raise EOFError
    locals:
      EOFError = <builtin> <class 'EOFError'>
EOFError
Unhandled exception <class 'EOFError'> in thread <_MainThread(MainThread, started 139836318601664)>, proc 96015.
torch.distributed
E0703 09:18:03.509000 139702419329472 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 96008) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 883, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/rnn.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-07-03_09:18:03
  host      : g-28.apptek.local
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 96009)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-07-03_09:18:03
  host      : g-28.apptek.local
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 96010)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-07-03_09:18:03
  host      : g-28.apptek.local
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 96011)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2024-07-03_09:18:03
  host      : g-28.apptek.local
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 96012)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2024-07-03_09:18:03
  host      : g-28.apptek.local
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 96013)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2024-07-03_09:18:03
  host      : g-28.apptek.local
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 96014)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2024-07-03_09:18:03
  host      : g-28.apptek.local
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 96015)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-03_09:18:03
  host      : g-28.apptek.local
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 96008)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

EDIT: fixed torch.distributed trace

@albertz

This comment was marked as resolved.

@albertz
Copy link
Member

albertz commented Jul 3, 2024

From your logs, it does not seem like the error is in _TouchFilesThread.run?

It seems the error occurred in some sub proc. Can you give details on what sub proc this is?

You posted the stack of _TouchFilesThread, but it seems all fine? This one is not crashed and just idling?

But the main thread of the RETURNN main proc crashed, specifically in Torch Engine.__init__, in this line:

self._mp_manager = torch.multiprocessing.Manager()

So, this looks very much outside of the scope of RETURNN? This is either pure Python related or PyTorch related (but I assume just pure Python; torch.multiprocessing is basically just the standard multiprocessing module).

@albertz
Copy link
Member

albertz commented Jul 3, 2024

I assume some bad environment configuration. Maybe weird ulimits or so. Or wrong tmp path. I wonder a bit where it tries to bind the socket (self._socket.bind(address)), i.e. what is address here? Can you get those local vars?

@NeoLegends
Copy link
Collaborator

NeoLegends commented Jul 3, 2024

This is the relevant code from the stack trace where the read fails: https://github.com/python/cpython/blob/26d24eeb90d781e381b97d64b4dcb1ee4dd891fe/Lib/multiprocessing/managers.py#L554-L570

The read goes over a connection.Pipe(), which communicates over unix sockets: https://github.com/python/cpython/blob/26d24eeb90d781e381b97d64b4dcb1ee4dd891fe/Lib/multiprocessing/connection.py#L552-L556

and the address comes from a temp dir: https://github.com/python/cpython/blob/26d24eeb90d781e381b97d64b4dcb1ee4dd891fe/Lib/multiprocessing/connection.py#L70-L82

which is created like this: https://github.com/python/cpython/blob/26d24eeb90d781e381b97d64b4dcb1ee4dd891fe/Lib/multiprocessing/util.py#L133-L145

For my account on the node the temp dir path does not look excessive, so maybe some strange user config @michelwi? Apparently what's failing here is the child process trying to bind to that socket.

@albertz
Copy link
Member

albertz commented Jul 3, 2024

Note, the temp dir logic of Python, i.e. where it would create those temp files/dirs:

def _candidate_tempdir_list():
    """Generate a list of candidate temporary directories which
    _get_default_tempdir will try."""

    dirlist = []

    # First, try the environment.
    for envname in 'TMPDIR', 'TEMP', 'TMP':
        dirname = _os.getenv(envname)
        if dirname: dirlist.append(dirname)

    # Failing that, try OS-specific locations.
    if _os.name == 'nt':
        dirlist.extend([ _os.path.expanduser(r'~\AppData\Local\Temp'),
                         _os.path.expandvars(r'%SYSTEMROOT%\Temp'),
                         r'c:\temp', r'c:\tmp', r'\temp', r'\tmp' ])
    else:
        dirlist.extend([ '/tmp', '/var/tmp', '/usr/tmp' ])

    # As a last resort, the current directory.
    try:
        dirlist.append(_os.getcwd())
    except (AttributeError, OSError):
        dirlist.append(_os.curdir)

    return dirlist

So, is TMPDIR (or some of the other) env var set? Where does it point to? If not, you should set it, and prefer /var/tmp.

@albertz
Copy link
Member

albertz commented Jul 3, 2024

But in any case, what is address actually? So we just know, and don't need to assume.

@albertz albertz changed the title _TouchFilesThread: OSError: AF_UNIX path too long multiprocessing: OSError: AF_UNIX path too long Jul 3, 2024
@NeoLegends
Copy link
Collaborator

NeoLegends commented Jul 3, 2024

I just got bitten by the same error in a training not using the new dataset or caching mechanism.

@albertz
Copy link
Member

albertz commented Jul 3, 2024

I just got bitten by the same error in a training not using the new dataset or caching mechanism.

So what is address in your case?

@michelwi
Copy link
Collaborator Author

michelwi commented Jul 4, 2024

what is address actually?

sorry I could not figure it out. Then my node was rebooted and the error is gone again.
As I cannot reproduce it any more it would be fine by me to close this issue for now. But if @NeoLegends wants to keep debugging we can keep it.

@NeoLegends
Copy link
Collaborator

Got the address. It seems reproducible for me on g-16. I don't know the root cause yet why it's behaving strangely.

tempdir._get_default_tempdir returns the project folder, i.e. /home/mgunz/setups/2024-06-24--[redacted], leading to a path like:

Initializing SocketListener with /nas/models/asr/mgunz/2024-06-24--[redacted]/i6_core/returnn/training/ReturnnTrainingJob.mFRyGSpd0CuH/work/pymp-zwqk0y_y/listener-kxe8tmti
Process SyncManager-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/multiprocessing/managers.py", line 591, in _run_server
    server = cls._Server(registry, address, authkey, serializer)
  File "/usr/lib/python3.10/multiprocessing/managers.py", line 156, in __init__
    self.listener = Listener(address=address, backlog=16)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 448, in __init__
    self._listener = SocketListener(address, family, backlog)
  File "/home/mgunz/setups/2024-06-24--[redacted]/work/i6_core/returnn/training/ReturnnTrainingJob.mFRyGSpd0CuH/output/returnn.config", line 12, in print_addr
    return orig_init(self, address, *args, **kwargs)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 591, in __init__
    self._socket.bind(address)
OSError: AF_UNIX path too long

>>> _candidate_tempdir_list()
['/tmp', '/tmp', '/var/tmp', '/usr/tmp', '/home/mgunz/setups/2024-06-24--[redacted]']

I'm not sure why it won't take e.g. /var/tmp. I checked from the container, it is writable for me.

>>> os.open("/var/tmp/test2", os.O_RDWR | os.O_CREAT | os.O_EXCL)
3

Or this is something wrt. the child processes lacking some rights the parent process has?

I'm not sure what this is yet.

@albertz
Copy link
Member

albertz commented Jul 4, 2024

tempdir._get_default_tempdir returns the project folder, i.e. /home/mgunz/setups/2024-06-24--[redacted]

Yea that's a problem. It should definitely not use that.

>>> _candidate_tempdir_list()
['/tmp', '/tmp', '/var/tmp', '/usr/tmp', '/home/mgunz/setups/2024-06-24--[redacted]']

I'm not sure why it won't take e.g. /var/tmp.

Yea that's strange. We should find out. Eg. just step-by-step debug through it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants