Code hangs on training #14

egaebel · 2018-08-14T05:48:42Z

I'm reading multiple datasets from a single file and after a certain number of iterations the code hangs indefinitely (I let it go overnight just to be absolutely certain). I have to ctrl+C out of it and I get the following exception. Looks like a hang in multitables somewhere? Maybe from the queue not being populated quickly enough?

Traceback (most recent call last): │··
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap │··
self.run() │··
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run │··
self._target(*self._args, **self._kwargs) │··
File "/workspace-mount/Programs/tachotron2-implementations/barronalex-tachotron/lib/python3.5/site-packages/multitables.py", line 389, in _Streamer__read_process │··
with sync.do(cbuf.put_direct(), i, (i+read_size) % len(ary)) as put_ary: │··
File "/workspace-mount/Programs/tachotron2-implementations/barronalex-tachotron/lib/python3.5/site-packages/multitables.py", line 136, in enter │··
with self.sync.barrier_in.wait(*self.index): │··
File "/workspace-mount/Programs/tachotron2-implementations/barronalex-tachotron/lib/python3.5/site-packages/multitables.py", line 87, in enter │··
self.sync.cvar.wait() │··
File "/usr/lib/python3.5/multiprocessing/synchronize.py", line 262, in wait │··
return self._wait_semaphore.acquire(True, timeout)

egaebel · 2018-08-14T06:00:24Z

I think it occurs when the end of the dataset is reached. I have cyclic=True set on get_batch though...

ghcollin · 2018-08-14T22:23:24Z

It seems to be stuck waiting to write to the internal queue, maybe there is an issue with ordered access? Can you try setting ordered = False, and see if it still hangs? (this will results in corrupted batches if you're splicing multiple data sets to create each training example, but might help narrow things down)

egaebel · 2018-08-15T02:22:44Z

Thanks for your quick response!

This indeed makes the hanging go away, but I am doing some splicing with datasets.

ghcollin · 2018-08-16T18:42:34Z

If you run the multitables unit tests, https://github.com/ghcollin/multitables/blob/master/multitables_test.py do they complete properly? Also what size/how many rows is your dataset(s)?

egaebel · 2018-09-12T13:46:37Z

Hey sorry for the long silence, I ended up changing my dataset to all be in one table and everything is working fine now.

However.
The multitables unit test hangs as well.
My dataset has 68933 rows.

egaebel · 2018-09-16T20:43:50Z

So actually, with my one table approach when I set the reader's ordered=False it freezes, but when ordered=True it does not freeze. Very odd...

Any tips on how I can get the multitables unit tests to run? Seems like that is probably the same thing...

sullivan-sean · 2019-04-09T21:40:03Z

I'm also running into the same issue.

I'm training on multiple datasets, with ordered=True as recommended, however this results in the following error:

Traceback (most recent call last):
  File "train.py", line 225, in <module>
    tf.app.run()
  File "/usr/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "train.py", line 221, in main
    train()
  File "train.py", line 216, in train
    filename)
  File "/usr/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/lib/python3.7/site-packages/tftables-1.1.2-py3.7.egg/tftables.py", line 552, in begin
  File "/usr/lib/python3.7/site-packages/tftables-1.1.2-py3.7.egg/tftables.py", line 521, in stop
  File "/usr/lib/python3.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/lib/python3.7/site-packages/six.py", line 692, in reraise
    raise value.with_traceback(tb)
  File "/usr/lib/python3.7/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
    yield
  File "/usr/lib/python3.7/site-packages/tftables-1.1.2-py3.7.egg/tftables.py", line 473, in __read_thread
  File "/usr/lib/python3.7/site-packages/tftables-1.1.2-py3.7.egg/tftables.py", line 393, in feed
  File "/usr/lib/python3.7/site-packages/tftables-1.1.2-py3.7.egg/tftables.py", line 293, in read_batch
  File "/usr/lib/python3.7/site-packages/multitables-1.1.1-py3.7.egg/multitables.py", line 270, in __enter__
    return self.arys[self.idx]
IndexError: list index out of range

When I run the same code with ordered=False, the code runs as expected but with corrupted batches.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code hangs on training #14

Code hangs on training #14

egaebel commented Aug 14, 2018

egaebel commented Aug 14, 2018

ghcollin commented Aug 14, 2018

egaebel commented Aug 15, 2018 •

edited

Loading

ghcollin commented Aug 16, 2018

egaebel commented Sep 12, 2018

egaebel commented Sep 16, 2018

sullivan-sean commented Apr 9, 2019

Code hangs on training #14

Code hangs on training #14

Comments

egaebel commented Aug 14, 2018

egaebel commented Aug 14, 2018

ghcollin commented Aug 14, 2018

egaebel commented Aug 15, 2018 • edited Loading

ghcollin commented Aug 16, 2018

egaebel commented Sep 12, 2018

egaebel commented Sep 16, 2018

sullivan-sean commented Apr 9, 2019

egaebel commented Aug 15, 2018 •

edited

Loading