Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code hangs on training #14

Open
egaebel opened this issue Aug 14, 2018 · 7 comments
Open

Code hangs on training #14

egaebel opened this issue Aug 14, 2018 · 7 comments

Comments

@egaebel
Copy link

egaebel commented Aug 14, 2018

I'm reading multiple datasets from a single file and after a certain number of iterations the code hangs indefinitely (I let it go overnight just to be absolutely certain). I have to ctrl+C out of it and I get the following exception. Looks like a hang in multitables somewhere? Maybe from the queue not being populated quickly enough?

Traceback (most recent call last): │··
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap │··
self.run() │··
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run │··
self._target(*self._args, **self._kwargs) │··
File "/workspace-mount/Programs/tachotron2-implementations/barronalex-tachotron/lib/python3.5/site-packages/multitables.py", line 389, in _Streamer__read_process │··
with sync.do(cbuf.put_direct(), i, (i+read_size) % len(ary)) as put_ary: │··
File "/workspace-mount/Programs/tachotron2-implementations/barronalex-tachotron/lib/python3.5/site-packages/multitables.py", line 136, in enter │··
with self.sync.barrier_in.wait(*self.index): │··
File "/workspace-mount/Programs/tachotron2-implementations/barronalex-tachotron/lib/python3.5/site-packages/multitables.py", line 87, in enter │··
self.sync.cvar.wait() │··
File "/usr/lib/python3.5/multiprocessing/synchronize.py", line 262, in wait │··
return self._wait_semaphore.acquire(True, timeout)

@egaebel
Copy link
Author

egaebel commented Aug 14, 2018

I think it occurs when the end of the dataset is reached. I have cyclic=True set on get_batch though...

@ghcollin
Copy link
Owner

It seems to be stuck waiting to write to the internal queue, maybe there is an issue with ordered access? Can you try setting ordered = False, and see if it still hangs? (this will results in corrupted batches if you're splicing multiple data sets to create each training example, but might help narrow things down)

@egaebel
Copy link
Author

egaebel commented Aug 15, 2018

Thanks for your quick response!

This indeed makes the hanging go away, but I am doing some splicing with datasets.

@ghcollin
Copy link
Owner

If you run the multitables unit tests, https://github.com/ghcollin/multitables/blob/master/multitables_test.py do they complete properly? Also what size/how many rows is your dataset(s)?

@egaebel
Copy link
Author

egaebel commented Sep 12, 2018

Hey sorry for the long silence, I ended up changing my dataset to all be in one table and everything is working fine now.

However.
The multitables unit test hangs as well.
My dataset has 68933 rows.

@egaebel
Copy link
Author

egaebel commented Sep 16, 2018

So actually, with my one table approach when I set the reader's ordered=False it freezes, but when ordered=True it does not freeze. Very odd...

Any tips on how I can get the multitables unit tests to run? Seems like that is probably the same thing...

@sullivan-sean
Copy link

I'm also running into the same issue.

I'm training on multiple datasets, with ordered=True as recommended, however this results in the following error:

Traceback (most recent call last):
  File "train.py", line 225, in <module>
    tf.app.run()
  File "/usr/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "train.py", line 221, in main
    train()
  File "train.py", line 216, in train
    filename)
  File "/usr/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/lib/python3.7/site-packages/tftables-1.1.2-py3.7.egg/tftables.py", line 552, in begin
  File "/usr/lib/python3.7/site-packages/tftables-1.1.2-py3.7.egg/tftables.py", line 521, in stop
  File "/usr/lib/python3.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/lib/python3.7/site-packages/six.py", line 692, in reraise
    raise value.with_traceback(tb)
  File "/usr/lib/python3.7/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
    yield
  File "/usr/lib/python3.7/site-packages/tftables-1.1.2-py3.7.egg/tftables.py", line 473, in __read_thread
  File "/usr/lib/python3.7/site-packages/tftables-1.1.2-py3.7.egg/tftables.py", line 393, in feed
  File "/usr/lib/python3.7/site-packages/tftables-1.1.2-py3.7.egg/tftables.py", line 293, in read_batch
  File "/usr/lib/python3.7/site-packages/multitables-1.1.1-py3.7.egg/multitables.py", line 270, in __enter__
    return self.arys[self.idx]
IndexError: list index out of range

When I run the same code with ordered=False, the code runs as expected but with corrupted batches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants