Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestions to tackle this MemoryError #6

Open
hipoglucido opened this issue Jun 27, 2016 · 8 comments
Open

Suggestions to tackle this MemoryError #6

hipoglucido opened this issue Jun 27, 2016 · 8 comments

Comments

@hipoglucido
Copy link

hipoglucido commented Jun 27, 2016

Hello, I am using ami-125b2c72 (g2.2xlarge) with spot price as you suggested in another issue (thanks a lot). After struggling a bit with CUDA drivers I finally got to run some epochs and I am able to save and load all the training files from S3. Now, I have 1441135 examples. I trained one epoch, saved weights, stop ami, re-run script, load weights, train one more epoch and then crashed. I got this output. I wonder if you, @udibr , could give me some ideas or intuitions about what is my problem. One of my questions is, is the memory error about regular RAM or GPU memory? (maybe I could use another AMI).I also got the warning about Epoch comprised more than samples_per_epoch samples, but I am not sure if I should do anything about it.

`ubuntu@ip-xxxxxx:~/auris$ python train2.py
Using Theano backend.
Using gpu device 0: GRID K520 (CNMeM is enabled with initial size: 95.0% of memory, cuDNN 4007)
READING WORD EMBEDDING
('/home/ubuntu/auris//en3_vocabulary-embedding.pkl', ' already downloaded')
('/home/ubuntu/auris//en3_vocabulary-embedding.data.pkl', ' already downloaded')
number of examples 1441135 1441135
dimension of embedding space for words 100
vocabulary size 40000 the last 10 words can be used as place holders for unknown/oov words
total number of different words 1481094 1481094
number of words outside vocabulary which we can substitue using glove similarity 208580
number of words that will be regarded as unknonw(unk)/out-of-vocabulary(oov) 1232514
H: Vuwani schools damage prompts call for new law to punish vandals
D: Department officials on Tuesday briefed MPs on the recovery plans for the protest-ravaged^ Vuwani area in Limpopo . Earlier in May Vuwani residents protested against the creation of a new municipality which will include Malamulele^ residents . The violent protests led to the torching of several schools and other public amenities in the area which disrupted classes .
H: Kathmandu Post- Mitsubishi Motors admits cheating fuel tests since 1991
D: Mitsubishi 's eK^ Wagon^ was one of the models affected Reuters Apr 26 , 2016- Mitsubishi Motors has admitted to falsifying some fuel consumption tests since 1991 . The admission follows last week 's revelation that it had falsified fuel economy data for more than 600,000 vehicles sold in Japan . 'For the domestic market , we have been using that method since 1991 , ' said vice-president Ryugo^ Nakao^ at a press conference in Tokyo on Tuesday .
MODEL
0 cls=Embedding name=embedding_1
40000x100
1 cls=LSTM name=lstm_1
100x512 512x512 512 100x512 512x512 512 100x512 512x512 512 100x512 512x512 512
2 cls=Dropout name=dropout_1

3 cls=LSTM name=lstm_2
512x512 512x512 512 512x512 512x512 512 512x512 512x512 512 512x512 512x512 512
4 cls=Dropout name=dropout_2

5 cls=LSTM name=lstm_3
512x512 512x512 512 512x512 512x512 512 512x512 512x512 512 512x512 512x512 512
6 cls=Dropout name=dropout_3

7 cls=SimpleContext name=simplecontext_1

8 cls=TimeDistributed name=timedistributed_1
944x40000 40000
9 cls=Activation name=activation_1

LOAD
downloading train.hdf5 to /home/ubuntu/auris/train.hdf5:
downloaded /home/ubuntu/auris/train.hdf5
Weights downloaded
TEST
....
....


H: ~ Kate Beckinsale and teenage daughter text naked pictures of Michael Sheen to each other to <0>^ themselves up ' _ _ _ _ _
D: the Port Talbot actor to each other . The underworld star , who lives in the States with 17-year-old Lily , made the odd revelation
TRAIN
Iteration 0
Epoch 1/1
29952/30000 [============================>.] - ETA: 1s - loss: 7.7543/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/engine/training.py:1403: UserWarning: Epoch comprised more than 'samples_per_epoch' samples, which might affect learning results. Set 'samples_per_epoch' correctly to avoid this warning.
30016/30000 [==============================] - 746s - loss: 7.7548 - val_loss: 7.8132
('Uploaded ', '/home/ubuntu/auris/train.history.pkl', ' succesfully')
('Uploaded ', '/home/ubuntu/auris/train.hdf5', ' succesfully')
HEAD: A Python^ Bit^ A Man 's P
DESC: The man fought to remove
HEADS:
34.5337700502 Syrian , Attaporn^ Attaporn^ at , to
43.0280796466 Former wife.She^ : for hour.Eventually^ wife.She^ to in wife.She^ wife.She^
Iteration 1
Epoch 1/1
29952/30000 [============================>.] - ETA: 1s - loss: 7.7520Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/ubuntu/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/home/ubuntu/anaconda2/lib/python2.7/threading.py", line 754, in run
self.__target(_self.__args, *_self.__kwargs)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 404, in data_generator_task
generator_output = next(generator)
File "train2.py", line 498, in gen
yield conv_seq_labels(xds, xhs, nflips=nflips, model=model, debug=debug)
File "train2.py", line 459, in conv_seq_labels
y = np.zeros((batch_size, maxlenh, vocab_size))
MemoryError

Traceback (most recent call last):
File "train2.py", line 538, in
nb_epoch=1, validation_data=valgen, nb_val_samples=nb_val_samples
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/models.py", line 656, in fit_generator
max_q_size=max_q_size)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 1412, in fit_generator
max_q_size=max_q_size)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 1474, in evaluate_generator
'or (x, y). Found: ' + str(generator_output))
Exception: output of generator should be a tuple (x, y, sample_weight) or (x, y). Found: None
`
And as always, thanks for giving us the oportunity of using state of the art machine learning techniques into our own projects :)

@udibr
Copy link
Owner

udibr commented Jun 28, 2016

It looks like you run out of RAM on your Amazon instance. You can open a terminal and use htop to see this happening while the code is running.
As a check reduce the size of your data set to half its size and see if it solves your problem

@hipoglucido
Copy link
Author

hipoglucido commented Jun 29, 2016

Hello @udibr , you are right, I ran out of RAM on my AMI (15GB). I reduced the dataset to half-size (750K) but I got exactly the same error, at the same point. It is always at the end of 2nd iteration (epoch). The line is
model.fit_generator(traingen, samples_per_epoch=nb_train_samples,nb_epoch=1, validation_data=valgen, nb_val_samples=nb_val_samples)
when the status is like this
29952/30000 [============================>.],
it seems like at the end of the execution of that line there is a peak in RAM usage (does it make sense to you?), but somehow the script survives the end of the first epoch and dies at the end of the second. (I have also checked that the upload of history.pkl and train.hdf5 to S3 is not the problem).
I tried the script with g2.8xlarge AMI version (60GB RAM) and it doesn't crash. However, I had to close it because it is much more expensive than g2.2xlarge (15GB RAM) and I can't afford it.
I would like to understand why is it that the bigger the dataset the higher the ram consumption. I expected it to be more or less the same but slower. Is it because of the vocabulary size? My idea was to train a model with a ~7M articles dataset I've collected. I hope there is a way I could make it without using the expensive g2.8xlarge AMI type.

For now I am going to run it again with 150K samples and see what happens.

Thanks 👍 👍

EDIT: with 150K samples I am getting the same error. I have also tried seeting nb_train_samples=15000 and nb_val_samples=1500 and still the same. I am confused :\

@udibr
Copy link
Owner

udibr commented Jun 29, 2016

try training with nflips=0 it may help

@hipoglucido
Copy link
Author

hipoglucido commented Jun 29, 2016

Ok. You mean
traingen = gen(X_train, Y_train, batch_size=batch_size, nflips=0, model=model)
right?

Thanks

@udibr
Copy link
Owner

udibr commented Jun 29, 2016

thats ok or change the value assigned to nflips in cell 9 of train.ipynb

@hipoglucido
Copy link
Author

hipoglucido commented Jun 30, 2016

Thanks, I tried it but unfortunately it crashed in the second iteration again. However, this time did not crash at the end of it:

TRAIN
Iteration 0
Epoch 1/1
29952/30000 [============================>.] - ETA: 0s - loss: 7.4030/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/engine/training.py:1432: UserWarning: Epoch comprised more than 'samples_per_epoch' samples, which might affect learning results. Set 'samples_per_epoch' correctly to avoid this warning.
30016/30000 [==============================] - 532s - loss: 7.4035 - val_loss: 7.6468
Building historic...
... historic built
('Writting historic in ', '/home/ubuntu/auris/train.history.pkl', '...')
('Uploaded ', '/home/ubuntu/auris/train.history.pkl', ' succesfully')
('... historic saved in ', '/home/ubuntu/auris/train.history.pkl')
('Saving model weights in ', '/home/ubuntu/auris/train.hdf5', '...')
('Uploaded ', '/home/ubuntu/auris/train.hdf5', ' succesfully')
('... model weights saved in ', '/home/ubuntu/auris/train.hdf5')
Generating samples...
HEAD: Police bar tatooed^ peopl
DESC: A bizarre video has emerg
HEADS:
37.3261257818 The of of for after in after and
... samples generated
Iteration 1
Epoch 1/1
384/30000 [..............................] - ETA: 531s - loss: 7.5555Exception in thread Thread-3:
Traceback (most recent call last):
File "/home/ubuntu/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/home/ubuntu/anaconda2/lib/python2.7/threading.py", line 754, in run
self.__target(_self.__args, *_self.__kwargs)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 416, in data_generator_task
generator_output = next(generator)
File "train.py", line 507, in gen
yield conv_seq_labels(xds, xhs, nflips=nflips, model=model, debug=debug)
File "train.py", line 468, in conv_seq_labels
y = np.zeros((batch_size, maxlenh, vocab_size))
MemoryError
448/30000 [..............................] - ETA: 528s - loss: 7.5818Traceback (most recent call last):
File "train.py", line 548, in
nb_epoch=1, validation_data=valgen, nb_val_samples=nb_val_samples
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/models.py", line 661, in fit_generator
max_q_size=max_q_size)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 1387, in fit_generator
'or (x, y). Found: ' + str(generator_output))
Exception: output of generator should be a tuple (x, y, sample_weight) or (x, y). Found: None

I am going to try it with TensorFlow backend. Do you think it may be a good idea?
If it doesn't resolve the memory problem, maybe it will be worth to use g2.8xlarge (4 GPUs), since TensorFlow supports parallel GPU (Theano doesn't. Yet). It is much more expensive but it should train faster...

@hipoglucido
Copy link
Author

hipoglucido commented Jun 30, 2016

I set batch_size=32 and it stopped complaining (using theano)!

Thanks a lot for your help @udibr :)

@udibr
Copy link
Owner

udibr commented Jun 30, 2016

I'm using Theano so there could be bugs related to TF which I did not noticed.
I found that Theano runs on this problem about 5 times faster than TF and all you need to do is install Theano (and dont set Keras to use TF)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants