How to train on limited RAM when I have >10million documents #273

allhelllooz · 2019-10-19T10:28:11Z

Environment

Ubuntu 16.04, Python 3.7

Question

I have p2.xlarge machine with 60gb ram and 11gb gpu.
When I am taking ~0.8 million documents for NER Labelling task >55Gb of RAM is consumed ... but I am able to train.

Now I want to train on all >10million documents ... how can I do that with limited memory available.
I am going to try 0.8 mill documents with 4 epochs, then save model and load again with next 0.8 mill data with another 4 epochs ... and so on. Will it help ??

Tried above methods for 2-3 sets but accuracy does not improve.
Is there any option for lazy loading or something else ... let me know.
Thanks.

allhelllooz · 2019-10-19T11:26:53Z

Additionally, what is the difference between fit and fit_without_generator.
Will it help me in training the way I explained above ?

BrikerMan · 2019-10-19T12:11:21Z

This is a really cool use-case. I am happy to help out. Actually we can train tons of data with limited RAM, but need to make some changes.

Let's start with the fit and fit_without_generator. fit is equal to keras's fit_with_generator, a lazy load function that could handle lots of data with limited RAM. fit_without_generator equals to keras's fit, slightly faster than fit_with_generator but cost more RAM.

But why it still need 55Gb RAM with 0.8 million data? It is because you need to load all of the orininal data to RAM so that we could build token and label dict and build model struct. So if we can optimize this part, you can handle all of your data easily.

Let me try something and come back to you a little bit layer. Then you can try it out.

BrikerMan · 2019-10-19T12:21:21Z

Let's keep it simple first. Which embedding are you using for this task?

allhelllooz · 2019-10-19T13:26:08Z

Cool. Thanks for the reply. I am using bert embeddings and BiGRU model for labelling ner task.

I have used ImageDataGenerator from keras before, which reads few images in memory not everything. Wanted to check is something like that is possible or not.
Also, noob in Tensorflow and keras so not aware how to solve this use case.

BrikerMan · 2019-10-19T13:36:41Z

Yea, we need to implement something similiar to the ImageDataGenerator. I will try to do that tomorrow and come back to you.

allhelllooz · 2019-10-19T13:38:47Z

Thanks. That would be a very cool solution .. btw when I load model and try to train again on some other dataset why does it not work ?? When we save model do we save all states .. right??

allhelllooz · 2019-10-22T08:33:11Z

Hi @BrikerMan .. were you able to do something ? Let me know. I will also try something out in-between. Thanks.

stale · 2019-11-11T09:19:33Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

BrikerMan · 2019-11-11T09:22:04Z

Sorry, it has been very busy for the last several weeks. I will come back to you ASAP.

BrikerMan · 2019-11-12T03:31:25Z

@allhelllooz could you prepare token-dict and label-dict by yourself?

allhelllooz · 2019-11-12T13:15:55Z

I should be able to do that. Can you send me format for token-dict and label-dict ?

BrikerMan · 2020-03-16T10:46:43Z

@allhelllooz Sorry for the long delay. I have started the tf2 version Kashgari V2, which is very RAM-friendly. I have tested a classification task with a 10G corpus, it cost 1G RAM. Please try it out.

allhelllooz added the question Further information is requested label Oct 19, 2019

allhelllooz assigned BrikerMan Oct 19, 2019

stale bot added the wontfix This will not be worked on label Nov 11, 2019

stale bot removed the wontfix This will not be worked on label Nov 11, 2019

BrikerMan added the pinned label Nov 11, 2019

BrikerMan mentioned this issue Mar 14, 2020

V2.0 implementation design #340

Closed

24 tasks

BrikerMan added a commit that referenced this issue Mar 16, 2020

✨ Minimal RAM use while handle huge corpus. (#273)

03c335a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to train on limited RAM when I have >10million documents #273

How to train on limited RAM when I have >10million documents #273

allhelllooz commented Oct 19, 2019

allhelllooz commented Oct 19, 2019

BrikerMan commented Oct 19, 2019

BrikerMan commented Oct 19, 2019

allhelllooz commented Oct 19, 2019

BrikerMan commented Oct 19, 2019

allhelllooz commented Oct 19, 2019

allhelllooz commented Oct 22, 2019

stale bot commented Nov 11, 2019

BrikerMan commented Nov 11, 2019

BrikerMan commented Nov 12, 2019

allhelllooz commented Nov 12, 2019

BrikerMan commented Mar 16, 2020

How to train on limited RAM when I have >10million documents #273

How to train on limited RAM when I have >10million documents #273

Comments

allhelllooz commented Oct 19, 2019

Environment

Question

allhelllooz commented Oct 19, 2019

BrikerMan commented Oct 19, 2019

BrikerMan commented Oct 19, 2019

allhelllooz commented Oct 19, 2019

BrikerMan commented Oct 19, 2019

allhelllooz commented Oct 19, 2019

allhelllooz commented Oct 22, 2019

stale bot commented Nov 11, 2019

BrikerMan commented Nov 11, 2019

BrikerMan commented Nov 12, 2019

allhelllooz commented Nov 12, 2019

BrikerMan commented Mar 16, 2020