Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to train on limited RAM when I have >10million documents #273

Open
allhelllooz opened this issue Oct 19, 2019 · 12 comments
Open

How to train on limited RAM when I have >10million documents #273

allhelllooz opened this issue Oct 19, 2019 · 12 comments
Assignees
Labels
pinned question Further information is requested

Comments

@allhelllooz
Copy link

Environment

Ubuntu 16.04, Python 3.7

Question

I have p2.xlarge machine with 60gb ram and 11gb gpu.
When I am taking ~0.8 million documents for NER Labelling task >55Gb of RAM is consumed ... but I am able to train.

Now I want to train on all >10million documents ... how can I do that with limited memory available.
I am going to try 0.8 mill documents with 4 epochs, then save model and load again with next 0.8 mill data with another 4 epochs ... and so on. Will it help ??

Tried above methods for 2-3 sets but accuracy does not improve.
Is there any option for lazy loading or something else ... let me know.
Thanks.

@allhelllooz allhelllooz added the question Further information is requested label Oct 19, 2019
@allhelllooz
Copy link
Author

Additionally, what is the difference between fit and fit_without_generator.
Will it help me in training the way I explained above ?

@BrikerMan
Copy link
Owner

This is a really cool use-case. I am happy to help out. Actually we can train tons of data with limited RAM, but need to make some changes.

Let's start with the fit and fit_without_generator. fit is equal to keras's fit_with_generator, a lazy load function that could handle lots of data with limited RAM. fit_without_generator equals to keras's fit, slightly faster than fit_with_generator but cost more RAM.

But why it still need 55Gb RAM with 0.8 million data? It is because you need to load all of the orininal data to RAM so that we could build token and label dict and build model struct. So if we can optimize this part, you can handle all of your data easily.

Let me try something and come back to you a little bit layer. Then you can try it out.

@BrikerMan
Copy link
Owner

Let's keep it simple first. Which embedding are you using for this task?

@allhelllooz
Copy link
Author

Cool. Thanks for the reply. I am using bert embeddings and BiGRU model for labelling ner task.

I have used ImageDataGenerator from keras before, which reads few images in memory not everything. Wanted to check is something like that is possible or not.
Also, noob in Tensorflow and keras so not aware how to solve this use case.

@BrikerMan
Copy link
Owner

Yea, we need to implement something similiar to the ImageDataGenerator. I will try to do that tomorrow and come back to you.

@allhelllooz
Copy link
Author

Thanks. That would be a very cool solution .. btw when I load model and try to train again on some other dataset why does it not work ?? When we save model do we save all states .. right??

@allhelllooz
Copy link
Author

Hi @BrikerMan .. were you able to do something ? Let me know. I will also try something out in-between. Thanks.

@stale
Copy link

stale bot commented Nov 11, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Nov 11, 2019
@BrikerMan
Copy link
Owner

Sorry, it has been very busy for the last several weeks. I will come back to you ASAP.

@stale stale bot removed the wontfix This will not be worked on label Nov 11, 2019
@BrikerMan
Copy link
Owner

@allhelllooz could you prepare token-dict and label-dict by yourself?

@allhelllooz
Copy link
Author

I should be able to do that. Can you send me format for token-dict and label-dict ?

@BrikerMan
Copy link
Owner

@allhelllooz Sorry for the long delay. I have started the tf2 version Kashgari V2, which is very RAM-friendly. I have tested a classification task with a 10G corpus, it cost 1G RAM. Please try it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pinned question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants