-
-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to train on limited RAM when I have >10million documents #273
Comments
Additionally, what is the difference between |
This is a really cool use-case. I am happy to help out. Actually we can train tons of data with limited RAM, but need to make some changes. Let's start with the But why it still need 55Gb RAM with 0.8 million data? It is because you need to load all of the orininal data to RAM so that we could build token and label dict and build model struct. So if we can optimize this part, you can handle all of your data easily. Let me try something and come back to you a little bit layer. Then you can try it out. |
Let's keep it simple first. Which embedding are you using for this task? |
Cool. Thanks for the reply. I am using bert embeddings and BiGRU model for labelling ner task. I have used ImageDataGenerator from keras before, which reads few images in memory not everything. Wanted to check is something like that is possible or not. |
Yea, we need to implement something similiar to the ImageDataGenerator. I will try to do that tomorrow and come back to you. |
Thanks. That would be a very cool solution .. btw when I load model and try to train again on some other dataset why does it not work ?? When we save model do we save all states .. right?? |
Hi @BrikerMan .. were you able to do something ? Let me know. I will also try something out in-between. Thanks. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Sorry, it has been very busy for the last several weeks. I will come back to you ASAP. |
@allhelllooz could you prepare token-dict and label-dict by yourself? |
I should be able to do that. Can you send me format for token-dict and label-dict ? |
@allhelllooz Sorry for the long delay. I have started the tf2 version Kashgari V2, which is very RAM-friendly. I have tested a classification task with a 10G corpus, it cost 1G RAM. Please try it out. |
Environment
Ubuntu 16.04, Python 3.7
Question
I have p2.xlarge machine with 60gb ram and 11gb gpu.
When I am taking ~0.8 million documents for NER Labelling task >55Gb of RAM is consumed ... but I am able to train.
Now I want to train on all >10million documents ... how can I do that with limited memory available.
I am going to try 0.8 mill documents with 4 epochs, then save model and load again with next 0.8 mill data with another 4 epochs ... and so on. Will it help ??
Tried above methods for 2-3 sets but accuracy does not improve.
Is there any option for lazy loading or something else ... let me know.
Thanks.
The text was updated successfully, but these errors were encountered: