Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Library consumes more than 100MB just by importing it #244

Open
rragundez opened this issue Nov 30, 2024 · 5 comments
Open

Library consumes more than 100MB just by importing it #244

rragundez opened this issue Nov 30, 2024 · 5 comments

Comments

@rragundez
Copy link

Hi, I notices that my processes are out of the gate > 100MB which is not "normal", so I checked and my findings detect that imorting this library adds > 100MB to the memory consumption, that sounds a bit exagerrated, do you know or can I do something about it?
I can spin up machine with more memory but effectivly this limits the amount of workers I can spin up for performing certain job.

@miohtama
Copy link
Contributor

miohtama commented Dec 3, 2024

Interesting.

What modules are you importing? I'd guess the memory consumption is ABI files that are cached, but could be something else.

@miohtama
Copy link
Contributor

miohtama commented Dec 3, 2024

You can debug this by watching the memory usage in a task manager and importing different modules by hand in an interactive Python REPL prompt.

@rragundez
Copy link
Author

rragundez commented Dec 4, 2024

Hi, I did some check and I think the library is not optimised for memory regarding dependency imports. There are unnecessary big libraries being imported indirectly when importing this package from root or some modules. For example I see Pandas and Numpy being loaded while not needed for the processing I am doing (they both add around 170MB). If want to optimize memory consumption what it is done is to import them only if necessary or push the responsibility to the user that to use some part of the code they need to have the library as a requirement.
I give you an example where the responsibility is pushed to the user:

https://github.com/rragundez/chunkdot/blob/main/chunkdot/__init__.py#L7
https://github.com/rragundez/chunkdot/blob/main/chunkdot/_cosine_similarity_top_k.py#L14

There are several ways to do it, very lightweight libraries that add the "heavy" libraries as dependencies but imported only inside functions that need it, or they concentrate the functions that need it in a single module.
Specially big impact for use cases using parallel computation via Python's multiprocessing (multithreading will be OKish as it will share the memory) or just executing different jobs via some external tool.
If you would get this package to <25MB the parallelisation you can achieve with very cheap infra costs would be amazing and allow the user to concentrate mainly on solving IO since you already have generators across the library to reduce the memory consumption.

Hope this helps. I could also help looking more in detail if you think it would be valuable. Have a nice one!

@miohtama
Copy link
Contributor

Pandas and Numpy being loaded while not needed for the processing I am doing (they both add around 170MB).

Thank you for your concern.

At the moment, focusing our efforts on removing these imports would take too much energy and would help only one very specific use case.

While we may do this in the future, I feel the correct way to spend the energy would be how to make Pandas and Numpy itself to be more lightweight environment friendly.

@miohtama miohtama reopened this Dec 15, 2024
@miohtama
Copy link
Contributor

If you think there is an easy way to fix this without too many lines of changes, we are happy to take a patch, but we do not have enough resources ourselves to do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants