Library consumes more than 100MB just by importing it #244

rragundez · 2024-11-30T08:19:07Z

Hi, I notices that my processes are out of the gate > 100MB which is not "normal", so I checked and my findings detect that imorting this library adds > 100MB to the memory consumption, that sounds a bit exagerrated, do you know or can I do something about it?
I can spin up machine with more memory but effectivly this limits the amount of workers I can spin up for performing certain job.

miohtama · 2024-12-03T09:31:30Z

Interesting.

What modules are you importing? I'd guess the memory consumption is ABI files that are cached, but could be something else.

miohtama · 2024-12-03T09:34:04Z

You can debug this by watching the memory usage in a task manager and importing different modules by hand in an interactive Python REPL prompt.

rragundez · 2024-12-04T00:04:57Z

Hi, I did some check and I think the library is not optimised for memory regarding dependency imports. There are unnecessary big libraries being imported indirectly when importing this package from root or some modules. For example I see Pandas and Numpy being loaded while not needed for the processing I am doing (they both add around 170MB). If want to optimize memory consumption what it is done is to import them only if necessary or push the responsibility to the user that to use some part of the code they need to have the library as a requirement.
I give you an example where the responsibility is pushed to the user:

https://github.com/rragundez/chunkdot/blob/main/chunkdot/__init__.py#L7
https://github.com/rragundez/chunkdot/blob/main/chunkdot/_cosine_similarity_top_k.py#L14

There are several ways to do it, very lightweight libraries that add the "heavy" libraries as dependencies but imported only inside functions that need it, or they concentrate the functions that need it in a single module.
Specially big impact for use cases using parallel computation via Python's multiprocessing (multithreading will be OKish as it will share the memory) or just executing different jobs via some external tool.
If you would get this package to <25MB the parallelisation you can achieve with very cheap infra costs would be amazing and allow the user to concentrate mainly on solving IO since you already have generators across the library to reduce the memory consumption.

Hope this helps. I could also help looking more in detail if you think it would be valuable. Have a nice one!

miohtama · 2024-12-15T17:31:37Z

Pandas and Numpy being loaded while not needed for the processing I am doing (they both add around 170MB).

Thank you for your concern.

At the moment, focusing our efforts on removing these imports would take too much energy and would help only one very specific use case.

While we may do this in the future, I feel the correct way to spend the energy would be how to make Pandas and Numpy itself to be more lightweight environment friendly.

miohtama · 2024-12-15T17:32:09Z

If you think there is an easy way to fix this without too many lines of changes, we are happy to take a patch, but we do not have enough resources ourselves to do this.

miohtama closed this as completed Dec 15, 2024

miohtama reopened this Dec 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Library consumes more than 100MB just by importing it #244

Library consumes more than 100MB just by importing it #244

rragundez commented Nov 30, 2024

miohtama commented Dec 3, 2024

miohtama commented Dec 3, 2024

rragundez commented Dec 4, 2024 •

edited

Loading

miohtama commented Dec 15, 2024

miohtama commented Dec 15, 2024

Library consumes more than 100MB just by importing it #244

Library consumes more than 100MB just by importing it #244

Comments

rragundez commented Nov 30, 2024

miohtama commented Dec 3, 2024

miohtama commented Dec 3, 2024

rragundez commented Dec 4, 2024 • edited Loading

miohtama commented Dec 15, 2024

miohtama commented Dec 15, 2024

rragundez commented Dec 4, 2024 •

edited

Loading