-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document memory behaviour and give tips for dealing with many files #226
Comments
Hi! Totally agree it'd be good to document the current state of things. I don't think it's easily possible at the moment to influence the amount of memory used per-file, though (I'd have to take a closer look to be completely sure). The memory limit option will only affect how many filesystem blocks can be queued for compression, so that isn't going to help with the memory consumed per-file. I'll get back with more info when I find some time to look into this more closely. |
Okay, I'm going to use this issue to collect info / random thoughts as I go through the code and hopefully this will spark further discussion before summarizing this in the docs and / or coming up with plans for future improvements.
|
|
I think it would be fantastic if there was a way to get a "constant-memory operation" mode. Sometimes you just have too many files, and it's the number of files that's the problem (e.g. causing disk seeks upon reads or disk scrubs, payment-per-request on cloud storage systems, and so on). Then you basically want constant-memory So I think it would be valuable to be able to disable dwarfs's other features already for gaining those. But in addition, it would be even cooler to be able to opt into some of dwarfs's more advanced feature in a "constent-memory" mode, e.g. deduplicating against only the last N MB read, or the hashes of the last N million files. This would unlock most of dwarfs's space-saving features while still being able to set up an automatic job and knowing it will never run out of RAM. So in this case, you could gradually tune between the "a few KB memory needed" mode of |
Hi,
I sometimes have the need to archive hundreds of millions of small files.
Lots of software fails on that with out-of-memory, for example:
It would be fantastic if somewhere in the
mkdwarfs
man page you could document its memory scaling behaviour.For example:
--num-scanner-workers
looks like such an option, are there others recommended for the "many small files" use case?As a quick benchmark, 500 k small files took 2 GB maxresident RAM for me with default options and
--num-scanner-workers 100
, on a 32-core machine.--file-hash=none --max-similarity-size=0 --window-size 0 --memory-limit 100M
did not significantly reduce it, but maybe that changes at higher scale.But it is definitely curious that the used memory was 20x higher than the requested memory limit; its documentation says
approximately
, but this is a case that further motivates knowing what the other factors are in memory consumption.It would be awesome if the scaling behaviour could be documented, so that one doesn't have to benchmark it to find out what would havppen for 500 M files.
The text was updated successfully, but these errors were encountered: