-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strict Entries for out-of-order (parallel) processing #160
Comments
Oh awesome! I always assumed the slowness in rustup was around the handling of "transactions" which copied files at least 3 times IIRC and rust-docs has a huge number of files. That being said extraction in parallel is certain to be a nice boost! I think with parallel extraction using In that sense I think this may basically just want methods that are specialized to |
Hi Alex! Thanks for the enthusiasm! I want being precise: The performance problem in rustup is with the The The transactions mechanism being a second-order performance problem here, which I'll address later. So the line of thought I had was to extract it all in memory (have Strict Entries with all the metadata together with the data), which would then be dispatched in parallel to the file system pipeline. I also thought about the So we could add another option to the solution space: Also, just to be sure there isn't another little misunderstanding here - the 3 points under the solution space are independent routes to solve the problem - an What do you think would be the right way to go here? |
Hey there. Switching to read the data directly instead of using the lazy The result would be that the memory would be dropped by the user on each iteration instead of being read and dropped in the So I think this is the way I'm going to start refactoring. |
Hm so it's not entirely clear to me what the proposal is here of what to do? Could you elaborate a bit on what you're thinking? One thing I just realized as well is that the first thing to do with any tarball is decompress it which is effectively single-threaded. I wonder if this could work by perhaps quickly iterating over (or maybe this is all already what you're thinking?) |
@alexcrichton, are you available on any synchronous communication channels such as IRC/gitter, etc? What I currently thinking of implementing is as follows:
To be able to acheive this, the limitation of sequential processing of I would like to note that this won't deteriorate performance much on the existing use case of sequential processing of
|
@NightRa sure yeah, I'm on Discord and I think though I'm wary to put a |
The current behavior: for entry in archive.entries() {
entry.unpack(...)
} In this case, the entry's data is allocated on each iteration and the decallocated. for entry in archive.entries() {
// no unpack call
} In this case, the data's entry is read & allocated temporarily in If we have a
The only difference being the calls to the heap allocation, now having at most 1 entry's data in memory at each point, versus a bounded intermediate buffer at each point. And I don't think it would be that terrible to have at most one entry in memory at each point in time |
Ok I talked with @NightRa on discord a bit and (I think?) we concluded that probably the best way to go for now is to have a way to take an |
I would like to have this feature. @NightRa Do you plan to resubmit the patch? |
Yes, I do have plans to resubmit the patch soon. |
Wouldn't additional impl blocks for |
Hi, I'm interested in the rustup performance story too :) - I've dropped the copy count down to 0 - write once and move into place. I'm not sure if more concurrency is needed or not - I'd like to see @NightRa 's prototype to experiment with. In the meantime though, I have analysed the current behaviour, and quite some time is being spent re-opening files to do mtime, permissions and so on. Adding a buffered writer may also be useful - submitting 4k blocks to the IO subsystem is way too small for todays devices (Linux being much better at coalescing than Windows) |
Also a lot of time is being spent in CloseHandle - I haven't cracked out a full system profiler yet - thats pretty much next - but this is well documented as a thing for write-heavy tools like VCS's.... even with syscalls optimised in rustup we see
where CloseFile (aka CloseHandle) is taking up most of the time; I suspect that the benefit you're actually seeing is avoiding CloseHandle blocking, and if so, I have a much simpler patch up to permit parallelising that work externally without requiring any parallelisation of the iteration or extraction process. |
The time in Close Handle is Windows Defender scanning the written files via
a filter driver.
Parallelising (or doing async closes) submits more jobs to defender,
allowing parallel scanning, and avoiding the implicit serialization bc of
defender.
…On Sun, May 12, 2019, 1:13 PM Robert Collins ***@***.***> wrote:
Also a lot of time is being spent in CloseHandle - I haven't cracked out a
full system profiler yet - thats pretty much next - but this is well
documented as a thing for write-heavy tools like VCS's.... even with
syscalls optimised in rustup we see
2:16:18.5880792 PM 0.0001307 rustup.exe 38704 CreateFile C:\Users\robertc\.rustup\tmp\73avhrved4ed3vw1_dir\rust-docs\share\doc\rust\html\std\sync\condvar\struct.Condvar.html SUCCESS Desired Access: Generic Write, Read Attributes, Disposition: Create, Options: Synchronous IO Non-Alert, Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: 0, OpenResult: Created
2:16:18.5882340 PM 0.0000520 rustup.exe 38704 WriteFile C:\Users\robertc\.rustup\tmp\73avhrved4ed3vw1_dir\rust-docs\share\doc\rust\html\std\sync\condvar\struct.Condvar.html SUCCESS Offset: 0, Length: 389, Priority: Normal
2:16:18.5883040 PM 0.0006485 rustup.exe 38704 CloseFile C:\Users\robertc\.rustup\tmp\73avhrved4ed3vw1_dir\rust-docs\share\doc\rust\html\std\sync\condvar\struct.Condvar.html SUCCESS
where CloseFile (aka CloseHandle) is taking up most of the time; I suspect
that the benefit you're actually seeing is avoiding CloseHandle blocking,
and if so, I have a much simpler patch up to permit parallelising that work
externally without requiring any parallelisation of the iteration or
extraction process.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#160 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAQKWNYTPOYDO3E6E4YY7VTPU7UVZANCNFSM4FXEUKYA>
.
|
@NightRa well - sometimes it is; its not for my traces which is why I haven't assigned specific blame. I'd be delighted to do async closes instead, but I'm not aware of any API for doing that! Please link me up :) |
Hi there.
I'm in the process of performance improvement for rustup, and have settled on extracting tars in parallel, having a POC showing its effectiveness on Windows.
Currently, the
tar
package only supports sequential unpacking of files to disk (unpack
being the primitive operation as it also writes metadata).To sum up the currently relevant motivations of the
tar
package, as I understand them:Read
trait implementation, not assuming it'sSeek
.Read
er unless the user asked to do so, instead call skip.Both decisions are debatable, but are certainly very decent in the design space.
Now, in order to implement parallel extraction, I would need to process
Entry
ies out of order.Solution space:
Seek
- each entry can be independent, and still lazy, but this may potentially break clients.entries
about out-of-order processing.force
function onEntry
- which returns aStrictEntry
, being an independent structure, which reads & contains the data strictly. This would enlarge the user-facing API though - increasing implementation & api complexity. Doesn't break the current api.I like both approach #2 and #3, each with their own trade-offs, and choosing which way to go depends on the crate's direction.
What are your thoughts?
Thanks for the great crate.
The text was updated successfully, but these errors were encountered: