Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strict Entries for out-of-order (parallel) processing #160

Open
NightRa opened this issue Sep 25, 2018 · 15 comments
Open

Strict Entries for out-of-order (parallel) processing #160

NightRa opened this issue Sep 25, 2018 · 15 comments

Comments

@NightRa
Copy link
Contributor

NightRa commented Sep 25, 2018

Hi there.
I'm in the process of performance improvement for rustup, and have settled on extracting tars in parallel, having a POC showing its effectiveness on Windows.

Currently, the tar package only supports sequential unpacking of files to disk (unpack being the primitive operation as it also writes metadata).

To sum up the currently relevant motivations of the tar package, as I understand them:

  1. Interface: Support any Read trait implementation, not assuming it's Seek.
  2. Lazyness: Don't read data from the Reader unless the user asked to do so, instead call skip.

Both decisions are debatable, but are certainly very decent in the design space.

Now, in order to implement parallel extraction, I would need to process Entryies out of order.

Solution space:

  1. Make the interface require Seek - each entry can be independent, and still lazy, but this may potentially break clients.
  2. Always be strict - with some performance hit only for complex applications that do filtering on the tar. Doesn't break the current api.
    • This may also simplify the API and avoid some mistakes people make by not reading the docs for entries about out-of-order processing.
  3. Add a force function on Entry - which returns a StrictEntry, being an independent structure, which reads & contains the data strictly. This would enlarge the user-facing API though - increasing implementation & api complexity. Doesn't break the current api.

I like both approach #2 and #3, each with their own trade-offs, and choosing which way to go depends on the crate's direction.

What are your thoughts?

  1. What's your preferred solution to the problem?
  2. Do you give a green light to start working on the suggested solution.

Thanks for the great crate.

@alexcrichton
Copy link
Owner

Oh awesome! I always assumed the slowness in rustup was around the handling of "transactions" which copied files at least 3 times IIRC and rust-docs has a huge number of files. That being said extraction in parallel is certain to be a nice boost!

I think with parallel extraction using Seek won't really work though, right? If you want to, in parallel, extract to the filesystem you'd almost for sure need an in-memory buffer with the tar to read the buffer in parallel? Otherwise if, as the crate is organized today, you've only got R then you can only read on one thread at a time (or seek).

In that sense I think this may basically just want methods that are specialized to Archive<Vec<u8>> perhaps?

@NightRa
Copy link
Contributor Author

NightRa commented Sep 26, 2018

Hi Alex! Thanks for the enthusiasm!

I want being precise: The performance problem in rustup is with the WriteFiles themselves - Windows Defender scans all the 15k html&js files of rust-docs syncronously, which is a pure cpu workload.

The tar extraction's performance itself is negligible.

The transactions mechanism being a second-order performance problem here, which I'll address later.

So the line of thought I had was to extract it all in memory (have Strict Entries with all the metadata together with the data), which would then be dispatched in parallel to the file system pipeline.

I also thought about the Archive<Vec<u8>> (Or more precisely Archive<&[u8]>) - but it seems this usage would be depending too much on the implementation details of this tar library - as it's a very intricate interaction between the mutable Read for mut &[u8] interface and my requirement for immutable, self-standing entries which contain a lazy data field with &[u8]s.

So we could add another option to the solution space:
4. Try to stabilize & formalize the behavior of the library in the case of &[u8] (Or other outer mutability/ inner immutability interactions), which would then allow out-of-order processing of entries.
Note that I haven't verified the consistancy of this line of thought yet.

Also, just to be sure there isn't another little misunderstanding here - the 3 points under the solution space are independent routes to solve the problem - an or not an and.

What do you think would be the right way to go here?

@NightRa
Copy link
Contributor Author

NightRa commented Sep 26, 2018

Hey there.
I just noticed that Archive::skip calls Read::read on the underlying reader, thus not doing any laziness at all.

Switching to read the data directly instead of using the lazy EntryIo should have almost no impact on the current performance of the library, and would simplify the surface API considerably - dropping the sequential processing ordering of Archive::entries() and also dropping the phantom lifetime from Entry<'a, R> to Entry.

The result would be that the memory would be dropped by the user on each iteration instead of being read and dropped in the Archive::skip function, which shouldn't have much of an impact overall.

So I think this is the way I'm going to start refactoring.
Would be glad to have a green light or any problems you see with this approach @alexcrichton

@alexcrichton
Copy link
Owner

Hm so it's not entirely clear to me what the proposal is here of what to do? Could you elaborate a bit on what you're thinking?

One thing I just realized as well is that the first thing to do with any tarball is decompress it which is effectively single-threaded. I wonder if this could work by perhaps quickly iterating over Archive<&[u8]> (somehow fixing skip, unsure how to best do that). The quick iteration could build a list of (Header, &[u8]) fields (effectively) and then a new function could perhaps be added to easily extract that into a file (or construct an Entry or something like that).

(or maybe this is all already what you're thinking?)

@NightRa
Copy link
Contributor Author

NightRa commented Sep 27, 2018

@alexcrichton, are you available on any synchronous communication channels such as IRC/gitter, etc?

What I currently thinking of implementing is as follows:

  1. Be able to extract the whole archive into memory, sequentially.
  2. Iterate over a Vec of all extracted entries and write them to the file system in parallel.

To be able to acheive this, the limitation of sequential processing of Archive::entries would need to be lifted.
The origin of this limitation is in the EntryFields::data field being a lazy EntryIo - which holds a reference to the reader/archive.
The solution would then be to replace EntryIo::Data from a reader/archive to a Vec<u8>.
This would also imply we can remove the lifetime parameters on Entry, EntryFields, etc.

I would like to note that this won't deteriorate performance much on the existing use case of sequential processing of entries:

  • Archive::skip already queries the Reader for the data, and drops it's memory buffer immediately.
  • Only one file at a time will be resident in memory - one at each iteration of entries.

@alexcrichton
Copy link
Owner

@NightRa sure yeah, I'm on Discord and acrichto on IRC.

I think though raw_file_position can be used? If you've got prior knowledge that the file is in-memory then you can use raw_file_position to recreate where the file is in the tar archive afterwards. The only "slow" part would then be Archive::skip which could possibly be hacked around differently?

I'm wary to put a Vec<u8> into EntryIo and remove all lifetimes because not all applications necessarily need to read up all data into memory.

@NightRa
Copy link
Contributor Author

NightRa commented Sep 28, 2018

The current behavior:

for entry in archive.entries() {
    entry.unpack(...)
}

In this case, the entry's data is allocated on each iteration and the decallocated.

for entry in archive.entries() {
    // no unpack call
}

In this case, the data's entry is read & allocated temporarily in Archive::skip, then deallocated in Archive::skip.

If we have a Vec<u8> in EntryIo:

  1. When the user code does an unpack, there should be no difference.
  2. When the user does not use unpack, the data is read upon each iteration, and deallocated at the end of the scope of the iteration, and then there's no reading, allocation & deallocation in Archive::skip.

The only difference being the calls to the heap allocation, now having at most 1 entry's data in memory at each point, versus a bounded intermediate buffer at each point.

And I don't think it would be that terrible to have at most one entry in memory at each point in time

@alexcrichton
Copy link
Owner

Ok I talked with @NightRa on discord a bit and (I think?) we concluded that probably the best way to go for now is to have a way to take an Entry<T> to an Entry<U>, replacing the underlying I/O with a different one. That way we can replace it with a slice (for example) which can disconnect the entry from the original archive and allow parallel processing.

@oblique
Copy link

oblique commented Dec 22, 2018

I would like to have this feature. @NightRa Do you plan to resubmit the patch?

@NightRa
Copy link
Contributor Author

NightRa commented Dec 22, 2018

Yes, I do have plans to resubmit the patch soon.

@the8472
Copy link

the8472 commented Apr 19, 2019

Wouldn't additional impl blocks for R: Read + Seek + Clone or similar allow it to be parallelized in most cases and also make it Send?

@rbtcollins
Copy link
Contributor

Hi, I'm interested in the rustup performance story too :) - I've dropped the copy count down to 0 - write once and move into place. I'm not sure if more concurrency is needed or not - I'd like to see @NightRa 's prototype to experiment with. In the meantime though, I have analysed the current behaviour, and quite some time is being spent re-opening files to do mtime, permissions and so on. Adding a buffered writer may also be useful - submitting 4k blocks to the IO subsystem is way too small for todays devices (Linux being much better at coalescing than Windows)

@rbtcollins
Copy link
Contributor

Also a lot of time is being spent in CloseHandle - I haven't cracked out a full system profiler yet - thats pretty much next - but this is well documented as a thing for write-heavy tools like VCS's.... even with syscalls optimised in rustup we see


2:16:18.5880792 PM	0.0001307	rustup.exe	38704	CreateFile	C:\Users\robertc\.rustup\tmp\73avhrved4ed3vw1_dir\rust-docs\share\doc\rust\html\std\sync\condvar\struct.Condvar.html	SUCCESS	Desired Access: Generic Write, Read Attributes, Disposition: Create, Options: Synchronous IO Non-Alert, Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: 0, OpenResult: Created
2:16:18.5882340 PM	0.0000520	rustup.exe	38704	WriteFile	C:\Users\robertc\.rustup\tmp\73avhrved4ed3vw1_dir\rust-docs\share\doc\rust\html\std\sync\condvar\struct.Condvar.html	SUCCESS	Offset: 0, Length: 389, Priority: Normal
2:16:18.5883040 PM	0.0006485	rustup.exe	38704	CloseFile	C:\Users\robertc\.rustup\tmp\73avhrved4ed3vw1_dir\rust-docs\share\doc\rust\html\std\sync\condvar\struct.Condvar.html	SUCCESS	

where CloseFile (aka CloseHandle) is taking up most of the time; I suspect that the benefit you're actually seeing is avoiding CloseHandle blocking, and if so, I have a much simpler patch up to permit parallelising that work externally without requiring any parallelisation of the iteration or extraction process.

@NightRa
Copy link
Contributor Author

NightRa commented May 12, 2019 via email

@rbtcollins
Copy link
Contributor

@NightRa well - sometimes it is; its not for my traces which is why I haven't assigned specific blame.

I'd be delighted to do async closes instead, but I'm not aware of any API for doing that! Please link me up :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants