Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve time to first csv load #975

Closed
wants to merge 11 commits into from
Closed

Conversation

rafaqz
Copy link

@rafaqz rafaqz commented Jan 31, 2022

This PR aims to at least partially solve #974

This first commit demonstrates that simply breaking up large unstable functions can greatly improve compilation time. Splitting File here into threaded and single threaded contexts means only the one we are using gets compiled. For me this shaves the time from #974 from 16seconds to 12 seconds. Without changing anything else.

There are probably a lot more potential for this. The functions in file.jl all seem to be too long.

@bkamins
Copy link
Member

bkamins commented Jan 31, 2022

bravo!

@rafaqz
Copy link
Author

rafaqz commented Feb 1, 2022

Fixing type stability and adding precompilination runs brings the CSV.File load time down to 7 seconds, and using CSV is 1 second.

There is another second available by removing the @refargs macro so Context precompiles, but I'm not sure what the purpose of this macro is?

@quinnj
Copy link
Member

quinnj commented Feb 1, 2022

Sorry for not being able to review earlier; I think this is exciting stuff! One comment I'll make is that a lot of the current code organization was done in an effort to avoid recompilation costs as much as possible. It's become clear to me now that doing that has brought on a pretty severe cost for TTFR, but I'd also like to keep that in mind as we make efforts to reduce TTFR. I.e. some of the changes here mean that if you slightly change a keyword argument, we have to pay significant compilation costs again, which hopefully we can avoid. In CSV.jl early days, we were at the opposite end of the spectrum where every single call to CSV.File almost guaranteed recompiling everything; but I think I agree that we've probably swung back too far the other way by compiling too much when you initially call CSV.File and we can allow more incremental compilation as various features are actually used/needed while parsing.

As a reviewer comment, it would be nice to probably split this up into separate PRs? As-is, it's very hard to review since there's so much moving/changing around.

@rafaqz
Copy link
Author

rafaqz commented Feb 1, 2022

I figured something like that might have been the case. If we can get precompilation working nearly everywhere and functions are broken up enough that they are decoupled, we should hopefully only see minor recompilation for small changes. Its really the file.jl internals that take a long time and they are precompiling for all common types now it seems. Context takes under a second the first time, so subsequent changes shouldn't be too bad either.

(Notice I've added type annotations to a lot of functions... that's just to remind me which functions will and wont need to recompile from changes further out, and to push more of them towards not recompiling)

I know this PR is currently pretty messy and buggy... its pretty much a blind search for type stability with Cthulhu.jl, not really taking continuity or clarity into account.

Unfortunately its hard to split up the PR while working on it because precompilation only works past some certain threshold of stability and decreased function size (that I don't totally understand to be honest), and before that most things have no noticeable effect. Once I can lock down everything required to make this work I can go back and split it all up into separate changes.

To summarise the mess of code changes, the main kinds of changes are:

  • smaller more specific functions
  • splitting unrelated functionality
  • splitting unstable sections from stable sections of code to isolate things
  • lots of @noinline to force functions to precompile separately (it seems to help with that anyway)
  • more concrete field types. pool being a Float64 or a Tuple causes a lot of instability, so 2 fields with a default of typemax(Int) for poollimit (which is already the default for limit) means it's always type stable (I guess it could also be one field thats always a tuple)
  • typed Context argments, removing the @refargs macro
  • removing the __init__ method and using precompilation instead, once it starts working

Some of these things are probably incantations that do nothing, its hard to tell exactly what causes precompilation to work and how much type instability and large functions is tolerable.

@rafaqz
Copy link
Author

rafaqz commented Feb 3, 2022

Precompile in Parsers.jl helps a lot here: JuliaData/Parsers.jl#108

TTFX with both of these changes is now about 5 seconds. (but there are quite a few bugs in this PR, and its a mess)

Surprisingly, once type stability was good enough, precompiling things here stopped improving things, and actually make ttfx worse!

@oscardssmith
Copy link
Contributor

Now that the Parsers PR is merged, what is the status of this?

@rafaqz
Copy link
Author

rafaqz commented Feb 14, 2022

There another 30-40% TTFX improvement in some of this code, but some changes are unnecessary now - this branch had larger gains without the Parsers.jl fix.

The changes here that still make sense will be more controversial for more modest improvements, so won't be high priority for me.

@rafaqz
Copy link
Author

rafaqz commented Feb 14, 2022

Probably restructuring Context and the preloaded of that would be the most obvious next step.

@jakobnissen jakobnissen mentioned this pull request Aug 30, 2022
@ViralBShah
Copy link
Contributor

I suppose this is outdated and can be closed.

@oscardssmith
Copy link
Contributor

Agreed, but a lot of these changes seem good (at least from a usability perspective)

@rafaqz
Copy link
Author

rafaqz commented Feb 28, 2024

I didnt mention here but after the later fix to Parsers.jl (removing just 4 @inlines 😂) this PR stopped doing much:

https://github.com/JuliaData/Parsers.jl/pull/1140

All the @noinlines and function barriers were effective because of the constants that Parsers.jl inlined. There are probably a few type stability things here worth keeping, but maybe starting a new PR and benchmarking without the Parsers.jl problems would make more sense.

@rafaqz rafaqz closed this Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants