Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subset with grouped data frame has worse compile times than transform #2806

Closed
pdeffebach opened this issue Jun 24, 2021 · 3 comments
Closed
Milestone

Comments

@pdeffebach
Copy link
Contributor

julia> d = DataFrame(n = 1:20, x = [3, 3, 3, 3, 1, 1, 1, 2, 1, 1,
                                    2, 1, 1, 2, 2, 2, 3, 1, 1, 2]);

julia> g = groupby(d, :x);

julia> @time transform(g, :n => (n -> n .> mean(n)) => :b);
  0.239001 seconds (795.97 k allocations: 40.779 MiB, 5.03% gc time, 99.38% compilation time)

julia> @time transform(g, :n => (n -> n .> mean(n)) => :b);
  0.235570 seconds (795.96 k allocations: 40.526 MiB, 5.09% gc time, 99.41% compilation time)

julia> @time subset(g, :n => (n -> n .> mean(n)));
  0.288238 seconds (833.20 k allocations: 42.992 MiB, 4.23% gc time, 99.40% compilation time)

julia> @time subset(g, :n => (n -> n .> mean(n)));
  0.295599 seconds (833.19 k allocations: 42.990 MiB, 4.21% gc time, 99.40% compilation time)

julia> @time subset(g, :n => (n -> n .> mean(n)));
  0.286430 seconds (833.20 k allocations: 42.993 MiB, 4.30% gc time, 99.41% compilation time)
@bkamins bkamins added this to the patch milestone Jun 24, 2021
@bkamins
Copy link
Member

bkamins commented Jun 24, 2021

I will try to have a look what can be done. It looks like the reason is that:

@time select(g, :n => (n -> n .> mean(n)));

is slower than transform (which is strange)

@bkamins
Copy link
Member

bkamins commented Jul 24, 2021

I have tried to investigate it further. Unfortunately it seems that the current tooling we have does not allow to diagnose the reason. The short answer is that subset does more work than transform. But the amount of additional work is hard to split into:

  • inference
  • LLVM codegen
  • native codegen
  • actual execution time

I have checked that inference part is roughly equal for both of them. Since actual execution time is clearly also similar, then this is either LLVM codegen or native codegen that make the difference, but it is hard to say which and why.

@bkamins
Copy link
Member

bkamins commented Jul 24, 2021

I am closing this issue as I am unable to move forward with it (and probably there is nothing that can be done). However in timholy/SnoopCompile.jl#251 I hope to learn how to diagnose such things.

@bkamins bkamins closed this as completed Jul 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants