`subset` with grouped data frame has worse compile times than `transform` #2806

pdeffebach · 2021-06-24T22:34:31Z

julia> d = DataFrame(n = 1:20, x = [3, 3, 3, 3, 1, 1, 1, 2, 1, 1,
                                    2, 1, 1, 2, 2, 2, 3, 1, 1, 2]);

julia> g = groupby(d, :x);

julia> @time transform(g, :n => (n -> n .> mean(n)) => :b);
  0.239001 seconds (795.97 k allocations: 40.779 MiB, 5.03% gc time, 99.38% compilation time)

julia> @time transform(g, :n => (n -> n .> mean(n)) => :b);
  0.235570 seconds (795.96 k allocations: 40.526 MiB, 5.09% gc time, 99.41% compilation time)

julia> @time subset(g, :n => (n -> n .> mean(n)));
  0.288238 seconds (833.20 k allocations: 42.992 MiB, 4.23% gc time, 99.40% compilation time)

julia> @time subset(g, :n => (n -> n .> mean(n)));
  0.295599 seconds (833.19 k allocations: 42.990 MiB, 4.21% gc time, 99.40% compilation time)

julia> @time subset(g, :n => (n -> n .> mean(n)));
  0.286430 seconds (833.20 k allocations: 42.993 MiB, 4.30% gc time, 99.41% compilation time)

The text was updated successfully, but these errors were encountered:

bkamins · 2021-06-24T22:40:30Z

I will try to have a look what can be done. It looks like the reason is that:

@time select(g, :n => (n -> n .> mean(n)));

is slower than transform (which is strange)

bkamins · 2021-07-24T15:43:51Z

I have tried to investigate it further. Unfortunately it seems that the current tooling we have does not allow to diagnose the reason. The short answer is that subset does more work than transform. But the amount of additional work is hard to split into:

inference
LLVM codegen
native codegen
actual execution time

I have checked that inference part is roughly equal for both of them. Since actual execution time is clearly also similar, then this is either LLVM codegen or native codegen that make the difference, but it is hard to say which and why.

bkamins · 2021-07-24T15:48:23Z

I am closing this issue as I am unable to move forward with it (and probably there is nothing that can be done). However in timholy/SnoopCompile.jl#251 I hope to learn how to diagnose such things.

bkamins added the performance label Jun 24, 2021

bkamins added this to the patch milestone Jun 24, 2021

bkamins mentioned this issue Jul 24, 2021

inference/LLVM codegen/native codegen time split timholy/SnoopCompile.jl#251

Open

bkamins closed this as completed Jul 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`subset` with grouped data frame has worse compile times than `transform` #2806

`subset` with grouped data frame has worse compile times than `transform` #2806

pdeffebach commented Jun 24, 2021

bkamins commented Jun 24, 2021

bkamins commented Jul 24, 2021

bkamins commented Jul 24, 2021

subset with grouped data frame has worse compile times than transform #2806

subset with grouped data frame has worse compile times than transform #2806

Comments

pdeffebach commented Jun 24, 2021

bkamins commented Jun 24, 2021

bkamins commented Jul 24, 2021

bkamins commented Jul 24, 2021

`subset` with grouped data frame has worse compile times than `transform` #2806

`subset` with grouped data frame has worse compile times than `transform` #2806