Performance improvements in chained subparts syntax #137

lukem12345 · 2024-05-24T03:11:32Z

As with recent work on e.g. issue #135 , we were looking at what there is any performance being left on the table with the ACSets set/ get API.

Here, I looked at the what's known in the docs as the "chaining" or "composition syntax". I trimmed down two versions of an operation used in CombinatorialSpaces.jl. One loops over some data using the composed syntax (loop_contracted!), and the other decomposed (loop_expanded!). All allocations are optimized away by the compiler when using the decomposed syntax, but the same is not done for the composed case. I think that recent PRs will give us a good starting off point.

julia> function loop_expanded!(buffer::Vector{Point3D}, sd::HasDeltaSet2D, ::Type{point_type}) where point_type
         @inbounds for t in parts(sd, :DualTri)
           buffer[t] = sd[sd[sd[t, :D_∂e1], :D_∂v1], :dual_point]
         end
       end;

julia> function loop_contracted!(buffer::Vector{Point3D}, sd::HasDeltaSet2D, ::Type{point_type}) where point_type
         @inbounds for t in parts(sd, :DualTri)
           buffer[t] = sd[t, [:D_∂e1, :D_∂v1, :dual_point]]
         end
       end;

julia> @btime sd[sd[sd[1, :D_∂e1], :D_∂v1], :dual_point];
  1.146 μs (9 allocations: 1.78 KiB)

julia> @btime sd[1, [:D_∂e1, :D_∂v1, :dual_point]];
  1.240 μs (12 allocations: 3.08 KiB)

julia> buffer = Vector{Point3D}(undef, nparts(sd, :DualTri));

julia> @btime loop_expanded!(buffer, sd, Point3{Float64});
  4.073 ms (0 allocations: 0 bytes)

julia> buffer = Vector{Point3D}(undef, nparts(sd, :DualTri));

julia> @btime loop_contracted!(buffer, sd, Point3{Float64});
  892.627 ms (7669178 allocations: 1.44 GiB)

The text was updated successfully, but these errors were encountered:

epatters · 2024-05-24T03:14:40Z

Hopefully this issue has already been resolved by #109 thanks to @slwu89, which provides a new tuple-based method for chained subpart access that generates fast code.

lukem12345 · 2024-05-24T03:17:20Z

Oh excellent!

lukem12345 · 2024-05-24T12:03:20Z

I was a little hasty closing this issue. It looks like the () syntax does indeed improve allocations but there is some allocation overhead left on the table.

julia> function loop_expanded!(buffer::Vector{Point3D}, sd::HasDeltaSet2D, ::Type                                                                                  {point_type}) where point_type
         @inbounds for t in parts(sd, :DualTri)
           buffer[t] = sd[sd[sd[t, :D_∂e1], :D_∂v1], :dual_point]
         end
       end;

julia> function loop_contracted!(buffer::Vector{Point3D}, sd::HasDeltaSet2D, ::Ty                                                                                  pe{point_type}) where point_type
         @inbounds for t in parts(sd, :DualTri)
           buffer[t] = sd[t, (:D_∂e1, :D_∂v1, :dual_point)]
         end
       end;

julia> @btime sd[sd[sd[1, :D_∂e1], :D_∂v1], :dual_point];
  1.181 μs (9 allocations: 1.78 KiB)

julia> @btime sd[1, (:D_∂e1, :D_∂v1, :dual_point)];
  302.128 ns (3 allocations: 688 bytes)

julia> buffer = Vector{Point3D}(undef, nparts(sd, :DualTri));

julia> @btime loop_expanded!(buffer, sd, Point3{Float64});
  4.191 ms (0 allocations: 0 bytes)

julia> buffer = Vector{Point3D}(undef, nparts(sd, :DualTri));

julia> @btime loop_contracted!(buffer, sd, Point3{Float64});
  207.264 ms (2398978 allocations: 329.57 MiB)

lukem12345 · 2024-05-24T12:54:41Z

I've made a more minimal example. I'm going to take a look at this but haven't interacted with CompTime much, so I'm not sure if the issue will eventually be upstreamed there.

julia> using Catlab.Graphs, ACSets, BenchmarkTools

julia> function loop_expanded!(buffer::Vector{Int}, grph::Graph)
         @inbounds for i in parts(grph, :E)
           buffer[i] = grph[i, :src]
         end
       end;

julia> function loop_contracted!(buffer::Vector{Int}, grph::Graph)
         @inbounds for i in parts(grph, :E)
           buffer[i] = grph[i, (:src,)]
         end
       end;

julia> g = path_graph(Graph, 80_000);

julia> buffer = Vector{Int}(undef, nparts(g, :E));

julia> @btime loop_expanded!(buffer, g);
  68.679 μs (0 allocations: 0 bytes)

julia> @btime loop_contracted!(buffer, g);
  21.828 ms (398462 allocations: 12.18 MiB)

epatters · 2024-05-24T17:24:16Z

Thanks for looking into it.

epatters added the performance label May 24, 2024

lukem12345 closed this as completed May 24, 2024

lukem12345 reopened this May 24, 2024

lukem12345 linked a pull request May 28, 2024 that will close this issue

Generating optimized chained subparts #138

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements in chained subparts syntax #137

Performance improvements in chained subparts syntax #137

lukem12345 commented May 24, 2024

epatters commented May 24, 2024

lukem12345 commented May 24, 2024

lukem12345 commented May 24, 2024

lukem12345 commented May 24, 2024 •

edited

Loading

epatters commented May 24, 2024

Performance improvements in chained subparts syntax #137

Performance improvements in chained subparts syntax #137

Comments

lukem12345 commented May 24, 2024

epatters commented May 24, 2024

lukem12345 commented May 24, 2024

lukem12345 commented May 24, 2024

lukem12345 commented May 24, 2024 • edited Loading

epatters commented May 24, 2024

lukem12345 commented May 24, 2024 •

edited

Loading