Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements in chained subparts syntax #137

Open
lukem12345 opened this issue May 24, 2024 · 5 comments · May be fixed by #138
Open

Performance improvements in chained subparts syntax #137

lukem12345 opened this issue May 24, 2024 · 5 comments · May be fixed by #138

Comments

@lukem12345
Copy link
Member

As with recent work on e.g. issue #135 , we were looking at what there is any performance being left on the table with the ACSets set/ get API.

Here, I looked at the what's known in the docs as the "chaining" or "composition syntax". I trimmed down two versions of an operation used in CombinatorialSpaces.jl. One loops over some data using the composed syntax (loop_contracted!), and the other decomposed (loop_expanded!). All allocations are optimized away by the compiler when using the decomposed syntax, but the same is not done for the composed case. I think that recent PRs will give us a good starting off point.

julia> function loop_expanded!(buffer::Vector{Point3D}, sd::HasDeltaSet2D, ::Type{point_type}) where point_type
         @inbounds for t in parts(sd, :DualTri)
           buffer[t] = sd[sd[sd[t, :D_∂e1], :D_∂v1], :dual_point]
         end
       end;

julia> function loop_contracted!(buffer::Vector{Point3D}, sd::HasDeltaSet2D, ::Type{point_type}) where point_type
         @inbounds for t in parts(sd, :DualTri)
           buffer[t] = sd[t, [:D_∂e1, :D_∂v1, :dual_point]]
         end
       end;

julia> @btime sd[sd[sd[1, :D_∂e1], :D_∂v1], :dual_point];
  1.146 μs (9 allocations: 1.78 KiB)

julia> @btime sd[1, [:D_∂e1, :D_∂v1, :dual_point]];
  1.240 μs (12 allocations: 3.08 KiB)

julia> buffer = Vector{Point3D}(undef, nparts(sd, :DualTri));

julia> @btime loop_expanded!(buffer, sd, Point3{Float64});
  4.073 ms (0 allocations: 0 bytes)

julia> buffer = Vector{Point3D}(undef, nparts(sd, :DualTri));

julia> @btime loop_contracted!(buffer, sd, Point3{Float64});
  892.627 ms (7669178 allocations: 1.44 GiB)
@epatters
Copy link
Member

Hopefully this issue has already been resolved by #109 thanks to @slwu89, which provides a new tuple-based method for chained subpart access that generates fast code.

@lukem12345
Copy link
Member Author

Oh excellent!

@lukem12345
Copy link
Member Author

I was a little hasty closing this issue. It looks like the () syntax does indeed improve allocations but there is some allocation overhead left on the table.

julia> function loop_expanded!(buffer::Vector{Point3D}, sd::HasDeltaSet2D, ::Type                                                                                  {point_type}) where point_type
         @inbounds for t in parts(sd, :DualTri)
           buffer[t] = sd[sd[sd[t, :D_∂e1], :D_∂v1], :dual_point]
         end
       end;

julia> function loop_contracted!(buffer::Vector{Point3D}, sd::HasDeltaSet2D, ::Ty                                                                                  pe{point_type}) where point_type
         @inbounds for t in parts(sd, :DualTri)
           buffer[t] = sd[t, (:D_∂e1, :D_∂v1, :dual_point)]
         end
       end;

julia> @btime sd[sd[sd[1, :D_∂e1], :D_∂v1], :dual_point];
  1.181 μs (9 allocations: 1.78 KiB)

julia> @btime sd[1, (:D_∂e1, :D_∂v1, :dual_point)];
  302.128 ns (3 allocations: 688 bytes)

julia> buffer = Vector{Point3D}(undef, nparts(sd, :DualTri));

julia> @btime loop_expanded!(buffer, sd, Point3{Float64});
  4.191 ms (0 allocations: 0 bytes)

julia> buffer = Vector{Point3D}(undef, nparts(sd, :DualTri));

julia> @btime loop_contracted!(buffer, sd, Point3{Float64});
  207.264 ms (2398978 allocations: 329.57 MiB)

@lukem12345 lukem12345 reopened this May 24, 2024
@lukem12345
Copy link
Member Author

lukem12345 commented May 24, 2024

I've made a more minimal example. I'm going to take a look at this but haven't interacted with CompTime much, so I'm not sure if the issue will eventually be upstreamed there.

julia> using Catlab.Graphs, ACSets, BenchmarkTools

julia> function loop_expanded!(buffer::Vector{Int}, grph::Graph)
         @inbounds for i in parts(grph, :E)
           buffer[i] = grph[i, :src]
         end
       end;

julia> function loop_contracted!(buffer::Vector{Int}, grph::Graph)
         @inbounds for i in parts(grph, :E)
           buffer[i] = grph[i, (:src,)]
         end
       end;

julia> g = path_graph(Graph, 80_000);

julia> buffer = Vector{Int}(undef, nparts(g, :E));

julia> @btime loop_expanded!(buffer, g);
  68.679 μs (0 allocations: 0 bytes)

julia> @btime loop_contracted!(buffer, g);
  21.828 ms (398462 allocations: 12.18 MiB)

@epatters
Copy link
Member

Thanks for looking into it.

@lukem12345 lukem12345 linked a pull request May 28, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants