Reuse memory for ELBO calculation #75

cscherrer · 2022-06-07T23:46:25Z

I noticed this line:

u = Random.randn!(rng, similar(μ, N, ndraws))

This allocates a new u on each iteration, which will have some overhead. This PR removes that, instead allocating u once in maximize_elbo and reusing it.

Some caveats:

This didn't have nearly the impact I expected
Tests are not yet updated for this

That said, here's the test case:

using Pathfinder

rosenbrock(x) = @inbounds -(1-x[1])^2  - 100 * (x[2] - x[1]^2)^2

using BenchmarkTools
result = pathfinder(rosenbrock; init=zeros(2))

The result is

julia> result = pathfinder(rosenbrock; init=zeros(2))
Single-path Pathfinder result
  tries: 1
  draws: 5
  fit iteration: 7 (total: 22)
  fit ELBO: -2.77 ± 0.78
  fit distribution: Distributions.MvNormal{Float64, Pathfinder.WoodburyPDMat{Float64, LinearAlgebra.Diagonal{Float64, Vector{Float64}}, Matrix{Float64}, Matrix{Float64}, LinearAlgebra.Diagonal{Float64, Vector{Float64}}, LinearAlgebra.QRCompactWYQ{Float64, Matrix{Float64}, Matrix{Float64}}, LinearAlgebra.UpperTriangular{Float64, Matrix{Float64}}}, Vector{Float64}}(
dim: 2
μ: [0.658221, 0.418772]
Σ: [0.122998 0.13468; 0.13468 0.151634]
)

Note that there are 22 ELBO evaluations.

Now for benchmarking.

BEFORE

julia> @benchmark result = pathfinder(rosenbrock; init=zeros(2))
BenchmarkTools.Trial: 4835 samples with 1 evaluation.
 Range (min … max):  902.530 μs …   8.335 ms  ┊ GC (min … max): 0.00% … 79.29%
 Time  (median):     966.893 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.030 ms ± 514.814 μs  ┊ GC (mean ± σ):  3.46% ±  6.15%

 Memory estimate: 293.51 KiB, allocs estimate: 3359.

AFTER

julia> @benchmark result = pathfinder(rosenbrock; init=zeros(2))
BenchmarkTools.Trial: 4957 samples with 1 evaluation.
 Range (min … max):  898.102 μs …  12.059 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     956.924 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.005 ms ± 503.189 μs  ┊ GC (mean ± σ):  3.37% ± 6.15%

 Memory estimate: 290.55 KiB, allocs estimate: 3338.

Now 3359 - 3338 = 21, because we've allocated once instead of 22 times.

It's very minor in this case, but maybe much larger problems could require many more ELBO evaluations. Then again, in those cases the non-allocation overhead will also be much higher.

codecov · 2022-06-09T13:20:41Z

Codecov Report

Merging #75 (e00e7ce) into main (3516702) will decrease coverage by 14.46%.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##             main      #75       +/-   ##
===========================================
- Coverage   92.89%   78.43%   -14.47%     
===========================================
  Files          13       13               
  Lines         521      524        +3     
===========================================
- Hits          484      411       -73     
- Misses         37      113       +76

Impacted Files	Coverage Δ
src/elbo.jl	`80.00% <100.00%> (+2.72%)`	⬆️
src/mvnormal.jl	`100.00% <100.00%> (ø)`
src/optimize.jl	`54.23% <0.00%> (-40.68%)`	⬇️
src/woodbury.jl	`65.34% <0.00%> (-34.66%)`	⬇️
src/transducers.jl	`81.25% <0.00%> (-18.75%)`	⬇️
src/resample.jl	`83.33% <0.00%> (-16.67%)`	⬇️
src/multipath.jl	`53.70% <0.00%> (-12.97%)`	⬇️
src/singlepath.jl	`81.57% <0.00%> (-7.90%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3516702...e00e7ce. Read the comment docs.

sethaxen · 2022-06-09T13:28:03Z

Thanks @cscherrer for the analysis and PR! I had considered reusing u before but ultimately decided not to in the interest of allowing the user to inspect all intermediates (they can currently access these draws if they desire). I'm not certain how useful that is in the end though.

The main reason I considered reusing u was due to memory. For very large models, say those with 10^6+ parameters, if Pathfinder runs 10^3 iterations, then we've already stored O(10^10) parameters. Drawing 5 draws from each distribution gives us 10^10 new parameters. So we're looking at worst at a doubling of the memory requirements by not reusing u. And it's not clear to me how big of a problem that is.

sethaxen · 2022-06-09T13:29:51Z

src/elbo.jl

    EE = Core.Compiler.return_type(
-        elbo_and_samples, Tuple{typeof(rng),typeof(logp),eltype(dists),Int}
+        elbo_and_samples!, Tuple{typeof(rng),typeof(u),typeof(logp),eltype(dists),Int}
    )
    estimates = similar(dists, EE)
    isempty(estimates) && return 0, estimates
    Folds.map!(estimates, dists, executor) do dist


Since u is being overwritten here, I think we cannot use Transducers, or different threads could be writing to it at the same time. This will probably just need to be made a simple map

sethaxen · 2022-06-09T13:31:11Z

src/elbo.jl

    end
    _, iteration_opt = _findmax(estimates |> Transducers.Map(est -> est.value))
    return iteration_opt, estimates
 end

-function elbo_and_samples(rng, logp, dist, ndraws)
-    ϕ, logqϕ = rand_and_logpdf(rng, dist, ndraws)
+function elbo_and_samples!(rng, u, logp, dist, ndraws)


draws will need to be removed from ELBOEstimate as well.

cscherrer · 2022-06-09T13:59:55Z

Thanks @sethaxen , memory overhead is a great point. One possibility for getting to the best of both worlds is to have a callback that can optionally copy the draws at each step. You could even copy conditionally, like thinning in MCMC or based on some predicate.

There's a risk here of over-engineering, solving problems that don't exist, or at least don't exist yet. But then it's also useful to get to an API that can scale.

sethaxen · 2022-06-10T18:26:49Z

Keeping in mind #15, the goal is ultimately to allow the user to provide an object that configures an optimization procedure over the trace of distributions, with the default being ELBOMaximization or something like that. Such an object could have a type parameter that configures whether the samples drawn are kept or overwritten. So I'll take a stab at designing that interface. Then it should be pretty easy to finish up this PR based on that.

cscherrer added 2 commits June 7, 2022 15:48

in-place

b69aabf

in-place

e00e7ce

sethaxen reviewed Jun 9, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reuse memory for ELBO calculation #75

Reuse memory for ELBO calculation #75

cscherrer commented Jun 7, 2022

codecov bot commented Jun 9, 2022 •

edited

Loading

sethaxen commented Jun 9, 2022

sethaxen Jun 9, 2022

sethaxen Jun 9, 2022

cscherrer commented Jun 9, 2022

sethaxen commented Jun 10, 2022

Reuse memory for ELBO calculation #75

Are you sure you want to change the base?

Reuse memory for ELBO calculation #75

Conversation

cscherrer commented Jun 7, 2022

codecov bot commented Jun 9, 2022 • edited Loading

Codecov Report

sethaxen commented Jun 9, 2022

sethaxen Jun 9, 2022

Choose a reason for hiding this comment

sethaxen Jun 9, 2022

Choose a reason for hiding this comment

cscherrer commented Jun 9, 2022

sethaxen commented Jun 10, 2022

codecov bot commented Jun 9, 2022 •

edited

Loading