Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forward-over-reverse HVP ignores rows for matrix input, only with Julia 1.11 #2071

Open
gdalle opened this issue Nov 7, 2024 · 37 comments
Open

Comments

@gdalle
Copy link
Contributor

gdalle commented Nov 7, 2024

I diagnosed this problem with DifferentiationInterface, but it also appears using Enzyme's own hvp function.
The following code is run on Julia 1.11 with Enzyme v0.13.14. On Julia 1.10 it returns the correct result.

julia> using Enzyme: Enzyme

julia> f(x) = sum(vec(x .^ 3) .* transpose(vec(x .^ 4)))
f (generic function with 1 method)

julia> xv, dxv = float.(1:12), ones(12);  # vector input

julia> pv = Enzyme.hvp(f, xv, dxv)  # correct result
12-element Vector{Float64}:
 518076.0
      1.374984e6
      2.617524e6
      4.292496e6
      6.4467e6
      9.126936e6
      1.2380004e7
      1.6252704e7
      2.0791836e7
      2.60442e7
      3.2056596e7
      3.8875824e7

julia> xm, dxm = float.(reshape(1:12, 4, 3)), ones(4, 3);  # matrix input

julia> pm = Enzyme.hvp(f, xm, dxm)  # only the first row is correct
freeing without malloc   %_cache320.i.0 = phi double* [ undef, %L126.i ], [ %190, %L348.loopexit.i ]
freeing without malloc   %_cache193.i.0 = phi double* [ undef, %L493.i ], [ %423, %L714.i.loopexit ]
4×3 Matrix{Float64}:
 518076.0  6.4467e6  2.07918e7
      0.0  0.0       0.0
      0.0  0.0       0.0
      0.0  0.0       0.0

julia> reshape(pv, 4, 3) - pm  # should be zero everywhere
4×3 Matrix{Float64}:
 0.0        0.0        0.0
 1.37498e6  9.12694e6  2.60442e7
 2.61752e6  1.238e7    3.20566e7
 4.2925e6   1.62527e7  3.88758e7
@wsmoses
Copy link
Member

wsmoses commented Nov 8, 2024

is there a simpler code which triggers this (both removing the hvp as well as reducing the complexity of f)

@gdalle
Copy link
Contributor Author

gdalle commented Nov 8, 2024

The incorrect result only appears in the HVP and Hessian parts of the DI test suite, not in the first order, so I think the Enzyme.hvp is necessary here.
As for the function f, I just tried a few simplifications which didn't work, in the sense that Enzyme.hvp went back to giving the correct result. Maybe it has something to do with linear algebra, because this function (which is semantically equivalent) works fine:

function g(x)
    s = zero(eltype(x))
    for i in eachindex(x), j in eachindex(x)
        s += x[i]^3 * x[j]^4
    end
    return s
end

Interestingly though, my original function throws an error on this other laptop of mine (an old macbook pro), whereas it ran without issue on my linux computer:

julia> versioninfo()
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 4 × Intel(R) Core(TM) i7-7567U CPU @ 3.50GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

julia> Pkg.status()
Status `/private/var/folders/cx/7tl61h5d5tvbhk6xjtnyb85m0000gn/T/jl_hS9xKD/Project.toml`
  [7da242da] Enzyme v0.13.14
julia> pv = Enzyme.hvp(g, xv, dxv)  # correct result
ERROR: Current scope: 
; Function Attrs: mustprogress noinline willreturn memory(readwrite, argmem: none, inaccessiblemem: none)
define private fastcc void @preprocess_diffejulia_mapreduce_impl_63711({} addrspace(10)* noalias nocapture nonnull readonly align 8 dereferenceable(32) "enzyme_type"="{[-1]:Pointer, [-1,0]:Pointer, [-1,0,-1]:Float@double, [-1,8]:Pointer, [-1,8,0]:Integer, [-1,8,1]:Integer, [-1,8,2]:Integer, [-1,8,3]:Integer, [-1,8,4]:Integer, [-1,8,5]:Integer, [-1,8,6]:Integer, [-1,8,7]:Integer, [-1,8,8]:Pointer, [-1,8,8,-1]:Float@double, [-1,16]:Integer, [-1,17]:Integer, [-1,18]:Integer, [-1,19]:Integer, [-1,20]:Integer, [-1,21]:Integer, [-1,22]:Integer, [-1,23]:Integer, [-1,24]:Integer, [-1,25]:Integer, [-1,26]:Integer, [-1,27]:Integer, [-1,28]:Integer, [-1,29]:Integer, [-1,30]:Integer, [-1,31]:Integer}" "enzymejl_parmtype"="4935644608" "enzymejl_parmtype_ref"="2" %"'", i64 signext "enzyme_inactive" "enzyme_type"="{[-1]:Integer}" "enzymejl_parmtype"="4979721520" "enzymejl_parmtype_ref"="0" %0, double %differeturn) unnamed_addr #15 !dbg !2938 {
top:
  %1 = call {}*** @julia.get_pgcstack() #25
  %ptls_field16 = getelementptr inbounds {}**, {}*** %1, i64 2
  %2 = bitcast {}*** %ptls_field16 to i64***
  %ptls_load1718 = load i64**, i64*** %2, align 8, !tbaa !28, !alias.scope !2939, !noalias !2942
  %3 = getelementptr inbounds i64*, i64** %ptls_load1718, i64 2
  %safepoint = load i64*, i64** %3, align 8, !tbaa !32, !alias.scope !2944, !noalias !2947
  fence syncscope("singlethread") seq_cst
  call void @julia.safepoint(i64* %safepoint) #26, !dbg !2949
  fence syncscope("singlethread") seq_cst
  %.not = icmp eq i64 %0, 1, !dbg !2950
  br i1 %.not, label %invertL17, label %L22, !dbg !2951

L83.epil.preheader:                               ; preds = %L83.lr.ph
  %_unwrap48 = add nuw i64 %spec.select, 9223372036854775806
  %unroll_iter_unwrap49 = and i64 %_unwrap48, 9223372036854775800
  %unroll_iter_unwrap49.op = or i64 %unroll_iter_unwrap49, 2
  %_unwrap50 = select i1 %12, i64 2, i64 %unroll_iter_unwrap49.op
  %storemerge1 = add nsw i64 %xtraiter, -1
  %_unwrap51 = add nuw nsw i64 %_unwrap50, %storemerge1, !dbg !2952
  %"'ipg39_unwrap" = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, i64 %_unwrap51, !dbg !2952
  %"'ipc40_unwrap" = bitcast {} addrspace(10)* addrspace(13)* %"'ipg39_unwrap" to double addrspace(13)*, !dbg !2952
  %4 = load double, double addrspace(13)* %"'ipc40_unwrap", align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %5 = fadd fast double %4, %differeturn, !dbg !2952
  store double %5, double addrspace(13)* %"'ipc40_unwrap", align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2961
  %6 = icmp eq i64 %storemerge1, 0
  br i1 %6, label %invertL79.common.ret.loopexit_crit_edge.unr-lcssa, label %invertL83.epil.1

L22:                                              ; preds = %top
  %7 = add i64 %0, -1, !dbg !2966
  %.not19 = icmp slt i64 %7, 1024, !dbg !2968
  br i1 %.not19, label %L39, label %invertL128, !dbg !2967

L39:                                              ; preds = %L22
  %"'ipc85" = bitcast {} addrspace(10)* %"'" to { i8*, {} addrspace(10)* } addrspace(10)*, !dbg !2969
  %"'ipc86" = addrspacecast { i8*, {} addrspace(10)* } addrspace(10)* %"'ipc85" to { i8*, {} addrspace(10)* } addrspace(11)*, !dbg !2969
  %"'ipc91" = bitcast {} addrspace(10)* %"'" to {} addrspace(10)** addrspace(10)*, !dbg !2969
  %"'ipc92" = addrspacecast {} addrspace(10)** addrspace(10)* %"'ipc91" to {} addrspace(10)** addrspace(11)*, !dbg !2969
  %"'ipl93" = load {} addrspace(10)**, {} addrspace(10)** addrspace(11)* %"'ipc92", align 8, !dbg !2969, !tbaa !82, !alias.scope !2971, !noalias !2974, !invariant.group !2976, !enzyme_type !84, !enzymejl_byref_BITS_VALUE !0, !enzymejl_source_type_Ptr\7BFloat64\7D !0, !enzyme_nocache !0
  %"'ipg87" = getelementptr inbounds { i8*, {} addrspace(10)* }, { i8*, {} addrspace(10)* } addrspace(11)* %"'ipc86", i64 0, i32 1, !dbg !2969
  %"'ipl88" = load {} addrspace(10)*, {} addrspace(10)* addrspace(11)* %"'ipg87", align 8, !dbg !2969, !tbaa !82, !alias.scope !2971, !noalias !2974, !invariant.group !2977, !enzyme_type !87, !enzymejl_source_type_Memory\7BFloat64\7D !0, !enzymejl_byref_MUT_REF !0
  %8 = call "enzyme_type"="{[-1]:Pointer, [-1,-1]:Float@double}" {} addrspace(10)* addrspace(13)* @julia.gc_loaded({} addrspace(10)* noundef %"'ipl88", {} addrspace(10)** noundef %"'ipl93") #25, !dbg !2969
  %spec.select = call i64 @llvm.smax.i64(i64 %0, i64 noundef 2) #25, !dbg !2978
  %9 = add nsw i64 %spec.select, -3, !dbg !2982
  %10 = icmp ugt i64 %9, 9223372036854775806, !dbg !2986
  br i1 %10, label %invertL39, label %L83.lr.ph, !dbg !2987

L83.lr.ph:                                        ; preds = %L39
  %11 = add nuw nsw i64 %spec.select, 6, !dbg !2988
  %xtraiter = and i64 %11, 7, !dbg !2989
  %12 = icmp ult i64 %9, 7, !dbg !2989
  %lcmp.mod.not = icmp eq i64 %xtraiter, 0, !dbg !2989
  br i1 %lcmp.mod.not, label %invertL79.common.ret.loopexit_crit_edge.unr-lcssa, label %L83.epil.preheader, !dbg !2989

inverttop:                                        ; preds = %invertL128, %invertL39, %invertL17
  fence syncscope("singlethread") seq_cst
  ret void

invertL17:                                        ; preds = %top
  %"'ipc8" = bitcast {} addrspace(10)* %"'" to { i8*, {} addrspace(10)* } addrspace(10)*, !dbg !2990
  %"'ipc9" = addrspacecast { i8*, {} addrspace(10)* } addrspace(10)* %"'ipc8" to { i8*, {} addrspace(10)* } addrspace(11)*, !dbg !2990
  %"'ipc13" = bitcast {} addrspace(10)* %"'" to {} addrspace(10)** addrspace(10)*, !dbg !2990
  %"'ipc14" = addrspacecast {} addrspace(10)** addrspace(10)* %"'ipc13" to {} addrspace(10)** addrspace(11)*, !dbg !2990
  %"'ipl15" = load {} addrspace(10)**, {} addrspace(10)** addrspace(11)* %"'ipc14", align 8, !dbg !2990, !tbaa !82, !alias.scope !2971, !noalias !2974, !enzyme_type !84, !enzymejl_byref_BITS_VALUE !0, !enzymejl_source_type_Ptr\7BFloat64\7D !0, !enzyme_nocache !0
  %"'ipg10" = getelementptr inbounds { i8*, {} addrspace(10)* }, { i8*, {} addrspace(10)* } addrspace(11)* %"'ipc9", i64 0, i32 1, !dbg !2990
  %"'ipl" = load {} addrspace(10)*, {} addrspace(10)* addrspace(11)* %"'ipg10", align 8, !dbg !2990, !tbaa !82, !alias.scope !2971, !noalias !2974, !dereferenceable_or_null !690, !enzyme_type !87, !enzymejl_source_type_Memory\7BFloat64\7D !0, !enzymejl_byref_MUT_REF !0
  %13 = call "enzyme_type"="{[-1]:Pointer, [-1,-1]:Float@double}" {} addrspace(10)* addrspace(13)* @julia.gc_loaded({} addrspace(10)* noundef %"'ipl", {} addrspace(10)** noundef %"'ipl15") #25, !dbg !2990
  %"'ipc_unwrap" = bitcast {} addrspace(10)* addrspace(13)* %13 to double addrspace(13)*, !dbg !2990
  %14 = load double, double addrspace(13)* %"'ipc_unwrap", align 8, !dbg !2990, !tbaa !90, !alias.scope !2992, !noalias !2995, !enzyme_type !786, !enzymejl_byref_BITS_VALUE !0, !enzymejl_source_type_Float64 !0
  %15 = fadd fast double %14, %differeturn, !dbg !2990
  store double %15, double addrspace(13)* %"'ipc_unwrap", align 8, !dbg !2990, !tbaa !90, !alias.scope !2992, !noalias !2997
  br label %inverttop

mergeinvertL83_L79.common.ret.loopexit_crit_edge.unr-lcssa.loopexit: ; preds = %invertL79.common.ret.loopexit_crit_edge.unr-lcssa
  %_unwrap24 = add nuw i64 %spec.select, 9223372036854775806
  %unroll_iter_unwrap = and i64 %_unwrap24, 9223372036854775800
  %_unwrap25 = add nsw i64 %unroll_iter_unwrap, -8
  %_unwrap26 = lshr exact i64 %_unwrap25, 3
  %16 = add nuw nsw i64 %_unwrap26, 1
  %min.iters.check = icmp ult i64 %_unwrap25, 120
  br i1 %min.iters.check, label %scalar.ph, label %vector.scevcheck

vector.scevcheck:                                 ; preds = %mergeinvertL83_L79.common.ret.loopexit_crit_edge.unr-lcssa.loopexit
  %17 = add nsw i64 %unroll_iter_unwrap, -6
  %scevgep = getelementptr {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, i64 %17
  %mul.result.neg = mul nsw i64 %_unwrap25, -8
  %mul.overflow = icmp ugt i64 %_unwrap25, 2305843009213693944
  %scevgep50 = bitcast {} addrspace(10)* addrspace(13)* %scevgep to i8 addrspace(13)*
  %18 = getelementptr i8, i8 addrspace(13)* %scevgep50, i64 %mul.result.neg
  %19 = icmp ugt i8 addrspace(13)* %18, %scevgep50
  %20 = add nsw i64 %unroll_iter_unwrap, -5
  %scevgep51 = getelementptr {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, i64 %20
  %scevgep5155 = bitcast {} addrspace(10)* addrspace(13)* %scevgep51 to i8 addrspace(13)*
  %21 = getelementptr i8, i8 addrspace(13)* %scevgep5155, i64 %mul.result.neg
  %22 = icmp ugt i8 addrspace(13)* %21, %scevgep5155
  %23 = add nsw i64 %unroll_iter_unwrap, -4
  %scevgep56 = getelementptr {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, i64 %23
  %scevgep5660 = bitcast {} addrspace(10)* addrspace(13)* %scevgep56 to i8 addrspace(13)*
  %24 = getelementptr i8, i8 addrspace(13)* %scevgep5660, i64 %mul.result.neg
  %25 = icmp ugt i8 addrspace(13)* %24, %scevgep5660
  %26 = or i1 %mul.overflow, %25
  %27 = add nsw i64 %unroll_iter_unwrap, -3
  %scevgep61 = getelementptr {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, i64 %27
  %scevgep6165 = bitcast {} addrspace(10)* addrspace(13)* %scevgep61 to i8 addrspace(13)*
  %28 = getelementptr i8, i8 addrspace(13)* %scevgep6165, i64 %mul.result.neg
  %29 = icmp ugt i8 addrspace(13)* %28, %scevgep6165
  %30 = add nsw i64 %unroll_iter_unwrap, -2
  %scevgep66 = getelementptr {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, i64 %30
  %scevgep6670 = bitcast {} addrspace(10)* addrspace(13)* %scevgep66 to i8 addrspace(13)*
  %31 = getelementptr i8, i8 addrspace(13)* %scevgep6670, i64 %mul.result.neg
  %32 = icmp ugt i8 addrspace(13)* %31, %scevgep6670
  %33 = add nsw i64 %unroll_iter_unwrap, -1
  %scevgep71 = getelementptr {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, i64 %33
  %scevgep7175 = bitcast {} addrspace(10)* addrspace(13)* %scevgep71 to i8 addrspace(13)*
  %34 = getelementptr i8, i8 addrspace(13)* %scevgep7175, i64 %mul.result.neg
  %35 = icmp ugt i8 addrspace(13)* %34, %scevgep7175
  %scevgep76 = getelementptr {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, i64 %unroll_iter_unwrap
  %scevgep7680 = bitcast {} addrspace(10)* addrspace(13)* %scevgep76 to i8 addrspace(13)*
  %36 = getelementptr i8, i8 addrspace(13)* %scevgep7680, i64 %mul.result.neg
  %37 = icmp ugt i8 addrspace(13)* %36, %scevgep7680
  %38 = or i64 %unroll_iter_unwrap, 1
  %scevgep81 = getelementptr {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, i64 %38
  %scevgep8185 = bitcast {} addrspace(10)* addrspace(13)* %scevgep81 to i8 addrspace(13)*
  %39 = getelementptr i8, i8 addrspace(13)* %scevgep8185, i64 %mul.result.neg
  %40 = icmp ugt i8 addrspace(13)* %39, %scevgep8185
  %41 = or i1 %22, %19
  %42 = or i1 %41, %26
  %43 = or i1 %29, %42
  %44 = or i1 %32, %43
  %45 = or i1 %35, %44
  %46 = or i1 %37, %45
  %47 = or i1 %40, %46
  br i1 %47, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %vector.scevcheck
  %n.vec = and i64 %16, -8
  %48 = insertelement <4 x double> <double poison, double 0.000000e+00, double 0.000000e+00, double 0.000000e+00>, double %160, i64 0
  %.splatinsert = insertelement <4 x i64> poison, i64 %_unwrap26, i64 0
  %.splat = shufflevector <4 x i64> %.splatinsert, <4 x i64> poison, <4 x i32> zeroinitializer
  %induction = add nsw <4 x i64> %.splat, <i64 0, i64 -1, i64 -2, i64 -3>
  %broadcast.splatinsert = insertelement <4 x double> poison, double %159, i64 0
  %broadcast.splat = shufflevector <4 x double> %broadcast.splatinsert, <4 x double> poison, <4 x i32> zeroinitializer
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %iv = phi i64 [ %iv.next, %vector.body ], [ 0, %vector.ph ]
  %vec.phi = phi <4 x double> [ %48, %vector.ph ], [ %138, %vector.body ]
  %vec.phi86 = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %139, %vector.body ]
  %vec.ind = phi <4 x i64> [ %induction, %vector.ph ], [ %vec.ind.next, %vector.body ]
  %49 = shl nuw i64 %iv, 3
  %iv.next = add nuw nsw i64 %iv, 1
  %step.add = add <4 x i64> %vec.ind, <i64 -4, i64 -4, i64 -4, i64 -4>
  %50 = shl nuw <4 x i64> %vec.ind, <i64 3, i64 3, i64 3, i64 3>, !dbg !2952
  %51 = shl nuw <4 x i64> %step.add, <i64 3, i64 3, i64 3, i64 3>, !dbg !2952
  %52 = add <4 x i64> %50, <i64 9, i64 9, i64 9, i64 9>, !dbg !2952
  %53 = add <4 x i64> %51, <i64 9, i64 9, i64 9, i64 9>, !dbg !2952
  %54 = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, <4 x i64> %52, !dbg !2952
  %55 = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, <4 x i64> %53, !dbg !2952
  %56 = bitcast <4 x {} addrspace(10)* addrspace(13)*> %54 to <4 x double addrspace(13)*>, !dbg !2952
  %57 = bitcast <4 x {} addrspace(10)* addrspace(13)*> %55 to <4 x double addrspace(13)*>, !dbg !2952
  %wide.masked.gather = call <4 x double> @llvm.masked.gather.v4f64.v4p13f64(<4 x double addrspace(13)*> %56, i32 noundef 8, <4 x i1> noundef <i1 true, i1 true, i1 true, i1 true>, <4 x double> poison) #25, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %wide.masked.gather88 = call <4 x double> @llvm.masked.gather.v4f64.v4p13f64(<4 x double addrspace(13)*> %57, i32 noundef 8, <4 x i1> noundef <i1 true, i1 true, i1 true, i1 true>, <4 x double> poison) #25, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %58 = fadd fast <4 x double> %wide.masked.gather, %broadcast.splat, !dbg !2952
  %59 = fadd fast <4 x double> %wide.masked.gather88, %broadcast.splat, !dbg !2952
  %60 = add <4 x i64> %50, <i64 8, i64 8, i64 8, i64 8>, !dbg !2952
  %61 = add <4 x i64> %51, <i64 8, i64 8, i64 8, i64 8>, !dbg !2952
  %62 = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, <4 x i64> %60, !dbg !2952
  %63 = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, <4 x i64> %61, !dbg !2952
  %64 = bitcast <4 x {} addrspace(10)* addrspace(13)*> %62 to <4 x double addrspace(13)*>, !dbg !2952
  %65 = bitcast <4 x {} addrspace(10)* addrspace(13)*> %63 to <4 x double addrspace(13)*>, !dbg !2952
  %wide.masked.gather91 = call <4 x double> @llvm.masked.gather.v4f64.v4p13f64(<4 x double addrspace(13)*> %64, i32 noundef 8, <4 x i1> noundef <i1 true, i1 true, i1 true, i1 true>, <4 x double> poison) #25, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %wide.masked.gather92 = call <4 x double> @llvm.masked.gather.v4f64.v4p13f64(<4 x double addrspace(13)*> %65, i32 noundef 8, <4 x i1> noundef <i1 true, i1 true, i1 true, i1 true>, <4 x double> poison) #25, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %66 = fadd fast <4 x double> %wide.masked.gather91, %broadcast.splat, !dbg !2952
  %67 = fadd fast <4 x double> %wide.masked.gather92, %broadcast.splat, !dbg !2952
  %68 = or <4 x i64> %50, <i64 7, i64 7, i64 7, i64 7>, !dbg !2952
  %69 = or <4 x i64> %51, <i64 7, i64 7, i64 7, i64 7>, !dbg !2952
  %70 = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, <4 x i64> %68, !dbg !2952
  %71 = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, <4 x i64> %69, !dbg !2952
  %72 = bitcast <4 x {} addrspace(10)* addrspace(13)*> %70 to <4 x double addrspace(13)*>, !dbg !2952
  %73 = bitcast <4 x {} addrspace(10)* addrspace(13)*> %71 to <4 x double addrspace(13)*>, !dbg !2952
  %wide.masked.gather93 = call <4 x double> @llvm.masked.gather.v4f64.v4p13f64(<4 x double addrspace(13)*> %72, i32 noundef 8, <4 x i1> noundef <i1 true, i1 true, i1 true, i1 true>, <4 x double> poison) #25, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %wide.masked.gather94 = call <4 x double> @llvm.masked.gather.v4f64.v4p13f64(<4 x double addrspace(13)*> %73, i32 noundef 8, <4 x i1> noundef <i1 true, i1 true, i1 true, i1 true>, <4 x double> poison) #25, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %74 = fadd fast <4 x double> %wide.masked.gather93, %broadcast.splat, !dbg !2952
  %75 = fadd fast <4 x double> %wide.masked.gather94, %broadcast.splat, !dbg !2952
  %76 = or <4 x i64> %50, <i64 6, i64 6, i64 6, i64 6>, !dbg !2952
  %77 = or <4 x i64> %51, <i64 6, i64 6, i64 6, i64 6>, !dbg !2952
  %78 = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, <4 x i64> %76, !dbg !2952
  %79 = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, <4 x i64> %77, !dbg !2952
  %80 = bitcast <4 x {} addrspace(10)* addrspace(13)*> %78 to <4 x double addrspace(13)*>, !dbg !2952
  %81 = bitcast <4 x {} addrspace(10)* addrspace(13)*> %79 to <4 x double addrspace(13)*>, !dbg !2952
  %wide.masked.gather95 = call <4 x double> @llvm.masked.gather.v4f64.v4p13f64(<4 x double addrspace(13)*> %80, i32 noundef 8, <4 x i1> noundef <i1 true, i1 true, i1 true, i1 true>, <4 x double> poison) #25, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %wide.masked.gather96 = call <4 x double> @llvm.masked.gather.v4f64.v4p13f64(<4 x double addrspace(13)*> %81, i32 noundef 8, <4 x i1> noundef <i1 true, i1 true, i1 true, i1 true>, <4 x double> poison) #25, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %82 = fadd fast <4 x double> %wide.masked.gather95, %broadcast.splat, !dbg !2952
  %83 = fadd fast <4 x double> %wide.masked.gather96, %broadcast.splat, !dbg !2952
  %84 = or <4 x i64> %50, <i64 5, i64 5, i64 5, i64 5>, !dbg !2952
  %85 = or <4 x i64> %51, <i64 5, i64 5, i64 5, i64 5>, !dbg !2952
  %86 = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, <4 x i64> %84, !dbg !2952
  %87 = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, <4 x i64> %85, !dbg !2952
  %88 = bitcast <4 x {} addrspace(10)* addrspace(13)*> %86 to <4 x double addrspace(13)*>, !dbg !2952
  %89 = bitcast <4 x {} addrspace(10)* addrspace(13)*> %87 to <4 x double addrspace(13)*>, !dbg !2952
  %wide.masked.gather97 = call <4 x double> @llvm.masked.gather.v4f64.v4p13f64(<4 x double addrspace(13)*> %88, i32 noundef 8, <4 x i1> noundef <i1 true, i1 true, i1 true, i1 true>, <4 x double> poison) #25, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %wide.masked.gather98 = call <4 x double> @llvm.masked.gather.v4f64.v4p13f64(<4 x double addrspace(13)*> %89, i32 noundef 8, <4 x i1> noundef <i1 true, i1 true, i1 true, i1 true>, <4 x double> poison) #25, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %90 = fadd fast <4 x double> %wide.masked.gather97, %broadcast.splat, !dbg !2952
  %91 = fadd fast <4 x double> %wide.masked.gather98, %broadcast.splat, !dbg !2952
  %92 = or <4 x i64> %50, <i64 4, i64 4, i64 4, i64 4>, !dbg !2952
  %93 = or <4 x i64> %51, <i64 4, i64 4, i64 4, i64 4>, !dbg !2952
  %94 = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, <4 x i64> %92, !dbg !2952
  %95 = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, <4 x i64> %93, !dbg !2952
  %96 = bitcast <4 x {} addrspace(10)* addrspace(13)*> %94 to <4 x double addrspace(13)*>, !dbg !2952
  %97 = bitcast <4 x {} addrspace(10)* addrspace(13)*> %95 to <4 x double addrspace(13)*>, !dbg !2952
  %wide.masked.gather99 = call <4 x double> @llvm.masked.gather.v4f64.v4p13f64(<4 x double addrspace(13)*> %96, i32 noundef 8, <4 x i1> noundef <i1 true, i1 true, i1 true, i1 true>, <4 x double> poison) #25, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %wide.masked.gather100 = call <4 x double> @llvm.masked.gather.v4f64.v4p13f64(<4 x double addrspace(13)*> %97, i32 noundef 8, <4 x i1> noundef <i1 true, i1 true, i1 true, i1 true>, <4 x double> poison) #25, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %98 = fadd fast <4 x double> %wide.masked.gather99, %broadcast.splat, !dbg !2952
  %99 = fadd fast <4 x double> %wide.masked.gather100, %broadcast.splat, !dbg !2952
  %100 = or <4 x i64> %50, <i64 3, i64 3, i64 3, i64 3>, !dbg !2952
  %101 = or <4 x i64> %51, <i64 3, i64 3, i64 3, i64 3>, !dbg !2952
  %102 = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, <4 x i64> %100, !dbg !2952
  %103 = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, <4 x i64> %101, !dbg !2952
  %104 = bitcast <4 x {} addrspace(10)* addrspace(13)*> %102 to <4 x double addrspace(13)*>, !dbg !2952
  %105 = bitcast <4 x {} addrspace(10)* addrspace(13)*> %103 to <4 x double addrspace(13)*>, !dbg !2952
  %wide.masked.gather101 = call <4 x double> @llvm.masked.gather.v4f64.v4p13f64(<4 x double addrspace(13)*> %104, i32 noundef 8, <4 x i1> noundef <i1 true, i1 true, i1 true, i1 true>, <4 x double> poison) #25, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %wide.masked.gather102 = call <4 x double> @llvm.masked.gather.v4f64.v4p13f64(<4 x double addrspace(13)*> %105, i32 noundef 8, <4 x i1> noundef <i1 true, i1 true, i1 true, i1 true>, <4 x double> poison) #25, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %106 = fadd fast <4 x double> %wide.masked.gather101, %broadcast.splat, !dbg !2952
  %107 = fadd fast <4 x double> %wide.masked.gather102, %broadcast.splat, !dbg !2952
  %108 = or <4 x i64> %50, <i64 2, i64 2, i64 2, i64 2>, !dbg !2952
  %109 = or <4 x i64> %51, <i64 2, i64 2, i64 2, i64 2>, !dbg !2952
  %110 = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, <4 x i64> %108, !dbg !2952
  %111 = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, <4 x i64> %109, !dbg !2952
  %112 = bitcast <4 x {} addrspace(10)* addrspace(13)*> %110 to <4 x double addrspace(13)*>, !dbg !2952
  %113 = bitcast <4 x {} addrspace(10)* addrspace(13)*> %111 to <4 x double addrspace(13)*>, !dbg !2952
  %wide.masked.gather103 = call <4 x double> @llvm.masked.gather.v4f64.v4p13f64(<4 x double addrspace(13)*> %112, i32 noundef 8, <4 x i1> noundef <i1 true, i1 true, i1 true, i1 true>, <4 x double> poison) #25, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %wide.masked.gather104 = call <4 x double> @llvm.masked.gather.v4f64.v4p13f64(<4 x double addrspace(13)*> %113, i32 noundef 8, <4 x i1> noundef <i1 true, i1 true, i1 true, i1 true>, <4 x double> poison) #25, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %114 = fadd fast <4 x double> %wide.masked.gather103, %broadcast.splat, !dbg !2952
  %115 = fadd fast <4 x double> %wide.masked.gather104, %broadcast.splat, !dbg !2952
  %116 = extractelement <4 x double addrspace(13)*> %112, i64 0, !dbg !2952
  %117 = getelementptr double, double addrspace(13)* %116, i64 -24, !dbg !2952
  %118 = bitcast double addrspace(13)* %117 to <32 x double> addrspace(13)*, !dbg !2952
  %119 = extractelement <4 x double addrspace(13)*> %113, i64 0, !dbg !2952
  %120 = getelementptr double, double addrspace(13)* %119, i64 -24, !dbg !2952
  %121 = bitcast double addrspace(13)* %120 to <32 x double> addrspace(13)*, !dbg !2952
  %reverse = shufflevector <4 x double> %114, <4 x double> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>, !dbg !2952
  %reverse105 = shufflevector <4 x double> %106, <4 x double> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>, !dbg !2952
  %reverse106 = shufflevector <4 x double> %98, <4 x double> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>, !dbg !2952
  %reverse107 = shufflevector <4 x double> %90, <4 x double> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>, !dbg !2952
  %reverse108 = shufflevector <4 x double> %82, <4 x double> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>, !dbg !2952
  %reverse109 = shufflevector <4 x double> %74, <4 x double> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>, !dbg !2952
  %reverse110 = shufflevector <4 x double> %66, <4 x double> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>, !dbg !2952
  %reverse111 = shufflevector <4 x double> %58, <4 x double> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>, !dbg !2952
  %122 = shufflevector <4 x double> %reverse, <4 x double> %reverse105, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>, !dbg !2952
  %123 = shufflevector <4 x double> %reverse106, <4 x double> %reverse107, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>, !dbg !2952
  %124 = shufflevector <4 x double> %reverse108, <4 x double> %reverse109, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>, !dbg !2952
  %125 = shufflevector <4 x double> %reverse110, <4 x double> %reverse111, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>, !dbg !2952
  %126 = shufflevector <8 x double> %122, <8 x double> %123, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>, !dbg !2952
  %127 = shufflevector <8 x double> %124, <8 x double> %125, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>, !dbg !2952
  %interleaved.vec = shufflevector <16 x double> %126, <16 x double> %127, <32 x i32> <i32 0, i32 4, i32 8, i32 12, i32 16, i32 20, i32 24, i32 28, i32 1, i32 5, i32 9, i32 13, i32 17, i32 21, i32 25, i32 29, i32 2, i32 6, i32 10, i32 14, i32 18, i32 22, i32 26, i32 30, i32 3, i32 7, i32 11, i32 15, i32 19, i32 23, i32 27, i32 31>, !dbg !2952
  store <32 x double> %interleaved.vec, <32 x double> addrspace(13)* %118, align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2961
  %reverse112 = shufflevector <4 x double> %115, <4 x double> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>, !dbg !2952
  %reverse113 = shufflevector <4 x double> %107, <4 x double> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>, !dbg !2952
  %reverse114 = shufflevector <4 x double> %99, <4 x double> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>, !dbg !2952
  %reverse115 = shufflevector <4 x double> %91, <4 x double> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>, !dbg !2952
  %reverse116 = shufflevector <4 x double> %83, <4 x double> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>, !dbg !2952
  %reverse117 = shufflevector <4 x double> %75, <4 x double> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>, !dbg !2952
  %reverse118 = shufflevector <4 x double> %67, <4 x double> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>, !dbg !2952
  %reverse119 = shufflevector <4 x double> %59, <4 x double> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>, !dbg !2952
  %128 = shufflevector <4 x double> %reverse112, <4 x double> %reverse113, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>, !dbg !2952
  %129 = shufflevector <4 x double> %reverse114, <4 x double> %reverse115, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>, !dbg !2952
  %130 = shufflevector <4 x double> %reverse116, <4 x double> %reverse117, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>, !dbg !2952
  %131 = shufflevector <4 x double> %reverse118, <4 x double> %reverse119, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>, !dbg !2952
  %132 = shufflevector <8 x double> %128, <8 x double> %129, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>, !dbg !2952
  %133 = shufflevector <8 x double> %130, <8 x double> %131, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>, !dbg !2952
  %interleaved.vec120 = shufflevector <16 x double> %132, <16 x double> %133, <32 x i32> <i32 0, i32 4, i32 8, i32 12, i32 16, i32 20, i32 24, i32 28, i32 1, i32 5, i32 9, i32 13, i32 17, i32 21, i32 25, i32 29, i32 2, i32 6, i32 10, i32 14, i32 18, i32 22, i32 26, i32 30, i32 3, i32 7, i32 11, i32 15, i32 19, i32 23, i32 27, i32 31>, !dbg !2952
  store <32 x double> %interleaved.vec120, <32 x double> addrspace(13)* %121, align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2961
  %134 = icmp eq <4 x i64> %vec.ind, zeroinitializer
  %135 = icmp eq <4 x i64> %step.add, zeroinitializer
  %136 = select <4 x i1> %134, <4 x double> %broadcast.splat, <4 x double> zeroinitializer
  %137 = select <4 x i1> %135, <4 x double> %broadcast.splat, <4 x double> zeroinitializer
  %138 = fadd fast <4 x double> %136, %vec.phi
  %139 = fadd fast <4 x double> %137, %vec.phi86
  %index.next = add nuw i64 %49, 8
  %vec.ind.next = add <4 x i64> %vec.ind, <i64 -8, i64 -8, i64 -8, i64 -8>
  %140 = icmp eq i64 %index.next, %n.vec
  br i1 %140, label %middle.block, label %vector.body, !llvm.loop !2998

middle.block:                                     ; preds = %vector.body
  %bin.rdx = fadd fast <4 x double> %139, %138
  %141 = call fast double @llvm.vector.reduce.fadd.v4f64(double noundef -0.000000e+00, <4 x double> %bin.rdx) #25
  %cmp.n = icmp eq i64 %16, %n.vec
  br i1 %cmp.n, label %invertL39, label %middle.block.scalar.ph_crit_edge

middle.block.scalar.ph_crit_edge:                 ; preds = %middle.block
  %ind.end = sub nsw i64 %_unwrap26, %n.vec
  %.pre = add nsw i64 %ind.end, 1
  br label %scalar.ph

scalar.ph:                                        ; preds = %middle.block.scalar.ph_crit_edge, %vector.scevcheck, %mergeinvertL83_L79.common.ret.loopexit_crit_edge.unr-lcssa.loopexit
  %.pre-phi = phi i64 [ %.pre, %middle.block.scalar.ph_crit_edge ], [ %16, %vector.scevcheck ], [ %16, %mergeinvertL83_L79.common.ret.loopexit_crit_edge.unr-lcssa.loopexit ]
  %bc.resume.val = phi i64 [ %ind.end, %middle.block.scalar.ph_crit_edge ], [ %_unwrap26, %vector.scevcheck ], [ %_unwrap26, %mergeinvertL83_L79.common.ret.loopexit_crit_edge.unr-lcssa.loopexit ]
  %bc.merge.rdx = phi double [ %141, %middle.block.scalar.ph_crit_edge ], [ 0.000000e+00, %vector.scevcheck ], [ 0.000000e+00, %mergeinvertL83_L79.common.ret.loopexit_crit_edge.unr-lcssa.loopexit ]
  %142 = insertelement <4 x double> poison, double %159, i64 0, !dbg !2952
  %143 = shufflevector <4 x double> %142, <4 x double> poison, <4 x i32> zeroinitializer, !dbg !2952
  %xtraiter5 = and i64 %.pre-phi, 3
  %lcmp.mod.not6 = icmp eq i64 %xtraiter5, 0
  br i1 %lcmp.mod.not6, label %invertL83.prol.loopexit, label %invertL83.prol.preheader

invertL83.prol.preheader:                         ; preds = %scalar.ph
  br label %invertL83.prol

invertL83.prol:                                   ; preds = %invertL83.prol.preheader, %invertL83.prol
  %iv1 = phi i64 [ 0, %invertL83.prol.preheader ], [ %iv.next2, %invertL83.prol ]
  %"'de29.2.prol" = phi double [ %154, %invertL83.prol ], [ %bc.merge.rdx, %invertL83.prol.preheader ]
  %144 = mul nsw i64 %iv1, -1, !dbg !2952
  %iv.next2 = add nuw nsw i64 %iv1, 1, !dbg !2952
  %145 = add nsw i64 %bc.resume.val, %144, !dbg !2952
  %_unwrap105.prol = shl nuw i64 %145, 3, !dbg !2952
  %_unwrap136.prol = or i64 %_unwrap105.prol, 6, !dbg !2952
  %"'ipg133_unwrap.prol" = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, i64 %_unwrap136.prol, !dbg !2952
  %146 = bitcast {} addrspace(10)* addrspace(13)* %"'ipg133_unwrap.prol" to <4 x double> addrspace(13)*, !dbg !2952
  %147 = load <4 x double>, <4 x double> addrspace(13)* %146, align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %148 = fadd fast <4 x double> %147, %143, !dbg !2952
  store <4 x double> %148, <4 x double> addrspace(13)* %146, align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2961
  %_unwrap175.prol = or i64 %_unwrap105.prol, 2, !dbg !2952
  %"'ipg172_unwrap.prol" = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, i64 %_unwrap175.prol, !dbg !2952
  %149 = bitcast {} addrspace(10)* addrspace(13)* %"'ipg172_unwrap.prol" to <4 x double> addrspace(13)*, !dbg !2952
  %150 = load <4 x double>, <4 x double> addrspace(13)* %149, align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %151 = fadd fast <4 x double> %150, %143, !dbg !2952
  store <4 x double> %151, <4 x double> addrspace(13)* %149, align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2961
  %152 = icmp eq i64 %145, 0
  %153 = select fast i1 %152, double %159, double 0.000000e+00
  %154 = fadd fast double %153, %"'de29.2.prol"
  %155 = add nsw i64 %145, -1
  %prol.iter.cmp.not = icmp eq i64 %iv.next2, %xtraiter5
  br i1 %prol.iter.cmp.not, label %invertL83.prol.loopexit.loopexit, label %invertL83.prol, !llvm.loop !2999

invertL83.prol.loopexit.loopexit:                 ; preds = %invertL83.prol
  br label %invertL83.prol.loopexit

invertL83.prol.loopexit:                          ; preds = %invertL83.prol.loopexit.loopexit, %scalar.ph
  %.lcssa.unr = phi double [ undef, %scalar.ph ], [ %154, %invertL83.prol.loopexit.loopexit ]
  %"'de29.2.unr" = phi double [ %bc.merge.rdx, %scalar.ph ], [ %154, %invertL83.prol.loopexit.loopexit ]
  %storemerge.unr = phi i64 [ %bc.resume.val, %scalar.ph ], [ %155, %invertL83.prol.loopexit.loopexit ]
  %156 = icmp ult i64 %bc.resume.val, 3
  br i1 %156, label %invertL39, label %invertL83.preheader

invertL83.preheader:                              ; preds = %invertL83.prol.loopexit
  br label %invertL83

invertL79.common.ret.loopexit_crit_edge.unr-lcssa: ; preds = %invertL83.epil.6, %invertL83.epil.5, %invertL83.epil.4, %invertL83.epil.3, %invertL83.epil.2, %invertL83.epil.1, %L83.lr.ph, %L83.epil.preheader
  %157 = phi double [ %differeturn, %L83.lr.ph ], [ 0.000000e+00, %invertL83.epil.6 ], [ 0.000000e+00, %invertL83.epil.5 ], [ 0.000000e+00, %invertL83.epil.4 ], [ 0.000000e+00, %invertL83.epil.3 ], [ 0.000000e+00, %invertL83.epil.2 ], [ 0.000000e+00, %invertL83.epil.1 ], [ 0.000000e+00, %L83.epil.preheader ]
  %"value_phi223.unr'de.0" = phi double [ 0.000000e+00, %L83.lr.ph ], [ %181, %invertL83.epil.6 ], [ %178, %invertL83.epil.5 ], [ %173, %invertL83.epil.4 ], [ %differeturn, %invertL83.epil.3 ], [ %differeturn, %invertL83.epil.2 ], [ %differeturn, %invertL83.epil.1 ], [ %differeturn, %L83.epil.preheader ]
  %158 = fadd fast double %"value_phi223.unr'de.0", %157
  %159 = select fast i1 %12, double 0.000000e+00, double %158
  %160 = select fast i1 %12, double %"value_phi223.unr'de.0", double 0.000000e+00
  br i1 %12, label %invertL39, label %mergeinvertL83_L79.common.ret.loopexit_crit_edge.unr-lcssa.loopexit

invertL83.epil.1:                                 ; preds = %L83.epil.preheader
  %storemerge1.1 = add nsw i64 %xtraiter, -2
  %_unwrap51.1 = add nuw nsw i64 %_unwrap50, %storemerge1.1, !dbg !2952
  %"'ipg39_unwrap.1" = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, i64 %_unwrap51.1, !dbg !2952
  %"'ipc40_unwrap.1" = bitcast {} addrspace(10)* addrspace(13)* %"'ipg39_unwrap.1" to double addrspace(13)*, !dbg !2952
  %161 = load double, double addrspace(13)* %"'ipc40_unwrap.1", align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %162 = fadd fast double %161, %differeturn, !dbg !2952
  store double %162, double addrspace(13)* %"'ipc40_unwrap.1", align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2961
  %163 = icmp eq i64 %storemerge1.1, 0
  br i1 %163, label %invertL79.common.ret.loopexit_crit_edge.unr-lcssa, label %invertL83.epil.2

invertL83.epil.2:                                 ; preds = %invertL83.epil.1
  %storemerge1.2 = add nsw i64 %xtraiter, -3
  %_unwrap51.2 = add nuw nsw i64 %_unwrap50, %storemerge1.2, !dbg !2952
  %"'ipg39_unwrap.2" = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, i64 %_unwrap51.2, !dbg !2952
  %"'ipc40_unwrap.2" = bitcast {} addrspace(10)* addrspace(13)* %"'ipg39_unwrap.2" to double addrspace(13)*, !dbg !2952
  %164 = load double, double addrspace(13)* %"'ipc40_unwrap.2", align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %165 = fadd fast double %164, %differeturn, !dbg !2952
  store double %165, double addrspace(13)* %"'ipc40_unwrap.2", align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2961
  %166 = icmp eq i64 %storemerge1.2, 0
  br i1 %166, label %invertL79.common.ret.loopexit_crit_edge.unr-lcssa, label %invertL83.epil.3

invertL83.epil.3:                                 ; preds = %invertL83.epil.2
  %storemerge1.3 = add nsw i64 %xtraiter, -4
  %_unwrap51.3 = add nuw nsw i64 %_unwrap50, %storemerge1.3, !dbg !2952
  %"'ipg39_unwrap.3" = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, i64 %_unwrap51.3, !dbg !2952
  %"'ipc40_unwrap.3" = bitcast {} addrspace(10)* addrspace(13)* %"'ipg39_unwrap.3" to double addrspace(13)*, !dbg !2952
  %167 = load double, double addrspace(13)* %"'ipc40_unwrap.3", align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %168 = fadd fast double %167, %differeturn, !dbg !2952
  store double %168, double addrspace(13)* %"'ipc40_unwrap.3", align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2961
  %169 = icmp eq i64 %storemerge1.3, 0
  br i1 %169, label %invertL79.common.ret.loopexit_crit_edge.unr-lcssa, label %invertL83.epil.4

invertL83.epil.4:                                 ; preds = %invertL83.epil.3
  %storemerge1.4 = add nsw i64 %xtraiter, -5
  %_unwrap51.4 = add nuw nsw i64 %_unwrap50, %storemerge1.4, !dbg !2952
  %"'ipg39_unwrap.4" = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, i64 %_unwrap51.4, !dbg !2952
  %"'ipc40_unwrap.4" = bitcast {} addrspace(10)* addrspace(13)* %"'ipg39_unwrap.4" to double addrspace(13)*, !dbg !2952
  %170 = load double, double addrspace(13)* %"'ipc40_unwrap.4", align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %171 = fadd fast double %170, %differeturn, !dbg !2952
  store double %171, double addrspace(13)* %"'ipc40_unwrap.4", align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2961
  %172 = icmp eq i64 %storemerge1.4, 0
  %173 = select fast i1 %172, double %differeturn, double 0.000000e+00
  br i1 %172, label %invertL79.common.ret.loopexit_crit_edge.unr-lcssa, label %invertL83.epil.5

invertL83.epil.5:                                 ; preds = %invertL83.epil.4
  %storemerge1.5 = add nsw i64 %xtraiter, -6
  %_unwrap51.5 = add nuw nsw i64 %_unwrap50, %storemerge1.5, !dbg !2952
  %"'ipg39_unwrap.5" = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, i64 %_unwrap51.5, !dbg !2952
  %"'ipc40_unwrap.5" = bitcast {} addrspace(10)* addrspace(13)* %"'ipg39_unwrap.5" to double addrspace(13)*, !dbg !2952
  %174 = load double, double addrspace(13)* %"'ipc40_unwrap.5", align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %175 = fadd fast double %174, %differeturn, !dbg !2952
  store double %175, double addrspace(13)* %"'ipc40_unwrap.5", align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2961
  %176 = icmp eq i64 %storemerge1.5, 0
  %177 = select fast i1 %176, double %differeturn, double 0.000000e+00
  %178 = fadd fast double %173, %177
  br i1 %176, label %invertL79.common.ret.loopexit_crit_edge.unr-lcssa, label %invertL83.epil.6

invertL83.epil.6:                                 ; preds = %invertL83.epil.5
  %storemerge1.6 = add nsw i64 %xtraiter, -7
  %_unwrap51.6 = add nuw nsw i64 %storemerge1.6, %_unwrap50, !dbg !2952
  %"'ipg39_unwrap.6" = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, i64 %_unwrap51.6, !dbg !2952
  %"'ipc40_unwrap.6" = bitcast {} addrspace(10)* addrspace(13)* %"'ipg39_unwrap.6" to double addrspace(13)*, !dbg !2952
  %179 = load double, double addrspace(13)* %"'ipc40_unwrap.6", align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %180 = fadd fast double %179, %differeturn, !dbg !2952
  store double %180, double addrspace(13)* %"'ipc40_unwrap.6", align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2961
  %181 = fadd fast double %178, %differeturn
  br label %invertL79.common.ret.loopexit_crit_edge.unr-lcssa

invertL39.loopexit:                               ; preds = %invertL83
  br label %invertL39, !dbg !2969

invertL39:                                        ; preds = %invertL39.loopexit, %invertL79.common.ret.loopexit_crit_edge.unr-lcssa, %invertL83.prol.loopexit, %middle.block, %L39
  %"'de29.0" = phi double [ %"value_phi223.unr'de.0", %invertL79.common.ret.loopexit_crit_edge.unr-lcssa ], [ %differeturn, %L39 ], [ %141, %middle.block ], [ %.lcssa.unr, %invertL83.prol.loopexit ], [ %227, %invertL39.loopexit ]
  %182 = bitcast {} addrspace(10)* addrspace(13)* %8 to <2 x double> addrspace(13)*, !dbg !2969
  %183 = load <2 x double>, <2 x double> addrspace(13)* %182, align 8, !dbg !2969, !tbaa !90, !alias.scope !2956, !noalias !2959
  %184 = insertelement <2 x double> poison, double %"'de29.0", i64 0, !dbg !3000
  %185 = shufflevector <2 x double> %184, <2 x double> poison, <2 x i32> zeroinitializer, !dbg !3000
  %186 = fadd fast <2 x double> %185, %183, !dbg !2969
  store <2 x double> %186, <2 x double> addrspace(13)* %182, align 8, !dbg !2969, !tbaa !90, !alias.scope !2956, !noalias !2961
  br label %inverttop

invertL83:                                        ; preds = %invertL83.preheader, %invertL83
  %iv3 = phi i64 [ 0, %invertL83.preheader ], [ %iv.next4, %invertL83 ]
  %"'de29.2" = phi double [ %227, %invertL83 ], [ %"'de29.2.unr", %invertL83.preheader ]
  %187 = mul i64 %iv3, -4, !dbg !2952
  %iv.next4 = add nuw nsw i64 %iv3, 1, !dbg !2952
  %188 = add nsw i64 %storemerge.unr, %187, !dbg !2952
  %_unwrap105 = shl nuw i64 %188, 3, !dbg !2952
  %_unwrap136 = or i64 %_unwrap105, 6, !dbg !2952
  %"'ipg133_unwrap" = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, i64 %_unwrap136, !dbg !2952
  %189 = bitcast {} addrspace(10)* addrspace(13)* %"'ipg133_unwrap" to <4 x double> addrspace(13)*, !dbg !2952
  %190 = load <4 x double>, <4 x double> addrspace(13)* %189, align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %191 = fadd fast <4 x double> %190, %143, !dbg !2952
  store <4 x double> %191, <4 x double> addrspace(13)* %189, align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2961
  %_unwrap175 = or i64 %_unwrap105, 2, !dbg !2952
  %"'ipg172_unwrap" = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, i64 %_unwrap175, !dbg !2952
  %192 = bitcast {} addrspace(10)* addrspace(13)* %"'ipg172_unwrap" to <4 x double> addrspace(13)*, !dbg !2952
  %193 = load <4 x double>, <4 x double> addrspace(13)* %192, align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %194 = fadd fast <4 x double> %193, %143, !dbg !2952
  store <4 x double> %194, <4 x double> addrspace(13)* %192, align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2961
  %195 = icmp eq i64 %188, 0
  %196 = select fast i1 %195, double %159, double 0.000000e+00
  %197 = fadd fast double %196, %"'de29.2"
  %198 = add nsw i64 %188, -1
  %_unwrap105.1 = shl nuw i64 %198, 3, !dbg !2952
  %_unwrap136.1 = or i64 %_unwrap105.1, 6, !dbg !2952
  %"'ipg133_unwrap.1" = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, i64 %_unwrap136.1, !dbg !2952
  %199 = bitcast {} addrspace(10)* addrspace(13)* %"'ipg133_unwrap.1" to <4 x double> addrspace(13)*, !dbg !2952
  %200 = load <4 x double>, <4 x double> addrspace(13)* %199, align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %201 = fadd fast <4 x double> %200, %143, !dbg !2952
  store <4 x double> %201, <4 x double> addrspace(13)* %199, align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2961
  %_unwrap175.1 = or i64 %_unwrap105.1, 2, !dbg !2952
  %"'ipg172_unwrap.1" = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, i64 %_unwrap175.1, !dbg !2952
  %202 = bitcast {} addrspace(10)* addrspace(13)* %"'ipg172_unwrap.1" to <4 x double> addrspace(13)*, !dbg !2952
  %203 = load <4 x double>, <4 x double> addrspace(13)* %202, align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %204 = fadd fast <4 x double> %203, %143, !dbg !2952
  store <4 x double> %204, <4 x double> addrspace(13)* %202, align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2961
  %205 = icmp eq i64 %198, 0
  %206 = select fast i1 %205, double %159, double 0.000000e+00
  %207 = fadd fast double %197, %206
  %208 = add nsw i64 %188, -2
  %_unwrap105.2 = shl nuw i64 %208, 3, !dbg !2952
  %_unwrap136.2 = or i64 %_unwrap105.2, 6, !dbg !2952
  %"'ipg133_unwrap.2" = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, i64 %_unwrap136.2, !dbg !2952
  %209 = bitcast {} addrspace(10)* addrspace(13)* %"'ipg133_unwrap.2" to <4 x double> addrspace(13)*, !dbg !2952
  %210 = load <4 x double>, <4 x double> addrspace(13)* %209, align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %211 = fadd fast <4 x double> %210, %143, !dbg !2952
  store <4 x double> %211, <4 x double> addrspace(13)* %209, align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2961
  %_unwrap175.2 = or i64 %_unwrap105.2, 2, !dbg !2952
  %"'ipg172_unwrap.2" = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, i64 %_unwrap175.2, !dbg !2952
  %212 = bitcast {} addrspace(10)* addrspace(13)* %"'ipg172_unwrap.2" to <4 x double> addrspace(13)*, !dbg !2952
  %213 = load <4 x double>, <4 x double> addrspace(13)* %212, align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %214 = fadd fast <4 x double> %213, %143, !dbg !2952
  store <4 x double> %214, <4 x double> addrspace(13)* %212, align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2961
  %215 = icmp eq i64 %208, 0
  %216 = select fast i1 %215, double %159, double 0.000000e+00
  %217 = fadd fast double %207, %216
  %218 = add nsw i64 %188, -3
  %_unwrap105.3 = shl nuw i64 %218, 3, !dbg !2952
  %_unwrap136.3 = or i64 %_unwrap105.3, 6, !dbg !2952
  %"'ipg133_unwrap.3" = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, i64 %_unwrap136.3, !dbg !2952
  %219 = bitcast {} addrspace(10)* addrspace(13)* %"'ipg133_unwrap.3" to <4 x double> addrspace(13)*, !dbg !2952
  %220 = load <4 x double>, <4 x double> addrspace(13)* %219, align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %221 = fadd fast <4 x double> %220, %143, !dbg !2952
  store <4 x double> %221, <4 x double> addrspace(13)* %219, align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2961
  %_unwrap175.3 = or i64 %_unwrap105.3, 2, !dbg !2952
  %"'ipg172_unwrap.3" = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(13)* %8, i64 %_unwrap175.3, !dbg !2952
  %222 = bitcast {} addrspace(10)* addrspace(13)* %"'ipg172_unwrap.3" to <4 x double> addrspace(13)*, !dbg !2952
  %223 = load <4 x double>, <4 x double> addrspace(13)* %222, align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2959
  %224 = fadd fast <4 x double> %223, %143, !dbg !2952
  store <4 x double> %224, <4 x double> addrspace(13)* %222, align 8, !dbg !2952, !tbaa !90, !alias.scope !2956, !noalias !2961
  %225 = icmp eq i64 %218, 0
  %226 = select fast i1 %225, double %159, double 0.000000e+00
  %227 = fadd fast double %217, %226
  %228 = add nsw i64 %188, -4
  br i1 %225, label %invertL39.loopexit, label %invertL83, !llvm.loop !3002

invertL128:                                       ; preds = %L22
  %_unwrap184 = lshr i64 %7, 1, !dbg !3003
  %_unwrap185 = add nuw nsw i64 %_unwrap184, 1, !dbg !3003
  %_unwrap186 = add nuw nsw i64 %_unwrap184, 2, !dbg !3003
  call fastcc void @diffejulia_mapreduce_impl_63711.6({} addrspace(10)* noalias nocapture nonnull readonly align 8 dereferenceable(32) %"'", i64 signext %_unwrap186, i64 signext %0, double %differeturn) #25, !dbg !3003
  call fastcc void @diffejulia_mapreduce_impl_63711.6({} addrspace(10)* noalias nocapture nonnull readonly align 8 dereferenceable(32) %"'", i64 noundef signext 1, i64 signext %_unwrap185, double %differeturn) #25, !dbg !3004
  br label %inverttop
}

cannot handle (forward) unknown intrinsic
llvm.masked.gather
  %wide.masked.gather = call <4 x double> @llvm.masked.gather.v4f64.v4p13f64(<4 x double addrspace(13)*> %56, i32 noundef 8, <4 x i1> noundef <i1 true, i1 true, i1 true, i1 true>, <4 x double> poison) #25, !dbg !49, !tbaa !58, !alias.scope !61, !noalias !66

Stacktrace:
 [1] getindex
   @ ./essentials.jl:917
 [2] macro expansion
   @ ./reduce.jl:264
 [3] macro expansion
   @ ./simdloop.jl:77
 [4] mapreduce_impl
   @ ./reduce.jl:263


Stacktrace:
  [1] getindex
    @ ./essentials.jl:917 [inlined]
  [2] macro expansion
    @ ./reduce.jl:264 [inlined]
  [3] macro expansion
    @ ./simdloop.jl:77 [inlined]
  [4] mapreduce_impl
    @ ./reduce.jl:263
  [5] mapreduce_impl
    @ ./reduce.jl:277 [inlined]
  [6] _mapreduce
    @ ./reduce.jl:444 [inlined]
  [7] _mapreduce_dim
    @ ./reducedim.jl:337 [inlined]
  [8] mapreduce
    @ ./reducedim.jl:329 [inlined]
  [9] _sum
    @ ./reducedim.jl:987 [inlined]
 [10] _sum
    @ ./reducedim.jl:986 [inlined]
 [11] sum
    @ ./reducedim.jl:982 [inlined]
 [12] g
    @ ./REPL[69]:1 [inlined]
 [13] diffejulia_g_63639wrap
    @ ./REPL[69]:0 [inlined]
 [14] macro expansion
    @ ~/.julia/packages/Enzyme/RvNgp/src/compiler.jl:8305 [inlined]
 [15] enzyme_call
    @ ~/.julia/packages/Enzyme/RvNgp/src/compiler.jl:7868 [inlined]
 [16] CombinedAdjointThunk
    @ ~/.julia/packages/Enzyme/RvNgp/src/compiler.jl:7641 [inlined]
 [17] autodiff_deferred
    @ ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:781 [inlined]
 [18] autodiff
    @ ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:512 [inlined]
 [19] gradient!
    @ ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:1787 [inlined]
 [20] fwddiffejulia_gradient__63612wrap
    @ ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:0
 [21] macro expansion
    @ ~/.julia/packages/Enzyme/RvNgp/src/compiler.jl:8305 [inlined]
 [22] enzyme_call
    @ ~/.julia/packages/Enzyme/RvNgp/src/compiler.jl:7868 [inlined]
 [23] ForwardModeThunk
    @ ~/.julia/packages/Enzyme/RvNgp/src/compiler.jl:7657 [inlined]
 [24] autodiff
    @ ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:647 [inlined]
 [25] autodiff
    @ ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:537 [inlined]
 [26] autodiff
    @ ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:504 [inlined]
 [27] hvp!
    @ ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:2498 [inlined]
 [28] hvp(f::typeof(g), x::Vector{Float64}, v::Vector{Float64})
    @ Enzyme ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:2465
 [29] top-level scope
    @ REPL[90]:1

@wsmoses
Copy link
Member

wsmoses commented Nov 8, 2024

when did julia emit masked gathers??? cc @gbaraldi

@wsmoses
Copy link
Member

wsmoses commented Nov 8, 2024

in particular, why is this not a masked_load ?

@wsmoses
Copy link
Member

wsmoses commented Nov 8, 2024

@gdalle I can't repro your first example locally on Julia 1.11.1, on my Linux box

julia> using Enzyme: Enzyme

julia> f(x) = sum(vec(x .^ 3) .* transpose(vec(x .^ 4)))
f (generic function with 1 method)

julia> xv, dxv = float.(1:12), ones(12);  # vector input

julia> pv = Enzyme.hvp(f, xv, dxv)  # correct result

12-element Vector{Float64}:
 518076.0
      1.374984e6
      2.617524e6
      4.292496e6
      6.4467e6
      9.126936e6
      1.2380004e7
      1.6252704e7
      2.0791836e7
      2.60442e7
      3.2056596e7
      3.8875824e7

julia> xm, dxm = float.(reshape(1:12, 4, 3)), ones(4, 3);  # matrix input

julia> pm = Enzyme.hvp(f, xm, dxm)  # only the first row is correct
4×3 Matrix{Float64}:
 518076.0        6.4467e6   2.07918e7
      1.37498e6  9.12694e6  2.60442e7
      2.61752e6  1.238e7    3.20566e7
      4.2925e6   1.62527e7  3.88758e7

julia> reshape(pv, 4, 3) - pm  # should be zero everywhere
4×3 Matrix{Float64}:
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0

julia> 

@gdalle
Copy link
Contributor Author

gdalle commented Nov 8, 2024

Things are getting stranger and stranger. But at least that behavior is also present in my CI on 1.11, I didn't dream it:

https://github.com/JuliaDiff/DifferentiationInterface.jl/actions/runs/11723355094/job/32654862255?pr=615#step:5:330

The first few test failures are all happening during comparison of an HVP output to the expected result from the test scenario. As you can see in the CI log, the HVPs computed by DI using Enzyme (which is essentially the same as Enzyme.hvp, i.e. forward over reverse) all have zeros in the second row, not in the first row.

@gdalle
Copy link
Contributor Author

gdalle commented Nov 8, 2024

So, to sum up:

  • Guillaume's linux: only first row correct
  • CI's linux: same
  • Guillaume's OSX: error
  • Billy's linux: works like a charm

@wsmoses
Copy link
Member

wsmoses commented Nov 8, 2024

The osx error I can resolve without extra stuff.

The correctness one concerns me.

I really need you to minimize this further. Eg desugar broadcast hvp, whatever else you can.

Then maybe I have a chance of reproducing and thus fixing it

@gdalle
Copy link
Contributor Author

gdalle commented Nov 8, 2024

I'll try when I get back to my linux laptop, probably next week, maybe this weekend if I have some time

@gdalle
Copy link
Contributor Author

gdalle commented Nov 9, 2024

I removed the hvp sugar and found a tipping point where we go from correct behavior to incorrect:

using Enzyme

function mygradient(f, x)
    g = make_zero(x)
    autodiff(Reverse, f, Active, Duplicated(x, g))
    return g
end

myhvp(f, x, dx) = return autodiff(Forward, mygradient, Const(f), Duplicated(x, dx))[1]

xm, dxm = Matrix{Float64}(reshape(1:12, 4, 3)), ones(4, 3);  # matrix input

f1(x) = sum(vec(x) .* transpose(vec(x .^ 2)))
f2(x) = sum(vec(x .^ 2) .* transpose(vec(x .^ 2)))
julia> pm1 = myhvp(f1, xm, dxm)
4×3 Matrix{Float64}:
 336.0  432.0  528.0
 360.0  456.0  552.0
 384.0  480.0  576.0
 408.0  504.0  600.0

julia> pm2 = myhvp(f2, xm, dxm)
freeing without malloc   %_cache300.i.0 = phi double* [ undef, %L126.i ], [ %190, %L347.loopexit.i ]
freeing without malloc   %_cache323.i.0 = phi double* [ undef, %L492.i ], [ %422, %L713.i.loopexit ]
4×3 Matrix{Float64}:
 3224.0  5720.0  8216.0
    0.0     0.0     0.0
    0.0     0.0     0.0
    0.0     0.0     0.0

This was on my linux laptop:

julia> versioninfo()
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 1

(@di) pkg> st
Status `~/.julia/environments/di/Project.toml`
  ...
  [7da242da] Enzyme v0.13.14
  ...

@gdalle
Copy link
Contributor Author

gdalle commented Nov 9, 2024

This test function may seem random but it's an easy and GPU-friendly way to get a Hessian where every coefficient $\partial^2 f / \partial x_i \partial x_j$. is non-zero and depends on the actual value of both $x_i$ and $x_j$

@wsmoses
Copy link
Member

wsmoses commented Nov 9, 2024

yeah for sure but broadcasting adds a ton of misc bloat, if you can try to reduce while maintaining the failure

@gdalle
Copy link
Contributor Author

gdalle commented Nov 9, 2024

yeah for sure but broadcasting adds a ton of misc bloat, if you can try to reduce while maintaining the failure

I've tried but did not succeed. As shown above, the version written as a loop gives the correct result. So the issue is probably with broadcasting itself.

@gdalle
Copy link
Contributor Author

gdalle commented Nov 11, 2024

Is there anything else I can do to help? I understand that this function is not the simplest you can possibly conceive, but it is still pretty elementary. And most importantly, it satisfies the essential conditions for my DI test suite:

  • type-stable
  • GPU friendly
  • nontrivial Hessian (every pair of inputs interacts)

@wsmoses
Copy link
Member

wsmoses commented Nov 11, 2024

Yeah I think just simplifying it (eg remove broadcast and call say Base.broadcast etc, and then simplifying the use of that code)

@wsmoses
Copy link
Member

wsmoses commented Nov 11, 2024

Broadcasting code generates a bunch of things and isn’t really elementary from what Julia generates =/

@gdalle
Copy link
Contributor Author

gdalle commented Nov 11, 2024

Yeah I think just simplifying it (eg remove broadcast and call say Base.broadcast etc, and then simplifying the use of that code)

As mentioned in this comment, I tried simplifying it but the version with just a manual loop does not trigger the same error. So it is probably due to broadcasting itself. I'll try with Base.broadcast.

Broadcasting code generates a bunch of things and isn’t really elementary from what Julia generates =/

I understand that but it is still a fundamental part of the language, so Enzyme needs to support it too. If you don't think it should be supported, then users need to be warned that Enzyme may return wrong results if their functions include broadcasts.

@wsmoses
Copy link
Member

wsmoses commented Nov 11, 2024

I agree and we do support broadcasting.

However you asked what you could do to help and I pointed you to how to simplify broadcasting code in this case while hopefully retaining the error.

I often use @which and @code_lowered to figure out what functions are called to try to inline, and simplify

@gdalle
Copy link
Contributor Author

gdalle commented Nov 11, 2024

I'll do my best

@gdalle
Copy link
Contributor Author

gdalle commented Nov 12, 2024

I started to peel off the layers of complexity but I hit a snag right away. When I replace abs2.(x) with myabs2(x) = abs2.(x) (in the hope of removing the broadcast later on), Enzyme starts throwing an error and I can't figure out why. What's the big difference between f and f_bis in the code below??

using Enzyme

function mygradient(f, x)
    g = make_zero(x)
    autodiff(Reverse, f, Active, Duplicated(x, g))
    return g
end

myhvp(f, x, dx) = return autodiff(Forward, mygradient, Const(f), Duplicated(x, dx))[1]

myabs2(x) = abs2.(x)

function f(x)
    a = vec(abs2.(x))
    b = transpose(vec(abs2.(x)))
    return sum(a .* b)
end

function f_bis(x)
    a = vec(myabs2(x))
    b = transpose(vec(myabs2(x)))
    return sum(a .* b)
end

xv, dxv = Vector{Float64}(1:4), ones(4);  # vector input
julia> myhvp(f, xv, dxv)
4-element Vector{Float64}:
 200.0
 280.0
 360.0
 440.0

julia> myhvp(f_bis, xv, dxv)
ERROR: Constant memory is stored (or returned) to a differentiable variable.
As a result, Enzyme cannot provably ensure correctness and throws this error.
This might be due to the use of a constant variable as temporary storage for active memory (https://enzyme.mit.edu/julia/stable/faq/#Runtime-Activity).
If Enzyme should be able to prove this use non-differentable, open an issue!
To work around this issue, either:
 a) rewrite this variable to not be conditionally active (fastest, but requires a code change), or
 b) set the Enzyme mode to turn on runtime activity (e.g. autodiff(set_runtime_activity(Reverse), ...) ). This will maintain correctness, but may slightly reduce performance.
Mismatched activity for:   %.unpack.fca.10.insert = insertvalue { {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, i8*, {} addrspace(10)*, {} addrspace(10)*, i8*, {} addrspace(10)*, i64, {} addrspace(10)*, i64, double*, {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)**, {} addrspace(10)*, i1 } %.unpack.fca.9.insert, {} addrspace(10)* %.sroa.27.0, 10, !dbg !80 const val:   %.sroa.27.0 = phi {} addrspace(10)* [ %11, %L8 ], [ @ejl_jl_nothing, %L10 ]
 value=Unknown object of type Union{Nothing, Memory{Float64}}
 llvalue=  %.sroa.27.0 = phi {} addrspace(10)* [ %11, %L8 ], [ @ejl_jl_nothing, %L10 ]

Stacktrace:
 [1] copy
   @ ./broadcast.jl:892
 [2] materialize
   @ ./broadcast.jl:867
 [3] myabs2
   @ ~/Work/GitHub/Julia/DifferentiationInterface.jl/DifferentiationInterface/test/playground.jl:11

Stacktrace:
  [1] copy
    @ ./broadcast.jl:892 [inlined]
  [2] materialize
    @ ./broadcast.jl:867 [inlined]
  [3] myabs2
    @ ~/Work/GitHub/Julia/DifferentiationInterface.jl/DifferentiationInterface/test/playground.jl:11
  [4] f_bis
    @ ~/Work/GitHub/Julia/DifferentiationInterface.jl/DifferentiationInterface/test/playground.jl:20 [inlined]
  [5] diffejulia_f_bis_61640wrap
    @ ~/Work/GitHub/Julia/DifferentiationInterface.jl/DifferentiationInterface/test/playground.jl:0 [inlined]
  [6] macro expansion
    @ ~/.julia/packages/Enzyme/RvNgp/src/compiler.jl:8305 [inlined]
  [7] enzyme_call
    @ ~/.julia/packages/Enzyme/RvNgp/src/compiler.jl:7868 [inlined]
  [8] CombinedAdjointThunk
    @ ~/.julia/packages/Enzyme/RvNgp/src/compiler.jl:7641 [inlined]
  [9] autodiff_deferred
    @ ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:781 [inlined]
 [10] autodiff
    @ ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:512 [inlined]
 [11] mygradient
    @ ~/Work/GitHub/Julia/DifferentiationInterface.jl/DifferentiationInterface/test/playground.jl:5 [inlined]
 [12] fwddiffejulia_mygradient_61610wrap
    @ ~/Work/GitHub/Julia/DifferentiationInterface.jl/DifferentiationInterface/test/playground.jl:0
 [13] macro expansion
    @ ~/.julia/packages/Enzyme/RvNgp/src/compiler.jl:8305 [inlined]
 [14] enzyme_call
    @ ~/.julia/packages/Enzyme/RvNgp/src/compiler.jl:7868 [inlined]
 [15] ForwardModeThunk
    @ ~/.julia/packages/Enzyme/RvNgp/src/compiler.jl:7657 [inlined]
 [16] autodiff
    @ ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:647 [inlined]
 [17] autodiff(::ForwardMode{…}, ::Const{…}, ::Const{…}, ::Duplicated{…})
    @ Enzyme ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:537
 [18] autodiff
    @ ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:504 [inlined]
 [19] myhvp(f::Function, x::Vector{Float64}, dx::Vector{Float64})
    @ Main ~/Work/GitHub/Julia/DifferentiationInterface.jl/DifferentiationInterface/test/playground.jl:9
 [20] top-level scope
    @ ~/Work/GitHub/Julia/DifferentiationInterface.jl/DifferentiationInterface/test/playground.jl:29
Some type information was truncated. Use `show(err)` to see complete types.

@wsmoses
Copy link
Member

wsmoses commented Nov 12, 2024

julia inlining is fun and non obvious. Maybe mark the function @ inline?

@gdalle
Copy link
Contributor Author

gdalle commented Nov 12, 2024

Okay, I have dived deep into the broadcasting logic and inlined as much as I could. What I find is that the incorrectness hinges on the @simd annotation in this loop:

https://github.com/JuliaLang/julia/blob/505907bd11618e97e9f8d565487cf245df772362/base/broadcast.jl#L970-L976

In the code below, f_bad reproduces the behavior of the built-in broadcast (and gives the wrong HVP), while f_good removes the @simd (and gives the right HVP). @wsmoses do you think you can take it from here?

using Base.Broadcast:
    Broadcasted, DefaultArrayStyle, materialize, preprocess, preprocess_args
using Enzyme

function mygradient(f, x)
    g = make_zero(x)
    autodiff(Reverse, f, Active, Duplicated(x, g))
    return g
end

myhvp(f, x, dx) = return autodiff(Forward, mygradient, Const(f), Duplicated(x, dx))[1]

@inline cube(x) = x^3

@inline function arraycube_bad(x::AbstractArray{N}) where {N}
    y = similar(x)
    args = (x,)
    style = nothing
    ax = axes(x)
    bc = Broadcasted(style, cube, args, ax)
    bc´ = preprocess(y, bc)
    @inbounds @simd for i in eachindex(bc´)
        y[i] = bc´[i]
    end
    return y
end

@inline function arraycube_good(x)
    y = similar(x)
    args = (x,)
    style = nothing
    ax = axes(x)
    bc = Broadcasted(style, cube, args, ax)
    bc´ = preprocess(y, bc)
    @inbounds for i in eachindex(bc´)
        y[i] = bc´[i]
    end
    return y
end

f_bad(x) = sum(arraycube_bad(x))
f_good(x) = sum(arraycube_good(x))

xm, dxm = Matrix{Float64}(reshape(1:12, 4, 3)), ones(4, 3);  # matrix input
julia> myhvp(f_bad, xm, dxm)
freeing without malloc   %_cache120.i.0 = phi double* [ undef, %L103.i ], [ %253, %L310.i.loopexit ]
4×3 Matrix{Float64}:
 6.0  30.0  54.0
 0.0   0.0   0.0
 0.0   0.0   0.0
 0.0   0.0   0.0

julia> myhvp(f_good, xm, dxm)
4×3 Matrix{Float64}:
  6.0  30.0  54.0
 12.0  36.0  60.0
 18.0  42.0  66.0
 24.0  48.0  72.0

@wsmoses
Copy link
Member

wsmoses commented Nov 16, 2024

Would it be possible to remove the preprocess/broadcasted parts?

@gdalle
Copy link
Contributor Author

gdalle commented Nov 16, 2024

I'm not at my work computer right now but if I remember correctly, I had tried removing the preprocess and the incorrectness remained, in fact it was even worse (I think the whole HVP was uniformly zero, not just the n-1 last rows).
I'll try to see if I can remove the Broadcasted but since the bug is linked to broadcasting, I fear this may not be illuminating.

@gdalle
Copy link
Contributor Author

gdalle commented Nov 16, 2024

@wsmoses I finally have a minimum working (or failing) example without Broadcasted!
The weird behavior seems to come from the interaction between @simd and CartesianIndices. When you remove either of those, it works.

using Enzyme

function mygradient(f, x)
    g = make_zero(x)
    autodiff(Reverse, f, Active, Duplicated(x, g))
    return g
end

myhvp(f, x, dx) = return autodiff(Forward, mygradient, Const(f), Duplicated(x, dx))[1]

@inline function arraycube_bad(x::AbstractArray{N}) where {N}
    y = similar(x)
    @inbounds @simd for i in CartesianIndices(y)
        y[i] = x[i]^3
    end
    return y
end

@inline function arraycube_good1(x)
    y = similar(x)
    @inbounds for i in CartesianIndices(y)
        y[i] = x[i]^3
    end
    return y
end

@inline function arraycube_good2(x)
    y = similar(x)
    @inbounds @simd for i in eachindex(y)
        y[i] = x[i]^3
    end
    return y
end

f_bad(x) = sum(arraycube_bad(x))
f_good1(x) = sum(arraycube_good1(x))
f_good2(x) = sum(arraycube_good2(x))

xm, dxm = Matrix{Float64}(reshape(1:12, 4, 3)), ones(4, 3);  # matrix input

@assert f_bad(xm) == f_good1(xm) == f_good2(xm)

myhvp(f_bad, xm, dxm)  # incorrect
myhvp(f_good1, xm, dxm)  # correct
myhvp(f_good2, xm, dxm)  # correct

@wsmoses
Copy link
Member

wsmoses commented Nov 16, 2024

Awesome, yeah this should be enough to go on.

@wsmoses
Copy link
Member

wsmoses commented Nov 17, 2024

However, sadly on my laptop I still get correct answers...

julia> myhvp(f_bad, xm, dxm)  # incorrect

freeing without malloc   %_cache174.i.0 = phi i8* [ undef, %L23.i ], [ %_malloccache175.i, %L193.i.loopexit ]
4×3 Matrix{Float64}:
  6.0  30.0  54.0
 12.0  36.0  60.0
 18.0  42.0  66.0
 24.0  48.0  72.0

julia> myhvp(f_good1, xm, dxm)  # correct
4×3 Matrix{Float64}:
  6.0  30.0  54.0
 12.0  36.0  60.0
 18.0  42.0  66.0
 24.0  48.0  72.0

julia> myhvp(f_good2, xm, dxm)  # correct
4×3 Matrix{Float64}:
  6.0  30.0  54.0
 12.0  36.0  60.0
 18.0  42.0  66.0
 24.0  48.0  72.0

Let me try my desktop, and if that doesn't fail I'll need your help to figure out an even more specific example which triggers the issue (e.g. try to remove simd, inline cartesianindex iterators into a while loop of the iterate function, possibly even using llvmcall on the exact input IR)

@wsmoses
Copy link
Member

wsmoses commented Nov 17, 2024

Yeah it also gets the right answer on my linux box.

Some more reduction ideas in interim: Maybe also try to get rid of sum? [e.g. just use first, or potentially do a manual for loop over, and possibly inbounds].

Does squaring work instead of cube

It doesn't matter if it computes something different ffrom the original fn, so long as the result is wrong

@gdalle
Copy link
Contributor Author

gdalle commented Nov 17, 2024

My machine's specs are as follows:

julia> versioninfo()
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 12 × Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 12 virtual cores)

Here's what I get with the previous example, both on Enzyme v0.13.14 and on the brand new Enzyme v0.13.15:

julia> myhvp(f_bad, xm, dxm)  # incorrect
freeing without malloc   %_cache37.i.0 = phi double* [ undef, %L23.i ], [ %134, %L193.i.loopexit ]
freeing without malloc   %_cache84.i.0 = phi double* [ undef, %L23.i ], [ %209, %L193.i.loopexit ]
4×3 Matrix{Float64}:
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0

julia> myhvp(f_good1, xm, dxm)  # correct
4×3 Matrix{Float64}:
  6.0  30.0  54.0
 12.0  36.0  60.0
 18.0  42.0  66.0
 24.0  48.0  72.0

julia> myhvp(f_good2, xm, dxm)  # correct
4×3 Matrix{Float64}:
  6.0  30.0  54.0
 12.0  36.0  60.0
 18.0  42.0  66.0
 24.0  48.0  72.0

Note the following warning, which was a staple of every one of my previous attempts too. When the HVP is correct, it doesn't appear.

freeing without malloc   %_cache84.i.0 = phi double* [ undef, %L23.i ], [ %209, %L193.i.loopexit ]

On your platform it is slightly different:

freeing without malloc   %_cache174.i.0 = phi i8* [ undef, %L23.i ], [ %_malloccache175.i, %L193.i.loopexit ]

Any idea what this could mean?

@gdalle
Copy link
Contributor Author

gdalle commented Nov 17, 2024

Good catch with sum.
When I replace it with first, f_bad becomes correct again.
When I implement sum manually, same thing.

@inline function mysum(y::AbstractArray)
    s = zero(eltype(y))
    @inbounds @simd for i in eachindex(y)
        s += y[i]
    end
    return s
end

@wsmoses
Copy link
Member

wsmoses commented Nov 17, 2024

Can you try inlining the iterate in the for loop and the sum?

And yeah the machine specs are helpful, but without the ability to repro your issue I need something more precise to debug it.

@wsmoses
Copy link
Member

wsmoses commented Nov 17, 2024

Those are just warnings, I also got them on my end alongside correct answers

@gdalle
Copy link
Contributor Author

gdalle commented Nov 17, 2024

I'll try more inlining but I think you might be able to reproduce it on GitHub's CI machines, after all that's where I first diagnosed it. If you want I can set you up with a simple repo containing only this test, so that you can investigate too?

@wsmoses
Copy link
Member

wsmoses commented Nov 23, 2024

yeah I guess try de-sugaring the julia iterator through the CartesianIndices and see if you can find a simpler gap?

@gdalle
Copy link
Contributor Author

gdalle commented Nov 23, 2024

I got rid of CartesianIndices, @inbounds and @simd, managed to boil it down to a problem with two nested for loops versus one:

using Enzyme

function mygradient(f, x)
    g = make_zero(x)
    autodiff(Reverse, f, Active, Duplicated(x, g))
    return g
end

myhvp(f, x, dx) = return autodiff(Forward, mygradient, Const(f), Duplicated(x, dx))[1]

@inline function arraycube_bad(x::AbstractArray{N}) where {N}
    y = similar(x)
    for i in axes(y, 1)
        for j in axes(y, 2)
            y[i, j] = x[i, j]^3
        end
    end
    return y
end

@inline function arraycube_good(x)
    y = similar(x)
    for k in eachindex(y)
        y[k] = x[k]^3
    end
    return y
end

f_bad(x) = sum(arraycube_bad(x))
f_good(x) = sum(arraycube_good(x))
julia> xm, dxm = Matrix{Float64}(reshape(1:12, 4, 3)), ones(4, 3);

julia> myhvp(f_bad, xm, dxm)
4×3 Matrix{Float64}:
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0

julia> myhvp(f_good, xm, dxm)
4×3 Matrix{Float64}:
  6.0  30.0  54.0
 12.0  36.0  60.0
 18.0  42.0  66.0
 24.0  48.0  72.0

I have also set up a repo with GitHub actions for testing, where the error is reliably reproducible in CI. That way you can experiment with it even it your hardware behaves differently. Check it out here: https://github.com/gdalle/EnzymeErrorReproducer.jl.

Can you take it from here or do I need to dig even further?

@wsmoses
Copy link
Member

wsmoses commented Nov 24, 2024

can you add me to the repo?

I'll see what I can do [but if it fails to reproduce again for me, I may ask for more help here]

@wsmoses
Copy link
Member

wsmoses commented Nov 24, 2024

Ironically @inbounds actually makes it simpler, if it still fails with that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants