Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a copy(::LazyBranch) method to prevent branch materialization on … #286

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

grasph
Copy link
Member

@grasph grasph commented Oct 13, 2023

For a LazyBranch br, Base.copy(br) was returning a vector instead of a LazyBranch and leading to uncessary read. This PR fix the issue.

…copy

In absence of such method, the copy(::AbstractVector) method that resulted
in the materialization into a Vector was called.
@codecov
Copy link

codecov bot commented Oct 13, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.33%. Comparing base (799b290) to head (de28e08).
Report is 24 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #286      +/-   ##
==========================================
- Coverage   87.82%   86.33%   -1.49%     
==========================================
  Files          18       18              
  Lines        2366     2445      +79     
==========================================
+ Hits         2078     2111      +33     
- Misses        288      334      +46     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Moelf
Copy link
Member

Moelf commented Oct 13, 2023

I'm not so sure about this one. For example, copy(reinterpret(...)) also changes type and return a Vector(). I think it could be argued that current behavior is the correct one already

@Moelf
Copy link
Member

Moelf commented Oct 13, 2023

Arrow.jl

julia> d = Arrow.Table("/tmp/a.feather");

julia> d.x #this is memory-mapped
100-element Arrow.Primitive{Int64, Vector{Int64}}:
...

julia> copy(d.x)
100-element Vector{Int64}:
...

@grasph
Copy link
Member Author

grasph commented Oct 13, 2023

According the Base.copy documentation it must "create a shallow copy of x: the outer structure is copied, but not all internal values."

@Moelf
Copy link
Member

Moelf commented Oct 13, 2023

but then you also have this from Julia itself being weird/inconsistent:

julia> a = rand(2,2)'
2×2 adjoint(::Matrix{Float64}) with eltype Float64:
 0.651229  0.952613
 0.723043  0.00959459

julia> copy(a)
2×2 Matrix{Float64}:
 0.651229  0.952613
 0.723043  0.00959459

julia> a = rand(2)'
1×2 adjoint(::Vector{Float64}) with eltype Float64:
 0.29677  0.177136

julia> copy(a)
1×2 adjoint(::Vector{Float64}) with eltype Float64:
 0.29677  0.177136

edit: it seems the adjoint of vector is a corner case because it's behavior is different from one-row matrix, but the intention here is to change type and return Array.

Let's hold off for a bit and I will ask other Julia folks on Slack about design/expectation. Potentially request a Documentation update

@Moelf
Copy link
Member

Moelf commented Oct 13, 2023

another thing is I just realized we no longer need mutable struct LazyBranch, it can be immutable.

But after making it immutable it breaks CI and seems like allocation in certain operation is increased somehow.

If we can make it immutable then clearly copy should just be a no-op. Except in this case we need clarification from others regarding copy(<:AbstractVector).

@grasph
Copy link
Member Author

grasph commented Oct 13, 2023

My idea was that the LazyBranch was conceptually immutable and this would avoid unnecessary materialization with DataFrame if copycols=false is not passed.

Looking at the DataFrames documentation, the behavior of Arrow.Table is to allow conversion of immutable to mutable columns: "[to] get mutable columns from an immutable input table (like Arrow.Table), pass copycols=true explicitly." Changing the immutable nature of the object is arguable, although here it has some practical use here. The issue is that copycols=true is the default, which can be very expansive in HEP use case.

The case of copy(A::Transpose) and copy(A::Adjoint) is special. According to the documentation it "eagerly evaluate[s] the lazy matrix transpose/adjoint.". For sure return an object of the same type as the original is not wrong, it is changing the type which is an exception.

@Moelf
Copy link
Member

Moelf commented Oct 13, 2023

The issue is that copycols=true is the default, which can be very expansive in HEP use case

But it is what it is, user of DataFrame needs to know what's default behavior. The problem of overriding copy is what user should do if they indeed want copycols=true?

I agree with your argument but I don't know if this is overly intrusive and confusing (i.e. what if user wants a mutable copy of a column?)

@grasph
Copy link
Member Author

grasph commented Oct 13, 2023

Concerning the method copy(<:AbstractVector), is is not documented. So I expect it can be considered as fallback implementation and should not be considered as providing the current behavior for all subtypes.

There is a Base.copymutable() function to turn an immutable into a mutable e.g, it turns a tuple to a vector.

@Moelf
Copy link
Member

Moelf commented Oct 13, 2023

I got this comment from @vtjnash so maybe we should just return the original object, what do you think?

But maybe the consideration is that, iterating the copy() shouldn't change buffer of the original?

image

@grasph
Copy link
Member Author

grasph commented Oct 15, 2023

I think both options w and w/o copy of the buffer are valid. Independent buffer should provide better performance in case of concurrent access on two copies as it will prevent buffer invalidation, while no copy option minimizes memory allocations.

About this PR in general, since it is not clear what the best option is, I propose to:

  • postpone the decision to specialize the copy function until we have a better view of the use cases;
  • document the possibility for the user to disable the materialization by defining himself the copy method as a no-op.

That way, we will keep the options opened. What do you think of this plan ?

Philippe.

@Moelf
Copy link
Member

Moelf commented Oct 15, 2023

that sounds good to me!

I personally think no-op is okay, because I *think vast majority use cases are similar to when you want to do something in a different container type, e.g. DataFrame, the user probably stops interacting with the original LazyBranch variable.

To me preventing full materialization seems useful enough

@Moelf
Copy link
Member

Moelf commented Oct 16, 2023

then that is a violation of the copy API, which states that operations on b do not affect a

ah ok we can't do no-op, so I think we know exactly what we can do in this case.

@tamasgal
Copy link
Member

This is really difficult since copying something lazy needs to decouple the structure from its laziness in order to fulfil what copy signals: having something which I can operate on without modifying the original thing.

We need to either break the laziness and materialise everything inside the structure (but then it's not a lazy thing anymore) or we create a completely different object with it's own references and independent caches, which sounds like a source of happy bugs 🙈

I don't know... I think changing the type is also a bit of a surprise, even though it makes sense in certain applications.

I agree that we should figure out what a user expects to be served when doing a copy of lazy branches.

src/iteration.jl Outdated Show resolved Hide resolved
src/iteration.jl Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants