-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a copy(::LazyBranch) method to prevent branch materialization on … #286
base: main
Are you sure you want to change the base?
Conversation
…copy In absence of such method, the copy(::AbstractVector) method that resulted in the materialization into a Vector was called.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #286 +/- ##
==========================================
- Coverage 87.82% 86.33% -1.49%
==========================================
Files 18 18
Lines 2366 2445 +79
==========================================
+ Hits 2078 2111 +33
- Misses 288 334 +46 ☔ View full report in Codecov by Sentry. |
I'm not so sure about this one. For example, |
Arrow.jljulia> d = Arrow.Table("/tmp/a.feather");
julia> d.x #this is memory-mapped
100-element Arrow.Primitive{Int64, Vector{Int64}}:
...
julia> copy(d.x)
100-element Vector{Int64}:
... |
According the Base.copy documentation it must "create a shallow copy of x: the outer structure is copied, but not all internal values." |
but then you also have this from Julia itself being weird/inconsistent: julia> a = rand(2,2)'
2×2 adjoint(::Matrix{Float64}) with eltype Float64:
0.651229 0.952613
0.723043 0.00959459
julia> copy(a)
2×2 Matrix{Float64}:
0.651229 0.952613
0.723043 0.00959459
julia> a = rand(2)'
1×2 adjoint(::Vector{Float64}) with eltype Float64:
0.29677 0.177136
julia> copy(a)
1×2 adjoint(::Vector{Float64}) with eltype Float64:
0.29677 0.177136 edit: it seems the adjoint of vector is a corner case because it's behavior is different from Let's hold off for a bit and I will ask other Julia folks on Slack about design/expectation. Potentially request a Documentation update |
another thing is I just realized we no longer need But after making it immutable it breaks CI and seems like allocation in certain operation is increased somehow. If we can make it immutable then clearly |
My idea was that the LazyBranch was conceptually immutable and this would avoid unnecessary materialization with DataFrame if copycols=false is not passed. Looking at the DataFrames documentation, the behavior of Arrow.Table is to allow conversion of immutable to mutable columns: "[to] get mutable columns from an immutable input table (like Arrow.Table), pass copycols=true explicitly." Changing the immutable nature of the object is arguable, although here it has some practical use here. The issue is that copycols=true is the default, which can be very expansive in HEP use case. The case of copy(A::Transpose) and copy(A::Adjoint) is special. According to the documentation it "eagerly evaluate[s] the lazy matrix transpose/adjoint.". For sure return an object of the same type as the original is not wrong, it is changing the type which is an exception. |
But it is what it is, user of DataFrame needs to know what's default behavior. The problem of overriding copy is what user should do if they indeed want copycols=true? I agree with your argument but I don't know if this is overly intrusive and confusing (i.e. what if user wants a mutable copy of a column?) |
Concerning the method There is a Base.copymutable() function to turn an immutable into a mutable e.g, it turns a tuple to a vector. |
I got this comment from @vtjnash so maybe we should just return the original object, what do you think? But maybe the consideration is that, iterating the |
I think both options w and w/o copy of the buffer are valid. Independent buffer should provide better performance in case of concurrent access on two copies as it will prevent buffer invalidation, while no copy option minimizes memory allocations. About this PR in general, since it is not clear what the best option is, I propose to:
That way, we will keep the options opened. What do you think of this plan ? Philippe. |
that sounds good to me! I personally think To me preventing full materialization seems useful enough |
ah ok we can't do |
This is really difficult since copying something lazy needs to decouple the structure from its laziness in order to fulfil what We need to either break the laziness and materialise everything inside the structure (but then it's not a lazy thing anymore) or we create a completely different object with it's own references and independent caches, which sounds like a source of happy bugs 🙈 I don't know... I think changing the type is also a bit of a surprise, even though it makes sense in certain applications. I agree that we should figure out what a user expects to be served when doing a copy of lazy branches. |
For a LazyBranch br, Base.copy(br) was returning a vector instead of a LazyBranch and leading to uncessary read. This PR fix the issue.