-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Only return parquet metadata if intending to write #549
Conversation
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #549 +/- ##
==========================================
- Coverage 93.06% 92.93% -0.14%
==========================================
Files 23 22 -1
Lines 3290 3395 +105
==========================================
+ Hits 3062 3155 +93
- Misses 228 240 +12 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
@pfackeldey : added fire_and_forget flag to to_parquet. Give it a try? You must already have your dask client instantiated. |
Thank you for adding this option @martindurant so quickly! |
@martindurant could we also get an option for the tree reduce? |
Reducing N*None -> None in each reduction? I suppose so. I'll get back to you in about an hour. Naturally, all these approaches are mutually exclusive, and purely experimental for now. |
@lgray : |
I will try those out. Takes some time because I’m trying a large enough sample to see those issues and also because I’m monitoring the dashboard as it’s running. |
So we need the client to be already up when we will |
Correct. This was the quick and easy way to do it.
On 8 Nov 2024 17:50, Iason Krommydas ***@***.***> wrote:
@pfackeldey<https://github.com/pfackeldey> : added fire_and_forget flag to to_parquet. Give it a try? You must already have your dask client instantiated.
So we need the client to be already up when we will dak.to_parquet (when building the graph)?
—
Reply to this email directly, view it on GitHub<#549 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABODEZCYBEHZ6EB3C6EV4C3Z7U537AVCNFSM6AAAAABQMSMIOSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRVHA3DOOJUGQ>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
So first report from trying out the new
|
Cool - this is useful information re: tree reduction. It would seem we should try to use it in as many remaining places as possible where we otherwise have N:1 input-to-output partitions (like collecting finalized arrays or similar things). Histograms are already a tree reduction but those face different issues. However, used in a few places here it could bring us the robustness we appear to be missing? This also brings up the issue - why the heck are distributed tasks taking up so much memory!? There's an additional class that represents a task in distributed which is surely eating up some space if tasks are hanging around. I guess we should think carefully about lifecycles. |
Quite. I suggested that perhaps a worker plugin can figure out what's being allocated as tasks go through their lifecycles, perhaps on one-thread workers. Usual tools like object growth and reference cycle finders would be the first line of attack. I'm not certain that the worker plugin system ( |
We need to be careful with |
At least we would fail early at get_client, but your point is valid. As implemented in this PR, it is only for trialing and getting the kind of information @ikrommyd supplied, of course. |
I've just tried the |
I'm the weekly awkward meeting, we decided that tree reduction should become the only implementation for write-parquet (it amounts to the same layout in the case of few partitions). The fire-and-forget route will be removed from this PR and maybe can be resurrected in a separate one for those that want it. Aside from being distributed-specific, it comes with the problem of not knowing when your process is finished. |
Would be nice to add the same feature here: https://github.com/scikit-hep/uproot5/blob/734700ef1f822338b03a7573df484909b317b2c2/src/uproot/writing/_dask_write.py |
@ikrommyd - certainly, but that would be a separate PR of course. It does seem like the metadata in that case is simply discarded anyway. |
Oh yeah, I just mentioned it here for documentation purposes since there was a discussion above. |
I don't suppose there's any testing we should be doing here, except that the existing parquet tests continue to work? |
Yeah it's hard to achieve the scale in CI to test the actual performance impact of this PR. |
No description provided.