feat: Only return parquet metadata if intending to write #549

martindurant · 2024-10-22T13:56:22Z

No description provided.

codecov-commenter · 2024-10-22T13:58:44Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 75.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 92.93%. Comparing base (8cb8994) to head (64b3649).
Report is 148 commits behind head on main.

Files with missing lines	Patch %	Lines
src/dask_awkward/lib/io/parquet.py	75.00%	3 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #549      +/-   ##
==========================================
- Coverage   93.06%   92.93%   -0.14%     
==========================================
  Files          23       22       -1     
  Lines        3290     3395     +105     
==========================================
+ Hits         3062     3155      +93     
- Misses        228      240      +12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

for more information, see https://pre-commit.ci

martindurant · 2024-11-08T16:55:37Z

@pfackeldey : added fire_and_forget flag to to_parquet. Give it a try? You must already have your dask client instantiated.

pfackeldey · 2024-11-08T17:02:53Z

Thank you for adding this option @martindurant so quickly!
I'm redirecting this test to @ikrommyd as he has an analysis setup that can test this 👍

lgray · 2024-11-08T17:18:46Z

@martindurant could we also get an option for the tree reduce?

martindurant · 2024-11-08T17:21:00Z

tree reduce?

Reducing N*None -> None in each reduction? I suppose so. I'll get back to you in about an hour. Naturally, all these approaches are mutually exclusive, and purely experimental for now.

…kward into pq-metadata

martindurant · 2024-11-08T18:41:59Z

@lgray : tree= option seems to work

ikrommyd · 2024-11-08T18:49:15Z

I will try those out. Takes some time because I’m trying a large enough sample to see those issues and also because I’m monitoring the dashboard as it’s running.

ikrommyd · 2024-11-08T22:50:17Z

@pfackeldey : added fire_and_forget flag to to_parquet. Give it a try? You must already have your dask client instantiated.

So we need the client to be already up when we will dak.to_parquet (when building the graph)?

martindurant · 2024-11-08T22:58:06Z

Correct. This was the quick and easy way to do it. On 8 Nov 2024 17:50, Iason Krommydas ***@***.***> wrote: @pfackeldey<https://github.com/pfackeldey> : added fire_and_forget flag to to_parquet. Give it a try? You must already have your dask client instantiated. So we need the client to be already up when we will dak.to_parquet (when building the graph)? — Reply to this email directly, view it on GitHub<#549 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABODEZCYBEHZ6EB3C6EV4C3Z7U537AVCNFSM6AAAAABQMSMIOSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRVHA3DOOJUGQ>. You are receiving this because you were mentioned.Message ID: ***@***.***>

ikrommyd · 2024-11-12T03:26:52Z

So first report from trying out the new fire_and_forget and tree options:

The tree option seems to work very well. The workflow succeeded the first time without errors. The workers still had unmanaged memory(old) in the GB scale but the to-parquet tasks weren't accumulating on them. They were being forgotten and weren't staying in memory. When the merging to the final write-parquet task after the tree reduction was about to happen, I saw no spike in memory of 1-2 workers that which would happen if they had to accumulate all the to-parquet tasks. Even better, because one worked had died for different reasons, not a lot of tasks hard to be redone with tree reduction. The computation reached the end, couldn't gather the results from this dead worker and only went back and re-did ~100 tasks (out of many thousands) and then the computation finished first try.
the fire_and_forget option seems to work (I see files being written on disk) but there is a problem. The dask.compute(to_compute) doesn't hold the interpreter. So the script proceeds and then reaches its end and when the python interpreter exits, it kills the client and the scheduler. I don't know if there is a way to prevent this or if I'm doing something wrong. However, until the client got killed, I was able to see tasks being computed in the dashboard and files being written to disk. There is some sense of tracking in the dashboard as well. I'm seeing 0/X to-parquet tasks completed as it doesn't track them with this option but X keeps getting smaller and smaller due to the remaining tasks number becoming smaller. So you do have a sense of how many writing tasks you have remaining.

lgray · 2024-11-12T15:59:54Z

Cool - this is useful information re: tree reduction. It would seem we should try to use it in as many remaining places as possible where we otherwise have N:1 input-to-output partitions (like collecting finalized arrays or similar things).

Histograms are already a tree reduction but those face different issues. However, used in a few places here it could bring us the robustness we appear to be missing?

This also brings up the issue - why the heck are distributed tasks taking up so much memory!? There's an additional class that represents a task in distributed which is surely eating up some space if tasks are hanging around.

I guess we should think carefully about lifecycles.

martindurant · 2024-11-12T16:24:31Z

why the heck are distributed tasks taking up so much memory

Quite. I suggested that perhaps a worker plugin can figure out what's being allocated as tasks go through their lifecycles, perhaps on one-thread workers. Usual tools like object growth and reference cycle finders would be the first line of attack. I'm not certain that the worker plugin system (transition method) has enough granularity, but it's an easy place to start. https://distributed.dask.org/en/stable/plugins.html#worker-plugins

lgray · 2024-11-12T16:42:35Z

We need to be careful with fire_and_forget since it depends on whatever is executing tasks being interface-similar to distributed. We already have options that are not, some logic to check what's being used to execute the graph and error out if it isn't distributed is probably useful.

martindurant · 2024-11-12T16:55:41Z

it depends on whatever is executing tasks being interface-similar to distributed.

At least we would fail early at get_client, but your point is valid. As implemented in this PR, it is only for trialing and getting the kind of information @ikrommyd supplied, of course.

ikrommyd · 2024-11-14T11:12:02Z

I've just tried the fire_and_forget as well. I stopped the interpreter from going past the dask.compute() call by just adding a input("Press enter when the computation finishes") and monitored the dashboard. All went fine just like in the tree-reduction case. By the end, two workers had died and the tasks those workers had into memory were just redone by other workers that spawned in the end to do just that. I got exactly the same number of parquet files with tree-reduction and fire and forget and no memory problems (the unmanaged memory (old) of the workers was still in the GB scale for fire and forget as well).

martindurant · 2024-11-15T17:37:47Z

I'm the weekly awkward meeting, we decided that tree reduction should become the only implementation for write-parquet (it amounts to the same layout in the case of few partitions). The fire-and-forget route will be removed from this PR and maybe can be resurrected in a separate one for those that want it. Aside from being distributed-specific, it comes with the problem of not knowing when your process is finished.

ikrommyd · 2024-11-18T21:53:49Z

Would be nice to add the same feature here: https://github.com/scikit-hep/uproot5/blob/734700ef1f822338b03a7573df484909b317b2c2/src/uproot/writing/_dask_write.py

martindurant · 2024-11-18T22:09:46Z

@ikrommyd - certainly, but that would be a separate PR of course. It does seem like the metadata in that case is simply discarded anyway.

ikrommyd · 2024-11-18T22:16:47Z

@ikrommyd - certainly, but that would be a separate PR of course. It does seem like the metadata in that case is simply discarded anyway.

Oh yeah, I just mentioned it here for documentation purposes since there was a discussion above.

martindurant · 2024-11-20T15:40:29Z

I don't suppose there's any testing we should be doing here, except that the existing parquet tests continue to work?

lgray · 2024-11-20T15:43:40Z

Yeah it's hard to achieve the scale in CI to test the actual performance impact of this PR.

Only return parquet metadata if intending to write

5aa1a26

martindurant and others added 2 commits November 8, 2024 11:54

Add fire&forget experimental option

1dac8d6

[pre-commit.ci] auto fixes from pre-commit.com hooks

519846a

for more information, see https://pre-commit.ci

martindurant added 2 commits November 8, 2024 13:41

tree option

982e3a3

Merge branch 'pq-metadata' of https://github.com/martindurant/dask-aw…

4520030

…kward into pq-metadata

martindurant added 2 commits November 18, 2024 15:03

Tree becomes only way when not making parquet meta file

21090a3

add a little documentation

056d830

martindurant changed the title ~~Only return parquet metadata if intending to write~~ [feat]: Only return parquet metadata if intending to write Nov 18, 2024

allow older awkward-cpp in ci

64b3649

martindurant changed the title ~~[feat]: Only return parquet metadata if intending to write~~ feat: Only return parquet metadata if intending to write Nov 18, 2024

martindurant marked this pull request as ready for review November 20, 2024 15:40

martindurant merged commit 1d4d4e9 into dask-contrib:main Nov 20, 2024
24 of 25 checks passed

martindurant deleted the pq-metadata branch November 20, 2024 15:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Only return parquet metadata if intending to write #549

feat: Only return parquet metadata if intending to write #549

martindurant commented Oct 22, 2024

codecov-commenter commented Oct 22, 2024 •

edited

Loading

martindurant commented Nov 8, 2024

pfackeldey commented Nov 8, 2024

lgray commented Nov 8, 2024

martindurant commented Nov 8, 2024

martindurant commented Nov 8, 2024

ikrommyd commented Nov 8, 2024

ikrommyd commented Nov 8, 2024

martindurant commented Nov 8, 2024 via email

ikrommyd commented Nov 12, 2024 •

edited

Loading

lgray commented Nov 12, 2024

martindurant commented Nov 12, 2024

lgray commented Nov 12, 2024

martindurant commented Nov 12, 2024

ikrommyd commented Nov 14, 2024 •

edited

Loading

martindurant commented Nov 15, 2024

ikrommyd commented Nov 18, 2024

martindurant commented Nov 18, 2024

ikrommyd commented Nov 18, 2024

martindurant commented Nov 20, 2024

lgray commented Nov 20, 2024

feat: Only return parquet metadata if intending to write #549

feat: Only return parquet metadata if intending to write #549

Conversation

martindurant commented Oct 22, 2024

codecov-commenter commented Oct 22, 2024 • edited Loading

Codecov Report

martindurant commented Nov 8, 2024

pfackeldey commented Nov 8, 2024

lgray commented Nov 8, 2024

martindurant commented Nov 8, 2024

martindurant commented Nov 8, 2024

ikrommyd commented Nov 8, 2024

ikrommyd commented Nov 8, 2024

martindurant commented Nov 8, 2024 via email

ikrommyd commented Nov 12, 2024 • edited Loading

lgray commented Nov 12, 2024

martindurant commented Nov 12, 2024

lgray commented Nov 12, 2024

martindurant commented Nov 12, 2024

ikrommyd commented Nov 14, 2024 • edited Loading

martindurant commented Nov 15, 2024

ikrommyd commented Nov 18, 2024

martindurant commented Nov 18, 2024

ikrommyd commented Nov 18, 2024

martindurant commented Nov 20, 2024

lgray commented Nov 20, 2024

codecov-commenter commented Oct 22, 2024 •

edited

Loading

ikrommyd commented Nov 12, 2024 •

edited

Loading

ikrommyd commented Nov 14, 2024 •

edited

Loading