-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
massive touching performance regression when using dask 2024.2.0 #468
Comments
git bisect says the offending PR in dask is: dask/dask#10898 , dask/dask@8e10a14 |
Super minimal repro: import json
import awkward as ak
from dask.base import tokenize
a = ak.Array([1, 2, 3, 4, 5])
layout, report = ak.typetracer.typetracer_with_report(a.layout.form_with_key())
b = ak.Array(layout)
print(report.data_touched, report.shape_touched)
print(tokenize({"what": b}))
print(report.data_touched, report.shape_touched) results in
So this is an awkward-array issue, not dask-awkward. |
And here is why: https://github.com/dask/dask/blob/main/dask/base.py#L1065 The tokenizer for dicts sorts items in the dictionary using |
Even more minimal now: import json
import awkward as ak
a = ak.Array([1, 2, 3, 4, 5])
layout, report = ak.typetracer.typetracer_with_report(a.layout.form_with_key())
b = ak.Array(layout)
print(report.data_touched, report.shape_touched)
d = {"what": b}
normalized = sorted(d.items(), key=str)
print(report.data_touched, report.shape_touched) So there we have it, I believe that is the end of the story. |
Let's keep this issue open as a meta-issue to track this regression. It's easy to get sucked in to the details here (certainly for me!), so let's just start with a birds-eye view:
Now, we can certainly look to make What's really happening here is that we have a singular (?) case of an As such, I think the simplest solution is to implement |
You can prevent pickle serialization by either adding a |
We can fix this easily. |
I'm seeing two different analyses above, and I suspect that only one is correct:
|
On the subject of
It's not that |
@crusaderky both are correct but the instigating problem is actually the second one - the lack of a In our case it was the In the end, both are problems, but the analysis of the failure led to originally solving 1 but not 2. Now we have fixes for both (see #470). |
with dask 2024.2.0 results in (essentially reading the whole file...):
With dask < 2024.2.0, it results in:
For simple queries this results in multiple orders of magnitude slower performance. This is rather high severity.
@martindurant @agoose77
The text was updated successfully, but these errors were encountered: