-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TPC-H q8 hangs with xxhash64 enabled #517
Comments
The issue starts with the PR that added xxhash64 (#424). I wonder if this is an issue with hashing itself or just that more queries are running natively now and I am hitting some other issue 🤔 |
Is the failed one with OOM a query with xxhash64 so it is not natively run before? |
The issue is happening with query 8. The behavior I see is that all/most cores are at 100% utilization, and memory is not actually high, but the cluster becomes unresponsive before an OutOfMemory error is shown in the driver. |
So it might be more about cpu usage rather than memory consumption? If it’s indeed related to xxhash64, one possible reason might be that every hash call it creates a XXHash64 struct on the stack. Thanks for filing this, it would be great that we can have a minimum reproduction example with smaller size. I can help with this issue. |
Was this running on a K8s cluster? It should not result in a OutOfMemory if cpu limits are reached. |
This is running locally. I ran this again comparing xxhash64 enabled vs disabled and it is very clear that early on in processing q8 all the cores are busy and Spark logging stops and then there are heartbeat timeouts and eventually the driver fails with OOM but I do not see system memory usage being very high so I think the cluster timeouts lead to the OOM issue in the driver somehow. |
@andygrove how are you enabling/disabling xxhash64? I just ran tpch8 locally in the profiler and I see no signs of xxhash64 in the output. |
JFR output from async_profiler for tpch q8 (sf1) running locally on mac. (unzip, then open in intellij or tool of your choice). |
|
I have been doing this manually by commenting out the |
Also, I am using the configs from https://datafusion.apache.org/comet/contributor-guide/benchmarking.html#running-benchmarks-against-apache-spark-with-apache-datafusion-comet-enabled |
Got it. I think the calls to xxhash64 native code are not being profiled successfully by async_profiler. Looks like the profiler loses track of the native call stack in generated code. I'll experiment and see if I can find some way to get more info. |
I spent some time trying to create a simple repro but with no success. I wonder if the issue in my env is that we just have too much computation in parallel and trying to run more than the available cores. I probably have more aggregation happening natively because we now support xxhash64. Are we doing any multi-threded execution in aggregation or just single threaded? I guess I can go look at the code and find out. |
Even with #575, q8 still crashes so perhaps it is not related to xxhash64 support at all. I am investigating still |
The issue is not related to xxhash64. Closing this and filed #576 |
Describe the bug
I can run TPC-H queries @ 100GB locally with commit 8f8a0d9, but not with any commit after that because the job hangs and eventually fails with out of memory errors.
I will investigate more and post findings on this issue.
Steps to reproduce
No response
Expected behavior
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: