Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPC-H q8 hangs with xxhash64 enabled #517

Closed
andygrove opened this issue Jun 5, 2024 · 15 comments
Closed

TPC-H q8 hangs with xxhash64 enabled #517

andygrove opened this issue Jun 5, 2024 · 15 comments
Assignees
Labels
bug Something isn't working

Comments

@andygrove
Copy link
Member

Describe the bug

I can run TPC-H queries @ 100GB locally with commit 8f8a0d9, but not with any commit after that because the job hangs and eventually fails with out of memory errors.

I will investigate more and post findings on this issue.

Steps to reproduce

No response

Expected behavior

No response

Additional context

No response

@andygrove andygrove added the bug Something isn't working label Jun 5, 2024
@andygrove
Copy link
Member Author

The issue starts with the PR that added xxhash64 (#424).

I wonder if this is an issue with hashing itself or just that more queries are running natively now and I am hitting some other issue 🤔

@viirya
Copy link
Member

viirya commented Jun 5, 2024

Is the failed one with OOM a query with xxhash64 so it is not natively run before?

@andygrove
Copy link
Member Author

The issue is happening with query 8. The behavior I see is that all/most cores are at 100% utilization, and memory is not actually high, but the cluster becomes unresponsive before an OutOfMemory error is shown in the driver.

@andygrove andygrove added this to the 0.1.0 milestone Jun 5, 2024
@advancedxy
Copy link
Contributor

all/most cores are at 100% utilization

So it might be more about cpu usage rather than memory consumption? If it’s indeed related to xxhash64, one possible reason might be that every hash call it creates a XXHash64 struct on the stack.

Thanks for filing this, it would be great that we can have a minimum reproduction example with smaller size. I can help with this issue.

@andygrove andygrove changed the title Possible regression in memory usage TPC-H q8 hangs with xxhash64 enabled Jun 5, 2024
@parthchandra
Copy link
Contributor

Was this running on a K8s cluster? It should not result in a OutOfMemory if cpu limits are reached.

@andygrove
Copy link
Member Author

This is running locally. I ran this again comparing xxhash64 enabled vs disabled and it is very clear that early on in processing q8 all the cores are busy and Spark logging stops and then there are heartbeat timeouts and eventually the driver fails with OOM but I do not see system memory usage being very high so I think the cluster timeouts lead to the OOM issue in the driver somehow.

@parthchandra
Copy link
Contributor

@andygrove how are you enabling/disabling xxhash64? I just ran tpch8 locally in the profiler and I see no signs of xxhash64 in the output.

@parthchandra
Copy link
Contributor

JFR output from async_profiler for tpch q8 (sf1) running locally on mac. (unzip, then open in intellij or tool of your choice).
CometTPCHQueryBenchmark_2024_06_06_102425.jfr.gz

@parthchandra
Copy link
Contributor

hashAgg_doAggregateWithKeys_0$ stands out as an expensive call. I suppose it makes sense to profile the xxhash code independently, profiling an entire tpch query produces a lot of noise.

@andygrove
Copy link
Member Author

andygrove commented Jun 6, 2024

@andygrove how are you enabling/disabling xxhash64?

I have been doing this manually by commenting out the case XxHash64(children, seed) => section in QueryPlanSerde

@andygrove
Copy link
Member Author

I just ran tpch8 locally in the profiler and I see no signs of xxhash64 in the output.

Also, I am using the configs from https://datafusion.apache.org/comet/contributor-guide/benchmarking.html#running-benchmarks-against-apache-spark-with-apache-datafusion-comet-enabled

@parthchandra
Copy link
Contributor

Got it. I think the calls to xxhash64 native code are not being profiled successfully by async_profiler. Looks like the profiler loses track of the native call stack in generated code. I'll experiment and see if I can find some way to get more info.
In the run above while most of the executor threads were in hash aggregation (when cpu usage was at its peak), some threads were in the LZ4 decompressor and some in the parquet reader both of which were probably adding to the CPU load.
Also, the peak (CPU: 95%) existed for a very short while compared to the overall run.

@andygrove
Copy link
Member Author

I spent some time trying to create a simple repro but with no success.

I wonder if the issue in my env is that we just have too much computation in parallel and trying to run more than the available cores. I probably have more aggregation happening natively because we now support xxhash64. Are we doing any multi-threded execution in aggregation or just single threaded? I guess I can go look at the code and find out.

@andygrove andygrove removed this from the 0.1.0 milestone Jun 10, 2024
@andygrove andygrove self-assigned this Jun 14, 2024
@andygrove
Copy link
Member Author

Even with #575, q8 still crashes so perhaps it is not related to xxhash64 support at all. I am investigating still

@andygrove
Copy link
Member Author

andygrove commented Jun 14, 2024

The issue is not related to xxhash64. Closing this and filed #576

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants