TPC-H q8 hangs with xxhash64 enabled #517

andygrove · 2024-06-05T14:39:29Z

Describe the bug

I can run TPC-H queries @ 100GB locally with commit 8f8a0d9, but not with any commit after that because the job hangs and eventually fails with out of memory errors.

I will investigate more and post findings on this issue.

Steps to reproduce

No response

Expected behavior

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

andygrove · 2024-06-05T20:22:44Z

The issue starts with the PR that added xxhash64 (#424).

I wonder if this is an issue with hashing itself or just that more queries are running natively now and I am hitting some other issue 🤔

viirya · 2024-06-05T20:56:19Z

Is the failed one with OOM a query with xxhash64 so it is not natively run before?

andygrove · 2024-06-05T22:24:28Z

The issue is happening with query 8. The behavior I see is that all/most cores are at 100% utilization, and memory is not actually high, but the cluster becomes unresponsive before an OutOfMemory error is shown in the driver.

advancedxy · 2024-06-05T23:05:41Z

all/most cores are at 100% utilization

So it might be more about cpu usage rather than memory consumption? If it’s indeed related to xxhash64, one possible reason might be that every hash call it creates a XXHash64 struct on the stack.

Thanks for filing this, it would be great that we can have a minimum reproduction example with smaller size. I can help with this issue.

parthchandra · 2024-06-05T23:31:43Z

Was this running on a K8s cluster? It should not result in a OutOfMemory if cpu limits are reached.

andygrove · 2024-06-06T16:34:44Z

This is running locally. I ran this again comparing xxhash64 enabled vs disabled and it is very clear that early on in processing q8 all the cores are busy and Spark logging stops and then there are heartbeat timeouts and eventually the driver fails with OOM but I do not see system memory usage being very high so I think the cluster timeouts lead to the OOM issue in the driver somehow.

parthchandra · 2024-06-06T17:01:34Z

@andygrove how are you enabling/disabling xxhash64? I just ran tpch8 locally in the profiler and I see no signs of xxhash64 in the output.

parthchandra · 2024-06-06T17:30:45Z

JFR output from async_profiler for tpch q8 (sf1) running locally on mac. (unzip, then open in intellij or tool of your choice).
CometTPCHQueryBenchmark_2024_06_06_102425.jfr.gz

parthchandra · 2024-06-06T17:44:29Z

hashAgg_doAggregateWithKeys_0$ stands out as an expensive call. I suppose it makes sense to profile the xxhash code independently, profiling an entire tpch query produces a lot of noise.

andygrove · 2024-06-06T20:54:25Z

@andygrove how are you enabling/disabling xxhash64?

I have been doing this manually by commenting out the case XxHash64(children, seed) => section in QueryPlanSerde

andygrove · 2024-06-06T21:00:36Z

I just ran tpch8 locally in the profiler and I see no signs of xxhash64 in the output.

Also, I am using the configs from https://datafusion.apache.org/comet/contributor-guide/benchmarking.html#running-benchmarks-against-apache-spark-with-apache-datafusion-comet-enabled

parthchandra · 2024-06-06T21:39:18Z

Got it. I think the calls to xxhash64 native code are not being profiled successfully by async_profiler. Looks like the profiler loses track of the native call stack in generated code. I'll experiment and see if I can find some way to get more info.
In the run above while most of the executor threads were in hash aggregation (when cpu usage was at its peak), some threads were in the LZ4 decompressor and some in the parquet reader both of which were probably adding to the CPU load.
Also, the peak (CPU: 95%) existed for a very short while compared to the overall run.

andygrove · 2024-06-06T23:40:50Z

I spent some time trying to create a simple repro but with no success.

I wonder if the issue in my env is that we just have too much computation in parallel and trying to run more than the available cores. I probably have more aggregation happening natively because we now support xxhash64. Are we doing any multi-threded execution in aggregation or just single threaded? I guess I can go look at the code and find out.

andygrove · 2024-06-14T20:45:31Z

Even with #575, q8 still crashes so perhaps it is not related to xxhash64 support at all. I am investigating still

andygrove · 2024-06-14T22:26:19Z

The issue is not related to xxhash64. Closing this and filed #576

andygrove added the bug Something isn't working label Jun 5, 2024

andygrove added this to the 0.1.0 milestone Jun 5, 2024

andygrove changed the title ~~Possible regression in memory usage~~ TPC-H q8 hangs with xxhash64 enabled Jun 5, 2024

andygrove removed this from the 0.1.0 milestone Jun 10, 2024

advancedxy mentioned this issue Jun 12, 2024

Avoid stack allocation in xxhash64 #547

Closed

andygrove self-assigned this Jun 14, 2024

andygrove mentioned this issue Jun 14, 2024

Infinite recursion when calling canonicalized on query plan for TPC-H q8 #576

Closed

andygrove closed this as completed Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TPC-H q8 hangs with xxhash64 enabled #517

TPC-H q8 hangs with xxhash64 enabled #517

andygrove commented Jun 5, 2024

andygrove commented Jun 5, 2024

viirya commented Jun 5, 2024

andygrove commented Jun 5, 2024

advancedxy commented Jun 5, 2024

parthchandra commented Jun 5, 2024

andygrove commented Jun 6, 2024

parthchandra commented Jun 6, 2024

parthchandra commented Jun 6, 2024

parthchandra commented Jun 6, 2024

andygrove commented Jun 6, 2024 •

edited

Loading

andygrove commented Jun 6, 2024

parthchandra commented Jun 6, 2024

andygrove commented Jun 6, 2024

andygrove commented Jun 14, 2024

andygrove commented Jun 14, 2024 •

edited

Loading

TPC-H q8 hangs with xxhash64 enabled #517

TPC-H q8 hangs with xxhash64 enabled #517

Comments

andygrove commented Jun 5, 2024

Describe the bug

Steps to reproduce

Expected behavior

Additional context

andygrove commented Jun 5, 2024

viirya commented Jun 5, 2024

andygrove commented Jun 5, 2024

advancedxy commented Jun 5, 2024

parthchandra commented Jun 5, 2024

andygrove commented Jun 6, 2024

parthchandra commented Jun 6, 2024

parthchandra commented Jun 6, 2024

parthchandra commented Jun 6, 2024

andygrove commented Jun 6, 2024 • edited Loading

andygrove commented Jun 6, 2024

parthchandra commented Jun 6, 2024

andygrove commented Jun 6, 2024

andygrove commented Jun 14, 2024

andygrove commented Jun 14, 2024 • edited Loading

andygrove commented Jun 6, 2024 •

edited

Loading

andygrove commented Jun 14, 2024 •

edited

Loading