Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CH] GraceHashJoin is easy to cause OOM #8003

Open
lgbo-ustc opened this issue Nov 20, 2024 · 3 comments
Open

[CH] GraceHashJoin is easy to cause OOM #8003

lgbo-ustc opened this issue Nov 20, 2024 · 3 comments
Labels
bug Something isn't working triage

Comments

@lgbo-ustc
Copy link
Contributor

lgbo-ustc commented Nov 20, 2024

Backend

CH (ClickHouse)

Bug description

There is a problem in controlling the memory usage of grace hash join. It uses a fixed bytes limit. This cannot trigger the spill of hash table adaptively.

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

No response

@lgbo-ustc lgbo-ustc added bug Something isn't working triage labels Nov 20, 2024
@lgbo-ustc
Copy link
Contributor Author

lgbo-ustc commented Nov 20, 2024

We track the memory usage of hash table in the join

2024-11-20 11:33:18.923 <Error> GraceHashJoin: xxx 0x7f41ebf6c340 total_rows: 577278, total_bytes: 781479072
2024-11-20 11:33:18.924 <Error> GraceHashJoin: xxx 0x7f41e5ad8840 total_rows: 569546, total_bytes: 772995144
2024-11-20 11:33:18.934 <Error> GraceHashJoin: xxx 0x7f41ebf6c340 total_rows: 581310, total_bytes: 786001384
2024-11-20 11:33:18.939 <Error> GraceHashJoin: xxx 0x7f41e5ad8840 total_rows: 573578, total_bytes: 777517456
2024-11-20 11:33:18.946 <Error> GraceHashJoin: xxx 0x7f41ebf6c340 total_rows: 585358, total_bytes: 790524104
2024-11-20 11:33:18.951 <Error> GraceHashJoin: xxx 0x7f41e5ad8840 total_rows: 577610, total_bytes: 782039768
2024-11-20 11:33:18.957 <Error> GraceHashJoin: xxx 0x7f41ebf6c340 total_rows: 589390, total_bytes: 795046416
2024-11-20 11:33:18.962 <Error> GraceHashJoin: xxx 0x7f41e5ad8840 total_rows: 581642, total_bytes: 786562080
2024-11-20 11:33:18.968 <Error> GraceHashJoin: xxx 0x7f41ebf6c340 total_rows: 593422, total_bytes: 799568728
2024-11-20 11:33:18.972 <Error> GraceHashJoin: xxx 0x7f41e5ad8840 total_rows: 585674, total_bytes: 791084392
2024-11-20 11:33:18.973 <Debug> MemoryTracker: Current memory usage: 3.00 GiB.
2024-11-20 11:33:18.979 <Error> GraceHashJoin: xxx 0x7f41ebf6c340 total_rows: 597454, total_bytes: 804091040
2024-11-20 11:33:18.983 <Error> GraceHashJoin: xxx 0x7f41e5ad8840 total_rows: 589706, total_bytes: 795606704
2024-11-20 11:33:18.989 <Error> GraceHashJoin: xxx 0x7f41ebf6c340 total_rows: 601486, total_bytes: 808613352
2024-11-20 11:33:18.995 <Error> GraceHashJoin: xxx 0x7f41e5ad8840 total_rows: 593738, total_bytes: 800129016
2024-11-20 11:33:19.000 <Error> GraceHashJoin: xxx 0x7f41ebf6c340 total_rows: 605518, total_bytes: 813135664
2024-11-20 11:33:19.006 <Error> GraceHashJoin: xxx 0x7f41e5ad8840 total_rows: 597780, total_bytes: 804651592
2024-11-20 11:33:19.010 <Error> GraceHashJoin: xxx 0x7f41ebf6c340 total_rows: 609550, total_bytes: 817657976
2024-11-20 11:33:19.017 <Error> GraceHashJoin: xxx 0x7f41e5ad8840 total_rows: 601812, total_bytes: 809173904
2024-11-20 11:33:19.029 <Error> GraceHashJoin: xxx 0x7f41e5ad8840 total_rows: 605844, total_bytes: 813696216
2024-11-20 11:33:19.130 <Error> local_engine: Enter java exception handle.
Exception Exception in thread "Executor task launch worker for task 248.0 in stage 11.0 (TID 1288)" org.apache.gluten.exception.GlutenException: Memory limit exceeded: would use 1.50 GiB (attempt to allocate chunk of 4440569 bytes), current RSS 2.79 GiB, maximum: 1.50 GiB.
0. ../contrib/llvm-project/libcxx/include/exception:141: Poco::Exception::Exception(String const&, int) @ 0x000000001469dc59
1. ./build/../src/Common/Exception.cpp:109: DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x00000000069da63c
2. ../src/Common/Exception.h:111: DB::Exception::Exception(PreformattedMessage&&, int) @ 0x00000000068ca54c
3. ../src/Common/Exception.h:129: DB::Exception::Exception<char const*, char const*, String, long&, String, String, char const*, std::basic_string_view<char, std::char_traits<char>>>(int, FormatStringHelperImpl<std::type_identity<char const*>::type, std::type_identity<char const*>::type, std::type_identity<String>::type, std::type_identity<long&>::type, std::type_identity<String>::type, std::type_identity<String>::type, std::type_identity<char const*>::type, std::type_identity<std::basic_string_view<char, std::char_traits<char>>>::type>, char const*&&, char const*&&, String&&, long&, String&&, String&&, char const*&&, std::basic_string_view<char, std::char_traits<char>>&&) @ 0x00000000069ea0c9
4. ./build/../src/Common/MemoryTracker.cpp:326: MemoryTracker::allocImpl(long, bool, MemoryTracker*, double) @ 0x00000000069e8ee1
5. ./build/../src/Common/MemoryTracker.cpp:383: MemoryTracker::allocImpl(long, bool, MemoryTracker*, double) @ 0x00000000069e8a96
6. ./build/../src/Common/CurrentMemoryTracker.cpp:64: CurrentMemoryTracker::alloc(long) @ 0x00000000069ccb1f
7. ./build/../src/Common/Allocator.cpp:233: Allocator<false, false>::realloc(void*, unsigned long, unsigned long, unsigned long) @ 0x00000000069bab7e
8. ../src/Common/PODArray.h:152: void DB::PODArrayBase<1ul, 4096ul, Allocator<false, false>, 63ul, 64ul>::resize<>(unsigned long) @ 0x0000000006a44e40
9. ./build/../src/Columns/ColumnString.cpp:156: DB::ColumnString::insertRangeFrom(DB::IColumn const&, unsigned long, unsigned long) @ 0x00000000102f5c09
10. ./build/../src/Columns/ColumnTuple.cpp:370: DB::ColumnTuple::insertRangeFrom(DB::IColumn const&, unsigned long, unsigned long) @ 0x000000001031e8a0
11. ./build/../src/Columns/ColumnArray.cpp:605: DB::ColumnArray::insertRangeFrom(DB::IColumn const&, unsigned long, unsigned long) @ 0x0000000010195cb7
12. ./build/../utils/extern-local-engine/Storages/IO/NativeReader.cpp:150: local_engine::readNormalComplexData(DB::ReadBuffer&, COW<DB::IColumn>::immutable_ptr<DB::IColumn>&, unsigned long, local_engine::NativeReader::ColumnParseUtil&) @ 0x0000000006e5e0d5
13. ../contrib/llvm-project/libcxx/include/__functional/function.h:848: ? @ 0x0000000006e5d8f0
14. ./build/../utils/extern-local-engine/Storages/IO/NativeReader.cpp:71: local_engine::NativeReader::read() @ 0x0000000006e5be89
15. ./build/../utils/extern-local-engine/Shuffle/ShuffleReader.cpp:51: local_engine::ShuffleReader::read() @ 0x0000000006f47142
16. ./build/../utils/extern-local-engine/local_engine_jni.cpp:554: Java_org_apache_gluten_vectorized_CHStreamReader_nativeNext @ 0x00000000068b61d7

The total memory limit is 1.5G, and there are multiple joins in the plan. 0x7f41e5ad8840 and 0x7f41ebf6c340 is the adresses of GraceHashJoin. each uses over 700M memory.

@lgbo-ustc
Copy link
Contributor Author

lgbo-ustc commented Nov 20, 2024

This could extend to more cases

  • there are join and aggregate in the same plan segment
  • there are muliple joins in the same plan segment

The fixed bytes limit cannot let join spill appropriately in all of these cases.

By the way, aggregation spills adaptively at present.

@lgbo-ustc
Copy link
Contributor Author

lgbo-ustc commented Nov 20, 2024

Reimpleting a GraceHashJoin that can spill adaptively in gluten is not a good choice.
We need to find a bad case in CH that we can convince CH to accept this modification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

1 participant