GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation #37400

mapleFU · 2023-08-26T14:26:49Z

Rationale for this change

Currently we allow reading bloom filter for specific column and rowgroup, now this patch allow it writing BF.

This patch is just a skeleton. If reviewer thinks interface would be OK, I'll go on and add testing.

What changes are included in this PR?

Allow writing bf:

Add WriterProperties config for writing bloom filter, including bf and (per-rowgroup) ndv estimation.
Add BloomFilterBuilder for parquet
From FileSerializer to ColumnWriter, adding bloomfilter
Ensure Bloom Filter info is written to the file
Testing logic for BloomFilterBuilder
Testing logic for BloomFilter and ColumnWriter
Testing whole roundtrip like ParquetPageIndexRoundTripTest

Are these changes tested?

Yes

Are there any user-facing changes?

User can create Bloom Filter in parquet with C++ api

Closes: [C++][Parquet] Allow writing BloomFilter for specific column #34785
GitHub Issue: [C++][Parquet] Allow writing BloomFilter for specific column #34785

mapleFU · 2023-08-26T14:28:40Z

This is port of #35691 . I'm busy previous days and now I've time on it now.

The previous comment are solved. cc @pitrou @wgtmac @emkornfield

cpp/src/parquet/bloom_filter.h

cpp/src/parquet/bloom_filter_parquet_test.cc

cpp/src/parquet/column_writer.cc

cpp/src/parquet/bloom_filter_parquet_test.cc

cpp/src/parquet/metadata.h

cpp/src/parquet/file_writer.cc

wgtmac

Thanks for adding this! I just did an initial review except the test.

cpp/src/parquet/bloom_filter.h

cpp/src/parquet/properties.h

cpp/src/parquet/CMakeLists.txt

cpp/src/parquet/bloom_filter_builder.cc

cpp/src/parquet/arrow/arrow_reader_writer_test.cc

cpp/src/parquet/bloom_filter.h

cpp/src/parquet/bloom_filter_builder.h

cpp/src/parquet/bloom_filter_builder.cc

huberylee · 2024-07-02T13:21:41Z

@mapleFU Hi, is there a planned merge time for this?

alippai · 2024-07-02T14:10:28Z

I believe this missed the feature freeze deadline and won’t be included in 17.0.0, likely it’ll be part of 18.0.0. @raulcd will know better.

Since this is the largest PR the past months, years, it’s understandable it’s not rushed out the door.

mapleFU · 2024-07-02T14:15:04Z

I'm quite busy these few days but I promise I would try my best to check this in this month
This would not be in 17.0.0 release

raulcd · 2024-07-02T14:43:29Z

The feature freeze was yesterday. This will not make it in time for 17.0.0

mapleFU · 2024-07-03T15:51:07Z

@emkornfield I've try to resolve the comment
For bloom filter quality, this patch I think just static config should be a start point, I've create an issue for that: #43138

mapleFU · 2024-07-11T16:16:00Z

@pitrou @wgtmac This patch is ready for review, would you mind also take a look?

cpp/src/parquet/arrow/arrow_reader_writer_test.cc

wgtmac · 2024-07-12T16:10:43Z

cpp/src/parquet/arrow/arrow_reader_writer_test.cc

+        [6,     "f"]
+  ])"};
+  auto table = ::arrow::TableFromJSON(schema, contents);
+  auto non_dict_table = ::arrow::TableFromJSON(origin_schema, contents);


What about these:
table -> dict_encoded_table/table_with_dict
non_dict_table -> table

wgtmac · 2024-07-12T16:12:24Z

cpp/src/parquet/bloom_filter.h

+  /// Compute hash for ByteArray value by using its plain encoding result.
+  ///
+  /// @param value the value to hash.
+  uint64_t Hash(const ByteArray& value) const { return Hash(&value); }


It seems they are better to be relocated to below line 107.

wgtmac · 2024-07-12T16:16:29Z

cpp/src/parquet/bloom_filter_builder.cc

+  if (iter == row_group_bloom_filter.end()) {
+    auto block_split_bloom_filter =
+        std::make_unique<BlockSplitBloomFilter>(properties_->memory_pool());
+    block_split_bloom_filter->Init(BlockSplitBloomFilter::OptimalNumOfBytes(


I have reviewed that PR and it could be a followup change. Writer implementation has the freedom to try smart things.

FYI, parquet-java also discards the bloom filter if dictionary encoding is applied to all data pages, though I don't think we should do the same thing.

wgtmac · 2024-07-12T16:23:15Z

cpp/src/parquet/column_writer.cc

+    std::array<uint64_t, kHashBatchSize> hashes;
+    for (int64_t i = 0; i < num_values; i += kHashBatchSize) {
+      int64_t current_hash_batch_size = std::min(kHashBatchSize, num_values - i);
+      bloom_filter_->Hashes(values, static_cast<int>(current_hash_batch_size),


I might have forgot something, where does int32 range come from? @emkornfield

wgtmac · 2024-07-12T16:24:37Z

cpp/src/parquet/column_writer.cc

+void TypedColumnWriterImpl<BooleanType>::UpdateBloomFilterSpaced(const bool*, int64_t,
+                                                                 const uint8_t*,
+                                                                 int64_t) {
+  DCHECK(bloom_filter_ == nullptr);


+1 on an explicit exception

# Conflicts: # cpp/src/parquet/column_writer_test.cc

mapleFU · 2024-11-12T10:45:56Z

Two need fix:

/arrow/cpp/src/parquet/bloom_filter.h:118: error: The following parameter of parquet::BloomFilter::Hash(const FLBA &value, uint32_t type_len) const is not documented:
  parameter 'type_len' (warning treated as error, aborting now)

D:/a/arrow/arrow/build/cpp/src/parquet/CMakeFiles/parquet_shared.dir/Unity/unity_3_cxx.cxx
In file included from D:/a/arrow/arrow/build/cpp/src/parquet/CMakeFiles/parquet_shared.dir/Unity/unity_3_cxx.cxx:16:
D:/a/arrow/arrow/cpp/src/parquet/schema.cc: In function 'void parquet::schema::PrintRepLevel(parquet::Repetition::type, std::ostream&)':
D:/a/arrow/arrow/cpp/src/parquet/schema.cc:630:30: error: expected unqualified-id before ':' token
  630 |     case Repetition::OPTIONAL:
      |                              ^

mapleFU · 2024-11-19T09:28:09Z

@pitrou @wgtmac @emkornfield Sorry for late reply, I believe all comments are replyed or fixed now. Now the bloom filter becoming map in all use cases, since it would be more sparse then page-indices.

amoeba · 2024-12-19T17:56:08Z

The feature freeze for Arrow 19 is planned for January 6, 2025 and I'm curious if there might be capacity to get this fully reviewed and merged by then (or soon after). If not, feel free to comment with what you'd need (more reviewers, more time, etc). cc @mapleFU @pitrou @wgtmac @emkornfield

wgtmac · 2024-12-20T01:29:16Z

Sorry for missing this! I will take a look. Meanwhile, I don't think we need to hurry for the code freeze.

# Conflicts: # cpp/src/parquet/column_writer.cc # cpp/src/parquet/type_fwd.h

mapleFU · 2024-12-20T02:43:54Z

Rebased, ready for review now

wgtmac

I have reviewed for another pass and generally LGTM.

It would be good if @pitrou @emkornfield can take a look after the holiday season.

wgtmac · 2024-12-23T05:13:49Z

cpp/src/parquet/CMakeLists.txt

    bloom_filter_reader.cc
+    bloom_filter_builder.cc


Suggested change

bloom_filter_reader.cc

bloom_filter_builder.cc

bloom_filter_builder.cc

bloom_filter_reader.cc

wgtmac · 2024-12-23T05:18:16Z

cpp/src/parquet/arrow/arrow_reader_writer_test.cc

@@ -5541,7 +5543,7 @@ auto encode_double = [](double value) {

 }  // namespace

-class ParquetPageIndexRoundTripTest : public ::testing::Test {
+class ParquetIndexRoundTripTest {


Suggested change

class ParquetIndexRoundTripTest {

class TestingWithPageIndex {

ParquetIndexRoundTripTest is a little bit confusing since it is not a complete test.

wgtmac · 2024-12-23T05:25:01Z

cpp/src/parquet/arrow/arrow_reader_writer_test.cc

+          ASSERT_EQ(nullptr, bloom_filter);
+        } else {
+          ASSERT_NE(nullptr, bloom_filter);
+          bloom_filters_.push_back(std::move(bloom_filter));


What about changing bloom_filters_ to be an output parameter to function ReadBloomFilters instead of a class member variable?

wgtmac · 2024-12-23T05:27:12Z

cpp/src/parquet/arrow/arrow_reader_writer_test.cc

+  std::vector<std::unique_ptr<BloomFilter>> bloom_filters_;
+};
+
+TEST_F(ParquetBloomFilterRoundTripTest, SimpleRoundTrip) {


The three test cases below share a lot of common logic (with exactly same data). Should we refactor them to eliminate the duplicate?

wgtmac · 2024-12-23T05:33:30Z

cpp/src/parquet/bloom_filter.h

+  }
+  /// Compute hash for Int96 value by using its plain encoding result.


Suggested change

}

/// Compute hash for Int96 value by using its plain encoding result.

}

/// Compute hash for Int96 value by using its plain encoding result.

wgtmac · 2024-12-23T08:30:30Z

cpp/src/parquet/metadata.h

+struct BloomFilterLocation {
+  /// Row group bloom filter index locations which uses row group ordinal as the key.
+  ///
+  /// Note: Before Parquet 2.10, the bloom filter index only have "offset". But here


Suggested change

/// Note: Before Parquet 2.10, the bloom filter index only have "offset". But here

/// Note: Before Parquet Format v2.10, the bloom filter index only have "offset". But here

wgtmac · 2024-12-23T08:32:39Z

cpp/src/parquet/metadata.h

+///
+/// Number of columns with a bloom filter to be relatively small compared to
+/// the number of overall columns, so map is used.
+using RowGroupBloomFilterLocation = std::map<int32_t, IndexLocation>;


What about defining RowGroupBloomFilterLocation and FileBloomFilterLocation in the BloomFilterLocation?

wgtmac · 2024-12-23T08:33:34Z

cpp/src/parquet/metadata.cc

+              auto& column = row_group_metadata.columns[column_id];
+              auto& column_metadata = column.meta_data;
+              column_metadata.__set_bloom_filter_offset(bloom_filter_location.offset);
+              // bloom_filter_length is added by Parquet format 2.10.0


I don't think we need this comment

wgtmac · 2024-12-23T08:38:37Z

cpp/src/parquet/column_writer_test.cc

+      this->sink_, column_properties.compression(), this->metadata_.get());
+  builder_ =
+      internal::BloomFilterBuilder::Make(&this->schema_, this->writer_properties_.get());
+  // Initial RowGroup


Did you mean initialize?

wgtmac · 2024-12-23T08:40:35Z

cpp/src/parquet/column_writer_test.cc

+  writer->WriteBatch(this->values_.size(), nullptr, nullptr, this->values_ptr_);
+  writer->Close();
+
+  // Read all rows so we are sure that also the non-dictionary pages are read correctly


Suggested change

// Read all rows so we are sure that also the non-dictionary pages are read correctly

// Make sure that column values are read correctly

Parquet: Implement skeleton for BloomFilter

f1c6dc0

mapleFU requested a review from wgtmac as a code owner August 26, 2023 14:26

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Aug 26, 2023

tiny fixing

6ebd6da

mapleFU commented Aug 26, 2023

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Aug 26, 2023

mapleFU added 2 commits August 26, 2023 23:32

tiny update test

70c9267

trying to fix ci

48350d8

mapleFU commented Aug 26, 2023

View reviewed changes

cpp/src/parquet/bloom_filter_parquet_test.cc Outdated Show resolved Hide resolved

fix lint

d2a659e

mapleFU commented Aug 26, 2023

View reviewed changes

cpp/src/parquet/metadata.h Show resolved Hide resolved

mapleFU added 4 commits August 27, 2023 00:44

fix some style problem

41236d8

add file roundtrip test

8afba81

add file roundtrip test

96c6691

fix document and ci

c131341

mapleFU commented Aug 26, 2023

View reviewed changes

cpp/src/parquet/file_writer.cc Outdated Show resolved Hide resolved

mapleFU requested review from pitrou and emkornfield August 26, 2023 18:43

Update: tiny style fix

220b58e

mapleFU changed the title ~~GH-34785: [C++][Parquet] Parquet Bloom Filter Implement~~ GH-34785: [C++][Parquet] Parquet Bloom Filter Write Implement Aug 27, 2023

wgtmac reviewed Aug 30, 2023

View reviewed changes

wgtmac changed the title ~~GH-34785: [C++][Parquet] Parquet Bloom Filter Write Implement~~ GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation Aug 30, 2023

pitrou requested changes Aug 31, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Sep 1, 2023

Merge branch 'main' into parquet/support-write-bloom-filter

ad96c48

mapleFU added 2 commits July 3, 2024 22:30

Merge branch 'main' into parquet/support-write-bloom-filter

70e3508

resolve comments

c587568

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 3, 2024

mapleFU mentioned this pull request Jul 3, 2024

[C++][Parquet] BloomFilter writer: Estimate the bloom filter quality #43138

Open

wgtmac reviewed Jul 12, 2024

View reviewed changes

raphaelauv mentioned this pull request Aug 1, 2024

Add support to write Parquet with bloom filters #43530

Open

mapleFU added 3 commits November 11, 2024 12:22

Merge branch 'main' into parquet/support-write-bloom-filter

2223423

# Conflicts: # cpp/src/parquet/column_writer_test.cc

change the bloom filter from vector to map

22030db

fix lint

e9c550a

mapleFU force-pushed the parquet/support-write-bloom-filter branch from 505e23b to e9c550a Compare November 12, 2024 05:48

fix lint

23fb3fa

mapleFU force-pushed the parquet/support-write-bloom-filter branch from fd3856d to 79bdaeb Compare November 15, 2024 04:49

fix comment

d892819

mapleFU force-pushed the parquet/support-write-bloom-filter branch from 79bdaeb to d892819 Compare November 15, 2024 06:57

mapleFU requested review from emkornfield and wgtmac November 19, 2024 09:27

Merge branch 'main' into parquet/support-write-bloom-filter

ef3291d

# Conflicts: # cpp/src/parquet/column_writer.cc # cpp/src/parquet/type_fwd.h

wgtmac reviewed Dec 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation #37400

GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation #37400

mapleFU commented Aug 26, 2023 •

edited by github-actions bot

Loading

mapleFU commented Aug 26, 2023 •

edited

Loading

wgtmac left a comment

huberylee commented Jul 2, 2024

alippai commented Jul 2, 2024

mapleFU commented Jul 2, 2024 •

edited

Loading

raulcd commented Jul 2, 2024

mapleFU commented Jul 3, 2024

mapleFU commented Jul 11, 2024 •

edited

Loading

wgtmac Jul 12, 2024

wgtmac Jul 12, 2024

wgtmac Jul 12, 2024

wgtmac Jul 12, 2024

wgtmac Jul 12, 2024

mapleFU commented Nov 12, 2024

mapleFU commented Nov 19, 2024

amoeba commented Dec 19, 2024

wgtmac commented Dec 20, 2024

mapleFU commented Dec 20, 2024

wgtmac left a comment

wgtmac Dec 23, 2024

wgtmac Dec 23, 2024

wgtmac Dec 23, 2024

wgtmac Dec 23, 2024

wgtmac Dec 23, 2024

wgtmac Dec 23, 2024

wgtmac Dec 23, 2024

wgtmac Dec 23, 2024

wgtmac Dec 23, 2024

wgtmac Dec 23, 2024

	class ParquetIndexRoundTripTest {
	class TestingWithPageIndex {

		}
		/// Compute hash for Int96 value by using its plain encoding result.

	/// Note: Before Parquet 2.10, the bloom filter index only have "offset". But here
	/// Note: Before Parquet Format v2.10, the bloom filter index only have "offset". But here

	// Read all rows so we are sure that also the non-dictionary pages are read correctly
	// Make sure that column values are read correctly

GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation #37400

Are you sure you want to change the base?

GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation #37400

Conversation

mapleFU commented Aug 26, 2023 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

mapleFU commented Aug 26, 2023 • edited Loading

wgtmac left a comment

Choose a reason for hiding this comment

huberylee commented Jul 2, 2024

alippai commented Jul 2, 2024

mapleFU commented Jul 2, 2024 • edited Loading

raulcd commented Jul 2, 2024

mapleFU commented Jul 3, 2024

mapleFU commented Jul 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mapleFU commented Nov 12, 2024

mapleFU commented Nov 19, 2024

amoeba commented Dec 19, 2024

wgtmac commented Dec 20, 2024

mapleFU commented Dec 20, 2024

wgtmac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mapleFU commented Aug 26, 2023 •

edited by github-actions bot

Loading

mapleFU commented Aug 26, 2023 •

edited

Loading

mapleFU commented Jul 2, 2024 •

edited

Loading

mapleFU commented Jul 11, 2024 •

edited

Loading