Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation #37400

Open
wants to merge 60 commits into
base: main
Choose a base branch
from

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Aug 26, 2023

Rationale for this change

Currently we allow reading bloom filter for specific column and rowgroup, now this patch allow it writing BF.

This patch is just a skeleton. If reviewer thinks interface would be OK, I'll go on and add testing.

What changes are included in this PR?

Allow writing bf:

  • Add WriterProperties config for writing bloom filter, including bf and (per-rowgroup) ndv estimation.
  • Add BloomFilterBuilder for parquet
  • From FileSerializer to ColumnWriter, adding bloomfilter
  • Ensure Bloom Filter info is written to the file
  • Testing logic for BloomFilterBuilder
  • Testing logic for BloomFilter and ColumnWriter
  • Testing whole roundtrip like ParquetPageIndexRoundTripTest

Are these changes tested?

Yes

Are there any user-facing changes?

User can create Bloom Filter in parquet with C++ api

@mapleFU
Copy link
Member Author

mapleFU commented Aug 26, 2023

This is port of #35691 . I'm busy previous days and now I've time on it now.

The previous comment are solved. cc @pitrou @wgtmac @emkornfield

cpp/src/parquet/bloom_filter.h Outdated Show resolved Hide resolved
cpp/src/parquet/bloom_filter_parquet_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/column_writer.cc Show resolved Hide resolved
cpp/src/parquet/column_writer.cc Outdated Show resolved Hide resolved
cpp/src/parquet/column_writer.cc Show resolved Hide resolved
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Aug 26, 2023
@mapleFU mapleFU requested review from pitrou and emkornfield August 26, 2023 18:43
@mapleFU mapleFU changed the title GH-34785: [C++][Parquet] Parquet Bloom Filter Implement GH-34785: [C++][Parquet] Parquet Bloom Filter Write Implement Aug 27, 2023
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this! I just did an initial review except the test.

cpp/src/parquet/bloom_filter.h Outdated Show resolved Hide resolved
cpp/src/parquet/properties.h Outdated Show resolved Hide resolved
cpp/src/parquet/properties.h Outdated Show resolved Hide resolved
cpp/src/parquet/properties.h Show resolved Hide resolved
cpp/src/parquet/properties.h Outdated Show resolved Hide resolved
cpp/src/parquet/CMakeLists.txt Outdated Show resolved Hide resolved
cpp/src/parquet/bloom_filter_builder.cc Outdated Show resolved Hide resolved
cpp/src/parquet/bloom_filter_builder.cc Outdated Show resolved Hide resolved
cpp/src/parquet/bloom_filter_builder.cc Outdated Show resolved Hide resolved
cpp/src/parquet/bloom_filter_builder.cc Outdated Show resolved Hide resolved
@wgtmac wgtmac changed the title GH-34785: [C++][Parquet] Parquet Bloom Filter Write Implement GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation Aug 30, 2023
cpp/src/parquet/arrow/arrow_reader_writer_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/arrow/arrow_reader_writer_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/arrow/arrow_reader_writer_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/bloom_filter.h Outdated Show resolved Hide resolved
cpp/src/parquet/bloom_filter_builder.h Outdated Show resolved Hide resolved
cpp/src/parquet/bloom_filter_builder.h Outdated Show resolved Hide resolved
cpp/src/parquet/bloom_filter_builder.cc Outdated Show resolved Hide resolved
cpp/src/parquet/bloom_filter_builder.cc Outdated Show resolved Hide resolved
cpp/src/parquet/bloom_filter_builder.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Sep 1, 2023
@huberylee
Copy link
Contributor

@mapleFU Hi, is there a planned merge time for this?

@alippai
Copy link
Contributor

alippai commented Jul 2, 2024

I believe this missed the feature freeze deadline and won’t be included in 17.0.0, likely it’ll be part of 18.0.0. @raulcd will know better.

Since this is the largest PR the past months, years, it’s understandable it’s not rushed out the door.

@mapleFU
Copy link
Member Author

mapleFU commented Jul 2, 2024

I'm quite busy these few days but I promise I would try my best to check this in this month
This would not be in 17.0.0 release

@raulcd
Copy link
Member

raulcd commented Jul 2, 2024

The feature freeze was yesterday. This will not make it in time for 17.0.0

@mapleFU
Copy link
Member Author

mapleFU commented Jul 3, 2024

@emkornfield I've try to resolve the comment
For bloom filter quality, this patch I think just static config should be a start point, I've create an issue for that: #43138

@mapleFU
Copy link
Member Author

mapleFU commented Jul 11, 2024

@pitrou @wgtmac This patch is ready for review, would you mind also take a look?

cpp/src/parquet/arrow/arrow_reader_writer_test.cc Outdated Show resolved Hide resolved
[6, "f"]
])"};
auto table = ::arrow::TableFromJSON(schema, contents);
auto non_dict_table = ::arrow::TableFromJSON(origin_schema, contents);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about these:
table -> dict_encoded_table/table_with_dict
non_dict_table -> table

/// Compute hash for ByteArray value by using its plain encoding result.
///
/// @param value the value to hash.
uint64_t Hash(const ByteArray& value) const { return Hash(&value); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems they are better to be relocated to below line 107.

if (iter == row_group_bloom_filter.end()) {
auto block_split_bloom_filter =
std::make_unique<BlockSplitBloomFilter>(properties_->memory_pool());
block_split_bloom_filter->Init(BlockSplitBloomFilter::OptimalNumOfBytes(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed that PR and it could be a followup change. Writer implementation has the freedom to try smart things.

FYI, parquet-java also discards the bloom filter if dictionary encoding is applied to all data pages, though I don't think we should do the same thing.

std::array<uint64_t, kHashBatchSize> hashes;
for (int64_t i = 0; i < num_values; i += kHashBatchSize) {
int64_t current_hash_batch_size = std::min(kHashBatchSize, num_values - i);
bloom_filter_->Hashes(values, static_cast<int>(current_hash_batch_size),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have forgot something, where does int32 range come from? @emkornfield

void TypedColumnWriterImpl<BooleanType>::UpdateBloomFilterSpaced(const bool*, int64_t,
const uint8_t*,
int64_t) {
DCHECK(bloom_filter_ == nullptr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on an explicit exception

@mapleFU mapleFU force-pushed the parquet/support-write-bloom-filter branch from 505e23b to e9c550a Compare November 12, 2024 05:48
@mapleFU
Copy link
Member Author

mapleFU commented Nov 12, 2024

Two need fix:

/arrow/cpp/src/parquet/bloom_filter.h:118: error: The following parameter of parquet::BloomFilter::Hash(const FLBA &value, uint32_t type_len) const is not documented:
  parameter 'type_len' (warning treated as error, aborting now)
D:/a/arrow/arrow/build/cpp/src/parquet/CMakeFiles/parquet_shared.dir/Unity/unity_3_cxx.cxx
In file included from D:/a/arrow/arrow/build/cpp/src/parquet/CMakeFiles/parquet_shared.dir/Unity/unity_3_cxx.cxx:16:
D:/a/arrow/arrow/cpp/src/parquet/schema.cc: In function 'void parquet::schema::PrintRepLevel(parquet::Repetition::type, std::ostream&)':
D:/a/arrow/arrow/cpp/src/parquet/schema.cc:630:30: error: expected unqualified-id before ':' token
  630 |     case Repetition::OPTIONAL:
      |                              ^

@mapleFU mapleFU force-pushed the parquet/support-write-bloom-filter branch from fd3856d to 79bdaeb Compare November 15, 2024 04:49
@mapleFU mapleFU force-pushed the parquet/support-write-bloom-filter branch from 79bdaeb to d892819 Compare November 15, 2024 06:57
@mapleFU
Copy link
Member Author

mapleFU commented Nov 19, 2024

@pitrou @wgtmac @emkornfield Sorry for late reply, I believe all comments are replyed or fixed now. Now the bloom filter becoming map in all use cases, since it would be more sparse then page-indices.

@amoeba
Copy link
Member

amoeba commented Dec 19, 2024

The feature freeze for Arrow 19 is planned for January 6, 2025 and I'm curious if there might be capacity to get this fully reviewed and merged by then (or soon after). If not, feel free to comment with what you'd need (more reviewers, more time, etc). cc @mapleFU @pitrou @wgtmac @emkornfield

@wgtmac
Copy link
Member

wgtmac commented Dec 20, 2024

Sorry for missing this! I will take a look. Meanwhile, I don't think we need to hurry for the code freeze.

# Conflicts:
#	cpp/src/parquet/column_writer.cc
#	cpp/src/parquet/type_fwd.h
@mapleFU
Copy link
Member Author

mapleFU commented Dec 20, 2024

Rebased, ready for review now

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed for another pass and generally LGTM.

It would be good if @pitrou @emkornfield can take a look after the holiday season.

Comment on lines 162 to +163
bloom_filter_reader.cc
bloom_filter_builder.cc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bloom_filter_reader.cc
bloom_filter_builder.cc
bloom_filter_builder.cc
bloom_filter_reader.cc

@@ -5541,7 +5543,7 @@ auto encode_double = [](double value) {

} // namespace

class ParquetPageIndexRoundTripTest : public ::testing::Test {
class ParquetIndexRoundTripTest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class ParquetIndexRoundTripTest {
class TestingWithPageIndex {

ParquetIndexRoundTripTest is a little bit confusing since it is not a complete test.

ASSERT_EQ(nullptr, bloom_filter);
} else {
ASSERT_NE(nullptr, bloom_filter);
bloom_filters_.push_back(std::move(bloom_filter));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about changing bloom_filters_ to be an output parameter to function ReadBloomFilters instead of a class member variable?

std::vector<std::unique_ptr<BloomFilter>> bloom_filters_;
};

TEST_F(ParquetBloomFilterRoundTripTest, SimpleRoundTrip) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The three test cases below share a lot of common logic (with exactly same data). Should we refactor them to eliminate the duplicate?

Comment on lines +122 to +123
}
/// Compute hash for Int96 value by using its plain encoding result.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
/// Compute hash for Int96 value by using its plain encoding result.
}
/// Compute hash for Int96 value by using its plain encoding result.

struct BloomFilterLocation {
/// Row group bloom filter index locations which uses row group ordinal as the key.
///
/// Note: Before Parquet 2.10, the bloom filter index only have "offset". But here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Note: Before Parquet 2.10, the bloom filter index only have "offset". But here
/// Note: Before Parquet Format v2.10, the bloom filter index only have "offset". But here

///
/// Number of columns with a bloom filter to be relatively small compared to
/// the number of overall columns, so map is used.
using RowGroupBloomFilterLocation = std::map<int32_t, IndexLocation>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about defining RowGroupBloomFilterLocation and FileBloomFilterLocation in the BloomFilterLocation?

auto& column = row_group_metadata.columns[column_id];
auto& column_metadata = column.meta_data;
column_metadata.__set_bloom_filter_offset(bloom_filter_location.offset);
// bloom_filter_length is added by Parquet format 2.10.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this comment

this->sink_, column_properties.compression(), this->metadata_.get());
builder_ =
internal::BloomFilterBuilder::Make(&this->schema_, this->writer_properties_.get());
// Initial RowGroup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean initialize?

writer->WriteBatch(this->values_.size(), nullptr, nullptr, this->values_ptr_);
writer->Close();

// Read all rows so we are sure that also the non-dictionary pages are read correctly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Read all rows so we are sure that also the non-dictionary pages are read correctly
// Make sure that column values are read correctly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Allow writing BloomFilter for specific column
8 participants