Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BloomFilter configuration #572

Merged
merged 6 commits into from
Oct 6, 2023
Merged

Add BloomFilter configuration #572

merged 6 commits into from
Oct 6, 2023

Conversation

SophieYu41
Copy link
Collaborator

@SophieYu41 SophieYu41 commented Oct 2, 2023

Summary

  • Add option to define threshold to opt out bloomfilter if threshold exceeded.
  • Default threshold 100K for now (random pick)
  • Minor update for table permission check - return true if unexpected error encountered. Since we are not able to confirm if access is denied.

Why / Goal

Test Plan

  • Added Unit Tests
  • Covered by existing CI
  • Integration tested
    Tested on Gateway with python3 ~/.local/bin/run.py --mode=backfill --conf=production/joins/zipline_test/test_online_join_small.v2 --chronon-jar=bloom-test-1002.jar

Checklist

  • Documentation update

Reviewers

@nikhilsimha @better365

@@ -24,6 +24,9 @@ case class TableUtils(sparkSession: SparkSession) {
sparkSession.conf.get("spark.chronon.partition.format", "yyyy-MM-dd")
val partitionSpec: PartitionSpec = PartitionSpec(partitionFormat, WindowUtils.Day.millis)
val backfillValidationEnforced = sparkSession.conf.get("spark.chronon.backfill.validation.enabled", "true").toBoolean
// Threshold to control whether or not to use bloomfilter on join backfill. If the row approximate count is under this threshold, we will use bloomfilter.
// We are choosing approximate count so that optimal number of bits is at-least 1G for default fpp of 0.01
val bloomFilterThreshold = sparkSession.conf.get("spark.chronon.backfill.bloomfilter.threshold", "800000000").toLong
Copy link
Collaborator Author

@SophieYu41 SophieYu41 Oct 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

planning to use 1M as default. Seeking suggestions.

@hzding621
Copy link
Collaborator

@SophieYu41 the implementation makes sense, but just curious what's the rationale behind opting out of gen bloom if left rows count is very big?

@SophieYu41
Copy link
Collaborator Author

@SophieYu41 the implementation makes sense, but just curious what's the rationale behind opting out of gen bloom if left rows count is very big?

The rationale is to use bloomFilter when there are lots of keys to be filtered out, potentially a relative smaller size on the left and large size on the right. If left side rows are already huge, bloomfilter might not be very helpful.

This is a request from Homes when onboarding to chaining, they seem to be observing bloomfilter step taking long time in spark job and backfill job cannot make through.

@SophieYu41 SophieYu41 merged commit d0323d3 into master Oct 6, 2023
@SophieYu41 SophieYu41 deleted the sophie-bloom branch October 6, 2023 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants