-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BloomFilter configuration #572
Conversation
8d0240a
to
058081c
Compare
@@ -24,6 +24,9 @@ case class TableUtils(sparkSession: SparkSession) { | |||
sparkSession.conf.get("spark.chronon.partition.format", "yyyy-MM-dd") | |||
val partitionSpec: PartitionSpec = PartitionSpec(partitionFormat, WindowUtils.Day.millis) | |||
val backfillValidationEnforced = sparkSession.conf.get("spark.chronon.backfill.validation.enabled", "true").toBoolean | |||
// Threshold to control whether or not to use bloomfilter on join backfill. If the row approximate count is under this threshold, we will use bloomfilter. | |||
// We are choosing approximate count so that optimal number of bits is at-least 1G for default fpp of 0.01 | |||
val bloomFilterThreshold = sparkSession.conf.get("spark.chronon.backfill.bloomfilter.threshold", "800000000").toLong |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
planning to use 1M as default. Seeking suggestions.
comment Co-authored-by: Nikhil <[email protected]> Signed-off-by: Sophie <[email protected]>
@SophieYu41 the implementation makes sense, but just curious what's the rationale behind opting out of gen bloom if left rows count is very big? |
The rationale is to use bloomFilter when there are lots of keys to be filtered out, potentially a relative smaller size on the left and large size on the right. If left side rows are already huge, bloomfilter might not be very helpful. This is a request from Homes when onboarding to chaining, they seem to be observing bloomfilter step taking long time in spark job and backfill job cannot make through. |
Summary
Why / Goal
Test Plan
Tested on Gateway with
python3 ~/.local/bin/run.py --mode=backfill --conf=production/joins/zipline_test/test_online_join_small.v2 --chronon-jar=bloom-test-1002.jar
Checklist
Reviewers
@nikhilsimha @better365