Add BloomFilter configuration #572

SophieYu41 · 2023-10-02T22:51:50Z

Summary

Add option to define threshold to opt out bloomfilter if threshold exceeded.
Default threshold 100K for now (random pick)
Minor update for table permission check - return true if unexpected error encountered. Since we are not able to confirm if access is denied.

Why / Goal

Test Plan

Added Unit Tests
Covered by existing CI
Integration tested
Tested on Gateway with python3 ~/.local/bin/run.py --mode=backfill --conf=production/joins/zipline_test/test_online_join_small.v2 --chronon-jar=bloom-test-1002.jar

Checklist

Documentation update

Reviewers

@nikhilsimha @better365

SophieYu41 · 2023-10-03T18:45:03Z

spark/src/main/scala/ai/chronon/spark/TableUtils.scala

@@ -24,6 +24,9 @@ case class TableUtils(sparkSession: SparkSession) {
    sparkSession.conf.get("spark.chronon.partition.format", "yyyy-MM-dd")
  val partitionSpec: PartitionSpec = PartitionSpec(partitionFormat, WindowUtils.Day.millis)
  val backfillValidationEnforced = sparkSession.conf.get("spark.chronon.backfill.validation.enabled", "true").toBoolean
+  // Threshold to control whether or not to use bloomfilter on join backfill. If the row approximate count is under this threshold, we will use bloomfilter.
+  // We are choosing approximate count so that optimal number of bits is at-least 1G for default fpp of 0.01
+  val bloomFilterThreshold = sparkSession.conf.get("spark.chronon.backfill.bloomfilter.threshold", "800000000").toLong


planning to use 1M as default. Seeking suggestions.

spark/src/main/scala/ai/chronon/spark/JoinUtils.scala

comment Co-authored-by: Nikhil <[email protected]> Signed-off-by: Sophie <[email protected]>

hzding621 · 2023-10-06T04:34:57Z

@SophieYu41 the implementation makes sense, but just curious what's the rationale behind opting out of gen bloom if left rows count is very big?

SophieYu41 · 2023-10-06T17:05:50Z

@SophieYu41 the implementation makes sense, but just curious what's the rationale behind opting out of gen bloom if left rows count is very big?

The rationale is to use bloomFilter when there are lots of keys to be filtered out, potentially a relative smaller size on the left and large size on the right. If left side rows are already huge, bloomfilter might not be very helpful.

This is a request from Homes when onboarding to chaining, they seem to be observing bloomfilter step taking long time in spark job and backfill job cannot make through.

Add BloomFilter conf

058081c

SophieYu41 force-pushed the sophie-bloom branch from 8d0240a to 058081c Compare October 2, 2023 23:05

Sophie Wang added 2 commits October 2, 2023 16:25

update test

d02d929

udpate test

46a746c

SophieYu41 commented Oct 3, 2023

View reviewed changes

Sophie Wang added 2 commits October 3, 2023 15:47

update default limit & comment

7f126e3

update default threshold

a3050b0

nikhilsimha reviewed Oct 4, 2023

View reviewed changes

spark/src/main/scala/ai/chronon/spark/JoinUtils.scala Outdated Show resolved Hide resolved

pengyu-hou approved these changes Oct 4, 2023

View reviewed changes

Update spark/src/main/scala/ai/chronon/spark/JoinUtils.scala

4b6e879

comment Co-authored-by: Nikhil <[email protected]> Signed-off-by: Sophie <[email protected]>

hzding621 approved these changes Oct 6, 2023

View reviewed changes

SophieYu41 merged commit d0323d3 into master Oct 6, 2023

SophieYu41 deleted the sophie-bloom branch October 6, 2023 17:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BloomFilter configuration #572

Add BloomFilter configuration #572

SophieYu41 commented Oct 2, 2023 •

edited

Loading

SophieYu41 Oct 3, 2023 •

edited

Loading

hzding621 commented Oct 6, 2023

SophieYu41 commented Oct 6, 2023

Add BloomFilter configuration #572

Add BloomFilter configuration #572

Conversation

SophieYu41 commented Oct 2, 2023 • edited Loading

Summary

Why / Goal

Test Plan

Checklist

Reviewers

SophieYu41 Oct 3, 2023 • edited Loading

Choose a reason for hiding this comment

hzding621 commented Oct 6, 2023

SophieYu41 commented Oct 6, 2023

SophieYu41 commented Oct 2, 2023 •

edited

Loading

SophieYu41 Oct 3, 2023 •

edited

Loading