-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Experimental] sampling #100
base: twitter-master
Are you sure you want to change the base?
[Experimental] sampling #100
Conversation
Upgrade to 0.139-tw-17
…scovery Add a few logs in DiscoveryNodeManager to debug 'No worker nodes' error.
…_queries Add query logging of all presto queries
Upgrade to 0.139-tw-0.18
…sitive_parquet_column_match Look up parquet columns by name case-insensitive
…are_parquet_column_match Handle hive keywords when doing a name-based parquet field lookup
Upgrade to 0.139-tw-0.19
[maven-release-plugin] copy for tag 0.141 # Conflicts: # pom.xml
Upgrade to 0.141-tw-0.20
…function signature
Upgrade to 181
Kafka07 plugin
Fix build and style
log.info("startTs and endTs are both empty"); | ||
// throw new IllegalArgumentException("Must provide filter on " + KafkaInternalFieldDescription.OFFSET_TIMESTAMP_FIELD.getName()); | ||
endTs = System.currentTimeMillis(); | ||
startTs = endTs - 10 * 60 * 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make it as a connector property?
Can we separate them into 2 pull requests?
|
Also, do we need 2 session variables: |
@maosongfu I prefer having two session vars: if we further research into how sampling could work better(for example, jumping randomly in the segment instead of just sampling first few percent), then |
@xiaoyao1991 Fair enough. Can you add this justification as comments in source code, and in the description of the pull request? |
@@ -174,7 +178,10 @@ public boolean advanceNextPosition() | |||
|
|||
try { | |||
// Create a fetch request | |||
openFetchRequest(); | |||
KafkaFetchStatus kafkaFetchStatus = openFetchRequest(); | |||
if (kafkaFetchStatus == KafkaFetchStatus.SAMPLE_ACHIEVED) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can simply check if messageAndOffsetIterator
is still null
for this case so we can get rid of KafkaFetchStatus
modification.
Experimental feature:
Added session variables:
sampling_only
: a boolean flag that controls if we should only sample the first X percent from a segment, or scan through the entire segment.sampling_percent
: a double number that indicates the X percent.Notes:
Turning on the
sampling_only
flag will speed up query execution significantly. However, since it is only sampling the first X percent, the rest of the segment will be discarded, and therefore would cause inaccuracy in query results. It should be used only when the scan range is large, say querying the last 1 hour. For smaller scan range(like 5min, 10min etc), the query should complete fairly quickly already, so it's not worth it to sacrifice too much accuracy for that.