Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Experimental] sampling #100

Open
wants to merge 333 commits into
base: twitter-master
Choose a base branch
from

Conversation

xiaoyao1991
Copy link

@xiaoyao1991 xiaoyao1991 commented Aug 10, 2017

Experimental feature:
Added session variables:
sampling_only: a boolean flag that controls if we should only sample the first X percent from a segment, or scan through the entire segment.
sampling_percent: a double number that indicates the X percent.

Notes:
Turning on the sampling_only flag will speed up query execution significantly. However, since it is only sampling the first X percent, the rest of the segment will be discarded, and therefore would cause inaccuracy in query results. It should be used only when the scan range is large, say querying the last 1 hour. For smaller scan range(like 5min, 10min etc), the query should complete fairly quickly already, so it's not worth it to sacrifice too much accuracy for that.

Bill Graham and others added 30 commits February 24, 2016 09:38
…scovery

Add a few logs in DiscoveryNodeManager to debug 'No worker nodes' error.
…_queries

Add query logging of all presto queries
…sitive_parquet_column_match

Look up parquet columns by name case-insensitive
…are_parquet_column_match

Handle hive keywords when doing a name-based parquet field lookup
[maven-release-plugin]  copy for tag 0.141

# Conflicts:
#	pom.xml
@xiaoyao1991 xiaoyao1991 changed the title Kafka07 hardlimit and sampling [Experimental] Kafka07 hardlimit and sampling Aug 10, 2017
@xiaoyao1991 xiaoyao1991 changed the title [Experimental] Kafka07 hardlimit and sampling Kafka07 hardlimit and [Experimental] sampling Aug 14, 2017
@xiaoyao1991
Copy link
Author

@maosongfu

log.info("startTs and endTs are both empty");
// throw new IllegalArgumentException("Must provide filter on " + KafkaInternalFieldDescription.OFFSET_TIMESTAMP_FIELD.getName());
endTs = System.currentTimeMillis();
startTs = endTs - 10 * 60 * 1000;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it as a connector property?

@maosongfu
Copy link

Can we separate them into 2 pull requests?

  • Hard limit the query range (default is 10mins)
  • Sampling

@maosongfu
Copy link

Also, do we need 2 session variables: sampling_only and sampling_percent?
It seems sampling_percent is enough. If it is not set, we can consider sampling_only is false. Otherwise, it is true.

@xiaoyao1991
Copy link
Author

@maosongfu I prefer having two session vars: if we further research into how sampling could work better(for example, jumping randomly in the segment instead of just sampling first few percent), then sampling_only is a global feature flag, where sampling_percent or sampling_jump could be minor feature flag.

@maosongfu
Copy link

@xiaoyao1991 Fair enough. Can you add this justification as comments in source code, and in the description of the pull request?

@@ -174,7 +178,10 @@ public boolean advanceNextPosition()

try {
// Create a fetch request
openFetchRequest();
KafkaFetchStatus kafkaFetchStatus = openFetchRequest();
if (kafkaFetchStatus == KafkaFetchStatus.SAMPLE_ACHIEVED) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simply check if messageAndOffsetIterator is still null for this case so we can get rid of KafkaFetchStatus modification.

@xiaoyao1991 xiaoyao1991 changed the title Kafka07 hardlimit and [Experimental] sampling [Experimental] sampling Aug 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants