[Experimental] sampling #100

xiaoyao1991 · 2017-08-10T01:46:59Z

Experimental feature:
Added session variables:
sampling_only: a boolean flag that controls if we should only sample the first X percent from a segment, or scan through the entire segment.
sampling_percent: a double number that indicates the X percent.

Notes:
Turning on the sampling_only flag will speed up query execution significantly. However, since it is only sampling the first X percent, the rest of the segment will be discarded, and therefore would cause inaccuracy in query results. It should be used only when the scan range is large, say querying the last 1 hour. For smaller scan range(like 5min, 10min etc), the query should complete fairly quickly already, so it's not worth it to sacrifice too much accuracy for that.

Upgrade to 0.139-tw-17

…scovery Add a few logs in DiscoveryNodeManager to debug 'No worker nodes' error.

…_queries Add query logging of all presto queries

Upgrade to 0.139-tw-0.18

…sitive_parquet_column_match Look up parquet columns by name case-insensitive

…to_139_19

…are_parquet_column_match Handle hive keywords when doing a name-based parquet field lookup

Upgrade to 0.139-tw-0.19

[maven-release-plugin] copy for tag 0.141 # Conflicts: # pom.xml

Upgrade to 0.141-tw-0.20

…function signature

Upgrade to 181

Thrift decoder

Kafka07 plugin

Fix build and style

xiaoyao1991 · 2017-08-14T18:39:14Z

@maosongfu

Yaliang · 2017-08-14T18:42:20Z

presto-kafka07/src/main/java/com/facebook/presto/kafka/KafkaMetadata.java

+            log.info("startTs and endTs are both empty");
+            // throw new IllegalArgumentException("Must provide filter on " + KafkaInternalFieldDescription.OFFSET_TIMESTAMP_FIELD.getName());
+            endTs = System.currentTimeMillis();
+            startTs = endTs - 10 * 60 * 1000;


Can we make it as a connector property?

maosongfu · 2017-08-14T20:50:46Z

Can we separate them into 2 pull requests?

Hard limit the query range (default is 10mins)
Sampling

maosongfu · 2017-08-14T20:53:36Z

Also, do we need 2 session variables: sampling_only and sampling_percent?
It seems sampling_percent is enough. If it is not set, we can consider sampling_only is false. Otherwise, it is true.

xiaoyao1991 · 2017-08-14T20:56:21Z

@maosongfu I prefer having two session vars: if we further research into how sampling could work better(for example, jumping randomly in the segment instead of just sampling first few percent), then sampling_only is a global feature flag, where sampling_percent or sampling_jump could be minor feature flag.

maosongfu · 2017-08-14T21:26:21Z

@xiaoyao1991 Fair enough. Can you add this justification as comments in source code, and in the description of the pull request?

Yaliang · 2017-08-14T22:09:34Z

presto-kafka07/src/main/java/com/facebook/presto/kafka/KafkaRecordSet.java

@@ -174,7 +178,10 @@ public boolean advanceNextPosition()

                try {
                    // Create a fetch request
-                    openFetchRequest();
+                    KafkaFetchStatus kafkaFetchStatus = openFetchRequest();
+                    if (kafkaFetchStatus == KafkaFetchStatus.SAMPLE_ACHIEVED) {


I think we can simply check if messageAndOffsetIterator is still null for this case so we can get rid of KafkaFetchStatus modification.

Bill Graham and others added 30 commits February 24, 2016 09:38

Change version to 0.139 to not conflict with merge

f4fed25

merge from 0.139

b3a94e1

upgrade pom version to 0.139-tw-0.17

2c9da60

Merge pull request twitter-forks#27 from billonahill/upgrade_to_139_17

d295eea

Upgrade to 0.139-tw-17

Adding query logging to presto queries

582df38

Add a few logs in DiscoveryNodeManager to debug 'No worker nodes' error.

04e60cc

Merge pull request twitter-forks#29 from saileshmittal/add-logs-to-di…

56582c3

…scovery Add a few logs in DiscoveryNodeManager to debug 'No worker nodes' error.

Merge pull request twitter-forks#28 from billonahill/billg/log_presto…

280e37c

…_queries Add query logging of all presto queries

Upgrade to 0.139-tw-0.18

4ca01a7

Merge pull request twitter-forks#30 from saileshmittal/upgrade_to_139_18

3a4a156

Upgrade to 0.139-tw-0.18

Look up parquet columns by name case-insensitive

f1085e0

Merge pull request twitter-forks#31 from billonahill/billg/case_insen…

775eee8

…sitive_parquet_column_match Look up parquet columns by name case-insensitive

Upgrading to 0.139-tw-0.19

5b4647a

Handle hive keywords when doing a name-based parquet field lookup

f766a0c

Merge branch 'billg/keyword_aware_parquet_column_match' into upgrade_…

a062c56

…to_139_19

Refactor to make diff cleaner

ca9961e

Merge branch 'billg/keyword_aware_parquet_column_match' into upgrade_…

423f5cf

…to_139_19

Merge pull request twitter-forks#32 from billonahill/billg/keyword_aw…

a1807ed

…are_parquet_column_match Handle hive keywords when doing a name-based parquet field lookup

Merge branch 'twitter-master' into upgrade_to_139_19

d687155

Merge pull request twitter-forks#33 from billonahill/upgrade_to_139_19

7eb8909

Upgrade to 0.139-tw-0.19

Upgrade to 0.141.

d7d66a2

Merge tag '0.141' into upgrade_to_141_20

a7a5233

[maven-release-plugin] copy for tag 0.141 # Conflicts: # pom.xml

Upgrade to 0.141-tw-0.20.

94bcf3e

Merge pull request twitter-forks#34 from saileshmittal/upgrade_to_141_20

dc6bebe

Upgrade to 0.141-tw-0.20

Use modules and query events for logging

3de1ca4

Add javadocs

f055db6

Add javadocs

3c1c7b6

fix imports

d8afd62

add splits, rows and bytes

251db4d

Change to use QueryComplete

95035ec

xiaoyao1991 and others added 14 commits July 19, 2017 17:40

cleanup

d98b04e

avoid unnecessary version conflict during merge

d8a6e9e

merged 0.181, resolved conflicts

791d3d5

use twitter tag 0.181-tw-0.37, update ThriftHiveRecordCursorProvider …

c8a76bb

…function signature

Merge pull request twitter-forks#98 from dabaitu/upgrade_to_181

70baf50

Upgrade to 181

Merge pull request twitter-forks#95 from xiaoyao1991/thrift_decoder_fin

92a9ebc

Thrift decoder

clone presto-kafka as presto-kafka07 plugin

1936564

kafka07 changes

47ac252

address comments

a5f13ce

Merge pull request twitter-forks#96 from xiaoyao1991/kafka07_plugin

ab45afb

Kafka07 plugin

Fix build and style

5feb582

Merge pull request twitter-forks#99 from xiaoyao1991/fix_build

dadbfe0

Fix build and style

init

96b13a2

sampling enabled

556f02b

xiaoyao1991 changed the title ~~Kafka07 hardlimit and sampling~~ [Experimental] Kafka07 hardlimit and sampling Aug 10, 2017

default query interval to 10min

8431bb8

xiaoyao1991 changed the title ~~[Experimental] Kafka07 hardlimit and sampling~~ Kafka07 hardlimit and [Experimental] sampling Aug 14, 2017

Yaliang reviewed Aug 14, 2017

View reviewed changes

address comments

7df05c0

xiaoyao1991 added 2 commits August 14, 2017 14:54

fix tests

e7c9580

justification

fd2dffe

Yaliang reviewed Aug 14, 2017

View reviewed changes

cleanup

2a8e840

xiaoyao1991 changed the title ~~Kafka07 hardlimit and [Experimental] sampling~~ [Experimental] sampling Aug 14, 2017

beinan force-pushed the twitter-master branch from b51d979 to 9a609f3 Compare August 24, 2020 06:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Experimental] sampling #100

[Experimental] sampling #100

xiaoyao1991 commented Aug 10, 2017 •

edited

Loading

xiaoyao1991 commented Aug 14, 2017

Yaliang Aug 14, 2017

maosongfu commented Aug 14, 2017

maosongfu commented Aug 14, 2017

xiaoyao1991 commented Aug 14, 2017

maosongfu commented Aug 14, 2017

Yaliang Aug 14, 2017

[Experimental] sampling #100

Are you sure you want to change the base?

[Experimental] sampling #100

Conversation

xiaoyao1991 commented Aug 10, 2017 • edited Loading

xiaoyao1991 commented Aug 14, 2017

Yaliang Aug 14, 2017

Choose a reason for hiding this comment

maosongfu commented Aug 14, 2017

maosongfu commented Aug 14, 2017

xiaoyao1991 commented Aug 14, 2017

maosongfu commented Aug 14, 2017

Yaliang Aug 14, 2017

Choose a reason for hiding this comment

xiaoyao1991 commented Aug 10, 2017 •

edited

Loading