Add single node execution #24172

kewang1024 · 2024-11-29T21:01:44Z

Description

== RELEASE NOTES ==

General Changes
* Add single worker execution. To improve latency of tiny queries running on a large cluster, we introduce single worker execution mode: query will only use one node to execute and plan would be optimized accordingly. This feature can be turned on by config `single-node-execution-enabled` or session property `single_node_execution_enabled`.:pr:`24172`

tdcmeehan · 2024-12-02T16:26:35Z

Can you help to explain why this feature is or has to be tied to native execution? Perhaps describe the background and motivation?

arhimondr

High level comments

Do we want to support single node execution on per-query basis?

This can be useful to improve latency of tiny queries running on a large cluster. For example a user may know that a query is small and may decide to run it on a multi node cluster in a single node mode.

If decided to support it is necessary to make sure the session property is used consistently through the code.

If decided not to support it for now I think the session property should be removed and a configuration property should only be used.

Should the single node execution mode be native specific?

When running a Java cluster deployment with a dedicated coordinator and a dedicated worker (workers) additional exchanges at worker - coordinator boundary are necessary.

I'm thinking if a simpler mental model would be to always add coordinator-to-worker exchanges when single node execution is requested?

arhimondr · 2024-12-02T16:42:19Z

presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java

@@ -328,6 +328,7 @@ public final class SystemSessionProperties
    // TODO: Native execution related session properties that are temporarily put here. They will be relocated in the future.
    public static final String NATIVE_AGGREGATION_SPILL_ALL = "native_aggregation_spill_all";
    private static final String NATIVE_EXECUTION_ENABLED = "native_execution_enabled";
+    private static final String NATIVE_SINGLE_WORKER_EXECUTION = "native_single_worker_execution";


nit: Should we stay consistent and use node instead of worker (e.g.: query_max_memory_per_node, force_single_node_output, etc.). Also maybe add the _enabled suffix to make it sound more natural, e.g.: single_node_execution_enabled, isSingleNodeExecutionEnabled(...)

I tried singleNodeExecutionEnabled, but then realized it would cause confusion with forceSingleNode

Node can either be worker or coordinator, but what we want is explicitly worker. So singleWorkerExecutionEnabled makes more sense

presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java

presto-main/src/main/java/com/facebook/presto/sql/analyzer/FeaturesConfig.java

presto-main/src/main/java/com/facebook/presto/sql/planner/PlanFragmenterUtils.java

arhimondr · 2024-12-02T16:53:46Z

...ain/java/com/facebook/presto/sql/planner/optimizations/AddExchangeForNativeSingleWorker.java

+        }
+
+        @Override
+        public PlanNode visitTableFinish(TableFinishNode node, RewriteContext<Void> context)


There are other nodes that need to be run on coordinator:

https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/sql/planner/BasePlanFragmenter.java#L181
https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/sql/planner/BasePlanFragmenter.java#L235

Do we need to add an exchange to support those as well?

setCoordinatorOnlyDistribution currently works for ExplainAnalyze, TableFinish, MetadataDelete and StatisticsWriterNode

MetadataDelete we don't need to add exchange (looks like it would be a metadata operation: 02b1bf7)
I have added the exchange for the rest

arhimondr · 2024-12-02T16:58:21Z

presto-main/src/main/java/com/facebook/presto/sql/planner/PlanOptimizers.java

@@ -813,7 +814,11 @@ public PlanOptimizers(
                        costCalculator,
                        ImmutableSet.of(new ScaledWriterRule())));

-        if (!forceSingleNode) {
+        if (featuresConfig.isNativeExecutionEnabled() && featuresConfig.isNativeSingleWorkerExecution()) {


I wonder if a simpler mental model would be to always add worker to coordinator exchanges when single worker execution is enabled.

This way:

single worker execution can be enabled for normal clusters with more than a single worker and a single coordinator (with schedule on coordinator disabled) on per query basis

For Java execution if coordinator scheduling is enabled an extra exchange is not going to hurt

For example this condition can be kept as isSingleNodeExecutionEnabled(session) and we can call the AddExchangeForNativeSingleWorker as AddWorkerToCoordinatorExchanges

But for some cases, it would be exchange from coordinator to worker? For example, scanning system table

Aggregation [Worker]
|
Exchange
|
TableScan (system table) [Coordinator]

presto-main/src/main/java/com/facebook/presto/sql/planner/PlanOptimizers.java

...ain/java/com/facebook/presto/sql/planner/optimizations/MergeJoinForSortedInputOptimizer.java

...rc/test/java/com/facebook/presto/nativeworker/TestPrestoNativeWindowQueriesSingleWorker.java

arhimondr · 2024-12-02T17:03:58Z

...ution/src/test/java/com/facebook/presto/nativeworker/TestPrestoNativeWriterSingleWorker.java

+    }
+
+    @Override
+    public void testScaleWriters() {


scale writers should have no effect in single node? Do we nee this test?

I want to use this test to see scaled-writer under single worker execution mode, it wouldn't scale to multiple worker tasks

kaikalur · 2024-12-02T17:33:49Z

High level comments

Do we want to support single node execution on per-query basis?

This can be useful to improve latency of tiny queries running on a large cluster. For example a user may know that a query is small and may decide to run it on a multi node cluster in a single node mode.

We can also potentially use HBO/CBO to decide to run some in single node mode

arhimondr

LGTM % nits and fixing test failures

presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java

.../java/com/facebook/presto/sql/planner/optimizations/AddExchangeForSingleWorkerExecution.java

...ain/java/com/facebook/presto/sql/planner/optimizations/MergeJoinForSortedInputOptimizer.java

steveburnett · 2024-12-05T14:51:09Z

Thanks for the release note entry! Some formatting nits.

== RELEASE NOTES ==

General Changes
* Add single worker execution. To improve latency of tiny queries running on a large cluster, we introduce single worker execution mode: query will only use one node to execute and plan would be optimized accordingly. This feature can be turned on by the configuration property ``single-node-execution-enabled`` or session property ``single_node_execution_enabled``. :pr:`24172`

Also, consider adding documentation for the new configuration property and session property to either the Presto [Configuration, Session] Properties pages, or the Presto C++ pages, as appropriate.

tdcmeehan · 2024-12-05T17:37:03Z

...n/java/com/facebook/presto/sql/planner/optimizations/AddExchangesForSingleNodeExecution.java

+            if (containsSystemTableScan(plan)) {
+                plan = gatheringExchange(idAllocator.getNextId(), REMOTE_STREAMING, plan);
+            }


Are these three lines of code the reason we've extended so many test cases to apply to Presto - single node - native? If so, I'm wondering if a few, targetted example based tests are more appropriate. I'm concerned the additional tests don't add value and will make our CI slower and more expensive.

Not only so, essentially under single-node mode, we will not use addExchange optimizer. Theoretically we should test many cases that originally have remote exchange (especially partitioned), would they still return correct result under single-node mode.

And also scheduling has changes accordingly, one of the test cases actually caught a issue for it.

I understand your concern, I remove some of tests (that I think could potentially be redundant in terms of exchange pattern)

kewang1024 · 2024-12-06T00:42:19Z

It won't let me rerun some sporadic flaky test, have to push to force rerun :(

tdcmeehan · 2024-12-06T15:12:02Z

@kewang1024 this feature could be useful for lower latency deployments to support canned or bounded queries, where latency is expected to be low. Could you please introduce some documentation for it as @steveburnett requested?

tdcmeehan · 2024-12-06T15:18:48Z

High level comments

Do we want to support single node execution on per-query basis?

This can be useful to improve latency of tiny queries running on a large cluster. For example a user may know that a query is small and may decide to run it on a multi node cluster in a single node mode.

We can also potentially use HBO/CBO to decide to run some in single node mode

@kewang1024 /@kaikalur can you create an issue for this so we don't lose track of this suggestion?

I think we could also consider to toggle this feature via a resource group, similar to per-query-limits in resource groups. This might make the feature more convenient to toggle, and in a multinode cluster, this would more accurate value for the hard/soft concurrency limit (and perhaps make it safer to run in a multitenant deployment). I can create an issue for that, I don't think it needs to be added here.

kewang1024 · 2024-12-06T22:31:55Z

@steveburnett the Presto C++ pages is not a good place since this is not limited to c++, I failed to find the [Configuration, Session] one you're referring to, can u give me a pointer

steveburnett · 2024-12-06T23:01:29Z

@steveburnett the Presto C++ pages is not a good place since this is not limited to c++, I failed to find the [Configuration, Session] one you're referring to, can u give me a pointer

Of course! I was referring the to Presto Configuration Properties page
https://github.com/prestodb/presto/blob/master/presto-docs/src/main/sphinx/admin/properties.rst

or the Presto Session Properties page
https://github.com/prestodb/presto/blob/master/presto-docs/src/main/sphinx/admin/properties-session.rst

kewang1024 · 2024-12-07T00:28:43Z

Thanks @steveburnett for the prompt response, updated the doc.
cc: @tdcmeehan

presto-docs/src/main/sphinx/admin/properties.rst

kewang1024 · 2024-12-09T18:54:07Z

@tdcmeehan Updated accordingly, can you help take another look? Thanks!

tdcmeehan

Overall, wondering what's preventing us from using FixedBucketNodeMap, since it seems that BucketNodeMap is aligned with this use case (only using a single node-bucket).

tdcmeehan · 2024-12-09T19:30:36Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/SectionExecutionFactory.java

+                    @Override
+                    public boolean isDynamic()
+                    {
+                        return true;


Just curious, why is this true? Shouldn't this be false, since I think there is a single node and a single task?

It is only applicable for grouped execution. Grouped execution is currently not supported for single node mode. It can be supported if needed, but generally the idea is that only small queries should run single node, while grouped execution is generally applicable for very large queries.

tdcmeehan · 2024-12-09T19:32:38Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/SectionExecutionFactory.java

+                    @Override
+                    public boolean hasInitialMap()
+                    {
+                        return false;


Can't this be true? Since there's only one node being returned in getBucketToNode?

This is true. @kewang1024 I wonder if instead we can simply use a DynamicBucketNodeMap((split) -> 0, 1). Basically pretending there's only a single bucket for all splits to avoid a custom override?

Sure, can make that change now

To improve performance for small queries which can be executed within a single node, we introduce single worker execution mode: query will only use one node to execute and plan would be optimized accordingly.

kewang1024 requested review from a team, jaystarshot, feilong-liu and ClarenceThreepwood as code owners November 29, 2024 21:01

kewang1024 requested review from presto-oss and arhimondr November 29, 2024 21:01

kewang1024 force-pushed the native-single-worker branch 2 times, most recently from 0d9137e to d71d256 Compare December 2, 2024 08:45

arhimondr reviewed Dec 2, 2024

View reviewed changes

tdcmeehan self-assigned this Dec 2, 2024

kewang1024 force-pushed the native-single-worker branch 8 times, most recently from 15bf3b5 to 5aa836f Compare December 4, 2024 09:11

arhimondr reviewed Dec 4, 2024

View reviewed changes

kewang1024 changed the title ~~Add single worker execution mode for native execution~~ Add single node execution mode Dec 5, 2024

kewang1024 force-pushed the native-single-worker branch 5 times, most recently from 7685df7 to 3de09d3 Compare December 5, 2024 07:29

kewang1024 changed the title ~~Add single node execution mode~~ Add single node execution Dec 5, 2024

kewang1024 force-pushed the native-single-worker branch from 3de09d3 to 8aa6631 Compare December 5, 2024 07:34

kewang1024 force-pushed the native-single-worker branch from 46d7ec3 to 1ee77a7 Compare December 5, 2024 10:02

arhimondr previously approved these changes Dec 5, 2024

View reviewed changes

tdcmeehan requested changes Dec 5, 2024

View reviewed changes

kewang1024 dismissed arhimondr’s stale review via 77747b4 December 5, 2024 18:39

kewang1024 force-pushed the native-single-worker branch 3 times, most recently from 2024bac to 55ca461 Compare December 5, 2024 18:50

kewang1024 requested a review from tdcmeehan December 5, 2024 18:50

kewang1024 force-pushed the native-single-worker branch 2 times, most recently from ac7073b to 57fce4d Compare December 6, 2024 00:41

kewang1024 force-pushed the native-single-worker branch from 57fce4d to fb4ad58 Compare December 7, 2024 00:26

kewang1024 requested review from steveburnett and elharo as code owners December 7, 2024 00:26

kewang1024 force-pushed the native-single-worker branch from fb4ad58 to b91f0f5 Compare December 7, 2024 00:28

tdcmeehan reviewed Dec 9, 2024

View reviewed changes

presto-docs/src/main/sphinx/admin/properties.rst Outdated Show resolved Hide resolved

kewang1024 force-pushed the native-single-worker branch from 040866e to 61d73da Compare December 9, 2024 18:46

kewang1024 requested a review from tdcmeehan December 9, 2024 18:46

tdcmeehan reviewed Dec 9, 2024

View reviewed changes

kewang1024 force-pushed the native-single-worker branch 2 times, most recently from 8376f0c to 6ae5e35 Compare December 11, 2024 05:51

Add single node execution

58a7e6c

To improve performance for small queries which can be executed within a single node, we introduce single worker execution mode: query will only use one node to execute and plan would be optimized accordingly.

kewang1024 force-pushed the native-single-worker branch from 6ae5e35 to 58a7e6c Compare December 11, 2024 07:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add single node execution #24172

Add single node execution #24172

kewang1024 commented Nov 29, 2024 •

edited

Loading

tdcmeehan commented Dec 2, 2024

arhimondr left a comment

arhimondr Dec 2, 2024

kewang1024 Dec 4, 2024

arhimondr Dec 2, 2024

kewang1024 Dec 4, 2024 •

edited

Loading

arhimondr Dec 2, 2024

kewang1024 Dec 4, 2024 •

edited

Loading

arhimondr Dec 2, 2024

kewang1024 Dec 4, 2024 •

edited

Loading

kaikalur commented Dec 2, 2024

arhimondr left a comment

steveburnett commented Dec 5, 2024

tdcmeehan Dec 5, 2024

kewang1024 Dec 5, 2024 •

edited

Loading

kewang1024 commented Dec 6, 2024

tdcmeehan commented Dec 6, 2024

tdcmeehan commented Dec 6, 2024

kewang1024 commented Dec 6, 2024

steveburnett commented Dec 6, 2024

kewang1024 commented Dec 7, 2024

kewang1024 commented Dec 9, 2024

tdcmeehan left a comment

tdcmeehan Dec 9, 2024

arhimondr Dec 9, 2024

tdcmeehan Dec 9, 2024

arhimondr Dec 9, 2024

kewang1024 Dec 11, 2024

Add single node execution #24172

Are you sure you want to change the base?

Add single node execution #24172

Conversation

kewang1024 commented Nov 29, 2024 • edited Loading

Description

tdcmeehan commented Dec 2, 2024

arhimondr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kewang1024 Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kewang1024 Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kewang1024 Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

kaikalur commented Dec 2, 2024

arhimondr left a comment

Choose a reason for hiding this comment

steveburnett commented Dec 5, 2024

Choose a reason for hiding this comment

kewang1024 Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

kewang1024 commented Dec 6, 2024

tdcmeehan commented Dec 6, 2024

tdcmeehan commented Dec 6, 2024

kewang1024 commented Dec 6, 2024

steveburnett commented Dec 6, 2024

kewang1024 commented Dec 7, 2024

kewang1024 commented Dec 9, 2024

tdcmeehan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kewang1024 commented Nov 29, 2024 •

edited

Loading

kewang1024 Dec 4, 2024 •

edited

Loading

kewang1024 Dec 4, 2024 •

edited

Loading

kewang1024 Dec 4, 2024 •

edited

Loading

kewang1024 Dec 5, 2024 •

edited

Loading