[SPARK-50372][CONNECT][SQL] Make all DF execution path collect observed metrics #48920

xupefei · 2024-11-21T12:56:23Z

What changes were proposed in this pull request?

This PR fixes an issue that some of DataFrame execution paths would not process ObservedMetrics. The fix is done by injecting a lazy processing logic into the result iterator.

The following private execution APIs are affected by this issue:

SparkSession.execute(proto.Relation.Builder)
SparkSession.execute(proto.Command)
SparkSession.execute(proto.Plan)

The following user-facing API is affected by this issue:

DataFrame.write.format("...").mode("...").save()

This PR also fixes an issue in which on the Server side, two observed metrics can be assigned to the same Plan ID when they are in the same plan (e.g., one observation is used as the input of another). The fix is to traverse the plan and find all observations with correct IDs.

Another bug is discovered as a byproduct of introducing a new test case. Copying the PR comment here from SparkConnectPlanner.scala:

This fixes a bug where the input of a CollectMetrics can be processed two times, once in Line 1190 and once here/below.

When the input contains another CollectMetrics, transforming it twice will cause two Observation objects (in the input) to be initialised and registered two times to the system. Since only one of them will be fulfilled when the query finishes, the one we'll be looking at may not have any data.

This issue is highlighted in the test case Observation.get is blocked until the query is finished ..., where we specifically execute observedObservedDf, which is a CollectMetrics that has another CollectMetrics as its input.

Why are the changes needed?

To fix a bug.

Does this PR introduce any user-facing change?

Yes, this bug is user-facing.

How was this patch tested?

New tests.

Was this patch authored or co-authored using generative AI tooling?

No.

xupefei · 2024-11-21T13:03:42Z

...connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

-      CollectMetrics(name, metrics.map(_.named), transformRelation(rel.getInput), planId)
+      CollectMetrics(name, metrics.map(_.named), input, planId)


This fixes a bug where the input of a CollectMetrics can be processed two times, once in Line 1190 and once here/below.

When the input contains another CollectMetrics, transforming it twice will cause two Observation objects (in the input) to be initialised and registered two times to the system. Since only one of them will be fulfilled when the query finishes, the one we'll be looking at may not have any data.

This issue is highlighted in the test case Observation.get is blocked until the query is finished ..., where we specifically execute observedObservedDf, which is a CollectMetrics that has another CollectMetrics as its input.

vicennial

Thanks for fixing! Mostly LGTM, minor ask

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala

HyukjinKwon

Looks good but mind listing up affected API in the PR description?

xupefei · 2024-11-22T19:39:24Z

Looks good but mind listing up affected API in the PR description?

Done!

xupefei added 2 commits November 21, 2024 13:47

fix

ebdf54e

.

7fa2757

github-actions bot added SQL CONNECT labels Nov 21, 2024

remove logging

8392da4

xupefei commented Nov 21, 2024

View reviewed changes

vicennial approved these changes Nov 21, 2024

View reviewed changes

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala Outdated Show resolved Hide resolved

rename & fmt

8eb3381

HyukjinKwon changed the title ~~[SPARK-50372][Connect][SQL] Make all DF execution path collect observed metrics~~ [SPARK-50372][CONNECT][SQL] Make all DF execution path collect observed metrics Nov 21, 2024

HyukjinKwon approved these changes Nov 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50372][CONNECT][SQL] Make all DF execution path collect observed metrics #48920

[SPARK-50372][CONNECT][SQL] Make all DF execution path collect observed metrics #48920

xupefei commented Nov 21, 2024 •

edited

Loading

xupefei Nov 21, 2024 •

edited

Loading

vicennial left a comment

HyukjinKwon left a comment

xupefei commented Nov 22, 2024

		CollectMetrics(name, metrics.map(_.named), transformRelation(rel.getInput), planId)
		CollectMetrics(name, metrics.map(_.named), input, planId)

[SPARK-50372][CONNECT][SQL] Make all DF execution path collect observed metrics #48920

Are you sure you want to change the base?

[SPARK-50372][CONNECT][SQL] Make all DF execution path collect observed metrics #48920

Conversation

xupefei commented Nov 21, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

xupefei Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

vicennial left a comment

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

xupefei commented Nov 22, 2024

xupefei commented Nov 21, 2024 •

edited

Loading

xupefei Nov 21, 2024 •

edited

Loading