feat: Add additional metrics for shuffle write #1173

andygrove · 2024-12-16T22:25:00Z

Which issue does this PR close?

N/A

Rationale for this change

I would like to understand how much time is spent on shuffle writing.

What changes are included in this PR?

Introduce a new write_time native metric to record write time instead of using elapsed_time
Use elapsed_time to measure total native time (excluding executing the child plan and fetching data)
Add new input_time native metric for measuring the time for ShuffleWriterExec to execute the child plan and fetch its input data
Add new shuffleWallTime JVM metric to measure total time of shuffle
Update docs

Spark UI

Note the new metrics:

"native shuffle time"
"native shuffle input time"
"shuffle wall time"

Native plan

ShuffleWriterExec: ..., metrics=[elapsed_compute=42.425493ms, ..., input_time=255.873165ms, write_time=462.517µs]
  ScanExec: source=[ShuffleWriterInput], metrics=[elapsed_compute=254.994282ms, ...]

How are these changes tested?

kazuyukitanimura · 2024-12-16T23:22:48Z

docs/source/user-guide/metrics.md

+
+### CometScanExec
+
+`CometScanExec` uses nanoseconds for total scan time. Spark also measures scan time in nanoseconds but converts to


This sounds like a problem statement. Did you mean that spark.comet.metrics.detailed=true will not loose the precision?

afaik, the conversion happens only when the data is to be displayed in the UI. (https://github.com/apache/spark/blob/576caec1da85c4451fe63e2a5923f2dbf136e278/sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala#L248)
But this is what Spark does with all its nanosecond timing metrics, so we aren't doing anything different here.

Thanks, I have updated this

Spark converts nanos to millis on each batch:

override def hasNext: Boolean = { // The `FileScanRDD` returns an iterator which scans the file during the `hasNext` call. val startNs = System.nanoTime() val res = batches.hasNext scanTime += NANOSECONDS.toMillis(System.nanoTime() - startNs) res }

We just use the nano time:

override def hasNext: Boolean = { // The `FileScanRDD` returns an iterator which scans the file during the `hasNext` call. val startNs = System.nanoTime() val res = batches.hasNext scanTime += System.nanoTime() - startNs res }

It actually makes a large difference to the time reported in some cases

I have updated the description in the metrics guide to explain this in more detail.

I see. That could be a big difference in small datasets (wonder if the same occurs when we have large files). Either way, better to not lose precision. We are not likely to run into overflow issues are we?

andygrove · 2024-12-16T23:59:45Z

Also added ipc_time.

andygrove · 2024-12-17T01:19:40Z

Seeing some segmentation faults in CI:

# Problematic frame:
# C  [libcomet-9228012970092653184.so+0x274908f]  core::sync::atomic::AtomicUsize::fetch_add::h50346ebf04f5a2ef+0x4f

andygrove · 2024-12-17T04:59:42Z

I may have found a bug in DataFusion

 # A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fb91a124f46, pid=762, tid=833
#
# JRE version: OpenJDK Runtime Environment Zulu11.76+21-CA (11.0.25+9) (build 11.0.25+9-LTS)
# Java VM: OpenJDK 64-Bit Server VM Zulu11.76+21-CA (11.0.25+9-LTS, mixed mode, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# C  [libcomet-1968252056067325991.so+0x1524f46]  datafusion_physical_plan::metrics::value::ScopedTimerGuard::stop::ha579d5e7b5f1a919+0x46

andygrove · 2024-12-17T17:21:42Z

I created a new simpler PR to replace this one: #1175

andygrove added 7 commits December 16, 2024 12:18

show shuffle wall time

ff12c69

use write_time consistently

0034bc0

record native shuffle elapsed compute

4ec6ed6

save progress

565c27a

input_time

d3f00bf

save

cfc72f5

Ready for review

b26322e

andygrove requested review from comphead, kazuyukitanimura and viirya December 16, 2024 22:26

andygrove marked this pull request as ready for review December 16, 2024 22:29

kazuyukitanimura approved these changes Dec 16, 2024

View reviewed changes

add IPC time

3c8f3cf

andygrove added 2 commits December 16, 2024 17:03

address feedback

0c85460

update benchmark code

a317fc5

add input_batches metric

73b4f9b

address feedback

bf039f7

andygrove marked this pull request as draft December 17, 2024 16:18

andygrove closed this Dec 17, 2024

andygrove mentioned this pull request Dec 17, 2024

feat: Improve shuffle metrics (second attempt) #1175

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add additional metrics for shuffle write #1173

feat: Add additional metrics for shuffle write #1173

andygrove commented Dec 16, 2024 •

edited

Loading

kazuyukitanimura Dec 16, 2024

parthchandra Dec 16, 2024

andygrove Dec 17, 2024

andygrove Dec 17, 2024 •

edited

Loading

andygrove Dec 17, 2024

parthchandra Dec 17, 2024

andygrove commented Dec 16, 2024

andygrove commented Dec 17, 2024

andygrove commented Dec 17, 2024

andygrove commented Dec 17, 2024


		### CometScanExec

		`CometScanExec` uses nanoseconds for total scan time. Spark also measures scan time in nanoseconds but converts to

feat: Add additional metrics for shuffle write #1173

feat: Add additional metrics for shuffle write #1173

Conversation

andygrove commented Dec 16, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Spark UI

Native plan

How are these changes tested?

kazuyukitanimura Dec 16, 2024

Choose a reason for hiding this comment

parthchandra Dec 16, 2024

Choose a reason for hiding this comment

andygrove Dec 17, 2024

Choose a reason for hiding this comment

andygrove Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

andygrove Dec 17, 2024

Choose a reason for hiding this comment

parthchandra Dec 17, 2024

Choose a reason for hiding this comment

andygrove commented Dec 16, 2024

andygrove commented Dec 17, 2024

andygrove commented Dec 17, 2024

andygrove commented Dec 17, 2024

andygrove commented Dec 16, 2024 •

edited

Loading

andygrove Dec 17, 2024 •

edited

Loading