Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[comet-parquet-exec] CometNativeScan metrics from ParquetFileMetrics and FileStreamMetrics #1172

Open
wants to merge 4 commits into
base: comet-parquet-exec
Choose a base branch
from

Conversation

mbutrovich
Copy link
Contributor

Still confirming if there's a unit mismatch wrt time between Spark elapsed time and native elapsed time. Once I confirm that, I'll mark this okay for review.

@mbutrovich
Copy link
Contributor Author

Screenshot 2024-12-16 at 4 51 24 PM

As best as I can tell it is recording more metrics than that, but the UI cuts it off.

@mbutrovich mbutrovich marked this pull request as ready for review December 17, 2024 00:19

override lazy val metrics: Map[String, SQLMetric] = {
CometMetricNode.baselineMetrics(sparkContext) ++
Map(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have some way of distinguishing between these metrics and those from the current native scan? Perhaps the display string can have a short prefix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a Native prefix. We may want to do this for all operators.
Screenshot 2024-12-17 at 10 21 27 AM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. We need not do this for all operators. This is just so we could distinguish between metrics reported by different scan implementations.

@parthchandra
Copy link
Contributor

Is the time spent reading the footer actually zero?

@mbutrovich
Copy link
Contributor Author

If I do an explain on the native side I see values that are sub-millisecond, which I'm not sure the Spark UI shows by default:

metrics=[output_rows=1, elapsed_compute=1ns, bytes_scanned=4744, file_open_errors=0, file_scan_errors=0, num_predicate_creation_errors=0, page_index_rows_matched=504, page_index_rows_pruned=1496, predicate_evaluation_errors=0, pushdown_rows_matched=505, pushdown_rows_pruned=503, row_groups_matched_bloom_filter=0, row_groups_matched_statistics=1, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=0, bloom_filter_eval_time=49.043µs, metadata_load_time=506.667µs, page_index_eval_time=74.751µs, row_pushdown_eval_time=293.255µs, statistics_eval_time=116.584µs, time_elapsed_opening=835.042µs, time_elapsed_processing=2.705084ms, time_elapsed_scanning_total=2.349792ms, time_elapsed_scanning_until_data=2.247333ms]

The elapsed_compute number looks bogus though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants