Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: Add initial doc how to expand Comet exceptions #170

Merged
merged 3 commits into from
Mar 6, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 56 additions & 0 deletions DEBUGGING.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,62 @@ LLDB or the LLDB that is bundled with XCode. We will use the LLDB packaged with

_Caveat: The steps here have only been tested with JDK 11_ on Mac (M1)

# Expand Comet exception details
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit weird to put this between the first paragraph and the section of Debugging for Advanced Developers which are both talking about IDEs..

How about put this part after Additional Info?

Otherwise, thos doc LGTM.

By default, Comet outputs the exception details specific for Comet. There is a possibility of extending the exception
details by leveraging Datafusion [backtraces](https://arrow.apache.org/datafusion/user-guide/example-usage.html#enable-backtraces)

```scala
scala> spark.sql("my_failing_query").show(false)

24/03/05 17:00:07 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1]
org.apache.comet.CometNativeException: Internal error: MIN/MAX is not expected to receive scalars of incompatible types (Date32("NULL"), Int32(15901)).
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
at org.apache.comet.Native.executePlan(Native Method)
at org.apache.comet.CometExecIterator.executeNative(CometExecIterator.scala:65)
at org.apache.comet.CometExecIterator.getNextBatch(CometExecIterator.scala:111)
at org.apache.comet.CometExecIterator.hasNext(CometExecIterator.scala:126)

```
To do that with Comet it is needed to enable `backtrace` in https://github.com/apache/arrow-datafusion-comet/blob/main/core/Cargo.toml

```
datafusion-common = { version = "36.0.0", features = ["backtrace"] }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we should update the cargo.toml to include backtrace feature by default for local/debug build?

datafusion = { default-features = false, version = "36.0.0", features = ["unicode_expressions", "backtrace"] }
```

Then build the Comet as [described](https://github.com/apache/arrow-datafusion-comet/blob/main/README.md#getting-started)

Start Comet with `RUST_BACKTRACE=1`

```commandline
RUST_BACKTRACE=1 $SPARK_HOME/spark-shell --jars spark/target/comet-spark-spark3.4_2.12-0.1.0-SNAPSHOT.jar --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions --conf spark.comet.enabled=true --conf spark.comet.exec.enabled=true --conf spark.comet.exec.all.enabled=true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, if we don't enable RUST_BACKTRACE=1, is there any performance penalty from enabling backtrace?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. If there's no penalty without enabling RUST_BACKTRACE=1, I think we can build with backtrace enabled and disabled it at runtime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, if we don't enable RUST_BACKTRACE=1, is there any performance penalty from enabling backtrace?

We had some discussion back in the day on this apache/datafusion#7522 (comment)

The problem was with analyzer, and in Comet we use physical plan, however it might still have some overhead IMHO, as it has to call Backtrace::enabled function which will def take some cpu cycles

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little off the topic. For distributed environments, I believe --conf spark.executorEnv.RUST_BACKTRACE=1 should be passed to make sure executor's JVM is started with backtrace enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is good idea if we decide to keep backtrace enabled always

```

Get the expanded exception details
```scala
scala> spark.sql("my_failing_query").show(false)
24/03/05 17:00:49 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.comet.CometNativeException: Internal error: MIN/MAX is not expected to receive scalars of incompatible types (Date32("NULL"), Int32(15901))

backtrace: 0: std::backtrace::Backtrace::create
1: datafusion_physical_expr::aggregate::min_max::min
2: <datafusion_physical_expr::aggregate::min_max::MinAccumulator as datafusion_expr::accumulator::Accumulator>::update_batch
3: <futures_util::stream::stream::fuse::Fuse<S> as futures_core::stream::Stream>::poll_next
4: comet::execution::jni_api::Java_org_apache_comet_Native_executePlan::{{closure}}
5: _Java_org_apache_comet_Native_executePlan
.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
at org.apache.comet.Native.executePlan(Native Method)
at org.apache.comet.CometExecIterator.executeNative(CometExecIterator.scala:65)
at org.apache.comet.CometExecIterator.getNextBatch(CometExecIterator.scala:111)
at org.apache.comet.CometExecIterator.hasNext(CometExecIterator.scala:126)

...
```
Note:
- The backtrace coverage in Datafusion is still improving. So there is a chance the error still not covered, feel free to file a [ticket](https://github.com/apache/arrow-datafusion/issues)
- The backtrace doesn't come for free and therefore intended for debugging purposes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious to know, what percentage of overhead are we talking about? Like 1%, 5% or something 10%?


## Debugging for Advanced Developers

Add a `.lldbinit` to comet/core. This is not strictly necessary but will be useful if you want to
Expand Down