Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50310][PYTHON] Apply a flag to disable DataFrameQueryContext for PySpark #48964

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

itholic
Copy link
Contributor

@itholic itholic commented Nov 26, 2024

What changes were proposed in this pull request?

We disabled the DataFrameQueryContext from #48827, and we also need applying a flag for PySpark for the same performant reason.

Why are the changes needed?

To avoid the performance slowdown for the case when the DataFrameQueryContext too much stacked

Does this PR introduce any user-facing change?

No API changes, but the DataFrameQueryContext would no longer displayed when the flag is disabled

How was this patch tested?

Manually tested:

  1. FLAG ON (almost 25sec)
>>> spark.conf.get("spark.sql.dataFrameQueryContext.enabled")
'true'
>>> import time
>>> import pyspark.sql.functions as F
>>>
>>> c = F.col("name")
>>> start = time.time()
>>> for i in range(10000):
...   _ = c.alias("a")
...
>>> print(time.time() - start)
24.78217577934265
  1. FLAG OFF (only 1 sec)
>>> spark.conf.set("spark.sql.dataFrameQueryContext.enabled", "false")
>>> spark.conf.get("spark.sql.dataFrameQueryContext.enabled")
'false'
>>> import time
>>> import pyspark.sql.functions as F
>>>
>>> c = F.col("name")
>>> start = time.time()
>>> for i in range(10000):
...   _ = c.alias("a")
...
>>> print(time.time() - start)
1.0222370624542236

Was this patch authored or co-authored using generative AI tooling?

@itholic itholic changed the title [SPARK-50310][PYTHON] Add a flag to disable DataFrameQueryContext for PySpark [SPARK-50310][PYTHON] Apply a flag to disable DataFrameQueryContext for PySpark Nov 26, 2024
if spark is not None:
_enable_debugging_cache = (
spark.conf.get(
"spark.sql.dataFrameQueryContext.enabled", "true" # type: ignore[union-attr]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.sql.dataFrameQueryContext.enabled has to be StaticSQLConf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Should we update the existing conf to static conf or should we add a new one for PySpark specific?

The existing conf here is defined as SQLConf below:

val DATA_FRAME_QUERY_CONTEXT_ENABLED = buildConf("spark.sql.dataFrameQueryContext.enabled")
.internal()
.doc(
"Enable the DataFrame query context. This feature is enabled by default, but has a " +
"non-trivial performance overhead because of the stack trace collection.")
.version("4.0.0")
.booleanConf
.createWithDefault(true)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so

@xinrong-meng
Copy link
Member

A comment for the PR description #48827 doesn’t seem to disable “DataFrameQueryContext” if I understand correctly :)

@itholic
Copy link
Contributor Author

itholic commented Nov 27, 2024

A comment for the PR description #48827 doesn’t seem to disable “DataFrameQueryContext” if I understand correctly :)

Yes it doesn't disable the DataFrameQueryContext but just provide an option to disable the DataFrameQueryContext to PySpark users. Default behavior still collects the stack traces. Please let me know if I don't get the context clearly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants