[FEATURE] Impl Spark DSv2 YARN Connector that supports reading YARN aggregation logs #6832

pan3793 · 2024-12-02T18:06:08Z

Code of Conduct

I agree to follow this project's Code of Conduct

Search before asking

I have searched in the issues and found no similar issues.

Describe the feature

Leverage the Spark DSv2 API to implement a connector that provides a SQL interface to access the YARN agg logs, and maybe other YARN resources in the future.

Motivation

For large-scale Spark on YARN deployments, there are dozens or even hundreds of thousands of Spark applications submitted to a cluster per day, and the app logs are collected and aggregated by YARN stored on HDFS, sometimes we might want to analyze the logs to identify some cluster-level issues, for example, some machine might have hardware issues that frequently produce disk/network exceptions, it's straightforward to leverage Spark to analyze those logs in parallel.

Describe the solution

the usage might be like

$ spark-sql --conf spark.sql.catalog.yarn=org.apache.kyuubi.spark.connector.yarn.YarnCatalog
> SELECT
    app_id, app_attempt_id,
    app_start_time, app_end_time,
    container_id, host,
    file_name, line_num, message
  FROM yarn.agg_logs
  WHERE app_id = 'application_1234'
    AND container_id='container_12345'
    AND host = 'hadoop123.example.com'

Additional context

No response

Are you willing to submit PR?

Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve.
No. I cannot submit a PR at this time.

The text was updated successfully, but these errors were encountered:

naive-zhang · 2024-12-02T18:40:47Z

@pan3793 I'd like to try to implement this, please aign it to me, thx~

pan3793 · 2024-12-03T03:14:04Z

@naive-zhang Thank you for being interested in this ticket, you can refer to the tpch connector if you are not familiar with Spark DSv2 API, and the first milestone might be implementing the

SELECT * FROM yarn.agg_logs WHERE appId='<appId>'

that produces similar content like yarn logs -applicationId <appId>, and then explore how to list YARN apps, and filter them by other predications.

pan3793 added kind:feature Feature request priority:major help wanted labels Dec 2, 2024

turboFei assigned naive-zhang Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Impl Spark DSv2 YARN Connector that supports reading YARN aggregation logs #6832

[FEATURE] Impl Spark DSv2 YARN Connector that supports reading YARN aggregation logs #6832

pan3793 commented Dec 2, 2024 •

edited

Loading

naive-zhang commented Dec 2, 2024

pan3793 commented Dec 3, 2024 •

edited

Loading

[FEATURE] Impl Spark DSv2 YARN Connector that supports reading YARN aggregation logs #6832

[FEATURE] Impl Spark DSv2 YARN Connector that supports reading YARN aggregation logs #6832

Comments

pan3793 commented Dec 2, 2024 • edited Loading

Code of Conduct

Search before asking

Describe the feature

Motivation

Describe the solution

Additional context

Are you willing to submit PR?

naive-zhang commented Dec 2, 2024

pan3793 commented Dec 3, 2024 • edited Loading

pan3793 commented Dec 2, 2024 •

edited

Loading

pan3793 commented Dec 3, 2024 •

edited

Loading