Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Impl Spark DSv2 YARN Connector that supports reading YARN aggregation logs #6832

Open
2 of 4 tasks
pan3793 opened this issue Dec 2, 2024 · 2 comments
Open
2 of 4 tasks
Assignees

Comments

@pan3793
Copy link
Member

pan3793 commented Dec 2, 2024

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the feature

Leverage the Spark DSv2 API to implement a connector that provides a SQL interface to access the YARN agg logs, and maybe other YARN resources in the future.

Motivation

For large-scale Spark on YARN deployments, there are dozens or even hundreds of thousands of Spark applications submitted to a cluster per day, and the app logs are collected and aggregated by YARN stored on HDFS, sometimes we might want to analyze the logs to identify some cluster-level issues, for example, some machine might have hardware issues that frequently produce disk/network exceptions, it's straightforward to leverage Spark to analyze those logs in parallel.

Describe the solution

the usage might be like

$ spark-sql --conf spark.sql.catalog.yarn=org.apache.kyuubi.spark.connector.yarn.YarnCatalog
> SELECT
    app_id, app_attempt_id,
    app_start_time, app_end_time,
    container_id, host,
    file_name, line_num, message
  FROM yarn.agg_logs
  WHERE app_id = 'application_1234'
    AND container_id='container_12345'
    AND host = 'hadoop123.example.com'

Additional context

No response

Are you willing to submit PR?

  • Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve.
  • No. I cannot submit a PR at this time.
@naive-zhang
Copy link
Contributor

@pan3793 I'd like to try to implement this, please aign it to me, thx~

@pan3793
Copy link
Member Author

pan3793 commented Dec 3, 2024

@naive-zhang Thank you for being interested in this ticket, you can refer to the tpch connector if you are not familiar with Spark DSv2 API, and the first milestone might be implementing the

SELECT * FROM yarn.agg_logs WHERE appId='<appId>'

that produces similar content like yarn logs -applicationId <appId>, and then explore how to list YARN apps, and filter them by other predications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants