You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have searched in the issues and found no similar issues.
Describe the feature
Leverage the Spark DSv2 API to implement a connector that provides a SQL interface to access the YARN agg logs, and maybe other YARN resources in the future.
Motivation
For large-scale Spark on YARN deployments, there are dozens or even hundreds of thousands of Spark applications submitted to a cluster per day, and the app logs are collected and aggregated by YARN stored on HDFS, sometimes we might want to analyze the logs to identify some cluster-level issues, for example, some machine might have hardware issues that frequently produce disk/network exceptions, it's straightforward to leverage Spark to analyze those logs in parallel.
Describe the solution
the usage might be like
$ spark-sql --conf spark.sql.catalog.yarn=org.apache.kyuubi.spark.connector.yarn.YarnCatalog
> SELECT
app_id, app_attempt_id,
app_start_time, app_end_time,
container_id, host,
file_name, line_num, message
FROM yarn.agg_logs
WHERE app_id = 'application_1234'
AND container_id='container_12345'
AND host = 'hadoop123.example.com'
Additional context
No response
Are you willing to submit PR?
Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve.
No. I cannot submit a PR at this time.
The text was updated successfully, but these errors were encountered:
@naive-zhang Thank you for being interested in this ticket, you can refer to the tpch connector if you are not familiar with Spark DSv2 API, and the first milestone might be implementing the
SELECT * FROM yarn.agg_logs WHERE appId='<appId>'
that produces similar content like yarn logs -applicationId <appId>, and then explore how to list YARN apps, and filter them by other predications.
Code of Conduct
Search before asking
Describe the feature
Leverage the Spark DSv2 API to implement a connector that provides a SQL interface to access the YARN agg logs, and maybe other YARN resources in the future.
Motivation
For large-scale Spark on YARN deployments, there are dozens or even hundreds of thousands of Spark applications submitted to a cluster per day, and the app logs are collected and aggregated by YARN stored on HDFS, sometimes we might want to analyze the logs to identify some cluster-level issues, for example, some machine might have hardware issues that frequently produce disk/network exceptions, it's straightforward to leverage Spark to analyze those logs in parallel.
Describe the solution
the usage might be like
Additional context
No response
Are you willing to submit PR?
The text was updated successfully, but these errors were encountered: