Skip to content

Commit

Permalink
spark-dashboard V2
Browse files Browse the repository at this point in the history
  • Loading branch information
LucaCanali committed Mar 27, 2024
1 parent 4260d31 commit 43f7b22
Show file tree
Hide file tree
Showing 35 changed files with 15,826 additions and 37 deletions.
168 changes: 131 additions & 37 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,55 +1,74 @@
# Apache Spark Performance Dashboard and Spark Monitoring
# Spark-Dashboard
![Docker Pulls](https://img.shields.io/docker/pulls/lucacanali/spark-dashboard)

This repository provides the tooling and configuration for deploying an Apache Spark Performance Dashboard using containers technology.
This provides monitoring for Apache Spark workloads.
The monitoring pipeline and dashboard are implemented from the [Spark metrics system](https://spark.apache.org/docs/latest/monitoring.html#metrics) using InfluxDB, and Grafana.
Spark-dashboard is a solution for monitoring Apache Spark jobs.

**Why:** Troubleshooting Spark jobs and understanding how system resources are used by Spark executors can be complicated.
This type of data is precious for visualizing and understanding root causes of performance issues.
Using the Spark Dashboard you can collect and visualize many of key metrics available by the Spark metrics system
as time series. This provides monitoring and help for Spark applications troubleshooting.
### Key Features
- You can find here all the components to deploy a monitoring application for Apache Spark
- Spark-dashboard collects metrics from Spark and visualizes them in a Grafana
- This tool is intended for performance troubleshooting and DevOps monitoring of Spark workloads.
- Use it with Spark 2.4 and higher (3.x)

**Compatibility:**
- Use with Spark 3.x and 2.4.
- The provided containers are for the Linux platform
### Contents
- [Architecture](#architecture)
- [How To Deploy the Spark Dashboard](#how-to-deploy-the-spark-dashboard)
- [How to run the Spark Dashboard V2 on a Docker container](#how-to-run-the-spark-dashboard-v2-on-a-docker-container)
- [Advanced configurations and notes](#advanced-configurations-and-notes)
- [Examples and testing the dashboard](#examples-and-testing-the-dashboard)
- [Old implementation (v1)](#old-implementation-v1)
- [How to run the Spark dashboard V1 on a Docker container](#how-to-run-the-spark-dashboard-v1-on-a-docker-container)
- [How to run the dashboard V1 on Kubernetes using Helm](#how-to-run-the-dashboard-v1-on-kubernetes-using-helm)
- [Advanced configurations and notes](#advanced-configurations-and-notes)

**Demos and blogs:**
- **[Short demo of the Spark dashboard](https://canali.web.cern.ch/docs/Spark_Dashboard_Demo.mp4)**
### Resources
- **[Short demo of Spark dashboard](https://canali.web.cern.ch/docs/Spark_Dashboard_Demo.mp4)**
- [Blog entry on Spark Dashboard](https://db-blog.web.cern.ch/blog/luca-canali/2019-02-performance-dashboard-apache-spark)
- Talk on Spark performance at [Data+AI Summit 2021](https://databricks.com/session_na21/monitor-apache-spark-3-on-kubernetes-using-metrics-and-plugins), [slides](http://canali.web.cern.ch/docs/Monitor_Spark3_on_Kubernetes_DataAI2021_LucaCanali.pdf)
- Notes on [Spark Dashboard](https://github.com/LucaCanali/Miscellaneous/tree/master/Spark_Dashboard)

**Related work:**
- **[sparkMeasure](https://github.com/LucaCanali/sparkMeasure)** a tool for performance troubleshooting of Apache Spark workloads
- **[TPCDS_PySpark](https://github.com/LucaCanali/Miscellaneous/tree/master/Performance_Testing/TPCDS_PySpark)** a TPC-DS workload generator written in Python and designed to run at scale using Apache Spark
- [sparkMeasure](https://github.com/LucaCanali/sparkMeasure) a tool for performance troubleshooting of Apache Spark workloads
- [TPCDS_PySpark](https://github.com/LucaCanali/Miscellaneous/tree/master/Performance_Testing/TPCDS_PySpark) a TPC-DS workload generator written in Python and designed to run at scale using Apache Spark

Main author and contact: [email protected]

---
### Architecture
The Spark Dashboard collects and displays Apache Spark workload metrics produced by
the [Spark metrics system](https://spark.apache.org/docs/latest/monitoring.html#metrics).
Spark metrics are exported via a Graphite endpoint and stored in InfluxDB.
Metrics are then queried from InfluxDB and displayed using a set of pre-configured Grafana dashboards distributed with this repo.
Note that the provided installation instructions and code are intended as examples for testing and experimenting.
Hardening the installation will be necessary for production-quality use.

![Spark metrics dashboard architecture](https://raw.githubusercontent.com/LucaCanali/Miscellaneous/master/Spark_Dashboard/images/Spark_metrics_dashboard_arch.PNG "Spark metrics dashboard architecture")
![Spark metrics dashboard architecture](https://raw.githubusercontent.com/LucaCanali/Miscellaneous/master/Spark_Dashboard/images/Spark_MetricsSystem_Grafana_Dashboard_V2.0.png "Spark metrics dashboard architecture")

This technical drawing outlines an integrated monitoring pipeline for Apache Spark using open-source components. The flow of the diagram illustrates the following components and their interactions:
- **Apache Spark's metrics:** This is the source of metrics data: [Spark metrics system](https://spark.apache.org/docs/latest/monitoring.html#metrics). Spark's executors and the driver emit metrics such
as executors' run time, CPU time, garbage collection (GC) time, memory usage, shuffle metrics, I/O metrics, and more.
Spark metrics are exported in Graphite format by Spark and then ingested by Telegraf.
- **Telegraf:** This component acts as the metrics collection agent (the sink in this context). It receives the
metrics emitted by Apache Spark's executors and driver, and it adds labels to the measurements to organize
the data effectively. Telegraf send the measurements to VitoriaMetrics for storage and later querying.
- **VictoriaMetrics:** This is a time-series database that stores the labeled metrics data collected by Telegraf.
The use of a time-series database is appropriate for storing and querying the type of data emitted by
monitoring systems, which is often timestamped and sequential.
- **Grafana:** Finally, Grafana is used for visualization. It reads the metrics stored in VictoriaMetrics
using PromQL/MetricsQL, which is a query language for time series data in Prometheus. Grafana provides
dashboards that present the data in the form of metrics and graphs, offering insights into the performance
and health of the Spark application.

Note: spark-dashboard v1 (the original implementation) uses InfluxDB as the time-series database, see also
[spark-dashabord v1 architecture](https://raw.githubusercontent.com/LucaCanali/Miscellaneous/master/Spark_Dashboard/images/Spark_metrics_dashboard_arch.PNG)

---
## How To Deploy the Spark Dashboard

This provides a quickstart guide to deploy the Spark Dashboard. Two methods are provided: one using a Docker container
and the other is deploying on Kubernetes via Helm.
This provides a quickstart guide to deploy the Spark Dashboard. Three different installation methods are described:
- **Recommended:** Dashboard v2 on a Docker container
- Dashboard v1 on a Docker container
- Dashboard v1 on Helm

### How to run the Spark dashboard on a Docker container
### How to run the Spark Dashboard V2 on a Docker container
If you chose to run on container image, these are steps:

**1. Start the container**
The provided container image has been built configured to run InfluxDB and Grafana
-`docker run -p 3000:3000 -p 2003:2003 -d lucacanali/spark-dashboard`
- `docker run -p 3000:3000 -p 2003:2003 -d lucacanali/spark-dashboard`
- Note: port 2003 is for Graphite ingestion, port 3000 is for Grafana
- More options, including on how to persist InfluxDB data across restarts at: [Spark dashboard in a container](dockerfiles)
- More options, including on how to persist InfluxDB data across restarts at: [Spark dashboard in a container](dockerfiles_v2)

**2. Spark configuration**
You need to configure Spark to send the metrics to the desired Graphite endpoint + the add the related configuration.
Expand Down Expand Up @@ -88,21 +107,95 @@ bin/spark-shell (or spark-submit or pyspark)
--conf "spark.metrics.appStatusSource.enabled"=true
```

Optional configuration if you want to display "Tree Process Memory Details":
```
--conf spark.executor.processTreeMetrics.enabled=true
```

**3. Visualize the metrics using a Grafana dashboard**
- Point your browser to `http://hostname:3000` (edit `hostname` as relevant)
- Credentials: use the default for the first login (user: admin, password: admin)
- Choose one of the provided dashboards (for example start with **Spark_Perf_Dashboard_v04**) and select the user,
applicationId and time range.
- Use the default dashboard bundled with the container (**Spark_Perf_Dashboard_v04_promQL**) and select the user name,
applicationId and time range (default is last 5 minutes).
- You will need a running Spark application configured to use the dashboard to be able to select an application
and display the metrics.
- See also [TPCDS_PySpark](https://github.com/LucaCanali/Miscellaneous/tree/master/Performance_Testing/TPCDS_PySpark)
a TPC-DS workload generator written in Python and designed to run at scale using Apache Spark.

### Examples:
### Extended Spark dashboard
An extended Spark dashboard is available to collect and visualize OS and storage data.
This utilizes Spark Plugins to collect the extended metrics. The metrics are collected and stored in the
same VictoriaMetrics database as the Spark metrics.

- Configuration:
- Add the following to the Spark configuration:
`--conf ch.cern.sparkmeasure:spark-plugins_2.12:0.3`
`--conf spark.plugins=ch.cern.HDFSMetrics,ch.cern.CgroupMetrics,ch.cern.CloudFSMetrics`
- Use the extended dashboard
- Manually select the dashboard **Spark_Perf_Dashboard_v04_PromQL_with_SparkPlugins**
- The dashboard includes additional graphs for OS and storage metrics.
- Three new tabs are available:
- CGroup Metrics (use with Spark running on Kubernetes)
- Cloud Storage (use with S3A, GZ, WASB, and cloud storage in general)
- HDFS Advanced Statistics (use with HDFS)

### Examples and testing the dashboard:
- See some [examples of the graphs available in the dashboard at this link](https://github.com/LucaCanali/Miscellaneous/tree/master/Spark_Dashboard#example-graphs)

- You can use the [TPCDS_PySpark](https://github.com/LucaCanali/Miscellaneous/tree/master/Performance_Testing/TPCDS_PySpark)
package to generate a TPC-DS workload and test the dashboard.

- Example of running TPCDS on a YARN Spark cluster, monitor with the Spark dashboard:
```
TPCDS_PYSPARK=`which tpcds_pyspark_run.py`
spark-submit --master yarn --conf spark.log.level=error --conf spark.executor.cores=8 --conf spark.executor.memory=64g \
--conf spark.driver.memory=16g --conf spark.driver.extraClassPath=tpcds_pyspark/spark-measure_2.12-0.24.jar \
--conf spark.dynamicAllocation.enabled=false --conf spark.executor.instances=32 --conf spark.sql.shuffle.partitions=512 \
$TPCDS_PYSPARK -d hdfs://<PATH>/tpcds_10000_parquet_1.13.1
```

- Example of running TPCDS on a Kubernetes cluster with S3 storage, monitor this with the extended dashboard using Spark plugins:
```
TPCDS_PYSPARK=`which tpcds_pyspark_run.py`
spark-submit --master k8s://https://xxx.xxx.xxx.xxx:6443 --conf spark.kubernetes.container.image=<URL>/spark:v3.5.1 --conf spark.kubernetes.namespace=xxx \
--conf spark.eventLog.enabled=false --conf spark.task.maxDirectResultSize=2000000000 --conf spark.shuffle.service.enabled=false --conf spark.executor.cores=8 --conf spark.executor.memory=32g --conf spark.driver.memory=4g \
--packages org.apache.hadoop:hadoop-aws:3.3.4,ch.cern.sparkmeasure:spark-measure_2.12:0.24,ch.cern.sparkmeasure:spark-plugins_2.12:0.3 --conf spark.plugins=ch.cern.HDFSMetrics,ch.cern.CgroupMetrics,ch.cern.CloudFSMetrics \
--conf spark.cernSparkPlugin.cloudFsName=s3a \
--conf spark.dynamicAllocation.enabled=false --conf spark.executor.instances=4 \
--conf spark.hadoop.fs.s3a.secret.key=$SECRET_KEY \
--conf spark.hadoop.fs.s3a.access.key=$ACCESS_KEY \
--conf spark.hadoop.fs.s3a.endpoint="https://s3.cern.ch" \
--conf spark.hadoop.fs.s3a.impl="org.apache.hadoop.fs.s3a.S3AFileSystem" \
--conf spark.executor.metrics.fileSystemSchemes="file,hdfs,s3a" \
--conf spark.hadoop.fs.s3a.fast.upload=true \
--conf spark.hadoop.fs.s3a.path.style.access=true \
--conf spark.hadoop.fs.s3a.list.version=1 \
$TPCDS_PYSPARK -d s3a://luca/tpcds_100
```

---
## Old implementation (v1)

### How to run the Spark dashboard V1 on a Docker container
This is the original implementation of the tool using InfluxDB and Grafana

**1. Start the container**
The provided container image has been built configured to run InfluxDB and Grafana
-`docker run -p 3000:3000 -p 2003:2003 -d lucacanali/spark-dashboard:v01`
- Note: port 2003 is for Graphite ingestion, port 3000 is for Grafana
- More options, including on how to persist InfluxDB data across restarts at: [Spark dashboard in a container](dockerfiles)

**2. Spark configuration**
See above

**3. Visualize the metrics using a Grafana dashboard**
- Point your browser to `http://hostname:3000` (edit `hostname` as relevant)
- See details above

---
### How to run the dashboard on Kubernetes using Helm
### How to run the dashboard V1 on Kubernetes using Helm
If you chose to run on Kubernetes, these are steps:

1. The Helm chart takes care of configuring and running InfluxDB and Grafana:
Expand All @@ -114,7 +207,7 @@ If you chose to run on Kubernetes, these are steps:
- Use `INFLUXDB_ENDPOINT=spark-dashboard-influx.default.svc.cluster.local` as the InfluxDB endpoint in
the Spark configuration.

3. Grafana visualization with Helm:
3. Grafana's visualization with Helm:
- The Grafana dashboard is reachable at port 3000 of the spark-dashboard-service.
- See service details: `kubectl get service spark-dashboard-grafana`
- When using NodePort and an internal cluster IP address, this is how you can port forward to the service from
Expand All @@ -126,7 +219,7 @@ More info at [Spark dashboard on Kubernetes](charts/README.md)
## Advanced configurations and notes

### Graph annotations: display query/job/stage start and end times
Optionally, you can add annotation instrumentation to the performance dashboard.
Optionally, you can add annotation instrumentation to the performance dashboard v1.
Annotations provide additional info on start and end times for queries, jobs and stages.
To activate annotations, add the following additional configuration, needed for collecting and writing
extra performance data:
Expand All @@ -140,10 +233,11 @@ INFLUXDB_HTTP_ENDPOINT="http://`hostname`:8086"
### Notes
- More details on how this works and alternative configurations at [Spark Dashboard](https://github.com/LucaCanali/Miscellaneous/tree/master/Spark_Dashboard)
- The dashboard can be used when running Spark on a cluster (Kubernetes, YARN, Standalone) or in local mode.
- When using Spark in local mode, it's best with Spark version 3.1 or higher, see [SPARK-31711](https://issues.apache.org/jira/browse/SPARK-31711)
- When using Spark in local mode, use Spark version 3.1 or higher, see [SPARK-31711](https://issues.apache.org/jira/browse/SPARK-31711)

### Docker
- InfluxDB will use port 2003 (graphite endpoint), and port 8086 (http endpoint) of
- Telegraf will use port 2003 (graphite endpoint) and port 8428 (VictoriaMetrics source) of your machine/VM.
- For dashboard v1: InfluxDB will use port 2003 (graphite endpoint), and port 8086 (http endpoint) of
your machine/VM (when running using `--network=host`).
- Note: the endpoints need to be available on the node where you started the Docker container and
reachable by Spark executors and driver (mind the firewall).
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
48 changes: 48 additions & 0 deletions dockerfiles_v2/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
FROM ubuntu:22.04

ENV TELEGRAF_VERSION 1.30.0-1
ENV GRAFANA_VERSION 10.4.1
ENV VM_VERSION v1.99.0
ENV ARCH amd64
ENV GRAFANA_VM_PLUGIN_VERSION v0.6.0
ENV PLUGIN_PATH /var/lib/grafana/plugins

RUN set -ex && \
apt-get update && \
apt-get install -qq -y curl libfontconfig musl && \
curl -O https://dl.grafana.com/oss/release/grafana_${GRAFANA_VERSION}_${ARCH}.deb && \
dpkg -i grafana_${GRAFANA_VERSION}_${ARCH}.deb && \
rm -f grafana_${GRAFANA_VERSION}_${ARCH}.deb && \
curl -O https://repos.influxdata.com/debian/packages/telegraf_${TELEGRAF_VERSION}_${ARCH}.deb && \
dpkg -i telegraf_${TELEGRAF_VERSION}_${ARCH}.deb && \
rm -f telegraf_${TELEGRAF_VERSION}_${ARCH}.deb

# Configure VictoriaMetric's Grafana datasource
RUN curl -L -O https://github.com/VictoriaMetrics/grafana-datasource/releases/download/${GRAFANA_VM_PLUGIN_VERSION}/victoriametrics-datasource-${GRAFANA_VM_PLUGIN_VERSION}.tar.gz && \
tar -xzf victoriametrics-datasource-${GRAFANA_VM_PLUGIN_VERSION}.tar.gz && \
find victoriametrics-datasource -type f -name "victoriametrics_backend_plugin*" ! -name "*linux_amd64" -exec rm -f {} + && \
mkdir ${PLUGIN_PATH} && \
mv victoriametrics-datasource ${PLUGIN_PATH} && \
rm victoriametrics-datasource-${GRAFANA_VM_PLUGIN_VERSION}.tar.gz
COPY grafana.ini /etc/grafana/grafana.ini
COPY victoriametrics-datasource.yml /etc/grafana/provisioning/datasources/victoriametrics-datasource.yml

# Copy the bundled dashboards for the spark-dashboard
COPY grafana_dashboards /var/lib/grafana/dashboards
COPY spark.yaml /etc/grafana/provisioning/dashboards/spark.yaml

# Configure telegraf
COPY telegraf.conf /etc/telegraf/telegraf.conf

# Download and install VictoriaMetrics (VM)
RUN curl -L -O https://github.com/VictoriaMetrics/VictoriaMetrics/releases/download/${VM_VERSION}/victoria-metrics-linux-${ARCH}-${VM_VERSION}.tar.gz && \
tar -xzvf victoria-metrics-*.tar.gz && \
rm -f victoria-metrics-linux-${ARCH}-${VM_VERSION}.tar.gz

COPY entrypoint.sh /opt/entrypoint.sh

EXPOSE 3000/tcp 2003/tcp 8428/tcp

WORKDIR /
ENTRYPOINT [ "/opt/entrypoint.sh" ]

31 changes: 31 additions & 0 deletions dockerfiles_v2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# How to build and run the Spark dashboard in a container image

## How to run
Run the dashboard using a container image from [Dockerhub](https://hub.docker.com/r/lucacanali/spark-dashboard):
- There are a few ports needed and multiple options on how to expose them
- Port 2003 is for Graphite ingestion, port 3000 is for Grafana, port 8428 is used internally by VictoriaMetrics source
- You can expose the ports from the container individually or just make `network=host`.
- Examples:
```
docker run --network=host -d lucacanali/spark-dashboard
or
docker run -p 3000:3000 -p 2003:2003 -d lucacanali/spark-dashboard
or
docker run -p 3000:3000 -p 2003:2003 -p 8428:8428 -d lucacanali/spark-dashboard
```

## Advanced: persist InfluxDB data across restarts
- This shows an example of how to use a volume to store data on VictoriaMetrics.
It allows preserving the history across runs when the container is restarted,
otherwise InfluxDB starts from scratch each time.
```
mkdir metrics_data
docker run --network=host -v ./metrics_data:/victoria-metrics-data -d lucacanali/spark-dashboard:v02
```

## Example of how to build the image:
```
cd dockerfiles_v2
docker build -t spark-dashboard:v02 .
```

11 changes: 11 additions & 0 deletions dockerfiles_v2/entrypoint.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/bash

# Start the services
service grafana-server start
service telegraf start
./victoria-metrics-prod

# when running with docker run -d option this keeps the container running
tail -f /dev/null


5 changes: 5 additions & 0 deletions dockerfiles_v2/grafana.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
[plugins]
allow_loading_unsigned_plugins = victoriametrics-datasource
[dashboards]
default_home_dashboard_path = /var/lib/grafana/dashboards/Spark_Perf_Dashboard_v04_PromQL.json

Loading

0 comments on commit 43f7b22

Please sign in to comment.