Skip to content

Commit

Permalink
Initial commit (#285)
Browse files Browse the repository at this point in the history
  • Loading branch information
Techassi authored and fhennig committed Sep 21, 2023
1 parent 6117b45 commit 621f2ac
Show file tree
Hide file tree
Showing 2 changed files with 58 additions and 25 deletions.
27 changes: 17 additions & 10 deletions docs/modules/spark-k8s/pages/getting_started/installation.adoc
Original file line number Diff line number Diff line change
@@ -1,10 +1,15 @@
= Installation

On this page you will install the Stackable Spark-on-Kubernetes operator as well as the Commons and Secret operators which are required by all Stackable operators.
On this page you will install the Stackable Spark-on-Kubernetes operator as well as the Commons and Secret operators
which are required by all Stackable operators.

== Dependencies

Spark applications almost always require dependencies like database drivers, REST api clients and many others. These dependencies must be available on the `classpath` of each executor (and in some cases of the driver, too). There are multiple ways to provision Spark jobs with such dependencies: some are built into Spark itself while others are implemented at the operator level. In this guide we are going to keep things simple and look at executing a Spark job that has a minimum of dependencies.
Spark applications almost always require dependencies like database drivers, REST api clients and many others. These
dependencies must be available on the `classpath` of each executor (and in some cases of the driver, too). There are
multiple ways to provision Spark jobs with such dependencies: some are built into Spark itself while others are
implemented at the operator level. In this guide we are going to keep things simple and look at executing a Spark job
that has a minimum of dependencies.

More information about the different ways to define Spark jobs and their dependencies is given on the following pages:

Expand All @@ -15,14 +20,13 @@ More information about the different ways to define Spark jobs and their depende

There are 2 ways to install Stackable operators

1. Using xref:stackablectl::index.adoc[]

2. Using a Helm chart
. Using xref:management:stackablectl:index.adoc[]
. Using a Helm chart

=== stackablectl

`stackablectl` is the command line tool to interact with Stackable operators and our recommended way to install Operators.
Follow the xref:stackablectl::installation.adoc[installation steps] for your platform.
`stackablectl` is the command line tool to interact with Stackable operators and our recommended way to install
Operators. Follow the xref:management:stackablectl:installation.adoc[installation steps] for your platform.

After you have installed `stackablectl` run the following command to install the Spark-k8s operator:

Expand All @@ -39,7 +43,8 @@ The tool will show
[INFO ] Installing spark-k8s operator
----

TIP: Consult the xref:stackablectl::quickstart.adoc[] to learn more about how to use stackablectl. For example, you can use the `-k` flag to create a Kubernetes cluster with link:https://kind.sigs.k8s.io/[kind].
TIP: Consult the xref:management:stackablectl:quickstart.adoc[] to learn more about how to use stackablectl. For
example, you can use the `--cluster kind` flag to create a Kubernetes cluster with link:https://kind.sigs.k8s.io/[kind].

=== Helm

Expand All @@ -55,8 +60,10 @@ Then install the Stackable Operators:
include::example$getting_started/getting_started.sh[tag=helm-install-operators]
----

Helm will deploy the operators in a Kubernetes Deployment and apply the CRDs for the `SparkApplication` (as well as the CRDs for the required operators). You are now ready to create a Spark job.
Helm will deploy the operators in a Kubernetes Deployment and apply the CRDs for the `SparkApplication` (as well as the
CRDs for the required operators). You are now ready to create a Spark job.

== What's next

xref:getting_started/first_steps.adoc[Execute a Spark Job] and xref:getting_started/first_steps.adoc#_verify_that_it_works[verify that it works] by inspecting the pod logs.
xref:getting_started/first_steps.adoc[Execute a Spark Job] and
xref:getting_started/first_steps.adoc#_verify_that_it_works[verify that it works] by inspecting the pod logs.
56 changes: 41 additions & 15 deletions docs/modules/spark-k8s/pages/index.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -2,55 +2,81 @@
:description: The Stackable Operator for Apache Spark is a Kubernetes operator that can manage Apache Spark clusters. Learn about its features, resources, dependencies and demos, and see the list of supported Spark versions.
:keywords: Stackable Operator, Apache Spark, Kubernetes, operator, data science, engineer, big data, CRD, StatefulSet, ConfigMap, Service, S3, demo, version

This is an operator manages https://spark.apache.org/[Apache Spark] on Kubernetes clusters. Apache Spark is a powerful open-source big data processing framework that allows for efficient and flexible distributed computing. Its in-memory processing and fault-tolerant architecture make it ideal for a variety of use cases, including batch processing, real-time streaming, machine learning, and graph processing.
:structured-streaming: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

This is an operator manages https://spark.apache.org/[Apache Spark] on Kubernetes clusters. Apache Spark is a powerful
open-source big data processing framework that allows for efficient and flexible distributed computing. Its in-memory
processing and fault-tolerant architecture make it ideal for a variety of use cases, including batch processing,
real-time streaming, machine learning, and graph processing.

== Getting Started

Follow the xref:getting_started/index.adoc[] guide to get started with Apache Spark using the Stackable Operator. The guide will lead you through the installation of the Operator and running your first Spark application on Kubernetes.
Follow the xref:getting_started/index.adoc[] guide to get started with Apache Spark using the Stackable Operator. The
guide will lead you through the installation of the Operator and running your first Spark application on Kubernetes.

== How the Operator works

The Stackable Operator for Apache Spark reads a _SparkApplication custom resource_ which you use to define your spark job/application. The Operator creates the relevant Kubernetes resources for the job to run.
The Stackable Operator for Apache Spark reads a _SparkApplication custom resource_ which you use to define your spark
job/application. The Operator creates the relevant Kubernetes resources for the job to run.

=== Custom resources

The Operator manages two custom resource kinds: The _SparkApplication_ and the _SparkHistoryServer_.

The SparkApplication resource is the main point of interaction with the Operator. Unlike other Stackable Operator custom resources, the SparkApplication does not have xref:concepts:roles-and-role-groups.adoc[roles]. An exhaustive list of options is given on the xref:crd-reference.adoc[] page.
The SparkApplication resource is the main point of interaction with the Operator. Unlike other Stackable Operator custom
resources, the SparkApplication does not have xref:concepts:roles-and-role-groups.adoc[roles]. An exhaustive list of
options is given on the xref:crd-reference.adoc[] page.

The xref:usage-guide/history-server.adoc[SparkHistoryServer] does have a single `node` role. It is used to deploy a https://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact[Spark history server]. It reads data from an S3 bucket that you configure. Your applications need to write their logs to the same bucket.
The xref:usage-guide/history-server.adoc[SparkHistoryServer] does have a single `node` role. It is used to deploy a
https://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact[Spark history server]. It reads data from an
S3 bucket that you configure. Your applications need to write their logs to the same bucket.

=== Kubernetes resources

For every SparkApplication deployed to the cluster the Operator creates a Job, A ServiceAccout and a few ConfigMaps.

image::spark_overview.drawio.svg[A diagram depicting the Kubernetes resources created by the operator]

The Job runs `spark-submit` in a Pod which then creates a Spark driver Pod. The driver creates its own Executors based on the configuration in the SparkApplication. The Job, driver and executors all use the same image, which is configured in the SparkApplication resource.
The Job runs `spark-submit` in a Pod which then creates a Spark driver Pod. The driver creates its own Executors based
on the configuration in the SparkApplication. The Job, driver and executors all use the same image, which is configured
in the SparkApplication resource.

The two main ConfigMaps are the `<name>-driver-pod-template` and `<name>-executor-pod-template` which define how the driver and executor Pods should be created.
The two main ConfigMaps are the `<name>-driver-pod-template` and `<name>-executor-pod-template` which define how the
driver and executor Pods should be created.

The Spark history server deploys like other Stackable-supported applications: A Statefulset is created for every role group. A role group can have multiple replicas (Pods). A ConfigMap supplies the necessary configuration, and there is a service to connect to.
The Spark history server deploys like other Stackable-supported applications: A Statefulset is created for every role
group. A role group can have multiple replicas (Pods). A ConfigMap supplies the necessary configuration, and there is a
service to connect to.

=== RBAC

The https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac[Spark-Kubernetes RBAC documentation] describes what is needed for `spark-submit` jobs to run successfully: minimally a role/cluster-role to allow the driver pod to create and manage executor pods.
The https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac[Spark-Kubernetes RBAC documentation] describes
what is needed for `spark-submit` jobs to run successfully: minimally a role/cluster-role to allow the driver pod to
create and manage executor pods.

However, to add security, each `spark-submit` job launched by the spark-k8s operator will be assigned its own ServiceAccount.
However, to add security, each `spark-submit` job launched by the spark-k8s operator will be assigned its own
ServiceAccount.

When the spark-k8s operator is installed via Helm, a cluster role named `spark-k8s-clusterrole` is created with pre-defined permissions.
When the spark-k8s operator is installed via Helm, a cluster role named `spark-k8s-clusterrole` is created with
pre-defined permissions.

When a new Spark application is submitted, the operator creates a new service account with the same name as the application and binds this account to the cluster role `spark-k8s-clusterrole` created by Helm.
When a new Spark application is submitted, the operator creates a new service account with the same name as the
application and binds this account to the cluster role `spark-k8s-clusterrole` created by Helm.

== Integrations

You can read and write data from xref:usage-guide/s3.adoc[s3 buckets], load xref:usage-guide/job-dependencies[custom job dependencies]. Spark also supports easy integration with Apache Kafka which is also supported xref:kafka:index.adoc[on the Stackable Data Platform]. Have a look at the demos below to see it in action.
You can read and write data from xref:usage-guide/s3.adoc[s3 buckets], load xref:usage-guide/job-dependencies[custom job
dependencies]. Spark also supports easy integration with Apache Kafka which is also supported xref:kafka:index.adoc[on
the Stackable Data Platform]. Have a look at the demos below to see it in action.

== [[demos]]Demos

The xref:stackablectl::demos/data-lakehouse-iceberg-trino-spark.adoc[] demo connects multiple components and datasets into a data Lakehouse. A Spark application with https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html[structured streaming] is used to stream data from Apache Kafka into the Lakehouse.
The xref:demos:data-lakehouse-iceberg-trino-spark.adoc[] demo connects multiple components and datasets into a data
Lakehouse. A Spark application with {structured-streaming}[structured streaming] is used to stream data from Apache
Kafka into the Lakehouse.

In the xref:stackablectl::demos/spark-k8s-anomaly-detection-taxi-data.adoc[] demo Spark is used to read training data from S3 and train an anomaly detection model on the data. The model is then stored in a Trino table.
In the xref:demos:spark-k8s-anomaly-detection-taxi-data.adoc[] demo Spark is used to read training data from S3 and
train an anomaly detection model on the data. The model is then stored in a Trino table.

== Supported Versions

Expand Down

0 comments on commit 621f2ac

Please sign in to comment.