Skip to content

Commit

Permalink
docs: Update references (#393)
Browse files Browse the repository at this point in the history
* Initial commit

* Update xref
  • Loading branch information
Techassi authored Sep 21, 2023
1 parent cf371d9 commit f55c8ff
Show file tree
Hide file tree
Showing 2 changed files with 42 additions and 22 deletions.
24 changes: 14 additions & 10 deletions docs/modules/hdfs/pages/getting_started/installation.adoc
Original file line number Diff line number Diff line change
@@ -1,21 +1,22 @@
= Installation

On this page you will install the Stackable HDFS operator and its dependency, the Zookeeper operator, as well as the commons and secret operators which are required by all Stackable operators.
On this page you will install the Stackable HDFS operator and its dependency, the Zookeeper operator, as well as the
commons and secret operators which are required by all Stackable operators.

== Stackable Operators

There are 2 ways to run Stackable Operators

1. Using xref:stackablectl::index.adoc[]

2. Using Helm
. Using xref:management:stackablectl:index.adoc[]
. Using Helm

=== stackablectl

stackablectl is the command line tool to interact with Stackable operators and our recommended way to install operators.
Follow the xref:stackablectl::installation.adoc[installation steps] for your platform.
`stackablectl` is the command line tool to interact with Stackable operators and our recommended way to install
operators. Follow the xref:management:stackablectl:installation.adoc[installation steps] for your platform.

After you have installed stackablectl run the following command to install all operators necessary for the HDFS cluster:
After you have installed `stackablectl`, run the following command to install all operators necessary for the HDFS
cluster:

[source,bash]
----
Expand All @@ -31,7 +32,8 @@ The tool will show
[INFO ] Installing hdfs operator
----

TIP: Consult the xref:stackablectl::quickstart.adoc[] to learn more about how to use stackablectl. For example, you can use the `-k` flag to create a Kubernetes cluster with link:https://kind.sigs.k8s.io/[kind].
TIP: Consult the xref:management:stackablectl:quickstart.adoc[] to learn more about how to use `stackablectl`. For
example, you can use the `--cluster kind` flag to create a Kubernetes cluster with link:https://kind.sigs.k8s.io/[kind].

=== Helm

Expand All @@ -47,8 +49,10 @@ Then install the Stackable Operators:
include::example$getting_started/getting_started.sh[tag=helm-install-operators]
----

Helm will deploy the operators in a Kubernetes Deployment and apply the CRDs for the HDFS cluster (as well as the CRDs for the required operators). You are now ready to deploy HDFS in Kubernetes.
Helm will deploy the operators in a Kubernetes Deployment and apply the CRDs for the HDFS cluster (as well as the CRDs
for the required operators). You are now ready to deploy HDFS in Kubernetes.

== What's next

xref:getting_started/first_steps.adoc[Set up an HDFS cluster] and its dependencies and xref:getting_started/first_steps.adoc#_verify_that_it_works[verify that it works].
xref:getting_started/first_steps.adoc[Set up an HDFS cluster] and its dependencies and
xref:getting_started/first_steps.adoc#_verify_that_it_works[verify that it works].
40 changes: 28 additions & 12 deletions docs/modules/hdfs/pages/index.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,28 @@
:description: The Stackable Operator for Apache HDFS is a Kubernetes operator that can manage Apache HDFS clusters. Learn about its features, resources, dependencies and demos, and see the list of supported HDFS versions.
:keywords: Stackable Operator, Hadoop, Apache HDFS, Kubernetes, k8s, operator, engineer, big data, metadata, storage, cluster, distributed storage

The Stackable Operator for https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html[Apache HDFS] (Hadoop Distributed File System) is used to set up HFDS in high-availability mode. HDFS is a distributed file system designed to store and manage massive amounts of data across multiple machines in a fault-tolerant manner. The Operator depends on the xref:zookeeper:index.adoc[] to operate a ZooKeeper cluster to coordinate the active and standby NameNodes.
The Stackable Operator for https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html[Apache HDFS]
(Hadoop Distributed File System) is used to set up HFDS in high-availability mode. HDFS is a distributed file system
designed to store and manage massive amounts of data across multiple machines in a fault-tolerant manner. The Operator
depends on the xref:zookeeper:index.adoc[] to operate a ZooKeeper cluster to coordinate the active and standby NameNodes.

== Getting started

Follow the xref:getting_started/index.adoc[Getting started guide] which will guide you through installing the Stackable HDFS and ZooKeeper Operators, setting up ZooKeeper and HDFS and writing a file to HDFS to verify that everything is set up correctly.
Follow the xref:getting_started/index.adoc[Getting started guide] which will guide you through installing the Stackable
HDFS and ZooKeeper Operators, setting up ZooKeeper and HDFS and writing a file to HDFS to verify that everything is set
up correctly.

Afterwards you can consult the xref:usage-guide/index.adoc[] to learn more about tailoring your HDFS configuration to your needs, or have a look at the <<demos, demos>> for some example setups.
Afterwards you can consult the xref:usage-guide/index.adoc[] to learn more about tailoring your HDFS configuration to
your needs, or have a look at the <<demos, demos>> for some example setups.

== Operator model

The Operator manages the _HdfsCluster_ custom resource. The cluster implements three xref:home:concepts:roles-and-role-groups.adoc[roles]:
The Operator manages the _HdfsCluster_ custom resource. The cluster implements three
xref:home:concepts:roles-and-role-groups.adoc[roles]:

* DataNode - responsible for storing the actual data.
* JournalNode - responsible for keeping track of HDFS blocks and used to perform failovers in case the active NameNode fails. For details see: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
* JournalNode - responsible for keeping track of HDFS blocks and used to perform failovers in case the active NameNode
fails. For details see: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
* NameNode - responsible for keeping track of HDFS blocks and providing access to the data.


Expand All @@ -24,30 +32,38 @@ image::hdfs_overview.drawio.svg[A diagram depicting the Kubernetes resources cre
The operator creates the following K8S objects per role group defined in the custom resource.

* Service - ClusterIP used for intra-cluster communication.
* ConfigMap - HDFS configuration files like `core-site.xml`, `hdfs-site.xml` and `log4j.properties` are defined here and mounted in the pods.
* ConfigMap - HDFS configuration files like `core-site.xml`, `hdfs-site.xml` and `log4j.properties` are defined here and
mounted in the pods.
* StatefulSet - where the replica count, volume mounts and more for each role group is defined.

In addition, a `NodePort` service is created for each pod labeled with `hdfs.stackable.tech/pod-service=true` that exposes all container ports to the outside world (from the perspective of K8S).
In addition, a `NodePort` service is created for each pod labeled with `hdfs.stackable.tech/pod-service=true` that
exposes all container ports to the outside world (from the perspective of K8S).

In the custom resource you can specify the number of replicas per role group (NameNode, DataNode or JournalNode). A minimal working configuration requires:
In the custom resource you can specify the number of replicas per role group (NameNode, DataNode or JournalNode). A
minimal working configuration requires:

* 2 NameNodes (HA)
* 1 JournalNode
* 1 DataNode (should match at least the `clusterConfig.dfsReplication` factor)

The Operator creates a xref:concepts:service_discovery.adoc[service discovery ConfigMap] for the HDFS instance. The discovery ConfigMap contains the `core-site.xml` file and the `hdfs-site.xml` file.
The Operator creates a xref:concepts:service_discovery.adoc[service discovery ConfigMap] for the HDFS instance. The
discovery ConfigMap contains the `core-site.xml` file and the `hdfs-site.xml` file.

== Dependencies

HDFS depends on ZooKeeper for coordination between nodes. You can run a ZooKeeper cluster with the xref:zookeeper:index.adoc[]. Additionally, the xref:commons-operator:index.adoc[] and xref:secret-operator:index.adoc[] are needed.
HDFS depends on ZooKeeper for coordination between nodes. You can run a ZooKeeper cluster with the
xref:zookeeper:index.adoc[]. Additionally, the xref:commons-operator:index.adoc[] and
xref:secret-operator:index.adoc[] are needed.

== [[demos]]Demos

Two demos that use HDFS are available.

**xref:stackablectl::demos/hbase-hdfs-load-cycling-data.adoc[]** loads a dataset of cycling data from S3 into HDFS and then uses HBase to analyze the data.
**xref:demos:hbase-hdfs-load-cycling-data.adoc[]** loads a dataset of cycling data from S3 into HDFS and then uses HBase
to analyze the data.

**xref:stackablectl::demos/jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data.adoc[]** showcases the integration between HDFS and Jupyter. New York Taxi data is stored in HDFS and analyzed in a Jupyter notebook.
**xref:demos:jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data.adoc[]** showcases the integration between HDFS and
Jupyter. New York Taxi data is stored in HDFS and analyzed in a Jupyter notebook.

== Supported Versions

Expand Down

0 comments on commit f55c8ff

Please sign in to comment.