GitHub - MetaCenterCloudPuppet/cesnet-spark: Apache Spark Puppet Module

Apache Spark Puppet Module

Module Description - What the module does and why it is useful
Setup - The basics of getting started with spark
- What spark affects
- Setup requirements
Usage - Configuration options and additional functionality
Reference - An under-the-hood peek at what the module is doing and how
- Classes
- Module Parameters (spark class)
Limitations - OS compatibility, etc.
Development - Guide for contributing to the module

Module Description

This puppet module installs and setup Apache Spark cluster, optionally with security. YARN and Spark Master cluster modes are supported.

Setup

What spark affects

Packages: installs Spark packages as needed (core, python, history server, ...)
Files modified:
/etc/spark/conf/spark-default.conf
/etc/spark/conf/spark-env.sh (modified, when environment parameter set)
/etc/default/spark
/etc/profile.d/hadoop-spark.csh (frontend)
/etc/profile.d/hadoop-spark.sh (frontend)
Permissions modified:
/etc/security/keytab/spark.service.keytab (historyserver)
Alternatives:
alternatives are used for /etc/spark/conf in Cloudera
this module switches to the new alternative by default, so the Cloudera original configuration can be kept intact
Services:
master server (when spark::master or spark::master::service included)
history server (when spark::historyserver or spark::historyserver::service included)
worker node (when spark::worker or spark::worker::service included)
Helper files:
/var/lib/hadoop-hdfs/.puppet-spark-*

Setup Requirements

There are several known or intended limitations in this module.

Be aware of:

Hadoop repositories
neither Cloudera nor Hortonworks repositories are configured in this module (for Cloudera you can find list and key files here: http://archive.cloudera.com/cdh5/debian/wheezy/amd64/cdh/)
java is not installed by this module (openjdk-7-jre-headless is OK for Debian 7/wheezy)
No inter-node dependencies: working HDFS is required before deploying of Spark History Server, dependency of Spark HDFS initialization on HDFS namenode is handled properly (if the class spark::hdfs is included on the HDFS namenode, see examples)

Usage

There are two cluster modes, how to use Spark (these modes can be both enabled):

YARN mode: Hadoop is used for computing and scheduling
Spark mode: Spark Master Server and Worker Nodes are used for computing and scheduling

Optionally Spark History Server can be used (for both YARN or Spark modes), which would also require Hadoop HDFS.

The Spark mode doesn't support security, only YARN mode can be used with secured Hadoop cluster.

Puppet classes to include:

everywhere: spark
YARN mode (requires Hadoop cluster with YARN, see CESNET Hadoop puppet module):
client: spark::frontend
Spark mode:
master: spark::master
slaves: spark::worker
optionally History Server (requires Hadoop cluster with HDFS, see CESNET Hadoop puppet module):
spark::historyserver
on HDFS namenode: spark::hdfs

Spark in YARN cluster mode

Example: Apache Spark over Hadoop cluster:

For simplicity one-machine Hadoop cluster is used (everything is on $::fqdn, replication factor 1).

class{'hadoop':
  hdfs_hostname => $::fqdn,
  yarn_hostname => $::fqdn,
  slaves => [ $::fqdn ],
  frontends => [ $::fqdn ],
  realm => '',
  properties => {
    'dfs.replication' => 1,
  },
}

class{'spark':
  # defaultFS is taken from hadoop class
}

node default {
  include stdlib

  include hadoop::namenode
  include hadoop::resourcemanager
  include hadoop::historyserver
  include hadoop::datanode
  include hadoop::nodemanager
  include hadoop::frontend

  include spark::frontend
  # should be collocated with hadoop::namenode
  include spark::hdfs
}

Notes:

if collocated with HDFS namenode, add dependency Class['hadoop::namenode::service'] -> Class['spark::historyserver::service']
if not collocated, it is needed to have HDFS namenode running first (puppet should be launched later again, if Spark History Server won't start because of HDFS)
for Spark clients (in YARN mode): user must logout and login again or launch ". /etc/profile.d/hadoop-spark.sh"

Now you can submit spark jobs in the cluster mode over Hadoop YARN:

spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode cluster --master yarn /usr/lib/spark/lib/spark-examples-1.2.0-cdh5.3.1-hadoop2.5.0-cdh5.3.1.jar 10

Spark in Spark Master cluster mode

Example: Apache Spark in Spark cluster mode:

Two-nodes cluster is used here.

$master_hostname='spark-master.example.com'

class{'hadoop':
  realm         => '',
  hdfs_hostname => $master_hostname,
  slaves        => ['spark1.example.com', 'spark2.example.com'],
}

class{'spark':
  master_hostname        => $master_hostname,
  historyserver_hostname => $master_hostname,
  yarn_enable            => false,
}

node 'spark-master.example.com' {
  include spark::master
  include spark::historyserver
  include hadoop::namenode
  include spark::hdfs
}

node /spark(1|2).example.com/ {
  include spark::worker
  include hadoop::datanode
}

node 'client.example.com' {
  include hadoop::frontend
  include spark::frontend
}

Notes:

there is also enabled Spark History Server (spark::historyserver), which requires HDFS (master: hadoop::namenode, slaves: hadoop::datanode)
YARN is disabled completely, to enable YARN: include also hadoop::nodemanager on the slave nodes (collocation with spark::worker is not needed) and hadoop::resourcemanager on master (see previous example, or CESNET Hadoop puppet module)

Spark jar file optimization

The spark-assembly.jar file is copied into HDFS on each job submit. It is possible to optimize this by copying it beforehand. Keep in mind the jar file needs to be refreshed on HDFS with each Spark SW update.

...

class{'spark':
  jar_enable    => true,
}

...

Copy the jar file after installation and deployment (superuser credentials are needed if security in Hadoop is enabled):

hdfs dfs -put /usr/lib/spark/spark-assembly.jar /user/spark/share/lib/spark-assembly.jar

Add Spark History Server

Spark History server stores details about Spark jobs. It is provided by the class spark::historyserver. The parameter historiserver_hostname needs to be also specified (replace $::fqdn by real hostname), and HDFS cluster is required:

...
class{'spark':
  ...
  historyserver_hostname => $::fqdn,
}

node default {
  ...
  include spark::historyserver
}

Multihome

Multihome is not supported.

You may also need to set SPARK_LOCAL_IP to bind RPC listen address to the default interface:

environment => {
  'SPARK_LOCAL_IP' => '0.0.0.0',
  #'SPARK_LOCAL_IP' => $::ipaddress_eth0,
}

Cluster with more HDFS Name nodes

If there are used more HDFS namenodes in the Hadoop cluster (high availability, namespaces, ...), it is needed to have 'spark' system user on all of them to autorization work properly. You could install full Spark client (using spark::frontend::install), but just creating the user is enough (using spark::user).

Note, the spark::hdfs class must be used too, but only on one of the HDFS namenodes. It includes the spark::user.

Example:

node <HDFS_NAMENODE> {
  include spark::hdfs
}

node <HDFS_OTHER_NAMENODE> {
  include spark::user
}

Upgrade

The best way is to refresh configurations from the new original (=remove the old) and relaunch puppet on top of it. There is also problem with start-up scripts on Debian, which needs to be worked around, where Spark history server is used.

For example:

alternative='cluster'
d='spark'
mv /etc/{d}$/conf.${alternative} /etc/${d}/conf.cdhXXX
update-alternatives --auto ${d}-conf

service spark-history-server stop || :
mv /etc/init.d/spark-history-server /etc/init.d/spark-history-server.prev

# upgrade
...

puppet agent --test
#or: puppet apply ...

# restore start-up script from spark-history-server.dpkg-new or spark-history-server.prev
...
service spark-history-server start

##Reference

###Classes

spark: Main configuration class for CESNET Apache Spark puppet module
spark::common:
spark::common::config
spark::common::postinstall
spark::frontend: Apache Spark Client
spark::frontend::config
spark::frontend::install
spark::hdfs: HDFS initialization
spark::historyserver: Apache Spark History Server
spark::historyserver::config
spark::historyserver::install
spark::historyserver::service
spark::master: Apache Spark Master Server
spark::master::config
spark::master::install
spark::master::service
spark::worker: Apache Spark Worker Node
spark::worker::config
spark::worker::install
spark::worker::service
spark::params
spark::user: Create spark system user

`spark` class

####Parameters

#####alternatives

Switches the alternatives used for the configuration. Default: 'cluster' (Debian) or undef.

It can be used only when supported (for example with Cloudera distribution).

#####confdir

Spark config directory. Default: platform specific ('/etc/spark/conf' or '/etc/spark').

#####defaultFS

Filesystem URI. Default: '::default' (from $::hadoop::_defaultFS).

Examples:

hdfs://hdfs.example.com:8020
hdfs://mycluster

#####hive_configfile

Hive config file. Default: platform specific ('../../hive/conf/hive-site.xml' or '../etc/hive/hive-site.xml').

#####keytab

Spark Historyserver keytab file. Default: '/etc/security/keytab/spark.service.keytab'.

#####keytab_source

Puppet source for the Spark keytab file. Default: undef.

When specified, the Spark keytab file is created using this puppet source(s). Otherwise only persmissions are set on the keytab file.

#####logdir

Event log directory and history server log directory without the defaultFS prefix. Default: '/user/spark/applicationHistory'.

Note, this is parameter is ignored by spark::hdfs class. When using non=default value, this directory must be explicitly created.

#####master_hostname

Spark Master hostname. Default: undef.

#####master_port

Spark Master port. Default: '7077'.

#####master_ui_port

Spark Master Web UI port. Default: '18080'.

#####historyserver_hostname

Spark History server hostname. Default: undef.

#####historyserver_port

Spark History Server Web UI port. Default: '180088'.

Notes:

the Spark default value is 18080, which conflicts with default for Master server
no historyserver_ui_port parameter (Web UI port is the same as the RPC port)

#####worker_port

Spark Worker node port. Default: '7078'.

#####worker_ui_port

Spark Worker node Web UI port. Default: '18081'.

#####environment

Environments to set for Apache Spark. Default: undef.

The value is a hash. The '::undef' values will unset the particular variables.

Example: you may need to increase memory in case of big amount of jobs:

environment => {
  'SPARK_DAEMON_MEMORY' => '4096m',
}

#####properties

Spark properties to set. Default: undef.

#####realm

Kerberos realm. Default: undef.

Non-empty string enables security.

#####hive_enable

Enable support for Hive metastore. Default: true.

This just create the symlink of the Hive configuration file in the Spark configuration directory on the frontend.

There is required to install also Hive JDBC (or Spark assembly with Hive JDBC) at all worker nodes.

#####jar_enable

Configure Apache Spark to search Spark jar file in $hdfs_hostname/user/spark/share/lib/spark-assembly.jar. Default: false.

The jar needs to be copied to HDFS manually after installation, and also manually updated after each Spark SW update:

hdfs dfs -put /usr/lib/spark/spark-assembly.jar /user/spark/share/lib/spark-assembly.jar

#####yarn_enable

Enable YARN mode. Default: true.

This requires configured Hadoop using CESNET Hadoop puppet module.

Limitations

Tested with Cloudera distribution.

Development

Repository: https://github.com/MetaCenterCloudPuppet/cesnet-spark
Tests:
basic: see .travis.yml
vagrant: https://github.com/MetaCenterCloudPuppet/hadoop-tests
Email: František Dvořák <[email protected]>

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
manifests		manifests
spec		spec
templates		templates
tests		tests
.fixtures.yml		.fixtures.yml
.gitignore		.gitignore
.rspec		.rspec
.travis.yml		.travis.yml
CHANGELOG		CHANGELOG
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTORS		CONTRIBUTORS
Gemfile		Gemfile
Guardfile		Guardfile
LICENSE		LICENSE
README.markdown		README.markdown
Rakefile		Rakefile
metadata.json		metadata.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apache Spark Puppet Module

Table of Contents

Module Description

Setup

What spark affects

Setup Requirements

Usage

Spark in YARN cluster mode

Spark in Spark Master cluster mode

Spark jar file optimization

Add Spark History Server

Multihome

Cluster with more HDFS Name nodes

Upgrade

`spark` class

Limitations

Development

About

Releases

Packages

Languages

License

MetaCenterCloudPuppet/cesnet-spark

Folders and files

Latest commit

History

Repository files navigation

Apache Spark Puppet Module

Table of Contents

Module Description

Setup

What spark affects

Setup Requirements

Usage

Spark in YARN cluster mode

Spark in Spark Master cluster mode

Spark jar file optimization

Add Spark History Server

Multihome

Cluster with more HDFS Name nodes

Upgrade

spark class

Limitations

Development

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`spark` class

Packages