spark-cookbook

Chef cookbook to install Apache Spark master, and slaves for standalone mode.

Spark will be installed and run as ['spark']['user'], the home directory of that user will be the installed path.

In standalone mode, spark slaves are started from master using ssh connections to the slaves. Then, if you add add slaves, add them to ['spark']['slaves'] and re-run the recipe on master to update the list of slaves.

Supported Platforms

Only tested on debian. Please fill a ticket if you found incompatibilities with your platform.

Attributes

Key	Type	Description	Default
`["chef"]["data_bag_secret_path"]`	String	Path to secret file to decrypt data bag secret.	`/var/chef/encrypted_data_bag_secret`
`["spark"]["master_host"]`	String	hostname for spark master.	`localhost`
`["spark"]["master_port"]`	String	spark master port.	`7077`
`["spark"]["bin_url"]`	String	URL to download spark binary archive.	`http://d3kbcqa49mib13.cloudfront.net/spark-1.0.1-bin-hadoop2.tgz`
`["spark"]["bin_checksum"]`	String	SHA256 checksum to help chef cache the file. Set it if you change `["spark"]["bin_url"]`.
`["spark"]["install_dir"]`	String	Where to install spark. Also home directory for `['spark']['user']`.	`/opt/local/spark`
`["spark"]["user"]`	String	Spark runtime user name.	`spark`
`["spark"]["group"]`	String	Spark runtime group	`spark`
`["spark"]["slaves"]`	Array[String]	List of hostname of the slaves. You probably want to use private network hostnames or ip addresses.	`[]`

Also you can set spark-env.sh environment variables with ['spark']['env']['lowercase key']. For example :

SPARK_LOCAL_IP = 'ec2-ip-xxx.internal'

Here are the supported parameters :

# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append

# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# - MESOS_NATIVE_LIBRARY, to point to your libmesos.so if you use Mesos

# Options read in YARN client mode
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_EXECUTOR_INSTANCES, Number of workers to start (Default: 2)
# - SPARK_EXECUTOR_CORES, Number of cores for the workers (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Worker (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Master (e.g. 1000M, 2G) (Default: 512 Mb)
# - SPARK_YARN_APP_NAME, The name of your application (Default: Spark)
# - SPARK_YARN_QUEUE, The hadoop queue to use for allocation requests (Default: ‘default’)
# - SPARK_YARN_DIST_FILES, Comma separated list of files to be distributed with the job.
# - SPARK_YARN_DIST_ARCHIVES, Comma separated list of archives to be distributed with the job.

# Options for the daemons used in the standalone deploy mode:
# - SPARK_MASTER_IP, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_INSTANCES, to set the number of worker processes per node
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers

Data bags

You must provide an keypair (rsa or dsa) in a databag item :

spark
  ssh_key
    type: "rsa" or "dsa"
    private_key: the private key
    public_key: the public key

You can generate a keypair with :

ssh-keygen -t dsa -f spark_key

Put the content of spark_key.pub file in public_key and spark_key file in private_key and dsa as type.

Usage

Include spark::master and/or spark::slave in your node's run_list:

{
  "run_list": [
    "recipe[spark::master]"
  ]
}

Here is a more concrete example that also configures java and scala :

{
  "java": {
    "install_flavor": "oracle",
    "jdk_version": "8",
    "oracle": {
      "accept_oracle_download_terms": true
    }
  },
  "scala": {
    "version": "2.10.4",
    "home": "/usr/lib/scala",
    "checksum": "b46db638c5c6066eee21f00c447fc13d1dfedbfb60d07db544e79db67ba810c3",
    "url": "http://www.scala-lang.org/files/archive/scala-2.10.4.tgz"
  },
  "spark": {
    "slaves": ["ip-xxx-xxx-xxx-xxx.eu-west-1.compute.internal"]
  },
  "run_list": [
    "recipe[spark::master]"
  ]
}

Contributing

Fork the repository on Github
Create a named feature branch (i.e. add-new-recipe)
Write your change
Write tests for your change (if applicable)
Run the tests, ensuring they all pass
Submit a Pull Request

License and Authors

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
attributes		attributes
recipes		recipes
templates/default		templates/default
.gitignore		.gitignore
Berksfile		Berksfile
CHANGELOG.md		CHANGELOG.md
Gemfile		Gemfile
LICENSE		LICENSE
README.md		README.md
Thorfile		Thorfile
chefignore		chefignore
metadata.rb		metadata.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark-cookbook

Supported Platforms

Attributes

Data bags

Usage

Contributing

License and Authors

About

Releases

Packages

Languages

License

gwik/spark-cookbook

Folders and files

Latest commit

History

Repository files navigation

spark-cookbook

Supported Platforms

Attributes

Data bags

Usage

Contributing

License and Authors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages