##The main site module for Hadoop environment
####Table of Contents
- Module Description - What the module does and why it is useful
- Setup - The basics of getting started with site_hadoop
- Usage - Configuration options and additional functionality
- Reference - An under-the-hood peek at what the module is doing and how
- Limitations - OS compatibility, etc.
- Development - Guide for contributing to the module
This is the main puppet module for Hadoop environment, which performs settings and decisions not meant to be in generic Hadoop modules:
- sets Cloudera repository
- enables custom accounting
- enables custom bookkeeping
- installs Hadoop and its addons using roles
This module provide roles. Roles are helper classes joining together external, hadoop, and hadoop addons puppet modules. They solve dependencies and hide complexity of putting all pieces together.
Puppet configured with stringify_facts=false
is recommended (see also hive::schema
parameter).
Tested with:
- Debian 7/wheezy + Cloudera distribution (tested on Hadoop 2.5.0/2.6.0, CDH 5.14.0)
- Debian 8/jessie
- RHEL 6, 7 and clones
- Ubuntu 18.04
Several basic values are regularly measured and saved to local MySQL database:
- disk space of each node
- disk space of each user (/user/*)
- number of jobs and basic summary (like elapsed times) for each user for last 24 hours
The last information (24 hours job statistics) could be mined also from data gathered by bookkeeping (see below). Accounting will do it more lightweight though - only summary is gathered without cloning of jobs metadata.
With accounting you will have basic statistics of Hadoop cluster.
Job metadata information is regularly copied from Hadoop to local MySQL database using HTTP REST API from YARN Resource Manager and MapRed Job History server.
Information stored:
- jobs: top-level information of submitted jobs (elapsed time, ...)
- subjobs: individual map/reduce tasks (node used, elapsed time, ...)
- job nodes: subjobs information aggregated per node (number of map/reduce tasks, summary of elapsed time, ...)
With all job metadata in local database you will have detailed history information of Hadoop cluster.
###What cesnet-hadoop module affects
- Packages: Java JRE
- Files modified:
- /etc/apt/sources.list.d/.list*
- /etc/apt/preferences.d/10_.pref*
- apt gpg keys
- /usr/local/bin/launch (when scripts_enable parameter is true)
- /usr/lib/bigtop-tomcat/lib/core-site.xml: link to /etc/hadoop/conf/core-site.xml file as workaround problems with HDFS configuration during login in some components (HDFS httpfs, Oozie), for example to use Kerberos mapping rules often needed in Kerberos cross-realm environment
Note: Security files are NOT handled by this module. They needs to be copied to proper places.
This is basic multinode Hadoop cluster with addons.
Hadoop module addons modules still needs to be configured:
$clients = [
'client.example.com',
]
$master = 'master.example.com'
$slaves = [
'node1.example.com',
'node2.example.com',
'node3.example.com',
]
$zookeepers = [
$master,
]
# set to false for initial run
$hdfs_deployed = true
class { '::hadoop':
hdfs_hostname => $master,
yarn_hostname => $master,
slaves => $slaves,
frontends => $clients,
nfs_hostnames => $clients,
zookeeper_hostnames => $zookeepers,
hdfs_deployed => $hdfs_deployed,
}
class { '::hbase':
hdfs_hostname => $master,
master_hostname => $master,
slaves => $slaves,
zookeeper_hostnames => $zookeepers,
}
class { '::hive':
metastore_hostname => $master,
server2_hostname => $master,
zookeeper_hostnames => $zookeepers,
}
class { '::spark':
# defaultFS is taken from hadoop class
historyserver_hostname => $master,
}
class { '::site_hadoop':
users => [
'hawking',
],
}
# required for hive, oozie
class { '::mysql::bindings':
java_enable => true,
}
node 'master.example.com' {
class { '::zookeeper':
hostnames => $zookeepers,
}
include ::site_hadoop::role::master
}
node /node\d+\.example\.com/ {
include ::site_hadoop::role::slave
}
node 'client.example.com' {
include ::site_hadoop::role::frontend
}
Note: all the classes with parameters can be replaced by hiera.
Note 2: all classes with parameters there are configuration only classes, except the zookeeper class. Zookeeper must be specified only on the proper nodes, or there can be used hiera instead.
Note 3: some parameters are taken from the other (configuration) classes. For example HDFS and YARN parts are optional, and they are enaled according to the ::hadoop::hdfs_hostname and ::hadoop::yarn_hostname parameters.
This is already included in the "primary master" roles:
- ::site_hadoop::role::master
- ::site_hadoop::role::master_hdfs
- ::site_hadoop::role::master_ha1
It can be disabled by accounting_enable parameter.
class { 'hadoop':
...
$yarn_hostname = rm.example.com
$historyserver_hostname = jhs.example.com
...
}
class { '::mysql::server':
root_password => 'strongpassword',
}
mysql::db { 'accounting':
user => 'accounting',
password => 'accpass',
host => 'localhost',
grant => ['SELECT', 'INSERT', 'UPDATE', 'DELETE'],
sql => '/usr/local/share/hadoop/accounting.sql',
}
class{'site_hadoop::accounting':
db_user => 'accounting',
db_password => 'accpass',
email => '[email protected]',
accounting_hdfs => '0 */4 * * *',
accounting_quota => '0 */4 * * *',
accounting_jobs => '10 2 * * *',
# needs to be empty string, when not using Kerberos security
principal => '',
}
# site_hadoop::accounting provides the SQL import script
Class['site_hadoop::accounting'] -> Mysql::Db['accounting']
# start accounting after Hadoop startup (not strictly needed)
#Class['hadoop::namenode::service'] -> Class['site_hadoop::accounting']
This is already included in the "primary master" roles:
- ::site_hadoop::role::master
- ::site_hadoop::role::master_hdfs
- ::site_hadoop::role::master_ha1
It can be disabled by accounting_enable parameter.
class{'hadoop':
...
$yarn_hostname = rm.example.com
...
}
class{'mysql::server':
root_password => 'strong_password',
}
mysql::db{'bookkeeping':
user => 'bookkeeping',
password => 'bkpass',
grant => ['SELECT', 'INSERT', 'UPDATE', 'DELETE'],
sql => '/usr/local/share/hadoop/bookkeeping.sql',
}
class{'site_hadoop::bookkeeping':
email => '[email protected]',
db_name => 'bookkeeping',
db_user => 'bookkeeping',
db_password => 'bkpass',
freq => '*/12 * * * *',
interval => 3600,
}
Class['site_hadoop::bookkeeping'] -> Mysql::Db['bookkeeping']
site_hadoop
: The main classsite_hadoop::devel
:site_hadoop::devel::hadoop
: Local post-installation steps for Hadoop for testing in Vagrantsite_hadoop::server::accounting
: Custom Hadoop accounting scriptssite_hadoop::server::bookkeeping
: Custom Hadoop bookkeeping scriptssite_hadoop::config
: Configuration of Hadoop cluster machinessite_hadoop::install
: Installation of packages required by site_hadoop modulesite_hadoop::params
: Parameters and default values for site_hadoop modulesite_hadoop::repo::cloudera
: Set-up Cloudera repositorysite_hadoop::repo::bigtop
: Set-up Bigtop repositorysite_hadoop::role::common
: Hadoop initialization and dependencies needed on all nodessite_hadoop::role::frontend
: Hadoop Frontendsite_hadoop::role::frontend_ext
: Hadoop External Frontendsite_hadoop::role::ha
: Hadoop HA quorum serversite_hadoop::role::hue
: Apache Hue web interfacesite_hadoop::role::master
: Hadoop Master server in cluster without high availabilitysite_hadoop::role::master_ha1
: Primary Hadoop master server in cluster with high availabilitysite_hadoop::role::master_ha2
: Secondary Hadoop master server in cluster with high availabilitysite_hadoop::role::master_hdfs
: Hadoop master providing HDFS Namenode in cluster without high availabilitysite_hadoop::role::master_yarn
: Hadoop master providing YARN Resourcemanager and MapRed Historyserver in cluster without high availabilitysite_hadoop::role::simple
: Hadoop cluster completely on one machinesite_hadoop::role::slave
: Hadoop worker node
####distribution
Hadoop distribution. Default: 'cloudera'.
Values:
- bigtop: Apache Bigtop
- cloudera: Cloudera
- undef: no repository setup
####email
Email address to send errors from cron. Default: undef.
####key
Repository key. Default: (auto)
Used in site_hadoop::cloudera.
####mirror
Cloudera mirror to use. Default: 'cloudera'.
- Bigtop:
- amazon
- apache
- Cloudera:
- cloudera
- scientific
- scientific/test
####priority
Debian repository priority. Default: 900.
####release
Apt release for Debian platforms. Default: undef (automatic).
####url
Override repository URL. Default: undef.
####users
Accounts to create. Default: undef.
####user_realms
Realms to add to .k5login files. Default: undef.
####version
Hadoop distribution version to install. Default: '5' (for Cloudera), '1.2.1' (for BigTop).
Selects Hadoop distribution version.
####accounting_enable
Installs MySQL/MariaDB on the primary master node and enables accounting and bookkeeping. Default: true.
See site_hadoop::accounting and site_hadoop::bookkeeping
####database_setup_enable
Installs and setup database server and databases, if needed. Default: true.
Database is installed only, if enabled in parameters of:
- Hive: db
- Oozie: db
- site Hadoop accounting_enable
####hbase_enable
Deploys Apache HBase addon. Default: true.
####hive_enable
Deploys Apache Hive addon. Default: true.
####hue_enable
Deploys Apache Hue web interface. Default: false.
####impala_enable
Deploys Cloudera Impala addon. Default: false.
Disabled by default because of crashes with security (IMPALA-2645).
####nfs_frontend_enable
Launches HDFS NFS Gateway and mounts HDFS on the frontend. Default: true.
####nfs_yarn_enable
Launches HDFS NFS Gateway and mounts HDFS on the YARN master. Default: false.
####oozie_enable
Installs Apache Oozie addon. Default: true.
It is used by Apache Hue in workflow editor and for submitting jobs.
####pig_enable
Installs Apache Pig addon. Default: true.
####scripts_enable
Creates also helper useful scripts in /usr/local. Default: true.
####spark_enable
Deploys Apache Spark. Default: true.
####spark_standalone_enable
Deploys complete standalone Apache Spark cluster. Default: false.
###site_hadoop::accounting
class
####accounting_hdfs
Enable storing global HDFS disk and data statistics. Default: undef.
The value is time in the cron format. See man 5 crontab.
####accounting_quota
Enable storing user data statistics. Default: undef.
The value is time in the cron format. See man 5 crontab.
####accounting_jobs
Enable storing user jobs statistics. Default: undef.
The value is time in the cron format. See man 5 crontab.
####db_name
Database name for statistics. Default: undef (system default is accounting).
####db_user
Database user for statistics. Default: undef (system default is accounting).
####db_password
Database password for statistics. Default: undef.
####email
Email address to send errors from cron. Default: undef.
####mapred_hostname
Hadoop Job History Node hostname for gathering user jobs statistics. Default: $::fqdn.
####mapred_url
HTTP REST URL of Hadoop Job History Node for gathering user jobs statistics. Default: "http://mapred_hostname:19888", "https://mapred_hostname:19890".
It is derived from mapred_hostname and principal, but it may be needed to override it anyway (different hosts due to High Availability, non-default port, ...).
####principal
Kerberos principal to access Hadoop. Default: undef (system default is nn/`hostname -f`).
Undef value means using default principal value. It needs to be empty string to disable security and not using Kerberos tickets!
###site_hadoop::bookkeeping
class
####db_name
Default: undef (system default is bookkeeping).
####db_host
Database name for statistics. Default: undef (system default is local socket).
####db_user
Database user for statistics. Default: undef (system default is bookkeeping).
####db_password
Database password for statistics. Default: undef (system default is empty password).
####email
Email address to send errors from cron. Default: undef.
####freq
Frequency of hadoop job metadata polling. Default: '*/10 * * * *'.
The value is time in the cron format. See man 5 crontab.
####historyserver_hostname
Hadoop Job History Server hostname. Default: $::fqdn.
####https
Enable HTTPS. Default: false.
####interval
Interval (in seconds) to scan Hadoop. Default: undef (scripts default: 3600).
####keytab
Service keytab for ticket refresh. Default: undef (script default: /etc/security/keytab/nn.service.keytab).
####principal
Kerberos principal name for gathering metadata. Default: undef (script default: nn/`hostname -f`@REALM).
Undef means using default principal value.
####realm
Kerberos realm. Default: undef.
Non-empty values enables the security.
####refresh
Ticket refresh frequency. Default: '0 */4 * * *'.
The value is time in the cron format. See man 5 crontab.
####resourcemanager_hostname
Hostname of the Hadoop YARN Resource Manager. Default: $::fqdn.
####resourcemanager_hostname2
Hostname of the second Hadoop YARN Resource Manager, used with high availability. Default: undef.
###site_hadoop::role::frontend
Hadoop Frontend.
Installed clients:
- Hadoop Frontend + basic packages
- HBase Frontend (optional, hbase_enable)
- Hive Frontend (optional, hive_enable)
- Pig Frontend (optional, pig_enable)
- Spark Frontend (optional, spark_enable)
- HDFS NFS Gateway (optional, nfs_frontend_enable)
Required additional parameters:
- hadoop::frontends
- hadoop::nfs_hostnames
- hbase::frontends
Add also 'nfs' user to security.client.protocol.acl authorization (not needed by default).
Hadoop Master server in cluster without high availability.
Use case: non-HA, single master, multiple nodes.
Services:
- HDFS Namenode (optional, hadoop::hdfs_hostname)
-
- initialization for Spark, HBase, Hive, ...
- HDFS NFS Gateway (optional, nfs_yarn_enable)
- YARN Resourcemanager (optional, hadoop::yarn_hostname)
- MapRed Historyserver
- HBase Master (optional, hbase_enable)
- Hive Metastore (optional, hive_enable)
- Hive Server2 (optional, hive_enable)
- Impala Catalog (optional, *impala_enable*)
- Impala Statestore (optional, *impala_enable*)
- MySQL (HDFS accounting+bookkeeping, Hive, Oozie)
- Oozie Server (optional, oozie_enable)
- Spark Master (optional, spark_standalone_enable)
- Spark Historyserver (optional, spark_enable)
- Zookeeper
Requires many parameters (hostnames for each service, ...).
MariaDB/MySQL database is supported. To setup and use it also for Hive, Hue and Oozie addons, add parameters:
hive::db: mysql
hive::db_password: OOZIE_DB_PASSWORD
hue::db: mysql
hue::db_password: HUE_DB_PASSWORD
mysql::bindings::java_enable: true
oozie::db: mysql
oozie::db_password: HIVE_DB_PASSWORD
Apache Hue web interface.
Services:
- Hadoop HTTPFS (in case of HDFS HA)
- Hue
- MySQL
Required additional parameters:
- hadoop::hue_hostnames
- hadoop::httpfs_hostnames
- hadoop::oozie_hostnames
- hue::hdfs_hostname or hue::defaultFS
- hue::httpfs_hostname
- hue::oozie_hostname
- mysql::bindings::java_enable: true
- oozie::hue_hostnames
Keep enabled also oozie.
Add also 'hue' user and 'oozie' group into security.client.protocol.acl authorization (not needed by default).
MariaDB/MySQL database is supported. To setup and use it for Hive, Hue, and Oozie addons, add parameters:
hive::db: mysql
hive::db_password: OOZIE_DB_PASSWORD
hue::db: mysql
hue::db_password: HUE_DB_PASSWORD
oozie::db: mysql
oozie::db_password: HIVE_DB_PASSWORD
Hadoop worker node.
Services:
- HDFS Datanode
- YARN Nodemanager (optional, yarn_enable)
- HBase Regionserver (optional, hbase_enable)
- Impala Server (optional, impala_enable)
- Spark Worker (optional, spark_standalone_enable)
Requires many parameters (hostnames for each service, ...).
To avoid puppet dependency hell some packages are installed in the stage setup.
- Repository: https://github.com/MetaCenterCloudPuppet/cesnet-site_hadoop
- Testing:
- basic: see .travis.xml
- vagrant: https://github.com/MetaCenterCloudPuppet/hadoop-tests