####Table of Contents
- Module Description - What the module does and why it is useful
- Setup - The basics of getting started with Hive
- Usage - Configuration options and additional functionality
- Reference - An under-the-hood peek at what the module is doing and how
- Limitations - OS compatibility, etc.
- Development - Guide for contributing to the module
This module installs and setups Apache Hive data warehouse software running on the top of Hadoop cluster. Hive services can be collocated or separated with other services in the cluster. Optionally security based on Kerberos can be enabled. Security should be enabled if Hadoop cluster security is enabled.
Puppet client configured with stringify_facts=false
is recommended, but not required (see also schema_file parameter).
Tested with:
- Debian 7/wheezy, 8/jessie: Cloudera distribution (tested on Hive 0.13.1, 2.1.1)
- RHEL 6 and clones: Cloudera distribution (tested with Hadoop 2.6.0)
###What cesnet-hive module affects
- Packages: installs Hive packages (common packages, subsets for requested services, hcatalog, and/or hive client)
- Files modified:
- /etc/hive/* (or /etc/hive/conf/*)
- /usr/local/sbin/hivemanager (not needed, only when administrator manager script is requested by features)
- Alternatives:
- alternatives are used for /etc/hive/conf in Cloudera
- this module switches to the new alternative by default, so the Cloudera original configuration can be kept intact
- Services: only requested Hive services are setup and started
- metastore
- server2
- Helper Files:
- /var/lib/hadoop-hdfs/.puppet-hive-dir-created (created by cesnet-hadoop module)
- Secret Files (keytabs): permissions are modified for hive service keytab (/etc/security/keytab/hive.service.keytab)
- Facts:
hive_schemas
(stringify_facts=false
is needed when using this fact) - Databases: for supported databases (when not disabled): user created and database schema imported using puppetlabs modules
There are several known or intended limitations in this module.
Be aware of:
-
Repositories - see cesnet-hadoop module Setup Requirements for details
-
No inter-node dependencies: running HDFS namenode is required for Hive metastore server startup
-
Secure mode: keytabs must be prepared in /etc/security/keytabs/ (see realm parameter)
-
Database setup: MariaDB/MySQL or PostgreSQL are supported. You need to install puppetlabs-mysql or puppetlabs-postgresql module, because they are not in dependencies.
-
Hadoop: it should be configured locally or you should use hdfs_hostname parameter (see Module Parameters)
Let's start with basic examples.
Example: The simplest setup without security nor zookeeper, with everything on single machine:
class{"hive":
hdfs_hostname => $::fqdn,
metastore_hostname => $::fqdn,
server2_hostname => $::fqdn,
}
node <HDFS_NAMENODE> {
# HDFS initialization must be done on the namenode
# (or /user/hive on HDFS must be created)
include hive::hdfs
}
node default {
# server
include ::hive::metastore
include ::hive::server2
# client
include ::hive::frontend
include ::hive::hcatalog
# worker nodes
include ::hive::worker
}
Modify $::fqdn and node(s) section as needed.
We recommend:
- using zookeeper and set hive parameter zookeeper_hostnames (cesnet-zookeeper module can be used for installation of zookeeper)
- if collocated with HDFS namenode, add dependency Class['hadoop::namenode::service'] -> Class['hive::metastore::service']
- if not collocated, it is needed to have HDFS namenode running first, or restart Hive metastore later
- using hadoop class plus some other component (or hadoop::common::config class) - see hdfs_hostname parameter
It is highly recommended to use real database backends instead of Derby. Also security can be enabled.
Hive is used together with other components in roles in cesnet::site_hadoop puppet module.
Or you can see the examples here, how to use the hive puppet module directly:
Example 1: Setup with security:
Additional permissions in Hadoop cluster are needed: add hive proxy user.
class{"hadoop":
...
properties => {
'hadoop.proxyuser.hive.groups' => 'hive,impala,oozie,users',
'hadoop.proxyuser.hive.hosts' => '*',
},
...
}
class{"hive":
group => 'users',
metastore_hostname => $::fqdn,
realm => 'MY.REALM',
}
Use nodes sections from the initial Example, modify $::fqdn and nodes sections as needed.
Example 2: MySQL database, puppetlabs-mysql puppet module must be installed.
Add this to the initial example:
class{"hive":
...
db => 'mysql',
#db => 'mariadb',
db_password => 'hivepassword',
}
node default {
...
class { 'mysql::server':
root_password => 'strongpassword',
}
class { 'mysql::bindings':
java_enable => true,
#java_package_name => 'libmariadb-java',
}
}
Database is created in hive::metastore::db (hive::metastore) class.
Example 3: PostgreSQL database, puppetlabs-postgresql puppet module must be installed.
Add this to the initial example:
class{"hive":
...
db => 'postgresql',
db_password => 'hivepassword',
}
node default {
...
class { 'postgresql::server':
postgres_password => 'strongpassword',
}
include postgresql::lib::java
...
}
Security in Hadoop (and Hive) is based on Kerberos. Keytab files needs to be prepared on the proper places before enabling the security.
Following parameters are used for security (see also hive class):
- realm (Kerberos realm, empty string disables the security)
Enables security and specifies Kerberos realm to use. Empty string disables the security. To enable security, there are required:- installed Kerberos client (Debian: krb5-user/heimdal-clients; RedHat: krb5-workstation)
- configured Kerberos client (/etc/krb5.conf, /etc/krb5.keytab)
- /etc/security/keytab/hive.service.keytab (on all server nodes)
- sentry_hostname Enable usage of Sentry authorization service. When not specified, Hive server2 impersonation is enabled and authorization works using HDFS permissions.
####Impersonation
Authorization by impersonation of the user. Used when sentry_hostname is not specified.
Hadoop needs to have enabled proxyuser for it:
# 'users' is the group in *group* parameter
hadoop.proxyuser.hive.groups => 'hive,users'
hadoop.proxyuser.hive.hosts => '*'
Users need to have access to warehouse directory. Group is set to users by default. Other addons (like impala) need to be in the users group too!
Another way could be to add users to hive group and use that group instead (more simple, but less secure).
####Sentry
Authorization by sentry. Used when sentry_hostname is not specified.
Hive itself runs under 'hive' user. Hadoop and Hive must have enabed security.
Warehouse directory must have 'hive' group ownership. It is set by the puppet module by default.
Multihome is supported by Hive out-of-the-box.
<a name="defaultfs" ###Changing defaultFS (converting non-HA cluster, ...)
Changing defaultFS can be needed when, for example:
- changing Hadoop cluster name
- using cluster name because of converting non-HA cluster to High Availability
But existing objects in Hive schema are using the old URL with previous defaultFS and needs to be converted.
Getting the old URL:
hive --service metatool -listFSRoot 2>/dev/null
Convert (you can try testing run first using --dryRun):
OLD_URL="hdfs://NAMENODE_HOSTNAME:8020"
NEW_URL="hdfs://CLUSTER_NAME"
hive --service metatool -updateLocation ${NEW_URL} ${OLD_URL} --dryRun
hive --service metatool -updateLocation ${NEW_URL} ${OLD_URL}
###Cluster with more HDFS Name nodes
If there are used more HDFS namenodes in the Hadoop cluster (high availability, namespaces, ...), it is needed to have 'hive' system user on all of them to authorization work properly. You could install full Hive client (using hive::frontend::install), but just creating the user is enough (using hive::user).
Note, the hive::hdfs class must be used too, but only on one of the HDFS namenodes. It includes the hive::user.
Example:
node <HDFS_NAMENODE> {
include hive::hdfs
}
node <HDFS_OTHER_NAMENODE> {
include hive::user
}
The best way is to refresh configurations from the new original (=remove the old) and relaunch puppet on top of it. There is also needed to update schema using schematool or upgrade scripts in /usr/lib/hive/scripts/metastore/upgrade/DATABASE/.
For example (using mysql, from Hive 0.13.0):
alternative='cluster'
d='hive'
mv /etc/{d}$/conf.${alternative} /etc/${d}/conf.cdhXXX
update-alternatives --auto ${d}-conf
# upgrade
...
# metadata schema upgrade
mysqldump --opt metastore > metastore-backup.sql
mysqldump --skip-add-drop-table --no-data metastore > my-schema-backup.mysql.sql
/usr/lib/hive/bin/schematool -dbType mysql -upgradeSchemaFrom 0.13.0 -userName root -passWord MYSQL_ROOT_PASSWORD
puppet agent --test
#or: puppet apply ...
hive
: The main configuration class for Apache Hivehive::hbase
: Client Support for HBasehive::hdfs
: HDFS initialiationshive::params
hive::service
- common:
hive::common::config
hive::common::daemon
hive::common::postinstall
hive::frontend
: Hive Clienthive::frontend::config
hive::frontend::install
hive::hcatalog
: Hive HCatalog Clienthive::hcatalog::config
hive::hcatalog::install
hive::metastore
: Hive Metastorehive::metastore::config
hive::metastore::install
hive::metastore::db
hive::metastore::service
hive::server2
: Hive Serverhive::server2::config
hive::server2::install
hive::server2::service
hive::user
: Create hive system user, if neededhive::worker
: Hive support at the worker node
hive_schemas
: database schema file for each database backend
####confdir
Hive config directory. Default: '/etc/hive/conf' or '/etc/hive'.
####group
Hive group on HDFS. Default: 'users' (without sentry), 'hive' (with sentry).
For Hive impersonation (without sentry) is expected all users belong to the specified group.
It is not updated when changed, you should remove the /var/lib/hadoop-hdfs/.puppet-hive-dir-created file when changing or update group of /user/hive on HDFS.
####hdfs_hostname
HDFS hostname (or defaultFS value), if different from core-site.xml Hadoop file. Default: undef.
It is recommended to have the core-site.xml file instead. core-site.xml will be created when installing any Hadoop component or if you include hadoop::common::config class.
####keytab
Hive keytab file. Default: '/etc/security/keytab/hive.service.keytab'.
Only used with security (realm parameter).
####keytab_source
Puppet source for keytab file. Default: undef.
When specified, the Hive keytab file is created using this puppet source(s). Otherwise only persmissions are set on the keytab file.
Only used with security (realm parameter).
####metastore_hostname
Hostname of the metastore server. Default: undef.
When specified, remote mode is activated (recommended).
####principal
Hive Kerberos principal. Default: '::default' (="hive/_HOST@${hive::realm}").
####sentry_hostname
Hostname of the (external) Sentry service. Default: undef.
Non-empty value will enable Hive settings needed to use Sentry authorization service.
When sentry is enabled, you will need also hive user added to allowed.system.users in Hadoop YARN containers.
####server2_hostname
Hostname of the Hive server. Default: undef.
Used only for hivemanager script.
####zookeeper_hostnames
Array of zookeeper hostnames quorum. Default: undef.
Used for lock management (recommended).
####zookeeper_port
Zookeeper port, if different from the default (2181). Default: undef.
####realm
Kerberos realm. Default: ''.
Empty string disables the security.
When security is enabled, you also need either Sentry service (sentry_hostname parameter) or proxyuser properties to Hadoop cluster for Hive impersonation. See Enable Security.
####properties
Additional properties. Default: undef.
####descriptions
Descriptions for the additional properties. Default: undef.
####alternatives
Switches the alternatives used for the configuration. Default: 'cluster' (Debian) or undef.
Use it only when supported (for example with Cloudera distribution).
####database_setup_enable
Enables database setup (if suported). Default: true.
####db
Database behind the metastore. Default: undef.
The default is embedded database (derby), but it is recommended to use proper database.
Values:
- derby (default): embedded database
- mysql: MySQL/MariaDB,
- postgresql: PostgreSQL
####db_host
Database hostname for mysql, postgresql, and oracle. Default: 'localhost'.
It can be overridden by javax.jdo.option.ConnectionURL property.
####db_name
Database name for mysql and postgresql. Default: 'metastore'.
For oracle 'xe' schema is used. Can be overridden by javax.jdo.option.ConnectionURL property.
####db_user
Database user for mysql, postgresql, and oracle. Default: 'hive'.
####db_password
Database password for mysql, postgresql, and oracle. Default: undef.
####features
Enable additional features. Default: {}.
Values:
- manager - script in /usr/local to start/stop Hive daemons relevant for given node
####schema_dir
Hive directory with database schemas. Default: undef (/usr/lib/hive/scripts/metastore/upgrade).
####schema_file
Hive database schema file. Default: undef (autodetect).
Autodetection requires puppet configured with stringify_facts=false
. But the value can be set directly instead (for example hive-schema-2.1.1.mysql.sql
).
Idea in this module is to do only one thing - setup Hive SW - and not limit generic usage of this module by doing other stuff. You can have your own repository with Hadoop SW, you can select which Kerberos implementation to use, or Java version.
On other hand this leads to some limitations as mentioned in Setup Requirements section and usage is more complicated - you may need site-specific puppet module together with this one, like cesnet-site_hadoop.
For database there are used puppetlabs-mysql and puppetlabs-postgresql modules, but they are not in dependencies. You can disable database setup altogether with database_setup_enable parameter.
- Repository: https://github.com/MetaCenterCloudPuppet/cesnet-hive
- Tests:
- basic: see .travis.yml
- vagrant: https://github.com/MetaCenterCloudPuppet/hadoop-tests