Skip to content

ETL driver

Sri Harsha Boda edited this page Sep 15, 2017 · 1 revision

Description

ETLDriver is an API to import the data from files to hive tables. It has mainly 3 stages.

1.File2Raw : Loads data from a file to a Raw table in Hive

2.Raw2Stage : Creates a temporary table Stage having the same schema as the Base table and inserts data from Raw table to the Stage table

3.Stage2Base : Finally moves the data from Stage table to Base table, thereby completing the file -> table process

This uses FileIngester API to get the files from remote server to the Hadoop cluster and it will insert records into Batch table and File table to enqueue the process. It uses a Stage table, View, Stage table and a Base table to insert the data.

Process workflow

The following flow diagram summarizes the workflow of a ETLDriver process : BigData Ready Enterprise - Documentation > ETLDriver> f2sflow.png

image

File2Raw

Reads the hive tables schema from Hive_table for the given process Id and creates the Raw table, Raw view and Base table. Invokes ListOfFiles API to get the files and the corresponding batch ids for the given process Insert the data into hive table partition by batch ids returned by ListOfFiles.

How to Invoke

INPUT parameters :

process-id : This is the process Id of the entire File-to-stage sub-process

list-of-files : List of all files the data from which needs to be inserted to the table.
BeginProcess returns a min-batch-id and a max-batch-id. All files which form the batches having batch_id between this min-batch-id and max-batch-id are returned as a list. The format for the list of files is : batch_id,server_id, path, file_size, file_hash, creation_ts; each row is semi colon separated and column is comma separated.

environment : The configuration environment id defined in the im-config.xml

The following statement demonstrates how to execute the File2Raw step:

 java com.wipro.ats.bdre.im.etl.api.oozie.OozieFile2Raw -process-id 210 list-of-files 3025,1,/home/dropuser/data/file.csv,39844,Z433uuYzddsl323,2014-10-10 09:30:45 -env env2

Raw2Stage

Creates the temporary table Stage table with partition as process run id.

Using the View Created in the File2Raw read the data from stage table for the last processed batches and insert the records into Stage table.

INPUT parameters :

process-id

process-run-id

min-batch-id

max-batch-id

environment

       java com.wipro.ats.bdre.im.etl.api.oozie.OozieRaw2Stage -process-id 210 -process-run-id 125 min-batch-id 10 max-batch-id 11 -env env2

Stage2Base

Move the Stage table partition files located in hdfs directory to Base table hdfs directory.Drop the Stage table.

How to Invoke

Following are the input parameters for Stage2Base.

process-id

process-run-id

environment

        java com.wipro.ats.bdre.im.etl.api.oozie.OoziePre2Core -process-id 210 -process-run-id 125 -env env2
Clone this wiki locally