-
Notifications
You must be signed in to change notification settings - Fork 129
ETL driver
ETLDriver is an API to import the data from files to hive tables. It has mainly 3 stages.
1.File2Raw : Loads data from a file to a Raw table in Hive
2.Raw2Stage : Creates a temporary table Stage having the same schema as the Base table and inserts data from Raw table to the Stage table
3.Stage2Base : Finally moves the data from Stage table to Base table, thereby completing the file -> table process
This uses FileIngester API to get the files from remote server to the Hadoop cluster and it will insert records into Batch table and File table to enqueue the process. It uses a Stage table, View, Stage table and a Base table to insert the data.
The following flow diagram summarizes the workflow of a ETLDriver process : BigData Ready Enterprise - Documentation > ETLDriver> f2sflow.png
Reads the hive tables schema from Hive_table for the given process Id and creates the Raw table, Raw view and Base table. Invokes ListOfFiles API to get the files and the corresponding batch ids for the given process Insert the data into hive table partition by batch ids returned by ListOfFiles.
INPUT parameters :
process-id : This is the process Id of the entire File-to-stage sub-process
list-of-files : List of all files the data from which needs to be inserted to the table.
BeginProcess returns a min-batch-id and a max-batch-id. All files which form the batches having batch_id between this min-batch-id and max-batch-id are returned as a list.
The format for the list of files is : batch_id,server_id, path, file_size, file_hash, creation_ts; each row is semi colon separated and column is comma separated.
environment : The configuration environment id defined in the im-config.xml
The following statement demonstrates how to execute the File2Raw step:
java com.wipro.ats.bdre.im.etl.api.oozie.OozieFile2Raw -process-id 210 list-of-files 3025,1,/home/dropuser/data/file.csv,39844,Z433uuYzddsl323,2014-10-10 09:30:45 -env env2
Creates the temporary table Stage table with partition as process run id.
Using the View Created in the File2Raw read the data from stage table for the last processed batches and insert the records into Stage table.
INPUT parameters :
process-id
process-run-id
min-batch-id
max-batch-id
environment
java com.wipro.ats.bdre.im.etl.api.oozie.OozieRaw2Stage -process-id 210 -process-run-id 125 min-batch-id 10 max-batch-id 11 -env env2
Move the Stage table partition files located in hdfs directory to Base table hdfs directory.Drop the Stage table.
Following are the input parameters for Stage2Base.
process-id
process-run-id
environment
java com.wipro.ats.bdre.im.etl.api.oozie.OoziePre2Core -process-id 210 -process-run-id 125 -env env2