Skip to content

file monitor

Sri Harsha Boda edited this page Sep 15, 2017 · 1 revision

File Monitor

Description

File Monitor is an API implemented in BDRE to monitor incoming files and process the file events occurs in a particular folder. This API will continuously monitor the folders defined in the configuration, if any new file has been created and the file name matches the pattern(defined in the configuration), the file will get registered in the metadata management and a new batch will be en queued for the given process.

How it works

The directories and the file matching patters has to be configured in the im-config xml. A thread will be continually monitoring to the directories defined in configuration file. When a new file get created inside the directory a fileCreated event will get triggered. The API will do a regular expression validation to check whether the created file matches the pattern. If the file matches the pattern RegisterFile API will be called to register the file in metadata.

Registering the file in Metadata

A new entry will be inserted into batch table with a new batch id.

            mysql> select * from batch;
            +----------+-----------------------+------------+
            | batch_id | source_process_run_id | batch_type |
            +----------+-----------------------+------------+
            |     3053 |                  NULL | file       |
            |     3056 |                  NULL | file       |  
            +----------+-----------------------+------------+

A corresponding row will be created in process_batch_queue table so that the batch will be en queued for the sub process.

mysql> select * from process_batch_queue;
+-----------------+-----------------+----------+---------------------+-------------------+----------+--------+-------------+---------------+------------+
| source_batch_id | target_batch_id | queue_id | insert_ts           | source_process_id | start_ts | end_ts | batch_state | batch_marking | process_id |
+-----------------+-----------------+----------+---------------------+-------------------+----------+--------+-------------+---------------+------------+
|            3053 |            NULL |       37 | 2015-01-05 22:43:01 |               140 | NULL     | NULL   |           0 | 2014-12-16    |        186 |
|            3056 |            NULL |       38 | 2015-01-05 22:43:01 |               140 | NULL     | NULL   |           0 | 2014-12-16    |        186 |
+-----------------+-----------------+----------+---------------------+-------------------+----------+--------+-------------+---------------+------------+
2 rows in set (0.00 sec)

A new entry will get added in file table with the file path, file hash, file size and creation time and the above batch id,

          mysql> select * from file;
          +----------+-----------+------------------------------+-----------+----------------------------------+---------------------+
          | batch_id | server_id | path                         | file_size | file_hash                        | creation_ts         |
          +----------+-----------+------------------------------+-----------+----------------------------------+---------------------+
          |     3053 |         1 | /home/cloudera/oozies/merc 3 |      1209 | ce7c2be23dde29677e1bf85e1e6b3137 | 2015-01-13 04:44:53 |
          |     3056 |         1 | /home/cloudera/oozies/merc 5 |      1209 | ce7c2be23dde29677e1bf85e1e6b3137 | 2015-01-13 04:44:53 |
          +----------+-----------+------------------------------+-----------+----------------------------------+---------------------+

Following are the main Configuration parameters required in the im-config.xml to run this API.

1.thread-wait: The interval to monitor the directories.
2.dirs: Multiple directories can be configures by comma separated.
3.filter: The regular expression to validate the file. If the newly created file name matches the given pattern the API will register the file. If multiple patterns has to be matched in a directory then the same directory name has to be given multiple times comma seperated.
4.sub-processsIds: The process ids has to be mentioned comma seperated for each directories.
5.serverIds: valid serverIds has to be configured here comma seperated. Server Ids can be obtained from the servers table.

Example

           <environment id="env2">
              <file-mon>
                <thread-wait>500</thread-wait>
                <dirs>/home/cloudera/oozies,/home/cloudera/datasets</dirs>
                <filter>merchant\s[0-9],test\s[0-9]</filter>
                <sub-processIds>213,214</sub-processIds>
                <serverIds>1,1</serverIds>
              </file-mon>
           </environment>

How to run the program

1.Environment name has to be passed to this API as an input parameter. 2.export the required jars to the class path. 3.Run the following command

        java com.wipro.ats.bdre.filemon.FileMonRunnableMain.java -env env2
Clone this wiki locally