The main challenge is ingest data from a CSV and API using Apache Airflow and EMR to create a star-schema and display 3 graphs in a dashboard.
Our solution includes the usage of five main technologies:
- ECS: As the platform to run our containers
- EMR: As the platform to run our ETL codes.
- Spark: as the main language to process the data
- Metabase: as the data visualization tool
- Apache Airflow: as the scheduling and orchestration tool
At Apache Airflow, it was implemented a DAG called star_schema.py
to perform the ETL. It has the following opereators:
There are only two types of operators used in this class which are the EmrCreateJobFlowOperator and EmrJobFlowSensor. The first one will be responsible to create the EMR Spot instance and assign a Step to it. The second one will basically check if the execution succeeded or not, if it fails the Operator will fail as well, otherwise it will become green (success) and proceed to the next operator.
There is only one file that is being executed by all jobs on EMR which is the emr/job_template.py. Basically, this job is calling the ETLFactory class to build each class that will be executed in the EMR StepJob. In the image below you can check a diagram of how the packages
directory is structured.
- Logger: A class that implements/configure a logger for the project
- RestApiHook: A Hook for Rest API which implements the requests methods
- DatabaseManager: A class specialized in perform actions in Postgres. Implements sql-alchemy
- ETLBase: A base abstraction for the Extractor, Transformer and Loader classes. It simply implements the Logger class as composition and set a class-level attribute with the 'root' path of the 'filesystem'.
- RestApiExtractor: An Abstract class for Extractor classes specialized in extract from Rest APIs
- Transformer: An Abstraction that sets the structure for a Transformer class which will be responsible to perform the transformation of the extracted data
- Loader: The Loader class will be responsible to load/write data (extracted, transformed) to the Database or the Filesystem.
- ETLFactory: The ETLFactory is a implementation of Factory which is responsible for build the Extractors, Transformers and Loader classes.
Using metabase it was generated the three following graphs:
-
Local Environment
-
Development Environment
On your terminal, execute the following cmd:
$ docker-compose up -d --build
- Airflow
url: http://localhost:8080/
- Metabase
url: http://localhost:3000/
On your terminal, execute the following cmd:
$ source deploy.sh
This documentation is located in the root of the project-4 folder. It is basically doing the following actions:
- Asking for some required arguments (such as AWS_KEY and PASS info if it isn't set as environment variables)
- Executing terraform
- Sending some required files to the emr bucket on S3
- Building and pushing the airflow's docker image (check the Dockerfile) to ECR.
On your terminal, execute the following cmd:
$ python architecture_diagram/architecture.py