GitHub - lapiceroazul4/workshop_02: It is a hands-on workshop on how to build an ETL (Extract, Transform, Load) pipeline using Apache Airflow.

What's this project about

This my second project in python, it is a hands-on workshop on how to build an ETL (Extract, Transform, Load) pipeline using Apache Airflow. The main goal is to demonstrate how to extract information from two different data sources (CSV file, database), perform data transformations, merge the transformed data, and finally load it into Google Drive as a CSV file and store it in a database. As a final step, we will create a dashboard from the data stored in the database to visualize the information in the best way possible.

Folders' Structures

workshop_02/
├── 0_src/                      # Source scripts
│   ├── db.py                   # Database handling script
│   ├── drive.py                # Google Drive operations script
│   ├── etl_dag.py              # DAG setup for ETL process
│   ├── etl.py                  # Main ETL process script
│   └── transform.py            # Data transformation script
├── 1_notebooks/                # Jupyter notebooks
│   ├── EDA_Grammy.ipynb        # EDA on Grammy data
│   └── EDA_Spotify.ipynb       # EDA on Spotify data
├── 2_data/                     # Data and visualizations
│   ├── data_result.csv         # Processed data results
│   └── Visualization.pdf       # PDF with visualizations
├── .gitignore                  # Ignored files for Git
├── README.md                   # Project description and guide
└── requirements.txt            # Project dependencies          
└──

Prerequisites

Before getting started with this project, make sure you have the following components installed or ready:

Environment Setup

Here are the steps to set up your development environment:

create a virtual enviroment: Run the following command to create a virtual enviroment called venv:
```
python -m venv venv
```
activate your venv: Run the following commands to activate the enviroment:
```
cd venv/bin
source activate
```
Install Dependencies: Once you're in the venv run the following command to install the necessary dependencies:
```
pip install -r requirements.txt
```
Create db_config: Yo need to create a json file called "db_config" with the following information, make sure you replace the values with the correspondent information :
```
{
 "user" : "myuser",
 "passwd" : "mypass",
 "server" : "XXX.XX.XX.XX",
 "database" : "demo_db"
}  
```
Airflow Scheduler: Now go to the main folder and run the commands below to config airflow :
```
airflow scheduler
airflow standalone
```
Running Server: At this point airflow is running and we can run the etl_dag, in case you want to use the airflow interface the server is running in port 8080, credentials are shown in your terminal when you did airflow standalone, in case you want to run the dag in terminal run the code below:
```
airflow trigger_dag etl_dag
```

Contact

If you have any questions or suggestions, feel free to contact me at [[email protected]].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What's this project about

Folders' Structures

Prerequisites

Environment Setup

Contact

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
0_src		0_src
1_notebooks		1_notebooks
2_data		2_data
.gitignore		.gitignore
README.md		README.md
requeriments.txt		requeriments.txt

lapiceroazul4/workshop_02

Folders and files

Latest commit

History

Repository files navigation

What's this project about

Folders' Structures

Prerequisites

Environment Setup

Contact

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages