Hey there! 🏍️ Welcome to my Data Engineer project in the world of MotoGP. Get ready for a thrilling journey through the twists and turns of data collection, transformation, and loading.
The MotoGP Data Engineer project is more than just a project; it's an adventure filled with data-driven excitement. It all began on a casual day when I visited a friend's house to hang out and have a great time. As destiny would have it, we stumbled upon a MotoGP race and decided to watch it.
As we sat glued to the screen, witnessing the high-speed action and the jaw-dropping performance of riders like Jorge Martin, we couldn't help but be amazed. The adrenaline, the race strategy, and the incredible overtakes had us on the edge of our seats.
After the race, we couldn't stop talking about it. Jorge Martin's fiery performance became the hot topic of our discussion. His skills were so blazing that it ignited a spark in me. That's when the idea of creating a Data Engineer project was born.
And here we are! This project is the result of that fateful day, where MotoGP not only entertained us but also inspired a data-driven journey. Join me on this adventure, where data takes the center stage in the high-speed world of MotoGP.
- Introduction
- Project Overview
- Getting Started
- Data Collection
- Data Transformation
- Data Loading
- Running the Project
- Dashboard Tableau
- Contributing
The MotoGP Data Engineer project involves the following key steps:
-
Data Collection: Obtaining data from various sources, including web scraping or utilizing APIs, to gather relevant information about MotoGP events, riders, sessions, and more.
-
Data Transformation: Cleaning, structuring, and transforming the collected data to make it suitable for storage and analysis. This may include handling missing values, converting data types, and structuring data into a format that aligns with the chosen storage solution.
-
Data Loading: Loading the transformed data into a data storage system. In this project, we are considering Apache Kafka for real-time data ingestion and Apache Cassandra for storage.
-
Data Analysis: Once the data is loaded, it can be analyzed to gain insights, generate reports, and support decision-making.
To get started with the project, make sure you have the following prerequisites:
- Python (for data transformation and processing)
- Apache Kafka (for real-time data ingestion)
- Apache Cassandra (for data storage)
- Spark (for data processing)
- An active internet connection to access external data sources
Data collection in this project is a crucial step. Depending on the specific requirements, you may choose to scrape data from websites using libraries like Beautiful Soup or utilize APIs if available.
For web scraping, you can use libraries such as Beautiful Soup or Scrapy to extract information from websites. An example of web scraping includes extracting data from the official MotoGP website.
I initially considered using Beautiful Soup to scrape the data, but fortunately, I was able to find an API on the official motogp.com website. This discovery simplified the data retrieval process.
APIs offer a more structured and reliable way to access data. In this project, we recommend using the official MotoGP API, which provides valuable information about events, riders, sessions, and more. The discovery of this API simplified the data retrieval process.
Data transformation involves cleaning, structuring, and converting the collected data into a format suitable for storage and analysis. This step is essential to ensure data quality and consistency.
If your data includes JSON or other object data stored as strings, use PySpark to convert these strings back to objects. You can use the pyspark.sql.functions.from_json function for this purpose.
The data loading phase consists of ingesting data into the chosen data storage system. For this project, we recommend using Apache Kafka for real-time data ingestion and Apache Cassandra for storage.
Apache Kafka serves as a data streaming platform that can ingest real-time data from various sources. The data can be pushed to Kafka topics and consumed by Spark for further processing.
Apache Cassandra is a NoSQL database suitable for storing large volumes of data. Data ingested from Kafka can be written to Cassandra tables, and it can be queried for analysis.
To run the project, follow these general steps:
-
Install and set up the required software, including Python, Apache Kafka, Apache Cassandra, and Spark.
-
Collect data from the MotoGP API or relevant websites using web scraping tools or API integration.
-
Transform and clean the data to prepare it for loading.
-
Set up Kafka and create Kafka topics for data ingestion.
-
Ingest data into Kafka using Spark or any other data producer.
-
Load the data into Apache Cassandra for storage.
-
Analyze and report on the data as needed.
Please refer to the specific scripts and instructions provided with the project code for detailed implementation steps.
Follow these steps to run the MotoGP Data Engineering Project on your local machine:
Before you begin, make sure you have the following software and tools installed:
git clone https://github.com/duonggiakhanhb/Motogp-Data-Engineering-Project.git
cd motogp-data-engineering-project
# add permission for script
chmod +x scripts/entrypoint.sh
# start docker containers
docker-compose up -d
# run DAG
docker exec motogp-data-engineer-project-scheduler-1 airflow dags trigger motogp_etl
# add permission for script
chmod +x scripts/spark_submit.sh
# submit job to spark cluster
./scripts/spark_submit.sh
- Apache Airflow: http://localhost:8181
- Confluent Control Center: http://localhost:9021
- Apache Spark master: http://localhost:9090
- Apache Spark job history: http://localhost:4040
Contributions to this project are welcome. If you have suggestions, enhancements, or bug fixes, feel free to submit a pull request. Your contributions can help improve the project and make it more valuable for the MotoGP community.
- Author: Adam Nguyen
- License: MIT