Extract, Transform, Load
- Create an ETL pipeline from raw data to a SQL database.
- Extract data from disparate sources using Python.
- Clean and transform data using Pandas.
- Use regular expressions (Regex) to parse data and to transform text into numbers.
- Load data with PostgreSQL and verify in PgAdmin.
The project included extracting a large data set from Kaggle, then transforming the data into a usable dataset for a "hacking competition." Once the data was transformed and narrowed in scope for the hack-a-thon, the DataFrames were loaded into PostgresSQL.
Wikipedia Movies JSON file, starting with 193 Columns:
Kaggle Movie Metadata, 24 columns
Kaggle Ratings data, 2602489 rows by 4 columns
Wikipedia Movies transformed, 22 columns
Wikipedia Movies, making the column names more succinct and uniform, 7033 rows of data.
Wikipedia Movies merged with Kaggle Movies data, all column names and row counts, 6052 rows.
Merged Movies with Kaggle ratings, all of the column names and row counts, 6052 rows.
Sending the data to PostgresSQL
Movies Query
Ratings Query
A JSON file and 2 Kaggle files were extracted, then transformed, and joined. A movies and ratings file were loaded into a database for the hack-a-thon event.