Load song play events of a mobile application and register to Redshift database.
This project's goal is analyze data to find what attribute of users often play songs, what songs or artists are played often and when songs are often played.
First, script create_tables.py
creates staging, dimension and fact tables.
Then, script etl.py
loads json data from S3 to Redshift staging table.
Finally, it loads from staging table to dimension and fact table by using INSERT INTO ... SELECT ...
syntax.
-
log_data
provide event data recorded in json file underlog_data
directory -
song_data
provide song and artist data recorded in json file undersong_data
directory
-
songs
provide song data which were played -
artists
provide artist data whose songs were played -
users
provide user data who played songs -
time
provide time data when users played songs
songplays
provide event data which describe who, when and what songs are played
Those tables are designed using STAR schema, which can easily analyze various aspects. I assumed we can specify artists and songs of event data by its name and title.
# pull docker images and build docker images
docker-compose up --build -d
# create tables
docker-compose exec app python3 create_tables.py
# run ETL
docker-compose exec app python3 etl.py
Query:
SELECT sp.artist_id, a.name, COUNT(sp.artist_id)
FROM songplays sp
INNER JOIN artists a on sp.artist_id = a.artist_id
GROUP BY sp.artist_id, a.name
ORDER BY 3 DESC
LIMIT 5
Result:
artist_id | name | cnt |
---|---|---|
AR5E44Z1187B9A1D74 | Dwight Yoakam | 37 |
ARD46C811C8A414F3F | Kid Cudi / Kanye West / Common | 10 |
ARD46C811C8A414F3F | Kid Cudi | 10 |
AR37SX11187FB3E164 | Ron Carter | 9 |
AR5EYTL1187B98EDA0 | Lonnie Gordon | 9 |
Query:
SELECT sp.song_id, s.title, COUNT(sp.song_id) AS cnt
FROM songplays sp
INNER JOIN songs s on sp.song_id = s.song_id
INNER JOIN time t ON sp.start_time = t.start_time
WHERE t.start_time BETWEEN '2018-11-01 00:00:00' AND '2018-11-08 00:00:00'
GROUP BY sp.song_id, s.title
ORDER BY 3 DESC
LIMIT 5
song_id | title | cnt |
---|---|---|
SOBONKR12A58A7A7E0 | You're The One | 8 |
SONTFNG12A8C13FF69 | If I Ain't Got You | 3 |
SOHTKMO12AB01843B0 | Catch You Baby (Steve Pitron & Max Sanna Radio Edit) | 3 |
SOULTKQ12AB018A183 | Nothin' On You [feat. Bruno Mars] (Album Version) | 2 |
SOLZOBD12AB0185720 | Hey Daddy (Daddy's Home) | 2 |
-
create_tables.py
creates necessary tables -
docker-compose.yml
preference file to organize docker images -
Dockerfile
provides docker image template -
dwh.cfg
parameter file for paths of json data, IAM Role and redshift connection -
etl.py
defines etl processes -
init.sh
script launched initially when docker image started -
README.md
describes about this project -
requirements.txt
defines libraries need to execute processes -
sql_queries.py
defines SQLs need to execute