For the homework, we'll be working with the green taxi dataset located here:
https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green
The goal will be to construct an ETL pipeline that loads the data, performs some transformations, and writes the data to a database (and Google Cloud!).
- Create a new pipeline, call it
green_taxi_etl
- Add a data loader block and use Pandas to read data for the final quarter of 2020 (months
10
,11
,12
).- You can use the same datatypes and date parsing methods shown in the course.
BONUS
: load the final three months using a for loop andpd.concat
- Add a transformer block and perform the following:
- Remove rows where the passenger count is equal to 0 or the trip distance is equal to zero.
- Create a new column
lpep_pickup_date
by convertinglpep_pickup_datetime
to a date. - Rename columns in Camel Case to Snake Case, e.g.
VendorID
tovendor_id
. - Add three assertions:
vendor_id
is one of the existing values in the column (currently)passenger_count
is greater than 0trip_distance
is greater than 0
- Using a Postgres data exporter (SQL or Python), write the dataset to a table called
green_taxi
in a schemamage
. Replace the table if it already exists. - Write your data as Parquet files to a bucket in GCP, partioned by
lpep_pickup_date
. Use thepyarrow
library! - Schedule your pipeline to run daily at 5AM UTC.
Once the dataset is loaded, what's the shape of the data?
- 266,855 rows x 20 columns
- 544,898 rows x 18 columns
- 544,898 rows x 20 columns
- 133,744 rows x 20 columns
Upon filtering the dataset where the passenger count is equal to 0 or the trip distance is equal to zero, how many rows are left?
- 544,897 rows
- 266,855 rows
- 139,370 rows
- 266,856 rows
Which of the following creates a new column lpep_pickup_date
by converting lpep_pickup_datetime
to a date?
- data = data['lpep_pickup_datetime'].date
- data('lpep_pickup_date') = data['lpep_pickup_datetime'].date
- data['lpep_pickup_date'] = data['lpep_pickup_datetime'].dt.date
- data['lpep_pickup_date'] = data['lpep_pickup_datetime'].dt().date()
What are the existing values of VendorID
in the dataset?
- 1, 2, or 3
- 1 or 2
- 1, 2, 3, 4
- 1
How many columns need to be renamed to snake case?
- 3
- 6
- 2
- 4
Once exported, how many partitions (folders) are present in Google Cloud?
- 96
- 56
- 67
- 108
- Form for submitting: TBA
Deadline: TBA
Will be added after the due date