This project analyzes data from New York City's Citi Bike share program for the years 2019, 2020, and 2023. The primary goal is to investigate ridership trends before the 2020 pandemic, during 2020, and in the year 2023. The project aims to answer the following questions:
- How did the 2020 pandemic impact ridership, and did it rebound significantly by 2023?
- Are there any differences or similarities in monthly usage trends for each year, particularly during the 2020 pandemic?
- What is the comparison between members of the bike program and non-members in terms of total usage across all three years combined?
- Which bike stations are the most popular, and what are the total trips and average trip times for each of the top 5 stations?
By analyzing the data, we can gain insights into the popularity of certain areas in New York City among Citi Bike users and determine if the most popular stations change over time, even as Citi Bike relocates bike stations. The project also examines whether the average trip time for users changes significantly when starting and completing trips at these popular stations.
Through this analysis, we aim to better understand the impact of the pandemic on bike-sharing in New York City and identify any lasting changes in ridership patterns and user behavior.
-
Terraform: Used for Infrastructure as Code (IaC) to create a Google Cloud Storage (GCS) bucket and BigQuery dataset
-
Google Cloud Storage (GCS): Used as a data lake to store the raw Citi Bike data in parquet format
-
Google BigQuery: Used as the data warehouse for storing and analyzing the Citi Bike data
-
PyArrow: Used in conjunction with the Data Load Tool (dlt) for the initial data pipeline
-
Data Load Tool (dlt): Utilized for the ETL process, creating data pipelines, dbt transformations, and setting up GitHub Actions
-
dbt (Data Build Tool): Employed for performing data transformations and running tests on the data stored in BigQuery
-
GitHub Actions: Integrated with dlt to deploy a monthly triggered data pipeline
-
Google Looker Studio: Used for creating a dashboard to visualize the analyzed Citi Bike data
2019 | 2020 | 2023 |
---|---|---|
tripduration | tripduration | - |
startime | startime | started_at |
stoptime | stoptime | ended_at |
start station id | start station id | start_station_id |
start station name | start station name | start_station_name |
start station latitude | start station latitude | start_lat |
start station longitude | start station longitude | start_lng |
end station id | end station id | end_station_id |
end station name | end station name | end_station_name |
end station latitude | end station latitude | end_lat |
end station longitude | end station longitude | end_lng |
bikeid | bikeid | ride_id |
usertype | usertype | member_casual |
birth year | birth year | - |
gender | gender | - |
- | - | rideable_type |
- 2019 - 20,551,697
- 2020 - 19,506,857
- 2023 - 36,226,600
2019 - 2023 (Combined) |
---|
ride_id |
start_time_date |
stop_time_date |
trip_duration_seconds |
start_station_name |
stop_station_name |
user_category |
- 76,285,697
Link to dashboard is here. (If the dashboard doesn't display, it's due to the GCS credits ending)
-
Create a Google Cloud account if you don't have one already. Google Cloud offers a $300 credit that's valid for 90 days, which can be used for this project.
-
Create a new project and project ID within Google Cloud. You can do this either through the Google Cloud Console or using the Google Cloud SDK CLI.
If you prefer the CLI approach, follow the steps in this video. -
Create a Service Account through the IAM (Identity and Access Management) section in Google Cloud Console. Assign the following permission roles to the service account:
- Viewer
- Storage Admin
- Storage Object Admin
- BigQuery Admin
- Owner (if needed)
-
Generate and download a JSON key for the newly created Service Account. It will be used when interacting with Terraform, dbt and dlt scripts, and if using Google Cloud SDK CLI.
Note: Save the JSON key to a local folder. If the folder is located within the project directory, consider adding it to your
.gitignore
file to prevent the key from being accidentally committed to version control.
- Clone me:
git clone https://github.com/agutiernc/nyc-citi-bike-insights.git
- Switch to project directory
cd nyc-citi-bike-insights
- Run setup bash script, in a terminal, that will rename secret directories and files (that have been added to
.gitignore
):chmod +x setup.sh # Make it executable ./setup.sh # To run it
Detailed instructions for setting up Terraform can be found in the Terraform directory.
Step 4: Create a Python virtual environment in the project directory, install the requirements, and activate it:
python -m venv venv
pip install -r requirements.txt
source venv/bin/activate # to activate
Detailed instructions for performing the ETL process using the Data Load Tool (dlt) can be found in the dlt_data_load directory.
In this step, 3 external tables are created in BigQuery using the parquet files stored in the GCS bucket. You have two options to accomplish this:
-
Open
parquet_to_external_tables.sql
, in thesql_external_tables
folder, and copy the SQL commands. -
Open the BigQuery dataset
citi_bike_data
in the Google Cloud Console and open a new Query tab. -
Paste the SQL commands from the
parquet_to_external_tables.sql
file into the Query editor. -
Edit the fields in the SQL commands with the necessary information, such as the
project name
and thegsutil URI
for the.parquet
files that are in the bucket folders. -
Run the query, one command at a time, to create the external tables.
Follow the step-by-step guide in this PDF document, skip to step 3, to create the external tables using the Google Cloud Console GUI within BigQuery.
- The Dataset will be
citi_bike_data
and name each external table as follows:citi_bike_2019
,citi_bike_2020
,citi_bike_2023
In this step, data transformations are performed on the raw data stored in the external tables to create a single materialized table. The transformations will be done using the Data Load Tool (dlt) to create the pipelines for BigQuery and the dbt (Data Build Tool) runner to execute the transformations.
Detailed instructions and necessary files for this step can be found in the dbt_dlt_transformations directory.
In this step, we'll use dlt to deploy our pipelines with GitHub Actions, enabling automated execution every quarter of the year.
Prerequisites:
- The project should be hosted in a GitHub repository.
Steps:
-
Deploy the ETL pipeline:
- Navigate to the
dlt_data_load
directory in the terminal - Run:
dlt deploy load_bike_data_to_gcs.py github-action --schedule "0 0 1 */3 *"
These commands create GitHub Actions workflow
.yml
files in the.github/workflows
directory with the necessary environment variables. - Navigate to the
-
Add secret values to GitHub:
- After running the deploy command, dlt will print out each
Name
andSecret
pairs. - Copy and paste each
Name
andSecret
pair to the GitHub UI, which can be accessed at the link printed by the dlt deploy command (e.g.,github.com/.../settings/secrets/actions
).
- After running the deploy command, dlt will print out each
-
Commit and push the changes to GitHub:
- Add and commit the changes to your local repository:
git add . && git commit -m 'pipeline deployed with github action'
- Push the changes to the GitHub repository:
git push origin
- Add and commit the changes to your local repository:
That's it! With this setup, the data pipelines will be automatically triggered and executed every three months, ensuring that your data remains up to date and ready for analysis.