Developed using Python 3.
Working AWS credentials should be provided in the system
The AWS account would need permissions to read/write to S3 and start Athena jobs/queries (athena:StartQueryExecution)
e.g. for mac, in ~/.aws/credentials
Install required python modules with
pip3 install -r requirements.txt
- helper_modules - Modules to help process the data
- process_sensor_data - Main module to start the script
- tests - Unit tests
- requirements.txt - List of required python modules
The two data files have been manually downloaded and added to S3 s3://sid-coding-test-data/rawdata/
for easier access
Given more time, it is possible to automate that as well
What the script does
- Downloads the 2 data files from S3
- Load the data files in to pandas dataframes
- Process the dataframes to extract the required Top 10 sensor locations by day and month by pedestrian count and writes them to local files
topn_by_day_loc.csv
andtopn_by_month_loc.csv
- Write the original data to S3 in parquet format for future querying. The original downloaded data was ~ 370MB, while the parquet files written to S3 are about 70MB
- Create external tables in AWS Athena to enable querying of the data -
ped_loc_data
andsensor_locations
. The Athena queries take about 17 mins to complete, after which the data can be queried through Athena. These external tables is Athena refer to the data in S3 written in Step 4 above.
python3 process_sensor_data.py
# Testing
pytest tests.py
A very basic Tableau dashboard is published which is connected to this data in AWS Athena.
https://public.tableau.com/app/profile/siddharth.bose/viz/TopNpedSensorsMelb/TopNSensorsbypedestriancount