In this project, our team generated 70 days of noisy synthetic traffic data (550 GB) for a region of Manhattan. Our goal was to embed certain insights into the data and use Apache Spark libraries (Spark SQL, MLlib, GraphX) to recover them. Running our Spark code on a distributed Hadoop cluster, we did the following:
- Classified vehicles as cars or buses
- Forecasted the density of traffic on each street segment in our chosen region of Manhattan
- Recommended a shortest path for a car to take from a start node to an end node
Our method and results can be found in the file: BDAD Violet Noise Presentation.pdf.
.
├── Research
├── sumoDataGeneration
├── runscripts
├── ScalaETL
├── NoiseGenerator
├── edgeWeightForecast
└── VehicleClassification
Domain research, simulation experiments, and un-packaged testing scripts for scala development are all contained here
Final simulation code run on the greene cluster is here
Scripts used throughout development to manage the data and hdfs storage. Also contains script for final run through example.
Raw data processing to be used by multiple insights down stream
Add noise on top of output data to further bury insights
All feature generation, model development, and graph algorithms used to forecast edge weights and recommend a shortest path
All feature generation and model development used to classify vehicle types in the simulation