Skip to content
Tom Schenk Jr edited this page Dec 8, 2015 · 20 revisions

Project Workflow

A kanban board is located at Waffle.io. The project contains the following high-level tasks:

  • Combine the raw data files of the lab tests for 2008 and beyond.
  • Create a variable indicating lab results above the acceptable threshold.
  • Clean-up advisories from DrekBeach and remove advisories not caused by high predicted values of E. coli.
  • Merge (cleaned) advisories from above with lab results to determine if the advisory was correct.
  • Determine the baseline performance of the current model.
  • Create alternative models
  • Use test-train framework to compare performance of the new model.

Replication

Split raw data from Excel workbooks into individual CSVs

python e-coli-beach-predictions/data/ChicagoParkDistrict/raw/Standard 18 hr Testing/split_sheets.py

Stack the sheets into a single Excel workbook for a given year:

csvstack 2006\ *.csv > 2006.csv
csvstack 2007\ *.csv > 2007.csv
csvstack 2008\ *.csv > 2008.csv
csvstack 2009\ *.csv > 2009.csv
csvstack 2010\ *.csv > 2010.csv
csvstack 2011\ *.csv > 2011.csv
csvstack 2012\ *.csv > 2012.csv
csvstack 2013\ *.csv > 2013.csv
csvstack 2014\ *.csv > 2014.csv
csvstack 2015\ *.csv > 2015.csv

Then, combine the annual files into a single file:

csvstack 2006.csv 2007.csv 2008.csv 2009.csv 2010.csv 2011.csv 2012.csv 2013.csv 2014.csv 2015.csv > beach_lab_readings.csv
Clone this wiki locally