Skip to content

Roadmap Nov. 13 2015

Will Engler edited this page Nov 13, 2015 · 1 revision

Introduction

Periodically, I (@willengler) would like to post in this wiki to let stakeholders, users, and anyone interested to see what the development team is planning for Plenario.

Current Priorities

  • Improving performance
  • Supporting a broader set of data

Planned Work through Jan. 2016

  1. Expose new shape features in Explorer.
  2. Restructure database to improve query performance.
  3. Modify ETL for point data to eliminate duplicates and remove unique ID constraint.

Expose new shape features

I added experimental shapefile ingestion and endpoints in early November. Now @lw334 is working on the Backbone frontend to display summary information about shape datasets after a user draws a query polygon on the Leaflet map as well as allowing a user to inspect individual shape datasets.

Restructure database

I'm going to try to eliminate Master Table, the table that holds a record for every row in every point dataset in Plenario. The /timeseries endpoint uses Master Table to generate counts of points that fit the user's query. The /detail and /grid endpoints also uses Master Table to filter before joining to the point dataset. I suspect removing Master Table as a middleman will improve query speed. Of course, I'll benchmark the effects of bypassing Master Table on /timeseries and /detail to to be sure.

The new database structure will distinguish between types of datasets. To start, we will have point and shape datasets. There will be one "meta table" that serves as a registry for each type of dataset. So if Plenario has x shape datasets, there will be x shape tables and the shape meta table will have x rows.

Master Table's other important role is storing references to tables that individual points can join to (i.e. weather, census blocks). Adding an hstore column on both the individual tables and meta tables can fill this role in Master's absence. The hstore column on each meta table will indicate what connections have been precomputed. The hstore column on the individual datasets will store the references. For example, let's say the chicago_crime point dataset has weather observations associated with each crime observation. Then the chicago_crime entry in the meta table will have an hstore key indicating that WEATHER_HOURLY should be present. Each entry in the chicago_crime dataset will have a key WEATHER_HOURLY pointing to some weather observation id. Using postgres's key-value store in this way lets us establish flexible relationships between datasets without updating our schema. This should pave the way for interesting connections between point and shape datasets.

Modify ETL

Assuming the database restructuring goes well, I'll improve our point ETL. Not needing to duplicate data across the point datasets and Master Table will make this task easier. For point datasets, we currently require that users specify a column that will serve as a unique identifier. We've found that's too restrictive. I'll add support for composite IDs, where the user can specify that some combination of two or more rows will be unique.

The desired update behavior for our ETL is an "upsert". One of the big sources of complexity in our current ETL is that we maintain duplicate records (same ID, at least one different attribute value) and mark them with a dup_ver. The intended behavior is only to return the record with the highest dup_ver, but that behavior wasn't fully implemented. Instead, we return all duplicates, making it look like there are more observations than there really are in the source dataset. I'd like to eliminate storage of duplicate rows altogether. Duplicate tracking could've allowed interesting history tracking features, but it's not a priority.

I'd also like to work pgloader into our ETL process so that a single bad row doesn't sink our entire COPY operation.