Skip to content
Ben Best edited this page Mar 5, 2014 · 12 revisions

Tracking data quality

Purpose: determine the level to which individual country scores are reliable based on whether and how the values were gap-filled (and year mismatches?).

The types of gapfilling we used:

  • average
    • what spatial level: neighbors, sovereign country, region, world
  • disaggregate model (i.e. value split from sovereign country using some form of weighting, e.g. relative proportion of coastal population)
  • modeled (trash as proportional to population density?) Level of gap-filling:
  • data set
  • data layer
  • score Proposed process:
  • generate a label for the data layer that describes type and level of gapfilling
  • generate an additional label if the gapfilling occurs during further calculations -> this needs to be embedded in the toolbox

Year mismatches – matter when:

  • the score is a ratio of two data layers from different years (e.g. employees in the tourism sector/workforce size)
  • a score uses a spatial reference point such as a global average or max (e.g. wages - countries with less recent data are at a disadvantage)
  • comparing scores across countries with data from very different years
  • Any other ways that year mismatches may affect interpretation of results?

Is there an easy way to know what year is associated to a country’s goal score? How to keep track of this? Maybe just for certain goals, e.g. Livelihoods

Provenance

Data Structure

  • field: value_whence_v01 column with basic data types. This can get tallied up by toolbox. 2013 documentation. And then how carried through. Shapefile limitation of 10 characters so could have little lookup table to go back and forth. Suggest breaking up procedures with a delimiter (like "|"). Associate with functions for each procedure.
  • uncertainty_v01 having some free text format like: [measure]: [value]. [description]. Some datasets have only point estimate whereas others get gapfilled.
  • file: layername_whence-v01.csv

    • format for details & children
    • incorporate to gapfilling functions
    • spatial_id_output, whence procedure, whence procedure order, arguments, input vs output, spatial_id_input, uncertainty measure associated
  • similarity. ecological vs political.

Toolbox report

  • Per reporting region

Examples

  • Uncertainty
    • FIS B/B_msy
    • AO: avg GDP based on linear model

Document

  • See section 2013 documentation

KLo's notes

Questions:

  • how to label input data files when gap-filling occurred at that stage?
  • How to combine this label if subsequent gap-filling operations occur? (e.g. Darren just had “raw”, “modeled”, “mixed”)

Aspects of gap-filling:

  1. just counting number of steps (cumulative number of gap-filling steps)
  2. qualifying the type of step (use the acronyms)
  3. actual sources identified (perhaps later)

run through some examples to test the procedure, and the acronyms for the keep a loose text for stat uncertainty we can add a “whence” column that has a value per region

Country DatasetFIS_1 DatasetFIS_2 DatalayerFIS DatalayerMAR_1 ... Score
Eritrea 0 TG SG1 SG2 =0+1+1+2+…
Neverland 0 TG SG1 SG2 =0+1+1+2+…
      |              |              |              |                |     |					

The table would have as many columns as the total sum of individual datasets, datalayers and scores we gapfill

Can script the gapfilling functions to geneate an independent file, to get a list of operations in the “basic” whence column , then broken down in the details column

Clone this wiki locally