Skip to content
jules32 edited this page Mar 14, 2014 · 12 revisions

Tracking data quality

Purpose: determine the level to which individual country scores are reliable based on whether and how the values were gap-filled (and year mismatches?).

The types of gapfilling we used:

  • average
    • what spatial level: neighbors, sovereign country, region, world
  • disaggregate model (i.e. value split from sovereign country using some form of weighting, e.g. relative proportion of coastal population)
  • modeled (trash as proportional to population density?)

Level of gap-filling:

  • data set
  • data layer
  • score

Proposed process:

  • generate a label for the data layer that describes type and level of gapfilling
  • generate an additional label if the gapfilling occurs during further calculations -> this needs to be embedded in the toolbox

Year mismatches – matter when:

  • the score is a ratio of two data layers from different years (e.g. employees in the tourism sector/workforce size)
  • a score uses a spatial reference point such as a global average or max (e.g. wages - countries with less recent data are at a disadvantage)
  • comparing scores across countries with data from very different years
  • Any other ways that year mismatches may affect interpretation of results?

Is there an easy way to know what year is associated to a country’s goal score? How to keep track of this? Maybe just for certain goals, e.g. Livelihoods

Provenance

Data Structure

  • field: value_whence_v01 column with basic data types. This can get tallied up by toolbox. 2013 documentation. And then how carried through. Shapefile limitation of 10 characters so could have little lookup table to go back and forth. Suggest breaking up procedures with a delimiter (like "|"). Associate with functions for each procedure.
  • uncertainty_v01 having some free text format like: [measure]: [value]. [description]. Some datasets have only point estimate whereas others get gapfilled.
  • file: layername_whence-v01.csv

    • format for details & children
    • incorporate to gapfilling functions
    • spatial_id_output, whence procedure, whence procedure order, arguments, input vs output, spatial_id_input, uncertainty measure associated
  • similarity. ecological vs political.

Toolbox report

  • Per reporting region

Examples

  • Uncertainty
    • FIS B/B_msy
    • AO: avg GDP based on linear model

Document

  • See section 2013 documentation

KLo's notes

Questions:

  • how to label input data files when gap-filling occurred at that stage?
  • How to combine this label if subsequent gap-filling operations occur? (e.g. Darren just had “raw”, “modeled”, “mixed”)

Aspects of gap-filling:

  1. just counting number of steps (cumulative number of gap-filling steps)
  2. qualifying the type of step (use the acronyms)
  3. actual sources identified (perhaps later)

run through some examples to test the procedure, and the acronyms for the keep a loose text for stat uncertainty we can add a “whence” column that has a value per region

Country DatasetFIS_1 DatasetFIS_2 DatalayerFIS DatalayerMAR_1 ... Score
Eritrea 0 TG SG1 SG2 =0+1+1+2+…
Neverland 0 TG SG1 SG2 =0+1+1+2+…
      |              |              |              |                |     |					

The table would have as many columns as the total sum of individual datasets, datalayers and scores we gapfill.

Can script the gapfilling functions to geneate an independent file, to get a list of operations in the “basic” whence column , then broken down in the details column

Categories

from GapfillingCategories.xlsx sent by JLo Mar 5, 2014:

code name category description_SOM_v2013
OD original data original the original data, un-modified, to differentiate from other whence categories.
TP previous year temporal the value from the previous year is used to replace the current year’s value. This approach assumes no change in the past 2 years and was implemented in cases where the current year could have been missing due to a delay in reporting at the time the Index was calculated. This approach was only implemented for the natural products goal (i.e. for harvested tonnage of each product), and for the mariculture subgoal (i.e. for harvested tonnage of each species).
TF fitted values temporal the available data were used to fit a linear model to the time series and predict missing values. Data within a 10-year window centered on the gap year (i.e. ± 5 years) were used as input in the fitted model. When the missing year was less than 5 years from the most recent year in the data set, the window was shifted to still include 10 years of data even though it was no longer centered upon the missing year. Temporal gap-filling of this kind was done when at least 2 years of data were available.
TF10 fitted values for data older than 10 years temporal in the cases of livelihoods & economies, the goals based on habitats, i.e. coastal protection, carbon storage and biodiversity, and the monetary value data for natural products, due to the scarcity of data available, the 10 year rule was relaxed so as to include older data. For more details see Halpern et al. (2012), and see the sections 4.3 and 5.53 on natural products value data.
SG georegional spatial in general, we assumed nearby regions (with data) could serve as reasonable proxies for a region missing data, and so we averaged values from geographically nearby regions to fill the gap. We used two levels of spatial aggregation to determine which regions defined ‘nearby’, derived from United Nations definitions of geopolitical regions (Table S7). The first level aggregates geographically closer regions (preferred), while the second defines much larger regions, in some cases coinciding with entire continents (used only when no countries within the ‘first level’ aggregation had data).
SCG sovereign country + georegional spatial often data were missing for small remote islands. Several of these are under the governance of distant countries that would not fall within the same georegion. For institutional and socioeconomic data, we assumed that offshore domains would have more in common with their administrative country than with geographically closer regions. In these cases, the values from the administrative country were used to gap-fill when present, otherwise the georegional averages were used as described above.
SH habitat regions spatial for goals using habitat data (i.e., natural products, carbon storage, coastal prediction, and biodiversity), when the habitat extent data indicated that a given habitat was present, but data on its condition was missing, geo-ecological regional averages were used specific to each habitat type (see Halpern et al. 2012, Selig et al. 2013 for descriptions of these regions). Because no habitat data could be updated for this current assessment, we did not need to repeat this method, but its implications for results remain.
XSI southern islands special For a group of small, remote islands found in the Southern oceans (see Table S6), data are often missing. Due to their remote location, a spatial gap-filling approach would result in values from very distant regions, that may have no similarities with these islands, being used to gap-fill, thus leading to biased scores. For the tourism & recreation and coastal livelihoods & economies goals, we assume that these scarcely inhabited areas do not develop either aspect and these goals therefore drop out. For the artisanal fishing opportunities goal, on the other hand, we assigned a perfect score because we assume that there is need for it and that it is fully satisfied, since legislative or economic constraints on people’s access to artisanal fishing are unlikely in these regions. Note that this only applies to the southern islands that are inhabited (Table S6); uninhabited Southern Islands get no score as do all other uninhabited regions.
XP using other data as a proxy special for the evaluation of trends in pesticides, fertilizers and trash pollution, it was assumed that the relative rate of change would mimic that of population. Hence, this was used when data to calculate these trends were missing. On the other hand, the fertilizer and pesticide consumption used as input values to calculate pressures were gap-filled by using a linear relationship between these two layers.
XN new reporting regions special "Some gap-filling was necessary even when the data sources used were identical to those in the 2012 calculation (i.e., no updated data were available), due to the presence of new, better resolved, reporting regions added for this 2013 calculation (see section 3). Most of the new reporting regions were offshore territorial holdings that in 2012 had been aggregated with their administrative country. For spatially explicit data sources this was not an issue, as we simply re-calculated the values using the new regional delimitations (i.e., all calculations based on habitat coverage, including the soft bottom layer, the exposure layers for the natural products status, the relative weights assigned to the pressure and resilience matrices pertinent to habitat-based goals, most of the pressure layers). However, tabular data were not always reported at the scale of the smaller reporting regions added in 2013 (e.g., trash, targeted harvest, artisanal fishing high & low bycatch, habitat destruction of subtidal hard-bottom, etc.). In these cases, values from 2012 aggregated reporting regions were disaggregated for corresponding 2013 reporting regions using one of three approaches: a. identical value assigned, for example with some regulatory measures used in resilience measures (the World Governance Indicators (WGI), the Convention on Biological Diversity (CBD), the Global Competitiveness Index (GCI), the Mariculture Sustainability Index (MSI), the Convention on International Trade in Endangered Species of Wild Fauna and Flora (CITES), alien species regulations, and sector evenness); for the status and trend of habitat-based goals (habitat-specific “health” condition and its trend); for artisanal opportunities (access based on regulations reported by Mora et al. 2009); for pressures (artisanal fishing with high bycatch, and targeted harvest); b. weighted by the relative proportions of coastal population (namely: revenue, number of jobs, adjusted workforce size, pressure from artisanal fishing with low bycatch, and from intertidal habitat destruction); c. weighted by the relative proportions of corresponding EEZ area (e.g. alien species pressure)."
XA alternate data sources special for total population, data were not reported in 59 of the 2013 reporting regions. For these cases we manually searched Wikipedia for population estimates to fill in the missing values.
XH by hand special when values are entered by hand. For example, when North Korea is assigned the minimum score of all other regions (example: WEF GCI data).
Clone this wiki locally