Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

managements value constraints: needed cleanup and decisions #4

Closed
7 of 18 tasks
dlebauer opened this issue Sep 13, 2018 · 1 comment
Closed
7 of 18 tasks

managements value constraints: needed cleanup and decisions #4

dlebauer opened this issue Sep 13, 2018 · 1 comment

Comments

@dlebauer
Copy link
Member

@gsrohde commented on Mon Feb 16 2015

citation_id

  • Decide if this should be not NULL
  • If so, clean up violators [UNCHECKED THIS 9/11/2015: 16 rows still have null citation_id]
  • Decide on consistency constraint
    • not necessary / applicable (see notes added below)
  • Clean up data if needed
  • Add to migration

date, dateloc

  • Decide what constraints to use
    • dateloc is defined in the data entry documentation. I think integer between 0-100 is sufficient. [COMMENT 9/11/2015: The data entry documentation only specifies meanings for the following values: 5, 6, 7, 8, 9, 95, 96, 97. Does it really make sense to allow anything between 0 and 100? There are currently many outliers. See comment below.]
  • Clean up data if needed
  • Add to migration

mgmttype

  • Decide on what constraints to use
  • Clean up data if needed [UNCHECKED THIS 9/11/2015: many rows have values for this that aren't values the Rails app offers. See comment below.]
  • Expand table in Rails app if needed (#341)
  • Add constraints to migration

level

  • Decide if any meaningful constraints are possible
    • NULL or numeric
  • AND non-negative? Even if this isn't made a database constraint, we can validate this easily in Rails.

units

  • Decide if any meaningful constraints are possible
    • could require fertilizer* to be kg ha-1 and fertilizer*rate to be kg ha-1 y-1 or g m-2 y-1
  • Clean up data if needed [UNCHECKED THIS 9/15/2015: THERE CERTAINLY IS MUCH CLEANUP THAT CAN AND SHOULD BE DONE. FOR EXAMPLE, ENTRIES LIKE g/m-2 ARE CERTAINLY INCORRECT. AT LEAST WE KNOW THIS WAS PROBABLY MEANT TO BE g m-2 or g/m2. OTHER ENTRIES ARE MUCH MORE AMBIGUOUS (e.g. g/m-2, g/m-1).
  • Add constraints to migration if needed
  • If meaningful constraints aren't feasible, using an autocompletion list containing the list of existing values (but allowing others) would at least help limit the number of variants. Decide if this should be done, and if so, implement.

Details

citation_id

A management is related to one or more citations indirectly via the managements_treatments and citations_managements tables but also directly via the citation_id column. It seems like some kind of consistency constraint might be in order.

SELECT COUNT(*) FROM managements m WHERE citation_id NOT IN (SELECT citation_id FROM citations_treatments ct JOIN managements_treatments mt USING (treatment_id) WHERE mt.management_id = m.id);

shows 1262 violations of what seems like a reasonable consistency constraint. For a better picture of the two ways of associating with citations, try

SELECT m.citation_id, ARRAY_AGG(ct.citation_id ORDER BY ct.citation_id) AS
     citation_list FROM managements m JOIN managements_treatments mt ON
     mt.management_id = m.id JOIN citations_treatments ct USING (treatment_id)
     GROUP BY m.id, m.citation_id ORDER BY m.citation_id;

response (by @dlebauer)

it is not necessary for this to be consistent. details of a management could come from a citation other than where data come from.

date

When a year is not specified, a date beginning with '0001-' is used. So to impose a range restriction on date, the following might be reasonable:

CHECK ((date > '1800-01-01' OR EXTRACT(year FROM date) = '0001') AND date < NOW())

Another check that might be reasonable is that if date is like '0001-%', then dateloc is 95, 96, or ???

dateloc should be constrained to the defined values--we should define a domain for this since it is used elsewhere, and a comment on the domain should tell what the values mean.

It might also be useful to define a function that combines date with dateloc and outputs a string showing the significant portions of the date (given the dateloc value) in human readable form. For example INPUT date = '0001-04-21', dateloc = 96 might yield OUTPUT 'April'.

Since dates are fundamentally ambiguous when dateloc isn't given ('01' can mean either "unspecified" or "January" when it appears in the month portion of the date), it seems reasonable to require dateloc to be non-null.

mgmttype

The Rails model for managments contains a list of 29 values for mgmttype that the user can choose from when creating a new management. But the query

SELECT COUNT(*) FROM managements WHERE mgmttype NOT IN ( 'burned', 'coppice', 'cultivated', 'cultivated or grazed', 'fertilization_Ca', 'fertilization_K', 'fertilization_N', 'fertilization_P', 'fertilization_other', 'fungicide', 'grazed', 'harvest', 'herbicide', 'irrigation', 'light', 'pesticide', 'planting (plants / m2)', 'row spacing', 'seeding', 'tillage','warming_soil','warming_air','initiation of natural succession','major storm','root exclusion', 'trenching', 'CO2 fumigation', 'soil disturbance', 'rain exclusion');

shows that 1592 rows contain types not in the list. At least some of these are variants of the "official" values--for example "fertilizer_N" in place of "fertilization_N". These should probably be changed to match the "official list". Other values not like anything in the list may mean we need to expand the list. If this list will change often, perhaps we need a "management_types" table. See discussion at https://www.overleaf.com/2086241dwjyrd#/5297403/.

To see the unsanctioned values that occur in this column, use

SELECT DISTINCT mgmttype FROM managements WHERE mgmttype NOT IN ( 'burned', 'coppice', 'cultivated', 'cultivated or grazed', 'fertilization_Ca', 'fertilization_K', 'fertilization_N', 'fertilization_P', 'fertilization_other', 'fungicide', 'grazed', 'harvest', 'herbicide', 'irrigation', 'light', 'pesticide', 'planting (plants / m2)', 'row spacing', 'seeding', 'tillage','warming_soil','warming_air','initiation of natural succession','major storm','root exclusion', 'trenching', 'CO2 fumigation', 'soil disturbance', 'rain exclusion');

units

Since we have a finite set of values for mgmttype (at least we do in the Rails interface), it seems like we could also limit units to some finite list of values. Currently, there are many variants of what certainly is the same unit; for example, "plant/m2", "plant/ m2", "plants m-2", "plants/m2", and "plants/ m2" all occur.

Ideally, the units would be constrained to be applicable to the mgmttype value. An SQL function could facilitate this. The question is, can will any restriction we impose be too restrictive to account for future possibilites?


@dlebauer commented on Fri Aug 07 2015

Cleaning up managements - went through and checked lots of units in g m-2 that should've been kg ha-1 https://gist.github.com/dlebauer/2fa2fa9bedf7fc3d5ae7


@dlebauer commented on Fri Aug 07 2015

@gsrohde did we clean up orphaned managements and treatments in #171? There appear to still be some, depending on how I write my queries (#340)


@gsrohde commented on Fri Sep 11 2015

Current dateloc values (as of 2015-09-11):

ebi_production=# select dateloc, count(*) from managements group by dateloc order by dateloc;
 dateloc | count 
---------+-------
    1.00 |     4           <-- doesn't make sense for dates without time
    2.00 |     1           <-- doesn't make sense for dates without time
    4.00 |   114         <-- doesn't make sense for dates without time
    4.50 |     2           <-- doesn't make sense for dates without time
    5.00 |   538
    5.20 |     1           <-- the Data Entry docs don't specify what fractional values mean
    5.50 |   241         <-- the Data Entry docs don't specify what fractional values mean
    6.00 |  1520
    6.50 |    56          <-- the Data Entry docs don't specify what fractional values mean
    7.00 |   149
    7.50 |    72          <-- the Data Entry docs don't specify what fractional values mean
    8.00 |   950
    8.50 |     2           <-- the Data Entry docs don't specify what fractional values mean
    9.00 |    59
   95.00 |    16
         |  1108           <-- this is NULL
(16 rows)

@gsrohde commented on Fri Sep 11 2015

These are the values for mgmttype that exist in the database but aren't selectable in on the New/Editing Management pages:

ebi_production=# SELECT mgmttype, count(*) FROM managements WHERE mgmttype NOT IN ( 'burned', 'coppice', 'cultivated', 'cultivated or grazed', 'fertilizer_Ca', 'fertilizer_K', 'fertilizer_N', 'fertilizer_P', 'fertilizer_other',  'fertilizer_Ca_rate', 'fertilizer_K_rate', 'fertilizer_N_rate', 'fertilizer_P_rate', 'fertilizer_other_rate', 'fungicide', 'grazed', 'harvest', 'herbicide', 'irrigation', 'light', 'pesticide', 'planting (plants / m2)', 'row spacing', 'seeding', 'tillage','warming_soil','warming_air','initiation_of_natural_succession','major_storm','root_exclusion', 'trenching', 'CO2_fumigation', 'soil_disturbance', 'rain_exclusion') group by mgmttype;
      mgmttype       | count 
---------------------+-------
 mowing              |     4
 planting            |   821
 fertilizer_Mg_rate  |     2
 fertilizer_hoagland |     1
 O3_fumigation       |    49
 fertilizer_NPK      |     1
 row_spacing         |    67
 fertilizer_Mg       |     2
 fertilizer_S        |     2
 thinning            |     7
 PAR                 |     1
 fertilizer_NP       |     3
(12 rows)

The bulk of these have the value planting, which could perhaps be folded in to the value planting (plants / m2).

@dlebauer
Copy link
Member Author

duplicate of #2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant