missingValues field is not taken into account during load #6

cschloer · 2019-04-10T09:48:44Z

When load is run using force_strings=false and validate=true, it fails on values like nd (for non string types) even if you add nd to the missingValues datapackage parameter. If instead you set force_strings=true and you later use the set_types processor to manually set the types, it works.

Here is a pipeline-spec.yaml file demonstrating the addition of missingValues to the datapackage (using a BCODMO custom processor) that will fail on load.

demo_2019-04-09:
  description: demo missing_values not working in load
  pipeline:
  - parameters:
      missingValues: ['', nd]
      resources: [temperature]
    run: bcodmo_pipeline_processors.add_schema_metadata
  - parameters:
      force_strings: false 
      format: csv
      from: https://raw.githubusercontent.com/BCODMO/frictionless-usecases/master/usecases/752613_Temperature/orig/Field_Temperature_Data_set1.txt
      headers: 1
      name: temperature
      skip_rows: []
      validate: true
    run: load
  - parameters: {out-path: /home/conrad/Projects/whoi/pipeline-generator/bcodmo_pipeline/tmp/a769fd44-5aad-11e9-a9ad-6057181bb7cd/results}
    run: dump_to_path
  title: demo_2019-04-09

This is high priority because until it is fixed there is no way to use the validate=true, force_strings=false functionality with datasets that have missing data values.

The text was updated successfully, but these errors were encountered:

adyork · 2019-04-18T18:35:40Z

See 752613_Temperature for more details about the data file: https://raw.githubusercontent.com/BCODMO/frictionless-usecases/master/usecases/752613_Temperature/orig/Field_Temperature_Data_set1.txt

adyork · 2019-04-18T19:29:20Z

I ran into dataflow load errors when there was a missing data identifier as well, even with force_strings=True,validate=False.

load(orig_path, name=row.obj_name, validate=False, force_strings=True,sheet=row.sheet_name, strip=True),

dataflows.base.schema_validator.ValidationError: 
ROW: {'Date': datetime.datetime(2015, 11, 13, 0, 0), 'Treatment days': -4, 'Treatment': 870, 'Flume': 4, 'Temperature (24h)': 'NA', 'Temperature (day)': 'NA', 'Temperature (night)': 'NA', 'Irrandiance': 'NA'}

Tried to load this data:
754644_Carpenter2018_physical_data

cschloer · 2019-04-19T17:05:41Z

Important to note that you need to use the pipeline generator repo (https://github.com/BCODMO/pipeline-generator) to load in the custom processor to run the test case

adyork · 2019-04-19T20:55:40Z

@cschloer do you agree that the meat of this issue is that we had to do a workaround to be able to load data files that had missing data identifiers? I was having issues with doing that outside of our custom stuff.

If we can get some help on loading our test case which includes a missing data identifier "NA" with vanilla load, that would be great!

https://raw.githubusercontent.com/BCODMO/frictionless-usecases/master/usecases/752613_Temperature/orig/Field_Temperature_Data_set1.txt

roll · 2019-05-28T07:05:26Z

I checked different options and I think this one could be the best:

we don't cast data during loading (though a schema will be inferred)
we update schema on the next step (setting missingValues)
on the dump_to_path step data is validated anyway (actually making this pipeline more effective because it's not validated two times: on load and on dump)

missing:
  title: missing
  description: "test missing values"
  pipeline:

  - run: load
    parameters:
      from: 'missing.csv'
      name: missing
      format: csv
      cast_strategy: nothing

  - run: bcodmo_pipeline_processors.add_schema_metadata
    parameters:
      resources: [missing]
      missingValues: ['nd']

  - run: dump.to_path
    parameters:
      resources: [missing]
      out-path: 'output'
      pretty-descriptor: true

PASSING:

index,city
1,london
2,paris
3,rome
... (100 lines or more)
3,rome
nd,nd

FAILING:

index,city
1,london
2,paris
3,rome
... (100 lines or more)
3,rome
nd,nd
na, na # fails here on dump_to_path

If it's not good enough, as alternative we need to check with @akariv if it makes to add missingValues or schema (merging into inferred) parameter to the load processor. But I think having two separate steps for loading and setting schema metadata feels more right taking into account the datapackage-pipelines concept.

akariv · 2019-05-28T07:17:15Z

I think that adding a schema (to be used as overrides over the inferred schema) would be a good and generic solution.

…

On Tue, May 28, 2019 at 10:05 AM roll ***@***.***> wrote: I checked different options and I think this one could be the best: - we don't cast data during loading (though a schema will be inferred) - we update schema on the next step (setting missingValues) - on the dump_to_path step data is validated anyway (actually making this pipeline more effective because it's not validated two times: on load and on dump) missing: title: missing description: "test missing values" pipeline: - run: load parameters: from: 'missing.csv' name: missing format: csv cast_strategy: nothing - run: bcodmo_pipeline_processors.add_schema_metadata parameters: resources: [missing] missingValues: ['nd'] - run: dump.to_path parameters: resources: [missing] out-path: 'output' pretty-descriptor: true If it's not good enough, as alternative we need to check with @akariv <https://github.com/akariv> if it makes to add missingValues or schema (merging into inferred) parameter to the load processor. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6?email_source=notifications&email_token=AACAY5OUP2QYZLRAAA7VV5DPXTKTPA5CNFSM4HG7LCLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWLFLIA#issuecomment-496391584>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AACAY5JXF6W5BYNMNUJRKWLPXTKTPANCNFSM4HG7LCLA> .

roll · 2019-05-28T07:21:30Z

Thanks @akariv!

@adyork
@cschloer
Please let me know what do you think regarding these two options.

roll · 2019-07-05T11:45:39Z

Now it's possible to use override_schema and override_fields on the load step:

pip install --upgrade datapackage-pipelines==2.1.9

missing_on_load:
  title: missing
  description: "test missing values"
  pipeline:

  - run: load
    parameters:
      from: 'missing.csv'
      name: missing
      format: csv
      override_schema:
        missingValues: ['nd']
      override_fields:
        index:
          type: string
        city:
          type: string

  - run: dump.to_path
    parameters:
      resources: [missing]
      out-path: 'output'
      pretty-descriptor: true

cschloer · 2019-09-05T18:22:45Z

Hey @roll @akariv I'm not going to reopen this because we have a workaround (I have my own concatenate processor that is overwriting the default one) but for other people it would probably be useful to have this same override_schema parameter in the standard concatenate processor. Currently there is no way to update the schema of a resource created by concatenate (or join) with the method used here for load.

roll · 2019-09-09T07:34:56Z

@cschloer
Would you like to create a feature request on the dataflows issues tracker for the future reference?

cschloer · 2019-09-09T15:15:48Z

👍
Made it here! datahq/dataflows#109

cschloer assigned adyork Apr 10, 2019

cschloer changed the title ~~missing_values field is not taken into account during load~~ missingValues field is not taken into account during load Apr 10, 2019

adyork transferred this issue from another repository Apr 18, 2019

adyork removed their assignment Apr 18, 2019

jobarratt assigned roll May 17, 2019

roll mentioned this issue Jun 28, 2019

Added override_schema/fields to the load processor datahq/dataflows#99

Merged

roll closed this as completed Jul 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

missingValues field is not taken into account during load #6

missingValues field is not taken into account during load #6

cschloer commented Apr 10, 2019

adyork commented Apr 18, 2019 •

edited

Loading

adyork commented Apr 18, 2019

cschloer commented Apr 19, 2019

adyork commented Apr 19, 2019 •

edited

Loading

roll commented May 28, 2019 •

edited

Loading

akariv commented May 28, 2019 via email

roll commented May 28, 2019

roll commented Jul 5, 2019

cschloer commented Sep 5, 2019

roll commented Sep 9, 2019

cschloer commented Sep 9, 2019

missingValues field is not taken into account during load #6

missingValues field is not taken into account during load #6

Comments

cschloer commented Apr 10, 2019

adyork commented Apr 18, 2019 • edited Loading

adyork commented Apr 18, 2019

cschloer commented Apr 19, 2019

adyork commented Apr 19, 2019 • edited Loading

roll commented May 28, 2019 • edited Loading

akariv commented May 28, 2019 via email

roll commented May 28, 2019

roll commented Jul 5, 2019

cschloer commented Sep 5, 2019

roll commented Sep 9, 2019

cschloer commented Sep 9, 2019

adyork commented Apr 18, 2019 •

edited

Loading

adyork commented Apr 19, 2019 •

edited

Loading

roll commented May 28, 2019 •

edited

Loading