Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missingValues field is not taken into account during load #6

Closed
cschloer opened this issue Apr 10, 2019 · 11 comments
Closed

missingValues field is not taken into account during load #6

cschloer opened this issue Apr 10, 2019 · 11 comments
Assignees

Comments

@cschloer
Copy link

When load is run using force_strings=false and validate=true, it fails on values like nd (for non string types) even if you add nd to the missingValues datapackage parameter. If instead you set force_strings=true and you later use the set_types processor to manually set the types, it works.

Here is a pipeline-spec.yaml file demonstrating the addition of missingValues to the datapackage (using a BCODMO custom processor) that will fail on load.

demo_2019-04-09:
  description: demo missing_values not working in load
  pipeline:
  - parameters:
      missingValues: ['', nd]
      resources: [temperature]
    run: bcodmo_pipeline_processors.add_schema_metadata
  - parameters:
      force_strings: false 
      format: csv
      from: https://raw.githubusercontent.com/BCODMO/frictionless-usecases/master/usecases/752613_Temperature/orig/Field_Temperature_Data_set1.txt
      headers: 1
      name: temperature
      skip_rows: []
      validate: true
    run: load
  - parameters: {out-path: /home/conrad/Projects/whoi/pipeline-generator/bcodmo_pipeline/tmp/a769fd44-5aad-11e9-a9ad-6057181bb7cd/results}
    run: dump_to_path
  title: demo_2019-04-09

This is high priority because until it is fixed there is no way to use the validate=true, force_strings=false functionality with datasets that have missing data values.

@cschloer cschloer changed the title missing_values field is not taken into account during load missingValues field is not taken into account during load Apr 10, 2019
@adyork adyork transferred this issue from another repository Apr 18, 2019
@adyork
Copy link
Contributor

adyork commented Apr 18, 2019

@adyork adyork removed their assignment Apr 18, 2019
@adyork
Copy link
Contributor

adyork commented Apr 18, 2019

I ran into dataflow load errors when there was a missing data identifier as well, even with force_strings=True,validate=False.

load(orig_path, name=row.obj_name, validate=False, force_strings=True,sheet=row.sheet_name, strip=True),

dataflows.base.schema_validator.ValidationError: 
ROW: {'Date': datetime.datetime(2015, 11, 13, 0, 0), 'Treatment days': -4, 'Treatment': 870, 'Flume': 4, 'Temperature (24h)': 'NA', 'Temperature (day)': 'NA', 'Temperature (night)': 'NA', 'Irrandiance': 'NA'}

Tried to load this data:
754644_Carpenter2018_physical_data

@cschloer
Copy link
Author

Important to note that you need to use the pipeline generator repo (https://github.com/BCODMO/pipeline-generator) to load in the custom processor to run the test case

@adyork
Copy link
Contributor

adyork commented Apr 19, 2019

@cschloer do you agree that the meat of this issue is that we had to do a workaround to be able to load data files that had missing data identifiers? I was having issues with doing that outside of our custom stuff.

If we can get some help on loading our test case which includes a missing data identifier "NA" with vanilla load, that would be great!

https://raw.githubusercontent.com/BCODMO/frictionless-usecases/master/usecases/752613_Temperature/orig/Field_Temperature_Data_set1.txt

@roll
Copy link

roll commented May 28, 2019

I checked different options and I think this one could be the best:

  • we don't cast data during loading (though a schema will be inferred)
  • we update schema on the next step (setting missingValues)
  • on the dump_to_path step data is validated anyway (actually making this pipeline more effective because it's not validated two times: on load and on dump)
missing:
  title: missing
  description: "test missing values"
  pipeline:

  - run: load
    parameters:
      from: 'missing.csv'
      name: missing
      format: csv
      cast_strategy: nothing

  - run: bcodmo_pipeline_processors.add_schema_metadata
    parameters:
      resources: [missing]
      missingValues: ['nd']

  - run: dump.to_path
    parameters:
      resources: [missing]
      out-path: 'output'
      pretty-descriptor: true

PASSING:

index,city
1,london
2,paris
3,rome
... (100 lines or more)
3,rome
nd,nd

FAILING:

index,city
1,london
2,paris
3,rome
... (100 lines or more)
3,rome
nd,nd
na, na # fails here on dump_to_path

If it's not good enough, as alternative we need to check with @akariv if it makes to add missingValues or schema (merging into inferred) parameter to the load processor. But I think having two separate steps for loading and setting schema metadata feels more right taking into account the datapackage-pipelines concept.

@akariv
Copy link
Collaborator

akariv commented May 28, 2019 via email

@roll
Copy link

roll commented May 28, 2019

Thanks @akariv!

@adyork
@cschloer
Please let me know what do you think regarding these two options.

@roll
Copy link

roll commented Jul 5, 2019

Now it's possible to use override_schema and override_fields on the load step:

pip install --upgrade datapackage-pipelines==2.1.9
missing_on_load:
  title: missing
  description: "test missing values"
  pipeline:

  - run: load
    parameters:
      from: 'missing.csv'
      name: missing
      format: csv
      override_schema:
        missingValues: ['nd']
      override_fields:
        index:
          type: string
        city:
          type: string

  - run: dump.to_path
    parameters:
      resources: [missing]
      out-path: 'output'
      pretty-descriptor: true

@roll roll closed this as completed Jul 5, 2019
@cschloer
Copy link
Author

cschloer commented Sep 5, 2019

Hey @roll @akariv I'm not going to reopen this because we have a workaround (I have my own concatenate processor that is overwriting the default one) but for other people it would probably be useful to have this same override_schema parameter in the standard concatenate processor. Currently there is no way to update the schema of a resource created by concatenate (or join) with the method used here for load.

@roll
Copy link

roll commented Sep 9, 2019

@cschloer
Would you like to create a feature request on the dataflows issues tracker for the future reference?

@cschloer
Copy link
Author

cschloer commented Sep 9, 2019

👍
Made it here! datahq/dataflows#109

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants