-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
missingValues field is not taken into account during load #6
Comments
See 752613_Temperature for more details about the data file: https://raw.githubusercontent.com/BCODMO/frictionless-usecases/master/usecases/752613_Temperature/orig/Field_Temperature_Data_set1.txt |
I ran into dataflow load errors when there was a missing data identifier as well, even with force_strings=True,validate=False.
Tried to load this data: |
Important to note that you need to use the pipeline generator repo (https://github.com/BCODMO/pipeline-generator) to load in the custom processor to run the test case |
@cschloer do you agree that the meat of this issue is that we had to do a workaround to be able to load data files that had missing data identifiers? I was having issues with doing that outside of our custom stuff. If we can get some help on loading our test case which includes a missing data identifier "NA" with vanilla load, that would be great! |
I checked different options and I think this one could be the best:
missing:
title: missing
description: "test missing values"
pipeline:
- run: load
parameters:
from: 'missing.csv'
name: missing
format: csv
cast_strategy: nothing
- run: bcodmo_pipeline_processors.add_schema_metadata
parameters:
resources: [missing]
missingValues: ['nd']
- run: dump.to_path
parameters:
resources: [missing]
out-path: 'output'
pretty-descriptor: true PASSING:
FAILING:
If it's not good enough, as alternative we need to check with @akariv if it makes to add |
I think that adding a schema (to be used as overrides over the inferred
schema) would be a good and generic solution.
…On Tue, May 28, 2019 at 10:05 AM roll ***@***.***> wrote:
I checked different options and I think this one could be the best:
- we don't cast data during loading (though a schema will be inferred)
- we update schema on the next step (setting missingValues)
- on the dump_to_path step data is validated anyway (actually making
this pipeline more effective because it's not validated two times: on load
and on dump)
missing:
title: missing
description: "test missing values"
pipeline:
- run: load
parameters:
from: 'missing.csv'
name: missing
format: csv
cast_strategy: nothing
- run: bcodmo_pipeline_processors.add_schema_metadata
parameters:
resources: [missing]
missingValues: ['nd']
- run: dump.to_path
parameters:
resources: [missing]
out-path: 'output'
pretty-descriptor: true
If it's not good enough, as alternative we need to check with @akariv
<https://github.com/akariv> if it makes to add missingValues or schema
(merging into inferred) parameter to the load processor.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6?email_source=notifications&email_token=AACAY5OUP2QYZLRAAA7VV5DPXTKTPA5CNFSM4HG7LCLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWLFLIA#issuecomment-496391584>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AACAY5JXF6W5BYNMNUJRKWLPXTKTPANCNFSM4HG7LCLA>
.
|
Now it's possible to use pip install --upgrade datapackage-pipelines==2.1.9 missing_on_load:
title: missing
description: "test missing values"
pipeline:
- run: load
parameters:
from: 'missing.csv'
name: missing
format: csv
override_schema:
missingValues: ['nd']
override_fields:
index:
type: string
city:
type: string
- run: dump.to_path
parameters:
resources: [missing]
out-path: 'output'
pretty-descriptor: true |
Hey @roll @akariv I'm not going to reopen this because we have a workaround (I have my own concatenate processor that is overwriting the default one) but for other people it would probably be useful to have this same override_schema parameter in the standard concatenate processor. Currently there is no way to update the schema of a resource created by concatenate (or join) with the method used here for load. |
@cschloer |
👍 |
When load is run using
force_strings=false
andvalidate=true
, it fails on values likend
(for non string types) even if you addnd
to the missingValues datapackage parameter. If instead you setforce_strings=true
and you later use the set_types processor to manually set the types, it works.Here is a pipeline-spec.yaml file demonstrating the addition of missingValues to the datapackage (using a BCODMO custom processor) that will fail on load.
This is high priority because until it is fixed there is no way to use the
validate=true
,force_strings=false
functionality with datasets that have missing data values.The text was updated successfully, but these errors were encountered: