Add the ability to preserve missing_values #32

cschloer · 2019-11-22T17:20:00Z

An issue here to track the PRs made

frictionlessdata/tableschema-py#260
frictionlessdata/datapackage-pipelines#175
datahq/dataflows#119

Original justification:

If missing data IDs added in load step, the identifier becomes a blank cell. We want to be able to preserve the provided missing data identifiers (e.g. bdl, NaN, NA, nd, -99999.99)

roll · 2019-11-25T10:58:01Z

@cschloer
The first one is released as [email protected]

Handing over to @akariv 😃

akariv · 2019-11-28T08:39:15Z

To be honest, this doesn't feel that this is the way to go about it.

missingValue is, by definition, for specifying values that convey that this is an empty cell.
Since in your case these values actually have information, then they're not really empty.
To mark them as 'missingValues' and then add a parameter to ignore that setting seems like an abuse of that option.

Furthermore, with your suggestion, it's possible that the next steps of the pipeline might get a string for a numeric field and breaking one of the basic contracts in dataflows, that processor's input data always adheres to schema.

The correct way to load such datasets (imo) would be to load these fields as strings, extract the required information to a separate field (e.g. value-status={valid/empty/incomplete-data/etc...}) and then call 'set-types' with the missingValues option.

cschloer · 2019-12-02T09:30:28Z

I understand how this doesn't fit into the scope of the misssingValue key, and that breaking the basic contract of flows is especially a problem. The workaround you mention is indeed possible (and is how we currently work around this issue) but is a bit of a headache to deal with when you have a dataset with many fields and many different kinds of "ignoreValues". It would significantly improve our workflow to have a more direct solution for this.

Could we consider adding a new key? Maybe something like "interdeterminiteValue" or "ignoreValue"? I haven't dived deeply enough into dpp or dataflows to know if this is possible, but is there a wrapper/runner that executes each step of the pipeline? Maybe within that wrapper it could check the ignoreValue key and see if a value in the row matches a string in the ignoreValue. If it does, it replaces it with None, but only for that specific processor. After the processor yields that value back, you replace it again with the ignoreValue value. And then do that basically for every processor except for the dump_to processors.

Is that possible? If so, could you point me to the area in the code where I might implement that? And let me know if you think a new ignoreValue parameter is necessary or if a flag to implement this behavior for missingValue is enough.

Thanks @akariv !

akariv · 2019-12-02T13:38:42Z

A few ideas into how to tackle this problem -

My first idea was to create a processor/flow that does all that in one step - parse values from a specific column and output two columns - one for the values and another for the indeterminate value. Another option is for this processor not to output two columns, but rather a tuple or a simple object instead, in a single column.

A more fancy approach would work in a single dataflows.Flow (but not in a dpp pipeline...).
You could subclass decimal.Decimal to create a class that holds the value and the extra information in case it's invalid. Further processors in the same flow would get that Python object and would be able to manipulate it - but for every other purpose (e.g. writing to a file) it would behave as a simple Decimal.

Either way, if you think there's a real use case here for extending missingValues I would suggest to create an issue or a pull request in the spec itself, so the suggestion is properly reviewed in a wider context (and not just to answer a specific use-case).

cschloer assigned roll Nov 22, 2019

cschloer added the laminar An issue for the laminar tool label Jan 21, 2020

cschloer mentioned this issue Jan 29, 2020

Ability to preserve missingValues in dump frictionlessdata/datapackage-pipelines#177

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the ability to preserve missing_values #32

Add the ability to preserve missing_values #32

cschloer commented Nov 22, 2019

roll commented Nov 25, 2019

akariv commented Nov 28, 2019

cschloer commented Dec 2, 2019 •

edited

Loading

akariv commented Dec 2, 2019

Add the ability to preserve missing_values #32

Add the ability to preserve missing_values #32

Comments

cschloer commented Nov 22, 2019

roll commented Nov 25, 2019

akariv commented Nov 28, 2019

cschloer commented Dec 2, 2019 • edited Loading

akariv commented Dec 2, 2019

cschloer commented Dec 2, 2019 •

edited

Loading