Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability to preserve missing_values #32

Open
cschloer opened this issue Nov 22, 2019 · 4 comments
Open

Add the ability to preserve missing_values #32

cschloer opened this issue Nov 22, 2019 · 4 comments
Assignees
Labels
laminar An issue for the laminar tool

Comments

@cschloer
Copy link

An issue here to track the PRs made

frictionlessdata/tableschema-py#260
frictionlessdata/datapackage-pipelines#175
datahq/dataflows#119

Original justification:

If missing data IDs added in load step, the identifier becomes a blank cell. We want to be able to preserve the provided missing data identifiers (e.g. bdl, NaN, NA, nd, -99999.99)

@roll
Copy link

roll commented Nov 25, 2019

@cschloer
The first one is released as [email protected]

Handing over to @akariv 😃

@akariv
Copy link
Collaborator

akariv commented Nov 28, 2019

To be honest, this doesn't feel that this is the way to go about it.

missingValue is, by definition, for specifying values that convey that this is an empty cell.
Since in your case these values actually have information, then they're not really empty.
To mark them as 'missingValues' and then add a parameter to ignore that setting seems like an abuse of that option.

Furthermore, with your suggestion, it's possible that the next steps of the pipeline might get a string for a numeric field and breaking one of the basic contracts in dataflows, that processor's input data always adheres to schema.

The correct way to load such datasets (imo) would be to load these fields as strings, extract the required information to a separate field (e.g. value-status={valid/empty/incomplete-data/etc...}) and then call 'set-types' with the missingValues option.

@cschloer
Copy link
Author

cschloer commented Dec 2, 2019

I understand how this doesn't fit into the scope of the misssingValue key, and that breaking the basic contract of flows is especially a problem. The workaround you mention is indeed possible (and is how we currently work around this issue) but is a bit of a headache to deal with when you have a dataset with many fields and many different kinds of "ignoreValues". It would significantly improve our workflow to have a more direct solution for this.

Could we consider adding a new key? Maybe something like "interdeterminiteValue" or "ignoreValue"? I haven't dived deeply enough into dpp or dataflows to know if this is possible, but is there a wrapper/runner that executes each step of the pipeline? Maybe within that wrapper it could check the ignoreValue key and see if a value in the row matches a string in the ignoreValue. If it does, it replaces it with None, but only for that specific processor. After the processor yields that value back, you replace it again with the ignoreValue value. And then do that basically for every processor except for the dump_to processors.

Is that possible? If so, could you point me to the area in the code where I might implement that? And let me know if you think a new ignoreValue parameter is necessary or if a flag to implement this behavior for missingValue is enough.

Thanks @akariv !

@akariv
Copy link
Collaborator

akariv commented Dec 2, 2019

A few ideas into how to tackle this problem -

My first idea was to create a processor/flow that does all that in one step - parse values from a specific column and output two columns - one for the values and another for the indeterminate value. Another option is for this processor not to output two columns, but rather a tuple or a simple object instead, in a single column.

A more fancy approach would work in a single dataflows.Flow (but not in a dpp pipeline...).
You could subclass decimal.Decimal to create a class that holds the value and the extra information in case it's invalid. Further processors in the same flow would get that Python object and would be able to manipulate it - but for every other purpose (e.g. writing to a file) it would behave as a simple Decimal.

Either way, if you think there's a real use case here for extending missingValues I would suggest to create an issue or a pull request in the spec itself, so the suggestion is properly reviewed in a wider context (and not just to answer a specific use-case).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
laminar An issue for the laminar tool
Projects
None yet
Development

No branches or pull requests

3 participants