-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the ability to preserve missing_values #32
Comments
@cschloer Handing over to @akariv 😃 |
To be honest, this doesn't feel that this is the way to go about it.
Furthermore, with your suggestion, it's possible that the next steps of the pipeline might get a string for a numeric field and breaking one of the basic contracts in dataflows, that processor's input data always adheres to schema. The correct way to load such datasets (imo) would be to load these fields as strings, extract the required information to a separate field (e.g. value-status={valid/empty/incomplete-data/etc...}) and then call 'set-types' with the missingValues option. |
I understand how this doesn't fit into the scope of the misssingValue key, and that breaking the basic contract of flows is especially a problem. The workaround you mention is indeed possible (and is how we currently work around this issue) but is a bit of a headache to deal with when you have a dataset with many fields and many different kinds of "ignoreValues". It would significantly improve our workflow to have a more direct solution for this. Could we consider adding a new key? Maybe something like "interdeterminiteValue" or "ignoreValue"? I haven't dived deeply enough into dpp or dataflows to know if this is possible, but is there a wrapper/runner that executes each step of the pipeline? Maybe within that wrapper it could check the Is that possible? If so, could you point me to the area in the code where I might implement that? And let me know if you think a new Thanks @akariv ! |
A few ideas into how to tackle this problem - My first idea was to create a processor/flow that does all that in one step - parse values from a specific column and output two columns - one for the values and another for the indeterminate value. Another option is for this processor not to output two columns, but rather a tuple or a simple object instead, in a single column. A more fancy approach would work in a single Either way, if you think there's a real use case here for extending |
An issue here to track the PRs made
frictionlessdata/tableschema-py#260
frictionlessdata/datapackage-pipelines#175
datahq/dataflows#119
Original justification:
The text was updated successfully, but these errors were encountered: