Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contingent validation & optional columns #31

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

kotarf
Copy link

@kotarf kotarf commented Feb 23, 2020

There is a use case for validation where we want to contingently validate a column. We might not require such a column, but will want to validate it if it happens to be provided. This would come up when PandasSchema is used in an 'ETL' pipeline where data is submitted by users or 3rd parties.

  • If the column is optional and present and at least one non-null value exists, validate as we do normally.
  • If the column is optional and present but is empty, do not validate.
  • If the column is optional and not present, do not raise w/ the missing column exception.

I don't think allow_empty has this behavior so I introduced a new parameter. It's close but it has subtly different behavior.

This might go against the design of the library in that all provided columns are considered required in the Schema. In which case, I guess this feature doesn't make sense.

Otherwise, I don't think this PR is ready to go (eg. more tests and the logic might be off) but I'm happy to continue it so let me know what you think. Thx

@multimeric
Copy link
Owner

Hi @kotarf, thanks for the PR. I can appreciate your use-case and would like to support it. However, considering that I'm in the middle of a pseudo-rewrite that will better support this need, I'm hesitant to work through a PR too much. That said, if you need this functionality soon, it doesn't look like a huge PR and I'm happy to review it.

@kotarf
Copy link
Author

kotarf commented Feb 23, 2020

Hi @kotarf, thanks for the PR. I can appreciate your use-case and would like to support it. However, considering that I'm in the middle of a pseudo-rewrite that will better support this need, I'm hesitant to work through a PR too much. That said, if you need this functionality soon, it doesn't look like a huge PR and I'm happy to review it.

Nope, not really urgent. I also had to add a fix; the logic should be correct now.

I'd like to work against this re-write. Is it being worked on in #29?

@multimeric
Copy link
Owner

It is being worked on there, but it's all in so much flux I wouldn't recommend merging that branch in right now. Broadly speaking, the PR decouples Validations from Series, so your problem of optional Series this is naturally something that I'll have to solve anyway.

If it's not urgent, I'll check in here after I've finished that branch.

@kotarf
Copy link
Author

kotarf commented Feb 25, 2020

It is being worked on there, but it's all in so much flux I wouldn't recommend merging that branch in right now. Broadly speaking, the PR decouples Validations from Series, so your problem of optional Series this is naturally something that I'll have to solve anyway.

If it's not urgent, I'll check in here after I've finished that branch.

That's fine, I actually have another potential feature that's slightly more involved. I'll try to get up to date with the rework and code against that. Let me know if this PR becomes irrelevant and I'll close it out.

@whege
Copy link

whege commented Nov 6, 2020

Hi, wanted to check on the status of the PR since I see it has been a while since any discussion and I currently have a use case for this feature in a project. If it is still being worked on, I am happy to help contribute if needed to help complete the PR.

@multimeric
Copy link
Owner

Unfortunately it's in the same place as before, because I've been quite busy. If you or anyone else wants to expedite this, you could try to implement this by writing a PR against my 1.0 branch: https://github.com/TMiguelT/PandasSchema/tree/bitwise.

Unfortunately I haven't written a comprehensive changes document, that will come later. However the docstrings are generally up-to-date, so that should help you in writing this. Basically the main changes in that branch that impacts this feature are:

  • I have abolished the notion of Columns, because for situations like this one, it doesn't necessarily make sense
  • Each Validation object has its own internal "index" which is an instance of AxisIndexer, a thin wrapper around .loc and .iloc

So to implement the behaviour mentioned in the OP, you would have your ordinary validation set, and-ed together, then you would or it with a validator that passes when the column doesn't exist, or-ed with a validator that passes when all of the column is empty. e.g. (CanConvertValidation(int) & InRangeValidation(min=1, max=8)) | IsAbsentValidation() | AllEmptyValidation().

Then you have to implement the IsAbsentValidation() such that it returns True if the index doesn't exist in the DataFrame (yes really a scalar True, not a Series, we use numpy broadcasting logic to handle this), and also the AllEmptyValidation() which returns True if the entire Series is empty. Note, this differs from the existing IsEmptyValidation, which returns a boolean Series of True and False depending on whether each cell in that Series is empty.

Let me know if you have any issues or need any clarifications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants