Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Schemas for arbitrary subsets #2 #33

Draft
wants to merge 34 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
2f7e7f3
Initial groundwork for the rewrite
multimeric Nov 17, 2019
e92045d
InRangeValidation working with tests
multimeric Dec 24, 2019
dcb04c4
Clarify and cleanup Warning class, add back in the standard validations
multimeric Jan 20, 2020
8bffe93
Re-use some old validations for nicer diff; fix some tests
multimeric Jan 23, 2020
9c6b910
Some miscellaneous design docs and updates
multimeric Jan 31, 2020
28e2c11
Sort out new ValidationWarning structure
multimeric Feb 1, 2020
04893b3
Add indexer class, solidify message format and ValidationWarning
multimeric Feb 3, 2020
7d8aa93
First attempt at CombinedValidations in the new API
multimeric Feb 5, 2020
c36761a
Rework CombinedValidations
multimeric Feb 16, 2020
9bd2704
More work
multimeric Feb 24, 2020
f502167
Fix more tests
multimeric Mar 16, 2020
bc7f269
All tests passing; fixed message generation, fixed negation
multimeric Mar 19, 2020
f8ce653
Fix or operator
multimeric Mar 21, 2020
cc1e8c8
Initial bitwise rewrite
multimeric Mar 22, 2020
3115dcb
Simple use cases working
multimeric Mar 22, 2020
a6c98ec
Update
multimeric Mar 27, 2020
73e86f1
Most tests working with bitwise rewrite
multimeric Mar 30, 2020
8fe1c90
Implement negation
multimeric Apr 1, 2020
bc6d0de
First attempt at combined validations
multimeric Apr 4, 2020
f3dee89
Update
multimeric Apr 7, 2020
e7bfb4b
Merge branch 'master' of github.com:TMiguelT/PandasSchema into bitwise
multimeric Apr 7, 2020
1e8ec23
Merge branch 'bitwise' of github.com:TMiguelT/PandasSchema into bitwise
multimeric Apr 7, 2020
b0105ca
All tests passing
multimeric Apr 7, 2020
5306728
Restructure test
multimeric Apr 13, 2020
a8fa041
Restructure tests
multimeric Apr 13, 2020
d216e48
Update docstrings
multimeric Apr 13, 2020
947410f
Merge branch 'bitwise' of github.com:TMiguelT/PandasSchema into bitwise
multimeric Apr 25, 2020
aae44a7
Update
multimeric Apr 25, 2020
3a8e437
Nested validations seem to be working
multimeric May 4, 2020
b662759
Implement .optional() method, which works a bit like allow_empty
multimeric May 14, 2020
24f3bb2
Rework negation (again), to fix tests
multimeric May 25, 2020
36f6ee6
Add row-uniqueness validation
multimeric May 28, 2020
9452513
Improve column functions; add recurse; improve CombinedValidation
multimeric Jul 8, 2020
02e2a34
Some notes about a better indexer interface
multimeric Nov 7, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file modified .gitignore
100644 → 100755
Empty file.
Empty file modified .travis.yml
100644 → 100755
Empty file.
Empty file modified LICENSE
100644 → 100755
Empty file.
Empty file modified README.rst
100644 → 100755
Empty file.
14 changes: 14 additions & 0 deletions TODO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
* [ ] Add validations that apply to every column in the DF equally (for the moment, users can just duplicate their validations)
* [x] Add validations that use the entire DF like, uniqueness
* [x] Fix CombinedValidations
* [x] Add replacement for allow_empty Columns
* [ ] New column() tests
* [x] New CombinedValidation tests
* [x] Implement the negate flag in the indexer
* [x] Add facility for allow_empty
* [x] Fix messages
* [x] Re-implement the or/and using operators
* [ ] Allow and/or operators between Series-level and row-level validations
* [ ] Separate ValidationClasses for each scope
* [ ] Add row-level validations
* [x] Fix message for DateAndOr test
47 changes: 47 additions & 0 deletions UPDATE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# ValidationWarnings
## Options for the ValidationWarning data
* We keep it as is, with one single ValidationWarning class that stores a `message` and a reference to the validation
that spawned it
* PREFERRED: As above, but we add a dictionary of miscellaneous kwargs to the ValidationWarning for storing stuff like the row index that failed
* We have a dataclass for each Validation type that stores things in a more structured way
* Why bother doing this if the Validation stores its own structure for the column index etc?

## Options for the ValidationWarning message
* It's generated from the Validation as a fixed string, as it is now
* It's generated dynamically by the VW
* This means that custom messages means overriding the VW class
* PREFERRED: It's generated dynamically in the VW by calling the parent Validation with a reference to itself, e.g.
```python
class ValidationWarning:
def __str__(self):
return self.validation.generate_message(self)

class Validation:
def generate_message(warning: ValidationWarning) -> str:
pass
```
* This lets the message function use all the validation properties, and the dictionary of kwargs that it specified
* `generate_message()` will call `default_message(**kwargs)`, the dynamic class method, or `self.custom_message`, the
non-dynamic string specified by the user
* Each category of Validation will define a `create_prefix()` method, that creates the {row: 1, column: 2} prefix
that goes before each message. Thus, `generate_message()` will concatenate that with the actual message
*

## Options for placing CombinedValidation in the inheritance hierarchy
* In order to make both CombinedValidation and BooleanSeriesValidation both share a class, so they can be chained together,
either we had to make a mixin that creates a "side path" that doesn't call `validate` (in this case, `validate_with_series`),
or we

# Rework of Validation Indexing
## All Indexed
* All Validations now have an index and an axis
* However, this index can be none, can be column only, row only, or both
* When combined with each other, the resulting boolean series will be broadcast using numpy broadcasting rules
* e.g.
* A per-series validation might have index 0 (column 0) and return a scalar (the whole series is okay)
* A per-cell validation might have index 0 (column 0) and return a series (True, True, False) indicating that cell 0 and 1 of column 0 are okay
* A per-frame validation would have index None, and might return True if the whole frame meets the validation, or a series indicating which columns or rows match the validation

# Rework of combinedvalidations
## Bitwise
* Could assign each validation a bit in a large bitwise enum, and `or` together a number each time that index fails a validatioin. This lets us track the origin of each warning, allowing us to slice them out by bit and generate an appropriate list of warnings
Empty file modified doc/common/introduction.rst
100644 → 100755
Empty file.
Empty file modified doc/readme/README.rst
100644 → 100755
Empty file.
Empty file modified doc/readme/conf.py
100644 → 100755
Empty file.
Empty file modified doc/site/Makefile
100644 → 100755
Empty file.
Empty file modified doc/site/conf.py
100644 → 100755
Empty file.
Empty file modified doc/site/index.rst
100644 → 100755
Empty file.
Empty file modified example/boolean.py
100644 → 100755
Empty file.
Empty file modified example/boolean.txt
100644 → 100755
Empty file.
Empty file modified example/example.py
100644 → 100755
Empty file.
Empty file modified example/example.txt
100644 → 100755
Empty file.
2 changes: 0 additions & 2 deletions pandas_schema/__init__.py
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,4 +1,2 @@
from .column import Column
from .validation_warning import ValidationWarning
from .schema import Schema
from .version import __version__
144 changes: 117 additions & 27 deletions pandas_schema/column.py
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,27 +1,117 @@
import typing
import pandas as pd

from . import validation
from .validation_warning import ValidationWarning

class Column:
def __init__(self, name: str, validations: typing.Iterable['validation._BaseValidation'] = [], allow_empty=False):
"""
Creates a new Column object

:param name: The column header that defines this column. This must be identical to the header used in the CSV/Data Frame you are validating.
:param validations: An iterable of objects implementing _BaseValidation that will generate ValidationErrors
:param allow_empty: True if an empty column is considered valid. False if we leave that logic up to the Validation
"""
self.name = name
self.validations = list(validations)
self.allow_empty = allow_empty

def validate(self, series: pd.Series) -> typing.List[ValidationWarning]:
"""
Creates a list of validation errors using the Validation objects contained in the Column

:param series: A pandas Series to validate
:return: An iterable of ValidationError instances generated by the validation
"""
return [error for validation in self.validations for error in validation.get_errors(series, self)]
from typing import Union, Iterable

from pandas_schema.core import IndexValidation, BaseValidation
from pandas_schema.index import AxisIndexer, IndexValue


def column(
validations: Union[Iterable['IndexValidation'], 'IndexValidation'],
index = None,
override: bool = False,
recurse: bool = True,
allow_empty: bool = False
) -> Union[Iterable['IndexValidation'], 'IndexValidation']:
"""A utility method for setting the index data on a set of Validations

Args:
validations: A list of validations to modify
index: The index of the series that these validations will now consider
override: If true, override existing index values. Otherwise keep the existing ones
recurse: If true, recurse into child validations
allow_empty: Allow empty rows (NaN) to pass the validation
See :py:class:`pandas_schema.validation.IndexSeriesValidation` (Default value = False)
Returns:
"""
# TODO: Abolish this, and instead propagate the individual validator indexes when we And/Or them together
def update_validation(validation: BaseValidation):
if isinstance(validation, IndexValidation):
if override or validation.index is None:
validation.index = index

if allow_empty:
return validation.optional()
else:
return validation

if isinstance(validations, Iterable):
ret = []
for valid in validations:
if recurse:
ret.append(valid.map(update_validation))
else:
ret.append(update_validation(valid))
return ret
else:
if recurse:
return validations.map(update_validation)
else:
return update_validation(validations)

return validations


def column_sequence(
validations: Iterable['IndexValidation'],
override: bool = False
) -> Iterable['IndexValidation']:
"""A utility method for setting the index data on a set of Validations. Applies a sequential position based index, so
that the first validation gets index 0, the second gets index 1 etc. Note: this will not modify any index that
already has some kind of index unless you set override=True

Args:
validations: A list of validations to modify
override: If true, override existing index values. Otherwise keep the existing ones
validations: typing.Iterable['pandas_schema.core.IndexValidation']:
override: bool: (Default value = False)

Returns:

"""
for i, valid in validations:
if override or valid.index is None:
valid.index = AxisIndexer(i, typ='positional')
return validations


def each_column(validations: Iterable[IndexValidation], columns: IndexValue):
"""Duplicates a validation and applies it to each column specified

Args:
validations: A list of validations to apply to each column
columns: An index that should, when applied to the column index, should return all columns you want this to
validations: typing.Iterable[pandas_schema.core.IndexValidation]:
columns: IndexValue:

Returns:

"""

#
# def label_column(
# validations: typing.Iterable['pandas_schema.core.IndexSeriesValidation'],
# index: typing.Union[int, str],
# ):
# """
# A utility method for setting the label-based column for each validation
# :param validations: A list of validations to modify
# :param index: The label of the series that these validations will now consider
# """
# return _column(
# validations,
# index,
# position=False
# )
#
# def positional_column(
# validations: typing.Iterable['pandas_schema.core.IndexSeriesValidation'],
# index: int,
# ):
# """
# A utility method for setting the position-based column for each validation
# :param validations: A list of validations to modify
# :param index: The index of the series that these validations will now consider
# """
# return _column(
# validations,
# index,
# position=True
Loading