-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Column info for invalid column number #8
base: master
Are you sure you want to change the base?
Column info for invalid column number #8
Conversation
This is a good idea, I'm sure extra warnings would be helpful. That said, the code could be a bit cleaner if you used sets: schema_columns = set(self.get_column_names())
df_columns = set(df.columns)
add_schema_columns = schema_columns - df_columns
add_df_columns = df_columns - schema_columns Also could you please add at least one test that test that your warning is correct in |
…s are shown in output message of validation warning
I have added a unit test for the warning. Thanks for the hint. |
df = pd.DataFrame.from_dict({'a': [1, 2, 3]}) | ||
|
||
out = self.schema.validate(df, columns=['a', 'b']) | ||
assert out[0].message == 'The column b exists in the schema but not in the data frame' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a self.assert()
method (https://docs.python.org/3.7/library/unittest.html#unittest.TestCase.debug)
@@ -135,6 +135,13 @@ def test_column_subset_error(self): | |||
# should raise a PanSchArgumentError | |||
self.assertRaises(PanSchArgumentError, self.schema.validate, df, columns=['c']) | |||
|
|||
def test_column_not_present_shown(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a docstring that explains this test (can be one sentence)
) | ||
) | ||
return errors | ||
|
||
errors.append( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you have a second ValidationWarning
added here? You already have one error if add_schema_columns
and one error if add_df_columns
. If there is an error with the number of columns this will produce two ValidationWarning
objects, when it should produce one.
schema_columns = set(self.get_column_names()) | ||
df_columns = set(df.columns) | ||
|
||
add_schema_columns = [col for col in schema_columns if col not in df_columns] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use set operations here, (https://docs.python.org/3/library/stdtypes.html#set) e.g. add_schema_columns = schema_columns - df_columns
I have updated the ValidationError message in case the number of columns for the schema and the date frame do not match to include the names of the columns which are additionally present in the schema and/or data frame. This should be more informative for the user.