Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column info for invalid column number #8

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

oliverfu89
Copy link

I have updated the ValidationError message in case the number of columns for the schema and the date frame do not match to include the names of the columns which are additionally present in the schema and/or data frame. This should be more informative for the user.

@multimeric
Copy link
Owner

This is a good idea, I'm sure extra warnings would be helpful. That said, the code could be a bit cleaner if you used sets:

schema_columns = set(self.get_column_names())
df_columns = set(df.columns)

add_schema_columns = schema_columns - df_columns
add_df_columns  =  df_columns - schema_columns

Also could you please add at least one test that test that your warning is correct in test/test_schema.py

…s are shown in output message of validation warning
@oliverfu89
Copy link
Author

I have added a unit test for the warning. Thanks for the hint.

@oliverfu89 oliverfu89 closed this Jan 18, 2019
@oliverfu89 oliverfu89 reopened this Jan 18, 2019
df = pd.DataFrame.from_dict({'a': [1, 2, 3]})

out = self.schema.validate(df, columns=['a', 'b'])
assert out[0].message == 'The column b exists in the schema but not in the data frame'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -135,6 +135,13 @@ def test_column_subset_error(self):
# should raise a PanSchArgumentError
self.assertRaises(PanSchArgumentError, self.schema.validate, df, columns=['c'])

def test_column_not_present_shown(self):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a docstring that explains this test (can be one sentence)

)
)
return errors

errors.append(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you have a second ValidationWarning added here? You already have one error if add_schema_columns and one error if add_df_columns. If there is an error with the number of columns this will produce two ValidationWarning objects, when it should produce one.

schema_columns = set(self.get_column_names())
df_columns = set(df.columns)

add_schema_columns = [col for col in schema_columns if col not in df_columns]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use set operations here, (https://docs.python.org/3/library/stdtypes.html#set) e.g. add_schema_columns = schema_columns - df_columns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants