-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GEN-809] Validate allele columns #539
Conversation
genie/validate.py
Outdated
pd.Index: A pandas index object indicating the row indices that | ||
don't match the allowed alleles | ||
""" | ||
search_str = rf"^[{''.join(allowed_alleles)}]+$" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this regex does actually check if for alleles like 'ATATATATAT'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does. Have to update it for '-', empty strings and NAs based on our discussion
genie_registry/maf.py
Outdated
@@ -70,6 +70,8 @@ class maf(FileTypeFormat): | |||
_fileType = "maf" | |||
|
|||
_process_kwargs = [] | |||
_allele_cols = ["REFERENCE_ALLELE", "TUMOR_SEQ_ALLELE1", "TUMOR_SEQ_ALLELE2"] | |||
_allowed_alleles = ["A", "T", "C", "G", "N", " ", "-"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove " " and "-".
I like what you're thinking here for class attributes. In the future, these classes should follow SOLID principles, right now it is doing too much. I can imagine having a Validate
class that ValidateMaf
inherits from and then different compositions if necessary.
For example, I can imagine having a _required_columns
attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes agreed, I originally had them defined in the _validate
function and was debating between the two but was thinking future wise when we do want to start separating out into the separate classes we talked about (e.g Processing class, Validate class and regular read in data class), thought it would be easier if we put constants like these up in the class attributes given how big _validate is getting. Even before then, we might have other allele variable specific validation to add.
genie/validate.py
Outdated
if len(invalid_indices) > 0: | ||
errors = ( | ||
f"{fileformat}: Your {invalid_col} column has invalid allele values. " | ||
f"These are the accepted allele values: {allowed_alleles}.\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alleles should be combinations of the allowed alleles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, message will be updated based on discussions earlier.
tests/test_validate.py
Outdated
[ | ||
( | ||
pd.DataFrame( | ||
{"REFERENCE_ALLELE": ["ACGT-G", "A-CGT ", "A", "C", "T", "G", "-", " "]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{"REFERENCE_ALLELE": ["ACGT-G", "A-CGT ", "A", "C", "T", "G", "-", " "]} | |
{"REFERENCE_ALLELE": ["ACGTG", "ACGT ", "A", "C", "T", "G", "-", " "]} |
lets make the white space invalid. Lets try and rely on pandas as much as possible, if pandas reads in the " " as float('nan') then it's fine to include it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thankfully pandas.str.match
has a parameter that handles anything NA so we should be good on that front and don't need to add the long list of possible NA values here.
tests/test_validate.py
Outdated
True, | ||
), | ||
( | ||
pd.DataFrame({"REFERENCE_ALLELE": ["ACGT-G", pd.NA, None]}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are float('nan')
currently allowed in the maf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to this code, it seems like it could be possible for allele columns to be read in as numeric and get something like float('nan')
(i believe this is only for numeric columns).
This makes me wonder if we should have standardization of columns and their datatypes before hand (either they have the wrong datatype and stop validation there, or they have the wrong datatype, and it gets converted before further validation) otherwise this just adds a lot more responsibilities for the validations that are string column specific as they will error out with numeric columns.
Otherwise, will just have to add another if statement logic to check that the column is string or convert it to str because we won't be able to do our regex matching at all
|
||
# special handling for all NA column | ||
is_all_na = pd.isna(input_data[input_col]).all() | ||
if is_all_na and allow_na: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not the biggest fan of what has to be done here to allow the checking of non-str columns and columns with all NAs. I almost feel like it would be better to have the _check_allele_col
be an enforced rule where it skips certain validation rules if the column has any NAs OR if we choose to allow NAs, have a standardized handling for those (if that's even possible...) so that we don't have to do it in the individual validation functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, can you create a ticket to track that? for now, let's do the 80/20 rule. This does exactly what we need it to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 I'm going to pre-approve - great work! I'll look closer at the tests when I get home from the office.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌👌 LGTM!
Purpose: There is no current allele validation so this PR adds it for the VCF and MAF file formats. Each format has its own list of accepted alleles.
Changes: I added two general allele validation and allele validation message functions to
genie/validate.py
:so that they can be more general and used by both vcf and maf file formats (and any relevant future file formats). I separated out the retrieval of the rows with the invalid allele values and the retrieval of the error message to make it easier for testing and also easier to implement row-based validation in the future if we would like (or keep our more high level validation).
Special Considerations:
NA handling: Based on our discussion, we will wait to handle NAs (default solution right now is to flag NAs as invalid). Because I'm using regex to match on the allele values, I would run into an issue with any NAs/missingness in the columns. I think that it's currently enforced that we can't have emptiness for the allele columns in these file formats but I added this parameter allow_na to flag any NAs as invalid otherwise the code will break when using
pandas.str.match
. We also have to have special handling to validate a column with all NA values in this function as well...I talked more about the caveats of handling NAs here
Testing: