Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GEN-809] Validate allele columns #539

Merged
merged 9 commits into from
Nov 9, 2023
Merged

[GEN-809] Validate allele columns #539

merged 9 commits into from
Nov 9, 2023

Conversation

rxu17
Copy link
Contributor

@rxu17 rxu17 commented Nov 3, 2023

Purpose: There is no current allele validation so this PR adds it for the VCF and MAF file formats. Each format has its own list of accepted alleles.

Changes: I added two general allele validation and allele validation message functions to genie/validate.py:

so that they can be more general and used by both vcf and maf file formats (and any relevant future file formats). I separated out the retrieval of the rows with the invalid allele values and the retrieval of the error message to make it easier for testing and also easier to implement row-based validation in the future if we would like (or keep our more high level validation).

Special Considerations:
NA handling: Based on our discussion, we will wait to handle NAs (default solution right now is to flag NAs as invalid). Because I'm using regex to match on the allele values, I would run into an issue with any NAs/missingness in the columns. I think that it's currently enforced that we can't have emptiness for the allele columns in these file formats but I added this parameter allow_na to flag any NAs as invalid otherwise the code will break when using pandas.str.match. We also have to have special handling to validate a column with all NA values in this function as well...

I talked more about the caveats of handling NAs here

Testing:

  • Unit tests were written to test the general allele validation functions.
  • Maf and Vcf classes' tests were adjusted to account for the new allele validation.
  • Validation was run on the test genie project.

@rxu17 rxu17 requested a review from a team as a code owner November 3, 2023 01:28
pd.Index: A pandas index object indicating the row indices that
don't match the allowed alleles
"""
search_str = rf"^[{''.join(allowed_alleles)}]+$"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this regex does actually check if for alleles like 'ATATATATAT'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does. Have to update it for '-', empty strings and NAs based on our discussion

@@ -70,6 +70,8 @@ class maf(FileTypeFormat):
_fileType = "maf"

_process_kwargs = []
_allele_cols = ["REFERENCE_ALLELE", "TUMOR_SEQ_ALLELE1", "TUMOR_SEQ_ALLELE2"]
_allowed_alleles = ["A", "T", "C", "G", "N", " ", "-"]
Copy link
Member

@thomasyu888 thomasyu888 Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove " " and "-".

I like what you're thinking here for class attributes. In the future, these classes should follow SOLID principles, right now it is doing too much. I can imagine having a Validate class that ValidateMaf inherits from and then different compositions if necessary.

For example, I can imagine having a _required_columns attribute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agreed, I originally had them defined in the _validate function and was debating between the two but was thinking future wise when we do want to start separating out into the separate classes we talked about (e.g Processing class, Validate class and regular read in data class), thought it would be easier if we put constants like these up in the class attributes given how big _validate is getting. Even before then, we might have other allele variable specific validation to add.

if len(invalid_indices) > 0:
errors = (
f"{fileformat}: Your {invalid_col} column has invalid allele values. "
f"These are the accepted allele values: {allowed_alleles}.\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alleles should be combinations of the allowed alleles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, message will be updated based on discussions earlier.

[
(
pd.DataFrame(
{"REFERENCE_ALLELE": ["ACGT-G", "A-CGT ", "A", "C", "T", "G", "-", " "]}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{"REFERENCE_ALLELE": ["ACGT-G", "A-CGT ", "A", "C", "T", "G", "-", " "]}
{"REFERENCE_ALLELE": ["ACGTG", "ACGT ", "A", "C", "T", "G", "-", " "]}

lets make the white space invalid. Lets try and rely on pandas as much as possible, if pandas reads in the " " as float('nan') then it's fine to include it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thankfully pandas.str.match has a parameter that handles anything NA so we should be good on that front and don't need to add the long list of possible NA values here.

True,
),
(
pd.DataFrame({"REFERENCE_ALLELE": ["ACGT-G", pd.NA, None]}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are float('nan') currently allowed in the maf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to this code, it seems like it could be possible for allele columns to be read in as numeric and get something like float('nan') (i believe this is only for numeric columns).

This makes me wonder if we should have standardization of columns and their datatypes before hand (either they have the wrong datatype and stop validation there, or they have the wrong datatype, and it gets converted before further validation) otherwise this just adds a lot more responsibilities for the validations that are string column specific as they will error out with numeric columns.

Otherwise, will just have to add another if statement logic to check that the column is string or convert it to str because we won't be able to do our regex matching at all

genie_registry/vcf.py Outdated Show resolved Hide resolved

# special handling for all NA column
is_all_na = pd.isna(input_data[input_col]).all()
if is_all_na and allow_na:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the biggest fan of what has to be done here to allow the checking of non-str columns and columns with all NAs. I almost feel like it would be better to have the _check_allele_col be an enforced rule where it skips certain validation rules if the column has any NAs OR if we choose to allow NAs, have a standardized handling for those (if that's even possible...) so that we don't have to do it in the individual validation functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, can you create a ticket to track that? for now, let's do the 80/20 rule. This does exactly what we need it to do.

@rxu17 rxu17 requested a review from thomasyu888 November 8, 2023 01:41
Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 I'm going to pre-approve - great work! I'll look closer at the tests when I get home from the office.

Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌👌 LGTM!

@rxu17 rxu17 merged commit c430754 into develop Nov 9, 2023
7 checks passed
@rxu17 rxu17 deleted the gen-809-validate-allele branch November 9, 2023 00:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants