Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADT: Add Biomarker/Pathology GX validation suites #153

Merged
merged 13 commits into from
Nov 6, 2024

Conversation

beatrizsaldana
Copy link
Member

@beatrizsaldana beatrizsaldana commented Oct 29, 2024

Jira Ticket

Adding Great Expectations validation for the Model AD Biomarker and Pathology data transforms. The points column is a nested column, so a points.json file was made to validate the column using the expect_column_values_to_match_json_schema() function.

@beatrizsaldana beatrizsaldana changed the title Beatrizsaldana/mg 108/gx validation suites ADT: Add Biomarker/Pathology GX validation suites Oct 29, 2024
@beatrizsaldana beatrizsaldana marked this pull request as ready for review October 31, 2024 19:53
Copy link
Contributor

@BWMac BWMac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beatrizsaldana can you please link to the GX reports generated by the new suites?

"validator.expect_column_values_to_be_of_type(\"type\", \"str\")\n",
"validator.expect_column_values_to_not_be_null(\"type\")\n",
"# allows all alphanumeric characters, underscores, periods, and dashes\n",
"validator.expect_column_values_to_match_regex(\"type\", \"^[A-Za-z0-9\\s_.-]+$\")"
Copy link
Contributor

@jaclynbeck-sage jaclynbeck-sage Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question. Even better: do we have a limited list of values that can go in this field (i.e. ["Type A", "Type B", "Type C"...] or is it a free-for-all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JessterB do you know the answer to this question?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a free-for-all, each of the two datasets has a unique set of valid values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are they and should we make a list of them instead of using regex?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to use the unique set of values in the type column in the current files, as your list. There's code in one of the other GX suites that Brad wrote, that tells GX to compare values to a list of pre-defined values. Off the top of my head I don't remember which suite to point you to though. Biodomains probably has it?

Copy link
Contributor

@BWMac BWMac Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, you are thinking of expect_column_values_to_be_in_set which is used in this suite (among others).

"validator.expect_column_values_to_be_of_type(\"model\", \"str\")\n",
"validator.expect_column_values_to_not_be_null(\"model\")\n",
"# allows all alphanumeric characters, underscores, periods, and dashes\n",
"validator.expect_column_values_to_match_regex(\"type\", \"^[A-Za-z0-9\\s\\(\\)\\*_.-]+$\")\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume my questions in the biomarkers file also apply here.

CONTRIBUTING.md Outdated
#### Nested Columns

If the transform includes nested columns (example: `druggability` column in `gene_info` tranform), please follow these four steps:
1. Add the nested column name to the `gx_nested_columns` flag for the specific transform. This will convert the column values to a JSON parsable string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention that this goes in the config file.

@thomasyu888 thomasyu888 removed the request for review from BryanFauble October 31, 2024 23:12
@BWMac BWMac requested a review from a team November 1, 2024 18:00
gx_suite_definitions/biomarkers.ipynb Outdated Show resolved Hide resolved
"validator.expect_column_values_to_be_of_type(\"type\", \"str\")\n",
"validator.expect_column_values_to_not_be_null(\"type\")\n",
"# allows all alphanumeric characters, underscores, periods, and dashes\n",
"validator.expect_column_values_to_match_regex(\"type\", \"^[A-Za-z0-9\\s_.-]+$\")"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a free-for-all, each of the two datasets has a unique set of valid values.

gx_suite_definitions/biomarkers.ipynb Outdated Show resolved Hide resolved
gx_suite_definitions/biomarkers.ipynb Show resolved Hide resolved
gx_suite_definitions/biomarkers.ipynb Show resolved Hide resolved
"outputs": [],
"source": [
"# unique entries ExpectSelectColumnValuesToBeUniqueWithinRecord\n",
"validator.expect_select_column_values_to_be_unique_within_record(column_list=[\"model\", \"type\", \"age_death\", \"tissue\", \"units\"])"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove units from this list; we should never have the same model/type/age/tissue with different units...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I remove it from this list or also from the column-grouping in the transform?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove it from this mutil-value unique expectation only.

@BWMac BWMac self-requested a review November 4, 2024 15:19
Beatriz Saldana added 2 commits November 6, 2024 09:33
…sed those unique values for gx validation for biomarkers and pathology datasets
"outputs": [],
"source": [
"# unique entries ExpectSelectColumnValuesToBeUniqueWithinRecord\n",
"validator.expect_select_column_values_to_be_unique_within_record(column_list=[\"model\", \"type\", \"age_death\", \"tissue\", \"units\"])"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove it from this mutil-value unique expectation only.

gx_suite_definitions/pathology.ipynb Show resolved Hide resolved
Copy link

sonarcloud bot commented Nov 6, 2024

Copy link
Contributor

@JessterB JessterB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@beatrizsaldana beatrizsaldana merged commit 6fd726d into dev Nov 6, 2024
8 checks passed
@beatrizsaldana beatrizsaldana deleted the beatrizsaldana/MG-108/GX-validation-suites branch December 2, 2024 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants