-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revise parameter grid #114
Conversation
This looks good to me. I'm not sure if parameterizing |
Thanks for reviewing this @patrick-miller!
I think it's only better (or different at all) when there are queries that don't use all the samples (that are subset by disease). For example:
I think Query B should use more components than Query A because Query B will likely need more components to capture a similar amount of the variance and Query B will be less prone to over-fitting than Query A. Let me know if that made sense. |
I'm not positive, but I think you are right. |
Positive... I love a good pun 😃 (I'm terrible I know) I'll give @dhimmel a chance to look at this if he wants before we merge it. |
# Typically, this type of split can only be done | ||
# for genes where the number of mutations is large enough | ||
X = pd.concat([covariate_df, expression_df], axis='columns') | ||
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0) | ||
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.1, random_state=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! It'd also be nice to stratify by disease type, but only if there is an elegant implementation. Can also do this a later PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into this in a future PR. Based on a quick review it looks like multi-column stratify isn't supported in scikit-learn 18.x (cognoma is currently using 18.1) but was added/fixed in 19.0:
scikit-learn/scikit-learn#9044
scikit-learn/scikit-learn#9037
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! In a future upgrade, we may want to consider upgrading everything as much as possible... and could also make this change.
scripts/2.mutation-classifier.py
Outdated
regularization_l1_ratio = 0.15 | ||
regularization_alpha_list = [10 ** x for x in range(-10, 10)] | ||
# Chose n_components based on number of positives (or negatives, if that is less) | ||
n_positives = min(y.sum(),len(y)-y.sum()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: spaces after comma and surrounding operators (-
).
Let use a more accurate variable name, like min_class_size
.
@rdvelazquez or @patrick-miller someone squash merge this! |
Builds on #113 and revises the parameter grid in
n.mutation-classifier
as follows:l1_ratio
: changed from 0.15 to 0alpha
: changed from[10 ** x for x in range(-3, 1)]
to[10 ** x for x in range(-10, 10)]
[50, 100]
to a function that selects the number of components based on the number of positive samples in the query (or the number of negatives if it is a rare instance with more positives than negatives). The function is shown below:This PR also added
stratify=y
to thetest_train_split
and revised the markdown note about the gene (below cell 3) to be more general as opposed to just referencing TP53.