Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise parameter grid #114

Merged
merged 2 commits into from
Sep 26, 2017
Merged

Conversation

rdvelazquez
Copy link
Member

Builds on #113 and revises the parameter grid in n.mutation-classifier as follows:

  • l1_ratio: changed from 0.15 to 0
  • alpha: changed from [10 ** x for x in range(-3, 1)] to [10 ** x for x in range(-10, 10)]
  • 'n_components: changed from [50, 100] to a function that selects the number of components based on the number of positive samples in the query (or the number of negatives if it is a rare instance with more positives than negatives). The function is shown below:
n_positives = min(y.sum(),len(y)-y.sum())
if n_positives > 500:
    n_components_list = [100]
elif n_positives > 250:
    n_components_list = [50]
else:
    n_components_list = [30]

This PR also added stratify=y to the test_train_split and revised the markdown note about the gene (below cell 3) to be more general as opposed to just referencing TP53.

@patrick-miller
Copy link
Member

This looks good to me.

I'm not sure if parameterizing n_components by a the number of positives as opposed to the % of positives is better. This only really matters if we obtain more data, which would probably lead to other design changes as well anyway.

@rdvelazquez
Copy link
Member Author

Thanks for reviewing this @patrick-miller!

I'm not sure if parameterizing n_components by the number of positives as opposed to the % of positives is better.

I think it's only better (or different at all) when there are queries that don't use all the samples (that are subset by disease). For example:

  • Query A: 50% positives (100 samples, 50 positives)
  • Query B: 10% positives (5,000 samples, 500 positives)

I think Query B should use more components than Query A because Query B will likely need more components to capture a similar amount of the variance and Query B will be less prone to over-fitting than Query A. Let me know if that made sense.

@patrick-miller
Copy link
Member

I'm not positive, but I think you are right.

@rdvelazquez
Copy link
Member Author

rdvelazquez commented Sep 22, 2017

I'm not positive, but I think you are right

Positive... I love a good pun 😃 (I'm terrible I know)

I'll give @dhimmel a chance to look at this if he wants before we merge it.

# Typically, this type of split can only be done
# for genes where the number of mutations is large enough
X = pd.concat([covariate_df, expression_df], axis='columns')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.1, random_state=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! It'd also be nice to stratify by disease type, but only if there is an elegant implementation. Can also do this a later PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into this in a future PR. Based on a quick review it looks like multi-column stratify isn't supported in scikit-learn 18.x (cognoma is currently using 18.1) but was added/fixed in 19.0:
scikit-learn/scikit-learn#9044
scikit-learn/scikit-learn#9037

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! In a future upgrade, we may want to consider upgrading everything as much as possible... and could also make this change.

regularization_l1_ratio = 0.15
regularization_alpha_list = [10 ** x for x in range(-10, 10)]
# Chose n_components based on number of positives (or negatives, if that is less)
n_positives = min(y.sum(),len(y)-y.sum())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: spaces after comma and surrounding operators (-).

Let use a more accurate variable name, like min_class_size.

@dhimmel
Copy link
Member

dhimmel commented Sep 26, 2017

@rdvelazquez or @patrick-miller someone squash merge this!

@rdvelazquez rdvelazquez merged commit bf533ae into cognoma:master Sep 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants