Revise parameter grid #114

rdvelazquez · 2017-09-21T20:57:47Z

Builds on #113 and revises the parameter grid in n.mutation-classifier as follows:

l1_ratio: changed from 0.15 to 0
alpha: changed from [10 ** x for x in range(-3, 1)] to [10 ** x for x in range(-10, 10)]
'n_components: changed from [50, 100] to a function that selects the number of components based on the number of positive samples in the query (or the number of negatives if it is a rare instance with more positives than negatives). The function is shown below:

n_positives = min(y.sum(),len(y)-y.sum())
if n_positives > 500:
    n_components_list = [100]
elif n_positives > 250:
    n_components_list = [50]
else:
    n_components_list = [30]

This PR also added stratify=y to the test_train_split and revised the markdown note about the gene (below cell 3) to be more general as opposed to just referencing TP53.

patrick-miller · 2017-09-22T00:03:39Z

This looks good to me.

I'm not sure if parameterizing n_components by a the number of positives as opposed to the % of positives is better. This only really matters if we obtain more data, which would probably lead to other design changes as well anyway.

rdvelazquez · 2017-09-22T00:29:31Z

Thanks for reviewing this @patrick-miller!

I'm not sure if parameterizing n_components by the number of positives as opposed to the % of positives is better.

I think it's only better (or different at all) when there are queries that don't use all the samples (that are subset by disease). For example:

Query A: 50% positives (100 samples, 50 positives)
Query B: 10% positives (5,000 samples, 500 positives)

I think Query B should use more components than Query A because Query B will likely need more components to capture a similar amount of the variance and Query B will be less prone to over-fitting than Query A. Let me know if that made sense.

patrick-miller · 2017-09-22T00:47:30Z

I'm not positive, but I think you are right.

rdvelazquez · 2017-09-22T00:50:39Z

I'm not positive, but I think you are right

Positive... I love a good pun 😃 (I'm terrible I know)

I'll give @dhimmel a chance to look at this if he wants before we merge it.

dhimmel · 2017-09-22T15:14:59Z

scripts/2.mutation-classifier.py

 # Typically, this type of split can only be done 
 # for genes where the number of mutations is large enough
 X = pd.concat([covariate_df, expression_df], axis='columns')
-X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)
+X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.1, random_state=0)


Great! It'd also be nice to stratify by disease type, but only if there is an elegant implementation. Can also do this a later PR.

I'll look into this in a future PR. Based on a quick review it looks like multi-column stratify isn't supported in scikit-learn 18.x (cognoma is currently using 18.1) but was added/fixed in 19.0:
scikit-learn/scikit-learn#9044
scikit-learn/scikit-learn#9037

Nice! In a future upgrade, we may want to consider upgrading everything as much as possible... and could also make this change.

dhimmel · 2017-09-22T15:16:41Z

scripts/2.mutation-classifier.py

-regularization_l1_ratio = 0.15
+regularization_alpha_list = [10 ** x for x in range(-10, 10)]
+# Chose n_components based on number of positives (or negatives, if that is less)
+n_positives = min(y.sum(),len(y)-y.sum())


Style: spaces after comma and surrounding operators (-).

Let use a more accurate variable name, like min_class_size.

dhimmel · 2017-09-26T17:41:37Z

@rdvelazquez or @patrick-miller someone squash merge this!

revise parameter-grid

b5a0131

dhimmel reviewed Sep 22, 2017

View reviewed changes

address dh comments

3277274

dhimmel approved these changes Sep 26, 2017

View reviewed changes

rdvelazquez merged commit bf533ae into cognoma:master Sep 26, 2017

This was referenced Sep 26, 2017

Machine Learning Punch List for Launch #110

Closed

Selecting the elastic net mixing parameter #56

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise parameter grid #114

Revise parameter grid #114

rdvelazquez commented Sep 21, 2017

patrick-miller commented Sep 22, 2017

rdvelazquez commented Sep 22, 2017

patrick-miller commented Sep 22, 2017

rdvelazquez commented Sep 22, 2017 •

edited

Loading

dhimmel Sep 22, 2017

rdvelazquez Sep 26, 2017

dhimmel Sep 26, 2017

dhimmel Sep 22, 2017

dhimmel commented Sep 26, 2017

Revise parameter grid #114

Revise parameter grid #114

Conversation

rdvelazquez commented Sep 21, 2017

patrick-miller commented Sep 22, 2017

rdvelazquez commented Sep 22, 2017

patrick-miller commented Sep 22, 2017

rdvelazquez commented Sep 22, 2017 • edited Loading

dhimmel Sep 22, 2017

Choose a reason for hiding this comment

rdvelazquez Sep 26, 2017

Choose a reason for hiding this comment

dhimmel Sep 26, 2017

Choose a reason for hiding this comment

dhimmel Sep 22, 2017

Choose a reason for hiding this comment

dhimmel commented Sep 26, 2017

rdvelazquez commented Sep 22, 2017 •

edited

Loading