Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cell cycle score baseline #706

Conversation

scottgigante-immunai
Copy link
Collaborator

cc_score was not well captured by existing baselines. This method should get a perfect score.

@LuckyMD
Copy link
Collaborator

LuckyMD commented Nov 28, 2022

Not sure this will get a perfect score, as the metric checks whether the variance contribution has changes per batch from unintegrated data. This will have more variance contribution and therefore perform poorly again. I suggest we just add "unintegrated" as a baseline.

Copy link
Collaborator

@LuckyMD LuckyMD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, I don't think this will get a score of 1 as it increases CC variance contribution (I think we also penalize for that). Also, should organism be added on the level of data loader or dataset function?

Really good catch that this wasn't defined in the API though!

@scottgigante-immunai
Copy link
Collaborator Author

scottgigante-immunai commented Nov 28, 2022 via email

@LuckyMD
Copy link
Collaborator

LuckyMD commented Nov 28, 2022

Okay, unintegrated shouldn't be bad... by definition this should work. Maybe @danielStrobl can take a look at this?

@LuckyMD
Copy link
Collaborator

LuckyMD commented Nov 28, 2022

Re: loaders, I don't know if we want to enforce this on all loaders yet
(but we could!)

We don't have to enforce on all loaders yet, but it does seem to me that it's something that is part of adding a dataset to open problems. Ideally, if someone adds a dataset, then it should be re-usable by other tasks without someone having to research what this actually is. If we use CELLxGENE more frequently in future, this field will also exist there. We could even borrow from their schema here if we want. Looks like i'm arguing myself into the position of "it should be used by all data loaders"... hmm... maybe do this in steps? First for these datasets, then for the rest in future.

@scottgigante-immunai
Copy link
Collaborator Author

Okay, unintegrated shouldn't be bad... by definition this should work. Maybe @danielStrobl can take a look at this?

Ah, I see. by definition here:

CCconservation=1−|Varafter−Varbefore|/Varbefore

unintegrated should get a perfect score... if you look at https://openproblems-nbt2022-reproducibility.netlify.app/results/batch_integration_embed/ it gets a score of 0.75, compared to scanorama, harmony, combat that score closer to 0.9.

@scottgigante-immunai
Copy link
Collaborator Author

Looks like i'm arguing myself into the position of "it should be used by all data loaders"... hmm... maybe do this in steps? First for these datasets, then for the rest in future.

I agree. Ultimately it should be something that every dataset defines, but we don't need it right now.

@scottgigante-immunai
Copy link
Collaborator Author

scottgigante-immunai commented Nov 28, 2022

I think something is wrong with the cc_score metric.

>>> import openproblems
>>> adata = openproblems.tasks.batch_integration_embed.datasets.immune_batch()
>>> adata_combat = openproblems.tasks.batch_integration_embed.methods.combat_hvg_unscaled(adata.copy())
>>> adata_baseline = openproblems.tasks.batch_integration_embed.methods.no_integration(adata.copy())
>>> openproblems.tasks.batch_integration_embed.metrics.cc_score(adata_combat)
0.8884892448780356
>>> openproblems.tasks.batch_integration_embed.metrics.cc_score(adata_baseline)
0.7510949842852523

if I call directly from scib:

>>> from scib.metrics import cell_cycle
>>> adata_baseline.obsm["X_pca"] = adata_baseline.obsm["X_uni_pca"]
>>> cell_cycle(adata_baseline, adata_baseline, "batch", embed="X_emb", organism="human")
0.7510949842852523

or if I compute the PCA myself:

>>> import scanpy as sc
>>> sc.pp.pca(adata)
>>> adata.obsm["X_emb"] = adata.obsm["X_pca"]
>>> cell_cycle(adata_baseline, adata_baseline, "batch", embed="X_emb", organism="human")
0.7510949842852523

cell_cycle recomputes PCA (see https://github.com/theislab/scib/blob/f0be8267256427e307c5979f4d20dc3e5dc33d04/scib/metrics/cell_cycle.py#L175) even if PCA is already computed. Could this be the cause?

@LuckyMD
Copy link
Collaborator

LuckyMD commented Nov 29, 2022

I don't see why recomputing should be the issue tbh... As long as it's using the embedding. I think we decided to recompute X_pca from X_emb as sometimes X_pca is a remnant of the unintegrated embedding (e.g., in FastMNN or Scanorama), so it makes sense to recompute as you don't know where the existing X_pca comes from.

@scottgigante-immunai
Copy link
Collaborator Author

It scores 1.0 if you don't give it an embedding (i.e., it computes PCA on adata.X for both pre and post.) If you give it X_emb == X_uni_pca, it scores 0.75. This is definitely a bug.

@scottgigante-immunai
Copy link
Collaborator Author

New approach -- simply passing the raw data gives a perfect score, so let's just do that. In theory passing X_uni_pca should be right, but pending a fix from scIB, this will work.

@scottgigante-immunai
Copy link
Collaborator Author

scottgigante-immunai commented Dec 1, 2022

@scottgigante-immunai scottgigante-immunai marked this pull request as ready for review December 1, 2022 22:14
@scottgigante-immunai
Copy link
Collaborator Author

@danielStrobl I assume Malte is gone by now. Mind reviewing this for me?

@codecov
Copy link

codecov bot commented Dec 1, 2022

Codecov Report

Base: 95.04% // Head: 95.04% // Decreases project coverage by -0.00% ⚠️

Coverage data is based on head (d981d42) compared to base (aa22537).
Patch coverage: 86.95% of modified lines in pull request are covered.

❗ Current head d981d42 differs from pull request most recent head a69f851. Consider uploading reports for the commit a69f851 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #706      +/-   ##
==========================================
- Coverage   95.04%   95.04%   -0.01%     
==========================================
  Files         154      154              
  Lines        4073     4093      +20     
  Branches      207      207              
==========================================
+ Hits         3871     3890      +19     
- Misses        131      132       +1     
  Partials       71       71              
Flag Coverage Δ
unittests 95.04% <86.95%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...egration/batch_integration_embed/metrics/_utils.py 100.00% <ø> (+33.33%) ⬆️
...egration/batch_integration_graph/methods/_utils.py 61.11% <50.00%> (-3.18%) ⬇️
...ration/batch_integration_embed/methods/baseline.py 97.29% <87.50%> (-2.71%) ⬇️
.../_batch_integration/batch_integration_embed/api.py 100.00% <100.00%> (ø)
...ration/batch_integration_embed/metrics/cc_score.py 100.00% <100.00%> (ø)
...gration/batch_integration_graph/datasets/immune.py 100.00% <100.00%> (ø)
...ation/batch_integration_graph/datasets/pancreas.py 100.00% <100.00%> (ø)
...integration/batch_integration_graph/methods/mnn.py 100.00% <100.00%> (ø)
...ation/batch_integration_graph/methods/scanorama.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@scottgigante-immunai
Copy link
Collaborator Author

Achieves perfect performance per theislab/scib#351 (comment)

@scottgigante-immunai scottgigante-immunai merged commit 7ffc855 into openproblems-bio:main Dec 3, 2022
@scottgigante-immunai scottgigante-immunai deleted the batch_integration_embed/baseline/cc_score branch December 3, 2022 02:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants