Add cell cycle score baseline #706

scottgigante-immunai · 2022-11-26T20:24:54Z

cc_score was not well captured by existing baselines. This method should get a perfect score.

…m:scottgigante-immunai/openproblems into batch_integration_embed/baseline/cc_score

LuckyMD · 2022-11-28T12:19:03Z

Not sure this will get a perfect score, as the metric checks whether the variance contribution has changes per batch from unintegrated data. This will have more variance contribution and therefore perform poorly again. I suggest we just add "unintegrated" as a baseline.

LuckyMD

As mentioned, I don't think this will get a score of 1 as it increases CC variance contribution (I think we also penalize for that). Also, should organism be added on the level of data loader or dataset function?

Really good catch that this wasn't defined in the API though!

scottgigante-immunai · 2022-11-28T12:58:52Z

Valid point, but unintegrated is already there and performs poorly. I'll look at the paper and figure out what this should look like. Re: loaders, I don't know if we want to enforce this on all loaders yet (but we could!)

…

On Mon, 28 Nov 2022, 7:22 am MalteDLuecken, ***@***.***> wrote: ***@***.**** requested changes on this pull request. As mentioned, I don't think this will get a score of 1 as it increases CC variance contribution (I think we also penalize for that). Also, should organism be added on the level of data loader or dataset function? — Reply to this email directly, view it on GitHub <https://protect.checkpoint.com/v2/___https://github.com/openproblems-bio/openproblems/pull/706%23pullrequestreview-1195712548___.YzJlOmltbXVuYWk6YzpnOjE4NDEwOTAxZjE2NzM5YzhlZGU0NWM2NDhlY2ZmMWMyOjY6NTM3OTplZmZlM2JjY2IzNzc5YTE5MjA1NzMxM2E4NTE0ZTM5MTJmMjUxYTI4MzczMDQyM2U5Y2ExZjYyMjU4ZDI2ZDAyOmg6VA>, or unsubscribe <https://protect.checkpoint.com/v2/___https://github.com/notifications/unsubscribe-auth/AUHCMAS62PWC56GOJR7ZN5DWKSPWFANCNFSM6AAAAAASME4SSU___.YzJlOmltbXVuYWk6YzpnOjE4NDEwOTAxZjE2NzM5YzhlZGU0NWM2NDhlY2ZmMWMyOjY6ODU0OToyOGFkYzQ0MDNhODU1NTJkMjlkYzI2ZjMxYTNiMDU4MWI3ODE3ODRhNGNlYjA4YmU5MmVmNjE5MjQwYTYzMWEzOmg6VA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- PLEASE NOTE: The information contained in this message is privileged and confidential, and is intended only for the use of the individual to whom it is addressed and others who have been specifically authorized to receive it. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, or if any problems occur with the transmission, please contact the sender.

LuckyMD · 2022-11-28T13:03:25Z

Okay, unintegrated shouldn't be bad... by definition this should work. Maybe @danielStrobl can take a look at this?

LuckyMD · 2022-11-28T13:07:02Z

Re: loaders, I don't know if we want to enforce this on all loaders yet
(but we could!)

We don't have to enforce on all loaders yet, but it does seem to me that it's something that is part of adding a dataset to open problems. Ideally, if someone adds a dataset, then it should be re-usable by other tasks without someone having to research what this actually is. If we use CELLxGENE more frequently in future, this field will also exist there. We could even borrow from their schema here if we want. Looks like i'm arguing myself into the position of "it should be used by all data loaders"... hmm... maybe do this in steps? First for these datasets, then for the rest in future.

scottgigante-immunai · 2022-11-28T14:51:35Z

Okay, unintegrated shouldn't be bad... by definition this should work. Maybe @danielStrobl can take a look at this?

Ah, I see. by definition here:

CCconservation=1−|Varafter−Varbefore|/Varbefore

unintegrated should get a perfect score... if you look at https://openproblems-nbt2022-reproducibility.netlify.app/results/batch_integration_embed/ it gets a score of 0.75, compared to scanorama, harmony, combat that score closer to 0.9.

scottgigante-immunai · 2022-11-28T14:52:14Z

Looks like i'm arguing myself into the position of "it should be used by all data loaders"... hmm... maybe do this in steps? First for these datasets, then for the rest in future.

I agree. Ultimately it should be something that every dataset defines, but we don't need it right now.

scottgigante-immunai · 2022-11-28T17:49:09Z

I think something is wrong with the cc_score metric.

>>> import openproblems
>>> adata = openproblems.tasks.batch_integration_embed.datasets.immune_batch()
>>> adata_combat = openproblems.tasks.batch_integration_embed.methods.combat_hvg_unscaled(adata.copy())
>>> adata_baseline = openproblems.tasks.batch_integration_embed.methods.no_integration(adata.copy())
>>> openproblems.tasks.batch_integration_embed.metrics.cc_score(adata_combat)
0.8884892448780356
>>> openproblems.tasks.batch_integration_embed.metrics.cc_score(adata_baseline)
0.7510949842852523

if I call directly from scib:

>>> from scib.metrics import cell_cycle
>>> adata_baseline.obsm["X_pca"] = adata_baseline.obsm["X_uni_pca"]
>>> cell_cycle(adata_baseline, adata_baseline, "batch", embed="X_emb", organism="human")
0.7510949842852523

or if I compute the PCA myself:

>>> import scanpy as sc
>>> sc.pp.pca(adata)
>>> adata.obsm["X_emb"] = adata.obsm["X_pca"]
>>> cell_cycle(adata_baseline, adata_baseline, "batch", embed="X_emb", organism="human")
0.7510949842852523

cell_cycle recomputes PCA (see https://github.com/theislab/scib/blob/f0be8267256427e307c5979f4d20dc3e5dc33d04/scib/metrics/cell_cycle.py#L175) even if PCA is already computed. Could this be the cause?

LuckyMD · 2022-11-29T12:04:22Z

I don't see why recomputing should be the issue tbh... As long as it's using the embedding. I think we decided to recompute X_pca from X_emb as sometimes X_pca is a remnant of the unintegrated embedding (e.g., in FastMNN or Scanorama), so it makes sense to recompute as you don't know where the existing X_pca comes from.

scottgigante-immunai · 2022-11-29T14:13:06Z

It scores 1.0 if you don't give it an embedding (i.e., it computes PCA on adata.X for both pre and post.) If you give it X_emb == X_uni_pca, it scores 0.75. This is definitely a bug.

…/baseline/cc_score

scottgigante-immunai · 2022-12-01T17:26:47Z

New approach -- simply passing the raw data gives a perfect score, so let's just do that. In theory passing X_uni_pca should be right, but pending a fix from scIB, this will work.

scottgigante-immunai · 2022-12-01T17:26:55Z

Tests passing at https://tower.nf/orgs/openproblems-bio/workspaces/openproblems-bio/watch/5djcm9XQYGrhkJ

scottgigante-immunai · 2022-12-01T22:20:05Z

@danielStrobl I assume Malte is gone by now. Mind reviewing this for me?

codecov · 2022-12-01T22:56:07Z

Codecov Report

Base: 95.04% // Head: 95.04% // Decreases project coverage by -0.00% ⚠️

Coverage data is based on head (d981d42) compared to base (aa22537).
Patch coverage: 86.95% of modified lines in pull request are covered.

❗ Current head d981d42 differs from pull request most recent head a69f851. Consider uploading reports for the commit a69f851 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #706      +/-   ##
==========================================
- Coverage   95.04%   95.04%   -0.01%     
==========================================
  Files         154      154              
  Lines        4073     4093      +20     
  Branches      207      207              
==========================================
+ Hits         3871     3890      +19     
- Misses        131      132       +1     
  Partials       71       71

Flag	Coverage Δ
unittests	`95.04% <86.95%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...egration/batch_integration_embed/metrics/_utils.py	`100.00% <ø> (+33.33%)`	⬆️
...egration/batch_integration_graph/methods/_utils.py	`61.11% <50.00%> (-3.18%)`	⬇️
...ration/batch_integration_embed/methods/baseline.py	`97.29% <87.50%> (-2.71%)`	⬇️
.../_batch_integration/batch_integration_embed/api.py	`100.00% <100.00%> (ø)`
...ration/batch_integration_embed/metrics/cc_score.py	`100.00% <100.00%> (ø)`
...gration/batch_integration_graph/datasets/immune.py	`100.00% <100.00%> (ø)`
...ation/batch_integration_graph/datasets/pancreas.py	`100.00% <100.00%> (ø)`
...integration/batch_integration_graph/methods/mnn.py	`100.00% <100.00%> (ø)`
...ation/batch_integration_graph/methods/scanorama.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

scottgigante-immunai · 2022-12-03T02:26:47Z

Achieves perfect performance per theislab/scib#351 (comment)

scottgigante-immunai added 2 commits November 26, 2022 15:23

add cc_score baseline

020d907

document

fe14551

scottgigante-immunai requested review from rcannood and danielStrobl November 26, 2022 20:24

scottgigante-immunai and others added 13 commits November 26, 2022 15:27

Merge branch 'main' into batch_integration_embed/baseline/cc_score

5f257fd

Merge branch 'main' into batch_integration_embed/baseline/cc_score

f81f946

Merge branch 'main' into batch_integration_embed/baseline/cc_score

838da80

Make sure method didn't remove uns

5596e45

Combat tramples uns

cd3f88b

Revert

9e85204

Scale and hvg trample uns

da437b5

scanorama clears uns

eadbc8d

mnn tramples uns

5fff381

just copy uns

f755ae7

Merge branch 'main' into batch_integration_embed/baseline/cc_score

ec14bc2

just copy uns

ecb88b2

Merge branch 'batch_integration_embed/baseline/cc_score' of github.co…

d273c9f

…m:scottgigante-immunai/openproblems into batch_integration_embed/baseline/cc_score

LuckyMD requested changes Nov 28, 2022

View reviewed changes

scottgigante-immunai added 2 commits November 28, 2022 13:38

don't set X_emb if missing; it shouldn't ever be missing

9c70c22

Merge branch 'main' into batch_integration_embed/baseline/cc_score

09adbba

scottgigante-immunai mentioned this pull request Nov 29, 2022

cell_cycle returns poor scores on perfect data input theislab/scib#351

Open

scottgigante-immunai added 2 commits December 1, 2022 10:16

Merge remote-tracking branch 'base/main' into batch_integration_embed…

fdd6ea8

…/baseline/cc_score

use true features as embedding

d981d42

scottgigante-immunai marked this pull request as ready for review December 1, 2022 22:14

scottgigante-immunai requested a review from LuckyMD December 1, 2022 22:14

scottgigante-immunai and others added 3 commits December 2, 2022 10:34

compute PCA per batch

0d2dd69

Merge branch 'main' into batch_integration_embed/baseline/cc_score

fa4ec33

Set code version

a69f851

scottgigante-immunai merged commit 7ffc855 into openproblems-bio:main Dec 3, 2022

scottgigante-immunai deleted the batch_integration_embed/baseline/cc_score branch December 3, 2022 02:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cell cycle score baseline #706

Add cell cycle score baseline #706

scottgigante-immunai commented Nov 26, 2022

LuckyMD commented Nov 28, 2022

LuckyMD left a comment •

edited

Loading

scottgigante-immunai commented Nov 28, 2022 via email

LuckyMD commented Nov 28, 2022

LuckyMD commented Nov 28, 2022

scottgigante-immunai commented Nov 28, 2022

scottgigante-immunai commented Nov 28, 2022

scottgigante-immunai commented Nov 28, 2022 •

edited

Loading

LuckyMD commented Nov 29, 2022

scottgigante-immunai commented Nov 29, 2022

scottgigante-immunai commented Dec 1, 2022

scottgigante-immunai commented Dec 1, 2022 •

edited

Loading

scottgigante-immunai commented Dec 1, 2022

codecov bot commented Dec 1, 2022 •

edited

Loading

scottgigante-immunai commented Dec 3, 2022

Add cell cycle score baseline #706

Add cell cycle score baseline #706

Conversation

scottgigante-immunai commented Nov 26, 2022

LuckyMD commented Nov 28, 2022

LuckyMD left a comment • edited Loading

Choose a reason for hiding this comment

scottgigante-immunai commented Nov 28, 2022 via email

LuckyMD commented Nov 28, 2022

LuckyMD commented Nov 28, 2022

scottgigante-immunai commented Nov 28, 2022

scottgigante-immunai commented Nov 28, 2022

scottgigante-immunai commented Nov 28, 2022 • edited Loading

LuckyMD commented Nov 29, 2022

scottgigante-immunai commented Nov 29, 2022

scottgigante-immunai commented Dec 1, 2022

scottgigante-immunai commented Dec 1, 2022 • edited Loading

scottgigante-immunai commented Dec 1, 2022

codecov bot commented Dec 1, 2022 • edited Loading

Codecov Report

scottgigante-immunai commented Dec 3, 2022

LuckyMD left a comment •

edited

Loading

scottgigante-immunai commented Nov 28, 2022 •

edited

Loading

scottgigante-immunai commented Dec 1, 2022 •

edited

Loading

codecov bot commented Dec 1, 2022 •

edited

Loading