`cell_cycle` returns poor scores on perfect data input #351

scottgigante-immunai · 2022-11-29T14:18:30Z

If we pass PCA on the unintegrated data, we should get a perfect score. We don't.

>>> import scanpy as sc
>>> adata = sc.datasets.paul15()
>>> adata.obsm["X_emb"] = adata.X
>>> from scib.metrics import cell_cycle
>>> cell_cycle(adata, adata, "paul15_clusters", embed="X_emb")
1.0
>>> adata.obsm["X_emb"] = sc.tl.pca(adata, n_comps=50, use_highly_variable=False, svd_solver="arpack", copy=True).obsm["X_pca"]
>>> cell_cycle(adata, adata, "paul15_clusters", embed="X_emb")
0.8083203810376348

Related: openproblems-bio/openproblems#706

The text was updated successfully, but these errors were encountered:

mumichae · 2022-12-02T14:04:52Z

Hi Scott,
this is likely related to the fact that when the PCA is recomputed, it's done so per batch. So essentially, the PC regression will be computed on different principle components, than if you were to use the globally computed PCA.

scib/scib/metrics/cell_cycle.py

Line 71 in 2fe05c7

for batch in batches:

@LuckyMD I think we decided on per-batch computation of the PCA. Is this still the behaviour we want? If so it might make sense to default to encourage PCA recomputation or remove the reuse of existings PC components altogether to avoid confusion.

scottgigante-immunai · 2022-12-02T15:17:54Z

Oh! That make sense. It's a bit of a funny result then that the "perfect embedding" here is one in which the batches are embedded with PCA separately and then smashed together, while an embedding that keeps the raw data as-is performs relatively poorly... but at least from the openproblems perspective there is a simple solution here.

scottgigante-immunai · 2022-12-02T15:30:56Z

>>> import numpy as np
>>> import scanpy as sc
>>> adata = sc.datasets.paul15()
>>> adata.obsm["X_emb"] = np.zeros((adata.shape[0], 50), dtype=float)
>>> for batch in adata.obs["paul15_clusters"].unique():
...     batch_idx = adata.obs["paul15_clusters"] == batch
...     n_comps = min(50, np.sum(batch_idx))
...     solver = "full" if n_comps == np.sum(batch_idx) else "arpack"
...     adata.obsm["X_emb"][batch_idx,:n_comps] = sc.tl.pca(adata[batch_idx], n_comps=n_comps, use_highly_variable=False, svd_solver=solver, copy=True).obsm["X_pca"]
... 
>>> from scib.metrics import cell_cycle
>>> cell_cycle(adata, adata, "paul15_clusters", embed="X_emb")
0.9999998942341607

LuckyMD · 2022-12-04T11:45:53Z

Hah... yes! This still makes sense I think. We don't want the PCA to capture batch differences in the CC score. I forgot this completely! Thanks so much for highlighting this @mumichae !

scottgigante-immunai mentioned this issue Dec 3, 2022

Add cell cycle score baseline openproblems-bio/openproblems#706

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`cell_cycle` returns poor scores on perfect data input #351

`cell_cycle` returns poor scores on perfect data input #351

scottgigante-immunai commented Nov 29, 2022 •

edited

Loading

mumichae commented Dec 2, 2022

scottgigante-immunai commented Dec 2, 2022

scottgigante-immunai commented Dec 2, 2022

LuckyMD commented Dec 4, 2022

cell_cycle returns poor scores on perfect data input #351

cell_cycle returns poor scores on perfect data input #351

Comments

scottgigante-immunai commented Nov 29, 2022 • edited Loading

mumichae commented Dec 2, 2022

scottgigante-immunai commented Dec 2, 2022

scottgigante-immunai commented Dec 2, 2022

LuckyMD commented Dec 4, 2022

`cell_cycle` returns poor scores on perfect data input #351

`cell_cycle` returns poor scores on perfect data input #351

scottgigante-immunai commented Nov 29, 2022 •

edited

Loading