Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cell_cycle returns poor scores on perfect data input #351

Open
scottgigante-immunai opened this issue Nov 29, 2022 · 4 comments
Open

cell_cycle returns poor scores on perfect data input #351

scottgigante-immunai opened this issue Nov 29, 2022 · 4 comments

Comments

@scottgigante-immunai
Copy link
Contributor

scottgigante-immunai commented Nov 29, 2022

If we pass PCA on the unintegrated data, we should get a perfect score. We don't.

>>> import scanpy as sc
>>> adata = sc.datasets.paul15()
>>> adata.obsm["X_emb"] = adata.X
>>> from scib.metrics import cell_cycle
>>> cell_cycle(adata, adata, "paul15_clusters", embed="X_emb")
1.0
>>> adata.obsm["X_emb"] = sc.tl.pca(adata, n_comps=50, use_highly_variable=False, svd_solver="arpack", copy=True).obsm["X_pca"]
>>> cell_cycle(adata, adata, "paul15_clusters", embed="X_emb")
0.8083203810376348

Related: openproblems-bio/openproblems#706

@mumichae
Copy link
Collaborator

mumichae commented Dec 2, 2022

Hi Scott,
this is likely related to the fact that when the PCA is recomputed, it's done so per batch. So essentially, the PC regression will be computed on different principle components, than if you were to use the globally computed PCA.

for batch in batches:

@LuckyMD I think we decided on per-batch computation of the PCA. Is this still the behaviour we want? If so it might make sense to default to encourage PCA recomputation or remove the reuse of existings PC components altogether to avoid confusion.

@scottgigante-immunai
Copy link
Contributor Author

Oh! That make sense. It's a bit of a funny result then that the "perfect embedding" here is one in which the batches are embedded with PCA separately and then smashed together, while an embedding that keeps the raw data as-is performs relatively poorly... but at least from the openproblems perspective there is a simple solution here.

@scottgigante-immunai
Copy link
Contributor Author

>>> import numpy as np
>>> import scanpy as sc
>>> adata = sc.datasets.paul15()
>>> adata.obsm["X_emb"] = np.zeros((adata.shape[0], 50), dtype=float)
>>> for batch in adata.obs["paul15_clusters"].unique():
...     batch_idx = adata.obs["paul15_clusters"] == batch
...     n_comps = min(50, np.sum(batch_idx))
...     solver = "full" if n_comps == np.sum(batch_idx) else "arpack"
...     adata.obsm["X_emb"][batch_idx,:n_comps] = sc.tl.pca(adata[batch_idx], n_comps=n_comps, use_highly_variable=False, svd_solver=solver, copy=True).obsm["X_pca"]
... 
>>> from scib.metrics import cell_cycle
>>> cell_cycle(adata, adata, "paul15_clusters", embed="X_emb")
0.9999998942341607

@LuckyMD
Copy link
Collaborator

LuckyMD commented Dec 4, 2022

Hah... yes! This still makes sense I think. We don't want the PCA to capture batch differences in the CC score. I forgot this completely! Thanks so much for highlighting this @mumichae !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants