-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cell cycle score baseline #706
Add cell cycle score baseline #706
Conversation
…m:scottgigante-immunai/openproblems into batch_integration_embed/baseline/cc_score
Not sure this will get a perfect score, as the metric checks whether the variance contribution has changes per batch from unintegrated data. This will have more variance contribution and therefore perform poorly again. I suggest we just add "unintegrated" as a baseline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned, I don't think this will get a score of 1 as it increases CC variance contribution (I think we also penalize for that). Also, should organism be added on the level of data loader or dataset function?
Really good catch that this wasn't defined in the API though!
Valid point, but unintegrated is already there and performs poorly. I'll
look at the paper and figure out what this should look like.
Re: loaders, I don't know if we want to enforce this on all loaders yet
(but we could!)
…On Mon, 28 Nov 2022, 7:22 am MalteDLuecken, ***@***.***> wrote:
***@***.**** requested changes on this pull request.
As mentioned, I don't think this will get a score of 1 as it increases CC
variance contribution (I think we also penalize for that). Also, should
organism be added on the level of data loader or dataset function?
—
Reply to this email directly, view it on GitHub
<https://protect.checkpoint.com/v2/___https://github.com/openproblems-bio/openproblems/pull/706%23pullrequestreview-1195712548___.YzJlOmltbXVuYWk6YzpnOjE4NDEwOTAxZjE2NzM5YzhlZGU0NWM2NDhlY2ZmMWMyOjY6NTM3OTplZmZlM2JjY2IzNzc5YTE5MjA1NzMxM2E4NTE0ZTM5MTJmMjUxYTI4MzczMDQyM2U5Y2ExZjYyMjU4ZDI2ZDAyOmg6VA>,
or unsubscribe
<https://protect.checkpoint.com/v2/___https://github.com/notifications/unsubscribe-auth/AUHCMAS62PWC56GOJR7ZN5DWKSPWFANCNFSM6AAAAAASME4SSU___.YzJlOmltbXVuYWk6YzpnOjE4NDEwOTAxZjE2NzM5YzhlZGU0NWM2NDhlY2ZmMWMyOjY6ODU0OToyOGFkYzQ0MDNhODU1NTJkMjlkYzI2ZjMxYTNiMDU4MWI3ODE3ODRhNGNlYjA4YmU5MmVmNjE5MjQwYTYzMWEzOmg6VA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
PLEASE NOTE: The information contained in this message is privileged and
confidential, and is intended only for the use of the individual to whom it
is addressed and others who have been specifically authorized to receive
it. If you are not the intended recipient, you are hereby notified that any
dissemination, distribution, or copying of this communication is strictly
prohibited. If you have received this communication in error, or if any
problems occur with the transmission, please contact the sender.
|
Okay, unintegrated shouldn't be bad... by definition this should work. Maybe @danielStrobl can take a look at this? |
We don't have to enforce on all loaders yet, but it does seem to me that it's something that is part of adding a dataset to open problems. Ideally, if someone adds a dataset, then it should be re-usable by other tasks without someone having to research what this actually is. If we use CELLxGENE more frequently in future, this field will also exist there. We could even borrow from their schema here if we want. Looks like i'm arguing myself into the position of "it should be used by all data loaders"... hmm... maybe do this in steps? First for these datasets, then for the rest in future. |
Ah, I see. by definition here:
unintegrated should get a perfect score... if you look at https://openproblems-nbt2022-reproducibility.netlify.app/results/batch_integration_embed/ it gets a score of 0.75, compared to scanorama, harmony, combat that score closer to 0.9. |
I agree. Ultimately it should be something that every dataset defines, but we don't need it right now. |
I think something is wrong with the
if I call directly from scib:
or if I compute the PCA myself:
|
I don't see why recomputing should be the issue tbh... As long as it's using the embedding. I think we decided to recompute |
It scores 1.0 if you don't give it an embedding (i.e., it computes PCA on |
New approach -- simply passing the raw data gives a perfect score, so let's just do that. In theory passing X_uni_pca should be right, but pending a fix from scIB, this will work. |
@danielStrobl I assume Malte is gone by now. Mind reviewing this for me? |
Codecov ReportBase: 95.04% // Head: 95.04% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #706 +/- ##
==========================================
- Coverage 95.04% 95.04% -0.01%
==========================================
Files 154 154
Lines 4073 4093 +20
Branches 207 207
==========================================
+ Hits 3871 3890 +19
- Misses 131 132 +1
Partials 71 71
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Achieves perfect performance per theislab/scib#351 (comment) |
cc_score was not well captured by existing baselines. This method should get a perfect score.