-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError when creating custom reference #23
Comments
This is a use case I did not anticipate. I am assuming that the problem is that CS20230722_CLUS_001 does not appear in the region for which you are constructing a taxonomy. The way the code is set up, the nodes on leaf level of the taxonomy tree (in our case the CLUS_NNNN nodes) end up pointing to lists of the cells associated with that leaf node. If you create the taxonomy tree with Unless you want to re-engineer What I don't know for sure (and would appreciate your feedback on) is: what is the right way to handle this? My inclination is to add a step to the taxonomy construction code that trims the tree, removing all cell types that have zero cells assigned to them (effectively pretending that cell types which occur outside of your region of interest do not exist). Is this acceptable to you, or do you need/want those cell types to persist in the taxonomy*? *if they did persist in the taxonomy, I don't think that anything would ever get mapped to them, as they would all have NULL mean expression profiles. Let me know what you think of this rough sketch of a solution. |
By the way: I am excited to see what comes of this analysis. Though I didn't anticipate this particular use case, that was just a deficiency in my imagination. What you are proposing ought to be possible in a perfect world. |
Thank you for the detailed explanation and proposed solutions @danielsf ! I appreciate the clarification regarding the root cause of the issue. I will give re-engineering the cluster specific CSVs a shot. Since the re-engineering the CSV files has constraints and potential problems, I find the second solution to trim trees of cell types with zero cells to be practical and efficient. However, I have some concerns about the potential implications of trimming the tree, particularly in scenarios such as:
To mitigate these concerns, is a hybrid approach where the cell types with zero counts are flagged and excluded rather than permanently removed? This would provide the ability to retain the complete taxonomy structure for future use while avoiding immediate computational issues. Please let me know your thoughts on this approach. Thank you for your assistance! |
Flagging and excluding empty cell types would take a lot more coding to implement, since several steps in the pipeline involve iterating over the taxonomy and all of its nodes. Each of those steps would have to be altered to check for the validity of the cell type. In principle, we could just fill in the leaf level of the taxonomy with empty lists of cells wherever necessary. I think that will bog down the code, though, as the marker gene finder tries to find markers to distinguish between cell types with no data assigned to them. It would also undermine the advantage of doing a region-specific taxonomy (which, I assume, is finding marker genes that are targeted for discriminating only between cell types that exist in the chosen region), since the taxonomy would still appear to have all ~ 5,000 cell type clusters from the full Whole Mouse Brain data release. This branch implements the solution I originally proposed (it had the virtue of only taking a few hours to code up). If you add If you want a reference taxonomy that includes all of the nodes, you could do something like
The form of the taxonomy tree as serialized to Does this solution work for you? |
Thanks @danielsf for the quick implementation! I will give this solution a try. I am curious to know if one would expect to see major differences going this route when compared to building a single region specific h5ad file and using the |
Building a region-specific h5ad file and calling
The problem you've run into with So: if you feel comfortable assembling a region-specific h5ad file in which the cell type annotations are in the |
Incidentally, this Jupyter notebook constructs an h5ad file with cartoon data and a cartoon taxonomy and then walks through all of the steps in the process of mapping data onto that cartoon taxonomy, starting with |
Thank you so much for the quick clarification. I was able to complete the single h5ad route to use |
The new branch ( |
Hi @danielsf, After successful creation of region specific multiple chemistry files, I am running into an error that I am a bit confused about. I am running Not supplying the |
There is a subtlety that I obviously failed to capture in the documentation.
When you created the multiple precomputed_stats files, you should have seen files like
The Sorry for the confusion. I'll try to put together an example notebook demonstrating this process (probably after SfN is past). Quick note: the example notebook I pointed you towards intentionally sets |
Thank you for the clarification! The errors and the description had me stumped. I am hopeful about the process running without any more errors! If you are open to it, I am also happy to contribute to the documentation to showcase the region specific reference file setup as an example. |
@danielsf - Is there a threshold for minimum cells to be included in the reference files to lead to meangingful analysis? Another error I ran into which took a long time to debug was due to the sparse amount of cells in one of the chemistry files contributing to the region specific reference file. The But when I move on to the mapping step using |
Good debugging. I don't have an off-the-cuff answer to the question that you have actually asked. I've run this code on taxonomies that contain leaf nodes with a few dozen cells in them and the code gave somewhat reasonable results. What ultimately matters is the statistical distribution of genes in the cells you do have. Can you quickly assess the number of cells in the least populous cluster? Since a variance needs to be calculated, a cluster with 2 or fewer cells in it will put some zeros in unfortunate (but not necessarily fatal) places. Note: as long as each cluster has a sensible (again, ~ a dozen) cell in at least one of the chemistry files, your results should be fine. The final step in mapping only considers the most populous chemistry file for each cluster. On to the question you didn't ask: I've made a quick bugfix to the code in the |
And thanks for the catch. |
Sorry. Doubling back this morning to add a unit test for yesterday afternoon's bugfix, I discovered that I hadn't actually fixed the problem. I just pushed another commit that actually will allow the code to run over your data. Sorry about that. One note: in the course of constructing the test, I got a better understanding of this failure mode. The error you saw means that, for the offending |
Thank you for the quick fix and for fixing the fix :) You are right in your assessment that the reference file with sparse data for this attempt had too few cells to contribute (< 10). But I also anticipate the same problem to creep up if one of the chemistry files has skewed representation of cell types for example. So the solution to combine both pre-check for the cell contribution per chemistry file as well as the fix for the chunk size seems reasonable to me. |
@danielsf - I was able to successfully run both versions of the cell type mapping pipelines i.e However, the results are not accurate. For concordance with a previous attempt, I ran the same sample set using the online MapMyCells which identified all major neuronal and non-neuronal cell types in the dataset. The online version of the tool uses v1.3.1 of the codebase. My attempt with the Since the online version uses a pre-calculated marker lookup table, I also ran the Both pipelines print a long list of warning messages to the console of type: By selecting only subset of the regions, are any assumptions for marker gene selection and/or mapping being broken ? I am not sure if the difference in code version is a contributing factor for the big change in results, but if v1.3.1 of the code is available with |
My initial thought is that using your downsampled cell_metadata.csv file to find markers means there are too few cells in the various cell types to find statistically significant marker genes. I don't have a hard-and-fast number representing what I mean by "too few." One thing to try would be to construct a cell_metadata.csv file as follows:
This way, you are still mapping to a taxonomy that is limited to the cell types represented in your region of interest, but you are using the full statistical power of the whole brain data to find the marker genes for discriminating between them. I guess it is possible that there is no quantitative difference between what I just described and what you are already doing (maybe the cluster_alias values from (1) above are already well-segregated into your region of interest). Is there an appreciable difference between the number of cells in the modified cell_metadata.csv that I am proposing and what you are already using? |
I think what I hypothesized above might be correct. Without knowing exactly how you are slicing the data, I sliced the cell_metadata.csv file by region_of_interest_acronym. Below is what I found. The "just cells in roi" line lists how many cells have that
|
This seems very much plausible. I went back to check the unique values for cluster_alias and the cell count distribution. In one of my region files, ~28% of cluster_alias has less than 10 cells assigned to it which is why a large number of them came back empty. This leads to two questions:
If I move forward with your suggestion to add power from borrowing cells from the whole brain set to the identified cell types, would I essentially subset all the expression matrices to only contains cells of interest and re-run the entire pipeline on it ? Or would it be easier to just rebuild a single expression matrix for each chemistry type and then run the pipeline as usual ? |
To answer your questions
Marker selection in this code repository is not really concerned with individual cell types. It is concerned with pairs of cell types. We are not asking "what genes uniquely define cell type A in a vacuum." We are asking "what genes are good at distinguishing between cell type A and cell type B" for all possible values of A and B. Thus, if cell type A is well sampled (say, a few thousand cells) but cell type B is poorly sampled (only 5 cells), the code likely won't be able to find markers for that specific (A, B) pair, which will cause problems in the mapping. You can imagine an extreme case in which your taxonomy has 6 cell types in it, but only two of them are well-sampled. The only marker genes you will find will be markers that are useful at distinguishing between those two types and the mapping will fail hopelessly for cells of any other type (and will probably try to assign all of the cells to one of those well sampled types... I haven't really thought about how this failure would play out). The specific marker-finding algorithm used by the codebase is documented on this page.
I am not really qualified to answer this question. My role in principally as a combination data scientist and software engineer. The question you are asking should probably be asked of the scientists who derived the cell type taxonomy. You can ask on the Allen Community Forum (probably under this topic). Warning: I suspect most of the people who can answer are about to travel to the annual Society for Neuroscience meeting, so the response may not come until next week.
You will need to re-run the whole pipeline (because you will need a new |
Thank you @danielsf for the detailed answers! I think I was specifically thrown off by the lack of distinguishing markers genes for taxonomy levels that had ample data to support it as well as the actual marker genes that were listed which were all non-specific. When I run the same dataset via the online tool, I get a rather believable list of markers for each of class and subclass levels. I also verifed the expression levels of the top markers in my query dataset. I will follow up with a post on the Allen Community Forum about the actual taxonomy values. I hesitate to change the reference set without proper context about why some taxonomy values were left in regions that are typically not associated with them. |
Hello @danielsf,
Thank you for putting together this wonderful tool. For our analysis workflow, we would like to be able to create region specific reference based on the ABC WMB taxonomy. I followed the instructions to create the
precomputed_stats.h5
file from a ABC release as documented here.The inputs for the function include all of the original taxonomy CSV files, the region specific cell metadata file and region specific list of h5ad files. However, I am running into a RuntimeError raised by the validate_taxonomy_tree function.
RuntimeError: node CS20230722_CLUS_0001 (child of CCN20230722_SUPT:CS20230722_SUPT_0001 -- type <class 'str'>) is not present in the keys at level CCN20230722_CLUS e.g. of keys ['CS20230722_CLUS_0326', 'CS20230722_CLUS_0328', 'CS20230722_CLUS_0033', 'CS20230722_CLUS_0038', 'CS20230722_CLUS_0044']
I separately confirmed that the CSV file includes all the parent-child relationships and that level CCN20230722_CLUS does include the node CS20230722_CLUS_0001. I am a bit confused about the setup since the above function is getting the taxonomy tree input from the h5ad file which for the ABC releases are devoid of taxonomy information. Is there a work around for this error ? Thank you for your help!
The text was updated successfully, but these errors were encountered: