diff --git a/jupyter-book/air_repertoire/clonotype.ipynb b/jupyter-book/air_repertoire/clonotype.ipynb index aeece56e..213c5e1e 100644 --- a/jupyter-book/air_repertoire/clonotype.ipynb +++ b/jupyter-book/air_repertoire/clonotype.ipynb @@ -51,7 +51,7 @@ "\n", ":::{warning}\n", "Scirpy changed the format of [its datastructure](https://scirpy.scverse.org/en/latest/data-structure.html#storing-airr-rearrangement-data-in-anndata)\n", - "with v0.13. While the overall anlaysis workflow has not changed, some outputs shown in this chapter might not be accurate anymore. \n", + "with v0.13. While the overall analysis workflow has not changed, some outputs shown in this chapter might not be accurate anymore. \n", "\n", "See [the scirpy release notes](https://scirpy.scverse.org/en/latest/changelog.html#v0-13-0-new-data-structure-based-on-awkward-arrays) for more details about this change. \n", "Until we update this chapter, please also refer to the [official scirpy documentation](https://scirpy.scverse.org).\n", @@ -314,7 +314,7 @@ "id": "c1cf919e", "metadata": {}, "source": [ - "Once the identity between T-cells is obtained for V(D)J CDR3, it is time to define the cluster of cells corresponding to one specific clonotype. A clonotype will be a set of cells with identical sequences, considering the parameters used in the previous step. However, it is possible to define clonotypes as a set of cells with just identical VJ or just identical VDJ sequences. Furthermore, it is possible definding the clonotypes by comparing either or both pairs of VJ or VDJ sequences.\n", + "Once the identity between T-cells is obtained for V(D)J CDR3, it is time to define the cluster of cells corresponding to one specific clonotype. A clonotype will be a set of cells with identical sequences, considering the parameters used in the previous step. However, it is possible to define clonotypes as a set of cells with just identical VJ or just identical VDJ sequences. Furthermore, it is possible defining the clonotypes by comparing either or both pairs of VJ or VDJ sequences.\n", "\n", "The set of parameters to define clonotypes should be the same as used previously. In our case, the sequences of amino acids must be compared using identity as a metric. In addition, we are setting the additional parameters to define clonotypes if the V(D)J are identical using the most abundant pair as the target sequence." ] @@ -1678,7 +1678,7 @@ "source": [ "So far, we have shown all the analyses you can perform to characterize the T-cell receptor repertoire, including the cell clones' identification and expansion. Besides, the representation in both, cell clusters and biological samples. Furthermore, the sequence motif for V(D)J gene segments, which is highlighted by interpretation of gene usage and spectratype results.\n", "\n", - "Those methods could be applied to characterize B-cell receptors as well {cite}`gupta2015change`. However, over the lifetime of B-cells mutual mutations occur in the V gene segment helping the low-affinity receptors to aquire a high affinity phenotype. This process is known as **affinity maturation**, and the high rate of mutual mutations (~10000 more than germline cells) is called **somatic hypermutation** {cite}`papavasiliou2002somatic`. Therefore, the clonotype definition for B-cells should take this phenomenon into account. One way to deal with this is through distance-based clonotype analysis.\n", + "Those methods could be applied to characterize B-cell receptors as well {cite}`gupta2015change`. However, over the lifetime of B-cells mutual mutations occur in the V gene segment helping the low-affinity receptors to acquire a high affinity phenotype. This process is known as **affinity maturation**, and the high rate of mutual mutations (~10000 more than germline cells) is called **somatic hypermutation** {cite}`papavasiliou2002somatic`. Therefore, the clonotype definition for B-cells should take this phenomenon into account. One way to deal with this is through distance-based clonotype analysis.\n", "\n", "Here, we use **Dandelion**, a python library focused on BCR analysis which interoperates with *Scanpy* and *Scirpy* and provides a BCR distance-based method for clone definition, which is explained below in more detail {cite}`stephenson2021single`." ] @@ -2240,7 +2240,7 @@ "id": "018129e2", "metadata": {}, "source": [ - "Compared to *Scirpy*, the clonotypes visualization in *Dandelion* does not show their sizes (number of cells). This process should be done separately, i.e., first, it is necessary to calculate the size of the clones and transfer this information to the *annData* object to performe the visualization via *Scanpy*." + "Compared to *Scirpy*, the clonotypes visualization in *Dandelion* does not show their sizes (number of cells). This process should be done separately, i.e., first, it is necessary to calculate the size of the clones and transfer this information to the *annData* object to perform the visualization via *Scanpy*." ] }, { @@ -2393,7 +2393,7 @@ "id": "9837bcde", "metadata": {}, "source": [ - "As you can appreciate before, **IGHV3-48** and **IGHV1-18** were the gene segments consistently more abundants in comparison to the rest of the segments in the plot, providing evidence of strong V gene preferiantial usage for the samples analyzed here.\n", + "As you can appreciate before, **IGHV3-48** and **IGHV1-18** were the gene segments consistently more abundant in comparison to the rest of the segments in the plot, providing evidence of strong V gene preferential usage for the samples analyzed here.\n", "\n", "The previous analysis can be improved by just adding information for the visualization. For example, let us see if those privilege V segments are shared between isotypes." ] @@ -2488,7 +2488,7 @@ "\n", "We have identified key expanded clonotypes and the isotype they represented. In addition, we can explore spectratype to observe the dominance in terms of sequence length. As well as in the previous analysis, we discarded the multi-chain cells, and we conserved those clonotypes whose sizes were higher than 50 cells to keep the analysis consistency.\n", "\n", - "The plot below shown an interesting behaviour, despite the clear spectratype dominance reflected in our previous TCR analysis. Here, two squence lengths rased, the first and the most dominant conformed by sequences of 23 aminoacids, and the second one composed by 15 aminoacids." + "The plot below shown an interesting behaviour, despite the clear spectratype dominance reflected in our previous TCR analysis. Here, two sequence lengths raised, the first and the most dominant conformed by sequences of 23 aminoacids, and the second one composed by 15 aminoacids." ] }, { @@ -2586,7 +2586,7 @@ "\n", "![](../_static/images/air_repertoire/bcr_logo_motif.svg)\n", "\n", - "On the other hand, we analyzed the same V gene segments for the V(D)J chain but with a sequence lentgh of 23 aminoacids." + "On the other hand, we analyzed the same V gene segments for the V(D)J chain but with a sequence length of 23 aminoacids." ] }, { diff --git a/jupyter-book/air_repertoire/ir_profiling.ipynb b/jupyter-book/air_repertoire/ir_profiling.ipynb index 247c90b8..1fa8e53a 100644 --- a/jupyter-book/air_repertoire/ir_profiling.ipynb +++ b/jupyter-book/air_repertoire/ir_profiling.ipynb @@ -23,7 +23,7 @@ "- **Fc receptors**: Epitope-antibody complex\n", "- **Cytokine receptors**: Cytokines\n", "- **B-cell receptor (BCRs)**: Epitopes\n", - "- **T-cell recpetors (TCRs)**: Linear epitopes bound to the Major Histocompatibility Complex (MHC)" + "- **T-cell receptors (TCRs)**: Linear epitopes bound to the Major Histocompatibility Complex (MHC)" ] }, { @@ -82,11 +82,11 @@ "\n", "- **Fluorescence Activated Cell Sorting (FACS)** is a method based on flow cytometry with the power to label the cells of interest based on fluorescent probes over the raw cell suspension. A cell suspension is carried by a rapidly flowing stream of liquid. This stream of cells is broken up into individual droplets through a vibrating mechanism. Just before the stream breaks into droplets, the flow passes through a fluorescence measuring station where the fluorescence signal of every cell is measured. The droplets can be further charged for further separations.\n", "\n", - "- **Magnetic-Activated Cell Sorting (MACS)** can use antibodies, enzymes, lectins, or strepavidins attached to a magnetic bead to label the target cells. Once the cells are labeled from the raw suspension, a magnetic field is applied to attract the magnetic beads and discard the remaining cells from the suspensions. The targeted cells are collected once the magnetic field is turned off. One advantage of this method is the capacity to collect targeted cells with no specific markers to be labeled, in that case, a cocktail of markers is used to label the untargeted cells, and the cells of interests are collected by washing them out once the magnetic field captures the untargeted cells conjugated to a magnetic bead.\n", + "- **Magnetic-Activated Cell Sorting (MACS)** can use antibodies, enzymes, lectins, or streptavidins attached to a magnetic bead to label the target cells. Once the cells are labeled from the raw suspension, a magnetic field is applied to attract the magnetic beads and discard the remaining cells from the suspensions. The targeted cells are collected once the magnetic field is turned off. One advantage of this method is the capacity to collect targeted cells with no specific markers to be labeled, in that case, a cocktail of markers is used to label the untargeted cells, and the cells of interests are collected by washing them out once the magnetic field captures the untargeted cells conjugated to a magnetic bead.\n", "\n", - "- **Laser Capture Microdissection (LCM)** has the power to extract cell populations or single cells from microscope preparations without detriment of the surrounding tissue. The components to perform LCM includes a reverse micrsocope, a laser control unit, a microscope joy stick to plate stabilization, a CCD camera, and a color monitor. The idea behind LCM consists on labelling cells by visual detection of morphological characteristics of target cells, the plate is immobilized and the laser pulse melts the thin thermoplastic film removing the cells or cells of interest without any damage to the surrounding tissue.\n", + "- **Laser Capture Microdissection (LCM)** has the power to extract cell populations or single cells from microscope preparations without detriment of the surrounding tissue. The components to perform LCM includes a reverse microscope, a laser control unit, a microscope joy stick to plate stabilization, a CCD camera, and a color monitor. The idea behind LCM consists on labelling cells by visual detection of morphological characteristics of target cells, the plate is immobilized and the laser pulse melts the thin thermoplastic film removing the cells or cells of interest without any damage to the surrounding tissue.\n", "\n", - "- **Microfluidics** is a versatile method able to work with small quantities of raw suspension even at the order of nanoliters. There are different kinds of microfluidic approaches including cell-affinity chromatographu based microfluidics, physical characteristics of cell based microfluidics, immunomagnetics beads based microfluidics, and separation by dielectric properties of some cell-types based microfluifdics. The most used microfluidics based method is the chromatographic separation using a chip assay as stationary phase which is modified to include the necessary antibodies to capture the target cells in the mobile phase. After the buffer flows off from the chip, a solution is used to separate the cells attached to the antibodies to collect them for further analysis {cite}`hu2016single`.\n", + "- **Microfluidics** is a versatile method able to work with small quantities of raw suspension even at the order of nanoliters. There are different kinds of microfluidic approaches including cell-affinity chromatography based microfluidics, physical characteristics of cell based microfluidics, immunomagnetics beads based microfluidics, and separation by dielectric properties of some cell-types based microfluifdics. The most used microfluidics based method is the chromatographic separation using a chip assay as stationary phase which is modified to include the necessary antibodies to capture the target cells in the mobile phase. After the buffer flows off from the chip, a solution is used to separate the cells attached to the antibodies to collect them for further analysis {cite}`hu2016single`.\n", "\n", "### Immune receptor sequencing\n", "\n", @@ -195,7 +195,7 @@ "metadata": {}, "source": [ "## Load data\n", - "In this tutorial we will mainly use two python packages for loading, cell-level ordering, and visualiation:\n", + "In this tutorial we will mainly use two python packages for loading, cell-level ordering, and visualization:\n", "- **Scanpy**: general package for single cell analysis (https://github.com/theislab/scanpy, {cite}`wolf2018scanpy`)\n", "- **Scirpy**: scanpy extension for immune receptor analysis (https://github.com/scverse/scirpy, {cite}`sturm2020scirpy`)\n", "\n", @@ -203,7 +203,7 @@ "\n", ":::{warning}\n", "Scirpy changed the format of [its datastructure](https://scirpy.scverse.org/en/latest/data-structure.html#storing-airr-rearrangement-data-in-anndata)\n", - "with v0.13. While the overall anlaysis workflow has not changed, some outputs shown in this chapter might not be accurate anymore. \n", + "with v0.13. While the overall analysis workflow has not changed, some outputs shown in this chapter might not be accurate anymore. \n", "\n", "See [the scirpy release notes](https://scirpy.scverse.org/en/latest/changelog.html#v0-13-0-new-data-structure-based-on-awkward-arrays) for more details about this change. \n", "Until we update this chapter, please also refer to the [official scirpy documentation](https://scirpy.scverse.org).\n", @@ -502,7 +502,7 @@ "- **barcode**: tag of the cell the contig was measured from\n", "- **is_cell**: indicates whether the barcode is associated with a cell\n", "- **high_confidence**: confidence of the measurement being a IR\n", - "- **chain**: chain of the IR (e.g. TRA: T Cell Receptor α-chain, IGH: Immuneglobulin Heavy chain)\n", + "- **chain**: chain of the IR (e.g. TRA: T Cell Receptor α-chain, IGH: Immunoglobulin Heavy chain)\n", "- **{v,d,j,c}_gene**: gene used to form the specific segment of the IR\n", "- **full_length**: whether the full IR was captured (see below)\n", "- **productive**: whether the IR is productive (see below)\n", @@ -662,7 +662,7 @@ "id": "a4b42743", "metadata": {}, "source": [ - "Example 2: Contigs express full length but there is not identifieable CDR3." + "Example 2: Contigs express full length but there is not identifiable CDR3." ] }, { @@ -1421,7 +1421,7 @@ "id": "e2633249", "metadata": {}, "source": [ - "Notice, that the patient-level information is not automatically added here. Let's add them by loading the raw data, alligning them on a cell level and indexing them by their barcode. " + "Notice, that the patient-level information is not automatically added here. Let's add them by loading the raw data, aligning them on a cell level and indexing them by their barcode. " ] }, { diff --git a/jupyter-book/air_repertoire/multimodal_integration.ipynb b/jupyter-book/air_repertoire/multimodal_integration.ipynb index 1493e91f..f3213777 100644 --- a/jupyter-book/air_repertoire/multimodal_integration.ipynb +++ b/jupyter-book/air_repertoire/multimodal_integration.ipynb @@ -29,7 +29,7 @@ "\n", ":::{warning}\n", "Scirpy changed the format of [its datastructure](https://scirpy.scverse.org/en/latest/data-structure.html#storing-airr-rearrangement-data-in-anndata)\n", - "with v0.13. While the overall anlaysis workflow has not changed, some outputs shown in this chapter might not be accurate anymore. \n", + "with v0.13. While the overall analysis workflow has not changed, some outputs shown in this chapter might not be accurate anymore. \n", "\n", "See [the scirpy release notes](https://scirpy.scverse.org/en/latest/changelog.html#v0-13-0-new-data-structure-based-on-awkward-arrays) for more details about this change. \n", "Until we update this chapter, please also refer to the [official scirpy documentation](https://scirpy.scverse.org).\n", @@ -147,7 +147,7 @@ "id": "3eb46cbd", "metadata": {}, "source": [ - "To get an overview, we will plot the data with cluster assigment as a UMAP visualisation." + "To get an overview, we will plot the data with cluster assignment as a UMAP visualisation." ] }, { @@ -264,7 +264,7 @@ "id": "50cda721", "metadata": {}, "source": [ - "**Leiden Clusters**: Above we clustered the gene expression data via the Leiden algorithm. We can now use the resulting groups to performane any kind of sequence analysis, e.g. spectratyping. For that, we simply define the group parameter (here: color) with the column name of the Leiden clustering (\"leiden\")." + "**Leiden Clusters**: Above we clustered the gene expression data via the Leiden algorithm. We can now use the resulting groups to performance any kind of sequence analysis, e.g. spectratyping. For that, we simply define the group parameter (here: color) with the column name of the Leiden clustering (\"leiden\")." ] }, { @@ -591,7 +591,7 @@ "\n", "Zhang et al. developed TCR functional landscape estimation supervised with scRNA-seq analysis (TESSA) {cite}`zhang2021mapping`, which aims at embedding and clustering T cell clones based on their TCR sequence and transcriptome via Bayesian modelling. The CDR3β sequence is first compressed to a 30-dimensional numeric representation using a pretrained autoencoder. Following, the dimensions are upweighted to correlate the TCR representation with the gene expression of similar TCR-groups, thereby assigning importance of TCR position for explaining the cells' gene expression. In an interactive process, weights and groups are updated until convergence to reach a maximal alignment between both modalities.\n", "\n", - "TESSA produced clusters of high purity when embedding T-cells with known epitope specificity from {cite}`10x2019new`, surpasing the uni-modal model GLIPH {cite}`glanville2017identifying`, which is commonly used for clustering TCR sequences. Further, cluster centrallity was indicative for higher avidity clones shown by clonal expansion and high ADT counts. Using TESSA on data from {cite}`yost2019clonal`, the author detected novel clusters of responder T cell in patients undergoing PD-1 blockade. \n", + "TESSA produced clusters of high purity when embedding T-cells with known epitope specificity from {cite}`10x2019new`, surpassing the uni-modal model GLIPH {cite}`glanville2017identifying`, which is commonly used for clustering TCR sequences. Further, cluster centrality was indicative for higher avidity clones shown by clonal expansion and high ADT counts. Using TESSA on data from {cite}`yost2019clonal`, the author detected novel clusters of responder T cell in patients undergoing PD-1 blockade. \n", "\n", "The code and instructions for installation can be found [here](https://github.com/jcao89757/TESSA)." ] @@ -1032,7 +1032,7 @@ "id": "d68ce520", "metadata": {}, "source": [ - "In the next step, we will create the command for running TESSA by specifying the environmnet and adding the settings." + "In the next step, we will create the command for running TESSA by specifying the environment and adding the settings." ] }, { @@ -1195,7 +1195,7 @@ "id": "ad57da7e", "metadata": {}, "source": [ - "For plotting the cluster assignemnt in the GEX space, we add the assignment to the original adata object. When plotting, we can see that TESSA clusters also share similar phenotypes at GEX level." + "For plotting the cluster assignment in the GEX space, we add the assignment to the original adata object. When plotting, we can see that TESSA clusters also share similar phenotypes at GEX level." ] }, { @@ -1238,7 +1238,7 @@ "id": "9ecb259a", "metadata": {}, "source": [ - "In both visualisations, we can see that the clusters are related on a TCR and GEX level. However, the full clustering can not directly observed in a single modality. Due to similar TCR and GEX profile, these cells might be specific to the same epitopes. Examplatory, we could use this annotation for DEG analysis between cell networks as described above." + "In both visualisations, we can see that the clusters are related on a TCR and GEX level. However, the full clustering can not directly observed in a single modality. Due to similar TCR and GEX profile, these cells might be specific to the same epitopes. As an example, we could use this annotation for DEG analysis between cell networks as described above." ] }, { @@ -1256,7 +1256,7 @@ "id": "f42926cf", "metadata": {}, "source": [ - "##### Data Preperation\n", + "##### Data Preparation\n", "\n", "CoNGA conveniently supports several input types for the gene expression matrix including h5ad-files. For the clonotype information, we need to create a clone_file first by running a preprocessing script on the Cell Ranger output." ] @@ -1426,7 +1426,7 @@ "id": "0ee3486c", "metadata": {}, "source": [ - "The AnnData object requires a clonotype assignement (see chapter {ref}`air:sequence`). Since mvTCR uses both chains of the primary receptor, we will use this information for defining the clonotype as well. # todo => use the one provided" + "The AnnData object requires a clonotype assignment (see chapter {ref}`air:sequence`). Since mvTCR uses both chains of the primary receptor, we will use this information for defining the clonotype as well. # todo => use the one provided" ] }, { @@ -1465,7 +1465,7 @@ "id": "99d33391", "metadata": {}, "source": [ - "The model requires a numeric encoding of the CDR3α and CDR3β chain. For that, we need to provide the columns containing the amino acid sequences. Additional, the pad attribute indicates the maximal length of the sequence. If we do not want to embedd aditional data afterwards, we can set it to the maximal sequence length." + "The model requires a numeric encoding of the CDR3α and CDR3β chain. For that, we need to provide the columns containing the amino acid sequences. Additional, the pad attribute indicates the maximal length of the sequence. If we do not want to embed additional data afterwards, we can set it to the maximal sequence length." ] }, { @@ -2268,7 +2268,7 @@ "source": [ "##### Output\n", "\n", - "We now recieve several output files, which are described in more detail in the authors GitHub repository. In the following, we will examine:\n", + "We now receive several output files, which are described in more detail in the authors GitHub repository. In the following, we will examine:\n", "- **connectionplot.pdf**: Visualisation of the derived clustering\n", "- **clone_annotation.csv**: cluster assignment by cell" ] @@ -2313,7 +2313,7 @@ "id": "ac5a4e04", "metadata": {}, "source": [ - "The connection plot prowides us with a visualisation of the graph build by Benisse. Each clonotype is represented by a node in the learned embedding space. These clustering is informed by BCR as well as gene expression similarity." + "The connection plot provides us with a visualisation of the graph build by Benisse. Each clonotype is represented by a node in the learned embedding space. These clustering is informed by BCR as well as gene expression similarity." ] }, { @@ -2398,13 +2398,13 @@ "What information provides us the AIR sequence, that is not directly captured in GEX?\n", "- A count matrix between cells and antibody-tagged epitope bindings.\n", "- The IR sequence can be used for demultiplexing between different donors.\n", - "- \\+ The cell's clonotype and, thereby, cell ancestory is defined by the AIR sequence.\n", + "- \\+ The cell's clonotype and, thereby, cell ancestry is defined by the AIR sequence.\n", "- \\+ The AIR sequence determines specificity and is therefor a barcode for recognizing the same epitope.\n", "\n", "On what premise rely multi-modal integration approaches?\n", "- \\+ Cells of same or alike AIRs often have a similar phenotype.\n", "- Information of AIR and GEX provide orthogonal information to each other, since they are independent.\n", - "- Knowledge is transfered between large gene expression datasets into which AIR data can be mapped.\n", + "- Knowledge is transferred between large gene expression datasets into which AIR data can be mapped.\n", "- Each cell occurs clonally expanded and thereby provides multiple gene profiles." ] }, diff --git a/jupyter-book/air_repertoire/specificity.ipynb b/jupyter-book/air_repertoire/specificity.ipynb index db76ca46..679f26b0 100644 --- a/jupyter-book/air_repertoire/specificity.ipynb +++ b/jupyter-book/air_repertoire/specificity.ipynb @@ -27,7 +27,7 @@ "- **Clustering and distances**: {cite}`glanville2017identifying` showed, that IRs with similar receptors have common specificity. This property has been used in multiple approaches for comparing AIRs with distance metrics and unsupervised clustering.\n", "- **Epitope prediction**: Recently, several machine-learning methods were developed that directly predict binding between AIRs and a target. In theory, these methods could be used to directly assign specificity to the AIRs involved in single-cell studies.\n", "\n", - "However, all three approaches have major pitfalls. The amount of samples in the public databases is severly biased towards diseases and use cases that are commonly researched. Examplatory, this leads to known bindings for only several 100 epitopes sequences for TCRs in the major public databases. Further, a majority amount of samples in these databases does not provide the full AIR sequence (V-, (D-,) and J-genes and CDRs for both chains), but rather focuses on the CDR3, while often only reporting VJ or VDJ sequences." + "However, all three approaches have major pitfalls. The amount of samples in the public databases is severely biased towards diseases and use cases that are commonly researched. Examplatory, this leads to known bindings for only several 100 epitopes sequences for TCRs in the major public databases. Further, a majority amount of samples in these databases does not provide the full AIR sequence (V-, (D-,) and J-genes and CDRs for both chains), but rather focuses on the CDR3, while often only reporting VJ or VDJ sequences." ] }, { @@ -45,7 +45,7 @@ "\n", ":::{warning}\n", "Scirpy changed the format of [its datastructure](https://scirpy.scverse.org/en/latest/data-structure.html#storing-airr-rearrangement-data-in-anndata)\n", - "with v0.13. While the overall anlaysis workflow has not changed, some outputs shown in this chapter might not be accurate anymore. \n", + "with v0.13. While the overall analysis workflow has not changed, some outputs shown in this chapter might not be accurate anymore. \n", "\n", "See [the scirpy release notes](https://scirpy.scverse.org/en/latest/changelog.html#v0-13-0-new-data-structure-based-on-awkward-arrays) for more details about this change. \n", "Until we update this chapter, please also refer to the [official scirpy documentation](https://scirpy.scverse.org).\n", @@ -110,7 +110,7 @@ "metadata": {}, "source": [ "### Database Queries\n", - "Here, we will search for TCRs with specificity annotatation from previous studies. Common large-scale databases with TCR-epitope pairs are:\n", + "Here, we will search for TCRs with specificity annotation from previous studies. Common large-scale databases with TCR-epitope pairs are:\n", "\n", "- [IEDB](https://www.iedb.org/) {cite}`fleri2017immune`\n", "- [vdjDB](https://vdjdb.cdr3.net/) {cite}`shugay2018vdjdb`\n", @@ -893,11 +893,11 @@ "To detect cells with shared specificity, we can calculate pairwise sequence distances between their TCRs. This can be either used to group cells within a dataset or to increase the amount of hits, we receive from database queries. For sequence distances, there are generally three different approaches:\n", "\n", "- **Edit distances**: calculate the cost of transforming the first into the second sequence.\n", - "- **k-mer matching**: compares the occurence of short motifs of length k between two sequences.\n", + "- **k-mer matching**: compares the occurrence of short motifs of length k between two sequences.\n", "- **Embeddings**: the sequence is embedded into a numeric representation (e.g. via deep learning).\n", "\n", "Note, that these approaches have not been independently benchmarked. We will therefore focus here on two selected distance metrics:\n", - "- **TCRdist**: this commonly used metric uses the sequences of all CDRs and compares them via transformation cost and gap panelties {cite}`dash2017quantifiable`. The costs are based on the BLOSUM matrix, which indicates the probabilities of substituting one amino acid against another {cite}`henikoff1992amino`. By incorporating the full sequence, accuracy is most likely increased as compared to other approaches, but limits its use, when only a subset of information is provided.\n", + "- **TCRdist**: this commonly used metric uses the sequences of all CDRs and compares them via transformation cost and gap penalties {cite}`dash2017quantifiable`. The costs are based on the BLOSUM matrix, which indicates the probabilities of substituting one amino acid against another {cite}`henikoff1992amino`. By incorporating the full sequence, accuracy is most likely increased as compared to other approaches, but limits its use, when only a subset of information is provided.\n", "- **TCRmatch**: this novel metric uses all k-mers to compare the overlap in motifs between two TCRs based on their CDR3β sequences {cite}`chronister2021tcrmatch`. It can therefore, also be utilized on most databases, that mainly contain this information. It is also conveniently integrated into IEDB." ] }, @@ -1560,7 +1560,7 @@ "\n", "- **-i**: the query data, here: our input file\n", "- **-t**: amount of cores used for calculation\n", - "- **-d**: the reference data, either a database or our input file (for pairwise matches)- **-s**: treshold for considering a match, where 0 is no similarity and 1 is perfect match. Here we use the threshold of medium confidence 0.9. Alternative, you can use the more stricter threshold of 0.97 for high confidence binding. Note, that less stringent cutoffs also result in higher computation times." + "- **-d**: the reference data, either a database or our input file (for pairwise matches)- **-s**: threshold for considering a match, where 0 is no similarity and 1 is perfect match. Here we use the threshold of medium confidence 0.9. Alternative, you can use the more stricter threshold of 0.97 for high confidence binding. Note, that less stringent cutoffs also result in higher computation times." ] }, { @@ -2919,7 +2919,7 @@ "id": "2e1ec167", "metadata": {}, "source": [ - "All AIRs in the database are tested against SARS-CoV-2 epitopes. To make the result more clearer to view, we will annotate the binding to include only SARS-CoV-2 and None. This is however heavily dependend on your specific research question." + "All AIRs in the database are tested against SARS-CoV-2 epitopes. To make the result more clearer to view, we will annotate the binding to include only SARS-CoV-2 and None. This is however heavily dependent on your specific research question." ] }, { @@ -3107,7 +3107,7 @@ "source": [ "#### Query via Hamming Distance\n", "\n", - "Due to Somantic Hypermutation, BCRs of the same lineage often differ with mutations on one position, with deletion and addition of amino acids being more unlikely. We will therefore query the database using a Hamming distance, which will mark CDR3 with a single mutation. You can set the threshold for calculating matches during distance calculation. However, we advise using conservative thresholds to limit the amount of false positives. " + "Due to Somatic Hypermutation, BCRs of the same lineage often differ with mutations on one position, with deletion and addition of amino acids being more unlikely. We will therefore query the database using a Hamming distance, which will mark CDR3 with a single mutation. You can set the threshold for calculating matches during distance calculation. However, we advise using conservative thresholds to limit the amount of false positives. " ] }, { @@ -3179,7 +3179,7 @@ "source": [ "### Distance Measurements\n", "\n", - "Contrary to TCRs, a BCR clonotype may contain different sequences due mutations stemming from somantic hypermutation (see chapter 02_clonotypes). The BCRs within this clonotype often target the same epitope with varying strength due to affinity maturation. We therefore already explained distance based clustering in the chapter . \n", + "Contrary to TCRs, a BCR clonotype may contain different sequences due mutations stemming from somatic hypermutation (see chapter 02_clonotypes). The BCRs within this clonotype often target the same epitope with varying strength due to affinity maturation. We therefore already explained distance based clustering in the chapter . \n", " \n", "However, several, different clonotype lineages can share their specificity. While these clonotypes are not ancestrally related, they might be related by similar BCR sequences. Recently, methods were developed to compare Antibody sequences based on shared specificity, which can also be applied to BCRs. Since they often rely on structural information (often from prediction) applying these methods is not feasible for large single-cell studies and are, therefore, not included in this tutorial." ] @@ -3191,7 +3191,7 @@ "source": [ "### Prediction\n", "\n", - "While TCRs bind linear peptides constrained by their binding to the MHC, BCRs can bind to linear or non-continoues antigens formed from proteins and polysacherides. This unconstraint binding is highly dependent on the three-dimensional structure of BCRs and antigens. There are several prediction tools for antibodies/BCRs focusing on identifying paratopes (binding residues in the AB) or epitopes (binding residues in the antigen). Further, deep learning is used for AB structure prediction, design, optimization, and docking prediction, which often rely on (inferred) spatial structure implying high computational costs. However, these models rather focus on the application of AB development for therapeutics than the analysis of large-scale single cell studies and are therefore out of the scope for this tutorial." + "While TCRs bind linear peptides constrained by their binding to the MHC, BCRs can bind to linear or non-continoues antigens formed from proteins and polysaccharides. This unconstrained binding is highly dependent on the three-dimensional structure of BCRs and antigens. There are several prediction tools for antibodies/BCRs focusing on identifying paratopes (binding residues in the AB) or epitopes (binding residues in the antigen). Further, deep learning is used for AB structure prediction, design, optimization, and docking prediction, which often rely on (inferred) spatial structure implying high computational costs. However, these models rather focus on the application of AB development for therapeutics than the analysis of large-scale single cell studies and are therefore out of the scope for this tutorial." ] }, { @@ -3201,8 +3201,8 @@ "source": [ "## Key Takeaways\n", "- The AIR sequence determines the epitope-specificity of the cell. Cells with similar AIR sequence, bind to the same antigen.\n", - "- Specificity can be inferred via Database queries, AIR comparisson, or prediction.\n", - "- Most approaches are not idependently benchmarked and should be used with some caution and additional validation." + "- Specificity can be inferred via Database queries, AIR comparison, or prediction.\n", + "- Most approaches are not independently benchmarked and should be used with some caution and additional validation." ] }, { diff --git a/jupyter-book/cellular_structure/annotation.ipynb b/jupyter-book/cellular_structure/annotation.ipynb index 9c4b32d6..97d65f8a 100644 --- a/jupyter-book/cellular_structure/annotation.ipynb +++ b/jupyter-book/cellular_structure/annotation.ipynb @@ -358,7 +358,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now perform PCA. We use the highly deviant genes (set as \"highly variable\" above) to reduce noise and strenghten signal in our data and set number of components to the default n=50. 50 is on the high side for data of a single sample, but it will ensure that we don't ignore important variation in our data." + "Now perform PCA. We use the highly deviant genes (set as \"highly variable\" above) to reduce noise and strengthen signal in our data and set number of components to the default n=50. 50 is on the high side for data of a single sample, but it will ensure that we don't ignore important variation in our data." ] }, { @@ -1146,7 +1146,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The abovementioned points highlight possible disadvantages of using classifiers, depending on the training data and model type. Nonetheless, there are several important advantages of using pre-trained classifiers to annotate your data. First, it is a fast and and easy way to annotate your data. The annotation does not require the downloading nor preprocessing of the training data and sometimes merely involves the upload of your data to an online webpage. Second, these methods don't rely on a partitioning of your data into clusters, as the manual annotation does. Third, pre-trained classifiers enable you to directly leverage the knowledge and information from previous studies, such as a high quality annotation. And finally, using such classifiers can help with harmonizing cell-type definitions across a field, thereby clearing the path towards a field-wide consensus on these definitions. " + "The aforementioned points highlight possible disadvantages of using classifiers, depending on the training data and model type. Nonetheless, there are several important advantages of using pre-trained classifiers to annotate your data. First, it is a fast and and easy way to annotate your data. The annotation does not require the downloading nor preprocessing of the training data and sometimes merely involves the upload of your data to an online webpage. Second, these methods don't rely on a partitioning of your data into clusters, as the manual annotation does. Third, pre-trained classifiers enable you to directly leverage the knowledge and information from previous studies, such as a high quality annotation. And finally, using such classifiers can help with harmonizing cell-type definitions across a field, thereby clearing the path towards a field-wide consensus on these definitions. " ] }, { @@ -1758,7 +1758,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "To highlight this, let's look at a marker for the eryhtroid lineage: hemoglobulin B. Most likely the cells annotated as \"Tcm/Naive helper T\" (already flagged as possibly wrongly annotated based on the dendrogram above) are from the erythroid lineage!" + "To highlight this, let's look at a marker for the erythroid lineage: hemoglobulin B. Most likely the cells annotated as \"Tcm/Naive helper T\" (already flagged as possibly wrongly annotated based on the dendrogram above) are from the erythroid lineage!" ] }, { diff --git a/jupyter-book/cellular_structure/clustering.ipynb b/jupyter-book/cellular_structure/clustering.ipynb index 47551a3a..cbdc3105 100644 --- a/jupyter-book/cellular_structure/clustering.ipynb +++ b/jupyter-book/cellular_structure/clustering.ipynb @@ -202,9 +202,9 @@ "id": "9d50ca54-9e43-4f12-bfc4-eeccdf0740c9", "metadata": {}, "source": [ - "We now clearly inspect the impact of different resolutions on the clustering result. For a resolution of 0.25, the clustering is much coarser and the algorthm detected fewer communities. Additionally, clustered regions are less dense compared to the clustering obtained at a resolution of 1.0. \n", + "We now clearly inspect the impact of different resolutions on the clustering result. For a resolution of 0.25, the clustering is much coarser and the algorithm detected fewer communities. Additionally, clustered regions are less dense compared to the clustering obtained at a resolution of 1.0. \n", "\n", - "We would like to highlight again that distances between the displayed clusters must be interpreted with caution. As the UMAP embedding is in 2D, distances are not necessarily captured well between all points. We recommend to not interprete distances between clusters visualized on UMAP embeddings." + "We would like to highlight again that distances between the displayed clusters must be interpreted with caution. As the UMAP embedding is in 2D, distances are not necessarily captured well between all points. We recommend to not interpret distances between clusters visualized on UMAP embeddings." ] }, { diff --git a/jupyter-book/chromatin_accessibility/gene_regulatory_networks_atac.ipynb b/jupyter-book/chromatin_accessibility/gene_regulatory_networks_atac.ipynb index 0cdf64d3..7ddd4ea0 100644 --- a/jupyter-book/chromatin_accessibility/gene_regulatory_networks_atac.ipynb +++ b/jupyter-book/chromatin_accessibility/gene_regulatory_networks_atac.ipynb @@ -610,7 +610,7 @@ "source": [ "This analysis through cisTopic allows us to retrieve and interpret two specific values, as line plots:\n", "- A curve showing the main log-likelihood calculated by cisTopic, versus the number of topics. A plateau of this curve number indicates that additional topics are not main factors to explain the data, versus previous topics.\n", - "- The first derivative of likelihood versus number of declared topics. This visualization is useful to assess convergence in the likelihood value, and random oscilations around zero, as the number of topics increases." + "- The first derivative of likelihood versus number of declared topics. This visualization is useful to assess convergence in the likelihood value, and random oscillations around zero, as the number of topics increases." ] }, { diff --git a/jupyter-book/chromatin_accessibility/introduction.ipynb b/jupyter-book/chromatin_accessibility/introduction.ipynb index f3622873..14e80a51 100644 --- a/jupyter-book/chromatin_accessibility/introduction.ipynb +++ b/jupyter-book/chromatin_accessibility/introduction.ipynb @@ -17,7 +17,7 @@ "source": [ "## Motivation\n", "\n", - "Every cell of an organism shares the same DNA with the same set of functional units referred to as genes. With this in mind, what determines the tremendous diversity of cells reaching from natural killer cells of the immune system to neurons transmitting electrochemical signals throughout the body? In the previous chapters, we saw that cell identity and function can be inferred from gene expression profiles in each cell. The control of gene expression is driven by a complex interplay of regulatory mechanisms such as DNA methylation, histone modifications, and transcription factor activity. {term}`Chromatin` accessibility largely reflects the combined regulatory state of a cell, serving as an orthogonal layer of information to mRNA levels describing cell identity. Furthermore, exploring the chromain accessibility profile enables additional insights into gene regulatory mechanisms and cell differentiation processes that might not be captured by scRNA-seq data." + "Every cell of an organism shares the same DNA with the same set of functional units referred to as genes. With this in mind, what determines the tremendous diversity of cells reaching from natural killer cells of the immune system to neurons transmitting electrochemical signals throughout the body? In the previous chapters, we saw that cell identity and function can be inferred from gene expression profiles in each cell. The control of gene expression is driven by a complex interplay of regulatory mechanisms such as DNA methylation, histone modifications, and transcription factor activity. {term}`Chromatin` accessibility largely reflects the combined regulatory state of a cell, serving as an orthogonal layer of information to mRNA levels describing cell identity. Furthermore, exploring the chromatin accessibility profile enables additional insights into gene regulatory mechanisms and cell differentiation processes that might not be captured by scRNA-seq data." ] }, { @@ -45,13 +45,13 @@ "source": [ "As depicted above, chromatin accessibility is influenced by higher-order structure down to low-level DNA modifications. **(1)** Chromatin scaffolding driven by scaffold/matrix attachment regions (S/MARs) and proteins in the nuclear periphery such as nuclear pore complexes (NPCs) or lamins influences chromatin compactness and gene expression {cite}`atac:narwade_mapping_2019, atac:buchwalter_coaching_2019`. **(2, 3)** More local accessibility often referred to as densly packed heterochromatin versus open euchromatin can be actively controlled by ATP-dependent and ATP-independent chromatin remodeling complexes and histone modifications such as acetylation, methylation and phosphorylation. **(4)** Also the binding of transcription factors can influence nucleosome positioning and lead to the recruitment of histone-modifying enzymes and chromatin remodelers. **(5)** On a DNA level, methylation of CpG sites influences the binding affinity of various proteins including transcription factors and histone-modifying enzymes which combined leads to the silencing of the corresponding genomic regions. For an animated visualization we also recommend [this 2 minute video](https://www.youtube.com/watch?v=XelGO582s4U) on epigenetics and the regulation of gene activity (credits to Nicole Ethen from the SQE, University of Illinois). For a comprehensive and up-to-date review on genome regulation and TF activity, we refer to {cite}`atac:isbel_generating_2022`.\n", "\n", - "Taken together, an essential component defining cell identity is the regulatory state of each cell. In this chapter, we focus on chomatin accessibility data measured by the **Single-Cell Assay for Transposase-Accessible Chromatin with High-Throughput Sequencing (scATAC-seq)** or as part of the **10x Multiome assay (scATAC combined with scRNA-seq)**. \n", + "Taken together, an essential component defining cell identity is the regulatory state of each cell. In this chapter, we focus on chromatin accessibility data measured by the **Single-Cell Assay for Transposase-Accessible Chromatin with High-Throughput Sequencing (scATAC-seq)** or as part of the **10x Multiome assay (scATAC combined with scRNA-seq)**. \n", "\n", "After walking you through the preprocessing steps this analysis will allow us to:\n", "1) characterize cell identity with an orthogonal approach to scRNA-seq analysis\n", "2) identify cell state specific transcriptional regulators\n", "3) link gene expression to sequence features\n", - "4) disentagle epigenetic mechanisms driving cell differentiation and disease states\n" + "4) disentangle epigenetic mechanisms driving cell differentiation and disease states\n" ] }, { @@ -110,7 +110,7 @@ "id": "ca71d10e-b3d4-4de9-bcbd-31cd4ff13a85", "metadata": {}, "source": [ - "## Overview of the data analyis workflow" + "## Overview of the data analysis workflow" ] }, { diff --git a/jupyter-book/chromatin_accessibility/muon_to_seurat.ipynb b/jupyter-book/chromatin_accessibility/muon_to_seurat.ipynb index be5466d7..5147a0ea 100644 --- a/jupyter-book/chromatin_accessibility/muon_to_seurat.ipynb +++ b/jupyter-book/chromatin_accessibility/muon_to_seurat.ipynb @@ -13,7 +13,7 @@ "id": "e9f41665-b267-45fb-a910-74777ea955f5", "metadata": {}, "source": [ - "As we saw in the introduction, there are some downstream analyses that are currently only available in R. In this notebook, we therfore want to showcase how you can convert a muon object into a Seurat object for downstream analysis using Signac." + "As we saw in the introduction, there are some downstream analyses that are currently only available in R. In this notebook, we therefore want to showcase how you can convert a muon object into a Seurat object for downstream analysis using Signac." ] }, { @@ -79,7 +79,7 @@ "id": "db95bc97-3855-43ce-8cab-c900a4f0c786", "metadata": {}, "source": [ - "With Signac, we have the possibility to visualize the coverage of the features or perform other analysis like footprinting. For some of these downstream analysis, it is necessary to add the fragment file to the Seurat object which we specifiy here with `fragmen_file`" + "With Signac, we have the possibility to visualize the coverage of the features or perform other analysis like footprinting. For some of these downstream analysis, it is necessary to add the fragment file to the Seurat object which we specify here with `fragmen_file`" ] }, { @@ -167,7 +167,7 @@ "id": "2450fa73-ba9f-4dc9-8d3d-d1e43699cbc8", "metadata": {}, "source": [ - "We have succesfully loaded the muon object into R and it has been converted to a Seurat object with three assays `atac`, `rna` and `gene_activity`. The dimensionality reductions and metadata have also been transferred. " + "We have successfully loaded the muon object into R and it has been converted to a Seurat object with three assays `atac`, `rna` and `gene_activity`. The dimensionality reductions and metadata have also been transferred. " ] }, { @@ -288,7 +288,7 @@ "id": "c6d2c985-d508-4eb1-8cfb-a96cb9acdf5b", "metadata": {}, "source": [ - "We verfiy that the correct embedding and clustering has been added using `DimPlot`." + "We verify that the correct embedding and clustering has been added using `DimPlot`." ] }, { @@ -348,7 +348,7 @@ "id": "f5c1be57-b12f-4920-a7cc-b3e08fab9620", "metadata": {}, "source": [ - "Loading using `MuDataSeurat` does not support featues with the same name in different assays (the same gene name for `rna` and `gene_activity`), which is why we had to add the assay name to the feature with `assay:feature`. Additionally, the `atac` assay is still a regular Seurat `Assay` and not a Signac `ChromatinAssay` which we need for chromatin specific downstream analyses. " + "Loading using `MuDataSeurat` does not support features with the same name in different assays (the same gene name for `rna` and `gene_activity`), which is why we had to add the assay name to the feature with `assay:feature`. Additionally, the `atac` assay is still a regular Seurat `Assay` and not a Signac `ChromatinAssay` which we need for chromatin specific downstream analyses. " ] }, { diff --git a/jupyter-book/chromatin_accessibility/quality_control.ipynb b/jupyter-book/chromatin_accessibility/quality_control.ipynb index 14fd31f8..d87883dd 100644 --- a/jupyter-book/chromatin_accessibility/quality_control.ipynb +++ b/jupyter-book/chromatin_accessibility/quality_control.ipynb @@ -41,7 +41,7 @@ "id": "f6d73a6a-e370-4cfe-aae1-fdcab47e82e7", "metadata": {}, "source": [ - "To showcase the processing of scATAC-seq data, we use a 10x Multiome data set generated for the single cell data integration challenge at the NeurIPS conference 2021 {cite}`atacqc:luecken2021sandbox`. Note that this data set containes multiple samples, making feature harmonization and integration important to consider before analysing them jointly (discussed in later chapters). However, the most unbiased quality assessment can be derived by examining each sample individually. Therefore, we describe the preprocessing of one selected sample in this notebook.\n", + "To showcase the processing of scATAC-seq data, we use a 10x Multiome data set generated for the single cell data integration challenge at the NeurIPS conference 2021 {cite}`atacqc:luecken2021sandbox`. Note that this data set contains multiple samples, making feature harmonization and integration important to consider before analysing them jointly (discussed in later chapters). However, the most unbiased quality assessment can be derived by examining each sample individually. Therefore, we describe the preprocessing of one selected sample in this notebook.\n", "\n", "Our starting point is the output of `cellranger-arc`, the software solution of 10x to perform alignment, peak calling and initial QC of their 10x Multiome assay. By default, the output files contain the snRNA-seq and the scATAC-seq data. Since the preprocessing of scRNA-seq or snRNA-seq data has been described extensively in previous chapters, here, we only discuss the processing of the chromatin accessibility data (which are also applicable to data from a unimodal scATAC-seq assay).\n" ] @@ -948,7 +948,7 @@ "id": "06cb1456-1d07-44fe-89d4-80c92e58c397", "metadata": {}, "source": [ - "Let us now plot the scores we derived from the two appraoches." + "Let us now plot the scores we derived from the two approaches." ] }, { @@ -1002,7 +1002,7 @@ "- **total_fragment_counts**: Total number of fragments per cell representing cellular sequencing depth. This metric is analogous to the number of total counts in scRNA-seq data.\n", "- **tss_enrichment**: Transcription start site (TSS) enrichment score, which is the ratio of fragments centered at the TSS to fragments in TSS-flanking regions. This metric can be interpreted as a signal-to noise ratio of each cell.\n", "- **n_features_per_cell**: The number of peaks with non-zero counts in each cell. This metric is analogous to the number of genes detected in scRNA-seq data.\n", - "- **nucleosome_signal**: The nucleosome signal refers to the ratio of mono-nucleosomal to nucloesome-free fragments and can also be interpreted as a signal-to-noise ratio in each cell (more details below).\n", + "- **nucleosome_signal**: The nucleosome signal refers to the ratio of mono-nucleosomal to nucleosome-free fragments and can also be interpreted as a signal-to-noise ratio in each cell (more details below).\n", "\n", "Additional metrics that can be considered:\n", "- **reads_in_peaks_frac:** The fraction of fragments in peak regions versus fragments outside of peaks. Similar to the TSS score, this is an indicator for the signal-to-noise ratio.\n", @@ -1962,7 +1962,7 @@ "id": "253df1e9-1ddc-4480-8c5b-219c28b59b2d", "metadata": {}, "source": [ - "To ensure we keep a version of the raw counts we save them as a seperate layer." + "To ensure we keep a version of the raw counts we save them as a separate layer." ] }, { diff --git a/jupyter-book/chromatin_accessibility/resources/celltype_markers.ipynb b/jupyter-book/chromatin_accessibility/resources/celltype_markers.ipynb index fa23c349..3386384e 100644 --- a/jupyter-book/chromatin_accessibility/resources/celltype_markers.ipynb +++ b/jupyter-book/chromatin_accessibility/resources/celltype_markers.ipynb @@ -5,7 +5,7 @@ "id": "6fef2d89-99b9-4c83-9e38-e02fd30a0b88", "metadata": {}, "source": [ - "# Markers for cluster annotations acros modalities" + "# Markers for cluster annotations across modalities" ] }, { diff --git a/jupyter-book/conditions/compositional.ipynb b/jupyter-book/conditions/compositional.ipynb index 13d25f9b..01fdc806 100644 --- a/jupyter-book/conditions/compositional.ipynb +++ b/jupyter-book/conditions/compositional.ipynb @@ -700,7 +700,7 @@ } }, "source": [ - "[scCODA](https://sccoda.readthedocs.io/en/latest) belongs to the family of tools that require pre-defined clusters, most commony cell types, to statistically derive changes in composition. Inspired by methods for compositional analysis of microbiome data, scCODA proposes a Bayesian approach to address the low replicate issue as commonly encountered in single-cell analysis{cite}`Büttner2021`. It models cell-type counts using a hierarchical Dirichlet-Multinomial model, which accounts for uncertainty in cell-type proportions and the negative correlative bias via joint modeling of all measured cell-type proportions. To ensure a uniquely identifiable solution and easy interpretability, the reference in scCODA is chosen to be a specific cell type. Hence, any detected compositional changes by scCODA always have to be viewed in relation to the selected reference." + "[scCODA](https://sccoda.readthedocs.io/en/latest) belongs to the family of tools that require pre-defined clusters, most common cell types, to statistically derive changes in composition. Inspired by methods for compositional analysis of microbiome data, scCODA proposes a Bayesian approach to address the low replicate issue as commonly encountered in single-cell analysis{cite}`Büttner2021`. It models cell-type counts using a hierarchical Dirichlet-Multinomial model, which accounts for uncertainty in cell-type proportions and the negative correlative bias via joint modeling of all measured cell-type proportions. To ensure a uniquely identifiable solution and easy interpretability, the reference in scCODA is chosen to be a specific cell type. Hence, any detected compositional changes by scCODA always have to be viewed in relation to the selected reference." ] }, { @@ -712,7 +712,7 @@ } }, "source": [ - "However, scCODA assumes a log-linear relationship between covariates and cell abundance, which may not always reflect the underlying biological processes when using continuoaus covariates. A further limitation of scCODA is the inability to infer correlation structures among cell compositions beyond compositional effects. Furthermore, scCODA only models shifts in mean abundance, but does not detect changes in response variability{cite}`Büttner2021`." + "However, scCODA assumes a log-linear relationship between covariates and cell abundance, which may not always reflect the underlying biological processes when using continuous covariates. A further limitation of scCODA is the inability to infer correlation structures among cell compositions beyond compositional effects. Furthermore, scCODA only models shifts in mean abundance, but does not detect changes in response variability{cite}`Büttner2021`." ] }, { @@ -824,9 +824,9 @@ } }, "source": [ - "The boxplots highlight some differences in the distributions of the cell types. Clearly noticable is the high proportion of enterocytes for the Salmonella condition. But other cell types such as transit-amplifying (TA) cells also show stark differences in abundance for the Salmonella condition compared to control. Whether any of these differences are statistically significant has to be properly evaluated.\n", + "The boxplots highlight some differences in the distributions of the cell types. Clearly noticeable is the high proportion of enterocytes for the Salmonella condition. But other cell types such as transit-amplifying (TA) cells also show stark differences in abundance for the Salmonella condition compared to control. Whether any of these differences are statistically significant has to be properly evaluated.\n", "\n", - "An alternative visualization is a stacked barplot as provided by scCODA. This visualization nicely displays the characteristics of compositional data: If we compare the Control and Salmonella groups, we can see that the proportion of Enterocytes greatly increases in the infected mice. Since the data is proportional, this leads to a decreased share of all other cell types to fulfil the sum-to-one constraint." + "An alternative visualization is a stacked barplot as provided by scCODA. This visualization nicely displays the characteristics of compositional data: If we compare the Control and Salmonella groups, we can see that the proportion of Enterocytes greatly increases in the infected mice. Since the data is proportional, this leads to a decreased share of all other cell types to fulfill the sum-to-one constraint." ] }, { @@ -3044,13 +3044,13 @@ "source": [ "It is not always possible or practical to use precisely labeled clusters such as cell-type definitions, especially when we are interested in studying transitional states between cell type clusters, such as during developmental processes, or when we expect only a subpopulation of a cell type to be affected by the condition of interest. In such cases, determining compositional changes based on known annotations may not be appropriate. \n", "\n", - "A set of methods exist to detect compositional changes occuring in subpopulations of cells smaller than the cell type clusters, usually defined starting from a k-nearest neighbor (KNN) graph computed from similarities in the same low dimensional space used for clustering. \n", + "A set of methods exist to detect compositional changes occurring in subpopulations of cells smaller than the cell type clusters, usually defined starting from a k-nearest neighbor (KNN) graph computed from similarities in the same low dimensional space used for clustering. \n", "\n", "- DA-seq computes, for each cell, a score based on the relative prevalence of cells from both biological states in the cell’s neighborhood, using a range of k values{cite}`Zhao2021`. The scores are used as input for a logistic classifier to predict the biological condition of each cell. \n", "- Milo assigns cells to partially overlapping neighborhoods on the KNN graph, then differential abundance (DA) testing is performed modelling cell counts with a generalized linear model (GLM) {cite}`Dann2022`. \n", "- MELD calculates a relative likelihood estimate of observing each cell in every condition using graph-based density estimate{cite}`Burkhardt2021`. \n", "\n", - "These methods have unique strenghts and weaknesses. Because it relies on logistic classification, DA-seq is designed for pairwise comparisons between two biological conditions, but can't be applied to test for differences associated with a continuous covariate (such as age or timepoints). DA-seq and Milo use the variance in the abundance statistic between replicate samples of the same condition to estimate the significance of the differential abundance, while MELD doesn't use this information. While considering consistency across replicates reduces the number of false positives driven by one or a few samples, all KNN-based methods are sensitive to a loss of information if the conditions of interest and confounders, defined by technical or experimental sources of variation, are strongly correlated. The impact of confounders can be mitigated using batch integration methods before KNN graph construction and/or incorporating the confounding covariates in the model for DA testing, as we discuss further in the example below. Another limitation of KNN-based methods to bare in mind is that cells in a neighborhood may not necessarily represent a specific, unique biological subpopulation, because a cellular state may span over multiple neighborhoods. Reducing k for the KNN graph or constructing a graph on cells from a particular lineage of interest can help mitigate this issue and ensure the predicted effects are robust to the choice of parameters and to the data subset used{cite}`Dann2022`. \n", + "These methods have unique strengths and weaknesses. Because it relies on logistic classification, DA-seq is designed for pairwise comparisons between two biological conditions, but can't be applied to test for differences associated with a continuous covariate (such as age or timepoints). DA-seq and Milo use the variance in the abundance statistic between replicate samples of the same condition to estimate the significance of the differential abundance, while MELD doesn't use this information. While considering consistency across replicates reduces the number of false positives driven by one or a few samples, all KNN-based methods are sensitive to a loss of information if the conditions of interest and confounders, defined by technical or experimental sources of variation, are strongly correlated. The impact of confounders can be mitigated using batch integration methods before KNN graph construction and/or incorporating the confounding covariates in the model for DA testing, as we discuss further in the example below. Another limitation of KNN-based methods to bare in mind is that cells in a neighborhood may not necessarily represent a specific, unique biological subpopulation, because a cellular state may span over multiple neighborhoods. Reducing k for the KNN graph or constructing a graph on cells from a particular lineage of interest can help mitigate this issue and ensure the predicted effects are robust to the choice of parameters and to the data subset used{cite}`Dann2022`. \n", "\n", "Generally, if large differences are apparent in large clusters by visualization or the imbalances between cell types are of interest, direct analysis with cell-type aware methods, such as scCODA, might be more suitable. KNN-based methods are more powerful when we are interested in differences in cell abundances that might appear in transitional states between cell types or in a specific subset of cells of a given cell type." ] @@ -4314,7 +4314,7 @@ } }, "source": [ - "Interestingly the DA test on the neighbourhoods detects an enrichment upon infection in Tuft cells and in a subset of goblet cells. We can characterize the difference between cell type subpopulations enriched upon infection by examining the mean gene expression profiles of cells in neighbourhoods. For example, if we take the neighbourhoods of Goblet cells, we can see that neighbourhoods enriched upon infection display a higher expression of Retnlb, which is a gene implicated in anti-parassitic immunity {cite}`comp:Haber2017`. " + "Interestingly the DA test on the neighbourhoods detects an enrichment upon infection in Tuft cells and in a subset of goblet cells. We can characterize the difference between cell type subpopulations enriched upon infection by examining the mean gene expression profiles of cells in neighbourhoods. For example, if we take the neighbourhoods of Goblet cells, we can see that neighbourhoods enriched upon infection display a higher expression of Retnlb, which is a gene implicated in anti-parasitic immunity {cite}`comp:Haber2017`. " ] }, { diff --git a/jupyter-book/conditions/gsea_pathway.ipynb b/jupyter-book/conditions/gsea_pathway.ipynb index bc51beb9..f62716bf 100644 --- a/jupyter-book/conditions/gsea_pathway.ipynb +++ b/jupyter-book/conditions/gsea_pathway.ipynb @@ -54,7 +54,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Gene set tests can be *competitive* or *self-contained* as defined by Goeman and Buhlmann (2007) {cite}`goeman2007analyzing`. Competitive gene set testing tests whether the genes in the set are highly ranked in terms of differential expression relative to the genes not in the set. The sampling unit here is genes, so the test can be done with a single sample (i.e. single-sample GSEA). The test requires genes that are not in the set (i.e background genes). In self-contained gene set testing, the sampling unit is the subject, so multiple samples per group are required, but it is not required to have genes that are not present in the set. A self-contained gene set test tests whether genes in the test set are differentially expressed without regard to any other gene measured in the dataset. These distinctions between the two null hypotheses make differences to the interpretation of gene set enrichment results. Note that in biological data there exist inter-gene correlations, that is the expression of genes in the same pathways are correlated. There are only a few tests that accomodate inter-gene correlations. We will discuss these methods later. Detailed explanations on various gene set tests can be found in [*limma* user manual](https://bioconductor.org/packages/release/bioc/manuals/limma/man/limma.pdf)." + "Gene set tests can be *competitive* or *self-contained* as defined by Goeman and Buhlmann (2007) {cite}`goeman2007analyzing`. Competitive gene set testing tests whether the genes in the set are highly ranked in terms of differential expression relative to the genes not in the set. The sampling unit here is genes, so the test can be done with a single sample (i.e. single-sample GSEA). The test requires genes that are not in the set (i.e background genes). In self-contained gene set testing, the sampling unit is the subject, so multiple samples per group are required, but it is not required to have genes that are not present in the set. A self-contained gene set test tests whether genes in the test set are differentially expressed without regard to any other gene measured in the dataset. These distinctions between the two null hypotheses make differences to the interpretation of gene set enrichment results. Note that in biological data there exist inter-gene correlations, that is the expression of genes in the same pathways are correlated. There are only a few tests that accommodate inter-gene correlations. We will discuss these methods later. Detailed explanations on various gene set tests can be found in [*limma* user manual](https://bioconductor.org/packages/release/bioc/manuals/limma/man/limma.pdf)." ] }, { @@ -69,7 +69,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In scRNA-seq data analysis, gene set enrichment is generally carried out on clusters of cells or cell types, one-at-a-time. Genes differentially expressed in a cluster or cell type are used to identify over-represented gene sets from the selected collection, using simple hypergeomtric tests or Fisher's exact test (as in *Enrichr* {cite}`chen2013enrichr`), for example. Such tests do not require the actual gene expression measurements and read counts to compute enrichment statistics, as they rely on testing how significant it is that an $X$ number of genes in a gene set are differentially expressed in the experiment compared to the number of non-DE genes in the set.\n", + "In scRNA-seq data analysis, gene set enrichment is generally carried out on clusters of cells or cell types, one-at-a-time. Genes differentially expressed in a cluster or cell type are used to identify over-represented gene sets from the selected collection, using simple hypergeometric tests or Fisher's exact test (as in *Enrichr* {cite}`chen2013enrichr`), for example. Such tests do not require the actual gene expression measurements and read counts to compute enrichment statistics, as they rely on testing how significant it is that an $X$ number of genes in a gene set are differentially expressed in the experiment compared to the number of non-DE genes in the set.\n", "\n", "*fgsea* {cite}`korotkevich2021fast` is a more common tool for gene set enrichment test. *fgsae* is a computationally faster implementation of the well established *Gene Set Enrichment Analysis (GSEA)* algorithm {cite}`subramanian2005gene`, which computes enrichment statistics on the basis of some preranked gene-level test statistics. *fgsea* computes an enrichment score using some signed statistics of the genes in the gene set, such as the t-statistics, log fold-changes (logFC) or p-values from the differential expression test. An empirical (estimated from the data) null distribution is computed for the enrichment score using some random gene sets of the same size, and a p-value is computed to determine the significance of the enrichment score. The p-values are then adjusted for multiple hypothesis testing. GSVA {cite}`hanzelmann2013gsva` is another example of preranked gene set enrichment approaches. We should note that the pre-ranked gene set tests are not specific to single cell datasets and apply to Bulk-seq assays as well.\n", "\n", @@ -1360,7 +1360,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "PC1 now captures difference between lymphoid (T, NK, B) and myeloid (Mono, DC) populations, while the second PC captures variation due to administration of stimulus (i.e. difference between control and stimulated pseduo-replicates). Ideally, the variation of interest has to be detectable in top few PCs of the pseudo-bulk data. \n", + "PC1 now captures difference between lymphoid (T, NK, B) and myeloid (Mono, DC) populations, while the second PC captures variation due to administration of stimulus (i.e. difference between control and stimulated pseudo-replicates). Ideally, the variation of interest has to be detectable in top few PCs of the pseudo-bulk data. \n", "\n", "In this case, since we are indeed interested in stimulation effect per cell type, we proceed to gene set testing. We re-iterate that the purpose of plotting PCs is to explore various axes of variability in the data and to spot unwanted variabilities that can substantial influence the test results. Users may proceed with the rest of the analyses should they be satisfied with the the variations in their data." ] diff --git a/jupyter-book/conditions/perturbation_modeling.ipynb b/jupyter-book/conditions/perturbation_modeling.ipynb index 323c7a90..015ed6bd 100644 --- a/jupyter-book/conditions/perturbation_modeling.ipynb +++ b/jupyter-book/conditions/perturbation_modeling.ipynb @@ -33,7 +33,7 @@ "\n", "Robust and accessible tooling for all of these steps is still under development. Hence, we will solely introduce three approaches for a subset of these tasks that can be tackled with single-cell perturbation data in the following sections:\n", "\n", - "1. Finding the cell types that were most affected by pertubations using [Augur](https://github.com/neurorestore/Augur) applied to Kang 2018 {cite}`pemo:kang2018`.\n", + "1. Finding the cell types that were most affected by perturbations using [Augur](https://github.com/neurorestore/Augur) applied to Kang 2018 {cite}`pemo:kang2018`.\n", "2. Predicting the transcriptional response of single cells to perturbations using [scGen](https://github.com/theislab/scgen) applied to Kang 2018 {cite}`pemo:kang2018`.\n", "3. Quantifying the sensitivity of genetic CRISPR perturbations using [Mixscape](https://github.com/satijalab/seurat/blob/master/R/mixscape.R) applied to Papalexi 2021 {cite}`Papalexi2021`." ] @@ -59,7 +59,7 @@ "id": "ad956267", "metadata": {}, "source": [ - "Perturbations rarely have the same effect on all cells. In particular, different cell types or cells in different states in their cell cycle can be affected to varying degrees. Here we will leverage [Augur](https://github.com/neurorestore/Augur) by Skinnidier et al. {cite}`Skinnider2021,Squair2021Augur`, which provides one way of quantifying the degree of response, for this purpose." + "Perturbations rarely have the same effect on all cells. In particular, different cell types or cells in different states in their cell cycle can be affected to varying degrees. Here we will leverage [Augur](https://github.com/neurorestore/Augur) by Skinnider et al. {cite}`Skinnider2021,Squair2021Augur`, which provides one way of quantifying the degree of response, for this purpose." ] }, { @@ -1390,7 +1390,7 @@ "metadata": {}, "source": [ "\n", - "Here, we demonstrate the application of scGen {cite}`lotfollahi2019`, a variational autoencoder combined with vector arithmetics. The model learns a latent representation of the data in which it estimates a difference vector between control (untreated) and perturbed (treated) cells. The estimated difference vector is then added to control cells for the cell type or population of interest to predict the gene expression response for each single cell. Here, we apply scGen to predict the response to IFN-β for a population of CD4-T cells that are artificially held out (unseen) during training to simulate one of the aforementioned real-world scenarios. We again leverage a dataset that contains peripher blood mononuclear cells (PBMCs) from eight patients with Lupus treated with IFN-β or left untreated from {cite}`pemo:kang2018` across seven different cell-types.\n", + "Here, we demonstrate the application of scGen {cite}`lotfollahi2019`, a variational autoencoder combined with vector arithmetics. The model learns a latent representation of the data in which it estimates a difference vector between control (untreated) and perturbed (treated) cells. The estimated difference vector is then added to control cells for the cell type or population of interest to predict the gene expression response for each single cell. Here, we apply scGen to predict the response to IFN-β for a population of CD4-T cells that are artificially held out (unseen) during training to simulate one of the aforementioned real-world scenarios. We again leverage a dataset that contains peripheral blood mononuclear cells (PBMCs) from eight patients with Lupus treated with IFN-β or left untreated from {cite}`pemo:kang2018` across seven different cell-types.\n", "\n", "As a first step, we import `scanpy` and `scgen` to allow us to work with AnnData objects and employ scGen." ] diff --git a/jupyter-book/introduction/interoperability.ipynb b/jupyter-book/introduction/interoperability.ipynb index 3fb33f33..e860a16a 100644 --- a/jupyter-book/introduction/interoperability.ipynb +++ b/jupyter-book/introduction/interoperability.ipynb @@ -545,7 +545,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The [Loom file format](http://loompy.org/) is an older HDF5 specification for omics data. Unlike H5AD, it is not linked to a specific analysis ecosystem, although the structure is similar to `AnnData` and `SingleCellExperiment` objects. Packages implementing the Loom format exist for both [R](https://github.com/mojaveazure/loomR) and [Python](https://pypi.org/project/loompy/) as well as a [Bioconductor package](https://bioconductor.org/packages/LoomExperiment/) for writing Loom files. However, it is often more convenient to use the higher-level interfaces provided by the core ecosystem packages. Apart from sharing datasets another common place Loom files are encountered is when spliced/unspliced reads are quantified using [velocycto](http://velocyto.org/) for {ref}`RNA velocity analysis `." + "The [Loom file format](http://loompy.org/) is an older HDF5 specification for omics data. Unlike H5AD, it is not linked to a specific analysis ecosystem, although the structure is similar to `AnnData` and `SingleCellExperiment` objects. Packages implementing the Loom format exist for both [R](https://github.com/mojaveazure/loomR) and [Python](https://pypi.org/project/loompy/) as well as a [Bioconductor package](https://bioconductor.org/packages/LoomExperiment/) for writing Loom files. However, it is often more convenient to use the higher-level interfaces provided by the core ecosystem packages. Apart from sharing datasets another common place Loom files are encountered is when spliced/unspliced reads are quantified using [velocyto](http://velocyto.org/) for {ref}`RNA velocity analysis `." ] }, { diff --git a/jupyter-book/mechanisms/cell_cell_communication.ipynb b/jupyter-book/mechanisms/cell_cell_communication.ipynb index 707616a7..543441ad 100644 --- a/jupyter-book/mechanisms/cell_cell_communication.ipynb +++ b/jupyter-book/mechanisms/cell_cell_communication.ipynb @@ -2105,7 +2105,7 @@ "### Summary & Outlook:\n", "\n", "In this chapter we presented two applications of CCC inference methods, namely we used CellPhoneDB and LIANA to predict relevant ligand-receptor interactions from single-context (or steady-state) data and NicheNet to infer potentially active ligands and their targets in a differential-expression context. \n", - "As the focus of the single-cell field moves further away from the definition of linages, and into characterizing changes within cell types between conditions, approaches to disentangle CCC insights across contexts are becoming essential. Thus, in addition to NicheNet, we refer the user to other approaches that enable cross-condition comparisons, such as NATMI's differential cell-connectivity analysis {cite}`hou2020predicting`, Crosstalker's network topological measures {cite}`nagai2021crosstalker`, CellChat's pathway-focused manifold learning {cite}`jin_2021`, as well as Tensor-cell2cell's untargetted factorization approach to infer CCC patterns across contexts {cite}`armingol_2022`. \n", + "As the focus of the single-cell field moves further away from the definition of linages, and into characterizing changes within cell types between conditions, approaches to disentangle CCC insights across contexts are becoming essential. Thus, in addition to NicheNet, we refer the user to other approaches that enable cross-condition comparisons, such as NATMI's differential cell-connectivity analysis {cite}`hou2020predicting`, Crosstalker's network topological measures {cite}`nagai2021crosstalker`, CellChat's pathway-focused manifold learning {cite}`jin_2021`, as well as Tensor-cell2cell's untargeted factorization approach to infer CCC patterns across contexts {cite}`armingol_2022`. \n", "\n", "As a consequence of the ongoing developments within the single-cell and the cell-cell communication field specifically, there are is an ever-growing number of methods, some of which propose alternative ways to predict CCC events, such as those that work at the single-cell resolution {cite}`raredon_2023,wang_2019,wilk_2022`. While others attempt to address some of the limitations above, e.g. by including the inference of interactions mediated by metabolites or small molecules {cite}`zheng_2022,garciaalonso_2022,zhang_2021`.\n", "\n", diff --git a/jupyter-book/mechanisms/gene_regulatory_networks.ipynb b/jupyter-book/mechanisms/gene_regulatory_networks.ipynb index e9219756..61b20498 100644 --- a/jupyter-book/mechanisms/gene_regulatory_networks.ipynb +++ b/jupyter-book/mechanisms/gene_regulatory_networks.ipynb @@ -1184,7 +1184,7 @@ } }, "source": [ - "This step will use TFs to calculate Area Under the Curve scores, that summarize how well the gene expression observed in each cell can be associated by the regulation of target genes regulatred by the mentioned TFs." + "This step will use TFs to calculate Area Under the Curve scores, that summarize how well the gene expression observed in each cell can be associated by the regulation of target genes regulated by the mentioned TFs." ] }, { @@ -1367,7 +1367,7 @@ } }, "source": [ - "A visualization of the tSNE values generated by SCENIC also confimrs this cell-type separation, for the majority of cell-types" + "A visualization of the tSNE values generated by SCENIC also confirms this cell-type separation, for the majority of cell-types" ] }, { diff --git a/jupyter-book/multimodal_integration/advanced_integration.ipynb b/jupyter-book/multimodal_integration/advanced_integration.ipynb index ad5e2fab..4c2af3b2 100644 --- a/jupyter-book/multimodal_integration/advanced_integration.ipynb +++ b/jupyter-book/multimodal_integration/advanced_integration.ipynb @@ -279,7 +279,7 @@ "id": "ea3e32a5", "metadata": {}, "source": [ - "There are some differences in `.obs_names` of RNA and ADT of CITE-seq data, so we update them to make sure they allign between modalities." + "There are some differences in `.obs_names` of RNA and ADT of CITE-seq data, so we update them to make sure they align between modalities." ] }, { @@ -521,7 +521,7 @@ "source": [ "Opposed to paired integration that we demonstrated before, it is also possible to perform completely unpaired integration. In this case, there is no intersection of cell barcodes or features. Hence, we need some prior knowledge to connect different modalities.\n", "\n", - "GLUE {cite}`cao2022` is a deep learning model for unpaired integration which makes use of a regulatory graph helping connect features from different modalities. The model is based on conditional variational autoencoders, where the model learns to reconstruct while simultaneously allowing for batch correction. To guide the integration, GLUE learns an embedding for each modality for each feature by utilizing a prior knowledge graph. We demonstrate how to use GLUE to integrate unpaired RNA and ADT data using RNA part of Multiome data and ADT part of CITE-seq data from the NeurIPS competiiton (https://openproblems.bio/neurips_2021/). To construct the graph, we connect nodes from RNA modality to nodes from ADT modality if and only if the RNA node is a protein encoding gene of a given protein from ADT modality. The output of the GLUE model is a representation of each cell in a shared latent space. \n", + "GLUE {cite}`cao2022` is a deep learning model for unpaired integration which makes use of a regulatory graph helping connect features from different modalities. The model is based on conditional variational autoencoders, where the model learns to reconstruct while simultaneously allowing for batch correction. To guide the integration, GLUE learns an embedding for each modality for each feature by utilizing a prior knowledge graph. We demonstrate how to use GLUE to integrate unpaired RNA and ADT data using RNA part of Multiome data and ADT part of CITE-seq data from the NeurIPS competition (https://openproblems.bio/neurips_2021/). To construct the graph, we connect nodes from RNA modality to nodes from ADT modality if and only if the RNA node is a protein encoding gene of a given protein from ADT modality. The output of the GLUE model is a representation of each cell in a shared latent space. \n", "\n", "We refer the reader to GLUE tutorial to see how one can integrate unpaired RNA and ATAC with GLUE and for more details about the model https://scglue.readthedocs.io/en/latest/tutorials.html." ] @@ -1252,7 +1252,7 @@ "source": [ "### Query-to-reference mapping\n", "\n", - "Now we demontrate how to map new unimodal (RNA-only) and multimodal query (CITE-seq) onto the above reference.\n", + "Now we demonstrate how to map new unimodal (RNA-only) and multimodal query (CITE-seq) onto the above reference.\n", "\n", "We mimic RNA-only query by setting the protein counts of one of the two batches to zero." ] @@ -1621,7 +1621,7 @@ "id": "56b2d3ef", "metadata": {}, "source": [ - "Now we need to find anchors between the reference and the query. We specify that we want to use SPCA dimentionality reduction from the reference." + "Now we need to find anchors between the reference and the query. We specify that we want to use SPCA dimensionality reduction from the reference." ] }, { @@ -2325,7 +2325,7 @@ "id": "757856d6", "metadata": {}, "source": [ - "Finally, we obtain the latent representation, save in explicitely in `.obsm['latent_ref']` as we will overwrite `.obsm['latent']` later when we work with fine-tuned model after query mapping and visualize the result." + "Finally, we obtain the latent representation, save in explicitly in `.obsm['latent_ref']` as we will overwrite `.obsm['latent']` later when we work with fine-tuned model after query mapping and visualize the result." ] }, { diff --git a/jupyter-book/multimodal_integration/paired_integration.ipynb b/jupyter-book/multimodal_integration/paired_integration.ipynb index 5aab936b..413a2e8e 100644 --- a/jupyter-book/multimodal_integration/paired_integration.ipynb +++ b/jupyter-book/multimodal_integration/paired_integration.ipynb @@ -23,7 +23,7 @@ "source": [ "In the recent years several technologies appeared that allow us to measure several modalities in a single-cell. Modalities in this context refer to different type of information that we can capture in each cell. For instance, CITE-seq allows measuring gene expression and surface protein abundance in the same cell. Alternatively, paired RNA-seq/ATAC-seq experiments using, for example, the Multiome assay, capture gene expression and chromatin accessibility simultaneously. \n", "\n", - "We are interested in the most holistic representation of single cells that incorporate information from all the available modalities, but several challenges might arise when integrating these different modalities. Data stemming from different sequencing technologies can vary in dimensions: RNA-seq experiments usually capture 20-30 thousand genes, but the protein panel can be as small just a few proteins up to 200. ATAC-seq experiments on the other hand can have more than 200000 peaks. On top of having different dimensionality, the data can follow different distributions. RNA-seq counts are often modelled with negative binomial distribution, while chromatine accesibility can be binarized and modelled as either open or closed and therefore with a Bernoulli distribution {cite}`pi:ashuach2021`. Alternatively raw ATAC-seq counts can be modelled following Poisson distribution {cite}`pi:martens2022`.\n", + "We are interested in the most holistic representation of single cells that incorporate information from all the available modalities, but several challenges might arise when integrating these different modalities. Data stemming from different sequencing technologies can vary in dimensions: RNA-seq experiments usually capture 20-30 thousand genes, but the protein panel can be as small just a few proteins up to 200. ATAC-seq experiments on the other hand can have more than 200000 peaks. On top of having different dimensionality, the data can follow different distributions. RNA-seq counts are often modelled with negative binomial distribution, while chromatin accessibility can be binarized and modelled as either open or closed and therefore with a Bernoulli distribution {cite}`pi:ashuach2021`. Alternatively raw ATAC-seq counts can be modelled following Poisson distribution {cite}`pi:martens2022`.\n", "\n", "Here, we showcase several tools for paired integration including MOFA+ {cite}`pi:argelaguet2020`, WNN {cite}`pi:hao2021`, totalVI {cite}`pi:gayoso2021` and multiVI {cite}`pi:ashuach2021`. We use 10x Multiome and CITE-seq data generated for the single cell data integration challenge at the NeurIPS conference 2021 {cite}`pi:luecken2021sandbox`. This dataset captures single-cell data from bone marrow mononuclear cells of 12 healthy human donors measured at four different sites to obtain nested batch effects. In this tutorial, we will use 3 batches from one site to showcase the integration tools. \n", "\n", @@ -2304,7 +2304,7 @@ "id": "3c582fba", "metadata": {}, "source": [ - "MultiVI is also based on variational inference and conditional variational autoencoders. The gene expression counts are modeled exactly the same way as in totalVI, i.e. using raw counts and NB distribution. Chromatin accessibility on the other hand is modeled using Bernouli distribution modeling how likely a particular region is to be open. Hence, the input data for ATAC assay has to be binary where 0 means a closed region and 1 means an open region." + "MultiVI is also based on variational inference and conditional variational autoencoders. The gene expression counts are modeled exactly the same way as in totalVI, i.e. using raw counts and NB distribution. Chromatin accessibility on the other hand is modeled using Bernoulli distribution modeling how likely a particular region is to be open. Hence, the input data for ATAC assay has to be binary where 0 means a closed region and 1 means an open region." ] }, { @@ -2368,7 +2368,7 @@ "id": "1c979b93", "metadata": {}, "source": [ - "We also make sure that we pass raw counts as input to the model by specifiying `layer='counts'` in the `setup_anndata` funciton." + "We also make sure that we pass raw counts as input to the model by specifying `layer='counts'` in the `setup_anndata` function." ] }, { diff --git a/jupyter-book/preprocessing_visualization/feature_selection.ipynb b/jupyter-book/preprocessing_visualization/feature_selection.ipynb index 197b3067..e347a6af 100644 --- a/jupyter-book/preprocessing_visualization/feature_selection.ipynb +++ b/jupyter-book/preprocessing_visualization/feature_selection.ipynb @@ -243,7 +243,7 @@ "id": "cfa8dd2a-62b0-4e05-9357-2febaa5ccf68", "metadata": {}, "source": [ - "Last, we visualise the feature selection results. We use a scanpy function to compute the mean and dispersion for each gene accross all cells." + "Last, we visualise the feature selection results. We use a scanpy function to compute the mean and dispersion for each gene across all cells." ] }, { @@ -318,7 +318,7 @@ "id": "84b8e5aa-90ce-40c0-b5d1-4bcaba0c5b25", "metadata": {}, "source": [ - "We observe that genes with a high mean expression are selected as highly deviant. This is in agreement with emprical observations by {cite}`Townes2019`." + "We observe that genes with a high mean expression are selected as highly deviant. This is in agreement with empirical observations by {cite}`Townes2019`." ] }, { diff --git a/jupyter-book/preprocessing_visualization/normalization.ipynb b/jupyter-book/preprocessing_visualization/normalization.ipynb index 9ab6315f..cde82306 100644 --- a/jupyter-book/preprocessing_visualization/normalization.ipynb +++ b/jupyter-book/preprocessing_visualization/normalization.ipynb @@ -90,7 +90,7 @@ "id": "31b68498-ccfa-4a47-b4fe-614f132a2fb5", "metadata": {}, "source": [ - "We can now inspect the distrubution of the raw counts which we already calculated during quality control. This step can be neglected during a standard single-cell analysis pipeline, but might be helpful to understand the different normalization concepts. " + "We can now inspect the distribution of the raw counts which we already calculated during quality control. This step can be neglected during a standard single-cell analysis pipeline, but might be helpful to understand the different normalization concepts. " ] }, { @@ -230,7 +230,7 @@ "id": "947bae04-7447-413f-aac4-2d4343e43f3e", "metadata": {}, "source": [ - "scran requires a coarse clustering input to improve size factor esimation performance. In this tutorial, we use a simple preprocessing approach and cluster the data at a low resolution to get an input for the size factor estimation. The basic preprocessing includes assuming all size factors are equal (library size normalization to counts per million - CPM) and log-transforming the count data." + "scran requires a coarse clustering input to improve size factor estimation performance. In this tutorial, we use a simple preprocessing approach and cluster the data at a low resolution to get an input for the size factor estimation. The basic preprocessing includes assuming all size factors are equal (library size normalization to counts per million - CPM) and log-transforming the count data." ] }, { diff --git a/jupyter-book/preprocessing_visualization/quality_control.ipynb b/jupyter-book/preprocessing_visualization/quality_control.ipynb index ab1bf730..ee9e1946 100644 --- a/jupyter-book/preprocessing_visualization/quality_control.ipynb +++ b/jupyter-book/preprocessing_visualization/quality_control.ipynb @@ -554,7 +554,7 @@ "id": "6dce115c-204f-48c5-851b-ca1eefb59566", "metadata": {}, "source": [ - "Next, we compute the principle components of the data to obtain a lower dimentional representation. This representation is then used to generate a neighbourhood graph of the data and run leiden clustering on the KNN-graph. We add the clusters as `soupx_groups` to `.obs` and save them as a vector. " + "Next, we compute the principle components of the data to obtain a lower dimensional representation. This representation is then used to generate a neighbourhood graph of the data and run leiden clustering on the KNN-graph. We add the clusters as `soupx_groups` to `.obs` and save them as a vector. " ] }, { diff --git a/jupyter-book/spatial/deconvolution.ipynb b/jupyter-book/spatial/deconvolution.ipynb index a23b0326..8b01fd16 100644 --- a/jupyter-book/spatial/deconvolution.ipynb +++ b/jupyter-book/spatial/deconvolution.ipynb @@ -18,7 +18,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Spot-based spatial transcriptomics data extends the classical readout from dissociated single-cell RNA sequencing with spatial locations. While still being sequencing based, and therefore unbiased in gene space, available methods do not have single-cell resolution. Visium, the commercial version of the original Spatial Transcriptomics protocol, for example, has circular capture areas with a diameter of 55mu. Dependet on the tissue and spatial location multiple cells map onto a single capture area. In addition, a single cell might only be partly contained within the capture area, leading to further differences from the expression profiles we are used to when dealing with classical scRNA-seq data. " + "Spot-based spatial transcriptomics data extends the classical readout from dissociated single-cell RNA sequencing with spatial locations. While still being sequencing based, and therefore unbiased in gene space, available methods do not have single-cell resolution. Visium, the commercial version of the original Spatial Transcriptomics protocol, for example, has circular capture areas with a diameter of 55mu. Dependent on the tissue and spatial location multiple cells map onto a single capture area. In addition, a single cell might only be partly contained within the capture area, leading to further differences from the expression profiles we are used to when dealing with classical scRNA-seq data. " ] }, { @@ -45,7 +45,7 @@ "source": [ "Spatial deconvolution is a rather complex approach that requires some understanding of the underlying methods. This section aims to explain the mathematical concepts of the task at hand. Readers only interested in how to apply deconvolution in practice can directly jump to the section **Cell2location in practice**.\n", "\n", - "In spatial transciptomics, the observed transcriptome can be described as a latent variable model. The observed counts $x_{sg}$ for gene $g$ and spot $s$ is the sum of the cells' contributions $x_{sig}$ that belong to this spot:\n", + "In spatial transcriptomics, the observed transcriptome can be described as a latent variable model. The observed counts $x_{sg}$ for gene $g$ and spot $s$ is the sum of the cells' contributions $x_{sig}$ that belong to this spot:\n", "\n", "$$x_{sg} = \\sum_{i=1}^{C(s)} x_{sig} \\quad .$$\n", "\n", @@ -53,7 +53,7 @@ "\n", "$$x_{sg} = \\sum_{i=1}^{C(s)} c_{t(i)g} = \\sum_{t=1}^{T} \\beta_{st} c_{tg} \\quad ,$$\n", "\n", - "where $t(i)$ is cell's $i$ cell type and $c_{tg}$ the prototype expression profiles. Note that the sum changes from summing over individual cells $i$ to summing over the distinct set of cell types $t \\in \\{1, \\dots, C\\}$. Here, the parameter $\\tilde \\beta_{st}$ counts how often a cell type occures in a spot. Through normalisation of the spot's library size $l_s$, this count vector can be changed to indicate cell type proportions:\n", + "where $t(i)$ is cell's $i$ cell type and $c_{tg}$ the prototype expression profiles. Note that the sum changes from summing over individual cells $i$ to summing over the distinct set of cell types $t \\in \\{1, \\dots, C\\}$. Here, the parameter $\\tilde \\beta_{st}$ counts how often a cell type occurs in a spot. Through normalisation of the spot's library size $l_s$, this count vector can be changed to indicate cell type proportions:\n", "\n", "$$x_{sg} = l_s \\sum_{t=1}^{T} \\beta_{st} c_{tg} \\quad .$$\n", "\n", @@ -491,7 +491,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "As we can see, ENSEMBL gene identifiers are now correctly stroed in `adata_st.var_names`." + "As we can see, ENSEMBL gene identifiers are now correctly stored in `adata_st.var_names`." ] }, { @@ -685,7 +685,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The respective function returns a 2D histogram, where the orange rectangle indicated genes that are excluded based on the defined theshold. The Y-axis shows the number of cells expressing that gene and the X-axis the average RNA count for cells with a detected gene." + "The respective function returns a 2D histogram, where the orange rectangle indicated genes that are excluded based on the defined threshold. The Y-axis shows the number of cells expressing that gene and the X-axis the average RNA count for cells with a detected gene." ] }, { diff --git a/jupyter-book/spatial/imputation.ipynb b/jupyter-book/spatial/imputation.ipynb index 2abc2924..fda04fab 100644 --- a/jupyter-book/spatial/imputation.ipynb +++ b/jupyter-book/spatial/imputation.ipynb @@ -28,7 +28,7 @@ "\n", "$$ \\textbf{d} \\in \\mathbb{R}^{n_\\text{voxels}} \\ \\ \\text{ with } \\ \\ d_j \\in [0,1] \\ \\ \\text{ and } \\ \\ \\sum_j d_j = 1 \\quad . $$ \n", "\n", - "Given the matrices $S$ and $G$ as well as the density $\\textbf d$, Tangram aims to learn a thrid matrix $M$ that expresses the probability $M_{ij} \\in [0,1]$ that cell $i$ belongs to voxel $j$. Being probabilistic, every cell has to be mapped exactly once, that is the rows of $M$ have to be normalised: \n", + "Given the matrices $S$ and $G$ as well as the density $\\textbf d$, Tangram aims to learn a third matrix $M$ that expresses the probability $M_{ij} \\in [0,1]$ that cell $i$ belongs to voxel $j$. Being probabilistic, every cell has to be mapped exactly once, that is the rows of $M$ have to be normalised: \n", "\n", "$$ M \\in \\mathbb{R}^{n_{\\text{cells}}\\times n_{\\text{voxels}}}_+ \\ \\ \\text{ with } \\ \\ \\sum_{j}^{n_\\text{voxel}} M_{ij}=1 \\quad . $$\n", "\n", @@ -544,7 +544,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Next, we will compare the new spatial data with the originial measurements. This will provide us with a better feeling why some training scores might be bad. Note that this explicit mapping of Tangram relies on entirely different premises than those in probabilistic models. Here, we are inclined to trust the predicted gene expression patterns based on the good mapping performance of most training genes. The fact the some genes show a very sparse and dispersed spatial signal can be understood a result of technical dropout of the spatial technology rather than a shortcoming of the mapping method." + "Next, we will compare the new spatial data with the original measurements. This will provide us with a better feeling why some training scores might be bad. Note that this explicit mapping of Tangram relies on entirely different premises than those in probabilistic models. Here, we are inclined to trust the predicted gene expression patterns based on the good mapping performance of most training genes. The fact the some genes show a very sparse and dispersed spatial signal can be understood a result of technical dropout of the spatial technology rather than a shortcoming of the mapping method." ] }, { diff --git a/jupyter-book/spatial/introduction.ipynb b/jupyter-book/spatial/introduction.ipynb index 46d78b53..5613dd70 100644 --- a/jupyter-book/spatial/introduction.ipynb +++ b/jupyter-book/spatial/introduction.ipynb @@ -50,7 +50,7 @@ "metadata": {}, "source": [ "### Single-cell resolution\n", - "Spatial omics data obtained at single-cell resolution either directly capture single cells at their exact position or capture spots on the scale of single-cell. Examples for spot-based methods are HDST, slide-seqV2 or stero-seq. These methods capture the whole transcriptome but still have a low capture efficiency. \n", + "Spatial omics data obtained at single-cell resolution either directly capture single cells at their exact position or capture spots on the scale of single-cell. Examples for spot-based methods are HDST, slide-seqV2 or stereo-seq. These methods capture the whole transcriptome but still have a low capture efficiency. \n", "\n", "Targeted methods provide an alternative for measuring cells at their exact position. Common examples are MERFISH, seqFISH+, IMC or multiplexed IHC (e.g. cyCIF and CODEX). These technologies are usually expensive and only measure a limited features space. These methods do not capture spots at a predefined location or grid, but measure individual transcript or cellular locations.\n", "\n" @@ -119,4 +119,4 @@ }, "nbformat": 4, "nbformat_minor": 5 -} \ No newline at end of file +} diff --git a/jupyter-book/surface_protein/quality_control.ipynb b/jupyter-book/surface_protein/quality_control.ipynb index f2e640cc..cee44b79 100644 --- a/jupyter-book/surface_protein/quality_control.ipynb +++ b/jupyter-book/surface_protein/quality_control.ipynb @@ -336,7 +336,7 @@ "id": "f7773b29-2d12-4315-a69a-95dea73f4078", "metadata": {}, "source": [ - "Now, we have two Mudata objects. `mdata_raw` contrains the unfiltered barcode x features MuData object and `mdata` contains all barcodes that passed the cellranger filtering. While the raw object contains over 24 million droplets, the filtered object only contains 122,016. The `rna` modality has 36601 features while the `prot` modality has 140 features." + "Now, we have two Mudata objects. `mdata_raw` contains the unfiltered barcode x features MuData object and `mdata` contains all barcodes that passed the cellranger filtering. While the raw object contains over 24 million droplets, the filtered object only contains 122,016. The `rna` modality has 36601 features while the `prot` modality has 140 features." ] }, { @@ -383,7 +383,7 @@ "tags": [] }, "source": [ - "We first look at the distribution of ADTs per cell over all samples. We plot this using the seaborn library. We first take a look at the wole range and can see that most cells\n", + "We first look at the distribution of ADTs per cell over all samples. We plot this using the seaborn library. We first take a look at the whole range and can see that most cells\n", "express between 70 and 140 proteins." ] }, diff --git a/jupyter-book/trajectories/lineage_tracing.ipynb b/jupyter-book/trajectories/lineage_tracing.ipynb index 4b38a109..6c9ed29a 100644 --- a/jupyter-book/trajectories/lineage_tracing.ipynb +++ b/jupyter-book/trajectories/lineage_tracing.ipynb @@ -11,7 +11,7 @@ "source": [ "# Lineage tracing\n", "\n", - "_TL;DR we provide a brief overview assays providing measurments of both cell state and lineage history and on the available computational pipelines using a leading example, tracing tumor development in a mouse model of lung cancer._" + "_TL;DR we provide a brief overview assays providing measurements of both cell state and lineage history and on the available computational pipelines using a leading example, tracing tumor development in a mouse model of lung cancer._" ] }, { @@ -25,7 +25,7 @@ "source": [ "## Motivation\n", "\n", - "Cellular lineages are ubiquitious in biology. Perhaps the most famous example is that of embyrogenesis: the process by which an organism like a human being is generated from from a single cell, the fertilized egg. During this process, subsequent cell divisions give rise to daughter cells and over time entire \"lineages\" that take on specialized roles within the developing embryo. The amazing complexity of this process has captured the imagination of scientists for centuries, and over the past century and a half our understanding of this process has been bolstered by the development of high-throughput sequencing assays and new \"lineage tracing\" technologies for visualizing and characterizing this process {cite}`woodworth2017`. Amongst the most exciting of these methods allow investigators to link measurements of cell state with models of their history, thus providing a window into how differentiation trajectories might have unfolded.\n", + "Cellular lineages are ubiquitous in biology. Perhaps the most famous example is that of embyrogenesis: the process by which an organism like a human being is generated from from a single cell, the fertilized egg. During this process, subsequent cell divisions give rise to daughter cells and over time entire \"lineages\" that take on specialized roles within the developing embryo. The amazing complexity of this process has captured the imagination of scientists for centuries, and over the past century and a half our understanding of this process has been bolstered by the development of high-throughput sequencing assays and new \"lineage tracing\" technologies for visualizing and characterizing this process {cite}`woodworth2017`. Amongst the most exciting of these methods allow investigators to link measurements of cell state with models of their history, thus providing a window into how differentiation trajectories might have unfolded.\n", "\n", "The marriage of single-cell assays and lineage tracing approaches has yielded an exponential growth in the complexity of datasets, requiring the development of new computational methodology for their analysis. As such, there has been a strong need in developing new computational methodology for processing these datasets {cite}`gong2021`. Sourcing heavily from population genetics literature, the past half decade has witnessed an exciting confluence of traditional concepts in evolutionary biology with cutting-edge genome engineering techniques. \n", "\n", @@ -1905,7 +1905,7 @@ "\n", "In terms of new lineage inference algorithms, there are many promising directions: \n", "\n", - "- **Scalable bayesian inference**: Mirroring the trends of more traditional phylogenetic algorithms, one potential direction is that of scaling Bayesian approaches to larger inputs. While most Bayesian algorithms have leveraged Markov chain Monte Carlo (MCMC) to estimate the posterior distribution {cite}`huelsenbeck2001`, advances in variational inference would greatly improve the scalability of Bayesian algorithms {cite}`zhang2018`. The probablistic nature of such an advance would support high-throughput uncertainty estimation of the tree as well as fit naturally with other single-cell transcriptomic Bayesian approaches, like scVI {cite}`lt:Lopez2018`.\n", + "- **Scalable bayesian inference**: Mirroring the trends of more traditional phylogenetic algorithms, one potential direction is that of scaling Bayesian approaches to larger inputs. While most Bayesian algorithms have leveraged Markov chain Monte Carlo (MCMC) to estimate the posterior distribution {cite}`huelsenbeck2001`, advances in variational inference would greatly improve the scalability of Bayesian algorithms {cite}`zhang2018`. The probabilistic nature of such an advance would support high-throughput uncertainty estimation of the tree as well as fit naturally with other single-cell transcriptomic Bayesian approaches, like scVI {cite}`lt:Lopez2018`.\n", "- **Improved distance-based algorithms**: A fundamental aspect of distance-based algorithms is the estimation of dissimilarities between samples based on their mutation data and known properties of the dataset. With this in mind, a promising direction is that of developing more statistically robust and consistent dissimilarity functions for evolving lineage tracers by taking into account properites of how mutations arise and the priors on the likelihoods of specific mutations. Already, this has proven successful with DCLEAR {cite}`gong2021` and {cite}`fang2022`. Continuing advances in this realm would be greatly enabling as distance-based algorithms can run in polynomial time and produce very accurate trees with an adequate dissimilarity function." ] }, diff --git a/jupyter-book/trajectories/pseudotemporal.ipynb b/jupyter-book/trajectories/pseudotemporal.ipynb index f818c5b8..0fa93c60 100644 --- a/jupyter-book/trajectories/pseudotemporal.ipynb +++ b/jupyter-book/trajectories/pseudotemporal.ipynb @@ -53,13 +53,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "1. Observations are first clustered and, following, connections between these clusters identified. The clusters can be ordered and, thereby, a pseudotime constructed. Henceforth, we will refer to this apporoach as the _cluster approach_. Classical cluster algorithms include $k$-means {cite}`Lloyd1982`, {cite}`MacQueen1967`, Leiden {cite}`pt:Traag2019`, or hierarchical clustering {cite}`Mueller2011`. Clusters may be connected based on similarity, or by constructing a minimum spanning tree (MST) {cite}`Pettie2002`.\n", + "1. Observations are first clustered and, following, connections between these clusters identified. The clusters can be ordered and, thereby, a pseudotime constructed. Henceforth, we will refer to this approach as the _cluster approach_. Classical cluster algorithms include $k$-means {cite}`Lloyd1982`, {cite}`MacQueen1967`, Leiden {cite}`pt:Traag2019`, or hierarchical clustering {cite}`Mueller2011`. Clusters may be connected based on similarity, or by constructing a minimum spanning tree (MST) {cite}`Pettie2002`.\n", "\n", "2. The _graph approach_ first finds connections between the lower dimensional representation of the observations. This procedure defines a graph based on which clusters, and thus an ordering, are defined. *PAGA* {cite}`Wolf2019`, for example, partitions the graph into Leiden clusters and estimates connections between them. Intuitively, this approach preserves the global topology of the data while analyzing it at a lower resolution. Consequently, the computational efficiency is increased.\n", "\n", "3. *Manifold-learning based approaches* proceed similar to the *cluster approach*. However, connections between clusters are defined by using principal curves or graphs to estimate the underlying trajectories. Principal curves find a one-dimensional curve that connects cellular observations in the higher dimensional space. A notable representation of this approach is Slingshot {cite}`Street2018`.\n", "\n", - "4. Probabilistic frameworks assign transition probabilities to ordered cell-cell pairs. Each transition probabilitiy quantifies how likely the reference cell is the ancestor of the other cell. These probabilities define random processes that are used to define a pseudotime. DPT, for example, is defined as the difference between consecutive states of a random walk. Contrastingly, Palantir {cite}`Setty2019` models trajectories themselves as Markov chains. While both approaches rely on a probabilistic framework, they require a root cell to be specified. The pseudotime itself is computed with respect to this cell." + "4. Probabilistic frameworks assign transition probabilities to ordered cell-cell pairs. Each transition probability quantifies how likely the reference cell is the ancestor of the other cell. These probabilities define random processes that are used to define a pseudotime. DPT, for example, is defined as the difference between consecutive states of a random walk. Contrastingly, Palantir {cite}`Setty2019` models trajectories themselves as Markov chains. While both approaches rely on a probabilistic framework, they require a root cell to be specified. The pseudotime itself is computed with respect to this cell." ] }, { @@ -80,14 +80,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Even though TI and pseudotime can already provide valuable insight, they usually act as a stepping stone for more fine grained analysis. Identifying terminal states, for example, is a classical biological question that can be studied. Similarly, lineage bifurcation and genes driving fate decisions can be identified based on TI and pseudotime. Which question can answere and how the answer is found is usually method specific. Palantir, for example, identifies terminal states as absorbing states of its constructed Markov chain." + "Even though TI and pseudotime can already provide valuable insight, they usually act as a stepping stone for more fine grained analysis. Identifying terminal states, for example, is a classical biological question that can be studied. Similarly, lineage bifurcation and genes driving fate decisions can be identified based on TI and pseudotime. Which question can answer and how the answer is found is usually method specific. Palantir, for example, identifies terminal states as absorbing states of its constructed Markov chain." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The success of trajectory inference is well documented and, consequently, many methods have been proposed. However, with the advances of sequencing technologies, new sources of information become available. ATAC-seq {cite}`Buenrostro2015`, CITE-seq {cite}`pt:Stoeckius2017`, and DOGMA-seq {cite}`pt:Mimitou2021`, for example, measure additional modalities beyond the transcriptome. Lineage tracing {cite}`Weinreb2020` and metabolic labeling {cite}`Erhard2019`, {cite}`Battich2020`, {cite}`Qiu2020`, {cite}`Erhard2022` even provide the (likely) future state of a given cell. Consequently, future TI tools will be able to include more information to estimate trajectories and pseudotime more accuractely and robustly, and allow answering novel questions. For example, RNA velocity {cite}`LaManno2018`, {cite}`Bergen2020`, {cite}`Bergen2021` is one technique that uses unspliced and spliced mRNA to infer directed, dynamic information beyond classical, static snapshot data." + "The success of trajectory inference is well documented and, consequently, many methods have been proposed. However, with the advances of sequencing technologies, new sources of information become available. ATAC-seq {cite}`Buenrostro2015`, CITE-seq {cite}`pt:Stoeckius2017`, and DOGMA-seq {cite}`pt:Mimitou2021`, for example, measure additional modalities beyond the transcriptome. Lineage tracing {cite}`Weinreb2020` and metabolic labeling {cite}`Erhard2019`, {cite}`Battich2020`, {cite}`Qiu2020`, {cite}`Erhard2022` even provide the (likely) future state of a given cell. Consequently, future TI tools will be able to include more information to estimate trajectories and pseudotime more accurately and robustly, and allow answering novel questions. For example, RNA velocity {cite}`LaManno2018`, {cite}`Bergen2020`, {cite}`Bergen2021` is one technique that uses unspliced and spliced mRNA to infer directed, dynamic information beyond classical, static snapshot data." ] }, { @@ -298,7 +298,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Different pseudotime methods give different results. Sometimes, one pseudotime captures the underlying developmental processes more accurately. Here, we compare the just computed DPT with the pre-computed Palantir pseudotime (see [here](https://github.com/dpeerlab/Palantir/blob/master/notebooks/Palantir_sample_notebook.ipynb) for the corresponding tutorial). One option to compare different pseudotimes is by coloring the low dimensional embedding of the data (here, t-SNE). Here, DPT is extremly high in the cluster of CLPs compared to all other cell types. Contrastingly, the Palantir pseudotime increases continuously with developmental maturity." + "Different pseudotime methods give different results. Sometimes, one pseudotime captures the underlying developmental processes more accurately. Here, we compare the just computed DPT with the pre-computed Palantir pseudotime (see [here](https://github.com/dpeerlab/Palantir/blob/master/notebooks/Palantir_sample_notebook.ipynb) for the corresponding tutorial). One option to compare different pseudotimes is by coloring the low dimensional embedding of the data (here, t-SNE). Here, DPT is extremely high in the cluster of CLPs compared to all other cell types. Contrastingly, the Palantir pseudotime increases continuously with developmental maturity." ] }, { diff --git a/jupyter-book/trajectories/rna_velocity.ipynb b/jupyter-book/trajectories/rna_velocity.ipynb index f47de8cd..f940a26d 100644 --- a/jupyter-book/trajectories/rna_velocity.ipynb +++ b/jupyter-book/trajectories/rna_velocity.ipynb @@ -36,7 +36,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The change in the transcriptomic profile of a cell is triggered by a cascade of events: Broadly speaking, DNA is transcribed to produce so-called unspliced precursor messenger RNA (pre-mRNA). Unspliced pre-mRNA contains regions relevant for translation (exons) as well as non-coding regions (introns). These non-coding regions are spliced out, *i.e.*, removed, to form spliced, mature mRNA. While single-cell RNA sequencing (scRNA-seq) protocols fail to capture the transcriptome at multiple timepoints, they do include the necessay information to disassociate unspliced and spliced mRNA reads {cite}`velo:LaManno2018, velo:Srivastava2019, velo:He2022, velo:Melsted2021`." + "The change in the transcriptomic profile of a cell is triggered by a cascade of events: Broadly speaking, DNA is transcribed to produce so-called unspliced precursor messenger RNA (pre-mRNA). Unspliced pre-mRNA contains regions relevant for translation (exons) as well as non-coding regions (introns). These non-coding regions are spliced out, *i.e.*, removed, to form spliced, mature mRNA. While single-cell RNA sequencing (scRNA-seq) protocols fail to capture the transcriptome at multiple timepoints, they do include the necessary information to disassociate unspliced and spliced mRNA reads {cite}`velo:LaManno2018, velo:Srivastava2019, velo:He2022, velo:Melsted2021`." ] }, { @@ -81,7 +81,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The first attempt at estimating RNA velocity assumed gene independence and the underlying kinetics to be goverened by the above model. Additionally, it is assumed that (1) kinetics reached their equilibrium, (2) rates are constant, and (3) there is a single, common splicing rate across all genes. In the following, we will refer to this model as the *steady-state model* due to the first assumption. The steady-states itself are found in the upper right corner of the phase portrait (induction phase) and its origin (repression phase). Based on these extreme quantiles, the *steady-state model* estimates the steady-state ratio with a linear regression fit. RNA velocity is then defined as the residual to this fit." + "The first attempt at estimating RNA velocity assumed gene independence and the underlying kinetics to be governed by the above model. Additionally, it is assumed that (1) kinetics reached their equilibrium, (2) rates are constant, and (3) there is a single, common splicing rate across all genes. In the following, we will refer to this model as the *steady-state model* due to the first assumption. The steady-states itself are found in the upper right corner of the phase portrait (induction phase) and its origin (repression phase). Based on these extreme quantiles, the *steady-state model* estimates the steady-state ratio with a linear regression fit. RNA velocity is then defined as the residual to this fit." ] }, { @@ -126,7 +126,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For a practical example of how RNA velocity can be inferred, we analyze the endocrine development in pancrease {cite}`velo:BastidasPonce2019`. In this system, pre-endocrine cells (*Ductal*, *Ngn3 low EP*, *Ngn3 high EP*, *Pre-endocrine*) develop into four endocrine cell types (*Alpha*, *Beta*, *Delta*, *Epsilon*). Here, we use *scVelo* {cite}`velo:Bergen2020` to infer RNA velocity." + "For a practical example of how RNA velocity can be inferred, we analyze the endocrine development in pancreas {cite}`velo:BastidasPonce2019`. In this system, pre-endocrine cells (*Ductal*, *Ngn3 low EP*, *Ngn3 high EP*, *Pre-endocrine*) develop into four endocrine cell types (*Alpha*, *Beta*, *Delta*, *Epsilon*). Here, we use *scVelo* {cite}`velo:Bergen2020` to infer RNA velocity." ] }, { @@ -283,7 +283,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In a typical workflow, we would cluster the data, infer cell types, and visualize the data in a two-dimensional embedding. Luckily, for the pancrease data, this information has already been calculated a priori and directly be used." + "In a typical workflow, we would cluster the data, infer cell types, and visualize the data in a two-dimensional embedding. Luckily, for the pancreas data, this information has already been calculated a priori and directly be used." ] }, { @@ -420,7 +420,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In order to calculate RNA velocity with the *EM model*, the parameters of splicing kinetics need to be infered first. The inference is taken care of by *scVelo*'s `recover_dynamics` function." + "In order to calculate RNA velocity with the *EM model*, the parameters of splicing kinetics need to be inferred first. The inference is taken care of by *scVelo*'s `recover_dynamics` function." ] }, { @@ -467,7 +467,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The parameters of the splicing model are inferred by maximizing a given likelihood. To study which genes were fit most confidently by *scVelo*, we can study the corresponding phase portraits as well as the inferred trajectory (plotted in purple) and steady-state ratio (dashed purple line). Here, three out of the five shown genes (*Pcsk2*, *Top2a*, *Ppp1r1a*) exhibit phase portraits in a (partial) almond shape. We observe a clear transition either within a single cell type (*Top2a*, *Ppp1r1a*) or across several cell types (*Pcsk2*, from Pre-endocrine to Alpha and Beta). In the case of *Nfib*, we observe two cellular populations in steady state. This most likely an artifact of undersampling the phenotypic manifold around Ngn3 low/high EP cells. Similary, *Ghrl* is highly expressed in Epsilon cells although only a few due to the small cluster size. While current best practices are limited to analysing model fits and the confidence therein by hand, recently proposed methods can help automate the process (New directions). Here, *Nfib* abd *Ghrl* would be assigned with a lower confidence score." + "The parameters of the splicing model are inferred by maximizing a given likelihood. To study which genes were fit most confidently by *scVelo*, we can study the corresponding phase portraits as well as the inferred trajectory (plotted in purple) and steady-state ratio (dashed purple line). Here, three out of the five shown genes (*Pcsk2*, *Top2a*, *Ppp1r1a*) exhibit phase portraits in a (partial) almond shape. We observe a clear transition either within a single cell type (*Top2a*, *Ppp1r1a*) or across several cell types (*Pcsk2*, from Pre-endocrine to Alpha and Beta). In the case of *Nfib*, we observe two cellular populations in steady state. This most likely an artifact of undersampling the phenotypic manifold around Ngn3 low/high EP cells. Similarly, *Ghrl* is highly expressed in Epsilon cells although only a few due to the small cluster size. While current best practices are limited to analysing model fits and the confidence therein by hand, recently proposed methods can help automate the process (New directions). Here, *Nfib* abd *Ghrl* would be assigned with a lower confidence score." ] }, { @@ -588,8 +588,8 @@ "To understand if RNA velocity analysis is applicable to a given dataset, we remark the following points:\n", "\n", "1. To infer RNA velocity, the time scale of the developmental process under investigation must be comparable to the half-life of RNA molecules. This requirement is, for example, met in pancreatic endocrinogenesis {cite}`velo:BastidasPonce2019` but not in long term diseases such as Alzheimer's or Parkinson's disease. Similarly, RNA velocity analysis is not applicable to steady-state systems such as peripheral blood mononuclear cells lacking any transitions between (mature) cell types.\n", - "2. RNA velocity can only be inferred robustly and reliantly if the underlying model assumptions (approximately) hold true. To check the assumptions, the phase portraits can be studied to verify that they exhibit the expected almond shape. If a gene includes multiple, pronounced kinetcs, RNA velocity analysis should be applied with caution and the data possibly subsetted to individual lineages.\n", - "3. Classically, the high-dimensional RNA velocity vectors have been visualized by projecting them onto a low-dimensional representation of the data. This approach for verifying hypotheses can be erronous and misleading as the projecteceted velocity stream is highly dependend on (1) the number of included genes and (2) chosen plotting parameters. Additionally, the projection quality decreases at the boundary of the low dimensional embedding {cite}`velo:LaManno2018`." + "2. RNA velocity can only be inferred robustly and reliantly if the underlying model assumptions (approximately) hold true. To check the assumptions, the phase portraits can be studied to verify that they exhibit the expected almond shape. If a gene includes multiple, pronounced kinetics, RNA velocity analysis should be applied with caution and the data possibly subsetted to individual lineages.\n", + "3. Classically, the high-dimensional RNA velocity vectors have been visualized by projecting them onto a low-dimensional representation of the data. This approach for verifying hypotheses can be erronous and misleading as the projected velocity stream is highly dependent on (1) the number of included genes and (2) chosen plotting parameters. Additionally, the projection quality decreases at the boundary of the low dimensional embedding {cite}`velo:LaManno2018`." ] }, { @@ -604,7 +604,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Although RNA velocity has been applied successfully to many systems, some model limitations persist. Violated model assumptions may cause erronous result {cite}`velo:Bergen2021, velo:Barile2021`, and projecting the high dimensional velocity vectors onto a low dimensional representation of the data misleading. To overcome these pitfalls several tools have been developed. CellRank {cite}`velo:Lange2022`, for example, uses the inferred velocity field to infer likely future states of a cell. As the algorithm operates on the higher dimensional representation of the data, misleading velocity streams on embeddings are circumvented. Contrastingly, a recent publication tries to improve the quality of the lower dimensional embedding {cite}`velo:MarotLassauzaie2022`.\n", + "Although RNA velocity has been applied successfully to many systems, some model limitations persist. Violated model assumptions may cause erroneous result {cite}`velo:Bergen2021, velo:Barile2021`, and projecting the high dimensional velocity vectors onto a low dimensional representation of the data misleading. To overcome these pitfalls several tools have been developed. CellRank {cite}`velo:Lange2022`, for example, uses the inferred velocity field to infer likely future states of a cell. As the algorithm operates on the higher dimensional representation of the data, misleading velocity streams on embeddings are circumvented. Contrastingly, a recent publication tries to improve the quality of the lower dimensional embedding {cite}`velo:MarotLassauzaie2022`.\n", "\n", "To soften current assumptions of RNA velocity inference, several new approaches have been suggested {cite}`velo:Qiao2021, velo:MarotLassauzaie2022, velo:Chen2022, velo:Riba2022, velo:Gu2022, velo:Gu2022-PLMR`, {cite}`velo:Gayoso2022`. For example, these methods try to no longer assume constant rates {cite}`velo:Chen2022, velo:Gu2022-PLMR`, work with raw counts {cite}`velo:Gu2022-PLMR`, or reformulate the inference methods in a variational inference framework to associate uncertainty with estimates {cite}`velo:Gayoso2022`. Additionally, to aid in understanding if RNA velocity analysis can be inferred for individual genes or entire datasets, different procedures have been proposed {cite}`velo:Zheng2022, velo:Gayoso2022`." ]