-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.Rmd
230 lines (191 loc) · 7.33 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
---
output: github_document
bibliography: vignettes/references.bib
link-citations: true
nocite: |
@R-cowplot
@R-ggtext
@R-magrittr
@R-ggtext
@R-dplyr
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "75%"
)
# Bioconductor vignette links
link_sce <-
paste0(
"https://bioconductor.org/packages/release/bioc/",
"vignettes/SingleCellExperiment/inst/doc/intro.html"
)
link_se <-
paste0(
"https://bioconductor.org/packages/release/bioc/",
"vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html"
)
link_DESeq2 <-
paste0(
"https://bioconductor.org/packages/devel/bioc/",
"vignettes/DESeq2/inst/doc/DESeq2.html"
)
```
# aggregateBioVar
<!-- badges: start -->
<!-- badges: end -->
Single cell RNA sequencing (scRNA-seq) studies allow gene expression
quantification at the level of individual cells.
These studies introduce multiple layers of biological complexity, including
variations in gene expression between cell states within a sample
(*e.g.*, T cells versus macrophages), between samples within a population
(*e.g.*, biological or technical replicates), and between populations
(*e.g.*, healthy versus diseased individuals). Many early scRNA-seq studies
involved analysis of gene expression within cells from a *single sample*.
For single cell RNA-seq data collected from more than one subject,
`aggregateBioVar` provides tools to summarize summarize single cell gene
expression profiles at the level of *samples* (*i.e.*, subjects) or
*populations*. Given an input [`SingleCellExperiment`](`r link_sce`) object
[@SingleCellExperiment2020] with pre-defined cell states, `aggregateBioVar()`
stratifies data as a list of [`SummarizedExperiment`](`r link_se`) objects
[@R-SummarizedExperiment].
For each cell type, gene counts are aggregated by subject into a
**gene-by-subject** count matrix, and column metadata are summarized to retain
*inter-subject* variation for downstream analysis with bulk RNA-seq tools.
## Installation
<!-- # TODO Replace with BiocManager::install() -->
<!-- You can install the released version of aggregateBioVar from -->
<!-- [CRAN](https://CRAN.R-project.org) with: -->
<!-- ``` r -->
<!-- install.packages("aggregateBioVar") -->
<!-- ``` -->
Install the development version of `aggregateBioVar` from
[GitHub](https://github.com/) with:
```r
# install.packages("devtools")
devtools::install_github("jasonratcliff/aggregateBioVar", build_vignettes=TRUE)
```
## Multi-subject scRNA-seq
```{r example, message=FALSE}
library(aggregateBioVar)
# Bioconductor Packages
library(SummarizedExperiment, quietly = TRUE)
library(SingleCellExperiment, quietly = TRUE)
library(DESeq2, quietly = TRUE)
# Data analysis and visualization
library(dplyr, quietly = TRUE)
library(magrittr, quietly = TRUE)
library(ggplot2, quietly = TRUE)
library(ggtext, quietly = TRUE)
```
```{r sceMetadata, echo=FALSE}
sce_samples <- sort(unique(small_airway$orig.ident))
sce_genotypes <- list(
"Wildtype" = grep("WT", sce_samples, value = TRUE),
"CFTRKO" = grep("CF", sce_samples, value = TRUE)
)
```
To illustrate the utility of *biological replication* for scRNA-seq
sequencing experiments, consider a `SingleCellExperiment` object with
scRNA-seq data from 7 subjects (`r sce_samples`) in the context of a cystic
fibrosis phenotype. Samples were collected from small airway epithelium of
newborn *Sus scrofa* with genotypes from wild type
(**CFTR+/+**, n=`r length(sce_genotypes$Wildtype)`) and *CFTR*-knockout
(**CFTR-/-**, n=`r length(sce_genotypes$CFTRKO)`) individuals.
Note the dimensions of this object, with `r nrow(small_airway)` genes from
`r ncol(small_airway)` cells:
```{r singleCellExperiment}
small_airway
```
The primary function `aggregateBioVar()` takes a `SingleCellExperiment`
object with column metadata variables indicating subject identity
(*e.g.*, biological sample; `subjectVar`) and assigned cell type (`cellVar`).
The column metadata of a `SingleCellExperiment` object can be obtained by
`SummarizedExperiment::colData()`.
Here, the metadata variable `orig.ident` indicates the biological sample
identifier and `celltype` the inferred cell type.
```{r aggregateBioVar}
# Perform aggregation of counts and metadata by subject and cell type.
aggregate_counts <-
aggregateBioVar(
scExp = small_airway,
subjectVar = "orig.ident", cellVar = "celltype"
)
```
Each element of the returned list contains a `SummarizedExperiment` object with
aggregated counts from cells in the assigned cell type (indicated by `cellVar`).
```{r aggregateList}
aggregate_counts
```
For each cell type subset, within-subject gene counts are aggregated and column
metadata are summarized to exclude variables with *intercellular* variation.
This effectively retains subject metadata and can be used for downstream
analysis with bulk RNA-seq tools. After aggregation, the number of columns
in the `SingleCellExperiment` object matches the number of unique values in
the subject metadata variable indicated by `subjectVar`.
```{r aggregateCounts}
assay(aggregate_counts$`Immune cell`, "counts")
```
```{r aggregateMetadata}
colData(aggregate_counts$`Immune cell`)
```
### Differential Gene Expression
The aggregate gene-by-subject matrix and subject metadata can be used as inputs
for bulk RNA-seq tools to investigate gene expression. Here, an example is
provided using [`DESeq2`](`r link_DESeq2`) [@DESeq2].
A `DESeqDataSet` can be constructed from the aggregate gene-by-subject
count matrix and summarized column metadata.
```{r DESeq2}
subj_dds_dataset <-
DESeqDataSetFromMatrix(
countData = assay(aggregate_counts$`Secretory cell`, "counts"),
colData = colData(aggregate_counts$`Secretory cell`),
design = ~ Genotype
)
subj_dds <- DESeq(subj_dds_dataset)
subj_dds_results <-
results(subj_dds, contrast = c("Genotype", "WT", "CFTRKO"))
```
Add negative log~10~ adjusted P-values, then plot against log~2~ fold change.
Genes with adjusted P-values < 0.05 and fold-change absolute values > 1.0 are
highlighted in red and labeled by feature.
```{r volcanoPlot}
subj_dds_transf <- as.data.frame(subj_dds_results) %>%
bind_cols(feature = rownames(subj_dds_results)) %>%
mutate(log_padj = - log(.data$padj, base = 10))
ggplot(data = subj_dds_transf) +
geom_point(aes(x = log2FoldChange, y = log_padj), na.rm = TRUE) +
geom_point(
data = filter(
.data = subj_dds_transf,
abs(.data$log2FoldChange) > 1, .data$padj < 0.05
),
aes(x = log2FoldChange, y = log_padj), color = "red"
) +
geom_label(
data = filter(
.data = subj_dds_transf,
abs(.data$log2FoldChange) > 1, .data$padj < 0.05
),
aes(x = log2FoldChange, y = log_padj + 0.4, label = feature)
) +
theme_classic() +
labs(
x = "log<sub>2</sub> (fold change)",
y = "-log<sub>10</sub> (p<sub>adj</sub>)"
) +
theme(
axis.title.x = element_markdown(),
axis.title.y = element_markdown())
```
## Vignettes
For a detailed workflow and description of package components,
see the package vignette:
```{r vignette, eval=FALSE}
vignette("multi-subject-scRNA-seq", package = "aggregateBioVar")
```
## References
<div id="refs"></div>