README.Rmd

---
output: github_document
bibliography: vignettes/references.bib
link-citations: true
nocite: |
    @R-cowplot
    @R-ggtext
    @R-magrittr
    @R-ggtext
    @R-dplyr
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "75%"
)

# Bioconductor vignette links
link_sce <-
    paste0(
        "https://bioconductor.org/packages/release/bioc/",
        "vignettes/SingleCellExperiment/inst/doc/intro.html"
    )
link_se <-
    paste0(
        "https://bioconductor.org/packages/release/bioc/",
        "vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html"
    )
link_DESeq2 <-
    paste0(
        "https://bioconductor.org/packages/devel/bioc/",
        "vignettes/DESeq2/inst/doc/DESeq2.html"
    )
```

# aggregateBioVar

<!-- badges: start -->
<!-- badges: end -->

Single cell RNA sequencing (scRNA-seq) studies allow gene expression
quantification at the level of individual cells.
These studies introduce multiple layers of biological complexity, including
variations in gene expression between cell states within a sample
(*e.g.*, T cells versus macrophages), between samples within a population
(*e.g.*, biological or technical replicates), and between populations
(*e.g.*, healthy versus diseased individuals). Many early scRNA-seq studies
involved analysis of gene expression within cells from a *single sample*.
For single cell RNA-seq data collected from more than one subject,
`aggregateBioVar` provides tools to summarize summarize single cell gene
expression profiles at the level of *samples* (*i.e.*, subjects) or
*populations*. Given an input [`SingleCellExperiment`](`r link_sce`) object
[@SingleCellExperiment2020] with pre-defined cell states, `aggregateBioVar()`
stratifies data as a list of [`SummarizedExperiment`](`r link_se`) objects
[@R-SummarizedExperiment].
For each cell type, gene counts are aggregated by subject into a
**gene-by-subject** count matrix, and column metadata are summarized to retain 
*inter-subject* variation for downstream analysis with bulk RNA-seq tools.

## Installation

<!-- # TODO Replace with BiocManager::install() -->
<!-- You can install the released version of aggregateBioVar from -->
<!-- [CRAN](https://CRAN.R-project.org) with: -->

<!-- ``` r -->
<!-- install.packages("aggregateBioVar") -->
<!-- ``` -->

Install the development version of `aggregateBioVar` from
[GitHub](https://github.com/) with:

```r
# install.packages("devtools")
devtools::install_github("jasonratcliff/aggregateBioVar", build_vignettes=TRUE)
```

## Multi-subject scRNA-seq

```{r example, message=FALSE}
library(aggregateBioVar)

# Bioconductor Packages
library(SummarizedExperiment, quietly = TRUE)
library(SingleCellExperiment, quietly = TRUE)
library(DESeq2, quietly = TRUE)

# Data analysis and visualization
library(dplyr, quietly = TRUE)
library(magrittr, quietly = TRUE)
library(ggplot2, quietly = TRUE)
library(ggtext, quietly = TRUE)
```

```{r sceMetadata, echo=FALSE}
sce_samples <- sort(unique(small_airway$orig.ident))
sce_genotypes <- list(
    "Wildtype" = grep("WT", sce_samples, value = TRUE),
    "CFTRKO" = grep("CF", sce_samples, value = TRUE)
)
```

To illustrate the utility of *biological replication* for scRNA-seq
sequencing experiments, consider a `SingleCellExperiment` object with
scRNA-seq data from 7 subjects (`r sce_samples`) in the context of a cystic
fibrosis phenotype. Samples were collected from small airway epithelium of
newborn *Sus scrofa* with genotypes from wild type
(**CFTR+/+**, n=`r length(sce_genotypes$Wildtype)`) and *CFTR*-knockout
(**CFTR-/-**, n=`r length(sce_genotypes$CFTRKO)`) individuals.
Note the dimensions of this object, with `r nrow(small_airway)` genes from
`r ncol(small_airway)` cells:

```{r singleCellExperiment}
small_airway
```

The primary function `aggregateBioVar()` takes a `SingleCellExperiment`
object with column metadata variables indicating subject identity
(*e.g.*, biological sample; `subjectVar`) and assigned cell type (`cellVar`).
The column metadata of a `SingleCellExperiment` object can be obtained by
`SummarizedExperiment::colData()`.
Here, the metadata variable `orig.ident` indicates the biological sample
identifier and `celltype` the inferred cell type.

```{r aggregateBioVar}
# Perform aggregation of counts and metadata by subject and cell type.
aggregate_counts <-
    aggregateBioVar(
        scExp = small_airway,
        subjectVar = "orig.ident", cellVar = "celltype"
    )
```

Each element of the returned list contains a `SummarizedExperiment` object with
aggregated counts from cells in the assigned cell type (indicated by `cellVar`).

```{r aggregateList}
aggregate_counts
```

For each cell type subset, within-subject gene counts are aggregated and column
metadata are summarized to exclude variables with *intercellular* variation.
This effectively retains subject metadata and can be used for downstream
analysis with bulk RNA-seq tools. After aggregation, the number of columns
in the `SingleCellExperiment` object matches the number of unique values in
the subject metadata variable indicated by `subjectVar`.

```{r aggregateCounts}
assay(aggregate_counts$`Immune cell`, "counts")
```

```{r aggregateMetadata}
colData(aggregate_counts$`Immune cell`)
```

### Differential Gene Expression

The aggregate gene-by-subject matrix and subject metadata can be used as inputs
for bulk RNA-seq tools  to investigate gene expression. Here, an example is
provided using [`DESeq2`](`r link_DESeq2`) [@DESeq2].
A `DESeqDataSet` can be constructed from the aggregate gene-by-subject
count matrix and summarized column metadata.

```{r DESeq2}
subj_dds_dataset <-
    DESeqDataSetFromMatrix(
        countData = assay(aggregate_counts$`Secretory cell`, "counts"),
        colData = colData(aggregate_counts$`Secretory cell`),
        design = ~ Genotype
    )

subj_dds <- DESeq(subj_dds_dataset)

subj_dds_results <-
    results(subj_dds, contrast = c("Genotype", "WT", "CFTRKO"))
```

Add negative log~10~ adjusted P-values, then plot against log~2~ fold change.
Genes with adjusted P-values < 0.05 and fold-change absolute values > 1.0 are
highlighted in red and labeled by feature.

```{r volcanoPlot}
subj_dds_transf <- as.data.frame(subj_dds_results) %>%
    bind_cols(feature = rownames(subj_dds_results)) %>%
    mutate(log_padj = - log(.data$padj, base = 10))

ggplot(data = subj_dds_transf) +
    geom_point(aes(x = log2FoldChange, y = log_padj), na.rm = TRUE) +
    geom_point(
        data = filter(
            .data = subj_dds_transf,
            abs(.data$log2FoldChange) > 1, .data$padj < 0.05
        ),
        aes(x = log2FoldChange, y = log_padj), color = "red"
    ) +
    geom_label(
        data = filter(
            .data = subj_dds_transf,
            abs(.data$log2FoldChange) > 1, .data$padj < 0.05
        ),
        aes(x = log2FoldChange, y = log_padj + 0.4, label = feature)
    ) +
    theme_classic() +
    labs(
        x = "log<sub>2</sub> (fold change)",
        y = "-log<sub>10</sub> (p<sub>adj</sub>)"
    ) +
    theme(
        axis.title.x = element_markdown(),
        axis.title.y = element_markdown())

```

## Vignettes

For a detailed workflow and description of package components,
see the package vignette:

```{r vignette, eval=FALSE}
vignette("multi-subject-scRNA-seq", package = "aggregateBioVar")
```

## References

<div id="refs"></div>