MiniMarS finds the protein markers that best define biological clusters, using a number of different pre-existing methods for marker selection.
Please install the following packages first.
# CiteFuse
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("CiteFuse")
devtools::install_github("tpq/propr") # propr package required for CiteFuse to run
# sc2marker
if (!require("devtools", quietly = TRUE))
install.packages("devtools")
devtools::install_github("https://github.com/CostaLab/sc2marker", build_vignettes = TRUE)
# geneBasisR
devtools::install_github("MarioniLab/geneBasisR")
# Seurat
install.packages('Seurat')
# xgboost
install.packages("xgboost")
# SingleCellExperiment
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("SingleCellExperiment")
# dplyr
install.packages("dplyr")
Please run the following to install the MiniMarS
package from the Development branch:
devtools::install_github("raymondlouie/MiniMarS", ref = "Dev")
or download the package here and install it using the following command
install.packages("~/Downloads/MiniMarS_0.3.0.tar.gz", type = "source", repos = NULL)
# Check to see if all the packages are installed.
packages_required = c("CiteFuse","sc2marker","geneBasisR","xgboost","dplyr","MiniMarS")
packages_required_not_installed=setdiff(packages_required, rownames(installed.packages()))
if (length(packages_required_not_installed)>0){
stop(paste0("Please install packages ",packages_required_not_installed,"\n"))
}
library(MiniMarS)
library(dplyr)
library(SingleCellExperiment)
The input data can be i) an SCE object, ii) a matrix of features and a vector of cell type annotations, or iii) a Seurat object.
# Load the data
sc_in = processInputFormat(sc_object = sce,
sce_cluster = "cell_type", #pre-defined in the SCE object
verbose = TRUE)
ii) If you have a Feature matrix (feature x cell) and a vector of cell type annotations for each cell (the length of this vector should be the same as the number of columns of the input_matrix):
input_matrix = object@assays@data$counts
clusters = object$cell_type
sc_in = processInputFormat(sc_object = input_matrix,
clusters_all = clusters,
verbose = TRUE)
library(Seurat)
# Set identity class for your Seurat object to the column that includes the cell type annotations for each cell.
Idents(seurat_object) <- seurat_object$cell_type
clusters <- Idents(object = seurat_object)
sc_in = processInputFormat(sc_object = seurat_object,
verbose=TRUE)
# If 'clusters_sel' is not defined, then the default is to use all clusters.
clusters_sel = c("CD4-positive, alpha-beta memory T cell",
"naive thymus-derived CD8-positive, alpha-beta T cell")
cluster_selection_out= processClusterSelection(sc_in,
clusters_sel = clusters_sel,
verbose = TRUE)
final_out = processSubsampling(cluster_selection_out,
subsample_num = 1000,
cluster_proportion = "proportional",
verbose = TRUE,
seed = 8)
Several methods implemented to find markers for identifying the clusters using the method
argument: "citeFUSE", "sc2marker", "geneBasis", "xgBoost", "fstat", "seurat_wilcox", "seurat_bimod", "seurat_roc", "seurat_t", "seurat_LR", "consensus_weighted", "consensus_naive", "consensus_fstat", and "consensus_xgboost". The default option is to use "all" methods. The methods with "consensus_" naming return an integration of results from several methods run.
list_markers_time = findClusterMarkers(final_out,
num_markers = 15,
method = "all",
verbose = TRUE)
list_time = list_markers_time$runtime_secs
names(list_time) = names(list_markers_time)[which(!(names(list_markers_time) %in% c("consensus",
"runtime_secs")))]
list_markers = list_markers_time[which(!(names(list_markers_time) %in% c("runtime_secs")))]
There are two methods implemented to assess the performance of each marker selection method using the method
argument: "xgBoost" and "geneBasis". The default option is to use "xgBoost".
list_performance = performanceAllMarkers(list_markers,
final_out = final_out,
method = "xgBoost",
nrounds = 1500,
nthread = 6,
testSet = "test",
verbose = TRUE)
library(ggplot2)
plotMarkers(list_markers)
plotPerformance(list_performance)
## Visualisation
library(RColorBrewer)
plotExpression(list_markers,
sc_in,
plot_type="violin")
plotExpression(list_markers,
sc_in,
plot_type="umap")
sessionInfo()
[1] sp_1.4-6 SeuratObject_4.1.0 Seurat_4.1.1.9002 SingleCellExperiment_1.16.0
[5] SummarizedExperiment_1.24.0 Biobase_2.54.0 GenomicRanges_1.46.1 GenomeInfoDb_1.30.0
[9] IRanges_2.28.0 S4Vectors_0.32.3 BiocGenerics_0.40.0 MatrixGenerics_1.6.0
[13] matrixStats_0.61.0 dplyr_1.0.7 MiniMarS_0.1.0
Under some scenarios, the users may want to test their own customised marker panel instead of the predicted ones. We recommend using the following codes to evaluate the performance of the user's customised marker input.
user_markers = c("CD80", "CD86", "CD274", "CD273", "CD275", "CD11b", "CD137L", "CD70", "unidentified_marker1","unidentified_marker2")
own_list_performance = performanceOwnMarkers(user_markers,
final_out = final_out,
method = "all",
nrounds = 1500,
nthread = 6,
verbose = TRUE)
print(own_list_performance)
We provide below public datasets for users to try the MiniMarS
package. The datasets all contain protein features as well as either cell annotation or cluster labels from the corresponding papers or websites.
Human bone marrow samples from the BD Rhapsody assay.
Number of cells in this dataset: around 100,000 (after QC: 49,100 from healthy controls and 31,600 from leukemia patients)
Number of protein features: 97
43 cell types/clusters from the data provider
Click here to access the data
Human mucosa-associated lymphoid tissue samples from the 10X technology.
Number of cells in this dataset: around 10,000 (after QC: 8,500)
Number of protein features: 17
11 cell types/clusters from the data provider
Click here to access the data
Mouse spleen and lymph nodes samples from the 10X technology.
Number of cells in this dataset: around 40,000 (after QC: 16,800 with 111 protein features, 15,800 with 206 protein features)
Number of protein features: 111 and 206 (two sets of data) - need to find the name of each protein feature (16 Mar 2023)
Both sets, 35 cell types/clusters from the data provider
Click here to access the raw data
Click here to access the processed data
Human PBMC (peripheral blood mononuclear cells) samples from the 10X technology.
Number of cells in this dataset: around 10,000 (after QC: 7,800)
Number of protein features: 17
11 cell types/clusters from the data provider
Click here to access the data