From be132b56781bede5dc6e020aa80ca315546666cd Mon Sep 17 00:00:00 2001 From: y4nnick8 <151630397+y4nnick8@users.noreply.github.com> Date: Thu, 16 May 2024 14:53:22 +0200 Subject: [PATCH] Add Snapatac2 wrappers (#5740) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Add Snapatac2 wrappers * add test data * updated tools * Fix some tests * update tools * update * added metrics.tsse to preprocessing.xml * tests tutorial1 * added tl.umap test * fix tl.umap by setting NUMBA_CACHE_DIR variable * add .shed.yaml * add location for large files * delete test data larger than 1MB and remove checksum attribute for params * fix tool category in shed.yml * formatting and another try to test location attribute * add ftype * review and correct preprocessing params, tests need to be added * review and correct preprocessing params * update preprocessing tests with new data * add more tests for snap.tl functions * Remove unnecessary functions and change tool naming * remove macs3 and multiprocess dependencies * remove ipynb * read anndata into memory to overcome anndata metadata problem * Use test data from Zenodo * update the help text and fix the last test * add two more dbscan clustering methods and fix njobs param * add dbscan help * some fixes and styles * Update plotting.xml * Update dimension_reduction_clustering.xml * Update preprocessing.xml --------- Co-authored-by: Pavankumar Videm Co-authored-by: Björn Grüning --- tools/snapatac2/.shed.yml | 28 + .../dimension_reduction_clustering.xml | 579 +++++++++++++++++ tools/snapatac2/macros.xml | 187 ++++++ tools/snapatac2/plotting.xml | 229 +++++++ tools/snapatac2/preprocessing.xml | 580 ++++++++++++++++++ 5 files changed, 1603 insertions(+) create mode 100644 tools/snapatac2/.shed.yml create mode 100644 tools/snapatac2/dimension_reduction_clustering.xml create mode 100644 tools/snapatac2/macros.xml create mode 100644 tools/snapatac2/plotting.xml create mode 100644 tools/snapatac2/preprocessing.xml diff --git a/tools/snapatac2/.shed.yml b/tools/snapatac2/.shed.yml new file mode 100644 index 00000000000..39feccd03f8 --- /dev/null +++ b/tools/snapatac2/.shed.yml @@ -0,0 +1,28 @@ +name: snapatac2 +owner: iuc +description: "SnapATAC2 – A Python/Rust package for single-cell epigenomics analysis" +homepage_url: https://kzhang.org/SnapATAC2/ +long_description: | + SnapATAC2 is a flexible, versatile, and scalable single-cell omics analysis framework. + +remote_repository_url: https://github.com/galaxyproject/tools-iuc/tree/master/tools/snapatac2 +type: unrestricted +categories: +- Epigenetics +- Sequence Analysis +auto_tool_repositories: + name_template: "{{ tool_id }}" + description_template: "Wrapper for the snapatac2 tool suite: {{ tool_name }}" +suite: + name: "suite_snapatac2" + description: "SnapATAC2 – A Python/Rust package for single-cell epigenomics analysis" + long_description: | + SnapATAC2 is a flexible, versatile, and scalable single-cell omics analysis framework, featuring: + + * Scale to more than 10 million cells. + * Blazingly fast preprocessing tools for BAM to fragment files conversion and count matrix generation. + * Matrix-free spectral embedding algorithm that is applicable to a wide range of single-cell omics data, including single-cell ATAC-seq, single-cell RNA-seq, single-cell Hi-C, and single-cell methylation. + * Efficient and scalable co-embedding algorithm for single-cell multi-omics data integration. + * End-to-end analysis pipeline for single-cell ATAC-seq data, including preprocessing, dimension reduction, clustering, data integration, peak calling, differential analysis, motif analysis, regulatory network analysis. + * Seamless integration with other single-cell analysis packages such as Scanpy. + * Implementation of fully backed AnnData. \ No newline at end of file diff --git a/tools/snapatac2/dimension_reduction_clustering.xml b/tools/snapatac2/dimension_reduction_clustering.xml new file mode 100644 index 00000000000..26691c616be --- /dev/null +++ b/tools/snapatac2/dimension_reduction_clustering.xml @@ -0,0 +1,579 @@ + + and dimension reduction + + macros.xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + advanced_common['show_log'] + + + method['method'] and 'tl.diff_test' in method['method'] + + + + + + + + + + + + + + +
+ +
+ + + + + + + + + + + +
+ + + + + + + + + + +
+ +
+ + + + + + + + + + +
+ + + + + + + + + + + +
+ +
+ + + + + + + + + + + +
+ + + + + + + + + + + + + +
+ +
+ + + + + + + + + + + + + +
+ + + + + + + + + + +
+ +
+ + + + + + + + + + +
+ + + + + + + + + + + +
+ +
+ + + + + + + + + + + +
+ + + + + + + + + + + + +
+ +
+ + + + + + + + + + + + +
+ + + + + + + +
+ +
+ + + + + + + +
+ + + + + + + + + + + +
+ +
+ + + + + + + + + + + +
+
+ `__ + +Compute Umap, using `tl.umap` +============================= + +Compute Umap + +More details on the `SnapATAC2 documentation +`__ + +Compute a neighborhood graph of observations, using `pp.knn` +============================================================ + +Compute a neighborhood graph of observations. + +Computes a neighborhood graph of observations stored in adata using the method specified by method. The distance metric used is Euclidean. + +More details on the `SnapATAC2 documentation +`__ + +Cluster cells into subgroups, using `tl.leiden` +=============================================== + +Cluster cells into subgroups. + +Cluster cells using the Leiden algorithm, an improved version of the Louvain algorithm. It has been proposed for single-cell analysis by. This requires having ran `knn`. + +More details on the `SnapATAC2 documentation +`__ + +Cluster cells into subgroups using the K-means algorithm, using `tl.kmeans` +=========================================================================== + +Cluster cells into subgroups using the K-means algorithm, a classical algorithm in data mining. + +More details on the `SnapATAC2 documentation +`__ + +Cluster cells into subgroups using the DBSCAN algorithm, using `tl.dbscan` +========================================================================== + +Cluster cells into subgroups using the DBSCAN algorithm. + +More details on the `SnapATAC2 documentation +`__ + +Cluster cells into subgroups using the HDBSCAN algorithm, using `tl.hdbscan` +============================================================================ + +Cluster cells into subgroups using the HDBSCAN algorithm. + +More details on the `SnapATAC2 documentation +`__ + +Aggregate values in adata.X in a row-wise fashion, using `tl.aggregate_X` +========================================================================= + +Aggregate values in adata.X in a row-wise fashion. + +Aggregate values in adata.X in a row-wise fashion. This is used to compute RPKM or RPM values stratified by user-provided groupings. + +More details on the `SnapATAC2 documentation +`__ + +Aggregate cells into pseudo-cells, using `tl.aggregate_cells` +============================================================= + +Aggregate cells into pseudo-cells. + +Aggregate cells into pseudo-cells by iterative clustering. + +More details on the `SnapATAC2 documentation +`__ + ]]> + +
diff --git a/tools/snapatac2/macros.xml b/tools/snapatac2/macros.xml new file mode 100644 index 00000000000..2e34577fc17 --- /dev/null +++ b/tools/snapatac2/macros.xml @@ -0,0 +1,187 @@ + + 2.5.3 + 0 + 23.0 + + snapatac2 + plotly + python-kaleido + polars + pyarrow + python-igraph + hdbscan + harmonypy + scanorama + + + + + + + '$hidden_output' && + python '$script_file' >> '$hidden_output' && + touch 'anndata_info.txt' && + cat 'anndata_info.txt' @CMD_prettify_stdout@ + ]]> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + +s + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 10.1038/s41592-023-02139-9 + + + + + + + + + + + + + + + + + + + + + + + +
diff --git a/tools/snapatac2/plotting.xml b/tools/snapatac2/plotting.xml new file mode 100644 index 00000000000..d7edfaea346 --- /dev/null +++ b/tools/snapatac2/plotting.xml @@ -0,0 +1,229 @@ + + + macros.xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + method['out_file'] == 'png' + + + method['out_file'] == 'pdf' + + + method['out_file'] == 'svg' + + + + + + + + + + + + + +
+ +
+ + + + + + + + +
+ + + + + + + + + +
+ +
+ + + + + + + + +
+ + + + + + + + + + + + + +
+ +
+ + + + + + + + + + + + +
+ + + + + + + + +
+ +
+ + + + + + + +
+
+ `__ + +Plot the UMAP embedding, using `pl.umap` +======================================== + +Plot the UMAP embedding. + +More details on the `SnapATAC2 documentation +`__ + +Plot the eigenvalues of spectral embedding, using `pl.spectral_eigenvalues` +=========================================================================== + +Plot the eigenvalues of spectral embedding. + +More details on the `SnapATAC2 documentation +`__ + ]]> + +
diff --git a/tools/snapatac2/preprocessing.xml b/tools/snapatac2/preprocessing.xml new file mode 100644 index 00000000000..ed898a497c6 --- /dev/null +++ b/tools/snapatac2/preprocessing.xml @@ -0,0 +1,580 @@ + + and integration + + macros.xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + method['method'] == 'pp.make_fragment_file' + + + method['method'] != 'pp.make_fragment_file' + + + advanced_common['show_log'] + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ + + + + + + + + + + + +
+
+ `__ + +Import data fragment file` and compute basic QC metrics, using `pp.import_data` +=============================================================================== + +Import data fragment files and compute basic QC metrics. + +A fragment refers to the sequence data originating from a distinct location in the genome. In single-ended sequencing, one read equates to a fragment. However, in paired-ended sequencing, a fragment is defined by a pair of reads. This function is designed to handle, store, and process input files with fragment data, further yielding a range of basic Quality Control (QC) metrics. These metrics include the total number of unique fragments, duplication rates, and the percentage of mitochondrial DNA detected. + +How fragments are stored is dependent on the sequencing approach utilized. For single-ended sequencing, fragments are found in `.obsm['fragment_single']`. In contrast, for paired-ended sequencing, they are located in `.obsm['fragment_paired']`. + +More details on the `SnapATAC2 documentation +`__ + +Generate cell by bin count matrix, using `pp.add_tile_matrix` +============================================================= + +Generate cell by bin count matrix. + +This function is used to generate and add a cell by bin count matrix to the AnnData object. + +`import_data` must be ran first in order to use this function. + +More details on the `SnapATAC2 documentation +`__ + +Generate cell by gene activity matrix, using `pp.make_gene_matrix` +================================================================== + +Generate cell by gene activity matrix. + +Generate cell by gene activity matrix by counting the TN5 insertions in gene body regions. The result will be stored in a new file and a new AnnData object will be created. + +`import_data` must be ran first in order to use this function. + +More details on the `SnapATAC2 documentation +`__ + +Filter cell outliers based on counts and numbers of genes expressed, using `pp.filter_cells` +============================================================================================ + +Filter cell outliers based on counts and numbers of genes expressed. For instance, only keep cells with at least `min_counts` counts or `min_ts`` TSS enrichment scores. This is to filter measurement outliers, i.e. “unreliable” observations. + +More details on the `SnapATAC2 documentation +`__ + +Perform feature selection, using `pp.select_features` +===================================================== + +Perform feature selection by selecting the most accessibile features across all cells unless `max_iter` > 1 + +More details on the `SnapATAC2 documentation +`__ + +Compute probability of being a doublet using the scrublet algorithm, using `pp.scrublet` +======================================================================================== + +Compute probability of being a doublet using the scrublet algorithm. + +This function identifies doublets by generating simulated doublets using randomly pairing chromatin accessibility profiles of individual cells. The simulated doublets are then embedded alongside the original cells using the spectral embedding algorithm in this package. A k-nearest-neighbor classifier is trained to distinguish between the simulated doublets and the authentic cells. This trained classifier produces a “doublet score” for each cell. The doublet scores are then converted into probabilities using a Gaussian mixture model. + +More details on the `SnapATAC2 documentation +`__ + +Remove doublets according to the doublet probability or doublet score, using `pp.filter_doublets` +================================================================================================= + +Remove doublets according to the doublet probability or doublet score. + +The user can choose to remove doublets by either the doublet probability or the doublet score. `scrublet` must be ran first in order to use this function. + +More details on the `SnapATAC2 documentation +`__ + +A modified MNN-Correct algorithm based on cluster centroid, using `pp.mnc_correct` +================================================================================== + +A modified MNN-Correct algorithm based on cluster centroid. + +More details on the `SnapATAC2 documentation +`__ + +Use harmonypy to integrate different experiments,using `pp.harmony` +=================================================================== + +Use harmonypy to integrate different experiments. + +Harmony is an algorithm for integrating single-cell data from multiple experiments. This function uses the python port of Harmony, `harmonypy`, to integrate single-cell data stored in an AnnData object. This function should be run after performing dimension reduction. + +More details on the `SnapATAC2 documentation +`__ + +Use Scanorama to integrate different experiments, using `pp.scanorama_integrate` +======================================================================================== + +Use Scanorama to integrate different experiments. + +Scanorama is an algorithm for integrating single-cell data from multiple experiments stored in an AnnData object. This function should be run after performing `tl.spectral` but before computing the neighbor graph. + +More details on the `SnapATAC2 documentation +`__ + +Compute the fragment size distribution of the dataset, using `metrics.frag_size_distr` +====================================================================================== + +Compute the fragment size distribution of the dataset. + +This function computes the fragment size distribution of the dataset. Note that it does not operate at the single-cell level. The result is stored in a vector where each element represents the number of fragments and the index represents the fragment length. The first posision of the vector is reserved for fragments with size larger than the `max_recorded_size` parameter. `import_data` must be ran first in order to use this function. + +More details on the `SnapATAC2 documentation +`__ + +Compute the TSS enrichment score (TSSe) for each cell, using `metrics.tsse` +=========================================================================== + +Compute the TSS enrichment score (TSSe) for each cell. + +`import_data` must be ran first in order to use this function. + +More details on the `SnapATAC2 documentation +`__ + + ]]> + +