STARR-seq Analysis Pipeline

This project contains a STARR-seq analysis pipeline based on the tool mka, starcode, and drmr. In addition to a template file for mka, the project contains the starrseq module that must be loaded that contains helper scripts for running the pipeline.

Getting Started

In order to run the pipeline if you're on the lab server, you must load the starrseq and starcode modules, while also either sourcing /lab/sw/bin/pl or preceding the mka command with MKA_TEMPLATES = /lab/sw/lib/mka/templates. The reason for the extra steps is that in its current state, the pipeline is not very general and thus the files reside in the lab's directory for custom templates. Regardless, the necessary imports can all be handled automatically by adding the following to your .bashrc and restarting the shell:

loadmodule starcode
loadmodule starrseq 
. /lab/sw/bin/pl

Renaming the input files

In order for mka to run correctly, a bit of setup is necessary to encode metadata in the name of the files such that mka knows how to iterate over the files. To do this, a simple csv must be written that has metadata about the information. Here is an example csv file (make sure read 1 comes first, read 3 comes second):

       inputDNA     ,       rep1       ,        rep2       ,  ....
       
      inputDNA.read1.fq.gz   ,   rep1.read1.fq.gz  ,   rep2.read1.fq.gz  , ....

      inputDNA.read3.fq.gz   ,   rep1.read3.fq.gz  ,   rep2.read3.fq.gz  , ....

Next, run the command:

starr_screname_custom analysis_name /path/to/metadata.csv /path/to/data/directory /path/to/data/destination

Provided the right format for the csv, this will create a data directory with files that have symbolic links with mka's naming conventions to the correct data files. Here are the example contents of the directory:

lrwxrwxrwx  1 collinwa parkerlab-users 84 Jan  1 18:05 83213_modSTARRseq_121917___inputDNA___L001___inputDNA.1.fq.gz -> /home/ykyono/data/83213_modSTARRseq_121917/scp1-STARRseq_inputDNA_120717.read1.fq.gz
lrwxrwxrwx  1 collinwa parkerlab-users 84 Jan  1 18:05 83213_modSTARRseq_121917___inputDNA___L001___inputDNA.2.fq.gz -> /home/ykyono/data/83213_modSTARRseq_121917/scp1-STARRseq_inputDNA_120717.read3.fq.gz
lrwxrwxrwx  1 collinwa parkerlab-users 85 Jan  1 18:05 83213_modSTARRseq_121917___rep1___L001___rep1.1.fq.gz -> /home/ykyono/data/83213_modSTARRseq_121917/83213_scp1-STARRseq_cDNA_rep_1.read1.fq.gz
lrwxrwxrwx  1 collinwa parkerlab-users 85 Jan  1 18:05 83213_modSTARRseq_121917___rep1___L001___rep1.2.fq.gz -> /home/ykyono/data/83213_modSTARRseq_121917/83213_scp1-STARRseq_cDNA_rep_1.read3.fq.gz
lrwxrwxrwx  1 collinwa parkerlab-users 85 Jan  1 18:05 83213_modSTARRseq_121917___rep2___L001___rep2.1.fq.gz -> /home/ykyono/data/83213_modSTARRseq_121917/83213_scp1-STARRseq_cDNA_rep_2.read1.fq.gz
lrwxrwxrwx  1 collinwa parkerlab-users 85 Jan  1 18:10 83213_modSTARRseq_121917___rep2___L001___rep2.2.fq.gz -> /home/ykyono/data/83213_modSTARRseq_121917/83213_scp1-STARRseq_cDNA_rep_2.read3.fq.gz
lrwxrwxrwx  1 collinwa parkerlab-users 85 Jan  1 18:05 83213_modSTARRseq_121917___rep3___L001___rep3.1.fq.gz -> /home/ykyono/data/83213_modSTARRseq_121917/83213_scp1-STARRseq_cDNA_rep_3.read1.fq.gz
lrwxrwxrwx  1 collinwa parkerlab-users 85 Jan  1 18:05 83213_modSTARRseq_121917___rep3___L001___rep3.2.fq.gz -> /home/ykyono/data/83213_modSTARRseq_121917/83213_scp1-STARRseq_cDNA_rep_3.read3.fq.gz

Generating the pipeline

Once the files are renamed, mka is ready to be run. When running mka, make sure that only the fastq files containing cDNA and DNA are specified, and not sub-assemblies. The pipeline is currently equipped to create count tables for any number of replicates, as long as those replicates have the necessary metadata encoded in the filenames and are located in /path/to/dna_cdna/fq_files/* via the previous command. Set the reference genome as hg19.

mka --analysis-type  starr-seq --description 'Random STARR-seq pipeline' -a /path/to/analysis /path/to/control /path/to/dna_cdna/fq_files/*

Making and running the pipeline

Set IONICE = 1 to limit read/write. The pipeline is very lightweight and this step is probably unnecessary. To run the pipeline, go to the control directory. There will be 2 files commands and Makefile. When in this directory, use the following command to build the pipeline:

make pipeline

Finally, go to the analysis directory and run the command:

drmr pipeline

and the job scheduler will handle the tasks associated with the pipeline. You can check to see if the commands have been queued using the command: squeue -u YOUR_USERNAME | grep pipeline

Finding Results

The final counts table can be found in

/path/to/analysis/work/count_table/final_counts_table.txt.gz

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
code		code
examples		examples
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
commands.starr-seq		commands.starr-seq
documented_template.starr-seq		documented_template.starr-seq
helper_functions.py		helper_functions.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STARR-seq Analysis Pipeline

Getting Started

Renaming the input files

Generating the pipeline

Making and running the pipeline

Finding Results

About

Releases

Packages

Contributors 2

Languages

ParkerLab/STARR-seq-Analysis-Pipeline

Folders and files

Latest commit

History

Repository files navigation

STARR-seq Analysis Pipeline

Getting Started

Renaming the input files

Generating the pipeline

Making and running the pipeline

Finding Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages