Developed and maintained by Matthew Galbraith
This pipeline processes high-throughput sequencing data (PE/SE) through QC,trim/filter,alignment,counting etc via sequential stages, with individual samples run in parallel via submission to a Slurm queue.
This specific version was customized to run on HDC Eureka instances (running CentOS 7) on Google Compute Engine and uses a custom Conda environment to supply most of the tools/dependencies required by each stage, with only a few problemmatic tools running from Singularity containers.
BaSH_seq/
├── CutRun
│ └── stageScripts # Cut&Run-specific stage scripts + symbolic links to common stage scripts
├── RNAseq
│ └── stageScripts # RNAseq-specific stage scripts + symbolic links to common stage scripts
├── misc
├── multiqc
├── other_scripts
├── pipeline_templates
└── stageScripts # common stage scripts live here
Bash Slurm Tools required by each stage (in PATH / Conda env / Containers); Conda / Docker/Singularity References and related files MultiQC script (need to update and integrate) R (if integrating RPKM script)
need to download required refs etc need to modify paths to stage scripts, references, etc fastq_screen.conf For Conda version: need to add env.yaml For container version: modify stage scripts to use correct call to each tool or container
FASTQ merging?
- In
Project/
: create top-level working dirs:Project/raw_date
andProject/analysis_date
- In
Project/analysis_date/
create sample_locations.txt (field 1 = SAMPLE_NAME; field 2 = path/to/raw_fastq_file.gz; field 3 = read1 / read2 labels)
See also: SampleInfo.xlsx template - Edit variables in top section (SEE XXX FOR MORE DETAILS eg Human vs Mouse & strandedness) of pipeline script template and save a project-specific version in
Project/analysis_date/scripts
(specify path to this version as indicated below) - From
Project/analysis_date/
run analysis_setup.sh as follows:
sh path/to/analysis_setup.sh <FULL/PATH/TO/PIPELINE_SCRIPT> <START_AT_STAGE> <END_AT_STAGE>
This will create and populateProject/analysis_date/Sample_*
directories with symbolic links to FASTQ files inProject/raw_date
as well as write out 'submit' scripts toProject/analysis_date/scripts
Usually will specify subsets of pipeline stages to allow for QC checks/troubleshooting eg stages 1-3, 4-4, 4-11 - Then from
Project/analysis_date/scripts
run submit scripts as follows (command can be copied from within submit script):
sbatch submitAll_START_AT_STAGE-END_AT_STAGE.sh
If using conda to magae required tools, will need to first activate appropriate env - Monitor progress using
squeue
(see watch alias)
- check pipeline logs
- (if needed) check Sample/Stage logs
- Run MultiQC and inspect report
- (RNAseq only) Gather counts files
- Run temp FASTQ and BAM file cleanup
- (if needed) sync back to main data store