Created by: Bercin Cenik Date: June 5, 2023 2:21 PM
Overview: this is a set of instructions for creating merged fastq files and compressing minibam files for ONT long-read sequencing runs, written for NGS technicians in the Shilatifard lab.
Ensure that when creating the experiments, you use the following directory structure:
CHACHAx
CHACHA_Pm
CHACHA_Pn
…
where x is the CHACHA# and m and n are the barcode number. Always save experiments in data_offload.
Example: as part of CHACHA5, there are 5 pools, each containing 4 barcoded samples, being run on 5 separate flow cells. This folder structure would be:
- CHACHA5
-
CHACHA5_P1
-
fastq_pass
-
barcode01
*.fastq.gz
-
barcode02
*.fastq.gz
-
barcode03
*.fastq.gz
-
barcode04
*.fastq.gz
-
-
bam_pass
-
barcode01
*.bam
-
barcode02
*.bam
-
barcode03
*.bam
-
barcode04
*.bam
-
-
-
CHACHA5_P2
fastq_pass..
bam_pass..
-
Create a .csv file in excel or a text editor, in the following format:
pool,firstbarcode,lastbarcode
1,01,04
2,05,08
3,09,12
4,13,16
5,17,20
This looks like a table similar to the following:
pool | firstbarcode | lastbarcode |
---|---|---|
1 | 01 | 04 |
2 | 05 | 08 |
3 | 09 | 12 |
4 | 13 | 16 |
5 | 17 | 20 |
Save this as CHACHAx.csv and move it to the experiment directory.
Copy make.merged.fastq.sh
and make.tar.sh
into the experiment directory as well.
make.merged.fastq.sh
#!/bin/bash
# make merged fastq scripts: make.merged.fastq.sh
# written by: Bercin Cenik
#### instructions
# create a CSV file with CHACHA number, e.g., CHACHAx.csv, save it in /data/data_offload/CHACHAx/
# header of CSV file should be: pool,firstbarcode,lastbarcode
# then enter the pool number and barcodes in corresponding rows. eg. 1,01,04 etc.
# in CHACHAx, run the script: sh make.merged.fastq.sh
# it will ask you for the CHACHA number, e.g., x, and then the path to the csv file (type CHACHAx.csv)
# a fastq merging script will be created, with each pool having its own script.
# Prompt for CHACHA number
read -p "Enter CHACHA number (x): " chacha
# Prompt for CSV file path
read -p "Enter the path of the CSV file: " csv_file
# Create the CHACHAx.fastq.merge.sh script
merge_script_filename="CHACHA${chacha}.fastq.merge.sh"
echo "#!/bin/bash" > "$merge_script_filename"
# Read the CSV file and process each line
while IFS=',' read -r pool start end
do
# Skip processing the first line (header)
if [ "$pool" = "pool" ]; then
continue
fi
# Generate the desired command for each barcode in the range
for barcode in $(seq -w "$start" "$end")
do
echo "for fastq in \$(ls /data/data_offload/CHACHA${chacha}/CHACHA_P${pool}/*/fastq_pass/barcode${barcode}/fastq_pass/*.fastq.gz); do" >> "$merge_script_filename"
echo " cat \"\$fastq\" >> pool${pool}.barcode${barcode}.merge.fastq.gz" >> "$merge_script_filename"
echo "done" >> "$merge_script_filename"
done
echo "Script commands for pool${pool} have been added to '$merge_script_filename'."
done < "$csv_file"
# Make the merge script file executable
chmod +x "$merge_script_filename"
echo "The CHACHA${chacha}.fastq.merge.sh script has been created."
Run this script:
sh make.merged.fastq.sh
It will prompt the user for the CHACHA# and the path to the CSV file with pool and barcode numbers, and generate the merge script:
>Enter CHACHA number (x): 5
>Enter the path of the CSV file: CHACHA5.csv
>Script commands for pool1 have been added to 'CHACHA5.fastq.merge.sh'.
>Script commands for pool2 have been added to 'CHACHA5.fastq.merge.sh'.
>Script commands for pool3 have been added to 'CHACHA5.fastq.merge.sh'.
>Script commands for pool4 have been added to 'CHACHA5.fastq.merge.sh'.
>Script commands for pool5 have been added to 'CHACHA5.fastq.merge.sh'.
>The CHACHA5.fastq.merge.sh script has been created.
These are the contents of the CHACHAx.fastq.merge.sh
script:
#!/bin/bash
for fastq in $(ls /data/data_offload/CHACHA5/CHACHA_P1/*/fastq_pass/barcode01/fastq_pass/*.fastq.gz); do
cat "$fastq" >> pool1.barcode01.merge.fastq.gz
done
for fastq in $(ls /data/data_offload/CHACHA5/CHACHA_P1/*/fastq_pass/barcode02/fastq_pass/*.fastq.gz); do
cat "$fastq" >> pool1.barcode02.merge.fastq.gz
done
for fastq in $(ls /data/data_offload/CHACHA5/CHACHA_P1/*/fastq_pass/barcode03/fastq_pass/*.fastq.gz); do
cat "$fastq" >> pool1.barcode03.merge.fastq.gz
done
for fastq in $(ls /data/data_offload/CHACHA5/CHACHA_P1/*/fastq_pass/barcode04/fastq_pass/*.fastq.gz); do
cat "$fastq" >> pool1.barcode04.merge.fastq.gz
done
for fastq in $(ls /data/data_offload/CHACHA5/CHACHA_P2/*/fastq_pass/barcode05/fastq_pass/*.fastq.gz); do
cat "$fastq" >> pool2.barcode05.merge.fastq.gz
done
for fastq in $(ls /data/data_offload/CHACHA5/CHACHA_P2/*/fastq_pass/barcode06/fastq_pass/*.fastq.gz); do
cat "$fastq" >> pool2.barcode06.merge.fastq.gz
done
for fastq in $(ls /data/data_offload/CHACHA5/CHACHA_P2/*/fastq_pass/barcode07/fastq_pass/*.fastq.gz); do
cat "$fastq" >> pool2.barcode07.merge.fastq.gz
done
for fastq in $(ls /data/data_offload/CHACHA5/CHACHA_P2/*/fastq_pass/barcode08/fastq_pass/*.fastq.gz); do
cat "$fastq" >> pool2.barcode08.merge.fastq.gz
done
for fastq in $(ls /data/data_offload/CHACHA5/CHACHA_P3/*/fastq_pass/barcode09/fastq_pass/*.fastq.gz); do
cat "$fastq" >> pool3.barcode09.merge.fastq.gz
done
for fastq in $(ls /data/data_offload/CHACHA5/CHACHA_P3/*/fastq_pass/barcode10/fastq_pass/*.fastq.gz); do
cat "$fastq" >> pool3.barcode10.merge.fastq.gz
done
for fastq in $(ls /data/data_offload/CHACHA5/CHACHA_P3/*/fastq_pass/barcode11/fastq_pass/*.fastq.gz); do
cat "$fastq" >> pool3.barcode11.merge.fastq.gz
done
for fastq in $(ls /data/data_offload/CHACHA5/CHACHA_P3/*/fastq_pass/barcode12/fastq_pass/*.fastq.gz); do
cat "$fastq" >> pool3.barcode12.merge.fastq.gz
done
for fastq in $(ls /data/data_offload/CHACHA5/CHACHA_P4/*/fastq_pass/barcode13/fastq_pass/*.fastq.gz); do
cat "$fastq" >> pool4.barcode13.merge.fastq.gz
done
for fastq in $(ls /data/data_offload/CHACHA5/CHACHA_P4/*/fastq_pass/barcode14/fastq_pass/*.fastq.gz); do
cat "$fastq" >> pool4.barcode14.merge.fastq.gz
done
for fastq in $(ls /data/data_offload/CHACHA5/CHACHA_P4/*/fastq_pass/barcode15/fastq_pass/*.fastq.gz); do
cat "$fastq" >> pool4.barcode15.merge.fastq.gz
done
for fastq in $(ls /data/data_offload/CHACHA5/CHACHA_P4/*/fastq_pass/barcode16/fastq_pass/*.fastq.gz); do
cat "$fastq" >> pool4.barcode16.merge.fastq.gz
done
for fastq in $(ls /data/data_offload/CHACHA5/CHACHA_P5/*/fastq_pass/barcode17/fastq_pass/*.fastq.gz); do
cat "$fastq" >> pool5.barcode17.merge.fastq.gz
done
for fastq in $(ls /data/data_offload/CHACHA5/CHACHA_P5/*/fastq_pass/barcode18/fastq_pass/*.fastq.gz); do
cat "$fastq" >> pool5.barcode18.merge.fastq.gz
done
for fastq in $(ls /data/data_offload/CHACHA5/CHACHA_P5/*/fastq_pass/barcode19/fastq_pass/*.fastq.gz); do
cat "$fastq" >> pool5.barcode19.merge.fastq.gz
done
for fastq in $(ls /data/data_offload/CHACHA5/CHACHA_P5/*/fastq_pass/barcode20/fastq_pass/*.fastq.gz); do
cat "$fastq" >> pool5.barcode20.merge.fastq.gz
done
Running this script (sh CHACHAx.fastq.merge.sh
) will output the merged fastq files.
While still in the experiment directory, make a merged fastq directory and move all the files there.
cd /data/data_offload/CHACHAx
mkdir fastq.merge
cd fastq.merge
mv ../*fastq.gz .
Following this, login to Quest and create the directories for the experiments in b1042, then log out of Quest.
ssh -X <netID>@quest.it.northwestern.edu
# enter password when prompted
mkdir /projects/b1042/Shilatifard/PromethION/CHACHAx
cd CHACHAx
mkdir ./bam ./minibam ./tars ./scripts ./QC.reports ./seq.summary ./fastq
exit
Transfer the files to b1042. This will take a while:
cd /data/data_offload/CHACHAx/
scp -r ./fastq.merge <netID>@quest.it.northwestern.edu:/projects/b1042/Shilatifard/PromethION/CHACHAx/fastq
# enter password when prompted.
make.tar.sh
#!/bin/bash
# make minibam file compression scripts: maketar.sh
# written by: Bercin Cenik
#### instructions
# create a CSV file with CHACHA number, e.g., CHACHAx.csv, save it in /data/data_offload/CHACHAx/
# header of CSV file should be: pool,firstbarcode,lastbarcode
# then enter the pool number and barcodes in corresponding rows. eg. 1,01,04 etc.
# in CHACHAx, run the script eg: sh maketar.sh
# it will ask you for the CHACHA number eg. x, and then the path to the csv file (type CHACHAx.csv)
# a tar merging script will be created eg. pool1.tar.sh
# each pool will have its own script.
# run these scripts individually, for example: sh pool1.tar.sh
# this will compress the minibam files as such: pool<x>.barcode<y>.bam.tar.gz
# Prompt the user for the path to the CSV file
read -p "Enter the path to the CSV file: " csv_file
# Check if the CSV file exists
if [ ! -f "$csv_file" ]; then
echo "Error: CSV file not found."
exit 1
fi
# Prompt the user for the CHACHA number
read -p "Enter the CHACHA number: " chacha_number
# Read the CSV file and process each line
tail -n +2 "$csv_file" | while IFS=',' read -r pool first last; do
# Generate the pool tar file path
pool_tar_file="pool${pool}.tar.sh"
# Create the pool tar script
echo "#!/bin/bash" > "$pool_tar_file"
echo " " >> "$pool_tar_file"
echo "# Pool ${pool} tar compression script" >> "$pool_tar_file"
echo " " >> "$pool_tar_file"
for barcode in $(seq -w "$first" "$last"); do
# Generate the individual tar file path
tar_file="pool${pool}.barcode${barcode}.bam.tar.gz"
# Append the tar compression command to the pool tar script
echo "tar cvzf ${tar_file} /data/data_offload/CHACHA${chacha_number}/CHACHA${chacha_number}_P${pool}/*/bam_pass/barcode${barcode}/*.bam" >> "$pool_tar_file"
# Delete the individual barcode tar script
# rm "pool${pool}.barcode${barcode}.bam.tar.gz.sh"
done
echo "echo \"Finished compressing all barcodes for Pool ${pool}\"" >> "$pool_tar_file"
done
cat pool*.sh > CHACHA5.tarminibams.sh
rm pool*.sh
Run make.tar.sh
:
sh make.tar.sh
This will create CHACHA5.tarminibams.sh
:
!/bin/bash
# Pool 1 tar compression script
tar cvzf pool1.barcode01.bam.tar.gz /data/data_offload/CHACHA5/CHACHA5_P1/*/bam_pass/barcode01/*.bam
tar cvzf pool1.barcode02.bam.tar.gz /data/data_offload/CHACHA5/CHACHA5_P1/*/bam_pass/barcode02/*.bam
tar cvzf pool1.barcode03.bam.tar.gz /data/data_offload/CHACHA5/CHACHA5_P1/*/bam_pass/barcode03/*.bam
tar cvzf pool1.barcode04.bam.tar.gz /data/data_offload/CHACHA5/CHACHA5_P1/*/bam_pass/barcode04/*.bam
echo "Finished compressing all barcodes for Pool 1"
#!/bin/bash
# Pool 2 tar compression script
tar cvzf pool2.barcode05.bam.tar.gz /data/data_offload/CHACHA5/CHACHA5_P2/*/bam_pass/barcode05/*.bam
tar cvzf pool2.barcode06.bam.tar.gz /data/data_offload/CHACHA5/CHACHA5_P2/*/bam_pass/barcode06/*.bam
tar cvzf pool2.barcode07.bam.tar.gz /data/data_offload/CHACHA5/CHACHA5_P2/*/bam_pass/barcode07/*.bam
tar cvzf pool2.barcode08.bam.tar.gz /data/data_offload/CHACHA5/CHACHA5_P2/*/bam_pass/barcode08/*.bam
echo "Finished compressing all barcodes for Pool 2"
#!/bin/bash
# Pool 3 tar compression script
tar cvzf pool3.barcode09.bam.tar.gz /data/data_offload/CHACHA5/CHACHA5_P3/*/bam_pass/barcode09/*.bam
tar cvzf pool3.barcode10.bam.tar.gz /data/data_offload/CHACHA5/CHACHA5_P3/*/bam_pass/barcode10/*.bam
tar cvzf pool3.barcode11.bam.tar.gz /data/data_offload/CHACHA5/CHACHA5_P3/*/bam_pass/barcode11/*.bam
tar cvzf pool3.barcode12.bam.tar.gz /data/data_offload/CHACHA5/CHACHA5_P3/*/bam_pass/barcode12/*.bam
echo "Finished compressing all barcodes for Pool 3"
#!/bin/bash
# Pool 4 tar compression script
tar cvzf pool4.barcode13.bam.tar.gz /data/data_offload/CHACHA5/CHACHA5_P4/*/bam_pass/barcode13/*.bam
tar cvzf pool4.barcode14.bam.tar.gz /data/data_offload/CHACHA5/CHACHA5_P4/*/bam_pass/barcode14/*.bam
tar cvzf pool4.barcode15.bam.tar.gz /data/data_offload/CHACHA5/CHACHA5_P4/*/bam_pass/barcode15/*.bam
tar cvzf pool4.barcode16.bam.tar.gz /data/data_offload/CHACHA5/CHACHA5_P4/*/bam_pass/barcode16/*.bam
echo "Finished compressing all barcodes for Pool 4"
#!/bin/bash
# Pool 5 tar compression script
tar cvzf pool5.barcode17.bam.tar.gz /data/data_offload/CHACHA5/CHACHA5_P5/*/bam_pass/barcode17/*.bam
tar cvzf pool5.barcode18.bam.tar.gz /data/data_offload/CHACHA5/CHACHA5_P5/*/bam_pass/barcode18/*.bam
tar cvzf pool5.barcode19.bam.tar.gz /data/data_offload/CHACHA5/CHACHA5_P5/*/bam_pass/barcode19/*.bam
tar cvzf pool5.barcode20.bam.tar.gz /data/data_offload/CHACHA5/CHACHA5_P5/*/bam_pass/barcode20/*.bam
echo "Finished compressing all barcodes for Pool 5"
Run CHACHA5.tarminibams.sh
:
sh CHACHA5.tarminibams.sh
When finished, create and move compresed bams into the bam.tars directory.
cd /data/data_offload/CHACHAx
mkdir bam.tars
cd bam.tars
mv ../*bam.tar.gz *
Transfer into the minibam directory using scp, this will take a while:
cd /data/data_offload/CHACHAx/
scp -r ./bam.tars <netID>@quest.it.northwestern.edu:/projects/b1042/Shilatifard/PromethION/CHACHAx/minibam
# enter password when prompted.