Add Reference Alignment + UTR Trimming, Dehosting and iVar update (#16)

* Index reference before samtools and ivar access the reference in callVariants step * Try indexing the reference before callVariants * Qc update (#56) * Update qc.py to reflect changed standards * Filter out Illumina Undetermined_S0 * Make sure to negate the matcher * Add minReadsPerBarcode parameter for Nanopore workflow (connor-lab#57) The barcode<n*> directories of de-plexed Fastq input are filtered to exclude any containing fewer than 5 files (guppy makes all of the barcode directories, whether you used the barcodes or not). This patch changes the filter to scan the files and count the records within to 1) avoid excluding barcodes where many reads are in fewer than 5 files and 2) avoid including barcodes where there are very few reads in more than 5 files. This threshold can be changed with the new optional parameter minReadsPerBarcode, which defaults to 100 reads. Co-authored-by: Keith James <[email protected]> * Remove unnecessary samtools view -bS (#48) Co-authored-by: ac55-sanger <[email protected]> * Added libxcb, default Qt library backend for graphics rendering, to dependencies (connor-lab#60) * use ivar default of 30 for illuminaKeepLen, and comment (connor-lab#61) Co-authored-by: David K. Jackson <[email protected]> * add basic host filter step * remove hardcoded paths * move host filter to within workflow * synchronize host removal with SIGNAL * limit host removal cpus * Move performHostFilter to illuminaNcov workflow * Add pysam dependency for filter_non_human_reads.py * update ivar to 1.2.3 * increase minimum read length after trimming * add step for filtering residual adapters missed by trimming * fix samtools/pysam version in environment * lower required match length for adapter check * restrict short adapter matches to the suffix of the read * update to ivar 1.3, add option for amplicon filtering * remove CLIMB-specific modules/calls * Align to ref (#15) * Add alignment step using MAFFT and UTR-trimming step * Remove extra = char * Tidy up illumina environment * Remove local executor definition * Remove addition of whitespace * Remove deletion of whitespace * Remove addition of whitespace * Publish dehosted fastqs * Fix process name for alignment step in illumina workflow * Set default viral_contig_name * Add sampleId tag to performHostFilter process * Enable testing against development branch * Remove testing scripts not in use * index bam files and publish indexes (#23) * index bam files and publish indexes * Move hostFilter to sequenceAnalysis, publish dehosted reads and consensus to nml_upload (#25) Co-authored-by: Michael Laszloffy <[email protected]> Co-authored-by: Michael Laszloffy <[email protected]> Co-authored-by: Matt Bull <[email protected]> Co-authored-by: Keith James <[email protected]> Co-authored-by: Keith James <[email protected]> Co-authored-by: Ashwini Chhipa <[email protected]> Co-authored-by: ac55-sanger <[email protected]> Co-authored-by: nodrogluap <[email protected]> Co-authored-by: dkj <[email protected]> Co-authored-by: David K. Jackson <[email protected]> Co-authored-by: Jared Simpson <[email protected]> Co-authored-by: Lawrence Heisler <[email protected]>
BCCDC-PHL · Dec 18, 2020 · f478d47 · f478d47
1 parent be8635b
commit f478d47
Show file tree

Hide file tree

Showing 14 changed files with 307 additions and 96 deletions.
diff --git a/.github/workflows/pull_request.yml b/.github/workflows/pull_request.yml
@@ -2,6 +2,7 @@ on:
   pull_request:
     branches:
     - master
+    - development
 name: master Pull Request
 jobs:
   test:

diff --git a/bin/filter_non_human_reads.py b/bin/filter_non_human_reads.py
@@ -0,0 +1,67 @@
+#!/usr/bin/env python
+import pysam
+import sys
+import argparse
+
+def filter_reads(contig_name, input_sam_fp, output_bam_fp):
+
+    # use streams if args are None
+    if input_sam_fp:
+        input_sam = pysam.AlignmentFile(input_sam_fp, 'r')
+    else:
+        input_sam = pysam.AlignmentFile('-', 'r')
+
+    if output_bam_fp:
+        output_bam = pysam.AlignmentFile(output_bam_fp, 'wb',
+                                         template=input_sam)
+    else:
+        output_bam = pysam.AlignmentFile('-', 'wb', template=input_sam)
+
+    # if read isn't mapped or mapped to viral reference contig name
+    viral_reads = 0
+    human_reads = 0
+    unmapped_reads = 0
+
+    # iterate over input from BWA
+    for read in input_sam:
+        # only look at primary alignments
+        if not read.is_supplementary and not read.is_secondary:
+            if read.reference_name == contig_name:
+                output_bam.write(read)
+                viral_reads += 1
+            elif read.is_unmapped:
+                output_bam.write(read)
+                unmapped_reads +=1
+            else:
+                human_reads += 1
+
+    total_reads = viral_reads + human_reads + unmapped_reads
+
+    print(f"viral read count = {viral_reads} "
+            f"({viral_reads/total_reads * 100:.2f}%)", file=sys.stderr)
+    print(f"human read count = {human_reads} "
+            f"({human_reads/total_reads * 100:.2f}%) ", file=sys.stderr)
+    print(f"unmapped read count = {unmapped_reads} "
+            f"({unmapped_reads/total_reads * 100:.2f}%)", file=sys.stderr)
+
+if __name__ == '__main__':
+
+    parser = argparse.ArgumentParser(description="Get all reads that are "
+                                                  "unmapped and map to a "
+                                                  "specific reference "
+                                                  "contig")
+    parser.add_argument('-i', '--input', required=False, default=False,
+                        help="Input SAM formatted file (stdin used if "
+                              " not specified)")
+
+    parser.add_argument('-o', '--output', required=False, default=False,
+                        help="Output BAM formatted file (stdout used if not "
+                             "specified)")
+
+    parser.add_argument('-c', '--contig_name', required=False,
+                        default="MN908947.3",
+                        help="Contig name to retain e.g. viral")
+
+    args = parser.parse_args()
+
+    filter_reads(args.contig_name, args.input, args.output)
diff --git a/bin/filter_residual_adapters.py b/bin/filter_residual_adapters.py
@@ -0,0 +1,85 @@
+#!/usr/bin/env python
+import gzip
+import pysam
+import sys
+import argparse
+
+# require that filename contains the specified extension
+def require_extension(filename, ext):
+    assert(args.input_R1[-len(ext):] == ext)
+
+def contains_adapter(read_sequence, adapter_sequence, min_match_length):
+    # discard all reads that contain the full adapter 
+    if adapter_sequence in read_sequence:
+        return True
+
+    # discard all reads where the suffix of length min_match_length is contained within the adapter
+    # this handles the case where there is a fairly good adapter match that is truncated
+    # by the end of the read
+    rl = len(read_sequence)
+    if rl >= min_match_length:
+        read_suffix = read_sequence[(rl - min_match_length):]
+        if read_suffix in adapter_sequence:
+            return True
+    return False
+
+def filter_reads(filter_sequences, min_match_length, input_R1_fp, input_R2_fp, output_R1_fp, output_R2_fp):
+
+    input_R1 = pysam.FastxFile(input_R1_fp)
+    input_R2 = pysam.FastxFile(input_R2_fp)
+
+    output_R1 = gzip.open(output_R1_fp, 'w')
+    output_R2 = gzip.open(output_R2_fp, 'w')
+
+    reads_filtered = 0
+    reads_kept = 0
+    for R1, R2 in zip(input_R1, input_R2):
+
+        discard = False
+        for f in filter_sequences:
+            if contains_adapter(R1.sequence, f, min_match_length) or contains_adapter(R2.sequence, f, min_match_length):
+                discard = True
+                break
+
+        if discard:
+            reads_filtered += 1
+            continue
+        else:
+            reads_kept += 1
+
+            fq1 = str(R1) + "\n"
+            fq2 = str(R2) + "\n"
+            output_R1.write(fq1.encode('utf-8'))
+            output_R2.write(fq2.encode('utf-8'))
+
+    print(f"reads kept: {reads_kept}, reads filtered: {reads_filtered}")
+
+if __name__ == '__main__':
+
+    parser = argparse.ArgumentParser(description="Discard read pairs that contain sequencing adapters that are missed by a trimmer")
+    parser.add_argument('--input_R1', required=True, default=False,
+                        help="Input fasta/fastq file for first half of pair")
+
+    parser.add_argument('--input_R2', required=True, default=False,
+                        help="Input fasta/fastq file for second half of pair")
+
+    args = parser.parse_args()
+
+    # this is designed to only work in the nextflow pipeline so we put strict requirements on the input/output names
+    in_ext = ".fq.gz"
+    require_extension(args.input_R1, in_ext)
+    require_extension(args.input_R2, in_ext)
+
+    out_ext = "_posttrim_filter.fq.gz"
+    output_R1 = args.input_R1.replace(in_ext, out_ext)
+    output_R2 = args.input_R2.replace(in_ext, out_ext)
+
+    # Illumina adapters that are occasionally leftover in reads
+    S7 = "CCGAGCCCACGAGAC"
+    P7 = "ATCTCGTATGCCGTCTTCTGCTTG"
+
+    # require a minimum match between read and adapter
+    min_length = 10
+    filter_sequences = [ S7, P7 ]
+
+    filter_reads(filter_sequences, min_length, args.input_R1, args.input_R2, output_R1, output_R2)
diff --git a/conf/base.config b/conf/base.config
@@ -10,6 +10,10 @@ params{
     ref = false
     bed = false
     profile = false
+
+    // host filtering
+    composite_ref = false
+    viral_contig_name = 'MN908947.3'
 
     // Repo to download your primer scheme from
     schemeRepoURL = 'https://github.com/artic-network/artic-ncov2019.git'
@@ -21,7 +25,10 @@ params{
     scheme =  'nCoV-2019'
 
     // Scheme version
-    schemeVersion = 'V2'
+    schemeVersion = 'V3'
+
+    // For ivar's amplicon filter
+    primer_pairs_tsv = false
 
     // Run experimental medaka pipeline? Specify in the command using "--medaka"
     medaka = false
@@ -31,18 +38,6 @@ params{
 
     // Run nanopolish pipeline
     nanopolish = false
-
-    // Upload data to CLIMB?
-    upload = false
-
-    // CLIMB username
-    CLIMBUser = false
-
-    // CLIMB SSH pubkey
-    CLIMBkey = false
-
-    // CLIMB hostname
-    CLIMBHostname = false
 }
 
 
diff --git a/conf/illumina.config b/conf/illumina.config
@@ -28,7 +28,7 @@ params {
     allowNoprimer = true
 
     // Length of illumina reads to keep after primer trimming
-    illuminaKeepLen = 20
+    illuminaKeepLen = 50
 
     // Sliding window quality threshold for keeping reads after primer trimming (illumina)
     illuminaQualThreshold = 20

diff --git a/environments/extras.yml b/environments/extras.yml
@@ -6,5 +6,5 @@ dependencies:
   - biopython
   - matplotlib
   - pandas
-  - samtools=1.10-0
+  - samtools
   - libxcb
diff --git a/environments/illumina/environment.yml b/environments/illumina/environment.yml
@@ -8,6 +8,8 @@ dependencies:
   - biopython=1.74
   - pandas=0.23.0=py36_1
   - bwa=0.7.17=pl5.22.0_2
-  - samtools=1.9
+  - samtools=1.10
   - trim-galore=0.6.5
-  - ivar=1.2.2
+  - ivar=1.3
+  - pysam=0.16.0.1
+  - mafft=7.471