Skip to content

Commit

Permalink
Add Reference Alignment + UTR Trimming, Dehosting and iVar update (#16)
Browse files Browse the repository at this point in the history
* Index reference before samtools and ivar access the reference in callVariants step

* Try indexing the reference before callVariants

* Qc update (#56)

* Update qc.py to reflect changed standards

* Filter out Illumina Undetermined_S0

* Make sure to negate the matcher

* Add minReadsPerBarcode parameter for Nanopore workflow (connor-lab#57)

The barcode<n*> directories of de-plexed Fastq input are filtered to
exclude any containing fewer than 5 files (guppy makes all of the
barcode directories, whether you used the barcodes or not).

This patch changes the filter to scan the files and count the records
within to 1) avoid excluding barcodes where many reads are in fewer
than 5 files and 2) avoid including barcodes where there are very few
reads in more than 5 files.

This threshold can be changed with the new optional parameter
minReadsPerBarcode, which defaults to 100 reads.

Co-authored-by: Keith James <[email protected]>

* Remove unnecessary samtools view -bS (#48)

Co-authored-by: ac55-sanger <[email protected]>

* Added libxcb, default Qt library backend for graphics rendering, to dependencies (connor-lab#60)

* use ivar default of 30 for illuminaKeepLen, and comment (connor-lab#61)

Co-authored-by: David K. Jackson <[email protected]>

* add basic host filter step

* remove hardcoded paths

* move host filter to within workflow

* synchronize host removal with SIGNAL

* limit host removal cpus

* Move performHostFilter to illuminaNcov workflow

* Add pysam dependency for filter_non_human_reads.py

* update ivar to 1.2.3

* increase minimum read length after trimming

* add step for filtering residual adapters missed by trimming

* fix samtools/pysam version in environment

* lower required match length for adapter check

* restrict short adapter matches to the suffix of the read

* update to ivar 1.3, add option for amplicon filtering

* remove CLIMB-specific modules/calls

* Align to ref (#15)

* Add alignment step using MAFFT and UTR-trimming step

* Remove extra = char

* Tidy up illumina environment

* Remove local executor definition

* Remove addition of whitespace

* Remove deletion of whitespace

* Remove addition of whitespace

* Publish dehosted fastqs

* Fix process name for alignment step in illumina workflow

* Set default viral_contig_name

* Add sampleId tag to performHostFilter process

* Enable testing against development branch

* Remove testing scripts not in use

* index bam files and publish indexes (#23)

* index bam files and publish indexes

* Move hostFilter to sequenceAnalysis, publish dehosted reads and consensus to nml_upload (#25)

Co-authored-by: Michael Laszloffy <[email protected]>
Co-authored-by: Michael Laszloffy <[email protected]>
Co-authored-by: Matt Bull <[email protected]>
Co-authored-by: Keith James <[email protected]>
Co-authored-by: Keith James <[email protected]>
Co-authored-by: Ashwini Chhipa <[email protected]>
Co-authored-by: ac55-sanger <[email protected]>
Co-authored-by: nodrogluap <[email protected]>
Co-authored-by: dkj <[email protected]>
Co-authored-by: David K. Jackson <[email protected]>
Co-authored-by: Jared Simpson <[email protected]>
Co-authored-by: Lawrence Heisler <[email protected]>
  • Loading branch information
13 people authored Dec 18, 2020
1 parent be8635b commit f478d47
Show file tree
Hide file tree
Showing 14 changed files with 307 additions and 96 deletions.
1 change: 1 addition & 0 deletions .github/workflows/pull_request.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ on:
pull_request:
branches:
- master
- development
name: master Pull Request
jobs:
test:
Expand Down
67 changes: 67 additions & 0 deletions bin/filter_non_human_reads.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
#!/usr/bin/env python
import pysam
import sys
import argparse

def filter_reads(contig_name, input_sam_fp, output_bam_fp):

# use streams if args are None
if input_sam_fp:
input_sam = pysam.AlignmentFile(input_sam_fp, 'r')
else:
input_sam = pysam.AlignmentFile('-', 'r')

if output_bam_fp:
output_bam = pysam.AlignmentFile(output_bam_fp, 'wb',
template=input_sam)
else:
output_bam = pysam.AlignmentFile('-', 'wb', template=input_sam)

# if read isn't mapped or mapped to viral reference contig name
viral_reads = 0
human_reads = 0
unmapped_reads = 0

# iterate over input from BWA
for read in input_sam:
# only look at primary alignments
if not read.is_supplementary and not read.is_secondary:
if read.reference_name == contig_name:
output_bam.write(read)
viral_reads += 1
elif read.is_unmapped:
output_bam.write(read)
unmapped_reads +=1
else:
human_reads += 1

total_reads = viral_reads + human_reads + unmapped_reads

print(f"viral read count = {viral_reads} "
f"({viral_reads/total_reads * 100:.2f}%)", file=sys.stderr)
print(f"human read count = {human_reads} "
f"({human_reads/total_reads * 100:.2f}%) ", file=sys.stderr)
print(f"unmapped read count = {unmapped_reads} "
f"({unmapped_reads/total_reads * 100:.2f}%)", file=sys.stderr)

if __name__ == '__main__':

parser = argparse.ArgumentParser(description="Get all reads that are "
"unmapped and map to a "
"specific reference "
"contig")
parser.add_argument('-i', '--input', required=False, default=False,
help="Input SAM formatted file (stdin used if "
" not specified)")

parser.add_argument('-o', '--output', required=False, default=False,
help="Output BAM formatted file (stdout used if not "
"specified)")

parser.add_argument('-c', '--contig_name', required=False,
default="MN908947.3",
help="Contig name to retain e.g. viral")

args = parser.parse_args()

filter_reads(args.contig_name, args.input, args.output)
85 changes: 85 additions & 0 deletions bin/filter_residual_adapters.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
#!/usr/bin/env python
import gzip
import pysam
import sys
import argparse

# require that filename contains the specified extension
def require_extension(filename, ext):
assert(args.input_R1[-len(ext):] == ext)

def contains_adapter(read_sequence, adapter_sequence, min_match_length):
# discard all reads that contain the full adapter
if adapter_sequence in read_sequence:
return True

# discard all reads where the suffix of length min_match_length is contained within the adapter
# this handles the case where there is a fairly good adapter match that is truncated
# by the end of the read
rl = len(read_sequence)
if rl >= min_match_length:
read_suffix = read_sequence[(rl - min_match_length):]
if read_suffix in adapter_sequence:
return True
return False

def filter_reads(filter_sequences, min_match_length, input_R1_fp, input_R2_fp, output_R1_fp, output_R2_fp):

input_R1 = pysam.FastxFile(input_R1_fp)
input_R2 = pysam.FastxFile(input_R2_fp)

output_R1 = gzip.open(output_R1_fp, 'w')
output_R2 = gzip.open(output_R2_fp, 'w')

reads_filtered = 0
reads_kept = 0
for R1, R2 in zip(input_R1, input_R2):

discard = False
for f in filter_sequences:
if contains_adapter(R1.sequence, f, min_match_length) or contains_adapter(R2.sequence, f, min_match_length):
discard = True
break

if discard:
reads_filtered += 1
continue
else:
reads_kept += 1

fq1 = str(R1) + "\n"
fq2 = str(R2) + "\n"
output_R1.write(fq1.encode('utf-8'))
output_R2.write(fq2.encode('utf-8'))

print(f"reads kept: {reads_kept}, reads filtered: {reads_filtered}")

if __name__ == '__main__':

parser = argparse.ArgumentParser(description="Discard read pairs that contain sequencing adapters that are missed by a trimmer")
parser.add_argument('--input_R1', required=True, default=False,
help="Input fasta/fastq file for first half of pair")

parser.add_argument('--input_R2', required=True, default=False,
help="Input fasta/fastq file for second half of pair")

args = parser.parse_args()

# this is designed to only work in the nextflow pipeline so we put strict requirements on the input/output names
in_ext = ".fq.gz"
require_extension(args.input_R1, in_ext)
require_extension(args.input_R2, in_ext)

out_ext = "_posttrim_filter.fq.gz"
output_R1 = args.input_R1.replace(in_ext, out_ext)
output_R2 = args.input_R2.replace(in_ext, out_ext)

# Illumina adapters that are occasionally leftover in reads
S7 = "CCGAGCCCACGAGAC"
P7 = "ATCTCGTATGCCGTCTTCTGCTTG"

# require a minimum match between read and adapter
min_length = 10
filter_sequences = [ S7, P7 ]

filter_reads(filter_sequences, min_length, args.input_R1, args.input_R2, output_R1, output_R2)
21 changes: 8 additions & 13 deletions conf/base.config
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,10 @@ params{
ref = false
bed = false
profile = false

// host filtering
composite_ref = false
viral_contig_name = 'MN908947.3'

// Repo to download your primer scheme from
schemeRepoURL = 'https://github.com/artic-network/artic-ncov2019.git'
Expand All @@ -21,7 +25,10 @@ params{
scheme = 'nCoV-2019'

// Scheme version
schemeVersion = 'V2'
schemeVersion = 'V3'

// For ivar's amplicon filter
primer_pairs_tsv = false

// Run experimental medaka pipeline? Specify in the command using "--medaka"
medaka = false
Expand All @@ -31,18 +38,6 @@ params{

// Run nanopolish pipeline
nanopolish = false

// Upload data to CLIMB?
upload = false

// CLIMB username
CLIMBUser = false

// CLIMB SSH pubkey
CLIMBkey = false

// CLIMB hostname
CLIMBHostname = false
}


2 changes: 1 addition & 1 deletion conf/illumina.config
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ params {
allowNoprimer = true

// Length of illumina reads to keep after primer trimming
illuminaKeepLen = 20
illuminaKeepLen = 50

// Sliding window quality threshold for keeping reads after primer trimming (illumina)
illuminaQualThreshold = 20
Expand Down
2 changes: 1 addition & 1 deletion environments/extras.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,5 @@ dependencies:
- biopython
- matplotlib
- pandas
- samtools=1.10-0
- samtools
- libxcb
6 changes: 4 additions & 2 deletions environments/illumina/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ dependencies:
- biopython=1.74
- pandas=0.23.0=py36_1
- bwa=0.7.17=pl5.22.0_2
- samtools=1.9
- samtools=1.10
- trim-galore=0.6.5
- ivar=1.2.2
- ivar=1.3
- pysam=0.16.0.1
- mafft=7.471
Loading

0 comments on commit f478d47

Please sign in to comment.