How to load pairs of FASTQ files from paired-end reads #102

abearab · 2024-02-28T01:59:36Z

As you know, it's very common to have paired-end reads mainly in Illumina NGS technologies. I was wondering to know your recommendation for having a single session for processing paired-end FASTQ files. Looking forward to hear your thoughts.

The text was updated successfully, but these errors were encountered:

abearab · 2024-02-28T02:03:13Z

There are several common computational task with paired-end reads, here are some examples:

1. trimming task

https://cutadapt.readthedocs.io/en/stable/guide.html#paired-end

2. merge-paired-end-sequences

It might be a separate conversation but it can be a good idea if your tool can have guidance for these tasks. Based on what I have learned about your tool so far, I assume these task can be very efficient with the SQL core in your tool and it can be combined with other tasks at the same time.

tshauck · 2024-02-28T23:37:07Z

Thanks as always for filing the issue. In the short term please see the following example for loading pair end reads. I also agree w/ you point about having some guidance around trimming, and filed in ticket to update the docs with an example.

>>> !wget https://raw.githubusercontent.com/marcelm/cutadapt/main/tests/cut/paired.1.fastq
>>> !wget https://raw.githubusercontent.com/marcelm/cutadapt/main/tests/cut/paired.2.fastq
>>>
>>> import biobear as bb
>>> session = bb.connect()
>>>
>>> session.sql("""
   ...: WITH one AS (SELECT REPLACE(name, '/1', '') trimmed_name, * FROM fastq_scan('paired.1.fastq')),
   ...:      two AS (SELECT REPLACE(name, '/2', '') trimmed_name, * FROM fastq_scan('paired.2.fastq'))
   ...:      
   ...: SELECT one.trimmed_name, one.name one_name, one.sequence one_sequence, two.name two_name, two.sequence two_sequence
   ...: FROM one
   ...:   JOIN two
   ...:     ON one.trimmed_name = two.trimmed_name
   ...: 
   ...: """).to_polars()

Outside of python the SQL looks like the following. The idea is to create two CTEs which are used to standardize the read names (removing /1 or /2), then the body joins the two files together on that new name and returns a bit of info about both reads. You could add quality score or description if you want those.

WITH one AS (
    SELECT REPLACE(name, '/1', '') trimmed_name, *
    FROM fastq_scan('paired.1.fastq')
), two AS (
    SELECT REPLACE(name, '/2', '') trimmed_name, *
    FROM fastq_scan('paired.2.fastq')
)

SELECT one.trimmed_name, one.name one_name, one.sequence one_sequence, two.name two_name, two.sequence two_sequence
FROM one
  JOIN two
    ON one.trimmed_name = two.trimmed_name

The results of which looks like:

shape: (4, 5)
┌──────────────┬──────────┬────────────────────────────────┬──────────┬────────────────────────────────┐
│ trimmed_name ┆ one_name ┆ one_sequence                   ┆ two_name ┆ two_sequence                   │
│ ---          ┆ ---      ┆ ---                            ┆ ---      ┆ ---                            │
│ str          ┆ str      ┆ str                            ┆ str      ┆ str                            │
╞══════════════╪══════════╪════════════════════════════════╪══════════╪════════════════════════════════╡
│ read5        ┆ read5    ┆ TTATTTGTCTCCAGC                ┆ read5    ┆ CAACAGGCCACATTAGACATATCGGATGGT │
│ read1        ┆ read1/1  ┆ TTATTTGTCTCCAGC                ┆ read1/2  ┆ GCTGGAGACAAATAA                │
│ read4        ┆ read4/1  ┆ GACAGGCCGTTTGAATGTTGACGGGATGTT ┆ read4/2  ┆ CATCCCGTCAACATTCAAACGGCCTGTCCA │
│ read3        ┆ read3/1  ┆ CCAACTTGATATTAATAACA           ┆ read3/2  ┆ TGTTATTAATATCAAGTTGG           │
└──────────────┴──────────┴────────────────────────────────┴──────────┴────────────────────────────────┘

I'd also just add that session will persist state within a given program or interpreter session. So you can create a table backed by a file, then query it multiple times if needed.

import biobear as bb

s = bb.connect()

s.execute("""
CREATE EXTERNAL TABLE test STORED AS FASTQ LOCATION 'paired.1.fastq'
""")

s.sql("SELECT * FROM test").to_polars()

┌─────────┬─────────────┬────────────────────────────────┬────────────────────────────────┐
│ name    ┆ description ┆ sequence                       ┆ quality_scores                 │
│ ---     ┆ ---         ┆ ---                            ┆ ---                            │
│ str     ┆ str         ┆ str                            ┆ str                            │
╞═════════╪═════════════╪════════════════════════════════╪════════════════════════════════╡
│ read1/1 ┆ some text   ┆ TTATTTGTCTCCAGC                ┆ ##HHHHHHHHHHHHH                │
│ read3/1 ┆ null        ┆ CCAACTTGATATTAATAACA           ┆ HHHHHHHHHHHHHHHHHHHH           │
│ read4/1 ┆ null        ┆ GACAGGCCGTTTGAATGTTGACGGGATGTT ┆ HHHHHHHHHHHHHHHHHHHHHHHHHHHHHH │
│ read5   ┆ null        ┆ TTATTTGTCTCCAGC                ┆ #####HHHHHHHHHH                │
└─────────┴─────────────┴────────────────────────────────┴────────────────────────────────┘

abearab · 2024-03-04T23:11:39Z

Hi @tshauck – thank you so much for your response! I'll look more carefully and will get back to you.

Actually, my main goal for loading both R1/R2 reads at the same time is doing this:

trimming each read from certain position
counting unique sequences based on both reads

Basically, I'm going to expand this function to work with more complex inputs: https://github.com/ArcInstitute/ScreenPro2/blob/master/screenpro/ngs.py#L26-L50

Do you have any suggestion? Thanks agin for your insightful responses.

tshauck · 2024-03-05T21:47:15Z

Thanks for the link 👍 , I'm gonna work through the trimming doc (#103) and how it refers to your link tonight/tomorrow morning and I'll follow-up then w/ more.

abearab · 2024-03-05T23:37:00Z

Thanks for the link 👍 , I'm gonna work through the trimming doc (#103) and how it refers to your link tonight/tomorrow morning and I'll follow-up then w/ more.

Awesome! Looking forward to that.

I'm also excited to hear your thoughts about this #102 (comment) but there is no rush; whenever you have time please let me know what you think. I already released a new sub-bersion for my tool which is now using biobear – ArcInstitute/ScreenPro2@85445c6. I'll add paired-end read support in the future (hopefully next few weeks).

abearab · 2024-03-07T03:22:01Z

I could follow your example here and it worked well. Thanks!!

ArcInstitute/ScreenPro2@fe9a27d

Thanks as always for filing the issue. In the short term please see the following example for loading pair end reads. I also agree w/ you point about having some guidance around trimming, and filed in ticket to update the docs with an example.

>>> !wget https://raw.githubusercontent.com/marcelm/cutadapt/main/tests/cut/paired.1.fastq
>>> !wget https://raw.githubusercontent.com/marcelm/cutadapt/main/tests/cut/paired.2.fastq
>>>
>>> import biobear as bb
>>> session = bb.connect()
>>>
>>> session.sql("""
   ...: WITH one AS (SELECT REPLACE(name, '/1', '') trimmed_name, * FROM fastq_scan('paired.1.fastq')),
   ...:      two AS (SELECT REPLACE(name, '/2', '') trimmed_name, * FROM fastq_scan('paired.2.fastq'))
   ...:      
   ...: SELECT one.trimmed_name, one.name one_name, one.sequence one_sequence, two.name two_name, two.sequence two_sequence
   ...: FROM one
   ...:   JOIN two
   ...:     ON one.trimmed_name = two.trimmed_name
   ...: 
   ...: """).to_polars()

Outside of python the SQL looks like the following. The idea is to create two CTEs which are used to standardize the read names (removing /1 or /2), then the body joins the two files together on that new name and returns a bit of info about both reads. You could add quality score or description if you want those.

WITH one AS (
    SELECT REPLACE(name, '/1', '') trimmed_name, *
    FROM fastq_scan('paired.1.fastq')
), two AS (
    SELECT REPLACE(name, '/2', '') trimmed_name, *
    FROM fastq_scan('paired.2.fastq')
)

SELECT one.trimmed_name, one.name one_name, one.sequence one_sequence, two.name two_name, two.sequence two_sequence
FROM one
  JOIN two
    ON one.trimmed_name = two.trimmed_name

The results of which looks like:

shape: (4, 5)
┌──────────────┬──────────┬────────────────────────────────┬──────────┬────────────────────────────────┐
│ trimmed_name ┆ one_name ┆ one_sequence                   ┆ two_name ┆ two_sequence                   │
│ ---          ┆ ---      ┆ ---                            ┆ ---      ┆ ---                            │
│ str          ┆ str      ┆ str                            ┆ str      ┆ str                            │
╞══════════════╪══════════╪════════════════════════════════╪══════════╪════════════════════════════════╡
│ read5        ┆ read5    ┆ TTATTTGTCTCCAGC                ┆ read5    ┆ CAACAGGCCACATTAGACATATCGGATGGT │
│ read1        ┆ read1/1  ┆ TTATTTGTCTCCAGC                ┆ read1/2  ┆ GCTGGAGACAAATAA                │
│ read4        ┆ read4/1  ┆ GACAGGCCGTTTGAATGTTGACGGGATGTT ┆ read4/2  ┆ CATCCCGTCAACATTCAAACGGCCTGTCCA │
│ read3        ┆ read3/1  ┆ CCAACTTGATATTAATAACA           ┆ read3/2  ┆ TGTTATTAATATCAAGTTGG           │
└──────────────┴──────────┴────────────────────────────────┴──────────┴────────────────────────────────┘

I'd also just add that session will persist state within a given program or interpreter session. So you can create a table backed by a file, then query it multiple times if needed.

import biobear as bb

s = bb.connect()

s.execute("""
CREATE EXTERNAL TABLE test STORED AS FASTQ LOCATION 'paired.1.fastq'
""")

s.sql("SELECT * FROM test").to_polars()

┌─────────┬─────────────┬────────────────────────────────┬────────────────────────────────┐
│ name    ┆ description ┆ sequence                       ┆ quality_scores                 │
│ ---     ┆ ---         ┆ ---                            ┆ ---                            │
│ str     ┆ str         ┆ str                            ┆ str                            │
╞═════════╪═════════════╪════════════════════════════════╪════════════════════════════════╡
│ read1/1 ┆ some text   ┆ TTATTTGTCTCCAGC                ┆ ##HHHHHHHHHHHHH                │
│ read3/1 ┆ null        ┆ CCAACTTGATATTAATAACA           ┆ HHHHHHHHHHHHHHHHHHHH           │
│ read4/1 ┆ null        ┆ GACAGGCCGTTTGAATGTTGACGGGATGTT ┆ HHHHHHHHHHHHHHHHHHHHHHHHHHHHHH │
│ read5   ┆ null        ┆ TTATTTGTCTCCAGC                ┆ #####HHHHHHHHHH                │
└─────────┴─────────────┴────────────────────────────────┴────────────────────────────────┘

tshauck · 2024-03-07T19:09:18Z

@abearab Nice, congrats on the release! And glad that worked well enough.

I am still working on the adapter/quality trimming info. I'm trying to add some functionality as I go to make it easier to use, e.g. SELECT quality_scores_to_list('!!!') returning [0, 0, 0]. (https://github.com/wheretrue/exon/pull/427/files)

tshauck mentioned this issue Feb 28, 2024

Update https://www.wheretrue.dev/ w/ Trimming Example #103

Open

abearab closed this as completed Mar 7, 2024

abearab mentioned this issue Mar 7, 2024

Design and implement fastq2count module ArcInstitute/ScreenPro2#28

Closed

10 tasks

abearab mentioned this issue Apr 5, 2024

merge paired-end sequences #116

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to load pairs of FASTQ files from paired-end reads #102

How to load pairs of FASTQ files from paired-end reads #102

abearab commented Feb 28, 2024

abearab commented Feb 28, 2024 •

edited

Loading

tshauck commented Feb 28, 2024 •

edited

Loading

abearab commented Mar 4, 2024 •

edited

Loading

tshauck commented Mar 5, 2024

abearab commented Mar 5, 2024

abearab commented Mar 7, 2024

tshauck commented Mar 7, 2024 •

edited

Loading

How to load pairs of FASTQ files from paired-end reads #102

How to load pairs of FASTQ files from paired-end reads #102

Comments

abearab commented Feb 28, 2024

abearab commented Feb 28, 2024 • edited Loading

tshauck commented Feb 28, 2024 • edited Loading

abearab commented Mar 4, 2024 • edited Loading

tshauck commented Mar 5, 2024

abearab commented Mar 5, 2024

abearab commented Mar 7, 2024

tshauck commented Mar 7, 2024 • edited Loading

abearab commented Feb 28, 2024 •

edited

Loading

tshauck commented Feb 28, 2024 •

edited

Loading

abearab commented Mar 4, 2024 •

edited

Loading

tshauck commented Mar 7, 2024 •

edited

Loading