Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to load pairs of FASTQ files from paired-end reads #102

Closed
Tracked by #28
abearab opened this issue Feb 28, 2024 · 7 comments
Closed
Tracked by #28

How to load pairs of FASTQ files from paired-end reads #102

abearab opened this issue Feb 28, 2024 · 7 comments

Comments

@abearab
Copy link
Contributor

abearab commented Feb 28, 2024

As you know, it's very common to have paired-end reads mainly in Illumina NGS technologies. I was wondering to know your recommendation for having a single session for processing paired-end FASTQ files. Looking forward to hear your thoughts.

@abearab
Copy link
Contributor Author

abearab commented Feb 28, 2024

There are several common computational task with paired-end reads, here are some examples:

1. trimming task

2. merge-paired-end-sequences

It might be a separate conversation but it can be a good idea if your tool can have guidance for these tasks. Based on what I have learned about your tool so far, I assume these task can be very efficient with the SQL core in your tool and it can be combined with other tasks at the same time.

@tshauck
Copy link
Member

tshauck commented Feb 28, 2024

Thanks as always for filing the issue. In the short term please see the following example for loading pair end reads. I also agree w/ you point about having some guidance around trimming, and filed in ticket to update the docs with an example.

>>> !wget https://raw.githubusercontent.com/marcelm/cutadapt/main/tests/cut/paired.1.fastq
>>> !wget https://raw.githubusercontent.com/marcelm/cutadapt/main/tests/cut/paired.2.fastq
>>>
>>> import biobear as bb
>>> session = bb.connect()
>>>
>>> session.sql("""
   ...: WITH one AS (SELECT REPLACE(name, '/1', '') trimmed_name, * FROM fastq_scan('paired.1.fastq')),
   ...:      two AS (SELECT REPLACE(name, '/2', '') trimmed_name, * FROM fastq_scan('paired.2.fastq'))
   ...:      
   ...: SELECT one.trimmed_name, one.name one_name, one.sequence one_sequence, two.name two_name, two.sequence two_sequence
   ...: FROM one
   ...:   JOIN two
   ...:     ON one.trimmed_name = two.trimmed_name
   ...: 
   ...: """).to_polars()

Outside of python the SQL looks like the following. The idea is to create two CTEs which are used to standardize the read names (removing /1 or /2), then the body joins the two files together on that new name and returns a bit of info about both reads. You could add quality score or description if you want those.

WITH one AS (
    SELECT REPLACE(name, '/1', '') trimmed_name, *
    FROM fastq_scan('paired.1.fastq')
), two AS (
    SELECT REPLACE(name, '/2', '') trimmed_name, *
    FROM fastq_scan('paired.2.fastq')
)

SELECT one.trimmed_name, one.name one_name, one.sequence one_sequence, two.name two_name, two.sequence two_sequence
FROM one
  JOIN two
    ON one.trimmed_name = two.trimmed_name

The results of which looks like:

shape: (4, 5)
┌──────────────┬──────────┬────────────────────────────────┬──────────┬────────────────────────────────┐
│ trimmed_name ┆ one_name ┆ one_sequence                   ┆ two_name ┆ two_sequence                   │
│ ---          ┆ ---      ┆ ---                            ┆ ---      ┆ ---                            │
│ str          ┆ str      ┆ str                            ┆ str      ┆ str                            │
╞══════════════╪══════════╪════════════════════════════════╪══════════╪════════════════════════════════╡
│ read5        ┆ read5    ┆ TTATTTGTCTCCAGC                ┆ read5    ┆ CAACAGGCCACATTAGACATATCGGATGGT │
│ read1        ┆ read1/1  ┆ TTATTTGTCTCCAGC                ┆ read1/2  ┆ GCTGGAGACAAATAA                │
│ read4        ┆ read4/1  ┆ GACAGGCCGTTTGAATGTTGACGGGATGTT ┆ read4/2  ┆ CATCCCGTCAACATTCAAACGGCCTGTCCA │
│ read3        ┆ read3/1  ┆ CCAACTTGATATTAATAACA           ┆ read3/2  ┆ TGTTATTAATATCAAGTTGG           │
└──────────────┴──────────┴────────────────────────────────┴──────────┴────────────────────────────────┘

I'd also just add that session will persist state within a given program or interpreter session. So you can create a table backed by a file, then query it multiple times if needed.

import biobear as bb

s = bb.connect()

s.execute("""
CREATE EXTERNAL TABLE test STORED AS FASTQ LOCATION 'paired.1.fastq'
""")

s.sql("SELECT * FROM test").to_polars()
┌─────────┬─────────────┬────────────────────────────────┬────────────────────────────────┐
│ name    ┆ description ┆ sequence                       ┆ quality_scores                 │
│ ---     ┆ ---         ┆ ---                            ┆ ---                            │
│ str     ┆ str         ┆ str                            ┆ str                            │
╞═════════╪═════════════╪════════════════════════════════╪════════════════════════════════╡
│ read1/1 ┆ some text   ┆ TTATTTGTCTCCAGC                ┆ ##HHHHHHHHHHHHH                │
│ read3/1 ┆ null        ┆ CCAACTTGATATTAATAACA           ┆ HHHHHHHHHHHHHHHHHHHH           │
│ read4/1 ┆ null        ┆ GACAGGCCGTTTGAATGTTGACGGGATGTT ┆ HHHHHHHHHHHHHHHHHHHHHHHHHHHHHH │
│ read5   ┆ null        ┆ TTATTTGTCTCCAGC                ┆ #####HHHHHHHHHH                │
└─────────┴─────────────┴────────────────────────────────┴────────────────────────────────┘

@abearab
Copy link
Contributor Author

abearab commented Mar 4, 2024

Hi @tshauck – thank you so much for your response! I'll look more carefully and will get back to you.

Actually, my main goal for loading both R1/R2 reads at the same time is doing this:

  1. trimming each read from certain position
  2. counting unique sequences based on both reads

Basically, I'm going to expand this function to work with more complex inputs: https://github.com/ArcInstitute/ScreenPro2/blob/master/screenpro/ngs.py#L26-L50

Do you have any suggestion? Thanks agin for your insightful responses.

@tshauck
Copy link
Member

tshauck commented Mar 5, 2024

Thanks for the link 👍 , I'm gonna work through the trimming doc (#103) and how it refers to your link tonight/tomorrow morning and I'll follow-up then w/ more.

@abearab
Copy link
Contributor Author

abearab commented Mar 5, 2024

Thanks for the link 👍 , I'm gonna work through the trimming doc (#103) and how it refers to your link tonight/tomorrow morning and I'll follow-up then w/ more.

Awesome! Looking forward to that.

I'm also excited to hear your thoughts about this #102 (comment) but there is no rush; whenever you have time please let me know what you think. I already released a new sub-bersion for my tool which is now using biobearArcInstitute/ScreenPro2@85445c6. I'll add paired-end read support in the future (hopefully next few weeks).

@abearab
Copy link
Contributor Author

abearab commented Mar 7, 2024

I could follow your example here and it worked well. Thanks!!

ArcInstitute/ScreenPro2@fe9a27d

Thanks as always for filing the issue. In the short term please see the following example for loading pair end reads. I also agree w/ you point about having some guidance around trimming, and filed in ticket to update the docs with an example.

>>> !wget https://raw.githubusercontent.com/marcelm/cutadapt/main/tests/cut/paired.1.fastq
>>> !wget https://raw.githubusercontent.com/marcelm/cutadapt/main/tests/cut/paired.2.fastq
>>>
>>> import biobear as bb
>>> session = bb.connect()
>>>
>>> session.sql("""
   ...: WITH one AS (SELECT REPLACE(name, '/1', '') trimmed_name, * FROM fastq_scan('paired.1.fastq')),
   ...:      two AS (SELECT REPLACE(name, '/2', '') trimmed_name, * FROM fastq_scan('paired.2.fastq'))
   ...:      
   ...: SELECT one.trimmed_name, one.name one_name, one.sequence one_sequence, two.name two_name, two.sequence two_sequence
   ...: FROM one
   ...:   JOIN two
   ...:     ON one.trimmed_name = two.trimmed_name
   ...: 
   ...: """).to_polars()

Outside of python the SQL looks like the following. The idea is to create two CTEs which are used to standardize the read names (removing /1 or /2), then the body joins the two files together on that new name and returns a bit of info about both reads. You could add quality score or description if you want those.

WITH one AS (
    SELECT REPLACE(name, '/1', '') trimmed_name, *
    FROM fastq_scan('paired.1.fastq')
), two AS (
    SELECT REPLACE(name, '/2', '') trimmed_name, *
    FROM fastq_scan('paired.2.fastq')
)

SELECT one.trimmed_name, one.name one_name, one.sequence one_sequence, two.name two_name, two.sequence two_sequence
FROM one
  JOIN two
    ON one.trimmed_name = two.trimmed_name

The results of which looks like:

shape: (4, 5)
┌──────────────┬──────────┬────────────────────────────────┬──────────┬────────────────────────────────┐
│ trimmed_name ┆ one_name ┆ one_sequence                   ┆ two_name ┆ two_sequence                   │
│ ---          ┆ ---      ┆ ---                            ┆ ---      ┆ ---                            │
│ str          ┆ str      ┆ str                            ┆ str      ┆ str                            │
╞══════════════╪══════════╪════════════════════════════════╪══════════╪════════════════════════════════╡
│ read5        ┆ read5    ┆ TTATTTGTCTCCAGC                ┆ read5    ┆ CAACAGGCCACATTAGACATATCGGATGGT │
│ read1        ┆ read1/1  ┆ TTATTTGTCTCCAGC                ┆ read1/2  ┆ GCTGGAGACAAATAA                │
│ read4        ┆ read4/1  ┆ GACAGGCCGTTTGAATGTTGACGGGATGTT ┆ read4/2  ┆ CATCCCGTCAACATTCAAACGGCCTGTCCA │
│ read3        ┆ read3/1  ┆ CCAACTTGATATTAATAACA           ┆ read3/2  ┆ TGTTATTAATATCAAGTTGG           │
└──────────────┴──────────┴────────────────────────────────┴──────────┴────────────────────────────────┘

I'd also just add that session will persist state within a given program or interpreter session. So you can create a table backed by a file, then query it multiple times if needed.

import biobear as bb

s = bb.connect()

s.execute("""
CREATE EXTERNAL TABLE test STORED AS FASTQ LOCATION 'paired.1.fastq'
""")

s.sql("SELECT * FROM test").to_polars()
┌─────────┬─────────────┬────────────────────────────────┬────────────────────────────────┐
│ name    ┆ description ┆ sequence                       ┆ quality_scores                 │
│ ---     ┆ ---         ┆ ---                            ┆ ---                            │
│ str     ┆ str         ┆ str                            ┆ str                            │
╞═════════╪═════════════╪════════════════════════════════╪════════════════════════════════╡
│ read1/1 ┆ some text   ┆ TTATTTGTCTCCAGC                ┆ ##HHHHHHHHHHHHH                │
│ read3/1 ┆ null        ┆ CCAACTTGATATTAATAACA           ┆ HHHHHHHHHHHHHHHHHHHH           │
│ read4/1 ┆ null        ┆ GACAGGCCGTTTGAATGTTGACGGGATGTT ┆ HHHHHHHHHHHHHHHHHHHHHHHHHHHHHH │
│ read5   ┆ null        ┆ TTATTTGTCTCCAGC                ┆ #####HHHHHHHHHH                │
└─────────┴─────────────┴────────────────────────────────┴────────────────────────────────┘

@tshauck
Copy link
Member

tshauck commented Mar 7, 2024

@abearab Nice, congrats on the release! And glad that worked well enough.

I am still working on the adapter/quality trimming info. I'm trying to add some functionality as I go to make it easier to use, e.g. SELECT quality_scores_to_list('!!!') returning [0, 0, 0]. (https://github.com/wheretrue/exon/pull/427/files)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants