-
-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to load pairs of FASTQ files from paired-end reads #102
Comments
There are several common computational task with paired-end reads, here are some examples: 1. trimming task 2. merge-paired-end-sequences
It might be a separate conversation but it can be a good idea if your tool can have guidance for these tasks. Based on what I have learned about your tool so far, I assume these task can be very efficient with the SQL core in your tool and it can be combined with other tasks at the same time. |
Thanks as always for filing the issue. In the short term please see the following example for loading pair end reads. I also agree w/ you point about having some guidance around trimming, and filed in ticket to update the docs with an example. >>> !wget https://raw.githubusercontent.com/marcelm/cutadapt/main/tests/cut/paired.1.fastq
>>> !wget https://raw.githubusercontent.com/marcelm/cutadapt/main/tests/cut/paired.2.fastq
>>>
>>> import biobear as bb
>>> session = bb.connect()
>>>
>>> session.sql("""
...: WITH one AS (SELECT REPLACE(name, '/1', '') trimmed_name, * FROM fastq_scan('paired.1.fastq')),
...: two AS (SELECT REPLACE(name, '/2', '') trimmed_name, * FROM fastq_scan('paired.2.fastq'))
...:
...: SELECT one.trimmed_name, one.name one_name, one.sequence one_sequence, two.name two_name, two.sequence two_sequence
...: FROM one
...: JOIN two
...: ON one.trimmed_name = two.trimmed_name
...:
...: """).to_polars() Outside of python the SQL looks like the following. The idea is to create two CTEs which are used to standardize the read names (removing WITH one AS (
SELECT REPLACE(name, '/1', '') trimmed_name, *
FROM fastq_scan('paired.1.fastq')
), two AS (
SELECT REPLACE(name, '/2', '') trimmed_name, *
FROM fastq_scan('paired.2.fastq')
)
SELECT one.trimmed_name, one.name one_name, one.sequence one_sequence, two.name two_name, two.sequence two_sequence
FROM one
JOIN two
ON one.trimmed_name = two.trimmed_name The results of which looks like:
I'd also just add that session will persist state within a given program or interpreter session. So you can create a table backed by a file, then query it multiple times if needed. import biobear as bb
s = bb.connect()
s.execute("""
CREATE EXTERNAL TABLE test STORED AS FASTQ LOCATION 'paired.1.fastq'
""")
s.sql("SELECT * FROM test").to_polars()
|
Hi @tshauck – thank you so much for your response! I'll look more carefully and will get back to you. Actually, my main goal for loading both R1/R2 reads at the same time is doing this:
Basically, I'm going to expand this function to work with more complex inputs: https://github.com/ArcInstitute/ScreenPro2/blob/master/screenpro/ngs.py#L26-L50 Do you have any suggestion? Thanks agin for your insightful responses. |
Thanks for the link 👍 , I'm gonna work through the trimming doc (#103) and how it refers to your link tonight/tomorrow morning and I'll follow-up then w/ more. |
Awesome! Looking forward to that. I'm also excited to hear your thoughts about this #102 (comment) but there is no rush; whenever you have time please let me know what you think. I already released a new sub-bersion for my tool which is now using |
I could follow your example here and it worked well. Thanks!! ArcInstitute/ScreenPro2@fe9a27d
|
@abearab Nice, congrats on the release! And glad that worked well enough. I am still working on the adapter/quality trimming info. I'm trying to add some functionality as I go to make it easier to use, e.g. |
As you know, it's very common to have paired-end reads mainly in Illumina NGS technologies. I was wondering to know your recommendation for having a single
session
for processing paired-end FASTQ files. Looking forward to hear your thoughts.The text was updated successfully, but these errors were encountered: