Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: control info-file column separator #816

Open
ramongallego opened this issue Nov 6, 2024 · 5 comments
Open

Feature request: control info-file column separator #816

ramongallego opened this issue Nov 6, 2024 · 5 comments

Comments

@ramongallego
Copy link

I use the info-file quite often, either with awk one liners or importing the file in R. I have been using them with Illumina data, with great success. Now I am trying to use them with Nanopore data, with a catch: Different fields in the Nanopore fastq header are also separated by tabs, as the different fields in the info-file. This makes parsing these files more troublesome. Would it be possible, for next versions, to include a way of choosing the desired column-separating character for the info-file?

  • Cutadapt v4.6 and Python version 3.10.3
  • How you installed the tool (conda or pip, for example) miniconda
@marcelm
Copy link
Owner

marcelm commented Nov 7, 2024

Interesting. Where do these tab characters come from? Looking at some of the Nanopore data that I have here, I don’t see any.

If possible, I try to avoid adding options to Cutadapt if I can just make the default behavior better. I’m wondering whether an alternative would be to replace all tab characters in the read header with a space character. I consider it a bug that this isn’t done at the moment because the output is an invalid TSV otherwise, as you found out.

@rhpvorderman
Copy link
Collaborator

samtools fastq with the -T flag generates tab-delimited fields.
Would quoting the name be an option?

@ramongallego
Copy link
Author

@marcelm Here I copy the first line of a fastq file generated with dorado basecaller v 0.7 , using the --emit-fastq option (so it's not a bam from which I later extracted the fastq)
I was worried it would take a lot of time to make the tabs for spaces substitutions, but it might be the best approach. I also realized that any other column separator character I can think of is also a valid Qscore, so there would be even more unpredictable parsing issues.
good_length.txt

@marcelm
Copy link
Owner

marcelm commented Nov 18, 2024

Quoting from the read header you attached:

@54c591fa-c560-405b-bc82-b3cd603b84fc	st:Z:2024-10-17T01:47:46.145+00:00	RG:Z:bf71225953a48fb33134df81086f6e5d64deeca6_dna_r10.4.1_e8.2_400bps_sup@v5.0.0	DS:Z:gpu:NVIDIA GeForce RTX 3070 Laptop GPU

This is apparently intended to be used by a read mapper to be added to its SAM output, such as with BWA-MEM’s -C option:

-C            append FASTA/FASTQ comment to SAM output

This is kind of the inverse of samtools fastq -T that Ruben mentioned.

So if you want to let Cutadapt output an info file, manipulate the info file and then write back a FASTQ file that would still be usable in this way, then something needs to be done to the tabs that is reversible. Just replacing them with spaces won’t work because then they cannot be distinguished from spaces. Even in your example, there’s already a value NVIDIA GeForce RTX 3070 Laptop GPU that contains spaces.

I’m not sure what is best here. Maybe replace tab with backslash plus t ("\\t")?

@ramongallego
Copy link
Author

ramongallego commented Nov 20, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants