Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why different handling between GFF and mzml/genbank in polars. #93

Open
tshauck opened this issue Feb 5, 2024 · 4 comments
Open

Why different handling between GFF and mzml/genbank in polars. #93

tshauck opened this issue Feb 5, 2024 · 4 comments

Comments

@tshauck
Copy link
Member

tshauck commented Feb 5, 2024

All have maps, but GFF is converted ok into a polars df, but not the latter to.

@abearab
Copy link
Contributor

abearab commented Mar 9, 2024

@tshauck have you seen https://github.com/BiocPy?

@tshauck
Copy link
Member Author

tshauck commented Mar 11, 2024

@abearab I've seen it, but not used it. I am a fan of those packages in R (e.g. granges). I also know there's some other analogues in Python for some of those bioconductor packages (e.g. AnnData).

@abearab
Copy link
Contributor

abearab commented Mar 11, 2024

I've seen it, but not used it. I am a fan of those packages in R (e.g. granges). I also know there's some other analogues in Python for some of those bioconductor packages (e.g. AnnData).

Yeah, I do like AnnData and scverse ecosystem a lot. However, a good implementation of granges for python has been missing for long time! I thought considering this granges can be very relevant to your implementation of annotation file formats (i.e. GTF, GFF, BED, BAM, etc.). This is just another suggestion, feel free to ignore it :)

@tshauck
Copy link
Member Author

tshauck commented Mar 12, 2024

100% -- you can actually do a little of granges stuff via SQL joins, but it's not quite as intuitive or as specialized to genomic intervals. E.g. say you ran bakta, and wanted to get CDSs where a spacer annotation is within 100bp of a CDS

WITH cds AS (
  SELECT *
  FROM gff_scan('bakta.gff')
  WHERE type = 'cds'
), spacers AS (
  SELECT *
  FROM gff_scan('bakta.gff')
  WHERE type = 'crispr-spacer'
)

SELECT *
FROM cds
  JOIN spacers
    ON spacers.start > (cds.start - 100) OR spacers.end < (cds.end - 100)

I would certainly like to make it easier to do more complex granges stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants