Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genome class to handle fasta files and chromsizes throughout package #76

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

LukasMahieu
Copy link
Collaborator

Updates

Added a genome class and a register_genome(...) function to make working with genomes easier.
Updated all functions in the package to allow for this genome instance as input while keeping backward compatibility (you can still provide a path if you want).
Added unit tests.

The advantages of this are:

  • The Genome class will deduce the chromsizes from the Fasta if not provided
  • The Genome class ensures that we only open one pysam.FastaFile in a session instead of multiple in different functions
  • A user can use crested.register_genome(...) at the beginning of a session (notebook or script). If done, all the other functions that require a genome_path and/or chromsizes will use the registered genome if not provided. This is important for the functional refactor I'm working on since many functions will require a genome as input

@casblaauw The genome also has an "annotations" attribute that is currently unused and not implemented, but we should use that when working with genes.

Haven't updated the tutorial yet, will do so when I finish the functional refactor.

There's one breaking change in this PR, since the crested.tl.data.AnnDataset now only expects a crested.Genome object instead of a genome_path and chromsizes. However, since this is more of a backend functionality that a normal user will never have used, I don't think this is such a big deal.

Example usage

Genome class

>>> genome = crested.Genome(
...     fasta="tests/data/test.fa",
...     chrom_sizes="tests/data/test.chrom.sizes",
... )
>>> print(genome.fasta)
<pysam.libcfaidx.FastaFile at 0x7f4d8b4a8f40>
>>> print(genome.chrom_sizes)
{'chr1': 1000, 'chr2': 2000}
>>> print(genome.name)
test

Registering

>>> genome = Genome(
...     fasta="tests/data/hg38.fa",
...     chrom_sizes="tests/data/test.chrom.sizes",
... )
>>> crested.register_genome(genome)
INFO Genome hg38 registered.

@LukasMahieu LukasMahieu linked an issue Dec 2, 2024 that may be closed by this pull request
Copy link
Collaborator

@nkempynck nkempynck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

@casblaauw
Copy link
Collaborator

casblaauw commented Dec 3, 2024

Thanks for this, I was just hoping we'd get something like this! I'll give it a spin to see if I run into something, but it looks very good already.

One thing I was considering is that we should maybe add a fetch method? We have the genome object, I feel like it'd make sense that we can use that to extract a sequence easily, especially in the light of functions like Crested.calculate_contribution_scores_sequence. I've added it as a commit, feel free to reverse it if you disagree.
You can currently do Genome.fasta.fetch(), but you need to read the pysam docs to find that. There is also the (unused?) util function crested.utils.fetch_sequences() which I didn't know about until this PR, but I still think it'd make more sense to wrap that functionality in the genome object.

@LukasMahieu
Copy link
Collaborator Author

Yes, good point. We indeed already had the crested.utils.fetch_sequences (which wasn't in the tutorials anywhere, only the API docs) but now it would make more sense to only have it as a method in the Genome class

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better handling of genomics files.
3 participants