Genome class to handle fasta files and chromsizes throughout package #76

LukasMahieu · 2024-12-02T16:33:25Z

Updates

Added a genome class and a register_genome(...) function to make working with genomes easier.
Updated all functions in the package to allow for this genome instance as input while keeping backward compatibility (you can still provide a path if you want).
Added unit tests.

The advantages of this are:

The Genome class will deduce the chromsizes from the Fasta if not provided
The Genome class ensures that we only open one pysam.FastaFile in a session instead of multiple in different functions
A user can use crested.register_genome(...) at the beginning of a session (notebook or script). If done, all the other functions that require a genome_path and/or chromsizes will use the registered genome if not provided. This is important for the functional refactor I'm working on since many functions will require a genome as input

@casblaauw The genome also has an "annotations" attribute that is currently unused and not implemented, but we should use that when working with genes.

Haven't updated the tutorial yet, will do so when I finish the functional refactor.

There's one breaking change in this PR, since the crested.tl.data.AnnDataset now only expects a crested.Genome object instead of a genome_path and chromsizes. However, since this is more of a backend functionality that a normal user will never have used, I don't think this is such a big deal.

Example usage

Genome class

>>> genome = crested.Genome(
...     fasta="tests/data/test.fa",
...     chrom_sizes="tests/data/test.chrom.sizes",
... )
>>> print(genome.fasta)
<pysam.libcfaidx.FastaFile at 0x7f4d8b4a8f40>
>>> print(genome.chrom_sizes)
{'chr1': 1000, 'chr2': 2000}
>>> print(genome.name)
test

Registering

>>> genome = Genome(
...     fasta="tests/data/hg38.fa",
...     chrom_sizes="tests/data/test.chrom.sizes",
... )
>>> crested.register_genome(genome)
INFO Genome hg38 registered.

…d genome name

nkempynck

looks good to me

casblaauw · 2024-12-03T13:47:13Z

Thanks for this, I was just hoping we'd get something like this! I'll give it a spin to see if I run into something, but it looks very good already.

One thing I was considering is that we should maybe add a fetch method? We have the genome object, I feel like it'd make sense that we can use that to extract a sequence easily, especially in the light of functions like Crested.calculate_contribution_scores_sequence. I've added it as a commit, feel free to reverse it if you disagree.
You can currently do Genome.fasta.fetch(), but you need to read the pysam docs to find that. There is also the (unused?) util function crested.utils.fetch_sequences() which I didn't know about until this PR, but I still think it'd make more sense to wrap that functionality in the genome object.

LukasMahieu · 2024-12-03T14:07:01Z

Yes, good point. We indeed already had the crested.utils.fetch_sequences (which wasn't in the tutorials anywhere, only the API docs) but now it would make more sense to only have it as a method in the Genome class

LukasMahieu added 14 commits November 30, 2024 15:26

genome object and registration for easier working with genomes

2b5de26

genome object and registration for easier working with genomes

1534b66

check for genome object if chromsizes not provided in inputs

579e918

unit tests and handling of path chromsizes

d3229fe

more explicit docstring

7f3a5b8

remove old debug print

be2cd39

check if fasta files exist on init of Genome

2d8edf2

remove incorrect info log about dir creating

b96da5d

examples in docstrings and _resolve_genome for backward compatibility

f5ce2d5

update funcs to accept genome object

0e29e3a

genome and register_genome doc

3862d16

bugfix - unused dense bias and output activation

83a6bf1

update pipeline test to use genome obj

30fe270

make fasta attribute into FastaFile instead of simply the path and ad…

edab6dd

…d genome name

LukasMahieu requested review from nkempynck and casblaauw December 2, 2024 16:33

LukasMahieu linked an issue Dec 2, 2024 that may be closed by this pull request

Better handling of genomics files. #52

Open

correctly init the keras backend in the unit tests

61c080a

nkempynck approved these changes Dec 3, 2024

View reviewed changes

casblaauw and others added 5 commits December 3, 2024 15:15

Add fetch() to genome object

192c882

Resolve circular dependencies, add missing arg check

2a22aea

Add more tests

307282d

Fix test bug and some ruff stuff

4ddeeb9

Fix more test bugs

8be54c5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Genome class to handle fasta files and chromsizes throughout package #76

Genome class to handle fasta files and chromsizes throughout package #76

LukasMahieu commented Dec 2, 2024

nkempynck left a comment

casblaauw commented Dec 3, 2024 •

edited

Loading

LukasMahieu commented Dec 3, 2024

Genome class to handle fasta files and chromsizes throughout package #76

Are you sure you want to change the base?

Genome class to handle fasta files and chromsizes throughout package #76

Conversation

LukasMahieu commented Dec 2, 2024

Updates

Example usage

nkempynck left a comment

Choose a reason for hiding this comment

casblaauw commented Dec 3, 2024 • edited Loading

LukasMahieu commented Dec 3, 2024

casblaauw commented Dec 3, 2024 •

edited

Loading