Skip to content
This repository has been archived by the owner on Mar 17, 2023. It is now read-only.

Index fasta files by containing hashes, and bam files by containing read ids #50

Open
olgabot opened this issue May 14, 2020 · 2 comments
Assignees

Comments

@olgabot
Copy link
Contributor

olgabot commented May 14, 2020

Use tools from spacegraphcats to do the indexing spacegraphcats/spacegraphcats#273

@olgabot
Copy link
Contributor Author

olgabot commented May 14, 2020

Here's a schematic of what I'm thinking of doing:

Screen Shot 2020-05-14 at 11 50 37 AM

I want to be able to query with a hash, and get all reads containing that hash, then use those read IDs to query the bam. I think this is possible given the make_bgzf.py and overall spacegraphcats/utils/bgzf/ folder of tools.

But then will all the querying need to happen with SQLite as in label_cdbg.py? I'm afraid of SQL...

cc @ctb

@ctb
Copy link

ctb commented May 18, 2020

this should be lightweight and straightforward if you are using downsampled hashes (either regular MinHash or scaled hash, as in sourmash). "All k-mers" is hard, might look at BLight (https://www.biorxiv.org/content/10.1101/546309v2), happy to put you in touch with people in that group!

I have been using sqlite for ages, because it's so blindingly fast that there's no hope of competing. See http://ivory.idyll.org/blog/storing-and-retrieving-sequences.html.

sqlite is also ridiculously robust and well tested, and very widely used, with interfaces in most languages. Well worth the time investment in my experience.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants