Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add human (and animal) reference genome to prepared databases #2717

Open
dportik opened this issue Aug 16, 2023 · 8 comments
Open

Add human (and animal) reference genome to prepared databases #2717

dportik opened this issue Aug 16, 2023 · 8 comments

Comments

@dportik
Copy link

dportik commented Aug 16, 2023

Hi Titus et al,
Given the recent fiasco related to mapping reads to microbial databases without human references (links at bottom), it might be a good time to create a small human genome database for use with sourmash. A standalone database on the database page would be ideal, so that researchers can include with the other databases of interest.

Thanks for considering!

social media discussion: https://twitter.com/StevenSalzberg1/status/1686350449069244416
pre-print: https://doi.org/10.1101/2023.07.28.550993

@luizirber
Copy link
Member

On the "raw" side 1 there are both GRCh38.p14 and T2T-CHM13v2.0 signatures in wort, would that work?

Footnotes

  1. just downloaded the data and calculated a signature, no other pre-processing like repeat masking

@dportik
Copy link
Author

dportik commented Aug 22, 2023

Yep! Those should be plenty.

@ctb ctb changed the title Add human reference to prepared databases Add human reference genome to prepared databases Sep 28, 2023
@ctb
Copy link
Contributor

ctb commented May 11, 2024

Repo to sketch hg38, including all unmapped chromosomes: https://github.com/ctb/2024-human-sketch

@ctb
Copy link
Contributor

ctb commented May 11, 2024

note: decontaminating human WGS samples, #3151

@ctb
Copy link
Contributor

ctb commented May 11, 2024

download at: https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/hg38/hg38-entire.sig.zip

@ctb
Copy link
Contributor

ctb commented Dec 6, 2024

@ctb
Copy link
Contributor

ctb commented Dec 7, 2024

added here - #3422 - should add the t2t ones, too, though.

@ctb ctb changed the title Add human reference genome to prepared databases Add human (and animal) reference genome to prepared databases Dec 8, 2024
@ctb
Copy link
Contributor

ctb commented Dec 8, 2024

@ccbaumler suggests adding more animal genomes over in #3422 (comment):

  1. rat https://www.ncbi.nlm.nih.gov/datasets/taxonomy/10116/

  2. xenopus https://www.ncbi.nlm.nih.gov/datasets/taxonomy/8355/

  3. zebrafish https://www.ncbi.nlm.nih.gov/datasets/taxonomy/7955/

  4. drosophila https://ncbi.nlm.nih.gov/datasets/taxonomy/7227/

  5. c. elegans https://www.ncbi.nlm.nih.gov/datasets/taxonomy/6239/

Rather than doing these piecemeal, I think we should come up with a set of accessions we care about and then use directsketch to get them, so for now I'm punting on that suggestion, but it is definitely the way we want to go!

ctb added a commit that referenced this issue Dec 8, 2024
Adds common hosts and also hg38.

Tackles #2717

<img width="667" alt="Screenshot 2024-12-07 at 9 17 40 AM"
src="https://github.com/user-attachments/assets/bfeff595-1759-4569-8adb-1e950f75a03e">

## Rendered preview:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants