Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to interpret the results from sourmash gather --picklist from prefetch? #2833

Open
yuzie0314 opened this issue Nov 2, 2023 · 5 comments

Comments

@yuzie0314
Copy link

yuzie0314 commented Nov 2, 2023

Hello @ctb thanks for your contribution to this amazing software,
I referred to this topic #2095 wanting to run gather on extremely big index file, and followed your suggestion using formally correct option. I obtained a csv file from the gather command:

sourmash gather ${query_sig} \\
    --picklist ${combined_prefetch_csv}::prefetch \\
    ${params.representatives_db}/*.zip \\
    --output **${meta}_find_genomes.csv**

This csv file only shows the db_name (*.sbt.zip) in column 9, namely, filename, which make me a little confused.
What I expected is that I can retrieve the results to tell me which genomes within each db present in my query signature.

Or I can use the results from prefetch command to gain this information?

Thanks in advance and wish you have a nice day,
YZ

@ctb
Copy link
Contributor

ctb commented Nov 2, 2023

Hi @yuzie0314 the contents of column 8 of the gather output, name, should the name of matching signatures from the database. It's named match_name in the prefetch file. Does that work?

@yuzie0314
Copy link
Author

yuzie0314 commented Nov 7, 2023

Hi @ctb , sorry for the late response.
Yea, I didn't see any contents in the filename [10] or name [11] columns.

the following is the header including first row:

**intersect_bp,f_orig_query,f_match,f_unique_to_query,f_unique_weighted,average_abund,median_abund,std_abund,filename,name,md5,f_match_orig,unique_intersect_bp,gather_result_rank,remaining_bp,query_filename,query_name,query_md5,query_bp,ksize,moltype,scaled,query_n_hashes,query_abundance,query_containment_ani,match_containment_ani,average_containment_ani,max_containment_ani,potential_false_negative,n_unique_weighted_found,sum_weighted_found,total_weighted_hashes**
3255000,0.043493365758494905,0.47078391669077235,0.043493365758494905,0.043493365758494905,,,,/fsx/LongContigMapper/inputs/ref_genomes/sm_mash/uhgg_representatives_db/db-8.sbt.zip,,19752a9a9f605e68917194a28f8d6fea,0.47078391669077235,3255000,0,68735000,AMD0002201105.clean.fasta,,b224eb71,74839000,21,DNA,1000,74839,False,0.8613169638031671,0.9647617509109547,0.913039357357061,0.9647617509109547,False,,3255,74839

My sourmash version is 4.8.2 latest versin on conda.
Another thing is that how can I evaluate the abundance from these values provided by sourmash, could you please guide me this?

@ctb
Copy link
Contributor

ctb commented Nov 7, 2023

hi @yuzie0314, the only identifying thing in your match is the md5 column, which for the first row looks like 19752a9a9f605e68917194a28f8d6fea. You can add a more informative name by specifying it when sketching your database signatures, or you could run sourmash sig rename on each of your database signatures. I think the former (naming when sketching) is nicer and more convenient; we don't have good mass renaming approaches implemented, I'm afraid.

How did you build the sketches for the database you're searching, if I might ask?


In order to get abundance information, you'll need to sketch your query with -p abund - e.g. something like sourmash sketch dna AMD0002201105.clean.fasta -p abund - and then rerun the gather. That will populate the abundance columns.

@ctb
Copy link
Contributor

ctb commented Nov 7, 2023

(a script for large scale renaming is here: #1883 - but it may not work well for your use case)

@yuzie0314
Copy link
Author

yuzie0314 commented Nov 8, 2023

Hi @ctb,
Thanks for your informative descriptions, but I do have some doubts.
I have check the rename signature part, and I might have two options from your document tutorials:

1.0 sourmash sketch dna genomeX.fa --name genomeX -o genomeX.sig
1.1 sourmash sketch dna genomeX.fa --name-from-first -o genomeX.sig
2. sourmash signature rename file1.sig "new name" -o renamed.sig

To retrieve the abundance information, the -p abund flag should be added to each reference genome and the query genome, correct? Or I can only add this to my query signature?
I just checked the script mass-rename.py, the ident is md5 or not?
Really appreciate your efforts to develop this great and amazing tools and spend your time on explaining thie ideas behind your software to me.

Many thanks,
YZ

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants