How to interpret the results from sourmash gather --picklist from prefetch? #2833

yuzie0314 · 2023-11-02T01:17:22Z

Hello @ctb thanks for your contribution to this amazing software,
I referred to this topic #2095 wanting to run gather on extremely big index file, and followed your suggestion using formally correct option. I obtained a csv file from the gather command:

sourmash gather ${query_sig} \\
    --picklist ${combined_prefetch_csv}::prefetch \\
    ${params.representatives_db}/*.zip \\
    --output **${meta}_find_genomes.csv**

This csv file only shows the db_name (*.sbt.zip) in column 9, namely, filename, which make me a little confused.
What I expected is that I can retrieve the results to tell me which genomes within each db present in my query signature.

Or I can use the results from prefetch command to gain this information?

Thanks in advance and wish you have a nice day,
YZ

The text was updated successfully, but these errors were encountered:

ctb · 2023-11-02T03:33:41Z

Hi @yuzie0314 the contents of column 8 of the gather output, name, should the name of matching signatures from the database. It's named match_name in the prefetch file. Does that work?

yuzie0314 · 2023-11-07T02:01:29Z

Hi @ctb , sorry for the late response.
Yea, I didn't see any contents in the filename [10] or name [11] columns.

the following is the header including first row:

**intersect_bp,f_orig_query,f_match,f_unique_to_query,f_unique_weighted,average_abund,median_abund,std_abund,filename,name,md5,f_match_orig,unique_intersect_bp,gather_result_rank,remaining_bp,query_filename,query_name,query_md5,query_bp,ksize,moltype,scaled,query_n_hashes,query_abundance,query_containment_ani,match_containment_ani,average_containment_ani,max_containment_ani,potential_false_negative,n_unique_weighted_found,sum_weighted_found,total_weighted_hashes**
3255000,0.043493365758494905,0.47078391669077235,0.043493365758494905,0.043493365758494905,,,,/fsx/LongContigMapper/inputs/ref_genomes/sm_mash/uhgg_representatives_db/db-8.sbt.zip,,19752a9a9f605e68917194a28f8d6fea,0.47078391669077235,3255000,0,68735000,AMD0002201105.clean.fasta,,b224eb71,74839000,21,DNA,1000,74839,False,0.8613169638031671,0.9647617509109547,0.913039357357061,0.9647617509109547,False,,3255,74839

My sourmash version is 4.8.2 latest versin on conda.
Another thing is that how can I evaluate the abundance from these values provided by sourmash, could you please guide me this?

ctb · 2023-11-07T13:43:02Z

hi @yuzie0314, the only identifying thing in your match is the md5 column, which for the first row looks like 19752a9a9f605e68917194a28f8d6fea. You can add a more informative name by specifying it when sketching your database signatures, or you could run sourmash sig rename on each of your database signatures. I think the former (naming when sketching) is nicer and more convenient; we don't have good mass renaming approaches implemented, I'm afraid.

How did you build the sketches for the database you're searching, if I might ask?

In order to get abundance information, you'll need to sketch your query with -p abund - e.g. something like sourmash sketch dna AMD0002201105.clean.fasta -p abund - and then rerun the gather. That will populate the abundance columns.

ctb · 2023-11-07T13:43:39Z

(a script for large scale renaming is here: #1883 - but it may not work well for your use case)

yuzie0314 · 2023-11-08T01:59:27Z

Hi @ctb,
Thanks for your informative descriptions, but I do have some doubts.
I have check the rename signature part, and I might have two options from your document tutorials:

1.0 sourmash sketch dna genomeX.fa --name genomeX -o genomeX.sig
1.1 sourmash sketch dna genomeX.fa --name-from-first -o genomeX.sig
2. sourmash signature rename file1.sig "new name" -o renamed.sig

To retrieve the abundance information, the -p abund flag should be added to each reference genome and the query genome, correct? Or I can only add this to my query signature?
I just checked the script mass-rename.py, the ident is md5 or not?
Really appreciate your efforts to develop this great and amazing tools and spend your time on explaining thie ideas behind your software to me.

Many thanks,
YZ

ctb mentioned this issue Nov 2, 2023

provide complete documentation on gather and prefetch output columns #2812

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to interpret the results from sourmash gather --picklist from prefetch? #2833

How to interpret the results from sourmash gather --picklist from prefetch? #2833

yuzie0314 commented Nov 2, 2023 •

edited

Loading

ctb commented Nov 2, 2023

yuzie0314 commented Nov 7, 2023 •

edited

Loading

ctb commented Nov 7, 2023

ctb commented Nov 7, 2023

yuzie0314 commented Nov 8, 2023 •

edited

Loading

How to interpret the results from sourmash gather --picklist from prefetch? #2833

How to interpret the results from sourmash gather --picklist from prefetch? #2833

Comments

yuzie0314 commented Nov 2, 2023 • edited Loading

ctb commented Nov 2, 2023

yuzie0314 commented Nov 7, 2023 • edited Loading

ctb commented Nov 7, 2023

ctb commented Nov 7, 2023

yuzie0314 commented Nov 8, 2023 • edited Loading

yuzie0314 commented Nov 2, 2023 •

edited

Loading

yuzie0314 commented Nov 7, 2023 •

edited

Loading

yuzie0314 commented Nov 8, 2023 •

edited

Loading