Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

differences between this plugin's gather and sourmash gather #331

Closed
ctb opened this issue May 12, 2024 · 7 comments · Fixed by #361
Closed

differences between this plugin's gather and sourmash gather #331

ctb opened this issue May 12, 2024 · 7 comments · Fixed by #361

Comments

@ctb
Copy link
Collaborator

ctb commented May 12, 2024

Post-#298, we now get full gather results out of fastgather and fastmultigather. But there are some differences between what the plugin outputs and what the OG sourmash gather outputs 😱 .

First, note that {'filename', 'md5', 'name'} in OG gather are now {'match_filename', 'match_md5', 'match_name'}.

Also, 'potential_false_negative' is missing from plugin gather.

After that is dealt with, the following columns are the same 🎉 -

  • f_orig_query
  • median_abund
  • md5
  • gather_result_rank
  • n_unique_weighted_found
  • query_abundance
  • ksize
  • f_match_orig
  • scaled
  • f_unique_to_query
  • query_name
  • query_filename
  • average_abund
  • unique_intersect_bp
  • name
  • f_match
  • intersect_bp
  • total_weighted_hashes

rounding differences?

std_abund, f_unique_weighted appear to be different just because they are floats.

trivial/easy to fix differences

  • moltype is lowercase in plugin gather, so dna instead of DNA
  • query_md5 is truncated to 8 characters in the OG gather.
  • filename means different things in OG gather and the plugin - in the OG gather, it's the filename of the database being searched, in the plugin it's ... the filename of the sig? not sure.

real differences

  • query_n_hashes and query_bp are the original query (so, constant) in OG gather, while in the plugin they are the size of the remaining query at that rank
  • remaining_bp is just different - looks like it's just being calculated very differently.
  • max_containment_ani is just quite different...??
  • average_containment_ani is just different, too
  • query_containment_ani is also different
  • aaand match_containment_ani is also different

twilight zone differences

  • sum_weighted_found values are all the same except for in one specific row. WTF.
@ctb
Copy link
Collaborator Author

ctb commented May 12, 2024

@bluegenes here's the notebook I'm using: https://github.com/ctb/2024-debug-gather-difference/blob/main/compare-picklist.ipynb

it's mildly tricksy, because I had to force sourmash gather to use the identical set of sketches used by fastgather, via a picklist. but it seems to work ok ;)

@ctb
Copy link
Collaborator Author

ctb commented Jun 9, 2024

It looks like remaining_bp is being calculated correctly over in sourmash-rs land (once sourmash-bio/sourmash#3193 is merged, at least), and Python-land is doing it incorrectly:

sourmash-bio/sourmash#3194

which suggests (but does not confirm ;) that remaining_bp is correct in this plugin's implementation.

@ctb
Copy link
Collaborator Author

ctb commented Jun 10, 2024

Running with all these fixes, I now see the following differences remaining between the sourmash gather and sourmash scripts fastmultigather:

  • filename
  • max_containment_ani
  • sum_weighted_found
  • query_bp
  • average_containment_ani
  • query_n_hashes
  • match_containment_ani
  • query_containment_ani

Results over in ctb/2024-debug-gather-difference#1

@bluegenes
Copy link
Contributor

bluegenes commented Jun 14, 2024

After #353, remaining changes needed:

-sum_weighted_found (only in second to last result)

  • I think this may actually be correct here? At least, when I print out the hashes + abundances, the numbers check out. Could we be missing something or calculating wrong in sourmash? Note we calculate this differently: In sourmash, we subtract (n_total_weighted - n_weighted_found), whereas in branchwater, we are adding (total_weighted_found + n_weighted_found).

I think the following require sourmash core changes:

  • filename
    • I'm not sure how to get the zipfile name, as is used for sourmash gather's filename. I don't think we store it after reading in the collection. Might need to add a param.
  • max_containment_ani
  • average_containment_ani
  • query_containment_ani
  • match_containment_ani

@ctb
Copy link
Collaborator Author

ctb commented Jun 18, 2024

Wondering if maybe ANI numbers in Rust are generally busted - per sourmash-bio/sourmash#3134 (comment). Something to investigate.

@ctb
Copy link
Collaborator Author

ctb commented Jun 18, 2024

@bluegenes nah, looks like the ANI calculation code is good from pairwise/multisearch. So it's probably something gather specific! 😅

@ctb
Copy link
Collaborator Author

ctb commented Jun 19, 2024

I COME BEARING NEWS!

With the ANI fixes in #361 and sourmash-bio/sourmash#3218, the only remaining difference is in match_filename/filename, which is ok for now.

🎉

Just to be clear:

sourmash gather, fastgather, fastmultigather, and fastmultigather+rocksdb all return the SAME RESULTS now when using the comparison approach in ctb/2024-debug-gather-difference#1. Yay!

The only remaining tricky bit that explained the sum_weighted_found and also led to differences in the rocksdb results was that there were two Sphingomonas matches:

  • GCF_000379045.1 Sphingomonas melonis DAPP-PG 224 strain=DAPP-PG 224, ASM37904v1
  • GCF_000419565.1 Sphingomonas melonis FR1 strain=FR1, ASM41956v1

and somehow these two had equivalent unweighted matches (so you could pick either one legitimately!) but led to two different weighted results. I removed ASM41956v1 from the sig.zip file and that led to identical results across the board!!

🎉

ctb added a commit to sourmash-bio/sourmash that referenced this issue Jun 19, 2024
Calculate ANI of matches against original query with `f_orig_query` and
`f_match_orig`, instead of against `f_unique_to_query` and `f_match`.

This fixes the ANI differences between `sourmash gather` and RocksDB
branchwater gather for the columns `query_containment_ani`,
`match_containment_ani`, `max_containment_ani`, and
`average_containment_ani`.

Refs:
* Used by
sourmash-bio/sourmash_plugin_branchwater#361
* Fixes RocksDB-based calculations for
sourmash-bio/sourmash_plugin_branchwater#331
@ctb ctb closed this as completed in #361 Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants