-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add DocumentNDCGEvaluator component #8419
Conversation
Pull Request Test Coverage Report for Build 11124601375Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
`ground_truth_documents` and `retrieved_documents` must have the same length. | ||
|
||
:param ground_truth_documents: | ||
A list of expected documents for each question with relevance scores or sorted by relevance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In light of above comments, maybe this can also be refined. Currently, it sounds like we are expecting a list of documents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the docstring. Please let me know if it was better before or not.
relevant_id_to_score = {doc.id: doc.score for doc in gt_docs} | ||
for i, doc in enumerate(ret_docs): | ||
if doc.id in relevant_id_to_score: # TODO Related to https://github.com/deepset-ai/haystack/issues/8412 | ||
# If the gt document has a float score, use it; otherwise, use the inverse of the rank |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we using the inverse of the rank as fallback? Effectively this would double the "rank-discount" of the retrieved document: One by dividing by (i +1) in line 85 and the other by dividing by log2(i + 2) in line 86.
I guess a better fallback would be to just use value 1
which would translate into a simple binary relevance schema according to https://en.wikipedia.org/wiki/Discounted_cumulative_gain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea was that the user can provide the relevant documents as a sorted list without scores. With the current fallback, the retrieved documents get the highest NDCG score only if all relevant documents are retrieved in this particular order.
With a fallback to 1
, the order of the relevant documents wouldn't matter anymore. I agree that's then simple binary relevance. Happy to change the fallback to that if users benefit more from that.
@bilgeyucel You wanted to pass a sorted list of documents without scores right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@julian-risch yes, I'm using HotPot QA dataset from hugging face and it doesn't provide scores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we change the fallback to binary relevance, what you could do is calculate scores yourself before passing the documents to the DocumentNDCGEvaluator. For example:
for i, doc in enumerate(docs, 1):
doc.score = 1 / i
That would work for you too right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@julian-risch I can understand the intuition now. Still I'd probably not make this the default behavior: When supplying ground-truth docs, I wouldn't expect that the order of them makes a difference, tbh.
And if you really need that, you could simply pass scores as you showed in the preceding comment.
Anyways, I think there is an error in the implemetatoin of the intuiton. If I got it correct, then document relevance should be based on the order of the passed ground-truth docs. In the current implementation it's instead based on the order of the retrieved documents: relevance = 1 / (i + 1)
where i
is the index of the retrieved doc, not the ground-truth doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, true. I will change the fallback to binary relevance. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking pretty good. I like the tests as well. Two things:
- we should ensure that the code doesn't break when both inputs are empty lists. Or at least it should break friedly
- I wouldn't create our own flavour of nDCG by using the "inverse of document ranks" score-fallback. Instead I'd prefer to go with simple binary relevance assumption. This is at least what I would expect to get, if I haven't specified graded relevance values.
@Amnah199 @tstadel Thank you for your reviews!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes! 💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
* draft new component and tests * draft new component and tests * fix tests, replace usage of get_attr * improve docstrings, refactor tests * add test for mixed documents w/wo scores * add test with multiple lists and update docstring * validate inputs, add tests, make methods static * change fallback to binary relevance * rename validate_init_parameters to validate_inputs
Related Issues
Proposed Changes:
How did you test it?
Notes for the reviewer
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.