Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add DocumentNDCGEvaluator component #8419

Merged
merged 9 commits into from
Oct 1, 2024
Merged

Conversation

julian-risch
Copy link
Member

@julian-risch julian-risch commented Sep 30, 2024

Related Issues

Proposed Changes:

  • Adding a new DocumentNDCGEvaluator component and corresponding unit tests

How did you test it?

  • New unit tests

Notes for the reviewer

Checklist

@github-actions github-actions bot added the type:documentation Improvements on the docs label Sep 30, 2024
@coveralls
Copy link
Collaborator

coveralls commented Sep 30, 2024

Pull Request Test Coverage Report for Build 11124601375

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 43 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.05%) to 90.329%

Files with Coverage Reduction New Missed Lines %
core/pipeline/pipeline.py 8 79.55%
core/pipeline/base.py 35 92.36%
Totals Coverage Status
Change from base Build 11054954171: 0.05%
Covered Lines: 7435
Relevant Lines: 8231

💛 - Coveralls

@julian-risch julian-risch marked this pull request as ready for review September 30, 2024 13:21
@julian-risch julian-risch requested review from a team as code owners September 30, 2024 13:21
@julian-risch julian-risch requested review from dfokina and vblagoje and removed request for a team September 30, 2024 13:21
@shadeMe shadeMe requested a review from sjrl September 30, 2024 13:26
@julian-risch julian-risch requested review from Amnah199 and removed request for vblagoje September 30, 2024 13:43
`ground_truth_documents` and `retrieved_documents` must have the same length.

:param ground_truth_documents:
A list of expected documents for each question with relevance scores or sorted by relevance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In light of above comments, maybe this can also be refined. Currently, it sounds like we are expecting a list of documents.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the docstring. Please let me know if it was better before or not.

relevant_id_to_score = {doc.id: doc.score for doc in gt_docs}
for i, doc in enumerate(ret_docs):
if doc.id in relevant_id_to_score: # TODO Related to https://github.com/deepset-ai/haystack/issues/8412
# If the gt document has a float score, use it; otherwise, use the inverse of the rank
Copy link
Member

@tstadel tstadel Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we using the inverse of the rank as fallback? Effectively this would double the "rank-discount" of the retrieved document: One by dividing by (i +1) in line 85 and the other by dividing by log2(i + 2) in line 86.

I guess a better fallback would be to just use value 1 which would translate into a simple binary relevance schema according to https://en.wikipedia.org/wiki/Discounted_cumulative_gain

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was that the user can provide the relevant documents as a sorted list without scores. With the current fallback, the retrieved documents get the highest NDCG score only if all relevant documents are retrieved in this particular order.
With a fallback to 1, the order of the relevant documents wouldn't matter anymore. I agree that's then simple binary relevance. Happy to change the fallback to that if users benefit more from that.
@bilgeyucel You wanted to pass a sorted list of documents without scores right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julian-risch yes, I'm using HotPot QA dataset from hugging face and it doesn't provide scores.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change the fallback to binary relevance, what you could do is calculate scores yourself before passing the documents to the DocumentNDCGEvaluator. For example:

for i, doc in enumerate(docs, 1):
    doc.score = 1 / i

That would work for you too right?

Copy link
Member

@tstadel tstadel Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julian-risch I can understand the intuition now. Still I'd probably not make this the default behavior: When supplying ground-truth docs, I wouldn't expect that the order of them makes a difference, tbh.
And if you really need that, you could simply pass scores as you showed in the preceding comment.

Anyways, I think there is an error in the implemetatoin of the intuiton. If I got it correct, then document relevance should be based on the order of the passed ground-truth docs. In the current implementation it's instead based on the order of the retrieved documents: relevance = 1 / (i + 1) where i is the index of the retrieved doc, not the ground-truth doc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, true. I will change the fallback to binary relevance. 👍

Copy link
Member

@tstadel tstadel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty good. I like the tests as well. Two things:

  • we should ensure that the code doesn't break when both inputs are empty lists. Or at least it should break friedly
  • I wouldn't create our own flavour of nDCG by using the "inverse of document ranks" score-fallback. Instead I'd prefer to go with simple binary relevance assumption. This is at least what I would expect to get, if I haven't specified graded relevance values.

@julian-risch
Copy link
Member Author

Looking pretty good. I like the tests as well. Two things:

  • we should ensure that the code doesn't break when both inputs are empty lists. Or at least it should break friedly
  • I wouldn't create our own flavour of nDCG by using the "inverse of document ranks" score-fallback. Instead I'd prefer to go with simple binary relevance assumption. This is at least what I would expect to get, if I haven't specified graded relevance values.

@Amnah199 @tstadel Thank you for your reviews!

  • I extended the input validation and now raise an error if one of the inputs is []. [[]] as input still works and raises no error.
  • I now changed the fallback to score 1, which is binary relevance. Simplifies the implementation too.

@julian-risch julian-risch requested a review from tstadel October 1, 2024 11:47
Copy link
Member

@tstadel tstadel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! 💯

Copy link
Contributor

@Amnah199 Amnah199 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@julian-risch julian-risch merged commit 08686d9 into main Oct 1, 2024
19 checks passed
@julian-risch julian-risch deleted the document-ndcg-evaluator branch October 1, 2024 14:15
julian-risch added a commit that referenced this pull request Oct 1, 2024
* draft new component and tests

* draft new component and tests

* fix tests, replace usage of get_attr

* improve docstrings, refactor tests

* add test for mixed documents w/wo scores

* add test with multiple lists and update docstring

* validate inputs, add tests, make methods static

* change fallback to binary relevance

* rename validate_init_parameters to validate_inputs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add DocumentNDCGEvaluator component
5 participants