Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

archive: e2e test for ranking against sourcegraph repo #695

Merged
merged 4 commits into from
Nov 20, 2023

Conversation

keegancsmith
Copy link
Member

@keegancsmith keegancsmith commented Nov 15, 2023

This is an initial framework for having golden file results for search results against a real repository. At first we have only added one query and one repository, but it should be straightforward to grow this list further.

The golden files we write to disk are a summary of results with debug information. This matches how we have been using the zoekt CLI tool on the keyword branch during our ranking work.

Test Plan: go test

Fixes https://github.com/sourcegraph/sourcegraph/issues/57666

This is an initial framework for having golden file results for search
results against a real repository. At first we have only added one query
and one repository, but it should be straightforward to grow this list
further.

The golden files we write to disk are a summary of results with debug
information. This matches how we have been using the zoekt CLI tool on
the keyword branch during our ranking work.

Test Plan: go test
Copy link
Member

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a solid + simple way to index a repo snapshot at a particular time!

One overall comment: I had been thinking that along with each query, we'd also provide the 1-2 files we consider to be most relevant. We could show some visual indication of the result that's relevant, and also report a metric like "recall at 5". This helps make trade-offs when reviewing changes to results. For example, maybe we see a bunch of changes in results: knowing what files are relevant and how their positions changed can help determine if the changes are overall positive.

What do you think? How were you thinking we'd make use of this test for evaluating changes to ranking?

@keegancsmith
Copy link
Member Author

One overall comment: I had been thinking that along with each query, we'd also provide the 1-2 files we consider to be most relevant.

Agreed, I can add that. Mind me doing that as a follow up PR to avoid it getting to big?

We could show some visual indication of the result that's relevant, and also report a metric like "recall at 5". This helps make trade-offs when reviewing changes to results.

In tests we want to assert on behaviour. I am thinking we could assert on acceptable recall? Or maybe just log it? Or hook this up to a tool we can run outside of tests? WDYT?

For example, maybe we see a bunch of changes in results: knowing what files are relevant and how their positions changed can help determine if the changes are overall positive.

Agreed, this would be useful. Right now I am concerned for example including the debugscore information is too noisy.

What do you think? How were you thinking we'd make use of this test for evaluating changes to ranking?

I was imagining we would inspect the changes to the snapshot files to realise the impact. It would still be "feeling based", so your idea of outputting a single metric (etc) is really useful. I'll follow-up with those ideas.

I've added a task to the tracking issue titled "ranking: add recall to zoekt test"

@jtibshirani
Copy link
Member

Agreed, I can add that. Mind me doing that as a follow up PR to avoid it getting to big?

Makes sense to do this in a follow-up. I'll review this PR right now.

In tests we want to assert on behaviour. I am thinking we could assert on acceptable recall?

I like your approach of asserting on these as "gold" results -- it helps prevent accidental ranking changes, and also forces us to evaluate any ranking changes in a more rigorous way. For me it'd be useful to just log the recall as part of the test output. It'd also be nice if we could show a visual indication in results of what file is relevant... something like

** github.com/sourcegraph/sourcegraph/cmd/frontend/graphqlbackend/schema.graphql **	score:8807.21 <- atom(4):300.00, fragment:8500.00, doc-order:7.21
6376:type User implements Node & SettingsSubject & Namespace {	score:8500.00 <- WordMatch:500.00, Symbol:7000.00, kind:GraphQL:type:1000.00
3862:        type: GitRefType	score:8050.00 <- WordMatch:500.00, Symbol:7000.00, kind:GraphQL:field:550.00
5037:    type: GitRefType!	score:8050.00 <- WordMatch:500.00, Symbol:7000.00, kind:GraphQL:field:550.00
hidden 460 more line matches

github.com/sourcegraph/sourcegraph/internal/types/types.go	score:8759.73 <- atom(4):300.00, fragment:8450.00, doc-order:9.73
850:type User struct {	score:8450.00 <- WordMatch:500.00, Symbol:7000.00, kind:Go:struct:950.00
1372:	Type               *SearchCountStatistics	score:8250.00 <- WordMatch:500.00, Symbol:7000.00, kind:Go:member:750.00
1766:	Type       string	score:8250.00 <- WordMatch:500.00, Symbol:7000.00, kind:Go:member:750.00
hidden 234 more line matches

Copy link
Member

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I left a few non-blocking comments.

cmd/zoekt-archive-index/e2e_rank_test.go Show resolved Hide resolved
cmd/zoekt-archive-index/e2e_rank_test.go Outdated Show resolved Hide resolved
cmd/zoekt-archive-index/e2e_rank_test.go Show resolved Hide resolved
@keegancsmith keegancsmith merged commit 0f685d8 into main Nov 20, 2023
8 checks passed
@keegancsmith keegancsmith deleted the k/integration branch November 20, 2023 12:19
@keegancsmith
Copy link
Member Author

@jtibshirani I'm going to follow up with a recall measurement. I realised it isn't clear to me what exactly the metric will be. Googling recall measurements it is often presented as a % of the total corpus, which is not that helpful to us. I can think of two systems: 1 point everytime a document we want appears in the top 5. Or a score where the top doc is worth 5 points and that continues to decrease.

Additionally I was also thinking it may be useful which line we show. EG some of the improvements we have made have made us more likely to show the class definition in the file at top, rather than a random other part of the document that matches.

@jtibshirani
Copy link
Member

I can think of two systems: 1 point everytime a document we want appears in the top 5. Or a score where the top doc is worth 5 points and that continues to decrease.

In my experience it's common to report a couple metrics to try to capture the overall quality. For our problem I think these are most helpful:

  • Recall@k, where k is some small-ish number. One idea is to report Recall@1 and Recall@5 -- to me these capture the user experience pretty well (did the most relevant result appear first? did it at least appear in the first screen and require little to no scrolling?)
  • Mean reciprocal rank (MRR). This tries to capture the average rank of the relevant result across queries. So if a result moves up from 10th to 6th place, this metric would capture that and improve. It's not as interpretable but captures more cases. This doesn't seem as critical to me but is a "nice to have".

As background, a lot of the traditional metrics (recall as % of corpus, mAP, NDCG) assume that there are a bunch of relevant docs throughout the corpus that are relevant to the query in different degrees. I don't think that matches our use case well -- there are usually 1-2 docs that are highly relevant or "correct" answers, so we can use simple binary metrics that focus on whether we retrieved those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ranking integration test
2 participants