archive: e2e test for ranking against sourcegraph repo #695

keegancsmith · 2023-11-15T09:14:01Z

This is an initial framework for having golden file results for search results against a real repository. At first we have only added one query and one repository, but it should be straightforward to grow this list further.

The golden files we write to disk are a summary of results with debug information. This matches how we have been using the zoekt CLI tool on the keyword branch during our ranking work.

Test Plan: go test

Fixes https://github.com/sourcegraph/sourcegraph/issues/57666

This is an initial framework for having golden file results for search results against a real repository. At first we have only added one query and one repository, but it should be straightforward to grow this list further. The golden files we write to disk are a summary of results with debug information. This matches how we have been using the zoekt CLI tool on the keyword branch during our ranking work. Test Plan: go test

jtibshirani

This seems like a solid + simple way to index a repo snapshot at a particular time!

One overall comment: I had been thinking that along with each query, we'd also provide the 1-2 files we consider to be most relevant. We could show some visual indication of the result that's relevant, and also report a metric like "recall at 5". This helps make trade-offs when reviewing changes to results. For example, maybe we see a bunch of changes in results: knowing what files are relevant and how their positions changed can help determine if the changes are overall positive.

What do you think? How were you thinking we'd make use of this test for evaluating changes to ranking?

keegancsmith · 2023-11-16T08:21:44Z

One overall comment: I had been thinking that along with each query, we'd also provide the 1-2 files we consider to be most relevant.

Agreed, I can add that. Mind me doing that as a follow up PR to avoid it getting to big?

We could show some visual indication of the result that's relevant, and also report a metric like "recall at 5". This helps make trade-offs when reviewing changes to results.

In tests we want to assert on behaviour. I am thinking we could assert on acceptable recall? Or maybe just log it? Or hook this up to a tool we can run outside of tests? WDYT?

For example, maybe we see a bunch of changes in results: knowing what files are relevant and how their positions changed can help determine if the changes are overall positive.

Agreed, this would be useful. Right now I am concerned for example including the debugscore information is too noisy.

What do you think? How were you thinking we'd make use of this test for evaluating changes to ranking?

I was imagining we would inspect the changes to the snapshot files to realise the impact. It would still be "feeling based", so your idea of outputting a single metric (etc) is really useful. I'll follow-up with those ideas.

I've added a task to the tracking issue titled "ranking: add recall to zoekt test"

jtibshirani · 2023-11-16T16:34:43Z

Agreed, I can add that. Mind me doing that as a follow up PR to avoid it getting to big?

Makes sense to do this in a follow-up. I'll review this PR right now.

In tests we want to assert on behaviour. I am thinking we could assert on acceptable recall?

I like your approach of asserting on these as "gold" results -- it helps prevent accidental ranking changes, and also forces us to evaluate any ranking changes in a more rigorous way. For me it'd be useful to just log the recall as part of the test output. It'd also be nice if we could show a visual indication in results of what file is relevant... something like

** github.com/sourcegraph/sourcegraph/cmd/frontend/graphqlbackend/schema.graphql **	score:8807.21 <- atom(4):300.00, fragment:8500.00, doc-order:7.21
6376:type User implements Node & SettingsSubject & Namespace {	score:8500.00 <- WordMatch:500.00, Symbol:7000.00, kind:GraphQL:type:1000.00
3862:        type: GitRefType	score:8050.00 <- WordMatch:500.00, Symbol:7000.00, kind:GraphQL:field:550.00
5037:    type: GitRefType!	score:8050.00 <- WordMatch:500.00, Symbol:7000.00, kind:GraphQL:field:550.00
hidden 460 more line matches

github.com/sourcegraph/sourcegraph/internal/types/types.go	score:8759.73 <- atom(4):300.00, fragment:8450.00, doc-order:9.73
850:type User struct {	score:8450.00 <- WordMatch:500.00, Symbol:7000.00, kind:Go:struct:950.00
1372:	Type               *SearchCountStatistics	score:8250.00 <- WordMatch:500.00, Symbol:7000.00, kind:Go:member:750.00
1766:	Type       string	score:8250.00 <- WordMatch:500.00, Symbol:7000.00, kind:Go:member:750.00
hidden 234 more line matches

jtibshirani

Looks good to me! I left a few non-blocking comments.

cmd/zoekt-archive-index/e2e_rank_test.go

keegancsmith · 2023-11-20T12:24:55Z

@jtibshirani I'm going to follow up with a recall measurement. I realised it isn't clear to me what exactly the metric will be. Googling recall measurements it is often presented as a % of the total corpus, which is not that helpful to us. I can think of two systems: 1 point everytime a document we want appears in the top 5. Or a score where the top doc is worth 5 points and that continues to decrease.

Additionally I was also thinking it may be useful which line we show. EG some of the improvements we have made have made us more likely to show the class definition in the file at top, rather than a random other part of the document that matches.

jtibshirani · 2023-11-20T16:15:24Z

I can think of two systems: 1 point everytime a document we want appears in the top 5. Or a score where the top doc is worth 5 points and that continues to decrease.

In my experience it's common to report a couple metrics to try to capture the overall quality. For our problem I think these are most helpful:

Recall@k, where k is some small-ish number. One idea is to report Recall@1 and Recall@5 -- to me these capture the user experience pretty well (did the most relevant result appear first? did it at least appear in the first screen and require little to no scrolling?)
Mean reciprocal rank (MRR). This tries to capture the average rank of the relevant result across queries. So if a result moves up from 10th to 6th place, this metric would capture that and improve. It's not as interpretable but captures more cases. This doesn't seem as critical to me but is a "nice to have".

As background, a lot of the traditional metrics (recall as % of corpus, mAP, NDCG) assume that there are a bunch of relevant docs throughout the corpus that are relevant to the query in different degrees. I don't think that matches our use case well -- there are usually 1-2 docs that are highly relevant or "correct" answers, so we can use simple binary metrics that focus on whether we retrieved those.

keegancsmith requested a review from a team November 15, 2023 09:14

This was referenced Nov 15, 2023

ranking integration test sourcegraph/sourcegraph-public-snapshot#57666

Closed

archive: add cody and golang repo to e2e rank corpus #696

Merged

archive: re-use shard cache in e2e rank tests #697

Closed

jtibshirani reviewed Nov 16, 2023

View reviewed changes

jtibshirani approved these changes Nov 16, 2023

View reviewed changes

cmd/zoekt-archive-index/e2e_rank_test.go Show resolved Hide resolved

cmd/zoekt-archive-index/e2e_rank_test.go Outdated Show resolved Hide resolved

cmd/zoekt-archive-index/e2e_rank_test.go Show resolved Hide resolved

keegancsmith added 3 commits November 20, 2023 10:20

debug output optional via -debug_score

18f018a

skip if -short flag is set

300e2cd

use same options as sourcegraph (including chunkmatch)

d71f80c

keegancsmith merged commit 0f685d8 into main Nov 20, 2023
8 checks passed

keegancsmith deleted the k/integration branch November 20, 2023 12:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

archive: e2e test for ranking against sourcegraph repo #695

archive: e2e test for ranking against sourcegraph repo #695

keegancsmith commented Nov 15, 2023 •

edited

Loading

jtibshirani left a comment •

edited

Loading

keegancsmith commented Nov 16, 2023

jtibshirani commented Nov 16, 2023

jtibshirani left a comment

keegancsmith commented Nov 20, 2023

jtibshirani commented Nov 20, 2023

archive: e2e test for ranking against sourcegraph repo #695

archive: e2e test for ranking against sourcegraph repo #695

Conversation

keegancsmith commented Nov 15, 2023 • edited Loading

jtibshirani left a comment • edited Loading

Choose a reason for hiding this comment

keegancsmith commented Nov 16, 2023

jtibshirani commented Nov 16, 2023

jtibshirani left a comment

Choose a reason for hiding this comment

keegancsmith commented Nov 20, 2023

jtibshirani commented Nov 20, 2023

keegancsmith commented Nov 15, 2023 •

edited

Loading

jtibshirani left a comment •

edited

Loading