Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

languages: copy over latest version from sourcegraph #828

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

keegancsmith
Copy link
Member

I realised we haven't been updating this package as we updated the package in the sourcegraph repo.

We don't need all the functionality it has, but its easier to just copy paste everything.

Test Plan: go test

I realised we haven't been updating this package as we updated the
package in the sourcegraph repo.

We don't need all the functionality it has, but its easier to just copy
paste everything.

Test Plan: go test
@keegancsmith keegancsmith requested review from mmanela, varungandhi-src and a team September 17, 2024 10:20
@cla-bot cla-bot bot added the cla-signed label Sep 17, 2024
Copy link
Contributor

Fuzz test failed on commit 78a7ea4. To troubleshoot locally, use the GitHub CLI to download the seed corpus with

gh run download 10901106986 -n testdata

@keegancsmith keegancsmith marked this pull request as draft September 17, 2024 13:47
@keegancsmith
Copy link
Member Author

Ok there are issues here. The issue is boils down to how we changed normalization between the two versions of the package.

I actually have bigger concerns around this now. We just lookup by query.Language.Language and do no normalization on it. We expect normalization to be done before. This then relies on the normalization being the same at index time as the time we construct the query. I think this is quite easy to mess up, so I need to read more code to see how this can be made more robust.

But essentially I feel like when we read in index data from disk we should normalize by the rules of the running process. Then we do the same normilization when executing a query. There is one other issue where we directly look into the go-enry maps to get out extensions. That code also seems very fragile.

cc @jtibshirani who I think looked at all this code recently.

Copy link
Contributor

@varungandhi-src varungandhi-src left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move the lib/codeintel/languages package to a public repo and add a dependency on that? Otherwise, having to maintain this copy separately is going to get tedious quickly, and reviewing changes is also not that simple.

Wrt the normalization, I don't fully understand the requirements -- do we need to maintain some kind of compatibility between successive versions of the package?

@keegancsmith
Copy link
Member Author

Let's move the lib/codeintel/languages package to a public repo and add a dependency on that? Otherwise, having to maintain this copy separately is going to get tedious quickly, and reviewing changes is also not that simple.

Agreed, this makes sense to be shared. However, there is a tricky requirement around testing since the implementation of this package is tied directly to the version of go-enry. Maybe what we can do is expose a package which contains the tests, then as part of CI for sourcegraph and zoekt we run those tests to ensure we are compatible with the resolved version of go-enry?

Wrt the normalization, I don't fully understand the requirements -- do we need to maintain some kind of compatibility between successive versions of the package?

We do case-sensitive string comparisons on the language value. I noticed that while doing this upgrade the casing we chose for some stuff changed. I didn't exactly track it down, but I believe it has something to do with things migrating between being inside of go-enry vs our custom support (I believe the code made different decisions around deciding to ToLower a string). I need to look into it more to give you a definite answer, but I saw enough to mark this PR as draft :)

@jtibshirani
Copy link
Member

Let's move the lib/codeintel/languages package to a public repo and add a dependency on that?

+1 let's do that! Could we commit to languages being our only interface for language info (including the IsVendor, etc. methods we added)? Then we could completely remove a direct dependency on go-enry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants