Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moniker Size and Storage Optimization #276

Closed
JamyDev opened this issue Sep 11, 2024 · 1 comment
Closed

Moniker Size and Storage Optimization #276

JamyDev opened this issue Sep 11, 2024 · 1 comment

Comments

@JamyDev
Copy link

JamyDev commented Sep 11, 2024

We've been playing around and working with the SCIP indices for quite a while now and one thing became clear, they take up a lot of space, especially considering the scale of our codebase.

One though was that monikers take up a large chunk of the space, where a lot of it is redundant information, eg:
Every within some/package/path symbol has: scip-go gomod some/package/path v0.0.4 as a preamble, then as the actual descriptors we havesome/package/path/Struct# which duplicates the package path provided in the preamble. In our usecase we only need the descriptors, and maybe the package version so a thought was to just strip the prefix from the indices. Before doing that I wanted to ask if any other considerations were made around the sizing of the indices.

We considered compressing as these strings would be very compressible, but our index reader would still need to scan through the uncompressed index regardless.

Another option was to define symbol/moniker mappings for the index, which would map the moniker to a unique id so it may be reused, similar to how LSIF handles it.This could be an optional feature in the index definition either on the document or index level. This would likely also give a good indication of all the symbols the index references without having to read through the docs&occurrences.

@varungandhi-src
Copy link
Contributor

I've created a PR here with design docs. #289

That includes rationale on why we've avoided integer IDs as well as other kinds of redundancy (that would push more work onto indexer authors).

Please leave comments on the PR if you have follow-up questions.

We did get a request for SCIP to SQLite conversion. #233 -- We'd be open to brainstorming design and/or accepting support for that as part of the SCIP CLI if that's something you would find useful, but we don't have bandwidth to add support for that ourselves.

@varungandhi-src varungandhi-src closed this as not planned Won't fix, can't repro, duplicate, stale Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants