Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code base Search #4

Open
shiv810 opened this issue Oct 21, 2024 · 5 comments
Open

Code base Search #4

shiv810 opened this issue Oct 21, 2024 · 5 comments

Comments

@shiv810
Copy link
Collaborator

shiv810 commented Oct 21, 2024

The model can currently parse code files shared via links. However, this functionality should be improved to allow searching through the entire codebase repository where the bot operates. Similar solutions, like Copilot, chunk code into units, convert them into graphs, and use these for retrieval.

Proposed Solution: AST-Based System for Codebase Queries

  1. Automatic Analysis: The plugin would analyze the entire codebase, converting it into Abstract Syntax Trees (ASTs).

  2. Chunking and Storage: These ASTs would be segmented into meaningful units, such as functions, classes, and methods, and stored in tables with associated metadata, semantic descriptions, and vector embeddings in Supabase. These would be stored in three tables (AST_Nodes, Code_Chunks, Chunk_Relationships)

  3. Query Processing: When a user submits a query, it would be converted into an embedding and used to search for similar code chunks. The system would retrieve relevant chunks, their relationships, and semantic descriptions from the database to effectively answer queries.

@shiv810
Copy link
Collaborator Author

shiv810 commented Oct 21, 2024

@0x4007 rfc

@0x4007
Copy link
Member

0x4007 commented Oct 21, 2024

I think let's copy what cursor.com does, if thats public knowledge. They seem to do a pretty solid job at codebase indexing.

As a second choice, there's also Microsoft's Semantic Kernel, which I assume is used inside of GitHub's web based copilot and codebase indexing. I kind of got it to work in the past, but it was finicky.

The model can currently parse code files shared via links.

Can you elaborate on this?

@shiv810
Copy link
Collaborator Author

shiv810 commented Oct 21, 2024

The model can currently parse code files shared via links.

Can you elaborate on this?

The current chatbot, can parse code files parsed with the comments, and then add that to the context to answer questions.

QA: Link

@shiv810
Copy link
Collaborator Author

shiv810 commented Oct 22, 2024

Instead of cursor, we can use Cline; this is open source and performs better than cursor in real-world use cases.

I am not sure how the semantic kernel would work here. The semantic kernel and its plugins, along with the core kernel, work quite well with unstructured data processing, but I don't think they have anything particular for code or code parsing.

Cline provides models with tools that help them determine whether to read a file based on context. After reading the file, it offers suggestions. I believe Cursor's approach is similar to my previous thoughts; they semantically chunk the code and utilize an external embedding server or openai to obtain vector embeddings for retrieval operations.

@0x4007
Copy link
Member

0x4007 commented Oct 22, 2024

Cline looks epic - good find. Especially great that its open source. We should take the useful bits from it, like codebase parsing then.

https://github.com/cline/cline/blob/52f86b188786d5fe8eb4ad5df800b9a95aaba6e6/src/services/tree-sitter/index.ts#L98

but I don't think they have anything particular for code or code parsing.

There was a demo I ran locally last year - that allowed you to q&a any codebase. Very similar to the copilot chat we have on the GitHub UI now. I'm assuming its using the same code. microsoft/semantic-kernel#2506

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants