-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code base Search #4
Comments
@0x4007 rfc |
I think let's copy what cursor.com does, if thats public knowledge. They seem to do a pretty solid job at codebase indexing. As a second choice, there's also Microsoft's Semantic Kernel, which I assume is used inside of GitHub's web based copilot and codebase indexing. I kind of got it to work in the past, but it was finicky.
Can you elaborate on this? |
The current chatbot, can parse code files parsed with the comments, and then add that to the context to answer questions. QA: Link |
Instead of cursor, we can use Cline; this is open source and performs better than cursor in real-world use cases. I am not sure how the semantic kernel would work here. The semantic kernel and its plugins, along with the core kernel, work quite well with unstructured data processing, but I don't think they have anything particular for code or code parsing. Cline provides models with tools that help them determine whether to read a file based on context. After reading the file, it offers suggestions. I believe Cursor's approach is similar to my previous thoughts; they semantically chunk the code and utilize an external embedding server or openai to obtain vector embeddings for retrieval operations. |
Cline looks epic - good find. Especially great that its open source. We should take the useful bits from it, like codebase parsing then.
There was a demo I ran locally last year - that allowed you to q&a any codebase. Very similar to the copilot chat we have on the GitHub UI now. I'm assuming its using the same code. microsoft/semantic-kernel#2506 |
The model can currently parse code files shared via links. However, this functionality should be improved to allow searching through the entire codebase repository where the bot operates. Similar solutions, like Copilot, chunk code into units, convert them into graphs, and use these for retrieval.
Proposed Solution: AST-Based System for Codebase Queries
Automatic Analysis: The plugin would analyze the entire codebase, converting it into Abstract Syntax Trees (ASTs).
Chunking and Storage: These ASTs would be segmented into meaningful units, such as functions, classes, and methods, and stored in tables with associated metadata, semantic descriptions, and vector embeddings in Supabase. These would be stored in three tables (AST_Nodes, Code_Chunks, Chunk_Relationships)
Query Processing: When a user submits a query, it would be converted into an embedding and used to search for similar code chunks. The system would retrieve relevant chunks, their relationships, and semantic descriptions from the database to effectively answer queries.
The text was updated successfully, but these errors were encountered: