Code base Search #4

shiv810 · 2024-10-21T01:49:47Z

The model can currently parse code files shared via links. However, this functionality should be improved to allow searching through the entire codebase repository where the bot operates. Similar solutions, like Copilot, chunk code into units, convert them into graphs, and use these for retrieval.

Proposed Solution: AST-Based System for Codebase Queries

Automatic Analysis: The plugin would analyze the entire codebase, converting it into Abstract Syntax Trees (ASTs).
Chunking and Storage: These ASTs would be segmented into meaningful units, such as functions, classes, and methods, and stored in tables with associated metadata, semantic descriptions, and vector embeddings in Supabase. These would be stored in three tables (AST_Nodes, Code_Chunks, Chunk_Relationships)
Query Processing: When a user submits a query, it would be converted into an embedding and used to search for similar code chunks. The system would retrieve relevant chunks, their relationships, and semantic descriptions from the database to effectively answer queries.

shiv810 · 2024-10-21T01:49:53Z

@0x4007 rfc

0x4007 · 2024-10-21T08:29:37Z

I think let's copy what cursor.com does, if thats public knowledge. They seem to do a pretty solid job at codebase indexing.

As a second choice, there's also Microsoft's Semantic Kernel, which I assume is used inside of GitHub's web based copilot and codebase indexing. I kind of got it to work in the past, but it was finicky.

The model can currently parse code files shared via links.

Can you elaborate on this?

shiv810 · 2024-10-21T13:44:46Z

The model can currently parse code files shared via links.

Can you elaborate on this?

The current chatbot, can parse code files parsed with the comments, and then add that to the context to answer questions.

QA: Link

shiv810 · 2024-10-22T04:06:21Z

Instead of cursor, we can use Cline; this is open source and performs better than cursor in real-world use cases.

I am not sure how the semantic kernel would work here. The semantic kernel and its plugins, along with the core kernel, work quite well with unstructured data processing, but I don't think they have anything particular for code or code parsing.

Cline provides models with tools that help them determine whether to read a file based on context. After reading the file, it offers suggestions. I believe Cursor's approach is similar to my previous thoughts; they semantically chunk the code and utilize an external embedding server or openai to obtain vector embeddings for retrieval operations.

0x4007 · 2024-10-22T05:40:13Z

Cline looks epic - good find. Especially great that its open source. We should take the useful bits from it, like codebase parsing then.

https://github.com/cline/cline/blob/52f86b188786d5fe8eb4ad5df800b9a95aaba6e6/src/services/tree-sitter/index.ts#L98

but I don't think they have anything particular for code or code parsing.

There was a demo I ran locally last year - that allowed you to q&a any codebase. Very similar to the copilot chat we have on the GitHub UI now. I'm assuming its using the same code. microsoft/semantic-kernel#2506

devpool-directory-superintendent bot mentioned this issue Oct 21, 2024

Code base Search ubiquity/devpool-directory#1741

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code base Search #4

Code base Search #4

shiv810 commented Oct 21, 2024 •

edited

Loading

shiv810 commented Oct 21, 2024

0x4007 commented Oct 21, 2024 •

edited

Loading

shiv810 commented Oct 21, 2024

shiv810 commented Oct 22, 2024 •

edited

Loading

0x4007 commented Oct 22, 2024 •

edited

Loading

Code base Search #4

Code base Search #4

Comments

shiv810 commented Oct 21, 2024 • edited Loading

Proposed Solution: AST-Based System for Codebase Queries

shiv810 commented Oct 21, 2024

0x4007 commented Oct 21, 2024 • edited Loading

shiv810 commented Oct 21, 2024

shiv810 commented Oct 22, 2024 • edited Loading

0x4007 commented Oct 22, 2024 • edited Loading

shiv810 commented Oct 21, 2024 •

edited

Loading

0x4007 commented Oct 21, 2024 •

edited

Loading

shiv810 commented Oct 22, 2024 •

edited

Loading

0x4007 commented Oct 22, 2024 •

edited

Loading