Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model Prompt rewrite #31

Draft
wants to merge 7 commits into
base: development
Choose a base branch
from

Conversation

sshivaditya2019
Copy link
Collaborator

Resolves #30

  • The reasoning is reorganized with a more modular approach to context.
  • The model receives a list of tools that it utilizes to gather additional context and information.
  • Context is integrated based on the tool calls performed by the LLM.
  • The LLM employs a multi-step Chain of Thought prompt, focusing on reasoning between each step.

return formattedText.trim();
}

private static _removeWeakPhrases(text: string): string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is only viable with temperature 0 etc. couldn't this also cause a problem with quotes? If the text is quoted from elsewhere and it contains one of these expressions it might cause problems.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only removes, the phrases from the generated output, and not quoted text. I don't think this is necessary, basically ensures the tone of the response.

Copy link
Contributor

@Keyrxng Keyrxng Nov 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this remove uncertainty and caution from the statements that would follow these phrases? Isn't an LLM likely to interpret these strings as indicating subjectivity, I'd think that these would give key context that we are not 100% on the following fact.

https://chatgpt.com/share/672642c9-0e90-8000-871c-613f8a2768b2

I see it used at the end of createCompletion which would imply it's stripping these from final output that we receive as well as completions sent between llm calls, is that right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it does remove the contextual cues you mentioned, the second model, like o1-mini, offers better reasoning capabilities that can address ambiguous cases. It manages all the formatting and processing, which is beneficial. Although this isn't a perfect solution, it ensures a consistent tone across different outputs without unnecessary preambles before each message.


const response = await askGpt(context, question);
context.logger.info(`Answer: ${response.answer}`, {
caller: "_Logs.<anonymous>",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was hardcoded into the tests, for askQuestion. I'll remove this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is added via the logger you don't need caller

@Keyrxng
Copy link
Contributor

Keyrxng commented Nov 2, 2024

I don't think this is the right thing to do. I'm leaving my opinion early on, absorb or ignore it whatever the case may be but:

Rewriting the model prompt doesn't involve taking all of our functions and turning them into LLM tools and have the AI be involved in the collection of the data via parsing comments or fetching diffs or anything like that...

We should programmatically obtain all the data, the LLM should not play a part in that at all. It having these tools, in my mind, is a poor decision.


If the intention is to improve the context the AI has by allowing it to custom search context against the GitHub API then clearly embeddings search is the problem as it should have everything that we have on GitHub.

@sshivaditya2019
Copy link
Collaborator Author

sshivaditya2019 commented Nov 2, 2024

We should programmatically obtain all the data, the LLM should not play a part in that at all. It having these tools, in my mind, is a poor decision.

The model is effectively performing the same actions: calling Similar Comments, then Similar Issues, and finally Issue Search. Previously, these tasks would have been executed simultaneously, but now the model processes the data sequentially. If I recall correctly, this was one of your concerns, as the prompt was already pulling in data from multiple tasks, which resulted in the context being cluttered with information from various sources.

Apart from that, this can achieve performance close to o1-mini using just GPT-4o. I believe that would provide significant cost savings.

@Keyrxng
Copy link
Contributor

Keyrxng commented Nov 2, 2024

The model's process hasn't changed: it calls Similar Comments, then Similar Issues, and finally Issue Search.

But why does it need to do this?

From your description, search should pull in the most relevant context beyond linked issues (with an unlimited fetch depth, a key goal for 0x4007). Now we lack control over linked issue collection, embedding search, etc.

Previously, createCompletion was the final step, pulling context from recursively fetched linked issues and embedding search results. We then passed this to a reranker before feeding it to the LLM. Is the embedding search underperforming in relevance assignment?

The LLM's output depends entirely on what we input, which we can control and refine.


This PR seems to move the plugin toward an autonomous L2 agent, which was not its intended purpose. If it’s meant to act autonomously, it should be a separate plugin. This approach adds opacity to the process, which has already been problematic. While reasoning logs provide insight, this shift is a departure from the plugin’s initial design and is undocumented in the spec or PR about why these changes are suddenly necessary.


If this PR was based on this comment, it doesn’t address that problem. #28 already fixes it. The root issue was missing context and hallucinations due to hashMatching being removed.

The relevant issues and fixes are documented in the PR I requested you review. Your PR's changes are a major breaking shift; ideally, we’d merge mine and test in prod first, and if issues persist, then document why handing full control to the LLM is necessary.

I feel instead of solving the root issue, we've just given the LLM tools to be able pick up where we fall short in terms of A) context fetching B) embeddings search C) prompt engineering.


That's my two cents on things but 0x4007 is the boss obviously, I'm just giving my input since I was the go-to for AI features for quite some time and have enough experience building with OpenAI to be confident in what I'm saying.

I've said before, I ask questions so that I understand the software that's being written so that I can effectively work on that codebase as well as perform review on it. Not out of avg curiosity or 'to learn' per-se but to become effective at fault finding, triage, future features, etc.

@sshivaditya2019
Copy link
Collaborator Author

From your description, search should pull in the most relevant context beyond linked issues (with an unlimited fetch depth, a key goal for 0x4007). Now we lack control over linked issue collection, embedding search, etc.

We have control over what is passed and what is called. The recursive search is still in place; the main difference is in the execution order. Instead of processing everything at once, the LLM now handles it more sequentially.

Previously, createCompletion was the final step, pulling context from recursively fetched linked issues and embedding search results. We then passed this to a reranker before feeding it to the LLM. Is the embedding search underperforming in relevance assignment?

The relevance assignment is functioning as intended. The final step remains createCompletion. I’m unclear why you believe the embedding search is problematic; that’s not the case.


This PR seems to move the plugin toward an autonomous L2 agent, which was not its intended purpose. If it’s meant to act autonomously, it should be a separate plugin. This approach adds opacity to the process, which has already been problematic. While reasoning logs provide insight, this shift is a departure from the plugin’s initial design and is undocumented in the spec or PR about why these changes are suddenly necessary.

As I mentioned earlier, we’re aiming to integrate everything retrieved so far into the model’s context window. This ensures the process occurs over multiple calls, allowing the model to prioritize what's essential in each iteration. I’m not sure what you’re referring to, but every tool call is logged, allowing us to see the input given to the model, just like in the previous version.


I feel instead of solving the root issue, we've just given the LLM tools to be able pick up where we fall short in terms of A) context fetching B) embeddings search C) prompt engineering.

If context fetching or embedding search is an issue, it would be a problem regardless of the approach taken. It’s not productive to place blame solely on the model. Regarding prompt engineering, if the model isn’t receiving pull diffs and ground truths, then improving the prompts is futile, both of which you claim to be fixed by your PR. Then, I don't see an issue here.

@Keyrxng
Copy link
Contributor

Keyrxng commented Nov 2, 2024

We have control over what is passed and what is called. The recursive search is still in place; the main difference is in the execution order. Instead of processing everything at once, the LLM now handles it more sequentially.

No, now the LLM decides whenever it wants to call a tool, so the gathered data is no longer static which is going to make life much more difficult in evaluating and refining things, yes logs help you see what it thought but now we are having to optimize it's internal reasoning logic as well as the global goal. When we already know what context to fetch A) All linked issues B) relevant embeddings (as per the spec for this feature).

As I mentioned earlier, we’re aiming to integrate everything retrieved so far into the model’s context window. This ensures the process occurs over multiple calls, allowing the model to prioritize what's essential in each iteration. I’m not sure what you’re referring to, but every tool call is logged, allowing us to see the input given to the model, just like in the previous version.

Then we should improve our logic for obtaining context as we can easily benchmark and test that. Starting a loop and allowing the blackbox to take over is going to become a nightmare imo.

The LLM should have one job: Take this context we gathered specifically from linked issues and embeddings search, and do what the instruction tells you. It shouldn't have the job of gathering all that data or it has the very real potential to run off endlessly fetching context it thinks is relevant. The search feature and statically linked issues has all the context we could want for, if both of those features work correctly because embeddings cover every comment across the org and task participants don't link to random off-topic issues/prs/repos.

I’m unclear why you believe the embedding search is problematic; that’s not the case.

  1. I assume we are assuming a ubiquity DB for production rather than your own? If you are using your DB and Ubiquity is using a production DB then the embeddings won't match exactly unless you are scripting to collect the new comments/tasks etc.
  2. We have not established benchmarks or baselines for any of this, which we absolutely should, but I am not saying it's a problem. I'm saying it's a bit of a black box and we need it to be clearer and easier to evaluate.

If I recall correctly, this was one of your concerns, as the prompt was already pulling in data from multiple tasks, which resulted in the context being cluttered with information from various sources.

To which I suggested using GPT to take the all the data we fetched and embeddings etc, and then create a far more succinct ctx window with the noise removed. I see you implemented a new creation call for something almost similar. I didn't think that giving the LLM full control was the solution to refining our collected context.

If context fetching or embedding search is an issue, it would be a problem regardless of the approach taken. It’s not productive to place blame solely on the model. Regarding prompt engineering, if the model isn’t receiving pull diffs and ground truths, then improving the prompts is futile, both of which you claim to be fixed by your PR. Then, I don't see an issue here.

That's my point exactly buddy thank you and I'm not blaming anything on the model dude. I'm saying that giving the LLM more freedom to do as it wants via the ability to call tools whenever it deems fit is wrong when there's other issues at play.

Instead we should refine and improve the two base methods of data collection: A) linked issues (static, never changes once an issue is complete) B) embeddings search (dynamic always as they are updated per org comment, of which Ubiquity has 4 to consider).

So before we give full control to an LLM and hope it delivers, let's establish benchmarks and baselines that we can repeat with optimizations or logic/prompt changes.

I can make a pretty educated guess how we'd create these benchmarks for our use-case but maybe you have some more specific experience or learnings with that sort of thing? Wouldn't this be a much better way to triage and optimize things rather than us taking educated-stabs-in-the-dark at it?

@sshivaditya2019
Copy link
Collaborator Author

sshivaditya2019 commented Nov 2, 2024

No, now the LLM decides whenever it wants to call a tool, so the gathered data is no longer static which is going to make life much more difficult in evaluating and refining things, yes logs help you see what it thought but now we are having to optimize it's internal reasoning logic as well as the global goal. When we already know what context to fetch A) All linked issues B) relevant embeddings (as per the spec for this feature).

I am not sure why you believe it is like this, but that's not quite accurate. The model primarily uses three functions to fetch context, with recursive issue search running independently of parameters, while the other two require specific queries. If you’ve reviewed the logs, you'd see all three methods being called consistently in each run.

Then we should improve our logic for obtaining context as we can easily benchmark and test that. Starting a loop and allowing the blackbox to take over is going to become a nightmare imo.

I am not sure, why you keep calling this blackbox, but you can easily debug which action/word in the prompt or query is leading to a particular reasoning step and the idea behind choosing a particular tool.

The LLM should have one job: Take this context we gathered specifically from linked issues and embeddings search, and do what the instruction tells you. It shouldn't have the job of gathering all that data or it has the very real potential to run off endlessly fetching context it thinks is relevant. The search feature and statically linked issues has all the context we could want for, if both of those features work correctly because embeddings cover every comment across the org and task participants don't link to random off-topic issues/prs/repos.

Yeah, it does have the potentially to keep fetching context, but then there is the token limit as well. So practically, we won’t run into that issue at all.

  1. I assume we are assuming a ubiquity DB for production rather than your own? If you are using your DB and Ubiquity is using a production DB then the embeddings won't match exactly unless you are scripting to collect the new comments/tasks etc.

It is running, off my DB, which I periodically update with comments and issues across organizations. Is that an issue ?

  1. We have not established benchmarks or baselines for any of this, which we absolutely should, but I am not saying it's a problem. I'm saying it's a bit of a black box and we need it to be clearer and easier to evaluate.

Good point. We should establish strong retrieval metrics or benchmarks. This will streamline evaluation.

That's my point exactly buddy thank you and I'm not blaming anything on the model dude. I'm saying that giving the LLM more freedom to do as it wants via the ability to call tools whenever it deems fit is wrong when there's other issues at play.

I’m not sure what you mean by this. Reasoning prompts already guide the model’s function calls at the start, according to the prompt, and it's not set up to randomly invoke functions. It’s designed to follow prompts precisely.

Instead we should refine and improve the two base methods of data collection: A) linked issues (static, never changes once an issue is complete) B) embeddings search (dynamic always as they are updated per org comment, of which Ubiquity has 4 to consider).

Yeah, it should be refined further, but they are independent of model architecture. With this PR I am trying to accomplish the following things:

  • A more modular approach between context gathering and model processing
  • Sequential data fetching
  • Reducing prompt clutter from unrelated tasks.

The other two base methods should be improved regardless of this approach, I don't think that is relevant to be mentioned here.

I can make a pretty educated guess how we'd create these benchmarks for our use-case but maybe you have some more specific experience or learnings with that sort of thing? Wouldn't this be a much better way to triage and optimize things rather than us taking educated-stabs-in-the-dark at it?

We should have benchmarks for embeddings and optimize them for retrieval process, but even if we do I am not sure what we are benchmarking, we are using off the self embeddings, which are top of the leaderboard (MTEB). We do not have any prompts to optimize, and any distance metric chosen would yield similar results, albeit with slight variations in thresholds.


I appreciate you asking questions and all. But, Please give a moment to read through the code. Tool calling operates differently; it invokes functions based on the provided prompt and any necessary context. To avoid endless function calls or the issues you've mentioned, we have limits and optimizations in place. In summary, you have two main concerns:

  1. Blackbox Approach : This isn’t accurate. The model strictly follows prompts, calling appropriate functions initially, and logging details along the way. It also passes all current test cases.

  2. Improving embedding and context retrieval: Agreed. These enhancements are valuable independently of the model approach chosen.

Let me know, if there are any other concerns apart from the ones raised above.

@Keyrxng
Copy link
Contributor

Keyrxng commented Nov 2, 2024

https://platform.openai.com/docs/guides/function-calling#lifecycle

Feed LLM systemMessage and query > it decides: "call a tool" or "respond" > repeating until it is satisfised or the tokenLimit is hit.

If you’ve reviewed the logs, you'd see all three methods being called consistently in each run.

Your QA is just the issue the LLM responded on not a link to your runs. I just checked your runs and the last one was a week ago so your logs were inside your worker I assume?

I am not sure, why you keep calling this blackbox

Because the search results are not logged specifically, we have no idea their influence on responses and we haven't benchmarked or started small and built up to a stable knowledgebase nor do we know how the reranker is actually reranking in terms of input/output. But AI features in general are a blackbox the farther we stray from that the easier it is to dev.

So practically, we won’t run into that issue at all.

That's a bold assumption without evidence I think.

Is that an issue ?

idk lol, is it? I'd think a prod DB would be best for embeddings gathering then they can actually be reviewed by reviewers and we don't have to run scripts to fill our own DB which you seem to be doing. (I still would but atleast 0x4007 and mentlegen could peep the supabase instance)

It’s designed to follow prompts precisely.

Haven't the issues with things recently alone been enough to show that it doesn't always follow your prompts, that's my concern, all of this logic and realistically it would be perfect so will need optimized again but at that point we are dealing with things that are much harder to quantify and test.

With this PR I am trying to accomplish the following things:

  • If the LLM is now responsible for tool calling (data fetching), then context fetching is now embedded into the LLM which is less modular than before
  • I don't understand how it's more "sequential" when it's using the same fetching fns in the same way like you said. We fetched in sequence of current issue > current linked > additional linked > ..., the llm now just tells us what fns to call?
  • Imo that is a real concern that should be addressed for sure you are spot on, but as it's own task and without such a massive breaking change to things before we implement baselines and benchmarks

I am not sure what we are benchmarking

I wrote a spec #32 please add to it. Benchmarkings:

  1. Data retrieval: linked (static) & search (dynamic)
  2. Response: Test conditions where we evaluate both individually and them with combined context (maybe all we need is search and we can remove static completely? Hopefully actually)
  3. Factual accuracy: Of both the returned embeddings and of the answer to the query given the context

With our QAs, that's very unstructured and for sure not test conditions but it's a sort of benchmark isn't it? We want to break this down into repeatable and gradable steps independent of each other and then combined and if it can be automated great but probably requires marking a few problematic issues/prs and using those any time we merge a feature to get new updated benchmarks for each step in the process.

@Keyrxng
Copy link
Contributor

Keyrxng commented Nov 2, 2024

tldr; I'm not against giving the LLM full control but first we should effectively benchmarks our core components and then take it from there. An autonomous chatbot would be pretty cool fetching whatever it wants to but we need better foundations first.

@0x4007
Copy link
Member

0x4007 commented Nov 2, 2024

@Keyrxng you should let shiv focus on shipping this stuff and we'll address potential future problems iteratively.

@Keyrxng
Copy link
Contributor

Keyrxng commented Nov 2, 2024

@Keyrxng you should let shiv focus on shipping this stuff and we'll address potential future problems iteratively.

I apologize for intruding, I did based on the fact that #28 should resolve these issues and the review is being avoided for a massive breaking change to the plugin. It would have saved development time had that PR been reviewed and QA'd first.

I'll stop providing reviews on this plugin as I seem to slow progress and to handle future problems I need to understand what's going on at a deep level so I'll do what I can if I'm asked basically.

@Keyrxng Keyrxng mentioned this pull request Nov 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Model Hallucinations
4 participants