-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model Prompt rewrite #31
base: development
Are you sure you want to change the base?
Conversation
return formattedText.trim(); | ||
} | ||
|
||
private static _removeWeakPhrases(text: string): string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is only viable with temperature 0 etc. couldn't this also cause a problem with quotes? If the text is quoted from elsewhere and it contains one of these expressions it might cause problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only removes, the phrases from the generated output, and not quoted text. I don't think this is necessary, basically ensures the tone of the response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't this remove uncertainty and caution from the statements that would follow these phrases? Isn't an LLM likely to interpret these strings as indicating subjectivity, I'd think that these would give key context that we are not 100% on the following fact.
https://chatgpt.com/share/672642c9-0e90-8000-871c-613f8a2768b2
I see it used at the end of createCompletion
which would imply it's stripping these from final output that we receive as well as completions sent between llm calls, is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it does remove the contextual cues you mentioned, the second model, like o1-mini, offers better reasoning capabilities that can address ambiguous cases. It manages all the formatting and processing, which is beneficial. Although this isn't a perfect solution, it ensures a consistent tone across different outputs without unnecessary preambles before each message.
|
||
const response = await askGpt(context, question); | ||
context.logger.info(`Answer: ${response.answer}`, { | ||
caller: "_Logs.<anonymous>", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was hardcoded into the tests, for askQuestion. I'll remove this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is added via the logger you don't need caller
I don't think this is the right thing to do. I'm leaving my opinion early on, absorb or ignore it whatever the case may be but: Rewriting the model prompt doesn't involve taking all of our functions and turning them into LLM tools and have the AI be involved in the collection of the data via parsing comments or fetching diffs or anything like that... We should programmatically obtain all the data, the LLM should not play a part in that at all. It having these tools, in my mind, is a poor decision. If the intention is to improve the context the AI has by allowing it to custom search context against the GitHub API then clearly embeddings search is the problem as it should have everything that we have on GitHub. |
The model is effectively performing the same actions: calling Similar Comments, then Similar Issues, and finally Issue Search. Previously, these tasks would have been executed simultaneously, but now the model processes the data sequentially. If I recall correctly, this was one of your concerns, as the prompt was already pulling in data from multiple tasks, which resulted in the context being cluttered with information from various sources. Apart from that, this can achieve performance close to o1-mini using just GPT-4o. I believe that would provide significant cost savings. |
But why does it need to do this? From your description, search should pull in the most relevant context beyond linked issues (with an unlimited fetch depth, a key goal for 0x4007). Now we lack control over linked issue collection, embedding search, etc. Previously, The LLM's output depends entirely on what we input, which we can control and refine. This PR seems to move the plugin toward an autonomous L2 agent, which was not its intended purpose. If it’s meant to act autonomously, it should be a separate plugin. This approach adds opacity to the process, which has already been problematic. While reasoning logs provide insight, this shift is a departure from the plugin’s initial design and is undocumented in the spec or PR about why these changes are suddenly necessary. If this PR was based on this comment, it doesn’t address that problem. #28 already fixes it. The root issue was missing context and hallucinations due to hashMatching being removed. The relevant issues and fixes are documented in the PR I requested you review. Your PR's changes are a major breaking shift; ideally, we’d merge mine and test in prod first, and if issues persist, then document why handing full control to the LLM is necessary. I feel instead of solving the root issue, we've just given the LLM tools to be able pick up where we fall short in terms of A) context fetching B) embeddings search C) prompt engineering. That's my two cents on things but 0x4007 is the boss obviously, I'm just giving my input since I was the go-to for AI features for quite some time and have enough experience building with OpenAI to be confident in what I'm saying. I've said before, I ask questions so that I understand the software that's being written so that I can effectively work on that codebase as well as perform review on it. Not out of avg curiosity or 'to learn' per-se but to become effective at fault finding, triage, future features, etc. |
We have control over what is passed and what is called. The recursive search is still in place; the main difference is in the execution order. Instead of processing everything at once, the LLM now handles it more sequentially.
The relevance assignment is functioning as intended. The final step remains
As I mentioned earlier, we’re aiming to integrate everything retrieved so far into the model’s context window. This ensures the process occurs over multiple calls, allowing the model to prioritize what's essential in each iteration. I’m not sure what you’re referring to, but every tool call is logged, allowing us to see the input given to the model, just like in the previous version.
If context fetching or embedding search is an issue, it would be a problem regardless of the approach taken. It’s not productive to place blame solely on the model. Regarding prompt engineering, if the model isn’t receiving pull diffs and ground truths, then improving the prompts is futile, both of which you claim to be fixed by your PR. Then, I don't see an issue here. |
No, now the LLM decides whenever it wants to call a tool, so the gathered data is no longer static which is going to make life much more difficult in evaluating and refining things, yes logs help you see what it thought but now we are having to optimize it's internal reasoning logic as well as the global goal. When we already know what context to fetch A) All linked issues B) relevant embeddings (as per the spec for this feature).
Then we should improve our logic for obtaining context as we can easily benchmark and test that. Starting a loop and allowing the blackbox to take over is going to become a nightmare imo. The LLM should have one job: Take this context we gathered specifically from linked issues and embeddings search, and do what the instruction tells you. It shouldn't have the job of gathering all that data or it has the very real potential to run off endlessly fetching context it thinks is relevant. The search feature and statically linked issues has all the context we could want for, if both of those features work correctly because embeddings cover every comment across the org and task participants don't link to random off-topic issues/prs/repos.
To which I suggested using GPT to take the all the data we fetched and embeddings etc, and then create a far more succinct ctx window with the noise removed. I see you implemented a new creation call for something almost similar. I didn't think that giving the LLM full control was the solution to refining our collected context.
That's my point exactly buddy thank you and I'm not blaming anything on the model dude. I'm saying that giving the LLM more freedom to do as it wants via the ability to call tools whenever it deems fit is wrong when there's other issues at play. Instead we should refine and improve the two base methods of data collection: A) linked issues (static, never changes once an issue is complete) B) embeddings search (dynamic always as they are updated per org comment, of which Ubiquity has 4 to consider). So before we give full control to an LLM and hope it delivers, let's establish benchmarks and baselines that we can repeat with optimizations or logic/prompt changes. I can make a pretty educated guess how we'd create these benchmarks for our use-case but maybe you have some more specific experience or learnings with that sort of thing? Wouldn't this be a much better way to triage and optimize things rather than us taking educated-stabs-in-the-dark at it? |
I am not sure why you believe it is like this, but that's not quite accurate. The model primarily uses three functions to fetch context, with recursive issue search running independently of parameters, while the other two require specific queries. If you’ve reviewed the logs, you'd see all three methods being called consistently in each run.
I am not sure, why you keep calling this blackbox, but you can easily debug which action/word in the prompt or query is leading to a particular reasoning step and the idea behind choosing a particular tool.
Yeah, it does have the potentially to keep fetching context, but then there is the token limit as well. So practically, we won’t run into that issue at all.
It is running, off my DB, which I periodically update with comments and issues across organizations. Is that an issue ?
Good point. We should establish strong retrieval metrics or benchmarks. This will streamline evaluation.
I’m not sure what you mean by this. Reasoning prompts already guide the model’s function calls at the start, according to the prompt, and it's not set up to randomly invoke functions. It’s designed to follow prompts precisely.
Yeah, it should be refined further, but they are independent of model architecture. With this PR I am trying to accomplish the following things:
The other two base methods should be improved regardless of this approach, I don't think that is relevant to be mentioned here.
We should have benchmarks for embeddings and optimize them for retrieval process, but even if we do I am not sure what we are benchmarking, we are using off the self embeddings, which are top of the leaderboard (MTEB). We do not have any prompts to optimize, and any distance metric chosen would yield similar results, albeit with slight variations in thresholds. I appreciate you asking questions and all. But, Please give a moment to read through the code. Tool calling operates differently; it invokes functions based on the provided prompt and any necessary context. To avoid endless function calls or the issues you've mentioned, we have limits and optimizations in place. In summary, you have two main concerns:
Let me know, if there are any other concerns apart from the ones raised above. |
https://platform.openai.com/docs/guides/function-calling#lifecycle Feed LLM systemMessage and query > it decides: "call a tool" or "respond" > repeating until it is satisfised or the tokenLimit is hit.
Your QA is just the issue the LLM responded on not a link to your runs. I just checked your runs and the last one was a week ago so your logs were inside your worker I assume?
Because the search results are not logged specifically, we have no idea their influence on responses and we haven't benchmarked or started small and built up to a stable knowledgebase nor do we know how the reranker is actually reranking in terms of input/output. But AI features in general are a blackbox the farther we stray from that the easier it is to dev.
That's a bold assumption without evidence I think.
idk lol, is it? I'd think a prod DB would be best for embeddings gathering then they can actually be reviewed by reviewers and we don't have to run scripts to fill our own DB which you seem to be doing. (I still would but atleast 0x4007 and mentlegen could peep the supabase instance)
Haven't the issues with things recently alone been enough to show that it doesn't always follow your prompts, that's my concern, all of this logic and realistically it would be perfect so will need optimized again but at that point we are dealing with things that are much harder to quantify and test.
I wrote a spec #32 please add to it. Benchmarkings:
With our QAs, that's very unstructured and for sure not test conditions but it's a sort of benchmark isn't it? We want to break this down into repeatable and gradable steps independent of each other and then combined and if it can be automated great but probably requires marking a few problematic issues/prs and using those any time we merge a feature to get new updated benchmarks for each step in the process. |
tldr; I'm not against giving the LLM full control but first we should effectively benchmarks our core components and then take it from there. An autonomous chatbot would be pretty cool fetching whatever it wants to but we need better foundations first. |
@Keyrxng you should let shiv focus on shipping this stuff and we'll address potential future problems iteratively. |
I apologize for intruding, I did based on the fact that #28 should resolve these issues and the review is being avoided for a massive breaking change to the plugin. It would have saved development time had that PR been reviewed and QA'd first. I'll stop providing reviews on this plugin as I seem to slow progress and to handle future problems I need to understand what's going on at a deep level so I'll do what I can if I'm asked basically. |
Resolves #30