-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evals #37
Evals #37
Conversation
Unused devDependencies (1)
|
@0x4007 Evaluations cannot be configured as action requests since repository secrets cannot be accessed in pull requests (without pull request workflow trigger which I think is not safe). I have currently implemented |
@rndquu knows the solution for secrets and pull requests I think. I'm not sure because I never implemented evals before. I suppose you might know best off hand and without research. |
@sshivaditya2019, this task has been idle for a while. Please provide an update. |
@sshivaditya2019 Regarding evals-testing.yml workflow. Yes, github secrets can't be accessed in workflows triggered from forks by the What to do:
|
577d87d
to
58c01ac
Compare
.github/workflows/evals-testing.yml
Outdated
@@ -19,7 +17,7 @@ jobs: | |||
VOYAGEAI_API_KEY: ${{ secrets.VOYAGEAI_API_KEY }} | |||
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} | |||
OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }} | |||
UBIQUITY_OS_APP_NAME: "ubiquity-agent" # Hardcoded value | |||
UBIQUITY_OS_APP_NAME: "ubiquity-agent" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume ubiquity-agent
is for testing purposes only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll revert the changes, I am trying to make some examples for OA.
This will work once deployed. I'm trying make some examples for QA and will try running this locally. I'll update the changes based on this. |
QA: |
@0x4007 Should this workflow be triggered by the codebase contributors? It costs around |
I'm not sure what makes the most sense to do here. I think that the safest option is for collaborators to manually run this but I also wonder if we can also automatically run it just on merges to main? |
We could either run this automatically on merges to the main branch, or trigger it after a PR receives 2 approvals for merging. |
evals/data/eval-gold-responses.json
Outdated
"body": "Manifests need to be updated so the name matches the intended name, which is the name of the repo it lives in.\n\nAny mismatch in manifest.name and the plugin repo, and we will not be able to install those plugins. The config will look like this:\n\nThis is because the worker URL contains the repo name, and we use that to match against manifest.name.", | ||
"number": 27, | ||
"html_url": "https://github.com/ubiquity-os/ubiquity-os-plugin-installer/issues/27/", | ||
"question": "@ubosshivaditya could you please provide a summary of the issue ?" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it be a problem to leave in @ubosshivaditya
?
Also this seems like a random example can you explain the context of this file further?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can modify the app name to whatever is stored in the secrets; it doesn't matter, as the askQuestion
function will be triggered either way.
This file primarily contains solid baseline examples, including "gold star" responses to questions. We run the model with the same context and should expect similar performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can modify the app name to whatever is stored in the secrets; it doesn't matter,
We have the production and beta instance of the app so I'm not sure about dealing with secrets to save the names. Think through how this will be configured and let me know what you think makes sense
including "gold star" responses to questions.
I guess it's "gold standard" I just messed up the terminology when I called it "gold star".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have the production and beta instance of the app so I'm not sure about dealing with secrets to save the names. Think through how this will be configured and let me know what you think makes sense
I think it would be better if we could just hard code names, and keep in consistent in the workflow
.
I guess it's "gold standard" I just messed up the terminology when I called it "gold star".
No, you were correct—it's called a "gold star response"1. "Gold standard" is a different approach, but not the one we're discussing here.
Footnotes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard coding seems questionable for developers but generally yes I agree that it's easier to deal with vs secrets. @gentlementlegen please decide
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is no longer relevant, as we are using the LLM command router.
openai: OpenAI; | ||
} | ||
|
||
export const initAdapters = (context: Context, clients: EvalClients): Context => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't have passed linter settings. Make sure you run yarn format
in case the git hooks aren't working. We don't generally allow arrow functions unless they are used in some tiny callback context.
}); | ||
formattedChat += SEPERATOR; | ||
//Iterate through the ground truths and add it to the final formatted chat | ||
formattedChat += "#################### Ground Truths ####################\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these many hashtags really necessary? I thought writing in plain markdown syntax is sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config: { | ||
model: "gpt-4o", | ||
similarityThreshold: 0.8, | ||
maxTokens: 1000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you set such a low max? Also should we really be using GPT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expected answer should be short, not too long. Making it longer would just make the model talk too much and be irrelevant. For the model, I think it should be gpt-4o
because it’s relatively stable right now. I tried the same with o1-mini
and o1-preview
, but each test gave very different results. It’s also more expensive, using about 75,000 input tokens on average.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also o1 now to consider using.
To be frank I'm not super concerned with costs because we don't even use this feature every day but it can be tremendously valuable to automatically populate all the context and ask a question conveniently particularly from mobile which is my normal client.
If it's providing high quality answers it's okay to spend a bit more in exchange for all the time savings.
However, obviously, if the difference is marginal then we can optimize and use the cheaper models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we using two different models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ground-truths
is simple enough to be handled by gpt-4o
or even gpt-4o-mini
. I don’t think it would be productive to use a much more powerful model for such a simple task.
I wonder if we should just migrate everything that needs a large context window to Claude at this point. |
It should be pretty simple, we just need to update the |
Is this good to go? |
It is. I've also added the |
If you mean the CI question, yes, good to go |
a3814c8
into
ubiquity-os-marketplace:development
Resolves #185
Command-ask
using theLevenshtein
score.