Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evals #37

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
5177c29
fix: setup evals actions
sshivaditya Nov 21, 2024
e61e401
fix: setup evals actions
sshivaditya Nov 21, 2024
148a73f
feat: adds ClosedQA, ContextPrecision
sshivaditya Nov 22, 2024
ae6d163
fix: removed ClosedQA
sshivaditya Nov 22, 2024
e075c71
fix: cspell and knip
sshivaditya Nov 22, 2024
a077ad8
fix: deps issues
sshivaditya Nov 22, 2024
122efad
fix: downgraded typebox
sshivaditya Nov 22, 2024
43e5e50
fix: evals action
sshivaditya Nov 22, 2024
a518e5e
fix: braintrust action
sshivaditya Nov 22, 2024
8dfe444
fix: update permission
sshivaditya Nov 22, 2024
2bad0fe
fix: add github token to the braintrust
sshivaditya Nov 22, 2024
399d80a
fix: update permission
sshivaditya Nov 22, 2024
139770e
fix: github action
sshivaditya Nov 22, 2024
633ebfe
fix: action
sshivaditya Nov 22, 2024
75a2f87
fix: action
sshivaditya Nov 22, 2024
a4351af
fix: action file dir
sshivaditya Nov 22, 2024
022b9ce
fix: updated package.json
sshivaditya Nov 22, 2024
98d3777
fix: add env to the action
sshivaditya Nov 22, 2024
1d2309c
fix: log env
sshivaditya Nov 23, 2024
2cb4a14
fix: actions env
sshivaditya Nov 23, 2024
58c01ac
fix: updated workflow
sshivaditya Nov 30, 2024
311fc4a
fix: updated workflow
sshivaditya Nov 30, 2024
4a9df8a
fix: package.json duplicate items
sshivaditya Nov 30, 2024
171e427
fix: fixed lockfile
sshivaditya Nov 30, 2024
64263de
fix: lockfile
sshivaditya Dec 1, 2024
766e23f
fix: remove unused deps
sshivaditya Dec 1, 2024
75779f8
fix: update workflow
sshivaditya Dec 1, 2024
fa8934f
fix: evals workflow
sshivaditya Dec 10, 2024
a8922db
fix: add workflow dispatch
sshivaditya Dec 10, 2024
2f0cb7a
fix: evals secrets
sshivaditya Dec 11, 2024
cab2349
fix: prettier and jest
sshivaditya Dec 11, 2024
dd39b3b
fix: fix knip
sshivaditya Dec 11, 2024
cfe8c4d
fix: formatting
sshivaditya Dec 13, 2024
fe11d1b
fix: ncc compile error
sshivaditya Dec 13, 2024
7ec1087
fix: ncc compile error
sshivaditya Dec 13, 2024
5d27598
chore: updated manifest.json and dist build
github-actions[bot] Dec 13, 2024
a5d9302
fix: change to /ask
sshivaditya Dec 13, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 4 additions & 6 deletions .github/workflows/evals-testing.yml
Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
name: Run Braintrust Evals

on:
workflow_run:
workflows: ["Knip"]
types:
- completed
workflow_dispatch:
pull_request:

permissions: write-all

Expand All @@ -19,7 +17,7 @@ jobs:
VOYAGEAI_API_KEY: ${{ secrets.VOYAGEAI_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }}
UBIQUITY_OS_APP_NAME: "ubiquity-agent" # Hardcoded value
UBIQUITY_OS_APP_NAME: "ubiquity-agent"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume ubiquity-agent is for testing purposes only.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll revert the changes, I am trying to make some examples for OA.


steps:
- uses: actions/setup-node@v4
Expand All @@ -43,7 +41,7 @@ jobs:
VOYAGEAI_API_KEY: ${{ secrets.VOYAGEAI_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }}
UBIQUITY_OS_APP_NAME: "ubiquity-agent" # Hardcoded value
UBIQUITY_OS_APP_NAME: "ubiquity-agent"

- name: Add Evals Report to Github Comment
if: always()
Expand Down
12 changes: 6 additions & 6 deletions evals/data/eval-gold-responses.json
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
{
"issueResponses": [
{
"scenario": "UI Improvements Issue",
"scenario": "manifest.name mismatch ",
"issue": {
"body": "Depends on #13\n\nWhat should be done:\n\n1. The \"select configuration\" step should be removed. Always use the \"production\" config.\n2. Show plugin readme on installing a new plugin (original comment)\n3. Show indicator if plugin is already installed (original comment)\n4. Load current config values in the editor (if config is present)\n5. Add \"Reset to defaults\" button which resets plugin options to default values (original comment)\n6. Add github redirect to the same page (example)",
"number": 19,
"html_url": "https://github.com/ubiquity-os/ubiquity-os-plugin-installer/issues/19/",
"question": "@ubosshivaditya could you please provide a summary of the UI improvements needed based on the issue discussion and linked issues?"
"body": "Manifests need to be updated so the name matches the intended name, which is the name of the repo it lives in.\n\nAny mismatch in manifest.name and the plugin repo, and we will not be able to install those plugins. The config will look like this:\n\nThis is because the worker URL contains the repo name, and we use that to match against manifest.name.",
"number": 27,
"html_url": "https://github.com/ubiquity-os/ubiquity-os-plugin-installer/issues/27/",
"question": "@ubosshivaditya could you please provide a summary of the issue ?"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be a problem to leave in @ubosshivaditya?

Also this seems like a random example can you explain the context of this file further?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can modify the app name to whatever is stored in the secrets; it doesn't matter, as the askQuestion function will be triggered either way.

This file primarily contains solid baseline examples, including "gold star" responses to questions. We run the model with the same context and should expect similar performance.

Copy link
Member

@0x4007 0x4007 Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can modify the app name to whatever is stored in the secrets; it doesn't matter,

We have the production and beta instance of the app so I'm not sure about dealing with secrets to save the names. Think through how this will be configured and let me know what you think makes sense

including "gold star" responses to questions.

I guess it's "gold standard" I just messed up the terminology when I called it "gold star".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the production and beta instance of the app so I'm not sure about dealing with secrets to save the names. Think through how this will be configured and let me know what you think makes sense

I think it would be better if we could just hard code names, and keep in consistent in the workflow.

I guess it's "gold standard" I just messed up the terminology when I called it "gold star".

No, you were correct—it's called a "gold star response"1. "Gold standard" is a different approach, but not the one we're discussing here.

Footnotes

  1. https://arxiv.org/html/2410.23214v1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard coding seems questionable for developers but generally yes I agree that it's easier to deal with vs secrets. @gentlementlegen please decide

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is no longer relevant, as we are using the LLM command router.

},
"expectedResponse": "Based on the issue discussion and linked issues, here's a summary of the UI improvements needed:\n\n1. Configuration Simplification:\n- Remove the \"select configuration\" step\n- Always use \"production\" config as this is for partners\n- Development configs can be created manually when needed\n\n2. Plugin Information Display:\n- Show plugin readme when installing new plugins\n- Add indicator to show if a plugin is already installed\n- Load and display current config values in the editor when available\n\n3. Configuration Management:\n- Add \"Reset to defaults\" button for plugin options\n- Ensure proper handling of config values\n\n4. GitHub Integration:\n- Add GitHub redirect functionality to the same page\n- Reference implementation available in work.ubq.fi repository\n\nThis implementation depends on the config installer feature (#13) being completed first. The changes focus on improving user experience for partners while maintaining development flexibility through manual configuration options.",
"expectedResponse": "The manifest.name should match the name of the repo it lives in. This is because the worker URL contains the repo name, and we use that to match against manifest.name.",
"sender": {
"login": "sshivaditya2019",
"type": "User"
Expand Down
4 changes: 1 addition & 3 deletions evals/llm.eval.ts
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,11 @@ import { createClient } from "@supabase/supabase-js";
import { createAdapters } from "../src/adapters";
import { Context } from "../src/types/context";
import { logger } from "../src/helpers/errors";
import { Octokit } from "@octokit/rest";
import { customOctokit as Octokit } from "@ubiquity-os/plugin-sdk/octokit";
import issueTemplate from "../tests/__mocks__/issue-template";
import { writeFileSync } from "fs";
import { fetchContext, formattedHistory, initAdapters } from "./handlers/setup-context";

console.log(process.env);

// Required environment variables with type assertion
const requiredEnvVars = {
OPENAI_API_KEY: process.env.OPENAI_API_KEY as string,
Expand Down
5 changes: 5 additions & 0 deletions src/adapters/openai/helpers/completions.ts
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,11 @@ export class Completions extends SuperOpenAi {
},
});

if (!res.choices || !res.choices.length) {
logger.debug(`No completion found for query: ${query} Response: ${JSON.stringify(res)}`, { res });
return { answer: "", tokenUsage: { input: 0, output: 0, total: 0 }, groundTruths };
}

const answer = res.choices[0].message;
if (answer && answer.content && res.usage) {
return {
Expand Down
Loading