Evals #37

sshivaditya2019 · 2024-11-21T23:05:21Z

Resolves #185

Adds Evals for Command-ask using the Levenshtein score.
New Github Action for running the evals.

github-actions · 2024-11-21T23:06:26Z

Unused devDependencies (1)

Filename	devDependencies
package.json	`ts-node`

sshivaditya2019 · 2024-11-23T01:19:31Z

@0x4007 Evaluations cannot be configured as action requests since repository secrets cannot be accessed in pull requests (without pull request workflow trigger which I think is not safe). I have currently implemented Levenshtein and ContextRelevance metrics. Are there any other metrics that should be added to the evals ?

0x4007 · 2024-11-23T09:59:23Z

@rndquu knows the solution for secrets and pull requests I think. I'm not sure because I never implemented evals before. I suppose you might know best off hand and without research.

ubiquity-os-beta · 2024-11-25T10:44:23Z

@sshivaditya2019, this task has been idle for a while. Please provide an update.

0x4007 · 2024-11-28T14:30:29Z

@rndquu

rndquu · 2024-11-28T15:03:32Z

@sshivaditya2019 Regarding evals-testing.yml workflow.

Yes, github secrets can't be accessed in workflows triggered from forks by the pull_request trigger.

What to do:

I suppose UBIQUITY_OS_APP_NAME can be hardcoded
SUPABASE_URL and SUPABASE_KEY can also be hardcoded (example)
Refactor evals-testing.yml workflow to run on workflow_run (example). This way when knip workflow is finished then evals-testing.yml will run in a privileged context (i.e. with access to github secrets in a safe way).

0x4007 · 2024-12-10T12:38:11Z

.github/workflows/evals-testing.yml

@@ -19,7 +17,7 @@ jobs:
      VOYAGEAI_API_KEY: ${{ secrets.VOYAGEAI_API_KEY }}
      OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }}
-      UBIQUITY_OS_APP_NAME: "ubiquity-agent"  # Hardcoded value
+      UBIQUITY_OS_APP_NAME: "ubiquity-agent"


I assume ubiquity-agent is for testing purposes only.

Yes, I'll revert the changes, I am trying to make some examples for OA.

sshivaditya2019 · 2024-12-10T19:07:01Z

This comment describes the solution.

Change these lines to:
on:
  workflow_run:
    workflows: ["Knip"]
    types:
      - completed
and all the secrets will be available on a PR from a fork in a safe way.

This will work once deployed. I'm trying make some examples for QA and will try running this locally. I'll update the changes based on this.

sshivaditya2019 · 2024-12-11T00:36:12Z

QA:

Summary #1

Summary #2

sshivaditya2019 · 2024-12-11T00:38:42Z

@0x4007 Should this workflow be triggered by the codebase contributors? It costs around $0.2 per run on average, so it could become expensive quickly after a few executions.

0x4007 · 2024-12-11T00:40:38Z

@0x4007 Should this workflow be triggered by the codebase contributors? It costs around $0.2 per run on average, so it could become expensive quickly after a few executions.

I'm not sure what makes the most sense to do here. I think that the safest option is for collaborators to manually run this but I also wonder if we can also automatically run it just on merges to main?

sshivaditya2019 · 2024-12-11T00:43:34Z

I'm not sure what makes the most sense to do here. I think that the safest option is for collaborators to manually run this but I also wonder if we can also automatically run it just on merges to main?

We could either run this automatically on merges to the main branch, or trigger it after a PR receives 2 approvals for merging.

0x4007 · 2024-12-11T12:04:51Z

evals/data/eval-gold-responses.json

+        "body": "Manifests need to be updated so the name matches the intended name, which is the name of the repo it lives in.\n\nAny mismatch in manifest.name and the plugin repo, and we will not be able to install those plugins. The config will look like this:\n\nThis is because the worker URL contains the repo name, and we use that to match against manifest.name.",
+        "number": 27,
+        "html_url": "https://github.com/ubiquity-os/ubiquity-os-plugin-installer/issues/27/",
+        "question": "@ubosshivaditya could you please provide a summary of the issue ?"


Will it be a problem to leave in @ubosshivaditya?

Also this seems like a random example can you explain the context of this file further?

We can modify the app name to whatever is stored in the secrets; it doesn't matter, as the askQuestion function will be triggered either way.

This file primarily contains solid baseline examples, including "gold star" responses to questions. We run the model with the same context and should expect similar performance.

We can modify the app name to whatever is stored in the secrets; it doesn't matter,

We have the production and beta instance of the app so I'm not sure about dealing with secrets to save the names. Think through how this will be configured and let me know what you think makes sense

including "gold star" responses to questions.

I guess it's "gold standard" I just messed up the terminology when I called it "gold star".

We have the production and beta instance of the app so I'm not sure about dealing with secrets to save the names. Think through how this will be configured and let me know what you think makes sense

I think it would be better if we could just hard code names, and keep in consistent in the workflow.

I guess it's "gold standard" I just messed up the terminology when I called it "gold star".

No, you were correct—it's called a "gold star response"¹. "Gold standard" is a different approach, but not the one we're discussing here.

Footnotes

https://arxiv.org/html/2410.23214v1 ↩

Hard coding seems questionable for developers but generally yes I agree that it's easier to deal with vs secrets. @gentlementlegen please decide

I think this is no longer relevant, as we are using the LLM command router.

0x4007 · 2024-12-11T12:05:48Z

evals/handlers/setup-context.ts

+  openai: OpenAI;
+}
+
+export const initAdapters = (context: Context, clients: EvalClients): Context => {


This shouldn't have passed linter settings. Make sure you run yarn format in case the git hooks aren't working. We don't generally allow arrow functions unless they are used in some tiny callback context.

0x4007 · 2024-12-11T12:06:37Z

evals/handlers/setup-context.ts

+  });
+  formattedChat += SEPERATOR;
+  //Iterate through the ground truths and add it to the final formatted chat
+  formattedChat += "#################### Ground Truths ####################\n";


Are these many hashtags really necessary? I thought writing in plain markdown syntax is sufficient.

It’s not necessary, but it helps a lot when looking through the context in the braintrust context viewer. This formatting is only used for the context stored in braintrust; otherwise, the flow uses the regular formatted chat history.

0x4007 · 2024-12-11T12:07:22Z

evals/llm.eval.ts

+  config: {
+    model: "gpt-4o",
+    similarityThreshold: 0.8,
+    maxTokens: 1000,


Why did you set such a low max? Also should we really be using GPT?

The expected answer should be short, not too long. Making it longer would just make the model talk too much and be irrelevant. For the model, I think it should be gpt-4o because it’s relatively stable right now. I tried the same with o1-mini and o1-preview, but each test gave very different results. It’s also more expensive, using about 75,000 input tokens on average.

There's also o1 now to consider using.

To be frank I'm not super concerned with costs because we don't even use this feature every day but it can be tremendously valuable to automatically populate all the context and ask a question conveniently particularly from mobile which is my normal client.

If it's providing high quality answers it's okay to spend a bit more in exchange for all the time savings.

However, obviously, if the difference is marginal then we can optimize and use the cheaper models.

evals/llm.eval.ts

0x4007 · 2024-12-11T12:08:25Z

src/handlers/ground-truths/find-ground-truths.ts

Why are we using two different models?

Ground-truths is simple enough to be handled by gpt-4o or even gpt-4o-mini. I don’t think it would be productive to use a much more powerful model for such a simple task.

0x4007 · 2024-12-11T12:09:26Z

I wonder if we should just migrate everything that needs a large context window to Claude at this point.

sshivaditya2019 · 2024-12-11T17:42:35Z

I’m wondering if we should just move everything that requires a large context window to Claude at this point.

It should be pretty simple, we just need to update the model type parameter in the completion calls since we’re already using OpenRouter.

0x4007 · 2024-12-13T00:19:16Z

Is this good to go?

sshivaditya2019 · 2024-12-13T00:25:04Z

Is this good to go?

It is. I've also added the BRAINTRUST_API_KEY to the repo secrets.

rndquu · 2024-12-13T07:48:38Z

Is this good to go?

If you mean the CI question, yes, good to go

sshivaditya added 21 commits November 30, 2024 14:19

fix: setup evals actions

5177c29

fix: setup evals actions

e61e401

feat: adds ClosedQA, ContextPrecision

148a73f

fix: removed ClosedQA

ae6d163

fix: cspell and knip

e075c71

fix: deps issues

a077ad8

fix: downgraded typebox

122efad

fix: evals action

43e5e50

fix: braintrust action

a518e5e

fix: update permission

8dfe444

fix: add github token to the braintrust

2bad0fe

fix: update permission

399d80a

fix: github action

139770e

fix: action

633ebfe

fix: action

75a2f87

fix: action file dir

a4351af

fix: updated package.json

022b9ce

fix: add env to the action

98d3777

fix: log env

1d2309c

fix: actions env

2cb4a14

fix: updated workflow

58c01ac

sshivaditya2019 force-pushed the evals branch from 577d87d to 58c01ac Compare November 30, 2024 19:22

sshivaditya added 2 commits November 30, 2024 14:27

fix: updated workflow

311fc4a

fix: package.json duplicate items

4a9df8a

0x4007 reviewed Dec 10, 2024

View reviewed changes

sshivaditya added 3 commits December 10, 2024 19:17

fix: evals secrets

2f0cb7a

fix: prettier and jest

cab2349

fix: fix knip

dd39b3b

sshivaditya2019 marked this pull request as ready for review December 11, 2024 00:36

sshivaditya2019 requested a review from 0x4007 December 11, 2024 04:26

0x4007 reviewed Dec 11, 2024

View reviewed changes

evals/llm.eval.ts Outdated Show resolved Hide resolved

0x4007 reviewed Dec 11, 2024

View reviewed changes

0x4007 requested a review from rndquu December 13, 2024 00:19

sshivaditya and others added 5 commits December 12, 2024 19:41

fix: formatting

cfe8c4d

fix: ncc compile error

fe11d1b

fix: ncc compile error

7ec1087

chore: updated manifest.json and dist build

5d27598

fix: change to /ask

a5d9302

0x4007 approved these changes Dec 13, 2024

View reviewed changes

sshivaditya2019 merged commit a3814c8 into ubiquity-os-marketplace:development Dec 14, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evals #37

Evals #37

sshivaditya2019 commented Nov 21, 2024

github-actions bot commented Nov 21, 2024 •

edited

Loading

sshivaditya2019 commented Nov 23, 2024

0x4007 commented Nov 23, 2024 •

edited

Loading

ubiquity-os-beta bot commented Nov 25, 2024

0x4007 commented Nov 28, 2024

rndquu commented Nov 28, 2024

0x4007 Dec 10, 2024

sshivaditya2019 Dec 10, 2024

sshivaditya2019 commented Dec 10, 2024 •

edited

Loading

sshivaditya2019 commented Dec 11, 2024

sshivaditya2019 commented Dec 11, 2024

0x4007 commented Dec 11, 2024

sshivaditya2019 commented Dec 11, 2024 •

edited

Loading

0x4007 Dec 11, 2024

sshivaditya2019 Dec 11, 2024

0x4007 Dec 13, 2024 •

edited

Loading

sshivaditya2019 Dec 13, 2024

0x4007 Dec 13, 2024

sshivaditya2019 Dec 13, 2024

0x4007 Dec 11, 2024

0x4007 Dec 11, 2024

sshivaditya2019 Dec 11, 2024

0x4007 Dec 11, 2024

sshivaditya2019 Dec 11, 2024

0x4007 Dec 13, 2024 •

edited

Loading

0x4007 Dec 11, 2024

sshivaditya2019 Dec 11, 2024

0x4007 commented Dec 11, 2024 •

edited

Loading

sshivaditya2019 commented Dec 11, 2024

0x4007 commented Dec 13, 2024

sshivaditya2019 commented Dec 13, 2024 •

edited

Loading

rndquu commented Dec 13, 2024

Evals #37

Evals #37

Conversation

sshivaditya2019 commented Nov 21, 2024

github-actions bot commented Nov 21, 2024 • edited Loading

Unused devDependencies (1)

sshivaditya2019 commented Nov 23, 2024

0x4007 commented Nov 23, 2024 • edited Loading

ubiquity-os-beta bot commented Nov 25, 2024

0x4007 commented Nov 28, 2024

rndquu commented Nov 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sshivaditya2019 commented Dec 10, 2024 • edited Loading

sshivaditya2019 commented Dec 11, 2024

sshivaditya2019 commented Dec 11, 2024

0x4007 commented Dec 11, 2024

sshivaditya2019 commented Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0x4007 Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Footnotes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0x4007 Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0x4007 commented Dec 11, 2024 • edited Loading

sshivaditya2019 commented Dec 11, 2024

0x4007 commented Dec 13, 2024

sshivaditya2019 commented Dec 13, 2024 • edited Loading

rndquu commented Dec 13, 2024

github-actions bot commented Nov 21, 2024 •

edited

Loading

0x4007 commented Nov 23, 2024 •

edited

Loading

sshivaditya2019 commented Dec 10, 2024 •

edited

Loading

sshivaditya2019 commented Dec 11, 2024 •

edited

Loading

0x4007 Dec 13, 2024 •

edited

Loading

0x4007 Dec 13, 2024 •

edited

Loading

0x4007 commented Dec 11, 2024 •

edited

Loading

sshivaditya2019 commented Dec 13, 2024 •

edited

Loading