Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evals #37

Merged
Merged

Conversation

sshivaditya2019
Copy link
Collaborator

Resolves #185

  • Adds Evals for Command-ask using the Levenshtein score.
  • New Github Action for running the evals.

Copy link
Contributor

github-actions bot commented Nov 21, 2024

Unused devDependencies (1)

Filename devDependencies
package.json ts-node

@sshivaditya2019
Copy link
Collaborator Author

@0x4007 Evaluations cannot be configured as action requests since repository secrets cannot be accessed in pull requests (without pull request workflow trigger which I think is not safe). I have currently implemented Levenshtein and ContextRelevance metrics. Are there any other metrics that should be added to the evals ?

@0x4007
Copy link
Member

0x4007 commented Nov 23, 2024

@rndquu knows the solution for secrets and pull requests I think. I'm not sure because I never implemented evals before. I suppose you might know best off hand and without research.

Copy link

@sshivaditya2019, this task has been idle for a while. Please provide an update.

@0x4007
Copy link
Member

0x4007 commented Nov 28, 2024

@rndquu

@rndquu
Copy link
Member

rndquu commented Nov 28, 2024

@sshivaditya2019 Regarding evals-testing.yml workflow.

Yes, github secrets can't be accessed in workflows triggered from forks by the pull_request trigger.

What to do:

  1. I suppose UBIQUITY_OS_APP_NAME can be hardcoded
  2. SUPABASE_URL and SUPABASE_KEY can also be hardcoded (example)
  3. Refactor evals-testing.yml workflow to run on workflow_run (example). This way when knip workflow is finished then evals-testing.yml will run in a privileged context (i.e. with access to github secrets in a safe way).

@@ -19,7 +17,7 @@ jobs:
VOYAGEAI_API_KEY: ${{ secrets.VOYAGEAI_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }}
UBIQUITY_OS_APP_NAME: "ubiquity-agent" # Hardcoded value
UBIQUITY_OS_APP_NAME: "ubiquity-agent"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume ubiquity-agent is for testing purposes only.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll revert the changes, I am trying to make some examples for OA.

@sshivaditya2019
Copy link
Collaborator Author

sshivaditya2019 commented Dec 10, 2024

This comment describes the solution.

Change these lines to:

on:
  workflow_run:
    workflows: ["Knip"]
    types:
      - completed

and all the secrets will be available on a PR from a fork in a safe way.

This will work once deployed. I'm trying make some examples for QA and will try running this locally. I'll update the changes based on this.

@sshivaditya2019
Copy link
Collaborator Author

QA:
image
Summary #1


image
Summary #2

@sshivaditya2019 sshivaditya2019 marked this pull request as ready for review December 11, 2024 00:36
@sshivaditya2019
Copy link
Collaborator Author

@0x4007 Should this workflow be triggered by the codebase contributors? It costs around $0.2 per run on average, so it could become expensive quickly after a few executions.

@0x4007
Copy link
Member

0x4007 commented Dec 11, 2024

@0x4007 Should this workflow be triggered by the codebase contributors? It costs around $0.2 per run on average, so it could become expensive quickly after a few executions.

I'm not sure what makes the most sense to do here. I think that the safest option is for collaborators to manually run this but I also wonder if we can also automatically run it just on merges to main?

@sshivaditya2019
Copy link
Collaborator Author

sshivaditya2019 commented Dec 11, 2024

I'm not sure what makes the most sense to do here. I think that the safest option is for collaborators to manually run this but I also wonder if we can also automatically run it just on merges to main?

We could either run this automatically on merges to the main branch, or trigger it after a PR receives 2 approvals for merging.

"body": "Manifests need to be updated so the name matches the intended name, which is the name of the repo it lives in.\n\nAny mismatch in manifest.name and the plugin repo, and we will not be able to install those plugins. The config will look like this:\n\nThis is because the worker URL contains the repo name, and we use that to match against manifest.name.",
"number": 27,
"html_url": "https://github.com/ubiquity-os/ubiquity-os-plugin-installer/issues/27/",
"question": "@ubosshivaditya could you please provide a summary of the issue ?"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be a problem to leave in @ubosshivaditya?

Also this seems like a random example can you explain the context of this file further?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can modify the app name to whatever is stored in the secrets; it doesn't matter, as the askQuestion function will be triggered either way.

This file primarily contains solid baseline examples, including "gold star" responses to questions. We run the model with the same context and should expect similar performance.

Copy link
Member

@0x4007 0x4007 Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can modify the app name to whatever is stored in the secrets; it doesn't matter,

We have the production and beta instance of the app so I'm not sure about dealing with secrets to save the names. Think through how this will be configured and let me know what you think makes sense

including "gold star" responses to questions.

I guess it's "gold standard" I just messed up the terminology when I called it "gold star".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the production and beta instance of the app so I'm not sure about dealing with secrets to save the names. Think through how this will be configured and let me know what you think makes sense

I think it would be better if we could just hard code names, and keep in consistent in the workflow.

I guess it's "gold standard" I just messed up the terminology when I called it "gold star".

No, you were correct—it's called a "gold star response"1. "Gold standard" is a different approach, but not the one we're discussing here.

Footnotes

  1. https://arxiv.org/html/2410.23214v1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard coding seems questionable for developers but generally yes I agree that it's easier to deal with vs secrets. @gentlementlegen please decide

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is no longer relevant, as we are using the LLM command router.

openai: OpenAI;
}

export const initAdapters = (context: Context, clients: EvalClients): Context => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't have passed linter settings. Make sure you run yarn format in case the git hooks aren't working. We don't generally allow arrow functions unless they are used in some tiny callback context.

});
formattedChat += SEPERATOR;
//Iterate through the ground truths and add it to the final formatted chat
formattedChat += "#################### Ground Truths ####################\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these many hashtags really necessary? I thought writing in plain markdown syntax is sufficient.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2024-12-11 at 12 26 58 PM

It’s not necessary, but it helps a lot when looking through the context in the braintrust context viewer. This formatting is only used for the context stored in braintrust; otherwise, the flow uses the regular formatted chat history.

config: {
model: "gpt-4o",
similarityThreshold: 0.8,
maxTokens: 1000,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you set such a low max? Also should we really be using GPT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected answer should be short, not too long. Making it longer would just make the model talk too much and be irrelevant. For the model, I think it should be gpt-4o because it’s relatively stable right now. I tried the same with o1-mini and o1-preview, but each test gave very different results. It’s also more expensive, using about 75,000 input tokens on average.

Copy link
Member

@0x4007 0x4007 Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also o1 now to consider using.

To be frank I'm not super concerned with costs because we don't even use this feature every day but it can be tremendously valuable to automatically populate all the context and ask a question conveniently particularly from mobile which is my normal client.

If it's providing high quality answers it's okay to spend a bit more in exchange for all the time savings.

However, obviously, if the difference is marginal then we can optimize and use the cheaper models.

evals/llm.eval.ts Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we using two different models?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ground-truths is simple enough to be handled by gpt-4o or even gpt-4o-mini. I don’t think it would be productive to use a much more powerful model for such a simple task.

@0x4007
Copy link
Member

0x4007 commented Dec 11, 2024

I wonder if we should just migrate everything that needs a large context window to Claude at this point.

@sshivaditya2019
Copy link
Collaborator Author

I’m wondering if we should just move everything that requires a large context window to Claude at this point.

It should be pretty simple, we just need to update the model type parameter in the completion calls since we’re already using OpenRouter.

@0x4007 0x4007 requested a review from rndquu December 13, 2024 00:19
@0x4007
Copy link
Member

0x4007 commented Dec 13, 2024

Is this good to go?

@sshivaditya2019
Copy link
Collaborator Author

sshivaditya2019 commented Dec 13, 2024

Is this good to go?

It is. I've also added the BRAINTRUST_API_KEY to the repo secrets.

@rndquu
Copy link
Member

rndquu commented Dec 13, 2024

Is this good to go?

If you mean the CI question, yes, good to go

@sshivaditya2019 sshivaditya2019 merged commit a3814c8 into ubiquity-os-marketplace:development Dec 14, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Evals
4 participants