Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to Evals #851

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from
Draft

Improvements to Evals #851

wants to merge 13 commits into from

Conversation

bborn
Copy link
Contributor

@bborn bborn commented Oct 21, 2024

Just exploring some ideas:

  • Should be able to run evals against a full dataset (instead of one item at a time)
  • Should be able to run multiple evaluators at once
  • Add a Regex evaluator
  • Add cosine similarity evaluator
  • Add a LLM evaluator

@andreibondarev
Copy link
Collaborator

@bborn How useful is the regex evaluator in your opinion?

Do you have any thoughts evals that calculate vector distance?

@bborn
Copy link
Contributor Author

bborn commented Oct 21, 2024

@andreibondarev I think regex or other string comparison is pretty important. You might want to ensure your model/agent is returning a number, a URL, an email, etc. Or you might want to ensure that the answer contains the expected output string (this doesn't exist in Lanchain.rb yet), something like:

regex_evaluator.score(answer: "The answer to 2 + 2 is 4", expected_answer: "4")

Vector (or levenshtein, etc.) distance seems useful too. Not so much as an absolute score but as something you could look at over time (if our agent was getting a vector score of .75 for the last three months, and then we changed the prompt and now it's getting .45, we'd be concerned).

I think the evaluators kind of break down into LLM Graded, LLM Labeled, and Code Graded):

CleanShot 2024-10-21 at 08 21 04@2x

LLM Graded: ask another LLM to score the dataset item based on some criteria
LLM Labled: ask an LLM to label the dataset item
Code Graded: run the dataset item through some grading algorithm (regex, JSON, or other)

@bborn bborn changed the title First pass at evaluating a dataset with multiple evaluators Improvements to Evals Oct 21, 2024
@bborn bborn closed this Oct 21, 2024
@bborn bborn reopened this Oct 22, 2024
@bborn
Copy link
Contributor Author

bborn commented Oct 22, 2024

Another thought: maybe you should be able to add an Eval to your Agent or llm call like this:

dataset = "/path/to/dataset.jsonl"

evaluators = [
    Langchain::Evals::LLM::LLM.new(llm: llm),
    Langchain::Evals::LLM::CosineSimilarity.new(llm: llm)
]

eval_service = EvalService(evaluators, dataset)

response = llm.complete(prompt: "Once upon a time", evaluate_with: eval_service)

By default this would store eval results in a CSV (could be anything, sqlite, whatever) in the same location as the dataset.

Another idea would be the ability to log the completion results to the dataset before evaluating them (e.g. if you don't already have a dataset):

dataset = "/path/to/dataset.jsonl"

evaluators = [
    Langchain::Evals::LLM::LLM.new(llm: llm),
    Langchain::Evals::LLM::CosineSimilarity.new(llm: llm)
]

options: {
    log_completions: true
    log_rate: 0.5    #log 50% of completions
}

eval_service = EvalService(evaluators, dataset, options)

response = llm.complete(prompt: "Once upon a time", evaluate_with: eval_service)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants