Improvements to Evals #851

bborn · 2024-10-21T12:01:15Z

Just exploring some ideas:

Should be able to run evals against a full dataset (instead of one item at a time)
Should be able to run multiple evaluators at once
Add a Regex evaluator
Add cosine similarity evaluator
Add a LLM evaluator

andreibondarev · 2024-10-21T13:14:05Z

@bborn How useful is the regex evaluator in your opinion?

Do you have any thoughts evals that calculate vector distance?

bborn · 2024-10-21T13:25:33Z

@andreibondarev I think regex or other string comparison is pretty important. You might want to ensure your model/agent is returning a number, a URL, an email, etc. Or you might want to ensure that the answer contains the expected output string (this doesn't exist in Lanchain.rb yet), something like:

regex_evaluator.score(answer: "The answer to 2 + 2 is 4", expected_answer: "4")

Vector (or levenshtein, etc.) distance seems useful too. Not so much as an absolute score but as something you could look at over time (if our agent was getting a vector score of .75 for the last three months, and then we changed the prompt and now it's getting .45, we'd be concerned).

I think the evaluators kind of break down into LLM Graded, LLM Labeled, and Code Graded):

LLM Graded: ask another LLM to score the dataset item based on some criteria
LLM Labled: ask an LLM to label the dataset item
Code Graded: run the dataset item through some grading algorithm (regex, JSON, or other)

bborn · 2024-10-22T11:51:17Z

Another thought: maybe you should be able to add an Eval to your Agent or llm call like this:

dataset = "/path/to/dataset.jsonl"

evaluators = [
    Langchain::Evals::LLM::LLM.new(llm: llm),
    Langchain::Evals::LLM::CosineSimilarity.new(llm: llm)
]

eval_service = EvalService(evaluators, dataset)

response = llm.complete(prompt: "Once upon a time", evaluate_with: eval_service)

By default this would store eval results in a CSV (could be anything, sqlite, whatever) in the same location as the dataset.

Another idea would be the ability to log the completion results to the dataset before evaluating them (e.g. if you don't already have a dataset):

dataset = "/path/to/dataset.jsonl"

evaluators = [
    Langchain::Evals::LLM::LLM.new(llm: llm),
    Langchain::Evals::LLM::CosineSimilarity.new(llm: llm)
]

options: {
    log_completions: true
    log_rate: 0.5    #log 50% of completions
}

eval_service = EvalService(evaluators, dataset, options)

response = llm.complete(prompt: "Once upon a time", evaluate_with: eval_service)

bborn added 7 commits October 21, 2024 07:00

first pass at evaluating a dataset with multiple evaluators

1b0229f

doc

3e8d75f

reorg

d9e8e0d

add regex evaluator

ac27a12

refactor

c41214c

doc

39669db

fix tests

2e7d084

bborn added 3 commits October 21, 2024 08:49

allow regex interpolation based on expected answer

1dba3fa

cosine similarity

3dee98f

oops - use expected answer and answer, not question

70db0b3

bborn changed the title ~~First pass at evaluating a dataset with multiple evaluators~~ Improvements to Evals Oct 21, 2024

bborn added 2 commits October 21, 2024 10:13

typo

96eca13

allow for optional dataset item attributes

d1e7034

bborn closed this Oct 21, 2024

more test

e64a2fe

bborn reopened this Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to Evals #851

Improvements to Evals #851

bborn commented Oct 21, 2024 •

edited

Loading

andreibondarev commented Oct 21, 2024

bborn commented Oct 21, 2024

bborn commented Oct 22, 2024 •

edited

Loading

Improvements to Evals #851

Are you sure you want to change the base?

Improvements to Evals #851

Conversation

bborn commented Oct 21, 2024 • edited Loading

andreibondarev commented Oct 21, 2024

bborn commented Oct 21, 2024

bborn commented Oct 22, 2024 • edited Loading

bborn commented Oct 21, 2024 •

edited

Loading

bborn commented Oct 22, 2024 •

edited

Loading