Fix evaluation PR #1711

mmabrouk · 2024-05-28T16:27:18Z

I've significantly revised the evaluation PR, especially our handling of the correct_answer_key as a string or list, which was problematic. The following changes have been made:

I've changed our approach to the correct answer, offering a universal solution. Now, each evaluator can have advanced settings, such as correct_answer_key. Ground truth keys that should appear in the evaluation result are marked in the configuration.
I've enhanced the security of the code to prevent executing a function with a name taken directly from the frontend.
Ive replaced langchain with OpenAI for the AI critic. The versions we were using were outdated and incompatible with Ragas and other features. Langchain was not suitable for our needs.

…low running any function

…f it is specified as such in the evalutor definition file

vercel · 2024-05-28T16:27:22Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
agenta	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	May 29, 2024 0:08am

Removed langchain and use openai directly

mmabrouk · 2024-05-28T18:40:39Z

@aakrem can you please review the general logic
@aybruhm can you please review the changes to the tests and see whether they make sense
@bekossy can you please QA the evaluation. I QA'ed myself, but between my changes and Akrem's a lot has been done. Can you please try all the evaluators and see whether they work. Then again with a different correct_answer. Thank you!

mmabrouk · 2024-05-28T18:42:31Z

@bekossy oss and cloud.beta both point for now to this PR

aakrem

Thanks for the PR @mmabrouk

I like the idea of introducing the advance key, it frees up a lot of logic in frontend.
I also like a lot of the refactoring you did
Regarding the refactor in the evaluators.py, tbh, I prefer the initial implementation with correct_answer_keys as it groups all possible values I find it it abstract more things.

I also would love to know how are we planning to define a ragas evaluator. Is it going to be something like :

    "settings_template": {
      >>>  "correct_answer_keyS": {
            "label": "Correct Answers",
            "default": "correct_answers",
        >>>    "type": "array",
            "advanced": True,
          >>>  "ground_truth_keyS": True,
            "description": "The name of the column in the test data that contains the correct answer",
        },
    },

We are mixing the usage of terms correct_answer and ground_truths. I tried to do this I gave up as it's populated everywhere. let's either use the same terminology or refactor all to ground truth together with the schema at a later stage.

agenta-backend/agenta_backend/services/evaluators_service.py

agenta-backend/agenta_backend/resources/evaluators/evaluators.py

mmabrouk · 2024-05-29T07:15:43Z

Regarding the refactor in the evaluators.py, tbh, I prefer the initial implementation with correct_answer_keys as it groups all possible values I find it it abstract more things.
I don't agree at all. 1) I think the previous implementation does not even work in the case of multiple ground truths. 2) This is a bad abstraction.

A good abstraction encapsulates a concept and conceals its details, as a result it ends up simplifying the code. It does this because a good abstraction mirrors the reality and the problem That's not the case here, you can see it by looking at the code and how we handle the correct_answer_keys:

First let's look at the code in the frontend (which does not make any sense to me). We handle the correct_answer_keys as a string in the UI yet an array in the code. We end up having custom code that moves between string and array. The reason is that the abstraction does not aligned with reality. There is nothing in reality that is a list of correct answer_keys, it's just a random encapsulation that has been created to put things together. In reality, the fact of being a ground truth key is just a property of the setting.

2. In the evaluator functions, we take the first value of a list of keys. This isn't clean code—it unnecessarily complicates the code (now we have to get the first element of the list) without providing inherent meaning. For example, if you read the function on its own, you might wonder why the correct_answer_keys are a list? Why are we ignoring the rest? What's special about the first element? These questions indicate a problem. Another red flag is when you find yourself writing a constant in the code, like correct_answers[0]. It's a sign that the abstraction might be flawed. It's a good habit to have to 1) read functions by themselves and see if they make sense 2) check if you have any constants in code or seemingly random instruction that does not fit any business logic.

Now, why is the abstraction that you have used is bad. The data structure you used does not mirror reality: A list means that all elements are interchangeable and have the same meaning. This is not the case here, let's assume we have an evaluator that uses two columns, context and correct_answer. Obviously it needs to now which is which to function properly. How do you even define this when you have a list, do you expect the user to provide these in some specific order?

In conclusion, while grouping values together is often a feature of good abstraction, it doesn't necessarily make an abstraction good. A good abstraction reflects the problem at hand and simplifies its complexity. I hope this explanation is helpful.

I hope this helps. I strongly recommend to everyone reading (or even skimming) the book clean code. I think it is very helpful.

I also would love to know how are we planning to define a ragas evaluator.

    "settings_template": {
        "context_key": {
            "label": "Column Containing Contexts",
            "default": "contexts",
            "type": "string"
            "advanced": True,
            "ground_truth_key": True,
            "description": "The name of the column in the test data that contains the contexts",
        },
    },

Not sure why you were under the impression here that we would be saving each of the context in a different column and providing a list of column names in this case. That would not work, since the number of contexts might change from one case to another.

mmabrouk · 2024-05-29T07:20:39Z

Just a small addition here. The right abstraction, that is the abstraction that mirrors reality here, is that a value in the test set could be not only a string but also a list. Using this abstraction, would fix any future complexity we would add when using ragas (for instance parsing a json).

mmabrouk · 2024-05-29T07:28:26Z

We are mixing the usage of terms correct_answer and ground_truths. I tried to do this I gave up as it's populated everywhere. let's either use the same terminology or refactor all to ground truth together with the schema at a later stage.
Actually the issue, is that ground truth is not the right world here. For instance the context in the case of ragas is not a ground truth. Maybe a better word is "evaluation context", not sure

mmabrouk · 2024-05-29T10:13:54Z

@aybruhm @aakrem @bekossy Please review and and approve (or request changes) the changes when you are done, to know that we can merge

aakrem · 2024-05-29T12:53:50Z

@mmabrouk

First let's look at the code in the frontend (which does not make any sense to me). We handle the correct_answer_keys as a string in the UI yet an array in the code.

In frontend it's also considered as a list.
https://github.com/Agenta-AI/agenta/pull/1606/files#diff-a5b1a701e0665f6eb124977e2185fc01a12aaf7bf65ddb00e4b61c1e9c41ab41R7

Regarding UI We still don't even have a design for an evaluator containing a list of correct answers.
I don't understand the concern there.

There is nothing in reality that is a list of correct answer_keys

Regarding all your "reality" points, the reality I see it is that evaluators can have a single correct answer or multiple ones.
Some will use only one some will use many.

I think the original implementation focused a lot to make things ready to integrate evaluator with multiple correct answer.

It's still not clear how are you planning to create an evaluator with multiple correct answers with this implementation.

Can you simply give an example.

mmabrouk · 2024-05-30T17:45:01Z

@aakrem

@mmabrouk

First let's look at the code in the frontend (which does not make any sense to me). We handle the correct_answer_keys as a string in the UI yet an array in the code.

In frontend it's also considered as a list. https://github.com/Agenta-AI/agenta/pull/1606/files#diff-a5b1a701e0665f6eb124977e2185fc01a12aaf7bf65ddb00e4b61c1e9c41ab41R7

Which we set as a string.

Regarding UI We still don't even have a design for an evaluator containing a list of correct answers. I don't understand the concern there.

I think you don't understand the problem...

There is nothing in reality that is a list of correct answer_keys

Regarding all your "reality" points, the reality I see it is that evaluators can have a single correct answer or multiple ones. Some will use only one some will use many.

No, an evaluator have multiple columns each with a different meanings: one for context, one for correct_answer.. Not multiple columns with the same meaning (correct_answer1, correct_answer_2). Each column contains one type of information.

I think the original implementation focused a lot to make things ready to integrate evaluator with multiple correct answer.

Again, I think you did not understand the problem.

It's still not clear how are you planning to create an evaluator with multiple correct answers with this implementation.

Can you simply give an example.

I gave one my answer. Which is the Ragas example. There the evaluator need access to a list of context. This list can be of different size, so it does not make any sense to have each context in a different column. The list of context should be in one column.

I hope this helps understand the user story. If there is no other comments other than on this. Let's merge this.

aakrem

thanks for the PR @mmabrouk

mmabrouk added 10 commits May 28, 2024 15:55

Modify the evaluator definition for correct answer key

5811770

Refactored the evaluator service to use specific correct_answers

ca309b5

Show the advanced settings undere a hidden collapse

1f14f0e

Made the code more secure by removing the global().get which would al…

7e11adc

…low running any function

Improved the logic to use a correct_answer as a ground truth column i…

81f9b12

…f it is specified as such in the evalutor definition file

rewrote logic for creating ready to use evaluators

20d54ab

Allow editing ready to use evalutors

e6dcfd3

allow the addition of ready to use evaluators

9faa400

Fixed evaluators definition

df16e88

minor fix

87b0c2f

mmabrouk temporarily deployed to oss May 28, 2024 16:27 — with GitHub Actions Inactive

vercel bot deployed to Preview May 28, 2024 16:29 View deployment

updated pyproject

a0d0420

mmabrouk temporarily deployed to oss May 28, 2024 16:33 — with GitHub Actions Inactive

vercel bot deployed to Preview May 28, 2024 16:35 View deployment

Added auto similarity

1a6e4ce

Removed langchain and use openai directly

mmabrouk temporarily deployed to oss May 28, 2024 17:00 — with GitHub Actions Inactive

vercel bot deployed to Preview May 28, 2024 17:02 View deployment

formatting

6b44f5c

mmabrouk temporarily deployed to oss May 28, 2024 17:03 — with GitHub Actions Inactive

vercel bot deployed to Preview May 28, 2024 17:04 View deployment

updated docker

3fba85d

mmabrouk temporarily deployed to oss May 28, 2024 17:25 — with GitHub Actions Inactive

vercel bot deployed to Preview May 28, 2024 17:27 View deployment

aakrem requested changes May 28, 2024

View reviewed changes

agenta-backend/agenta_backend/services/evaluators_service.py Outdated Show resolved Hide resolved

agenta-backend/agenta_backend/resources/evaluators/evaluators.py Show resolved Hide resolved

remove comment

f5ca784

mmabrouk temporarily deployed to oss May 29, 2024 06:28 — with GitHub Actions Inactive

vercel bot deployed to Preview May 29, 2024 06:30 View deployment

improved label

c78856a

mmabrouk temporarily deployed to oss May 29, 2024 07:22 — with GitHub Actions Inactive

vercel bot deployed to Preview May 29, 2024 07:24 View deployment

mmabrouk requested a review from aakrem May 29, 2024 10:13

fixed correct_answer_key payload

7bffe15

bekossy temporarily deployed to oss May 29, 2024 11:41 — with GitHub Actions Inactive

vercel bot deployed to Preview May 29, 2024 11:43 View deployment

cleanup

d285754

bekossy temporarily deployed to oss May 29, 2024 12:06 — with GitHub Actions Inactive

vercel bot deployed to Preview May 29, 2024 12:08 View deployment

aakrem approved these changes May 30, 2024

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label May 30, 2024

mmabrouk merged commit 2080680 into access-to-all-columns May 30, 2024
9 checks passed

mmabrouk deleted the fix-all-columns branch May 30, 2024 18:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix evaluation PR #1711

Fix evaluation PR #1711

mmabrouk commented May 28, 2024 •

edited

Loading

vercel bot commented May 28, 2024 •

edited

Loading

mmabrouk commented May 28, 2024

mmabrouk commented May 28, 2024

aakrem left a comment

mmabrouk commented May 29, 2024

mmabrouk commented May 29, 2024

mmabrouk commented May 29, 2024

mmabrouk commented May 29, 2024

aakrem commented May 29, 2024

mmabrouk commented May 30, 2024 •

edited

Loading

aakrem left a comment

Fix evaluation PR #1711

Fix evaluation PR #1711

Conversation

mmabrouk commented May 28, 2024 • edited Loading

vercel bot commented May 28, 2024 • edited Loading

mmabrouk commented May 28, 2024

mmabrouk commented May 28, 2024

aakrem left a comment

Choose a reason for hiding this comment

mmabrouk commented May 29, 2024

mmabrouk commented May 29, 2024

mmabrouk commented May 29, 2024

mmabrouk commented May 29, 2024

aakrem commented May 29, 2024

mmabrouk commented May 30, 2024 • edited Loading

aakrem left a comment

Choose a reason for hiding this comment

mmabrouk commented May 28, 2024 •

edited

Loading

vercel bot commented May 28, 2024 •

edited

Loading

mmabrouk commented May 30, 2024 •

edited

Loading