-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix evaluation PR #1711
Fix evaluation PR #1711
Conversation
…low running any function
…f it is specified as such in the evalutor definition file
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Removed langchain and use openai directly
@aakrem can you please review the general logic |
@bekossy oss and cloud.beta both point for now to this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @mmabrouk
- I like the idea of introducing the advance key, it frees up a lot of logic in frontend.
- I also like a lot of the refactoring you did
- Regarding the refactor in the evaluators.py, tbh, I prefer the initial implementation with correct_answer_keys as it groups all possible values I find it it abstract more things.
I also would love to know how are we planning to define a ragas evaluator. Is it going to be something like :
"settings_template": {
>>> "correct_answer_keyS": {
"label": "Correct Answers",
"default": "correct_answers",
>>> "type": "array",
"advanced": True,
>>> "ground_truth_keyS": True,
"description": "The name of the column in the test data that contains the correct answer",
},
},
- We are mixing the usage of terms correct_answer and ground_truths. I tried to do this I gave up as it's populated everywhere. let's either use the same terminology or refactor all to ground truth together with the schema at a later stage.
A good abstraction encapsulates a concept and conceals its details, as a result it ends up simplifying the code. It does this because a good abstraction mirrors the reality and the problem That's not the case here, you can see it by looking at the code and how we handle the correct_answer_keys:
Now, why is the abstraction that you have used is bad. The data structure you used does not mirror reality: A list means that all elements are interchangeable and have the same meaning. This is not the case here, let's assume we have an evaluator that uses two columns, In conclusion, while grouping values together is often a feature of good abstraction, it doesn't necessarily make an abstraction good. A good abstraction reflects the problem at hand and simplifies its complexity. I hope this explanation is helpful. I hope this helps. I strongly recommend to everyone reading (or even skimming) the book clean code. I think it is very helpful.
Not sure why you were under the impression here that we would be saving each of the context in a different column and providing a list of column names in this case. That would not work, since the number of contexts might change from one case to another. |
Just a small addition here. The right abstraction, that is the abstraction that mirrors reality here, is that a value in the test set could be not only a string but also a list. Using this abstraction, would fix any future complexity we would add when using ragas (for instance parsing a json). |
|
In frontend it's also considered as a list. Regarding UI We still don't even have a design for an evaluator containing a list of correct answers.
Regarding all your "reality" points, the reality I see it is that evaluators can have a single correct answer or multiple ones. I think the original implementation focused a lot to make things ready to integrate evaluator with multiple correct answer. It's still not clear how are you planning to create an evaluator with multiple correct answers with this implementation. Can you simply give an example. |
Which we set as a string.
I think you don't understand the problem...
No, an evaluator have multiple columns each with a different meanings: one for context, one for correct_answer.. Not multiple columns with the same meaning (correct_answer1, correct_answer_2). Each column contains one type of information.
Again, I think you did not understand the problem.
I gave one my answer. Which is the Ragas example. There the evaluator need access to a list of context. This list can be of different size, so it does not make any sense to have each context in a different column. The list of context should be in one column. I hope this helps understand the user story. If there is no other comments other than on this. Let's merge this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the PR @mmabrouk
I've significantly revised the evaluation PR, especially our handling of the correct_answer_key as a string or list, which was problematic. The following changes have been made: