Fix issues in ResponseMatching operator #589

Dominastorm · 2024-03-16T11:15:28Z

Hello everyone,

Apologies for the inconvenience experienced with #564 and #571.

In this pull request, I've included question as a mandatory column for the ResponseMatching operator. This update addresses the challenges encountered with uptrain==0.5.0. We are currently in the process of making further modifications, hence think it to be best to support the latest version of uptrain later. However, for the time being, implementing this should resolve the issues.

Thanks!

CLAassistant · 2024-03-16T11:15:33Z

All committers have signed the CLA.

Dominastorm · 2024-03-16T11:20:19Z

Could you also let me know how I can update the same here: https://docs.haystack.deepset.ai/docs/uptrainevaluator

shadeMe · 2024-03-18T12:28:21Z

Many thanks for the PR, @Dominastorm. Seems like it fixes the issue with the response matching metric, but I'm unfortunately still seeing some other metrics fail with 500 Internal Server Error:

tests/test_evaluator.py::test_integration_run[response_relevance-inputs2-None] FAILED                                                                                                                                                                                                                                  [ 78%]
tests/test_evaluator.py::test_integration_run[response_completeness-inputs3-None] FAILED                                                                                                                                                                                                                               [ 80%]
tests/test_evaluator.py::test_integration_run[response_completeness_wrt_context-inputs4-None] PASSED                                                                                                                                                                                                                   [ 82%]
tests/test_evaluator.py::test_integration_run[response_consistency-inputs5-None] FAILED                                                                                                                                                                                                                                [ 85%]
tests/test_evaluator.py::test_integration_run[response_conciseness-inputs6-None] FAILED

Log:

2024-03-18 13:18:50.316 | INFO     | uptrain.framework.evalllm:evaluate:100 - Sending evaluation request for rows 0 to <50 to the Uptrain
2024-03-18 13:18:50.975 | ERROR    | uptrain.framework.remote:raise_or_return:32 - {"detail":"Error running the evaluation: 500: Error running the evaluation: concurrent.futures.process._RemoteTraceback: \n\"\"\"\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.10/concurrent/futures/process.py\", line 246, in _process_worker\n    r = call_item.fn(*call_item.args, **call_item.kwargs)\n  File \"/app/uptrain_server/worker.py\", line 426, in _run_log_and_eval\n    res = func(settings, dataset)\n  File \"/app/uptrain_server/evals/chatgpt.py\", line 1575, in score_response_conciseness\n    argument = res.response.choices[0].message.content\nAttributeError: 'NoneType' object has no attribute 'choices'\n\"\"\"\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/app/uptrain_server/worker.py\", line 960, in run_eval_v2\n    result = future.result()\n  File \"/usr/local/lib/python3.10/concurrent/futures/_base.py\", line 451, in result\n    return self.__get_result()\n  File \"/usr/local/lib/python3.10/concurrent/futures/_base.py\", line 403, in __get_result\n    raise self._exception\nAttributeError: 'NoneType' object has no attribute 'choices'\n"}
2024-03-18 13:18:50.976 | ERROR    | uptrain.framework.remote:evaluate:74 - Evaluation failed with error: Server error '500 Internal Server Error' for url 'https://demo.uptrain.ai/api/open/evaluate_no_auth'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/500
2024-03-18 13:18:50.976 | INFO     | uptrain.framework.evalllm:evaluate:113 - Retrying evaluation request
2024-03-18 13:18:50.976 | INFO     | uptrain.framework.evalllm:evaluate:100 - Sending evaluation request for rows 0 to <50 to the Uptrain
2024-03-18 13:18:51.413 | ERROR    | uptrain.framework.remote:raise_or_return:32 - {"detail":"Error running the evaluation: 500: Error running the evaluation: concurrent.futures.process._RemoteTraceback: \n\"\"\"\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.10/concurrent/futures/process.py\", line 246, in _process_worker\n    r = call_item.fn(*call_item.args, **call_item.kwargs)\n  File \"/app/uptrain_server/worker.py\", line 426, in _run_log_and_eval\n    res = func(settings, dataset)\n  File \"/app/uptrain_server/evals/chatgpt.py\", line 1575, in score_response_conciseness\n    argument = res.response.choices[0].message.content\nAttributeError: 'NoneType' object has no attribute 'choices'\n\"\"\"\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/app/uptrain_server/worker.py\", line 960, in run_eval_v2\n    result = future.result()\n  File \"/usr/local/lib/python3.10/concurrent/futures/_base.py\", line 451, in result\n    return self.__get_result()\n  File \"/usr/local/lib/python3.10/concurrent/futures/_base.py\", line 403, in __get_result\n    raise self._exception\nAttributeError: 'NoneType' object has no attribute 'choices'\n"}
2024-03-18 13:18:51.414 | ERROR    | uptrain.framework.remote:evaluate:74 - Evaluation failed with error: Server error '500 Internal Server Error' for url 'https://demo.uptrain.ai/api/open/evaluate_no_auth'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/500
2024-03-18 13:18:51.414 | INFO     | uptrain.framework.evalllm:evaluate:113 - Retrying evaluation request
2024-03-18 13:18:51.414 | INFO     | uptrain.framework.evalllm:evaluate:100 - Sending evaluation request for rows 0 to <50 to the Uptrain
2024-03-18 13:18:51.737 | ERROR    | uptrain.framework.remote:raise_or_return:32 - {"detail":"Error running the evaluation: 500: Error running the evaluation: concurrent.futures.process._RemoteTraceback: \n\"\"\"\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.10/concurrent/futures/process.py\", line 246, in _process_worker\n    r = call_item.fn(*call_item.args, **call_item.kwargs)\n  File \"/app/uptrain_server/worker.py\", line 426, in _run_log_and_eval\n    res = func(settings, dataset)\n  File \"/app/uptrain_server/evals/chatgpt.py\", line 1575, in score_response_conciseness\n    argument = res.response.choices[0].message.content\nAttributeError: 'NoneType' object has no attribute 'choices'\n\"\"\"\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/app/uptrain_server/worker.py\", line 960, in run_eval_v2\n    result = future.result()\n  File \"/usr/local/lib/python3.10/concurrent/futures/_base.py\", line 451, in result\n    return self.__get_result()\n  File \"/usr/local/lib/python3.10/concurrent/futures/_base.py\", line 403, in __get_result\n    raise self._exception\nAttributeError: 'NoneType' object has no attribute 'choices'\n"}
2024-03-18 13:18:51.738 | ERROR    | uptrain.framework.remote:evaluate:74 - Evaluation failed with error: Server error '500 Internal Server Error' for url 'https://demo.uptrain.ai/api/open/evaluate_no_auth'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/500
2024-03-18 13:18:51.739 | INFO     | uptrain.framework.evalllm:evaluate:113 - Retrying evaluation request
2024-03-18 13:18:51.739 | ERROR    | uptrain.framework.evalllm:evaluate:115 - Evaluation failed with error: Server error '500 Internal Server Error' for url 'https://demo.uptrain.ai/api/open/evaluate_no_auth'

Is this something you can look into as well? We can extend the scope of this PR to include them, for what it's worth.

On a related note, can/does UpTrain version its API endpoints? This will help prevent such retroactive failures in the future.

Dominastorm · 2024-03-19T07:06:20Z

The issue was that Secret.from_token("Aaa") was sending "Aaa" as the OpenAI API key. Here's a screenshot from our logs:

This was causing the evaluations to fail. Changing it to Secret.from_env_var("OPENAI_API_KEY") solves this issue.

We do not currently version our endpoints, but that's a good suggestion and we will look into it!

shadeMe · 2024-03-19T11:52:24Z

The issue was that Secret.from_token("Aaa") was sending "Aaa" as the OpenAI API key. Here's a screenshot from our logs:

This was causing the evaluations to fail. Changing it to Secret.from_env_var("OPENAI_API_KEY") solves this issue.

I was very surprised that changing those lines fixed the issue because none of those tests were actually invoking your client to begin with (also, in some cases, a mock client/backend is being used). Digging in a bit more, I discovered that the EvalLLM class creates a Settings instance which ends up modifying the execution's environment variable in its ctor.

This is very problematic - Is there a good reason why this is being done? The code shouldn't be modifying the user/execution environment unless it has an extremely good reason to do so, and definitely not silently. It's very insidious behaviour as it can break other code that's executing side-by-side in a hard-to-debug manner. Speculating here, but maybe this code was primarily written to be executed in your sever/deployment environment where making such modifications are handled gracefully? Further more, all of this happens even before your client is invoked, which also violates the principle of least surprise.

In any event, this appears to be the underlying issue here w.r.t the above failures. However, I'm still stumped as to why only some of the metrics fail during the integration test even if the API token is junk. Could you shed some light on this?

Dominastorm · 2024-03-21T10:28:45Z

Speculating here, but maybe this code was primarily written to be executed in your sever/deployment environment where making such modifications are handled gracefully?

You are correct. This was being done as it is handled gracefully on our servers. The removal of the assignments from the Settings class is in our pipeline. However, since it is a breaking change, it will require extensive testing from our end to make sure that it does not cause any of our current code to fail.   

However, I'm still stumped as to why only some of the metrics fail during the integration test even if the API token is junk. Could you shed some light on this?

This happened because we didn’t have a check for valid OpenAI keys. We have added it in our latest release. The four evaluations that failed gave an Authentication Error. The rest, however, returned dictionaries with None values. This was an oversight from our end, which we have now resolved. With the current code, all of them should throw an error when an invalid key is passed.   This also brings up a shortcoming of the tests that are being performed. They are not checking whether the result contains valid data and instead only checking if the data type is correct. So, that is another change that needs to be made.  As for upgrading to the latest version of UpTrain. I will make those changes as well. Upgrading to v0.6.9 should primarily change two things: 

All evaluations will be directly sent to OpenAI instead of going through our servers
The critique language operator now returns only one score instead of four.

I will make the above changes and support the latest version of UpTrain. We will take up resolving the environment variables at a later stage once we ensure that it is stable on our end.

lbux · 2024-04-10T06:07:47Z

Hey @Dominastorm. I'm wondering if there is any update on this?

Dominastorm · 2024-04-12T17:34:26Z

Hey @shadeMe, you can merge it now. It will work for uptrain>=0.6.13

shadeMe · 2024-05-06T10:38:43Z

Closing as the ownership of this integration has been transferred over to UpTrain.

Fix issues in ResponseMatching operator

e5d54b3

Dominastorm requested a review from a team as a code owner March 16, 2024 11:15

Dominastorm requested review from vblagoje and removed request for a team March 16, 2024 11:15

github-actions bot added the integration:uptrain label Mar 16, 2024

shadeMe requested review from shadeMe and removed request for vblagoje March 18, 2024 12:10

Use .from_env_var instead of .from_token

ca319e8

github-actions bot added the type:documentation Improvements or additions to documentation label Mar 19, 2024

Dominastorm and others added 2 commits March 19, 2024 12:37

Merge branch 'main' into uptrain_response_matching_fixes

33e9af7

Run black

9b7d349

Dominastorm and others added 4 commits April 12, 2024 22:34

Refactor

22af432

Merge branch 'main' into uptrain_response_matching_fixes

b4e4081

Reorder tests

3e3fc46

Run black

707cda8

Add api_key to init_params

f4113f0

shadeMe closed this May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issues in ResponseMatching operator #589

Fix issues in ResponseMatching operator #589

Dominastorm commented Mar 16, 2024

CLAassistant commented Mar 16, 2024 •

edited

Loading

Dominastorm commented Mar 16, 2024

shadeMe commented Mar 18, 2024 •

edited

Loading

Dominastorm commented Mar 19, 2024

shadeMe commented Mar 19, 2024 •

edited

Loading

Dominastorm commented Mar 21, 2024

lbux commented Apr 10, 2024

Dominastorm commented Apr 12, 2024 •

edited

Loading

shadeMe commented May 6, 2024

Fix issues in ResponseMatching operator #589

Fix issues in ResponseMatching operator #589

Conversation

Dominastorm commented Mar 16, 2024

CLAassistant commented Mar 16, 2024 • edited Loading

Dominastorm commented Mar 16, 2024

shadeMe commented Mar 18, 2024 • edited Loading

Dominastorm commented Mar 19, 2024

shadeMe commented Mar 19, 2024 • edited Loading

Dominastorm commented Mar 21, 2024

lbux commented Apr 10, 2024

Dominastorm commented Apr 12, 2024 • edited Loading

shadeMe commented May 6, 2024

CLAassistant commented Mar 16, 2024 •

edited

Loading

shadeMe commented Mar 18, 2024 •

edited

Loading

shadeMe commented Mar 19, 2024 •

edited

Loading

Dominastorm commented Apr 12, 2024 •

edited

Loading