Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issues in ResponseMatching operator #589

Closed

Conversation

Dominastorm
Copy link

Hello everyone,

Apologies for the inconvenience experienced with #564 and #571.

In this pull request, I've included question as a mandatory column for the ResponseMatching operator. This update addresses the challenges encountered with uptrain==0.5.0. We are currently in the process of making further modifications, hence think it to be best to support the latest version of uptrain later. However, for the time being, implementing this should resolve the issues.

Thanks!

@Dominastorm Dominastorm requested a review from a team as a code owner March 16, 2024 11:15
@Dominastorm Dominastorm requested review from vblagoje and removed request for a team March 16, 2024 11:15
@CLAassistant
Copy link

CLAassistant commented Mar 16, 2024

CLA assistant check
All committers have signed the CLA.

@Dominastorm
Copy link
Author

Could you also let me know how I can update the same here: https://docs.haystack.deepset.ai/docs/uptrainevaluator

@shadeMe shadeMe requested review from shadeMe and removed request for vblagoje March 18, 2024 12:10
@shadeMe
Copy link
Contributor

shadeMe commented Mar 18, 2024

Many thanks for the PR, @Dominastorm. Seems like it fixes the issue with the response matching metric, but I'm unfortunately still seeing some other metrics fail with 500 Internal Server Error:

tests/test_evaluator.py::test_integration_run[response_relevance-inputs2-None] FAILED                                                                                                                                                                                                                                  [ 78%]
tests/test_evaluator.py::test_integration_run[response_completeness-inputs3-None] FAILED                                                                                                                                                                                                                               [ 80%]
tests/test_evaluator.py::test_integration_run[response_completeness_wrt_context-inputs4-None] PASSED                                                                                                                                                                                                                   [ 82%]
tests/test_evaluator.py::test_integration_run[response_consistency-inputs5-None] FAILED                                                                                                                                                                                                                                [ 85%]
tests/test_evaluator.py::test_integration_run[response_conciseness-inputs6-None] FAILED

Log:

2024-03-18 13:18:50.316 | INFO     | uptrain.framework.evalllm:evaluate:100 - Sending evaluation request for rows 0 to <50 to the Uptrain
2024-03-18 13:18:50.975 | ERROR    | uptrain.framework.remote:raise_or_return:32 - {"detail":"Error running the evaluation: 500: Error running the evaluation: concurrent.futures.process._RemoteTraceback: \n\"\"\"\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.10/concurrent/futures/process.py\", line 246, in _process_worker\n    r = call_item.fn(*call_item.args, **call_item.kwargs)\n  File \"/app/uptrain_server/worker.py\", line 426, in _run_log_and_eval\n    res = func(settings, dataset)\n  File \"/app/uptrain_server/evals/chatgpt.py\", line 1575, in score_response_conciseness\n    argument = res.response.choices[0].message.content\nAttributeError: 'NoneType' object has no attribute 'choices'\n\"\"\"\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/app/uptrain_server/worker.py\", line 960, in run_eval_v2\n    result = future.result()\n  File \"/usr/local/lib/python3.10/concurrent/futures/_base.py\", line 451, in result\n    return self.__get_result()\n  File \"/usr/local/lib/python3.10/concurrent/futures/_base.py\", line 403, in __get_result\n    raise self._exception\nAttributeError: 'NoneType' object has no attribute 'choices'\n"}
2024-03-18 13:18:50.976 | ERROR    | uptrain.framework.remote:evaluate:74 - Evaluation failed with error: Server error '500 Internal Server Error' for url 'https://demo.uptrain.ai/api/open/evaluate_no_auth'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/500
2024-03-18 13:18:50.976 | INFO     | uptrain.framework.evalllm:evaluate:113 - Retrying evaluation request
2024-03-18 13:18:50.976 | INFO     | uptrain.framework.evalllm:evaluate:100 - Sending evaluation request for rows 0 to <50 to the Uptrain
2024-03-18 13:18:51.413 | ERROR    | uptrain.framework.remote:raise_or_return:32 - {"detail":"Error running the evaluation: 500: Error running the evaluation: concurrent.futures.process._RemoteTraceback: \n\"\"\"\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.10/concurrent/futures/process.py\", line 246, in _process_worker\n    r = call_item.fn(*call_item.args, **call_item.kwargs)\n  File \"/app/uptrain_server/worker.py\", line 426, in _run_log_and_eval\n    res = func(settings, dataset)\n  File \"/app/uptrain_server/evals/chatgpt.py\", line 1575, in score_response_conciseness\n    argument = res.response.choices[0].message.content\nAttributeError: 'NoneType' object has no attribute 'choices'\n\"\"\"\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/app/uptrain_server/worker.py\", line 960, in run_eval_v2\n    result = future.result()\n  File \"/usr/local/lib/python3.10/concurrent/futures/_base.py\", line 451, in result\n    return self.__get_result()\n  File \"/usr/local/lib/python3.10/concurrent/futures/_base.py\", line 403, in __get_result\n    raise self._exception\nAttributeError: 'NoneType' object has no attribute 'choices'\n"}
2024-03-18 13:18:51.414 | ERROR    | uptrain.framework.remote:evaluate:74 - Evaluation failed with error: Server error '500 Internal Server Error' for url 'https://demo.uptrain.ai/api/open/evaluate_no_auth'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/500
2024-03-18 13:18:51.414 | INFO     | uptrain.framework.evalllm:evaluate:113 - Retrying evaluation request
2024-03-18 13:18:51.414 | INFO     | uptrain.framework.evalllm:evaluate:100 - Sending evaluation request for rows 0 to <50 to the Uptrain
2024-03-18 13:18:51.737 | ERROR    | uptrain.framework.remote:raise_or_return:32 - {"detail":"Error running the evaluation: 500: Error running the evaluation: concurrent.futures.process._RemoteTraceback: \n\"\"\"\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.10/concurrent/futures/process.py\", line 246, in _process_worker\n    r = call_item.fn(*call_item.args, **call_item.kwargs)\n  File \"/app/uptrain_server/worker.py\", line 426, in _run_log_and_eval\n    res = func(settings, dataset)\n  File \"/app/uptrain_server/evals/chatgpt.py\", line 1575, in score_response_conciseness\n    argument = res.response.choices[0].message.content\nAttributeError: 'NoneType' object has no attribute 'choices'\n\"\"\"\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/app/uptrain_server/worker.py\", line 960, in run_eval_v2\n    result = future.result()\n  File \"/usr/local/lib/python3.10/concurrent/futures/_base.py\", line 451, in result\n    return self.__get_result()\n  File \"/usr/local/lib/python3.10/concurrent/futures/_base.py\", line 403, in __get_result\n    raise self._exception\nAttributeError: 'NoneType' object has no attribute 'choices'\n"}
2024-03-18 13:18:51.738 | ERROR    | uptrain.framework.remote:evaluate:74 - Evaluation failed with error: Server error '500 Internal Server Error' for url 'https://demo.uptrain.ai/api/open/evaluate_no_auth'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/500
2024-03-18 13:18:51.739 | INFO     | uptrain.framework.evalllm:evaluate:113 - Retrying evaluation request
2024-03-18 13:18:51.739 | ERROR    | uptrain.framework.evalllm:evaluate:115 - Evaluation failed with error: Server error '500 Internal Server Error' for url 'https://demo.uptrain.ai/api/open/evaluate_no_auth'

Is this something you can look into as well? We can extend the scope of this PR to include them, for what it's worth.

On a related note, can/does UpTrain version its API endpoints? This will help prevent such retroactive failures in the future.

@github-actions github-actions bot added the type:documentation Improvements or additions to documentation label Mar 19, 2024
@Dominastorm
Copy link
Author

The issue was that Secret.from_token("Aaa") was sending "Aaa" as the OpenAI API key. Here's a screenshot from our logs:

image

This was causing the evaluations to fail. Changing it to Secret.from_env_var("OPENAI_API_KEY") solves this issue.

We do not currently version our endpoints, but that's a good suggestion and we will look into it!

@shadeMe
Copy link
Contributor

shadeMe commented Mar 19, 2024

The issue was that Secret.from_token("Aaa") was sending "Aaa" as the OpenAI API key. Here's a screenshot from our logs:

image

This was causing the evaluations to fail. Changing it to Secret.from_env_var("OPENAI_API_KEY") solves this issue.

I was very surprised that changing those lines fixed the issue because none of those tests were actually invoking your client to begin with (also, in some cases, a mock client/backend is being used). Digging in a bit more, I discovered that the EvalLLM class creates a Settings instance which ends up modifying the execution's environment variable in its ctor.

This is very problematic - Is there a good reason why this is being done? The code shouldn't be modifying the user/execution environment unless it has an extremely good reason to do so, and definitely not silently. It's very insidious behaviour as it can break other code that's executing side-by-side in a hard-to-debug manner. Speculating here, but maybe this code was primarily written to be executed in your sever/deployment environment where making such modifications are handled gracefully? Further more, all of this happens even before your client is invoked, which also violates the principle of least surprise.

In any event, this appears to be the underlying issue here w.r.t the above failures. However, I'm still stumped as to why only some of the metrics fail during the integration test even if the API token is junk. Could you shed some light on this?

@Dominastorm
Copy link
Author

Speculating here, but maybe this code was primarily written to be executed in your sever/deployment environment where making such modifications are handled gracefully?

You are correct. This was being done as it is handled gracefully on our servers. The removal of the assignments from the Settings class is in our pipeline. However, since it is a breaking change, it will require extensive testing from our end to make sure that it does not cause any of our current code to fail. 



However, I'm still stumped as to why only some of the metrics fail during the integration test even if the API token is junk. Could you shed some light on this?

This happened because we didn’t have a check for valid OpenAI keys. We have added it in our latest release. The four evaluations that failed gave an Authentication Error. The rest, however, returned dictionaries with None values. This was an oversight from our end, which we have now resolved. With the current code, all of them should throw an error when an invalid key is passed. 

This also brings up a shortcoming of the tests that are being performed. They are not checking whether the result contains valid data and instead only checking if the data type is correct. So, that is another change that needs to be made.

As for upgrading to the latest version of UpTrain. I will make those changes as well. Upgrading to v0.6.9 should primarily change two things:


  1. All evaluations will be directly sent to OpenAI instead of going through our servers
  2. The critique language operator now returns only one score instead of four. 


I will make the above changes and support the latest version of UpTrain. We will take up resolving the environment variables at a later stage once we ensure that it is stable on our end.

@lbux
Copy link
Contributor

lbux commented Apr 10, 2024

Hey @Dominastorm. I'm wondering if there is any update on this?

@Dominastorm
Copy link
Author

Dominastorm commented Apr 12, 2024

Hey @shadeMe, you can merge it now. It will work for uptrain>=0.6.13

@shadeMe
Copy link
Contributor

shadeMe commented May 6, 2024

Closing as the ownership of this integration has been transferred over to UpTrain.

@shadeMe shadeMe closed this May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integration:uptrain type:documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants