Releases · confident-ai/deepeval

02 Dec 01:33

a8dd3a1

Version v2.0 Latest

Latest

Here are the new features we're bringing to you in the latest release:
⚙️ Automated LLM red teaming, aka. vulnerability and security safety scanning. You can now scan for over 40+ vulnerabilities using 10+ SOTA attack enhancement techniques in <10 lines of python code.
🪄 Synthetic dataset generation with a highly customizable synthetic data generation pipeline to cover literally any use case.
🖼️ Multi-modal LLM evaluation - perfect for an image editing or text-image use cases.
💬 Conversational evaluation - perfect for evaluating LLM chatbots.
💥 More LLM system metrics: Prompt Alignment (to determine whether your LLM is able to follow instructions specified in your prompt template), Tool Correctness (for agents), and Json Correctness (to validate if LLM outputs conform to your desired schema)

Assets 2

31 Oct 23:01

penguine-ip

v1.4.7

19cdaa1

Red teaming, safety testing, and improved synthesizer, conversational metrics, multi-modal metrics

In DeepEval 1.4.7, we're releasing:

LLM red teaming. Safety test your LLM application for 40+ vulnerabilities with 10+ attack enhancements, docs here: https://docs.confident-ai.com/docs/red-teaming-introduction
Improved synthetic data synthesizer, much more functionality and customizbility: https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data
Conversational metrics: Dedicated metrics to evaluate LLM turns
Multi-modal metrics: Image editing and text to image evaluation

Assets 2

30 Jul 17:27

penguine-ip

v0.21.74

eb343ac

Agentic Evaluation Metric, Custom Evaluation LLMs, and Async for Synthetic Data Generation

In DeepEval v0.21.74, we have:

Agnetic evaluation metric to evaluate tool calling correctness for LLM agents: https://docs.confident-ai.com/docs/metrics-tool-correctness
Pydantic Schemas to enforce JSON outputs for custom, smaller LLMs: https://docs.confident-ai.com/docs/guides-using-custom-llms
Asynchronous support for synthetic data generation: https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data
Tracing integration for LLamaIndex and LangChain: https://docs.confident-ai.com/docs/confident-ai-tracing

Assets 2

25 Jun 12:14

penguine-ip

v0.21.62

3f0f945

Verbosity in Metrics, Hyperparameter Logging, Improved Synthetic Data Generation, Better Async Support

In DeepEval v0.21.62, we:

added an option to print out intermediate steps during metric execution, which can be configured via the verbose_mode parameter: https://docs.confident-ai.com/docs/metrics-answer-relevancy#example
hyperparameters can be logged to Confident AI via the evaluate() function: https://docs.confident-ai.com/docs/getting-started#optimizing-hyperparameters
Synthetic data generation now gives more realistic results and is more customizable: https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data

Assets 2

31 Mar 18:30

penguine-ip

v0.21.15

ebdcc04

Synthetic Data, Caching, Benchmarks, and GEval improvement

For deepeval's latest release v0.21.15, we release:

Synthetic Data generation. Generate synthetic data from documents easily: https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data
caching. If you're running 10k test cases and it fails at the 9999th test case, you no longer have to rerun the first 9999 test case as you can just read from cache using the -c flag: https://docs.confident-ai.com/docs/evaluation-introduction#cache
repeats. If you want to repeat each test case for statistical significant, use the -r flag: https://docs.confident-ai.com/docs/evaluation-introduction#repeats
LLM Benchmarks. Supporting popular benchmarks such as MMLU, HellaSwag, and BIG-BH so anyone can evaluate ANY model on research backed benchmarks in a few lines of code.
G-Eval improvements. The G-Eval metric now supports using logprobs of tokens to find the weighted summed score.

Assets 2

09 Mar 17:27

penguine-ip

v0.20.85

bb40704

Async Support for Prod

In deepeval v0.20.85:

asynchronous support throughout deepeval, and no longer using threads. Users can also call individual metrics asynchronously: https://docs.confident-ai.com/docs/metrics-introduction#measuring-metrics-in-async
improved the way in which you create a custom LLM for evaluation. You'll now have to implement an asynchronous generate() method to use deepeval's async features: https://docs.confident-ai.com/docs/metrics-introduction#using-a-custom-llm
strict mode for all metrics!
improve evaluate() function for more customizability: https://docs.confident-ai.com/docs/evaluation-introduction#evaluating-without-pytest

Assets 2

04 Mar 18:04

penguine-ip

v0.20.80

4757393

Conversational Metrics and Synthetic Data Generation

In DeepEval's latest release, there is now:

conversational metrics: https://docs.confident-ai.com/docs/metrics-knowledge-retention. This metric evaluates whether your LLM is able to retain factual information presented to it throughout a conversation
synthetic data generation. Generate evaluation datasets from scratch: https://docs.confident-ai.com/docs/evaluation-datasets#generate-an-evaluation-dataset

Assets 2

25 Feb 11:18

penguine-ip

v0.20.73

564b108

Production Stability

For the newest release, deepeval now is now stable for production use:

reduced package size
separated functionality of pytest vs deepeval test run command
included coverage score for summarization
fix contextual precision node error
released docs for better transparency into metrics calculation
allows users to configure RAGAS metrics for custom embedding models: https://docs.confident-ai.com/docs/metrics-ragas#example
fixed bugs with checking for package updates

Assets 2

14 Feb 06:05

penguine-ip

v0.20.68

2a6da83

Hugging Face and LlamaIndex integration

For the latest release, DeepEval:

Supports Hugging Face users by providing real-time evaluations during fine-tuning: https://docs.confident-ai.com/docs/integrations-huggingface
Supports LlamaIndex users by allowing unit testing of LlamaIndex apps in CI/CD, and offer metrics in LlamaIndex's evaluators: https://docs.confident-ai.com/docs/integrations-llamaindex
Improvements to accuracy and reliability in Faithfulness and Answer Relevancy
Summarization Metric now offers explanation
You can now use ANY LLM for evaluation: https://docs.confident-ai.com/docs/metrics-introduction#using-a-custom-llm

Assets 2

16 Jan 11:22

penguine-ip

v0.20.57

be8e95c

LLM-Evals now support all LangChain chatmodels

LLM-Evals (LLM evaluated metrics) now support all of langchain's chat models.
LLMTestCase now has execution_time and cost, useful for those looking to evaluate on these parameters
minimum_score is now threshold instead, meaning you can now create custom metrics that either have a "minimum" or "maximum" threshold
LLMEvalMetric is now GEval
Llamaindex Tracing integration: (https://docs.llamaindex.ai/en/stable/module_guides/observability/observability.html#deepeval)

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: confident-ai/deepeval

Version v2.0

Red teaming, safety testing, and improved synthesizer, conversational metrics, multi-modal metrics

Agentic Evaluation Metric, Custom Evaluation LLMs, and Async for Synthetic Data Generation

Verbosity in Metrics, Hyperparameter Logging, Improved Synthetic Data Generation, Better Async Support

Synthetic Data, Caching, Benchmarks, and GEval improvement

Async Support for Prod

Conversational Metrics and Synthetic Data Generation

Production Stability

Hugging Face and LlamaIndex integration

LLM-Evals now support all LangChain chatmodels