Skip to content

Releases: confident-ai/deepeval

🎉 New Arena GEval Metric, for Pairwise Comparisons

25 Jun 18:49
Compare
Choose a tag to compare

Metric that is alike LLM Arena is Here

In DeepEval's latest release, we are introducing ArenaGEval, the first ever metric to compare test cases to choose the best performing one based on your custom criteria.

It looks something like this:

from deepeval import evaluate
from deepeval.test_case import ArenaTestCase, LLMTestCaseParams
from deepeval.metrics import ArenaGEval

a_test_case = ArenaTestCase(
    contestants={
        "GPT-4": LLMTestCase(
            input="What is the capital of France?",
            actual_output="Paris",
        ),
        "Claude-4": LLMTestCase(
            input="What is the capital of France?",
            actual_output="Paris is the capital of France.",
        ),
    },
)
arena_geval = ArenaGEval(
    name="Friendly",
    criteria="Choose the winter of the more friendly contestant based on the input and actual output",
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
    ],
)


metric.measure(a_test_case)
print(metric.winner, metric.reason)

Docs here: https://deepeval.com/docs/metrics-arena-g-eval

🎉 New Multimodal Metrics, with Platform Support

19 Jun 07:46
Compare
Choose a tag to compare

In DeepEval's latest release, we are introducing multimodal G-Eval, plus 7+ multimodal metrics!

Previously we had great support for single-turn, text evaluation in the form of LLMTestCases, but now we're adding MLLMTestCase, which accepts images:

from deepeval.metrics import MultimodalGEval
from deepeval.test_case import MLLMTestCaseParams, MLLMTestCase, MLLMImage
from deepeval import evaluate

m_test_case = MLLMTestCase(
    input=["Show me how to fold an airplane"],
    actual_output=[
        "1. Take the sheet of paper and fold it lengthwise",
        MLLMImage(url="./paper_plane_1", local=True),
        "2. Unfold the paper. Fold the top left and right corners towards the center.",
        MLLMImage(url="./paper_plane_2", local=True)
    ]
)
text_image_coherence = MultimodalGEval(
    name="Text-Image Coherence",
    criteria="Determine whether the images and text is coherence in the actual output.",
    evaluation_params=[MLLMTestCaseParams.ACTUAL_OUTPUT],
)

evaluate(test_cases=[m_test_case], metrics=[text_image_coherence])

Docs here: https://deepeval.com/docs/multimodal-metrics-g-eval

PS. This also includes platform support

Screenshot 2025-06-19 at 3 46 12 PM

🎉 New Conversational Evaluation, LiteLLM Integration

10 Jun 09:16
Compare
Choose a tag to compare

In DeepEval's latest release, we are introducing a slight change in how a conversation is evaluated.

Previously we assumed a conversation as as a list of LLMTestCases, which might necessarily be the case. Now a conversational test case is made up of a list of Turns instead, which follows OpenAI's standard messages format:

from deepeval.test_case import Turn

turns = [Turn(role="user", content="...")]

Docs here: https://deepeval.com/docs/evaluation-test-cases#conversational-test-case

New Loading Bars, And Cloud Storage

07 Jun 11:09
Compare
Choose a tag to compare

Added new loading bars for component-level evals, and deepeval view to see results on Confident AI.

LLM Evals - v3.0

27 May 17:59
Compare
Choose a tag to compare

🚀 DeepEval v3.0 — Evaluate Any LLM Workflow, Anywhere

We’re excited to introduce DeepEval v3.0, a major milestone that transforms how you evaluate LLM applications — from complex multi-step agents to simple prompt chains. This release brings component-level granularity, production-ready observability, and simulation tools to empower devs building modern AI systems.


🔍 Component-Level Evaluation for Agentic Workflows

You can now apply DeepEval metrics to any step of your LLM workflow — tools, memories, retrievers, generators — and monitor them in both development and production.

  • Evaluate individual function calls, not just final outputs
  • Works with any framework or custom agent logic
  • Real-time evaluation in production using observe()
  • Track sub-component performance over time

📘 Learn more →


🧠 Conversation Simulation

Automatically simulate realistic multi-turn conversations to test your chatbots and agents.

  • Define model goals and user behavior
  • Generate labeled conversations at scale
  • Use DeepEval metrics to assess response quality
  • Customize turn count, persona types, and more

📘 Try the simulator →


🧬 Generate Goldens from Goldens

Bootstrapping eval datasets just got easier. Now you can exponentially expand your test cases using LLM-generated variants of existing goldens.

  • Transform goldens into many meaningful test cases
  • Preserve structure while diversifying content
  • Control tone, complexity, length, and more

📘 Read the guide →


🔒 Red Teaming Moved to DeepTeam

All red teaming functionality now lives in its own focused project: DeepTeam. DeepTeam is built for LLM security — adversarial testing, attack generation, and vulnerability discovery.


🛠️ Install or Upgrade

pip install deepeval --upgrade

🧠 Why v3.0 Matters

DeepEval v3.0 is more than an evaluation framework — it's a foundation for LLM observability. Whether you're debugging agents, simulating conversations, or continuously monitoring production performance, DeepEval now meets you wherever your LLM logic runs.

Ready to explore?
📚 Full docs at deepeval.com →

G-Eval Rubric

15 May 05:13
Compare
Choose a tag to compare

Cleanup Tracing, Component Evals, Etc.

06 May 11:58
Compare
Choose a tag to compare

In this release we've cleaned up some dependencies to separate out dev packages, as well as more tracing verbose logs for debugging.

v3.0 Pre-Release

28 Apr 07:48
Compare
Choose a tag to compare

🚨 Breaking Changes

⚠️ This release introduces breaking changes in preparation for DeepEval v3.0.
Please review carefully and adjust your code as needed.

The evaluate() function now has "configs"

  • Previously the evaluate() function had 13+ arguments to control display, async behaviors, caching, etc. and it was growing out of control. We've now abstracted it into "configs" instead:
from deepeval.evaluate.configs import AsyncConfig
from deepeval import evaluate

evaluate(..., async_config=AsyncConfig(max_concurrent=20))

Full docs here: https://www.deepeval.com/docs/evaluation-running-llm-evals#configs-for-evaluate

Red Teaming Officially Migrated to DeepTeam

This shouldn't be a surprised but, DeepTeam now takes care of everything red teaming related, for the foreseeable future. Docs here: https://trydeepteam.com

🥳 New Feature

Dynamic Evaluations for Nested Components

Nested components are a mess to evaluate. In this version in preparation for v3.0, we introduced dynamic evals, where you can apply a different set of metrics for different components in your LLM application:

from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.tracing import observe, update_current_span_test_case

@observe(metrics=[AnswerRelevancyMetric()])
def complete(query: str):
  response = openai.ChatCompletion.create(model="gpt-4o", messages=[{"role": "user", "content": query}]).choices[0].message["content"]

  update_current_span_test_case(
    test_case=LLMTestCase(input=query, output=response)
  )
  return response

Full docs here: https://www.deepeval.com/docs/evaluation-running-llm-evals#setup-tracing-highly-recommended

Dependency Cleaning

23 Apr 03:14
Compare
Choose a tag to compare

Cleaned up dependencies for upcoming 3.0 release:

  • Removed the automatic updates, it is now opt-in: https://www.deepeval.com/docs/miscellaneous

  • Removed instructor, double checked and it wasn't used anywhere

  • Removed LlamaIndex and moved it to optional, only needed for one module

Conversation Simulator

07 Apr 04:27
Compare
Choose a tag to compare

The latest conversation simulator simulates fake user interactions to generate conversations on your behalf. These conversations can be used for evaluation right afterwards, and is similar to the goldens synthesizer. Docs here: https://docs.confident-ai.com/docs/evaluation-conversation-simulator