Metric that is alike LLM Arena is Here

In DeepEval's latest release, we are introducing ArenaGEval, the first ever metric to compare test cases to choose the best performing one based on your custom criteria.

It looks something like this:

from deepeval import evaluate
from deepeval.test_case import ArenaTestCase, LLMTestCaseParams
from deepeval.metrics import ArenaGEval

a_test_case = ArenaTestCase(
    contestants={
        "GPT-4": LLMTestCase(
            input="What is the capital of France?",
            actual_output="Paris",
        ),
        "Claude-4": LLMTestCase(
            input="What is the capital of France?",
            actual_output="Paris is the capital of France.",
        ),
    },
)
arena_geval = ArenaGEval(
    name="Friendly",
    criteria="Choose the winter of the more friendly contestant based on the input and actual output",
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
    ],
)


metric.measure(a_test_case)
print(metric.winner, metric.reason)

Docs here: https://deepeval.com/docs/metrics-arena-g-eval

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🎉 New Arena GEval Metric, for Pairwise Comparisons

Metric that is alike LLM Arena is Here

Uh oh!