Skip to content

πŸŽ‰ New Arena GEval Metric, for Pairwise Comparisons

Latest
Compare
Choose a tag to compare
@penguine-ip penguine-ip released this 25 Jun 18:49
· 72 commits to main since this release

Metric that is alike LLM Arena is Here

In DeepEval's latest release, we are introducing ArenaGEval, the first ever metric to compare test cases to choose the best performing one based on your custom criteria.

It looks something like this:

from deepeval import evaluate
from deepeval.test_case import ArenaTestCase, LLMTestCaseParams
from deepeval.metrics import ArenaGEval

a_test_case = ArenaTestCase(
    contestants={
        "GPT-4": LLMTestCase(
            input="What is the capital of France?",
            actual_output="Paris",
        ),
        "Claude-4": LLMTestCase(
            input="What is the capital of France?",
            actual_output="Paris is the capital of France.",
        ),
    },
)
arena_geval = ArenaGEval(
    name="Friendly",
    criteria="Choose the winter of the more friendly contestant based on the input and actual output",
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
    ],
)


metric.measure(a_test_case)
print(metric.winner, metric.reason)

Docs here: https://deepeval.com/docs/metrics-arena-g-eval