Metric that is alike LLM Arena is Here
In DeepEval's latest release, we are introducing ArenaGEval
, the first ever metric to compare test cases to choose the best performing one based on your custom criteria.
It looks something like this:
from deepeval import evaluate
from deepeval.test_case import ArenaTestCase, LLMTestCaseParams
from deepeval.metrics import ArenaGEval
a_test_case = ArenaTestCase(
contestants={
"GPT-4": LLMTestCase(
input="What is the capital of France?",
actual_output="Paris",
),
"Claude-4": LLMTestCase(
input="What is the capital of France?",
actual_output="Paris is the capital of France.",
),
},
)
arena_geval = ArenaGEval(
name="Friendly",
criteria="Choose the winter of the more friendly contestant based on the input and actual output",
evaluation_params=[
LLMTestCaseParams.INPUT,
LLMTestCaseParams.ACTUAL_OUTPUT,
],
)
metric.measure(a_test_case)
print(metric.winner, metric.reason)