-
Notifications
You must be signed in to change notification settings - Fork 319
Description
Issue encountered
Reasoning models like Qwen3 emit a long chain of thought that is enclosed by <think>
and </think>
tags before providing the final answer as follows:
<|im start|>user
{query} /think<|im end|>
<|im start|>assistant
<think>
{thinking content}
</think>
{response}<|im end|>
The content of the chain of thought often contains preliminary solutions, code snippets etc that should not be graded as they can give false positives/negatives.
We already remove the chain of thought for IFEval, but this logic should be applied consistently to all evals. In particular, if a completion fails to produce a closing </think>
tag, it should be graded as incomplete (i.e. empty string).
If we implement this feature, we should run some careful comparisons before/after. The Qwen3 models are good candidates as they allow both modes of generation: https://arxiv.org/abs/2505.09388
Solution/Feature
I think we could expose a --remove-thinking
arg in the CLI (default to True
), along with an optional --think-tags
(default to [<think>,</think>]
) that allows users to target models like Magistral which have their own think tags.
Possible alternatives
The alternative is to store a hard coded list of think tags and remove the thought blocks automatically instead of exposing them in the CLI.