Skip to content

[FT] Remove thinking from all evals #869

@lewtun

Description

@lewtun

Issue encountered

Reasoning models like Qwen3 emit a long chain of thought that is enclosed by <think> and </think> tags before providing the final answer as follows:

<|im start|>user
{query} /think<|im end|>
<|im start|>assistant
<think>
{thinking content}
</think>
{response}<|im end|>

The content of the chain of thought often contains preliminary solutions, code snippets etc that should not be graded as they can give false positives/negatives.

We already remove the chain of thought for IFEval, but this logic should be applied consistently to all evals. In particular, if a completion fails to produce a closing </think> tag, it should be graded as incomplete (i.e. empty string).

If we implement this feature, we should run some careful comparisons before/after. The Qwen3 models are good candidates as they allow both modes of generation: https://arxiv.org/abs/2505.09388

Solution/Feature

I think we could expose a --remove-thinking arg in the CLI (default to True), along with an optional --think-tags (default to [<think>,</think>]) that allows users to target models like Magistral which have their own think tags.

Possible alternatives

The alternative is to store a hard coded list of think tags and remove the thought blocks automatically instead of exposing them in the CLI.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions