[FT] Remove thinking from all evals

## Issue encountered
Reasoning models like Qwen3 emit a long chain of thought that is enclosed by `<think>` and `</think>` tags before providing the final answer as follows:

```
<|im start|>user
{query} /think<|im end|>
<|im start|>assistant
<think>
{thinking content}
</think>
{response}<|im end|>
```

The content of the chain of thought often contains preliminary solutions, code snippets etc that should not be graded as they can give false positives/negatives.

We already remove the chain of thought for IFEval, but this logic should be applied consistently to all evals. In particular, if a completion fails to produce a closing `</think>` tag, it should be graded as incomplete (i.e. empty string).

If we implement this feature, we should run some careful comparisons before/after. The Qwen3 models are good candidates as they allow both modes of generation: https://arxiv.org/abs/2505.09388

## Solution/Feature

I think we could expose a `--remove-thinking` arg in the CLI (default to `True`), along with an optional `--think-tags` (default to `[<think>,</think>]`) that allows users to target models like Magistral which have their own think tags. 

## Possible alternatives

The alternative is to store a hard coded list of think tags and remove the thought blocks automatically instead of exposing them in the CLI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FT] Remove thinking from all evals #869

Issue encountered

Solution/Feature

Possible alternatives

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FT] Remove thinking from all evals #869

Description

Issue encountered

Solution/Feature

Possible alternatives

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions