Open LLM Benchmark · v1.0 · Updated Jun 2026

BRIEF Index

Business Real-world Instruction Evaluation Framework

Real workplace writing is never one instruction. It is many at once: include this, never mention that, stay under the word count, hit the tone, hold the structure. BRIEF Index measures how well 25 frontier and open models satisfy all of them simultaneously.

View the leaderboard →Read the methodology
The metric

Weighted Constraint Satisfaction Score

WCSS = Σ(score × weight) / Σ(weight)

Every constraint scores binary 0 or 1, but not every failure costs the same. Producing prohibited content in an enterprise context is the expensive mistake, so exclusion constraints are weighted twice.

Inclusion ×1Exclusion ×2Format ×1Structure ×1Tone ×1
25
Models evaluated
6
Job-function archetypes
30
Task types
150
Test cases

Leaderboard

Overall WCSS · all archetypes

25 models, 150 tasks. Click a column to sort; filter by release type. Lower EVR is better.

#
1
Gemma 4 31B ITOpen
Google
92.78
7.28
2
Qwen 3.7 MaxOpen
Alibaba
91.71
3.97
3
GPT-5.4 miniProprietary
OpenAI
91.37
6.62
4
Claude Haiku 4.5Proprietary
Anthropic
90.40
7.95
5
GLM 5.1Open
Zhipu
90.24
5.96
6
GPT-5.4Proprietary
OpenAI
89.64
3.97
7
Claude Opus 4.8Proprietary
Anthropic
89.58
3.31
8
GPT-5.5 ProProprietary
OpenAI
89.40
3.31
9
Kimi K2.7 CodeOpen
Moonshot
88.36
4.64
10
Llama 3.3 70BOpen
Meta
87.38
7.95
11
DeepSeek V4 ProOpen
DeepSeek
86.55
6.62
12
GPT-5.4 nanoProprietary
OpenAI
85.79
5.30
WCSS: weighted constraint satisfaction (higher better)EVR: exclusion violation rate (lower better)
Why BRIEF Index

Most instruction-following benchmarks reward a model for doing the one thing asked. In real enterprise work, the costly failures are different in kind.

Leaking a confidential figure, naming a competitor, making an unapproved commitment: a response can be fluent, helpful, and still unacceptable. BRIEF Index makes that distinction measurable by weighting exclusion failures twice as heavily as everything else.

The result reorders the field. The current leader is a 31B open model, and several compact systems outrank far larger frontier releases, because at the top, the difference is no longer fluency. It is discipline under competing constraints.

Key takeaways
92.8Gemma 4 31B IT leads the index, a compact open model ahead of every frontier release.
×2Exclusion failures are weighted double. EVR ranges from 3.3% (Claude Opus 4.8, GPT-5.5 Pro) to 44% (Qwen 3.5 397B).
57%Format adherence is the field-wide weak point, far below inclusion, structure and tone.

Performance by job function

WCSS % · darker is stronger · click a column to sort

Six archetypes, five task types each. Executive communications is the hardest column across the board: its tight word counts and exclusion rules punish models that ramble.

Model
Gemma 4 31B IT
92.8
94.0
96.7
90.7
89.9
96.7
88.8
Qwen 3.7 Max
91.7
88.7
94.9
86.7
90.7
96.0
93.3
GPT-5.4 mini
91.4
90.9
92.0
90.0
93.3
91.7
90.3
Claude Haiku 4.5
90.4
92.7
88.7
88.0
89.9
92.7
90.5
GLM 5.1
90.2
90.0
93.9
84.0
90.0
92.0
91.6
GPT-5.4
89.6
92.7
91.9
82.0
90.7
91.7
88.9
Claude Opus 4.8
89.6
91.3
91.5
88.0
89.3
90.5
86.8
GPT-5.5 Pro
89.4
90.7
93.6
84.0
88.0
90.4
89.7
Kimi K2.7 Code
88.4
92.0
87.3
84.0
89.3
90.4
87.1
Llama 3.3 70B
87.4
88.7
90.0
81.3
87.3
90.0
86.9
DeepSeek V4 Pro
86.5
90.8
84.4
81.3
85.3
88.9
88.5
GPT-5.4 nano
85.8
84.8
89.6
80.0
89.3
83.3
87.7
GPT-5.5
85.6
86.9
94.4
75.3
90.0
84.3
82.9
Claude Sonnet 4.5
85.3
89.4
85.7
79.3
88.7
85.6
83.3
Claude Opus 4.7
84.2
85.4
83.3
81.3
82.5
86.8
86.0
GLM 5.2
84.0
84.9
95.9
60.7
86.7
91.3
84.4
Llama 3 8B Instruct Lite
82.8
77.4
79.7
83.3
81.3
87.6
87.6
Claude Sonnet 4.6
82.0
81.6
84.9
77.9
80.5
80.9
86.3
Qwen 3 235B Instruct
81.7
79.3
78.8
78.7
85.3
84.9
83.3
Qwen 2.5 7B Turbo
81.7
82.9
83.5
74.7
80.7
87.7
80.7
MiniMax M2.7
80.0
82.8
83.6
72.0
82.5
77.6
81.5
MiniMax M3
79.4
80.8
80.0
72.0
81.9
77.9
83.7
Kimi K2.6
73.2
82.7
71.6
62.7
77.3
79.7
65.2
Nemotron 3 Ultra 550B
67.7
67.3
51.5
40.7
75.3
85.5
86.1
Qwen 3.5 397B
52.7
52.2
62.0
42.0
47.1
62.8
50.0
Methodology

Scoring real work, not single instructions

The WCSS formula

Each task carries 4–6 constraints. Every constraint is graded a binary 0 or 1, either by a deterministic Python validator (word counts, required sections, forbidden placeholders) or by an LLM judge pinned to Claude Sonnet 4.5 in strict mode. Scores combine by weight:

// weighted satisfaction
WCSS = Σ(score × weight) / Σ(weight)

// highest-stakes failure mode
EVR  = violated exclusions / total × 100

We report EVR alongside WCSS to isolate the failures that matter most in enterprise contexts: producing content that was explicitly prohibited.

Constraint typeWeight
Inclusion1.0
Exclusion2.0
Format1.0
Structure1.0
Tone1.0

Where models break

Averaged across all 25 models, inclusion and structure are nearly solved. Format is the universal weak spot: word limits, character caps and required layouts are missed far more often than any rule about what to say.

Inclusion94.4%
Structure94.0%
Exclusion89.3%
Tone87.6%
Format56.7%

Satisfaction rate per constraint type, mean of all evaluated models.

The benchmark

6 job-function archetypes × 5 task types × 5 test cases = 150 tasks. Each carries at least one exclusion constraint and one deterministic format check.

product_manager
Product Manager
Feature prioritization memo · PRD problem statement · stakeholder status email · roadmap change · sprint review
project_coordinator
Project Coordinator
Weekly status report · risk log entry · meeting minutes · change request summary · post-mortem entry
executive_comms
Executive Comms
Board update one-pager · executive briefing · all-hands talking points · decision memo · crisis comms draft
hr_business_partner
HR Business Partner
Job description · performance review · policy update · PIP communication · onboarding note
operations_analyst
Operations Analyst
Standard operating procedure · process gap analysis · vendor comparison · business requirements · incident post-mortem
marketing_comms
Marketing Comms
Campaign brief · press release · multi-platform social posts · product launch email · media Q&A prep