Open LLM Benchmark · v1.0 · Updated Jun 2026

BRIEF Index

Business Real-world Instruction Evaluation Framework

Real workplace writing is never one instruction. It is many at once: include this, never mention that, stay under the word count, hit the tone, hold the structure. BRIEF Index measures how well 25 frontier and open models satisfy all of them simultaneously.

View the leaderboard →Read the methodology

The metric

Weighted Constraint Satisfaction Score

WCSS = Σ(score × weight) / Σ(weight)

Every constraint scores binary 0 or 1, but not every failure costs the same. Producing prohibited content in an enterprise context is the expensive mistake, so exclusion constraints are weighted twice.

Inclusion ×1Exclusion ×2Format ×1Structure ×1Tone ×1

Models evaluated

Job-function archetypes

Task types

150

Test cases

Leaderboard

Overall WCSS · all archetypes

25 models, 150 tasks. Click a column to sort; filter by release type. Lower EVR is better.

#

1

Gemma 4 31B ITOpen

Google

92.78

7.28

2

Qwen 3.7 MaxOpen

Alibaba

91.71

3.97

3

GPT-5.4 miniProprietary

OpenAI

91.37

6.62

4

Claude Haiku 4.5Proprietary

Anthropic

90.40

7.95

5

GLM 5.1Open

Zhipu

90.24

5.96

6

GPT-5.4Proprietary

OpenAI

89.64

3.97

7

Claude Opus 4.8Proprietary

Anthropic

89.58

3.31

8

GPT-5.5 ProProprietary

OpenAI

89.40

3.31

9

Kimi K2.7 CodeOpen

Moonshot

88.36

4.64

10

Llama 3.3 70BOpen

Meta

87.38

7.95

11

DeepSeek V4 ProOpen

DeepSeek

86.55

6.62

12

GPT-5.4 nanoProprietary

OpenAI

85.79

5.30

WCSS: weighted constraint satisfaction (higher better)EVR: exclusion violation rate (lower better)

Why BRIEF Index

Most instruction-following benchmarks reward a model for doing the one thing asked. In real enterprise work, the costly failures are different in kind.

Leaking a confidential figure, naming a competitor, making an unapproved commitment: a response can be fluent, helpful, and still unacceptable. BRIEF Index makes that distinction measurable by weighting exclusion failures twice as heavily as everything else.

The result reorders the field. The current leader is a 31B open model, and several compact systems outrank far larger frontier releases, because at the top, the difference is no longer fluency. It is discipline under competing constraints.

Key takeaways

92.8Gemma 4 31B IT leads the index, a compact open model ahead of every frontier release.

×2Exclusion failures are weighted double. EVR ranges from 3.3% (Claude Opus 4.8, GPT-5.5 Pro) to 44% (Qwen 3.5 397B).

57%Format adherence is the field-wide weak point, far below inclusion, structure and tone.

Performance by job function

WCSS % · darker is stronger · click a column to sort

Six archetypes, five task types each. Executive communications is the hardest column across the board: its tight word counts and exclusion rules punish models that ramble.

Model

Gemma 4 31B IT

92.8

94.0

96.7

90.7

89.9

96.7

88.8

Qwen 3.7 Max

91.7

88.7

94.9

86.7

90.7

96.0

93.3

GPT-5.4 mini

91.4

90.9

92.0

90.0

93.3

91.7

90.3

Claude Haiku 4.5

90.4

92.7

88.7

88.0

89.9

92.7

90.5

GLM 5.1

90.2

90.0

93.9

84.0

90.0

92.0

91.6

GPT-5.4

89.6

92.7

91.9

82.0

90.7

91.7

88.9

Claude Opus 4.8

89.6

91.3

91.5

88.0

89.3

90.5

86.8

GPT-5.5 Pro

89.4

90.7

93.6

84.0

88.0

90.4

89.7

Kimi K2.7 Code

88.4

92.0

87.3

84.0

89.3

90.4

87.1

Llama 3.3 70B

87.4

88.7

90.0

81.3

87.3

90.0

86.9

DeepSeek V4 Pro

86.5

90.8

84.4

81.3

85.3

88.9

88.5

GPT-5.4 nano

85.8

84.8

89.6

80.0

89.3

83.3

87.7

GPT-5.5

85.6

86.9

94.4

75.3

90.0

84.3

82.9

Claude Sonnet 4.5

85.3

89.4

85.7

79.3

88.7

85.6

83.3

Claude Opus 4.7

84.2

85.4

83.3

81.3

82.5

86.8

86.0

GLM 5.2

84.0

84.9

95.9

60.7

86.7

91.3

84.4

Llama 3 8B Instruct Lite

82.8

77.4

79.7

83.3

81.3

87.6

87.6

Claude Sonnet 4.6

82.0

81.6

84.9

77.9

80.5

80.9

86.3

Qwen 3 235B Instruct

81.7

79.3

78.8

78.7

85.3

84.9

83.3

Qwen 2.5 7B Turbo

81.7

82.9

83.5

74.7

80.7

87.7

80.7

MiniMax M2.7

80.0

82.8

83.6

72.0

82.5

77.6

81.5

MiniMax M3

79.4

80.8

80.0

72.0

81.9

77.9

83.7

Kimi K2.6

73.2

82.7

71.6

62.7

77.3

79.7

65.2

Nemotron 3 Ultra 550B

67.7

67.3

51.5

40.7

75.3

85.5

86.1

Qwen 3.5 397B

52.7

52.2

62.0

42.0

47.1

62.8

50.0

Methodology

Scoring real work, not single instructions

The WCSS formula

Each task carries 4–6 constraints. Every constraint is graded a binary 0 or 1, either by a deterministic Python validator (word counts, required sections, forbidden placeholders) or by an LLM judge pinned to Claude Sonnet 4.5 in strict mode. Scores combine by weight:

// weighted satisfaction
WCSS = Σ(score × weight) / Σ(weight)

// highest-stakes failure mode
EVR  = violated exclusions / total × 100

We report EVR alongside WCSS to isolate the failures that matter most in enterprise contexts: producing content that was explicitly prohibited.

Constraint typeWeight

Inclusion1.0

Exclusion2.0

Format1.0

Structure1.0

Tone1.0

Where models break

Averaged across all 25 models, inclusion and structure are nearly solved. Format is the universal weak spot: word limits, character caps and required layouts are missed far more often than any rule about what to say.

Inclusion94.4%

Structure94.0%

Exclusion89.3%

Tone87.6%

Format56.7%

Satisfaction rate per constraint type, mean of all evaluated models.

The benchmark

6 job-function archetypes × 5 task types × 5 test cases = 150 tasks. Each carries at least one exclusion constraint and one deterministic format check.

product_manager

Product Manager

Feature prioritization memo · PRD problem statement · stakeholder status email · roadmap change · sprint review

project_coordinator

Project Coordinator

Weekly status report · risk log entry · meeting minutes · change request summary · post-mortem entry

executive_comms

Executive Comms

Board update one-pager · executive briefing · all-hands talking points · decision memo · crisis comms draft

hr_business_partner

HR Business Partner

Job description · performance review · policy update · PIP communication · onboarding note

operations_analyst

Operations Analyst

Standard operating procedure · process gap analysis · vendor comparison · business requirements · incident post-mortem

marketing_comms

Marketing Comms

Campaign brief · press release · multi-platform social posts · product launch email · media Q&A prep