Business Real-world Instruction Evaluation Framework
Real workplace writing is never one instruction. It is many at once: include this, never mention that, stay under the word count, hit the tone, hold the structure. BRIEF Index measures how well 25 frontier and open models satisfy all of them simultaneously.
Every constraint scores binary 0 or 1, but not every failure costs the same. Producing prohibited content in an enterprise context is the expensive mistake, so exclusion constraints are weighted twice.
25 models, 150 tasks. Click a column to sort; filter by release type. Lower EVR is better.
Most instruction-following benchmarks reward a model for doing the one thing asked. In real enterprise work, the costly failures are different in kind.
Leaking a confidential figure, naming a competitor, making an unapproved commitment: a response can be fluent, helpful, and still unacceptable. BRIEF Index makes that distinction measurable by weighting exclusion failures twice as heavily as everything else.
The result reorders the field. The current leader is a 31B open model, and several compact systems outrank far larger frontier releases, because at the top, the difference is no longer fluency. It is discipline under competing constraints.
Six archetypes, five task types each. Executive communications is the hardest column across the board: its tight word counts and exclusion rules punish models that ramble.
Each task carries 4–6 constraints. Every constraint is graded a binary 0 or 1, either by a deterministic Python validator (word counts, required sections, forbidden placeholders) or by an LLM judge pinned to Claude Sonnet 4.5 in strict mode. Scores combine by weight:
We report EVR alongside WCSS to isolate the failures that matter most in enterprise contexts: producing content that was explicitly prohibited.
Averaged across all 25 models, inclusion and structure are nearly solved. Format is the universal weak spot: word limits, character caps and required layouts are missed far more often than any rule about what to say.
Satisfaction rate per constraint type, mean of all evaluated models.
6 job-function archetypes × 5 task types × 5 test cases = 150 tasks. Each carries at least one exclusion constraint and one deterministic format check.