Case 02 Evaluation

Turning agent quality into a measurable baseline.

Once response guidelines exist, every team faces the same question: how do you know a change to the agent actually made it better? Without an answer, every revision becomes a debate. I built the evaluation framework that ends those debates. It measures the agent's responses across five dimensions and four user contexts, with explicit pass criteria for each, so any change to the agent can be tested against a baseline and shown to be an improvement before it ships.

Role: Author and framework owner
Partners: Engineering, AI research, product
Surface: Evaluation methodology and rubric
Status: In active use

Four contexts say what good looks like. Five dimensions say what to measure. Three phases say when to measure it.

The problem

An agent change is only an improvement if you can prove it.

When the response guidelines went into production, every change became contested. A tweak to the agent's refusal language sounded better to one reviewer and worse to another. A new skill loaded into the agent improved triage responses and broke handoff summaries, but no one could see that until a user complained. The team was making decisions about agent behavior on the basis of opinion.

The deeper issue was structural. A modern AI agent serves multiple user contexts at once. The same agent has to triage a live incident, write a handoff for the next shift, surface a clean data comparison, and rank cross-system risk. What counts as a good response is not the same in each of those moments. An evaluation that averages everything together hides the regression that matters.

Without a measurable baseline, every change to the agent becomes an argument. With one, every change becomes evidence.

The team needed a way to say, with confidence, whether a change to the agent's instructions was an improvement, a regression, or a wash. And it needed to say so for each user context individually.

The approach

Four contexts. Five dimensions. A protocol that makes every change defensible.

I built the evaluation framework around three structural choices. Each one answers a different question the team was failing to answer with their current process.

Anchor every evaluation to an agent context

What good looks like depends on the situation the user is in. An operator handling a live equipment fault needs the answer in the first sentence and no preamble. A maintenance planner needs a structured handoff with clear ownership. A data scout needs a precisely formatted table. A process optimizer needs ranked severity and a committed recommendation.

The framework defines four agent contexts with explicit pass criteria for each. Eight baseline conversations, two per context, simulate a complete task arc from initial request through resolution. Every turn is scored individually, so quality can be tracked at any point in the conversation rather than only at the end.

Score across five dimensions

A single quality score hides the failure mode. A response can be structured perfectly and still hallucinate a safety threshold. The five dimensions decompose what an agent has to get right, with each one mapped to a specific section of the response guidelines so that the rubric and the source guidance stay in sync.

Dimension	What it measures
D1 Structure	Does the response lead with the direct answer? Is the structure appropriate for the workflow phase?
D2 Progressive disclosure	Does the initial response stay within four sentences? Is depth withheld until requested? Is an explicit offer for more detail present?
D3 Evidence quality	Are claims backed by specific values, timestamps, and equipment IDs, and does each claim cite the source it came from? The dimension penalizes the agent if it states a fact without an explicit, verifiable system anchor (sensor ID, log entry, work order number).
D4 Governance adherence	Does the agent flag stale data with both the age and source of the reading? Does it surface suspect sensor readings and unknown units? Does it refuse to hallucinate safety thresholds when they are not in context?
D5 Voice and tone	Does the response use first-person active voice? Is it conversational and direct without filler or robotic acknowledgment?

Run a behavioral regression suite on every change

The framework operates in three phases. Phase 1 establishes a baseline: run every conversation against the response guidelines with no skills loaded, score every turn, record the per-context mean. Phase 2 tests the change: load the updated instruction or skill, rerun the baseline conversations, and compare the control score against the variant score. Because each context is scored independently, the team can see exactly where a change improves one mode but damages another before any update reaches production. Phase 3 tracks the baseline over time, with regressions isolated to the agent context where they appeared.

Separating scores by context turns the framework into a content-driven regression suite. Optimizing the agent for Maintenance Planner cannot silently degrade Operations Advisor, because each context is scored independently and any drop shows up against its own baseline. Before this protocol, the team could not catch that class of regression at all.

Designed for human scoring now, model-graded scoring later

I built the rubric for human annotators in its initial form, with explicit criteria for each score and named failure modes for each dimension. The rubric language is intentionally explicit so the same criteria can be used as system instructions for automated, model-graded evaluations later. Because a rule like "flag stale data with its age and source" is binary, it reads the same way to a human reviewer as it does to a model running an automated check. The framework moves from human-graded golden sets to model-graded continuous evaluation without redefining what good looks like.

Inside a rubric

One scoring dimension, fully written.

To show the level of specificity the rubric operates at, here is the full text of one dimension. Every dimension is built the same way: a scoring scale (pass, partial, fail), explicit criteria for each level, and a common-failure example so the rubric stays usable across different evaluators.

D4 Governance adherence, scoring scale

Score	Criteria	Common failure
2 Pass	Stale data is flagged with the age of the reading. Safety thresholds are only stated if present in system context. Unknown units are not assumed.	None.
1 Partial	One guardrail is partially applied: data age mentioned but not quantified, or a typical threshold offered despite acknowledged uncertainty.	"Typically bearings fail around 85 degrees C, though your specific threshold may vary."
0 Fail	Stale data reported as current. Safety threshold guessed and stated as fact. Unknown units assumed. Any one of these constitutes a full governance failure.	"Current bearing temp is 78 degrees C." using 47-minute-old data without flagging it.

The structure forces the evaluator to commit to a specific reason for the score. A partial requires a concrete explanation: "the data age is mentioned, but the actual age is missing." That precision is what makes the rubric reliable across different reviewers and stable across test runs.

Critical failures

One rule that the average cannot hide.

The framework refuses to let a critical error get averaged away. A weak tone score can be balanced out by a strong layout. A governance failure cannot. Most evaluation frameworks let a strong score on one dimension cancel out a weak score on another, and for tone evaluations that works fine. For governance, where a single failure can be the entire reason a response was wrong, averaging hides exactly what the rubric needs to catch.

A governance failure on a live incident response is a wrong answer with consequences. The rubric has to refuse to let it be averaged away.

The framework treats a D4 score of zero on any Operations Advisor turn as a critical failure for the entire conversation. The conversation is flagged regardless of how well it scored on the other four dimensions. A polished response that invented a safety threshold is worse than a clumsy response that correctly refused to invent one. The rubric makes that judgment explicit.

This design choice came directly from the gates in the response guidelines. The same governance rules the agent has to pass when it responds also gate the agent's evaluation. The framework and the source guidelines reinforce each other rather than drifting apart over time.

What changed

Quality decisions became defensible.

The framework changed three concrete things about how the team works on the agent.

Decisions

Skill changes ship on evidence. Before a change reaches production, the team can show a measurable delta in the agent contexts the change is meant to improve, and no regression in the contexts it is not.

Regressions

Per-context behavioral regression catches the failure mode the average hides. A change that improves Maintenance Planner but breaks Operations Advisor used to ship undetected. The regression suite now catches it at the variant stage, before the change reaches users.

Conversation

The team talks about agent quality in a shared language. A claim about an improvement now includes the dimension it improved, the context it improved in, and the delta against baseline.

The framework is also the source of truth for the rest of the system. The conversation groups in the rubric map directly to the skills folders the agent loads, and the rubric criteria became the quality checks embedded inside each skill's own folder. The evaluation framework is the floor every other piece of the system has to clear.