Case 01 AI conversation design

The response guidelines that govern how an AI agent speaks and reasons.

The agent serves operations, maintenance, data investigation, and cross-system risk. Each of those user contexts has different demands on how the agent should speak and what it should reason about before answering. I authored the response guidelines that make the agent behave consistently across all four: how it speaks, how it formats, and how it reasons about its own data before opening its mouth.

Role: Author, sole content design owner
Partners: Engineering, AI research, product design
Surface: Foundational response guidelines
Status: In production

The agent reasons before it speaks. Part two rules the reasoning. Part one rules the speech.

The problem

The same question, asked twice, returned two different agents.

An AI agent serving operators, maintenance planners, and engineers in energy, manufacturing, and process industries was producing responses that were technically correct but inconsistent in voice, structure, and depth. One reply came back as a single line. The next, on the same query, came back as five paragraphs. Bold formatting landed on phrases that did not matter for decisions. Tables arrived as prose. The agent treated forty-seven-minute-old sensor readings as current. It guessed at safety thresholds when no threshold was in its context.

For these users, those are not edge cases. They are running live equipment. They need an answer they can act on within seconds, in a format that matches the urgency of the situation, with explicit signals when the agent's data is uncertain or its context is incomplete. The agent was not producing that consistently.

Engineering had been refining model selection and retrieval. The remaining variance was a content design problem.

The instructions governing agent behavior needed to be authored with the same rigor as the model itself. Two layers needed work. How the agent reasons about the data and context it has before it responds. And how the agent shapes that response once the reasoning is done.

The approach

One instruction set. Two layers. Three gates.

The agent needed two kinds of rules. One for how it shapes a response. One for how it reasons about its data before responding. I authored both as a single connected guideline that became the agent's foundational instructions, with each part addressing a different layer of how the agent communicates.

Part one: Voice and formatting

Rules for how the agent speaks. The defaults are tight. Lead with the direct answer in the first sentence. Keep the initial response within four sentences. Offer more depth at the end. Use first-person active voice. Remove filler openers.

Formatting rules are equally specific. Markdown elements are assigned to information types and reserved for that use. The most consequential rule is the one on bold.

Rule, with rationale

Bold is for decision-critical information only. Equipment IDs the user has to act on, thresholds that have been crossed, timestamps that signal urgency. Anything else stays unbolded.

When bold is used for general emphasis, the visual signal flattens. In a high-velocity environment, the user skims past the critical data points if the entire sentence is competing for attention. Reserving bold for the equipment ID and the out-of-bounds value puts the user's eye on what they need to act on first.

Poor

**Check** the **C-19 compressor** first. The **temperature** is **15% above baseline** as of **04:12**.

Bold on every clause. The signal flattens. The reader has no anchor for where to look first.

Good

Check C-19 first, then P-15. Temperature 15% above baseline at 04:12.

Bold reserved for the equipment ID, the threshold crossing, and the timestamp. The eye lands on what the operator needs to act on.

The rest of part one covers heading hierarchy, code blocks, blockquotes, lists versus prose, and the language register. Definitive verbs when the data supports them. Explicit hedge words when it does not. Together these rules turn agent output into something with predictable shape.

Part two: Reasoning and context

Rules for what the agent does before it formats anything. This part runs first. It is organized around three gates the agent passes through on every response involving operational data.

GATE 01

Data integrity and lineage

Is the data current, plausible, and verifiable? Beyond freshness checks (flag any sensor reading older than fifteen minutes, identify pinned values as potential sensor errors, state when units are unknown), the agent has to surface the lineage of what it cites. Which sensor produced the reading, at what timestamp, from which log entry. Whether the agent is handling industrial sensors or a document database, the rule is the same: the user has to be able to trace any claim back to the source it came from, otherwise the response is a number without provenance.

GATE 02

Operational safety

Is the agent staying inside its scope? If a specific safety threshold is not present in context, state that the threshold is unknown. Never guess a threshold and present it as fact. Never provide regulatory compliance interpretations or final permit approvals.

GATE 03

Transparency and confidence

Is the agent honest about what it knows? Distinguish between hard sensor data and soft log data in language. State explicitly what context is needed if critical context is missing. Calibrate confidence to what the data actually supports.

The gates run on every turn as part of the response pipeline. Together they produce an agent that says "I cannot give you that threshold because it is not in my context" rather than inventing a plausible number, and that flags a forty-seven-minute-old reading rather than presenting it as current.

Edge cases

The language of limitation.

The guidelines treat edge cases as first-class. Refusals, clarifications, multi-solution recommendations. The default response shape for each is named, with explicit examples. Below is the same agent behavior across three different edge cases, switched via tabs.

Poor

I don't know the answer to that question.

Closes the conversation. Leaves the user with nowhere to go.

Good

I can help you troubleshoot the compressor issue. Want to diagnose the current problem, review historical failure patterns, or plan preventive maintenance? Let me know which applies.

Names what the agent can do. Offers forward-pointing options the user can pick from.

Poor

Can you clarify what you mean?

Generic. Forces the user to do the framing work. Slows them down at a moment when they need momentum.

Good

Are you asking about (1) the current state of P-15, (2) the failure pattern from last week, or (3) the maintenance schedule for next month?

Numbered options scoped to what the user is likely asking. Quick to read, quick to answer.

Poor

There are several approaches. You could replace the filter, schedule maintenance, or wait and see. Each has tradeoffs.

Neutral menu. The agent refuses to take a position. The user is left to decide on their own.

Good

Two viable approaches. (1) Immediate replacement: eliminates risk, requires shutdown. (2) Scheduled maintenance: no downtime, but risk of failure before then. Given C-19 is critical to production, I recommend immediate replacement.

Options with pros and cons, then a committed recommendation grounded in context. The agent has a position.

What changed

From variance to a shape engineering can rely on.

Once the guidelines were in place, three concrete shifts followed in how the agent communicated and how the team could work on it.

Consistency

Responses gained a predictable shape. The same question now produces a response of the same structure, regardless of surface domain. The bold-on-bold-on-bold pattern that the agent had drifted toward stopped appearing.

Governance

Defensible behavior on the hardest cases. The agent flags stale data with the age and source of the reading, refuses to invent thresholds, and distinguishes hard from soft data in language. Each gate maps to a measurable behavior with explicit pass criteria, so these are not aspirational guidelines but rules that can be tested against a golden-set conversation and scored on every change to the agent.

Foundation

The instruction set became the system's spine. Every subsequent layer of agent behavior, from skill orchestration to content generation, extends from the voice and reasoning rules established here.

With the guidelines holding, the next question became how to know whether ongoing changes to them were improvements or regressions, and how to catch that distinction before it reached users.