Eval as a Release Criterion: Catching Agent Drift Before Production

Problem

An AI feature that works in the demo breaks in production unpredictably and inexplicably. Most teams find out after the release. The reason — "demo grade": the eval that approved the feature used the same inputs as the pitch. Curated dataset, controlled conditions. The long tail of real traffic was never run.

Methodology

Make AI output quality a release criterion, not a retrospective finding. Minimum eval at release — three layers.

Layer 1 — Regression (non-negotiable). A fixed set of reference cases with known expected output: edge cases from past incidents, cases from the demo (that got the feature approved), at least one adversarial input per output type (prompt injection, empty, malformed). Any new failure blocks the release.
```
# cases/rag-stale-corpus.yaml
id: rag-stale-corpus
origin: incident-2026-03   # RAG hallucinated on a full corpus
input:
  query: "current rate for tariff X"
expected:
  must_contain:
    - "current document"
  must_not_contain:
    - "status: archived"
  retrieved_chunks:
    min_count: 1
    must_contain:
      - "active"
    must_not_contain:
      - "archived"
pass_criteria:
  - no_archived_docs
  - answer_grounded
```
The case is born from an incident (origin) — that's what distinguishes a regression set from a demo dataset: inputs taken from what has already broken in production.
Layer 2 — Distribution check (RAG/classifiers). 20–50 fresh real inputs through the new version. Not checking against an "ideal," but against a diff vs the previous release snapshot: distribution of length/format, change in retrieved chunks, confidence shift (shift toward extremes = brittleness).
Layer 3 — Human spot-check. Required before the first production deploy of a new type of output: a domain expert reads 10–20 real outputs with specific questions (any data the model shouldn't have access to; would an expert consider this correct; any adversarial exploitation).
Owner. One person signs off. If you can't name the owner before release — that's already the first red flag.

Artifact

Minimal harness for this checklist: a Python script that runs the regression set against a snapshot and compares distribution (Layer 1 + Layer 2).

github.com/dobryakov/eval-harness — regression + distribution check, adapter interface, demo case from a real incident.

Series signature

Where it breaks

Eval dataset = demo dataset — the most common anti-pattern: "eval" turns out to be the stakeholder demo, not a measurement.
Pass criteria defined after the run — "looks reasonable" instead of a concrete measurable threshold.
No owner — sign-off delegated to "it looked fine in staging."
Real failure modes that produced this methodology: RAG correct on curated corpus, hallucinates on full; classifier silently degrades after upstream input schema change; AI-generated code passes automated tests but introduces a security pattern the tests didn't cover.

For whom and why

Risk: eval green, prod red — the standard story. Teams find out after release and fix it more expensively than if they'd caught it before.

Solution: a feature doesn't ship until three eval layers pass — not because it "looks good," but because it meets a criterion.

Metric: when the board asks "what's our AI quality story," you don't answer "we tested it" — you show the regression and distribution diff.

Closest to a production playbook: a practice-level artifact for Head of AI and technical CTO. Vibe-coding the harness turns the methodology into a reproducible code artifact — closing the code gap, not just the playbook.

Want AI release quality as a first-class engineering criterion?

Structured eval before every release that changes model behavior — regression, distribution check, human spot-check, named owner.

Email me

Other breakdowns

An engineering breakdown series: real task → methodology → working artifact → honest breakdown of where it fails.

Back to series →