Grigoriy Dobryakov

Howto · breakdown

Breakdown 10

Eval as a release criterion: harness vs demo-grade

Make AI output quality a release criterion, not a retrospective finding. Minimum eval at release — three layers.

CTO Head of AI Tech Lead

Problem

An AI feature that works in the demo breaks in production unpredictably and inexplicably. Most teams find out after the release. The reason — "demo grade": the eval that approved the feature used the same inputs as the pitch. Curated dataset, controlled conditions. The long tail of real traffic was never run.

Methodology

Make AI output quality a release criterion, not a retrospective finding. Minimum eval at release — three layers.

  1. Layer 1 — Regression (non-negotiable). A fixed set of reference cases with known expected output: edge cases from past incidents, cases from the demo (that got the feature approved), at least one adversarial input per output type (prompt injection, empty, malformed). Any new failure blocks the release.
    # cases/rag-stale-corpus.yaml
    id: rag-stale-corpus
    origin: incident-2026-03   # RAG hallucinated on a full corpus
    input:
      query: "current rate for tariff X"
    expected:
      must_contain:
        - "current document"
      must_not_contain:
        - "status: archived"
      retrieved_chunks:
        min_count: 1
        must_contain:
          - "active"
        must_not_contain:
          - "archived"
    pass_criteria:
      - no_archived_docs
      - answer_grounded

    The case is born from an incident (origin) — that's what distinguishes a regression set from a demo dataset: inputs taken from what has already broken in production.

  2. Layer 2 — Distribution check (RAG/classifiers). 20–50 fresh real inputs through the new version. Not checking against an "ideal," but against a diff vs the previous release snapshot: distribution of length/format, change in retrieved chunks, confidence shift (shift toward extremes = brittleness).
  3. Layer 3 — Human spot-check. Required before the first production deploy of a new type of output: a domain expert reads 10–20 real outputs with specific questions (any data the model shouldn't have access to; would an expert consider this correct; any adversarial exploitation).
  4. Owner. One person signs off. If you can't name the owner before release — that's already the first red flag.

Artifact

Minimal harness for this checklist: a Python script that runs the regression set against a snapshot and compares distribution (Layer 1 + Layer 2).

github.com/dobryakov/eval-harness — regression + distribution check, adapter interface, demo case from a real incident.

Series signature

Where it breaks

For whom and why

Risk: eval green, prod red — the standard story. Teams find out after release and fix it more expensively than if they'd caught it before.

Solution: a feature doesn't ship until three eval layers pass — not because it "looks good," but because it meets a criterion.

Metric: when the board asks "what's our AI quality story," you don't answer "we tested it" — you show the regression and distribution diff.

Closest to a production playbook: a practice-level artifact for Head of AI and technical CTO. Vibe-coding the harness turns the methodology into a reproducible code artifact — closing the code gap, not just the playbook.

Want AI release quality as a first-class engineering criterion?

Structured eval before every release that changes model behavior — regression, distribution check, human spot-check, named owner.

Email me

Other breakdowns

An engineering breakdown series: real task → methodology → working artifact → honest breakdown of where it fails.

Back to series →