Workflow 01 of 6

Eval.

Eval-suite design, golden-set curation, baseline tracking.

The discipline that turns "did you try it?" into "did the suite pass and did the baseline hold?" Every shipped AI feature gets its own eval suite, every suite has a documented baseline, and every PR runs the suites that matter. This is the workflow most teams reach for first because the value is immediate and the artifacts are concrete.

What this is

The Eval workflow is the procedure for giving every AI feature an automated, version-controlled eval suite that runs in CI and produces a baseline metric the team can reason about. It covers correctness and safety/refuse cases, and it ships failure-mode-specific evals (faithfulness, refusal-correctness, length-bound, citation) rather than a single accuracy number.

The procedure

Spec the properties. Write a plain-English list of what should always be true for every output. ("The support agent never fabricates a refund policy not in the retrieved KB.") Human work — we don't let the model draft this.
Curate the golden set. 50–200 production traces per feature, version-controlled. The traces are the eval cases.
Author the suite. DeepEval is our default. Each suite asserts faithfulness, relevancy, refusal-correctness, and any feature-specific metrics.
Document the baseline. Each suite has a published threshold. The threshold is the score below which the team agrees the feature isn't ready.
Wire CI. Suite runs on every PR that touches AI-feature code. Result is visible on the PR.
Maintain weekly. Production traces sample into the golden set; new failure modes get added to the suite as part of incident response.

What gets scored

Maturity dimension Eval coverage — see the L1 → L5 progression for this dimension

The five questions on the readiness self-assessment that score this dimension are the five rungs of the procedure above. Yes on a question means the artifact named in that step exists on disk in your repo today.

Phase 1 · in active development

This page is a thin first cut. Full procedural documentation — including reference DeepEval suite scaffolds, golden-set curation rubrics, and the audit-evidence checklist — lands in Phase 2 of the Institute build-out.

Eval.

What this is

The procedure

What gets scored

Find out where your team's Eval workflow stands.