Eval-suite design, golden-set curation, baseline tracking.
The discipline that turns "did you try it?" into "did the suite pass and did the baseline hold?" Every shipped AI feature gets its own eval suite, every suite has a documented baseline, and every PR runs the suites that matter. This is the workflow most teams reach for first because the value is immediate and the artifacts are concrete.
The Eval workflow is the procedure for giving every AI feature an automated, version-controlled eval suite that runs in CI and produces a baseline metric the team can reason about. It covers correctness and safety/refuse cases, and it ships failure-mode-specific evals (faithfulness, refusal-correctness, length-bound, citation) rather than a single accuracy number.
The five questions on the readiness self-assessment that score this dimension are the five rungs of the procedure above. Yes on a question means the artifact named in that step exists on disk in your repo today.
This page is a thin first cut. Full procedural documentation — including reference DeepEval suite scaffolds, golden-set curation rubrics, and the audit-evidence checklist — lands in Phase 2 of the Institute build-out.
The free readiness self-assessment scores the Eval workflow as one of six dimensions. Five minutes. Your weakest workflow is the one most worth fixing first.
Take the assessment →