Home Services Approach Institute About Contact Book a call

Institute/Maturity Model

The Institute's rubric

The AI-QA Maturity Model.

Every team shipping AI features sits somewhere on this ladder, whether they've named it or not. This page names it. Five levels, six capability dimensions (one per Institute workflow), and the specific artifacts and behaviors you'd expect to find at each. The model is descriptive — what's actually true at each rung — not a wish list. Your team is at the level where the artifacts exist on disk, not the level where leadership wishes they were. The free readiness self-assessment scores you against this rubric in five minutes; the paid readiness audit anchors a Gloxx reviewer's judgment to the same rubric with documented evidence per question.

Six dimensions

What we measure across the model.

Maturity is multi-dimensional. A team can be Measured on eval coverage but Ad-hoc on drift monitoring — and that combination is the most common failure mode in AI-feature QA. The model scores six capability dimensions independently. Your overall level is the floor, not the ceiling.

  • 1. Eval coverageDo AI features have version-controlled eval suites? Are correctness AND safety/refuse cases covered?
  • 2. Release gatingDo evals run in CI on every PR? Does a threshold breach block release?
  • 3. Drift monitoringAre production inputs/outputs logged, sampled, and continuously evaluated? Do drift alerts page someone?
  • 4. Failure taxonomyIs there a written list of known failure modes? Do incidents get tagged and added to eval suites?
  • 5. Feedback loopsCan users report bad AI outputs in-product? Does the report flow into a triage queue and onward into evals?
  • 6. Policy & refuse behaviorIs there a written refuse policy? Reflected in the system prompt? Tested by automated evals? Reviewed by legal?

The five levels

L1

Ad-hoc

AI testing happens by accident. Engineers test their own prompts, mostly by trying them and looking at the output.

What's true here

  • Prompts ship without a written eval suite
  • "Did you try it?" is the gate
  • Bugs are discovered by users in production, not by tests
  • If eval cases exist, they live in a notebook or one-off script that nobody re-runs
  • Nobody on the team can tell you which AI features have eval coverage

Artifacts present

  • Possibly: a Notion page with example prompts
  • Possibly: a Slack channel for "weird outputs"
  • Otherwise: nothing version-controlled

Failure modes still possible

  • Silent regressions on every prompt edit
  • Hallucinations reach users with no detection
  • Model-provider degradation goes unnoticed for weeks
  • Refuse-list violations only surface via screenshots on Twitter

Trigger to L2

  • An incident severe enough that leadership asks "how do we know this won't happen again?"
  • A customer churn or compliance escalation tied to AI output
  • A board-level question about AI risk
L2

Reactive

Eval suites exist, but only for things that already broke. Coverage is a map of past pain.

What's true here

  • A tests/evals/ folder exists with sparse coverage
  • Cases were added after specific incidents
  • Tests run locally; not always in CI
  • "Did the eval pass on my machine?" is the gate
  • No regression coverage for cases that haven't broken yet

Artifacts present

  • Sparse tests/evals/ directory, organized around incidents
  • Maybe a FAILURES.md file
  • One or two retrospective documents naming the worst incidents

Failure modes still possible

  • Untested edge cases continue to leak to prod
  • No drift detection — model degradation invisible
  • New AI features ship with zero eval coverage
  • Eval coverage decays as incidents are forgotten

Trigger to L3

  • Leadership commits to a quarterly eval review cadence
  • Engineering accepts eval discipline as a first-class concern, not a fire drill
  • An eval coverage metric appears on a roadmap or scorecard
L3

Measured

Every AI feature has a dedicated, version-controlled eval suite with a baseline. CI runs them. Reviews happen on a cadence.

What's true here

  • Every shipped AI feature has its own eval suite
  • Suites cover correctness AND safety/refuse cases
  • Suites are versioned alongside the code they test
  • CI runs evals on every PR touching AI-feature code
  • Each suite has a documented baseline metric
  • A quarterly eval review meeting actually happens

Artifacts present

  • Per-feature eval suite in tests/evals/
  • Golden set per feature (50–200 curated cases)
  • Baseline metrics documented in PRD or wiki
  • CI workflow that runs evals; results visible on PRs
  • Quarterly review document with last-quarter trend

Failure modes still possible

  • Eval thresholds are advisory, not enforced — devs override silently
  • Production drift goes unnoticed because no production sampling
  • No failure-mode taxonomy: incidents don't get categorized
  • Refuse-list policy lives in tribal knowledge, not in code

Trigger to L4

  • A release where evals "passed" but production output was bad enough to escalate
  • Compliance, legal, or sales asks "can you prove the AI never does X?"
  • Drift incident that takes > 24h to detect
L4

Governed

Evals enforce policy. Releases block on threshold breach. Drift is monitored. Refuse policy is written, tested, and dated.

What's true here

  • Eval thresholds are documented, reviewed quarterly, and enforced by CI
  • Release blocks if eval scores drop below threshold
  • Refuse policy is written, system-prompted, and tested by automated evals
  • Production traffic is sampled for ongoing eval
  • Drift alerts page on-call
  • Each production incident gets tagged with a failure mode and added to the eval suite
  • Override policy for failed gates is documented with audit trail

Artifacts present

  • Enforced threshold config (e.g. evals.config.yml with per-feature minimums)
  • Drift monitoring dashboard with paging integration
  • Refuse-policy document, dated and versioned
  • Failure-mode taxonomy in the repo
  • Override log: who overrode what, when, why

Failure modes still possible

  • User-reported failures don't reliably reach the eval suite
  • Evals are dev-driven, not prod-driven — golden set goes stale
  • Eval improvements ship in batches, not continuously
  • Time-from-report-to-eval-coverage is unmeasured

Trigger to L5

  • Leadership commits to closed-loop continuous improvement as a measured SLO
  • Compliance review surfaces a need for documented response time
  • An L4 incident slips through because the eval suite was 4 weeks behind reality
L5

Continuous

Evals run continuously in production. User-reported failures flow back into eval suites in days, not quarters. The failure taxonomy is alive.

What's true here

  • Online evals on production traffic — not just sampled batches
  • In-product feedback widget routes directly into a triage queue
  • Triaged failures get added to eval suites in days
  • Time-from-report-to-eval-coverage is measured against an SLO
  • Eval suites evolve weekly, not quarterly
  • Refuse policy goes through legal/compliance review with documented date
  • Failure-mode-specific evals exist (not just general accuracy)

Artifacts present

  • Production eval pipeline (online metrics, not just CI)
  • In-product "report this output" widget
  • Triage queue with named owner
  • Time-to-coverage SLO documented and measured
  • Dated compliance and legal reviews of refuse policy
  • Failure-mode-specific eval files (faithfulness, refusal-correctness, length-bound, source-citation, etc.)

Failure modes still possible

  • Unknown unknowns — failure modes nobody has named yet
  • Adversarial inputs designed to evade the taxonomy
  • Model-provider changes that violate prior assumptions
  • Cost-efficiency tradeoffs that re-open closed gaps

Beyond L5

  • The work stops being a maturity climb and becomes a discipline
  • The team is the model
  • External audit finds nothing the team hasn't already filed
  • You're now writing the next maturity model, not consuming this one

How to use this

Three rules of thumb.

Be honest, not aspirational. Your team is at the level where the artifacts exist on disk, not the level where the leadership wishes they were. If you can't show the eval-results dashboard to a new engineer in their first week, you don't have it.

Score per dimension, not overall. Most teams have one dimension two levels behind the others. The fastest way to move up is to drag the lagging dimension to the floor. Take the self-assessment — it scores all six dimensions independently.

The trigger matters more than the level. Knowing you're at L2 is less useful than knowing what L3 looks like and what would force you to commit to it. The "trigger to next level" line on each level above is the real load-bearing detail.

Find out where your team actually sits.

The 30-question self-assessment scores all six dimensions and gives you a personalized progression map back. Five minutes. No sales call required.