The AI-QA Maturity Model — Gloxx QA Institute

L1

Ad-hoc

AI testing happens by accident. Engineers test their own prompts, mostly by trying them and looking at the output.

What's true here

Prompts ship without a written eval suite
"Did you try it?" is the gate
Bugs are discovered by users in production, not by tests
If eval cases exist, they live in a notebook or one-off script that nobody re-runs
Nobody on the team can tell you which AI features have eval coverage

Artifacts present

Possibly: a Notion page with example prompts
Possibly: a Slack channel for "weird outputs"
Otherwise: nothing version-controlled

Failure modes still possible

Silent regressions on every prompt edit
Hallucinations reach users with no detection
Model-provider degradation goes unnoticed for weeks
Refuse-list violations only surface via screenshots on Twitter

Trigger to L2

An incident severe enough that leadership asks "how do we know this won't happen again?"
A customer churn or compliance escalation tied to AI output
A board-level question about AI risk

L2

Reactive

Eval suites exist, but only for things that already broke. Coverage is a map of past pain.

What's true here

A tests/evals/ folder exists with sparse coverage
Cases were added after specific incidents
Tests run locally; not always in CI
"Did the eval pass on my machine?" is the gate
No regression coverage for cases that haven't broken yet

Artifacts present

Sparse tests/evals/ directory, organized around incidents
Maybe a FAILURES.md file
One or two retrospective documents naming the worst incidents

Failure modes still possible

Untested edge cases continue to leak to prod
No drift detection — model degradation invisible
New AI features ship with zero eval coverage
Eval coverage decays as incidents are forgotten

Trigger to L3

Leadership commits to a quarterly eval review cadence
Engineering accepts eval discipline as a first-class concern, not a fire drill
An eval coverage metric appears on a roadmap or scorecard

L3

Measured

Every AI feature has a dedicated, version-controlled eval suite with a baseline. CI runs them. Reviews happen on a cadence.

What's true here

Every shipped AI feature has its own eval suite
Suites cover correctness AND safety/refuse cases
Suites are versioned alongside the code they test
CI runs evals on every PR touching AI-feature code
Each suite has a documented baseline metric
A quarterly eval review meeting actually happens

Artifacts present

Per-feature eval suite in tests/evals/
Golden set per feature (50–200 curated cases)
Baseline metrics documented in PRD or wiki
CI workflow that runs evals; results visible on PRs
Quarterly review document with last-quarter trend

Failure modes still possible

Eval thresholds are advisory, not enforced — devs override silently
Production drift goes unnoticed because no production sampling
No failure-mode taxonomy: incidents don't get categorized
Refuse-list policy lives in tribal knowledge, not in code

Trigger to L4

A release where evals "passed" but production output was bad enough to escalate
Compliance, legal, or sales asks "can you prove the AI never does X?"
Drift incident that takes > 24h to detect

L4

Governed

Evals enforce policy. Releases block on threshold breach. Drift is monitored. Refuse policy is written, tested, and dated.

What's true here

Eval thresholds are documented, reviewed quarterly, and enforced by CI
Release blocks if eval scores drop below threshold
Refuse policy is written, system-prompted, and tested by automated evals
Production traffic is sampled for ongoing eval
Drift alerts page on-call
Each production incident gets tagged with a failure mode and added to the eval suite
Override policy for failed gates is documented with audit trail

Artifacts present

Enforced threshold config (e.g. evals.config.yml with per-feature minimums)
Drift monitoring dashboard with paging integration
Refuse-policy document, dated and versioned
Failure-mode taxonomy in the repo
Override log: who overrode what, when, why

Failure modes still possible

User-reported failures don't reliably reach the eval suite
Evals are dev-driven, not prod-driven — golden set goes stale
Eval improvements ship in batches, not continuously
Time-from-report-to-eval-coverage is unmeasured

Trigger to L5

Leadership commits to closed-loop continuous improvement as a measured SLO
Compliance review surfaces a need for documented response time
An L4 incident slips through because the eval suite was 4 weeks behind reality

L5

Continuous

Evals run continuously in production. User-reported failures flow back into eval suites in days, not quarters. The failure taxonomy is alive.

What's true here

Online evals on production traffic — not just sampled batches
In-product feedback widget routes directly into a triage queue
Triaged failures get added to eval suites in days
Time-from-report-to-eval-coverage is measured against an SLO
Eval suites evolve weekly, not quarterly
Refuse policy goes through legal/compliance review with documented date
Failure-mode-specific evals exist (not just general accuracy)

Artifacts present

Production eval pipeline (online metrics, not just CI)
In-product "report this output" widget
Triage queue with named owner
Time-to-coverage SLO documented and measured
Dated compliance and legal reviews of refuse policy
Failure-mode-specific eval files (faithfulness, refusal-correctness, length-bound, source-citation, etc.)

Failure modes still possible

Unknown unknowns — failure modes nobody has named yet
Adversarial inputs designed to evade the taxonomy
Model-provider changes that violate prior assumptions
Cost-efficiency tradeoffs that re-open closed gaps

Beyond L5

The work stops being a maturity climb and becomes a discipline
The team is the model
External audit finds nothing the team hasn't already filed
You're now writing the next maturity model, not consuming this one

The AI-QA Maturity Model.

What we measure across the model.

Ad-hoc

What's true here

Artifacts present

Failure modes still possible

Trigger to L2

Reactive

What's true here

Artifacts present

Failure modes still possible

Trigger to L3

Measured

What's true here

Artifacts present

Failure modes still possible

Trigger to L4

Governed

What's true here

Artifacts present

Failure modes still possible

Trigger to L5

Continuous

What's true here

Artifacts present

Failure modes still possible

Beyond L5

Three rules of thumb.

Find out where your team actually sits.