Introducing the AI-QA Maturity Model

Every team shipping AI features is somewhere on a maturity ladder, whether they've named it or not. Most teams, in our experience auditing release paths over the last two years, are at Level 1 or 2 — testing prompts by trying them, adding eval cases only after a customer escalation, and shipping AI features whose only QA is "did the engineer who wrote it look at the output?"

That's not a moral failing. It's a recognizable stage. The problem is that most teams don't know they're there because nobody's published a vocabulary for naming the rungs.

So we did. The AI-QA Maturity Model v1.0 is a five-level descriptive rubric for how mature an organization's AI-feature QA practice is — Ad-hoc, Reactive, Measured, Governed, Continuous — with concrete, observable signals at each rung. It scores six capability dimensions independently because the most common failure mode in AI-feature QA is imbalanced maturity: a team that's Measured on eval coverage but Ad-hoc on drift monitoring, hemorrhaging quality silently in production while feeling rigorous in CI.

This essay is the first published artifact in the Institute Journal. It introduces the model, explains the design constraints, and walks through why the rubric is descriptive — not aspirational — and what that distinction costs you if you get it wrong.

The problem the model solves

We've audited the AI-QA practice of around two dozen software teams over the last 18 months. The pattern is remarkably consistent. Asked "how do you test your AI features?", the answers fall into one of three buckets:

"We don't, really." — The honest version. Engineers test their own prompts. Bugs are reported by users. Nobody on the team can tell you which features have eval coverage. This is L1, Ad-hoc.
"We have evals." — There's a tests/evals/ directory. The contents are sparse. The cases were added after specific incidents. The tests run sometimes — locally on a developer's machine, not in CI on every PR. This is L2, Reactive.
"We have a mature AI-QA practice." — Sometimes true. More often, the team has invested in one dimension (say, a comprehensive eval suite) and zero in others (no drift monitoring, no refuse policy, no time-to-coverage SLO). They're convinced they're at L4. They're at L2 with a strong eval suite. This is the most common — and most consequential — confusion.

That third bucket is why the model matters. Without a vocabulary that scores capability dimensions independently, leadership over-credits the dimension the team has invested in and under-credits the dimensions they haven't. Maturity feels like a single number; it isn't.

Your team is at the level where the artifacts exist on disk, not the level where leadership wishes they were.

Five levels, six dimensions

The Maturity Model has five vertical levels and six horizontal dimensions. The levels are descriptive labels — what's actually true at this stage — not aspirational targets. The dimensions are independently scored, and the overall level is the floor across dimensions, not the average.

Briefly, each level:

Ad-hoc. AI testing happens by accident. Engineers test their own prompts, mostly by trying them. If eval cases exist, they live in a notebook nobody re-runs. Bugs are discovered by users in production.

Reactive. Eval suites exist, but only for things that already broke. A tests/evals/ folder exists with sparse, incident-shaped coverage. New AI features ship with zero eval coverage. Coverage decays as incidents are forgotten.

Measured. Every shipped AI feature has its own version-controlled eval suite with a documented baseline. CI runs evals on every PR. A quarterly eval review actually happens. But thresholds are advisory, drift is invisible, and refuse-policy lives in tribal knowledge.

Governed. Evals enforce policy. Releases block on threshold breach. Drift is monitored and pages on-call. Refuse policy is written, system-prompted, and tested. Each incident gets tagged with a failure mode and added to the eval suite. Override policy has an audit trail.

Continuous. Evals run continuously in production. User-reported failures flow back into eval suites in days, not quarters. The failure taxonomy is alive. Time-from-report-to-eval-coverage is measured against an SLO. Refuse policy goes through dated legal review.

The six dimensions — eval coverage, release gating, drift monitoring, failure taxonomy, feedback loops, and policy & refuse behavior — are scored independently. Each dimension maps 1:1 to one of the six Institute Workflows. So when the rubric says you're at L3 on Eval Coverage but L1 on Drift Monitoring, that maps to two specific procedures with two specific sets of artifacts you'd need to produce to move up.

Why the model is descriptive, not aspirational

This is the design choice that took the longest to land, and the one that costs most when you get it wrong.

An aspirational maturity model says "L4 is where mature teams operate." It implies a target. Leadership reads it, picks L4, and tells the team to get there. The team writes a roadmap. Six months later, the roadmap is half-shipped and someone declares L4-ish.

A descriptive maturity model says "if your team has X, Y, and Z artifacts on disk and runs them on this cadence, you are at L4. If not, you are not." There is no roadmap aspiration. There is a check: do the artifacts exist? Are they versioned? Do they actually run? Did the last quarterly review happen, and is its diff documented?

The descriptive framing is uncomfortable because it forecloses a popular kind of ambiguity. You can't be "L4-ish." The artifacts either exist and are dated, or they don't. Whether they're "good enough" is a separate question; the level question is binary per dimension.

If you can't show the eval-results dashboard to a new engineer in their first week, you don't have it.

This framing is borrowed from the maturity-model traditions that work. CMMI¹, the SEI's CMMs, and ISTQB rubrics² all rely on artifact-presence rather than self-rating. ISO/IEC 25010 — the international standard for software-product quality³ — uses the same artifact-grounded approach. The rubrics that don't work are the ones where teams self-score on a 1-to-5 Likert scale and end up at 3.5 across the board.

The most common failure mode: imbalanced dimensions

If you only remember one thing from this essay: maturity is multi-dimensional, and the overall level is the floor, not the average.

The most common scenario we see: a team has poured engineering time into eval coverage, scoring L3 or L4 on that dimension, while sitting at L1 on drift monitoring and L1 on refuse policy. Their answer to "what's your AI-QA maturity?" is "we're solid — we have a comprehensive eval suite." Their actual maturity is L1, because the floor across dimensions is L1.

This isn't pedantic. The L1 dimensions are where production incidents come from. Drift goes uncaught for weeks. Refusals fail in customer-facing chat. Failure modes never get named, so they keep happening. The strong dimension feels like adequate coverage; it isn't.

The remediation isn't to push the strong dimension to L5. It's to identify the lagging dimension and drag it to L2, then L3, then onward. Most of the leverage in moving up the ladder comes from dragging the weakest dimension up by one level — not from polishing the strongest.

How to use the rubric

Three rules of thumb, lifted from the rubric's own "How to use this" section, expanded a little:

1. Be honest, not aspirational.

For every "yes" you'd give yourself on a self-assessment question, pick one specific person on your team and ask: "could you walk a new engineer through this artifact in their first week?" If the artifact exists but only one person can find it, you're at L1 for that question. The artifact requires institutional knowledge, which means the institution doesn't actually have it.

2. Score per dimension, not overall.

Most teams have one dimension two levels behind the others. The fastest way to move up overall is to drag the lagging dimension to the floor. The free Self-Assessment scores all six dimensions independently — five minutes, no email blast back at you afterward.

3. The trigger matters more than the level.

Knowing you're at L2 is less useful than knowing what L3 looks like and what would force you to commit to it. The "trigger to next level" line on each level in the rubric is the load-bearing detail. It names the kind of organizational event — a customer churn tied to AI output, a board-level question about AI risk, a compliance review surfacing a gap — that historically pushes teams to invest in the next rung. If none of the triggers have happened to you, you may not need to move yet.

What the rubric won't do

A few things the Maturity Model is deliberately not:

It is not a certification. Gloxx is a publisher of methodology and a service provider; it is not a regulatory or standards body. The rubric is a methodology, not a credential. Audits scored against it are dated, scoped to artifacts reviewed, and explicit about what was not reviewed. See the Doctrine.
It is not a prescription for tooling. The rubric names artifacts and behaviors. It doesn't tell you to use DeepEval over promptfoo, or LangSmith over Phoenix. The choice of tool is downstream of having a discipline that can use any of them.
It is not the EU AI Act compliance map. The Act's high-risk obligations⁴ overlap meaningfully with L4 and L5 behaviors but the rubric is not a gap analysis. Don't substitute it for one.

What's coming

This essay introduces the model. The next two pieces in the Institute Journal go deeper:

Week of May 12 — The L2 → L3 jump. The single biggest discontinuity in the ladder. How reactive teams become measured ones, and the four artifacts whose presence on disk closes the gap.
Week of May 26 — Why most "AI evals" don't gate releases. The difference between an advisory eval and an enforced gate, and why the override policy matters more than the threshold.

And in parallel: the State of AI-Feature QA 2026 report, in the field now, will publish the actual distribution of teams across the maturity model — based on anonymized aggregate data from the Self-Assessment. If you want to be counted, take it.

Introducing the AI-QA Maturity Model.