AI-Feature QA

AI-feature QA. Done as discipline, not vibes.

We test the AI features your team is shipping. Eval suites, golden sets, prompt regression, drift monitoring, refusal-correctness — and the release gate that ties them all to a number you can ship against. The discipline that turns "did you try it?" into "did the suite pass and did the baseline hold?" The published methodology behind it lives at the Gloxx QA Institute.

"Your existing test suite catches deterministic regressions. It does not catch a model update that quietly drops faithfulness from 94% to 81%. AI-feature QA is the layer that does."

The discipline

What AI-feature QA is.

Testing applied to product features that use LLMs or other AI models. Different failure modes than regular code, different artifacts, different release-gate criteria. We bring the discipline; your engineers stay responsible for the prompts and the code. One accountable lead.

The six workflows that ship in every engagement

Eval — golden-set curation, suite design, baseline tracking. workflow
Release Gate — gate spec, threshold enforcement, override policy. workflow
Drift Monitoring — production sampling, online evals, paging. workflow
Failure Taxonomy — tagged postmortems, named-mode catalog. workflow
Feedback Loops — triage queue, time-to-coverage SLO. workflow
Refuse Policy — written list, enforcement, dated review. workflow

The fit

Who this is for.

AI-feature QA is the specialty layer of the Gloxx Retainer — switched on when your roadmap calls for it. Some teams need it on day one; some teams shouldn't pay for it yet.

Strong fit

Series A–B SaaS with AI features in production or pre-launch
AI-native products approaching their first paid users
Regulated fintech / healthtech with LLM features and compliance exposure
Teams shipping weekly where a quiet model regression is a real cost
Engineering orgs that already test their deterministic code well and feel a gap on the AI side
Teams where "we test in production by reading user complaints" is the current state of AI QA

In every retainer · no surcharge

What's included.

When AI-feature QA is in scope, the deliverables are concrete artifacts on disk in your repo, not slides. Every artifact is yours to keep on day one.

Deliverables

Coverage map of every AI feature in your product, with named failure modes per feature
Golden set per feature — 50–200 production traces, version-controlled
Eval suite per feature — DeepEval, promptfoo, or LangSmith depending on stack
Baseline scores per metric (faithfulness, refusal-correctness, length, citation, on-topic)
Release-gate spec extended with AI-specific thresholds and override policy
CI integration so the suite runs on every PR that touches AI-feature code
Drift-monitoring pipeline with sampling cadence and paging thresholds
Failure-mode catalog tied to your incident postmortems
Refuse-policy document with system-prompt enforcement and dated review cadence
Monthly QA scorecard tracking eval pass rates, drift incidents, and time-to-coverage

The 90-day arc

How an engagement runs.

Two-week ramp, then a 90-day arc that maps to where you score weakest on the AI-QA Maturity Model. The same model that anchors the free Readiness Self-Assessment is the one we use to prioritize work in your engagement.

Engagement rhythm

Week 1–2: Coverage map + named failure modes per AI feature; readiness score per dimension
Week 3–4: First golden set + eval suite live in CI; release-gate threshold proposed
Week 5–6: Drift monitoring sampling-cadence live; first failure-taxonomy review
Week 7–9: Refuse-policy formalized; legal review if regulated; second AI feature added
Week 10–12: Time-to-coverage SLO measured against real production reports; second 90-day plan
Async PR review, weekly sync, monthly QA scorecard throughout

vs. agencies / vs. building internally

What makes this different.

Most AI-feature QA today is one of two patterns: agencies that lift a generic eval template onto your stack and call it done, or in-house teams that learn the discipline by accumulating incidents over twelve months. Gloxx is neither.

The actually-different parts

Published methodology. The six workflows are documented at the Institute, not held as agency IP
Founder-led. Brandon Jensen on every engagement — 15 years of QA leadership, not a junior pool
Audited, not certified. Methodology-based assessments, dated, scoped, no warranty theater
No surcharge. AI-feature QA is included in the $15k retainer when the roadmap calls for it
Vendor-neutral tooling. DeepEval, promptfoo, LangSmith — whichever fits your stack. No lock-in
Open methodology, open tools. The published workflows and tools at github.com/gloxxai

The refuse list

What we won't do.

Honest scoping is part of the offer. We turn down categories of AI-feature QA work even at retainer price — because saying yes would make us bad at the work we actually sell.

We will not

Issue compliance certifications or regulatory attestations — that's not what an audit is
Sell you a 60-page test plan instead of artifacts that run in CI
Pretend a model is "safe" when the eval suite says it isn't
Replace your engineers' responsibility for the prompts and code they ship
Vendor-lock you into proprietary eval tooling — every artifact is yours
Recommend AI-feature QA when your product hasn't shipped any AI features yet
Quietly let a refuse-policy regression slip through to keep a release on schedule

FAQ.

What is AI-feature QA, and how is it different from regular QA?

AI-feature QA is the testing discipline applied to product features that use LLMs or other AI models. Regular QA verifies that deterministic code does what it's supposed to. AI-feature QA verifies that non-deterministic model output stays within named bounds — faithfulness, refusal-correctness, length, citation, on-topic — across versions of the model, the prompt, and the retrieval context. Different failure modes, different artifacts (eval suites, golden sets, prompt regression), different release-gate criteria.

Do we need this if we already have a regular QA process?

If you ship AI features, yes. Your existing test suite catches deterministic regressions; it doesn't catch a model update that quietly drops faithfulness from 94% to 81%. AI-feature QA layers in alongside the discipline you already have — same release-gate spec, with AI-specific thresholds added.

What tools do you use?

DeepEval is our default eval harness. promptfoo and LangSmith where they fit. Claude Code for prompt iteration and trace review. Playwright for AI flows that ship in browser-based products. We bring tooling to where your stack already is — no vendor lock-in. Every artifact is yours to keep on day one.

What's a golden set?

50 to 200 production traces per AI feature, version-controlled, with expected-output annotations. The golden set is the input side of every eval suite. New failure modes found in production get added to the golden set as part of incident response, not "later when we have time." Full procedure documented in the Eval workflow.

How do you charge for AI-feature QA?

It's included in the Gloxx Retainer ($15,000/month, month-to-month) at no surcharge when the roadmap calls for it. We don't sell AI-feature QA as a separate paid module because that incentive would push us to upsell teams that don't need it.

How fast is the ramp?

Two-week ramp to coverage map and named failure modes for the AI features in scope. First eval suite live in CI by week three. The 90-day plan tells you which workflows from the Gloxx QA Institute we'll harden first based on your weakest dimension on the Maturity Model.

Do you replace the engineers writing the AI features?

No. Engineers stay responsible for the code and the prompts they ship. We own the eval-suite, golden-set, drift-monitoring, and release-gate discipline that wraps it. Two responsibilities, one accountable lead.

What if we're a regulated fintech or healthtech?

Refuse policy gets formal treatment — written list across out-of-scope, unsafe, and regulatory buckets, system-prompt enforcement, refusal-correctness evals, and dated legal review. Audit-friendly artifacts at every step. We don't issue certifications and we don't make compliance claims; we produce the methodology-based evidence your auditor wants to see. The full procedure is at the Refuse Policy workflow.

AI-feature QA. Done as discipline, not vibes.

What AI-feature QA is.

Who this is for.

What's included.

How an engagement runs.

What makes this different.

What we won't do.

FAQ.

What is AI-feature QA, and how is it different from regular QA?

Do we need this if we already have a regular QA process?

What tools do you use?

What's a golden set?

How do you charge for AI-feature QA?

How fast is the ramp?

Do you replace the engineers writing the AI features?

What if we're a regulated fintech or healthtech?

From the Gloxx QA Institute.

Shipping AI features?
Let's talk through where the gate should be.

AI-feature QA. Done as discipline, not vibes.

What AI-feature QA is.

Who this is for.

What's included.

How an engagement runs.

What makes this different.

What we won't do.

FAQ.

What is AI-feature QA, and how is it different from regular QA?

Do we need this if we already have a regular QA process?

What tools do you use?

What's a golden set?

How do you charge for AI-feature QA?

How fast is the ramp?

Do you replace the engineers writing the AI features?

What if we're a regulated fintech or healthtech?

From the Gloxx QA Institute.

Shipping AI features?Let's talk through where the gate should be.

Shipping AI features?
Let's talk through where the gate should be.