Home Services Approach Institute About Contact Book a call

AI-feature QA. Done as discipline, not vibes.

We test the AI features your team is shipping. Eval suites, golden sets, prompt regression, drift monitoring, refusal-correctness — and the release gate that ties them all to a number you can ship against. The discipline that turns "did you try it?" into "did the suite pass and did the baseline hold?" The published methodology behind it lives at the Gloxx QA Institute.

"Your existing test suite catches deterministic regressions. It does not catch a model update that quietly drops faithfulness from 94% to 81%. AI-feature QA is the layer that does."
01
The discipline

What AI-feature QA is.

Testing applied to product features that use LLMs or other AI models. Different failure modes than regular code, different artifacts, different release-gate criteria. We bring the discipline; your engineers stay responsible for the prompts and the code. One accountable lead.

The six workflows that ship in every engagement
  • Eval — golden-set curation, suite design, baseline tracking. workflow
  • Release Gate — gate spec, threshold enforcement, override policy. workflow
  • Drift Monitoring — production sampling, online evals, paging. workflow
  • Failure Taxonomy — tagged postmortems, named-mode catalog. workflow
  • Feedback Loops — triage queue, time-to-coverage SLO. workflow
  • Refuse Policy — written list, enforcement, dated review. workflow
02
The fit

Who this is for.

AI-feature QA is the specialty layer of the Gloxx Retainer — switched on when your roadmap calls for it. Some teams need it on day one; some teams shouldn't pay for it yet.

Strong fit
  • Series A–B SaaS with AI features in production or pre-launch
  • AI-native products approaching their first paid users
  • Regulated fintech / healthtech with LLM features and compliance exposure
  • Teams shipping weekly where a quiet model regression is a real cost
  • Engineering orgs that already test their deterministic code well and feel a gap on the AI side
  • Teams where "we test in production by reading user complaints" is the current state of AI QA
03
In every retainer · no surcharge

What's included.

When AI-feature QA is in scope, the deliverables are concrete artifacts on disk in your repo, not slides. Every artifact is yours to keep on day one.

Deliverables
  • Coverage map of every AI feature in your product, with named failure modes per feature
  • Golden set per feature — 50–200 production traces, version-controlled
  • Eval suite per feature — DeepEval, promptfoo, or LangSmith depending on stack
  • Baseline scores per metric (faithfulness, refusal-correctness, length, citation, on-topic)
  • Release-gate spec extended with AI-specific thresholds and override policy
  • CI integration so the suite runs on every PR that touches AI-feature code
  • Drift-monitoring pipeline with sampling cadence and paging thresholds
  • Failure-mode catalog tied to your incident postmortems
  • Refuse-policy document with system-prompt enforcement and dated review cadence
  • Monthly QA scorecard tracking eval pass rates, drift incidents, and time-to-coverage
04
The 90-day arc

How an engagement runs.

Two-week ramp, then a 90-day arc that maps to where you score weakest on the AI-QA Maturity Model. The same model that anchors the free Readiness Self-Assessment is the one we use to prioritize work in your engagement.

Engagement rhythm
  • Week 1–2: Coverage map + named failure modes per AI feature; readiness score per dimension
  • Week 3–4: First golden set + eval suite live in CI; release-gate threshold proposed
  • Week 5–6: Drift monitoring sampling-cadence live; first failure-taxonomy review
  • Week 7–9: Refuse-policy formalized; legal review if regulated; second AI feature added
  • Week 10–12: Time-to-coverage SLO measured against real production reports; second 90-day plan
  • Async PR review, weekly sync, monthly QA scorecard throughout
05
vs. agencies / vs. building internally

What makes this different.

Most AI-feature QA today is one of two patterns: agencies that lift a generic eval template onto your stack and call it done, or in-house teams that learn the discipline by accumulating incidents over twelve months. Gloxx is neither.

The actually-different parts
  • Published methodology. The six workflows are documented at the Institute, not held as agency IP
  • Founder-led. Brandon Jensen on every engagement — 15 years of QA leadership, not a junior pool
  • Audited, not certified. Methodology-based assessments, dated, scoped, no warranty theater
  • No surcharge. AI-feature QA is included in the $15k retainer when the roadmap calls for it
  • Vendor-neutral tooling. DeepEval, promptfoo, LangSmith — whichever fits your stack. No lock-in
  • Open methodology, open tools. The published workflows and tools at github.com/gloxxai
06
The refuse list

What we won't do.

Honest scoping is part of the offer. We turn down categories of AI-feature QA work even at retainer price — because saying yes would make us bad at the work we actually sell.

We will not
  • Issue compliance certifications or regulatory attestations — that's not what an audit is
  • Sell you a 60-page test plan instead of artifacts that run in CI
  • Pretend a model is "safe" when the eval suite says it isn't
  • Replace your engineers' responsibility for the prompts and code they ship
  • Vendor-lock you into proprietary eval tooling — every artifact is yours
  • Recommend AI-feature QA when your product hasn't shipped any AI features yet
  • Quietly let a refuse-policy regression slip through to keep a release on schedule

FAQ.

What is AI-feature QA, and how is it different from regular QA?

AI-feature QA is the testing discipline applied to product features that use LLMs or other AI models. Regular QA verifies that deterministic code does what it's supposed to. AI-feature QA verifies that non-deterministic model output stays within named bounds — faithfulness, refusal-correctness, length, citation, on-topic — across versions of the model, the prompt, and the retrieval context. Different failure modes, different artifacts (eval suites, golden sets, prompt regression), different release-gate criteria.

Do we need this if we already have a regular QA process?

If you ship AI features, yes. Your existing test suite catches deterministic regressions; it doesn't catch a model update that quietly drops faithfulness from 94% to 81%. AI-feature QA layers in alongside the discipline you already have — same release-gate spec, with AI-specific thresholds added.

What tools do you use?

DeepEval is our default eval harness. promptfoo and LangSmith where they fit. Claude Code for prompt iteration and trace review. Playwright for AI flows that ship in browser-based products. We bring tooling to where your stack already is — no vendor lock-in. Every artifact is yours to keep on day one.

What's a golden set?

50 to 200 production traces per AI feature, version-controlled, with expected-output annotations. The golden set is the input side of every eval suite. New failure modes found in production get added to the golden set as part of incident response, not "later when we have time." Full procedure documented in the Eval workflow.

How do you charge for AI-feature QA?

It's included in the Gloxx Retainer ($15,000/month, month-to-month) at no surcharge when the roadmap calls for it. We don't sell AI-feature QA as a separate paid module because that incentive would push us to upsell teams that don't need it.

How fast is the ramp?

Two-week ramp to coverage map and named failure modes for the AI features in scope. First eval suite live in CI by week three. The 90-day plan tells you which workflows from the Gloxx QA Institute we'll harden first based on your weakest dimension on the Maturity Model.

Do you replace the engineers writing the AI features?

No. Engineers stay responsible for the code and the prompts they ship. We own the eval-suite, golden-set, drift-monitoring, and release-gate discipline that wraps it. Two responsibilities, one accountable lead.

What if we're a regulated fintech or healthtech?

Refuse policy gets formal treatment — written list across out-of-scope, unsafe, and regulatory buckets, system-prompt enforcement, refusal-correctness evals, and dated legal review. Audit-friendly artifacts at every step. We don't issue certifications and we don't make compliance claims; we produce the methodology-based evidence your auditor wants to see. The full procedure is at the Refuse Policy workflow.

From the Gloxx QA Institute.

The published procedures we run in every AI-feature QA engagement. Open methodology, dated, audit-friendly. Take the free 30-question Readiness Self-Assessment to find your weakest dimension.

Shipping AI features?
Let's talk through where the gate should be.

A free 30-minute call. We'll talk through what you're shipping, what the failure modes are, and whether AI-feature QA at the retainer level is the right move now or later. No pitch, no slides, no pressure.

Book the call → Or take the free assessment first →