What is AI-feature QA, and how is it different from regular QA?
AI-feature QA is the testing discipline applied to product features that use LLMs or other AI models. Regular QA verifies that deterministic code does what it's supposed to. AI-feature QA verifies that non-deterministic model output stays within named bounds — faithfulness, refusal-correctness, length, citation, on-topic — across versions of the model, the prompt, and the retrieval context. Different failure modes, different artifacts (eval suites, golden sets, prompt regression), different release-gate criteria.
Do we need this if we already have a regular QA process?
If you ship AI features, yes. Your existing test suite catches deterministic regressions; it doesn't catch a model update that quietly drops faithfulness from 94% to 81%. AI-feature QA layers in alongside the discipline you already have — same release-gate spec, with AI-specific thresholds added.
What tools do you use?
DeepEval is our default eval harness. promptfoo and LangSmith where they fit. Claude Code for prompt iteration and trace review. Playwright for AI flows that ship in browser-based products. We bring tooling to where your stack already is — no vendor lock-in. Every artifact is yours to keep on day one.
What's a golden set?
50 to 200 production traces per AI feature, version-controlled, with expected-output annotations. The golden set is the input side of every eval suite. New failure modes found in production get added to the golden set as part of incident response, not "later when we have time." Full procedure documented in the Eval workflow.
How do you charge for AI-feature QA?
It's included in the Gloxx Retainer ($15,000/month, month-to-month) at no surcharge when the roadmap calls for it. We don't sell AI-feature QA as a separate paid module because that incentive would push us to upsell teams that don't need it.
How fast is the ramp?
Two-week ramp to coverage map and named failure modes for the AI features in scope. First eval suite live in CI by week three. The 90-day plan tells you which workflows from the Gloxx QA Institute we'll harden first based on your weakest dimension on the Maturity Model.
Do you replace the engineers writing the AI features?
No. Engineers stay responsible for the code and the prompts they ship. We own the eval-suite, golden-set, drift-monitoring, and release-gate discipline that wraps it. Two responsibilities, one accountable lead.
What if we're a regulated fintech or healthtech?
Refuse policy gets formal treatment — written list across out-of-scope, unsafe, and regulatory buckets, system-prompt enforcement, refusal-correctness evals, and dated legal review. Audit-friendly artifacts at every step. We don't issue certifications and we don't make compliance claims; we produce the methodology-based evidence your auditor wants to see. The full procedure is at the Refuse Policy workflow.