How Gloxx works — Approach

QA for any software company

The QA discipline you'd expect from a senior in-house team. Tools, test pyramid, release gates. This is the half of the engagement that's the same whether you ship AI features or not — it's the floor every retainer starts from.

The general stack. Playwright for E2E and visual regression. axe-core for WCAG 2.1 AA accessibility gates. Pact for service-boundary contract tests. k6 for load and soak. Sentry / Datadog for production observability and error budgets. GitHub Actions (or your CI of choice) as the orchestration layer that runs every gate before merge or deploy.

The general test pyramid. Engineering owns unit tests — we extend them but don't rebuild them. We own integration (~80 tests, the service-boundary layer where most missed bugs live). We own and maintain E2E flows (~40 critical paths in Playwright). We run weekly exploratory testing on the high-risk surface — the layer that catches what scripts miss because no one wrote them yet.

The general release gate. Every pull request runs unit + integration + a fast E2E smoke. Every release runs the full E2E suite + accessibility check + a manual exploratory pass on critical user journeys. If you have a contract test, it runs. If you have a load benchmark, it runs. If you have a security scan, it runs. The gate is a checklist, not a hope.

What we own day to day. Bug triage with severity grading and reproducible reports. Test plan authorship for new features before they ship, not after. Release-readiness sign-off — written, named, dated. A weekly QA sync with engineering. A monthly QA scorecard for your leadership.

When you ship AI features — the specialty.

The work below is what we layer on top when AI is in the product. It's the half of Gloxx's discipline that most QA shops don't have, and it's why the retainer includes AI-feature QA at no surcharge. Read on if it's relevant; skip ahead to §7 if it isn't. If you want a structured way to figure out where your team currently sits on this discipline, take the AI-QA Readiness Self-Assessment — it’s the flagship instrument of the Gloxx QA Institute.

The AI-QA tool stack

Eight tools, used in specific combinations. None of them is magic; the value is knowing when to reach for each, and which layer of the release gate each one anchors.

DeepEvalEval runner

Our default. PyTest-based, so any team with Python test discipline adopts it in a day. Ships metrics for faithfulness, relevancy, toxicity, bias, summarization — the coverage most product teams need on day one. Every Gloxx eval suite lands here unless the client is already deep into a different runner.

promptfooPrompt regression

YAML-native, declarative, fast to run in CI. We use it specifically for prompt-version regression — when the prompt changes, does every case in the golden set still pass? It's the diff tool that makes prompt edits safe to merge.

LangSmith / LangfuseTrace observability

Production traces are the raw material for the golden set. We wire one of these in on day one of every retainer so every eval case is traceable back to a real user query, not a speculative one.

PlaywrightEnd-to-end tests

For the non-AI surface that still matters. Every AI feature sits inside a product with forms, auth, billing, and navigation — and those break the old-fashioned way. Playwright is our default for E2E + visual regression on that surface.

axe-coreAccessibility gate

Shipped as part of every Playwright run. WCAG 2.1 AA violations block merge. If your AI product can't be used by a screen reader, it's not shipped; it's leaked.

PactContract tests

For the service boundary where the AI feature calls the model provider or where downstream services call the AI-gated endpoint. Pact pins the contract so an upstream schema drift doesn't silently degrade output quality.

Claude CodeAgent runtime

The glue. Drafts candidate eval cases from production traces, operates the rest of the stack during authoring, and runs unattended work as Routines between engagements (see §2 and §3).

Claude Code RoutinesScheduled automation

How we run continuous monitoring inside a retainer. Nightly eval sweeps, weekly drift diffs, webhook-triggered PR review gates — all configured once, run forever, every run producing an auditable session transcript. This is how Gloxx delivers continuous QA without billing for the hours.

How we use AI agents

Concrete workflow. When the retainer covers an AI feature, this is the loop we run on it:

Step 1 — Spec + golden-set curation. The engineer (with us) writes a plain-English list of the properties that should always hold for every AI output. ("The support agent never fabricates a refund policy not present in the retrieved knowledge base." "The summarization output is never longer than the source.") We pair this with 50–200 real production traces curated into a version-controlled golden set. This is the most valuable human work in the process — we don't let the model draft it.

Step 2 — Claude Code drafts candidate eval cases. We feed the model the spec plus the relevant traces and ask for a first-draft DeepEval suite. We prompt explicitly for cases the model is unsure about, not just the obvious ones — adversarial inputs, edge-case retrievals, long context. Output looks like this:

tests/evals/support_agent_faithfulness.py
# Generated by Claude Code from PRD §2.1: "Support agent never invents
# policy details that aren't in the KB." Human-reviewed by Bran on 2026-04-16.
# See commit 3f91a2c for the discussion on edge case 4 (out-of-scope queries).

import pytest
from deepeval import assert_test
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

GOLDEN_SET = load_golden_set("support-refund-policy.jsonl")  # 127 cases

faithfulness = FaithfulnessMetric(threshold=0.95)
relevancy    = AnswerRelevancyMetric(threshold=0.85)

@pytest.mark.parametrize("case", GOLDEN_SET)
def test_support_agent_grounded_in_kb(case):
    response = support_agent.run(case.input, context=case.retrieved_chunks)
    test_case = LLMTestCase(
        input=case.input,
        actual_output=response.content,
        retrieval_context=case.retrieved_chunks,
        expected_output=case.expected,
    )
    assert_test(test_case, [faithfulness, relevancy])

Step 3 — Human review before merge. We read every generated eval line-by-line. About 40% of drafts ship unchanged, 40% need edits (wrong threshold, missed edge case, ambiguous expected output), 20% get rejected outright because the model misread the spec. Rejections are fed back as counterexamples in the prompt for the next round.

Step 4 — The suite runs in CI, every push. We don't ship evals that only run on a developer's laptop. Every Gloxx-authored suite lands in CI from day one, with a failing-eval gate on the merge queue.

Our Claude Code operating protocol

Six principles that sit underneath §2. Any CTO can tell the difference between "someone who uses AI" and "someone who has an opinion about how to use it." These are ours.

Plan mode by default — "move slow to move fast."

Roughly 80% of any active Claude Code session is spent in Plan Mode before a single line of test code is generated. A locked-in plan makes execution nearly automatic and prevents "quick fix" failure modes — like the well-known case where Claude, asked to fix a UI display error, silently modified the underlying database values to match the expected output and corrupted the source of truth. On a production AI system, that class of mistake is a rollback-and-incident event. Plan first, execute second, always.

The interview prompt — surface assumptions before they become bugs.

Before generating anything, we run a fixed interview prompt. The goal is to drag every hidden assumption into the open before it becomes a regression.

"Before we start building, interview me about this. What are the core problems this solves? Who is this for? What does success look like? What should this NOT do? Summarize it back to me before you write any code."

Verification feedback loops — the 2–3× quality lever.

Every Claude Code session we run has a verification tool wired in — a headless browser, a test runner, a linter, or a direct DeepEval / Playwright invocation — and an explicit instruction to use that tool to confirm state before declaring a task complete. This single practice is the largest quality lever we've measured. For long sessions we close with a forced audit: "Go back and verify all of your work so far. Flag anything that skipped best practice or introduced risk."

Partitioned parallel sessions — avoid contextual fog.

For non-overlapping tasks we run multiple independent Claude Code contexts simultaneously rather than piling them into one window. Deep-dive sessions accumulate baggage that hides obvious solutions; a fresh window often sees what the long one missed. Two isolated contexts routinely beat one overloaded context on the same problem.

Minimalist CLAUDE.md — aggressive instruction hygiene.

We maintain the smallest possible instruction set per repo. Heavy prompt engineering becomes obsolete within ~6 months as the underlying model improves; we don't pay that tax. When a CLAUDE.md drifts into contradiction or bloat, we delete it and re-seed from zero, adding rules back only as the current model provably needs them.

Skills for inner loops — productize every repetitive task.

Recurring processes — release-gate reports, eval-case templates, post-incident summary documents, compliance exports — get codified as Claude Skills once, then invoked by slash command. When a Skill needs to run without us (nightly, on a webhook, or via API), we deploy it as a scheduled or webhook-triggered Claude Code Routine. Every run produces an auditable session transcript by default — that's the evidence trail an AI-QA retainer needs.

Underpinning all six: context over prompt engineering. We spend our time feeding the model high-quality context — codebase, docs, system state, the spec on this page — rather than micro-tweaking prompts that will be obsolete by Q3. The "what" we feed the model is our moat; the "how" of any individual prompt is not.

Our test pyramid for AI-native apps

Six layers. Most AI-native teams we audit are top-heavy on unit tests and light on everything that actually scores the model's output. The gap between "our CI is green" and "the feature works" is where the lower layers should be built.

RED-TEAM~10 adversarial sets

TRACE REVIEW50 traces / wk

EVAL (golden sets)~150 cases

E2E (Playwright)~40 flows

INTEGRATION~80 tests

UNIT~200+ tests

Unit — function-level correctness. Cheap, fast, and the floor most teams already have. We don't rebuild these; we extend them.

Integration — service-boundary interactions (DB, message bus, model provider, retrieval layer). Where most missed bugs around AI features actually live — the glue code, not the model call.

E2E — Playwright user flows through the surrounding product. The feature isn't shipped if the login, billing, or empty-state screens break around it.

Eval — golden-set scoring against faithfulness, relevancy, toxicity, structure, or custom metrics. The single highest-leverage layer for AI features. Ten good evals with a real golden set beat a hundred vibe-checked hand-written prompts.

Trace review — sampled production traces read by a human every week. The layer that catches drift the eval suite hasn't learned to measure yet. Every flagged trace becomes a new eval case next sprint.

Red-team — adversarial prompts, jailbreaks, prompt-injection fixtures, data-leak probes. Run on every material prompt change and before every feature launch. Five good red-team sets beat fifty speculative ones.

Our release-gate checklist

We give this away. Every Gloxx Retainer that touches an AI feature uses a customized version of the checklist below as the gate between "code merged" and "code deployed." If you can't answer yes to all of these, you're not ready to ship. The checklist below is what an L4 release gate looks like in our AI-QA Maturity Model — enforced thresholds, refuse-policy testing, and a rollback path that takes less than 15 minutes.

Pre-deploy release gate (Gloxx standard)

Has the full eval suite plus traditional test suite passed on the exact commit being deployed, not a close ancestor?
Did the prompt regression suite pass against the last N production prompt versions?
Do faithfulness, relevancy, and toxicity metrics meet published thresholds on the current golden set?
Has the golden set been re-sampled from production traces in the last 30 days?
If this is a prompt or model change: has a red-team pass been run against prompt-injection and jailbreak fixtures?
Is cost-per-request within tolerance vs. the previous version on the same benchmark?
Is latency-at-p95 within the documented SLO on representative input?
Has at least one human reviewed 50 sampled production traces in the last 7 days?
Is there a documented, tested rollback path to the previous prompt or model version — and does it take less than 15 minutes to flip?
Is there a post-deploy monitoring plan for the first 48 hours, with owner + on-call assigned?
Has the change been communicated to downstream teams (support, customer success, sales) with sufficient notice?
If a similar release has failed before on this team: has the specific failure mode been regression-tested?

What we refuse to do

Saying no to the wrong engagement is how we stay useful for the right one. These are non-negotiable.

We don't replace AI red teams or safety audits. Adversarial safety testing and an ongoing AI-QA function are different products with different incentive structures. We pair with red-team specialists; we don't substitute for them. When a client asks us to run the full adversarial gauntlet ourselves, we refer out and offer to run the release-gate layer alongside.
We don't ship untested AI-generated code. Every line of eval or test code that bears a Gloxx name has been read, understood, and signed off by a human. No eval theater. If that constraint slows us down, so be it — the cost of one false-positive-green-light engagement is worse than any speed gain across a whole year.
We don't lock you into an annual contract. The retainer is month-to-month with 30 days' notice on either side. If we're not earning the renewal every month, you shouldn't have to pay for the privilege of leaving. The first two weeks are the discovery ramp — if it's not a fit by week three, we both know.
We don't bill by the hour. The retainer is a flat $15k/month regardless of how busy a given week is — we take the timeline risk, you take the scope-change risk. This aligns incentives: we don't get paid more for being slow, and you don't get a surprise invoice when a release week runs hot.
We don't publish client work without written consent. Even sanitized. Even as a case study. The AI posture of a live product is sensitive information and we treat it that way.

How Gloxx works.

QA for any software company

The AI-QA tool stack

How we use AI agents

Our Claude Code operating protocol

Plan mode by default — "move slow to move fast."

The interview prompt — surface assumptions before they become bugs.

Verification feedback loops — the 2–3× quality lever.

Partitioned parallel sessions — avoid contextual fog.

Minimalist CLAUDE.md — aggressive instruction hygiene.

Skills for inner loops — productize every repetitive task.

Our test pyramid for AI-native apps

Our release-gate checklist

What we refuse to do

Want a QA partner who works this way?