Glossary — Gloxx QA Institute

Domain · Eval & suite design

Eval & suite design.

Eval suite: A collection of test cases, each with an expected output or graded property, run automatically against an AI feature on every code change. Distinct from traditional unit tests because correctness is graded against a rubric — faithfulness, refusal-correctness, length-bound, citation — not asserted with assertEquals. See also: Eval Workflow · Golden set · Baseline.
Golden set: A curated, version-controlled collection of input/output pairs (typically 50–200 per feature) that anchor the eval suite. The golden set evolves: production failures get added to it, edge cases get pruned, and the diff between this quarter's golden set and last quarter's is itself an eval-discipline signal.
Faithfulness: The degree to which an AI output is grounded in the retrieved context (for retrieval-augmented generation) or the source material (for summarization). High faithfulness = the model didn't fabricate. Faithfulness is one of the load-bearing metrics in the Eval Workflow because it's the metric that catches hallucinations before users do.
Refusal-correctness: A bidirectional metric: did the model correctly refuse queries that violate the refuse policy and did it correctly answer queries that should be allowed? A model that refuses everything has 100% refusal-correctness on the negative set and 0% on the positive set. Both halves matter. See also: Refuse Policy Workflow.
Baseline: The documented threshold score below which an eval suite is considered failing for a given feature. Baselines are written down, dated, and reviewed quarterly. A team that says "the eval passed" without naming a baseline is at L1 — Ad-hoc — on the Maturity Model.

Domain · Release gate

Release gate.

Release gate: The set of automated checks that decide whether a code change is allowed to deploy. For AI features, the gate includes the eval suite plus traditional tests, accessibility, and any feature-specific metrics. A gate that doesn't block is advisory — and an advisory gate is no gate at all. See also: Release-Gate Workflow.
Threshold breach: The state of an eval score dropping below its documented baseline, which blocks merge or deploy until the team either fixes the regression or invokes the override policy.
Override policy: The documented procedure for shipping a release that fails the gate. Includes who can override, what justification is required (incident severity, business deadline, false-positive analysis), and how the override is logged. Without an override policy, teams either ship through failed gates silently (worst) or stop using the gate (also bad).

Domain · Drift monitoring

Drift monitoring.

Drift: The phenomenon where AI feature quality silently degrades over time without code changes. Causes include: model-provider updates, retrieval-corpus drift, distribution shift in production traffic, prompt edits with downstream effects. Drift is invisible without instrumentation, which is why drift monitoring is its own workflow.
Online eval: Running the eval rubric against sampled production traffic in real time, comparing scores to the dev/CI baseline. The delta between online-eval scores and CI scores is the drift signal.
Trace: A single production input/output pair plus the full execution context — system prompt version, retrieved chunks, model version, latency — sufficient to reproduce the eval case offline. Without traces, a drift incident is unreproducible.

Domain · Failure taxonomy

Failure taxonomy.

Failure mode: A named category of AI failure — hallucination, refusal-mismatch, retrieval miss, drift, length-bound violation, prompt-injection, citation fabrication. Used to tag postmortems so incident counts can be trended and named at the leadership table. See also: Failure Taxonomy Workflow.
Postmortem tag: A label applied to an incident postmortem identifying the failure mode(s) involved. An untagged postmortem isn't done — the work of identifying which named failure mode this was is the work of getting the eval suite to catch the next one.

Domain · Feedback loops

Feedback loops.

Time-to-coverage: The elapsed time from a user-reported failure being filed to the corresponding test case being added to the eval suite. Tracked as an SLO (days, not weeks) at L4 and L5 of the Maturity Model. Time-to-coverage is the metric that turns feedback from a complaint queue into a ratchet.
Triage queue: A prioritized list of user-reported AI failures awaiting investigation, with a named owner. A queue without a named owner is a backlog. See also: Feedback Loops Workflow.

Domain · Refuse policy

Refuse policy.

Refuse list: A written enumeration of query categories the AI must never answer. Three buckets: out-of-scope (queries the product isn't designed for), unsafe (jailbreaks, harmful instructions), regulatory (legal, medical, financial advice the company isn't licensed for).
System-prompt enforcement: Reflecting the refuse policy in the model's system prompt so refusals are produced by the model rather than filtered after the fact. Tribal knowledge of what to refuse is not enforcement; the policy lives in the prompt or it doesn't live at all.
Compliance review: A dated, documented review of the refuse policy by legal/compliance. Recurring cadence (typically quarterly). The output is a diff against the prior version with sign-off.

Domain · Maturity model

Maturity model.

L1 → L5: The five levels of the AI-QA Maturity Model: Ad-hoc, Reactive, Measured, Governed, Continuous. Descriptive labels — what's actually true at each level — not aspirational. Your team is at the level where the artifacts exist on disk, not the level leadership wishes they were. See also: AI-QA Maturity Model v1.0.
Capability dimension: One of six independently-scored axes of the Maturity Model, mapped 1:1 to a Workflow. A team can be Measured on eval coverage but Ad-hoc on drift monitoring — and that combination is the most common failure mode in AI-feature QA. The overall level is the floor across dimensions, not the average.
Readiness audit: A Gloxx-conducted methodology-based review of release readiness, scored against the Maturity Model with documented evidence per question. Audit, not certification. Reports are dated, scoped to the artifacts reviewed, and explicit about what was not reviewed.

Glossary.

Eval & suite design.

Release gate.

Drift monitoring.

Failure taxonomy.

Feedback loops.

Refuse policy.

Maturity model.

Want to score your team against these definitions?