Home Services Approach About Contact Book a call
Approach

How Gloxx works.

We don't sell a methodology. We sell a habit: how we actually use our tools, how we drive Claude Code without letting it drive us, and what we refuse to do regardless of price. If you want to understand whether Gloxx belongs on your release-critical path, this is the page that tells you.

The Gloxx QA stack

Eight tools, used in specific combinations. None of them is magic; the value is knowing when to reach for each.

FoundryTest runner

Our default. Fast, Solidity-native, and the only test framework with first-class invariant and fuzz runners baked in. We write new suites in Foundry unless the client's existing Hardhat pipeline is too entrenched to move.

HardhatLegacy runner

We maintain and extend Hardhat suites when the client already has one. We don't migrate to Hardhat. JavaScript-based, slower, but still the right choice when the surrounding tooling is Node-centric.

SlitherStatic analysis

First pass on every audit. Catches reentrancy, uninitialized storage, shadowing, and about 90 other low-cost findings that no smart auditor should still be paid to discover manually.

MythrilSymbolic execution

Slower than Slither, deeper. We run it when Slither's output is clean and we want to surface integer-overflow / state-dependency bugs that show up only across call sequences.

AderynRust-based linter

Newer, faster than Slither for large codebases, with complementary detector coverage. We run both — their disagreements are signal.

HalmosSymbolic verification

Formal verification without a separate language. We use it when an invariant is critical enough that fuzz counterexamples aren't sufficient — think reserves-greater-than-liabilities properties.

EchidnaProperty-based fuzzer

Where Foundry's fuzzer runs out of depth. Echidna's Haskell core finds sequences of calls that violate properties; we reach for it on stateful bugs the invariant runner couldn't provoke.

Claude CodeAgent runtime

The glue. Drafts candidate tests from specs, operates the stack above, and runs unattended work as Routines between engagements (see §2 and §3).

How we use AI agents

Concrete workflow. Every Test Suite Rebuild (Tier 2) and every Fractional retainer (Tier 3) runs this loop:

Step 1 — PRD / invariant spec. The engineer writes a plain-English list of the properties that should always hold. ("The pool's token reserves should always equal the sum of user balances minus pending withdrawals." "No user should be able to withdraw more than their share of LP tokens represents.") This is the most valuable human work in the process — we don't let the model draft it.

Step 2 — Claude Code drafts candidate Foundry invariants. We feed the model the PRD plus the relevant Solidity file and ask for a first-draft invariant test. We prompt explicitly for properties the model is unsure about, not just the obvious ones. Output looks like this:

test/invariants/PoolReserves.invariant.t.sol
// Generated by Claude Code from PRD §3.2: "Pool reserves never diverge // from bookkeeping." Human-reviewed by Bran on 2026-04-14. See commit // 3f91a2c for the discussion on edge case 4 (zero-amount donation). contract PoolReservesInvariant is StdInvariant, Test { Pool pool; PoolHandler handler; function setUp() public { pool = new Pool(); handler = new PoolHandler(pool); targetContract(address(handler)); } // INVARIANT: reserves == sum(balances) - pendingWithdrawals function invariant_reservesMatchBookkeeping() public { assertEq( pool.reserves(), handler.sumBalances() - pool.pendingWithdrawals() ); } }

Step 3 — Human review before merge. We read every generated test line-by-line. About 40% of drafts ship unchanged, 40% need edits (wrong helper, missed edge case, ambiguous invariant), 20% get rejected outright because the model misread the spec. Rejections are fed back as counterexamples in the prompt for the next round.

Step 4 — The suite runs in CI, every push. We don't ship tests that only run on a developer's laptop. Every Gloxx-authored suite lands in CI from day one.

Our Claude Code operating protocol

Six principles that sit underneath §2. Any CTO can tell the difference between "someone who uses AI" and "someone who has an opinion about how to use it." These are ours.

Plan mode by default — "move slow to move fast."

Roughly 80% of any active Claude Code session is spent in Plan Mode before a single line of test code is generated. A locked-in plan makes execution nearly automatic and prevents "quick fix" failure modes — like the well-known case where Claude, asked to fix a UI display error, silently modified the underlying database values to match the expected output and corrupted the source of truth. On a protocol, that class of mistake is a migration rollback event. Plan first, execute second, always.

The interview prompt — surface assumptions before they become bugs.

Before generating anything, we run a fixed interview prompt. The goal is to drag every hidden assumption into the open before it becomes a regression.

"Before we start building, interview me about this. What are the core problems this solves? Who is this for? What does success look like? What should this NOT do? Summarize it back to me before you write any code."

Verification feedback loops — the 2–3× quality lever.

Every Claude Code session we run has a verification tool wired in — a headless browser, a test runner, a linter, or a direct Foundry/Slither invocation — and an explicit instruction to use that tool to confirm state before declaring a task complete. This single practice is the largest quality lever we've measured. For long sessions we close with a forced audit: "Go back and verify all of your work so far. Flag anything that skipped best practice or introduced risk."

Partitioned parallel sessions — avoid contextual fog.

For non-overlapping tasks we run multiple independent Claude Code contexts simultaneously rather than piling them into one window. Deep-dive sessions accumulate baggage that hides obvious solutions; a fresh window often sees what the long one missed. Two isolated contexts routinely beat one overloaded context on the same problem.

Minimalist CLAUDE.md — aggressive instruction hygiene.

We maintain the smallest possible instruction set per repo. Heavy prompt engineering becomes obsolete within ~6 months as the underlying model improves; we don't pay that tax. When a CLAUDE.md drifts into contradiction or bloat, we delete it and re-seed from zero, adding rules back only as the current model provably needs them.

Skills for inner loops — productize every repetitive task.

Recurring processes — release-gate reports, invariant-test templates, post-audit summary documents, compliance exports — get codified as Claude Skills once, then invoked by slash command. When a Skill needs to run without us (nightly, on a webhook, or via API), we deploy it as a scheduled or webhook-triggered Claude Code Routine. Every run produces an auditable session transcript by default — that's the evidence trail a blockchain-QA retainer needs.

Underpinning all six: context over prompt engineering. We spend our time feeding the model high-quality context — codebase, docs, system state, the spec on this page — rather than micro-tweaking prompts that will be obsolete by Q3. The "what" we feed the model is our moat; the "how" of any individual prompt is not.

Our test pyramid for smart contract protocols

Six layers. Most protocols we audit are top-heavy on unit tests and light on everything below. The gap between audits is where the lower layers should be built.

SCENARIO~5 tests
FUZZ (Echidna)~10 properties
INVARIANT (Foundry)~20 invariants
FORK~40 tests
INTEGRATION~80 tests
UNIT~200+ tests

Unit — function-level correctness. Cheap, fast, and the floor most teams already have. We don't rebuild these; we extend them.

Integration — multi-contract interactions. Where most missed bugs live. We add these aggressively during Rebuilds.

Fork — tests that run against a forked mainnet/testnet snapshot. Critical for anything that touches oracles, router contracts, or live AMMs.

Invariant — property-based tests via Foundry's invariant runner. The single highest-leverage layer for DeFi protocols. Ten good invariants beat a hundred unit tests.

Fuzz — Echidna/Halmos for sequences of calls the Foundry fuzzer can't provoke. We reach for these on stateful bugs only.

Scenario — end-to-end user flows replayed in deterministic forks. The "does this actually work in anger" layer. Five good scenarios beat fifty speculative ones.

Our release-gate checklist

We give this away. Every Fractional QA retainer uses a customized version of the checklist below as the gate between "code merged" and "code deployed." If you can't answer yes to all of these, you're not ready to ship.

Pre-deploy release gate (Gloxx standard)
  1. Has the full Foundry suite passed on the exact commit being deployed, not a close ancestor?
  2. Have Slither, Mythril, and Aderyn run on the deployed artifact? Were any net-new findings triaged?
  3. Do the invariant tests still hold under 10,000+ runs on a forked mainnet state?
  4. Has the storage layout been diffed against the currently-deployed version? Any slot collisions?
  5. If this is an upgradeable contract: has the initializer for the new version been tested from uninitialized state?
  6. Are gas regressions within tolerance vs. the previous version on the same benchmark suite?
  7. Does the deploy script re-run cleanly from a fresh state in CI, not just locally?
  8. Have all external oracles, routers, or integrations been exercised in a fork test against the target block number?
  9. Is there a documented, tested rollback path? Does it require multisig coordination we can reach right now?
  10. Is there a post-deploy monitoring plan for the first 48 hours, with owner + on-call assigned?
  11. Has the change been communicated to downstream integrators (aggregators, frontends, explorers) with sufficient notice?
  12. If a similar deploy has failed before on this team: has the specific failure mode been regression-tested?

What we refuse to do

Saying no to the wrong engagement is how we stay useful for the right one. These are non-negotiable.

  • We don't replace security audits. A security audit and an ongoing QA function are different products with different incentive structures. We pair with auditors; we don't substitute for them. When a client asks us to "just do the audit," we refer them to Trail of Bits, Sherlock, Spearbit, or Cantina and offer to run the QA layer alongside.
  • We don't ship untested AI-generated code. Every line of test code that bears a Gloxx name has been read, understood, and signed off by a human. No "AI security theater." If that constraint slows us down, so be it — the cost of one false-positive-green-light engagement is worse than any speed gain across a whole year.
  • We don't take clients on month-to-month without an Audit first. The Audit exists to make sure we actually know what we're signing up for. Skipping it means the retainer starts with guesses, and guesses on protocol QA are how teams get burned.
  • We don't bill by the hour on engagements with fixed deliverables. Audits, Rebuilds, and War Rooms are flat-fee. We take the timeline risk; you take the scope-change risk. This aligns incentives: we don't get paid for being slow.
  • We don't publish client work without written consent. Even sanitized. Even as a case study. The QA posture of a live protocol is sensitive information and we treat it that way.

Want a QA partner who works this way?

Book a free 30-minute assessment. We'll talk through what you're shipping and whether the Audit is the right first step. No pitch.

Book the call →