Home Services Approach Institute About Contact Book a call

Institute/Maturity Model/Self-Assessment

Readiness Self-Assessment · Free tier

Where does your team actually sit?

Thirty questions across the Institute's six workflows. Answer honestly — yes only if the artifact exists on disk, partial if it's in flight, no if it's a wish. Five minutes. You'll get your AI-QA maturity level back, your weakest workflow flagged, and a concrete picture of what the next level looks like. The same rubric the paid readiness audit anchors to. (If you haven't read the rubric yet, start there.)

Who's taking the assessment.

We email your full breakdown so you can share it with your team. We don't put you on a sales sequence — one human, one reply, only if you ask for it.

Eval coverage

Do AI features have version-controlled eval suites? Do they cover correctness AND safety/refuse behavior?

01

We have at least one automated eval for an AI feature in production.

02

Every shipped AI feature has its own dedicated eval suite (not shared, not generic).

03

Eval suites cover BOTH correctness (does it produce the right output?) AND safety/refuse (does it not produce wrong outputs?).

04

Eval suites are version-controlled alongside the code they test, with a documented baseline metric per suite.

05

We have failure-mode-specific eval files (faithfulness, refusal-correctness, length-bound, citation), not just general accuracy.

Release gating

Do evals run automatically? Does a threshold breach actually block release, or is it advisory?

06

We can run AI evals locally before merging.

07

CI runs evals on every PR that touches AI-feature code.

08

A documented threshold exists per AI feature — the score below which the team agrees the feature isn't ready.

09

Release blocks if eval scores drop below threshold — enforced by CI, not advisory.

10

We have a documented override policy with audit trail for failed eval gates (who overrode, when, why).

Drift monitoring

Are production inputs/outputs logged, sampled, and continuously evaluated? Do drift alerts page someone?

11

We log every AI-feature input and output in production.

12

We sample production traffic for ongoing eval review (not just CI).

13

We compare production eval scores against dev/CI eval scores to detect drift.

14

Drift alerts page on-call when production eval scores degrade below threshold.

15

We run online evals on production traffic continuously, not just sampled batches.

Failure taxonomy

Is there a written list of known failure modes? Do incidents get tagged and added to eval suites?

16

We have a written list of known AI failure modes for our product (hallucination, refusal-mismatch, drift, etc.).

17

Each AI-related production incident gets a written postmortem.

18

Each incident is tagged against a failure-mode taxonomy (so we can count "we had 3 hallucination incidents this quarter").

19

New failure modes always get added to the eval suite as part of the incident response — not "later when we have time."

20

We review and update the failure taxonomy at least monthly.

Feedback loops

Can users report bad outputs in-product? Does the report flow into a triage queue and onward into evals?

21

Users can report bad AI outputs (in-product widget or via support).

22

Reported bad outputs flow into a triage queue with a named owner.

23

Triaged bad outputs get added to eval suites as new test cases.

24

We measure time-from-report-to-eval-coverage as a tracked metric.

25

Eval improvements ship in a regular weekly cadence — not ad-hoc batches when someone has time.

Policy & refuse behavior

Is there a written refuse policy? Reflected in the system prompt? Tested by automated evals? Reviewed by legal?

26

We have a written list of things our AI must refuse to do (out-of-scope, unsafe, regulatory).

27

The refuse policy is reflected in the system prompt (not just tribal knowledge).

28

Refuse behavior is tested by automated evals (the model is failed if it answers a question it should refuse).

29

The refuse policy has been reviewed by legal or compliance.

30

Compliance reviews are documented, dated, and re-run on a recurring cadence.

By submitting, you'll get your level emailed to the address above. We won't put you on a sequence.