11 · Governance¶
← 10 Setup and stages · Contents · Next: 12 Roles →
Governance is what keeps the method honest when a team runs it at speed. It is the same regardless of which AI tool writes the code, because it lives in the process and the pipeline, not in the agent.
The autonomy ladder¶
How much the AI is allowed to do is not one switch; it is a setting chosen per area, rising with evidence and with your capacity to verify.
| Level | The AI… | A person… | Typical use |
|---|---|---|---|
| Suggest | proposes options | decides and writes | early, exploratory work |
| Draft-and-review | drafts artifacts | edits and approves each one | specs, scenarios, contracts |
| Generate-behind-gate | generates code | reviews the change; it merges only if the contract and tests pass | the normal build |
| Auto-with-evidence | generates and merges | samples and audits; auto-merge allowed only with a full evidence bundle attached | narrow, well-tested areas |
The governing rule, restated from the principles: operate only at the level your review capacity can sustain. If the AI produces more than the team can verify, drop a level.
The per-scope default is auto-with-evidence behind a one-approval decision point: the AI drafts the specification bundle, a human approves the frozen contract once, and the build auto-gates on evidence. You lower a scope toward draft-and-review or suggest wherever risk is high or evidence is thin — and a high-risk or method-defining scope is always lowered (it is never auto-run). The default sets where you start; review capacity and risk set where you stay.
The engine expresses this per task as an explicit three-rung level — autonomy: manual | conservative | auto, an ordered ladder manual < conservative < auto declared in the TASK.md header and reviewed at the freeze. auto is auto-with-evidence behind the one approval (the seeded default); conservative is the deliberate lowering that keeps a person at the verify gate; manual is the strict floor where the human owns the gate and nothing auto-resolves. A high-risk or method-defining scope refuses an unguarded auto (unguarded_high_risk_auto) — it must be lowered to conservative or manual. The prose here and that engine token are one rule: prose ≡ enforcement.
Autonomy is earned by goal-clarity — the auto-ready goal. The autonomy level decides who resolves Verify; an auto-ready goal decides whether a self-verifying run is even meaningful. A milestone goal is auto-ready when every exit criterion cites a verifier — (verify: <test | command | metric>) — so the engine can check the result against the goal without human judgment. add.py check raises a goal_not_auto_ready WARN (never red, the active milestone only) while the goal has criteria not all cited, and status surfaces a goal-ready: line every session, so the goal-clarity gap stays visible. The WARN measures, it never blocks: it changes neither the freeze gate nor the autonomy level — clarifying the goal is the prerequisite that earns trust, not a new gate (a zero-criteria goal reads not-auto-ready and is milestone-shaping's nudge, not this one's). The lint raises the floor — a citation slot per criterion — but cannot prove the citation is honest: a human can still write (verify: it works), and closing that is a person's judgment, not the engine's.
The gate-fail protocol and the three reports¶
Every checkpoint produces three short reports — Test (does it pass?), Quality (is it well-made and conformant?), and Risk (what could go wrong, and who owns it?) — and resolves to exactly one outcome:
PASS— criteria met; proceed.RISK-ACCEPTED— proceed with a signed waiver carrying a named owner, a linked ticket, and an expiry. Allowed for non-security gaps only.HARD-STOP— cannot proceed. Triggered by any failing test or any security finding. A non-security limitation may proceed only with a signedRISK-ACCEPTEDrecord carrying an owner and an expiry; security is never waved through.
The rule behind the protocol is no silent skips. A report nobody is accountable for approving is just a document; an outcome with an owner is governance.
Why each step exists (institutional memory)¶
When someone proposes skipping a step "to go faster," this table is the answer:
| Step skipped | What happens | How you notice |
|---|---|---|
| Specify | the wrong thing gets built | shipped, but users do not use it |
| Scenarios | the feature is vague, edges missing | the AI keeps asking questions mid-build |
| Contract | interfaces drift | front, back, and AI disagree on shapes |
| Tests | AI code is uncontrollable | no way to know it is right but to test by hand |
| Verify (architecture check) | entropy explodes | the codebase is a tangle within months |
| Operate / loop | silent rot | the same incidents recur |
The continuous concerns¶
Four concerns are not steps but threads that run through every step, starting at project setup. Pulling them forward ("shifting left") is far cheaper than bolting them on at the end.
| Concern | Begins at | Enforced at the build gate by |
|---|---|---|
| Security | setup (secret scanning, dependency allow-list) | zero high-severity findings; every AI-suggested package verified to exist |
| Testing | the scenarios step | coverage must not decrease; no test weakened to pass |
| Observability | setup (logging/metric conventions) | instrumentation present; service objectives verified after release |
| Cost | setup (an AI-usage budget per task) | a task may not exceed its budget without escalation |
AI-specific governance¶
A method built on AI agents needs controls older methods did not:
- Pin the model. Record the model and version; re-check the prompt library before adopting an upgrade. AI output is non-deterministic, so provenance matters.
- Test the prompts. The reusable instructions in
playbook/are themselves artifacts: give each golden input/output cases, and re-check them when edited. A prompt that fails its check does not ship. - Guard the supply chain. No package outside the allow-list without human approval; verify each suggested package actually exists, to defeat the risk of an agent inventing a plausible name an attacker has registered.
- Track provenance and licensing. License-scan both generated and pulled-in code; keep a record of what the AI produced.
Metrics that matter — and the anti-metrics¶
Measure the scarce things:
- Contract stability — how rarely the frozen contracts change; high churn is genuinely expensive.
- Validated requirement coverage — the share of rules confirmed against real behavior.
- Review throughput — the team's verification capacity, which sets the safe autonomy level.
- Delivery and reliability — lead time, deployment frequency, change-failure rate, time to recover.
Do not optimize: lines of AI code generated, code-reuse percentage, prompt counts, or velocity measured in code volume. These count the cheap, disposable thing and create incentives to keep bad code to protect a number.
Profiles: one method, three intensities¶
| Express (startup) | Standard (most teams) | Regulated (audited) | |
|---|---|---|---|
| Steps | combine Specify + Scenarios into a one-page brief; light contract | full flow | full flow, all HARD-STOP |
| Scenarios | happy path only | happy + key alternatives | exhaustive, incl. compliance |
| Autonomy ceiling | generate-behind-gate from day one | up to auto-with-evidence | generate-behind-gate max; the AI never merges its own work |
| Gate default | RISK-ACCEPTED allowed |
PASS required to advance |
HARD-STOP; full audit trail |
Choose the profile deliberately — a startup spike and a banking system are not the same risk — and run different products at different profiles as appropriate. The choice is owned by the delivery lead (see 12 Roles).