Skip to content

11 · Governance

← 10 Setup and stages · Contents · Next: 12 Roles →

Governance is what keeps the method honest when a team runs it at speed. It is the same regardless of which AI tool writes the code, because it lives in the process and the pipeline, not in the agent.


The autonomy ladder

How much the AI is allowed to do is not one switch; it is a setting chosen per area, rising with evidence and with your capacity to verify.

Level The AI… A person… Typical use
Suggest proposes options decides and writes early, exploratory work
Draft-and-review drafts artifacts edits and approves each one specs, scenarios, contracts
Generate-behind-gate generates code reviews the change; it merges only if the contract and tests pass the normal build
Auto-with-evidence generates and merges samples and audits; auto-merge allowed only with a full evidence bundle attached narrow, well-tested areas

The governing rule, restated from the principles: operate only at the level your review capacity can sustain. If the AI produces more than the team can verify, drop a level.

The per-scope default is auto-with-evidence behind a one-approval decision point: the AI drafts the specification bundle, a human approves the frozen contract once, and the build auto-gates on evidence. You lower a scope toward draft-and-review or suggest wherever risk is high or evidence is thin — and a high-risk or method-defining scope is always lowered (it is never auto-run). The default sets where you start; review capacity and risk set where you stay.

The engine expresses this per task as an explicit three-rung level — autonomy: manual | conservative | auto, an ordered ladder manual < conservative < auto declared in the TASK.md header and reviewed at the freeze. auto is auto-with-evidence behind the one approval (the seeded default); conservative is the deliberate lowering that keeps a person at the verify gate; manual is the strict floor where the human owns the gate and nothing auto-resolves. A high-risk or method-defining scope refuses an unguarded auto (unguarded_high_risk_auto) — it must be lowered to conservative or manual. The prose here and that engine token are one rule: prose ≡ enforcement.

Autonomy is earned by goal-clarity — the auto-ready goal. The autonomy level decides who resolves Verify; an auto-ready goal decides whether a self-verifying run is even meaningful. A milestone goal is auto-ready when every exit criterion cites a verifier(verify: <test | command | metric>) — so the engine can check the result against the goal without human judgment. add.py check raises a goal_not_auto_ready WARN (never red, the active milestone only) while the goal has criteria not all cited, and status surfaces a goal-ready: line every session, so the goal-clarity gap stays visible. The WARN measures, it never blocks: it changes neither the freeze gate nor the autonomy level — clarifying the goal is the prerequisite that earns trust, not a new gate (a zero-criteria goal reads not-auto-ready and is milestone-shaping's nudge, not this one's). The lint raises the floor — a citation slot per criterion — but cannot prove the citation is honest: a human can still write (verify: it works), and closing that is a person's judgment, not the engine's.

The gate-fail protocol and the three reports

Every checkpoint produces three short reports — Test (does it pass?), Quality (is it well-made and conformant?), and Risk (what could go wrong, and who owns it?) — and resolves to exactly one outcome:

  • PASS — criteria met; proceed.
  • RISK-ACCEPTED — proceed with a signed waiver carrying a named owner, a linked ticket, and an expiry. Allowed for non-security gaps only.
  • HARD-STOP — cannot proceed. Triggered by any failing test or any security finding. A non-security limitation may proceed only with a signed RISK-ACCEPTED record carrying an owner and an expiry; security is never waved through.

The rule behind the protocol is no silent skips. A report nobody is accountable for approving is just a document; an outcome with an owner is governance.

Why each step exists (institutional memory)

When someone proposes skipping a step "to go faster," this table is the answer:

Step skipped What happens How you notice
Specify the wrong thing gets built shipped, but users do not use it
Scenarios the feature is vague, edges missing the AI keeps asking questions mid-build
Contract interfaces drift front, back, and AI disagree on shapes
Tests AI code is uncontrollable no way to know it is right but to test by hand
Verify (architecture check) entropy explodes the codebase is a tangle within months
Operate / loop silent rot the same incidents recur

The continuous concerns

Four concerns are not steps but threads that run through every step, starting at project setup. Pulling them forward ("shifting left") is far cheaper than bolting them on at the end.

Concern Begins at Enforced at the build gate by
Security setup (secret scanning, dependency allow-list) zero high-severity findings; every AI-suggested package verified to exist
Testing the scenarios step coverage must not decrease; no test weakened to pass
Observability setup (logging/metric conventions) instrumentation present; service objectives verified after release
Cost setup (an AI-usage budget per task) a task may not exceed its budget without escalation

AI-specific governance

A method built on AI agents needs controls older methods did not:

  • Pin the model. Record the model and version; re-check the prompt library before adopting an upgrade. AI output is non-deterministic, so provenance matters.
  • Test the prompts. The reusable instructions in playbook/ are themselves artifacts: give each golden input/output cases, and re-check them when edited. A prompt that fails its check does not ship.
  • Guard the supply chain. No package outside the allow-list without human approval; verify each suggested package actually exists, to defeat the risk of an agent inventing a plausible name an attacker has registered.
  • Track provenance and licensing. License-scan both generated and pulled-in code; keep a record of what the AI produced.

Metrics that matter — and the anti-metrics

Measure the scarce things:

  • Contract stability — how rarely the frozen contracts change; high churn is genuinely expensive.
  • Validated requirement coverage — the share of rules confirmed against real behavior.
  • Review throughput — the team's verification capacity, which sets the safe autonomy level.
  • Delivery and reliability — lead time, deployment frequency, change-failure rate, time to recover.

Do not optimize: lines of AI code generated, code-reuse percentage, prompt counts, or velocity measured in code volume. These count the cheap, disposable thing and create incentives to keep bad code to protect a number.

Profiles: one method, three intensities

Express (startup) Standard (most teams) Regulated (audited)
Steps combine Specify + Scenarios into a one-page brief; light contract full flow full flow, all HARD-STOP
Scenarios happy path only happy + key alternatives exhaustive, incl. compliance
Autonomy ceiling generate-behind-gate from day one up to auto-with-evidence generate-behind-gate max; the AI never merges its own work
Gate default RISK-ACCEPTED allowed PASS required to advance HARD-STOP; full audit trail

Choose the profile deliberately — a startup spike and a banking system are not the same risk — and run different products at different profiles as appropriate. The choice is owned by the delivery lead (see 12 Roles).