test(harness): require ## Decisions section under --disallowedTools plan_ready

Adversarial review (during /ship Step 11) found that the previous gate-test
envelope ['asked', 'plan_ready'] for the AskUserQuestion-blocked regression
cases accepted the bug they exist to catch: a model that silently skips
Step 0 entirely (writes a plan with no questions, no `## Decisions to
confirm` section, just ExitPlanModes) reaches plan_ready and passes.

The fix tightens the contract in two layers:

1. Harness: PlanSkillObservation gains a `planFile?: string` field
   populated when outcome is plan_ready. extractPlanFilePath() walks the
   visible TTY buffer for "Plan saved to:", "Plan file:", or
   ".claude/plans/<name>.md" patterns and resolves tilde to absolute.
   planFileHasDecisionsSection() reads the resolved file and returns true
   if it contains a `## Decisions` heading (any form: "to confirm",
   "needed", etc.).

2. Tests: 5 of 6 regression cases now require, when outcome is plan_ready,
   that obs.planFile is set AND planFileHasDecisionsSection returns true.
   Otherwise the test fails with a "Step 0 was silently skipped" diagnosis.
   plan-design-review remains the sole exception — it legitimately
   short-circuits to plan_ready on no-UI-scope branches and we have no
   deterministic way to distinguish that from a silent skip.

This closes the loophole the adversarial review identified. The fix
preamble flow already tells the model to write `## Decisions to confirm`
when neither AUQ variant is callable — now the test verifies the model
actually did it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-05-01 07:10:37 -07:00
parent d3ce41b2a6
commit 9ef34603df
7 changed files with 147 additions and 7 deletions
+19 -1
View File
@@ -34,7 +34,7 @@
*/
import { describe, test, expect } from 'bun:test';
import { runPlanSkillObservation } from './helpers/claude-pty-runner';
import { runPlanSkillObservation, planFileHasDecisionsSection } from './helpers/claude-pty-runner';
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate';
const describeE2E = shouldRun ? describe : describe.skip;
@@ -110,6 +110,24 @@ describeE2E('plan-ceo-review plan-mode smoke (gate)', () => {
`--- evidence (last 2KB visible) ---\n${obs.evidence}`,
);
}
// plan_ready under --disallowedTools is only a pass when the model used
// the plan-file fallback (wrote a `## Decisions to confirm` section).
// Without that section, plan_ready means the model silently skipped Step 0
// and went straight to ExitPlanMode — the regression we're catching.
if (obs.outcome === 'plan_ready') {
if (!obs.planFile) {
throw new Error(
`plan-ceo-review AskUserQuestion-blocked regression: outcome=plan_ready but no plan file path detected in TTY output. Cannot verify the model used the fallback flow.\n` +
`--- evidence (last 2KB visible) ---\n${obs.evidence}`,
);
}
if (!planFileHasDecisionsSection(obs.planFile)) {
throw new Error(
`plan-ceo-review AskUserQuestion-blocked regression: model wrote ${obs.planFile} without a "## Decisions" section. Step 0 was silently skipped.\n` +
`--- evidence (last 2KB visible) ---\n${obs.evidence}`,
);
}
}
expect(['asked', 'plan_ready']).toContain(obs.outcome);
}, 360_000);
});