mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-06 13:45:35 +02:00
test(harness): require ## Decisions section under --disallowedTools plan_ready
Adversarial review (during /ship Step 11) found that the previous gate-test envelope ['asked', 'plan_ready'] for the AskUserQuestion-blocked regression cases accepted the bug they exist to catch: a model that silently skips Step 0 entirely (writes a plan with no questions, no `## Decisions to confirm` section, just ExitPlanModes) reaches plan_ready and passes. The fix tightens the contract in two layers: 1. Harness: PlanSkillObservation gains a `planFile?: string` field populated when outcome is plan_ready. extractPlanFilePath() walks the visible TTY buffer for "Plan saved to:", "Plan file:", or ".claude/plans/<name>.md" patterns and resolves tilde to absolute. planFileHasDecisionsSection() reads the resolved file and returns true if it contains a `## Decisions` heading (any form: "to confirm", "needed", etc.). 2. Tests: 5 of 6 regression cases now require, when outcome is plan_ready, that obs.planFile is set AND planFileHasDecisionsSection returns true. Otherwise the test fails with a "Step 0 was silently skipped" diagnosis. plan-design-review remains the sole exception — it legitimately short-circuits to plan_ready on no-UI-scope branches and we have no deterministic way to distinguish that from a silent skip. This closes the loophole the adversarial review identified. The fix preamble flow already tells the model to write `## Decisions to confirm` when neither AUQ variant is callable — now the test verifies the model actually did it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -10,7 +10,7 @@
|
||||
*/
|
||||
|
||||
import { describe, test, expect } from 'bun:test';
|
||||
import { runPlanSkillObservation } from './helpers/claude-pty-runner';
|
||||
import { runPlanSkillObservation, planFileHasDecisionsSection } from './helpers/claude-pty-runner';
|
||||
|
||||
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate';
|
||||
const describeE2E = shouldRun ? describe : describe.skip;
|
||||
@@ -61,6 +61,12 @@ describeE2E('plan-design-review plan-mode smoke (gate)', () => {
|
||||
`--- evidence (last 2KB visible) ---\n${obs.evidence}`,
|
||||
);
|
||||
}
|
||||
// plan-design-review legitimately short-circuits to plan_ready on no-UI
|
||||
// branches. Allow plan_ready WITHOUT a decisions section ONLY if the
|
||||
// plan file genuinely has no UI scope (we don't have a deterministic way
|
||||
// to check this from the test, so this skill keeps the looser envelope).
|
||||
// Other plan-mode skills require the decisions section under
|
||||
// --disallowedTools; design is the special case.
|
||||
expect(['asked', 'plan_ready']).toContain(obs.outcome);
|
||||
}, 360_000);
|
||||
});
|
||||
|
||||
Reference in New Issue
Block a user