mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-01 19:25:10 +02:00
a81be53621
* fix(preamble): reorder AskUserQuestion Format above model overlay + rewrite Opus 4.7 pacing directive
Root cause of plan-review regression (v1.6.4.0): model overlays rendered
ABOVE the pacing rule in every SKILL.md, so Opus 4.7 read "Batch your
questions" first and absorbed it as the ambient default. The overlay's
claimed subordination ("skill wins on pacing, always") didn't stick —
literal-interpretation mode reads physical order, not claimed hierarchy.
Part 1 of 4 (plan: ~/.claude/plans/system-instruction-you-are-working-polymorphic-twilight.md):
scripts/resolvers/preamble.ts
- Move generateAskUserFormat above generateModelOverlay in section array
- Comment explains why — prevents future refactors from silently reverting
model-overlays/opus-4-7.md
- Replace "Batch your questions" block with "Pace questions to the skill"
- New wording makes one-question-per-turn the default when the skill
contains STOP directives; batching becomes the explicit exception
Regenerated 30 SKILL.md files via bun run gen:skill-docs.
Verified:
- With --model opus-4-7: Format renders at line 359, Model-Specific
Patch at 373, "Pace questions" at 419 (Format comes first, overlay
second, pacing directive intact).
- bun test passes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(plan-reviews): tighten STOP/escape-hatch directives across 4 templates
Part 2 of 4 (plan: ~/.claude/plans/system-instruction-you-are-working-polymorphic-twilight.md).
Codex caught that v1.6.3.0's reasoning collapsed on Opus 4.7: the old
escape-hatch wording ("If no issues or fix is obvious, state what
you'll do and move on — don't waste a question") let the literal
interpreter classify every finding as having an "obvious fix" and skip
AskUserQuestion entirely. Reviews became reports.
Per-template hardening (16 sites total, verified by rg):
plan-ceo-review/SKILL.md.tmpl (13 sites):
- 12 inline STOP directives: replace the full escape-hatch clause with
"zero findings → say so and proceed; findings → MUST call AskUserQuestion
as a tool_use, including for obvious fixes."
- 1 Escape hatch bullet in CRITICAL RULE section: tightened.
plan-eng-review, plan-design-review, plan-devex-review (1 site each):
- Each template's Escape hatch bullet tightened to match the new CEO wording,
adapted for each review's domain (issue/gap, decision/design/DX alternatives).
After regeneration: rg "don't waste a question" returns 0 across all
*SKILL.md.tmpl and *SKILL.md files. "zero findings, state" wording
present 16 times (matches prior count of escape-hatch sites).
bun test passes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(preamble): upgrade AskUserQuestion format to Pros/Cons decision brief
Part 4 of 4 (plan: ~/.claude/plans/system-instruction-you-are-working-polymorphic-twilight.md).
Every AskUserQuestion now renders as a decision brief, not a bullet list:
D-numbered header, ELI10, Stakes-if-we-pick-wrong, Recommendation, Pros/Cons
with ✅/❌ markers per option, closing Net: tradeoff synthesis.
scripts/resolvers/preamble/generate-ask-user-format.ts
- Full rewrite. Preserves prior rules (Re-ground, ELI10, Recommend,
Completeness, Options) and adds:
- D-numbering per skill invocation (model-level, not runtime state)
- Stakes line (pain avoided / capability unlocked / consequence named)
- Pros/Cons block with min 2 ✅ + 1 ❌ per option, min 40 chars/bullet
- Hard-stop escape: "✅ No cons — this is a hard-stop choice" for
genuine one-sided choices (destructive-action confirmations)
- Neutral-posture handling (CT1-compliant): (recommended) label
STAYS on default option to preserve AUTO_DECIDE contract; neutrality
expressed as prose in Recommendation line only
- Net line closes the decision with a one-sentence tradeoff frame
- Rule 11: tool_use mandate (prose "Question:" blocks don't count)
- Self-check list before emitting
test/skill-validation.test.ts
- Update format assertions to check for new Pros/Cons tokens
(Pros / cons:, Recommendation: <choice>, Net:, ELI10, Stakes if we
pick wrong:, ✅, ❌) across all tier-2+ skills
- Old "RECOMMENDATION: Choose" expectation removed (the new format uses
mixed-case "Recommendation:" with no literal "Choose")
test/skill-e2e-plan-format.test.ts
- Add v1.7.0.0 format token regexes (PROS_CONS_HEADER_RE, PRO_BULLET_RE,
CON_BULLET_RE, NET_LINE_RE, D_NUMBER_RE, STAKES_RE)
- Existing RECOMMENDATION_RE loosened to accept mixed-case "Recommendation:"
(canonical v1.7.0.0 form) alongside all-caps (legacy). Tests are
additive — the strict new-format gate is the upcoming cadence eval.
Regenerated 30 SKILL.md files via bun run gen:skill-docs.
Verified:
- bun test: 319 pass (1 pre-existing security-bench fixture oversize
failure on main, unrelated — confirmed via git stash test on main HEAD)
- New format tokens render in all tier-2+ skills (plan-ceo-review,
plan-eng-review, ship, office-hours verified)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: gate-tier units + periodic Pros/Cons evals for AskUserQuestion format
Part 3 of 4 (plan: ~/.claude/plans/system-instruction-you-are-working-polymorphic-twilight.md).
Gate-tier (E1, free, runs on every `bun test`):
test/preamble-compose.test.ts — pins the composition order
Asserts AskUserQuestion Format section renders BEFORE Model-Specific
Behavioral Patch in tier-≥2 preamble output. Covers claude default,
opus-4-7 overlay, tier 2/3, and codex host. Catches any future edit
to scripts/resolvers/preamble.ts that silently reverts the order.
test/resolver-ask-user-format.test.ts — pins the Pros/Cons contract
14 assertions against generateAskUserFormat output: D<N>, ELI10,
Stakes if we pick wrong:, Recommendation: <choice>, Pros / cons:,
✅/❌ markers, min 2 pros + 1 con rules, hard-stop escape exact
phrase, neutral-posture CT1 rule ((recommended) label preserved for
AUTO_DECIDE), Completeness coverage-vs-kind, tool_use mandate
(rule 11), self-check list, D-numbering model-level caveat.
test/model-overlay-opus-4-7.test.ts — pins the pacing directive
Asserts raw overlay file + resolved overlay output contain "Pace
questions to the skill" and NOT "Batch your questions". Verifies
INHERIT:claude chain still works (Todo-list, subordination wrapper),
Fan out / Effort-match / Literal interpretation nudges preserved.
Also asserts claude base overlay does NOT carry the Opus-specific
pacing directive (no cross-contamination).
Periodic-tier (E2, Opus-dependent, ~$1-2/run):
test/skill-e2e-plan-prosons.test.ts — 4 cases extending v1.6.3.0 harness
1. Format positive — every token present when plan has real tradeoff
2. Hard-stop NEGATIVE — plan with genuine tradeoff must NOT dodge to
"No cons — hard-stop choice" escape
3. Neutral-posture NEGATIVE — plan where one option dominates must emit
(recommended) label + "because <reason>", must NOT dodge to
"taste call" / "no preference"
4. Hard-stop POSITIVE — destructive-action plan may legitimately use
the hard-stop escape
test/helpers/touchfiles.ts — entries for all new eval cases
Dependencies: overlay, preamble.ts, generate-ask-user-format.ts, and
the 4 plan-review templates. Diff-based selection triggers the evals
whenever those files change. Also added entries for 7 expanded-coverage
cases (ship, office-hours, investigate, qa, review, design-review,
document-release) — test cases will land in follow-up PRs per skill.
Follow-ups noted in test file header:
- True multi-turn cadence eval (3 findings → 3 distinct asks) — current
harness captures one $OUT_FILE per session; multi-turn capture needs
new harness support.
- Expanded-coverage test cases for the 7 non-plan-review skills.
Verified:
- bun test: 349 pass (30 new + 319 baseline), 1 pre-existing security-bench
oversize failure on main (unrelated, unchanged).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: regenerate golden fixtures + update ELI10 phrase check for v1.7.0.0
Pros/Cons format rewrite (6b99df9d) changed the resolver output across all
tier-2+ SKILL.md files. Three golden-file regression tests in
test/host-config.test.ts and one phrase-check test in test/gen-skill-docs.test.ts
were failing as expected.
- test/fixtures/golden/claude-ship-SKILL.md
- test/fixtures/golden/codex-ship-SKILL.md
- test/fixtures/golden/factory-ship-SKILL.md
Regenerated via `bun run gen:skill-docs --host all` + cp into fixtures.
- test/gen-skill-docs.test.ts line 244: rename test from "ELI16 simplification
rules" to "ELI10 simplification rules" and match the new phrase pattern.
v1.7.0.0 uses "ELI10 (ALWAYS)" rather than legacy "Simplify (ELI10, ALWAYS)".
bun test: 744 pass, 1 fail (pre-existing security-bench fixture oversize,
unrelated to this branch).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* v1.7.0.0: plan reviews walk you through each issue with Pros/Cons
Restores AskUserQuestion cadence on Opus 4.7 (v1.6.4.0 regression) and
upgrades the format to a numbered decision brief — D<N> header, ELI10,
Stakes, Recommendation, per-option ✅/❌ bullets, Net: closing line.
Fix: composition reorder + overlay rewrite + 16-site escape-hatch hardening
across the 4 plan-review templates.
Feature: Pros/Cons format in the preamble resolver, inherited by every
tier-2+ skill automatically.
30 new gate-tier unit tests pin the format contract (runs in <100ms, $0).
4 new periodic-tier eval cases defend against escape-hatch abuse
(2 positive, 2 negative). Golden fixtures regenerated.
CEO + Eng + Codex reviews completed. 5 of 8 Codex findings incorporated;
CT2 (16 sites, not 31) and CT1 (AUTO_DECIDE contract break) were
load-bearing catches the primary reviews missed.
bun test: 774 pass, 1 fail (pre-existing security-bench oversize, unrelated).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* v1.10.0.0: bump VERSION (was v1.7.0.0, align with branch discipline)
Per user direction — jumping to 1.10.0.0 for versioning alignment.
No functional changes from the prior ship commit (5f038ab7). The
regression fix + Pros/Cons format are identical.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
353 lines
14 KiB
TypeScript
353 lines
14 KiB
TypeScript
/**
|
|
* v1.7.0.0 Pros/Cons format regression tests for plan reviews.
|
|
*
|
|
* Extends the v1.6.3.0 format harness (skill-e2e-plan-format.test.ts) with
|
|
* four new cases covering the Pros/Cons decision-brief format:
|
|
*
|
|
* 1. Format positive — every AskUserQuestion renders with D<N> / ELI10 /
|
|
* Stakes / Recommendation / Pros/cons / ✅×2+ / ❌×1+ / Net tokens.
|
|
* 2. Hard-stop positive — destructive-action question may use the single
|
|
* "No cons — this is a hard-stop choice" escape.
|
|
* 3. Hard-stop NEGATIVE (CT2) — plan with genuine tradeoff, model must NOT
|
|
* dodge to the hard-stop escape. Forces real tradeoff articulation.
|
|
* 4. Neutral-posture NEGATIVE (CT2) — plan with one clearly-dominant option,
|
|
* model must emit (recommended) label and concrete recommendation, NOT
|
|
* "no preference — taste call" dodge.
|
|
*
|
|
* Capture pattern matches existing harness: agent writes verbatim
|
|
* AskUserQuestion text to $OUT_FILE; regex predicates run on the captured
|
|
* file. Classified periodic (Opus 4.7 non-deterministic).
|
|
*
|
|
* FOLLOW-UP (not in v1.7.0.0):
|
|
* - True cadence eval (3 findings → 3 distinct asks across turns). Current
|
|
* $OUT_FILE harness captures ONE would-be question per session. Multi-turn
|
|
* cadence needs new harness support. Filed in TODOs.
|
|
* - Expanded coverage for /ship /office-hours /investigate /qa /review
|
|
* /design-review /document-release. Touchfiles entries already exist; eval
|
|
* cases will land as follow-up PRs per skill.
|
|
*/
|
|
import { describe, test, expect, beforeAll, afterAll } from 'bun:test';
|
|
import { runSkillTest } from './helpers/session-runner';
|
|
import {
|
|
ROOT, runId,
|
|
describeIfSelected, testConcurrentIfSelected,
|
|
logCost, recordE2E,
|
|
createEvalCollector, finalizeEvalCollector,
|
|
} from './helpers/e2e-helpers';
|
|
import { spawnSync } from 'child_process';
|
|
import * as fs from 'fs';
|
|
import * as path from 'path';
|
|
import * as os from 'os';
|
|
|
|
const evalCollector = createEvalCollector('e2e-plan-prosons');
|
|
|
|
// v1.7.0.0 format tokens
|
|
const D_NUMBER_RE = /D\d+\s+—/;
|
|
const ELI10_RE = /ELI10:/i;
|
|
const STAKES_RE = /Stakes if we pick wrong:/i;
|
|
const RECOMMENDATION_RE = /[Rr]ecommendation:/;
|
|
const PROS_CONS_HEADER_RE = /Pros\s*\/\s*cons:/i;
|
|
const NET_LINE_RE = /^Net:/m;
|
|
const HARD_STOP_ESCAPE_RE = /✅\s+No cons\s+—\s+this is a hard-stop choice/;
|
|
const NEUTRAL_POSTURE_RE = /taste call/i;
|
|
const RECOMMENDED_LABEL_RE = /\(recommended\)/;
|
|
|
|
function countChars(text: string, char: string): number {
|
|
return (text.match(new RegExp(char, 'g')) || []).length;
|
|
}
|
|
|
|
const TRADEOFF_PLAN = `# Plan: Add user dashboard caching
|
|
|
|
## Context
|
|
Dashboard renders in 3s on cold load, 800ms on warm cache. Users complain.
|
|
|
|
## Approach options
|
|
|
|
### Option A: Redis cache layer (complete)
|
|
- Add Redis with 5min TTL for dashboard aggregates.
|
|
- Cold path: compute + cache. Warm path: fetch from cache.
|
|
- Needs Redis infra, cache invalidation logic for activity updates.
|
|
- Covers all users, all flows, fails gracefully on cache miss.
|
|
|
|
### Option B: In-memory LRU cache (happy path only)
|
|
- Per-process LRU with 100-entry cap.
|
|
- No cross-process sharing; cache warms per-pod.
|
|
- Skips cache invalidation; stale reads up to 5min.
|
|
|
|
Both options have real pros and cons. This is a genuine tradeoff.
|
|
`;
|
|
|
|
const HARDSTOP_PLAN = `# Plan: Delete all user sessions
|
|
|
|
## Context
|
|
Security incident. All active sessions need to be terminated immediately.
|
|
|
|
## Action
|
|
Run \`DELETE FROM sessions WHERE TRUE\`. No dry-run mode.
|
|
|
|
This is a one-way door. There is no "partial" version.
|
|
`;
|
|
|
|
const DOMINANT_PLAN = `# Plan: Add input validation to signup endpoint
|
|
|
|
## Context
|
|
Signup endpoint currently accepts any email string and any password length.
|
|
Bug report: users type gibberish, signup succeeds, they can't log in.
|
|
|
|
## Options
|
|
|
|
### Option A: Full RFC 5322 email validation + min 8-char password + server-side checks
|
|
- Catches malformed emails, rejects weak passwords, validated on server.
|
|
- Prevents the reported bug and adjacent bugs.
|
|
- Standard web practice.
|
|
|
|
### Option B: Client-side type="email" only, no password validation
|
|
- Only catches some browsers' built-in validation.
|
|
- Attackers bypass by disabling JS.
|
|
- Does not fix the reported bug.
|
|
|
|
Option A clearly dominates on coverage. This is NOT a taste call.
|
|
`;
|
|
|
|
function setupPlanDir(tmpPrefix: string, planContent: string, skillName: string): string {
|
|
const planDir = fs.mkdtempSync(path.join(os.tmpdir(), tmpPrefix));
|
|
const run = (cmd: string, args: string[]) =>
|
|
spawnSync(cmd, args, { cwd: planDir, stdio: 'pipe', timeout: 5000 });
|
|
|
|
run('git', ['init', '-b', 'main']);
|
|
run('git', ['config', 'user.email', 'test@test.com']);
|
|
run('git', ['config', 'user.name', 'Test']);
|
|
|
|
fs.writeFileSync(path.join(planDir, 'plan.md'), planContent);
|
|
run('git', ['add', '.']);
|
|
run('git', ['commit', '-m', 'add plan']);
|
|
|
|
fs.mkdirSync(path.join(planDir, skillName), { recursive: true });
|
|
fs.copyFileSync(
|
|
path.join(ROOT, skillName, 'SKILL.md'),
|
|
path.join(planDir, skillName, 'SKILL.md'),
|
|
);
|
|
|
|
return planDir;
|
|
}
|
|
|
|
function captureInstruction(outFile: string): string {
|
|
return `Write the verbatim text of the single AskUserQuestion you would have made to ${outFile} (full text including D<N> header, ELI10, Stakes, Recommendation, Pros/cons, and Net line — the complete rich markdown body). Do NOT call any tool to ask the user. Do NOT paraphrase. This is a format-capture test.`;
|
|
}
|
|
|
|
// --- Case 1: Format positive — all v1.7.0.0 tokens present ---
|
|
|
|
describeIfSelected('Plan Prosons — Format Positive', ['plan-review-prosons-format'], () => {
|
|
let planDir: string;
|
|
let outFile: string;
|
|
|
|
beforeAll(() => {
|
|
planDir = setupPlanDir('skill-e2e-plan-prosons-format-', TRADEOFF_PLAN, 'plan-ceo-review');
|
|
outFile = path.join(planDir, 'ask-capture.md');
|
|
});
|
|
|
|
afterAll(() => {
|
|
try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {}
|
|
});
|
|
|
|
testConcurrentIfSelected('plan-review-prosons-format', async () => {
|
|
const result = await runSkillTest({
|
|
prompt: `Read plan-ceo-review/SKILL.md for the review workflow.
|
|
|
|
Read plan.md — two cache approaches with real tradeoffs. Pick the architectural approach via AskUserQuestion (Step 0C-bis / Implementation Alternatives). These options differ in coverage.
|
|
|
|
${captureInstruction(outFile)}
|
|
|
|
After writing the file, stop.`,
|
|
workingDirectory: planDir,
|
|
maxTurns: 10,
|
|
timeout: 240_000,
|
|
testName: 'plan-review-prosons-format',
|
|
runId,
|
|
model: 'claude-opus-4-7',
|
|
});
|
|
|
|
logCost('/plan-review prosons format positive', result);
|
|
recordE2E(evalCollector, '/plan-review-prosons-format', 'Plan Prosons — Format Positive', result, {
|
|
passed: ['success', 'error_max_turns'].includes(result.exitReason),
|
|
});
|
|
expect(['success', 'error_max_turns']).toContain(result.exitReason);
|
|
|
|
expect(fs.existsSync(outFile)).toBe(true);
|
|
const captured = fs.readFileSync(outFile, 'utf-8');
|
|
expect(captured.length).toBeGreaterThan(200);
|
|
|
|
// Every Pros/Cons token present
|
|
expect(captured).toMatch(D_NUMBER_RE);
|
|
expect(captured).toMatch(ELI10_RE);
|
|
expect(captured).toMatch(STAKES_RE);
|
|
expect(captured).toMatch(RECOMMENDATION_RE);
|
|
expect(captured).toMatch(PROS_CONS_HEADER_RE);
|
|
expect(captured).toMatch(NET_LINE_RE);
|
|
|
|
// Pro/con bullet counts: ≥2 ✅ and ≥1 ❌ per option (total ≥4 ✅ and ≥2 ❌ for 2 options)
|
|
expect(countChars(captured, '✅')).toBeGreaterThanOrEqual(4);
|
|
expect(countChars(captured, '❌')).toBeGreaterThanOrEqual(2);
|
|
|
|
// (recommended) label on one option
|
|
expect(captured).toMatch(RECOMMENDED_LABEL_RE);
|
|
}, 300_000);
|
|
});
|
|
|
|
// --- Case 2: Hard-stop escape NEGATIVE (CT2) ---
|
|
|
|
describeIfSelected('Plan Prosons — Hard-stop Negative', ['plan-review-prosons-hardstop-neg'], () => {
|
|
let planDir: string;
|
|
let outFile: string;
|
|
|
|
beforeAll(() => {
|
|
planDir = setupPlanDir('skill-e2e-plan-prosons-hardstop-neg-', TRADEOFF_PLAN, 'plan-ceo-review');
|
|
outFile = path.join(planDir, 'ask-capture.md');
|
|
});
|
|
|
|
afterAll(() => {
|
|
try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {}
|
|
});
|
|
|
|
testConcurrentIfSelected('plan-review-prosons-hardstop-neg', async () => {
|
|
const result = await runSkillTest({
|
|
prompt: `Read plan-ceo-review/SKILL.md.
|
|
|
|
Read plan.md — this has REAL tradeoffs between Redis and in-memory caching (both have pros and cons). Pick the architectural approach via AskUserQuestion.
|
|
|
|
${captureInstruction(outFile)}
|
|
|
|
After writing the file, stop.`,
|
|
workingDirectory: planDir,
|
|
maxTurns: 10,
|
|
timeout: 240_000,
|
|
testName: 'plan-review-prosons-hardstop-neg',
|
|
runId,
|
|
model: 'claude-opus-4-7',
|
|
});
|
|
|
|
logCost('/plan-review prosons hard-stop negative', result);
|
|
recordE2E(evalCollector, '/plan-review-prosons-hardstop-neg', 'Plan Prosons — Hard-stop Negative', result, {
|
|
passed: ['success', 'error_max_turns'].includes(result.exitReason),
|
|
});
|
|
expect(['success', 'error_max_turns']).toContain(result.exitReason);
|
|
|
|
expect(fs.existsSync(outFile)).toBe(true);
|
|
const captured = fs.readFileSync(outFile, 'utf-8');
|
|
expect(captured.length).toBeGreaterThan(200);
|
|
|
|
// Genuine tradeoff — must NOT dodge to hard-stop escape.
|
|
expect(captured).not.toMatch(HARD_STOP_ESCAPE_RE);
|
|
// Must have real pros and cons (≥2 ✅ + ≥1 ❌ per option)
|
|
expect(countChars(captured, '✅')).toBeGreaterThanOrEqual(4);
|
|
expect(countChars(captured, '❌')).toBeGreaterThanOrEqual(2);
|
|
}, 300_000);
|
|
});
|
|
|
|
// --- Case 3: Neutral-posture NEGATIVE (CT2) ---
|
|
|
|
describeIfSelected('Plan Prosons — Neutral-posture Negative', ['plan-review-prosons-neutral-neg'], () => {
|
|
let planDir: string;
|
|
let outFile: string;
|
|
|
|
beforeAll(() => {
|
|
planDir = setupPlanDir('skill-e2e-plan-prosons-neutral-neg-', DOMINANT_PLAN, 'plan-ceo-review');
|
|
outFile = path.join(planDir, 'ask-capture.md');
|
|
});
|
|
|
|
afterAll(() => {
|
|
try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {}
|
|
});
|
|
|
|
testConcurrentIfSelected('plan-review-prosons-neutral-neg', async () => {
|
|
const result = await runSkillTest({
|
|
prompt: `Read plan-ceo-review/SKILL.md.
|
|
|
|
Read plan.md — Option A dominates Option B on coverage. This is NOT a taste call. Pick the approach via AskUserQuestion (Step 0C-bis / Implementation Alternatives — coverage-differentiated, so Completeness: N/10 applies).
|
|
|
|
${captureInstruction(outFile)}
|
|
|
|
After writing the file, stop.`,
|
|
workingDirectory: planDir,
|
|
maxTurns: 10,
|
|
timeout: 240_000,
|
|
testName: 'plan-review-prosons-neutral-neg',
|
|
runId,
|
|
model: 'claude-opus-4-7',
|
|
});
|
|
|
|
logCost('/plan-review prosons neutral negative', result);
|
|
recordE2E(evalCollector, '/plan-review-prosons-neutral-neg', 'Plan Prosons — Neutral Negative', result, {
|
|
passed: ['success', 'error_max_turns'].includes(result.exitReason),
|
|
});
|
|
expect(['success', 'error_max_turns']).toContain(result.exitReason);
|
|
|
|
expect(fs.existsSync(outFile)).toBe(true);
|
|
const captured = fs.readFileSync(outFile, 'utf-8');
|
|
expect(captured.length).toBeGreaterThan(200);
|
|
|
|
// One option dominates — must NOT use "taste call" neutral-posture dodge.
|
|
expect(captured).not.toMatch(NEUTRAL_POSTURE_RE);
|
|
// (recommended) label MUST be present on the dominant option.
|
|
expect(captured).toMatch(RECOMMENDED_LABEL_RE);
|
|
// Recommendation line must contain "because" (concrete reason, not "no preference")
|
|
expect(captured).toMatch(/[Rr]ecommendation:.*because/);
|
|
}, 300_000);
|
|
});
|
|
|
|
// --- Case 4: Hard-stop POSITIVE (escape allowed when legitimately one-sided) ---
|
|
|
|
describeIfSelected('Plan Prosons — Hard-stop Positive', ['plan-ceo-review-prosons-cadence'], () => {
|
|
let planDir: string;
|
|
let outFile: string;
|
|
|
|
beforeAll(() => {
|
|
planDir = setupPlanDir('skill-e2e-plan-prosons-hardstop-pos-', HARDSTOP_PLAN, 'plan-ceo-review');
|
|
outFile = path.join(planDir, 'ask-capture.md');
|
|
});
|
|
|
|
afterAll(() => {
|
|
try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {}
|
|
});
|
|
|
|
testConcurrentIfSelected('plan-ceo-review-prosons-cadence', async () => {
|
|
const result = await runSkillTest({
|
|
prompt: `Read plan-ceo-review/SKILL.md.
|
|
|
|
Read plan.md — this is a destructive one-way action (terminate all sessions). Ask the user to confirm via AskUserQuestion. This is a legitimate hard-stop choice — the hard-stop escape (\`✅ No cons — this is a hard-stop choice\`) is allowed here because there is no meaningful alternative besides doing or not doing the action.
|
|
|
|
${captureInstruction(outFile)}
|
|
|
|
After writing the file, stop.`,
|
|
workingDirectory: planDir,
|
|
maxTurns: 10,
|
|
timeout: 240_000,
|
|
testName: 'plan-ceo-review-prosons-cadence',
|
|
runId,
|
|
model: 'claude-opus-4-7',
|
|
});
|
|
|
|
logCost('/plan-review prosons hard-stop positive', result);
|
|
recordE2E(evalCollector, '/plan-ceo-review-prosons-cadence', 'Plan Prosons — Hard-stop Positive', result, {
|
|
passed: ['success', 'error_max_turns'].includes(result.exitReason),
|
|
});
|
|
expect(['success', 'error_max_turns']).toContain(result.exitReason);
|
|
|
|
expect(fs.existsSync(outFile)).toBe(true);
|
|
const captured = fs.readFileSync(outFile, 'utf-8');
|
|
expect(captured.length).toBeGreaterThan(100);
|
|
|
|
// Format scaffolding still required
|
|
expect(captured).toMatch(PROS_CONS_HEADER_RE);
|
|
// Hard-stop escape is ACCEPTED here (destructive one-way action)
|
|
// Either the escape is used OR real pros/cons are present — both are valid.
|
|
const hasEscape = HARD_STOP_ESCAPE_RE.test(captured);
|
|
const hasProsAndCons = countChars(captured, '✅') >= 1 && countChars(captured, '❌') >= 1;
|
|
expect(hasEscape || hasProsAndCons).toBe(true);
|
|
}, 300_000);
|
|
});
|
|
|
|
afterAll(async () => {
|
|
await finalizeEvalCollector(evalCollector);
|
|
});
|