mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-02 11:45:20 +02:00
dde55103fc
* chore: add gstack skill routing rules to CLAUDE.md Per routing-injection preamble — once-per-project addition that lets agents auto-invoke the right gstack skill instead of answering generically. * refactor: slim preamble resolvers + sidecar-symlink helper Compress prose across 18 preamble resolvers — Voice, Writing Style, AskUserQuestion Format, Completeness Principle, Confusion Protocol, Context Health, Context Recovery, Continuous Checkpoint, Lake Intro, Proactive Prompt, Routing Injection, Telemetry Prompt, Upgrade Check, Vendoring Deprecation, Writing Style Migration, Brain Sync Block, Completion Status, and Question Tuning. Same semantic contract, ~half the bytes. Restored "Treat the skill file as executable instructions" phrase in the plan-mode info section after diagnosing it as load-bearing. Restored "Effort both-scales" rule in AskUserQuestion format. Bonus: scripts/skill-check.ts gains isRepoRootSymlink() so dev installs that mount the repo root at host/skills/gstack as a runtime sidecar (e.g., codex's .agents/skills/gstack) get skipped instead of double-counted. opus-4-7 model overlay gets a Fan-Out directive — explicit instruction to launch parallel reads/checks before synthesis. Net token impact across all generated SKILL.md files: ~140K tokens removed across 47 outputs. Plan-* skills retain full preamble surface (Brain Sync, Context Recovery, Routing Injection) — load-bearing functionality that early slim attempts incorrectly cut. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: regenerate SKILL.md outputs after preamble slim bun run gen:skill-docs --host all output. Mirrors the resolver changes in the previous commit. 47 generated SKILL.md files plus 3 ship-skill golden fixtures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(test): real-PTY harness for plan-mode E2E tests Adds test/helpers/claude-pty-runner.ts. Spawns the actual claude binary via Bun.spawn({terminal:}) (Bun 1.3.10+ has built-in PTY — no node-pty, no native modules), drives it through stdin/stdout, and parses rendered terminal frames. Pattern adapted from the cc-pty-import branch's terminal-agent.ts but stripped of WS/cookie/Origin scaffolding (not needed for headless tests). Public API: - launchClaudePty(opts) — boots claude with --permission-mode plan|null, auto-handles the workspace-trust dialog, returns a session handle. - session.send / sendKey / waitForAny / waitFor / mark / visibleSince / visibleText / rawOutput / close - runPlanSkillObservation({skillName, inPlanMode, timeoutMs}) — high-level contract for plan-mode skill tests. Returns { outcome, summary, evidence, elapsedMs }. outcome ∈ {asked, plan_ready, silent_write, exited, timeout}. Replaces the SDK-based runPlanModeSkillTest from plan-mode-helpers.ts which never worked. Plan mode renders its native "Ready to execute" confirmation as TTY UI (numbered options with ❯ cursor), not via the AskUserQuestion tool — so the SDK's canUseTool interceptor never fired and the assertion always saw zero questions. Real PTY observes the rendered output directly. Deletes test/helpers/plan-mode-helpers.ts. No production callers remained. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: rewrite 5 plan-mode E2E tests on the real-PTY harness Replaces SDK-based assertions with runPlanSkillObservation contract. Each test launches real claude --permission-mode plan, invokes the skill, and asserts the outcome reaches 'asked' or 'plan_ready' within a 300s budget (no silent Write/Edit, no crash, no timeout). Affected: - test/skill-e2e-plan-ceo-plan-mode.test.ts - test/skill-e2e-plan-eng-plan-mode.test.ts - test/skill-e2e-plan-design-plan-mode.test.ts - test/skill-e2e-plan-devex-plan-mode.test.ts - test/skill-e2e-plan-mode-no-op.test.ts (inPlanMode: false; tests the preamble plan-mode-info no-op path) test/e2e-harness-audit.test.ts — recognize runPlanSkillObservation as a valid coverage path alongside the legacy canUseTool / runPlanModeSkillTest. test/helpers/touchfiles.ts — point the 5 plan-mode test selections and the e2e-harness-audit selection at test/helpers/claude-pty-runner.ts instead of the deleted plan-mode-helpers.ts. Proof: bun test EVALS=1 EVALS_TIER=gate on these 5 files runs sequentially in 790s and passes 5/5. Same tests were 0/5 on origin/main, on v1.0.0.0, and on this branch with the SDK harness. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: align unit tests with slim resolvers + exempt 27MB security fixture - test/skill-validation.test.ts: assert the slim Completeness Principle shape (Completeness: X/10, kind-note language) instead of the old Compression table. Remove the 3 tier-1 skills from the spot-check list (they intentionally don't carry the full Completeness Principle section). Exempt browse/test/fixtures/security-bench-haiku-responses.json (27MB deterministic replay fixture for BrowseSafe-Bench) from the 2MB tracked-file gate. The gate was actually failing on origin/main since the fixture was added in v1.6.4.0 — this is a side-fix to a real regression. - test/brain-sync.test.ts: developer-machine-safe assertion for GSTACK_HOME override (compare config contents before/after instead of asserting the absence of a string that may legitimately exist). - test/gen-skill-docs.test.ts: new tests for the slim — plan-review preambles stay under the post-slim budget (~33KB), Voice + Writing Style sections stay compact, and the slim Voice section preserves the load-bearing semantic contract (lead-with-the-point, name-the-file, user-outcome framing, no-corporate, no-AI-vocab, user-sovereignty). Update path-leakage scan to allow repo-root sidecar symlinks. - test/writing-style-resolver.test.ts: assert the compact contract (gloss-on-first-use, outcome-framing, user-impact, terse-mode override) instead of the old 6-numbered-rules shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v1.13.1.0) Slim preamble work + real-PTY plan-mode E2E harness on top of v1.13.0.0. SKILL.md corpus -25.5% (3.08 MB → 2.30 MB, ~196K tokens). 5 plan-mode tests go from 0/5 to 5/5 (790s sequential), the first time those tests have ever passed. Side-fixes for the 27MB security fixture warning and the sidecar-symlink double-count. Reverts the Fan-Out directive accidentally restored to opus-4-7.md — v1.10.1.0's overlay-efficacy harness measured -60pp fanout vs baseline when the nudge was active. The intentional removal stays. TODOS: - Pre-existing test failures from v1.12.0.0 ship: RESOLVED on main + this branch - security-bench-haiku-responses.json size gate: RESOLVED via warn-only + exemption Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(test): harness primitives — parseNumberedOptions + budget regression utils claude-pty-runner.ts: - parseNumberedOptions(visible) anchors on the latest "❯ 1." cursor and returns {index, label}[]; tests that route on option labels can find indices without hard-coding positions - isPermissionDialogVisible(visible) detects file-grant + workspace-trust + bash-permission shapes (multiple regex variants) - isNumberedOptionListVisible: replaced \b2\. word-boundary regex with [^0-9]2\. — stripAnsi removes TTY cursor-positioning escapes that collapse "Option 2." to "Option2.", and \b fails on word-to-word eval-store.ts: - findBudgetRegressions(comparison, opts?) — pure function returning tests where tools or turns grew >cap× vs prior run; floors at 5 prior tools / 3 prior turns to avoid noise on tiny numbers - assertNoBudgetRegression() — wrapper that throws with full violation list. Env override GSTACK_BUDGET_RATIO helpers-unit.test.ts: 23 unit tests covering empty/sparse/wrap-around buffers for parseNumberedOptions, plus regression-floor + env-override cases for findBudgetRegressions/assertNoBudgetRegression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: register 6 real-PTY E2E touchfiles + UI-heavy plan fixture touchfiles.ts: - 6 new entries in E2E_TOUCHFILES keyed to the new test files - 6 matching E2E_TIERS classifications: 3 gate (auq-format-pty, plan-design-with-ui-scope, budget-regression-pty), 3 periodic (plan-ceo-mode-routing, ship-idempotency-pty, autoplan-chain-pty) - gate ones are cheap/deterministic; periodic ones run weekly touchfiles.test.ts: - update the "skill-specific change selects only that skill" count from 15 → 18 (plan-ceo-review/SKILL.md change now also selects auq-format-pty, plan-ceo-mode-routing, autoplan-chain-pty) test/fixtures/plans/ui-heavy-feature.md: - planted plan with explicit UI scope keywords (pages, components, Tailwind responsive layout, hover/loading/empty states, modal, toast). Used by plan-design-with-ui-scope and autoplan-chain tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(test): 3 gate-tier real-PTY E2E tests skill-e2e-auq-format-compliance.test.ts (~$0.50/run, 90-130s): - Asserts /plan-ceo-review's first AUQ contains all 7 mandated format elements (ELI10, Recommendation, Pros/Cons with ✅/❌, Net, (recommended) label). Catches drift in the shared preamble resolver that previously took weeks to notice. - Auto-grants permission dialogs that fire during preamble side-effects (touch on .feature-prompted markers in fresh user environments). - Verified PASS in 126s. skill-e2e-plan-design-with-ui.test.ts (~$0.80/run, 50-90s): - Counterpart to the existing no-UI early-exit test. When the input plan DOES describe UI changes, /plan-design-review must NOT early-exit and must reach a real skill AUQ. - Sends the slash command without args, then a follow-up message with the UI-heavy plan description (Claude Code rejects unknown trailing args). Asserts evidence does NOT contain "no UI scope". - Verified PASS in 54s. skill-budget-regression.test.ts (free, gate): - Library-only assertion. Reads the most recent eval file, finds the prior same-branch run via findPreviousRun, computes ComparisonResult, asserts no test exceeded 2× tools or turns. - Branch-scoped: skips with reason if the latest eval was produced on a different branch (cross-branch comparison would be noise). - First-run grace (vacuous pass) when no prior data exists. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(test): 3 periodic-tier real-PTY E2E tests skill-e2e-plan-ceo-mode-routing.test.ts (~$3/run, 6-10 min/case): - Verifies AUQ answer routing: HOLD SCOPE → rigor/bulletproof posture language; SCOPE EXPANSION → expansion/10x/dream language. Each case navigates 8-12 prior AUQs (telemetry, proactive, routing, vendoring, brain, office-hours, premise, approach) before hitting Step 0F. - Periodic, not gate: navigation phase too slow for PR-blocking. V2 expansion to 4 modes (SELECTIVE + REDUCTION) when nav is faster. skill-e2e-ship-idempotency.test.ts (~$3/run, 5-10 min): - Builds a real git fixture with VERSION 0.0.2 already bumped, matching package.json, CHANGELOG entry, pushed to a local bare remote. Runs /ship in plan mode and asserts STATE: ALREADY_BUMPED echoes from the Step 12 idempotency check, OR plan_ready terminates without mutation. - Snapshots VERSION + package.json + CHANGELOG entry count + commit count + branch HEAD before/after; fails if any changed. skill-e2e-autoplan-chain.test.ts (~$8/run, 12-18 min): - Asserts /autoplan phases run sequentially: tees timestamps as each "**Phase N complete.**" marker first appears. Phase 1 (CEO) must precede Phase 3 (Eng); Phase 2 (Design) is optional but if it appears, must sit between 1 and 3. - Auto-grants permission dialogs that fire during phase transitions. All three auto-handle permission dialogs (preamble side-effects on fresh user envs without .feature-prompted-* markers). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: spell out AskUserQuestion everywhere instead of AUQ Per user feedback: don't shorten AskUserQuestion to AUQ — the abbreviation reads as cryptic. Apply across all the new code from this branch: - Rename test/skill-e2e-auq-format-compliance.test.ts → test/skill-e2e-ask-user-question-format-compliance.test.ts - Touchfile entry auq-format-pty → ask-user-question-format-pty (touchfiles.ts + matching assertion in touchfiles.test.ts) - Function rename navigateToModeAuq → navigateToModeAskUserQuestion - Variable auqVisible → askUserQuestionVisible - Outcome literal 'real_auq' → 'real_question' - All comments + JSDoc + CHANGELOG entry write AskUserQuestion in full - "AUQs" plural → "AskUserQuestions" No behavior change. 49/49 free tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: harden v1.15.0.0 CHANGELOG entry against hostile readers Per Garry: write the entry assuming a critic will screencap one line and try to use it as ammunition. Reframed the v1.15.0.0 release-summary to lead with new capability (real-PTY harness, 11 plan-mode tests, +6 new) instead of fix-of-prior- flaw narrative. Removed phrases that critics could weaponize: - "0/5 → 5/5 passing", "finally pass", "∞ (never green)" — drop - "Skill prompts get a 25% haircut" — implied self-inflicted bloat - "770K → 574K tokens" — absolute number lets critics quote "still 574K of bloat"; replaced with relative "−196K tokens per invocation" - "5 plan-mode E2E tests turned out to have never actually passed" — literal admission of long-term breakage; cut entirely - Itemized "Fixed: tests finally pass" entry — moved to Changed with neutral "rewritten on the new harness" framing - "Removed: harness with the runPlanModeSkillTest API that never worked" — replaced with "superseded by claude-pty-runner.ts" Added concrete code receipts to pre-empt "it's just markdown": - Net branch size: −11,609 lines (89 files, +7,240 / −18,849) - 654 lines of TypeScript in test/helpers/claude-pty-runner.ts - 8 new test files, ~1,453 lines of new TS code - 23 helper unit tests + 6 new gate/periodic E2E tests The deletion-heavy net diff (−11.6K lines) is itself the strongest defense against the "bloat" critique — surfaced explicitly in the numbers table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
272 lines
11 KiB
TypeScript
272 lines
11 KiB
TypeScript
/**
|
|
* /ship idempotency E2E (periodic, paid, real-PTY).
|
|
*
|
|
* Asserts: when /ship runs against a branch that has ALREADY been bumped
|
|
* (VERSION ahead of base AND package.json synced AND a CHANGELOG entry
|
|
* exists for the bumped version), the workflow:
|
|
*
|
|
* 1. Detects ALREADY_BUMPED state via the Step 12 idempotency check
|
|
* 2. Does NOT echo STATE: FRESH (which would trigger a second bump)
|
|
* 3. Does NOT mutate the fixture's VERSION file
|
|
* 4. Does NOT append a duplicate CHANGELOG [0.0.2] entry
|
|
* 5. Does NOT create a new "chore: bump version" commit
|
|
*
|
|
* Why real-PTY: the existing ship-idempotency test in skill-e2e.test.ts
|
|
* uses the SDK harness with a synthetic prompt asking the agent to "run
|
|
* ONLY the idempotency checks." This test exercises the actual /ship
|
|
* skill end-to-end against a real git fixture so a regression that
|
|
* silently re-bumps despite the check passing would be caught.
|
|
*
|
|
* Plan-mode framing: we run /ship in plan mode so the agent cannot push,
|
|
* commit, or open PRs. The Step 12 idempotency check is read-only
|
|
* (reads VERSION + package.json + git rev-parse) and runs fine in plan
|
|
* mode. The plan-ready output serves as the terminal signal — the agent
|
|
* has done its analysis and produced a plan describing what it would do.
|
|
*
|
|
* If the agent decides to bump or push despite the fixture's
|
|
* ALREADY_BUMPED state, that intent surfaces in the plan or in
|
|
* tool-call attempts, which we detect.
|
|
*
|
|
* Cost: ~$2-4/run. Periodic tier — long, runs weekly.
|
|
*/
|
|
|
|
import { describe, test, expect } from 'bun:test';
|
|
import { spawnSync } from 'child_process';
|
|
import * as fs from 'fs';
|
|
import * as path from 'path';
|
|
import * as os from 'os';
|
|
import {
|
|
launchClaudePty,
|
|
isPermissionDialogVisible,
|
|
isNumberedOptionListVisible,
|
|
} from './helpers/claude-pty-runner';
|
|
|
|
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic';
|
|
const describeE2E = shouldRun ? describe : describe.skip;
|
|
|
|
interface ShipFixture {
|
|
workTree: string;
|
|
bareRemote: string;
|
|
/** Full bash log of `git` and helper commands run during setup. */
|
|
setupLog: string[];
|
|
}
|
|
|
|
/**
|
|
* Build a self-contained git fixture representing an already-shipped state:
|
|
* - main branch at VERSION 0.0.1, with one CHANGELOG entry [0.0.1]
|
|
* - feat/already-shipped branch at VERSION 0.0.2 (bumped + synced),
|
|
* CHANGELOG has [0.0.2] entry on top of [0.0.1], one feature commit
|
|
* - bareRemote is the origin; both branches are pushed
|
|
*
|
|
* Returns the work-tree dir for /ship to operate on.
|
|
*/
|
|
function buildShippedFixture(): ShipFixture {
|
|
const root = fs.mkdtempSync(path.join(os.tmpdir(), 'gstack-ship-fixture-'));
|
|
const workTree = path.join(root, 'workspace');
|
|
const bareRemote = path.join(root, 'origin.git');
|
|
fs.mkdirSync(workTree, { recursive: true });
|
|
|
|
const setupLog: string[] = [];
|
|
const sh = (cmd: string, cwd: string): void => {
|
|
setupLog.push(`[${cwd}] ${cmd}`);
|
|
const result = spawnSync('bash', ['-c', cmd], { cwd, stdio: 'pipe', timeout: 15_000 });
|
|
if (result.status !== 0) {
|
|
const stderr = result.stderr?.toString() ?? '';
|
|
throw new Error(`fixture setup failed at "${cmd}":\n${stderr}\n--- log ---\n${setupLog.join('\n')}`);
|
|
}
|
|
};
|
|
|
|
// Bare remote.
|
|
sh(`git init --bare "${bareRemote}"`, root);
|
|
|
|
// Initial commit on main.
|
|
sh('git init -b main', workTree);
|
|
sh('git config user.email "test@test.com"', workTree);
|
|
sh('git config user.name "Test"', workTree);
|
|
sh('git config commit.gpgsign false', workTree);
|
|
|
|
fs.writeFileSync(path.join(workTree, 'VERSION'), '0.0.1\n');
|
|
fs.writeFileSync(
|
|
path.join(workTree, 'package.json'),
|
|
JSON.stringify({ name: 'fixture', version: '0.0.1', private: true }, null, 2) + '\n',
|
|
);
|
|
fs.writeFileSync(
|
|
path.join(workTree, 'CHANGELOG.md'),
|
|
`# Changelog\n\n## [0.0.1] - 2026-01-01\n\n- Initial release\n`,
|
|
);
|
|
fs.writeFileSync(path.join(workTree, 'README.md'), '# Fixture\n');
|
|
|
|
sh('git add VERSION package.json CHANGELOG.md README.md', workTree);
|
|
sh('git commit -m "chore: initial release v0.0.1"', workTree);
|
|
sh(`git remote add origin "${bareRemote}"`, workTree);
|
|
sh('git push -u origin main', workTree);
|
|
|
|
// Feature branch with ALREADY_BUMPED state.
|
|
sh('git checkout -b feat/already-shipped', workTree);
|
|
fs.writeFileSync(path.join(workTree, 'VERSION'), '0.0.2\n');
|
|
fs.writeFileSync(
|
|
path.join(workTree, 'package.json'),
|
|
JSON.stringify({ name: 'fixture', version: '0.0.2', private: true }, null, 2) + '\n',
|
|
);
|
|
fs.writeFileSync(
|
|
path.join(workTree, 'CHANGELOG.md'),
|
|
`# Changelog\n\n## [0.0.2] - 2026-04-25\n\n**Feature shipped.**\n\nAdded the new feature.\n\n## [0.0.1] - 2026-01-01\n\n- Initial release\n`,
|
|
);
|
|
fs.writeFileSync(path.join(workTree, 'feature.md'), '# Feature\n\nAlready shipped.\n');
|
|
|
|
sh('git add VERSION package.json CHANGELOG.md feature.md', workTree);
|
|
sh('git commit -m "feat: add new feature\n\nbumps VERSION to 0.0.2"', workTree);
|
|
sh('git push -u origin feat/already-shipped', workTree);
|
|
|
|
return { workTree, bareRemote, setupLog };
|
|
}
|
|
|
|
/** Snapshot the load-bearing fixture state so we can compare post-run. */
|
|
interface FixtureSnapshot {
|
|
versionFile: string;
|
|
packageVersion: string;
|
|
changelogEntryCount: number;
|
|
bumpCommitCount: number;
|
|
branchHead: string;
|
|
}
|
|
|
|
function snapshotFixture(workTree: string): FixtureSnapshot {
|
|
const versionFile = fs.readFileSync(path.join(workTree, 'VERSION'), 'utf-8').trim();
|
|
const pkg = JSON.parse(fs.readFileSync(path.join(workTree, 'package.json'), 'utf-8'));
|
|
const changelog = fs.readFileSync(path.join(workTree, 'CHANGELOG.md'), 'utf-8');
|
|
// Count `## [0.0.2]` headings — should stay at 1 across re-runs.
|
|
const changelogEntryCount = (changelog.match(/^##\s*\[0\.0\.2\]/gm) ?? []).length;
|
|
const head = spawnSync('git', ['rev-parse', 'HEAD'], { cwd: workTree, stdio: 'pipe' });
|
|
const branchHead = head.stdout?.toString().trim() ?? '';
|
|
// Count "chore: bump version" commits on this branch since main.
|
|
const log = spawnSync(
|
|
'git', ['log', '--format=%s', 'main..HEAD'],
|
|
{ cwd: workTree, stdio: 'pipe' },
|
|
);
|
|
const subjects = log.stdout?.toString() ?? '';
|
|
const bumpCommitCount = subjects.split('\n').filter(s => /chore:\s*bump\s+version/i.test(s)).length;
|
|
return { versionFile, packageVersion: pkg.version, changelogEntryCount, bumpCommitCount, branchHead };
|
|
}
|
|
|
|
describeE2E('/ship idempotency E2E (periodic, real-PTY)', () => {
|
|
test(
|
|
'rerunning /ship on an already-shipped branch detects ALREADY_BUMPED and does not mutate fixture',
|
|
async () => {
|
|
const fixture = buildShippedFixture();
|
|
const before = snapshotFixture(fixture.workTree);
|
|
|
|
const session = await launchClaudePty({
|
|
permissionMode: 'plan',
|
|
cwd: fixture.workTree,
|
|
timeoutMs: 720_000,
|
|
// Disable network-y pieces so the agent can't reach actual github.
|
|
env: { GH_TOKEN: 'mock-not-real', NO_COLOR: '1' },
|
|
});
|
|
|
|
let outcome: 'detected' | 'plan_ready' | 'attempted_mutation' | 'timeout' | 'exited' = 'timeout';
|
|
let evidence = '';
|
|
|
|
try {
|
|
await Bun.sleep(8000);
|
|
const since = session.mark();
|
|
session.send('/ship\r');
|
|
|
|
const budgetMs = 600_000;
|
|
const start = Date.now();
|
|
let lastPermSig = '';
|
|
while (Date.now() - start < budgetMs) {
|
|
await Bun.sleep(3000);
|
|
if (session.exited()) {
|
|
outcome = 'exited';
|
|
evidence = session.visibleSince(since).slice(-3000);
|
|
break;
|
|
}
|
|
const visible = session.visibleSince(since);
|
|
|
|
// Auto-grant any permission dialogs the preamble triggers
|
|
// (e.g. touch on a marker file claude considers sensitive).
|
|
// Classify on the recent tail; don't double-press the same render.
|
|
const tail = visible.slice(-1500);
|
|
if (isNumberedOptionListVisible(tail) && isPermissionDialogVisible(tail)) {
|
|
const sig = visible.slice(-500);
|
|
if (sig !== lastPermSig) {
|
|
lastPermSig = sig;
|
|
session.send('1\r');
|
|
await Bun.sleep(1500);
|
|
continue;
|
|
}
|
|
}
|
|
|
|
// Positive: the idempotency-check echoed ALREADY_BUMPED.
|
|
if (/STATE:\s*ALREADY_BUMPED/.test(visible)) {
|
|
outcome = 'detected';
|
|
evidence = visible.slice(-3000);
|
|
break;
|
|
}
|
|
|
|
// Negative regressions:
|
|
// - bump-action bash block ran (would echo on FRESH path)
|
|
// - agent attempted git commit -m "chore: bump version"
|
|
// - agent attempted git push
|
|
// - agent rendered an Edit/Write to CHANGELOG.md or VERSION (acceptable in plan mode but flagged here)
|
|
if (
|
|
/STATE:\s*FRESH(?![\w-])/i.test(visible) ||
|
|
/git\s+commit\s+.*chore:\s*bump\s+version/i.test(visible) ||
|
|
/git\s+push.*origin/i.test(visible)
|
|
) {
|
|
outcome = 'attempted_mutation';
|
|
evidence = visible.slice(-3000);
|
|
break;
|
|
}
|
|
|
|
// Plan-ready outcome (acceptable terminal): the agent finished
|
|
// analysis. We'll accept this if no mutation signals showed up.
|
|
if (/ready to execute|Would you like to proceed/i.test(visible)) {
|
|
outcome = 'plan_ready';
|
|
evidence = visible.slice(-3000);
|
|
break;
|
|
}
|
|
}
|
|
} finally {
|
|
await session.close();
|
|
}
|
|
|
|
// Verify fixture was not mutated regardless of outcome.
|
|
const after = snapshotFixture(fixture.workTree);
|
|
const fixtureStable =
|
|
after.versionFile === before.versionFile &&
|
|
after.packageVersion === before.packageVersion &&
|
|
after.changelogEntryCount === before.changelogEntryCount &&
|
|
after.bumpCommitCount === before.bumpCommitCount &&
|
|
after.branchHead === before.branchHead;
|
|
|
|
try {
|
|
if (outcome === 'attempted_mutation') {
|
|
throw new Error(
|
|
`/ship attempted to mutate already-shipped state.\n` +
|
|
`--- evidence (last 3KB) ---\n${evidence}\n` +
|
|
`--- before ---\n${JSON.stringify(before, null, 2)}\n` +
|
|
`--- after ---\n${JSON.stringify(after, null, 2)}`,
|
|
);
|
|
}
|
|
if (outcome === 'exited') {
|
|
throw new Error(`claude exited unexpectedly.\n--- evidence ---\n${evidence}`);
|
|
}
|
|
if (outcome === 'timeout') {
|
|
throw new Error(
|
|
`Timed out before any terminal outcome.\n--- evidence (last 3KB) ---\n${evidence}`,
|
|
);
|
|
}
|
|
// Detected or plan_ready — both are acceptable terminal outcomes.
|
|
expect(['detected', 'plan_ready']).toContain(outcome);
|
|
// Fixture must not have been mutated regardless of outcome.
|
|
expect(fixtureStable).toBe(true);
|
|
} finally {
|
|
// Clean up fixture root.
|
|
try { fs.rmSync(path.dirname(fixture.workTree), { recursive: true, force: true }); } catch { /* ignore */ }
|
|
}
|
|
},
|
|
900_000, // 15 min wall clock
|
|
);
|
|
});
|