Files
gstack/test/skill-e2e-ship-idempotency.test.ts
T
Garry Tan 46c1fae7f1 v1.54.0.0 feat: carve /ship into skeleton + on-demand sections (-59% always-loaded) (#1806)
* feat(test): transcript-section-logger + ship-action fingerprint (T10)

Pure-analysis module over a SkillTestResult/NDJSON transcript:
- extractSectionReads(): which sections/*.md a run opened (post-carve check)
- extractShipActions(): observable action fingerprint (merge/test/bump/
  changelog/commit/push/pr) that works on the MONOLITH too, so a baseline
  captured before the carve can detect a sectioned-ship regression
- baseline read/write + compareShipActions() for baseline-first dogf(T10)

Baseline-first answers the Codex outside-voice critique that a logger in the
same PR as the carve is post-failure telemetry without a pre-carve reference.

11 unit tests, all green. Paid monolith baseline capture runs separately.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(pipeline): section discovery + generation machinery (T9)

- discover-skills.ts: discoverSectionTemplates() scans <skill>/sections/*.md.tmpl
- gen-skill-docs.ts: extract resolvePlaceholders + applyHostRewrites + buildContext
  as shared helpers (processTemplate and the new processSectionTemplate both call
  them, so a sanitization/rewrite fix can't miss sections) [C1]
- processSectionTemplate: body-fragment generation (no frontmatter/catalog/voice),
  parent-skill TemplateContext (skillName pinned to parent, not 'sections', so
  appliesTo gating + tier behave identically), per-host output routing
- --host all now fails the build on ANY host failure, not just claude, so a stale
  external-host output can't slip the freshness gate [Codex outside-voice #9]

Inert until a skill is carved (no sections/ dirs exist yet). Refactor is
output-neutral: gen:skill-docs --dry-run --host all reports 0 STALE.

5 discovery unit tests + 389 gen-skill-docs tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(setup): install sections/ for cherry-pick targets (claude + kiro) (T9)

Two install targets cherry-pick SKILL.md and would leave a carved skill's
sections/ behind, 404ing a runtime 'Read sections/<name>.md':
- link_claude_skill_dirs: link the sections/ subdir via _link_or_copy (windows
  gets a fresh copy on every ./setup)
- kiro per-skill loop: sed-rewrite + copy each sections/* so paths resolve under
  ~/.kiro, not ~/.codex/~/.claude

codex/factory/opencode link the whole generated dir, so sections ride free.
Addresses Codex outside-voice #4/#6 (runtime pathing landmine). Inert until a
skill is carved. Static-tripwire test + windows-fallback invariant green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(ship): gstack-version-bump CLI — tested idempotency classify + write (T9)

Hybrid CLI extraction (CM1): the deterministic core of ship Step 12 becomes a
tested CLI instead of bash prose the agent re-derives each run.
- classify: FRESH/ALREADY_BUMPED/DRIFT_STALE_PKG/DRIFT_UNEXPECTED from VERSION
  vs origin/<base>:VERSION vs package.json.version (pure reader)
- write: validated dual-write to VERSION + package.json (FRESH bump)
- repair: DRIFT_STALE_PKG sync, no re-bump
Bump-LEVEL choice + queue collision stay agent judgment; slot pick stays
bin/gstack-next-version. This removes the re-bump-a-shipped-branch footgun from
skippable prose into code that can't be skipped or misread.

15 tests (exhaustive state matrix + write/repair fs + real-git classify).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(parity): sectioned-skill parity capability — guards the carve (T9)

Carved skills (skeleton + sections/*.md) need parity checks that see relocated
content, or moving a phrase into a section reads as 'lost':
- readSkillForParity(): union skeleton + all sections/*.md
- checkSkillParity sectioned mode: content checks against the union; minBytes/
  maxSizeRatio against union bytes (total behavior preserved); maxSkeletonBytes
  asserts the always-loaded skeleton actually shrank. Lowering minBytes to fit a
  small skeleton would otherwise make the size floor toothless [Codex #12].

Built + tested BEFORE the carve so ship's invariant can flip to sectioned in the
same commit it lands. Monolith path byte-identical (verified: pre-existing
investigate 1.053 ratio drift fails the same with this change stashed).

7 sectioned-parity tests + existing parity tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(ship): carve into skeleton + on-demand sections (Claude) (T9)

ship/SKILL.md drops 167KB → 68.7KB (~59% of the always-loaded skill) by moving
8 prose-heavy steps into ship/sections/*.md, read on demand:
tests, test-coverage, plan-completion, review-army, greptile, adversarial,
changelog, pr-body. Step 12's version logic now calls the tested
gstack-version-bump CLI instead of inline bash.

Claude-first (S2): {{SECTION:id}} emits a STOP-Read pointer on Claude (skeleton +
generated section files) and INLINES the content on every other host, so external
hosts keep the full monolith — verified factory at 162KB with no sections dir.
{{SECTION_INDEX:ship}} renders the situation→section table from the PASSIVE
manifest (CM2 / v2_PLAN.md:663); required-reads live only in test fixtures.
Multi-pass resolve expands inlined sections' own resolvers.

Parity: ship invariant flipped to sectioned (union content checks + maxSkeletonBytes
asserts the shrink). Carve-fallout fixed across gen-skill-docs/skill-validation/
golden/plan-completion/#1539/size-budget tests via skeleton+sections union reads.
Free suite green except the pre-existing investigate parity drift.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(ship): manifest-consistency + context-parity + requiredReads helper (T9)

Free deterministic guards for the carve:
- required-reads.ts + unit test: assertRequiredReads(run, requiredFiles) — the
  mechanical layer-5 check that the agent Read the sections its situation needs
  (required set comes from the fixture, not the passive manifest)
- section-manifest-consistency: 3-tier orphan classification (generated orphan +
  hand-edited generated file → FAIL; manifest orphan → WARN per v2_PLAN.md) and
  pins the PASSIVE-manifest contract (no applies_when/required_for)
- template-context-parity: generated sections have zero unresolved placeholders
  and gated resolvers (ADVERSARIAL_STEP/CONFIDENCE_CALIBRATION/CHANGELOG_WORKFLOW)
  rendered — proving sections resolve with the parent skillName, not 'sections'

16 tests, all green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(ship): section-loading E2E + idempotency CLI detection (T9)

- skill-e2e-ship-section-loading.test.ts (new, periodic): runs real /ship in plan
  mode against a fresh version-changing fixture and asserts the agent Read the
  required sections (review-army + changelog). Runs against the INSTALLED skill
  (~/.claude/skills/gstack/ship), not repo paths, so install-layout 404s surface
  [Codex outside-voice #5]. Layer-5 mechanical guard against silent section-skip.
- skill-e2e-ship-idempotency.test.ts: detection updated for the carve — Step 12
  now runs gstack-version-bump classify (JSON "state":"ALREADY_BUMPED") instead
  of the inline bash echo (STATE: ALREADY_BUMPED). Accept both; add a
  gstack-version-bump-write re-bump regression signal.
- touchfiles: register ship-section-loading (periodic) + extend idempotency deps
  with bin/gstack-version-bump + scripts/resolvers/sections.ts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(ship): union-read redaction wiring test for the carve (T9)

main's PR-body redaction-at-sink lives in sections/pr-body.md.tmpl after the
carve, not the skeleton template. Read skeleton + section templates union so the
redaction-wiring assertions follow the relocated content. 9/9 green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* v1.54.0.0 feat: carve /ship into skeleton + on-demand sections (-59% always-loaded)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 12:09:10 -07:00

278 lines
11 KiB
TypeScript

/**
* /ship idempotency E2E (periodic, paid, real-PTY).
*
* Asserts: when /ship runs against a branch that has ALREADY been bumped
* (VERSION ahead of base AND package.json synced AND a CHANGELOG entry
* exists for the bumped version), the workflow:
*
* 1. Detects ALREADY_BUMPED state via the Step 12 idempotency check
* 2. Does NOT echo STATE: FRESH (which would trigger a second bump)
* 3. Does NOT mutate the fixture's VERSION file
* 4. Does NOT append a duplicate CHANGELOG [0.0.2] entry
* 5. Does NOT create a new "chore: bump version" commit
*
* Why real-PTY: the existing ship-idempotency test in skill-e2e.test.ts
* uses the SDK harness with a synthetic prompt asking the agent to "run
* ONLY the idempotency checks." This test exercises the actual /ship
* skill end-to-end against a real git fixture so a regression that
* silently re-bumps despite the check passing would be caught.
*
* Plan-mode framing: we run /ship in plan mode so the agent cannot push,
* commit, or open PRs. The Step 12 idempotency check is read-only
* (reads VERSION + package.json + git rev-parse) and runs fine in plan
* mode. The plan-ready output serves as the terminal signal — the agent
* has done its analysis and produced a plan describing what it would do.
*
* If the agent decides to bump or push despite the fixture's
* ALREADY_BUMPED state, that intent surfaces in the plan or in
* tool-call attempts, which we detect.
*
* Cost: ~$2-4/run. Periodic tier — long, runs weekly.
*/
import { describe, test, expect } from 'bun:test';
import { spawnSync } from 'child_process';
import * as fs from 'fs';
import * as path from 'path';
import * as os from 'os';
import {
launchClaudePty,
isPermissionDialogVisible,
isNumberedOptionListVisible,
} from './helpers/claude-pty-runner';
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic';
const describeE2E = shouldRun ? describe : describe.skip;
interface ShipFixture {
workTree: string;
bareRemote: string;
/** Full bash log of `git` and helper commands run during setup. */
setupLog: string[];
}
/**
* Build a self-contained git fixture representing an already-shipped state:
* - main branch at VERSION 0.0.1, with one CHANGELOG entry [0.0.1]
* - feat/already-shipped branch at VERSION 0.0.2 (bumped + synced),
* CHANGELOG has [0.0.2] entry on top of [0.0.1], one feature commit
* - bareRemote is the origin; both branches are pushed
*
* Returns the work-tree dir for /ship to operate on.
*/
function buildShippedFixture(): ShipFixture {
const root = fs.mkdtempSync(path.join(os.tmpdir(), 'gstack-ship-fixture-'));
const workTree = path.join(root, 'workspace');
const bareRemote = path.join(root, 'origin.git');
fs.mkdirSync(workTree, { recursive: true });
const setupLog: string[] = [];
const sh = (cmd: string, cwd: string): void => {
setupLog.push(`[${cwd}] ${cmd}`);
const result = spawnSync('bash', ['-c', cmd], { cwd, stdio: 'pipe', timeout: 15_000 });
if (result.status !== 0) {
const stderr = result.stderr?.toString() ?? '';
throw new Error(`fixture setup failed at "${cmd}":\n${stderr}\n--- log ---\n${setupLog.join('\n')}`);
}
};
// Bare remote.
sh(`git init --bare "${bareRemote}"`, root);
// Initial commit on main.
sh('git init -b main', workTree);
sh('git config user.email "test@test.com"', workTree);
sh('git config user.name "Test"', workTree);
sh('git config commit.gpgsign false', workTree);
fs.writeFileSync(path.join(workTree, 'VERSION'), '0.0.1\n');
fs.writeFileSync(
path.join(workTree, 'package.json'),
JSON.stringify({ name: 'fixture', version: '0.0.1', private: true }, null, 2) + '\n',
);
fs.writeFileSync(
path.join(workTree, 'CHANGELOG.md'),
`# Changelog\n\n## [0.0.1] - 2026-01-01\n\n- Initial release\n`,
);
fs.writeFileSync(path.join(workTree, 'README.md'), '# Fixture\n');
sh('git add VERSION package.json CHANGELOG.md README.md', workTree);
sh('git commit -m "chore: initial release v0.0.1"', workTree);
sh(`git remote add origin "${bareRemote}"`, workTree);
sh('git push -u origin main', workTree);
// Feature branch with ALREADY_BUMPED state.
sh('git checkout -b feat/already-shipped', workTree);
fs.writeFileSync(path.join(workTree, 'VERSION'), '0.0.2\n');
fs.writeFileSync(
path.join(workTree, 'package.json'),
JSON.stringify({ name: 'fixture', version: '0.0.2', private: true }, null, 2) + '\n',
);
fs.writeFileSync(
path.join(workTree, 'CHANGELOG.md'),
`# Changelog\n\n## [0.0.2] - 2026-04-25\n\n**Feature shipped.**\n\nAdded the new feature.\n\n## [0.0.1] - 2026-01-01\n\n- Initial release\n`,
);
fs.writeFileSync(path.join(workTree, 'feature.md'), '# Feature\n\nAlready shipped.\n');
sh('git add VERSION package.json CHANGELOG.md feature.md', workTree);
sh('git commit -m "feat: add new feature\n\nbumps VERSION to 0.0.2"', workTree);
sh('git push -u origin feat/already-shipped', workTree);
return { workTree, bareRemote, setupLog };
}
/** Snapshot the load-bearing fixture state so we can compare post-run. */
interface FixtureSnapshot {
versionFile: string;
packageVersion: string;
changelogEntryCount: number;
bumpCommitCount: number;
branchHead: string;
}
function snapshotFixture(workTree: string): FixtureSnapshot {
const versionFile = fs.readFileSync(path.join(workTree, 'VERSION'), 'utf-8').trim();
const pkg = JSON.parse(fs.readFileSync(path.join(workTree, 'package.json'), 'utf-8'));
const changelog = fs.readFileSync(path.join(workTree, 'CHANGELOG.md'), 'utf-8');
// Count `## [0.0.2]` headings — should stay at 1 across re-runs.
const changelogEntryCount = (changelog.match(/^##\s*\[0\.0\.2\]/gm) ?? []).length;
const head = spawnSync('git', ['rev-parse', 'HEAD'], { cwd: workTree, stdio: 'pipe' });
const branchHead = head.stdout?.toString().trim() ?? '';
// Count "chore: bump version" commits on this branch since main.
const log = spawnSync(
'git', ['log', '--format=%s', 'main..HEAD'],
{ cwd: workTree, stdio: 'pipe' },
);
const subjects = log.stdout?.toString() ?? '';
const bumpCommitCount = subjects.split('\n').filter(s => /chore:\s*bump\s+version/i.test(s)).length;
return { versionFile, packageVersion: pkg.version, changelogEntryCount, bumpCommitCount, branchHead };
}
describeE2E('/ship idempotency E2E (periodic, real-PTY)', () => {
test(
'rerunning /ship on an already-shipped branch detects ALREADY_BUMPED and does not mutate fixture',
async () => {
const fixture = buildShippedFixture();
const before = snapshotFixture(fixture.workTree);
const session = await launchClaudePty({
permissionMode: 'plan',
cwd: fixture.workTree,
timeoutMs: 720_000,
// Disable network-y pieces so the agent can't reach actual github.
env: { GH_TOKEN: 'mock-not-real', NO_COLOR: '1' },
});
let outcome: 'detected' | 'plan_ready' | 'attempted_mutation' | 'timeout' | 'exited' = 'timeout';
let evidence = '';
try {
await Bun.sleep(8000);
const since = session.mark();
session.send('/ship\r');
const budgetMs = 600_000;
const start = Date.now();
let lastPermSig = '';
while (Date.now() - start < budgetMs) {
await Bun.sleep(3000);
if (session.exited()) {
outcome = 'exited';
evidence = session.visibleSince(since).slice(-3000);
break;
}
const visible = session.visibleSince(since);
// Auto-grant any permission dialogs the preamble triggers
// (e.g. touch on a marker file claude considers sensitive).
// Classify on the recent tail; don't double-press the same render.
const tail = visible.slice(-1500);
if (isNumberedOptionListVisible(tail) && isPermissionDialogVisible(tail)) {
const sig = visible.slice(-500);
if (sig !== lastPermSig) {
lastPermSig = sig;
session.send('1\r');
await Bun.sleep(1500);
continue;
}
}
// Positive: idempotency classify reported ALREADY_BUMPED. Post-carve
// (T9), Step 12 runs `gstack-version-bump classify` which emits JSON
// (`"state":"ALREADY_BUMPED"`); the legacy inline bash echoed
// `STATE: ALREADY_BUMPED`. Accept either so the test survives the carve.
if (/STATE:\s*ALREADY_BUMPED|"state":\s*"ALREADY_BUMPED"/.test(visible)) {
outcome = 'detected';
evidence = visible.slice(-3000);
break;
}
// Negative regressions:
// - classify reported FRESH (CLI JSON or legacy echo) → would re-bump
// - agent attempted git commit -m "chore: bump version"
// - agent attempted git push
// - agent ran the CLI write path (gstack-version-bump write) — a
// re-bump on an already-shipped branch
if (
/"state":\s*"FRESH"/.test(visible) ||
/STATE:\s*FRESH(?![\w-])/i.test(visible) ||
/gstack-version-bump\s+write/i.test(visible) ||
/git\s+commit\s+.*chore:\s*bump\s+version/i.test(visible) ||
/git\s+push.*origin/i.test(visible)
) {
outcome = 'attempted_mutation';
evidence = visible.slice(-3000);
break;
}
// Plan-ready outcome (acceptable terminal): the agent finished
// analysis. We'll accept this if no mutation signals showed up.
if (/ready to execute|Would you like to proceed/i.test(visible)) {
outcome = 'plan_ready';
evidence = visible.slice(-3000);
break;
}
}
} finally {
await session.close();
}
// Verify fixture was not mutated regardless of outcome.
const after = snapshotFixture(fixture.workTree);
const fixtureStable =
after.versionFile === before.versionFile &&
after.packageVersion === before.packageVersion &&
after.changelogEntryCount === before.changelogEntryCount &&
after.bumpCommitCount === before.bumpCommitCount &&
after.branchHead === before.branchHead;
try {
if (outcome === 'attempted_mutation') {
throw new Error(
`/ship attempted to mutate already-shipped state.\n` +
`--- evidence (last 3KB) ---\n${evidence}\n` +
`--- before ---\n${JSON.stringify(before, null, 2)}\n` +
`--- after ---\n${JSON.stringify(after, null, 2)}`,
);
}
if (outcome === 'exited') {
throw new Error(`claude exited unexpectedly.\n--- evidence ---\n${evidence}`);
}
if (outcome === 'timeout') {
throw new Error(
`Timed out before any terminal outcome.\n--- evidence (last 3KB) ---\n${evidence}`,
);
}
// Detected or plan_ready — both are acceptable terminal outcomes.
expect(['detected', 'plan_ready']).toContain(outcome);
// Fixture must not have been mutated regardless of outcome.
expect(fixtureStable).toBe(true);
} finally {
// Clean up fixture root.
try { fs.rmSync(path.dirname(fixture.workTree), { recursive: true, force: true }); } catch { /* ignore */ }
}
},
900_000, // 15 min wall clock
);
});