v1.54.0.0 feat: carve /ship into skeleton + on-demand sections (-59% always-loaded) (#1806)

* feat(test): transcript-section-logger + ship-action fingerprint (T10) Pure-analysis module over a SkillTestResult/NDJSON transcript: - extractSectionReads(): which sections/*.md a run opened (post-carve check) - extractShipActions(): observable action fingerprint (merge/test/bump/ changelog/commit/push/pr) that works on the MONOLITH too, so a baseline captured before the carve can detect a sectioned-ship regression - baseline read/write + compareShipActions() for baseline-first dogf(T10) Baseline-first answers the Codex outside-voice critique that a logger in the same PR as the carve is post-failure telemetry without a pre-carve reference. 11 unit tests, all green. Paid monolith baseline capture runs separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(pipeline): section discovery + generation machinery (T9) - discover-skills.ts: discoverSectionTemplates() scans <skill>/sections/*.md.tmpl - gen-skill-docs.ts: extract resolvePlaceholders + applyHostRewrites + buildContext as shared helpers (processTemplate and the new processSectionTemplate both call them, so a sanitization/rewrite fix can't miss sections) [C1] - processSectionTemplate: body-fragment generation (no frontmatter/catalog/voice), parent-skill TemplateContext (skillName pinned to parent, not 'sections', so appliesTo gating + tier behave identically), per-host output routing - --host all now fails the build on ANY host failure, not just claude, so a stale external-host output can't slip the freshness gate [Codex outside-voice #9] Inert until a skill is carved (no sections/ dirs exist yet). Refactor is output-neutral: gen:skill-docs --dry-run --host all reports 0 STALE. 5 discovery unit tests + 389 gen-skill-docs tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(setup): install sections/ for cherry-pick targets (claude + kiro) (T9) Two install targets cherry-pick SKILL.md and would leave a carved skill's sections/ behind, 404ing a runtime 'Read sections/<name>.md': - link_claude_skill_dirs: link the sections/ subdir via _link_or_copy (windows gets a fresh copy on every ./setup) - kiro per-skill loop: sed-rewrite + copy each sections/* so paths resolve under ~/.kiro, not ~/.codex/~/.claude codex/factory/opencode link the whole generated dir, so sections ride free. Addresses Codex outside-voice #4/#6 (runtime pathing landmine). Inert until a skill is carved. Static-tripwire test + windows-fallback invariant green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(ship): gstack-version-bump CLI — tested idempotency classify + write (T9) Hybrid CLI extraction (CM1): the deterministic core of ship Step 12 becomes a tested CLI instead of bash prose the agent re-derives each run. - classify: FRESH/ALREADY_BUMPED/DRIFT_STALE_PKG/DRIFT_UNEXPECTED from VERSION vs origin/<base>:VERSION vs package.json.version (pure reader) - write: validated dual-write to VERSION + package.json (FRESH bump) - repair: DRIFT_STALE_PKG sync, no re-bump Bump-LEVEL choice + queue collision stay agent judgment; slot pick stays bin/gstack-next-version. This removes the re-bump-a-shipped-branch footgun from skippable prose into code that can't be skipped or misread. 15 tests (exhaustive state matrix + write/repair fs + real-git classify). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(parity): sectioned-skill parity capability — guards the carve (T9) Carved skills (skeleton + sections/*.md) need parity checks that see relocated content, or moving a phrase into a section reads as 'lost': - readSkillForParity(): union skeleton + all sections/*.md - checkSkillParity sectioned mode: content checks against the union; minBytes/ maxSizeRatio against union bytes (total behavior preserved); maxSkeletonBytes asserts the always-loaded skeleton actually shrank. Lowering minBytes to fit a small skeleton would otherwise make the size floor toothless [Codex #12]. Built + tested BEFORE the carve so ship's invariant can flip to sectioned in the same commit it lands. Monolith path byte-identical (verified: pre-existing investigate 1.053 ratio drift fails the same with this change stashed). 7 sectioned-parity tests + existing parity tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(ship): carve into skeleton + on-demand sections (Claude) (T9) ship/SKILL.md drops 167KB → 68.7KB (~59% of the always-loaded skill) by moving 8 prose-heavy steps into ship/sections/*.md, read on demand: tests, test-coverage, plan-completion, review-army, greptile, adversarial, changelog, pr-body. Step 12's version logic now calls the tested gstack-version-bump CLI instead of inline bash. Claude-first (S2): {{SECTION:id}} emits a STOP-Read pointer on Claude (skeleton + generated section files) and INLINES the content on every other host, so external hosts keep the full monolith — verified factory at 162KB with no sections dir. {{SECTION_INDEX:ship}} renders the situation→section table from the PASSIVE manifest (CM2 / v2_PLAN.md:663); required-reads live only in test fixtures. Multi-pass resolve expands inlined sections' own resolvers. Parity: ship invariant flipped to sectioned (union content checks + maxSkeletonBytes asserts the shrink). Carve-fallout fixed across gen-skill-docs/skill-validation/ golden/plan-completion/#1539/size-budget tests via skeleton+sections union reads. Free suite green except the pre-existing investigate parity drift. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(ship): manifest-consistency + context-parity + requiredReads helper (T9) Free deterministic guards for the carve: - required-reads.ts + unit test: assertRequiredReads(run, requiredFiles) — the mechanical layer-5 check that the agent Read the sections its situation needs (required set comes from the fixture, not the passive manifest) - section-manifest-consistency: 3-tier orphan classification (generated orphan + hand-edited generated file → FAIL; manifest orphan → WARN per v2_PLAN.md) and pins the PASSIVE-manifest contract (no applies_when/required_for) - template-context-parity: generated sections have zero unresolved placeholders and gated resolvers (ADVERSARIAL_STEP/CONFIDENCE_CALIBRATION/CHANGELOG_WORKFLOW) rendered — proving sections resolve with the parent skillName, not 'sections' 16 tests, all green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(ship): section-loading E2E + idempotency CLI detection (T9) - skill-e2e-ship-section-loading.test.ts (new, periodic): runs real /ship in plan mode against a fresh version-changing fixture and asserts the agent Read the required sections (review-army + changelog). Runs against the INSTALLED skill (~/.claude/skills/gstack/ship), not repo paths, so install-layout 404s surface [Codex outside-voice #5]. Layer-5 mechanical guard against silent section-skip. - skill-e2e-ship-idempotency.test.ts: detection updated for the carve — Step 12 now runs gstack-version-bump classify (JSON "state":"ALREADY_BUMPED") instead of the inline bash echo (STATE: ALREADY_BUMPED). Accept both; add a gstack-version-bump-write re-bump regression signal. - touchfiles: register ship-section-loading (periodic) + extend idempotency deps with bin/gstack-version-bump + scripts/resolvers/sections.ts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(ship): union-read redaction wiring test for the carve (T9) main's PR-body redaction-at-sink lives in sections/pr-body.md.tmpl after the carve, not the skeleton template. Read skeleton + section templates union so the redaction-wiring assertions follow the relocated content. 9/9 green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * v1.54.0.0 feat: carve /ship into skeleton + on-demand sections (-59% always-loaded) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-07-19 05:57:37 +02:00 · 2026-05-30 12:09:10 -07:00
parent 9562ad4e70
commit 46c1fae7f1
51 changed files with 4445 additions and 4891 deletions
@@ -0,0 +1,120 @@
+/**
+ * /ship section-loading E2E (periodic, paid, real-PTY) — v2 plan T9 mitigation
+ * layer 5, the ONLY CI-failing guard against silent section-skip.
+ *
+ * After the carve, ship is a skeleton whose STOP-Read directives point at
+ * sections/*.md. This test runs the REAL /ship skill in plan mode against a
+ * fresh version-changing fixture and asserts the agent actually Read the
+ * sections its situation requires (review-army + changelog at minimum — every
+ * version-changing ship needs the pre-landing review and a CHANGELOG entry).
+ *
+ * Runs against the INSTALLED skill at ~/.claude/skills/gstack/ship (Codex
+ * outside-voice #5: an E2E that reads repo paths would miss install-layout
+ * 404s). Section reads are detected from the PTY scrollback — when the agent
+ * Reads a section the tool render shows the `sections/<file>.md` path.
+ *
+ * Plan-mode framing keeps the agent from committing/pushing; producing a plan
+ * is the terminal signal. Cost: ~$2-4/run. Periodic tier.
+ *
+ * Situation matrix (T1 = B): this file covers the fresh version-changing ship;
+ * the already-bumped re-run is covered by skill-e2e-ship-idempotency.test.ts,
+ * and a no-plan-file variant can be added to FIXTURES below.
+ */
+
+import { describe, test, expect } from 'bun:test';
+import { spawnSync } from 'child_process';
+import * as fs from 'fs';
+import * as path from 'path';
+import * as os from 'os';
+import {
+  launchClaudePty,
+  isPermissionDialogVisible,
+  isNumberedOptionListVisible,
+} from './helpers/claude-pty-runner';
+
+const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic';
+const describeE2E = shouldRun ? describe : describe.skip;
+
+/** Fresh fixture: feature branch with a real change but VERSION still == base,
+ *  so /ship must bump (FRESH) and walk the full pre-landing + changelog flow. */
+function buildFreshFixture(): { workTree: string; root: string } {
+  const root = fs.mkdtempSync(path.join(os.tmpdir(), 'gstack-ship-secload-'));
+  const workTree = path.join(root, 'workspace');
+  const bareRemote = path.join(root, 'origin.git');
+  fs.mkdirSync(workTree, { recursive: true });
+  const sh = (cmd: string, cwd: string): void => {
+    const r = spawnSync('bash', ['-c', cmd], { cwd, stdio: 'pipe', timeout: 15_000 });
+    if (r.status !== 0) throw new Error(`fixture setup failed at "${cmd}":\n${r.stderr?.toString()}`);
+  };
+  sh(`git init --bare "${bareRemote}"`, root);
+  sh('git init -b main', workTree);
+  sh('git config user.email "t@t.com" && git config user.name "T" && git config commit.gpgsign false', workTree);
+  fs.writeFileSync(path.join(workTree, 'VERSION'), '0.0.1\n');
+  fs.writeFileSync(path.join(workTree, 'package.json'), JSON.stringify({ name: 'fx', version: '0.0.1', private: true }, null, 2) + '\n');
+  fs.writeFileSync(path.join(workTree, 'CHANGELOG.md'), '# Changelog\n\n## [0.0.1] - 2026-01-01\n\n- Initial release\n');
+  fs.writeFileSync(path.join(workTree, 'app.js'), '// base\n');
+  sh('git add -A && git commit -m "chore: initial v0.0.1"', workTree);
+  sh(`git remote add origin "${bareRemote}" && git push -u origin main`, workTree);
+  // Feature branch: a real code change, VERSION untouched → FRESH (needs a bump).
+  sh('git checkout -b feat/new-thing', workTree);
+  fs.writeFileSync(path.join(workTree, 'app.js'), '// base\nexport function newThing() { return 42; }\n');
+  fs.writeFileSync(path.join(workTree, 'app.test.js'), 'test("newThing", () => {});\n');
+  sh('git add -A && git commit -m "feat: add newThing"', workTree);
+  sh('git push -u origin feat/new-thing', workTree);
+  return { workTree, root };
+}
+
+// Sections every version-changing ship must consult.
+const REQUIRED_SECTIONS = ['review-army.md', 'changelog.md'];
+
+describeE2E('/ship section-loading E2E (periodic, real-PTY, installed skill)', () => {
+  test(
+    'fresh version-changing ship Reads the required sections',
+    async () => {
+      const { workTree, root } = buildFreshFixture();
+      const session = await launchClaudePty({
+        permissionMode: 'plan',
+        cwd: workTree,
+        timeoutMs: 720_000,
+        env: { GH_TOKEN: 'mock-not-real', NO_COLOR: '1' },
+      });
+
+      const readSections = new Set<string>();
+      let planReady = false;
+      try {
+        await Bun.sleep(8000);
+        const since = session.mark();
+        session.send('/ship\r');
+        const start = Date.now();
+        let lastPermSig = '';
+        while (Date.now() - start < 600_000) {
+          await Bun.sleep(3000);
+          if (session.exited()) break;
+          const visible = session.visibleSince(since);
+          const tail = visible.slice(-1500);
+          if (isNumberedOptionListVisible(tail) && isPermissionDialogVisible(tail)) {
+            const sig = visible.slice(-500);
+            if (sig !== lastPermSig) { lastPermSig = sig; session.send('1\r'); await Bun.sleep(1500); continue; }
+          }
+          // Detect section reads from the scrollback (tool render shows the path).
+          for (const m of visible.matchAll(/sections\/([A-Za-z0-9._-]+\.md)/g)) readSections.add(m[1]);
+          if (/ready to execute|Would you like to proceed|GSTACK REVIEW REPORT/i.test(visible)) {
+            planReady = true;
+            break;
+          }
+        }
+      } finally {
+        await session.close();
+        try { fs.rmSync(root, { recursive: true, force: true }); } catch { /* ignore */ }
+      }
+
+      const missing = REQUIRED_SECTIONS.filter(s => !readSections.has(s));
+      expect({ planReady, read: [...readSections], missing }).toEqual({
+        planReady: true,
+        read: expect.any(Array),
+        missing: [],
+      });
+    },
+    900_000,
+  );
+});