feat: context rot defense for /ship — subagent isolation + clean step numbering (v0.18.1.0) (#1030)

* refactor: renumber /ship steps to clean integers (1-20) Replaces fractional step numbers (1.5, 2.5, 3.25, 3.4, 3.45, 3.47, 3.48, 3.5, 3.55, 3.56, 3.57, 3.75, 3.8, 5.5, 6.5, 8.5, 8.75) with clean integers 1 through 20, plus allowed resolver sub-steps 8.1, 8.2, 9.1, 9.2, 9.3. Fractional numbering signaled "optional appendix" and contributed to /ship's habit of skipping late-stage steps. Affects: - ship/SKILL.md.tmpl (all headings + ~30 cross-references) - scripts/resolvers/review.ts (ship-side 3.47/3.48/3.57/3.8 conditionals) - scripts/resolvers/review-army.ts (ship-side 3.55/3.56 conditionals) - scripts/resolvers/testing.ts (ship-side 2.5/3.4 references, 5 sites) - scripts/resolvers/utility.ts (CHANGELOG heading gets Step 13 prefix) - test/gen-skill-docs.test.ts (5 step-number assertions updated) - test/skill-validation.test.ts (3 step-number assertions updated) /review step numbering (1.5, 2.5, 4.5, 5.5-5.8) intentionally unchanged — only the ship-side of each isShip conditional was updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: subagent isolation for /ship's 4 context-heaviest sub-workflows Fights context rot. By late /ship, the parent context is bloated with 500-1,750 lines of intermediate tool output from tests, coverage audits, reviews, adversarial checks, and PR body construction. The model is at its least intelligent when it reaches doc-sync — which is why /document-release was being skipped ~80% of the time. Applies subagent dispatch (proven pattern from Review Army at Step 9.1 and Adversarial at Step 11) to four sub-workflows where the parent only needs the conclusion, not the intermediate output: - Step 7 (Test Coverage Audit) — subagent returns coverage_pct, gaps, diagram, tests_added - Step 8 (Plan Completion Audit) — subagent returns total_items, done, changed, deferred, summary - Step 10 (Greptile Triage) — subagent fetches + classifies, parent handles user interaction and commits fixes (AskUserQuestion + Edit can't run in subagents) - Step 18 (Documentation Sync) — subagent invokes full /document-release skill in fresh context; parent embeds documentation_section in PR body Sequencing fix for Step 18: runs AFTER Step 17 (Push) and BEFORE Step 19 (Create PR). The PR is created once from final HEAD with the ## Documentation section baked into the initial body — no create-then- re-edit dance, no race conditions with document-release's own PR body editor. Adds "You are NOT done" guardrail after Step 17 (Push) to break the natural stopping point that currently causes doc-release skips. Each subagent falls back to inline execution if it fails or returns invalid JSON. /ship never blocks on subagent failure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: regression guard for /ship step numbering Three regression guards in skill-validation.test.ts to prevent future drift back to fractional step numbering: 1. ship/SKILL.md.tmpl contains no fractional step numbers except the allowed resolver sub-steps (8.1, 8.2, 9.1, 9.2, 9.3). A contributor adding "Step 3.75" next month will fail this test with a clear error. 2. ship/SKILL.md main headings use clean integer step numbers. If a renumber accidentally leaves a decimal heading, this catches it. 3. review/SKILL.md step numbers unchanged — regression guard for the resolver conditionals in review.ts/review-army.ts. If a future edit accidentally touches the review-side of an isShip ternary, /review's fractional numbering (1.5, 4.5, 5.7) would vanish. This test catches that cross-contamination. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: sync ship step references after renumber CLAUDE.md: "At /ship time (Step 5)" → "(Step 13)" — CHANGELOG is now explicitly Step 13 after the renumber (was implicit between old Step 4 and Step 5.5). TODOS.md: "Step 3.4 coverage audit" → "Step 7" — references the open TODO for auto-upgrading ★-rated tests, which hooks into the coverage audit step. Both are historical references to ship's step numbering that became stale when clean integer renumbering landed in 566d42c2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: update golden ship skill baselines after renumber + subagent refactor The golden fixtures at test/fixtures/golden/{claude,codex,factory}-ship-SKILL.md regression-test that generated ship/SKILL.md output matches a committed baseline. After renumbering steps to clean integers and converting 4 sub-workflows to subagent dispatches, the generated output changed substantially — refresh the baselines to reflect the new expected output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.18.1.0) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: gitignore Claude Code harness runtime artifacts .claude/scheduled_tasks.lock appears when ScheduleWakeup fires. It's a runtime lock file owned by the Claude Code harness, not project source. Add .claude/*.lock too so future harness artifacts in that directory don't need their own gitignore entries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 19:25:10 +02:00 · 2026-04-16 23:14:03 -07:00
parent 822e843a60
commit b3eaffce07
19 changed files with 900 additions and 552 deletions
@@ -1005,7 +1005,7 @@ describe('Test Bootstrap ({{TEST_BOOTSTRAP}}) integration', () => {
  test('TEST_BOOTSTRAP appears in ship/SKILL.md', () => {
    const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
    expect(content).toContain('Test Framework Bootstrap');
-    expect(content).toContain('Step 2.5');
+    expect(content).toContain('Step 4');
  });

  test('TEST_BOOTSTRAP appears in design-review/SKILL.md', () => {
@@ -1100,9 +1100,9 @@ describe('Phase 8e.5 regression test generation', () => {
 // --- Step 3.4 coverage audit validation ---

 describe('Step 3.4 test coverage audit', () => {
-  test('ship/SKILL.md contains Step 3.4', () => {
+  test('ship/SKILL.md contains Step 7', () => {
    const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
-    expect(content).toContain('Step 3.4: Test Coverage Audit');
+    expect(content).toContain('Step 7: Test Coverage Audit');
    expect(content).toContain('CODE PATH COVERAGE');
  });

@@ -1127,7 +1127,7 @@ describe('Step 3.4 test coverage audit', () => {

  test('ship rules include test generation rule', () => {
    const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
-    expect(content).toContain('Step 3.4 generates coverage tests');
+    expect(content).toContain('Step 7 generates coverage tests');
    expect(content).toContain('Never commit failing tests');
  });

@@ -1161,6 +1161,53 @@ describe('Step 3.4 test coverage audit', () => {
  });
 });

+// --- Ship step numbering regression guard ---
+
+describe('ship step numbering', () => {
+  // Allowed sub-steps that are resolver-generated and intentionally nested:
+  // 8.1 (Plan Verification), 8.2 (Scope Drift), 9.1 (Review Army), 9.2 (Findings Merge), 9.3 (Cross-review dedup)
+  const ALLOWED_SUBSTEPS = new Set(['8.1', '8.2', '9.1', '9.2', '9.3']);
+
+  test('ship/SKILL.md.tmpl contains no unexpected fractional step numbers', () => {
+    const tmpl = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md.tmpl'), 'utf-8');
+    // Match "Step X.Y" where X.Y is a decimal step reference (e.g., "Step 3.47", "Step 8.1")
+    const matches = Array.from(tmpl.matchAll(/Step (\d+\.\d+)/g));
+    const violations = matches
+      .map((m) => m[1])
+      .filter((n) => !ALLOWED_SUBSTEPS.has(n));
+    if (violations.length > 0) {
+      const unique = Array.from(new Set(violations)).sort();
+      throw new Error(
+        `ship/SKILL.md.tmpl contains fractional step numbers that are not in the allowed sub-step list.\n` +
+          `  Found: ${unique.join(', ')}\n` +
+          `  Allowed sub-steps: ${Array.from(ALLOWED_SUBSTEPS).sort().join(', ')}\n` +
+          `  Fix: use clean integer step numbers (1-20), or add to ALLOWED_SUBSTEPS if intentional.`
+      );
+    }
+  });
+
+  test('ship/SKILL.md main headings use clean integer step numbers', () => {
+    const skill = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
+    // Headings like "## Step 7: Test Coverage Audit" — NOT sub-steps like "## Step 8.1:"
+    const headings = Array.from(skill.matchAll(/^## Step (\d+(?:\.\d+)?):/gm)).map(
+      (m) => m[1]
+    );
+    const fractional = headings.filter((n) => n.includes('.'));
+    const unexpected = fractional.filter((n) => !ALLOWED_SUBSTEPS.has(n));
+    expect(unexpected).toEqual([]);
+  });
+
+  test('review/SKILL.md step numbers unchanged (regression guard for resolver conditionals)', () => {
+    const skill = fs.readFileSync(path.join(ROOT, 'review', 'SKILL.md'), 'utf-8');
+    // /review uses its own fractional numbering: 1.5, 2.5, 4.5, 5.5, 5.6, 5.7, 5.8
+    // If the ship-side renumber accidentally touched the review-side of resolver conditionals,
+    // these would vanish. This test catches that.
+    expect(skill).toContain('## Step 1.5: Scope Drift Detection');
+    expect(skill).toContain('## Step 4.5: Review Army');
+    expect(skill).toContain('## Step 5.7: Adversarial review');
+  });
+});
+
 // --- Retro test health validation ---

 describe('Retro test health tracking', () => {