test: migrate section-loading E2E to lossless SDK tool-stream detection

The /ship and /plan-ceo-review section-loading tests drove a real PTY and scraped the ANSI screen buffer for sections/<file>.md paths. That silently saw nothing in a Conductor PTY (cursor-positioned tool renders and an unanswered Step 0 question loop both defeat the regex), so both reported read: [] even when the agent did the work. They now run the skill through claude -p (the same SDK path the AUQ matrix uses) and detect section reads from the tool-use stream — Read calls whose file_path contains sections/<file>.md — with no rendering layer to mangle. The run is also hermetic: the freshly-generated worktree skeleton + sections are copied into a throwaway fixture with the absolute path pinned, so the test validates this branch's carve without mutating the user's ~/.claude install. Validated EVALS_TIER=periodic: both pass (plan-ceo Reads review-sections.md; ship Reads review-army.md + changelog.md), ~6.5 min for both vs ~23 min combined on the old PTY path where both were failing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 00:00:13 +02:00 · 2026-06-03 07:28:44 -07:00
parent 857f100d79
commit 8bb733f54d
5 changed files with 191 additions and 254 deletions
@@ -1,120 +1,83 @@
 /**
- * /ship section-loading E2E (periodic, paid, real-PTY) — v2 plan T9 mitigation
- * layer 5, the ONLY CI-failing guard against silent section-skip.
+ * /ship section-loading E2E (periodic, paid, SDK capture) — v2 plan T9 mitigation
+ * layer 5: the behavioral guard that a real agent Reads the carved sections a
+ * version-changing ship requires instead of working from the skeleton's memory.
 *
- * After the carve, ship is a skeleton whose STOP-Read directives point at
- * sections/*.md. This test runs the REAL /ship skill in plan mode against a
- * fresh version-changing fixture and asserts the agent actually Read the
- * sections its situation requires (review-army + changelog at minimum — every
- * version-changing ship needs the pre-landing review and a CHANGELOG entry).
+ * Detection is LOSSLESS. Earlier this test drove a real PTY and scraped the ANSI
+ * screen buffer for `sections/<file>.md` paths, which silently saw nothing in a
+ * Conductor PTY (cursor-positioned tool renders + an unanswered question loop
+ * defeat the regex — it reported `read: []` even when the agent did the work). It
+ * now runs the skill through `claude -p` (the SDK path the AUQ matrix uses) and
+ * detects section reads from the tool-use stream (`Read` calls whose file_path
+ * contains `sections/review-army.md` / `sections/changelog.md`).
 *
- * Runs against the INSTALLED skill at ~/.claude/skills/gstack/ship (Codex
- * outside-voice #5: an E2E that reads repo paths would miss install-layout
- * 404s). Section reads are detected from the PTY scrollback — when the agent
- * Reads a section the tool render shows the `sections/<file>.md` path.
+ * Hermetic, not install-mutating: the freshly-generated worktree skeleton +
+ * sections are copied into a throwaway fixture dir and the absolute path is pinned,
+ * so the test validates the current carve without touching the user's active
+ * ~/.claude install. (Install-layout linking is covered by
+ * setup-sections-linking.test.ts.)
 *
- * Plan-mode framing keeps the agent from committing/pushing; producing a plan
- * is the terminal signal. Cost: ~$2-4/run. Periodic tier.
- *
- * Situation matrix (T1 = B): this file covers the fresh version-changing ship;
- * the already-bumped re-run is covered by skill-e2e-ship-idempotency.test.ts,
- * and a no-plan-file variant can be added to FIXTURES below.
+ * The agent is told AskUserQuestion is unavailable and is given the version-changing
+ * situation explicitly (no Bash, so it can't and needn't probe git), so it follows
+ * the skeleton's STOP-Read directives for that situation. Cost: ~$1-2/run.
+ * Periodic tier.
 */

 import { describe, test, expect } from 'bun:test';
-import { spawnSync } from 'child_process';
-import * as fs from 'fs';
-import * as path from 'path';
-import * as os from 'os';
 import {
-  launchClaudePty,
-  isPermissionDialogVisible,
-  isNumberedOptionListVisible,
-} from './helpers/claude-pty-runner';
+  setupSkillDir,
+  skillFromWorktree,
+  captureSectionReads,
+} from './helpers/auq-sdk-capture';

 const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic';
 const describeE2E = shouldRun ? describe : describe.skip;
-
-/** Fresh fixture: feature branch with a real change but VERSION still == base,
- *  so /ship must bump (FRESH) and walk the full pre-landing + changelog flow. */
-function buildFreshFixture(): { workTree: string; root: string } {
-  const root = fs.mkdtempSync(path.join(os.tmpdir(), 'gstack-ship-secload-'));
-  const workTree = path.join(root, 'workspace');
-  const bareRemote = path.join(root, 'origin.git');
-  fs.mkdirSync(workTree, { recursive: true });
-  const sh = (cmd: string, cwd: string): void => {
-    const r = spawnSync('bash', ['-c', cmd], { cwd, stdio: 'pipe', timeout: 15_000 });
-    if (r.status !== 0) throw new Error(`fixture setup failed at "${cmd}":\n${r.stderr?.toString()}`);
-  };
-  sh(`git init --bare "${bareRemote}"`, root);
-  sh('git init -b main', workTree);
-  sh('git config user.email "t@t.com" && git config user.name "T" && git config commit.gpgsign false', workTree);
-  fs.writeFileSync(path.join(workTree, 'VERSION'), '0.0.1\n');
-  fs.writeFileSync(path.join(workTree, 'package.json'), JSON.stringify({ name: 'fx', version: '0.0.1', private: true }, null, 2) + '\n');
-  fs.writeFileSync(path.join(workTree, 'CHANGELOG.md'), '# Changelog\n\n## [0.0.1] - 2026-01-01\n\n- Initial release\n');
-  fs.writeFileSync(path.join(workTree, 'app.js'), '// base\n');
-  sh('git add -A && git commit -m "chore: initial v0.0.1"', workTree);
-  sh(`git remote add origin "${bareRemote}" && git push -u origin main`, workTree);
-  // Feature branch: a real code change, VERSION untouched → FRESH (needs a bump).
-  sh('git checkout -b feat/new-thing', workTree);
-  fs.writeFileSync(path.join(workTree, 'app.js'), '// base\nexport function newThing() { return 42; }\n');
-  fs.writeFileSync(path.join(workTree, 'app.test.js'), 'test("newThing", () => {});\n');
-  sh('git add -A && git commit -m "feat: add newThing"', workTree);
-  sh('git push -u origin feat/new-thing', workTree);
-  return { workTree, root };
-}
+const runId = `ship-section-loading-${process.env.EVALS_RUN_ID ?? 'local'}`;

 // Sections every version-changing ship must consult.
 const REQUIRED_SECTIONS = ['review-army.md', 'changelog.md'];

-describeE2E('/ship section-loading E2E (periodic, real-PTY, installed skill)', () => {
+const FIXTURES: Record<string, string> = {
+  VERSION: '0.0.1\n',
+  'package.json': JSON.stringify({ name: 'fx', version: '0.0.1', private: true }, null, 2) + '\n',
+  'CHANGELOG.md': '# Changelog\n\n## [0.0.1] - 2026-01-01\n\n- Initial release\n',
+  'app.js': '// base\nexport function newThing() { return 42; }\n',
+  'app.test.js': 'test("newThing", () => {});\n',
+};
+
+describeE2E('/ship section-loading E2E (periodic, SDK capture)', () => {
  test(
    'fresh version-changing ship Reads the required sections',
    async () => {
-      const { workTree, root } = buildFreshFixture();
-      const session = await launchClaudePty({
-        permissionMode: 'plan',
-        cwd: workTree,
-        timeoutMs: 720_000,
-        env: { GH_TOKEN: 'mock-not-real', NO_COLOR: '1' },
+      const { skillMd, sectionsFrom } = skillFromWorktree('ship');
+      const planDir = setupSkillDir({
+        skillName: 'ship',
+        skillMd,
+        sectionsFrom,
+        fixtures: FIXTURES,
+        tmpPrefix: 'gstack-ship-secload-',
      });

-      const readSections = new Set<string>();
-      let planReady = false;
-      try {
-        await Bun.sleep(8000);
-        const since = session.mark();
-        session.send('/ship\r');
-        const start = Date.now();
-        let lastPermSig = '';
-        while (Date.now() - start < 600_000) {
-          await Bun.sleep(3000);
-          if (session.exited()) break;
-          const visible = session.visibleSince(since);
-          const tail = visible.slice(-1500);
-          if (isNumberedOptionListVisible(tail) && isPermissionDialogVisible(tail)) {
-            const sig = visible.slice(-500);
-            if (sig !== lastPermSig) { lastPermSig = sig; session.send('1\r'); await Bun.sleep(1500); continue; }
-          }
-          // Detect section reads from the scrollback (tool render shows the path).
-          for (const m of visible.matchAll(/sections\/([A-Za-z0-9._-]+\.md)/g)) readSections.add(m[1]);
-          if (/ready to execute|Would you like to proceed|GSTACK REVIEW REPORT/i.test(visible)) {
-            planReady = true;
-            break;
-          }
-        }
-      } finally {
-        await session.close();
-        try { fs.rmSync(root, { recursive: true, force: true }); } catch { /* ignore */ }
-      }
+      const { readSections, reportProduced, output } = await captureSectionReads({
+        planDir,
+        skillName: 'ship',
+        scenario:
+          'This is a FRESH version-changing ship: the branch has a real code change (app.js gained a new function with a test), VERSION still equals the base version (0.0.1, so it needs a bump), and CHANGELOG.md needs a new entry. Follow the skill\'s flow for a version-changing ship: run the pre-landing review and prepare the CHANGELOG entry. Produce the ship plan / review report. Do NOT actually commit, push, or open a PR.',
+        requiredSections: REQUIRED_SECTIONS,
+        reportMarker: /version|changelog|review|ship/i,
+        testName: '/ship section-loading',
+        runId,
+      });

      const missing = REQUIRED_SECTIONS.filter(s => !readSections.has(s));
-      expect({ planReady, read: [...readSections], missing }).toEqual({
-        planReady: true,
+      expect({ reportProduced, read: [...readSections], missing }).toEqual({
+        reportProduced: true,
        read: expect.any(Array),
        missing: [],
      });
+      // Guard against an empty pass: the report must have real content.
+      expect(output.trim().length).toBeGreaterThan(200);
    },
-    900_000,
+    360_000,
  );
 });