test: add multi-finding batching regression test (periodic tier)

Adds a periodic-tier E2E that catches the May 2026 transcript bug shape the existing single-finding gate-tier floor test cannot detect: a model that fires one AskUserQuestion and then batches the remaining findings into a single "## Decisions to confirm" plan write + ExitPlanMode. Why a separate test from skill-e2e-plan-eng-finding-floor: the gate-tier floor (runPlanSkillFloorCheck) exits on the first AUQ render and returns success, so a once-then-batch model would pass it trivially. This test uses runPlanSkillCounting at periodic tier with N-AUQ tracking and asserts >= 3 distinct review-phase AUQs on a 4-finding seeded plan. - test/fixtures/forcing-finding-seeds.ts: FORCING_BATCHING_ENG fixture (4 distinct non-trivial findings spread across Architecture, Code Quality, Tests, Performance — mirrors the D1-D4 transcript shape) - test/skill-e2e-plan-eng-multi-finding-batching.test.ts: new test - test/helpers/touchfiles.ts: registered in BOTH E2E_TOUCHFILES and E2E_TIERS (touchfiles.test.ts asserts exact equality) Test will fail on baseline today because today's model uses the preamble fallback to batch findings; passes after the architectural fix lands in a follow-up commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-19 08:10:08 +02:00 · 2026-05-08 15:25:17 -07:00
parent 443bde054c
commit 4c2bcf5c17
3 changed files with 143 additions and 0 deletions
@@ -81,3 +81,42 @@ export const FORCING_FLOOR_DEVEX = [
  '',
  'No quickstart command, no hosted sandbox, no copy-pasteable curl example.',
 ].join('\n');
+
+/**
+ * Multi-finding batching regression seed (periodic tier).
+ *
+ * Mirrors the May 2026 transcript bug shape: 4 distinct non-trivial findings
+ * spread across plan-eng-review's standard sections (Architecture, Code
+ * Quality, Tests, Performance). Each finding is independent — there is no
+ * legitimate reason to batch them into a single AskUserQuestion.
+ *
+ * Used by test/skill-e2e-plan-eng-multi-finding-batching.test.ts to assert
+ * the agent fires >= 3 review-phase AUQs (i.e., does NOT batch them into a
+ * "## Decisions to confirm" section + ExitPlanMode). Floor of 3 (not 4) is
+ * the [N-1] tolerance from the existing finding-count band convention.
+ */
+export const FORCING_BATCHING_ENG = [
+  'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-eng-batching.md (use Edit/Write to that exact path).',
+  '',
+  '# Plan: Add background job retry framework',
+  '',
+  '## Architecture',
+  "We'll roll a custom exponential-backoff scheduler inline in each worker",
+  "rather than use the existing job library's built-in retry hooks. Same",
+  'shape as the library version, but we want full control over the curve.',
+  '',
+  '## Code quality',
+  'The retry envelope (compute delay, log attempt, dispatch) is duplicated',
+  'across 5 worker files with copy-pasted bodies. We will leave the',
+  'duplication for now and refactor "later."',
+  '',
+  '## Tests',
+  'The existing `processWebhookJob()` flow gets rewritten as part of this',
+  'change. No regression test for the prior at-most-once delivery guarantee',
+  'is planned.',
+  '',
+  '## Performance',
+  'On every retry we re-fetch the full job payload from the database, then',
+  'iterate the payload to recompute the dependency graph. Could cache the',
+  'graph on the first attempt; not planned.',
+].join('\n');