From 3492f98e828416c4aa41dbfed8be69f344501d64 Mon Sep 17 00:00:00 2001
From: Garry Tan <garrytan@gmail.com>
Date: Mon, 16 Mar 2026 10:21:46 -0500
Subject: [PATCH] docs: add E2E eval failure blame protocol
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

"Not related to our changes" is an extraordinary claim that requires
extraordinary proof. When evals fail during /ship:

1. Run the same eval on main — prove it fails there too
2. If it passes on main, it IS your change — trace the blame
3. If you can't verify, say "unverified" not "pre-existing"

Added to CLAUDE.md and as a comment in skill-e2e.test.ts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 CLAUDE.md              | 15 +++++++++++++++
 test/skill-e2e.test.ts |  5 +++++
 2 files changed, 20 insertions(+)

diff --git a/CLAUDE.md b/CLAUDE.md
index bc21f606..6f12deae 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -118,6 +118,21 @@ CHANGELOG.md is **for users**, not contributors. Write it like product release n
 - No jargon: say "every question now tells you which project and branch you're in" not
   "AskUserQuestion format standardized across skill templates via preamble resolver."
 
+## E2E eval failure blame protocol
+
+When an E2E eval fails during `/ship` or any other workflow, **never claim "not
+related to our changes" without proving it.** These systems have invisible couplings —
+a preamble text change affects agent behavior, a new helper changes timing, a
+regenerated SKILL.md shifts prompt context.
+
+**Required before attributing a failure to "pre-existing":**
+1. Run the same eval on main (or base branch) and show it fails there too
+2. If it passes on main but fails on the branch — it IS your change. Trace the blame.
+3. If you can't run on main, say "unverified — may or may not be related" and flag it
+   as a risk in the PR body
+
+"Pre-existing" without receipts is a lazy claim. Prove it or don't say it.
+
 ## Deploying to the active skill
 
 The active skill lives at `~/.claude/skills/gstack/`. After making changes:
diff --git a/test/skill-e2e.test.ts b/test/skill-e2e.test.ts
index 02e9fc24..aa50a976 100644
--- a/test/skill-e2e.test.ts
+++ b/test/skill-e2e.test.ts
@@ -13,6 +13,11 @@ import * as os from 'os';
 const ROOT = path.resolve(import.meta.dir, '..');
 
 // Skip unless EVALS=1. Session runner strips CLAUDE* env vars to avoid nested session issues.
+//
+// BLAME PROTOCOL: When an eval fails, do NOT claim "pre-existing" or "not related
+// to our changes" without proof. Run the same eval on main to verify. These tests
+// have invisible couplings — preamble text, SKILL.md content, and timing all affect
+// agent behavior. See CLAUDE.md "E2E eval failure blame protocol" for details.
 const evalsEnabled = !!process.env.EVALS;
 const describeE2E = evalsEnabled ? describe : describe.skip;