diff --git a/CLAUDE.md b/CLAUDE.md index bc21f606..6f12deae 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -118,6 +118,21 @@ CHANGELOG.md is **for users**, not contributors. Write it like product release n - No jargon: say "every question now tells you which project and branch you're in" not "AskUserQuestion format standardized across skill templates via preamble resolver." +## E2E eval failure blame protocol + +When an E2E eval fails during `/ship` or any other workflow, **never claim "not +related to our changes" without proving it.** These systems have invisible couplings — +a preamble text change affects agent behavior, a new helper changes timing, a +regenerated SKILL.md shifts prompt context. + +**Required before attributing a failure to "pre-existing":** +1. Run the same eval on main (or base branch) and show it fails there too +2. If it passes on main but fails on the branch — it IS your change. Trace the blame. +3. If you can't run on main, say "unverified — may or may not be related" and flag it + as a risk in the PR body + +"Pre-existing" without receipts is a lazy claim. Prove it or don't say it. + ## Deploying to the active skill The active skill lives at `~/.claude/skills/gstack/`. After making changes: diff --git a/test/skill-e2e.test.ts b/test/skill-e2e.test.ts index 02e9fc24..aa50a976 100644 --- a/test/skill-e2e.test.ts +++ b/test/skill-e2e.test.ts @@ -13,6 +13,11 @@ import * as os from 'os'; const ROOT = path.resolve(import.meta.dir, '..'); // Skip unless EVALS=1. Session runner strips CLAUDE* env vars to avoid nested session issues. +// +// BLAME PROTOCOL: When an eval fails, do NOT claim "pre-existing" or "not related +// to our changes" without proof. Run the same eval on main to verify. These tests +// have invisible couplings — preamble text, SKILL.md content, and timing all affect +// agent behavior. See CLAUDE.md "E2E eval failure blame protocol" for details. const evalsEnabled = !!process.env.EVALS; const describeE2E = evalsEnabled ? describe : describe.skip;