docs: document hermetic-by-default E2E + eval:bg detached runs in CONTRIBUTING

The Testing & evals section now tells contributors that local E2E runners spawn children through a sealed clean room (allowlist-scrubbed env, seeded CLAUDE_CONFIG_DIR, temp GSTACK_HOME, --strict-mcp-config) so local signal matches CI, with EVALS_HERMETIC=0 as the escape hatch. The eval-tools list gains the eval:bg* detached-run scripts (gstack-detach: SIGTERM-proof, caffeinate-wrapped, machine-locked, run-scoped logs, EXIT= sentinel). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 07:10:12 +02:00 · 2026-06-14 09:28:39 -07:00
parent a4197b962c
commit d5b869afd5
1 changed files with 31 additions and 0 deletions
@@ -176,6 +176,18 @@ EVALS=1 bun test test/skill-e2e-*.test.ts
 - Saves full NDJSON transcripts and failure JSON for debugging
 - Tests live in `test/skill-e2e-*.test.ts` (split by category), runner logic in `test/helpers/session-runner.ts`

+**Hermetic by default.** Every E2E runner (claude -p, the real-PTY plan-mode
+runner, the Agent SDK runner, plus the codex and gemini runners) spawns its child
+through `test/helpers/hermetic-env.ts`: an allowlist-scrubbed environment, a fresh
+seeded `CLAUDE_CONFIG_DIR`, a temp `GSTACK_HOME`, and `--strict-mcp-config`. Your
+operator `~/.claude` config, MCP servers (gbrain, Conductor), skills, `~/.gstack`
+decision logs, and `CONDUCTOR_*` env never leak into the child, so local eval
+signal matches CI instead of disagreeing for reasons unrelated to the code under
+test. Set `EVALS_HERMETIC=0` to debug against your real operator state (this also
+drops `--strict-mcp-config`). The wiring is pinned by `test/hermetic-wiring.test.ts`
+(a free static tripwire) and two gate-tier isolation canaries in
+`test/skill-e2e-hermetic-canary.test.ts`.
+
 ### E2E observability

 When E2E tests run, they produce machine-readable artifacts in `~/.gstack-dev/`:
@@ -198,6 +210,25 @@ bun run eval:compare         # compare two runs — shows per-test deltas + Take
 bun run eval:summary         # aggregate stats + per-test efficiency averages across runs
 ```

+**Detached runs for agents and long suites.** When an agent (or you, for a run
+you don't want to babysit) launches a long eval, use the `eval:bg*` scripts. They
+wrap the eval command in `bin/gstack-detach`: a fresh session that escapes a
+turn-boundary SIGTERM, a `caffeinate` wrapper that blocks idle-sleep, a machine-wide
+`gstack-evals` lock so concurrent worktrees serialize instead of saturating the
+model API, a run-scoped log under `~/.gstack-dev/eval-runs/`, a per-tier watchdog,
+and a guaranteed `### gstack-detach EXIT=<code> ###` sentinel so a poller never
+mistakes silence for success.
+
+```bash
+bun run eval:bg              # detached test:evals (diff-based)
+bun run eval:bg:all          # detached test:evals:all
+bun run eval:bg:gate         # detached gate-tier suite
+bun run eval:bg:periodic     # detached periodic-tier suite
+```
+
+Each prints its log path. Humans running `bun run test:evals` foreground in their
+own terminal don't need this — Ctrl-C is intended there.
+
 **Eval comparison commentary:** `eval:compare` generates natural-language Takeaway sections interpreting what changed between runs — flagging regressions, noting improvements, calling out efficiency gains (fewer turns, faster, cheaper), and producing an overall summary. This is driven by `generateCommentary()` in `eval-store.ts`.

 Artifacts are never cleaned up — they accumulate in `~/.gstack-dev/` for post-mortem debugging and trend analysis.