From d5b869afd56bbe0d688b6570bcd67352c9f9bddf Mon Sep 17 00:00:00 2001 From: Garry Tan Date: Sun, 14 Jun 2026 09:28:39 -0700 Subject: [PATCH] docs: document hermetic-by-default E2E + eval:bg detached runs in CONTRIBUTING The Testing & evals section now tells contributors that local E2E runners spawn children through a sealed clean room (allowlist-scrubbed env, seeded CLAUDE_CONFIG_DIR, temp GSTACK_HOME, --strict-mcp-config) so local signal matches CI, with EVALS_HERMETIC=0 as the escape hatch. The eval-tools list gains the eval:bg* detached-run scripts (gstack-detach: SIGTERM-proof, caffeinate-wrapped, machine-locked, run-scoped logs, EXIT= sentinel). Co-Authored-By: Claude Opus 4.8 (1M context) --- CONTRIBUTING.md | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 5a56ef5d3..b75d4a898 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -176,6 +176,18 @@ EVALS=1 bun test test/skill-e2e-*.test.ts - Saves full NDJSON transcripts and failure JSON for debugging - Tests live in `test/skill-e2e-*.test.ts` (split by category), runner logic in `test/helpers/session-runner.ts` +**Hermetic by default.** Every E2E runner (claude -p, the real-PTY plan-mode +runner, the Agent SDK runner, plus the codex and gemini runners) spawns its child +through `test/helpers/hermetic-env.ts`: an allowlist-scrubbed environment, a fresh +seeded `CLAUDE_CONFIG_DIR`, a temp `GSTACK_HOME`, and `--strict-mcp-config`. Your +operator `~/.claude` config, MCP servers (gbrain, Conductor), skills, `~/.gstack` +decision logs, and `CONDUCTOR_*` env never leak into the child, so local eval +signal matches CI instead of disagreeing for reasons unrelated to the code under +test. Set `EVALS_HERMETIC=0` to debug against your real operator state (this also +drops `--strict-mcp-config`). The wiring is pinned by `test/hermetic-wiring.test.ts` +(a free static tripwire) and two gate-tier isolation canaries in +`test/skill-e2e-hermetic-canary.test.ts`. + ### E2E observability When E2E tests run, they produce machine-readable artifacts in `~/.gstack-dev/`: @@ -198,6 +210,25 @@ bun run eval:compare # compare two runs — shows per-test deltas + Take bun run eval:summary # aggregate stats + per-test efficiency averages across runs ``` +**Detached runs for agents and long suites.** When an agent (or you, for a run +you don't want to babysit) launches a long eval, use the `eval:bg*` scripts. They +wrap the eval command in `bin/gstack-detach`: a fresh session that escapes a +turn-boundary SIGTERM, a `caffeinate` wrapper that blocks idle-sleep, a machine-wide +`gstack-evals` lock so concurrent worktrees serialize instead of saturating the +model API, a run-scoped log under `~/.gstack-dev/eval-runs/`, a per-tier watchdog, +and a guaranteed `### gstack-detach EXIT= ###` sentinel so a poller never +mistakes silence for success. + +```bash +bun run eval:bg # detached test:evals (diff-based) +bun run eval:bg:all # detached test:evals:all +bun run eval:bg:gate # detached gate-tier suite +bun run eval:bg:periodic # detached periodic-tier suite +``` + +Each prints its log path. Humans running `bun run test:evals` foreground in their +own terminal don't need this — Ctrl-C is intended there. + **Eval comparison commentary:** `eval:compare` generates natural-language Takeaway sections interpreting what changed between runs — flagging regressions, noting improvements, calling out efficiency gains (fewer turns, faster, cheaper), and producing an overall summary. This is driven by `generateCommentary()` in `eval-store.ts`. Artifacts are never cleaned up — they accumulate in `~/.gstack-dev/` for post-mortem debugging and trend analysis.