mirror of
https://github.com/garrytan/gstack.git
synced 2026-06-17 07:10:12 +02:00
docs: document hermetic-by-default E2E + eval:bg detached runs in CONTRIBUTING
The Testing & evals section now tells contributors that local E2E runners spawn children through a sealed clean room (allowlist-scrubbed env, seeded CLAUDE_CONFIG_DIR, temp GSTACK_HOME, --strict-mcp-config) so local signal matches CI, with EVALS_HERMETIC=0 as the escape hatch. The eval-tools list gains the eval:bg* detached-run scripts (gstack-detach: SIGTERM-proof, caffeinate-wrapped, machine-locked, run-scoped logs, EXIT= sentinel). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -176,6 +176,18 @@ EVALS=1 bun test test/skill-e2e-*.test.ts
|
||||
- Saves full NDJSON transcripts and failure JSON for debugging
|
||||
- Tests live in `test/skill-e2e-*.test.ts` (split by category), runner logic in `test/helpers/session-runner.ts`
|
||||
|
||||
**Hermetic by default.** Every E2E runner (claude -p, the real-PTY plan-mode
|
||||
runner, the Agent SDK runner, plus the codex and gemini runners) spawns its child
|
||||
through `test/helpers/hermetic-env.ts`: an allowlist-scrubbed environment, a fresh
|
||||
seeded `CLAUDE_CONFIG_DIR`, a temp `GSTACK_HOME`, and `--strict-mcp-config`. Your
|
||||
operator `~/.claude` config, MCP servers (gbrain, Conductor), skills, `~/.gstack`
|
||||
decision logs, and `CONDUCTOR_*` env never leak into the child, so local eval
|
||||
signal matches CI instead of disagreeing for reasons unrelated to the code under
|
||||
test. Set `EVALS_HERMETIC=0` to debug against your real operator state (this also
|
||||
drops `--strict-mcp-config`). The wiring is pinned by `test/hermetic-wiring.test.ts`
|
||||
(a free static tripwire) and two gate-tier isolation canaries in
|
||||
`test/skill-e2e-hermetic-canary.test.ts`.
|
||||
|
||||
### E2E observability
|
||||
|
||||
When E2E tests run, they produce machine-readable artifacts in `~/.gstack-dev/`:
|
||||
@@ -198,6 +210,25 @@ bun run eval:compare # compare two runs — shows per-test deltas + Take
|
||||
bun run eval:summary # aggregate stats + per-test efficiency averages across runs
|
||||
```
|
||||
|
||||
**Detached runs for agents and long suites.** When an agent (or you, for a run
|
||||
you don't want to babysit) launches a long eval, use the `eval:bg*` scripts. They
|
||||
wrap the eval command in `bin/gstack-detach`: a fresh session that escapes a
|
||||
turn-boundary SIGTERM, a `caffeinate` wrapper that blocks idle-sleep, a machine-wide
|
||||
`gstack-evals` lock so concurrent worktrees serialize instead of saturating the
|
||||
model API, a run-scoped log under `~/.gstack-dev/eval-runs/`, a per-tier watchdog,
|
||||
and a guaranteed `### gstack-detach EXIT=<code> ###` sentinel so a poller never
|
||||
mistakes silence for success.
|
||||
|
||||
```bash
|
||||
bun run eval:bg # detached test:evals (diff-based)
|
||||
bun run eval:bg:all # detached test:evals:all
|
||||
bun run eval:bg:gate # detached gate-tier suite
|
||||
bun run eval:bg:periodic # detached periodic-tier suite
|
||||
```
|
||||
|
||||
Each prints its log path. Humans running `bun run test:evals` foreground in their
|
||||
own terminal don't need this — Ctrl-C is intended there.
|
||||
|
||||
**Eval comparison commentary:** `eval:compare` generates natural-language Takeaway sections interpreting what changed between runs — flagging regressions, noting improvements, calling out efficiency gains (fewer turns, faster, cheaper), and producing an overall summary. This is driven by `generateCommentary()` in `eval-store.ts`.
|
||||
|
||||
Artifacts are never cleaned up — they accumulate in `~/.gstack-dev/` for post-mortem debugging and trend analysis.
|
||||
|
||||
Reference in New Issue
Block a user