From b608b060b93291f5bffd36776940b50e74e60845 Mon Sep 17 00:00:00 2001 From: Garry Tan Date: Thu, 11 Jun 2026 23:07:11 -0700 Subject: [PATCH] =?UTF-8?q?docs:=20CLAUDE.md=20=E2=80=94=20agents=20must?= =?UTF-8?q?=20run=20long=20evals=20via=20gstack-detach?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Codifies the detached-execution default: agent-launched eval/benchmark runs go through bin/gstack-detach (or the eval:bg* scripts) so a harness SIGTERM or macOS idle-sleep can't kill a 30-60 min run, then poll the log with a death-aware watcher. Humans keep foreground scripts. Co-Authored-By: Claude Fable 5 --- CLAUDE.md | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/CLAUDE.md b/CLAUDE.md index 03384ae79..830d7b189 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -828,6 +828,27 @@ them. Report progress at each check (which tests passed, which are running, any failures so far). The user wants to see the run complete, not a promise that you'll check later. +## Running evals as an agent: always detach (SIGTERM-proof) + +When **you (an agent/harness)** launch a long eval/benchmark run, run it through +`bin/gstack-detach` — NEVER as a plain backgrounded Bash task. A plain background +task lives in the harness's process group, so a SIGTERM ("polite quit") on a turn +boundary, a stopped Monitor, or an interruption kills the run mid-flight (observed: +`script "test:gate" was terminated by signal SIGTERM` ~40 min into a run). On macOS +the run can also die to idle-sleep. `gstack-detach` fixes both: a fresh session +(escapes the group SIGTERM) wrapped in `caffeinate -i` (blocks idle-sleep). + +- Use the `eval:bg*` scripts (`eval:bg`, `eval:bg:all`, `eval:bg:gate`, + `eval:bg:periodic`) — they wrap the eval command in `gstack-detach` and stream to + `/tmp/gstack-evals.log`. Or call `gstack-detach -- ` directly for any + long agent job. Export `ANTHROPIC_API_KEY` first (never pass keys in argv). +- Then **poll the logfile** with a death-aware watcher: break on a `### DONE ###` / + `EXIT=` marker OR on "runner process gone before DONE" (silence is not success — + cover the crash/kill path, not just the happy path). The detached run survives + even if your watcher gets reaped, so re-checking the log always works. +- Humans running `bun run test:evals` foreground in their own terminal don't need + this — Ctrl-C is intended there. Detachment is for agent-launched runs only. + ## E2E test fixtures: extract, don't copy **NEVER copy a full SKILL.md file into an E2E test fixture.** SKILL.md files are