From a38089aac6a0a2a2a45a24581386619b15aebaf1 Mon Sep 17 00:00:00 2001 From: Garry Tan Date: Fri, 12 Jun 2026 06:24:39 -0700 Subject: [PATCH] docs: wire detached-eval guidance into /ship + correct CLAUDE.md flags MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - /ship eval step (sections/tests.md): long eval suites launch via gstack-detach (own session, machine lock, EXIT sentinel) so a turn boundary can't kill a 30+ min run mid-ship — the exact failure observed during this branch's ship. - CLAUDE.md: correct the now-stale /tmp reference; document the --lock (serialize worktrees, no API saturation), --timeout watchdog, run-scoped log, and the guaranteed EXIT sentinel the poller breaks on. Co-Authored-By: Claude Fable 5 --- CLAUDE.md | 19 +++++++++++++------ ship/sections/tests.md | 16 ++++++++++++++++ ship/sections/tests.md.tmpl | 16 ++++++++++++++++ 3 files changed, 45 insertions(+), 6 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 830d7b189..4d385723c 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -839,13 +839,20 @@ the run can also die to idle-sleep. `gstack-detach` fixes both: a fresh session (escapes the group SIGTERM) wrapped in `caffeinate -i` (blocks idle-sleep). - Use the `eval:bg*` scripts (`eval:bg`, `eval:bg:all`, `eval:bg:gate`, - `eval:bg:periodic`) — they wrap the eval command in `gstack-detach` and stream to - `/tmp/gstack-evals.log`. Or call `gstack-detach -- ` directly for any - long agent job. Export `ANTHROPIC_API_KEY` first (never pass keys in argv). -- Then **poll the logfile** with a death-aware watcher: break on a `### DONE ###` / - `EXIT=` marker OR on "runner process gone before DONE" (silence is not success — - cover the crash/kill path, not just the happy path). The detached run survives + `eval:bg:periodic`) — they wrap the eval command in `gstack-detach` with the + machine-wide `gstack-evals` lock (concurrent worktrees serialize instead of + saturating the shared model API), a per-tier watchdog, and a **run-scoped** log + under `~/.gstack-dev/eval-runs/` (no shared-`/tmp` collision). Each prints its + log path. Or call `gstack-detach [--lock NAME] [--timeout SECS] [--label LBL] -- + ` directly for any long agent job. Export `ANTHROPIC_API_KEY` first (never + pass keys in argv). +- Then **poll the printed logfile** with a death-aware watcher: break on the + guaranteed `### gstack-detach EXIT= ###` sentinel (success AND failure are + both marked, so silence is never mistaken for success). The detached run survives even if your watcher gets reaped, so re-checking the log always works. +- Why the lock: a shared dev box with several Conductor worktrees will rate-limit + the model API if two eval suites run at once (15-way concurrency each), which + mass-times-out E2E tests. The lock makes the second run WAIT, not collide. - Humans running `bun run test:evals` foreground in their own terminal don't need this — Ctrl-C is intended there. Detachment is for agent-launched runs only. diff --git a/ship/sections/tests.md b/ship/sections/tests.md index 2ba17a96b..e9d4dd819 100644 --- a/ship/sections/tests.md +++ b/ship/sections/tests.md @@ -332,6 +332,22 @@ EVAL_JUDGE_TIER=full EVAL_VERBOSE=1 bin/test-lane --eval test/evals/_eval If multiple suites need to run, run them sequentially (each needs a test lane). If the first suite fails, stop immediately — don't burn API cost on remaining suites. +**Long eval suites (30+ min): launch detached so a turn boundary can't kill them.** +A plain backgrounded eval lives in the harness's process group and dies to a +SIGTERM ("polite quit") on a turn boundary, a stopped monitor, or an interruption +(observed mid-`/ship`: `script terminated by signal SIGTERM`). Run it through +`~/.claude/skills/gstack/bin/gstack-detach` instead — it survives in its own +session, serializes against other worktrees via a machine lock (no API +saturation), and writes a guaranteed `### gstack-detach EXIT= ###` sentinel: + +```bash +~/.claude/skills/gstack/bin/gstack-detach --label ship-evals --lock gstack-evals --timeout 5400 -- +``` + +Then poll the printed log path; break on the `EXIT=` sentinel (covers both pass +and crash — silence is never success). The detached run survives even if your +poller is reaped. + **4. Check results:** - **If any eval fails:** Show the failures, the cost dashboard, and **STOP**. Do not proceed. diff --git a/ship/sections/tests.md.tmpl b/ship/sections/tests.md.tmpl index c9ba9ed6f..e6d53d495 100644 --- a/ship/sections/tests.md.tmpl +++ b/ship/sections/tests.md.tmpl @@ -76,6 +76,22 @@ EVAL_JUDGE_TIER=full EVAL_VERBOSE=1 bin/test-lane --eval test/evals/_eval If multiple suites need to run, run them sequentially (each needs a test lane). If the first suite fails, stop immediately — don't burn API cost on remaining suites. +**Long eval suites (30+ min): launch detached so a turn boundary can't kill them.** +A plain backgrounded eval lives in the harness's process group and dies to a +SIGTERM ("polite quit") on a turn boundary, a stopped monitor, or an interruption +(observed mid-`/ship`: `script terminated by signal SIGTERM`). Run it through +`~/.claude/skills/gstack/bin/gstack-detach` instead — it survives in its own +session, serializes against other worktrees via a machine lock (no API +saturation), and writes a guaranteed `### gstack-detach EXIT= ###` sentinel: + +```bash +~/.claude/skills/gstack/bin/gstack-detach --label ship-evals --lock gstack-evals --timeout 5400 -- +``` + +Then poll the printed log path; break on the `EXIT=` sentinel (covers both pass +and crash — silence is never success). The detached run survives even if your +poller is reaped. + **4. Check results:** - **If any eval fails:** Show the failures, the cost dashboard, and **STOP**. Do not proceed.