mirror of
https://github.com/garrytan/gstack.git
synced 2026-06-17 23:30:09 +02:00
docs: wire detached-eval guidance into /ship + correct CLAUDE.md flags
- /ship eval step (sections/tests.md): long eval suites launch via gstack-detach (own session, machine lock, EXIT sentinel) so a turn boundary can't kill a 30+ min run mid-ship — the exact failure observed during this branch's ship. - CLAUDE.md: correct the now-stale /tmp reference; document the --lock (serialize worktrees, no API saturation), --timeout watchdog, run-scoped log, and the guaranteed EXIT sentinel the poller breaks on. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
@@ -839,13 +839,20 @@ the run can also die to idle-sleep. `gstack-detach` fixes both: a fresh session
|
|||||||
(escapes the group SIGTERM) wrapped in `caffeinate -i` (blocks idle-sleep).
|
(escapes the group SIGTERM) wrapped in `caffeinate -i` (blocks idle-sleep).
|
||||||
|
|
||||||
- Use the `eval:bg*` scripts (`eval:bg`, `eval:bg:all`, `eval:bg:gate`,
|
- Use the `eval:bg*` scripts (`eval:bg`, `eval:bg:all`, `eval:bg:gate`,
|
||||||
`eval:bg:periodic`) — they wrap the eval command in `gstack-detach` and stream to
|
`eval:bg:periodic`) — they wrap the eval command in `gstack-detach` with the
|
||||||
`/tmp/gstack-evals.log`. Or call `gstack-detach <log> -- <cmd>` directly for any
|
machine-wide `gstack-evals` lock (concurrent worktrees serialize instead of
|
||||||
long agent job. Export `ANTHROPIC_API_KEY` first (never pass keys in argv).
|
saturating the shared model API), a per-tier watchdog, and a **run-scoped** log
|
||||||
- Then **poll the logfile** with a death-aware watcher: break on a `### DONE ###` /
|
under `~/.gstack-dev/eval-runs/` (no shared-`/tmp` collision). Each prints its
|
||||||
`EXIT=` marker OR on "runner process gone before DONE" (silence is not success —
|
log path. Or call `gstack-detach [--lock NAME] [--timeout SECS] [--label LBL] --
|
||||||
cover the crash/kill path, not just the happy path). The detached run survives
|
<cmd>` directly for any long agent job. Export `ANTHROPIC_API_KEY` first (never
|
||||||
|
pass keys in argv).
|
||||||
|
- Then **poll the printed logfile** with a death-aware watcher: break on the
|
||||||
|
guaranteed `### gstack-detach EXIT=<code> ###` sentinel (success AND failure are
|
||||||
|
both marked, so silence is never mistaken for success). The detached run survives
|
||||||
even if your watcher gets reaped, so re-checking the log always works.
|
even if your watcher gets reaped, so re-checking the log always works.
|
||||||
|
- Why the lock: a shared dev box with several Conductor worktrees will rate-limit
|
||||||
|
the model API if two eval suites run at once (15-way concurrency each), which
|
||||||
|
mass-times-out E2E tests. The lock makes the second run WAIT, not collide.
|
||||||
- Humans running `bun run test:evals` foreground in their own terminal don't need
|
- Humans running `bun run test:evals` foreground in their own terminal don't need
|
||||||
this — Ctrl-C is intended there. Detachment is for agent-launched runs only.
|
this — Ctrl-C is intended there. Detachment is for agent-launched runs only.
|
||||||
|
|
||||||
|
|||||||
@@ -332,6 +332,22 @@ EVAL_JUDGE_TIER=full EVAL_VERBOSE=1 bin/test-lane --eval test/evals/<suite>_eval
|
|||||||
|
|
||||||
If multiple suites need to run, run them sequentially (each needs a test lane). If the first suite fails, stop immediately — don't burn API cost on remaining suites.
|
If multiple suites need to run, run them sequentially (each needs a test lane). If the first suite fails, stop immediately — don't burn API cost on remaining suites.
|
||||||
|
|
||||||
|
**Long eval suites (30+ min): launch detached so a turn boundary can't kill them.**
|
||||||
|
A plain backgrounded eval lives in the harness's process group and dies to a
|
||||||
|
SIGTERM ("polite quit") on a turn boundary, a stopped monitor, or an interruption
|
||||||
|
(observed mid-`/ship`: `script terminated by signal SIGTERM`). Run it through
|
||||||
|
`~/.claude/skills/gstack/bin/gstack-detach` instead — it survives in its own
|
||||||
|
session, serializes against other worktrees via a machine lock (no API
|
||||||
|
saturation), and writes a guaranteed `### gstack-detach EXIT=<code> ###` sentinel:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
~/.claude/skills/gstack/bin/gstack-detach --label ship-evals --lock gstack-evals --timeout 5400 -- <project eval command>
|
||||||
|
```
|
||||||
|
|
||||||
|
Then poll the printed log path; break on the `EXIT=` sentinel (covers both pass
|
||||||
|
and crash — silence is never success). The detached run survives even if your
|
||||||
|
poller is reaped.
|
||||||
|
|
||||||
**4. Check results:**
|
**4. Check results:**
|
||||||
|
|
||||||
- **If any eval fails:** Show the failures, the cost dashboard, and **STOP**. Do not proceed.
|
- **If any eval fails:** Show the failures, the cost dashboard, and **STOP**. Do not proceed.
|
||||||
|
|||||||
@@ -76,6 +76,22 @@ EVAL_JUDGE_TIER=full EVAL_VERBOSE=1 bin/test-lane --eval test/evals/<suite>_eval
|
|||||||
|
|
||||||
If multiple suites need to run, run them sequentially (each needs a test lane). If the first suite fails, stop immediately — don't burn API cost on remaining suites.
|
If multiple suites need to run, run them sequentially (each needs a test lane). If the first suite fails, stop immediately — don't burn API cost on remaining suites.
|
||||||
|
|
||||||
|
**Long eval suites (30+ min): launch detached so a turn boundary can't kill them.**
|
||||||
|
A plain backgrounded eval lives in the harness's process group and dies to a
|
||||||
|
SIGTERM ("polite quit") on a turn boundary, a stopped monitor, or an interruption
|
||||||
|
(observed mid-`/ship`: `script terminated by signal SIGTERM`). Run it through
|
||||||
|
`~/.claude/skills/gstack/bin/gstack-detach` instead — it survives in its own
|
||||||
|
session, serializes against other worktrees via a machine lock (no API
|
||||||
|
saturation), and writes a guaranteed `### gstack-detach EXIT=<code> ###` sentinel:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
~/.claude/skills/gstack/bin/gstack-detach --label ship-evals --lock gstack-evals --timeout 5400 -- <project eval command>
|
||||||
|
```
|
||||||
|
|
||||||
|
Then poll the printed log path; break on the `EXIT=` sentinel (covers both pass
|
||||||
|
and crash — silence is never success). The detached run survives even if your
|
||||||
|
poller is reaped.
|
||||||
|
|
||||||
**4. Check results:**
|
**4. Check results:**
|
||||||
|
|
||||||
- **If any eval fails:** Show the failures, the cost dashboard, and **STOP**. Do not proceed.
|
- **If any eval fails:** Show the failures, the cost dashboard, and **STOP**. Do not proceed.
|
||||||
|
|||||||
Reference in New Issue
Block a user