* feat: add shared call-time isConductor() helper
Single source of truth for Conductor host detection in TS consumers
(CONDUCTOR_WORKSPACE_PATH / CONDUCTOR_PORT). Reads the passed env at
call time, not a module-load snapshot, so unit tests can pin the env
inline without Bun --preload (esm-hoist-breaks-env-pin-bootstrap).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test: harden question-preference-hook harness against ambient Conductor env
runHook copied all of process.env into the hook subprocess, so running the
suite inside Conductor (CONDUCTOR_WORKSPACE_PATH/PORT set) would leak those
markers. Strip them so the existing cases deterministically characterize
NON-Conductor behavior before the Conductor branch lands. Baseline: 15 pass.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: PreToolUse hook denies AskUserQuestion in Conductor, redirects to prose
Conductor disables native AskUserQuestion and routes through a flaky MCP
variant that returns '[Tool result missing due to internal error]'. The
hook now denies any AUQ call in a Conductor session and instructs the model
to render a prose decision brief instead (transport avoidance, not preference
enforcement) — firing for one-way doors too, with a typed-confirmation
requirement for destructive paths.
Precedence: never-ask auto-decide still wins (user already settled those);
Conductor prose is the fallback for everything else; non-Conductor behavior
is byte-for-byte unchanged. Restructured the per-question loop to compute
eligibility without early-returning so the Conductor branch can run as the
fallback while preserving memoryContext on every exit.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: Conductor renders AskUserQuestion decisions as prose by default
In Conductor, native AskUserQuestion is disabled and the MCP variant is
flaky, so skills now render every decision as a plain-text prose brief the
user answers by typing a letter — proactively, not as a failure reaction.
- Preamble emits CONDUCTOR_SESSION, gated on != headless so eval/CI inside
Conductor still BLOCKs instead of rendering prose to nobody.
- AskUserQuestion Format gains a Conductor-default-prose rule (auto-decide
preferences still apply first; prose decisions log via gstack-question-log
since PostToolUse never fires), a one-way/destructive typed-confirmation
rule, and a typed-reply continuation protocol for split chains.
- Regenerated all SKILL.md + ship golden fixtures; bumped affected carve
skeleton caps to absorb the always-loaded additions.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: deploy the Conductor AskUserQuestion hook (setup + upgrade migration)
The PreToolUse hook only delivers its Conductor-prose guarantee if it's
installed, but setup skips hook registration in non-interactive (conductor/CI)
setups. Two fixes so layer 3 actually deploys:
- setup: treat a Conductor workspace as an implicit opt-in for the PreToolUse
hook on the silent fall-through (never overriding an explicit opt-out).
- migration v1.58.0.0: re-register the hook for existing Conductor installs on
/gstack-upgrade, idempotent and respecting plan_tune_hooks=no.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test: E2E for Conductor prose + fix auto-decide-preserved GSTACK_HOME bug
- New skill-e2e-conductor-prose (periodic): Conductor env + plan-eng-review
surfaces a prose decision brief, not a silent skip. Header documents this is
end-to-end behavior coverage; the deterministic Conductor guard is the
question-preference-hook unit test (the PTY harness can't register the MCP
variant — Codex #10).
- Fix the pre-existing bug in auto-decide-preserved: it seeded the never-ask
preference under GSTACK_HOME=tmpHome but never passed GSTACK_HOME into the
PTY run, so the spawned claude read the real ~/.gstack and the preference
was inert (Codex #9). Now passes GSTACK_HOME + CONDUCTOR_WORKSPACE_PATH to
prove auto-decide still wins over the Conductor prose redirect.
- Register both in touchfiles (periodic tier).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* v1.58.0.0 feat: Conductor renders AskUserQuestion decisions as prose
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test: strip ambient Conductor env in memory-cache-injection hook harness
Same dev-in-Conductor leak fixed for question-preference-hook: this suite's
runHook copies process.env, so running it inside Conductor flipped the
defer-path memoryContext assertions into the [conductor] prose deny. Strip
CONDUCTOR_* so the cases characterize non-Conductor behavior. (CI is headless,
so this only bit local Conductor runs.)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: gstack-detach — run agent eval/bench jobs in their own session
Long agent-run jobs (30-60 min evals, benchmarks) die when the harness sends
SIGTERM to a background task's process group on turn boundaries / monitor
stops / interruptions (observed: 'script test:gate terminated by signal
SIGTERM'). gstack-detach runs the command in a fresh session (python3
os.setsid, or setsid on Linux, nohup fallback) so a group SIGTERM can't reach
it, and wraps it in caffeinate -i on macOS so idle-sleep can't kill it either.
Returns immediately; caller polls the logfile. Secrets stay in env, never argv.
The guard test pins the contract: the command runs in a different process
group than the caller and outlives the launching shell.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: eval:bg* scripts — detached eval runs for agents
Agent-facing convenience scripts that launch the eval suites through
gstack-detach so a harness SIGTERM can't kill a long run. eval:bg (diff-based),
eval:bg:all, eval:bg:gate, eval:bg:periodic — each returns immediately and
streams to /tmp/gstack-evals.log for polling. The plain test:evals / test:e2e
scripts stay foreground for humans.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs: CLAUDE.md — agents must run long evals via gstack-detach
Codifies the detached-execution default: agent-launched eval/benchmark runs go
through bin/gstack-detach (or the eval:bg* scripts) so a harness SIGTERM or
macOS idle-sleep can't kill a 30-60 min run, then poll the log with a
death-aware watcher. Humans keep foreground scripts.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: harden gstack-detach against all four eval-infra killers
The basic bash detach fixed SIGTERM but a real run on a shared dev box hit
three more killers: cross-worktree API saturation (15-way concurrency x a
sibling worktree mass-timed-out the suite), a silent hang (periodic bun died
with no exit marker), and shared-/tmp log contamination (a concurrent
worktree's agent output bled into the log). Rewrite as a portable python3 tool
that bakes in all four fixes:
- fork + setsid: SIGTERM-proof (own session, survives harness polite-quit)
- caffeinate -i on macOS: no idle-sleep death
- --lock NAME (fcntl, machine-wide): concurrent worktrees SERIALIZE instead of
saturating the shared model API
- run-scoped default log (~/.gstack-dev/eval-runs/<label>-<slug>-<branch>-<ts>-<pid>):
no cross-worktree collision/contamination
- --timeout watchdog + a guaranteed '### gstack-detach EXIT=<code> ###' sentinel
on every terminal path: no silent hang, finished-vs-died always detectable
Guard test pins all four: detached pgid differs + outlives launcher, run-scoped
log path, watchdog EXIT=timeout, and lock serialization (second run WAITS).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: eval:bg* use run-scoped logs + machine lock + watchdog
Drop the shared /tmp/gstack-evals.log path (the cross-worktree collision that
contaminated a live run) for gstack-detach's run-scoped default, and add the
machine-wide gstack-evals lock (concurrent worktrees serialize, no API
saturation) plus per-tier watchdog timeouts (60/90/120 min). Each eval:bg*
prints its run-scoped log path to poll.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs: wire detached-eval guidance into /ship + correct CLAUDE.md flags
- /ship eval step (sections/tests.md): long eval suites launch via gstack-detach
(own session, machine lock, EXIT sentinel) so a turn boundary can't kill a
30+ min run mid-ship — the exact failure observed during this branch's ship.
- CLAUDE.md: correct the now-stale /tmp reference; document the --lock (serialize
worktrees, no API saturation), --timeout watchdog, run-scoped log, and the
guaranteed EXIT sentinel the poller breaks on.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* refactor: extract pure promotedEnv() from conductor-env-shim
Single source of truth for GSTACK_* key promotion semantics. The ambient
promoteConductorEnv() becomes a wrapper; behavior-preserving. Needed by the
hermetic env builder which must not mutate process.env.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: hermetic child-env builder for E2E runners
Allowlist scrub (basics/network/named-auth kept; CONDUCTOR_*, CLAUDE_*,
GSTACK_*, MCP_*, GBRAIN_*, operator credentials dropped), per-runner
extraAllow, overrides merge last, EVALS_HERMETIC=0 byte-identical escape
hatch read at call time (ESM-hoist safe). Sync memoized singleton temp dirs
(<runRoot>/.claude keeps the extractPlanFilePath contract), seeded
.claude.json for non-interactive first run, pid-aware GC of crashed runs.
19 free unit tests.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: session-runner spawns hermetic children + isolation canaries
claude -p children now get the allowlist-scrubbed env and a gated
--strict-mcp-config (EVALS_HERMETIC=0 restores operator env AND args).
Two gate-tier canaries make the clean room falsifiable: hermetic-canary
asserts env redirect + scrub + zero MCP servers + nonzero API-key cost
from the Bash tool_result (never model prose); hermetic-sentinel plants a
poisoned operator config (user CLAUDE.md + MCP server) and proves the
child cannot see it. Empirically verified on claude 2.1.175: print mode
needs no seed config (the seed serves the PTY path); the child CLI sets
CLAUDECODE for its own tools, so that scrub is pinned in unit tests, not
E2E. hermetic-env.ts joins GLOBAL_TOUCHFILES.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: PTY runner spawns hermetic claude sessions
launchClaudePty children get the allowlist-scrubbed env, a gated
--strict-mcp-config, and the session exposes hermeticConfigDir for
forensics (hermetic plan files live under <dir>/plans/ and still match
extractPlanFilePath via the /.claude dir-name contract). Seeded trust
state covers repo-cwd sessions; the 15s trust-watcher stays as fallback.
Verified foreground via the plan-mode-no-op gate test.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: codex/gemini runners spawn hermetic children
Same allowlist scrub as the claude runners, with each provider's auth
surface re-admitted via extraAllow (codex: OPENAI_API_KEY/CODEX_* plus
its tempHome .codex copy; gemini: GEMINI_*/GOOGLE_* with real HOME for
~/.gemini auth). The gemini spawn previously inherited the full operator
env with no env property at all.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: agent-sdk-runner spawns hermetic children via complete Options.env
The historical 'env: breaks SDK auth' failure was partial-env replacement:
Options.env replaces the child's entire environment, so objects lacking
ANTHROPIC_API_KEY killed auth. Passing the complete hermetic env (key +
PATH + redirected CLAUDE_CONFIG_DIR/GSTACK_HOME) works — validated live
via query() with a Bash tool call (success, real cost, Conductor vars
scrubbed). Per-test opts.env merges last; ambient key mutation still
works because the builder reads process.env at call time.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test: static tripwire pins hermetic wiring in all five runners
Free-tier invariants: every runner builds child env via hermeticChildEnv,
no raw ...process.env spread at any spawn site, --strict-mcp-config gated
on isHermeticEnabled in both claude runners, and no test callsite passes
the operator env into a runner's override parameter (scoped to runner
calls — unit tests spawning gstack bin scripts directly are exempt).
Mirrors the terminal-agent-pid-identity / server-embedder-terminal-port
tripwire idiom.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test: refresh codex/factory ship goldens with detached-eval block
a38089aa added the gstack-detach guidance to the ship template and
updated the claude golden; the codex and factory goldens missed the same
16-line block. Regenerated via bun run gen:skill-docs.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs: hermetic local E2E is the default; retire stale SDK env warning
CLAUDE.md now documents the hermetic clean room (allowlist scrub, fresh
seeded CLAUDE_CONFIG_DIR, temp GSTACK_HOME, --strict-mcp-config),
EVALS_HERMETIC=0 as the debug escape hatch, and replaces the 'never pass
env: to runAgentSdkTest' rule with the verified mechanism (partial-env
replacement was the failure; complete env is safe).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix: operational-learning fixture copies lib/jsonl-store.ts with the bin
gstack-learnings-log imports $SCRIPT_DIR/../lib/jsonl-store.ts (hasInjection,
v1.57.5.0) — copying only the bin scripts into the temp fixture broke the
script with exit 1 since then. Latent because diff-based selection rarely
runs this test; surfaced when hermetic-env.ts joined GLOBAL_TOUCHFILES and
selected everything. Reproduced outside the hermetic env to confirm blame.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix: ios-qa daemon scenarios use unique pidfiles under --concurrent
All scenarios shared join(workDir, 'daemon.pid') through a module-scope
workDir binding that beforeEach reassigns mid-flight under bun --concurrent.
First daemon claims; siblings get already_running against the test process's
own always-alive pid and fail in milliseconds — the failure mode seen at
15-way gate concurrency. Per-claim unique pidfiles keep the single-instance
semantics under test.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix: workflow judge re-appends body-carved sections after the marker slice
runWorkflowJudge appended sections/*.md before slicing startMarker..endMarker.
That handles skills that moved their MARKERS into sections (plan-eng,
plan-design) but not document-release, which keeps its markers in the
skeleton and carved the workflow BODY (Steps 2-9 -> sections/release-body.md)
AFTER the endMarker — so the slice dropped it and the judge scored
completeness 2 ('Steps 2-9 are in an external file'). Now any carved section
the marker window excluded is re-appended, so the judge sees the full
workflow the agent executes. document-release: completeness 2->5, clarity
3->4. ship/plan-ceo/plan-eng/plan-design judges unchanged (their section
content is already inside the slice, so the head-dedup skips re-append).
Pre-existing since the v1.57.0.0 carve (#1907); surfaced now because
hermetic-env.ts is a global touchfile that selects every llm-judge test.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* harden: hermetic temp-dir GC grace window + half-seed cleanup
Codex adversarial review (ship) flagged two temp-dir lifecycle edges:
- GC deleted any dead-pid dir; PID reuse could delete a freshly-created dir
whose original pid exited and was recycled to a live process. Now requires
BOTH a dead pid AND mtime older than a 1h floor.
- A seed-write failure after mkdir left an unseeded dir named with our live
pid that this process's GC skips, leaking until exit. Now the partial dir
is torn down before the (still loud) rethrow.
Two findings left as-is by design: HOME stays allowlisted (CLAUDE_CONFIG_DIR
wins for claude; codex/gemini need ~/.codex|~/.gemini auth; FS sandbox is
TODOS.md:454 scope; the hermetic-sentinel canary proves config isolation),
and PTY extraArgs --mcp-config is a deliberate caller opt-in like env overrides.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs: document hermetic-by-default E2E + eval:bg detached runs in CONTRIBUTING
The Testing & evals section now tells contributors that local E2E runners
spawn children through a sealed clean room (allowlist-scrubbed env, seeded
CLAUDE_CONFIG_DIR, temp GSTACK_HOME, --strict-mcp-config) so local signal
matches CI, with EVALS_HERMETIC=0 as the escape hatch. The eval-tools list
gains the eval:bg* detached-run scripts (gstack-detach: SIGTERM-proof,
caffeinate-wrapped, machine-locked, run-scoped logs, EXIT= sentinel).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: sync package.json to 1.58.1.0
The merge took main's package.json (1.58.0.0); gstack-version-bump repair
fixed the working tree but the change was left uncommitted. Without this the
committed tree disagrees with VERSION and CI's version-match test fails.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs: regenerate diagram SKILL.md with Conductor prose preamble
The diagram skill (new from main) was missing the Conductor-session prose
AskUserQuestion blocks that gen-skill-docs propagates to every SKILL.md.
Pure generated output; reproduced by bun run gen:skill-docs.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
27 KiB
Contributing to gstack
Thanks for wanting to make gstack better. Whether you're fixing a typo in a skill prompt or building an entirely new workflow, this guide will get you up and running fast.
Quick start
gstack skills are Markdown files that Claude Code discovers from a skills/ directory. Normally they live at ~/.claude/skills/gstack/ (your global install). But when you're developing gstack itself, you want Claude Code to use the skills in your working tree — so edits take effect instantly without copying or deploying anything.
That's what dev mode does. It symlinks your repo into the local .claude/skills/ directory so Claude Code reads skills straight from your checkout.
git clone https://github.com/garrytan/gstack.git && cd gstack
bun install # install dependencies
bin/dev-setup # activate dev mode
Full clone vs shallow. The README's user-facing install uses
--depth 1for speed. As a contributor, use a full clone (no--depthflag) — you'll need history forgit log,git blame,git bisect, and reviewing PRs against earlier versions. If you already have a--depth 1clone from following the README, promote it to a full clone withgit fetch --unshallow.
Now edit any SKILL.md, invoke it in Claude Code (e.g. /review), and see your changes live. When you're done developing:
bin/dev-teardown # deactivate — back to your global install
Operational self-improvement
gstack automatically learns from failures. At the end of every skill session, the agent
reflects on what went wrong (CLI errors, wrong approaches, project quirks) and logs
operational learnings to ~/.gstack/projects/{slug}/learnings.jsonl. Future sessions
surface these learnings automatically, so gstack gets smarter on your codebase over time.
No setup needed. Learnings are logged automatically. View them with /learn.
The contributor workflow
- Use gstack normally — operational learnings are captured automatically
- Check your learnings:
/learnorls ~/.gstack/projects/*/learnings.jsonl - Fork and clone gstack (if you haven't already)
- Symlink your fork into the project where you hit the bug:
Setup creates per-skill directories with SKILL.md symlinks inside (
# In your core project (the one where gstack annoyed you) ln -sfn /path/to/your/gstack-fork .claude/skills/gstack cd .claude/skills/gstack && bun install && bun run build && ./setupqa/SKILL.md -> gstack/qa/SKILL.md) and asks your prefix preference. Pass--no-prefixto skip the prompt and use short names. - Fix the issue — your changes are live immediately in this project
- Test by actually using gstack — do the thing that annoyed you, verify it's fixed
- Open a PR from your fork
This is the best way to contribute: fix gstack while doing your real work, in the project where you actually felt the pain.
Session awareness
When you have 3+ gstack sessions open simultaneously, every question tells you which project, which branch, and what's happening. No more staring at a question thinking "wait, which window is this?" The format is consistent across all skills.
Working on gstack inside the gstack repo
When you're editing gstack skills and want to test them by actually using gstack
in the same repo, bin/dev-setup wires this up. It creates .claude/skills/
symlinks (gitignored) pointing back to your working tree, so Claude Code uses
your local edits instead of the global install.
gstack/ <- your working tree
├── .claude/skills/ <- created by dev-setup (gitignored)
│ ├── gstack -> ../../ <- symlink back to repo root
│ ├── review/ <- real directory (short name, default)
│ │ └── SKILL.md -> gstack/review/SKILL.md
│ ├── ship/ <- or gstack-review/, gstack-ship/ if --prefix
│ │ └── SKILL.md -> gstack/ship/SKILL.md
│ └── ... <- one directory per skill
├── review/
│ └── SKILL.md <- edit this, test with /review
├── ship/
│ └── SKILL.md
├── browse/
│ ├── src/ <- TypeScript source
│ └── dist/ <- compiled binary (gitignored)
└── ...
Setup creates real directories (not symlinks) at the top level with a SKILL.md
symlink inside. This ensures Claude discovers them as top-level skills, not nested
under gstack/. Names depend on your prefix setting (~/.gstack/config.yaml).
Short names (/review, /ship) are the default. Run ./setup --prefix if you
prefer namespaced names (/gstack-review, /gstack-ship).
Day-to-day workflow
# 1. Enter dev mode
bin/dev-setup
# 2. Edit a skill
vim review/SKILL.md
# 3. Test it in Claude Code — changes are live
# > /review
# 4. Editing browse source? Rebuild the binary
bun run build
# 5. Done for the day? Tear down
bin/dev-teardown
Brain-aware blocks in a dev workspace (gbrain installed)
If gbrain is installed and usable (bin/gstack-gbrain-detect --is-ok exits 0),
bin/dev-setup keeps your tracked SKILL.md files canonical and renders the
brain-aware variant (the GBRAIN_CONTEXT_LOAD / GBRAIN_SAVE_RESULTS blocks)
into .claude/gstack-rendered/ (gitignored, per-workspace). It then repoints the
workspace's SKILL.md symlinks at that render, so your Claude sessions get the
full gbrain experience while git status stays clean. Under the hood, dev-setup
passes GSTACK_SKIP_GBRAIN_REGEN=1 inline to the nested ./setup (so it never
dirties tracked source) and runs gen:skill-docs:user --out-dir .claude/gstack-rendered,
which rewrites only the section-base paths to point at the render. bin/dev-teardown
removes the render. To make the blocks live across your other projects' Claude
sessions, run gstack-config gbrain-refresh, which renders them into the global
install (~/.claude/skills/gstack), guarded so it never touches a symlinked or
non-gstack directory.
Testing & evals
Setup
# 1. Copy .env.example and add your API key
cp .env.example .env
# Edit .env → set ANTHROPIC_API_KEY=sk-ant-...
# 2. Install deps (if you haven't already)
bun install
Bun auto-loads .env — no extra config. Conductor workspaces inherit .env from the main worktree automatically (see "Conductor workspaces" below).
Test tiers
| Tier | Command | Cost | What it tests |
|---|---|---|---|
| 1 — Static | bun test |
Free | Command validation, snapshot flags, SKILL.md correctness, TODOS-format.md refs, observability unit tests |
| 2 — E2E | bun run test:e2e |
~$3.85 | Full skill execution via claude -p subprocess |
| 3 — LLM eval | bun run test:evals |
~$0.15 standalone | LLM-as-judge scoring of generated SKILL.md docs |
| 2+3 | bun run test:evals |
~$4 combined | E2E + LLM-as-judge (runs both) |
bun test # Tier 1 only (runs on every commit, <5s)
bun run test:e2e # Tier 2: E2E only (needs EVALS=1, can't run inside Claude Code)
bun run test:evals # Tier 2 + 3 combined (~$4/run)
Tier 1: Static validation (free)
Runs automatically with bun test. No API keys needed.
- Skill parser tests (
test/skill-parser.test.ts) — Extracts every$Bcommand from SKILL.md bash code blocks and validates against the command registry inbrowse/src/commands.ts. Catches typos, removed commands, and invalid snapshot flags. - Skill validation tests (
test/skill-validation.test.ts) — Validates that SKILL.md files reference only real commands and flags, and that command descriptions meet quality thresholds. - Generator tests (
test/gen-skill-docs.test.ts) — Tests the template system: verifies placeholders resolve correctly, output includes value hints for flags (e.g.-d <N>not just-d), enriched descriptions for key commands (e.g.islists valid states,presslists key examples).
Tier 2: E2E via claude -p (~$3.85/run)
Spawns claude -p as a subprocess with --output-format stream-json --verbose, streams NDJSON for real-time progress, and scans for browse errors. This is the closest thing to "does this skill actually work end-to-end?"
# Must run from a plain terminal — can't nest inside Claude Code or Conductor
EVALS=1 bun test test/skill-e2e-*.test.ts
- Gated by
EVALS=1env var (prevents accidental expensive runs) - Auto-skips if running inside Claude Code (
claude -pcan't nest) - API connectivity pre-check — fails fast on ConnectionRefused before burning budget
- Real-time progress to stderr:
[Ns] turn T tool #C: Name(...) - Saves full NDJSON transcripts and failure JSON for debugging
- Tests live in
test/skill-e2e-*.test.ts(split by category), runner logic intest/helpers/session-runner.ts
Hermetic by default. Every E2E runner (claude -p, the real-PTY plan-mode
runner, the Agent SDK runner, plus the codex and gemini runners) spawns its child
through test/helpers/hermetic-env.ts: an allowlist-scrubbed environment, a fresh
seeded CLAUDE_CONFIG_DIR, a temp GSTACK_HOME, and --strict-mcp-config. Your
operator ~/.claude config, MCP servers (gbrain, Conductor), skills, ~/.gstack
decision logs, and CONDUCTOR_* env never leak into the child, so local eval
signal matches CI instead of disagreeing for reasons unrelated to the code under
test. Set EVALS_HERMETIC=0 to debug against your real operator state (this also
drops --strict-mcp-config). The wiring is pinned by test/hermetic-wiring.test.ts
(a free static tripwire) and two gate-tier isolation canaries in
test/skill-e2e-hermetic-canary.test.ts.
E2E observability
When E2E tests run, they produce machine-readable artifacts in ~/.gstack-dev/:
| Artifact | Path | Purpose |
|---|---|---|
| Heartbeat | e2e-live.json |
Current test status (updated per tool call) |
| Partial results | evals/_partial-e2e.json |
Completed tests (survives kills) |
| Progress log | e2e-runs/{runId}/progress.log |
Append-only text log |
| NDJSON transcripts | e2e-runs/{runId}/{test}.ndjson |
Raw claude -p output per test |
| Failure JSON | e2e-runs/{runId}/{test}-failure.json |
Diagnostic data on failure |
Live dashboard: Run bun run eval:watch in a second terminal to see a live dashboard showing completed tests, the currently running test, and cost. Use --tail to also show the last 10 lines of progress.log.
Eval history tools:
bun run eval:list # list all eval runs (turns, duration, cost per run)
bun run eval:compare # compare two runs — shows per-test deltas + Takeaway commentary
bun run eval:summary # aggregate stats + per-test efficiency averages across runs
Detached runs for agents and long suites. When an agent (or you, for a run
you don't want to babysit) launches a long eval, use the eval:bg* scripts. They
wrap the eval command in bin/gstack-detach: a fresh session that escapes a
turn-boundary SIGTERM, a caffeinate wrapper that blocks idle-sleep, a machine-wide
gstack-evals lock so concurrent worktrees serialize instead of saturating the
model API, a run-scoped log under ~/.gstack-dev/eval-runs/, a per-tier watchdog,
and a guaranteed ### gstack-detach EXIT=<code> ### sentinel so a poller never
mistakes silence for success.
bun run eval:bg # detached test:evals (diff-based)
bun run eval:bg:all # detached test:evals:all
bun run eval:bg:gate # detached gate-tier suite
bun run eval:bg:periodic # detached periodic-tier suite
Each prints its log path. Humans running bun run test:evals foreground in their
own terminal don't need this — Ctrl-C is intended there.
Eval comparison commentary: eval:compare generates natural-language Takeaway sections interpreting what changed between runs — flagging regressions, noting improvements, calling out efficiency gains (fewer turns, faster, cheaper), and producing an overall summary. This is driven by generateCommentary() in eval-store.ts.
Artifacts are never cleaned up — they accumulate in ~/.gstack-dev/ for post-mortem debugging and trend analysis.
Tier 3: LLM-as-judge (~$0.15/run)
Uses Claude Sonnet to score generated SKILL.md docs on three dimensions:
- Clarity — Can an AI agent understand the instructions without ambiguity?
- Completeness — Are all commands, flags, and usage patterns documented?
- Actionability — Can the agent execute tasks using only the information in the doc?
Each dimension is scored 1-5. Threshold: every dimension must score ≥ 4. There's also a regression test that compares generated docs against the hand-maintained baseline from origin/main — generated must score equal or higher.
# Needs ANTHROPIC_API_KEY in .env — included in bun run test:evals
- Uses
claude-sonnet-4-6for scoring stability - Tests live in
test/skill-llm-eval.test.ts - Calls the Anthropic API directly (not
claude -p), so it works from anywhere including inside Claude Code
CI
A GitHub Action (.github/workflows/skill-docs.yml) runs bun run gen:skill-docs --dry-run on every push and PR. If the generated SKILL.md files differ from what's committed, CI fails. This catches stale docs before they merge.
Tests run against the browse binary directly — they don't require dev mode.
Editing SKILL.md files
SKILL.md files are generated from .tmpl templates. Don't edit the .md directly — your changes will be overwritten on the next build.
# 1. Edit the template
vim SKILL.md.tmpl # or browse/SKILL.md.tmpl
# 2. Regenerate for all hosts
bun run gen:skill-docs --host all
# 3. Check health (reports all hosts)
bun run skill:check
# Or use watch mode — auto-regenerates on save
bun run dev:skill
For template authoring best practices (natural language over bash-isms, dynamic branch detection, {{BASE_BRANCH_DETECT}} usage), see CLAUDE.md's "Writing SKILL templates" section.
To add a browse command, add it to browse/src/commands.ts. To add a snapshot flag, add it to SNAPSHOT_FLAGS in browse/src/snapshot.ts. Then rebuild.
Don't bundle puppeteer/Chromium in a skill. browse is the one shared
Chromium per box, including offline local-render workloads. A skill that needs to
rasterize its own HTML/JSON (diagrams, cards, og-images) should route through
browse — screenshot --selector for visual output, load-html + js --out for
bytes a render function returns — instead of npm i puppeteer and downloading a
second Chromium that drifts out of version sync. One install to pin, one daemon to
manage.
Jargon list (V1 writing style)
gstack's Writing Style section (injected into every tier-≥2 skill's preamble)
glosses technical terms on first use per skill invocation. The list of terms
that qualify for glossing lives at scripts/jargon-list.json — ~50 curated
high-frequency terms (idempotent, race condition, N+1, backpressure, etc.).
Terms not on the list are assumed plain-English enough.
Adding or removing a term: open a PR editing scripts/jargon-list.json.
Run bun run gen:skill-docs after the edit — terms are baked into every
generated SKILL.md at gen time, so changes take effect only after regeneration.
No runtime loading; no user-side override. The repo list is the source of truth.
Good candidates for addition: high-frequency terms that non-technical users encounter in review output without context (common database/concurrency terminology, security jargon, frontend framework concepts). Don't add terms that only appear in one or two niche skills — the cost-to-value trade isn't worth the review overhead.
Multi-host development
gstack generates SKILL.md files for 8 hosts from one set of .tmpl templates.
Each host is a typed config in hosts/*.ts. The generator reads these configs
to produce host-appropriate output (different frontmatter, paths, tool names).
Supported hosts: Claude (primary), Codex, Factory, Kiro, OpenCode, Slate, Cursor, OpenClaw.
Generating for all hosts
# Generate for a specific host
bun run gen:skill-docs # Claude (default)
bun run gen:skill-docs --host codex # Codex
bun run gen:skill-docs --host opencode # OpenCode
bun run gen:skill-docs --host all # All 8 hosts
# Or use build, which does all hosts + compiles binaries
bun run build
What changes between hosts
Each host config (hosts/*.ts) controls:
| Aspect | Example (Claude vs Codex) |
|---|---|
| Output directory | {skill}/SKILL.md vs .agents/skills/gstack-{skill}/SKILL.md |
| Frontmatter | Full (name, description, hooks, version) vs minimal (name + description) |
| Paths | ~/.claude/skills/gstack vs $GSTACK_ROOT |
| Tool names | "use the Bash tool" vs same (Factory rewrites to "run this command") |
| Hook skills | hooks: frontmatter vs inline safety advisory prose |
| Suppressed sections | None vs Codex self-invocation sections stripped |
See scripts/host-config.ts for the full HostConfig interface.
Testing host output
# Run all static tests (includes parameterized smoke tests for all hosts)
bun test
# Check freshness for all hosts
bun run gen:skill-docs --host all --dry-run
# Health dashboard covers all hosts
bun run skill:check
Adding a new host
See docs/ADDING_A_HOST.md for the full guide. Short version:
- Create
hosts/myhost.ts(copy fromhosts/opencode.ts) - Add to
hosts/index.ts - Add
.myhost/to.gitignore - Run
bun run gen:skill-docs --host myhost - Run
bun test(parameterized tests auto-cover it)
Zero generator, setup, or tooling code changes needed.
Adding a new skill
When you add a new skill template, all hosts get it automatically:
- Create
{skill}/SKILL.md.tmpl - Run
bun run gen:skill-docs --host all - The dynamic template discovery picks it up, no static list to update
- Commit
{skill}/SKILL.md, external host output is generated at setup time and gitignored
Conductor workspaces
If you're using Conductor to run multiple Claude Code sessions in parallel, conductor.json wires up workspace lifecycle automatically:
| Hook | Script | What it does |
|---|---|---|
setup |
bin/dev-setup |
Copies .env from main worktree, installs deps, symlinks skills, runs ./setup non-interactively, and (if gbrain is installed) renders brain-aware blocks into .claude/gstack-rendered/ without dirtying tracked source |
archive |
bin/dev-teardown |
Removes skill symlinks, the .claude/gstack-rendered/ render, and cleans up .claude/ directory |
When Conductor creates a new workspace, bin/dev-setup runs automatically. It detects the main worktree (via git worktree list), copies your .env so API keys carry over, and sets up dev mode — no manual steps needed.
bin/dev-setup runs ./setup fully non-interactively (it passes --plan-tune-hooks=prompt and closes stdin), so a forwarded Conductor TTY can never hang on a hidden setup prompt. It also never installs the plan-tune Claude Code hooks, which means a throwaway workspace can't rewrite your global ~/.claude/settings.json to point at an ephemeral worktree path. To install the plan-tune hooks deliberately, run ./setup --plan-tune-hooks outside dev-setup (or gstack-config set plan_tune_hooks yes).
First-time setup: Put your ANTHROPIC_API_KEY in .env in the main repo (see .env.example). Every Conductor workspace inherits it automatically.
GSTACK_* env prefix (Conductor-injected keys). Conductor explicitly strips ANTHROPIC_API_KEY and OPENAI_API_KEY from every workspace's process env. The .env copy path doesn't restore them either — the strip happens after env inheritance. Users who want paid evals, /sync-gbrain embeddings, or claude-agent-sdk calls to work in a Conductor workspace must set GSTACK_ANTHROPIC_API_KEY and GSTACK_OPENAI_API_KEY in Conductor's workspace env config; Conductor passes those through untouched. On the gstack side, TS entry points import lib/conductor-env-shim.ts as a side effect, which promotes GSTACK_FOO_API_KEY to FOO_API_KEY when the canonical name is empty. If you add a new TS entry point that hits a paid API, add import "../lib/conductor-env-shim"; to the top of the file. Today the shim is imported from bin/gstack-gbrain-sync.ts, bin/gstack-model-benchmark, scripts/preflight-agent-sdk.ts, and test/helpers/e2e-helpers.ts.
Things to know
- SKILL.md files are generated. Edit the
.tmpltemplate, not the.md. Runbun run gen:skill-docsto regenerate. - TODOS.md is the unified backlog. Organized by skill/component with P0-P4 priorities.
/shipauto-detects completed items. All planning/review/retro skills read it for context. - Browse source changes need a rebuild. If you touch
browse/src/*.ts, runbun run build. - Dev mode shadows your global install. Project-local skills take priority over
~/.claude/skills/gstack.bin/dev-teardownrestores the global one. - Conductor workspaces are independent. Each workspace is its own git worktree.
bin/dev-setupruns automatically viaconductor.json. .envpropagates across worktrees. Set it once in the main repo, all Conductor workspaces get it..claude/skills/is gitignored. The symlinks never get committed.- Never write raw
ln -snfinsetup. Every link site insetupMUST route through the_link_or_copy SRC DSThelper near theIS_WINDOWSdetection. The helper preservesln -snfon Unix and switches tocp -R/cp -fon Windows without Developer Mode, where plainln -snfproduces frozen file copies that don't refresh ongit pull.test/setup-windows-fallback.test.tsenforces this with a static invariant — a single rawlncall outside the helper body fails CI.
Testing your changes in a real project
This is the recommended way to develop gstack. Symlink your gstack checkout into the project where you actually use it, so your changes are live while you do real work.
Step 1: Symlink your checkout
# In your core project (not the gstack repo)
ln -sfn /path/to/your/gstack-checkout .claude/skills/gstack
Step 2: Run setup to create per-skill symlinks
The gstack symlink alone isn't enough. Claude Code discovers skills through
individual top-level directories (qa/SKILL.md, ship/SKILL.md, etc.), not through
the gstack/ directory itself. Run ./setup to create them:
cd .claude/skills/gstack && bun install && bun run build && ./setup
Setup will ask whether you want short names (/qa) or namespaced (/gstack-qa).
Your choice is saved to ~/.gstack/config.yaml and remembered for future runs.
To skip the prompt, pass --no-prefix (short names) or --prefix (namespaced).
Step 3: Develop
Edit a template, run bun run gen:skill-docs, and the next /review or /qa
call picks it up immediately. No restart needed.
Going back to the stable global install
Remove the project-local symlink. Claude Code falls back to ~/.claude/skills/gstack/:
rm .claude/skills/gstack
The per-skill directories (qa/, ship/, etc.) contain SKILL.md symlinks that point
to gstack/..., so they'll resolve to the global install automatically.
Switching prefix mode
If you installed gstack with one prefix setting and want to switch:
cd .claude/skills/gstack && ./setup --no-prefix # switch to /qa, /ship
cd .claude/skills/gstack && ./setup --prefix # switch to /gstack-qa, /gstack-ship
Setup cleans up the old symlinks automatically. No manual cleanup needed.
Alternative: point your global install at a branch
If you don't want per-project symlinks, you can switch the global install:
cd ~/.claude/skills/gstack
git fetch origin
git checkout origin/<branch>
bun install && bun run build && ./setup
This affects all projects. To revert: git checkout main && git pull && bun run build && ./setup.
Community PR triage (wave process)
When community PRs accumulate, batch them into themed waves:
- Categorize — group by theme (security, features, infra, docs)
- Deduplicate — if two PRs fix the same thing, pick the one that changes fewer lines. Close the other with a note pointing to the winner.
- Collector branch — create
pr-wave-N, merge clean PRs, resolve conflicts for dirty ones, verify withbun test && bun run build - Close with context — every closed PR gets a comment explaining why and what (if anything) supersedes it. Contributors did real work; respect that with clear communication.
- Ship as one PR — single PR to main with all attributions preserved in merge commits. Include a summary table of what merged and what closed.
See PR #205 (v0.8.3) for the first wave as an example.
Upgrade migrations
When a release changes on-disk state (directory structure, config format, stale
files) in ways that ./setup alone can't fix, add a migration script so existing
users get a clean upgrade.
When to add a migration
- Changed how skill directories are created (symlinks vs real dirs)
- Renamed or moved config keys in
~/.gstack/config.yaml - Need to delete orphaned files from a previous version
- Changed the format of
~/.gstack/state files
Don't add a migration for: new features (users get them automatically), new skills (setup discovers them), or code-only changes (no on-disk state).
How to add one
- Create
gstack-upgrade/migrations/v{VERSION}.shwhere{VERSION}matches the VERSION file for the release that needs the fix. - Make it executable:
chmod +x gstack-upgrade/migrations/v{VERSION}.sh - The script must be idempotent (safe to run multiple times) and non-fatal (failures are logged but don't block the upgrade).
- Include a comment block at the top explaining what changed, why the migration is needed, and which users are affected.
Example:
#!/usr/bin/env bash
# Migration: v0.15.2.0 — Fix skill directory structure
# Affected: users who installed with --no-prefix before v0.15.2.0
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "$0")/../.." && pwd)"
"$SCRIPT_DIR/bin/gstack-relink" 2>/dev/null || true
How it runs
During /gstack-upgrade, after ./setup completes (Step 4.75), the upgrade
skill scans gstack-upgrade/migrations/ and runs every v*.sh script whose
version is newer than the user's old version. Scripts run in version order.
Failures are logged but never block the upgrade.
Testing migrations
Migrations are tested as part of bun test (tier 1, free). The test suite
verifies that all migration scripts in gstack-upgrade/migrations/ are
executable and parse without syntax errors.
Shipping your changes
When you're happy with your skill edits:
/ship
This runs tests, reviews the diff, triages Greptile comments (with 2-tier escalation), manages TODOS.md, bumps the version, and opens a PR. See ship/SKILL.md for the full workflow.