diff --git a/.gitattributes b/.gitattributes index 713416057..e67042f0a 100644 --- a/.gitattributes +++ b/.gitattributes @@ -37,3 +37,9 @@ bin/* text eol=lf *.gif binary *.ico binary *.pdf binary + +# The committed diagram-render bundle is hash-pinned (BUILD_INFO sha256); +# a CRLF rewrite on Windows checkout would break the drift test and change +# the content-addressed staged filename. +lib/diagram-render/dist/*.html text eol=lf +lib/diagram-render/dist/*.json text eol=lf diff --git a/.github/workflows/make-pdf-gate.yml b/.github/workflows/make-pdf-gate.yml index 769fccd2b..cd07e26bc 100644 --- a/.github/workflows/make-pdf-gate.yml +++ b/.github/workflows/make-pdf-gate.yml @@ -4,6 +4,8 @@ on: branches: [main] paths: - 'make-pdf/**' + - 'lib/diagram-render/**' + - 'test/diagram-render-drift.test.ts' - 'browse/src/meta-commands.ts' - 'browse/src/write-commands.ts' - 'browse/src/commands.ts' @@ -81,7 +83,7 @@ jobs: which pdftotext && pdftotext -v 2>&1 | head -1 || true - name: Run make-pdf unit tests - run: bun test make-pdf/test/*.test.ts + run: bun test make-pdf/test/*.test.ts test/diagram-render-drift.test.ts - name: Run E2E gates (combined-features copy-paste + emoji render) env: diff --git a/.gitignore b/.gitignore index 42b2c2a04..5196c0d05 100644 --- a/.gitignore +++ b/.gitignore @@ -4,9 +4,13 @@ dist/ browse/dist/ design/dist/ make-pdf/dist/ +# diagram-render ships its built bundle (offline-at-install premise, eng-review D2) +!lib/diagram-render/dist/ +!lib/diagram-render/dist/** bin/gstack-global-discover* .gstack/ .claude/skills/ +.claude/gstack-rendered/ .claude/scheduled_tasks.lock .claude/*.lock .agents/ diff --git a/AGENTS.md b/AGENTS.md index a3d1fdb48..69651022d 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -104,6 +104,7 @@ End-to-end walkthrough: [docs/howto-ios-testing-with-gstack.md](docs/howto-ios-t | `/guard` | Activate both careful + freeze at once. | | `/unfreeze` | Remove directory edit restrictions. | | `/make-pdf` | Turn any markdown file into a publication-quality PDF. | +| `/diagram` | English in, diagram out: mermaid source + editable .excalidraw + SVG/PNG, offline. | ## Build commands diff --git a/CHANGELOG.md b/CHANGELOG.md index 4e858d879..1d59026dc 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -44,6 +44,371 @@ If you only want to merge, run `/land` and stop. Got ten PRs green and ready? Ru #### For contributors - `lib/merge.ts` holds the pure regime logic (detection precedence, submit planning, landing classification, handoff schema + validation); `test/gstack-merge.test.ts` (30) and `test/gstack-merge-cli.test.ts` (11) pin it. A generated-doc scrub test fails CI if `/land`'s SKILL.md ever grows deploy/canary machinery. The merge SHA → revert handoff and the never-blind-retry invariant (cli/cli#3442, cli/cli#13380) moved into `/land` with their tests. +## [1.58.1.0] - 2026-06-14 + +## **Local evals stop lying. Spawned `claude` test children run in a sealed clean room,** +## **and in Conductor every decision is a plain-text brief you answer with a letter.** + +Two things shipped here. First, the local E2E harness is now hermetic by default: +every spawned agent (claude -p, the real-PTY plan-mode runner, the Agent SDK +runner, plus the codex and gemini runners) gets an allowlist-scrubbed environment, +a fresh seeded `CLAUDE_CONFIG_DIR`, a temp `GSTACK_HOME`, and `--strict-mcp-config`. +Before this, a dev machine leaked the operator's `~/.claude` config, MCP servers +(gbrain, Conductor), skills, `~/.gstack` decision logs, and `CONDUCTOR_*`/`CLAUDECODE` +env into every child, so local eval results disagreed with CI for reasons that had +nothing to do with the code under test. Now local signal matches CI. Set +`EVALS_HERMETIC=0` to debug against real operator state. + +Second, in a Conductor session gstack no longer fights Conductor's flaky +AskUserQuestion tool. It detects the session and renders every decision as a prose +brief, a labeled question with a recommendation, per-option completeness scores, and +"reply with a letter," enforced by a PreToolUse hook that denies the tool and +redirects to prose. Destructive confirmations demand an explicit typed answer. + +Agents that launch long eval runs get `gstack-detach`: a SIGTERM-proof, idle-sleep-proof +wrapper (fresh session + `caffeinate`) with a machine-wide lock so concurrent +worktrees serialize instead of saturating the model API, run-scoped logs, and a +guaranteed `EXIT=` sentinel so a poller never mistakes silence for success. + +### The numbers that matter + +Measured against the gate eval suite on a contaminated dev box (gbrain MCP up, live +Conductor session, sibling worktrees). Reproduce: `bun test` (free unit + wiring +tripwire) and `EVALS=1 EVALS_TIER=gate bun test test/skill-e2e-hermetic-canary.test.ts`. + +| Metric | Before | After | Δ | +|--------|--------|-------|---| +| Spawned-child env | full operator `process.env` | allowlist-scrubbed | sealed | +| Runners hermeticized | 0 of 5 | 5 of 5 | +5 | +| Operator MCP servers visible to child | all (gbrain, Conductor) | 0 (`--strict-mcp-config`) | isolated | +| Config isolation proof | none | poisoned-operator sentinel canary | falsifiable | +| Long eval runs surviving a turn-boundary SIGTERM | no | yes (`gstack-detach`) | survives | + +The clean room is falsifiable, not asserted: a `hermetic-sentinel` gate canary +plants a poisoned operator config (a user `CLAUDE.md` + an MCP server) and fails if +the child can see any of it, and a free static tripwire fails CI if any runner +reverts to a raw `process.env` spread. + +### What this means for contributors + +Run evals locally and trust the result. You no longer have to push to CI to find +out whether a failure was real or just your machine bleeding context into the agent. +Three latent bugs the old harness hid surfaced the moment the suite ran clean and +are fixed: a coverage-judge that scored carved skills against half a document, an +ios-qa daemon test that collided on a shared pidfile under concurrency, and an +operational-learning fixture missing a lib it imports. Start a run with +`bun run eval:bg:gate`; flip `EVALS_HERMETIC=0` only when you deliberately want your +real `~/.claude` in the loop. + +### Itemized changes + +#### Added +- **Hermetic E2E environment** (`test/helpers/hermetic-env.ts`): allowlist env + builder (process basics, network/proxy vars, named `ANTHROPIC_*` auth, per-runner + `extraAllow`), pure `promotedEnv()` shared with `lib/conductor-env-shim.ts`, a + sync-memoized singleton temp dir (`/.claude` keeps the plan-file path + contract), a seeded `.claude.json` for non-interactive first run, and pid-aware GC + of crashed runs. Default-on; `EVALS_HERMETIC=0` restores the legacy env AND drops + `--strict-mcp-config`. +- **Two gate-tier isolation canaries** (`test/skill-e2e-hermetic-canary.test.ts`): + `hermetic-canary` asserts env redirect + scrub + zero MCP servers + nonzero + API-key cost from the Bash tool_result (not model prose); `hermetic-sentinel` + proves the child cannot see a planted poisoned operator config. +- **Static wiring tripwire** (`test/hermetic-wiring.test.ts`): free-tier invariants + that fail CI if any of the five runners drops `hermeticChildEnv()`, the gated + `--strict-mcp-config`, or leaks `process.env` through a callsite override. +- **`gstack-detach`** + `eval:bg` / `eval:bg:all` / `eval:bg:gate` / `eval:bg:periodic` + scripts: detached, SIGTERM-proof, `caffeinate`-wrapped eval runs with a machine-wide + lock, per-run logs under `~/.gstack-dev/eval-runs/`, a watchdog, and an `EXIT=` + sentinel. +- **Conductor prose AskUserQuestion**: when a Conductor session is detected, every + decision renders as a prose brief (labeled question, recommendation, per-option + completeness, reply-with-a-letter), enforced by a PreToolUse hook that denies the + tool and redirects. Auto-decide preferences still apply first; destructive + confirmations require an explicit typed answer. Installed for Conductor even in + non-interactive setup, with an upgrade migration for existing installs. + +#### Changed +- All five E2E runners (`session-runner`, `claude-pty-runner`, `agent-sdk-runner`, + `codex-session-runner`, `gemini-session-runner`) spawn children through + `hermeticChildEnv()`. The Agent SDK runner now receives a COMPLETE hermetic env + via `Options.env` (the old "never pass env: to the SDK" rule was partial-env + replacement; a complete env is safe). +- `hermetic-env.ts` is a global touchfile, so any change to it selects every E2E + + judge test. +- CLAUDE.md documents hermetic-by-default local evals and retires the stale SDK env + warning. + +#### Fixed +- The workflow LLM-judge now re-appends body-carved `sections/*.md` after the marker + slice, so carved skills (document-release) are judged on the full workflow the + agent executes instead of a half-document. +- ios-qa daemon scenarios use unique pidfiles, fixing `already_running` collisions + under `bun test --concurrent`. + +## [1.58.0.0] - 2026-06-12 + +## **Your documents grow diagrams. Mermaid and excalidraw fences render as real pictures,** +## **and make-pdf now ships single-file HTML and Word output from the same markdown.** + +Put a ` ```mermaid ` fence in your markdown and `make-pdf` renders it as a crisp +vector diagram, fully offline, with the source preserved for round-trips. A broken +fence prints a loud red diagnostic block with the parse error, never silent raw +code. The new `/diagram` skill goes the other way: describe a flow in English and +get a triplet back, the mermaid source, an editable `.excalidraw` file you can open +at excalidraw.com in the hand-drawn style, and rendered SVG + PNG. Images got the +same care: local paths inline automatically and never truncate, phone photos +downscale to print resolution instead of blowing up the file, and a wide small-text +diagram promotes itself onto a vertically centered landscape page inside an +otherwise portrait document. One markdown file now exports three ways: +`--to pdf | html | docx`, where html is one self-contained file with zero network +references. Type is bigger across the board (12pt body, 56pt cover titles), TOC +links actually jump, and `--strict` turns missing, remote, out-of-tree, or +oversized images into hard CI failures. + +### The numbers that matter + +Measured on this repo's README (5,940 words, lists, code, screenshots, one +diagram fence) and the free gate suite. Reproduce: `make-pdf generate README.md +--cover --toc` and `bun test make-pdf/test/`. + +| Metric | Before | After | Δ | +|--------|--------|-------|---| +| A mermaid fence in your PDF | raw code block | vector diagram | rendered | +| Output formats from one markdown | 1 (pdf) | 3 (pdf, html, docx) | +2 | +| Network requests at render time | up to 1 per remote image | 0 by default | sealed | +| Wide-diagram handling | shrunk into portrait | own centered landscape page | rotated | +| Free make-pdf gate tests | 121 | 189 | +68 | +| README → 29-page PDF with diagram | n/a | 4.4s | one command | + +The sealed-network number is the one to notice: the mermaid and excalidraw +runtimes are vendored into a 9.2MB sha-pinned bundle, so rendering works on a +plane and a tracking pixel in pasted markdown fetches nothing. + +### What this means for your documents + +The diagram you describe in English stays editable forever: `/diagram` writes the +source, you embed the source in markdown, and every export renders it fresh. Stop +pasting screenshots of diagrams into documents. Run `/diagram` for the picture, +` ```mermaid ` for the document, and `--to html` when the reader doesn't want a PDF. + +### Itemized changes + +#### Added +- ` ```mermaid ` and ` ```excalidraw ` fences render as inline vector SVG in pdf + and html output (docx embeds them as 300dpi PNGs). Fence options: `title="..."` (caption + aria-label), + `render=false` (keep as code), `page=landscape|portrait` (orientation override). + Render failures produce a visible diagnostic block with the parse error. +- `/diagram` skill: English in, editable triplet out (`.mmd` source, + `.excalidraw` scene, SVG + PNG). Flowcharts convert to fully editable + excalidraw scenes; other mermaid types render with an explicit limitation note. +- `lib/diagram-render/`: vendored offline bundle (mermaid 11.12.2, excalidraw + 0.18.0, exact pins), deterministic build, committed dist with sha256 + source + fingerprint, drift tests, THIRD-PARTY-LICENSES. +- `--to pdf|html|docx` output formats. HTML is one self-contained file (inline + SVG diagrams, data-URI images, zero network refs, screen-readable). DOCX is a + content-fidelity export with diagrams embedded as 300dpi PNGs and alt text. +- Per-image directives: `![x](a.png){width=full|50%|3in}` and + `{page=landscape|portrait}`. +- Conservative auto-landscape: wide, small-text, diagram-like images get their + own vertically centered landscape page (aspect ≥ 1.8, width over ~2.5x the + content box, diagram-ish alt word). Directives override in both directions. +- `--strict` for CI: missing images, remote images, out-of-tree image reads, + oversized files, and non-regular files fail the run instead of degrading to + placeholders. +- `docs/howto-diagrams-and-formats.md`: the full walkthrough, fences to formats. + +#### Changed +- Typography scale: 12pt body, 26pt h1, 56pt poster cover with 13pt meta, 12pt + TOC entries, larger code and tables. Auto-hyphenation is off so copy-paste + yields clean words. +- Local images inline as data URIs with byte-probed dimensions and never + truncate; oversized photos downscale to print resolution at inline time; + repeated images are read once. +- TOC links resolve in every format (headings get real anchor ids); the screen + layer hides print-only page-number dots in HTML output. +- Remote images are blocked with a visible placeholder unless `--allow-network` + is passed; out-of-tree image reads (including via symlink) warn loudly. +- `make-pdf preview` prints a note when the document contains fences or local + images that only `generate` renders fully. + +#### Fixed +- Relative image paths render correctly in PDFs (previously resolved against the + wrong base and could show as broken boxes). +- Fenced code inside lists survives the render byte-for-byte; indented fences + keep their list placement. +- Documents containing `$&`-style sequences in diagram labels render exactly; + Windows drive-letter image paths resolve as local files; malformed + percent-encoded image URLs degrade gracefully instead of failing the run. +- Per-side margins (`--margin-left` etc.) are honored on documents containing + landscape pages. + +#### For contributors +- 68 new free-tier gates (fence extraction, image policy, landscape promotion + with negative fixtures, format contracts, bundle drift) plus a paid gate-tier + /diagram triplet test and a periodic authoring-quality judge. +- make-pdf-gate CI now covers `lib/diagram-render/**` and the drift test; the + committed bundle is pinned to LF in .gitattributes. +- Fixed the `operational-learning` E2E fixture (bin scripts now ship with the + lib module they import). + +## [1.57.10.0] - 2026-06-10 + +## **Codex review now runs by default everywhere it matters.** +## **One switch governs it, and it falls back to Claude when Codex is missing or unauthed.** + +Codex cross-model review used to be inconsistent. `/review` and `/ship` ran it +automatically, but plan reviews hid it behind a "Want an outside voice?" question +you had to say yes to every time, `/document-release` never ran it at all, and every +entry point only checked whether the `codex` binary existed, not whether it was +logged in. Now `codex_reviews` is one master switch (default `enabled`) that governs +Codex review across `/review`, `/ship`, `/plan-ceo-review`, `/plan-eng-review`, +`/plan-design-review`, `/plan-devex-review`, `/document-release`, and `/autoplan`. +The plan-review outside voice runs automatically. `/document-release` gets a new +Codex pass that checks your docs against what actually shipped. Every call site now +detects install AND auth separately, and degrades to a Claude subagent with a clear +one-line reason instead of silently skipping. Turn the whole thing off with one +command: `gstack-config set codex_reviews disabled`. + +### The numbers that matter + +Verified by the gate-tier E2E evals that exercise these exact paths +(`codex-offered-ceo-review`, `codex-offered-eng-review`, `document-release`, +`codex-review-findings`), all green this run. + +| Metric | Before | After | Δ | +|--------|--------|-------|---| +| Skills where Codex review runs by default | 2 | 8 | +6 | +| Prompts to get a plan-review outside voice | 1 (opt-in each time) | 0 (automatic) | -1 | +| Codex readiness detection | install only | install + auth | sharper | +| Master switches to disable it all | 0 (per-skill only) | 1 (`codex_reviews`) | +1 | +| `/document-release` Codex doc audit | none | doc-vs-diff pass | new | + +When Codex is installed but not logged in, you used to get nothing on the paths that +checked only `command -v codex`. Now you get a named reason ("Codex installed but not +authenticated, using Claude subagent") and the review still happens. A typo on the +switch (`gstack-config set codex_reviews disabledd`) is rejected and your existing +setting is preserved, so a fat-finger can never silently turn paid Codex calls on or +off. + +### What this means for you + +If you run gstack day to day, you stop deciding whether to get a second model's eyes +on every plan and every release. It is just there, on by default, the way the strong +reviewers already worked on diffs. If you do not have Codex set up, nothing breaks: +you get the Claude outside voice instead, with a one-line note telling you how to add +Codex for true cross-model coverage. If you want it gone, one command turns off all +eight surfaces at once. + +### Itemized changes + +#### Added +- **`codex_reviews` as the master switch** for Codex review across `/review`, `/ship`, + `/document-release`, all four plan reviews, and `/autoplan` (`bin/gstack-config`). + Default `enabled`. Invalid values on `set` are rejected with the existing value + preserved, so a typo cannot flip paid Codex calls. +- **`/document-release` Codex doc audit** (`generateCodexDocReview`): reviews the + docs you touched against the release diff for stale claims, undocumented new + surface, and over/under-sold CHANGELOG entries. Informational, with an explicit + apply-fixes decision point. Never auto-edits docs. +- **`codexPreflight()` shared helper** (`scripts/resolvers/constants.ts`): one + self-contained bash block that reads the switch, sources the probe, checks install + and auth, and emits a single canonical mode (`ready` / `not_installed` / + `not_authed` / `disabled`). + +#### Changed +- **Plan-review outside voice is default-on**, not opt-in. The "Want an outside + voice?" question is gone; it runs automatically and falls back to a Claude subagent + when Codex is unavailable. Incorporating its findings still requires your explicit + approval (cross-model tension is presented, never auto-applied). +- **Adversarial review detects auth, not just install** (`generateAdversarialStep`): + distinct "not installed" vs "not authenticated" guidance. The 200-line threshold + for the heavier structured `codex review` is unchanged. +- **`/autoplan` honors `codex_reviews=disabled`** in its Phase 0.5 preflight, so the + switch is truly global. + +#### Fixed +- Three `gstack-config` tests asserted `get`/`list` print empty for unset keys; the + tool falls back to the documented defaults table. Assertions now match real behavior. + +#### For contributors +- Size-budget guards widened for the default-on outside-voice prose, each with a + rationale comment (`test/helpers/carve-guards.ts`, `test/helpers/parity-harness.ts`). +- Static guards added: plan reviews must not carry the opt-in question and must render + the default-on voice; `/document-release` must carry the doc review; the codex host + strips all of it (`test/skill-validation.test.ts`). + +## [1.57.9.0] - 2026-06-09 + +## **Your gstack checkout stays clean when gbrain is installed.** +## **Brain-aware skill blocks render to an untracked spot, never into tracked source.** + +Before this, finishing a Conductor or dev-workspace setup with gbrain installed +rewrote 16 planning and review SKILL.md files in place, adding 326 lines of +brain-aware blocks straight into tracked source. Your working tree came back dirty, +one stray `git add` away from committing a token regression for everyone who does +not run gbrain. Now `gen-skill-docs --out-dir` renders the brain-aware variant into +an untracked per-workspace directory, and `bin/dev-setup` repoints the workspace's +skill symlinks at it. The dev workspace gets the full gbrain experience (context-load +and save-to-brain blocks live at runtime), while the tracked SKILL.md files stay +byte-for-byte canonical. To turn the blocks on across all your projects' Claude +sessions, `gstack-config gbrain-refresh` now renders them into your global install, +guarded so it never mutates a symlinked or non-gstack directory. + +### The numbers that matter + +Structural facts of the change, verifiable from the diff plus `bun run gen:skill-docs` +(zero drift) and the new behavioral test (`test/gen-skill-docs-out-dir.test.ts`). + +| When gbrain is installed | Before | After | +|---|---|---| +| Tracked SKILL.md files dirtied by dev-setup | 16 (+326 lines) | 0 | +| Where brain-aware blocks render in a dev workspace | in-place, tracked source | `.claude/gstack-rendered/`, untracked | +| Brain-aware blocks across other projects | re-run `./setup` or hand-edit | `gstack-config gbrain-refresh` (idempotent) | +| "Is gbrain usable" check | per-caller JSON grep, can read stale state | `gstack-gbrain-detect --is-ok` (one live gate) | + +The section-path rewrite is surgical: only `~/.claude/skills/gstack//sections/` +references move to the render dir, so `bin/` and `docs/` references still resolve to +the install. + +### What this means for you + +If you develop gstack with gbrain on, `git status` is clean again after setup, and +you can stop fishing brain-block drift out of your commits. After a +`git reset --hard` deploy of your install, re-run `gstack-config gbrain-refresh` to +restore the machine-wide blocks (it is idempotent, and the deploy note in CLAUDE.md +spells this out). + +### Itemized changes + +#### Added +- `gen-skill-docs --out-dir `: render the Claude SKILL.md + sections into a + separate directory instead of in place, rewriting only the section-base path so + section reads resolve to the render. Default (no flag) output is unchanged. +- `gstack-gbrain-detect --is-ok`: live-detection exit-code gate (0 iff gbrain is + usable), so setup, dev-setup, and gstack-config share one check. +- `gstack-config gbrain-refresh` now renders brain-aware blocks into the global + install (`~/.claude/skills/gstack`), guarded against symlinked or non-gstack + targets and self-documenting about the `reset --hard` re-run cycle. + +#### Changed +- `bin/dev-setup` renders the brain-aware variant into `.claude/gstack-rendered` + (gitignored) and repoints workspace skill symlinks at it; the worktree stays + canonical. `GSTACK_SKIP_GBRAIN_REGEN` is passed inline to the nested setup, never + exported. +- `setup` honors `GSTACK_SKIP_GBRAIN_REGEN` (skips the in-place brain regen on dev + trees) and writes detection state to a PID-unique tmp so concurrent workspaces + cannot clobber it. +- `scripts/dev-skill.ts` refreshes the workspace render on template change, only + when the render dir already exists. +- `bin/dev-teardown` removes the untracked render. + +#### For contributors +- New tests: `test/gen-skill-docs-out-dir.test.ts` (behavioral: worktree unchanged, + blocks rendered, section paths rewritten), `test/dev-setup-render-isolation.test.ts` + and `test/gbrain-refresh-install-render.test.ts` (static tripwires), plus + `--is-ok` coverage in `test/gbrain-detect-shape.test.ts`. + ## [1.57.8.0] - 2026-06-09 ## **`browse` is now the one Chromium on the box, for offline rendering too.** diff --git a/CLAUDE.md b/CLAUDE.md index 41db0093e..984844902 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -31,11 +31,26 @@ use Codex's own auth from `~/.codex/` config — no `OPENAI_API_KEY` env var nee `lib/conductor-env-shim.ts`) promotes `GSTACK_ANTHROPIC_API_KEY` / `GSTACK_OPENAI_API_KEY` to their canonical names inside gstack's TS binaries. Tests run through gstack entrypoints inherit this promotion automatically. -Don't echo the key value to stdout, logs, or shell history. When passing to a -test's Agent SDK, do NOT pass `env: {...}` to `runAgentSdkTest` — the SDK's -auth pipeline doesn't pick up the key the same way when env is supplied as an -object (confirmed failure mode). Mutate `process.env.ANTHROPIC_API_KEY` -ambiently before the call and restore in `finally`. +Don't echo the key value to stdout, logs, or shell history. The historical +"never pass `env:` to `runAgentSdkTest`" rule is retired: the failure was +partial-env replacement (the SDK's `Options.env` REPLACES the child's entire +environment, so an object without the key broke auth). The runner now always +passes a COMPLETE hermetic env with per-test `env:` merged last, so per-test +overrides are safe; ambient `process.env.ANTHROPIC_API_KEY` mutation also +still works (the env builder reads process.env at call time). + +**Hermetic local E2E (default).** Every E2E runner (claude -p, PTY, Agent +SDK, codex, gemini) spawns children through `test/helpers/hermetic-env.ts`: +allowlist-scrubbed env (operator `CONDUCTOR_*`, `CLAUDE_*`, `GSTACK_*`, +`MCP_*`, `GBRAIN_*`, and credentials like `GH_TOKEN` never reach children), +a fresh seeded `CLAUDE_CONFIG_DIR` (no operator `~/.claude` CLAUDE.md / +MCP servers / skills), a temp `GSTACK_HOME`, and `--strict-mcp-config`. +Local eval signal matches CI. Debug against real operator state with +`EVALS_HERMETIC=0` (restores the legacy env AND drops the strict-MCP flag). +Per-test `env:` overrides merge last, so deliberate contamination +(`CONDUCTOR_WORKSPACE_PATH`, per-test `GSTACK_HOME`) keeps working. Wiring +is pinned by `test/hermetic-wiring.test.ts` (static tripwire) and two +gate-tier canaries in `test/skill-e2e-hermetic-canary.test.ts`. E2E tests stream progress in real-time (tool-by-tool via `--output-format stream-json --verbose`). Results are persisted to `~/.gstack-dev/evals/` with auto-comparison @@ -828,6 +843,34 @@ them. Report progress at each check (which tests passed, which are running, any failures so far). The user wants to see the run complete, not a promise that you'll check later. +## Running evals as an agent: always detach (SIGTERM-proof) + +When **you (an agent/harness)** launch a long eval/benchmark run, run it through +`bin/gstack-detach` — NEVER as a plain backgrounded Bash task. A plain background +task lives in the harness's process group, so a SIGTERM ("polite quit") on a turn +boundary, a stopped Monitor, or an interruption kills the run mid-flight (observed: +`script "test:gate" was terminated by signal SIGTERM` ~40 min into a run). On macOS +the run can also die to idle-sleep. `gstack-detach` fixes both: a fresh session +(escapes the group SIGTERM) wrapped in `caffeinate -i` (blocks idle-sleep). + +- Use the `eval:bg*` scripts (`eval:bg`, `eval:bg:all`, `eval:bg:gate`, + `eval:bg:periodic`) — they wrap the eval command in `gstack-detach` with the + machine-wide `gstack-evals` lock (concurrent worktrees serialize instead of + saturating the shared model API), a per-tier watchdog, and a **run-scoped** log + under `~/.gstack-dev/eval-runs/` (no shared-`/tmp` collision). Each prints its + log path. Or call `gstack-detach [--lock NAME] [--timeout SECS] [--label LBL] -- + ` directly for any long agent job. Export `ANTHROPIC_API_KEY` first (never + pass keys in argv). +- Then **poll the printed logfile** with a death-aware watcher: break on the + guaranteed `### gstack-detach EXIT= ###` sentinel (success AND failure are + both marked, so silence is never mistaken for success). The detached run survives + even if your watcher gets reaped, so re-checking the log always works. +- Why the lock: a shared dev box with several Conductor worktrees will rate-limit + the model API if two eval suites run at once (15-way concurrency each), which + mass-times-out E2E tests. The lock makes the second run WAIT, not collide. +- Humans running `bun run test:evals` foreground in their own terminal don't need + this — Ctrl-C is intended there. Detachment is for agent-launched runs only. + ## E2E test fixtures: extract, don't copy **NEVER copy a full SKILL.md file into an E2E test fixture.** SKILL.md files are @@ -883,6 +926,12 @@ The active skill lives at `~/.claude/skills/gstack/`. After making changes: 2. Fetch and reset in the skill directory: `cd ~/.claude/skills/gstack && git fetch origin && git reset --hard origin/main` 3. Rebuild: `cd ~/.claude/skills/gstack && bun run build` +**If you use gbrain:** the `git reset --hard` in step 2 reverts the brain-aware +(`GBRAIN_CONTEXT_LOAD` / `GBRAIN_SAVE_RESULTS`) blocks that `gstack-config +gbrain-refresh` renders into the install (those generated blocks differ from +`main` by design). After deploying, re-run `gstack-config gbrain-refresh` to +restore them across all your projects' Claude sessions. It's idempotent. + Or copy the binaries directly: - `cp browse/dist/browse ~/.claude/skills/gstack/browse/dist/browse` - `cp design/dist/design ~/.claude/skills/gstack/design/dist/design` diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index a4872fc47..b75d4a898 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -106,6 +106,22 @@ bun run build bin/dev-teardown ``` +### Brain-aware blocks in a dev workspace (gbrain installed) + +If gbrain is installed and usable (`bin/gstack-gbrain-detect --is-ok` exits 0), +`bin/dev-setup` keeps your tracked `SKILL.md` files canonical and renders the +brain-aware variant (the `GBRAIN_CONTEXT_LOAD` / `GBRAIN_SAVE_RESULTS` blocks) +into `.claude/gstack-rendered/` (gitignored, per-workspace). It then repoints the +workspace's `SKILL.md` symlinks at that render, so your Claude sessions get the +full gbrain experience while `git status` stays clean. Under the hood, dev-setup +passes `GSTACK_SKIP_GBRAIN_REGEN=1` inline to the nested `./setup` (so it never +dirties tracked source) and runs `gen:skill-docs:user --out-dir .claude/gstack-rendered`, +which rewrites only the section-base paths to point at the render. `bin/dev-teardown` +removes the render. To make the blocks live across your *other* projects' Claude +sessions, run `gstack-config gbrain-refresh`, which renders them into the global +install (`~/.claude/skills/gstack`), guarded so it never touches a symlinked or +non-gstack directory. + ## Testing & evals ### Setup @@ -160,6 +176,18 @@ EVALS=1 bun test test/skill-e2e-*.test.ts - Saves full NDJSON transcripts and failure JSON for debugging - Tests live in `test/skill-e2e-*.test.ts` (split by category), runner logic in `test/helpers/session-runner.ts` +**Hermetic by default.** Every E2E runner (claude -p, the real-PTY plan-mode +runner, the Agent SDK runner, plus the codex and gemini runners) spawns its child +through `test/helpers/hermetic-env.ts`: an allowlist-scrubbed environment, a fresh +seeded `CLAUDE_CONFIG_DIR`, a temp `GSTACK_HOME`, and `--strict-mcp-config`. Your +operator `~/.claude` config, MCP servers (gbrain, Conductor), skills, `~/.gstack` +decision logs, and `CONDUCTOR_*` env never leak into the child, so local eval +signal matches CI instead of disagreeing for reasons unrelated to the code under +test. Set `EVALS_HERMETIC=0` to debug against your real operator state (this also +drops `--strict-mcp-config`). The wiring is pinned by `test/hermetic-wiring.test.ts` +(a free static tripwire) and two gate-tier isolation canaries in +`test/skill-e2e-hermetic-canary.test.ts`. + ### E2E observability When E2E tests run, they produce machine-readable artifacts in `~/.gstack-dev/`: @@ -182,6 +210,25 @@ bun run eval:compare # compare two runs — shows per-test deltas + Take bun run eval:summary # aggregate stats + per-test efficiency averages across runs ``` +**Detached runs for agents and long suites.** When an agent (or you, for a run +you don't want to babysit) launches a long eval, use the `eval:bg*` scripts. They +wrap the eval command in `bin/gstack-detach`: a fresh session that escapes a +turn-boundary SIGTERM, a `caffeinate` wrapper that blocks idle-sleep, a machine-wide +`gstack-evals` lock so concurrent worktrees serialize instead of saturating the +model API, a run-scoped log under `~/.gstack-dev/eval-runs/`, a per-tier watchdog, +and a guaranteed `### gstack-detach EXIT= ###` sentinel so a poller never +mistakes silence for success. + +```bash +bun run eval:bg # detached test:evals (diff-based) +bun run eval:bg:all # detached test:evals:all +bun run eval:bg:gate # detached gate-tier suite +bun run eval:bg:periodic # detached periodic-tier suite +``` + +Each prints its log path. Humans running `bun run test:evals` foreground in their +own terminal don't need this — Ctrl-C is intended there. + **Eval comparison commentary:** `eval:compare` generates natural-language Takeaway sections interpreting what changed between runs — flagging regressions, noting improvements, calling out efficiency gains (fewer turns, faster, cheaper), and producing an overall summary. This is driven by `generateCommentary()` in `eval-store.ts`. Artifacts are never cleaned up — they accumulate in `~/.gstack-dev/` for post-mortem debugging and trend analysis. @@ -334,8 +381,8 @@ If you're using [Conductor](https://conductor.build) to run multiple Claude Code | Hook | Script | What it does | |------|--------|-------------| -| `setup` | `bin/dev-setup` | Copies `.env` from main worktree, installs deps, symlinks skills, runs `./setup` non-interactively | -| `archive` | `bin/dev-teardown` | Removes skill symlinks, cleans up `.claude/` directory | +| `setup` | `bin/dev-setup` | Copies `.env` from main worktree, installs deps, symlinks skills, runs `./setup` non-interactively, and (if gbrain is installed) renders brain-aware blocks into `.claude/gstack-rendered/` without dirtying tracked source | +| `archive` | `bin/dev-teardown` | Removes skill symlinks, the `.claude/gstack-rendered/` render, and cleans up `.claude/` directory | When Conductor creates a new workspace, `bin/dev-setup` runs automatically. It detects the main worktree (via `git worktree list`), copies your `.env` so API keys carry over, and sets up dev mode — no manual steps needed. diff --git a/README.md b/README.md index c8b20b308..4bb177c3a 100644 --- a/README.md +++ b/README.md @@ -206,6 +206,8 @@ Each skill feeds into the next. `/office-hours` writes a design doc that `/plan- | `/autoplan` | **Review Pipeline** | One command, fully reviewed plan. Runs CEO → design → eng review automatically with encoded decision principles. Surfaces only taste decisions for your approval. | | `/spec` | **Spec Author** | Turn vague intent into a precise, executable spec in five phases (why, scope, technical with mandatory code-reading, draft, file). Codex quality gate before file (blocks below 7/10), fail-closed secret redaction, dedupe against existing issues, archive to `$GSTACK_STATE_ROOT/projects/$SLUG/specs/` for team-corpus recall. `--execute` spawns `claude -p` in a fresh worktree; `/ship` auto-closes the source issue on merge. Plan-mode aware. | | `/learn` | **Memory** | Manage what gstack learned across sessions. Review, search, prune, and export project-specific patterns, pitfalls, and preferences. Learnings compound across sessions so gstack gets smarter on your codebase over time. | +| `/make-pdf` | **Publisher** | Markdown in, publication-quality document out. Mermaid and excalidraw fences render as vector diagrams, fully offline. Images scale to the page and never truncate; wide diagrams get their own landscape page. `--to html` emits one self-contained file, `--to docx` a Word doc. | +| `/diagram` | **Diagram Maker** | English in, editable diagram out. Emits a triplet: mermaid source, `.excalidraw` you can open and edit on excalidraw.com (hand-drawn style), and rendered SVG/PNG. Zero network. Embed the source in markdown and `/make-pdf` renders it. | ### Which review should I use? @@ -429,6 +431,7 @@ Other references: [docs/gbrain-sync.md](docs/gbrain-sync.md) (sync-specific guid | Doc | What it covers | |-----|---------------| | [Skill Deep Dives](docs/skills.md) | Philosophy, examples, and workflow for every skill (includes Greptile integration) | +| [Diagrams & Document Formats](docs/howto-diagrams-and-formats.md) | Mermaid/excalidraw fences in PDFs, image sizing and safety defaults, `--to html\|docx`, `/diagram` triplets | | [Builder Ethos](ETHOS.md) | Builder philosophy: Boil the Ocean, Search Before Building, three layers of knowledge | | [Using GBrain with GStack](USING_GBRAIN_WITH_GSTACK.md) | Every path, flag, bin helper, and troubleshooting step for `/setup-gbrain` | | [GBrain Sync](docs/gbrain-sync.md) | Cross-machine memory setup, privacy modes, troubleshooting | diff --git a/SKILL.md b/SKILL.md index 8711ae7f3..90774950e 100644 --- a/SKILL.md +++ b/SKILL.md @@ -48,6 +48,13 @@ echo "REPO_MODE: $REPO_MODE" _SESSION_KIND=$(~/.claude/skills/gstack/bin/gstack-session-kind 2>/dev/null || echo "interactive") case "$_SESSION_KIND" in spawned|headless|interactive) ;; *) _SESSION_KIND="interactive" ;; esac echo "SESSION_KIND: $_SESSION_KIND" +# Conductor host: AskUserQuestion is unreliable here (native disabled, MCP +# variant flaky), so skills render decisions as prose instead of calling the +# tool. Gated on !headless so an eval/CI run INSIDE Conductor (GSTACK_HEADLESS) +# still BLOCKs rather than rendering prose to nobody. +if [ "$_SESSION_KIND" != "headless" ] && { [ -n "${CONDUCTOR_WORKSPACE_PATH:-}" ] || [ -n "${CONDUCTOR_PORT:-}" ]; }; then + echo "CONDUCTOR_SESSION: true" +fi _LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") echo "LAKE_INTRO: $_LAKE_SEEN" _TEL=$(~/.claude/skills/gstack/bin/gstack-config get telemetry 2>/dev/null || true) diff --git a/TODOS.md b/TODOS.md index df510e032..52e806af3 100644 --- a/TODOS.md +++ b/TODOS.md @@ -2377,3 +2377,41 @@ Pre-existing in `auq-sdk-capture.ts` — affects `skill-e2e-ship-section-loading path to the fixture during the run. **Effort:** S (human ~3h, CC ~30min). **Depends on:** None. + +### P3: Content-hash diagram render cache for make-pdf + +**What:** Cache rendered diagram SVG/PNG in `~/.gstack/cache/diagram-render/`, +keyed on `sha256(fence source + bundle version + render options)`, so repeat +`make-pdf` runs skip the browse render tab for unchanged diagrams. + +**Why:** Every run currently re-renders every fence (~150-300ms each). Docs with +10+ diagrams pay seconds per iteration during write-preview loops. Codex +outside-voice flagged the missing cache story during the eng review of the +diagram engine plan (2026-06-11, D7). + +**Context:** The diagram-render bundle ships a `BUILD_INFO.json` with a content +hash (see `lib/diagram-render/`) — use that as the bundle-version cache key +component so bundle bumps invalidate cleanly. Invalidation surface is the main +risk: stale renders after a mermaid theme change must not survive. Only worth +building once users hit multi-diagram docs; wedge perf is fine without it. + +**Effort:** S (human ~1d, CC ~30min). **Depends on:** diagram engine wedge +shipping (lib/diagram-render bundle versioning). + +### P3: Dedupe the make-pdf e2e gate-test harness + +**What:** Five e2e files (`combined-gate`, `emoji-gate`, `diagram-gate`, +`landscape-gate`, `format-gate`) each hand-roll the same prerequisite probe +(binary/browse/poppler checks with CI hard-fail vs local skip), mkdtemp/rm +lifecycle, and child-timeout constants. Extract a shared +`make-pdf/test/e2e/helpers.ts` (prerequisites(), withWorkDir(), runGenerate()). + +**Why:** Review-army maintainability finding on v1.58.0.0 — the boilerplate +diverges a little more with each new gate (diagram-gate now captures stderr +via Bun.spawnSync while the others use execFileSync), and a future fix to the +CI-hard-fail contract has to land five times. + +**Context:** Deferred at ship time (D8.2) because it's test-only churn across +five green files at the tail of a release. Zero user-facing value; pure DRY. + +**Effort:** S (human ~3h, CC ~20min). **Depends on:** None. diff --git a/autoplan/SKILL.md b/autoplan/SKILL.md index bd372a4c3..49db38ff9 100644 --- a/autoplan/SKILL.md +++ b/autoplan/SKILL.md @@ -57,6 +57,13 @@ echo "REPO_MODE: $REPO_MODE" _SESSION_KIND=$(~/.claude/skills/gstack/bin/gstack-session-kind 2>/dev/null || echo "interactive") case "$_SESSION_KIND" in spawned|headless|interactive) ;; *) _SESSION_KIND="interactive" ;; esac echo "SESSION_KIND: $_SESSION_KIND" +# Conductor host: AskUserQuestion is unreliable here (native disabled, MCP +# variant flaky), so skills render decisions as prose instead of calling the +# tool. Gated on !headless so an eval/CI run INSIDE Conductor (GSTACK_HEADLESS) +# still BLOCKs rather than rendering prose to nobody. +if [ "$_SESSION_KIND" != "headless" ] && { [ -n "${CONDUCTOR_WORKSPACE_PATH:-}" ] || [ -n "${CONDUCTOR_PORT:-}" ]; }; then + echo "CONDUCTOR_SESSION: true" +fi _LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") echo "LAKE_INTRO: $_LAKE_SEEN" _TEL=$(~/.claude/skills/gstack/bin/gstack-config get telemetry 2>/dev/null || true) @@ -306,7 +313,9 @@ AI orchestrator (e.g., OpenClaw). In spawned sessions: "AskUserQuestion" can resolve to two tools at runtime: the **host MCP variant** (e.g. `mcp__conductor__AskUserQuestion` — appears in your tool list when the host registers it) or the **native** Claude Code tool. -**Rule:** if any `mcp__*__AskUserQuestion` variant is in your tool list, prefer it. Hosts may disable native AUQ via `--disallowedTools AskUserQuestion` (Conductor does, by default) and route through their MCP variant; calling native there silently fails. Same questions/options shape; same decision-brief format applies. +**Conductor rule (read before the MCP rule):** if `CONDUCTOR_SESSION: true` was echoed by the preamble, do NOT call AskUserQuestion at all — neither native nor any `mcp__*__AskUserQuestion` variant. Render EVERY decision brief as the **prose form** below and STOP. This is proactive, not a reaction to a failure: Conductor disables native AUQ and its MCP variant is flaky (it returns `[Tool result missing due to internal error]`), so prose is the reliable path. **Auto-decide preferences still apply first:** if a `[plan-tune auto-decide]