From 6a785c57293e507e8f94cb881031c0ccf5a7d013 Mon Sep 17 00:00:00 2001 From: Garry Tan Date: Thu, 16 Apr 2026 13:49:04 -0700 Subject: [PATCH 1/4] fix: ngrok Windows build + close CI error-swallowing gap (v0.18.0.1) (#1024) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * fix(browse): externalize @ngrok/ngrok so Node server bundle builds on Windows @ngrok/ngrok has a native .node addon that causes `bun build --outfile` to fail with "cannot write multiple output files without an output directory". Externalize it alongside the existing runtime deps (playwright, diff, bun:sqlite), matching the exact pattern used for every other dynamic import in server.ts. Adds a policy comment explaining when to extend the externals list so the next native dep doesn't repeat this failure. Two community contributors independently converged on this fix: - @tomasmontbrun-hash (#1019) - @scarson (#1013) Also fixes issues #1010 and #960. Co-Authored-By: Claude Opus 4.7 (1M context) * fix(package.json): subshell cleanup so || true stops masking build/test failures Shell operator precedence trap in both the build and test scripts: cmd1 && cmd2 && ... && rm -f .*.bun-build || true bun test ... && bun run slop:diff 2>/dev/null || true The trailing `|| true` was intended to suppress cleanup errors, but it applies to the entire `&&` chain — so ANY failure (including the build-node-server.sh failure that broke Windows installs since v0.15.12) silently exits 0. CI ran the build, the build failed, and CI reported green. Wrap the cleanup/slop-diff commands in subshells so `|| true` only scopes to the intended step: ... && (rm -f .*.bun-build || true) bun test ... && (bun run slop:diff 2>/dev/null || true) Verified: `bash -c 'false && echo A && rm -f X || true'` exits 0 (old, broken), `bash -c 'false && echo A && (rm -f X || true)'` exits 1 (new, correct). Co-Authored-By: Claude Opus 4.7 (1M context) * test(browse): add build validation test for server-node.mjs Two assertions: 1. `node --check` passes on the built `server-node.mjs` (valid ES module syntax). This catches regressions where the post-processing steps (perl regex replacements) corrupt the bundle. 2. No inlined `@ngrok/ngrok` module identifiers (ngrok_napi, platform- specific binding packages). Verifies the --external flag actually kept it external. Skips gracefully when `browse/dist/server-node.mjs` is missing — the dist dir is gitignored, so a fresh clone + `bun test` without a prior build is a valid state, not a failure. Co-Authored-By: Claude Opus 4.7 (1M context) * fix(setup): verify @ngrok/ngrok can load on Windows Mirror the existing Playwright verification step. Since @ngrok/ngrok is now externalized in server-node.mjs (resolved at runtime from node_modules), confirm the platform-specific native binary (@ngrok/ngrok-win32-x64-msvc et al.) is installed at setup time rather than surfacing the failure later when the user runs /pair-agent. Same fallback pattern: if `node -e "require('@ngrok/ngrok')"` fails, fall back to `npm install --no-save @ngrok/ngrok` to pull the missing binary. Co-Authored-By: Claude Opus 4.7 (1M context) * chore: bump to v0.18.0.1 for ngrok Windows fix + CI error-propagation Fixes shipped in this version: - Externalize @ngrok/ngrok so the Node server bundle builds on Windows (PRs #1019, #1013; issues #1010, #960) - Shell precedence fix so build/test failures no longer exit 0 in CI - Build validation test for server-node.mjs - Windows setup verifies @ngrok/ngrok native binary is loadable Credit: @tomasmontbrun-hash (#1019), @scarson (#1013). Co-Authored-By: Claude Opus 4.7 (1M context) --------- Co-authored-by: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 11 +++++++++++ VERSION | 2 +- browse/scripts/build-node-server.sh | 8 +++++++- browse/test/build.test.ts | 28 ++++++++++++++++++++++++++++ package.json | 6 +++--- setup | 4 ++++ 6 files changed, 54 insertions(+), 5 deletions(-) create mode 100644 browse/test/build.test.ts diff --git a/CHANGELOG.md b/CHANGELOG.md index b078e05f..3cc4f230 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,16 @@ # Changelog +## [0.18.0.1] - 2026-04-16 + +### Fixed +- **Windows install no longer fails with a build error.** If you installed gstack on Windows (or a fresh Linux box), `./setup` was dying with `cannot write multiple output files without an output directory`. The Windows-compat Node server bundle now builds cleanly, so `/browse`, `/canary`, `/pair-agent`, `/open-gstack-browser`, `/setup-browser-cookies`, and `/design-review` all work on Windows again. If you were stuck on gstack v0.15.11-era features without knowing it, this is why. Thanks to @tomasmontbrun-hash (#1019) and @scarson (#1013) for independently tracking this down, and to the issue reporters on #1010 and #960. +- **CI stops lying about green builds.** The `build` and `test` scripts in `package.json` had a shell precedence trap where a trailing `|| true` swallowed failures from the *entire* command chain, not just the cleanup step it was meant for. That's how the Windows build bug above shipped in the first place — CI ran the build, the build failed, and CI reported success anyway. Now build and test failures actually fail. Silent CI is the worst kind of CI. +- **`/pair-agent` on Windows surfaces install problems at install time, not tunnel time.** `./setup` now verifies Node can load `@ngrok/ngrok` on Windows, just like it already did for Playwright. If the native binary didn't install, you find out now instead of the first time you try to pair an agent. + +### For contributors +- New `browse/test/build.test.ts` validates `server-node.mjs` is well-formed ES module syntax and that `@ngrok/ngrok` was actually externalized (not inlined). Gracefully skips when no prior build has run. +- Added a policy comment in `browse/scripts/build-node-server.sh` explaining when and why to externalize a dependency. If you add a dep with a native addon or a dynamic `await import()`, the comment tells you where to plug it in. + ## [0.18.0.0] - 2026-04-15 ### Added diff --git a/VERSION b/VERSION index 42b43e04..d6bda5aa 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -0.18.0.0 +0.18.0.1 diff --git a/browse/scripts/build-node-server.sh b/browse/scripts/build-node-server.sh index 539e391c..3ab652ac 100755 --- a/browse/scripts/build-node-server.sh +++ b/browse/scripts/build-node-server.sh @@ -14,13 +14,19 @@ DIST_DIR="$GSTACK_DIR/browse/dist" echo "Building Node-compatible server bundle..." # Step 1: Transpile server.ts to a single .mjs bundle (externalize runtime deps) +# +# Externalize packages with native addons, dynamic imports, or runtime resolution. +# If you add a new dependency that uses `await import()` or has a .node addon, +# add it here. Otherwise `bun build --outfile` will fail with +# "cannot write multiple output files without an output directory". bun build "$SRC_DIR/server.ts" \ --target=node \ --outfile "$DIST_DIR/server-node.mjs" \ --external playwright \ --external playwright-core \ --external diff \ - --external "bun:sqlite" + --external "bun:sqlite" \ + --external "@ngrok/ngrok" # Step 2: Post-process # Replace import.meta.dir with a resolvable reference diff --git a/browse/test/build.test.ts b/browse/test/build.test.ts new file mode 100644 index 00000000..050f3576 --- /dev/null +++ b/browse/test/build.test.ts @@ -0,0 +1,28 @@ +import { describe, test, expect } from 'bun:test'; +import { execSync } from 'child_process'; +import * as fs from 'fs'; +import * as path from 'path'; + +const DIST_DIR = path.resolve(__dirname, '..', 'dist'); +const SERVER_NODE = path.join(DIST_DIR, 'server-node.mjs'); + +describe('build: server-node.mjs', () => { + test('passes node --check if present', () => { + if (!fs.existsSync(SERVER_NODE)) { + // browse/dist is gitignored; no build has run in this checkout. + // Skip rather than fail so plain `bun test` without a prior build passes. + return; + } + expect(() => execSync(`node --check ${SERVER_NODE}`, { stdio: 'pipe' })).not.toThrow(); + }); + + test('does not inline @ngrok/ngrok (must be external)', () => { + if (!fs.existsSync(SERVER_NODE)) return; + const bundle = fs.readFileSync(SERVER_NODE, 'utf-8'); + // Dynamic imports of externalized packages show up as string literals in the bundle, + // not as inlined module code. The heuristic: ngrok's native binding loader would + // reference its own internals. If any ngrok internal identifier appears, the module + // got inlined despite the --external flag. + expect(bundle).not.toMatch(/ngrok_napi|ngrokNapi|@ngrok\/ngrok-darwin|@ngrok\/ngrok-linux|@ngrok\/ngrok-win32/); + }); +}); diff --git a/package.json b/package.json index 09c6bbc0..bbc1a6d1 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "gstack", - "version": "0.18.0.0", + "version": "0.18.0.1", "description": "Garry's Stack — Claude Code skills + fast headless browser. One repo, one install, entire AI engineering workflow.", "license": "MIT", "type": "module", @@ -8,12 +8,12 @@ "browse": "./browse/dist/browse" }, "scripts": { - "build": "bun run gen:skill-docs --host all; bun build --compile browse/src/cli.ts --outfile browse/dist/browse && bun build --compile browse/src/find-browse.ts --outfile browse/dist/find-browse && bun build --compile design/src/cli.ts --outfile design/dist/design && bun build --compile bin/gstack-global-discover.ts --outfile bin/gstack-global-discover && bash browse/scripts/build-node-server.sh && git rev-parse HEAD > browse/dist/.version && git rev-parse HEAD > design/dist/.version && chmod +x browse/dist/browse browse/dist/find-browse design/dist/design bin/gstack-global-discover && rm -f .*.bun-build || true", + "build": "bun run gen:skill-docs --host all; bun build --compile browse/src/cli.ts --outfile browse/dist/browse && bun build --compile browse/src/find-browse.ts --outfile browse/dist/find-browse && bun build --compile design/src/cli.ts --outfile design/dist/design && bun build --compile bin/gstack-global-discover.ts --outfile bin/gstack-global-discover && bash browse/scripts/build-node-server.sh && git rev-parse HEAD > browse/dist/.version && git rev-parse HEAD > design/dist/.version && chmod +x browse/dist/browse browse/dist/find-browse design/dist/design bin/gstack-global-discover && (rm -f .*.bun-build || true)", "dev:design": "bun run design/src/cli.ts", "gen:skill-docs": "bun run scripts/gen-skill-docs.ts", "dev": "bun run browse/src/cli.ts", "server": "bun run browse/src/server.ts", - "test": "bun test browse/test/ test/ --ignore 'test/skill-e2e-*.test.ts' --ignore test/skill-llm-eval.test.ts --ignore test/skill-routing-e2e.test.ts --ignore test/codex-e2e.test.ts --ignore test/gemini-e2e.test.ts && bun run slop:diff 2>/dev/null || true", + "test": "bun test browse/test/ test/ --ignore 'test/skill-e2e-*.test.ts' --ignore test/skill-llm-eval.test.ts --ignore test/skill-routing-e2e.test.ts --ignore test/codex-e2e.test.ts --ignore test/gemini-e2e.test.ts && (bun run slop:diff 2>/dev/null || true)", "test:evals": "EVALS=1 bun test --retry 2 --concurrent --max-concurrency ${EVALS_CONCURRENCY:-15} test/skill-llm-eval.test.ts test/skill-e2e-*.test.ts test/skill-routing-e2e.test.ts test/codex-e2e.test.ts test/gemini-e2e.test.ts", "test:evals:all": "EVALS=1 EVALS_ALL=1 bun test --retry 2 --concurrent --max-concurrency ${EVALS_CONCURRENCY:-15} test/skill-llm-eval.test.ts test/skill-e2e-*.test.ts test/skill-routing-e2e.test.ts test/codex-e2e.test.ts test/gemini-e2e.test.ts", "test:e2e": "EVALS=1 bun test --retry 2 --concurrent --max-concurrency ${EVALS_CONCURRENCY:-15} test/skill-e2e-*.test.ts test/skill-routing-e2e.test.ts test/codex-e2e.test.ts test/gemini-e2e.test.ts", diff --git a/setup b/setup index b00608b8..5b974e23 100755 --- a/setup +++ b/setup @@ -292,6 +292,10 @@ if ! ensure_playwright_browser; then cd "$SOURCE_GSTACK_DIR" # Bun's node_modules already has playwright; verify Node can require it node -e "require('playwright')" 2>/dev/null || npm install --no-save playwright + # @ngrok/ngrok is externalized in server-node.mjs and resolved at runtime. + # Verify the platform-specific native binary is installed so /pair-agent + # tunnels don't fail later with a cryptic module-not-found error. + node -e "require('@ngrok/ngrok')" 2>/dev/null || npm install --no-save @ngrok/ngrok ) fi fi From 0cc830b65f8016fb24fd89b097087e119ba425d6 Mon Sep 17 00:00:00 2001 From: Boyu Liu Date: Fri, 17 Apr 2026 05:49:56 +0800 Subject: [PATCH 2/4] fix: avoid tilde-in-assignment to silence Claude Code permission prompts (#993) Thanks @byliu-labs. Replaces `VAR=~/path` with `VAR="$HOME/path"` in two source-of-truth locations (scripts/resolvers/browse.ts + gstack-upgrade/SKILL.md.tmpl) so Claude Code's sandbox stops asking for permission on every skill invocation. Co-Authored-By: Boyu Liu --- SKILL.md | 2 +- benchmark/SKILL.md | 2 +- browse/SKILL.md | 2 +- canary/SKILL.md | 2 +- design-consultation/SKILL.md | 2 +- design-html/SKILL.md | 2 +- design-review/SKILL.md | 2 +- devex-review/SKILL.md | 2 +- gstack-upgrade/SKILL.md | 2 +- gstack-upgrade/SKILL.md.tmpl | 2 +- land-and-deploy/SKILL.md | 2 +- office-hours/SKILL.md | 2 +- open-gstack-browser/SKILL.md | 2 +- pair-agent/SKILL.md | 2 +- qa-only/SKILL.md | 2 +- qa/SKILL.md | 2 +- scripts/resolvers/browse.ts | 2 +- setup-browser-cookies/SKILL.md | 2 +- 18 files changed, 18 insertions(+), 18 deletions(-) diff --git a/SKILL.md b/SKILL.md index edd41954..70d576cd 100644 --- a/SKILL.md +++ b/SKILL.md @@ -473,7 +473,7 @@ Auto-shuts down after 30 min idle. State persists between calls (cookies, tabs, _ROOT=$(git rev-parse --show-toplevel 2>/dev/null) B="" [ -n "$_ROOT" ] && [ -x "$_ROOT/.claude/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.claude/skills/gstack/browse/dist/browse" -[ -z "$B" ] && B=~/.claude/skills/gstack/browse/dist/browse +[ -z "$B" ] && B="$HOME/.claude/skills/gstack/browse/dist/browse" if [ -x "$B" ]; then echo "READY: $B" else diff --git a/benchmark/SKILL.md b/benchmark/SKILL.md index efb0ae7d..b7d5a3b5 100644 --- a/benchmark/SKILL.md +++ b/benchmark/SKILL.md @@ -435,7 +435,7 @@ plan's living status. _ROOT=$(git rev-parse --show-toplevel 2>/dev/null) B="" [ -n "$_ROOT" ] && [ -x "$_ROOT/.claude/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.claude/skills/gstack/browse/dist/browse" -[ -z "$B" ] && B=~/.claude/skills/gstack/browse/dist/browse +[ -z "$B" ] && B="$HOME/.claude/skills/gstack/browse/dist/browse" if [ -x "$B" ]; then echo "READY: $B" else diff --git a/browse/SKILL.md b/browse/SKILL.md index 47519f9b..c0bcb353 100644 --- a/browse/SKILL.md +++ b/browse/SKILL.md @@ -439,7 +439,7 @@ State persists between calls (cookies, tabs, login sessions). _ROOT=$(git rev-parse --show-toplevel 2>/dev/null) B="" [ -n "$_ROOT" ] && [ -x "$_ROOT/.claude/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.claude/skills/gstack/browse/dist/browse" -[ -z "$B" ] && B=~/.claude/skills/gstack/browse/dist/browse +[ -z "$B" ] && B="$HOME/.claude/skills/gstack/browse/dist/browse" if [ -x "$B" ]; then echo "READY: $B" else diff --git a/canary/SKILL.md b/canary/SKILL.md index 5a42ab11..d2535d8f 100644 --- a/canary/SKILL.md +++ b/canary/SKILL.md @@ -557,7 +557,7 @@ plan's living status. _ROOT=$(git rev-parse --show-toplevel 2>/dev/null) B="" [ -n "$_ROOT" ] && [ -x "$_ROOT/.claude/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.claude/skills/gstack/browse/dist/browse" -[ -z "$B" ] && B=~/.claude/skills/gstack/browse/dist/browse +[ -z "$B" ] && B="$HOME/.claude/skills/gstack/browse/dist/browse" if [ -x "$B" ]; then echo "READY: $B" else diff --git a/design-consultation/SKILL.md b/design-consultation/SKILL.md index 4bb1b015..36d89123 100644 --- a/design-consultation/SKILL.md +++ b/design-consultation/SKILL.md @@ -622,7 +622,7 @@ If the codebase is empty and purpose is unclear, say: *"I don't have a clear pic _ROOT=$(git rev-parse --show-toplevel 2>/dev/null) B="" [ -n "$_ROOT" ] && [ -x "$_ROOT/.claude/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.claude/skills/gstack/browse/dist/browse" -[ -z "$B" ] && B=~/.claude/skills/gstack/browse/dist/browse +[ -z "$B" ] && B="$HOME/.claude/skills/gstack/browse/dist/browse" if [ -x "$B" ]; then echo "READY: $B" else diff --git a/design-html/SKILL.md b/design-html/SKILL.md index c9e75ba9..ea73c852 100644 --- a/design-html/SKILL.md +++ b/design-html/SKILL.md @@ -699,7 +699,7 @@ else a few taps away with an obvious path to get there. _ROOT=$(git rev-parse --show-toplevel 2>/dev/null) B="" [ -n "$_ROOT" ] && [ -x "$_ROOT/.claude/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.claude/skills/gstack/browse/dist/browse" -[ -z "$B" ] && B=~/.claude/skills/gstack/browse/dist/browse +[ -z "$B" ] && B="$HOME/.claude/skills/gstack/browse/dist/browse" if [ -x "$B" ]; then echo "READY: $B" else diff --git a/design-review/SKILL.md b/design-review/SKILL.md index 19c7f752..f2c136f9 100644 --- a/design-review/SKILL.md +++ b/design-review/SKILL.md @@ -631,7 +631,7 @@ After the user chooses, execute their choice (commit or stash), then continue wi _ROOT=$(git rev-parse --show-toplevel 2>/dev/null) B="" [ -n "$_ROOT" ] && [ -x "$_ROOT/.claude/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.claude/skills/gstack/browse/dist/browse" -[ -z "$B" ] && B=~/.claude/skills/gstack/browse/dist/browse +[ -z "$B" ] && B="$HOME/.claude/skills/gstack/browse/dist/browse" if [ -x "$B" ]; then echo "READY: $B" else diff --git a/devex-review/SKILL.md b/devex-review/SKILL.md index e93a7866..8978872d 100644 --- a/devex-review/SKILL.md +++ b/devex-review/SKILL.md @@ -619,7 +619,7 @@ branch name wherever the instructions say "the base branch" or ``. _ROOT=$(git rev-parse --show-toplevel 2>/dev/null) B="" [ -n "$_ROOT" ] && [ -x "$_ROOT/.claude/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.claude/skills/gstack/browse/dist/browse" -[ -z "$B" ] && B=~/.claude/skills/gstack/browse/dist/browse +[ -z "$B" ] && B="$HOME/.claude/skills/gstack/browse/dist/browse" if [ -x "$B" ]; then echo "READY: $B" else diff --git a/gstack-upgrade/SKILL.md b/gstack-upgrade/SKILL.md index 99a820d1..81bb1228 100644 --- a/gstack-upgrade/SKILL.md +++ b/gstack-upgrade/SKILL.md @@ -53,7 +53,7 @@ Tell user: "Auto-upgrade enabled. Future updates will install automatically." Th **If "Not now":** Write snooze state with escalating backoff (first snooze = 24h, second = 48h, third+ = 1 week), then continue with the current skill. Do not mention the upgrade again. ```bash -_SNOOZE_FILE=~/.gstack/update-snoozed +_SNOOZE_FILE="$HOME/.gstack/update-snoozed" _REMOTE_VER="{new}" _CUR_LEVEL=0 if [ -f "$_SNOOZE_FILE" ]; then diff --git a/gstack-upgrade/SKILL.md.tmpl b/gstack-upgrade/SKILL.md.tmpl index 19f3a0d5..5402a1da 100644 --- a/gstack-upgrade/SKILL.md.tmpl +++ b/gstack-upgrade/SKILL.md.tmpl @@ -55,7 +55,7 @@ Tell user: "Auto-upgrade enabled. Future updates will install automatically." Th **If "Not now":** Write snooze state with escalating backoff (first snooze = 24h, second = 48h, third+ = 1 week), then continue with the current skill. Do not mention the upgrade again. ```bash -_SNOOZE_FILE=~/.gstack/update-snoozed +_SNOOZE_FILE="$HOME/.gstack/update-snoozed" _REMOTE_VER="{new}" _CUR_LEVEL=0 if [ -f "$_SNOOZE_FILE" ]; then diff --git a/land-and-deploy/SKILL.md b/land-and-deploy/SKILL.md index 4661fab7..5415179d 100644 --- a/land-and-deploy/SKILL.md +++ b/land-and-deploy/SKILL.md @@ -574,7 +574,7 @@ plan's living status. _ROOT=$(git rev-parse --show-toplevel 2>/dev/null) B="" [ -n "$_ROOT" ] && [ -x "$_ROOT/.claude/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.claude/skills/gstack/browse/dist/browse" -[ -z "$B" ] && B=~/.claude/skills/gstack/browse/dist/browse +[ -z "$B" ] && B="$HOME/.claude/skills/gstack/browse/dist/browse" if [ -x "$B" ]; then echo "READY: $B" else diff --git a/office-hours/SKILL.md b/office-hours/SKILL.md index 50ad2740..0c31095f 100644 --- a/office-hours/SKILL.md +++ b/office-hours/SKILL.md @@ -585,7 +585,7 @@ plan's living status. _ROOT=$(git rev-parse --show-toplevel 2>/dev/null) B="" [ -n "$_ROOT" ] && [ -x "$_ROOT/.claude/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.claude/skills/gstack/browse/dist/browse" -[ -z "$B" ] && B=~/.claude/skills/gstack/browse/dist/browse +[ -z "$B" ] && B="$HOME/.claude/skills/gstack/browse/dist/browse" if [ -x "$B" ]; then echo "READY: $B" else diff --git a/open-gstack-browser/SKILL.md b/open-gstack-browser/SKILL.md index 1f134137..0ec96ac5 100644 --- a/open-gstack-browser/SKILL.md +++ b/open-gstack-browser/SKILL.md @@ -579,7 +579,7 @@ anti-bot stealth, and custom branding. You see every action in real time. _ROOT=$(git rev-parse --show-toplevel 2>/dev/null) B="" [ -n "$_ROOT" ] && [ -x "$_ROOT/.claude/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.claude/skills/gstack/browse/dist/browse" -[ -z "$B" ] && B=~/.claude/skills/gstack/browse/dist/browse +[ -z "$B" ] && B="$HOME/.claude/skills/gstack/browse/dist/browse" if [ -x "$B" ]; then echo "READY: $B" else diff --git a/pair-agent/SKILL.md b/pair-agent/SKILL.md index 5787693b..33403034 100644 --- a/pair-agent/SKILL.md +++ b/pair-agent/SKILL.md @@ -598,7 +598,7 @@ The skill will tell you if one is needed and how to set it up. _ROOT=$(git rev-parse --show-toplevel 2>/dev/null) B="" [ -n "$_ROOT" ] && [ -x "$_ROOT/.claude/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.claude/skills/gstack/browse/dist/browse" -[ -z "$B" ] && B=~/.claude/skills/gstack/browse/dist/browse +[ -z "$B" ] && B="$HOME/.claude/skills/gstack/browse/dist/browse" if [ -x "$B" ]; then echo "READY: $B" else diff --git a/qa-only/SKILL.md b/qa-only/SKILL.md index ec8a28d5..8e57eced 100644 --- a/qa-only/SKILL.md +++ b/qa-only/SKILL.md @@ -596,7 +596,7 @@ You are a QA engineer. Test web applications like a real user — click everythi _ROOT=$(git rev-parse --show-toplevel 2>/dev/null) B="" [ -n "$_ROOT" ] && [ -x "$_ROOT/.claude/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.claude/skills/gstack/browse/dist/browse" -[ -z "$B" ] && B=~/.claude/skills/gstack/browse/dist/browse +[ -z "$B" ] && B="$HOME/.claude/skills/gstack/browse/dist/browse" if [ -x "$B" ]; then echo "READY: $B" else diff --git a/qa/SKILL.md b/qa/SKILL.md index db9711fb..3a04bd78 100644 --- a/qa/SKILL.md +++ b/qa/SKILL.md @@ -673,7 +673,7 @@ After the user chooses, execute their choice (commit or stash), then continue wi _ROOT=$(git rev-parse --show-toplevel 2>/dev/null) B="" [ -n "$_ROOT" ] && [ -x "$_ROOT/.claude/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.claude/skills/gstack/browse/dist/browse" -[ -z "$B" ] && B=~/.claude/skills/gstack/browse/dist/browse +[ -z "$B" ] && B="$HOME/.claude/skills/gstack/browse/dist/browse" if [ -x "$B" ]; then echo "READY: $B" else diff --git a/scripts/resolvers/browse.ts b/scripts/resolvers/browse.ts index ef7e9485..a0ae37a7 100644 --- a/scripts/resolvers/browse.ts +++ b/scripts/resolvers/browse.ts @@ -106,7 +106,7 @@ export function generateBrowseSetup(ctx: TemplateContext): string { _ROOT=$(git rev-parse --show-toplevel 2>/dev/null) B="" [ -n "$_ROOT" ] && [ -x "$_ROOT/${ctx.paths.localSkillRoot}/browse/dist/browse" ] && B="$_ROOT/${ctx.paths.localSkillRoot}/browse/dist/browse" -[ -z "$B" ] && B=${ctx.paths.browseDir}/browse +[ -z "$B" ] && B="$HOME${ctx.paths.browseDir.replace(/^~/, '')}/browse" if [ -x "$B" ]; then echo "READY: $B" else diff --git a/setup-browser-cookies/SKILL.md b/setup-browser-cookies/SKILL.md index 846b4377..5b228986 100644 --- a/setup-browser-cookies/SKILL.md +++ b/setup-browser-cookies/SKILL.md @@ -454,7 +454,7 @@ If `CDP_MODE=true`: tell the user "Not needed — you're connected to your real _ROOT=$(git rev-parse --show-toplevel 2>/dev/null) B="" [ -n "$_ROOT" ] && [ -x "$_ROOT/.claude/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.claude/skills/gstack/browse/dist/browse" -[ -z "$B" ] && B=~/.claude/skills/gstack/browse/dist/browse +[ -z "$B" ] && B="$HOME/.claude/skills/gstack/browse/dist/browse" if [ -x "$B" ]; then echo "READY: $B" else From cc42f14a589e173d64d93ece20b73155a6b0df2d Mon Sep 17 00:00:00 2001 From: Garry Tan Date: Thu, 16 Apr 2026 15:04:26 -0700 Subject: [PATCH 3/4] docs: gstack compact design doc (tabled pending Anthropic API) (#1027) Preserves the full architecture, 15 locked eng-review decisions, B-series benchmark spec, codex review findings, and research that confirmed Claude Code's PostToolUse cannot replace non-MCP tool output today. Tracks anthropics/claude-code#36843 for the unblocking API. Co-authored-by: Claude Opus 4.7 --- docs/designs/GCOMPACTION.md | 831 ++++++++++++++++++++++++++++++++++++ 1 file changed, 831 insertions(+) create mode 100644 docs/designs/GCOMPACTION.md diff --git a/docs/designs/GCOMPACTION.md b/docs/designs/GCOMPACTION.md new file mode 100644 index 00000000..3937eccf --- /dev/null +++ b/docs/designs/GCOMPACTION.md @@ -0,0 +1,831 @@ +# GCOMPACTION.md — Design & Architecture (TABLED) + +**Target path on approval:** `docs/designs/GCOMPACTION.md` + +This is the preserved design artifact for `gstack compact`. Everything above the first `---` divider below gets extracted verbatim to `docs/designs/GCOMPACTION.md` on plan approval. Everything after that divider is archived research (office hours + competitive deep-dive + eng-review notes + codex review + research findings) that informed the design. + +--- + +## Status: TABLED (2026-04-17) — pending Anthropic `updatedBuiltinToolOutput` API + +**Why tabled.** The v1 architecture assumed a Claude Code `PostToolUse` hook could REPLACE the tool output that enters the model's context for built-in tools (Bash, Read, Grep, Glob, WebFetch). Research on 2026-04-17 confirmed this is not possible today. + +**Evidence:** + +1. **Official docs** (https://code.claude.com/docs/en/hooks): The only output-replace field documented for `PostToolUse` is `hookSpecificOutput.updatedMCPToolOutput`, and the docs explicitly state: *"For MCP tools only: replaces the tool's output with the provided value."* No equivalent field exists for built-in tools. +2. **Anthropic issue [#36843](https://github.com/anthropics/claude-code/issues/36843)** (OPEN): Anthropic themselves acknowledge the gap. *"PostToolUse hooks can replace MCP tool output via `updatedMCPToolOutput`, but there is no equivalent for built-in tools (WebFetch, WebSearch, Bash, Read, etc.)... They can only add warnings via `decision: block` (which injects a reason string) or `additionalContext`. The original malicious content still reaches the model."* +3. **RTK mechanism** (source-reviewed at `src/hooks/init.rs:906-912` and `hooks/claude/rtk-rewrite.sh:83-100`): RTK is NOT a PostToolUse compactor. It's a **PreToolUse** Bash matcher that rewrites `tool_input.command` (e.g., `git status` → `rtk git status`). The wrapped command produces compact stdout itself. RTK README confirms: *"the hook only runs on Bash tool calls. Claude Code built-in tools like Read, Grep, and Glob do not pass through the Bash hook, so they are not auto-rewritten."* RTK is Bash-only by architectural constraint, not by choice. +4. **tokenjuice mechanism** (source-reviewed at `src/core/claude-code.ts:160, 491, 540-549`): tokenjuice DOES register `PostToolUse` with `matcher: "Bash"` but has no real output-replace API available — it hijacks `decision: "block"` + `reason` to inject compacted text. Whether this actually reduces model-context tokens or just overlays UI output is disputed. tokenjuice is also Bash-only. +5. **Read/Grep/Glob execute in-process inside Claude Code** and bypass hooks entirely. Wedge (ii) "native-tool coverage" was architecturally impossible from day one regardless of replacement API. + +**Consequence.** Both wedges are dead in their original form: +- Wedge (i) "Conditional LLM verifier" — still technically possible, but only for Bash output, via PreToolUse command wrapping (RTK's mechanism). The verifier stops being a differentiator once we're also Bash-only. +- Wedge (ii) "Native-tool coverage" — impossible today. Read/Grep/Glob don't fire hooks. Even if they did, no output-replace field exists. + +**Decision.** Shelve `gstack compact` entirely. Track Anthropic issue #36843 for the arrival of `updatedBuiltinToolOutput` (or equivalent). When that API ships, this design doc + the 15 locked decisions below + the research archive at the bottom become the unblocking artifacts for a fresh implementation sprint. + +**If un-tabling:** Start from the "Decisions locked during plan-eng-review" block below — most remain valid. Then re-verify the hooks reference against the newly-shipped API, update the Architecture data-flow diagram to use whatever real output-replacement field exists, and re-run `/codex review` against the revised plan before coding. + +**What we're NOT doing:** +- Not shipping a Bash-only PreToolUse wrapper. That's RTK's product; they're at 28K stars and 3 years of rule scars. No wedge. +- Not shipping the `decision: block` + `reason` hack. Undocumented behavior, Anthropic could break it, and the model may still see the raw output alongside the compacted overlay — context savings are disputed. +- Not shipping B-series benchmark in isolation. Without a working compactor, there's nothing to benchmark. + +**Cost of tabling:** ~0. No code was written. The design doc + research + decisions remain as a ready-to-unblock artifact. + +--- + +## Decisions locked during plan-eng-review (2026-04-17) + +Preserved for the un-tabling sprint if/when Anthropic ships the built-in-tool output-replace API. + +Summary of every decision made during the engineering review. Full rationale is preserved throughout the sections below; this block is the single source of truth if anything else drifts. + +**Scope (Section 0):** +1. **Claude-first v1.** Ship compact + rules + verifier on Claude Code only. Codex + OpenClaw land at v1.1 after the wedge is proven on the primary host. Cuts ~2 days of host integration and derisks launch. The original "wedge (ii) native-tool coverage" claim applies to Claude Code at v1; we make no cross-host claim until v1.1. +2. **13-rule launch library.** v1 ships tests (jest/vitest/pytest/cargo-test/go-test/rspec) + git (diff/log/status) + install (npm/pnpm/pip/cargo). Build/lint/log families defer to v1.1, driven by `gstack compact discover` telemetry from real users. +3. **Verifier default ON at v1.0.** `failureCompaction` trigger (exit≠0 AND >50% reduction) is enabled out of the box. The verifier IS the wedge — defaulting it off hides the differentiating feature. Trigger bounds already keep expected fire rate ≤10% of tool calls. + +**Architecture (Section 1):** +4. **Exact line-match sanitization for Haiku output.** Split raw output by `\n`, put lines in a set, only append lines from Haiku that appear verbatim in that set. Tightest adversarial contract; prompt-injection attempts cannot slip in novel text. +5. **Layered failureCompaction signal.** Prefer `exitCode` from the envelope; if the host omits it, fall back to `/FAIL|Error|Traceback|panic/` regex on the output. Log which signal fired in `meta.failureSignal` ("exit" | "pattern" | "none"). Pre-implementation task #1 still verifies Claude Code's envelope empirically, but the system no longer breaks if it doesn't. +6. **Deep-merge rule resolution.** User/project rules inherit built-in fields they don't override. Escape hatch: `"extends": null` in a rule file triggers full replacement semantics. Matches the mental model of eslint/tsconfig/.gitignore — override a piece without losing the rest. + +**Code quality (Section 2):** +7. **Per-rule regex timeout, no RE2 dep.** Run each rule's regex via a 50ms AbortSignal budget; on timeout, skip the rule and record `meta.regexTimedOut: [ruleId]`. Avoids a WASM dependency and keeps rule-author syntax unconstrained. +8. **Pre-compiled rule bundle.** `gstack compact install` and `gstack compact reload` produce `~/.gstack/compact/rules.bundle.json` (deep-merged, regex-compiled metadata cached). Hook reads that single file instead of parsing N source files. +9. **Auto-reload on mtime drift.** Hook stats rule source files on startup; if any source file is newer than the bundle, rebuild in-line before applying. Adds ~0.5ms/invocation but eliminates the "I edited a rule and nothing changed" footgun. +10. **Expanded v1 redaction set.** Tee files redact: AWS keys, GitHub tokens (`ghp_/gho_/ghs_/ghu_`), GitLab tokens (`glpat-`), Slack webhooks, generic JWT (three base64 segments), generic bearer tokens, SSH private-key headers (`-----BEGIN * PRIVATE KEY-----`). Credit cards / SSNs / per-key env-pairs deferred to a full DLP layer in v2. + +**Testing (Section 3):** +11. **P-series gate subset.** v1 gate-tier P-tests: P1 (binary garbage), P3 (empty output), P6 (RTK-killer critical stack frame), P8 (secrets to tee), P15 (hook timeout), P18 (prompt injection), P26 (malformed user rule JSON), P28 (regex DoS), P30 (Haiku hallucination). Remaining 21 P-cases grow R-series as real bugs hit. +12. **Fixture version-stamping.** Every golden fixture has a `toolVersion:` frontmatter. CI warns when fixture toolVersion ≠ currently installed. No more calendar-based rotation. +13. **B-series real-world benchmark testbench (hard v1 gate).** New component `compact/benchmark/` scans `~/.claude/projects/**/*.jsonl`, ranks the noisiest tool calls, clusters them into named scenarios, replays the compactor against them, and reports reduction-by-rule-family. v1 cannot ship until B-series on the author's own 30-day corpus shows ≥15% reduction AND zero critical-line loss on planted bugs. Local-only; never uploads. Community-shared corpus is v2. + +**Performance (Section 4):** +14. **Revised latency budgets.** Bun cold-start on macOS ARM is 15-25ms; the original 10ms p50 target was unrealistic. New budgets: <30ms p50 / <80ms p99 on macOS ARM, <20ms p50 / <60ms p99 on Linux (verifier off). Verifier-fires budget stays <600ms p50 / <2s p99. Daemon mode is a v2 option gated on B-series showing cold-start hurts session savings. +15. **Line-oriented streaming pipeline.** Readline over stdin → filter → group → dedupe → ring-buffered tail truncation → stdout. Any single line >1MB hits P9 (truncate to 1KB with `[... truncated ...]` marker). Caps memory at 64MB regardless of total output size. + +Every row above is a `MUST` in the implementation. Drift requires a new eng-review. + +--- + +## Summary + +`gstack compact` was designed as a `PostToolUse` hook that reduces tool-output noise before it reaches an AI coding agent's context window. Deterministic JSON rules would shrink noisy test runners, build logs, git diffs, and package installs. A conditional Claude Haiku verifier would act as a safety net when over-compaction risk was high. + +**Current status: TABLED.** See "Status" section above. The architecture depends on a Claude Code API (`updatedBuiltinToolOutput` or equivalent for built-in tools) that does not exist as of 2026-04-17. Anthropic issue #36843 tracks the gap. + +**Intended goal (preserved for the un-tabling sprint):** 15–30% tool-output token reduction per long session, with zero increase in task-failure rate. + +**Original wedge (vs RTK, the 28K-star incumbent) — both invalidated by research:** +1. ~~**Conditional LLM verifier.**~~ Still technically viable via PreToolUse command wrapping, but only for Bash. Stops being a differentiator once we're Bash-only. Reconsider if the built-in-tool API arrives. +2. ~~**Native-tool coverage.**~~ Architecturally impossible today. Read/Grep/Glob execute in-process inside Claude Code and do not fire hooks. Even for tools that do fire `PostToolUse`, no output-replacement field exists for non-MCP tools. + +**Original positioning (now moot):** *"RTK is fast. gstack compact is fast AND safe, and it covers every tool in your toolbox, not just Bash."* + +## Non-goals + +- Summarizing user messages or prior agent turns (Claude's own Compaction API owns that). +- Compressing agent response output (caveman's layer). +- Caching tool calls to avoid re-execution (token-optimizer-mcp's layer). +- Acting as a general-purpose log analyzer. +- Replacing the agent's own judgement about when to re-run a command with `GSTACK_RAW=1`. + +## Why this is worth building + +**Problem is measured, not hypothetical.** + +- [Chroma research (2025)](https://research.trychroma.com/context-rot) tested 18 frontier models. Every model degrades as context grows. Rot starts well before the window limit — a 200K model rots at 50K. +- Coding agents are the worst case: accumulative context + high distractor density + long task horizon. Tool output is explicitly named as a primary noise source. +- The market has voted: Anthropic shipped Opus 4.6 Compaction API; OpenAI shipped a compaction guide; Google ADK shipped context compression; LangChain shipped autonomous compression; sst/opencode has built-in compaction. The hybrid deterministic + LLM pattern is industry consensus. + +**Existing field (what gstack compact joins and differentiates from):** + +| Project | Stars | License | Layer | Threat | Note | +|---------|-------|---------|-------|--------|------| +| **RTK (rtk-ai/rtk)** | **28K** | Apache-2.0 | Tool output | Primary benchmark | Pure Rust, Bash-only, zero LLM | +| caveman | 34.8K | MIT | Output tokens | Different axis | Terse system prompt; pairs WITH us | +| claude-token-efficient | 4.3K | MIT | Response verbosity | Different axis | Single CLAUDE.md | +| token-optimizer-mcp | 49 | MIT | MCP caching | Different axis | Prevents calls rather than compresses output | +| tokenjuice | ~12 | MIT | Tool output | Too new | 2 days old; inspired our JSON envelope | +| 6-Layer Token Savings Stack | — | Public gist | Recipe | Zero | Documentation; validates stacked compaction thesis | + +RTK is the only direct competitor. Everything else compresses a different token source. + +**License compatibility:** Every referenced project is permissive-licensed (MIT or Apache-2.0) and compatible with gstack's MIT license. No AGPL, GPL, or other copyleft dependencies. See the "License & attribution" section below for the clean-room policy. + +## Architecture + +### Data flow + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Host (Claude Code / Codex / OpenClaw) │ +│ ───────────────────────────────────────── │ +│ 1. Agent requests tool call: Bash|Read|Grep|Glob|MCP │ +│ 2. Host executes tool │ +│ 3. Host invokes PostToolUse hook with: {tool, input, output} │ +└────────────────────┬────────────────────────────────────────────┘ + │ stdin (JSON envelope) + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ gstack-compact hook binary │ +│ ─────────────────────────── │ +│ a. Parse envelope │ +│ b. Match rule by (tool, command, pattern) │ +│ c. Apply rule primitives: filter / group / truncate / dedupe │ +│ d. Record reduction metadata │ +│ e. Evaluate verifier triggers │ +│ f. If trigger met: call Haiku, append preserved lines │ +│ g. On failure exit code: tee raw to ~/.gstack/compact/tee/... │ +│ h. Emit JSON envelope to stdout │ +└────────────────────┬────────────────────────────────────────────┘ + │ stdout (JSON envelope) + ▼ + Host substitutes compacted output into agent context +``` + +### Rule resolution + +Three-tier hierarchy (highest precedence wins), same pattern as tokenjuice and gstack's existing host-config-export model: + +1. Built-in rules: `compact/rules/` shipped with gstack +2. User rules: `~/.config/gstack/compact-rules/` +3. Project rules: `.gstack/compact-rules/` + +Rules match tool calls by rule ID. A project rule with ID `tests/jest` overrides the built-in `tests/jest` entirely. No merging — replace semantics, to keep reasoning simple. + +### JSON envelope contract (adopted from tokenjuice) + +Input: +```json +{ + "tool": "Bash", + "command": "bun test test/billing.test.ts", + "argv": ["bun", "test", "test/billing.test.ts"], + "combinedText": "...", + "exitCode": 1, + "cwd": "/Users/garry/proj", + "host": "claude-code" +} +``` + +Output: +```json +{ + "reduced": "compacted output with [gstack-compact: N → M lines, rule: X] header", + "meta": { + "rule": "tests/jest", + "linesBefore": 247, + "linesAfter": 18, + "bytesBefore": 18234, + "bytesAfter": 892, + "verifierFired": false, + "teeFile": null, + "durationMs": 8 + } +} +``` + +### Rule schema + +Compact, minimal. Total rules-payload must stay <5KB on disk (lesson from claude-token-efficient: rule files themselves consume tokens on every session). + +```json +{ + "id": "tests/jest", + "family": "test-results", + "description": "Jest/Vitest output — preserve failures and summary counts", + "match": { + "tools": ["Bash"], + "commands": ["jest", "vitest", "bun test"], + "patterns": ["jest", "vitest", "PASS", "FAIL"] + }, + "primitives": { + "filter": { + "strip": ["\\x1b\\[[0-9;]*m", "^\\s*at .+node_modules"], + "keep": ["FAIL", "PASS", "Error:", "Expected:", "Received:", "✓", "✗", "Tests:"] + }, + "group": { + "by": "error-kind", + "header": "Errors grouped by type:" + }, + "truncate": { + "headLines": 5, + "tailLines": 15, + "onFailure": { "headLines": 20, "tailLines": 30 } + }, + "dedupe": { + "pattern": "^\\s*$", + "format": "[... {count} blank lines ...]" + } + }, + "tee": { + "onExit": "nonzero", + "maxBytes": 1048576 + }, + "counters": [ + { "name": "failed", "pattern": "^FAIL\\s", "flags": "m" }, + { "name": "passed", "pattern": "^PASS\\s", "flags": "m" } + ] +} +``` + +The four primitives — `filter`, `group`, `truncate`, `dedupe` — are lifted directly from RTK's technique taxonomy (the only thing every serious compactor needs to handle). Any rule can combine any subset of the four; omitted primitives are no-ops. + +### Verifier layer (tiered, opt-in) + +The verifier is a cheap Haiku call that fires only under specific triggers. Never on every tool call. + +**Trigger matrix (user-configurable):** + +| Trigger | Default | Condition | +|---------|---------|-----------| +| `failureCompaction` | **ON** | exit code ≠ 0 AND reduction >50% (diagnosis at risk) | +| `aggressiveReduction` | off | reduction >80% AND original >200 lines | +| `largeNoMatch` | off | no rule matched AND output >500 lines | +| `userOptIn` | on (env-gated) | `GSTACK_COMPACT_VERIFY=1` forces verifier for that call | + +Default config ships with `failureCompaction` only — the highest-leverage case (agent is debugging; rule may have filtered the critical stack frame). + +**Haiku's job (bounded):** + +``` +Here is raw output (truncated to first 2000 lines) and a compacted version. +Return any important lines from the raw that are missing from the compacted, +or `NONE` if nothing critical is missing. +``` + +The verifier never rewrites the compacted output. It only appends missing lines under a header: + +``` +[gstack-compact: 247 → 18 lines, rule: tests/jest] +[gstack-verify: 2 additional lines preserved by Haiku] + TypeError: Cannot read property 'foo' of undefined + at parseConfig (src/config.ts:42:18) +``` + +**Why Haiku, not Sonnet:** ~1/12th the cost, ~500ms vs ~2s, and the task is simple substring classification, not reasoning. + +**Verifier config (`compact/rules/_verifier.json`):** + +```json +{ + "verifier": { + "enabled": true, + "model": "claude-haiku-4-5-20251001", + "maxInputLines": 2000, + "triggers": { + "aggressiveReduction": { "enabled": false, "thresholdPct": 80, "minLines": 200 }, + "failureCompaction": { "enabled": true, "minReductionPct": 50 }, + "largeNoMatch": { "enabled": false, "minLines": 500 }, + "userOptIn": { "enabled": true, "envVar": "GSTACK_COMPACT_VERIFY" } + }, + "fallback": "passthrough" + } +} +``` + +**Failure modes (verifier is strictly additive — never breaks the baseline):** + +- No `ANTHROPIC_API_KEY` → skip verifier, use pure rule output. +- Haiku call times out (>5s) → skip verifier, use pure rule output. +- Haiku returns malformed JSON → skip, use pure rule output. +- Haiku returns prompt-injection attempt → sanitize: only append lines that are substring-matches of the original raw output. +- Haiku returns hallucinated lines (not present in raw) → drop them. + +### Tee mode (adopted from RTK) + +On any command with exit code ≠ 0, the full unfiltered output is written to `~/.gstack/compact/tee/{timestamp}_{cmd-slug}.log`. The compacted output includes a tee-file pointer: + +``` +[gstack-compact: 247 → 18 lines, rule: tests/jest, tee: ~/.gstack/compact/tee/20260416-143022_bun-test.log] +``` + +The agent can read the tee file directly if it needs the full stack trace. This replaces the earlier `onFailure.preserveFull` mechanic with a cleaner design: compacted output always stays small; raw output is always one `cat` away. + +**Tee safety:** + +- File mode `0600` — not world-readable. +- Built-in secret-regex set redacts AWS keys, bearer tokens, and common credential patterns before write. +- Failed writes (read-only filesystem, permission denied) degrade gracefully: still emit compacted output, record `meta.teeFailed: true`. +- Tee files auto-expire after 7 days (cleanup on hook startup). + +### Host integration matrix + +| Host | Hook type | Supported matchers | Config path | +|------|-----------|-------------------|-------------| +| Claude Code | `PostToolUse` | Bash, Read, Grep, Glob, Edit, Write, WebFetch, WebSearch, mcp__* | `~/.claude/settings.json` | +| Codex (v1.1) | `PostToolUse` equivalent | Bash (primary); tool subset TBD — empirical verification is a v1.1 prereq | `~/.codex/hooks.json` | +| OpenClaw (v1.1) | Native hook API | Bash + MCP | OpenClaw config | + +**v1 is Claude-first.** Wedge (ii) — native-tool coverage — is confirmed on Claude Code via [the hooks reference](https://code.claude.com/docs/en/hooks). Codex and OpenClaw integration ships at v1.1 only after the wedge is proven on the primary host via B-series benchmark data. CHANGELOG for v1 makes the Claude-only scope explicit. + +### Config surface + +User config (`~/.config/gstack/compact.toml`): + +```toml +[compact] +enabled = true +level = "normal" # minimal | normal | aggressive (caveman pattern) +exclude_commands = ["curl", "playwright"] # RTK pattern + +[compact.bundle] +auto_reload_on_mtime_drift = true # hook rebuilds bundle if source rule files are newer +bundle_path = "~/.gstack/compact/rules.bundle.json" + +[compact.regex] +per_rule_timeout_ms = 50 # AbortSignal budget per regex; timeout → skip rule + +[compact.verifier] +enabled = true +trigger_failure_compaction = true +trigger_aggressive_reduction = false +trigger_large_no_match = false +failure_signal_fallback = true # use /FAIL|Error|Traceback|panic/ when exitCode missing +sanitization = "exact-line-match" # only append lines present verbatim in raw output + +[compact.tee] +on_exit = "nonzero" +max_bytes = 1048576 +redact_patterns = ["aws", "github", "gitlab", "slack", "jwt", "bearer", "ssh-private-key"] +cleanup_days = 7 + +[compact.benchmark] +local_only = true # hard-coded; config is documentary, cannot be changed +transcript_root = "~/.claude/projects" +output_dir = "~/.gstack/compact/benchmark" +scenario_cap = 20 # top-N clusters by aggregate output volume +``` + +**Intensity levels (caveman pattern):** + +- **minimal:** only `filter` + `dedupe`; no truncation. Safest. +- **normal:** `filter` + `dedupe` + `truncate`. Default. +- **aggressive:** adds `group`; more savings, more edge-case risk. + +### CLI surface + +| Command | Purpose | Source | +|---------|---------|--------| +| `gstack compact install ` | Register PostToolUse hook in host config; builds `rules.bundle.json` | new | +| `gstack compact uninstall ` | Idempotent removal | new | +| `gstack compact reload` | Rebuild `rules.bundle.json` after editing user/project rules | new | +| `gstack compact doctor` | Detect drift / broken hook config, offer to repair | tokenjuice | +| `gstack compact gain` | Show token/dollar savings over time (per-rule breakdown) | RTK | +| `gstack compact discover` | Find commands with no matching rule, ranked by noise volume | RTK | +| `gstack compact verify ` | Dry-run verifier on a fixture | new | +| `gstack compact list-rules` | Show effective rule set after deep-merge (built-in + user + project) | new | +| `gstack compact test ` | Apply a rule to a fixture and show the diff | new | +| `gstack compact benchmark` | Run B-series testbench against local transcript corpus (see Benchmark section) | new | + +Escape hatch: `GSTACK_RAW=1` env var bypasses the hook entirely for the duration of a command (same pattern as tokenjuice's `--raw` flag). Hook also auto-reloads the bundle if any source rule file's mtime is newer than the bundle file. + +## File layout + +``` +compact/ +├── SKILL.md.tmpl # template; regen via `bun run gen:skill-docs` +├── src/ +│ ├── hook.ts # entry point; reads stdin, writes stdout; mtime-checks bundle +│ ├── engine.ts # rule matching + reduction metadata +│ ├── apply.ts # primitive application (line-oriented streaming pipeline) +│ ├── merge.ts # deep-merge of built-in/user/project rules; honors `extends: null` +│ ├── bundle.ts # compile source rules → rules.bundle.json (install/reload) +│ ├── primitives/ +│ │ ├── filter.ts +│ │ ├── group.ts +│ │ ├── truncate.ts # ring-buffered tail; safe for arbitrary input size +│ │ └── dedupe.ts +│ ├── regex-sandbox.ts # AbortSignal-bounded regex execution (50ms budget per rule) +│ ├── verifier.ts # Haiku integration (triggers + failure-signal fallback + sanitization) +│ ├── sanitize.ts # exact-line-match filter for verifier output +│ ├── tee.ts # raw-output archival with secret redaction + 7-day cleanup +│ ├── redact.ts # secret-pattern set (AWS/GitHub/GitLab/Slack/JWT/bearer/SSH) +│ ├── envelope.ts # JSON I/O contract parsing + validation +│ ├── doctor.ts # hook drift detection + repair +│ ├── analytics.ts # gain + discover queries against local metadata +│ └── cli.ts # argv dispatch; one thin dispatch per subcommand +├── benchmark/ # B-series testbench (hard v1 gate) +│ └── src/ +│ ├── scanner.ts # walk ~/.claude/projects/**/*.jsonl; pair tool_use × tool_result +│ ├── sizer.ts # tokens per call (ceil(len/4) heuristic); rank heavy tail +│ ├── cluster.ts # group high-leverage calls by (tool, command pattern) +│ ├── scenarios.ts # emit B1-Bn real-world scenario fixtures +│ ├── replay.ts # run compactor against scenarios; measure reduction +│ ├── pathology.ts # layer planted-bug P-cases on top of real scenarios +│ └── report.ts # dashboard: per-scenario before/after + overall reduction +├── rules/ # v1 built-in JSON rule library (13 rules) +│ ├── tests/ +│ │ ├── jest.json +│ │ ├── vitest.json +│ │ ├── pytest.json +│ │ ├── cargo-test.json +│ │ ├── go-test.json +│ │ └── rspec.json +│ ├── install/ +│ │ ├── npm.json +│ │ ├── pnpm.json +│ │ ├── pip.json +│ │ └── cargo.json +│ ├── git/ +│ │ ├── diff.json +│ │ ├── log.json +│ │ └── status.json +│ ├── _verifier.json # verifier config (not a rule per se) +│ └── _HOLD/ # v1.1 rule families (not shipped at v1; kept for reference) +│ ├── build/ +│ ├── lint/ +│ └── log/ +└── test/ + ├── unit/ + ├── golden/ + ├── fuzz/ # P-series — v1 gate subset only (P1/P3/P6/P8/P15/P18/P26/P28/P30) + ├── cross-host/ # v1: claude-code.test.ts only; codex/openclaw stub files + ├── adversarial/ # R-series — grows with shipped bugs + ├── benchmark/ # B-series scenario fixtures + expected reduction ranges + ├── fixtures/ # version-stamped golden inputs (toolVersion: frontmatter) + └── evals/ +``` + +## Testing Strategy + +The test plan is comprehensive by design. Shipping into a space where the 28K-star incumbent has three years of regex battle-scars, with our wedges (Haiku verifier + native-tool coverage) introducing new failure surfaces, means we get ONE shot at "the compactor made my agent dumb" going viral. Zero appetite for that. + +### Test tiers + +| Tier | Cost | Frequency | Blocks merge | +|------|------|-----------|--------------| +| Unit | free, <1s | every PR | yes | +| Golden file (with `toolVersion:` frontmatter) | free, <1s | every PR | yes | +| Rule schema validation | free, <1s | every PR | yes | +| Fuzz (P-series gate subset: P1/P3/P6/P8/P15/P18/P26/P28/P30) | free, <10s | every PR | yes | +| Cross-host E2E — Claude Code only at v1 | free, ~1min | every PR (gate tier) | yes | +| E2E with verifier (mocked Haiku) | free, ~15s | every PR | yes | +| E2E with verifier (real Haiku) | paid, ~$0.10/run | PR touching verifier files | yes | +| **B-series benchmark (real-world scenarios)** | **free, ~2min** | **pre-release gate** | **yes (hard gate for v1)** | +| Token-savings eval (E1-E4 synthetic) | paid, ~$4/run | periodic weekly | no (informational) | +| Adversarial regression (R-series) | free, <5s | every PR | yes | +| Tool-version drift warning | free, <1s | every PR | warning only | + +Test file layout: + +``` +compact/test/ +├── unit/ +│ ├── engine.test.ts # rule matching + primitive application +│ ├── primitives.test.ts # filter / group / truncate / dedupe +│ ├── envelope.test.ts # JSON input/output contract +│ ├── triggers.test.ts # verifier trigger evaluation +│ └── verifier.test.ts # Haiku call (mocked) +├── golden/ +│ ├── tests/ # one fixture per test runner +│ │ ├── jest-success.input.txt +│ │ ├── jest-success.expected.txt +│ │ ├── jest-fail.input.txt +│ │ ├── jest-fail.expected.txt +│ │ └── ... (vitest, pytest, cargo-test, go-test, rspec) +│ ├── install/ +│ ├── git/ +│ ├── build/ +│ ├── lint/ +│ └── log/ +├── fuzz/ +│ └── pathological.test.ts # P-series +├── cross-host/ +│ ├── claude-code.test.ts +│ ├── codex.test.ts +│ └── openclaw.test.ts +├── adversarial/ +│ └── regression.test.ts # R-series; past bugs that must never recur +├── fixtures/ +│ └── {tool}/ # shared raw output fixtures +└── evals/ + └── token-savings.eval.ts # periodic-tier; measures real reduction +``` + +### G-series: good cases (must produce expected reduction) + +| ID | Scenario | Expected reduction | +|----|----------|-------------------| +| G1 | `jest` 47 passing tests, clean run | 150+ lines → ≤10 lines | +| G2 | `jest` 47 tests with 2 failures | 200+ lines → keep both failures + summary | +| G3 | `vitest` run with `--reporter=verbose` | 300+ lines → ≤15 lines | +| G4 | `pytest` collection then run | preserve failure tracebacks | +| G5 | `cargo test` with one panic | panic location preserved verbatim | +| G6 | `go test -v` with 200 subtests passing | collapse to `PASS: 200 subtests` | +| G7 | `git diff` on a file with 2 hunks in 500 lines of context | keep hunks, drop context | +| G8 | `git log -50` | preserve SHA + subject + author, drop body | +| G9 | `git status` with 30 modified files | group by directory | +| G10 | `pnpm install` fresh | final count + warnings; drop resolved packages | +| G11 | `pip install -r requirements.txt` | drop download progress; keep final install list + errors | +| G12 | `cargo build` success | drop compilation progress; keep final target | +| G13 | `docker build` success | drop layer pulls; keep final image digest | +| G14 | `tsc --noEmit` clean | compact to `tsc: 0 errors` | +| G15 | `tsc --noEmit` with 3 errors | keep all 3 errors with location | +| G16 | `eslint .` clean | compact to `eslint: 0 problems` | +| G17 | `eslint .` with violations | group by rule; preserve location + fix suggestion | +| G18 | `docker logs -f` with 1000 repeating lines | dedupe with count: `[last message repeated 973 times]` | +| G19 | `kubectl get pods -A` | group by namespace | +| G20 | `ls -la` deep tree | directory grouping (RTK pattern) | +| G21 | `find . -type f` 10K files | group by extension with counts | +| G22 | `grep -r "foo" .` with 500 hits | cap at 50; suffix `[... 450 more matches; use --ripgrep for full]` | +| G23 | `curl -v https://api.example.com` | strip verbose headers; keep response body | +| G24 | `aws ec2 describe-instances` 50 instances | columnar summary | + +### P-series: pathological cases (must NOT break the agent) + +These turn "nice feature" into "catastrophic regression" if we get any of them wrong. + +| ID | Scenario | Required behavior | +|----|----------|-------------------| +| P1 | Binary garbage in output (non-UTF8 bytes) | Pass through unchanged; don't crash | +| P2 | ANSI escape explosion (10K+ codes) | Strip cleanly, don't choke regex | +| P3 | Empty output (`""`) | Pass through empty; do NOT inject header | +| P4 | Stdout+stderr interleaved | Rule matches across both streams | +| P5 | Truncated output (SIGPIPE mid-stream) | Don't mis-compact partial output | +| P6 | **Failed test, critical stack frame at line 4 of 200** | Must NOT filter the frame (the RTK-killer case) | +| P7 | Exit 0 but `ERROR:` in output | Rule must not trust exit code alone | +| P8 | Output contains AWS key / bearer token / password | Tee file must NOT be world-readable; redact in compacted output | +| P9 | Single-line minified JS error (40KB one line) | Truncate to first 1KB; append `[... truncated ...]` | +| P10 | Unicode (emoji, RTL, combining chars, CJK) | Byte-safe truncation; don't split codepoints | +| P11 | Two rules match same command | Deterministic priority: longest `match.commands` prefix wins; tie → rule ID alphabetical | +| P12 | Rule's compacted output matches another rule's pattern | No recursive application; hook runs once per tool call | +| P13 | Command contains embedded newlines in quoted arg | Rule doesn't misparse args | +| P14 | Concurrent tool calls (parallel Bash invocations) | No shared mutable state in hook; each call is isolated | +| P15 | Hook execution >5s | Pass through raw; emit `meta.timedOut: true` | +| P16 | Haiku API offline/rate-limited | Skip verifier silently; use pure rule output | +| P17 | Haiku returns malformed JSON | Skip verifier; do NOT feed raw response to agent | +| P18 | Haiku response contains prompt-injection (`"Ignore all prior instructions..."`) | Sanitize: only append lines that are substring matches of the original raw output | +| P19 | 1M-line output | Stream-process, cap memory at 64MB; truncate with clear marker | +| P20 | Rapid-fire: 50 tool calls / sec | Hook latency stays <15ms p99 | +| P21 | Command with shell redirects (`cmd >file 2>&1`) | Match on the underlying command name, not the redirect wrapper | +| P22 | Deeply nested quotes/escapes in command string | Robust arg parser; no shell injection possible | +| P23 | NULL bytes in output | Strip safely; don't truncate | +| P24 | Command that exits then writes more to stderr after | Hook receives final combined output; handles gracefully | +| P25 | Read-only filesystem / no tee write permission | Degrade gracefully; still emit compacted output; record `meta.teeFailed: true` | +| P26 | User's rule JSON is malformed | Skip that rule; emit warning to stderr; don't break hook | +| P27 | Rule references a non-existent primitive field | Ignore unknown field; apply rest of rule | +| P28 | Rule regex has catastrophic backtracking | RE2-compatible engine (no backtracking) OR per-rule timeout | +| P29 | Exit code 137 (OOM kill) | Rule treats same as generic failure; preserves full output | +| P30 | Haiku returns lines NOT present in raw output (hallucination) | Drop hallucinated lines; keep only substring matches | + +### CH-series: cross-host E2E + +Run each scenario on each supported host. Same input, same expected output. If a host does not support a matcher, the test is marked `skip-on-{host}` with a comment linking the upstream limitation. + +| ID | Scenario | Hosts | +|----|----------|-------| +| CH1 | Install hook via `gstack compact install ` | Claude Code, Codex, OpenClaw | +| CH2 | Uninstall hook is idempotent | All | +| CH3 | Re-install doesn't duplicate entries | All | +| CH4 | Hook co-exists with user's other PostToolUse hooks | All | +| CH5 | Hook fires on Bash tool | All | +| CH6 | Hook fires on Read tool | Claude Code (confirmed); Codex/OpenClaw verify-then-require | +| CH7 | Hook fires on Grep tool | Same as CH6 | +| CH8 | Hook fires on Glob tool | Same as CH6 | +| CH9 | Hook fires on MCP tool (`mcp__*` matcher) | Claude Code; verify on others | +| CH10 | Config precedence: project > user > built-in | All | +| CH11 | `GSTACK_RAW=1` env var bypasses hook | All | +| CH12 | Rule ID override works (project rule replaces built-in) | All | +| CH13 | `gstack compact doctor` detects drift on each host | All | +| CH14 | Hook error does not crash the agent session | All | + +Implementation note: cross-host tests reuse the fixture corpus from the `golden/` tree; the harness wraps each fixture in a host-specific hook invocation envelope and asserts the output is byte-identical across hosts (modulo the `host` field). + +### V-series: verifier tests (paid) + +| ID | Scenario | Expected | +|----|----------|----------| +| V1 | Rule reduces 200-line test output to 5 lines, exit=1 | Verifier fires (failure + >50% reduction), appends any missing critical lines | +| V2 | Rule reduces 10-line output to 9 lines, exit=1 | Verifier does NOT fire (reduction too small) | +| V3 | Rule reduces 200-line output to 5 lines, exit=0 | Verifier does NOT fire (success path, default config) | +| V4 | `aggressiveReduction` trigger enabled, 300 lines → 20 lines, exit=0 | Verifier fires | +| V5 | `GSTACK_COMPACT_VERIFY=1` env var set | Verifier fires once for that call | +| V6 | `ANTHROPIC_API_KEY` missing | Verifier silently skipped; raw rule output returned | +| V7 | Verifier mocked to return "NONE" | Output identical to pure-rule path | +| V8 | Verifier mocked to return prompt injection | Injection discarded; only substring-matched lines appended | +| V9 | Verifier mocked to time out >5s | Skipped; `meta.verifierTimedOut: true` | +| V10 | Verifier mocked to return 500 error | Skipped; rule output returned | + +### R-series: adversarial regression + +Every bug caught after v1 ship gets a permanent R-series test. Starts empty; grows with scars. Template: + +``` +R{N}: {commit-sha} — {1-line summary} +Scenario: {reproducer} +Fix: {PR link} +``` + +### Performance budgets (enforced in CI; revised for realistic Bun cold-start) + +| Metric | Target | Hard limit | +|--------|--------|-----------| +| Hook overhead macOS ARM (verifier disabled) | <30ms p50 | <80ms p99 | +| Hook overhead Linux (verifier disabled) | <20ms p50 | <60ms p99 | +| Hook overhead (verifier fires) | <600ms p50 | <2s p99 | +| Bundle deserialize (rules.bundle.json) | <2ms | <10ms | +| mtime drift check (stat of source files) | <0.5ms | <3ms | +| Single-regex execution budget (per rule) | <5ms | <50ms (hard abort) | +| Memory per hook invocation (line-streamed) | <16MB typical | <64MB max | +| Total rule-payload size on disk (source files) | <5KB | <15KB | +| Compiled bundle size on disk | <25KB | <80KB | + +Daemon mode is a v2 optimization. If B-series benchmark on the author's corpus shows cold-start meaningfully hurts session-total savings (e.g., total hook overhead >5% of saved tokens' wall time), promote to v1.1. + +### B-series real-world benchmark testbench (hard v1 gate) + +**Why it exists.** Every competing compactor ships with hand-picked fixture numbers. B-series proves the compactor works on the user's *actual* coding sessions before they enable the hook. It's both the ship-gate and the marketing artifact. + +**Architecture** (components in `compact/benchmark/src/`): + +``` +┌──────────────────────────────────────────────────────────────┐ +│ 1. SCAN scanner.ts walks ~/.claude/projects/**/*.jsonl │ +│ → pairs tool_use × tool_result blocks │ +│ → emits {tool, command, outputBytes, lineCount, │ +│ estimatedTokens, sessionId, timestamp} │ +├──────────────────────────────────────────────────────────────┤ +│ 2. RANK sizer.ts sorts corpus by estimatedTokens desc │ +│ → cluster.ts groups by (tool, command-pattern) │ +│ → identifies heavy-tail: which 10% of calls │ +│ produced 80% of the tokens? │ +├──────────────────────────────────────────────────────────────┤ +│ 3. SCENARIO scenarios.ts emits fixture files: │ +│ B1_bun_test_heavy.jsonl │ +│ B2_git_diff_huge.jsonl │ +│ B3_tsc_errors_production.jsonl │ +│ B4_pnpm_install_fresh.jsonl ... (one per │ +│ high-leverage cluster, up to ~20 scenarios) │ +├──────────────────────────────────────────────────────────────┤ +│ 4. REPLAY replay.ts runs compactor against each scenario, │ +│ measures token reduction + diff of dropped lines│ +│ → per-rule reduction numbers │ +│ → per-scenario before/after token counts │ +├──────────────────────────────────────────────────────────────┤ +│ 5. PATHOLOGY pathology.ts injects planted critical lines │ +│ (line 4 of 200 in a failing test fixture) into │ +│ real B-scenarios. Confirms verifier restores │ +│ them. Real data + real threats = real proof. │ +├──────────────────────────────────────────────────────────────┤ +│ 6. REPORT report.ts emits HTML + JSON dashboard to │ +│ ~/.gstack/compact/benchmark/latest/ │ +│ "On YOUR 30 days of Claude Code data, gstack │ +│ compact would save X tokens in Y scenarios." │ +└──────────────────────────────────────────────────────────────┘ +``` + +**v1 ship gate (hard):** +- ≥15% total-token reduction across the aggregated scenario corpus on the author's own 30-day transcript set. +- Zero critical-line loss on planted-bug scenarios (every planted stack frame must survive either the rule or the verifier). +- No scenario regresses to <5% reduction under the new rules (catch over-compaction edge cases). + +**Privacy (non-negotiable):** +- Reads `~/.claude/projects/**/*.jsonl` locally only. Never uploads. Never shares. Never logs scenarios to telemetry. +- Output files live under `~/.gstack/compact/benchmark/` with mode `0600`. +- The command prints a confirmation banner: *"Scanning local transcripts at ~/.claude/projects/ (local-only; nothing leaves this machine)."* +- Any future community corpus is a separate v2 workstream built from hand-contributed, secret-scanned fixtures on OSS projects. + +**Ports from analyze_transcripts (TypeScript reimplementation; not a subprocess call):** +- JSONL parsing + tool_use/tool_result pairing pattern (from `event_extractor.rb`). +- Token estimate `ceil(len/4)` (same char-ratio heuristic; sufficient for ranking). +- Event-type taxonomy (`bash_command`, `file_read`, `test_run`, `error_encountered`) for scenario clustering. +- Stress-fixture generation pattern for pathology layering. + +**What we do NOT port:** behavioral scoring, pgvector embeddings, decision-exchange graphs, velocity metrics, the Rails/ActiveRecord layer. Out of scope; not what we're measuring. + +### Synthetic token-savings evals (E-series, periodic/informational only) + +Retained from the original plan but now informational-only because B-series is the real gate. + +- **E1:** simulated 30-min coding session on a medium TypeScript project. Measure total tokens with/without gstack compact enabled. Target: ≥15% reduction. +- **E2:** same session at `level=aggressive`. Target: ≥25% reduction, zero test-failure increase. +- **E3:** same session with verifier on `failureCompaction` only. Verifier fire rate ≤10% of tool calls. +- **E4:** adversarial — inject a planted bug in a test output and confirm the verifier restores the critical stack frame. + +### Test corpus sourcing + +For each rule family, capture 3+ real outputs: + +1. Run the tool against a real project (gstack itself for TS; popular OSS for Rust/Go/Python). +2. Capture stdout+stderr+exit code into a fixture file with `toolVersion:` frontmatter (e.g., `jest@29.7.0`). +3. Hand-author the expected compacted output once. +4. Golden file test: rule application must produce byte-identical output. +5. CI drift warning: if installed tool version differs from fixture's `toolVersion:`, CI warns (not fails). Drift-warning dashboard is checked pre-release. + +Draw from: +- tokenjuice's fixture directory patterns (`tests/fixtures/`) +- RTK's per-command examples (their README lists real before/after metrics; verify independently) +- gstack's own test output (eat our own dog food) +- Real failure archives from `~/.gstack/compact/tee/` (once volunteers contribute) +- **B-series real-world scenarios are the primary corpus for reduction measurements.** + +## Pattern adoption table + +Concrete patterns borrowed from the competitive landscape: + +| From | Adopt as | Why | +|------|----------|-----| +| RTK | 4 reduction primitives (filter/group/truncate/dedupe) as JSON rule verbs | Table stakes for a serious compactor | +| RTK | `gstack compact tee` for failure-mode raw save | Better than the original `onFailure.preserveFull` design | +| RTK | `gstack compact gain` + `gstack compact discover` | Trust + continuous improvement | +| RTK | `exclude_commands` per-user blocklist | Must-have config | +| tokenjuice | JSON envelope contract for hook I/O | Clean machine adapter | +| tokenjuice | `gstack compact doctor` | Hooks drift; self-repair matters | +| caveman | Intensity levels (minimal/normal/aggressive) | User-tunable safety/savings knob | +| claude-token-efficient | Rules-file size budget (<5KB total) | Don't bloat context | + +## Rollout plan + +**ALL PHASES TABLED pending Anthropic `updatedBuiltinToolOutput` API.** See Status section at the top of this doc. The rollout below is the intended sequence if/when the API ships and this design un-tables. + +### Un-tabling checklist (do in order when the API arrives) + +1. **Confirm the new API's shape.** Read the updated Claude Code hooks reference. Capture a real envelope containing the new output-replacement field for Bash, Read, Grep, Glob. Record in `docs/designs/GCOMPACTION_envelope.md`. +2. **Re-validate the wedge.** Does the new API cover Read/Grep/Glob (do they fire `PostToolUse` now), or just Bash/WebFetch? If Bash-only, wedge (ii) stays dead and the product needs a new pitch before implementation. +3. **Re-run `/plan-eng-review`** against the revised plan with the new API. Most of the 15 locked decisions should carry forward; adjust the Architecture data-flow and any envelope-dependent decisions. +4. **Re-run `/codex review`** against the revised plan. The prior BLOCK verdict's concerns about hook substitution disappear once the API exists; remaining criticals (B-series privacy, regex DoS, JSON-envelope streaming) still apply. +5. **Execute the original rollout below.** + +### Original rollout (preserved for un-tabling) + +Each tier blocks on the prior passing all gate-tier tests. Claude-first — Codex and OpenClaw land at v1.1 after the wedge is proven on the primary host. + +1. **v0.0 (1 day):** rule engine + 4 primitives + line-oriented streaming pipeline + deep-merge + bundle compiler + envelope contract + golden tests for `tests/*` family only. No host integration yet. Measure savings on offline fixtures. +2. **v0.1 (1 day):** Claude Code hook integration + `gstack compact install` + mtime-based auto-reload. Ship as opt-in; off by default. Ask 10 gstack power users to try it; collect feedback. +3. **v0.5 (1 day):** B-series benchmark testbench (`compact/benchmark/`). Ship `gstack compact benchmark` so users can measure on their own data. Collect anonymous-from-the-start (nothing uploaded) reduction numbers from dogfooders. +4. **v1.0 (1 day):** verifier layer with `failureCompaction` trigger on by default + exact-line-match sanitization + layered exitCode/pattern fallback + expanded tee redaction set. **Hard ship gate:** B-series on the author's 30-day local corpus shows ≥15% total reduction AND zero critical-line loss on planted bugs. Publish CHANGELOG entry leading with wedge framing (Claude Code only at v1). +5. **v1.1 (+1 day):** Codex + OpenClaw hook integration. Cross-host E2E suite green. Build/lint/log rule families land with `gstack compact discover`-derived priorities. +6. **v1.2+:** expand rule families, community rule contribution workflow, community-corpus benchmark (hand-authored public fixtures, separate from local B-series). + +## Risk analysis + +| Risk | Severity | Mitigation | +|------|----------|------------| +| RTK adds an LLM verifier in response | Low | Creator is vocal about zero-dependency Rust. Ship first, build the pattern library. | +| Platform compaction subsumes us (Anthropic Compaction API in Claude Code) | Medium | We operate at a different layer (per-tool output vs whole-context). Position as complementary. | +| Rules drop something critical → "compactor made my agent dumb" | High | B-series real-world benchmark as hard ship gate; tee mode always available; verifier default-on for failures; exact-line-match sanitization. | +| Haiku cost creep (triggers fire more than expected) | Medium | E3 eval + B-series fire-rate metric; cost visible in `gstack compact gain`; per-session rate cap in v1.1 if rate >10%. | +| Rule maintenance debt (jest/vitest output formats change) | Medium | `toolVersion:` fixture frontmatter + CI drift warning; community rule PRs; `discover` flags bypassing commands. | +| Rules file bloats context | Low | CI-enforced <5KB source + <25KB compiled bundle budget; per-rule size warning at schema-validation. | +| Regex DoS blocks the agent | Medium | 50ms AbortSignal budget per rule; timeout logged to `meta.regexTimedOut`; stale rules quarantined on repeated failure. | +| Bundle staleness silently breaks user edits | Low | mtime-check on every hook invocation auto-rebuilds; `gstack compact reload` is a backup not a requirement. | +| Benchmark leaks user's private data | High | Local-only by construction: no network call, mode-0600 output, explicit banner at runtime. Privacy review before v1 ship. | + +## Open questions + +1. ~~Does Codex's PostToolUse hook support matchers for Read/Grep/Glob?~~ (Deferred to v1.1 — Claude-first at v1.) +2. ~~Does OpenClaw's hook API support PostToolUse specifically?~~ (Deferred to v1.1.) +3. Should the verifier model be pinned, or version-tracked like gstack's other AI calls? (Inclined to pin `claude-haiku-4-5-20251001` and bump explicitly in CHANGELOG.) +4. ~~Built-in secret-redaction regex set for tee files~~ **(resolved: expanded set — AWS/GitHub/GitLab/Slack/JWT/bearer/SSH-private-key. See decision #10.)** +5. Should `gstack compact discover` propose auto-generated rules via Haiku? (Deferred to v2; skill-creep risk.) +6. **New:** Does Claude Code's PostToolUse envelope include `exitCode`? (Still needs empirical verification per pre-implementation task #1; system now has a layered fallback regardless.) +7. **New:** What's the right scenario-count cap for B-series? Cluster.ts can produce 5-50 scenarios depending on heavy-tail shape. Plan: cap at top 20 clusters by aggregate output volume. + +## Pre-implementation assignment (must complete before coding) + +1. **Verify Claude Code's PostToolUse envelope contents empirically.** Ship a no-op hook; confirm `exitCode`, `command`, `argv`, `combinedText` are all present. This is the pivot for wedge (ii) native-tool coverage AND for the failureCompaction trigger. Output: `docs/designs/GCOMPACTION_envelope.md` with real captured envelopes for Bash + Read + Grep + Glob. +2. **Read RTK's rule definitions** (`ARCHITECTURE.md`, `src/rules/`) and write a 1-paragraph summary of which of the 4 primitives they handle best. Inform our v1 rule set. This is the Search Before Building layer. +3. **Port analyze_transcripts JSONL parser to TypeScript.** `compact/benchmark/src/scanner.ts`. Write a quick-look output that lists the top-50 noisiest tool calls on the author's `~/.claude/projects/`. Confirms the testbench premise before we build the replay loop. This is the B-series foundation. +4. **Write the CHANGELOG entry FIRST.** Target sentence: *"Every tool in your agent's toolbox on Claude Code now produces less noise — test runners, git diffs, package installs — with an intelligent Haiku safety net that restores critical stack frames when our rules over-compact, and a local benchmark that proves the savings on your actual 30 days of coding sessions. Codex + OpenClaw land in v1.1."* If we cannot write that sentence honestly, the wedge isn't there yet. +5. **Ship a rule-only v0** (no Haiku verifier, no benchmark). Measure real token savings with current gstack evals + early B-series prototype. If <10% on local corpus, the whole premise is weaker than claimed — iterate the rules before adding the verifier on top. + +## License & attribution + +gstack ships under MIT. To keep the license clean for downstream users, this project follows a strict clean-room policy for everything borrowed from the competitive landscape: + +- **Every project referenced above is permissive-licensed** (MIT or Apache-2.0). No AGPL, GPL, SSPL, or other copyleft exposure. + - RTK (rtk-ai/rtk): **Apache-2.0** — MIT-compatible; Apache patent grant is a bonus for us. + - tokenjuice, caveman, claude-token-efficient, token-optimizer-mcp, sst/opencode: **MIT**. +- **Patterns, not code.** We read these projects to understand what they solved and why. We implement independently in TypeScript inside `compact/src/`. We do not copy source files, translate source files line-for-line, or lift test fixtures verbatim. +- **Attribution.** Where a pattern is directly borrowed (the 4 primitives from RTK, the JSON envelope from tokenjuice, intensity levels from caveman, rules-file size budget from claude-token-efficient), we credit the source inline in comments and in the "Pattern adoption table" above. The project's `README` and `NOTICE` file (if we add one) list the inspirations. +- **Fixture sourcing.** Golden-file fixtures come from running real tools against real projects — they are our own captures, not imported from RTK or tokenjuice. This keeps the test corpus free of license-tangled content. +- **Forbidden sources.** Before adding any new reference project, run `gh api repos/OWNER/REPO --jq '.license'` and verify the license key is one of: `mit`, `apache-2.0`, `bsd-2-clause`, `bsd-3-clause`, `isc`, `cc0-1.0`, `unlicense`. If the project has no license field, treat it as "all rights reserved" and do not draw from it. Reject `agpl-3.0`, `gpl-*`, `sspl-*`, and any custom or source-available license. + +CI enforcement: a `scripts/check-references.ts` script parses `docs/designs/GCOMPACTION.md` for GitHub URLs and re-runs the license check, failing if any referenced project's license moves off the allowlist. + +## References + +- [RTK (Rust Token Killer) — rtk-ai/rtk](https://github.com/rtk-ai/rtk) +- [RTK issue #538 — native-tool gap](https://github.com/rtk-ai/rtk/issues/538) +- [tokenjuice — vincentkoc/tokenjuice](https://github.com/vincentkoc/tokenjuice) +- [caveman — juliusbrussee/caveman](https://github.com/juliusbrussee/caveman) +- [claude-token-efficient — drona23](https://github.com/drona23/claude-token-efficient) +- [token-optimizer-mcp — ooples](https://github.com/ooples/token-optimizer-mcp) +- [6-Layer Token Savings Stack — doobidoo gist](https://gist.github.com/doobidoo/e5500be6b59e47cadc39e0b7c5cd9871) +- [Claude Code hooks reference](https://code.claude.com/docs/en/hooks) +- [Chroma context rot research](https://research.trychroma.com/context-rot) +- [Morph: Why LLMs Degrade as Context Grows](https://www.morphllm.com/context-rot) +- [Anthropic Opus 4.6 Compaction API — InfoQ](https://www.infoq.com/news/2026/03/opus-4-6-context-compaction/) +- [OpenAI compaction docs](https://developers.openai.com/api/docs/guides/compaction) +- [Google ADK context compression](https://google.github.io/adk-docs/context/compaction/) +- [LangChain autonomous context compression](https://blog.langchain.com/autonomous-context-compression/) +- [sst/opencode context management](https://deepwiki.com/sst/opencode/2.4-context-management-and-compaction) +- [DEV: Deterministic vs. LLM Evaluators — 2026 trade-off study](https://dev.to/anshd_12/deterministic-vs-llm-evaluators-a-2026-technical-trade-off-study-11h) +- [MadPlay: RTK 80% token reduction experiment](https://madplay.github.io/en/post/rtk-reduce-ai-coding-agent-token-usage) +- [Esteban Estrada: RTK 70% Claude Code reduction](https://codestz.dev/experiments/rtk-rust-token-killer) + +**End of GCOMPACTION.md canonical section.** On plan approval, everything above is copied verbatim to `docs/designs/GCOMPACTION.md` as a **tabled design artifact**. No code is written; no hook is installed; no CHANGELOG entry is added. The doc exists so a future sprint can unblock quickly when Anthropic ships the built-in-tool output-replace API. From 822e843a60c6c13508f70dd1ffcc163e8fc79be5 Mon Sep 17 00:00:00 2001 From: Garry Tan Date: Thu, 16 Apr 2026 15:39:44 -0700 Subject: [PATCH 4/4] fix: headed browser auto-shutdown + disconnect cleanup (v0.18.1.0) (#1025) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * fix: headed browser no longer auto-shuts down after 15 seconds The parent-process watchdog in server.ts polls the spawning CLI's PID every 15s and self-terminates if it is gone. The connect command in cli.ts exits with process.exit(0) immediately after launching the server, so the watchdog would reliably kill the headed browser within ~15s. This contradicted the idle timer's own design: server.ts:745 explicitly skips headed mode because "the user is looking at the browser. Never auto-die." The watchdog had no such exemption. Two-layer fix: 1. CLI layer: connect handler always sets BROWSE_PARENT_PID=0 (was only pass-through for pair-agent subprocesses). The user owns the headed browser lifecycle; cleanup happens via browser disconnect event or $B disconnect. 2. CLI layer: startServer() honors caller's BROWSE_PARENT_PID=0 in the headless spawn path too. Lets CI, non-interactive shells, and Claude Code Bash calls opt into persistent servers across short-lived CLI invocations. 3. Server layer: defense-in-depth. Watchdog now also skips when BROWSE_HEADED=1, so even if a future launcher forgets PID=0, headed browsers won't die. Adds log lines when the watchdog is disabled so lifecycle debugging is easier. Four community contributors diagnosed variants of this bug independently. Thanks for the clear analyses and reproductions. Closes #1020 (rocke2020) Closes #1018 (sanghyuk-seo-nexcube) Closes #1012 (rodbland2021) Closes #986 (jbetala7) Closes #1006 Closes #943 Co-Authored-By: rocke2020 Co-Authored-By: sanghyuk-seo-nexcube Co-Authored-By: rodbland2021 Co-Authored-By: jbetala7 Co-Authored-By: Claude Opus 4.7 (1M context) * fix: disconnect handler runs full cleanup before exiting When the user closed the headed browser window, the disconnect handler in browser-manager.ts called process.exit(2) directly, bypassing the server's shutdown() function entirely. That meant: - sidebar-agent daemon kept polling a dead server - session state wasn't saved - Chromium profile locks (SingletonLock, SingletonSocket, SingletonCookie) weren't cleaned — causing "profile in use" errors on next $B connect - state file at .gstack/browse.json was left stale Now the disconnect handler calls onDisconnect(), which server.ts wires up to shutdown(2). Full cleanup runs first, then the process exits with code 2 — preserving the existing semantic that distinguishes user-close (exit 2) from crashes (exit 1). shutdown() now accepts an optional exitCode parameter (default 0) so the SIGTERM/SIGINT paths and the disconnect path can share cleanup code while preserving their distinct exit codes. Surfaced by Codex during /plan-eng-review of the watchdog fix. Co-Authored-By: Claude Opus 4.7 (1M context) * fix: pre-existing test flakiness in relink.test.ts The 23 tests in this file all shell out to gstack-config + gstack-relink (bash scripts doing subprocess work). Under parallel bun test load, those subprocess spawns contend with other test suites and each test can drift ~200ms past Bun's 5s default timeout, causing 5+ flaky timeouts per run in the gate-tier ship gate. Wrap the `test` import to default the per-test timeout to 15s. Explicit per-test timeouts (third arg) still win, so individual tests can lower it if needed. No behavior change — only gives subprocess-heavy tests more headroom under parallel load. Noticed by /ship pre-flight test run. Unrelated to the main PR fix but blocking the gate, so fixing as a separate commit per the test ownership protocol. Co-Authored-By: Claude Opus 4.7 (1M context) * fix: SIGTERM/SIGINT shutdown exit code regression Node's signal listeners receive the signal name ('SIGTERM' / 'SIGINT') as the first argument. When shutdown() started accepting an optional exitCode parameter in the prior disconnect-cleanup commit, the bare `process.on('SIGTERM', shutdown)` registration started silently calling shutdown('SIGTERM'). The string passed through to process.exit(), Node coerced it to NaN, and the process exited with code 1 instead of 0. Wrap both listeners so they call shutdown() with no args — signal name never leaks into the exitCode slot. Surfaced by /ship's adversarial subagent. Co-Authored-By: Claude Opus 4.7 (1M context) * fix: onDisconnect async rejection leaves process running The disconnect handler calls this.onDisconnect() without awaiting it, but server.ts wires the callback to shutdown(2) — which is async. If that promise rejects, the rejection drops on the floor as an unhandled rejection, the browser is already disconnected, and the server keeps running indefinitely with no browser attached. Add a sync try/catch for throws and a .catch() chain for promise rejections. Both fall back to process.exit(2) so a dead browser never leaves a live server. Also widen the callback type from `() => void` to `() => void | Promise` to match the actual runtime shape of the wired shutdown(2) call. Surfaced by /ship's adversarial subagent. Co-Authored-By: Claude Opus 4.7 (1M context) * fix: honor BROWSE_PARENT_PID=0 with trailing whitespace The strict string compare `process.env.BROWSE_PARENT_PID === '0'` meant any stray newline or whitespace (common from shell `export` in a pipe or heredoc) would fail the check and re-enable the watchdog against the caller's intent. Switch to parseInt + === 0, matching the server's own parseInt at server.ts:760. Handles '0', '0\n', ' 0 ', and unset correctly; non-numeric values (parseInt returns NaN, NaN === 0 is false) fail safe — watchdog stays active, which is the safe default for unexpected input. Surfaced by /ship's adversarial subagent. Co-Authored-By: Claude Opus 4.7 (1M context) * fix: preserve bun:test sub-APIs in relink test wrapper The previous commit wrapped bun:test's `test` to bump the per-test timeout default to 15s but cast the wrapper `as typeof _bunTest` without copying the sub-properties (`.only`, `.skip`, `.each`, `.todo`, `.failing`, `.if`) from the original. The cast was a lie: the wrapper was a plain function, not the full callable with those chained properties attached. The file doesn't use any of them today, but a future test.only or test.skip would fail with a cryptic "undefined is not a function." Object.assign the original _bunTest's properties onto the wrapper so sub-APIs chain correctly forever. Surfaced by /ship's adversarial subagent. Co-Authored-By: Claude Opus 4.7 (1M context) * chore: bump version and changelog (v0.18.1.0) Co-Authored-By: Claude Opus 4.7 (1M context) * test: regression tests for parent-process watchdog End-to-end tests in browse/test/watchdog.test.ts that prove the three invariants v0.18.1.0 depends on. Each test spawns the real server.ts (not a mock), so any future change that breaks the watchdog logic fails here — the thing /ship's adversarial review flagged as missing. 1. BROWSE_PARENT_PID=0 disables the watchdog Spawns server with PID=0, reads stdout, confirms the "watchdog disabled (BROWSE_PARENT_PID=0)" log line appears and "Parent process ... exited" does NOT. ~2s. 2. BROWSE_HEADED=1 disables the watchdog (server-side guard) Spawns server with BROWSE_HEADED=1 and a bogus parent PID (999999). Proves BROWSE_HEADED takes precedence over a present PID — if the server-side defense-in-depth regresses, the watchdog would try to poll 999999 and fire on the "dead parent." ~2s. 3. Default headless mode: watchdog fires when parent dies The regression guard for the original orphan-prevention behavior. Spawns a real `sleep 60` parent and a server watching its PID, then kills the parent and waits up to 25s for the server to exit. The watchdog polls every 15s so first tick is 0-15s after death, plus shutdown() cleanup. ~18s. Total runtime: ~21s for all 3 tests. They catch the class of bug this branch exists to fix: "does the process live or die when it should?" Co-Authored-By: Claude Opus 4.7 (1M context) --------- Co-authored-by: rocke2020 Co-authored-by: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 14 ++++ TODOS.md | 14 ++++ VERSION | 2 +- browse/src/browser-manager.ts | 29 ++++++- browse/src/cli.ts | 22 +++-- browse/src/server.ts | 29 +++++-- browse/test/watchdog.test.ts | 147 ++++++++++++++++++++++++++++++++++ package.json | 2 +- test/relink.test.ts | 12 ++- 9 files changed, 254 insertions(+), 17 deletions(-) create mode 100644 browse/test/watchdog.test.ts diff --git a/CHANGELOG.md b/CHANGELOG.md index 3cc4f230..75f09431 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,19 @@ # Changelog +## [0.18.1.0] - 2026-04-16 + +### Fixed +- **`/open-gstack-browser` actually stays open now.** If you ran `/open-gstack-browser` or `$B connect` and your browser vanished roughly 15 seconds later, this was why: a watchdog inside the browse server was polling the CLI process that spawned it, and when the CLI exited (which it does, immediately, right after launching the browser), the watchdog said "orphan!" and killed everything. The fix disables that watchdog for headed mode, both in the CLI (always set `BROWSE_PARENT_PID=0` for headed launches) and in the server (skip the watchdog entirely when `BROWSE_HEADED=1`). Two layers of defense in case a future launcher forgets to pass the env var. Thanks to @rocke2020 (#1020), @sanghyuk-seo-nexcube (#1018), @rodbland2021 (#1012), and @jbetala7 (#986) for independently diagnosing this and sending in clean, well-documented fixes. +- **Closing the headed browser window now cleans up properly.** Before this release, clicking the X on the GStack Browser window skipped the server's cleanup routine and exited the process directly. That left behind stale sidebar-agent processes polling a dead server, unsaved chat session state, leftover Chromium profile locks (which cause "profile in use" errors on the next `$B connect`), and a stale `browse.json` state file. Now the disconnect handler routes through the full `shutdown()` path first, cleans everything, and then exits with code 2 (which still distinguishes user-close from crash). +- **CI/Claude Code Bash calls can now share a persistent headless server.** The headless spawn path used to hardcode the CLI's own PID as the watchdog target, ignoring `BROWSE_PARENT_PID=0` even if you set it in your environment. Now `BROWSE_PARENT_PID=0 $B goto https://...` keeps the server alive across short-lived CLI invocations, which is what multi-step workflows (CI matrices, Claude Code's Bash tool, cookie picker flows) actually want. +- **`SIGTERM` / `SIGINT` shutdown now exits with code 0 instead of 1.** Regression caught during /ship's adversarial review: when `shutdown()` started accepting an `exitCode` argument, Node's signal listeners silently passed the signal name (`'SIGTERM'`) as the exit code, which got coerced to `NaN` and used `1`. Wrapped the listeners so they call `shutdown()` with no args. Your `Ctrl+C` now exits clean again. + +### For contributors +- `test/relink.test.ts` no longer flakes under parallel test load. The 23 tests in that file each shell out to `gstack-config` + `gstack-relink` (bash subprocess work), and under `bun test` with other suites running, each test drifted ~200ms past Bun's 5s default. Wrapped `test` to default the per-test timeout to 15s with `Object.assign` preserving `.only`/`.skip`/`.each` sub-APIs. +- `BrowserManager` gained an `onDisconnect` callback (wired by `server.ts` to `shutdown(2)`), replacing the direct `process.exit(2)` in the disconnect handler. The callback is wrapped with try/catch + Promise rejection handling so a rejecting cleanup path still exits the process instead of leaving a live server attached to a dead browser. +- `shutdown()` now accepts an optional `exitCode: number = 0` parameter, used by the disconnect path (exit 2) and the signal path (default 0). Same cleanup code, two call sites, distinct exit codes. +- `BROWSE_PARENT_PID` parsing in `cli.ts` now matches `server.ts`: `parseInt` instead of strict string equality, so `BROWSE_PARENT_PID=0\n` (common from shell `export`) is honored. + ## [0.18.0.1] - 2026-04-16 ### Fixed diff --git a/TODOS.md b/TODOS.md index 0e3ac932..7bb06d01 100644 --- a/TODOS.md +++ b/TODOS.md @@ -1,5 +1,19 @@ # TODOS +## Browse + +### Scope sidebar-agent kill to session PID, not `pkill -f sidebar-agent\.ts` + +**What:** `shutdown()` in `browse/src/server.ts:1193` uses `pkill -f sidebar-agent\.ts` to kill the sidebar-agent daemon, which matches every sidebar-agent on the machine, not just the one this server spawned. Replace with PID tracking: store the sidebar-agent PID when `cli.ts` spawns it (via state file or env), then `process.kill(pid, 'SIGTERM')` in `shutdown()`. + +**Why:** A user running two Conductor worktrees (or any multi-session setup), each with its own `$B connect`, closes one browser window ... and the other worktree's sidebar-agent gets killed too. The blast radius was there before, but the v0.18.1.0 disconnect-cleanup fix makes it more reachable: every user-close now runs the full `shutdown()` path, whereas before user-close bypassed it. + +**Context:** Surfaced by /ship's adversarial review on v0.18.1.0. Pre-existing code, not introduced by the fix. Fix requires propagating the sidebar-agent PID from `cli.ts` spawn site (~line 885) into the server's state file so `shutdown()` can target just this session's agent. Related: `browse/src/cli.ts` spawns with `Bun.spawn(...).unref()` and already captures `agentProc.pid`. + +**Effort:** S (human: ~2h / CC: ~15min) +**Priority:** P2 +**Depends on:** None + ## Sidebar Security ### ML Prompt Injection Classifier diff --git a/VERSION b/VERSION index d6bda5aa..72ad141a 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -0.18.0.1 +0.18.1.0 diff --git a/browse/src/browser-manager.ts b/browse/src/browser-manager.ts index 63d78358..6b9242da 100644 --- a/browse/src/browser-manager.ts +++ b/browse/src/browser-manager.ts @@ -72,6 +72,12 @@ export class BrowserManager { private connectionMode: 'launched' | 'headed' = 'launched'; private intentionalDisconnect = false; + // Called when the headed browser disconnects without intentional teardown + // (user closed the window). Wired up by server.ts to run full cleanup + // (sidebar-agent, state file, profile locks) before exiting with code 2. + // Returns void or a Promise; rejections are caught and fall back to exit(2). + public onDisconnect: (() => void | Promise) | null = null; + getConnectionMode(): 'launched' | 'headed' { return this.connectionMode; } // ─── Watch Mode Methods ───────────────────────────────── @@ -467,13 +473,32 @@ export class BrowserManager { await this.newTab(); } - // Browser disconnect handler — exit code 2 distinguishes from crashes (1) + // Browser disconnect handler — exit code 2 distinguishes from crashes (1). + // Calls onDisconnect() to trigger full shutdown (kill sidebar-agent, save + // session, clean profile locks + state file) before exit. Falls back to + // direct process.exit(2) if no callback is wired up, or if the callback + // throws/rejects — never leave the process running with a dead browser. if (this.browser) { this.browser.on('disconnected', () => { if (this.intentionalDisconnect) return; console.error('[browse] Real browser disconnected (user closed or crashed).'); console.error('[browse] Run `$B connect` to reconnect.'); - process.exit(2); + if (!this.onDisconnect) { + process.exit(2); + return; + } + try { + const result = this.onDisconnect(); + if (result && typeof (result as Promise).catch === 'function') { + (result as Promise).catch((err) => { + console.error('[browse] onDisconnect rejected:', err); + process.exit(2); + }); + } + } catch (err) { + console.error('[browse] onDisconnect threw:', err); + process.exit(2); + } }); } diff --git a/browse/src/cli.ts b/browse/src/cli.ts index ae287515..eb58cd7d 100644 --- a/browse/src/cli.ts +++ b/browse/src/cli.ts @@ -210,12 +210,20 @@ async function startServer(extraEnv?: Record): Promise): Promise { // server can become an orphan — keeping chrome-headless-shell alive and // causing console-window flicker on Windows. Poll the parent PID every 15s // and self-terminate if it is gone. +// +// Headed mode (BROWSE_HEADED=1 or BROWSE_PARENT_PID=0): The user controls +// the browser window lifecycle. The CLI exits immediately after connect, +// so the watchdog would kill the server prematurely. Disabled in both cases +// as defense-in-depth — the CLI sets PID=0 for headed mode, and the server +// also checks BROWSE_HEADED in case a future launcher forgets. +// Cleanup happens via browser disconnect event or $B disconnect. const BROWSE_PARENT_PID = parseInt(process.env.BROWSE_PARENT_PID || '0', 10); -if (BROWSE_PARENT_PID > 0) { +const IS_HEADED_WATCHDOG = process.env.BROWSE_HEADED === '1'; +if (BROWSE_PARENT_PID > 0 && !IS_HEADED_WATCHDOG) { setInterval(() => { try { process.kill(BROWSE_PARENT_PID, 0); // signal 0 = existence check only, no signal sent @@ -767,6 +775,10 @@ if (BROWSE_PARENT_PID > 0) { shutdown(); } }, 15_000); +} else if (IS_HEADED_WATCHDOG) { + console.log('[browse] Parent-process watchdog disabled (headed mode)'); +} else if (BROWSE_PARENT_PID === 0) { + console.log('[browse] Parent-process watchdog disabled (BROWSE_PARENT_PID=0)'); } // ─── Command Sets (from commands.ts — single source of truth) ─── @@ -793,6 +805,10 @@ function emitInspectorEvent(event: any): void { // ─── Server ──────────────────────────────────────────────────── const browserManager = new BrowserManager(); +// When the user closes the headed browser window, run full cleanup +// (kill sidebar-agent, save session, remove profile locks, delete state file) +// before exiting with code 2. Exit code 2 distinguishes user-close from crashes (1). +browserManager.onDisconnect = () => shutdown(2); let isShuttingDown = false; // Test if a port is available by binding and immediately releasing. @@ -1180,7 +1196,7 @@ async function handleCommand(body: any, tokenInfo?: TokenInfo | null): Promise shutdown()); +process.on('SIGINT', () => shutdown()); // Windows: taskkill /F bypasses SIGTERM, but 'exit' fires for some shutdown paths. // Defense-in-depth — primary cleanup is the CLI's stale-state detection via health check. if (process.platform === 'win32') { diff --git a/browse/test/watchdog.test.ts b/browse/test/watchdog.test.ts new file mode 100644 index 00000000..1a6fd9af --- /dev/null +++ b/browse/test/watchdog.test.ts @@ -0,0 +1,147 @@ +import { describe, test, expect, afterEach } from 'bun:test'; +import { spawn, type Subprocess } from 'bun'; +import * as path from 'path'; +import * as fs from 'fs'; +import * as os from 'os'; + +// End-to-end regression tests for the parent-process watchdog in server.ts. +// Proves three invariants that the v0.18.1.0 fix depends on: +// +// 1. BROWSE_PARENT_PID=0 disables the watchdog (opt-in used by CI and pair-agent). +// 2. BROWSE_HEADED=1 disables the watchdog (server-side defense-in-depth). +// 3. Default headless mode still kills the server when its parent dies +// (the original orphan-prevention must keep working). +// +// Each test spawns the real server.ts, not a mock. Tests 1 and 2 verify the +// code path via stdout log line (fast). Test 3 waits for the watchdog's 15s +// poll cycle to actually fire (slow — ~25s). + +const ROOT = path.resolve(import.meta.dir, '..'); +const SERVER_SCRIPT = path.join(ROOT, 'src', 'server.ts'); + +let tmpDir: string; +let serverProc: Subprocess | null = null; +let parentProc: Subprocess | null = null; + +afterEach(async () => { + // Kill any survivors so subsequent tests get a clean slate. + try { parentProc?.kill('SIGKILL'); } catch {} + try { serverProc?.kill('SIGKILL'); } catch {} + // Give processes a moment to exit before tmpDir cleanup. + await Bun.sleep(100); + try { fs.rmSync(tmpDir, { recursive: true, force: true }); } catch {} + parentProc = null; + serverProc = null; +}); + +function spawnServer(env: Record, port: number): Subprocess { + const stateFile = path.join(tmpDir, 'browse-state.json'); + return spawn(['bun', 'run', SERVER_SCRIPT], { + env: { + ...process.env, + BROWSE_STATE_FILE: stateFile, + BROWSE_PORT: String(port), + ...env, + }, + stdio: ['ignore', 'pipe', 'pipe'], + }); +} + +function isProcessAlive(pid: number): boolean { + try { + process.kill(pid, 0); // signal 0 = existence check, no signal sent + return true; + } catch { + return false; + } +} + +// Read stdout until we see the expected marker or timeout. Returns the captured +// text. Used to verify the watchdog code path ran as expected at startup. +async function readStdoutUntil( + proc: Subprocess, + marker: string, + timeoutMs: number, +): Promise { + const deadline = Date.now() + timeoutMs; + const decoder = new TextDecoder(); + let captured = ''; + const reader = (proc.stdout as ReadableStream).getReader(); + try { + while (Date.now() < deadline) { + const readPromise = reader.read(); + const timed = Bun.sleep(Math.max(0, deadline - Date.now())); + const result = await Promise.race([readPromise, timed.then(() => null)]); + if (!result || result.done) break; + captured += decoder.decode(result.value); + if (captured.includes(marker)) return captured; + } + } finally { + try { reader.releaseLock(); } catch {} + } + return captured; +} + +describe('parent-process watchdog (v0.18.1.0)', () => { + test('BROWSE_PARENT_PID=0 disables the watchdog', async () => { + tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), 'watchdog-pid0-')); + serverProc = spawnServer({ BROWSE_PARENT_PID: '0' }, 34901); + + const out = await readStdoutUntil( + serverProc, + 'Parent-process watchdog disabled (BROWSE_PARENT_PID=0)', + 5000, + ); + expect(out).toContain('Parent-process watchdog disabled (BROWSE_PARENT_PID=0)'); + // Control: the "parent exited, shutting down" line must NOT appear — + // that would mean the watchdog ran after we said to skip it. + expect(out).not.toContain('Parent process'); + }, 15_000); + + test('BROWSE_HEADED=1 disables the watchdog (server-side guard)', async () => { + tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), 'watchdog-headed-')); + // Pass a bogus parent PID to prove BROWSE_HEADED takes precedence. + // If the server-side guard regresses, the watchdog would try to poll + // this PID and eventually fire on the "dead parent." + serverProc = spawnServer( + { BROWSE_HEADED: '1', BROWSE_PARENT_PID: '999999' }, + 34902, + ); + + const out = await readStdoutUntil( + serverProc, + 'Parent-process watchdog disabled (headed mode)', + 5000, + ); + expect(out).toContain('Parent-process watchdog disabled (headed mode)'); + expect(out).not.toContain('Parent process 999999 exited'); + }, 15_000); + + test('default headless mode: watchdog fires when parent dies', async () => { + tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), 'watchdog-default-')); + + // Spawn a real, short-lived "parent" that the watchdog will poll. + parentProc = spawn(['sleep', '60'], { stdio: ['ignore', 'ignore', 'ignore'] }); + const parentPid = parentProc.pid!; + + // Default headless: no BROWSE_HEADED, real parent PID — watchdog active. + serverProc = spawnServer({ BROWSE_PARENT_PID: String(parentPid) }, 34903); + const serverPid = serverProc.pid!; + + // Give the server a moment to start and register the watchdog interval. + await Bun.sleep(2000); + expect(isProcessAlive(serverPid)).toBe(true); + + // Kill the parent. The watchdog polls every 15s, so first tick after + // parent death lands within ~15s, plus shutdown() cleanup time. + parentProc.kill('SIGKILL'); + + // Poll for up to 25s for the server to exit. + const deadline = Date.now() + 25_000; + while (Date.now() < deadline) { + if (!isProcessAlive(serverPid)) break; + await Bun.sleep(500); + } + expect(isProcessAlive(serverPid)).toBe(false); + }, 45_000); +}); diff --git a/package.json b/package.json index bbc1a6d1..68edadf1 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "gstack", - "version": "0.18.0.1", + "version": "0.18.1.0", "description": "Garry's Stack — Claude Code skills + fast headless browser. One repo, one install, entire AI engineering workflow.", "license": "MIT", "type": "module", diff --git a/test/relink.test.ts b/test/relink.test.ts index d0c48f19..e5cd5206 100644 --- a/test/relink.test.ts +++ b/test/relink.test.ts @@ -1,9 +1,19 @@ -import { describe, test, expect, beforeEach, afterEach } from 'bun:test'; +import { describe, test as _bunTest, expect, beforeEach, afterEach } from 'bun:test'; import { execSync } from 'child_process'; import * as fs from 'fs'; import * as path from 'path'; import * as os from 'os'; +// Every test in this file shells out to gstack-config + gstack-relink (bash scripts +// invoking subprocess work). Under parallel bun test load, subprocess spawn contends +// with other suites and each test can drift ~200ms past the 5s default. Bump to 15s. +// Object.assign preserves test.only / test.skip / test.each / test.todo sub-APIs. +const test = Object.assign( + ((name: any, fn: any, timeout?: number) => + _bunTest(name, fn, timeout ?? 15_000)) as typeof _bunTest, + _bunTest, +); + const ROOT = path.resolve(import.meta.dir, '..'); const BIN = path.join(ROOT, 'bin');