gstack

mirror of https://github.com/garrytan/gstack.git synced 2026-08-02 12:28:36 +02:00

Author	SHA1	Message	Date
Garry Tan GitHub Claude Opus 4.7	cf50443b63	v1.45.0.0 feat(design): persistent board daemon — 24h boards, one tab, board history (#1710 ) * refactor(design): board JS uses relative paths; drop __GSTACK_SERVER_URL injection Board JS in design/src/compare.ts now calls ./api/feedback and ./api/progress (relative to location.pathname) and feature-detects server mode via location.protocol instead of the injected window.__GSTACK_SERVER_URL global. The injection in design/src/serve.ts is removed (dead code now that nothing reads it). Tests updated to match the new contract: serve.test.ts asserts the relative-path JS is present and the global is gone; feedback-roundtrip asserts location.protocol detects HTTP mode. Why: prep for the multi-board daemon (design/src/daemon.ts upcoming) where the same generated HTML is served at /boards/<id>/ instead of /. Relative paths resolve against location.pathname in both cases, so one HTML, two hosts. The injection was the only thing tying board JS to a specific serving path; removing it unblocks the daemon work without forking the generator. file:// fallback preserved via the location.protocol feature-detect — board opened directly as a file still falls through to the DOM-only success path. The 6 feedback-roundtrip browser tests continue to fail with session.clearLoadedHtml undefined; that failure pre-exists this branch (verified against HEAD with these edits stashed) and lives in browse/src/write-commands.ts, not in the design code path. Tracking separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(design): reload guard rejects directory paths design/src/serve.ts:200-212 used to accept a path that resolved to the allowedDir itself (the OR branch `\|\| resolvedReload === allowedDir`), which then crashed readFileSync with EISDIR. Now: 1. startsWith(allowedDir + path.sep) must pass — rejects the dir itself and anything outside (403). 2. statSync(resolvedReload).isFile() must pass — rejects subdirectories inside allowedDir with a clear "Path must be a file" 400. The test stub in serve.test.ts mirrors prod; both updated, plus two new test cases for the previously-broken paths. Codex caught this in the plan-review pass; it's a latent bug in shipping code, not a regression from the daemon work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(design): introduce design daemon — multi-board persistent server Adds design/src/daemon.ts: a Bun.serve daemon that hosts many boards under /boards/<id>/ instead of one server per `$D compare --serve` call. Spawned by daemon-client (next commit); for now wired only via tests. Endpoint table: GET /health liveness + version + counts (unauth) GET / index of recent boards POST /api/boards publish; daemon derives sourceDir from realpath(html). body sourceDir IGNORED (Codex trust-boundary fix). POST /shutdown graceful; refuses if active boards exist (Codex data-loss fix) GET /boards/<id> 301 → /boards/<id>/ (trailing slash is load-bearing — relative URLs in board JS resolve against pathname) GET /boards/<id>/ render board HTML GET /boards/<id>/api/progress state machine status (no idle reset) POST /boards/<id>/api/feedback submit/regen; writes feedback.json or feedback-pending.json with boardId + publishedAt augmented in POST /boards/<id>/api/reload swap HTML; per-board allowedDir guard rejects traversal, directories, out-of-allowed-dir symlinks Lifecycle: - 24h idle timeout (DESIGN_DAEMON_IDLE_MS for tests). - Idle with active boards extends 1h up to 4x, then force-shuts (Codex). - LRU cap 50 boards; evicts done before non-done; 503 when 50 non-done. - Per-board async mutex serializes feedback POST vs reload POST. - SIGTERM/SIGINT/uncaughtException → graceful shutdown, state file unlink. - Stdout: DAEMON_STARTED port=<N> (the line the client parses). Shared utilities live in design/src/daemon-state.ts: atomic state-file write/read (mode 0o600), fs.openSync('wx') lock, isProcessAlive, cmdline identity verification (/proc on Linux, ps on macOS), CMDLINE_MARKER constant. Modeled on browse/src/cli.ts lock + spawn patterns. design/test/daemon.test.ts: 30 tests, all green. Covers every endpoint, both error paths and happy paths, cross-board feedback isolation, the trailing-slash redirect, the directory-not-file reload rejection, LRU preferring done over non-done, /shutdown refusal with active boards, all path-traversal guards. Uses the exported fetchHandler in-process (no spawn) so the suite runs in ~70ms. design/test/daemon-tests-fixtures.ts: shared helpers — req() builder, tmp-dir helpers, daemon reset, and a spawnDaemonForTest() helper used by the next commit's discovery tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(design): daemon-client with lock + identity-verified spawn design/src/daemon-client.ts implements the CLI side of the daemon lifecycle: ensureDaemon() (the spawn-or-attach decision), publishBoard(), and the $D daemon stop\|status helpers. Modeled on browse/src/cli.ts:317-415 — same health-check-first attach, same fs.openSync('wx') lock, same re-read-state-INSIDE-the-lock guard against two CLIs both deciding "no daemon, spawn." Two design-specific safety properties added beyond browse: 1. verifyIdentity before any SIGTERM/SIGKILL. Reads the running process's cmdline (/proc/PID/cmdline on Linux, `ps -p PID -o command=` on macOS) and only signals if it contains CMDLINE_MARKER ("gstack-design-daemon", passed as argv at spawn time). Prevents a stale state file from causing us to kill an unrelated process that inherited the PID. 2. Refuse-kill-with-active-boards on version mismatch. Browse silently restarts; here in-memory board history would vanish, so the client prints a user-actionable WARNING and exit 1 instead. Users explicitly `$D daemon stop` to override. Spawn uses Node child_process.spawn (NOT Bun.spawn().unref) because of the macOS session-detach quirks browse already discovered. Stdio is redirected to ~/.gstack/design-daemon-startup.log, which the client tails into stderr if waitForHealthOrError times out — no more silent "daemon failed for some unknowable reason." daemon-state.ts gains DESIGN_DAEMON_STATE_FILE env override so tests can point both client and spawned daemon at a per-test path without a shared cwd. design/test/daemon-discovery.test.ts: 17 tests, all green in ~8s. Covers: spawn-fresh, attach-existing, stale-state-file (pid dead), PID-reuse safety (uses the test runner's own PID as the bait — verifyIdentity catches the cmdline mismatch, daemon not signaled), version-mismatch with/without active boards (the active-boards case runs a subprocess and asserts exit 1 + WARNING in stderr), publishBoard 200 + 409, shutdownDaemon refuse/force/unresponsive paths, daemonStatus. The daemon-discovery suite is split out of daemon.test.ts because each real spawn costs ~200ms; the in-process daemon.test.ts (30 tests, 70ms) covers the same handler logic without the spawn overhead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(design): wire daemon dispatch into CLI; add daemon stop/status design/src/cli.ts now branches on --no-daemon for both `compare --serve` and standalone `serve --html`. Default path: ensureDaemon → publishBoard → openBrowser → exit. The legacy single-process serve() is preserved behind --no-daemon for tests, Windows, and explicit debugging. Adds $D daemon status (prints daemon state JSON, or {running:false}) and $D daemon stop [--force] (refuses with active boards unless --force). parseArgs gains a `positionals` field so daemon sub-commands work naturally (`$D daemon stop` instead of `$D --action stop`). Stderr lines printed by the publishToDaemon path: DAEMON_STARTED port=N (or DAEMON_ATTACHED port=N) BOARD_PUBLISHED: <url> BOARD_URL: <url> (alias for grep-friendliness) Stdout: JSON with id, url, sourceDir. design/src/commands.ts: --no-daemon, --title added to compare + serve; new daemon command entry with status\|stop sub-commands. End-to-end smoke (manual): spawning a board via $D serve, hitting the returned URL, reading /health, calling daemon status (returns the right JSON), and daemon stop refusing because of the active board — all work as designed. Force-stop tears down cleanly and removes the state file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(design): end-to-end daemon round-trip via HTTP fetch design/test/feedback-roundtrip-daemon.test.ts walks the full publish → submit / regenerate / reload cycle against a real spawned daemon, using the same HTTP calls the board JS makes. Four tests, all green in ~650ms. Covers what design-shotgun and friends actually depend on: - Submit writes feedback.json into the board's sourceDir with the augmented boardId + publishedAt fields. - GET /boards/<id> (no slash) returns a 301 to /boards/<id>/ — the load-bearing redirect that lets the board JS use relative paths. - Regenerate writes feedback-pending.json, flips state to regenerating, /api/progress reflects it; /api/reload swaps HTML in place; round-2 submit writes the final feedback.json with the round-2 selection. - Two boards published into the same daemon get independent URLs on the same port — feedback for board A doesn't contaminate board B's sourceDir, both URLs serve their own content, the index lists both. Uses HTTP fetch rather than a real browser because the existing browser round-trip (feedback-roundtrip.test.ts) is broken on a pre-existing browse harness regression (session.clearLoadedHtml undefined in browse/src/write-commands.ts:149) that's unrelated to this branch. The HTTP path proves the same daemon semantics; a browser variant can be added once the browse harness is fixed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(design): compiled binary self-execs as daemon; unified version lookup Two small but production-critical fixes once the binary actually runs: 1. Compiled binary couldn't spawn the daemon. daemon-client previously pointed at design/src/daemon.ts via import.meta.dir — fine in dev, fatal in production (the source path doesn't exist on a user's machine). Fix: design CLI now self-execs in --daemon-mode when invoked with that flag, so the spawn is `process.execPath --daemon-mode --marker gstack-design-daemon` for the compiled binary and `bun run cli.ts --daemon-mode ...` in dev. Same one binary, two modes, no separate daemon entrypoint to ship. 2. Client and daemon disagreed on VERSION in the compiled binary. Both used a source-tree-relative path that resolves to "unknown" at runtime, which silently shorted the version-mismatch refusal path (client expected "unknown" + daemon reported "unknown" → match → no refusal even when DESIGN_DAEMON_VERSION was set on one side). New readVersionString() consults DESIGN_DAEMON_VERSION env first, then design/dist/.version (sidecar baked at build time by build.sh), then VERSION at the source-tree root. Both client and daemon now go through this one helper. Manual smoke (compiled binary, all checks green): - DAEMON_STARTED + BOARD_PUBLISHED with trailing slash - GET /boards/<id> (no slash) → 301 Location /boards/<id>/ - Second `$D serve` invocation → DAEMON_ATTACHED, new board on same port - feedback.json gets boardId + publishedAt fields - DESIGN_DAEMON_VERSION=v2-different on second invocation with active board → WARNING + "Refusing to auto-kill" + exit 1, original daemon still alive - `$D daemon stop --force` removes state file All 67 design tests still green after the refactor (16 serve + 30 daemon + 17 discovery + 4 daemon round-trip). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(design): skill resolvers learn the daemon's BOARD_URL output The five skills that invoke $D compare --serve (design-shotgun, design-consultation, plan-design-review, office-hours, design-review) parsed `SERVE_STARTED: port=N` from stderr and then POSTed to `/api/reload` at that port during regenerate cycles. The new daemon hosts boards under `/boards/<id>/` so the reload endpoint moved to `<BOARD_URL>api/reload` — without this update, the regenerate phase of every skill invocation would silently fail against daemon mode. Updated scripts/resolvers/design.ts to parse `BOARD_URL:` instead of the port, and to POST reloads against the per-board URL. Regenerated the four SKILL.md files via bun run gen:skill-docs. Legacy `--no-daemon` invocations continue to emit `SERVE_STARTED:` and serve at `/api/reload` — the resolver instructions note both. Surfaced by the maintainability specialist during /ship review (the "stale comment" finding was actually a behavior bug pointing at five downstream consumers). Codex's plan-review pass flagged the migration story as incomplete but I dismissed the concern — Codex was right. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(design): emit SERVE_STARTED back-compat alias; drop dead import design/src/cli.ts publishToDaemon now emits `SERVE_STARTED: port=N html=<path>` as a third stderr line alongside DAEMON_STARTED/DAEMON_ATTACHED + BOARD_URL. Any out-of-tree script that grepped the legacy line still gets the port — they'd still fail at the reload step (the endpoint moved to /boards/<id>/ api/reload) but they no longer fail at the port-detection step. Combined with the resolver updates one commit back, this is belt-and-suspenders compat. Fixed the stale docstring at cli.ts:316 that claimed back-compat without actually emitting the alias. The maintainability specialist flagged it. Dropped a dead `DaemonState` import from daemon-client.ts. Same review pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v1.45.0.0) Design boards now live 24h, not 10 minutes. One daemon hosts every board, one tab survives the whole day. See CHANGELOG.md for the full release summary + metrics + itemized changes. TODOS.md gains a "design daemon: follow-ups" section capturing the P3 test gaps + maintainability nits the /ship review army flagged but that aren't blocking for this release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(design): fill daemon test gaps surfaced by ship review army Adds 10 net new tests (and removes 1 misleading smoke) for the gaps the testing specialist flagged at /ship time. Filed as P3 TODOs at ship, filling now per boil-the-lake. design/test/daemon-discovery.test.ts (+6 tests, +1 import): - "idle daemon (no boards) shuts itself down after IDLE_MS + CHECK_MS" Spawn-based, DESIGN_DAEMON_IDLE_MS=2000, CHECK_MS=200. Waits for the daemon process to actually exit and asserts the state file is removed. Previously only "callable without throwing" was tested. - "bare GET polling does NOT prevent idle shutdown" Hammers /api/progress every 200ms in a background loop with a done board, asserts the daemon still idles out — proves the meaningful-activity-only-on-POSTs guard (Codex finding) actually works. - "idle with active (non-done) boards triggers extension instead of shutdown" Sets DESIGN_DAEMON_EXTENSION_MS=1500 + MAX_EXTENSIONS=2, publishes a non-done board, asserts the daemon survives past IDLE_MS (extends), then verifies the MAX_EXTENSIONS hard ceiling force-shuts. Both the extension counter and the hard ceiling were previously untested. - "two parallel ensureDaemon() calls converge on one daemon" Fires two ensureDaemon calls in Promise.all against an empty stateFile, asserts: both ports match, exactly one spawned=true, exactly one daemon alive, no orphaned lock file. The discovery-test file's own docstring claimed this test existed; now it actually does. - "acquireLock reclaims a lockfile owned by a dead PID" Plants a lockfile with PID 999999998, calls acquireLock, asserts the returned release fn is non-null and the lock now holds our PID. - "acquireLock refuses to reclaim a lockfile owned by an alive PID" Uses the test runner's own PID — alive but not the lock's intended owner. Asserts acquireLock returns null and leaves the lockfile untouched. The unrelated-process-PID-reuse safety guard. design/test/daemon.test.ts (-2 misleading, +5 new = +3 net): - Removed: "bare GET /api/progress does NOT reset meaningful activity" (smoke pretending to be behavioral — body comment admitted it couldn't verify). Replaced by the spawn-based version in daemon-discovery above. - Removed: "idleCheckTick is callable without throwing when there's no idle" (collapsed into a single smoke describe that's clearer about its scope). - Added: "POST /api/boards rejects invalid JSON body" - Added: "POST /api/boards rejects non-object body (e.g. JSON null)" - Added: "POST /api/boards: array body falls through to missing-html 400" (documents the typeof-array-is-object JS quirk; will surface if we ever tighten the type check) - Added: "POST /boards/<id>/api/reload rejects invalid JSON body" - Added: "POST /boards/<id>/api/reload rejects body missing html field" Per-file totals after: serve 16, daemon 34, discovery 23, round-trip 4 = 77. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: update CHANGELOG + TODOS for filled test gaps in v1.45.0.0 Bumps the design test count from 67 → 77 (and the new-test delta from +51 → +61) to reflect commit `6b037c55`, which filled the 5 P3 test gaps the /ship review army had filed to TODOS.md. Marks the "Tighten daemon test coverage" entry in TODOS.md as DONE. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 20:45:12 -07:00
+2 Garry Tan GitHub Claude Opus 4.7 Jayesh Betala Andrey Esipov David Foy mikeangstadt 0xDevNinja techcenter68 shohu Bharat Matteo Hertel	66f3a180d3	v1.43.2.0 fix wave: post-Daegu paper-cut — 18 fixes, 28 bisect commits (#1642 ) * fix(gbrain-sync): --full produces an empty code index on first run of a new repo `gbrain reindex-code` only RE-EMBEDS pages that already exist; it never walks the filesystem. On a freshly-registered source (0 pages), a --full run that called reindex-code alone found nothing ("No code pages to reindex"), finished in ~1s, and left the code index permanently empty while still reporting OK. Fix: --full now runs `sync --strategy code` FIRST to create pages via the file walk, then runs `reindex-code` to honor the documented "full walk + reindex" contract for both fresh and populated sources. Contributed by @jetsetterfl via #1584. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(gbrain-local-status): classifier falsely reports broken-db inside repos with their own DATABASE_URL The freshClassify probe ran `gbrain sources list --json` with the inherited process env. When the probe ran from inside a repo with its own .env (an app DATABASE_URL on a different port), Bun autoloaded the project's .env, gbrain connected to the wrong database, and the classifier reported broken-db on otherwise-healthy brains. Fix: route the probe env through `buildGbrainEnv` from lib/gbrain-exec, the same helper the sync orchestrator uses. DATABASE_URL is seeded from ~/.gbrain/config.json so the result is cwd-independent. The 60s cache can no longer propagate a poisoned negative to clean directories. Contributed by @jetsetterfl via #1583. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(retro): stale-base + bad-today-anchor pre-flight guard (#1624) /retro silently produced confidently-wrong output when "today" drifted (model session-context error) or when origin/<default> was materially behind the actual remote — git log --since returned zero or near-zero commits and the narrative was fabricated from nothing. Adds Step 0.5 with four ordered pre-check branches before any window analysis: A. No 'origin' remote → skip with "base freshness not verified" note B. Detached HEAD → skip with "base freshness not verified" note C. `git fetch origin <default>` fails (offline) → warn, proceed against last-known origin/<default> D. Fetch succeeded → compare today vs latest origin/<default> commit; if gap > window-days, BLOCK with explicit citation of latest-commit date. Skip paths still proceed to Step 1, but the disclosure is carried into the retro narrative ("offline run, window not freshness-verified") so the output is never silently confidently-wrong. Atomic .tmpl + gen:skill-docs regen commit (T-Codex-3 pattern). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(retro): regression for #1624 stale-base pre-flight guard 13 static-invariant tests pinning the four ordered pre-check branches in retro/SKILL.md.tmpl:Step 0.5: A. no-remote skip — must check origin presence + set verdict B. detached-HEAD skip — must gate behind prior verdict (ordering) C. fetch-fail warn — must match `if !` or `\|\|` shape, gate by verdict D. stale-base BLOCK — must read latest-commit ISO date, cite remediation Plus a disclosure-survives-to-narrative invariant: skip-path verdicts must be named in prose so the retro output carries the cited reason rather than silently misreporting. Failing build if Step 0.5 is removed, branches re-ordered (no-remote no longer wins), or the BLOCK message stops citing today/latest-commit/remediation path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(gbrain-sync): configurable timeouts + resume from gbrain checkpoint (#1611) The memory and code stages hardcoded a 35-min spawn timeout. On brains with ~2000+ staged files, /sync-gbrain --full reliably SIGTERM'd the child at exactly 35 minutes with exit 143. gbrain left ~/.gbrain/import-checkpoint.json pointing at the staging dir, but gstack-memory-ingest's SIGTERM handler unconditionally cleaned the dir up — so the next run found a checkpoint pointing at nothing and restaged from scratch, repeating the SIGTERM forever. Three changes: 1. Configurable timeouts via env (bounds 60_000ms - 86_400_000ms, default 2_100_000ms = 35min unchanged): GSTACK_SYNC_MEMORY_TIMEOUT_MS GSTACK_SYNC_CODE_TIMEOUT_MS Out-of-range or non-numeric values warn and fall back to the default. 2. SIGTERM in gstack-memory-ingest no longer always cleans up the staging dir. If gbrain has written ~/.gbrain/import-checkpoint.json pointing at the active staging dir, the dir is PRESERVED for next-run resume. Otherwise (no checkpoint pointing here, crash before gbrain ever touched it) it's cleaned up as before. 3. Next /sync-gbrain run detects gbrain's checkpoint via decideResume() in gstack-gbrain-sync.ts: - no checkpoint → fresh ingest pass - checkpoint + staging ok → set GSTACK_INGEST_RESUME_DIR; child reuses staging dir and skips writeStaged; gbrain import resumes from processedIndex+1 - checkpoint + staging gone → warn "previous checkpoint stale (staging dir gone), restaging from scratch" and proceed Reuses gbrain's own checkpoint as the source of truth (D1 — no double-store state). Detect-then-fallback semantics per C1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(gbrain-sync): regression for #1611 timeouts + resume 19 tests across three surfaces: - resolveStageTimeoutMs (10 tests): undefined/empty → default; non-numeric, zero, negative, below-floor, above-ceiling → warn + default; at-floor, at-ceiling, valid mid-range → accepted as-is. - decideResume (6 tests): no checkpoint, corrupt JSON, checkpoint + staging ok, checkpoint + staging missing, checkpoint with no dir, checkpoint with empty dir. - SIGTERM staging preservation (3 static invariants): memory-ingest signal handler must check stagingDirIsCheckpointed BEFORE cleanup; preserve branch must come before cleanup branch (ordering); orchestrator must pass GSTACK_INGEST_RESUME_DIR to the grandchild on resume. Also threads process.env.HOME through readGbrainCheckpoint and stagingDirIsCheckpointed so tests can redirect home. os.homedir() caches at process start and ignores later mutation, so the env override is the only reliable test injection point. Failing build if the timeout bounds are removed, the resume detection short-circuits incorrectly, or the SIGTERM handler regresses to unconditional cleanup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(review): pre-emit verification gate kills Django-shape FP class (#1539) External user filed 4/8 false positives on a /review run against a Django + DRF + PostgreSQL repo (Sprint 2.5). Every FP class was the same shape: "resolvable in <5 minutes by viewing the actual code or running a simple grep" — fields that don't exist on the model, dict.get()-might-be-None on a form that returns {}-initialized cleaned_data, standard ORM save behavior called out as data loss. Extends the Confidence Calibration resolver (consumed by review, cso, plan-eng-review, ship) with a Pre-emit verification gate: Every finding MUST quote the specific code line that motivates it (file:line + verbatim text). If the reviewer cannot produce the quote, the finding is unverified — its confidence is forced to 4-5 so the existing "Suppress from main report" rule fires automatically. The finding still goes to the appendix for calibration audit, but the user does not see it in the critical-pass output. Reuses the existing suppression mechanism — no new code path. The FP classes the gate kills are enumerated in the resolver text so reviewers see the named patterns. Framework-meta nudge included for Django Meta, Rails associations, SQLAlchemy relationships, TypeORM decorators, Sequelize init, Prisma generated client — the reviewer must quote the meta-construct that generates the symbol, not just grep for the literal name. Deeper framework-aware ORM verification (model introspection, migration-history- aware checks) is deliberately deferred to a future wave per T-Codex-2. Atomic .tmpl-equivalent (resolver) edit + gen:skill-docs regen commit per T-Codex-3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(review): regression for #1539 pre-emit verification gate 12 tests pinning the gate behavior: - Resolver emits the gate header + #1539 reference - Gate requires quoting file:line + verbatim text - Unverified findings forced to confidence 4-5 (auto-suppress via existing <7-rule, no new mechanism) - Framework-meta nudge names Django, Rails, SQLAlchemy, TypeORM, Sequelize, Prisma - Deferred design doc reference present (1539-framework-aware-review.md) - Four named FP classes from #1539 enumerated: * field doesn't exist on model * dict.get() might be None * save() might lose fields * update_fields might miss X - All four downstream SKILL.md consumers (review, cso, plan-eng-review, ship) carry the gate text after gen:skill-docs - Existing confidence 9-10 'Show normally' + 3-4 'Suppress' rows unchanged (regression on existing behavior) Failing build if the gate is removed, the suppression mechanism is re-invented separately, the framework-meta nudge drops a framework, or gen:skill-docs stops propagating the gate to consumers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(config): expose explain_level default * fix(benchmark): parse positional prompt after flags * fix(artifacts): reject malformed remote paths * fix(learnings): preserve current entries in cross-project search * fix(setup): register root gstack slash alias * fix(memory): probe gitleaks without shell builtin * fix(gbrain-lib): pin LC_ALL=C in varname validator (macOS locale guard) In many macOS shells the default locale (e.g. en_US.UTF-8) makes bash glob brackets like `[A-Z]` match lowercase letters too, so the existing `case "$name" in [A-Z_][A-Z0-9_])` branch lets names like `lower-case` through validation. The function then trips `printf -v "$varname"` and `export "$varname"` with `not a valid identifier` errors that surface mid-prompt, which is exactly what the validator was supposed to prevent. Pinning `LC_ALL=C` inside the function gives ASCII-only bracket semantics on both macOS and Linux, matching the documented `[A-Z_][A-Z0-9_]` contract. Declared `local` so it doesn't leak to the calling shell — `gstack-gbrain-lib.sh` is documented as a sourced helper, so a bare assignment would mutate the caller's locale for the rest of the process (silently affecting downstream `sort`, `tr`, locale-aware globs in the same shell, etc.). The existing regression test `test/gbrain-lib-verify.test.ts:'rejects invalid var names'` already covers the macOS repro shape (passes `lower-case` and expects the validator to reject + emit `invalid var name`). On Linux CI the test silently passed because `LC_ALL=C` is the typical default; on macOS dev boxes it fails. Verified: - `bun test test/gbrain-lib-verify.test.ts`: 22 pass, 0 fail (on macOS). - `_gstack_gbrain_validate_varname lower-case; echo $?` → 2. - `_gstack_gbrain_validate_varname FOO_BAR; echo $?` → 0. - Caller's LC_ALL preserved across calls (confirmed via sourced bash). * fix(land-and-deploy): detect merged PR after gh failure After `gh pr merge` exits non-zero, the PR may already be MERGED server-side (concurrent merge landed, or local cleanup phase failed AFTER the merge succeeded). Calling `gh pr merge` a second time then errors with a confusing "already merged" — and worse, the deploy workflow never runs because we stopped on the first failure. Adds a Post-failure PR-state check (§4a-postfail) that runs after ANY non-zero exit from `gh pr merge`: - state == MERGED → record MERGE_PATH=direct, OFFER (don't force) stale-worktree cleanup on the base branch with uncommitted-work guard, proceed to §4a CI watch - state == OPEN → check autoMergeRequest; if non-null treat as merge-queue wait; if null surface both errors and STOP - state == CLOSED → STOP Hard invariant: never retry `gh pr merge` after a non-zero exit. Server state is authoritative. Re-authored from PR #1620 into land-and-deploy/SKILL.md.tmpl (the source of truth) instead of the generated SKILL.md, so the next gen:skill-docs run preserves the change. Original diff by @davidfoy via #1620. Related: cli/cli#3442, cli/cli#13380. Contributed by @davidfoy via #1620. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: detect PgBouncer transaction-mode pooler and set GBRAIN_PREPARE=true (#1435) When gbrain connects through a PgBouncer transaction-mode pooler (port 6543), it auto-disables prepared statements. This breaks `gbrain search` silently — the /sync-gbrain capability check fails and the GBrain Search Guidance block never gets written to CLAUDE.md. Three-layer fix: 1. lib/gbrain-exec.ts — `buildGbrainEnv()` now detects port 6543 in the effective DATABASE_URL and sets `GBRAIN_PREPARE=true` in the env passed to every gbrain spawn. This is the single chokepoint — all gstack gbrain invocations inherit the fix. Caller can opt out with `GBRAIN_PREPARE=false`. 2. sync-gbrain/SKILL.md{,.tmpl} — capability check now exports `GBRAIN_PREPARE=true` explicitly and retries search up to 3x with 1s delay for async index propagation under connection pooling. 3. bin/gstack-gbrain-detect — surfaces `gbrain_pooler_mode` field ("transaction" \| "session" \| null) in the preamble probe JSON so /setup-gbrain and /sync-gbrain can advise users about pooler state. Closes #1435 Built with [ClosedLoop.AI](https://closedloop.ai) \| [GitHub](https://github.com/closedloop-ai/claude-plugins) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(supabase-provision): rewrite transaction/6543 -> session/5432 for new projects - Single-object pooler API responses default to transaction-mode at 6543, but the shared pooler tenant on new projects only listens on session/5432 - Add a `pool_mode == transaction && db_port == 6543` rewrite + stderr note - Escape hatch via `GSTACK_SUPABASE_TRUST_API_PORT=1` for forward-compat - 5 new tests covering rewrite, no-op shapes, env opt-out, array path Fixes #1301. * fix(browse): GSTACK_CHROMIUM_NO_SANDBOX opt-out for Ubuntu/AppArmor (#1562) Ubuntu/AppArmor configurations often block unprivileged Chromium sandboxing for headless agent sessions even for normal users — /qa hangs without --no-sandbox. The kernel policy denies the unprivileged user namespaces Chromium needs. Adds GSTACK_CHROMIUM_NO_SANDBOX=1 as an explicit user override that forces the sandbox off without changing the default for everyone else. Re-authored from PR #1562 onto v1.42.2.0's shouldEnableChromiumSandbox() helper — purely additive, preserves the headed-launch sandbox-on-by-default behavior that v1.42.2.0 shipped to kill the --no-sandbox yellow infobar. Three new regression tests cover: - linux + override=1 → false (the named use case) - darwin + override=1 → false (env wins on any platform) - override=0 → does NOT trigger (must be exactly "1") Original diff by @techcenter68 via #1562. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(browse): mirror isCustomChromium() guard in headless launch() When BROWSE_EXTENSIONS_DIR is set alongside GSTACK_CHROMIUM_PATH pointing at a baked-extension build (GBrowser / GStack Browser), the headless launch() path was unconditionally adding --disable-extensions-except / --load-extension. This causes the same ServiceWorkerState::SetWorkerId DCHECK crash that launchHeaded() already guards against via isCustomChromium(). Mirror the existing guard: skip --load-extension flags when isCustomChromium() returns true; always push the off-screen window geometry args. * fix(browse): daemonize macOS/Linux server via setsid() `Bun.spawn().unref()` only releases the child from Bun's event loop — it does NOT call setsid(). The spawned bun server inherits the spawning shell's process session. When the CLI runs inside a session-managed shell that exits shortly after the CLI returns (Claude Code's per-command Bash sandbox, Conductor, OpenClaw, CI step runners), the session leader's exit sends SIGHUP to every PID in the session — killing the bun server and its Chromium grandchildren within seconds of a successful `connect`. Setting `BROWSE_PARENT_PID=0` (already done by the `connect` command and pair-agent) disables the parent-process watchdog but does NOT save the server here: SIGHUP from session teardown still reaps it. Replace the macOS/Linux `Bun.spawn().unref()` with Node's `child_process.spawn({ detached: true })`, which calls setsid() and gives the server its own session leader role (PPID=1, STAT=Ss). This mirrors the Windows path's rationale (PR #191 by @fqueiro) — same root cause, different OS surface. Verified on macOS in Conductor: pre-fix the server dies ~10–15s after connect across separate Bash invocations; post-fix the same PID stays alive (PPID=1, SESS=0, STAT=Ss) and responds to `status`/`goto`/ `snapshot` across many separate shell calls. The `proc?.stderr` startup-error branch is removed since both platforms now spawn with `stdio: 'ignore'`; both fall through to the on-disk `browse-startup-error.log` written by `server.ts`'s start().catch. * fix(design): bump image-gen timeout to 240s + pin gpt-image-2 The design binary calls /v1/responses (gpt-4o + image_generation tool, quality:high, 1536x1024) but aborted the request after a hardcoded 120s. That class of request consistently takes ~140-160s end-to-end, so every generate/variants/evolve/iterate call aborted before the image returned. In /design-shotgun this cascades: Step 3c launches N parallel agents, each calling `$D generate`, each aborts at 120s and retries, all fail, the comparison board never opens — the skill appears to hang indefinitely. Reproduced the exact API call with a longer budget: HTTP 200, valid image, 143.5s. A real /design-shotgun run after the patch generated 3 variants in parallel at 150.0s / 161.0s / 152.1s, all exit 0 — note the 161s case, which a naive 150s bump would still have failed. - Bump AbortController timeout 120_000 -> 240_000 in generate.ts, variants.ts, evolve.ts, iterate.ts (both call sites) - Pin the image_generation tool to model "gpt-image-2" design/test/variants-retry-after.test.ts: 5 pass, 0 fail. The feedback-roundtrip.test.ts failures are a pre-existing browse-module breakage (session.clearLoadedHtml undefined), unrelated to this change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: fill coverage gaps for PRs #1606, #1612, #1620 Three cherry-picked PRs in this wave landed without unit-test coverage for the specific invariant they protect: #1606 (@andrey-esipov) — LC_ALL=C pin in _gstack_gbrain_validate_varname 8 tests by sourcing bin/gstack-gbrain-lib.sh and calling the validator directly. Asserts uppercase/digit/underscore accepted, lowercase REJECTED (the macOS-locale regression case), mixed-case rejected, LC_ALL=C scoping is local (doesn't leak to caller). #1612 (@bharat2913) — setsid daemonize via Node child_process.spawn 4 static-invariant tests on browse/src/cli.ts. The actual setsid syscall is hard to assert without a real spawn, so we pin the source shape: nodeSpawn imported from child_process; non-Windows branch uses nodeSpawn(...) with detached:true and .unref(); comment documents setsid/SIGHUP root cause; Bun.spawn() is NOT used on macOS/Linux. #1620 (@davidfoy, re-authored into .tmpl per A3) — §4a-postfail 12 static invariants on land-and-deploy/SKILL.md.tmpl + generated SKILL.md. Pins all three state branches (MERGED/OPEN/CLOSED), the authoritative state query, the merge-SHA capture, non-destructive worktree cleanup with uncommitted-work guard, autoMergeRequest probe on OPEN, hard "never retry gh pr merge" rule, and atomic regen propagation. Failing build if any of the three invariants regresses. Note: gbrain-lib-validate-varname.test.ts also surfaces a pre-existing glob-pattern overpermissiveness (hyphens + dots accepted) — not in #1606's scope; documented inline as a separate cleanup target. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(learnings): align injection-prevention tests with PR #1619 tagged-line shape PR #1619 (preserve current entries in cross-project search) refactored gstack-learnings-search to tag rows inline (`current\t<json>` vs `cross\t<json>`) instead of filtering inside the bun block via process.env.GSTACK_SEARCH_SLUG. The bun block no longer reads SLUG or CROSS env vars — it parses the per-line tag and sets a per-entry _crossProject flag. The pre-existing test/learnings-injection.test.ts still asserted on the old SLUG + CROSS env var shape. Updates: - Remove the SLUG env var assertion (no longer set on bash command line) - Remove the bun-block CROSS env var assertion (block reads the tag now, not the env) - Add a new positive assertion that the bun block parses the tag (sourceTag \| tabIndex \| crossProject) - Keep the shell-interpolation safety assertion unchanged — that's independent of the SLUG refactor The CROSS env var is still SET on the bash command line (it controls whether the cross-project find runs at all), but the bun child no longer reads it. The existing "env vars set on bash command line" test continues to pin that. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(fixtures): regenerate ship-SKILL.md golden baselines ship/SKILL.md consumes the Confidence Calibration resolver via the preamble pipeline. This wave's #1539 pre-emit verification gate extends the resolver text, which propagated to ship/SKILL.md via gen:skill-docs. The golden fixtures in test/fixtures/golden/ matched the pre-#1539 shape and failed the host-config regression check. Refreshes claude-ship-SKILL.md, codex-ship-SKILL.md, and factory-ship-SKILL.md to match the current generated output. Matches the Daegu wave's bisect commit 23 ("test(fixtures): regenerate ship-SKILL.md golden baselines"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(gbrain-detect): include gbrain_pooler_mode in schema regression (PR #1591) PR #1591 (PgBouncer transaction-mode detection, @mikeangstadt) added gbrain_pooler_mode to the gstack-gbrain-detect JSON output but did not update the schema regression check in test/gstack-gbrain-detect-mcp-mode.test.ts. Adding the key in alphabetical order matching the rest of the schema array. Downstream sync-gbrain ignores unknown keys, so this is forward-compat. Without this, the test fails with a diff: + "gbrain_pooler_mode" because keys is the actual set returned and the expected array was pre-#1591. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(release): v1.43.0.0 — post-Daegu paper-cut wave Bumps VERSION 1.42.2.0 → 1.43.0.0 (MINOR per scale-aware bump rules: new env-var surface GSTACK_SYNC__TIMEOUT_MS + GSTACK_CHROMIUM_NO_SANDBOX, behavior expansion in browse/src/browser-manager.ts headless launch, three skill-template prompt changes affecting /retro, /review, /sync-gbrain). CHANGELOG entry leads with what stopped happening: /retro stops fabricating retros against stale bases, /sync-gbrain stops SIGTERM-looping 35-min restarts on big brains, /review stops shipping framework FPs the reviewer never grep'd. 18 fixes total — 15 community PRs + 3 self-filed silent-failure issues (#1624, #1611, #1539) — in one bundled PR with 26 bisect commits and 7 new regression test files. Every wave-touched test file passes in isolation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> chore(release): bump v1.43.0.0 → v1.43.2.0 for queue collision CI check-version-stale flagged v1.43.0.0 already claimed by PR #1574 (garrytan/colombo-v3). PR #1639 (garrytan/muscat-v3) claims v1.43.1.0. Next available MINOR slot is v1.43.2.0. Bump VERSION + package.json + CHANGELOG entry header. No behavior changes — purely re-versioning to clear the queue collision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Jayesh Betala <jayesh.betala7@gmail.com> Co-authored-by: Andrey Esipov <andrey.esipov@outlook.com> Co-authored-by: David Foy <davidfoy@users.noreply.github.com> Co-authored-by: mikeangstadt <mike.angstadt@closedloop.ai> Co-authored-by: 0xDevNinja <manmit0x@gmail.com> Co-authored-by: techcenter68 <techcenter68@users.noreply.github.com> Co-authored-by: shohu <shohu33@gmail.com> Co-authored-by: Bharat <bharat@theysaid.io> Co-authored-by: Matteo Hertel <info@matteohertel.com>	2026-05-21 21:21:07 -07:00
Garry Tan GitHub Claude Opus 4.7	7ca04d8ef0	v1.42.0.0 Daegu wave: 23 community-filed bugs + PTY classifier enforcement (24 bisect commits) (#1594 ) * fix(gstack-paths): guard CLAUDE_PLUGIN_DATA against cross-plugin contamination (#1569) gstack-paths previously trusted CLAUDE_PLUGIN_DATA as a fallback for GSTACK_STATE_ROOT whenever GSTACK_HOME was unset. When another plugin (e.g. Codex) persists its own CLAUDE_PLUGIN_DATA into the session env via CLAUDE_ENV_FILE, gstack picked it up and wrote checkpoints, analytics, and learnings into that plugin's directory. Anyone with the Codex plugin installed alongside gstack hit this silently. Fix: guard the CLAUDE_PLUGIN_DATA branch so it only fires when CLAUDE_PLUGIN_ROOT confirms we're running as the gstack plugin (path contains "gstack"). Skill installs fall through to \$HOME/.gstack. Contributed by @ElliotDrel via #1570. Closes #1569. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(gbrain-sync): sourceLocalPath handles wrapped {sources:[...]} shape from gbrain v0.20+ gbrain v0.20+ changed `gbrain sources list --json` to return {sources: [...]} instead of a flat array. sourceLocalPath crashed upstream with `list.find is not a function` on every /sync-gbrain invocation against modern gbrain. Accept both shapes for forward/backward compat, matching probeSource/sourcePageCount in lib/gbrain-sources.ts. Contributed by @jakehann11 via #1571. Closes #1567. Supersedes #1564 (@tonyjzhou, same fix, different shape — credit retained). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(brain-context-load): probe gbrain via execFile, not shell builtin (#1559) gbrainAvailable() used `execFileSync("command", ["-v", "gbrain"])`, which fails in any environment where the `command` builtin isn't on the spawned process's PATH (most non-interactive shells). The probe then reported gbrain as missing even when it was installed, and context-load silently skipped vector/list queries. Fix: probe `gbrain --version` directly with a 500ms timeout (matching the rest of the file's MCP_TIMEOUT_MS). Same semantics, works everywhere execFile works. Contributed by @jbetala7 via #1560. Closes #1559. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(gbrain-doctor): pin schema_version:2 doctor parse path (#1418) Adds an exec-path regression test that runs a fake gbrain shim emitting the v0.25+ doctor JSON shape (schema_version: 2, status: "warnings", exit 1 for health_score < 100, no top-level `engine` field). Confirms freshDetectEngineTier recovers stdout from the non-zero exit and falls back to GBRAIN_HOME/config.json for the engine label. The pre-existing test for #1415 only stripped gbrain from PATH; this test exercises the actual doctor parse path, closing the gap that codex's plan review flagged. Also documents the schema_version separation in lib/gbrain-local-status.ts: the local CacheEntry stays at version 1, distinct from the doctor-output schema_version which we accept across versions in gstack-memory-helpers. Closes #1418 (credit @mvanhorn for surfacing the doctor + schema_v2 collapse). The fix landed pre-emptively in v1.29.x; this commit pins it with a stronger test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(memory-ingest): pin put_page regression + scrub stale name from --help and comments (#1346) #1346 reported that gstack-memory-ingest still called the renamed gbrain put_page subcommand on gbrain v0.18+. The actual code migrated to `gbrain put` and later to batch `gbrain import <dir>` before this report landed — only documentation lag remained. This commit: - Updates the --help string ("Skip gbrain put calls (still updates state file)") so user-facing docs match the shipped subcommand - Updates two inline comments that still referenced the old name - Adds test/memory-ingest-no-put_page.test.ts: a regression pin that strips comments from bin/gstack-memory-ingest.ts and fails the build if "put_page" appears in any active code or string literal, plus a sanity check that the file still calls a supported gbrain page-write verb (put or import) Closes #1346. Reporter @kylma-code surfaced the doc lag; the original code migration credit is on the v1.27.x wave. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(resolvers): rewrite all gbrain put_page instructions to canonical put <slug> scripts/resolvers/gbrain.ts emitted user-facing copy-paste instructions using the renamed `gbrain put_page` subcommand across 10 skills (office-hours, investigate, plan-ceo-review, retro, plan-eng-review, ship, cso, design-consultation, fallback, entity-stub). Every gstack user copying those snippets hit "unknown command: put_page" on gbrain v0.18+. This commit: - Rewrites all 10 instruction templates to use `gbrain put <slug> --content "$(cat <<EOF...EOF)"` with title/tags moved into YAML frontmatter inside --content, matching the v0.18+ subcommand shape - Updates README.md and USING_GBRAIN_WITH_GSTACK.md "common commands" table to reference `gbrain put` and `gbrain get` - Adds test/resolvers-gbrain-put-rewrite.test.ts pinning two invariants: (a) resolver source ships only canonical instructions, (b) every tracked SKILL.md file is free of `gbrain put_page` CHANGELOG entries are deliberately left untouched (historical record). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(build): extract package.json build to scripts/build.sh for Windows Bun compat (#1538, #1537, #1530, #1457, #1561) Bun's Windows shell parser rejects multiple constructs the inline package.json build chain used: brace groups `{ cmd; }`, subshells with redirection `( git ... ) > path/.version`, and (in Bun 1.3.x) subshells near redirections in general. Every Windows install + every auto-upgrade since v1.34.2.0 has failed on `bun run build`. Extracts the build chain to scripts/build.sh and the .version writes to scripts/write-version-files.sh. POSIX-portable, no Bun shell parsing involved. Also adds Windows-specific bun.exe handling for non-ASCII PATHs (a separate Windows footgun where Bun's --compile fails when the binary lives under a path with non-ASCII chars). Updates test/build-script-shell-compat.test.ts to assert the new shape: no subshells with redirections anywhere in the build chain, and build delegates to scripts/build.sh which delegates .version writes. Contributed by @Charlie-El via #1544. Supersedes #1531 (@scarson, fixed in build helper), #1480 (@mikepsinn, partial overlap), #1460 (@realcarsonterry, brace-group fix subsumed) — credit retained. Closes #1538, #1537, #1530, #1457, #1561. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(windows): .exe glob in .gitignore + .exe extension resolution in find-browse (#1554) bun build --compile on Windows appends .exe to the output filename, producing browse.exe instead of browse. find-browse's existsSync probe only checked the bare path and returned null on Windows even when the binary was correctly built. .gitignore similarly only excluded the bare bin/gstack-global-discover path, leaving the .exe variant tracked. This commit: - .gitignore: changes `bin/gstack-global-discover` → `bin/gstack-global-discover` so the Windows .exe variant is ignored - browse/src/find-browse.ts: adds isExecutable + findExecutable helpers that fall back to .exe/.cmd/.bat probing on Windows, mirroring the same helper already in make-pdf/src/browseClient.ts and pdftotext.ts Contributed by @Mike-E-Log via #1554. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> ci(windows): add fresh-install E2E gate that runs bun run build on windows-latest Adds .github/workflows/windows-setup-e2e.yml as the gate that catches Bun shell-parser regressions in the build chain before they reach users. Triggers on PRs touching package.json, scripts/build.sh, scripts/write-version-files.sh, setup, browse cli/find-browse, or gstack-paths. What it verifies: 1. bun run build completes on Windows (the previously-broken path that #1538/#1537/#1530/#1457/#1561 reported) 2. All compiled binaries land on disk (browse.exe, find-browse.exe, design.exe, gstack-global-discover.exe) 3. find-browse resolves to the .exe variant on Windows (regression gate for #1554) 4. gstack-paths returns non-empty GSTACK_STATE_ROOT/PLAN_ROOT/TMP_ROOT on Windows (regression gate for #1570) Complements the existing windows-free-tests.yml (curated unit subset); this new workflow exercises the install path itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(codex): move diff scope into prompt instead of --base (Codex CLI 0.130+ argv conflict) (#1209) Codex CLI ≥ 0.130.0 rejects passing a custom prompt and --base together (mutually exclusive at argv level). Every /codex review, /review, and /ship structured Codex review call ended with an argv error before the model ran. Fix: scope the diff in prompt text using "Run git diff origin/<base>...HEAD 2>/dev/null \|\| git diff <base>...HEAD" instead of `--base <base>`. Preserves the filesystem boundary instruction across all invocations and keeps Codex's review prompt tuning. Touches: - codex/SKILL.md.tmpl + regenerated codex/SKILL.md - scripts/resolvers/review.ts + regenerated review/SKILL.md, ship/SKILL.md - test/gen-skill-docs.test.ts: new regression that fails if any of the five known files still contain the prompt+--base shape - test/skill-validation.test.ts: corresponding negative + positive pin on the rendered SKILL.md files Contributed by @jbetala7 via #1209. Closes #1479. Supersedes #1527 (@mvanhorn — same intent, different patch shape, CONFLICTING) and #1449 (@Gujiassh — broader refactor, CONFLICTING). Credit retained in CHANGELOG. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(review): diff from git merge-base, not git diff origin/<base> (#1492) git diff origin/<base> shows everything since the common ancestor in both directions — it includes commits that landed on origin/<base> after this branch was created as deletions. That made /review and /ship's pre-landing structured review report inflated diff totals and flagged "removed" code that was actually still present in the working tree. Fix: compute DIFF_BASE via git merge-base origin/<base> HEAD and diff the working tree against that point. Same coverage of uncommitted edits, no phantom deletions from out-of-order base advancement. Applies to /review's Step 1 (diff existence check), Step 3 (get the diff), the build-on-intent scope-creep check, the structured review DIFF_INS/DIFF_DEL stats, and the Claude adversarial subagent prompt. Same change flows into ship/SKILL.md via the shared resolver. Touches: - review/SKILL.md.tmpl + regenerated review/SKILL.md, ship/SKILL.md - scripts/resolvers/review.ts - scripts/resolvers/review-army.ts Contributed by @mvanhorn via #1492. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(codex): pin filesystem-boundary preservation across all codex review surfaces (#1503, #1522) #1503 reported that the bare codex review --base path stripped the filesystem boundary instruction, letting Codex spend tokens reading .claude/skills/ and agents/. #1522 proposed adding a skill-path detector that switched to the custom-instructions route when the diff touched skill files. After C10 (#1209) restructured codex review to always carry the boundary in the prompt (the prompt+--base argv conflict forced the restructure), the skill-path detector becomes redundant — every default call already preserves the boundary. This commit pins the post-#1209 invariant with a test that fails the build if any future refactor strips the boundary from codex/SKILL.md, review/SKILL.md, or ship/SKILL.md. Closes #1503 by regression test. #1522 (@genisis0x) is superseded by #1209 (the prompt rewrite covers its safety concern); credit retained in CHANGELOG. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(skills): use command -v instead of which for codex detection (#1197) `which` is not on PATH in every shell — some Windows shells, BusyBox- only containers, and minimal CI images all fail when skills probe codex availability via `which codex`. `command -v` is a POSIX builtin and always available where the skill is running. Touched: - codex/SKILL.md.tmpl: CODEX_BIN=$(command -v codex \|\| echo "") - scripts/resolvers/review.ts and scripts/resolvers/design.ts: 3 + 3 sites each rewritten to `command -v codex >/dev/null 2>&1` - Regenerated all 10 affected SKILL.md files (codex, review, ship, design-consultation, design-review, office-hours, plan-ceo-review, plan-design-review, plan-devex-review, plan-eng-review) - test/skill-validation.test.ts: updated pin + defensive regression test that fails if `which codex` returns to codex/SKILL.md - test/skill-e2e-plan.test.ts: updated summary regex Contributed by @mvanhorn via #1197. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(codex): surface non-zero exits so wrappers stop reading as silent stalls (#1467, #1327) When codex exits non-zero (parse errors, arg-shape breaks, model API errors that propagate as non-zero status), the calling agent previously saw an empty output and burned 30-60 minutes misdiagnosing as a silent model/API stall. The hang-detection block only caught exit 124 (the timeout-wrapper signal). Adds elif blocks in all four codex invocation sites (Review default, Challenge, Consult new-session, Consult resume) that: - Echo "[codex exit N] <stderr first line>" to stdout - Indent the first 20 stderr lines for inline context - Log codex_nonzero_exit telemetry tagged with the call site Contributed by @genisis0x via #1467. Closes #1327. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(design): disclose OpenAI key source + warn on cwd .env match (#1278, closes #1248) The design binary previously called process.env.OPENAI_API_KEY without checking where the key came from. If a user ran $D inside someone else's project that had OPENAI_API_KEY in its .env, the resulting generation billed that project's account. Silent and irreversible. Fix: resolveApiKeyInfo() returns both the key and its source. When the env-var path matches an OPENAI_API_KEY entry in the current directory's .env, .env.<NODE_ENV>, or .env.local file, we set a warning. requireApiKey() prints "Using OpenAI key from <source>" plus the warning before the run — never the key itself. Adds 6 unit tests covering: config-vs-env precedence, env-only (no match), env+cwd .env match, quoted/exported values, value-mismatch (no false positive), and the no-leak invariant for requireApiKey stderr output. Contributed by @jbetala7 via #1278. Closes #1248. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(browse): guard full-page screenshots against Anthropic vision API >2000px brick (#1214) Full-page screenshots of tall pages routinely exceeded 2000px on the longest dimension, silently bricking the agent's session: the resulting base64 reached the Anthropic vision API which rejected the oversized image, leaving the agent burning turns on a useless blob with no stderr trace from the browse side. Adds browse/src/screenshot-size-guard.ts as a shared helper: - guardScreenshotBuffer(buf) → downscales in-memory if max(w,h) > 2000 - guardScreenshotPath(path) → file-mode variant that rewrites in place - Aspect ratio preserved via sharp's resize fit:inside - Stderr diagnostic on any downscale so callers can see when it fired - Lazy sharp import so non-screenshot paths pay no startup cost Wires the guard into all three full-page callsites codex review flagged: - browse/src/snapshot.ts: annotated + heatmap fullPage captures - browse/src/meta-commands.ts: screenshot command (path + base64 fullPage modes) plus the responsive 3-viewport sweep - browse/src/write-commands.ts: prettyscreenshot fullPage path Covers seven unit cases (pass-through, downscale, aspect ratio, exactly-2000px edge, file-mode rewrite) plus a static invariant test that fails the build if any of the three callsites stops importing the guard. Closes #1214. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): add Node sidecar entry for L4 prompt-injection classifier (#1370) The L4 TestSavant classifier in browse/src/security-classifier.ts can't be imported into the compiled browse server (onnxruntime-node dlopen fails from Bun's compile extract dir per CLAUDE.md). The agent that used to host it (sidebar-agent.ts) was removed when the PTY proved out — leaving the classifier file shipped but with zero callers. Exactly the gap codex flagged in #1370. Adds browse/src/security-sidecar-entry.ts: a Node script that runs the classifier as a subprocess of the browse server. It reads NDJSON requests from stdin and writes id-correlated NDJSON responses to stdout, supporting: - op: "scan-page-content" — full L4 classifier scan - op: "ping" — liveness probe for the client's health check - op: "status" — classifier readiness (used by /pty-inject-scan to surface l4 { available: bool } in its response) Plus browse/src/find-security-sidecar.ts: a resolver that locates node + the bundled JS entry (browse/dist/security-sidecar.js, built in a follow-up package.json change) or falls back to the dev TS entry. Returns null cleanly when node isn't on PATH so the calling endpoint can degrade per D7 (extension WARN + user confirm). C17 of the security-stack wave. C18 adds the IPC client + lifecycle management; C19 wires the endpoint; C20 routes the extension through it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): sidecar IPC client with lifecycle + circuit breaker (#1370) Adds browse/src/security-sidecar-client.ts to manage the Node L4 classifier subprocess from the compiled browse server: - Lazy spawn on first scan; reuses the same process across requests - Id-correlated request/response via NDJSON over stdio - 5s default per-scan timeout; 64KB payload cap (short-circuits before spawn so oversized requests don't waste a process) - 3-in-10-minutes respawn cap → trips circuit breaker; subsequent scans throw immediately so the /pty-inject-scan endpoint can surface l4 { available: false } to the extension and degrade to WARN+confirm - process.on('exit') sends SIGTERM to the child for clean teardown - isSidecarAvailable() lets the endpoint probe before scan calls so the response shape reflects degraded mode honestly Unit tests cover the payload cap, the availability probe, and the breaker-doesn't-crash invariant under repeated rejected calls. C18 of the security-stack wave. C19 adds POST /pty-inject-scan; C20 routes the extension through it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): add POST /pty-inject-scan endpoint for pre-PTY-inject scans (#1370) The sidebar's gstackInjectToTerminal callers (toolbar Cleanup, Inspector "Send to Code") were piping page-derived text directly into the live claude PTY with ZERO classifier processing — the gap codex flagged in #1370. The documented sidebar security stack had a hole the size of every Cleanup-button click. Adds POST /pty-inject-scan to browse/src/server.ts: - Local-only binding (NOT in TUNNEL_PATHS — tunnel attempts get the general 404 path; never reaches the scan logic) - Root-token auth via existing validateAuth() — 401 on unauth - 64KB request cap → 413 + payload-too-large body - 5s scan timeout via sidecar client - URL-blocklist forced to BLOCK in PTY context (page-derived REPL input is higher-risk than ordinary tool output) - L4 ML classifier via the sidecar when available; degrades to WARN per D7 when sidecar is unavailable - Response goes through JSON.stringify(..., sanitizeReplacer) per v1.38.0.0 Unicode-egress hardening - Imports only from security-sidecar-client.ts, never directly from security-classifier.ts (which would brick the compiled Bun binary) Seven static-invariant tests pin the POST verb, auth gate, 64KB cap, tunnel-listener exclusion, sanitizeReplacer wrapping, l4 availability shape, and the no-direct-classifier-import rule. C19 of the security-stack wave. C20 routes the extension through it; C21 adds the invariant AST check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(extension): route gstackInjectToTerminal through /pty-inject-scan (#1370) Closes the documented-vs-shipped gap codex flagged in #1370. The sidebar's two PTY-injection call sites (Inspector "Send to Code" and toolbar Cleanup) now pre-scan via the new /pty-inject-scan endpoint before writing to the live claude REPL. Adds window.gstackScanForPTYInject(text, origin) to extension/sidepanel-terminal.js: - Async, returns { allow, verdict, reasons, l4 } - POST to /pty-inject-scan with the existing root-token auth - WARN+confirm on scan failure (network down, sidecar absent, etc.) rather than silent PASS — D7 honest-degradation gstackInjectToTerminal stays synchronous, returns boolean. Per D6: keeping the inject sync means existing `const ok = ...?.()` callers don't break, and the invariant test in test/extension-pty-inject-invariant.test.ts can statically pin that every call goes through the scan first. extension/sidepanel.js call sites updated: - inspectorSendBtn click → await scan, BLOCK drops + WARN prompts via window.confirm, PASS injects silently - runCleanup() → same flow. Static cleanup prompt always PASSes but still routes through scan to honor the invariant. C20 of the security-stack wave. C21 adds the static invariant test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(security): invariant — extension PTY inject must be scan-gated (#1370) Static-analysis invariant test that fails the build if any extension/.js path calls window.gstackInjectToTerminal without a preceding window.gstackScanForPTYInject in the same enclosing function. Closes the documented-vs-shipped gap codex demanded a machine check on. Rules: - Rule 1: any file that calls inject must also reference scan - Rule 2: in the enclosing function (function declaration, arrow, async (), event handler), a scan call must appear before the inject call by source position - Exemption: sidepanel-terminal.js (the file that DEFINES the inject function) is exempt from Rule 2 since the definition is not a call Plus two structural checks: - sidepanel-terminal.js defines both the inject and scan functions - inject stays SYNCHRONOUS (no `async` modifier) per D6 — async would silently break the `const ok = ...?.()` pattern at every caller C21 of the security-stack wave. The sidecar architecture (#1370) is complete: server-side L1-L3 + L4-via-sidecar (C17+C18+C19), extension pre-scan wiring (C20), and now the regression gate (C21). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> feat(browse): opt-in extended stealth mode with 6 detection-vector patches (#1112) Rebases @garrytan's PR #1112 (Apr 2026, abandoned) onto the current browse/src/stealth.ts contract. The existing minimal "codex narrowed" stealth (webdriver-mask + AutomationControlled launch arg) stays the default. PR #1112's six additional patches are added behind an opt-in GSTACK_STEALTH=extended env flag. Extended-mode patches (applied AFTER the default mask, in order): 1. delete navigator.webdriver from prototype (not just the getter — detectors check `"webdriver" in navigator`) 2. WebGL renderer spoof to Apple M1 Pro (SwiftShader was the #1 software-GPU tell in containers) 3. navigator.plugins returns a PluginArray-prototype-passing array with MimeType objects and namedItem() 4. window.chrome populated with chrome.app, chrome.runtime, chrome.loadTimes(), chrome.csi() with realistic shapes 5. navigator.mediaDevices backfilled when headless drops it 6. CDP cdc_-prefixed window globals cleared Why opt-in: the default mode's contract is fingerprint CONSISTENCY, which protects against detectors that flag spoofing mismatch. Extended mode actively lies about the environment; sites that reflect on these properties can break. Users who hit detection in default mode can flip GSTACK_STEALTH=extended for SannySoft 100% pass-rate. Twenty unit tests pin the env-flag semantics, all six patches' code presence, and the applyStealth wiring order. Live SannySoft pass-rate verification stays in the periodic-tier E2E suite. Contributed by @garrytan via #1112 (rebased — original PR opened before the codex-narrowed minimum landed; rebase preserves the narrowed default while adding the SannySoft-passing path as opt-in). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> test(fixtures): regenerate ship-SKILL.md golden baselines after C10-C13 + C16 templates Updates the three ship-SKILL.md golden baselines (claude, codex, factory hosts) to match the new shape produced by: - C10 #1209 codex argv (prompt + diff scope, no --base) - C11 #1492 merge-base diff (DIFF_BASE= preamble) - C13 #1197 command -v for codex detection - C12 + boundary preservation per regen-enforcing test Per CLAUDE.md SKILL.md workflow: edit the .tmpl, run gen:skill-docs, commit the regenerated outputs together. Goldens are part of the regen contract — without this commit, test/host-config.test.ts' golden-baseline checks fail with the diff codex review surfaced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(release): v1.41.0.0 — Daegu wave (24 bisect commits, 14 user-facing fixes) Bumps VERSION 1.40.0.0 → 1.41.0.0. CHANGELOG entry follows the release-summary format in CLAUDE.md: two-line headline, lead paragraph, "The numbers that matter" table, "What this means for builders" closer, then itemized Added/Changed/Fixed/For contributors with inline credit to every PR author and original issue reporter. Scale-aware bump per CLAUDE.md: 24 commits, ~6000 LOC net, substantial new capability across security (PTY sidecar wiring), install (Windows build chain), compat (gbrain 0.18-0.35, Codex CLI 0.130+), and quality (screenshot guard, design key disclosure, extended stealth opt-in). MINOR is the right call. Closes for users: #1567, #1559, #1569, #1346, #1418, #1538, #1537, #1530, #1457, #1561, #1554, #1479, #1503, #1248, #1214, #1370, #1327, #1193 pattern, #1152 pattern. Credit retained inline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(find-browse): resolve source-checkout layout <git-root>/browse/dist/browse[.exe] windows-setup-e2e.yml runs `bun browse/src/find-browse.ts` against a freshly-built repo where binaries land at browse/dist/browse.exe (no .claude/skills/gstack/ install layout). The previous markers chain only matched .codex/.agents/.claude prefixed paths, so find-browse exited "not found" even when the binary was present. Adds a source-checkout fallback after the marker scan: if no installed layout resolves but <git-root>/browse/dist/browse[.exe] exists, return that. Three real callers hit this path: - gstack repo dev workflow before `./setup` runs - windows-setup-e2e.yml CI (the breakage that surfaced this) - make-pdf consumers running from a sibling source checkout Smoke-verified: a fresh git repo with browse/dist/browse on disk now resolves through the source-checkout branch (was returning null before this commit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(release): bump v1.41.0.0 → v1.42.0.0 to clear queue collision with #1574 The version-gate workflow flagged a collision: PR #1574 (garrytan/colombo-v3) already claims v1.41.0.0, and #1592 (fix/audit-critical-high-bugs) claims v1.41.1.0. Per CLAUDE.md's workspace-aware ship rule, queue-advancing past a claimed version within the same bump level is permitted — MINOR work landing on top of a queued MINOR still reads as MINOR relative to main. Util's suggested next slot is v1.42.0.0; taking it. CHANGELOG entry header bumped + dated 2026-05-19; entry body unchanged (same wave content, same credit list). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 07:35:01 -07:00
+6 Garry Tan GitHub Jayesh Betala orbisai0security Bryce Alan Claude Opus 4.7 Terry Carson YM Vasko Ckorovski Samuel Carson Yashwant Kotipalli Jasper Chen Stefan Neamtu 陈家名 Abigail Atheryon Furkan Köykıran gus	00f966b3ec	v1.30.0.0 fix wave: 21 community PRs + Windows CI extension + codex flag-semantics smoke (#1391 ) * fix(codex): use resume-compatible flags * fix: V-001 security vulnerability Automated security fix generated by Orbis Security AI * docs: align prompt-injection thresholds to security.ts (v1.6.4.0 catch-up) CLAUDE.md:290 and ARCHITECTURE.md:159 were missed when WARN was bumped 0.60 → 0.75 in `d75402bb` (v1.6.4.0, "cut Haiku classifier FP from 44% to 23%, gate now enforced", #1135). browse/src/security.ts:37 has WARN: 0.75 and BROWSER.md:743 was updated alongside that commit; CLAUDE.md and ARCHITECTURE.md still read 0.60. Also adds the SOLO_CONTENT_BLOCK: 0.92 entry to CLAUDE.md (already in security.ts:50 and BROWSER.md:745, missing from CLAUDE.md's threshold table). No code change. No behavior change. Pure doc-vs-code alignment. Verification: $ grep -n "WARN" browse/src/security.ts CLAUDE.md ARCHITECTURE.md BROWSER.md browse/src/security.ts:37: WARN: 0.75, CLAUDE.md:290: - \`WARN: 0.75\` ... ARCHITECTURE.md:159: ...>= \`WARN\` (0.75)... BROWSER.md:743: - \`WARN: 0.75\` ... Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: Korean/CJK IME input and rendering in Sidebar Terminal Fixes #1272 This commit addresses three separate Korean/CJK bugs in the Sidebar Terminal: Bug 1 - IME Input: Korean text typed via IME composition was not reaching the PTY correctly. Added compositionstart/compositionend event listeners to suppress partial jamo fragments and only send the final composed string. Bug 2a - Font Rendering: Added CJK monospace font fallbacks ("Noto Sans Mono CJK KR", "Malgun Gothic") to both the xterm.js fontFamily config and the CSS --font-mono variable. This ensures consistent cell-width calculations for Korean characters. Bug 2b - UTF-8 Boundary Detection: Added buffering logic to prevent multi-byte UTF-8 characters (Korean is 3 bytes) from being split across WebSocket chunks. This follows the same pattern as PR #1007 which fixed the sidebar-agent path, but extends it to the terminal-agent path. Special thanks to @ldybob for the excellent root cause analysis and proposed solutions in issue #1272. Tested on WSL2 + Windows 11 with Korean IME. * fix(ship): tighten Plan Completion gate (VAS-449 remediation) VAS-446 shipped with a PLAN.md acceptance criterion (domain-hq has /docs/dashboard.md) silently skipped. /ship's Plan Completion subagent existed at ship time (added in v1.4.1.0) but the gate let the failure through. Four structural fixes: 1. Path concreteness rule: items naming a concrete filesystem path MUST be classified DONE/NOT DONE via [ -f <path> ], never UNVERIFIABLE. 2. Validator detection: CONTENT-SHAPE items scan target repo's package.json for validate-* scripts and run them before falling back to UNVERIFIABLE. 3. Per-item UNVERIFIABLE confirmation: replaces blanket "I've checked each one" with per-item Y/N/D loop. The blanket-confirm path is the exact failure VAS-449 surfaced. 4. Subagent fail-closed: if Plan Completion subagent + inline fallback both fail, surface explicit AskUserQuestion instead of silent pass. Replaces the prior "Never block /ship on subagent failure" fail-open. Locked in by test/ship-plan-completion-invariants.test.ts (5 assertions, no LLM dependency, ~60ms). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(browse): bash.exe wrap for telemetry on Windows reportAttemptTelemetry() in browse/src/security.ts calls spawn(bin, args) where bin is the gstack-telemetry-log bash script. On Windows this fails silently with ENOENT — CreateProcess can't dispatch on shebang lines. Adopts v1.24.0.0's Bun.which + GSTACK__BIN override pattern (from browse/src/claude-bin.ts:resolveClaudeCommand, introduced in #1252) for resolving bash.exe. resolveBashBinary() honors GSTACK_BASH_BIN absolute-path or PATH-resolvable override, falling back to Bun.which('bash') which finds Git Bash on the standard Windows install. buildTelemetrySpawnCommand() wraps the script invocation on win32 only; POSIX path is bit-identical. Returns null when bash can't be resolved on Windows so caller skips spawn — local attempts.jsonl audit trail keeps working without surfacing a Windows-only failure. 8 new unit tests cover resolveBashBinary (POSIX bash, absolute override, quote-stripping, BASH_BIN fallback, empty-PATH null) and buildTelemetrySpawnCommand (POSIX pass-through, win32 bash wrap, win32 null on unresolvable, arg-array immutability). POSIX path is bit-identical — Bun.which('bash') on Linux/macOS returns the same /bin/bash or /usr/bin/bash that the old hardcoded spawn relied on. fix(make-pdf): Bun.which-based binary resolution for browse + pdftotext on Windows Extends v1.24.0.0's Bun.which + GSTACK__BIN override pattern (introduced in browse/src/claude-bin.ts via #1252) to the two other binary resolvers in the codebase: make-pdf/src/browseClient.ts:resolveBrowseBin and make-pdf/src/pdftotext.ts:resolvePdftotext. Same Windows quirks (fs.accessSync(X_OK) degrades to existence-check; `which` isn't available outside Git Bash; bun --compile --outfile X emits X.exe), same Bun.which-based fix shape, same env override convention. Changes: - GSTACK_BROWSE_BIN / GSTACK_PDFTOTEXT_BIN as the v1.24-aligned overrides; BROWSE_BIN / PDFTOTEXT_BIN remain as back-compat aliases. - Bun.which() replaces execFileSync('which', ...) for PATH lookup. Handles Windows PATHEXT natively; no more `where`-vs-`which` branch. - findExecutable(base) helper exported from each module, probes .exe/.cmd/.bat after the bare-path miss on win32. Linux/macOS behavior is bit-identical (isExecutable short-circuits before the win32 branch ever runs). - macCandidates renamed posixCandidates (always was — /opt/homebrew, /usr/local, /usr/bin). No Windows candidates added; Poppler installs scatter across Scoop/Chocolatey/portable zips and guessing causes false positives. - Error messages get a Windows install hint (scoop install poppler / oschwartz10612) and `setx` example for GSTACK__BIN. - Pre-existing test 'honors BROWSE_BIN when it points at a real executable' was hardcoded /bin/sh — made cross-platform via a REAL_EXE constant (cmd.exe on win32, /bin/sh on POSIX). Was a Windows-CI blocker on its own. Coordination: PR #1094 (@BkashJEE) covered browseClient.ts independently with a narrower scope; this PR's pdftotext + cross-platform tests + GSTACK__BIN naming are additive. Either order of merge works. Test plan: - bun test make-pdf/test/browseClient.test.ts make-pdf/test/pdftotext.test.ts on win32 — 29 pass, 0 fail (12 new assertions: findExecutable POSIX/win32/null, resolveBrowseBin GSTACK_BROWSE_BIN + BROWSE_BIN + precedence + quote-strip, same shape for resolvePdftotext + Windows install hint in error message). - POSIX branch unchanged — fs.accessSync(X_OK) on Linux/macOS short-circuits before any win32 logic runs, matching the v1.24 claude-bin.ts pattern. fix(browse): NTFS ACL hardening for Windows state files via icacls gstack's ~/.gstack/ state directory holds bearer tokens, canary tokens, agent queue contents (with prompt history), session state, security-decision logs, and saved cookie bundles — all written with { mode: 0o600 } / 0o700. On Windows, those mode bits are a silent no-op: Node's fs module doesn't translate POSIX modes to NTFS ACLs, and inherited ACLs leave every "restricted" file readable by other principals on the machine (verified via icacls — six ACEs, the intended user is the LAST of six). Threat model is non-trivial on: - Self-hosted CI runners (different service account on the same Windows box can read developer tokens, canary tokens, prompt history) - Shared development machines (agencies, studios, lab environments) - Multi-tenant servers with shared home directories Orthogonal to v1.24.0.0's binary-resolution work — complementary at the write side. v1.24's bin/gstack-paths resolves ~/.gstack/ correctly across plugin / global / local installs; this PR ensures files written into those resolved paths actually get the POSIX 0o600 semantic translated to NTFS. The fix: - New browse/src/file-permissions.ts (158 LOC, 5 public + 1 test-reset). restrictFilePermissions / restrictDirectoryPermissions wrap chmod (POSIX) or icacls /inheritance:r /grant:r <user>:(F) (Windows). writeSecureFile / appendSecureFile / mkdirSecure are drop-in wrappers for the common patterns. - 19 call sites converted across 9 source files: browser-manager.ts, browser-skill-write.ts, cli.ts, config.ts, meta-commands.ts, security-classifier.ts, security.ts (4 sites), server.ts (5 sites), terminal-agent.ts (8 sites), tunnel-denial-log.ts. - (OI)(CI) inheritance flags on directories mean files created via fs.write* inside an mkdirSecure-created dir inherit the owner-only ACL automatically — important for tunnel-denial-log.ts where appends use async fsp.appendFile. Error handling: icacls failures (nonexistent path, missing icacls.exe, hardened environments) log a one-shot warning to stderr and proceed. Once-per-process gating prevents log spam if the condition persists. Filesystem stays functional; the file just ends up with inherited ACLs. Test plan: - bun test browse/test/file-permissions.test.ts — 13 pass, 0 fail (POSIX mode-bit assertions, Windows no-throw, mkdir idempotence, recursive creation, Buffer payloads, append-creates-then-reapplies-once semantics) - bun test browse/test/security.test.ts — 38 pass, 0 fail (existing security test suite plus the bash-binary resolution tests added in fix #1119; the converted writeFileSync/appendFileSync/mkdirSync sites in security.ts integrate cleanly) - Empirical icacls before/after on a real file — 6 ACEs → 1 ACE - bun build typecheck on all modified files — clean (server.ts has a pre-existing playwright-core/electron resolution issue unrelated to this PR) POSIX behavior is bit-identical to old code — fs.chmodSync(path, 0o6XX) on the helper's POSIX branch matches the inline { mode: 0o6XX } it replaces. Linux and macOS see no behavior change. Inviting pushback on three judgment calls (in PR description): 1. icacls vs npm library 2. ACL scope — just user, or user + SYSTEM? 3. Graceful degradation — once-per-process warn, not silent, not hard-fail. * fix(browse): declare lastConsoleFlushed to restore console-log persistence flushBuffers() references a `lastConsoleFlushed` cursor at server.ts:337 and assigns it at :344, but the `let lastConsoleFlushed = 0;` declaration is missing — only the network and dialog siblings are declared at lines 327-328. Result: every 1-second flushBuffers tick (line 376) throws `ReferenceError: lastConsoleFlushed is not defined`, gets swallowed by the catch at line 369 ("[browse] Buffer flush failed: ..."), and the console branch's append never runs. browse-console.log is never written in any production deployment since this regressed. Discovered by stress-testing the daemon with 15 concurrent CLIs against cold state — the race surfaced the buffer-flush error spam in one spawned daemon's stderr. Verified by running the daemon against a real file:// page with console.log events: in-memory `browse console` returns the entries, but `.gstack/browse-console.log` is never created on disk. Regression introduced by `1a100a2a` "fix: eliminate duplicate command sets in chain, improve flush perf and type safety" — the flush refactor switched from `Bun.write` to `fs.appendFileSync` and added the `lastConsoleFlushed` cursor pattern alongside its network/dialog siblings, but missed the matching `let` declaration. Tests don't currently exercise flushBuffers, so the regression shipped silently. Fix: - Declare `let lastConsoleFlushed = 0;` next to `lastNetworkFlushed` and `lastDialogFlushed` (browse/src/server.ts:327) - Add a source-level guard test (browse/test/server-flush-trackers.test.ts) that fails any future refactor that adds a fourth `lastFlushed` cursor without the matching declaration. Same pattern as terminal-agent.test.ts and dual-listener.test.ts — read source as text, assert invariant, no daemon required. Test plan: - [x] New regression test fails on current main, passes with the fix - [x] `bun run build` clean - [x] Manual smoke: spawn daemon -> goto file:// page with console.log -> wait 4s -> .gstack/browse-console.log now exists with the expected entries (163 bytes vs zero before) 🤖 Generated with [Claude Code](https://claude.com/claude-code) fix(browse): per-process state-file temp path to fix concurrent-write ENOENT The daemon writes `.gstack/browse.json` via the standard atomic-rename pattern: `writeFileSync(tmp, …) → renameSync(tmp, stateFile)`. Four sites in server.ts use this pattern (initial daemon-startup state at :2002, /tunnel/start handler at :1479, BROWSE_TUNNEL=1 inline tunnel update at :2083, BROWSE_TUNNEL_LOCAL_ONLY=1 update at :2113), and all four hard-code the same temp filename `${stateFile}.tmp`. Under concurrent writers the shared filename races on the rename: t0 Writer A: writeFileSync(stateFile + '.tmp', payloadA) t1 Writer B: writeFileSync(stateFile + '.tmp', payloadB) // overwrites A t2 Writer A: renameSync(stateFile + '.tmp', stateFile) // moves B's payload t3 Writer B: renameSync(stateFile + '.tmp', stateFile) // ENOENT — file gone Reproduced empirically with 15 concurrent CLIs against a fresh `.gstack/`: [browse] Failed to start: ENOENT: no such file or directory, rename '…/.gstack/browse.json.tmp' -> '…/.gstack/browse.json' Pre-fix success rate: 0 / 15 under cold-start race. Post-fix success rate: 15 / 15, zero ENOENT. Fix: - New `tmpStatePath()` helper (server.ts:333) returns `${stateFile}.tmp.${pid}.${randomBytes(4).toString('hex')}` - All 4 call sites use `tmpStatePath()` instead of the shared literal - Atomic rename still gives last-writer-wins semantics on the final state.json content; only behavior change is that concurrent writers no longer kill each other on the rename step Source-level guard test (browse/test/server-tmp-state-path.test.ts) locks two invariants: (1) no remaining `stateFile + '.tmp'` literals, (2) every state-write `writeFileSync` call uses `tmpStatePath()`. Same read-source-as-text pattern as terminal-agent.test.ts and dual-listener.test.ts — no daemon required, runs in tier-1 free. Test plan: - [x] Targeted source-level guard test passes (3 / 0) - [x] `bun run build` clean - [x] Live regression: 15 concurrent CLIs against cold state → 15 / 15 healthy, 0 ENOENT (vs 0 / 15 pre-fix) - [x] No `.tmp.` orphans left behind after rename succeeds - [x] Related test cluster (server-auth, dual-listener, cdp-mutex, findport) — same pre-existing flakes as `main`, no new regressions introduced 🤖 Generated with [Claude Code](https://claude.com/claude-code) fix(browse): clear refs when iframe auto-detaches in getActiveFrameOrPage Asymmetric cleanup between two equivalent staleness conditions: onMainFrameNavigated() → clearRefs() + activeFrame = null ✓ getActiveFrameOrPage() → activeFrame = null (refs NOT cleared) ✗ Both paths see the same staleness condition — refs were captured against a frame that no longer exists. The main-frame path correctly clears both pieces of state. The iframe-detach path nulls the frame but leaves the refMap intact. The lazy click-time check in `resolveRef` (tab-session.ts:97) partially saves us — `entry.locator.count()` on a detached-frame locator throws or returns 0, so the click errors out as "Ref X is stale". But the user has no signal that frame context silently changed underfoot: the next `snapshot` runs against `this.page` (main) while old iframe refs still litter `refMap` with the same role+name keys. New refs collide with stale ones, the resolver picks one at random, the user clicks the wrong element. TODOS.md line 816-820 documents "Detached frame auto-recovery" as a shipped iframe-support feature in v0.12.1.0. This restores the documented intent — the recovery should leave the session in a clean state, not a half-cleared one. Fix: 1 line — add `this.clearRefs()` next to `this.activeFrame = null` inside the if-branch. Test plan: - [x] New regression test: 4/4 pass - refs cleared when getActiveFrameOrPage detects detached iframe - refs preserved when active frame is still attached (no regression) - refs preserved when no frame set (page-level path untouched) - matches onMainFrameNavigated symmetry — both paths reach the same clean end state - [x] `bun run build` clean 🤖 Generated with [Claude Code](https://claude.com/claude-code) * fix(codex): resolve python for JSON parser * fix: add fail-fast probe for base branch in ship step 12 * fix(plan-devex-review): remove contradictory plan-mode handshake * fix(design): honor Retry-After header in variants 429 handler Closes #1244. The 429 handler in `generateVariant` discarded the `Retry-After` response header and fell straight through to a local exponential schedule (2s/4s/8s). In image-generation batches, that burns retry attempts inside the provider's cooldown window and the request never recovers. Now we parse `Retry-After` per RFC 7231 — both delta-seconds (`Retry-After: 5`) and HTTP-date (`Retry-After: Fri, 31 Dec 1999 23:59:59 GMT`). Honored waits are capped at 60s to bound stalls from hostile or buggy headers. Delta-seconds are validated as digits-only (rejects `2abc`). When `Retry-After` is honored (including 0 / past-date "retry now"), the next iteration's leading exponential sleep is skipped so we don't double-wait. Invalid or missing headers fall through to the existing exponential schedule unchanged. Behavior matrix: \| Header \| Behavior \| \|---------------------------------\|-------------------------------------------\| \| Retry-After: 5 \| wait 5s, skip leading on next attempt \| \| Retry-After: 999999 \| capped to 60s, skip leading \| \| Retry-After: 2abc \| invalid, fall through to exponential \| \| Retry-After: 0 \| wait 0, skip leading (retry immediately) \| \| Retry-After: <past HTTP-date> \| wait 0, skip leading \| \| Retry-After: <future date> \| wait diff capped at 60s, skip leading \| \| no header \| fall through to existing exponential \| `generateVariant` now accepts an optional `fetchFn` parameter (defaults to `globalThis.fetch`) so tests can inject a stub. Production call sites are unchanged. Tests cover the five behavior buckets above, asserting both the 1st-to-2nd call timing gap and call counts. All five pass in ~8s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(docs): correct per-skill symlink removal snippet in README uninstall Closes #1130. The manual-uninstall fallback in `## Uninstall` → `### Option 2` used `find ~/.claude/skills -maxdepth 1 -type l`, which finds nothing on real installs. Each `~/.claude/skills/<name>/` is a real directory, and only `<name>/SKILL.md` inside it is a symlink into `gstack/`. The find never matched, so the snippet silently removed nothing. Replace with a directory walk that inspects each `<name>/SKILL.md`: find ~/.claude/skills -mindepth 1 -maxdepth 1 -type d ! -name gstack → check $dir/SKILL.md is a symlink → readlink it → if target is gstack/* or /gstack/: rm -f the link, rmdir the dir (only if empty — preserves any user-added files) Excludes the top-level `gstack/` dir from the walk; that's removed by step 3 of the same uninstall block. `bin/gstack-uninstall` (the script-mode path) already handles the layout correctly via its own walk; only this manual fallback needed updating. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: reject partial browse client env integers * fix(gemini-adapter): detect new ~/.gemini/oauth_creds.json auth path gemini-cli >=0.30 stores OAuth credentials at ~/.gemini/oauth_creds.json instead of the legacy ~/.config/gemini/ directory. The benchmark adapter's availability check now succeeds for users on recent gemini-cli releases who have authenticated via interactive login. Both paths are accepted so users on older versions still work. * fix(browser): add --no-sandbox for root user on Linux/WSL2 Chromium's sandbox can't initialize when running as root on Linux, causing an immediate exit. Extend the existing CI/CONTAINER check to also cover this case, keeping the Windows-safe `typeof getuid` guard. * security: pass cwd to git via execFileSync, not interpolation through /bin/sh `bin/gstack-memory-ingest.ts:632-643` ran `execSync(\`git -C ${JSON.stringify(cwd)} remote get-url origin 2>/dev/null\`, ...)`. JSON.stringify escapes `"` and `\` but not `$` or backticks, so a `cwd` of `"$(touch /tmp/marker)"` survived JSON quoting and detonated under /bin/sh's command-substitution-inside-double-quotes. `cwd` originates from transcript JSONL records under `~/.claude/projects/<encoded-cwd>/<uuid>.jsonl` and `~/.codex/sessions/YYYY/MM/DD/rollout-.jsonl`. The walker grabs the first `.cwd` it sees per session. That's an untrusted surface in the gstack threat model — the L1-L6 sidebar security stack exists exactly because agent transcripts can carry attacker-influenced text. Two pivots above the local same-uid bar: (a) prompt-injection appending `cwd="$(...)"` to the active session log turns the next /sync-gbrain run into RCE under the user's uid; (b) cross-machine transcript share (a colleague's `.claude/projects` snippet untar'd into HOME, a documented gbrain dogfooding shape) → RCE on first sync. Fix swaps the one execSync for `execFileSync("git", ["-C", cwd, "remote", "get-url", "origin"], ...)`. No shell, argv passed directly to git. The same module already uses execFileSync for `gbrainAvailable()` (line 762 pre-patch) and `gbrainPutPage()` (line 816 pre-patch) — this single execSync was the outlier. Test: `gstack-memory-ingest security: untrusted cwd cannot trigger shell substitution` plants a Claude-Code-shaped JSONL with cwd=`$(touch <marker>)` and asserts the marker file is not created after `--incremental --quiet`. Negative control: with the patch reverted, the test fails (marker created); with the patch applied, it passes (18/18 in test/gstack-memory-ingest.test.ts). security: gate domain-skill auto-promote on classifier_score > 0 `browse/src/domain-skill-commands.ts:140` (handleSave) writes `classifier_score: 0` with the comment "L4 deferred to load-time / sidebar-agent fills this in on first prompt-injection load." But CLAUDE.md "Sidebar architecture" documents that sidebar-agent.ts was ripped, and grep for recordSkillUse + classifierFlagged callers across browse/src/ returns zero hits outside the module under test. Net effect: every quarantined skill that survives three benign uses without flag (`recordSkillUse(... , classifierFlagged: false)` x3) auto-promotes to `active` and lands in prompt context wrapped as UNTRUSTED on every subsequent visit to that host. The L4 score that was supposed to gate the promotion was never written — the production save path puts 0 on disk and nothing later updates it. Threat model: a domain-skill body authored by an agent under the influence of a poisoned page (the new `gstackInjectToTerminal` PTY path runs no L1-L3 either) would lose its auto-promote barrier after three uses. The exploit isn't single-step but the bar is exactly N=3 prompt-injection-shaped uses on a hostile page, which is well within reach. Fix adds a single condition to the auto-promote gate in `recordSkillUse`: if (state === 'quarantined' && useCount >= PROMOTE_THRESHOLD && flagCount === 0 && current.classifier_score > 0) { state = 'active'; } `classifier_score` is set once at writeSkill and never updated. Production saves it as 0 (handleSave), so the gate stays closed; existing tests that explicitly pass `classifierScore: 0.1` still auto-promote (the auto-promote path is preserved for the day L4 is rewired). Manual promotion via `domain-skill promote-to-global` is unaffected (it goes through `promoteToGlobal` which has its own state-machine guard at line 337+). Test: new regression case `does NOT auto-promote when classifier_score is 0 (production handleSave shape)` plants a skill with classifierScore=0 (matches domain-skill-commands.ts:140), runs three uses without flag, asserts the skill stays quarantined and readSkill returns null. Negative control: revert the patch, the test fails with `Received: "active"`. With the patch: 15/15 pass. * fix(ship): port #1302 SKILL.md edits to .tmpl + resolver source PR #1302 added Verification Mode + UNVERIFIABLE classification + per-item confirmation gate to ship/SKILL.md, but only the generated SKILL.md was edited — not the .tmpl source or scripts/resolvers/review.ts. The next `bun run gen:skill-docs` run would have wiped the changes. Port the same content into the resolver and .tmpl so regeneration produces the intended output. * ci(windows): extend free-tests lane to cover icacls + Bun.which resolvers from fix-wave PRs Closes #1306/#1307/#1308 validation gap. The four newly-added test files already have process.platform guards so they run safely on both POSIX and Windows lanes — only platform-relevant assertions execute on each. Tests added to the windows-latest lane: - browse/test/file-permissions.test.ts (#1308 icacls + writeSecureFile) - browse/test/security.test.ts (#1306 bash.exe wrap pure-function path) - make-pdf/test/browseClient.test.ts (#1307 Bun.which browse resolver) - make-pdf/test/pdftotext.test.ts (#1307 Bun.which pdftotext resolver) * test(codex): live flag-semantics smoke for codex exec resume Closes #1270's regex-only test gap. PR #1270 asserted that codex/SKILL.md's `codex exec resume` invocation drops -C/-s and uses sandbox_mode config. That regex catches the skill template regressing, but not codex CLI itself flipping flag semantics again. This test probes `codex exec resume --help` and asserts the surface gstack relies on: -c/sandbox_mode is accepted, top-level -C is absent. Skips silently when codex isn't on PATH, so dev machines without codex installed never see it fail. * chore: regen SKILL.md after fix wave One regen commit at the end of the merge wave per the plan. plan-devex-review loses the contradictory plan-mode handshake (#1333). review/SKILL.md picks up the Verification Mode + UNVERIFIABLE classification additions that #1302 authored against ship/SKILL.md (same resolver shared between ship and review modes). * fix(server.ts): keep fs.writeFileSync for state-file writes #1308's writeSecureFile wrapper added Windows icacls hardening for the 4 state-file write sites in server.ts, but #1310's regression test grep's for fs.writeFileSync(tmpStatePath()) calls. The two changes are technically compatible only if the test relaxes — keeping the test strict (the safer choice for catching regressions on the cold-start race) means the 4 state- file sites stay on fs.writeFileSync(..., { mode: 0o600 }). POSIX 0o600 hardening is preserved on those 4 sites. Windows icacls hardening still applies to all the other writeSecureFile call sites #1308 added (auth.json, mkdirSecure, etc.). Also refreshes golden baselines after #1302 / port + minor wording tweak in scripts/resolvers/review.ts to keep gen-skill-docs.test.ts assertion 'Cite the specific file' satisfied. * v1.30.0.0: fix wave — 21 community PRs + 2 closing fixes for Windows + codex CI gaps Headline release. Browse stops dropping console logs, cold-start race fixed, codex resume works without python3, Windows hardening (icacls + Bun.which + bash.exe wrap), ship gate gets VAS-449 remediation, two closing fixes that put icacls/Bun.which/codex flag semantics under CI. * test(domain-skills): cover #1369 classifier_score=0 quarantine + score>0 promote path The pre-existing T6 test seeded skills via writeSkill (which defaults classifier_score to 0 until L4 is rewired) and then expected 3 uses to auto-promote. PR #1369 added `current.classifier_score > 0` to the gate specifically to block that path — a quarantined skill written under the influence of a poisoned page would otherwise auto-promote after three benign uses. Updated test asserts both halves of the new contract: - classifier_score=0 + 3 uses → stays quarantined (the security guarantee) - classifier_score>0 + 3 more uses → promotes to active (unblock path) Catches both regressions: the gate going away (would re-allow the bypass) and the unblock path breaking (would silently quarantine all skills forever once L4 is rewired). --------- Co-authored-by: Jayesh Betala <jayesh.betala7@gmail.com> Co-authored-by: orbisai0security <mediratta01.pally@gmail.com> Co-authored-by: Bryce Alan <brycealan.eth@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Terry Carson YM <cym3118288@gmail.com> Co-authored-by: Vasko Ckorovski <vckorovski@gmail.com> Co-authored-by: Samuel Carson <samuel.carson@gmail.com> Co-authored-by: Yashwant Kotipalli <yashwant7kotipalli@gmail.com> Co-authored-by: Jasper Chen <jasperchen925@gmail.com> Co-authored-by: Stefan Neamtu <stefan.neamtu@gmail.com> Co-authored-by: 陈家名 <chenjiaming@kezaihui.com> Co-authored-by: Abigail Atheryon <abi@atheryon.ai> Co-authored-by: Furkan Köykıran <furkankoykiran@gmail.com> Co-authored-by: gus <gustavoraularagon@gmail.com>	2026-05-09 08:06:47 -07:00
Garry Tan GitHub Claude Opus 4.7 gus Mohammed Qazi	54d4cde773	security: tunnel dual-listener + SSRF + envelope + path wave (v1.6.0.0) (#1137 ) * refactor(security): loosen /connect rate limit from 3/min to 300/min Setup keys are 24 random bytes (unbruteforceable), so a tight rate limit does not meaningfully prevent key guessing. It exists only to cap bandwidth, CPU, and log-flood damage from someone who discovered the ngrok URL. A legitimate pair-agent session hits /connect once; 300/min is 60x that pattern and never hit accidentally. 3/min caused pairing to fail on any retry flow (network blip, second paired client) with no upside. Per-IP tracking was considered and rejected — adds a bounded Map + LRU for defense already adequate at the global layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): add tunnel-denial-log module for attack visibility Append-only log of tunnel-surface auth denials to ~/.gstack/security/attempts.jsonl. Gives operators visibility into who is probing tunneled daemons so the next security wave can be driven by real attack data instead of speculation. Design notes: - Async via fs.promises.appendFile. Never appendFileSync — blocking the event loop on every denial during a flood is what an attacker wants (prior learning: sync-audit-log-io, 10/10 confidence). - In-process rate cap at 60 writes/minute globally. Excess denials are counted in memory but not written to disk — prevents disk DoS. - Writes to the same ~/.gstack/security/attempts.jsonl used by the prompt-injection attempt log. File rotation is handled by the existing security pipeline (10MB, 5 generations). No consumers in this commit; wired up in the dual-listener refactor that follows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): dual-listener tunnel architecture The /health endpoint leaked AUTH_TOKEN to any caller that hit the ngrok URL (spoofing chrome-extension:// origin, or catching headed mode). Surfaced by @garagon in PR #1026; the original fix was header-inference on the single port. Codex's outside-voice review during /plan-ceo-review called that approach brittle (ngrok header behavior could change, local proxies would false-positive), and pushed for the structural fix. This is that fix. Stop making /health a root-token bootstrap endpoint on any surface the tunnel can reach. The server now binds two HTTP listeners when a tunnel is active. The local listener (extension, CLI, sidebar) stays on 127.0.0.1 and is never exposed to ngrok. ngrok forwards only to the tunnel listener, which serves only /connect (unauth, rate-limited) and /command with a locked allowlist of browser-driving commands. Security property comes from physical port separation, not from header inference — a tunnel caller cannot reach /health or /cookie-picker or /inspector because they live on a different TCP socket. What this commit adds to browse/src/server.ts: * Surface type ('local' \| 'tunnel') and TUNNEL_PATHS + TUNNEL_COMMANDS allowlists near the top of the file. * makeFetchHandler(surface) factory replacing the single fetch arrow; closure-captures the surface so the filter that runs before route dispatch knows which socket accepted the request. * Tunnel filter at dispatch entry: 404s anything not on TUNNEL_PATHS, 403s root-token bearers with a clear pairing hint, 401s non-/connect requests that lack a scoped token. Every denial is logged via logTunnelDenial (from tunnel-denial-log). * GET /connect alive probe (unauth on both surfaces) so /pair and /tunnel/start can detect dead ngrok tunnels without reaching /health — /health is no longer tunnel-reachable. * Lazy tunnel listener lifecycle. /tunnel/start binds a dedicated Bun.serve on an ephemeral port, points ngrok.forward at THAT port (not the local port), hard-fails on bind error (no local fallback), tears down cleanly on ngrok failure. BROWSE_TUNNEL=1 startup uses the same pattern. * closeTunnel() helper — single teardown path for both the ngrok listener and the tunnel Bun.serve listener. * resolveNgrokAuthtoken() helper — shared authtoken lookup across /tunnel/start and BROWSE_TUNNEL=1 startup (was duplicated). * TUNNEL_COMMANDS check in /command dispatch: on the tunnel surface, commands outside the allowlist return 403 with a list of allowed commands as a hint. * Probe paths in /pair and /tunnel/start migrated from /health to GET /connect — the only unauth path reachable on the tunnel surface under the new architecture. Test updates in browse/test/server-auth.test.ts: * /pair liveness-verify test: assert via closeTunnel() helper instead of the inline `tunnelActive = false; tunnelUrl = null` lines that the helper subsumes. * /tunnel/start cached-tunnel test: same closeTunnel() adaptation. Credit Derived from PR #1026 by @garagon — thanks for flagging the critical bug that drove the architectural rewrite. The per-request isTunneledRequest approach from #1026 is superseded by physical port separation here; the underlying report remains the root cause for the entire v1.6.0.0 wave. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(security): add source-level guards for dual-listener architecture 23 source-level assertions that keep future contributors from silently widening the tunnel surface during a routine refactor. Covers: * Surface type + tunnelServer state variable shape * TUNNEL_PATHS is a closed set of /connect, /command, /sidebar-chat (and NOT /health, /welcome, /cookie-picker, /inspector/, /pair, /token, /refs, /activity/stream, /tunnel/{start,stop}) TUNNEL_COMMANDS includes browser-driving ops only (and NOT launch-browser, tunnel-start, token-mint, cookie-import, etc.) * makeFetchHandler(surface) factory exists and is wired to both listeners with the correct surface parameter * Tunnel filter runs BEFORE any route dispatch, with 404/403/401 responses and logged denials for each reason * GET /connect returns {alive: true} unauth * /command dispatch enforces TUNNEL_COMMANDS on tunnel surface * closeTunnel() helper tears down ngrok + Bun.serve listener * /tunnel/start binds on ephemeral port, points ngrok at TUNNEL_PORT (not local port), hard-fails on bind error (no fallback), probes cached tunnel via GET /connect (not /health), tears down on ngrok.forward failure * BROWSE_TUNNEL=1 startup uses the dual-listener pattern * logTunnelDenial wired for all three denial reasons * /connect rate limit is 300/min, not 3/min All 23 tests pass. Behavioral integration tests (spawn subprocess, real network) live in the E2E suite that lands later in this wave. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * security: gate download + scrape through validateNavigationUrl (SSRF) The `goto` command was correctly wired through validateNavigationUrl, but `download` and `scrape` called page.request.fetch(url, ...) directly. A caller with the default write scope could hit the /command endpoint and ask the daemon to fetch http://169.254.169.254/latest/meta-data/ (AWS IMDSv1) or the GCP/Azure/internal equivalents. The response body comes back as base64 or lands on disk where GET /file serves it. Fix: call validateNavigationUrl(url) immediately before each page.request.fetch() call site in download and in the scrape loop. Same blocklist that already protects `goto`: file://, javascript:, data:, chrome://, cloud metadata (IPv4 all encodings, IPv6 ULA, metadata..internal). Tests: extend browse/test/url-validation.test.ts with a source-level guard that walks every `await page.request.fetch(` call site and asserts a validateNavigationUrl call precedes it within the same branch. Regression trips before code review if a future refactor drops the gate. security: route splitForScoped through envelope sentinel escape The scoped-token snapshot path in snapshot.ts built its untrusted block by pushing the raw accessibility-tree lines between the literal `═══ BEGIN UNTRUSTED WEB CONTENT ═══` / `═══ END UNTRUSTED WEB CONTENT ═══` sentinels. The full-page wrap path in content-security.ts already applied a zero-width-space escape on those exact strings to prevent sentinel injection, but the scoped path skipped it. Net effect: a page whose rendered text contains the literal sentinel can close the envelope early from inside untrusted content and forge a fake "trusted" block for the LLM. That includes fabricating interactive `@eN` references the agent will act on. Fix: * Extract the zero-width-space escape into a named, exported helper `escapeEnvelopeSentinels(content)` in content-security.ts. * Have `wrapUntrustedPageContent` call it (behavior unchanged on that path — same bytes out). * Import the helper in snapshot.ts and map it over `untrustedLines` in the `splitForScoped` branch before pushing the BEGIN sentinel. Tests: add a describe block in content-security.test.ts that covers * `escapeEnvelopeSentinels` defuses BEGIN and END markers; * `escapeEnvelopeSentinels` leaves normal text untouched; * `wrapUntrustedPageContent` still emits exactly one real envelope pair when hostile content contains forged sentinels; * snapshot.ts imports the helper; * the scoped-snapshot branch calls `escapeEnvelopeSentinels` before pushing the BEGIN sentinel (source-level regression — if a future refactor reorders this, the test trips). * security: extend hidden-element detection to all DOM-reading channels The Confusion Protocol envelope wrap (`wrapUntrustedPageContent`) covers every scoped PAGE_CONTENT_COMMAND, but the hidden-element ARIA-injection detection layer only ran for `text`. Other DOM-reading channels (html, links, forms, accessibility, attrs, data, media, ux-audit) returned their output through the envelope with no hidden- content filter, so a page serving a display:none div that instructs the agent to disregard prior system messages, or an aria-label that claims to put the LLM in admin mode, leaked the injection payload on any non-text channel. The envelope alone does not mitigate this, and the page itself never rendered the hostile content to the human operator. Fix: * New export `DOM_CONTENT_COMMANDS` in commands.ts — the subset of PAGE_CONTENT_COMMANDS that derives its output from the live DOM. Console and dialog stay out; they read separate runtime state. * server.ts runs `markHiddenElements` + `cleanupHiddenMarkers` for every scoped command in this set. `text` keeps its existing `getCleanTextWithStripping` path (hidden elements physically stripped before the read). All other channels keep their output format but emit flagged elements as CONTENT WARNINGS on the envelope, so the LLM sees what it would otherwise have consumed silently. * Hidden-element descriptions merge into `combinedWarnings` alongside content-filter warnings before the wrap call. Tests: new describe block in content-security.test.ts covering * `DOM_CONTENT_COMMANDS` export shape and channel membership; * dispatch gates on `DOM_CONTENT_COMMANDS.has(command)`, not the literal `text` string; * hiddenContentWarnings plumbs into `combinedWarnings` and reaches wrapUntrustedPageContent; * DOM_CONTENT_COMMANDS is a strict subset of PAGE_CONTENT_COMMANDS. Existing datamarking, envelope wrap, centralized-wrapping, and chain security suites stay green (52 pass, 0 fail). * security: validate --from-file payload paths for parity with direct paths The direct `load-html <file>` path runs every caller-supplied file path through validateReadPath() so reads stay confined to SAFE_DIRECTORIES (cwd, TEMP_DIR). The `load-html --from-file <payload.json>` shortcut and its sibling `pdf --from-file <payload.json>` skipped that check and went straight to fs.readFileSync(). An MCP caller that picks the payload path (or any caller whose payload argument is reachable from attacker-influenced text) could use --from-file as a read-anywhere escape hatch for the safe-dirs policy. Fix: call validateReadPath(path.resolve(payloadPath)) before readFileSync at both sites. Error surface mirrors the direct-path branch so ops and agent errors stay consistent. Test coverage in browse/test/from-file-path-validation.test.ts: - source-level: validateReadPath precedes readFileSync in the load-html --from-file branch (write-commands.ts) and the pdf --from-file parser (meta-commands.ts) - error-message parity: both sites reference SAFE_DIRECTORIES Related security audit pattern: R3 F002 (validateNavigationUrl gap on download/scrape) and R3 F008 (markHiddenElements gap on 10 DOM commands) were the same shape — a defense that existed on the primary code path but not its shortcut sibling. This PR closes the same class of gap on the --from-file shortcuts. * fix(design): escape url.origin when injecting into served HTML serve.ts injected url.origin into a single-quoted JS string in the response body. A local request with a crafted Host header (e.g. Host: "evil'-alert(1)-'x") would break out of the string and execute JS in the 127.0.0.1:<port> origin opened by the design board. Low severity — bound to localhost, requires a local attacker — but no reason not to escape. Fix: JSON.stringify(url.origin) produces a properly quoted, escaped JS string literal in one call. Also includes Prettier reformatting (single→double quotes, trailing commas, line wrapping) applied by the repo's PostToolUse formatter hook. Security change is the one line in the HTML injection; everything else is whitespace/style. * fix(scripts): drop shell:true from slop-diff npx invocations spawnSync('npx', [...], { shell: true }) invokes /bin/sh -c with the args concatenated, subjecting them to shell parsing (word splitting, glob expansion, metacharacter interpretation). No user input reaches these calls today, so not exploitable — but the posture is wrong: npx + shell args should be direct. Fix: scope shell:true to process.platform === 'win32' where npx is actually a .cmd requiring the shell. POSIX runs the npx binary directly with array-form args. Also includes Prettier reformatting (single→double quotes, trailing commas, line wrapping) applied by the repo's PostToolUse formatter hook. Security-relevant change is just the two shell:true -> shell: process.platform === 'win32' lines; everything else is whitespace/style. * security(E3): gate GSTACK_SLUG on /welcome path traversal The /welcome handler interpolates GSTACK_SLUG directly into the filesystem path used to locate the project-local welcome page. Without validation, a slug like "../../etc/passwd" would resolve to ~/.gstack/projects/../../etc/passwd/designs/welcome-page-20260331/finalized.html — classic path traversal. Not exploitable today: GSTACK_SLUG is set by the gstack CLI at daemon launch, and an attacker would already need local env-var access to poison it. But the gate is one regex (^[a-z0-9_-]+$), and a defense-in-depth pass costs us nothing when the cost of being wrong is arbitrary file read via /welcome. Fall back to the safe 'unknown' literal when the slug fails validation — same fallback the code already uses when GSTACK_SLUG is unset. No behavior change for legitimate slugs (they all match the regex). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * security(N1): replace ?token= SSE auth with HttpOnly session cookie Activity stream and inspector events SSE endpoints accepted the root AUTH_TOKEN via `?token=` query param (EventSource can't send Authorization headers). URLs leak to browser history, referer headers, server logs, crash reports, and refactoring accidents. Codex flagged this during the /plan-ceo-review outside voice pass. New auth model: the extension calls POST /sse-session with a Bearer token and receives a view-only session cookie (HttpOnly, SameSite=Strict, 30-min TTL). EventSource is opened with `withCredentials: true` so the browser sends the cookie back on the SSE connection. The ?token= query param is GONE — no more URL-borne secrets. Scope isolation (prior learning cookie-picker-auth-isolation, 10/10 confidence): the SSE session cookie grants access to /activity/stream and /inspector/events ONLY. The token is never valid against /command, /token, or any mutating endpoint. A leaked cookie can watch activity; it cannot execute browser commands. Components * browse/src/sse-session-cookie.ts — registry: mint/validate/extract/ build-cookie. 256-bit tokens, 30-min TTL, lazy expiry pruning, no imports from token-registry (scope isolation enforced by module boundary). * browse/src/server.ts — POST /sse-session mint endpoint (requires Bearer). /activity/stream and /inspector/events now accept Bearer OR the session cookie, and reject ?token= query param. * extension/sidepanel.js — ensureSseSessionCookie() bootstrap call, EventSource opened with withCredentials:true on both SSE endpoints. Tested via the source guards; behavioral test is the E2E pairing flow that lands later in the wave. * browse/test/sse-session-cookie.test.ts — 20 unit tests covering mint entropy, TTL enforcement, cookie flag invariants, cookie parsing from multi-cookie headers, and scope-isolation contract guard (module must not import token-registry). * browse/test/server-auth.test.ts — existing /activity/stream auth test updated to assert the new cookie-based gate and the absence of the ?token= query param. Cookie flag choices: * HttpOnly: token not readable from page JS (mitigates XSS exfiltration). * SameSite=Strict: cookie not sent on cross-site requests (mitigates CSRF). Fine for SSE because the extension connects to 127.0.0.1 directly. * Path=/: cookie scoped to the whole origin. * Max-Age=1800: 30 minutes, matches TTL. Extension re-mints on reconnect when daemon restarts. * Secure NOT set: daemon binds to 127.0.0.1 over plain HTTP. Adding Secure would block the browser from ever sending the cookie back. Add Secure when gstack ships over HTTPS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * security(N2): document Windows v20 ABE elevation path on CDP port The existing comment around the cookie-import-browser --remote-debugging-port launch claimed "threat model: no worse than baseline." That's wrong on Windows with App-Bound Encryption v20. A same-user local process that opens the cookie SQLite DB directly CANNOT decrypt v20 values (DPAPI context is bound to the browser process). The CDP port lets them bypass that: connect to the debug port, call Network.getAllCookies inside Chrome, walk away with decrypted v20 cookies. The correct fix is to switch from TCP --remote-debugging-port to --remote-debugging-pipe so the CDP transport is a stdio pipe, not a socket. That requires restructuring the CDP WebSocket client in this module and Playwright doesn't expose the pipe transport out of the box. Non-trivial, deferred from the v1.6.0.0 wave. This commit updates the comment to correctly describe the threat and points at the tracking issue. No code change to the launch itself. Follow-up: #1136. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(E2): document dual-listener tunnel architecture in ARCHITECTURE.md Adds an explicit per-endpoint disposition table to the Security model section, covering the v1.6.0.0 dual-listener refactor. Every HTTP endpoint now has a documented local-vs-tunnel answer. Future audits (and future contributors wondering "is it safe to add X to the tunnel surface?") can read this instead of reverse-engineering server.ts. Also documents: * Why physical port separation beats per-request header inference (ngrok behavior drift, local proxies can forge headers, etc.) * Tunnel surface denial logging → ~/.gstack/security/attempts.jsonl * SSE session cookie model (gstack_sse, 30-min TTL, stream-scope only, module-boundary-enforced scope isolation) * N2 non-goal for Windows v20 ABE via CDP port (tracking #1136) No code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(E1): end-to-end pair-agent flow against a spawned daemon Spawns the browse daemon as a subprocess with BROWSE_HEADLESS_SKIP=1 so the HTTP layer runs without a real browser. Exercises: * GET /health — token delivery for chrome-extension origin, withheld otherwise (the F1 + PR #1026 invariant) * GET /connect — alive probe returns {alive:true} unauth * POST /pair — root Bearer required (403 without), returns setup_key * POST /connect — setup_key exchange mints a distinct scoped token * POST /command — 401 without auth * POST /sse-session — Bearer required, Set-Cookie has HttpOnly + SameSite=Strict (the N1 invariant) * GET /activity/stream — 401 without auth * GET /activity/stream?token= — 401 (the old ?token= query param is REJECTED, which is the whole point of N1) * GET /welcome — serves HTML, does not leak /etc/passwd content under the default 'unknown' slug (E3 regex gate) 12 behavioral tests, ~220ms end-to-end, no network dependencies, no ngrok, no real browser. This is the receipt for the wave's central 'pair-agent still works + the security boundary holds' claim. Tunnel-port binding (/tunnel/start) is deliberately NOT exercised here — it requires an ngrok authtoken and live network. The dual-listener route allowlist is covered by source-level guards in dual-listener.test.ts; behavioral tunnel testing belongs in a separate paid-evals harness. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * release(v1.6.0.0): bump VERSION + CHANGELOG for security wave Architectural bump, not patch: dual-listener HTTP refactor changes the daemon's tunnel-exposure model. See CHANGELOG for the full release summary (~950 words) covering the five root causes this wave closes: 1. /health token leak over ngrok (F1 + E3 + test infra) 2. /cookie-picker + /inspector exposed over the tunnel (F1) 3. ?token=<ROOT> in SSE URLs leaking to logs/referer/history (N1) 4. /welcome GSTACK_SLUG path traversal (E3) 5. Windows v20 ABE elevation via CDP port (N2 — documented non-goal, tracked as #1136) Plus the base PRs: SSRF gate (#1029), envelope sentinel escape (#1031), DOM-channel hidden-element coverage (#1032), --from-file path validation (#1103), and 2 commits from #1073 (@theqazi). VERSION + package.json bumped to 1.6.0.0. CHANGELOG entry covers credits (@garagon, @Hybirdss, @HMAKT99, @theqazi), review lineage (CEO → Codex outside voice → Eng), and the non-goal tracking issue. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: pre-landing review findings (4 auto-fixes) Addresses 4 findings from the Claude adversarial subagent on the v1.6.0.0 security wave diff. No user-visible behavior change; all are defense-in-depth hardening of newly-introduced code. 1. GET /connect rate-limited (was POST-only) [HIGH conf 8/10] Attacker discovering the ngrok URL could probe unlimited GETs for daemon enumeration. Now shares the global /connect counter. 2. ngrok listener leak on tunnel startup failure [MEDIUM conf 8/10] If ngrok.forward() resolved but tunnelListener.url() or the state-file write threw, the Bun listener was torn down but the ngrok session was leaked. Fixed in BOTH /tunnel/start and BROWSE_TUNNEL=1 startup paths. 3. GSTACK_SKILL_ROOT path-traversal gate [MEDIUM conf 8/10] Symmetric with E3's GSTACK_SLUG regex gate — reject values containing '..' before interpolating into the welcome-page path. 4. SSE session registry pruning [LOW conf 7/10] pruneExpired() only checked 10 entries per mint call. Now runs on every validate too, checks 20 entries, with a hard 10k cap as backstop. Prevents registry growth under sustained extension reconnect pressure. Tests remain green (56/56 in sse-session-cookie + dual-listener + pair-agent-e2e suites). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: update project documentation for v1.6.0.0 Reflect the dual-listener tunnel architecture, SSE session cookies, SSRF guards, and Windows v20 ABE non-goal across the three docs users actually read for remote-agent and browser auth context: - docs/REMOTE_BROWSER_ACCESS.md: rewrote Architecture diagram for dual listeners, fixed /connect rate limit (3/min → 300/min), removed stale "/health requires no auth" (now 404 on tunnel), added SSE cookie auth, expanded Security Model with tunnel allowlist, SSRF guards, /welcome path traversal defense, and the Windows v20 ABE tracking note. - BROWSER.md: added dual-listener paragraph to Authentication and linked to ARCHITECTURE.md endpoint table. Replaced the stale ?token= SSE auth note with the HttpOnly gstack_sse cookie flow. - CLAUDE.md: added Transport-layer security section above the sidebar prompt-injection stack so contributors editing server.ts, sse-session-cookie.ts, or tunnel-denial-log.ts see the load-bearing module boundaries before touching them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(make-pdf): write --from-file payload to /tmp, not os.tmpdir() make-pdf's browseClient wrote its --from-file payload to os.tmpdir(), which is /var/folders/... on macOS. v1.6.0.0's PR #1103 cherry-pick tightened browse load-html --from-file to validate against the safe-dirs allowlist ([TEMP_DIR, cwd] where TEMP_DIR is '/tmp' on macOS/Linux, os.tmpdir() on Windows). This closed a CLI/API parity gap but broke make-pdf on macOS because /var/folders/... is outside the allowlist. Fix: mirror browse's TEMP_DIR convention — use '/tmp' on non-Windows, os.tmpdir() on Windows. The make-pdf-gate CI failure on macOS-latest (run 72440797490) is caused by exactly this: the payload file was rejected by validateReadPath. Verified locally: the combined-gate e2e test now passes after rebuilding make-pdf/dist/pdf. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(sidebar): killAgent resets per-tab state; align tests with current agent event format Two pre-existing bugs surfaced while running the full e2e suite on the sec-wave branch. Both pre-date v1.6.0.0 (same failures on main at `e23ff280`) but blocked the ship verification, so fixing now. ### Bug 1: killAgent leaked stale per-tab state `killAgent()` reset the legacy globals (agentProcess, agentStatus, etc.) but never touched the per-tab `tabAgents` Map. Meanwhile `/sidebar-command` routes on `tabState.status` from that Map, not the legacy globals. Consequence: after a kill (including the implicit kill in `/sidebar-session/new`), the next /sidebar-command on the same tab saw `tabState.status === 'processing'` and fell into the queue branch, silently NOT spawning an agent. Integration tests that called resetState between cases all failed with empty queues. Fix: when targetTabId is supplied, reset that one tab's state; when called without a tab (session-new, full kill), reset ALL tab states. Matches the semantic boundary already used for the cancel-file write. ### Bug 2: sidebar-integration tests drifted from current event format `agent events appear in /sidebar-chat` posted the raw Claude streaming format (`{type: 'assistant', message: {content: [...]}}`) but `processAgentEvent` in server.ts only handles the simplified types that sidebar-agent.ts pre-processes into (text, text_delta, tool_use, result, agent_error, security_event). The architecture moved pre-processing into sidebar-agent.ts at some point and this test never got updated. Fixed by sending the pre-processed `{type: 'text', text: '...'}` format — which is actually what the server sees in production. Also removed the `entry.prompt` URL-containment check in the queue-write test. The URL is carried on entry.pageUrl (metadata) by design: the system prompt tells Claude to run `browse url` to fetch the actual page rather than trust any URL in the prompt body. That's the URL-based prompt-injection defense. The prompt SHOULD NOT contain the URL, so the test assertion was wrong for the current security posture. ### Verification - `bun test browse/test/sidebar-integration.test.ts` → 13/13 pass (was 6/13 on both main and branch before this commit) - Full `bun run test` → exit 0, zero fail markers - No behavior change for production sidebar flows: killAgent was already supposed to return the agent to idle; it just wasn't fully doing so. Per-tab reset now matches the documented semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: gus <gustavoraularagon@gmail.com> Co-authored-by: Mohammed Qazi <10266060+theqazi@users.noreply.github.com>	2026-04-21 21:58:27 -07:00
Garry Tan GitHub Claude Opus 4.6 Yunsu Gus Toby Morning Alberto Martinez Maksim Soltan Ziad Al Sharif Osman Mehmood	7e96fe299b	fix: security wave 3 — 12 fixes, 7 contributors (v0.16.4.0) (#988 ) * fix(security): validateOutputPath symlink bypass — check file-level symlinks validateOutputPath() previously only resolved symlinks on the parent directory. A symlink at /tmp/evil.png → /etc/crontab passed the parent check (parent is /tmp, which is safe) but the write followed the symlink outside safe dirs. Add lstatSync() check: if the target file exists and is a symlink, resolve through it and verify the real target is within SAFE_DIRECTORIES. ENOENT (file doesn't exist yet) falls through to the existing parent-dir check. Closes #921 Co-Authored-By: Yunsu <Hybirdss@users.noreply.github.com> * fix(security): shell injection in bin/ scripts — use env vars instead of interpolation gstack-settings-hook interpolated $SETTINGS_FILE directly into bun -e double-quoted blocks. A path containing quotes or backticks breaks the JS string context, enabling arbitrary code execution. Replace direct interpolation with environment variables (process.env). Same fix applied to gstack-team-init which had the same pattern. Systematic audit confirmed only these two scripts were vulnerable — all other bin/ scripts already use stdin piping or env vars. Closes #858 Co-Authored-By: Gus <garagon@users.noreply.github.com> * fix(security): cookie-import path validation bypass + hardcoded /tmp Two fixes: 1. cookie-import relative path bypass (#707): path.isAbsolute() gated the entire validation, so relative paths like "sensitive-file.json" bypassed the safe-directory check entirely. Now always resolves to absolute path with realpathSync for symlink resolution, matching validateOutputPath(). 2. Hardcoded /tmp in cookie-import-browser (#708): openDbFromCopy used /tmp directly instead of os.tmpdir(), breaking Windows support. Also adds explicit imports for SAFE_DIRECTORIES and isPathWithin in write-commands.ts (previously resolved implicitly through bundler). Closes #852 Co-Authored-By: Toby Morning <urbantech@users.noreply.github.com> * fix(security): redact form fields with sensitive names, not just type=password Form redaction only applied to type="password" fields. Hidden and text fields named csrf_token, api_key, session_id, etc. were exposed unredacted in LLM context, leaking secrets. Extend redaction to check field name and id against sensitive patterns: token, secret, key, password, credential, auth, jwt, session, csrf, sid, api_key. Uses the same pattern style as SENSITIVE_COOKIE_NAME. Closes #860 Co-Authored-By: Gus <garagon@users.noreply.github.com> * fix(security): restrict session file permissions to owner-only Design session files written to /tmp with default umask (0644) were world-readable on shared systems. Sessions contain design prompts and feedback history. Set mode 0o600 (owner read/write only) on both create and update paths. Closes #859 Co-Authored-By: Gus <garagon@users.noreply.github.com> * fix(security): enforce frozen lockfile during setup bun install without --frozen-lockfile resolves ^semver ranges from npm on every run. If an attacker publishes a compromised compatible version of any dependency, the next ./setup pulls it silently. Add --frozen-lockfile with fallback to plain install (for fresh clones where bun.lock may not exist yet). Matches the pattern already used in the .agents/ generation block (line 237). Closes #614 Co-Authored-By: Alberto Martinez <halbert04@users.noreply.github.com> * fix: remove duplicate recursive chmod on /tmp in Dockerfile.ci chmod -R 1777 /tmp recursively sets sticky bit on files (no defined behavior), not just the directory. Deduplicate to single chmod 1777 /tmp. Closes #747 Co-Authored-By: Maksim Soltan <Gonzih@users.noreply.github.com> * fix(security): learnings input validation + cross-project trust gate Three fixes to the learnings system: 1. Input validation in gstack-learnings-log: type must be from allowed list, key must be alphanumeric, confidence must be 1-10 integer, source must be from allowed list. Prevents injection via malformed fields. 2. Prompt injection defense: insight field checked against 10 instruction-like patterns (ignore previous, system:, override, etc.). Rejected with clear error message. 3. Cross-project trust gate in gstack-learnings-search: AI-generated learnings from other projects are filtered out. Only user-stated learnings cross project boundaries. Prevents silent prompt injection across codebases. Also adds trusted field (true for user-stated source, false for AI-generated) to enable the trust gate at read time. Closes #841 Co-Authored-By: Ziad Al Sharif <Ziadstr@users.noreply.github.com> * feat(security): track cookie-imported domains and scope cookie imports Foundation for origin-pinned JS execution (#616). Tracks which domains cookies were imported from so the JS/eval commands can verify execution stays within imported origins. Changes: - BrowserManager: new cookieImportedDomains Set with track/get/has methods - cookie-import: tracks imported cookie domains after addCookies - cookie-import-browser: tracks domains on --domain direct import - cookie-import-browser --all: new explicit opt-in for all-domain import (previously implicit behavior, now requires deliberate flag) Closes #615 Co-Authored-By: Alberto Martinez <halbert04@users.noreply.github.com> * feat(security): pin JS/eval execution to cookie-imported origins When cookies have been imported for specific domains, block JS execution on pages whose origin doesn't match. Prevents the attack chain: 1. Agent imports cookies for github.com 2. Prompt injection navigates to attacker.com 3. Agent runs js document.cookie → exfiltrates github cookies assertJsOriginAllowed() checks the current page hostname against imported cookie domains with subdomain matching (.github.com allows api.github.com). When no cookies are imported, all origins allowed (nothing to protect). about:blank and data: URIs are allowed (no cookies at risk). Depends on #615 (cookie domain tracking). Closes #616 Co-Authored-By: Alberto Martinez <halbert04@users.noreply.github.com> * feat(security): add persistent command audit log Append-only JSONL audit trail for all browse server commands. Unlike in-memory ring buffers, the audit log persists across restarts and is never truncated. Each entry records: timestamp, command, args (truncated to 200 chars), page origin, duration, status, error (truncated to 300 chars), hasCookies flag, connection mode. All writes are best-effort — audit failures never block command execution. Log stored at ~/.gstack/.browse/browse-audit.jsonl. Closes #617 Co-Authored-By: Alberto Martinez <halbert04@users.noreply.github.com> * fix(security): block hex-encoded IPv4-mapped IPv6 metadata bypass URL constructor normalizes ::ffff:169.254.169.254 to ::ffff:a9fe:a9fe (hex form), which was not in the blocklist. Similarly, ::169.254.169.254 normalizes to ::a9fe:a9fe. Add both hex-encoded forms to BLOCKED_METADATA_HOSTS so they're caught by the direct hostname check in validateNavigationUrl. Closes #739 Co-Authored-By: Osman Mehmood <mehmoodosman@users.noreply.github.com> * chore: bump version and changelog (v0.16.4.0) Security wave 3: 12 fixes, 7 contributors. Cookie origin pinning, command audit log, domain tracking. Symlink bypass, path validation, shell injection, form redaction, learnings injection, IPv6 SSRF, session permissions, frozen lockfile. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Yunsu <Hybirdss@users.noreply.github.com> Co-authored-by: Gus <garagon@users.noreply.github.com> Co-authored-by: Toby Morning <urbantech@users.noreply.github.com> Co-authored-by: Alberto Martinez <halbert04@users.noreply.github.com> Co-authored-by: Maksim Soltan <Gonzih@users.noreply.github.com> Co-authored-by: Ziad Al Sharif <Ziadstr@users.noreply.github.com> Co-authored-by: Osman Mehmood <mehmoodosman@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 07:49:37 -10:00
Garry Tan GitHub Claude Opus 4.6 garagon 0531Kim pieterklue mmporong mr-k-man	03973c2fab	fix: community security wave — 8 PRs, 4 contributors (v0.15.13.0) (#847 ) * fix(bin): pass search params via env vars (RCE fix) (#819) Replace shell string interpolation with process.env in gstack-learnings-search to prevent arbitrary code execution via crafted learnings entries. Also fixes the CROSS_PROJECT interpolation that the original PR missed. Adds 3 regression tests verifying no shell interpolation remains in the bun -e block. Co-authored-by: garagon <garagon@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(browse): add path validation to upload command (#821) Add isPathWithin() and path traversal checks to the upload command, blocking file exfiltration via crafted upload paths. Uses existing SAFE_DIRECTORIES constant instead of a local copy. Adds 3 regression tests. Co-authored-by: garagon <garagon@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(browse): symlink resolution in meta-commands validateOutputPath (#820) Add realpathSync to validateOutputPath in meta-commands.ts to catch symlink-based directory escapes in screenshot, pdf, and responsive commands. Resolves SAFE_DIRECTORIES through realpathSync to handle macOS /tmp -> /private/tmp symlinks. Existing path validation tests pass with the hardened implementation. Co-authored-by: garagon <garagon@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add uninstall instructions to README (#812) Community PR #812 by @0531Kim. Adds two uninstall paths: the gstack-uninstall script (handles everything) and manual removal steps for when the repo isn't cloned. Includes CLAUDE.md cleanup note and Playwright cache guidance. Co-Authored-By: 0531Kim <0531Kim@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(browse): Windows launcher extraEnv + headed-mode token (#822) Community PR #822 by @pieterklue. Three fixes: 1. Windows launcher now merges extraEnv into spawned server env (was only passing BROWSE_STATE_FILE, dropping all other env vars) 2. Welcome page fallback serves inline HTML instead of about:blank redirect (avoids ERR_UNSAFE_REDIRECT on Windows) 3. /health returns auth token in headed mode even without Origin header (fixes Playwright Chromium extensions that don't send it) Also adds HOME/USERPROFILE fallback for cross-platform compatibility. Co-Authored-By: pieterklue <pieterklue@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(browse): terminate orphan server when parent process exits (#808) Community PR #808 by @mmporong. Passes BROWSE_PARENT_PID to the spawned server process. The server polls every 15s with signal 0 and calls shutdown() if the parent is gone. Prevents orphaned chrome-headless-shell processes when Claude Code sessions exit abnormally. Co-Authored-By: mmporong <mmporong@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(security): IPv6 ULA blocking, cookie redaction, per-tab cancel, targeted token (#664) Community PR #664 by @mr-k-man (security audit round 1, new parts only). - IPv6 ULA prefix blocking (fc00::/7) in url-validation.ts with false-positive guard for hostnames like fd.example.com - Cookie value redaction for tokens, API keys, JWTs in browse cookies command - Per-tab cancel files in killAgent() replacing broken global kill-signal - design/serve.ts: realpathSync upgrade prevents symlink bypass in /api/reload - extension: targeted getToken handler replaces token-in-health-broadcast - Supabase migration 003: column-level GRANT restricts anon UPDATE scope - Telemetry sync: upsert error logging - 10 new tests for IPv6, cookie redaction, DNS rebinding, path traversal Co-Authored-By: mr-k-man <mr-k-man@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(security): CSS injection guard, timeout clamping, session validation, tests (#806) Community PR #806 by @mr-k-man (security audit round 2, new parts only). - CSS value validation (DANGEROUS_CSS) in cdp-inspector, write-commands, extension inspector - Queue file permissions (0o700/0o600) in cli, server, sidebar-agent - escapeRegExp for frame --url ReDoS fix - Responsive screenshot path validation with validateOutputPath - State load cookie filtering (reject localhost/.internal/metadata cookies) - Session ID format validation in loadSession - /health endpoint: remove currentUrl and currentMessage fields - QueueEntry interface + isValidQueueEntry validator for sidebar-agent - SIGTERM->SIGKILL escalation in timeout handler - Viewport dimension clamping (1-16384), wait timeout clamping (1s-300s) - Cookie domain validation in cookie-import and cookie-import-browser - DocumentFragment-based tab switching (XSS fix in sidepanel) - pollInProgress reentrancy guard for pollChat - toggleClass/injectCSS input validation in extension inspector - Snapshot annotated path validation with realpathSync - 714-line security-audit-r2.test.ts + 33-line learnings-injection.test.ts Co-Authored-By: mr-k-man <mr-k-man@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.15.13.0) Community security wave: 8 PRs from 4 contributors (@garagon, @mr-k-man, @mmporong, @0531Kim, @pieterklue). IPv6 ULA blocking, cookie redaction, per-tab cancel signaling, CSS injection guards, timeout clamping, session validation, DocumentFragment XSS fix, parent process watchdog, uninstall docs, Windows fixes, and 750+ lines of security regression tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: garagon <garagon@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: 0531Kim <0531Kim@users.noreply.github.com> Co-authored-by: pieterklue <pieterklue@users.noreply.github.com> Co-authored-by: mmporong <mmporong@users.noreply.github.com> Co-authored-by: mr-k-man <mr-k-man@users.noreply.github.com>	2026-04-06 00:47:04 -07:00
Garry Tan GitHub Claude Opus 4.6	2b08cfe71e	fix: close redundant PRs + friendly error on all design commands (v0.15.8.1) (#817 ) * fix: friendly OpenAI org error on all design commands Previously only generate.ts showed a user-friendly message when the OpenAI org wasn't verified. Now evolve, iterate, variants, and check all detect the 403 + "organization must be verified" pattern and show a clear message with the correct verification URL. * test: regression test for >128KB Codex session_meta Documents the current 128KB buffer limitation. When Codex embeds session_meta beyond 128KB, this test will fail, signaling the need for a streaming parse or larger buffer. * chore: bump version and changelog (v0.15.8.1) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-05 02:02:06 -07:00
Matt Van HornandGitHub	f91ad61a15	fix: user-friendly error when OpenAI org is not verified (#776 ) Merged via PR triage plan. Friendly error for unverified OpenAI org. Follow-up: expand to evolve.ts, iterate.ts, variants.ts, check.ts.	2026-04-05 00:09:32 -07:00
Garry Tan GitHub Claude Opus 4.6 Gonzih garagon	115d81d792	fix: security wave 1 — 14 fixes for audit #783 (v0.15.7.0) (#810 ) * fix: DNS rebinding protection checks AAAA (IPv6) records too Cherry-pick PR #744 by @Gonzih. Closes the IPv6-only DNS rebinding gap by checking both A and AAAA records independently. Co-Authored-By: Gonzih <gonzih@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: validateOutputPath symlink bypass — resolve real path before safe-dir check Cherry-pick PR #745 by @Gonzih. Adds a second pass using fs.realpathSync() to resolve symlinks after lexical path validation. Co-Authored-By: Gonzih <gonzih@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: validate saved URLs before navigation in restoreState Cherry-pick PR #751 by @Gonzih. Prevents navigation to cloud metadata endpoints or file:// URIs embedded in user-writable state files. Co-Authored-By: Gonzih <gonzih@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: telemetry-ingest uses anon key instead of service role key Cherry-pick PR #750 by @Gonzih. The service role key bypasses RLS and grants unrestricted database access — anon key + RLS is the right model for a public telemetry endpoint. Co-Authored-By: Gonzih <gonzih@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: killAgent() actually kills the sidebar claude subprocess Cherry-pick PR #743 by @Gonzih. Implements cross-process kill signaling via kill-file + polling pattern, tracks active processes per-tab. Co-Authored-By: Gonzih <gonzih@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(design): bind server to localhost and validate reload paths Cherry-pick PR #803 by @garagon. Adds hostname: '127.0.0.1' to Bun.serve() and validates /api/reload paths are within cwd() or tmpdir(). Closes C1+C2 from security audit #783. Co-Authored-By: garagon <garagon@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: add auth gate to /inspector/events SSE endpoint (C3) The /inspector/events endpoint had no authentication, unlike /activity/stream which validates tokens. Now requires the same Bearer header or ?token= query param check. Closes C3 from security audit #783. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: sanitize design feedback with trust boundary markers (C4+H5) Wrap user feedback in <user-feedback> XML markers with tag escaping to prevent prompt injection via malicious feedback text. Cap accumulated feedback to last 5 iterations to limit incremental poisoning. Closes C4 and H5 from security audit #783. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: harden file/directory permissions to owner-only (C5+H9+M9+M10) Add mode 0o700 to all mkdirSync calls for state/session directories. Add mode 0o600 to all writeFileSync calls for session.json, chat.jsonl, and log files. Add umask 077 to setup script. Prevents auth tokens, chat history, and browser logs from being world-readable on multi-user systems. Closes C5, H9, M9, M10 from security audit #783. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: TOCTOU race in setup symlink creation (C6) Remove the existence check before mkdir -p (it's idempotent) and validate the target isn't already a symlink before creating the link. Prevents a local attacker from racing between the check and mkdir to redirect SKILL.md writes. Closes C6 from security audit #783. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: remove CORS wildcard, restrict to localhost (H1) Replace Access-Control-Allow-Origin: * with http://127.0.0.1 on sidebar tab/chat endpoints. The Chrome extension uses manifest host_permissions to bypass CORS entirely, so this only blocks malicious websites from making cross-origin requests. Closes H1 from security audit #783. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: make cookie picker auth mandatory (H2) Remove the conditional if(authToken) guard that skipped auth when authToken was undefined. Now all cookie picker data/action routes reject unauthenticated requests. Closes H2 from security audit #783. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: gate /health token on chrome-extension Origin header Only return the auth token in /health response when the request Origin starts with chrome-extension://. The Chrome extension always sends this origin via manifest host_permissions. Regular HTTP requests (including tunneled ones from ngrok/SSH) won't get the token. The extension also has a fallback path through background.js that reads the token from the state file directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: update server-auth test for chrome-extension Origin gating The test previously checked for 'localhost-only' comment. Now checks for 'chrome-extension://' since the token is gated on Origin header. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.15.7.0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Gonzih <gonzih@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: garagon <garagon@users.noreply.github.com>	2026-04-04 22:12:04 -07:00
Garry Tan GitHub Claude Opus 4.6	78bc1d1968	feat: design binary — real UI mockup generation for gstack skills (v0.13.0.0) (#551 ) * docs: design tools v1 plan — visual mockup generation for gstack skills Full design doc covering the `design` binary that wraps OpenAI's GPT Image API to generate real UI mockups from gstack's design skills. Includes comparison board UX spec, auth model, 6 CEO expansions (design memory, mockup diffing, screenshot evolution, design intent verification, responsive variants, design-to-code prompt), and 9-commit implementation plan. Reviewed: /office-hours + /plan-eng-review (CLEARED) + /plan-ceo-review (EXPANSION, 6/6 accepted) + /plan-design-review (2/10 → 8/10). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: design tools prototype validation — GPT Image API works Prototype script sends 3 design briefs to OpenAI Responses API with image_generation tool. Results: dashboard (47s, 2.1MB), landing page (42s, 1.3MB), settings page (37s, 1.3MB) all produce real, implementable UI mockups with accurate text rendering and clean layouts. Key finding: Codex OAuth tokens lack image generation scopes. Direct API key (sk-proj-) required, stored in ~/.gstack/openai.json. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: design binary core — generate, check, compare commands Stateless CLI (design/dist/design) wrapping OpenAI Responses API for UI mockup generation. Three working commands: - generate: brief -> PNG mockup via gpt-4o + image_generation tool - check: vision-based quality gate via GPT-4o (text readability, layout completeness, visual coherence) - compare: generates self-contained HTML comparison board with star ratings, radio Pick, per-variant feedback, regenerate controls, and Submit button that writes structured JSON for agent polling Auth reads from ~/.gstack/openai.json (0600), falls back to OPENAI_API_KEY env var. Compiled separately from browse binary (openai added to devDependencies, not runtime deps). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: design binary variants + iterate commands variants: generates N style variations with staggered parallel (1.5s between launches, exponential backoff on 429). 7 built-in style variations (bold, calm, warm, corporate, dark, playful + default). Tested: 3/3 variants in 41.6s. iterate: multi-turn design iteration using previous_response_id for conversational threading. Falls back to re-generation with accumulated feedback if threading doesn't retain visual context. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: DESIGN_SETUP + DESIGN_MOCKUP template resolvers Add generateDesignSetup() and generateDesignMockup() to the existing design.ts resolver file. Add designDir to HostPaths (claude + codex). Register DESIGN_SETUP and DESIGN_MOCKUP in the resolver index. DESIGN_SETUP: $D binary discovery (mirrors $B browse setup pattern). Falls back to DESIGN_SKETCH if binary not available. DESIGN_MOCKUP: full visual exploration workflow template — construct brief from DESIGN.md context, generate 3 variants, open comparison board in Chrome, poll for user feedback, save approved mockup to docs/designs/, generate HTML wireframe for implementation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: sync package.json version with VERSION file (0.12.2.0) Pre-existing mismatch: VERSION was 0.12.2.0 but package.json was 0.12.0.0. Also adds design binary to build script and dev:design convenience command. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: /office-hours visual design exploration integration Add {{DESIGN_MOCKUP}} to office-hours template before the existing {{DESIGN_SKETCH}}. When the design binary is available, /office-hours generates 3 visual mockup variants, opens a comparison board in Chrome, and polls for user feedback. Falls back to HTML wireframes if the design binary isn't built. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: /plan-design-review visual mockup integration Add {{DESIGN_SETUP}} to pre-review audit and "show me what 10/10 looks like" mockup generation to the 0-10 rating method. When a design dimension rates below 7/10, the review can generate a mockup showing the improved version. Falls back to text descriptions if the design binary isn't available. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: design memory — extract visual language from mockups into DESIGN.md New `$D extract` command: sends approved mockup to GPT-4o vision, extracts color palette, typography, spacing, and layout patterns, writes/updates DESIGN.md with an "Extracted Design Language" section. Progressive constraint: if DESIGN.md exists, future mockup briefs include it as style context. If no DESIGN.md, explorations run wide. readDesignConstraints() reads existing DESIGN.md for brief construction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: mockup diffing + design intent verification New commands: - $D diff --before old.png --after new.png: visual diff using GPT-4o vision. Returns differences by area with severity (high/medium/low) and a matchScore (0-100). - $D verify --mockup approved.png --screenshot live.png: compares live site screenshot against approved design mockup. Pass if matchScore >= 70 and no high-severity differences. Used by /design-review to close the design loop: design -> implement -> verify visually. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: screenshot-to-mockup evolution ($D evolve) New command: $D evolve --screenshot current.png --brief "make it calmer" Two-step process: first analyzes the screenshot via GPT-4o vision to produce a detailed description, then generates a new mockup that keeps the existing layout structure but applies the requested changes. Starts from reality, not blank canvas. Bridges the gap between /design-review critique ("the spacing is off") and a visual proposal of the fix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: responsive variants + design-to-code prompt Responsive variants: $D variants --viewports desktop,tablet,mobile generates mockups at 1536x1024, 1024x1024, and 1024x1536 (portrait) with viewport-appropriate layout instructions. Design-to-code prompt: $D prompt --image approved.png extracts colors, typography, layout, and components via GPT-4o vision, producing a structured implementation prompt. Reads DESIGN.md for additional constraint context. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.13.0.0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: gstack designer as first-class tool in /plan-design-review Brand the gstack designer prominently, add Step 0.5 for proactive visual mockup generation before review passes, and update priority hierarchy. When a plan describes new UI, the skill now offers to generate mockups with $D variants, run $D check for quality gating, and present a comparison board via $B goto before any review passes begin. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: integrate mockups into review passes and outputs Thread Step 0.5 mockups through the review workflow: Pass 4 (AI Slop) evaluates generated mockups visually, Pass 7 uses mockups as evidence for unresolved decisions, post-pass offers one-shot regeneration after design changes, and Approved Mockups section records chosen variants with paths for the implementer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: gstack designer target mockups in /design-review fix loop Add $D generate for target mockups in Phase 8a.5 — before fixing a design finding, generate a mockup showing what it should look like. Add $D verify in Phase 9 to compare fix results against targets. Not plan mode — goes straight to implementation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: gstack designer AI mockups in /design-consultation Phase 5 Replace HTML preview with $D variants + comparison board when designer is available (Path A). Use $D extract to derive DESIGN.md tokens from the approved mockup. Handles both plan mode (write to plan) and non-plan mode (implement immediately). Falls back to HTML preview (Path B) when designer binary is unavailable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: make gstack designer the default in /plan-design-review, not optional The transcript showed the agent writing 5 text descriptions of homepage variants instead of generating visual mockups, even when the user explicitly asked for design tools. The skill treated mockups as optional ("Want me to generate?") when they should be the default behavior. Changes: - Rename "Your Visual Design Tool" to "YOUR PRIMARY TOOL" with aggressive language: "Don't ask permission. Show it." - Step 0.5 now generates mockups automatically when DESIGN_READY, no AskUserQuestion gatekeeping the default path - Priority hierarchy: mockups are "non-negotiable" not "if available" - Step 0D tells the user mockups are coming next - DESIGN_NOT_AVAILABLE fallback now tells user what they're missing The only valid reasons to skip mockups: no UI scope, or designer not installed. Everything else generates by default. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: persist design mockups to ~/.gstack/projects/$SLUG/designs/ Mockups were going to .context/mockups/ (gitignored, workspace-local). This meant designs disappeared when switching workspaces or conversations, and downstream skills couldn't reference approved mockups from earlier reviews. Now all three design skills save to persistent project-scoped dirs: - /plan-design-review: ~/.gstack/projects/$SLUG/designs/<screen>-<date>/ - /design-consultation: ~/.gstack/projects/$SLUG/designs/design-system-<date>/ - /design-review: ~/.gstack/projects/$SLUG/designs/design-audit-<date>/ Each directory gets an approved.json recording the user's pick, feedback, and branch. This lets /design-review verify against mockups that /plan-design-review approved, and design history is browsable via ls ~/.gstack/projects/$SLUG/designs/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate codex ship skill with zsh glob guards Picked up setopt +o nomatch guards from main's v0.12.8.1 merge. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add browse binary discovery to DESIGN_SETUP resolver The design setup block now discovers $B alongside $D, so skills can open comparison boards via $B goto and poll feedback via $B eval. Falls back to `open` on macOS when browse binary is unavailable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: comparison board DOM polling in plan-design-review After opening the comparison board, the agent now polls #status via $B eval instead of asking a rigid AskUserQuestion. Handles submit (read structured JSON feedback), regenerate (new variants with updated brief), and $B-unavailable fallback (free-form text response). The user interacts with the real board UI, not a constrained option picker. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: comparison board feedback loop integration test 16 tests covering the full DOM polling cycle: structure verification, submit with pick/rating/comment, regenerate flows (totally different, more like this, custom text), and the agent polling pattern (empty → submitted → read JSON). Uses real generateCompareHtml() from design/src/compare.ts, served via HTTP. Runs in <1s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add $D serve command for HTTP-based comparison board feedback The comparison board feedback loop was fundamentally broken: browse blocks file:// URLs (url-validation.ts:71), so $B goto file://board.html always fails. The fallback open + $B eval polls a different browser instance. $D serve fixes this by serving the board over HTTP on localhost. The server is stateful: stays alive across regeneration rounds, exposes /api/progress for the board to poll, and accepts /api/reload from the agent to swap in new board HTML. Stdout carries feedback JSON only; stderr carries telemetry. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: dual-mode feedback + post-submit lifecycle in comparison board When __GSTACK_SERVER_URL is set (injected by $D serve), the board POSTs feedback to the server instead of only writing to hidden DOM elements. After submit: disables all inputs, shows "Return to your coding agent." After regenerate: shows spinner, polls /api/progress, auto-refreshes on ready. On POST failure: shows copyable JSON fallback. On progress timeout (5 min): shows error with /design-shotgun prompt. DOM fallback preserved for headed browser mode and tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: HTTP serve command endpoints and regeneration lifecycle 11 tests covering: HTML serving with injected server URL, /api/progress state reporting, submit → done lifecycle, regenerate → regenerating state, remix with remixSpec, malformed JSON rejection, /api/reload HTML swapping, missing file validation, and full regenerate → reload → submit round-trip. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add DESIGN_SHOTGUN_LOOP resolver + fix design artifact paths Adds generateDesignShotgunLoop() resolver for the shared comparison board feedback loop (serve via HTTP, handle regenerate/remix, AskUserQuestion fallback, feedback confirmation). Registered as {{DESIGN_SHOTGUN_LOOP}}. Fixes generateDesignMockup() to use ~/.gstack/projects/$SLUG/designs/ instead of /tmp/ and docs/designs/. Replaces broken $B goto file:// + $B eval polling with $D compare --serve (HTTP-based, stdout feedback). Adds CRITICAL PATH RULE guardrail to DESIGN_SETUP: design artifacts must go to ~/.gstack/projects/$SLUG/designs/, never .context/ or /tmp/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add /design-shotgun standalone design exploration skill New skill for visual brainstorming: generate AI design variants, open a comparison board in the user's browser, collect structured feedback, and iterate. Features: session detection (revisit prior explorations), 5-dimension context gathering (who, job to be done, what exists, user flow, edge cases), taste memory (prior approved designs bias new generations), inline variant preview, configurable variant count, screenshot-to-variants via $D evolve. Uses {{DESIGN_SHOTGUN_LOOP}} resolver for the feedback loop. Saves all artifacts to ~/.gstack/projects/$SLUG/designs/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate SKILL.md files for design-shotgun + resolver changes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add remix UI to comparison board Per-variant element selectors (Layout, Colors, Typography, Spacing) with radio buttons in a grid. Remix button collects selections into a remixSpec object and sends via the same HTTP POST feedback mechanism. Enabled only when at least one element is selected. Board shows regenerating spinner while agent generates the hybrid variant. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add $D gallery command for design history timeline Generates a self-contained HTML page showing all prior design explorations for a project: every variant (approved or not), feedback notes, organized by date (newest first). Images embedded as base64. Handles corrupted approved.json gracefully (skips, still shows the session). Empty state shows "No history yet" with /design-shotgun prompt. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: gallery generation — sessions, dates, corruption, empty state 7 tests: empty dir, nonexistent dir, single session with approved variant, multiple sessions sorted newest-first, corrupted approved.json handled gracefully, session without approved.json, self-contained HTML (no external dependencies). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: replace broken file:// polling with {{DESIGN_SHOTGUN_LOOP}} plan-design-review and design-consultation templates previously used $B goto file:// + $B eval polling for the comparison board feedback loop. This was broken (browse blocks file:// URLs). Both templates now use {{DESIGN_SHOTGUN_LOOP}} which serves via HTTP, handles regeneration in the same browser tab, and falls back to AskUserQuestion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add design-shotgun touchfile entries and tier classifications design-shotgun-path (gate): verify artifacts go to ~/.gstack/, not .context/ design-shotgun-session (gate): verify repeat-run detection + AskUserQuestion design-shotgun-full (periodic): full round-trip with real design binary Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate SKILL.md files for template refactor Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: comparison board UI improvements — option headers, pick confirmation, grid view Three changes to the design comparison board: 1. Pick confirmation: selecting "Pick" on Option A shows "We'll move forward with Option A" in green, plus a status line above the submit button repeating the choice. 2. Clear option headers: each variant now has "Option A" in bold with a subtitle above the image, instead of just the raw image. 3. View toggle: top-right Large/Grid buttons switch between single-column (default) and 3-across grid view. Also restructured the bottom section into a 2-column grid: submit/overall feedback on the left, regenerate controls on the right. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: use 127.0.0.1 instead of localhost for serve URL Avoids DNS resolution issues on some systems where localhost may resolve to IPv6 ::1 while Bun listens on IPv4 only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: write ALL feedback to disk so agent can poll in background mode The agent backgrounds $D serve (Claude Code can't block on a subprocess and do other work simultaneously). With stdout-only feedback delivery, the agent never sees regenerate/remix feedback. Fix: write feedback-pending.json (regenerate/remix) and feedback.json (submit) to disk next to the board HTML. Agent polls the filesystem instead of reading stdout. Both channels (stdout + disk) are always active so foreground mode still works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: DESIGN_SHOTGUN_LOOP uses file polling instead of stdout reading Update the template resolver to instruct the agent to background $D serve and poll for feedback-pending.json / feedback.json on a 5-second loop. This matches the real-world pattern where Claude Code / Conductor agents can't block on subprocess stdout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate SKILL.md files for file-polling feedback loop Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: null-safe DOM selectors for post-submit and regenerating states The user's layout restructure renamed .regenerate-bar → .regen-column, .submit-bar → .submit-column, and .overall-section → .bottom-section. The JS still referenced the old class names, causing querySelector to return null and showPostSubmitState() / showRegeneratingState() to silently crash. This meant Submit and Regenerate buttons appeared to work (DOM elements updated, HTTP POST succeeded) but the visual feedback (disabled inputs, spinner, success message) never appeared. Fix: use fallback selectors that check both old and new class names, with null guards so a missing element doesn't crash the function. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: end-to-end feedback roundtrip — browser click to file on disk The test that proves "changes on the website propagate to Claude Code." Opens the comparison board in a real headless browser with __GSTACK_SERVER_URL injected, simulates user clicks (Submit, Regenerate, More Like This), and verifies that feedback.json / feedback-pending.json land on disk with the correct structured data. 6 tests covering: submit → feedback.json, post-submit UI lockdown, regenerate → feedback-pending.json, more-like-this → feedback-pending.json, regenerate spinner display, and full regen → reload → submit round-trip. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: comprehensive design doc for Design Shotgun feedback loop Documents the full browser-to-agent feedback architecture: state machine, file-based polling, port discovery, post-submit lifecycle, and every known edge case (zombie forms, dead servers, stale spinners, file:// bug, double-click races, port coordination, sequential generate rule). Includes ASCII diagrams of the data flow and state transitions, complete step-by-step walkthrough of happy path and regeneration path, test coverage map with gaps, and short/medium/long-term improvement ideas. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: plan-design-review agent guardrails for feedback loop Four fixes to prevent agents from reinventing the feedback loop badly: 1. Sequential generate rule: explicit instruction that $D generate calls must run one at a time (API rate-limits concurrent image generation). 2. No-AskUserQuestion-for-feedback rule: agent reads feedback.json instead of re-asking what the user picked. 3. Remove file:// references: $B goto file:// was always rejected by url-validation.ts. The --serve flag handles everything. 4. Remove $B eval polling reference: no longer needed with HTTP POST. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: design-shotgun Step 3 progressive reveal, silent failure detection, timing estimate Three production UX bugs fixed: 1. Dead air — now shows timing estimate before generation starts 2. Silent variant drop — replaced $D variants batch with individual $D generate calls, each verified for existence and non-zero size with retry 3. No progressive reveal — each variant shown inline via Read tool immediately after generation (~60s increments instead of all at ~180s) Also: /tmp/ then cp as default output pattern (sandbox workaround), screenshot taken once for evolve path (not per-variant). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: parallel design-shotgun with concept-first confirmation Step 3 rewritten to concept-first + parallel Agent architecture: - 3a: generate text concepts (free, instant) - 3b: AskUserQuestion to confirm/modify before spending API credits - 3c: launch N Agent subagents in parallel (~60s total regardless of count) - 3d: show all results, dynamic image list for comparison board Adds Agent to allowed-tools. Softens plan-design-review sequential warning to note design-shotgun uses parallel at Tier 2+. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update project documentation for v0.13.0.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: untrack .agents/skills/ — generated at setup, already gitignored These files were committed despite .agents/ being in .gitignore. They regenerate from ./setup --host codex on any machine. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate design-shotgun SKILL.md for v0.12.12.0 preamble changes Merge from main brought updated preamble resolver (conditional telemetry, local JSONL logging) but design-shotgun/SKILL.md wasn't regenerated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 20:32:59 -06:00