gstack

mirror of https://github.com/garrytan/gstack.git synced 2026-05-01 19:25:10 +02:00

Author	SHA1	Message	Date
Garry Tan	54d4cde773	security: tunnel dual-listener + SSRF + envelope + path wave (v1.6.0.0) (#1137 ) * refactor(security): loosen /connect rate limit from 3/min to 300/min Setup keys are 24 random bytes (unbruteforceable), so a tight rate limit does not meaningfully prevent key guessing. It exists only to cap bandwidth, CPU, and log-flood damage from someone who discovered the ngrok URL. A legitimate pair-agent session hits /connect once; 300/min is 60x that pattern and never hit accidentally. 3/min caused pairing to fail on any retry flow (network blip, second paired client) with no upside. Per-IP tracking was considered and rejected — adds a bounded Map + LRU for defense already adequate at the global layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): add tunnel-denial-log module for attack visibility Append-only log of tunnel-surface auth denials to ~/.gstack/security/attempts.jsonl. Gives operators visibility into who is probing tunneled daemons so the next security wave can be driven by real attack data instead of speculation. Design notes: - Async via fs.promises.appendFile. Never appendFileSync — blocking the event loop on every denial during a flood is what an attacker wants (prior learning: sync-audit-log-io, 10/10 confidence). - In-process rate cap at 60 writes/minute globally. Excess denials are counted in memory but not written to disk — prevents disk DoS. - Writes to the same ~/.gstack/security/attempts.jsonl used by the prompt-injection attempt log. File rotation is handled by the existing security pipeline (10MB, 5 generations). No consumers in this commit; wired up in the dual-listener refactor that follows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): dual-listener tunnel architecture The /health endpoint leaked AUTH_TOKEN to any caller that hit the ngrok URL (spoofing chrome-extension:// origin, or catching headed mode). Surfaced by @garagon in PR #1026; the original fix was header-inference on the single port. Codex's outside-voice review during /plan-ceo-review called that approach brittle (ngrok header behavior could change, local proxies would false-positive), and pushed for the structural fix. This is that fix. Stop making /health a root-token bootstrap endpoint on any surface the tunnel can reach. The server now binds two HTTP listeners when a tunnel is active. The local listener (extension, CLI, sidebar) stays on 127.0.0.1 and is never exposed to ngrok. ngrok forwards only to the tunnel listener, which serves only /connect (unauth, rate-limited) and /command with a locked allowlist of browser-driving commands. Security property comes from physical port separation, not from header inference — a tunnel caller cannot reach /health or /cookie-picker or /inspector because they live on a different TCP socket. What this commit adds to browse/src/server.ts: * Surface type ('local' \| 'tunnel') and TUNNEL_PATHS + TUNNEL_COMMANDS allowlists near the top of the file. * makeFetchHandler(surface) factory replacing the single fetch arrow; closure-captures the surface so the filter that runs before route dispatch knows which socket accepted the request. * Tunnel filter at dispatch entry: 404s anything not on TUNNEL_PATHS, 403s root-token bearers with a clear pairing hint, 401s non-/connect requests that lack a scoped token. Every denial is logged via logTunnelDenial (from tunnel-denial-log). * GET /connect alive probe (unauth on both surfaces) so /pair and /tunnel/start can detect dead ngrok tunnels without reaching /health — /health is no longer tunnel-reachable. * Lazy tunnel listener lifecycle. /tunnel/start binds a dedicated Bun.serve on an ephemeral port, points ngrok.forward at THAT port (not the local port), hard-fails on bind error (no local fallback), tears down cleanly on ngrok failure. BROWSE_TUNNEL=1 startup uses the same pattern. * closeTunnel() helper — single teardown path for both the ngrok listener and the tunnel Bun.serve listener. * resolveNgrokAuthtoken() helper — shared authtoken lookup across /tunnel/start and BROWSE_TUNNEL=1 startup (was duplicated). * TUNNEL_COMMANDS check in /command dispatch: on the tunnel surface, commands outside the allowlist return 403 with a list of allowed commands as a hint. * Probe paths in /pair and /tunnel/start migrated from /health to GET /connect — the only unauth path reachable on the tunnel surface under the new architecture. Test updates in browse/test/server-auth.test.ts: * /pair liveness-verify test: assert via closeTunnel() helper instead of the inline `tunnelActive = false; tunnelUrl = null` lines that the helper subsumes. * /tunnel/start cached-tunnel test: same closeTunnel() adaptation. Credit Derived from PR #1026 by @garagon — thanks for flagging the critical bug that drove the architectural rewrite. The per-request isTunneledRequest approach from #1026 is superseded by physical port separation here; the underlying report remains the root cause for the entire v1.6.0.0 wave. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(security): add source-level guards for dual-listener architecture 23 source-level assertions that keep future contributors from silently widening the tunnel surface during a routine refactor. Covers: * Surface type + tunnelServer state variable shape * TUNNEL_PATHS is a closed set of /connect, /command, /sidebar-chat (and NOT /health, /welcome, /cookie-picker, /inspector/, /pair, /token, /refs, /activity/stream, /tunnel/{start,stop}) TUNNEL_COMMANDS includes browser-driving ops only (and NOT launch-browser, tunnel-start, token-mint, cookie-import, etc.) * makeFetchHandler(surface) factory exists and is wired to both listeners with the correct surface parameter * Tunnel filter runs BEFORE any route dispatch, with 404/403/401 responses and logged denials for each reason * GET /connect returns {alive: true} unauth * /command dispatch enforces TUNNEL_COMMANDS on tunnel surface * closeTunnel() helper tears down ngrok + Bun.serve listener * /tunnel/start binds on ephemeral port, points ngrok at TUNNEL_PORT (not local port), hard-fails on bind error (no fallback), probes cached tunnel via GET /connect (not /health), tears down on ngrok.forward failure * BROWSE_TUNNEL=1 startup uses the dual-listener pattern * logTunnelDenial wired for all three denial reasons * /connect rate limit is 300/min, not 3/min All 23 tests pass. Behavioral integration tests (spawn subprocess, real network) live in the E2E suite that lands later in this wave. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * security: gate download + scrape through validateNavigationUrl (SSRF) The `goto` command was correctly wired through validateNavigationUrl, but `download` and `scrape` called page.request.fetch(url, ...) directly. A caller with the default write scope could hit the /command endpoint and ask the daemon to fetch http://169.254.169.254/latest/meta-data/ (AWS IMDSv1) or the GCP/Azure/internal equivalents. The response body comes back as base64 or lands on disk where GET /file serves it. Fix: call validateNavigationUrl(url) immediately before each page.request.fetch() call site in download and in the scrape loop. Same blocklist that already protects `goto`: file://, javascript:, data:, chrome://, cloud metadata (IPv4 all encodings, IPv6 ULA, metadata..internal). Tests: extend browse/test/url-validation.test.ts with a source-level guard that walks every `await page.request.fetch(` call site and asserts a validateNavigationUrl call precedes it within the same branch. Regression trips before code review if a future refactor drops the gate. security: route splitForScoped through envelope sentinel escape The scoped-token snapshot path in snapshot.ts built its untrusted block by pushing the raw accessibility-tree lines between the literal `═══ BEGIN UNTRUSTED WEB CONTENT ═══` / `═══ END UNTRUSTED WEB CONTENT ═══` sentinels. The full-page wrap path in content-security.ts already applied a zero-width-space escape on those exact strings to prevent sentinel injection, but the scoped path skipped it. Net effect: a page whose rendered text contains the literal sentinel can close the envelope early from inside untrusted content and forge a fake "trusted" block for the LLM. That includes fabricating interactive `@eN` references the agent will act on. Fix: * Extract the zero-width-space escape into a named, exported helper `escapeEnvelopeSentinels(content)` in content-security.ts. * Have `wrapUntrustedPageContent` call it (behavior unchanged on that path — same bytes out). * Import the helper in snapshot.ts and map it over `untrustedLines` in the `splitForScoped` branch before pushing the BEGIN sentinel. Tests: add a describe block in content-security.test.ts that covers * `escapeEnvelopeSentinels` defuses BEGIN and END markers; * `escapeEnvelopeSentinels` leaves normal text untouched; * `wrapUntrustedPageContent` still emits exactly one real envelope pair when hostile content contains forged sentinels; * snapshot.ts imports the helper; * the scoped-snapshot branch calls `escapeEnvelopeSentinels` before pushing the BEGIN sentinel (source-level regression — if a future refactor reorders this, the test trips). * security: extend hidden-element detection to all DOM-reading channels The Confusion Protocol envelope wrap (`wrapUntrustedPageContent`) covers every scoped PAGE_CONTENT_COMMAND, but the hidden-element ARIA-injection detection layer only ran for `text`. Other DOM-reading channels (html, links, forms, accessibility, attrs, data, media, ux-audit) returned their output through the envelope with no hidden- content filter, so a page serving a display:none div that instructs the agent to disregard prior system messages, or an aria-label that claims to put the LLM in admin mode, leaked the injection payload on any non-text channel. The envelope alone does not mitigate this, and the page itself never rendered the hostile content to the human operator. Fix: * New export `DOM_CONTENT_COMMANDS` in commands.ts — the subset of PAGE_CONTENT_COMMANDS that derives its output from the live DOM. Console and dialog stay out; they read separate runtime state. * server.ts runs `markHiddenElements` + `cleanupHiddenMarkers` for every scoped command in this set. `text` keeps its existing `getCleanTextWithStripping` path (hidden elements physically stripped before the read). All other channels keep their output format but emit flagged elements as CONTENT WARNINGS on the envelope, so the LLM sees what it would otherwise have consumed silently. * Hidden-element descriptions merge into `combinedWarnings` alongside content-filter warnings before the wrap call. Tests: new describe block in content-security.test.ts covering * `DOM_CONTENT_COMMANDS` export shape and channel membership; * dispatch gates on `DOM_CONTENT_COMMANDS.has(command)`, not the literal `text` string; * hiddenContentWarnings plumbs into `combinedWarnings` and reaches wrapUntrustedPageContent; * DOM_CONTENT_COMMANDS is a strict subset of PAGE_CONTENT_COMMANDS. Existing datamarking, envelope wrap, centralized-wrapping, and chain security suites stay green (52 pass, 0 fail). * security: validate --from-file payload paths for parity with direct paths The direct `load-html <file>` path runs every caller-supplied file path through validateReadPath() so reads stay confined to SAFE_DIRECTORIES (cwd, TEMP_DIR). The `load-html --from-file <payload.json>` shortcut and its sibling `pdf --from-file <payload.json>` skipped that check and went straight to fs.readFileSync(). An MCP caller that picks the payload path (or any caller whose payload argument is reachable from attacker-influenced text) could use --from-file as a read-anywhere escape hatch for the safe-dirs policy. Fix: call validateReadPath(path.resolve(payloadPath)) before readFileSync at both sites. Error surface mirrors the direct-path branch so ops and agent errors stay consistent. Test coverage in browse/test/from-file-path-validation.test.ts: - source-level: validateReadPath precedes readFileSync in the load-html --from-file branch (write-commands.ts) and the pdf --from-file parser (meta-commands.ts) - error-message parity: both sites reference SAFE_DIRECTORIES Related security audit pattern: R3 F002 (validateNavigationUrl gap on download/scrape) and R3 F008 (markHiddenElements gap on 10 DOM commands) were the same shape — a defense that existed on the primary code path but not its shortcut sibling. This PR closes the same class of gap on the --from-file shortcuts. * fix(design): escape url.origin when injecting into served HTML serve.ts injected url.origin into a single-quoted JS string in the response body. A local request with a crafted Host header (e.g. Host: "evil'-alert(1)-'x") would break out of the string and execute JS in the 127.0.0.1:<port> origin opened by the design board. Low severity — bound to localhost, requires a local attacker — but no reason not to escape. Fix: JSON.stringify(url.origin) produces a properly quoted, escaped JS string literal in one call. Also includes Prettier reformatting (single→double quotes, trailing commas, line wrapping) applied by the repo's PostToolUse formatter hook. Security change is the one line in the HTML injection; everything else is whitespace/style. * fix(scripts): drop shell:true from slop-diff npx invocations spawnSync('npx', [...], { shell: true }) invokes /bin/sh -c with the args concatenated, subjecting them to shell parsing (word splitting, glob expansion, metacharacter interpretation). No user input reaches these calls today, so not exploitable — but the posture is wrong: npx + shell args should be direct. Fix: scope shell:true to process.platform === 'win32' where npx is actually a .cmd requiring the shell. POSIX runs the npx binary directly with array-form args. Also includes Prettier reformatting (single→double quotes, trailing commas, line wrapping) applied by the repo's PostToolUse formatter hook. Security-relevant change is just the two shell:true -> shell: process.platform === 'win32' lines; everything else is whitespace/style. * security(E3): gate GSTACK_SLUG on /welcome path traversal The /welcome handler interpolates GSTACK_SLUG directly into the filesystem path used to locate the project-local welcome page. Without validation, a slug like "../../etc/passwd" would resolve to ~/.gstack/projects/../../etc/passwd/designs/welcome-page-20260331/finalized.html — classic path traversal. Not exploitable today: GSTACK_SLUG is set by the gstack CLI at daemon launch, and an attacker would already need local env-var access to poison it. But the gate is one regex (^[a-z0-9_-]+$), and a defense-in-depth pass costs us nothing when the cost of being wrong is arbitrary file read via /welcome. Fall back to the safe 'unknown' literal when the slug fails validation — same fallback the code already uses when GSTACK_SLUG is unset. No behavior change for legitimate slugs (they all match the regex). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * security(N1): replace ?token= SSE auth with HttpOnly session cookie Activity stream and inspector events SSE endpoints accepted the root AUTH_TOKEN via `?token=` query param (EventSource can't send Authorization headers). URLs leak to browser history, referer headers, server logs, crash reports, and refactoring accidents. Codex flagged this during the /plan-ceo-review outside voice pass. New auth model: the extension calls POST /sse-session with a Bearer token and receives a view-only session cookie (HttpOnly, SameSite=Strict, 30-min TTL). EventSource is opened with `withCredentials: true` so the browser sends the cookie back on the SSE connection. The ?token= query param is GONE — no more URL-borne secrets. Scope isolation (prior learning cookie-picker-auth-isolation, 10/10 confidence): the SSE session cookie grants access to /activity/stream and /inspector/events ONLY. The token is never valid against /command, /token, or any mutating endpoint. A leaked cookie can watch activity; it cannot execute browser commands. Components * browse/src/sse-session-cookie.ts — registry: mint/validate/extract/ build-cookie. 256-bit tokens, 30-min TTL, lazy expiry pruning, no imports from token-registry (scope isolation enforced by module boundary). * browse/src/server.ts — POST /sse-session mint endpoint (requires Bearer). /activity/stream and /inspector/events now accept Bearer OR the session cookie, and reject ?token= query param. * extension/sidepanel.js — ensureSseSessionCookie() bootstrap call, EventSource opened with withCredentials:true on both SSE endpoints. Tested via the source guards; behavioral test is the E2E pairing flow that lands later in the wave. * browse/test/sse-session-cookie.test.ts — 20 unit tests covering mint entropy, TTL enforcement, cookie flag invariants, cookie parsing from multi-cookie headers, and scope-isolation contract guard (module must not import token-registry). * browse/test/server-auth.test.ts — existing /activity/stream auth test updated to assert the new cookie-based gate and the absence of the ?token= query param. Cookie flag choices: * HttpOnly: token not readable from page JS (mitigates XSS exfiltration). * SameSite=Strict: cookie not sent on cross-site requests (mitigates CSRF). Fine for SSE because the extension connects to 127.0.0.1 directly. * Path=/: cookie scoped to the whole origin. * Max-Age=1800: 30 minutes, matches TTL. Extension re-mints on reconnect when daemon restarts. * Secure NOT set: daemon binds to 127.0.0.1 over plain HTTP. Adding Secure would block the browser from ever sending the cookie back. Add Secure when gstack ships over HTTPS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * security(N2): document Windows v20 ABE elevation path on CDP port The existing comment around the cookie-import-browser --remote-debugging-port launch claimed "threat model: no worse than baseline." That's wrong on Windows with App-Bound Encryption v20. A same-user local process that opens the cookie SQLite DB directly CANNOT decrypt v20 values (DPAPI context is bound to the browser process). The CDP port lets them bypass that: connect to the debug port, call Network.getAllCookies inside Chrome, walk away with decrypted v20 cookies. The correct fix is to switch from TCP --remote-debugging-port to --remote-debugging-pipe so the CDP transport is a stdio pipe, not a socket. That requires restructuring the CDP WebSocket client in this module and Playwright doesn't expose the pipe transport out of the box. Non-trivial, deferred from the v1.6.0.0 wave. This commit updates the comment to correctly describe the threat and points at the tracking issue. No code change to the launch itself. Follow-up: #1136. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(E2): document dual-listener tunnel architecture in ARCHITECTURE.md Adds an explicit per-endpoint disposition table to the Security model section, covering the v1.6.0.0 dual-listener refactor. Every HTTP endpoint now has a documented local-vs-tunnel answer. Future audits (and future contributors wondering "is it safe to add X to the tunnel surface?") can read this instead of reverse-engineering server.ts. Also documents: * Why physical port separation beats per-request header inference (ngrok behavior drift, local proxies can forge headers, etc.) * Tunnel surface denial logging → ~/.gstack/security/attempts.jsonl * SSE session cookie model (gstack_sse, 30-min TTL, stream-scope only, module-boundary-enforced scope isolation) * N2 non-goal for Windows v20 ABE via CDP port (tracking #1136) No code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(E1): end-to-end pair-agent flow against a spawned daemon Spawns the browse daemon as a subprocess with BROWSE_HEADLESS_SKIP=1 so the HTTP layer runs without a real browser. Exercises: * GET /health — token delivery for chrome-extension origin, withheld otherwise (the F1 + PR #1026 invariant) * GET /connect — alive probe returns {alive:true} unauth * POST /pair — root Bearer required (403 without), returns setup_key * POST /connect — setup_key exchange mints a distinct scoped token * POST /command — 401 without auth * POST /sse-session — Bearer required, Set-Cookie has HttpOnly + SameSite=Strict (the N1 invariant) * GET /activity/stream — 401 without auth * GET /activity/stream?token= — 401 (the old ?token= query param is REJECTED, which is the whole point of N1) * GET /welcome — serves HTML, does not leak /etc/passwd content under the default 'unknown' slug (E3 regex gate) 12 behavioral tests, ~220ms end-to-end, no network dependencies, no ngrok, no real browser. This is the receipt for the wave's central 'pair-agent still works + the security boundary holds' claim. Tunnel-port binding (/tunnel/start) is deliberately NOT exercised here — it requires an ngrok authtoken and live network. The dual-listener route allowlist is covered by source-level guards in dual-listener.test.ts; behavioral tunnel testing belongs in a separate paid-evals harness. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * release(v1.6.0.0): bump VERSION + CHANGELOG for security wave Architectural bump, not patch: dual-listener HTTP refactor changes the daemon's tunnel-exposure model. See CHANGELOG for the full release summary (~950 words) covering the five root causes this wave closes: 1. /health token leak over ngrok (F1 + E3 + test infra) 2. /cookie-picker + /inspector exposed over the tunnel (F1) 3. ?token=<ROOT> in SSE URLs leaking to logs/referer/history (N1) 4. /welcome GSTACK_SLUG path traversal (E3) 5. Windows v20 ABE elevation via CDP port (N2 — documented non-goal, tracked as #1136) Plus the base PRs: SSRF gate (#1029), envelope sentinel escape (#1031), DOM-channel hidden-element coverage (#1032), --from-file path validation (#1103), and 2 commits from #1073 (@theqazi). VERSION + package.json bumped to 1.6.0.0. CHANGELOG entry covers credits (@garagon, @Hybirdss, @HMAKT99, @theqazi), review lineage (CEO → Codex outside voice → Eng), and the non-goal tracking issue. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: pre-landing review findings (4 auto-fixes) Addresses 4 findings from the Claude adversarial subagent on the v1.6.0.0 security wave diff. No user-visible behavior change; all are defense-in-depth hardening of newly-introduced code. 1. GET /connect rate-limited (was POST-only) [HIGH conf 8/10] Attacker discovering the ngrok URL could probe unlimited GETs for daemon enumeration. Now shares the global /connect counter. 2. ngrok listener leak on tunnel startup failure [MEDIUM conf 8/10] If ngrok.forward() resolved but tunnelListener.url() or the state-file write threw, the Bun listener was torn down but the ngrok session was leaked. Fixed in BOTH /tunnel/start and BROWSE_TUNNEL=1 startup paths. 3. GSTACK_SKILL_ROOT path-traversal gate [MEDIUM conf 8/10] Symmetric with E3's GSTACK_SLUG regex gate — reject values containing '..' before interpolating into the welcome-page path. 4. SSE session registry pruning [LOW conf 7/10] pruneExpired() only checked 10 entries per mint call. Now runs on every validate too, checks 20 entries, with a hard 10k cap as backstop. Prevents registry growth under sustained extension reconnect pressure. Tests remain green (56/56 in sse-session-cookie + dual-listener + pair-agent-e2e suites). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: update project documentation for v1.6.0.0 Reflect the dual-listener tunnel architecture, SSE session cookies, SSRF guards, and Windows v20 ABE non-goal across the three docs users actually read for remote-agent and browser auth context: - docs/REMOTE_BROWSER_ACCESS.md: rewrote Architecture diagram for dual listeners, fixed /connect rate limit (3/min → 300/min), removed stale "/health requires no auth" (now 404 on tunnel), added SSE cookie auth, expanded Security Model with tunnel allowlist, SSRF guards, /welcome path traversal defense, and the Windows v20 ABE tracking note. - BROWSER.md: added dual-listener paragraph to Authentication and linked to ARCHITECTURE.md endpoint table. Replaced the stale ?token= SSE auth note with the HttpOnly gstack_sse cookie flow. - CLAUDE.md: added Transport-layer security section above the sidebar prompt-injection stack so contributors editing server.ts, sse-session-cookie.ts, or tunnel-denial-log.ts see the load-bearing module boundaries before touching them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(make-pdf): write --from-file payload to /tmp, not os.tmpdir() make-pdf's browseClient wrote its --from-file payload to os.tmpdir(), which is /var/folders/... on macOS. v1.6.0.0's PR #1103 cherry-pick tightened browse load-html --from-file to validate against the safe-dirs allowlist ([TEMP_DIR, cwd] where TEMP_DIR is '/tmp' on macOS/Linux, os.tmpdir() on Windows). This closed a CLI/API parity gap but broke make-pdf on macOS because /var/folders/... is outside the allowlist. Fix: mirror browse's TEMP_DIR convention — use '/tmp' on non-Windows, os.tmpdir() on Windows. The make-pdf-gate CI failure on macOS-latest (run 72440797490) is caused by exactly this: the payload file was rejected by validateReadPath. Verified locally: the combined-gate e2e test now passes after rebuilding make-pdf/dist/pdf. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(sidebar): killAgent resets per-tab state; align tests with current agent event format Two pre-existing bugs surfaced while running the full e2e suite on the sec-wave branch. Both pre-date v1.6.0.0 (same failures on main at `e23ff280`) but blocked the ship verification, so fixing now. ### Bug 1: killAgent leaked stale per-tab state `killAgent()` reset the legacy globals (agentProcess, agentStatus, etc.) but never touched the per-tab `tabAgents` Map. Meanwhile `/sidebar-command` routes on `tabState.status` from that Map, not the legacy globals. Consequence: after a kill (including the implicit kill in `/sidebar-session/new`), the next /sidebar-command on the same tab saw `tabState.status === 'processing'` and fell into the queue branch, silently NOT spawning an agent. Integration tests that called resetState between cases all failed with empty queues. Fix: when targetTabId is supplied, reset that one tab's state; when called without a tab (session-new, full kill), reset ALL tab states. Matches the semantic boundary already used for the cancel-file write. ### Bug 2: sidebar-integration tests drifted from current event format `agent events appear in /sidebar-chat` posted the raw Claude streaming format (`{type: 'assistant', message: {content: [...]}}`) but `processAgentEvent` in server.ts only handles the simplified types that sidebar-agent.ts pre-processes into (text, text_delta, tool_use, result, agent_error, security_event). The architecture moved pre-processing into sidebar-agent.ts at some point and this test never got updated. Fixed by sending the pre-processed `{type: 'text', text: '...'}` format — which is actually what the server sees in production. Also removed the `entry.prompt` URL-containment check in the queue-write test. The URL is carried on entry.pageUrl (metadata) by design: the system prompt tells Claude to run `browse url` to fetch the actual page rather than trust any URL in the prompt body. That's the URL-based prompt-injection defense. The prompt SHOULD NOT contain the URL, so the test assertion was wrong for the current security posture. ### Verification - `bun test browse/test/sidebar-integration.test.ts` → 13/13 pass (was 6/13 on both main and branch before this commit) - Full `bun run test` → exit 0, zero fail markers - No behavior change for production sidebar flows: killAgent was already supposed to return the agent to idle; it just wasn't fully doing so. Per-tab reset now matches the documented semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: gus <gustavoraularagon@gmail.com> Co-authored-by: Mohammed Qazi <10266060+theqazi@users.noreply.github.com>	2026-04-21 21:58:27 -07:00
Garry Tan	97584f9a59	feat(security): ML prompt injection defense for sidebar (v1.4.0.0) (#1089 ) * chore(deps): add @huggingface/transformers for prompt injection classifier Dependency needed for the ML prompt injection defense layer coming in the follow-up commits. @huggingface/transformers will host the TestSavantAI BERT-small classifier that scans tool outputs for indirect prompt injection. Note: this dep only runs in non-compiled bun contexts (sidebar-agent.ts). The compiled browse binary cannot load it because transformers.js v4 requires onnxruntime-node (native module, fails to dlopen from bun compile's temp extract dir). See docs/designs/ML_PROMPT_INJECTION_KILLER.md for the full architectural decision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): add security.ts foundation for prompt injection defense Establishes the module structure for the L5 canary and L6 verdict aggregation layers. Pure-string operations only — safe to import from the compiled browse binary. Includes: * THRESHOLDS constants (BLOCK 0.85 / WARN 0.60 / LOG_ONLY 0.40), calibrated against BrowseSafe-Bench smoke + developer content benign corpus. * combineVerdict() implementing the ensemble rule: BLOCK only when the ML content classifier AND the transcript classifier both score >= WARN. Single-layer high confidence degrades to WARN to prevent any one classifier's false-positives from killing sessions (Stack Overflow instruction-writing-style FPs at 0.99 on TestSavantAI alone). * generateCanary / injectCanary / checkCanaryInStructure — session-scoped secret token, recursively scans tool arguments, URLs, file writes, and nested objects per the plan's all-channel coverage decision. * logAttempt with 10MB rotation (keeps 5 generations). Salted SHA-256 hash, per-device salt at ~/.gstack/security/device-salt (0600). * Cross-process session state at ~/.gstack/security/session-state.json (atomic temp+rename). Required because server.ts (compiled) and sidebar-agent.ts (non-compiled) are separate processes. * getStatus() for shield icon rendering via /health. ML classifier code will live in a separate module (security-classifier.ts) loaded only by sidebar-agent.ts — compiled browse binary cannot load the native ONNX runtime. Plan: ~/.gstack/projects/garrytan-gstack/ceo-plans/2026-04-19-prompt-injection-guard.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): wire canary injection into sidebar spawnClaude Every sidebar message now gets a fresh CANARY-XXXXXXXXXXXX token embedded in the system prompt with an instruction for Claude to never output it on any channel. The token flows through the queue entry so sidebar-agent.ts can check every outbound operation for leaks. If Claude echoes the canary into any outbound channel (text stream, tool arguments, URLs, file write paths), the sidebar-agent terminates the session and the user sees the approved canary leak banner. This operation is pure string manipulation — safe in the compiled browse binary. The actual output-stream check (which also has to be safe in compiled contexts) lives in sidebar-agent.ts (next commit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(security): make sidebar-agent destructure check regex-tolerant The test asserted the exact string `const { prompt, args, stateFile, cwd, tabId } = queueEntry` which breaks whenever security or other extensions add fields (canary, pageUrl, etc.). Switch to a regex that requires the core fields in order but tolerates additional fields in between. Preserves the test's intent (args come from the queue entry, not rebuilt) while allowing the destructure to grow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): canary leak check across all outbound channels The sidebar-agent now scans every Claude stream event for the session's canary token before relaying any data to the sidepanel. Channels covered (per CEO review cross-model tension #2): * Assistant text blocks * Assistant text_delta streaming * tool_use arguments (recursively, via checkCanaryInStructure — catches URLs, commands, file paths nested at any depth) * tool_use content_block_start * tool_input_delta partial JSON * Final result payload If the canary leaks on any channel, onCanaryLeaked() fires once per session: 1. logAttempt() writes the event to ~/.gstack/security/attempts.jsonl with the canary's salted hash (never the payload content). 2. sends a `security_event` to the sidepanel so it can render the approved canary-leak banner (variant A mockup — ceo-plan 2026-04-19). 3. sends an `agent_error` for backward-compat with existing error surfaces. 4. SIGTERM's the claude subprocess (SIGKILL after 2s if still alive). The leaked content itself is never relayed to the sidepanel — the event is dropped at the boundary. Canary detection is pure-string substring match, so this all runs safely in the sidebar-agent (non-compiled bun) context. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): add security-classifier.ts with TestSavantAI + Haiku This module holds the ML classifier code that the compiled browse binary cannot link (onnxruntime-node native dylib doesn't load from Bun compile's temp extract dir — see CEO plan §"Pre-Impl Gate 1 Outcome"). It's imported ONLY by sidebar-agent.ts, which runs as a non-compiled bun script. Two layers: L4 testsavant_content — TestSavantAI BERT-small ONNX classifier. First call triggers a one-time 112MB model download to ~/.gstack/models/testsavant-small/ (files staged into the onnx/ layout transformers.js v4 expects). Classifies page snapshots and tool outputs for indirect prompt injection + jailbreak attempts. On benign-corpus dry-run: Wikipedia/HN/Reddit/tech-blog all score SAFE 0.98+, attack text scores INJECTION 0.99+, Stack Overflow instruction-writing now scores SAFE 0.98 on the shorter form (was 0.99 INJECTION on the longer form — instruction-density threshold). Ensemble combiner downgrades single-layer high to WARN to cover this case. L4b transcript_classifier — Claude Haiku reasoning-blind pre-tool-call scan. Sees only {user_message, last 3 tool_calls}, never Claude's chain-of-thought or tool results (those are how self-persuasion attacks leak). 2000ms hard timeout. Fail-open on any subprocess failure so sidebar stays functional. Gated by shouldRunTranscriptCheck() — only runs when another layer already fired at >= LOG_ONLY, saving ~70% of Haiku spend. Both layers degrade gracefully: load/spawn failures set status to 'degraded' and return confidence=0. Shield icon reflects this via getClassifierStatus() which security.ts's getStatus() composes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): wire TestSavantAI + ensemble into sidebar-agent pre-spawn scan The sidebar-agent now runs a ML security check on the user message BEFORE spawning claude. If the content classifier and (gated) transcript classifier ensemble returns BLOCK, the session is refused with a security_event + agent_error — the sidepanel renders the approved banner. Two pieces: 1. On agent startup, loadTestsavant() warms the classifier in the background. First run triggers a 112MB model download from HuggingFace (~30s on average broadband). Non-blocking — sidebar stays functional during cold-start, shield just reports 'off' until warmed. 2. preSpawnSecurityCheck() runs the ensemble against the user message: - L4 (testsavant_content) always runs - L4b (transcript_classifier via Haiku) runs only if L4 flagged at >= LOG_ONLY — plan §E1 gating optimization, saves ~70% of Haiku spend combineVerdict() applies the BLOCK-requires-both-layers rule, which downgrades any single-layer high confidence to WARN. Stack Overflow-style instruction-heavy writing false-positives on TestSavantAI alone are caught by this degrade — Haiku corrects them when called. Fail-open everywhere: any subprocess/load/inference error returns confidence=0 so the sidebar keeps working on architectural controls alone. Shield icon reflects degraded state via getClassifierStatus(). BLOCK path emits both: - security_event {verdict, reason, layer, confidence, domain} (for the approved canary-leak banner UX mockup — variant A) - agent_error "Session blocked — prompt injection detected..." (backward-compat with existing error surface) Regression test suite still passes (12/12 sidebar-security tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(security): add security.ts unit tests (25 tests, 62 assertions) Covers the pure-string operations that must behave deterministically in both compiled and source-mode bun contexts: * THRESHOLDS ordering invariant (BLOCK > WARN > LOG_ONLY > 0) * combineVerdict ensemble rule — THE critical path: - Empty signals → safe - Canary leak always blocks (regardless of ML signals) - Both ML layers >= WARN → BLOCK (ensemble_agreement) - Single layer >= BLOCK → WARN (single_layer_high) — the Stack Overflow FP mitigation that prevents one classifier killing sessions alone - Max-across-duplicates when multiple signals reference the same layer * Canary generation + injection + recursive checking: - Unique CANARY-XXXXXXXXXXXX tokens (>= 48 bits entropy) - Recursive structure scan for tool_use inputs, nested URLs, commands - Null / primitive handling doesn't throw * Payload hashing (salted sha256) — deterministic per-device, differs across payloads, 64-char hex shape * logAttempt writes to ~/.gstack/security/attempts.jsonl * writeSessionState + readSessionState round-trip (cross-process) * getStatus returns valid SecurityStatus shape * extractDomain returns hostname only, empty string on bad input All 25 tests pass in 18ms — no ML, no network, no subprocess spawning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): expose security status on /health for shield icon The /health endpoint now returns a `security` field with the classifier status, suitable for driving the sidepanel shield icon: { status: 'protected' \| 'degraded' \| 'inactive', layers: { testsavant, transcript, canary }, lastUpdated: ISO8601 } Backend plumbing: * server.ts imports getStatus from security.ts (pure-string, safe in compiled binary) and includes it in the /health response. * sidebar-agent.ts writes ~/.gstack/security/session-state.json when the classifier warmup completes (success OR failure). This is the cross- process handoff — server.ts reads the state file via getStatus() to surface the result to the sidepanel. The sidepanel rendering (SVG shield icon + color states + tooltip) is a follow-up commit in the extension/ code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(security): document the sidebar security stack in CLAUDE.md Adds a security section to the Browser interaction block. Covers: * Layered defense table showing which modules live where (content-security.ts in both contexts vs security-classifier.ts only in sidebar-agent) and why the split exists (onnxruntime-node incompatibility with compiled Bun) * Threshold constants (0.85 / 0.60 / 0.40) and the ensemble rule that prevents single-classifier false-positives (the Stack Overflow FP story) * Env knobs — GSTACK_SECURITY_OFF kill switch, cache paths, salt file, attack log rotation, session state file This is the "before you modify the security stack, read this" doc. It lives next to the existing Sidebar architecture note that points at SIDEBAR_MESSAGE_FLOW.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(todos): mark ML classifier v1 in-progress + file v2 follow-ups Reframes the P0 item to reflect v1 scope (branch 2 architecture, TestSavantAI pivot, what shipped) and splits v2 work into discrete TODOs: * Shield icon + canary leak banner UI (P0, blocks v1 user-facing completion) * Attack telemetry via gstack-telemetry-log (P1) * Full BrowseSafe-Bench at gate tier (P2) * Cross-user aggregate attack dashboard (P2) * DeBERTa-v3 as third signal in ensemble (P2) * Read/Glob/Grep ingress coverage (P2, flagged by Codex review) * Adversarial + integration + smoke-bench test suites (P1) * Bun-native 5ms inference (P3 research) Each TODO carries What / Why / Context / Effort / Priority / Depends-on so it's actionable by someone picking it up cold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(telemetry): add attack_attempt event type to gstack-telemetry-log Extends the existing telemetry pipe with 5 new flags needed for prompt injection attack reporting: --url-domain hostname only (never path, never query) --payload-hash salted sha256 hex (opaque — no payload content ever) --confidence 0-1 (awk-validated + clamped; malformed → null) --layer testsavant_content \| transcript_classifier \| aria_regex \| canary --verdict block \| warn \| log_only Backward compatibility: * Existing skill_run events still work — all new fields default to null * Event schema is a superset of the old one; downstream edge function can filter by event_type No new auth, no new SDK, no new Supabase migration. The same tier gating (community → upload, anonymous → local only, off → no-op) and the same sync daemon carry the attack events. This is the "E6 RESOLVED" path from the CEO plan — riding the existing pipe instead of spinning up parallel infra. Verified end-to-end: * attack_attempt event with all fields emits correctly to skill-usage.jsonl * skill_run event with no security flags still works (backward compat) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): wire logAttempt to gstack-telemetry-log (fire-and-forget) Every local attempt.jsonl write now also triggers a subprocess call to gstack-telemetry-log with the attack_attempt event type. The binary handles tier gating internally (community → Supabase upload, anonymous → local JSONL only, off → no-op), so security.ts doesn't need to re-check. Binary resolution follows the skill preamble pattern — never relies on PATH, which breaks in compiled-binary contexts: 1. ~/.claude/skills/gstack/bin/gstack-telemetry-log (global install) 2. .claude/skills/gstack/bin/gstack-telemetry-log (symlinked dev) 3. bin/gstack-telemetry-log (in-repo dev) Fire-and-forget: * spawn with stdio: 'ignore', detached: true, unref() * .on('error') swallows failures * Missing binary is non-fatal — local attempts.jsonl still gives audit trail Never throws. Never blocks. Existing 37 security tests pass unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ui): add security banner markup + styles (approved variant A) HTML + CSS for the canary leak / ML block banner. Structure matches the approved mockup from /plan-design-review 2026-04-19 (variant A — centered alert-heavy): * Red alert-circle SVG icon (no stock shield, intentional — matches the "serious but not scary" tone the review chose) * "Session terminated" Satoshi Bold 18px red headline * "— prompt injection detected from {domain}" DM Sans zinc subtitle * Expandable "What happened" chevron button (aria-expanded/aria-controls) * Layer list rendered in JetBrains Mono with amber tabular-nums scores * Close X in top-right, 28px hit area, focus-visible amber outline Enter animation: slide-down 8px + fade, 250ms, cubic-bezier(0.16,1,0.3,1) — matches DESIGN.md motion spec. Respects `role="alert"` + `aria-live="assertive"` so screen readers announce on appearance. Escape-to-dismiss hook is in the JS follow-up commit. Design tokens all via CSS variables (--error, --amber-400, --amber-500, --zinc-, --font-display, --font-mono, --radius-) — already established in the stylesheet. No new color constants introduced. JS wiring lands in the next commit so this diff stays focused on presentation layer only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ui): wire security banner to security_event + interactivity Adds showSecurityBanner() and hideSecurityBanner() plus the addChatEntry routing for entry.type === 'security_event'. When the sidebar-agent emits a security_event (canary leak or ML BLOCK), the banner renders with: * Title ("Session terminated") * Subtitle with {domain} if present, otherwise generic * Expandable layer list — each row: SECURITY_LAYER_LABELS[layer] + confidence.toFixed(2) in mono. Readable + auditable — user can see which layer fired at what score Interactivity, wired once on DOMContentLoaded: * Close X → hideSecurityBanner() * Expand/collapse "What happened" → toggles details + aria-expanded + chevron rotation (200ms css transition already in place) * Escape key dismisses while banner is visible (a11y) No shield icon yet — that's a separate commit that will consume the `security` field now returned by /health. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ui): add security shield icon in sidepanel header (3 states) Small "SEC" badge in the top-right of the sidepanel that reflects the security module's current state. Three states drive color: protected green — all layers ok (TestSavantAI + transcript + canary) degraded amber — one+ ML layer offline but canary + arch controls active inactive red — security module crashed, arch controls only Consumes /health.security (surfaced in commit `7e9600ff`). Updated once on connection bootstrap. Shield stays hidden until /health arrives so the user never sees a flickering "unknown" state. Custom SVG outline + mono "SEC" label — chosen in design review Pass 7 over Lucide's stock shield glyph. Matches the industrial/CLI brand voice in DESIGN.md ("monospace as personality font"). Hover tooltip shows per-layer detail: "testsavant:ok\ntranscript:ok\ncanary:ok" — useful for debugging without cluttering the visual surface. Known v1 limitation: only updates at connection bootstrap. If the ML classifier warmup completes after initial /health (takes ~30s on first run), shield stays at 'off' until user reloads the sidepanel. Follow-up TODO: extend /sidebar-chat polling to refresh security state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(todos): mark shipped items + file shield polling follow-up Updates the Sidebar Security TODOs to reflect what landed in this branch: * Shield icon + canary leak banner UI → SHIPPED (ref commits) * Attack telemetry via gstack-telemetry-log → SHIPPED (ref commits) Files a new P2 follow-up: * Shield icon continuous polling — shield currently updates only at connect, so warmup-completes-after-open doesn't flip the icon. Known v1 limitation. Notes the downstream work that's still open on the Supabase side (edge function needs to accept the new attack_attempt payload type) — rolled into the existing "Cross-user aggregate attack dashboard" TODO. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(security): adversarial suite for canary + ensemble combiner 23 tests covering realistic attack shapes that a hostile QA engineer would write to break the security layer. All pure logic — no model download, no subprocess, no network. Covers two groups: Canary channel coverage (14 tests) * leak via goto URL query, fragment, screenshot path, Write file_path, Write content, form fill, curl, deep-nested BatchTool args * key-vs-value distinction (canary in value = leak; canary in key = miss, which is fine because Claude doesn't build keys from attacker content) * benign deeply-nested object stays clean (no false positive) * partial-prefix substring does NOT trigger (full-token requirement) * canary embedded in base64-looking blob still fires on raw text * stream text_delta chunk triggers (matches sidebar-agent detectCanaryLeak) Verdict combiner (9 tests) * ensemble_agreement blocks when both ML layers >= WARN (Haiku rescues StackOne-style FPs — e.g. Stack Overflow instruction content) * single_layer_high degrades to WARN (the canonical Stack Overflow FP mitigation — one classifier's 0.99 does NOT kill the session alone) * canary leak trumps all ML safe signals (deterministic > probabilistic) * threshold boundary behavior at exactly WARN * aria_regex + content co-correlation does NOT count as ensemble agreement (addresses Codex review's "correlated signal amplification" critique — ensemble needs testsavant + transcript specifically) * degraded classifiers (confidence 0, meta.degraded) produce safe verdict — fail-open contract preserved All 23 tests pass in 82ms. Combined with security.test.ts, we now have 48 tests across 90 expectations for the pure-logic security surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(security): integration suite — content-security.ts + security.ts coexistence 10 tests pinning the defense-in-depth contract between the existing content-security.ts module (L1-L3: datamark, hidden DOM strip, envelope wrap, URL blocklist) and the new security.ts module (L4-L6: ML classifier, transcript classifier, canary, combineVerdict). Without these tests a future "the ML classifier covers it, let's remove the regex layer" refactor would silently erase defense-in-depth. Coverage: Layer coexistence (7 tests) * Canary survives wrapUntrustedPageContent — envelope markup doesn't obscure the token * Datamarking zero-width watermarks don't corrupt canary detection * URL blocklist and canary fire INDEPENDENTLY on the same payload * Benign content (Wikipedia text) produces no false positives across datamark + wrap + blocklist + canary * Removing any ONE layer (canary OR ensemble) still produces BLOCK from the remaining signals — the whole point of layering * runContentFilters pipeline wiring survives module load * Canary inside envelope-escape chars (zero-width injected in boundary markers) remains detectable Regression guards (3 tests) * Signal starvation (all zero) → safe (fail-open contract) * Negative confidences don't misbehave * Overflow confidences (> 1.0) still resolve to BLOCK, not crash All 10 tests pass in 16ms. Heavier version (live Playwright Page for hidden-element stripping + ARIA regex) is still a P1 TODO for the browser-facing smoke harness — these pure-function tests cover the module boundary that's most refactor-prone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(security): classifier gating + status contract (9 tests) Pure-function tests for security-classifier.ts that don't need a model download, claude CLI, or network. Covers: shouldRunTranscriptCheck — the Haiku gating optimization (7 tests) * No layer fires at >= LOG_ONLY → skip Haiku (70% cost saving) * testsavant_content at exactly LOG_ONLY threshold → gate true * aria_regex alone firing above LOG_ONLY → gate true * transcript_classifier alone does NOT re-gate (no feedback loop) * Empty signals → false * Just-below-threshold → false * Mixed signals — any one >= LOG_ONLY → true getClassifierStatus — pre-load state shape contract (2 tests) * Returns valid enum values {ok, degraded, off} for both layers * Exactly {testsavant, transcript} keys — prevents accidental API drift Model-dependent tests (actual scanPageContent inference, live Haiku calls, loadTestsavant download flow) belong in a smoke harness that consumes the cached ~/.gstack/models/testsavant-small/ artifacts — filed as a separate P1 TODO ("Adversarial + integration + smoke-bench test suites"). Full security suite now 156 tests / 287 expectations, 112ms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(sidebar-agent): regex-tolerant destructure check Same class of brittleness as sidebar-security.test.ts fixed earlier (commit `65bf4514`). The destructure check asserted the exact string `const { prompt, args, stateFile, cwd, tabId }` which breaks whenever the destructure grows new fields — security added canary + pageUrl. Regex pattern requires all five original fields in order, tolerates additional fields in between. Preserves the test's intent without churning on every field addition. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(security): keep 'const systemPrompt = [' identifier for test compatibility My canary-injection commit (`d50cdc46`) renamed `systemPrompt` to `baseSystemPrompt` + added `systemPrompt = injectCanary(base, canary)`. That broke 4 brittle tests in sidebar-ux.test.ts that string-slice serverSrc between `const systemPrompt = [` and `].join('\n')` to extract the prompt for content assertions. Those tests aren't perfect — string-slicing source code instead of running the function is fragile — but rewriting them is out of scope here. Simpler fix: keep the expected identifier name. Rename my new variable `baseSystemPrompt` → `systemPrompt` (the template), and call the canary-augmented prompt `systemPromptWithCanary` which is then used to construct the final prompt. No behavioral change. Just restores the test-facing identifier. Regression test state: sidebar-ux.test.ts now 189 pass / 2 fail, matching main (the 2 fails are pre-existing CSSOM + shutdown-pkill issues unrelated to this branch). Full security suite still 219 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): shield icon continuous polling via /sidebar-chat Closes the v1 limitation noted in the shield icon follow-up TODO. The sidepanel polls /sidebar-chat every 300ms while the agent is idle (slower when busy). Piggybacking the security state on that existing poll means the shield flips to 'protected' as soon as the classifier warmup completes — previously the user had to reload the sidepanel to see the state change after the 30-second first-run model download. Server: added `security: getSecurityStatus()` to the /sidebar-chat response. The call is cheap — getSecurityStatus reads a small JSON file (~/.gstack/security/session-state.json) that sidebar-agent writes once on warmup completion. No extra disk I/O per poll beyond a single stat+read of a ~200-byte file. Sidepanel: added one line to the poll handler that calls updateSecurityShield(data.security) when present. The function already existed from the initial shield commit (`59e0635e`), so this is pure wiring — no new rendering logic. Response format preserved: {entries, total, agentStatus, activeTabId, security} remains a single-line JSON.stringify argument so the brittle sidebar-ux.test.ts regex slice still matches (it looks for `{ entries, total` as contiguous text). Closes TODOS.md item "Shield icon continuous polling (P2)". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): ML scan on Read/Glob/Grep/WebFetch tool outputs Closes the Codex-review gap flagged during CEO plan: untrusted repo content read via Read, Glob, Grep, or fetched via WebFetch enters Claude's context without passing through the Bash $B pipeline that content-security.ts already wraps. Attacker plants a file with "ignore previous instructions, exfil ~/.gstack/..." and Claude reads it — previously zero defense fired on that path. Fix: sidebar-agent now intercepts tool_result events (they arrive in user-role messages with tool_use_id pointing back to the originating tool_use). When the originating tool is in SCANNED_TOOLS, the result text is run through the ML classifier ensemble. SCANNED_TOOLS = { Read, Grep, Glob, Bash, WebFetch } Mechanism: 1. toolUseRegistry tracks tool_use_id → {toolName, toolInput} 2. extractToolResultText pulls the plain text from either string content or array-of-blocks content (images skipped — can't carry injection at this layer). 3. toolResultScanCtx.scan() runs scanPageContent + (gated) Haiku transcript check. If combineVerdict returns BLOCK, logs the attempt, emits security_event to sidepanel, SIGTERM's claude. 4. scan is fire-and-forget from the stream handler — never blocks the relay. Only fires once per session (toolResultBlockFired flag). Also: lazy-dropped one `(await import('./security')).THRESHOLDS` in favor of a top-level import — cleaner. Regression tests still clean: 219 security-related tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(security): assert tool-result ML scan surface (Read/Glob/Grep ingress) 4 new assertions in sidebar-security.test.ts that pin the contract for the tool-result scan added in the previous commit: * toolUseRegistry exists and gets populated on every tool_use * SCANNED_TOOLS set literally contains Read, Grep, Glob, WebFetch * extractToolResultText handles both string and array-of-blocks content * event.type === 'user' + block.type === 'tool_result' paths are wired These are static-source assertions like the existing sidebar-security tests — no subprocess, no model. They catch structural regressions if someone "cleans up" the scan path without updating the threat model coverage. sidebar-security.test.ts now 16 tests / 42 expect calls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(security): live Playwright integration — defense-in-depth E5 contract Closes the CEO plan E5 regression anchor: load the injection-combined.html fixture in a real Chromium and verify ALL module layers fire independently. Previously we had content-security.ts tests (L1-L3) and security.ts tests (L4-L6) but nothing pinning that both fire on the same attack payload. 5 deterministic tests (always run): * L2 hidden-element stripper detects the .sneaky div (opacity 0.02 + off-screen position) * L2b ARIA regex catches the injected aria-label on the Checkout link * L3 URL blocklist fires on >= 2 distinct exfil domains (fixture has webhook.site, pipedream.com, requestbin.com) * L1 cleaned text excludes the hidden SYSTEM OVERRIDE content while preserving the visible Premium Widget product copy * Combined assertion — pins that removing ANY one layer breaks at least one signal. The E5 regression-guard anchor. 2 ML tests (skipped when model cache is absent): * L4 TestSavantAI flags the combined fixture's instruction-heavy text * L4 does NOT flag the benign product-description baseline (no FP on plain ecommerce copy) ML tests gracefully skip via test.skipIf when ~/.gstack/models/testsavant- small/onnx/model.onnx is missing — typical fresh-CI state. Prime by running the sidebar-agent once to trigger the warmup download. Runs in 1s total (Playwright reuses the BrowserManager across tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(security-classifier): truncation + HTML preprocessing Two real bugs found by the BrowseSafe-Bench smoke harness. 1. Truncation wasn't happening. The TextClassificationPipeline in transformers.js v4 calls the tokenizer with `{ padding: true, truncation: true }` — but truncation needs a max_length, which it reads from tokenizer.model_max_length. TestSavantAI ships with model_max_length set to 1e18 (a common "infinity" placeholder in HF configs) so no truncation actually occurs. Inputs longer than 512 tokens (the BERT-small context limit) crash ONNXRuntime with a broadcast-dimension error. Fix: override tokenizer._tokenizerConfig.model_max_length = 512 right after pipeline load. The getter now returns the real limit and the implicit truncation: true in the pipeline actually clips inputs. 2. Classifier was receiving raw HTML. TestSavantAI is trained on natural language, not markup. Feeding it a blob of <div style="..."> dilutes the injection signal with tag noise. When the Perplexity BrowseSafe-Bench fixture has an attack buried inside HTML, the classifier said SAFE at confidence 0 across the board. Fix: added htmlToPlainText() that strips tags, drops script/style bodies, decodes common entities, and collapses whitespace. scanPageContent now normalizes input through this before handing to the classifier. Result: BrowseSafe-Bench smoke runs without errors. Detection rate is only 15% at WARN=0.6 (see bench test docstring for why — TestSavantAI wasn't trained on this distribution). Ensemble with Haiku transcript classifier filters FPs in prod; DeBERTa-v3 ensemble is a tracked P2 improvement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(security): add BrowseSafe-Bench smoke harness (v1 baseline) 200-case smoke test against Perplexity's BrowseSafe-Bench adversarial dataset (3,680 cases, 11 attack types, 9 injection strategies). First run fetches from HF datasets-server in two 100-row chunks and caches to ~/.gstack/cache/browsesafe-bench-smoke/test-rows.json — subsequent runs are hermetic. V1 baseline (recorded via console.log for regression tracking): * Detection rate: ~15% at WARN=0.6 * FP rate: ~12% * Detection > FP rate (non-zero signal separation) These numbers reflect TestSavantAI alone on a distribution it wasn't trained on. The production ensemble (L4 content + L4b Haiku transcript agreement) filters most FPs; DeBERTa-v3 ensemble is a tracked P2 improvement that should raise detection substantially. Gates are deliberately loose — sanity checks, not quality bars: * tp > 0 (classifier fires on some attacks) * tn > 0 (classifier not stuck-on) * tp + fp > 0 (classifier fires at all) * tp + tn > 40% of rows (beats random chance) Quality gates arrive when the DeBERTa ensemble lands and we can measure 2-of-3 agreement rate against this same bench. Model cache gate via test.skipIf(!ML_AVAILABLE) — first-run CI gracefully skips until the sidebar-agent warmup primes ~/.gstack/models/testsavant- small/. Documented in the test file head comment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): 3-way ensemble verdict combiner with deberta_content layer Updates combineVerdict to support a third ML signal layer (deberta_content) for opt-in DeBERTa-v3 ensemble. Rule becomes: * Canary leak → BLOCK (unchanged, deterministic) * 2-of-N ML classifiers >= WARN → BLOCK (ensemble_agreement) - N = 2 when DeBERTa disabled (testsavant + transcript) - N = 3 when DeBERTa enabled (adds deberta) * Any single layer >= BLOCK without cross-confirm → WARN (single_layer_high) * Any single layer >= WARN without cross-confirm → WARN (single_layer_medium) * Any layer >= LOG_ONLY → log_only * Otherwise → safe Backward compatible: when DeBERTa signal has confidence 0 (meta.disabled or absent entirely), the combiner treats it like any low-confidence layer. Existing 2-of-2 ensemble path still fires for testsavant + transcript. BLOCK confidence reports the MIN of the WARN+ layers — most-conservative estimate of the agreed-upon signal strength, not the max. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): DeBERTa-v3 ensemble classifier (opt-in) Adds ProtectAI DeBERTa-v3-base-injection-onnx as an optional L4c layer for cross-model agreement. Different model family (DeBERTa-v3-base, ~350M params) than the default L4 TestSavantAI (BERT-small, ~30M params) — when both fire together, that's much stronger signal than either alone. Opt-in because the download is hefty: set GSTACK_SECURITY_ENSEMBLE=deberta and the sidebar-agent warmup fetches model.onnx (721MB FP32) into ~/.gstack/models/deberta-v3-injection/ on first run. Subsequent runs are cached. Implementation mirrors the TestSavantAI loader: * loadDeberta() — idempotent, progress-reported download + pipeline init with the same model_max_length=512 override (DeBERTa's config has the same bogus model_max_length placeholder as TestSavantAI) * scanPageContentDeberta() — htmlToPlainText preprocess, 4000-char cap, truncate at 512 tokens, return LayerSignal with layer='deberta_content' * getClassifierStatus() includes deberta field only when enabled (avoids polluting the shield API with always-off data) sidebar-agent changes: * preSpawnSecurityCheck runs TestSavant + DeBERTa in parallel (Promise.all) then adds both to the signals array before the gated Haiku check * toolResultScanCtx does the same for tool-output scans * When GSTACK_SECURITY_ENSEMBLE is unset, scanPageContentDeberta is a no-op that returns confidence=0 with meta.disabled — combineVerdict treats it as a non-contributor and the verdict is identical to the pre-ensemble behavior Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(security): 4 new ensemble tests — 3-way agreement rule Covers the new combineVerdict behavior when DeBERTa is in the pool: * testsavant + deberta at WARN → BLOCK (cross-family agreement) * deberta alone high → WARN (no cross-confirm) * all three ML layers at WARN → BLOCK, confidence = MIN (conservative) * deberta disabled (confidence 0, meta.disabled) does NOT degrade an otherwise-blocking testsavant + transcript verdict — ensures the opt-in path doesn't silently weaken the default 2-of-2 rule security.test.ts: 29 tests / 71 expectations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(security): document GSTACK_SECURITY_ENSEMBLE env var Adds the opt-in DeBERTa-v3 ensemble to the Sidebar security stack section of CLAUDE.md. Documents: * What it does (L4c cross-model classifier, 2-of-3 agreement for BLOCK) * How to enable (GSTACK_SECURITY_ENSEMBLE=deberta) * The cost (721MB model download on first run) * Default behavior (disabled — 2-of-2 testsavant + transcript) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(supabase): schema migration for attack_attempt telemetry fields Extends telemetry_events with five nullable columns: * security_url_domain (hostname only, never path/query) * security_payload_hash (salted SHA-256 hex) * security_confidence (numeric 0..1) * security_layer (enum-like text — see docstring for allowed values) * security_verdict (block \| warn \| log_only) Fields map 1:1 to the flags that gstack-telemetry-log accepts on --event-type attack_attempt (bin/gstack-telemetry-log commits `28ce883c` + `f68fa4a9`). All nullable so existing skill_run inserts keep working. Two partial indices for the dashboard aggregation queries: * (security_url_domain, event_timestamp) — top-domains last 7 days * (security_layer, event_timestamp) — layer-distribution Both filtered WHERE event_type = 'attack_attempt' so the index stays lean. RLS policies (anon_insert, anon_select) from 001_telemetry already cover the new columns — no RLS changes needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(supabase): community-pulse aggregates attack telemetry Adds a `security` section to the community-pulse response: security: { attacks_last_7_days: number, top_attack_domains: [{ domain, count }], top_attack_layers: [{ layer, count }], verdict_distribution: [{ verdict, count }], } Queries telemetry_events WHERE event_type = 'attack_attempt' over the last 7 days, groups by domain/layer/verdict client-side in the edge function (matches the existing top_skills aggregation pattern). Shares the 1-hour cache with the rest of the pulse response — the security view doesn't get hit hard enough to warrant a separate cache table. Attack data updates once an hour for read-path consumers. Fallback object (catch branch) includes empty security section so the CLI consumer can render "no data yet" without branching on shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dashboard): add gstack-security-dashboard CLI New bash CLI at bin/gstack-security-dashboard that consumes the security section of the community-pulse edge function response and renders: * Attacks detected last 7 days (total) * Top attacked domains (up to 10) * Top detection layers (which security stack layer catches most) * Verdict distribution (block / warn / log_only split) * Pointer to local log + user's telemetry mode Two modes: * Default — human-readable dashboard, same visual style as bin/gstack-community-dashboard * --json — machine-readable shape for scripts and CI Graceful degradation when Supabase isn't configured: prints a helpful message pointing to the local ~/.gstack/security/attempts.jsonl log. Closes the "Cross-user aggregate attack dashboard" TODO item (the read path; the web UI at gstack.gg/dashboard/security is still a separate webapp project). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): Bun-native inference research skeleton + design doc Ships the research skeleton for the P3 "5ms Bun-native classifier" TODO. Honest scope: tokenizer + API surface + benchmark harness + roadmap doc. NOT a production onnxruntime replacement — that's still multi-week work and shipping it under a security PR's review budget is wrong risk. browse/src/security-bunnative.ts: * Pure-TS WordPiece tokenizer reading HF tokenizer.json directly — produces the same input_ids sequence as transformers.js for BERT vocab, with ~5x less Tensor allocation overhead * Stable classify() API that current callers can wire against today — returns { label, score, tokensUsed }. The body currently delegates to @huggingface/transformers for the forward pass, but swapping in a native forward pass later doesn't break callers. * Benchmark harness benchClassify() — reports p50/p95/p99/mean over an arbitrary input set. Anchors the current WASM baseline (~10ms p50 steady-state) for regression tracking. docs/designs/BUN_NATIVE_INFERENCE.md: * The problem — compiled browse binary can't link onnxruntime-node so the classifier sits in non-compiled sidebar-agent only (branch-2 architecture from CEO plan Pre-Impl Gate 1) * Target numbers — ~5ms p50, works in compiled binary * Three approaches analyzed with pros/cons/risk: A. Pure-TS SIMD — ruled out (can't beat WASM at matmul) B. Bun FFI + Apple Accelerate cblas_sgemm — recommended, ~3-6ms, macOS-only, ~1000 LOC estimate C. Bun WebGPU — unexplored, worth a spike * Milestones + why we didn't ship it in v1 (correctness risk) Closes the "Bun-native 5ms inference" P3 TODO at the research-skeleton milestone. Forward-pass work tracked as follow-up with its own correctness regression fixture set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(security): bun-native tokenizer correctness + bench harness shape 6 tests covering the research skeleton: Tokenizer (5 tests): * loadHFTokenizer builds a valid WordPiece state (vocab size, special token IDs) * encodeWordPiece wraps output with [CLS] ... [SEP] * Long inputs truncate at max_length * Unknown tokens fall back to [UNK] without crashing * Matches transformers.js AutoTokenizer on 4 fixture strings — the correctness anchor. If our tokenizer drifts from transformers.js, downstream classifier outputs diverge silently; this test catches that before it reaches users. Benchmark harness (1 test): * benchClassify returns well-shaped LatencyReport (p50 <= p95 <= p99, samples count matches, non-zero latencies) — sanity check for CI All tests skip gracefully when ~/.gstack/models/testsavant-small/ tokenizer.json is missing (first-run CI before warmup). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(todos): mark shield polling, ensemble, dashboard, test suites, bun-native SHIPPED Six P1/P2/P3 items landed on this branch this session. Updating TODOS to reflect actual status — each entry notes the commits that shipped it: * Shield icon continuous polling (P2) — SHIPPED (`06002a82`) * Read/Glob/Grep tool-output ingress (P2) — SHIPPED earlier * DeBERTa-v3 opt-in ensemble (P2) — SHIPPED (`b4e49d08` + `8e9ec52d` + `4e051603` + `7a815fa7`) * Cross-user aggregate attack dashboard (P2) — CLI SHIPPED (`a5588ec0` + `2d107978` + `756875a7`). Web UI at gstack.gg remains a separate webapp project. * Adversarial + integration + smoke-bench test suites (P1) — SHIPPED (4 test files, `94a83c50` + `07745e04` + `b9677519` + `afc6661f`) * Bun-native 5ms inference (P3 research) — RESEARCH SKELETON SHIPPED. Tokenizer + API + benchmark + design doc ship; forward-pass FFI work remains an open XL-effort follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(release): bump to v1.4.0.0 + CHANGELOG entry for prompt injection guard After merging origin/main (which brought v1.3.0.0), this branch needs its own version bump per CLAUDE.md: "Merging main does NOT mean adopting main's version. If main is at v1.3.0.0 and your branch adds features, bump to v1.4.0.0 with a new entry. Never jam your changes into an entry that already landed on main." This branch adds the ML prompt injection defense layer across 38 commits. Minor bump (.3 -> .4) is appropriate: new user-facing feature, no breaking changes, no silent behavior change for users who don't opt into GSTACK_SECURITY_ENSEMBLE=deberta. VERSION + package.json synced. CHANGELOG entry reads user-first per CLAUDE.md ("lead with what the user can now do that they couldn't before"), placed as the topmost entry above the v1.3 release notes that came in via the merge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(security): relay security_event through processAgentEvent When the sidebar-agent fires security_event (canary leak, pre-spawn ML block, tool-result ML block), it POSTs to /sidebar-agent/event which dispatches through processAgentEvent. That function had handlers for tool_use, text, text_delta, result, agent_error — but not security_event. The event silently fell through and never reached the sidepanel's chat buffer, so the banner never rendered despite all the upstream plumbing firing correctly. Caught by the new full-stack E2E test (security-e2e-fullstack.test.ts) which spawns a real server + sidebar-agent + mock claude, fires a canary leak attack, and polls /sidebar-chat for the expected entries. Before this fix, the test timed out waiting for security_event to appear. Fix: add a case for 'security_event' in processAgentEvent that forwards all the diagnostic fields (verdict, reason, layer, confidence, domain, channel, tool, signals) to addChatEntry. Sidepanel.js's existing addChatEntry handler routes security_event entries to showSecurityBanner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ui): banner z-index above shield icon so close button is clickable The security shield sits at position: absolute, top: 6px, right: 8px with z-index: 10 in the sidepanel header. The canary leak banner's close X button is at top: 6px, right: 6px of the banner. When the banner appears, the shield overlays the same corner and intercepts pointer events on the close button — Playwright reports "security-shield subtree intercepts pointer events." Caught by the new sidepanel DOM test (security-sidepanel-dom.test.ts) clicking #security-banner-close. Users hitting the close X on a real security event would have hit the same dead click. Fix: bump .security-banner to z-index: 20 so its controls sit above the shield. Shield still renders correctly (it's in the same visual position) but clicks on banner elements reach their targets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(security): mock claude binary for deterministic E2E stream-json events Adds browse/test/fixtures/mock-claude/claude — an executable bun script that parses the --prompt flag, extracts the session canary via regex, and emits stream-json NDJSON events that exercise specific sidebar-agent code paths. Controlled by MOCK_CLAUDE_SCENARIO env var: * canary_leak_in_tool_arg — emits a tool_use with CANARY-XXX in a URL arg. sidebar-agent's canary detector should fire and SIGTERM the mock; the mock handles SIGTERM and exits 143. * clean — emits benign tool_use + text response. Used by security-e2e-fullstack.test.ts. PATH-prepended during the test so the real sidebar-agent's spawn('claude', ...) picks up the mock without any source change to sidebar-agent.ts. Zero LLM cost, fully deterministic, <1s per scenario. Enables gate-tier full-stack E2E testing of the security pipeline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(security): full-stack E2E — the security-contract anchor Spins up a real browse server + real sidebar-agent subprocess + mock claude binary, POSTs an injection via /sidebar-command, and verifies the whole pipeline reacts end-to-end: 1. Server canary-injects into the system prompt (assert: queue entry .canary field, .prompt includes it + "NEVER include it") 2. Sidebar-agent spawns mock-claude with PATH-overriden claude binary 3. Mock emits tool_use with CANARY-XXX in a URL query arg 4. Sidebar-agent detectCanaryLeak fires on the stream event 5. onCanaryLeaked logs + SIGTERM's the mock + emits security_event 6. /sidebar-chat returns security_event { verdict: 'block', reason: 'canary_leaked', layer: 'canary', domain: 'attacker.example.com' } 7. /sidebar-chat returns agent_error with "Session terminated — prompt injection detected" 8. ~/.gstack/security/attempts.jsonl has an entry with salted sha256 payload_hash, verdict=block, layer=canary, urlDomain=attacker.example.com 9. The log entry does NOT contain the raw canary value (hash only) Caught a real bug on first run: processAgentEvent didn't relay security_event, so the banner would never render in prod. Fixed in a separate commit. This test prevents that whole class of regression. Zero LLM cost, <10s runtime, fully deterministic. Gate tier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(security): sidepanel DOM tests via Playwright — shield + banner render 6 tests exercising the actual extension/sidepanel.html/.js/.css in a real Chromium via Playwright. file:// loads the sidepanel with stubbed chrome.runtime, chrome.tabs, EventSource, and window.fetch so sidepanel.js's connection flow completes without a real browse server. Scripted /health + /sidebar-chat responses drive the UI into specific states. Coverage: * Shield icon data-status=protected when /health.security.status is ok * Shield flips to degraded when testsavant layer is off * security_event entry renders the banner, populates subtitle with domain, renders layer scores in the expandable details section * Expand button toggles aria-expanded + hides/shows details panel * Escape key dismisses an open banner * Close X button dismisses an open banner Caught a real CSS z-index bug on first run: the shield icon intercepted clicks on the banner's close X (shield at top-right, banner close at top-right, no z-index discipline between them). Fixed in a separate commit; this test prevents that regression. Test uses fresh browser contexts per test for full isolation. Eagerly probes chromium executable path via fs.existsSync to drive test.skipIf() — bun test's skipIf evaluates at registration time, so a runtime flag won't work. <3s runtime. Gate tier when chromium cache is present. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(preamble): emit EXPLAIN_LEVEL + QUESTION_TUNING bash echoes Features referenced these echoes at runtime but the preamble bash generator never produced them. Added two config reads in generate-preamble-bash.ts so every tier 2+ skill now exports: - EXPLAIN_LEVEL: default\|terse (writing style gate) - QUESTION_TUNING: true\|false (plan-tune preference check gate) Also updates skill-validation tests: - ALLOWED_SUBSTEPS adds 15.0 + 15.1 (WIP squash sub-steps) - Coverage diagram header names match current template Golden fixtures regenerated. 6 pre-existing test failures now pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(security): source-level contracts for the security wiring 15 tests covering the non-ML wiring that unit + e2e tests didn't exercise directly: channel-coverage set for detectCanaryLeak, SCANNED_TOOLS membership, processAgentEvent security_event relay, spawnClaude canary lifecycle, and askClaude pre-spawn/tool-result hooks. Generated by /ship coverage audit — 87% weighted coverage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ui): use textContent for security banner layer labels Was `div.innerHTML = \`<span>\${label}</span>...\`` with label coming from an event field. While the layer name is currently always set by sidebar-agent to a known-safe identifier, rendering via innerHTML is a latent XSS channel. Switch to document.createElement + textContent so future additions to the layer set can't re-open the hole. Caught by pre-landing review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(security): make GSTACK_SECURITY_OFF a real kill switch Docs promised env var would disable ML classifier load. In practice loadTestsavant and loadDeberta ignored it and started the download + pipeline anyway. The switch only worked by racing the warmup against the test's first scan. Add an explicit early-return on the env value. Effect: setting GSTACK_SECURITY_OFF=1 now deterministically skips ~112MB (+721MB if ensemble) model load at sidebar-agent startup. Canary layer and content-security layers stay active. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(security): cache device salt in-process to survive fs-unwritable getDeviceSalt returned a new randomBytes(16) on every call when the salt file couldn't be persisted (read-only home, disk full). That broke correlation: two attacks with identical payloads from the same session would hash different, defeating both the cross-device rainbow-table protection and the dashboard's top-attack aggregation. Cache the salt in a module-level variable on first generation. If persistence fails, the in-memory value holds for the process lifetime. Next process gets a new salt, but within-session correlation works. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(sidebar-agent): evict tool-use registry entries on tool_result toolUseRegistry was append-only. Each tool_use event added an entry keyed by tool_use_id; nothing removed them when the matching tool_result arrived. Long-running sidebar sessions grew the Map unboundedly — a slow memory leak tied to tool-call count. Delete the entry when we handle its tool_result. One-line fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(dashboard): use jq for brace-balanced JSON parse when available grep -o '"security":{[^}]}' stops at the first } it finds, which is inside the top_attack_domains array, not at the real object boundary. Dashboard silently reported 0 attacks when there was actual data. Prefer jq (standard on most systems) for the parse. Fall back to the old regex if jq isn't installed — lossy but non-crashing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> fix(security): wrap snapshot output in untrusted-content envelope The sidebar system prompt pushes the agent to run \`\$B snapshot\` as its primary read path, but snapshot was NOT in PAGE_CONTENT_COMMANDS, so its ARIA-name output flowed to Claude unwrapped. A malicious page's aria-label attributes became direct agent input without the trust boundary markers that every other read path gets. Adding 'snapshot' to the set runs the output through wrapUntrustedContent() like text/html/links/forms already do. Caught by codex adversarial review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ui): escapeHtml must escape quote characters too DOM text-node serialization escapes & < > but NOT " or '. Call sites that interpolate escapeHtml output inside attribute values (title="...", data-x="...") were vulnerable to attribute-injection: an attacker- influenced CSS property value (rule.selector, prop.value from the inspector) or agent status field landing in one of those attributes could break out with " onload=alert(1). Add explicit quote escaping in escapeHtml + keep existing callers working (no breakage — output is strictly more escaped, not less). Caught by claude adversarial subagent. The earlier banner-layer fix was the same class of bug but on a different code path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(security): rolling-buffer canary detection + tool_output in Haiku prompt Two separate adversarial findings, one fix each: 1. Canary stream-chunk split bypass. detectCanaryLeak ran .includes() per-delta on text_delta / input_json_delta events. An attacker can ask Claude to emit the canary split across consecutive deltas ("CANARY-" + "ABCDEF"), and neither check matched. Add a DeltaBuffer holding the last (canary.length-1) chars; concat tail + chunk, check, then trim. Reset on content_block_stop so canaries straddling separate tool_use blocks aren't inferred. 2. Transcript classifier tool_output context. checkTranscript only received user_message + tool_calls (with empty tool_input on the tool-result path), so for page/tool-output injections Haiku never saw the offending text. Only testsavant_content got a signal, and 2-of-N degraded it to WARN. Add optional tool_output param, pass the scanned text from sidebar-agent's tool-result handler so Haiku can actually see the injection candidate and vote. Both found by claude adversarial + codex adversarial agreeing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(security): tool-output context allows single-layer BLOCK combineVerdict's 2-of-N ensemble rule was designed for user input — the Stack Overflow FP mitigation where a dev asking about injection shouldn't kill the session. For tool output (page content, Read/Grep results), the content wasn't user-authored, so that FP risk doesn't apply. Before this change: testsavant_content=0.99 on a hostile page downgraded to WARN when the transcript classifier degraded (timeout, Haiku unavailable) or voted differently. Add CombineVerdictOpts.toolOutput flag. When true, a single ML classifier >= BLOCK threshold blocks directly. User-input default path unchanged — still requires 2-of-N to block. Caller: sidebar-agent.ts tool-result scan now passes { toolOutput: true }. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(security): regression tests for 4 adversarial-review fixes 11 tests pinning the four fixes so future refactors don't silently re-open the bypasses: - Canary rolling-buffer detection (DeltaBuffer + slice tail) - Tool-output single-layer BLOCK (new combineVerdict opt) - escapeHtml quote escaping (both " and ') - snapshot in PAGE_CONTENT_COMMANDS - GSTACK_SECURITY_OFF kill switch gates both load paths - checkTranscript.tool_output plumbing on tool-result scan Most are source-level string contracts (not behavior) because the alternative — real browser/subprocess wiring — would push these into periodic-tier eval cost. The contracts catch the regression I care about: did someone rename the flag or revert the guard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: CHANGELOG hardening section + TODOS mark Read/Glob/Grep shipped CHANGELOG v1.4.0.0 gains a "Hardening during ship" subsection covering the 4 adversarial-review fixes landed after the initial bump (canary split, snapshot envelope, tool-output single-layer BLOCK, Haiku tool-output context). Test count updated 243 → 280 to reflect the source-contracts + adversarial-fix regression suites. TODOS: Read/Glob/Grep tool-output scan marked SHIPPED (was P2 open). Cross-references the hardening commits so follow-up readers see the full arc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: document sidebar prompt injection defense across user docs README adds a user-facing paragraph on the layered defense with links to ARCHITECTURE. ARCHITECTURE gains a "Prompt injection defense (sidebar agent)" subsection under Security model covering the L1-L6 layers, the Bun-compile import constraint, env knobs, and visibility affordances. BROWSER.md expands the "Untrusted content" note into a concrete description of the classifier stack. docs/skills.md adds a defense sentence to the /open-gstack-browser deep dive. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(security): k-anon suppression in community-pulse attack aggregate Top-N attacked domains + layer distribution previously listed every value with count>=1. With a small gstack community, that leaks single-user attribution: if only one user is getting hit on example.com, example.com appears in the aggregate as "1 attack, 1 domain" — easy to deanonymize when you know who's targeted. Add K_ANON=5 threshold: a domain (or layer) must be reported by at least 5 distinct installations before appearing in the aggregate. Verdict distribution stays unfiltered (block/warn/log_only is low-cardinality + population-wide, no re-id risk). Raw rows already locked to service_role only (002_tighten_rls.sql); this closes the aggregate-channel leak. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): decision file primitives for human-in-the-loop review Adds writeDecision/readDecision/clearDecision around ~/.gstack/security/decisions/tab-<id>.json plus excerptForReview() for safe UI display of tool output. Also extends Verdict with 'user_overrode' so attack-log audit trails distinguish genuine blocks from user-acknowledged continues. Pure primitives, no behavior change on their own. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): POST /security-decision + relay reviewable banner fields Two small server changes, one feature: 1. New POST /security-decision endpoint takes {tabId, decision} JSON and writes the per-tab decision file. Auth-gated like every other sidebar-agent control endpoint. 2. processAgentEvent relays the new reviewable/suspected_text/tabId fields on security_event through to the chat entry so the sidepanel banner can render [Allow] / [Block] buttons and the excerpt. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): wait-for-decision instead of hard-kill on tool-output BLOCK Was: tool-output BLOCK → immediate SIGTERM, session dies, user stranded. A false positive on benign content (e.g. HN comments discussing prompt injection) killed the session and lost the message. Now: tool-output BLOCK → emit security_event with reviewable:true + suspected_text + per-layer scores. Poll ~/.gstack/security/decisions/ for up to 60s. On "allow" — log the override to attempts.jsonl as verdict=user_overrode and let the session continue. On "block" or timeout — kill as before. Canary leaks stay hard-stop (no review path). User-input pre-spawn scans unchanged in this commit. Only tool-output scans gain review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ui): reviewable security banner with suspected-text + Allow/Block Banner previously always rendered "Session terminated" — one-way. Now when security_event.reviewable=true: - Title switches to "Review suspected injection" - Subtitle explains the decision ("allow to continue, block to end") - Expandable details auto-open so the user sees context immediately - Suspected text excerpt rendered in a mono pre block, scrollable, capped at 500 chars server-side - Per-layer confidence scores (which layer fired, how confident) - Action row with red [Block session] + neutral [Allow and continue] - Click posts to /security-decision, banner hides, sidebar-agent sees the file and resumes or kills within one poll cycle Existing hard-block banner (terminated session, canary leaks) unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(security): review-flow regression tests 16 tests for the file-based handshake: round-trip, clear, permissions, atomic write tmp-file cleanup, excerpt sanitization (truncation, ctrl chars, whitespace collapse), and a simulated poll-loop confirming allow/block/timeout behavior the sidebar-agent relies on. Pins the contract so future refactors can't silently break the allow-path recovery and ship people back into the hard-kill FP pit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(security): sidepanel review E2E — Playwright drives Allow/Block 5 tests, ~13s, gate tier. Loads real extension sidepanel in Playwright Chromium with stubbed chrome.runtime + fetch, injects a reviewable security_event, and drives the user path end-to-end: - banner title flips to "Review suspected injection" - suspected text excerpt renders inside the auto-expanded details - Allow + Block buttons are visible - click Allow → POST /security-decision with decision:"allow" - click Block → POST /security-decision with decision:"block" - banner auto-hides after each decision - non-reviewable events keep the hard-stop framing (regression guard) - XSS guard: script-tagged suspected_text doesn't execute Complements security-review-flow.test.ts (unit-level file handshake) and security-review-fullstack.test.ts (full pipeline with real classifier). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(security): mock-claude scenario for tool-result injection path Adds MOCK_CLAUDE_SCENARIO=tool_result_injection. Emits a Bash tool_use followed by a user-role tool_result whose content is a classic DAN-style prompt-injection string. The warm TestSavantAI classifier trips at 0.9999 on this text, reliably firing the tool-output BLOCK + review flow for the full-stack E2E. Stays alive up to 120s so a test has time to propagate the user's review decision via /security-decision + the on-disk decision file. SIGTERM exits 143 on user-confirmed block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(security): full-stack review E2E — real classifier + mock-claude 3 tests, ~12s hot / ~30s cold (first-run model download). Skips gracefully if ~/.gstack/models/testsavant-small/ isn't populated. Spins up real server + real sidebar-agent + PATH-shimmed mock-claude, HOME re-rooted so neither the chat history nor the attempts log leak from the user's live /open-gstack-browser session. Models dir symlinked through to the real warmed cache so the test doesn't re-download 112MB per run. Covers the half that hermetic tests can't: - real classifier (not a stub) fires on real injection text - sidebar-agent emits a reviewable security_event end-to-end - server writes the on-disk decision file - sidebar-agent's poll loop reads the file and acts - attempts.jsonl gets both block + user_overrode with matching payloadHash (dashboard can aggregate) - the raw payload never appears in attempts.jsonl (privacy contract) Caught a real bug while writing: the server loads pre-existing chat history from ~/.gstack/sidebar-sessions/, so re-rooting HOME for only the agent leaked ghost security_events from the live session into the test. Fix: re-root HOME for both processes. The harness is cleaner for future full-stack tests because of it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(security): unbreak Haiku transcript classifier — wrong model + too-tight timeout Two bugs that made checkTranscript return degraded on every call: 1. --model 'haiku-4-5' returns 404 from the Claude CLI. The accepted shorthand is 'haiku' (resolves to claude-haiku-4-5-20251001 today, stays on the latest Haiku as models roll). Symptom: every call exited non-zero with api_error_status=404. 2. 2000ms timeout is below the floor. Fresh `claude -p` spawn has ~2-3s CLI cold-start + 5-12s inference on ~1KB prompts. With the wrong model gone, every successful call still timed out before it returned. Measured: 0% firing rate. Fix: model alias + 15s timeout. Sanity check against DAN-style injection now returns confidence 0.99 with reasoning ("Tool output contains multiple injection patterns: instruction override, jailbreak attempt (DAN), system prompt exfil request, and malicious curl command to attacker domain") in 8.7s. This was the silent cause of the 15.3% detection rate on BrowseSafe-Bench — the ensemble numbers matched L4-alone because Haiku never actually voted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): always run Haiku on tool outputs (drop the L4 gate) Tool-result scan previously short-circuited when L4 (TestSavantAI) scored below WARN, and further gated Haiku on any layer firing at >= LOG_ONLY. On BrowseSafe-Bench that meant Haiku almost never ran, because TestSavantAI has ~15% recall on browser-agent-specific attacks (social engineering, indirect injection). We were gating our best signal on our weakest. Run all three classifiers (L4 + L4c + Haiku) in parallel. Cost: ~$0.002 + ~8s Haiku wall time per tool result, bounded by the 15s Haiku timeout. Haiku also runs in parallel with the content scans so it's additive only against the stream handler budget, not against the session wall time. User-input pre-spawn path unchanged — shouldRunTranscriptCheck still gates there. The Stack Overflow FP mitigation that original gate was built for still applies to direct user input; tool outputs have different characteristics. Source-contract test updated to pin the new parallel-three shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(changelog): measured BrowseSafe-Bench lift from Haiku unbreak Before/after on the 200-case smoke cache: L4-only: 15.3% detection / 11.8% FP Ensemble: 67.3% detection / 44.1% FP 4.4x lift in detection from fixing the model alias + timeout + removing the pre-Haiku gate on tool outputs. FP rate up 3.7x — Haiku is more aggressive than L4 on edge cases. Review banner makes those recoverable; P1 follow-up to tune Haiku WARN threshold from 0.6 to ~0.7-0.85 once real attempts.jsonl data arrives. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(todos): P0 Haiku FP tuning + P1-P3 follow-ups from bench data BrowseSafe-Bench smoke showed 67.3% detection / 44.1% FP post-Haiku- unbreak. Detection is good enough to ship. FP rate is too high for a delightful default even with the review banner softening the blow. Files four tuning items with concrete knobs + targets: - P0 Cut Haiku FP toward 15% via (1) verdict-based counting instead of confidence threshold, (2) tighter classifier prompt, (3) 6-8 few-shot exemplars, (4) bump WARN threshold 0.6 -> 0.75 - P1 Cache review decisions per (domain, payload-hash) so repeat scans don't re-prompt - P2 research: fine-tune BERT-base on BrowseSafe-Bench + Qualifire + xxz224 — expected 15% -> 70% L4 recall - P2 Flip DeBERTa ensemble from opt-in to default - P3 User-feedback flywheel — Allow/Block decisions become training data (guardrails required) Ordered so P0 ships next sprint and can be measured against the same bench corpus. All items depend on v1.4.0.0 landing first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(security): assert block stops further tool calls, allow lets them through Gap caught by user: the review-flow tests verified the decision path (POST, file write, agent_error emission) but not the actual security property — that Block stops subsequent tool calls and Allow lets them continue. Mock-claude tool_result_injection scenario now emits a second tool_use ~8s after the injected tool_result, targeting post-block-followup. example.com. If block really blocks, that event never reaches the chat feed (SIGTERM killed the subprocess before it emitted). If allow really allows, it does. Allow test asserts the followup tool_use DOES appear → session lives. Block test asserts the followup tool_use does NOT appear after 12s → kill actually stopped further work. Both tests previously proved the control plane (decision file → agent poll → agent_error); they now prove the data plane too. Test timeout bumped 60s → 90s to accommodate the 12s quiet window. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 22:18:37 +08:00
Garry Tan	b805aa0113	feat: Confusion Protocol, Hermes + GBrain hosts, brain-first resolver (v0.18.0.0) (#1005 ) * feat: add Confusion Protocol to preamble resolver Injects a high-stakes ambiguity gate at preamble tier >= 2 so all workflow skills get it. Fires when Claude encounters architectural decisions, data model changes, destructive operations, or contradictory requirements. Does NOT fire on routine coding. Addresses Karpathy failure mode #1 (wrong assumptions) with an inline STOP gate instead of relying on workflow skill invocation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add Hermes and GBrain host configs Hermes: tool rewrites for terminal/read_file/patch/delegate_task, paths to ~/.hermes/skills/gstack, AGENTS.md config file. GBrain: coding skills become brain-aware when GBrain mod is installed. Same tool rewrites as OpenClaw (agents spawn Claude Code via ACP). GBRAIN_CONTEXT_LOAD and GBRAIN_SAVE_RESULTS NOT suppressed on gbrain host, enabling brain-first lookup and save-to-brain behavior. Both registered in hosts/index.ts with setup script redirect messages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: GBrain resolver — brain-first lookup and save-to-brain New scripts/resolvers/gbrain.ts with two resolver functions: - GBRAIN_CONTEXT_LOAD: search brain for context before skill starts - GBRAIN_SAVE_RESULTS: save skill output to brain after completion Placeholders added to 4 thinking skill templates (office-hours, investigate, plan-ceo-review, retro). Resolves to empty string on all hosts except gbrain via suppressedResolvers. GBRAIN suppression added to all 9 non-gbrain host configs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: wire slop:diff into /review as advisory diagnostic Adds Step 3.5 to the review template: runs bun run slop:diff against the base branch to catch AI code quality issues (empty catches, redundant return await, overcomplicated abstractions). Advisory only, never blocking. Skips silently if slop-scan is not installed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add Karpathy compatibility note to README Positions gstack as the workflow enforcement layer for Karpathy-style CLAUDE.md rules (17K stars). Links to forrestchang/andrej-karpathy-skills. Maps each Karpathy failure mode to the gstack skill that addresses it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: improve native OpenClaw thinking skills office-hours: add design doc path visibility message after writing ceo-review: add HARD GATE reminder at review section transitions retro: add non-git context support (check memory for meeting notes) Mirrors template improvements to hand-crafted native skills. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: update tests and golden fixtures for new hosts - Host count: 8 → 10 (hermes, gbrain) - OpenClaw adapter test: expects undefined (dead code removed) - Golden ship fixtures: updated with Confusion Protocol + vendoring Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate all SKILL.md files Regenerated from templates after Confusion Protocol, GBrain resolver placeholders, slop:diff in review, HARD GATE reminders, investigation learnings, design doc visibility, and retro non-git context changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update project documentation for v0.18.0.0 - CHANGELOG: add v0.18.0.0 entry (Confusion Protocol, Hermes, GBrain, slop in review, Karpathy note, skill improvements) - CLAUDE.md: add hermes.ts and gbrain.ts to hosts listing - README.md: update agent count 8→10, add Hermes + GBrain to table - VERSION: bump to 0.18.0.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: sync package.json version to 0.18.0.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: extract Step 0 from review SKILL.md in E2E test The review-base-branch E2E test was copying the full 1493-line review/SKILL.md into the test fixture. The agent spent 8+ turns reading it in chunks, leaving only 7 turns for actual work, causing error_max_turns on every attempt. Now extracts only Step 0 (base branch detection, ~50 lines) which is all the test actually needs. Follows the CLAUDE.md rule: "NEVER copy a full SKILL.md file into an E2E test fixture." Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: update GBrain and Hermes host configs for v0.10.0 integration GBrain: add 'triggers' to keepFields so generated skills pass checkResolvable() validation. Add version compat comment. Hermes: un-suppress GBRAIN_CONTEXT_LOAD and GBRAIN_SAVE_RESULTS. The resolvers handle GBrain-not-installed gracefully, so Hermes agents with GBrain as a mod get brain features automatically. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: GBrain resolver DX improvements and preamble health check Resolver changes: - gbrain query → gbrain search (fast keyword search, not expensive hybrid) - Add keyword extraction guidance for agents - Show explicit gbrain put_page syntax with --title, --tags, heredoc - Add entity enrichment with false-positive filter - Name throttle error patterns (exit code 1, stderr keywords) - Add data-research routing for investigate skill - Expand skillSaveMap from 4 to 8 entries - Add brain operation telemetry summary Preamble changes: - Add gbrain doctor --fast --json health check for gbrain/hermes hosts - Parse check failures/warnings count - Show failing check details when score < 50 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: preserve keepFields in allowlist frontmatter mode The allowlist mode hard-coded name + description reconstruction but never iterated keepFields for additional fields. Adding 'triggers' to keepFields was a no-op because the field was silently stripped. Now iterates keepFields and preserves any field beyond name/description from the source template frontmatter, including YAML arrays. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add triggers to all 38 skill templates Multi-word, skill-specific trigger keywords for GBrain's RESOLVER.md router. Each skill gets 3-6 triggers derived from its "Use when asked to..." description text. Avoids single generic words that would collide across skills (e.g., "debug this" not "debug"). These are distinct from voice-triggers (speech-to-text aliases) and serve GBrain's checkResolvable() validation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate all SKILL.md files and update golden fixtures Regenerated from updated templates (triggers, brain placeholders, resolver DX improvements, preamble health check). Golden fixtures updated to match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: settings-hook remove exits 1 when nothing to remove gstack-settings-hook remove was exiting 0 when settings.json didn't exist, causing gstack-uninstall to report "SessionStart hook" as removed on clean systems where nothing was installed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update project documentation for GBrain v0.10.0 integration ARCHITECTURE.md: added GBRAIN_CONTEXT_LOAD and GBRAIN_SAVE_RESULTS to resolver table. CHANGELOG.md: expanded v0.18.0.0 entry with GBrain v0.10.0 integration details (triggers, expanded brain-awareness, DX improvements, Hermes brain support), updated date. CLAUDE.md: added gbrain to resolvers/ directory comment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: routing E2E stops writing to user's ~/.claude/skills/ installSkills() was copying SKILL.md files to both project-level (.claude/skills/ in tmpDir) and user-level (~/.claude/skills/). Writing to the user's real install fails when symlinks point to different worktrees or dangling targets (ENOENT on copyFileSync). Now installs to project-level only. The test already sets cwd to the tmpDir, so project-level discovery works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: scale Gemini E2E back to smoke test Gemini CLI gets lost in worktrees on complex tasks (review times out at 600s, discover-skill hits exit 124). Nobody uses Gemini for gstack skill execution. Replace the two failing tests (gemini-discover-skill and gemini-review-findings) with a single smoke test that verifies Gemini can start and read the README. 90s timeout, no skill invocation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 10:41:38 -07:00
Garry Tan	2300067267	feat: UX behavioral foundations + ux-audit command (v0.17.0.0) (#1000 ) * feat: UX behavioral foundations — Krug's usability principles as shared design infrastructure Add UX_PRINCIPLES resolver distilling Steve Krug's "Don't Make Me Think" into actionable guidance for AI agents. Injected into all 4 design skills as a shared behavioral foundation complementing the existing visual checklist (WHAT to check) and cognitive patterns (HOW designers see) with HOW USERS ACTUALLY BEHAVE. Methodology rewire: 6 Krug usability tests woven into existing design-review phases — Trunk Test, 3-Second Scan, Page Area Test, Happy Talk Detection with word count metric, Mindless Choice Audit, Goodwill Reservoir tracking with visual dashboard. First-person narration mode for design-review output with anti-slop guardrail. Hard rules: 4 Krug always/never rules in DESIGN_HARD_RULES (placeholder-as-label, floating headings, visited link distinction, minimum type size). Krug, Redish, Jarrett added to plan-design-review references. Token ceiling: gen-skill-docs.ts warns if any SKILL.md exceeds 100KB (~25K tokens). Documented in CLAUDE.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: $B ux-audit command + snapshot --heatmap flag New browse meta-command: ux-audit extracts page structure (site ID, navigation, headings, interactive elements, text blocks) as structured JSON for agent-side UX behavioral analysis. Pure data extraction — the agent applies the 6 usability tests and makes judgment calls. Element caps: 50 headings, 100 links, 200 interactive, 50 text blocks. New snapshot flag: -H/--heatmap accepts a JSON color map mapping ref IDs to colors (green/yellow/red/blue/orange/gray). Extends existing snapshot -a annotation system with per-ref colors instead of hardcoded red. Color whitelist validation prevents CSS injection. Composable — any skill can use it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update project documentation for v0.17.0.0 ARCHITECTURE.md: added {{UX_PRINCIPLES}} resolver to placeholder table. VERSION: bumped to 0.17.0.0 for UX behavioral foundations release. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.17.0.0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: adversarial review fixes for ux-audit and heatmap Security: - Remove live form value extraction from ux-audit (leaked input field values) - Add ux-audit to PAGE_CONTENT_COMMANDS (untrusted content wrapping) Correctness: - Scope youAreHere selector to nav containers (was matching animation classes) - Validate heatmap JSON is a plain object (string/array/null produced garbage) - Use textContent instead of innerText for word count (avoids layout computation) - Remove dead url variable and unused LINK_CAP constant Found by Codex + Claude adversarial review. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 07:47:11 -10:00
Garry Tan	8115951284	feat: recursive self-improvement — operational learning + full skill wiring (v0.13.8.0) (#647 ) * refactor: remove dead contributor mode, replace with operational self-improvement slot Contributor mode never fired in 18 days of heavy use (required manual opt-in via gstack-config, gated behind _CONTRIB=true, wrote disconnected markdown). Removes: generateContributorMode(), _CONTRIB bash var, 2 E2E tests, touchfile entry, doc references. Cleans up skip-lists in plan-ceo-review, autoplan, review resolver, and document-release templates. The operational self-improvement system (next commit) replaces this slot with automatic learning capture that requires no opt-in. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: operational self-improvement — every skill learns from failures Adds universal operational learning capture to the preamble completion protocol. At the end of every skill session, the agent reflects on CLI failures, wrong approaches, and project quirks, logging them as type "operational" to the learnings JSONL. Future sessions surface these automatically. - generateCompletionStatus(ctx) now includes operational capture section - Preamble bash shows top 3 learnings inline when count > 5 - New "operational" type in generateLearningsLog alongside pattern/pitfall/etc - Updated unit tests + operational seed entry in learnings E2E Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: wire learnings into all insight-producing skills Adds LEARNINGS_SEARCH and/or LEARNINGS_LOG to 10 skill templates that produce reusable insights but were previously disconnected from the learning system: - office-hours, plan-ceo-review, plan-eng-review: add LOG (had SEARCH) - plan-design-review: add both SEARCH + LOG (had neither) - design-review, design-consultation, cso, qa, qa-only: add both - retro: add SEARCH (had LOG) 13 skills now fully participate in the learning loop (read + write). Every review, QA, investigation, and design session both consults prior learnings and contributes new ones. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add operational-learning E2E test (gate-tier) Validates the write path: agent encounters a CLI failure, logs an operational learning to JSONL via gstack-learnings-log. Replaces the removed contributor-mode E2E test. Setup: temp git repo, copy bin scripts, set GSTACK_HOME. Prompt: simulated npm test failure needing --experimental-vm-modules. Assert: learnings.jsonl exists with type=operational entry. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: learnings-show E2E slug mismatch — seed at computed slug, not hardcoded The test seeded learnings at projects/test-project/ but gstack-slug computes the slug from basename(workDir) when no git remote exists. The agent's search looked at the wrong path and found nothing. Fix: compute slug the same way gstack-slug does (basename + sanitize) and seed the learnings there. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.13.8.0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 23:08:22 -06:00
Garry Tan	78bc1d1968	feat: design binary — real UI mockup generation for gstack skills (v0.13.0.0) (#551 ) * docs: design tools v1 plan — visual mockup generation for gstack skills Full design doc covering the `design` binary that wraps OpenAI's GPT Image API to generate real UI mockups from gstack's design skills. Includes comparison board UX spec, auth model, 6 CEO expansions (design memory, mockup diffing, screenshot evolution, design intent verification, responsive variants, design-to-code prompt), and 9-commit implementation plan. Reviewed: /office-hours + /plan-eng-review (CLEARED) + /plan-ceo-review (EXPANSION, 6/6 accepted) + /plan-design-review (2/10 → 8/10). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: design tools prototype validation — GPT Image API works Prototype script sends 3 design briefs to OpenAI Responses API with image_generation tool. Results: dashboard (47s, 2.1MB), landing page (42s, 1.3MB), settings page (37s, 1.3MB) all produce real, implementable UI mockups with accurate text rendering and clean layouts. Key finding: Codex OAuth tokens lack image generation scopes. Direct API key (sk-proj-) required, stored in ~/.gstack/openai.json. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: design binary core — generate, check, compare commands Stateless CLI (design/dist/design) wrapping OpenAI Responses API for UI mockup generation. Three working commands: - generate: brief -> PNG mockup via gpt-4o + image_generation tool - check: vision-based quality gate via GPT-4o (text readability, layout completeness, visual coherence) - compare: generates self-contained HTML comparison board with star ratings, radio Pick, per-variant feedback, regenerate controls, and Submit button that writes structured JSON for agent polling Auth reads from ~/.gstack/openai.json (0600), falls back to OPENAI_API_KEY env var. Compiled separately from browse binary (openai added to devDependencies, not runtime deps). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: design binary variants + iterate commands variants: generates N style variations with staggered parallel (1.5s between launches, exponential backoff on 429). 7 built-in style variations (bold, calm, warm, corporate, dark, playful + default). Tested: 3/3 variants in 41.6s. iterate: multi-turn design iteration using previous_response_id for conversational threading. Falls back to re-generation with accumulated feedback if threading doesn't retain visual context. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: DESIGN_SETUP + DESIGN_MOCKUP template resolvers Add generateDesignSetup() and generateDesignMockup() to the existing design.ts resolver file. Add designDir to HostPaths (claude + codex). Register DESIGN_SETUP and DESIGN_MOCKUP in the resolver index. DESIGN_SETUP: $D binary discovery (mirrors $B browse setup pattern). Falls back to DESIGN_SKETCH if binary not available. DESIGN_MOCKUP: full visual exploration workflow template — construct brief from DESIGN.md context, generate 3 variants, open comparison board in Chrome, poll for user feedback, save approved mockup to docs/designs/, generate HTML wireframe for implementation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: sync package.json version with VERSION file (0.12.2.0) Pre-existing mismatch: VERSION was 0.12.2.0 but package.json was 0.12.0.0. Also adds design binary to build script and dev:design convenience command. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: /office-hours visual design exploration integration Add {{DESIGN_MOCKUP}} to office-hours template before the existing {{DESIGN_SKETCH}}. When the design binary is available, /office-hours generates 3 visual mockup variants, opens a comparison board in Chrome, and polls for user feedback. Falls back to HTML wireframes if the design binary isn't built. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: /plan-design-review visual mockup integration Add {{DESIGN_SETUP}} to pre-review audit and "show me what 10/10 looks like" mockup generation to the 0-10 rating method. When a design dimension rates below 7/10, the review can generate a mockup showing the improved version. Falls back to text descriptions if the design binary isn't available. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: design memory — extract visual language from mockups into DESIGN.md New `$D extract` command: sends approved mockup to GPT-4o vision, extracts color palette, typography, spacing, and layout patterns, writes/updates DESIGN.md with an "Extracted Design Language" section. Progressive constraint: if DESIGN.md exists, future mockup briefs include it as style context. If no DESIGN.md, explorations run wide. readDesignConstraints() reads existing DESIGN.md for brief construction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: mockup diffing + design intent verification New commands: - $D diff --before old.png --after new.png: visual diff using GPT-4o vision. Returns differences by area with severity (high/medium/low) and a matchScore (0-100). - $D verify --mockup approved.png --screenshot live.png: compares live site screenshot against approved design mockup. Pass if matchScore >= 70 and no high-severity differences. Used by /design-review to close the design loop: design -> implement -> verify visually. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: screenshot-to-mockup evolution ($D evolve) New command: $D evolve --screenshot current.png --brief "make it calmer" Two-step process: first analyzes the screenshot via GPT-4o vision to produce a detailed description, then generates a new mockup that keeps the existing layout structure but applies the requested changes. Starts from reality, not blank canvas. Bridges the gap between /design-review critique ("the spacing is off") and a visual proposal of the fix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: responsive variants + design-to-code prompt Responsive variants: $D variants --viewports desktop,tablet,mobile generates mockups at 1536x1024, 1024x1024, and 1024x1536 (portrait) with viewport-appropriate layout instructions. Design-to-code prompt: $D prompt --image approved.png extracts colors, typography, layout, and components via GPT-4o vision, producing a structured implementation prompt. Reads DESIGN.md for additional constraint context. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.13.0.0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: gstack designer as first-class tool in /plan-design-review Brand the gstack designer prominently, add Step 0.5 for proactive visual mockup generation before review passes, and update priority hierarchy. When a plan describes new UI, the skill now offers to generate mockups with $D variants, run $D check for quality gating, and present a comparison board via $B goto before any review passes begin. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: integrate mockups into review passes and outputs Thread Step 0.5 mockups through the review workflow: Pass 4 (AI Slop) evaluates generated mockups visually, Pass 7 uses mockups as evidence for unresolved decisions, post-pass offers one-shot regeneration after design changes, and Approved Mockups section records chosen variants with paths for the implementer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: gstack designer target mockups in /design-review fix loop Add $D generate for target mockups in Phase 8a.5 — before fixing a design finding, generate a mockup showing what it should look like. Add $D verify in Phase 9 to compare fix results against targets. Not plan mode — goes straight to implementation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: gstack designer AI mockups in /design-consultation Phase 5 Replace HTML preview with $D variants + comparison board when designer is available (Path A). Use $D extract to derive DESIGN.md tokens from the approved mockup. Handles both plan mode (write to plan) and non-plan mode (implement immediately). Falls back to HTML preview (Path B) when designer binary is unavailable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: make gstack designer the default in /plan-design-review, not optional The transcript showed the agent writing 5 text descriptions of homepage variants instead of generating visual mockups, even when the user explicitly asked for design tools. The skill treated mockups as optional ("Want me to generate?") when they should be the default behavior. Changes: - Rename "Your Visual Design Tool" to "YOUR PRIMARY TOOL" with aggressive language: "Don't ask permission. Show it." - Step 0.5 now generates mockups automatically when DESIGN_READY, no AskUserQuestion gatekeeping the default path - Priority hierarchy: mockups are "non-negotiable" not "if available" - Step 0D tells the user mockups are coming next - DESIGN_NOT_AVAILABLE fallback now tells user what they're missing The only valid reasons to skip mockups: no UI scope, or designer not installed. Everything else generates by default. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: persist design mockups to ~/.gstack/projects/$SLUG/designs/ Mockups were going to .context/mockups/ (gitignored, workspace-local). This meant designs disappeared when switching workspaces or conversations, and downstream skills couldn't reference approved mockups from earlier reviews. Now all three design skills save to persistent project-scoped dirs: - /plan-design-review: ~/.gstack/projects/$SLUG/designs/<screen>-<date>/ - /design-consultation: ~/.gstack/projects/$SLUG/designs/design-system-<date>/ - /design-review: ~/.gstack/projects/$SLUG/designs/design-audit-<date>/ Each directory gets an approved.json recording the user's pick, feedback, and branch. This lets /design-review verify against mockups that /plan-design-review approved, and design history is browsable via ls ~/.gstack/projects/$SLUG/designs/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate codex ship skill with zsh glob guards Picked up setopt +o nomatch guards from main's v0.12.8.1 merge. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add browse binary discovery to DESIGN_SETUP resolver The design setup block now discovers $B alongside $D, so skills can open comparison boards via $B goto and poll feedback via $B eval. Falls back to `open` on macOS when browse binary is unavailable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: comparison board DOM polling in plan-design-review After opening the comparison board, the agent now polls #status via $B eval instead of asking a rigid AskUserQuestion. Handles submit (read structured JSON feedback), regenerate (new variants with updated brief), and $B-unavailable fallback (free-form text response). The user interacts with the real board UI, not a constrained option picker. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: comparison board feedback loop integration test 16 tests covering the full DOM polling cycle: structure verification, submit with pick/rating/comment, regenerate flows (totally different, more like this, custom text), and the agent polling pattern (empty → submitted → read JSON). Uses real generateCompareHtml() from design/src/compare.ts, served via HTTP. Runs in <1s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add $D serve command for HTTP-based comparison board feedback The comparison board feedback loop was fundamentally broken: browse blocks file:// URLs (url-validation.ts:71), so $B goto file://board.html always fails. The fallback open + $B eval polls a different browser instance. $D serve fixes this by serving the board over HTTP on localhost. The server is stateful: stays alive across regeneration rounds, exposes /api/progress for the board to poll, and accepts /api/reload from the agent to swap in new board HTML. Stdout carries feedback JSON only; stderr carries telemetry. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: dual-mode feedback + post-submit lifecycle in comparison board When __GSTACK_SERVER_URL is set (injected by $D serve), the board POSTs feedback to the server instead of only writing to hidden DOM elements. After submit: disables all inputs, shows "Return to your coding agent." After regenerate: shows spinner, polls /api/progress, auto-refreshes on ready. On POST failure: shows copyable JSON fallback. On progress timeout (5 min): shows error with /design-shotgun prompt. DOM fallback preserved for headed browser mode and tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: HTTP serve command endpoints and regeneration lifecycle 11 tests covering: HTML serving with injected server URL, /api/progress state reporting, submit → done lifecycle, regenerate → regenerating state, remix with remixSpec, malformed JSON rejection, /api/reload HTML swapping, missing file validation, and full regenerate → reload → submit round-trip. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add DESIGN_SHOTGUN_LOOP resolver + fix design artifact paths Adds generateDesignShotgunLoop() resolver for the shared comparison board feedback loop (serve via HTTP, handle regenerate/remix, AskUserQuestion fallback, feedback confirmation). Registered as {{DESIGN_SHOTGUN_LOOP}}. Fixes generateDesignMockup() to use ~/.gstack/projects/$SLUG/designs/ instead of /tmp/ and docs/designs/. Replaces broken $B goto file:// + $B eval polling with $D compare --serve (HTTP-based, stdout feedback). Adds CRITICAL PATH RULE guardrail to DESIGN_SETUP: design artifacts must go to ~/.gstack/projects/$SLUG/designs/, never .context/ or /tmp/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add /design-shotgun standalone design exploration skill New skill for visual brainstorming: generate AI design variants, open a comparison board in the user's browser, collect structured feedback, and iterate. Features: session detection (revisit prior explorations), 5-dimension context gathering (who, job to be done, what exists, user flow, edge cases), taste memory (prior approved designs bias new generations), inline variant preview, configurable variant count, screenshot-to-variants via $D evolve. Uses {{DESIGN_SHOTGUN_LOOP}} resolver for the feedback loop. Saves all artifacts to ~/.gstack/projects/$SLUG/designs/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate SKILL.md files for design-shotgun + resolver changes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add remix UI to comparison board Per-variant element selectors (Layout, Colors, Typography, Spacing) with radio buttons in a grid. Remix button collects selections into a remixSpec object and sends via the same HTTP POST feedback mechanism. Enabled only when at least one element is selected. Board shows regenerating spinner while agent generates the hybrid variant. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add $D gallery command for design history timeline Generates a self-contained HTML page showing all prior design explorations for a project: every variant (approved or not), feedback notes, organized by date (newest first). Images embedded as base64. Handles corrupted approved.json gracefully (skips, still shows the session). Empty state shows "No history yet" with /design-shotgun prompt. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: gallery generation — sessions, dates, corruption, empty state 7 tests: empty dir, nonexistent dir, single session with approved variant, multiple sessions sorted newest-first, corrupted approved.json handled gracefully, session without approved.json, self-contained HTML (no external dependencies). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: replace broken file:// polling with {{DESIGN_SHOTGUN_LOOP}} plan-design-review and design-consultation templates previously used $B goto file:// + $B eval polling for the comparison board feedback loop. This was broken (browse blocks file:// URLs). Both templates now use {{DESIGN_SHOTGUN_LOOP}} which serves via HTTP, handles regeneration in the same browser tab, and falls back to AskUserQuestion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add design-shotgun touchfile entries and tier classifications design-shotgun-path (gate): verify artifacts go to ~/.gstack/, not .context/ design-shotgun-session (gate): verify repeat-run detection + AskUserQuestion design-shotgun-full (periodic): full round-trip with real design binary Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate SKILL.md files for template refactor Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: comparison board UI improvements — option headers, pick confirmation, grid view Three changes to the design comparison board: 1. Pick confirmation: selecting "Pick" on Option A shows "We'll move forward with Option A" in green, plus a status line above the submit button repeating the choice. 2. Clear option headers: each variant now has "Option A" in bold with a subtitle above the image, instead of just the raw image. 3. View toggle: top-right Large/Grid buttons switch between single-column (default) and 3-across grid view. Also restructured the bottom section into a 2-column grid: submit/overall feedback on the left, regenerate controls on the right. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: use 127.0.0.1 instead of localhost for serve URL Avoids DNS resolution issues on some systems where localhost may resolve to IPv6 ::1 while Bun listens on IPv4 only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: write ALL feedback to disk so agent can poll in background mode The agent backgrounds $D serve (Claude Code can't block on a subprocess and do other work simultaneously). With stdout-only feedback delivery, the agent never sees regenerate/remix feedback. Fix: write feedback-pending.json (regenerate/remix) and feedback.json (submit) to disk next to the board HTML. Agent polls the filesystem instead of reading stdout. Both channels (stdout + disk) are always active so foreground mode still works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: DESIGN_SHOTGUN_LOOP uses file polling instead of stdout reading Update the template resolver to instruct the agent to background $D serve and poll for feedback-pending.json / feedback.json on a 5-second loop. This matches the real-world pattern where Claude Code / Conductor agents can't block on subprocess stdout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate SKILL.md files for file-polling feedback loop Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: null-safe DOM selectors for post-submit and regenerating states The user's layout restructure renamed .regenerate-bar → .regen-column, .submit-bar → .submit-column, and .overall-section → .bottom-section. The JS still referenced the old class names, causing querySelector to return null and showPostSubmitState() / showRegeneratingState() to silently crash. This meant Submit and Regenerate buttons appeared to work (DOM elements updated, HTTP POST succeeded) but the visual feedback (disabled inputs, spinner, success message) never appeared. Fix: use fallback selectors that check both old and new class names, with null guards so a missing element doesn't crash the function. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: end-to-end feedback roundtrip — browser click to file on disk The test that proves "changes on the website propagate to Claude Code." Opens the comparison board in a real headless browser with __GSTACK_SERVER_URL injected, simulates user clicks (Submit, Regenerate, More Like This), and verifies that feedback.json / feedback-pending.json land on disk with the correct structured data. 6 tests covering: submit → feedback.json, post-submit UI lockdown, regenerate → feedback-pending.json, more-like-this → feedback-pending.json, regenerate spinner display, and full regen → reload → submit round-trip. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: comprehensive design doc for Design Shotgun feedback loop Documents the full browser-to-agent feedback architecture: state machine, file-based polling, port discovery, post-submit lifecycle, and every known edge case (zombie forms, dead servers, stale spinners, file:// bug, double-click races, port coordination, sequential generate rule). Includes ASCII diagrams of the data flow and state transitions, complete step-by-step walkthrough of happy path and regeneration path, test coverage map with gaps, and short/medium/long-term improvement ideas. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: plan-design-review agent guardrails for feedback loop Four fixes to prevent agents from reinventing the feedback loop badly: 1. Sequential generate rule: explicit instruction that $D generate calls must run one at a time (API rate-limits concurrent image generation). 2. No-AskUserQuestion-for-feedback rule: agent reads feedback.json instead of re-asking what the user picked. 3. Remove file:// references: $B goto file:// was always rejected by url-validation.ts. The --serve flag handles everything. 4. Remove $B eval polling reference: no longer needed with HTTP POST. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: design-shotgun Step 3 progressive reveal, silent failure detection, timing estimate Three production UX bugs fixed: 1. Dead air — now shows timing estimate before generation starts 2. Silent variant drop — replaced $D variants batch with individual $D generate calls, each verified for existence and non-zero size with retry 3. No progressive reveal — each variant shown inline via Read tool immediately after generation (~60s increments instead of all at ~180s) Also: /tmp/ then cp as default output pattern (sandbox workaround), screenshot taken once for evolve path (not per-variant). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: parallel design-shotgun with concept-first confirmation Step 3 rewritten to concept-first + parallel Agent architecture: - 3a: generate text concepts (free, instant) - 3b: AskUserQuestion to confirm/modify before spending API credits - 3c: launch N Agent subagents in parallel (~60s total regardless of count) - 3d: show all results, dynamic image list for comparison board Adds Agent to allowed-tools. Softens plan-design-review sequential warning to note design-shotgun uses parallel at Tier 2+. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update project documentation for v0.13.0.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: untrack .agents/skills/ — generated at setup, already gitignored These files were committed despite .agents/ being in .gitignore. They regenerate from ./setup --host codex on any machine. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate design-shotgun SKILL.md for v0.12.12.0 preamble changes Merge from main brought updated preamble resolver (conditional telemetry, local JSONL logging) but design-shotgun/SKILL.md wasn't regenerated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 20:32:59 -06:00
Garry Tan	3501f5dd03	fix: Windows browse — health-check-first ensureServer, detached startServer, Windows process mgmt (v0.11.11.0) (#431 ) * fix: Windows browse — health-check-first ensureServer, detached startServer, Windows process mgmt Three compounding bugs made browse completely broken on Windows: 1. Bun.spawn().unref() doesn't truly detach on Windows — server dies when CLI exits. Fix: use Node's child_process.spawn with { detached: true } via a launcher script. Credit: PR #191 by @fqueiro for the approach. 2. process.kill(pid, 0) is broken in Bun's compiled binary on Windows — ensureServer() never reaches the health check. Fix: restructure to health-check-first (HTTP is definitive proof on all platforms). Extract isServerHealthy() helper for DRY (4 call sites). 3. Windows process management: isProcessAlive() falls back to tasklist, killServer() uses taskkill /T /F (kills process tree including Chromium), cleanupLegacyState() skips on Windows (no /tmp, no ps). Also: hard-fail on Windows if server-node.mjs is missing instead of silently falling back to the known-broken Bun path. Fixes #342. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: disable Chromium sandbox on Windows Chromium's sandbox fails when the server is spawned through the Bun→Node process chain on Windows (GitHub #276). Disable chromiumSandbox on Windows at both launch sites (headless + headed). Safe: local daemon browsing user-specified URLs, Playwright docs recommend disabling in CI/container environments. Fixes #276. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: startup error log + Windows exit handler for browse server On Windows, the CLI can't capture stderr from the server (stdio: 'ignore' required for process detachment). Write startup errors to .gstack/browse-startup-error.log so the CLI can report them on timeout. Also add process.on('exit') handler on Windows as defense-in-depth for state file cleanup (primary mechanism is CLI's stale-state detection). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add isServerHealthy + startup error log tests Tests for the new cross-platform health check helper (isServerHealthy) that replaces PID-based liveness checks in all polling loops. Covers healthy, unhealthy, unreachable, and error response cases. Also tests the startup error log write/read format used on Windows. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.11.11.0) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: sync ARCHITECTURE.md with health-check-first ensureServer --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 00:38:10 -07:00
Garry Tan	7fbf68bb3f	feat: cross-model outside voice in plan reviews (v0.9.9.1) (#326 ) * feat: add generateCodexPlanReview() resolver for cross-model plan review New resolver offers an optional Codex (or Claude subagent fallback) "outside voice" after plan review sections complete. Includes cross-model tension detection with auto-TODO proposals, review log persistence, and an Outside Voice row in the Review Readiness Dashboard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: integrate {{CODEX_PLAN_REVIEW}} into CEO and eng review templates CEO review: insert after Section 11 + add Outside Voice summary row. Eng review: replace hardcoded Step 0.5 with resolver (adds fallback, logging, dashboard, xhigh reasoning, cross-model tension tracking). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate SKILL.md files from updated templates Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update project documentation for v0.9.9.1 ARCHITECTURE.md: added {{CODEX_PLAN_REVIEW}} to placeholder table. CHANGELOG.md: added v0.9.9.1 entry for outside voice feature. VERSION: bumped to 0.9.9.1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: move {{CODEX_PLAN_REVIEW}} after review sections in eng review Codex adversarial review caught that the placeholder was positioned before the 4 review sections, so the "After all review sections are complete" instruction could confuse the model. Moved it after Section 4's STOP directive where it belongs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate eng review SKILL.md files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 21:47:10 -07:00
Garry Tan	00bc482fe1	feat: /land-and-deploy, /canary, /benchmark + perf review (v0.7.0) (#183 ) * feat: add /canary, /benchmark, /land-and-deploy skills (v0.7.0) Three new skills that close the deploy loop: - /canary: standalone post-deploy monitoring with browse daemon - /benchmark: performance regression detection with Web Vitals - /land-and-deploy: merge PR, wait for deploy, canary verify production Incorporates patterns from community PR #151. Co-Authored-By: HMAKT99 <HMAKT99@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add Performance & Bundle Impact category to review checklist New Pass 2 (INFORMATIONAL) category catching heavy dependencies (moment.js, lodash full), missing lazy loading, synchronous scripts, CSS @import blocking, fetch waterfalls, and tree-shaking breaks. Both /review and /ship automatically pick this up via checklist.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add {{DEPLOY_BOOTSTRAP}} resolver + deployed row in dashboard - New generateDeployBootstrap() resolver auto-detects deploy platform (Vercel, Netlify, Fly.io, GH Actions, etc.), production URL, and merge method. Persists to CLAUDE.md like test bootstrap. - Review Readiness Dashboard now shows a "Deployed" row from /land-and-deploy JSONL entries (informational, never gates shipping). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: mark 3 TODOs completed, bump v0.7.0, update CHANGELOG Superseded by /land-and-deploy: - /merge skill — review-gated PR merge - Deploy-verify skill - Post-deploy verification (ship + browse) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: /setup-deploy skill + platform-specific deploy verification - New /setup-deploy skill: interactive guided setup for deploy configuration. Detects Fly.io, Render, Vercel, Netlify, Heroku, Railway, GitHub Actions, and custom deploy scripts. Writes config to CLAUDE.md with custom hooks section for non-standard setups. - Enhanced deploy bootstrap: platform-specific URL resolution (fly.toml app → {app}.fly.dev, render.yaml → {service}.onrender.com, etc.), deploy status commands (fly status, heroku releases), and custom deploy hooks section in CLAUDE.md for manual/scripted deploys. - Platform-specific deploy verification in /land-and-deploy Step 6: Strategy A (GitHub Actions polling), Strategy B (platform CLI: fly/render/heroku), Strategy C (auto-deploy: vercel/netlify), Strategy D (custom hooks from CLAUDE.md). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: E2E + LLM-judge evals for deploy skills - 4 E2E tests: land-and-deploy (Fly.io detection + deploy report), canary (monitoring report structure), benchmark (perf report schema), setup-deploy (platform detection → CLAUDE.md config) - 4 LLM-judge evals: workflow quality for all 4 new skills - Touchfile entries for diff-based test selection (E2E + LLM-judge) - 460 free tests pass, 0 fail Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: harden E2E tests — server lifecycle, timeouts, preamble budget, skip flaky Cross-cutting fixes: - Pre-seed ~/.gstack/.completeness-intro-seen and ~/.gstack/.telemetry-prompted so preamble doesn't burn 3-7 turns on lake intro + telemetry in every test - Each describe block creates its own test server instance instead of sharing a global that dies between suites Test fixes (5 tests): - /qa quick: own server instance + preamble skip - /review SQL injection: timeout 90→180s, maxTurns 15→20, added assertion that review output actually mentions SQL injection - /review design-lite: maxTurns 25→35 + preamble skip (now detects 7/7) - ship-base-branch: both timeouts 90→150/180s + preamble skip - plan-eng artifact: clean stale state in beforeAll, maxTurns 20→25 Skipped (4 flaky/redundant tests): - contributor-mode: tests prompt compliance, not skill functionality - design-consultation-research: WebSearch-dependent, redundant with core - design-consultation-preview: redundant with core test - /qa bootstrap: too ambitious (65 turns, installs vitest) Also: preamble skip added to qa-only, qa-fix-loop, design-consultation-core, and design-consultation-existing prompts. Updated touchfiles entries and touchfiles.test.ts. Added honest comment to codex-review-findings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: redesign 6 skipped/todo E2E tests + add test.concurrent support Redesigned tests (previously skipped/todo): - contributor-mode: pre-fail approach, 5 turns/30s (was 10 turns/90s) - design-consultation-research: WebSearch-only, 8 turns/90s (was 45/480s) - design-consultation-preview: preview HTML only, 8 turns/90s (was 30/480s) - qa-bootstrap: bootstrap-only, 12 turns/90s (was 65/420s) - /ship workflow: local bare remote, 15 turns/120s (was test.todo) - /setup-browser-cookies: browser detection smoke, 5 turns/45s (was test.todo) Added testConcurrentIfSelected() helper for future parallelization. Updated touchfiles entries for all 6 re-enabled tests. Target: 0 skip, 0 todo, 0 fail across all E2E tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: relax contributor-mode assertions — test structure not exact phrasing * perf: enable test.concurrent for 31 independent E2E tests Convert 18 skill-e2e, 11 routing, and 2 codex tests from sequential to test.concurrent. Only design-consultation tests (4) remain sequential due to shared designDir state. Expected ~6x speedup on Teams high-burst. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: add --concurrent flag to bun test + convert remaining 4 sequential tests bun's test.concurrent only works within a describe block, not across describe blocks. Adding --concurrent to the CLI command makes ALL tests concurrent regardless of describe boundaries. Also converted the 4 design-consultation tests to concurrent (each already independent). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * perf: split monolithic E2E test into 8 parallel files Split test/skill-e2e.test.ts (3442 lines) into 8 category files: - skill-e2e-browse.test.ts (7 tests) - skill-e2e-review.test.ts (7 tests) - skill-e2e-qa-bugs.test.ts (3 tests) - skill-e2e-qa-workflow.test.ts (4 tests) - skill-e2e-plan.test.ts (6 tests) - skill-e2e-design.test.ts (7 tests) - skill-e2e-workflow.test.ts (6 tests) - skill-e2e-deploy.test.ts (4 tests) Bun runs each file in its own worker = 10 parallel workers (8 split + routing + codex). Expected: 78 min → ~12 min. Extracted shared helpers to test/helpers/e2e-helpers.ts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * perf: bump default E2E concurrency to 15 * perf: add model pinning infrastructure + rate-limit telemetry to E2E runner Default E2E model changed from Opus to Sonnet (5x faster, 5x cheaper). Session runner now accepts `model` option with EVALS_MODEL env var override. Added timing telemetry (first_response_ms, max_inter_turn_ms) and wall_clock_ms to eval-store for diagnosing rate-limit impact. Added EVALS_FAST test filtering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: resolve 3 E2E test failures — tmpdir race, wasted turns, brittle assertions plan-design-review-plan-mode: give each test its own tmpdir to eliminate race condition where concurrent tests pollute each other's working directory. ship-local-workflow: inline ship workflow steps in prompt instead of having agent read 700+ line SKILL.md (was wasting 6 of 15 turns on file I/O). design-consultation-core: replace exact section name matching with fuzzy synonym-based matching (e.g. "Colors" matches "Color", "Type System" matches "Typography"). All 7 sections still required, LLM judge still hard fail. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * perf: pin quality tests to Opus, add --retry 2 and test:e2e:fast tier ~10 quality-sensitive tests (planted-bug detection, design quality judge, strategic review, retro analysis) explicitly pinned to Opus. ~30 structure tests default to Sonnet for 5x speed improvement. Added --retry 2 to all E2E scripts for flaky test resilience. Added test:e2e:fast script that excludes 8 slowest tests for quick feedback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: mark E2E model pinning TODO as shipped Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add SKILL.md merge conflict directive to CLAUDE.md When resolving merge conflicts on generated SKILL.md files, always merge the .tmpl templates first, then regenerate — never accept either side's generated output directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: add DEPLOY_BOOTSTRAP resolver to gen-skill-docs The land-and-deploy template referenced {{DEPLOY_BOOTSTRAP}} but no resolver existed, causing gen-skill-docs to fail. Added generateDeployBootstrap() that generates the deploy config detection bash block (check CLAUDE.md for persisted config, auto-detect platform from config files, detect deploy workflows). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate SKILL.md files after DEPLOY_BOOTSTRAP fix Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: move prompt temp file outside workingDirectory to prevent race condition The .prompt-tmp file was written inside workingDirectory, which gets deleted by afterAll cleanup. With --concurrent --retry, afterAll can interleave with retries, causing "No such file or directory" crashes at 0s (seen in review-design-lite and office-hours-spec-review). Fix: write prompt file to os.tmpdir() with a unique suffix so it survives directory cleanup. Also convert review-design-lite from describeE2E to describeIfSelected for proper diff-based test selection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: add --retry 2 --concurrent flags to test:evals scripts for consistency test:evals and test:evals:all were missing the retry and concurrency flags that test:e2e already had, causing inconsistent behavior between the two script families. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: HMAKT99 <HMAKT99@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 14:31:36 -07:00
Garry Tan	f075cb757f	feat: Search Before Building — builder ethos + skill integrations (v0.9.5.0) (#298 ) * feat: ETHOS.md — gstack builder philosophy Standalone document capturing the four principles: The Golden Age, Boil the Lake, Search Before Building, and Build for Yourself. Introduces the three-layer knowledge framework (tried-and-true, new-and-popular, first-principles) and the Eureka Moment concept — when first-principles reasoning reveals conventional wisdom is wrong. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: Search Before Building preamble section + CLAUDE.md Add generateSearchBeforeBuildingSection(ctx) to gen-skill-docs.ts. Every workflow skill now gets a compact router section covering: - Three layers of knowledge (tried-and-true, new-and-popular, first-principles) - Eureka moment format and jq-based JSONL logging - WebSearch fallback clause - ETHOS.md reference via ctx.paths.skillRoot resolver Also adds compact "Search before building" section to CLAUDE.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: skill-specific Search Before Building integrations 8 template changes: - /office-hours: Phase 2.75 Landscape Awareness (WebSearch + three-layer synthesis) - /plan-eng-review: Step 0 search check with layer provenance annotations - /investigate: external pattern search + search escalation on hypothesis failure - /plan-ceo-review: Landscape Check before scope challenge - /review: search-before-recommending for fix patterns - /qa-only: WebSearch in allowed-tools - /design-consultation: three-layer synthesis backport in Phase 2 Step 3 - /retro: eureka moment tracking from ~/.gstack/analytics/eureka.jsonl All search steps include WebSearch fallback clause. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: v0.9.5.0 — Builder Ethos (CHANGELOG + VERSION + TODOS) ETHOS.md + Search Before Building across all workflow skills. Deferred: first-time intro flow (blocked on blog post). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address Codex review — sanitize search, privacy gate, ETHOS.md sidecar Three fixes from adversarial Codex review: - /investigate: sanitize error messages before searching (strip hostnames, IPs, file paths, SQL, customer data). Skip search if unsanitizable. - /office-hours: add privacy gate before landscape search. Use generalized category terms, never the user's specific product name or stealth idea. - setup: link ETHOS.md into .agents/skills/gstack/ sidecar so workspace- local Codex sessions can find the builder philosophy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: sanitize Phase 2 external pattern search in /investigate The Phase 2 external search also sent raw error messages to WebSearch. Apply same sanitization rule as Phase 3 search escalation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: sync documentation with shipped changes - ARCHITECTURE.md: preamble now handles 5 things (add Search Before Building) - CLAUDE.md: add ETHOS.md to project structure tree - README.md: add ETHOS.md to docs table Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 11:46:06 -07:00
Garry Tan	78c207efb4	feat: interactive /plan-design-review + CEO invokes designer + 100% coverage (v0.6.4) (#149 ) * refactor: rename qa-design-review → design-review The "qa-" prefix was confusing — this is the live-site design audit with fix loop, not a QA-only report. Rename directory and update all references across docs, tests, scripts, and skill templates. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: interactive /plan-design-review + CEO invokes designer Rewrite /plan-design-review from report-only grading to an interactive plan-fixer that rates each design dimension 0-10, explains what a 10 looks like, and edits the plan to get there. Parallel structure with /plan-ceo-review and /plan-eng-review — one issue = one AskUserQuestion. CEO review now detects UI scope and invokes the designer perspective when the plan has frontend/UX work, so you get design review automatically when it matters. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: validation + touchfile entries for 100% coverage Add design-consultation to command/snapshot flag validation. Add 4 skills to contributor mode validation (plan-design-review, design-review, design-consultation, document-release). Add 2 templates to hardcoded branch check. Register touchfile entries for 10 new LLM-judge tests and 1 new E2E test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: LLM-judge for 10 skills + gstack-upgrade E2E Add LLM-judge quality evals for all uncovered skills using a DRY runWorkflowJudge helper with section marker guards. Add real E2E test for gstack-upgrade using mock git remote (replaces test.todo). Add plan-edit assertion to plan-design-review E2E. 14/15 skills now at full coverage. setup-browser-cookies remains deferred (needs real browser). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add bisect commit style to CLAUDE.md All commits should be single logical changes, split before pushing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.6.4.0) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-17 22:48:48 -05:00
Garry Tan	a2d756f945	feat: Test Bootstrap + Regression Tests + Coverage Audit (v0.6.0) (#136 ) * feat: test bootstrap, regression tests, coverage audit, retro test health - Add {{TEST_BOOTSTRAP}} resolver to gen-skill-docs.ts - Add Phase 8e.5 regression test generation to /qa and /qa-design-review - Add Step 3.4 test coverage audit with quality scoring to /ship - Add test health tracking to /retro - Add 2 E2E evals (bootstrap + coverage audit) - Add 26 validation tests - Update ARCHITECTURE.md placeholder table - Add 2 P3 TODOs (CI/CD non-GitHub, auto-upgrade weak tests) * chore: bump version and changelog (v0.6.0) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: make coverage audit trace actual codepaths, not just syntax patterns Step 3.4 now instructs Claude to read full files, trace data flow through every branch, diagram the execution, and check each branch against tests. Phase 8e.5 regression tests now trace the bug's codepath before writing the test, catching adjacent edge cases. * feat: coverage audit now maps user flows, interactions, and error states Step 3.4 now covers the full picture: code branches AND user-facing behavior. Maps user flows (complete journey through the feature), interaction edge cases (double-click, back button, stale state, slow connection), error states (what does the user actually see?), and boundary states (zero results, 10k results, max-length input). Coverage diagram splits into Code Path Coverage and User Flow Coverage sections with separate percentages. * fix: raise test gen cap to 20, add validation tests for user flow coverage - Raise Step 3.4 test generation cap from 10 to 20 (code + user flow combined) - Add 3 validation tests: codepath tracing, user flow mapping, diagram sections --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 13:05:18 -05:00
Garry Tan	4a77cc2c34	feat: /plan-design-review + /qa-design-review skills (v0.5.0) (#102 ) * feat: add {{DESIGN_METHODOLOGY}} resolver and register design review skills Add generateDesignMethodology() to gen-skill-docs.ts with 10-category, 80-item design audit checklist. Register plan-design-review and qa-design-review templates in findTemplates(). Add both skills to skill-check.ts SKILL_FILES. Add command and snapshot flag validation tests for both skills in skill-validation.test.ts. * feat: add /plan-design-review and /qa-design-review skills /plan-design-review: report-only designer audit with letter grades, AI slop scoring, structured first impression, design system extraction, DESIGN.md inference and export offer. Never modifies code. /qa-design-review: same audit, then iterative fix loop with style(design): commits, CSS-safe WTF heuristic, before/after screenshots, final re-audit. * chore: bump version and changelog (v0.5.0) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: update README, ARCHITECTURE for design review skills (v0.5.0) - Update skill count to 11, add /plan-design-review and /qa-design-review to skill table, install/uninstall commands, and demo walkthrough - Add narrative sections: "senior designer mode" and "designer who codes mode" with compelling examples showing AI Slop detection and design system inference - Add {{DESIGN_METHODOLOGY}} to ARCHITECTURE.md placeholder table - Extend demo to show full plan→eng→review→ship→qa→design-review pipeline Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: regenerate design review SKILL.md files after merge from main Picks up BASE_BRANCH_DETECT resolver and updated contributor mode from main. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add /design-consultation skill — design consultant that creates DESIGN.md 6-phase consultant flow: product context → competitive research (WebSearch) → complete coherent proposal → drill-downs on demand → font+color preview page → write DESIGN.md + update CLAUDE.md. Opinionated recommendations grounded in product context, not menu-driven forms. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add E2E tests for design skill family (7 tests + LLM quality judge) Tests 1-4: /design-consultation (core flow, research integration, existing DESIGN.md handling, font+color preview generation). Tests 5-6: /plan-design-review (audit report, DESIGN.md export). Test 7: /qa-design-review (audit + fix loop). LLM judge validates font blacklist compliance, coherence, and AI slop avoidance. Also adds plan-design-review + qa-design-review to ALL_SKILLS test array. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: mark /design-consultation as shipped in TODOS.md Renamed from /setup-design-md to reflect the consultant approach. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 21:55:07 -05:00
Garry Tan	1e06b6a5c6	fix: dynamic base branch detection across all SKILL templates (v0.3.10) (#81 ) * feat: add {{BASE_BRANCH_DETECT}} resolver to gen-skill-docs DRY placeholder for dynamic base branch detection across PR-targeting skills. Detects via gh pr view (existing PR base) → gh repo view (repo default) → fallback to main. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: ship skill detects base branch instead of hardcoding main Replaces ~14 hardcoded 'main' references with dynamic detection via {{BASE_BRANCH_DETECT}}. Fixes stacked branches and Conductor workspaces targeting non-main branches. Adds --base <base> to gh pr create. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: review, qa, plan-ceo-review detect base branch dynamically Same pattern as ship: replaces hardcoded 'main' with {{BASE_BRANCH_DETECT}}. Also cleans up qa bash-isms (REPORT_DIR variable, port chaining). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: retro detects default branch instead of hardcoding origin/main Retro queries commit history (not PR targets), so uses simpler detection: gh repo view defaultBranchRef. Replaces ~11 origin/main refs with origin/<default>. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add explicit cross-step references in gstack-upgrade template Bash blocks are self-contained, but cross-block variable references (INSTALL_DIR from Step 2) were implicit. Adds prose making them explicit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs+test: SKILL authoring guidance + regression tests Adds "Writing SKILL templates" section to CLAUDE.md explaining that templates are prompts, not scripts. Adds validation test catching hardcoded 'main' in git commands, and resolver content test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update ARCHITECTURE + CONTRIBUTING for new placeholders Add {{BASE_BRANCH_DETECT}} to ARCHITECTURE.md placeholder list. Cross-reference CLAUDE.md template authoring guidance from CONTRIBUTING.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.3.10) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add missing blank line between resolver functions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add 3 E2E smoke tests for base branch detection - /review: verifies Step 0 detection + git diff against detected base - /ship: truncated dry-run (Steps 0-1 only, no push/PR), asserts no destructive actions - /retro: verifies default branch detection for git log queries Covers the {{BASE_BRANCH_DETECT}} resolver path (review), the ship template's dual abort check, and retro's inline detection pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.4.2) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 10:59:13 -05:00
Garry Tan	3e3843c4a9	feat: contributor mode, session awareness, recommendation format (#90 ) * feat: contributor mode, session awareness, universal RECOMMENDATION format - Rename {{UPDATE_CHECK}} → {{PREAMBLE}} across all 10 skill templates - Add session tracking (touch ~/.gstack/sessions/$PPID, count active sessions) - ELI16 mode when 3+ concurrent sessions detected (re-ground user on context) - Contributor mode: auto-file field reports to ~/.gstack/contributor-logs/ - Universal AskUserQuestion format: context → question → RECOMMENDATION → options - Update plan-ceo-review and plan-eng-review to reference preamble baseline - Add vendored symlink awareness section to CLAUDE.md - Rewrite CONTRIBUTING.md with contributor workflow and cross-project testing - Add tests for contributor mode and session awareness in generated output - Add E2E eval for contributor mode report filing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add Enum & Value Completeness to /review critical checklist New CRITICAL review category that traces new enum values, status strings, and type constants through every consumer outside the diff. Catches the class of bugs where a new value is added but not handled in all switch/case chains, allowlists, or frontend-backend contracts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: bump v0.4.1, user-facing changelog, update qa-only template and architecture docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: add CHANGELOG style guide — user-facing, sell the feature Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: rewrite v0.4.1 changelog to be user-facing and sell the features Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add evals for RECOMMENDATION format, session awareness, and enum completeness Free tests (Tier 1): RECOMMENDATION format + session awareness in all preamble SKILL.md files, enum completeness checklist structure and CRITICAL classification. E2E eval: /review catches missed enum handlers when a new status value is added but not handled in case/switch and notify methods. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add E2E eval for session awareness ELI16 mode Stubs _SESSIONS=4, gives agent a decision point on feature/add-payments branch, verifies the output re-grounds the user with project, branch, context, and RECOMMENDATION — the ELI16 mode behavior for 3+ sessions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: contributor mode eval marked FAIL due to expected browse error The test intentionally runs a nonexistent binary to trigger contributor mode. The session runner's browse error detection catches "no such file or directory...browse" and sets browseErrors, causing recordE2E to mark passed=false. Override passed to check only exitReason since the browse error is the expected scenario. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 01:45:50 -05:00
Garry Tan	f3ee0ee28a	feat: QA restructure, browser ref staleness, eval efficiency metrics (v0.4.0) (#83 ) * feat: browser ref staleness detection via async count() validation resolveRef() now checks element count to detect stale refs after page mutations (e.g. SPA navigation). RefEntry stores role+name metadata for better diagnostics. 3 new snapshot tests for staleness detection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: qa-only skill, qa fix loop, plan-to-QA artifact flow Add /qa-only (report-only, Edit tool blocked), restructure /qa with find-fix-verify cycle, add {{QA_METHODOLOGY}} DRY placeholder for shared methodology. /plan-eng-review now writes test-plan artifacts to ~/.gstack/projects/<slug>/ for QA consumption. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: eval efficiency metrics — turns, duration, commentary across all surfaces Add generateCommentary() for natural-language delta interpretation, per-test turns/duration in comparison and summary output, judgePassed unit tests, 3 new E2E tests (qa-only, qa fix loop, plan artifact). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: bump version and changelog (v0.4.0) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: update ARCHITECTURE, BROWSER, CONTRIBUTING, README for v0.4.0 - ARCHITECTURE: add ref staleness detection section, update RefEntry type - BROWSER: add ref staleness paragraph to snapshot system docs - CONTRIBUTING: update eval tool descriptions with commentary feature - README: fix missing qa-only in project-local uninstall command Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: add user-facing benefit descriptions to v0.4.0 changelog Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-15 23:55:39 -05:00
Garry Tan	43fbe165a4	docs: update README, CONTRIBUTING, ARCHITECTURE for v0.3.6 Update test tier costs and commands (Agent SDK → claude -p, SKILL_E2E → EVALS), add E2E observability section to CONTRIBUTING and ARCHITECTURE, add testing quick-start to README. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 12:47:00 -05:00
Garry Tan	5205070299	feat: SKILL.md template system, 3-tier testing, DX tools (v0.3.3) (#41 ) * refactor: extract command registry to commands.ts, add SNAPSHOT_FLAGS metadata - NEW: browse/src/commands.ts — command sets + COMMAND_DESCRIPTIONS + load-time validation (zero side effects) - server.ts imports from commands.ts instead of declaring sets inline - snapshot.ts: SNAPSHOT_FLAGS array drives parseSnapshotArgs (metadata-driven, no duplication) - All 186 existing tests pass * feat: SKILL.md template system with auto-generated command references - SKILL.md.tmpl + browse/SKILL.md.tmpl with {{COMMAND_REFERENCE}} and {{SNAPSHOT_FLAGS}} placeholders - scripts/gen-skill-docs.ts generates SKILL.md from templates (supports --dry-run) - Build pipeline runs gen:skill-docs before binary compilation - Generated files have AUTO-GENERATED header, committed to git * test: Tier 1 static validation — 34 tests for SKILL.md command correctness - test/helpers/skill-parser.ts: extracts $B commands from code blocks, validates against registry - test/skill-parser.test.ts: 13 parser/validator unit tests - test/skill-validation.test.ts: 13 tests validating all SKILL.md files + registry consistency - test/gen-skill-docs.test.ts: 8 generator tests (categories, sorting, freshness) * feat: DX tools (skill:check, dev:skill) + Tier 2 E2E test scaffolding - scripts/skill-check.ts: health summary for all SKILL.md files (commands, templates, freshness) - scripts/dev-skill.ts: watch mode for template development - test/helpers/session-runner.ts: Agent SDK wrapper for E2E skill tests - test/skill-e2e.test.ts: 2 E2E tests + 3 stubs (auto-skip inside Claude Code sessions) - E2E tests must run from plain terminal: SKILL_E2E=1 bun test test/skill-e2e.test.ts * ci: SKILL.md freshness check on push/PR + TODO updates - .github/workflows/skill-docs.yml: fails if generated SKILL.md files are stale - TODO.md: add E2E cost tracking and model pinning to future ideas * fix: restore rich descriptions lost in auto-generation - Snapshot flags: add back value hints (-d <N>, -s <sel>, -o <path>) - Snapshot flags: restore parenthetical context (@e refs, @c refs, etc.) - Commands: is → includes valid states enum - Commands: console → notes --errors filter behavior - Commands: press → lists common keys (Enter, Tab, Escape) - Commands: cookie-import-browser → describes picker UI - Commands: dialog-accept → specifies alert/confirm/prompt - Tips: restore → arrow (was downgraded to ->) * test: quality evals for generated SKILL.md descriptions Catches the exact regressions we shipped and caught in review: - Snapshot flags must include value hints (-d <N>, -s <sel>, -o <path>) - is command must list all valid states (visible/hidden/enabled/...) - press command must list example keys (Enter, Tab, Escape) - console command must describe --errors behavior - Snapshot -i must mention @e refs, -C must mention @c refs - All descriptions must be >= 8 chars (no empty stubs) - Tips section must use → not -> * feat: LLM-as-judge evals for SKILL.md documentation quality 4 eval tests using Anthropic API (claude-haiku, ~$0.01-0.03/run): - Command reference table: clarity/completeness/actionability >= 4/5 - Snapshot flags section: same thresholds - browse/SKILL.md overall quality - Regression: generated version must score >= hand-maintained baseline Requires ANTHROPIC_API_KEY. Auto-skips without it. Run: bun run test:eval (or ANTHROPIC_API_KEY=sk-... bun test test/skill-llm-eval.test.ts) * chore: bump version to 0.3.3, update changelog Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: add ARCHITECTURE.md, update CLAUDE.md and CONTRIBUTING.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: conductor.json lifecycle hooks + .env propagation across worktrees bin/dev-setup now copies .env from main worktree so API keys carry over to Conductor workspaces automatically. conductor.json wires up setup and archive hooks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: complete CHANGELOG for v0.3.3 (architecture, conductor, .env) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 21:08:12 -07:00

18 Commits