mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-01 19:25:10 +02:00
9dbaf906cffa689fe0d93b1f6f5ce233a50bf754
18 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
54d4cde773 |
security: tunnel dual-listener + SSRF + envelope + path wave (v1.6.0.0) (#1137)
* refactor(security): loosen /connect rate limit from 3/min to 300/min
Setup keys are 24 random bytes (unbruteforceable), so a tight rate limit
does not meaningfully prevent key guessing. It exists only to cap
bandwidth, CPU, and log-flood damage from someone who discovered the
ngrok URL. A legitimate pair-agent session hits /connect once; 300/min
is 60x that pattern and never hit accidentally.
3/min caused pairing to fail on any retry flow (network blip, second
paired client) with no upside. Per-IP tracking was considered and
rejected — adds a bounded Map + LRU for defense already adequate at the
global layer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(security): add tunnel-denial-log module for attack visibility
Append-only log of tunnel-surface auth denials to
~/.gstack/security/attempts.jsonl. Gives operators visibility into who
is probing tunneled daemons so the next security wave can be driven by
real attack data instead of speculation.
Design notes:
- Async via fs.promises.appendFile. Never appendFileSync — blocking the
event loop on every denial during a flood is what an attacker wants
(prior learning: sync-audit-log-io, 10/10 confidence).
- In-process rate cap at 60 writes/minute globally. Excess denials are
counted in memory but not written to disk — prevents disk DoS.
- Writes to the same ~/.gstack/security/attempts.jsonl used by the
prompt-injection attempt log. File rotation is handled by the existing
security pipeline (10MB, 5 generations).
No consumers in this commit; wired up in the dual-listener refactor that
follows.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(security): dual-listener tunnel architecture
The /health endpoint leaked AUTH_TOKEN to any caller that hit the ngrok
URL (spoofing chrome-extension:// origin, or catching headed mode).
Surfaced by @garagon in PR #1026; the original fix was header-inference
on the single port. Codex's outside-voice review during /plan-ceo-review
called that approach brittle (ngrok header behavior could change, local
proxies would false-positive), and pushed for the structural fix.
This is that fix. Stop making /health a root-token bootstrap endpoint on
any surface the tunnel can reach. The server now binds two HTTP
listeners when a tunnel is active. The local listener (extension, CLI,
sidebar) stays on 127.0.0.1 and is never exposed to ngrok. ngrok
forwards only to the tunnel listener, which serves only /connect
(unauth, rate-limited) and /command with a locked allowlist of
browser-driving commands. Security property comes from physical port
separation, not from header inference — a tunnel caller cannot reach
/health or /cookie-picker or /inspector because they live on a
different TCP socket.
What this commit adds to browse/src/server.ts:
* Surface type ('local' | 'tunnel') and TUNNEL_PATHS +
TUNNEL_COMMANDS allowlists near the top of the file.
* makeFetchHandler(surface) factory replacing the single fetch arrow;
closure-captures the surface so the filter that runs before route
dispatch knows which socket accepted the request.
* Tunnel filter at dispatch entry: 404s anything not on TUNNEL_PATHS,
403s root-token bearers with a clear pairing hint, 401s non-/connect
requests that lack a scoped token. Every denial is logged via
logTunnelDenial (from tunnel-denial-log).
* GET /connect alive probe (unauth on both surfaces) so /pair and
/tunnel/start can detect dead ngrok tunnels without reaching
/health — /health is no longer tunnel-reachable.
* Lazy tunnel listener lifecycle. /tunnel/start binds a dedicated
Bun.serve on an ephemeral port, points ngrok.forward at THAT port
(not the local port), hard-fails on bind error (no local fallback),
tears down cleanly on ngrok failure. BROWSE_TUNNEL=1 startup uses
the same pattern.
* closeTunnel() helper — single teardown path for both the ngrok
listener and the tunnel Bun.serve listener.
* resolveNgrokAuthtoken() helper — shared authtoken lookup across
/tunnel/start and BROWSE_TUNNEL=1 startup (was duplicated).
* TUNNEL_COMMANDS check in /command dispatch: on the tunnel surface,
commands outside the allowlist return 403 with a list of allowed
commands as a hint.
* Probe paths in /pair and /tunnel/start migrated from /health to
GET /connect — the only unauth path reachable on the tunnel surface
under the new architecture.
Test updates in browse/test/server-auth.test.ts:
* /pair liveness-verify test: assert via closeTunnel() helper instead
of the inline `tunnelActive = false; tunnelUrl = null` lines that
the helper subsumes.
* /tunnel/start cached-tunnel test: same closeTunnel() adaptation.
Credit
Derived from PR #1026 by @garagon — thanks for flagging the critical
bug that drove the architectural rewrite. The per-request
isTunneledRequest approach from #1026 is superseded by physical port
separation here; the underlying report remains the root cause for the
entire v1.6.0.0 wave.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(security): add source-level guards for dual-listener architecture
23 source-level assertions that keep future contributors from silently
widening the tunnel surface during a routine refactor. Covers:
* Surface type + tunnelServer state variable shape
* TUNNEL_PATHS is a closed set of /connect, /command, /sidebar-chat
(and NOT /health, /welcome, /cookie-picker, /inspector/*, /pair,
/token, /refs, /activity/stream, /tunnel/{start,stop})
* TUNNEL_COMMANDS includes browser-driving ops only (and NOT
launch-browser, tunnel-start, token-mint, cookie-import, etc.)
* makeFetchHandler(surface) factory exists and is wired to both
listeners with the correct surface parameter
* Tunnel filter runs BEFORE any route dispatch, with 404/403/401
responses and logged denials for each reason
* GET /connect returns {alive: true} unauth
* /command dispatch enforces TUNNEL_COMMANDS on tunnel surface
* closeTunnel() helper tears down ngrok + Bun.serve listener
* /tunnel/start binds on ephemeral port, points ngrok at TUNNEL_PORT
(not local port), hard-fails on bind error (no fallback), probes
cached tunnel via GET /connect (not /health), tears down on
ngrok.forward failure
* BROWSE_TUNNEL=1 startup uses the dual-listener pattern
* logTunnelDenial wired for all three denial reasons
* /connect rate limit is 300/min, not 3/min
All 23 tests pass. Behavioral integration tests (spawn subprocess, real
network) live in the E2E suite that lands later in this wave.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* security: gate download + scrape through validateNavigationUrl (SSRF)
The `goto` command was correctly wired through validateNavigationUrl,
but `download` and `scrape` called page.request.fetch(url, ...) directly.
A caller with the default write scope could hit the /command endpoint
and ask the daemon to fetch http://169.254.169.254/latest/meta-data/
(AWS IMDSv1) or the GCP/Azure/internal equivalents. The response body
comes back as base64 or lands on disk where GET /file serves it.
Fix: call validateNavigationUrl(url) immediately before each
page.request.fetch() call site in download and in the scrape loop.
Same blocklist that already protects `goto`: file://, javascript:,
data:, chrome://, cloud metadata (IPv4 all encodings, IPv6 ULA,
metadata.*.internal).
Tests: extend browse/test/url-validation.test.ts with a source-level
guard that walks every `await page.request.fetch(` call site and
asserts a validateNavigationUrl call precedes it within the same
branch. Regression trips before code review if a future refactor
drops the gate.
* security: route splitForScoped through envelope sentinel escape
The scoped-token snapshot path in snapshot.ts built its untrusted
block by pushing the raw accessibility-tree lines between the literal
`═══ BEGIN UNTRUSTED WEB CONTENT ═══` / `═══ END UNTRUSTED WEB CONTENT ═══`
sentinels. The full-page wrap path in content-security.ts already
applied a zero-width-space escape on those exact strings to prevent
sentinel injection, but the scoped path skipped it.
Net effect: a page whose rendered text contains the literal sentinel
can close the envelope early from inside untrusted content and forge
a fake "trusted" block for the LLM. That includes fabricating
interactive `@eN` references the agent will act on.
Fix:
* Extract the zero-width-space escape into a named, exported helper
`escapeEnvelopeSentinels(content)` in content-security.ts.
* Have `wrapUntrustedPageContent` call it (behavior unchanged on
that path — same bytes out).
* Import the helper in snapshot.ts and map it over `untrustedLines`
in the `splitForScoped` branch before pushing the BEGIN sentinel.
Tests: add a describe block in content-security.test.ts that covers
* `escapeEnvelopeSentinels` defuses BEGIN and END markers;
* `escapeEnvelopeSentinels` leaves normal text untouched;
* `wrapUntrustedPageContent` still emits exactly one real envelope
pair when hostile content contains forged sentinels;
* snapshot.ts imports the helper;
* the scoped-snapshot branch calls `escapeEnvelopeSentinels` before
pushing the BEGIN sentinel (source-level regression — if a future
refactor reorders this, the test trips).
* security: extend hidden-element detection to all DOM-reading channels
The Confusion Protocol envelope wrap (`wrapUntrustedPageContent`)
covers every scoped PAGE_CONTENT_COMMAND, but the hidden-element
ARIA-injection detection layer only ran for `text`. Other DOM-reading
channels (html, links, forms, accessibility, attrs, data, media,
ux-audit) returned their output through the envelope with no hidden-
content filter, so a page serving a display:none div that instructs
the agent to disregard prior system messages, or an aria-label that
claims to put the LLM in admin mode, leaked the injection payload on
any non-text channel. The envelope alone does not mitigate this, and
the page itself never rendered the hostile content to the human
operator.
Fix:
* New export `DOM_CONTENT_COMMANDS` in commands.ts — the subset of
PAGE_CONTENT_COMMANDS that derives its output from the live DOM.
Console and dialog stay out; they read separate runtime state.
* server.ts runs `markHiddenElements` + `cleanupHiddenMarkers` for
every scoped command in this set. `text` keeps its existing
`getCleanTextWithStripping` path (hidden elements physically
stripped before the read). All other channels keep their output
format but emit flagged elements as CONTENT WARNINGS on the
envelope, so the LLM sees what it would otherwise have consumed
silently.
* Hidden-element descriptions merge into `combinedWarnings`
alongside content-filter warnings before the wrap call.
Tests: new describe block in content-security.test.ts covering
* `DOM_CONTENT_COMMANDS` export shape and channel membership;
* dispatch gates on `DOM_CONTENT_COMMANDS.has(command)`, not the
literal `text` string;
* hiddenContentWarnings plumbs into `combinedWarnings` and reaches
wrapUntrustedPageContent;
* DOM_CONTENT_COMMANDS is a strict subset of PAGE_CONTENT_COMMANDS.
Existing datamarking, envelope wrap, centralized-wrapping, and chain
security suites stay green (52 pass, 0 fail).
* security: validate --from-file payload paths for parity with direct paths
The direct `load-html <file>` path runs every caller-supplied file path
through validateReadPath() so reads stay confined to SAFE_DIRECTORIES
(cwd, TEMP_DIR). The `load-html --from-file <payload.json>` shortcut
and its sibling `pdf --from-file <payload.json>` skipped that check and
went straight to fs.readFileSync(). An MCP caller that picks the
payload path (or any caller whose payload argument is reachable from
attacker-influenced text) could use --from-file as a read-anywhere
escape hatch for the safe-dirs policy.
Fix: call validateReadPath(path.resolve(payloadPath)) before readFileSync
at both sites. Error surface mirrors the direct-path branch so ops and
agent errors stay consistent.
Test coverage in browse/test/from-file-path-validation.test.ts:
- source-level: validateReadPath precedes readFileSync in the load-html
--from-file branch (write-commands.ts) and the pdf --from-file parser
(meta-commands.ts)
- error-message parity: both sites reference SAFE_DIRECTORIES
Related security audit pattern: R3 F002 (validateNavigationUrl gap on
download/scrape) and R3 F008 (markHiddenElements gap on 10 DOM commands)
were the same shape — a defense that existed on the primary code path
but not its shortcut sibling. This PR closes the same class of gap on
the --from-file shortcuts.
* fix(design): escape url.origin when injecting into served HTML
serve.ts injected url.origin into a single-quoted JS string in
the response body. A local request with a crafted Host header
(e.g. Host: "evil'-alert(1)-'x") would break out of the string
and execute JS in the 127.0.0.1:<port> origin opened by the
design board. Low severity — bound to localhost, requires a
local attacker — but no reason not to escape.
Fix: JSON.stringify(url.origin) produces a properly quoted,
escaped JS string literal in one call.
Also includes Prettier reformatting (single→double quotes,
trailing commas, line wrapping) applied by the repo's
PostToolUse formatter hook. Security change is the one line
in the HTML injection; everything else is whitespace/style.
* fix(scripts): drop shell:true from slop-diff npx invocations
spawnSync('npx', [...], { shell: true }) invokes /bin/sh -c
with the args concatenated, subjecting them to shell parsing
(word splitting, glob expansion, metacharacter interpretation).
No user input reaches these calls today, so not exploitable —
but the posture is wrong: npx + shell args should be direct.
Fix: scope shell:true to process.platform === 'win32' where
npx is actually a .cmd requiring the shell. POSIX runs the
npx binary directly with array-form args.
Also includes Prettier reformatting (single→double quotes,
trailing commas, line wrapping) applied by the repo's
PostToolUse formatter hook. Security-relevant change is just
the two shell:true -> shell: process.platform === 'win32'
lines; everything else is whitespace/style.
* security(E3): gate GSTACK_SLUG on /welcome path traversal
The /welcome handler interpolates GSTACK_SLUG directly into the filesystem
path used to locate the project-local welcome page. Without validation, a
slug like "../../etc/passwd" would resolve to
~/.gstack/projects/../../etc/passwd/designs/welcome-page-20260331/finalized.html
— classic path traversal.
Not exploitable today: GSTACK_SLUG is set by the gstack CLI at daemon launch,
and an attacker would already need local env-var access to poison it. But
the gate is one regex (^[a-z0-9_-]+$), and a defense-in-depth pass costs us
nothing when the cost of being wrong is arbitrary file read via /welcome.
Fall back to the safe 'unknown' literal when the slug fails validation —
same fallback the code already uses when GSTACK_SLUG is unset. No behavior
change for legitimate slugs (they all match the regex).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* security(N1): replace ?token= SSE auth with HttpOnly session cookie
Activity stream and inspector events SSE endpoints accepted the root
AUTH_TOKEN via `?token=` query param (EventSource can't send Authorization
headers). URLs leak to browser history, referer headers, server logs,
crash reports, and refactoring accidents. Codex flagged this during the
/plan-ceo-review outside voice pass.
New auth model: the extension calls POST /sse-session with a Bearer token
and receives a view-only session cookie (HttpOnly, SameSite=Strict, 30-min
TTL). EventSource is opened with `withCredentials: true` so the browser
sends the cookie back on the SSE connection. The ?token= query param is
GONE — no more URL-borne secrets.
Scope isolation (prior learning cookie-picker-auth-isolation, 10/10
confidence): the SSE session cookie grants access to /activity/stream and
/inspector/events ONLY. The token is never valid against /command, /token,
or any mutating endpoint. A leaked cookie can watch activity; it cannot
execute browser commands.
Components
* browse/src/sse-session-cookie.ts — registry: mint/validate/extract/
build-cookie. 256-bit tokens, 30-min TTL, lazy expiry pruning,
no imports from token-registry (scope isolation enforced by module
boundary).
* browse/src/server.ts — POST /sse-session mint endpoint (requires
Bearer). /activity/stream and /inspector/events now accept Bearer
OR the session cookie, and reject ?token= query param.
* extension/sidepanel.js — ensureSseSessionCookie() bootstrap call,
EventSource opened with withCredentials:true on both SSE endpoints.
Tested via the source guards; behavioral test is the E2E pairing
flow that lands later in the wave.
* browse/test/sse-session-cookie.test.ts — 20 unit tests covering
mint entropy, TTL enforcement, cookie flag invariants, cookie
parsing from multi-cookie headers, and scope-isolation contract
guard (module must not import token-registry).
* browse/test/server-auth.test.ts — existing /activity/stream auth
test updated to assert the new cookie-based gate and the absence
of the ?token= query param.
Cookie flag choices:
* HttpOnly: token not readable from page JS (mitigates XSS
exfiltration).
* SameSite=Strict: cookie not sent on cross-site requests (mitigates
CSRF). Fine for SSE because the extension connects to 127.0.0.1
directly.
* Path=/: cookie scoped to the whole origin.
* Max-Age=1800: 30 minutes, matches TTL. Extension re-mints on
reconnect when daemon restarts.
* Secure NOT set: daemon binds to 127.0.0.1 over plain HTTP. Adding
Secure would block the browser from ever sending the cookie back.
Add Secure when gstack ships over HTTPS.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* security(N2): document Windows v20 ABE elevation path on CDP port
The existing comment around the cookie-import-browser --remote-debugging-port
launch claimed "threat model: no worse than baseline." That's wrong on
Windows with App-Bound Encryption v20. A same-user local process that
opens the cookie SQLite DB directly CANNOT decrypt v20 values (DPAPI
context is bound to the browser process). The CDP port lets them bypass
that: connect to the debug port, call Network.getAllCookies inside Chrome,
walk away with decrypted v20 cookies.
The correct fix is to switch from TCP --remote-debugging-port to
--remote-debugging-pipe so the CDP transport is a stdio pipe, not a
socket. That requires restructuring the CDP WebSocket client in this
module and Playwright doesn't expose the pipe transport out of the box.
Non-trivial, deferred from the v1.6.0.0 wave.
This commit updates the comment to correctly describe the threat and
points at the tracking issue. No code change to the launch itself.
Follow-up: #1136.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(E2): document dual-listener tunnel architecture in ARCHITECTURE.md
Adds an explicit per-endpoint disposition table to the Security model
section, covering the v1.6.0.0 dual-listener refactor. Every HTTP
endpoint now has a documented local-vs-tunnel answer. Future audits
(and future contributors wondering "is it safe to add X to the tunnel
surface?") can read this instead of reverse-engineering server.ts.
Also documents:
* Why physical port separation beats per-request header inference
(ngrok behavior drift, local proxies can forge headers, etc.)
* Tunnel surface denial logging → ~/.gstack/security/attempts.jsonl
* SSE session cookie model (gstack_sse, 30-min TTL, stream-scope only,
module-boundary-enforced scope isolation)
* N2 non-goal for Windows v20 ABE via CDP port (tracking #1136)
No code changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(E1): end-to-end pair-agent flow against a spawned daemon
Spawns the browse daemon as a subprocess with BROWSE_HEADLESS_SKIP=1 so
the HTTP layer runs without a real browser. Exercises:
* GET /health — token delivery for chrome-extension origin, withheld
otherwise (the F1 + PR #1026 invariant)
* GET /connect — alive probe returns {alive:true} unauth
* POST /pair — root Bearer required (403 without), returns setup_key
* POST /connect — setup_key exchange mints a distinct scoped token
* POST /command — 401 without auth
* POST /sse-session — Bearer required, Set-Cookie has HttpOnly +
SameSite=Strict (the N1 invariant)
* GET /activity/stream — 401 without auth
* GET /activity/stream?token= — 401 (the old ?token= query param is
REJECTED, which is the whole point of N1)
* GET /welcome — serves HTML, does not leak /etc/passwd content under
the default 'unknown' slug (E3 regex gate)
12 behavioral tests, ~220ms end-to-end, no network dependencies, no
ngrok, no real browser. This is the receipt for the wave's central
'pair-agent still works + the security boundary holds' claim.
Tunnel-port binding (/tunnel/start) is deliberately NOT exercised here
— it requires an ngrok authtoken and live network. The dual-listener
route allowlist is covered by source-level guards in
dual-listener.test.ts; behavioral tunnel testing belongs in a separate
paid-evals harness.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* release(v1.6.0.0): bump VERSION + CHANGELOG for security wave
Architectural bump, not patch: dual-listener HTTP refactor changes the
daemon's tunnel-exposure model. See CHANGELOG for the full release
summary (~950 words) covering the five root causes this wave closes:
1. /health token leak over ngrok (F1 + E3 + test infra)
2. /cookie-picker + /inspector exposed over the tunnel (F1)
3. ?token=<ROOT> in SSE URLs leaking to logs/referer/history (N1)
4. /welcome GSTACK_SLUG path traversal (E3)
5. Windows v20 ABE elevation via CDP port (N2 — documented non-goal,
tracked as #1136)
Plus the base PRs: SSRF gate (#1029), envelope sentinel escape (#1031),
DOM-channel hidden-element coverage (#1032), --from-file path validation
(#1103), and 2 commits from #1073 (@theqazi).
VERSION + package.json bumped to 1.6.0.0. CHANGELOG entry covers
credits (@garagon, @Hybirdss, @HMAKT99, @theqazi), review lineage (CEO
→ Codex outside voice → Eng), and the non-goal tracking issue.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: pre-landing review findings (4 auto-fixes)
Addresses 4 findings from the Claude adversarial subagent on the
v1.6.0.0 security wave diff. No user-visible behavior change; all
are defense-in-depth hardening of newly-introduced code.
1. GET /connect rate-limited (was POST-only) [HIGH conf 8/10]
Attacker discovering the ngrok URL could probe unlimited GETs for
daemon enumeration. Now shares the global /connect counter.
2. ngrok listener leak on tunnel startup failure [MEDIUM conf 8/10]
If ngrok.forward() resolved but tunnelListener.url() or the
state-file write threw, the Bun listener was torn down but the
ngrok session was leaked. Fixed in BOTH /tunnel/start and
BROWSE_TUNNEL=1 startup paths.
3. GSTACK_SKILL_ROOT path-traversal gate [MEDIUM conf 8/10]
Symmetric with E3's GSTACK_SLUG regex gate — reject values
containing '..' before interpolating into the welcome-page path.
4. SSE session registry pruning [LOW conf 7/10]
pruneExpired() only checked 10 entries per mint call. Now runs
on every validate too, checks 20 entries, with a hard 10k cap as
backstop. Prevents registry growth under sustained extension
reconnect pressure.
Tests remain green (56/56 in sse-session-cookie + dual-listener +
pair-agent-e2e suites).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: update project documentation for v1.6.0.0
Reflect the dual-listener tunnel architecture, SSE session cookies,
SSRF guards, and Windows v20 ABE non-goal across the three docs
users actually read for remote-agent and browser auth context:
- docs/REMOTE_BROWSER_ACCESS.md: rewrote Architecture diagram for
dual listeners, fixed /connect rate limit (3/min → 300/min),
removed stale "/health requires no auth" (now 404 on tunnel),
added SSE cookie auth, expanded Security Model with tunnel
allowlist, SSRF guards, /welcome path traversal defense, and
the Windows v20 ABE tracking note.
- BROWSER.md: added dual-listener paragraph to Authentication and
linked to ARCHITECTURE.md endpoint table. Replaced the stale
?token= SSE auth note with the HttpOnly gstack_sse cookie flow.
- CLAUDE.md: added Transport-layer security section above the
sidebar prompt-injection stack so contributors editing server.ts,
sse-session-cookie.ts, or tunnel-denial-log.ts see the load-bearing
module boundaries before touching them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(make-pdf): write --from-file payload to /tmp, not os.tmpdir()
make-pdf's browseClient wrote its --from-file payload to os.tmpdir(),
which is /var/folders/... on macOS. v1.6.0.0's PR #1103 cherry-pick
tightened browse load-html --from-file to validate against the
safe-dirs allowlist ([TEMP_DIR, cwd] where TEMP_DIR is '/tmp' on
macOS/Linux, os.tmpdir() on Windows). This closed a CLI/API parity
gap but broke make-pdf on macOS because /var/folders/... is outside
the allowlist.
Fix: mirror browse's TEMP_DIR convention — use '/tmp' on non-Windows,
os.tmpdir() on Windows. The make-pdf-gate CI failure on macOS-latest
(run 72440797490) is caused by exactly this: the payload file was
rejected by validateReadPath.
Verified locally: the combined-gate e2e test now passes after
rebuilding make-pdf/dist/pdf.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(sidebar): killAgent resets per-tab state; align tests with current agent event format
Two pre-existing bugs surfaced while running the full e2e suite on the
sec-wave branch. Both pre-date v1.6.0.0 (same failures on main at
|
||
|
|
97584f9a59 |
feat(security): ML prompt injection defense for sidebar (v1.4.0.0) (#1089)
* chore(deps): add @huggingface/transformers for prompt injection classifier Dependency needed for the ML prompt injection defense layer coming in the follow-up commits. @huggingface/transformers will host the TestSavantAI BERT-small classifier that scans tool outputs for indirect prompt injection. Note: this dep only runs in non-compiled bun contexts (sidebar-agent.ts). The compiled browse binary cannot load it because transformers.js v4 requires onnxruntime-node (native module, fails to dlopen from bun compile's temp extract dir). See docs/designs/ML_PROMPT_INJECTION_KILLER.md for the full architectural decision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): add security.ts foundation for prompt injection defense Establishes the module structure for the L5 canary and L6 verdict aggregation layers. Pure-string operations only — safe to import from the compiled browse binary. Includes: * THRESHOLDS constants (BLOCK 0.85 / WARN 0.60 / LOG_ONLY 0.40), calibrated against BrowseSafe-Bench smoke + developer content benign corpus. * combineVerdict() implementing the ensemble rule: BLOCK only when the ML content classifier AND the transcript classifier both score >= WARN. Single-layer high confidence degrades to WARN to prevent any one classifier's false-positives from killing sessions (Stack Overflow instruction-writing-style FPs at 0.99 on TestSavantAI alone). * generateCanary / injectCanary / checkCanaryInStructure — session-scoped secret token, recursively scans tool arguments, URLs, file writes, and nested objects per the plan's all-channel coverage decision. * logAttempt with 10MB rotation (keeps 5 generations). Salted SHA-256 hash, per-device salt at ~/.gstack/security/device-salt (0600). * Cross-process session state at ~/.gstack/security/session-state.json (atomic temp+rename). Required because server.ts (compiled) and sidebar-agent.ts (non-compiled) are separate processes. * getStatus() for shield icon rendering via /health. ML classifier code will live in a separate module (security-classifier.ts) loaded only by sidebar-agent.ts — compiled browse binary cannot load the native ONNX runtime. Plan: ~/.gstack/projects/garrytan-gstack/ceo-plans/2026-04-19-prompt-injection-guard.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): wire canary injection into sidebar spawnClaude Every sidebar message now gets a fresh CANARY-XXXXXXXXXXXX token embedded in the system prompt with an instruction for Claude to never output it on any channel. The token flows through the queue entry so sidebar-agent.ts can check every outbound operation for leaks. If Claude echoes the canary into any outbound channel (text stream, tool arguments, URLs, file write paths), the sidebar-agent terminates the session and the user sees the approved canary leak banner. This operation is pure string manipulation — safe in the compiled browse binary. The actual output-stream check (which also has to be safe in compiled contexts) lives in sidebar-agent.ts (next commit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(security): make sidebar-agent destructure check regex-tolerant The test asserted the exact string `const { prompt, args, stateFile, cwd, tabId } = queueEntry` which breaks whenever security or other extensions add fields (canary, pageUrl, etc.). Switch to a regex that requires the core fields in order but tolerates additional fields in between. Preserves the test's intent (args come from the queue entry, not rebuilt) while allowing the destructure to grow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): canary leak check across all outbound channels The sidebar-agent now scans every Claude stream event for the session's canary token before relaying any data to the sidepanel. Channels covered (per CEO review cross-model tension #2): * Assistant text blocks * Assistant text_delta streaming * tool_use arguments (recursively, via checkCanaryInStructure — catches URLs, commands, file paths nested at any depth) * tool_use content_block_start * tool_input_delta partial JSON * Final result payload If the canary leaks on any channel, onCanaryLeaked() fires once per session: 1. logAttempt() writes the event to ~/.gstack/security/attempts.jsonl with the canary's salted hash (never the payload content). 2. sends a `security_event` to the sidepanel so it can render the approved canary-leak banner (variant A mockup — ceo-plan 2026-04-19). 3. sends an `agent_error` for backward-compat with existing error surfaces. 4. SIGTERM's the claude subprocess (SIGKILL after 2s if still alive). The leaked content itself is never relayed to the sidepanel — the event is dropped at the boundary. Canary detection is pure-string substring match, so this all runs safely in the sidebar-agent (non-compiled bun) context. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): add security-classifier.ts with TestSavantAI + Haiku This module holds the ML classifier code that the compiled browse binary cannot link (onnxruntime-node native dylib doesn't load from Bun compile's temp extract dir — see CEO plan §"Pre-Impl Gate 1 Outcome"). It's imported ONLY by sidebar-agent.ts, which runs as a non-compiled bun script. Two layers: L4 testsavant_content — TestSavantAI BERT-small ONNX classifier. First call triggers a one-time 112MB model download to ~/.gstack/models/testsavant-small/ (files staged into the onnx/ layout transformers.js v4 expects). Classifies page snapshots and tool outputs for indirect prompt injection + jailbreak attempts. On benign-corpus dry-run: Wikipedia/HN/Reddit/tech-blog all score SAFE 0.98+, attack text scores INJECTION 0.99+, Stack Overflow instruction-writing now scores SAFE 0.98 on the shorter form (was 0.99 INJECTION on the longer form — instruction-density threshold). Ensemble combiner downgrades single-layer high to WARN to cover this case. L4b transcript_classifier — Claude Haiku reasoning-blind pre-tool-call scan. Sees only {user_message, last 3 tool_calls}, never Claude's chain-of-thought or tool results (those are how self-persuasion attacks leak). 2000ms hard timeout. Fail-open on any subprocess failure so sidebar stays functional. Gated by shouldRunTranscriptCheck() — only runs when another layer already fired at >= LOG_ONLY, saving ~70% of Haiku spend. Both layers degrade gracefully: load/spawn failures set status to 'degraded' and return confidence=0. Shield icon reflects this via getClassifierStatus() which security.ts's getStatus() composes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): wire TestSavantAI + ensemble into sidebar-agent pre-spawn scan The sidebar-agent now runs a ML security check on the user message BEFORE spawning claude. If the content classifier and (gated) transcript classifier ensemble returns BLOCK, the session is refused with a security_event + agent_error — the sidepanel renders the approved banner. Two pieces: 1. On agent startup, loadTestsavant() warms the classifier in the background. First run triggers a 112MB model download from HuggingFace (~30s on average broadband). Non-blocking — sidebar stays functional during cold-start, shield just reports 'off' until warmed. 2. preSpawnSecurityCheck() runs the ensemble against the user message: - L4 (testsavant_content) always runs - L4b (transcript_classifier via Haiku) runs only if L4 flagged at >= LOG_ONLY — plan §E1 gating optimization, saves ~70% of Haiku spend combineVerdict() applies the BLOCK-requires-both-layers rule, which downgrades any single-layer high confidence to WARN. Stack Overflow-style instruction-heavy writing false-positives on TestSavantAI alone are caught by this degrade — Haiku corrects them when called. Fail-open everywhere: any subprocess/load/inference error returns confidence=0 so the sidebar keeps working on architectural controls alone. Shield icon reflects degraded state via getClassifierStatus(). BLOCK path emits both: - security_event {verdict, reason, layer, confidence, domain} (for the approved canary-leak banner UX mockup — variant A) - agent_error "Session blocked — prompt injection detected..." (backward-compat with existing error surface) Regression test suite still passes (12/12 sidebar-security tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(security): add security.ts unit tests (25 tests, 62 assertions) Covers the pure-string operations that must behave deterministically in both compiled and source-mode bun contexts: * THRESHOLDS ordering invariant (BLOCK > WARN > LOG_ONLY > 0) * combineVerdict ensemble rule — THE critical path: - Empty signals → safe - Canary leak always blocks (regardless of ML signals) - Both ML layers >= WARN → BLOCK (ensemble_agreement) - Single layer >= BLOCK → WARN (single_layer_high) — the Stack Overflow FP mitigation that prevents one classifier killing sessions alone - Max-across-duplicates when multiple signals reference the same layer * Canary generation + injection + recursive checking: - Unique CANARY-XXXXXXXXXXXX tokens (>= 48 bits entropy) - Recursive structure scan for tool_use inputs, nested URLs, commands - Null / primitive handling doesn't throw * Payload hashing (salted sha256) — deterministic per-device, differs across payloads, 64-char hex shape * logAttempt writes to ~/.gstack/security/attempts.jsonl * writeSessionState + readSessionState round-trip (cross-process) * getStatus returns valid SecurityStatus shape * extractDomain returns hostname only, empty string on bad input All 25 tests pass in 18ms — no ML, no network, no subprocess spawning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): expose security status on /health for shield icon The /health endpoint now returns a `security` field with the classifier status, suitable for driving the sidepanel shield icon: { status: 'protected' | 'degraded' | 'inactive', layers: { testsavant, transcript, canary }, lastUpdated: ISO8601 } Backend plumbing: * server.ts imports getStatus from security.ts (pure-string, safe in compiled binary) and includes it in the /health response. * sidebar-agent.ts writes ~/.gstack/security/session-state.json when the classifier warmup completes (success OR failure). This is the cross- process handoff — server.ts reads the state file via getStatus() to surface the result to the sidepanel. The sidepanel rendering (SVG shield icon + color states + tooltip) is a follow-up commit in the extension/ code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(security): document the sidebar security stack in CLAUDE.md Adds a security section to the Browser interaction block. Covers: * Layered defense table showing which modules live where (content-security.ts in both contexts vs security-classifier.ts only in sidebar-agent) and why the split exists (onnxruntime-node incompatibility with compiled Bun) * Threshold constants (0.85 / 0.60 / 0.40) and the ensemble rule that prevents single-classifier false-positives (the Stack Overflow FP story) * Env knobs — GSTACK_SECURITY_OFF kill switch, cache paths, salt file, attack log rotation, session state file This is the "before you modify the security stack, read this" doc. It lives next to the existing Sidebar architecture note that points at SIDEBAR_MESSAGE_FLOW.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(todos): mark ML classifier v1 in-progress + file v2 follow-ups Reframes the P0 item to reflect v1 scope (branch 2 architecture, TestSavantAI pivot, what shipped) and splits v2 work into discrete TODOs: * Shield icon + canary leak banner UI (P0, blocks v1 user-facing completion) * Attack telemetry via gstack-telemetry-log (P1) * Full BrowseSafe-Bench at gate tier (P2) * Cross-user aggregate attack dashboard (P2) * DeBERTa-v3 as third signal in ensemble (P2) * Read/Glob/Grep ingress coverage (P2, flagged by Codex review) * Adversarial + integration + smoke-bench test suites (P1) * Bun-native 5ms inference (P3 research) Each TODO carries What / Why / Context / Effort / Priority / Depends-on so it's actionable by someone picking it up cold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(telemetry): add attack_attempt event type to gstack-telemetry-log Extends the existing telemetry pipe with 5 new flags needed for prompt injection attack reporting: --url-domain hostname only (never path, never query) --payload-hash salted sha256 hex (opaque — no payload content ever) --confidence 0-1 (awk-validated + clamped; malformed → null) --layer testsavant_content | transcript_classifier | aria_regex | canary --verdict block | warn | log_only Backward compatibility: * Existing skill_run events still work — all new fields default to null * Event schema is a superset of the old one; downstream edge function can filter by event_type No new auth, no new SDK, no new Supabase migration. The same tier gating (community → upload, anonymous → local only, off → no-op) and the same sync daemon carry the attack events. This is the "E6 RESOLVED" path from the CEO plan — riding the existing pipe instead of spinning up parallel infra. Verified end-to-end: * attack_attempt event with all fields emits correctly to skill-usage.jsonl * skill_run event with no security flags still works (backward compat) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): wire logAttempt to gstack-telemetry-log (fire-and-forget) Every local attempt.jsonl write now also triggers a subprocess call to gstack-telemetry-log with the attack_attempt event type. The binary handles tier gating internally (community → Supabase upload, anonymous → local JSONL only, off → no-op), so security.ts doesn't need to re-check. Binary resolution follows the skill preamble pattern — never relies on PATH, which breaks in compiled-binary contexts: 1. ~/.claude/skills/gstack/bin/gstack-telemetry-log (global install) 2. .claude/skills/gstack/bin/gstack-telemetry-log (symlinked dev) 3. bin/gstack-telemetry-log (in-repo dev) Fire-and-forget: * spawn with stdio: 'ignore', detached: true, unref() * .on('error') swallows failures * Missing binary is non-fatal — local attempts.jsonl still gives audit trail Never throws. Never blocks. Existing 37 security tests pass unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ui): add security banner markup + styles (approved variant A) HTML + CSS for the canary leak / ML block banner. Structure matches the approved mockup from /plan-design-review 2026-04-19 (variant A — centered alert-heavy): * Red alert-circle SVG icon (no stock shield, intentional — matches the "serious but not scary" tone the review chose) * "Session terminated" Satoshi Bold 18px red headline * "— prompt injection detected from {domain}" DM Sans zinc subtitle * Expandable "What happened" chevron button (aria-expanded/aria-controls) * Layer list rendered in JetBrains Mono with amber tabular-nums scores * Close X in top-right, 28px hit area, focus-visible amber outline Enter animation: slide-down 8px + fade, 250ms, cubic-bezier(0.16,1,0.3,1) — matches DESIGN.md motion spec. Respects `role="alert"` + `aria-live="assertive"` so screen readers announce on appearance. Escape-to-dismiss hook is in the JS follow-up commit. Design tokens all via CSS variables (--error, --amber-400, --amber-500, --zinc-*, --font-display, --font-mono, --radius-*) — already established in the stylesheet. No new color constants introduced. JS wiring lands in the next commit so this diff stays focused on presentation layer only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ui): wire security banner to security_event + interactivity Adds showSecurityBanner() and hideSecurityBanner() plus the addChatEntry routing for entry.type === 'security_event'. When the sidebar-agent emits a security_event (canary leak or ML BLOCK), the banner renders with: * Title ("Session terminated") * Subtitle with {domain} if present, otherwise generic * Expandable layer list — each row: SECURITY_LAYER_LABELS[layer] + confidence.toFixed(2) in mono. Readable + auditable — user can see which layer fired at what score Interactivity, wired once on DOMContentLoaded: * Close X → hideSecurityBanner() * Expand/collapse "What happened" → toggles details + aria-expanded + chevron rotation (200ms css transition already in place) * Escape key dismisses while banner is visible (a11y) No shield icon yet — that's a separate commit that will consume the `security` field now returned by /health. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ui): add security shield icon in sidepanel header (3 states) Small "SEC" badge in the top-right of the sidepanel that reflects the security module's current state. Three states drive color: protected green — all layers ok (TestSavantAI + transcript + canary) degraded amber — one+ ML layer offline but canary + arch controls active inactive red — security module crashed, arch controls only Consumes /health.security (surfaced in commit |
||
|
|
b805aa0113 |
feat: Confusion Protocol, Hermes + GBrain hosts, brain-first resolver (v0.18.0.0) (#1005)
* feat: add Confusion Protocol to preamble resolver Injects a high-stakes ambiguity gate at preamble tier >= 2 so all workflow skills get it. Fires when Claude encounters architectural decisions, data model changes, destructive operations, or contradictory requirements. Does NOT fire on routine coding. Addresses Karpathy failure mode #1 (wrong assumptions) with an inline STOP gate instead of relying on workflow skill invocation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add Hermes and GBrain host configs Hermes: tool rewrites for terminal/read_file/patch/delegate_task, paths to ~/.hermes/skills/gstack, AGENTS.md config file. GBrain: coding skills become brain-aware when GBrain mod is installed. Same tool rewrites as OpenClaw (agents spawn Claude Code via ACP). GBRAIN_CONTEXT_LOAD and GBRAIN_SAVE_RESULTS NOT suppressed on gbrain host, enabling brain-first lookup and save-to-brain behavior. Both registered in hosts/index.ts with setup script redirect messages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: GBrain resolver — brain-first lookup and save-to-brain New scripts/resolvers/gbrain.ts with two resolver functions: - GBRAIN_CONTEXT_LOAD: search brain for context before skill starts - GBRAIN_SAVE_RESULTS: save skill output to brain after completion Placeholders added to 4 thinking skill templates (office-hours, investigate, plan-ceo-review, retro). Resolves to empty string on all hosts except gbrain via suppressedResolvers. GBRAIN suppression added to all 9 non-gbrain host configs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: wire slop:diff into /review as advisory diagnostic Adds Step 3.5 to the review template: runs bun run slop:diff against the base branch to catch AI code quality issues (empty catches, redundant return await, overcomplicated abstractions). Advisory only, never blocking. Skips silently if slop-scan is not installed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add Karpathy compatibility note to README Positions gstack as the workflow enforcement layer for Karpathy-style CLAUDE.md rules (17K stars). Links to forrestchang/andrej-karpathy-skills. Maps each Karpathy failure mode to the gstack skill that addresses it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: improve native OpenClaw thinking skills office-hours: add design doc path visibility message after writing ceo-review: add HARD GATE reminder at review section transitions retro: add non-git context support (check memory for meeting notes) Mirrors template improvements to hand-crafted native skills. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: update tests and golden fixtures for new hosts - Host count: 8 → 10 (hermes, gbrain) - OpenClaw adapter test: expects undefined (dead code removed) - Golden ship fixtures: updated with Confusion Protocol + vendoring Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate all SKILL.md files Regenerated from templates after Confusion Protocol, GBrain resolver placeholders, slop:diff in review, HARD GATE reminders, investigation learnings, design doc visibility, and retro non-git context changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update project documentation for v0.18.0.0 - CHANGELOG: add v0.18.0.0 entry (Confusion Protocol, Hermes, GBrain, slop in review, Karpathy note, skill improvements) - CLAUDE.md: add hermes.ts and gbrain.ts to hosts listing - README.md: update agent count 8→10, add Hermes + GBrain to table - VERSION: bump to 0.18.0.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: sync package.json version to 0.18.0.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: extract Step 0 from review SKILL.md in E2E test The review-base-branch E2E test was copying the full 1493-line review/SKILL.md into the test fixture. The agent spent 8+ turns reading it in chunks, leaving only 7 turns for actual work, causing error_max_turns on every attempt. Now extracts only Step 0 (base branch detection, ~50 lines) which is all the test actually needs. Follows the CLAUDE.md rule: "NEVER copy a full SKILL.md file into an E2E test fixture." Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: update GBrain and Hermes host configs for v0.10.0 integration GBrain: add 'triggers' to keepFields so generated skills pass checkResolvable() validation. Add version compat comment. Hermes: un-suppress GBRAIN_CONTEXT_LOAD and GBRAIN_SAVE_RESULTS. The resolvers handle GBrain-not-installed gracefully, so Hermes agents with GBrain as a mod get brain features automatically. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: GBrain resolver DX improvements and preamble health check Resolver changes: - gbrain query → gbrain search (fast keyword search, not expensive hybrid) - Add keyword extraction guidance for agents - Show explicit gbrain put_page syntax with --title, --tags, heredoc - Add entity enrichment with false-positive filter - Name throttle error patterns (exit code 1, stderr keywords) - Add data-research routing for investigate skill - Expand skillSaveMap from 4 to 8 entries - Add brain operation telemetry summary Preamble changes: - Add gbrain doctor --fast --json health check for gbrain/hermes hosts - Parse check failures/warnings count - Show failing check details when score < 50 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: preserve keepFields in allowlist frontmatter mode The allowlist mode hard-coded name + description reconstruction but never iterated keepFields for additional fields. Adding 'triggers' to keepFields was a no-op because the field was silently stripped. Now iterates keepFields and preserves any field beyond name/description from the source template frontmatter, including YAML arrays. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add triggers to all 38 skill templates Multi-word, skill-specific trigger keywords for GBrain's RESOLVER.md router. Each skill gets 3-6 triggers derived from its "Use when asked to..." description text. Avoids single generic words that would collide across skills (e.g., "debug this" not "debug"). These are distinct from voice-triggers (speech-to-text aliases) and serve GBrain's checkResolvable() validation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate all SKILL.md files and update golden fixtures Regenerated from updated templates (triggers, brain placeholders, resolver DX improvements, preamble health check). Golden fixtures updated to match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: settings-hook remove exits 1 when nothing to remove gstack-settings-hook remove was exiting 0 when settings.json didn't exist, causing gstack-uninstall to report "SessionStart hook" as removed on clean systems where nothing was installed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update project documentation for GBrain v0.10.0 integration ARCHITECTURE.md: added GBRAIN_CONTEXT_LOAD and GBRAIN_SAVE_RESULTS to resolver table. CHANGELOG.md: expanded v0.18.0.0 entry with GBrain v0.10.0 integration details (triggers, expanded brain-awareness, DX improvements, Hermes brain support), updated date. CLAUDE.md: added gbrain to resolvers/ directory comment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: routing E2E stops writing to user's ~/.claude/skills/ installSkills() was copying SKILL.md files to both project-level (.claude/skills/ in tmpDir) and user-level (~/.claude/skills/). Writing to the user's real install fails when symlinks point to different worktrees or dangling targets (ENOENT on copyFileSync). Now installs to project-level only. The test already sets cwd to the tmpDir, so project-level discovery works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: scale Gemini E2E back to smoke test Gemini CLI gets lost in worktrees on complex tasks (review times out at 600s, discover-skill hits exit 124). Nobody uses Gemini for gstack skill execution. Replace the two failing tests (gemini-discover-skill and gemini-review-findings) with a single smoke test that verifies Gemini can start and read the README. 90s timeout, no skill invocation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
2300067267 |
feat: UX behavioral foundations + ux-audit command (v0.17.0.0) (#1000)
* feat: UX behavioral foundations — Krug's usability principles as shared design infrastructure Add UX_PRINCIPLES resolver distilling Steve Krug's "Don't Make Me Think" into actionable guidance for AI agents. Injected into all 4 design skills as a shared behavioral foundation complementing the existing visual checklist (WHAT to check) and cognitive patterns (HOW designers see) with HOW USERS ACTUALLY BEHAVE. Methodology rewire: 6 Krug usability tests woven into existing design-review phases — Trunk Test, 3-Second Scan, Page Area Test, Happy Talk Detection with word count metric, Mindless Choice Audit, Goodwill Reservoir tracking with visual dashboard. First-person narration mode for design-review output with anti-slop guardrail. Hard rules: 4 Krug always/never rules in DESIGN_HARD_RULES (placeholder-as-label, floating headings, visited link distinction, minimum type size). Krug, Redish, Jarrett added to plan-design-review references. Token ceiling: gen-skill-docs.ts warns if any SKILL.md exceeds 100KB (~25K tokens). Documented in CLAUDE.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: $B ux-audit command + snapshot --heatmap flag New browse meta-command: ux-audit extracts page structure (site ID, navigation, headings, interactive elements, text blocks) as structured JSON for agent-side UX behavioral analysis. Pure data extraction — the agent applies the 6 usability tests and makes judgment calls. Element caps: 50 headings, 100 links, 200 interactive, 50 text blocks. New snapshot flag: -H/--heatmap accepts a JSON color map mapping ref IDs to colors (green/yellow/red/blue/orange/gray). Extends existing snapshot -a annotation system with per-ref colors instead of hardcoded red. Color whitelist validation prevents CSS injection. Composable — any skill can use it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update project documentation for v0.17.0.0 ARCHITECTURE.md: added {{UX_PRINCIPLES}} resolver to placeholder table. VERSION: bumped to 0.17.0.0 for UX behavioral foundations release. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.17.0.0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: adversarial review fixes for ux-audit and heatmap Security: - Remove live form value extraction from ux-audit (leaked input field values) - Add ux-audit to PAGE_CONTENT_COMMANDS (untrusted content wrapping) Correctness: - Scope youAreHere selector to nav containers (was matching animation classes) - Validate heatmap JSON is a plain object (string/array/null produced garbage) - Use textContent instead of innerText for word count (avoids layout computation) - Remove dead url variable and unused LINK_CAP constant Found by Codex + Claude adversarial review. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
8115951284 |
feat: recursive self-improvement — operational learning + full skill wiring (v0.13.8.0) (#647)
* refactor: remove dead contributor mode, replace with operational self-improvement slot Contributor mode never fired in 18 days of heavy use (required manual opt-in via gstack-config, gated behind _CONTRIB=true, wrote disconnected markdown). Removes: generateContributorMode(), _CONTRIB bash var, 2 E2E tests, touchfile entry, doc references. Cleans up skip-lists in plan-ceo-review, autoplan, review resolver, and document-release templates. The operational self-improvement system (next commit) replaces this slot with automatic learning capture that requires no opt-in. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: operational self-improvement — every skill learns from failures Adds universal operational learning capture to the preamble completion protocol. At the end of every skill session, the agent reflects on CLI failures, wrong approaches, and project quirks, logging them as type "operational" to the learnings JSONL. Future sessions surface these automatically. - generateCompletionStatus(ctx) now includes operational capture section - Preamble bash shows top 3 learnings inline when count > 5 - New "operational" type in generateLearningsLog alongside pattern/pitfall/etc - Updated unit tests + operational seed entry in learnings E2E Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: wire learnings into all insight-producing skills Adds LEARNINGS_SEARCH and/or LEARNINGS_LOG to 10 skill templates that produce reusable insights but were previously disconnected from the learning system: - office-hours, plan-ceo-review, plan-eng-review: add LOG (had SEARCH) - plan-design-review: add both SEARCH + LOG (had neither) - design-review, design-consultation, cso, qa, qa-only: add both - retro: add SEARCH (had LOG) 13 skills now fully participate in the learning loop (read + write). Every review, QA, investigation, and design session both consults prior learnings and contributes new ones. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add operational-learning E2E test (gate-tier) Validates the write path: agent encounters a CLI failure, logs an operational learning to JSONL via gstack-learnings-log. Replaces the removed contributor-mode E2E test. Setup: temp git repo, copy bin scripts, set GSTACK_HOME. Prompt: simulated npm test failure needing --experimental-vm-modules. Assert: learnings.jsonl exists with type=operational entry. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: learnings-show E2E slug mismatch — seed at computed slug, not hardcoded The test seeded learnings at projects/test-project/ but gstack-slug computes the slug from basename(workDir) when no git remote exists. The agent's search looked at the wrong path and found nothing. Fix: compute slug the same way gstack-slug does (basename + sanitize) and seed the learnings there. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.13.8.0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
78bc1d1968 |
feat: design binary — real UI mockup generation for gstack skills (v0.13.0.0) (#551)
* docs: design tools v1 plan — visual mockup generation for gstack skills Full design doc covering the `design` binary that wraps OpenAI's GPT Image API to generate real UI mockups from gstack's design skills. Includes comparison board UX spec, auth model, 6 CEO expansions (design memory, mockup diffing, screenshot evolution, design intent verification, responsive variants, design-to-code prompt), and 9-commit implementation plan. Reviewed: /office-hours + /plan-eng-review (CLEARED) + /plan-ceo-review (EXPANSION, 6/6 accepted) + /plan-design-review (2/10 → 8/10). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: design tools prototype validation — GPT Image API works Prototype script sends 3 design briefs to OpenAI Responses API with image_generation tool. Results: dashboard (47s, 2.1MB), landing page (42s, 1.3MB), settings page (37s, 1.3MB) all produce real, implementable UI mockups with accurate text rendering and clean layouts. Key finding: Codex OAuth tokens lack image generation scopes. Direct API key (sk-proj-*) required, stored in ~/.gstack/openai.json. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: design binary core — generate, check, compare commands Stateless CLI (design/dist/design) wrapping OpenAI Responses API for UI mockup generation. Three working commands: - generate: brief -> PNG mockup via gpt-4o + image_generation tool - check: vision-based quality gate via GPT-4o (text readability, layout completeness, visual coherence) - compare: generates self-contained HTML comparison board with star ratings, radio Pick, per-variant feedback, regenerate controls, and Submit button that writes structured JSON for agent polling Auth reads from ~/.gstack/openai.json (0600), falls back to OPENAI_API_KEY env var. Compiled separately from browse binary (openai added to devDependencies, not runtime deps). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: design binary variants + iterate commands variants: generates N style variations with staggered parallel (1.5s between launches, exponential backoff on 429). 7 built-in style variations (bold, calm, warm, corporate, dark, playful + default). Tested: 3/3 variants in 41.6s. iterate: multi-turn design iteration using previous_response_id for conversational threading. Falls back to re-generation with accumulated feedback if threading doesn't retain visual context. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: DESIGN_SETUP + DESIGN_MOCKUP template resolvers Add generateDesignSetup() and generateDesignMockup() to the existing design.ts resolver file. Add designDir to HostPaths (claude + codex). Register DESIGN_SETUP and DESIGN_MOCKUP in the resolver index. DESIGN_SETUP: $D binary discovery (mirrors $B browse setup pattern). Falls back to DESIGN_SKETCH if binary not available. DESIGN_MOCKUP: full visual exploration workflow template — construct brief from DESIGN.md context, generate 3 variants, open comparison board in Chrome, poll for user feedback, save approved mockup to docs/designs/, generate HTML wireframe for implementation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: sync package.json version with VERSION file (0.12.2.0) Pre-existing mismatch: VERSION was 0.12.2.0 but package.json was 0.12.0.0. Also adds design binary to build script and dev:design convenience command. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: /office-hours visual design exploration integration Add {{DESIGN_MOCKUP}} to office-hours template before the existing {{DESIGN_SKETCH}}. When the design binary is available, /office-hours generates 3 visual mockup variants, opens a comparison board in Chrome, and polls for user feedback. Falls back to HTML wireframes if the design binary isn't built. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: /plan-design-review visual mockup integration Add {{DESIGN_SETUP}} to pre-review audit and "show me what 10/10 looks like" mockup generation to the 0-10 rating method. When a design dimension rates below 7/10, the review can generate a mockup showing the improved version. Falls back to text descriptions if the design binary isn't available. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: design memory — extract visual language from mockups into DESIGN.md New `$D extract` command: sends approved mockup to GPT-4o vision, extracts color palette, typography, spacing, and layout patterns, writes/updates DESIGN.md with an "Extracted Design Language" section. Progressive constraint: if DESIGN.md exists, future mockup briefs include it as style context. If no DESIGN.md, explorations run wide. readDesignConstraints() reads existing DESIGN.md for brief construction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: mockup diffing + design intent verification New commands: - $D diff --before old.png --after new.png: visual diff using GPT-4o vision. Returns differences by area with severity (high/medium/low) and a matchScore (0-100). - $D verify --mockup approved.png --screenshot live.png: compares live site screenshot against approved design mockup. Pass if matchScore >= 70 and no high-severity differences. Used by /design-review to close the design loop: design -> implement -> verify visually. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: screenshot-to-mockup evolution ($D evolve) New command: $D evolve --screenshot current.png --brief "make it calmer" Two-step process: first analyzes the screenshot via GPT-4o vision to produce a detailed description, then generates a new mockup that keeps the existing layout structure but applies the requested changes. Starts from reality, not blank canvas. Bridges the gap between /design-review critique ("the spacing is off") and a visual proposal of the fix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: responsive variants + design-to-code prompt Responsive variants: $D variants --viewports desktop,tablet,mobile generates mockups at 1536x1024, 1024x1024, and 1024x1536 (portrait) with viewport-appropriate layout instructions. Design-to-code prompt: $D prompt --image approved.png extracts colors, typography, layout, and components via GPT-4o vision, producing a structured implementation prompt. Reads DESIGN.md for additional constraint context. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.13.0.0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: gstack designer as first-class tool in /plan-design-review Brand the gstack designer prominently, add Step 0.5 for proactive visual mockup generation before review passes, and update priority hierarchy. When a plan describes new UI, the skill now offers to generate mockups with $D variants, run $D check for quality gating, and present a comparison board via $B goto before any review passes begin. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: integrate mockups into review passes and outputs Thread Step 0.5 mockups through the review workflow: Pass 4 (AI Slop) evaluates generated mockups visually, Pass 7 uses mockups as evidence for unresolved decisions, post-pass offers one-shot regeneration after design changes, and Approved Mockups section records chosen variants with paths for the implementer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: gstack designer target mockups in /design-review fix loop Add $D generate for target mockups in Phase 8a.5 — before fixing a design finding, generate a mockup showing what it should look like. Add $D verify in Phase 9 to compare fix results against targets. Not plan mode — goes straight to implementation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: gstack designer AI mockups in /design-consultation Phase 5 Replace HTML preview with $D variants + comparison board when designer is available (Path A). Use $D extract to derive DESIGN.md tokens from the approved mockup. Handles both plan mode (write to plan) and non-plan mode (implement immediately). Falls back to HTML preview (Path B) when designer binary is unavailable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: make gstack designer the default in /plan-design-review, not optional The transcript showed the agent writing 5 text descriptions of homepage variants instead of generating visual mockups, even when the user explicitly asked for design tools. The skill treated mockups as optional ("Want me to generate?") when they should be the default behavior. Changes: - Rename "Your Visual Design Tool" to "YOUR PRIMARY TOOL" with aggressive language: "Don't ask permission. Show it." - Step 0.5 now generates mockups automatically when DESIGN_READY, no AskUserQuestion gatekeeping the default path - Priority hierarchy: mockups are "non-negotiable" not "if available" - Step 0D tells the user mockups are coming next - DESIGN_NOT_AVAILABLE fallback now tells user what they're missing The only valid reasons to skip mockups: no UI scope, or designer not installed. Everything else generates by default. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: persist design mockups to ~/.gstack/projects/$SLUG/designs/ Mockups were going to .context/mockups/ (gitignored, workspace-local). This meant designs disappeared when switching workspaces or conversations, and downstream skills couldn't reference approved mockups from earlier reviews. Now all three design skills save to persistent project-scoped dirs: - /plan-design-review: ~/.gstack/projects/$SLUG/designs/<screen>-<date>/ - /design-consultation: ~/.gstack/projects/$SLUG/designs/design-system-<date>/ - /design-review: ~/.gstack/projects/$SLUG/designs/design-audit-<date>/ Each directory gets an approved.json recording the user's pick, feedback, and branch. This lets /design-review verify against mockups that /plan-design-review approved, and design history is browsable via ls ~/.gstack/projects/$SLUG/designs/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate codex ship skill with zsh glob guards Picked up setopt +o nomatch guards from main's v0.12.8.1 merge. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add browse binary discovery to DESIGN_SETUP resolver The design setup block now discovers $B alongside $D, so skills can open comparison boards via $B goto and poll feedback via $B eval. Falls back to `open` on macOS when browse binary is unavailable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: comparison board DOM polling in plan-design-review After opening the comparison board, the agent now polls #status via $B eval instead of asking a rigid AskUserQuestion. Handles submit (read structured JSON feedback), regenerate (new variants with updated brief), and $B-unavailable fallback (free-form text response). The user interacts with the real board UI, not a constrained option picker. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: comparison board feedback loop integration test 16 tests covering the full DOM polling cycle: structure verification, submit with pick/rating/comment, regenerate flows (totally different, more like this, custom text), and the agent polling pattern (empty → submitted → read JSON). Uses real generateCompareHtml() from design/src/compare.ts, served via HTTP. Runs in <1s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add $D serve command for HTTP-based comparison board feedback The comparison board feedback loop was fundamentally broken: browse blocks file:// URLs (url-validation.ts:71), so $B goto file://board.html always fails. The fallback open + $B eval polls a different browser instance. $D serve fixes this by serving the board over HTTP on localhost. The server is stateful: stays alive across regeneration rounds, exposes /api/progress for the board to poll, and accepts /api/reload from the agent to swap in new board HTML. Stdout carries feedback JSON only; stderr carries telemetry. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: dual-mode feedback + post-submit lifecycle in comparison board When __GSTACK_SERVER_URL is set (injected by $D serve), the board POSTs feedback to the server instead of only writing to hidden DOM elements. After submit: disables all inputs, shows "Return to your coding agent." After regenerate: shows spinner, polls /api/progress, auto-refreshes on ready. On POST failure: shows copyable JSON fallback. On progress timeout (5 min): shows error with /design-shotgun prompt. DOM fallback preserved for headed browser mode and tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: HTTP serve command endpoints and regeneration lifecycle 11 tests covering: HTML serving with injected server URL, /api/progress state reporting, submit → done lifecycle, regenerate → regenerating state, remix with remixSpec, malformed JSON rejection, /api/reload HTML swapping, missing file validation, and full regenerate → reload → submit round-trip. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add DESIGN_SHOTGUN_LOOP resolver + fix design artifact paths Adds generateDesignShotgunLoop() resolver for the shared comparison board feedback loop (serve via HTTP, handle regenerate/remix, AskUserQuestion fallback, feedback confirmation). Registered as {{DESIGN_SHOTGUN_LOOP}}. Fixes generateDesignMockup() to use ~/.gstack/projects/$SLUG/designs/ instead of /tmp/ and docs/designs/. Replaces broken $B goto file:// + $B eval polling with $D compare --serve (HTTP-based, stdout feedback). Adds CRITICAL PATH RULE guardrail to DESIGN_SETUP: design artifacts must go to ~/.gstack/projects/$SLUG/designs/, never .context/ or /tmp/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add /design-shotgun standalone design exploration skill New skill for visual brainstorming: generate AI design variants, open a comparison board in the user's browser, collect structured feedback, and iterate. Features: session detection (revisit prior explorations), 5-dimension context gathering (who, job to be done, what exists, user flow, edge cases), taste memory (prior approved designs bias new generations), inline variant preview, configurable variant count, screenshot-to-variants via $D evolve. Uses {{DESIGN_SHOTGUN_LOOP}} resolver for the feedback loop. Saves all artifacts to ~/.gstack/projects/$SLUG/designs/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate SKILL.md files for design-shotgun + resolver changes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add remix UI to comparison board Per-variant element selectors (Layout, Colors, Typography, Spacing) with radio buttons in a grid. Remix button collects selections into a remixSpec object and sends via the same HTTP POST feedback mechanism. Enabled only when at least one element is selected. Board shows regenerating spinner while agent generates the hybrid variant. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add $D gallery command for design history timeline Generates a self-contained HTML page showing all prior design explorations for a project: every variant (approved or not), feedback notes, organized by date (newest first). Images embedded as base64. Handles corrupted approved.json gracefully (skips, still shows the session). Empty state shows "No history yet" with /design-shotgun prompt. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: gallery generation — sessions, dates, corruption, empty state 7 tests: empty dir, nonexistent dir, single session with approved variant, multiple sessions sorted newest-first, corrupted approved.json handled gracefully, session without approved.json, self-contained HTML (no external dependencies). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: replace broken file:// polling with {{DESIGN_SHOTGUN_LOOP}} plan-design-review and design-consultation templates previously used $B goto file:// + $B eval polling for the comparison board feedback loop. This was broken (browse blocks file:// URLs). Both templates now use {{DESIGN_SHOTGUN_LOOP}} which serves via HTTP, handles regeneration in the same browser tab, and falls back to AskUserQuestion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add design-shotgun touchfile entries and tier classifications design-shotgun-path (gate): verify artifacts go to ~/.gstack/, not .context/ design-shotgun-session (gate): verify repeat-run detection + AskUserQuestion design-shotgun-full (periodic): full round-trip with real design binary Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate SKILL.md files for template refactor Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: comparison board UI improvements — option headers, pick confirmation, grid view Three changes to the design comparison board: 1. Pick confirmation: selecting "Pick" on Option A shows "We'll move forward with Option A" in green, plus a status line above the submit button repeating the choice. 2. Clear option headers: each variant now has "Option A" in bold with a subtitle above the image, instead of just the raw image. 3. View toggle: top-right Large/Grid buttons switch between single-column (default) and 3-across grid view. Also restructured the bottom section into a 2-column grid: submit/overall feedback on the left, regenerate controls on the right. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: use 127.0.0.1 instead of localhost for serve URL Avoids DNS resolution issues on some systems where localhost may resolve to IPv6 ::1 while Bun listens on IPv4 only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: write ALL feedback to disk so agent can poll in background mode The agent backgrounds $D serve (Claude Code can't block on a subprocess and do other work simultaneously). With stdout-only feedback delivery, the agent never sees regenerate/remix feedback. Fix: write feedback-pending.json (regenerate/remix) and feedback.json (submit) to disk next to the board HTML. Agent polls the filesystem instead of reading stdout. Both channels (stdout + disk) are always active so foreground mode still works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: DESIGN_SHOTGUN_LOOP uses file polling instead of stdout reading Update the template resolver to instruct the agent to background $D serve and poll for feedback-pending.json / feedback.json on a 5-second loop. This matches the real-world pattern where Claude Code / Conductor agents can't block on subprocess stdout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate SKILL.md files for file-polling feedback loop Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: null-safe DOM selectors for post-submit and regenerating states The user's layout restructure renamed .regenerate-bar → .regen-column, .submit-bar → .submit-column, and .overall-section → .bottom-section. The JS still referenced the old class names, causing querySelector to return null and showPostSubmitState() / showRegeneratingState() to silently crash. This meant Submit and Regenerate buttons appeared to work (DOM elements updated, HTTP POST succeeded) but the visual feedback (disabled inputs, spinner, success message) never appeared. Fix: use fallback selectors that check both old and new class names, with null guards so a missing element doesn't crash the function. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: end-to-end feedback roundtrip — browser click to file on disk The test that proves "changes on the website propagate to Claude Code." Opens the comparison board in a real headless browser with __GSTACK_SERVER_URL injected, simulates user clicks (Submit, Regenerate, More Like This), and verifies that feedback.json / feedback-pending.json land on disk with the correct structured data. 6 tests covering: submit → feedback.json, post-submit UI lockdown, regenerate → feedback-pending.json, more-like-this → feedback-pending.json, regenerate spinner display, and full regen → reload → submit round-trip. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: comprehensive design doc for Design Shotgun feedback loop Documents the full browser-to-agent feedback architecture: state machine, file-based polling, port discovery, post-submit lifecycle, and every known edge case (zombie forms, dead servers, stale spinners, file:// bug, double-click races, port coordination, sequential generate rule). Includes ASCII diagrams of the data flow and state transitions, complete step-by-step walkthrough of happy path and regeneration path, test coverage map with gaps, and short/medium/long-term improvement ideas. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: plan-design-review agent guardrails for feedback loop Four fixes to prevent agents from reinventing the feedback loop badly: 1. Sequential generate rule: explicit instruction that $D generate calls must run one at a time (API rate-limits concurrent image generation). 2. No-AskUserQuestion-for-feedback rule: agent reads feedback.json instead of re-asking what the user picked. 3. Remove file:// references: $B goto file:// was always rejected by url-validation.ts. The --serve flag handles everything. 4. Remove $B eval polling reference: no longer needed with HTTP POST. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: design-shotgun Step 3 progressive reveal, silent failure detection, timing estimate Three production UX bugs fixed: 1. Dead air — now shows timing estimate before generation starts 2. Silent variant drop — replaced $D variants batch with individual $D generate calls, each verified for existence and non-zero size with retry 3. No progressive reveal — each variant shown inline via Read tool immediately after generation (~60s increments instead of all at ~180s) Also: /tmp/ then cp as default output pattern (sandbox workaround), screenshot taken once for evolve path (not per-variant). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: parallel design-shotgun with concept-first confirmation Step 3 rewritten to concept-first + parallel Agent architecture: - 3a: generate text concepts (free, instant) - 3b: AskUserQuestion to confirm/modify before spending API credits - 3c: launch N Agent subagents in parallel (~60s total regardless of count) - 3d: show all results, dynamic image list for comparison board Adds Agent to allowed-tools. Softens plan-design-review sequential warning to note design-shotgun uses parallel at Tier 2+. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update project documentation for v0.13.0.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: untrack .agents/skills/ — generated at setup, already gitignored These files were committed despite .agents/ being in .gitignore. They regenerate from ./setup --host codex on any machine. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate design-shotgun SKILL.md for v0.12.12.0 preamble changes Merge from main brought updated preamble resolver (conditional telemetry, local JSONL logging) but design-shotgun/SKILL.md wasn't regenerated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
3501f5dd03 |
fix: Windows browse — health-check-first ensureServer, detached startServer, Windows process mgmt (v0.11.11.0) (#431)
* fix: Windows browse — health-check-first ensureServer, detached startServer, Windows process mgmt
Three compounding bugs made browse completely broken on Windows:
1. Bun.spawn().unref() doesn't truly detach on Windows — server dies when
CLI exits. Fix: use Node's child_process.spawn with { detached: true }
via a launcher script. Credit: PR #191 by @fqueiro for the approach.
2. process.kill(pid, 0) is broken in Bun's compiled binary on Windows —
ensureServer() never reaches the health check. Fix: restructure to
health-check-first (HTTP is definitive proof on all platforms). Extract
isServerHealthy() helper for DRY (4 call sites).
3. Windows process management: isProcessAlive() falls back to tasklist,
killServer() uses taskkill /T /F (kills process tree including Chromium),
cleanupLegacyState() skips on Windows (no /tmp, no ps).
Also: hard-fail on Windows if server-node.mjs is missing instead of
silently falling back to the known-broken Bun path.
Fixes #342.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: disable Chromium sandbox on Windows
Chromium's sandbox fails when the server is spawned through the
Bun→Node process chain on Windows (GitHub #276). Disable
chromiumSandbox on Windows at both launch sites (headless + headed).
Safe: local daemon browsing user-specified URLs, Playwright docs
recommend disabling in CI/container environments.
Fixes #276.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: startup error log + Windows exit handler for browse server
On Windows, the CLI can't capture stderr from the server (stdio: 'ignore'
required for process detachment). Write startup errors to
.gstack/browse-startup-error.log so the CLI can report them on timeout.
Also add process.on('exit') handler on Windows as defense-in-depth for
state file cleanup (primary mechanism is CLI's stale-state detection).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: add isServerHealthy + startup error log tests
Tests for the new cross-platform health check helper (isServerHealthy)
that replaces PID-based liveness checks in all polling loops. Covers
healthy, unhealthy, unreachable, and error response cases.
Also tests the startup error log write/read format used on Windows.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: bump version and changelog (v0.11.11.0)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs: sync ARCHITECTURE.md with health-check-first ensureServer
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
||
|
|
7fbf68bb3f |
feat: cross-model outside voice in plan reviews (v0.9.9.1) (#326)
* feat: add generateCodexPlanReview() resolver for cross-model plan review New resolver offers an optional Codex (or Claude subagent fallback) "outside voice" after plan review sections complete. Includes cross-model tension detection with auto-TODO proposals, review log persistence, and an Outside Voice row in the Review Readiness Dashboard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: integrate {{CODEX_PLAN_REVIEW}} into CEO and eng review templates CEO review: insert after Section 11 + add Outside Voice summary row. Eng review: replace hardcoded Step 0.5 with resolver (adds fallback, logging, dashboard, xhigh reasoning, cross-model tension tracking). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate SKILL.md files from updated templates Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update project documentation for v0.9.9.1 ARCHITECTURE.md: added {{CODEX_PLAN_REVIEW}} to placeholder table. CHANGELOG.md: added v0.9.9.1 entry for outside voice feature. VERSION: bumped to 0.9.9.1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: move {{CODEX_PLAN_REVIEW}} after review sections in eng review Codex adversarial review caught that the placeholder was positioned before the 4 review sections, so the "After all review sections are complete" instruction could confuse the model. Moved it after Section 4's STOP directive where it belongs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate eng review SKILL.md files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
00bc482fe1 |
feat: /land-and-deploy, /canary, /benchmark + perf review (v0.7.0) (#183)
* feat: add /canary, /benchmark, /land-and-deploy skills (v0.7.0) Three new skills that close the deploy loop: - /canary: standalone post-deploy monitoring with browse daemon - /benchmark: performance regression detection with Web Vitals - /land-and-deploy: merge PR, wait for deploy, canary verify production Incorporates patterns from community PR #151. Co-Authored-By: HMAKT99 <HMAKT99@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add Performance & Bundle Impact category to review checklist New Pass 2 (INFORMATIONAL) category catching heavy dependencies (moment.js, lodash full), missing lazy loading, synchronous scripts, CSS @import blocking, fetch waterfalls, and tree-shaking breaks. Both /review and /ship automatically pick this up via checklist.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add {{DEPLOY_BOOTSTRAP}} resolver + deployed row in dashboard - New generateDeployBootstrap() resolver auto-detects deploy platform (Vercel, Netlify, Fly.io, GH Actions, etc.), production URL, and merge method. Persists to CLAUDE.md like test bootstrap. - Review Readiness Dashboard now shows a "Deployed" row from /land-and-deploy JSONL entries (informational, never gates shipping). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: mark 3 TODOs completed, bump v0.7.0, update CHANGELOG Superseded by /land-and-deploy: - /merge skill — review-gated PR merge - Deploy-verify skill - Post-deploy verification (ship + browse) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: /setup-deploy skill + platform-specific deploy verification - New /setup-deploy skill: interactive guided setup for deploy configuration. Detects Fly.io, Render, Vercel, Netlify, Heroku, Railway, GitHub Actions, and custom deploy scripts. Writes config to CLAUDE.md with custom hooks section for non-standard setups. - Enhanced deploy bootstrap: platform-specific URL resolution (fly.toml app → {app}.fly.dev, render.yaml → {service}.onrender.com, etc.), deploy status commands (fly status, heroku releases), and custom deploy hooks section in CLAUDE.md for manual/scripted deploys. - Platform-specific deploy verification in /land-and-deploy Step 6: Strategy A (GitHub Actions polling), Strategy B (platform CLI: fly/render/heroku), Strategy C (auto-deploy: vercel/netlify), Strategy D (custom hooks from CLAUDE.md). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: E2E + LLM-judge evals for deploy skills - 4 E2E tests: land-and-deploy (Fly.io detection + deploy report), canary (monitoring report structure), benchmark (perf report schema), setup-deploy (platform detection → CLAUDE.md config) - 4 LLM-judge evals: workflow quality for all 4 new skills - Touchfile entries for diff-based test selection (E2E + LLM-judge) - 460 free tests pass, 0 fail Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: harden E2E tests — server lifecycle, timeouts, preamble budget, skip flaky Cross-cutting fixes: - Pre-seed ~/.gstack/.completeness-intro-seen and ~/.gstack/.telemetry-prompted so preamble doesn't burn 3-7 turns on lake intro + telemetry in every test - Each describe block creates its own test server instance instead of sharing a global that dies between suites Test fixes (5 tests): - /qa quick: own server instance + preamble skip - /review SQL injection: timeout 90→180s, maxTurns 15→20, added assertion that review output actually mentions SQL injection - /review design-lite: maxTurns 25→35 + preamble skip (now detects 7/7) - ship-base-branch: both timeouts 90→150/180s + preamble skip - plan-eng artifact: clean stale state in beforeAll, maxTurns 20→25 Skipped (4 flaky/redundant tests): - contributor-mode: tests prompt compliance, not skill functionality - design-consultation-research: WebSearch-dependent, redundant with core - design-consultation-preview: redundant with core test - /qa bootstrap: too ambitious (65 turns, installs vitest) Also: preamble skip added to qa-only, qa-fix-loop, design-consultation-core, and design-consultation-existing prompts. Updated touchfiles entries and touchfiles.test.ts. Added honest comment to codex-review-findings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: redesign 6 skipped/todo E2E tests + add test.concurrent support Redesigned tests (previously skipped/todo): - contributor-mode: pre-fail approach, 5 turns/30s (was 10 turns/90s) - design-consultation-research: WebSearch-only, 8 turns/90s (was 45/480s) - design-consultation-preview: preview HTML only, 8 turns/90s (was 30/480s) - qa-bootstrap: bootstrap-only, 12 turns/90s (was 65/420s) - /ship workflow: local bare remote, 15 turns/120s (was test.todo) - /setup-browser-cookies: browser detection smoke, 5 turns/45s (was test.todo) Added testConcurrentIfSelected() helper for future parallelization. Updated touchfiles entries for all 6 re-enabled tests. Target: 0 skip, 0 todo, 0 fail across all E2E tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: relax contributor-mode assertions — test structure not exact phrasing * perf: enable test.concurrent for 31 independent E2E tests Convert 18 skill-e2e, 11 routing, and 2 codex tests from sequential to test.concurrent. Only design-consultation tests (4) remain sequential due to shared designDir state. Expected ~6x speedup on Teams high-burst. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: add --concurrent flag to bun test + convert remaining 4 sequential tests bun's test.concurrent only works within a describe block, not across describe blocks. Adding --concurrent to the CLI command makes ALL tests concurrent regardless of describe boundaries. Also converted the 4 design-consultation tests to concurrent (each already independent). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * perf: split monolithic E2E test into 8 parallel files Split test/skill-e2e.test.ts (3442 lines) into 8 category files: - skill-e2e-browse.test.ts (7 tests) - skill-e2e-review.test.ts (7 tests) - skill-e2e-qa-bugs.test.ts (3 tests) - skill-e2e-qa-workflow.test.ts (4 tests) - skill-e2e-plan.test.ts (6 tests) - skill-e2e-design.test.ts (7 tests) - skill-e2e-workflow.test.ts (6 tests) - skill-e2e-deploy.test.ts (4 tests) Bun runs each file in its own worker = 10 parallel workers (8 split + routing + codex). Expected: 78 min → ~12 min. Extracted shared helpers to test/helpers/e2e-helpers.ts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * perf: bump default E2E concurrency to 15 * perf: add model pinning infrastructure + rate-limit telemetry to E2E runner Default E2E model changed from Opus to Sonnet (5x faster, 5x cheaper). Session runner now accepts `model` option with EVALS_MODEL env var override. Added timing telemetry (first_response_ms, max_inter_turn_ms) and wall_clock_ms to eval-store for diagnosing rate-limit impact. Added EVALS_FAST test filtering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: resolve 3 E2E test failures — tmpdir race, wasted turns, brittle assertions plan-design-review-plan-mode: give each test its own tmpdir to eliminate race condition where concurrent tests pollute each other's working directory. ship-local-workflow: inline ship workflow steps in prompt instead of having agent read 700+ line SKILL.md (was wasting 6 of 15 turns on file I/O). design-consultation-core: replace exact section name matching with fuzzy synonym-based matching (e.g. "Colors" matches "Color", "Type System" matches "Typography"). All 7 sections still required, LLM judge still hard fail. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * perf: pin quality tests to Opus, add --retry 2 and test:e2e:fast tier ~10 quality-sensitive tests (planted-bug detection, design quality judge, strategic review, retro analysis) explicitly pinned to Opus. ~30 structure tests default to Sonnet for 5x speed improvement. Added --retry 2 to all E2E scripts for flaky test resilience. Added test:e2e:fast script that excludes 8 slowest tests for quick feedback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: mark E2E model pinning TODO as shipped Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add SKILL.md merge conflict directive to CLAUDE.md When resolving merge conflicts on generated SKILL.md files, always merge the .tmpl templates first, then regenerate — never accept either side's generated output directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: add DEPLOY_BOOTSTRAP resolver to gen-skill-docs The land-and-deploy template referenced {{DEPLOY_BOOTSTRAP}} but no resolver existed, causing gen-skill-docs to fail. Added generateDeployBootstrap() that generates the deploy config detection bash block (check CLAUDE.md for persisted config, auto-detect platform from config files, detect deploy workflows). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate SKILL.md files after DEPLOY_BOOTSTRAP fix Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: move prompt temp file outside workingDirectory to prevent race condition The .prompt-tmp file was written inside workingDirectory, which gets deleted by afterAll cleanup. With --concurrent --retry, afterAll can interleave with retries, causing "No such file or directory" crashes at 0s (seen in review-design-lite and office-hours-spec-review). Fix: write prompt file to os.tmpdir() with a unique suffix so it survives directory cleanup. Also convert review-design-lite from describeE2E to describeIfSelected for proper diff-based test selection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: add --retry 2 --concurrent flags to test:evals scripts for consistency test:evals and test:evals:all were missing the retry and concurrency flags that test:e2e already had, causing inconsistent behavior between the two script families. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: HMAKT99 <HMAKT99@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
f075cb757f |
feat: Search Before Building — builder ethos + skill integrations (v0.9.5.0) (#298)
* feat: ETHOS.md — gstack builder philosophy Standalone document capturing the four principles: The Golden Age, Boil the Lake, Search Before Building, and Build for Yourself. Introduces the three-layer knowledge framework (tried-and-true, new-and-popular, first-principles) and the Eureka Moment concept — when first-principles reasoning reveals conventional wisdom is wrong. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: Search Before Building preamble section + CLAUDE.md Add generateSearchBeforeBuildingSection(ctx) to gen-skill-docs.ts. Every workflow skill now gets a compact router section covering: - Three layers of knowledge (tried-and-true, new-and-popular, first-principles) - Eureka moment format and jq-based JSONL logging - WebSearch fallback clause - ETHOS.md reference via ctx.paths.skillRoot resolver Also adds compact "Search before building" section to CLAUDE.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: skill-specific Search Before Building integrations 8 template changes: - /office-hours: Phase 2.75 Landscape Awareness (WebSearch + three-layer synthesis) - /plan-eng-review: Step 0 search check with layer provenance annotations - /investigate: external pattern search + search escalation on hypothesis failure - /plan-ceo-review: Landscape Check before scope challenge - /review: search-before-recommending for fix patterns - /qa-only: WebSearch in allowed-tools - /design-consultation: three-layer synthesis backport in Phase 2 Step 3 - /retro: eureka moment tracking from ~/.gstack/analytics/eureka.jsonl All search steps include WebSearch fallback clause. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: v0.9.5.0 — Builder Ethos (CHANGELOG + VERSION + TODOS) ETHOS.md + Search Before Building across all workflow skills. Deferred: first-time intro flow (blocked on blog post). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address Codex review — sanitize search, privacy gate, ETHOS.md sidecar Three fixes from adversarial Codex review: - /investigate: sanitize error messages before searching (strip hostnames, IPs, file paths, SQL, customer data). Skip search if unsanitizable. - /office-hours: add privacy gate before landscape search. Use generalized category terms, never the user's specific product name or stealth idea. - setup: link ETHOS.md into .agents/skills/gstack/ sidecar so workspace- local Codex sessions can find the builder philosophy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: sanitize Phase 2 external pattern search in /investigate The Phase 2 external search also sent raw error messages to WebSearch. Apply same sanitization rule as Phase 3 search escalation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: sync documentation with shipped changes - ARCHITECTURE.md: preamble now handles 5 things (add Search Before Building) - CLAUDE.md: add ETHOS.md to project structure tree - README.md: add ETHOS.md to docs table Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
78c207efb4 |
feat: interactive /plan-design-review + CEO invokes designer + 100% coverage (v0.6.4) (#149)
* refactor: rename qa-design-review → design-review The "qa-" prefix was confusing — this is the live-site design audit with fix loop, not a QA-only report. Rename directory and update all references across docs, tests, scripts, and skill templates. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: interactive /plan-design-review + CEO invokes designer Rewrite /plan-design-review from report-only grading to an interactive plan-fixer that rates each design dimension 0-10, explains what a 10 looks like, and edits the plan to get there. Parallel structure with /plan-ceo-review and /plan-eng-review — one issue = one AskUserQuestion. CEO review now detects UI scope and invokes the designer perspective when the plan has frontend/UX work, so you get design review automatically when it matters. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: validation + touchfile entries for 100% coverage Add design-consultation to command/snapshot flag validation. Add 4 skills to contributor mode validation (plan-design-review, design-review, design-consultation, document-release). Add 2 templates to hardcoded branch check. Register touchfile entries for 10 new LLM-judge tests and 1 new E2E test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: LLM-judge for 10 skills + gstack-upgrade E2E Add LLM-judge quality evals for all uncovered skills using a DRY runWorkflowJudge helper with section marker guards. Add real E2E test for gstack-upgrade using mock git remote (replaces test.todo). Add plan-edit assertion to plan-design-review E2E. 14/15 skills now at full coverage. setup-browser-cookies remains deferred (needs real browser). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add bisect commit style to CLAUDE.md All commits should be single logical changes, split before pushing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.6.4.0) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
a2d756f945 |
feat: Test Bootstrap + Regression Tests + Coverage Audit (v0.6.0) (#136)
* feat: test bootstrap, regression tests, coverage audit, retro test health
- Add {{TEST_BOOTSTRAP}} resolver to gen-skill-docs.ts
- Add Phase 8e.5 regression test generation to /qa and /qa-design-review
- Add Step 3.4 test coverage audit with quality scoring to /ship
- Add test health tracking to /retro
- Add 2 E2E evals (bootstrap + coverage audit)
- Add 26 validation tests
- Update ARCHITECTURE.md placeholder table
- Add 2 P3 TODOs (CI/CD non-GitHub, auto-upgrade weak tests)
* chore: bump version and changelog (v0.6.0)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat: make coverage audit trace actual codepaths, not just syntax patterns
Step 3.4 now instructs Claude to read full files, trace data flow through
every branch, diagram the execution, and check each branch against tests.
Phase 8e.5 regression tests now trace the bug's codepath before writing
the test, catching adjacent edge cases.
* feat: coverage audit now maps user flows, interactions, and error states
Step 3.4 now covers the full picture: code branches AND user-facing behavior.
Maps user flows (complete journey through the feature), interaction edge cases
(double-click, back button, stale state, slow connection), error states
(what does the user actually see?), and boundary states (zero results,
10k results, max-length input). Coverage diagram splits into Code Path
Coverage and User Flow Coverage sections with separate percentages.
* fix: raise test gen cap to 20, add validation tests for user flow coverage
- Raise Step 3.4 test generation cap from 10 to 20 (code + user flow combined)
- Add 3 validation tests: codepath tracing, user flow mapping, diagram sections
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
|
||
|
|
4a77cc2c34 |
feat: /plan-design-review + /qa-design-review skills (v0.5.0) (#102)
* feat: add {{DESIGN_METHODOLOGY}} resolver and register design review skills
Add generateDesignMethodology() to gen-skill-docs.ts with 10-category, 80-item
design audit checklist. Register plan-design-review and qa-design-review templates
in findTemplates(). Add both skills to skill-check.ts SKILL_FILES. Add command
and snapshot flag validation tests for both skills in skill-validation.test.ts.
* feat: add /plan-design-review and /qa-design-review skills
/plan-design-review: report-only designer audit with letter grades, AI slop
scoring, structured first impression, design system extraction, DESIGN.md
inference and export offer. Never modifies code.
/qa-design-review: same audit, then iterative fix loop with style(design):
commits, CSS-safe WTF heuristic, before/after screenshots, final re-audit.
* chore: bump version and changelog (v0.5.0)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs: update README, ARCHITECTURE for design review skills (v0.5.0)
- Update skill count to 11, add /plan-design-review and /qa-design-review
to skill table, install/uninstall commands, and demo walkthrough
- Add narrative sections: "senior designer mode" and "designer who codes mode"
with compelling examples showing AI Slop detection and design system inference
- Add {{DESIGN_METHODOLOGY}} to ARCHITECTURE.md placeholder table
- Extend demo to show full plan→eng→review→ship→qa→design-review pipeline
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* chore: regenerate design review SKILL.md files after merge from main
Picks up BASE_BRANCH_DETECT resolver and updated contributor mode from main.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add /design-consultation skill — design consultant that creates DESIGN.md
6-phase consultant flow: product context → competitive research (WebSearch) →
complete coherent proposal → drill-downs on demand → font+color preview page →
write DESIGN.md + update CLAUDE.md. Opinionated recommendations grounded in
product context, not menu-driven forms.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: add E2E tests for design skill family (7 tests + LLM quality judge)
Tests 1-4: /design-consultation (core flow, research integration, existing
DESIGN.md handling, font+color preview generation).
Tests 5-6: /plan-design-review (audit report, DESIGN.md export).
Test 7: /qa-design-review (audit + fix loop).
LLM judge validates font blacklist compliance, coherence, and AI slop avoidance.
Also adds plan-design-review + qa-design-review to ALL_SKILLS test array.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: mark /design-consultation as shipped in TODOS.md
Renamed from /setup-design-md to reflect the consultant approach.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
|
||
|
|
1e06b6a5c6 |
fix: dynamic base branch detection across all SKILL templates (v0.3.10) (#81)
* feat: add {{BASE_BRANCH_DETECT}} resolver to gen-skill-docs
DRY placeholder for dynamic base branch detection across PR-targeting
skills. Detects via gh pr view (existing PR base) → gh repo view
(repo default) → fallback to main.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: ship skill detects base branch instead of hardcoding main
Replaces ~14 hardcoded 'main' references with dynamic detection via
{{BASE_BRANCH_DETECT}}. Fixes stacked branches and Conductor workspaces
targeting non-main branches. Adds --base <base> to gh pr create.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: review, qa, plan-ceo-review detect base branch dynamically
Same pattern as ship: replaces hardcoded 'main' with {{BASE_BRANCH_DETECT}}.
Also cleans up qa bash-isms (REPORT_DIR variable, port chaining).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: retro detects default branch instead of hardcoding origin/main
Retro queries commit history (not PR targets), so uses simpler detection:
gh repo view defaultBranchRef. Replaces ~11 origin/main refs with
origin/<default>.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: add explicit cross-step references in gstack-upgrade template
Bash blocks are self-contained, but cross-block variable references
(INSTALL_DIR from Step 2) were implicit. Adds prose making them explicit.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs+test: SKILL authoring guidance + regression tests
Adds "Writing SKILL templates" section to CLAUDE.md explaining that
templates are prompts, not scripts. Adds validation test catching
hardcoded 'main' in git commands, and resolver content test.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: update ARCHITECTURE + CONTRIBUTING for new placeholders
Add {{BASE_BRANCH_DETECT}} to ARCHITECTURE.md placeholder list.
Cross-reference CLAUDE.md template authoring guidance from CONTRIBUTING.md.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: bump version and changelog (v0.3.10)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: add missing blank line between resolver functions
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: add 3 E2E smoke tests for base branch detection
- /review: verifies Step 0 detection + git diff against detected base
- /ship: truncated dry-run (Steps 0-1 only, no push/PR), asserts no
destructive actions
- /retro: verifies default branch detection for git log queries
Covers the {{BASE_BRANCH_DETECT}} resolver path (review), the ship
template's dual abort check, and retro's inline detection pattern.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: bump version and changelog (v0.4.2)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
||
|
|
3e3843c4a9 |
feat: contributor mode, session awareness, recommendation format (#90)
* feat: contributor mode, session awareness, universal RECOMMENDATION format
- Rename {{UPDATE_CHECK}} → {{PREAMBLE}} across all 10 skill templates
- Add session tracking (touch ~/.gstack/sessions/$PPID, count active sessions)
- ELI16 mode when 3+ concurrent sessions detected (re-ground user on context)
- Contributor mode: auto-file field reports to ~/.gstack/contributor-logs/
- Universal AskUserQuestion format: context → question → RECOMMENDATION → options
- Update plan-ceo-review and plan-eng-review to reference preamble baseline
- Add vendored symlink awareness section to CLAUDE.md
- Rewrite CONTRIBUTING.md with contributor workflow and cross-project testing
- Add tests for contributor mode and session awareness in generated output
- Add E2E eval for contributor mode report filing
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat: add Enum & Value Completeness to /review critical checklist
New CRITICAL review category that traces new enum values, status strings,
and type constants through every consumer outside the diff. Catches the
class of bugs where a new value is added but not handled in all switch/case
chains, allowlists, or frontend-backend contracts.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* chore: bump v0.4.1, user-facing changelog, update qa-only template and architecture docs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add CHANGELOG style guide — user-facing, sell the feature
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: rewrite v0.4.1 changelog to be user-facing and sell the features
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add evals for RECOMMENDATION format, session awareness, and enum completeness
Free tests (Tier 1): RECOMMENDATION format + session awareness in all
preamble SKILL.md files, enum completeness checklist structure and CRITICAL
classification.
E2E eval: /review catches missed enum handlers when a new status value
is added but not handled in case/switch and notify methods.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add E2E eval for session awareness ELI16 mode
Stubs _SESSIONS=4, gives agent a decision point on feature/add-payments
branch, verifies the output re-grounds the user with project, branch,
context, and RECOMMENDATION — the ELI16 mode behavior for 3+ sessions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: contributor mode eval marked FAIL due to expected browse error
The test intentionally runs a nonexistent binary to trigger contributor
mode. The session runner's browse error detection catches "no such file
or directory...browse" and sets browseErrors, causing recordE2E to mark
passed=false. Override passed to check only exitReason since the browse
error is the expected scenario.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
|
||
|
|
f3ee0ee28a |
feat: QA restructure, browser ref staleness, eval efficiency metrics (v0.4.0) (#83)
* feat: browser ref staleness detection via async count() validation resolveRef() now checks element count to detect stale refs after page mutations (e.g. SPA navigation). RefEntry stores role+name metadata for better diagnostics. 3 new snapshot tests for staleness detection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: qa-only skill, qa fix loop, plan-to-QA artifact flow Add /qa-only (report-only, Edit tool blocked), restructure /qa with find-fix-verify cycle, add {{QA_METHODOLOGY}} DRY placeholder for shared methodology. /plan-eng-review now writes test-plan artifacts to ~/.gstack/projects/<slug>/ for QA consumption. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: eval efficiency metrics — turns, duration, commentary across all surfaces Add generateCommentary() for natural-language delta interpretation, per-test turns/duration in comparison and summary output, judgePassed unit tests, 3 new E2E tests (qa-only, qa fix loop, plan artifact). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: bump version and changelog (v0.4.0) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: update ARCHITECTURE, BROWSER, CONTRIBUTING, README for v0.4.0 - ARCHITECTURE: add ref staleness detection section, update RefEntry type - BROWSER: add ref staleness paragraph to snapshot system docs - CONTRIBUTING: update eval tool descriptions with commentary feature - README: fix missing qa-only in project-local uninstall command Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: add user-facing benefit descriptions to v0.4.0 changelog Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
43fbe165a4 |
docs: update README, CONTRIBUTING, ARCHITECTURE for v0.3.6
Update test tier costs and commands (Agent SDK → claude -p, SKILL_E2E → EVALS), add E2E observability section to CONTRIBUTING and ARCHITECTURE, add testing quick-start to README. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
5205070299 |
feat: SKILL.md template system, 3-tier testing, DX tools (v0.3.3) (#41)
* refactor: extract command registry to commands.ts, add SNAPSHOT_FLAGS metadata
- NEW: browse/src/commands.ts — command sets + COMMAND_DESCRIPTIONS + load-time validation (zero side effects)
- server.ts imports from commands.ts instead of declaring sets inline
- snapshot.ts: SNAPSHOT_FLAGS array drives parseSnapshotArgs (metadata-driven, no duplication)
- All 186 existing tests pass
* feat: SKILL.md template system with auto-generated command references
- SKILL.md.tmpl + browse/SKILL.md.tmpl with {{COMMAND_REFERENCE}} and {{SNAPSHOT_FLAGS}} placeholders
- scripts/gen-skill-docs.ts generates SKILL.md from templates (supports --dry-run)
- Build pipeline runs gen:skill-docs before binary compilation
- Generated files have AUTO-GENERATED header, committed to git
* test: Tier 1 static validation — 34 tests for SKILL.md command correctness
- test/helpers/skill-parser.ts: extracts $B commands from code blocks, validates against registry
- test/skill-parser.test.ts: 13 parser/validator unit tests
- test/skill-validation.test.ts: 13 tests validating all SKILL.md files + registry consistency
- test/gen-skill-docs.test.ts: 8 generator tests (categories, sorting, freshness)
* feat: DX tools (skill:check, dev:skill) + Tier 2 E2E test scaffolding
- scripts/skill-check.ts: health summary for all SKILL.md files (commands, templates, freshness)
- scripts/dev-skill.ts: watch mode for template development
- test/helpers/session-runner.ts: Agent SDK wrapper for E2E skill tests
- test/skill-e2e.test.ts: 2 E2E tests + 3 stubs (auto-skip inside Claude Code sessions)
- E2E tests must run from plain terminal: SKILL_E2E=1 bun test test/skill-e2e.test.ts
* ci: SKILL.md freshness check on push/PR + TODO updates
- .github/workflows/skill-docs.yml: fails if generated SKILL.md files are stale
- TODO.md: add E2E cost tracking and model pinning to future ideas
* fix: restore rich descriptions lost in auto-generation
- Snapshot flags: add back value hints (-d <N>, -s <sel>, -o <path>)
- Snapshot flags: restore parenthetical context (@e refs, @c refs, etc.)
- Commands: is → includes valid states enum
- Commands: console → notes --errors filter behavior
- Commands: press → lists common keys (Enter, Tab, Escape)
- Commands: cookie-import-browser → describes picker UI
- Commands: dialog-accept → specifies alert/confirm/prompt
- Tips: restore → arrow (was downgraded to ->)
* test: quality evals for generated SKILL.md descriptions
Catches the exact regressions we shipped and caught in review:
- Snapshot flags must include value hints (-d <N>, -s <sel>, -o <path>)
- is command must list all valid states (visible/hidden/enabled/...)
- press command must list example keys (Enter, Tab, Escape)
- console command must describe --errors behavior
- Snapshot -i must mention @e refs, -C must mention @c refs
- All descriptions must be >= 8 chars (no empty stubs)
- Tips section must use → not ->
* feat: LLM-as-judge evals for SKILL.md documentation quality
4 eval tests using Anthropic API (claude-haiku, ~$0.01-0.03/run):
- Command reference table: clarity/completeness/actionability >= 4/5
- Snapshot flags section: same thresholds
- browse/SKILL.md overall quality
- Regression: generated version must score >= hand-maintained baseline
Requires ANTHROPIC_API_KEY. Auto-skips without it.
Run: bun run test:eval (or ANTHROPIC_API_KEY=sk-... bun test test/skill-llm-eval.test.ts)
* chore: bump version to 0.3.3, update changelog
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add ARCHITECTURE.md, update CLAUDE.md and CONTRIBUTING.md
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat: conductor.json lifecycle hooks + .env propagation across worktrees
bin/dev-setup now copies .env from main worktree so API keys carry
over to Conductor workspaces automatically. conductor.json wires up
setup and archive hooks.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs: complete CHANGELOG for v0.3.3 (architecture, conductor, .env)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
|