* feat(plan-tune): explicit-consent surface + setup gate for question_tuning Step 0 grows two implicit gates that run before user-intent routing: - Consent gate: question_tuning=false + no marker → offer opt-in (contributor-specific copy variant) - Setup gate: question_tuning=true + declared empty + no marker → run 5-Q wizard Markers (~/.gstack/.question-tuning-prompted, ~/.gstack/.declared-setup-prompted) ensure each user is asked at most once. The Enable+setup section split into "Consent + opt-in" (with contributor framing) and standalone "5-Q setup" reachable from both the consent flow and the setup gate. Also aligns the calibration gate across three docs (V0 said 90+ days, TODOS said 2+ weeks, binary uses 7 days). The fix distinguishes: - Display gate (sample_size>=20, skills>=3, question_ids>=8, days_span>=7): for rendering inferred values in /plan-tune output - Promotion gate (90+ days stable across 3+ skills): for shipping E1 behavior-adapting defaults TODOS.md E1 card updated to reference 90+ days, plus Codex's substrate risk note: generated skill prose is agent-compliance-based, so E1 ships as advisory annotations on AskUserQuestion recommendations, not silent AUTO_DECIDE. Tests can verify templates contain right reads but can't prove agents obey them. Per /plan-eng-review + Codex outside-voice 2026-05-26. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore: bump version and changelog (v1.49.0.0) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(bins): honor GSTACK_STATE_ROOT override for test isolation Plan-tune cathedral T1 (per D16 / Codex outside voice). The 3 bins that back /plan-tune (question-log, question-preference, developer-profile) previously ignored GSTACK_STATE_ROOT, so tests that tried to point state at a tempdir via that env var silently wrote to the real ~/.gstack. Make STATE_ROOT take precedence over GSTACK_HOME so the cathedral's E2E + unit tests can isolate cleanly without sledgehammering HOME. Order of precedence: GSTACK_STATE_ROOT > GSTACK_HOME > $HOME/.gstack Matches the existing gstack-paths emission order. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(plan-tune): regression coverage for v1.49 consent + setup gates Plan-tune cathedral T2 + part of T1 follow-up (Codex IRON RULE — regressions get tests). v1.49 shipped two prose-driven implicit gates inside plan-tune Step 0 (consent, setup) with zero test coverage. The cathedral refactors that template heavily; without tests, silent breakage is possible. Three regression families plus a static template assertion: 1. Consent gate fires under qt=false + no marker; goes silent on marker write or qt=true flip. 2. Setup gate fires under qt=true + empty declared + no marker; goes silent when declared populates, marker is written, or qt is still false. 3. Marker idempotency: gates stay silent across 5 re-invocations after a single decline/bail. Markers honored independently. 4. Static template assertion: gate language can't be silently deleted without breaking a test. Also extends gstack-config to honor GSTACK_STATE_ROOT (it was the last bin still ignoring it — caught while writing the tests; without this, tests would silently mutate the user's real config.yaml). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(spikes): Claude hook mutation + Codex session format Plan-tune cathedral T4 (per D5/D10). Two Phase 1 design spikes that downstream tasks (T3, T5, T6, T8, T9) depend on. claude-code-hook-mutation.md - Confirms PreToolUse allow + updatedInput is supported and is the right mechanism for substituting an auto-decided answer. - Pins stdin/stdout JSON schemas with field-by-field reference. - Documents matcher regex syntax for "(AskUserQuestion|mcp__.*__AskUserQuestion)" so Conductor's MCP-routed AUQ is covered. - Captures parallel-hook merge order caveat and our settings.json snippet. codex-session-format.md - Maps the on-disk ~/.codex/sessions/<date>/rollout-*.jsonl schema by event type (response_item 76%, event_msg 19%, turn_context, session_meta). - Critical finding: Codex has NO AskUserQuestion tool. Gstack AUQ-shaped Decision Briefs surface as agent_message text; answer is the next user_message. Two-tier recovery: marker-first (D18), then pattern fallback for hash-only logging. - Confirms logs_2.sqlite is internal telemetry, not session content. - Lists open questions to answer during T9 implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(settings-hook): schema-aware PreToolUse/PostToolUse registration Plan-tune cathedral T3 (per D4 + Codex correction). The previous bin only knew SessionStart and dedup'd on the hardcoded `gstack-session-update` substring. The cathedral needs PreToolUse + PostToolUse hooks registered side-by-side with the user's own hooks, with explicit consent UX, backups, and rollback. New subcommands: - add-event --event <SessionStart|PreToolUse|PostToolUse|...> --command <cmd> --source <tag> [--matcher <re>] [--timeout <s>] - remove-source --source <tag> # removes all entries tagged by source - diff-event ... # preview without mutating - rollback # restore latest backup - list-sources # audit gstack-tagged hooks Multi-source dedup via a new `_gstack_source` field on each hook entry (Claude Code preserves unknown fields). Source tag lets plan-tune-cathedral register PreToolUse + PostToolUse without colliding with the existing SessionStart wiring, and lets remove-source clean up cleanly during gstack-uninstall. Backups written automatically to settings.json.bak.<ts> before any mutation, with a .bak-latest pointer the rollback subcommand reads. Existing legacy `add <cmd>` / `remove <cmd>` shape preserved verbatim so setup --team and gstack-uninstall keep working unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(hooks): PostToolUse capture hook for AskUserQuestion Plan-tune cathedral T5. Closes the substrate hole that motivated this entire branch: agent-compliance-only logging produced zero events in weeks of dogfood. PostToolUse hook captures every AUQ fire deterministically. What ships: - hosts/claude/hooks/question-log-hook.ts — TS hook that reads Claude Code's hook stdin, walks tool_input.questions[*], extracts user choice + recommended option from tool_response, spawns gstack-question-log per question. - hosts/claude/hooks/question-log-hook — bash shim Claude Code's hook runner invokes; execs bun against the .ts file. - Marker-first question_id extraction (D18 progressive markers): <gstack-qid:foo-bar> stripped from question text, used as the id. Hash fallback hook-<sha1[:10]> for unmarked questions (observed-only, never used as preference key — D18 hash drift mitigation). - (recommended) label parsing for the user_choice/recommended fields, with refuse-on-ambiguous when two labels are present (D2 safety). - Free-text capture: source=auq-other + free_text field when user picks Other and types (Layer 8 dream cycle input). - Matcher covers both native AskUserQuestion and mcp__*__AskUserQuestion (Codex/Conductor catch from outside voice review). - Crash safety: always exits 0; errors land in ~/.gstack/hook-errors.log so the user's session is never blocked by a hook failure. gstack-question-log extended to: - Accept `source` field (default 'agent', new values: hook, auq-other, auto-decided, codex-import-marker, codex-import-pattern). - Accept `tool_use_id` (<=128 chars) for dedup. - Composite dedup on (source, tool_use_id) across the last 100 lines — protects against hook + preamble both firing on the same tool call (D3 belt+suspenders). - Async fire `gstack-developer-profile --derive` after each successful write so inferred.sample_size actually grows (D17 — without this, the cathedral's "before 0, after >0" metric never moves). - GSTACK_QUESTION_LOG_NO_DERIVE=1 escape hatch for tests. 9 new unit tests covering capture, marker extraction, MCP variant, free-text, dedup, ambiguous-recommended safety, crash paths. All pass plus the existing 88 tests across related files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(hooks): PreToolUse enforcement hook for AskUserQuestion preferences Plan-tune cathedral T6 — the keystone that makes never-ask actually bind. Today preferences are agent-convention (silently ignored). This hook enforces them via Claude Code's hook protocol: when a never-ask preference matches an AUQ that is two-way + has a marker + has a clear recommendation, the hook returns permissionDecision: "deny" with permissionDecisionReason naming the auto-decided option. The agent obeys the rejection feedback and proceeds with the recommended option without re-firing AUQ. Decision tree (per question): - marker absent → defer (D18: hash IDs are observed-only) - one-way door → defer (safety override — never auto-decide one-way) - always-ask preference → defer - no preference set → defer - ambiguous recommendation (two (recommended) labels OR no parseable rec) → defer (D2 refuse-on-ambiguous) - never-ask / ask-only-for-one-way + two-way + clean rec → deny+reason Preference precedence per D8: project-local (~/.gstack/projects/<slug>/question-preferences.json) wins, global (~/.gstack/global-question-preferences.json) is fallback. Why deny+reason instead of allow+updatedInput: AskUserQuestion's updatedInput shape for "pre-resolve this question" isn't structurally pinned in Claude Code docs (T4 spike open question). deny with a reason that names the auto-decided option is the conservative + reliable v1 — the model receives the rejection, reads the recommended option from the reason, proceeds without re-prompting. Swap to allow+updatedInput once the AUQ input shape is verified against real Claude Code. Since deny prevents PostToolUse from firing, this hook logs the auto-decided event itself via gstack-question-log (source=auto-decided) so /plan-tune's Recent auto-decisions surface picks it up. Also writes a session marker ~/.gstack/sessions/<id>/.auto-decided-<tool_use_id> for coordination when the AUQ-shape switch lands. Multi-question AUQ: enforcement is all-or-nothing per call. If any question in the batch isn't eligible (no marker, no preference, ambiguous rec, etc.), the whole call defers so the user still gets to answer the rest normally. Registry lookup: cheap regex extraction from scripts/question-registry.ts (reading + bun-importing the TS file from a hook is too slow). Door type defaults to two-way for unregistered. Matcher covers both native AskUserQuestion and mcp__*__AskUserQuestion (Conductor disables native — Codex outside-voice catch). 15 unit tests cover defer paths, enforcement, one-way safety override, ambiguous-rec refuse, precedence (project wins, global fallback, project-overrides-global), MCP matcher, auto-decided event logging, session marker writing, crash safety. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(scripts): declared-annotation helper + autonomy signal_key wiring Plan-tune cathedral T7. Adds the helper that lets skills inject one-line plain-English annotations on AUQ recommendations based on the user's declared profile — read-only, advisory-only, per TODOS.md E1 substrate-risk guidance (no AUTO_DECIDE off inferred). scripts/declared-annotation.ts - getDeclaredAnnotation(signal_key) → annotation | null - primaryDimensionFor(signal_key) → Dimension | null - Signature uses kebab signal_key per D2/Codex correction (registry uses hyphens; profile dimensions use underscores; helper maps internally). - Bands: >= 0.7 high, <= 0.3 low, else null. Middle band stays silent. - Per-dimension plain-English phrasing: 5 dimensions × 2 bands = 10 phrases. - Reads ~/.gstack/developer-profile.json (honors GSTACK_STATE_ROOT). scripts/psychographic-signals.ts - New signal_key 'decision-autonomy' that maps user_choice → autonomy dimension nudges. This was the missing signal for the 'autonomy' dimension — without it, the cathedral could annotate four of five declared dimensions but autonomy stayed silent. scripts/question-registry.ts - Add signal_key: 'decision-autonomy' to land-and-deploy-merge-confirm and land-and-deploy-rollback. These are the highest-leverage autonomy questions in the surface — "let me decide" vs "go ahead" is exactly what the dimension captures. 13 unit tests cover the helper's full contract (unknown keys, missing profile, middle-band null, both band thresholds, all five dimensions rendering distinct phrases). Existing 47 plan-tune.test.ts tests still pass after the registry + signal-map enrichment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(setup): install plan-tune cathedral hooks with explicit consent UX Plan-tune cathedral T8. Wires the new PostToolUse capture hook and PreToolUse enforcement hook into ~/.claude/settings.json via the schema-aware gstack-settings-hook (T3) — respecting D4's "never mutate settings.json silently" boundary and the Codex outside-voice warning. Behavior at setup time: - Idempotency: if list-sources already shows 'plan-tune-cathedral', no-op with a one-line note. - Marker present (previously declined): no-op, no re-prompt. - Interactive terminal: print rationale + diff preview from settings-hook, rollback command, and prompt y/N. On accept, register both hooks (PostToolUse and PreToolUse) with --source plan-tune-cathedral. On decline, touch ~/.gstack/.plan-tune-hooks-prompted so we don't re-ask. - Non-interactive (CI / scripted): no prompt; print the two exact commands the user would need to install manually. - --no-team teardown also removes the plan-tune hooks via remove-source. gstack-uninstall extended to clean up plan-tune-cathedral hooks alongside the existing SessionStart cleanup. Listed as a separate "plan-tune cathedral hooks" line in the REMOVED summary when it fires. No new test file — coverage from T3's gstack-settings-hook-schema-aware tests proves the underlying bin behavior; setup-level integration is verified manually (re-running ./setup is cheap and the prompt makes it obvious whether install happened). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(bin): gstack-codex-session-import — structured Codex transcript parser Plan-tune cathedral T9. Backfills question-log.jsonl from Codex sessions since Codex has no AskUserQuestion tool (per docs/spikes/codex-session-format.md) and gstack AUQ-shaped Decision Briefs show up as agent_message prose. Walks ~/.codex/sessions/<date>/rollout-*.jsonl, matches each agent_message that contains either a <gstack-qid:foo-bar> marker or a D-numbered Decision Brief header, then pairs it with the next user_message for the answer. Two-tier recovery per D5: - marker present → source=codex-import-marker, stable question_id - no marker but D-shape detected → source=codex-import-pattern with hash-only question_id (never used as preference key per D18) Subcommands: gstack-codex-session-import # latest session gstack-codex-session-import <file> # explicit path gstack-codex-session-import --since <iso> # all sessions newer than User-choice extraction handles A/B/C letter responses and prose responses that start with the option label. Recommended option parsed via the "(recommended)" label suffix (same convention as Layer 2). Each extracted event written via gstack-question-log, so source tagging, dedup, and async derive all apply uniformly. spawnSync uses the cwd from session_meta so gstack-slug buckets events into the project the user was actually working in, not the importer's cwd. 7 unit tests cover marker path, pattern fallback, multiple briefs in sequence, missing user_message, numeric/letter user response forms, empty-sessions-dir handling. Smoke-tested against a real ~/.codex/sessions/ file from earlier today — returns IMPORTED: 0 because that session was autonomous (no AUQ-shaped prose), proving the bin doesn't false-positive on unrelated agent_message events. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(bin): gstack-distill-free-text — Layer 8 dream cycle distiller Plan-tune cathedral T10. Reads auq-other free-text events from this project's question-log.jsonl, calls Claude via the Anthropic SDK to extract structured proposals (preference candidates, declared-profile nudges, memory nuggets), writes them to distillation-proposals.json for the user to review via /plan-tune (never autonomous — every apply requires explicit Y). Subcommands: gstack-distill-free-text # sync distill gstack-distill-free-text --background # detach + return PID gstack-distill-free-text --dry-run # emit prompt + events, no API call gstack-distill-free-text --status # run history + cost-to-date D7 rate cap: 3 distills per slug per day. Reads ~/.gstack/distill-cost.jsonl for the count, exits with RATE_CAPPED when limit hit. Cost log lines tagged by slug so sibling projects don't share the cap. Yesterday runs don't count. D6 API auth: Anthropic SDK direct, fail-loud on missing ANTHROPIC_API_KEY with explicit message that distill is a separate billing surface from the interactive Claude Code session. Uses claude-haiku-4-5 for cost (~$0.001/ 1k input, $0.005/1k output) — sufficient for structured extraction. D14 execution context: --background spawns detached (nohup) so auto-trigger during /ship doesn't add 30s of pause; results surface on next /plan-tune. Source events get distilled_at:<ts> stamped on them after the run so they don't re-propose on the next distill. Match by ts + question_id. Cost-log line per run includes: slug, proposals_count, rejected_low_confidence, input_tokens, output_tokens, cost_usd_est. /plan-tune stats reads this to show "$X estimated, N runs this month" per Layer 4 surface. 10 unit tests cover --status, rate cap (3/day, yesterday-not-counted, other-slug-not-counted), no-log/no-free-text paths, --dry-run, missing API key, --background spawn. The actual SDK call is exercised by the T16 E2E test (uses real key, ~$0.001 per run). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(bin): gstack-distill-apply — apply distillation proposals with gbrain tag Plan-tune cathedral T11. Bin that applies a single user-approved proposal from distillation-proposals.json to the right surface: - memory-nugget → appended to ~/.gstack/free-text-memory.json (durable local source-of-truth; gbrain is mirror when configured). - preference → routed through gstack-question-preference --write with source=plan-tune (clears the user-origin gate). - declared-nudge → atomic update to developer-profile.json declared dim, small=0.05, medium=0.10, large=0.15, clamped to [0, 1]. Why a separate bin (not inline in the skill template): /plan-tune's apply step needs to be invokable from any host (Claude, Codex, etc) and must write to multiple state files atomically. A bin centralizes the schema + clamp logic; the skill template just calls it after user Y. gbrain coordination: --gbrain-published true marks the nugget so /plan-tune stats can show "12 nuggets, 8 mirrored to gbrain". The skill template invokes mcp__gbrain__put_page / extract_facts / add_tag in the same turn (those are MCP tools, not CLI-callable) before calling this bin. Local file remains canonical so the PreToolUse hook injection path (T12) doesn't depend on gbrain availability. Subcommands: gstack-distill-apply --list # show pending proposals gstack-distill-apply --proposal <N> # apply, file fallback gstack-distill-apply --proposal <N> --gbrain-published true Applied proposals get applied_at + gbrain_published stamped on them so re-running --list shows only unconsumed ones. 11 unit tests cover --list (all three kinds + quotes), memory-nugget append + non-clobber, preference routing through the gate-respecting bin, declared-nudge math (medium=0.10, small=0.05, large=0.15, clamp at [0,1]), proposal mark-applied with gbrain flag, and error paths (bad index, missing --proposal). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(hooks): Layer 8 memory injection via per-session cache Plan-tune cathedral T12. Extends the PreToolUse hook to inject matching free-text-memory.json nuggets into AskUserQuestion responses, giving the agent + user the distilled context from past 'Other' answers right when the related question fires. Per-session cache (D13 perf): first read of free-text-memory.json writes ~/.gstack/sessions/<id>/memory-cache.json. Subsequent hooks on the same session take the cached path. Invalidation is by file-missing: when the canonical file changes (via gstack-distill-apply), the per-session cache either reflects the staler view for the rest of the session or the session restarts and the cache rebuilds. Cheap, correct enough for v1. Matching logic: - Walk this AUQ batch's questions, extract marker question_ids. - Look up signal_key in scripts/question-registry.ts. - Collect nuggets whose applies_to_signal_keys include any of the matched signal_keys. - Cap to 3 most-recent (by applied_at) so the additionalContext stays short. - Surface as additionalContext on the hookSpecificOutput response. Memory + enforcement interact cleanly: the same hook can both surface nuggets AND deny the tool when a never-ask preference matches. Memory context isn't doubled in the deny reason — the auto-decided option name in the deny path is sufficient signal. 6 new tests cover injection on defer, no-match silence, 3-most-recent cap, memory-alongside-deny enforcement, cache file write-through, empty-canonical graceful degradation. Existing 15 preference-hook tests still green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(plan-tune): SKILL.md surfaces for cathedral T13 Plan-tune cathedral T13. Rewires plan-tune/SKILL.md.tmpl to expose the new cathedral surfaces: Step 0 routing: - Implicit gate #3 (dream-cycle): fires when distillation-proposals.json has unapplied proposals. Marker is per-proposal applied_at so re-firing naturally skips already-handled items. - Added user-intent route for "dream cycle" / "distill" / "what have I been free-texting". - Power-user shortcuts: distill, dream, audit. Stats: - Host-aware source breakdown (SOURCE_HOOK, SOURCE_AGENT, SOURCE_AUTO_DECIDED, SOURCE_CODEX_IMPORT_*, SOURCE_AUQ_OTHER). - MARKED percentage so D18 progressive-markers progress is visible. - Distill cost-to-date via gstack-distill-free-text --status. Recent auto-decisions: - Last 10 source=auto-decided events with question_id + user_choice. Lets the user spot-check enforcement and flip via always-ask. Audit unmarked questions: - Top N hash-only ids by frequency. Surfaces next candidates for the D18 marker retrofit. Dream cycle review + manual distill: - Walks unapplied proposals via AskUserQuestion (one per call), routes accepts through gstack-distill-apply with --gbrain-published flag. Skill template invokes mcp__gbrain__put_page when MCP is available; local file remains source-of-truth. Regenerated SKILL.md via `bun run gen:skill-docs`. All 60 plan-tune tests still green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(preamble): inject <gstack-qid:...> marker convention into question-tuning resolver Plan-tune cathedral T14. Per D18 progressive markers, the PreToolUse enforcement hook only fires when the AUQ question text contains a <gstack-qid:foo-bar> marker the hook can extract. Without a marker, the hook logs the fire as observed-only and skips enforcement (hash IDs drift with prose so they're never used as preference keys). The high-leverage retrofit point is the preamble's Question Tuning section, not 10 individual skill templates. Updating scripts/resolvers/question-tuning.ts adds the marker convention to every tier-≥2 skill in one change — agents running ANY of the 30+ tier-≥2 skills now embed the marker by default when the question matches a registered question_id. Two convention additions in the preamble: 1. "Embed the question_id as a marker (<gstack-qid:{id}>) somewhere in the rendered question." With explanation that the marker is the only path for the PreToolUse hook to enforce preferences. 2. "Embed the option recommendation via the (recommended) label suffix on exactly one option per AUQ." Documents the D2 parser contract: label first, prose fallback, refuse-on-ambiguous. Net cost: ~700 bytes added to the preamble per generated skill. Plan-review preamble budget ratcheted from 39000 → 40000 (test/gen-skill-docs.test.ts) with a comment explaining the cathedral T14 expansion is load-bearing. Regenerated 42 SKILL.md files via `bun run gen:skill-docs`. The token ceiling warning on ship/SKILL.md (~41K tokens) is pre-existing; this PR doesn't change ship's preamble materially. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ship): plan-tune discoverability nudge after first successful ship Plan-tune cathedral T15 (the ship-side surface; the setup-side surface shipped in T8 with explicit hook-install consent UX). Adds Step 21 to ship/SKILL.md.tmpl: after Step 20 (persist metrics) succeeds, surface /plan-tune once per machine via a marker-gated single-line nudge. Behavior: - If ~/.gstack/.plan-tune-nudge-shown exists → no-op. - If question_tuning is already true → no-op (user already on board). - Otherwise: print one nudge line, touch marker. The nudge mentions both the observational substrate AND the hook-installed auto-decide enforcement so users know what they get when they opt in. Non-blocking — never asks a question, doesn't gate ship completion. To re-show: rm ~/.gstack/.plan-tune-nudge-shown before next ship. Setup-side discoverability shipped in T8 via the hook install prompt (explicit consent + diff preview + backup). Together these two surfaces cover first-install AND first-ship moments — the user discovers plan-tune organically rather than needing to know /plan-tune exists. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(plan-tune): 5 cathedral E2E scenarios + touchfile registration Plan-tune cathedral T16 (per D12 — all 5 in gate tier). One consolidated file with five describeIfSelected scenarios, each selectable by its own touchfile entry so they only run when the relevant code changes (or EVALS_ALL=1 forces all): plan-tune-hook-capture — PostToolUse hook fires → question-log fills plan-tune-enforcement — never-ask + marker + 2-way → deny+reason + auto-decided event logged plan-tune-annotation — declared profile + memory nugget → additionalContext surfaced on defer plan-tune-codex-import — synthetic JSONL → import bin → log with source=codex-import-marker plan-tune-dream-cycle — apply proposal → re-fire question → memory injected via additionalContext Each scenario fixtures an isolated git repo + bins + scripts + hooks under tmp, then exercises the cathedral chain end-to-end against real on-disk binaries (no mocks at the bin layer). GSTACK_STATE_ROOT keeps the user's real ~/.gstack untouched. These five complement the existing unit tests by proving the full sub-process chain works (not just individual functions in isolation). They DON'T spawn claude -p because the cathedral's substrate behavior is deterministic — agent compliance is no longer the variable. The existing test/skill-e2e-plan-tune.test.ts (plan-tune-inspect) still covers the LLM-driven intent-routing behavior. Cost: each scenario runs in ~1s with $0 because no claude -p invocations. Touchfile-gated, so they only run on PRs that touch cathedral code. Also fixes a bug found by the E2E: question-log-hook didn't pass the incoming tool call's cwd to spawnSync when invoking gstack-question-log, so the bin used the hook process's cwd (the repo root) instead of the session's cwd. Result: log writes landed in the wrong project bucket. Fix mirrors the same cwd-passing pattern from question-preference-hook. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump VERSION to 1.50.0.0 + plan-tune cathedral CHANGELOG Plan-tune cathedral T17. Bumps VERSION 1.49.0.0 → 1.50.0.0 (MINOR per CLAUDE.md scale-aware rule: this is substantial new capability — 8 layers, ~3000 LOC, 96 new tests, deterministic substrate + dream-cycle distillation). CHANGELOG entry follows the release-summary format from CLAUDE.md: - Two-line bold headline naming what changed for users (deterministic capture, binding preferences, free-text memory loop) - Lead paragraph: before/after framed concretely (zero events captured → every fire, agent-honored → hook-enforced, declared profile → injected context, regex backfill → structured JSONL parser) - Two tables: metric deltas + layer/where-it-lives. Real numbers (96 tests, ~$0.01 per distill, 3/day cap), no AI vocabulary, no em dashes. - "What this means for solo builders" close: ties dream cycle to the compounding loop and points to ./setup as the on-ramp. - Itemized Added/Changed/For contributors sections list every layer's surfaces with file paths. Also: - Refreshed test/fixtures/golden/{claude,codex,factory}-ship-SKILL.md to match the regenerated ship templates (Step 21 nudge added). - Rebased plan-tune entry in parity-baseline-v1.47.0.0.json from 51717 → 64017 bytes with a baseline_note explaining the cathedral T13 expansion. Documents that the new Dream cycle, Recent auto-decisions, Audit unmarked, Dream cycle review/distill sections are load-bearing, not bloat. Without the rebase, the size-budget gate fails — and the cathedral's whole point is making /plan-tune do more, not less. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump VERSION 1.50.0.0 → 1.52.0.0 (queue collision with #1742) CI version gate caught: PR #1742 (garrytan/upgrade-gstack-gbrain-v1) already claims v1.50.0.0 and #1751 (garrytan/browser-memory-leak) claims v1.51.0.0. gstack-next-version util recommends v1.52.0.0 as the next free slot. Updates: - VERSION 1.50.0.0 → 1.52.0.0 - package.json version sync - CHANGELOG.md header + metric table label - parity-baseline-v1.47.0.0.json baseline_note reference No content changes; pure slot rebase per the queue. The cathedral scope (8 layers, 96 tests) and CHANGELOG narrative stay identical — same ship, different release number. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: cap audit — remove distill rate cap, loosen size/budget gates Plan-tune cathedral follow-up. The 3/day distill cap was theatrical: at ~$0.01 per Haiku call, even a runaway loop firing every minute would cost ~$14/day, and free-text events are rare enough that the natural input rate self-limits to 1-2 fires/day. Count caps don't protect against runaway bugs (which fire 1000x/second, not 4 times/day) but DO punish heavy users who'd legitimately distill multiple times during a busy week. Removed: 3/day rate cap on bin/gstack-distill-free-text. --status output swapped from "TODAY: N / 3" to "TODAY: N run(s), $X" so users see what they're spending instead of how close they are to a meaningless count. Loosened (caps that exist for real-runaway protection, not normal scope): - EVALS_BUDGET_HARD_CAP_GATE $25 → $200/run - EVALS_BUDGET_HARD_CAP_PERIODIC $70 → $500/run - EVALS_BUDGET_HARD_CAP $30 → $300/run (umbrella fallback) - GSTACK_SIZE_BUDGET_RATIO 1.05 → 1.50 per-skill ratio - plan-review preamble byte budget 40K → 60K Principle: caps exist to catch obvious bugs (infinite retry, model price change, prompt blowup), not to gate legitimate scope growth. Set high enough that real growth never trips them, only bug territory does. Adjusted defaults are 4-8× historical worst case, leaving ample headroom for the next 12 months of legitimate expansion. Tests updated: distill-free-text removes the 3-test rate-cap describe block in favor of "no rate cap" assertion that 10 runs/day pass. Other budget tests still pass because they were never near the old ceilings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
68 KiB
Skill Deep Dives
Detailed guides for every gstack skill — philosophy, workflow, and examples.
| Skill | Your specialist | What they do |
|---|---|---|
/office-hours |
YC Office Hours | Start here. Six forcing questions that reframe your product before you write code. Pushes back on your framing, challenges premises, generates implementation alternatives. Design doc feeds into every downstream skill. |
/spec |
Spec Author | Turn vague intent into a precise, executable spec in five phases. Backlog-ready output that downstream skills can pick up. Optional agent spawn at the end. |
/plan-ceo-review |
CEO / Founder | Rethink the problem. Find the 10-star product hiding inside the request. Four modes: Expansion, Selective Expansion, Hold Scope, Reduction. |
/plan-eng-review |
Eng Manager | Lock in architecture, data flow, diagrams, edge cases, and tests. Forces hidden assumptions into the open. |
/plan-design-review |
Senior Designer | Interactive plan-mode design review. Rates each dimension 0-10, explains what a 10 looks like, fixes the plan. Works in plan mode. |
/design-consultation |
Design Partner | Build a complete design system from scratch. Knows the landscape, proposes creative risks, generates realistic product mockups. Design at the heart of all other phases. |
/review |
Staff Engineer | Find the bugs that pass CI but blow up in production. Auto-fixes the obvious ones. Flags completeness gaps. |
/investigate |
Debugger | Systematic root-cause debugging. Iron Law: no fixes without investigation. Traces data flow, tests hypotheses, stops after 3 failed fixes. |
/design-review |
Designer Who Codes | Live-site visual audit + fix loop. 80-item audit, then fixes what it finds. Atomic commits, before/after screenshots. |
/design-shotgun |
Design Explorer | Generate multiple AI design variants, open a comparison board in your browser, and iterate until you approve a direction. Taste memory biases toward your preferences. |
/design-html |
Design Engineer | Generates production-quality Pretext-native HTML. Works with approved mockups, CEO plans, design reviews, or from scratch. Text reflows on resize, heights adjust to content. Smart API routing per design type. Framework detection for React/Svelte/Vue. |
/qa |
QA Lead | Test your app, find bugs, fix them with atomic commits, re-verify. Auto-generates regression tests for every fix. |
/qa-only |
QA Reporter | Same methodology as /qa but report only. Use when you want a pure bug report without code changes. |
/scrape |
Browser Data Extractor | Pull data from a web page. First call prototypes via $B; subsequent calls on a matching intent run a codified browser-skill in ~200ms. |
/skillify |
Skill Codifier | Walks back through your conversation, finds the last /scrape prototype, synthesizes script + test + fixture, runs the test, asks before committing. |
/ship |
Release Engineer | Sync main, run tests, audit coverage, push, open PR. Bootstraps test frameworks if you don't have one. One command. |
/land-and-deploy |
Release Engineer | Merge the PR, wait for CI and deploy, verify production health. One command from "approved" to "verified in production." |
/canary |
SRE | Post-deploy monitoring loop. Watches for console errors, performance regressions, and page failures using the browse daemon. |
/benchmark |
Performance Engineer | Baseline page load times, Core Web Vitals, and resource sizes. Compare before/after on every PR. Track trends over time. |
/cso |
Chief Security Officer | OWASP Top 10 + STRIDE threat modeling security audit. Scans for injection, auth, crypto, and access control issues. |
/document-release |
Technical Writer | Update all project docs to match what you just shipped. Catches stale READMEs automatically. |
/document-generate |
Technical Writer | Generate Diataxis docs (tutorial / how-to / reference / explanation) for a feature from code. |
/retro |
Eng Manager | Team-aware weekly retro. Per-person breakdowns, shipping streaks, test health trends, growth opportunities. |
/browse |
QA Engineer | Give the agent eyes. Real Chromium browser, real clicks, real screenshots. ~100ms per command. |
/setup-browser-cookies |
Session Manager | Import cookies from your real browser (Chrome, Arc, Brave, Edge) into the headless session. Test authenticated pages. |
/autoplan |
Review Pipeline | One command, fully reviewed plan. Runs CEO → design → eng → DX review automatically with encoded decision principles. Surfaces only taste decisions for your approval. |
/plan-devex-review |
DX Reviewer | Plan-stage DX review. TTHW (time-to-hello-world), magical moments, friction points, persona traces. Three modes: Expansion, Polish, Triage. |
/devex-review |
DX Reviewer (live) | Live developer experience audit. Walks the actual onboarding flow, measures TTHW, catches the docs lies. |
/plan-tune |
Question Tuner | Self-tune AskUserQuestion sensitivity per question. Mark questions as never-ask, always-ask, or only-for-one-way. |
/spec |
Spec Author | Turn vague intent into a precise, executable spec in five phases. Files a GitHub issue, optionally spawns a Claude Code agent in a fresh worktree, and lets /ship close the source issue on merge. |
/learn |
Memory | Manage what gstack learned across sessions. Review, search, prune, and export project-specific patterns and preferences. |
/context-save |
Save State | Save working context (git state, decisions, remaining work) so any future session can resume. |
/context-restore |
Restore State | Resume from a saved context, even across Conductor workspace handoffs. |
/health |
Code Quality Dashboard | Wraps type checker, linter, tests, dead code detection. Computes a weighted 0-10 score; tracks trends over time. |
/landing-report |
Ship Queue Dashboard | Read-only snapshot of the workspace-aware ship queue. Which version slots are claimed, which sibling workspaces have WIP. |
/benchmark-models |
Model Benchmark | Side-by-side cross-model benchmark for skills (Claude vs GPT vs Gemini). Latency, tokens, cost, optional LLM-judged quality. |
| Multi-AI | ||
/codex |
Second Opinion | Independent review from OpenAI Codex CLI. Three modes: code review (pass/fail gate), adversarial challenge, and open consultation with session continuity. Cross-model analysis when both /review and /codex have run. |
/pair-agent |
Remote Agent Bridge | Pair a remote AI agent (OpenClaw, Codex, Cursor, Hermes) with your browser. Scoped tunnel, locked allowlist, session token. |
/setup-gbrain |
Memory Sync | Set up gbrain for cross-machine session memory sync. One command from zero to live. |
/sync-gbrain |
Keep Brain Current | Refresh gbrain against this repo's code; teach the agent when to use gbrain search/code-def over Grep. Idempotent; safe to re-run. |
| Safety & Utility | ||
/careful |
Safety Guardrails | Warns before destructive commands (rm -rf, DROP TABLE, force-push, git reset --hard). Override any warning. Common build cleanups whitelisted. |
/freeze |
Edit Lock | Restrict all file edits to a single directory. Blocks Edit and Write outside the boundary. Accident prevention for debugging. |
/guard |
Full Safety | Combines /careful + /freeze in one command. Maximum safety for prod work. |
/unfreeze |
Unlock | Remove the /freeze boundary, allowing edits everywhere again. |
/open-gstack-browser |
GStack Browser | Launch GStack Browser with sidebar, anti-bot stealth, auto model routing, cookie import, and Claude Code integration. Watch every action live. |
/setup-deploy |
Deploy Configurator | One-time setup for /land-and-deploy. Detects your platform, production URL, and deploy commands. |
/gstack-upgrade |
Self-Updater | Upgrade gstack to the latest version. Detects global vs vendored install, syncs both, shows what changed. |
/make-pdf |
PDF Generator | Turn any markdown file into a publication-quality PDF. Proper margins, page numbers, cover pages, clickable TOC. |
/ios-qa |
iOS QA Lead | Live-device iOS QA via USB CoreDevice tunnel + embedded StateServer. Reads Swift source, codegens accessors, drives the real iPhone. Optionally exposes the device over Tailscale for remote agents. |
/ios-fix |
iOS Autonomous Fixer | Closes the find→fix→verify loop on a real iPhone. Captures a reproducing snapshot, fixes the source, rebuilds, redeploys, verifies. |
/ios-design-review |
iOS Designer's Eye | 10-dimension Apple HIG audit on a real iPhone. Rates each screen, says what would make it a 10. |
/ios-clean |
iOS Bridge Cleanup | Convenience wrapper to strip DebugBridge SPM + #if DEBUG wiring. The structural Release-build guard is in Package.swift + CI; this skill is for guided manual removals. |
/ios-sync |
iOS Bridge Resync | Regenerate accessors and Swift templates against the latest upstream gstack. Run when you add new @Observable classes or upgrade gstack. |
/office-hours
This is where every project should start.
Before you plan, before you review, before you write code — sit down with a YC-style partner and think about what you're actually building. Not what you think you're building. What you're actually building.
The reframe
Here's what happened on a real project. The user said: "I want to build a daily briefing app for my calendar." Reasonable request. Then it asked about the pain — specific examples, not hypotheticals. They described an assistant missing things, calendar items across multiple Google accounts with stale info, prep docs that were AI slop, events with wrong locations that took forever to track down.
It came back with: "I'm going to push back on the framing, because I think you've outgrown it. You said 'daily briefing app for multi-Google-Calendar management.' But what you actually described is a personal chief of staff AI."
Then it extracted five capabilities the user didn't realize they were describing:
- Watches your calendar across all accounts and detects stale info, missing locations, permission gaps
- Generates real prep work — not logistics summaries, but the intellectual work of preparing for a board meeting, a podcast, a fundraiser
- Manages your CRM — who are you meeting, what's the relationship, what do they want, what's the history
- Prioritizes your time — flags when prep needs to start early, blocks time proactively, ranks events by importance
- Trades money for leverage — actively looks for ways to delegate or automate
That reframe changed the entire project. They were about to build a calendar app. Now they're building something ten times more valuable — because the skill listened to their pain instead of their feature request.
Premise challenge
After the reframe, it presents premises for you to validate. Not "does this sound good?" — actual falsifiable claims about the product:
- The calendar is the anchor data source, but the value is in the intelligence layer on top
- The assistant doesn't get replaced — they get superpowered
- The narrowest wedge is a daily briefing that actually works
- CRM integration is a must-have, not a nice-to-have
You agree, disagree, or adjust. Every premise you accept becomes load-bearing in the design doc.
Implementation alternatives
Then it generates 2-3 concrete implementation approaches with honest effort estimates:
- Approach A: Daily Briefing First — narrowest wedge, ships tomorrow, M effort (human: ~3 weeks / CC: ~2 days)
- Approach B: CRM-First — build the relationship graph first, L effort (human: ~6 weeks / CC: ~4 days)
- Approach C: Full Vision — everything at once, XL effort (human: ~3 months / CC: ~1.5 weeks)
Recommends A because you learn from real usage. CRM data comes naturally in week two.
Two modes
Startup mode — for founders and intrapreneurs building a business. You get six forcing questions distilled from how YC partners evaluate products: demand reality, status quo, desperate specificity, narrowest wedge, observation & surprise, and future-fit. These questions are uncomfortable on purpose. If you can't name a specific human who needs your product, that's the most important thing to learn before writing any code.
Builder mode — for hackathons, side projects, open source, learning, and having fun. You get an enthusiastic collaborator who helps you find the coolest version of your idea. What would make someone say "whoa"? What's the fastest path to something you can share? The questions are generative, not interrogative.
The design doc
Both modes end with a design doc written to ~/.gstack/projects/ — and that doc feeds directly into /plan-ceo-review and /plan-eng-review. The full lifecycle is now: office-hours → plan → implement → review → QA → ship → retro.
After the design doc is approved, /office-hours reflects on what it noticed about how you think — not generic praise, but specific callbacks to things you said during the session. The observations appear in the design doc too, so you re-encounter them when you re-read later.
/plan-ceo-review
This is my founder mode.
This is where I want the model to think with taste, ambition, user empathy, and a long time horizon. I do not want it taking the request literally. I want it asking a more important question first:
What is this product actually for?
I think of this as Brian Chesky mode.
The point is not to implement the obvious ticket. The point is to rethink the problem from the user's point of view and find the version that feels inevitable, delightful, and maybe even a little magical.
Example
Say I am building a Craigslist-style listing app and I say:
"Let sellers upload a photo for their item."
A weak assistant will add a file picker and save an image.
That is not the real product.
In /plan-ceo-review, I want the model to ask whether "photo upload" is even the feature. Maybe the real feature is helping someone create a listing that actually sells.
If that is the real job, the whole plan changes.
Now the model should ask:
- Can we identify the product from the photo?
- Can we infer the SKU or model number?
- Can we search the web and draft the title and description automatically?
- Can we pull specs, category, and pricing comps?
- Can we suggest which photo will convert best as the hero image?
- Can we detect when the uploaded photo is ugly, dark, cluttered, or low-trust?
- Can we make the experience feel premium instead of like a dead form from 2007?
That is what /plan-ceo-review does for me.
It does not just ask, "how do I add this feature?" It asks, "what is the 10-star product hiding inside this request?"
Four modes
- SCOPE EXPANSION — dream big. The agent proposes the ambitious version. Every expansion is presented as an individual decision you opt into. Recommends enthusiastically.
- SELECTIVE EXPANSION — hold your current scope as the baseline, but see what else is possible. The agent surfaces opportunities one by one with neutral recommendations — you cherry-pick the ones worth doing.
- HOLD SCOPE — maximum rigor on the existing plan. No expansions surfaced.
- SCOPE REDUCTION — find the minimum viable version. Cut everything else.
Visions and decisions are persisted to ~/.gstack/projects/ so they survive beyond the conversation. Exceptional visions can be promoted to docs/designs/ in your repo for the team.
/plan-eng-review
This is my eng manager mode.
Once the product direction is right, I want a different kind of intelligence entirely. I do not want more sprawling ideation. I do not want more "wouldn't it be cool if." I want the model to become my best technical lead.
This mode should nail:
- architecture
- system boundaries
- data flow
- state transitions
- failure modes
- edge cases
- trust boundaries
- test coverage
And one surprisingly big unlock for me: diagrams.
LLMs get way more complete when you force them to draw the system. Sequence diagrams, state diagrams, component diagrams, data-flow diagrams, even test matrices. Diagrams force hidden assumptions into the open. They make hand-wavy planning much harder.
So /plan-eng-review is where I want the model to build the technical spine that can carry the product vision.
Example
Take the same listing app example.
Let's say /plan-ceo-review already did its job. We decided the real feature is not just photo upload. It is a smart listing flow that:
- uploads photos
- identifies the product
- enriches the listing from the web
- drafts a strong title and description
- suggests the best hero image
Now /plan-eng-review takes over.
Now I want the model to answer questions like:
- What is the architecture for upload, classification, enrichment, and draft generation?
- Which steps happen synchronously, and which go to background jobs?
- Where are the boundaries between app server, object storage, vision model, search/enrichment APIs, and the listing database?
- What happens if upload succeeds but enrichment fails?
- What happens if product identification is low-confidence?
- How do retries work?
- How do we prevent duplicate jobs?
- What gets persisted when, and what can be safely recomputed?
And this is where I want diagrams — architecture diagrams, state models, data-flow diagrams, test matrices. Diagrams force hidden assumptions into the open. They make hand-wavy planning much harder.
That is /plan-eng-review.
Not "make the idea smaller." Make the idea buildable.
Review Readiness Dashboard
Every review (CEO, Eng, Design) logs its result. At the end of each review, you see a dashboard:
+====================================================================+
| REVIEW READINESS DASHBOARD |
+====================================================================+
| Review | Runs | Last Run | Status | Required |
|-----------------|------|---------------------|-----------|----------|
| Eng Review | 1 | 2026-03-16 15:00 | CLEAR | YES |
| CEO Review | 1 | 2026-03-16 14:30 | CLEAR | no |
| Design Review | 0 | — | — | no |
+--------------------------------------------------------------------+
| VERDICT: CLEARED — Eng Review passed |
+====================================================================+
Eng Review is the only required gate (disable with gstack-config set skip_eng_review true). CEO and Design are informational — recommended for product and UI changes respectively.
Plan-to-QA flow
When /plan-eng-review finishes the test review section, it writes a test plan artifact to ~/.gstack/projects/. When you later run /qa, it picks up that test plan automatically — your engineering review feeds directly into QA testing with no manual copy-paste.
/plan-design-review
This is my senior designer reviewing your plan — before you write a single line of code.
Most plans describe what the backend does but never specify what the user actually sees. Empty states? Error states? Loading states? Mobile layout? AI slop risk? These decisions get deferred to "figure it out during implementation" — and then an engineer ships "No items found." as the empty state because nobody specified anything better.
/plan-design-review catches all of this during planning, when it's cheap to fix.
It works like /plan-ceo-review and /plan-eng-review — interactive, one issue at a time, with the STOP + AskUserQuestion pattern. It rates each design dimension 0-10, explains what a 10 looks like, then edits the plan to get there. The rating drives the work: rate low = lots of fixes, rate high = quick pass.
Seven passes over the plan: information architecture, interaction state coverage, user journey, AI slop risk, design system alignment, responsive/accessibility, and unresolved design decisions. For each pass, it finds gaps and either fixes them directly (obvious ones) or asks you to make a design choice (genuine tradeoffs).
Example
You: /plan-design-review
Claude: Initial Design Rating: 4/10
"This plan describes a user dashboard but never specifies
what the user sees first. It says 'cards with icons' —
which looks like every SaaS template. It mentions zero
loading states, zero empty states, and no mobile behavior."
Pass 1 (Info Architecture): 3/10
"A 10 would define primary/secondary/tertiary content
hierarchy for every screen."
→ Added information hierarchy section to plan
Pass 2 (Interaction States): 2/10
"The plan has 4 UI features but specifies 0 out of 20
interaction states (4 features × 5 states each)."
→ Added interaction state table to plan
Pass 4 (AI Slop): 4/10
"The plan says 'clean, modern UI with cards and icons'
and 'hero section with gradient'. These are the top 2
AI-generated-looking patterns."
→ Rewrote UI descriptions with specific, intentional alternatives
Overall: 4/10 → 8/10 after fixes
"Plan is design-complete. Run /design-review after
implementation for visual QA."
When you re-run it, sections already at 8+ get a quick pass. Sections below 8 get full treatment. For live-site visual audits post-implementation, use /design-review.
/design-consultation
This is my design partner mode.
/plan-design-review audits a site that already exists. /design-consultation is for when you have nothing yet — no design system, no font choices, no color palette. You are starting from zero and you want a senior designer to sit down with you and build the whole visual identity together.
It is a conversation, not a form. The agent asks about your product, your users, and your audience. It thinks about what your product needs to communicate — trust, speed, craft, warmth, whatever fits — and works backward from that to concrete choices. Then it proposes a complete, coherent design system: aesthetic direction, typography (3+ fonts with specific roles), color palette with hex values, spacing scale, layout approach, and motion strategy. Every recommendation comes with a rationale. Every choice reinforces every other choice.
But coherence is table stakes. Every dev tool dashboard looks the same — clean sans-serif, muted grays, a blue accent. They are all coherent. They are all forgettable. The difference between a product that looks "nice" and one that people actually recognize is the deliberate creative risks: the unexpected serif for headings, the bold accent nobody else in your category uses, the tighter spacing that makes your data feel authoritative instead of airy.
That is what /design-consultation is really about. It does not just propose a safe system. It proposes safe choices AND risks — and tells you which is which. "Here are the choices that keep you literate in your category. And here is where I think you should break from convention, and why." You pick which risks to take. The agent checks that the whole system still coheres either way.
If you want, the agent will research what's already out there in your space — take screenshots of real sites, analyze their fonts and colors and spacing — so you can see the landscape before you make choices. This is not about copying. It is about getting in the ballpark so you know what the conventions are, and then deciding which ones are worth breaking.
After you agree on the system, it generates an interactive HTML preview page — not just swatches and font samples, but realistic product pages. If you are building a dashboard, you see a dashboard with a sidebar, data tables, and stat cards. If you are building a marketing site, you see a hero section with real copy and a CTA. Everything rendered in your design system, with your product name, in light and dark mode. You see what your product could feel like before a single line of production code is written.
Then it writes DESIGN.md to your repo root — your project's design source of truth — and updates CLAUDE.md so every future Claude Code session respects the system. From that point on, /design-review can audit against it, and any agent working on your frontend knows the rules.
Example
You: /design-consultation
Claude: From the README, this looks like a developer analytics dashboard
for engineering teams. Sound right? Tell me about the feeling
you want — and would you like me to look at what's out there
first so we can see the landscape?
You: Yes, it's for eng managers. Show me what's out there.
Claude: [browses Linear, Vercel, Datadog, Posthog — takes screenshots]
Here's the landscape. They all converge on clean sans-serifs
and muted blue-gray palettes. Linear stands out with dark-first
and tight spacing. The rest feel interchangeable.
Here's my proposal — and where I'd play it safe vs. take risks:
AESTHETIC: Industrial/Utilitarian — data-first, every pixel earns its place
TYPOGRAPHY: Geist (body + UI), Instrument Serif (display), JetBrains Mono (data)
COLOR: Restrained — slate neutrals, single teal accent (#0D9488)
SPACING: 4px base, compact density
LAYOUT: Grid-disciplined, 12-column
MOTION: Minimal-functional — only transitions that aid comprehension
SAFE CHOICES (match category expectations):
Geist for body — your users already know this font from Vercel
Slate neutrals — warm grays feel right for data-heavy UIs
Grid-disciplined layout — dashboards need predictable alignment
RISKS (where you'd stand out):
Instrument Serif for display headings — nobody in dev tools
uses a serif. It creates instant hierarchy contrast and
signals "we care about craft" in a space full of sans-serifs.
Single teal accent instead of blue — everyone else in this space uses blue.
Teal is close enough to feel trustworthy, different enough
to be recognizable.
4px compact density — tighter than Linear, says "we respect
your screen real estate."
The safe choices keep you literate. The risks are where your
product gets its own face. Which risks appeal to you?
You: Love the serif risk. Let's see it.
Claude: Preview opened in your browser.
[/tmp/design-consultation-preview-1710000000.html]
The page shows a realistic dashboard: sidebar nav, header with
avatar, stat cards, data table — all in your design system.
Instrument Serif at hero scale, Geist in body text, JetBrains
Mono in the data table. Teal accent on buttons and active states.
Light and dark mode toggle in the corner.
How does it feel? Want to adjust anything before I write DESIGN.md?
You: Ship it.
Claude: Wrote DESIGN.md (typography, color, spacing, layout, motion).
Updated CLAUDE.md with design system guidelines.
All future sessions will respect these choices.
/design-review
This is my designer who codes mode.
/plan-design-review reviews your plan before implementation. /design-review audits and fixes the live site after.
It runs an 80-item visual audit on your live site, then enters a fix loop: for each design finding, it locates the source file, makes the minimal CSS/styling change, commits with style(design): FINDING-NNN, re-navigates to verify, and takes before/after screenshots. One commit per fix, fully bisectable.
The self-regulation heuristic is tuned for design work — CSS-only changes get a free pass (they are inherently safe and reversible), but changes to component JSX/TSX files count against the risk budget. Hard cap at 30 fixes. If the risk score exceeds 20%, it stops and asks.
Example
You: /design-review https://myapp.com
Claude: [Runs full 80-item visual audit on the live site]
Design Score: C | AI Slop Score: D
12 findings (4 high, 5 medium, 3 polish)
Fixing 9 design issues...
style(design): FINDING-001 — replace 3-column icon grid with asymmetric layout
style(design): FINDING-002 — add heading scale 48/32/24/18/16
style(design): FINDING-003 — remove gradient hero, use bold typography
style(design): FINDING-004 — add second font for headings
style(design): FINDING-005 — vary border-radius by element role
style(design): FINDING-006 — left-align body text, reserve center for headings
style(design): FINDING-007 — add hover/focus states to all interactive elements
style(design): FINDING-008 — add prefers-reduced-motion media query
style(design): FINDING-009 — set max content width to 680px for body text
Final audit:
Design Score: C → B+ | AI Slop Score: D → A
9 fixes applied (8 verified, 1 best-effort). 3 deferred.
[Report with before/after screenshots saved to .gstack/design-reports/]
Nine commits, each touching one concern. The AI Slop score went from D to A because the three most recognizable patterns (gradient hero, 3-column grid, uniform radius) are gone.
/design-shotgun
This is my design exploration mode.
You know the feeling. You have a feature, a page, a landing screen... and you're not sure what it should look like. You could describe it to Claude and get one answer. But one answer means one perspective, and design is a taste game. You need to see options.
/design-shotgun generates 3 visual design variants using the GPT Image API, opens a comparison board in your browser, and waits for your feedback. You pick a direction, request changes, or ask for entirely new variants. The board supports remix, regenerate, and approval actions.
The loop
- You describe what you want (or point at an existing page)
- The skill reads your
DESIGN.mdfor brand constraints (if it exists) - It generates 3 distinct design variants as PNGs
- A comparison board opens in your browser with all 3 side-by-side
- You click "Approve" on the one you like, or give feedback for another round
- The approved variant saves to
~/.gstack/projects/$SLUG/designs/with anapproved.json
That approved.json is one way to feed /design-html. The design pipeline chains: shotgun picks the direction, design-html renders it as working code. But /design-html also works with CEO plans, design reviews, or just a description.
Taste memory
The skill remembers your preferences across sessions. If you consistently prefer minimal designs over busy ones, it biases future generations. This isn't a setting you configure... it emerges from your approvals.
Example
You: /design-shotgun — hero section for a developer tools landing page
Claude: [Generates 3 variants]
Variant A: Bold typography, dark background, code snippet hero
Variant B: Split layout, product screenshot left, copy right
Variant C: Minimal, centered headline, gradient accent
[Opens comparison board at localhost:PORT]
You: [Clicks "Approve" on Variant A in the browser]
Claude: Approved Variant A. Saved to ~/.gstack/projects/myapp/designs/
Next: run /design-html to generate production HTML from this mockup.
/design-html
This is my design-to-code mode.
Every AI code generation tool produces static CSS. Hardcoded heights. Text that overflows on resize. Breakpoints that snap instead of flowing. The output looks right at exactly one viewport size and breaks at every other.
/design-html fixes this. It generates HTML using Pretext by Cheng Lou (ex-React core, Midjourney frontend). Pretext is a 15KB library that computes text layout without DOM measurement. Text reflows. Heights adjust to content. Cards size themselves. Chat bubbles shrinkwrap. All sub-millisecond, all dynamic.
It works with multiple input sources: an approved mockup from /design-shotgun, a CEO plan from /plan-ceo-review, design review context from /plan-design-review, a PNG you provide, or just a description of what you want. It detects what context exists and asks how you want to proceed.
Smart API routing
Not every page needs the full Pretext engine. The skill reads the design and picks the right tools:
- Simple layouts (landing, marketing):
prepare()+layout()for resize-aware heights - Card grids (dashboard, listing):
prepare()+layout()for self-sizing cards - Chat UIs:
walkLineRanges()for tight-fit bubbles with zero wasted pixels - Editorial layouts:
layoutNextLine()for text flowing around obstacles - Complex editorial: Full engine with
layoutWithLines()for manual line rendering
The refinement loop
- Reads the approved mockup from
approved.json - Uses GPT-4o vision to extract implementation spec (colors, typography, layout)
- Generates self-contained HTML with Pretext inlined (15KB, zero network dependency)
- Spins up a live-reload server so you see changes instantly
- Screenshots at 3 viewports (mobile, tablet, desktop) to verify layout
- AskUserQuestion: what needs to change?
- Surgical edits via the Edit tool (not full regeneration)
- Repeat until you say "done"
Framework detection
If your project uses React, Svelte, or Vue (detected from package.json), the skill offers to generate a framework component instead of vanilla HTML. Framework output uses npm install @chenglou/pretext instead of inline vendoring.
Example
You: /design-html
Claude: Found approved mockup: variant-A.png (hero section)
Design type: simple layout → using prepare() + layout()
[Generates HTML with Pretext, opens live-reload server]
[Screenshots at 375px, 768px, 1440px — all verified]
The HTML is live at localhost:3456. Resize your browser.
Text reflows dynamically. What needs to change?
You: Make the headline bigger and add more spacing below the CTA
Claude: [Edits font-size and margin, browser auto-reloads]
Updated. Two changes applied.
You: Done
Claude: Saved to ~/.gstack/projects/myapp/designs/hero-20260330/finalized.html
/review
This is my paranoid staff engineer mode.
Passing tests do not mean the branch is safe.
/review exists because there is a whole class of bugs that can survive CI and still punch you in the face in production. This mode is not about dreaming bigger. It is not about making the plan prettier. It is about asking:
What can still break?
This is a structural audit, not a style nitpick pass. I want the model to look for things like:
- N+1 queries
- stale reads
- race conditions
- bad trust boundaries
- missing indexes
- escaping bugs
- broken invariants
- bad retry logic
- tests that pass while missing the real failure mode
- forgotten enum handlers — add a new status or type constant, and
/reviewtraces it through every switch statement and allowlist in your codebase, not just the files you changed
Fix-First
Findings get action, not just listed. Obvious mechanical fixes (dead code, stale comments, N+1 queries) are applied automatically — you see [AUTO-FIXED] file:line Problem → what was done for each one. Genuinely ambiguous issues (security, race conditions, design decisions) get surfaced for your call.
Completeness gaps
/review now flags shortcut implementations where the complete version costs less than 30 minutes of CC time. If you chose the 80% solution and the 100% solution is a lake, not an ocean, the review will call it out.
Example
Suppose the smart listing flow is implemented and the tests are green.
/review should still ask:
- Did I introduce an N+1 query when rendering listing photos or draft suggestions?
- Am I trusting client-provided file metadata instead of validating the actual file?
- Can two tabs race and overwrite cover-photo selection or item details?
- Do failed uploads leave orphaned files in storage forever?
- Can the "exactly one hero image" rule break under concurrency?
- If enrichment APIs partially fail, do I degrade gracefully or save garbage?
- Did I accidentally create a prompt injection or trust-boundary problem by pulling web data into draft generation?
That is the point of /review.
I do not want flattery here. I want the model imagining the production incident before it happens.
/investigate
When something is broken and you don't know why, /investigate is your systematic debugger. It follows the Iron Law: no fixes without root cause investigation first.
Instead of guessing and patching, it traces data flow, matches against known bug patterns, and tests hypotheses one at a time. If three fix attempts fail, it stops and questions the architecture instead of thrashing. This prevents the "let me try one more thing" spiral that wastes hours.
/qa
This is my QA lead mode.
/browse gives the agent eyes. /qa gives it a testing methodology.
The most common use case: you're on a feature branch, you just finished coding, and you want to verify everything works. Just say /qa — it reads your git diff, identifies which pages and routes your changes affect, spins up the browser, and tests each one. No URL required. No manual test plan.
Four modes:
- Diff-aware (automatic on feature branches) — reads
git diff main, identifies affected pages, tests them specifically - Full — systematic exploration of the entire app. 5-15 minutes. Documents 5-10 well-evidenced issues.
- Quick (
--quick) — 30-second smoke test. Homepage + top 5 nav targets. - Regression (
--regression baseline.json) — run full mode, then diff against a previous baseline.
Automatic regression tests
When /qa fixes a bug and verifies it, it automatically generates a regression test that catches the exact scenario that broke. Tests include full attribution tracing back to the QA report.
Example
You: /qa https://staging.myapp.com
Claude: [Explores 12 pages, fills 3 forms, tests 2 flows]
QA Report: staging.myapp.com — Health Score: 72/100
Top 3 Issues:
1. CRITICAL: Checkout form submits with empty required fields
2. HIGH: Mobile nav menu doesn't close after selecting an item
3. MEDIUM: Dashboard chart overlaps sidebar below 1024px
[Full report with screenshots saved to .gstack/qa-reports/]
Testing authenticated pages: Use /setup-browser-cookies first to import your real browser sessions, then /qa can test pages behind login.
/ship
This is my release machine mode.
Once I have decided what to build, nailed the technical plan, and run a serious review, I do not want more talking. I want execution.
/ship is for the final mile. It is for a ready branch, not for deciding what to build.
This is where the model should stop behaving like a brainstorm partner and start behaving like a disciplined release engineer: sync with main, run the right tests, make sure the branch state is sane, update changelog or versioning if the repo expects it, push, and create or update the PR.
Test bootstrap
If your project doesn't have a test framework, /ship sets one up — detects your runtime, researches the best framework, installs it, writes 3-5 real tests for your actual code, sets up CI/CD (GitHub Actions), and creates TESTING.md. 100% test coverage is the goal — tests make vibe coding safe instead of yolo coding.
Coverage audit
Every /ship run builds a code path map from your diff, searches for corresponding tests, and produces an ASCII coverage diagram with quality stars. Gaps get tests auto-generated. Your PR body shows the coverage: Tests: 42 → 47 (+5 new).
Review gate
/ship checks the Review Readiness Dashboard before creating the PR. If the Eng Review is missing, it asks — but won't block you. Decisions are saved per-branch so you're never re-asked.
A lot of branches die when the interesting work is done and only the boring release work is left. Humans procrastinate that part. AI should not.
/land-and-deploy
This is my deploy pipeline mode.
/ship creates the PR. /land-and-deploy finishes the job: merge, deploy, verify.
It merges the PR, waits for CI, waits for the deploy to finish, then runs canary checks against production. One command from "approved" to "verified in production." If the deploy breaks, it tells you what failed and whether to rollback.
First run on a new project triggers a dry-run walk-through so you can verify the pipeline before it does anything irreversible. After that, it trusts the config and runs straight through.
Setup
Run /setup-deploy first. It detects your platform (Fly.io, Render, Vercel, Netlify, Heroku, GitHub Actions, or custom), discovers your production URL and health check endpoints, and writes the config to CLAUDE.md. One-time, 60 seconds.
Example
You: /land-and-deploy
Claude: Merging PR #42...
CI: 3/3 checks passed
Deploy: Fly.io — deploying v2.1.0...
Health check: https://myapp.fly.dev/health → 200 OK
Canary: 5 pages checked, 0 console errors, p95 < 800ms
Production verified. v2.1.0 is live.
/canary
This is my post-deploy monitoring mode.
After deploy, /canary watches the live site for trouble. It loops through your key pages using the browse daemon, checking for console errors, performance regressions, page failures, and visual anomalies. Takes periodic screenshots and compares against pre-deploy baselines.
Use it right after /land-and-deploy, or schedule it to run periodically after a risky deploy.
You: /canary https://myapp.com
Claude: Monitoring 8 pages every 2 minutes...
Cycle 1: ✓ All pages healthy. p95: 340ms. 0 console errors.
Cycle 2: ✓ All pages healthy. p95: 380ms. 0 console errors.
Cycle 3: ⚠ /dashboard — new console error: "TypeError: Cannot read
property 'map' of undefined" at dashboard.js:142
Screenshot saved.
Alert: 1 new console error after 3 monitoring cycles.
/benchmark
This is my performance engineer mode.
/benchmark establishes performance baselines for your pages: load time, Core Web Vitals (LCP, CLS, INP), resource counts, and total transfer size. Run it before and after a PR to catch regressions.
It uses the browse daemon for real Chromium measurements, not synthetic estimates. Multiple runs averaged. Results persist so you can track trends across PRs.
You: /benchmark https://myapp.com
Claude: Benchmarking 5 pages (3 runs each)...
/ load: 1.2s LCP: 0.9s CLS: 0.01 resources: 24 (890KB)
/dashboard load: 2.1s LCP: 1.8s CLS: 0.03 resources: 31 (1.4MB)
/settings load: 0.8s LCP: 0.6s CLS: 0.00 resources: 18 (420KB)
Baseline saved. Run again after changes to compare.
/cso
This is my Chief Security Officer.
Run /cso on any codebase and it performs an OWASP Top 10 + STRIDE threat model audit. It scans for injection vulnerabilities, broken authentication, sensitive data exposure, XML external entities, broken access control, security misconfiguration, XSS, insecure deserialization, known-vulnerable components, and insufficient logging. Each finding includes severity, evidence, and a recommended fix.
You: /cso
Claude: Running OWASP Top 10 + STRIDE security audit...
CRITICAL: SQL injection in user search (app/models/user.rb:47)
HIGH: Session tokens stored in localStorage (app/frontend/auth.ts:12)
MEDIUM: Missing rate limiting on /api/login endpoint
LOW: X-Frame-Options header not set
4 findings across 12 files scanned. 1 critical, 1 high.
/document-release
This is my technical writer mode.
After /ship creates the PR but before it merges, /document-release reads every documentation file in the project and cross-references it against the diff. It updates file paths, command lists, project structure trees, and anything else that drifted. Risky or subjective changes get surfaced as questions — everything else is handled automatically.
You: /document-release
Claude: Analyzing 21 files changed across 3 commits. Found 8 documentation files.
README.md: updated skill count from 9 to 10, added new skill to table
CLAUDE.md: added new directory to project structure
CONTRIBUTING.md: current — no changes needed
TODOS.md: marked 2 items complete, added 1 new item
All docs updated and committed. PR body updated with doc diff.
It also polishes CHANGELOG voice (without ever overwriting entries), cleans up completed TODOS, checks cross-doc consistency, and asks about VERSION bumps only when appropriate.
/retro
This is my engineering manager mode.
At the end of the week I want to know what actually happened. Not vibes — data. /retro analyzes commit history, work patterns, and shipping velocity and writes a candid retrospective.
It is team-aware. It identifies who is running the command, gives you the deepest treatment on your own work, then breaks down every contributor with specific praise and growth opportunities. It computes metrics like commits, LOC, test ratio, PR sizes, and fix ratio. It detects coding sessions from commit timestamps, finds hotspot files, tracks shipping streaks, and identifies the biggest ship of the week.
It also tracks test health: total test files, tests added this period, regression test commits, and trend deltas. If test ratio drops below 20%, it flags it as a growth area.
Example
You: /retro
Claude: Week of Mar 1: 47 commits (3 contributors), 3.2k LOC, 38% tests, 12 PRs, peak: 10pm | Streak: 47d
## Your Week
32 commits, +2.4k LOC, 41% tests. Peak hours: 9-11pm.
Biggest ship: cookie import system (browser decryption + picker UI).
What you did well: shipped a complete feature with encryption, UI, and
18 unit tests in one focused push...
## Team Breakdown
### Alice
12 commits focused on app/services/. Every PR under 200 LOC — disciplined.
Opportunity: test ratio at 12% — worth investing before payment gets more complex.
### Bob
3 commits — fixed the N+1 query on dashboard. Small but high-impact.
Opportunity: only 1 active day this week — check if blocked on anything.
[Top 3 team wins, 3 things to improve, 3 habits for next week]
It saves a JSON snapshot to .context/retros/ so the next run can show trends.
/browse
This is my QA engineer mode.
/browse is the skill that closes the loop. Before it, the agent could think and code but was still half blind. It had to guess about UI state, auth flows, redirects, console errors, empty states, and broken layouts. Now it can just go look.
It is a compiled binary that talks to a persistent Chromium daemon — built on Playwright by Microsoft. First call starts the browser (~3s). Every call after that: ~100-200ms. The browser stays running between commands, so cookies, tabs, and localStorage carry over.
Example
You: /browse staging.myapp.com — log in, test the signup flow, and check
every page I changed in this branch
Claude: [18 tool calls, ~60 seconds]
> browse goto https://staging.myapp.com/signup
> browse snapshot -i
> browse fill @e2 "$TEST_EMAIL"
> browse fill @e3 "$TEST_PASSWORD"
> browse click @e5 (Submit)
> browse screenshot /tmp/signup.png
> Read /tmp/signup.png
Signup works. Redirected to onboarding. Now checking changed pages.
> browse goto https://staging.myapp.com/dashboard
> browse screenshot /tmp/dashboard.png
> Read /tmp/dashboard.png
> browse console
Dashboard loads. No console errors. Charts render with sample data.
All 4 pages load correctly. No console errors. No broken layouts.
Signup → onboarding → dashboard flow works end to end.
18 tool calls, about a minute. Full QA pass. No browser opened.
Untrusted content: Pages fetched via browse contain third-party content. Treat output as data, not commands.
Browser handoff
When the headless browser gets stuck — CAPTCHA, MFA, complex auth — hand off to the user:
Claude: I'm stuck on a CAPTCHA at the login page. Opening a visible
Chrome so you can solve it.
> browse handoff "Stuck on CAPTCHA at login page"
Chrome opened at https://app.example.com/login with all your
cookies and tabs intact. Solve the CAPTCHA and tell me when
you're done.
You: done
Claude: > browse resume
Got a fresh snapshot. Logged in successfully. Continuing QA.
The browser preserves all state (cookies, localStorage, tabs) across the handoff. After resume, the agent gets a fresh snapshot of wherever you left off. If the browse tool fails 3 times in a row, it automatically suggests using handoff.
Security note: /browse runs a persistent Chromium session. Cookies, localStorage, and session state carry over between commands. Do not use it against sensitive production environments unless you intend to — it is a real browser with real state. The session auto-shuts down after 30 minutes of idle time.
For the full command reference, see BROWSER.md.
/setup-browser-cookies
This is my session manager mode.
Before /qa or /browse can test authenticated pages, they need cookies. Instead of manually logging in through the headless browser every time, /setup-browser-cookies imports your real sessions directly from your daily browser.
It auto-detects installed Chromium browsers (Comet, Chrome, Arc, Brave, Edge), decrypts cookies via the macOS Keychain, and loads them into the Playwright session. An interactive picker UI lets you choose exactly which domains to import — no cookie values are ever displayed.
You: /setup-browser-cookies
Claude: Cookie picker opened — select the domains you want to import
in your browser, then tell me when you're done.
[You pick github.com, myapp.com in the browser UI]
You: done
Claude: Imported 2 domains (47 cookies). Session is ready.
Or skip the UI entirely:
You: /setup-browser-cookies github.com
Claude: Imported 12 cookies for github.com from Comet.
/autoplan
This is my review autopilot mode.
Running /plan-ceo-review, then /plan-design-review, then /plan-eng-review individually means answering 15-30 intermediate questions. Each question is valuable, but sometimes you want the gauntlet to run without stopping for every decision.
/autoplan reads all three review skills from disk and runs them sequentially: CEO → Design → Eng. It makes decisions automatically using six encoded principles (prefer completeness, match existing patterns, choose reversible options, prefer the option the user chose for similar past decisions, defer ambiguous items, and escalate security). Taste decisions (close approaches, borderline scope expansions, cross-model disagreements) get saved and presented at a final approval gate.
One command, fully reviewed plan out.
You: /autoplan
Claude: Running CEO review... [4 scope decisions auto-resolved]
Running design review... [3 design dimensions auto-scored]
Running eng review... [2 architecture decisions auto-resolved]
TASTE DECISIONS (need your input):
1. Scope: Codex suggested adding search — borderline expansion. Add?
2. Design: Two approaches scored within 1 point. Which feels right?
[Shows both options with context]
You: 1) Yes, add search. 2) Option A.
Claude: Plan complete. 9 decisions auto-resolved, 2 taste decisions approved.
/learn
This is my institutional memory mode.
gstack learns from every session. Patterns, pitfalls, preferences, architectural decisions... they accumulate in ~/.gstack/projects/$SLUG/learnings.jsonl. Each learning has a confidence score, source attribution, and the files it references.
/learn lets you see what gstack has absorbed, search for specific patterns, prune stale entries (when referenced files no longer exist), and export learnings for team sharing. The real magic is in other skills... they automatically search learnings before making recommendations, and display "Prior learning applied" when a past insight is relevant.
You: /learn
Claude: 23 learnings for this project (14 high confidence, 6 medium, 3 low)
Top patterns:
- [9/10] API responses always wrapped in { data, error } envelope
- [8/10] Tests use factory helpers in test/support/factories.ts
- [8/10] All DB queries go through repository pattern, never direct
3 potentially stale (referenced files deleted):
- "auth middleware uses JWT" — auth/middleware.ts was deleted
[Prune these? Y/N]
/open-gstack-browser
This is my co-presence mode.
/browse runs headless by default. You don't see what the agent sees. /open-gstack-browser changes that. It launches GStack Browser (rebranded Chromium with anti-bot stealth) controlled by Playwright, with the sidebar extension auto-loaded. You watch every action in real time.
The sidebar chat is a Claude instance that controls the browser. It auto-routes to the right model: Sonnet for navigation and actions (click, goto, fill, screenshot), Opus for reading and analysis (summarize, find bugs, describe). One-click cookie import from the sidebar footer. The browser stays alive as long as the window is open... no idle timeout in headed mode. The menu bar says "GStack Browser" instead of "Chrome for Testing."
The sidebar agent ships a layered prompt injection defense: a local 22MB ML classifier scans every page and tool output, a Haiku transcript check votes on the full conversation, a canary token catches session-exfil attempts, and a verdict combiner requires two classifiers to agree before blocking. A shield icon in the header shows status (green/amber/red). Details in ARCHITECTURE.md.
You: /open-gstack-browser
Claude: Launched GStack Browser with sidebar extension.
Anti-bot stealth active. All $B commands run in headed mode.
Type in the sidebar to direct the browser agent.
Sidebar model routing: sonnet for actions, opus for analysis.
/setup-deploy
One-time deploy configuration. Run this before your first /land-and-deploy.
It auto-detects your deploy platform (Fly.io, Render, Vercel, Netlify, Heroku, GitHub Actions, or custom), discovers your production URL, health check endpoints, and deploy status commands. Writes everything to CLAUDE.md so all future deploys are automatic.
You: /setup-deploy
Claude: Detected: Fly.io (fly.toml found)
Production URL: https://myapp.fly.dev
Health check: /health → expects 200
Deploy command: fly deploy
Status command: fly status
Written to CLAUDE.md. Run /land-and-deploy when ready.
/codex
This is my second opinion mode.
When /review catches bugs from Claude's perspective, /codex brings a completely different AI — OpenAI's Codex CLI — to review the same diff. Different training, different blind spots, different strengths. The overlap tells you what's definitely real. The unique findings from each are where you find the bugs neither would catch alone.
Three modes
Review — run codex review against the current diff. Codex reads every changed file, classifies findings by severity (P1 critical, P2 high, P3 medium), and returns a PASS/FAIL verdict. Any P1 finding = FAIL. The review is fully independent — Codex doesn't see Claude's review.
Challenge — adversarial mode. Codex actively tries to break your code. It looks for edge cases, race conditions, security holes, and assumptions that would fail under load. Uses maximum reasoning effort (xhigh). Think of it as a penetration test for your logic.
Consult — open conversation with session continuity. Ask Codex anything about the codebase. Follow-up questions reuse the same session, so context carries over. Great for "am I thinking about this correctly?" moments.
Cross-model analysis
When both /review (Claude) and /codex (OpenAI) have reviewed the same branch, you get a cross-model comparison: which findings overlap (high confidence), which are unique to Codex (different perspective), and which are unique to Claude. This is the "two doctors, same patient" approach to code review.
You: /codex review
Claude: Running independent Codex review...
CODEX REVIEW: PASS (3 findings)
[P2] Race condition in payment handler — concurrent charges
can double-debit without advisory lock
[P3] Missing null check on user.email before downcase
[P3] Token comparison not using constant-time compare
Cross-model analysis (vs /review):
OVERLAP: Race condition in payment handler (both caught it)
UNIQUE TO CODEX: Token comparison timing attack
UNIQUE TO CLAUDE: N+1 query in listing photos
Safety & Guardrails
Four skills that add safety rails to any Claude Code session. They work via Claude Code's PreToolUse hooks — transparent, session-scoped, no configuration files.
/careful
Say "be careful" or run /careful when you're working near production, running destructive commands, or just want a safety net. Every Bash command gets checked against known-dangerous patterns:
rm -rf/rm -r— recursive deleteDROP TABLE/DROP DATABASE/TRUNCATE— data lossgit push --force/git push -f— history rewritegit reset --hard— discard commitsgit checkout ./git restore .— discard uncommitted workkubectl delete— production resource deletiondocker rm -f/docker system prune— container/image loss
Common build artifact cleanups (rm -rf node_modules, dist, .next, __pycache__, build, coverage) are whitelisted — no false alarms on routine operations.
You can override any warning. The guardrails are accident prevention, not access control.
/freeze
Restrict all file edits to a single directory. When you're debugging a billing bug, you don't want Claude accidentally "fixing" unrelated code in src/auth/. /freeze src/billing blocks all Edit and Write operations outside that path.
/investigate activates this automatically — it detects the module being debugged and freezes edits to that directory.
You: /freeze src/billing
Claude: Edits restricted to src/billing/. Run /unfreeze to remove.
[Later, Claude tries to edit src/auth/middleware.ts]
Claude: BLOCKED — Edit outside freeze boundary (src/billing/).
Skipping this change.
Note: this blocks Edit and Write tools only. Bash commands like sed can still modify files outside the boundary — it's accident prevention, not a security sandbox.
/guard
Full safety mode — combines /careful + /freeze in one command. Destructive command warnings plus directory-scoped edits. Use when touching prod or debugging live systems.
/unfreeze
Remove the /freeze boundary, allowing edits everywhere again. The hooks stay registered for the session — they just allow everything. Run /freeze again to set a new boundary.
/gstack-upgrade
Keep gstack current with one command. It detects your install type (global at ~/.claude/skills/gstack vs vendored in your project at .claude/skills/gstack), runs the upgrade, syncs both copies if you have dual installs, and shows you what changed.
You: /gstack-upgrade
Claude: Current version: 0.7.4
Latest version: 0.8.2
What's new:
- Browse handoff for CAPTCHAs and auth walls
- /codex multi-AI second opinion
- /qa always uses browser now
- Safety skills: /careful, /freeze, /guard
- Proactive skill suggestions
Upgraded to 0.8.2. Both global and project installs synced.
Set auto_upgrade: true in ~/.gstack/config.yaml to skip the prompt entirely — gstack upgrades silently at the start of each session when a new version is available.
Greptile integration
Greptile is a YC company that reviews your PRs automatically. It catches real bugs — race conditions, security issues, things that pass CI and blow up in production. It has genuinely saved my ass more than once. I love these guys.
Setup
Install Greptile on your GitHub repo at greptile.com — it takes about 30 seconds. Once it's reviewing your PRs, gstack picks up its comments automatically. No additional configuration.
How it works
The problem with any automated reviewer is triage. Greptile is good, but not every comment is a real issue. Some are false positives. Some flag things you already fixed three commits ago. Without a triage layer, the comments pile up and you start ignoring them — which defeats the purpose.
gstack solves this. /review and /ship are now Greptile-aware. They read Greptile's comments, classify each one, and take action:
- Valid issues get added to the critical findings and fixed before shipping
- Already-fixed issues get an auto-reply acknowledging the catch
- False positives get pushed back — you confirm, and a reply goes out explaining why it's wrong
The result is a two-layer review: Greptile catches things asynchronously on the PR, then /review and /ship triage those findings as part of the normal workflow. Nothing falls through the cracks.
Learning from history
Every false positive you confirm gets saved to ~/.gstack/greptile-history.md. Future runs auto-skip known FP patterns for your codebase. And /retro tracks Greptile's batting average over time — so you can see whether the signal-to-noise ratio is improving.
Example
You: /ship
Claude: [syncs main, runs tests, pre-landing review...]
Greptile found 3 comments on this PR:
[VALID] app/services/payment_service.rb:47 — Race condition:
concurrent charges can double-debit. Recommend DB-level advisory
lock around the charge block.
[ALREADY FIXED] app/models/user.rb:12 — Missing null check
on email before downcase.
→ Already fixed in commit abc1234. Auto-replying.
[FALSE POSITIVE] lib/auth.rb:88 — "Token comparison should
use constant-time comparison." We already use
ActiveSupport::SecurityUtils.secure_compare here.
Fixing the race condition in payment_service.rb...
The auth token comment is a false positive.
Options: A) Reply to Greptile explaining B) Fix anyway C) Ignore
You: A
Claude: Replied to Greptile. All tests pass.
PR: github.com/you/app/pull/42
Three Greptile comments. One real fix. One auto-acknowledged. One false positive pushed back with a reply. Total extra time: about 30 seconds.
/ios-qa
Live-device iOS QA. The fork's load-bearing insight was: don't simulate, don't run XCTest, don't bring up WebDriverAgent. Embed an HTTP server in the app under test, drive it from a Mac-side daemon over the USB CoreDevice IPv6 tunnel.
The agent reads your Swift source, finds @Observable classes with @Snapshotable-marked fields, codegens typed accessors, deploys a debug bridge, then runs a closed find→fix→verify loop.
Architecture in one diagram
┌──────────────────────┐ USB CoreDevice (IPv6) ┌──────────────────┐
│ gstack-ios-qa daemon │ ────────────────────────▶ │ iOS app │
│ (Mac, bun/TS) │ bearer + X-Session-Id │ StateServer │
│ - rotates boot token │ │ (loopback only) │
│ - mints session toks │ └──────────────────┘
│ - capability tiers │
│ - audit + redact │
└──────────────────────┘
▲
│ Tailscale (optional, --tailnet)
│
┌──────────────────────┐
│ Remote agent │
│ (OpenClaw, etc.) │
└──────────────────────┘
The iOS app's StateServer binds loopback only (::1 + 127.0.0.1). The Mac daemon owns tailnet identity validation, capability tiers, and the audit trail. Remote agents NEVER see the boot token — only short-lived session tokens (1h default, 24h hard cap) minted via Tailscale identity gating.
The unlock: USB-tethered + Tailscale = remote iOS QA from any agent
A Mac plus an iPhone you already own plus the Tailscale free tier replaces what most teams pay BrowserStack/Sauce Labs for. Any HTTP-capable agent on your tailnet can drive the iOS app once you've minted them a session token. Tailscale ACLs scope which identities can reach the Mac at which capability tier.
See ios-qa/docs/tailscale-acl-example.md for the runnable setup.
Capability tiers
| Tier | Endpoints |
|---|---|
| observe | /screenshot, /elements, GET /state/*, /state/snapshot, /healthz |
| interact | observe + /tap, /swipe, /type, /session/* |
| mutate | interact + POST /state/<key> |
| restore | mutate + POST /state/restore |
Default minted tokens get interact. Higher tiers require explicit owner mint.
/ios-fix
Iron Law: no fix without a reproducing snapshot. The agent captures pre-bug state via GET /state/snapshot, writes the fix, rebuilds, redeploys, restores the snapshot, and verifies the bug is gone. The snapshot becomes a regression test fixture so the bug can't recur silently.
Mirrors /qa's find-bug → fix → re-verify loop for iOS.
/ios-design-review
Designer's-eye QA on a real iPhone. Connects to the same /ios-qa daemon in observe-tier mode and screenshots every screen. Scores 10 dimensions 0-10: typography hierarchy, spacing rhythm, color hierarchy, touch targets, loading/empty/error states, accessibility, animation discipline, iOS idiom alignment, information density, AI-slop check.
For each score < 7, uses AskUserQuestion to present the issue with recommended fix.
/ios-clean
Convenience wrapper. The structural Release-build guard against shipping DebugBridge is in Package.swift (.when(configuration: .debug)) plus a CI invariant test. /ios-clean is for developers who want a guided removal flow or who manually added the SPM dependency without going through /ios-qa.
/ios-sync
Run after upgrading gstack or adding new @Observable classes. Detects what's installed, runs gen-accessors against the latest upstream templates, refreshes any changed Swift files, verifies the app rebuilds. Cache-key invalidation handles Swift version changes, generator git rev changes, and source changes.