* docs: add design doc for /plan-tune v1 (observational substrate) Canonical record of the /plan-tune v1 design: typed question registry, per-question explicit preferences, inline tune: feedback with user-origin gate, dual-track profile (declared + inferred separately), and plain-English inspection skill. Captures every decision with pros/cons, what's deferred to v2 with explicit acceptance criteria, and what was rejected entirely. Codex review drove a substantial scope rollback from the initial CEO EXPANSION plan. 15+ legitimate findings (substrate claim was false without a typed registry; E4/E6/clamp logical contradiction; profile poisoning attack surface; LANDED preamble side effect; implementation order) shaped the final shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: typed question registry for /plan-tune v1 foundation scripts/question-registry.ts declares 53 recurring AskUserQuestion categories across 15 skills (ship, review, office-hours, plan-ceo-review, plan-eng-review, plan-design-review, plan-devex-review, qa, investigate, land-and-deploy, cso, gstack-upgrade, preamble, plan-tune, autoplan). Each entry has: stable kebab-case id, skill owner, category (approval | clarification | routing | cherry-pick | feedback-loop), door_type (one-way | two-way), optional stable option keys, optional psychographic signal_key, and a one-line description. 12 of 53 are one-way doors (destructive ops, architecture/data forks, security/compliance). These are ALWAYS asked regardless of user preference. Helpers: getQuestion(id), getOneWayDoorIds(), getAllRegisteredIds(), getRegistryStats(). No binary or resolver wiring yet — this is the schema substrate the rest of /plan-tune builds on. Ad-hoc question_ids (not registered) still log but skip psychographic signal attribution. Future /plan-tune skill surfaces frequently-firing ad-hoc ids as candidates for registry promotion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: registry schema + safety + coverage tests (gate tier) 20 tests validating the question registry: Schema (7 tests): - Every entry has required fields - All ids are kebab-case and start with their skill name - No duplicate ids - Categories are from the allowed set - door_type is one-way | two-way - Options arrays are well-formed - Descriptions are short and single-line Helpers (5 tests): - getQuestion returns entry for known id, undefined for unknown - getOneWayDoorIds includes destructive questions, excludes two-way - getAllRegisteredIds count matches QUESTIONS keys - getRegistryStats totals are internally consistent One-way door safety (2 tests): - Every critical question (test failure, SQL safety, LLM trust boundary, security scan, merge confirm, rollback, fix apply, premise revise, arch finding, privacy gate, user challenge) is declared one-way - At least 10 one-way doors exist (catches regression if declarations are accidentally dropped) Registry breadth (3 tests): - 11 high-volume skills each have >= 1 registered question - Preamble one-time prompts are registered - /plan-tune's own questions are registered Signal map references (1 test): - signal_key values are typed kebab-case strings Template coverage (2 tests, informational): - AskUserQuestion usage across templates is non-trivial (>20) - Registry spans >= 10 skills 20 pass, 0 fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: one-way door classifier (belt-and-suspenders safety fallback) scripts/one-way-doors.ts — secondary keyword-pattern classifier that catches destructive questions even when the registry doesn't have an entry for them. The registry's door_type field (from scripts/question-registry.ts) is the PRIMARY safety gate. This classifier is the fallback for ad-hoc question_ids that agents generate at runtime. Classification priority: 1. Registry lookup by question_id → use declared door_type 2. Skill:category fallback (cso:approval, land-and-deploy:approval) 3. Keyword pattern match against question_summary 4. Default: treat as two-way (safer to log the miss than auto-decide unsafely) Covers 21 destructive patterns across: - File system (rm -rf, delete, wipe, purge, truncate) - Database (drop table/database/schema, delete from) - Git/VCS (force-push, reset --hard, checkout --, branch -D) - Deploy/infra (kubectl delete, terraform destroy, rollback) - Credentials (revoke/reset/rotate API key|token|secret|password) - Architecture (breaking change, schema migration, data model change) 7 new tests in test/plan-tune.test.ts covering: registry-first lookup, unknown-id fallthrough, keyword matching on destructive phrasings including embedded filler words ("rotate the API key"), skill-category fallback, benign questions defaulting to two-way, pattern-list non-empty. 27 pass, 0 fail. 1270 expect() calls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: psychographic signal map + builder archetypes scripts/psychographic-signals.ts — hand-crafted {signal_key, user_choice} → {dimension, delta} map. Version 0.1.0. Conservative deltas (±0.03 to ±0.06 per event). Covers 9 signal keys: scope-appetite, architecture-care, code-quality-care, test-discipline, detail-preference, design-care, devex-care, distribution-care, session-mode. Helpers: applySignal() mutates running totals, newDimensionTotals() creates empty starting state, normalizeToDimensionValue() sigmoid-clamps accumulated delta to [0,1] (0 → 0.5 neutral), validateRegistrySignalKeys() checks that every signal_key in the registry has a SIGNAL_MAP entry. In v1 the signal map is used ONLY to compute inferred dimension values for /plan-tune inspection output. No skill behavior adapts to these signals until v2. scripts/archetypes.ts — 8 named archetypes + Polymath fallback: - Cathedral Builder (boil-the-ocean + architecture-first) - Ship-It Pragmatist (small scope + fast) - Deep Craft (detail-verbose + principled) - Taste Maker (intuitive, overrides recommendations) - Solo Operator (high-autonomy, delegates) - Consultant (hands-on, consulted on everything) - Wedge Hunter (narrow scope aggressively) - Builder-Coach (balanced steering) - Polymath (fallback when no archetype matches) matchArchetype() uses L2 distance scaled by tightness, with a 0.55 threshold below which we return Polymath. v1 ships the model stable; v2 narrative/vibe commands wire it into user-facing output. 14 new tests: signal map consistency vs registry, applySignal behavior for known/unknown keys, normalization bounds, archetype schema validity, name uniqueness, matchArchetype correctness for each reference profile, Polymath fallback for outliers. 41 pass, 0 fail total in test/plan-tune.test.ts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: bin/gstack-question-log — append validated AskUserQuestion events Append-only JSONL log at ~/.gstack/projects/{SLUG}/question-log.jsonl. Schema: {skill, question_id, question_summary, category?, door_type?, options_count?, user_choice, recommended?, followed_recommendation?, session_id?, ts} Validates: - skill is kebab-case - question_id is kebab-case, <= 64 chars - question_summary non-empty, <= 200 chars, newlines flattened - category is one of approval/clarification/routing/cherry-pick/feedback-loop - door_type is one-way or two-way - options_count is integer in [1, 26] - user_choice non-empty string, <= 64 chars Injection defense on question_summary rejects the same patterns as gstack-learnings-log (ignore previous instructions, system:, override:, do not report, etc). followed_recommendation is auto-computed when both user_choice and recommended are present. ts auto-injected as ISO 8601 if missing. 21 tests covering: valid payloads, full field preservation, auto-followed computation, appending, long-summary truncation, newline flattening, invalid JSON, missing fields, bad case, oversized ids, invalid enum values, out-of-range options_count, and 6 injection attack patterns. 21 pass, 0 fail, 43 expect() calls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: bin/gstack-developer-profile — unified profile with migration bin/gstack-developer-profile supersedes bin/gstack-builder-profile. The old binary becomes a one-line legacy shim delegating to --read for /office-hours backward compat. Subcommands: --read legacy KEY:VALUE output (tier, session_count, etc) --migrate folds ~/.gstack/builder-profile.jsonl into ~/.gstack/developer-profile.json. Atomic (temp + rename), idempotent (no-op when target exists or source absent), archives source as .migrated-YYYY-MM-DD-HHMMSS --derive recomputes inferred dimensions from question-log.jsonl using the signal map in scripts/psychographic-signals.ts --profile full profile JSON --gap declared vs inferred diff JSON --trace <dim> event-level trace of what contributed to a dimension --check-mismatch flags dimensions where declared and inferred disagree by > 0.3 (requires >= 10 events first) --vibe archetype name + description from scripts/archetypes.ts --narrative (v2 stub) Auto-migration on first read: if legacy file exists and new file doesn't, migrate before reading. Creates a neutral (all-0.5) stub if nothing exists. Unified schema (see docs/designs/PLAN_TUNING_V0.md §Architecture): {identity, declared, inferred: {values, sample_size, diversity}, gap, overrides, sessions, signals_accumulated, schema_version} 25 new tests across subcommand behaviors: - --read defaults + stub creation - --migrate: 3 sessions preserved with signal tallies, idempotency, archival - Tier calculation: welcome_back / regular / inner_circle boundaries - --derive: neutral-when-empty, upward nudge on 'expand', downward on 'reduce', recomputable (same input → same output), ad-hoc unregistered ids ignored - --trace: contributing events, empty for untouched dims, error without arg - --gap: empty when no declared, correctly computed otherwise - --vibe: returns archetype name + description - --check-mismatch: threshold behavior, 10+ sample requirement - Unknown subcommand errors 25 pass, 0 fail, 60 expect() calls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: bin/gstack-question-preference — explicit preferences + user-origin gate Subcommands: --check <id> → ASK_NORMALLY | AUTO_DECIDE (decides if a registered question should be auto-decided by the agent) --write '{…}' → set a preference (requires user-origin source) --read → dump preferences JSON --clear [id] → clear one or all --stats → short counts summary Preference values: always-ask | never-ask | ask-only-for-one-way. Stored at ~/.gstack/projects/{SLUG}/question-preferences.json. Safety contract (the core of Codex finding #16, profile-poisoning defense from docs/designs/PLAN_TUNING_V0.md §Security model): 1. One-way doors ALWAYS return ASK_NORMALLY from --check, regardless of user preference. User's never-ask is overridden with a visible safety note so the user knows why their preference didn't suppress the prompt. 2. --write requires an explicit `source` field: - Allowed: "plan-tune", "inline-user" - REJECTED with exit code 2: "inline-tool-output", "inline-file", "inline-file-content", "inline-unknown" Rejection is explicit ("profile poisoning defense") so the caller can log and surface the attempt. 3. free_text on --write is sanitized against injection patterns (ignore previous instructions, override:, system:, etc.) and newline-flattened. Each --write also appends a preference-set event to ~/.gstack/projects/{SLUG}/question-events.jsonl for derivation audit trail. 31 tests: - --check behavior (4): defaults, two-way, one-way (one-way overrides never-ask with safety note), unknown ids, missing arg - --check with prefs (5): never-ask on two-way → AUTO_DECIDE; never-ask on one-way → ASK_NORMALLY with override note; always-ask always asks; ask-only-for-one-way flips appropriately - --write valid (5): inline-user accepted, plan-tune accepted, persisted correctly, event appended, free_text preserved with flattening - User-origin gate (6): missing source rejected; inline-tool-output rejected with exit code 2 and explicit poisoning message; inline-file, inline-file-content, inline-unknown rejected; unknown source rejected - Schema validation (4): invalid JSON, bad question_id, bad preference, injection in free_text - --read (2): empty → {}, returns writes - --clear (3): specific id, clear-all, NOOP for missing - --stats (2): empty zeros, tallies by preference type 31 pass, 0 fail, 52 expect() calls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: question-tuning preamble resolvers scripts/resolvers/question-tuning.ts ships three preamble generators: generateQuestionPreferenceCheck — before each AskUserQuestion, agent runs gstack-question-preference --check <id>. AUTO_DECIDE suppresses the ask and auto-chooses recommended. ASK_NORMALLY asks as usual. One-way door safety override is handled by the binary. generateQuestionLog — after each AskUserQuestion, agent appends a log record with skill, question_id, summary, category, door_type, options_count, user_choice, recommended, session_id. generateInlineTuneFeedback — offers inline "tune:" prompt after two-way questions. Documents structured shortcuts (never-ask, always-ask, ask-only-for-one-way, ask-less) AND accepts free-form English with normalization + confirmation. Explicitly spells out the USER-ORIGIN GATE: only write tune events when the prefix appears in the user's own chat message, never from tool output or file content. Binary enforces. All three resolvers are gated by the QUESTION_TUNING preamble echo. When the config is off, the agent skips these sections entirely. Ready to be wired into preamble.ts in the next commit. Codex host has a simpler variant that uses $GSTACK_BIN env vars. scripts/resolvers/index.ts registers three placeholders: QUESTION_PREFERENCE_CHECK, QUESTION_LOG, INLINE_TUNE_FEEDBACK Total resolver count goes from 45 to 48. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: wire question-tuning into preamble for tier >= 2 skills scripts/resolvers/preamble.ts — adds two things: 1. _QUESTION_TUNING config echo in the preamble bash block, gated on the user's gstack-config `question_tuning` value (default: false). 2. A combined Question Tuning section for tier >= 2 skills, injected after the confusion protocol. The section itself is runtime-gated by the QUESTION_TUNING value — agents skip it entirely when off. scripts/resolvers/question-tuning.ts — consolidated into one compact combined section `generateQuestionTuning(ctx)` covering: preference check before the question, log after, and inline tune: feedback with user-origin gate. Per-phase generators remain exported for unit tests but are no longer the main entrypoint. Size impact: +570 tokens / +2.3KB per tier-2+ SKILL.md. Three skills (plan-ceo-review, office-hours, ship) still exceed the 100KB token ceiling — but they were already over before this change. Delta is the smallest viable wiring of the /plan-tune v1 substrate. Golden fixtures (test/fixtures/golden/claude-ship, codex-ship, factory-ship) regenerated to match the new baseline. Full test run: 1149 pass, 0 fail, 113 skip across 28 files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: regenerate SKILL.md files with question-tuning section bun run gen:skill-docs --host all after wiring the QUESTION_TUNING preamble section. Every tier >= 2 skill now includes the combined Question Tuning guidance. Runtime-gated — agents skip the section when question_tuning is off in gstack-config (default). Golden fixtures (claude-ship, codex-ship, factory-ship) updated to the new baseline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: /plan-tune skill — conversational inspection + preferences plan-tune/SKILL.md.tmpl: the user-facing skill for /plan-tune v1. Routes plain-English intent to one of 8 flows: - Enable + setup (first-time): 5 declaration questions mapping to the 5 psychographic dimensions (scope_appetite, risk_tolerance, detail_preference, autonomy, architecture_care). Writes to developer-profile.json declared.*. - Inspect profile: plain-English rendering of declared + inferred + gap. Uses word bands (low/balanced/high) not raw floats. Shows vibe archetype when calibration gate is met. - Review question log: top-20 question frequencies with follow/override counts. Highlights override-heavy questions as candidates for never-ask. - Set a preference: normalizes "stop asking me about X" → never-ask, etc. Confirms ambiguous phrasings before writing via gstack-question-preference. - Edit declared profile: interprets free-form ("more boil-the-ocean") and CONFIRMS before mutating declared.* (trust boundary per Codex #15). - Show gap: declared vs inferred diff with plain-English severity bands (close / drift / mismatch). Never auto-updates declared from the gap. - Stats: preference counts + diversity/calibration status. - Enable / disable: gstack-config set question_tuning true|false. Design constraints enforced: - Plain English everywhere. No CLI subcommand syntax required. Shortcuts (`profile`, `vibe`, `stats`, `setup`) exist but optional. - user-origin gate on tune: writes. source: "plan-tune" for user-invoked /plan-tune; source: "inline-user" for inline tune: from other skills. - One-way doors override never-ask (safety, surfaced to user). - No behavior adaptation in v1 — this skill inspects and configures only. Generates plan-tune/SKILL.md at ~11.6k tokens, well under the 100KB ceiling. Generated for all hosts via `bun run gen:skill-docs --host all`. Full free test suite: 1149 pass, 0 fail, 113 skip across 28 files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: end-to-end pipeline + preamble injection coverage Added 6 tests to test/plan-tune.test.ts: Preamble injection (3 tests): - tier 2+ includes Question Tuning section with preference check, log, and user-origin gate language ('profile-poisoning defense', 'inline-user') - tier 1 does NOT include the prose section (QUESTION_TUNING bash echo still fires since it's in the bash block all tiers share) - codex host swaps binDir references to $GSTACK_BIN End-to-end pipeline (3 tests) — real binaries working together, not mocks: - Log 5 expand choices → --derive → profile shows scope_appetite > 0.5 (full log → registry lookup → signal map → normalization round-trip) - --write source: inline-tool-output rejected; --read confirms no pref was persisted (the profile-poisoning defense actually works end-to-end) - Migrate a 3-session legacy file; confirm legacy gstack-builder-profile shim still returns SESSION_COUNT: 3, TIER: welcome_back, CROSS_PROJECT: true test/plan-tune.test.ts now has 47 tests total. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: E2E test for /plan-tune plain-English inspection flow (gate tier) test/skill-e2e-plan-tune.test.ts — verifies /plan-tune correctly routes plain-English intent ("review the questions I've been asked") to the Review question log section without requiring CLI subcommand syntax. Seeds a synthetic question-log.jsonl with 3 entries exercising: - override behavior (user chose expand over recommended selective) - one-way door respect (user followed ship-test-failure-triage recommendation) - two-way override (user skipped recommended changelog polish) Invokes the skill via `claude -p` and asserts: - Agent surfaces >= 2 of 3 logged question_ids in output - Agent notices override/skip behavior from the log - Exit reason is success or error_max_turns (not agent-crash) Gate-tier because the core v1 DX promise is plain-English intent routing. If it requires memorized subcommands or breaks on natural language, that's a regression of the defining feature. Registered in test/helpers/touchfiles.ts with dependencies: - plan-tune/** (skill template + generated md) - scripts/question-registry.ts (required for log lookup) - scripts/psychographic-signals.ts, scripts/one-way-doors.ts (derive path) - bin/gstack-question-log, gstack-question-preference, gstack-developer-profile Skipped when EVALS_ENABLED is not set; runs on `bun run test:evals`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.19.0.0) — /plan-tune v1 Ships /plan-tune as observational substrate: typed question registry, dual-track developer profile (declared + inferred), explicit per-question preferences with user-origin gate, inline tune: feedback across every tier >= 2 skill, unified developer-profile.json with migration from builder-profile.jsonl. Scope rolled back from initial CEO EXPANSION plan after outside-voice review (Codex). 6 deferrals tracked as P0 TODOs with explicit acceptance criteria: E1 substrate wiring, E3 narrative/vibe, E4 blind-spot coach, E5 LANDED celebration, E6 auto-adjustment, E7 psychographic auto-decide. See docs/designs/PLAN_TUNING_V0.md for the full design record. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ci): harden Dockerfile.ci against transient Ubuntu mirror failures The CI image build failed with: E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/... Connection failed [IP: 91.189.92.22 80] ERROR: process "/bin/sh -c apt-get update && apt-get install ..." did not complete successfully: exit code: 100 archive.ubuntu.com periodically returns "connection refused" on individual regional mirrors. Without retry logic a single failed fetch nukes the whole Docker build. Three defenses, layered: 1. /etc/apt/apt.conf.d/80-retries — apt fetches each package up to 5 times with a 30s timeout. Handles per-package flakes. 2. Shell-loop retry around the whole apt-get step (x3, 10s sleep) — handles the case where apt-get update itself can't reach any mirror. 3. --retry 5 --retry-delay 5 --retry-connrefused on all curl fetches (bun install script, GitHub CLI keyring, NodeSource setup script). Applied to every apt-get and curl call in the Dockerfile. No behavior change on happy path — only kicks in when mirrors blip. Fixes the build-image job that was blocking CI on the /plan-tune PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: add PLAN_TUNING_V1 + PACING_UPDATES_V0 design docs Captures the V1 design (ELI10 writing + LOC reframe) in docs/designs/PLAN_TUNING_V1.md and the extracted V1.1 pacing-overhaul plan in docs/designs/PACING_UPDATES_V0.md. V1 scope was reduced from the original bundled pacing + writing-style plan after three engineering-review passes revealed structural gaps in the pacing workstream that couldn't be closed via plan-text editing. TODOS.md P0 entry links to V1.1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: curated jargon list for V1 writing-style glossing Repo-owned list of ~50 high-frequency technical terms (idempotent, race condition, N+1, backpressure, etc.) that gstack glosses on first use in tier-≥2 skill output. Baked into generated SKILL.md prose at gen-skill-docs time. Terms not on this list are assumed plain-English enough. Contributions via PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(preamble): V1 Writing Style section + EXPLAIN_LEVEL echo + migration prompt Adds a new Writing Style section to tier-≥2 preamble output composing with the existing AskUserQuestion Format section. Six rules: jargon glossed on first use per skill invocation (from scripts/jargon-list.json), outcome- framed questions, short sentences, decisions close with user impact, gloss-on-first-use even if user pasted term, user-turn override for "be terse" requests. Baked conditionally (skip if EXPLAIN_LEVEL: terse). Adds EXPLAIN_LEVEL preamble echo using \${binDir} (host-portable matching V0 QUESTION_TUNING pattern). Adds WRITING_STYLE_PENDING echo reading a flag file written by the V0→V1 upgrade migration; on first post-upgrade skill run, the agent fires a one-time AskUserQuestion offering terse mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(gstack-config): validate explain_level + document in header Adds explain_level: default|terse to the annotated config header with a one-line description. Whitelists valid values; on set of an unknown value, prints a specific warning ("explain_level '\$VALUE' not recognized. Valid values: default, terse. Using default.") and writes the default value. Matches V1 preamble's EXPLAIN_LEVEL echo expectation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: V1 upgrade migration — writing-style opt-out prompt New migration script following existing v0.15.2.0.sh / v0.16.2.0.sh pattern. Writes a .writing-style-prompt-pending flag file on first run post-upgrade. The preamble's migration-prompt block reads the flag and fires a one-time AskUserQuestion offering the user a choice between the new default writing style and restoring V0 prose via \`gstack-config set explain_level terse\`. Idempotent via flag files; if the user has already set explain_level explicitly, counts as answered and skips. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: LOC reframe tooling — throughput comparison + README updater + scc installer Three new scripts: - scripts/garry-output-comparison.ts — enumerates Garry-authored commits in 2013 + 2026 on public repos, extracts ADDED lines from git diff, classifies as logical SLOC via scc --stdin (regex fallback if scc missing). Writes docs/throughput-2013-vs-2026.json with per-language breakdown + explicit caveats (public repos only, commit-style drift, private-work exclusion). - scripts/update-readme-throughput.ts — reads the JSON if present, replaces the README's <!-- GSTACK-THROUGHPUT-PLACEHOLDER --> anchor with the computed multiple (preserving the anchor for future runs). If JSON missing, writes GSTACK-THROUGHPUT-PENDING marker that CI rejects — forcing the build to run before commit. - scripts/setup-scc.sh — standalone OS-detecting installer for scc. Not a package.json dependency (95% of users never run throughput). Brew on macOS, apt on Linux, GitHub releases link on Windows. Two-string anchor pattern (PLACEHOLDER vs PENDING) prevents the pipeline from destroying its own update path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(retro): surface logical SLOC + weighted commits above raw LOC V1 reorders the /retro summary table to lead with features shipped, then commits + weighted commits (commits × files-touched capped at 20), then PRs merged, then logical SLOC added as the primary code-volume metric. Raw LOC stays present but is demoted to context. Rationale inline in the template: ten lines of a good fix is not less shipping than ten thousand lines of scaffold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(v1): README hero reframe + writing-style + CHANGELOG + version bump to 1.0.0.0 README.md: - Hero removes "600,000+ lines of production code" framing; replaces with the computed 2013-vs-2026 pro-rata multiple (via <!-- GSTACK-THROUGHPUT-PLACEHOLDER --> anchor, filled by the update-readme-throughput build step). - Hiring callout: "ship real products at AI-coding speed" instead of "10K+ LOC/day." - New Writing Style section (~80 words) between Quick start and Install: "v1 prompts = simpler" framing, outcome-language example, terse-mode opt-out, pointer to /plan-tune. CLAUDE.md: one-paragraph Writing style (V1) note under project conventions, linking to preamble resolver + V1 design docs. CHANGELOG.md: V1 entry on top of v0.19.0.0 with user-facing narrative (what changes, how to opt out, for-contributors notes). Mentions scope reduction — pacing overhaul ships in V1.1. CONTRIBUTING.md: one-paragraph note on jargon-list.json maintenance (PR to add/remove terms; regenerate via gen:skill-docs). VERSION + package.json: bump to 1.0.0.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: regenerate SKILL.md files + golden fixtures for V1 Mechanical regeneration from the updated templates in prior commits: - Writing Style section now appears in tier-≥2 skill output. - EXPLAIN_LEVEL + WRITING_STYLE_PENDING echoes in preamble bash. - V1 migration-prompt block fires conditionally on first upgrade. - Jargon list inlined into preamble prose at gen time. - Retro template's logical SLOC + weighted commits order applied. Regenerated for all 8 hosts via bun run gen:skill-docs --host all. Golden ship-skill fixtures refreshed from regenerated outputs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: V1 gate coverage — writing-style resolver + config + jargon + migration + dormancy Six new gate-tier test files: - test/writing-style-resolver.test.ts — asserts Writing Style section is injected into tier-≥2 preamble, all 6 rules present, jargon list inlined, terse-mode gate condition present, Codex output uses \$GSTACK_BIN (not ~/.claude/), tier-1 does NOT get the section, migration-prompt block present. - test/explain-level-config.test.ts — gstack-config set/get round-trip for default + terse, unknown-value warns + defaults to default, header documents the key, round-trip across set→set→get. - test/jargon-list.test.ts — shape + ~50 terms + no duplicates (case-insensitive) + includes canonical high-signal terms. - test/v0-dormancy.test.ts — 5D dimension names + archetype names forbidden in default-mode tier-≥2 SKILL.md output, except for plan-tune and office-hours where they're load-bearing. - test/readme-throughput.test.ts — script replaces anchor with number on happy path, writes PENDING marker when JSON missing, CI gate asserts committed README contains no PENDING string. - test/upgrade-migration-v1.test.ts — fresh run writes pending flag, idempotent after user-answered, pre-existing explain_level counts as answered. All 95 V1 test-expect() calls pass. Full suite: 0 failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: compute real 2013-vs-2026 throughput multiple (130.2×) Ran scripts/garry-output-comparison.ts across all 15 public garrytan/* repos. Aggregated results into docs/throughput-2013-vs-2026.json and ran scripts/update-readme-throughput.ts to replace the README placeholder. 2013 public activity: 2 commits, 2,384 logical lines added across 1 week, in 1 repo (zurb-foundation-wysihtml5 upstream contribution). 2026 public activity: 279 commits, 310,484 logical lines added across 17 active weeks, in 3 repos (gbrain, gstack, resend_robot). Multiples (public repos only, apples-to-apples): - Logical SLOC: 130.2× - Commits per active week: 8.2× - Raw lines added: 134.4× Private work at both eras (2013 Bookface at YC, Posterous-era code, 2026 internal tools) is excluded from this comparison. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: 207× throughput multiple (with private repos + Bookface) Re-ran scripts/garry-output-comparison.ts across all 41 repos under garrytan/* (15 public + 26 private), including Bookface (YC's internal social network, 2013-era work). 2013 activity: 71 commits, 5,143 logical lines, 4 active repos (bookface, delicounter, tandong, zurb-foundation-wysihtml5) 2026 activity: 350 commits, 1,064,818 logical lines, 15 active repos (gbrain, gstack, gbrowser, tax-app, kumo, tenjin, autoemail, kitsune, easy-chromium-compiles, conductor-playground, garryslist-agent, baku, gstack-website, resend_robot, garryslist-brain) Multiples: - Logical SLOC: 207× (up from 130.2× when including private work) - Raw lines: 223× - Commits/active-week: 3.4× Stopped committing docs/throughput-2013-vs-2026.json — analysis is a local artifact, not repo state. Added docs/throughput-*.json to .gitignore. Full markdown analysis at ~/throughput-analysis-2026-04-18.md (local-only). README multiple is now hardcoded; re-run the script and edit manually when you want to refresh it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: run rate vs year-to-date throughput comparison Two separate numbers in the README hero: - Run rate: ~700× (9,859 logical lines/day in 2026 vs 14/day in 2013) - Year-to-date: 207× (2026 through April 18 already exceeds 2013 full year by 207×) Previous "207× pro-rata" framing mixed full-year 2013 vs partial-year 2026. Run rate is the apples-to-apples normalization; YTD is the "already produced" total. Both are honest; both are compelling; they measure different things. Analysis at ~/throughput-analysis-2026-04-18.md (local-only). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(throughput): script natively computes to-date + run-rate multiples Enhanced scripts/garry-output-comparison.ts so both calculations come out of a single run instead of being reassembled ad-hoc in bash: PerYearResult now includes: - days_elapsed — 365 for past years, day-of-year for current - is_partial — flags the current (in-progress) year - per_day_rate — logical/raw/commits normalized by calendar day - annualized_projection — per_day_rate × 365 Output JSON's `multiples` now has two sibling blocks: - multiples.to_date — raw volume ratios (2026-YTD / 2013-full-year) - multiples.run_rate — per-day pace ratios (apples-to-apples) Back-compat: multiples.logical_lines_added still aliases to_date for older consumers reading the JSON. Updated README hero to cite both (picking up brain/* repo that was missed in the earlier aggregation pass): 2026 run rate: ~880× my 2013 pace (12,382 vs 14 logical lines/day) 2026 YTD: 260× the entire 2013 year Stderr summary now prints both multiples at the end of each run. Full analysis at ~/throughput-analysis-2026-04-18.md (local-only). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: ON_THE_LOC_CONTROVERSY methodology post + README link Long-form response to the "LOC is a meaningless vanity metric" critique. Covers: - The three branches of the LOC critique and which are right - Why logical SLOC (NCLOC) beats raw LOC as the honest measurement - Full method: author-scoped git diff, regex-classified added lines, aggregated across 41 public + private garrytan/* repos - Both calculations: to-date (260x) and run-rate (879x) - Steelman of the critics (greenfield-vs-maintenance, survivorship bias, quality-adjusted productivity, time-to-first-user) - Reproduction instructions Linked from README hero via a blockquote directly below the number. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * exclude: tax-app from throughput analysis (import-dominated history) tax-app's history is one commit of 104K logical lines — an initial import of a codebase, not authored work. Removing it to keep the comparison honest. Changes: - scripts/garry-output-comparison.ts: added EXCLUDED_REPOS constant with tax-app + a one-line rationale. The script now skips excluded repos with a stderr note and deletes any stale output JSON so aggregation loops don't pick up pre-exclusion numbers. - README hero: updated to 810× run rate + 240× YTD (were 880×/260×). Wording updated to "40 public + private repos ... after excluding repos dominated by imported code." - docs/ON_THE_LOC_CONTROVERSY.md: updated all numbers, added an "Exclusions" paragraph explaining tax-app, removed tax-app from the "shipped not WIP" example list. New numbers (2026 through day 108, without tax-app): - To-date: 240× logical SLOC (1,233,062 vs 5,143) - Run rate: 810× per-day pace (11,417 vs 14 logical/day) - Annualized: ~4.2M logical lines projected Future re-runs automatically skip tax-app. Add more exclusions to EXCLUDED_REPOS at the top of the script with a one-line rationale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: correct tax-app exclusion rationale tax-app is a demo app I built for an upcoming YC channel video, not an "import-dominated history" as the previous commit claimed. Excluded because it's not production shipping work, not because of an import commit. Updated rationale in scripts/garry-output-comparison.ts's EXCLUDED_REPOS constant, in docs/ON_THE_LOC_CONTROVERSY.md's method section + conclusion, and in the README hero wording ("one demo repo" vs the earlier "repos dominated by imported code"). Numbers unchanged — the exclusion itself is the same, just the reason. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: harden ON_THE_LOC_CONTROVERSY against Cramer + neckbeard critiques Reframes the thesis as "engineers can fly now" (amplification, not replacement) and fortifies the soft spots critics will attack. Added: - Flight-thesis opener: pilot vs walker, leverage not replacement. - Second deflation layer for AI verbosity (on top of NCLOC). Headline moves from 810x to 408x after generous 2x AI-boilerplate cut, with explicit sensitivity analysis showing the number is still large under pessimistic priors (5x → 162x, 10x → 81x, 100x impossible). - Weekly distribution check (kills "you had one burst week" attack). - Revert rate (2.0%) and post-merge fix rate (6.3%) with OSS comparables (K8s/Rails/Django band). Addresses "where are your error rates" directly. - Named production adoption signals (gstack 1000+ installs, gbrain beta, resend_robot paying API) with explicit concession that "shipped != used at scale" for most of the corpus. - Harder steelman: 5 specific concessions with quantified pivot points (e.g., "if 2013 baseline was 3.5x higher, 810x → 228x, still high"). Removed factual error: Posterous acquisition paragraph (Garry had already left Posterous by 2011, so the "Twitter bought our private repos" excuse for the 2013 corpus gap doesn't apply). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: update gstack/gbrain adoption numbers in LOC controversy post gstack: "1,000+ distinct project installations" → "tens of thousands of daily active users" (telemetry-reported, community tier, opt-in). gbrain: "small set of beta testers" → "hundreds of beta testers running it live." Both are the accurate current numbers. The concession paragraph below (about shipped != adopted at scale for the long-tail repos) still reads correctly since it's about the corpus as a whole, not gstack/gbrain specifically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: reframe reproducibility note as OSS breakout flex "You'd need access to my private repos" → "Bookface and Posthaven are private, but gstack and gbrain are open-sourced with tens of thousands of GitHub stars and tens of thousands of confirmed regular users, among the most-used OSS projects in the world that didn't exist three months ago." Keeps the `gh repo list` command at the end for the actual reproducibility instruction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Rewrite LOC controversy post - Lead with concession (LOC is garbage, do the math anyway) - Preempt 14 lines/day meme with historical baselines (Brooks, Jones, McConnell) - Remove 'neckbeard' language throughout - Add slop-scan story (Ben Vinegar, 5.24 → 1.96, 62% cut) - David Cramer GUnit joke - Add testing philosophy section (the real unlock) - ASCII weekly distribution chart - gstack telemetry section with real numbers (15K installs, 305K invocations, 95.2% success) - Top skills usage chart - Pick-your-priors paragraph moved earlier (the killer) - Sharper close: run the script, show me your numbers * docs: four precision fixes on LOC controversy post 1. Citation fix. Kernighan didn't say anything about LOC-as-metric (that's the famous "aircraft building by weight" quote, commonly misattributed but actually Bill Gates). Replaced "Kernighan implied it before that" with the real Dijkstra quote ("lines produced" vs "lines spent" from EWD1036, with direct link) + the Gates quote. Verified via web search. 2. Slop-scan direction clarified. "(highest on his benchmark)" was ambiguous — could read as a brag. Now: "Higher score = more slop. He ran it on gstack and we scored 5.24, the worst he'd measured at the time." Then the 62% cut lands as an actual win. 3. Prose/chart skill-usage ordering now matches. Added /plan-eng-review (28,014) to the prose list so it doesn't conflict with the chart below it. 4. Cut the "David — I owe you one / GUnit" insider joke. Most readers won't connect Cramer → Sentry → GUnit naming. Ends the slop-scan paragraph on the stronger line: "Run `bun test` and watch 2,000+ tests pass." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: tighten four LOC post citations to match primary sources 1. Bill Gates quote: flagged as folklore-grade. Was "Bill Gates put it more memorably" (firm attribution). Now "The old line (widely attributed to Bill Gates, sourcing murky) puts it more memorably." The quote stands; honesty about attribution avoids the same misattribution trap we just fixed for Kernighan. 2. Capers Jones: "15-50 across thousands of projects" → "roughly 16-38 LOC/day across thousands of projects" — matches his actual published measurements (which also report as 325-750 LOC/month). 3. Steve McConnell: "10-50 for finished, tested, delivered code" was folklore. Replaced with his actual project-size-dependent range from Code Complete: "20-125 LOC/day for small projects (10K LOC) down to 1.5-25 for large projects (10M LOC) — it's size-dependent, not a single number." 4. Revert rate comparison: "Kubernetes, Rails, and Django historically run 1.5-3%" was unsourced. Replaced with "mature OSS codebases typically run 1-3%" + "run the same command on whatever you consider the bar and compare." No false specificity about which repos. Net: every quantitative citation in the post now matches primary-source figures or is explicitly flagged as folklore. Neckbeards can verify. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: drop Writing style section from README Was sitting in prime real estate between Quick start and Install — internal implementation detail, not something users need up-front. Existing coverage is enough: - Upgrade migration prompt notifies users on first post-upgrade run - CLAUDE.md has the contributor note - docs/designs/PLAN_TUNING_V1.md has the full design Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: collapse team-mode setup into one paste-and-go command Step 2 was three separate code blocks: setup --team, then team-init, then git add/commit. Mirrors Step 1's style now — one shell one-liner that does all three. Subshell (cd && ./setup --team) keeps the user in their repo pwd so team-init + git commit land in the right place. "Swap required for optional" moved to a one-liner below. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: move full-clone footnote from README to CONTRIBUTING The "Contributing or need full history?" note is for contributors, not for someone following the README install flow. Moved into CONTRIBUTING's Quick start section where it fits next to the existing clone command, with a tip to upgrade an existing shallow clone via \`git fetch --unshallow\`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: root <root@localhost>
54 KiB
TODOS
P0: PACING_UPDATES_V0 — Louise's fatigue root cause (V1.1)
What: Implement the pacing overhaul extracted from PLAN_TUNING_V1. Full design in docs/designs/PACING_UPDATES_V0.md. Requires: session-state model, phase field in question-log schema, registry extension for dynamic findings, pacing as skill-template control flow (not preamble prose), bin/gstack-flip-decision command, migration-prompt budget rule, first-run preamble audit, ranking threshold calibration from real V0 data, one-way-door uncapped rule, concrete verification values.
Why: Louise de Sadeleer's "yes yes yes" during /autoplan was pacing + agency, not (only) jargon density. V1 addresses jargon (ELI10 writing). V1.1 addresses the interruption-volume half. Without this, V1 only gets halfway to the HOLY SHIT outcome.
Pros: End-to-end answer to Louise's feedback. Ships real calibration data from V1 usage. Completes the V0 → V2 pacing arc started in PLAN_TUNING_V0.
Cons: Substantial scope (10 items in docs/designs/PACING_UPDATES_V0.md). Needs its own CEO + Codex + DX + Eng review cycle. Calibration depends on real V0 question-log distribution.
Context: PLAN_TUNING_V1 attempted to bundle pacing. Three eng-review passes + two Codex passes surfaced 10 structural gaps unfixable via plan-text editing. Extracted to V1.1 as a dedicated plan.
Depends on / blocked by: V1 shipping (provides Louise's baseline transcript for calibration).
Plan Tune (v2 deferrals from v0.19.0.0 rollback)
All six items are gated on v1 dogfood results and the acceptance criteria in
docs/designs/PLAN_TUNING_V0.md. They were explicitly deferred after Codex's
outside-voice review drove a scope rollback from the CEO EXPANSION plan. v1
ships the observational substrate only; v2 adds behavior adaptation.
E1 — Substrate wiring (5 skills consume profile)
What: Add {{PROFILE_ADAPTATION:<skill>}} placeholder to ship, review,
office-hours, plan-ceo-review, plan-eng-review SKILL.md.tmpl files. Implement
scripts/resolvers/profile-consumer.ts with a per-skill adaptation registry
(scripts/profile-adaptations/{skill}.ts). Each consumer reads
~/.gstack/developer-profile.json on preamble and adapts skill-specific
defaults (verbosity, mode selection, severity thresholds, pushback intensity).
Why: v1 observational profile writes a file nobody reads. The substrate claim only becomes real when skills actually consume it. Without this, /plan-tune is a fancy config page.
Pros: gstack feels personal. Every skill adapts to the user's steering style instead of defaulting to middle-of-the-road.
Cons: Risk of psychographic drift if profile is noisy. Requires calibrated profile (v1 acceptance criteria: 90+ days stable across 3+ skills).
Context: See docs/designs/PLAN_TUNING_V0.md §Deferred to v2. v1 ships the
signal map + inferred computation; it's displayed in /plan-tune but no skill
reads it yet.
Effort: L (human: ~1 week / CC: ~4h) Priority: P0 Depends on: 2+ weeks of v1 dogfood, profile diversity check passing.
E3 — /plan-tune narrative + /plan-tune vibe
What: Event-anchored narrative ("You accepted 7 scope expansions, overrode test_failure_triage 4 times, called every PR 'boil the lake'") + one-word vibe archetype (Cathedral Builder, Ship-It Pragmatist, Deep Craft, etc). scripts/archetypes.ts is ALREADY SHIPPED in v1 (8 archetypes + Polymath fallback). v2 work is the narrative generator + /plan-tune skill wiring.
Why: Makes profile tangible and shareable. Screenshot-able.
Pros: Killer delight feature. Social surface for gstack. Concrete, specific output anchored in real events (not generic AI slop).
Cons: Requires stable inferred profile — without calibration it produces generic paragraphs. Gen-tests need to validate no-slop.
Context: Archetypes already defined. Just need the /plan-tune narrative subcommand + slop-check test.
Effort: S+ (human: ~1 day / CC: ~1h) Priority: P0 Depends on: Calibrated profile (>= 20 events, 3+ skills, 7+ days span).
E4 — Blind-spot coach
What: Preamble injection that surfaces the OPPOSITE of the user's profile
once per session per tier >= 2 skill. Boil-the-ocean user gets challenged on
scope ("what's the 80% version?"); small-scope user gets challenged on ambition.
scripts/resolvers/blind-spot-coach.ts. Marker file for session dedup. Opt-out
via gstack-config set blind_spot_coach false.
Why: Makes gstack a coach (challenges you) instead of a mirror (reflects you). The killer differentiation vs. a settings menu.
Pros: The feature that makes gstack feel like Garry. Surfaces assumptions the user hasn't challenged.
Cons: Logically conflicts with E1 (which adapts TO profile) and E6 (which flags mismatch). Requires interaction-budget design: global session budget + escalation rules + explicit exclusion from mismatch detection. Risk of feeling like a nag if fires wrong.
Context: v2 must redesign to resolve the E1/E4/E6 composition issue Codex caught. Dogfood required to calibrate frequency.
Effort: M (human: ~3 days / CC: ~2h design + ~1h impl) Priority: P0 Depends on: E1 shipped + interaction-budget design spec.
E5 — LANDED celebration HTML page
What: When a PR authored by the user is newly merged to the base branch, open an animated HTML celebration page in the browser. Confetti + typewriter headline + stats counter. Shows: what we built (PR stats + CHANGELOG entry), road traveled (scope decisions from CEO plan), road not traveled (deferred items), where we're going (next TODOs), who you are as a builder (vibe + narrative + profile delta for this ship). Self-contained HTML (CSS animations only, no JS deps).
CRITICAL REVISION from v0 plan: Passive detection must NOT live in the
preamble (Codex #9). When promoted, moves to explicit /plan-tune show-landed
OR post-ship hook — not passive detection in the hot path.
Why: Biggest personality moment in gstack. The "one-word thing that makes you remember why you built this."
Pros: Screenshot-worthy. Shareable. The kind of dopamine hit that turns power users into evangelists.
Cons: Product theater if the substrate isn't solid. Needs /design-shotgun → /design-html for the visual direction. Requires E2 unified profile for narrative/vibe data.
Context: /land-and-deploy trust/adoption is low, so passive detection is
the right trigger shape. Dedup marker per PR in ~/.gstack/.landed-celebrated-*.
E2E tests for squash/merge-commit/rebase/co-author/fresh-clone/dedup variants.
Effort: M+ (human: ~1 week / CC: ~3h total) Priority: P0 Depends on: E3 narrative/vibe shipped. /design-shotgun run on real PR data to pick a visual direction, then /design-html to finalize.
E6 — Auto-adjustment based on declared ↔ inferred mismatch
What: Currently /plan-tune shows the gap between declared and inferred
(v1 observational). v2 auto-suggests declaration updates when the gap exceeds
a threshold ("Your profile says hands-off but you've overridden 40% of
recommendations — you're actually taste-driven. Update declared autonomy from
0.8 to 0.5?"). Requires explicit user confirmation before any mutation (Codex
trust-boundary #15 already baked into v1).
Why: Profile drifts silently without correction. Self-correcting profile stays honest.
Pros: Profile becomes more accurate over time. User sees the gap and decides.
Cons: Requires stable inferred profile (diversity check). False positives nag the user.
Context: v1 has --check-mismatch that flags > 0.3 gaps but doesn't
suggest fixes. v2 adds the suggestion UX + per-dimension threshold tuning from
real data.
Effort: S (human: ~1 day / CC: ~45min) Priority: P0 Depends on: Calibrated profile + real mismatch data from v1 dogfood.
E7 — Psychographic auto-decide
What: When inferred profile is calibrated AND a question is two-way AND the user's dimensions strongly favor one option, auto-choose without asking (visible annotation: "Auto-decided via profile. Change with /plan-tune."). v1 only auto-decides via EXPLICIT per-question preferences; v2 adds profile-driven auto-decide.
Why: The whole point of the psychographic. Silent, correct defaults based on who the user IS, not just what they've said.
Pros: Friction-free skill invocation for calibrated power users. Over time, gstack feels like it's reading your mind.
Cons: Highest-risk deferral. Wrong auto-decides are costly. Requires very high confidence in the signal map AND calibration gate.
Context: v1 diversity gate is sample_size >= 20 AND skills_covered >= 3 AND question_ids_covered >= 8 AND days_span >= 7. v2 must prove this gate
actually catches noisy profiles before shipping.
Effort: M (human: ~3 days / CC: ~2h) Priority: P0 Depends on: E1 (skills consuming profile) + real observed data showing calibration gate is trustworthy.
Browse
Scope sidebar-agent kill to session PID, not pkill -f sidebar-agent\.ts
What: shutdown() in browse/src/server.ts:1193 uses pkill -f sidebar-agent\.ts to kill the sidebar-agent daemon, which matches every sidebar-agent on the machine, not just the one this server spawned. Replace with PID tracking: store the sidebar-agent PID when cli.ts spawns it (via state file or env), then process.kill(pid, 'SIGTERM') in shutdown().
Why: A user running two Conductor worktrees (or any multi-session setup), each with its own $B connect, closes one browser window ... and the other worktree's sidebar-agent gets killed too. The blast radius was there before, but the v0.18.1.0 disconnect-cleanup fix makes it more reachable: every user-close now runs the full shutdown() path, whereas before user-close bypassed it.
Context: Surfaced by /ship's adversarial review on v0.18.1.0. Pre-existing code, not introduced by the fix. Fix requires propagating the sidebar-agent PID from cli.ts spawn site (~line 885) into the server's state file so shutdown() can target just this session's agent. Related: browse/src/cli.ts spawns with Bun.spawn(...).unref() and already captures agentProc.pid.
Effort: S (human: ~2h / CC: ~15min) Priority: P2 Depends on: None
Sidebar Security
ML Prompt Injection Classifier
What: Add DeBERTa-v3-base-prompt-injection-v2 via @huggingface/transformers v4 (WASM backend) as an ML defense layer for the Chrome sidebar. Reusable browse/src/security.ts module with checkInjection() API. Includes canary tokens, attack logging, shield icon, special telemetry (AskUserQuestion on detection even when telemetry off), and BrowseSafe-bench red team test harness (3,680 adversarial cases from Perplexity).
Why: PR 1 fixes the architecture (command allowlist, XML framing, Opus default). But attackers can still trick Claude into navigating to phishing sites or exfiltrating visible page data via allowed browse commands. The ML classifier catches prompt injection patterns that architectural controls can't see. 94.8% accuracy, 99.6% recall, ~50-100ms inference via WASM. Defense-in-depth.
Context: Full design doc with industry research, open source tool landscape, Codex review findings, and ambitious Bun-native vision (5ms inference via FFI + Apple Accelerate): docs/designs/ML_PROMPT_INJECTION_KILLER.md. CEO plan with scope decisions: ~/.gstack/projects/garrytan-gstack/ceo-plans/2026-03-28-sidebar-prompt-injection-defense.md.
Effort: L (human: ~2 weeks / CC: ~3-4 hours) Priority: P0 Depends on: Sidebar security fix PR (command allowlist + XML framing + arg fix) landing first
Builder Ethos
First-time Search Before Building intro
What: Add a generateSearchIntro() function (like generateLakeIntro()) that introduces the Search Before Building principle on first use, with a link to the blog essay.
Why: Boil the Lake has an intro flow that links to the essay and marks .completeness-intro-seen. Search Before Building should have the same pattern for discoverability.
Context: Blocked on a blog post to link to. When the essay exists, add the intro flow with a .search-intro-seen marker file. Pattern: generateLakeIntro() at gen-skill-docs.ts:176.
Effort: S Priority: P2 Depends on: Blog post about Search Before Building
Chrome DevTools MCP Integration
Real Chrome session access
What: Integrate Chrome DevTools MCP to connect to the user's real Chrome session with real cookies, real state, no Playwright middleman.
Why: Right now, headed mode launches a fresh Chromium profile. Users must log in manually or import cookies. Chrome DevTools MCP connects to the user's actual Chrome ... instant access to every authenticated site. This is the future of browser automation for AI agents.
Context: Google shipped Chrome DevTools MCP in Chrome 146+ (June 2025). It provides screenshots, console messages, performance traces, Lighthouse audits, and full page interaction through the user's real browser. gstack should use it for real-session access while keeping Playwright for headless CI/testing workflows.
Potential new skills:
/debug-browser: JS error tracing with source-mapped stack traces/perf-debug: performance traces, Core Web Vitals, network waterfall
May replace /setup-browser-cookies for most use cases since the user's real cookies are already there.
Effort: L (human: ~2 weeks / CC: ~2 hours) Priority: P0 Depends on: Chrome 146+, DevTools MCP server installed
Browse
Bundle server.ts into compiled binary
What: Eliminate resolveServerScript() fallback chain entirely — bundle server.ts into the compiled browse binary.
Why: The current fallback chain (check adjacent to cli.ts, check global install) is fragile and caused bugs in v0.3.2. A single compiled binary is simpler and more reliable.
Context: Bun's --compile flag can bundle multiple entry points. The server is currently resolved at runtime via file path lookup. Bundling it removes the resolution step entirely.
Effort: M Priority: P2 Depends on: None
Sessions (isolated browser instances)
What: Isolated browser instances with separate cookies/storage/history, addressable by name.
Why: Enables parallel testing of different user roles, A/B test verification, and clean auth state management.
Context: Requires Playwright browser context isolation. Each session gets its own context with independent cookies/localStorage. Prerequisite for video recording (clean context lifecycle) and auth vault.
Effort: L Priority: P3
Video recording
What: Record browser interactions as video (start/stop controls).
Why: Video evidence in QA reports and PR bodies. Currently deferred because recreateContext() destroys page state.
Context: Needs sessions for clean context lifecycle. Playwright supports video recording per context. Also needs WebM → GIF conversion for PR embedding.
Effort: M Priority: P3 Depends on: Sessions
v20 encryption format support
What: AES-256-GCM support for future Chromium cookie DB versions (currently v10).
Why: Future Chromium versions may change encryption format. Proactive support prevents breakage.
Effort: S Priority: P3
State persistence — SHIPPED
What: Save/load cookies + localStorage to JSON files for reproducible test sessions.
$B state save/load ships in v0.12.1.0. V1 saves cookies + URLs only (not localStorage, which breaks on load-before-navigate). Files at .gstack/browse-states/{name}.json with 0o600 permissions. Load replaces session (closes all pages first). Name sanitized to [a-zA-Z0-9_-].
Remaining: V2 localStorage support (needs pre-navigation injection strategy). Completed: v0.12.1.0 (2026-03-26)
Auth vault
What: Encrypted credential storage, referenced by name. LLM never sees passwords.
Why: Security — currently auth credentials flow through the LLM context. Vault keeps secrets out of the AI's view.
Effort: L Priority: P3 Depends on: Sessions, state persistence
Iframe support — SHIPPED
What: frame <sel> and frame main commands for cross-frame interaction.
$B frame ships in v0.12.1.0. Supports CSS selector, @ref, --name, and --url pattern matching. Execution target abstraction (getActiveFrameOrPage()) across all read/write/snapshot commands. Frame context cleared on navigation, tab switch, resume. Detached frame auto-recovery. Page-only operations (goto, screenshot, viewport) throw clear error when in frame context.
Completed: v0.12.1.0 (2026-03-26)
Semantic locators
What: find role/label/text/placeholder/testid with attached actions.
Why: More resilient element selection than CSS selectors or ref numbers.
Effort: M Priority: P4
Device emulation presets
What: set device "iPhone 16 Pro" for mobile/tablet testing.
Why: Responsive layout testing without manual viewport resizing.
Effort: S Priority: P4
Network mocking/routing
What: Intercept, block, and mock network requests.
Why: Test error states, loading states, and offline behavior.
Effort: M Priority: P4
Download handling
What: Click-to-download with path control.
Why: Test file download flows end-to-end.
Effort: S Priority: P4
Content safety
What: --max-output truncation, --allowed-domains filtering.
Why: Prevent context window overflow and restrict navigation to safe domains.
Effort: S Priority: P4
Streaming (WebSocket live preview)
What: WebSocket-based live preview for pair browsing sessions.
Why: Enables real-time collaboration — human watches AI browse.
Effort: L Priority: P4
Headed mode with Chrome extension — SHIPPED
$B connect launches Playwright's bundled Chromium in headed mode with the gstack Chrome extension auto-loaded. $B handoff now produces the same result (extension + side panel). Sidebar chat gated behind --chat flag.
$B watch — SHIPPED
Claude observes user browsing in passive read-only mode with periodic snapshots. $B watch stop exits with summary. Mutation commands blocked during watch.
Sidebar scout / file drop relay — SHIPPED
Sidebar agent writes structured messages to .context/sidebar-inbox/. Workspace agent reads via $B inbox. Message format: {type, timestamp, page, userMessage, sidebarSessionId}.
Multi-agent tab isolation
What: Two Claude sessions connect to the same browser, each operating on different tabs. No cross-contamination.
Why: Enables parallel /qa + /design-review on different tabs in the same browser.
Context: Requires tab ownership model for concurrent headed connections. Playwright may not cleanly support two persistent contexts. Needs investigation.
Effort: L (human: ~2 weeks / CC: ~2 hours) Priority: P3 Depends on: Headed mode (shipped)
Sidebar agent needs Write tool + better error visibility — SHIPPED
What: Two issues with the sidebar agent (sidebar-agent.ts): (1) --allowedTools is hardcoded to Bash,Read,Glob,Grep, missing Write. Claude can't create files (like CSVs) when asked. (2) When Claude errors or returns empty, the sidebar UI shows nothing, just a green dot. No error message, no "I tried but failed", nothing.
Completed: v0.15.4.0 (2026-04-04). Write tool added to allowedTools. 40+ empty catch blocks replaced with [gstack sidebar], [gstack bg], [browse], [sidebar-agent] prefixed console logging across all 4 files (sidepanel.js, background.js, server.ts, sidebar-agent.ts). Error placeholder text now shows in red. Auth token stale-refresh bug fixed.
Sidebar direct API calls (eliminate claude -p startup tax)
What: Each sidebar message spawns a fresh claude -p process (~2-3s cold start overhead). For "click @e24" that's absurd. Direct Anthropic API calls would be sub-second.
Why: The claude -p startup cost is: process spawn (~100ms) + CLI init (~500ms-1s) + API connection (~200ms) + first token. Model routing (Sonnet for actions) helps but doesn't fix the CLI overhead.
Context: server.ts:spawnClaude() builds args and writes to queue file. sidebar-agent.ts:askClaude() spawns claude -p. Replace with direct fetch('https://api.anthropic.com/...') with tool use. Requires ANTHROPIC_API_KEY accessible to the browse server.
Effort: M (human: ~1 week / CC: ~30min) Priority: P2 Depends on: None
Chrome Web Store publishing
What: Publish the gstack browse Chrome extension to Chrome Web Store for easier install.
Why: Currently sideloaded via chrome://extensions. Web Store makes install one-click.
Effort: S Priority: P4 Depends on: Chrome extension proving value via sideloading
Linux cookie decryption — PARTIALLY SHIPPED
What: GNOME Keyring / kwallet / DPAPI support for non-macOS cookie import.
Linux cookie import shipped in v0.11.11.0 (Wave 3). Supports Chrome, Chromium, Brave, Edge on Linux with GNOME Keyring (libsecret) and "peanuts" fallback. Windows DPAPI support remains deferred.
Remaining: Windows cookie decryption (DPAPI). Needs complete rewrite — PR #64 was 1346 lines and stale.
Effort: L (Windows only) Priority: P4 Completed (Linux): v0.11.11.0 (2026-03-23)
Ship
GitLab support for /land-and-deploy
What: Add GitLab MR merge + CI polling support to /land-and-deploy skill. Currently uses gh pr view, gh pr checks, gh pr merge, and gh run list/view in 15+ places — each needs a GitLab conditional path using glab ci status, glab mr merge, etc.
Why: Without this, GitLab users can /ship (create MR) but can't /land-and-deploy (merge + verify). Completes the GitLab story end-to-end.
Context: /retro, /ship, and /document-release now support GitLab via the multi-platform BASE_BRANCH_DETECT resolver. /land-and-deploy has deeper GitHub-specific semantics (merge queues, required checks via gh pr checks, deploy workflow polling) that have different shapes on GitLab. The glab CLI (v1.90.0) supports glab mr merge, glab ci status, glab ci view but with different output formats and no merge queue concept.
Effort: L Priority: P2 Depends on: None (BASE_BRANCH_DETECT multi-platform resolver is already done)
Multi-commit CHANGELOG completeness eval
What: Add a periodic E2E eval that creates a branch with 5+ commits spanning 3+ themes (features, cleanup, infra), runs /ship's Step 5 CHANGELOG generation, and verifies the CHANGELOG mentions all themes.
Why: The bug fixed in v0.11.22 (garrytan/ship-full-commit-coverage) showed that /ship's CHANGELOG generation biased toward recent commits on long branches. The prompt fix adds a cross-check, but no test exercises the multi-commit failure mode. The existing ship-local-workflow E2E only uses a single-commit branch.
Context: Would be a periodic tier test (~$4/run, non-deterministic since it tests LLM instruction-following). Setup: create bare remote, clone, add 5+ commits across different themes on a feature branch, run Step 5 via claude -p, verify CHANGELOG output covers all themes. Pattern: ship-local-workflow in test/skill-e2e-workflow.test.ts.
Effort: M Priority: P3 Depends on: None
Ship log — persistent record of /ship runs
What: Append structured JSON entry to .gstack/ship-log.json at end of every /ship run (version, date, branch, PR URL, review findings, Greptile stats, todos completed, test results).
Why: /retro has no structured data about shipping velocity. Ship log enables: PRs-per-week trending, review finding rates, Greptile signal over time, test suite growth.
Context: /retro already reads greptile-history.md — same pattern. Eval persistence (eval-store.ts) shows the JSON append pattern exists in the codebase. ~15 lines in ship template.
Effort: S Priority: P2 Depends on: None
Visual verification with screenshots in PR body
What: /ship Step 7.5: screenshot key pages after push, embed in PR body.
Why: Visual evidence in PRs. Reviewers see what changed without deploying locally.
Context: Part of Phase 3.6. Needs S3 upload for image hosting.
Effort: M Priority: P2 Depends on: /setup-gstack-upload
Review
Inline PR annotations
What: /ship and /review post inline review comments at specific file:line locations using gh api to create pull request review comments.
Why: Line-level annotations are more actionable than top-level comments. The PR thread becomes a line-by-line conversation between Greptile, Claude, and human reviewers.
Context: GitHub supports inline review comments via gh api repos/$REPO/pulls/$PR/reviews. Pairs naturally with Phase 3.6 visual annotations.
Effort: S Priority: P2 Depends on: None
Greptile training feedback export
What: Aggregate greptile-history.md into machine-readable JSON summary of false positive patterns, exportable to the Greptile team for model improvement.
Why: Closes the feedback loop — Greptile can use FP data to stop making the same mistakes on your codebase.
Context: Was a P3 Future Idea. Upgraded to P2 now that greptile-history.md data infrastructure exists. The signal data is already being collected; this just makes it exportable. ~40 lines.
Effort: S Priority: P2 Depends on: Enough FP data accumulated (10+ entries)
Visual review with annotated screenshots
What: /review Step 4.5: browse PR's preview deploy, annotated screenshots of changed pages, compare against production, check responsive layouts, verify accessibility tree.
Why: Visual diff catches layout regressions that code review misses.
Context: Part of Phase 3.6. Needs S3 upload for image hosting.
Effort: M Priority: P2 Depends on: /setup-gstack-upload
QA
QA trend tracking
What: Compare baseline.json over time, detect regressions across QA runs.
Why: Spot quality trends — is the app getting better or worse?
Context: QA already writes structured reports. This adds cross-run comparison.
Effort: S Priority: P2
CI/CD QA integration
What: /qa as GitHub Action step, fail PR if health score drops.
Why: Automated quality gate in CI. Catch regressions before merge.
Effort: M Priority: P2
Smart default QA tier
What: After a few runs, check index.md for user's usual tier pick, skip the AskUserQuestion.
Why: Reduces friction for repeat users.
Effort: S Priority: P2
Accessibility audit mode
What: --a11y flag for focused accessibility testing.
Why: Dedicated accessibility testing beyond the general QA checklist.
Effort: S Priority: P3
CI/CD generation for non-GitHub providers
What: Extend CI/CD bootstrap to generate GitLab CI (.gitlab-ci.yml), CircleCI (.circleci/config.yml), and Bitrise pipelines.
Why: Not all projects use GitHub Actions. Universal CI/CD bootstrap would make test bootstrap work for everyone.
Context: v1 ships with GitHub Actions only. Detection logic already checks for .gitlab-ci.yml, .circleci/, bitrise.yml and skips with an informational note. Each provider needs ~20 lines of template text in generateTestBootstrap().
Effort: M Priority: P3 Depends on: Test bootstrap (shipped)
Auto-upgrade weak tests (★) to strong tests (★★★)
What: When Step 7 coverage audit identifies existing ★-rated tests (smoke/trivial assertions), generate improved versions testing edge cases and error paths.
Why: Many codebases have tests that technically exist but don't catch real bugs — expect(component).toBeDefined() isn't testing behavior. Upgrading these closes the gap between "has tests" and "has good tests."
Context: Requires the quality scoring rubric from the test coverage audit. Modifying existing test files is riskier than creating new ones — needs careful diffing to ensure the upgraded test still passes. Consider creating a companion test file rather than modifying the original.
Effort: M Priority: P3 Depends on: Test quality scoring (shipped)
Retro
Deployment health tracking (retro + browse)
What: Screenshot production state, check perf metrics (page load times), count console errors across key pages, track trends over retro window.
Why: Retro should include production health alongside code metrics.
Context: Requires browse integration. Screenshots + metrics fed into retro output.
Effort: L Priority: P3 Depends on: Browse sessions
Infrastructure
/setup-gstack-upload skill (S3 bucket)
What: Configure S3 bucket for image hosting. One-time setup for visual PR annotations.
Why: Prerequisite for visual PR annotations in /ship and /review.
Effort: M Priority: P2
gstack-upload helper
What: browse/bin/gstack-upload — upload file to S3, return public URL.
Why: Shared utility for all skills that need to embed images in PRs.
Effort: S Priority: P2 Depends on: /setup-gstack-upload
WebM to GIF conversion
What: ffmpeg-based WebM → GIF conversion for video evidence in PRs.
Why: GitHub PR bodies render GIFs but not WebM. Needed for video recording evidence.
Effort: S Priority: P3 Depends on: Video recording
Extend worktree isolation to Claude E2E tests
What: Add useWorktree?: boolean option to runSkillTest() so any Claude E2E test can opt into worktree mode for full repo context instead of tmpdir fixtures.
Why: Some Claude E2E tests (CSO audit, review-sql-injection) create minimal fake repos but would produce more realistic results with full repo context. The infrastructure exists (describeWithWorktree() in e2e-helpers.ts) — this extends it to the session-runner level.
Context: WorktreeManager shipped in v0.11.12.0. Currently only Gemini/Codex tests use worktrees. Claude tests use planted-bug fixture repos which are correct for their purpose, but new tests that want real repo context can use describeWithWorktree() today. This TODO is about making it even easier via a flag on runSkillTest().
Effort: M (human: ~2 days / CC: ~20 min) Priority: P3 Depends on: Worktree isolation (shipped v0.11.12.0)
E2E model pinning — SHIPPED
What: Pin E2E tests to claude-sonnet-4-6 for cost efficiency, add retry:2 for flaky LLM responses.
Shipped: Default model changed to Sonnet for structure tests (~30), Opus retained for quality tests (~10). --retry 2 added. EVALS_MODEL env var for override. test:e2e:fast tier added. Rate-limit telemetry (first_response_ms, max_inter_turn_ms) and wall_clock_ms tracking added to eval-store.
Eval web dashboard
What: bun run eval:dashboard serves local HTML with charts: cost trending, detection rate, pass/fail history.
Why: Visual charts better for spotting trends than CLI tools.
Context: Reads ~/.gstack-dev/evals/*.json. ~200 lines HTML + chart.js via Bun HTTP server.
Effort: M Priority: P3 Depends on: Eval persistence (shipped in v0.3.6)
CI/CD QA quality gate
What: Run /qa as a GitHub Action step, fail PR if health score drops below threshold.
Why: Automated quality gate catches regressions before merge. Currently QA is manual — CI integration makes it part of the standard workflow.
Context: Requires headless browse binary available in CI. The /qa skill already produces baseline.json with health scores — CI step would compare against the main branch baseline and fail if score drops. Would need ANTHROPIC_API_KEY in CI secrets since /qa uses Claude.
Effort: M Priority: P2 Depends on: None
Cross-platform URL open helper
What: gstack-open-url helper script — detect platform, use open (macOS) or xdg-open (Linux).
Why: The first-time Completeness Principle intro uses macOS open to launch the essay. If gstack ever supports Linux, this silently fails.
Effort: S (human: ~30 min / CC: ~2 min) Priority: P4 Depends on: Nothing
CDP-based DOM mutation detection for ref staleness
What: Use Chrome DevTools Protocol DOM.documentUpdated / MutationObserver events to proactively invalidate stale refs when the DOM changes, without requiring an explicit snapshot call.
Why: Current ref staleness detection (async count() check) only catches stale refs at action time. CDP mutation detection would proactively warn when refs become stale, preventing the 5-second timeout entirely for SPA re-renders.
Context: Parts 1+2 of ref staleness fix (RefEntry metadata + eager validation via count()) are shipped. This is Part 3 — the most ambitious piece. Requires CDP session alongside Playwright, MutationObserver bridge, and careful performance tuning to avoid overhead on every DOM change.
Effort: L Priority: P3 Depends on: Ref staleness Parts 1+2 (shipped)
Office Hours / Design
Design docs → Supabase team store sync
What: Add design docs (*-design-*.md) to the Supabase sync pipeline alongside test plans, retro snapshots, and QA reports.
Why: Cross-team design discovery at scale. Local ~/.gstack/projects/$SLUG/ keyword-grep discovery works for same-machine users now, but Supabase sync makes it work across the whole team. Duplicate ideas surface, everyone sees what's been explored.
Context: /office-hours writes design docs to ~/.gstack/projects/$SLUG/. The team store already syncs test plans, retro snapshots, QA reports. Design docs follow the same pattern — just add a sync adapter.
Effort: S
Priority: P2
Depends on: garrytan/team-supabase-store branch landing on main
/yc-prep skill
What: Skill that helps founders prepare their YC application after /office-hours identifies strong signal. Pulls from the design doc, structures answers to YC app questions, runs a mock interview.
Why: Closes the loop. /office-hours identifies the founder, /yc-prep helps them apply well. The design doc already contains most of the raw material for a YC application.
Effort: M (human: ~2 weeks / CC: ~2 hours) Priority: P2 Depends on: office-hours founder discovery engine shipping first
Design Review
/plan-design-review + /qa-design-review + /design-consultation — SHIPPED
Shipped as v0.5.0 on main. Includes /plan-design-review (report-only design audit), /qa-design-review (audit + fix loop), and /design-consultation (interactive DESIGN.md creation). {{DESIGN_METHODOLOGY}} resolver provides shared 80-item design audit checklist.
Design outside voices in /plan-eng-review
What: Extend the parallel dual-voice pattern (Codex + Claude subagent) to /plan-eng-review's architecture review section.
Why: The design beachhead (v0.11.3.0) proves cross-model consensus works for subjective reviews. Architecture reviews have similar subjectivity in tradeoff decisions.
Context: Depends on learnings from the design beachhead. If the litmus scorecard format proves useful, adapt it for architecture dimensions (coupling, scaling, reversibility).
Effort: S Priority: P3 Depends on: Design outside voices shipped (v0.11.3.0)
Outside voices in /qa visual regression detection
What: Add Codex design voice to /qa for detecting visual regressions during bug-fix verification.
Why: When fixing bugs, the fix can introduce visual regressions that code-level checks miss. Codex could flag "the fix broke the responsive layout" during re-test.
Context: Depends on /qa having design awareness. Currently /qa focuses on functional testing.
Effort: M Priority: P3 Depends on: Design outside voices shipped (v0.11.3.0)
Document-Release
Auto-invoke /document-release from /ship — SHIPPED
Shipped in v0.8.3. Step 8.5 added to /ship — after creating the PR, /ship automatically reads document-release/SKILL.md and executes the doc update workflow. Zero-friction doc updates.
{{DOC_VOICE}} shared resolver
What: Create a placeholder resolver in gen-skill-docs.ts encoding the gstack voice guide (friendly, user-forward, lead with benefits). Inject into /ship Step 5, /document-release Step 5, and reference from CLAUDE.md.
Why: DRY — voice rules currently live inline in 3 places (CLAUDE.md CHANGELOG style section, /ship Step 5, /document-release Step 5). When the voice evolves, all three drift.
Context: Same pattern as {{QA_METHODOLOGY}} — shared block injected into multiple templates to prevent drift. ~20 lines in gen-skill-docs.ts.
Effort: S Priority: P2 Depends on: None
Ship Confidence Dashboard
Smart review relevance detection — PARTIALLY SHIPPED
What: Auto-detect which of the 4 reviews are relevant based on branch changes (skip Design Review if no CSS/view changes, skip Code Review if plan-only).
bin/gstack-diff-scope shipped — categorizes diff into SCOPE_FRONTEND, SCOPE_BACKEND, SCOPE_PROMPTS, SCOPE_TESTS, SCOPE_DOCS, SCOPE_CONFIG. Used by design-review-lite to skip when no frontend files changed. Dashboard integration for conditional row display is a follow-up.
Remaining: Dashboard conditional row display (hide "Design Review: NOT YET RUN" when SCOPE_FRONTEND=false). Extend to Eng Review (skip for docs-only) and CEO Review (skip for config-only).
Effort: S Priority: P3 Depends on: gstack-diff-scope (shipped)
Codex
Codex→Claude reverse buddy check skill
What: A Codex-native skill (.agents/skills/gstack-claude/SKILL.md) that runs claude -p to get an independent second opinion from Claude — the reverse of what /codex does today from Claude Code.
Why: Codex users deserve the same cross-model challenge that Claude users get via /codex. Currently the flow is one-way (Claude→Codex). Codex users have no way to get a Claude second opinion.
Context: The /codex skill template (codex/SKILL.md.tmpl) shows the pattern — it wraps codex exec with JSONL parsing, timeout handling, and structured output. The reverse skill would wrap claude -p with similar infrastructure. Would be generated into .agents/skills/gstack-claude/ by gen-skill-docs --host codex.
Effort: M (human: ~2 weeks / CC: ~30 min) Priority: P1 Depends on: None
Completeness
Completeness metrics dashboard
What: Track how often Claude chooses the complete option vs shortcut across gstack sessions. Aggregate into a dashboard showing completeness trend over time.
Why: Without measurement, we can't know if the Completeness Principle is working. Could surface patterns (e.g., certain skills still bias toward shortcuts).
Context: Would require logging choices (e.g., append to a JSONL file when AskUserQuestion resolves), parsing them, and displaying trends. Similar pattern to eval persistence.
Effort: M (human) / S (CC) Priority: P3 Depends on: Boil the Lake shipped (v0.6.1)
Safety & Observability
On-demand hook skills (/careful, /freeze, /guard) — SHIPPED
What: Three new skills that use Claude Code's session-scoped PreToolUse hooks to add safety guardrails on demand.
Shipped as /careful, /freeze, /guard, and /unfreeze in v0.6.5. Includes hook fire-rate telemetry (pattern name only, no command content) and inline skill activation telemetry.
Skill usage telemetry — SHIPPED
What: Track which skills get invoked, how often, from which repo.
Shipped in v0.6.5. TemplateContext in gen-skill-docs.ts bakes skill name into preamble telemetry line. Analytics CLI (bun run analytics) for querying. /retro integration shows skills-used-this-week.
/investigate scoped debugging enhancements (gated on telemetry)
What: Six enhancements to /investigate auto-freeze, contingent on telemetry showing the freeze hook actually fires in real debugging sessions.
Why: /investigate v0.7.1 auto-freezes edits to the module being debugged. If telemetry shows the hook fires often, these enhancements make the experience smarter. If it never fires, the problem wasn't real and these aren't worth building.
Context: All items are prose additions to investigate/SKILL.md.tmpl. No new scripts.
Items:
- Stack trace auto-detection for freeze directory (parse deepest app frame)
- Freeze boundary widening (ask to widen instead of hard-block when hitting boundary)
- Post-fix auto-unfreeze + full test suite run
- Debug instrumentation cleanup (tag with DEBUG-TEMP, remove before commit)
- Debug session persistence (~/.gstack/investigate-sessions/ — save investigation for reuse)
- Investigation timeline in debug report (hypothesis log with timing)
Effort: M (all 6 combined) Priority: P3 Depends on: Telemetry data showing freeze hook fires in real /investigate sessions
Context Intelligence
Context recovery preamble
What: Add ~10 lines of prose to the preamble telling the agent to re-read gstack artifacts (CEO plans, design reviews, eng reviews, checkpoints) after compaction or context degradation.
Why: gstack skills produce valuable artifacts stored at ~/.gstack/projects/$SLUG/. When Claude's auto-compaction fires, it preserves a generic summary but doesn't know these artifacts exist. The plans and reviews that shaped the current work silently vanish from context, even though they're still on disk. This is the thing nobody else in the Claude Code ecosystem is solving, because nobody else has gstack's artifact architecture.
Context: Inspired by Anthropic's claude-progress.txt pattern for long-running agents. Also informed by claude-mem's "progressive disclosure" approach. See docs/designs/SESSION_INTELLIGENCE.md for the broader vision. CEO plan: ~/.gstack/projects/garrytan-gstack/ceo-plans/2026-03-31-session-intelligence-layer.md.
Effort: S (human: ~30 min / CC: ~5 min)
Priority: P1
Depends on: None
Key files: scripts/resolvers/preamble.ts
Session timeline
What: Append one-line JSONL entry to ~/.gstack/projects/$SLUG/timeline.jsonl after every skill run (timestamp, skill, branch, outcome). /retro renders the timeline.
Why: Makes AI-assisted work history visible. /retro can show "this week: 3 /review, 2 /ship, 1 /investigate." Provides the observability layer for the session intelligence architecture.
Effort: S (human: ~1h / CC: ~5 min)
Priority: P1
Depends on: None
Key files: scripts/resolvers/preamble.ts, retro/SKILL.md.tmpl
Cross-session context injection
What: When a new gstack session starts on a branch with recent checkpoints or plans, the preamble prints a one-line summary: "Last session: implemented JWT auth, 3/5 tasks done." Agent knows where you left off before reading any files.
Why: Claude starts every session fresh. This one-liner orients the agent immediately. Similar to claude-mem's SessionStart hook pattern but simpler and integrated.
Effort: S (human: ~2h / CC: ~10 min) Priority: P2 Depends on: Context recovery preamble
/checkpoint skill
What: Manual skill to snapshot current working state: what's being done and why, files being edited, decisions made (and rationale), what's done vs. remaining, critical types/signatures. Saved to ~/.gstack/projects/$SLUG/checkpoints/<timestamp>.md.
Why: Useful before stepping away from a long session, before known-complex operations that might trigger compaction, for handing off context to a different agent/workspace, or coming back to a project after days away.
Effort: M (human: ~1 week / CC: ~30 min)
Priority: P2
Depends on: Context recovery preamble
Key files: New checkpoint/SKILL.md.tmpl, scripts/gen-skill-docs.ts
Session Intelligence Layer design doc
What: Write docs/designs/SESSION_INTELLIGENCE.md describing the architectural vision: gstack as the persistent brain that survives Claude's ephemeral context. Every skill writes to ~/.gstack/projects/$SLUG/, preamble re-reads, /retro rolls up.
Why: Connects context recovery, health, checkpoint, and timeline features into a coherent architecture. Nobody else in the ecosystem is building this.
Effort: S (human: ~2h / CC: ~15 min) Priority: P1 Depends on: None
Health
/health — Project Health Dashboard
What: Skill that runs type-check, lint, test suite, and dead code scan, then reports a composite 0-10 health score with breakdown by category. Tracks over time in ~/.gstack/health/<project-slug>/ for trend detection. Optionally integrates CodeScene MCP for deeper complexity/cohesion/coupling analysis.
Why: No quick way to get "state of the codebase" before starting work. CodeScene peer-reviewed research shows AI-generated code increases static analysis warnings by 30%, code complexity by 41%, and change failure rates by 30%. Users need guardrails. Like /qa but for code quality rather than browser behavior.
Context: Reads CLAUDE.md for project-specific commands (platform-agnostic principle). Runs checks in parallel. /retro can pull from health history for trend sparklines.
Effort: M (human: ~1 week / CC: ~30 min)
Priority: P1
Depends on: None
Key files: New health/SKILL.md.tmpl, scripts/gen-skill-docs.ts
/health as /ship gate
What: If health score exists and drops below a configurable threshold, /ship warns before creating the PR: "Health dropped from 8/10 to 5/10 this branch — 3 new lint warnings, 1 test failure. Ship anyway?"
Why: Quality gate that prevents shipping degraded code. Configurable threshold so it's not blocking for teams that don't use /health.
Effort: S (human: ~1h / CC: ~5 min) Priority: P2 Depends on: /health skill
Swarm
Swarm primitive — reusable multi-agent dispatch
What: Extract Review Army's dispatch pattern into a reusable resolver (scripts/resolvers/swarm.ts). Wire into /ship for parallel pre-ship checks (type-check + lint + test in parallel sub-agents). Make available to /qa, /investigate, /health.
Why: Review Army proved parallel sub-agents work brilliantly (5 agents = 835K tokens of working memory vs. 167K for one). The pattern is locked inside review-army.ts. Other skills need it too. Claude Code Agent Teams (official, Feb 2026) validates the team-lead-delegates-to-specialists pattern. Gartner: multi-agent inquiries surged 1,445% in one year.
Context: Start with the specific /ship use case. Extract shared parts only after 2+ consumers reveal what config parameters are actually needed. Avoid premature abstraction. Can leverage existing WorktreeManager for isolation.
Effort: L (human: ~2 weeks / CC: ~2 hours)
Priority: P2
Depends on: None
Key files: scripts/resolvers/review-army.ts, new scripts/resolvers/swarm.ts, ship/SKILL.md.tmpl, lib/worktree.ts
Refactoring
/refactor-prep — Pre-Refactor Token Hygiene
What: Skill that detects project language/framework, runs appropriate dead code detection (knip/ts-prune for TS/JS, vulture/autoflake for Python, staticcheck/deadcode for Go, cargo udeps for Rust), strips dead imports/exports/props/console.logs, and commits cleanup separately.
Why: Dirty codebases accelerate context compaction. Dead imports, unused exports, and orphaned code eat tokens that contribute nothing but everything to triggering compaction mid-refactor. Cleaning first buys back 20%+ of context budget. Reports lines removed and estimated token savings.
Effort: M (human: ~1 week / CC: ~30 min)
Priority: P2
Depends on: None
Key files: New refactor-prep/SKILL.md.tmpl, scripts/gen-skill-docs.ts
Factory Droid
Browse MCP server for Factory Droid
What: Expose gstack's browse binary and key workflows as an MCP server that Factory Droid connects to natively. Factory users would run /mcp, add the gstack server, and get browse, QA, and review capabilities as Factory tools.
Why: Factory already supports 40+ MCP servers in its registry. Getting gstack's browse binary listed there is a distribution play. Nobody else has a real compiled browser binary as an MCP tool. This is the thing that makes gstack uniquely valuable on Factory Droid.
Context: Option A (--host factory compatibility shim) ships first in v0.13.4.0. Option B is the follow-up that provides deeper integration. The browse binary is already a stateless CLI, so wrapping it as an MCP server is straightforward (stdin/stdout JSON-RPC). Each browse command becomes an MCP tool.
Effort: L (human: ~1 week / CC: ~5 hours) Priority: P1 Depends on: --host factory (Option A, shipping in v0.13.4.0)
.agent/skills/ dual output for cross-agent compatibility
What: Factory also reads from <repo>/.agent/skills/ as a cross-agent compatibility path. Could output there in addition to .factory/skills/ for broader reach across other agents that use the .agent convention.
Why: Multiple AI agents beyond Factory may adopt the .agent/skills/ convention. Outputting there too would give free compatibility.
Effort: S Priority: P3 Depends on: --host factory
Custom Droid definitions alongside skills
What: Factory has "custom droids" (subagents with tool restrictions, model selection, autonomy levels). Could ship gstack-qa.md droid configs alongside skills that restrict tools to read-only + execute for safety.
Why: Deeper Factory integration. Droid configs give Factory users tighter control over what gstack skills can do.
Effort: M Priority: P3 Depends on: --host factory
GStack Browser
Anti-bot stealth: Playwright CDP patches (rebrowser-style)
What: Write a postinstall script that patches Playwright's CDP layer to suppress Runtime.enable and use addBinding for context ID discovery, same approach as rebrowser-patches. Eliminates the navigator.webdriver, cdc_ markers, and other CDP artifacts that sites like Google use to detect automation.
Why: Our current stealth patches (UA override, navigator.webdriver=false, fake plugins) work on most sites but Google still triggers captchas. The real detection is at the CDP protocol level. rebrowser-patches proved the approach works but their patches target Playwright 1.52.0 and don't apply to our 1.58.2. We need our own patcher using string matching instead of line-number diffs. 6 files, ~200 lines of patches total.
Context: Full analysis of rebrowser-patches source: patches 6 files in playwright-core/lib/server/ (crConnection.js, crDevTools.js, crPage.js, crServiceWorker.js, frames.js, page.js). Key technique: suppress Runtime.enable (the main CDP detection vector), use Runtime.addBinding + CustomEvent trick to discover execution context IDs without it. Our extension communicates via Chrome extension APIs, not CDP Runtime, so it should be unaffected. Write E2E tests that verify: (1) extension still loads and connects, (2) Google.com loads without captcha, (3) sidebar chat still works.
Effort: L (human: ~2 weeks / CC: ~3 hours) Priority: P1 Depends on: None
Chromium fork (long-term alternative to CDP patches)
What: Maintain a Chromium fork where anti-bot stealth, GStack Browser branding, and native sidebar support live in the source code, not as runtime monkey-patches.
Why: The CDP patches are brittle. They break on every Playwright upgrade and target compiled JS with fragile string matching. A proper fork means: (1) stealth is permanent, not patched, (2) branding is native (no plist hacking at launch), (3) native sidebar replaces the extension (Phase 4 of V0 roadmap), (4) custom protocols (gstack://) for internal pages. Companies like Brave, Arc, and Vivaldi maintain Chromium forks with small teams. With CC, the rebase-on-upstream maintenance could be largely automated.
Context: Trigger criteria from V0 design doc: fork when extension side panel becomes the bottleneck, when anti-bot patches need to live deeper than CDP, or when native UI integration (sidebar, status bar) can't be done via extension. The Chromium build takes ~4 hours on a 32-core machine and produces ~50GB of build artifacts. CI would need dedicated build infra. See docs/designs/GSTACK_BROWSER_V0.md Phase 5 for full analysis.
Effort: XL (human: ~1 quarter / CC: ~2-3 weeks of focused work) Priority: P2 Depends on: CDP patches proving the value of anti-bot stealth first
Completed
CI eval pipeline (v0.9.9.0)
- GitHub Actions eval upload on Ubicloud runners ($0.006/run)
- Within-file test concurrency (test() → testConcurrentIfSelected())
- Eval artifact upload + PR comment with pass/fail + cost
- Baseline comparison via artifact download from main
- EVALS_CONCURRENCY=40 for ~6min wall clock (was ~18min) Completed: v0.9.9.0
Deploy pipeline (v0.9.8.0)
- /land-and-deploy — merge PR, wait for CI/deploy, canary verification
- /canary — post-deploy monitoring loop with anomaly detection
- /benchmark — performance regression detection with Core Web Vitals
- /setup-deploy — one-time deploy platform configuration
- /review Performance & Bundle Impact pass
- E2E model pinning (Sonnet default, Opus for quality tests)
- E2E timing telemetry (first_response_ms, max_inter_turn_ms, wall_clock_ms)
- test:e2e:fast tier, --retry 2 on all E2E scripts Completed: v0.9.8.0
Phase 1: Foundations (v0.2.0)
- Rename to gstack
- Restructure to monorepo layout
- Setup script for skill symlinks
- Snapshot command with ref-based element selection
- Snapshot tests Completed: v0.2.0
Phase 2: Enhanced Browser (v0.2.0)
- Annotated screenshots, snapshot diffing, dialog handling, file upload
- Cursor-interactive elements, element state checks
- CircularBuffer, async buffer flush, health check
- Playwright error wrapping, useragent fix
- 148 integration tests Completed: v0.2.0
Phase 3: QA Testing Agent (v0.3.0)
- /qa SKILL.md with 6-phase workflow, 3 modes (full/quick/regression)
- Issue taxonomy, severity classification, exploration checklist
- Report template, health score rubric, framework detection
- wait/console/cookie-import commands, find-browse binary Completed: v0.3.0
Phase 3.5: Browser Cookie Import (v0.3.x)
- cookie-import-browser command (Chromium cookie DB decryption)
- Cookie picker web UI, /setup-browser-cookies skill
- 18 unit tests, browser registry (Comet, Chrome, Arc, Brave, Edge) Completed: v0.3.1
E2E test cost tracking
- Track cumulative API spend, warn if over threshold Completed: v0.3.6
Auto-upgrade mode + smart update check
- Config CLI (
bin/gstack-config), auto-upgrade via~/.gstack/config.yaml, 12h cache TTL, exponential snooze backoff (24h→48h→1wk), "never ask again" option, vendored copy sync on upgrade Completed: v0.3.8