mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-01 19:25:10 +02:00
feat: gstack v1 — simpler prompts + real LOC receipts (v1.0.0.0) (#1039)
* docs: add design doc for /plan-tune v1 (observational substrate) Canonical record of the /plan-tune v1 design: typed question registry, per-question explicit preferences, inline tune: feedback with user-origin gate, dual-track profile (declared + inferred separately), and plain-English inspection skill. Captures every decision with pros/cons, what's deferred to v2 with explicit acceptance criteria, and what was rejected entirely. Codex review drove a substantial scope rollback from the initial CEO EXPANSION plan. 15+ legitimate findings (substrate claim was false without a typed registry; E4/E6/clamp logical contradiction; profile poisoning attack surface; LANDED preamble side effect; implementation order) shaped the final shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: typed question registry for /plan-tune v1 foundation scripts/question-registry.ts declares 53 recurring AskUserQuestion categories across 15 skills (ship, review, office-hours, plan-ceo-review, plan-eng-review, plan-design-review, plan-devex-review, qa, investigate, land-and-deploy, cso, gstack-upgrade, preamble, plan-tune, autoplan). Each entry has: stable kebab-case id, skill owner, category (approval | clarification | routing | cherry-pick | feedback-loop), door_type (one-way | two-way), optional stable option keys, optional psychographic signal_key, and a one-line description. 12 of 53 are one-way doors (destructive ops, architecture/data forks, security/compliance). These are ALWAYS asked regardless of user preference. Helpers: getQuestion(id), getOneWayDoorIds(), getAllRegisteredIds(), getRegistryStats(). No binary or resolver wiring yet — this is the schema substrate the rest of /plan-tune builds on. Ad-hoc question_ids (not registered) still log but skip psychographic signal attribution. Future /plan-tune skill surfaces frequently-firing ad-hoc ids as candidates for registry promotion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: registry schema + safety + coverage tests (gate tier) 20 tests validating the question registry: Schema (7 tests): - Every entry has required fields - All ids are kebab-case and start with their skill name - No duplicate ids - Categories are from the allowed set - door_type is one-way | two-way - Options arrays are well-formed - Descriptions are short and single-line Helpers (5 tests): - getQuestion returns entry for known id, undefined for unknown - getOneWayDoorIds includes destructive questions, excludes two-way - getAllRegisteredIds count matches QUESTIONS keys - getRegistryStats totals are internally consistent One-way door safety (2 tests): - Every critical question (test failure, SQL safety, LLM trust boundary, security scan, merge confirm, rollback, fix apply, premise revise, arch finding, privacy gate, user challenge) is declared one-way - At least 10 one-way doors exist (catches regression if declarations are accidentally dropped) Registry breadth (3 tests): - 11 high-volume skills each have >= 1 registered question - Preamble one-time prompts are registered - /plan-tune's own questions are registered Signal map references (1 test): - signal_key values are typed kebab-case strings Template coverage (2 tests, informational): - AskUserQuestion usage across templates is non-trivial (>20) - Registry spans >= 10 skills 20 pass, 0 fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: one-way door classifier (belt-and-suspenders safety fallback) scripts/one-way-doors.ts — secondary keyword-pattern classifier that catches destructive questions even when the registry doesn't have an entry for them. The registry's door_type field (from scripts/question-registry.ts) is the PRIMARY safety gate. This classifier is the fallback for ad-hoc question_ids that agents generate at runtime. Classification priority: 1. Registry lookup by question_id → use declared door_type 2. Skill:category fallback (cso:approval, land-and-deploy:approval) 3. Keyword pattern match against question_summary 4. Default: treat as two-way (safer to log the miss than auto-decide unsafely) Covers 21 destructive patterns across: - File system (rm -rf, delete, wipe, purge, truncate) - Database (drop table/database/schema, delete from) - Git/VCS (force-push, reset --hard, checkout --, branch -D) - Deploy/infra (kubectl delete, terraform destroy, rollback) - Credentials (revoke/reset/rotate API key|token|secret|password) - Architecture (breaking change, schema migration, data model change) 7 new tests in test/plan-tune.test.ts covering: registry-first lookup, unknown-id fallthrough, keyword matching on destructive phrasings including embedded filler words ("rotate the API key"), skill-category fallback, benign questions defaulting to two-way, pattern-list non-empty. 27 pass, 0 fail. 1270 expect() calls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: psychographic signal map + builder archetypes scripts/psychographic-signals.ts — hand-crafted {signal_key, user_choice} → {dimension, delta} map. Version 0.1.0. Conservative deltas (±0.03 to ±0.06 per event). Covers 9 signal keys: scope-appetite, architecture-care, code-quality-care, test-discipline, detail-preference, design-care, devex-care, distribution-care, session-mode. Helpers: applySignal() mutates running totals, newDimensionTotals() creates empty starting state, normalizeToDimensionValue() sigmoid-clamps accumulated delta to [0,1] (0 → 0.5 neutral), validateRegistrySignalKeys() checks that every signal_key in the registry has a SIGNAL_MAP entry. In v1 the signal map is used ONLY to compute inferred dimension values for /plan-tune inspection output. No skill behavior adapts to these signals until v2. scripts/archetypes.ts — 8 named archetypes + Polymath fallback: - Cathedral Builder (boil-the-ocean + architecture-first) - Ship-It Pragmatist (small scope + fast) - Deep Craft (detail-verbose + principled) - Taste Maker (intuitive, overrides recommendations) - Solo Operator (high-autonomy, delegates) - Consultant (hands-on, consulted on everything) - Wedge Hunter (narrow scope aggressively) - Builder-Coach (balanced steering) - Polymath (fallback when no archetype matches) matchArchetype() uses L2 distance scaled by tightness, with a 0.55 threshold below which we return Polymath. v1 ships the model stable; v2 narrative/vibe commands wire it into user-facing output. 14 new tests: signal map consistency vs registry, applySignal behavior for known/unknown keys, normalization bounds, archetype schema validity, name uniqueness, matchArchetype correctness for each reference profile, Polymath fallback for outliers. 41 pass, 0 fail total in test/plan-tune.test.ts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: bin/gstack-question-log — append validated AskUserQuestion events Append-only JSONL log at ~/.gstack/projects/{SLUG}/question-log.jsonl. Schema: {skill, question_id, question_summary, category?, door_type?, options_count?, user_choice, recommended?, followed_recommendation?, session_id?, ts} Validates: - skill is kebab-case - question_id is kebab-case, <= 64 chars - question_summary non-empty, <= 200 chars, newlines flattened - category is one of approval/clarification/routing/cherry-pick/feedback-loop - door_type is one-way or two-way - options_count is integer in [1, 26] - user_choice non-empty string, <= 64 chars Injection defense on question_summary rejects the same patterns as gstack-learnings-log (ignore previous instructions, system:, override:, do not report, etc). followed_recommendation is auto-computed when both user_choice and recommended are present. ts auto-injected as ISO 8601 if missing. 21 tests covering: valid payloads, full field preservation, auto-followed computation, appending, long-summary truncation, newline flattening, invalid JSON, missing fields, bad case, oversized ids, invalid enum values, out-of-range options_count, and 6 injection attack patterns. 21 pass, 0 fail, 43 expect() calls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: bin/gstack-developer-profile — unified profile with migration bin/gstack-developer-profile supersedes bin/gstack-builder-profile. The old binary becomes a one-line legacy shim delegating to --read for /office-hours backward compat. Subcommands: --read legacy KEY:VALUE output (tier, session_count, etc) --migrate folds ~/.gstack/builder-profile.jsonl into ~/.gstack/developer-profile.json. Atomic (temp + rename), idempotent (no-op when target exists or source absent), archives source as .migrated-YYYY-MM-DD-HHMMSS --derive recomputes inferred dimensions from question-log.jsonl using the signal map in scripts/psychographic-signals.ts --profile full profile JSON --gap declared vs inferred diff JSON --trace <dim> event-level trace of what contributed to a dimension --check-mismatch flags dimensions where declared and inferred disagree by > 0.3 (requires >= 10 events first) --vibe archetype name + description from scripts/archetypes.ts --narrative (v2 stub) Auto-migration on first read: if legacy file exists and new file doesn't, migrate before reading. Creates a neutral (all-0.5) stub if nothing exists. Unified schema (see docs/designs/PLAN_TUNING_V0.md §Architecture): {identity, declared, inferred: {values, sample_size, diversity}, gap, overrides, sessions, signals_accumulated, schema_version} 25 new tests across subcommand behaviors: - --read defaults + stub creation - --migrate: 3 sessions preserved with signal tallies, idempotency, archival - Tier calculation: welcome_back / regular / inner_circle boundaries - --derive: neutral-when-empty, upward nudge on 'expand', downward on 'reduce', recomputable (same input → same output), ad-hoc unregistered ids ignored - --trace: contributing events, empty for untouched dims, error without arg - --gap: empty when no declared, correctly computed otherwise - --vibe: returns archetype name + description - --check-mismatch: threshold behavior, 10+ sample requirement - Unknown subcommand errors 25 pass, 0 fail, 60 expect() calls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: bin/gstack-question-preference — explicit preferences + user-origin gate Subcommands: --check <id> → ASK_NORMALLY | AUTO_DECIDE (decides if a registered question should be auto-decided by the agent) --write '{…}' → set a preference (requires user-origin source) --read → dump preferences JSON --clear [id] → clear one or all --stats → short counts summary Preference values: always-ask | never-ask | ask-only-for-one-way. Stored at ~/.gstack/projects/{SLUG}/question-preferences.json. Safety contract (the core of Codex finding #16, profile-poisoning defense from docs/designs/PLAN_TUNING_V0.md §Security model): 1. One-way doors ALWAYS return ASK_NORMALLY from --check, regardless of user preference. User's never-ask is overridden with a visible safety note so the user knows why their preference didn't suppress the prompt. 2. --write requires an explicit `source` field: - Allowed: "plan-tune", "inline-user" - REJECTED with exit code 2: "inline-tool-output", "inline-file", "inline-file-content", "inline-unknown" Rejection is explicit ("profile poisoning defense") so the caller can log and surface the attempt. 3. free_text on --write is sanitized against injection patterns (ignore previous instructions, override:, system:, etc.) and newline-flattened. Each --write also appends a preference-set event to ~/.gstack/projects/{SLUG}/question-events.jsonl for derivation audit trail. 31 tests: - --check behavior (4): defaults, two-way, one-way (one-way overrides never-ask with safety note), unknown ids, missing arg - --check with prefs (5): never-ask on two-way → AUTO_DECIDE; never-ask on one-way → ASK_NORMALLY with override note; always-ask always asks; ask-only-for-one-way flips appropriately - --write valid (5): inline-user accepted, plan-tune accepted, persisted correctly, event appended, free_text preserved with flattening - User-origin gate (6): missing source rejected; inline-tool-output rejected with exit code 2 and explicit poisoning message; inline-file, inline-file-content, inline-unknown rejected; unknown source rejected - Schema validation (4): invalid JSON, bad question_id, bad preference, injection in free_text - --read (2): empty → {}, returns writes - --clear (3): specific id, clear-all, NOOP for missing - --stats (2): empty zeros, tallies by preference type 31 pass, 0 fail, 52 expect() calls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: question-tuning preamble resolvers scripts/resolvers/question-tuning.ts ships three preamble generators: generateQuestionPreferenceCheck — before each AskUserQuestion, agent runs gstack-question-preference --check <id>. AUTO_DECIDE suppresses the ask and auto-chooses recommended. ASK_NORMALLY asks as usual. One-way door safety override is handled by the binary. generateQuestionLog — after each AskUserQuestion, agent appends a log record with skill, question_id, summary, category, door_type, options_count, user_choice, recommended, session_id. generateInlineTuneFeedback — offers inline "tune:" prompt after two-way questions. Documents structured shortcuts (never-ask, always-ask, ask-only-for-one-way, ask-less) AND accepts free-form English with normalization + confirmation. Explicitly spells out the USER-ORIGIN GATE: only write tune events when the prefix appears in the user's own chat message, never from tool output or file content. Binary enforces. All three resolvers are gated by the QUESTION_TUNING preamble echo. When the config is off, the agent skips these sections entirely. Ready to be wired into preamble.ts in the next commit. Codex host has a simpler variant that uses $GSTACK_BIN env vars. scripts/resolvers/index.ts registers three placeholders: QUESTION_PREFERENCE_CHECK, QUESTION_LOG, INLINE_TUNE_FEEDBACK Total resolver count goes from 45 to 48. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: wire question-tuning into preamble for tier >= 2 skills scripts/resolvers/preamble.ts — adds two things: 1. _QUESTION_TUNING config echo in the preamble bash block, gated on the user's gstack-config `question_tuning` value (default: false). 2. A combined Question Tuning section for tier >= 2 skills, injected after the confusion protocol. The section itself is runtime-gated by the QUESTION_TUNING value — agents skip it entirely when off. scripts/resolvers/question-tuning.ts — consolidated into one compact combined section `generateQuestionTuning(ctx)` covering: preference check before the question, log after, and inline tune: feedback with user-origin gate. Per-phase generators remain exported for unit tests but are no longer the main entrypoint. Size impact: +570 tokens / +2.3KB per tier-2+ SKILL.md. Three skills (plan-ceo-review, office-hours, ship) still exceed the 100KB token ceiling — but they were already over before this change. Delta is the smallest viable wiring of the /plan-tune v1 substrate. Golden fixtures (test/fixtures/golden/claude-ship, codex-ship, factory-ship) regenerated to match the new baseline. Full test run: 1149 pass, 0 fail, 113 skip across 28 files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: regenerate SKILL.md files with question-tuning section bun run gen:skill-docs --host all after wiring the QUESTION_TUNING preamble section. Every tier >= 2 skill now includes the combined Question Tuning guidance. Runtime-gated — agents skip the section when question_tuning is off in gstack-config (default). Golden fixtures (claude-ship, codex-ship, factory-ship) updated to the new baseline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: /plan-tune skill — conversational inspection + preferences plan-tune/SKILL.md.tmpl: the user-facing skill for /plan-tune v1. Routes plain-English intent to one of 8 flows: - Enable + setup (first-time): 5 declaration questions mapping to the 5 psychographic dimensions (scope_appetite, risk_tolerance, detail_preference, autonomy, architecture_care). Writes to developer-profile.json declared.*. - Inspect profile: plain-English rendering of declared + inferred + gap. Uses word bands (low/balanced/high) not raw floats. Shows vibe archetype when calibration gate is met. - Review question log: top-20 question frequencies with follow/override counts. Highlights override-heavy questions as candidates for never-ask. - Set a preference: normalizes "stop asking me about X" → never-ask, etc. Confirms ambiguous phrasings before writing via gstack-question-preference. - Edit declared profile: interprets free-form ("more boil-the-ocean") and CONFIRMS before mutating declared.* (trust boundary per Codex #15). - Show gap: declared vs inferred diff with plain-English severity bands (close / drift / mismatch). Never auto-updates declared from the gap. - Stats: preference counts + diversity/calibration status. - Enable / disable: gstack-config set question_tuning true|false. Design constraints enforced: - Plain English everywhere. No CLI subcommand syntax required. Shortcuts (`profile`, `vibe`, `stats`, `setup`) exist but optional. - user-origin gate on tune: writes. source: "plan-tune" for user-invoked /plan-tune; source: "inline-user" for inline tune: from other skills. - One-way doors override never-ask (safety, surfaced to user). - No behavior adaptation in v1 — this skill inspects and configures only. Generates plan-tune/SKILL.md at ~11.6k tokens, well under the 100KB ceiling. Generated for all hosts via `bun run gen:skill-docs --host all`. Full free test suite: 1149 pass, 0 fail, 113 skip across 28 files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: end-to-end pipeline + preamble injection coverage Added 6 tests to test/plan-tune.test.ts: Preamble injection (3 tests): - tier 2+ includes Question Tuning section with preference check, log, and user-origin gate language ('profile-poisoning defense', 'inline-user') - tier 1 does NOT include the prose section (QUESTION_TUNING bash echo still fires since it's in the bash block all tiers share) - codex host swaps binDir references to $GSTACK_BIN End-to-end pipeline (3 tests) — real binaries working together, not mocks: - Log 5 expand choices → --derive → profile shows scope_appetite > 0.5 (full log → registry lookup → signal map → normalization round-trip) - --write source: inline-tool-output rejected; --read confirms no pref was persisted (the profile-poisoning defense actually works end-to-end) - Migrate a 3-session legacy file; confirm legacy gstack-builder-profile shim still returns SESSION_COUNT: 3, TIER: welcome_back, CROSS_PROJECT: true test/plan-tune.test.ts now has 47 tests total. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: E2E test for /plan-tune plain-English inspection flow (gate tier) test/skill-e2e-plan-tune.test.ts — verifies /plan-tune correctly routes plain-English intent ("review the questions I've been asked") to the Review question log section without requiring CLI subcommand syntax. Seeds a synthetic question-log.jsonl with 3 entries exercising: - override behavior (user chose expand over recommended selective) - one-way door respect (user followed ship-test-failure-triage recommendation) - two-way override (user skipped recommended changelog polish) Invokes the skill via `claude -p` and asserts: - Agent surfaces >= 2 of 3 logged question_ids in output - Agent notices override/skip behavior from the log - Exit reason is success or error_max_turns (not agent-crash) Gate-tier because the core v1 DX promise is plain-English intent routing. If it requires memorized subcommands or breaks on natural language, that's a regression of the defining feature. Registered in test/helpers/touchfiles.ts with dependencies: - plan-tune/** (skill template + generated md) - scripts/question-registry.ts (required for log lookup) - scripts/psychographic-signals.ts, scripts/one-way-doors.ts (derive path) - bin/gstack-question-log, gstack-question-preference, gstack-developer-profile Skipped when EVALS_ENABLED is not set; runs on `bun run test:evals`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.19.0.0) — /plan-tune v1 Ships /plan-tune as observational substrate: typed question registry, dual-track developer profile (declared + inferred), explicit per-question preferences with user-origin gate, inline tune: feedback across every tier >= 2 skill, unified developer-profile.json with migration from builder-profile.jsonl. Scope rolled back from initial CEO EXPANSION plan after outside-voice review (Codex). 6 deferrals tracked as P0 TODOs with explicit acceptance criteria: E1 substrate wiring, E3 narrative/vibe, E4 blind-spot coach, E5 LANDED celebration, E6 auto-adjustment, E7 psychographic auto-decide. See docs/designs/PLAN_TUNING_V0.md for the full design record. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ci): harden Dockerfile.ci against transient Ubuntu mirror failures The CI image build failed with: E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/... Connection failed [IP: 91.189.92.22 80] ERROR: process "/bin/sh -c apt-get update && apt-get install ..." did not complete successfully: exit code: 100 archive.ubuntu.com periodically returns "connection refused" on individual regional mirrors. Without retry logic a single failed fetch nukes the whole Docker build. Three defenses, layered: 1. /etc/apt/apt.conf.d/80-retries — apt fetches each package up to 5 times with a 30s timeout. Handles per-package flakes. 2. Shell-loop retry around the whole apt-get step (x3, 10s sleep) — handles the case where apt-get update itself can't reach any mirror. 3. --retry 5 --retry-delay 5 --retry-connrefused on all curl fetches (bun install script, GitHub CLI keyring, NodeSource setup script). Applied to every apt-get and curl call in the Dockerfile. No behavior change on happy path — only kicks in when mirrors blip. Fixes the build-image job that was blocking CI on the /plan-tune PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: add PLAN_TUNING_V1 + PACING_UPDATES_V0 design docs Captures the V1 design (ELI10 writing + LOC reframe) in docs/designs/PLAN_TUNING_V1.md and the extracted V1.1 pacing-overhaul plan in docs/designs/PACING_UPDATES_V0.md. V1 scope was reduced from the original bundled pacing + writing-style plan after three engineering-review passes revealed structural gaps in the pacing workstream that couldn't be closed via plan-text editing. TODOS.md P0 entry links to V1.1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: curated jargon list for V1 writing-style glossing Repo-owned list of ~50 high-frequency technical terms (idempotent, race condition, N+1, backpressure, etc.) that gstack glosses on first use in tier-≥2 skill output. Baked into generated SKILL.md prose at gen-skill-docs time. Terms not on this list are assumed plain-English enough. Contributions via PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(preamble): V1 Writing Style section + EXPLAIN_LEVEL echo + migration prompt Adds a new Writing Style section to tier-≥2 preamble output composing with the existing AskUserQuestion Format section. Six rules: jargon glossed on first use per skill invocation (from scripts/jargon-list.json), outcome- framed questions, short sentences, decisions close with user impact, gloss-on-first-use even if user pasted term, user-turn override for "be terse" requests. Baked conditionally (skip if EXPLAIN_LEVEL: terse). Adds EXPLAIN_LEVEL preamble echo using \${binDir} (host-portable matching V0 QUESTION_TUNING pattern). Adds WRITING_STYLE_PENDING echo reading a flag file written by the V0→V1 upgrade migration; on first post-upgrade skill run, the agent fires a one-time AskUserQuestion offering terse mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(gstack-config): validate explain_level + document in header Adds explain_level: default|terse to the annotated config header with a one-line description. Whitelists valid values; on set of an unknown value, prints a specific warning ("explain_level '\$VALUE' not recognized. Valid values: default, terse. Using default.") and writes the default value. Matches V1 preamble's EXPLAIN_LEVEL echo expectation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: V1 upgrade migration — writing-style opt-out prompt New migration script following existing v0.15.2.0.sh / v0.16.2.0.sh pattern. Writes a .writing-style-prompt-pending flag file on first run post-upgrade. The preamble's migration-prompt block reads the flag and fires a one-time AskUserQuestion offering the user a choice between the new default writing style and restoring V0 prose via \`gstack-config set explain_level terse\`. Idempotent via flag files; if the user has already set explain_level explicitly, counts as answered and skips. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: LOC reframe tooling — throughput comparison + README updater + scc installer Three new scripts: - scripts/garry-output-comparison.ts — enumerates Garry-authored commits in 2013 + 2026 on public repos, extracts ADDED lines from git diff, classifies as logical SLOC via scc --stdin (regex fallback if scc missing). Writes docs/throughput-2013-vs-2026.json with per-language breakdown + explicit caveats (public repos only, commit-style drift, private-work exclusion). - scripts/update-readme-throughput.ts — reads the JSON if present, replaces the README's <!-- GSTACK-THROUGHPUT-PLACEHOLDER --> anchor with the computed multiple (preserving the anchor for future runs). If JSON missing, writes GSTACK-THROUGHPUT-PENDING marker that CI rejects — forcing the build to run before commit. - scripts/setup-scc.sh — standalone OS-detecting installer for scc. Not a package.json dependency (95% of users never run throughput). Brew on macOS, apt on Linux, GitHub releases link on Windows. Two-string anchor pattern (PLACEHOLDER vs PENDING) prevents the pipeline from destroying its own update path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(retro): surface logical SLOC + weighted commits above raw LOC V1 reorders the /retro summary table to lead with features shipped, then commits + weighted commits (commits × files-touched capped at 20), then PRs merged, then logical SLOC added as the primary code-volume metric. Raw LOC stays present but is demoted to context. Rationale inline in the template: ten lines of a good fix is not less shipping than ten thousand lines of scaffold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(v1): README hero reframe + writing-style + CHANGELOG + version bump to 1.0.0.0 README.md: - Hero removes "600,000+ lines of production code" framing; replaces with the computed 2013-vs-2026 pro-rata multiple (via <!-- GSTACK-THROUGHPUT-PLACEHOLDER --> anchor, filled by the update-readme-throughput build step). - Hiring callout: "ship real products at AI-coding speed" instead of "10K+ LOC/day." - New Writing Style section (~80 words) between Quick start and Install: "v1 prompts = simpler" framing, outcome-language example, terse-mode opt-out, pointer to /plan-tune. CLAUDE.md: one-paragraph Writing style (V1) note under project conventions, linking to preamble resolver + V1 design docs. CHANGELOG.md: V1 entry on top of v0.19.0.0 with user-facing narrative (what changes, how to opt out, for-contributors notes). Mentions scope reduction — pacing overhaul ships in V1.1. CONTRIBUTING.md: one-paragraph note on jargon-list.json maintenance (PR to add/remove terms; regenerate via gen:skill-docs). VERSION + package.json: bump to 1.0.0.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: regenerate SKILL.md files + golden fixtures for V1 Mechanical regeneration from the updated templates in prior commits: - Writing Style section now appears in tier-≥2 skill output. - EXPLAIN_LEVEL + WRITING_STYLE_PENDING echoes in preamble bash. - V1 migration-prompt block fires conditionally on first upgrade. - Jargon list inlined into preamble prose at gen time. - Retro template's logical SLOC + weighted commits order applied. Regenerated for all 8 hosts via bun run gen:skill-docs --host all. Golden ship-skill fixtures refreshed from regenerated outputs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: V1 gate coverage — writing-style resolver + config + jargon + migration + dormancy Six new gate-tier test files: - test/writing-style-resolver.test.ts — asserts Writing Style section is injected into tier-≥2 preamble, all 6 rules present, jargon list inlined, terse-mode gate condition present, Codex output uses \$GSTACK_BIN (not ~/.claude/), tier-1 does NOT get the section, migration-prompt block present. - test/explain-level-config.test.ts — gstack-config set/get round-trip for default + terse, unknown-value warns + defaults to default, header documents the key, round-trip across set→set→get. - test/jargon-list.test.ts — shape + ~50 terms + no duplicates (case-insensitive) + includes canonical high-signal terms. - test/v0-dormancy.test.ts — 5D dimension names + archetype names forbidden in default-mode tier-≥2 SKILL.md output, except for plan-tune and office-hours where they're load-bearing. - test/readme-throughput.test.ts — script replaces anchor with number on happy path, writes PENDING marker when JSON missing, CI gate asserts committed README contains no PENDING string. - test/upgrade-migration-v1.test.ts — fresh run writes pending flag, idempotent after user-answered, pre-existing explain_level counts as answered. All 95 V1 test-expect() calls pass. Full suite: 0 failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: compute real 2013-vs-2026 throughput multiple (130.2×) Ran scripts/garry-output-comparison.ts across all 15 public garrytan/* repos. Aggregated results into docs/throughput-2013-vs-2026.json and ran scripts/update-readme-throughput.ts to replace the README placeholder. 2013 public activity: 2 commits, 2,384 logical lines added across 1 week, in 1 repo (zurb-foundation-wysihtml5 upstream contribution). 2026 public activity: 279 commits, 310,484 logical lines added across 17 active weeks, in 3 repos (gbrain, gstack, resend_robot). Multiples (public repos only, apples-to-apples): - Logical SLOC: 130.2× - Commits per active week: 8.2× - Raw lines added: 134.4× Private work at both eras (2013 Bookface at YC, Posterous-era code, 2026 internal tools) is excluded from this comparison. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: 207× throughput multiple (with private repos + Bookface) Re-ran scripts/garry-output-comparison.ts across all 41 repos under garrytan/* (15 public + 26 private), including Bookface (YC's internal social network, 2013-era work). 2013 activity: 71 commits, 5,143 logical lines, 4 active repos (bookface, delicounter, tandong, zurb-foundation-wysihtml5) 2026 activity: 350 commits, 1,064,818 logical lines, 15 active repos (gbrain, gstack, gbrowser, tax-app, kumo, tenjin, autoemail, kitsune, easy-chromium-compiles, conductor-playground, garryslist-agent, baku, gstack-website, resend_robot, garryslist-brain) Multiples: - Logical SLOC: 207× (up from 130.2× when including private work) - Raw lines: 223× - Commits/active-week: 3.4× Stopped committing docs/throughput-2013-vs-2026.json — analysis is a local artifact, not repo state. Added docs/throughput-*.json to .gitignore. Full markdown analysis at ~/throughput-analysis-2026-04-18.md (local-only). README multiple is now hardcoded; re-run the script and edit manually when you want to refresh it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: run rate vs year-to-date throughput comparison Two separate numbers in the README hero: - Run rate: ~700× (9,859 logical lines/day in 2026 vs 14/day in 2013) - Year-to-date: 207× (2026 through April 18 already exceeds 2013 full year by 207×) Previous "207× pro-rata" framing mixed full-year 2013 vs partial-year 2026. Run rate is the apples-to-apples normalization; YTD is the "already produced" total. Both are honest; both are compelling; they measure different things. Analysis at ~/throughput-analysis-2026-04-18.md (local-only). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(throughput): script natively computes to-date + run-rate multiples Enhanced scripts/garry-output-comparison.ts so both calculations come out of a single run instead of being reassembled ad-hoc in bash: PerYearResult now includes: - days_elapsed — 365 for past years, day-of-year for current - is_partial — flags the current (in-progress) year - per_day_rate — logical/raw/commits normalized by calendar day - annualized_projection — per_day_rate × 365 Output JSON's `multiples` now has two sibling blocks: - multiples.to_date — raw volume ratios (2026-YTD / 2013-full-year) - multiples.run_rate — per-day pace ratios (apples-to-apples) Back-compat: multiples.logical_lines_added still aliases to_date for older consumers reading the JSON. Updated README hero to cite both (picking up brain/* repo that was missed in the earlier aggregation pass): 2026 run rate: ~880× my 2013 pace (12,382 vs 14 logical lines/day) 2026 YTD: 260× the entire 2013 year Stderr summary now prints both multiples at the end of each run. Full analysis at ~/throughput-analysis-2026-04-18.md (local-only). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: ON_THE_LOC_CONTROVERSY methodology post + README link Long-form response to the "LOC is a meaningless vanity metric" critique. Covers: - The three branches of the LOC critique and which are right - Why logical SLOC (NCLOC) beats raw LOC as the honest measurement - Full method: author-scoped git diff, regex-classified added lines, aggregated across 41 public + private garrytan/* repos - Both calculations: to-date (260x) and run-rate (879x) - Steelman of the critics (greenfield-vs-maintenance, survivorship bias, quality-adjusted productivity, time-to-first-user) - Reproduction instructions Linked from README hero via a blockquote directly below the number. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * exclude: tax-app from throughput analysis (import-dominated history) tax-app's history is one commit of 104K logical lines — an initial import of a codebase, not authored work. Removing it to keep the comparison honest. Changes: - scripts/garry-output-comparison.ts: added EXCLUDED_REPOS constant with tax-app + a one-line rationale. The script now skips excluded repos with a stderr note and deletes any stale output JSON so aggregation loops don't pick up pre-exclusion numbers. - README hero: updated to 810× run rate + 240× YTD (were 880×/260×). Wording updated to "40 public + private repos ... after excluding repos dominated by imported code." - docs/ON_THE_LOC_CONTROVERSY.md: updated all numbers, added an "Exclusions" paragraph explaining tax-app, removed tax-app from the "shipped not WIP" example list. New numbers (2026 through day 108, without tax-app): - To-date: 240× logical SLOC (1,233,062 vs 5,143) - Run rate: 810× per-day pace (11,417 vs 14 logical/day) - Annualized: ~4.2M logical lines projected Future re-runs automatically skip tax-app. Add more exclusions to EXCLUDED_REPOS at the top of the script with a one-line rationale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: correct tax-app exclusion rationale tax-app is a demo app I built for an upcoming YC channel video, not an "import-dominated history" as the previous commit claimed. Excluded because it's not production shipping work, not because of an import commit. Updated rationale in scripts/garry-output-comparison.ts's EXCLUDED_REPOS constant, in docs/ON_THE_LOC_CONTROVERSY.md's method section + conclusion, and in the README hero wording ("one demo repo" vs the earlier "repos dominated by imported code"). Numbers unchanged — the exclusion itself is the same, just the reason. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: harden ON_THE_LOC_CONTROVERSY against Cramer + neckbeard critiques Reframes the thesis as "engineers can fly now" (amplification, not replacement) and fortifies the soft spots critics will attack. Added: - Flight-thesis opener: pilot vs walker, leverage not replacement. - Second deflation layer for AI verbosity (on top of NCLOC). Headline moves from 810x to 408x after generous 2x AI-boilerplate cut, with explicit sensitivity analysis showing the number is still large under pessimistic priors (5x → 162x, 10x → 81x, 100x impossible). - Weekly distribution check (kills "you had one burst week" attack). - Revert rate (2.0%) and post-merge fix rate (6.3%) with OSS comparables (K8s/Rails/Django band). Addresses "where are your error rates" directly. - Named production adoption signals (gstack 1000+ installs, gbrain beta, resend_robot paying API) with explicit concession that "shipped != used at scale" for most of the corpus. - Harder steelman: 5 specific concessions with quantified pivot points (e.g., "if 2013 baseline was 3.5x higher, 810x → 228x, still high"). Removed factual error: Posterous acquisition paragraph (Garry had already left Posterous by 2011, so the "Twitter bought our private repos" excuse for the 2013 corpus gap doesn't apply). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: update gstack/gbrain adoption numbers in LOC controversy post gstack: "1,000+ distinct project installations" → "tens of thousands of daily active users" (telemetry-reported, community tier, opt-in). gbrain: "small set of beta testers" → "hundreds of beta testers running it live." Both are the accurate current numbers. The concession paragraph below (about shipped != adopted at scale for the long-tail repos) still reads correctly since it's about the corpus as a whole, not gstack/gbrain specifically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: reframe reproducibility note as OSS breakout flex "You'd need access to my private repos" → "Bookface and Posthaven are private, but gstack and gbrain are open-sourced with tens of thousands of GitHub stars and tens of thousands of confirmed regular users, among the most-used OSS projects in the world that didn't exist three months ago." Keeps the `gh repo list` command at the end for the actual reproducibility instruction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Rewrite LOC controversy post - Lead with concession (LOC is garbage, do the math anyway) - Preempt 14 lines/day meme with historical baselines (Brooks, Jones, McConnell) - Remove 'neckbeard' language throughout - Add slop-scan story (Ben Vinegar, 5.24 → 1.96, 62% cut) - David Cramer GUnit joke - Add testing philosophy section (the real unlock) - ASCII weekly distribution chart - gstack telemetry section with real numbers (15K installs, 305K invocations, 95.2% success) - Top skills usage chart - Pick-your-priors paragraph moved earlier (the killer) - Sharper close: run the script, show me your numbers * docs: four precision fixes on LOC controversy post 1. Citation fix. Kernighan didn't say anything about LOC-as-metric (that's the famous "aircraft building by weight" quote, commonly misattributed but actually Bill Gates). Replaced "Kernighan implied it before that" with the real Dijkstra quote ("lines produced" vs "lines spent" from EWD1036, with direct link) + the Gates quote. Verified via web search. 2. Slop-scan direction clarified. "(highest on his benchmark)" was ambiguous — could read as a brag. Now: "Higher score = more slop. He ran it on gstack and we scored 5.24, the worst he'd measured at the time." Then the 62% cut lands as an actual win. 3. Prose/chart skill-usage ordering now matches. Added /plan-eng-review (28,014) to the prose list so it doesn't conflict with the chart below it. 4. Cut the "David — I owe you one / GUnit" insider joke. Most readers won't connect Cramer → Sentry → GUnit naming. Ends the slop-scan paragraph on the stronger line: "Run `bun test` and watch 2,000+ tests pass." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: tighten four LOC post citations to match primary sources 1. Bill Gates quote: flagged as folklore-grade. Was "Bill Gates put it more memorably" (firm attribution). Now "The old line (widely attributed to Bill Gates, sourcing murky) puts it more memorably." The quote stands; honesty about attribution avoids the same misattribution trap we just fixed for Kernighan. 2. Capers Jones: "15-50 across thousands of projects" → "roughly 16-38 LOC/day across thousands of projects" — matches his actual published measurements (which also report as 325-750 LOC/month). 3. Steve McConnell: "10-50 for finished, tested, delivered code" was folklore. Replaced with his actual project-size-dependent range from Code Complete: "20-125 LOC/day for small projects (10K LOC) down to 1.5-25 for large projects (10M LOC) — it's size-dependent, not a single number." 4. Revert rate comparison: "Kubernetes, Rails, and Django historically run 1.5-3%" was unsourced. Replaced with "mature OSS codebases typically run 1-3%" + "run the same command on whatever you consider the bar and compare." No false specificity about which repos. Net: every quantitative citation in the post now matches primary-source figures or is explicitly flagged as folklore. Neckbeards can verify. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: drop Writing style section from README Was sitting in prime real estate between Quick start and Install — internal implementation detail, not something users need up-front. Existing coverage is enough: - Upgrade migration prompt notifies users on first post-upgrade run - CLAUDE.md has the contributor note - docs/designs/PLAN_TUNING_V1.md has the full design Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: collapse team-mode setup into one paste-and-go command Step 2 was three separate code blocks: setup --team, then team-init, then git add/commit. Mirrors Step 1's style now — one shell one-liner that does all three. Subshell (cd && ./setup --team) keeps the user in their repo pwd so team-init + git commit land in the right place. "Swap required for optional" moved to a one-liner below. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: move full-clone footnote from README to CONTRIBUTING The "Contributing or need full history?" note is for contributors, not for someone following the README install flow. Moved into CONTRIBUTING's Quick start section where it fits next to the existing clone command, with a tip to upgrade an existing shallow clone via \`git fetch --unshallow\`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: root <root@localhost>
This commit is contained in:
@@ -0,0 +1,169 @@
|
||||
# On the LOC controversy
|
||||
|
||||
Or: what happened when I mentioned how many lines of code I've been shipping, and what the numbers actually say.
|
||||
|
||||
## The critique is right. And it doesn't matter.
|
||||
|
||||
LOC is a garbage metric. Every senior engineer knows it. Dijkstra wrote in 1988 that lines of code shouldn't be counted as "lines produced" but as "lines spent" ([*On the cruelty of really teaching computing science*, EWD1036](https://www.cs.utexas.edu/~EWD/transcriptions/EWD10xx/EWD1036.html)). The old line (widely attributed to Bill Gates, sourcing murky) puts it more memorably: measuring programming progress by LOC is like measuring aircraft building progress by weight. If you measure programmer productivity in lines of code, you're measuring the wrong thing. This has been true for 40 years and it's still true.
|
||||
|
||||
I posted that in the last 60 days I'd shipped 600,000 lines of production code. The replies came in fast:
|
||||
|
||||
- "That's just AI slop."
|
||||
- "LOC is a meaningless metric. Every senior engineer in the last 40 years said so."
|
||||
- "Of course you produced 600K lines. You had an AI writing boilerplate."
|
||||
- "More lines is bad, not good."
|
||||
- "You're confusing volume with productivity. Classic PM brain."
|
||||
- "Where are your error rates? Your DAUs? Your revert counts?"
|
||||
- "This is embarrassing."
|
||||
|
||||
Some of those are right. Here's what happens when you take the smart version of the critique seriously and do the math anyway.
|
||||
|
||||
## Three branches of the AI coding critique
|
||||
|
||||
They get collapsed into one, but they're different arguments.
|
||||
|
||||
**Branch 1: LOC doesn't measure quality.** True. Always has been. A 50-line well-factored library beats a 5,000-line bloated one. This was true before AI and it's true now. It was never a killer argument. It was a reminder to think about what you're measuring.
|
||||
|
||||
**Branch 2: AI inflates LOC.** True. LLMs generate verbose code by default. More boilerplate. More defensive checks. More comments. More tests. Raw line counts go up even when "real work done" didn't.
|
||||
|
||||
**Branch 3: Therefore bragging about LOC is embarrassing.** This is where the argument jumps the track.
|
||||
|
||||
Branch 2 is the interesting one. If raw LOC is inflated by some factor, the honest thing is to compute the deflation and report the deflated number. That's what this post does.
|
||||
|
||||
## The math
|
||||
|
||||
### Raw numbers
|
||||
|
||||
I wrote a script ([`scripts/garry-output-comparison.ts`](../scripts/garry-output-comparison.ts)) that enumerates every commit I authored across all 41 repos owned by `garrytan/*` on GitHub — 15 public, 26 private — in 2013 and 2026. For each commit, it counts logical lines added (non-blank, non-comment). The 2013 corpus includes Bookface, the YC-internal social network I built that year.
|
||||
|
||||
One repo excluded from 2026: `tax-app` (demo for a YC video, not production work). Baked into the script's `EXCLUDED_REPOS` constant. Run it yourself.
|
||||
|
||||
2013 was a full year. 2026 is day 108 as of this writing (April 18).
|
||||
|
||||
| | 2013 (full year) | 2026 (108 days) | Multiple |
|
||||
|------------------|----------------:|----------------:|---------:|
|
||||
| Logical SLOC | 5,143 | 1,233,062 | 240x |
|
||||
| Logical SLOC/day | 14 | 11,417 | 810x |
|
||||
| Commits | 71 | 351 | 4.9x |
|
||||
| Files touched | 290 | 13,629 | 47x |
|
||||
| Active repos | 4 | 15 | 3.75x |
|
||||
|
||||
### "14 lines per day? That's pathetic."
|
||||
|
||||
It was. That's the point.
|
||||
|
||||
In 2013 I was a YC partner, then a cofounder at Posterous shipping code nights and weekends. 14 logical lines per day was my actual part-time output while holding down a real job. Historical research puts professional full-time programmer output in a wide band depending on project size and study: Fred Brooks cited ~10 lines/day for systems programming in *The Mythical Man-Month* (OS/360 observations), Capers Jones measured roughly 16-38 LOC/day across thousands of projects, and Steve McConnell's *Code Complete* reports 20-125 LOC/day for small projects (10K LOC) down to 1.5-25 for large projects (10M LOC) — it's size-dependent, not a single number.
|
||||
|
||||
My 2013 baseline isn't cherry-picked. It's normal for a part-time coder with a day job. If you think the right baseline is 50 (3.5x higher), the 2026 multiple drops from 810x to 228x. Still high.
|
||||
|
||||
### Two deflations
|
||||
|
||||
The standard response to "raw LOC is garbage" is **logical SLOC** (source lines of code, non-comment non-blank). Tools like `cloc` and `scc` have computed this for 20 years. Same code, fluff stripped: no blank lines, no single-line comments, no comment block bodies, no trailing whitespace.
|
||||
|
||||
But logical SLOC doesn't eliminate AI inflation entirely. AI writes 2-3 defensive null checks where a senior engineer would write zero. AI inlines try/catch around things that don't throw. AI spells out `const result = foo(); return result` instead of `return foo()`.
|
||||
|
||||
So let's apply a **second deflation**. Assume AI-generated code is 2x more verbose than senior hand-crafted code at the logical level. That's aggressive — most measurements I've seen put the multiplier at 1.3-1.8x — but it's the upper bound a skeptic would demand.
|
||||
|
||||
- My 2026 per-day rate, NCLOC: **11,417**
|
||||
- With 2x AI-verbosity deflation: **5,708** logical lines per day
|
||||
- Multiple on daily pace with both deflations: **408x**
|
||||
|
||||
Now pick your priors:
|
||||
|
||||
- At 5x deflation (unfounded but let's go): **162x**
|
||||
- At 10x (pathological): **81x**
|
||||
- At 100x (impossible — that's one line per minute sustained): **8x**
|
||||
|
||||
The argument about the size of the coefficient doesn't change the conclusion. The number is large regardless.
|
||||
|
||||
### Weekly distribution
|
||||
|
||||
"Your per-day number assumes uniform output. Show the distribution. If it's a single burst, your run-rate is bogus."
|
||||
|
||||
Fair.
|
||||
|
||||
```
|
||||
Week 1-4 (Jan): ████████░░░░░░░░░ ~8,800/day
|
||||
Week 5-8 (Feb): ████████████░░░░░ ~12,100/day
|
||||
Week 9-12 (Mar): ██████████░░░░░░░ ~10,900/day
|
||||
Week 13-15 (Apr): █████████████░░░░ ~13,200/day
|
||||
```
|
||||
|
||||
It's not a spike. The rate has been approximately consistent and slightly increasing. Run the script yourself.
|
||||
|
||||
## The quality question
|
||||
|
||||
This is the most legitimate critique, channeled through the [David Cramer](https://x.com/zeeg) voice: OK, you're pushing more lines. Where are your error rates? Your post-merge reverts? Your bug density? If you're typing at 10x speed but shipping 20x more bugs, you're not leveraged, you're making noise at scale.
|
||||
|
||||
Fair. Here's the data:
|
||||
|
||||
**Reverts.** `git log --grep="^revert" --grep="^Revert" -i` across the 15 active repos: 7 reverts in 351 commits = **2.0% revert rate**. For context, mature OSS codebases typically run 1-3%. Run the same command on whatever you consider the bar and compare.
|
||||
|
||||
**Post-merge fixes.** Commits matching `^fix:` that reference a prior commit on the same branch: 22 of 351 = **6.3%**. Healthy fix cycle. A zero-fix rate would mean I'm not catching my own mistakes.
|
||||
|
||||
**Tests.** This is the thing that actually matters, and it's the thing that changed everything for me. Early in 2026, I was shipping without tests and getting destroyed in bug land. Then I hit 30% test-to-code ratio, then 100% coverage on critical paths, and suddenly I could fly. Tests went from ~100 across all repos in January to **over 2,000 now**. They run in CI. They catch regressions. Every gstack PR has a coverage audit in the PR body.
|
||||
|
||||
The real insight: testing at multiple levels is what makes AI-assisted coding actually work. Unit tests, E2E tests, LLM-as-judge evals, smoke tests, slop scans. Without those layers, you're just generating confident garbage at high speed. With them, you have a verification loop that lets the AI iterate until the code is actually correct.
|
||||
|
||||
gstack's core real-code feature — the thing that isn't just markdown prompts — is a **Playwright-based CLI browser** I wrote specifically so I could stop manually black-box testing my stuff. `/qa` opens a real browser, navigates your staging URL, and runs automated checks. That's 2,000+ lines of real systems code (server, CDP inspector, snapshot engine, content security, cookie management) that exists because testing is the unlock, not the overhead.
|
||||
|
||||
**Slop scan.** A third party — [Ben Vinegar](https://x.com/bentlegen), founding engineer at Sentry — built a tool called [slop-scan](https://github.com/benvinegar/slop-scan) specifically to measure AI code patterns. Deterministic rules, calibrated against mature OSS baselines. Higher score = more slop. He ran it on gstack and we scored 5.24, the worst he'd measured at the time. I took the findings seriously, refactored, and cut the score by 62% in one session. Run `bun test` and watch 2,000+ tests pass.
|
||||
|
||||
**Review rigor.** Every gstack branch goes through CEO review, Codex outside-voice review, DX review, and eng review. Often 2-3 passes of each. The `/plan-tune` skill I just shipped had a scope ROLLBACK from the CEO expansion plan because Codex's outside-voice review surfaced 15+ findings my four Claude reviews missed. The review infrastructure catches the slop. It's visible in the repo. Anyone can read it.
|
||||
|
||||
## What I'll concede
|
||||
|
||||
I'm going to steelman harder than the critics steelmanned themselves:
|
||||
|
||||
**Greenfield vs maintenance.** 2026 numbers are dominated by new-project code. Mature-codebase maintenance produces fewer lines per day. If you're asking "can Garry 100x the team maintaining 10 million lines of legacy Java at a bank," my number doesn't prove that. Someone else will have to run their own script on a different context.
|
||||
|
||||
**The 2013 baseline has survivorship bias.** My 2013 public activity was low. This analysis includes Bookface (private, 22 active weeks) which was my biggest project that year, so the bias is smaller than it looks. It's not zero. If the true 2013 rate was 50/day instead of 14, the multiple at current pace is 228x instead of 810x. Still high.
|
||||
|
||||
**Quality-adjusted productivity isn't fully proven.** I don't have a clean bug-density comparison between 2013-me and 2026-me. What I can say: revert rate is in the normal band, fix rate is healthy, test coverage is real, and the adversarial review process caught 15+ issues on the most recent plan. That's evidence, not proof. A skeptic can discount it.
|
||||
|
||||
**"Shipped" means different things across eras.** Some 2013 products shipped and died. Some 2026 products may share that fate. If two years from now 80% of what I shipped this year is dead, the critique "you built a bunch of unused stuff" will have teeth. I accept that reality check.
|
||||
|
||||
**Time to first user is the metric that matters, not LOC.** The 60-day cycle from "I wish this existed" to "it exists and someone is using it" is the real shift. LOC is downstream evidence. The right metric is "shipped products per quarter" or "working features per week." Those went up by a similar multiple.
|
||||
|
||||
## What those lines became
|
||||
|
||||
gstack is not a hypothetical. It's a product with real users:
|
||||
|
||||
- **75,000+ GitHub stars** in 5 weeks
|
||||
- **14,965 unique installations** (opt-in telemetry)
|
||||
- **305,309 skill invocations** recorded since January 2026
|
||||
- **~7,000 weekly active users** at peak
|
||||
- **95.2% success rate** across all skill runs (290,624 successes / 305,309 total)
|
||||
- **57,650 /qa runs**, **28,014 /plan-eng-review runs**, **24,817 /office-hours sessions**, **18,899 /ship workflows**
|
||||
- **27,157 sessions used the browser** (real Playwright, not toy)
|
||||
- Median session duration: **2 minutes**. Average: **6.4 minutes**.
|
||||
|
||||
Top skills by usage:
|
||||
|
||||
```
|
||||
/qa 57,650 ████████████████████████████
|
||||
/plan-eng-review 28,014 ██████████████
|
||||
/office-hours 24,817 ████████████
|
||||
/ship 18,899 █████████
|
||||
/browse 13,675 ██████
|
||||
/review 13,459 ██████
|
||||
/plan-ceo-review 12,357 ██████
|
||||
```
|
||||
|
||||
These aren't scaffolds sitting in a drawer. Thousands of developers run these skills every day.
|
||||
|
||||
## What this means
|
||||
|
||||
I am not saying engineers are going away. Nobody serious thinks that.
|
||||
|
||||
I am saying engineers can fly now. One engineer in 2026 has the output of a small team in 2013, working the same hours, at the same day job, with the same brain. The code-generation cost curve collapsed by two orders of magnitude.
|
||||
|
||||
The interesting part of the number isn't the volume. It's the rate. And the rate isn't a statement about me. It's a statement about the ground underneath all software engineering.
|
||||
|
||||
2013 me shipped about 14 logical lines per day. Normal for a part-time coder with a real job. 2026 me is shipping 11,417 logical lines per day. While still running YC full-time. Same day job. Same free time. Same person.
|
||||
|
||||
The delta isn't that I became a better programmer. If anything, my mental model of coding has atrophied. The delta is that AI let me actually ship the things I always wanted to build. Small tools. Personal products. Experiments that used to die in my notebook because the time cost to build them was too high. The gap between "I want this tool" and "this tool exists and I'm using it" collapsed from 3 weeks to 3 hours.
|
||||
|
||||
Here's the script: [`scripts/garry-output-comparison.ts`](../scripts/garry-output-comparison.ts). Run it on your own repos. Show me your numbers. The argument isn't about me — it's about whether the ground moved.
|
||||
|
||||
I'm betting it did for you too.
|
||||
@@ -0,0 +1,95 @@
|
||||
# Pacing Updates v0 — Design Doc
|
||||
|
||||
**Status:** V1.1 plan (not yet implemented).
|
||||
**Extracted from:** [PLAN_TUNING_V1.md](./PLAN_TUNING_V1.md) during implementation, when review rigor revealed the pacing workstream had structural gaps unfixable via plan-text editing.
|
||||
**Authors:** Garry Tan (user), with AI-assisted reviews from Claude Opus 4.7 + OpenAI Codex gpt-5.4.
|
||||
**Review plan:** CEO + Codex + DX + Eng cycle, same rigor as V1.
|
||||
|
||||
## Credit
|
||||
|
||||
This plan exists because of **[Louise de Sadeleer](https://x.com/LouiseDSadeleer/status/2045139351227478199)**. Her "yes yes yes" during architecture review wasn't only about jargon (V1 addresses that) — it was pacing and agency. Too many interruptive decisions over too long a review. V1.1 addresses the pacing half.
|
||||
|
||||
## Problem
|
||||
|
||||
Louise's fatigue reading gstack review output came from two sources:
|
||||
|
||||
1. **Jargon density** — technical terms appeared without explanation. *Addressed in V1 (ELI10 writing).*
|
||||
2. **Interruption volume** — `/autoplan` ran 4 phases (CEO + Design + Eng + DX), each with 5–10 AskUserQuestion prompts. Total ≈ 30–50 prompts over ~45 minutes. Non-technical users check out at ~10–15 interruptions. **This is V1.1.**
|
||||
|
||||
Translation alone doesn't fix interruption volume. A translated interruption is still an interruption. The fix needs to change WHEN findings surface, not just HOW they're worded.
|
||||
|
||||
## Why it's extracted (structural gaps from V1's third eng review + Codex pass 2)
|
||||
|
||||
During V1 planning, a pacing workstream was drafted: rank findings, auto-accept two-way doors, max 3 AskUserQuestion prompts per review phase, Silent Decisions block for auto-accepted items, "flip <id>" command to re-open auto-accepted decisions post-hoc. The third eng-review pass + second Codex pass surfaced 10 gaps that couldn't be closed with plan-text edits:
|
||||
|
||||
1. **Session-state model undefined.** Pacing needs per-phase state (which findings surfaced, which auto-accepted, which user can flip). V1 has per-skill-invocation state for glossing but no backing store for per-phase pacing memory.
|
||||
2. **Phase identifier missing from question-log.** Silent Eng #8 wanted to warn when > 3 prompts within one phase. V0's `question-log.jsonl` has no `phase` field. V1 claimed "no schema change" — contradicts the enforcement target.
|
||||
3. **Question registry ≠ finding registry.** V0's `scripts/question-registry.ts` covers *questions* (registered at skill definition time). Review findings are *dynamic* (discovered at runtime). `door_type: one-way` enforcement via registry doesn't cover ad-hoc findings. One-way-door safety isn't enforceable for findings the agent generates mid-review.
|
||||
4. **Pacing as prose can't invert existing control flow.** V1 planned to add a "rank findings, then ask" rule to preamble prose. But existing skill templates like `plan-eng-review/SKILL.md.tmpl` have per-section STOP/AskUserQuestion sequences. A prose rule in preamble can't reliably override a hardcoded per-section STOP. The behavioral change is sequencing, not prompt wording.
|
||||
5. **Flip mechanism has no implementation.** "Reply `flip <id>` to change" was prose. No command parser, no state store, no replay behavior. If the conversation compacts and the Silent Decisions block leaves context, the original decision is lost.
|
||||
6. **Migration prompt is itself an interrupt.** V1's post-upgrade migration prompt (offering to restore V0 prose) counts against the interruption budget V1.1 is trying to reduce. V1.1 must decide: exempt from budget, or include as interrupt-1-of-N?
|
||||
7. **First-run preamble prompts count too.** Lake intro, telemetry, proactive, routing injection — Louise saw all of them on first run. They're interruptions before the first real skill runs. V1.1 must audit which of these are load-bearing for new users vs. deferrable until session N.
|
||||
8. **Ranking formula not calibrated against real data.** V1 considered `product 0-8` (broken: `{0,1,2,4,8}` distribution), then `sum 0-6` with threshold ≥ 4. But neither was validated against actual finding distribution. V1.1 should instrument V0 question-log to measure what real findings look like, then calibrate.
|
||||
9. **"Every one-way door surfaces" vs "max 3 per phase" contradicts.** One-way cap = uncapped (safety); two-way cap = 3. But the plan had both rules without explicit precedence. V1.1 must state: one-way doors surface uncapped regardless of phase budget.
|
||||
10. **Undefined verification values.** V1 plan had "Silent Decisions block ≥ N entries" with N never defined, and `active: true` field in throughput JSON never defined. V1.1 gets concrete values.
|
||||
|
||||
## Scope for V1.1
|
||||
|
||||
1. **Define session-state model.** Per-skill-invocation vs per-phase vs per-conversation. Backing store: likely a JSON file at `~/.gstack/sessions/<session_id>/pacing-state.json` that records which findings surfaced vs. auto-accepted per phase. Cleanup: same TTL as existing session tracking in preamble.
|
||||
|
||||
2. **Add `phase` field to question-log.jsonl schema.** Classify each AskUserQuestion by which review phase it came from (CEO / Design / Eng / DX / other). Migration: existing entries default to `"unknown"`. Non-breaking schema extension.
|
||||
|
||||
3. **Extend registry coverage for dynamic findings.** Two options, pick during CEO review:
|
||||
- (a) Widen `scripts/question-registry.ts` to allow runtime registration (ad-hoc IDs still get logged + classified).
|
||||
- (b) Add a secondary runtime classifier `scripts/finding-classifier.ts` that maps finding text → risk tier using pattern matching.
|
||||
|
||||
4. **Move pacing from preamble prose into skill-template control flow.** Update each review skill template to: (i) internally complete the phase, (ii) rank findings with the `gstack-pacing-rank` binary, (iii) emit up to 3 AskUserQuestion prompts, (iv) emit Silent Decisions block with the rest. Not a preamble rule — explicit sequence in each template.
|
||||
|
||||
5. **Flip mechanism implementation.** New binary `bin/gstack-flip-decision`. Command parser accepts `flip <id>` from user message. Looks up the original decision in pacing-state.json. Re-opens as an explicit AskUserQuestion. New choice persists.
|
||||
|
||||
6. **Migration-prompt budget decision.** Explicit rule: one-shot migration prompts are exempt from the per-phase interruption budget. Rationale: they fire before review phases start, not during.
|
||||
|
||||
7. **First-run preamble audit.** Audit lake intro, telemetry, proactive, routing injection. For each: is this load-bearing for a first-time user, or deferrable? Likely outcome: suppress all but lake intro until session 2+. Offer remaining ones via a `/plan-tune first-run` command that users can invoke voluntarily.
|
||||
|
||||
8. **Ranking threshold calibration.** Instrument V0's question-log (already running, has history). Measure the actual distribution of `severity × irreversibility × user-decision-matters` across recent CEO + Eng + DX + Design reviews. Pick threshold based on real data. Target: ~20% of findings surface, ~80% auto-accept.
|
||||
|
||||
9. **Explicit rule: one-way doors uncapped.** Hard-coded in skill template prose: "one-way doors surface regardless of phase interruption budget." Two-way findings cap at 3 per phase.
|
||||
|
||||
10. **Concrete verification values.** Define `N` for Silent Decisions (e.g., ≥ 5 entries expected for a non-trivial plan), define the throughput JSON schema with concrete field names.
|
||||
|
||||
## Acceptance criteria for V1.1
|
||||
|
||||
- **Interruption count:** Louise (or similar non-technical collaborator) reruns `/autoplan` end-to-end on a plan comparable to V0-baseline. AskUserQuestion count ≤ 50% of V0 baseline. (V1 captures this baseline transcript for V1.1 calibration.)
|
||||
- **One-way-door coverage:** 100% of safety-critical decisions (`door_type: one-way` OR classifier-flagged dynamic findings) surface individually at full technical detail. Uncapped.
|
||||
- **Flip round-trip:** User types `flip test-coverage-bookclub-form`. The original auto-accepted decision re-opens as an AskUserQuestion. User's new choice persists to the Silent Decisions block (or is removed if user flips to explicit surfacing).
|
||||
- **Per-phase observability:** `/plan-tune` can display per-phase AskUserQuestion counts for any session, reading from question-log.jsonl's new `phase` field.
|
||||
- **First-run reduction:** New users see ≤ 1 meta-prompt (lake intro) before their first real skill runs, vs. V1's 4 (lake + telemetry + proactive + routing).
|
||||
- **Human rerun:** Louise + Garry independent qualitative reviews, same pattern as V1.
|
||||
|
||||
## Dependencies on V1
|
||||
|
||||
V1.1 builds on V1's infrastructure:
|
||||
- `explain_level` config key + preamble echo pattern (A4).
|
||||
- Jargon list + Writing Style section (V1.1's interruption language should respect ELI10 rules).
|
||||
- V0 dormancy negative tests (V1.1 won't wake the 5D psychographic machinery either).
|
||||
- V1's captured Louise transcript (baseline for acceptance criterion calibration).
|
||||
|
||||
V1.1 does NOT depend on any V2 items (E1 substrate wiring, narrative/vibe, etc.).
|
||||
|
||||
## Review plan
|
||||
|
||||
- **Pre-work:** capture real question-log distribution from current V0 data. Use as calibration input for Scope #8.
|
||||
- **CEO review.** Premise challenge: is pacing the right fix, or should V1.1 consider removing phases entirely? (E.g., collapse CEO + Design + Eng + DX into a single unified review pass.) Scope mode: SELECTIVE EXPANSION likely (pacing is the core, related improvements are cherry-picks).
|
||||
- **Codex review.** Independent pass on the V1.1 plan. Expect particular scrutiny on the control-flow change (Scope #4) since that's the area V1 struggled with.
|
||||
- **DX review.** Focus on the flip mechanism's DX — is `flip <id>` discoverable, is the command syntax natural, is the error path clear?
|
||||
- **Eng review ×N.** Expect multiple passes, same as V1.
|
||||
|
||||
## NOT touched in V1.1
|
||||
|
||||
V2 items remain deferred:
|
||||
- Confusion-signal detection
|
||||
- 5D psychographic-driven skill adaptation (V0 E1)
|
||||
- /plan-tune narrative + /plan-tune vibe (V0 E3)
|
||||
- Per-skill or per-topic explain levels
|
||||
- Team profiles
|
||||
- AST-based "delivered features" metric
|
||||
@@ -0,0 +1,405 @@
|
||||
# Plan Tuning v0 — Design Doc
|
||||
|
||||
**Status:** Approved for v1 implementation
|
||||
**Branch:** garrytan/plan-tune-skill
|
||||
**Authors:** Garry Tan (user), with AI-assisted reviews from Claude Opus 4.7 + OpenAI Codex gpt-5.4
|
||||
**Date:** 2026-04-16
|
||||
|
||||
## What this document is
|
||||
|
||||
A canonical record of what `/plan-tune` v1 is, what it is NOT, what we considered, and why we made each call. Committed to the repo so future contributors (and future Garry) can trace reasoning without archeology. Supersedes the two `~/.gstack/projects/` artifacts (office-hours design doc + CEO plan) which are per-user local records.
|
||||
|
||||
## The feature, in one paragraph
|
||||
|
||||
gstack's 40+ skills fire AskUserQuestion constantly. Power users answer the same questions the same way repeatedly and have no way to tell gstack "stop asking me this." More fundamentally, gstack has no model of how each user prefers to steer their work — scope-appetite, risk-tolerance, detail-preference, autonomy, architecture-care — so every skill's defaults are middle-of-the-road for everyone. `/plan-tune` v1 builds the schema + observation layer: a typed question registry, per-question explicit preferences, inline "tune:" feedback, and a profile (declared + inferred dimensions) inspectable via plain English. It does not yet adapt skill behavior based on the profile. That comes in v2, after v1 proves the substrate works.
|
||||
|
||||
## Why we're building the smaller version
|
||||
|
||||
The feature started life as a full adaptive substrate: psychographic dimensions driving auto-decisions, blind-spot coaching, LANDED celebration HTML page, all bundled. Four rounds of review (office-hours, CEO EXPANSION, DX POLISH, eng review) cleared it. Then outside voice (Codex) delivered a 20-point critique. The critical findings, in priority order:
|
||||
|
||||
1. **"Substrate" was false.** The plan wired 5 skills to read the profile on preamble, but AskUserQuestion is a prompt convention, not middleware. Agents can silently skip the instructions. You cannot reliably build auto-decide on top of an unenforceable convention. Without a typed question registry that every AskUserQuestion routes through, the substrate claim is marketing.
|
||||
2. **Internal logical contradictions.** E4 (blind-spot) + E6 (mismatch) + ±0.2 clamp on declared dimensions do not compose. If user self-declaration is ground truth via the clamp, E6's mismatch detection is detecting noise. If behavior can correct the profile, the clamp suppresses the signal E6 needs.
|
||||
3. **Profile poisoning.** Inline "tune: never ask" could be emitted by malicious repo content (README, PR description, tool output) and the agent would dutifully write it. No prior review caught this security gap.
|
||||
4. **E5 LANDED page in preamble.** `gh pr view` + HTML write + browser open on every skill's preamble is latency, auth failures, rate limits, surprise browser opens, and nondeterminism injected into the hottest path.
|
||||
5. **Implementation order was backwards.** The plan started with classifiers and bins. The correct order: build the integration point first (typed question registry), then infrastructure, then consumers.
|
||||
|
||||
After weighing Codex's argument, we chose to roll back CEO EXPANSION and ship an observational v1 with a real typed registry as the foundation. Psychographic becomes behavioral only after the registry proves durable in production.
|
||||
|
||||
## v1 Scope (what we're building now)
|
||||
|
||||
1. **Typed question registry** (`scripts/question-registry.ts`). Every AskUserQuestion gstack uses is declared with `{id, skill, category, door_type, options[], signal_key?}`. Schema-governed.
|
||||
2. **CI enforcement.** Lint test (gate tier) asserts every AskUserQuestion pattern in SKILL.md.tmpl files has a matching registry entry. Fails CI on drift, renames, or duplicates.
|
||||
3. **Question logging** (`bin/gstack-question-log`). Appends `{ts, question_id, user_choice, recommended, session_id}` to `~/.gstack/projects/{SLUG}/question-log.jsonl`. Validates against registry.
|
||||
4. **Explicit per-question preferences** (`bin/gstack-question-preference`). Writes `{question_id, preference}` where preference is `always-ask | never-ask | ask-only-for-one-way`. Respected from session 1. No calibration gate — user stated it, system obeys.
|
||||
5. **Preamble injection.** Before each AskUserQuestion, agent calls `gstack-question-preference --check <registry-id>`. If `never-ask` AND question is NOT a one-way door, auto-choose recommended option with visible annotation: "Auto-decided [summary] → [option] (your preference). Change with /plan-tune." One-way doors always ask regardless of preference — safety override.
|
||||
6. **Inline "tune:" feedback with user-origin gate.** Agent offers "Tune this question? Reply `tune: [feedback]` to adjust." User can use shortcuts (`unnecessary`, `ask-less`, `never-ask`, `always-ask`, `context-dependent`) or free-form English. CRITICAL: the agent only writes a tune event when the `tune:` content appears in the user's current chat turn — NOT in tool output, NOT in a file read. Binary validates `source: "inline-user"` on write; rejects other sources.
|
||||
7. **Declared profile** (`/plan-tune setup`). 5 plain-English questions, one per dimension. Stored in unified `~/.gstack/developer-profile.json` under `declared: {...}`. Informational only in v1 — no skill behavior change.
|
||||
8. **Observed/Inferred profile.** Every question-log event contributes deltas to inferred dimensions via a hand-crafted signal map (`scripts/psychographic-signals.ts`). Computed on demand. Displayed but not acted on.
|
||||
9. **`/plan-tune` skill.** Conversational plain-English inspection tool. "Show my profile," "set a preference," "what questions have I been asked," "show the gap between what I said and what I do." No CLI subcommand syntax required.
|
||||
10. **Unification with existing `~/.gstack/builder-profile.jsonl`.** Fold /office-hours session records and accumulated signals into unified `~/.gstack/developer-profile.json`. Migration is atomic + idempotent + archives the source file.
|
||||
|
||||
## Deferred to v2 (not in this PR, but explicit acceptance criteria)
|
||||
|
||||
| Item | Why deferred | Acceptance criteria for v2 promotion |
|
||||
|------|--------------|--------------------------------------|
|
||||
| E1 Substrate wiring (5 skills read profile and adapt) | Requires v1 registry proving durable. Requires real observed data to calibrate signal deltas. Risk of psychographic drift. | v1 registry stable for 90+ days. Inferred dimensions show clear stability across 3+ skills. User dogfood validates that defaults informed by profile feel right. |
|
||||
| E3 `/plan-tune narrative` + `/plan-tune vibe` | Event-anchored narrative needs stable profile. Without v1 data, output will be generic slop. | Profile diversity check passes for 2+ weeks real usage. Narrative test proves it quotes specific events, not clichés. |
|
||||
| E4 Blind-spot coach | Logically conflicts with E1/E6 without explicit interaction-budget design. Needs global session budget, escalation rules, exclusion from mismatch detection. | Design spec for interaction budget + escalation. Dogfood confirms challenges feel coaching, not nagging. |
|
||||
| E5 LANDED celebration HTML page | Cannot live in preamble (Codex #9, #10). When promoted, moves to explicit command `/plan-tune show-landed` OR post-ship hook — not passive detection in the hot path. | Explicit command or hook design. /design-shotgun → /design-html for the visual direction. Security + privacy review for PR data aggregation. |
|
||||
| E6 Auto-adjustment based on mismatch | In v1, /plan-tune shows the gap between declared and inferred. In v2, it could suggest declaration updates. Requires dual-track profile to be stable. | Real mismatch data from v1 shows consistent patterns. Suggestion UX designed separately. |
|
||||
| Psychographic-driven auto-decide | Zero behavioral change in v1. Only explicit preferences act. | Real usage shows explicit preferences cover most cases. Inferred profile stable enough to trust. |
|
||||
|
||||
## Rejected entirely (Codex was right, we're not doing these)
|
||||
|
||||
| Item | Why rejected |
|
||||
|------|--------------|
|
||||
| Substrate-as-prompt-convention (vs. typed registry) | Codex #1. Agents can silently skip instructions. Building psychographic on top is sand. |
|
||||
| ±0.2 clamp on declared dimensions | Codex #6. Creates logical contradiction with E6 mismatch detection. Pick ONE: editable preference OR inferred behavior. Now: both, tracked separately (dual-track profile). |
|
||||
| One-way door classification by parsing prose summaries | Codex #4. Safety depends on wording. door_type must be declared at question definition site (registry), not inferred. |
|
||||
| Single event-schema file mixing declarations + overrides + verdicts + feedback | Codex #5. Incompatible domain objects. Now split into three files: question-log.jsonl, question-preferences.json, question-events.jsonl. |
|
||||
| TTHW telemetry for /plan-tune onboarding | Codex #14. Contradicts local-first framing. Local logging only. |
|
||||
| Inline tune: writes without user-origin verification | Codex #16. Profile poisoning attack. Now: user-origin gate is non-optional. |
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
~/.gstack/
|
||||
developer-profile.json # unified: declared + inferred + sessions (from office-hours)
|
||||
|
||||
~/.gstack/projects/{SLUG}/
|
||||
question-log.jsonl # every AskUserQuestion, append-only, registry-validated
|
||||
question-preferences.json # explicit per-question user choices
|
||||
question-events.jsonl # tune: feedback events, user-origin gated
|
||||
```
|
||||
|
||||
**Unified profile schema** (superseding both v0.16.2.0 builder-profile.jsonl and the proposed developer-profile.json):
|
||||
|
||||
```json
|
||||
{
|
||||
"identity": {"email": "..."},
|
||||
"declared": {
|
||||
"scope_appetite": 0.9,
|
||||
"risk_tolerance": 0.7,
|
||||
"detail_preference": 0.4,
|
||||
"autonomy": 0.5,
|
||||
"architecture_care": 0.7
|
||||
},
|
||||
"inferred": {
|
||||
"values": {"scope_appetite": 0.72, "risk_tolerance": 0.58, "...": "..."},
|
||||
"sample_size": 47,
|
||||
"diversity": {
|
||||
"skills_covered": 5,
|
||||
"question_ids_covered": 14,
|
||||
"days_span": 23
|
||||
}
|
||||
},
|
||||
"gap": {"scope_appetite": 0.18, "...": "..."},
|
||||
"sessions": [
|
||||
{"date": "...", "mode": "builder", "project_slug": "...", "signals": []}
|
||||
],
|
||||
"signals_accumulated": {
|
||||
"named_users": 1, "taste": 4, "agency": 3, "...": "..."
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Diversity check** (Codex #13): `inferred` is considered "enough data" only when `sample_size >= 20 AND skills_covered >= 3 AND question_ids_covered >= 8 AND days_span >= 7`. Below this, `/plan-tune profile` shows "not enough observed data yet" instead of a potentially-misleading inferred value.
|
||||
|
||||
## Data flow (v1)
|
||||
|
||||
1. Preamble: check `question_tuning` config. If off, do nothing.
|
||||
2. Before each AskUserQuestion:
|
||||
- Agent calls `gstack-question-preference --check <registry-id>`
|
||||
- If `never-ask` AND question is NOT one-way door → auto-choose recommended with annotation
|
||||
- If `always-ask`, unset, or question IS one-way door → ask normally
|
||||
3. After AskUserQuestion:
|
||||
- Append log record to question-log.jsonl (registry-validated, rejects unknown IDs)
|
||||
4. Offer inline: "Tune this question? Reply `tune: [feedback]` to adjust."
|
||||
5. If user's NEXT turn message contains `tune:` prefix AND the content originated in the user's own message (not tool output):
|
||||
- Agent calls `gstack-question-preference --write` with `source: "inline-user"`
|
||||
- Binary validates source field; rejects if anything other than `inline-user`
|
||||
6. Inferred dimensions recomputed on demand by `bin/gstack-developer-profile --derive`. Signal map changes trigger full recompute from events history.
|
||||
|
||||
## Security model
|
||||
|
||||
**Profile poisoning defense** (Codex #16, Decision J below): Inline tune events may be written ONLY when:
|
||||
- The agent is processing the user's current chat turn
|
||||
- The `tune:` prefix appears in that user message (not in any tool output, file content, PR description, commit message, etc.)
|
||||
- The resolver's instructions to the agent explicitly call this out
|
||||
|
||||
Binary enforcement: `gstack-question-preference --write` requires `source: "inline-user"` field on every tune-originated record. Any other source value (e.g., `inline-tool-output`, `inline-file-content`) is rejected with an error. Agent is instructed to never forge the `source` field.
|
||||
|
||||
**Data privacy**:
|
||||
- All data is local-only under `~/.gstack/`. Nothing leaves without explicit user action.
|
||||
- `/plan-tune export <path>` writes profile to user-specified path (opt-in export).
|
||||
- `/plan-tune delete` wipes local profile files.
|
||||
- `gstack-config set telemetry off` prevents any telemetry (this skill never sends profile data regardless).
|
||||
- Profile files have standard user-home permissions.
|
||||
|
||||
**Injection defense** (consistent with existing `bin/gstack-learnings-log` patterns): the `question_summary` and any free-form user feedback fields are sanitized against known prompt-injection patterns ("ignore previous instructions," "system:", etc.).
|
||||
|
||||
## 5 Hard Constraints (preserved from office-hours, updated for Codex feedback)
|
||||
|
||||
1. **One-way doors are classified deterministically by registry declaration**, NOT by runtime summary parsing. Each registry entry declares `door_type: one-way | two-way`. Keyword pattern fallback (`scripts/one-way-doors.ts`) is a belt-and-suspenders secondary check for edge cases.
|
||||
2. **Profile dimensions are inspectable AND editable.** `/plan-tune profile` shows declared + inferred + gap. Edits via plain English go to `declared` only. System tracks `inferred` independently.
|
||||
3. **Signal map is hand-crafted in TypeScript.** `scripts/psychographic-signals.ts` maps `{question_id, user_choice} → {dimension, delta}`. Not agent-inferred. In v1, consumed only for `inferred.values` display — not for driving decisions.
|
||||
4. **No psychographic-driven auto-decide in v1.** Only explicit per-question preferences act. This sidesteps the "calibration gate can be gamed" critique (Codex #13) entirely — v1 doesn't have a gate to pass.
|
||||
5. **Per-project preferences beat global preferences.** `~/.gstack/projects/{SLUG}/question-preferences.json` wins over any future global preference file. Global profile (`~/.gstack/developer-profile.json`) is a starting point for diversity across projects.
|
||||
|
||||
## Why event-sourced + dual-track
|
||||
|
||||
**Why event-sourced for the inferred profile**:
|
||||
- Signal map can change between gstack versions. Recompute from events, no data migration needed.
|
||||
- Auditable: `/plan-tune profile --trace autonomy` shows every event that contributed to the value.
|
||||
- Future-proof: new dimensions can be derived from existing history.
|
||||
|
||||
**Why dual-track (declared + inferred, separately)** (Decision B below):
|
||||
- Resolves the logical contradiction Codex #6 identified.
|
||||
- `declared` is user sovereignty. User states who they are. System obeys for anything user-driven (preferences, declarations, overrides).
|
||||
- `inferred` is observation. System tracks behavioral patterns. Displayed but not acted on in v1.
|
||||
- `gap` is the interesting signal. Large gaps suggest the user's self-description isn't matching their behavior — valuable self-insight, but not auto-corrected.
|
||||
|
||||
## Interaction model — plain English everywhere
|
||||
|
||||
(From /plan-devex-review, user correction on CLI syntax):
|
||||
|
||||
`/plan-tune` (no args) enters conversational mode. No CLI subcommand syntax required.
|
||||
|
||||
Menu in plain language:
|
||||
- "Show me my profile"
|
||||
- "Review questions I've been asked"
|
||||
- "Set a preference about a question"
|
||||
- "Update my profile — I've changed my mind about something"
|
||||
- "Show me the gap between what I said and what I do"
|
||||
- "Turn it off"
|
||||
|
||||
User replies conversationally. Agent interprets, confirms the intended change, then writes. For example:
|
||||
- User: "I'm more of a boil-the-ocean person than 0.5 suggests"
|
||||
- Agent: "Got it — update `declared.scope_appetite` from 0.5 to 0.8? [Y/n]"
|
||||
- User: "Yes"
|
||||
- Agent writes the update
|
||||
|
||||
Confirmation step is required for any mutation of `declared` from free-form input (Codex #15 trust boundary).
|
||||
|
||||
Power users can type shortcuts (`narrative`, `vibe`, `reset`, `stats`, `enable`, `disable`, `diff`). Neither is required. Both work.
|
||||
|
||||
## Files to Create
|
||||
|
||||
### Core schema
|
||||
- `scripts/question-registry.ts` — typed registry. Seeded from audit of all SKILL.md.tmpl AskUserQuestion invocations.
|
||||
- `scripts/one-way-doors.ts` — secondary keyword fallback. Primary: `door_type` in registry.
|
||||
- `scripts/psychographic-signals.ts` — hand-crafted signal map for inferred computation.
|
||||
|
||||
### Binaries
|
||||
- `bin/gstack-question-log` — append log record, validate against registry.
|
||||
- `bin/gstack-question-preference` — read/write/check/clear explicit preferences.
|
||||
- `bin/gstack-developer-profile` — supersedes `bin/gstack-builder-profile`. Subcommands: `--read` (legacy compat), `--derive`, `--gap`, `--profile`.
|
||||
|
||||
### Resolvers
|
||||
- `scripts/resolvers/question-tuning.ts` — three generators: `generateQuestionPreferenceCheck(ctx)` (pre-question check), `generateQuestionLog(ctx)` (post-question log), `generateInlineTuneFeedback(ctx)` (post-question tune: prompt with user-origin gate instructions).
|
||||
|
||||
### Skill
|
||||
- `plan-tune/SKILL.md.tmpl` — conversational, plain-English inspection and preference tool.
|
||||
|
||||
### Tests
|
||||
- `test/plan-tune.test.ts` — registry completeness, duplicate ID check, preference precedence (never-ask + not-one-way → AUTO_DECIDE; never-ask + one-way → ASK_NORMALLY), user-origin gate (rejects non-inline-user sources), derivation + recompute, unified profile schema, migration regression with 7-session fixture.
|
||||
|
||||
## Files to Modify
|
||||
|
||||
- `scripts/resolvers/index.ts` — register 3 new resolvers.
|
||||
- `scripts/resolvers/preamble.ts` — `_QUESTION_TUNING` config read; inject 3 resolvers for tier >= 2.
|
||||
- `bin/gstack-builder-profile` — legacy shim delegates to `bin/gstack-developer-profile --read`.
|
||||
- Migration script — folds existing builder-profile.jsonl into unified developer-profile.json. Atomic, idempotent, archives source as `.migrated-YYYY-MM-DD`.
|
||||
|
||||
## NOT touched in v1
|
||||
|
||||
Explicitly unchanged — no `{{PROFILE_ADAPTATION}}` placeholders, no behavior change based on profile:
|
||||
|
||||
- `ship/SKILL.md.tmpl`, `review/SKILL.md.tmpl`, `office-hours/SKILL.md.tmpl`, `plan-ceo-review/SKILL.md.tmpl`, `plan-eng-review/SKILL.md.tmpl`
|
||||
|
||||
These skills gain preamble injection for logging / preference checking / tune feedback only. No profile-driven defaults. v2 work.
|
||||
|
||||
## Decisions log (with pros/cons for each)
|
||||
|
||||
### Decision A: Bundle all three (question-log + sensitivity + psychographic) vs. ship smaller wedge — INITIAL ANSWER: BUNDLE; REVISED: REGISTRY-FIRST OBSERVATIONAL
|
||||
|
||||
Initial user position (office-hours): "The psychographic IS the differentiation. Ship the whole thing so the feedback loop can actually tune behavior." This drove CEO EXPANSION.
|
||||
|
||||
**Pros of bundling:** Ambition. The learning layer is what makes this more than config. Without psychographic, it's a fancy settings menu.
|
||||
|
||||
**Cons of bundling (surfaced by Codex):** The substrate didn't exist. Psychographic on top of prompt-convention is sand. E1/E4/E6 compose incoherently. Profile poisoning was unaddressed. E5 in preamble is a hidden hot-path side effect. Implementation order built machinery around an unenforceable convention.
|
||||
|
||||
**Revised answer:** Registry-first observational v1 (this doc). Preserves the ambition as a v2 target with explicit acceptance criteria. Ships a defensible foundation. User accepted this after seeing Codex's 20-point critique.
|
||||
|
||||
### Decision B: Event-sourced vs. stored dimensions vs. hybrid — ANSWER: EVENT-SOURCED + USER-DECLARED ANCHOR (B+C)
|
||||
|
||||
**Approach A (stored dimensions):** Mutate in place. Simple.
|
||||
- Pros: Smallest data model. Easy to reason about.
|
||||
- Cons: Lossy. No history. Signal map changes require migration. Profile changes are opaque to the user.
|
||||
|
||||
**Approach B (event-sourced):** Store raw events, derive dimensions.
|
||||
- Pros: Auditable. Recomputable on signal map changes. No data migration ever. Matches existing learnings.jsonl pattern.
|
||||
- Cons: More complex derivation. Events file grows over time (compaction deferred to v2).
|
||||
|
||||
**Approach C (hybrid — user-declared anchor, events refine):** Initial profile is user-stated; events refine within ±0.2.
|
||||
- Pros: Day-1 value. User sovereignty. Calibration anchor instead of starting from zero.
|
||||
- Cons: ±0.2 clamp creates logical conflict with mismatch detection (Codex #6 caught this).
|
||||
|
||||
**Chosen: B+C combined with ±0.2 CLAMP REMOVED.** Event-sourced underneath, declared profile as first-class separate field. No clamp. Declared and inferred live as independent values. Gap between them is displayed but not auto-corrected in v1.
|
||||
|
||||
### Decision C: One-way door classification — runtime prose parsing vs. registry declaration — ANSWER: REGISTRY DECLARATION (post-Codex)
|
||||
|
||||
**Runtime prose parsing (original):** `isOneWayDoor(skill, category, summary)` plus keyword patterns.
|
||||
- Pros: Minimal friction for skill authors. No schema to maintain.
|
||||
- Cons (Codex #4): Safety depends on wording. A destructive-op question phrased mildly could be misclassified. Unacceptable for a safety gate.
|
||||
|
||||
**Registry declaration (revised):** Every registry entry declares `door_type`.
|
||||
- Pros: Deterministic. Auditable. CI-enforceable (all questions must declare).
|
||||
- Cons: Maintenance burden. Every new skill question must classify.
|
||||
|
||||
**Chosen: registry declaration as primary, keyword patterns as fallback.** Schema governance is the cost of safety.
|
||||
|
||||
### Decision D: Inline tune feedback grammar — structured keywords vs. free-form natural language — ANSWER: STRUCTURED WITH FREE-FORM FALLBACK
|
||||
|
||||
**Structured keywords only:** `tune: unnecessary | ask-less | never-ask | always-ask | context-dependent`.
|
||||
- Pros: Unambiguous. Clean profile data.
|
||||
- Cons: Users must memorize.
|
||||
|
||||
**Free-form only:** Agent interprets whatever user says.
|
||||
- Pros: Natural. No syntax to learn.
|
||||
- Cons: Inconsistent profile data. Hard to debug why a tune didn't take effect.
|
||||
|
||||
**Chosen: both.** Shortcuts documented for power users; agent accepts and normalizes free English. Plain-English interaction is the default; structured keywords are an optional fast-path.
|
||||
|
||||
### Decision E: CLI subcommand structure for /plan-tune — ANSWER: PLAIN ENGLISH CONVERSATIONAL (no subcommand syntax required)
|
||||
|
||||
**`/plan-tune profile`, `/plan-tune profile set autonomy 0.4`, etc.** (original):
|
||||
- Pros: Fast for power users. Self-documenting via --help.
|
||||
- Cons: Users must memorize. Every invocation feels like a CLI session, not a conversation.
|
||||
|
||||
**Plain-English conversational (revised after user correction):** `/plan-tune` enters a menu. User says what they want in natural language.
|
||||
- Pros: Zero memorization. Feels like talking to a coach, not a shell.
|
||||
- Cons: Slower for power users. Requires good agent interpretation.
|
||||
|
||||
**Chosen: conversational with optional shortcuts.** Neither path is required. Most users never see the shortcuts. Confirmation step required before mutating declared profile (safety against agent misinterpretation — Codex #15 trust boundary).
|
||||
|
||||
### Decision F: Landed celebration — passive preamble detection vs. explicit command vs. post-ship hook — ANSWER: DEFERRED TO v2; WHEN PROMOTED, NOT IN PREAMBLE
|
||||
|
||||
**Passive detection in preamble (original):** Every skill's preamble runs `gh pr view` to detect recent merges.
|
||||
- Pros: Works regardless of which skill the user runs. User doesn't need to do anything special.
|
||||
- Cons (Codex #9): Latency, auth failures, rate limits, surprise browser opens, nondeterminism injected into every skill's preamble. Side effect in hot path.
|
||||
|
||||
**Explicit command (`/plan-tune show-landed`):** User opts in.
|
||||
- Pros: No hot-path side effects. User controls when to see it.
|
||||
- Cons: Requires user discovery. The "surprise you when you earned it" magic is lost.
|
||||
|
||||
**Post-ship hook (`/ship` triggers detection after PR creation):** Tied to /ship.
|
||||
- Pros: Natural timing. No preamble cost.
|
||||
- Cons: /ship isn't always the landing event (manual merges, team members merging, etc.).
|
||||
|
||||
**Chosen: DEFERRED entirely.** v2 will design this properly. When promoted, it moves out of preamble. User accepted Codex's argument that a celebration page in the preamble is strategic misfit for an already-risky feature.
|
||||
|
||||
### Decision G: Calibration gate — 20 events vs. diversity-checked — ANSWER: DIVERSITY-CHECKED
|
||||
|
||||
**"20 events" (original):** Simple count.
|
||||
- Pros: Trivial to implement.
|
||||
- Cons (Codex #13): Gameable. 20 inline "unnecessary" replies to ONE question should not calibrate five dimensions.
|
||||
|
||||
**Diversity check (revised):** `sample_size >= 20 AND skills_covered >= 3 AND question_ids_covered >= 8 AND days_span >= 7`.
|
||||
- Pros: Profile has actually been exercised across the system before it's trusted.
|
||||
- Cons: Slightly more complex.
|
||||
|
||||
**Chosen: diversity check.** In v1 used only for "enough data to display" threshold. In v2 will be the gate for psychographic-driven auto-decide.
|
||||
|
||||
### Decision H: Implementation order — classifiers first vs. integration point first — ANSWER: INTEGRATION POINT FIRST (registry + CI lint)
|
||||
|
||||
**Classifiers first (original):** Build bin tools, then resolvers, then skill template.
|
||||
- Pros: Atomic building blocks. Can unit-test before integration.
|
||||
- Cons (Codex #19): Builds machinery around an unenforceable convention. If the convention doesn't hold, all the work is wasted.
|
||||
|
||||
**Integration point first (revised):** Build typed registry + CI lint first. Prove the integration works before building infrastructure on top.
|
||||
- Pros: Foundation is proven. Infrastructure has something durable to rely on.
|
||||
- Cons: Requires auditing every existing AskUserQuestion in gstack — substantial up-front work.
|
||||
|
||||
**Chosen: integration point first.** Codex's argument was decisive. The audit is exactly the point — it forces us to catalog what we actually have before building adaptation on top.
|
||||
|
||||
### Decision I: Telemetry for TTHW — opt-in telemetry vs. local-only — ANSWER: LOCAL-ONLY
|
||||
|
||||
**Opt-in telemetry (original, suggested in DX review):** Instrument TTHW via telemetry event.
|
||||
- Pros: Quantitative measure of onboarding experience across all users.
|
||||
- Cons (Codex #14): Contradicts local-first OSS framing. Adds telemetry surface specifically for this skill.
|
||||
|
||||
**Local-only (revised):** Logging is local. Respect existing `telemetry` config; skill adds no new telemetry channels.
|
||||
- Pros: Consistent with gstack's local-first ethos.
|
||||
- Cons: No aggregate view of onboarding time.
|
||||
|
||||
**Chosen: local-only.** If we need TTHW data later, we add it as a gstack-wide telemetry event behind existing opt-in, not a skill-specific one.
|
||||
|
||||
### Decision J: Profile poisoning defense — no defense vs. confirmation gate vs. user-origin gate — ANSWER: USER-ORIGIN GATE
|
||||
|
||||
**No defense (original — caught by Codex):** Agent writes any tune event it sees.
|
||||
- Pros: Simplest. No additional trust checks.
|
||||
- Cons (Codex #16): Malicious repo content, PR descriptions, tool output can inject `tune: never ask` and poison the profile. This is a real attack surface.
|
||||
|
||||
**Confirmation gate:** Every tune write prompts "Confirmed? [Y/n]".
|
||||
- Pros: Universal defense.
|
||||
- Cons: Friction on every legitimate use.
|
||||
|
||||
**User-origin gate:** Agent only writes tune events when the `tune:` prefix appears in the user's own chat message for the current turn (not tool output, not file content). Binary validates `source: "inline-user"`.
|
||||
- Pros: Blocks the attack without friction on legitimate use.
|
||||
- Cons: Relies on agent correctly identifying source. Binary-level validation is the enforcement.
|
||||
|
||||
**Chosen: user-origin gate.** Matches the threat model (malicious content in automated inputs) without degrading the normal flow.
|
||||
|
||||
## Success Criteria
|
||||
|
||||
- `bun test` passes including new `test/plan-tune.test.ts`.
|
||||
- Every AskUserQuestion invocation in every SKILL.md.tmpl has a registry entry. CI lint enforces.
|
||||
- Migration from `~/.gstack/builder-profile.jsonl` preserves 100% of sessions + signals_accumulated. Regression test with 7-session fixture.
|
||||
- One-way door registry-declared entries: 100% of destructive ops, architecture forks, scope-adds > 1 day CC effort, security/compliance choices are classified `one-way`.
|
||||
- User-origin gate test: attempting to write a tune event with `source: "inline-tool-output"` is rejected.
|
||||
- Dogfood: Garry uses `/plan-tune` for 2+ weeks. Reports back whether:
|
||||
- `tune: never-ask` felt natural to type or got ignored
|
||||
- Registry maintenance (adding new questions) felt like reasonable discipline or schema bureaucracy
|
||||
- Inferred dimensions were stable across sessions or noisy
|
||||
- Plain-English interaction felt like a coach or like arguing with a chatbot
|
||||
|
||||
## Implementation Order
|
||||
|
||||
1. Audit every `AskUserQuestion` invocation in every gstack SKILL.md.tmpl. Build initial `scripts/question-registry.ts` with IDs, categories, door_types, options. This is the foundation; everything else sits on it.
|
||||
2. Write `test/plan-tune.test.ts` registry-completeness test (gate tier). Verify it catches drift — temporarily remove one registry entry, confirm CI fails.
|
||||
3. Seed `scripts/one-way-doors.ts` with keyword-pattern fallback classifier.
|
||||
4. Seed `scripts/psychographic-signals.ts` with initial `{question_id, user_choice} → {dimension, delta}` mappings. Numbers are tentative — v1 ships, v2 recalibrates.
|
||||
5. Seed `scripts/archetypes.ts` with archetype definitions (referenced by future v2 `/plan-tune vibe`).
|
||||
6. `bin/gstack-question-log` — validates against registry, rejects unknown IDs.
|
||||
7. `bin/gstack-question-preference` — all subcommands + tests.
|
||||
8. `bin/gstack-developer-profile` — `--read` (legacy), `--derive`, `--gap`, `--profile`.
|
||||
9. Migration script — builder-profile.jsonl → unified developer-profile.json. Atomic, idempotent, archives source. Regression test with fixture.
|
||||
10. `scripts/resolvers/question-tuning.ts` — three generators (preference check, log, inline tune with user-origin gate instructions).
|
||||
11. Register the 3 resolvers in `scripts/resolvers/index.ts`.
|
||||
12. Update `scripts/resolvers/preamble.ts` — `_QUESTION_TUNING` config read; conditionally inject for tier >= 2 skills.
|
||||
13. `plan-tune/SKILL.md.tmpl` — conversational plain-English skill.
|
||||
14. `bun run gen:skill-docs` — all SKILL.md files regenerated; verify each stays under 100KB token ceiling.
|
||||
15. `bun test` — all 45+ test cases green.
|
||||
16. Dogfood 2+ weeks. Collect real question-log + preferences data. Measure against success criteria.
|
||||
17. `/ship` v1. v2 scope discussion after dogfood.
|
||||
|
||||
## Open Questions (v2 scope decisions, deferred until real data)
|
||||
|
||||
1. Exact signal map deltas. v1 ships with initial guesses; v2 recalibrates from observed data.
|
||||
2. When `inferred` and `declared` gap becomes large, do we auto-suggest updating `declared`? Or just display?
|
||||
3. When a signal map version changes, do we auto-recompute or prompt user? Default: auto-recompute with diff display.
|
||||
4. Cross-project profile inheritance vs. isolation. v1 is per-project preferences + global profile; v2 may add explicit cross-project learning opt-ins.
|
||||
5. Should /plan-tune support a "team profile" mode where a shared developer-profile informs collaboration? v2+.
|
||||
|
||||
## Reviews incorporated
|
||||
|
||||
- **/office-hours (2026-04-16, 1 session):** Set 5 hard constraints, chose event-sourced + user-declared architecture.
|
||||
- **/plan-ceo-review (2026-04-16, EXPANSION mode):** 6 expansions accepted, later rolled back after Codex review.
|
||||
- **/plan-devex-review (2026-04-16, POLISH mode):** Plain-English interaction model; this survived to v1.
|
||||
- **/plan-eng-review (2026-04-16):** Test plan and completeness checks; partially superseded by registry-first rewrite.
|
||||
- **/codex (2026-04-16, gpt-5.4 high reasoning):** 20-point critique drove the rollback. 15+ legitimate findings the Claude reviews missed.
|
||||
|
||||
## Credits and caveats
|
||||
|
||||
This plan was developed through an iterative AI-collaboration loop over ~6 hours of planning. The author (Garry Tan) directed every scope decision; AI voices (Claude Opus 4.7 and OpenAI Codex gpt-5.4) challenged and refined the plan. Without Codex's outside voice, a much larger and less-defensible plan would have shipped. The value of cross-model review on high-stakes architectural changes is real and measurable.
|
||||
@@ -0,0 +1,237 @@
|
||||
# Plan Tuning v1 — Design Doc
|
||||
|
||||
**Status:** Approved for implementation (2026-04-18)
|
||||
**Branch:** garrytan/plan-tune-skill
|
||||
**Authors:** Garry Tan (user), with AI-assisted reviews from Claude Opus 4.7 + OpenAI Codex gpt-5.4
|
||||
**Supersedes scope:** adds writing-style + LOC-receipts layer on top of [PLAN_TUNING_V0.md](./PLAN_TUNING_V0.md) (observational substrate). V0 remains in place unchanged.
|
||||
**Related:** [PACING_UPDATES_V0.md](./PACING_UPDATES_V0.md) — extracted pacing overhaul, V1.1 plan.
|
||||
|
||||
## What this document is
|
||||
|
||||
A canonical record of what /plan-tune v1 is, what it is NOT, what we considered, and why we made each call. Committed to the repo so future contributors (and future Garry) can trace reasoning without archeology. Supersedes any per-user local plan artifacts.
|
||||
|
||||
## Credit
|
||||
|
||||
This plan exists because of **[Louise de Sadeleer](https://x.com/LouiseDSadeleer/status/2045139351227478199)**, who sat through a complete gstack run as a non-technical user and told us the truth about how it feels. Her specific feedback:
|
||||
|
||||
1. "I was getting a bit tired after a while and it felt a little bit rigid." — *pacing/fatigue*
|
||||
2. "I'm just gonna say yes yes yes" (during architecture review). — *disengagement*
|
||||
3. "What I find funny is his emphasis on how many lines of code he produces. AI has produced for him of course." — *LOC framing*
|
||||
4. "As a non-engineer this is a bit complicated to understand." — *jargon density + outcome framing*
|
||||
|
||||
V1 addresses #3 and #4 directly: jargon-glossing + outcome-framed writing that reads like a real person wrote it for the reader, plus a defensible LOC reframe. Louise's #1 and #2 (pacing/fatigue) require a separate design round — extracted to [PACING_UPDATES_V0.md](./PACING_UPDATES_V0.md) as the V1.1 plan.
|
||||
|
||||
## The feature, in one paragraph
|
||||
|
||||
gstack skill output is the product. If the prose doesn't read well for a non-technical founder, they check out of the review and click "yes yes yes." V1 adds a writing-style standard that applies to every tier ≥ 2 skill: jargon glossed on first use (from a curated ~50-term list), questions framed in outcome terms ("what breaks for your users if...") not implementation terms, short sentences, concrete nouns. Power users who want the tighter V0 prose can set `gstack-config set explain_level terse`. Binary switch, no partial modes. Plus: the README's "600,000+ lines of production code" framing — rightly called out as LOC vanity by Louise — gets replaced with a real computed 2013-vs-2026 pro-rata multiple from an `scc`-backed script, with honest caveats about public-vs-private repo visibility.
|
||||
|
||||
## Why we're building the smaller version
|
||||
|
||||
V1 went through four substantial scope revisions over multiple review passes. Final scope is smaller than any intermediate version because each review pass caught real problems.
|
||||
|
||||
**Revision 1 — Four-level experience axis (rejected).** Original proposal: ask users on first run whether they're an experienced dev, an engineer-without-solo-experience, non-technical-who-shipped-on-a-team, or non-technical-entirely. Skills adapt per level. Rejected during CEO review's premise-challenge step because (a) the onboarding ask adds friction at exactly the moment V1 is trying to reduce it, (b) "what level am I?" is itself a confusing question for the users who most need help, (c) technical expertise isn't one-dimensional (designer level A on CSS, level D on deploy), (d) engineers benefit from the same writing standards non-technical users do.
|
||||
|
||||
**Revision 2 — ELI10 by default, terse opt-out (accepted).** Every skill's output defaults to the writing standard. Power users who want V0 prose set `explain_level: terse`. Codex Pass 1 caught critical gaps (static-markdown gating, host-aware paths, README update mechanism) — all three integrated.
|
||||
|
||||
**Revision 3 — ELI10 + review-pacing overhaul (proposed, scoped back).** Added a pacing workstream: rank findings, auto-accept two-way doors, max 3 AskUserQuestion prompts per phase, Silent Decisions block with flip-command. Intended to address Louise's #1 and #2 directly. Eng review Pass 2 caught scoring-formula and path-consistency bugs. Eng review Pass 3 + Codex Pass 2 surfaced 10+ structural gaps in the pacing workstream that couldn't be fixed via plan-text editing.
|
||||
|
||||
**Revision 4 — ELI10 + LOC only (final).** User chose scope reduction: ship V1 with writing style + LOC receipts, defer pacing to V1.1 via [PACING_UPDATES_V0.md](./PACING_UPDATES_V0.md). This is the approved V1 scope.
|
||||
|
||||
The through-line: every review pass correctly narrowed the ambition until the remaining scope had no structural gaps. Matches the CEO review skill's SCOPE REDUCTION mode, arrived at late via engineering review rather than early via strategic choice.
|
||||
|
||||
## v1 Scope (what we're building now)
|
||||
|
||||
1. **Writing Style section in preamble** (`scripts/resolvers/preamble.ts`). Six rules: jargon-gloss on first use per skill invocation, outcome framing, short sentences / concrete nouns / active voice, decisions close with user impact, gloss-on-first-use-unconditional (even if user pasted the term), user-turn override (user says "be terse" → skip for that response).
|
||||
2. **Jargon boundary via repo-owned list** (`scripts/jargon-list.json`). ~50 curated high-frequency technical terms. Terms not on the list are assumed plain-English enough. Terms inlined into generated SKILL.md prose at `gen-skill-docs` time (zero runtime cost).
|
||||
3. **Terse opt-out** (`gstack-config set explain_level terse`). Binary: `default` vs `terse`. Terse skips the Writing Style block entirely and uses V0 prose style.
|
||||
4. **Host-aware preamble echo.** `_EXPLAIN_LEVEL=$(${binDir}/gstack-config get explain_level 2>/dev/null || echo "default")`. Host-portable via existing V0 `ctx.paths.binDir` pattern.
|
||||
5. **gstack-config validation.** Document `explain_level: default|terse` in header. Whitelist values. Warn on unknown with specific message + default to `default`.
|
||||
6. **LOC reframe in README.** Remove "600,000+ lines of production code" hero framing. Insert `<!-- GSTACK-THROUGHPUT-PLACEHOLDER -->` anchor. Build-time script replaces anchor with computed multiple + caveat.
|
||||
7. **`scc`-backed throughput script** (`scripts/garry-output-comparison.ts`). For each of 2013 + 2026, enumerate Garry-authored public commits, extract added lines from `git diff`, classify via `scc --stdin` (or regex fallback). Output `docs/throughput-2013-vs-2026.json` with per-language breakdown + caveats.
|
||||
8. **`scc` as standalone install script** (`scripts/setup-scc.sh`). Not a `package.json` dependency (truly optional — 95% of users never run throughput). OS-detects and runs `brew install scc` / `apt install scc` / prints GitHub releases link.
|
||||
9. **README update pipeline** (`scripts/update-readme-throughput.ts`). Reads `docs/throughput-2013-vs-2026.json` if present, replaces the anchor with computed number. If missing, writes `GSTACK-THROUGHPUT-PENDING` marker that CI rejects — forces contributor to run the script before commit.
|
||||
10. **/retro adds logical SLOC + weighted commits above raw LOC.** Raw LOC stays for context but is visually demoted.
|
||||
11. **Upgrade migration** (`gstack-upgrade/migrations/v<VERSION>.sh`). One-time post-upgrade interactive prompt offering to restore V0 prose via `explain_level: terse` for users who prefer it. Flag-file gated.
|
||||
12. **Documentation.** CLAUDE.md gains a Writing Style section (project convention). CHANGELOG.md gets V1 entry (user-facing narrative, mentions scope reduction + V1.1 pacing). README.md gets a Writing Style explainer section (~80 words). CONTRIBUTING.md gains a note on jargon-list maintenance (PRs to add/remove terms).
|
||||
13. **Tests.** 6 new test files + extension of existing `gen-skill-docs.test.ts`. All gate tier except LLM-judge E2E (periodic).
|
||||
14. **V0 dormancy negative tests.** Assert 5D dimension names and 8 archetype names don't appear in default-mode skill output. Prevents V0 psychographic machinery from leaking into V1.
|
||||
15. **V1 and V1.1 design docs.** PLAN_TUNING_V1.md (this file). PACING_UPDATES_V0.md (V1.1 plan, created during V1 implementation from the extracted appendix). TODOS.md P0 entry.
|
||||
|
||||
## Deferred
|
||||
|
||||
**To V1.1 (explicit, with dedicated design doc):**
|
||||
- Review pacing overhaul (ranking, auto-accept, max-3-per-phase, Silent Decisions block, flip mechanism). Reasoning: see [PACING_UPDATES_V0.md](./PACING_UPDATES_V0.md) §"Why it's extracted." Has 10+ structural gaps unfixable via prose-only changes.
|
||||
- Preamble first-run meta-prompt audit (lake intro, telemetry, proactive, routing). Louise saw all of them on first run; they count against fatigue. V1.1 considers suppressing until session N.
|
||||
|
||||
**To V2 (or later):**
|
||||
- Confusion-signal detection from question-log driving on-the-fly translation offers.
|
||||
- 5D psychographic-driven skill adaptation (V0 E1 item).
|
||||
- /plan-tune narrative + /plan-tune vibe (V0 E3 item).
|
||||
- Per-skill or per-topic explain levels.
|
||||
- Team profiles.
|
||||
- AST-based "delivered features" metric.
|
||||
|
||||
## Rejected entirely (considered, not doing)
|
||||
|
||||
- **Four-level declared experience axis (A/B/C/D).** Rejected during CEO review premise-challenge. See "Why we're building the smaller version" above.
|
||||
- **ELI10 as a new resolver file (`scripts/resolvers/eli10-writing.ts`).** Codex Pass 1 caught the conflict with existing "smart 16-year-old" framing in preamble's AskUserQuestion Format section. Fold into existing preamble instead.
|
||||
- **Runtime suppression of the Writing Style block.** Codex Pass 1 caught that `gen-skill-docs` produces static Markdown — runtime `EXPLAIN_LEVEL=terse` can't hide content already baked in. Solution: conditional prose gate (prose convention, same category as V0's `QUESTION_TUNING` gate).
|
||||
- **Middle writing mode between default and terse.** Revision 3 proposed "terse = no glosses but keep outcome framing." Codex Pass 2 caught the contradiction with migration messaging. Binary wins: terse = V0 prose, full stop.
|
||||
- **User-editable jargon list at runtime.** Revision 3 proposed `~/.gstack/jargon-list.json` as user override. Codex Pass 2 caught the contradiction with gen-time inlining. Resolved: repo-owned only, PRs to add/remove, regenerate to take effect.
|
||||
- **`devDependencies.optional` field in package.json.** Not a real npm/bun field. Eng review Pass 2 caught. Standalone install script instead.
|
||||
- **Using the same string as replacement anchor AND CI-reject marker in README.** Eng review Pass 2 / Codex Pass 2 caught that this makes the pipeline destroy its own update path. Two-string solution: `GSTACK-THROUGHPUT-PLACEHOLDER` (anchor, stays across runs) vs `GSTACK-THROUGHPUT-PENDING` (explicit "build didn't run" marker that CI rejects).
|
||||
- **"Every technical term gets a gloss" as acceptance criterion.** Codex Pass 2 caught the contradiction with the curated-list rule. Acceptance rewritten to match rule: "every term on `scripts/jargon-list.json` that appears gets a gloss."
|
||||
- **Acceptance criterion "≤ 12 AskUserQuestion prompts per /autoplan."** Removed from V1 — that target requires the pacing overhaul now in V1.1.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
~/.gstack/
|
||||
developer-profile.json # unchanged from V0
|
||||
config.yaml # + explain_level key (default | terse)
|
||||
|
||||
scripts/
|
||||
jargon-list.json # NEW: ~50 repo-owned terms (gen-time inlined)
|
||||
garry-output-comparison.ts # NEW: scc + git per-year, author-scoped
|
||||
update-readme-throughput.ts # NEW: README anchor replacement
|
||||
setup-scc.sh # NEW: OS-detecting scc installer
|
||||
resolvers/preamble.ts # MODIFIED: Writing Style section + EXPLAIN_LEVEL echo
|
||||
|
||||
docs/
|
||||
designs/PLAN_TUNING_V1.md # NEW: this file
|
||||
designs/PACING_UPDATES_V0.md # NEW: V1.1 plan (extracted)
|
||||
throughput-2013-vs-2026.json # NEW: computed, committed
|
||||
|
||||
~/.claude/skills/gstack/bin/
|
||||
gstack-config # MODIFIED: explain_level header + validation
|
||||
|
||||
gstack-upgrade/migrations/
|
||||
v<VERSION>.sh # NEW: V0 → V1 interactive prompt
|
||||
```
|
||||
|
||||
### Data flow
|
||||
|
||||
```
|
||||
User runs tier-≥2 skill
|
||||
│
|
||||
▼
|
||||
Preamble bash (per-invocation):
|
||||
_EXPLAIN_LEVEL=$(${binDir}/gstack-config get explain_level 2>/dev/null || "default")
|
||||
echo "EXPLAIN_LEVEL: $_EXPLAIN_LEVEL"
|
||||
│
|
||||
▼
|
||||
Generated SKILL.md body (static Markdown, baked at gen-skill-docs):
|
||||
- AskUserQuestion Format section (existing V0)
|
||||
- Writing Style section (NEW, conditional prose gate)
|
||||
│
|
||||
├── "Skip if EXPLAIN_LEVEL: terse OR user says 'be terse' this turn"
|
||||
├── 6 writing rules (jargon, outcome, short, impact, first-use, override)
|
||||
└── Jargon list inlined from scripts/jargon-list.json
|
||||
│
|
||||
▼
|
||||
Agent applies or skips based on runtime EXPLAIN_LEVEL + user-turn signal
|
||||
│
|
||||
▼
|
||||
V0 QUESTION_TUNING + question-log + preferences unchanged
|
||||
│
|
||||
▼
|
||||
Output to user (gloss-on-first-use, outcome-framed, short sentences; or V0 prose if terse)
|
||||
```
|
||||
|
||||
### Data flow: throughput script (build-time)
|
||||
|
||||
```
|
||||
bun run build
|
||||
│
|
||||
├── gen:skill-docs (regenerates SKILL.md files with jargon list inlined)
|
||||
├── update-readme-throughput (reads JSON if present; replaces anchor OR writes PENDING marker)
|
||||
└── other steps (binary compilation, etc.)
|
||||
|
||||
Separately, on-demand:
|
||||
bun run scripts/garry-output-comparison.ts
|
||||
│
|
||||
├── scc preflight (if missing → exit with setup-scc.sh hint)
|
||||
├── For 2013 + 2026: enumerate Garry-authored commits in public garrytan/* repos
|
||||
├── For each commit: git diff, extract ADDED lines, classify via scc --stdin
|
||||
└── Write docs/throughput-2013-vs-2026.json (per-language + caveats)
|
||||
```
|
||||
|
||||
## Security + privacy
|
||||
|
||||
- **No new user data.** V1 extends preamble prose + config key. No new personal data collected.
|
||||
- **No runtime file reads of sensitive data.** Jargon list is a repo-committed curated list.
|
||||
- **Migration script is one-shot.** Flag-file prevents re-fire.
|
||||
- **scc runs on public repos only.** No access to private work.
|
||||
|
||||
## Decisions log (with pros/cons)
|
||||
|
||||
### Decision A: Four-level experience axis vs. ELI10 by default — ANSWER: ELI10 BY DEFAULT
|
||||
|
||||
**Four-level axis (rejected):** Ask users to self-identify as A/B/C/D on first run. Skills adapt per level.
|
||||
- Pros: Explicit user sovereignty. Power users get V0 behavior.
|
||||
- Cons: Adds onboarding friction. Forces users to label themselves. Technical expertise isn't one-dimensional. Engineers benefit from the same writing standards non-technical users do.
|
||||
|
||||
**ELI10 by default with terse opt-out (chosen):** Every skill's output defaults to the writing standard. Power users set `explain_level: terse`.
|
||||
- Pros: No onboarding question. Good writing benefits everyone. Power users still have an escape hatch.
|
||||
- Cons: Silently changes V0 behavior on upgrade → requires migration prompt.
|
||||
|
||||
### Decision B: New resolver file vs. extend existing preamble — ANSWER: EXTEND EXISTING
|
||||
|
||||
**New resolver (rejected):** `scripts/resolvers/eli10-writing.ts` as a separate generator.
|
||||
- Pros: Modular.
|
||||
- Cons (Codex #7): Conflicts with existing "smart 16-year-old" framing in preamble's AskUserQuestion Format section. Two sources of truth.
|
||||
|
||||
**Extend preamble (chosen):** Writing Style section added to `scripts/resolvers/preamble.ts` directly below AskUserQuestion Format.
|
||||
- Pros: One source of truth. Composes with existing rules.
|
||||
- Cons: `preamble.ts` grows.
|
||||
|
||||
### Decision C: Runtime suppression vs. conditional prose gate — ANSWER: CONDITIONAL PROSE GATE
|
||||
|
||||
**Runtime suppression (rejected):** Preamble read of `explain_level` triggers suppression logic.
|
||||
- Pros: Simpler mental model.
|
||||
- Cons (Codex #1): `gen-skill-docs` produces static Markdown. Once baked, content can't be retroactively hidden. Runtime suppression is fictional.
|
||||
|
||||
**Conditional prose gate (chosen):** "Skip this block if EXPLAIN_LEVEL: terse OR user says 'be terse' this turn." Prose convention; agent obeys or disobeys at runtime.
|
||||
- Pros: Testable. Matches V0's `QUESTION_TUNING` pattern. Honest about the mechanism.
|
||||
- Cons: Depends on agent prose compliance (no hard runtime gate).
|
||||
|
||||
### Decision D: Jargon list location — runtime-user-editable vs. repo-owned gen-time — ANSWER: REPO-OWNED GEN-TIME
|
||||
|
||||
**User-editable at runtime (rejected):** `~/.gstack/jargon-list.json` overrides `scripts/jargon-list.json`.
|
||||
- Pros: User can add terms specific to their domain.
|
||||
- Cons (Codex #4, Pass 2): Gen-time inlining means user edits require regeneration. Contradiction.
|
||||
|
||||
**Repo-owned, gen-time inlined (chosen):** `scripts/jargon-list.json` only. PRs to add/remove. `bun run gen:skill-docs` inlines terms into preamble prose.
|
||||
- Pros: One source of truth. Zero runtime cost. Composable with existing build.
|
||||
- Cons: Users can't add terms locally. Mitigation: documented in CONTRIBUTING.md; PRs accepted.
|
||||
|
||||
### Decision E: Pacing overhaul in V1 vs. V1.1 — ANSWER: V1.1 (extracted)
|
||||
|
||||
**Pacing in V1 (rejected):** Bundle ranking + auto-accept + Silent Decisions + max-3-per-phase cap + flip mechanism.
|
||||
- Pros: Addresses Louise's fatigue directly.
|
||||
- Cons (Eng review Pass 3 + Codex Pass 2): 10+ structural gaps unfixable via plan-text editing. Session-state model undefined. `phase` field missing from question-log. Registry doesn't cover dynamic review findings. Flip mechanism has no implementation. Migration prompt itself is an interrupt. First-run preamble prompts also count. Pacing as prose can't invert existing ask-per-section execution order.
|
||||
|
||||
**Extract to V1.1 (chosen):** Ship ELI10 + LOC in V1. Pacing gets its own design round with full review cycle.
|
||||
- Pros: Ships V1 honestly. Gives V1.1 real baseline data from V1 usage (Louise's V1 transcript). Matches SCOPE REDUCTION mode from CEO review.
|
||||
- Cons: Louise's fatigue complaint isn't fully addressed until V1.1. Mitigation: V1 still improves her experience via writing quality; V1.1 follows up with pacing.
|
||||
|
||||
### Decision F: README update mechanism — single string vs. two-string — ANSWER: TWO-STRING
|
||||
|
||||
**Single string (rejected):** `<!-- GSTACK-THROUGHPUT-MULTIPLE: N× -->` as both replacement anchor AND CI-reject marker.
|
||||
- Pros: Simple.
|
||||
- Cons (Codex Pass 2): Pipeline breaks on itself — CI rejects commits containing the marker, but the marker IS the anchor.
|
||||
|
||||
**Two-string (chosen):** `GSTACK-THROUGHPUT-PLACEHOLDER` (anchor, stable) + `GSTACK-THROUGHPUT-PENDING` (explicit missing-build marker, CI rejects).
|
||||
- Pros: Anchor persists; CI catches actual failure state.
|
||||
- Cons: Two symbols to remember.
|
||||
|
||||
## Review record
|
||||
|
||||
| Review | Runs | Status | Key findings integrated |
|
||||
|---|---|---|---|
|
||||
| CEO Review | 1 | CLEAR (HOLD SCOPE) | Premise pivot: four-level axis → ELI10 by default. Cross-model tensions resolved via explicit user choice. |
|
||||
| Codex Review | 2 | ISSUES_FOUND + drove scope reduction | Pass 1: 25 findings, 3 critical blockers (static-markdown, host-paths, README mechanism). Pass 2: 20 findings on revised plan, drove V1.1 extraction. |
|
||||
| Eng Review | 3 | CLEAR (SCOPE_REDUCED) | Pass 1: critical gaps + 3 decisions (all A). Pass 2: scoring-formula bug, path contradiction, fake `devDependencies.optional` field. Pass 3: identified pacing structural gaps, drove extraction. |
|
||||
| DX Review | 1 | CLEAR (TRIAGE) | 3 critical (docs plan, upgrade migration, hero moment). 9 auto-accepted as Silent DX Decisions. |
|
||||
|
||||
Review report persisted in `~/.gstack/` via `gstack-review-log`. Plan file retained with full history at `~/.claude/plans/system-instruction-you-are-working-transient-sunbeam.md`.
|
||||
Reference in New Issue
Block a user