From bd44c27a188bed42e59a1e5eb22a07f4a163619f Mon Sep 17 00:00:00 2001 From: Garry Tan Date: Fri, 17 Apr 2026 13:38:44 +0800 Subject: [PATCH] =?UTF-8?q?chore:=20bump=20version=20and=20changelog=20(v0?= =?UTF-8?q?.19.0.0)=20=E2=80=94=20/plan-tune=20v1?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Ships /plan-tune as observational substrate: typed question registry, dual-track developer profile (declared + inferred), explicit per-question preferences with user-origin gate, inline tune: feedback across every tier >= 2 skill, unified developer-profile.json with migration from builder-profile.jsonl. Scope rolled back from initial CEO EXPANSION plan after outside-voice review (Codex). 6 deferrals tracked as P0 TODOs with explicit acceptance criteria: E1 substrate wiring, E3 narrative/vibe, E4 blind-spot coach, E5 LANDED celebration, E6 auto-adjustment, E7 psychographic auto-decide. See docs/designs/PLAN_TUNING_V0.md for the full design record. Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 20 ++++++ TODOS.md | 168 +++++++++++++++++++++++++++++++++++++++++++++++++++ VERSION | 2 +- 3 files changed, 189 insertions(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 75f09431..a216334a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,25 @@ # Changelog +## [0.19.0.0] - 2026-04-17 + +### Added +- **`/plan-tune` skill — gstack can now learn which of its prompts you find valuable vs noisy.** If you keep answering the same AskUserQuestion the same way every time, this is the skill that teaches gstack to stop asking. Say "stop asking me about changelog polish" — gstack writes it down, respects it from that point forward, and one-way doors (destructive ops, architecture forks, security choices) still always ask regardless, because safety wins over preference. Plain English everywhere. No CLI subcommand syntax to memorize. +- **Dual-track developer profile.** Tell gstack who you are as a builder (5 dimensions: scope appetite, risk tolerance, detail preference, autonomy, architecture care). gstack also silently tracks what your behavior suggests. `/plan-tune` shows both side by side plus the gap, so you can see when your actions don't match your self-description. v1 is observational — no skills change their behavior based on your profile yet. That comes in v2, once the profile has proven itself. +- **Builder archetypes.** Run `/plan-tune vibe` (v2) or let the skill infer it from your dimensions. Eight named archetypes (Cathedral Builder, Ship-It Pragmatist, Deep Craft, Taste Maker, Solo Operator, Consultant, Wedge Hunter, Builder-Coach) plus a Polymath fallback when your dimensions don't fit a standard pattern. Codebase and model ship now; the user-facing commands are v2. +- **Inline `tune:` feedback across every gstack skill.** When a skill asks you something, you can reply `tune: never-ask` or `tune: always-ask` or free-form English and gstack normalizes it into a preference. Only runs when you've opted in via `gstack-config set question_tuning true` — zero impact until then. +- **Profile-poisoning defense.** Inline `tune:` writes only get accepted when the prefix came from your own chat message — never from tool output, file content, PR descriptions, or anywhere else a malicious repo might inject instructions. The binary enforces this with exit code 2 for rejected writes. This was an outside-voice catch from Codex review; it's baked in from day one. +- **Typed question registry with CI enforcement.** 53 recurring AskUserQuestion categories across 15 skills are now declared in `scripts/question-registry.ts` with stable IDs, categories, door types (one-way vs two-way), and options. A CI test asserts the schema stays valid. Safety-critical questions (destructive ops, architecture forks) are classified `one-way` at the declaration site — never inferred from prose summaries. +- **Unified developer profile.** The `/office-hours` skill's existing builder-profile.jsonl (sessions, signals, resources, topics) is folded into a single `~/.gstack/developer-profile.json` on first use. Migration is atomic, idempotent, and archives the source file — rerun it safely. Legacy `gstack-builder-profile` is a thin shim that delegates to the new binary. + +### For contributors +- New `docs/designs/PLAN_TUNING_V0.md` captures the full design journey: every decision with pros/cons, what was deferred to v2 with explicit acceptance criteria, what was rejected after Codex review (substrate-as-prompt-convention, ±0.2 clamp, preamble LANDED detection, single event-schema), and how the final shape came together. Read this before working on v2 to understand why the constraints exist. +- Three new binaries: `bin/gstack-question-log` (validated append to question-log.jsonl), `bin/gstack-question-preference` (explicit preference store with user-origin gate), `bin/gstack-developer-profile` (supersedes gstack-builder-profile; supports --read, --migrate, --derive, --profile, --gap, --trace, --check-mismatch, --vibe). +- Three new preamble resolvers in `scripts/resolvers/question-tuning.ts`: question preference check (before each AskUserQuestion), question log (after), inline tune feedback with user-origin gate instructions. Consolidated into one compact `generateQuestionTuning` section for tier >= 2 skills to minimize token overhead. +- Hand-crafted psychographic signal map (`scripts/psychographic-signals.ts`) with version hash so cached profiles recompute automatically when the map changes between gstack versions. 9 signal keys covering scope-appetite, architecture-care, test-discipline, code-quality-care, detail-preference, design-care, devex-care, distribution-care, session-mode. +- Keyword-fallback one-way-door classifier (`scripts/one-way-doors.ts`) — secondary safety layer for ad-hoc question IDs that don't appear in the registry. Primary safety is the registry declaration. +- 118 new tests across 4 test files: `test/plan-tune.test.ts` (47 tests — schema, helpers, safety, classifier, signal map, archetypes, preamble injection, end-to-end pipeline), `test/gstack-question-log.test.ts` (21 tests — valid payloads, rejected payloads, injection defense), `test/gstack-question-preference.test.ts` (31 tests — check/write/read/clear/stats + user-origin gate + schema validation), `test/gstack-developer-profile.test.ts` (25 tests — read/migrate/derive/trace/gap/vibe/check-mismatch). Gate-tier E2E test `skill-e2e-plan-tune.test.ts` registered (runs on `bun run test:evals`). +- Scope rollback driven by outside-voice review. The initial CEO EXPANSION plan bundled psychographic auto-decide + blind-spot coach + LANDED celebration + full substrate wiring. Codex's 20-point critique caught that without a typed question registry, "substrate" was marketing; E1/E4/E6 formed a logical contradiction; profile poisoning was unaddressed; LANDED in the preamble injected side effects into every skill's hot path. Accepted the rollback: v1 ships the schema + observation layer, v2 adds behavior adaptation only after the foundation proves durable. All six expansions are tracked as P0 TODOs with explicit acceptance criteria. + ## [0.18.1.0] - 2026-04-16 ### Fixed diff --git a/TODOS.md b/TODOS.md index 7bb06d01..b6b717c1 100644 --- a/TODOS.md +++ b/TODOS.md @@ -1,5 +1,173 @@ # TODOS +## Plan Tune (v2 deferrals from v0.19.0.0 rollback) + +All six items are gated on v1 dogfood results and the acceptance criteria in +`docs/designs/PLAN_TUNING_V0.md`. They were explicitly deferred after Codex's +outside-voice review drove a scope rollback from the CEO EXPANSION plan. v1 +ships the observational substrate only; v2 adds behavior adaptation. + +### E1 — Substrate wiring (5 skills consume profile) + +**What:** Add `{{PROFILE_ADAPTATION:}}` placeholder to ship, review, +office-hours, plan-ceo-review, plan-eng-review SKILL.md.tmpl files. Implement +`scripts/resolvers/profile-consumer.ts` with a per-skill adaptation registry +(`scripts/profile-adaptations/{skill}.ts`). Each consumer reads +`~/.gstack/developer-profile.json` on preamble and adapts skill-specific +defaults (verbosity, mode selection, severity thresholds, pushback intensity). + +**Why:** v1 observational profile writes a file nobody reads. The substrate +claim only becomes real when skills actually consume it. Without this, /plan-tune +is a fancy config page. + +**Pros:** gstack feels personal. Every skill adapts to the user's steering +style instead of defaulting to middle-of-the-road. + +**Cons:** Risk of psychographic drift if profile is noisy. Requires calibrated +profile (v1 acceptance criteria: 90+ days stable across 3+ skills). + +**Context:** See `docs/designs/PLAN_TUNING_V0.md` §Deferred to v2. v1 ships the +signal map + inferred computation; it's displayed in /plan-tune but no skill +reads it yet. + +**Effort:** L (human: ~1 week / CC: ~4h) +**Priority:** P0 +**Depends on:** 2+ weeks of v1 dogfood, profile diversity check passing. + +### E3 — `/plan-tune narrative` + `/plan-tune vibe` + +**What:** Event-anchored narrative ("You accepted 7 scope expansions, overrode +test_failure_triage 4 times, called every PR 'boil the lake'") + one-word vibe +archetype (Cathedral Builder, Ship-It Pragmatist, Deep Craft, etc). +scripts/archetypes.ts is ALREADY SHIPPED in v1 (8 archetypes + Polymath +fallback). v2 work is the narrative generator + /plan-tune skill wiring. + +**Why:** Makes profile tangible and shareable. Screenshot-able. + +**Pros:** Killer delight feature. Social surface for gstack. Concrete, specific +output anchored in real events (not generic AI slop). + +**Cons:** Requires stable inferred profile — without calibration it produces +generic paragraphs. Gen-tests need to validate no-slop. + +**Context:** Archetypes already defined. Just need the /plan-tune narrative +subcommand + slop-check test. + +**Effort:** S+ (human: ~1 day / CC: ~1h) +**Priority:** P0 +**Depends on:** Calibrated profile (>= 20 events, 3+ skills, 7+ days span). + +### E4 — Blind-spot coach + +**What:** Preamble injection that surfaces the OPPOSITE of the user's profile +once per session per tier >= 2 skill. Boil-the-ocean user gets challenged on +scope ("what's the 80% version?"); small-scope user gets challenged on ambition. +`scripts/resolvers/blind-spot-coach.ts`. Marker file for session dedup. Opt-out +via `gstack-config set blind_spot_coach false`. + +**Why:** Makes gstack a coach (challenges you) instead of a mirror (reflects +you). The killer differentiation vs. a settings menu. + +**Pros:** The feature that makes gstack feel like Garry. Surfaces assumptions +the user hasn't challenged. + +**Cons:** Logically conflicts with E1 (which adapts TO profile) and E6 (which +flags mismatch). Requires interaction-budget design: global session budget + +escalation rules + explicit exclusion from mismatch detection. Risk of feeling +like a nag if fires wrong. + +**Context:** v2 must redesign to resolve the E1/E4/E6 composition issue Codex +caught. Dogfood required to calibrate frequency. + +**Effort:** M (human: ~3 days / CC: ~2h design + ~1h impl) +**Priority:** P0 +**Depends on:** E1 shipped + interaction-budget design spec. + +### E5 — LANDED celebration HTML page + +**What:** When a PR authored by the user is newly merged to the base branch, +open an animated HTML celebration page in the browser. Confetti + typewriter +headline + stats counter. Shows: what we built (PR stats + CHANGELOG entry), +road traveled (scope decisions from CEO plan), road not traveled (deferred +items), where we're going (next TODOs), who you are as a builder (vibe + +narrative + profile delta for this ship). Self-contained HTML (CSS animations +only, no JS deps). + +**CRITICAL REVISION from v0 plan:** Passive detection must NOT live in the +preamble (Codex #9). When promoted, moves to explicit `/plan-tune show-landed` +OR post-ship hook — not passive detection in the hot path. + +**Why:** Biggest personality moment in gstack. The "one-word thing that makes +you remember why you built this." + +**Pros:** Screenshot-worthy. Shareable. The kind of dopamine hit that turns +power users into evangelists. + +**Cons:** Product theater if the substrate isn't solid. Needs /design-shotgun +→ /design-html for the visual direction. Requires E2 unified profile for +narrative/vibe data. + +**Context:** /land-and-deploy trust/adoption is low, so passive detection is +the right trigger shape. Dedup marker per PR in `~/.gstack/.landed-celebrated-*`. +E2E tests for squash/merge-commit/rebase/co-author/fresh-clone/dedup variants. + +**Effort:** M+ (human: ~1 week / CC: ~3h total) +**Priority:** P0 +**Depends on:** E3 narrative/vibe shipped. /design-shotgun run on real PR data +to pick a visual direction, then /design-html to finalize. + +### E6 — Auto-adjustment based on declared ↔ inferred mismatch + +**What:** Currently `/plan-tune` shows the gap between declared and inferred +(v1 observational). v2 auto-suggests declaration updates when the gap exceeds +a threshold ("Your profile says hands-off but you've overridden 40% of +recommendations — you're actually taste-driven. Update declared autonomy from +0.8 to 0.5?"). Requires explicit user confirmation before any mutation (Codex +trust-boundary #15 already baked into v1). + +**Why:** Profile drifts silently without correction. Self-correcting profile +stays honest. + +**Pros:** Profile becomes more accurate over time. User sees the gap and +decides. + +**Cons:** Requires stable inferred profile (diversity check). False positives +nag the user. + +**Context:** v1 has `--check-mismatch` that flags > 0.3 gaps but doesn't +suggest fixes. v2 adds the suggestion UX + per-dimension threshold tuning from +real data. + +**Effort:** S (human: ~1 day / CC: ~45min) +**Priority:** P0 +**Depends on:** Calibrated profile + real mismatch data from v1 dogfood. + +### E7 — Psychographic auto-decide + +**What:** When inferred profile is calibrated AND a question is two-way AND +the user's dimensions strongly favor one option, auto-choose without asking +(visible annotation: "Auto-decided via profile. Change with /plan-tune."). v1 +only auto-decides via EXPLICIT per-question preferences; v2 adds profile-driven +auto-decide. + +**Why:** The whole point of the psychographic. Silent, correct defaults based +on who the user IS, not just what they've said. + +**Pros:** Friction-free skill invocation for calibrated power users. Over time, +gstack feels like it's reading your mind. + +**Cons:** Highest-risk deferral. Wrong auto-decides are costly. Requires very +high confidence in the signal map AND calibration gate. + +**Context:** v1 diversity gate is `sample_size >= 20 AND skills_covered >= 3 +AND question_ids_covered >= 8 AND days_span >= 7`. v2 must prove this gate +actually catches noisy profiles before shipping. + +**Effort:** M (human: ~3 days / CC: ~2h) +**Priority:** P0 +**Depends on:** E1 (skills consuming profile) + real observed data showing +calibration gate is trustworthy. + ## Browse ### Scope sidebar-agent kill to session PID, not `pkill -f sidebar-agent\.ts` diff --git a/VERSION b/VERSION index 72ad141a..b4ee4869 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -0.18.1.0 +0.19.0.0