mirror of https://github.com/garrytan/gstack.git synced 2026-05-01 19:25:10 +02:00

Files

T

Garry Tan 0a803f9e81 feat: gstack v1 — simpler prompts + real LOC receipts (v1.0.0.0) (#1039 )

* docs: add design doc for /plan-tune v1 (observational substrate)

Canonical record of the /plan-tune v1 design: typed question registry,
per-question explicit preferences, inline tune: feedback with user-origin
gate, dual-track profile (declared + inferred separately), and plain-English
inspection skill. Captures every decision with pros/cons, what's deferred to
v2 with explicit acceptance criteria, and what was rejected entirely.

Codex review drove a substantial scope rollback from the initial CEO
EXPANSION plan. 15+ legitimate findings (substrate claim was false without
a typed registry; E4/E6/clamp logical contradiction; profile poisoning
attack surface; LANDED preamble side effect; implementation order) shaped
the final shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: typed question registry for /plan-tune v1 foundation

scripts/question-registry.ts declares 53 recurring AskUserQuestion categories
across 15 skills (ship, review, office-hours, plan-ceo-review, plan-eng-review,
plan-design-review, plan-devex-review, qa, investigate, land-and-deploy, cso,
gstack-upgrade, preamble, plan-tune, autoplan).

Each entry has: stable kebab-case id, skill owner, category (approval |
clarification | routing | cherry-pick | feedback-loop), door_type (one-way
| two-way), optional stable option keys, optional psychographic signal_key,
and a one-line description.

12 of 53 are one-way doors (destructive ops, architecture/data forks,
security/compliance). These are ALWAYS asked regardless of user preference.

Helpers: getQuestion(id), getOneWayDoorIds(), getAllRegisteredIds(),
getRegistryStats(). No binary or resolver wiring yet — this is the schema
substrate the rest of /plan-tune builds on.

Ad-hoc question_ids (not registered) still log but skip psychographic
signal attribution. Future /plan-tune skill surfaces frequently-firing
ad-hoc ids as candidates for registry promotion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: registry schema + safety + coverage tests (gate tier)

20 tests validating the question registry:

Schema (7 tests):
- Every entry has required fields
- All ids are kebab-case and start with their skill name
- No duplicate ids
- Categories are from the allowed set
- door_type is one-way | two-way
- Options arrays are well-formed
- Descriptions are short and single-line

Helpers (5 tests):
- getQuestion returns entry for known id, undefined for unknown
- getOneWayDoorIds includes destructive questions, excludes two-way
- getAllRegisteredIds count matches QUESTIONS keys
- getRegistryStats totals are internally consistent

One-way door safety (2 tests):
- Every critical question (test failure, SQL safety, LLM trust boundary,
  security scan, merge confirm, rollback, fix apply, premise revise,
  arch finding, privacy gate, user challenge) is declared one-way
- At least 10 one-way doors exist (catches regression if declarations
  are accidentally dropped)

Registry breadth (3 tests):
- 11 high-volume skills each have >= 1 registered question
- Preamble one-time prompts are registered
- /plan-tune's own questions are registered

Signal map references (1 test):
- signal_key values are typed kebab-case strings

Template coverage (2 tests, informational):
- AskUserQuestion usage across templates is non-trivial (>20)
- Registry spans >= 10 skills

20 pass, 0 fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: one-way door classifier (belt-and-suspenders safety fallback)

scripts/one-way-doors.ts — secondary keyword-pattern classifier that catches
destructive questions even when the registry doesn't have an entry for them.

The registry's door_type field (from scripts/question-registry.ts) is the
PRIMARY safety gate. This classifier is the fallback for ad-hoc question_ids
that agents generate at runtime.

Classification priority:
  1. Registry lookup by question_id → use declared door_type
  2. Skill:category fallback (cso:approval, land-and-deploy:approval)
  3. Keyword pattern match against question_summary
  4. Default: treat as two-way (safer to log the miss than auto-decide unsafely)

Covers 21 destructive patterns across:
  - File system (rm -rf, delete, wipe, purge, truncate)
  - Database (drop table/database/schema, delete from)
  - Git/VCS (force-push, reset --hard, checkout --, branch -D)
  - Deploy/infra (kubectl delete, terraform destroy, rollback)
  - Credentials (revoke/reset/rotate API key|token|secret|password)
  - Architecture (breaking change, schema migration, data model change)

7 new tests in test/plan-tune.test.ts covering: registry-first lookup,
unknown-id fallthrough, keyword matching on destructive phrasings including
embedded filler words ("rotate the API key"), skill-category fallback,
benign questions defaulting to two-way, pattern-list non-empty.

27 pass, 0 fail. 1270 expect() calls.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: psychographic signal map + builder archetypes

scripts/psychographic-signals.ts — hand-crafted {signal_key, user_choice} →
{dimension, delta} map. Version 0.1.0. Conservative deltas (±0.03 to ±0.06
per event). Covers 9 signal keys: scope-appetite, architecture-care,
code-quality-care, test-discipline, detail-preference, design-care,
devex-care, distribution-care, session-mode.

Helpers: applySignal() mutates running totals, newDimensionTotals() creates
empty starting state, normalizeToDimensionValue() sigmoid-clamps accumulated
delta to [0,1] (0 → 0.5 neutral), validateRegistrySignalKeys() checks that
every signal_key in the registry has a SIGNAL_MAP entry.

In v1 the signal map is used ONLY to compute inferred dimension values for
/plan-tune inspection output. No skill behavior adapts to these signals
until v2.

scripts/archetypes.ts — 8 named archetypes + Polymath fallback:
- Cathedral Builder (boil-the-ocean + architecture-first)
- Ship-It Pragmatist (small scope + fast)
- Deep Craft (detail-verbose + principled)
- Taste Maker (intuitive, overrides recommendations)
- Solo Operator (high-autonomy, delegates)
- Consultant (hands-on, consulted on everything)
- Wedge Hunter (narrow scope aggressively)
- Builder-Coach (balanced steering)
- Polymath (fallback when no archetype matches)

matchArchetype() uses L2 distance scaled by tightness, with a 0.55 threshold
below which we return Polymath. v1 ships the model stable; v2 narrative/vibe
commands wire it into user-facing output.

14 new tests: signal map consistency vs registry, applySignal behavior for
known/unknown keys, normalization bounds, archetype schema validity, name
uniqueness, matchArchetype correctness for each reference profile, Polymath
fallback for outliers.

41 pass, 0 fail total in test/plan-tune.test.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: bin/gstack-question-log — append validated AskUserQuestion events

Append-only JSONL log at ~/.gstack/projects/{SLUG}/question-log.jsonl.
Schema: {skill, question_id, question_summary, category?, door_type?,
options_count?, user_choice, recommended?, followed_recommendation?,
session_id?, ts}

Validates:
- skill is kebab-case
- question_id is kebab-case, <= 64 chars
- question_summary non-empty, <= 200 chars, newlines flattened
- category is one of approval/clarification/routing/cherry-pick/feedback-loop
- door_type is one-way or two-way
- options_count is integer in [1, 26]
- user_choice non-empty string, <= 64 chars

Injection defense on question_summary rejects the same patterns as
gstack-learnings-log (ignore previous instructions, system:, override:,
do not report, etc).

followed_recommendation is auto-computed when both user_choice and
recommended are present.

ts auto-injected as ISO 8601 if missing.

21 tests covering: valid payloads, full field preservation, auto-followed
computation, appending, long-summary truncation, newline flattening,
invalid JSON, missing fields, bad case, oversized ids, invalid enum
values, out-of-range options_count, and 6 injection attack patterns.

21 pass, 0 fail, 43 expect() calls.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: bin/gstack-developer-profile — unified profile with migration

bin/gstack-developer-profile supersedes bin/gstack-builder-profile. The old
binary becomes a one-line legacy shim delegating to --read for /office-hours
backward compat.

Subcommands:
  --read              legacy KEY:VALUE output (tier, session_count, etc)
  --migrate           folds ~/.gstack/builder-profile.jsonl into
                      ~/.gstack/developer-profile.json. Atomic (temp + rename),
                      idempotent (no-op when target exists or source absent),
                      archives source as .migrated-YYYY-MM-DD-HHMMSS
  --derive            recomputes inferred dimensions from question-log.jsonl
                      using the signal map in scripts/psychographic-signals.ts
  --profile           full profile JSON
  --gap               declared vs inferred diff JSON
  --trace <dim>       event-level trace of what contributed to a dimension
  --check-mismatch    flags dimensions where declared and inferred disagree by
                      > 0.3 (requires >= 10 events first)
  --vibe              archetype name + description from scripts/archetypes.ts
  --narrative         (v2 stub)

Auto-migration on first read: if legacy file exists and new file doesn't,
migrate before reading. Creates a neutral (all-0.5) stub if nothing exists.

Unified schema (see docs/designs/PLAN_TUNING_V0.md §Architecture):
  {identity, declared, inferred: {values, sample_size, diversity},
   gap, overrides, sessions, signals_accumulated, schema_version}

25 new tests across subcommand behaviors:
- --read defaults + stub creation
- --migrate: 3 sessions preserved with signal tallies, idempotency, archival
- Tier calculation: welcome_back / regular / inner_circle boundaries
- --derive: neutral-when-empty, upward nudge on 'expand', downward on 'reduce',
  recomputable (same input → same output), ad-hoc unregistered ids ignored
- --trace: contributing events, empty for untouched dims, error without arg
- --gap: empty when no declared, correctly computed otherwise
- --vibe: returns archetype name + description
- --check-mismatch: threshold behavior, 10+ sample requirement
- Unknown subcommand errors

25 pass, 0 fail, 60 expect() calls.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: bin/gstack-question-preference — explicit preferences + user-origin gate

Subcommands:
  --check <id>   → ASK_NORMALLY | AUTO_DECIDE  (decides if a registered
                   question should be auto-decided by the agent)
  --write '{…}'  → set a preference (requires user-origin source)
  --read         → dump preferences JSON
  --clear [id]   → clear one or all
  --stats        → short counts summary

Preference values: always-ask | never-ask | ask-only-for-one-way.
Stored at ~/.gstack/projects/{SLUG}/question-preferences.json.

Safety contract (the core of Codex finding #16, profile-poisoning defense
from docs/designs/PLAN_TUNING_V0.md §Security model):

  1. One-way doors ALWAYS return ASK_NORMALLY from --check, regardless of
     user preference. User's never-ask is overridden with a visible safety
     note so the user knows why their preference didn't suppress the prompt.

  2. --write requires an explicit `source` field:
       - Allowed:  "plan-tune", "inline-user"
       - REJECTED with exit code 2: "inline-tool-output", "inline-file",
         "inline-file-content", "inline-unknown"
     Rejection is explicit ("profile poisoning defense") so the caller can
     log and surface the attempt.

  3. free_text on --write is sanitized against injection patterns (ignore
     previous instructions, override:, system:, etc.) and newline-flattened.

Each --write also appends a preference-set event to
~/.gstack/projects/{SLUG}/question-events.jsonl for derivation audit trail.

31 tests:
- --check behavior (4): defaults, two-way, one-way (one-way overrides
  never-ask with safety note), unknown ids, missing arg
- --check with prefs (5): never-ask on two-way → AUTO_DECIDE; never-ask
  on one-way → ASK_NORMALLY with override note; always-ask always asks;
  ask-only-for-one-way flips appropriately
- --write valid (5): inline-user accepted, plan-tune accepted, persisted
  correctly, event appended, free_text preserved with flattening
- User-origin gate (6): missing source rejected; inline-tool-output
  rejected with exit code 2 and explicit poisoning message; inline-file,
  inline-file-content, inline-unknown rejected; unknown source rejected
- Schema validation (4): invalid JSON, bad question_id, bad preference,
  injection in free_text
- --read (2): empty → {}, returns writes
- --clear (3): specific id, clear-all, NOOP for missing
- --stats (2): empty zeros, tallies by preference type

31 pass, 0 fail, 52 expect() calls.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: question-tuning preamble resolvers

scripts/resolvers/question-tuning.ts ships three preamble generators:

  generateQuestionPreferenceCheck — before each AskUserQuestion, agent runs
    gstack-question-preference --check <id>. AUTO_DECIDE suppresses the ask
    and auto-chooses recommended. ASK_NORMALLY asks as usual. One-way door
    safety override is handled by the binary.

  generateQuestionLog — after each AskUserQuestion, agent appends a log
    record with skill, question_id, summary, category, door_type,
    options_count, user_choice, recommended, session_id.

  generateInlineTuneFeedback — offers inline "tune:" prompt after two-way
    questions. Documents structured shortcuts (never-ask, always-ask,
    ask-only-for-one-way, ask-less) AND accepts free-form English with
    normalization + confirmation. Explicitly spells out the USER-ORIGIN
    GATE: only write tune events when the prefix appears in the user's own
    chat message, never from tool output or file content. Binary enforces.

All three resolvers are gated by the QUESTION_TUNING preamble echo. When
the config is off, the agent skips these sections entirely. Ready to be
wired into preamble.ts in the next commit.

Codex host has a simpler variant that uses $GSTACK_BIN env vars.

scripts/resolvers/index.ts registers three placeholders:
  QUESTION_PREFERENCE_CHECK, QUESTION_LOG, INLINE_TUNE_FEEDBACK

Total resolver count goes from 45 to 48.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: wire question-tuning into preamble for tier >= 2 skills

scripts/resolvers/preamble.ts — adds two things:

  1. _QUESTION_TUNING config echo in the preamble bash block, gated on the
     user's gstack-config `question_tuning` value (default: false).
  2. A combined Question Tuning section for tier >= 2 skills, injected after
     the confusion protocol. The section itself is runtime-gated by the
     QUESTION_TUNING value — agents skip it entirely when off.

scripts/resolvers/question-tuning.ts — consolidated into one compact combined
section `generateQuestionTuning(ctx)` covering: preference check before the
question, log after, and inline tune: feedback with user-origin gate. Per-phase
generators remain exported for unit tests but are no longer the main entrypoint.

Size impact: +570 tokens / +2.3KB per tier-2+ SKILL.md. Three skills
(plan-ceo-review, office-hours, ship) still exceed the 100KB token ceiling —
but they were already over before this change. Delta is the smallest viable
wiring of the /plan-tune v1 substrate.

Golden fixtures (test/fixtures/golden/claude-ship, codex-ship, factory-ship)
regenerated to match the new baseline.

Full test run: 1149 pass, 0 fail, 113 skip across 28 files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: regenerate SKILL.md files with question-tuning section

bun run gen:skill-docs --host all after wiring the QUESTION_TUNING preamble
section. Every tier >= 2 skill now includes the combined Question Tuning
guidance. Runtime-gated — agents skip the section when question_tuning is
off in gstack-config (default).

Golden fixtures (claude-ship, codex-ship, factory-ship) updated to the new
baseline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: /plan-tune skill — conversational inspection + preferences

plan-tune/SKILL.md.tmpl: the user-facing skill for /plan-tune v1. Routes
plain-English intent to one of 8 flows:

  - Enable + setup (first-time): 5 declaration questions mapping to the
    5 psychographic dimensions (scope_appetite, risk_tolerance,
    detail_preference, autonomy, architecture_care). Writes to
    developer-profile.json declared.*.
  - Inspect profile: plain-English rendering of declared + inferred + gap.
    Uses word bands (low/balanced/high) not raw floats. Shows vibe archetype
    when calibration gate is met.
  - Review question log: top-20 question frequencies with follow/override
    counts. Highlights override-heavy questions as candidates for never-ask.
  - Set a preference: normalizes "stop asking me about X" → never-ask, etc.
    Confirms ambiguous phrasings before writing via gstack-question-preference.
  - Edit declared profile: interprets free-form ("more boil-the-ocean") and
    CONFIRMS before mutating declared.* (trust boundary per Codex #15).
  - Show gap: declared vs inferred diff with plain-English severity bands
    (close / drift / mismatch). Never auto-updates declared from the gap.
  - Stats: preference counts + diversity/calibration status.
  - Enable / disable: gstack-config set question_tuning true|false.

Design constraints enforced:
- Plain English everywhere. No CLI subcommand syntax required. Shortcuts
  (`profile`, `vibe`, `stats`, `setup`) exist but optional.
- user-origin gate on tune: writes. source: "plan-tune" for user-invoked
  /plan-tune; source: "inline-user" for inline tune: from other skills.
- One-way doors override never-ask (safety, surfaced to user).
- No behavior adaptation in v1 — this skill inspects and configures only.

Generates plan-tune/SKILL.md at ~11.6k tokens, well under the 100KB ceiling.
Generated for all hosts via `bun run gen:skill-docs --host all`.

Full free test suite: 1149 pass, 0 fail, 113 skip across 28 files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: end-to-end pipeline + preamble injection coverage

Added 6 tests to test/plan-tune.test.ts:

Preamble injection (3 tests):
- tier 2+ includes Question Tuning section with preference check, log,
  and user-origin gate language ('profile-poisoning defense', 'inline-user')
- tier 1 does NOT include the prose section (QUESTION_TUNING bash echo
  still fires since it's in the bash block all tiers share)
- codex host swaps binDir references to $GSTACK_BIN

End-to-end pipeline (3 tests) — real binaries working together, not mocks:
- Log 5 expand choices → --derive → profile shows scope_appetite > 0.5
  (full log → registry lookup → signal map → normalization round-trip)
- --write source: inline-tool-output rejected; --read confirms no pref
  was persisted (the profile-poisoning defense actually works end-to-end)
- Migrate a 3-session legacy file; confirm legacy gstack-builder-profile
  shim still returns SESSION_COUNT: 3, TIER: welcome_back, CROSS_PROJECT: true

test/plan-tune.test.ts now has 47 tests total.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: E2E test for /plan-tune plain-English inspection flow (gate tier)

test/skill-e2e-plan-tune.test.ts — verifies /plan-tune correctly routes
plain-English intent ("review the questions I've been asked") to the
Review question log section without requiring CLI subcommand syntax.

Seeds a synthetic question-log.jsonl with 3 entries exercising:
- override behavior (user chose expand over recommended selective)
- one-way door respect (user followed ship-test-failure-triage recommendation)
- two-way override (user skipped recommended changelog polish)

Invokes the skill via `claude -p` and asserts:
- Agent surfaces >= 2 of 3 logged question_ids in output
- Agent notices override/skip behavior from the log
- Exit reason is success or error_max_turns (not agent-crash)

Gate-tier because the core v1 DX promise is plain-English intent routing.
If it requires memorized subcommands or breaks on natural language, that's
a regression of the defining feature.

Registered in test/helpers/touchfiles.ts with dependencies:
- plan-tune/** (skill template + generated md)
- scripts/question-registry.ts (required for log lookup)
- scripts/psychographic-signals.ts, scripts/one-way-doors.ts (derive path)
- bin/gstack-question-log, gstack-question-preference, gstack-developer-profile

Skipped when EVALS_ENABLED is not set; runs on `bun run test:evals`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.19.0.0) — /plan-tune v1

Ships /plan-tune as observational substrate: typed question registry, dual-track
developer profile (declared + inferred), explicit per-question preferences with
user-origin gate, inline tune: feedback across every tier >= 2 skill, unified
developer-profile.json with migration from builder-profile.jsonl.

Scope rolled back from initial CEO EXPANSION plan after outside-voice review
(Codex). 6 deferrals tracked as P0 TODOs with explicit acceptance criteria:
E1 substrate wiring, E3 narrative/vibe, E4 blind-spot coach, E5 LANDED
celebration, E6 auto-adjustment, E7 psychographic auto-decide.

See docs/designs/PLAN_TUNING_V0.md for the full design record.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ci): harden Dockerfile.ci against transient Ubuntu mirror failures

The CI image build failed with:
  E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/...
     Connection failed [IP: 91.189.92.22 80]
  ERROR: process "/bin/sh -c apt-get update && apt-get install ..."
     did not complete successfully: exit code: 100

archive.ubuntu.com periodically returns "connection refused" on individual
regional mirrors. Without retry logic a single failed fetch nukes the whole
Docker build. Three defenses, layered:

  1. /etc/apt/apt.conf.d/80-retries — apt fetches each package up to 5 times
     with a 30s timeout. Handles per-package flakes.
  2. Shell-loop retry around the whole apt-get step (x3, 10s sleep) — handles
     the case where apt-get update itself can't reach any mirror.
  3. --retry 5 --retry-delay 5 --retry-connrefused on all curl fetches (bun
     install script, GitHub CLI keyring, NodeSource setup script).

Applied to every apt-get and curl call in the Dockerfile. No behavior change
on happy path — only kicks in when mirrors blip. Fixes the build-image job
that was blocking CI on the /plan-tune PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: add PLAN_TUNING_V1 + PACING_UPDATES_V0 design docs

Captures the V1 design (ELI10 writing + LOC reframe) in
docs/designs/PLAN_TUNING_V1.md and the extracted V1.1 pacing-overhaul
plan in docs/designs/PACING_UPDATES_V0.md. V1 scope was reduced from
the original bundled pacing + writing-style plan after three
engineering-review passes revealed structural gaps in the pacing
workstream that couldn't be closed via plan-text editing. TODOS.md
P0 entry links to V1.1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: curated jargon list for V1 writing-style glossing

Repo-owned list of ~50 high-frequency technical terms (idempotent,
race condition, N+1, backpressure, etc.) that gstack glosses on first
use in tier-≥2 skill output. Baked into generated SKILL.md prose at
gen-skill-docs time. Terms not on this list are assumed plain-English
enough. Contributions via PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(preamble): V1 Writing Style section + EXPLAIN_LEVEL echo + migration prompt

Adds a new Writing Style section to tier-≥2 preamble output composing with
the existing AskUserQuestion Format section. Six rules: jargon glossed on
first use per skill invocation (from scripts/jargon-list.json), outcome-
framed questions, short sentences, decisions close with user impact,
gloss-on-first-use even if user pasted term, user-turn override for "be
terse" requests. Baked conditionally (skip if EXPLAIN_LEVEL: terse).

Adds EXPLAIN_LEVEL preamble echo using \${binDir} (host-portable matching
V0 QUESTION_TUNING pattern). Adds WRITING_STYLE_PENDING echo reading a
flag file written by the V0→V1 upgrade migration; on first post-upgrade
skill run, the agent fires a one-time AskUserQuestion offering terse mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(gstack-config): validate explain_level + document in header

Adds explain_level: default|terse to the annotated config header with
a one-line description. Whitelists valid values; on set of an unknown
value, prints a specific warning ("explain_level '\$VALUE' not
recognized. Valid values: default, terse. Using default.") and writes
the default value. Matches V1 preamble's EXPLAIN_LEVEL echo expectation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: V1 upgrade migration — writing-style opt-out prompt

New migration script following existing v0.15.2.0.sh / v0.16.2.0.sh
pattern. Writes a .writing-style-prompt-pending flag file on first run
post-upgrade. The preamble's migration-prompt block reads the flag and
fires a one-time AskUserQuestion offering the user a choice between
the new default writing style and restoring V0 prose via
\`gstack-config set explain_level terse\`. Idempotent via flag files;
if the user has already set explain_level explicitly, counts as
answered and skips.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: LOC reframe tooling — throughput comparison + README updater + scc installer

Three new scripts:

- scripts/garry-output-comparison.ts — enumerates Garry-authored commits
  in 2013 + 2026 on public repos, extracts ADDED lines from git diff,
  classifies as logical SLOC via scc --stdin (regex fallback if scc
  missing). Writes docs/throughput-2013-vs-2026.json with per-language
  breakdown + explicit caveats (public repos only, commit-style drift,
  private-work exclusion).

- scripts/update-readme-throughput.ts — reads the JSON if present,
  replaces the README's <!-- GSTACK-THROUGHPUT-PLACEHOLDER --> anchor
  with the computed multiple (preserving the anchor for future runs).
  If JSON missing, writes GSTACK-THROUGHPUT-PENDING marker that CI
  rejects — forcing the build to run before commit.

- scripts/setup-scc.sh — standalone OS-detecting installer for scc.
  Not a package.json dependency (95% of users never run throughput).
  Brew on macOS, apt on Linux, GitHub releases link on Windows.

Two-string anchor pattern (PLACEHOLDER vs PENDING) prevents the
pipeline from destroying its own update path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(retro): surface logical SLOC + weighted commits above raw LOC

V1 reorders the /retro summary table to lead with features shipped,
then commits + weighted commits (commits × files-touched capped at 20),
then PRs merged, then logical SLOC added as the primary code-volume
metric. Raw LOC stays present but is demoted to context. Rationale
inline in the template: ten lines of a good fix is not less shipping
than ten thousand lines of scaffold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(v1): README hero reframe + writing-style + CHANGELOG + version bump to 1.0.0.0

README.md:
- Hero removes "600,000+ lines of production code" framing; replaces
  with the computed 2013-vs-2026 pro-rata multiple (via
  <!-- GSTACK-THROUGHPUT-PLACEHOLDER --> anchor, filled by the
  update-readme-throughput build step).
- Hiring callout: "ship real products at AI-coding speed" instead of
  "10K+ LOC/day."
- New Writing Style section (~80 words) between Quick start and
  Install: "v1 prompts = simpler" framing, outcome-language example,
  terse-mode opt-out, pointer to /plan-tune.

CLAUDE.md: one-paragraph Writing style (V1) note under project
conventions, linking to preamble resolver + V1 design docs.

CHANGELOG.md: V1 entry on top of v0.19.0.0 with user-facing narrative
(what changes, how to opt out, for-contributors notes). Mentions
scope reduction — pacing overhaul ships in V1.1.

CONTRIBUTING.md: one-paragraph note on jargon-list.json maintenance
(PR to add/remove terms; regenerate via gen:skill-docs).

VERSION + package.json: bump to 1.0.0.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: regenerate SKILL.md files + golden fixtures for V1

Mechanical regeneration from the updated templates in prior commits:
- Writing Style section now appears in tier-≥2 skill output.
- EXPLAIN_LEVEL + WRITING_STYLE_PENDING echoes in preamble bash.
- V1 migration-prompt block fires conditionally on first upgrade.
- Jargon list inlined into preamble prose at gen time.
- Retro template's logical SLOC + weighted commits order applied.

Regenerated for all 8 hosts via bun run gen:skill-docs --host all.
Golden ship-skill fixtures refreshed from regenerated outputs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: V1 gate coverage — writing-style resolver + config + jargon + migration + dormancy

Six new gate-tier test files:

- test/writing-style-resolver.test.ts — asserts Writing Style section
  is injected into tier-≥2 preamble, all 6 rules present, jargon list
  inlined, terse-mode gate condition present, Codex output uses
  \$GSTACK_BIN (not ~/.claude/), tier-1 does NOT get the section,
  migration-prompt block present.

- test/explain-level-config.test.ts — gstack-config set/get round-trip
  for default + terse, unknown-value warns + defaults to default,
  header documents the key, round-trip across set→set→get.

- test/jargon-list.test.ts — shape + ~50 terms + no duplicates
  (case-insensitive) + includes canonical high-signal terms.

- test/v0-dormancy.test.ts — 5D dimension names + archetype names
  forbidden in default-mode tier-≥2 SKILL.md output, except for
  plan-tune and office-hours where they're load-bearing.

- test/readme-throughput.test.ts — script replaces anchor with number
  on happy path, writes PENDING marker when JSON missing, CI gate
  asserts committed README contains no PENDING string.

- test/upgrade-migration-v1.test.ts — fresh run writes pending flag,
  idempotent after user-answered, pre-existing explain_level counts
  as answered.

All 95 V1 test-expect() calls pass. Full suite: 0 failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: compute real 2013-vs-2026 throughput multiple (130.2×)

Ran scripts/garry-output-comparison.ts across all 15 public garrytan/*
repos. Aggregated results into docs/throughput-2013-vs-2026.json and
ran scripts/update-readme-throughput.ts to replace the README placeholder.

2013 public activity: 2 commits, 2,384 logical lines added across 1
week, in 1 repo (zurb-foundation-wysihtml5 upstream contribution).

2026 public activity: 279 commits, 310,484 logical lines added across
17 active weeks, in 3 repos (gbrain, gstack, resend_robot).

Multiples (public repos only, apples-to-apples):
- Logical SLOC: 130.2×
- Commits per active week: 8.2×
- Raw lines added: 134.4×

Private work at both eras (2013 Bookface at YC, Posterous-era code,
2026 internal tools) is excluded from this comparison.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: 207× throughput multiple (with private repos + Bookface)

Re-ran scripts/garry-output-comparison.ts across all 41 repos under
garrytan/* (15 public + 26 private), including Bookface (YC's internal
social network, 2013-era work).

2013 activity: 71 commits, 5,143 logical lines, 4 active repos
  (bookface, delicounter, tandong, zurb-foundation-wysihtml5)
2026 activity: 350 commits, 1,064,818 logical lines, 15 active repos
  (gbrain, gstack, gbrowser, tax-app, kumo, tenjin, autoemail, kitsune,
  easy-chromium-compiles, conductor-playground, garryslist-agent, baku,
  gstack-website, resend_robot, garryslist-brain)

Multiples:
- Logical SLOC: 207× (up from 130.2× when including private work)
- Raw lines: 223×
- Commits/active-week: 3.4×

Stopped committing docs/throughput-2013-vs-2026.json — analysis is a
local artifact, not repo state. Added docs/throughput-*.json to
.gitignore. Full markdown analysis at ~/throughput-analysis-2026-04-18.md
(local-only). README multiple is now hardcoded; re-run the script and
edit manually when you want to refresh it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: run rate vs year-to-date throughput comparison

Two separate numbers in the README hero:
- Run rate: ~700× (9,859 logical lines/day in 2026 vs 14/day in 2013)
- Year-to-date: 207× (2026 through April 18 already exceeds 2013 full
  year by 207×)

Previous "207× pro-rata" framing mixed full-year 2013 vs partial-year
2026. Run rate is the apples-to-apples normalization; YTD is the
"already produced" total. Both are honest; both are compelling; they
measure different things.

Analysis at ~/throughput-analysis-2026-04-18.md (local-only).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(throughput): script natively computes to-date + run-rate multiples

Enhanced scripts/garry-output-comparison.ts so both calculations come
out of a single run instead of being reassembled ad-hoc in bash:

PerYearResult now includes:
- days_elapsed — 365 for past years, day-of-year for current
- is_partial — flags the current (in-progress) year
- per_day_rate — logical/raw/commits normalized by calendar day
- annualized_projection — per_day_rate × 365

Output JSON's `multiples` now has two sibling blocks:
- multiples.to_date — raw volume ratios (2026-YTD / 2013-full-year)
- multiples.run_rate — per-day pace ratios (apples-to-apples)

Back-compat: multiples.logical_lines_added still aliases to_date for
older consumers reading the JSON.

Updated README hero to cite both (picking up brain/* repo that was
missed in the earlier aggregation pass):

  2026 run rate: ~880× my 2013 pace (12,382 vs 14 logical lines/day)
  2026 YTD:      260× the entire 2013 year

Stderr summary now prints both multiples at the end of each run.

Full analysis at ~/throughput-analysis-2026-04-18.md (local-only).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: ON_THE_LOC_CONTROVERSY methodology post + README link

Long-form response to the "LOC is a meaningless vanity metric" critique.
Covers:
- The three branches of the LOC critique and which are right
- Why logical SLOC (NCLOC) beats raw LOC as the honest measurement
- Full method: author-scoped git diff, regex-classified added lines,
  aggregated across 41 public + private garrytan/* repos
- Both calculations: to-date (260x) and run-rate (879x)
- Steelman of the critics (greenfield-vs-maintenance, survivorship bias,
  quality-adjusted productivity, time-to-first-user)
- Reproduction instructions

Linked from README hero via a blockquote directly below the number.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* exclude: tax-app from throughput analysis (import-dominated history)

tax-app's history is one commit of 104K logical lines — an initial
import of a codebase, not authored work. Removing it to keep the
comparison honest.

Changes:
- scripts/garry-output-comparison.ts: added EXCLUDED_REPOS constant
  with tax-app + a one-line rationale. The script now skips excluded
  repos with a stderr note and deletes any stale output JSON so
  aggregation loops don't pick up pre-exclusion numbers.

- README hero: updated to 810× run rate + 240× YTD (were 880×/260×).
  Wording updated to "40 public + private repos ... after excluding
  repos dominated by imported code."

- docs/ON_THE_LOC_CONTROVERSY.md: updated all numbers, added an
  "Exclusions" paragraph explaining tax-app, removed tax-app from
  the "shipped not WIP" example list.

New numbers (2026 through day 108, without tax-app):
  - To-date:  240× logical SLOC (1,233,062 vs 5,143)
  - Run rate: 810× per-day pace (11,417 vs 14 logical/day)
  - Annualized: ~4.2M logical lines projected

Future re-runs automatically skip tax-app. Add more exclusions to
EXCLUDED_REPOS at the top of the script with a one-line rationale.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: correct tax-app exclusion rationale

tax-app is a demo app I built for an upcoming YC channel video,
not an "import-dominated history" as the previous commit claimed.
Excluded because it's not production shipping work, not because
of an import commit.

Updated rationale in scripts/garry-output-comparison.ts's
EXCLUDED_REPOS constant, in docs/ON_THE_LOC_CONTROVERSY.md's
method section + conclusion, and in the README hero wording
("one demo repo" vs the earlier "repos dominated by imported code").

Numbers unchanged — the exclusion itself is the same, just the
reason.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: harden ON_THE_LOC_CONTROVERSY against Cramer + neckbeard critiques

Reframes the thesis as "engineers can fly now" (amplification, not
replacement) and fortifies the soft spots critics will attack.

Added:
- Flight-thesis opener: pilot vs walker, leverage not replacement.
- Second deflation layer for AI verbosity (on top of NCLOC). Headline
  moves from 810x to 408x after generous 2x AI-boilerplate cut, with
  explicit sensitivity analysis showing the number is still large under
  pessimistic priors (5x → 162x, 10x → 81x, 100x impossible).
- Weekly distribution check (kills "you had one burst week" attack).
- Revert rate (2.0%) and post-merge fix rate (6.3%) with OSS
  comparables (K8s/Rails/Django band). Addresses "where are your error
  rates" directly.
- Named production adoption signals (gstack 1000+ installs, gbrain beta,
  resend_robot paying API) with explicit concession that "shipped != used
  at scale" for most of the corpus.
- Harder steelman: 5 specific concessions with quantified pivot points
  (e.g., "if 2013 baseline was 3.5x higher, 810x → 228x, still high").

Removed factual error: Posterous acquisition paragraph (Garry had already
left Posterous by 2011, so the "Twitter bought our private repos" excuse
for the 2013 corpus gap doesn't apply).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: update gstack/gbrain adoption numbers in LOC controversy post

gstack: "1,000+ distinct project installations" → "tens of thousands of
daily active users" (telemetry-reported, community tier, opt-in).
gbrain: "small set of beta testers" → "hundreds of beta testers running
it live."

Both are the accurate current numbers. The concession paragraph below
(about shipped != adopted at scale for the long-tail repos) still reads
correctly since it's about the corpus as a whole, not gstack/gbrain
specifically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: reframe reproducibility note as OSS breakout flex

"You'd need access to my private repos" → "Bookface and Posthaven are
private, but gstack and gbrain are open-sourced with tens of thousands
of GitHub stars and tens of thousands of confirmed regular users, among
the most-used OSS projects in the world that didn't exist three months
ago."

Keeps the `gh repo list` command at the end for the actual
reproducibility instruction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Rewrite LOC controversy post

- Lead with concession (LOC is garbage, do the math anyway)
- Preempt 14 lines/day meme with historical baselines (Brooks, Jones, McConnell)
- Remove 'neckbeard' language throughout
- Add slop-scan story (Ben Vinegar, 5.24 → 1.96, 62% cut)
- David Cramer GUnit joke
- Add testing philosophy section (the real unlock)
- ASCII weekly distribution chart
- gstack telemetry section with real numbers (15K installs, 305K invocations, 95.2% success)
- Top skills usage chart
- Pick-your-priors paragraph moved earlier (the killer)
- Sharper close: run the script, show me your numbers

* docs: four precision fixes on LOC controversy post

1. Citation fix. Kernighan didn't say anything about LOC-as-metric
   (that's the famous "aircraft building by weight" quote, commonly
   misattributed but actually Bill Gates). Replaced "Kernighan implied
   it before that" with the real Dijkstra quote ("lines produced" vs
   "lines spent" from EWD1036, with direct link) + the Gates quote.
   Verified via web search.

2. Slop-scan direction clarified. "(highest on his benchmark)" was
   ambiguous — could read as a brag. Now: "Higher score = more slop.
   He ran it on gstack and we scored 5.24, the worst he'd measured
   at the time." Then the 62% cut lands as an actual win.

3. Prose/chart skill-usage ordering now matches. Added /plan-eng-review
   (28,014) to the prose list so it doesn't conflict with the chart
   below it.

4. Cut the "David — I owe you one / GUnit" insider joke. Most readers
   won't connect Cramer → Sentry → GUnit naming. Ends the slop-scan
   paragraph on the stronger line: "Run `bun test` and watch 2,000+
   tests pass."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: tighten four LOC post citations to match primary sources

1. Bill Gates quote: flagged as folklore-grade. Was "Bill Gates put it
   more memorably" (firm attribution). Now "The old line (widely
   attributed to Bill Gates, sourcing murky) puts it more memorably."
   The quote stands; honesty about attribution avoids the same
   misattribution trap we just fixed for Kernighan.

2. Capers Jones: "15-50 across thousands of projects" → "roughly 16-38
   LOC/day across thousands of projects" — matches his actual published
   measurements (which also report as 325-750 LOC/month).

3. Steve McConnell: "10-50 for finished, tested, delivered code" was
   folklore. Replaced with his actual project-size-dependent range from
   Code Complete: "20-125 LOC/day for small projects (10K LOC) down to
   1.5-25 for large projects (10M LOC) — it's size-dependent, not a
   single number."

4. Revert rate comparison: "Kubernetes, Rails, and Django historically
   run 1.5-3%" was unsourced. Replaced with "mature OSS codebases
   typically run 1-3%" + "run the same command on whatever you consider
   the bar and compare." No false specificity about which repos.

Net: every quantitative citation in the post now matches primary-source
figures or is explicitly flagged as folklore. Neckbeards can verify.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: drop Writing style section from README

Was sitting in prime real estate between Quick start and Install —
internal implementation detail, not something users need up-front.
Existing coverage is enough:
- Upgrade migration prompt notifies users on first post-upgrade run
- CLAUDE.md has the contributor note
- docs/designs/PLAN_TUNING_V1.md has the full design

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: collapse team-mode setup into one paste-and-go command

Step 2 was three separate code blocks: setup --team, then team-init,
then git add/commit. Mirrors Step 1's style now — one shell one-liner
that does all three. Subshell (cd && ./setup --team) keeps the user
in their repo pwd so team-init + git commit land in the right place.

"Swap required for optional" moved to a one-liner below.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: move full-clone footnote from README to CONTRIBUTING

The "Contributing or need full history?" note is for contributors, not
for someone following the README install flow. Moved into CONTRIBUTING's
Quick start section where it fits next to the existing clone command,
with a tip to upgrade an existing shallow clone via
\`git fetch --unshallow\`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: root <root@localhost>

2026-04-18 15:05:42 +08:00

31 KiB

Raw Permalink Blame History

Plan Tuning v0 — Design Doc

Status: Approved for v1 implementation Branch: garrytan/plan-tune-skill Authors: Garry Tan (user), with AI-assisted reviews from Claude Opus 4.7 + OpenAI Codex gpt-5.4 Date: 2026-04-16

What this document is

A canonical record of what /plan-tune v1 is, what it is NOT, what we considered, and why we made each call. Committed to the repo so future contributors (and future Garry) can trace reasoning without archeology. Supersedes the two ~/.gstack/projects/ artifacts (office-hours design doc + CEO plan) which are per-user local records.

The feature, in one paragraph

gstack's 40+ skills fire AskUserQuestion constantly. Power users answer the same questions the same way repeatedly and have no way to tell gstack "stop asking me this." More fundamentally, gstack has no model of how each user prefers to steer their work — scope-appetite, risk-tolerance, detail-preference, autonomy, architecture-care — so every skill's defaults are middle-of-the-road for everyone. /plan-tune v1 builds the schema + observation layer: a typed question registry, per-question explicit preferences, inline "tune:" feedback, and a profile (declared + inferred dimensions) inspectable via plain English. It does not yet adapt skill behavior based on the profile. That comes in v2, after v1 proves the substrate works.

Why we're building the smaller version

The feature started life as a full adaptive substrate: psychographic dimensions driving auto-decisions, blind-spot coaching, LANDED celebration HTML page, all bundled. Four rounds of review (office-hours, CEO EXPANSION, DX POLISH, eng review) cleared it. Then outside voice (Codex) delivered a 20-point critique. The critical findings, in priority order:

"Substrate" was false. The plan wired 5 skills to read the profile on preamble, but AskUserQuestion is a prompt convention, not middleware. Agents can silently skip the instructions. You cannot reliably build auto-decide on top of an unenforceable convention. Without a typed question registry that every AskUserQuestion routes through, the substrate claim is marketing.
Internal logical contradictions. E4 (blind-spot) + E6 (mismatch) + ±0.2 clamp on declared dimensions do not compose. If user self-declaration is ground truth via the clamp, E6's mismatch detection is detecting noise. If behavior can correct the profile, the clamp suppresses the signal E6 needs.
Profile poisoning. Inline "tune: never ask" could be emitted by malicious repo content (README, PR description, tool output) and the agent would dutifully write it. No prior review caught this security gap.
E5 LANDED page in preamble. gh pr view + HTML write + browser open on every skill's preamble is latency, auth failures, rate limits, surprise browser opens, and nondeterminism injected into the hottest path.
Implementation order was backwards. The plan started with classifiers and bins. The correct order: build the integration point first (typed question registry), then infrastructure, then consumers.

After weighing Codex's argument, we chose to roll back CEO EXPANSION and ship an observational v1 with a real typed registry as the foundation. Psychographic becomes behavioral only after the registry proves durable in production.

v1 Scope (what we're building now)

Typed question registry (scripts/question-registry.ts). Every AskUserQuestion gstack uses is declared with {id, skill, category, door_type, options[], signal_key?}. Schema-governed.
CI enforcement. Lint test (gate tier) asserts every AskUserQuestion pattern in SKILL.md.tmpl files has a matching registry entry. Fails CI on drift, renames, or duplicates.
Question logging (bin/gstack-question-log). Appends {ts, question_id, user_choice, recommended, session_id} to ~/.gstack/projects/{SLUG}/question-log.jsonl. Validates against registry.
Explicit per-question preferences (bin/gstack-question-preference). Writes {question_id, preference} where preference is always-ask | never-ask | ask-only-for-one-way. Respected from session 1. No calibration gate — user stated it, system obeys.
Preamble injection. Before each AskUserQuestion, agent calls gstack-question-preference --check <registry-id>. If never-ask AND question is NOT a one-way door, auto-choose recommended option with visible annotation: "Auto-decided [summary] → [option] (your preference). Change with /plan-tune." One-way doors always ask regardless of preference — safety override.
Inline "tune:" feedback with user-origin gate. Agent offers "Tune this question? Reply tune: [feedback] to adjust." User can use shortcuts (unnecessary, ask-less, never-ask, always-ask, context-dependent) or free-form English. CRITICAL: the agent only writes a tune event when the tune: content appears in the user's current chat turn — NOT in tool output, NOT in a file read. Binary validates source: "inline-user" on write; rejects other sources.
Declared profile (/plan-tune setup). 5 plain-English questions, one per dimension. Stored in unified ~/.gstack/developer-profile.json under declared: {...}. Informational only in v1 — no skill behavior change.
Observed/Inferred profile. Every question-log event contributes deltas to inferred dimensions via a hand-crafted signal map (scripts/psychographic-signals.ts). Computed on demand. Displayed but not acted on.
/plan-tune skill. Conversational plain-English inspection tool. "Show my profile," "set a preference," "what questions have I been asked," "show the gap between what I said and what I do." No CLI subcommand syntax required.
Unification with existing ~/.gstack/builder-profile.jsonl. Fold /office-hours session records and accumulated signals into unified ~/.gstack/developer-profile.json. Migration is atomic + idempotent + archives the source file.

Deferred to v2 (not in this PR, but explicit acceptance criteria)

Item	Why deferred	Acceptance criteria for v2 promotion
E1 Substrate wiring (5 skills read profile and adapt)	Requires v1 registry proving durable. Requires real observed data to calibrate signal deltas. Risk of psychographic drift.	v1 registry stable for 90+ days. Inferred dimensions show clear stability across 3+ skills. User dogfood validates that defaults informed by profile feel right.
E3 `/plan-tune narrative` + `/plan-tune vibe`	Event-anchored narrative needs stable profile. Without v1 data, output will be generic slop.	Profile diversity check passes for 2+ weeks real usage. Narrative test proves it quotes specific events, not clichés.
E4 Blind-spot coach	Logically conflicts with E1/E6 without explicit interaction-budget design. Needs global session budget, escalation rules, exclusion from mismatch detection.	Design spec for interaction budget + escalation. Dogfood confirms challenges feel coaching, not nagging.
E5 LANDED celebration HTML page	Cannot live in preamble (Codex #9, #10). When promoted, moves to explicit command `/plan-tune show-landed` OR post-ship hook — not passive detection in the hot path.	Explicit command or hook design. /design-shotgun → /design-html for the visual direction. Security + privacy review for PR data aggregation.
E6 Auto-adjustment based on mismatch	In v1, /plan-tune shows the gap between declared and inferred. In v2, it could suggest declaration updates. Requires dual-track profile to be stable.	Real mismatch data from v1 shows consistent patterns. Suggestion UX designed separately.
Psychographic-driven auto-decide	Zero behavioral change in v1. Only explicit preferences act.	Real usage shows explicit preferences cover most cases. Inferred profile stable enough to trust.

Rejected entirely (Codex was right, we're not doing these)

Item	Why rejected
Substrate-as-prompt-convention (vs. typed registry)	Codex #1. Agents can silently skip instructions. Building psychographic on top is sand.
±0.2 clamp on declared dimensions	Codex #6. Creates logical contradiction with E6 mismatch detection. Pick ONE: editable preference OR inferred behavior. Now: both, tracked separately (dual-track profile).
One-way door classification by parsing prose summaries	Codex #4. Safety depends on wording. door_type must be declared at question definition site (registry), not inferred.
Single event-schema file mixing declarations + overrides + verdicts + feedback	Codex #5. Incompatible domain objects. Now split into three files: question-log.jsonl, question-preferences.json, question-events.jsonl.
TTHW telemetry for /plan-tune onboarding	Codex #14. Contradicts local-first framing. Local logging only.
Inline tune: writes without user-origin verification	Codex #16. Profile poisoning attack. Now: user-origin gate is non-optional.

Architecture

~/.gstack/
  developer-profile.json            # unified: declared + inferred + sessions (from office-hours)

~/.gstack/projects/{SLUG}/
  question-log.jsonl                # every AskUserQuestion, append-only, registry-validated
  question-preferences.json         # explicit per-question user choices
  question-events.jsonl             # tune: feedback events, user-origin gated

Unified profile schema (superseding both v0.16.2.0 builder-profile.jsonl and the proposed developer-profile.json):

{
  "identity": {"email": "..."},
  "declared": {
    "scope_appetite": 0.9,
    "risk_tolerance": 0.7,
    "detail_preference": 0.4,
    "autonomy": 0.5,
    "architecture_care": 0.7
  },
  "inferred": {
    "values": {"scope_appetite": 0.72, "risk_tolerance": 0.58, "...": "..."},
    "sample_size": 47,
    "diversity": {
      "skills_covered": 5,
      "question_ids_covered": 14,
      "days_span": 23
    }
  },
  "gap": {"scope_appetite": 0.18, "...": "..."},
  "sessions": [
    {"date": "...", "mode": "builder", "project_slug": "...", "signals": []}
  ],
  "signals_accumulated": {
    "named_users": 1, "taste": 4, "agency": 3, "...": "..."
  }
}

Diversity check (Codex #13): inferred is considered "enough data" only when sample_size >= 20 AND skills_covered >= 3 AND question_ids_covered >= 8 AND days_span >= 7. Below this, /plan-tune profile shows "not enough observed data yet" instead of a potentially-misleading inferred value.

Data flow (v1)

Preamble: check question_tuning config. If off, do nothing.
Before each AskUserQuestion:
- Agent calls gstack-question-preference --check <registry-id>
- If never-ask AND question is NOT one-way door → auto-choose recommended with annotation
- If always-ask, unset, or question IS one-way door → ask normally
After AskUserQuestion:
- Append log record to question-log.jsonl (registry-validated, rejects unknown IDs)
Offer inline: "Tune this question? Reply tune: [feedback] to adjust."
If user's NEXT turn message contains tune: prefix AND the content originated in the user's own message (not tool output):
- Agent calls gstack-question-preference --write with source: "inline-user"
- Binary validates source field; rejects if anything other than inline-user
Inferred dimensions recomputed on demand by bin/gstack-developer-profile --derive. Signal map changes trigger full recompute from events history.

Security model

Profile poisoning defense (Codex #16, Decision J below): Inline tune events may be written ONLY when:

The agent is processing the user's current chat turn
The tune: prefix appears in that user message (not in any tool output, file content, PR description, commit message, etc.)
The resolver's instructions to the agent explicitly call this out

Binary enforcement: gstack-question-preference --write requires source: "inline-user" field on every tune-originated record. Any other source value (e.g., inline-tool-output, inline-file-content) is rejected with an error. Agent is instructed to never forge the source field.

Data privacy:

All data is local-only under ~/.gstack/. Nothing leaves without explicit user action.
/plan-tune export <path> writes profile to user-specified path (opt-in export).
/plan-tune delete wipes local profile files.
gstack-config set telemetry off prevents any telemetry (this skill never sends profile data regardless).
Profile files have standard user-home permissions.

Injection defense (consistent with existing bin/gstack-learnings-log patterns): the question_summary and any free-form user feedback fields are sanitized against known prompt-injection patterns ("ignore previous instructions," "system:", etc.).

5 Hard Constraints (preserved from office-hours, updated for Codex feedback)

One-way doors are classified deterministically by registry declaration, NOT by runtime summary parsing. Each registry entry declares door_type: one-way | two-way. Keyword pattern fallback (scripts/one-way-doors.ts) is a belt-and-suspenders secondary check for edge cases.
Profile dimensions are inspectable AND editable. /plan-tune profile shows declared + inferred + gap. Edits via plain English go to declared only. System tracks inferred independently.
Signal map is hand-crafted in TypeScript. scripts/psychographic-signals.ts maps {question_id, user_choice} → {dimension, delta}. Not agent-inferred. In v1, consumed only for inferred.values display — not for driving decisions.
No psychographic-driven auto-decide in v1. Only explicit per-question preferences act. This sidesteps the "calibration gate can be gamed" critique (Codex #13) entirely — v1 doesn't have a gate to pass.
Per-project preferences beat global preferences. ~/.gstack/projects/{SLUG}/question-preferences.json wins over any future global preference file. Global profile (~/.gstack/developer-profile.json) is a starting point for diversity across projects.

Why event-sourced + dual-track

Why event-sourced for the inferred profile:

Signal map can change between gstack versions. Recompute from events, no data migration needed.
Auditable: /plan-tune profile --trace autonomy shows every event that contributed to the value.
Future-proof: new dimensions can be derived from existing history.

Why dual-track (declared + inferred, separately) (Decision B below):

Resolves the logical contradiction Codex #6 identified.
declared is user sovereignty. User states who they are. System obeys for anything user-driven (preferences, declarations, overrides).
inferred is observation. System tracks behavioral patterns. Displayed but not acted on in v1.
gap is the interesting signal. Large gaps suggest the user's self-description isn't matching their behavior — valuable self-insight, but not auto-corrected.

Interaction model — plain English everywhere

(From /plan-devex-review, user correction on CLI syntax):

/plan-tune (no args) enters conversational mode. No CLI subcommand syntax required.

Menu in plain language:

"Show me my profile"
"Review questions I've been asked"
"Set a preference about a question"
"Update my profile — I've changed my mind about something"
"Show me the gap between what I said and what I do"
"Turn it off"

User replies conversationally. Agent interprets, confirms the intended change, then writes. For example:

User: "I'm more of a boil-the-ocean person than 0.5 suggests"
Agent: "Got it — update declared.scope_appetite from 0.5 to 0.8? [Y/n]"
User: "Yes"
Agent writes the update

Confirmation step is required for any mutation of declared from free-form input (Codex #15 trust boundary).

Power users can type shortcuts (narrative, vibe, reset, stats, enable, disable, diff). Neither is required. Both work.

Files to Create

Core schema

scripts/question-registry.ts — typed registry. Seeded from audit of all SKILL.md.tmpl AskUserQuestion invocations.
scripts/one-way-doors.ts — secondary keyword fallback. Primary: door_type in registry.
scripts/psychographic-signals.ts — hand-crafted signal map for inferred computation.

Binaries

bin/gstack-question-log — append log record, validate against registry.
bin/gstack-question-preference — read/write/check/clear explicit preferences.
bin/gstack-developer-profile — supersedes bin/gstack-builder-profile. Subcommands: --read (legacy compat), --derive, --gap, --profile.

Resolvers

scripts/resolvers/question-tuning.ts — three generators: generateQuestionPreferenceCheck(ctx) (pre-question check), generateQuestionLog(ctx) (post-question log), generateInlineTuneFeedback(ctx) (post-question tune: prompt with user-origin gate instructions).

Skill

plan-tune/SKILL.md.tmpl — conversational, plain-English inspection and preference tool.

Tests

test/plan-tune.test.ts — registry completeness, duplicate ID check, preference precedence (never-ask + not-one-way → AUTO_DECIDE; never-ask + one-way → ASK_NORMALLY), user-origin gate (rejects non-inline-user sources), derivation + recompute, unified profile schema, migration regression with 7-session fixture.

Files to Modify

scripts/resolvers/index.ts — register 3 new resolvers.
scripts/resolvers/preamble.ts — _QUESTION_TUNING config read; inject 3 resolvers for tier >= 2.
bin/gstack-builder-profile — legacy shim delegates to bin/gstack-developer-profile --read.
Migration script — folds existing builder-profile.jsonl into unified developer-profile.json. Atomic, idempotent, archives source as .migrated-YYYY-MM-DD.

NOT touched in v1

Explicitly unchanged — no {{PROFILE_ADAPTATION}} placeholders, no behavior change based on profile:

ship/SKILL.md.tmpl, review/SKILL.md.tmpl, office-hours/SKILL.md.tmpl, plan-ceo-review/SKILL.md.tmpl, plan-eng-review/SKILL.md.tmpl

These skills gain preamble injection for logging / preference checking / tune feedback only. No profile-driven defaults. v2 work.

Decisions log (with pros/cons for each)

Decision A: Bundle all three (question-log + sensitivity + psychographic) vs. ship smaller wedge — INITIAL ANSWER: BUNDLE; REVISED: REGISTRY-FIRST OBSERVATIONAL

Initial user position (office-hours): "The psychographic IS the differentiation. Ship the whole thing so the feedback loop can actually tune behavior." This drove CEO EXPANSION.

Pros of bundling: Ambition. The learning layer is what makes this more than config. Without psychographic, it's a fancy settings menu.

Cons of bundling (surfaced by Codex): The substrate didn't exist. Psychographic on top of prompt-convention is sand. E1/E4/E6 compose incoherently. Profile poisoning was unaddressed. E5 in preamble is a hidden hot-path side effect. Implementation order built machinery around an unenforceable convention.

Revised answer: Registry-first observational v1 (this doc). Preserves the ambition as a v2 target with explicit acceptance criteria. Ships a defensible foundation. User accepted this after seeing Codex's 20-point critique.

Decision B: Event-sourced vs. stored dimensions vs. hybrid — ANSWER: EVENT-SOURCED + USER-DECLARED ANCHOR (B+C)

Approach A (stored dimensions): Mutate in place. Simple.

Pros: Smallest data model. Easy to reason about.
Cons: Lossy. No history. Signal map changes require migration. Profile changes are opaque to the user.

Approach B (event-sourced): Store raw events, derive dimensions.

Pros: Auditable. Recomputable on signal map changes. No data migration ever. Matches existing learnings.jsonl pattern.
Cons: More complex derivation. Events file grows over time (compaction deferred to v2).

Approach C (hybrid — user-declared anchor, events refine): Initial profile is user-stated; events refine within ±0.2.

Pros: Day-1 value. User sovereignty. Calibration anchor instead of starting from zero.
Cons: ±0.2 clamp creates logical conflict with mismatch detection (Codex #6 caught this).

Chosen: B+C combined with ±0.2 CLAMP REMOVED. Event-sourced underneath, declared profile as first-class separate field. No clamp. Declared and inferred live as independent values. Gap between them is displayed but not auto-corrected in v1.

Decision C: One-way door classification — runtime prose parsing vs. registry declaration — ANSWER: REGISTRY DECLARATION (post-Codex)

Runtime prose parsing (original): isOneWayDoor(skill, category, summary) plus keyword patterns.

Pros: Minimal friction for skill authors. No schema to maintain.
Cons (Codex #4): Safety depends on wording. A destructive-op question phrased mildly could be misclassified. Unacceptable for a safety gate.

Registry declaration (revised): Every registry entry declares door_type.

Pros: Deterministic. Auditable. CI-enforceable (all questions must declare).
Cons: Maintenance burden. Every new skill question must classify.

Chosen: registry declaration as primary, keyword patterns as fallback. Schema governance is the cost of safety.

Decision D: Inline tune feedback grammar — structured keywords vs. free-form natural language — ANSWER: STRUCTURED WITH FREE-FORM FALLBACK

Structured keywords only: tune: unnecessary | ask-less | never-ask | always-ask | context-dependent.

Pros: Unambiguous. Clean profile data.
Cons: Users must memorize.

Free-form only: Agent interprets whatever user says.

Pros: Natural. No syntax to learn.
Cons: Inconsistent profile data. Hard to debug why a tune didn't take effect.

Chosen: both. Shortcuts documented for power users; agent accepts and normalizes free English. Plain-English interaction is the default; structured keywords are an optional fast-path.

Decision E: CLI subcommand structure for /plan-tune — ANSWER: PLAIN ENGLISH CONVERSATIONAL (no subcommand syntax required)

/plan-tune profile, /plan-tune profile set autonomy 0.4, etc. (original):

Pros: Fast for power users. Self-documenting via --help.
Cons: Users must memorize. Every invocation feels like a CLI session, not a conversation.

Plain-English conversational (revised after user correction): /plan-tune enters a menu. User says what they want in natural language.

Pros: Zero memorization. Feels like talking to a coach, not a shell.
Cons: Slower for power users. Requires good agent interpretation.

Chosen: conversational with optional shortcuts. Neither path is required. Most users never see the shortcuts. Confirmation step required before mutating declared profile (safety against agent misinterpretation — Codex #15 trust boundary).

Decision F: Landed celebration — passive preamble detection vs. explicit command vs. post-ship hook — ANSWER: DEFERRED TO v2; WHEN PROMOTED, NOT IN PREAMBLE

Passive detection in preamble (original): Every skill's preamble runs gh pr view to detect recent merges.

Pros: Works regardless of which skill the user runs. User doesn't need to do anything special.
Cons (Codex #9): Latency, auth failures, rate limits, surprise browser opens, nondeterminism injected into every skill's preamble. Side effect in hot path.

Explicit command (/plan-tune show-landed): User opts in.

Pros: No hot-path side effects. User controls when to see it.
Cons: Requires user discovery. The "surprise you when you earned it" magic is lost.

Post-ship hook (/ship triggers detection after PR creation): Tied to /ship.

Pros: Natural timing. No preamble cost.
Cons: /ship isn't always the landing event (manual merges, team members merging, etc.).

Chosen: DEFERRED entirely. v2 will design this properly. When promoted, it moves out of preamble. User accepted Codex's argument that a celebration page in the preamble is strategic misfit for an already-risky feature.

Decision G: Calibration gate — 20 events vs. diversity-checked — ANSWER: DIVERSITY-CHECKED

"20 events" (original): Simple count.

Pros: Trivial to implement.
Cons (Codex #13): Gameable. 20 inline "unnecessary" replies to ONE question should not calibrate five dimensions.

Diversity check (revised): sample_size >= 20 AND skills_covered >= 3 AND question_ids_covered >= 8 AND days_span >= 7.

Pros: Profile has actually been exercised across the system before it's trusted.
Cons: Slightly more complex.

Chosen: diversity check. In v1 used only for "enough data to display" threshold. In v2 will be the gate for psychographic-driven auto-decide.

Decision H: Implementation order — classifiers first vs. integration point first — ANSWER: INTEGRATION POINT FIRST (registry + CI lint)

Classifiers first (original): Build bin tools, then resolvers, then skill template.

Pros: Atomic building blocks. Can unit-test before integration.
Cons (Codex #19): Builds machinery around an unenforceable convention. If the convention doesn't hold, all the work is wasted.

Integration point first (revised): Build typed registry + CI lint first. Prove the integration works before building infrastructure on top.

Pros: Foundation is proven. Infrastructure has something durable to rely on.
Cons: Requires auditing every existing AskUserQuestion in gstack — substantial up-front work.

Chosen: integration point first. Codex's argument was decisive. The audit is exactly the point — it forces us to catalog what we actually have before building adaptation on top.

Decision I: Telemetry for TTHW — opt-in telemetry vs. local-only — ANSWER: LOCAL-ONLY

Opt-in telemetry (original, suggested in DX review): Instrument TTHW via telemetry event.

Pros: Quantitative measure of onboarding experience across all users.
Cons (Codex #14): Contradicts local-first OSS framing. Adds telemetry surface specifically for this skill.

Local-only (revised): Logging is local. Respect existing telemetry config; skill adds no new telemetry channels.

Pros: Consistent with gstack's local-first ethos.
Cons: No aggregate view of onboarding time.

Chosen: local-only. If we need TTHW data later, we add it as a gstack-wide telemetry event behind existing opt-in, not a skill-specific one.

Decision J: Profile poisoning defense — no defense vs. confirmation gate vs. user-origin gate — ANSWER: USER-ORIGIN GATE

No defense (original — caught by Codex): Agent writes any tune event it sees.

Pros: Simplest. No additional trust checks.
Cons (Codex #16): Malicious repo content, PR descriptions, tool output can inject tune: never ask and poison the profile. This is a real attack surface.

Confirmation gate: Every tune write prompts "Confirmed? [Y/n]".

Pros: Universal defense.
Cons: Friction on every legitimate use.

User-origin gate: Agent only writes tune events when the tune: prefix appears in the user's own chat message for the current turn (not tool output, not file content). Binary validates source: "inline-user".

Pros: Blocks the attack without friction on legitimate use.
Cons: Relies on agent correctly identifying source. Binary-level validation is the enforcement.

Chosen: user-origin gate. Matches the threat model (malicious content in automated inputs) without degrading the normal flow.

Success Criteria

bun test passes including new test/plan-tune.test.ts.
Every AskUserQuestion invocation in every SKILL.md.tmpl has a registry entry. CI lint enforces.
Migration from ~/.gstack/builder-profile.jsonl preserves 100% of sessions + signals_accumulated. Regression test with 7-session fixture.
One-way door registry-declared entries: 100% of destructive ops, architecture forks, scope-adds > 1 day CC effort, security/compliance choices are classified one-way.
User-origin gate test: attempting to write a tune event with source: "inline-tool-output" is rejected.
Dogfood: Garry uses /plan-tune for 2+ weeks. Reports back whether:
- tune: never-ask felt natural to type or got ignored
- Registry maintenance (adding new questions) felt like reasonable discipline or schema bureaucracy
- Inferred dimensions were stable across sessions or noisy
- Plain-English interaction felt like a coach or like arguing with a chatbot

Implementation Order

Audit every AskUserQuestion invocation in every gstack SKILL.md.tmpl. Build initial scripts/question-registry.ts with IDs, categories, door_types, options. This is the foundation; everything else sits on it.
Write test/plan-tune.test.ts registry-completeness test (gate tier). Verify it catches drift — temporarily remove one registry entry, confirm CI fails.
Seed scripts/one-way-doors.ts with keyword-pattern fallback classifier.
Seed scripts/psychographic-signals.ts with initial {question_id, user_choice} → {dimension, delta} mappings. Numbers are tentative — v1 ships, v2 recalibrates.
Seed scripts/archetypes.ts with archetype definitions (referenced by future v2 /plan-tune vibe).
bin/gstack-question-log — validates against registry, rejects unknown IDs.
bin/gstack-question-preference — all subcommands + tests.
bin/gstack-developer-profile — --read (legacy), --derive, --gap, --profile.
Migration script — builder-profile.jsonl → unified developer-profile.json. Atomic, idempotent, archives source. Regression test with fixture.
scripts/resolvers/question-tuning.ts — three generators (preference check, log, inline tune with user-origin gate instructions).
Register the 3 resolvers in scripts/resolvers/index.ts.
Update scripts/resolvers/preamble.ts — _QUESTION_TUNING config read; conditionally inject for tier >= 2 skills.
plan-tune/SKILL.md.tmpl — conversational plain-English skill.
bun run gen:skill-docs — all SKILL.md files regenerated; verify each stays under 100KB token ceiling.
bun test — all 45+ test cases green.
Dogfood 2+ weeks. Collect real question-log + preferences data. Measure against success criteria.
/ship v1. v2 scope discussion after dogfood.

Open Questions (v2 scope decisions, deferred until real data)

Exact signal map deltas. v1 ships with initial guesses; v2 recalibrates from observed data.
When inferred and declared gap becomes large, do we auto-suggest updating declared? Or just display?
When a signal map version changes, do we auto-recompute or prompt user? Default: auto-recompute with diff display.
Cross-project profile inheritance vs. isolation. v1 is per-project preferences + global profile; v2 may add explicit cross-project learning opt-ins.
Should /plan-tune support a "team profile" mode where a shared developer-profile informs collaboration? v2+.

Reviews incorporated

/office-hours (2026-04-16, 1 session): Set 5 hard constraints, chose event-sourced + user-declared architecture.
/plan-ceo-review (2026-04-16, EXPANSION mode): 6 expansions accepted, later rolled back after Codex review.
/plan-devex-review (2026-04-16, POLISH mode): Plain-English interaction model; this survived to v1.
/plan-eng-review (2026-04-16): Test plan and completeness checks; partially superseded by registry-first rewrite.
/codex (2026-04-16, gpt-5.4 high reasoning): 20-point critique drove the rollback. 15+ legitimate findings the Claude reviews missed.

Credits and caveats

This plan was developed through an iterative AI-collaboration loop over ~6 hours of planning. The author (Garry Tan) directed every scope decision; AI voices (Claude Opus 4.7 and OpenAI Codex gpt-5.4) challenged and refined the plan. Without Codex's outside voice, a much larger and less-defensible plan would have shipped. The value of cross-model review on high-stakes architectural changes is real and measurable.

31 KiB Raw Permalink Blame History