Captures the V1 design (ELI10 writing + LOC reframe) in docs/designs/PLAN_TUNING_V1.md and the extracted V1.1 pacing-overhaul plan in docs/designs/PACING_UPDATES_V0.md. V1 scope was reduced from the original bundled pacing + writing-style plan after three engineering-review passes revealed structural gaps in the pacing workstream that couldn't be closed via plan-text editing. TODOS.md P0 entry links to V1.1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9.7 KiB
Pacing Updates v0 — Design Doc
Status: V1.1 plan (not yet implemented). Extracted from: PLAN_TUNING_V1.md during implementation, when review rigor revealed the pacing workstream had structural gaps unfixable via plan-text editing. Authors: Garry Tan (user), with AI-assisted reviews from Claude Opus 4.7 + OpenAI Codex gpt-5.4. Review plan: CEO + Codex + DX + Eng cycle, same rigor as V1.
Credit
This plan exists because of Louise de Sadeleer. Her "yes yes yes" during architecture review wasn't only about jargon (V1 addresses that) — it was pacing and agency. Too many interruptive decisions over too long a review. V1.1 addresses the pacing half.
Problem
Louise's fatigue reading gstack review output came from two sources:
- Jargon density — technical terms appeared without explanation. Addressed in V1 (ELI10 writing).
- Interruption volume —
/autoplanran 4 phases (CEO + Design + Eng + DX), each with 5–10 AskUserQuestion prompts. Total ≈ 30–50 prompts over ~45 minutes. Non-technical users check out at ~10–15 interruptions. This is V1.1.
Translation alone doesn't fix interruption volume. A translated interruption is still an interruption. The fix needs to change WHEN findings surface, not just HOW they're worded.
Why it's extracted (structural gaps from V1's third eng review + Codex pass 2)
During V1 planning, a pacing workstream was drafted: rank findings, auto-accept two-way doors, max 3 AskUserQuestion prompts per review phase, Silent Decisions block for auto-accepted items, "flip " command to re-open auto-accepted decisions post-hoc. The third eng-review pass + second Codex pass surfaced 10 gaps that couldn't be closed with plan-text edits:
- Session-state model undefined. Pacing needs per-phase state (which findings surfaced, which auto-accepted, which user can flip). V1 has per-skill-invocation state for glossing but no backing store for per-phase pacing memory.
- Phase identifier missing from question-log. Silent Eng #8 wanted to warn when > 3 prompts within one phase. V0's
question-log.jsonlhas nophasefield. V1 claimed "no schema change" — contradicts the enforcement target. - Question registry ≠ finding registry. V0's
scripts/question-registry.tscovers questions (registered at skill definition time). Review findings are dynamic (discovered at runtime).door_type: one-wayenforcement via registry doesn't cover ad-hoc findings. One-way-door safety isn't enforceable for findings the agent generates mid-review. - Pacing as prose can't invert existing control flow. V1 planned to add a "rank findings, then ask" rule to preamble prose. But existing skill templates like
plan-eng-review/SKILL.md.tmplhave per-section STOP/AskUserQuestion sequences. A prose rule in preamble can't reliably override a hardcoded per-section STOP. The behavioral change is sequencing, not prompt wording. - Flip mechanism has no implementation. "Reply
flip <id>to change" was prose. No command parser, no state store, no replay behavior. If the conversation compacts and the Silent Decisions block leaves context, the original decision is lost. - Migration prompt is itself an interrupt. V1's post-upgrade migration prompt (offering to restore V0 prose) counts against the interruption budget V1.1 is trying to reduce. V1.1 must decide: exempt from budget, or include as interrupt-1-of-N?
- First-run preamble prompts count too. Lake intro, telemetry, proactive, routing injection — Louise saw all of them on first run. They're interruptions before the first real skill runs. V1.1 must audit which of these are load-bearing for new users vs. deferrable until session N.
- Ranking formula not calibrated against real data. V1 considered
product 0-8(broken:{0,1,2,4,8}distribution), thensum 0-6with threshold ≥ 4. But neither was validated against actual finding distribution. V1.1 should instrument V0 question-log to measure what real findings look like, then calibrate. - "Every one-way door surfaces" vs "max 3 per phase" contradicts. One-way cap = uncapped (safety); two-way cap = 3. But the plan had both rules without explicit precedence. V1.1 must state: one-way doors surface uncapped regardless of phase budget.
- Undefined verification values. V1 plan had "Silent Decisions block ≥ N entries" with N never defined, and
active: truefield in throughput JSON never defined. V1.1 gets concrete values.
Scope for V1.1
-
Define session-state model. Per-skill-invocation vs per-phase vs per-conversation. Backing store: likely a JSON file at
~/.gstack/sessions/<session_id>/pacing-state.jsonthat records which findings surfaced vs. auto-accepted per phase. Cleanup: same TTL as existing session tracking in preamble. -
Add
phasefield to question-log.jsonl schema. Classify each AskUserQuestion by which review phase it came from (CEO / Design / Eng / DX / other). Migration: existing entries default to"unknown". Non-breaking schema extension. -
Extend registry coverage for dynamic findings. Two options, pick during CEO review:
- (a) Widen
scripts/question-registry.tsto allow runtime registration (ad-hoc IDs still get logged + classified). - (b) Add a secondary runtime classifier
scripts/finding-classifier.tsthat maps finding text → risk tier using pattern matching.
- (a) Widen
-
Move pacing from preamble prose into skill-template control flow. Update each review skill template to: (i) internally complete the phase, (ii) rank findings with the
gstack-pacing-rankbinary, (iii) emit up to 3 AskUserQuestion prompts, (iv) emit Silent Decisions block with the rest. Not a preamble rule — explicit sequence in each template. -
Flip mechanism implementation. New binary
bin/gstack-flip-decision. Command parser acceptsflip <id>from user message. Looks up the original decision in pacing-state.json. Re-opens as an explicit AskUserQuestion. New choice persists. -
Migration-prompt budget decision. Explicit rule: one-shot migration prompts are exempt from the per-phase interruption budget. Rationale: they fire before review phases start, not during.
-
First-run preamble audit. Audit lake intro, telemetry, proactive, routing injection. For each: is this load-bearing for a first-time user, or deferrable? Likely outcome: suppress all but lake intro until session 2+. Offer remaining ones via a
/plan-tune first-runcommand that users can invoke voluntarily. -
Ranking threshold calibration. Instrument V0's question-log (already running, has history). Measure the actual distribution of
severity × irreversibility × user-decision-mattersacross recent CEO + Eng + DX + Design reviews. Pick threshold based on real data. Target: ~20% of findings surface, ~80% auto-accept. -
Explicit rule: one-way doors uncapped. Hard-coded in skill template prose: "one-way doors surface regardless of phase interruption budget." Two-way findings cap at 3 per phase.
-
Concrete verification values. Define
Nfor Silent Decisions (e.g., ≥ 5 entries expected for a non-trivial plan), define the throughput JSON schema with concrete field names.
Acceptance criteria for V1.1
- Interruption count: Louise (or similar non-technical collaborator) reruns
/autoplanend-to-end on a plan comparable to V0-baseline. AskUserQuestion count ≤ 50% of V0 baseline. (V1 captures this baseline transcript for V1.1 calibration.) - One-way-door coverage: 100% of safety-critical decisions (
door_type: one-wayOR classifier-flagged dynamic findings) surface individually at full technical detail. Uncapped. - Flip round-trip: User types
flip test-coverage-bookclub-form. The original auto-accepted decision re-opens as an AskUserQuestion. User's new choice persists to the Silent Decisions block (or is removed if user flips to explicit surfacing). - Per-phase observability:
/plan-tunecan display per-phase AskUserQuestion counts for any session, reading from question-log.jsonl's newphasefield. - First-run reduction: New users see ≤ 1 meta-prompt (lake intro) before their first real skill runs, vs. V1's 4 (lake + telemetry + proactive + routing).
- Human rerun: Louise + Garry independent qualitative reviews, same pattern as V1.
Dependencies on V1
V1.1 builds on V1's infrastructure:
explain_levelconfig key + preamble echo pattern (A4).- Jargon list + Writing Style section (V1.1's interruption language should respect ELI10 rules).
- V0 dormancy negative tests (V1.1 won't wake the 5D psychographic machinery either).
- V1's captured Louise transcript (baseline for acceptance criterion calibration).
V1.1 does NOT depend on any V2 items (E1 substrate wiring, narrative/vibe, etc.).
Review plan
- Pre-work: capture real question-log distribution from current V0 data. Use as calibration input for Scope #8.
- CEO review. Premise challenge: is pacing the right fix, or should V1.1 consider removing phases entirely? (E.g., collapse CEO + Design + Eng + DX into a single unified review pass.) Scope mode: SELECTIVE EXPANSION likely (pacing is the core, related improvements are cherry-picks).
- Codex review. Independent pass on the V1.1 plan. Expect particular scrutiny on the control-flow change (Scope #4) since that's the area V1 struggled with.
- DX review. Focus on the flip mechanism's DX — is
flip <id>discoverable, is the command syntax natural, is the error path clear? - Eng review ×N. Expect multiple passes, same as V1.
NOT touched in V1.1
V2 items remain deferred:
- Confusion-signal detection
- 5D psychographic-driven skill adaptation (V0 E1)
- /plan-tune narrative + /plan-tune vibe (V0 E3)
- Per-skill or per-topic explain levels
- Team profiles
- AST-based "delivered features" metric