mirror of https://github.com/garrytan/gstack.git synced 2026-06-01 15:51:41 +02:00

Files

T

Garry Tan a6fb31726c v1.48.0.0 feat: AskUserQuestion split rule + runtime AUTO_DECIDE carve-out (#1740 )

* feat(preamble): add "Handling 5+ options — split, never drop" rule

Agents repeatedly hit Conductor's 4-option AskUserQuestion cap and
silently drop one option to fit, shrinking the user's decision space.
This rule names the bug and gives two compliant shapes: batch into
≤4-groups (for coherent alternatives) or split into N sequential
per-option calls (for independent scope items, default).

Inline preamble subsection is ~15 lines (rule + buckets + pointer).
Full reference with worked examples, Hold/dependency semantics, and
final-summary validation lives in docs/askuserquestion-split.md.
The agent loads the docs file on demand when N>4.

Per-option call shape: D<N>.k header, ELI10, Recommendation, kind-note
(no completeness score — decision actions, not coverage), Include /
Defer / Cut / Hold buckets. Hold stops the chain immediately; the
final D<N>.final call validates dependencies and confirms the
assembled scope.

question_ids: <skill>-split-<option-slug> (kebab-case ASCII, ≤64
chars). Also fixes orphan "12. " prefix on the existing CJK rule.

Tier-2+ skills inherit via the existing resolver. SKILL.md regenerated
for all 41 affected skills + 3 golden fixtures. Net diff per SKILL.md:
~34 lines (vs ~110 for the full inline version).

6 tests pin the inline contract (4-option cap, buckets, D-numbering,
docs pointer, runtime AUTO_DECIDE gate reference, orphan 12 regression).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(question-pref): runtime AUTO_DECIDE carve-out for *-split-* ids

Split chains (per-option AskUserQuestion calls emitted by the new
"Handling 5+ options" rule) must never be silently auto-approved
via /plan-tune preferences. The user's option set is sacred.

Layer 1 (mechanism): unique <skill>-split-<option-slug> ids prevent
cross-option preference leakage. Layer 2 (this commit): the runtime
checker `gstack-question-preference --check` detects any id matching
*-split-* and forces ASK_NORMALLY even when never-ask or
ask-only-for-one-way preferences exist for that exact id. An
explanatory note tells the user their preference was bypassed and why.

7 tests pin the carve-out: no-pref baseline, never-ask override,
explanatory note text, ask-only-for-one-way override, always-ask
(no note), non-split id containing "split" word (negative case for
regex specificity), multi-skill split id formats.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(e2e): split-overflow regression for /plan-ceo-review

Periodic-tier E2E test that catches the original failure mode the
user complained about: 5+ options for ONE decision must split into
N sequential AskUserQuestion calls, not drop one to fit Conductor's
4-option cap.

Fixture: 5 independent chat-platform integration candidates
(Slack/Discord/Teams/Telegram/Mattermost), each carrying its own
include/defer/cut decision. Floor = 4 review-phase AUQs (standard
[N-1] tolerance band). Pre-fix "drop to 4 + 1 dropped" fails this
floor.

Wired into test/helpers/touchfiles.ts: tier periodic, depends on
plan-ceo-review/**, the new preamble subsection, the question-pref
binary (for the carve-out), and the runner helper. touchfiles.test.ts
expected count bumped 21 → 22 to account for the new entry.

Cost: ~$0.30/run when EVALS_TIER=periodic. Skips silently otherwise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: post-merge regen + rebase size-budget baseline to v1.47.0.0

After merging origin/main (v1.45 → v1.47), three things needed cleanup:

1. spec/SKILL.md (main's new skill) regenerated to include our split-vs-drop
   preamble subsection — same mechanical regen as the other 41 tier-2+ skills.
2. Three golden ship fixtures refreshed to capture main's GSTACK_PLAN_MODE
   block + /spec routing entry + jargon-list.json refactor.
3. docs/skills.md — added /spec table row that main's PR (#1698/#1733) shipped
   without. Pre-existing failure on main; this PR catches and fixes.

Also rebased test/skill-size-budget.test.ts from v1.44.1 → v1.47.0.0 baseline.
Main's v1.46 (catalog tokens trim) + v1.47 (/spec skill) pushed the v1.44.1
anchor past the 5% ratchet to ×1.059 — pre-existing failure on main. This
PR captures a fresh parity-baseline-v1.47.0.0.json and re-anchors the test
there. Historical v1.44.1.json and v1.46.0.0.json retained in test/fixtures/
for reference. Our subsection contributes ~0.1% of the post-rebase corpus.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v1.48.0.0)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-26 23:43:07 -07:00

8.5 KiB

Raw Blame History

AskUserQuestion split rule — full reference

Inline summary lives in the canonical preamble (scripts/resolvers/preamble/generate-ask-user-format.ts). That subsection is intentionally compressed because it injects into every tier-2+ skill's SKILL.md. This file is the deep reference the inline guidance points to — load it when N>4 options come up and you need worked examples or the full Hold / dependency / final-summary semantics.

The bug this prevents

Pre-rule failure mode (transcript verbatim from the user complaint that motivated this):

"I'm hitting Conductor's limit of 4 options in the AUQ, so I need to cut one. E4 (the detect-mappings codegen) is the biggest lift and probably beyond scope for v0.42 anyway — users can hand-author their mapping rules for the 9 clusters. I'll drop that and keep E1, E2, E3, and E5..."

"Conductor caps at 4 options. Trimming: E4 (detect-mappings codegen) is the largest-effort item and a natural v0.43+ follow-up — moving it to TODOS.md without asking. Re-firing with 4."

The agent unilaterally cut a real option without user input. The option set is the user's decision space; shrinking it silently is the bug.

Which shape: batched vs. split

Two compliant shapes. Pick by reading the options:

Batched into ≤4-groups — the options are coherent alternatives, one will be picked. Examples: "major / minor / patch / micro" for a version bump, "5 layout variants where the user picks one", "which framework: rspec / minitest / cucumber / none". Batch the top 4 into one AskUserQuestion; surface the 5th as a follow-up if none of the first 4 fit. This is the lower-friction path when applicable.
Split per-option — the options are independent scope items, each carrying its own include/defer/cut decision. Examples: "E1..E6, which do we ship?", "5 candidate integrations for Q3", "8 TODOs surfaced by the audit — which do we land?". Fire N sequential AskUserQuestion calls, one per option.

Default to split per-option when unsure. Batching wrong options together — shoehorning orthogonal scope items into one question — is the same failure mode as dropping.

Split per-option mechanics

Before the chain

Check for dependencies between options. If E3 requires E1, or E5 conflicts with E2, surface that in the per-option ELI10:

"Cutting this orphans E3 — they're linked."

Without dependency surfacing, the chain produces incoherent picked sets (user picks Include for E3 + Cut for E1, ships an unbuildable scope).

D-numbering

Parent decision: D<N> where N is the global question counter.
Each per-option call: D<N>.k for k=1..K children.
Final summary: D<N>.final.
Single-option revise: D<N>.revise-<k>.

Example chain for 5 options at parent D3:

D3.1 → D3.2 → D3.3 → D3.4 → D3.5 → D3.final

Per-option call shape

For each option Eₖ, fire an AskUserQuestion with:

D<N>.k header (e.g. D3.1, D3.2 ... D3.5)
ELI10 of just this option's scope, cost, and any dependency it carries
Recommendation: Include / Defer / Cut, with concrete reason
4 buckets per option:
- A) Include in this scope (recommended/not)
- B) Defer to follow-up (TODOs / next version)
- C) Cut entirely
- D) Hold — stop the chain, discuss before deciding
Note: options differ in kind, not coverage — no completeness score. (Include/Defer/Cut/Hold are decision actions, so the existing format rule applies: omit Completeness: N/10 and use the kind-note instead.)

Hold means stop, not queue

When the user picks Hold on any per-option call, stop the chain immediately. Do not continue asking later options behind the Hold — the user wants to discuss the picked option first. After discussion, the user resumes by saying "continue" or naming the next option to ask about.

Wrong behavior: queue E4 and E5 behind a Hold on E3, then fire them later with stale context. Right behavior: stop, let the user reset the parent decision, resume from where they left off.

Final summary

After the chain resolves (without Hold), fire D<N>.final to confirm and validate the assembled set.

Step 1 — validate dependencies. If the picked set is incoherent (e.g. E3 picked Include but its required E1 was Cut), do NOT silently accept. Re-prompt the conflict as a single AskUserQuestion:

"E3 needs E1 but you cut E1. Revise: A) keep E1 B) cut E3 too C) leave as-is and accept the broken state"

Step 2 — confirm the assembled set. If coherent:

"Here's the assembled set: E1, E2, E5. Ship this scope? A) Ship this scope (recommended) B) Revise one option (you pick which) C) Cut more"

Step 3 — targeted revise. If the user picks B, ask which option to revise, then fire ONE per-option AskUserQuestion at D<N>.revise-<k> to update just that option. Do not re-run the whole chain.

Sizing rules

N ≤ 4: use the normal single AskUserQuestion form. Don't split.
N = 5 or 6: split (or batch if a clean grouping exists).
N > 6: BEFORE the chain, fire a meta-AskUserQuestion at D<N>.0:

"About to ask N per-option questions. Options: A) Proceed with the full split (recommended only if every option is independent) B) Narrow scope first — I'll propose a smaller set C) Batch into groups of 4 instead"

This is itself an AskUserQuestion tool call, not prose — it counts as the first prompt in the chain, not a violation of the "tool not prose" rule.

question_id rules for split chains

Each per-option AskUserQuestion emits a unique question_id of the form <skill>-split-<option-slug> where <option-slug> is the option's key kebab-cased (lowercase, hyphens, ASCII only).

Examples:

plan-ceo-review-split-e4-detect-mappings
ship-split-rspec
plan-eng-review-split-add-coverage-test

Collision handling. If two options would produce the same slug, suffix with -2, -3, etc.

Length. Total length must be ≤64 chars (validated by bin/gstack-question-preference --write). Truncate the option slug if needed, preserving the <skill>-split- prefix.

AUTO_DECIDE behavior with split chains

Two-layer defense.

Layer 1 — mechanism. Each per-option question_id is unique to its option, so preferences set on one option's id cannot leak across the chain. A never-ask on ship-split-rspec does not silently approve ship-split-minitest.

Layer 2 — runtime enforcement. bin/gstack-question-preference --check detects any id matching *-split-* (the canonical slug pattern emitted by split chains) and forces ASK_NORMALLY even when a never-ask or ask-only-for-one-way preference exists for that exact id. The check emits an explanatory note when this override fires:

"split-chain per-option calls always ASK_NORMALLY; your never-ask preference does not apply to options inside a sequential split."

Result. Split-chain per-option calls are NEVER AUTO_DECIDE-eligible. This is a runtime contract, not just collision-resistance by id uniqueness. The user's option set is sacred — restoring user sovereignty over the decision space is the entire point of splitting.

Interaction with per-skill rules

This rule overrides any per-skill "batch decisions" guidance. Per-skill templates that explicitly require one-issue-per-call (e.g. plan-eng-review) are already compatible — they're a stricter special case of this rule.

Worked example: 5 platform integrations

Fixture used by test/skill-e2e-plan-ceo-split-overflow.test.ts. A plan has 5 independent chat-platform candidates:

E1) Slack DM bot (~2 weeks, ~40% of asks)
E2) Discord guild bot (~3 weeks, ~15%)
E3) Microsoft Teams (~4 weeks, ~5%)
E4) Telegram (~1 week, ~8%)
E5) Mattermost (~2 weeks, ~3%)

User wants individual decisions per candidate, not a bundled pick. The agent should:

Recognize this is a 5-option independent-scope decision → split.
Check dependencies (none here — each platform is standalone).
Fire D3.1 through D3.5, one per platform, with Include / Defer / Cut / Hold buckets and an effort+demand-grounded recommendation per option.
After the chain, fire D3.final summarizing the assembled scope (e.g. "Ship E1 + E4 — Slack and Telegram pull most demand for least build cost. Defer the rest. A) Ship / B) Revise / C) Cut more").

Pre-fix failure shape (the bug): agent constructs a single AskUserQuestion with E1..E4 as four options, drops E5 with prose like "E5 is the smallest revenue segment, moving to TODOs". The user never got to weigh in on E5. Floor-of-4 in the E2E test catches this.

8.5 KiB Raw Blame History