Merge remote-tracking branch 'origin/main' into garrytan/workspace-aware-ship

Rebumped v1.8.0.0 -> v1.11.0.0 (minor-past main's v1.10.1.0) using
bin/gstack-next-version — the same queue-aware path this branch introduces.
CHANGELOG repositioned so v1.11.0.0 sits above main's new entries
(v1.10.1.0 / v1.10.0.0 / v1.9.0.0).

Conflicts resolved:
- VERSION, package.json: rebumped to v1.11.0.0 (util-picked)
- bin/gstack-config: merged both lists (workspace_root + gbrain keys)
- CHANGELOG.md: hoisted v1.11.0.0 entry above main's new entries

Pre-existing failures in main (4) documented but not fixed in this PR:
1. gstack-brain-sync secret scan > blocks bearer-json (brain-sync tests)
2. no files larger than 2MB (security-bench fixture, already TODO'd)
3. selectTests > skill-specific change (touchfiles scoping)
4. Opus 4.7 overlay pacing directive (expectation stale after v1.10.1.0
   removed the Fan out nudge)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-04-23 21:20:25 -07:00
parent 416a56a5c8
commit a64d70ba35
87 changed files with 14392 additions and 788 deletions
+228 -14
View File
@@ -355,6 +355,234 @@ AI orchestrator (e.g., OpenClaw). In spawned sessions:
- Focus on completing the task and reporting results via prose output.
- End with a completion report: what shipped, decisions made, anything uncertain.
## AskUserQuestion Format
**ALWAYS follow this structure for every AskUserQuestion call. Every element is non-skippable. If you find yourself about to skip any of them, stop and back up.**
### Required shape
Every AskUserQuestion reads like a decision brief, not a bullet list:
```
D<N> — <one-line question title>
ELI10: <plain English a 16-year-old could follow, 2-4 sentences, name the stakes>
Stakes if we pick wrong: <one sentence on what breaks, what user sees, what's lost>
Recommendation: <choice> because <one-line reason>
Completeness: A=X/10, B=Y/10 (or: Note: options differ in kind, not coverage — no completeness score)
Pros / cons:
A) <option label> (recommended)
✅ <pro — concrete, observable, ≥40 chars>
✅ <pro>
❌ <con — honest, ≥40 chars>
B) <option label>
✅ <pro>
❌ <con>
Net: <one-line synthesis of what you're actually trading off>
```
### Element rules
1. **D-numbering.** First question in a skill invocation is `D1`. Increment per
question within the same skill. This is a model-level instruction, not a
runtime counter — you count your own questions. Nested skill invocation
(e.g., `/plan-ceo-review` running `/office-hours` inline) starts its own
D1; label as `D1 (office-hours)` to disambiguate when the user will see
both. Drift is expected over long sessions; minor inconsistency is fine.
2. **Re-ground.** Before ELI10, state the project, current branch (use the
`_BRANCH` value from the preamble, NOT conversation history or gitStatus),
and the current plan/task. 1-2 sentences. Assume the user hasn't looked at
this window in 20 minutes.
3. **ELI10 (ALWAYS).** Explain in plain English a smart 16-year-old could
follow. Concrete examples and analogies, not function names. Say what it
DOES, not what it's called. This is not preamble — the user is about to
make a decision and needs context. Even in terse mode, emit the ELI10.
4. **Stakes if we pick wrong (ALWAYS).** One sentence naming what breaks in
concrete terms (pain avoided / capability unlocked / consequence named).
"Users see a 3-second spinner" beats "performance may degrade." Forces
the trade-off to be real.
5. **Recommendation (ALWAYS).** `Recommendation: <choice> because <one-line
reason>` on its own line. Never omit it. Required for every AskUserQuestion,
even when neutral-posture (see rule 8). The `(recommended)` label on the
option is REQUIRED — `scripts/resolvers/question-tuning.ts` reads it to
power the AUTO_DECIDE path. Omitting it breaks auto-decide.
6. **Completeness scoring (when meaningful).** When options differ in
coverage (full test coverage vs happy path vs shortcut, complete error
handling vs partial), score each `Completeness: N/10` on its own line.
Calibration: 10 = complete, 7 = happy path only, 3 = shortcut. Flag any
option ≤5 where a higher-completeness option exists. When options differ
in kind (review posture, architectural A-vs-B, cherry-pick Add/Defer/Skip,
two different kinds of systems), SKIP the score and write one line:
`Note: options differ in kind, not coverage — no completeness score.`
Do NOT fabricate filler scores — empty 10/10 on every option is worse
than no score.
7. **Pros / cons block.** Every option gets per-bullet ✅ (pro) and ❌ (con)
markers. Rules:
- **Minimum 2 pros and 1 con per option.** If you can't name a con for
the recommended option, the recommendation is hollow — go find one. If
you can't name a pro for the rejected option, the question isn't real.
- **Minimum 40 characters per bullet.** `✅ Simple` is not a pro. `
Reuses the YAML frontmatter format already in MEMORY.md, zero new
parser` is a pro. Concrete, observable, specific.
- **Hard-stop escape** for genuinely one-sided choices (destructive-action
confirmation, one-way doors): a single bullet `✅ No cons — this is a
hard-stop choice` satisfies the rule. Use sparingly; overuse flips a
decision brief into theater.
8. **Net line (ALWAYS).** Closes the decision with a one-sentence synthesis
of what the user is actually trading off. From the reference screenshot:
*"The new-format case is speculative. The copy-format case is immediate
leverage. Copy now, evolve later if a real pattern emerges."* Not a
summary — a verdict frame.
9. **Neutral-posture handling.** When the skill explicitly says "neutral
recommendation posture" (SELECTIVE EXPANSION cherry-picks, taste calls,
kind-differentiated choices where neither side dominates), the
Recommendation line reads: `Recommendation: <default-choice> — this is a
taste call, no strong preference either way`. The `(recommended)` label
STAYS on the default option (machine-readable hint for AUTO_DECIDE). The
`— this is a taste call` prose is the human-readable neutrality signal.
Both coexist.
10. **Effort both-scales.** When an option involves effort, show both human
and CC scales: `(human: ~2 days / CC: ~15 min)`.
11. **Tool_use, not prose.** A markdown block labeled `Question:` is not a
question — the user never sees it as interactive. If you wrote one in
prose, stop and reissue as an actual AskUserQuestion tool_use. The rich
markdown goes in the question body; the `options` array stays short
labels (A, B, C).
### Self-check before emitting
Before calling AskUserQuestion, verify:
- [ ] D<N> header present
- [ ] ELI10 paragraph present (stakes line too)
- [ ] Recommendation line present with concrete reason
- [ ] Completeness scored (coverage) OR kind-note present (kind)
- [ ] Every option has ≥2 ✅ and ≥1 ❌, each ≥40 chars (or hard-stop escape)
- [ ] (recommended) label on one option (even for neutral-posture — see rule 9)
- [ ] Net line closes the decision
- [ ] You are calling the tool, not writing prose
If you'd need to read the source to understand your own explanation, it's
too complex — simplify before emitting.
Per-skill instructions may add additional formatting rules on top of this
baseline.
## GBrain Sync (skill start)
```bash
# gbrain-sync: drain pending writes, pull once per day. Silent no-op when
# the feature isn't initialized or gbrain_sync_mode is "off". See
# docs/gbrain-sync.md.
_GSTACK_HOME="${GSTACK_HOME:-$HOME/.gstack}"
_BRAIN_REMOTE_FILE="$HOME/.gstack-brain-remote.txt"
_BRAIN_SYNC_BIN="~/.claude/skills/gstack/bin/gstack-brain-sync"
_BRAIN_CONFIG_BIN="~/.claude/skills/gstack/bin/gstack-config"
_BRAIN_SYNC_MODE=$("$_BRAIN_CONFIG_BIN" get gbrain_sync_mode 2>/dev/null || echo off)
# New-machine hint: URL file present, local .git missing, sync not yet enabled.
if [ -f "$_BRAIN_REMOTE_FILE" ] && [ ! -d "$_GSTACK_HOME/.git" ] && [ "$_BRAIN_SYNC_MODE" = "off" ]; then
_BRAIN_NEW_URL=$(head -1 "$_BRAIN_REMOTE_FILE" 2>/dev/null | tr -d '[:space:]')
if [ -n "$_BRAIN_NEW_URL" ]; then
echo "BRAIN_SYNC: brain repo detected: $_BRAIN_NEW_URL"
echo "BRAIN_SYNC: run 'gstack-brain-restore' to pull your cross-machine memory (or 'gstack-config set gbrain_sync_mode off' to dismiss forever)"
fi
fi
# Active-sync path.
if [ -d "$_GSTACK_HOME/.git" ] && [ "$_BRAIN_SYNC_MODE" != "off" ]; then
# Once-per-day pull.
_BRAIN_LAST_PULL_FILE="$_GSTACK_HOME/.brain-last-pull"
_BRAIN_NOW=$(date +%s)
_BRAIN_DO_PULL=1
if [ -f "$_BRAIN_LAST_PULL_FILE" ]; then
_BRAIN_LAST=$(cat "$_BRAIN_LAST_PULL_FILE" 2>/dev/null || echo 0)
_BRAIN_AGE=$(( _BRAIN_NOW - _BRAIN_LAST ))
[ "$_BRAIN_AGE" -lt 86400 ] && _BRAIN_DO_PULL=0
fi
if [ "$_BRAIN_DO_PULL" = "1" ]; then
( cd "$_GSTACK_HOME" && git fetch origin >/dev/null 2>&1 && git merge --ff-only "origin/$(git rev-parse --abbrev-ref HEAD)" >/dev/null 2>&1 ) || true
echo "$_BRAIN_NOW" > "$_BRAIN_LAST_PULL_FILE"
fi
# Drain pending queue, push.
"$_BRAIN_SYNC_BIN" --once 2>/dev/null || true
fi
# Status line — always emitted, easy to grep.
if [ -d "$_GSTACK_HOME/.git" ] && [ "$_BRAIN_SYNC_MODE" != "off" ]; then
_BRAIN_QUEUE_DEPTH=0
[ -f "$_GSTACK_HOME/.brain-queue.jsonl" ] && _BRAIN_QUEUE_DEPTH=$(wc -l < "$_GSTACK_HOME/.brain-queue.jsonl" | tr -d ' ')
_BRAIN_LAST_PUSH="never"
[ -f "$_GSTACK_HOME/.brain-last-push" ] && _BRAIN_LAST_PUSH=$(cat "$_GSTACK_HOME/.brain-last-push" 2>/dev/null || echo never)
echo "BRAIN_SYNC: mode=$_BRAIN_SYNC_MODE | last_push=$_BRAIN_LAST_PUSH | queue=$_BRAIN_QUEUE_DEPTH"
else
echo "BRAIN_SYNC: off"
fi
```
**Privacy stop-gate (fires ONCE per machine).**
If the bash output shows `BRAIN_SYNC: off` AND the config value
`gbrain_sync_mode_prompted` is `false` AND gbrain is detected on this host
(either `gbrain doctor --fast --json` succeeds or the `gbrain` binary is in PATH),
fire a one-time privacy gate via AskUserQuestion:
> gstack can publish your session memory (learnings, plans, designs, retros) to a
> private GitHub repo that GBrain indexes across your machines. Higher tiers
> include behavioral data (session timelines, developer profile). How much do you
> want to sync?
Options:
- A) Everything allowlisted (recommended — maximum cross-machine memory)
- B) Only artifacts (plans, designs, retros, learnings) — skip timelines and profile
- C) Decline — keep everything local
After the user answers, run (substituting the chosen value):
```bash
# Chosen mode: full | artifacts-only | off
"$_BRAIN_CONFIG_BIN" set gbrain_sync_mode <choice>
"$_BRAIN_CONFIG_BIN" set gbrain_sync_mode_prompted true
```
If A or B was chosen AND `~/.gstack/.git` doesn't exist, ask a follow-up:
"Set up the GBrain sync repo now? (runs `gstack-brain-init`)"
- A) Yes, run it now
- B) Show me the command, I'll run it myself
Do not block the skill. Emit the question, continue the skill workflow. The
next skill run picks up wherever this left off.
**At skill END (before the telemetry block),** run these bash commands to
catch artifact writes (design docs, plans, retros) that skipped the writer
shims, plus drain any still-pending queue entries:
```bash
"~/.claude/skills/gstack/bin/gstack-brain-sync" --discover-new 2>/dev/null || true
"~/.claude/skills/gstack/bin/gstack-brain-sync" --once 2>/dev/null || true
```
## Model-Specific Behavioral Patch (claude)
The following nudges are tuned for the claude model family. They are
@@ -468,20 +696,6 @@ are shown, synthesize a one-paragraph welcome briefing before proceeding:
"Welcome back to {branch}. Last session: /{skill} ({outcome}). [Checkpoint summary if
available]. [Health score if available]." Keep it to 2-3 sentences.
## AskUserQuestion Format
**ALWAYS follow this structure for every AskUserQuestion call. All four elements are non-skippable. If you find yourself about to skip any of them, stop and back up.**
1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences)
2. **Simplify (ELI10, ALWAYS):** Explain what's happening in plain English a smart 16-year-old could follow. Concrete examples and analogies, not function names or internal jargon. Say what it DOES, not what it's called. State the stakes: what breaks if we pick wrong. This is NOT optional verbosity and it is NOT preamble — the user is about to make a decision and needs context. Even if you'd normally stay terse, emit the ELI10 paragraph. The user will ask for it anyway; do it the first time.
3. **Recommend (ALWAYS):** Every question ends with `RECOMMENDATION: Choose [X] because [one-line reason]` on its own line. Never omit it. Never collapse it into the options list. Required for every AskUserQuestion, regardless of whether the options are coverage-differentiated or different-in-kind.
4. **Score completeness (when meaningful):** When options differ in coverage (e.g. full test coverage vs happy path vs shortcut, complete error handling vs partial), score each with `Completeness: N/10` on its own line. Calibration: 10 = complete (all edge cases, full coverage), 7 = happy path only, 3 = shortcut. Flag any option ≤5 where a higher-completeness option exists. When options differ in kind (picking a review posture, picking an architectural approach, cherry-pick Add/Defer/Skip, choosing between two different kinds of systems), the completeness axis doesn't apply — skip `Completeness: N/10` entirely and write one line: `Note: options differ in kind, not coverage — no completeness score.` Do not fabricate filler scores.
5. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)`
Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex.
Per-skill instructions may add additional formatting rules on top of this baseline.
## Writing Style (skip entirely if `EXPLAIN_LEVEL: terse` appears in the preamble echo OR the user's current message explicitly requests terse / no-explanations output)
These rules apply to every AskUserQuestion, every response you write to the user, and every review finding. They compose with the AskUserQuestion Format section above: Format = *how* a question is structured; Writing Style = *the prose quality of the content inside it*.
+228 -14
View File
@@ -344,6 +344,234 @@ AI orchestrator (e.g., OpenClaw). In spawned sessions:
- Focus on completing the task and reporting results via prose output.
- End with a completion report: what shipped, decisions made, anything uncertain.
## AskUserQuestion Format
**ALWAYS follow this structure for every AskUserQuestion call. Every element is non-skippable. If you find yourself about to skip any of them, stop and back up.**
### Required shape
Every AskUserQuestion reads like a decision brief, not a bullet list:
```
D<N> — <one-line question title>
ELI10: <plain English a 16-year-old could follow, 2-4 sentences, name the stakes>
Stakes if we pick wrong: <one sentence on what breaks, what user sees, what's lost>
Recommendation: <choice> because <one-line reason>
Completeness: A=X/10, B=Y/10 (or: Note: options differ in kind, not coverage — no completeness score)
Pros / cons:
A) <option label> (recommended)
✅ <pro — concrete, observable, ≥40 chars>
✅ <pro>
❌ <con — honest, ≥40 chars>
B) <option label>
✅ <pro>
❌ <con>
Net: <one-line synthesis of what you're actually trading off>
```
### Element rules
1. **D-numbering.** First question in a skill invocation is `D1`. Increment per
question within the same skill. This is a model-level instruction, not a
runtime counter — you count your own questions. Nested skill invocation
(e.g., `/plan-ceo-review` running `/office-hours` inline) starts its own
D1; label as `D1 (office-hours)` to disambiguate when the user will see
both. Drift is expected over long sessions; minor inconsistency is fine.
2. **Re-ground.** Before ELI10, state the project, current branch (use the
`_BRANCH` value from the preamble, NOT conversation history or gitStatus),
and the current plan/task. 1-2 sentences. Assume the user hasn't looked at
this window in 20 minutes.
3. **ELI10 (ALWAYS).** Explain in plain English a smart 16-year-old could
follow. Concrete examples and analogies, not function names. Say what it
DOES, not what it's called. This is not preamble — the user is about to
make a decision and needs context. Even in terse mode, emit the ELI10.
4. **Stakes if we pick wrong (ALWAYS).** One sentence naming what breaks in
concrete terms (pain avoided / capability unlocked / consequence named).
"Users see a 3-second spinner" beats "performance may degrade." Forces
the trade-off to be real.
5. **Recommendation (ALWAYS).** `Recommendation: <choice> because <one-line
reason>` on its own line. Never omit it. Required for every AskUserQuestion,
even when neutral-posture (see rule 8). The `(recommended)` label on the
option is REQUIRED — `scripts/resolvers/question-tuning.ts` reads it to
power the AUTO_DECIDE path. Omitting it breaks auto-decide.
6. **Completeness scoring (when meaningful).** When options differ in
coverage (full test coverage vs happy path vs shortcut, complete error
handling vs partial), score each `Completeness: N/10` on its own line.
Calibration: 10 = complete, 7 = happy path only, 3 = shortcut. Flag any
option ≤5 where a higher-completeness option exists. When options differ
in kind (review posture, architectural A-vs-B, cherry-pick Add/Defer/Skip,
two different kinds of systems), SKIP the score and write one line:
`Note: options differ in kind, not coverage — no completeness score.`
Do NOT fabricate filler scores — empty 10/10 on every option is worse
than no score.
7. **Pros / cons block.** Every option gets per-bullet ✅ (pro) and ❌ (con)
markers. Rules:
- **Minimum 2 pros and 1 con per option.** If you can't name a con for
the recommended option, the recommendation is hollow — go find one. If
you can't name a pro for the rejected option, the question isn't real.
- **Minimum 40 characters per bullet.** `✅ Simple` is not a pro. `
Reuses the YAML frontmatter format already in MEMORY.md, zero new
parser` is a pro. Concrete, observable, specific.
- **Hard-stop escape** for genuinely one-sided choices (destructive-action
confirmation, one-way doors): a single bullet `✅ No cons — this is a
hard-stop choice` satisfies the rule. Use sparingly; overuse flips a
decision brief into theater.
8. **Net line (ALWAYS).** Closes the decision with a one-sentence synthesis
of what the user is actually trading off. From the reference screenshot:
*"The new-format case is speculative. The copy-format case is immediate
leverage. Copy now, evolve later if a real pattern emerges."* Not a
summary — a verdict frame.
9. **Neutral-posture handling.** When the skill explicitly says "neutral
recommendation posture" (SELECTIVE EXPANSION cherry-picks, taste calls,
kind-differentiated choices where neither side dominates), the
Recommendation line reads: `Recommendation: <default-choice> — this is a
taste call, no strong preference either way`. The `(recommended)` label
STAYS on the default option (machine-readable hint for AUTO_DECIDE). The
`— this is a taste call` prose is the human-readable neutrality signal.
Both coexist.
10. **Effort both-scales.** When an option involves effort, show both human
and CC scales: `(human: ~2 days / CC: ~15 min)`.
11. **Tool_use, not prose.** A markdown block labeled `Question:` is not a
question — the user never sees it as interactive. If you wrote one in
prose, stop and reissue as an actual AskUserQuestion tool_use. The rich
markdown goes in the question body; the `options` array stays short
labels (A, B, C).
### Self-check before emitting
Before calling AskUserQuestion, verify:
- [ ] D<N> header present
- [ ] ELI10 paragraph present (stakes line too)
- [ ] Recommendation line present with concrete reason
- [ ] Completeness scored (coverage) OR kind-note present (kind)
- [ ] Every option has ≥2 ✅ and ≥1 ❌, each ≥40 chars (or hard-stop escape)
- [ ] (recommended) label on one option (even for neutral-posture — see rule 9)
- [ ] Net line closes the decision
- [ ] You are calling the tool, not writing prose
If you'd need to read the source to understand your own explanation, it's
too complex — simplify before emitting.
Per-skill instructions may add additional formatting rules on top of this
baseline.
## GBrain Sync (skill start)
```bash
# gbrain-sync: drain pending writes, pull once per day. Silent no-op when
# the feature isn't initialized or gbrain_sync_mode is "off". See
# docs/gbrain-sync.md.
_GSTACK_HOME="${GSTACK_HOME:-$HOME/.gstack}"
_BRAIN_REMOTE_FILE="$HOME/.gstack-brain-remote.txt"
_BRAIN_SYNC_BIN="$GSTACK_BIN/gstack-brain-sync"
_BRAIN_CONFIG_BIN="$GSTACK_BIN/gstack-config"
_BRAIN_SYNC_MODE=$("$_BRAIN_CONFIG_BIN" get gbrain_sync_mode 2>/dev/null || echo off)
# New-machine hint: URL file present, local .git missing, sync not yet enabled.
if [ -f "$_BRAIN_REMOTE_FILE" ] && [ ! -d "$_GSTACK_HOME/.git" ] && [ "$_BRAIN_SYNC_MODE" = "off" ]; then
_BRAIN_NEW_URL=$(head -1 "$_BRAIN_REMOTE_FILE" 2>/dev/null | tr -d '[:space:]')
if [ -n "$_BRAIN_NEW_URL" ]; then
echo "BRAIN_SYNC: brain repo detected: $_BRAIN_NEW_URL"
echo "BRAIN_SYNC: run 'gstack-brain-restore' to pull your cross-machine memory (or 'gstack-config set gbrain_sync_mode off' to dismiss forever)"
fi
fi
# Active-sync path.
if [ -d "$_GSTACK_HOME/.git" ] && [ "$_BRAIN_SYNC_MODE" != "off" ]; then
# Once-per-day pull.
_BRAIN_LAST_PULL_FILE="$_GSTACK_HOME/.brain-last-pull"
_BRAIN_NOW=$(date +%s)
_BRAIN_DO_PULL=1
if [ -f "$_BRAIN_LAST_PULL_FILE" ]; then
_BRAIN_LAST=$(cat "$_BRAIN_LAST_PULL_FILE" 2>/dev/null || echo 0)
_BRAIN_AGE=$(( _BRAIN_NOW - _BRAIN_LAST ))
[ "$_BRAIN_AGE" -lt 86400 ] && _BRAIN_DO_PULL=0
fi
if [ "$_BRAIN_DO_PULL" = "1" ]; then
( cd "$_GSTACK_HOME" && git fetch origin >/dev/null 2>&1 && git merge --ff-only "origin/$(git rev-parse --abbrev-ref HEAD)" >/dev/null 2>&1 ) || true
echo "$_BRAIN_NOW" > "$_BRAIN_LAST_PULL_FILE"
fi
# Drain pending queue, push.
"$_BRAIN_SYNC_BIN" --once 2>/dev/null || true
fi
# Status line — always emitted, easy to grep.
if [ -d "$_GSTACK_HOME/.git" ] && [ "$_BRAIN_SYNC_MODE" != "off" ]; then
_BRAIN_QUEUE_DEPTH=0
[ -f "$_GSTACK_HOME/.brain-queue.jsonl" ] && _BRAIN_QUEUE_DEPTH=$(wc -l < "$_GSTACK_HOME/.brain-queue.jsonl" | tr -d ' ')
_BRAIN_LAST_PUSH="never"
[ -f "$_GSTACK_HOME/.brain-last-push" ] && _BRAIN_LAST_PUSH=$(cat "$_GSTACK_HOME/.brain-last-push" 2>/dev/null || echo never)
echo "BRAIN_SYNC: mode=$_BRAIN_SYNC_MODE | last_push=$_BRAIN_LAST_PUSH | queue=$_BRAIN_QUEUE_DEPTH"
else
echo "BRAIN_SYNC: off"
fi
```
**Privacy stop-gate (fires ONCE per machine).**
If the bash output shows `BRAIN_SYNC: off` AND the config value
`gbrain_sync_mode_prompted` is `false` AND gbrain is detected on this host
(either `gbrain doctor --fast --json` succeeds or the `gbrain` binary is in PATH),
fire a one-time privacy gate via AskUserQuestion:
> gstack can publish your session memory (learnings, plans, designs, retros) to a
> private GitHub repo that GBrain indexes across your machines. Higher tiers
> include behavioral data (session timelines, developer profile). How much do you
> want to sync?
Options:
- A) Everything allowlisted (recommended — maximum cross-machine memory)
- B) Only artifacts (plans, designs, retros, learnings) — skip timelines and profile
- C) Decline — keep everything local
After the user answers, run (substituting the chosen value):
```bash
# Chosen mode: full | artifacts-only | off
"$_BRAIN_CONFIG_BIN" set gbrain_sync_mode <choice>
"$_BRAIN_CONFIG_BIN" set gbrain_sync_mode_prompted true
```
If A or B was chosen AND `~/.gstack/.git` doesn't exist, ask a follow-up:
"Set up the GBrain sync repo now? (runs `gstack-brain-init`)"
- A) Yes, run it now
- B) Show me the command, I'll run it myself
Do not block the skill. Emit the question, continue the skill workflow. The
next skill run picks up wherever this left off.
**At skill END (before the telemetry block),** run these bash commands to
catch artifact writes (design docs, plans, retros) that skipped the writer
shims, plus drain any still-pending queue entries:
```bash
"$GSTACK_BIN/gstack-brain-sync" --discover-new 2>/dev/null || true
"$GSTACK_BIN/gstack-brain-sync" --once 2>/dev/null || true
```
## Model-Specific Behavioral Patch (claude)
The following nudges are tuned for the claude model family. They are
@@ -457,20 +685,6 @@ are shown, synthesize a one-paragraph welcome briefing before proceeding:
"Welcome back to {branch}. Last session: /{skill} ({outcome}). [Checkpoint summary if
available]. [Health score if available]." Keep it to 2-3 sentences.
## AskUserQuestion Format
**ALWAYS follow this structure for every AskUserQuestion call. All four elements are non-skippable. If you find yourself about to skip any of them, stop and back up.**
1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences)
2. **Simplify (ELI10, ALWAYS):** Explain what's happening in plain English a smart 16-year-old could follow. Concrete examples and analogies, not function names or internal jargon. Say what it DOES, not what it's called. State the stakes: what breaks if we pick wrong. This is NOT optional verbosity and it is NOT preamble — the user is about to make a decision and needs context. Even if you'd normally stay terse, emit the ELI10 paragraph. The user will ask for it anyway; do it the first time.
3. **Recommend (ALWAYS):** Every question ends with `RECOMMENDATION: Choose [X] because [one-line reason]` on its own line. Never omit it. Never collapse it into the options list. Required for every AskUserQuestion, regardless of whether the options are coverage-differentiated or different-in-kind.
4. **Score completeness (when meaningful):** When options differ in coverage (e.g. full test coverage vs happy path vs shortcut, complete error handling vs partial), score each with `Completeness: N/10` on its own line. Calibration: 10 = complete (all edge cases, full coverage), 7 = happy path only, 3 = shortcut. Flag any option ≤5 where a higher-completeness option exists. When options differ in kind (picking a review posture, picking an architectural approach, cherry-pick Add/Defer/Skip, choosing between two different kinds of systems), the completeness axis doesn't apply — skip `Completeness: N/10` entirely and write one line: `Note: options differ in kind, not coverage — no completeness score.` Do not fabricate filler scores.
5. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)`
Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex.
Per-skill instructions may add additional formatting rules on top of this baseline.
## Writing Style (skip entirely if `EXPLAIN_LEVEL: terse` appears in the preamble echo OR the user's current message explicitly requests terse / no-explanations output)
These rules apply to every AskUserQuestion, every response you write to the user, and every review finding. They compose with the AskUserQuestion Format section above: Format = *how* a question is structured; Writing Style = *the prose quality of the content inside it*.
+228 -14
View File
@@ -346,6 +346,234 @@ AI orchestrator (e.g., OpenClaw). In spawned sessions:
- Focus on completing the task and reporting results via prose output.
- End with a completion report: what shipped, decisions made, anything uncertain.
## AskUserQuestion Format
**ALWAYS follow this structure for every AskUserQuestion call. Every element is non-skippable. If you find yourself about to skip any of them, stop and back up.**
### Required shape
Every AskUserQuestion reads like a decision brief, not a bullet list:
```
D<N> — <one-line question title>
ELI10: <plain English a 16-year-old could follow, 2-4 sentences, name the stakes>
Stakes if we pick wrong: <one sentence on what breaks, what user sees, what's lost>
Recommendation: <choice> because <one-line reason>
Completeness: A=X/10, B=Y/10 (or: Note: options differ in kind, not coverage — no completeness score)
Pros / cons:
A) <option label> (recommended)
✅ <pro — concrete, observable, ≥40 chars>
✅ <pro>
❌ <con — honest, ≥40 chars>
B) <option label>
✅ <pro>
❌ <con>
Net: <one-line synthesis of what you're actually trading off>
```
### Element rules
1. **D-numbering.** First question in a skill invocation is `D1`. Increment per
question within the same skill. This is a model-level instruction, not a
runtime counter — you count your own questions. Nested skill invocation
(e.g., `/plan-ceo-review` running `/office-hours` inline) starts its own
D1; label as `D1 (office-hours)` to disambiguate when the user will see
both. Drift is expected over long sessions; minor inconsistency is fine.
2. **Re-ground.** Before ELI10, state the project, current branch (use the
`_BRANCH` value from the preamble, NOT conversation history or gitStatus),
and the current plan/task. 1-2 sentences. Assume the user hasn't looked at
this window in 20 minutes.
3. **ELI10 (ALWAYS).** Explain in plain English a smart 16-year-old could
follow. Concrete examples and analogies, not function names. Say what it
DOES, not what it's called. This is not preamble — the user is about to
make a decision and needs context. Even in terse mode, emit the ELI10.
4. **Stakes if we pick wrong (ALWAYS).** One sentence naming what breaks in
concrete terms (pain avoided / capability unlocked / consequence named).
"Users see a 3-second spinner" beats "performance may degrade." Forces
the trade-off to be real.
5. **Recommendation (ALWAYS).** `Recommendation: <choice> because <one-line
reason>` on its own line. Never omit it. Required for every AskUserQuestion,
even when neutral-posture (see rule 8). The `(recommended)` label on the
option is REQUIRED — `scripts/resolvers/question-tuning.ts` reads it to
power the AUTO_DECIDE path. Omitting it breaks auto-decide.
6. **Completeness scoring (when meaningful).** When options differ in
coverage (full test coverage vs happy path vs shortcut, complete error
handling vs partial), score each `Completeness: N/10` on its own line.
Calibration: 10 = complete, 7 = happy path only, 3 = shortcut. Flag any
option ≤5 where a higher-completeness option exists. When options differ
in kind (review posture, architectural A-vs-B, cherry-pick Add/Defer/Skip,
two different kinds of systems), SKIP the score and write one line:
`Note: options differ in kind, not coverage — no completeness score.`
Do NOT fabricate filler scores — empty 10/10 on every option is worse
than no score.
7. **Pros / cons block.** Every option gets per-bullet ✅ (pro) and ❌ (con)
markers. Rules:
- **Minimum 2 pros and 1 con per option.** If you can't name a con for
the recommended option, the recommendation is hollow — go find one. If
you can't name a pro for the rejected option, the question isn't real.
- **Minimum 40 characters per bullet.** `✅ Simple` is not a pro. `✅
Reuses the YAML frontmatter format already in MEMORY.md, zero new
parser` is a pro. Concrete, observable, specific.
- **Hard-stop escape** for genuinely one-sided choices (destructive-action
confirmation, one-way doors): a single bullet `✅ No cons — this is a
hard-stop choice` satisfies the rule. Use sparingly; overuse flips a
decision brief into theater.
8. **Net line (ALWAYS).** Closes the decision with a one-sentence synthesis
of what the user is actually trading off. From the reference screenshot:
*"The new-format case is speculative. The copy-format case is immediate
leverage. Copy now, evolve later if a real pattern emerges."* Not a
summary — a verdict frame.
9. **Neutral-posture handling.** When the skill explicitly says "neutral
recommendation posture" (SELECTIVE EXPANSION cherry-picks, taste calls,
kind-differentiated choices where neither side dominates), the
Recommendation line reads: `Recommendation: <default-choice> — this is a
taste call, no strong preference either way`. The `(recommended)` label
STAYS on the default option (machine-readable hint for AUTO_DECIDE). The
`— this is a taste call` prose is the human-readable neutrality signal.
Both coexist.
10. **Effort both-scales.** When an option involves effort, show both human
and CC scales: `(human: ~2 days / CC: ~15 min)`.
11. **Tool_use, not prose.** A markdown block labeled `Question:` is not a
question — the user never sees it as interactive. If you wrote one in
prose, stop and reissue as an actual AskUserQuestion tool_use. The rich
markdown goes in the question body; the `options` array stays short
labels (A, B, C).
### Self-check before emitting
Before calling AskUserQuestion, verify:
- [ ] D<N> header present
- [ ] ELI10 paragraph present (stakes line too)
- [ ] Recommendation line present with concrete reason
- [ ] Completeness scored (coverage) OR kind-note present (kind)
- [ ] Every option has ≥2 ✅ and ≥1 ❌, each ≥40 chars (or hard-stop escape)
- [ ] (recommended) label on one option (even for neutral-posture — see rule 9)
- [ ] Net line closes the decision
- [ ] You are calling the tool, not writing prose
If you'd need to read the source to understand your own explanation, it's
too complex — simplify before emitting.
Per-skill instructions may add additional formatting rules on top of this
baseline.
## GBrain Sync (skill start)
```bash
# gbrain-sync: drain pending writes, pull once per day. Silent no-op when
# the feature isn't initialized or gbrain_sync_mode is "off". See
# docs/gbrain-sync.md.
_GSTACK_HOME="${GSTACK_HOME:-$HOME/.gstack}"
_BRAIN_REMOTE_FILE="$HOME/.gstack-brain-remote.txt"
_BRAIN_SYNC_BIN="$GSTACK_BIN/gstack-brain-sync"
_BRAIN_CONFIG_BIN="$GSTACK_BIN/gstack-config"
_BRAIN_SYNC_MODE=$("$_BRAIN_CONFIG_BIN" get gbrain_sync_mode 2>/dev/null || echo off)
# New-machine hint: URL file present, local .git missing, sync not yet enabled.
if [ -f "$_BRAIN_REMOTE_FILE" ] && [ ! -d "$_GSTACK_HOME/.git" ] && [ "$_BRAIN_SYNC_MODE" = "off" ]; then
_BRAIN_NEW_URL=$(head -1 "$_BRAIN_REMOTE_FILE" 2>/dev/null | tr -d '[:space:]')
if [ -n "$_BRAIN_NEW_URL" ]; then
echo "BRAIN_SYNC: brain repo detected: $_BRAIN_NEW_URL"
echo "BRAIN_SYNC: run 'gstack-brain-restore' to pull your cross-machine memory (or 'gstack-config set gbrain_sync_mode off' to dismiss forever)"
fi
fi
# Active-sync path.
if [ -d "$_GSTACK_HOME/.git" ] && [ "$_BRAIN_SYNC_MODE" != "off" ]; then
# Once-per-day pull.
_BRAIN_LAST_PULL_FILE="$_GSTACK_HOME/.brain-last-pull"
_BRAIN_NOW=$(date +%s)
_BRAIN_DO_PULL=1
if [ -f "$_BRAIN_LAST_PULL_FILE" ]; then
_BRAIN_LAST=$(cat "$_BRAIN_LAST_PULL_FILE" 2>/dev/null || echo 0)
_BRAIN_AGE=$(( _BRAIN_NOW - _BRAIN_LAST ))
[ "$_BRAIN_AGE" -lt 86400 ] && _BRAIN_DO_PULL=0
fi
if [ "$_BRAIN_DO_PULL" = "1" ]; then
( cd "$_GSTACK_HOME" && git fetch origin >/dev/null 2>&1 && git merge --ff-only "origin/$(git rev-parse --abbrev-ref HEAD)" >/dev/null 2>&1 ) || true
echo "$_BRAIN_NOW" > "$_BRAIN_LAST_PULL_FILE"
fi
# Drain pending queue, push.
"$_BRAIN_SYNC_BIN" --once 2>/dev/null || true
fi
# Status line — always emitted, easy to grep.
if [ -d "$_GSTACK_HOME/.git" ] && [ "$_BRAIN_SYNC_MODE" != "off" ]; then
_BRAIN_QUEUE_DEPTH=0
[ -f "$_GSTACK_HOME/.brain-queue.jsonl" ] && _BRAIN_QUEUE_DEPTH=$(wc -l < "$_GSTACK_HOME/.brain-queue.jsonl" | tr -d ' ')
_BRAIN_LAST_PUSH="never"
[ -f "$_GSTACK_HOME/.brain-last-push" ] && _BRAIN_LAST_PUSH=$(cat "$_GSTACK_HOME/.brain-last-push" 2>/dev/null || echo never)
echo "BRAIN_SYNC: mode=$_BRAIN_SYNC_MODE | last_push=$_BRAIN_LAST_PUSH | queue=$_BRAIN_QUEUE_DEPTH"
else
echo "BRAIN_SYNC: off"
fi
```
**Privacy stop-gate (fires ONCE per machine).**
If the bash output shows `BRAIN_SYNC: off` AND the config value
`gbrain_sync_mode_prompted` is `false` AND gbrain is detected on this host
(either `gbrain doctor --fast --json` succeeds or the `gbrain` binary is in PATH),
fire a one-time privacy gate via AskUserQuestion:
> gstack can publish your session memory (learnings, plans, designs, retros) to a
> private GitHub repo that GBrain indexes across your machines. Higher tiers
> include behavioral data (session timelines, developer profile). How much do you
> want to sync?
Options:
- A) Everything allowlisted (recommended — maximum cross-machine memory)
- B) Only artifacts (plans, designs, retros, learnings) — skip timelines and profile
- C) Decline — keep everything local
After the user answers, run (substituting the chosen value):
```bash
# Chosen mode: full | artifacts-only | off
"$_BRAIN_CONFIG_BIN" set gbrain_sync_mode <choice>
"$_BRAIN_CONFIG_BIN" set gbrain_sync_mode_prompted true
```
If A or B was chosen AND `~/.gstack/.git` doesn't exist, ask a follow-up:
"Set up the GBrain sync repo now? (runs `gstack-brain-init`)"
- A) Yes, run it now
- B) Show me the command, I'll run it myself
Do not block the skill. Emit the question, continue the skill workflow. The
next skill run picks up wherever this left off.
**At skill END (before the telemetry block),** run these bash commands to
catch artifact writes (design docs, plans, retros) that skipped the writer
shims, plus drain any still-pending queue entries:
```bash
"$GSTACK_BIN/gstack-brain-sync" --discover-new 2>/dev/null || true
"$GSTACK_BIN/gstack-brain-sync" --once 2>/dev/null || true
```
## Model-Specific Behavioral Patch (claude)
The following nudges are tuned for the claude model family. They are
@@ -459,20 +687,6 @@ are shown, synthesize a one-paragraph welcome briefing before proceeding:
"Welcome back to {branch}. Last session: /{skill} ({outcome}). [Checkpoint summary if
available]. [Health score if available]." Keep it to 2-3 sentences.
## AskUserQuestion Format
**ALWAYS follow this structure for every AskUserQuestion call. All four elements are non-skippable. If you find yourself about to skip any of them, stop and back up.**
1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences)
2. **Simplify (ELI10, ALWAYS):** Explain what's happening in plain English a smart 16-year-old could follow. Concrete examples and analogies, not function names or internal jargon. Say what it DOES, not what it's called. State the stakes: what breaks if we pick wrong. This is NOT optional verbosity and it is NOT preamble — the user is about to make a decision and needs context. Even if you'd normally stay terse, emit the ELI10 paragraph. The user will ask for it anyway; do it the first time.
3. **Recommend (ALWAYS):** Every question ends with `RECOMMENDATION: Choose [X] because [one-line reason]` on its own line. Never omit it. Never collapse it into the options list. Required for every AskUserQuestion, regardless of whether the options are coverage-differentiated or different-in-kind.
4. **Score completeness (when meaningful):** When options differ in coverage (e.g. full test coverage vs happy path vs shortcut, complete error handling vs partial), score each with `Completeness: N/10` on its own line. Calibration: 10 = complete (all edge cases, full coverage), 7 = happy path only, 3 = shortcut. Flag any option ≤5 where a higher-completeness option exists. When options differ in kind (picking a review posture, picking an architectural approach, cherry-pick Add/Defer/Skip, choosing between two different kinds of systems), the completeness axis doesn't apply — skip `Completeness: N/10` entirely and write one line: `Note: options differ in kind, not coverage — no completeness score.` Do not fabricate filler scores.
5. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)`
Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex.
Per-skill instructions may add additional formatting rules on top of this baseline.
## Writing Style (skip entirely if `EXPLAIN_LEVEL: terse` appears in the preamble echo OR the user's current message explicitly requests terse / no-explanations output)
These rules apply to every AskUserQuestion, every response you write to the user, and every review finding. They compose with the AskUserQuestion Format section above: Format = *how* a question is structured; Writing Style = *the prose quality of the content inside it*.
+487
View File
@@ -0,0 +1,487 @@
/**
* Overlay-efficacy fixture registry.
*
* Each fixture defines a reproducible A/B test for one behavioral nudge
* embedded in a model-overlays/*.md file. The harness at
* test/skill-e2e-overlay-harness.test.ts iterates this registry and runs
* `fixture.trials` A/B trials per fixture, asserting `fixture.pass(arms)`.
*
* Adding a new overlay eval = one entry in this list. The harness handles
* arm wiring, concurrency, artifact storage, rate-limit retries, and the
* cross-harness diagnostic.
*/
import * as fs from 'fs';
import * as path from 'path';
import {
firstTurnParallelism,
type AgentSdkResult,
} from '../helpers/agent-sdk-runner';
const REPO_ROOT = path.resolve(__dirname, '..', '..');
// ---------------------------------------------------------------------------
// Types
// ---------------------------------------------------------------------------
export interface OverlayFixture {
/** Unique, lowercase/digits/dash only. Used in artifact paths. */
id: string;
/** Path to the overlay file, relative to repo root. */
overlayPath: string;
/** API model ID, not the overlay family name. */
model: string;
/** Integer >= 3. Trials per arm. */
trials: number;
/** Max concurrent queries for this fixture's arms. Default 3. */
concurrency?: number;
/** Populate the workspace dir before each trial. */
setupWorkspace: (dir: string) => void;
/** The prompt the model receives. Non-empty. */
userPrompt: string;
/** Per-fixture tool allowlist. Omit to use runner default [Read, Glob, Grep, Bash]. */
allowedTools?: string[];
/** Max turns per trial. Omit to use runner default (5). */
maxTurns?: number;
/**
* Direction of the expected effect. `higher_is_better` = overlay should
* increase the metric (e.g. fanout, files touched for literal scope).
* `lower_is_better` = overlay should decrease it (e.g. Bash count, turn count).
* Used only for cosmetic logging in the test output; `pass` is the actual gate.
*/
direction?: 'higher_is_better' | 'lower_is_better';
/** Compute the per-trial metric from the typed SDK result. */
metric: (r: AgentSdkResult) => number;
/** Acceptance predicate across all arms' per-trial metrics. */
pass: (arms: { overlay: number[]; off: number[] }) => boolean;
}
// ---------------------------------------------------------------------------
// Validation
// ---------------------------------------------------------------------------
export function validateFixtures(fixtures: OverlayFixture[]): void {
const ids = new Set<string>();
for (const f of fixtures) {
if (!f.id || !/^[a-z0-9-]+$/.test(f.id)) {
throw new Error(
`fixture id must be non-empty, lowercase/digits/dash only: ${JSON.stringify(f.id)}`,
);
}
if (ids.has(f.id)) {
throw new Error(`duplicate fixture id: ${f.id}`);
}
ids.add(f.id);
if (!Number.isInteger(f.trials) || f.trials < 3) {
throw new Error(`${f.id}: trials must be an integer >= 3 (got ${f.trials})`);
}
if (
f.concurrency !== undefined &&
(!Number.isInteger(f.concurrency) || f.concurrency < 1)
) {
throw new Error(
`${f.id}: concurrency must be an integer >= 1 (got ${f.concurrency})`,
);
}
if (!f.model) throw new Error(`${f.id}: model must be non-empty`);
if (!f.userPrompt) throw new Error(`${f.id}: userPrompt must be non-empty`);
if (path.isAbsolute(f.overlayPath) || f.overlayPath.includes('..')) {
throw new Error(
`${f.id}: overlayPath must be relative and must not contain '..' (got ${f.overlayPath})`,
);
}
const fullPath = path.resolve(REPO_ROOT, f.overlayPath);
if (!fs.existsSync(fullPath)) {
throw new Error(`${f.id}: overlay file not found at ${f.overlayPath}`);
}
for (const fn of ['setupWorkspace', 'metric', 'pass'] as const) {
if (typeof f[fn] !== 'function') {
throw new Error(`${f.id}: ${fn} must be a function`);
}
}
}
}
// ---------------------------------------------------------------------------
// Metric + predicate helpers
// ---------------------------------------------------------------------------
function mean(xs: number[]): number {
if (xs.length === 0) return 0;
return xs.reduce((a, b) => a + b, 0) / xs.length;
}
/**
* Standard fanout predicate: overlay mean beats off mean by at least 0.5
* parallel tool_use blocks in first turn, AND at least 3 of the overlay
* trials emit >= 2 parallel tool_use blocks.
*
* The combined rule catches both "overlay nudges every trial slightly"
* (mean) and "overlay sometimes triggers real fanout" (floor). A single
* 0.5 lift with every trial still emitting 1 call would be suspicious;
* this predicate rejects it.
*/
export function fanoutPass(arms: { overlay: number[]; off: number[] }): boolean {
const lift = mean(arms.overlay) - mean(arms.off);
const floorHits = arms.overlay.filter((n) => n >= 2).length;
return lift >= 0.5 && floorHits >= 3;
}
/**
* Generic "lower is better" pass predicate: overlay mean should drop the
* metric by at least 20% vs baseline. Used for nudges like "effort-match"
* (fewer turns) and "dedicated tools vs Bash" (fewer Bash calls).
*/
export function lowerIsBetter20Pct(arms: { overlay: number[]; off: number[] }): boolean {
const meanOff = mean(arms.off);
if (meanOff === 0) return mean(arms.overlay) <= meanOff;
return mean(arms.overlay) <= meanOff * 0.8;
}
/**
* Generic "higher is better" pass predicate: overlay mean should lift the
* metric by at least 20% vs baseline. Used for nudges like "literal
* interpretation" (more files touched when scope is ambiguous).
*/
export function higherIsBetter20Pct(arms: { overlay: number[]; off: number[] }): boolean {
const meanOff = mean(arms.off);
const meanOn = mean(arms.overlay);
if (meanOff === 0) return meanOn > 0;
return meanOn >= meanOff * 1.2;
}
// ---------------------------------------------------------------------------
// Metrics
// ---------------------------------------------------------------------------
/**
* Count the total number of Bash tool_use blocks across ALL assistant turns.
* Signal for "dedicated tools over Bash" nudge in claude.md.
*/
export function bashToolCallCount(r: AgentSdkResult): number {
return r.toolCalls.filter((c) => c.tool === 'Bash').length;
}
/**
* Total turns the session used to complete. Signal for "effort-match the
* step" nudge in opus-4-7.md trivial prompts should complete quickly.
*/
export function turnsToCompletion(r: AgentSdkResult): number {
return r.turnsUsed;
}
/**
* Count of unique files the model edited or wrote. Signal for "literal
* interpretation" nudge in opus-4-7.md — "fix the tests" with multiple
* failures should touch all of them.
*/
export function uniqueFilesEdited(r: AgentSdkResult): number {
const touched = new Set<string>();
for (const call of r.toolCalls) {
if (call.tool === 'Edit' || call.tool === 'Write' || call.tool === 'MultiEdit') {
const input = call.input as { file_path?: string } | null;
if (input?.file_path) touched.add(input.file_path);
}
}
return touched.size;
}
// ---------------------------------------------------------------------------
// Fixtures
// ---------------------------------------------------------------------------
export const OVERLAY_FIXTURES: OverlayFixture[] = [
{
id: 'opus-4-7-fanout-toy',
overlayPath: 'model-overlays/opus-4-7.md',
model: 'claude-opus-4-7',
trials: 10,
concurrency: 3,
setupWorkspace: (dir) => {
fs.writeFileSync(path.join(dir, 'alpha.txt'), 'Alpha file: used in module A.\n');
fs.writeFileSync(path.join(dir, 'beta.txt'), 'Beta file: used in module B.\n');
fs.writeFileSync(path.join(dir, 'gamma.txt'), 'Gamma file: used in module C.\n');
},
userPrompt:
'Read alpha.txt, beta.txt, and gamma.txt and summarize each in one line.',
metric: (r) => firstTurnParallelism(r.assistantTurns[0]),
pass: fanoutPass,
},
{
id: 'opus-4-7-fanout-realistic',
overlayPath: 'model-overlays/opus-4-7.md',
model: 'claude-opus-4-7',
trials: 10,
concurrency: 3,
setupWorkspace: (dir) => {
fs.writeFileSync(
path.join(dir, 'app.ts'),
"import { config } from './config';\nimport { util } from './src/util';\n\nexport function main() { return config.name + ':' + util(); }\n",
);
fs.writeFileSync(
path.join(dir, 'config.ts'),
"export const config = { name: 'demo', version: 1 };\n",
);
fs.writeFileSync(
path.join(dir, 'README.md'),
'# demo project\n\nA small demo. Entry: `app.ts`. Config: `config.ts`.\n',
);
fs.mkdirSync(path.join(dir, 'src'), { recursive: true });
fs.writeFileSync(
path.join(dir, 'src', 'util.ts'),
"export function util() { return 'util-result'; }\n",
);
},
userPrompt:
'Audit this project: read app.ts, config.ts, and README.md, and glob for ' +
'every .ts file under src/. Summarize what you find in 3 bullet points.',
metric: (r) => firstTurnParallelism(r.assistantTurns[0]),
pass: fanoutPass,
},
// -------------------------------------------------------------------------
// claude.md / "Dedicated tools over Bash"
// -------------------------------------------------------------------------
{
id: 'claude-dedicated-tools-vs-bash',
overlayPath: 'model-overlays/claude.md',
model: 'claude-opus-4-7',
trials: 10,
concurrency: 3,
direction: 'lower_is_better',
// 5 files + summary = needs more than default 5 turns. SDK throws
// instead of returning a result when it hits the cap.
maxTurns: 15,
setupWorkspace: (dir) => {
fs.mkdirSync(path.join(dir, 'src'), { recursive: true });
fs.writeFileSync(path.join(dir, 'src', 'index.ts'), "export const x = 1;\n");
fs.writeFileSync(path.join(dir, 'src', 'util.ts'), "export function util() { return 42; }\n");
fs.writeFileSync(path.join(dir, 'src', 'types.ts'), "export type Foo = { a: number };\n");
fs.writeFileSync(path.join(dir, 'src', 'config.ts'), "export const c = { n: 'demo' };\n");
fs.writeFileSync(path.join(dir, 'src', 'api.ts'), "export async function fetchFoo() { return null; }\n");
},
userPrompt:
"List every TypeScript file under src/ and tell me what each exports. " +
"You may use any tools available.",
// Metric: total Bash tool_use count across the whole session.
// The overlay says "prefer Read/Glob/Grep over cat/find/grep shell."
// A model following that should emit Glob + Read, not Bash ls/find/cat.
metric: bashToolCallCount,
pass: lowerIsBetter20Pct,
},
// -------------------------------------------------------------------------
// opus-4-7.md / "Effort-match the step"
// -------------------------------------------------------------------------
{
id: 'opus-4-7-effort-match-trivial',
overlayPath: 'model-overlays/opus-4-7.md',
model: 'claude-opus-4-7',
trials: 10,
concurrency: 3,
direction: 'lower_is_better',
maxTurns: 8,
setupWorkspace: (dir) => {
fs.writeFileSync(
path.join(dir, 'config.json'),
'{"name": "demo", "version": "1.0.0"}\n',
);
},
userPrompt: "What's the version in config.json?",
// Metric: turns used to reach a result.
// The overlay says "simple lookups don't need deep reasoning."
// A trivial prompt like this should complete in 2 turns
// (Read, respond). Extended deliberation bumps this up.
metric: turnsToCompletion,
pass: lowerIsBetter20Pct,
},
// -------------------------------------------------------------------------
// opus-4-7.md / "Literal interpretation awareness"
// -------------------------------------------------------------------------
{
id: 'opus-4-7-literal-interpretation',
overlayPath: 'model-overlays/opus-4-7.md',
model: 'claude-opus-4-7',
trials: 10,
concurrency: 3,
direction: 'higher_is_better',
// Allow Edit and Write so the model can fix the tests.
allowedTools: ['Read', 'Glob', 'Grep', 'Bash', 'Edit', 'Write'],
maxTurns: 15,
setupWorkspace: (dir) => {
// Three failing test files. The "obvious" one is named first; the
// other two require the model to enumerate to notice them.
fs.writeFileSync(
path.join(dir, 'auth.test.ts'),
"import { test, expect } from 'bun:test';\n" +
"test('auth returns true', () => { expect(1).toBe(2); });\n",
);
fs.writeFileSync(
path.join(dir, 'billing.test.ts'),
"import { test, expect } from 'bun:test';\n" +
"test('billing computes total', () => { expect('ten').toBe('10'); });\n",
);
fs.writeFileSync(
path.join(dir, 'notifications.test.ts'),
"import { test, expect } from 'bun:test';\n" +
"test('notifications fire on save', () => { expect([1,2]).toEqual([1,2,3]); });\n",
);
fs.writeFileSync(
path.join(dir, 'README.md'),
"# demo\n\nSmall project with three failing test files.\n",
);
},
userPrompt: "Fix the failing tests.",
// Metric: number of unique test files the model edited/wrote.
// The overlay says "fix ALL failing tests, not just the obvious one."
// Overlay-ON should touch all 3 test files. Overlay-OFF might stop
// at the first one after making it pass.
metric: uniqueFilesEdited,
pass: higherIsBetter20Pct,
},
// =========================================================================
// Sonnet 4.6 variants of the Opus-4.7 fixtures.
//
// Rationale: /claude.md + /opus-4-7.md overlays measured as no-op or
// counterproductive on Opus 4.7. Before deleting the whole overlay stack,
// check whether weaker Claude models (Sonnet, Haiku) benefit from the same
// nudges. Same overlays, same prompts, same metrics, different model ID.
// Sonnet is ~4x cheaper than Opus so these 5 add ~$3 to a run.
// =========================================================================
{
id: 'opus-4-7-fanout-toy-sonnet',
overlayPath: 'model-overlays/opus-4-7.md',
model: 'claude-sonnet-4-6',
trials: 10,
concurrency: 3,
setupWorkspace: (dir) => {
fs.writeFileSync(path.join(dir, 'alpha.txt'), 'Alpha file: used in module A.\n');
fs.writeFileSync(path.join(dir, 'beta.txt'), 'Beta file: used in module B.\n');
fs.writeFileSync(path.join(dir, 'gamma.txt'), 'Gamma file: used in module C.\n');
},
userPrompt:
'Read alpha.txt, beta.txt, and gamma.txt and summarize each in one line.',
metric: (r) => firstTurnParallelism(r.assistantTurns[0]),
pass: fanoutPass,
},
{
id: 'opus-4-7-fanout-realistic-sonnet',
overlayPath: 'model-overlays/opus-4-7.md',
model: 'claude-sonnet-4-6',
trials: 10,
concurrency: 3,
setupWorkspace: (dir) => {
fs.writeFileSync(
path.join(dir, 'app.ts'),
"import { config } from './config';\nimport { util } from './src/util';\n\nexport function main() { return config.name + ':' + util(); }\n",
);
fs.writeFileSync(
path.join(dir, 'config.ts'),
"export const config = { name: 'demo', version: 1 };\n",
);
fs.writeFileSync(
path.join(dir, 'README.md'),
'# demo project\n\nA small demo. Entry: `app.ts`. Config: `config.ts`.\n',
);
fs.mkdirSync(path.join(dir, 'src'), { recursive: true });
fs.writeFileSync(
path.join(dir, 'src', 'util.ts'),
"export function util() { return 'util-result'; }\n",
);
},
userPrompt:
'Audit this project: read app.ts, config.ts, and README.md, and glob for ' +
'every .ts file under src/. Summarize what you find in 3 bullet points.',
metric: (r) => firstTurnParallelism(r.assistantTurns[0]),
pass: fanoutPass,
},
{
id: 'claude-dedicated-tools-vs-bash-sonnet',
overlayPath: 'model-overlays/claude.md',
model: 'claude-sonnet-4-6',
trials: 10,
concurrency: 3,
direction: 'lower_is_better',
maxTurns: 15,
setupWorkspace: (dir) => {
fs.mkdirSync(path.join(dir, 'src'), { recursive: true });
fs.writeFileSync(path.join(dir, 'src', 'index.ts'), "export const x = 1;\n");
fs.writeFileSync(path.join(dir, 'src', 'util.ts'), "export function util() { return 42; }\n");
fs.writeFileSync(path.join(dir, 'src', 'types.ts'), "export type Foo = { a: number };\n");
fs.writeFileSync(path.join(dir, 'src', 'config.ts'), "export const c = { n: 'demo' };\n");
fs.writeFileSync(path.join(dir, 'src', 'api.ts'), "export async function fetchFoo() { return null; }\n");
},
userPrompt:
"List every TypeScript file under src/ and tell me what each exports. " +
"You may use any tools available.",
metric: bashToolCallCount,
pass: lowerIsBetter20Pct,
},
{
id: 'opus-4-7-effort-match-trivial-sonnet',
overlayPath: 'model-overlays/opus-4-7.md',
model: 'claude-sonnet-4-6',
trials: 10,
concurrency: 3,
direction: 'lower_is_better',
maxTurns: 8,
setupWorkspace: (dir) => {
fs.writeFileSync(
path.join(dir, 'config.json'),
'{"name": "demo", "version": "1.0.0"}\n',
);
},
userPrompt: "What's the version in config.json?",
metric: turnsToCompletion,
pass: lowerIsBetter20Pct,
},
{
id: 'opus-4-7-literal-interpretation-sonnet',
overlayPath: 'model-overlays/opus-4-7.md',
model: 'claude-sonnet-4-6',
trials: 10,
concurrency: 3,
direction: 'higher_is_better',
allowedTools: ['Read', 'Glob', 'Grep', 'Bash', 'Edit', 'Write'],
maxTurns: 15,
setupWorkspace: (dir) => {
fs.writeFileSync(
path.join(dir, 'auth.test.ts'),
"import { test, expect } from 'bun:test';\n" +
"test('auth returns true', () => { expect(1).toBe(2); });\n",
);
fs.writeFileSync(
path.join(dir, 'billing.test.ts'),
"import { test, expect } from 'bun:test';\n" +
"test('billing computes total', () => { expect('ten').toBe('10'); });\n",
);
fs.writeFileSync(
path.join(dir, 'notifications.test.ts'),
"import { test, expect } from 'bun:test';\n" +
"test('notifications fire on save', () => { expect([1,2]).toEqual([1,2,3]); });\n",
);
fs.writeFileSync(
path.join(dir, 'README.md'),
"# demo\n\nSmall project with three failing test files.\n",
);
},
userPrompt: "Fix the failing tests.",
metric: uniqueFilesEdited,
pass: higherIsBetter20Pct,
},
];
// Validate at module load so a broken fixture fails fast at test startup,
// not mid-run after burning API dollars.
validateFixtures(OVERLAY_FIXTURES);