From da75ebaaa02c247aa035db6b860043a8650c13c4 Mon Sep 17 00:00:00 2001 From: Garry Tan Date: Tue, 21 Apr 2026 23:39:42 -0700 Subject: [PATCH] refactor(opus-4.7): split overlay, align routing, fix trailer fallback Follow-up to wintermute's initial Opus 4.7 migration commit (addresses ship-quality review findings before v1.6.1.0 release). Overlay split (model-overlays/): - Move 4 Opus-4.7-specific nudges (Fan out, Effort-match, Batch your questions, Literal interpretation) from claude.md into new opus-4-7.md with {{INHERIT:claude}} - claude.md now holds only model-agnostic nudges (Todo discipline, Think before heavy, Dedicated tools over Bash) - Prevents Opus-4.7-specific guidance leaking onto Sonnet/Haiku - Uses existing {{INHERIT:claude}} mechanism at scripts/resolvers/model-overlay.ts:28-43 scripts/models.ts: - Add opus-4-7 to ALL_MODEL_NAMES - resolveModel: claude-opus-4-7-* variants route to opus-4-7, all other claude-* variants continue to route to claude scripts/resolvers/utility.ts: - Update coAuthor trailer fallback: Opus 4.6 -> Opus 4.7 (fallback was missed in the initial migration commit) scripts/resolvers/preamble/generate-routing-injection.ts: - Align policy with new SKILL.md.tmpl: soft "when in doubt, invoke" instead of hard "ALWAYS invoke... Do NOT answer directly" - Replace stale /checkpoint reference with /context-save + /context-restore (skills were renamed in v1.0.1.0) - Expand route coverage to match full skill inventory: /plan-devex-review, /qa-only, /devex-review, /land-and-deploy, /setup-deploy, /canary, /open-gstack-browser, /setup-browser-cookies, /benchmark, /learn, /plan-tune, /health scripts/resolvers/preamble/generate-voice-directive.ts: - Voice example closing: "Want me to ship it?" -> "Want me to fix it?" - Preserves directness while routing through review gates SKILL.md.tmpl: - Add routing triggers for skills that were missing from the list: /plan-devex-review, /qa-only, /devex-review, /land-and-deploy, /setup-deploy, /canary, /open-gstack-browser, /setup-browser-cookies, /benchmark, /learn, /plan-tune, /health - Within Opus 4.7 overlay, added scope boundary to "Literal interpretation" nudge ("fix tests that this branch introduced or is responsible for") - Added pacing exception to "Batch your questions" nudge so skills that require one-question-at-a-time pacing still win Follow-up commit will regenerate SKILL.md files + update goldens. Co-Authored-By: Claude Opus 4.7 (1M context) --- SKILL.md.tmpl | 12 +++++ model-overlays/claude.md | 24 --------- model-overlays/opus-4-7.md | 29 +++++++++++ scripts/models.ts | 2 + .../preamble/generate-routing-injection.ts | 52 +++++++++++++------ .../preamble/generate-voice-directive.ts | 2 +- scripts/resolvers/utility.ts | 2 +- 7 files changed, 81 insertions(+), 42 deletions(-) create mode 100644 model-overlays/opus-4-7.md diff --git a/SKILL.md.tmpl b/SKILL.md.tmpl index 6936089a..a248cbfa 100644 --- a/SKILL.md.tmpl +++ b/SKILL.md.tmpl @@ -36,12 +36,18 @@ quality gates that produce better results than answering inline. - User asks to review architecture, lock in the plan, "does this design make sense" → invoke `/plan-eng-review` - User asks about design system, brand, visual identity, "how should this look" → invoke `/design-consultation` - User asks to review design of a plan → invoke `/plan-design-review` +- User asks about developer experience of a plan, API/CLI/SDK design → invoke `/plan-devex-review` - User wants all reviews done automatically, "review everything" → invoke `/autoplan` - User reports a bug, error, broken behavior, "why is this broken", "this doesn't work", "wtf", "something's wrong" → invoke `/investigate` - User asks to test the site, find bugs, QA, "does this work", "check the deploy" → invoke `/qa` +- User asks to just report bugs without fixing → invoke `/qa-only` - User asks to review code, check the diff, pre-landing review, "look at my changes" → invoke `/review` - User asks about visual polish, design audit of a live site, "this looks off" → invoke `/design-review` +- User asks to audit the live developer experience, time-to-hello-world → invoke `/devex-review` - User asks to ship, deploy, push, create a PR, "let's land this", "send it" → invoke `/ship` +- User asks to merge + deploy + verify as one flow → invoke `/land-and-deploy` +- User asks to configure deployment for the project → invoke `/setup-deploy` +- User asks to monitor prod after shipping, post-deploy checks → invoke `/canary` - User asks to update docs after shipping → invoke `/document-release` - User asks for a weekly retro, what did we ship, "how'd we do" → invoke `/retro` - User asks for a second opinion, codex review → invoke `/codex` @@ -52,6 +58,12 @@ quality gates that produce better results than answering inline. - User asks to resume, restore, "where was I" → invoke `/context-restore` - User asks about security, OWASP, vulnerabilities, "is this secure" → invoke `/cso` - User asks to make a PDF, document, publication → invoke `/make-pdf` +- User asks to launch a real browser for QA, "open the browser" → invoke `/open-gstack-browser` +- User asks to import cookies for authenticated testing → invoke `/setup-browser-cookies` +- User asks about page speed, performance regression, benchmarks → invoke `/benchmark` +- User asks what gstack has learned, "show learnings" → invoke `/learn` +- User asks to tune question sensitivity, "stop asking me that" → invoke `/plan-tune` +- User asks for code quality dashboard, "health check" → invoke `/health` **When in doubt, invoke the skill.** A false positive (invoking a skill that wasn't needed) is cheaper than a false negative (answering ad-hoc when a structured workflow diff --git a/model-overlays/claude.md b/model-overlays/claude.md index 7264f8b8..95943af5 100644 --- a/model-overlays/claude.md +++ b/model-overlays/claude.md @@ -8,27 +8,3 @@ the user course-correct cheaply instead of mid-flight. **Dedicated tools over Bash.** Prefer Read, Edit, Write, Glob, Grep over shell equivalents (cat, sed, find, grep). The dedicated tools are cheaper and clearer. - -**Fan out explicitly.** Opus 4.7 defaults to sequential work and spawns fewer -subagents than 4.6. When a task has independent sub-problems (investigating multiple -files, testing multiple endpoints, auditing multiple components), explicitly parallelize: -spawn subagents in the same turn, run independent checks concurrently, don't serialize -work that has no dependencies. If you catch yourself doing A then B then C where none -depend on each other, stop and do all three at once. - -**Effort-match the step.** Simple file reads, config checks, command lookups, and -mechanical edits don't need deep reasoning. Complete them quickly and move on. Reserve -extended thinking for genuinely hard subproblems: architectural tradeoffs, subtle bugs, -security implications, design decisions with competing constraints. Over-thinking -simple steps wastes tokens and time. - -**Batch your questions.** If you need to clarify multiple things before proceeding, -ask all of them in a single AskUserQuestion turn. Do not drip-feed one question per -turn. Three questions in one message beats three back-and-forth exchanges. - -**Literal interpretation awareness.** Opus 4.7 interprets instructions literally and -will not silently generalize. When the user says "fix the tests," fix ALL failing tests, -not just the first one. When the user says "update the docs," update every relevant doc, -not just the most obvious one. Read the full scope of what was asked and deliver the -full scope. If the request is ambiguous, ask once (batched with any other questions), -then execute completely. diff --git a/model-overlays/opus-4-7.md b/model-overlays/opus-4-7.md new file mode 100644 index 00000000..164ed6a3 --- /dev/null +++ b/model-overlays/opus-4-7.md @@ -0,0 +1,29 @@ +{{INHERIT:claude}} + +**Fan out explicitly.** Opus 4.7 defaults to sequential work and spawns fewer +subagents than 4.6. When a task has independent sub-problems (investigating multiple +files, testing multiple endpoints, auditing multiple components), explicitly parallelize: +spawn subagents in the same turn, run independent checks concurrently, don't serialize +work that has no dependencies. If you catch yourself doing A then B then C where none +depend on each other, stop and do all three at once. + +**Effort-match the step.** Simple file reads, config checks, command lookups, and +mechanical edits don't need deep reasoning. Complete them quickly and move on. Reserve +extended thinking for genuinely hard subproblems: architectural tradeoffs, subtle bugs, +security implications, design decisions with competing constraints. Over-thinking +simple steps wastes tokens and time. + +**Batch your questions.** If you need to clarify multiple things before proceeding, +ask all of them in a single AskUserQuestion turn. Do not drip-feed one question per +turn. Three questions in one message beats three back-and-forth exchanges. Exception: +skill workflows that explicitly require one-question-at-a-time pacing (e.g., plan +review skills with "STOP. AskUserQuestion once per issue. Do NOT batch.") override this +nudge. The skill wins on pacing, always. + +**Literal interpretation awareness.** Opus 4.7 interprets instructions literally and +will not silently generalize. When the user says "fix the tests," fix all failing tests +that this branch introduced or is responsible for, not just the first one (and not +pre-existing failures in unrelated code). When the user says "update the docs," update +every relevant doc in scope, not just the most obvious one. Read the full scope of what +was asked and deliver the full scope. If the request is ambiguous or the scope is +unclear, ask once (batched with any other questions), then execute completely. diff --git a/scripts/models.ts b/scripts/models.ts index b84608f6..b6d1d368 100644 --- a/scripts/models.ts +++ b/scripts/models.ts @@ -13,6 +13,7 @@ export const ALL_MODEL_NAMES = [ 'claude', + 'opus-4-7', 'gpt', 'gpt-5.4', 'gemini', @@ -51,6 +52,7 @@ export function resolveModel(input: string): Model | null { if (/^gpt-5\.4(-|$)/.test(s)) return 'gpt-5.4'; if (/^gpt(-|$)/.test(s)) return 'gpt'; if (/^o[0-9]+(-|$)/.test(s)) return 'o-series'; + if (/^claude-opus-4-7(-|$)/.test(s)) return 'opus-4-7'; if (/^claude(-|$)/.test(s)) return 'claude'; if (/^gemini(-|$)/.test(s)) return 'gemini'; diff --git a/scripts/resolvers/preamble/generate-routing-injection.ts b/scripts/resolvers/preamble/generate-routing-injection.ts index 1c05c284..0768a307 100644 --- a/scripts/resolvers/preamble/generate-routing-injection.ts +++ b/scripts/resolvers/preamble/generate-routing-injection.ts @@ -20,23 +20,44 @@ If A: Append this section to the end of CLAUDE.md: ## Skill routing -When the user's request matches an available skill, ALWAYS invoke it using the Skill -tool as your FIRST action. Do NOT answer directly, do NOT use other tools first. -The skill has specialized workflows that produce better results than ad-hoc answers. +When the user's request matches an available skill, invoke it via the Skill tool. The +skill has multi-step workflows, checklists, and quality gates that produce better +results than an ad-hoc answer. When in doubt, invoke the skill. A false positive is +cheaper than a false negative. Key routing rules: -- Product ideas, "is this worth building", brainstorming → invoke office-hours -- Bugs, errors, "why is this broken", 500 errors → invoke investigate -- Ship, deploy, push, create PR → invoke ship -- QA, test the site, find bugs → invoke qa -- Code review, check my diff → invoke review -- Update docs after shipping → invoke document-release -- Weekly retro → invoke retro -- Design system, brand → invoke design-consultation -- Visual audit, design polish → invoke design-review -- Architecture review → invoke plan-eng-review -- Save progress, checkpoint, resume → invoke checkpoint -- Code quality, health check → invoke health +- Product ideas, "is this worth building", brainstorming → invoke /office-hours +- Strategy, scope, "think bigger", "what should we build" → invoke /plan-ceo-review +- Architecture, "does this design make sense" → invoke /plan-eng-review +- Design system, brand, "how should this look" → invoke /design-consultation +- Design review of a plan → invoke /plan-design-review +- Developer experience of a plan → invoke /plan-devex-review +- "Review everything", full review pipeline → invoke /autoplan +- Bugs, errors, "why is this broken", "wtf", "this doesn't work" → invoke /investigate +- Test the site, find bugs, "does this work" → invoke /qa (or /qa-only for report only) +- Code review, check the diff, "look at my changes" → invoke /review +- Visual polish, design audit, "this looks off" → invoke /design-review +- Developer experience audit, try onboarding → invoke /devex-review +- Ship, deploy, create a PR, "send it" → invoke /ship +- Merge + deploy + verify → invoke /land-and-deploy +- Configure deployment → invoke /setup-deploy +- Post-deploy monitoring → invoke /canary +- Update docs after shipping → invoke /document-release +- Weekly retro, "how'd we do" → invoke /retro +- Second opinion, codex review → invoke /codex +- Safety mode, careful mode, lock it down → invoke /careful or /guard +- Restrict edits to a directory → invoke /freeze or /unfreeze +- Upgrade gstack → invoke /gstack-upgrade +- Save progress, "save my work" → invoke /context-save +- Resume, restore, "where was I" → invoke /context-restore +- Security audit, OWASP, "is this secure" → invoke /cso +- Make a PDF, document, publication → invoke /make-pdf +- Launch real browser for QA → invoke /open-gstack-browser +- Import cookies for authenticated testing → invoke /setup-browser-cookies +- Performance regression, page speed, benchmarks → invoke /benchmark +- Review what gstack has learned → invoke /learn +- Tune question sensitivity → invoke /plan-tune +- Code quality dashboard → invoke /health \`\`\` Then commit the change: \`git add CLAUDE.md && git commit -m "chore: add gstack skill routing rules to CLAUDE.md"\` @@ -46,4 +67,3 @@ Say "No problem. You can add routing rules later by running \`gstack-config set This only happens once per project. If \`HAS_ROUTING\` is \`yes\` or \`ROUTING_DECLINED\` is \`true\`, skip this entirely.`; } - diff --git a/scripts/resolvers/preamble/generate-voice-directive.ts b/scripts/resolvers/preamble/generate-voice-directive.ts index 539c8d3d..a175c08f 100644 --- a/scripts/resolvers/preamble/generate-voice-directive.ts +++ b/scripts/resolvers/preamble/generate-voice-directive.ts @@ -56,7 +56,7 @@ Avoid filler, throat-clearing, generic optimism, founder cosplay, and unsupporte - End with what to do. Give the action. **Example of the right voice:** -"auth.ts:47 returns undefined when the session cookie expires. Your users hit a white screen. Fix: add a null check and redirect to /login. Two lines. Want me to ship it?" +"auth.ts:47 returns undefined when the session cookie expires. Your users hit a white screen. Fix: add a null check and redirect to /login. Two lines. Want me to fix it?" Not: "I've identified a potential issue in the authentication flow that may cause problems for some users under certain conditions. Let me explain the approach I'd recommend..." **Final test:** does this sound like a real cross-functional builder who wants to help someone make something people want, ship it, and make it actually work?`; diff --git a/scripts/resolvers/utility.ts b/scripts/resolvers/utility.ts index 83934b07..3d2e368a 100644 --- a/scripts/resolvers/utility.ts +++ b/scripts/resolvers/utility.ts @@ -369,7 +369,7 @@ Minimum 0 per category. export function generateCoAuthorTrailer(ctx: TemplateContext): string { const { getHostConfig } = require('../../hosts/index'); const hostConfig = getHostConfig(ctx.host); - return hostConfig.coAuthorTrailer || 'Co-Authored-By: Claude Opus 4.6 '; + return hostConfig.coAuthorTrailer || 'Co-Authored-By: Claude Opus 4.7 '; } export function generateChangelogWorkflow(_ctx: TemplateContext): string {