merge: integrate origin/main (v1.1.2.0) — mode-posture energy

Main shipped v1.1.2.0, restoring mode-posture energy to /plan-ceo-review EXPANSION and /office-hours forcing + builder modes. V1's writing-style rules 2-4 collapsed every outcome into diagnostic-pain framing; models follow concrete examples over abstract taxonomies, so cathedral-mode output was flattening even when the template said "dream big." Conflicts: - VERSION / package.json: kept 1.2.0.0 (branch higher than main's 1.1.2.0) - CHANGELOG: preserved 1.2.0.0 at top, inserted main's 1.1.2.0 below it, and added a short note under 1.2.0.0's Changed section documenting that the mode-posture examples are included here too (via the port) - scripts/resolvers/preamble.ts: main edited inline writing-style examples in the old monolithic preamble file; my submodule refactor landed the same file as an 80-line composition root. Resolution: kept my submodule structure (dropped main's 800 lines of inline code) and ported main's new rule 2/3/4 examples into scripts/resolvers/preamble/generate-writing-style.ts — same behavior, just in the right place for the submodule shape. Ship SKILL.md, golden fixtures, office-hours/plan-ceo-review templates, new test/fixtures/mode-posture/** fixtures, new judgePosture helper, and touchfiles entries for three new gate-tier E2E tests (plan-ceo- review-expansion-energy, office-hours-forcing-energy, office-hours- builder-wildness) all auto-merged cleanly. Regenerated all SKILL.md files and ship goldens. 423 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-26 19:49:57 +02:00 · 2026-04-19 05:57:27 +08:00
parent 9bfbf06f41 8ee16b867b
commit 1a2e92278d
44 changed files with 745 additions and 105 deletions
@@ -25,6 +25,14 @@ export interface OutcomeJudgeResult {
  reasoning: string;
 }

+export interface PostureScore {
+  axis_a: number;       // 1-5 — mode-specific primary rubric axis
+  axis_b: number;       // 1-5 — mode-specific secondary rubric axis
+  reasoning: string;
+}
+
+export type PostureMode = 'expansion' | 'forcing' | 'builder';
+
 /**
 * Call claude-sonnet-4-6 with a prompt, extract JSON response.
 * Retries once on 429 rate limit errors.
@@ -128,3 +136,57 @@ Rules:
 - evidence_quality (1-5): Do detected bugs have screenshots, repro steps, or specific element references?
  5 = excellent evidence for every bug, 1 = no evidence at all`);
 }
+
+/**
+ * Score mode-specific prose posture on two mode-dependent axes (1-5 each).
+ *
+ * Used by mode-posture regression tests to detect whether V1's Writing Style
+ * rules have flattened the distinctive energy of expansion / forcing / builder
+ * modes. See docs/designs/PLAN_TUNING_V1.md and the V1.1 mode-posture fix.
+ *
+ * The generator model is whatever the skill runs with (often Opus for
+ * plan-ceo-review). The judge is always Sonnet via callJudge() for cost.
+ */
+export async function judgePosture(mode: PostureMode, text: string): Promise<PostureScore> {
+  const rubrics: Record<PostureMode, { axis_a: string; axis_b: string; context: string }> = {
+    expansion: {
+      context: 'This text is expansion proposals emitted by /plan-ceo-review in SCOPE EXPANSION or SELECTIVE EXPANSION mode. The skill is supposed to lead with felt-experience vision, then close with concrete effort and impact.',
+      axis_a: 'surface_framing (1-5): Does each proposal lead with felt-experience framing ("imagine", "when the user sees", "the moment X happens", or equivalent) BEFORE closing with concrete metrics? Penalize pure feature bullets ("Add X. Improves Y by Z%").',
+      axis_b: 'decision_preservation (1-5): Does each proposal contain the elements a scope-expansion decision needs — what to build (concrete shape), effort (ideally both human and CC scales), risk or integration note? Penalize pure prose with no actionable content.',
+    },
+    forcing: {
+      context: 'This text is the Q3 Desperate Specificity question emitted by /office-hours startup mode. The skill is supposed to force the founder to name a specific person and consequence, stacking multiple pressures.',
+      axis_a: 'stacking_preserved (1-5): Does the question include at least 3 distinct sub-pressures (e.g., title? promoted? fired? up at night? OR career? day? weekend?) rather than a single neutral ask? Penalize "Who is your target user?" style collapses.',
+      axis_b: 'domain_matched_consequence (1-5): Does the named consequence match the domain context in the input (B2B → career impact, consumer → daily pain, hobby/open-source → weekend project)? Penalize one-size-fits-all B2B career framing for non-B2B ideas.',
+    },
+    builder: {
+      context: 'This text is builder-mode response from /office-hours. The skill is supposed to riff creatively — "what if you also..." adjacent unlocks, cross-domain combinations, the "whoa" moment — not emit a structured product roadmap.',
+      axis_a: 'unexpected_combinations (1-5): Does the output include at least 2 cross-domain or surprising adjacent unlocks ("what if you also...", "pipe it into X", etc.)? Penalize structured feature lists with no creative leaps.',
+      axis_b: 'excitement_over_optimization (1-5): Does the output read as a creative riff (enthusiastic, opinionated, evocative) or as a PRD / product roadmap (structured, metric-driven, conservative)? Penalize PRD-voice language like "improve retention", "enable virality", "consider adding".',
+    },
+  };
+
+  const r = rubrics[mode];
+  return callJudge<PostureScore>(`You are evaluating prose quality for a mode-specific posture regression test.
+
+Context: ${r.context}
+
+Rate the following output on two dimensions (1-5 scale each):
+
+- **axis_a** — ${r.axis_a}
+- **axis_b** — ${r.axis_b}
+
+Scoring guide:
+- 5: Excellent — strong, unambiguous match for the posture
+- 4: Good — matches posture with minor weakness
+- 3: Adequate — partial match, noticeable flatness or structure
+- 2: Poor — posture mostly flattened / collapsed
+- 1: Fail — posture entirely missing, reads as the opposite mode
+
+Respond with ONLY valid JSON in this exact format:
+{"axis_a": N, "axis_b": N, "reasoning": "brief explanation naming specific phrases that drove the score"}
+
+Here is the output to evaluate:
+
+${text}`);
+}
@@ -69,12 +69,15 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
  'review-army-consensus':        ['review/**', 'scripts/resolvers/review-army.ts'],

  // Office Hours
-  'office-hours-spec-review':  ['office-hours/**', 'scripts/gen-skill-docs.ts'],
+  'office-hours-spec-review':     ['office-hours/**', 'scripts/gen-skill-docs.ts'],
+  'office-hours-forcing-energy':  ['office-hours/**', 'scripts/resolvers/preamble.ts', 'test/fixtures/mode-posture/**', 'test/helpers/llm-judge.ts'],
+  'office-hours-builder-wildness': ['office-hours/**', 'scripts/resolvers/preamble.ts', 'test/fixtures/mode-posture/**', 'test/helpers/llm-judge.ts'],

  // Plan reviews
-  'plan-ceo-review':           ['plan-ceo-review/**'],
-  'plan-ceo-review-selective': ['plan-ceo-review/**'],
-  'plan-ceo-review-benefits':  ['plan-ceo-review/**', 'scripts/gen-skill-docs.ts'],
+  'plan-ceo-review':                  ['plan-ceo-review/**'],
+  'plan-ceo-review-selective':        ['plan-ceo-review/**'],
+  'plan-ceo-review-benefits':         ['plan-ceo-review/**', 'scripts/gen-skill-docs.ts'],
+  'plan-ceo-review-expansion-energy': ['plan-ceo-review/**', 'scripts/resolvers/preamble.ts', 'test/fixtures/mode-posture/**', 'test/helpers/llm-judge.ts'],
  'plan-eng-review':           ['plan-eng-review/**'],
  'plan-eng-review-artifact':  ['plan-eng-review/**'],
  'plan-review-report':        ['plan-eng-review/**', 'scripts/gen-skill-docs.ts'],
@@ -236,11 +239,14 @@ export const E2E_TIERS: Record<string, 'gate' | 'periodic'> = {

  // Office Hours
  'office-hours-spec-review': 'gate',
+  'office-hours-forcing-energy': 'gate',       // V1.1 mode-posture regression gate (Sonnet generator)
+  'office-hours-builder-wildness': 'gate',     // V1.1 mode-posture regression gate (Sonnet generator)

  // Plan reviews — gate for cheap functional, periodic for Opus quality
  'plan-ceo-review': 'periodic',
  'plan-ceo-review-selective': 'periodic',
  'plan-ceo-review-benefits': 'gate',
+  'plan-ceo-review-expansion-energy': 'gate',  // V1.1 mode-posture regression gate (Opus generator, Sonnet judge)
  'plan-eng-review': 'periodic',
  'plan-eng-review-artifact': 'periodic',
  'plan-eng-coverage-audit': 'gate',