test: add gate-tier mode-posture regression tests

Three gate-tier E2E tests detect when preamble / template changes flatten the distinctive posture of /plan-ceo-review SCOPE EXPANSION or /office-hours (startup Q3, builder mode). The V1 regression that this PR fixes shipped without anyone catching it at ship time — this is the ongoing signal so the same thing doesn't happen again. Pieces: - `judgePosture(mode, text)` in `test/helpers/llm-judge.ts`. Sonnet judge with mode-specific dual-axis rubric (expansion: surface_framing + decision_preservation; forcing: stacking_preserved + domain_matched_consequence; builder: unexpected_combinations + excitement_over_optimization). Pass threshold 4/5 on both axes. - Three fixtures in `test/fixtures/mode-posture/` — deterministic input for expansion proposal generation, Q3 forcing question, and builder adjacent-unlock riffing. - `plan-ceo-review-expansion-energy` case appended to `test/skill-e2e-plan.test.ts`. Generator: Opus (skill default). Judge: Sonnet. - New `test/skill-e2e-office-hours.test.ts` with `office-hours-forcing-energy` + `office-hours-builder-wildness` cases. Generator: Sonnet. Judge: Sonnet. - Touchfile registration in `test/helpers/touchfiles.ts` — all three as `gate` tier in `E2E_TIERS`, triggered by changes to `scripts/resolvers/preamble.ts`, the relevant skill template, the judge helper, or any mode-posture fixture. Cost: ~$0.50-$1.50 per triggered PR. Sonnet judge is cheap; Opus generator for the plan-ceo-review case dominates. Known V1.1 tradeoff: judges test prose markers more than deep behavior. V1.2 candidate is a cross-provider (Codex) adversarial judge on the same output to decouple house-style bias.
2026-05-02 03:35:09 +02:00 · 2026-04-18 23:46:00 +08:00
parent 190bae5e0e
commit a647064734
7 changed files with 370 additions and 4 deletions
@@ -0,0 +1,15 @@
+# Weekend Project: Dependency Graph Visualizer
+
+I want to build a tool that takes a codebase and visualizes its dependency graph — modules, imports, which files depend on which. For fun, for learning. Maybe open-source it.
+
+## What I have so far
+
+- Rough idea: point it at a repo, get an interactive graph
+- Stack I'm leaning toward: TypeScript + D3 or Cytoscape for rendering
+- Potential: could work for JS/TS first, maybe Python later
+
+## What I don't know yet
+
+- How to make the visualization actually useful vs just pretty
+- Whether this should be a CLI, a web tool, or a VS Code extension
+- What would make someone else want to use it
@@ -0,0 +1,23 @@
+# Plan: Team Velocity Dashboard
+
+## Context
+
+We're building a dashboard for engineering managers to track team code velocity — commits per engineer, PR cycle time, review latency, CI pass rate. The data already lives in GitHub; we're just aggregating it for a manager's single-pane view.
+
+## Changes
+
+1. New React component `TeamVelocityDashboard` in `src/dashboard/`
+2. REST API endpoint `GET /api/team/velocity?days=30` returning aggregated metrics
+3. Background job pulling GitHub data every 15 minutes into Postgres
+4. Simple filter UI: team, date range, metric
+
+## Architecture
+
+- Frontend: React + shadcn/ui
+- Backend: Express + PostgreSQL
+- Data source: GitHub REST API (cached 15min)
+
+## Open questions
+
+- Should we support multiple repos per team?
+- Do we show individual engineer names or aggregate only?
@@ -0,0 +1,13 @@
+# Our Idea: AI Tools for Product Managers
+
+We're building AI tools for product managers at mid-market SaaS companies. The product combines a bunch of the things PMs already do — writing PRDs, gathering user feedback, analyzing usage data, drafting roadmaps — and uses LLMs to speed each of them up.
+
+## Who we're targeting
+
+Product managers at SaaS companies with 50-500 engineers. These PMs are stretched thin, juggle a lot of surface area, and would benefit from AI assistance.
+
+## What we've done so far
+
+- Talked to a few PMs we know from prior jobs
+- Built a prototype that summarizes Zoom customer calls into a PRD stub
+- Got on a waitlist of about 40 signups from LinkedIn posts
@@ -25,6 +25,14 @@ export interface OutcomeJudgeResult {
  reasoning: string;
 }

+export interface PostureScore {
+  axis_a: number;       // 1-5 — mode-specific primary rubric axis
+  axis_b: number;       // 1-5 — mode-specific secondary rubric axis
+  reasoning: string;
+}
+
+export type PostureMode = 'expansion' | 'forcing' | 'builder';
+
 /**
 * Call claude-sonnet-4-6 with a prompt, extract JSON response.
 * Retries once on 429 rate limit errors.
@@ -128,3 +136,57 @@ Rules:
 - evidence_quality (1-5): Do detected bugs have screenshots, repro steps, or specific element references?
  5 = excellent evidence for every bug, 1 = no evidence at all`);
 }
+
+/**
+ * Score mode-specific prose posture on two mode-dependent axes (1-5 each).
+ *
+ * Used by mode-posture regression tests to detect whether V1's Writing Style
+ * rules have flattened the distinctive energy of expansion / forcing / builder
+ * modes. See docs/designs/PLAN_TUNING_V1.md and the V1.1 mode-posture fix.
+ *
+ * The generator model is whatever the skill runs with (often Opus for
+ * plan-ceo-review). The judge is always Sonnet via callJudge() for cost.
+ */
+export async function judgePosture(mode: PostureMode, text: string): Promise<PostureScore> {
+  const rubrics: Record<PostureMode, { axis_a: string; axis_b: string; context: string }> = {
+    expansion: {
+      context: 'This text is expansion proposals emitted by /plan-ceo-review in SCOPE EXPANSION or SELECTIVE EXPANSION mode. The skill is supposed to lead with felt-experience vision, then close with concrete effort and impact.',
+      axis_a: 'surface_framing (1-5): Does each proposal lead with felt-experience framing ("imagine", "when the user sees", "the moment X happens", or equivalent) BEFORE closing with concrete metrics? Penalize pure feature bullets ("Add X. Improves Y by Z%").',
+      axis_b: 'decision_preservation (1-5): Does each proposal contain the elements a scope-expansion decision needs — what to build (concrete shape), effort (ideally both human and CC scales), risk or integration note? Penalize pure prose with no actionable content.',
+    },
+    forcing: {
+      context: 'This text is the Q3 Desperate Specificity question emitted by /office-hours startup mode. The skill is supposed to force the founder to name a specific person and consequence, stacking multiple pressures.',
+      axis_a: 'stacking_preserved (1-5): Does the question include at least 3 distinct sub-pressures (e.g., title? promoted? fired? up at night? OR career? day? weekend?) rather than a single neutral ask? Penalize "Who is your target user?" style collapses.',
+      axis_b: 'domain_matched_consequence (1-5): Does the named consequence match the domain context in the input (B2B → career impact, consumer → daily pain, hobby/open-source → weekend project)? Penalize one-size-fits-all B2B career framing for non-B2B ideas.',
+    },
+    builder: {
+      context: 'This text is builder-mode response from /office-hours. The skill is supposed to riff creatively — "what if you also..." adjacent unlocks, cross-domain combinations, the "whoa" moment — not emit a structured product roadmap.',
+      axis_a: 'unexpected_combinations (1-5): Does the output include at least 2 cross-domain or surprising adjacent unlocks ("what if you also...", "pipe it into X", etc.)? Penalize structured feature lists with no creative leaps.',
+      axis_b: 'excitement_over_optimization (1-5): Does the output read as a creative riff (enthusiastic, opinionated, evocative) or as a PRD / product roadmap (structured, metric-driven, conservative)? Penalize PRD-voice language like "improve retention", "enable virality", "consider adding".',
+    },
+  };
+
+  const r = rubrics[mode];
+  return callJudge<PostureScore>(`You are evaluating prose quality for a mode-specific posture regression test.
+
+Context: ${r.context}
+
+Rate the following output on two dimensions (1-5 scale each):
+
+- **axis_a** — ${r.axis_a}
+- **axis_b** — ${r.axis_b}
+
+Scoring guide:
+- 5: Excellent — strong, unambiguous match for the posture
+- 4: Good — matches posture with minor weakness
+- 3: Adequate — partial match, noticeable flatness or structure
+- 2: Poor — posture mostly flattened / collapsed
+- 1: Fail — posture entirely missing, reads as the opposite mode
+
+Respond with ONLY valid JSON in this exact format:
+{"axis_a": N, "axis_b": N, "reasoning": "brief explanation naming specific phrases that drove the score"}
+
+Here is the output to evaluate:
+
+${text}`);
+}
@@ -69,12 +69,15 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
  'review-army-consensus':        ['review/**', 'scripts/resolvers/review-army.ts'],

  // Office Hours
-  'office-hours-spec-review':  ['office-hours/**', 'scripts/gen-skill-docs.ts'],
+  'office-hours-spec-review':     ['office-hours/**', 'scripts/gen-skill-docs.ts'],
+  'office-hours-forcing-energy':  ['office-hours/**', 'scripts/resolvers/preamble.ts', 'test/fixtures/mode-posture/**', 'test/helpers/llm-judge.ts'],
+  'office-hours-builder-wildness': ['office-hours/**', 'scripts/resolvers/preamble.ts', 'test/fixtures/mode-posture/**', 'test/helpers/llm-judge.ts'],

  // Plan reviews
-  'plan-ceo-review':           ['plan-ceo-review/**'],
-  'plan-ceo-review-selective': ['plan-ceo-review/**'],
-  'plan-ceo-review-benefits':  ['plan-ceo-review/**', 'scripts/gen-skill-docs.ts'],
+  'plan-ceo-review':                  ['plan-ceo-review/**'],
+  'plan-ceo-review-selective':        ['plan-ceo-review/**'],
+  'plan-ceo-review-benefits':         ['plan-ceo-review/**', 'scripts/gen-skill-docs.ts'],
+  'plan-ceo-review-expansion-energy': ['plan-ceo-review/**', 'scripts/resolvers/preamble.ts', 'test/fixtures/mode-posture/**', 'test/helpers/llm-judge.ts'],
  'plan-eng-review':           ['plan-eng-review/**'],
  'plan-eng-review-artifact':  ['plan-eng-review/**'],
  'plan-review-report':        ['plan-eng-review/**', 'scripts/gen-skill-docs.ts'],
@@ -233,11 +236,14 @@ export const E2E_TIERS: Record<string, 'gate' | 'periodic'> = {

  // Office Hours
  'office-hours-spec-review': 'gate',
+  'office-hours-forcing-energy': 'gate',       // V1.1 mode-posture regression gate (Sonnet generator)
+  'office-hours-builder-wildness': 'gate',     // V1.1 mode-posture regression gate (Sonnet generator)

  // Plan reviews — gate for cheap functional, periodic for Opus quality
  'plan-ceo-review': 'periodic',
  'plan-ceo-review-selective': 'periodic',
  'plan-ceo-review-benefits': 'gate',
+  'plan-ceo-review-expansion-energy': 'gate',  // V1.1 mode-posture regression gate (Opus generator, Sonnet judge)
  'plan-eng-review': 'periodic',
  'plan-eng-review-artifact': 'periodic',
  'plan-eng-coverage-audit': 'gate',
@@ -0,0 +1,173 @@
+/**
+ * E2E tests for /office-hours mode-posture regression (V1.1 gate).
+ *
+ * Exercises startup mode Q3 (forcing energy) and builder mode (generative wildness).
+ * Both cases detect whether preamble Writing Style rules have flattened the
+ * skill's distinctive posture at runtime.
+ *
+ * Judge: Sonnet via judgePosture() — cheap per-call.
+ * Generator: whatever the skill runs with (Sonnet for office-hours).
+ */
+
+import { describe, test, expect, beforeAll, afterAll } from 'bun:test';
+import { runSkillTest } from './helpers/session-runner';
+import {
+  ROOT, browseBin, runId, evalsEnabled,
+  describeIfSelected, testConcurrentIfSelected,
+  logCost, recordE2E,
+  createEvalCollector, finalizeEvalCollector,
+} from './helpers/e2e-helpers';
+import { judgePosture } from './helpers/llm-judge';
+import { spawnSync } from 'child_process';
+import * as fs from 'fs';
+import * as path from 'path';
+import * as os from 'os';
+
+const evalCollector = createEvalCollector('e2e-office-hours');
+
+// --- Office Hours forcing-question energy (Q3 Desperate Specificity) ---
+
+describeIfSelected('Office Hours Forcing Energy E2E', ['office-hours-forcing-energy'], () => {
+  let workDir: string;
+
+  beforeAll(() => {
+    workDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-office-hours-forcing-'));
+    const run = (cmd: string, args: string[]) =>
+      spawnSync(cmd, args, { cwd: workDir, stdio: 'pipe', timeout: 5000 });
+
+    run('git', ['init', '-b', 'main']);
+    run('git', ['config', 'user.email', 'test@test.com']);
+    run('git', ['config', 'user.name', 'Test']);
+
+    const pitch = fs.readFileSync(
+      path.join(ROOT, 'test', 'fixtures', 'mode-posture', 'forcing-pitch.md'),
+      'utf-8',
+    );
+    fs.writeFileSync(path.join(workDir, 'pitch.md'), pitch);
+
+    run('git', ['add', '.']);
+    run('git', ['commit', '-m', 'add pitch']);
+
+    fs.mkdirSync(path.join(workDir, 'office-hours'), { recursive: true });
+    fs.copyFileSync(
+      path.join(ROOT, 'office-hours', 'SKILL.md'),
+      path.join(workDir, 'office-hours', 'SKILL.md'),
+    );
+  });
+
+  afterAll(() => {
+    try { fs.rmSync(workDir, { recursive: true, force: true }); } catch {}
+  });
+
+  testConcurrentIfSelected('office-hours-forcing-energy', async () => {
+    const result = await runSkillTest({
+      prompt: `Read office-hours/SKILL.md for the workflow.
+
+Read pitch.md — that's the founder pitch the user is bringing to office hours. Select Startup Mode. Skip any AskUserQuestion — this is non-interactive.
+
+Assume the founder has already answered Q1 (strongest evidence = "got on a waitlist of about 40 signups from LinkedIn posts") and Q2 (status quo = "PMs use Notion docs + lots of Zoom summaries by hand"). Jump directly to Q3 Desperate Specificity.
+
+Write Q3 output — the forcing question you would ask this founder — to ${workDir}/q3.md. Write ONLY the question prose. No conversational wrapper, no meta-commentary, no Q1/Q2 recap.`,
+      workingDirectory: workDir,
+      maxTurns: 8,
+      timeout: 240_000,
+      testName: 'office-hours-forcing-energy',
+      runId,
+      model: 'claude-sonnet-4-6',
+    });
+
+    logCost('/office-hours (FORCING)', result);
+    recordE2E(evalCollector, '/office-hours-forcing-energy', 'Office Hours Forcing Energy E2E', result, {
+      passed: ['success', 'error_max_turns'].includes(result.exitReason),
+    });
+    expect(['success', 'error_max_turns']).toContain(result.exitReason);
+
+    const q3Path = path.join(workDir, 'q3.md');
+    if (!fs.existsSync(q3Path)) {
+      throw new Error('Agent did not emit q3.md — forcing energy eval requires Q3 output');
+    }
+    const q3Text = fs.readFileSync(q3Path, 'utf-8');
+    expect(q3Text.length).toBeGreaterThan(80);
+
+    const scores = await judgePosture('forcing', q3Text);
+    console.log('Forcing energy scores:', JSON.stringify(scores, null, 2));
+    expect(scores.axis_a).toBeGreaterThanOrEqual(4);  // stacking_preserved
+    expect(scores.axis_b).toBeGreaterThanOrEqual(4);  // domain_matched_consequence
+  }, 360_000);
+});
+
+// --- Office Hours builder-mode wildness ---
+
+describeIfSelected('Office Hours Builder Wildness E2E', ['office-hours-builder-wildness'], () => {
+  let workDir: string;
+
+  beforeAll(() => {
+    workDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-office-hours-builder-'));
+    const run = (cmd: string, args: string[]) =>
+      spawnSync(cmd, args, { cwd: workDir, stdio: 'pipe', timeout: 5000 });
+
+    run('git', ['init', '-b', 'main']);
+    run('git', ['config', 'user.email', 'test@test.com']);
+    run('git', ['config', 'user.name', 'Test']);
+
+    const idea = fs.readFileSync(
+      path.join(ROOT, 'test', 'fixtures', 'mode-posture', 'builder-idea.md'),
+      'utf-8',
+    );
+    fs.writeFileSync(path.join(workDir, 'idea.md'), idea);
+
+    run('git', ['add', '.']);
+    run('git', ['commit', '-m', 'add idea']);
+
+    fs.mkdirSync(path.join(workDir, 'office-hours'), { recursive: true });
+    fs.copyFileSync(
+      path.join(ROOT, 'office-hours', 'SKILL.md'),
+      path.join(workDir, 'office-hours', 'SKILL.md'),
+    );
+  });
+
+  afterAll(() => {
+    try { fs.rmSync(workDir, { recursive: true, force: true }); } catch {}
+  });
+
+  testConcurrentIfSelected('office-hours-builder-wildness', async () => {
+    const result = await runSkillTest({
+      prompt: `Read office-hours/SKILL.md for the workflow.
+
+Read idea.md — that's the user's weekend project idea. Select Builder Mode (Phase 2B). Skip any AskUserQuestion — this is non-interactive.
+
+The user has confirmed the basic idea is "TypeScript + D3 web tool, start with JS/TS dependency graphs." They are now asking: "What are three adjacent unlocks I haven't mentioned yet — things that would turn this from a tool I used into something I'd show a friend?"
+
+Write your response — the three adjacent unlocks — to ${workDir}/unlocks.md. Write ONLY the response prose. No meta-commentary, no mode recap. Lead with the fun; let me edit it down later.`,
+      workingDirectory: workDir,
+      maxTurns: 8,
+      timeout: 240_000,
+      testName: 'office-hours-builder-wildness',
+      runId,
+      model: 'claude-sonnet-4-6',
+    });
+
+    logCost('/office-hours (BUILDER)', result);
+    recordE2E(evalCollector, '/office-hours-builder-wildness', 'Office Hours Builder Wildness E2E', result, {
+      passed: ['success', 'error_max_turns'].includes(result.exitReason),
+    });
+    expect(['success', 'error_max_turns']).toContain(result.exitReason);
+
+    const unlocksPath = path.join(workDir, 'unlocks.md');
+    if (!fs.existsSync(unlocksPath)) {
+      throw new Error('Agent did not emit unlocks.md — builder wildness eval requires output');
+    }
+    const unlocksText = fs.readFileSync(unlocksPath, 'utf-8');
+    expect(unlocksText.length).toBeGreaterThan(200);
+
+    const scores = await judgePosture('builder', unlocksText);
+    console.log('Builder wildness scores:', JSON.stringify(scores, null, 2));
+    expect(scores.axis_a).toBeGreaterThanOrEqual(4);  // unexpected_combinations
+    expect(scores.axis_b).toBeGreaterThanOrEqual(4);  // excitement_over_optimization
+  }, 360_000);
+});
+
+// Finalize eval collector for this file
+if (evalsEnabled) {
+  finalizeEvalCollector(evalCollector);
+}
@@ -6,6 +6,7 @@ import {
  copyDirSync, setupBrowseShims, logCost, recordE2E,
  createEvalCollector, finalizeEvalCollector,
 } from './helpers/e2e-helpers';
+import { judgePosture } from './helpers/llm-judge';
 import { spawnSync } from 'child_process';
 import * as fs from 'fs';
 import * as path from 'path';
@@ -183,6 +184,79 @@ Focus on reviewing the plan content: architecture, error handling, security, and
  }, 420_000);
 });

+// --- Plan CEO Review SCOPE EXPANSION energy (V1.1 mode-posture regression gate) ---
+
+describeIfSelected('Plan CEO Review Expansion Energy E2E', ['plan-ceo-review-expansion-energy'], () => {
+  let planDir: string;
+
+  beforeAll(() => {
+    planDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-plan-ceo-exp-'));
+    const run = (cmd: string, args: string[]) =>
+      spawnSync(cmd, args, { cwd: planDir, stdio: 'pipe', timeout: 5000 });
+
+    run('git', ['init', '-b', 'main']);
+    run('git', ['config', 'user.email', 'test@test.com']);
+    run('git', ['config', 'user.name', 'Test']);
+
+    // Use the shared fixture so expansion-energy regressions are reproducible.
+    const fixture = fs.readFileSync(
+      path.join(ROOT, 'test', 'fixtures', 'mode-posture', 'expansion-plan.md'),
+      'utf-8',
+    );
+    fs.writeFileSync(path.join(planDir, 'plan.md'), fixture);
+
+    run('git', ['add', '.']);
+    run('git', ['commit', '-m', 'add plan']);
+
+    fs.mkdirSync(path.join(planDir, 'plan-ceo-review'), { recursive: true });
+    fs.copyFileSync(
+      path.join(ROOT, 'plan-ceo-review', 'SKILL.md'),
+      path.join(planDir, 'plan-ceo-review', 'SKILL.md'),
+    );
+  });
+
+  afterAll(() => {
+    try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {}
+  });
+
+  testConcurrentIfSelected('plan-ceo-review-expansion-energy', async () => {
+    const result = await runSkillTest({
+      prompt: `Read plan-ceo-review/SKILL.md for the review workflow.
+
+Read plan.md — that's the plan to review. This is a standalone plan document, not a codebase — skip any codebase exploration or system audit steps.
+
+Choose SCOPE EXPANSION mode. Skip any AskUserQuestion calls — this is non-interactive. Auto-approve the ideal-architecture approach in 0C-bis. For 0D, run all three analyses (10x check, platonic ideal, delight opportunities), then emit exactly 2 concrete expansion proposals in the opt-in ceremony.
+
+Write your expansion proposals to ${planDir}/proposals.md with ONLY the proposal text — no conversational wrapper, no review summary, no mode analysis. Each proposal separated by "---".`,
+      workingDirectory: planDir,
+      maxTurns: 15,
+      timeout: 360_000,
+      testName: 'plan-ceo-review-expansion-energy',
+      runId,
+      model: 'claude-opus-4-6',
+    });
+
+    logCost('/plan-ceo-review (EXPANSION ENERGY)', result);
+    recordE2E(evalCollector, '/plan-ceo-review-expansion-energy', 'Plan CEO Review Expansion Energy E2E', result, {
+      passed: ['success', 'error_max_turns'].includes(result.exitReason),
+    });
+    expect(['success', 'error_max_turns']).toContain(result.exitReason);
+
+    const proposalsPath = path.join(planDir, 'proposals.md');
+    if (!fs.existsSync(proposalsPath)) {
+      throw new Error('Agent did not emit proposals.md — expansion energy eval requires proposal output');
+    }
+    const proposalText = fs.readFileSync(proposalsPath, 'utf-8');
+    expect(proposalText.length).toBeGreaterThan(200);
+
+    const scores = await judgePosture('expansion', proposalText);
+    console.log('Expansion energy scores:', JSON.stringify(scores, null, 2));
+    // Pass threshold: 4/5 on both axes (good — matches posture with minor weakness).
+    expect(scores.axis_a).toBeGreaterThanOrEqual(4);  // surface_framing
+    expect(scores.axis_b).toBeGreaterThanOrEqual(4);  // decision_preservation
+  }, 600_000);
+});
+
 // --- Plan Eng Review E2E ---

 describeIfSelected('Plan Eng Review E2E', ['plan-eng-review'], () => {