Files
gstack/test/model-overlay-opus-4-7.test.ts
T
Garry Tan d06f08938f test: gate-tier units + periodic Pros/Cons evals for AskUserQuestion format
Part 3 of 4 (plan: ~/.claude/plans/system-instruction-you-are-working-polymorphic-twilight.md).

Gate-tier (E1, free, runs on every `bun test`):

test/preamble-compose.test.ts — pins the composition order
  Asserts AskUserQuestion Format section renders BEFORE Model-Specific
  Behavioral Patch in tier-≥2 preamble output. Covers claude default,
  opus-4-7 overlay, tier 2/3, and codex host. Catches any future edit
  to scripts/resolvers/preamble.ts that silently reverts the order.

test/resolver-ask-user-format.test.ts — pins the Pros/Cons contract
  14 assertions against generateAskUserFormat output: D<N>, ELI10,
  Stakes if we pick wrong:, Recommendation: <choice>, Pros / cons:,
  / markers, min 2 pros + 1 con rules, hard-stop escape exact
  phrase, neutral-posture CT1 rule ((recommended) label preserved for
  AUTO_DECIDE), Completeness coverage-vs-kind, tool_use mandate
  (rule 11), self-check list, D-numbering model-level caveat.

test/model-overlay-opus-4-7.test.ts — pins the pacing directive
  Asserts raw overlay file + resolved overlay output contain "Pace
  questions to the skill" and NOT "Batch your questions". Verifies
  INHERIT:claude chain still works (Todo-list, subordination wrapper),
  Fan out / Effort-match / Literal interpretation nudges preserved.
  Also asserts claude base overlay does NOT carry the Opus-specific
  pacing directive (no cross-contamination).

Periodic-tier (E2, Opus-dependent, ~$1-2/run):

test/skill-e2e-plan-prosons.test.ts — 4 cases extending v1.6.3.0 harness
  1. Format positive — every token present when plan has real tradeoff
  2. Hard-stop NEGATIVE — plan with genuine tradeoff must NOT dodge to
     "No cons — hard-stop choice" escape
  3. Neutral-posture NEGATIVE — plan where one option dominates must emit
     (recommended) label + "because <reason>", must NOT dodge to
     "taste call" / "no preference"
  4. Hard-stop POSITIVE — destructive-action plan may legitimately use
     the hard-stop escape

test/helpers/touchfiles.ts — entries for all new eval cases
  Dependencies: overlay, preamble.ts, generate-ask-user-format.ts, and
  the 4 plan-review templates. Diff-based selection triggers the evals
  whenever those files change. Also added entries for 7 expanded-coverage
  cases (ship, office-hours, investigate, qa, review, design-review,
  document-release) — test cases will land in follow-up PRs per skill.

Follow-ups noted in test file header:
- True multi-turn cadence eval (3 findings → 3 distinct asks) — current
  harness captures one $OUT_FILE per session; multi-turn capture needs
  new harness support.
- Expanded-coverage test cases for the 7 non-plan-review skills.

Verified:
- bun test: 349 pass (30 new + 319 baseline), 1 pre-existing security-bench
  oversize failure on main (unrelated, unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 16:48:10 -07:00

99 lines
3.7 KiB
TypeScript

/**
* Opus 4.7 model overlay — gate-tier assertions on the pacing directive.
*
* v1.6.4.0 regressed plan-review cadence because the Opus 4.7 overlay
* carried a "Batch your questions" directive that physically rendered
* above the skill-level pacing rule. Opus 4.7 read top-to-bottom,
* absorbed batching as the ambient default, and stopped honoring the
* plan-review STOP directives.
*
* v1.7.0.0 replaces that block with "Pace questions to the skill" —
* one-question-at-a-time is now the default when the skill contains
* STOP directives; batching becomes the explicit exception.
*
* This test asserts:
* - The new "Pace questions" directive is present
* - The old "Batch your questions" directive is gone
* - The AUTO_DECIDE-compatible language survives (subordination, skill wins)
*/
import { describe, test, expect } from 'bun:test';
import * as fs from 'fs';
import * as path from 'path';
import type { TemplateContext } from '../scripts/resolvers/types';
import { HOST_PATHS } from '../scripts/resolvers/types';
import { generateModelOverlay } from '../scripts/resolvers/model-overlay';
function makeCtx(model: string): TemplateContext {
return {
skillName: 'test-skill',
tmplPath: 'test.tmpl',
host: 'claude',
paths: HOST_PATHS.claude,
preambleTier: 2,
model,
};
}
const ROOT = path.resolve(__dirname, '..');
describe('Opus 4.7 overlay — pacing directive', () => {
test('raw opus-4-7.md contains "Pace questions to the skill"', () => {
const raw = fs.readFileSync(
path.join(ROOT, 'model-overlays/opus-4-7.md'),
'utf-8',
);
expect(raw).toContain('Pace questions to the skill');
});
test('raw opus-4-7.md does NOT contain "Batch your questions" directive', () => {
const raw = fs.readFileSync(
path.join(ROOT, 'model-overlays/opus-4-7.md'),
'utf-8',
);
expect(raw).not.toContain('**Batch your questions.**');
});
test('resolved overlay output contains "Pace questions to the skill"', () => {
const out = generateModelOverlay(makeCtx('opus-4-7'));
expect(out).toContain('Pace questions to the skill');
});
test('resolved overlay inherits from claude base (INHERIT:claude)', () => {
const out = generateModelOverlay(makeCtx('opus-4-7'));
// The claude base contributes the subordination wrapper + Todo discipline
expect(out).toContain('Todo-list discipline');
expect(out).toContain('subordinate');
});
test('resolved overlay says skill STOP directives trigger one-per-turn pacing', () => {
const out = generateModelOverlay(makeCtx('opus-4-7'));
expect(out).toMatch(/STOP\. AskUserQuestion/);
expect(out).toMatch(/pace one question per turn|one question per turn/i);
});
test('resolved overlay requires AskUserQuestion as tool_use', () => {
const out = generateModelOverlay(makeCtx('opus-4-7'));
expect(out).toContain('tool_use');
});
test('resolved overlay flags "obvious fix" findings still need user approval', () => {
const out = generateModelOverlay(makeCtx('opus-4-7'));
expect(out).toMatch(/obvious fix/i);
expect(out).toMatch(/user approval/i);
});
test('resolved overlay keeps Fan out / Effort-match / Literal interpretation nudges', () => {
const out = generateModelOverlay(makeCtx('opus-4-7'));
expect(out).toContain('Fan out explicitly');
expect(out).toContain('Effort-match the step');
expect(out).toContain('Literal interpretation awareness');
});
test('claude overlay (no INHERIT chain) does not carry the pacing directive', () => {
// Claude is the default overlay; opus-4-7 inherits FROM claude.
// The pacing directive belongs to opus-4-7 only.
const out = generateModelOverlay(makeCtx('claude'));
expect(out).not.toContain('Pace questions to the skill');
});
});