Files
gstack/test/skill-e2e-design.test.ts
T
Garry Tan cab774cced v1.56.0.0 Token-reduction Phase B + AUQ paranoid safety net (#1849)
* refactor(plan-ceo-review): carve review body into on-demand section

Carve the largest skill (138,838 B) into a skeleton + one on-demand
section, the documented next Phase B target after /ship (v2_PLAN.md:216).

- sections/review-sections.md(.tmpl): the 11-section deep review, codex/
  outside-voice rules, how-to-ask, Required Outputs, registries, Completion
  Summary, Review Log, REVIEW_DASHBOARD, PLAN_FILE_REVIEW_REPORT, Next Steps,
  docs/designs promotion, Formatting Rules, and the Mode Quick Reference.
- sections/manifest.json: passive registry (CM2), one entry.
- SKILL.md.tmpl: {{SECTION_INDEX}} after the system audit, a single
  {{SECTION:review-sections}} STOP-Read after Step 0 mode selection, and a
  Section self-check. All of Step 0 (the scope/mode conversation) stays in
  the always-loaded skeleton; only EXIT_PLAN_MODE_GATE follows the section.

Measured: always-loaded skeleton 138,838 -> 80,731 B (-42%, ~14.4K tokens
off every invocation). Union (skeleton + section) 139,110 B, behavior held.

Boundary honors Codex P1: nothing review-governing (formatting rules, mode
reference, how-to-ask, required outputs) sits in the skeleton below the
STOP. Housekeeping resolvers ride in the section, matching the ship
precedent (adversarial.md carries LEARNINGS_LOG + GBRAIN_SAVE_RESULTS).

Tests (atomic with the carve — skill-docs.yml gates gen:skill-docs
freshness on every push, so source + regen + tests must land together):
- parity-harness: plan-ceo flipped to sectioned, maxSkeletonBytes 90_000
  (measured 80,731 + headroom); content/minBytes run against the union.
- skill-size-budget: plan-ceo-review added to SECTIONS_EXTRACTED.
- section-manifest-consistency: generalized to discover every carved skill,
  vars computed per-skill-case (Codex P2).
- skill-ceo-section-ordering (new, gate): per-PR static guard — STOP after
  Step 0, review body absent from skeleton, report writer in the section,
  nothing review-governing below the STOP.
- skill-e2e-plan-ceo-review-section-loading (new, periodic): refreshes the
  installed skill first (Codex P1), drives full Step 0, asserts the section
  is Read before the report.
- gen-skill-docs + skill-validation: read the skeleton+sections union for
  carved skills so relocated prose still counts.
- touchfiles: plan-ceo-section-loading registered (periodic).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump VERSION + CHANGELOG for plan-ceo-review carve (v1.56.0.0)

MINOR: carves the largest skill into skeleton + on-demand section,
dropping plan-ceo-review's always-loaded cost 42% (138,838 -> 80,731 B,
~14.4K tokens off every invocation). User-facing release notes lead with
the measured token win.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(todos): file P3 follow-up — carve the shared {{PREAMBLE}} reference blocks

Surfaced by /plan-eng-review on the plan-ceo-review carve: per-skill section
carves stay modest because the ~40-50KB shared preamble dominates the
always-loaded surface. A single preamble-reference carve would help every
tier->=2 skill at once. Records the why, the cold-vs-hot split to measure,
and the guards it needs. Not implemented this PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): Layer 0 — guarantee AUQ format spec is always-loaded

Deterministic, free, per-PR keystone for the token-reduction era. For every
interactive (tier>=2) skill, asserts the full AskUserQuestion decision-brief
format (ELI10/Recommendation/Pros-cons/checks/Net/(recommended)/Stakes/
self-check) lives in the always-loaded SKILL.md skeleton, NOT only in an
on-demand section. Plus a roster guard (a carve can't silently drop the block)
and per-skill rule survival in the skeleton+sections union. 51 cases + a
negative control. Fails the instant a future carve strands AUQ-governing text
where it won't be loaded when a question fires.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): SDK capture engine + verbose-vs-carved no-degradation A/B

Adds the reusable SDK $OUT_FILE capture engine (auq-sdk-capture.ts): drives a
skill to its AUQ and captures the verbatim text the model GENERATES, cleanly
(real-PTY mangles plan-mode AUQs via cursor escapes). Pins the skill to an
absolute path with Read/Write-only tools so the agent can't wander to the
global install. gradeAuqRecommendation normalizes a non-"because" connective
before grading so substantive reasons aren't false-flagged (without touching
the pinned shared judge).

The A/B drives the same prompt through the carved 80KB skeleton and the
pre-carve 137KB monolith and fails if carved scores worse. Result: both 7/7
format, substance 5 — proven no degradation, transcript-verified each side read
its own planted SKILL.md. Periodic tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): consistency — same trigger N runs, stable format + substance

Drives the carved /plan-ceo-review AUQ N=3 times and fails if any format
element appears in one run but not another, or substance craters. Targets the
"fine one run, broken the next" failure class a single snapshot can't see.
Result: 3/3 stable, 7/7 + substance 5 every run. Periodic tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): behavioral matrix across AUQ-heavy skills

Data-driven test that drives each AUQ-heavy skill (plan-eng/design/devex,
office-hours, cso, spec, design-consultation) to its first AskUserQuestion and
grades it to the plan-ceo bar: 7/7 decision-brief format + recommendation
substance >=4. One case per skill (isolated failures), env-subsettable via
AUQ_MATRIX_ONLY. Browser/design-binary skills are intentionally excluded
(comparison boards, not format-AUQs; Layer 0 covers their spec). All targeted
skills pass 7/7 with substance 4-5. Periodic tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(codex): live recommendation-substance grade for /codex

Closes the gap where /codex's synthesis recommendation was only checked
statically (template grep) and via fixtures. Drives the real /codex skill over
a flawed diff and grades the emitted "Recommendation: ... because ..." line
with judgeRecommendation (present/commits/has_because/substance>=4). The named
weak spot holds up: substance 5. Periodic tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): deterministic trigger for format-compliance gate

A bare /plan-ceo-review against a repo whose work is already implemented makes
the model improvise an off-script "what should I review?" scope question that
skips the decision-brief format, which the gate test then times out waiting for.
Hand it a concrete plan to review (FORCING_FLOOR_CEO) so it reaches the real
Step 0 mode-selection AUQ that is the intended format check.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(office-hours): carve Phase 5+6 into on-demand section

Third Phase B carve (v2_PLAN.md:216, after ship and plan-ceo-review). Moves
Phase 5 (Design Doc templates) + Phase 6 (tiered relationship handoff) — the
session's output + closing tail, only reached after the conversation and
alternatives are done — into sections/design-and-handoff.md, behind a single
STOP-Read after Phase 4.5. The live conversation (Phases 1-4.5) and the
always-run Important Rules stay in the always-loaded skeleton.

Measured: always-loaded skeleton 118,280 -> 88,975 B (-24.8%). Union preserved.
The carved AUQ is identical to pre-carve (matrix: 7/7 format, substance 5),
and Layer 0 confirms the AUQ format spec stays in the skeleton — the AUQ
paranoid suite de-risked this carve end to end.

Atomic with tests + regen (skill-docs.yml gates gen:skill-docs freshness on
every push, so source + regen + tests land together; --host all regenerates
the inlined non-Claude variants):
- sections/manifest.json: passive registry, one entry.
- parity-harness: office-hours flipped to sectioned, maxSkeletonBytes 96_000
  (measured 88,975 + headroom); content/minBytes run against the union.
- skill-size-budget: office-hours added to SECTIONS_EXTRACTED.
- gen-skill-docs + skill-validation: read the skeleton+sections union for
  office-hours so relocated Phase 5/6 prose still counts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump VERSION + CHANGELOG for office-hours carve + AUQ suite (v1.57.0.0)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(preamble): carve CJK-escaping manual to on-demand doc

The AskUserQuestion format block is inlined into every interactive skill (~33).
It carried the full multi-paragraph non-ASCII/CJK escaping manual inline, but
that rationale only matters when a question contains CJK text and the operative
rule already lives in the always-loaded self-check. Moved the justification to
docs/askuserquestion-cjk.md (read on demand); kept the rule + a pointer.

Corpus: Claude-host SKILL.md total 3,087,499 -> 3,057,975 B (-29,524 B, ~900 B
x ~33 skills). Layer 0 still passes — the core decision-brief format stays
always-loaded; only the rare CJK rationale moved. Atomic with the all-host
regen (skill-docs.yml freshness gate). VERSION + package.json -> 1.58.0.0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(plan-eng-review): carve review body into on-demand section

Fourth Phase B carve (v2_PLAN.md:220). Moves the 4-section review (Architecture,
Code Quality, Tests, Performance), outside voice, required outputs, and review
report — everything after Step 0 scope — into sections/review-sections.md behind
a single STOP-Read. Step 0 (scope challenge) and EXIT_PLAN_MODE_GATE stay in the
always-loaded skeleton.

Measured: skeleton 106,984 -> 54,892 B (-48.7%). Union preserved. Atomic with
tests + all-host regen (freshness gate): parity flipped to sectioned
(maxSkeletonBytes 62K), plan-eng-review added to SECTIONS_EXTRACTED, gen-skill-docs
reads the union for relocated review/TEST_COVERAGE/dashboard prose. Layer 0 green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(plan-design-review): carve review body into on-demand section

Fifth Phase B carve (v2_PLAN.md:220, bundled with plan-eng). Moves the 7 design
passes, required outputs, and review report — everything after Step 0 scope and
the mockup/rating phase — into sections/review-sections.md behind a STOP-Read.
Step 0, Step 0.5 mockups, the rating method, and EXIT_PLAN_MODE_GATE stay in the
always-loaded skeleton.

Measured: skeleton 112,057 -> 76,024 B (-32.2%). Union preserved. Atomic with
tests + all-host regen: parity sectioned (maxSkeletonBytes 82K), added to
SECTIONS_EXTRACTED, gen-skill-docs reads the union. Layer 0 green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(plan-devex-review): carve review body into on-demand section

Sixth Phase B carve. Moves the 8 DX passes, required outputs, and review report
— everything after the Step 0 DX investigation — into sections/review-sections.md
behind a STOP-Read. All of Step 0 (persona, empathy, benchmark, journey trace,
roleplay) + the rating method + EXIT_PLAN_MODE_GATE stay always-loaded.

Measured: skeleton 110,621 -> 69,658 B (-37%). Union preserved. Atomic with
tests + all-host regen: added to SECTIONS_EXTRACTED, gen-skill-docs reads the
union. Layer 0 green. (No parity invariant entry for plan-devex-review.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump VERSION + CHANGELOG for plan-* family carves (v1.59.0.0)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: refresh ship golden baselines + gbrain-detection union after carves

Two follow-ups the carve commits should have carried (caught by the full suite,
missed by targeted subsets):
- ship golden baselines (claude/codex/factory) regenerated: the preamble CJK
  trim (v1.58) changed ship's always-loaded AskUserQuestion block.
- gbrain-detection-override probes the office-hours skeleton+section union:
  GBRAIN_SAVE_RESULTS moved into sections/design-and-handoff.md when office-hours
  was carved, so the detection assertions now check both files.

Full `bun test` green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): grade format-compliance gate from SDK capture, not the TUI

The real-PTY version grepped the stripAnsi'd interactive AUQ picker. Verified
directly that this cannot work: plan-mode AUQs render as a cursor picker whose
cursor-positioning escapes stripAnsi can't flatten — the picker renders fine for
a human (cursorSeen=45) but the flattened text drops ELI10:/(recommended) and
parseNumberedOptions returns 0. The test was grading a lossy projection and
failed by construction.

Rewritten to drive /plan-ceo-review via the SDK $OUT_FILE capture (the agent
writes the verbatim question it would have shown — clean text, no rendering
loss) and grade 7/7 format + kind-note + recommendation substance >=4. Same
property, reliable, environment-independent; shares the engine with the periodic
A/B and matrix evals. Result: 7/7 format, substance 5. Touchfiles key renamed
ask-user-question-format-pty -> auq-format-gate (no longer a PTY test).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: fix carve-broken CI evals (union reads + section fixtures)

Two CI eval jobs failed on the carved plan-* skills because they read content
that moved into sections/:

- llm-judge (skill-llm-eval): runWorkflowJudge sliced SKILL.md between markers
  like "## Review Sections" / "## CRITICAL RULE" that now live in
  sections/review-sections.md. The markers vanished from the skeleton, so the
  judge scored empty/wrong content. Fix: read the skeleton+sections union.
  Verified: plan-ceo modes / plan-eng sections / plan-design passes all PASS
  (25/25).

- e2e-plan (skill-e2e-plan): setupPlanDir copied only <skill>/SKILL.md into the
  fixture, not sections/. The carved skill's STOP pointed at a section file that
  was absent, so the model improvised a compressed report table instead of the
  canonical "| Review | Trigger | Why | Runs | Status | Findings |". Fix: copy
  sections/ alongside SKILL.md in all 6 setup sites. Verified: report test PASS,
  canonical table emitted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: copy carved sections into all e2e fixtures (prevent more carve-blind CI fails)

Proactive sweep beyond the two CI logs: every e2e test that copies a carved
skill's SKILL.md into a temp fixture must also copy its sections/, or the
model hits a STOP pointing at a missing section file and improvises/degrades.

- skill-e2e.test.ts: plan-ceo/plan-eng/plan-design/office-hours copies across
  planDir/reviewDir/ohDir/benefitsDir dests now copy sections/.
- skill-e2e-plan.test.ts: the office-hours copy + the 4-skill codex-offering
  loop now copy sections/.
- skill-e2e-design.test.ts: plan-design-review copy now copies sections/.
- skill-e2e-office-hours.test.ts: both office-hours copies now copy sections/.
- skill-e2e-office-hours-brain-writeback.test.ts: GBRAIN_SAVE_RESULTS moved into
  the section, so check the regenerated skeleton+section UNION for the gbrain put
  block, ship both into the workdir, and restore both (the section regen was also
  leaking into the working tree — finally now restores it).

ship copies (single-file Step-0 slices) and review/retro (not carved) untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: migrate section-loading E2E to lossless SDK tool-stream detection

The /ship and /plan-ceo-review section-loading tests drove a real PTY and
scraped the ANSI screen buffer for sections/<file>.md paths. That silently
saw nothing in a Conductor PTY (cursor-positioned tool renders and an
unanswered Step 0 question loop both defeat the regex), so both reported
read: [] even when the agent did the work.

They now run the skill through claude -p (the same SDK path the AUQ matrix
uses) and detect section reads from the tool-use stream — Read calls whose
file_path contains sections/<file>.md — with no rendering layer to mangle.
The run is also hermetic: the freshly-generated worktree skeleton + sections
are copied into a throwaway fixture with the absolute path pinned, so the
test validates this branch's carve without mutating the user's ~/.claude
install.

Validated EVALS_TIER=periodic: both pass (plan-ceo Reads review-sections.md;
ship Reads review-army.md + changelog.md), ~6.5 min for both vs ~23 min
combined on the old PTY path where both were failing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: consolidate branch to v1.56.0.0 (single MINOR above main)

The branch bumped VERSION several times during development (1.56 → 1.57 →
1.58 → 1.59), but none of those landed on main (main is at 1.55.1.0). Per
the "never orphan branch-internal versions" discipline, collapse all four
into a single 1.56.0.0 entry — one MINOR release covering the whole branch:
five skills carved (plan-ceo, office-hours, plan-eng, plan-design,
plan-devex), the shared AskUserQuestion preamble CJK trim, and the paranoid
AUQ no-degradation test suite + lossless section-loading tests.

VERSION and package.json set to 1.56.0.0; main's 1.55.1.0 entry preserved
below the consolidated entry. No SKILL.md drift (VERSION is not embedded in
generated bodies).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 11:14:43 -07:00

616 lines
25 KiB
TypeScript

import { describe, test, expect, beforeAll, afterAll } from 'bun:test';
import { runSkillTest } from './helpers/session-runner';
import { callJudge } from './helpers/llm-judge';
import {
ROOT, browseBin, runId, evalsEnabled,
describeIfSelected, testConcurrentIfSelected,
copyDirSync, setupBrowseShims, logCost, recordE2E,
createEvalCollector, finalizeEvalCollector,
} from './helpers/e2e-helpers';
import { spawnSync } from 'child_process';
import * as fs from 'fs';
import * as path from 'path';
import * as os from 'os';
const evalCollector = createEvalCollector('e2e-design');
/**
* LLM judge for DESIGN.md quality — checks font blacklist compliance,
* coherence, specificity, and AI slop avoidance.
*/
async function designQualityJudge(designMd: string): Promise<{ passed: boolean; reasoning: string }> {
return callJudge<{ passed: boolean; reasoning: string }>(`You are evaluating a generated DESIGN.md file for quality.
Evaluate against these criteria — ALL must pass for an overall "passed: true":
1. Does NOT recommend Inter, Roboto, Arial, Helvetica, Open Sans, Lato, Montserrat, or Poppins as primary fonts
2. Aesthetic direction is coherent with color approach (e.g., brutalist aesthetic doesn't pair with expressive color without explanation)
3. Font recommendations include specific font names (not generic like "a sans-serif font")
4. Color palette includes actual hex values, not placeholders like "[hex]"
5. Rationale is provided for major decisions (not just "because it looks good")
6. No AI slop patterns: purple gradients mentioned positively, "3-column feature grid" language, generic marketing speak
7. Product context is reflected in design choices (civic tech → should have appropriate, professional aesthetic)
DESIGN.md content:
\`\`\`
${designMd}
\`\`\`
Return JSON: { "passed": true/false, "reasoning": "one paragraph explaining your evaluation" }`);
}
// --- Design Consultation E2E ---
describeIfSelected('Design Consultation E2E', [
'design-consultation-core',
'design-consultation-existing',
'design-consultation-research',
'design-consultation-preview',
], () => {
let designDir: string;
beforeAll(() => {
designDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-design-consultation-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: designDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
// Create a realistic project context
fs.writeFileSync(path.join(designDir, 'README.md'), `# CivicPulse
A civic tech data platform for government employees to access, visualize, and share public data. Built with Next.js and PostgreSQL.
## Features
- Real-time data dashboards for municipal budgets
- Public records search with faceted filtering
- Data export and sharing tools for inter-department collaboration
`);
fs.writeFileSync(path.join(designDir, 'package.json'), JSON.stringify({
name: 'civicpulse',
version: '0.1.0',
dependencies: { next: '^14.0.0', react: '^18.2.0', 'tailwindcss': '^3.4.0' },
}, null, 2));
run('git', ['add', '.']);
run('git', ['commit', '-m', 'initial project setup']);
// Copy design-consultation skill
fs.mkdirSync(path.join(designDir, 'design-consultation'), { recursive: true });
fs.copyFileSync(
path.join(ROOT, 'design-consultation', 'SKILL.md'),
path.join(designDir, 'design-consultation', 'SKILL.md'),
);
});
afterAll(() => {
try { fs.rmSync(designDir, { recursive: true, force: true }); } catch {}
});
testConcurrentIfSelected('design-consultation-core', async () => {
const result = await runSkillTest({
prompt: `Read design-consultation/SKILL.md for the design consultation workflow.
Skip the preamble bash block, lake intro, telemetry, and contributor mode sections — go straight to the design workflow.
This is a civic tech data platform called CivicPulse for government employees who need to access public data. Read the README.md for details.
Skip research — work from your design knowledge. Skip the font preview page. Skip any AskUserQuestion calls — this is non-interactive. Accept your first design system proposal.
Write DESIGN.md and CLAUDE.md (or update it) in the working directory.`,
workingDirectory: designDir,
maxTurns: 20,
timeout: 360_000,
testName: 'design-consultation-core',
runId,
model: 'claude-opus-4-7',
});
logCost('/design-consultation core', result);
const designPath = path.join(designDir, 'DESIGN.md');
const claudePath = path.join(designDir, 'CLAUDE.md');
const designExists = fs.existsSync(designPath);
const claudeExists = fs.existsSync(claudePath);
let designContent = '';
if (designExists) {
designContent = fs.readFileSync(designPath, 'utf-8');
}
// Structural checks — fuzzy synonym matching to handle agent variation
const sectionSynonyms: Record<string, string[]> = {
'Product Context': ['product', 'context', 'overview', 'about'],
'Aesthetic': ['aesthetic', 'visual direction', 'design direction', 'visual identity'],
'Typography': ['typography', 'type', 'font', 'typeface'],
'Color': ['color', 'colour', 'palette', 'colors'],
'Spacing': ['spacing', 'space', 'whitespace', 'gap'],
'Layout': ['layout', 'grid', 'structure', 'composition'],
'Motion': ['motion', 'animation', 'transition', 'movement'],
};
const missingSections = Object.entries(sectionSynonyms).filter(
([_, synonyms]) => !synonyms.some(s => designContent.toLowerCase().includes(s))
).map(([name]) => name);
// LLM judge for quality
let judgeResult = { passed: false, reasoning: 'judge not run' };
if (designExists && designContent.length > 100) {
try {
judgeResult = await designQualityJudge(designContent);
console.log('Design quality judge:', JSON.stringify(judgeResult, null, 2));
} catch (err) {
console.warn('Judge failed:', err);
judgeResult = { passed: true, reasoning: 'judge error — defaulting to pass' };
}
}
const structuralPass = designExists && claudeExists && missingSections.length === 0;
recordE2E(evalCollector, '/design-consultation core', 'Design Consultation E2E', result, {
passed: structuralPass && judgeResult.passed && ['success', 'error_max_turns'].includes(result.exitReason),
});
expect(['success', 'error_max_turns']).toContain(result.exitReason);
expect(designExists).toBe(true);
if (designExists) {
expect(missingSections).toHaveLength(0);
}
if (claudeExists) {
const claude = fs.readFileSync(claudePath, 'utf-8');
expect(claude.toLowerCase()).toContain('design.md');
}
}, 420_000);
testConcurrentIfSelected('design-consultation-research', async () => {
// Test WebSearch integration — research phase only, no DESIGN.md generation
const researchDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-research-'));
const result = await runSkillTest({
prompt: `You have access to WebSearch. Research civic tech data platform designs.
Do exactly 2 WebSearch queries:
1. 'civic tech government data platform design 2025'
2. 'open data portal UX best practices'
Summarize the key design patterns you found to ${researchDir}/research-notes.md.
Include: color trends, typography patterns, and layout conventions you observed.
Do NOT generate a full DESIGN.md — just research notes.`,
workingDirectory: researchDir,
maxTurns: 8,
timeout: 90_000,
testName: 'design-consultation-research',
runId,
});
logCost('/design-consultation research', result);
const notesPath = path.join(researchDir, 'research-notes.md');
const notesExist = fs.existsSync(notesPath);
const notesContent = notesExist ? fs.readFileSync(notesPath, 'utf-8') : '';
// Check if WebSearch was used
const webSearchCalls = result.toolCalls.filter(tc => tc.tool === 'WebSearch');
if (webSearchCalls.length > 0) {
console.log(`WebSearch used ${webSearchCalls.length} times`);
} else {
console.warn('WebSearch not used — may be unavailable in test env');
}
recordE2E(evalCollector, '/design-consultation research', 'Design Consultation E2E', result, {
passed: notesExist && notesContent.length > 200 && ['success', 'error_max_turns'].includes(result.exitReason),
});
expect(['success', 'error_max_turns']).toContain(result.exitReason);
expect(notesExist).toBe(true);
if (notesExist) {
expect(notesContent.length).toBeGreaterThan(200);
}
try { fs.rmSync(researchDir, { recursive: true, force: true }); } catch {}
}, 120_000);
testConcurrentIfSelected('design-consultation-existing', async () => {
// Pre-create a minimal DESIGN.md (independent of core test)
fs.writeFileSync(path.join(designDir, 'DESIGN.md'), `# Design System — CivicPulse
## Typography
Body: system-ui
`);
const result = await runSkillTest({
prompt: `Read design-consultation/SKILL.md for the design consultation workflow.
There is already a DESIGN.md in this repo. Update it with a complete design system for CivicPulse, a civic tech data platform for government employees.
Skip research. Skip font preview. Skip any AskUserQuestion calls — this is non-interactive.`,
workingDirectory: designDir,
maxTurns: 20,
timeout: 360_000,
testName: 'design-consultation-existing',
runId,
model: 'claude-opus-4-7',
});
logCost('/design-consultation existing', result);
const designPath = path.join(designDir, 'DESIGN.md');
const designExists = fs.existsSync(designPath);
let designContent = '';
if (designExists) {
designContent = fs.readFileSync(designPath, 'utf-8');
}
// Should have more content than the minimal version
const hasColor = designContent.toLowerCase().includes('color');
const hasSpacing = designContent.toLowerCase().includes('spacing');
recordE2E(evalCollector, '/design-consultation existing', 'Design Consultation E2E', result, {
passed: designExists && hasColor && hasSpacing && ['success', 'error_max_turns'].includes(result.exitReason),
});
expect(['success', 'error_max_turns']).toContain(result.exitReason);
expect(designExists).toBe(true);
if (designExists) {
expect(hasColor).toBe(true);
expect(hasSpacing).toBe(true);
}
}, 420_000);
testConcurrentIfSelected('design-consultation-preview', async () => {
// Test preview HTML generation only — no DESIGN.md (covered by core test)
const previewDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-preview-'));
const result = await runSkillTest({
prompt: `Generate a font and color preview page for a civic tech data platform.
The design system uses:
- Primary font: Cabinet Grotesk (headings), Source Sans 3 (body)
- Colors: #1B4D8E (civic blue), #C4501A (alert orange), #2D6A4F (success green)
- Neutral: #F8F7F6 (warm white), #1A1A1A (near black)
Write a single HTML file to ${previewDir}/design-preview.html that shows:
- Font specimens for each font at different sizes
- Color swatches with hex values
- A light/dark toggle
Do NOT write DESIGN.md — only the preview HTML.`,
workingDirectory: previewDir,
maxTurns: 8,
timeout: 90_000,
testName: 'design-consultation-preview',
runId,
});
logCost('/design-consultation preview', result);
const previewPath = path.join(previewDir, 'design-preview.html');
const previewExists = fs.existsSync(previewPath);
let previewContent = '';
if (previewExists) {
previewContent = fs.readFileSync(previewPath, 'utf-8');
}
const hasHtml = previewContent.includes('<html') || previewContent.includes('<!DOCTYPE');
const hasFontRef = previewContent.includes('font-family') || previewContent.includes('fonts.googleapis') || previewContent.includes('fonts.bunny');
recordE2E(evalCollector, '/design-consultation preview', 'Design Consultation E2E', result, {
passed: previewExists && hasHtml && ['success', 'error_max_turns'].includes(result.exitReason),
});
expect(['success', 'error_max_turns']).toContain(result.exitReason);
expect(previewExists).toBe(true);
if (previewExists) {
expect(hasHtml).toBe(true);
expect(hasFontRef).toBe(true);
}
try { fs.rmSync(previewDir, { recursive: true, force: true }); } catch {}
}, 120_000);
});
// --- Plan Design Review E2E (plan-mode) ---
describeIfSelected('Plan Design Review E2E', ['plan-design-review-plan-mode', 'plan-design-review-no-ui-scope'], () => {
/** Create an isolated tmpdir with git repo and plan-design-review skill */
function setupReviewDir(): string {
const dir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-plan-design-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: dir, stdio: 'pipe', timeout: 5000 });
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
// Copy plan-design-review skill
fs.mkdirSync(path.join(dir, 'plan-design-review'), { recursive: true });
fs.copyFileSync(
path.join(ROOT, 'plan-design-review', 'SKILL.md'),
path.join(dir, 'plan-design-review', 'SKILL.md'),
);
{ const _sec = path.join(ROOT, 'plan-design-review', 'sections'); if (fs.existsSync(_sec)) fs.cpSync(_sec, path.join(dir, 'plan-design-review', 'sections'), { recursive: true }); }
return dir;
}
testConcurrentIfSelected('plan-design-review-plan-mode', async () => {
const reviewDir = setupReviewDir();
try {
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: reviewDir, stdio: 'pipe', timeout: 5000 });
// Create a plan file with intentional design gaps
fs.writeFileSync(path.join(reviewDir, 'plan.md'), `# Plan: User Dashboard
## Context
Build a user dashboard that shows account stats, recent activity, and settings.
## Implementation
1. Create a dashboard page at /dashboard
2. Show user stats (posts, followers, engagement rate)
3. Add a recent activity feed
4. Add a settings panel
5. Use a clean, modern UI with cards and icons
6. Add a hero section at the top with a gradient background
## Technical Details
- React components with Tailwind CSS
- API endpoint: GET /api/dashboard
- WebSocket for real-time activity updates
`);
run('git', ['add', '.']);
run('git', ['commit', '-m', 'initial plan']);
const result = await runSkillTest({
prompt: `Read plan-design-review/SKILL.md for the design review workflow.
Review the plan in ./plan.md. This plan has several design gaps — it uses vague language like "clean, modern UI" and "cards and icons", mentions a "hero section with gradient" (AI slop), and doesn't specify empty states, error states, loading states, responsive behavior, or accessibility.
Skip the preamble bash block. Skip any AskUserQuestion calls — this is non-interactive. Rate each design dimension 0-10 and explain what would make it a 10. Then EDIT plan.md to add the missing design decisions (interaction state table, empty states, responsive behavior, etc.).
IMPORTANT: Do NOT try to browse any URLs or use a browse binary. This is a plan review, not a live site audit. Just read the plan file, review it, and edit it to fix the gaps.`,
workingDirectory: reviewDir,
maxTurns: 15,
timeout: 300_000,
testName: 'plan-design-review-plan-mode',
runId,
});
logCost('/plan-design-review plan-mode', result);
// Check that the agent produced design ratings (0-10 scale)
const output = result.output || '';
const hasRatings = /\d+\/10/.test(output);
const hasDesignContent = output.toLowerCase().includes('information architecture') ||
output.toLowerCase().includes('interaction state') ||
output.toLowerCase().includes('ai slop') ||
output.toLowerCase().includes('hierarchy');
// Check that the plan file was edited (the core new behavior)
const planAfter = fs.readFileSync(path.join(reviewDir, 'plan.md'), 'utf-8');
const planOriginal = `# Plan: User Dashboard`;
const planWasEdited = planAfter.length > 300; // Original is ~450 chars, edited should be much longer
const planHasDesignAdditions = planAfter.toLowerCase().includes('empty') ||
planAfter.toLowerCase().includes('loading') ||
planAfter.toLowerCase().includes('error') ||
planAfter.toLowerCase().includes('state') ||
planAfter.toLowerCase().includes('responsive') ||
planAfter.toLowerCase().includes('accessibility');
recordE2E(evalCollector, '/plan-design-review plan-mode', 'Plan Design Review E2E', result, {
passed: hasDesignContent && planWasEdited && ['success', 'error_max_turns'].includes(result.exitReason),
});
expect(['success', 'error_max_turns']).toContain(result.exitReason);
// Agent should produce design-relevant output about the plan
expect(hasDesignContent).toBe(true);
// Agent should have edited the plan file to add missing design decisions
expect(planWasEdited).toBe(true);
expect(planHasDesignAdditions).toBe(true);
} finally {
try { fs.rmSync(reviewDir, { recursive: true, force: true }); } catch {}
}
}, 360_000);
testConcurrentIfSelected('plan-design-review-no-ui-scope', async () => {
const reviewDir = setupReviewDir();
try {
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: reviewDir, stdio: 'pipe', timeout: 5000 });
// Write a backend-only plan
fs.writeFileSync(path.join(reviewDir, 'backend-plan.md'), `# Plan: Database Migration
## Context
Migrate user records from PostgreSQL to a new schema with better indexing.
## Implementation
1. Create migration to add new columns to users table
2. Backfill data from legacy columns
3. Add database indexes for common query patterns
4. Update ActiveRecord models
5. Run migration in staging first, then production
`);
run('git', ['add', '.']);
run('git', ['commit', '-m', 'initial plan']);
const result = await runSkillTest({
prompt: `Read plan-design-review/SKILL.md for the design review workflow.
Review the plan in ./backend-plan.md. This is a pure backend database migration plan with no UI changes.
Skip the preamble bash block. Skip any AskUserQuestion calls — this is non-interactive. Write your findings directly to stdout.
IMPORTANT: Do NOT try to browse any URLs or use a browse binary. This is a plan review, not a live site audit.`,
workingDirectory: reviewDir,
maxTurns: 10,
timeout: 180_000,
testName: 'plan-design-review-no-ui-scope',
runId,
});
logCost('/plan-design-review no-ui-scope', result);
// Agent should detect no UI scope and exit early
const output = result.output || '';
const detectsNoUI = output.toLowerCase().includes('no ui') ||
output.toLowerCase().includes('no frontend') ||
output.toLowerCase().includes('no design') ||
output.toLowerCase().includes('not applicable') ||
output.toLowerCase().includes('backend');
recordE2E(evalCollector, '/plan-design-review no-ui-scope', 'Plan Design Review E2E', result, {
passed: detectsNoUI && ['success', 'error_max_turns'].includes(result.exitReason),
});
expect(['success', 'error_max_turns']).toContain(result.exitReason);
expect(detectsNoUI).toBe(true);
} finally {
try { fs.rmSync(reviewDir, { recursive: true, force: true }); } catch {}
}
}, 240_000);
});
// --- Design Review E2E (live-site audit + fix) ---
describeIfSelected('Design Review E2E', ['design-review-fix'], () => {
let qaDesignDir: string;
let qaDesignServer: ReturnType<typeof Bun.serve> | null = null;
beforeAll(() => {
qaDesignDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-qa-design-'));
setupBrowseShims(qaDesignDir);
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: qaDesignDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
// Create HTML/CSS with intentional design issues
fs.writeFileSync(path.join(qaDesignDir, 'index.html'), `<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Design Test App</title>
<link rel="stylesheet" href="style.css">
</head>
<body>
<header>
<h1 style="font-size: 48px; color: #333;">Welcome</h1>
<h2 style="font-size: 47px; color: #334;">Subtitle Here</h2>
</header>
<main>
<div class="card" style="padding: 10px; margin: 20px;">
<h3 style="color: blue;">Card Title</h3>
<p style="color: #666; font-size: 14px; line-height: 1.2;">Some content here with tight line height.</p>
</div>
<div class="card" style="padding: 30px; margin: 5px;">
<h3 style="color: green;">Another Card</h3>
<p style="color: #999; font-size: 16px;">Different spacing and colors for no reason.</p>
</div>
<button style="background: red; color: white; padding: 5px 10px; border: none;">Click Me</button>
<button style="background: #007bff; color: white; padding: 12px 24px; border: none; border-radius: 20px;">Also Click</button>
</main>
</body>
</html>`);
fs.writeFileSync(path.join(qaDesignDir, 'style.css'), `body {
font-family: Arial, sans-serif;
margin: 0;
padding: 20px;
}
.card {
border: 1px solid #ddd;
border-radius: 4px;
}
`);
run('git', ['add', '.']);
run('git', ['commit', '-m', 'initial design test page']);
// Start a simple file server for the design test page
qaDesignServer = Bun.serve({
port: 0,
fetch(req) {
const url = new URL(req.url);
const filePath = path.join(qaDesignDir, url.pathname === '/' ? 'index.html' : url.pathname.slice(1));
try {
const content = fs.readFileSync(filePath);
const ext = path.extname(filePath);
const contentType = ext === '.css' ? 'text/css' : ext === '.html' ? 'text/html' : 'text/plain';
return new Response(content, { headers: { 'Content-Type': contentType } });
} catch {
return new Response('Not Found', { status: 404 });
}
},
});
// Copy design-review skill
fs.mkdirSync(path.join(qaDesignDir, 'design-review'), { recursive: true });
fs.copyFileSync(
path.join(ROOT, 'design-review', 'SKILL.md'),
path.join(qaDesignDir, 'design-review', 'SKILL.md'),
);
});
afterAll(() => {
qaDesignServer?.stop();
try { fs.rmSync(qaDesignDir, { recursive: true, force: true }); } catch {}
});
testConcurrentIfSelected('design-review-fix', async () => {
const serverUrl = `http://localhost:${(qaDesignServer as any)?.port}`;
const result = await runSkillTest({
prompt: `IMPORTANT: The browse binary is already assigned below as B. Do NOT search for it or run the SKILL.md setup block — just use $B directly.
B="${browseBin}"
Read design-review/SKILL.md for the design review + fix workflow.
Review the site at ${serverUrl}. Use --quick mode. Skip any AskUserQuestion calls — this is non-interactive. Fix up to 3 issues max. Write your report to ./design-audit.md.`,
workingDirectory: qaDesignDir,
maxTurns: 30,
timeout: 360_000,
testName: 'design-review-fix',
runId,
});
logCost('/design-review fix', result);
const reportPath = path.join(qaDesignDir, 'design-audit.md');
const reportExists = fs.existsSync(reportPath);
// Check if any design fix commits were made
const gitLog = spawnSync('git', ['log', '--oneline'], {
cwd: qaDesignDir, stdio: 'pipe',
});
const commits = gitLog.stdout.toString().trim().split('\n');
const designFixCommits = commits.filter((c: string) => c.includes('style(design)'));
recordE2E(evalCollector, '/design-review fix', 'Design Review E2E', result, {
passed: ['success', 'error_max_turns'].includes(result.exitReason),
});
// Accept error_max_turns — the fix loop is complex
expect(['success', 'error_max_turns']).toContain(result.exitReason);
// Report and commits are best-effort — log what happened
if (reportExists) {
const report = fs.readFileSync(reportPath, 'utf-8');
console.log(`Design audit report: ${report.length} chars`);
} else {
console.warn('No design-audit.md generated');
}
console.log(`Design fix commits: ${designFixCommits.length}`);
}, 420_000);
});
// Module-level afterAll — finalize eval collector after all tests complete
afterAll(async () => {
await finalizeEvalCollector(evalCollector);
});