mirror of
https://github.com/garrytan/gstack.git
synced 2026-06-17 15:20:11 +02:00
14fc0866d9
* docs(todos): P3 content-hash diagram render cache for make-pdf Deferred from the diagram-engine eng review (Codex outside-voice D7): repeat make-pdf runs re-render every fence; cache keyed on fence source + bundle version once multi-diagram docs make it worth building. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(diagram-render): offline mermaid+excalidraw render bundle for browse Single self-contained page (dist/diagram-render.html, 9.2MB, committed per eng-review D2) exposing __renderMermaid / __mermaidToExcalidraw / __excalidrawToSvg / __rasterize / __probeImage through browse load-html + js --out. Render contract per D3: securityLevel strict, per-fence ids, print-css font lock, htmlLabels off (canvas-taint-safe). Deterministic build (same sha twice); drift test pins dist == BUILD_INFO == package.json pins and rebuild-reproducibility when toolchain matches. Spike-proven offline: flowchart + sequence SVG, editable .excalidraw scene, 300dpi PNG. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(diagram-render): __downscaleRaster for print-resolution image normalization Data-URI rasters re-encode in their own format (JPEG stays JPEG at q0.9 — PNG-encoding photos bloats them) at an explicit target pixel width. Used by make-pdf's pre-pass for the 300dpi content-box ceiling (eng-review D4). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(make-pdf): diagram pre-pass — mermaid/excalidraw fences render as vector SVG; local images inline as data URIs ```mermaid / ```excalidraw fences extract to placeholder tokens, render in one diagram-render bundle tab per run (reset contract: bundle page reloads after any render error), and substitute back as accessible <figure> blocks with the raw source preserved in a comment. Render failures produce a loud red diagnostic block, never silent raw code. render=false keeps a fence as code; title="..." becomes the aria-label and caption. Local images now actually render: page.setContent loads at about:blank (tab-session.ts:194), so relative paths silently 404'd before. The pre-pass resolves them against the markdown's directory, inlines as data URIs, probes intrinsic dimensions from the bytes (pure-TS PNG/JPEG/GIF/WebP/SVG sniffing), and downscales rasters wider than 2x the content box at 300dpi. Remote URLs warn (offline posture, --allow-network exempts); missing files get a visible placeholder; --strict hard-fails both for CI pipelines. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test(make-pdf): diagram pre-pass unit suite + e2e render gates 34 unit tests (fence extraction incl. nested/tilde/unclosed/render=false, info-string parsing, slot substitution, diagnostic/figure escaping + SVG script strip, byte-level dimension probing across 5 formats, content-box math, image inlining incl. strict/remote/missing/data-URI paths). E2E gate proves through the compiled binary: both fences render as vector text (id-collision check), raw mermaid ships only via render=false, broken fence yields the diagnostic block, and the relative fixture image rasterizes to colored pixels (CRITICAL regression for the about:blank image fix). --strict exits non-zero on a missing image. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(make-pdf): width directives + conservative auto-landscape via CSS named pages `{width=full|<pct>|<dim>}` and `{page=landscape|portrait}` suffixes translate to data-gstack-* attrs in render() (before the sanitizer, which keeps data- attributes; unrecognized brace groups stay visible text). Default width rule needs no code: intrinsic CSS-px capped at the content box, never upscaled — figure img max-width owns it. Auto-landscape promotes a block to `@page wide { size: <pagesize> landscape }` only when aspect >= 1.8 AND intrinsic width > 2.5x the content box (~1600px on letter) AND diagram provenance (rendered fences) or a whole-word alt token (diagram|architecture|flowchart|chart|graph) for plain images. {page=...} forces or vetoes; fence info strings accept page=... too. preferCSSPageSize is passed to Chromium only when a promotion exists, so every other document prints exactly as before. False negatives are cheap; false positives feel broken (eng-review P4, Codex challenge accepted). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test(make-pdf): width-policy unit suite + landscape e2e gate with negative fixtures 24 unit tests weighted toward the false-positive guards: wide screenshot without an alt hint stays portrait, sub-threshold and tall images stay portrait, deterministic 1560/1561px boundary, whole-word alt matching ('photographic' must not match 'graph'), page=portrait veto beats every heuristic, diagnostic blocks never promote. E2E gate asserts pdfinfo per-page boxes through the compiled binary: exactly 3 of 5 fixture blocks get landscape pages (alt-hinted image, directive-forced image, wide sequence diagram) while the unhinted screenshot and the veto'd diagram stay portrait — plus the --toc combo proving TOC and named-page landscape coexist. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(make-pdf): --to html|docx output formats --to html writes the assembled self-contained document directly (no print round-trip): inline vector diagrams, data-URI images, zero network references, plus an @media screen layer for browser reading. --to docx is the content-fidelity export (eng-review P8): html-to-docx@1.8.0 (exact pin; pure JS, bun-compile-verified) maps headings/tables/code/lists; diagrams and SVG images rasterize at 300dpi of the content-box width via the render tab; diagnostic figures convert to plain p/pre so the converter can't silently drop an error. --format keeps its page-size-alias meaning; --to is the output format, and the CLI says so when confused. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test(make-pdf): format gate — html no-network-refs + docx zip content checks HTML: zero src/href network refs, no script/link tags, inline SVG diagrams, data-URI images, screen layer, diagnostic survives. DOCX: valid OOXML zip (document.xml + Content_Types), >=2 PNG media (diagram raster + fixture image), headings + render=false source + diagnostic text in document.xml, no leaked mermaid source from rendered fences. Plus --to validation UX. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(diagram): /diagram skill — English in, editable diagram triplet out New skill: agent authors mermaid from the user's description and renders the triplet through the offline diagram-render bundle in the browse daemon — .mmd source (the single source of truth), editable .excalidraw (opens at excalidraw.com, round-trips back through re-render), and SVG + PNG. Flowcharts convert to fully editable scenes; other mermaid types render with an explicit upstream-converter limitation note. Never ships an unrendered source file; offline is the contract (no CDN fallback). Inventory rows in AGENTS.md + docs/skills.md; generated SKILL.md + llms.txt via gen:skill-docs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test(diagram): paid E2E pair — gate triplet contract + periodic authoring judge diagram-triplet (gate, deterministic functional): a fresh claude -p agent following the skill extract must emit a parseable triplet — graph LR/TD in .mmd, excalidraw scene with >3 elements, SVG markup, PNG magic bytes. Verified live: pass, $0.17, 58s. diagram-authoring-quality (periodic, LLM-judged): faithfulness/labels/size rubric with a diagnostic-path cap, floor 6/10. Verified live: pass at exactly 6 with substantive critique. Touchfiles select both on diagram/** and lib/diagram-render/** changes; tier split per E2E_TIERS rules (eng-review D5). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test(diagram): register /diagram in the skill coverage matrix Gate: triplet contract + structural floor; periodic: authoring-quality judge. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(make-pdf): typography scale-up, zero image truncation, landscape vertical centering Dogfooding round on the repo README surfaced four output-quality bugs: - Type was too small everywhere: body 11→12pt, h1 22→26pt, h2 15→18pt, cover title 32→56pt with poster spacing, cover meta 10→13pt, TOC 11→12pt with tighter leading, code 9.5→10.5pt, tables 10→11pt. - Zero image truncation, ever: the max-width cap was figure-scoped, but markdown images render as <p><img> — a 1850px GitHub screenshot ran off the page edge. Global img { max-width: 100%; height: auto; } cap. - hyphens: auto put real 'dif-\nferent' breaks into the PDF text layer the moment 12pt made lines wrap (combined-gate caught it). Clean copy-paste is the product contract; left-aligned rag doesn't need hyphenation → hyphens: manual. - Promoted landscape blocks now vertically center. CSS flex/min-height centering fragments into phantom empty landscape pages in Chromium (bisected: min-height at ANY value; 3 promotions printed 5 pages), so image-policy computes an inline margin-top from each block's known aspect ratio against the landscape content box instead — fragmentation handles margins fine. .page-wide also drops its explicit break-before/ after (the page-name change already breaks on both sides). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test(make-pdf): pin zero-truncation invariant, typography floor, centering math Global img cap pinned as a regex invariant (the figure-scoped-cap regression class); typography floor (12pt body, 56pt cover, 12pt TOC); .page-wide must NOT carry min-height/flex (the phantom-landscape-page regression class); centering margin math verified both ways (2400×1000 image → 1.38in, 2050×600 viewBox diagram → 1.93in, page-filling directive block → no margin). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs: diagram + multi-format documentation across README, make-pdf skill, and how-to guide README gains /make-pdf (Publisher) and /diagram (Diagram Maker) rows in the sprint table. make-pdf's skill doc — the agent-facing contract — gains Core patterns for mermaid/excalidraw fences (title/render=false/page= options), the image policy ({width=}/{page=} directives, zero-truncation, conservative auto-landscape), --to html|docx, and --strict, plus the --to vs --format disambiguation in Common flags. New docs/howto-diagrams-and-formats.md is the user-facing walkthrough: fences, directives, formats, /diagram triplet, the mermaid racetrack trick, troubleshooting. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test(make-pdf): fill ship-audit coverage gaps — downscale, reset contract, excalidraw fence, WebP Ship coverage audit found 9 gaps (85%); this fills the 2 HIGH + 3 MEDIUM and most LOW. diagram-gate fixture gains a 4200px incompressible photo (the only live coverage of __downscaleRaster AND the 64KB chunked jsViaBuffer eval transport — asserted via the downscale stderr warning), an ```excalidraw scene fence rendered through exportToSvg (vector labels + caption in pdftotext, no leaked scene JSON), and the broken fence MOVED BETWEEN the two mermaid fences so the second diagram rendering proves the D6.2 reset contract end-to-end. New coverage-gaps.test.ts (16 tests): mock-tab reset contract (exactly one reload, post-failure fence renders), excalidraw fail-fast diagnostic without a bundle call, rasterize error fallbacks (figure/tag kept, never silent), WebP VP8/VP8L/VP8X byte parsers, landscapeContentBox a4/asymmetric margins, bare-token slot fallback, resolveBundlePath env override + error shape, screenCss media scoping. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(make-pdf): pre-landing review wave — fence fidelity, injection hardening, Windows paths, transport rework Review army (6 specialists + red team) findings, all fixed: - Indented fences replay byte-for-byte and indented diagram fences are NOT extracted (red-team conf-9: the pre-pass reconstructed fences at column 0, splitting any list containing fenced code — every ordinary document). - String.replace $-pattern injection killed at every seam: substituteSlots, mergeStyle, img/src rewrites all use function replacements (a diagram label containing $' duplicated the document tail). - Big-expression transport reworked: browse `eval <file>` (one spawn, any size, Windows-safe) replaces the 64KB chunked window-buffer eval — fixes the per-chunk spawn cost, the char-vs-byte argv units, AND the Windows 32,767-char command-line ceiling in one move. - Staged-bundle trust: content verified by hash even when the file exists, and the rename-failure path re-hashes the survivor (sticky-bit /tmp EPERM would otherwise ride a pre-planted file past the check). - Windows drive-letter img srcs (C:/x.png) reach the local-path branch instead of being swallowed as unknown URL schemes. - DOCX rasterize-failure now embeds the decoded source as visible text — returning the figure made diagrams vanish silently (converter drops svg). - Fence source preserved as base64 data-gstack-source attribute (the comment encoding corrupted every '-->' arrow); decodeFigureSource() round-trips. - inlineLocalImages memoizes per path; file:// uses fileURLToPath; preview prints a divergence note for fences/local images; --to docx strips the watermark div and warns about print-only flags; TOC links resolve in html/docx (heading ids assigned); waitForExpression sleeps instead of busy-spinning; escapeHtml/svg-dims deduped to single definitions; typography stragglers (blockquote 12pt, footnotes 10pt, 42em screen measure); bundle BUILD_INFO gains srcSha256 for no-node_modules drift detection; MAX_TARGET_PX shared guard. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * ci: make-pdf gate covers the diagram-render bundle; bundle pinned to LF make-pdf-gate.yml paths gain lib/diagram-render/** and the drift test (a bundle-only PR previously skipped every render gate AND no CI lane ran the drift check at all). .gitattributes pins dist html/json to LF so Windows autocrlf can't break the hash-pinned bundle. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test(make-pdf)+feat(diagram): review-wave test pins + skill transport hardening Tests: indented-fence byte-for-byte replay + no-extraction-in-lists, drive-letter local-path routing, $-pattern slot immunity, base64 source round-trip ('A --> B' exact), existing-style merge preservation, DOCX rasterize-failure surfaces source, srcSha256 + font-stack drift guards, landscape veto asserted as some-portrait/no-landscape (layout-order-proof), judge rubric cap lowered to 5 so it actually fails, vacuous error-shape test removed honestly, tmpdir cleanup. /diagram skill: base64 transport (template literals corrupted backticks/${ in sources), content-addressed staging with hash verification, and --tab-id pinned on every browse call so a concurrent /qa session can't be clobbered. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(make-pdf): out-of-tree image reads warn; --strict makes them fatal (D8.1) Local CLI semantics stay (absolute paths and ../ still inline, like pandoc), but never silently: an agent PDF-ing untrusted markdown can't quietly embed a file from outside the input directory into a shareable document without a visible warning, and --strict pipelines hard-fail. Two unit tests. Also: TODOS.md gains the deferred e2e-harness dedup entry (D8.2). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix: pre-existing test failure in skill-e2e-bws operational-learning Root cause was the fixture, not model behavior: gstack-learnings-log gained an import of lib/jsonl-store.ts in the v1.57.5.0 injection-sanitization wave, but the test copies only bin/ scripts into its sandbox — the inline bun import failed and the script exited 1 before writing, on every run, on main too (reproduced ata5833c41). Fixture now stages lib/jsonl-store.ts beside bin/; verified deterministically (script exits 0, learning written) and via the paid test (1 pass). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(make-pdf): adversarial-review wave — offline posture enforced, symlink-aware confinement, bounded reads Codex adversarial + structured review findings: - Remote images are now BLOCKED with a visible placeholder instead of warn-and-keep — leaving the tag meant Chromium fetched the URL at print time anyway, so the offline posture was a lie (tracking pixels and internal-URL probes ran without --allow-network). - The out-of-tree read check compares REAL paths: a symlink inside the input dir pointing at ~/.ssh/... passed the string-prefix check, including under --strict. Ordered after the existence check (realpath of a missing file false-positives on macOS /var → /private/var). - Image reads are bounded BEFORE reading: statSync first, non-regular files (fifo/device/dir) and >64MB files degrade to placeholders instead of hanging or exhausting memory; malformed percent-encoding (foo%zz.png) degrades to missing-image instead of crashing decodeURIComponent. - browse shell-outs get a 120s timeout — a wedged daemon or hostile mermaid source fails the run instead of hanging it. - TOC entries link to the heading's ACTUAL id (pre-id'd raw-HTML headings previously got dead #toc-N links); per-side margins compose into the CSS @page shorthand so a landscape promotion flipping preferCSSPageSize no longer silently reverts --margin-left/right to defaults (Codex P2). - The image memo is a typed object — literal NUL-byte separators had made diagram-prepass.ts register as binary to text tooling. Codex structured review GATE: PASS (no P1). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * chore: bump version and changelog (v1.58.0.0) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs: sync make-pdf image-policy docs with final shipped behavior (v1.58.0.0) The docs wave (87594420) predated the final review-wave commits, so two docs drifted from shipped behavior: - make-pdf/SKILL.md.tmpl + generated SKILL.md: remote images are BLOCKED with a visible placeholder (not warned-and-kept); out-of-tree reads (including via symlink) warn and --strict makes them fatal; --strict also covers oversized (>64MB) and non-regular files; troubleshooting entry now names the actual "[remote image blocked]" symptom. - docs/howto-diagrams-and-formats.md: same corrections in the image section, CI section, and troubleshooting. - README.md: docs/howto-diagrams-and-formats.md added to the Docs table (was unreachable from any entry-point doc). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs: apply Codex doc-review findings for v1.58.0.0 Cross-model doc review (Codex, read-only) checked the v1.58.0.0 docs against the shipped code. Fixes: - howto + make-pdf SKILL: diagram source is preserved base64 in a data-gstack-source attribute, not an HTML comment (-- in mermaid arrows would corrupt a comment); fences must start at column 0; fence options example gains page=portrait; --to html "zero network refs" qualified (--allow-network deliberately keeps remote tags). - /diagram description, README + docs/skills.md rows: the hand-drawn aesthetic belongs to the .excalidraw artifact; rendered SVG/PNG use mermaid's clean neutral theme (lib/diagram-render entry.ts pins theme: "neutral"). - CHANGELOG v1.58.0.0 wording: --strict coverage lists all five fatal classes (missing/remote/out-of-tree/oversized/non-regular); fences are vector SVG in pdf+html, 300dpi PNG in docx; hand-drawn claim scoped to the .excalidraw file. - lib/diagram-render/README: Page API table gains __downscaleRaster. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
394 lines
16 KiB
TypeScript
394 lines
16 KiB
TypeScript
import { describe, test, expect, beforeAll, afterAll } from 'bun:test';
|
|
import { runSkillTest } from './helpers/session-runner';
|
|
import {
|
|
ROOT, browseBin, runId, evalsEnabled,
|
|
describeIfSelected, testConcurrentIfSelected,
|
|
copyDirSync, setupBrowseShims, logCost, recordE2E,
|
|
createEvalCollector, finalizeEvalCollector,
|
|
} from './helpers/e2e-helpers';
|
|
import { startTestServer } from '../browse/test/test-server';
|
|
import { spawnSync } from 'child_process';
|
|
import * as fs from 'fs';
|
|
import * as path from 'path';
|
|
import * as os from 'os';
|
|
|
|
const evalCollector = createEvalCollector('e2e-browse');
|
|
|
|
let testServer: ReturnType<typeof startTestServer>;
|
|
let tmpDir: string;
|
|
|
|
describeIfSelected('Skill E2E tests', [
|
|
'browse-basic', 'browse-snapshot', 'skillmd-setup-discovery',
|
|
'skillmd-no-local-binary', 'skillmd-outside-git', 'session-awareness',
|
|
'operational-learning',
|
|
], () => {
|
|
beforeAll(() => {
|
|
testServer = startTestServer();
|
|
tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-'));
|
|
setupBrowseShims(tmpDir);
|
|
|
|
// Pre-warm the browse server so Chromium is already launched for tests.
|
|
// In CI, Chromium can take 10-20s to launch (Docker + --no-sandbox).
|
|
spawnSync(browseBin, ['goto', testServer.url], { cwd: tmpDir, timeout: 30000, stdio: 'pipe' });
|
|
}, 45_000);
|
|
|
|
afterAll(() => {
|
|
testServer?.server?.stop();
|
|
try { fs.rmSync(tmpDir, { recursive: true, force: true }); } catch {}
|
|
});
|
|
|
|
testConcurrentIfSelected('browse-basic', async () => {
|
|
const result = await runSkillTest({
|
|
prompt: `You have a browse binary at ${browseBin}. Assign it to B variable and run these commands in sequence:
|
|
1. $B goto ${testServer.url}
|
|
2. $B snapshot -i
|
|
3. $B text
|
|
4. $B screenshot /tmp/skill-e2e-test.png
|
|
Report the results of each command.`,
|
|
workingDirectory: tmpDir,
|
|
maxTurns: 7,
|
|
timeout: 60_000,
|
|
testName: 'browse-basic',
|
|
runId,
|
|
});
|
|
|
|
logCost('browse basic', result);
|
|
recordE2E(evalCollector, 'browse basic commands', 'Skill E2E tests', result);
|
|
expect(result.browseErrors).toHaveLength(0);
|
|
expect(result.exitReason).toBe('success');
|
|
}, 90_000);
|
|
|
|
testConcurrentIfSelected('browse-snapshot', async () => {
|
|
const result = await runSkillTest({
|
|
prompt: `You have a browse binary at ${browseBin}. Assign it to B variable and run:
|
|
1. $B goto ${testServer.url}
|
|
2. $B snapshot -i
|
|
3. $B snapshot -c
|
|
4. $B snapshot -D
|
|
5. $B snapshot -i -a -o /tmp/skill-e2e-annotated.png
|
|
Report what each command returned.`,
|
|
workingDirectory: tmpDir,
|
|
maxTurns: 9,
|
|
timeout: 60_000,
|
|
testName: 'browse-snapshot',
|
|
runId,
|
|
});
|
|
|
|
logCost('browse snapshot', result);
|
|
recordE2E(evalCollector, 'browse snapshot flags', 'Skill E2E tests', result);
|
|
// browseErrors can include false positives from hallucinated paths (e.g. "baltimore" vs "bangalore")
|
|
if (result.browseErrors.length > 0) {
|
|
console.warn('Browse errors (non-fatal):', result.browseErrors);
|
|
}
|
|
expect(result.exitReason).toBe('success');
|
|
}, 90_000);
|
|
|
|
testConcurrentIfSelected('skillmd-setup-discovery', async () => {
|
|
const skillMd = fs.readFileSync(path.join(ROOT, 'SKILL.md'), 'utf-8');
|
|
const setupStart = skillMd.indexOf('## SETUP');
|
|
const setupEnd = skillMd.indexOf('## IMPORTANT');
|
|
const setupBlock = skillMd.slice(setupStart, setupEnd);
|
|
|
|
// Guard: verify we extracted a valid setup block
|
|
expect(setupBlock).toContain('browse/dist/browse');
|
|
|
|
const result = await runSkillTest({
|
|
prompt: `Follow these instructions to find the browse binary and run a basic command.
|
|
|
|
${setupBlock}
|
|
|
|
After finding the binary, run: $B goto ${testServer.url}
|
|
Then run: $B text
|
|
Report whether it worked.`,
|
|
workingDirectory: tmpDir,
|
|
maxTurns: 10,
|
|
timeout: 60_000,
|
|
testName: 'skillmd-setup-discovery',
|
|
runId,
|
|
});
|
|
|
|
recordE2E(evalCollector, 'SKILL.md setup block discovery', 'Skill E2E tests', result);
|
|
expect(result.browseErrors).toHaveLength(0);
|
|
expect(result.exitReason).toBe('success');
|
|
}, 90_000);
|
|
|
|
testConcurrentIfSelected('skillmd-no-local-binary', async () => {
|
|
// Create a tmpdir with no browse binary — no local .claude/skills/gstack/browse/dist/browse
|
|
const emptyDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-empty-'));
|
|
|
|
const skillMd = fs.readFileSync(path.join(ROOT, 'SKILL.md'), 'utf-8');
|
|
const setupStart = skillMd.indexOf('## SETUP');
|
|
const setupEnd = skillMd.indexOf('## IMPORTANT');
|
|
const setupBlock = skillMd.slice(setupStart, setupEnd);
|
|
|
|
const result = await runSkillTest({
|
|
prompt: `Follow these instructions exactly. Run the bash code block below and report what it outputs.
|
|
|
|
${setupBlock}
|
|
|
|
Report the exact output. Do NOT try to fix or install anything — just report what you see.`,
|
|
workingDirectory: emptyDir,
|
|
maxTurns: 5,
|
|
timeout: 30_000,
|
|
testName: 'skillmd-no-local-binary',
|
|
runId,
|
|
});
|
|
|
|
// Setup block should either find the global binary (READY) or show NEEDS_SETUP.
|
|
// On dev machines with gstack installed globally, the fallback path
|
|
// ~/.claude/skills/gstack/browse/dist/browse exists, so we get READY.
|
|
// The important thing is it doesn't crash or give a confusing error.
|
|
const allText = result.output || '';
|
|
recordE2E(evalCollector, 'SKILL.md setup block (no local binary)', 'Skill E2E tests', result);
|
|
expect(allText).toMatch(/READY|NEEDS_SETUP/);
|
|
expect(result.exitReason).toBe('success');
|
|
|
|
// Clean up
|
|
try { fs.rmSync(emptyDir, { recursive: true, force: true }); } catch {}
|
|
}, 60_000);
|
|
|
|
testConcurrentIfSelected('skillmd-outside-git', async () => {
|
|
// Create a tmpdir outside any git repo
|
|
const nonGitDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-nogit-'));
|
|
|
|
const skillMd = fs.readFileSync(path.join(ROOT, 'SKILL.md'), 'utf-8');
|
|
const setupStart = skillMd.indexOf('## SETUP');
|
|
const setupEnd = skillMd.indexOf('## IMPORTANT');
|
|
const setupBlock = skillMd.slice(setupStart, setupEnd);
|
|
|
|
const result = await runSkillTest({
|
|
prompt: `Follow these instructions exactly. Run the bash code block below and report what it outputs.
|
|
|
|
${setupBlock}
|
|
|
|
Report the exact output — either "READY: <path>" or "NEEDS_SETUP".`,
|
|
workingDirectory: nonGitDir,
|
|
maxTurns: 5,
|
|
timeout: 30_000,
|
|
testName: 'skillmd-outside-git',
|
|
runId,
|
|
});
|
|
|
|
// Should either find global binary (READY) or show NEEDS_SETUP — not crash
|
|
const allText = result.output || '';
|
|
recordE2E(evalCollector, 'SKILL.md outside git repo', 'Skill E2E tests', result);
|
|
expect(allText).toMatch(/READY|NEEDS_SETUP/);
|
|
|
|
// Clean up
|
|
try { fs.rmSync(nonGitDir, { recursive: true, force: true }); } catch {}
|
|
}, 60_000);
|
|
|
|
testConcurrentIfSelected('operational-learning', async () => {
|
|
const opDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-oplearn-'));
|
|
const gstackHome = path.join(opDir, '.gstack-home');
|
|
|
|
// Init git repo
|
|
const run = (cmd: string, args: string[]) =>
|
|
spawnSync(cmd, args, { cwd: opDir, stdio: 'pipe', timeout: 5000 });
|
|
run('git', ['init', '-b', 'main']);
|
|
run('git', ['config', 'user.email', 'test@test.com']);
|
|
run('git', ['config', 'user.name', 'Test']);
|
|
fs.writeFileSync(path.join(opDir, 'app.ts'), 'console.log("hello");\n');
|
|
run('git', ['add', '.']);
|
|
run('git', ['commit', '-m', 'initial']);
|
|
|
|
// Copy bin scripts + the lib module they import. gstack-learnings-log
|
|
// does `import ... from '$SCRIPT_DIR/../lib/jsonl-store.ts'` (v1.57.5.0
|
|
// injection sanitization) — without lib/ alongside bin/, the script exits
|
|
// 1 before writing anything, failing this test for a fixture reason, not
|
|
// a model-behavior reason (root-caused during the v1.58.0.0 ship; fails
|
|
// identically on main).
|
|
const binDir = path.join(opDir, 'bin');
|
|
fs.mkdirSync(binDir, { recursive: true });
|
|
for (const script of ['gstack-learnings-log', 'gstack-slug']) {
|
|
fs.copyFileSync(path.join(ROOT, 'bin', script), path.join(binDir, script));
|
|
fs.chmodSync(path.join(binDir, script), 0o755);
|
|
}
|
|
const libDir = path.join(opDir, 'lib');
|
|
fs.mkdirSync(libDir, { recursive: true });
|
|
fs.copyFileSync(path.join(ROOT, 'lib', 'jsonl-store.ts'), path.join(libDir, 'jsonl-store.ts'));
|
|
|
|
// gstack-learnings-log will create the project dir automatically via gstack-slug
|
|
|
|
const result = await runSkillTest({
|
|
prompt: `You just ran \`npm test\` in this project and it failed with this error:
|
|
|
|
Error: --experimental-vm-modules flag is required for ESM support in this project.
|
|
Run: npm test --experimental-vm-modules
|
|
|
|
Per the Operational Self-Improvement instructions below, log an operational learning about this failure.
|
|
|
|
## Operational Self-Improvement
|
|
|
|
Before completing, reflect on this session:
|
|
- Did any commands fail unexpectedly?
|
|
|
|
If yes, log an operational learning for future sessions:
|
|
|
|
\`\`\`bash
|
|
GSTACK_HOME="${gstackHome}" ${binDir}/gstack-learnings-log '{"skill":"qa","type":"operational","key":"SHORT_KEY","insight":"DESCRIPTION","confidence":N,"source":"observed"}'
|
|
\`\`\`
|
|
|
|
Replace SHORT_KEY with a kebab-case key like "esm-vm-modules-flag".
|
|
Replace DESCRIPTION with a one-sentence description of what you learned.
|
|
Replace N with a confidence score 1-10.
|
|
|
|
Log the operational learning now. Then say what you logged.`,
|
|
workingDirectory: opDir,
|
|
maxTurns: 5,
|
|
timeout: 30_000,
|
|
testName: 'operational-learning',
|
|
runId,
|
|
});
|
|
|
|
logCost('operational learning', result);
|
|
|
|
const exitOk = ['success', 'error_max_turns'].includes(result.exitReason);
|
|
|
|
// Check if learnings file was created with an operational entry
|
|
// The slug is derived from the git repo (dirname), so search all project dirs
|
|
let hasOperational = false;
|
|
const projectsDir = path.join(gstackHome, 'projects');
|
|
if (fs.existsSync(projectsDir)) {
|
|
for (const slug of fs.readdirSync(projectsDir)) {
|
|
const lPath = path.join(projectsDir, slug, 'learnings.jsonl');
|
|
if (fs.existsSync(lPath)) {
|
|
const jsonl = fs.readFileSync(lPath, 'utf-8').trim();
|
|
if (jsonl) {
|
|
const entries = jsonl.split('\n').map(l => { try { return JSON.parse(l); } catch { return null; } }).filter(Boolean);
|
|
const opEntry = entries.find(e => e.type === 'operational');
|
|
if (opEntry) {
|
|
hasOperational = true;
|
|
console.log(`Operational learning logged: key="${opEntry.key}" insight="${opEntry.insight}" (slug: ${slug})`);
|
|
break;
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
recordE2E(evalCollector, 'operational learning', 'Skill E2E tests', result, {
|
|
passed: exitOk && hasOperational,
|
|
});
|
|
|
|
expect(exitOk).toBe(true);
|
|
expect(hasOperational).toBe(true);
|
|
|
|
// Clean up
|
|
try { fs.rmSync(opDir, { recursive: true, force: true }); } catch {}
|
|
}, 90_000);
|
|
|
|
testConcurrentIfSelected('session-awareness', async () => {
|
|
const sessionDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-session-'));
|
|
|
|
// Set up a git repo so there's project/branch context to reference
|
|
const run = (cmd: string, args: string[]) =>
|
|
spawnSync(cmd, args, { cwd: sessionDir, stdio: 'pipe', timeout: 5000 });
|
|
run('git', ['init', '-b', 'main']);
|
|
run('git', ['config', 'user.email', 'test@test.com']);
|
|
run('git', ['config', 'user.name', 'Test']);
|
|
fs.writeFileSync(path.join(sessionDir, 'app.rb'), '# my app\n');
|
|
run('git', ['add', '.']);
|
|
run('git', ['commit', '-m', 'init']);
|
|
run('git', ['checkout', '-b', 'feature/add-payments']);
|
|
// Add a remote so the agent can derive a project name
|
|
run('git', ['remote', 'add', 'origin', 'https://github.com/acme/billing-app.git']);
|
|
|
|
// Extract AskUserQuestion format instructions from a generated SKILL.md.
|
|
// ROOT/SKILL.md is the browse skill (Tier 1) and does NOT contain the
|
|
// "## AskUserQuestion Format" section — that block is only emitted for
|
|
// Tier 2+ skills by scripts/resolvers/preamble.ts. Use office-hours/SKILL.md
|
|
// (Tier 3) which always has the format guidance baked in. Falls back to
|
|
// the first SKILL.md that contains the header so a future template move
|
|
// doesn't break this test again.
|
|
let skillMdPath = path.join(ROOT, 'office-hours', 'SKILL.md');
|
|
let skillMd = '';
|
|
if (fs.existsSync(skillMdPath)) {
|
|
skillMd = fs.readFileSync(skillMdPath, 'utf-8');
|
|
}
|
|
if (!skillMd.includes('## AskUserQuestion Format')) {
|
|
// Fallback: scan top-level skill dirs for the first match.
|
|
const skillDirs = fs.readdirSync(ROOT, { withFileTypes: true })
|
|
.filter(d => d.isDirectory())
|
|
.map(d => path.join(ROOT, d.name, 'SKILL.md'));
|
|
for (const candidate of skillDirs) {
|
|
if (!fs.existsSync(candidate)) continue;
|
|
const content = fs.readFileSync(candidate, 'utf-8');
|
|
if (content.includes('## AskUserQuestion Format')) {
|
|
skillMd = content;
|
|
skillMdPath = candidate;
|
|
break;
|
|
}
|
|
}
|
|
}
|
|
const aqStart = skillMd.indexOf('## AskUserQuestion Format');
|
|
const aqEnd = skillMd.indexOf('\n## ', aqStart + 1);
|
|
const aqBlock = aqStart >= 0
|
|
? skillMd.slice(aqStart, aqEnd > 0 ? aqEnd : undefined)
|
|
: '';
|
|
|
|
const outputPath = path.join(sessionDir, 'question-output.md');
|
|
|
|
const result = await runSkillTest({
|
|
prompt: `You are running a gstack skill. The session preamble detected _SESSIONS=4 (the user has 4 gstack windows open).
|
|
|
|
${aqBlock}
|
|
|
|
You are on branch feature/add-payments in the billing-app project. You were reviewing a plan to add Stripe integration.
|
|
|
|
You've hit a decision point: the plan doesn't specify whether to use Stripe Checkout (hosted) or Stripe Elements (embedded). You need to ask the user which approach to use.
|
|
|
|
Since this is non-interactive, DO NOT actually call AskUserQuestion. Instead, write the EXACT text you would display to the user (the full AskUserQuestion content) to the file: ${outputPath}
|
|
|
|
Remember: _SESSIONS=4, so ELI16 mode is active. The user is juggling multiple windows and may not remember what this conversation is about. Re-ground them.`,
|
|
workingDirectory: sessionDir,
|
|
maxTurns: 8,
|
|
timeout: 60_000,
|
|
testName: 'session-awareness',
|
|
runId,
|
|
});
|
|
|
|
logCost('session awareness', result);
|
|
recordE2E(evalCollector, 'session awareness ELI16', 'Skill E2E tests', result);
|
|
|
|
// Verify the output contains ELI16 re-grounding context
|
|
if (fs.existsSync(outputPath)) {
|
|
const output = fs.readFileSync(outputPath, 'utf-8');
|
|
const lower = output.toLowerCase();
|
|
// Must mention project name
|
|
expect(lower.includes('billing') || lower.includes('acme')).toBe(true);
|
|
// Must mention branch
|
|
expect(lower.includes('payment') || lower.includes('feature')).toBe(true);
|
|
// Must mention what we're working on
|
|
expect(lower.includes('stripe') || lower.includes('checkout') || lower.includes('payment')).toBe(true);
|
|
// Must have a recommendation or structured options
|
|
expect(
|
|
output.includes('RECOMMENDATION') ||
|
|
lower.includes('recommend') ||
|
|
lower.includes('option a') ||
|
|
lower.includes('which do you want') ||
|
|
lower.includes('which approach')
|
|
).toBe(true);
|
|
} else {
|
|
// Check agent output as fallback
|
|
const output = result.output || '';
|
|
const lowerOut = output.toLowerCase();
|
|
expect(
|
|
output.includes('RECOMMENDATION') ||
|
|
lowerOut.includes('recommend') ||
|
|
lowerOut.includes('option a') ||
|
|
lowerOut.includes('which do you want') ||
|
|
lowerOut.includes('which approach')
|
|
).toBe(true);
|
|
}
|
|
|
|
// Clean up
|
|
try { fs.rmSync(sessionDir, { recursive: true, force: true }); } catch {}
|
|
}, 90_000);
|
|
});
|
|
|
|
// Module-level afterAll — finalize eval collector after all tests complete
|
|
afterAll(async () => {
|
|
await finalizeEvalCollector(evalCollector);
|
|
});
|