feat: SKILL.md template system, 3-tier testing, DX tools (v0.3.3) (#41)

* refactor: extract command registry to commands.ts, add SNAPSHOT_FLAGS metadata - NEW: browse/src/commands.ts — command sets + COMMAND_DESCRIPTIONS + load-time validation (zero side effects) - server.ts imports from commands.ts instead of declaring sets inline - snapshot.ts: SNAPSHOT_FLAGS array drives parseSnapshotArgs (metadata-driven, no duplication) - All 186 existing tests pass * feat: SKILL.md template system with auto-generated command references - SKILL.md.tmpl + browse/SKILL.md.tmpl with {{COMMAND_REFERENCE}} and {{SNAPSHOT_FLAGS}} placeholders - scripts/gen-skill-docs.ts generates SKILL.md from templates (supports --dry-run) - Build pipeline runs gen:skill-docs before binary compilation - Generated files have AUTO-GENERATED header, committed to git * test: Tier 1 static validation — 34 tests for SKILL.md command correctness - test/helpers/skill-parser.ts: extracts $B commands from code blocks, validates against registry - test/skill-parser.test.ts: 13 parser/validator unit tests - test/skill-validation.test.ts: 13 tests validating all SKILL.md files + registry consistency - test/gen-skill-docs.test.ts: 8 generator tests (categories, sorting, freshness) * feat: DX tools (skill:check, dev:skill) + Tier 2 E2E test scaffolding - scripts/skill-check.ts: health summary for all SKILL.md files (commands, templates, freshness) - scripts/dev-skill.ts: watch mode for template development - test/helpers/session-runner.ts: Agent SDK wrapper for E2E skill tests - test/skill-e2e.test.ts: 2 E2E tests + 3 stubs (auto-skip inside Claude Code sessions) - E2E tests must run from plain terminal: SKILL_E2E=1 bun test test/skill-e2e.test.ts * ci: SKILL.md freshness check on push/PR + TODO updates - .github/workflows/skill-docs.yml: fails if generated SKILL.md files are stale - TODO.md: add E2E cost tracking and model pinning to future ideas * fix: restore rich descriptions lost in auto-generation - Snapshot flags: add back value hints (-d <N>, -s <sel>, -o <path>) - Snapshot flags: restore parenthetical context (@e refs, @c refs, etc.) - Commands: is → includes valid states enum - Commands: console → notes --errors filter behavior - Commands: press → lists common keys (Enter, Tab, Escape) - Commands: cookie-import-browser → describes picker UI - Commands: dialog-accept → specifies alert/confirm/prompt - Tips: restore → arrow (was downgraded to ->) * test: quality evals for generated SKILL.md descriptions Catches the exact regressions we shipped and caught in review: - Snapshot flags must include value hints (-d <N>, -s <sel>, -o <path>) - is command must list all valid states (visible/hidden/enabled/...) - press command must list example keys (Enter, Tab, Escape) - console command must describe --errors behavior - Snapshot -i must mention @e refs, -C must mention @c refs - All descriptions must be >= 8 chars (no empty stubs) - Tips section must use → not -> * feat: LLM-as-judge evals for SKILL.md documentation quality 4 eval tests using Anthropic API (claude-haiku, ~$0.01-0.03/run): - Command reference table: clarity/completeness/actionability >= 4/5 - Snapshot flags section: same thresholds - browse/SKILL.md overall quality - Regression: generated version must score >= hand-maintained baseline Requires ANTHROPIC_API_KEY. Auto-skips without it. Run: bun run test:eval (or ANTHROPIC_API_KEY=sk-... bun test test/skill-llm-eval.test.ts) * chore: bump version to 0.3.3, update changelog Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: add ARCHITECTURE.md, update CLAUDE.md and CONTRIBUTING.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: conductor.json lifecycle hooks + .env propagation across worktrees bin/dev-setup now copies .env from main worktree so API keys carry over to Conductor workspaces automatically. conductor.json wires up setup and archive hooks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: complete CHANGELOG for v0.3.3 (architecture, conductor, .env) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-28 20:50:05 +02:00 · 2026-03-13 21:08:12 -07:00
parent ea0c0dad5e
commit 5205070299
29 changed files with 2479 additions and 135 deletions
@@ -60,22 +60,123 @@ bun run build
 bin/dev-teardown
 ```

-## Running tests
+## Testing & evals
+
+### Setup

 ```bash
-bun test                     # all tests (browse integration + snapshot)
-bun run dev <cmd>            # run CLI in dev mode, e.g. bun run dev goto https://example.com
-bun run build                # compile binary to browse/dist/browse
+# 1. Copy .env.example and add your API key
+cp .env.example .env
+# Edit .env → set ANTHROPIC_API_KEY=sk-ant-...
+
+# 2. Install deps (if you haven't already)
+bun install
 ```

+Bun auto-loads `.env` — no extra config. Conductor workspaces inherit `.env` from the main worktree automatically (see "Conductor workspaces" below).
+
+### Test tiers
+
+| Tier | Command | Cost | What it tests |
+|------|---------|------|---------------|
+| 1 — Static | `bun test` | Free | Command validation, snapshot flags, SKILL.md correctness |
+| 2 — E2E | `bun run test:e2e` | ~$0.50 | Full skill execution via Agent SDK |
+| 3 — LLM eval | `bun run test:eval` | ~$0.03 | Doc quality scoring via LLM-as-judge |
+
+```bash
+bun test                     # Tier 1 only (runs on every commit, <5s)
+bun run test:eval            # Tier 3: LLM-as-judge (needs ANTHROPIC_API_KEY in .env)
+bun run test:e2e             # Tier 2: E2E (needs SKILL_E2E=1, can't run inside Claude Code)
+bun run test:all             # Tier 1 + Tier 2
+```
+
+### Tier 1: Static validation (free)
+
+Runs automatically with `bun test`. No API keys needed.
+
+- **Skill parser tests** (`test/skill-parser.test.ts`) — Extracts every `$B` command from SKILL.md bash code blocks and validates against the command registry in `browse/src/commands.ts`. Catches typos, removed commands, and invalid snapshot flags.
+- **Skill validation tests** (`test/skill-validation.test.ts`) — Validates that SKILL.md files reference only real commands and flags, and that command descriptions meet quality thresholds.
+- **Generator tests** (`test/gen-skill-docs.test.ts`) — Tests the template system: verifies placeholders resolve correctly, output includes value hints for flags (e.g. `-d <N>` not just `-d`), enriched descriptions for key commands (e.g. `is` lists valid states, `press` lists key examples).
+
+### Tier 2: E2E via Agent SDK (~$0.50/run)
+
+Spawns a real Claude Code session, invokes `/qa` or `/browse`, and scans tool results for errors. This is the closest thing to "does this skill actually work end-to-end?"
+
+```bash
+# Must run from a plain terminal — can't nest inside Claude Code or Conductor
+SKILL_E2E=1 bun test test/skill-e2e.test.ts
+```
+
+- Gated by `SKILL_E2E=1` env var (prevents accidental expensive runs)
+- Auto-skips if it detects it's running inside Claude Code (Agent SDK can't nest)
+- Saves full conversation transcripts on failure for debugging
+- Tests live in `test/skill-e2e.test.ts`, runner logic in `test/helpers/session-runner.ts`
+
+### Tier 3: LLM-as-judge (~$0.03/run)
+
+Uses Claude Haiku to score generated SKILL.md docs on three dimensions:
+
+- **Clarity** — Can an AI agent understand the instructions without ambiguity?
+- **Completeness** — Are all commands, flags, and usage patterns documented?
+- **Actionability** — Can the agent execute tasks using only the information in the doc?
+
+Each dimension is scored 1-5. Threshold: every dimension must score **≥ 4**. There's also a regression test that compares generated docs against the hand-maintained baseline from `origin/main` — generated must score equal or higher.
+
+```bash
+# Needs ANTHROPIC_API_KEY in .env
+bun run test:eval
+```
+
+- Uses `claude-haiku-4-5` for cost efficiency
+- Tests live in `test/skill-llm-eval.test.ts`
+- Calls the Anthropic API directly (not Agent SDK), so it works from anywhere including inside Claude Code
+
+### CI
+
+A GitHub Action (`.github/workflows/skill-docs.yml`) runs `bun run gen:skill-docs --dry-run` on every push and PR. If the generated SKILL.md files differ from what's committed, CI fails. This catches stale docs before they merge.
+
 Tests run against the browse binary directly — they don't require dev mode.

+## Editing SKILL.md files
+
+SKILL.md files are **generated** from `.tmpl` templates. Don't edit the `.md` directly — your changes will be overwritten on the next build.
+
+```bash
+# 1. Edit the template
+vim SKILL.md.tmpl              # or browse/SKILL.md.tmpl
+
+# 2. Regenerate
+bun run gen:skill-docs
+
+# 3. Check health
+bun run skill:check
+
+# Or use watch mode — auto-regenerates on save
+bun run dev:skill
+```
+
+To add a browse command, add it to `browse/src/commands.ts`. To add a snapshot flag, add it to `SNAPSHOT_FLAGS` in `browse/src/snapshot.ts`. Then rebuild.
+
+## Conductor workspaces
+
+If you're using [Conductor](https://conductor.build) to run multiple Claude Code sessions in parallel, `conductor.json` wires up workspace lifecycle automatically:
+
+| Hook | Script | What it does |
+|------|--------|-------------|
+| `setup` | `bin/dev-setup` | Copies `.env` from main worktree, installs deps, symlinks skills |
+| `archive` | `bin/dev-teardown` | Removes skill symlinks, cleans up `.claude/` directory |
+
+When Conductor creates a new workspace, `bin/dev-setup` runs automatically. It detects the main worktree (via `git worktree list`), copies your `.env` so API keys carry over, and sets up dev mode — no manual steps needed.
+
+**First-time setup:** Put your `ANTHROPIC_API_KEY` in `.env` in the main repo (see `.env.example`). Every Conductor workspace inherits it automatically.
+
 ## Things to know

- **SKILL.md changes are instant.** They're just Markdown. Edit, save, invoke.
+- **SKILL.md files are generated.** Edit the `.tmpl` template, not the `.md`. Run `bun run gen:skill-docs` to regenerate.
 - **Browse source changes need a rebuild.** If you touch `browse/src/*.ts`, run `bun run build`.
 - **Dev mode shadows your global install.** Project-local skills take priority over `~/.claude/skills/gstack`. `bin/dev-teardown` restores the global one.
- **Conductor workspaces are independent.** Each workspace is its own clone. Run `bin/dev-setup` in the one you're working in.
+- **Conductor workspaces are independent.** Each workspace is its own git worktree. `bin/dev-setup` runs automatically via `conductor.json`.
+- **`.env` propagates across worktrees.** Set it once in the main repo, all Conductor workspaces get it.
 - **`.claude/skills/` is gitignored.** The symlinks never get committed.

 ## Testing a branch in another repo