fix: enrich SKILL.md docs to pass LLM evals, upgrade judge to Sonnet 4.6 (#43)

* fix: enrich command descriptions and snapshot flags for LLM eval quality

14 command descriptions enriched with specific arg formats, valid values,
error behavior, and return types. Fixed header usage from <name> <value>
to <name>:<value>. Added cookie usage syntax. Snapshot flags now show
long names, ref numbering, and output format examples.

* refactor: auto-generate server.ts help text from COMMAND_DESCRIPTIONS

Replace hand-maintained help block with generateHelpText() that reads
from COMMAND_DESCRIPTIONS and SNAPSHOT_FLAGS. Eliminates help text
drift from source of truth.

* test: add usage consistency and pipe guard tests

Usage consistency test cross-checks Usage: patterns in implementation
against COMMAND_DESCRIPTIONS using structural skeleton comparison.
Pipe guard test ensures descriptions don't contain | which would break
markdown table rendering.

* chore: upgrade eval judge to Sonnet 4.6, update changelog

Switch LLM-as-judge evals from Haiku to Sonnet 4.6 for more stable,
nuanced scoring. Add changelog entry for all eval improvements.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-03-13 22:14:14 -07:00
committed by GitHub
parent 5205070299
commit a468374272
10 changed files with 233 additions and 112 deletions
+14 -3
View File
@@ -64,20 +64,31 @@ function generateSnapshotFlags(): string {
for (const flag of SNAPSHOT_FLAGS) {
const label = flag.valueHint ? `${flag.short} ${flag.valueHint}` : flag.short;
lines.push(`${label.padEnd(10)}${flag.description}`);
lines.push(`${label.padEnd(10)}${flag.long.padEnd(24)}${flag.description}`);
}
lines.push('```');
lines.push('');
lines.push('Combine flags: `$B snapshot -i -a -C -o /tmp/annotated.png`');
lines.push('All flags can be combined freely. `-o` only applies when `-a` is also used.');
lines.push('Example: `$B snapshot -i -a -C -o /tmp/annotated.png`');
lines.push('');
lines.push('After snapshot, use @refs everywhere:');
lines.push('**Ref numbering:** @e refs are assigned sequentially (@e1, @e2, ...) in tree order.');
lines.push('@c refs from `-C` are numbered separately (@c1, @c2, ...).');
lines.push('');
lines.push('After snapshot, use @refs as selectors in any command:');
lines.push('```bash');
lines.push('$B click @e3 $B fill @e4 "value" $B hover @e1');
lines.push('$B html @e2 $B css @e5 "color" $B attrs @e6');
lines.push('$B click @c1 # cursor-interactive ref (from -C)');
lines.push('```');
lines.push('');
lines.push('**Output format:** indented accessibility tree with @ref IDs, one element per line.');
lines.push('```');
lines.push(' @e1 [heading] "Welcome" [level=1]');
lines.push(' @e2 [textbox] "Email"');
lines.push(' @e3 [button] "Submit"');
lines.push('```');
lines.push('');
lines.push('Refs are invalidated on navigation — run `snapshot` again after `goto`.');
return lines.join('\n');