mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-06 13:45:35 +02:00
2ac44c432c
The skill-llm-eval test "baseline score pinning" failed CI on three retry attempts: judge gave command_reference.actionability=3, baseline demands ≥4. Judge cited 8 specific gaps in COMMAND_DESCRIPTIONS. This commit closes 7 of 8 by tightening the descriptions: - press: documents that key names are case-sensitive Playwright keys, shows modifier syntax (Shift+Enter, Control+A), links the full key list. Removes the "is this case-sensitive?" guesswork. - is: documents that <sel> accepts either a CSS selector OR an @ref token from a prior snapshot, and that property values are case- sensitive. - scroll: documents that there is no --by/--to amount option, points at `js window.scrollTo(0, N)` for pixel-precise scrolling. - js / eval: clarifies that both run in the same JS sandbox, the difference is just inline expr (js) vs file (eval). - storage: clarifies sessionStorage is read-only via this command, points at `js sessionStorage.setItem(...)` for the write path. - chain: walks through how to invoke (pipe a JSON array of arrays to $B chain), confirms it stops at the first error. - cdp: explains how to discover allowed methods (read cdp-allowlist.ts) + shows a concrete example invocation. - domain-skill: explains that the "classifier flag" is set automatically by the L4 prompt-injection scan (agents do not set it manually); enumerates the full lifecycle verbs. The 8th gap (storage set syntax conflict) is also resolved as part of the storage rewrite. Two pipe-character bugs caught by the existing `no command description contains pipe character` guard at `test/gen-skill-docs.test.ts:595`: the chain example originally used `echo '[...]' | $B chain` (literal pipe) and the cdp description used `tab|browser` / `trusted|untrusted` (also literal pipes). Both rewritten to keep markdown table cells intact. Verification: 696/0 pass on skill-validation + gen-skill-docs after regen across all hosts. The CI llm-judge eval will re-run against the new SKILL.md and should hit actionability ≥4 reliably. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>