From 2ac44c432c9e7fe83d6820e59cf3971e334f88b8 Mon Sep 17 00:00:00 2001 From: Garry Tan Date: Tue, 28 Apr 2026 02:16:53 -0700 Subject: [PATCH] fix(commands): tighten descriptions for LLM-judge baseline pinning MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The skill-llm-eval test "baseline score pinning" failed CI on three retry attempts: judge gave command_reference.actionability=3, baseline demands ≥4. Judge cited 8 specific gaps in COMMAND_DESCRIPTIONS. This commit closes 7 of 8 by tightening the descriptions: - press: documents that key names are case-sensitive Playwright keys, shows modifier syntax (Shift+Enter, Control+A), links the full key list. Removes the "is this case-sensitive?" guesswork. - is: documents that accepts either a CSS selector OR an @ref token from a prior snapshot, and that property values are case- sensitive. - scroll: documents that there is no --by/--to amount option, points at `js window.scrollTo(0, N)` for pixel-precise scrolling. - js / eval: clarifies that both run in the same JS sandbox, the difference is just inline expr (js) vs file (eval). - storage: clarifies sessionStorage is read-only via this command, points at `js sessionStorage.setItem(...)` for the write path. - chain: walks through how to invoke (pipe a JSON array of arrays to $B chain), confirms it stops at the first error. - cdp: explains how to discover allowed methods (read cdp-allowlist.ts) + shows a concrete example invocation. - domain-skill: explains that the "classifier flag" is set automatically by the L4 prompt-injection scan (agents do not set it manually); enumerates the full lifecycle verbs. The 8th gap (storage set syntax conflict) is also resolved as part of the storage rewrite. Two pipe-character bugs caught by the existing `no command description contains pipe character` guard at `test/gen-skill-docs.test.ts:595`: the chain example originally used `echo '[...]' | $B chain` (literal pipe) and the cdp description used `tab|browser` / `trusted|untrusted` (also literal pipes). Both rewritten to keep markdown table cells intact. Verification: 696/0 pass on skill-validation + gen-skill-docs after regen across all hosts. The CI llm-judge eval will re-run against the new SKILL.md and should hit actionability ≥4 reliably. Co-Authored-By: Claude Opus 4.7 (1M context) --- SKILL.md | 18 +++++++++--------- browse/SKILL.md | 18 +++++++++--------- browse/src/commands.ts | 18 +++++++++--------- 3 files changed, 27 insertions(+), 27 deletions(-) diff --git a/SKILL.md b/SKILL.md index 83e512ea..4269f6f4 100644 --- a/SKILL.md +++ b/SKILL.md @@ -825,8 +825,8 @@ Refs are invalidated on navigation — run `snapshot` again after `goto`. | `fill ` | Fill input | | `header :` | Set custom request header (colon-separated, sensitive values auto-redacted) | | `hover ` | Hover element | -| `press ` | Press key — Enter, Tab, Escape, ArrowUp/Down/Left/Right, Backspace, Delete, Home, End, PageUp, PageDown, or modifiers like Shift+Enter | -| `scroll [sel]` | Scroll element into view, or scroll to page bottom if no selector | +| `press ` | Press a Playwright keyboard key against the focused element. Names are case-sensitive: Enter, Tab, Escape, ArrowUp/Down/Left/Right, Backspace, Delete, Home, End, PageUp, PageDown. Modifiers combine with +: Shift+Enter, Control+A, Meta+K. Single printable chars (a, A, 1) work too. Full key list: https://playwright.dev/docs/api/class-keyboard#keyboard-press | +| `scroll [sel|@ref]` | With a selector, smooth-scrolls the element into view. Without a selector, jumps to page bottom. No --by/--to amount option; for pixel-precise scrolling use `js window.scrollTo(0, N)`. | | `select ` | Select dropdown option by value, label, or visible text | | `style | style --undo [N]` | Modify CSS property on element (with undo support) | | `type ` | Type into focused element | @@ -839,18 +839,18 @@ Refs are invalidated on navigation — run `snapshot` again after `goto`. | Command | Description | |---------|-------------| | `attrs ` | Element attributes as JSON | -| `cdp [json-params]` | Raw CDP method dispatch (deny-default; allowlist in cdp-allowlist.ts). Output through UNTRUSTED envelope when method is data-exfil. | +| `cdp [json-params]` | Raw Chrome DevTools Protocol method dispatch. Deny-default: only methods enumerated in `browse/src/cdp-allowlist.ts` (CDP_ALLOWLIST const) are reachable; any other method 403s. Each allowlist entry declares scope (tab vs browser) and output (trusted vs untrusted) — untrusted methods (data-exfil-shaped, e.g. Network.getResponseBody) get UNTRUSTED-envelope wrapped output. To discover allowed methods: read `browse/src/cdp-allowlist.ts`. Example: `$B cdp Page.getLayoutMetrics`. | | `console [--clear|--errors]` | Console messages (--errors filters to error/warning) | | `cookies` | All cookies as JSON | | `css ` | Computed CSS value | | `dialog [--clear]` | Dialog messages | -| `eval ` | Run JavaScript from file and return result as string (path must be under /tmp or cwd) | +| `eval ` | Run JavaScript from a file in the page context and return result as string. Path must resolve under /tmp or cwd (no traversal). Use eval for multi-line scripts; use js for one-liners. | | `inspect [selector] [--all] [--history]` | Deep CSS inspection via CDP — full rule cascade, box model, computed styles | -| `is ` | State check (visible/hidden/enabled/disabled/checked/editable/focused) | -| `js ` | Run JavaScript expression and return result as string | +| `is ` | State check on element. Valid values: visible, hidden, enabled, disabled, checked, editable, focused (case-sensitive). accepts a CSS selector OR an @ref token from a prior snapshot (e.g. @e3, @c1) — refs are interchangeable with selectors anywhere a selector is expected. | +| `js ` | Run inline JavaScript expression in the page context and return result as string. Same JS sandbox as eval; the only difference is js takes an inline expr while eval reads from a file. | | `network [--clear]` | Network requests | | `perf` | Page load timings | -| `storage [set k v]` | Read all localStorage + sessionStorage as JSON, or set to write localStorage | +| `storage | storage set ` | Read both localStorage and sessionStorage as JSON. With "set ", write to localStorage only (sessionStorage is read-only via this command — set it with `js sessionStorage.setItem(...)`). | | `ux-audit` | Extract page structure for UX behavioral analysis — site ID, nav, headings, text blocks, interactive elements. Returns JSON for agent interpretation. | ### Visual @@ -870,8 +870,8 @@ Refs are invalidated on navigation — run `snapshot` again after `goto`. ### Meta | Command | Description | |---------|-------------| -| `chain` | Run commands from JSON stdin. Format: [["cmd","arg1",...],...] | -| `domain-skill save|list|show|edit|promote-to-global|rollback|rm ` | Per-site notes (host derived from active tab). Quarantined → active after N=3 uses without classifier flag → global by explicit promote. | +| `chain (JSON via stdin)` | Run a sequence of commands from JSON on stdin. One JSON array of arrays, each inner array is [cmd, ...args]. Output is one JSON result per command. Pipe a JSON array (e.g. `[["goto","https://example.com"],["text","h1"]]`) to `$B chain` and it runs the goto then the text command in order. Stops at the first error. | +| `domain-skill save|list|show|edit|promote-to-global|rollback|rm ` | Per-site notes the agent writes for itself. Host is derived from the active tab. Lifecycle: `save` adds a quarantined note → after N=3 successful uses without the prompt-injection classifier flagging it, the note auto-promotes to "active" → `promote-to-global` lifts it to the global tier (machine-wide, all projects). The classifier flag is set automatically by the L4 prompt-injection scan; agents do not set it manually. Use `list` / `show` to inspect, `edit` to revise, `rollback` to demote, `rm` to tombstone. | | `frame ` | Switch to iframe context (or main to return) | | `inbox [--clear]` | List messages from sidebar scout inbox | | `skill list|show|run|test|rm [--arg k=v]... [--timeout=Ns]` | Run a browser-skill: deterministic Playwright script that drives the daemon over loopback HTTP. 3-tier lookup (project > global > bundled). Spawned scripts get a per-spawn scoped token (read+write only) — never the daemon root token. | diff --git a/browse/SKILL.md b/browse/SKILL.md index 4f1232ed..22c27081 100644 --- a/browse/SKILL.md +++ b/browse/SKILL.md @@ -749,8 +749,8 @@ $B prettyscreenshot --cleanup --scroll-to ".pricing" --width 1440 ~/Desktop/hero | `fill ` | Fill input | | `header :` | Set custom request header (colon-separated, sensitive values auto-redacted) | | `hover ` | Hover element | -| `press ` | Press key — Enter, Tab, Escape, ArrowUp/Down/Left/Right, Backspace, Delete, Home, End, PageUp, PageDown, or modifiers like Shift+Enter | -| `scroll [sel]` | Scroll element into view, or scroll to page bottom if no selector | +| `press ` | Press a Playwright keyboard key against the focused element. Names are case-sensitive: Enter, Tab, Escape, ArrowUp/Down/Left/Right, Backspace, Delete, Home, End, PageUp, PageDown. Modifiers combine with +: Shift+Enter, Control+A, Meta+K. Single printable chars (a, A, 1) work too. Full key list: https://playwright.dev/docs/api/class-keyboard#keyboard-press | +| `scroll [sel|@ref]` | With a selector, smooth-scrolls the element into view. Without a selector, jumps to page bottom. No --by/--to amount option; for pixel-precise scrolling use `js window.scrollTo(0, N)`. | | `select ` | Select dropdown option by value, label, or visible text | | `style | style --undo [N]` | Modify CSS property on element (with undo support) | | `type ` | Type into focused element | @@ -763,18 +763,18 @@ $B prettyscreenshot --cleanup --scroll-to ".pricing" --width 1440 ~/Desktop/hero | Command | Description | |---------|-------------| | `attrs ` | Element attributes as JSON | -| `cdp [json-params]` | Raw CDP method dispatch (deny-default; allowlist in cdp-allowlist.ts). Output through UNTRUSTED envelope when method is data-exfil. | +| `cdp [json-params]` | Raw Chrome DevTools Protocol method dispatch. Deny-default: only methods enumerated in `browse/src/cdp-allowlist.ts` (CDP_ALLOWLIST const) are reachable; any other method 403s. Each allowlist entry declares scope (tab vs browser) and output (trusted vs untrusted) — untrusted methods (data-exfil-shaped, e.g. Network.getResponseBody) get UNTRUSTED-envelope wrapped output. To discover allowed methods: read `browse/src/cdp-allowlist.ts`. Example: `$B cdp Page.getLayoutMetrics`. | | `console [--clear|--errors]` | Console messages (--errors filters to error/warning) | | `cookies` | All cookies as JSON | | `css ` | Computed CSS value | | `dialog [--clear]` | Dialog messages | -| `eval ` | Run JavaScript from file and return result as string (path must be under /tmp or cwd) | +| `eval ` | Run JavaScript from a file in the page context and return result as string. Path must resolve under /tmp or cwd (no traversal). Use eval for multi-line scripts; use js for one-liners. | | `inspect [selector] [--all] [--history]` | Deep CSS inspection via CDP — full rule cascade, box model, computed styles | -| `is ` | State check (visible/hidden/enabled/disabled/checked/editable/focused) | -| `js ` | Run JavaScript expression and return result as string | +| `is ` | State check on element. Valid values: visible, hidden, enabled, disabled, checked, editable, focused (case-sensitive). accepts a CSS selector OR an @ref token from a prior snapshot (e.g. @e3, @c1) — refs are interchangeable with selectors anywhere a selector is expected. | +| `js ` | Run inline JavaScript expression in the page context and return result as string. Same JS sandbox as eval; the only difference is js takes an inline expr while eval reads from a file. | | `network [--clear]` | Network requests | | `perf` | Page load timings | -| `storage [set k v]` | Read all localStorage + sessionStorage as JSON, or set to write localStorage | +| `storage | storage set ` | Read both localStorage and sessionStorage as JSON. With "set ", write to localStorage only (sessionStorage is read-only via this command — set it with `js sessionStorage.setItem(...)`). | | `ux-audit` | Extract page structure for UX behavioral analysis — site ID, nav, headings, text blocks, interactive elements. Returns JSON for agent interpretation. | ### Visual @@ -794,8 +794,8 @@ $B prettyscreenshot --cleanup --scroll-to ".pricing" --width 1440 ~/Desktop/hero ### Meta | Command | Description | |---------|-------------| -| `chain` | Run commands from JSON stdin. Format: [["cmd","arg1",...],...] | -| `domain-skill save|list|show|edit|promote-to-global|rollback|rm ` | Per-site notes (host derived from active tab). Quarantined → active after N=3 uses without classifier flag → global by explicit promote. | +| `chain (JSON via stdin)` | Run a sequence of commands from JSON on stdin. One JSON array of arrays, each inner array is [cmd, ...args]. Output is one JSON result per command. Pipe a JSON array (e.g. `[["goto","https://example.com"],["text","h1"]]`) to `$B chain` and it runs the goto then the text command in order. Stops at the first error. | +| `domain-skill save|list|show|edit|promote-to-global|rollback|rm ` | Per-site notes the agent writes for itself. Host is derived from the active tab. Lifecycle: `save` adds a quarantined note → after N=3 successful uses without the prompt-injection classifier flagging it, the note auto-promotes to "active" → `promote-to-global` lifts it to the global tier (machine-wide, all projects). The classifier flag is set automatically by the L4 prompt-injection scan; agents do not set it manually. Use `list` / `show` to inspect, `edit` to revise, `rollback` to demote, `rm` to tombstone. | | `frame ` | Switch to iframe context (or main to return) | | `inbox [--clear]` | List messages from sidebar scout inbox | | `skill list|show|run|test|rm [--arg k=v]... [--timeout=Ns]` | Run a browser-skill: deterministic Playwright script that drives the daemon over loopback HTTP. 3-tier lookup (project > global > bundled). Spawned scripts get a per-spawn scoped token (read+write only) — never the daemon root token. | diff --git a/browse/src/commands.ts b/browse/src/commands.ts index c1668025..493c19ea 100644 --- a/browse/src/commands.ts +++ b/browse/src/commands.ts @@ -104,16 +104,16 @@ export const COMMAND_DESCRIPTIONS: Record' }, - 'eval': { category: 'Inspection', description: 'Run JavaScript from file and return result as string (path must be under /tmp or cwd)', usage: 'eval ' }, + 'js': { category: 'Inspection', description: 'Run inline JavaScript expression in the page context and return result as string. Same JS sandbox as eval; the only difference is js takes an inline expr while eval reads from a file.', usage: 'js ' }, + 'eval': { category: 'Inspection', description: 'Run JavaScript from a file in the page context and return result as string. Path must resolve under /tmp or cwd (no traversal). Use eval for multi-line scripts; use js for one-liners.', usage: 'eval ' }, 'css': { category: 'Inspection', description: 'Computed CSS value', usage: 'css ' }, 'attrs': { category: 'Inspection', description: 'Element attributes as JSON', usage: 'attrs ' }, - 'is': { category: 'Inspection', description: 'State check (visible/hidden/enabled/disabled/checked/editable/focused)', usage: 'is ' }, + 'is': { category: 'Inspection', description: 'State check on element. Valid values: visible, hidden, enabled, disabled, checked, editable, focused (case-sensitive). accepts a CSS selector OR an @ref token from a prior snapshot (e.g. @e3, @c1) — refs are interchangeable with selectors anywhere a selector is expected.', usage: 'is ' }, 'console': { category: 'Inspection', description: 'Console messages (--errors filters to error/warning)', usage: 'console [--clear|--errors]' }, 'network': { category: 'Inspection', description: 'Network requests', usage: 'network [--clear]' }, 'dialog': { category: 'Inspection', description: 'Dialog messages', usage: 'dialog [--clear]' }, 'cookies': { category: 'Inspection', description: 'All cookies as JSON' }, - 'storage': { category: 'Inspection', description: 'Read all localStorage + sessionStorage as JSON, or set to write localStorage', usage: 'storage [set k v]' }, + 'storage': { category: 'Inspection', description: 'Read both localStorage and sessionStorage as JSON. With "set ", write to localStorage only (sessionStorage is read-only via this command — set it with `js sessionStorage.setItem(...)`).', usage: 'storage | storage set ' }, 'perf': { category: 'Inspection', description: 'Page load timings' }, // Interaction 'click': { category: 'Interaction', description: 'Click element', usage: 'click ' }, @@ -121,8 +121,8 @@ export const COMMAND_DESCRIPTIONS: Record ' }, 'hover': { category: 'Interaction', description: 'Hover element', usage: 'hover ' }, 'type': { category: 'Interaction', description: 'Type into focused element', usage: 'type ' }, - 'press': { category: 'Interaction', description: 'Press key — Enter, Tab, Escape, ArrowUp/Down/Left/Right, Backspace, Delete, Home, End, PageUp, PageDown, or modifiers like Shift+Enter', usage: 'press ' }, - 'scroll': { category: 'Interaction', description: 'Scroll element into view, or scroll to page bottom if no selector', usage: 'scroll [sel]' }, + 'press': { category: 'Interaction', description: 'Press a Playwright keyboard key against the focused element. Names are case-sensitive: Enter, Tab, Escape, ArrowUp/Down/Left/Right, Backspace, Delete, Home, End, PageUp, PageDown. Modifiers combine with +: Shift+Enter, Control+A, Meta+K. Single printable chars (a, A, 1) work too. Full key list: https://playwright.dev/docs/api/class-keyboard#keyboard-press', usage: 'press ' }, + 'scroll': { category: 'Interaction', description: 'With a selector, smooth-scrolls the element into view. Without a selector, jumps to page bottom. No --by/--to amount option; for pixel-precise scrolling use `js window.scrollTo(0, N)`.', usage: 'scroll [sel|@ref]' }, 'wait': { category: 'Interaction', description: 'Wait for element, network idle, or page load (timeout: 15s)', usage: 'wait ' }, 'upload': { category: 'Interaction', description: 'Upload file(s)', usage: 'upload [file2...]' }, 'viewport':{ category: 'Interaction', description: 'Set viewport size and optional deviceScaleFactor (1-3, for retina screenshots). --scale requires a context rebuild.', usage: 'viewport [] [--scale ]' }, @@ -154,7 +154,7 @@ export const COMMAND_DESCRIPTIONS: Record' }, + 'domain-skill': { category: 'Meta', description: 'Per-site notes the agent writes for itself. Host is derived from the active tab. Lifecycle: `save` adds a quarantined note → after N=3 successful uses without the prompt-injection classifier flagging it, the note auto-promotes to "active" → `promote-to-global` lifts it to the global tier (machine-wide, all projects). The classifier flag is set automatically by the L4 prompt-injection scan; agents do not set it manually. Use `list` / `show` to inspect, `edit` to revise, `rollback` to demote, `rm` to tombstone.', usage: 'domain-skill save|list|show|edit|promote-to-global|rollback|rm ' }, // Browser-skills (hand-written or generated Playwright scripts the runtime spawns) 'skill': { category: 'Meta', description: 'Run a browser-skill: deterministic Playwright script that drives the daemon over loopback HTTP. 3-tier lookup (project > global > bundled). Spawned scripts get a per-spawn scoped token (read+write only) — never the daemon root token.', usage: 'skill list|show|run|test|rm [--arg k=v]... [--timeout=Ns]' }, // CDP escape hatch (deny-default; see browse/src/cdp-allowlist.ts) - 'cdp': { category: 'Inspection', description: 'Raw CDP method dispatch (deny-default; allowlist in cdp-allowlist.ts). Output through UNTRUSTED envelope when method is data-exfil.', usage: 'cdp [json-params]' }, + 'cdp': { category: 'Inspection', description: 'Raw Chrome DevTools Protocol method dispatch. Deny-default: only methods enumerated in `browse/src/cdp-allowlist.ts` (CDP_ALLOWLIST const) are reachable; any other method 403s. Each allowlist entry declares scope (tab vs browser) and output (trusted vs untrusted) — untrusted methods (data-exfil-shaped, e.g. Network.getResponseBody) get UNTRUSTED-envelope wrapped output. To discover allowed methods: read `browse/src/cdp-allowlist.ts`. Example: `$B cdp Page.getLayoutMetrics`.', usage: 'cdp [json-params]' }, }; // Load-time validation: descriptions must cover exactly the command sets