--- name: scrape version: 1.0.0 description: | Pull data from a web page. First call on a new intent prototypes the flow via $B primitives and returns JSON. Subsequent calls on a matching intent route to a codified browser-skill and return in ~200ms. Read-only — for mutating flows (form fills, clicks, submissions), use /automate. Use when asked to "scrape", "get data from", "pull", "extract from", or "what's on" a page. (gstack) allowed-tools: - Bash - Read - AskUserQuestion triggers: - scrape this page - get data from - pull from - extract from - what is on --- {{PREAMBLE}} # /scrape — pull data from a page One entry point for getting data off the web. Two paths under the hood: 1. **Match path** (~200ms) — if the user's intent matches an existing browser-skill's triggers, run it via `$B skill run ` and emit the JSON. 2. **Prototype path** (~30s) — no matching skill yet, so drive the page with `$B` primitives, return the JSON, and suggest `/skillify` so the next call lands on the match path. Read-only by contract. If the intent implies writing (submitting forms, clicking buttons that mutate state), refuse and route to `/automate`. ## Step 1 — Determine intent The user's request after `/scrape` is the intent. If they did not include one, ask once: > "What do you want to scrape? Describe it in one line, e.g. 'top stories > on Hacker News' or 'product names + prices on example.com/products'." Do not ask multiple clarifying questions up front. Any further questions go in the prototype path where they're cheaper. ## Step 2 — Refuse mutating intents If the intent implies writes — verbs like *submit*, *post*, *send*, *log in*, *click X*, *fill the form*, *delete*, *create*, *order*, *book* — respond: > "/scrape is read-only. For mutating flows, use /automate (browser-skills > Phase 2 P0 in TODOS.md — not yet shipped). Until then, use $B click / > $B fill / $B type directly." Stop. Do not enter the match or prototype path. ## Step 3 — Match phase List existing browser-skills: ```bash $B skill list ``` For each skill, `$B skill show ` exposes the full SKILL.md including `triggers:`, `description:`, and `host:`. Read these and judge whether the user's intent semantically matches one of them. A confident match means **all three** are true: - The intent's domain matches the skill's `host` (or one of its hostnames) - A `triggers:` phrase or the `description:` covers the same data the intent asks for - The intent does not require args the skill does not declare in `args:` If matched, parse any `--arg key=value` from the intent (or pass none for zero-arg skills) and run: ```bash $B skill run [--arg key=value ...] ``` Emit the JSON the skill prints to stdout. Stop. If matching is ambiguous (two skills could plausibly fit), pick the narrower-tier one (project > global > bundled — `$B skill list` shows the tier). If still ambiguous, fall through to the prototype path rather than guess wrong. ## Step 4 — Prototype phase No match. Drive the page using `$B` primitives: 1. `$B goto ` — navigate to the target. The user's intent usually names a host or a URL; use it directly. 2. `$B snapshot --text` (or `$B text`) — get a clean text view of the page to find selectors. 3. `$B html` — pull the raw HTML when you need to parse structured data (lists, tables, repeated rows). 4. `$B links` — when the intent is to gather URLs. 5. Iterate: try a selector, check the output, refine. Emit the result as JSON on stdout (one document, not pretty-printed). Use a stable shape — typically `{ "items": [...], "count": N }` or similar — so downstream consumers can treat it as data. ## Step 5 — Skillify nudge After a successful prototype, append exactly one line: > "Say /skillify to make this a permanent skill (200ms on next call)." That is the entire nudge. Do not nag, do not list pros, do not push. Proactive surfacing is a Phase 3 knob (`gstack-config browser_skillify_prompts`), not this skill's job. ## When the prototype fails If the page loads but data extraction does not yield a sensible JSON shape after 3-4 selector attempts: - Report what you tried, what came back, and what's blocking (lazy-loaded, JS-rendered, paywalled, etc.). - Do NOT write a partial result and call it done. - Do NOT suggest /skillify on a broken prototype. - Ask the user whether they want to (a) try a different selector, (b) switch to a different page, or (c) stop. ## What this skill does NOT do - Mutating actions (use /automate when shipped, or $B primitives directly) - Auth flows / cookie import (use /setup-browser-cookies first) - Multi-page crawls (this is one-shot per call) - Anything that requires the daemon to not be running ## Output discipline The match path returns whatever JSON the matched skill emits. The prototype path returns whatever JSON you construct. In both cases: - One JSON document, on stdout. - Stderr (or chat) is for logs and the skillify nudge. - Do not embed prose around the JSON in the chat reply unless the user asked for an explanation — many `/scrape` callers pipe the output to `jq`. {{LEARNINGS_LOG}}