docs(browser-skills): TODOS Phase 2a + design doc D1-D4 decisions

TODOS.md:
- Narrows existing P1 (was "/scrape and /automate") to "/scrape and
  /skillify" — the /scrape + /skillify wedge ships in this branch.
  Codex finding #6 (synthesis) removed from Cons (resolved by D2);
  finding #7 (Bun runtime) stays as the open carry-over.
- Adds new ## P0 above PACING_UPDATES_V0 for the /automate follow-up.
  Same skillify pattern as /scrape, different trust profile (per-step
  confirmation gate when running non-codified). Reuses /skillify and
  the D3 helper as-is. Effort M.

BROWSER_SKILLS_V1.md:
- Phase table re-organized into 1, 2a, 2b, 3, 4. Phase 1 + Phase 2a
  consolidate into v1.19.0.0 ship (the v1.16.0.0 branch-internal
  bump never landed on main).
- New "Phase 2a" sub-section captures the four decisions locked
  during /plan-eng-review:
    D1 — provenance guard (≤10 turn walk-back, refuse if cold)
    D2 — synthesis input slice (final-attempt $B calls only,
         closes Codex finding #6)
    D3 — atomic write discipline (temp-dir-then-rename via new
         browse/src/browser-skill-write.ts helper)
    D4 — full test scope (5 gate E2E + 1 unit + smoke)
- New "Phase 2b" sketch for /automate: same skillify machinery,
  per-mutating-step confirmation gate, deferred to next branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-04-27 18:34:23 -07:00
parent b5904dc11f
commit 90d7f22b88
2 changed files with 62 additions and 27 deletions
+35 -18
View File
@@ -59,10 +59,11 @@ The plan as approved replaces the existing P1.
| Phase | Branch | Scope |
|-------|--------|-------|
| **1** | `garrytan/browserharness` (this) | SDK, storage, `$B skill list/run/show/test/rm` subcommands, scoped-token model, bundled `hackernews-frontpage` reference. **Shipped.** |
| **2** | new (`browser-skills-scrape-automate`) | `/scrape` and `/automate` skill templates that prototype a flow then offer the skillify approval gate. |
| **1** | `garrytan/browserharness` | SDK, storage, `$B skill list/run/show/test/rm` subcommands, scoped-token model, bundled `hackernews-frontpage` reference. **Shipped (v1.19.0.0, consolidated with Phase 2a).** |
| **2a** | `garrytan/browserharness` (continues) | `/scrape <intent>` (read-only, single entry point with match/prototype paths) + `/skillify` (codifies prototype into permanent skill). Adds `browse/src/browser-skill-write.ts` D3 atomic-write helper. **Shipping v1.19.0.0.** |
| **2b** | new (`browser-skills-automate`) | `/automate` skill template (mutating-flow sibling of `/scrape`). Reuses `/skillify` and the D3 helper. Per-mutating-step confirmation gate when running non-codified. P0 in TODOS. |
| **3** | new (`browser-skills-resolver`) | Resolver injection at session start (per-host browser-skill discovery). Mirrors domain-skill injection. `gstack-config browser_skillify_prompts` knob. |
| **4** | new | Eval test infrastructure (LLM-judge), fixture-staleness detection, periodic re-validation against live pages. |
| **4** | new | Eval test infrastructure (LLM-judge), fixture-staleness detection, periodic re-validation against live pages, OS-level FS sandbox for untrusted spawns. |
---
@@ -205,24 +206,40 @@ The /codex review flagged 8 findings. The plan addresses them as follows:
---
## Phase 2 sketch (for reference)
## Phase 2a — `/scrape` + `/skillify` (shipping v1.19.0.0)
Two skill templates: `/scrape` and `/automate`. Both run a prototype flow via
`$B` primitives, get the user's "looks right" signal, then offer the skillify
approval gate that writes a Phase-1-shaped browser-skill to disk.
Two skill templates plus one helper module. `/scrape <intent>` is the single
entry point for pulling page data; first call on a new intent prototypes via
`$B` primitives and returns JSON, subsequent calls on a matching intent route
to a codified browser-skill in ~200ms. `/skillify` codifies the most recent
successful prototype into a permanent browser-skill on disk. Mutating-flow
sibling `/automate` deferred to Phase 2b (P0 in TODOS).
Open design questions deferred to Phase 2:
### Decisions locked during the v1.19.0.0 plan review (`/plan-eng-review`)
- **Where do user-authored skills live by default — global or per-project?**
Lean global for procedures, with per-project override (mirrors domain-skill
scope). Phase 1 storage helpers already support both lookup paths.
- **How does the agent synthesize the script?** Codex finding #6: the activity
feed is lossy. Options: (a) structured recorder that captures full $B
invocations to a separate buffer; (b) re-prompt the agent to write from
scratch using its own context, with the activity feed as a reference.
- **Bun runtime distribution.** Codex finding #7. Options: (a) ship Bun
binary with each skill; (b) compile each skill to a self-contained binary;
(c) use Node + the existing `cli.ts` pattern.
| ID | Decision | Locked behavior |
|----|----------|-----------------|
| **D1** | `/skillify` provenance guard | Walk back ≤10 agent turns looking for a clearly-bounded `/scrape` invocation (the prototype's intent line + its trailing JSON output). If not found, refuse with: *"No recent /scrape result found in this conversation. Run /scrape <intent> first, then say /skillify."* No silent fallback. |
| **D2** | Synthesis input slice | Template instructs the agent to extract ONLY the final-attempt `$B` calls that produced the JSON the user accepted, plus the user's stated intent string. Drop failed selector attempts, drop unrelated chat, drop earlier-session content. Closes Codex finding #6 by picking option (b) (re-prompt from agent's own context, not a structured recorder). |
| **D3** | Atomic write discipline | `/skillify` writes to `~/.gstack/.tmp/skillify-<spawnId>/`, runs `$B skill test` against the temp dir, and only renames into the final tier path on success + user approval. On test failure or approval rejection: `rm -rf` the temp dir entirely (no tombstone for never-approved skills). New module `browse/src/browser-skill-write.ts` (`stageSkill` / `commitSkill` / `discardStaged`) with `realpath`/`lstat` discipline per Codex finding #5. |
| **D4** | Test scope | 5 gate-tier E2E (scrape match, scrape prototype, skillify happy, skillify provenance refusal, approval-gate reject) + 1 unit test (atomic-write helper failure cleanup) + 1 hand-verified smoke (mutating-intent refusal). Registered in `test/helpers/touchfiles.ts`. |
### Carry-overs
- **Default tier: global.** Lean global for procedures, with per-project
override at `/skillify` time (mirrors domain-skill scope). Phase 1 storage
helpers support both lookup paths.
- **Bun runtime distribution.** Codex finding #7 stays open. Phase 2a assumes
Bun is on PATH (gstack already requires it via `setup:6-15`). Documented
in `/skillify` SKILL.md "Limits". Real fix lands in Phase 4.
## Phase 2b — `/automate` sketch
Mutating-flow sibling of `/scrape`. Same skillify pattern (reuses `/skillify`
and the D3 helper as-is). Difference: per-mutating-step UNTRUSTED-wrapped
summary + `AskUserQuestion` confirmation gate when run non-codified. After
codification, the skill runs unattended (the codified script enumerates exactly
which `$B click`/`fill`/`type` calls run). See P0 entry in `TODOS.md`.
## Phase 3 sketch