diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index 07d03bad..2b2ef364 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -120,7 +120,7 @@ Refs (`@e1`, `@e2`, `@c1`) are how the agent addresses page elements without wri 2. Server calls Playwright's page.accessibility.snapshot() 3. Parser walks the ARIA tree, assigns sequential refs: @e1, @e2, @e3... 4. For each ref, builds a Playwright Locator: getByRole(role, { name }).nth(index) -5. Stores Map on the BrowserManager instance +5. Stores Map on the BrowserManager instance (role + name + Locator) 6. Returns the annotated tree as plain text Later: @@ -142,6 +142,19 @@ Playwright Locators are external to the DOM. They use the accessibility tree (wh Refs are cleared on navigation (the `framenavigated` event on the main frame). This is correct — after navigation, all locators are stale. The agent must run `snapshot` again to get fresh refs. This is by design: stale refs should fail loudly, not click the wrong element. +### Ref staleness detection + +SPAs can mutate the DOM without triggering `framenavigated` (e.g. React router transitions, tab switches, modal opens). This makes refs stale even though the page URL didn't change. To catch this, `resolveRef()` performs an async `count()` check before using any ref: + +``` +resolveRef(@e3) → entry = refMap.get("e3") + → count = await entry.locator.count() + → if count === 0: throw "Ref @e3 is stale — element no longer exists. Run 'snapshot' to get fresh refs." + → if count > 0: return { locator } +``` + +This fails fast (~5ms overhead) instead of letting Playwright's 30-second action timeout expire on a missing element. The `RefEntry` stores `role` and `name` metadata alongside the Locator so the error message can tell the agent what the element was. + ### Cursor-interactive refs (@c) The `-C` flag finds elements that are clickable but not in the ARIA tree — things styled with `cursor: pointer`, elements with `onclick` attributes, or custom `tabindex`. These get `@c1`, `@c2` refs in a separate namespace. This catches custom components that frameworks render as `
` but are actually buttons. diff --git a/BROWSER.md b/BROWSER.md index 8d0c5775..2d828ebe 100644 --- a/BROWSER.md +++ b/BROWSER.md @@ -87,6 +87,8 @@ The browser's key innovation is ref-based element selection, built on Playwright No DOM mutation. No injected scripts. Just Playwright's native accessibility API. +**Ref staleness detection:** SPAs can mutate the DOM without navigation (React router, tab switches, modals). When this happens, refs collected from a previous `snapshot` may point to elements that no longer exist. To handle this, `resolveRef()` runs an async `count()` check before using any ref — if the element count is 0, it throws immediately with a message telling the agent to re-run `snapshot`. This fails fast (~5ms) instead of waiting for Playwright's 30-second action timeout. + **Extended snapshot features:** - `--diff` (`-D`): Stores each snapshot as a baseline. On the next `-D` call, returns a unified diff showing what changed. Use this to verify that an action (click, fill, etc.) actually worked. - `--annotate` (`-a`): Injects temporary overlay divs at each ref's bounding box, takes a screenshot with ref labels visible, then removes the overlays. Use `-o ` to control the output path. diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 34e502ea..9b590c87 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -131,11 +131,13 @@ When E2E tests run, they produce machine-readable artifacts in `~/.gstack-dev/`: **Eval history tools:** ```bash -bun run eval:list # list all eval runs -bun run eval:compare # compare two runs (auto-picks most recent) -bun run eval:summary # aggregate stats across all runs +bun run eval:list # list all eval runs (turns, duration, cost per run) +bun run eval:compare # compare two runs — shows per-test deltas + Takeaway commentary +bun run eval:summary # aggregate stats + per-test efficiency averages across runs ``` +**Eval comparison commentary:** `eval:compare` generates natural-language Takeaway sections interpreting what changed between runs — flagging regressions, noting improvements, calling out efficiency gains (fewer turns, faster, cheaper), and producing an overall summary. This is driven by `generateCommentary()` in `eval-store.ts`. + Artifacts are never cleaned up — they accumulate in `~/.gstack-dev/` for post-mortem debugging and trend analysis. ### Tier 3: LLM-as-judge (~$0.15/run) diff --git a/README.md b/README.md index ca2ddb77..ce994a45 100644 --- a/README.md +++ b/README.md @@ -614,7 +614,7 @@ Or set `auto_upgrade: true` in `~/.gstack/config.yaml` to upgrade automatically Paste this into Claude Code: -> Uninstall gstack: remove the skill symlinks by running `for s in browse plan-ceo-review plan-eng-review review ship retro qa qa-only setup-browser-cookies; do rm -f ~/.claude/skills/$s; done` then run `rm -rf ~/.claude/skills/gstack` and remove the gstack section from CLAUDE.md. If this project also has gstack at .claude/skills/gstack, remove it by running `for s in browse plan-ceo-review plan-eng-review review ship retro qa setup-browser-cookies; do rm -f .claude/skills/$s; done && rm -rf .claude/skills/gstack` and remove the gstack section from the project CLAUDE.md too. +> Uninstall gstack: remove the skill symlinks by running `for s in browse plan-ceo-review plan-eng-review review ship retro qa qa-only setup-browser-cookies; do rm -f ~/.claude/skills/$s; done` then run `rm -rf ~/.claude/skills/gstack` and remove the gstack section from CLAUDE.md. If this project also has gstack at .claude/skills/gstack, remove it by running `for s in browse plan-ceo-review plan-eng-review review ship retro qa qa-only setup-browser-cookies; do rm -f .claude/skills/$s; done && rm -rf .claude/skills/gstack` and remove the gstack section from the project CLAUDE.md too. ## Development