feat: worktree isolation for E2E tests + infrastructure elegance (v0.11.12.0) (#425)

* refactor: extract gen-skill-docs into modular resolver architecture

Break the 3000-line monolith into 10 domain modules under scripts/resolvers/:
types, constants, preamble, utility, browse, design, testing, review,
codex-helpers, and index. Each module owns one domain of template generation.

The preamble module introduces a 4-tier composition system (T1-T4) so skills
only pay for the preamble sections they actually need, reducing token usage
for lightweight skills by ~40%.

Adds a token budget dashboard that prints after every generation run showing
per-skill and total token counts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: tiered preamble — skills only pay for what they use

Tag all 23 templates with preamble-tier (T1-T4). Lightweight skills
like /browse and /benchmark get a minimal preamble (~40% fewer tokens),
while review skills get the full stack. Regenerate all SKILL.md files.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: migrate eval storage to project-scoped paths

Move eval results and E2E run artifacts from ~/.gstack-dev/evals/ to
~/.gstack/projects/$SLUG/evals/ so each project's eval history lives
alongside its other gstack data. Falls back to legacy path if slug
detection fails.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: sync package.json version with VERSION after merge

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add WorktreeManager for isolated test environments

Reusable platform module (lib/worktree.ts) that creates git worktrees
for test isolation and harvests useful changes as patches. Includes
SHA-256 dedup, original SHA tracking for committed change detection,
and automatic gitignored artifact copying (.agents/, browse/dist/).

12 unit tests covering lifecycle, harvest, dedup, and error handling.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: integrate worktree isolation into E2E test infrastructure

Add createTestWorktree(), harvestAndCleanup(), and describeWithWorktree()
helpers to e2e-helpers.ts. Add harvest field to EvalTestEntry for
eval-store integration. Register lib/worktree.ts as a global touchfile.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: run Gemini and Codex E2E tests in worktrees

Switch both test suites from cwd: ROOT to worktree isolation.
Gemini (--yolo) no longer pollutes the working tree. Codex
(read-only) gets worktree for consistency. Useful changes are
harvested as patches for cherry-picking.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: skip symlinks in copyDirSync to prevent infinite recursion

Adversarial review caught that .claude/skills/gstack may be a symlink
back to the repo root, causing copyDirSync to recurse infinitely
when copying gitignored artifacts into worktrees.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump version and changelog (v0.11.12.0)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: relax session-awareness assertion to accept structured options

The LLM consistently presents well-formatted A/B choices with pros/cons
but doesn't always use the exact string "RECOMMENDATION". Accept
case-insensitive "recommend", "option a", "which do you want", or
"which approach" as equivalent signals of a structured recommendation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-03-23 23:05:22 -07:00
committed by GitHub
parent 2c5ae38542
commit dc5e0538e5
79 changed files with 4217 additions and 4681 deletions
+24 -66
View File
@@ -1,5 +1,6 @@
---
name: qa
preamble-tier: 4
version: 2.0.0
description: |
MANUAL TRIGGER ONLY: invoke only when user types /qa.
@@ -119,97 +120,54 @@ Per-skill instructions may add additional formatting rules on top of this baseli
## Completeness Principle — Boil the Lake
AI-assisted coding makes the marginal cost of completeness near-zero. When you present options:
AI makes completeness near-free. Always recommend the complete option over shortcuts — the delta is minutes with CC+gstack. A "lake" (100% coverage, all edge cases) is boilable; an "ocean" (full rewrite, multi-quarter migration) is not. Boil lakes, flag oceans.
- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more.
- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope.
- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference:
**Effort reference** — always show both scales:
| Task type | Human team | CC+gstack | Compression |
|-----------|-----------|-----------|-------------|
| Boilerplate / scaffolding | 2 days | 15 min | ~100x |
| Test writing | 1 day | 15 min | ~50x |
| Feature implementation | 1 week | 30 min | ~30x |
| Bug fix + regression test | 4 hours | 15 min | ~20x |
| Architecture / design | 2 days | 4 hours | ~5x |
| Research / exploration | 1 day | 3 hours | ~3x |
| Boilerplate | 2 days | 15 min | ~100x |
| Tests | 1 day | 15 min | ~50x |
| Feature | 1 week | 30 min | ~30x |
| Bug fix | 4 hours | 15 min | ~20x |
- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds.
Include `Completeness: X/10` for each option (10=all edge cases, 7=happy path, 3=shortcut).
**Anti-patterns — DON'T do this:**
- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.)
- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.)
- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.)
- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.")
## Repo Ownership — See Something, Say Something
## Repo Ownership Mode — See Something, Say Something
`REPO_MODE` controls how to handle issues outside your branch:
- **`solo`** — You own everything. Investigate and offer to fix proactively.
- **`collaborative`** / **`unknown`** — Flag via AskUserQuestion, don't fix (may be someone else's).
`REPO_MODE` from the preamble tells you who owns issues in this repo:
- **`solo`** — One person does 80%+ of the work. They own everything. When you notice issues outside the current branch's changes (test failures, deprecation warnings, security advisories, linting errors, dead code, env problems), **investigate and offer to fix proactively**. The solo dev is the only person who will fix it. Default to action.
- **`collaborative`** — Multiple active contributors. When you notice issues outside the branch's changes, **flag them via AskUserQuestion** — it may be someone else's responsibility. Default to asking, not fixing.
- **`unknown`** — Treat as collaborative (safer default — ask before fixing).
**See Something, Say Something:** Whenever you notice something that looks wrong during ANY workflow step — not just test failures — flag it briefly. One sentence: what you noticed and its impact. In solo mode, follow up with "Want me to fix it?" In collaborative mode, just flag it and move on.
Never let a noticed issue silently pass. The whole point is proactive communication.
Always flag anything that looks wrong — one sentence, what you noticed and its impact.
## Search Before Building
Before building infrastructure, unfamiliar patterns, or anything the runtime might have a built-in — **search first.** Read `~/.claude/skills/gstack/ETHOS.md` for the full philosophy.
Before building anything unfamiliar, **search first.** See `~/.claude/skills/gstack/ETHOS.md`.
- **Layer 1** (tried and true) — don't reinvent. **Layer 2** (new and popular) — scrutinize. **Layer 3** (first principles) — prize above all.
**Three layers of knowledge:**
- **Layer 1** (tried and true — in distribution). Don't reinvent the wheel. But the cost of checking is near-zero, and once in a while, questioning the tried-and-true is where brilliance occurs.
- **Layer 2** (new and popular — search for these). But scrutinize: humans are subject to mania. Search results are inputs to your thinking, not answers.
- **Layer 3** (first principles — prize these above all). Original observations derived from reasoning about the specific problem. The most valuable of all.
**Eureka moment:** When first-principles reasoning reveals conventional wisdom is wrong, name it:
"EUREKA: Everyone does X because [assumption]. But [evidence] shows this is wrong. Y is better because [reasoning]."
Log eureka moments:
**Eureka:** When first-principles reasoning contradicts conventional wisdom, name it and log:
```bash
jq -n --arg ts "$(date -u +%Y-%m-%dT%H:%M:%SZ)" --arg skill "SKILL_NAME" --arg branch "$(git branch --show-current 2>/dev/null)" --arg insight "ONE_LINE_SUMMARY" '{ts:$ts,skill:$skill,branch:$branch,insight:$insight}' >> ~/.gstack/analytics/eureka.jsonl 2>/dev/null || true
```
Replace SKILL_NAME and ONE_LINE_SUMMARY. Runs inline — don't stop the workflow.
**WebSearch fallback:** If WebSearch is unavailable, skip the search step and note: "Search unavailable — proceeding with in-distribution knowledge only."
## Contributor Mode
If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better.
If `_CONTRIB` is `true`: you are in **contributor mode**. At the end of each major workflow step, rate your gstack experience 0-10. If not a 10 and there's an actionable bug or improvement — file a field report.
**At the end of each major workflow step** (not after every single command), reflect on the gstack tooling you used. Rate your experience 0 to 10. If it wasn't a 10, think about why. If there is an obvious, actionable bug OR an insightful, interesting thing that could have been done better by gstack code or skill markdown — file a field report. Maybe our contributor will help make us better!
**Calibration — this is the bar:** For example, `$B js "await fetch(...)"` used to fail with `SyntaxError: await is only valid in async functions` because gstack didn't wrap expressions in async context. Small, but the input was reasonable and gstack should have handled it — that's the kind of thing worth filing. Things less consequential than this, ignore.
**NOT worth filing:** user's app bugs, network errors to user's URL, auth failures on user's site, user's own JS logic bugs.
**To file:** write `~/.gstack/contributor-logs/{slug}.md` with **all sections below** (do not truncate — include every section through the Date/Version footer):
**File only:** gstack tooling bugs where the input was reasonable but gstack failed. **Skip:** user app bugs, network errors, auth failures on user's site.
**To file:** write `~/.gstack/contributor-logs/{slug}.md`:
```
# {Title}
Hey gstack team — ran into this while using /{skill-name}:
**What I was trying to do:** {what the user/agent was attempting}
**What happened instead:** {what actually happened}
**My rating:** {0-10} — {one sentence on why it wasn't a 10}
## Steps to reproduce
**What I tried:** {action} | **What happened:** {result} | **Rating:** {0-10}
## Repro
1. {step}
## Raw output
```
{paste the actual error or unexpected output here}
```
## What would make this a 10
{one sentence: what gstack should have done differently}
**Date:** {YYYY-MM-DD} | **Version:** {gstack version} | **Skill:** /{skill}
{one sentence}
**Date:** {YYYY-MM-DD} | **Version:** {version} | **Skill:** /{skill}
```
Slug: lowercase, hyphens, max 60 chars (e.g. `browse-js-no-await`). Skip if file already exists. Max 3 reports per session. File inline and continue — don't stop the workflow. Tell user: "Filed gstack field report: {title}"
Slug: lowercase hyphens, max 60 chars. Skip if exists. Max 3/session. File inline, don't stop.
## Completion Status Protocol
+1
View File
@@ -1,5 +1,6 @@
---
name: qa
preamble-tier: 4
version: 2.0.0
description: |
Systematically QA test a web application and fix bugs found. Runs QA testing,