diff --git a/.agents/skills/gstack-browse/SKILL.md b/.agents/skills/gstack-browse/SKILL.md index f41a62ef..d3e17ebd 100644 --- a/.agents/skills/gstack-browse/SKILL.md +++ b/.agents/skills/gstack-browse/SKILL.md @@ -372,7 +372,7 @@ The snapshot is your primary tool for understanding and interacting with pages. -s --selector Scope to CSS selector -D --diff Unified diff against previous snapshot (first call stores baseline) -a --annotate Annotated screenshot with red overlay boxes and ref labels --o --output Output path for annotated screenshot (default: /tmp/browse-annotated.png) +-o --output Output path for annotated screenshot (default: /browse-annotated.png) -C --cursor-interactive Cursor-interactive elements (@c refs — divs with pointer, onclick) ``` diff --git a/.agents/skills/gstack-office-hours/SKILL.md b/.agents/skills/gstack-office-hours/SKILL.md index 3f9625de..fd8a8887 100644 --- a/.agents/skills/gstack-office-hours/SKILL.md +++ b/.agents/skills/gstack-office-hours/SKILL.md @@ -232,6 +232,25 @@ success/error/abort, and `USED_BROWSE` with true/false based on whether `$B` was If you cannot determine the outcome, use "unknown". This runs in the background and never blocks the user. +## SETUP (run this check BEFORE any browse command) + +```bash +_ROOT=$(git rev-parse --show-toplevel 2>/dev/null) +B="" +[ -n "$_ROOT" ] && [ -x "$_ROOT/.agents/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.agents/skills/gstack/browse/dist/browse" +[ -z "$B" ] && B=~/.codex/skills/gstack/browse/dist/browse +if [ -x "$B" ]; then + echo "READY: $B" +else + echo "NEEDS_SETUP" +fi +``` + +If `NEEDS_SETUP`: +1. Tell the user: "gstack browse needs a one-time build (~10 seconds). OK to proceed?" Then STOP and wait. +2. Run: `cd && ./setup` +3. If `bun` is not installed: `curl -fsSL https://bun.sh/install | bash` + # YC Office Hours You are a **YC office hours partner**. Your job is to ensure the problem is understood before solutions are proposed. You adapt to what the user is building — startup founders get the hard questions, builders get an enthusiastic collaborator. This skill produces design docs, not code. @@ -496,6 +515,66 @@ Present via AskUserQuestion. Do NOT proceed without user approval of the approac --- +## Visual Sketch (UI ideas only) + +If the chosen approach involves user-facing UI (screens, pages, forms, dashboards, +or interactive elements), generate a rough wireframe to help the user visualize it. +If the idea is backend-only, infrastructure, or has no UI component — skip this +section silently. + +**Step 1: Gather design context** + +1. Check if `DESIGN.md` exists in the repo root. If it does, read it for design + system constraints (colors, typography, spacing, component patterns). Use these + constraints in the wireframe. +2. Apply core design principles: + - **Information hierarchy** — what does the user see first, second, third? + - **Interaction states** — loading, empty, error, success, partial + - **Edge case paranoia** — what if the name is 47 chars? Zero results? Network fails? + - **Subtraction default** — "as little design as possible" (Rams). Every element earns its pixels. + - **Design for trust** — every interface element builds or erodes user trust. + +**Step 2: Generate wireframe HTML** + +Generate a single-page HTML file with these constraints: +- **Intentionally rough aesthetic** — use system fonts, thin gray borders, no color, + hand-drawn-style elements. This is a sketch, not a polished mockup. +- Self-contained — no external dependencies, no CDN links, inline CSS only +- Show the core interaction flow (1-3 screens/states max) +- Include realistic placeholder content (not "Lorem ipsum" — use content that + matches the actual use case) +- Add HTML comments explaining design decisions + +Write to a temp file: +```bash +SKETCH_FILE="/tmp/gstack-sketch-$(date +%s).html" +``` + +**Step 3: Render and capture** + +```bash +$B goto "file://$SKETCH_FILE" +$B screenshot /tmp/gstack-sketch.png +``` + +If `$B` is not available (browse binary not set up), skip the render step. Tell the +user: "Visual sketch requires the browse binary. Run the setup script to enable it." + +**Step 4: Present and iterate** + +Show the screenshot to the user. Ask: "Does this feel right? Want to iterate on the layout?" + +If they want changes, regenerate the HTML with their feedback and re-render. +If they approve or say "good enough," proceed. + +**Step 5: Include in design doc** + +Reference the wireframe screenshot in the design doc's "Recommended Approach" section. +The screenshot file at `/tmp/gstack-sketch.png` can be referenced by downstream skills +(`/plan-design-review`, `/design-review`) to see what was originally envisioned. + +--- + ## Phase 4.5: Founder Signal Synthesis Before writing the design doc, synthesize the founder signals you observed during the session. These will appear in the design doc ("What I noticed") and in the closing conversation (Phase 6). @@ -632,7 +711,73 @@ Supersedes: {prior filename — omit this line if first design on this branch} {observational, mentor-like reflections referencing specific things the user said during the session. Quote their words back to them — don't characterize their behavior. 2-4 bullets.} ``` -Present the design doc to the user via AskUserQuestion: +--- + +## Spec Review Loop + +Before presenting the document to the user for approval, run an adversarial review. + +**Step 1: Dispatch reviewer subagent** + +Use the Agent tool to dispatch an independent reviewer. The reviewer has fresh context +and cannot see the brainstorming conversation — only the document. This ensures genuine +adversarial independence. + +Prompt the subagent with: +- The file path of the document just written +- "Read this document and review it on 5 dimensions. For each dimension, note PASS or + list specific issues with suggested fixes. At the end, output a quality score (1-10) + across all dimensions." + +**Dimensions:** +1. **Completeness** — Are all requirements addressed? Missing edge cases? +2. **Consistency** — Do parts of the document agree with each other? Contradictions? +3. **Clarity** — Could an engineer implement this without asking questions? Ambiguous language? +4. **Scope** — Does the document creep beyond the original problem? YAGNI violations? +5. **Feasibility** — Can this actually be built with the stated approach? Hidden complexity? + +The subagent should return: +- A quality score (1-10) +- PASS if no issues, or a numbered list of issues with dimension, description, and fix + +**Step 2: Fix and re-dispatch** + +If the reviewer returns issues: +1. Fix each issue in the document on disk (use Edit tool) +2. Re-dispatch the reviewer subagent with the updated document +3. Maximum 3 iterations total + +**Convergence guard:** If the reviewer returns the same issues on consecutive iterations +(the fix didn't resolve them or the reviewer disagrees with the fix), stop the loop +and persist those issues as "Reviewer Concerns" in the document rather than looping +further. + +If the subagent fails, times out, or is unavailable — skip the review loop entirely. +Tell the user: "Spec review unavailable — presenting unreviewed doc." The document is +already written to disk; the review is a quality bonus, not a gate. + +**Step 3: Report and persist metrics** + +After the loop completes (PASS, max iterations, or convergence guard): + +1. Tell the user the result — summary by default: + "Your doc survived N rounds of adversarial review. M issues caught and fixed. + Quality score: X/10." + If they ask "what did the reviewer find?", show the full reviewer output. + +2. If issues remain after max iterations or convergence, add a "## Reviewer Concerns" + section to the document listing each unresolved issue. Downstream skills will see this. + +3. Append metrics: +```bash +mkdir -p ~/.gstack/analytics +echo '{"skill":"office-hours","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","iterations":ITERATIONS,"issues_found":FOUND,"issues_fixed":FIXED,"remaining":REMAINING,"quality_score":SCORE}' >> ~/.gstack/analytics/spec-review.jsonl 2>/dev/null || true +``` +Replace ITERATIONS, FOUND, FIXED, REMAINING, SCORE with actual values from the review. + +--- + +Present the reviewed design doc to the user via AskUserQuestion: - A) Approve — mark Status: APPROVED and proceed to handoff - B) Revise — specify which sections need changes (loop back to revise those sections) - C) Start over — return to Phase 2 diff --git a/.agents/skills/gstack-plan-ceo-review/SKILL.md b/.agents/skills/gstack-plan-ceo-review/SKILL.md index 1bebfd5a..159a253e 100644 --- a/.agents/skills/gstack-plan-ceo-review/SKILL.md +++ b/.agents/skills/gstack-plan-ceo-review/SKILL.md @@ -338,6 +338,37 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head ``` If a design doc exists (from `/office-hours`), read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design. +## Prerequisite Skill Offer + +When the design doc check above prints "No design doc found," offer the prerequisite +skill before proceeding. + +Say to the user via AskUserQuestion: + +> "No design doc found for this branch. `/office-hours` produces a structured problem +> statement, premise challenge, and explored alternatives — it gives this review much +> sharper input to work with. Takes about 10 minutes. The design doc is per-feature, +> not per-product — it captures the thinking behind this specific change." + +Options: +- A) Run /office-hours first (in another window, then come back) +- B) Skip — proceed with standard review + +If they skip: "No worries — standard review. If you ever want sharper input, try +/office-hours first next time." Then proceed normally. Do not re-offer later in the session. + +**Mid-session detection:** During Step 0A (Premise Challenge), if the user can't +articulate the problem, keeps changing the problem statement, answers with "I'm not +sure," or is clearly exploring rather than reviewing — offer `/office-hours`: + +> "It sounds like you're still figuring out what to build — that's totally fine, but +> that's what /office-hours is designed for. Want to pause this review and run +> /office-hours first? It'll help you nail down the problem and approach, then come +> back here for the strategic review." + +Options: A) Yes, run /office-hours first. B) No, keep going. +If they keep going, proceed normally — no guilt, no re-asking. + When reading TODOS.md, specifically: * Note any TODOs this plan touches, blocks, or unlocks * Check if deferred work from prior reviews relates to this plan @@ -481,6 +512,70 @@ Repo: {owner/repo} Derive the feature slug from the plan being reviewed (e.g., "user-dashboard", "auth-refactor"). Use the date in YYYY-MM-DD format. +After writing the CEO plan, run the spec review loop on it: + +## Spec Review Loop + +Before presenting the document to the user for approval, run an adversarial review. + +**Step 1: Dispatch reviewer subagent** + +Use the Agent tool to dispatch an independent reviewer. The reviewer has fresh context +and cannot see the brainstorming conversation — only the document. This ensures genuine +adversarial independence. + +Prompt the subagent with: +- The file path of the document just written +- "Read this document and review it on 5 dimensions. For each dimension, note PASS or + list specific issues with suggested fixes. At the end, output a quality score (1-10) + across all dimensions." + +**Dimensions:** +1. **Completeness** — Are all requirements addressed? Missing edge cases? +2. **Consistency** — Do parts of the document agree with each other? Contradictions? +3. **Clarity** — Could an engineer implement this without asking questions? Ambiguous language? +4. **Scope** — Does the document creep beyond the original problem? YAGNI violations? +5. **Feasibility** — Can this actually be built with the stated approach? Hidden complexity? + +The subagent should return: +- A quality score (1-10) +- PASS if no issues, or a numbered list of issues with dimension, description, and fix + +**Step 2: Fix and re-dispatch** + +If the reviewer returns issues: +1. Fix each issue in the document on disk (use Edit tool) +2. Re-dispatch the reviewer subagent with the updated document +3. Maximum 3 iterations total + +**Convergence guard:** If the reviewer returns the same issues on consecutive iterations +(the fix didn't resolve them or the reviewer disagrees with the fix), stop the loop +and persist those issues as "Reviewer Concerns" in the document rather than looping +further. + +If the subagent fails, times out, or is unavailable — skip the review loop entirely. +Tell the user: "Spec review unavailable — presenting unreviewed doc." The document is +already written to disk; the review is a quality bonus, not a gate. + +**Step 3: Report and persist metrics** + +After the loop completes (PASS, max iterations, or convergence guard): + +1. Tell the user the result — summary by default: + "Your doc survived N rounds of adversarial review. M issues caught and fixed. + Quality score: X/10." + If they ask "what did the reviewer find?", show the full reviewer output. + +2. If issues remain after max iterations or convergence, add a "## Reviewer Concerns" + section to the document listing each unresolved issue. Downstream skills will see this. + +3. Append metrics: +```bash +mkdir -p ~/.gstack/analytics +echo '{"skill":"plan-ceo-review","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","iterations":ITERATIONS,"issues_found":FOUND,"issues_fixed":FIXED,"remaining":REMAINING,"quality_score":SCORE}' >> ~/.gstack/analytics/spec-review.jsonl 2>/dev/null || true +``` +Replace ITERATIONS, FOUND, FIXED, REMAINING, SCORE with actual values from the review. + ### 0E. Temporal Interrogation (EXPANSION, SELECTIVE EXPANSION, and HOLD modes) Think ahead to implementation: What decisions will need to be made during implementation that should be resolved NOW in the plan? ``` @@ -912,7 +1007,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl - **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting). - **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup. - **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes. -- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed. +- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`. **Verdict logic:** - **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`) diff --git a/.agents/skills/gstack-plan-design-review/SKILL.md b/.agents/skills/gstack-plan-design-review/SKILL.md index db5deffa..618f559c 100644 --- a/.agents/skills/gstack-plan-design-review/SKILL.md +++ b/.agents/skills/gstack-plan-design-review/SKILL.md @@ -542,7 +542,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl - **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting). - **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup. - **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes. -- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed. +- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`. **Verdict logic:** - **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`) diff --git a/.agents/skills/gstack-plan-eng-review/SKILL.md b/.agents/skills/gstack-plan-eng-review/SKILL.md index 642c0cf0..f350b149 100644 --- a/.agents/skills/gstack-plan-eng-review/SKILL.md +++ b/.agents/skills/gstack-plan-eng-review/SKILL.md @@ -283,6 +283,25 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head ``` If a design doc exists, read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design — check the prior version for context on what changed and why. +## Prerequisite Skill Offer + +When the design doc check above prints "No design doc found," offer the prerequisite +skill before proceeding. + +Say to the user via AskUserQuestion: + +> "No design doc found for this branch. `/office-hours` produces a structured problem +> statement, premise challenge, and explored alternatives — it gives this review much +> sharper input to work with. Takes about 10 minutes. The design doc is per-feature, +> not per-product — it captures the thinking behind this specific change." + +Options: +- A) Run /office-hours first (in another window, then come back) +- B) Skip — proceed with standard review + +If they skip: "No worries — standard review. If you ever want sharper input, try +/office-hours first next time." Then proceed normally. Do not re-offer later in the session. + ### Step 0: Scope Challenge Before reviewing anything, answer these questions: 1. **What existing code already partially or fully solves each sub-problem?** Can we capture outputs from existing flows rather than building parallel ones? @@ -512,7 +531,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl - **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting). - **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup. - **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes. -- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed. +- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`. **Verdict logic:** - **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`) diff --git a/.agents/skills/gstack-review/SKILL.md b/.agents/skills/gstack-review/SKILL.md index 8a1e6fbc..3e26d27f 100644 --- a/.agents/skills/gstack-review/SKILL.md +++ b/.agents/skills/gstack-review/SKILL.md @@ -488,54 +488,7 @@ If no documentation files exist, skip this step silently. --- -## Step 5.7: Codex second opinion (optional) -After completing the review, check if the Codex CLI is available: - -```bash -which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE" -``` - -If Codex is available, use AskUserQuestion: - -``` -Review complete. Want an independent second opinion from Codex (OpenAI)? - -A) Run Codex code review — independent diff review with pass/fail gate -B) Run Codex adversarial challenge — try to find ways this code will fail in production -C) Both — review first, then adversarial challenge -D) Skip — no Codex review needed -``` - -If the user chooses A, B, or C: - -**For code review (A or C):** Run `codex review --base ` with a 5-minute timeout. -Present the full output verbatim under a `CODEX SAYS (code review):` header. -Check the output for `[P1]` markers — if found, note `GATE: FAIL`, otherwise `GATE: PASS`. -After presenting, compare Codex's findings with your own review findings from Steps 4-5 -and output a CROSS-MODEL ANALYSIS showing what both found, what only Codex found, -and what only Claude found. - -**For adversarial challenge (B or C):** Run: -```bash -codex exec "Review the changes on this branch against the base branch. Run git diff origin/ to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, failure modes. Be adversarial." -s read-only -``` -Present the full output verbatim under a `CODEX SAYS (adversarial challenge):` header. - -**Only if a code review ran (user chose A or C):** Persist the Codex review result to the review log: -```bash -~/.codex/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE"}' -``` - -Substitute: STATUS ("clean" if PASS, "issues_found" if FAIL), GATE ("pass" or "fail"). - -**Do NOT persist a codex-review entry when only the adversarial challenge (B) ran** — -there is no gate verdict to record, and a false entry would make the Review Readiness -Dashboard believe a code review happened when it didn't. - -If Codex is not available, skip this step silently. - ---- ## Important Rules diff --git a/.agents/skills/gstack-ship/SKILL.md b/.agents/skills/gstack-ship/SKILL.md index 08e7f063..2b6d9d83 100644 --- a/.agents/skills/gstack-ship/SKILL.md +++ b/.agents/skills/gstack-ship/SKILL.md @@ -309,7 +309,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl - **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting). - **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup. - **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes. -- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed. +- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`. **Verdict logic:** - **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`) @@ -949,43 +949,7 @@ For each classified comment: --- -## Step 3.8: Codex second opinion (optional) -Check if the Codex CLI is available: - -```bash -which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE" -``` - -If Codex is available, use AskUserQuestion: - -``` -Pre-landing review complete. Want an independent Codex (OpenAI) review before shipping? - -A) Run Codex code review — independent diff review with pass/fail gate -B) Run Codex adversarial challenge — try to break this code -C) Skip — ship without Codex review -``` - -If the user chooses A or B: - -**For code review (A):** Run `codex review --base ` with a 5-minute timeout. -Present the full output verbatim under a `CODEX SAYS:` header. Check for `[P1]` markers -to determine pass/fail gate. Persist the result: - -```bash -~/.codex/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"TIMESTAMP","status":"STATUS","gate":"GATE"}' -``` - -If GATE is FAIL, use AskUserQuestion: "Codex found critical issues. Ship anyway?" -If the user says no, stop. If yes, continue to Step 4. - -**For adversarial (B):** Run codex exec with the adversarial prompt (see /codex skill). -Present findings. This is informational — does not block shipping. - -If Codex is not available, skip silently. Continue to Step 4. - ---- ## Step 4: Version bump (auto-decide) @@ -1226,7 +1190,7 @@ doc updates — the user runs `/ship` and documentation stays current without a - **Never skip tests.** If tests fail, stop. - **Never skip the pre-landing review.** If checklist.md is unreadable, stop. - **Never force push.** Use regular `git push` only. -- **Never ask for confirmation** except for MINOR/MAJOR version bumps and pre-landing review ASK items (batched into at most one AskUserQuestion). +- **Never ask for trivial confirmations** (e.g., "ready to push?", "create PR?"). DO stop for: version bumps (MINOR/MAJOR), pre-landing review findings (ASK items), Codex critical findings ([P1]), and the one-time Codex adoption prompt. - **Always use the 4-digit version format** from the VERSION file. - **Date format in CHANGELOG:** `YYYY-MM-DD` - **Split commits for bisectability** — each commit = one logical change. diff --git a/.agents/skills/gstack/SKILL.md b/.agents/skills/gstack/SKILL.md index 9e13bfd2..b24d38e2 100644 --- a/.agents/skills/gstack/SKILL.md +++ b/.agents/skills/gstack/SKILL.md @@ -500,7 +500,7 @@ The snapshot is your primary tool for understanding and interacting with pages. -s --selector Scope to CSS selector -D --diff Unified diff against previous snapshot (first call stores baseline) -a --annotate Annotated screenshot with red overlay boxes and ref labels --o --output Output path for annotated screenshot (default: /tmp/browse-annotated.png) +-o --output Output path for annotated screenshot (default: /browse-annotated.png) -C --cursor-interactive Cursor-interactive elements (@c refs — divs with pointer, onclick) ``` diff --git a/CHANGELOG.md b/CHANGELOG.md index 74572e22..2fa0a844 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,47 @@ # Changelog +## [0.9.4.0] - 2026-03-20 — Codex Reviews On By Default + +### Changed + +- **Codex code reviews now run automatically in `/ship` and `/review`.** No more "want a second opinion?" prompt every time — Codex reviews both your code (with a pass/fail gate) and runs an adversarial challenge by default. First-time users get a one-time opt-in prompt; after that, it's hands-free. Configure with `gstack-config set codex_reviews enabled|disabled`. +- **All Codex operations use maximum reasoning power.** Review, adversarial, and consult modes all use `xhigh` reasoning effort — when an AI is reviewing your code, you want it thinking as hard as possible. +- **Codex review errors can't corrupt the dashboard.** Auth failures, timeouts, and empty responses are now detected before logging results, so the Review Readiness Dashboard never shows a false "passed" entry. Adversarial stderr is captured separately. +- **Codex review log includes commit hash.** Staleness detection now works correctly for Codex reviews, matching the same commit-tracking behavior as eng/CEO/design reviews. + +### Fixed + +- **Codex-for-Codex recursion prevented.** When gstack runs inside Codex CLI (`.agents/skills/`), the Codex review step is completely stripped — no accidental infinite loops. + +## [0.9.3.0] - 2026-03-20 — Windows Support + +### Fixed + +- **gstack now works on Windows 11.** Setup no longer hangs when verifying Playwright, and the browse server automatically falls back to Node.js to work around a Bun pipe-handling bug on Windows ([bun#4253](https://github.com/oven-sh/bun/issues/4253)). Just make sure Node.js is installed alongside Bun. macOS and Linux are completely unaffected. +- **Path handling works on Windows.** All hardcoded `/tmp` paths and Unix-style path separators now use platform-aware equivalents via a new `platform.ts` module. Path traversal protection works correctly with Windows backslash separators. + +### Added + +- **Bun API polyfill for Node.js.** When the browse server runs under Node.js on Windows, a compatibility layer provides `Bun.serve()`, `Bun.spawn()`, `Bun.spawnSync()`, and `Bun.sleep()` equivalents. Fully tested. +- **Node server build script.** `browse/scripts/build-node-server.sh` transpiles the server for Node.js, stubs `bun:sqlite`, and injects the polyfill — all automated during `bun run build`. + +## [0.9.2.0] - 2026-03-20 — Gemini CLI E2E Tests + +### Added + +- **Gemini CLI is now tested end-to-end.** Two E2E tests verify that gstack skills work when invoked by Google's Gemini CLI (`gemini -p`). The `gemini-discover-skill` test confirms skill discovery from `.agents/skills/`, and `gemini-review-findings` runs a full code review via gstack-review. Both parse Gemini's stream-json NDJSON output and track token usage. +- **Gemini JSONL parser with 10 unit tests.** `parseGeminiJSONL` handles all Gemini event types (init, message, tool_use, tool_result, result) with defensive parsing for malformed input. The parser is a pure function, independently testable without spawning the CLI. +- **`bun run test:gemini`** and **`bun run test:gemini:all`** scripts for running Gemini E2E tests independently. Gemini tests are also included in `test:evals` and `test:e2e` aggregate scripts. + +## [0.9.1.0] - 2026-03-20 — Adversarial Spec Review + Skill Chaining + +### Added + +- **Your design docs now get stress-tested before you see them.** When you run `/office-hours`, an independent AI reviewer checks your design doc for completeness, consistency, clarity, scope creep, and feasibility — up to 3 rounds. You get a quality score (1-10) and a summary of what was caught and fixed. The doc you approve has already survived adversarial review. +- **Visual wireframes during brainstorming.** For UI ideas, `/office-hours` now generates a rough HTML wireframe using your project's design system (from DESIGN.md) and screenshots it. You see what you're designing while you're still thinking, not after you've coded it. +- **Skills help each other now.** `/plan-ceo-review` and `/plan-eng-review` detect when you'd benefit from running `/office-hours` first and offer it — one-tap to switch, one-tap to decline. If you seem lost during a CEO review, it'll gently suggest brainstorming first. +- **Spec review metrics.** Every adversarial review logs iterations, issues found/fixed, and quality score to `~/.gstack/analytics/spec-review.jsonl`. Over time, you can see if your design docs are getting better. + ## [0.9.0.1] - 2026-03-19 ### Changed diff --git a/README.md b/README.md index b7ddb7d1..07047797 100644 --- a/README.md +++ b/README.md @@ -42,7 +42,7 @@ Expect first useful run in under 5 minutes on any repo with tests already set up ## Install — takes 30 seconds -**Requirements:** [Claude Code](https://docs.anthropic.com/en/docs/claude-code), [Git](https://git-scm.com/), [Bun](https://bun.sh/) v1.0+ +**Requirements:** [Claude Code](https://docs.anthropic.com/en/docs/claude-code), [Git](https://git-scm.com/), [Bun](https://bun.sh/) v1.0+, [Node.js](https://nodejs.org/) (Windows only) ### Step 1: Install on your machine @@ -238,6 +238,8 @@ Data is stored in [Supabase](https://supabase.com) (open source Firebase alterna **Stale install?** Run `/gstack-upgrade` — or set `auto_upgrade: true` in `~/.gstack/config.yaml` +**Windows users:** gstack works on Windows 11 via Git Bash or WSL. Node.js is required in addition to Bun — Bun has a known bug with Playwright's pipe transport on Windows ([bun#4253](https://github.com/oven-sh/bun/issues/4253)). The browse server automatically falls back to Node.js. Make sure both `bun` and `node` are on your PATH. + **Claude says it can't see the skills?** Make sure your project's `CLAUDE.md` has a gstack section. Add this: ``` diff --git a/SKILL.md b/SKILL.md index 89316789..4baa33f3 100644 --- a/SKILL.md +++ b/SKILL.md @@ -506,7 +506,7 @@ The snapshot is your primary tool for understanding and interacting with pages. -s --selector Scope to CSS selector -D --diff Unified diff against previous snapshot (first call stores baseline) -a --annotate Annotated screenshot with red overlay boxes and ref labels --o --output Output path for annotated screenshot (default: /tmp/browse-annotated.png) +-o --output Output path for annotated screenshot (default: /browse-annotated.png) -C --cursor-interactive Cursor-interactive elements (@c refs — divs with pointer, onclick) ``` diff --git a/VERSION b/VERSION index 15e36e66..3544d2f0 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -0.9.0.1 +0.9.4.0 diff --git a/browse/SKILL.md b/browse/SKILL.md index db7edbd6..6b370253 100644 --- a/browse/SKILL.md +++ b/browse/SKILL.md @@ -378,7 +378,7 @@ The snapshot is your primary tool for understanding and interacting with pages. -s --selector Scope to CSS selector -D --diff Unified diff against previous snapshot (first call stores baseline) -a --annotate Annotated screenshot with red overlay boxes and ref labels --o --output Output path for annotated screenshot (default: /tmp/browse-annotated.png) +-o --output Output path for annotated screenshot (default: /browse-annotated.png) -C --cursor-interactive Cursor-interactive elements (@c refs — divs with pointer, onclick) ``` diff --git a/browse/scripts/build-node-server.sh b/browse/scripts/build-node-server.sh new file mode 100755 index 00000000..539e391c --- /dev/null +++ b/browse/scripts/build-node-server.sh @@ -0,0 +1,48 @@ +#!/usr/bin/env bash +# Build a Node.js-compatible server bundle for Windows. +# +# On Windows, Bun can't launch or connect to Playwright's Chromium +# (oven-sh/bun#4253, #9911). This script produces a server bundle +# that runs under Node.js with Bun API polyfills. + +set -e + +GSTACK_DIR="$(cd "$(dirname "$0")/../.." && pwd)" +SRC_DIR="$GSTACK_DIR/browse/src" +DIST_DIR="$GSTACK_DIR/browse/dist" + +echo "Building Node-compatible server bundle..." + +# Step 1: Transpile server.ts to a single .mjs bundle (externalize runtime deps) +bun build "$SRC_DIR/server.ts" \ + --target=node \ + --outfile "$DIST_DIR/server-node.mjs" \ + --external playwright \ + --external playwright-core \ + --external diff \ + --external "bun:sqlite" + +# Step 2: Post-process +# Replace import.meta.dir with a resolvable reference +perl -pi -e 's/import\.meta\.dir/__browseNodeSrcDir/g' "$DIST_DIR/server-node.mjs" +# Stub out bun:sqlite (macOS-only cookie import, not needed on Windows) +perl -pi -e 's|import { Database } from "bun:sqlite";|const Database = null; // bun:sqlite stubbed on Node|g' "$DIST_DIR/server-node.mjs" + +# Step 3: Create the final file with polyfill header injected after the first line +{ + head -1 "$DIST_DIR/server-node.mjs" + echo '// ── Windows Node.js compatibility (auto-generated) ──' + echo 'import { fileURLToPath as _ftp } from "node:url";' + echo 'import { dirname as _dn } from "node:path";' + echo 'const __browseNodeSrcDir = _dn(_dn(_ftp(import.meta.url))) + "/src";' + echo '{ const _r = createRequire(import.meta.url); _r("./bun-polyfill.cjs"); }' + echo '// ── end compatibility ──' + tail -n +2 "$DIST_DIR/server-node.mjs" +} > "$DIST_DIR/server-node.tmp.mjs" + +mv "$DIST_DIR/server-node.tmp.mjs" "$DIST_DIR/server-node.mjs" + +# Step 4: Copy polyfill to dist/ +cp "$SRC_DIR/bun-polyfill.cjs" "$DIST_DIR/bun-polyfill.cjs" + +echo "Node server bundle ready: $DIST_DIR/server-node.mjs" diff --git a/browse/src/bun-polyfill.cjs b/browse/src/bun-polyfill.cjs new file mode 100644 index 00000000..e0ada11b --- /dev/null +++ b/browse/src/bun-polyfill.cjs @@ -0,0 +1,109 @@ +/** + * Bun API polyfill for Node.js — Windows compatibility layer. + * + * On Windows, Bun can't launch or connect to Playwright's Chromium + * (oven-sh/bun#4253, #9911). The browse server falls back to running + * under Node.js with this polyfill providing Bun API equivalents. + * + * Loaded via --require before the transpiled server bundle. + */ + +'use strict'; + +const http = require('http'); +const { spawnSync, spawn } = require('child_process'); + +globalThis.Bun = { + serve(options) { + const { port, hostname = '127.0.0.1', fetch } = options; + + const server = http.createServer(async (nodeReq, nodeRes) => { + try { + const url = `http://${hostname}:${port}${nodeReq.url}`; + const headers = new Headers(); + for (const [key, val] of Object.entries(nodeReq.headers)) { + if (val) headers.set(key, Array.isArray(val) ? val[0] : val); + } + + let body = null; + if (nodeReq.method !== 'GET' && nodeReq.method !== 'HEAD') { + body = await new Promise((resolve) => { + const chunks = []; + nodeReq.on('data', (chunk) => chunks.push(chunk)); + nodeReq.on('end', () => resolve(Buffer.concat(chunks))); + }); + } + + const webReq = new Request(url, { + method: nodeReq.method, + headers, + body, + }); + + const webRes = await fetch(webReq); + + nodeRes.statusCode = webRes.status; + webRes.headers.forEach((val, key) => { + nodeRes.setHeader(key, val); + }); + + const resBody = await webRes.arrayBuffer(); + nodeRes.end(Buffer.from(resBody)); + } catch (err) { + nodeRes.statusCode = 500; + nodeRes.end(JSON.stringify({ error: err.message })); + } + }); + + server.listen(port, hostname); + + return { + stop() { server.close(); }, + port, + hostname, + }; + }, + + spawnSync(cmd, options = {}) { + const [command, ...args] = cmd; + const result = spawnSync(command, args, { + stdio: [ + options.stdin || 'pipe', + options.stdout === 'pipe' ? 'pipe' : 'ignore', + options.stderr === 'pipe' ? 'pipe' : 'ignore', + ], + timeout: options.timeout, + env: options.env, + cwd: options.cwd, + }); + + return { + exitCode: result.status, + stdout: result.stdout || Buffer.from(''), + stderr: result.stderr || Buffer.from(''), + }; + }, + + spawn(cmd, options = {}) { + const [command, ...args] = cmd; + const stdio = options.stdio || ['pipe', 'pipe', 'pipe']; + const proc = spawn(command, args, { + stdio, + env: options.env, + cwd: options.cwd, + }); + + return { + pid: proc.pid, + stdout: proc.stdout, + stderr: proc.stderr, + stdin: proc.stdin, + unref() { proc.unref(); }, + kill(signal) { proc.kill(signal); }, + }; + }, + + sleep(ms) { + return new Promise((resolve) => setTimeout(resolve, ms)); + }, +}; diff --git a/browse/src/cli.ts b/browse/src/cli.ts index 7d6eacdf..830b2e7c 100644 --- a/browse/src/cli.ts +++ b/browse/src/cli.ts @@ -14,7 +14,8 @@ import * as path from 'path'; import { resolveConfig, ensureStateDir, readVersionHash } from './config'; const config = resolveConfig(); -const MAX_START_WAIT = 8000; // 8 seconds to start +const IS_WINDOWS = process.platform === 'win32'; +const MAX_START_WAIT = IS_WINDOWS ? 15000 : 8000; // Node+Chromium takes longer on Windows export function resolveServerScript( env: Record = process.env, @@ -26,7 +27,9 @@ export function resolveServerScript( } // Dev mode: cli.ts runs directly from browse/src - if (metaDir.startsWith('/') && !metaDir.includes('$bunfs')) { + // On macOS/Linux, import.meta.dir starts with / + // On Windows, it starts with a drive letter (e.g., C:\...) + if (!metaDir.includes('$bunfs')) { const direct = path.resolve(metaDir, 'server.ts'); if (fs.existsSync(direct)) { return direct; @@ -48,6 +51,31 @@ export function resolveServerScript( const SERVER_SCRIPT = resolveServerScript(); +/** + * On Windows, resolve the Node.js-compatible server bundle. + * Falls back to null if not found (server will use Bun instead). + */ +export function resolveNodeServerScript( + metaDir: string = import.meta.dir, + execPath: string = process.execPath +): string | null { + // Dev mode + if (!metaDir.includes('$bunfs')) { + const distScript = path.resolve(metaDir, '..', 'dist', 'server-node.mjs'); + if (fs.existsSync(distScript)) return distScript; + } + + // Compiled binary: browse/dist/browse → browse/dist/server-node.mjs + if (execPath) { + const adjacent = path.resolve(path.dirname(execPath), 'server-node.mjs'); + if (fs.existsSync(adjacent)) return adjacent; + } + + return null; +} + +const NODE_SERVER_SCRIPT = IS_WINDOWS ? resolveNodeServerScript() : null; + interface ServerState { pid: number; port: number; @@ -139,8 +167,14 @@ async function startServer(): Promise { // Clean up stale state file try { fs.unlinkSync(config.stateFile); } catch {} - // Start server as detached background process - const proc = Bun.spawn(['bun', 'run', SERVER_SCRIPT], { + // Start server as detached background process. + // On Windows, Bun can't launch/connect to Playwright's Chromium (oven-sh/bun#4253, #9911). + // Fall back to running the server under Node.js with Bun API polyfills. + const useNode = IS_WINDOWS && NODE_SERVER_SCRIPT; + const serverCmd = useNode + ? ['node', NODE_SERVER_SCRIPT] + : ['bun', 'run', SERVER_SCRIPT]; + const proc = Bun.spawn(serverCmd, { stdio: ['ignore', 'pipe', 'pipe'], env: { ...process.env, BROWSE_STATE_FILE: config.stateFile }, }); diff --git a/browse/src/meta-commands.ts b/browse/src/meta-commands.ts index 049ed69a..f1ebdea8 100644 --- a/browse/src/meta-commands.ts +++ b/browse/src/meta-commands.ts @@ -10,13 +10,14 @@ import { validateNavigationUrl } from './url-validation'; import * as Diff from 'diff'; import * as fs from 'fs'; import * as path from 'path'; +import { TEMP_DIR, isPathWithin } from './platform'; // Security: Path validation to prevent path traversal attacks -const SAFE_DIRECTORIES = ['/tmp', process.cwd()]; +const SAFE_DIRECTORIES = [TEMP_DIR, process.cwd()]; export function validateOutputPath(filePath: string): void { const resolved = path.resolve(filePath); - const isSafe = SAFE_DIRECTORIES.some(dir => resolved === dir || resolved.startsWith(dir + '/')); + const isSafe = SAFE_DIRECTORIES.some(dir => isPathWithin(resolved, dir)); if (!isSafe) { throw new Error(`Path must be within: ${SAFE_DIRECTORIES.join(', ')}`); } @@ -88,7 +89,7 @@ export async function handleMetaCommand( case 'screenshot': { // Parse priority: flags (--viewport, --clip) → selector (@ref, CSS) → output path const page = bm.getPage(); - let outputPath = '/tmp/browse-screenshot.png'; + let outputPath = `${TEMP_DIR}/browse-screenshot.png`; let clipRect: { x: number; y: number; width: number; height: number } | undefined; let targetSelector: string | undefined; let viewportOnly = false; @@ -147,7 +148,7 @@ export async function handleMetaCommand( case 'pdf': { const page = bm.getPage(); - const pdfPath = args[0] || '/tmp/browse-page.pdf'; + const pdfPath = args[0] || `${TEMP_DIR}/browse-page.pdf`; validateOutputPath(pdfPath); await page.pdf({ path: pdfPath, format: 'A4' }); return `PDF saved: ${pdfPath}`; @@ -155,7 +156,7 @@ export async function handleMetaCommand( case 'responsive': { const page = bm.getPage(); - const prefix = args[0] || '/tmp/browse-responsive'; + const prefix = args[0] || `${TEMP_DIR}/browse-responsive`; validateOutputPath(prefix); const viewports = [ { name: 'mobile', width: 375, height: 812 }, diff --git a/browse/src/platform.ts b/browse/src/platform.ts new file mode 100644 index 00000000..c022b1d6 --- /dev/null +++ b/browse/src/platform.ts @@ -0,0 +1,17 @@ +/** + * Cross-platform constants for gstack browse. + * + * On macOS/Linux: TEMP_DIR = '/tmp', path.sep = '/' — identical to hardcoded values. + * On Windows: TEMP_DIR = os.tmpdir(), path.sep = '\\' — correct Windows behavior. + */ + +import * as os from 'os'; +import * as path from 'path'; + +export const IS_WINDOWS = process.platform === 'win32'; +export const TEMP_DIR = IS_WINDOWS ? os.tmpdir() : '/tmp'; + +/** Check if resolvedPath is within dir, using platform-aware separators. */ +export function isPathWithin(resolvedPath: string, dir: string): boolean { + return resolvedPath === dir || resolvedPath.startsWith(dir + path.sep); +} diff --git a/browse/src/read-commands.ts b/browse/src/read-commands.ts index e9823325..fad4e78c 100644 --- a/browse/src/read-commands.ts +++ b/browse/src/read-commands.ts @@ -10,6 +10,7 @@ import { consoleBuffer, networkBuffer, dialogBuffer } from './buffers'; import type { Page } from 'playwright'; import * as fs from 'fs'; import * as path from 'path'; +import { TEMP_DIR, isPathWithin } from './platform'; /** Detect await keyword, ignoring comments. Accepted risk: await in string literals triggers wrapping (harmless). */ function hasAwait(code: string): boolean { @@ -36,12 +37,12 @@ function wrapForEvaluate(code: string): string { } // Security: Path validation to prevent path traversal attacks -const SAFE_DIRECTORIES = ['/tmp', process.cwd()]; +const SAFE_DIRECTORIES = [TEMP_DIR, process.cwd()]; export function validateReadPath(filePath: string): void { if (path.isAbsolute(filePath)) { const resolved = path.resolve(filePath); - const isSafe = SAFE_DIRECTORIES.some(dir => resolved === dir || resolved.startsWith(dir + '/')); + const isSafe = SAFE_DIRECTORIES.some(dir => isPathWithin(resolved, dir)); if (!isSafe) { throw new Error(`Absolute path must be within: ${SAFE_DIRECTORIES.join(', ')}`); } diff --git a/browse/src/snapshot.ts b/browse/src/snapshot.ts index db1dfc7c..24380bad 100644 --- a/browse/src/snapshot.ts +++ b/browse/src/snapshot.ts @@ -20,6 +20,7 @@ import type { Page, Locator } from 'playwright'; import type { BrowserManager, RefEntry } from './browser-manager'; import * as Diff from 'diff'; +import { TEMP_DIR, isPathWithin } from './platform'; // Roles considered "interactive" for the -i flag const INTERACTIVE_ROLES = new Set([ @@ -61,7 +62,7 @@ export const SNAPSHOT_FLAGS: Array<{ { short: '-s', long: '--selector', description: 'Scope to CSS selector', takesValue: true, valueHint: '', optionKey: 'selector' }, { short: '-D', long: '--diff', description: 'Unified diff against previous snapshot (first call stores baseline)', optionKey: 'diff' }, { short: '-a', long: '--annotate', description: 'Annotated screenshot with red overlay boxes and ref labels', optionKey: 'annotate' }, - { short: '-o', long: '--output', description: 'Output path for annotated screenshot (default: /tmp/browse-annotated.png)', takesValue: true, valueHint: '', optionKey: 'outputPath' }, + { short: '-o', long: '--output', description: 'Output path for annotated screenshot (default: /browse-annotated.png)', takesValue: true, valueHint: '', optionKey: 'outputPath' }, { short: '-C', long: '--cursor-interactive', description: 'Cursor-interactive elements (@c refs — divs with pointer, onclick)', optionKey: 'cursorInteractive' }, ]; @@ -308,11 +309,11 @@ export async function handleSnapshot( // ─── Annotated screenshot (-a) ──────────────────────────── if (opts.annotate) { - const screenshotPath = opts.outputPath || '/tmp/browse-annotated.png'; + const screenshotPath = opts.outputPath || `${TEMP_DIR}/browse-annotated.png`; // Validate output path (consistent with screenshot/pdf/responsive) const resolvedPath = require('path').resolve(screenshotPath); - const safeDirs = ['/tmp', process.cwd()]; - if (!safeDirs.some((dir: string) => resolvedPath === dir || resolvedPath.startsWith(dir + '/'))) { + const safeDirs = [TEMP_DIR, process.cwd()]; + if (!safeDirs.some((dir: string) => isPathWithin(resolvedPath, dir))) { throw new Error(`Path must be within: ${safeDirs.join(', ')}`); } try { diff --git a/browse/src/write-commands.ts b/browse/src/write-commands.ts index 26a46a4b..1bf37eb5 100644 --- a/browse/src/write-commands.ts +++ b/browse/src/write-commands.ts @@ -10,6 +10,7 @@ import { findInstalledBrowsers, importCookies } from './cookie-import-browser'; import { validateNavigationUrl } from './url-validation'; import * as fs from 'fs'; import * as path from 'path'; +import { TEMP_DIR, isPathWithin } from './platform'; export async function handleWriteCommand( command: string, @@ -277,9 +278,9 @@ export async function handleWriteCommand( if (!filePath) throw new Error('Usage: browse cookie-import '); // Path validation — prevent reading arbitrary files if (path.isAbsolute(filePath)) { - const safeDirs = ['/tmp', process.cwd()]; + const safeDirs = [TEMP_DIR, process.cwd()]; const resolved = path.resolve(filePath); - if (!safeDirs.some(dir => resolved === dir || resolved.startsWith(dir + '/'))) { + if (!safeDirs.some(dir => isPathWithin(resolved, dir))) { throw new Error(`Path must be within: ${safeDirs.join(', ')}`); } } diff --git a/browse/test/bun-polyfill.test.ts b/browse/test/bun-polyfill.test.ts new file mode 100644 index 00000000..7ca25dfa --- /dev/null +++ b/browse/test/bun-polyfill.test.ts @@ -0,0 +1,72 @@ +import { describe, test, expect, afterAll } from 'bun:test'; +import * as path from 'path'; + +// Load the polyfill into a fresh object (don't clobber globalThis.Bun) +const polyfillPath = path.resolve(import.meta.dir, '../src/bun-polyfill.cjs'); + +describe('bun-polyfill', () => { + // We test the polyfill by requiring it in a subprocess under Node.js + // since it's designed for Node, not Bun. + + test('Bun.sleep resolves after delay', async () => { + const result = Bun.spawnSync(['node', '-e', ` + require('${polyfillPath}'); + (async () => { + const start = Date.now(); + await Bun.sleep(50); + const elapsed = Date.now() - start; + console.log(elapsed >= 40 ? 'OK' : 'TOO_FAST'); + })(); + `], { stdout: 'pipe', stderr: 'pipe' }); + expect(result.stdout.toString().trim()).toBe('OK'); + expect(result.exitCode).toBe(0); + }); + + test('Bun.spawnSync runs a command and returns stdout', () => { + const result = Bun.spawnSync(['node', '-e', ` + require('${polyfillPath}'); + const r = Bun.spawnSync(['echo', 'hello'], { stdout: 'pipe' }); + console.log(r.stdout.toString().trim()); + console.log('exit:' + r.exitCode); + `], { stdout: 'pipe', stderr: 'pipe' }); + const lines = result.stdout.toString().trim().split('\n'); + expect(lines[0]).toBe('hello'); + expect(lines[1]).toBe('exit:0'); + }); + + test('Bun.spawn launches a process with pid', async () => { + const result = Bun.spawnSync(['node', '-e', ` + require('${polyfillPath}'); + const p = Bun.spawn(['echo', 'test'], { stdio: ['pipe', 'pipe', 'pipe'] }); + console.log(typeof p.pid === 'number' ? 'HAS_PID' : 'NO_PID'); + console.log(typeof p.kill === 'function' ? 'HAS_KILL' : 'NO_KILL'); + console.log(typeof p.unref === 'function' ? 'HAS_UNREF' : 'NO_UNREF'); + `], { stdout: 'pipe', stderr: 'pipe' }); + const lines = result.stdout.toString().trim().split('\n'); + expect(lines[0]).toBe('HAS_PID'); + expect(lines[1]).toBe('HAS_KILL'); + expect(lines[2]).toBe('HAS_UNREF'); + }); + + test('Bun.serve creates an HTTP server that responds', async () => { + const result = Bun.spawnSync(['node', '-e', ` + require('${polyfillPath}'); + const server = Bun.serve({ + port: 0, // Note: polyfill uses port directly, so we pick one + hostname: '127.0.0.1', + fetch(req) { + return new Response(JSON.stringify({ ok: true }), { + headers: { 'Content-Type': 'application/json' }, + }); + }, + }); + // The polyfill doesn't support port 0, so we test the object shape + console.log(typeof server.stop === 'function' ? 'HAS_STOP' : 'NO_STOP'); + console.log(typeof server.port === 'number' ? 'HAS_PORT' : 'NO_PORT'); + server.stop(); + `], { stdout: 'pipe', stderr: 'pipe' }); + const lines = result.stdout.toString().trim().split('\n'); + expect(lines[0]).toBe('HAS_STOP'); + expect(lines[1]).toBe('HAS_PORT'); + }); +}); diff --git a/browse/test/config.test.ts b/browse/test/config.test.ts index 12892ce4..0cbe47fa 100644 --- a/browse/test/config.test.ts +++ b/browse/test/config.test.ts @@ -197,6 +197,36 @@ describe('resolveServerScript', () => { }); }); +describe('resolveNodeServerScript', () => { + const { resolveNodeServerScript } = require('../src/cli'); + + test('finds server-node.mjs in dist from dev mode', () => { + const srcDir = path.resolve(__dirname, '../src'); + const distFile = path.resolve(srcDir, '..', 'dist', 'server-node.mjs'); + const fs = require('fs'); + // Only test if the file exists (it may not be built yet) + if (fs.existsSync(distFile)) { + const result = resolveNodeServerScript(srcDir, ''); + expect(result).toBe(distFile); + } + }); + + test('returns null when server-node.mjs does not exist', () => { + const result = resolveNodeServerScript('/nonexistent/$bunfs', '/nonexistent/browse'); + expect(result).toBeNull(); + }); + + test('finds server-node.mjs adjacent to compiled binary', () => { + const distDir = path.resolve(__dirname, '../dist'); + const distFile = path.join(distDir, 'server-node.mjs'); + const fs = require('fs'); + if (fs.existsSync(distFile)) { + const result = resolveNodeServerScript('/$bunfs/something', path.join(distDir, 'browse')); + expect(result).toBe(distFile); + } + }); +}); + describe('version mismatch detection', () => { test('detects when versions differ', () => { const stateVersion = 'abc123'; diff --git a/browse/test/platform.test.ts b/browse/test/platform.test.ts new file mode 100644 index 00000000..fb6c64b9 --- /dev/null +++ b/browse/test/platform.test.ts @@ -0,0 +1,37 @@ +import { describe, test, expect } from 'bun:test'; +import { TEMP_DIR, isPathWithin, IS_WINDOWS } from '../src/platform'; + +describe('platform constants', () => { + test('TEMP_DIR is /tmp on non-Windows', () => { + if (!IS_WINDOWS) { + expect(TEMP_DIR).toBe('/tmp'); + } + }); + + test('IS_WINDOWS reflects process.platform', () => { + expect(IS_WINDOWS).toBe(process.platform === 'win32'); + }); +}); + +describe('isPathWithin', () => { + test('path inside directory returns true', () => { + expect(isPathWithin('/tmp/foo', '/tmp')).toBe(true); + }); + + test('path outside directory returns false', () => { + expect(isPathWithin('/etc/foo', '/tmp')).toBe(false); + }); + + test('exact match returns true', () => { + expect(isPathWithin('/tmp', '/tmp')).toBe(true); + }); + + test('partial prefix does not match (path traversal)', () => { + // /tmp-evil should NOT match /tmp + expect(isPathWithin('/tmp-evil/foo', '/tmp')).toBe(false); + }); + + test('nested path returns true', () => { + expect(isPathWithin('/tmp/a/b/c', '/tmp')).toBe(true); + }); +}); diff --git a/codex/SKILL.md b/codex/SKILL.md index cf9493f0..71d9bbe9 100644 --- a/codex/SKILL.md +++ b/codex/SKILL.md @@ -314,13 +314,13 @@ TMPERR=$(mktemp /tmp/codex-err-XXXXXX.txt) 2. Run the review (5-minute timeout): ```bash -codex review --base -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR" +codex review --base -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR" ``` Use `timeout: 300000` on the Bash call. If the user provided custom instructions (e.g., `/codex review focus on security`), pass them as the prompt argument: ```bash -codex review "focus on security" --base -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR" +codex review "focus on security" --base -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR" ``` 3. Capture the output. Then parse cost from stderr: @@ -475,7 +475,7 @@ THE PLAN: For a **new session:** ```bash -codex exec "" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c " +codex exec "" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c " import sys, json for line in sys.stdin: line = line.strip() @@ -508,7 +508,7 @@ for line in sys.stdin: For a **resumed session** (user chose "Continue"): ```bash -codex exec resume "" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c " +codex exec resume "" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c " " ``` @@ -544,10 +544,7 @@ Session saved — run /codex again to continue this conversation. agentic coding model). This means as OpenAI ships newer models, /codex automatically uses them. If the user wants a specific model, pass `-m` through to codex. -**Reasoning effort** varies by mode — use the right level for each task: -- **Review mode:** `high` — thorough but not slow. Diff review benefits from depth but doesn't need maximum compute. -- **Challenge (adversarial) mode:** `xhigh` — maximum reasoning power. When trying to break code, you want the model thinking as hard as possible. -- **Consult mode:** `high` — good balance of depth and speed for conversations. +**Reasoning effort:** All modes use `xhigh` — maximum reasoning power. When reviewing code, breaking code, or consulting on architecture, you want the model thinking as hard as possible. **Web search:** All codex commands use `--enable web_search_cached` so Codex can look up docs and APIs during review. This is OpenAI's cached index — fast, no extra cost. diff --git a/codex/SKILL.md.tmpl b/codex/SKILL.md.tmpl index f2da49ad..30b603ee 100644 --- a/codex/SKILL.md.tmpl +++ b/codex/SKILL.md.tmpl @@ -79,13 +79,13 @@ TMPERR=$(mktemp /tmp/codex-err-XXXXXX.txt) 2. Run the review (5-minute timeout): ```bash -codex review --base -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR" +codex review --base -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR" ``` Use `timeout: 300000` on the Bash call. If the user provided custom instructions (e.g., `/codex review focus on security`), pass them as the prompt argument: ```bash -codex review "focus on security" --base -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR" +codex review "focus on security" --base -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR" ``` 3. Capture the output. Then parse cost from stderr: @@ -240,7 +240,7 @@ THE PLAN: For a **new session:** ```bash -codex exec "" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c " +codex exec "" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c " import sys, json for line in sys.stdin: line = line.strip() @@ -273,7 +273,7 @@ for line in sys.stdin: For a **resumed session** (user chose "Continue"): ```bash -codex exec resume "" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c " +codex exec resume "" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c " " ``` @@ -309,10 +309,7 @@ Session saved — run /codex again to continue this conversation. agentic coding model). This means as OpenAI ships newer models, /codex automatically uses them. If the user wants a specific model, pass `-m` through to codex. -**Reasoning effort** varies by mode — use the right level for each task: -- **Review mode:** `high` — thorough but not slow. Diff review benefits from depth but doesn't need maximum compute. -- **Challenge (adversarial) mode:** `xhigh` — maximum reasoning power. When trying to break code, you want the model thinking as hard as possible. -- **Consult mode:** `high` — good balance of depth and speed for conversations. +**Reasoning effort:** All modes use `xhigh` — maximum reasoning power. When reviewing code, breaking code, or consulting on architecture, you want the model thinking as hard as possible. **Web search:** All codex commands use `--enable web_search_cached` so Codex can look up docs and APIs during review. This is OpenAI's cached index — fast, no extra cost. diff --git a/office-hours/SKILL.md b/office-hours/SKILL.md index 143da2d1..80418a79 100644 --- a/office-hours/SKILL.md +++ b/office-hours/SKILL.md @@ -241,6 +241,25 @@ success/error/abort, and `USED_BROWSE` with true/false based on whether `$B` was If you cannot determine the outcome, use "unknown". This runs in the background and never blocks the user. +## SETUP (run this check BEFORE any browse command) + +```bash +_ROOT=$(git rev-parse --show-toplevel 2>/dev/null) +B="" +[ -n "$_ROOT" ] && [ -x "$_ROOT/.claude/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.claude/skills/gstack/browse/dist/browse" +[ -z "$B" ] && B=~/.claude/skills/gstack/browse/dist/browse +if [ -x "$B" ]; then + echo "READY: $B" +else + echo "NEEDS_SETUP" +fi +``` + +If `NEEDS_SETUP`: +1. Tell the user: "gstack browse needs a one-time build (~10 seconds). OK to proceed?" Then STOP and wait. +2. Run: `cd && ./setup` +3. If `bun` is not installed: `curl -fsSL https://bun.sh/install | bash` + # YC Office Hours You are a **YC office hours partner**. Your job is to ensure the problem is understood before solutions are proposed. You adapt to what the user is building — startup founders get the hard questions, builders get an enthusiastic collaborator. This skill produces design docs, not code. @@ -505,6 +524,66 @@ Present via AskUserQuestion. Do NOT proceed without user approval of the approac --- +## Visual Sketch (UI ideas only) + +If the chosen approach involves user-facing UI (screens, pages, forms, dashboards, +or interactive elements), generate a rough wireframe to help the user visualize it. +If the idea is backend-only, infrastructure, or has no UI component — skip this +section silently. + +**Step 1: Gather design context** + +1. Check if `DESIGN.md` exists in the repo root. If it does, read it for design + system constraints (colors, typography, spacing, component patterns). Use these + constraints in the wireframe. +2. Apply core design principles: + - **Information hierarchy** — what does the user see first, second, third? + - **Interaction states** — loading, empty, error, success, partial + - **Edge case paranoia** — what if the name is 47 chars? Zero results? Network fails? + - **Subtraction default** — "as little design as possible" (Rams). Every element earns its pixels. + - **Design for trust** — every interface element builds or erodes user trust. + +**Step 2: Generate wireframe HTML** + +Generate a single-page HTML file with these constraints: +- **Intentionally rough aesthetic** — use system fonts, thin gray borders, no color, + hand-drawn-style elements. This is a sketch, not a polished mockup. +- Self-contained — no external dependencies, no CDN links, inline CSS only +- Show the core interaction flow (1-3 screens/states max) +- Include realistic placeholder content (not "Lorem ipsum" — use content that + matches the actual use case) +- Add HTML comments explaining design decisions + +Write to a temp file: +```bash +SKETCH_FILE="/tmp/gstack-sketch-$(date +%s).html" +``` + +**Step 3: Render and capture** + +```bash +$B goto "file://$SKETCH_FILE" +$B screenshot /tmp/gstack-sketch.png +``` + +If `$B` is not available (browse binary not set up), skip the render step. Tell the +user: "Visual sketch requires the browse binary. Run the setup script to enable it." + +**Step 4: Present and iterate** + +Show the screenshot to the user. Ask: "Does this feel right? Want to iterate on the layout?" + +If they want changes, regenerate the HTML with their feedback and re-render. +If they approve or say "good enough," proceed. + +**Step 5: Include in design doc** + +Reference the wireframe screenshot in the design doc's "Recommended Approach" section. +The screenshot file at `/tmp/gstack-sketch.png` can be referenced by downstream skills +(`/plan-design-review`, `/design-review`) to see what was originally envisioned. + +--- + ## Phase 4.5: Founder Signal Synthesis Before writing the design doc, synthesize the founder signals you observed during the session. These will appear in the design doc ("What I noticed") and in the closing conversation (Phase 6). @@ -641,7 +720,73 @@ Supersedes: {prior filename — omit this line if first design on this branch} {observational, mentor-like reflections referencing specific things the user said during the session. Quote their words back to them — don't characterize their behavior. 2-4 bullets.} ``` -Present the design doc to the user via AskUserQuestion: +--- + +## Spec Review Loop + +Before presenting the document to the user for approval, run an adversarial review. + +**Step 1: Dispatch reviewer subagent** + +Use the Agent tool to dispatch an independent reviewer. The reviewer has fresh context +and cannot see the brainstorming conversation — only the document. This ensures genuine +adversarial independence. + +Prompt the subagent with: +- The file path of the document just written +- "Read this document and review it on 5 dimensions. For each dimension, note PASS or + list specific issues with suggested fixes. At the end, output a quality score (1-10) + across all dimensions." + +**Dimensions:** +1. **Completeness** — Are all requirements addressed? Missing edge cases? +2. **Consistency** — Do parts of the document agree with each other? Contradictions? +3. **Clarity** — Could an engineer implement this without asking questions? Ambiguous language? +4. **Scope** — Does the document creep beyond the original problem? YAGNI violations? +5. **Feasibility** — Can this actually be built with the stated approach? Hidden complexity? + +The subagent should return: +- A quality score (1-10) +- PASS if no issues, or a numbered list of issues with dimension, description, and fix + +**Step 2: Fix and re-dispatch** + +If the reviewer returns issues: +1. Fix each issue in the document on disk (use Edit tool) +2. Re-dispatch the reviewer subagent with the updated document +3. Maximum 3 iterations total + +**Convergence guard:** If the reviewer returns the same issues on consecutive iterations +(the fix didn't resolve them or the reviewer disagrees with the fix), stop the loop +and persist those issues as "Reviewer Concerns" in the document rather than looping +further. + +If the subagent fails, times out, or is unavailable — skip the review loop entirely. +Tell the user: "Spec review unavailable — presenting unreviewed doc." The document is +already written to disk; the review is a quality bonus, not a gate. + +**Step 3: Report and persist metrics** + +After the loop completes (PASS, max iterations, or convergence guard): + +1. Tell the user the result — summary by default: + "Your doc survived N rounds of adversarial review. M issues caught and fixed. + Quality score: X/10." + If they ask "what did the reviewer find?", show the full reviewer output. + +2. If issues remain after max iterations or convergence, add a "## Reviewer Concerns" + section to the document listing each unresolved issue. Downstream skills will see this. + +3. Append metrics: +```bash +mkdir -p ~/.gstack/analytics +echo '{"skill":"office-hours","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","iterations":ITERATIONS,"issues_found":FOUND,"issues_fixed":FIXED,"remaining":REMAINING,"quality_score":SCORE}' >> ~/.gstack/analytics/spec-review.jsonl 2>/dev/null || true +``` +Replace ITERATIONS, FOUND, FIXED, REMAINING, SCORE with actual values from the review. + +--- + +Present the reviewed design doc to the user via AskUserQuestion: - A) Approve — mark Status: APPROVED and proceed to handoff - B) Revise — specify which sections need changes (loop back to revise those sections) - C) Start over — return to Phase 2 diff --git a/office-hours/SKILL.md.tmpl b/office-hours/SKILL.md.tmpl index caf91acb..e0ff98a7 100644 --- a/office-hours/SKILL.md.tmpl +++ b/office-hours/SKILL.md.tmpl @@ -23,6 +23,8 @@ allowed-tools: {{PREAMBLE}} +{{BROWSE_SETUP}} + # YC Office Hours You are a **YC office hours partner**. Your job is to ensure the problem is understood before solutions are proposed. You adapt to what the user is building — startup founders get the hard questions, builders get an enthusiastic collaborator. This skill produces design docs, not code. @@ -287,6 +289,10 @@ Present via AskUserQuestion. Do NOT proceed without user approval of the approac --- +{{DESIGN_SKETCH}} + +--- + ## Phase 4.5: Founder Signal Synthesis Before writing the design doc, synthesize the founder signals you observed during the session. These will appear in the design doc ("What I noticed") and in the closing conversation (Phase 6). @@ -423,7 +429,13 @@ Supersedes: {prior filename — omit this line if first design on this branch} {observational, mentor-like reflections referencing specific things the user said during the session. Quote their words back to them — don't characterize their behavior. 2-4 bullets.} ``` -Present the design doc to the user via AskUserQuestion: +--- + +{{SPEC_REVIEW_LOOP}} + +--- + +Present the reviewed design doc to the user via AskUserQuestion: - A) Approve — mark Status: APPROVED and proceed to handoff - B) Revise — specify which sections need changes (loop back to revise those sections) - C) Start over — return to Phase 2 diff --git a/package.json b/package.json index 2bf4a238..3001c764 100644 --- a/package.json +++ b/package.json @@ -8,17 +8,19 @@ "browse": "./browse/dist/browse" }, "scripts": { - "build": "bun run gen:skill-docs && bun run gen:skill-docs --host codex && bun build --compile browse/src/cli.ts --outfile browse/dist/browse && bun build --compile browse/src/find-browse.ts --outfile browse/dist/find-browse && git rev-parse HEAD > browse/dist/.version && rm -f .*.bun-build || true", + "build": "bun run gen:skill-docs && bun run gen:skill-docs --host codex && bun build --compile browse/src/cli.ts --outfile browse/dist/browse && bun build --compile browse/src/find-browse.ts --outfile browse/dist/find-browse && bash browse/scripts/build-node-server.sh && git rev-parse HEAD > browse/dist/.version && rm -f .*.bun-build || true", "gen:skill-docs": "bun run scripts/gen-skill-docs.ts", "dev": "bun run browse/src/cli.ts", "server": "bun run browse/src/server.ts", - "test": "bun test browse/test/ test/ --ignore test/skill-e2e.test.ts --ignore test/skill-llm-eval.test.ts --ignore test/skill-routing-e2e.test.ts --ignore test/codex-e2e.test.ts", - "test:evals": "EVALS=1 bun test test/skill-llm-eval.test.ts test/skill-e2e.test.ts test/skill-routing-e2e.test.ts test/codex-e2e.test.ts", - "test:evals:all": "EVALS=1 EVALS_ALL=1 bun test test/skill-llm-eval.test.ts test/skill-e2e.test.ts test/skill-routing-e2e.test.ts test/codex-e2e.test.ts", - "test:e2e": "EVALS=1 bun test test/skill-e2e.test.ts test/skill-routing-e2e.test.ts test/codex-e2e.test.ts", - "test:e2e:all": "EVALS=1 EVALS_ALL=1 bun test test/skill-e2e.test.ts test/skill-routing-e2e.test.ts test/codex-e2e.test.ts", + "test": "bun test browse/test/ test/ --ignore test/skill-e2e.test.ts --ignore test/skill-llm-eval.test.ts --ignore test/skill-routing-e2e.test.ts --ignore test/codex-e2e.test.ts --ignore test/gemini-e2e.test.ts", + "test:evals": "EVALS=1 bun test test/skill-llm-eval.test.ts test/skill-e2e.test.ts test/skill-routing-e2e.test.ts test/codex-e2e.test.ts test/gemini-e2e.test.ts", + "test:evals:all": "EVALS=1 EVALS_ALL=1 bun test test/skill-llm-eval.test.ts test/skill-e2e.test.ts test/skill-routing-e2e.test.ts test/codex-e2e.test.ts test/gemini-e2e.test.ts", + "test:e2e": "EVALS=1 bun test test/skill-e2e.test.ts test/skill-routing-e2e.test.ts test/codex-e2e.test.ts test/gemini-e2e.test.ts", + "test:e2e:all": "EVALS=1 EVALS_ALL=1 bun test test/skill-e2e.test.ts test/skill-routing-e2e.test.ts test/codex-e2e.test.ts test/gemini-e2e.test.ts", "test:codex": "EVALS=1 bun test test/codex-e2e.test.ts", "test:codex:all": "EVALS=1 EVALS_ALL=1 bun test test/codex-e2e.test.ts", + "test:gemini": "EVALS=1 bun test test/gemini-e2e.test.ts", + "test:gemini:all": "EVALS=1 EVALS_ALL=1 bun test test/gemini-e2e.test.ts", "skill:check": "bun run scripts/skill-check.ts", "dev:skill": "bun run scripts/dev-skill.ts", "start": "bun run browse/src/server.ts", diff --git a/plan-ceo-review/SKILL.md b/plan-ceo-review/SKILL.md index ffbe1561..2f4f59c5 100644 --- a/plan-ceo-review/SKILL.md +++ b/plan-ceo-review/SKILL.md @@ -10,6 +10,7 @@ description: | or "is this ambitious enough". Proactively suggest when the user is questioning scope or ambition of a plan, or when the plan feels like it could be thinking bigger. +benefits-from: [office-hours] allowed-tools: - Read - Grep @@ -345,6 +346,37 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head ``` If a design doc exists (from `/office-hours`), read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design. +## Prerequisite Skill Offer + +When the design doc check above prints "No design doc found," offer the prerequisite +skill before proceeding. + +Say to the user via AskUserQuestion: + +> "No design doc found for this branch. `/office-hours` produces a structured problem +> statement, premise challenge, and explored alternatives — it gives this review much +> sharper input to work with. Takes about 10 minutes. The design doc is per-feature, +> not per-product — it captures the thinking behind this specific change." + +Options: +- A) Run /office-hours first (in another window, then come back) +- B) Skip — proceed with standard review + +If they skip: "No worries — standard review. If you ever want sharper input, try +/office-hours first next time." Then proceed normally. Do not re-offer later in the session. + +**Mid-session detection:** During Step 0A (Premise Challenge), if the user can't +articulate the problem, keeps changing the problem statement, answers with "I'm not +sure," or is clearly exploring rather than reviewing — offer `/office-hours`: + +> "It sounds like you're still figuring out what to build — that's totally fine, but +> that's what /office-hours is designed for. Want to pause this review and run +> /office-hours first? It'll help you nail down the problem and approach, then come +> back here for the strategic review." + +Options: A) Yes, run /office-hours first. B) No, keep going. +If they keep going, proceed normally — no guilt, no re-asking. + When reading TODOS.md, specifically: * Note any TODOs this plan touches, blocks, or unlocks * Check if deferred work from prior reviews relates to this plan @@ -488,6 +520,70 @@ Repo: {owner/repo} Derive the feature slug from the plan being reviewed (e.g., "user-dashboard", "auth-refactor"). Use the date in YYYY-MM-DD format. +After writing the CEO plan, run the spec review loop on it: + +## Spec Review Loop + +Before presenting the document to the user for approval, run an adversarial review. + +**Step 1: Dispatch reviewer subagent** + +Use the Agent tool to dispatch an independent reviewer. The reviewer has fresh context +and cannot see the brainstorming conversation — only the document. This ensures genuine +adversarial independence. + +Prompt the subagent with: +- The file path of the document just written +- "Read this document and review it on 5 dimensions. For each dimension, note PASS or + list specific issues with suggested fixes. At the end, output a quality score (1-10) + across all dimensions." + +**Dimensions:** +1. **Completeness** — Are all requirements addressed? Missing edge cases? +2. **Consistency** — Do parts of the document agree with each other? Contradictions? +3. **Clarity** — Could an engineer implement this without asking questions? Ambiguous language? +4. **Scope** — Does the document creep beyond the original problem? YAGNI violations? +5. **Feasibility** — Can this actually be built with the stated approach? Hidden complexity? + +The subagent should return: +- A quality score (1-10) +- PASS if no issues, or a numbered list of issues with dimension, description, and fix + +**Step 2: Fix and re-dispatch** + +If the reviewer returns issues: +1. Fix each issue in the document on disk (use Edit tool) +2. Re-dispatch the reviewer subagent with the updated document +3. Maximum 3 iterations total + +**Convergence guard:** If the reviewer returns the same issues on consecutive iterations +(the fix didn't resolve them or the reviewer disagrees with the fix), stop the loop +and persist those issues as "Reviewer Concerns" in the document rather than looping +further. + +If the subagent fails, times out, or is unavailable — skip the review loop entirely. +Tell the user: "Spec review unavailable — presenting unreviewed doc." The document is +already written to disk; the review is a quality bonus, not a gate. + +**Step 3: Report and persist metrics** + +After the loop completes (PASS, max iterations, or convergence guard): + +1. Tell the user the result — summary by default: + "Your doc survived N rounds of adversarial review. M issues caught and fixed. + Quality score: X/10." + If they ask "what did the reviewer find?", show the full reviewer output. + +2. If issues remain after max iterations or convergence, add a "## Reviewer Concerns" + section to the document listing each unresolved issue. Downstream skills will see this. + +3. Append metrics: +```bash +mkdir -p ~/.gstack/analytics +echo '{"skill":"plan-ceo-review","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","iterations":ITERATIONS,"issues_found":FOUND,"issues_fixed":FIXED,"remaining":REMAINING,"quality_score":SCORE}' >> ~/.gstack/analytics/spec-review.jsonl 2>/dev/null || true +``` +Replace ITERATIONS, FOUND, FIXED, REMAINING, SCORE with actual values from the review. + ### 0E. Temporal Interrogation (EXPANSION, SELECTIVE EXPANSION, and HOLD modes) Think ahead to implementation: What decisions will need to be made during implementation that should be resolved NOW in the plan? ``` @@ -919,7 +1015,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl - **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting). - **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup. - **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes. -- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed. +- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`. **Verdict logic:** - **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`) diff --git a/plan-ceo-review/SKILL.md.tmpl b/plan-ceo-review/SKILL.md.tmpl index 87dec8e7..8dce40eb 100644 --- a/plan-ceo-review/SKILL.md.tmpl +++ b/plan-ceo-review/SKILL.md.tmpl @@ -10,6 +10,7 @@ description: | or "is this ambitious enough". Proactively suggest when the user is questioning scope or ambition of a plan, or when the plan feels like it could be thinking bigger. +benefits-from: [office-hours] allowed-tools: - Read - Grep @@ -110,6 +111,20 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head ``` If a design doc exists (from `/office-hours`), read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design. +{{BENEFITS_FROM}} + +**Mid-session detection:** During Step 0A (Premise Challenge), if the user can't +articulate the problem, keeps changing the problem statement, answers with "I'm not +sure," or is clearly exploring rather than reviewing — offer `/office-hours`: + +> "It sounds like you're still figuring out what to build — that's totally fine, but +> that's what /office-hours is designed for. Want to pause this review and run +> /office-hours first? It'll help you nail down the problem and approach, then come +> back here for the strategic review." + +Options: A) Yes, run /office-hours first. B) No, keep going. +If they keep going, proceed normally — no guilt, no re-asking. + When reading TODOS.md, specifically: * Note any TODOs this plan touches, blocks, or unlocks * Check if deferred work from prior reviews relates to this plan @@ -253,6 +268,10 @@ Repo: {owner/repo} Derive the feature slug from the plan being reviewed (e.g., "user-dashboard", "auth-refactor"). Use the date in YYYY-MM-DD format. +After writing the CEO plan, run the spec review loop on it: + +{{SPEC_REVIEW_LOOP}} + ### 0E. Temporal Interrogation (EXPANSION, SELECTIVE EXPANSION, and HOLD modes) Think ahead to implementation: What decisions will need to be made during implementation that should be resolved NOW in the plan? ``` diff --git a/plan-design-review/SKILL.md b/plan-design-review/SKILL.md index 2303bd9e..d46fdc2c 100644 --- a/plan-design-review/SKILL.md +++ b/plan-design-review/SKILL.md @@ -550,7 +550,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl - **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting). - **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup. - **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes. -- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed. +- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`. **Verdict logic:** - **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`) diff --git a/plan-eng-review/SKILL.md b/plan-eng-review/SKILL.md index 27fabcc9..5881f40b 100644 --- a/plan-eng-review/SKILL.md +++ b/plan-eng-review/SKILL.md @@ -8,6 +8,7 @@ description: | "review the architecture", "engineering review", or "lock in the plan". Proactively suggest when the user has a plan or design doc and is about to start coding — to catch architecture issues before implementation. +benefits-from: [office-hours] allowed-tools: - Read - Write @@ -291,6 +292,25 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head ``` If a design doc exists, read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design — check the prior version for context on what changed and why. +## Prerequisite Skill Offer + +When the design doc check above prints "No design doc found," offer the prerequisite +skill before proceeding. + +Say to the user via AskUserQuestion: + +> "No design doc found for this branch. `/office-hours` produces a structured problem +> statement, premise challenge, and explored alternatives — it gives this review much +> sharper input to work with. Takes about 10 minutes. The design doc is per-feature, +> not per-product — it captures the thinking behind this specific change." + +Options: +- A) Run /office-hours first (in another window, then come back) +- B) Skip — proceed with standard review + +If they skip: "No worries — standard review. If you ever want sharper input, try +/office-hours first next time." Then proceed normally. Do not re-offer later in the session. + ### Step 0: Scope Challenge Before reviewing anything, answer these questions: 1. **What existing code already partially or fully solves each sub-problem?** Can we capture outputs from existing flows rather than building parallel ones? @@ -520,7 +540,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl - **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting). - **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup. - **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes. -- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed. +- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`. **Verdict logic:** - **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`) diff --git a/plan-eng-review/SKILL.md.tmpl b/plan-eng-review/SKILL.md.tmpl index ef21a200..09782a9d 100644 --- a/plan-eng-review/SKILL.md.tmpl +++ b/plan-eng-review/SKILL.md.tmpl @@ -8,6 +8,7 @@ description: | "review the architecture", "engineering review", or "lock in the plan". Proactively suggest when the user has a plan or design doc and is about to start coding — to catch architecture issues before implementation. +benefits-from: [office-hours] allowed-tools: - Read - Write @@ -73,6 +74,8 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head ``` If a design doc exists, read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design — check the prior version for context on what changed and why. +{{BENEFITS_FROM}} + ### Step 0: Scope Challenge Before reviewing anything, answer these questions: 1. **What existing code already partially or fully solves each sub-problem?** Can we capture outputs from existing flows rather than building parallel ones? diff --git a/review/SKILL.md b/review/SKILL.md index 6f4baeac..045231b0 100644 --- a/review/SKILL.md +++ b/review/SKILL.md @@ -497,52 +497,128 @@ If no documentation files exist, skip this step silently. --- -## Step 5.7: Codex second opinion (optional) +## Step 5.7: Codex review -After completing the review, check if the Codex CLI is available: +Check if the Codex CLI is available and read the user's Codex review preference: ```bash which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE" +CODEX_REVIEWS_CFG=$(~/.claude/skills/gstack/bin/gstack-config get codex_reviews 2>/dev/null || true) +echo "CODEX_REVIEWS: ${CODEX_REVIEWS_CFG:-not_set}" ``` -If Codex is available, use AskUserQuestion: +If `CODEX_NOT_AVAILABLE`: skip this step silently. Continue to the next step. + +If `CODEX_REVIEWS` is `disabled`: skip this step silently. Continue to the next step. + +If `CODEX_REVIEWS` is `enabled`: run both code review and adversarial challenge automatically (no prompt). Jump to the "Run Codex" section below. + +If `CODEX_REVIEWS` is `not_set`: use AskUserQuestion to offer the one-time adoption prompt: ``` -Review complete. Want an independent second opinion from Codex (OpenAI)? +GStack recommends enabling Codex code reviews — Codex is the super smart quiet engineer friend who will save your butt. -A) Run Codex code review — independent diff review with pass/fail gate -B) Run Codex adversarial challenge — try to find ways this code will fail in production -C) Both — review first, then adversarial challenge -D) Skip — no Codex review needed +A) Enable for all future runs (recommended, default) +B) Try it for now, ask me again later +C) No thanks, don't ask me again ``` -If the user chooses A, B, or C: - -**For code review (A or C):** Run `codex review --base ` with a 5-minute timeout. -Present the full output verbatim under a `CODEX SAYS (code review):` header. -Check the output for `[P1]` markers — if found, note `GATE: FAIL`, otherwise `GATE: PASS`. -After presenting, compare Codex's findings with your own review findings from Steps 4-5 -and output a CROSS-MODEL ANALYSIS showing what both found, what only Codex found, -and what only Claude found. - -**For adversarial challenge (B or C):** Run: +If the user chooses A: persist the setting and run both: ```bash -codex exec "Review the changes on this branch against the base branch. Run git diff origin/ to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, failure modes. Be adversarial." -s read-only +~/.claude/skills/gstack/bin/gstack-config set codex_reviews enabled ``` -Present the full output verbatim under a `CODEX SAYS (adversarial challenge):` header. -**Only if a code review ran (user chose A or C):** Persist the Codex review result to the review log: +If the user chooses B: run both this time but do not persist any setting. + +If the user chooses C: persist the opt-out and skip: ```bash -~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE"}' +~/.claude/skills/gstack/bin/gstack-config set codex_reviews disabled +``` +Then skip this step. Continue to the next step. + +### Run Codex + +Always run **both** code review and adversarial challenge. Use a 5-minute timeout (`timeout: 300000`) on each Bash call. + +First, create a temp file for stderr capture: +```bash +TMPERR=$(mktemp /tmp/codex-review-XXXXXXXX) +``` + +**Code review:** Run: +```bash +codex review --base -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR" +``` + +After the command completes, read stderr for cost/error info: +```bash +cat "$TMPERR" +``` + +Present the full output verbatim under a `CODEX SAYS (code review):` header: + +``` +CODEX SAYS (code review): +════════════════════════════════════════════════════════════ + +════════════════════════════════════════════════════════════ +GATE: PASS Tokens: N | Est. cost: ~$X.XX +``` + +Check the output for `[P1]` markers. If found: `GATE: FAIL`. If no `[P1]`: `GATE: PASS`. + +**If GATE is FAIL:** use AskUserQuestion: + +``` +Codex found N critical issues in the diff. + +A) Investigate and fix now (recommended) +B) Ship anyway — these issues may cause production problems +``` + +If the user chooses A: read the Codex findings carefully and work to address them. Then re-run `codex review` to verify the gate is now PASS. + +If the user chooses B: continue to the next step. + +### Error handling (code review) + +Before persisting the gate result, check for errors. All errors are non-blocking — Codex is a quality enhancement, not a prerequisite. Check `$TMPERR` output (already read above) for error indicators: + +- **Auth failure:** If stderr contains "auth", "login", "unauthorized", or "API key", tell the user: "Codex authentication failed. Run \`codex login\` in your terminal to authenticate via ChatGPT." Do NOT persist a review log entry. Continue to the adversarial step (it will likely fail too, but try anyway). +- **Timeout:** If the Bash call times out (5 min), tell the user: "Codex timed out after 5 minutes. The diff may be too large or the API may be slow." Do NOT persist a review log entry. Skip to cleanup. +- **Empty response:** If codex returned no stdout output, tell the user: "Codex returned no response. Stderr: ." Do NOT persist a review log entry. Skip to cleanup. + +**Only if codex produced a real review (non-empty stdout):** Persist the code review result: +```bash +~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE","commit":"'"$(git rev-parse --short HEAD)"'"}' ``` Substitute: STATUS ("clean" if PASS, "issues_found" if FAIL), GATE ("pass" or "fail"). -**Do NOT persist a codex-review entry when only the adversarial challenge (B) ran** — -there is no gate verdict to record, and a false entry would make the Review Readiness -Dashboard believe a code review happened when it didn't. +**Adversarial challenge:** Run: +```bash +TMPERR_ADV=$(mktemp /tmp/codex-adv-XXXXXXXX) +codex exec "Review the changes on this branch against the base branch. Run git diff origin/ to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems." -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_ADV" +``` -If Codex is not available, skip this step silently. +After the command completes, read adversarial stderr: +```bash +cat "$TMPERR_ADV" +``` + +Present the full output verbatim under a `CODEX SAYS (adversarial challenge):` header. This is informational — it never blocks shipping. If the adversarial command timed out or returned no output, note this to the user and continue. + +**Cross-model analysis:** After both Codex outputs are presented, compare Codex's findings with your own review findings from the earlier review steps and output: + +``` +CROSS-MODEL ANALYSIS: + Both found: [findings that overlap between Claude and Codex] + Only Codex found: [findings unique to Codex] + Only Claude found: [findings unique to Claude's review] + Agreement rate: X% (N/M total unique findings overlap) +``` + +**Cleanup:** Run `rm -f "$TMPERR" "$TMPERR_ADV"` after processing. --- diff --git a/review/SKILL.md.tmpl b/review/SKILL.md.tmpl index bab95d91..34a25018 100644 --- a/review/SKILL.md.tmpl +++ b/review/SKILL.md.tmpl @@ -231,54 +231,7 @@ If no documentation files exist, skip this step silently. --- -## Step 5.7: Codex second opinion (optional) - -After completing the review, check if the Codex CLI is available: - -```bash -which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE" -``` - -If Codex is available, use AskUserQuestion: - -``` -Review complete. Want an independent second opinion from Codex (OpenAI)? - -A) Run Codex code review — independent diff review with pass/fail gate -B) Run Codex adversarial challenge — try to find ways this code will fail in production -C) Both — review first, then adversarial challenge -D) Skip — no Codex review needed -``` - -If the user chooses A, B, or C: - -**For code review (A or C):** Run `codex review --base ` with a 5-minute timeout. -Present the full output verbatim under a `CODEX SAYS (code review):` header. -Check the output for `[P1]` markers — if found, note `GATE: FAIL`, otherwise `GATE: PASS`. -After presenting, compare Codex's findings with your own review findings from Steps 4-5 -and output a CROSS-MODEL ANALYSIS showing what both found, what only Codex found, -and what only Claude found. - -**For adversarial challenge (B or C):** Run: -```bash -codex exec "Review the changes on this branch against the base branch. Run git diff origin/ to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, failure modes. Be adversarial." -s read-only -``` -Present the full output verbatim under a `CODEX SAYS (adversarial challenge):` header. - -**Only if a code review ran (user chose A or C):** Persist the Codex review result to the review log: -```bash -~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE"}' -``` - -Substitute: STATUS ("clean" if PASS, "issues_found" if FAIL), GATE ("pass" or "fail"). - -**Do NOT persist a codex-review entry when only the adversarial challenge (B) ran** — -there is no gate verdict to record, and a false entry would make the Review Readiness -Dashboard believe a code review happened when it didn't. - -If Codex is not available, skip this step silently. - ---- +{{CODEX_REVIEW_STEP}} ## Important Rules diff --git a/scripts/gen-skill-docs.ts b/scripts/gen-skill-docs.ts index ce33035a..9e612c55 100644 --- a/scripts/gen-skill-docs.ts +++ b/scripts/gen-skill-docs.ts @@ -55,6 +55,7 @@ const HOST_PATHS: Record = { interface TemplateContext { skillName: string; tmplPath: string; + benefitsFrom?: string[]; host: Host; paths: HostPaths; } @@ -1206,7 +1207,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl - **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \\\`gstack-config set skip_eng_review true\\\` (the "don't bother me" setting). - **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup. - **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes. -- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed. +- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \\\`gstack-config set codex_reviews enabled|disabled\\\`. **Verdict logic:** - **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \\\`skip_eng_review\\\` is \\\`true\\\`) @@ -1376,6 +1377,289 @@ Only commit if there are changes. Stage all bootstrap files (config, test direct ---`; } +function generateSpecReviewLoop(_ctx: TemplateContext): string { + return `## Spec Review Loop + +Before presenting the document to the user for approval, run an adversarial review. + +**Step 1: Dispatch reviewer subagent** + +Use the Agent tool to dispatch an independent reviewer. The reviewer has fresh context +and cannot see the brainstorming conversation — only the document. This ensures genuine +adversarial independence. + +Prompt the subagent with: +- The file path of the document just written +- "Read this document and review it on 5 dimensions. For each dimension, note PASS or + list specific issues with suggested fixes. At the end, output a quality score (1-10) + across all dimensions." + +**Dimensions:** +1. **Completeness** — Are all requirements addressed? Missing edge cases? +2. **Consistency** — Do parts of the document agree with each other? Contradictions? +3. **Clarity** — Could an engineer implement this without asking questions? Ambiguous language? +4. **Scope** — Does the document creep beyond the original problem? YAGNI violations? +5. **Feasibility** — Can this actually be built with the stated approach? Hidden complexity? + +The subagent should return: +- A quality score (1-10) +- PASS if no issues, or a numbered list of issues with dimension, description, and fix + +**Step 2: Fix and re-dispatch** + +If the reviewer returns issues: +1. Fix each issue in the document on disk (use Edit tool) +2. Re-dispatch the reviewer subagent with the updated document +3. Maximum 3 iterations total + +**Convergence guard:** If the reviewer returns the same issues on consecutive iterations +(the fix didn't resolve them or the reviewer disagrees with the fix), stop the loop +and persist those issues as "Reviewer Concerns" in the document rather than looping +further. + +If the subagent fails, times out, or is unavailable — skip the review loop entirely. +Tell the user: "Spec review unavailable — presenting unreviewed doc." The document is +already written to disk; the review is a quality bonus, not a gate. + +**Step 3: Report and persist metrics** + +After the loop completes (PASS, max iterations, or convergence guard): + +1. Tell the user the result — summary by default: + "Your doc survived N rounds of adversarial review. M issues caught and fixed. + Quality score: X/10." + If they ask "what did the reviewer find?", show the full reviewer output. + +2. If issues remain after max iterations or convergence, add a "## Reviewer Concerns" + section to the document listing each unresolved issue. Downstream skills will see this. + +3. Append metrics: +\`\`\`bash +mkdir -p ~/.gstack/analytics +echo '{"skill":"${_ctx.skillName}","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","iterations":ITERATIONS,"issues_found":FOUND,"issues_fixed":FIXED,"remaining":REMAINING,"quality_score":SCORE}' >> ~/.gstack/analytics/spec-review.jsonl 2>/dev/null || true +\`\`\` +Replace ITERATIONS, FOUND, FIXED, REMAINING, SCORE with actual values from the review.`; +} + +function generateBenefitsFrom(ctx: TemplateContext): string { + if (!ctx.benefitsFrom || ctx.benefitsFrom.length === 0) return ''; + + const skillList = ctx.benefitsFrom.map(s => `\`/${s}\``).join(' or '); + const first = ctx.benefitsFrom[0]; + + return `## Prerequisite Skill Offer + +When the design doc check above prints "No design doc found," offer the prerequisite +skill before proceeding. + +Say to the user via AskUserQuestion: + +> "No design doc found for this branch. ${skillList} produces a structured problem +> statement, premise challenge, and explored alternatives — it gives this review much +> sharper input to work with. Takes about 10 minutes. The design doc is per-feature, +> not per-product — it captures the thinking behind this specific change." + +Options: +- A) Run /${first} first (in another window, then come back) +- B) Skip — proceed with standard review + +If they skip: "No worries — standard review. If you ever want sharper input, try +/${first} first next time." Then proceed normally. Do not re-offer later in the session.`; +} + +function generateDesignSketch(_ctx: TemplateContext): string { + return `## Visual Sketch (UI ideas only) + +If the chosen approach involves user-facing UI (screens, pages, forms, dashboards, +or interactive elements), generate a rough wireframe to help the user visualize it. +If the idea is backend-only, infrastructure, or has no UI component — skip this +section silently. + +**Step 1: Gather design context** + +1. Check if \`DESIGN.md\` exists in the repo root. If it does, read it for design + system constraints (colors, typography, spacing, component patterns). Use these + constraints in the wireframe. +2. Apply core design principles: + - **Information hierarchy** — what does the user see first, second, third? + - **Interaction states** — loading, empty, error, success, partial + - **Edge case paranoia** — what if the name is 47 chars? Zero results? Network fails? + - **Subtraction default** — "as little design as possible" (Rams). Every element earns its pixels. + - **Design for trust** — every interface element builds or erodes user trust. + +**Step 2: Generate wireframe HTML** + +Generate a single-page HTML file with these constraints: +- **Intentionally rough aesthetic** — use system fonts, thin gray borders, no color, + hand-drawn-style elements. This is a sketch, not a polished mockup. +- Self-contained — no external dependencies, no CDN links, inline CSS only +- Show the core interaction flow (1-3 screens/states max) +- Include realistic placeholder content (not "Lorem ipsum" — use content that + matches the actual use case) +- Add HTML comments explaining design decisions + +Write to a temp file: +\`\`\`bash +SKETCH_FILE="/tmp/gstack-sketch-$(date +%s).html" +\`\`\` + +**Step 3: Render and capture** + +\`\`\`bash +$B goto "file://$SKETCH_FILE" +$B screenshot /tmp/gstack-sketch.png +\`\`\` + +If \`$B\` is not available (browse binary not set up), skip the render step. Tell the +user: "Visual sketch requires the browse binary. Run the setup script to enable it." + +**Step 4: Present and iterate** + +Show the screenshot to the user. Ask: "Does this feel right? Want to iterate on the layout?" + +If they want changes, regenerate the HTML with their feedback and re-render. +If they approve or say "good enough," proceed. + +**Step 5: Include in design doc** + +Reference the wireframe screenshot in the design doc's "Recommended Approach" section. +The screenshot file at \`/tmp/gstack-sketch.png\` can be referenced by downstream skills +(\`/plan-design-review\`, \`/design-review\`) to see what was originally envisioned.`; +} + +function generateCodexReviewStep(ctx: TemplateContext): string { + // Codex host: strip entirely — Codex should never invoke itself + if (ctx.host === 'codex') return ''; + + const isShip = ctx.skillName === 'ship'; + const stepNum = isShip ? '3.8' : '5.7'; + + return `## Step ${stepNum}: Codex review + +Check if the Codex CLI is available and read the user's Codex review preference: + +\`\`\`bash +which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE" +CODEX_REVIEWS_CFG=$(~/.claude/skills/gstack/bin/gstack-config get codex_reviews 2>/dev/null || true) +echo "CODEX_REVIEWS: \${CODEX_REVIEWS_CFG:-not_set}" +\`\`\` + +If \`CODEX_NOT_AVAILABLE\`: skip this step silently. Continue to the next step. + +If \`CODEX_REVIEWS\` is \`disabled\`: skip this step silently. Continue to the next step. + +If \`CODEX_REVIEWS\` is \`enabled\`: run both code review and adversarial challenge automatically (no prompt). Jump to the "Run Codex" section below. + +If \`CODEX_REVIEWS\` is \`not_set\`: use AskUserQuestion to offer the one-time adoption prompt: + +\`\`\` +GStack recommends enabling Codex code reviews — Codex is the super smart quiet engineer friend who will save your butt. + +A) Enable for all future runs (recommended, default) +B) Try it for now, ask me again later +C) No thanks, don't ask me again +\`\`\` + +If the user chooses A: persist the setting and run both: +\`\`\`bash +~/.claude/skills/gstack/bin/gstack-config set codex_reviews enabled +\`\`\` + +If the user chooses B: run both this time but do not persist any setting. + +If the user chooses C: persist the opt-out and skip: +\`\`\`bash +~/.claude/skills/gstack/bin/gstack-config set codex_reviews disabled +\`\`\` +Then skip this step. Continue to the next step. + +### Run Codex + +Always run **both** code review and adversarial challenge. Use a 5-minute timeout (\`timeout: 300000\`) on each Bash call. + +First, create a temp file for stderr capture: +\`\`\`bash +TMPERR=$(mktemp /tmp/codex-review-XXXXXXXX) +\`\`\` + +**Code review:** Run: +\`\`\`bash +codex review --base -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR" +\`\`\` + +After the command completes, read stderr for cost/error info: +\`\`\`bash +cat "$TMPERR" +\`\`\` + +Present the full output verbatim under a \`CODEX SAYS (code review):\` header: + +\`\`\` +CODEX SAYS (code review): +════════════════════════════════════════════════════════════ + +════════════════════════════════════════════════════════════ +GATE: PASS Tokens: N | Est. cost: ~$X.XX +\`\`\` + +Check the output for \`[P1]\` markers. If found: \`GATE: FAIL\`. If no \`[P1]\`: \`GATE: PASS\`. + +**If GATE is FAIL:** use AskUserQuestion: + +\`\`\` +Codex found N critical issues in the diff. + +A) Investigate and fix now (recommended) +B) Ship anyway — these issues may cause production problems +\`\`\` + +If the user chooses A: read the Codex findings carefully and work to address them${isShip ? '. After fixing, re-run tests (Step 3) since code has changed' : ''}. Then re-run \`codex review\` to verify the gate is now PASS. + +If the user chooses B: continue to the next step. + +### Error handling (code review) + +Before persisting the gate result, check for errors. All errors are non-blocking — Codex is a quality enhancement, not a prerequisite. Check \`$TMPERR\` output (already read above) for error indicators: + +- **Auth failure:** If stderr contains "auth", "login", "unauthorized", or "API key", tell the user: "Codex authentication failed. Run \\\`codex login\\\` in your terminal to authenticate via ChatGPT." Do NOT persist a review log entry. Continue to the adversarial step (it will likely fail too, but try anyway). +- **Timeout:** If the Bash call times out (5 min), tell the user: "Codex timed out after 5 minutes. The diff may be too large or the API may be slow." Do NOT persist a review log entry. Skip to cleanup. +- **Empty response:** If codex returned no stdout output, tell the user: "Codex returned no response. Stderr: ." Do NOT persist a review log entry. Skip to cleanup. + +**Only if codex produced a real review (non-empty stdout):** Persist the code review result: +\`\`\`bash +~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE","commit":"'"$(git rev-parse --short HEAD)"'"}' +\`\`\` + +Substitute: STATUS ("clean" if PASS, "issues_found" if FAIL), GATE ("pass" or "fail"). + +**Adversarial challenge:** Run: +\`\`\`bash +TMPERR_ADV=$(mktemp /tmp/codex-adv-XXXXXXXX) +codex exec "Review the changes on this branch against the base branch. Run git diff origin/ to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems." -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_ADV" +\`\`\` + +After the command completes, read adversarial stderr: +\`\`\`bash +cat "$TMPERR_ADV" +\`\`\` + +Present the full output verbatim under a \`CODEX SAYS (adversarial challenge):\` header. This is informational — it never blocks shipping. If the adversarial command timed out or returned no output, note this to the user and continue. +${!isShip ? ` +**Cross-model analysis:** After both Codex outputs are presented, compare Codex's findings with your own review findings from the earlier review steps and output: + +\`\`\` +CROSS-MODEL ANALYSIS: + Both found: [findings that overlap between Claude and Codex] + Only Codex found: [findings unique to Codex] + Only Claude found: [findings unique to Claude's review] + Agreement rate: X% (N/M total unique findings overlap) +\`\`\` +` : ''} +**Cleanup:** Run \`rm -f "$TMPERR" "$TMPERR_ADV"\` after processing. + +---`; +} + const RESOLVERS: Record string> = { COMMAND_REFERENCE: generateCommandReference, SNAPSHOT_FLAGS: generateSnapshotFlags, @@ -1388,6 +1672,10 @@ const RESOLVERS: Record string> = { REVIEW_DASHBOARD: generateReviewDashboard, TEST_BOOTSTRAP: generateTestBootstrap, TEST_FAILURE_TRIAGE: generateTestFailureTriage, + SPEC_REVIEW_LOOP: generateSpecReviewLoop, + DESIGN_SKETCH: generateDesignSketch, + BENEFITS_FROM: generateBenefitsFrom, + CODEX_REVIEW_STEP: generateCodexReviewStep, }; // ─── Codex Helpers ─────────────────────────────────────────── @@ -1510,7 +1798,14 @@ function processTemplate(tmplPath: string, host: Host = 'claude'): { outputPath: // Extract skill name from frontmatter for TemplateContext const nameMatch = tmplContent.match(/^name:\s*(.+)$/m); const skillName = nameMatch ? nameMatch[1].trim() : path.basename(path.dirname(tmplPath)); - const ctx: TemplateContext = { skillName, tmplPath, host, paths: HOST_PATHS[host] }; + + // Extract benefits-from list from frontmatter (inline YAML: benefits-from: [a, b]) + const benefitsMatch = tmplContent.match(/^benefits-from:\s*\[([^\]]*)\]/m); + const benefitsFrom = benefitsMatch + ? benefitsMatch[1].split(',').map(s => s.trim()).filter(Boolean) + : undefined; + + const ctx: TemplateContext = { skillName, tmplPath, benefitsFrom, host, paths: HOST_PATHS[host] }; // Replace placeholders let content = tmplContent.replace(/\{\{(\w+)\}\}/g, (match, name) => { diff --git a/setup b/setup index cf3e5050..09d2282f 100755 --- a/setup +++ b/setup @@ -12,6 +12,11 @@ GSTACK_DIR="$(cd "$(dirname "$0")" && pwd)" SKILLS_DIR="$(dirname "$GSTACK_DIR")" BROWSE_BIN="$GSTACK_DIR/browse/dist/browse" +IS_WINDOWS=0 +case "$(uname -s)" in + MINGW*|MSYS*|CYGWIN*|Windows_NT) IS_WINDOWS=1 ;; +esac + # ─── Parse --host flag ───────────────────────────────────────── HOST="claude" while [ $# -gt 0 ]; do @@ -44,10 +49,19 @@ elif [ "$HOST" = "codex" ]; then fi ensure_playwright_browser() { - ( - cd "$GSTACK_DIR" - bun --eval 'import { chromium } from "playwright"; const browser = await chromium.launch(); await browser.close();' - ) >/dev/null 2>&1 + if [ "$IS_WINDOWS" -eq 1 ]; then + # On Windows, Bun can't launch Chromium due to broken pipe handling + # (oven-sh/bun#4253). Use Node.js to verify Chromium works instead. + ( + cd "$GSTACK_DIR" + node -e "const { chromium } = require('playwright'); (async () => { const b = await chromium.launch(); await b.close(); })()" 2>/dev/null + ) + else + ( + cd "$GSTACK_DIR" + bun --eval 'import { chromium } from "playwright"; const browser = await chromium.launch(); await browser.close();' + ) >/dev/null 2>&1 + fi } # 1. Build browse binary if needed (smart rebuild: stale sources, package.json, lock) @@ -87,10 +101,32 @@ if ! ensure_playwright_browser; then cd "$GSTACK_DIR" bunx playwright install chromium ) + + if [ "$IS_WINDOWS" -eq 1 ]; then + # On Windows, Node.js launches Chromium (not Bun — see oven-sh/bun#4253). + # Ensure playwright is importable by Node from the gstack directory. + if ! command -v node >/dev/null 2>&1; then + echo "gstack setup failed: Node.js is required on Windows (Bun cannot launch Chromium due to a pipe bug)" >&2 + echo " Install Node.js: https://nodejs.org/" >&2 + exit 1 + fi + echo "Windows detected — verifying Node.js can load Playwright..." + ( + cd "$GSTACK_DIR" + # Bun's node_modules already has playwright; verify Node can require it + node -e "require('playwright')" 2>/dev/null || npm install --no-save playwright + ) + fi fi if ! ensure_playwright_browser; then - echo "gstack setup failed: Playwright Chromium could not be launched" >&2 + if [ "$IS_WINDOWS" -eq 1 ]; then + echo "gstack setup failed: Playwright Chromium could not be launched via Node.js" >&2 + echo " This is a known issue with Bun on Windows (oven-sh/bun#4253)." >&2 + echo " Ensure Node.js is installed and 'node -e \"require('playwright')\"' works." >&2 + else + echo "gstack setup failed: Playwright Chromium could not be launched" >&2 + fi exit 1 fi diff --git a/ship/SKILL.md b/ship/SKILL.md index 24941c0a..5da77a27 100644 --- a/ship/SKILL.md +++ b/ship/SKILL.md @@ -319,7 +319,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl - **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting). - **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup. - **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes. -- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed. +- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`. **Verdict logic:** - **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`) @@ -959,41 +959,118 @@ For each classified comment: --- -## Step 3.8: Codex second opinion (optional) +## Step 3.8: Codex review -Check if the Codex CLI is available: +Check if the Codex CLI is available and read the user's Codex review preference: ```bash which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE" +CODEX_REVIEWS_CFG=$(~/.claude/skills/gstack/bin/gstack-config get codex_reviews 2>/dev/null || true) +echo "CODEX_REVIEWS: ${CODEX_REVIEWS_CFG:-not_set}" ``` -If Codex is available, use AskUserQuestion: +If `CODEX_NOT_AVAILABLE`: skip this step silently. Continue to the next step. + +If `CODEX_REVIEWS` is `disabled`: skip this step silently. Continue to the next step. + +If `CODEX_REVIEWS` is `enabled`: run both code review and adversarial challenge automatically (no prompt). Jump to the "Run Codex" section below. + +If `CODEX_REVIEWS` is `not_set`: use AskUserQuestion to offer the one-time adoption prompt: ``` -Pre-landing review complete. Want an independent Codex (OpenAI) review before shipping? +GStack recommends enabling Codex code reviews — Codex is the super smart quiet engineer friend who will save your butt. -A) Run Codex code review — independent diff review with pass/fail gate -B) Run Codex adversarial challenge — try to break this code -C) Skip — ship without Codex review +A) Enable for all future runs (recommended, default) +B) Try it for now, ask me again later +C) No thanks, don't ask me again ``` -If the user chooses A or B: - -**For code review (A):** Run `codex review --base ` with a 5-minute timeout. -Present the full output verbatim under a `CODEX SAYS:` header. Check for `[P1]` markers -to determine pass/fail gate. Persist the result: - +If the user chooses A: persist the setting and run both: ```bash -~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"TIMESTAMP","status":"STATUS","gate":"GATE"}' +~/.claude/skills/gstack/bin/gstack-config set codex_reviews enabled ``` -If GATE is FAIL, use AskUserQuestion: "Codex found critical issues. Ship anyway?" -If the user says no, stop. If yes, continue to Step 4. +If the user chooses B: run both this time but do not persist any setting. -**For adversarial (B):** Run codex exec with the adversarial prompt (see /codex skill). -Present findings. This is informational — does not block shipping. +If the user chooses C: persist the opt-out and skip: +```bash +~/.claude/skills/gstack/bin/gstack-config set codex_reviews disabled +``` +Then skip this step. Continue to the next step. -If Codex is not available, skip silently. Continue to Step 4. +### Run Codex + +Always run **both** code review and adversarial challenge. Use a 5-minute timeout (`timeout: 300000`) on each Bash call. + +First, create a temp file for stderr capture: +```bash +TMPERR=$(mktemp /tmp/codex-review-XXXXXXXX) +``` + +**Code review:** Run: +```bash +codex review --base -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR" +``` + +After the command completes, read stderr for cost/error info: +```bash +cat "$TMPERR" +``` + +Present the full output verbatim under a `CODEX SAYS (code review):` header: + +``` +CODEX SAYS (code review): +════════════════════════════════════════════════════════════ + +════════════════════════════════════════════════════════════ +GATE: PASS Tokens: N | Est. cost: ~$X.XX +``` + +Check the output for `[P1]` markers. If found: `GATE: FAIL`. If no `[P1]`: `GATE: PASS`. + +**If GATE is FAIL:** use AskUserQuestion: + +``` +Codex found N critical issues in the diff. + +A) Investigate and fix now (recommended) +B) Ship anyway — these issues may cause production problems +``` + +If the user chooses A: read the Codex findings carefully and work to address them. After fixing, re-run tests (Step 3) since code has changed. Then re-run `codex review` to verify the gate is now PASS. + +If the user chooses B: continue to the next step. + +### Error handling (code review) + +Before persisting the gate result, check for errors. All errors are non-blocking — Codex is a quality enhancement, not a prerequisite. Check `$TMPERR` output (already read above) for error indicators: + +- **Auth failure:** If stderr contains "auth", "login", "unauthorized", or "API key", tell the user: "Codex authentication failed. Run \`codex login\` in your terminal to authenticate via ChatGPT." Do NOT persist a review log entry. Continue to the adversarial step (it will likely fail too, but try anyway). +- **Timeout:** If the Bash call times out (5 min), tell the user: "Codex timed out after 5 minutes. The diff may be too large or the API may be slow." Do NOT persist a review log entry. Skip to cleanup. +- **Empty response:** If codex returned no stdout output, tell the user: "Codex returned no response. Stderr: ." Do NOT persist a review log entry. Skip to cleanup. + +**Only if codex produced a real review (non-empty stdout):** Persist the code review result: +```bash +~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE","commit":"'"$(git rev-parse --short HEAD)"'"}' +``` + +Substitute: STATUS ("clean" if PASS, "issues_found" if FAIL), GATE ("pass" or "fail"). + +**Adversarial challenge:** Run: +```bash +TMPERR_ADV=$(mktemp /tmp/codex-adv-XXXXXXXX) +codex exec "Review the changes on this branch against the base branch. Run git diff origin/ to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems." -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_ADV" +``` + +After the command completes, read adversarial stderr: +```bash +cat "$TMPERR_ADV" +``` + +Present the full output verbatim under a `CODEX SAYS (adversarial challenge):` header. This is informational — it never blocks shipping. If the adversarial command timed out or returned no output, note this to the user and continue. + +**Cleanup:** Run `rm -f "$TMPERR" "$TMPERR_ADV"` after processing. --- @@ -1236,7 +1313,7 @@ doc updates — the user runs `/ship` and documentation stays current without a - **Never skip tests.** If tests fail, stop. - **Never skip the pre-landing review.** If checklist.md is unreadable, stop. - **Never force push.** Use regular `git push` only. -- **Never ask for confirmation** except for MINOR/MAJOR version bumps and pre-landing review ASK items (batched into at most one AskUserQuestion). +- **Never ask for trivial confirmations** (e.g., "ready to push?", "create PR?"). DO stop for: version bumps (MINOR/MAJOR), pre-landing review findings (ASK items), Codex critical findings ([P1]), and the one-time Codex adoption prompt. - **Always use the 4-digit version format** from the VERSION file. - **Date format in CHANGELOG:** `YYYY-MM-DD` - **Split commits for bisectability** — each commit = one logical change. diff --git a/ship/SKILL.md.tmpl b/ship/SKILL.md.tmpl index d9669bd6..03129aee 100644 --- a/ship/SKILL.md.tmpl +++ b/ship/SKILL.md.tmpl @@ -407,43 +407,7 @@ For each classified comment: --- -## Step 3.8: Codex second opinion (optional) - -Check if the Codex CLI is available: - -```bash -which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE" -``` - -If Codex is available, use AskUserQuestion: - -``` -Pre-landing review complete. Want an independent Codex (OpenAI) review before shipping? - -A) Run Codex code review — independent diff review with pass/fail gate -B) Run Codex adversarial challenge — try to break this code -C) Skip — ship without Codex review -``` - -If the user chooses A or B: - -**For code review (A):** Run `codex review --base ` with a 5-minute timeout. -Present the full output verbatim under a `CODEX SAYS:` header. Check for `[P1]` markers -to determine pass/fail gate. Persist the result: - -```bash -~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"TIMESTAMP","status":"STATUS","gate":"GATE"}' -``` - -If GATE is FAIL, use AskUserQuestion: "Codex found critical issues. Ship anyway?" -If the user says no, stop. If yes, continue to Step 4. - -**For adversarial (B):** Run codex exec with the adversarial prompt (see /codex skill). -Present findings. This is informational — does not block shipping. - -If Codex is not available, skip silently. Continue to Step 4. - ---- +{{CODEX_REVIEW_STEP}} ## Step 4: Version bump (auto-decide) @@ -684,7 +648,7 @@ doc updates — the user runs `/ship` and documentation stays current without a - **Never skip tests.** If tests fail, stop. - **Never skip the pre-landing review.** If checklist.md is unreadable, stop. - **Never force push.** Use regular `git push` only. -- **Never ask for confirmation** except for MINOR/MAJOR version bumps and pre-landing review ASK items (batched into at most one AskUserQuestion). +- **Never ask for trivial confirmations** (e.g., "ready to push?", "create PR?"). DO stop for: version bumps (MINOR/MAJOR), pre-landing review findings (ASK items), Codex critical findings ([P1]), and the one-time Codex adoption prompt. - **Always use the 4-digit version format** from the VERSION file. - **Date format in CHANGELOG:** `YYYY-MM-DD` - **Split commits for bisectability** — each commit = one logical change. diff --git a/test/gemini-e2e.test.ts b/test/gemini-e2e.test.ts new file mode 100644 index 00000000..bd69919f --- /dev/null +++ b/test/gemini-e2e.test.ts @@ -0,0 +1,173 @@ +/** + * Gemini CLI E2E tests — verify skills work when invoked by Gemini CLI. + * + * Spawns `gemini -p` with stream-json output in the repo root (where + * .agents/skills/ already exists), parses JSONL events, and validates + * structured results. Follows the same pattern as codex-e2e.test.ts. + * + * Prerequisites: + * - `gemini` binary installed (npm install -g @google/gemini-cli) + * - Gemini authenticated via ~/.gemini/ config or GEMINI_API_KEY env var + * - EVALS=1 env var set (same gate as Claude E2E tests) + * + * Skips gracefully when prerequisites are not met. + */ + +import { describe, test, expect, afterAll } from 'bun:test'; +import { runGeminiSkill } from './helpers/gemini-session-runner'; +import type { GeminiResult } from './helpers/gemini-session-runner'; +import { EvalCollector } from './helpers/eval-store'; +import { selectTests, detectBaseBranch, getChangedFiles, GLOBAL_TOUCHFILES } from './helpers/touchfiles'; +import * as path from 'path'; + +const ROOT = path.resolve(import.meta.dir, '..'); + +// --- Prerequisites check --- + +const GEMINI_AVAILABLE = (() => { + try { + const result = Bun.spawnSync(['which', 'gemini']); + return result.exitCode === 0; + } catch { return false; } +})(); + +const evalsEnabled = !!process.env.EVALS; + +// Skip all tests if gemini is not available or EVALS is not set. +const SKIP = !GEMINI_AVAILABLE || !evalsEnabled; + +const describeGemini = SKIP ? describe.skip : describe; + +// Log why we're skipping (helpful for debugging CI) +if (!evalsEnabled) { + // Silent — same as Claude E2E tests, EVALS=1 required +} else if (!GEMINI_AVAILABLE) { + process.stderr.write('\nGemini E2E: SKIPPED — gemini binary not found (install: npm i -g @google/gemini-cli)\n'); +} + +// --- Diff-based test selection --- + +// Gemini E2E touchfiles — keyed by test name, same pattern as Codex E2E +const GEMINI_E2E_TOUCHFILES: Record = { + 'gemini-discover-skill': ['.agents/skills/**', 'test/helpers/gemini-session-runner.ts'], + 'gemini-review-findings': ['review/**', '.agents/skills/gstack-review/**', 'test/helpers/gemini-session-runner.ts'], +}; + +let selectedTests: string[] | null = null; // null = run all + +if (evalsEnabled && !process.env.EVALS_ALL) { + const baseBranch = process.env.EVALS_BASE + || detectBaseBranch(ROOT) + || 'main'; + const changedFiles = getChangedFiles(baseBranch, ROOT); + + if (changedFiles.length > 0) { + const selection = selectTests(changedFiles, GEMINI_E2E_TOUCHFILES, GLOBAL_TOUCHFILES); + selectedTests = selection.selected; + process.stderr.write(`\nGemini E2E selection (${selection.reason}): ${selection.selected.length}/${Object.keys(GEMINI_E2E_TOUCHFILES).length} tests\n`); + if (selection.skipped.length > 0) { + process.stderr.write(` Skipped: ${selection.skipped.join(', ')}\n`); + } + process.stderr.write('\n'); + } + // If changedFiles is empty (e.g., on main branch), selectedTests stays null -> run all +} + +/** Skip an individual test if not selected by diff-based selection. */ +function testIfSelected(testName: string, fn: () => Promise, timeout: number) { + const shouldRun = selectedTests === null || selectedTests.includes(testName); + (shouldRun ? test : test.skip)(testName, fn, timeout); +} + +// --- Eval result collector --- + +const evalCollector = evalsEnabled && !SKIP ? new EvalCollector('e2e-gemini') : null; + +/** DRY helper to record a Gemini E2E test result into the eval collector. */ +function recordGeminiE2E(name: string, result: GeminiResult, passed: boolean) { + evalCollector?.addTest({ + name, + suite: 'gemini-e2e', + tier: 'e2e', + passed, + duration_ms: result.durationMs, + cost_usd: 0, // Gemini doesn't report cost in USD; tokens are tracked + output: result.output?.slice(0, 2000), + turns_used: result.toolCalls.length, // approximate: tool calls as turns + exit_reason: result.exitCode === 0 ? 'success' : `exit_code_${result.exitCode}`, + }); +} + +/** Print cost summary after a Gemini E2E test. */ +function logGeminiCost(label: string, result: GeminiResult) { + const durationSec = Math.round(result.durationMs / 1000); + console.log(`${label}: ${result.tokens} tokens, ${result.toolCalls.length} tool calls, ${durationSec}s`); +} + +// Finalize eval results on exit +afterAll(async () => { + if (evalCollector) { + await evalCollector.finalize(); + } +}); + +// --- Tests --- + +describeGemini('Gemini E2E', () => { + + testIfSelected('gemini-discover-skill', async () => { + // Run Gemini in the repo root where .agents/skills/ exists + const result = await runGeminiSkill({ + prompt: 'List any skills or instructions you have available. Just list the names.', + timeoutMs: 60_000, + cwd: ROOT, + }); + + logGeminiCost('gemini-discover-skill', result); + + // Gemini should have produced some output + const passed = result.exitCode === 0 && result.output.length > 0; + recordGeminiE2E('gemini-discover-skill', result, passed); + + expect(result.exitCode).toBe(0); + expect(result.output.length).toBeGreaterThan(0); + // The output should reference skills in some form + const outputLower = result.output.toLowerCase(); + expect( + outputLower.includes('review') || outputLower.includes('gstack') || outputLower.includes('skill'), + ).toBe(true); + }, 120_000); + + testIfSelected('gemini-review-findings', async () => { + // Run gstack-review skill via Gemini on this repo + const result = await runGeminiSkill({ + prompt: 'Run the gstack-review skill on this repository. Review the current branch diff and report your findings.', + timeoutMs: 540_000, + cwd: ROOT, + }); + + logGeminiCost('gemini-review-findings', result); + + // Should produce structured review-like output + const output = result.output; + const passed = result.exitCode === 0 && output.length > 50; + recordGeminiE2E('gemini-review-findings', result, passed); + + expect(result.exitCode).toBe(0); + expect(output.length).toBeGreaterThan(50); + + // Review output should contain some review-like content + const outputLower = output.toLowerCase(); + const hasReviewContent = + outputLower.includes('finding') || + outputLower.includes('issue') || + outputLower.includes('review') || + outputLower.includes('change') || + outputLower.includes('diff') || + outputLower.includes('clean') || + outputLower.includes('no issues') || + outputLower.includes('p1') || + outputLower.includes('p2'); + expect(hasReviewContent).toBe(true); + }, 600_000); +}); diff --git a/test/gen-skill-docs.test.ts b/test/gen-skill-docs.test.ts index 49714f2a..64b39118 100644 --- a/test/gen-skill-docs.test.ts +++ b/test/gen-skill-docs.test.ts @@ -416,6 +416,98 @@ describe('REVIEW_DASHBOARD resolver', () => { }); }); +// --- {{SPEC_REVIEW_LOOP}} resolver tests --- + +describe('SPEC_REVIEW_LOOP resolver', () => { + const content = fs.readFileSync(path.join(ROOT, 'office-hours', 'SKILL.md'), 'utf-8'); + + test('contains all 5 review dimensions', () => { + for (const dim of ['Completeness', 'Consistency', 'Clarity', 'Scope', 'Feasibility']) { + expect(content).toContain(dim); + } + }); + + test('references Agent tool for subagent dispatch', () => { + expect(content).toMatch(/Agent.*tool/i); + }); + + test('specifies max 3 iterations', () => { + expect(content).toMatch(/3.*iteration|maximum.*3/i); + }); + + test('includes quality score', () => { + expect(content).toContain('quality score'); + }); + + test('includes metrics path', () => { + expect(content).toContain('spec-review.jsonl'); + }); + + test('includes convergence guard', () => { + expect(content).toMatch(/[Cc]onvergence/); + }); + + test('includes graceful failure handling', () => { + expect(content).toMatch(/skip.*review|unavailable/i); + }); +}); + +// --- {{DESIGN_SKETCH}} resolver tests --- + +describe('DESIGN_SKETCH resolver', () => { + const content = fs.readFileSync(path.join(ROOT, 'office-hours', 'SKILL.md'), 'utf-8'); + + test('references DESIGN.md for design system constraints', () => { + expect(content).toContain('DESIGN.md'); + }); + + test('contains wireframe or sketch terminology', () => { + expect(content).toMatch(/wireframe|sketch/i); + }); + + test('references browse binary for rendering', () => { + expect(content).toContain('$B goto'); + }); + + test('references screenshot capture', () => { + expect(content).toContain('$B screenshot'); + }); + + test('specifies rough aesthetic', () => { + expect(content).toMatch(/[Rr]ough|hand-drawn/); + }); + + test('includes skip conditions', () => { + expect(content).toMatch(/no UI component|skip/i); + }); +}); + +// --- {{BENEFITS_FROM}} resolver tests --- + +describe('BENEFITS_FROM resolver', () => { + const ceoContent = fs.readFileSync(path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), 'utf-8'); + const engContent = fs.readFileSync(path.join(ROOT, 'plan-eng-review', 'SKILL.md'), 'utf-8'); + + test('plan-ceo-review contains prerequisite skill offer', () => { + expect(ceoContent).toContain('Prerequisite Skill Offer'); + expect(ceoContent).toContain('/office-hours'); + }); + + test('plan-eng-review contains prerequisite skill offer', () => { + expect(engContent).toContain('Prerequisite Skill Offer'); + expect(engContent).toContain('/office-hours'); + }); + + test('offer includes graceful decline', () => { + expect(ceoContent).toContain('No worries'); + }); + + test('skills without benefits-from do NOT have prerequisite offer', () => { + const qaContent = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8'); + expect(qaContent).not.toContain('Prerequisite Skill Offer'); + }); +}); + // ─── Codex Generation Tests ───────────────────────────────── describe('Codex generation (--host codex)', () => { @@ -492,6 +584,16 @@ describe('Codex generation (--host codex)', () => { expect(fs.existsSync(path.join(AGENTS_DIR, 'gstack-codex'))).toBe(false); }); + test('Codex review step stripped from Codex-host ship and review', () => { + const shipContent = fs.readFileSync(path.join(AGENTS_DIR, 'gstack-ship', 'SKILL.md'), 'utf-8'); + expect(shipContent).not.toContain('codex review --base'); + expect(shipContent).not.toContain('Investigate and fix'); + + const reviewContent = fs.readFileSync(path.join(AGENTS_DIR, 'gstack-review', 'SKILL.md'), 'utf-8'); + expect(reviewContent).not.toContain('codex review --base'); + expect(reviewContent).not.toContain('Investigate and fix'); + }); + test('--host codex --dry-run freshness', () => { const result = Bun.spawnSync(['bun', 'run', 'scripts/gen-skill-docs.ts', '--host', 'codex', '--dry-run'], { cwd: ROOT, diff --git a/test/helpers/gemini-session-runner.test.ts b/test/helpers/gemini-session-runner.test.ts new file mode 100644 index 00000000..1bb9a393 --- /dev/null +++ b/test/helpers/gemini-session-runner.test.ts @@ -0,0 +1,104 @@ +import { describe, test, expect } from 'bun:test'; +import { parseGeminiJSONL } from './gemini-session-runner'; + +// Fixture: actual Gemini CLI stream-json output with tool use +const FIXTURE_LINES = [ + '{"type":"init","timestamp":"2026-03-20T15:14:46.455Z","session_id":"test-session-123","model":"auto-gemini-3"}', + '{"type":"message","timestamp":"2026-03-20T15:14:46.456Z","role":"user","content":"list the files"}', + '{"type":"message","timestamp":"2026-03-20T15:14:49.650Z","role":"assistant","content":"I will list the files.","delta":true}', + '{"type":"tool_use","timestamp":"2026-03-20T15:14:49.690Z","tool_name":"run_shell_command","tool_id":"cmd_1","parameters":{"command":"ls"}}', + '{"type":"tool_result","timestamp":"2026-03-20T15:14:49.931Z","tool_id":"cmd_1","status":"success","output":"file1.ts\\nfile2.ts"}', + '{"type":"message","timestamp":"2026-03-20T15:14:51.945Z","role":"assistant","content":"Here are the files.","delta":true}', + '{"type":"result","timestamp":"2026-03-20T15:14:52.030Z","status":"success","stats":{"total_tokens":27147,"input_tokens":26928,"output_tokens":87,"cached":0,"duration_ms":5575,"tool_calls":1}}', +]; + +describe('parseGeminiJSONL', () => { + test('extracts session ID from init event', () => { + const parsed = parseGeminiJSONL(FIXTURE_LINES); + expect(parsed.sessionId).toBe('test-session-123'); + }); + + test('concatenates assistant message deltas into output', () => { + const parsed = parseGeminiJSONL(FIXTURE_LINES); + expect(parsed.output).toBe('I will list the files.Here are the files.'); + }); + + test('ignores user messages', () => { + const lines = [ + '{"type":"message","role":"user","content":"this should be ignored"}', + '{"type":"message","role":"assistant","content":"this should be kept","delta":true}', + ]; + const parsed = parseGeminiJSONL(lines); + expect(parsed.output).toBe('this should be kept'); + }); + + test('extracts tool names from tool_use events', () => { + const parsed = parseGeminiJSONL(FIXTURE_LINES); + expect(parsed.toolCalls).toHaveLength(1); + expect(parsed.toolCalls[0]).toBe('run_shell_command'); + }); + + test('extracts total tokens from result stats', () => { + const parsed = parseGeminiJSONL(FIXTURE_LINES); + expect(parsed.tokens).toBe(27147); + }); + + test('skips malformed lines without throwing', () => { + const lines = [ + '{"type":"init","session_id":"ok"}', + 'this is not json', + '{"type":"message","role":"assistant","content":"hello","delta":true}', + '{incomplete json', + '{"type":"result","status":"success","stats":{"total_tokens":100}}', + ]; + const parsed = parseGeminiJSONL(lines); + expect(parsed.sessionId).toBe('ok'); + expect(parsed.output).toBe('hello'); + expect(parsed.tokens).toBe(100); + }); + + test('skips empty and whitespace-only lines', () => { + const lines = [ + '', + ' ', + '{"type":"init","session_id":"s1"}', + '\t', + '{"type":"result","status":"success","stats":{"total_tokens":50}}', + ]; + const parsed = parseGeminiJSONL(lines); + expect(parsed.sessionId).toBe('s1'); + expect(parsed.tokens).toBe(50); + }); + + test('handles empty input', () => { + const parsed = parseGeminiJSONL([]); + expect(parsed.output).toBe(''); + expect(parsed.toolCalls).toHaveLength(0); + expect(parsed.tokens).toBe(0); + expect(parsed.sessionId).toBeNull(); + }); + + test('handles missing fields gracefully', () => { + const lines = [ + '{"type":"init"}', // no session_id + '{"type":"message","role":"assistant"}', // no content + '{"type":"tool_use"}', // no tool_name + '{"type":"result","status":"success"}', // no stats + ]; + const parsed = parseGeminiJSONL(lines); + expect(parsed.sessionId).toBeNull(); + expect(parsed.output).toBe(''); + expect(parsed.toolCalls).toHaveLength(0); + expect(parsed.tokens).toBe(0); + }); + + test('handles multiple tool_use events', () => { + const lines = [ + '{"type":"tool_use","tool_name":"run_shell_command","tool_id":"cmd_1","parameters":{"command":"ls"}}', + '{"type":"tool_use","tool_name":"read_file","tool_id":"cmd_2","parameters":{"path":"foo.ts"}}', + '{"type":"tool_use","tool_name":"run_shell_command","tool_id":"cmd_3","parameters":{"command":"cat bar.ts"}}', + ]; + const parsed = parseGeminiJSONL(lines); + expect(parsed.toolCalls).toEqual(['run_shell_command', 'read_file', 'run_shell_command']); + }); +}); diff --git a/test/helpers/gemini-session-runner.ts b/test/helpers/gemini-session-runner.ts new file mode 100644 index 00000000..06393c38 --- /dev/null +++ b/test/helpers/gemini-session-runner.ts @@ -0,0 +1,201 @@ +/** + * Gemini CLI subprocess runner for skill E2E testing. + * + * Spawns `gemini -p` as an independent process, parses its stream-json + * output, and returns structured results. Follows the same pattern as + * codex-session-runner.ts but adapted for the Gemini CLI. + * + * Key differences from Codex session-runner: + * - Uses `gemini -p` instead of `codex exec` + * - Output is NDJSON with event types: init, message, tool_use, tool_result, result + * - Uses `--output-format stream-json --yolo` instead of `--json -s read-only` + * - No temp HOME needed — Gemini discovers skills from `.agents/skills/` in cwd + * - Message events are streamed with `delta: true` — must concatenate + */ + +import * as path from 'path'; + +// --- Interfaces --- + +export interface GeminiResult { + output: string; // Full assistant message text (concatenated deltas) + toolCalls: string[]; // Tool names from tool_use events + tokens: number; // Total tokens used + exitCode: number; // Process exit code + durationMs: number; // Wall clock time + sessionId: string | null; // Session ID from init event + rawLines: string[]; // Raw JSONL lines for debugging +} + +// --- JSONL parser --- + +export interface ParsedGeminiJSONL { + output: string; + toolCalls: string[]; + tokens: number; + sessionId: string | null; +} + +/** + * Parse an array of JSONL lines from `gemini -p --output-format stream-json`. + * Pure function — no I/O, no side effects. + * + * Handles these Gemini event types: + * - init → extract session_id + * - message (role=assistant, delta=true) → concatenate content into output + * - tool_use → extract tool_name + * - tool_result → logged but not extracted + * - result → extract token usage from stats + */ +export function parseGeminiJSONL(lines: string[]): ParsedGeminiJSONL { + const outputParts: string[] = []; + const toolCalls: string[] = []; + let tokens = 0; + let sessionId: string | null = null; + + for (const line of lines) { + if (!line.trim()) continue; + try { + const obj = JSON.parse(line); + const t = obj.type || ''; + + if (t === 'init') { + const sid = obj.session_id || ''; + if (sid) sessionId = sid; + } else if (t === 'message') { + if (obj.role === 'assistant' && obj.content) { + outputParts.push(obj.content); + } + } else if (t === 'tool_use') { + const name = obj.tool_name || ''; + if (name) toolCalls.push(name); + } else if (t === 'result') { + const stats = obj.stats || {}; + tokens = (stats.total_tokens || 0); + } + } catch { /* skip malformed lines */ } + } + + return { + output: outputParts.join(''), + toolCalls, + tokens, + sessionId, + }; +} + +// --- Main runner --- + +/** + * Run a prompt via `gemini -p` and return structured results. + * + * Spawns gemini with stream-json output, parses JSONL events, + * and returns a GeminiResult. Skips gracefully if gemini binary is not found. + */ +export async function runGeminiSkill(opts: { + prompt: string; // What to ask Gemini + timeoutMs?: number; // Default 300000 (5 min) + cwd?: string; // Working directory (where .agents/skills/ lives) +}): Promise { + const { + prompt, + timeoutMs = 300_000, + cwd, + } = opts; + + const startTime = Date.now(); + + // Check if gemini binary exists + const whichResult = Bun.spawnSync(['which', 'gemini']); + if (whichResult.exitCode !== 0) { + return { + output: 'SKIP: gemini binary not found', + toolCalls: [], + tokens: 0, + exitCode: -1, + durationMs: Date.now() - startTime, + sessionId: null, + rawLines: [], + }; + } + + // Build gemini command + const args = ['-p', prompt, '--output-format', 'stream-json', '--yolo']; + + // Spawn gemini — uses real HOME for auth, cwd for skill discovery + const proc = Bun.spawn(['gemini', ...args], { + cwd: cwd || process.cwd(), + stdout: 'pipe', + stderr: 'pipe', + }); + + // Race against timeout + let timedOut = false; + const timeoutId = setTimeout(() => { + timedOut = true; + proc.kill(); + }, timeoutMs); + + // Stream and collect JSONL from stdout + const collectedLines: string[] = []; + const stderrPromise = new Response(proc.stderr).text(); + + const reader = proc.stdout.getReader(); + const decoder = new TextDecoder(); + let buf = ''; + + try { + while (true) { + const { done, value } = await reader.read(); + if (done) break; + buf += decoder.decode(value, { stream: true }); + const lines = buf.split('\n'); + buf = lines.pop() || ''; + for (const line of lines) { + if (!line.trim()) continue; + collectedLines.push(line); + + // Real-time progress to stderr + try { + const event = JSON.parse(line); + if (event.type === 'tool_use' && event.tool_name) { + const elapsed = Math.round((Date.now() - startTime) / 1000); + process.stderr.write(` [gemini ${elapsed}s] tool: ${event.tool_name}\n`); + } else if (event.type === 'message' && event.role === 'assistant' && event.content) { + const elapsed = Math.round((Date.now() - startTime) / 1000); + process.stderr.write(` [gemini ${elapsed}s] message: ${event.content.slice(0, 100)}\n`); + } + } catch { /* skip — parseGeminiJSONL will handle it later */ } + } + } + } catch { /* stream read error — fall through to exit code handling */ } + + // Flush remaining buffer + if (buf.trim()) { + collectedLines.push(buf); + } + + const stderr = await stderrPromise; + const exitCode = await proc.exited; + clearTimeout(timeoutId); + + const durationMs = Date.now() - startTime; + + // Parse all collected JSONL lines + const parsed = parseGeminiJSONL(collectedLines); + + // Log stderr if non-empty (may contain auth errors, etc.) + if (stderr.trim()) { + process.stderr.write(` [gemini stderr] ${stderr.trim().slice(0, 200)}\n`); + } + + return { + output: parsed.output, + toolCalls: parsed.toolCalls, + tokens: parsed.tokens, + exitCode: timedOut ? 124 : exitCode, + durationMs, + sessionId: parsed.sessionId, + rawLines: collectedLines, + }; +} diff --git a/test/helpers/touchfiles.ts b/test/helpers/touchfiles.ts index 2ba0e10b..8bb86f53 100644 --- a/test/helpers/touchfiles.ts +++ b/test/helpers/touchfiles.ts @@ -57,9 +57,13 @@ export const E2E_TOUCHFILES: Record = { 'review-base-branch': ['review/**'], 'review-design-lite': ['review/**', 'test/fixtures/review-eval-design-slop.*'], + // Office Hours + 'office-hours-spec-review': ['office-hours/**', 'scripts/gen-skill-docs.ts'], + // Plan reviews 'plan-ceo-review': ['plan-ceo-review/**'], 'plan-ceo-review-selective': ['plan-ceo-review/**'], + 'plan-ceo-review-benefits': ['plan-ceo-review/**', 'scripts/gen-skill-docs.ts'], 'plan-eng-review': ['plan-eng-review/**'], 'plan-eng-review-artifact': ['plan-eng-review/**'], @@ -80,6 +84,10 @@ export const E2E_TOUCHFILES: Record = { 'codex-discover-skill': ['codex/**', '.agents/skills/**', 'test/helpers/codex-session-runner.ts'], 'codex-review-findings': ['review/**', '.agents/skills/gstack-review/**', 'codex/**', 'test/helpers/codex-session-runner.ts'], + // Gemini E2E (tests skills via Gemini CLI) + 'gemini-discover-skill': ['.agents/skills/**', 'test/helpers/gemini-session-runner.ts'], + 'gemini-review-findings': ['review/**', '.agents/skills/gstack-review/**', 'test/helpers/gemini-session-runner.ts'], + // QA bootstrap 'qa-bootstrap': ['qa/**', 'browse/src/**', 'ship/**'], @@ -140,6 +148,10 @@ export const LLM_JUDGE_TOUCHFILES: Record = { 'design-review/SKILL.md fix loop': ['design-review/SKILL.md', 'design-review/SKILL.md.tmpl'], 'design-consultation/SKILL.md research': ['design-consultation/SKILL.md', 'design-consultation/SKILL.md.tmpl'], + // Office Hours + 'office-hours/SKILL.md spec review': ['office-hours/SKILL.md', 'office-hours/SKILL.md.tmpl', 'scripts/gen-skill-docs.ts'], + 'office-hours/SKILL.md design sketch': ['office-hours/SKILL.md', 'office-hours/SKILL.md.tmpl', 'scripts/gen-skill-docs.ts'], + // Other skills 'retro/SKILL.md instructions': ['retro/SKILL.md', 'retro/SKILL.md.tmpl'], 'qa-only/SKILL.md workflow': ['qa-only/SKILL.md', 'qa-only/SKILL.md.tmpl'], @@ -152,6 +164,7 @@ export const LLM_JUDGE_TOUCHFILES: Record = { export const GLOBAL_TOUCHFILES = [ 'test/helpers/session-runner.ts', 'test/helpers/codex-session-runner.ts', + 'test/helpers/gemini-session-runner.ts', 'test/helpers/eval-store.ts', 'test/helpers/llm-judge.ts', 'scripts/gen-skill-docs.ts', diff --git a/test/skill-e2e.test.ts b/test/skill-e2e.test.ts index 96019f70..0b6331f3 100644 --- a/test/skill-e2e.test.ts +++ b/test/skill-e2e.test.ts @@ -2911,6 +2911,128 @@ Write the full output (including the GATE verdict) to ${codexDir}/codex-output.m }, 360_000); }); +// --- Office Hours Spec Review E2E --- + +describeIfSelected('Office Hours Spec Review E2E', ['office-hours-spec-review'], () => { + let ohDir: string; + + beforeAll(() => { + ohDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-oh-spec-')); + const run = (cmd: string, args: string[]) => + spawnSync(cmd, args, { cwd: ohDir, stdio: 'pipe', timeout: 5000 }); + + run('git', ['init', '-b', 'main']); + run('git', ['config', 'user.email', 'test@test.com']); + run('git', ['config', 'user.name', 'Test']); + fs.writeFileSync(path.join(ohDir, 'README.md'), '# Test Project\n'); + run('git', ['add', '.']); + run('git', ['commit', '-m', 'init']); + + // Copy office-hours skill + fs.mkdirSync(path.join(ohDir, 'office-hours'), { recursive: true }); + fs.copyFileSync( + path.join(ROOT, 'office-hours', 'SKILL.md'), + path.join(ohDir, 'office-hours', 'SKILL.md'), + ); + }); + + afterAll(() => { + try { fs.rmSync(ohDir, { recursive: true, force: true }); } catch {} + }); + + test('/office-hours SKILL.md contains spec review loop', async () => { + const result = await runSkillTest({ + prompt: `Read office-hours/SKILL.md. I want to understand the spec review loop. + +Summarize what the "Spec Review Loop" section does — specifically: +1. How many dimensions does the reviewer check? +2. What tool is used to dispatch the reviewer? +3. What's the maximum number of iterations? +4. What metrics are tracked? + +Write your summary to ${ohDir}/spec-review-summary.md`, + workingDirectory: ohDir, + maxTurns: 8, + timeout: 120_000, + testName: 'office-hours-spec-review', + runId, + }); + + logCost('/office-hours spec review', result); + recordE2E('/office-hours-spec-review', 'Office Hours Spec Review E2E', result); + expect(result.exitReason).toBe('success'); + + const summaryPath = path.join(ohDir, 'spec-review-summary.md'); + if (fs.existsSync(summaryPath)) { + const summary = fs.readFileSync(summaryPath, 'utf-8').toLowerCase(); + // Verify the agent understood the key concepts + expect(summary).toMatch(/5.*dimension|dimension.*5|completeness|consistency|clarity|scope|feasibility/); + expect(summary).toMatch(/agent|subagent/); + expect(summary).toMatch(/3.*iteration|iteration.*3|maximum.*3/); + } + }, 180_000); +}); + +// --- Plan CEO Review Benefits-From E2E --- + +describeIfSelected('Plan CEO Review Benefits-From E2E', ['plan-ceo-review-benefits'], () => { + let benefitsDir: string; + + beforeAll(() => { + benefitsDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-benefits-')); + const run = (cmd: string, args: string[]) => + spawnSync(cmd, args, { cwd: benefitsDir, stdio: 'pipe', timeout: 5000 }); + + run('git', ['init', '-b', 'main']); + run('git', ['config', 'user.email', 'test@test.com']); + run('git', ['config', 'user.name', 'Test']); + fs.writeFileSync(path.join(benefitsDir, 'README.md'), '# Test Project\n'); + run('git', ['add', '.']); + run('git', ['commit', '-m', 'init']); + + // Copy plan-ceo-review skill + fs.mkdirSync(path.join(benefitsDir, 'plan-ceo-review'), { recursive: true }); + fs.copyFileSync( + path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), + path.join(benefitsDir, 'plan-ceo-review', 'SKILL.md'), + ); + }); + + afterAll(() => { + try { fs.rmSync(benefitsDir, { recursive: true, force: true }); } catch {} + }); + + test('/plan-ceo-review SKILL.md contains prerequisite skill offer', async () => { + const result = await runSkillTest({ + prompt: `Read plan-ceo-review/SKILL.md. Search for sections about "Prerequisite" or "office-hours" or "design doc found". + +Summarize what happens when no design doc is found — specifically: +1. Is /office-hours offered as a prerequisite? +2. What options does the user get? +3. Is there a mid-session detection for when the user seems lost? + +Write your summary to ${benefitsDir}/benefits-summary.md`, + workingDirectory: benefitsDir, + maxTurns: 8, + timeout: 120_000, + testName: 'plan-ceo-review-benefits', + runId, + }); + + logCost('/plan-ceo-review benefits-from', result); + recordE2E('/plan-ceo-review-benefits', 'Plan CEO Review Benefits-From E2E', result); + expect(result.exitReason).toBe('success'); + + const summaryPath = path.join(benefitsDir, 'benefits-summary.md'); + if (fs.existsSync(summaryPath)) { + const summary = fs.readFileSync(summaryPath, 'utf-8').toLowerCase(); + // Verify the agent understood the skill chaining + expect(summary).toMatch(/office.hours/); + expect(summary).toMatch(/design doc|no design/i); + } + }, 180_000); +}); + // Module-level afterAll — finalize eval collector after all tests complete afterAll(async () => { if (evalCollector) { diff --git a/test/skill-validation.test.ts b/test/skill-validation.test.ts index 6d68f086..b49d9b9f 100644 --- a/test/skill-validation.test.ts +++ b/test/skill-validation.test.ts @@ -644,6 +644,59 @@ describe('office-hours skill structure', () => { test('contains builder operating principles', () => { expect(content).toContain('Delight is the currency'); }); + + // Spec Review Loop (Phase 5.5) + test('contains spec review loop', () => { + expect(content).toContain('Spec Review Loop'); + }); + + test('contains adversarial review dimensions', () => { + for (const dim of ['Completeness', 'Consistency', 'Clarity', 'Scope', 'Feasibility']) { + expect(content).toContain(dim); + } + }); + + test('contains subagent dispatch instruction', () => { + expect(content).toMatch(/Agent.*tool|subagent/i); + }); + + test('contains max 3 iterations', () => { + expect(content).toMatch(/3.*iteration|maximum.*3/i); + }); + + test('contains quality score', () => { + expect(content).toContain('quality score'); + }); + + test('contains spec review metrics path', () => { + expect(content).toContain('spec-review.jsonl'); + }); + + test('contains convergence guard', () => { + expect(content).toMatch(/convergence/i); + }); + + // Visual Sketch (Phase 4.5) + test('contains visual sketch section', () => { + expect(content).toContain('Visual Sketch'); + }); + + test('contains wireframe generation', () => { + expect(content).toMatch(/wireframe|sketch/i); + }); + + test('contains DESIGN.md awareness', () => { + expect(content).toContain('DESIGN.md'); + }); + + test('contains browse rendering', () => { + expect(content).toContain('$B goto'); + expect(content).toContain('$B screenshot'); + }); + + test('contains rough aesthetic instruction', () => { + expect(content).toMatch(/rough|hand-drawn/i); + }); }); describe('investigate skill structure', () => { @@ -856,6 +909,22 @@ describe('CEO review mode validation', () => { expect(content).toContain('HOLD SCOPE'); expect(content).toContain('REDUCTION'); }); + + // Skill chaining (benefits-from) + test('contains prerequisite skill offer for office-hours', () => { + expect(content).toContain('Prerequisite Skill Offer'); + expect(content).toContain('/office-hours'); + }); + + test('contains mid-session detection', () => { + expect(content).toContain('Mid-session detection'); + expect(content).toMatch(/still figuring out|seems lost/i); + }); + + // Spec review on CEO plans + test('contains spec review loop for CEO plan documents', () => { + expect(content).toContain('Spec Review Loop'); + }); }); // --- gstack-slug helper --- @@ -1187,18 +1256,36 @@ describe('Codex skill', () => { expect(content).toContain('mktemp'); }); - test('codex integration in /review offers second opinion', () => { + test('codex integration in /review has config-driven review step', () => { const content = fs.readFileSync(path.join(ROOT, 'review', 'SKILL.md'), 'utf-8'); - expect(content).toContain('Codex second opinion'); + expect(content).toContain('Codex review'); + expect(content).toContain('codex_reviews'); expect(content).toContain('codex review'); expect(content).toContain('adversarial'); + expect(content).toContain('xhigh'); + expect(content).toContain('Investigate and fix'); + expect(content).toContain('CROSS-MODEL'); }); - test('codex integration in /ship offers review gate', () => { + test('codex integration in /ship has config-driven review step', () => { const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8'); - expect(content).toContain('Codex'); + expect(content).toContain('Codex review'); + expect(content).toContain('codex_reviews'); expect(content).toContain('codex review'); expect(content).toContain('codex-review'); + expect(content).toContain('xhigh'); + expect(content).toContain('Investigate and fix'); + }); + + test('codex-host ship/review do NOT contain codex review step', () => { + const shipContent = fs.readFileSync(path.join(ROOT, '.agents', 'skills', 'gstack-ship', 'SKILL.md'), 'utf-8'); + expect(shipContent).not.toContain('codex review --base'); + expect(shipContent).not.toContain('Investigate and fix'); + + const reviewContent = fs.readFileSync(path.join(ROOT, '.agents', 'skills', 'gstack-review', 'SKILL.md'), 'utf-8'); + expect(reviewContent).not.toContain('codex review --base'); + expect(reviewContent).not.toContain('codex_reviews'); + expect(reviewContent).not.toContain('Investigate and fix'); }); test('codex integration in /plan-eng-review offers plan critique', () => { diff --git a/test/touchfiles.test.ts b/test/touchfiles.test.ts index d89d533d..11dedb1c 100644 --- a/test/touchfiles.test.ts +++ b/test/touchfiles.test.ts @@ -78,8 +78,9 @@ describe('selectTests', () => { const result = selectTests(['plan-ceo-review/SKILL.md'], E2E_TOUCHFILES); expect(result.selected).toContain('plan-ceo-review'); expect(result.selected).toContain('plan-ceo-review-selective'); - expect(result.selected.length).toBe(2); - expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 2); + expect(result.selected).toContain('plan-ceo-review-benefits'); + expect(result.selected.length).toBe(3); + expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 3); }); test('global touchfile triggers ALL tests', () => {