mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-02 03:35:09 +02:00
Merge remote-tracking branch 'origin/main' into garrytan/test-fail-ownership
# Conflicts: # scripts/gen-skill-docs.ts
This commit is contained in:
@@ -372,7 +372,7 @@ The snapshot is your primary tool for understanding and interacting with pages.
|
||||
-s <sel> --selector Scope to CSS selector
|
||||
-D --diff Unified diff against previous snapshot (first call stores baseline)
|
||||
-a --annotate Annotated screenshot with red overlay boxes and ref labels
|
||||
-o <path> --output Output path for annotated screenshot (default: /tmp/browse-annotated.png)
|
||||
-o <path> --output Output path for annotated screenshot (default: <temp>/browse-annotated.png)
|
||||
-C --cursor-interactive Cursor-interactive elements (@c refs — divs with pointer, onclick)
|
||||
```
|
||||
|
||||
|
||||
@@ -232,6 +232,25 @@ success/error/abort, and `USED_BROWSE` with true/false based on whether `$B` was
|
||||
If you cannot determine the outcome, use "unknown". This runs in the background and
|
||||
never blocks the user.
|
||||
|
||||
## SETUP (run this check BEFORE any browse command)
|
||||
|
||||
```bash
|
||||
_ROOT=$(git rev-parse --show-toplevel 2>/dev/null)
|
||||
B=""
|
||||
[ -n "$_ROOT" ] && [ -x "$_ROOT/.agents/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.agents/skills/gstack/browse/dist/browse"
|
||||
[ -z "$B" ] && B=~/.codex/skills/gstack/browse/dist/browse
|
||||
if [ -x "$B" ]; then
|
||||
echo "READY: $B"
|
||||
else
|
||||
echo "NEEDS_SETUP"
|
||||
fi
|
||||
```
|
||||
|
||||
If `NEEDS_SETUP`:
|
||||
1. Tell the user: "gstack browse needs a one-time build (~10 seconds). OK to proceed?" Then STOP and wait.
|
||||
2. Run: `cd <SKILL_DIR> && ./setup`
|
||||
3. If `bun` is not installed: `curl -fsSL https://bun.sh/install | bash`
|
||||
|
||||
# YC Office Hours
|
||||
|
||||
You are a **YC office hours partner**. Your job is to ensure the problem is understood before solutions are proposed. You adapt to what the user is building — startup founders get the hard questions, builders get an enthusiastic collaborator. This skill produces design docs, not code.
|
||||
@@ -496,6 +515,66 @@ Present via AskUserQuestion. Do NOT proceed without user approval of the approac
|
||||
|
||||
---
|
||||
|
||||
## Visual Sketch (UI ideas only)
|
||||
|
||||
If the chosen approach involves user-facing UI (screens, pages, forms, dashboards,
|
||||
or interactive elements), generate a rough wireframe to help the user visualize it.
|
||||
If the idea is backend-only, infrastructure, or has no UI component — skip this
|
||||
section silently.
|
||||
|
||||
**Step 1: Gather design context**
|
||||
|
||||
1. Check if `DESIGN.md` exists in the repo root. If it does, read it for design
|
||||
system constraints (colors, typography, spacing, component patterns). Use these
|
||||
constraints in the wireframe.
|
||||
2. Apply core design principles:
|
||||
- **Information hierarchy** — what does the user see first, second, third?
|
||||
- **Interaction states** — loading, empty, error, success, partial
|
||||
- **Edge case paranoia** — what if the name is 47 chars? Zero results? Network fails?
|
||||
- **Subtraction default** — "as little design as possible" (Rams). Every element earns its pixels.
|
||||
- **Design for trust** — every interface element builds or erodes user trust.
|
||||
|
||||
**Step 2: Generate wireframe HTML**
|
||||
|
||||
Generate a single-page HTML file with these constraints:
|
||||
- **Intentionally rough aesthetic** — use system fonts, thin gray borders, no color,
|
||||
hand-drawn-style elements. This is a sketch, not a polished mockup.
|
||||
- Self-contained — no external dependencies, no CDN links, inline CSS only
|
||||
- Show the core interaction flow (1-3 screens/states max)
|
||||
- Include realistic placeholder content (not "Lorem ipsum" — use content that
|
||||
matches the actual use case)
|
||||
- Add HTML comments explaining design decisions
|
||||
|
||||
Write to a temp file:
|
||||
```bash
|
||||
SKETCH_FILE="/tmp/gstack-sketch-$(date +%s).html"
|
||||
```
|
||||
|
||||
**Step 3: Render and capture**
|
||||
|
||||
```bash
|
||||
$B goto "file://$SKETCH_FILE"
|
||||
$B screenshot /tmp/gstack-sketch.png
|
||||
```
|
||||
|
||||
If `$B` is not available (browse binary not set up), skip the render step. Tell the
|
||||
user: "Visual sketch requires the browse binary. Run the setup script to enable it."
|
||||
|
||||
**Step 4: Present and iterate**
|
||||
|
||||
Show the screenshot to the user. Ask: "Does this feel right? Want to iterate on the layout?"
|
||||
|
||||
If they want changes, regenerate the HTML with their feedback and re-render.
|
||||
If they approve or say "good enough," proceed.
|
||||
|
||||
**Step 5: Include in design doc**
|
||||
|
||||
Reference the wireframe screenshot in the design doc's "Recommended Approach" section.
|
||||
The screenshot file at `/tmp/gstack-sketch.png` can be referenced by downstream skills
|
||||
(`/plan-design-review`, `/design-review`) to see what was originally envisioned.
|
||||
|
||||
---
|
||||
|
||||
## Phase 4.5: Founder Signal Synthesis
|
||||
|
||||
Before writing the design doc, synthesize the founder signals you observed during the session. These will appear in the design doc ("What I noticed") and in the closing conversation (Phase 6).
|
||||
@@ -632,7 +711,73 @@ Supersedes: {prior filename — omit this line if first design on this branch}
|
||||
{observational, mentor-like reflections referencing specific things the user said during the session. Quote their words back to them — don't characterize their behavior. 2-4 bullets.}
|
||||
```
|
||||
|
||||
Present the design doc to the user via AskUserQuestion:
|
||||
---
|
||||
|
||||
## Spec Review Loop
|
||||
|
||||
Before presenting the document to the user for approval, run an adversarial review.
|
||||
|
||||
**Step 1: Dispatch reviewer subagent**
|
||||
|
||||
Use the Agent tool to dispatch an independent reviewer. The reviewer has fresh context
|
||||
and cannot see the brainstorming conversation — only the document. This ensures genuine
|
||||
adversarial independence.
|
||||
|
||||
Prompt the subagent with:
|
||||
- The file path of the document just written
|
||||
- "Read this document and review it on 5 dimensions. For each dimension, note PASS or
|
||||
list specific issues with suggested fixes. At the end, output a quality score (1-10)
|
||||
across all dimensions."
|
||||
|
||||
**Dimensions:**
|
||||
1. **Completeness** — Are all requirements addressed? Missing edge cases?
|
||||
2. **Consistency** — Do parts of the document agree with each other? Contradictions?
|
||||
3. **Clarity** — Could an engineer implement this without asking questions? Ambiguous language?
|
||||
4. **Scope** — Does the document creep beyond the original problem? YAGNI violations?
|
||||
5. **Feasibility** — Can this actually be built with the stated approach? Hidden complexity?
|
||||
|
||||
The subagent should return:
|
||||
- A quality score (1-10)
|
||||
- PASS if no issues, or a numbered list of issues with dimension, description, and fix
|
||||
|
||||
**Step 2: Fix and re-dispatch**
|
||||
|
||||
If the reviewer returns issues:
|
||||
1. Fix each issue in the document on disk (use Edit tool)
|
||||
2. Re-dispatch the reviewer subagent with the updated document
|
||||
3. Maximum 3 iterations total
|
||||
|
||||
**Convergence guard:** If the reviewer returns the same issues on consecutive iterations
|
||||
(the fix didn't resolve them or the reviewer disagrees with the fix), stop the loop
|
||||
and persist those issues as "Reviewer Concerns" in the document rather than looping
|
||||
further.
|
||||
|
||||
If the subagent fails, times out, or is unavailable — skip the review loop entirely.
|
||||
Tell the user: "Spec review unavailable — presenting unreviewed doc." The document is
|
||||
already written to disk; the review is a quality bonus, not a gate.
|
||||
|
||||
**Step 3: Report and persist metrics**
|
||||
|
||||
After the loop completes (PASS, max iterations, or convergence guard):
|
||||
|
||||
1. Tell the user the result — summary by default:
|
||||
"Your doc survived N rounds of adversarial review. M issues caught and fixed.
|
||||
Quality score: X/10."
|
||||
If they ask "what did the reviewer find?", show the full reviewer output.
|
||||
|
||||
2. If issues remain after max iterations or convergence, add a "## Reviewer Concerns"
|
||||
section to the document listing each unresolved issue. Downstream skills will see this.
|
||||
|
||||
3. Append metrics:
|
||||
```bash
|
||||
mkdir -p ~/.gstack/analytics
|
||||
echo '{"skill":"office-hours","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","iterations":ITERATIONS,"issues_found":FOUND,"issues_fixed":FIXED,"remaining":REMAINING,"quality_score":SCORE}' >> ~/.gstack/analytics/spec-review.jsonl 2>/dev/null || true
|
||||
```
|
||||
Replace ITERATIONS, FOUND, FIXED, REMAINING, SCORE with actual values from the review.
|
||||
|
||||
---
|
||||
|
||||
Present the reviewed design doc to the user via AskUserQuestion:
|
||||
- A) Approve — mark Status: APPROVED and proceed to handoff
|
||||
- B) Revise — specify which sections need changes (loop back to revise those sections)
|
||||
- C) Start over — return to Phase 2
|
||||
|
||||
@@ -338,6 +338,37 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head
|
||||
```
|
||||
If a design doc exists (from `/office-hours`), read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design.
|
||||
|
||||
## Prerequisite Skill Offer
|
||||
|
||||
When the design doc check above prints "No design doc found," offer the prerequisite
|
||||
skill before proceeding.
|
||||
|
||||
Say to the user via AskUserQuestion:
|
||||
|
||||
> "No design doc found for this branch. `/office-hours` produces a structured problem
|
||||
> statement, premise challenge, and explored alternatives — it gives this review much
|
||||
> sharper input to work with. Takes about 10 minutes. The design doc is per-feature,
|
||||
> not per-product — it captures the thinking behind this specific change."
|
||||
|
||||
Options:
|
||||
- A) Run /office-hours first (in another window, then come back)
|
||||
- B) Skip — proceed with standard review
|
||||
|
||||
If they skip: "No worries — standard review. If you ever want sharper input, try
|
||||
/office-hours first next time." Then proceed normally. Do not re-offer later in the session.
|
||||
|
||||
**Mid-session detection:** During Step 0A (Premise Challenge), if the user can't
|
||||
articulate the problem, keeps changing the problem statement, answers with "I'm not
|
||||
sure," or is clearly exploring rather than reviewing — offer `/office-hours`:
|
||||
|
||||
> "It sounds like you're still figuring out what to build — that's totally fine, but
|
||||
> that's what /office-hours is designed for. Want to pause this review and run
|
||||
> /office-hours first? It'll help you nail down the problem and approach, then come
|
||||
> back here for the strategic review."
|
||||
|
||||
Options: A) Yes, run /office-hours first. B) No, keep going.
|
||||
If they keep going, proceed normally — no guilt, no re-asking.
|
||||
|
||||
When reading TODOS.md, specifically:
|
||||
* Note any TODOs this plan touches, blocks, or unlocks
|
||||
* Check if deferred work from prior reviews relates to this plan
|
||||
@@ -481,6 +512,70 @@ Repo: {owner/repo}
|
||||
|
||||
Derive the feature slug from the plan being reviewed (e.g., "user-dashboard", "auth-refactor"). Use the date in YYYY-MM-DD format.
|
||||
|
||||
After writing the CEO plan, run the spec review loop on it:
|
||||
|
||||
## Spec Review Loop
|
||||
|
||||
Before presenting the document to the user for approval, run an adversarial review.
|
||||
|
||||
**Step 1: Dispatch reviewer subagent**
|
||||
|
||||
Use the Agent tool to dispatch an independent reviewer. The reviewer has fresh context
|
||||
and cannot see the brainstorming conversation — only the document. This ensures genuine
|
||||
adversarial independence.
|
||||
|
||||
Prompt the subagent with:
|
||||
- The file path of the document just written
|
||||
- "Read this document and review it on 5 dimensions. For each dimension, note PASS or
|
||||
list specific issues with suggested fixes. At the end, output a quality score (1-10)
|
||||
across all dimensions."
|
||||
|
||||
**Dimensions:**
|
||||
1. **Completeness** — Are all requirements addressed? Missing edge cases?
|
||||
2. **Consistency** — Do parts of the document agree with each other? Contradictions?
|
||||
3. **Clarity** — Could an engineer implement this without asking questions? Ambiguous language?
|
||||
4. **Scope** — Does the document creep beyond the original problem? YAGNI violations?
|
||||
5. **Feasibility** — Can this actually be built with the stated approach? Hidden complexity?
|
||||
|
||||
The subagent should return:
|
||||
- A quality score (1-10)
|
||||
- PASS if no issues, or a numbered list of issues with dimension, description, and fix
|
||||
|
||||
**Step 2: Fix and re-dispatch**
|
||||
|
||||
If the reviewer returns issues:
|
||||
1. Fix each issue in the document on disk (use Edit tool)
|
||||
2. Re-dispatch the reviewer subagent with the updated document
|
||||
3. Maximum 3 iterations total
|
||||
|
||||
**Convergence guard:** If the reviewer returns the same issues on consecutive iterations
|
||||
(the fix didn't resolve them or the reviewer disagrees with the fix), stop the loop
|
||||
and persist those issues as "Reviewer Concerns" in the document rather than looping
|
||||
further.
|
||||
|
||||
If the subagent fails, times out, or is unavailable — skip the review loop entirely.
|
||||
Tell the user: "Spec review unavailable — presenting unreviewed doc." The document is
|
||||
already written to disk; the review is a quality bonus, not a gate.
|
||||
|
||||
**Step 3: Report and persist metrics**
|
||||
|
||||
After the loop completes (PASS, max iterations, or convergence guard):
|
||||
|
||||
1. Tell the user the result — summary by default:
|
||||
"Your doc survived N rounds of adversarial review. M issues caught and fixed.
|
||||
Quality score: X/10."
|
||||
If they ask "what did the reviewer find?", show the full reviewer output.
|
||||
|
||||
2. If issues remain after max iterations or convergence, add a "## Reviewer Concerns"
|
||||
section to the document listing each unresolved issue. Downstream skills will see this.
|
||||
|
||||
3. Append metrics:
|
||||
```bash
|
||||
mkdir -p ~/.gstack/analytics
|
||||
echo '{"skill":"plan-ceo-review","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","iterations":ITERATIONS,"issues_found":FOUND,"issues_fixed":FIXED,"remaining":REMAINING,"quality_score":SCORE}' >> ~/.gstack/analytics/spec-review.jsonl 2>/dev/null || true
|
||||
```
|
||||
Replace ITERATIONS, FOUND, FIXED, REMAINING, SCORE with actual values from the review.
|
||||
|
||||
### 0E. Temporal Interrogation (EXPANSION, SELECTIVE EXPANSION, and HOLD modes)
|
||||
Think ahead to implementation: What decisions will need to be made during implementation that should be resolved NOW in the plan?
|
||||
```
|
||||
@@ -912,7 +1007,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
|
||||
- **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting).
|
||||
- **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
|
||||
- **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
|
||||
- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed.
|
||||
- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`.
|
||||
|
||||
**Verdict logic:**
|
||||
- **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`)
|
||||
|
||||
@@ -542,7 +542,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
|
||||
- **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting).
|
||||
- **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
|
||||
- **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
|
||||
- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed.
|
||||
- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`.
|
||||
|
||||
**Verdict logic:**
|
||||
- **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`)
|
||||
|
||||
@@ -283,6 +283,25 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head
|
||||
```
|
||||
If a design doc exists, read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design — check the prior version for context on what changed and why.
|
||||
|
||||
## Prerequisite Skill Offer
|
||||
|
||||
When the design doc check above prints "No design doc found," offer the prerequisite
|
||||
skill before proceeding.
|
||||
|
||||
Say to the user via AskUserQuestion:
|
||||
|
||||
> "No design doc found for this branch. `/office-hours` produces a structured problem
|
||||
> statement, premise challenge, and explored alternatives — it gives this review much
|
||||
> sharper input to work with. Takes about 10 minutes. The design doc is per-feature,
|
||||
> not per-product — it captures the thinking behind this specific change."
|
||||
|
||||
Options:
|
||||
- A) Run /office-hours first (in another window, then come back)
|
||||
- B) Skip — proceed with standard review
|
||||
|
||||
If they skip: "No worries — standard review. If you ever want sharper input, try
|
||||
/office-hours first next time." Then proceed normally. Do not re-offer later in the session.
|
||||
|
||||
### Step 0: Scope Challenge
|
||||
Before reviewing anything, answer these questions:
|
||||
1. **What existing code already partially or fully solves each sub-problem?** Can we capture outputs from existing flows rather than building parallel ones?
|
||||
@@ -512,7 +531,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
|
||||
- **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting).
|
||||
- **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
|
||||
- **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
|
||||
- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed.
|
||||
- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`.
|
||||
|
||||
**Verdict logic:**
|
||||
- **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`)
|
||||
|
||||
@@ -488,54 +488,7 @@ If no documentation files exist, skip this step silently.
|
||||
|
||||
---
|
||||
|
||||
## Step 5.7: Codex second opinion (optional)
|
||||
|
||||
After completing the review, check if the Codex CLI is available:
|
||||
|
||||
```bash
|
||||
which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
|
||||
```
|
||||
|
||||
If Codex is available, use AskUserQuestion:
|
||||
|
||||
```
|
||||
Review complete. Want an independent second opinion from Codex (OpenAI)?
|
||||
|
||||
A) Run Codex code review — independent diff review with pass/fail gate
|
||||
B) Run Codex adversarial challenge — try to find ways this code will fail in production
|
||||
C) Both — review first, then adversarial challenge
|
||||
D) Skip — no Codex review needed
|
||||
```
|
||||
|
||||
If the user chooses A, B, or C:
|
||||
|
||||
**For code review (A or C):** Run `codex review --base <base>` with a 5-minute timeout.
|
||||
Present the full output verbatim under a `CODEX SAYS (code review):` header.
|
||||
Check the output for `[P1]` markers — if found, note `GATE: FAIL`, otherwise `GATE: PASS`.
|
||||
After presenting, compare Codex's findings with your own review findings from Steps 4-5
|
||||
and output a CROSS-MODEL ANALYSIS showing what both found, what only Codex found,
|
||||
and what only Claude found.
|
||||
|
||||
**For adversarial challenge (B or C):** Run:
|
||||
```bash
|
||||
codex exec "Review the changes on this branch against the base branch. Run git diff origin/<base> to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, failure modes. Be adversarial." -s read-only
|
||||
```
|
||||
Present the full output verbatim under a `CODEX SAYS (adversarial challenge):` header.
|
||||
|
||||
**Only if a code review ran (user chose A or C):** Persist the Codex review result to the review log:
|
||||
```bash
|
||||
~/.codex/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE"}'
|
||||
```
|
||||
|
||||
Substitute: STATUS ("clean" if PASS, "issues_found" if FAIL), GATE ("pass" or "fail").
|
||||
|
||||
**Do NOT persist a codex-review entry when only the adversarial challenge (B) ran** —
|
||||
there is no gate verdict to record, and a false entry would make the Review Readiness
|
||||
Dashboard believe a code review happened when it didn't.
|
||||
|
||||
If Codex is not available, skip this step silently.
|
||||
|
||||
---
|
||||
|
||||
## Important Rules
|
||||
|
||||
|
||||
@@ -309,7 +309,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
|
||||
- **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting).
|
||||
- **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
|
||||
- **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
|
||||
- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed.
|
||||
- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`.
|
||||
|
||||
**Verdict logic:**
|
||||
- **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`)
|
||||
@@ -949,43 +949,7 @@ For each classified comment:
|
||||
|
||||
---
|
||||
|
||||
## Step 3.8: Codex second opinion (optional)
|
||||
|
||||
Check if the Codex CLI is available:
|
||||
|
||||
```bash
|
||||
which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
|
||||
```
|
||||
|
||||
If Codex is available, use AskUserQuestion:
|
||||
|
||||
```
|
||||
Pre-landing review complete. Want an independent Codex (OpenAI) review before shipping?
|
||||
|
||||
A) Run Codex code review — independent diff review with pass/fail gate
|
||||
B) Run Codex adversarial challenge — try to break this code
|
||||
C) Skip — ship without Codex review
|
||||
```
|
||||
|
||||
If the user chooses A or B:
|
||||
|
||||
**For code review (A):** Run `codex review --base <base>` with a 5-minute timeout.
|
||||
Present the full output verbatim under a `CODEX SAYS:` header. Check for `[P1]` markers
|
||||
to determine pass/fail gate. Persist the result:
|
||||
|
||||
```bash
|
||||
~/.codex/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"TIMESTAMP","status":"STATUS","gate":"GATE"}'
|
||||
```
|
||||
|
||||
If GATE is FAIL, use AskUserQuestion: "Codex found critical issues. Ship anyway?"
|
||||
If the user says no, stop. If yes, continue to Step 4.
|
||||
|
||||
**For adversarial (B):** Run codex exec with the adversarial prompt (see /codex skill).
|
||||
Present findings. This is informational — does not block shipping.
|
||||
|
||||
If Codex is not available, skip silently. Continue to Step 4.
|
||||
|
||||
---
|
||||
|
||||
## Step 4: Version bump (auto-decide)
|
||||
|
||||
@@ -1226,7 +1190,7 @@ doc updates — the user runs `/ship` and documentation stays current without a
|
||||
- **Never skip tests.** If tests fail, stop.
|
||||
- **Never skip the pre-landing review.** If checklist.md is unreadable, stop.
|
||||
- **Never force push.** Use regular `git push` only.
|
||||
- **Never ask for confirmation** except for MINOR/MAJOR version bumps and pre-landing review ASK items (batched into at most one AskUserQuestion).
|
||||
- **Never ask for trivial confirmations** (e.g., "ready to push?", "create PR?"). DO stop for: version bumps (MINOR/MAJOR), pre-landing review findings (ASK items), Codex critical findings ([P1]), and the one-time Codex adoption prompt.
|
||||
- **Always use the 4-digit version format** from the VERSION file.
|
||||
- **Date format in CHANGELOG:** `YYYY-MM-DD`
|
||||
- **Split commits for bisectability** — each commit = one logical change.
|
||||
|
||||
@@ -500,7 +500,7 @@ The snapshot is your primary tool for understanding and interacting with pages.
|
||||
-s <sel> --selector Scope to CSS selector
|
||||
-D --diff Unified diff against previous snapshot (first call stores baseline)
|
||||
-a --annotate Annotated screenshot with red overlay boxes and ref labels
|
||||
-o <path> --output Output path for annotated screenshot (default: /tmp/browse-annotated.png)
|
||||
-o <path> --output Output path for annotated screenshot (default: <temp>/browse-annotated.png)
|
||||
-C --cursor-interactive Cursor-interactive elements (@c refs — divs with pointer, onclick)
|
||||
```
|
||||
|
||||
|
||||
@@ -1,5 +1,47 @@
|
||||
# Changelog
|
||||
|
||||
## [0.9.4.0] - 2026-03-20 — Codex Reviews On By Default
|
||||
|
||||
### Changed
|
||||
|
||||
- **Codex code reviews now run automatically in `/ship` and `/review`.** No more "want a second opinion?" prompt every time — Codex reviews both your code (with a pass/fail gate) and runs an adversarial challenge by default. First-time users get a one-time opt-in prompt; after that, it's hands-free. Configure with `gstack-config set codex_reviews enabled|disabled`.
|
||||
- **All Codex operations use maximum reasoning power.** Review, adversarial, and consult modes all use `xhigh` reasoning effort — when an AI is reviewing your code, you want it thinking as hard as possible.
|
||||
- **Codex review errors can't corrupt the dashboard.** Auth failures, timeouts, and empty responses are now detected before logging results, so the Review Readiness Dashboard never shows a false "passed" entry. Adversarial stderr is captured separately.
|
||||
- **Codex review log includes commit hash.** Staleness detection now works correctly for Codex reviews, matching the same commit-tracking behavior as eng/CEO/design reviews.
|
||||
|
||||
### Fixed
|
||||
|
||||
- **Codex-for-Codex recursion prevented.** When gstack runs inside Codex CLI (`.agents/skills/`), the Codex review step is completely stripped — no accidental infinite loops.
|
||||
|
||||
## [0.9.3.0] - 2026-03-20 — Windows Support
|
||||
|
||||
### Fixed
|
||||
|
||||
- **gstack now works on Windows 11.** Setup no longer hangs when verifying Playwright, and the browse server automatically falls back to Node.js to work around a Bun pipe-handling bug on Windows ([bun#4253](https://github.com/oven-sh/bun/issues/4253)). Just make sure Node.js is installed alongside Bun. macOS and Linux are completely unaffected.
|
||||
- **Path handling works on Windows.** All hardcoded `/tmp` paths and Unix-style path separators now use platform-aware equivalents via a new `platform.ts` module. Path traversal protection works correctly with Windows backslash separators.
|
||||
|
||||
### Added
|
||||
|
||||
- **Bun API polyfill for Node.js.** When the browse server runs under Node.js on Windows, a compatibility layer provides `Bun.serve()`, `Bun.spawn()`, `Bun.spawnSync()`, and `Bun.sleep()` equivalents. Fully tested.
|
||||
- **Node server build script.** `browse/scripts/build-node-server.sh` transpiles the server for Node.js, stubs `bun:sqlite`, and injects the polyfill — all automated during `bun run build`.
|
||||
|
||||
## [0.9.2.0] - 2026-03-20 — Gemini CLI E2E Tests
|
||||
|
||||
### Added
|
||||
|
||||
- **Gemini CLI is now tested end-to-end.** Two E2E tests verify that gstack skills work when invoked by Google's Gemini CLI (`gemini -p`). The `gemini-discover-skill` test confirms skill discovery from `.agents/skills/`, and `gemini-review-findings` runs a full code review via gstack-review. Both parse Gemini's stream-json NDJSON output and track token usage.
|
||||
- **Gemini JSONL parser with 10 unit tests.** `parseGeminiJSONL` handles all Gemini event types (init, message, tool_use, tool_result, result) with defensive parsing for malformed input. The parser is a pure function, independently testable without spawning the CLI.
|
||||
- **`bun run test:gemini`** and **`bun run test:gemini:all`** scripts for running Gemini E2E tests independently. Gemini tests are also included in `test:evals` and `test:e2e` aggregate scripts.
|
||||
|
||||
## [0.9.1.0] - 2026-03-20 — Adversarial Spec Review + Skill Chaining
|
||||
|
||||
### Added
|
||||
|
||||
- **Your design docs now get stress-tested before you see them.** When you run `/office-hours`, an independent AI reviewer checks your design doc for completeness, consistency, clarity, scope creep, and feasibility — up to 3 rounds. You get a quality score (1-10) and a summary of what was caught and fixed. The doc you approve has already survived adversarial review.
|
||||
- **Visual wireframes during brainstorming.** For UI ideas, `/office-hours` now generates a rough HTML wireframe using your project's design system (from DESIGN.md) and screenshots it. You see what you're designing while you're still thinking, not after you've coded it.
|
||||
- **Skills help each other now.** `/plan-ceo-review` and `/plan-eng-review` detect when you'd benefit from running `/office-hours` first and offer it — one-tap to switch, one-tap to decline. If you seem lost during a CEO review, it'll gently suggest brainstorming first.
|
||||
- **Spec review metrics.** Every adversarial review logs iterations, issues found/fixed, and quality score to `~/.gstack/analytics/spec-review.jsonl`. Over time, you can see if your design docs are getting better.
|
||||
|
||||
## [0.9.0.1] - 2026-03-19
|
||||
|
||||
### Changed
|
||||
|
||||
@@ -42,7 +42,7 @@ Expect first useful run in under 5 minutes on any repo with tests already set up
|
||||
|
||||
## Install — takes 30 seconds
|
||||
|
||||
**Requirements:** [Claude Code](https://docs.anthropic.com/en/docs/claude-code), [Git](https://git-scm.com/), [Bun](https://bun.sh/) v1.0+
|
||||
**Requirements:** [Claude Code](https://docs.anthropic.com/en/docs/claude-code), [Git](https://git-scm.com/), [Bun](https://bun.sh/) v1.0+, [Node.js](https://nodejs.org/) (Windows only)
|
||||
|
||||
### Step 1: Install on your machine
|
||||
|
||||
@@ -238,6 +238,8 @@ Data is stored in [Supabase](https://supabase.com) (open source Firebase alterna
|
||||
|
||||
**Stale install?** Run `/gstack-upgrade` — or set `auto_upgrade: true` in `~/.gstack/config.yaml`
|
||||
|
||||
**Windows users:** gstack works on Windows 11 via Git Bash or WSL. Node.js is required in addition to Bun — Bun has a known bug with Playwright's pipe transport on Windows ([bun#4253](https://github.com/oven-sh/bun/issues/4253)). The browse server automatically falls back to Node.js. Make sure both `bun` and `node` are on your PATH.
|
||||
|
||||
**Claude says it can't see the skills?** Make sure your project's `CLAUDE.md` has a gstack section. Add this:
|
||||
|
||||
```
|
||||
|
||||
@@ -506,7 +506,7 @@ The snapshot is your primary tool for understanding and interacting with pages.
|
||||
-s <sel> --selector Scope to CSS selector
|
||||
-D --diff Unified diff against previous snapshot (first call stores baseline)
|
||||
-a --annotate Annotated screenshot with red overlay boxes and ref labels
|
||||
-o <path> --output Output path for annotated screenshot (default: /tmp/browse-annotated.png)
|
||||
-o <path> --output Output path for annotated screenshot (default: <temp>/browse-annotated.png)
|
||||
-C --cursor-interactive Cursor-interactive elements (@c refs — divs with pointer, onclick)
|
||||
```
|
||||
|
||||
|
||||
+1
-1
@@ -378,7 +378,7 @@ The snapshot is your primary tool for understanding and interacting with pages.
|
||||
-s <sel> --selector Scope to CSS selector
|
||||
-D --diff Unified diff against previous snapshot (first call stores baseline)
|
||||
-a --annotate Annotated screenshot with red overlay boxes and ref labels
|
||||
-o <path> --output Output path for annotated screenshot (default: /tmp/browse-annotated.png)
|
||||
-o <path> --output Output path for annotated screenshot (default: <temp>/browse-annotated.png)
|
||||
-C --cursor-interactive Cursor-interactive elements (@c refs — divs with pointer, onclick)
|
||||
```
|
||||
|
||||
|
||||
Executable
+48
@@ -0,0 +1,48 @@
|
||||
#!/usr/bin/env bash
|
||||
# Build a Node.js-compatible server bundle for Windows.
|
||||
#
|
||||
# On Windows, Bun can't launch or connect to Playwright's Chromium
|
||||
# (oven-sh/bun#4253, #9911). This script produces a server bundle
|
||||
# that runs under Node.js with Bun API polyfills.
|
||||
|
||||
set -e
|
||||
|
||||
GSTACK_DIR="$(cd "$(dirname "$0")/../.." && pwd)"
|
||||
SRC_DIR="$GSTACK_DIR/browse/src"
|
||||
DIST_DIR="$GSTACK_DIR/browse/dist"
|
||||
|
||||
echo "Building Node-compatible server bundle..."
|
||||
|
||||
# Step 1: Transpile server.ts to a single .mjs bundle (externalize runtime deps)
|
||||
bun build "$SRC_DIR/server.ts" \
|
||||
--target=node \
|
||||
--outfile "$DIST_DIR/server-node.mjs" \
|
||||
--external playwright \
|
||||
--external playwright-core \
|
||||
--external diff \
|
||||
--external "bun:sqlite"
|
||||
|
||||
# Step 2: Post-process
|
||||
# Replace import.meta.dir with a resolvable reference
|
||||
perl -pi -e 's/import\.meta\.dir/__browseNodeSrcDir/g' "$DIST_DIR/server-node.mjs"
|
||||
# Stub out bun:sqlite (macOS-only cookie import, not needed on Windows)
|
||||
perl -pi -e 's|import { Database } from "bun:sqlite";|const Database = null; // bun:sqlite stubbed on Node|g' "$DIST_DIR/server-node.mjs"
|
||||
|
||||
# Step 3: Create the final file with polyfill header injected after the first line
|
||||
{
|
||||
head -1 "$DIST_DIR/server-node.mjs"
|
||||
echo '// ── Windows Node.js compatibility (auto-generated) ──'
|
||||
echo 'import { fileURLToPath as _ftp } from "node:url";'
|
||||
echo 'import { dirname as _dn } from "node:path";'
|
||||
echo 'const __browseNodeSrcDir = _dn(_dn(_ftp(import.meta.url))) + "/src";'
|
||||
echo '{ const _r = createRequire(import.meta.url); _r("./bun-polyfill.cjs"); }'
|
||||
echo '// ── end compatibility ──'
|
||||
tail -n +2 "$DIST_DIR/server-node.mjs"
|
||||
} > "$DIST_DIR/server-node.tmp.mjs"
|
||||
|
||||
mv "$DIST_DIR/server-node.tmp.mjs" "$DIST_DIR/server-node.mjs"
|
||||
|
||||
# Step 4: Copy polyfill to dist/
|
||||
cp "$SRC_DIR/bun-polyfill.cjs" "$DIST_DIR/bun-polyfill.cjs"
|
||||
|
||||
echo "Node server bundle ready: $DIST_DIR/server-node.mjs"
|
||||
@@ -0,0 +1,109 @@
|
||||
/**
|
||||
* Bun API polyfill for Node.js — Windows compatibility layer.
|
||||
*
|
||||
* On Windows, Bun can't launch or connect to Playwright's Chromium
|
||||
* (oven-sh/bun#4253, #9911). The browse server falls back to running
|
||||
* under Node.js with this polyfill providing Bun API equivalents.
|
||||
*
|
||||
* Loaded via --require before the transpiled server bundle.
|
||||
*/
|
||||
|
||||
'use strict';
|
||||
|
||||
const http = require('http');
|
||||
const { spawnSync, spawn } = require('child_process');
|
||||
|
||||
globalThis.Bun = {
|
||||
serve(options) {
|
||||
const { port, hostname = '127.0.0.1', fetch } = options;
|
||||
|
||||
const server = http.createServer(async (nodeReq, nodeRes) => {
|
||||
try {
|
||||
const url = `http://${hostname}:${port}${nodeReq.url}`;
|
||||
const headers = new Headers();
|
||||
for (const [key, val] of Object.entries(nodeReq.headers)) {
|
||||
if (val) headers.set(key, Array.isArray(val) ? val[0] : val);
|
||||
}
|
||||
|
||||
let body = null;
|
||||
if (nodeReq.method !== 'GET' && nodeReq.method !== 'HEAD') {
|
||||
body = await new Promise((resolve) => {
|
||||
const chunks = [];
|
||||
nodeReq.on('data', (chunk) => chunks.push(chunk));
|
||||
nodeReq.on('end', () => resolve(Buffer.concat(chunks)));
|
||||
});
|
||||
}
|
||||
|
||||
const webReq = new Request(url, {
|
||||
method: nodeReq.method,
|
||||
headers,
|
||||
body,
|
||||
});
|
||||
|
||||
const webRes = await fetch(webReq);
|
||||
|
||||
nodeRes.statusCode = webRes.status;
|
||||
webRes.headers.forEach((val, key) => {
|
||||
nodeRes.setHeader(key, val);
|
||||
});
|
||||
|
||||
const resBody = await webRes.arrayBuffer();
|
||||
nodeRes.end(Buffer.from(resBody));
|
||||
} catch (err) {
|
||||
nodeRes.statusCode = 500;
|
||||
nodeRes.end(JSON.stringify({ error: err.message }));
|
||||
}
|
||||
});
|
||||
|
||||
server.listen(port, hostname);
|
||||
|
||||
return {
|
||||
stop() { server.close(); },
|
||||
port,
|
||||
hostname,
|
||||
};
|
||||
},
|
||||
|
||||
spawnSync(cmd, options = {}) {
|
||||
const [command, ...args] = cmd;
|
||||
const result = spawnSync(command, args, {
|
||||
stdio: [
|
||||
options.stdin || 'pipe',
|
||||
options.stdout === 'pipe' ? 'pipe' : 'ignore',
|
||||
options.stderr === 'pipe' ? 'pipe' : 'ignore',
|
||||
],
|
||||
timeout: options.timeout,
|
||||
env: options.env,
|
||||
cwd: options.cwd,
|
||||
});
|
||||
|
||||
return {
|
||||
exitCode: result.status,
|
||||
stdout: result.stdout || Buffer.from(''),
|
||||
stderr: result.stderr || Buffer.from(''),
|
||||
};
|
||||
},
|
||||
|
||||
spawn(cmd, options = {}) {
|
||||
const [command, ...args] = cmd;
|
||||
const stdio = options.stdio || ['pipe', 'pipe', 'pipe'];
|
||||
const proc = spawn(command, args, {
|
||||
stdio,
|
||||
env: options.env,
|
||||
cwd: options.cwd,
|
||||
});
|
||||
|
||||
return {
|
||||
pid: proc.pid,
|
||||
stdout: proc.stdout,
|
||||
stderr: proc.stderr,
|
||||
stdin: proc.stdin,
|
||||
unref() { proc.unref(); },
|
||||
kill(signal) { proc.kill(signal); },
|
||||
};
|
||||
},
|
||||
|
||||
sleep(ms) {
|
||||
return new Promise((resolve) => setTimeout(resolve, ms));
|
||||
},
|
||||
};
|
||||
+38
-4
@@ -14,7 +14,8 @@ import * as path from 'path';
|
||||
import { resolveConfig, ensureStateDir, readVersionHash } from './config';
|
||||
|
||||
const config = resolveConfig();
|
||||
const MAX_START_WAIT = 8000; // 8 seconds to start
|
||||
const IS_WINDOWS = process.platform === 'win32';
|
||||
const MAX_START_WAIT = IS_WINDOWS ? 15000 : 8000; // Node+Chromium takes longer on Windows
|
||||
|
||||
export function resolveServerScript(
|
||||
env: Record<string, string | undefined> = process.env,
|
||||
@@ -26,7 +27,9 @@ export function resolveServerScript(
|
||||
}
|
||||
|
||||
// Dev mode: cli.ts runs directly from browse/src
|
||||
if (metaDir.startsWith('/') && !metaDir.includes('$bunfs')) {
|
||||
// On macOS/Linux, import.meta.dir starts with /
|
||||
// On Windows, it starts with a drive letter (e.g., C:\...)
|
||||
if (!metaDir.includes('$bunfs')) {
|
||||
const direct = path.resolve(metaDir, 'server.ts');
|
||||
if (fs.existsSync(direct)) {
|
||||
return direct;
|
||||
@@ -48,6 +51,31 @@ export function resolveServerScript(
|
||||
|
||||
const SERVER_SCRIPT = resolveServerScript();
|
||||
|
||||
/**
|
||||
* On Windows, resolve the Node.js-compatible server bundle.
|
||||
* Falls back to null if not found (server will use Bun instead).
|
||||
*/
|
||||
export function resolveNodeServerScript(
|
||||
metaDir: string = import.meta.dir,
|
||||
execPath: string = process.execPath
|
||||
): string | null {
|
||||
// Dev mode
|
||||
if (!metaDir.includes('$bunfs')) {
|
||||
const distScript = path.resolve(metaDir, '..', 'dist', 'server-node.mjs');
|
||||
if (fs.existsSync(distScript)) return distScript;
|
||||
}
|
||||
|
||||
// Compiled binary: browse/dist/browse → browse/dist/server-node.mjs
|
||||
if (execPath) {
|
||||
const adjacent = path.resolve(path.dirname(execPath), 'server-node.mjs');
|
||||
if (fs.existsSync(adjacent)) return adjacent;
|
||||
}
|
||||
|
||||
return null;
|
||||
}
|
||||
|
||||
const NODE_SERVER_SCRIPT = IS_WINDOWS ? resolveNodeServerScript() : null;
|
||||
|
||||
interface ServerState {
|
||||
pid: number;
|
||||
port: number;
|
||||
@@ -139,8 +167,14 @@ async function startServer(): Promise<ServerState> {
|
||||
// Clean up stale state file
|
||||
try { fs.unlinkSync(config.stateFile); } catch {}
|
||||
|
||||
// Start server as detached background process
|
||||
const proc = Bun.spawn(['bun', 'run', SERVER_SCRIPT], {
|
||||
// Start server as detached background process.
|
||||
// On Windows, Bun can't launch/connect to Playwright's Chromium (oven-sh/bun#4253, #9911).
|
||||
// Fall back to running the server under Node.js with Bun API polyfills.
|
||||
const useNode = IS_WINDOWS && NODE_SERVER_SCRIPT;
|
||||
const serverCmd = useNode
|
||||
? ['node', NODE_SERVER_SCRIPT]
|
||||
: ['bun', 'run', SERVER_SCRIPT];
|
||||
const proc = Bun.spawn(serverCmd, {
|
||||
stdio: ['ignore', 'pipe', 'pipe'],
|
||||
env: { ...process.env, BROWSE_STATE_FILE: config.stateFile },
|
||||
});
|
||||
|
||||
@@ -10,13 +10,14 @@ import { validateNavigationUrl } from './url-validation';
|
||||
import * as Diff from 'diff';
|
||||
import * as fs from 'fs';
|
||||
import * as path from 'path';
|
||||
import { TEMP_DIR, isPathWithin } from './platform';
|
||||
|
||||
// Security: Path validation to prevent path traversal attacks
|
||||
const SAFE_DIRECTORIES = ['/tmp', process.cwd()];
|
||||
const SAFE_DIRECTORIES = [TEMP_DIR, process.cwd()];
|
||||
|
||||
export function validateOutputPath(filePath: string): void {
|
||||
const resolved = path.resolve(filePath);
|
||||
const isSafe = SAFE_DIRECTORIES.some(dir => resolved === dir || resolved.startsWith(dir + '/'));
|
||||
const isSafe = SAFE_DIRECTORIES.some(dir => isPathWithin(resolved, dir));
|
||||
if (!isSafe) {
|
||||
throw new Error(`Path must be within: ${SAFE_DIRECTORIES.join(', ')}`);
|
||||
}
|
||||
@@ -88,7 +89,7 @@ export async function handleMetaCommand(
|
||||
case 'screenshot': {
|
||||
// Parse priority: flags (--viewport, --clip) → selector (@ref, CSS) → output path
|
||||
const page = bm.getPage();
|
||||
let outputPath = '/tmp/browse-screenshot.png';
|
||||
let outputPath = `${TEMP_DIR}/browse-screenshot.png`;
|
||||
let clipRect: { x: number; y: number; width: number; height: number } | undefined;
|
||||
let targetSelector: string | undefined;
|
||||
let viewportOnly = false;
|
||||
@@ -147,7 +148,7 @@ export async function handleMetaCommand(
|
||||
|
||||
case 'pdf': {
|
||||
const page = bm.getPage();
|
||||
const pdfPath = args[0] || '/tmp/browse-page.pdf';
|
||||
const pdfPath = args[0] || `${TEMP_DIR}/browse-page.pdf`;
|
||||
validateOutputPath(pdfPath);
|
||||
await page.pdf({ path: pdfPath, format: 'A4' });
|
||||
return `PDF saved: ${pdfPath}`;
|
||||
@@ -155,7 +156,7 @@ export async function handleMetaCommand(
|
||||
|
||||
case 'responsive': {
|
||||
const page = bm.getPage();
|
||||
const prefix = args[0] || '/tmp/browse-responsive';
|
||||
const prefix = args[0] || `${TEMP_DIR}/browse-responsive`;
|
||||
validateOutputPath(prefix);
|
||||
const viewports = [
|
||||
{ name: 'mobile', width: 375, height: 812 },
|
||||
|
||||
@@ -0,0 +1,17 @@
|
||||
/**
|
||||
* Cross-platform constants for gstack browse.
|
||||
*
|
||||
* On macOS/Linux: TEMP_DIR = '/tmp', path.sep = '/' — identical to hardcoded values.
|
||||
* On Windows: TEMP_DIR = os.tmpdir(), path.sep = '\\' — correct Windows behavior.
|
||||
*/
|
||||
|
||||
import * as os from 'os';
|
||||
import * as path from 'path';
|
||||
|
||||
export const IS_WINDOWS = process.platform === 'win32';
|
||||
export const TEMP_DIR = IS_WINDOWS ? os.tmpdir() : '/tmp';
|
||||
|
||||
/** Check if resolvedPath is within dir, using platform-aware separators. */
|
||||
export function isPathWithin(resolvedPath: string, dir: string): boolean {
|
||||
return resolvedPath === dir || resolvedPath.startsWith(dir + path.sep);
|
||||
}
|
||||
@@ -10,6 +10,7 @@ import { consoleBuffer, networkBuffer, dialogBuffer } from './buffers';
|
||||
import type { Page } from 'playwright';
|
||||
import * as fs from 'fs';
|
||||
import * as path from 'path';
|
||||
import { TEMP_DIR, isPathWithin } from './platform';
|
||||
|
||||
/** Detect await keyword, ignoring comments. Accepted risk: await in string literals triggers wrapping (harmless). */
|
||||
function hasAwait(code: string): boolean {
|
||||
@@ -36,12 +37,12 @@ function wrapForEvaluate(code: string): string {
|
||||
}
|
||||
|
||||
// Security: Path validation to prevent path traversal attacks
|
||||
const SAFE_DIRECTORIES = ['/tmp', process.cwd()];
|
||||
const SAFE_DIRECTORIES = [TEMP_DIR, process.cwd()];
|
||||
|
||||
export function validateReadPath(filePath: string): void {
|
||||
if (path.isAbsolute(filePath)) {
|
||||
const resolved = path.resolve(filePath);
|
||||
const isSafe = SAFE_DIRECTORIES.some(dir => resolved === dir || resolved.startsWith(dir + '/'));
|
||||
const isSafe = SAFE_DIRECTORIES.some(dir => isPathWithin(resolved, dir));
|
||||
if (!isSafe) {
|
||||
throw new Error(`Absolute path must be within: ${SAFE_DIRECTORIES.join(', ')}`);
|
||||
}
|
||||
|
||||
@@ -20,6 +20,7 @@
|
||||
import type { Page, Locator } from 'playwright';
|
||||
import type { BrowserManager, RefEntry } from './browser-manager';
|
||||
import * as Diff from 'diff';
|
||||
import { TEMP_DIR, isPathWithin } from './platform';
|
||||
|
||||
// Roles considered "interactive" for the -i flag
|
||||
const INTERACTIVE_ROLES = new Set([
|
||||
@@ -61,7 +62,7 @@ export const SNAPSHOT_FLAGS: Array<{
|
||||
{ short: '-s', long: '--selector', description: 'Scope to CSS selector', takesValue: true, valueHint: '<sel>', optionKey: 'selector' },
|
||||
{ short: '-D', long: '--diff', description: 'Unified diff against previous snapshot (first call stores baseline)', optionKey: 'diff' },
|
||||
{ short: '-a', long: '--annotate', description: 'Annotated screenshot with red overlay boxes and ref labels', optionKey: 'annotate' },
|
||||
{ short: '-o', long: '--output', description: 'Output path for annotated screenshot (default: /tmp/browse-annotated.png)', takesValue: true, valueHint: '<path>', optionKey: 'outputPath' },
|
||||
{ short: '-o', long: '--output', description: 'Output path for annotated screenshot (default: <temp>/browse-annotated.png)', takesValue: true, valueHint: '<path>', optionKey: 'outputPath' },
|
||||
{ short: '-C', long: '--cursor-interactive', description: 'Cursor-interactive elements (@c refs — divs with pointer, onclick)', optionKey: 'cursorInteractive' },
|
||||
];
|
||||
|
||||
@@ -308,11 +309,11 @@ export async function handleSnapshot(
|
||||
|
||||
// ─── Annotated screenshot (-a) ────────────────────────────
|
||||
if (opts.annotate) {
|
||||
const screenshotPath = opts.outputPath || '/tmp/browse-annotated.png';
|
||||
const screenshotPath = opts.outputPath || `${TEMP_DIR}/browse-annotated.png`;
|
||||
// Validate output path (consistent with screenshot/pdf/responsive)
|
||||
const resolvedPath = require('path').resolve(screenshotPath);
|
||||
const safeDirs = ['/tmp', process.cwd()];
|
||||
if (!safeDirs.some((dir: string) => resolvedPath === dir || resolvedPath.startsWith(dir + '/'))) {
|
||||
const safeDirs = [TEMP_DIR, process.cwd()];
|
||||
if (!safeDirs.some((dir: string) => isPathWithin(resolvedPath, dir))) {
|
||||
throw new Error(`Path must be within: ${safeDirs.join(', ')}`);
|
||||
}
|
||||
try {
|
||||
|
||||
@@ -10,6 +10,7 @@ import { findInstalledBrowsers, importCookies } from './cookie-import-browser';
|
||||
import { validateNavigationUrl } from './url-validation';
|
||||
import * as fs from 'fs';
|
||||
import * as path from 'path';
|
||||
import { TEMP_DIR, isPathWithin } from './platform';
|
||||
|
||||
export async function handleWriteCommand(
|
||||
command: string,
|
||||
@@ -277,9 +278,9 @@ export async function handleWriteCommand(
|
||||
if (!filePath) throw new Error('Usage: browse cookie-import <json-file>');
|
||||
// Path validation — prevent reading arbitrary files
|
||||
if (path.isAbsolute(filePath)) {
|
||||
const safeDirs = ['/tmp', process.cwd()];
|
||||
const safeDirs = [TEMP_DIR, process.cwd()];
|
||||
const resolved = path.resolve(filePath);
|
||||
if (!safeDirs.some(dir => resolved === dir || resolved.startsWith(dir + '/'))) {
|
||||
if (!safeDirs.some(dir => isPathWithin(resolved, dir))) {
|
||||
throw new Error(`Path must be within: ${safeDirs.join(', ')}`);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -0,0 +1,72 @@
|
||||
import { describe, test, expect, afterAll } from 'bun:test';
|
||||
import * as path from 'path';
|
||||
|
||||
// Load the polyfill into a fresh object (don't clobber globalThis.Bun)
|
||||
const polyfillPath = path.resolve(import.meta.dir, '../src/bun-polyfill.cjs');
|
||||
|
||||
describe('bun-polyfill', () => {
|
||||
// We test the polyfill by requiring it in a subprocess under Node.js
|
||||
// since it's designed for Node, not Bun.
|
||||
|
||||
test('Bun.sleep resolves after delay', async () => {
|
||||
const result = Bun.spawnSync(['node', '-e', `
|
||||
require('${polyfillPath}');
|
||||
(async () => {
|
||||
const start = Date.now();
|
||||
await Bun.sleep(50);
|
||||
const elapsed = Date.now() - start;
|
||||
console.log(elapsed >= 40 ? 'OK' : 'TOO_FAST');
|
||||
})();
|
||||
`], { stdout: 'pipe', stderr: 'pipe' });
|
||||
expect(result.stdout.toString().trim()).toBe('OK');
|
||||
expect(result.exitCode).toBe(0);
|
||||
});
|
||||
|
||||
test('Bun.spawnSync runs a command and returns stdout', () => {
|
||||
const result = Bun.spawnSync(['node', '-e', `
|
||||
require('${polyfillPath}');
|
||||
const r = Bun.spawnSync(['echo', 'hello'], { stdout: 'pipe' });
|
||||
console.log(r.stdout.toString().trim());
|
||||
console.log('exit:' + r.exitCode);
|
||||
`], { stdout: 'pipe', stderr: 'pipe' });
|
||||
const lines = result.stdout.toString().trim().split('\n');
|
||||
expect(lines[0]).toBe('hello');
|
||||
expect(lines[1]).toBe('exit:0');
|
||||
});
|
||||
|
||||
test('Bun.spawn launches a process with pid', async () => {
|
||||
const result = Bun.spawnSync(['node', '-e', `
|
||||
require('${polyfillPath}');
|
||||
const p = Bun.spawn(['echo', 'test'], { stdio: ['pipe', 'pipe', 'pipe'] });
|
||||
console.log(typeof p.pid === 'number' ? 'HAS_PID' : 'NO_PID');
|
||||
console.log(typeof p.kill === 'function' ? 'HAS_KILL' : 'NO_KILL');
|
||||
console.log(typeof p.unref === 'function' ? 'HAS_UNREF' : 'NO_UNREF');
|
||||
`], { stdout: 'pipe', stderr: 'pipe' });
|
||||
const lines = result.stdout.toString().trim().split('\n');
|
||||
expect(lines[0]).toBe('HAS_PID');
|
||||
expect(lines[1]).toBe('HAS_KILL');
|
||||
expect(lines[2]).toBe('HAS_UNREF');
|
||||
});
|
||||
|
||||
test('Bun.serve creates an HTTP server that responds', async () => {
|
||||
const result = Bun.spawnSync(['node', '-e', `
|
||||
require('${polyfillPath}');
|
||||
const server = Bun.serve({
|
||||
port: 0, // Note: polyfill uses port directly, so we pick one
|
||||
hostname: '127.0.0.1',
|
||||
fetch(req) {
|
||||
return new Response(JSON.stringify({ ok: true }), {
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
});
|
||||
},
|
||||
});
|
||||
// The polyfill doesn't support port 0, so we test the object shape
|
||||
console.log(typeof server.stop === 'function' ? 'HAS_STOP' : 'NO_STOP');
|
||||
console.log(typeof server.port === 'number' ? 'HAS_PORT' : 'NO_PORT');
|
||||
server.stop();
|
||||
`], { stdout: 'pipe', stderr: 'pipe' });
|
||||
const lines = result.stdout.toString().trim().split('\n');
|
||||
expect(lines[0]).toBe('HAS_STOP');
|
||||
expect(lines[1]).toBe('HAS_PORT');
|
||||
});
|
||||
});
|
||||
@@ -197,6 +197,36 @@ describe('resolveServerScript', () => {
|
||||
});
|
||||
});
|
||||
|
||||
describe('resolveNodeServerScript', () => {
|
||||
const { resolveNodeServerScript } = require('../src/cli');
|
||||
|
||||
test('finds server-node.mjs in dist from dev mode', () => {
|
||||
const srcDir = path.resolve(__dirname, '../src');
|
||||
const distFile = path.resolve(srcDir, '..', 'dist', 'server-node.mjs');
|
||||
const fs = require('fs');
|
||||
// Only test if the file exists (it may not be built yet)
|
||||
if (fs.existsSync(distFile)) {
|
||||
const result = resolveNodeServerScript(srcDir, '');
|
||||
expect(result).toBe(distFile);
|
||||
}
|
||||
});
|
||||
|
||||
test('returns null when server-node.mjs does not exist', () => {
|
||||
const result = resolveNodeServerScript('/nonexistent/$bunfs', '/nonexistent/browse');
|
||||
expect(result).toBeNull();
|
||||
});
|
||||
|
||||
test('finds server-node.mjs adjacent to compiled binary', () => {
|
||||
const distDir = path.resolve(__dirname, '../dist');
|
||||
const distFile = path.join(distDir, 'server-node.mjs');
|
||||
const fs = require('fs');
|
||||
if (fs.existsSync(distFile)) {
|
||||
const result = resolveNodeServerScript('/$bunfs/something', path.join(distDir, 'browse'));
|
||||
expect(result).toBe(distFile);
|
||||
}
|
||||
});
|
||||
});
|
||||
|
||||
describe('version mismatch detection', () => {
|
||||
test('detects when versions differ', () => {
|
||||
const stateVersion = 'abc123';
|
||||
|
||||
@@ -0,0 +1,37 @@
|
||||
import { describe, test, expect } from 'bun:test';
|
||||
import { TEMP_DIR, isPathWithin, IS_WINDOWS } from '../src/platform';
|
||||
|
||||
describe('platform constants', () => {
|
||||
test('TEMP_DIR is /tmp on non-Windows', () => {
|
||||
if (!IS_WINDOWS) {
|
||||
expect(TEMP_DIR).toBe('/tmp');
|
||||
}
|
||||
});
|
||||
|
||||
test('IS_WINDOWS reflects process.platform', () => {
|
||||
expect(IS_WINDOWS).toBe(process.platform === 'win32');
|
||||
});
|
||||
});
|
||||
|
||||
describe('isPathWithin', () => {
|
||||
test('path inside directory returns true', () => {
|
||||
expect(isPathWithin('/tmp/foo', '/tmp')).toBe(true);
|
||||
});
|
||||
|
||||
test('path outside directory returns false', () => {
|
||||
expect(isPathWithin('/etc/foo', '/tmp')).toBe(false);
|
||||
});
|
||||
|
||||
test('exact match returns true', () => {
|
||||
expect(isPathWithin('/tmp', '/tmp')).toBe(true);
|
||||
});
|
||||
|
||||
test('partial prefix does not match (path traversal)', () => {
|
||||
// /tmp-evil should NOT match /tmp
|
||||
expect(isPathWithin('/tmp-evil/foo', '/tmp')).toBe(false);
|
||||
});
|
||||
|
||||
test('nested path returns true', () => {
|
||||
expect(isPathWithin('/tmp/a/b/c', '/tmp')).toBe(true);
|
||||
});
|
||||
});
|
||||
+5
-8
@@ -314,13 +314,13 @@ TMPERR=$(mktemp /tmp/codex-err-XXXXXX.txt)
|
||||
|
||||
2. Run the review (5-minute timeout):
|
||||
```bash
|
||||
codex review --base <base> -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR"
|
||||
codex review --base <base> -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR"
|
||||
```
|
||||
|
||||
Use `timeout: 300000` on the Bash call. If the user provided custom instructions
|
||||
(e.g., `/codex review focus on security`), pass them as the prompt argument:
|
||||
```bash
|
||||
codex review "focus on security" --base <base> -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR"
|
||||
codex review "focus on security" --base <base> -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR"
|
||||
```
|
||||
|
||||
3. Capture the output. Then parse cost from stderr:
|
||||
@@ -475,7 +475,7 @@ THE PLAN:
|
||||
|
||||
For a **new session:**
|
||||
```bash
|
||||
codex exec "<prompt>" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c "
|
||||
codex exec "<prompt>" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c "
|
||||
import sys, json
|
||||
for line in sys.stdin:
|
||||
line = line.strip()
|
||||
@@ -508,7 +508,7 @@ for line in sys.stdin:
|
||||
|
||||
For a **resumed session** (user chose "Continue"):
|
||||
```bash
|
||||
codex exec resume <session-id> "<prompt>" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c "
|
||||
codex exec resume <session-id> "<prompt>" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c "
|
||||
<same python streaming parser as above>
|
||||
"
|
||||
```
|
||||
@@ -544,10 +544,7 @@ Session saved — run /codex again to continue this conversation.
|
||||
agentic coding model). This means as OpenAI ships newer models, /codex automatically
|
||||
uses them. If the user wants a specific model, pass `-m` through to codex.
|
||||
|
||||
**Reasoning effort** varies by mode — use the right level for each task:
|
||||
- **Review mode:** `high` — thorough but not slow. Diff review benefits from depth but doesn't need maximum compute.
|
||||
- **Challenge (adversarial) mode:** `xhigh` — maximum reasoning power. When trying to break code, you want the model thinking as hard as possible.
|
||||
- **Consult mode:** `high` — good balance of depth and speed for conversations.
|
||||
**Reasoning effort:** All modes use `xhigh` — maximum reasoning power. When reviewing code, breaking code, or consulting on architecture, you want the model thinking as hard as possible.
|
||||
|
||||
**Web search:** All codex commands use `--enable web_search_cached` so Codex can look up
|
||||
docs and APIs during review. This is OpenAI's cached index — fast, no extra cost.
|
||||
|
||||
+5
-8
@@ -79,13 +79,13 @@ TMPERR=$(mktemp /tmp/codex-err-XXXXXX.txt)
|
||||
|
||||
2. Run the review (5-minute timeout):
|
||||
```bash
|
||||
codex review --base <base> -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR"
|
||||
codex review --base <base> -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR"
|
||||
```
|
||||
|
||||
Use `timeout: 300000` on the Bash call. If the user provided custom instructions
|
||||
(e.g., `/codex review focus on security`), pass them as the prompt argument:
|
||||
```bash
|
||||
codex review "focus on security" --base <base> -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR"
|
||||
codex review "focus on security" --base <base> -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR"
|
||||
```
|
||||
|
||||
3. Capture the output. Then parse cost from stderr:
|
||||
@@ -240,7 +240,7 @@ THE PLAN:
|
||||
|
||||
For a **new session:**
|
||||
```bash
|
||||
codex exec "<prompt>" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c "
|
||||
codex exec "<prompt>" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c "
|
||||
import sys, json
|
||||
for line in sys.stdin:
|
||||
line = line.strip()
|
||||
@@ -273,7 +273,7 @@ for line in sys.stdin:
|
||||
|
||||
For a **resumed session** (user chose "Continue"):
|
||||
```bash
|
||||
codex exec resume <session-id> "<prompt>" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c "
|
||||
codex exec resume <session-id> "<prompt>" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c "
|
||||
<same python streaming parser as above>
|
||||
"
|
||||
```
|
||||
@@ -309,10 +309,7 @@ Session saved — run /codex again to continue this conversation.
|
||||
agentic coding model). This means as OpenAI ships newer models, /codex automatically
|
||||
uses them. If the user wants a specific model, pass `-m` through to codex.
|
||||
|
||||
**Reasoning effort** varies by mode — use the right level for each task:
|
||||
- **Review mode:** `high` — thorough but not slow. Diff review benefits from depth but doesn't need maximum compute.
|
||||
- **Challenge (adversarial) mode:** `xhigh` — maximum reasoning power. When trying to break code, you want the model thinking as hard as possible.
|
||||
- **Consult mode:** `high` — good balance of depth and speed for conversations.
|
||||
**Reasoning effort:** All modes use `xhigh` — maximum reasoning power. When reviewing code, breaking code, or consulting on architecture, you want the model thinking as hard as possible.
|
||||
|
||||
**Web search:** All codex commands use `--enable web_search_cached` so Codex can look up
|
||||
docs and APIs during review. This is OpenAI's cached index — fast, no extra cost.
|
||||
|
||||
+146
-1
@@ -241,6 +241,25 @@ success/error/abort, and `USED_BROWSE` with true/false based on whether `$B` was
|
||||
If you cannot determine the outcome, use "unknown". This runs in the background and
|
||||
never blocks the user.
|
||||
|
||||
## SETUP (run this check BEFORE any browse command)
|
||||
|
||||
```bash
|
||||
_ROOT=$(git rev-parse --show-toplevel 2>/dev/null)
|
||||
B=""
|
||||
[ -n "$_ROOT" ] && [ -x "$_ROOT/.claude/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.claude/skills/gstack/browse/dist/browse"
|
||||
[ -z "$B" ] && B=~/.claude/skills/gstack/browse/dist/browse
|
||||
if [ -x "$B" ]; then
|
||||
echo "READY: $B"
|
||||
else
|
||||
echo "NEEDS_SETUP"
|
||||
fi
|
||||
```
|
||||
|
||||
If `NEEDS_SETUP`:
|
||||
1. Tell the user: "gstack browse needs a one-time build (~10 seconds). OK to proceed?" Then STOP and wait.
|
||||
2. Run: `cd <SKILL_DIR> && ./setup`
|
||||
3. If `bun` is not installed: `curl -fsSL https://bun.sh/install | bash`
|
||||
|
||||
# YC Office Hours
|
||||
|
||||
You are a **YC office hours partner**. Your job is to ensure the problem is understood before solutions are proposed. You adapt to what the user is building — startup founders get the hard questions, builders get an enthusiastic collaborator. This skill produces design docs, not code.
|
||||
@@ -505,6 +524,66 @@ Present via AskUserQuestion. Do NOT proceed without user approval of the approac
|
||||
|
||||
---
|
||||
|
||||
## Visual Sketch (UI ideas only)
|
||||
|
||||
If the chosen approach involves user-facing UI (screens, pages, forms, dashboards,
|
||||
or interactive elements), generate a rough wireframe to help the user visualize it.
|
||||
If the idea is backend-only, infrastructure, or has no UI component — skip this
|
||||
section silently.
|
||||
|
||||
**Step 1: Gather design context**
|
||||
|
||||
1. Check if `DESIGN.md` exists in the repo root. If it does, read it for design
|
||||
system constraints (colors, typography, spacing, component patterns). Use these
|
||||
constraints in the wireframe.
|
||||
2. Apply core design principles:
|
||||
- **Information hierarchy** — what does the user see first, second, third?
|
||||
- **Interaction states** — loading, empty, error, success, partial
|
||||
- **Edge case paranoia** — what if the name is 47 chars? Zero results? Network fails?
|
||||
- **Subtraction default** — "as little design as possible" (Rams). Every element earns its pixels.
|
||||
- **Design for trust** — every interface element builds or erodes user trust.
|
||||
|
||||
**Step 2: Generate wireframe HTML**
|
||||
|
||||
Generate a single-page HTML file with these constraints:
|
||||
- **Intentionally rough aesthetic** — use system fonts, thin gray borders, no color,
|
||||
hand-drawn-style elements. This is a sketch, not a polished mockup.
|
||||
- Self-contained — no external dependencies, no CDN links, inline CSS only
|
||||
- Show the core interaction flow (1-3 screens/states max)
|
||||
- Include realistic placeholder content (not "Lorem ipsum" — use content that
|
||||
matches the actual use case)
|
||||
- Add HTML comments explaining design decisions
|
||||
|
||||
Write to a temp file:
|
||||
```bash
|
||||
SKETCH_FILE="/tmp/gstack-sketch-$(date +%s).html"
|
||||
```
|
||||
|
||||
**Step 3: Render and capture**
|
||||
|
||||
```bash
|
||||
$B goto "file://$SKETCH_FILE"
|
||||
$B screenshot /tmp/gstack-sketch.png
|
||||
```
|
||||
|
||||
If `$B` is not available (browse binary not set up), skip the render step. Tell the
|
||||
user: "Visual sketch requires the browse binary. Run the setup script to enable it."
|
||||
|
||||
**Step 4: Present and iterate**
|
||||
|
||||
Show the screenshot to the user. Ask: "Does this feel right? Want to iterate on the layout?"
|
||||
|
||||
If they want changes, regenerate the HTML with their feedback and re-render.
|
||||
If they approve or say "good enough," proceed.
|
||||
|
||||
**Step 5: Include in design doc**
|
||||
|
||||
Reference the wireframe screenshot in the design doc's "Recommended Approach" section.
|
||||
The screenshot file at `/tmp/gstack-sketch.png` can be referenced by downstream skills
|
||||
(`/plan-design-review`, `/design-review`) to see what was originally envisioned.
|
||||
|
||||
---
|
||||
|
||||
## Phase 4.5: Founder Signal Synthesis
|
||||
|
||||
Before writing the design doc, synthesize the founder signals you observed during the session. These will appear in the design doc ("What I noticed") and in the closing conversation (Phase 6).
|
||||
@@ -641,7 +720,73 @@ Supersedes: {prior filename — omit this line if first design on this branch}
|
||||
{observational, mentor-like reflections referencing specific things the user said during the session. Quote their words back to them — don't characterize their behavior. 2-4 bullets.}
|
||||
```
|
||||
|
||||
Present the design doc to the user via AskUserQuestion:
|
||||
---
|
||||
|
||||
## Spec Review Loop
|
||||
|
||||
Before presenting the document to the user for approval, run an adversarial review.
|
||||
|
||||
**Step 1: Dispatch reviewer subagent**
|
||||
|
||||
Use the Agent tool to dispatch an independent reviewer. The reviewer has fresh context
|
||||
and cannot see the brainstorming conversation — only the document. This ensures genuine
|
||||
adversarial independence.
|
||||
|
||||
Prompt the subagent with:
|
||||
- The file path of the document just written
|
||||
- "Read this document and review it on 5 dimensions. For each dimension, note PASS or
|
||||
list specific issues with suggested fixes. At the end, output a quality score (1-10)
|
||||
across all dimensions."
|
||||
|
||||
**Dimensions:**
|
||||
1. **Completeness** — Are all requirements addressed? Missing edge cases?
|
||||
2. **Consistency** — Do parts of the document agree with each other? Contradictions?
|
||||
3. **Clarity** — Could an engineer implement this without asking questions? Ambiguous language?
|
||||
4. **Scope** — Does the document creep beyond the original problem? YAGNI violations?
|
||||
5. **Feasibility** — Can this actually be built with the stated approach? Hidden complexity?
|
||||
|
||||
The subagent should return:
|
||||
- A quality score (1-10)
|
||||
- PASS if no issues, or a numbered list of issues with dimension, description, and fix
|
||||
|
||||
**Step 2: Fix and re-dispatch**
|
||||
|
||||
If the reviewer returns issues:
|
||||
1. Fix each issue in the document on disk (use Edit tool)
|
||||
2. Re-dispatch the reviewer subagent with the updated document
|
||||
3. Maximum 3 iterations total
|
||||
|
||||
**Convergence guard:** If the reviewer returns the same issues on consecutive iterations
|
||||
(the fix didn't resolve them or the reviewer disagrees with the fix), stop the loop
|
||||
and persist those issues as "Reviewer Concerns" in the document rather than looping
|
||||
further.
|
||||
|
||||
If the subagent fails, times out, or is unavailable — skip the review loop entirely.
|
||||
Tell the user: "Spec review unavailable — presenting unreviewed doc." The document is
|
||||
already written to disk; the review is a quality bonus, not a gate.
|
||||
|
||||
**Step 3: Report and persist metrics**
|
||||
|
||||
After the loop completes (PASS, max iterations, or convergence guard):
|
||||
|
||||
1. Tell the user the result — summary by default:
|
||||
"Your doc survived N rounds of adversarial review. M issues caught and fixed.
|
||||
Quality score: X/10."
|
||||
If they ask "what did the reviewer find?", show the full reviewer output.
|
||||
|
||||
2. If issues remain after max iterations or convergence, add a "## Reviewer Concerns"
|
||||
section to the document listing each unresolved issue. Downstream skills will see this.
|
||||
|
||||
3. Append metrics:
|
||||
```bash
|
||||
mkdir -p ~/.gstack/analytics
|
||||
echo '{"skill":"office-hours","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","iterations":ITERATIONS,"issues_found":FOUND,"issues_fixed":FIXED,"remaining":REMAINING,"quality_score":SCORE}' >> ~/.gstack/analytics/spec-review.jsonl 2>/dev/null || true
|
||||
```
|
||||
Replace ITERATIONS, FOUND, FIXED, REMAINING, SCORE with actual values from the review.
|
||||
|
||||
---
|
||||
|
||||
Present the reviewed design doc to the user via AskUserQuestion:
|
||||
- A) Approve — mark Status: APPROVED and proceed to handoff
|
||||
- B) Revise — specify which sections need changes (loop back to revise those sections)
|
||||
- C) Start over — return to Phase 2
|
||||
|
||||
@@ -23,6 +23,8 @@ allowed-tools:
|
||||
|
||||
{{PREAMBLE}}
|
||||
|
||||
{{BROWSE_SETUP}}
|
||||
|
||||
# YC Office Hours
|
||||
|
||||
You are a **YC office hours partner**. Your job is to ensure the problem is understood before solutions are proposed. You adapt to what the user is building — startup founders get the hard questions, builders get an enthusiastic collaborator. This skill produces design docs, not code.
|
||||
@@ -287,6 +289,10 @@ Present via AskUserQuestion. Do NOT proceed without user approval of the approac
|
||||
|
||||
---
|
||||
|
||||
{{DESIGN_SKETCH}}
|
||||
|
||||
---
|
||||
|
||||
## Phase 4.5: Founder Signal Synthesis
|
||||
|
||||
Before writing the design doc, synthesize the founder signals you observed during the session. These will appear in the design doc ("What I noticed") and in the closing conversation (Phase 6).
|
||||
@@ -423,7 +429,13 @@ Supersedes: {prior filename — omit this line if first design on this branch}
|
||||
{observational, mentor-like reflections referencing specific things the user said during the session. Quote their words back to them — don't characterize their behavior. 2-4 bullets.}
|
||||
```
|
||||
|
||||
Present the design doc to the user via AskUserQuestion:
|
||||
---
|
||||
|
||||
{{SPEC_REVIEW_LOOP}}
|
||||
|
||||
---
|
||||
|
||||
Present the reviewed design doc to the user via AskUserQuestion:
|
||||
- A) Approve — mark Status: APPROVED and proceed to handoff
|
||||
- B) Revise — specify which sections need changes (loop back to revise those sections)
|
||||
- C) Start over — return to Phase 2
|
||||
|
||||
+8
-6
@@ -8,17 +8,19 @@
|
||||
"browse": "./browse/dist/browse"
|
||||
},
|
||||
"scripts": {
|
||||
"build": "bun run gen:skill-docs && bun run gen:skill-docs --host codex && bun build --compile browse/src/cli.ts --outfile browse/dist/browse && bun build --compile browse/src/find-browse.ts --outfile browse/dist/find-browse && git rev-parse HEAD > browse/dist/.version && rm -f .*.bun-build || true",
|
||||
"build": "bun run gen:skill-docs && bun run gen:skill-docs --host codex && bun build --compile browse/src/cli.ts --outfile browse/dist/browse && bun build --compile browse/src/find-browse.ts --outfile browse/dist/find-browse && bash browse/scripts/build-node-server.sh && git rev-parse HEAD > browse/dist/.version && rm -f .*.bun-build || true",
|
||||
"gen:skill-docs": "bun run scripts/gen-skill-docs.ts",
|
||||
"dev": "bun run browse/src/cli.ts",
|
||||
"server": "bun run browse/src/server.ts",
|
||||
"test": "bun test browse/test/ test/ --ignore test/skill-e2e.test.ts --ignore test/skill-llm-eval.test.ts --ignore test/skill-routing-e2e.test.ts --ignore test/codex-e2e.test.ts",
|
||||
"test:evals": "EVALS=1 bun test test/skill-llm-eval.test.ts test/skill-e2e.test.ts test/skill-routing-e2e.test.ts test/codex-e2e.test.ts",
|
||||
"test:evals:all": "EVALS=1 EVALS_ALL=1 bun test test/skill-llm-eval.test.ts test/skill-e2e.test.ts test/skill-routing-e2e.test.ts test/codex-e2e.test.ts",
|
||||
"test:e2e": "EVALS=1 bun test test/skill-e2e.test.ts test/skill-routing-e2e.test.ts test/codex-e2e.test.ts",
|
||||
"test:e2e:all": "EVALS=1 EVALS_ALL=1 bun test test/skill-e2e.test.ts test/skill-routing-e2e.test.ts test/codex-e2e.test.ts",
|
||||
"test": "bun test browse/test/ test/ --ignore test/skill-e2e.test.ts --ignore test/skill-llm-eval.test.ts --ignore test/skill-routing-e2e.test.ts --ignore test/codex-e2e.test.ts --ignore test/gemini-e2e.test.ts",
|
||||
"test:evals": "EVALS=1 bun test test/skill-llm-eval.test.ts test/skill-e2e.test.ts test/skill-routing-e2e.test.ts test/codex-e2e.test.ts test/gemini-e2e.test.ts",
|
||||
"test:evals:all": "EVALS=1 EVALS_ALL=1 bun test test/skill-llm-eval.test.ts test/skill-e2e.test.ts test/skill-routing-e2e.test.ts test/codex-e2e.test.ts test/gemini-e2e.test.ts",
|
||||
"test:e2e": "EVALS=1 bun test test/skill-e2e.test.ts test/skill-routing-e2e.test.ts test/codex-e2e.test.ts test/gemini-e2e.test.ts",
|
||||
"test:e2e:all": "EVALS=1 EVALS_ALL=1 bun test test/skill-e2e.test.ts test/skill-routing-e2e.test.ts test/codex-e2e.test.ts test/gemini-e2e.test.ts",
|
||||
"test:codex": "EVALS=1 bun test test/codex-e2e.test.ts",
|
||||
"test:codex:all": "EVALS=1 EVALS_ALL=1 bun test test/codex-e2e.test.ts",
|
||||
"test:gemini": "EVALS=1 bun test test/gemini-e2e.test.ts",
|
||||
"test:gemini:all": "EVALS=1 EVALS_ALL=1 bun test test/gemini-e2e.test.ts",
|
||||
"skill:check": "bun run scripts/skill-check.ts",
|
||||
"dev:skill": "bun run scripts/dev-skill.ts",
|
||||
"start": "bun run browse/src/server.ts",
|
||||
|
||||
@@ -10,6 +10,7 @@ description: |
|
||||
or "is this ambitious enough".
|
||||
Proactively suggest when the user is questioning scope or ambition of a plan,
|
||||
or when the plan feels like it could be thinking bigger.
|
||||
benefits-from: [office-hours]
|
||||
allowed-tools:
|
||||
- Read
|
||||
- Grep
|
||||
@@ -345,6 +346,37 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head
|
||||
```
|
||||
If a design doc exists (from `/office-hours`), read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design.
|
||||
|
||||
## Prerequisite Skill Offer
|
||||
|
||||
When the design doc check above prints "No design doc found," offer the prerequisite
|
||||
skill before proceeding.
|
||||
|
||||
Say to the user via AskUserQuestion:
|
||||
|
||||
> "No design doc found for this branch. `/office-hours` produces a structured problem
|
||||
> statement, premise challenge, and explored alternatives — it gives this review much
|
||||
> sharper input to work with. Takes about 10 minutes. The design doc is per-feature,
|
||||
> not per-product — it captures the thinking behind this specific change."
|
||||
|
||||
Options:
|
||||
- A) Run /office-hours first (in another window, then come back)
|
||||
- B) Skip — proceed with standard review
|
||||
|
||||
If they skip: "No worries — standard review. If you ever want sharper input, try
|
||||
/office-hours first next time." Then proceed normally. Do not re-offer later in the session.
|
||||
|
||||
**Mid-session detection:** During Step 0A (Premise Challenge), if the user can't
|
||||
articulate the problem, keeps changing the problem statement, answers with "I'm not
|
||||
sure," or is clearly exploring rather than reviewing — offer `/office-hours`:
|
||||
|
||||
> "It sounds like you're still figuring out what to build — that's totally fine, but
|
||||
> that's what /office-hours is designed for. Want to pause this review and run
|
||||
> /office-hours first? It'll help you nail down the problem and approach, then come
|
||||
> back here for the strategic review."
|
||||
|
||||
Options: A) Yes, run /office-hours first. B) No, keep going.
|
||||
If they keep going, proceed normally — no guilt, no re-asking.
|
||||
|
||||
When reading TODOS.md, specifically:
|
||||
* Note any TODOs this plan touches, blocks, or unlocks
|
||||
* Check if deferred work from prior reviews relates to this plan
|
||||
@@ -488,6 +520,70 @@ Repo: {owner/repo}
|
||||
|
||||
Derive the feature slug from the plan being reviewed (e.g., "user-dashboard", "auth-refactor"). Use the date in YYYY-MM-DD format.
|
||||
|
||||
After writing the CEO plan, run the spec review loop on it:
|
||||
|
||||
## Spec Review Loop
|
||||
|
||||
Before presenting the document to the user for approval, run an adversarial review.
|
||||
|
||||
**Step 1: Dispatch reviewer subagent**
|
||||
|
||||
Use the Agent tool to dispatch an independent reviewer. The reviewer has fresh context
|
||||
and cannot see the brainstorming conversation — only the document. This ensures genuine
|
||||
adversarial independence.
|
||||
|
||||
Prompt the subagent with:
|
||||
- The file path of the document just written
|
||||
- "Read this document and review it on 5 dimensions. For each dimension, note PASS or
|
||||
list specific issues with suggested fixes. At the end, output a quality score (1-10)
|
||||
across all dimensions."
|
||||
|
||||
**Dimensions:**
|
||||
1. **Completeness** — Are all requirements addressed? Missing edge cases?
|
||||
2. **Consistency** — Do parts of the document agree with each other? Contradictions?
|
||||
3. **Clarity** — Could an engineer implement this without asking questions? Ambiguous language?
|
||||
4. **Scope** — Does the document creep beyond the original problem? YAGNI violations?
|
||||
5. **Feasibility** — Can this actually be built with the stated approach? Hidden complexity?
|
||||
|
||||
The subagent should return:
|
||||
- A quality score (1-10)
|
||||
- PASS if no issues, or a numbered list of issues with dimension, description, and fix
|
||||
|
||||
**Step 2: Fix and re-dispatch**
|
||||
|
||||
If the reviewer returns issues:
|
||||
1. Fix each issue in the document on disk (use Edit tool)
|
||||
2. Re-dispatch the reviewer subagent with the updated document
|
||||
3. Maximum 3 iterations total
|
||||
|
||||
**Convergence guard:** If the reviewer returns the same issues on consecutive iterations
|
||||
(the fix didn't resolve them or the reviewer disagrees with the fix), stop the loop
|
||||
and persist those issues as "Reviewer Concerns" in the document rather than looping
|
||||
further.
|
||||
|
||||
If the subagent fails, times out, or is unavailable — skip the review loop entirely.
|
||||
Tell the user: "Spec review unavailable — presenting unreviewed doc." The document is
|
||||
already written to disk; the review is a quality bonus, not a gate.
|
||||
|
||||
**Step 3: Report and persist metrics**
|
||||
|
||||
After the loop completes (PASS, max iterations, or convergence guard):
|
||||
|
||||
1. Tell the user the result — summary by default:
|
||||
"Your doc survived N rounds of adversarial review. M issues caught and fixed.
|
||||
Quality score: X/10."
|
||||
If they ask "what did the reviewer find?", show the full reviewer output.
|
||||
|
||||
2. If issues remain after max iterations or convergence, add a "## Reviewer Concerns"
|
||||
section to the document listing each unresolved issue. Downstream skills will see this.
|
||||
|
||||
3. Append metrics:
|
||||
```bash
|
||||
mkdir -p ~/.gstack/analytics
|
||||
echo '{"skill":"plan-ceo-review","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","iterations":ITERATIONS,"issues_found":FOUND,"issues_fixed":FIXED,"remaining":REMAINING,"quality_score":SCORE}' >> ~/.gstack/analytics/spec-review.jsonl 2>/dev/null || true
|
||||
```
|
||||
Replace ITERATIONS, FOUND, FIXED, REMAINING, SCORE with actual values from the review.
|
||||
|
||||
### 0E. Temporal Interrogation (EXPANSION, SELECTIVE EXPANSION, and HOLD modes)
|
||||
Think ahead to implementation: What decisions will need to be made during implementation that should be resolved NOW in the plan?
|
||||
```
|
||||
@@ -919,7 +1015,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
|
||||
- **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting).
|
||||
- **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
|
||||
- **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
|
||||
- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed.
|
||||
- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`.
|
||||
|
||||
**Verdict logic:**
|
||||
- **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`)
|
||||
|
||||
@@ -10,6 +10,7 @@ description: |
|
||||
or "is this ambitious enough".
|
||||
Proactively suggest when the user is questioning scope or ambition of a plan,
|
||||
or when the plan feels like it could be thinking bigger.
|
||||
benefits-from: [office-hours]
|
||||
allowed-tools:
|
||||
- Read
|
||||
- Grep
|
||||
@@ -110,6 +111,20 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head
|
||||
```
|
||||
If a design doc exists (from `/office-hours`), read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design.
|
||||
|
||||
{{BENEFITS_FROM}}
|
||||
|
||||
**Mid-session detection:** During Step 0A (Premise Challenge), if the user can't
|
||||
articulate the problem, keeps changing the problem statement, answers with "I'm not
|
||||
sure," or is clearly exploring rather than reviewing — offer `/office-hours`:
|
||||
|
||||
> "It sounds like you're still figuring out what to build — that's totally fine, but
|
||||
> that's what /office-hours is designed for. Want to pause this review and run
|
||||
> /office-hours first? It'll help you nail down the problem and approach, then come
|
||||
> back here for the strategic review."
|
||||
|
||||
Options: A) Yes, run /office-hours first. B) No, keep going.
|
||||
If they keep going, proceed normally — no guilt, no re-asking.
|
||||
|
||||
When reading TODOS.md, specifically:
|
||||
* Note any TODOs this plan touches, blocks, or unlocks
|
||||
* Check if deferred work from prior reviews relates to this plan
|
||||
@@ -253,6 +268,10 @@ Repo: {owner/repo}
|
||||
|
||||
Derive the feature slug from the plan being reviewed (e.g., "user-dashboard", "auth-refactor"). Use the date in YYYY-MM-DD format.
|
||||
|
||||
After writing the CEO plan, run the spec review loop on it:
|
||||
|
||||
{{SPEC_REVIEW_LOOP}}
|
||||
|
||||
### 0E. Temporal Interrogation (EXPANSION, SELECTIVE EXPANSION, and HOLD modes)
|
||||
Think ahead to implementation: What decisions will need to be made during implementation that should be resolved NOW in the plan?
|
||||
```
|
||||
|
||||
@@ -550,7 +550,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
|
||||
- **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting).
|
||||
- **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
|
||||
- **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
|
||||
- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed.
|
||||
- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`.
|
||||
|
||||
**Verdict logic:**
|
||||
- **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`)
|
||||
|
||||
@@ -8,6 +8,7 @@ description: |
|
||||
"review the architecture", "engineering review", or "lock in the plan".
|
||||
Proactively suggest when the user has a plan or design doc and is about to
|
||||
start coding — to catch architecture issues before implementation.
|
||||
benefits-from: [office-hours]
|
||||
allowed-tools:
|
||||
- Read
|
||||
- Write
|
||||
@@ -291,6 +292,25 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head
|
||||
```
|
||||
If a design doc exists, read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design — check the prior version for context on what changed and why.
|
||||
|
||||
## Prerequisite Skill Offer
|
||||
|
||||
When the design doc check above prints "No design doc found," offer the prerequisite
|
||||
skill before proceeding.
|
||||
|
||||
Say to the user via AskUserQuestion:
|
||||
|
||||
> "No design doc found for this branch. `/office-hours` produces a structured problem
|
||||
> statement, premise challenge, and explored alternatives — it gives this review much
|
||||
> sharper input to work with. Takes about 10 minutes. The design doc is per-feature,
|
||||
> not per-product — it captures the thinking behind this specific change."
|
||||
|
||||
Options:
|
||||
- A) Run /office-hours first (in another window, then come back)
|
||||
- B) Skip — proceed with standard review
|
||||
|
||||
If they skip: "No worries — standard review. If you ever want sharper input, try
|
||||
/office-hours first next time." Then proceed normally. Do not re-offer later in the session.
|
||||
|
||||
### Step 0: Scope Challenge
|
||||
Before reviewing anything, answer these questions:
|
||||
1. **What existing code already partially or fully solves each sub-problem?** Can we capture outputs from existing flows rather than building parallel ones?
|
||||
@@ -520,7 +540,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
|
||||
- **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting).
|
||||
- **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
|
||||
- **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
|
||||
- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed.
|
||||
- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`.
|
||||
|
||||
**Verdict logic:**
|
||||
- **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`)
|
||||
|
||||
@@ -8,6 +8,7 @@ description: |
|
||||
"review the architecture", "engineering review", or "lock in the plan".
|
||||
Proactively suggest when the user has a plan or design doc and is about to
|
||||
start coding — to catch architecture issues before implementation.
|
||||
benefits-from: [office-hours]
|
||||
allowed-tools:
|
||||
- Read
|
||||
- Write
|
||||
@@ -73,6 +74,8 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head
|
||||
```
|
||||
If a design doc exists, read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design — check the prior version for context on what changed and why.
|
||||
|
||||
{{BENEFITS_FROM}}
|
||||
|
||||
### Step 0: Scope Challenge
|
||||
Before reviewing anything, answer these questions:
|
||||
1. **What existing code already partially or fully solves each sub-problem?** Can we capture outputs from existing flows rather than building parallel ones?
|
||||
|
||||
+102
-26
@@ -497,52 +497,128 @@ If no documentation files exist, skip this step silently.
|
||||
|
||||
---
|
||||
|
||||
## Step 5.7: Codex second opinion (optional)
|
||||
## Step 5.7: Codex review
|
||||
|
||||
After completing the review, check if the Codex CLI is available:
|
||||
Check if the Codex CLI is available and read the user's Codex review preference:
|
||||
|
||||
```bash
|
||||
which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
|
||||
CODEX_REVIEWS_CFG=$(~/.claude/skills/gstack/bin/gstack-config get codex_reviews 2>/dev/null || true)
|
||||
echo "CODEX_REVIEWS: ${CODEX_REVIEWS_CFG:-not_set}"
|
||||
```
|
||||
|
||||
If Codex is available, use AskUserQuestion:
|
||||
If `CODEX_NOT_AVAILABLE`: skip this step silently. Continue to the next step.
|
||||
|
||||
If `CODEX_REVIEWS` is `disabled`: skip this step silently. Continue to the next step.
|
||||
|
||||
If `CODEX_REVIEWS` is `enabled`: run both code review and adversarial challenge automatically (no prompt). Jump to the "Run Codex" section below.
|
||||
|
||||
If `CODEX_REVIEWS` is `not_set`: use AskUserQuestion to offer the one-time adoption prompt:
|
||||
|
||||
```
|
||||
Review complete. Want an independent second opinion from Codex (OpenAI)?
|
||||
GStack recommends enabling Codex code reviews — Codex is the super smart quiet engineer friend who will save your butt.
|
||||
|
||||
A) Run Codex code review — independent diff review with pass/fail gate
|
||||
B) Run Codex adversarial challenge — try to find ways this code will fail in production
|
||||
C) Both — review first, then adversarial challenge
|
||||
D) Skip — no Codex review needed
|
||||
A) Enable for all future runs (recommended, default)
|
||||
B) Try it for now, ask me again later
|
||||
C) No thanks, don't ask me again
|
||||
```
|
||||
|
||||
If the user chooses A, B, or C:
|
||||
|
||||
**For code review (A or C):** Run `codex review --base <base>` with a 5-minute timeout.
|
||||
Present the full output verbatim under a `CODEX SAYS (code review):` header.
|
||||
Check the output for `[P1]` markers — if found, note `GATE: FAIL`, otherwise `GATE: PASS`.
|
||||
After presenting, compare Codex's findings with your own review findings from Steps 4-5
|
||||
and output a CROSS-MODEL ANALYSIS showing what both found, what only Codex found,
|
||||
and what only Claude found.
|
||||
|
||||
**For adversarial challenge (B or C):** Run:
|
||||
If the user chooses A: persist the setting and run both:
|
||||
```bash
|
||||
codex exec "Review the changes on this branch against the base branch. Run git diff origin/<base> to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, failure modes. Be adversarial." -s read-only
|
||||
~/.claude/skills/gstack/bin/gstack-config set codex_reviews enabled
|
||||
```
|
||||
Present the full output verbatim under a `CODEX SAYS (adversarial challenge):` header.
|
||||
|
||||
**Only if a code review ran (user chose A or C):** Persist the Codex review result to the review log:
|
||||
If the user chooses B: run both this time but do not persist any setting.
|
||||
|
||||
If the user chooses C: persist the opt-out and skip:
|
||||
```bash
|
||||
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE"}'
|
||||
~/.claude/skills/gstack/bin/gstack-config set codex_reviews disabled
|
||||
```
|
||||
Then skip this step. Continue to the next step.
|
||||
|
||||
### Run Codex
|
||||
|
||||
Always run **both** code review and adversarial challenge. Use a 5-minute timeout (`timeout: 300000`) on each Bash call.
|
||||
|
||||
First, create a temp file for stderr capture:
|
||||
```bash
|
||||
TMPERR=$(mktemp /tmp/codex-review-XXXXXXXX)
|
||||
```
|
||||
|
||||
**Code review:** Run:
|
||||
```bash
|
||||
codex review --base <base> -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR"
|
||||
```
|
||||
|
||||
After the command completes, read stderr for cost/error info:
|
||||
```bash
|
||||
cat "$TMPERR"
|
||||
```
|
||||
|
||||
Present the full output verbatim under a `CODEX SAYS (code review):` header:
|
||||
|
||||
```
|
||||
CODEX SAYS (code review):
|
||||
════════════════════════════════════════════════════════════
|
||||
<full codex output, verbatim — do not truncate or summarize>
|
||||
════════════════════════════════════════════════════════════
|
||||
GATE: PASS Tokens: N | Est. cost: ~$X.XX
|
||||
```
|
||||
|
||||
Check the output for `[P1]` markers. If found: `GATE: FAIL`. If no `[P1]`: `GATE: PASS`.
|
||||
|
||||
**If GATE is FAIL:** use AskUserQuestion:
|
||||
|
||||
```
|
||||
Codex found N critical issues in the diff.
|
||||
|
||||
A) Investigate and fix now (recommended)
|
||||
B) Ship anyway — these issues may cause production problems
|
||||
```
|
||||
|
||||
If the user chooses A: read the Codex findings carefully and work to address them. Then re-run `codex review` to verify the gate is now PASS.
|
||||
|
||||
If the user chooses B: continue to the next step.
|
||||
|
||||
### Error handling (code review)
|
||||
|
||||
Before persisting the gate result, check for errors. All errors are non-blocking — Codex is a quality enhancement, not a prerequisite. Check `$TMPERR` output (already read above) for error indicators:
|
||||
|
||||
- **Auth failure:** If stderr contains "auth", "login", "unauthorized", or "API key", tell the user: "Codex authentication failed. Run \`codex login\` in your terminal to authenticate via ChatGPT." Do NOT persist a review log entry. Continue to the adversarial step (it will likely fail too, but try anyway).
|
||||
- **Timeout:** If the Bash call times out (5 min), tell the user: "Codex timed out after 5 minutes. The diff may be too large or the API may be slow." Do NOT persist a review log entry. Skip to cleanup.
|
||||
- **Empty response:** If codex returned no stdout output, tell the user: "Codex returned no response. Stderr: <paste relevant error>." Do NOT persist a review log entry. Skip to cleanup.
|
||||
|
||||
**Only if codex produced a real review (non-empty stdout):** Persist the code review result:
|
||||
```bash
|
||||
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE","commit":"'"$(git rev-parse --short HEAD)"'"}'
|
||||
```
|
||||
|
||||
Substitute: STATUS ("clean" if PASS, "issues_found" if FAIL), GATE ("pass" or "fail").
|
||||
|
||||
**Do NOT persist a codex-review entry when only the adversarial challenge (B) ran** —
|
||||
there is no gate verdict to record, and a false entry would make the Review Readiness
|
||||
Dashboard believe a code review happened when it didn't.
|
||||
**Adversarial challenge:** Run:
|
||||
```bash
|
||||
TMPERR_ADV=$(mktemp /tmp/codex-adv-XXXXXXXX)
|
||||
codex exec "Review the changes on this branch against the base branch. Run git diff origin/<base> to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems." -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_ADV"
|
||||
```
|
||||
|
||||
If Codex is not available, skip this step silently.
|
||||
After the command completes, read adversarial stderr:
|
||||
```bash
|
||||
cat "$TMPERR_ADV"
|
||||
```
|
||||
|
||||
Present the full output verbatim under a `CODEX SAYS (adversarial challenge):` header. This is informational — it never blocks shipping. If the adversarial command timed out or returned no output, note this to the user and continue.
|
||||
|
||||
**Cross-model analysis:** After both Codex outputs are presented, compare Codex's findings with your own review findings from the earlier review steps and output:
|
||||
|
||||
```
|
||||
CROSS-MODEL ANALYSIS:
|
||||
Both found: [findings that overlap between Claude and Codex]
|
||||
Only Codex found: [findings unique to Codex]
|
||||
Only Claude found: [findings unique to Claude's review]
|
||||
Agreement rate: X% (N/M total unique findings overlap)
|
||||
```
|
||||
|
||||
**Cleanup:** Run `rm -f "$TMPERR" "$TMPERR_ADV"` after processing.
|
||||
|
||||
---
|
||||
|
||||
|
||||
+1
-48
@@ -231,54 +231,7 @@ If no documentation files exist, skip this step silently.
|
||||
|
||||
---
|
||||
|
||||
## Step 5.7: Codex second opinion (optional)
|
||||
|
||||
After completing the review, check if the Codex CLI is available:
|
||||
|
||||
```bash
|
||||
which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
|
||||
```
|
||||
|
||||
If Codex is available, use AskUserQuestion:
|
||||
|
||||
```
|
||||
Review complete. Want an independent second opinion from Codex (OpenAI)?
|
||||
|
||||
A) Run Codex code review — independent diff review with pass/fail gate
|
||||
B) Run Codex adversarial challenge — try to find ways this code will fail in production
|
||||
C) Both — review first, then adversarial challenge
|
||||
D) Skip — no Codex review needed
|
||||
```
|
||||
|
||||
If the user chooses A, B, or C:
|
||||
|
||||
**For code review (A or C):** Run `codex review --base <base>` with a 5-minute timeout.
|
||||
Present the full output verbatim under a `CODEX SAYS (code review):` header.
|
||||
Check the output for `[P1]` markers — if found, note `GATE: FAIL`, otherwise `GATE: PASS`.
|
||||
After presenting, compare Codex's findings with your own review findings from Steps 4-5
|
||||
and output a CROSS-MODEL ANALYSIS showing what both found, what only Codex found,
|
||||
and what only Claude found.
|
||||
|
||||
**For adversarial challenge (B or C):** Run:
|
||||
```bash
|
||||
codex exec "Review the changes on this branch against the base branch. Run git diff origin/<base> to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, failure modes. Be adversarial." -s read-only
|
||||
```
|
||||
Present the full output verbatim under a `CODEX SAYS (adversarial challenge):` header.
|
||||
|
||||
**Only if a code review ran (user chose A or C):** Persist the Codex review result to the review log:
|
||||
```bash
|
||||
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE"}'
|
||||
```
|
||||
|
||||
Substitute: STATUS ("clean" if PASS, "issues_found" if FAIL), GATE ("pass" or "fail").
|
||||
|
||||
**Do NOT persist a codex-review entry when only the adversarial challenge (B) ran** —
|
||||
there is no gate verdict to record, and a false entry would make the Review Readiness
|
||||
Dashboard believe a code review happened when it didn't.
|
||||
|
||||
If Codex is not available, skip this step silently.
|
||||
|
||||
---
|
||||
{{CODEX_REVIEW_STEP}}
|
||||
|
||||
## Important Rules
|
||||
|
||||
|
||||
+297
-2
@@ -55,6 +55,7 @@ const HOST_PATHS: Record<Host, HostPaths> = {
|
||||
interface TemplateContext {
|
||||
skillName: string;
|
||||
tmplPath: string;
|
||||
benefitsFrom?: string[];
|
||||
host: Host;
|
||||
paths: HostPaths;
|
||||
}
|
||||
@@ -1206,7 +1207,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
|
||||
- **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \\\`gstack-config set skip_eng_review true\\\` (the "don't bother me" setting).
|
||||
- **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
|
||||
- **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
|
||||
- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed.
|
||||
- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \\\`gstack-config set codex_reviews enabled|disabled\\\`.
|
||||
|
||||
**Verdict logic:**
|
||||
- **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \\\`skip_eng_review\\\` is \\\`true\\\`)
|
||||
@@ -1376,6 +1377,289 @@ Only commit if there are changes. Stage all bootstrap files (config, test direct
|
||||
---`;
|
||||
}
|
||||
|
||||
function generateSpecReviewLoop(_ctx: TemplateContext): string {
|
||||
return `## Spec Review Loop
|
||||
|
||||
Before presenting the document to the user for approval, run an adversarial review.
|
||||
|
||||
**Step 1: Dispatch reviewer subagent**
|
||||
|
||||
Use the Agent tool to dispatch an independent reviewer. The reviewer has fresh context
|
||||
and cannot see the brainstorming conversation — only the document. This ensures genuine
|
||||
adversarial independence.
|
||||
|
||||
Prompt the subagent with:
|
||||
- The file path of the document just written
|
||||
- "Read this document and review it on 5 dimensions. For each dimension, note PASS or
|
||||
list specific issues with suggested fixes. At the end, output a quality score (1-10)
|
||||
across all dimensions."
|
||||
|
||||
**Dimensions:**
|
||||
1. **Completeness** — Are all requirements addressed? Missing edge cases?
|
||||
2. **Consistency** — Do parts of the document agree with each other? Contradictions?
|
||||
3. **Clarity** — Could an engineer implement this without asking questions? Ambiguous language?
|
||||
4. **Scope** — Does the document creep beyond the original problem? YAGNI violations?
|
||||
5. **Feasibility** — Can this actually be built with the stated approach? Hidden complexity?
|
||||
|
||||
The subagent should return:
|
||||
- A quality score (1-10)
|
||||
- PASS if no issues, or a numbered list of issues with dimension, description, and fix
|
||||
|
||||
**Step 2: Fix and re-dispatch**
|
||||
|
||||
If the reviewer returns issues:
|
||||
1. Fix each issue in the document on disk (use Edit tool)
|
||||
2. Re-dispatch the reviewer subagent with the updated document
|
||||
3. Maximum 3 iterations total
|
||||
|
||||
**Convergence guard:** If the reviewer returns the same issues on consecutive iterations
|
||||
(the fix didn't resolve them or the reviewer disagrees with the fix), stop the loop
|
||||
and persist those issues as "Reviewer Concerns" in the document rather than looping
|
||||
further.
|
||||
|
||||
If the subagent fails, times out, or is unavailable — skip the review loop entirely.
|
||||
Tell the user: "Spec review unavailable — presenting unreviewed doc." The document is
|
||||
already written to disk; the review is a quality bonus, not a gate.
|
||||
|
||||
**Step 3: Report and persist metrics**
|
||||
|
||||
After the loop completes (PASS, max iterations, or convergence guard):
|
||||
|
||||
1. Tell the user the result — summary by default:
|
||||
"Your doc survived N rounds of adversarial review. M issues caught and fixed.
|
||||
Quality score: X/10."
|
||||
If they ask "what did the reviewer find?", show the full reviewer output.
|
||||
|
||||
2. If issues remain after max iterations or convergence, add a "## Reviewer Concerns"
|
||||
section to the document listing each unresolved issue. Downstream skills will see this.
|
||||
|
||||
3. Append metrics:
|
||||
\`\`\`bash
|
||||
mkdir -p ~/.gstack/analytics
|
||||
echo '{"skill":"${_ctx.skillName}","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","iterations":ITERATIONS,"issues_found":FOUND,"issues_fixed":FIXED,"remaining":REMAINING,"quality_score":SCORE}' >> ~/.gstack/analytics/spec-review.jsonl 2>/dev/null || true
|
||||
\`\`\`
|
||||
Replace ITERATIONS, FOUND, FIXED, REMAINING, SCORE with actual values from the review.`;
|
||||
}
|
||||
|
||||
function generateBenefitsFrom(ctx: TemplateContext): string {
|
||||
if (!ctx.benefitsFrom || ctx.benefitsFrom.length === 0) return '';
|
||||
|
||||
const skillList = ctx.benefitsFrom.map(s => `\`/${s}\``).join(' or ');
|
||||
const first = ctx.benefitsFrom[0];
|
||||
|
||||
return `## Prerequisite Skill Offer
|
||||
|
||||
When the design doc check above prints "No design doc found," offer the prerequisite
|
||||
skill before proceeding.
|
||||
|
||||
Say to the user via AskUserQuestion:
|
||||
|
||||
> "No design doc found for this branch. ${skillList} produces a structured problem
|
||||
> statement, premise challenge, and explored alternatives — it gives this review much
|
||||
> sharper input to work with. Takes about 10 minutes. The design doc is per-feature,
|
||||
> not per-product — it captures the thinking behind this specific change."
|
||||
|
||||
Options:
|
||||
- A) Run /${first} first (in another window, then come back)
|
||||
- B) Skip — proceed with standard review
|
||||
|
||||
If they skip: "No worries — standard review. If you ever want sharper input, try
|
||||
/${first} first next time." Then proceed normally. Do not re-offer later in the session.`;
|
||||
}
|
||||
|
||||
function generateDesignSketch(_ctx: TemplateContext): string {
|
||||
return `## Visual Sketch (UI ideas only)
|
||||
|
||||
If the chosen approach involves user-facing UI (screens, pages, forms, dashboards,
|
||||
or interactive elements), generate a rough wireframe to help the user visualize it.
|
||||
If the idea is backend-only, infrastructure, or has no UI component — skip this
|
||||
section silently.
|
||||
|
||||
**Step 1: Gather design context**
|
||||
|
||||
1. Check if \`DESIGN.md\` exists in the repo root. If it does, read it for design
|
||||
system constraints (colors, typography, spacing, component patterns). Use these
|
||||
constraints in the wireframe.
|
||||
2. Apply core design principles:
|
||||
- **Information hierarchy** — what does the user see first, second, third?
|
||||
- **Interaction states** — loading, empty, error, success, partial
|
||||
- **Edge case paranoia** — what if the name is 47 chars? Zero results? Network fails?
|
||||
- **Subtraction default** — "as little design as possible" (Rams). Every element earns its pixels.
|
||||
- **Design for trust** — every interface element builds or erodes user trust.
|
||||
|
||||
**Step 2: Generate wireframe HTML**
|
||||
|
||||
Generate a single-page HTML file with these constraints:
|
||||
- **Intentionally rough aesthetic** — use system fonts, thin gray borders, no color,
|
||||
hand-drawn-style elements. This is a sketch, not a polished mockup.
|
||||
- Self-contained — no external dependencies, no CDN links, inline CSS only
|
||||
- Show the core interaction flow (1-3 screens/states max)
|
||||
- Include realistic placeholder content (not "Lorem ipsum" — use content that
|
||||
matches the actual use case)
|
||||
- Add HTML comments explaining design decisions
|
||||
|
||||
Write to a temp file:
|
||||
\`\`\`bash
|
||||
SKETCH_FILE="/tmp/gstack-sketch-$(date +%s).html"
|
||||
\`\`\`
|
||||
|
||||
**Step 3: Render and capture**
|
||||
|
||||
\`\`\`bash
|
||||
$B goto "file://$SKETCH_FILE"
|
||||
$B screenshot /tmp/gstack-sketch.png
|
||||
\`\`\`
|
||||
|
||||
If \`$B\` is not available (browse binary not set up), skip the render step. Tell the
|
||||
user: "Visual sketch requires the browse binary. Run the setup script to enable it."
|
||||
|
||||
**Step 4: Present and iterate**
|
||||
|
||||
Show the screenshot to the user. Ask: "Does this feel right? Want to iterate on the layout?"
|
||||
|
||||
If they want changes, regenerate the HTML with their feedback and re-render.
|
||||
If they approve or say "good enough," proceed.
|
||||
|
||||
**Step 5: Include in design doc**
|
||||
|
||||
Reference the wireframe screenshot in the design doc's "Recommended Approach" section.
|
||||
The screenshot file at \`/tmp/gstack-sketch.png\` can be referenced by downstream skills
|
||||
(\`/plan-design-review\`, \`/design-review\`) to see what was originally envisioned.`;
|
||||
}
|
||||
|
||||
function generateCodexReviewStep(ctx: TemplateContext): string {
|
||||
// Codex host: strip entirely — Codex should never invoke itself
|
||||
if (ctx.host === 'codex') return '';
|
||||
|
||||
const isShip = ctx.skillName === 'ship';
|
||||
const stepNum = isShip ? '3.8' : '5.7';
|
||||
|
||||
return `## Step ${stepNum}: Codex review
|
||||
|
||||
Check if the Codex CLI is available and read the user's Codex review preference:
|
||||
|
||||
\`\`\`bash
|
||||
which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
|
||||
CODEX_REVIEWS_CFG=$(~/.claude/skills/gstack/bin/gstack-config get codex_reviews 2>/dev/null || true)
|
||||
echo "CODEX_REVIEWS: \${CODEX_REVIEWS_CFG:-not_set}"
|
||||
\`\`\`
|
||||
|
||||
If \`CODEX_NOT_AVAILABLE\`: skip this step silently. Continue to the next step.
|
||||
|
||||
If \`CODEX_REVIEWS\` is \`disabled\`: skip this step silently. Continue to the next step.
|
||||
|
||||
If \`CODEX_REVIEWS\` is \`enabled\`: run both code review and adversarial challenge automatically (no prompt). Jump to the "Run Codex" section below.
|
||||
|
||||
If \`CODEX_REVIEWS\` is \`not_set\`: use AskUserQuestion to offer the one-time adoption prompt:
|
||||
|
||||
\`\`\`
|
||||
GStack recommends enabling Codex code reviews — Codex is the super smart quiet engineer friend who will save your butt.
|
||||
|
||||
A) Enable for all future runs (recommended, default)
|
||||
B) Try it for now, ask me again later
|
||||
C) No thanks, don't ask me again
|
||||
\`\`\`
|
||||
|
||||
If the user chooses A: persist the setting and run both:
|
||||
\`\`\`bash
|
||||
~/.claude/skills/gstack/bin/gstack-config set codex_reviews enabled
|
||||
\`\`\`
|
||||
|
||||
If the user chooses B: run both this time but do not persist any setting.
|
||||
|
||||
If the user chooses C: persist the opt-out and skip:
|
||||
\`\`\`bash
|
||||
~/.claude/skills/gstack/bin/gstack-config set codex_reviews disabled
|
||||
\`\`\`
|
||||
Then skip this step. Continue to the next step.
|
||||
|
||||
### Run Codex
|
||||
|
||||
Always run **both** code review and adversarial challenge. Use a 5-minute timeout (\`timeout: 300000\`) on each Bash call.
|
||||
|
||||
First, create a temp file for stderr capture:
|
||||
\`\`\`bash
|
||||
TMPERR=$(mktemp /tmp/codex-review-XXXXXXXX)
|
||||
\`\`\`
|
||||
|
||||
**Code review:** Run:
|
||||
\`\`\`bash
|
||||
codex review --base <base> -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR"
|
||||
\`\`\`
|
||||
|
||||
After the command completes, read stderr for cost/error info:
|
||||
\`\`\`bash
|
||||
cat "$TMPERR"
|
||||
\`\`\`
|
||||
|
||||
Present the full output verbatim under a \`CODEX SAYS (code review):\` header:
|
||||
|
||||
\`\`\`
|
||||
CODEX SAYS (code review):
|
||||
════════════════════════════════════════════════════════════
|
||||
<full codex output, verbatim — do not truncate or summarize>
|
||||
════════════════════════════════════════════════════════════
|
||||
GATE: PASS Tokens: N | Est. cost: ~$X.XX
|
||||
\`\`\`
|
||||
|
||||
Check the output for \`[P1]\` markers. If found: \`GATE: FAIL\`. If no \`[P1]\`: \`GATE: PASS\`.
|
||||
|
||||
**If GATE is FAIL:** use AskUserQuestion:
|
||||
|
||||
\`\`\`
|
||||
Codex found N critical issues in the diff.
|
||||
|
||||
A) Investigate and fix now (recommended)
|
||||
B) Ship anyway — these issues may cause production problems
|
||||
\`\`\`
|
||||
|
||||
If the user chooses A: read the Codex findings carefully and work to address them${isShip ? '. After fixing, re-run tests (Step 3) since code has changed' : ''}. Then re-run \`codex review\` to verify the gate is now PASS.
|
||||
|
||||
If the user chooses B: continue to the next step.
|
||||
|
||||
### Error handling (code review)
|
||||
|
||||
Before persisting the gate result, check for errors. All errors are non-blocking — Codex is a quality enhancement, not a prerequisite. Check \`$TMPERR\` output (already read above) for error indicators:
|
||||
|
||||
- **Auth failure:** If stderr contains "auth", "login", "unauthorized", or "API key", tell the user: "Codex authentication failed. Run \\\`codex login\\\` in your terminal to authenticate via ChatGPT." Do NOT persist a review log entry. Continue to the adversarial step (it will likely fail too, but try anyway).
|
||||
- **Timeout:** If the Bash call times out (5 min), tell the user: "Codex timed out after 5 minutes. The diff may be too large or the API may be slow." Do NOT persist a review log entry. Skip to cleanup.
|
||||
- **Empty response:** If codex returned no stdout output, tell the user: "Codex returned no response. Stderr: <paste relevant error>." Do NOT persist a review log entry. Skip to cleanup.
|
||||
|
||||
**Only if codex produced a real review (non-empty stdout):** Persist the code review result:
|
||||
\`\`\`bash
|
||||
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE","commit":"'"$(git rev-parse --short HEAD)"'"}'
|
||||
\`\`\`
|
||||
|
||||
Substitute: STATUS ("clean" if PASS, "issues_found" if FAIL), GATE ("pass" or "fail").
|
||||
|
||||
**Adversarial challenge:** Run:
|
||||
\`\`\`bash
|
||||
TMPERR_ADV=$(mktemp /tmp/codex-adv-XXXXXXXX)
|
||||
codex exec "Review the changes on this branch against the base branch. Run git diff origin/<base> to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems." -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_ADV"
|
||||
\`\`\`
|
||||
|
||||
After the command completes, read adversarial stderr:
|
||||
\`\`\`bash
|
||||
cat "$TMPERR_ADV"
|
||||
\`\`\`
|
||||
|
||||
Present the full output verbatim under a \`CODEX SAYS (adversarial challenge):\` header. This is informational — it never blocks shipping. If the adversarial command timed out or returned no output, note this to the user and continue.
|
||||
${!isShip ? `
|
||||
**Cross-model analysis:** After both Codex outputs are presented, compare Codex's findings with your own review findings from the earlier review steps and output:
|
||||
|
||||
\`\`\`
|
||||
CROSS-MODEL ANALYSIS:
|
||||
Both found: [findings that overlap between Claude and Codex]
|
||||
Only Codex found: [findings unique to Codex]
|
||||
Only Claude found: [findings unique to Claude's review]
|
||||
Agreement rate: X% (N/M total unique findings overlap)
|
||||
\`\`\`
|
||||
` : ''}
|
||||
**Cleanup:** Run \`rm -f "$TMPERR" "$TMPERR_ADV"\` after processing.
|
||||
|
||||
---`;
|
||||
}
|
||||
|
||||
const RESOLVERS: Record<string, (ctx: TemplateContext) => string> = {
|
||||
COMMAND_REFERENCE: generateCommandReference,
|
||||
SNAPSHOT_FLAGS: generateSnapshotFlags,
|
||||
@@ -1388,6 +1672,10 @@ const RESOLVERS: Record<string, (ctx: TemplateContext) => string> = {
|
||||
REVIEW_DASHBOARD: generateReviewDashboard,
|
||||
TEST_BOOTSTRAP: generateTestBootstrap,
|
||||
TEST_FAILURE_TRIAGE: generateTestFailureTriage,
|
||||
SPEC_REVIEW_LOOP: generateSpecReviewLoop,
|
||||
DESIGN_SKETCH: generateDesignSketch,
|
||||
BENEFITS_FROM: generateBenefitsFrom,
|
||||
CODEX_REVIEW_STEP: generateCodexReviewStep,
|
||||
};
|
||||
|
||||
// ─── Codex Helpers ───────────────────────────────────────────
|
||||
@@ -1510,7 +1798,14 @@ function processTemplate(tmplPath: string, host: Host = 'claude'): { outputPath:
|
||||
// Extract skill name from frontmatter for TemplateContext
|
||||
const nameMatch = tmplContent.match(/^name:\s*(.+)$/m);
|
||||
const skillName = nameMatch ? nameMatch[1].trim() : path.basename(path.dirname(tmplPath));
|
||||
const ctx: TemplateContext = { skillName, tmplPath, host, paths: HOST_PATHS[host] };
|
||||
|
||||
// Extract benefits-from list from frontmatter (inline YAML: benefits-from: [a, b])
|
||||
const benefitsMatch = tmplContent.match(/^benefits-from:\s*\[([^\]]*)\]/m);
|
||||
const benefitsFrom = benefitsMatch
|
||||
? benefitsMatch[1].split(',').map(s => s.trim()).filter(Boolean)
|
||||
: undefined;
|
||||
|
||||
const ctx: TemplateContext = { skillName, tmplPath, benefitsFrom, host, paths: HOST_PATHS[host] };
|
||||
|
||||
// Replace placeholders
|
||||
let content = tmplContent.replace(/\{\{(\w+)\}\}/g, (match, name) => {
|
||||
|
||||
@@ -12,6 +12,11 @@ GSTACK_DIR="$(cd "$(dirname "$0")" && pwd)"
|
||||
SKILLS_DIR="$(dirname "$GSTACK_DIR")"
|
||||
BROWSE_BIN="$GSTACK_DIR/browse/dist/browse"
|
||||
|
||||
IS_WINDOWS=0
|
||||
case "$(uname -s)" in
|
||||
MINGW*|MSYS*|CYGWIN*|Windows_NT) IS_WINDOWS=1 ;;
|
||||
esac
|
||||
|
||||
# ─── Parse --host flag ─────────────────────────────────────────
|
||||
HOST="claude"
|
||||
while [ $# -gt 0 ]; do
|
||||
@@ -44,10 +49,19 @@ elif [ "$HOST" = "codex" ]; then
|
||||
fi
|
||||
|
||||
ensure_playwright_browser() {
|
||||
(
|
||||
cd "$GSTACK_DIR"
|
||||
bun --eval 'import { chromium } from "playwright"; const browser = await chromium.launch(); await browser.close();'
|
||||
) >/dev/null 2>&1
|
||||
if [ "$IS_WINDOWS" -eq 1 ]; then
|
||||
# On Windows, Bun can't launch Chromium due to broken pipe handling
|
||||
# (oven-sh/bun#4253). Use Node.js to verify Chromium works instead.
|
||||
(
|
||||
cd "$GSTACK_DIR"
|
||||
node -e "const { chromium } = require('playwright'); (async () => { const b = await chromium.launch(); await b.close(); })()" 2>/dev/null
|
||||
)
|
||||
else
|
||||
(
|
||||
cd "$GSTACK_DIR"
|
||||
bun --eval 'import { chromium } from "playwright"; const browser = await chromium.launch(); await browser.close();'
|
||||
) >/dev/null 2>&1
|
||||
fi
|
||||
}
|
||||
|
||||
# 1. Build browse binary if needed (smart rebuild: stale sources, package.json, lock)
|
||||
@@ -87,10 +101,32 @@ if ! ensure_playwright_browser; then
|
||||
cd "$GSTACK_DIR"
|
||||
bunx playwright install chromium
|
||||
)
|
||||
|
||||
if [ "$IS_WINDOWS" -eq 1 ]; then
|
||||
# On Windows, Node.js launches Chromium (not Bun — see oven-sh/bun#4253).
|
||||
# Ensure playwright is importable by Node from the gstack directory.
|
||||
if ! command -v node >/dev/null 2>&1; then
|
||||
echo "gstack setup failed: Node.js is required on Windows (Bun cannot launch Chromium due to a pipe bug)" >&2
|
||||
echo " Install Node.js: https://nodejs.org/" >&2
|
||||
exit 1
|
||||
fi
|
||||
echo "Windows detected — verifying Node.js can load Playwright..."
|
||||
(
|
||||
cd "$GSTACK_DIR"
|
||||
# Bun's node_modules already has playwright; verify Node can require it
|
||||
node -e "require('playwright')" 2>/dev/null || npm install --no-save playwright
|
||||
)
|
||||
fi
|
||||
fi
|
||||
|
||||
if ! ensure_playwright_browser; then
|
||||
echo "gstack setup failed: Playwright Chromium could not be launched" >&2
|
||||
if [ "$IS_WINDOWS" -eq 1 ]; then
|
||||
echo "gstack setup failed: Playwright Chromium could not be launched via Node.js" >&2
|
||||
echo " This is a known issue with Bun on Windows (oven-sh/bun#4253)." >&2
|
||||
echo " Ensure Node.js is installed and 'node -e \"require('playwright')\"' works." >&2
|
||||
else
|
||||
echo "gstack setup failed: Playwright Chromium could not be launched" >&2
|
||||
fi
|
||||
exit 1
|
||||
fi
|
||||
|
||||
|
||||
+98
-21
@@ -319,7 +319,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
|
||||
- **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting).
|
||||
- **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
|
||||
- **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
|
||||
- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed.
|
||||
- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`.
|
||||
|
||||
**Verdict logic:**
|
||||
- **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`)
|
||||
@@ -959,41 +959,118 @@ For each classified comment:
|
||||
|
||||
---
|
||||
|
||||
## Step 3.8: Codex second opinion (optional)
|
||||
## Step 3.8: Codex review
|
||||
|
||||
Check if the Codex CLI is available:
|
||||
Check if the Codex CLI is available and read the user's Codex review preference:
|
||||
|
||||
```bash
|
||||
which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
|
||||
CODEX_REVIEWS_CFG=$(~/.claude/skills/gstack/bin/gstack-config get codex_reviews 2>/dev/null || true)
|
||||
echo "CODEX_REVIEWS: ${CODEX_REVIEWS_CFG:-not_set}"
|
||||
```
|
||||
|
||||
If Codex is available, use AskUserQuestion:
|
||||
If `CODEX_NOT_AVAILABLE`: skip this step silently. Continue to the next step.
|
||||
|
||||
If `CODEX_REVIEWS` is `disabled`: skip this step silently. Continue to the next step.
|
||||
|
||||
If `CODEX_REVIEWS` is `enabled`: run both code review and adversarial challenge automatically (no prompt). Jump to the "Run Codex" section below.
|
||||
|
||||
If `CODEX_REVIEWS` is `not_set`: use AskUserQuestion to offer the one-time adoption prompt:
|
||||
|
||||
```
|
||||
Pre-landing review complete. Want an independent Codex (OpenAI) review before shipping?
|
||||
GStack recommends enabling Codex code reviews — Codex is the super smart quiet engineer friend who will save your butt.
|
||||
|
||||
A) Run Codex code review — independent diff review with pass/fail gate
|
||||
B) Run Codex adversarial challenge — try to break this code
|
||||
C) Skip — ship without Codex review
|
||||
A) Enable for all future runs (recommended, default)
|
||||
B) Try it for now, ask me again later
|
||||
C) No thanks, don't ask me again
|
||||
```
|
||||
|
||||
If the user chooses A or B:
|
||||
|
||||
**For code review (A):** Run `codex review --base <base>` with a 5-minute timeout.
|
||||
Present the full output verbatim under a `CODEX SAYS:` header. Check for `[P1]` markers
|
||||
to determine pass/fail gate. Persist the result:
|
||||
|
||||
If the user chooses A: persist the setting and run both:
|
||||
```bash
|
||||
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"TIMESTAMP","status":"STATUS","gate":"GATE"}'
|
||||
~/.claude/skills/gstack/bin/gstack-config set codex_reviews enabled
|
||||
```
|
||||
|
||||
If GATE is FAIL, use AskUserQuestion: "Codex found critical issues. Ship anyway?"
|
||||
If the user says no, stop. If yes, continue to Step 4.
|
||||
If the user chooses B: run both this time but do not persist any setting.
|
||||
|
||||
**For adversarial (B):** Run codex exec with the adversarial prompt (see /codex skill).
|
||||
Present findings. This is informational — does not block shipping.
|
||||
If the user chooses C: persist the opt-out and skip:
|
||||
```bash
|
||||
~/.claude/skills/gstack/bin/gstack-config set codex_reviews disabled
|
||||
```
|
||||
Then skip this step. Continue to the next step.
|
||||
|
||||
If Codex is not available, skip silently. Continue to Step 4.
|
||||
### Run Codex
|
||||
|
||||
Always run **both** code review and adversarial challenge. Use a 5-minute timeout (`timeout: 300000`) on each Bash call.
|
||||
|
||||
First, create a temp file for stderr capture:
|
||||
```bash
|
||||
TMPERR=$(mktemp /tmp/codex-review-XXXXXXXX)
|
||||
```
|
||||
|
||||
**Code review:** Run:
|
||||
```bash
|
||||
codex review --base <base> -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR"
|
||||
```
|
||||
|
||||
After the command completes, read stderr for cost/error info:
|
||||
```bash
|
||||
cat "$TMPERR"
|
||||
```
|
||||
|
||||
Present the full output verbatim under a `CODEX SAYS (code review):` header:
|
||||
|
||||
```
|
||||
CODEX SAYS (code review):
|
||||
════════════════════════════════════════════════════════════
|
||||
<full codex output, verbatim — do not truncate or summarize>
|
||||
════════════════════════════════════════════════════════════
|
||||
GATE: PASS Tokens: N | Est. cost: ~$X.XX
|
||||
```
|
||||
|
||||
Check the output for `[P1]` markers. If found: `GATE: FAIL`. If no `[P1]`: `GATE: PASS`.
|
||||
|
||||
**If GATE is FAIL:** use AskUserQuestion:
|
||||
|
||||
```
|
||||
Codex found N critical issues in the diff.
|
||||
|
||||
A) Investigate and fix now (recommended)
|
||||
B) Ship anyway — these issues may cause production problems
|
||||
```
|
||||
|
||||
If the user chooses A: read the Codex findings carefully and work to address them. After fixing, re-run tests (Step 3) since code has changed. Then re-run `codex review` to verify the gate is now PASS.
|
||||
|
||||
If the user chooses B: continue to the next step.
|
||||
|
||||
### Error handling (code review)
|
||||
|
||||
Before persisting the gate result, check for errors. All errors are non-blocking — Codex is a quality enhancement, not a prerequisite. Check `$TMPERR` output (already read above) for error indicators:
|
||||
|
||||
- **Auth failure:** If stderr contains "auth", "login", "unauthorized", or "API key", tell the user: "Codex authentication failed. Run \`codex login\` in your terminal to authenticate via ChatGPT." Do NOT persist a review log entry. Continue to the adversarial step (it will likely fail too, but try anyway).
|
||||
- **Timeout:** If the Bash call times out (5 min), tell the user: "Codex timed out after 5 minutes. The diff may be too large or the API may be slow." Do NOT persist a review log entry. Skip to cleanup.
|
||||
- **Empty response:** If codex returned no stdout output, tell the user: "Codex returned no response. Stderr: <paste relevant error>." Do NOT persist a review log entry. Skip to cleanup.
|
||||
|
||||
**Only if codex produced a real review (non-empty stdout):** Persist the code review result:
|
||||
```bash
|
||||
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE","commit":"'"$(git rev-parse --short HEAD)"'"}'
|
||||
```
|
||||
|
||||
Substitute: STATUS ("clean" if PASS, "issues_found" if FAIL), GATE ("pass" or "fail").
|
||||
|
||||
**Adversarial challenge:** Run:
|
||||
```bash
|
||||
TMPERR_ADV=$(mktemp /tmp/codex-adv-XXXXXXXX)
|
||||
codex exec "Review the changes on this branch against the base branch. Run git diff origin/<base> to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems." -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_ADV"
|
||||
```
|
||||
|
||||
After the command completes, read adversarial stderr:
|
||||
```bash
|
||||
cat "$TMPERR_ADV"
|
||||
```
|
||||
|
||||
Present the full output verbatim under a `CODEX SAYS (adversarial challenge):` header. This is informational — it never blocks shipping. If the adversarial command timed out or returned no output, note this to the user and continue.
|
||||
|
||||
**Cleanup:** Run `rm -f "$TMPERR" "$TMPERR_ADV"` after processing.
|
||||
|
||||
---
|
||||
|
||||
@@ -1236,7 +1313,7 @@ doc updates — the user runs `/ship` and documentation stays current without a
|
||||
- **Never skip tests.** If tests fail, stop.
|
||||
- **Never skip the pre-landing review.** If checklist.md is unreadable, stop.
|
||||
- **Never force push.** Use regular `git push` only.
|
||||
- **Never ask for confirmation** except for MINOR/MAJOR version bumps and pre-landing review ASK items (batched into at most one AskUserQuestion).
|
||||
- **Never ask for trivial confirmations** (e.g., "ready to push?", "create PR?"). DO stop for: version bumps (MINOR/MAJOR), pre-landing review findings (ASK items), Codex critical findings ([P1]), and the one-time Codex adoption prompt.
|
||||
- **Always use the 4-digit version format** from the VERSION file.
|
||||
- **Date format in CHANGELOG:** `YYYY-MM-DD`
|
||||
- **Split commits for bisectability** — each commit = one logical change.
|
||||
|
||||
+2
-38
@@ -407,43 +407,7 @@ For each classified comment:
|
||||
|
||||
---
|
||||
|
||||
## Step 3.8: Codex second opinion (optional)
|
||||
|
||||
Check if the Codex CLI is available:
|
||||
|
||||
```bash
|
||||
which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
|
||||
```
|
||||
|
||||
If Codex is available, use AskUserQuestion:
|
||||
|
||||
```
|
||||
Pre-landing review complete. Want an independent Codex (OpenAI) review before shipping?
|
||||
|
||||
A) Run Codex code review — independent diff review with pass/fail gate
|
||||
B) Run Codex adversarial challenge — try to break this code
|
||||
C) Skip — ship without Codex review
|
||||
```
|
||||
|
||||
If the user chooses A or B:
|
||||
|
||||
**For code review (A):** Run `codex review --base <base>` with a 5-minute timeout.
|
||||
Present the full output verbatim under a `CODEX SAYS:` header. Check for `[P1]` markers
|
||||
to determine pass/fail gate. Persist the result:
|
||||
|
||||
```bash
|
||||
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"TIMESTAMP","status":"STATUS","gate":"GATE"}'
|
||||
```
|
||||
|
||||
If GATE is FAIL, use AskUserQuestion: "Codex found critical issues. Ship anyway?"
|
||||
If the user says no, stop. If yes, continue to Step 4.
|
||||
|
||||
**For adversarial (B):** Run codex exec with the adversarial prompt (see /codex skill).
|
||||
Present findings. This is informational — does not block shipping.
|
||||
|
||||
If Codex is not available, skip silently. Continue to Step 4.
|
||||
|
||||
---
|
||||
{{CODEX_REVIEW_STEP}}
|
||||
|
||||
## Step 4: Version bump (auto-decide)
|
||||
|
||||
@@ -684,7 +648,7 @@ doc updates — the user runs `/ship` and documentation stays current without a
|
||||
- **Never skip tests.** If tests fail, stop.
|
||||
- **Never skip the pre-landing review.** If checklist.md is unreadable, stop.
|
||||
- **Never force push.** Use regular `git push` only.
|
||||
- **Never ask for confirmation** except for MINOR/MAJOR version bumps and pre-landing review ASK items (batched into at most one AskUserQuestion).
|
||||
- **Never ask for trivial confirmations** (e.g., "ready to push?", "create PR?"). DO stop for: version bumps (MINOR/MAJOR), pre-landing review findings (ASK items), Codex critical findings ([P1]), and the one-time Codex adoption prompt.
|
||||
- **Always use the 4-digit version format** from the VERSION file.
|
||||
- **Date format in CHANGELOG:** `YYYY-MM-DD`
|
||||
- **Split commits for bisectability** — each commit = one logical change.
|
||||
|
||||
@@ -0,0 +1,173 @@
|
||||
/**
|
||||
* Gemini CLI E2E tests — verify skills work when invoked by Gemini CLI.
|
||||
*
|
||||
* Spawns `gemini -p` with stream-json output in the repo root (where
|
||||
* .agents/skills/ already exists), parses JSONL events, and validates
|
||||
* structured results. Follows the same pattern as codex-e2e.test.ts.
|
||||
*
|
||||
* Prerequisites:
|
||||
* - `gemini` binary installed (npm install -g @google/gemini-cli)
|
||||
* - Gemini authenticated via ~/.gemini/ config or GEMINI_API_KEY env var
|
||||
* - EVALS=1 env var set (same gate as Claude E2E tests)
|
||||
*
|
||||
* Skips gracefully when prerequisites are not met.
|
||||
*/
|
||||
|
||||
import { describe, test, expect, afterAll } from 'bun:test';
|
||||
import { runGeminiSkill } from './helpers/gemini-session-runner';
|
||||
import type { GeminiResult } from './helpers/gemini-session-runner';
|
||||
import { EvalCollector } from './helpers/eval-store';
|
||||
import { selectTests, detectBaseBranch, getChangedFiles, GLOBAL_TOUCHFILES } from './helpers/touchfiles';
|
||||
import * as path from 'path';
|
||||
|
||||
const ROOT = path.resolve(import.meta.dir, '..');
|
||||
|
||||
// --- Prerequisites check ---
|
||||
|
||||
const GEMINI_AVAILABLE = (() => {
|
||||
try {
|
||||
const result = Bun.spawnSync(['which', 'gemini']);
|
||||
return result.exitCode === 0;
|
||||
} catch { return false; }
|
||||
})();
|
||||
|
||||
const evalsEnabled = !!process.env.EVALS;
|
||||
|
||||
// Skip all tests if gemini is not available or EVALS is not set.
|
||||
const SKIP = !GEMINI_AVAILABLE || !evalsEnabled;
|
||||
|
||||
const describeGemini = SKIP ? describe.skip : describe;
|
||||
|
||||
// Log why we're skipping (helpful for debugging CI)
|
||||
if (!evalsEnabled) {
|
||||
// Silent — same as Claude E2E tests, EVALS=1 required
|
||||
} else if (!GEMINI_AVAILABLE) {
|
||||
process.stderr.write('\nGemini E2E: SKIPPED — gemini binary not found (install: npm i -g @google/gemini-cli)\n');
|
||||
}
|
||||
|
||||
// --- Diff-based test selection ---
|
||||
|
||||
// Gemini E2E touchfiles — keyed by test name, same pattern as Codex E2E
|
||||
const GEMINI_E2E_TOUCHFILES: Record<string, string[]> = {
|
||||
'gemini-discover-skill': ['.agents/skills/**', 'test/helpers/gemini-session-runner.ts'],
|
||||
'gemini-review-findings': ['review/**', '.agents/skills/gstack-review/**', 'test/helpers/gemini-session-runner.ts'],
|
||||
};
|
||||
|
||||
let selectedTests: string[] | null = null; // null = run all
|
||||
|
||||
if (evalsEnabled && !process.env.EVALS_ALL) {
|
||||
const baseBranch = process.env.EVALS_BASE
|
||||
|| detectBaseBranch(ROOT)
|
||||
|| 'main';
|
||||
const changedFiles = getChangedFiles(baseBranch, ROOT);
|
||||
|
||||
if (changedFiles.length > 0) {
|
||||
const selection = selectTests(changedFiles, GEMINI_E2E_TOUCHFILES, GLOBAL_TOUCHFILES);
|
||||
selectedTests = selection.selected;
|
||||
process.stderr.write(`\nGemini E2E selection (${selection.reason}): ${selection.selected.length}/${Object.keys(GEMINI_E2E_TOUCHFILES).length} tests\n`);
|
||||
if (selection.skipped.length > 0) {
|
||||
process.stderr.write(` Skipped: ${selection.skipped.join(', ')}\n`);
|
||||
}
|
||||
process.stderr.write('\n');
|
||||
}
|
||||
// If changedFiles is empty (e.g., on main branch), selectedTests stays null -> run all
|
||||
}
|
||||
|
||||
/** Skip an individual test if not selected by diff-based selection. */
|
||||
function testIfSelected(testName: string, fn: () => Promise<void>, timeout: number) {
|
||||
const shouldRun = selectedTests === null || selectedTests.includes(testName);
|
||||
(shouldRun ? test : test.skip)(testName, fn, timeout);
|
||||
}
|
||||
|
||||
// --- Eval result collector ---
|
||||
|
||||
const evalCollector = evalsEnabled && !SKIP ? new EvalCollector('e2e-gemini') : null;
|
||||
|
||||
/** DRY helper to record a Gemini E2E test result into the eval collector. */
|
||||
function recordGeminiE2E(name: string, result: GeminiResult, passed: boolean) {
|
||||
evalCollector?.addTest({
|
||||
name,
|
||||
suite: 'gemini-e2e',
|
||||
tier: 'e2e',
|
||||
passed,
|
||||
duration_ms: result.durationMs,
|
||||
cost_usd: 0, // Gemini doesn't report cost in USD; tokens are tracked
|
||||
output: result.output?.slice(0, 2000),
|
||||
turns_used: result.toolCalls.length, // approximate: tool calls as turns
|
||||
exit_reason: result.exitCode === 0 ? 'success' : `exit_code_${result.exitCode}`,
|
||||
});
|
||||
}
|
||||
|
||||
/** Print cost summary after a Gemini E2E test. */
|
||||
function logGeminiCost(label: string, result: GeminiResult) {
|
||||
const durationSec = Math.round(result.durationMs / 1000);
|
||||
console.log(`${label}: ${result.tokens} tokens, ${result.toolCalls.length} tool calls, ${durationSec}s`);
|
||||
}
|
||||
|
||||
// Finalize eval results on exit
|
||||
afterAll(async () => {
|
||||
if (evalCollector) {
|
||||
await evalCollector.finalize();
|
||||
}
|
||||
});
|
||||
|
||||
// --- Tests ---
|
||||
|
||||
describeGemini('Gemini E2E', () => {
|
||||
|
||||
testIfSelected('gemini-discover-skill', async () => {
|
||||
// Run Gemini in the repo root where .agents/skills/ exists
|
||||
const result = await runGeminiSkill({
|
||||
prompt: 'List any skills or instructions you have available. Just list the names.',
|
||||
timeoutMs: 60_000,
|
||||
cwd: ROOT,
|
||||
});
|
||||
|
||||
logGeminiCost('gemini-discover-skill', result);
|
||||
|
||||
// Gemini should have produced some output
|
||||
const passed = result.exitCode === 0 && result.output.length > 0;
|
||||
recordGeminiE2E('gemini-discover-skill', result, passed);
|
||||
|
||||
expect(result.exitCode).toBe(0);
|
||||
expect(result.output.length).toBeGreaterThan(0);
|
||||
// The output should reference skills in some form
|
||||
const outputLower = result.output.toLowerCase();
|
||||
expect(
|
||||
outputLower.includes('review') || outputLower.includes('gstack') || outputLower.includes('skill'),
|
||||
).toBe(true);
|
||||
}, 120_000);
|
||||
|
||||
testIfSelected('gemini-review-findings', async () => {
|
||||
// Run gstack-review skill via Gemini on this repo
|
||||
const result = await runGeminiSkill({
|
||||
prompt: 'Run the gstack-review skill on this repository. Review the current branch diff and report your findings.',
|
||||
timeoutMs: 540_000,
|
||||
cwd: ROOT,
|
||||
});
|
||||
|
||||
logGeminiCost('gemini-review-findings', result);
|
||||
|
||||
// Should produce structured review-like output
|
||||
const output = result.output;
|
||||
const passed = result.exitCode === 0 && output.length > 50;
|
||||
recordGeminiE2E('gemini-review-findings', result, passed);
|
||||
|
||||
expect(result.exitCode).toBe(0);
|
||||
expect(output.length).toBeGreaterThan(50);
|
||||
|
||||
// Review output should contain some review-like content
|
||||
const outputLower = output.toLowerCase();
|
||||
const hasReviewContent =
|
||||
outputLower.includes('finding') ||
|
||||
outputLower.includes('issue') ||
|
||||
outputLower.includes('review') ||
|
||||
outputLower.includes('change') ||
|
||||
outputLower.includes('diff') ||
|
||||
outputLower.includes('clean') ||
|
||||
outputLower.includes('no issues') ||
|
||||
outputLower.includes('p1') ||
|
||||
outputLower.includes('p2');
|
||||
expect(hasReviewContent).toBe(true);
|
||||
}, 600_000);
|
||||
});
|
||||
@@ -416,6 +416,98 @@ describe('REVIEW_DASHBOARD resolver', () => {
|
||||
});
|
||||
});
|
||||
|
||||
// --- {{SPEC_REVIEW_LOOP}} resolver tests ---
|
||||
|
||||
describe('SPEC_REVIEW_LOOP resolver', () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, 'office-hours', 'SKILL.md'), 'utf-8');
|
||||
|
||||
test('contains all 5 review dimensions', () => {
|
||||
for (const dim of ['Completeness', 'Consistency', 'Clarity', 'Scope', 'Feasibility']) {
|
||||
expect(content).toContain(dim);
|
||||
}
|
||||
});
|
||||
|
||||
test('references Agent tool for subagent dispatch', () => {
|
||||
expect(content).toMatch(/Agent.*tool/i);
|
||||
});
|
||||
|
||||
test('specifies max 3 iterations', () => {
|
||||
expect(content).toMatch(/3.*iteration|maximum.*3/i);
|
||||
});
|
||||
|
||||
test('includes quality score', () => {
|
||||
expect(content).toContain('quality score');
|
||||
});
|
||||
|
||||
test('includes metrics path', () => {
|
||||
expect(content).toContain('spec-review.jsonl');
|
||||
});
|
||||
|
||||
test('includes convergence guard', () => {
|
||||
expect(content).toMatch(/[Cc]onvergence/);
|
||||
});
|
||||
|
||||
test('includes graceful failure handling', () => {
|
||||
expect(content).toMatch(/skip.*review|unavailable/i);
|
||||
});
|
||||
});
|
||||
|
||||
// --- {{DESIGN_SKETCH}} resolver tests ---
|
||||
|
||||
describe('DESIGN_SKETCH resolver', () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, 'office-hours', 'SKILL.md'), 'utf-8');
|
||||
|
||||
test('references DESIGN.md for design system constraints', () => {
|
||||
expect(content).toContain('DESIGN.md');
|
||||
});
|
||||
|
||||
test('contains wireframe or sketch terminology', () => {
|
||||
expect(content).toMatch(/wireframe|sketch/i);
|
||||
});
|
||||
|
||||
test('references browse binary for rendering', () => {
|
||||
expect(content).toContain('$B goto');
|
||||
});
|
||||
|
||||
test('references screenshot capture', () => {
|
||||
expect(content).toContain('$B screenshot');
|
||||
});
|
||||
|
||||
test('specifies rough aesthetic', () => {
|
||||
expect(content).toMatch(/[Rr]ough|hand-drawn/);
|
||||
});
|
||||
|
||||
test('includes skip conditions', () => {
|
||||
expect(content).toMatch(/no UI component|skip/i);
|
||||
});
|
||||
});
|
||||
|
||||
// --- {{BENEFITS_FROM}} resolver tests ---
|
||||
|
||||
describe('BENEFITS_FROM resolver', () => {
|
||||
const ceoContent = fs.readFileSync(path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), 'utf-8');
|
||||
const engContent = fs.readFileSync(path.join(ROOT, 'plan-eng-review', 'SKILL.md'), 'utf-8');
|
||||
|
||||
test('plan-ceo-review contains prerequisite skill offer', () => {
|
||||
expect(ceoContent).toContain('Prerequisite Skill Offer');
|
||||
expect(ceoContent).toContain('/office-hours');
|
||||
});
|
||||
|
||||
test('plan-eng-review contains prerequisite skill offer', () => {
|
||||
expect(engContent).toContain('Prerequisite Skill Offer');
|
||||
expect(engContent).toContain('/office-hours');
|
||||
});
|
||||
|
||||
test('offer includes graceful decline', () => {
|
||||
expect(ceoContent).toContain('No worries');
|
||||
});
|
||||
|
||||
test('skills without benefits-from do NOT have prerequisite offer', () => {
|
||||
const qaContent = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8');
|
||||
expect(qaContent).not.toContain('Prerequisite Skill Offer');
|
||||
});
|
||||
});
|
||||
|
||||
// ─── Codex Generation Tests ─────────────────────────────────
|
||||
|
||||
describe('Codex generation (--host codex)', () => {
|
||||
@@ -492,6 +584,16 @@ describe('Codex generation (--host codex)', () => {
|
||||
expect(fs.existsSync(path.join(AGENTS_DIR, 'gstack-codex'))).toBe(false);
|
||||
});
|
||||
|
||||
test('Codex review step stripped from Codex-host ship and review', () => {
|
||||
const shipContent = fs.readFileSync(path.join(AGENTS_DIR, 'gstack-ship', 'SKILL.md'), 'utf-8');
|
||||
expect(shipContent).not.toContain('codex review --base');
|
||||
expect(shipContent).not.toContain('Investigate and fix');
|
||||
|
||||
const reviewContent = fs.readFileSync(path.join(AGENTS_DIR, 'gstack-review', 'SKILL.md'), 'utf-8');
|
||||
expect(reviewContent).not.toContain('codex review --base');
|
||||
expect(reviewContent).not.toContain('Investigate and fix');
|
||||
});
|
||||
|
||||
test('--host codex --dry-run freshness', () => {
|
||||
const result = Bun.spawnSync(['bun', 'run', 'scripts/gen-skill-docs.ts', '--host', 'codex', '--dry-run'], {
|
||||
cwd: ROOT,
|
||||
|
||||
@@ -0,0 +1,104 @@
|
||||
import { describe, test, expect } from 'bun:test';
|
||||
import { parseGeminiJSONL } from './gemini-session-runner';
|
||||
|
||||
// Fixture: actual Gemini CLI stream-json output with tool use
|
||||
const FIXTURE_LINES = [
|
||||
'{"type":"init","timestamp":"2026-03-20T15:14:46.455Z","session_id":"test-session-123","model":"auto-gemini-3"}',
|
||||
'{"type":"message","timestamp":"2026-03-20T15:14:46.456Z","role":"user","content":"list the files"}',
|
||||
'{"type":"message","timestamp":"2026-03-20T15:14:49.650Z","role":"assistant","content":"I will list the files.","delta":true}',
|
||||
'{"type":"tool_use","timestamp":"2026-03-20T15:14:49.690Z","tool_name":"run_shell_command","tool_id":"cmd_1","parameters":{"command":"ls"}}',
|
||||
'{"type":"tool_result","timestamp":"2026-03-20T15:14:49.931Z","tool_id":"cmd_1","status":"success","output":"file1.ts\\nfile2.ts"}',
|
||||
'{"type":"message","timestamp":"2026-03-20T15:14:51.945Z","role":"assistant","content":"Here are the files.","delta":true}',
|
||||
'{"type":"result","timestamp":"2026-03-20T15:14:52.030Z","status":"success","stats":{"total_tokens":27147,"input_tokens":26928,"output_tokens":87,"cached":0,"duration_ms":5575,"tool_calls":1}}',
|
||||
];
|
||||
|
||||
describe('parseGeminiJSONL', () => {
|
||||
test('extracts session ID from init event', () => {
|
||||
const parsed = parseGeminiJSONL(FIXTURE_LINES);
|
||||
expect(parsed.sessionId).toBe('test-session-123');
|
||||
});
|
||||
|
||||
test('concatenates assistant message deltas into output', () => {
|
||||
const parsed = parseGeminiJSONL(FIXTURE_LINES);
|
||||
expect(parsed.output).toBe('I will list the files.Here are the files.');
|
||||
});
|
||||
|
||||
test('ignores user messages', () => {
|
||||
const lines = [
|
||||
'{"type":"message","role":"user","content":"this should be ignored"}',
|
||||
'{"type":"message","role":"assistant","content":"this should be kept","delta":true}',
|
||||
];
|
||||
const parsed = parseGeminiJSONL(lines);
|
||||
expect(parsed.output).toBe('this should be kept');
|
||||
});
|
||||
|
||||
test('extracts tool names from tool_use events', () => {
|
||||
const parsed = parseGeminiJSONL(FIXTURE_LINES);
|
||||
expect(parsed.toolCalls).toHaveLength(1);
|
||||
expect(parsed.toolCalls[0]).toBe('run_shell_command');
|
||||
});
|
||||
|
||||
test('extracts total tokens from result stats', () => {
|
||||
const parsed = parseGeminiJSONL(FIXTURE_LINES);
|
||||
expect(parsed.tokens).toBe(27147);
|
||||
});
|
||||
|
||||
test('skips malformed lines without throwing', () => {
|
||||
const lines = [
|
||||
'{"type":"init","session_id":"ok"}',
|
||||
'this is not json',
|
||||
'{"type":"message","role":"assistant","content":"hello","delta":true}',
|
||||
'{incomplete json',
|
||||
'{"type":"result","status":"success","stats":{"total_tokens":100}}',
|
||||
];
|
||||
const parsed = parseGeminiJSONL(lines);
|
||||
expect(parsed.sessionId).toBe('ok');
|
||||
expect(parsed.output).toBe('hello');
|
||||
expect(parsed.tokens).toBe(100);
|
||||
});
|
||||
|
||||
test('skips empty and whitespace-only lines', () => {
|
||||
const lines = [
|
||||
'',
|
||||
' ',
|
||||
'{"type":"init","session_id":"s1"}',
|
||||
'\t',
|
||||
'{"type":"result","status":"success","stats":{"total_tokens":50}}',
|
||||
];
|
||||
const parsed = parseGeminiJSONL(lines);
|
||||
expect(parsed.sessionId).toBe('s1');
|
||||
expect(parsed.tokens).toBe(50);
|
||||
});
|
||||
|
||||
test('handles empty input', () => {
|
||||
const parsed = parseGeminiJSONL([]);
|
||||
expect(parsed.output).toBe('');
|
||||
expect(parsed.toolCalls).toHaveLength(0);
|
||||
expect(parsed.tokens).toBe(0);
|
||||
expect(parsed.sessionId).toBeNull();
|
||||
});
|
||||
|
||||
test('handles missing fields gracefully', () => {
|
||||
const lines = [
|
||||
'{"type":"init"}', // no session_id
|
||||
'{"type":"message","role":"assistant"}', // no content
|
||||
'{"type":"tool_use"}', // no tool_name
|
||||
'{"type":"result","status":"success"}', // no stats
|
||||
];
|
||||
const parsed = parseGeminiJSONL(lines);
|
||||
expect(parsed.sessionId).toBeNull();
|
||||
expect(parsed.output).toBe('');
|
||||
expect(parsed.toolCalls).toHaveLength(0);
|
||||
expect(parsed.tokens).toBe(0);
|
||||
});
|
||||
|
||||
test('handles multiple tool_use events', () => {
|
||||
const lines = [
|
||||
'{"type":"tool_use","tool_name":"run_shell_command","tool_id":"cmd_1","parameters":{"command":"ls"}}',
|
||||
'{"type":"tool_use","tool_name":"read_file","tool_id":"cmd_2","parameters":{"path":"foo.ts"}}',
|
||||
'{"type":"tool_use","tool_name":"run_shell_command","tool_id":"cmd_3","parameters":{"command":"cat bar.ts"}}',
|
||||
];
|
||||
const parsed = parseGeminiJSONL(lines);
|
||||
expect(parsed.toolCalls).toEqual(['run_shell_command', 'read_file', 'run_shell_command']);
|
||||
});
|
||||
});
|
||||
@@ -0,0 +1,201 @@
|
||||
/**
|
||||
* Gemini CLI subprocess runner for skill E2E testing.
|
||||
*
|
||||
* Spawns `gemini -p` as an independent process, parses its stream-json
|
||||
* output, and returns structured results. Follows the same pattern as
|
||||
* codex-session-runner.ts but adapted for the Gemini CLI.
|
||||
*
|
||||
* Key differences from Codex session-runner:
|
||||
* - Uses `gemini -p` instead of `codex exec`
|
||||
* - Output is NDJSON with event types: init, message, tool_use, tool_result, result
|
||||
* - Uses `--output-format stream-json --yolo` instead of `--json -s read-only`
|
||||
* - No temp HOME needed — Gemini discovers skills from `.agents/skills/` in cwd
|
||||
* - Message events are streamed with `delta: true` — must concatenate
|
||||
*/
|
||||
|
||||
import * as path from 'path';
|
||||
|
||||
// --- Interfaces ---
|
||||
|
||||
export interface GeminiResult {
|
||||
output: string; // Full assistant message text (concatenated deltas)
|
||||
toolCalls: string[]; // Tool names from tool_use events
|
||||
tokens: number; // Total tokens used
|
||||
exitCode: number; // Process exit code
|
||||
durationMs: number; // Wall clock time
|
||||
sessionId: string | null; // Session ID from init event
|
||||
rawLines: string[]; // Raw JSONL lines for debugging
|
||||
}
|
||||
|
||||
// --- JSONL parser ---
|
||||
|
||||
export interface ParsedGeminiJSONL {
|
||||
output: string;
|
||||
toolCalls: string[];
|
||||
tokens: number;
|
||||
sessionId: string | null;
|
||||
}
|
||||
|
||||
/**
|
||||
* Parse an array of JSONL lines from `gemini -p --output-format stream-json`.
|
||||
* Pure function — no I/O, no side effects.
|
||||
*
|
||||
* Handles these Gemini event types:
|
||||
* - init → extract session_id
|
||||
* - message (role=assistant, delta=true) → concatenate content into output
|
||||
* - tool_use → extract tool_name
|
||||
* - tool_result → logged but not extracted
|
||||
* - result → extract token usage from stats
|
||||
*/
|
||||
export function parseGeminiJSONL(lines: string[]): ParsedGeminiJSONL {
|
||||
const outputParts: string[] = [];
|
||||
const toolCalls: string[] = [];
|
||||
let tokens = 0;
|
||||
let sessionId: string | null = null;
|
||||
|
||||
for (const line of lines) {
|
||||
if (!line.trim()) continue;
|
||||
try {
|
||||
const obj = JSON.parse(line);
|
||||
const t = obj.type || '';
|
||||
|
||||
if (t === 'init') {
|
||||
const sid = obj.session_id || '';
|
||||
if (sid) sessionId = sid;
|
||||
} else if (t === 'message') {
|
||||
if (obj.role === 'assistant' && obj.content) {
|
||||
outputParts.push(obj.content);
|
||||
}
|
||||
} else if (t === 'tool_use') {
|
||||
const name = obj.tool_name || '';
|
||||
if (name) toolCalls.push(name);
|
||||
} else if (t === 'result') {
|
||||
const stats = obj.stats || {};
|
||||
tokens = (stats.total_tokens || 0);
|
||||
}
|
||||
} catch { /* skip malformed lines */ }
|
||||
}
|
||||
|
||||
return {
|
||||
output: outputParts.join(''),
|
||||
toolCalls,
|
||||
tokens,
|
||||
sessionId,
|
||||
};
|
||||
}
|
||||
|
||||
// --- Main runner ---
|
||||
|
||||
/**
|
||||
* Run a prompt via `gemini -p` and return structured results.
|
||||
*
|
||||
* Spawns gemini with stream-json output, parses JSONL events,
|
||||
* and returns a GeminiResult. Skips gracefully if gemini binary is not found.
|
||||
*/
|
||||
export async function runGeminiSkill(opts: {
|
||||
prompt: string; // What to ask Gemini
|
||||
timeoutMs?: number; // Default 300000 (5 min)
|
||||
cwd?: string; // Working directory (where .agents/skills/ lives)
|
||||
}): Promise<GeminiResult> {
|
||||
const {
|
||||
prompt,
|
||||
timeoutMs = 300_000,
|
||||
cwd,
|
||||
} = opts;
|
||||
|
||||
const startTime = Date.now();
|
||||
|
||||
// Check if gemini binary exists
|
||||
const whichResult = Bun.spawnSync(['which', 'gemini']);
|
||||
if (whichResult.exitCode !== 0) {
|
||||
return {
|
||||
output: 'SKIP: gemini binary not found',
|
||||
toolCalls: [],
|
||||
tokens: 0,
|
||||
exitCode: -1,
|
||||
durationMs: Date.now() - startTime,
|
||||
sessionId: null,
|
||||
rawLines: [],
|
||||
};
|
||||
}
|
||||
|
||||
// Build gemini command
|
||||
const args = ['-p', prompt, '--output-format', 'stream-json', '--yolo'];
|
||||
|
||||
// Spawn gemini — uses real HOME for auth, cwd for skill discovery
|
||||
const proc = Bun.spawn(['gemini', ...args], {
|
||||
cwd: cwd || process.cwd(),
|
||||
stdout: 'pipe',
|
||||
stderr: 'pipe',
|
||||
});
|
||||
|
||||
// Race against timeout
|
||||
let timedOut = false;
|
||||
const timeoutId = setTimeout(() => {
|
||||
timedOut = true;
|
||||
proc.kill();
|
||||
}, timeoutMs);
|
||||
|
||||
// Stream and collect JSONL from stdout
|
||||
const collectedLines: string[] = [];
|
||||
const stderrPromise = new Response(proc.stderr).text();
|
||||
|
||||
const reader = proc.stdout.getReader();
|
||||
const decoder = new TextDecoder();
|
||||
let buf = '';
|
||||
|
||||
try {
|
||||
while (true) {
|
||||
const { done, value } = await reader.read();
|
||||
if (done) break;
|
||||
buf += decoder.decode(value, { stream: true });
|
||||
const lines = buf.split('\n');
|
||||
buf = lines.pop() || '';
|
||||
for (const line of lines) {
|
||||
if (!line.trim()) continue;
|
||||
collectedLines.push(line);
|
||||
|
||||
// Real-time progress to stderr
|
||||
try {
|
||||
const event = JSON.parse(line);
|
||||
if (event.type === 'tool_use' && event.tool_name) {
|
||||
const elapsed = Math.round((Date.now() - startTime) / 1000);
|
||||
process.stderr.write(` [gemini ${elapsed}s] tool: ${event.tool_name}\n`);
|
||||
} else if (event.type === 'message' && event.role === 'assistant' && event.content) {
|
||||
const elapsed = Math.round((Date.now() - startTime) / 1000);
|
||||
process.stderr.write(` [gemini ${elapsed}s] message: ${event.content.slice(0, 100)}\n`);
|
||||
}
|
||||
} catch { /* skip — parseGeminiJSONL will handle it later */ }
|
||||
}
|
||||
}
|
||||
} catch { /* stream read error — fall through to exit code handling */ }
|
||||
|
||||
// Flush remaining buffer
|
||||
if (buf.trim()) {
|
||||
collectedLines.push(buf);
|
||||
}
|
||||
|
||||
const stderr = await stderrPromise;
|
||||
const exitCode = await proc.exited;
|
||||
clearTimeout(timeoutId);
|
||||
|
||||
const durationMs = Date.now() - startTime;
|
||||
|
||||
// Parse all collected JSONL lines
|
||||
const parsed = parseGeminiJSONL(collectedLines);
|
||||
|
||||
// Log stderr if non-empty (may contain auth errors, etc.)
|
||||
if (stderr.trim()) {
|
||||
process.stderr.write(` [gemini stderr] ${stderr.trim().slice(0, 200)}\n`);
|
||||
}
|
||||
|
||||
return {
|
||||
output: parsed.output,
|
||||
toolCalls: parsed.toolCalls,
|
||||
tokens: parsed.tokens,
|
||||
exitCode: timedOut ? 124 : exitCode,
|
||||
durationMs,
|
||||
sessionId: parsed.sessionId,
|
||||
rawLines: collectedLines,
|
||||
};
|
||||
}
|
||||
@@ -57,9 +57,13 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
|
||||
'review-base-branch': ['review/**'],
|
||||
'review-design-lite': ['review/**', 'test/fixtures/review-eval-design-slop.*'],
|
||||
|
||||
// Office Hours
|
||||
'office-hours-spec-review': ['office-hours/**', 'scripts/gen-skill-docs.ts'],
|
||||
|
||||
// Plan reviews
|
||||
'plan-ceo-review': ['plan-ceo-review/**'],
|
||||
'plan-ceo-review-selective': ['plan-ceo-review/**'],
|
||||
'plan-ceo-review-benefits': ['plan-ceo-review/**', 'scripts/gen-skill-docs.ts'],
|
||||
'plan-eng-review': ['plan-eng-review/**'],
|
||||
'plan-eng-review-artifact': ['plan-eng-review/**'],
|
||||
|
||||
@@ -80,6 +84,10 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
|
||||
'codex-discover-skill': ['codex/**', '.agents/skills/**', 'test/helpers/codex-session-runner.ts'],
|
||||
'codex-review-findings': ['review/**', '.agents/skills/gstack-review/**', 'codex/**', 'test/helpers/codex-session-runner.ts'],
|
||||
|
||||
// Gemini E2E (tests skills via Gemini CLI)
|
||||
'gemini-discover-skill': ['.agents/skills/**', 'test/helpers/gemini-session-runner.ts'],
|
||||
'gemini-review-findings': ['review/**', '.agents/skills/gstack-review/**', 'test/helpers/gemini-session-runner.ts'],
|
||||
|
||||
// QA bootstrap
|
||||
'qa-bootstrap': ['qa/**', 'browse/src/**', 'ship/**'],
|
||||
|
||||
@@ -140,6 +148,10 @@ export const LLM_JUDGE_TOUCHFILES: Record<string, string[]> = {
|
||||
'design-review/SKILL.md fix loop': ['design-review/SKILL.md', 'design-review/SKILL.md.tmpl'],
|
||||
'design-consultation/SKILL.md research': ['design-consultation/SKILL.md', 'design-consultation/SKILL.md.tmpl'],
|
||||
|
||||
// Office Hours
|
||||
'office-hours/SKILL.md spec review': ['office-hours/SKILL.md', 'office-hours/SKILL.md.tmpl', 'scripts/gen-skill-docs.ts'],
|
||||
'office-hours/SKILL.md design sketch': ['office-hours/SKILL.md', 'office-hours/SKILL.md.tmpl', 'scripts/gen-skill-docs.ts'],
|
||||
|
||||
// Other skills
|
||||
'retro/SKILL.md instructions': ['retro/SKILL.md', 'retro/SKILL.md.tmpl'],
|
||||
'qa-only/SKILL.md workflow': ['qa-only/SKILL.md', 'qa-only/SKILL.md.tmpl'],
|
||||
@@ -152,6 +164,7 @@ export const LLM_JUDGE_TOUCHFILES: Record<string, string[]> = {
|
||||
export const GLOBAL_TOUCHFILES = [
|
||||
'test/helpers/session-runner.ts',
|
||||
'test/helpers/codex-session-runner.ts',
|
||||
'test/helpers/gemini-session-runner.ts',
|
||||
'test/helpers/eval-store.ts',
|
||||
'test/helpers/llm-judge.ts',
|
||||
'scripts/gen-skill-docs.ts',
|
||||
|
||||
@@ -2911,6 +2911,128 @@ Write the full output (including the GATE verdict) to ${codexDir}/codex-output.m
|
||||
}, 360_000);
|
||||
});
|
||||
|
||||
// --- Office Hours Spec Review E2E ---
|
||||
|
||||
describeIfSelected('Office Hours Spec Review E2E', ['office-hours-spec-review'], () => {
|
||||
let ohDir: string;
|
||||
|
||||
beforeAll(() => {
|
||||
ohDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-oh-spec-'));
|
||||
const run = (cmd: string, args: string[]) =>
|
||||
spawnSync(cmd, args, { cwd: ohDir, stdio: 'pipe', timeout: 5000 });
|
||||
|
||||
run('git', ['init', '-b', 'main']);
|
||||
run('git', ['config', 'user.email', 'test@test.com']);
|
||||
run('git', ['config', 'user.name', 'Test']);
|
||||
fs.writeFileSync(path.join(ohDir, 'README.md'), '# Test Project\n');
|
||||
run('git', ['add', '.']);
|
||||
run('git', ['commit', '-m', 'init']);
|
||||
|
||||
// Copy office-hours skill
|
||||
fs.mkdirSync(path.join(ohDir, 'office-hours'), { recursive: true });
|
||||
fs.copyFileSync(
|
||||
path.join(ROOT, 'office-hours', 'SKILL.md'),
|
||||
path.join(ohDir, 'office-hours', 'SKILL.md'),
|
||||
);
|
||||
});
|
||||
|
||||
afterAll(() => {
|
||||
try { fs.rmSync(ohDir, { recursive: true, force: true }); } catch {}
|
||||
});
|
||||
|
||||
test('/office-hours SKILL.md contains spec review loop', async () => {
|
||||
const result = await runSkillTest({
|
||||
prompt: `Read office-hours/SKILL.md. I want to understand the spec review loop.
|
||||
|
||||
Summarize what the "Spec Review Loop" section does — specifically:
|
||||
1. How many dimensions does the reviewer check?
|
||||
2. What tool is used to dispatch the reviewer?
|
||||
3. What's the maximum number of iterations?
|
||||
4. What metrics are tracked?
|
||||
|
||||
Write your summary to ${ohDir}/spec-review-summary.md`,
|
||||
workingDirectory: ohDir,
|
||||
maxTurns: 8,
|
||||
timeout: 120_000,
|
||||
testName: 'office-hours-spec-review',
|
||||
runId,
|
||||
});
|
||||
|
||||
logCost('/office-hours spec review', result);
|
||||
recordE2E('/office-hours-spec-review', 'Office Hours Spec Review E2E', result);
|
||||
expect(result.exitReason).toBe('success');
|
||||
|
||||
const summaryPath = path.join(ohDir, 'spec-review-summary.md');
|
||||
if (fs.existsSync(summaryPath)) {
|
||||
const summary = fs.readFileSync(summaryPath, 'utf-8').toLowerCase();
|
||||
// Verify the agent understood the key concepts
|
||||
expect(summary).toMatch(/5.*dimension|dimension.*5|completeness|consistency|clarity|scope|feasibility/);
|
||||
expect(summary).toMatch(/agent|subagent/);
|
||||
expect(summary).toMatch(/3.*iteration|iteration.*3|maximum.*3/);
|
||||
}
|
||||
}, 180_000);
|
||||
});
|
||||
|
||||
// --- Plan CEO Review Benefits-From E2E ---
|
||||
|
||||
describeIfSelected('Plan CEO Review Benefits-From E2E', ['plan-ceo-review-benefits'], () => {
|
||||
let benefitsDir: string;
|
||||
|
||||
beforeAll(() => {
|
||||
benefitsDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-benefits-'));
|
||||
const run = (cmd: string, args: string[]) =>
|
||||
spawnSync(cmd, args, { cwd: benefitsDir, stdio: 'pipe', timeout: 5000 });
|
||||
|
||||
run('git', ['init', '-b', 'main']);
|
||||
run('git', ['config', 'user.email', 'test@test.com']);
|
||||
run('git', ['config', 'user.name', 'Test']);
|
||||
fs.writeFileSync(path.join(benefitsDir, 'README.md'), '# Test Project\n');
|
||||
run('git', ['add', '.']);
|
||||
run('git', ['commit', '-m', 'init']);
|
||||
|
||||
// Copy plan-ceo-review skill
|
||||
fs.mkdirSync(path.join(benefitsDir, 'plan-ceo-review'), { recursive: true });
|
||||
fs.copyFileSync(
|
||||
path.join(ROOT, 'plan-ceo-review', 'SKILL.md'),
|
||||
path.join(benefitsDir, 'plan-ceo-review', 'SKILL.md'),
|
||||
);
|
||||
});
|
||||
|
||||
afterAll(() => {
|
||||
try { fs.rmSync(benefitsDir, { recursive: true, force: true }); } catch {}
|
||||
});
|
||||
|
||||
test('/plan-ceo-review SKILL.md contains prerequisite skill offer', async () => {
|
||||
const result = await runSkillTest({
|
||||
prompt: `Read plan-ceo-review/SKILL.md. Search for sections about "Prerequisite" or "office-hours" or "design doc found".
|
||||
|
||||
Summarize what happens when no design doc is found — specifically:
|
||||
1. Is /office-hours offered as a prerequisite?
|
||||
2. What options does the user get?
|
||||
3. Is there a mid-session detection for when the user seems lost?
|
||||
|
||||
Write your summary to ${benefitsDir}/benefits-summary.md`,
|
||||
workingDirectory: benefitsDir,
|
||||
maxTurns: 8,
|
||||
timeout: 120_000,
|
||||
testName: 'plan-ceo-review-benefits',
|
||||
runId,
|
||||
});
|
||||
|
||||
logCost('/plan-ceo-review benefits-from', result);
|
||||
recordE2E('/plan-ceo-review-benefits', 'Plan CEO Review Benefits-From E2E', result);
|
||||
expect(result.exitReason).toBe('success');
|
||||
|
||||
const summaryPath = path.join(benefitsDir, 'benefits-summary.md');
|
||||
if (fs.existsSync(summaryPath)) {
|
||||
const summary = fs.readFileSync(summaryPath, 'utf-8').toLowerCase();
|
||||
// Verify the agent understood the skill chaining
|
||||
expect(summary).toMatch(/office.hours/);
|
||||
expect(summary).toMatch(/design doc|no design/i);
|
||||
}
|
||||
}, 180_000);
|
||||
});
|
||||
|
||||
// Module-level afterAll — finalize eval collector after all tests complete
|
||||
afterAll(async () => {
|
||||
if (evalCollector) {
|
||||
|
||||
@@ -644,6 +644,59 @@ describe('office-hours skill structure', () => {
|
||||
test('contains builder operating principles', () => {
|
||||
expect(content).toContain('Delight is the currency');
|
||||
});
|
||||
|
||||
// Spec Review Loop (Phase 5.5)
|
||||
test('contains spec review loop', () => {
|
||||
expect(content).toContain('Spec Review Loop');
|
||||
});
|
||||
|
||||
test('contains adversarial review dimensions', () => {
|
||||
for (const dim of ['Completeness', 'Consistency', 'Clarity', 'Scope', 'Feasibility']) {
|
||||
expect(content).toContain(dim);
|
||||
}
|
||||
});
|
||||
|
||||
test('contains subagent dispatch instruction', () => {
|
||||
expect(content).toMatch(/Agent.*tool|subagent/i);
|
||||
});
|
||||
|
||||
test('contains max 3 iterations', () => {
|
||||
expect(content).toMatch(/3.*iteration|maximum.*3/i);
|
||||
});
|
||||
|
||||
test('contains quality score', () => {
|
||||
expect(content).toContain('quality score');
|
||||
});
|
||||
|
||||
test('contains spec review metrics path', () => {
|
||||
expect(content).toContain('spec-review.jsonl');
|
||||
});
|
||||
|
||||
test('contains convergence guard', () => {
|
||||
expect(content).toMatch(/convergence/i);
|
||||
});
|
||||
|
||||
// Visual Sketch (Phase 4.5)
|
||||
test('contains visual sketch section', () => {
|
||||
expect(content).toContain('Visual Sketch');
|
||||
});
|
||||
|
||||
test('contains wireframe generation', () => {
|
||||
expect(content).toMatch(/wireframe|sketch/i);
|
||||
});
|
||||
|
||||
test('contains DESIGN.md awareness', () => {
|
||||
expect(content).toContain('DESIGN.md');
|
||||
});
|
||||
|
||||
test('contains browse rendering', () => {
|
||||
expect(content).toContain('$B goto');
|
||||
expect(content).toContain('$B screenshot');
|
||||
});
|
||||
|
||||
test('contains rough aesthetic instruction', () => {
|
||||
expect(content).toMatch(/rough|hand-drawn/i);
|
||||
});
|
||||
});
|
||||
|
||||
describe('investigate skill structure', () => {
|
||||
@@ -856,6 +909,22 @@ describe('CEO review mode validation', () => {
|
||||
expect(content).toContain('HOLD SCOPE');
|
||||
expect(content).toContain('REDUCTION');
|
||||
});
|
||||
|
||||
// Skill chaining (benefits-from)
|
||||
test('contains prerequisite skill offer for office-hours', () => {
|
||||
expect(content).toContain('Prerequisite Skill Offer');
|
||||
expect(content).toContain('/office-hours');
|
||||
});
|
||||
|
||||
test('contains mid-session detection', () => {
|
||||
expect(content).toContain('Mid-session detection');
|
||||
expect(content).toMatch(/still figuring out|seems lost/i);
|
||||
});
|
||||
|
||||
// Spec review on CEO plans
|
||||
test('contains spec review loop for CEO plan documents', () => {
|
||||
expect(content).toContain('Spec Review Loop');
|
||||
});
|
||||
});
|
||||
|
||||
// --- gstack-slug helper ---
|
||||
@@ -1187,18 +1256,36 @@ describe('Codex skill', () => {
|
||||
expect(content).toContain('mktemp');
|
||||
});
|
||||
|
||||
test('codex integration in /review offers second opinion', () => {
|
||||
test('codex integration in /review has config-driven review step', () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, 'review', 'SKILL.md'), 'utf-8');
|
||||
expect(content).toContain('Codex second opinion');
|
||||
expect(content).toContain('Codex review');
|
||||
expect(content).toContain('codex_reviews');
|
||||
expect(content).toContain('codex review');
|
||||
expect(content).toContain('adversarial');
|
||||
expect(content).toContain('xhigh');
|
||||
expect(content).toContain('Investigate and fix');
|
||||
expect(content).toContain('CROSS-MODEL');
|
||||
});
|
||||
|
||||
test('codex integration in /ship offers review gate', () => {
|
||||
test('codex integration in /ship has config-driven review step', () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
|
||||
expect(content).toContain('Codex');
|
||||
expect(content).toContain('Codex review');
|
||||
expect(content).toContain('codex_reviews');
|
||||
expect(content).toContain('codex review');
|
||||
expect(content).toContain('codex-review');
|
||||
expect(content).toContain('xhigh');
|
||||
expect(content).toContain('Investigate and fix');
|
||||
});
|
||||
|
||||
test('codex-host ship/review do NOT contain codex review step', () => {
|
||||
const shipContent = fs.readFileSync(path.join(ROOT, '.agents', 'skills', 'gstack-ship', 'SKILL.md'), 'utf-8');
|
||||
expect(shipContent).not.toContain('codex review --base');
|
||||
expect(shipContent).not.toContain('Investigate and fix');
|
||||
|
||||
const reviewContent = fs.readFileSync(path.join(ROOT, '.agents', 'skills', 'gstack-review', 'SKILL.md'), 'utf-8');
|
||||
expect(reviewContent).not.toContain('codex review --base');
|
||||
expect(reviewContent).not.toContain('codex_reviews');
|
||||
expect(reviewContent).not.toContain('Investigate and fix');
|
||||
});
|
||||
|
||||
test('codex integration in /plan-eng-review offers plan critique', () => {
|
||||
|
||||
@@ -78,8 +78,9 @@ describe('selectTests', () => {
|
||||
const result = selectTests(['plan-ceo-review/SKILL.md'], E2E_TOUCHFILES);
|
||||
expect(result.selected).toContain('plan-ceo-review');
|
||||
expect(result.selected).toContain('plan-ceo-review-selective');
|
||||
expect(result.selected.length).toBe(2);
|
||||
expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 2);
|
||||
expect(result.selected).toContain('plan-ceo-review-benefits');
|
||||
expect(result.selected.length).toBe(3);
|
||||
expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 3);
|
||||
});
|
||||
|
||||
test('global touchfile triggers ALL tests', () => {
|
||||
|
||||
Reference in New Issue
Block a user