merge: resolve conflicts with origin/main (v0.9.1.0 → v0.9.1)

Integrated office-hours spec review, visual sketch, skill chaining
(benefits-from), and plan-ceo-review benefits E2E from main with our
deploy skills. Updated touchfiles test for new plan-ceo-review-benefits
entry.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-03-20 07:28:44 -07:00
16 changed files with 1019 additions and 31 deletions
+146 -1
View File
@@ -218,6 +218,25 @@ success/error/abort, and `USED_BROWSE` with true/false based on whether `$B` was
If you cannot determine the outcome, use "unknown". This runs in the background and
never blocks the user.
## SETUP (run this check BEFORE any browse command)
```bash
_ROOT=$(git rev-parse --show-toplevel 2>/dev/null)
B=""
[ -n "$_ROOT" ] && [ -x "$_ROOT/.agents/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.agents/skills/gstack/browse/dist/browse"
[ -z "$B" ] && B=~/.codex/skills/gstack/browse/dist/browse
if [ -x "$B" ]; then
echo "READY: $B"
else
echo "NEEDS_SETUP"
fi
```
If `NEEDS_SETUP`:
1. Tell the user: "gstack browse needs a one-time build (~10 seconds). OK to proceed?" Then STOP and wait.
2. Run: `cd <SKILL_DIR> && ./setup`
3. If `bun` is not installed: `curl -fsSL https://bun.sh/install | bash`
# YC Office Hours
You are a **YC office hours partner**. Your job is to ensure the problem is understood before solutions are proposed. You adapt to what the user is building — startup founders get the hard questions, builders get an enthusiastic collaborator. This skill produces design docs, not code.
@@ -482,6 +501,66 @@ Present via AskUserQuestion. Do NOT proceed without user approval of the approac
---
## Visual Sketch (UI ideas only)
If the chosen approach involves user-facing UI (screens, pages, forms, dashboards,
or interactive elements), generate a rough wireframe to help the user visualize it.
If the idea is backend-only, infrastructure, or has no UI component — skip this
section silently.
**Step 1: Gather design context**
1. Check if `DESIGN.md` exists in the repo root. If it does, read it for design
system constraints (colors, typography, spacing, component patterns). Use these
constraints in the wireframe.
2. Apply core design principles:
- **Information hierarchy** — what does the user see first, second, third?
- **Interaction states** — loading, empty, error, success, partial
- **Edge case paranoia** — what if the name is 47 chars? Zero results? Network fails?
- **Subtraction default** — "as little design as possible" (Rams). Every element earns its pixels.
- **Design for trust** — every interface element builds or erodes user trust.
**Step 2: Generate wireframe HTML**
Generate a single-page HTML file with these constraints:
- **Intentionally rough aesthetic** — use system fonts, thin gray borders, no color,
hand-drawn-style elements. This is a sketch, not a polished mockup.
- Self-contained — no external dependencies, no CDN links, inline CSS only
- Show the core interaction flow (1-3 screens/states max)
- Include realistic placeholder content (not "Lorem ipsum" — use content that
matches the actual use case)
- Add HTML comments explaining design decisions
Write to a temp file:
```bash
SKETCH_FILE="/tmp/gstack-sketch-$(date +%s).html"
```
**Step 3: Render and capture**
```bash
$B goto "file://$SKETCH_FILE"
$B screenshot /tmp/gstack-sketch.png
```
If `$B` is not available (browse binary not set up), skip the render step. Tell the
user: "Visual sketch requires the browse binary. Run the setup script to enable it."
**Step 4: Present and iterate**
Show the screenshot to the user. Ask: "Does this feel right? Want to iterate on the layout?"
If they want changes, regenerate the HTML with their feedback and re-render.
If they approve or say "good enough," proceed.
**Step 5: Include in design doc**
Reference the wireframe screenshot in the design doc's "Recommended Approach" section.
The screenshot file at `/tmp/gstack-sketch.png` can be referenced by downstream skills
(`/plan-design-review`, `/design-review`) to see what was originally envisioned.
---
## Phase 4.5: Founder Signal Synthesis
Before writing the design doc, synthesize the founder signals you observed during the session. These will appear in the design doc ("What I noticed") and in the closing conversation (Phase 6).
@@ -618,7 +697,73 @@ Supersedes: {prior filename — omit this line if first design on this branch}
{observational, mentor-like reflections referencing specific things the user said during the session. Quote their words back to them — don't characterize their behavior. 2-4 bullets.}
```
Present the design doc to the user via AskUserQuestion:
---
## Spec Review Loop
Before presenting the document to the user for approval, run an adversarial review.
**Step 1: Dispatch reviewer subagent**
Use the Agent tool to dispatch an independent reviewer. The reviewer has fresh context
and cannot see the brainstorming conversation — only the document. This ensures genuine
adversarial independence.
Prompt the subagent with:
- The file path of the document just written
- "Read this document and review it on 5 dimensions. For each dimension, note PASS or
list specific issues with suggested fixes. At the end, output a quality score (1-10)
across all dimensions."
**Dimensions:**
1. **Completeness** — Are all requirements addressed? Missing edge cases?
2. **Consistency** — Do parts of the document agree with each other? Contradictions?
3. **Clarity** — Could an engineer implement this without asking questions? Ambiguous language?
4. **Scope** — Does the document creep beyond the original problem? YAGNI violations?
5. **Feasibility** — Can this actually be built with the stated approach? Hidden complexity?
The subagent should return:
- A quality score (1-10)
- PASS if no issues, or a numbered list of issues with dimension, description, and fix
**Step 2: Fix and re-dispatch**
If the reviewer returns issues:
1. Fix each issue in the document on disk (use Edit tool)
2. Re-dispatch the reviewer subagent with the updated document
3. Maximum 3 iterations total
**Convergence guard:** If the reviewer returns the same issues on consecutive iterations
(the fix didn't resolve them or the reviewer disagrees with the fix), stop the loop
and persist those issues as "Reviewer Concerns" in the document rather than looping
further.
If the subagent fails, times out, or is unavailable — skip the review loop entirely.
Tell the user: "Spec review unavailable — presenting unreviewed doc." The document is
already written to disk; the review is a quality bonus, not a gate.
**Step 3: Report and persist metrics**
After the loop completes (PASS, max iterations, or convergence guard):
1. Tell the user the result — summary by default:
"Your doc survived N rounds of adversarial review. M issues caught and fixed.
Quality score: X/10."
If they ask "what did the reviewer find?", show the full reviewer output.
2. If issues remain after max iterations or convergence, add a "## Reviewer Concerns"
section to the document listing each unresolved issue. Downstream skills will see this.
3. Append metrics:
```bash
mkdir -p ~/.gstack/analytics
echo '{"skill":"office-hours","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","iterations":ITERATIONS,"issues_found":FOUND,"issues_fixed":FIXED,"remaining":REMAINING,"quality_score":SCORE}' >> ~/.gstack/analytics/spec-review.jsonl 2>/dev/null || true
```
Replace ITERATIONS, FOUND, FIXED, REMAINING, SCORE with actual values from the review.
---
Present the reviewed design doc to the user via AskUserQuestion:
- A) Approve — mark Status: APPROVED and proceed to handoff
- B) Revise — specify which sections need changes (loop back to revise those sections)
- C) Start over — return to Phase 2
@@ -324,6 +324,37 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head
```
If a design doc exists (from `/office-hours`), read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design.
## Prerequisite Skill Offer
When the design doc check above prints "No design doc found," offer the prerequisite
skill before proceeding.
Say to the user via AskUserQuestion:
> "No design doc found for this branch. `/office-hours` produces a structured problem
> statement, premise challenge, and explored alternatives — it gives this review much
> sharper input to work with. Takes about 10 minutes. The design doc is per-feature,
> not per-product — it captures the thinking behind this specific change."
Options:
- A) Run /office-hours first (in another window, then come back)
- B) Skip — proceed with standard review
If they skip: "No worries — standard review. If you ever want sharper input, try
/office-hours first next time." Then proceed normally. Do not re-offer later in the session.
**Mid-session detection:** During Step 0A (Premise Challenge), if the user can't
articulate the problem, keeps changing the problem statement, answers with "I'm not
sure," or is clearly exploring rather than reviewing — offer `/office-hours`:
> "It sounds like you're still figuring out what to build — that's totally fine, but
> that's what /office-hours is designed for. Want to pause this review and run
> /office-hours first? It'll help you nail down the problem and approach, then come
> back here for the strategic review."
Options: A) Yes, run /office-hours first. B) No, keep going.
If they keep going, proceed normally — no guilt, no re-asking.
When reading TODOS.md, specifically:
* Note any TODOs this plan touches, blocks, or unlocks
* Check if deferred work from prior reviews relates to this plan
@@ -467,6 +498,70 @@ Repo: {owner/repo}
Derive the feature slug from the plan being reviewed (e.g., "user-dashboard", "auth-refactor"). Use the date in YYYY-MM-DD format.
After writing the CEO plan, run the spec review loop on it:
## Spec Review Loop
Before presenting the document to the user for approval, run an adversarial review.
**Step 1: Dispatch reviewer subagent**
Use the Agent tool to dispatch an independent reviewer. The reviewer has fresh context
and cannot see the brainstorming conversation — only the document. This ensures genuine
adversarial independence.
Prompt the subagent with:
- The file path of the document just written
- "Read this document and review it on 5 dimensions. For each dimension, note PASS or
list specific issues with suggested fixes. At the end, output a quality score (1-10)
across all dimensions."
**Dimensions:**
1. **Completeness** — Are all requirements addressed? Missing edge cases?
2. **Consistency** — Do parts of the document agree with each other? Contradictions?
3. **Clarity** — Could an engineer implement this without asking questions? Ambiguous language?
4. **Scope** — Does the document creep beyond the original problem? YAGNI violations?
5. **Feasibility** — Can this actually be built with the stated approach? Hidden complexity?
The subagent should return:
- A quality score (1-10)
- PASS if no issues, or a numbered list of issues with dimension, description, and fix
**Step 2: Fix and re-dispatch**
If the reviewer returns issues:
1. Fix each issue in the document on disk (use Edit tool)
2. Re-dispatch the reviewer subagent with the updated document
3. Maximum 3 iterations total
**Convergence guard:** If the reviewer returns the same issues on consecutive iterations
(the fix didn't resolve them or the reviewer disagrees with the fix), stop the loop
and persist those issues as "Reviewer Concerns" in the document rather than looping
further.
If the subagent fails, times out, or is unavailable — skip the review loop entirely.
Tell the user: "Spec review unavailable — presenting unreviewed doc." The document is
already written to disk; the review is a quality bonus, not a gate.
**Step 3: Report and persist metrics**
After the loop completes (PASS, max iterations, or convergence guard):
1. Tell the user the result — summary by default:
"Your doc survived N rounds of adversarial review. M issues caught and fixed.
Quality score: X/10."
If they ask "what did the reviewer find?", show the full reviewer output.
2. If issues remain after max iterations or convergence, add a "## Reviewer Concerns"
section to the document listing each unresolved issue. Downstream skills will see this.
3. Append metrics:
```bash
mkdir -p ~/.gstack/analytics
echo '{"skill":"plan-ceo-review","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","iterations":ITERATIONS,"issues_found":FOUND,"issues_fixed":FIXED,"remaining":REMAINING,"quality_score":SCORE}' >> ~/.gstack/analytics/spec-review.jsonl 2>/dev/null || true
```
Replace ITERATIONS, FOUND, FIXED, REMAINING, SCORE with actual values from the review.
### 0E. Temporal Interrogation (EXPANSION, SELECTIVE EXPANSION, and HOLD modes)
Think ahead to implementation: What decisions will need to be made during implementation that should be resolved NOW in the plan?
```
@@ -269,6 +269,25 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head
```
If a design doc exists, read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design — check the prior version for context on what changed and why.
## Prerequisite Skill Offer
When the design doc check above prints "No design doc found," offer the prerequisite
skill before proceeding.
Say to the user via AskUserQuestion:
> "No design doc found for this branch. `/office-hours` produces a structured problem
> statement, premise challenge, and explored alternatives — it gives this review much
> sharper input to work with. Takes about 10 minutes. The design doc is per-feature,
> not per-product — it captures the thinking behind this specific change."
Options:
- A) Run /office-hours first (in another window, then come back)
- B) Skip — proceed with standard review
If they skip: "No worries — standard review. If you ever want sharper input, try
/office-hours first next time." Then proceed normally. Do not re-offer later in the session.
### Step 0: Scope Challenge
Before reviewing anything, answer these questions:
1. **What existing code already partially or fully solves each sub-problem?** Can we capture outputs from existing flows rather than building parallel ones?
+9 -2
View File
@@ -16,6 +16,15 @@
- Incorporated canary monitoring and benchmark patterns from community PR #151 (HMAKT99).
- 3 new skills registered across gen-skill-docs.ts, skill-check.ts, skill-validation, and gen-skill-docs tests.
## [0.9.1.0] - 2026-03-20 — Adversarial Spec Review + Skill Chaining
### Added
- **Your design docs now get stress-tested before you see them.** When you run `/office-hours`, an independent AI reviewer checks your design doc for completeness, consistency, clarity, scope creep, and feasibility — up to 3 rounds. You get a quality score (1-10) and a summary of what was caught and fixed. The doc you approve has already survived adversarial review.
- **Visual wireframes during brainstorming.** For UI ideas, `/office-hours` now generates a rough HTML wireframe using your project's design system (from DESIGN.md) and screenshots it. You see what you're designing while you're still thinking, not after you've coded it.
- **Skills help each other now.** `/plan-ceo-review` and `/plan-eng-review` detect when you'd benefit from running `/office-hours` first and offer it — one-tap to switch, one-tap to decline. If you seem lost during a CEO review, it'll gently suggest brainstorming first.
- **Spec review metrics.** Every adversarial review logs iterations, issues found/fixed, and quality score to `~/.gstack/analytics/spec-review.jsonl`. Over time, you can see if your design docs are getting better.
## [0.9.0.1] - 2026-03-19
### Changed
@@ -87,7 +96,6 @@
- **`/debug` renamed to `/investigate`.** Claude Code has a built-in `/debug` command that shadowed the gstack skill. The systematic root-cause debugging workflow now lives at `/investigate`. (Closes #190)
- **Shell injection surface removed.** All skill templates now use `source <(gstack-slug)` instead of `eval $(gstack-slug)`. Same behavior, no `eval`. (Closes #133)
- **25 new security tests.** URL validation (16 tests) and path traversal validation (14 tests) now have dedicated unit test suites covering scheme blocking, metadata IP blocking, directory escapes, and prefix collision edge cases.
>>>>>>> origin/main
## [0.8.2] - 2026-03-19
@@ -169,7 +177,6 @@ Both modes write a design doc that feeds directly into `/plan-ceo-review` and `/
**`/debug` — find the root cause, not the symptom.**
When something is broken and you don't know why, `/debug` is your systematic debugger. It follows the Iron Law: no fixes without root cause investigation first. Traces data flow, matches against known bug patterns (race conditions, nil propagation, stale cache, config drift), and tests hypotheses one at a time. If 3 fixes fail, it stops and questions the architecture instead of thrashing.
>>>>>>> origin/main
## [0.6.4.1] - 2026-03-18
+146 -1
View File
@@ -227,6 +227,25 @@ success/error/abort, and `USED_BROWSE` with true/false based on whether `$B` was
If you cannot determine the outcome, use "unknown". This runs in the background and
never blocks the user.
## SETUP (run this check BEFORE any browse command)
```bash
_ROOT=$(git rev-parse --show-toplevel 2>/dev/null)
B=""
[ -n "$_ROOT" ] && [ -x "$_ROOT/.claude/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.claude/skills/gstack/browse/dist/browse"
[ -z "$B" ] && B=~/.claude/skills/gstack/browse/dist/browse
if [ -x "$B" ]; then
echo "READY: $B"
else
echo "NEEDS_SETUP"
fi
```
If `NEEDS_SETUP`:
1. Tell the user: "gstack browse needs a one-time build (~10 seconds). OK to proceed?" Then STOP and wait.
2. Run: `cd <SKILL_DIR> && ./setup`
3. If `bun` is not installed: `curl -fsSL https://bun.sh/install | bash`
# YC Office Hours
You are a **YC office hours partner**. Your job is to ensure the problem is understood before solutions are proposed. You adapt to what the user is building — startup founders get the hard questions, builders get an enthusiastic collaborator. This skill produces design docs, not code.
@@ -491,6 +510,66 @@ Present via AskUserQuestion. Do NOT proceed without user approval of the approac
---
## Visual Sketch (UI ideas only)
If the chosen approach involves user-facing UI (screens, pages, forms, dashboards,
or interactive elements), generate a rough wireframe to help the user visualize it.
If the idea is backend-only, infrastructure, or has no UI component — skip this
section silently.
**Step 1: Gather design context**
1. Check if `DESIGN.md` exists in the repo root. If it does, read it for design
system constraints (colors, typography, spacing, component patterns). Use these
constraints in the wireframe.
2. Apply core design principles:
- **Information hierarchy** — what does the user see first, second, third?
- **Interaction states** — loading, empty, error, success, partial
- **Edge case paranoia** — what if the name is 47 chars? Zero results? Network fails?
- **Subtraction default** — "as little design as possible" (Rams). Every element earns its pixels.
- **Design for trust** — every interface element builds or erodes user trust.
**Step 2: Generate wireframe HTML**
Generate a single-page HTML file with these constraints:
- **Intentionally rough aesthetic** — use system fonts, thin gray borders, no color,
hand-drawn-style elements. This is a sketch, not a polished mockup.
- Self-contained — no external dependencies, no CDN links, inline CSS only
- Show the core interaction flow (1-3 screens/states max)
- Include realistic placeholder content (not "Lorem ipsum" — use content that
matches the actual use case)
- Add HTML comments explaining design decisions
Write to a temp file:
```bash
SKETCH_FILE="/tmp/gstack-sketch-$(date +%s).html"
```
**Step 3: Render and capture**
```bash
$B goto "file://$SKETCH_FILE"
$B screenshot /tmp/gstack-sketch.png
```
If `$B` is not available (browse binary not set up), skip the render step. Tell the
user: "Visual sketch requires the browse binary. Run the setup script to enable it."
**Step 4: Present and iterate**
Show the screenshot to the user. Ask: "Does this feel right? Want to iterate on the layout?"
If they want changes, regenerate the HTML with their feedback and re-render.
If they approve or say "good enough," proceed.
**Step 5: Include in design doc**
Reference the wireframe screenshot in the design doc's "Recommended Approach" section.
The screenshot file at `/tmp/gstack-sketch.png` can be referenced by downstream skills
(`/plan-design-review`, `/design-review`) to see what was originally envisioned.
---
## Phase 4.5: Founder Signal Synthesis
Before writing the design doc, synthesize the founder signals you observed during the session. These will appear in the design doc ("What I noticed") and in the closing conversation (Phase 6).
@@ -627,7 +706,73 @@ Supersedes: {prior filename — omit this line if first design on this branch}
{observational, mentor-like reflections referencing specific things the user said during the session. Quote their words back to them — don't characterize their behavior. 2-4 bullets.}
```
Present the design doc to the user via AskUserQuestion:
---
## Spec Review Loop
Before presenting the document to the user for approval, run an adversarial review.
**Step 1: Dispatch reviewer subagent**
Use the Agent tool to dispatch an independent reviewer. The reviewer has fresh context
and cannot see the brainstorming conversation — only the document. This ensures genuine
adversarial independence.
Prompt the subagent with:
- The file path of the document just written
- "Read this document and review it on 5 dimensions. For each dimension, note PASS or
list specific issues with suggested fixes. At the end, output a quality score (1-10)
across all dimensions."
**Dimensions:**
1. **Completeness** — Are all requirements addressed? Missing edge cases?
2. **Consistency** — Do parts of the document agree with each other? Contradictions?
3. **Clarity** — Could an engineer implement this without asking questions? Ambiguous language?
4. **Scope** — Does the document creep beyond the original problem? YAGNI violations?
5. **Feasibility** — Can this actually be built with the stated approach? Hidden complexity?
The subagent should return:
- A quality score (1-10)
- PASS if no issues, or a numbered list of issues with dimension, description, and fix
**Step 2: Fix and re-dispatch**
If the reviewer returns issues:
1. Fix each issue in the document on disk (use Edit tool)
2. Re-dispatch the reviewer subagent with the updated document
3. Maximum 3 iterations total
**Convergence guard:** If the reviewer returns the same issues on consecutive iterations
(the fix didn't resolve them or the reviewer disagrees with the fix), stop the loop
and persist those issues as "Reviewer Concerns" in the document rather than looping
further.
If the subagent fails, times out, or is unavailable — skip the review loop entirely.
Tell the user: "Spec review unavailable — presenting unreviewed doc." The document is
already written to disk; the review is a quality bonus, not a gate.
**Step 3: Report and persist metrics**
After the loop completes (PASS, max iterations, or convergence guard):
1. Tell the user the result — summary by default:
"Your doc survived N rounds of adversarial review. M issues caught and fixed.
Quality score: X/10."
If they ask "what did the reviewer find?", show the full reviewer output.
2. If issues remain after max iterations or convergence, add a "## Reviewer Concerns"
section to the document listing each unresolved issue. Downstream skills will see this.
3. Append metrics:
```bash
mkdir -p ~/.gstack/analytics
echo '{"skill":"office-hours","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","iterations":ITERATIONS,"issues_found":FOUND,"issues_fixed":FIXED,"remaining":REMAINING,"quality_score":SCORE}' >> ~/.gstack/analytics/spec-review.jsonl 2>/dev/null || true
```
Replace ITERATIONS, FOUND, FIXED, REMAINING, SCORE with actual values from the review.
---
Present the reviewed design doc to the user via AskUserQuestion:
- A) Approve — mark Status: APPROVED and proceed to handoff
- B) Revise — specify which sections need changes (loop back to revise those sections)
- C) Start over — return to Phase 2
+13 -1
View File
@@ -23,6 +23,8 @@ allowed-tools:
{{PREAMBLE}}
{{BROWSE_SETUP}}
# YC Office Hours
You are a **YC office hours partner**. Your job is to ensure the problem is understood before solutions are proposed. You adapt to what the user is building — startup founders get the hard questions, builders get an enthusiastic collaborator. This skill produces design docs, not code.
@@ -287,6 +289,10 @@ Present via AskUserQuestion. Do NOT proceed without user approval of the approac
---
{{DESIGN_SKETCH}}
---
## Phase 4.5: Founder Signal Synthesis
Before writing the design doc, synthesize the founder signals you observed during the session. These will appear in the design doc ("What I noticed") and in the closing conversation (Phase 6).
@@ -423,7 +429,13 @@ Supersedes: {prior filename — omit this line if first design on this branch}
{observational, mentor-like reflections referencing specific things the user said during the session. Quote their words back to them — don't characterize their behavior. 2-4 bullets.}
```
Present the design doc to the user via AskUserQuestion:
---
{{SPEC_REVIEW_LOOP}}
---
Present the reviewed design doc to the user via AskUserQuestion:
- A) Approve — mark Status: APPROVED and proceed to handoff
- B) Revise — specify which sections need changes (loop back to revise those sections)
- C) Start over — return to Phase 2
+96
View File
@@ -10,6 +10,7 @@ description: |
or "is this ambitious enough".
Proactively suggest when the user is questioning scope or ambition of a plan,
or when the plan feels like it could be thinking bigger.
benefits-from: [office-hours]
allowed-tools:
- Read
- Grep
@@ -331,6 +332,37 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head
```
If a design doc exists (from `/office-hours`), read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design.
## Prerequisite Skill Offer
When the design doc check above prints "No design doc found," offer the prerequisite
skill before proceeding.
Say to the user via AskUserQuestion:
> "No design doc found for this branch. `/office-hours` produces a structured problem
> statement, premise challenge, and explored alternatives — it gives this review much
> sharper input to work with. Takes about 10 minutes. The design doc is per-feature,
> not per-product — it captures the thinking behind this specific change."
Options:
- A) Run /office-hours first (in another window, then come back)
- B) Skip — proceed with standard review
If they skip: "No worries — standard review. If you ever want sharper input, try
/office-hours first next time." Then proceed normally. Do not re-offer later in the session.
**Mid-session detection:** During Step 0A (Premise Challenge), if the user can't
articulate the problem, keeps changing the problem statement, answers with "I'm not
sure," or is clearly exploring rather than reviewing — offer `/office-hours`:
> "It sounds like you're still figuring out what to build — that's totally fine, but
> that's what /office-hours is designed for. Want to pause this review and run
> /office-hours first? It'll help you nail down the problem and approach, then come
> back here for the strategic review."
Options: A) Yes, run /office-hours first. B) No, keep going.
If they keep going, proceed normally — no guilt, no re-asking.
When reading TODOS.md, specifically:
* Note any TODOs this plan touches, blocks, or unlocks
* Check if deferred work from prior reviews relates to this plan
@@ -474,6 +506,70 @@ Repo: {owner/repo}
Derive the feature slug from the plan being reviewed (e.g., "user-dashboard", "auth-refactor"). Use the date in YYYY-MM-DD format.
After writing the CEO plan, run the spec review loop on it:
## Spec Review Loop
Before presenting the document to the user for approval, run an adversarial review.
**Step 1: Dispatch reviewer subagent**
Use the Agent tool to dispatch an independent reviewer. The reviewer has fresh context
and cannot see the brainstorming conversation — only the document. This ensures genuine
adversarial independence.
Prompt the subagent with:
- The file path of the document just written
- "Read this document and review it on 5 dimensions. For each dimension, note PASS or
list specific issues with suggested fixes. At the end, output a quality score (1-10)
across all dimensions."
**Dimensions:**
1. **Completeness** — Are all requirements addressed? Missing edge cases?
2. **Consistency** — Do parts of the document agree with each other? Contradictions?
3. **Clarity** — Could an engineer implement this without asking questions? Ambiguous language?
4. **Scope** — Does the document creep beyond the original problem? YAGNI violations?
5. **Feasibility** — Can this actually be built with the stated approach? Hidden complexity?
The subagent should return:
- A quality score (1-10)
- PASS if no issues, or a numbered list of issues with dimension, description, and fix
**Step 2: Fix and re-dispatch**
If the reviewer returns issues:
1. Fix each issue in the document on disk (use Edit tool)
2. Re-dispatch the reviewer subagent with the updated document
3. Maximum 3 iterations total
**Convergence guard:** If the reviewer returns the same issues on consecutive iterations
(the fix didn't resolve them or the reviewer disagrees with the fix), stop the loop
and persist those issues as "Reviewer Concerns" in the document rather than looping
further.
If the subagent fails, times out, or is unavailable — skip the review loop entirely.
Tell the user: "Spec review unavailable — presenting unreviewed doc." The document is
already written to disk; the review is a quality bonus, not a gate.
**Step 3: Report and persist metrics**
After the loop completes (PASS, max iterations, or convergence guard):
1. Tell the user the result — summary by default:
"Your doc survived N rounds of adversarial review. M issues caught and fixed.
Quality score: X/10."
If they ask "what did the reviewer find?", show the full reviewer output.
2. If issues remain after max iterations or convergence, add a "## Reviewer Concerns"
section to the document listing each unresolved issue. Downstream skills will see this.
3. Append metrics:
```bash
mkdir -p ~/.gstack/analytics
echo '{"skill":"plan-ceo-review","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","iterations":ITERATIONS,"issues_found":FOUND,"issues_fixed":FIXED,"remaining":REMAINING,"quality_score":SCORE}' >> ~/.gstack/analytics/spec-review.jsonl 2>/dev/null || true
```
Replace ITERATIONS, FOUND, FIXED, REMAINING, SCORE with actual values from the review.
### 0E. Temporal Interrogation (EXPANSION, SELECTIVE EXPANSION, and HOLD modes)
Think ahead to implementation: What decisions will need to be made during implementation that should be resolved NOW in the plan?
```
+19
View File
@@ -10,6 +10,7 @@ description: |
or "is this ambitious enough".
Proactively suggest when the user is questioning scope or ambition of a plan,
or when the plan feels like it could be thinking bigger.
benefits-from: [office-hours]
allowed-tools:
- Read
- Grep
@@ -110,6 +111,20 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head
```
If a design doc exists (from `/office-hours`), read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design.
{{BENEFITS_FROM}}
**Mid-session detection:** During Step 0A (Premise Challenge), if the user can't
articulate the problem, keeps changing the problem statement, answers with "I'm not
sure," or is clearly exploring rather than reviewing — offer `/office-hours`:
> "It sounds like you're still figuring out what to build — that's totally fine, but
> that's what /office-hours is designed for. Want to pause this review and run
> /office-hours first? It'll help you nail down the problem and approach, then come
> back here for the strategic review."
Options: A) Yes, run /office-hours first. B) No, keep going.
If they keep going, proceed normally — no guilt, no re-asking.
When reading TODOS.md, specifically:
* Note any TODOs this plan touches, blocks, or unlocks
* Check if deferred work from prior reviews relates to this plan
@@ -253,6 +268,10 @@ Repo: {owner/repo}
Derive the feature slug from the plan being reviewed (e.g., "user-dashboard", "auth-refactor"). Use the date in YYYY-MM-DD format.
After writing the CEO plan, run the spec review loop on it:
{{SPEC_REVIEW_LOOP}}
### 0E. Temporal Interrogation (EXPANSION, SELECTIVE EXPANSION, and HOLD modes)
Think ahead to implementation: What decisions will need to be made during implementation that should be resolved NOW in the plan?
```
+20
View File
@@ -8,6 +8,7 @@ description: |
"review the architecture", "engineering review", or "lock in the plan".
Proactively suggest when the user has a plan or design doc and is about to
start coding — to catch architecture issues before implementation.
benefits-from: [office-hours]
allowed-tools:
- Read
- Write
@@ -277,6 +278,25 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head
```
If a design doc exists, read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design — check the prior version for context on what changed and why.
## Prerequisite Skill Offer
When the design doc check above prints "No design doc found," offer the prerequisite
skill before proceeding.
Say to the user via AskUserQuestion:
> "No design doc found for this branch. `/office-hours` produces a structured problem
> statement, premise challenge, and explored alternatives — it gives this review much
> sharper input to work with. Takes about 10 minutes. The design doc is per-feature,
> not per-product — it captures the thinking behind this specific change."
Options:
- A) Run /office-hours first (in another window, then come back)
- B) Skip — proceed with standard review
If they skip: "No worries — standard review. If you ever want sharper input, try
/office-hours first next time." Then proceed normally. Do not re-offer later in the session.
### Step 0: Scope Challenge
Before reviewing anything, answer these questions:
1. **What existing code already partially or fully solves each sub-problem?** Can we capture outputs from existing flows rather than building parallel ones?
+3
View File
@@ -8,6 +8,7 @@ description: |
"review the architecture", "engineering review", or "lock in the plan".
Proactively suggest when the user has a plan or design doc and is about to
start coding — to catch architecture issues before implementation.
benefits-from: [office-hours]
allowed-tools:
- Read
- Write
@@ -73,6 +74,8 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head
```
If a design doc exists, read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design — check the prior version for context on what changed and why.
{{BENEFITS_FROM}}
### Step 0: Scope Challenge
Before reviewing anything, answer these questions:
1. **What existing code already partially or fully solves each sub-problem?** Can we capture outputs from existing flows rather than building parallel ones?
+162 -1
View File
@@ -55,6 +55,7 @@ const HOST_PATHS: Record<Host, HostPaths> = {
interface TemplateContext {
skillName: string;
tmplPath: string;
benefitsFrom?: string[];
host: Host;
paths: HostPaths;
}
@@ -1405,6 +1406,156 @@ Tell the user: "Deploy configuration saved to CLAUDE.md. Future /land-and-deploy
---`;
}
function generateSpecReviewLoop(_ctx: TemplateContext): string {
return `## Spec Review Loop
Before presenting the document to the user for approval, run an adversarial review.
**Step 1: Dispatch reviewer subagent**
Use the Agent tool to dispatch an independent reviewer. The reviewer has fresh context
and cannot see the brainstorming conversation only the document. This ensures genuine
adversarial independence.
Prompt the subagent with:
- The file path of the document just written
- "Read this document and review it on 5 dimensions. For each dimension, note PASS or
list specific issues with suggested fixes. At the end, output a quality score (1-10)
across all dimensions."
**Dimensions:**
1. **Completeness** Are all requirements addressed? Missing edge cases?
2. **Consistency** Do parts of the document agree with each other? Contradictions?
3. **Clarity** Could an engineer implement this without asking questions? Ambiguous language?
4. **Scope** Does the document creep beyond the original problem? YAGNI violations?
5. **Feasibility** Can this actually be built with the stated approach? Hidden complexity?
The subagent should return:
- A quality score (1-10)
- PASS if no issues, or a numbered list of issues with dimension, description, and fix
**Step 2: Fix and re-dispatch**
If the reviewer returns issues:
1. Fix each issue in the document on disk (use Edit tool)
2. Re-dispatch the reviewer subagent with the updated document
3. Maximum 3 iterations total
**Convergence guard:** If the reviewer returns the same issues on consecutive iterations
(the fix didn't resolve them or the reviewer disagrees with the fix), stop the loop
and persist those issues as "Reviewer Concerns" in the document rather than looping
further.
If the subagent fails, times out, or is unavailable skip the review loop entirely.
Tell the user: "Spec review unavailable — presenting unreviewed doc." The document is
already written to disk; the review is a quality bonus, not a gate.
**Step 3: Report and persist metrics**
After the loop completes (PASS, max iterations, or convergence guard):
1. Tell the user the result summary by default:
"Your doc survived N rounds of adversarial review. M issues caught and fixed.
Quality score: X/10."
If they ask "what did the reviewer find?", show the full reviewer output.
2. If issues remain after max iterations or convergence, add a "## Reviewer Concerns"
section to the document listing each unresolved issue. Downstream skills will see this.
3. Append metrics:
\`\`\`bash
mkdir -p ~/.gstack/analytics
echo '{"skill":"${_ctx.skillName}","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","iterations":ITERATIONS,"issues_found":FOUND,"issues_fixed":FIXED,"remaining":REMAINING,"quality_score":SCORE}' >> ~/.gstack/analytics/spec-review.jsonl 2>/dev/null || true
\`\`\`
Replace ITERATIONS, FOUND, FIXED, REMAINING, SCORE with actual values from the review.`;
}
function generateBenefitsFrom(ctx: TemplateContext): string {
if (!ctx.benefitsFrom || ctx.benefitsFrom.length === 0) return '';
const skillList = ctx.benefitsFrom.map(s => `\`/${s}\``).join(' or ');
const first = ctx.benefitsFrom[0];
return `## Prerequisite Skill Offer
When the design doc check above prints "No design doc found," offer the prerequisite
skill before proceeding.
Say to the user via AskUserQuestion:
> "No design doc found for this branch. ${skillList} produces a structured problem
> statement, premise challenge, and explored alternatives it gives this review much
> sharper input to work with. Takes about 10 minutes. The design doc is per-feature,
> not per-product it captures the thinking behind this specific change."
Options:
- A) Run /${first} first (in another window, then come back)
- B) Skip proceed with standard review
If they skip: "No worries standard review. If you ever want sharper input, try
/${first} first next time." Then proceed normally. Do not re-offer later in the session.`;
}
function generateDesignSketch(_ctx: TemplateContext): string {
return `## Visual Sketch (UI ideas only)
If the chosen approach involves user-facing UI (screens, pages, forms, dashboards,
or interactive elements), generate a rough wireframe to help the user visualize it.
If the idea is backend-only, infrastructure, or has no UI component skip this
section silently.
**Step 1: Gather design context**
1. Check if \`DESIGN.md\` exists in the repo root. If it does, read it for design
system constraints (colors, typography, spacing, component patterns). Use these
constraints in the wireframe.
2. Apply core design principles:
- **Information hierarchy** what does the user see first, second, third?
- **Interaction states** loading, empty, error, success, partial
- **Edge case paranoia** what if the name is 47 chars? Zero results? Network fails?
- **Subtraction default** "as little design as possible" (Rams). Every element earns its pixels.
- **Design for trust** every interface element builds or erodes user trust.
**Step 2: Generate wireframe HTML**
Generate a single-page HTML file with these constraints:
- **Intentionally rough aesthetic** use system fonts, thin gray borders, no color,
hand-drawn-style elements. This is a sketch, not a polished mockup.
- Self-contained no external dependencies, no CDN links, inline CSS only
- Show the core interaction flow (1-3 screens/states max)
- Include realistic placeholder content (not "Lorem ipsum" use content that
matches the actual use case)
- Add HTML comments explaining design decisions
Write to a temp file:
\`\`\`bash
SKETCH_FILE="/tmp/gstack-sketch-$(date +%s).html"
\`\`\`
**Step 3: Render and capture**
\`\`\`bash
$B goto "file://$SKETCH_FILE"
$B screenshot /tmp/gstack-sketch.png
\`\`\`
If \`$B\` is not available (browse binary not set up), skip the render step. Tell the
user: "Visual sketch requires the browse binary. Run the setup script to enable it."
**Step 4: Present and iterate**
Show the screenshot to the user. Ask: "Does this feel right? Want to iterate on the layout?"
If they want changes, regenerate the HTML with their feedback and re-render.
If they approve or say "good enough," proceed.
**Step 5: Include in design doc**
Reference the wireframe screenshot in the design doc's "Recommended Approach" section.
The screenshot file at \`/tmp/gstack-sketch.png\` can be referenced by downstream skills
(\`/plan-design-review\`, \`/design-review\`) to see what was originally envisioned.`;
}
const RESOLVERS: Record<string, (ctx: TemplateContext) => string> = {
COMMAND_REFERENCE: generateCommandReference,
SNAPSHOT_FLAGS: generateSnapshotFlags,
@@ -1417,6 +1568,9 @@ const RESOLVERS: Record<string, (ctx: TemplateContext) => string> = {
REVIEW_DASHBOARD: generateReviewDashboard,
TEST_BOOTSTRAP: generateTestBootstrap,
DEPLOY_BOOTSTRAP: generateDeployBootstrap,
SPEC_REVIEW_LOOP: generateSpecReviewLoop,
DESIGN_SKETCH: generateDesignSketch,
BENEFITS_FROM: generateBenefitsFrom,
};
// ─── Codex Helpers ───────────────────────────────────────────
@@ -1539,7 +1693,14 @@ function processTemplate(tmplPath: string, host: Host = 'claude'): { outputPath:
// Extract skill name from frontmatter for TemplateContext
const nameMatch = tmplContent.match(/^name:\s*(.+)$/m);
const skillName = nameMatch ? nameMatch[1].trim() : path.basename(path.dirname(tmplPath));
const ctx: TemplateContext = { skillName, tmplPath, host, paths: HOST_PATHS[host] };
// Extract benefits-from list from frontmatter (inline YAML: benefits-from: [a, b])
const benefitsMatch = tmplContent.match(/^benefits-from:\s*\[([^\]]*)\]/m);
const benefitsFrom = benefitsMatch
? benefitsMatch[1].split(',').map(s => s.trim()).filter(Boolean)
: undefined;
const ctx: TemplateContext = { skillName, tmplPath, benefitsFrom, host, paths: HOST_PATHS[host] };
// Replace placeholders
let content = tmplContent.replace(/\{\{(\w+)\}\}/g, (match, name) => {
+92
View File
@@ -416,6 +416,98 @@ describe('REVIEW_DASHBOARD resolver', () => {
});
});
// --- {{SPEC_REVIEW_LOOP}} resolver tests ---
describe('SPEC_REVIEW_LOOP resolver', () => {
const content = fs.readFileSync(path.join(ROOT, 'office-hours', 'SKILL.md'), 'utf-8');
test('contains all 5 review dimensions', () => {
for (const dim of ['Completeness', 'Consistency', 'Clarity', 'Scope', 'Feasibility']) {
expect(content).toContain(dim);
}
});
test('references Agent tool for subagent dispatch', () => {
expect(content).toMatch(/Agent.*tool/i);
});
test('specifies max 3 iterations', () => {
expect(content).toMatch(/3.*iteration|maximum.*3/i);
});
test('includes quality score', () => {
expect(content).toContain('quality score');
});
test('includes metrics path', () => {
expect(content).toContain('spec-review.jsonl');
});
test('includes convergence guard', () => {
expect(content).toMatch(/[Cc]onvergence/);
});
test('includes graceful failure handling', () => {
expect(content).toMatch(/skip.*review|unavailable/i);
});
});
// --- {{DESIGN_SKETCH}} resolver tests ---
describe('DESIGN_SKETCH resolver', () => {
const content = fs.readFileSync(path.join(ROOT, 'office-hours', 'SKILL.md'), 'utf-8');
test('references DESIGN.md for design system constraints', () => {
expect(content).toContain('DESIGN.md');
});
test('contains wireframe or sketch terminology', () => {
expect(content).toMatch(/wireframe|sketch/i);
});
test('references browse binary for rendering', () => {
expect(content).toContain('$B goto');
});
test('references screenshot capture', () => {
expect(content).toContain('$B screenshot');
});
test('specifies rough aesthetic', () => {
expect(content).toMatch(/[Rr]ough|hand-drawn/);
});
test('includes skip conditions', () => {
expect(content).toMatch(/no UI component|skip/i);
});
});
// --- {{BENEFITS_FROM}} resolver tests ---
describe('BENEFITS_FROM resolver', () => {
const ceoContent = fs.readFileSync(path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), 'utf-8');
const engContent = fs.readFileSync(path.join(ROOT, 'plan-eng-review', 'SKILL.md'), 'utf-8');
test('plan-ceo-review contains prerequisite skill offer', () => {
expect(ceoContent).toContain('Prerequisite Skill Offer');
expect(ceoContent).toContain('/office-hours');
});
test('plan-eng-review contains prerequisite skill offer', () => {
expect(engContent).toContain('Prerequisite Skill Offer');
expect(engContent).toContain('/office-hours');
});
test('offer includes graceful decline', () => {
expect(ceoContent).toContain('No worries');
});
test('skills without benefits-from do NOT have prerequisite offer', () => {
const qaContent = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8');
expect(qaContent).not.toContain('Prerequisite Skill Offer');
});
});
// ─── Codex Generation Tests ─────────────────────────────────
describe('Codex generation (--host codex)', () => {
+8
View File
@@ -57,9 +57,13 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
'review-base-branch': ['review/**'],
'review-design-lite': ['review/**', 'test/fixtures/review-eval-design-slop.*'],
// Office Hours
'office-hours-spec-review': ['office-hours/**', 'scripts/gen-skill-docs.ts'],
// Plan reviews
'plan-ceo-review': ['plan-ceo-review/**'],
'plan-ceo-review-selective': ['plan-ceo-review/**'],
'plan-ceo-review-benefits': ['plan-ceo-review/**', 'scripts/gen-skill-docs.ts'],
'plan-eng-review': ['plan-eng-review/**'],
'plan-eng-review-artifact': ['plan-eng-review/**'],
@@ -146,6 +150,10 @@ export const LLM_JUDGE_TOUCHFILES: Record<string, string[]> = {
'design-review/SKILL.md fix loop': ['design-review/SKILL.md', 'design-review/SKILL.md.tmpl'],
'design-consultation/SKILL.md research': ['design-consultation/SKILL.md', 'design-consultation/SKILL.md.tmpl'],
// Office Hours
'office-hours/SKILL.md spec review': ['office-hours/SKILL.md', 'office-hours/SKILL.md.tmpl', 'scripts/gen-skill-docs.ts'],
'office-hours/SKILL.md design sketch': ['office-hours/SKILL.md', 'office-hours/SKILL.md.tmpl', 'scripts/gen-skill-docs.ts'],
// Deploy skills
'land-and-deploy/SKILL.md workflow': ['land-and-deploy/SKILL.md', 'land-and-deploy/SKILL.md.tmpl'],
'canary/SKILL.md monitoring loop': ['canary/SKILL.md', 'canary/SKILL.md.tmpl'],
+119 -23
View File
@@ -2911,6 +2911,125 @@ Write the full output (including the GATE verdict) to ${codexDir}/codex-output.m
}, 360_000);
});
// --- Office Hours Spec Review E2E ---
describeIfSelected('Office Hours Spec Review E2E', ['office-hours-spec-review'], () => {
let ohDir: string;
beforeAll(() => {
ohDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-oh-spec-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: ohDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
fs.writeFileSync(path.join(ohDir, 'README.md'), '# Test Project\n');
run('git', ['add', '.']);
run('git', ['commit', '-m', 'init']);
// Copy office-hours skill
fs.mkdirSync(path.join(ohDir, 'office-hours'), { recursive: true });
fs.copyFileSync(
path.join(ROOT, 'office-hours', 'SKILL.md'),
path.join(ohDir, 'office-hours', 'SKILL.md'),
);
});
afterAll(() => {
try { fs.rmSync(ohDir, { recursive: true, force: true }); } catch {}
});
test('/office-hours SKILL.md contains spec review loop', async () => {
const result = await runSkillTest({
prompt: `Read office-hours/SKILL.md. I want to understand the spec review loop.
Summarize what the "Spec Review Loop" section does specifically:
1. How many dimensions does the reviewer check?
2. What tool is used to dispatch the reviewer?
3. What's the maximum number of iterations?
4. What metrics are tracked?
Write your summary to ${ohDir}/spec-review-summary.md`,
workingDirectory: ohDir,
maxTurns: 8,
timeout: 120_000,
testName: 'office-hours-spec-review',
runId,
});
logCost('/office-hours spec review', result);
recordE2E('/office-hours-spec-review', 'Office Hours Spec Review E2E', result);
expect(result.exitReason).toBe('success');
const summaryPath = path.join(ohDir, 'spec-review-summary.md');
if (fs.existsSync(summaryPath)) {
const summary = fs.readFileSync(summaryPath, 'utf-8').toLowerCase();
expect(summary).toMatch(/5.*dimension|dimension.*5|completeness|consistency|clarity|scope|feasibility/);
expect(summary).toMatch(/agent|subagent/);
expect(summary).toMatch(/3.*iteration|iteration.*3|maximum.*3/);
}
}, 180_000);
});
// --- Plan CEO Review Benefits-From E2E ---
describeIfSelected('Plan CEO Review Benefits-From E2E', ['plan-ceo-review-benefits'], () => {
let benefitsDir: string;
beforeAll(() => {
benefitsDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-benefits-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: benefitsDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
fs.writeFileSync(path.join(benefitsDir, 'README.md'), '# Test Project\n');
run('git', ['add', '.']);
run('git', ['commit', '-m', 'init']);
fs.mkdirSync(path.join(benefitsDir, 'plan-ceo-review'), { recursive: true });
fs.copyFileSync(
path.join(ROOT, 'plan-ceo-review', 'SKILL.md'),
path.join(benefitsDir, 'plan-ceo-review', 'SKILL.md'),
);
});
afterAll(() => {
try { fs.rmSync(benefitsDir, { recursive: true, force: true }); } catch {}
});
test('/plan-ceo-review SKILL.md contains prerequisite skill offer', async () => {
const result = await runSkillTest({
prompt: `Read plan-ceo-review/SKILL.md. Search for sections about "Prerequisite" or "office-hours" or "design doc found".
Summarize what happens when no design doc is found specifically:
1. Is /office-hours offered as a prerequisite?
2. What options does the user get?
3. Is there a mid-session detection for when the user seems lost?
Write your summary to ${benefitsDir}/benefits-summary.md`,
workingDirectory: benefitsDir,
maxTurns: 8,
timeout: 120_000,
testName: 'plan-ceo-review-benefits',
runId,
});
logCost('/plan-ceo-review benefits-from', result);
recordE2E('/plan-ceo-review-benefits', 'Plan CEO Review Benefits-From E2E', result);
expect(result.exitReason).toBe('success');
const summaryPath = path.join(benefitsDir, 'benefits-summary.md');
if (fs.existsSync(summaryPath)) {
const summary = fs.readFileSync(summaryPath, 'utf-8').toLowerCase();
expect(summary).toMatch(/office.hours/);
expect(summary).toMatch(/design doc|no design/i);
}
}, 180_000);
});
// --- Land-and-Deploy / Canary / Benchmark / Setup-Deploy E2E ---
describeIfSelected('Land-and-Deploy skill E2E', ['land-and-deploy-workflow'], () => {
@@ -2918,7 +3037,6 @@ describeIfSelected('Land-and-Deploy skill E2E', ['land-and-deploy-workflow'], ()
beforeAll(() => {
landDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-land-deploy-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: landDir, stdio: 'pipe', timeout: 5000 });
@@ -2926,19 +3044,16 @@ describeIfSelected('Land-and-Deploy skill E2E', ['land-and-deploy-workflow'], ()
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
// Create initial app
fs.writeFileSync(path.join(landDir, 'app.ts'), 'export function hello() { return "world"; }\n');
fs.writeFileSync(path.join(landDir, 'fly.toml'), 'app = "test-app"\n\n[http_service]\n internal_port = 3000\n');
run('git', ['add', '.']);
run('git', ['commit', '-m', 'initial']);
// Create feature branch with changes
run('git', ['checkout', '-b', 'feat/add-deploy']);
fs.writeFileSync(path.join(landDir, 'app.ts'), 'export function hello() { return "deployed"; }\n');
run('git', ['add', '.']);
run('git', ['commit', '-m', 'feat: update hello']);
// Copy skill
copyDirSync(path.join(ROOT, 'land-and-deploy'), path.join(landDir, 'land-and-deploy'));
});
@@ -2975,7 +3090,6 @@ Do NOT use AskUserQuestion. Do NOT run gh or fly commands.`,
recordE2E('/land-and-deploy workflow', 'Land-and-Deploy skill E2E', result);
expect(result.exitReason).toBe('success');
// Verify deploy config was written to CLAUDE.md
const claudeMd = path.join(landDir, 'CLAUDE.md');
if (fs.existsSync(claudeMd)) {
const content = fs.readFileSync(claudeMd, 'utf-8');
@@ -2983,7 +3097,6 @@ Do NOT use AskUserQuestion. Do NOT run gh or fly commands.`,
expect(hasFly).toBe(true);
}
// Verify deploy report directory was created
const reportDir = path.join(landDir, '.gstack', 'deploy-reports');
expect(fs.existsSync(reportDir)).toBe(true);
}, 180_000);
@@ -2994,7 +3107,6 @@ describeIfSelected('Canary skill E2E', ['canary-workflow'], () => {
beforeAll(() => {
canaryDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-canary-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: canaryDir, stdio: 'pipe', timeout: 5000 });
@@ -3006,7 +3118,6 @@ describeIfSelected('Canary skill E2E', ['canary-workflow'], () => {
run('git', ['add', '.']);
run('git', ['commit', '-m', 'initial']);
// Copy skill
copyDirSync(path.join(ROOT, 'canary'), path.join(canaryDir, 'canary'));
});
@@ -3043,10 +3154,7 @@ Just create the directory structure and report files showing the correct schema.
recordE2E('/canary workflow', 'Canary skill E2E', result);
expect(result.exitReason).toBe('success');
// Verify directory structure
expect(fs.existsSync(path.join(canaryDir, '.gstack', 'canary-reports'))).toBe(true);
// Verify baseline or report was created
const reportDir = path.join(canaryDir, '.gstack', 'canary-reports');
const files = fs.readdirSync(reportDir, { recursive: true }) as string[];
expect(files.length).toBeGreaterThan(0);
@@ -3058,7 +3166,6 @@ describeIfSelected('Benchmark skill E2E', ['benchmark-workflow'], () => {
beforeAll(() => {
benchDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-benchmark-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: benchDir, stdio: 'pipe', timeout: 5000 });
@@ -3070,7 +3177,6 @@ describeIfSelected('Benchmark skill E2E', ['benchmark-workflow'], () => {
run('git', ['add', '.']);
run('git', ['commit', '-m', 'initial']);
// Copy skill
copyDirSync(path.join(ROOT, 'benchmark'), path.join(benchDir, 'benchmark'));
});
@@ -3109,10 +3215,7 @@ Just create the files showing the correct schema and report format.`,
recordE2E('/benchmark workflow', 'Benchmark skill E2E', result);
expect(result.exitReason).toBe('success');
// Verify directory structure
expect(fs.existsSync(path.join(benchDir, '.gstack', 'benchmark-reports'))).toBe(true);
// Verify baseline was created
const baselineDir = path.join(benchDir, '.gstack', 'benchmark-reports', 'baselines');
if (fs.existsSync(baselineDir)) {
const files = fs.readdirSync(baselineDir);
@@ -3126,7 +3229,6 @@ describeIfSelected('Setup-Deploy skill E2E', ['setup-deploy-workflow'], () => {
beforeAll(() => {
setupDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-setup-deploy-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: setupDir, stdio: 'pipe', timeout: 5000 });
@@ -3134,13 +3236,11 @@ describeIfSelected('Setup-Deploy skill E2E', ['setup-deploy-workflow'], () => {
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
// Create a project with fly.toml
fs.writeFileSync(path.join(setupDir, 'app.ts'), 'export default { port: 3000 };\n');
fs.writeFileSync(path.join(setupDir, 'fly.toml'), 'app = "my-cool-app"\n\n[http_service]\n internal_port = 3000\n force_https = true\n');
run('git', ['add', '.']);
run('git', ['commit', '-m', 'initial']);
// Copy skill
copyDirSync(path.join(ROOT, 'setup-deploy'), path.join(setupDir, 'setup-deploy'));
});
@@ -3174,16 +3274,12 @@ Just detect the platform and write the config.`,
recordE2E('/setup-deploy workflow', 'Setup-Deploy skill E2E', result);
expect(result.exitReason).toBe('success');
// Verify CLAUDE.md was created with deploy config
const claudeMd = path.join(setupDir, 'CLAUDE.md');
expect(fs.existsSync(claudeMd)).toBe(true);
const content = fs.readFileSync(claudeMd, 'utf-8');
// Should mention Fly.io or fly
expect(content.toLowerCase()).toContain('fly');
// Should mention the app name
expect(content).toContain('my-cool-app');
// Should have the deploy configuration header
expect(content).toContain('Deploy Configuration');
}, 180_000);
});
+69
View File
@@ -652,6 +652,59 @@ describe('office-hours skill structure', () => {
test('contains builder operating principles', () => {
expect(content).toContain('Delight is the currency');
});
// Spec Review Loop (Phase 5.5)
test('contains spec review loop', () => {
expect(content).toContain('Spec Review Loop');
});
test('contains adversarial review dimensions', () => {
for (const dim of ['Completeness', 'Consistency', 'Clarity', 'Scope', 'Feasibility']) {
expect(content).toContain(dim);
}
});
test('contains subagent dispatch instruction', () => {
expect(content).toMatch(/Agent.*tool|subagent/i);
});
test('contains max 3 iterations', () => {
expect(content).toMatch(/3.*iteration|maximum.*3/i);
});
test('contains quality score', () => {
expect(content).toContain('quality score');
});
test('contains spec review metrics path', () => {
expect(content).toContain('spec-review.jsonl');
});
test('contains convergence guard', () => {
expect(content).toMatch(/convergence/i);
});
// Visual Sketch (Phase 4.5)
test('contains visual sketch section', () => {
expect(content).toContain('Visual Sketch');
});
test('contains wireframe generation', () => {
expect(content).toMatch(/wireframe|sketch/i);
});
test('contains DESIGN.md awareness', () => {
expect(content).toContain('DESIGN.md');
});
test('contains browse rendering', () => {
expect(content).toContain('$B goto');
expect(content).toContain('$B screenshot');
});
test('contains rough aesthetic instruction', () => {
expect(content).toMatch(/rough|hand-drawn/i);
});
});
describe('investigate skill structure', () => {
@@ -868,6 +921,22 @@ describe('CEO review mode validation', () => {
expect(content).toContain('HOLD SCOPE');
expect(content).toContain('REDUCTION');
});
// Skill chaining (benefits-from)
test('contains prerequisite skill offer for office-hours', () => {
expect(content).toContain('Prerequisite Skill Offer');
expect(content).toContain('/office-hours');
});
test('contains mid-session detection', () => {
expect(content).toContain('Mid-session detection');
expect(content).toMatch(/still figuring out|seems lost/i);
});
// Spec review on CEO plans
test('contains spec review loop for CEO plan documents', () => {
expect(content).toContain('Spec Review Loop');
});
});
// --- gstack-slug helper ---
+3 -2
View File
@@ -78,8 +78,9 @@ describe('selectTests', () => {
const result = selectTests(['plan-ceo-review/SKILL.md'], E2E_TOUCHFILES);
expect(result.selected).toContain('plan-ceo-review');
expect(result.selected).toContain('plan-ceo-review-selective');
expect(result.selected.length).toBe(2);
expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 2);
expect(result.selected).toContain('plan-ceo-review-benefits');
expect(result.selected.length).toBe(3);
expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 3);
});
test('global touchfile triggers ALL tests', () => {