Files
gstack/plan-devex-review/sections/review-sections.md.tmpl
T
Garry Tan cab774cced v1.56.0.0 Token-reduction Phase B + AUQ paranoid safety net (#1849)
* refactor(plan-ceo-review): carve review body into on-demand section

Carve the largest skill (138,838 B) into a skeleton + one on-demand
section, the documented next Phase B target after /ship (v2_PLAN.md:216).

- sections/review-sections.md(.tmpl): the 11-section deep review, codex/
  outside-voice rules, how-to-ask, Required Outputs, registries, Completion
  Summary, Review Log, REVIEW_DASHBOARD, PLAN_FILE_REVIEW_REPORT, Next Steps,
  docs/designs promotion, Formatting Rules, and the Mode Quick Reference.
- sections/manifest.json: passive registry (CM2), one entry.
- SKILL.md.tmpl: {{SECTION_INDEX}} after the system audit, a single
  {{SECTION:review-sections}} STOP-Read after Step 0 mode selection, and a
  Section self-check. All of Step 0 (the scope/mode conversation) stays in
  the always-loaded skeleton; only EXIT_PLAN_MODE_GATE follows the section.

Measured: always-loaded skeleton 138,838 -> 80,731 B (-42%, ~14.4K tokens
off every invocation). Union (skeleton + section) 139,110 B, behavior held.

Boundary honors Codex P1: nothing review-governing (formatting rules, mode
reference, how-to-ask, required outputs) sits in the skeleton below the
STOP. Housekeeping resolvers ride in the section, matching the ship
precedent (adversarial.md carries LEARNINGS_LOG + GBRAIN_SAVE_RESULTS).

Tests (atomic with the carve — skill-docs.yml gates gen:skill-docs
freshness on every push, so source + regen + tests must land together):
- parity-harness: plan-ceo flipped to sectioned, maxSkeletonBytes 90_000
  (measured 80,731 + headroom); content/minBytes run against the union.
- skill-size-budget: plan-ceo-review added to SECTIONS_EXTRACTED.
- section-manifest-consistency: generalized to discover every carved skill,
  vars computed per-skill-case (Codex P2).
- skill-ceo-section-ordering (new, gate): per-PR static guard — STOP after
  Step 0, review body absent from skeleton, report writer in the section,
  nothing review-governing below the STOP.
- skill-e2e-plan-ceo-review-section-loading (new, periodic): refreshes the
  installed skill first (Codex P1), drives full Step 0, asserts the section
  is Read before the report.
- gen-skill-docs + skill-validation: read the skeleton+sections union for
  carved skills so relocated prose still counts.
- touchfiles: plan-ceo-section-loading registered (periodic).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump VERSION + CHANGELOG for plan-ceo-review carve (v1.56.0.0)

MINOR: carves the largest skill into skeleton + on-demand section,
dropping plan-ceo-review's always-loaded cost 42% (138,838 -> 80,731 B,
~14.4K tokens off every invocation). User-facing release notes lead with
the measured token win.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(todos): file P3 follow-up — carve the shared {{PREAMBLE}} reference blocks

Surfaced by /plan-eng-review on the plan-ceo-review carve: per-skill section
carves stay modest because the ~40-50KB shared preamble dominates the
always-loaded surface. A single preamble-reference carve would help every
tier->=2 skill at once. Records the why, the cold-vs-hot split to measure,
and the guards it needs. Not implemented this PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): Layer 0 — guarantee AUQ format spec is always-loaded

Deterministic, free, per-PR keystone for the token-reduction era. For every
interactive (tier>=2) skill, asserts the full AskUserQuestion decision-brief
format (ELI10/Recommendation/Pros-cons/checks/Net/(recommended)/Stakes/
self-check) lives in the always-loaded SKILL.md skeleton, NOT only in an
on-demand section. Plus a roster guard (a carve can't silently drop the block)
and per-skill rule survival in the skeleton+sections union. 51 cases + a
negative control. Fails the instant a future carve strands AUQ-governing text
where it won't be loaded when a question fires.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): SDK capture engine + verbose-vs-carved no-degradation A/B

Adds the reusable SDK $OUT_FILE capture engine (auq-sdk-capture.ts): drives a
skill to its AUQ and captures the verbatim text the model GENERATES, cleanly
(real-PTY mangles plan-mode AUQs via cursor escapes). Pins the skill to an
absolute path with Read/Write-only tools so the agent can't wander to the
global install. gradeAuqRecommendation normalizes a non-"because" connective
before grading so substantive reasons aren't false-flagged (without touching
the pinned shared judge).

The A/B drives the same prompt through the carved 80KB skeleton and the
pre-carve 137KB monolith and fails if carved scores worse. Result: both 7/7
format, substance 5 — proven no degradation, transcript-verified each side read
its own planted SKILL.md. Periodic tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): consistency — same trigger N runs, stable format + substance

Drives the carved /plan-ceo-review AUQ N=3 times and fails if any format
element appears in one run but not another, or substance craters. Targets the
"fine one run, broken the next" failure class a single snapshot can't see.
Result: 3/3 stable, 7/7 + substance 5 every run. Periodic tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): behavioral matrix across AUQ-heavy skills

Data-driven test that drives each AUQ-heavy skill (plan-eng/design/devex,
office-hours, cso, spec, design-consultation) to its first AskUserQuestion and
grades it to the plan-ceo bar: 7/7 decision-brief format + recommendation
substance >=4. One case per skill (isolated failures), env-subsettable via
AUQ_MATRIX_ONLY. Browser/design-binary skills are intentionally excluded
(comparison boards, not format-AUQs; Layer 0 covers their spec). All targeted
skills pass 7/7 with substance 4-5. Periodic tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(codex): live recommendation-substance grade for /codex

Closes the gap where /codex's synthesis recommendation was only checked
statically (template grep) and via fixtures. Drives the real /codex skill over
a flawed diff and grades the emitted "Recommendation: ... because ..." line
with judgeRecommendation (present/commits/has_because/substance>=4). The named
weak spot holds up: substance 5. Periodic tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): deterministic trigger for format-compliance gate

A bare /plan-ceo-review against a repo whose work is already implemented makes
the model improvise an off-script "what should I review?" scope question that
skips the decision-brief format, which the gate test then times out waiting for.
Hand it a concrete plan to review (FORCING_FLOOR_CEO) so it reaches the real
Step 0 mode-selection AUQ that is the intended format check.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(office-hours): carve Phase 5+6 into on-demand section

Third Phase B carve (v2_PLAN.md:216, after ship and plan-ceo-review). Moves
Phase 5 (Design Doc templates) + Phase 6 (tiered relationship handoff) — the
session's output + closing tail, only reached after the conversation and
alternatives are done — into sections/design-and-handoff.md, behind a single
STOP-Read after Phase 4.5. The live conversation (Phases 1-4.5) and the
always-run Important Rules stay in the always-loaded skeleton.

Measured: always-loaded skeleton 118,280 -> 88,975 B (-24.8%). Union preserved.
The carved AUQ is identical to pre-carve (matrix: 7/7 format, substance 5),
and Layer 0 confirms the AUQ format spec stays in the skeleton — the AUQ
paranoid suite de-risked this carve end to end.

Atomic with tests + regen (skill-docs.yml gates gen:skill-docs freshness on
every push, so source + regen + tests land together; --host all regenerates
the inlined non-Claude variants):
- sections/manifest.json: passive registry, one entry.
- parity-harness: office-hours flipped to sectioned, maxSkeletonBytes 96_000
  (measured 88,975 + headroom); content/minBytes run against the union.
- skill-size-budget: office-hours added to SECTIONS_EXTRACTED.
- gen-skill-docs + skill-validation: read the skeleton+sections union for
  office-hours so relocated Phase 5/6 prose still counts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump VERSION + CHANGELOG for office-hours carve + AUQ suite (v1.57.0.0)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(preamble): carve CJK-escaping manual to on-demand doc

The AskUserQuestion format block is inlined into every interactive skill (~33).
It carried the full multi-paragraph non-ASCII/CJK escaping manual inline, but
that rationale only matters when a question contains CJK text and the operative
rule already lives in the always-loaded self-check. Moved the justification to
docs/askuserquestion-cjk.md (read on demand); kept the rule + a pointer.

Corpus: Claude-host SKILL.md total 3,087,499 -> 3,057,975 B (-29,524 B, ~900 B
x ~33 skills). Layer 0 still passes — the core decision-brief format stays
always-loaded; only the rare CJK rationale moved. Atomic with the all-host
regen (skill-docs.yml freshness gate). VERSION + package.json -> 1.58.0.0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(plan-eng-review): carve review body into on-demand section

Fourth Phase B carve (v2_PLAN.md:220). Moves the 4-section review (Architecture,
Code Quality, Tests, Performance), outside voice, required outputs, and review
report — everything after Step 0 scope — into sections/review-sections.md behind
a single STOP-Read. Step 0 (scope challenge) and EXIT_PLAN_MODE_GATE stay in the
always-loaded skeleton.

Measured: skeleton 106,984 -> 54,892 B (-48.7%). Union preserved. Atomic with
tests + all-host regen (freshness gate): parity flipped to sectioned
(maxSkeletonBytes 62K), plan-eng-review added to SECTIONS_EXTRACTED, gen-skill-docs
reads the union for relocated review/TEST_COVERAGE/dashboard prose. Layer 0 green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(plan-design-review): carve review body into on-demand section

Fifth Phase B carve (v2_PLAN.md:220, bundled with plan-eng). Moves the 7 design
passes, required outputs, and review report — everything after Step 0 scope and
the mockup/rating phase — into sections/review-sections.md behind a STOP-Read.
Step 0, Step 0.5 mockups, the rating method, and EXIT_PLAN_MODE_GATE stay in the
always-loaded skeleton.

Measured: skeleton 112,057 -> 76,024 B (-32.2%). Union preserved. Atomic with
tests + all-host regen: parity sectioned (maxSkeletonBytes 82K), added to
SECTIONS_EXTRACTED, gen-skill-docs reads the union. Layer 0 green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(plan-devex-review): carve review body into on-demand section

Sixth Phase B carve. Moves the 8 DX passes, required outputs, and review report
— everything after the Step 0 DX investigation — into sections/review-sections.md
behind a STOP-Read. All of Step 0 (persona, empathy, benchmark, journey trace,
roleplay) + the rating method + EXIT_PLAN_MODE_GATE stay always-loaded.

Measured: skeleton 110,621 -> 69,658 B (-37%). Union preserved. Atomic with
tests + all-host regen: added to SECTIONS_EXTRACTED, gen-skill-docs reads the
union. Layer 0 green. (No parity invariant entry for plan-devex-review.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump VERSION + CHANGELOG for plan-* family carves (v1.59.0.0)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: refresh ship golden baselines + gbrain-detection union after carves

Two follow-ups the carve commits should have carried (caught by the full suite,
missed by targeted subsets):
- ship golden baselines (claude/codex/factory) regenerated: the preamble CJK
  trim (v1.58) changed ship's always-loaded AskUserQuestion block.
- gbrain-detection-override probes the office-hours skeleton+section union:
  GBRAIN_SAVE_RESULTS moved into sections/design-and-handoff.md when office-hours
  was carved, so the detection assertions now check both files.

Full `bun test` green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): grade format-compliance gate from SDK capture, not the TUI

The real-PTY version grepped the stripAnsi'd interactive AUQ picker. Verified
directly that this cannot work: plan-mode AUQs render as a cursor picker whose
cursor-positioning escapes stripAnsi can't flatten — the picker renders fine for
a human (cursorSeen=45) but the flattened text drops ELI10:/(recommended) and
parseNumberedOptions returns 0. The test was grading a lossy projection and
failed by construction.

Rewritten to drive /plan-ceo-review via the SDK $OUT_FILE capture (the agent
writes the verbatim question it would have shown — clean text, no rendering
loss) and grade 7/7 format + kind-note + recommendation substance >=4. Same
property, reliable, environment-independent; shares the engine with the periodic
A/B and matrix evals. Result: 7/7 format, substance 5. Touchfiles key renamed
ask-user-question-format-pty -> auq-format-gate (no longer a PTY test).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: fix carve-broken CI evals (union reads + section fixtures)

Two CI eval jobs failed on the carved plan-* skills because they read content
that moved into sections/:

- llm-judge (skill-llm-eval): runWorkflowJudge sliced SKILL.md between markers
  like "## Review Sections" / "## CRITICAL RULE" that now live in
  sections/review-sections.md. The markers vanished from the skeleton, so the
  judge scored empty/wrong content. Fix: read the skeleton+sections union.
  Verified: plan-ceo modes / plan-eng sections / plan-design passes all PASS
  (25/25).

- e2e-plan (skill-e2e-plan): setupPlanDir copied only <skill>/SKILL.md into the
  fixture, not sections/. The carved skill's STOP pointed at a section file that
  was absent, so the model improvised a compressed report table instead of the
  canonical "| Review | Trigger | Why | Runs | Status | Findings |". Fix: copy
  sections/ alongside SKILL.md in all 6 setup sites. Verified: report test PASS,
  canonical table emitted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: copy carved sections into all e2e fixtures (prevent more carve-blind CI fails)

Proactive sweep beyond the two CI logs: every e2e test that copies a carved
skill's SKILL.md into a temp fixture must also copy its sections/, or the
model hits a STOP pointing at a missing section file and improvises/degrades.

- skill-e2e.test.ts: plan-ceo/plan-eng/plan-design/office-hours copies across
  planDir/reviewDir/ohDir/benefitsDir dests now copy sections/.
- skill-e2e-plan.test.ts: the office-hours copy + the 4-skill codex-offering
  loop now copy sections/.
- skill-e2e-design.test.ts: plan-design-review copy now copies sections/.
- skill-e2e-office-hours.test.ts: both office-hours copies now copy sections/.
- skill-e2e-office-hours-brain-writeback.test.ts: GBRAIN_SAVE_RESULTS moved into
  the section, so check the regenerated skeleton+section UNION for the gbrain put
  block, ship both into the workdir, and restore both (the section regen was also
  leaking into the working tree — finally now restores it).

ship copies (single-file Step-0 slices) and review/retro (not carved) untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: migrate section-loading E2E to lossless SDK tool-stream detection

The /ship and /plan-ceo-review section-loading tests drove a real PTY and
scraped the ANSI screen buffer for sections/<file>.md paths. That silently
saw nothing in a Conductor PTY (cursor-positioned tool renders and an
unanswered Step 0 question loop both defeat the regex), so both reported
read: [] even when the agent did the work.

They now run the skill through claude -p (the same SDK path the AUQ matrix
uses) and detect section reads from the tool-use stream — Read calls whose
file_path contains sections/<file>.md — with no rendering layer to mangle.
The run is also hermetic: the freshly-generated worktree skeleton + sections
are copied into a throwaway fixture with the absolute path pinned, so the
test validates this branch's carve without mutating the user's ~/.claude
install.

Validated EVALS_TIER=periodic: both pass (plan-ceo Reads review-sections.md;
ship Reads review-army.md + changelog.md), ~6.5 min for both vs ~23 min
combined on the old PTY path where both were failing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: consolidate branch to v1.56.0.0 (single MINOR above main)

The branch bumped VERSION several times during development (1.56 → 1.57 →
1.58 → 1.59), but none of those landed on main (main is at 1.55.1.0). Per
the "never orphan branch-internal versions" discipline, collapse all four
into a single 1.56.0.0 entry — one MINOR release covering the whole branch:
five skills carved (plan-ceo, office-hours, plan-eng, plan-design,
plan-devex), the shared AskUserQuestion preamble CJK trim, and the paranoid
AUQ no-degradation test suite + lossless section-loading tests.

VERSION and package.json set to 1.56.0.0; main's 1.55.1.0 entry preserved
below the consolidated entry. No SKILL.md drift (VERSION is not embedded in
generated bodies).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 11:14:43 -07:00

392 lines
17 KiB
Cheetah

## Review Sections (8 passes, after Step 0 is complete)
**Anti-skip rule:** Never condense, abbreviate, or skip any review pass (1-8) regardless of plan type (strategy, spec, code, infra). Every pass in this skill exists for a reason. "This is a strategy doc so DX passes don't apply" is always wrong — DX gaps are where adoption breaks down. If a pass genuinely has zero findings, say "No issues found" and move on — but you must evaluate it.
{{ANTI_SHORTCUT_CLAUSE}}
{{LEARNINGS_SEARCH}}
### DX Trend Check
Before starting review passes, check for prior DX reviews on this project:
```bash
eval "$(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)"
~/.claude/skills/gstack/bin/gstack-review-read 2>/dev/null | grep plan-devex-review || echo "NO_PRIOR_DX_REVIEWS"
```
If prior reviews exist, display the trend:
```
DX TREND (prior reviews):
Dimension | Prior Score | Notes
Getting Started | 4/10 | from 2026-03-15
...
```
### Pass 1: Getting Started Experience (Zero Friction)
Rate 0-10: Can a developer go from zero to hello world in under 5 minutes?
**Evidence recall:** Reference the competitive benchmark from 0C (target tier), the
magical moment from 0D (delivery vehicle), and any Install/Hello World friction
points from 0F.
Load reference: Read the "## Pass 1" section from `~/.claude/skills/gstack/plan-devex-review/dx-hall-of-fame.md`.
Evaluate:
- **Installation**: One command? One click? No prerequisites?
- **First run**: Does the first command produce visible, meaningful output?
- **Sandbox/Playground**: Can developers try before installing?
- **Free tier**: No credit card, no sales call, no company email?
- **Quick start guide**: Copy-paste complete? Shows real output?
- **Auth/credential bootstrapping**: How many steps between "I want to try" and "it works"?
- **Magical moment delivery**: Is the vehicle chosen in 0D actually in the plan?
- **Competitive gap**: How far is the TTHW from the target tier chosen in 0C?
FIX TO 10: Write the ideal getting started sequence. Specify exact commands,
expected output, and time budget per step. Target: 3 steps or fewer, under the
time chosen in 0C.
Stripe test: Can a [persona from 0A] go from "never heard of this" to "it worked"
in one terminal session without leaving the terminal?
**STOP.** AskUserQuestion once per issue. Recommend + WHY. Reference the persona.
### Pass 2: API/CLI/SDK Design (Usable + Useful)
Rate 0-10: Is the interface intuitive, consistent, and complete?
**Evidence recall:** Does the API surface match [persona from 0A]'s mental model?
A YC founder expects `tool.do(thing)`. A platform engineer expects
`tool.configure(options).execute(thing)`.
Load reference: Read the "## Pass 2" section from `~/.claude/skills/gstack/plan-devex-review/dx-hall-of-fame.md`.
Evaluate:
- **Naming**: Guessable without docs? Consistent grammar?
- **Defaults**: Every parameter has a sensible default? Simplest call gives useful result?
- **Consistency**: Same patterns across the entire API surface?
- **Completeness**: 100% coverage or do devs drop to raw HTTP for edge cases?
- **Discoverability**: Can devs explore from CLI/playground without docs?
- **Reliability/trust**: Latency, retries, rate limits, idempotency, offline behavior?
- **Progressive disclosure**: Simple case is production-ready, complexity revealed gradually?
- **Persona fit**: Does the interface match how [persona] thinks about the problem?
Good API design test: Can a [persona] use this API correctly after seeing one example?
**STOP.** AskUserQuestion once per issue. Recommend + WHY.
### Pass 3: Error Messages & Debugging (Fight Uncertainty)
Rate 0-10: When something goes wrong, does the developer know what happened, why,
and how to fix it?
**Evidence recall:** Reference any error-related friction points from 0F and confusion
points from 0G.
Load reference: Read the "## Pass 3" section from `~/.claude/skills/gstack/plan-devex-review/dx-hall-of-fame.md`.
**Trace 3 specific error paths** from the plan or codebase. For each, evaluate against
the three-tier system from the Hall of Fame:
- **Tier 1 (Elm):** Conversational, first person, exact location, suggested fix
- **Tier 2 (Rust):** Error code links to tutorial, primary + secondary labels, help section
- **Tier 3 (Stripe API):** Structured JSON with type, code, message, param, doc_url
For each error path, show what the developer currently sees vs. what they should see.
Also evaluate:
- **Permission/sandbox/safety model**: What can go wrong? How clear is the blast radius?
- **Debug mode**: Verbose output available?
- **Stack traces**: Useful or internal framework noise?
**STOP.** AskUserQuestion once per issue. Recommend + WHY.
### Pass 4: Documentation & Learning (Findable + Learn by Doing)
Rate 0-10: Can a developer find what they need and learn by doing?
**Evidence recall:** Does the docs architecture match [persona from 0A]'s learning
style? A YC founder needs copy-paste examples front and center. A platform engineer
needs architecture docs and API reference.
Load reference: Read the "## Pass 4" section from `~/.claude/skills/gstack/plan-devex-review/dx-hall-of-fame.md`.
Evaluate:
- **Information architecture**: Find what they need in under 2 minutes?
- **Progressive disclosure**: Beginners see simple, experts find advanced?
- **Code examples**: Copy-paste complete? Work as-is? Real context?
- **Interactive elements**: Playgrounds, sandboxes, "try it" buttons?
- **Versioning**: Docs match the version dev is using?
- **Tutorials vs references**: Both exist?
**STOP.** AskUserQuestion once per issue. Recommend + WHY.
### Pass 5: Upgrade & Migration Path (Credible)
Rate 0-10: Can developers upgrade without fear?
Load reference: Read the "## Pass 5" section from `~/.claude/skills/gstack/plan-devex-review/dx-hall-of-fame.md`.
Evaluate:
- **Backward compatibility**: What breaks? Blast radius limited?
- **Deprecation warnings**: Advance notice? Actionable? ("use newMethod() instead")
- **Migration guides**: Step-by-step for every breaking change?
- **Codemods**: Automated migration scripts?
- **Versioning strategy**: Semantic versioning? Clear policy?
**STOP.** AskUserQuestion once per issue. Recommend + WHY.
### Pass 6: Developer Environment & Tooling (Valuable + Accessible)
Rate 0-10: Does this integrate into developers' existing workflows?
**Evidence recall:** Does local dev setup work for [persona from 0A]'s typical
environment?
Load reference: Read the "## Pass 6" section from `~/.claude/skills/gstack/plan-devex-review/dx-hall-of-fame.md`.
Evaluate:
- **Editor integration**: Language server? Autocomplete? Inline docs?
- **CI/CD**: Works in GitHub Actions, GitLab CI? Non-interactive mode?
- **TypeScript support**: Types included? Good IntelliSense?
- **Testing support**: Easy to mock? Test utilities?
- **Local development**: Hot reload? Watch mode? Fast feedback?
- **Cross-platform**: Mac, Linux, Windows? Docker? ARM/x86?
- **Local env reproducibility**: Works across OS, package managers, containers, proxies?
- **Observability/testability**: Dry-run mode? Verbose output? Sample apps? Fixtures?
**STOP.** AskUserQuestion once per issue. Recommend + WHY.
### Pass 7: Community & Ecosystem (Findable + Desirable)
Rate 0-10: Is there a community, and does the plan invest in ecosystem health?
Load reference: Read the "## Pass 7" section from `~/.claude/skills/gstack/plan-devex-review/dx-hall-of-fame.md`.
Evaluate:
- **Open source**: Code open? Permissive license?
- **Community channels**: Where do devs ask questions? Someone answering?
- **Examples**: Real-world, runnable? Not just hello world?
- **Plugin/extension ecosystem**: Can devs extend it?
- **Contributing guide**: Process clear?
- **Pricing transparency**: No surprise bills?
**STOP.** AskUserQuestion once per issue. Recommend + WHY.
### Pass 8: DX Measurement & Feedback Loops (Implement + Refine)
Rate 0-10: Does the plan include ways to measure and improve DX over time?
Load reference: Read the "## Pass 8" section from `~/.claude/skills/gstack/plan-devex-review/dx-hall-of-fame.md`.
Evaluate:
- **TTHW tracking**: Can you measure getting started time? Is it instrumented?
- **Journey analytics**: Where do devs drop off?
- **Feedback mechanisms**: Bug reports? NPS? Feedback button?
- **Friction audits**: Periodic reviews planned?
- **Boomerang readiness**: Will /devex-review be able to measure reality vs. plan?
**STOP.** AskUserQuestion once per issue. Recommend + WHY.
### Appendix: Claude Code Skill DX Checklist
**Conditional: only run when product type includes "Claude Code skill".**
This is NOT a scored pass. It's a checklist of proven patterns from gstack's own DX.
Load reference: Read the "## Claude Code Skill DX Checklist" section from
`~/.claude/skills/gstack/plan-devex-review/dx-hall-of-fame.md`.
Check each item. For any unchecked item, explain what's missing and suggest the fix.
**STOP.** AskUserQuestion for any item that requires a design decision.
{{CODEX_PLAN_REVIEW}}
When constructing the outside voice prompt, include the Developer Persona from Step 0A
and the Competitive Benchmark from Step 0C. The outside voice should critique the plan
in the context of who is using it and what they're competing against.
## CRITICAL RULE — How to ask questions
Follow the AskUserQuestion format from the Preamble above. Additional rules for
DX reviews:
* **One issue = one AskUserQuestion call.** Never combine multiple issues.
* **Ground every question in evidence.** Reference the persona, competitive benchmark,
empathy narrative, or friction trace. Never ask a question in the abstract.
* **Frame pain from the persona's perspective.** Not "developers would be frustrated"
but "[persona from 0A] would hit this at minute [N] of their getting-started flow
and [specific consequence: abandon, file an issue, hack a workaround]."
* Present 2-3 options. For each: effort to fix, impact on developer adoption.
* **Map to DX First Principles above.** One sentence connecting your recommendation
to a specific principle (e.g., "This violates 'zero friction at T0' because
[persona] needs 3 extra config steps before their first API call").
* **Zero findings:** if a section has zero findings, state "No issues, moving on"
and proceed. Otherwise, use AskUserQuestion for each gap — a gap with an
"obvious fix" is still a gap and still needs user approval before any change
lands in the plan.
* Assume the user hasn't looked at this window in 20 minutes. Re-ground every question.
## Required Outputs
### Developer Persona Card
The persona card from Step 0A. This goes at the top of the plan's DX section.
### Developer Empathy Narrative
The first-person narrative from Step 0B, updated with user corrections.
### Competitive DX Benchmark
The benchmark table from Step 0C, updated with the product's post-review scores.
### Magical Moment Specification
The chosen delivery vehicle from Step 0D with implementation requirements.
### Developer Journey Map
The journey map from Step 0F, updated with all friction point resolutions.
### First-Time Developer Confusion Report
The roleplay report from Step 0G, annotated with which items were addressed.
### "NOT in scope" section
DX improvements considered and explicitly deferred, with one-line rationale each.
### "What already exists" section
Existing docs, examples, error handling, and DX patterns that the plan should reuse.
### TODOS.md updates
After all review passes are complete, present each potential TODO as its own individual
AskUserQuestion. Never batch. For DX debt: missing error messages, unspecified upgrade
paths, documentation gaps, missing SDK languages. Each TODO gets:
* **What:** One-line description
* **Why:** The concrete developer pain it causes
* **Pros:** What you gain (adoption, retention, satisfaction)
* **Cons:** Cost, complexity, or risks
* **Context:** Enough detail for someone to pick this up in 3 months
* **Depends on / blocked by:** Prerequisites
Options: **A)** Add to TODOS.md **B)** Skip **C)** Build it now
### DX Scorecard
```
+====================================================================+
| DX PLAN REVIEW — SCORECARD |
+====================================================================+
| Dimension | Score | Prior | Trend |
|----------------------|--------|--------|--------|
| Getting Started | __/10 | __/10 | __ ↑↓ |
| API/CLI/SDK | __/10 | __/10 | __ ↑↓ |
| Error Messages | __/10 | __/10 | __ ↑↓ |
| Documentation | __/10 | __/10 | __ ↑↓ |
| Upgrade Path | __/10 | __/10 | __ ↑↓ |
| Dev Environment | __/10 | __/10 | __ ↑↓ |
| Community | __/10 | __/10 | __ ↑↓ |
| DX Measurement | __/10 | __/10 | __ ↑↓ |
+--------------------------------------------------------------------+
| TTHW | __ min | __ min | __ ↑↓ |
| Competitive Rank | [Champion/Competitive/Needs Work/Red Flag] |
| Magical Moment | [designed/missing] via [delivery vehicle] |
| Product Type | [type] |
| Mode | [EXPANSION/POLISH/TRIAGE] |
| Overall DX | __/10 | __/10 | __ ↑↓ |
+====================================================================+
| DX PRINCIPLE COVERAGE |
| Zero Friction | [covered/gap] |
| Learn by Doing | [covered/gap] |
| Fight Uncertainty | [covered/gap] |
| Opinionated + Escape Hatches | [covered/gap] |
| Code in Context | [covered/gap] |
| Magical Moments | [covered/gap] |
+====================================================================+
```
If all passes 8+: "DX plan is solid. Developers will have a good experience."
If any below 6: Flag as critical DX debt with specific impact on adoption.
If TTHW > 10 min: Flag as blocking issue.
### DX Implementation Checklist
```
DX IMPLEMENTATION CHECKLIST
============================
[ ] Time to hello world < [target from 0C]
[ ] Installation is one command
[ ] First run produces meaningful output
[ ] Magical moment delivered via [vehicle from 0D]
[ ] Every error message has: problem + cause + fix + docs link
[ ] API/CLI naming is guessable without docs
[ ] Every parameter has a sensible default
[ ] Docs have copy-paste examples that actually work
[ ] Examples show real use cases, not just hello world
[ ] Upgrade path documented with migration guide
[ ] Breaking changes have deprecation warnings + codemods
[ ] TypeScript types included (if applicable)
[ ] Works in CI/CD without special configuration
[ ] Free tier available, no credit card required
[ ] Changelog exists and is maintained
[ ] Search works in documentation
[ ] Community channel exists and is monitored
```
{{TASKS_SECTION_EMIT:devex-review}}
### Unresolved Decisions
If any AskUserQuestion goes unanswered, note here. Never silently default.
{{REVIEW_DASHBOARD}}
{{PLAN_FILE_REVIEW_REPORT}}
{{LEARNINGS_LOG}}
{{GBRAIN_SAVE_RESULTS}}
{{BRAIN_WRITE_BACK}}
{{BRAIN_CACHE_REFRESH}}
## Next Steps — Review Chaining
After displaying the Review Readiness Dashboard, recommend next reviews:
**Recommend /plan-eng-review if eng review is not skipped globally** — DX issues often
have architectural implications. If this DX review found API design problems, error
handling gaps, or CLI ergonomics issues, eng review should validate the fixes.
**Suggest /plan-design-review if user-facing UI exists** — DX review focuses on
developer-facing surfaces; design review covers end-user-facing UI.
**Recommend /devex-review after implementation** — the boomerang. Plan said TTHW would
be [target from 0C]. Did reality match? Run /devex-review on the live product to find
out. This is where the competitive benchmark pays off: you have a concrete target to
measure against.
Use AskUserQuestion with applicable options:
- **A)** Run /plan-eng-review next (required gate)
- **B)** Run /plan-design-review (only if UI scope detected)
- **C)** Ready to implement, run /devex-review after shipping
- **D)** Skip, I'll handle next steps manually
## Mode Quick Reference
```
| DX EXPANSION | DX POLISH | DX TRIAGE
Scope | Push UP (opt-in) | Maintain | Critical only
Posture | Enthusiastic | Rigorous | Surgical
Competitive | Full benchmark | Full benchmark | Skip
Magical | Full design | Verify exists | Skip
Journey | All stages + | All stages | Install + Hello
| best-in-class | | World only
Passes | All 8, expanded | All 8, standard | Pass 1 + 3 only
Outside voice| Recommended | Recommended | Skip
```
## Formatting Rules
* NUMBER issues (1, 2, 3...) and LETTERS for options (A, B, C...).
* Label with NUMBER + LETTER (e.g., "3A", "3B").
* One sentence max per option.
* After each pass, pause and wait for feedback before moving on.
* Rate before and after each pass for scannability.