Files
gstack/plan-devex-review/SKILL.md.tmpl
T
Garry Tan cab774cced v1.56.0.0 Token-reduction Phase B + AUQ paranoid safety net (#1849)
* refactor(plan-ceo-review): carve review body into on-demand section

Carve the largest skill (138,838 B) into a skeleton + one on-demand
section, the documented next Phase B target after /ship (v2_PLAN.md:216).

- sections/review-sections.md(.tmpl): the 11-section deep review, codex/
  outside-voice rules, how-to-ask, Required Outputs, registries, Completion
  Summary, Review Log, REVIEW_DASHBOARD, PLAN_FILE_REVIEW_REPORT, Next Steps,
  docs/designs promotion, Formatting Rules, and the Mode Quick Reference.
- sections/manifest.json: passive registry (CM2), one entry.
- SKILL.md.tmpl: {{SECTION_INDEX}} after the system audit, a single
  {{SECTION:review-sections}} STOP-Read after Step 0 mode selection, and a
  Section self-check. All of Step 0 (the scope/mode conversation) stays in
  the always-loaded skeleton; only EXIT_PLAN_MODE_GATE follows the section.

Measured: always-loaded skeleton 138,838 -> 80,731 B (-42%, ~14.4K tokens
off every invocation). Union (skeleton + section) 139,110 B, behavior held.

Boundary honors Codex P1: nothing review-governing (formatting rules, mode
reference, how-to-ask, required outputs) sits in the skeleton below the
STOP. Housekeeping resolvers ride in the section, matching the ship
precedent (adversarial.md carries LEARNINGS_LOG + GBRAIN_SAVE_RESULTS).

Tests (atomic with the carve — skill-docs.yml gates gen:skill-docs
freshness on every push, so source + regen + tests must land together):
- parity-harness: plan-ceo flipped to sectioned, maxSkeletonBytes 90_000
  (measured 80,731 + headroom); content/minBytes run against the union.
- skill-size-budget: plan-ceo-review added to SECTIONS_EXTRACTED.
- section-manifest-consistency: generalized to discover every carved skill,
  vars computed per-skill-case (Codex P2).
- skill-ceo-section-ordering (new, gate): per-PR static guard — STOP after
  Step 0, review body absent from skeleton, report writer in the section,
  nothing review-governing below the STOP.
- skill-e2e-plan-ceo-review-section-loading (new, periodic): refreshes the
  installed skill first (Codex P1), drives full Step 0, asserts the section
  is Read before the report.
- gen-skill-docs + skill-validation: read the skeleton+sections union for
  carved skills so relocated prose still counts.
- touchfiles: plan-ceo-section-loading registered (periodic).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump VERSION + CHANGELOG for plan-ceo-review carve (v1.56.0.0)

MINOR: carves the largest skill into skeleton + on-demand section,
dropping plan-ceo-review's always-loaded cost 42% (138,838 -> 80,731 B,
~14.4K tokens off every invocation). User-facing release notes lead with
the measured token win.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(todos): file P3 follow-up — carve the shared {{PREAMBLE}} reference blocks

Surfaced by /plan-eng-review on the plan-ceo-review carve: per-skill section
carves stay modest because the ~40-50KB shared preamble dominates the
always-loaded surface. A single preamble-reference carve would help every
tier->=2 skill at once. Records the why, the cold-vs-hot split to measure,
and the guards it needs. Not implemented this PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): Layer 0 — guarantee AUQ format spec is always-loaded

Deterministic, free, per-PR keystone for the token-reduction era. For every
interactive (tier>=2) skill, asserts the full AskUserQuestion decision-brief
format (ELI10/Recommendation/Pros-cons/checks/Net/(recommended)/Stakes/
self-check) lives in the always-loaded SKILL.md skeleton, NOT only in an
on-demand section. Plus a roster guard (a carve can't silently drop the block)
and per-skill rule survival in the skeleton+sections union. 51 cases + a
negative control. Fails the instant a future carve strands AUQ-governing text
where it won't be loaded when a question fires.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): SDK capture engine + verbose-vs-carved no-degradation A/B

Adds the reusable SDK $OUT_FILE capture engine (auq-sdk-capture.ts): drives a
skill to its AUQ and captures the verbatim text the model GENERATES, cleanly
(real-PTY mangles plan-mode AUQs via cursor escapes). Pins the skill to an
absolute path with Read/Write-only tools so the agent can't wander to the
global install. gradeAuqRecommendation normalizes a non-"because" connective
before grading so substantive reasons aren't false-flagged (without touching
the pinned shared judge).

The A/B drives the same prompt through the carved 80KB skeleton and the
pre-carve 137KB monolith and fails if carved scores worse. Result: both 7/7
format, substance 5 — proven no degradation, transcript-verified each side read
its own planted SKILL.md. Periodic tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): consistency — same trigger N runs, stable format + substance

Drives the carved /plan-ceo-review AUQ N=3 times and fails if any format
element appears in one run but not another, or substance craters. Targets the
"fine one run, broken the next" failure class a single snapshot can't see.
Result: 3/3 stable, 7/7 + substance 5 every run. Periodic tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): behavioral matrix across AUQ-heavy skills

Data-driven test that drives each AUQ-heavy skill (plan-eng/design/devex,
office-hours, cso, spec, design-consultation) to its first AskUserQuestion and
grades it to the plan-ceo bar: 7/7 decision-brief format + recommendation
substance >=4. One case per skill (isolated failures), env-subsettable via
AUQ_MATRIX_ONLY. Browser/design-binary skills are intentionally excluded
(comparison boards, not format-AUQs; Layer 0 covers their spec). All targeted
skills pass 7/7 with substance 4-5. Periodic tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(codex): live recommendation-substance grade for /codex

Closes the gap where /codex's synthesis recommendation was only checked
statically (template grep) and via fixtures. Drives the real /codex skill over
a flawed diff and grades the emitted "Recommendation: ... because ..." line
with judgeRecommendation (present/commits/has_because/substance>=4). The named
weak spot holds up: substance 5. Periodic tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): deterministic trigger for format-compliance gate

A bare /plan-ceo-review against a repo whose work is already implemented makes
the model improvise an off-script "what should I review?" scope question that
skips the decision-brief format, which the gate test then times out waiting for.
Hand it a concrete plan to review (FORCING_FLOOR_CEO) so it reaches the real
Step 0 mode-selection AUQ that is the intended format check.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(office-hours): carve Phase 5+6 into on-demand section

Third Phase B carve (v2_PLAN.md:216, after ship and plan-ceo-review). Moves
Phase 5 (Design Doc templates) + Phase 6 (tiered relationship handoff) — the
session's output + closing tail, only reached after the conversation and
alternatives are done — into sections/design-and-handoff.md, behind a single
STOP-Read after Phase 4.5. The live conversation (Phases 1-4.5) and the
always-run Important Rules stay in the always-loaded skeleton.

Measured: always-loaded skeleton 118,280 -> 88,975 B (-24.8%). Union preserved.
The carved AUQ is identical to pre-carve (matrix: 7/7 format, substance 5),
and Layer 0 confirms the AUQ format spec stays in the skeleton — the AUQ
paranoid suite de-risked this carve end to end.

Atomic with tests + regen (skill-docs.yml gates gen:skill-docs freshness on
every push, so source + regen + tests land together; --host all regenerates
the inlined non-Claude variants):
- sections/manifest.json: passive registry, one entry.
- parity-harness: office-hours flipped to sectioned, maxSkeletonBytes 96_000
  (measured 88,975 + headroom); content/minBytes run against the union.
- skill-size-budget: office-hours added to SECTIONS_EXTRACTED.
- gen-skill-docs + skill-validation: read the skeleton+sections union for
  office-hours so relocated Phase 5/6 prose still counts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump VERSION + CHANGELOG for office-hours carve + AUQ suite (v1.57.0.0)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(preamble): carve CJK-escaping manual to on-demand doc

The AskUserQuestion format block is inlined into every interactive skill (~33).
It carried the full multi-paragraph non-ASCII/CJK escaping manual inline, but
that rationale only matters when a question contains CJK text and the operative
rule already lives in the always-loaded self-check. Moved the justification to
docs/askuserquestion-cjk.md (read on demand); kept the rule + a pointer.

Corpus: Claude-host SKILL.md total 3,087,499 -> 3,057,975 B (-29,524 B, ~900 B
x ~33 skills). Layer 0 still passes — the core decision-brief format stays
always-loaded; only the rare CJK rationale moved. Atomic with the all-host
regen (skill-docs.yml freshness gate). VERSION + package.json -> 1.58.0.0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(plan-eng-review): carve review body into on-demand section

Fourth Phase B carve (v2_PLAN.md:220). Moves the 4-section review (Architecture,
Code Quality, Tests, Performance), outside voice, required outputs, and review
report — everything after Step 0 scope — into sections/review-sections.md behind
a single STOP-Read. Step 0 (scope challenge) and EXIT_PLAN_MODE_GATE stay in the
always-loaded skeleton.

Measured: skeleton 106,984 -> 54,892 B (-48.7%). Union preserved. Atomic with
tests + all-host regen (freshness gate): parity flipped to sectioned
(maxSkeletonBytes 62K), plan-eng-review added to SECTIONS_EXTRACTED, gen-skill-docs
reads the union for relocated review/TEST_COVERAGE/dashboard prose. Layer 0 green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(plan-design-review): carve review body into on-demand section

Fifth Phase B carve (v2_PLAN.md:220, bundled with plan-eng). Moves the 7 design
passes, required outputs, and review report — everything after Step 0 scope and
the mockup/rating phase — into sections/review-sections.md behind a STOP-Read.
Step 0, Step 0.5 mockups, the rating method, and EXIT_PLAN_MODE_GATE stay in the
always-loaded skeleton.

Measured: skeleton 112,057 -> 76,024 B (-32.2%). Union preserved. Atomic with
tests + all-host regen: parity sectioned (maxSkeletonBytes 82K), added to
SECTIONS_EXTRACTED, gen-skill-docs reads the union. Layer 0 green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(plan-devex-review): carve review body into on-demand section

Sixth Phase B carve. Moves the 8 DX passes, required outputs, and review report
— everything after the Step 0 DX investigation — into sections/review-sections.md
behind a STOP-Read. All of Step 0 (persona, empathy, benchmark, journey trace,
roleplay) + the rating method + EXIT_PLAN_MODE_GATE stay always-loaded.

Measured: skeleton 110,621 -> 69,658 B (-37%). Union preserved. Atomic with
tests + all-host regen: added to SECTIONS_EXTRACTED, gen-skill-docs reads the
union. Layer 0 green. (No parity invariant entry for plan-devex-review.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump VERSION + CHANGELOG for plan-* family carves (v1.59.0.0)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: refresh ship golden baselines + gbrain-detection union after carves

Two follow-ups the carve commits should have carried (caught by the full suite,
missed by targeted subsets):
- ship golden baselines (claude/codex/factory) regenerated: the preamble CJK
  trim (v1.58) changed ship's always-loaded AskUserQuestion block.
- gbrain-detection-override probes the office-hours skeleton+section union:
  GBRAIN_SAVE_RESULTS moved into sections/design-and-handoff.md when office-hours
  was carved, so the detection assertions now check both files.

Full `bun test` green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): grade format-compliance gate from SDK capture, not the TUI

The real-PTY version grepped the stripAnsi'd interactive AUQ picker. Verified
directly that this cannot work: plan-mode AUQs render as a cursor picker whose
cursor-positioning escapes stripAnsi can't flatten — the picker renders fine for
a human (cursorSeen=45) but the flattened text drops ELI10:/(recommended) and
parseNumberedOptions returns 0. The test was grading a lossy projection and
failed by construction.

Rewritten to drive /plan-ceo-review via the SDK $OUT_FILE capture (the agent
writes the verbatim question it would have shown — clean text, no rendering
loss) and grade 7/7 format + kind-note + recommendation substance >=4. Same
property, reliable, environment-independent; shares the engine with the periodic
A/B and matrix evals. Result: 7/7 format, substance 5. Touchfiles key renamed
ask-user-question-format-pty -> auq-format-gate (no longer a PTY test).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: fix carve-broken CI evals (union reads + section fixtures)

Two CI eval jobs failed on the carved plan-* skills because they read content
that moved into sections/:

- llm-judge (skill-llm-eval): runWorkflowJudge sliced SKILL.md between markers
  like "## Review Sections" / "## CRITICAL RULE" that now live in
  sections/review-sections.md. The markers vanished from the skeleton, so the
  judge scored empty/wrong content. Fix: read the skeleton+sections union.
  Verified: plan-ceo modes / plan-eng sections / plan-design passes all PASS
  (25/25).

- e2e-plan (skill-e2e-plan): setupPlanDir copied only <skill>/SKILL.md into the
  fixture, not sections/. The carved skill's STOP pointed at a section file that
  was absent, so the model improvised a compressed report table instead of the
  canonical "| Review | Trigger | Why | Runs | Status | Findings |". Fix: copy
  sections/ alongside SKILL.md in all 6 setup sites. Verified: report test PASS,
  canonical table emitted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: copy carved sections into all e2e fixtures (prevent more carve-blind CI fails)

Proactive sweep beyond the two CI logs: every e2e test that copies a carved
skill's SKILL.md into a temp fixture must also copy its sections/, or the
model hits a STOP pointing at a missing section file and improvises/degrades.

- skill-e2e.test.ts: plan-ceo/plan-eng/plan-design/office-hours copies across
  planDir/reviewDir/ohDir/benefitsDir dests now copy sections/.
- skill-e2e-plan.test.ts: the office-hours copy + the 4-skill codex-offering
  loop now copy sections/.
- skill-e2e-design.test.ts: plan-design-review copy now copies sections/.
- skill-e2e-office-hours.test.ts: both office-hours copies now copy sections/.
- skill-e2e-office-hours-brain-writeback.test.ts: GBRAIN_SAVE_RESULTS moved into
  the section, so check the regenerated skeleton+section UNION for the gbrain put
  block, ship both into the workdir, and restore both (the section regen was also
  leaking into the working tree — finally now restores it).

ship copies (single-file Step-0 slices) and review/retro (not carved) untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: migrate section-loading E2E to lossless SDK tool-stream detection

The /ship and /plan-ceo-review section-loading tests drove a real PTY and
scraped the ANSI screen buffer for sections/<file>.md paths. That silently
saw nothing in a Conductor PTY (cursor-positioned tool renders and an
unanswered Step 0 question loop both defeat the regex), so both reported
read: [] even when the agent did the work.

They now run the skill through claude -p (the same SDK path the AUQ matrix
uses) and detect section reads from the tool-use stream — Read calls whose
file_path contains sections/<file>.md — with no rendering layer to mangle.
The run is also hermetic: the freshly-generated worktree skeleton + sections
are copied into a throwaway fixture with the absolute path pinned, so the
test validates this branch's carve without mutating the user's ~/.claude
install.

Validated EVALS_TIER=periodic: both pass (plan-ceo Reads review-sections.md;
ship Reads review-army.md + changelog.md), ~6.5 min for both vs ~23 min
combined on the old PTY path where both were failing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: consolidate branch to v1.56.0.0 (single MINOR above main)

The branch bumped VERSION several times during development (1.56 → 1.57 →
1.58 → 1.59), but none of those landed on main (main is at 1.55.1.0). Per
the "never orphan branch-internal versions" discipline, collapse all four
into a single 1.56.0.0 entry — one MINOR release covering the whole branch:
five skills carved (plan-ceo, office-hours, plan-eng, plan-design,
plan-devex), the shared AskUserQuestion preamble CJK trim, and the paranoid
AUQ no-degradation test suite + lossless section-loading tests.

VERSION and package.json set to 1.56.0.0; main's 1.55.1.0 entry preserved
below the consolidated entry. No SKILL.md drift (VERSION is not embedded in
generated bodies).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 11:14:43 -07:00

462 lines
18 KiB
Cheetah

---
name: plan-devex-review
preamble-tier: 3
interactive: true
version: 2.0.0
description: |
Interactive developer experience plan review. Explores developer personas,
benchmarks against competitors, designs magical moments, and traces friction
points before scoring. Three modes: DX EXPANSION (competitive advantage),
DX POLISH (bulletproof every touchpoint), DX TRIAGE (critical gaps only).
Use when asked to "DX review", "developer experience audit", "devex review",
or "API design review".
Proactively suggest when the user has a plan for developer-facing products
(APIs, CLIs, SDKs, libraries, platforms, docs). (gstack)
voice-triggers:
- "dx review"
- "developer experience review"
- "devex review"
- "devex audit"
- "API design review"
- "onboarding review"
benefits-from: [office-hours]
allowed-tools:
- Read
- Edit
- Grep
- Glob
- Bash
- AskUserQuestion
- WebSearch
triggers:
- developer experience review
- dx plan review
- check developer onboarding
---
{{PREAMBLE}}
{{BASE_BRANCH_DETECT}}
# /plan-devex-review: Developer Experience Plan Review
You are a developer advocate who has onboarded onto 100 developer tools. You have
opinions about what makes developers abandon a tool in minute 2 versus fall in love
in minute 5. You have shipped SDKs, written getting-started guides, designed CLI
help text, and watched developers struggle through onboarding in usability sessions.
Your job is not to score a plan. Your job is to make the plan produce a developer
experience worth talking about. Scores are the output, not the process. The process
is investigation, empathy, forcing decisions, and evidence gathering.
The output of this skill is a better plan, not a document about the plan.
Do NOT make any code changes. Do NOT start implementation. Your only job right now
is to review and improve the plan's DX decisions with maximum rigor.
DX is UX for developers. But developer journeys are longer, involve multiple tools,
require understanding new concepts quickly, and affect more people downstream. The bar
is higher because you are a chef cooking for chefs.
This skill IS a developer tool. Apply its own DX principles to itself.
{{DX_FRAMEWORK}}
## Priority Hierarchy Under Context Pressure
Step 0 > Developer Persona > Empathy Narrative > Competitive Benchmark >
Magical Moment Design > TTHW Assessment > Error quality > Getting started >
API/CLI ergonomics > Everything else.
Never skip Step 0, the persona interrogation, or the empathy narrative. These are
the highest-leverage outputs.
## PRE-REVIEW SYSTEM AUDIT (before Step 0)
Before doing anything else, gather context about the developer-facing product.
```bash
git log --oneline -15
git diff $(git merge-base HEAD main 2>/dev/null || echo HEAD~10) --stat 2>/dev/null
```
Then read:
- The plan file (current plan or branch diff)
- CLAUDE.md for project conventions
- README.md for current getting started experience
- Any existing docs/ directory structure
- package.json or equivalent (what developers will install)
- CHANGELOG.md if it exists
**DX artifacts scan:** Also search for existing DX-relevant content:
- Getting started guides (grep README for "Getting Started", "Quick Start", "Installation")
- CLI help text (grep for `--help`, `usage:`, `commands:`)
- Error message patterns (grep for `throw new Error`, `console.error`, error classes)
- Existing examples/ or samples/ directories
**Design doc check:**
```bash
setopt +o nomatch 2>/dev/null || true
SLUG=$(~/.claude/skills/gstack/browse/bin/remote-slug 2>/dev/null || basename "$(git rev-parse --show-toplevel 2>/dev/null || pwd)")
BRANCH=$(git rev-parse --abbrev-ref HEAD 2>/dev/null | tr '/' '-' || echo 'no-branch')
DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head -1)
[ -z "$DESIGN" ] && DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-design-*.md 2>/dev/null | head -1)
[ -n "$DESIGN" ] && echo "Design doc found: $DESIGN" || echo "No design doc found"
```
If a design doc exists, read it.
Map:
* What is the developer-facing surface area of this plan?
* What type of developer product is this? (API, CLI, SDK, library, framework, platform, docs)
* What are the existing docs, examples, and error messages?
{{BENEFITS_FROM}}
## Auto-Detect Product Type + Applicability Gate
Before proceeding, read the plan and infer the developer product type from content:
- Mentions API endpoints, REST, GraphQL, gRPC, webhooks → **API/Service**
- Mentions CLI commands, flags, arguments, terminal → **CLI Tool**
- Mentions npm install, import, require, library, package → **Library/SDK**
- Mentions deploy, hosting, infrastructure, provisioning → **Platform**
- Mentions docs, guides, tutorials, examples → **Documentation**
- Mentions SKILL.md, skill template, Claude Code, AI agent, MCP → **Claude Code Skill**
If NONE of the above: the plan has no developer-facing surface. Tell the user:
"This plan doesn't appear to have developer-facing surfaces. /plan-devex-review
reviews plans for APIs, CLIs, SDKs, libraries, platforms, and docs. Consider
/plan-eng-review or /plan-design-review instead." Exit gracefully.
If detected: State your classification and ask for confirmation. Do not ask from
scratch. "I'm reading this as a CLI Tool plan. Correct?"
A product can be multiple types. Identify the primary type for the initial assessment.
Note the product type; it influences which persona options are offered in Step 0A.
---
{{BRAIN_PREFLIGHT}}
---
{{SECTION_INDEX:plan-devex-review}}
---
## Step 0: DX Investigation (before scoring)
The core principle: **gather evidence and force decisions BEFORE scoring, not during
scoring.** Steps 0A through 0G build the evidence base. Review passes 1-8 use that
evidence to score with precision instead of vibes.
### 0A. Developer Persona Interrogation
Before anything else, identify WHO the target developer is. Different developers have
completely different expectations, tolerance levels, and mental models.
**Gather evidence first:** Read README.md for "who is this for" language. Check
package.json description/keywords. Check design doc for user mentions. Check docs/
for audience signals.
Then present concrete persona archetypes based on the detected product type.
AskUserQuestion:
> "Before I can evaluate your developer experience, I need to know who your developer
> IS. Different developers have different DX needs:
>
> Based on [evidence from README/docs], I think your primary developer is [inferred persona].
>
> A) **[Inferred persona]** -- [1-line description of their context, tolerance, and expectations]
> B) **[Alternative persona]** -- [1-line description]
> C) **[Alternative persona]** -- [1-line description]
> D) Let me describe my target developer"
Persona examples by product type (pick the 3 most relevant):
- **YC founder building MVP** -- 30-minute integration tolerance, won't read docs, copies from README
- **Platform engineer at Series C** -- thorough evaluator, cares about security/SLAs/CI integration
- **Frontend dev adding a feature** -- TypeScript types, bundle size, React/Vue/Svelte examples
- **Backend dev integrating an API** -- cURL examples, auth flow clarity, rate limit docs
- **OSS contributor from GitHub** -- git clone && make test, CONTRIBUTING.md, issue templates
- **Student learning to code** -- needs hand-holding, clear error messages, lots of examples
- **DevOps engineer setting up infra** -- Terraform/Docker, non-interactive mode, env vars
After the user responds, produce a persona card:
```
TARGET DEVELOPER PERSONA
========================
Who: [description]
Context: [when/why they encounter this tool]
Tolerance: [how many minutes/steps before they abandon]
Expects: [what they assume exists before trying]
```
**STOP.** Do NOT proceed until user responds. This persona shapes the entire review.
### 0B. Empathy Narrative as Conversation Starter
Write a 150-250 word first-person narrative from the persona's perspective. Walk
through the ACTUAL getting-started path from the README/docs. Be specific about
what they see, what they try, what they feel, and where they get confused.
Use the persona from 0A. Reference real files and content from the pre-review audit.
Not hypothetical. Trace the actual path: "I open the README. The first heading is
[actual heading]. I scroll down and find [actual install command]. I run it and see..."
Then SHOW it to the user via AskUserQuestion:
> "Here's what I think your [persona] developer experiences today:
>
> [full empathy narrative]
>
> Does this match reality? Where am I wrong?
>
> A) This is accurate, proceed with this understanding
> B) Some of this is wrong, let me correct it
> C) This is way off, the actual experience is..."
**STOP.** Incorporate corrections into the narrative. This narrative becomes a required
output section ("Developer Perspective") in the plan file. The implementer should read
it and feel what the developer feels.
### 0C. Competitive DX Benchmarking
Before scoring anything, understand how comparable tools handle DX. Use WebSearch to
find real TTHW data and onboarding approaches.
Run three searches:
1. "[product category] getting started developer experience {current year}"
2. "[closest competitor] developer onboarding time"
3. "[product category] SDK CLI developer experience best practices {current year}"
If WebSearch is unavailable: "Search unavailable. Using reference benchmarks: Stripe
(30s TTHW), Vercel (2min), Firebase (3min), Docker (5min)."
Produce a competitive benchmark table:
```
COMPETITIVE DX BENCHMARK
=========================
Tool | TTHW | Notable DX Choice | Source
[competitor 1] | [time] | [what they do well] | [url/source]
[competitor 2] | [time] | [what they do well] | [url/source]
[competitor 3] | [time] | [what they do well] | [url/source]
YOUR PRODUCT | [est] | [from README/plan] | current plan
```
AskUserQuestion:
> "Your closest competitors' TTHW:
> [benchmark table]
>
> Your plan's current TTHW estimate: [X] minutes ([Y] steps).
>
> Where do you want to land?
>
> A) Champion tier (< 2 min) -- requires [specific changes]. Stripe/Vercel territory.
> B) Competitive tier (2-5 min) -- achievable with [specific gap to close]
> C) Current trajectory ([X] min) -- acceptable for now, improve later
> D) Tell me what's realistic for our constraints"
**STOP.** The chosen tier becomes the benchmark for Pass 1 (Getting Started).
### 0D. Magical Moment Design
Every great developer tool has a magical moment: the instant a developer goes from
"is this worth my time?" to "oh wow, this is real."
Load the "## Pass 1" section from `~/.claude/skills/gstack/plan-devex-review/dx-hall-of-fame.md`
for gold standard examples.
Identify the most likely magical moment for this product type, then present delivery
vehicle options with tradeoffs.
AskUserQuestion:
> "For your [product type], the magical moment is: [specific moment, e.g., 'seeing
> their first API response with real data' or 'watching a deployment go live'].
>
> How should your [persona from 0A] experience this moment?
>
> A) **Interactive playground/sandbox** -- zero install, try in browser. Highest
> conversion but requires building a hosted environment.
> (human: ~1 week / CC: ~2 hours). Examples: Stripe's API explorer, Supabase SQL editor.
>
> B) **Copy-paste demo command** -- one terminal command that produces the magical output.
> Low effort, high impact for CLI tools, but requires local install first.
> (human: ~2 days / CC: ~30 min). Examples: `npx create-next-app`, `docker run hello-world`.
>
> C) **Video/GIF walkthrough** -- shows the magic without requiring any setup.
> Passive (developer watches, doesn't do), but zero friction.
> (human: ~1 day / CC: ~1 hour). Examples: Vercel's homepage deploy animation.
>
> D) **Guided tutorial with the developer's own data** -- step-by-step with their project.
> Deepest engagement but longest time-to-magic.
> (human: ~1 week / CC: ~2 hours). Examples: Stripe's interactive onboarding.
>
> E) Something else -- describe what you have in mind.
>
> RECOMMENDATION: [A/B/C/D] because for [persona], [reason]. Your competitor [name]
> uses [their approach]."
**STOP.** The chosen delivery vehicle is tracked through the scoring passes.
### 0E. Mode Selection
How deep should this DX review go?
Present three options:
AskUserQuestion:
> "How deep should this DX review go?
>
> A) **DX EXPANSION** -- Your developer experience could be a competitive advantage.
> I'll propose ambitious DX improvements beyond what the plan covers. Every expansion
> is opt-in via individual questions. I'll push hard.
>
> B) **DX POLISH** -- The plan's DX scope is right. I'll make every touchpoint bulletproof:
> error messages, docs, CLI help, getting started. No scope additions, maximum rigor.
> (recommended for most reviews)
>
> C) **DX TRIAGE** -- Focus only on the critical DX gaps that would block adoption.
> Fast, surgical, for plans that need to ship soon.
>
> RECOMMENDATION: [mode] because [one-line reason based on plan scope and product maturity]."
Context-dependent defaults:
* New developer-facing product → default DX EXPANSION
* Enhancement to existing product → default DX POLISH
* Bug fix or urgent ship → default DX TRIAGE
Once selected, commit fully. Do not silently drift toward a different mode.
**STOP.** Do NOT proceed until user responds.
### 0F. Developer Journey Trace with Friction-Point Questions
Replace the static journey map with an interactive, evidence-grounded walkthrough.
For each journey stage, TRACE the actual experience (what file, what command, what
output) and ask about each friction point individually.
For each stage (Discover, Install, Hello World, Real Usage, Debug, Upgrade):
1. **Trace the actual path.** Read the README, docs, package.json, CLI help, or
whatever the developer would encounter at this stage. Reference specific files
and line numbers.
2. **Identify friction points with evidence.** Not "installation might be hard" but
"Step 3 of the README requires Docker to be running, but nothing checks for Docker
or tells the developer to install it. A [persona] without Docker will see [specific
error or nothing]."
3. **AskUserQuestion per friction point.** One question per friction point found.
Do NOT batch multiple friction points into one question.
> "Journey Stage: INSTALL
>
> I traced the installation path. Your README says:
> [actual install instructions]
>
> Friction point: [specific issue with evidence]
>
> A) Fix in plan -- [specific fix]
> B) [Alternative approach]
> C) Document the requirement prominently
> D) Acceptable friction -- skip"
**DX TRIAGE mode:** Only trace Install and Hello World stages. Skip the rest.
**DX POLISH mode:** Trace all stages.
**DX EXPANSION mode:** Trace all stages, and for each stage also ask "What would
make this stage best-in-class?"
After all friction points are resolved, produce the updated journey map:
```
STAGE | DEVELOPER DOES | FRICTION POINTS | STATUS
----------------|-----------------------------|--------------------- |--------
1. Discover | [action] | [resolved/deferred] | [fixed/ok/deferred]
2. Install | [action] | [resolved/deferred] | [fixed/ok/deferred]
3. Hello World | [action] | [resolved/deferred] | [fixed/ok/deferred]
4. Real Usage | [action] | [resolved/deferred] | [fixed/ok/deferred]
5. Debug | [action] | [resolved/deferred] | [fixed/ok/deferred]
6. Upgrade | [action] | [resolved/deferred] | [fixed/ok/deferred]
```
### 0G. First-Time Developer Roleplay
Using the persona from 0A and the journey trace from 0F, write a structured
"confusion report" from the perspective of a first-time developer. Include
timestamps to simulate real time passing.
```
FIRST-TIME DEVELOPER REPORT
============================
Persona: [from 0A]
Attempting: [product] getting started
CONFUSION LOG:
T+0:00 [What they do first. What they see.]
T+0:30 [Next action. What surprised or confused them.]
T+1:00 [What they tried. What happened.]
T+2:00 [Where they got stuck or succeeded.]
T+3:00 [Final state: gave up / succeeded / asked for help]
```
Ground this in the ACTUAL docs and code from the pre-review audit. Not hypothetical.
Reference specific README headings, error messages, and file paths.
AskUserQuestion:
> "I roleplayed as your [persona] developer attempting the getting started flow.
> Here's what confused me:
>
> [confusion report]
>
> Which of these should we address in the plan?
>
> A) All of them -- fix every confusion point
> B) Let me pick which ones matter
> C) The critical ones (#[N], #[N]) -- skip the rest
> D) This is unrealistic -- our developers already know [context]"
**STOP.** Do NOT proceed until user responds.
---
## The 0-10 Rating Method
For each DX section, rate the plan 0-10. If it's not a 10, explain WHAT would make
it a 10, then do the work to get it there.
**Critical rule:** Every rating MUST reference evidence from Step 0. Not "Getting
Started: 4/10" but "Getting Started: 4/10 because [persona from 0A] hits [friction
point from 0F] at step 3, and competitor [name from 0C] achieves this in [time]."
Pattern:
1. **Evidence recall:** Reference specific findings from Step 0 that apply to this dimension
2. Rate: "Getting Started Experience: 4/10"
3. Gap: "It's a 4 because [evidence]. A 10 would be [specific description for THIS product]."
4. Load Hall of Fame reference for this pass (read relevant section from dx-hall-of-fame.md)
5. Fix: Edit the plan to add what's missing
6. Re-rate: "Now 7/10, still missing [specific gap]"
7. AskUserQuestion if there's a genuine DX choice to resolve
8. Fix again until 10 or user says "good enough, move on"
**Mode-specific behavior:**
- **DX EXPANSION:** After fixing to 10, also ask "What would make this dimension
best-in-class? What would make [persona] rave about it?" Present expansions as
individual opt-in AskUserQuestions.
- **DX POLISH:** Fix every gap. No shortcuts. Trace each issue to specific files/lines.
- **DX TRIAGE:** Only flag gaps that would block adoption (score below 5). Skip gaps
that are nice-to-have (score 5-7).
{{SECTION:review-sections}}
## Section self-check (before you finish)
Confirm you Read the review section the Section index named, and executed all 8 DX passes, the required outputs, and the review report in full. If you produced findings or the review report from memory without Reading `sections/review-sections.md`, stop and Read it now.
{{EXIT_PLAN_MODE_GATE}}