--- name: benchmark-models preamble-tier: 1 version: 1.0.0 description: | Cross-model benchmark for gstack skills. Runs the same prompt through Claude, GPT (via Codex CLI), and Gemini side-by-side — compares latency, tokens, cost, and optionally quality via LLM judge. Answers "which model is actually best for this skill?" with data instead of vibes. Separate from /benchmark, which measures web page performance. Use when: "benchmark models", "compare models", "which model is best for X", "cross-model comparison", "model shootout". (gstack) Voice triggers (speech-to-text aliases): "compare models", "model shootout", "which model is best". triggers: - cross model benchmark - compare claude gpt gemini - benchmark skill across models - which model should I use allowed-tools: - Bash - Read - AskUserQuestion --- ## Preamble (run first) ```bash _UPD=$(~/.claude/skills/gstack/bin/gstack-update-check 2>/dev/null || .claude/skills/gstack/bin/gstack-update-check 2>/dev/null || true) [ -n "$_UPD" ] && echo "$_UPD" || true mkdir -p ~/.gstack/sessions touch ~/.gstack/sessions/"$PPID" _SESSIONS=$(find ~/.gstack/sessions -mmin -120 -type f 2>/dev/null | wc -l | tr -d ' ') find ~/.gstack/sessions -mmin +120 -type f -exec rm {} + 2>/dev/null || true _PROACTIVE=$(~/.claude/skills/gstack/bin/gstack-config get proactive 2>/dev/null || echo "true") _PROACTIVE_PROMPTED=$([ -f ~/.gstack/.proactive-prompted ] && echo "yes" || echo "no") _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" _SKILL_PREFIX=$(~/.claude/skills/gstack/bin/gstack-config get skill_prefix 2>/dev/null || echo "false") echo "PROACTIVE: $_PROACTIVE" echo "PROACTIVE_PROMPTED: $_PROACTIVE_PROMPTED" echo "SKILL_PREFIX: $_SKILL_PREFIX" source <(~/.claude/skills/gstack/bin/gstack-repo-mode 2>/dev/null) || true REPO_MODE=${REPO_MODE:-unknown} echo "REPO_MODE: $REPO_MODE" _LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") echo "LAKE_INTRO: $_LAKE_SEEN" _TEL=$(~/.claude/skills/gstack/bin/gstack-config get telemetry 2>/dev/null || true) _TEL_PROMPTED=$([ -f ~/.gstack/.telemetry-prompted ] && echo "yes" || echo "no") _TEL_START=$(date +%s) _SESSION_ID="$$-$(date +%s)" echo "TELEMETRY: ${_TEL:-off}" echo "TEL_PROMPTED: $_TEL_PROMPTED" _EXPLAIN_LEVEL=$(~/.claude/skills/gstack/bin/gstack-config get explain_level 2>/dev/null || echo "default") if [ "$_EXPLAIN_LEVEL" != "default" ] && [ "$_EXPLAIN_LEVEL" != "terse" ]; then _EXPLAIN_LEVEL="default"; fi echo "EXPLAIN_LEVEL: $_EXPLAIN_LEVEL" _QUESTION_TUNING=$(~/.claude/skills/gstack/bin/gstack-config get question_tuning 2>/dev/null || echo "false") echo "QUESTION_TUNING: $_QUESTION_TUNING" mkdir -p ~/.gstack/analytics if [ "$_TEL" != "off" ]; then echo '{"skill":"benchmark-models","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","repo":"'$(basename "$(git rev-parse --show-toplevel 2>/dev/null)" 2>/dev/null || echo "unknown")'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true fi for _PF in $(find ~/.gstack/analytics -maxdepth 1 -name '.pending-*' 2>/dev/null); do if [ -f "$_PF" ]; then if [ "$_TEL" != "off" ] && [ -x "~/.claude/skills/gstack/bin/gstack-telemetry-log" ]; then ~/.claude/skills/gstack/bin/gstack-telemetry-log --event-type skill_run --skill _pending_finalize --outcome unknown --session-id "$_SESSION_ID" 2>/dev/null || true fi rm -f "$_PF" 2>/dev/null || true fi break done eval "$(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)" 2>/dev/null || true _LEARN_FILE="${GSTACK_HOME:-$HOME/.gstack}/projects/${SLUG:-unknown}/learnings.jsonl" if [ -f "$_LEARN_FILE" ]; then _LEARN_COUNT=$(wc -l < "$_LEARN_FILE" 2>/dev/null | tr -d ' ') echo "LEARNINGS: $_LEARN_COUNT entries loaded" if [ "$_LEARN_COUNT" -gt 5 ] 2>/dev/null; then ~/.claude/skills/gstack/bin/gstack-learnings-search --limit 3 2>/dev/null || true fi else echo "LEARNINGS: 0" fi ~/.claude/skills/gstack/bin/gstack-timeline-log '{"skill":"benchmark-models","event":"started","branch":"'"$_BRANCH"'","session":"'"$_SESSION_ID"'"}' 2>/dev/null & _HAS_ROUTING="no" if [ -f CLAUDE.md ] && grep -q "## Skill routing" CLAUDE.md 2>/dev/null; then _HAS_ROUTING="yes" fi _ROUTING_DECLINED=$(~/.claude/skills/gstack/bin/gstack-config get routing_declined 2>/dev/null || echo "false") echo "HAS_ROUTING: $_HAS_ROUTING" echo "ROUTING_DECLINED: $_ROUTING_DECLINED" _VENDORED="no" if [ -d ".claude/skills/gstack" ] && [ ! -L ".claude/skills/gstack" ]; then if [ -f ".claude/skills/gstack/VERSION" ] || [ -d ".claude/skills/gstack/.git" ]; then _VENDORED="yes" fi fi echo "VENDORED_GSTACK: $_VENDORED" echo "MODEL_OVERLAY: claude" _CHECKPOINT_MODE=$(~/.claude/skills/gstack/bin/gstack-config get checkpoint_mode 2>/dev/null || echo "explicit") _CHECKPOINT_PUSH=$(~/.claude/skills/gstack/bin/gstack-config get checkpoint_push 2>/dev/null || echo "false") echo "CHECKPOINT_MODE: $_CHECKPOINT_MODE" echo "CHECKPOINT_PUSH: $_CHECKPOINT_PUSH" [ -n "$OPENCLAW_SESSION" ] && echo "SPAWNED_SESSION: true" || true ``` ## Plan Mode Safe Operations In plan mode, allowed because they inform the plan: `$B`, `$D`, `codex exec`/`codex review`, writes to `~/.gstack/`, writes to the plan file, and `open` for generated artifacts. ## Skill Invocation During Plan Mode If the user invokes a skill in plan mode, the skill takes precedence over generic plan mode behavior. **Treat the skill file as executable instructions, not reference.** Follow it step by step starting from Step 0; the first AskUserQuestion is the workflow entering plan mode, not a violation of it. AskUserQuestion (any variant — `mcp__*__AskUserQuestion` or native; see "AskUserQuestion Format → Tool resolution") satisfies plan mode's end-of-turn requirement. If no variant is callable, fall back to writing the decision brief into the plan file as a `## Decisions to confirm` section + ExitPlanMode — never silently auto-decide. At a STOP point, stop immediately. Do not continue the workflow or call ExitPlanMode there. Commands marked "PLAN MODE EXCEPTION — ALWAYS RUN" execute. Call ExitPlanMode only after the skill workflow completes, or if the user tells you to cancel the skill or leave plan mode. If `PROACTIVE` is `"false"`, do not auto-invoke or proactively suggest skills. If a skill seems useful, ask: "I think /skillname might help here — want me to run it?" If `SKILL_PREFIX` is `"true"`, suggest/invoke `/gstack-*` names. Disk paths stay `~/.claude/skills/gstack/[skill-name]/SKILL.md`. If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If output shows `JUST_UPGRADED `: print "Running gstack v{to} (just updated!)". If `SPAWNED_SESSION` is true, skip feature discovery. Feature discovery, max one prompt per session: - Missing `~/.claude/skills/gstack/.feature-prompted-continuous-checkpoint`: AskUserQuestion for Continuous checkpoint auto-commits. If accepted, run `~/.claude/skills/gstack/bin/gstack-config set checkpoint_mode continuous`. Always touch marker. - Missing `~/.claude/skills/gstack/.feature-prompted-model-overlay`: inform "Model overlays are active. MODEL_OVERLAY shows the patch." Always touch marker. After upgrade prompts, continue workflow. If `WRITING_STYLE_PENDING` is `yes`: ask once about writing style: > v1 prompts are simpler: first-use jargon glosses, outcome-framed questions, shorter prose. Keep default or restore terse? Options: - A) Keep the new default (recommended — good writing helps everyone) - B) Restore V0 prose — set `explain_level: terse` If A: leave `explain_level` unset (defaults to `default`). If B: run `~/.claude/skills/gstack/bin/gstack-config set explain_level terse`. Always run (regardless of choice): ```bash rm -f ~/.gstack/.writing-style-prompt-pending touch ~/.gstack/.writing-style-prompted ``` Skip if `WRITING_STYLE_PENDING` is `no`. If `LAKE_INTRO` is `no`: say "gstack follows the **Boil the Lake** principle — do the complete thing when AI makes marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" Offer to open: ```bash open https://garryslist.org/posts/boil-the-ocean touch ~/.gstack/.completeness-intro-seen ``` Only run `open` if yes. Always run `touch`. If `TEL_PROMPTED` is `no` AND `LAKE_INTRO` is `yes`: ask telemetry once via AskUserQuestion: > Help gstack get better. Share usage data only: skill, duration, crashes, stable device ID. No code, file paths, or repo names. Options: - A) Help gstack get better! (recommended) - B) No thanks If A: run `~/.claude/skills/gstack/bin/gstack-config set telemetry community` If B: ask follow-up: > Anonymous mode sends only aggregate usage, no unique ID. Options: - A) Sure, anonymous is fine - B) No thanks, fully off If B→A: run `~/.claude/skills/gstack/bin/gstack-config set telemetry anonymous` If B→B: run `~/.claude/skills/gstack/bin/gstack-config set telemetry off` Always run: ```bash touch ~/.gstack/.telemetry-prompted ``` Skip if `TEL_PROMPTED` is `yes`. If `PROACTIVE_PROMPTED` is `no` AND `TEL_PROMPTED` is `yes`: ask once: > Let gstack proactively suggest skills, like /qa for "does this work?" or /investigate for bugs? Options: - A) Keep it on (recommended) - B) Turn it off — I'll type /commands myself If A: run `~/.claude/skills/gstack/bin/gstack-config set proactive true` If B: run `~/.claude/skills/gstack/bin/gstack-config set proactive false` Always run: ```bash touch ~/.gstack/.proactive-prompted ``` Skip if `PROACTIVE_PROMPTED` is `yes`. If `HAS_ROUTING` is `no` AND `ROUTING_DECLINED` is `false` AND `PROACTIVE_PROMPTED` is `yes`: Check if a CLAUDE.md file exists in the project root. If it does not exist, create it. Use AskUserQuestion: > gstack works best when your project's CLAUDE.md includes skill routing rules. Options: - A) Add routing rules to CLAUDE.md (recommended) - B) No thanks, I'll invoke skills manually If A: Append this section to the end of CLAUDE.md: ```markdown ## Skill routing When the user's request matches an available skill, invoke it via the Skill tool. When in doubt, invoke the skill. Key routing rules: - Product ideas/brainstorming → invoke /office-hours - Strategy/scope → invoke /plan-ceo-review - Architecture → invoke /plan-eng-review - Design system/plan review → invoke /design-consultation or /plan-design-review - Full review pipeline → invoke /autoplan - Bugs/errors → invoke /investigate - QA/testing site behavior → invoke /qa or /qa-only - Code review/diff check → invoke /review - Visual polish → invoke /design-review - Ship/deploy/PR → invoke /ship or /land-and-deploy - Save progress → invoke /context-save - Resume context → invoke /context-restore ``` Then commit the change: `git add CLAUDE.md && git commit -m "chore: add gstack skill routing rules to CLAUDE.md"` If B: run `~/.claude/skills/gstack/bin/gstack-config set routing_declined true` and say they can re-enable with `gstack-config set routing_declined false`. This only happens once per project. Skip if `HAS_ROUTING` is `yes` or `ROUTING_DECLINED` is `true`. If `VENDORED_GSTACK` is `yes`, warn once via AskUserQuestion unless `~/.gstack/.vendoring-warned-$SLUG` exists: > This project has gstack vendored in `.claude/skills/gstack/`. Vendoring is deprecated. > Migrate to team mode? Options: - A) Yes, migrate to team mode now - B) No, I'll handle it myself If A: 1. Run `git rm -r .claude/skills/gstack/` 2. Run `echo '.claude/skills/gstack/' >> .gitignore` 3. Run `~/.claude/skills/gstack/bin/gstack-team-init required` (or `optional`) 4. Run `git add .claude/ .gitignore CLAUDE.md && git commit -m "chore: migrate gstack from vendored to team mode"` 5. Tell the user: "Done. Each developer now runs: `cd ~/.claude/skills/gstack && ./setup --team`" If B: say "OK, you're on your own to keep the vendored copy up to date." Always run (regardless of choice): ```bash eval "$(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)" 2>/dev/null || true touch ~/.gstack/.vendoring-warned-${SLUG:-unknown} ``` If marker exists, skip. If `SPAWNED_SESSION` is `"true"`, you are running inside a session spawned by an AI orchestrator (e.g., OpenClaw). In spawned sessions: - Do NOT use AskUserQuestion for interactive prompts. Auto-choose the recommended option. - Do NOT run upgrade checks, telemetry prompts, routing injection, or lake intro. - Focus on completing the task and reporting results via prose output. - End with a completion report: what shipped, decisions made, anything uncertain. ## GBrain Sync (skill start) ```bash _GSTACK_HOME="${GSTACK_HOME:-$HOME/.gstack}" _BRAIN_REMOTE_FILE="$HOME/.gstack-brain-remote.txt" _BRAIN_SYNC_BIN="~/.claude/skills/gstack/bin/gstack-brain-sync" _BRAIN_CONFIG_BIN="~/.claude/skills/gstack/bin/gstack-config" _BRAIN_SYNC_MODE=$("$_BRAIN_CONFIG_BIN" get gbrain_sync_mode 2>/dev/null || echo off) if [ -f "$_BRAIN_REMOTE_FILE" ] && [ ! -d "$_GSTACK_HOME/.git" ] && [ "$_BRAIN_SYNC_MODE" = "off" ]; then _BRAIN_NEW_URL=$(head -1 "$_BRAIN_REMOTE_FILE" 2>/dev/null | tr -d '[:space:]') if [ -n "$_BRAIN_NEW_URL" ]; then echo "BRAIN_SYNC: brain repo detected: $_BRAIN_NEW_URL" echo "BRAIN_SYNC: run 'gstack-brain-restore' to pull your cross-machine memory (or 'gstack-config set gbrain_sync_mode off' to dismiss forever)" fi fi if [ -d "$_GSTACK_HOME/.git" ] && [ "$_BRAIN_SYNC_MODE" != "off" ]; then _BRAIN_LAST_PULL_FILE="$_GSTACK_HOME/.brain-last-pull" _BRAIN_NOW=$(date +%s) _BRAIN_DO_PULL=1 if [ -f "$_BRAIN_LAST_PULL_FILE" ]; then _BRAIN_LAST=$(cat "$_BRAIN_LAST_PULL_FILE" 2>/dev/null || echo 0) _BRAIN_AGE=$(( _BRAIN_NOW - _BRAIN_LAST )) [ "$_BRAIN_AGE" -lt 86400 ] && _BRAIN_DO_PULL=0 fi if [ "$_BRAIN_DO_PULL" = "1" ]; then ( cd "$_GSTACK_HOME" && git fetch origin >/dev/null 2>&1 && git merge --ff-only "origin/$(git rev-parse --abbrev-ref HEAD)" >/dev/null 2>&1 ) || true echo "$_BRAIN_NOW" > "$_BRAIN_LAST_PULL_FILE" fi "$_BRAIN_SYNC_BIN" --once 2>/dev/null || true fi if [ -d "$_GSTACK_HOME/.git" ] && [ "$_BRAIN_SYNC_MODE" != "off" ]; then _BRAIN_QUEUE_DEPTH=0 [ -f "$_GSTACK_HOME/.brain-queue.jsonl" ] && _BRAIN_QUEUE_DEPTH=$(wc -l < "$_GSTACK_HOME/.brain-queue.jsonl" | tr -d ' ') _BRAIN_LAST_PUSH="never" [ -f "$_GSTACK_HOME/.brain-last-push" ] && _BRAIN_LAST_PUSH=$(cat "$_GSTACK_HOME/.brain-last-push" 2>/dev/null || echo never) echo "BRAIN_SYNC: mode=$_BRAIN_SYNC_MODE | last_push=$_BRAIN_LAST_PUSH | queue=$_BRAIN_QUEUE_DEPTH" else echo "BRAIN_SYNC: off" fi ``` Privacy stop-gate: if output shows `BRAIN_SYNC: off`, `gbrain_sync_mode_prompted` is `false`, and gbrain is on PATH or `gbrain doctor --fast --json` works, ask once: > gstack can publish your session memory to a private GitHub repo that GBrain indexes across machines. How much should sync? Options: - A) Everything allowlisted (recommended) - B) Only artifacts - C) Decline, keep everything local After answer: ```bash # Chosen mode: full | artifacts-only | off "$_BRAIN_CONFIG_BIN" set gbrain_sync_mode "$_BRAIN_CONFIG_BIN" set gbrain_sync_mode_prompted true ``` If A/B and `~/.gstack/.git` is missing, ask whether to run `gstack-brain-init`. Do not block the skill. At skill END before telemetry: ```bash "~/.claude/skills/gstack/bin/gstack-brain-sync" --discover-new 2>/dev/null || true "~/.claude/skills/gstack/bin/gstack-brain-sync" --once 2>/dev/null || true ``` ## Model-Specific Behavioral Patch (claude) The following nudges are tuned for the claude model family. They are **subordinate** to skill workflow, STOP points, AskUserQuestion gates, plan-mode safety, and /ship review gates. If a nudge below conflicts with skill instructions, the skill wins. Treat these as preferences, not rules. **Todo-list discipline.** When working through a multi-step plan, mark each task complete individually as you finish it. Do not batch-complete at the end. If a task turns out to be unnecessary, mark it skipped with a one-line reason. **Think before heavy actions.** For complex operations (refactors, migrations, non-trivial new features), briefly state your approach before executing. This lets the user course-correct cheaply instead of mid-flight. **Dedicated tools over Bash.** Prefer Read, Edit, Write, Glob, Grep over shell equivalents (cat, sed, find, grep). The dedicated tools are cheaper and clearer. ## Voice Direct, concrete, builder-to-builder. Name the file, function, command, and user-visible impact. No filler. No em dashes. No AI vocabulary: delve, crucial, robust, comprehensive, nuanced, multifaceted. Never corporate or academic. Short paragraphs. End with what to do. The user has context you do not. Cross-model agreement is a recommendation, not a decision. The user decides. ## Completion Status Protocol When completing a skill workflow, report status using one of: - **DONE** — completed with evidence. - **DONE_WITH_CONCERNS** — completed, but list concerns. - **BLOCKED** — cannot proceed; state blocker and what was tried. - **NEEDS_CONTEXT** — missing info; state exactly what is needed. Escalate after 3 failed attempts, uncertain security-sensitive changes, or scope you cannot verify. Format: `STATUS`, `REASON`, `ATTEMPTED`, `RECOMMENDATION`. ## Operational Self-Improvement Before completing, if you discovered a durable project quirk or command fix that would save 5+ minutes next time, log it: ```bash ~/.claude/skills/gstack/bin/gstack-learnings-log '{"skill":"SKILL_NAME","type":"operational","key":"SHORT_KEY","insight":"DESCRIPTION","confidence":N,"source":"observed"}' ``` Do not log obvious facts or one-time transient errors. ## Telemetry (run last) After workflow completion, log telemetry. Use skill `name:` from frontmatter. OUTCOME is success/error/abort/unknown. **PLAN MODE EXCEPTION — ALWAYS RUN:** This command writes telemetry to `~/.gstack/analytics/`, matching preamble analytics writes. Run this bash: ```bash _TEL_END=$(date +%s) _TEL_DUR=$(( _TEL_END - _TEL_START )) rm -f ~/.gstack/analytics/.pending-"$_SESSION_ID" 2>/dev/null || true # Session timeline: record skill completion (local-only, never sent anywhere) ~/.claude/skills/gstack/bin/gstack-timeline-log '{"skill":"SKILL_NAME","event":"completed","branch":"'$(git branch --show-current 2>/dev/null || echo unknown)'","outcome":"OUTCOME","duration_s":"'"$_TEL_DUR"'","session":"'"$_SESSION_ID"'"}' 2>/dev/null || true # Local analytics (gated on telemetry setting) if [ "$_TEL" != "off" ]; then echo '{"skill":"SKILL_NAME","duration_s":"'"$_TEL_DUR"'","outcome":"OUTCOME","browse":"USED_BROWSE","session":"'"$_SESSION_ID"'","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true fi # Remote telemetry (opt-in, requires binary) if [ "$_TEL" != "off" ] && [ -x ~/.claude/skills/gstack/bin/gstack-telemetry-log ]; then ~/.claude/skills/gstack/bin/gstack-telemetry-log \ --skill "SKILL_NAME" --duration "$_TEL_DUR" --outcome "OUTCOME" \ --used-browse "USED_BROWSE" --session-id "$_SESSION_ID" 2>/dev/null & fi ``` Replace `SKILL_NAME`, `OUTCOME`, and `USED_BROWSE` before running. ## Plan Status Footer In plan mode before ExitPlanMode: if the plan file lacks `## GSTACK REVIEW REPORT`, run `~/.claude/skills/gstack/bin/gstack-review-read` and append the standard runs/status/findings table. With `NO_REVIEWS` or empty, append a 5-row placeholder with verdict "NO REVIEWS YET — run `/autoplan`". If a richer report exists, skip. PLAN MODE EXCEPTION — always allowed (it's the plan file). # /benchmark-models — Cross-Model Skill Benchmark You are running the `/benchmark-models` workflow. Wraps the `gstack-model-benchmark` binary with an interactive flow that picks a prompt, confirms providers, previews auth, and runs the benchmark. Different from `/benchmark` — that skill measures web page performance (Core Web Vitals, load times). This skill measures AI model performance on gstack skills or arbitrary prompts. --- ## Step 0: Locate the binary ```bash BIN="$HOME/.claude/skills/gstack/bin/gstack-model-benchmark" [ -x "$BIN" ] || BIN=".claude/skills/gstack/bin/gstack-model-benchmark" [ -x "$BIN" ] || { echo "ERROR: gstack-model-benchmark not found. Run ./setup in the gstack install dir." >&2; exit 1; } echo "BIN: $BIN" ``` If not found, stop and tell the user to reinstall gstack. --- ## Step 1: Choose a prompt Use AskUserQuestion with the preamble format: - **Re-ground:** current project + branch. - **Simplify:** "A cross-model benchmark runs the same prompt through 2-3 AI models and shows you how they compare on speed, cost, and output quality. What prompt should we use?" - **RECOMMENDATION:** A because benchmarking against a real skill exposes tool-use differences, not just raw generation. - **Options:** - A) Benchmark one of my gstack skills (we'll pick which skill next). Completeness: 10/10. - B) Use an inline prompt — type it on the next turn. Completeness: 8/10. - C) Point at a prompt file on disk — specify path on the next turn. Completeness: 8/10. If A: list top-level gstack skills that have SKILL.md files (from `find . -maxdepth 2 -name SKILL.md -not -path './.*'`), ask the user to pick one via a second AskUserQuestion. Use the picked SKILL.md path as the prompt file. If B: ask the user for the inline prompt. Use it verbatim via `--prompt ""`. If C: ask for the path. Verify it exists. Use as positional argument. --- ## Step 2: Choose providers ```bash "$BIN" --prompt "unused, dry-run" --models claude,gpt,gemini --dry-run ``` Show the dry-run output. The "Adapter availability" section tells the user which providers will actually run (OK) vs skip (NOT READY — remediation hint included). If ALL three show NOT READY: stop with a clear message — benchmark can't run without at least one authed provider. Suggest `claude login`, `codex login`, or `gemini login` / `export GOOGLE_API_KEY`. If at least one is OK: AskUserQuestion: - **Simplify:** "Which models should we include? The dry-run above showed which are authed. Unauthed ones will be skipped cleanly — they won't abort the batch." - **RECOMMENDATION:** A (all authed providers) because running as many as possible gives the richest comparison. - **Options:** - A) All authed providers. Completeness: 10/10. - B) Only Claude. Completeness: 6/10 (no cross-model signal — use /ship's review for solo claude benchmarks instead). - C) Pick two — specify on next turn. Completeness: 8/10. --- ## Step 3: Decide on judge ```bash [ -n "$ANTHROPIC_API_KEY" ] || grep -q 'ANTHROPIC' "$HOME/.claude/.credentials.json" 2>/dev/null && echo "JUDGE_AVAILABLE" || echo "JUDGE_UNAVAILABLE" ``` If judge is available, AskUserQuestion: - **Simplify:** "The quality judge scores each model's output on a 0-10 scale using Anthropic's Claude as a tiebreaker. Adds ~$0.05/run. Recommended if you care about output quality, not just latency and cost." - **RECOMMENDATION:** A — the whole point is comparing quality, not just speed. - **Options:** - A) Enable judge (adds ~$0.05). Completeness: 10/10. - B) Skip judge — speed/cost/tokens only. Completeness: 7/10. If judge is NOT available, skip this question and omit the `--judge` flag. --- ## Step 4: Run the benchmark Construct the command from Step 1, 2, 3 decisions: ```bash "$BIN" --models [--judge] --output table ``` Where `` is either `--prompt ""` (Step 1B), a file path (Step 1A or 1C), and `` is the comma-separated list from Step 2. Stream the output as it arrives. This is slow — each provider runs the prompt fully. Expect 30s-5min depending on prompt complexity and whether `--judge` is on. --- ## Step 5: Interpret results After the table prints, summarize for the user: - **Fastest** — provider with lowest latency. - **Cheapest** — provider with lowest cost. - **Highest quality** (if `--judge` ran) — provider with highest score. - **Best overall** — use judgment. If judge ran: quality-weighted. Otherwise: note the tradeoff the user needs to make. If any provider hit an error (auth/timeout/rate_limit), call it out with the remediation path. --- ## Step 6: Offer to save results AskUserQuestion: - **Simplify:** "Save this benchmark as JSON so you can compare future runs against it?" - **RECOMMENDATION:** A — skill performance drifts as providers update their models; a saved baseline catches quality regressions. - **Options:** - A) Save to `~/.gstack/benchmarks/-.json`. Completeness: 10/10. - B) Just print, don't save. Completeness: 5/10 (loses trend data). If A: re-run with `--output json` and tee to the dated file. Print the path so the user can diff future runs against it. --- ## Important Rules - **Never run a real benchmark without Step 2's dry-run first.** Users need to see auth status before spending API calls. - **Never hardcode model names.** Always pass providers from user's Step 2 choice — the binary handles the rest. - **Never auto-include `--judge`.** It adds real cost; user must opt in. - **If zero providers are authed, STOP.** Don't attempt the benchmark — it produces no useful output. - **Cost is visible.** Every run shows per-provider cost in the table. Users should see it before the next run.