mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-07 05:56:41 +02:00
612c1a24f8
Wires the orphaned gstack-model-benchmark binary into a dedicated skill
so users can discover cross-model benchmarking via /benchmark-models or
voice triggers ("compare models", "which model is best").
Deliberately separate from /benchmark (page performance) because the
two surfaces test completely different things — confusing them would
muddy both.
Flow:
1. Pick a prompt (an existing SKILL.md file, inline text, or file path)
2. Confirm providers (dry-run shows auth status per provider)
3. Decide on --judge (adds ~$0.05, scores output quality 0-10)
4. Run the benchmark — table output
5. Interpret results (fastest / cheapest / highest quality)
6. Offer to save to ~/.gstack/benchmarks/<date>.json for trend tracking
Uses gstack-model-benchmark --dry-run as a safety gate — auth status is
visible BEFORE the user spends API calls. If zero providers are authed,
the skill stops cleanly rather than attempting a run that produces no
useful output.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
152 lines
6.5 KiB
Cheetah
152 lines
6.5 KiB
Cheetah
---
|
|
name: benchmark-models
|
|
preamble-tier: 1
|
|
version: 1.0.0
|
|
description: |
|
|
Cross-model benchmark for gstack skills. Runs the same prompt through Claude,
|
|
GPT (via Codex CLI), and Gemini side-by-side — compares latency, tokens, cost,
|
|
and optionally quality via LLM judge. Answers "which model is actually best
|
|
for this skill?" with data instead of vibes. Separate from /benchmark, which
|
|
measures web page performance. Use when: "benchmark models", "compare models",
|
|
"which model is best for X", "cross-model comparison", "model shootout". (gstack)
|
|
voice-triggers:
|
|
- "compare models"
|
|
- "model shootout"
|
|
- "which model is best"
|
|
triggers:
|
|
- cross model benchmark
|
|
- compare claude gpt gemini
|
|
- benchmark skill across models
|
|
- which model should I use
|
|
allowed-tools:
|
|
- Bash
|
|
- Read
|
|
- AskUserQuestion
|
|
---
|
|
|
|
{{PREAMBLE}}
|
|
|
|
# /benchmark-models — Cross-Model Skill Benchmark
|
|
|
|
You are running the `/benchmark-models` workflow. Wraps the `gstack-model-benchmark` binary with an interactive flow that picks a prompt, confirms providers, previews auth, and runs the benchmark.
|
|
|
|
Different from `/benchmark` — that skill measures web page performance (Core Web Vitals, load times). This skill measures AI model performance on gstack skills or arbitrary prompts.
|
|
|
|
---
|
|
|
|
## Step 0: Locate the binary
|
|
|
|
```bash
|
|
BIN="$HOME/.claude/skills/gstack/bin/gstack-model-benchmark"
|
|
[ -x "$BIN" ] || BIN=".claude/skills/gstack/bin/gstack-model-benchmark"
|
|
[ -x "$BIN" ] || { echo "ERROR: gstack-model-benchmark not found. Run ./setup in the gstack install dir." >&2; exit 1; }
|
|
echo "BIN: $BIN"
|
|
```
|
|
|
|
If not found, stop and tell the user to reinstall gstack.
|
|
|
|
---
|
|
|
|
## Step 1: Choose a prompt
|
|
|
|
Use AskUserQuestion with the preamble format:
|
|
- **Re-ground:** current project + branch.
|
|
- **Simplify:** "A cross-model benchmark runs the same prompt through 2-3 AI models and shows you how they compare on speed, cost, and output quality. What prompt should we use?"
|
|
- **RECOMMENDATION:** A because benchmarking against a real skill exposes tool-use differences, not just raw generation.
|
|
- **Options:**
|
|
- A) Benchmark one of my gstack skills (we'll pick which skill next). Completeness: 10/10.
|
|
- B) Use an inline prompt — type it on the next turn. Completeness: 8/10.
|
|
- C) Point at a prompt file on disk — specify path on the next turn. Completeness: 8/10.
|
|
|
|
If A: list top-level gstack skills that have SKILL.md files (from `find . -maxdepth 2 -name SKILL.md -not -path './.*'`), ask the user to pick one via a second AskUserQuestion. Use the picked SKILL.md path as the prompt file.
|
|
|
|
If B: ask the user for the inline prompt. Use it verbatim via `--prompt "<text>"`.
|
|
|
|
If C: ask for the path. Verify it exists. Use as positional argument.
|
|
|
|
---
|
|
|
|
## Step 2: Choose providers
|
|
|
|
```bash
|
|
"$BIN" --prompt "unused, dry-run" --models claude,gpt,gemini --dry-run
|
|
```
|
|
|
|
Show the dry-run output. The "Adapter availability" section tells the user which providers will actually run (OK) vs skip (NOT READY — remediation hint included).
|
|
|
|
If ALL three show NOT READY: stop with a clear message — benchmark can't run without at least one authed provider. Suggest `claude login`, `codex login`, or `gemini login` / `export GOOGLE_API_KEY`.
|
|
|
|
If at least one is OK: AskUserQuestion:
|
|
- **Simplify:** "Which models should we include? The dry-run above showed which are authed. Unauthed ones will be skipped cleanly — they won't abort the batch."
|
|
- **RECOMMENDATION:** A (all authed providers) because running as many as possible gives the richest comparison.
|
|
- **Options:**
|
|
- A) All authed providers. Completeness: 10/10.
|
|
- B) Only Claude. Completeness: 6/10 (no cross-model signal — use /ship's review for solo claude benchmarks instead).
|
|
- C) Pick two — specify on next turn. Completeness: 8/10.
|
|
|
|
---
|
|
|
|
## Step 3: Decide on judge
|
|
|
|
```bash
|
|
[ -n "$ANTHROPIC_API_KEY" ] || grep -q 'ANTHROPIC' "$HOME/.claude/.credentials.json" 2>/dev/null && echo "JUDGE_AVAILABLE" || echo "JUDGE_UNAVAILABLE"
|
|
```
|
|
|
|
If judge is available, AskUserQuestion:
|
|
- **Simplify:** "The quality judge scores each model's output on a 0-10 scale using Anthropic's Claude as a tiebreaker. Adds ~$0.05/run. Recommended if you care about output quality, not just latency and cost."
|
|
- **RECOMMENDATION:** A — the whole point is comparing quality, not just speed.
|
|
- **Options:**
|
|
- A) Enable judge (adds ~$0.05). Completeness: 10/10.
|
|
- B) Skip judge — speed/cost/tokens only. Completeness: 7/10.
|
|
|
|
If judge is NOT available, skip this question and omit the `--judge` flag.
|
|
|
|
---
|
|
|
|
## Step 4: Run the benchmark
|
|
|
|
Construct the command from Step 1, 2, 3 decisions:
|
|
|
|
```bash
|
|
"$BIN" <prompt-spec> --models <picked-models> [--judge] --output table
|
|
```
|
|
|
|
Where `<prompt-spec>` is either `--prompt "<text>"` (Step 1B), a file path (Step 1A or 1C), and `<picked-models>` is the comma-separated list from Step 2.
|
|
|
|
Stream the output as it arrives. This is slow — each provider runs the prompt fully. Expect 30s-5min depending on prompt complexity and whether `--judge` is on.
|
|
|
|
---
|
|
|
|
## Step 5: Interpret results
|
|
|
|
After the table prints, summarize for the user:
|
|
- **Fastest** — provider with lowest latency.
|
|
- **Cheapest** — provider with lowest cost.
|
|
- **Highest quality** (if `--judge` ran) — provider with highest score.
|
|
- **Best overall** — use judgment. If judge ran: quality-weighted. Otherwise: note the tradeoff the user needs to make.
|
|
|
|
If any provider hit an error (auth/timeout/rate_limit), call it out with the remediation path.
|
|
|
|
---
|
|
|
|
## Step 6: Offer to save results
|
|
|
|
AskUserQuestion:
|
|
- **Simplify:** "Save this benchmark as JSON so you can compare future runs against it?"
|
|
- **RECOMMENDATION:** A — skill performance drifts as providers update their models; a saved baseline catches quality regressions.
|
|
- **Options:**
|
|
- A) Save to `~/.gstack/benchmarks/<date>-<skill-or-prompt-slug>.json`. Completeness: 10/10.
|
|
- B) Just print, don't save. Completeness: 5/10 (loses trend data).
|
|
|
|
If A: re-run with `--output json` and tee to the dated file. Print the path so the user can diff future runs against it.
|
|
|
|
---
|
|
|
|
## Important Rules
|
|
|
|
- **Never run a real benchmark without Step 2's dry-run first.** Users need to see auth status before spending API calls.
|
|
- **Never hardcode model names.** Always pass providers from user's Step 2 choice — the binary handles the rest.
|
|
- **Never auto-include `--judge`.** It adds real cost; user must opt in.
|
|
- **If zero providers are authed, STOP.** Don't attempt the benchmark — it produces no useful output.
|
|
- **Cost is visible.** Every run shows per-provider cost in the table. Users should see it before the next run.
|