mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-02 11:45:20 +02:00
612c1a24f8
Wires the orphaned gstack-model-benchmark binary into a dedicated skill
so users can discover cross-model benchmarking via /benchmark-models or
voice triggers ("compare models", "which model is best").
Deliberately separate from /benchmark (page performance) because the
two surfaces test completely different things — confusing them would
muddy both.
Flow:
1. Pick a prompt (an existing SKILL.md file, inline text, or file path)
2. Confirm providers (dry-run shows auth status per provider)
3. Decide on --judge (adds ~$0.05, scores output quality 0-10)
4. Run the benchmark — table output
5. Interpret results (fastest / cheapest / highest quality)
6. Offer to save to ~/.gstack/benchmarks/<date>.json for trend tracking
Uses gstack-model-benchmark --dry-run as a safety gate — auth status is
visible BEFORE the user spends API calls. If zero providers are authed,
the skill stops cleanly rather than attempting a run that produces no
useful output.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>