Files
gstack/bin/gstack-model-benchmark
T
Garry Tan 614354fc41 feat: multi-provider model benchmark (boil the ocean)
Adds the full spec Codex asked for: real provider adapters with auth
detection, normalized RunResult, pricing tables, tool compatibility
maps, parallel execution with error isolation, and table/JSON/markdown
output. Judge stays on Anthropic SDK as the single stable source of
quality scoring, gated behind --judge.

Codex flagged the original plan as massively under-scoped — the
existing runner is Claude-only and the judge is Anthropic-only. You
can't benchmark GPT or Gemini without real provider infrastructure.
This commit ships it.

New architecture:

  test/helpers/providers/types.ts       ProviderAdapter interface
  test/helpers/providers/claude.ts      wraps `claude -p --output-format json`
  test/helpers/providers/gpt.ts         wraps `codex exec --json`
  test/helpers/providers/gemini.ts      wraps `gemini -p --output-format stream-json --yolo`
  test/helpers/pricing.ts               per-model USD cost tables (quarterly)
  test/helpers/tool-map.ts              which tools each CLI exposes
  test/helpers/benchmark-runner.ts      orchestrator (Promise.allSettled)
  test/helpers/benchmark-judge.ts       Anthropic SDK quality scorer
  bin/gstack-model-benchmark            CLI entry
  test/benchmark-runner.test.ts         9 unit tests (cost math, formatters, tool-map)

Per-provider error isolation:
  - auth → record reason, don't abort batch
  - timeout → record reason, don't abort batch
  - rate_limit → record reason, don't abort batch
  - binary_missing → record in available() check, skip if --skip-unavailable

Pricing correction: cached input tokens are disjoint from uncached
input tokens (Anthropic/OpenAI report them separately). Original
math subtracted them, producing negative costs. Now adds cached at
the 10% discount alongside the full uncached input cost.

CLI:
  gstack-model-benchmark --prompt "..." --models claude,gpt,gemini
  gstack-model-benchmark ./prompt.txt --output json --judge
  gstack-model-benchmark ./prompt.txt --models claude --timeout-ms 60000

Output formats: table (default), json, markdown. Each shows model,
latency, in→out tokens, cost, quality (when --judge used), tool calls,
and any errors.

Known limitations for v1:
- Claude adapter approximates toolCalls as num_turns (stream-json
  would give exact counts; v2 can upgrade).
- Live E2E tests (test/providers.e2e.test.ts) not included — they
  require CI secrets for all three providers. Unit tests cover the
  shape and math.
- Provider CLIs sometimes return non-JSON error text to stdout; the
  parsers fall back to treating raw output as plain text in that case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 06:16:42 +08:00

112 lines
3.8 KiB
TypeScript
Executable File

#!/usr/bin/env bun
/**
* gstack-model-benchmark — run the same prompt across multiple providers
* and compare latency, tokens, cost, quality, and tool-call count.
*
* Usage:
* gstack-model-benchmark <skill-or-prompt-file> [options]
*
* Options:
* --models claude,gpt,gemini Comma-separated provider list (default: claude)
* --prompt "<text>" Inline prompt instead of a file
* --workdir <path> Working dir passed to each CLI (default: cwd)
* --timeout-ms <n> Per-provider timeout (default: 300000)
* --output table|json|markdown Output format (default: table)
* --skip-unavailable Skip providers that fail available() check
* (default: include them with unavailable marker)
* --judge Run Anthropic SDK judge on outputs for quality score
* (requires ANTHROPIC_API_KEY; adds ~$0.05 per call)
*
* Examples:
* gstack-model-benchmark --prompt "Write a haiku about databases" --models claude,gpt
* gstack-model-benchmark ./test-prompt.txt --models claude,gpt,gemini --judge
*/
import * as fs from 'fs';
import * as path from 'path';
import { runBenchmark, formatTable, formatJson, formatMarkdown, type BenchmarkInput } from '../test/helpers/benchmark-runner';
type OutputFormat = 'table' | 'json' | 'markdown';
function arg(name: string, def?: string): string | undefined {
const idx = process.argv.findIndex(a => a === name || a.startsWith(name + '='));
if (idx < 0) return def;
const eqIdx = process.argv[idx].indexOf('=');
if (eqIdx >= 0) return process.argv[idx].slice(eqIdx + 1);
return process.argv[idx + 1];
}
function flag(name: string): boolean {
return process.argv.includes(name);
}
function parseProviders(s: string | undefined): Array<'claude' | 'gpt' | 'gemini'> {
if (!s) return ['claude'];
const out: Array<'claude' | 'gpt' | 'gemini'> = [];
for (const p of s.split(',').map(x => x.trim()).filter(Boolean)) {
if (p === 'claude' || p === 'gpt' || p === 'gemini') out.push(p);
else {
console.error(`WARN: unknown provider '${p}' — skipping. Valid: claude, gpt, gemini.`);
}
}
return out.length ? out : ['claude'];
}
function resolvePrompt(positional: string | undefined): string {
const inline = arg('--prompt');
if (inline) return inline;
if (!positional) {
console.error('ERROR: specify a prompt via positional path or --prompt "<text>"');
process.exit(1);
}
if (fs.existsSync(positional)) {
return fs.readFileSync(positional, 'utf-8');
}
// Not a file — treat as inline prompt
return positional;
}
async function main(): Promise<void> {
const positional = process.argv.slice(2).find(a => !a.startsWith('--'));
const prompt = resolvePrompt(positional);
const providers = parseProviders(arg('--models'));
const workdir = arg('--workdir', process.cwd())!;
const timeoutMs = parseInt(arg('--timeout-ms', '300000')!, 10);
const output = (arg('--output', 'table') as OutputFormat);
const skipUnavailable = flag('--skip-unavailable');
const doJudge = flag('--judge');
const input: BenchmarkInput = {
prompt,
workdir,
providers,
timeoutMs,
skipUnavailable,
};
const report = await runBenchmark(input);
if (doJudge) {
try {
const { judgeEntries } = await import('../test/helpers/benchmark-judge');
await judgeEntries(report);
} catch (err) {
console.error(`WARN: judge unavailable: ${(err as Error).message}`);
}
}
let out: string;
switch (output) {
case 'json': out = formatJson(report); break;
case 'markdown': out = formatMarkdown(report); break;
case 'table':
default: out = formatTable(report); break;
}
process.stdout.write(out + '\n');
}
main().catch(err => {
console.error('FATAL:', err);
process.exit(1);
});