mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-06 05:35:46 +02:00
614354fc41
Adds the full spec Codex asked for: real provider adapters with auth detection, normalized RunResult, pricing tables, tool compatibility maps, parallel execution with error isolation, and table/JSON/markdown output. Judge stays on Anthropic SDK as the single stable source of quality scoring, gated behind --judge. Codex flagged the original plan as massively under-scoped — the existing runner is Claude-only and the judge is Anthropic-only. You can't benchmark GPT or Gemini without real provider infrastructure. This commit ships it. New architecture: test/helpers/providers/types.ts ProviderAdapter interface test/helpers/providers/claude.ts wraps `claude -p --output-format json` test/helpers/providers/gpt.ts wraps `codex exec --json` test/helpers/providers/gemini.ts wraps `gemini -p --output-format stream-json --yolo` test/helpers/pricing.ts per-model USD cost tables (quarterly) test/helpers/tool-map.ts which tools each CLI exposes test/helpers/benchmark-runner.ts orchestrator (Promise.allSettled) test/helpers/benchmark-judge.ts Anthropic SDK quality scorer bin/gstack-model-benchmark CLI entry test/benchmark-runner.test.ts 9 unit tests (cost math, formatters, tool-map) Per-provider error isolation: - auth → record reason, don't abort batch - timeout → record reason, don't abort batch - rate_limit → record reason, don't abort batch - binary_missing → record in available() check, skip if --skip-unavailable Pricing correction: cached input tokens are disjoint from uncached input tokens (Anthropic/OpenAI report them separately). Original math subtracted them, producing negative costs. Now adds cached at the 10% discount alongside the full uncached input cost. CLI: gstack-model-benchmark --prompt "..." --models claude,gpt,gemini gstack-model-benchmark ./prompt.txt --output json --judge gstack-model-benchmark ./prompt.txt --models claude --timeout-ms 60000 Output formats: table (default), json, markdown. Each shows model, latency, in→out tokens, cost, quality (when --judge used), tool calls, and any errors. Known limitations for v1: - Claude adapter approximates toolCalls as num_turns (stream-json would give exact counts; v2 can upgrade). - Live E2E tests (test/providers.e2e.test.ts) not included — they require CI secrets for all three providers. Unit tests cover the shape and math. - Provider CLIs sometimes return non-JSON error text to stdout; the parsers fall back to treating raw output as plain text in that case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
83 lines
2.3 KiB
TypeScript
83 lines
2.3 KiB
TypeScript
/**
|
|
* Tool compatibility map across provider CLIs.
|
|
*
|
|
* Not all provider CLIs expose equivalent tools. A benchmark that uses Edit, Glob,
|
|
* or Grep won't run cleanly on CLIs that don't have those. The map answers:
|
|
* "which tools does each provider's CLI expose by default?"
|
|
*
|
|
* When a benchmark is scoped to a tool a provider lacks, the harness records
|
|
* `unsupported_tool` in the result and continues with the other providers.
|
|
*
|
|
* Source-of-truth references:
|
|
* - Claude Code: https://code.claude.com/docs/en/tools
|
|
* - Codex CLI: `codex exec --help` tool listing
|
|
* - Gemini CLI: `gemini --help` (limited tool surface as of 2026-04)
|
|
*/
|
|
|
|
export type ToolName =
|
|
| 'Read'
|
|
| 'Write'
|
|
| 'Edit'
|
|
| 'Bash'
|
|
| 'Agent'
|
|
| 'Glob'
|
|
| 'Grep'
|
|
| 'AskUserQuestion'
|
|
| 'WebSearch'
|
|
| 'WebFetch';
|
|
|
|
export const TOOL_COMPATIBILITY: Record<'claude' | 'gpt' | 'gemini', Record<ToolName, boolean>> = {
|
|
claude: {
|
|
Read: true,
|
|
Write: true,
|
|
Edit: true,
|
|
Bash: true,
|
|
Agent: true,
|
|
Glob: true,
|
|
Grep: true,
|
|
AskUserQuestion: true,
|
|
WebSearch: true,
|
|
WebFetch: true,
|
|
},
|
|
gpt: {
|
|
// Codex CLI has a narrower tool surface: it uses shell + apply_patch.
|
|
// Read/Glob/Grep-style operations happen via shell pipelines.
|
|
Read: true,
|
|
Write: false, // apply_patch handles writes; no standalone Write tool
|
|
Edit: false, // apply_patch handles edits; no standalone Edit tool
|
|
Bash: true,
|
|
Agent: false,
|
|
Glob: false,
|
|
Grep: false,
|
|
AskUserQuestion: false,
|
|
WebSearch: true, // --enable web_search_cached
|
|
WebFetch: false,
|
|
},
|
|
gemini: {
|
|
// Gemini CLI (as of 2026-04) has a limited tool surface in --yolo mode.
|
|
// Shell access depends on flags; most agentic tools are not exposed.
|
|
Read: true,
|
|
Write: false,
|
|
Edit: false,
|
|
Bash: false,
|
|
Agent: false,
|
|
Glob: false,
|
|
Grep: false,
|
|
AskUserQuestion: false,
|
|
WebSearch: true,
|
|
WebFetch: false,
|
|
},
|
|
};
|
|
|
|
/**
|
|
* Determine which tools from a required-set are missing for a given provider.
|
|
* Empty array means full compatibility.
|
|
*/
|
|
export function missingTools(
|
|
provider: 'claude' | 'gpt' | 'gemini',
|
|
requiredTools: ToolName[]
|
|
): ToolName[] {
|
|
const map = TOOL_COMPATIBILITY[provider];
|
|
return requiredTools.filter(t => !map[t]);
|
|
}
|