User feedback: "i don't think i would use gstack-publish, i think we
should remove it." Agreed. The CLI + marketplace wiring was an
ambitious but speculative primitive. Zero users, zero validated demand,
and the existing manual `clawhub publish` workflow already covers the
real case (OpenClaw methodology skill publishing).
Deleted:
- bin/gstack-publish (the CLI)
- skills.json (the marketplace manifest)
- test/publish-dry-run.test.ts (13 tests)
- ship/SKILL.md.tmpl Step 19.5 — the methodology-skill publish-on-ship
check. No target to dispatch to anymore.
- README.md Power tools row for gstack-publish
Updated:
- bin/gstack-model-benchmark doc comment: dropped "matches gstack-publish
--dry-run semantics" reference (self-describing flag now)
- CHANGELOG 1.3.0.0 entry:
* Release summary: "three new binaries" → "two new binaries".
Dropped the /ship publish-check narrative.
* Numbers table: "1 of 3 → 3 of 3 wired" → "1 of 2 → 2 of 2 wired".
Deterministic test count: 45 → 32 (removed publish-dry-run's 13).
* Added section: removed gstack-publish CLI bullet + /ship Step 19.5
bullet.
* "What this means for users" closer: replaced the /ship publish
paragraph with the design-taste-engine learning loop, which IS
real, wired, and something users hit every week via /design-shotgun.
* Contributors section: "Four new test files" → "Three new test files"
Retained:
- openclaw/skills/gstack-openclaw-* skill dirs (pre-existed this PR,
still publishable manually via `clawhub publish`, useful standalone
for ClawHub installs)
- CLAUDE.md publishing-native-skills section (same rationale)
Regenerated SKILL.md across all hosts. Ship golden fixtures refreshed
for claude/codex/factory. 455 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`--models claude,claude,gpt` previously produced a list with a duplicate
entry, meaning the benchmark would run claude twice and bill for two
runs. Surfaced by /review on this branch.
Use a Set internally; return Array.from(seen) to preserve type + order
of first occurrence.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Matches gstack-publish --dry-run semantics. Validates the provider list,
resolves per-adapter auth, echoes the resolved flag values, and exits
without invoking any provider CLI. Zero-cost pre-flight for CI pipelines
and for catching auth drift before starting a paid benchmark run.
Output shape:
== gstack-model-benchmark --dry-run ==
prompt: <truncated>
providers: claude, gpt, gemini
workdir: /tmp/...
timeout_ms: 300000
output: table
judge: off
Adapter availability:
claude: OK
gpt: NOT READY — <reason>
gemini: NOT READY — <reason>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the full spec Codex asked for: real provider adapters with auth
detection, normalized RunResult, pricing tables, tool compatibility
maps, parallel execution with error isolation, and table/JSON/markdown
output. Judge stays on Anthropic SDK as the single stable source of
quality scoring, gated behind --judge.
Codex flagged the original plan as massively under-scoped — the
existing runner is Claude-only and the judge is Anthropic-only. You
can't benchmark GPT or Gemini without real provider infrastructure.
This commit ships it.
New architecture:
test/helpers/providers/types.ts ProviderAdapter interface
test/helpers/providers/claude.ts wraps `claude -p --output-format json`
test/helpers/providers/gpt.ts wraps `codex exec --json`
test/helpers/providers/gemini.ts wraps `gemini -p --output-format stream-json --yolo`
test/helpers/pricing.ts per-model USD cost tables (quarterly)
test/helpers/tool-map.ts which tools each CLI exposes
test/helpers/benchmark-runner.ts orchestrator (Promise.allSettled)
test/helpers/benchmark-judge.ts Anthropic SDK quality scorer
bin/gstack-model-benchmark CLI entry
test/benchmark-runner.test.ts 9 unit tests (cost math, formatters, tool-map)
Per-provider error isolation:
- auth → record reason, don't abort batch
- timeout → record reason, don't abort batch
- rate_limit → record reason, don't abort batch
- binary_missing → record in available() check, skip if --skip-unavailable
Pricing correction: cached input tokens are disjoint from uncached
input tokens (Anthropic/OpenAI report them separately). Original
math subtracted them, producing negative costs. Now adds cached at
the 10% discount alongside the full uncached input cost.
CLI:
gstack-model-benchmark --prompt "..." --models claude,gpt,gemini
gstack-model-benchmark ./prompt.txt --output json --judge
gstack-model-benchmark ./prompt.txt --models claude --timeout-ms 60000
Output formats: table (default), json, markdown. Each shows model,
latency, in→out tokens, cost, quality (when --judge used), tool calls,
and any errors.
Known limitations for v1:
- Claude adapter approximates toolCalls as num_turns (stream-json
would give exact counts; v2 can upgrade).
- Live E2E tests (test/providers.e2e.test.ts) not included — they
require CI secrets for all three providers. Unit tests cover the
shape and math.
- Provider CLIs sometimes return non-JSON error text to stdout; the
parsers fall back to treating raw output as plain text in that case.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>