Big merge. Main shipped three releases while this branch was in flight:
- v0.19.0.0 /plan-tune skill (observational layer; dual-track dev profile)
- v1.0.0.0 V1 prompts (simpler, outcome-framed, jargon-glossed) + LOC receipts
- v1.1.0.0 browse Puppeteer parity (load-html, file://, --selector, --scale)
This branch bumps to v1.2.0.0 (above main's v1.1.0.0) per the
branch-scoped-version rule in CLAUDE.md. My "0.19.0.0" CHANGELOG entry
is renamed to "1.2.0.0" and dated 2026-04-18 to land above main's trail.
Conflicts resolved:
- VERSION / package.json: 1.2.0.0
- CHANGELOG.md: preserved my entry at top (renamed), kept main's 1.1.0.0
/ 1.0.0.0 / 0.19.0.0 / 0.18.4.0 trail below in correct order
- .github/docker/Dockerfile.ci: kept my xz-utils + nodejs.org tarball
fix (real CI bug fix main didn't have); absorbed main's retry loop
structure for both apt and the tarball curl
- bin/gstack-config: kept both my checkpoint_mode/push section and
main's explain_level writing-style section
- scripts/resolvers/preamble.ts: kept my submodule refactor as the
file shape; extracted main's new generateWritingStyle and
generateWritingStyleMigration into scripts/resolvers/preamble/
submodules; absorbed main's generateQuestionTuning import
- All generated SKILL.md files: resolved by regen via
bun run gen:skill-docs --host all (per CLAUDE.md: never hand-merge
generated files — resolve templates and regen)
- Ship golden fixtures (claude/codex/factory): refreshed
Tier 2 preamble composition now includes all 8 sections: context
recovery, ask-user-format, writing-style, completeness, confusion,
continuous checkpoint, context health, question tuning.
Main also brought new test files from /plan-tune: skill-e2e-plan-tune,
upgrade-migration-v1, v0-dormancy, writing-style-resolver. All absorbed.
468 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the orphaned gstack-model-benchmark binary into a dedicated skill
so users can discover cross-model benchmarking via /benchmark-models or
voice triggers ("compare models", "which model is best").
Deliberately separate from /benchmark (page performance) because the
two surfaces test completely different things — confusing them would
muddy both.
Flow:
1. Pick a prompt (an existing SKILL.md file, inline text, or file path)
2. Confirm providers (dry-run shows auth status per provider)
3. Decide on --judge (adds ~$0.05, scores output quality 0-10)
4. Run the benchmark — table output
5. Interpret results (fastest / cheapest / highest quality)
6. Offer to save to ~/.gstack/benchmarks/<date>.json for trend tracking
Uses gstack-model-benchmark --dry-run as a safety gate — auth status is
visible BEFORE the user spends API calls. If zero providers are authed,
the skill stops cleanly rather than attempting a run that produces no
useful output.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>