gstack

mirror of https://github.com/garrytan/gstack.git synced 2026-06-18 07:40:09 +02:00

Files

T

History

Garry Tan a647064734 test: add gate-tier mode-posture regression tests

Three gate-tier E2E tests detect when preamble / template changes
flatten the distinctive posture of /plan-ceo-review SCOPE EXPANSION or
/office-hours (startup Q3, builder mode). The V1 regression that this
PR fixes shipped without anyone catching it at ship time — this is the
ongoing signal so the same thing doesn't happen again.

Pieces:
- `judgePosture(mode, text)` in `test/helpers/llm-judge.ts`. Sonnet
  judge with mode-specific dual-axis rubric (expansion: surface_framing
  + decision_preservation; forcing: stacking_preserved +
  domain_matched_consequence; builder: unexpected_combinations +
  excitement_over_optimization). Pass threshold 4/5 on both axes.
- Three fixtures in `test/fixtures/mode-posture/` — deterministic input
  for expansion proposal generation, Q3 forcing question, and builder
  adjacent-unlock riffing.
- `plan-ceo-review-expansion-energy` case appended to
  `test/skill-e2e-plan.test.ts`. Generator: Opus (skill default). Judge:
  Sonnet.
- New `test/skill-e2e-office-hours.test.ts` with
  `office-hours-forcing-energy` + `office-hours-builder-wildness`
  cases. Generator: Sonnet. Judge: Sonnet.
- Touchfile registration in `test/helpers/touchfiles.ts` — all three as
  `gate` tier in `E2E_TIERS`, triggered by changes to
  `scripts/resolvers/preamble.ts`, the relevant skill template, the
  judge helper, or any mode-posture fixture.

Cost: ~$0.50-$1.50 per triggered PR. Sonnet judge is cheap; Opus
generator for the plan-ceo-review case dominates.

Known V1.1 tradeoff: judges test prose markers more than deep behavior.
V1.2 candidate is a cross-provider (Codex) adversarial judge on the
same output to decouple house-style bias.

2026-04-18 23:46:00 +08:00

codex-session-runner.ts

fix: enforce Codex 1024-char description limit + auto-heal stale installs (v0.11.9.0) (#391 )

2026-03-23 08:44:08 -07:00

e2e-helpers.ts

feat: remove trigger guard + proactive opt-out prompt (#457 )

2026-03-24 18:07:36 -07:00

eval-store.test.ts

feat: QA restructure, browser ref staleness, eval efficiency metrics (v0.4.0) (#83 )

2026-03-15 23:55:39 -05:00

eval-store.ts

feat: worktree isolation for E2E tests + infrastructure elegance (v0.11.12.0) (#425 )