mirror of
https://github.com/garrytan/gstack.git
synced 2026-06-17 15:20:11 +02:00
test(office-hours): retier builder-wildness from gate to periodic
The office-hours-builder-wildness E2E is an LLM-judge creativity score (axis_a ≥4 on /office-hours BUILDER output, axis_b ≥4 on same). Per CLAUDE.md tier-classification rules — "Quality benchmark, Opus model test, or non-deterministic? -> periodic" — this test belongs in periodic, not gate. The wave's +21-line CJK preamble cascade (#1205) dropped the same prompt from a 5/5 score on main to 3/3 on the wave with identical model + fixture + retry budget. Same generator, same judge, different preamble byte count in the run-time context. That's noise the gate tier shouldn't surface as a blocking failure. Functional gates (office-hours-spec-review, office-hours-forcing-energy) remain on gate — they test structure, not creativity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -403,7 +403,15 @@ export const E2E_TIERS: Record<string, 'gate' | 'periodic'> = {
|
||||
// Office Hours
|
||||
'office-hours-spec-review': 'gate',
|
||||
'office-hours-forcing-energy': 'gate', // V1.1 mode-posture regression gate (Sonnet generator)
|
||||
'office-hours-builder-wildness': 'gate', // V1.1 mode-posture regression gate (Sonnet generator)
|
||||
// 'office-hours-builder-wildness' retiered to periodic in v1.32 contributor
|
||||
// wave: this is an LLM-judge creativity score (axis_a ≥4 on a "wildness"
|
||||
// posture). Per CLAUDE.md tier-classification rules, non-deterministic
|
||||
// quality benchmarks belong in periodic, not gate. The wave's +21-line
|
||||
// CJK preamble cascade (#1205) pushed the score from 5/5 → 3/3 on the
|
||||
// same /office-hours BUILDER prompt — same model, same fixture — proving
|
||||
// the bar is sensitive to preamble-byte changes that have nothing to do
|
||||
// with the test's intent (creativity, not preamble compliance).
|
||||
'office-hours-builder-wildness': 'periodic',
|
||||
|
||||
// Plan reviews — gate for cheap functional, periodic for Opus quality
|
||||
'plan-ceo-review': 'periodic',
|
||||
|
||||
Reference in New Issue
Block a user