v1.57.2.0 feat: AskUserQuestion prose fallback when the tool fails at runtime (#1908)

* feat(auq): add gstack-session-kind + echo SESSION_KIND in preamble

Classifies the session as spawned | headless | interactive from env markers
(OPENCLAW_SESSION / GSTACK_HEADLESS / CONDUCTOR_* / CLAUDE_CODE_ENTRYPOINT / CI),
defaulting to interactive. Echoed once at skill start alongside BRANCH/REPO_MODE
so the AskUserQuestion-failure fallback can branch without a shell-out at failure
time. Degrade-safe: empty/error => interactive.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(auq): prose fallback when AskUserQuestion fails (interactive sessions)

On a genuine AUQ failure (tool absent, or present-but-erroring like Conductor's
flaky MCP returning '[Tool result missing due to internal error]'): retry once,
then branch on SESSION_KIND — spawned auto-chooses, headless BLOCKs, interactive
renders a prose decision brief the user answers by typing a letter.

The prose fallback MUST surface the triad: a clear ELI10 of the issue, a
per-choice Completeness score, and a recommendation+why (one paragraph per
choice). Carves out the [plan-tune auto-decide] denial as NOT a failure, and
qualifies the former 'tool_use, not prose' assertions so the rule isn't
self-contradicting. Tests pin the triad, the SESSION_KIND branch, the OV2
collision guard, the always-loaded guarantee, and a cross-file invariant on the
auto-decide prefix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): default GSTACK_HEADLESS=1 in eval/E2E runners

Headless harness runs classify as headless (BLOCK on AUQ failure rather than
emit a prose question no one reads). SDK runner uses ambient mutation, not the
Options.env object, to avoid breaking the SDK auth pipeline. Interactive-path
suites opt out by overriding the env per-run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(auq): defensive PostToolUse error-fallback hook (OV3:B)

When an AskUserQuestion call returns an error/missing result, this hook injects
additionalContext reminding the model to run the prose fallback for the current
SESSION_KIND. It does not render prose itself — it guarantees the reminder fires
at the moment of failure instead of relying on the model recalling SESSION_KIND.

Inert on success and inert if the platform never invokes PostToolUse on tool
errors (unverified — could not force the Conductor MCP error in a harness; see
the spike doc). The prompt-level fallback covers the case regardless. Decision
logic is unit-tested deterministically; registered in setup beside the existing
AUQ hooks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(auq): regenerate SKILL.md for all hosts + refresh ship goldens

Regenerated from the resolver changes (gen:skill-docs --host all). Refreshes the
byte-exact ship golden fixtures (claude/codex/factory). Spec prose tightened so
the cross-cutting preamble addition stays under the 5% per-skill parity ceiling
(investigate 4.8%) — guard unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(test): kebab testNames for section-loading E2Es to match TOUCHFILES keys

The two section-loading E2E tests used display-form testNames ('/ship
section-loading', '/plan-ceo-review section-loading') while every other E2E
testName and their E2E_TOUCHFILES keys are kebab. The completeness gate does an
exact `name in E2E_TOUCHFILES` check, so it failed (pre-existing on main); diff-
based selection also couldn't match them. Align to ship-section-loading /
plan-ceo-section-loading.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(test): make external-host freshness checks deterministic

The parameterized host smoke + --host all freshness tests assumed an external
`gen:skill-docs --host all` had run first (it never does in `bun test`), so which
host reported STALE varied by sibling-test timing — flaky. Regenerate the
gitignored external host dirs in a beforeAll so the --dry-run check is
deterministic. It still catches non-deterministic generation (the real bug class
for regenerated outputs); the tracked-claude freshness test runs earlier and is
unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(parity): headroom for AUQ cross-cutting addition on carved document-release

Merging main brought the carve of document-release (smaller skeleton); the AUQ
prose-fallback adds ~2KB to every skill's always-loaded preamble, landing
document-release at ~5.9% over the pre-carve v1.53.0.0 baseline. Add a per-carve
maxSizeRatio override (CARVE_GUARDS single source of truth) and bump only this
skill to 1.08. All other skills keep the strict 1.05 ceiling.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(auq): harden error-fallback hook + harness per adversarial review

Codex pre-landing review found three real issues:
- The PostToolUse fallback hook shared source 'plan-tune-cathedral' with the
  question-log hook (same event+matcher); gstack-settings-hook replaces the entry,
  so it would have clobbered plan-tune capture. Give it its own 'auq-error-fallback'
  source (separate entry, both run); ALREADY_INSTALLED now requires both sources.
- isErrorResponse triggered on any string containing 'internal error'/'is_error',
  so a real answer or a {"is_error": false} payload could fire the fallback after a
  successful question. Narrow it to the missing-result sentinel + boolean is_error.
- The SDK runner mutated process.env.GSTACK_HEADLESS process-wide (leaked headless
  into later tests). Removed; GSTACK_HEADLESS=1 now lives in the eval package.json
  scripts, scoped to the invocation and inherited by the SDK child.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v1.57.2.0)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-06-07 21:38:21 -07:00
committed by GitHub
parent e722c5bf89
commit 4dfdb7cdc2
72 changed files with 2035 additions and 211 deletions
+68
View File
@@ -1,5 +1,73 @@
# Changelog
## [1.57.2.0] - 2026-06-08
## **When the question picker breaks mid-skill, gstack asks in plain text instead of stalling.**
## **Every skill detects a dead AskUserQuestion and falls back to a full decision brief you answer by typing a letter.**
AskUserQuestion is how every gstack skill asks you to decide. When the host's question
tool fails at runtime, which Conductor's MCP integration currently does intermittently,
skills used to stall or hard-block. Now each skill detects the failure, works out
whether a human is actually present, and if so re-renders the exact same decision as a
text message: a plain-English explanation of the issue, a completeness score on each
choice, and a recommendation with its reason, one paragraph per choice. You answer by
typing a single letter. Headless eval runs still block cleanly (no human to answer);
orchestrator sessions keep auto-choosing. This whole release was built and reviewed
through that fallback, because the Conductor tool was down the entire session.
### The numbers that matter
No production benchmark for a reliability path like this. These are the behavior and
coverage facts, verifiable with `bun test test/gstack-session-kind.test.ts
test/resolver-ask-user-format.test.ts test/auq-error-fallback-hook.test.ts`.
| When AskUserQuestion fails | Before | After |
|---|---|---|
| Interactive session (human present) | stall / hard BLOCK | full prose decision brief, answer by letter |
| Headless eval / CI | BLOCK | BLOCK (unchanged, correct) |
| Orchestrator (OpenClaw) session | undefined | auto-choose recommended (contract kept) |
| Session kinds detected | 0 | 3 (interactive / headless / spawned) |
| New tests guarding the path | 0 | 34 |
The text brief is not a degraded stub. It carries the same three things the picker
shows: a clear explanation of what is being decided, a `Completeness: X/10` on every
choice, and a recommendation with the reason it wins.
### What this means for you
If your host's question tool flakes out, a skill no longer dies on you. You get the
same decision to make, in text, and you reply with a letter. Nothing changes when the
tool works normally. If you run gstack headless, those sessions still block on a needed
question exactly as before, so eval determinism is intact.
### Itemized changes
#### Added
- `gstack-session-kind` classifies each session as interactive, headless, or spawned,
echoed as `SESSION_KIND` at skill start so any skill can branch on it.
- Plain-text fallback for AskUserQuestion: on a tool failure in an interactive session,
the skill renders the full decision brief (issue ELI10 + per-choice completeness +
recommendation) as markdown you answer by typing a letter, then stops and waits.
- A defensive hook that, when an AskUserQuestion call errors, reminds the agent to run
the fallback for the current session kind.
#### Changed
- AskUserQuestion is still sent as a normal tool call; the prose path applies only when
the tool is unavailable or erroring, and never on a `[plan-tune auto-decide]` result.
#### Fixed
- Section-loading tests use the canonical kebab test names, so the test-coverage gate
matches them.
- External-host doc-freshness checks are deterministic, no longer dependent on a prior
full regeneration.
#### For contributors
- The eval/E2E runners set `GSTACK_HEADLESS=1` so headless runs classify correctly;
interactive-path suites opt out per-run.
- Per-skill `maxSizeRatio` override in the carve-guards registry; `document-release`
gets 1.08 headroom for the cross-cutting preamble addition while every other skill
keeps the 1.05 ceiling.
## [1.57.0.0] - 2026-06-07
## **Three more heavyweight skills load lighter, and every carved skill finally has a test that proves it loads.**