mirror of
https://github.com/garrytan/gstack.git
synced 2026-07-01 22:15:43 +02:00
0f165cbed0edb6800a2fc6edda56b1027a11e621
13 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
cab774cced |
v1.56.0.0 Token-reduction Phase B + AUQ paranoid safety net (#1849)
* refactor(plan-ceo-review): carve review body into on-demand section
Carve the largest skill (138,838 B) into a skeleton + one on-demand
section, the documented next Phase B target after /ship (v2_PLAN.md:216).
- sections/review-sections.md(.tmpl): the 11-section deep review, codex/
outside-voice rules, how-to-ask, Required Outputs, registries, Completion
Summary, Review Log, REVIEW_DASHBOARD, PLAN_FILE_REVIEW_REPORT, Next Steps,
docs/designs promotion, Formatting Rules, and the Mode Quick Reference.
- sections/manifest.json: passive registry (CM2), one entry.
- SKILL.md.tmpl: {{SECTION_INDEX}} after the system audit, a single
{{SECTION:review-sections}} STOP-Read after Step 0 mode selection, and a
Section self-check. All of Step 0 (the scope/mode conversation) stays in
the always-loaded skeleton; only EXIT_PLAN_MODE_GATE follows the section.
Measured: always-loaded skeleton 138,838 -> 80,731 B (-42%, ~14.4K tokens
off every invocation). Union (skeleton + section) 139,110 B, behavior held.
Boundary honors Codex P1: nothing review-governing (formatting rules, mode
reference, how-to-ask, required outputs) sits in the skeleton below the
STOP. Housekeeping resolvers ride in the section, matching the ship
precedent (adversarial.md carries LEARNINGS_LOG + GBRAIN_SAVE_RESULTS).
Tests (atomic with the carve — skill-docs.yml gates gen:skill-docs
freshness on every push, so source + regen + tests must land together):
- parity-harness: plan-ceo flipped to sectioned, maxSkeletonBytes 90_000
(measured 80,731 + headroom); content/minBytes run against the union.
- skill-size-budget: plan-ceo-review added to SECTIONS_EXTRACTED.
- section-manifest-consistency: generalized to discover every carved skill,
vars computed per-skill-case (Codex P2).
- skill-ceo-section-ordering (new, gate): per-PR static guard — STOP after
Step 0, review body absent from skeleton, report writer in the section,
nothing review-governing below the STOP.
- skill-e2e-plan-ceo-review-section-loading (new, periodic): refreshes the
installed skill first (Codex P1), drives full Step 0, asserts the section
is Read before the report.
- gen-skill-docs + skill-validation: read the skeleton+sections union for
carved skills so relocated prose still counts.
- touchfiles: plan-ceo-section-loading registered (periodic).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump VERSION + CHANGELOG for plan-ceo-review carve (v1.56.0.0)
MINOR: carves the largest skill into skeleton + on-demand section,
dropping plan-ceo-review's always-loaded cost 42% (138,838 -> 80,731 B,
~14.4K tokens off every invocation). User-facing release notes lead with
the measured token win.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs(todos): file P3 follow-up — carve the shared {{PREAMBLE}} reference blocks
Surfaced by /plan-eng-review on the plan-ceo-review carve: per-skill section
carves stay modest because the ~40-50KB shared preamble dominates the
always-loaded surface. A single preamble-reference carve would help every
tier->=2 skill at once. Records the why, the cold-vs-hot split to measure,
and the guards it needs. Not implemented this PR.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(auq): Layer 0 — guarantee AUQ format spec is always-loaded
Deterministic, free, per-PR keystone for the token-reduction era. For every
interactive (tier>=2) skill, asserts the full AskUserQuestion decision-brief
format (ELI10/Recommendation/Pros-cons/checks/Net/(recommended)/Stakes/
self-check) lives in the always-loaded SKILL.md skeleton, NOT only in an
on-demand section. Plus a roster guard (a carve can't silently drop the block)
and per-skill rule survival in the skeleton+sections union. 51 cases + a
negative control. Fails the instant a future carve strands AUQ-governing text
where it won't be loaded when a question fires.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(auq): SDK capture engine + verbose-vs-carved no-degradation A/B
Adds the reusable SDK $OUT_FILE capture engine (auq-sdk-capture.ts): drives a
skill to its AUQ and captures the verbatim text the model GENERATES, cleanly
(real-PTY mangles plan-mode AUQs via cursor escapes). Pins the skill to an
absolute path with Read/Write-only tools so the agent can't wander to the
global install. gradeAuqRecommendation normalizes a non-"because" connective
before grading so substantive reasons aren't false-flagged (without touching
the pinned shared judge).
The A/B drives the same prompt through the carved 80KB skeleton and the
pre-carve 137KB monolith and fails if carved scores worse. Result: both 7/7
format, substance 5 — proven no degradation, transcript-verified each side read
its own planted SKILL.md. Periodic tier.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(auq): consistency — same trigger N runs, stable format + substance
Drives the carved /plan-ceo-review AUQ N=3 times and fails if any format
element appears in one run but not another, or substance craters. Targets the
"fine one run, broken the next" failure class a single snapshot can't see.
Result: 3/3 stable, 7/7 + substance 5 every run. Periodic tier.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(auq): behavioral matrix across AUQ-heavy skills
Data-driven test that drives each AUQ-heavy skill (plan-eng/design/devex,
office-hours, cso, spec, design-consultation) to its first AskUserQuestion and
grades it to the plan-ceo bar: 7/7 decision-brief format + recommendation
substance >=4. One case per skill (isolated failures), env-subsettable via
AUQ_MATRIX_ONLY. Browser/design-binary skills are intentionally excluded
(comparison boards, not format-AUQs; Layer 0 covers their spec). All targeted
skills pass 7/7 with substance 4-5. Periodic tier.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(codex): live recommendation-substance grade for /codex
Closes the gap where /codex's synthesis recommendation was only checked
statically (template grep) and via fixtures. Drives the real /codex skill over
a flawed diff and grades the emitted "Recommendation: ... because ..." line
with judgeRecommendation (present/commits/has_because/substance>=4). The named
weak spot holds up: substance 5. Periodic tier.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(auq): deterministic trigger for format-compliance gate
A bare /plan-ceo-review against a repo whose work is already implemented makes
the model improvise an off-script "what should I review?" scope question that
skips the decision-brief format, which the gate test then times out waiting for.
Hand it a concrete plan to review (FORCING_FLOOR_CEO) so it reaches the real
Step 0 mode-selection AUQ that is the intended format check.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(office-hours): carve Phase 5+6 into on-demand section
Third Phase B carve (v2_PLAN.md:216, after ship and plan-ceo-review). Moves
Phase 5 (Design Doc templates) + Phase 6 (tiered relationship handoff) — the
session's output + closing tail, only reached after the conversation and
alternatives are done — into sections/design-and-handoff.md, behind a single
STOP-Read after Phase 4.5. The live conversation (Phases 1-4.5) and the
always-run Important Rules stay in the always-loaded skeleton.
Measured: always-loaded skeleton 118,280 -> 88,975 B (-24.8%). Union preserved.
The carved AUQ is identical to pre-carve (matrix: 7/7 format, substance 5),
and Layer 0 confirms the AUQ format spec stays in the skeleton — the AUQ
paranoid suite de-risked this carve end to end.
Atomic with tests + regen (skill-docs.yml gates gen:skill-docs freshness on
every push, so source + regen + tests land together; --host all regenerates
the inlined non-Claude variants):
- sections/manifest.json: passive registry, one entry.
- parity-harness: office-hours flipped to sectioned, maxSkeletonBytes 96_000
(measured 88,975 + headroom); content/minBytes run against the union.
- skill-size-budget: office-hours added to SECTIONS_EXTRACTED.
- gen-skill-docs + skill-validation: read the skeleton+sections union for
office-hours so relocated Phase 5/6 prose still counts.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump VERSION + CHANGELOG for office-hours carve + AUQ suite (v1.57.0.0)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(preamble): carve CJK-escaping manual to on-demand doc
The AskUserQuestion format block is inlined into every interactive skill (~33).
It carried the full multi-paragraph non-ASCII/CJK escaping manual inline, but
that rationale only matters when a question contains CJK text and the operative
rule already lives in the always-loaded self-check. Moved the justification to
docs/askuserquestion-cjk.md (read on demand); kept the rule + a pointer.
Corpus: Claude-host SKILL.md total 3,087,499 -> 3,057,975 B (-29,524 B, ~900 B
x ~33 skills). Layer 0 still passes — the core decision-brief format stays
always-loaded; only the rare CJK rationale moved. Atomic with the all-host
regen (skill-docs.yml freshness gate). VERSION + package.json -> 1.58.0.0.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(plan-eng-review): carve review body into on-demand section
Fourth Phase B carve (v2_PLAN.md:220). Moves the 4-section review (Architecture,
Code Quality, Tests, Performance), outside voice, required outputs, and review
report — everything after Step 0 scope — into sections/review-sections.md behind
a single STOP-Read. Step 0 (scope challenge) and EXIT_PLAN_MODE_GATE stay in the
always-loaded skeleton.
Measured: skeleton 106,984 -> 54,892 B (-48.7%). Union preserved. Atomic with
tests + all-host regen (freshness gate): parity flipped to sectioned
(maxSkeletonBytes 62K), plan-eng-review added to SECTIONS_EXTRACTED, gen-skill-docs
reads the union for relocated review/TEST_COVERAGE/dashboard prose. Layer 0 green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(plan-design-review): carve review body into on-demand section
Fifth Phase B carve (v2_PLAN.md:220, bundled with plan-eng). Moves the 7 design
passes, required outputs, and review report — everything after Step 0 scope and
the mockup/rating phase — into sections/review-sections.md behind a STOP-Read.
Step 0, Step 0.5 mockups, the rating method, and EXIT_PLAN_MODE_GATE stay in the
always-loaded skeleton.
Measured: skeleton 112,057 -> 76,024 B (-32.2%). Union preserved. Atomic with
tests + all-host regen: parity sectioned (maxSkeletonBytes 82K), added to
SECTIONS_EXTRACTED, gen-skill-docs reads the union. Layer 0 green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(plan-devex-review): carve review body into on-demand section
Sixth Phase B carve. Moves the 8 DX passes, required outputs, and review report
— everything after the Step 0 DX investigation — into sections/review-sections.md
behind a STOP-Read. All of Step 0 (persona, empathy, benchmark, journey trace,
roleplay) + the rating method + EXIT_PLAN_MODE_GATE stay always-loaded.
Measured: skeleton 110,621 -> 69,658 B (-37%). Union preserved. Atomic with
tests + all-host regen: added to SECTIONS_EXTRACTED, gen-skill-docs reads the
union. Layer 0 green. (No parity invariant entry for plan-devex-review.)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump VERSION + CHANGELOG for plan-* family carves (v1.59.0.0)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test: refresh ship golden baselines + gbrain-detection union after carves
Two follow-ups the carve commits should have carried (caught by the full suite,
missed by targeted subsets):
- ship golden baselines (claude/codex/factory) regenerated: the preamble CJK
trim (v1.58) changed ship's always-loaded AskUserQuestion block.
- gbrain-detection-override probes the office-hours skeleton+section union:
GBRAIN_SAVE_RESULTS moved into sections/design-and-handoff.md when office-hours
was carved, so the detection assertions now check both files.
Full `bun test` green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(auq): grade format-compliance gate from SDK capture, not the TUI
The real-PTY version grepped the stripAnsi'd interactive AUQ picker. Verified
directly that this cannot work: plan-mode AUQs render as a cursor picker whose
cursor-positioning escapes stripAnsi can't flatten — the picker renders fine for
a human (cursorSeen=45) but the flattened text drops ELI10:/(recommended) and
parseNumberedOptions returns 0. The test was grading a lossy projection and
failed by construction.
Rewritten to drive /plan-ceo-review via the SDK $OUT_FILE capture (the agent
writes the verbatim question it would have shown — clean text, no rendering
loss) and grade 7/7 format + kind-note + recommendation substance >=4. Same
property, reliable, environment-independent; shares the engine with the periodic
A/B and matrix evals. Result: 7/7 format, substance 5. Touchfiles key renamed
ask-user-question-format-pty -> auq-format-gate (no longer a PTY test).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test: fix carve-broken CI evals (union reads + section fixtures)
Two CI eval jobs failed on the carved plan-* skills because they read content
that moved into sections/:
- llm-judge (skill-llm-eval): runWorkflowJudge sliced SKILL.md between markers
like "## Review Sections" / "## CRITICAL RULE" that now live in
sections/review-sections.md. The markers vanished from the skeleton, so the
judge scored empty/wrong content. Fix: read the skeleton+sections union.
Verified: plan-ceo modes / plan-eng sections / plan-design passes all PASS
(25/25).
- e2e-plan (skill-e2e-plan): setupPlanDir copied only <skill>/SKILL.md into the
fixture, not sections/. The carved skill's STOP pointed at a section file that
was absent, so the model improvised a compressed report table instead of the
canonical "| Review | Trigger | Why | Runs | Status | Findings |". Fix: copy
sections/ alongside SKILL.md in all 6 setup sites. Verified: report test PASS,
canonical table emitted.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test: copy carved sections into all e2e fixtures (prevent more carve-blind CI fails)
Proactive sweep beyond the two CI logs: every e2e test that copies a carved
skill's SKILL.md into a temp fixture must also copy its sections/, or the
model hits a STOP pointing at a missing section file and improvises/degrades.
- skill-e2e.test.ts: plan-ceo/plan-eng/plan-design/office-hours copies across
planDir/reviewDir/ohDir/benefitsDir dests now copy sections/.
- skill-e2e-plan.test.ts: the office-hours copy + the 4-skill codex-offering
loop now copy sections/.
- skill-e2e-design.test.ts: plan-design-review copy now copies sections/.
- skill-e2e-office-hours.test.ts: both office-hours copies now copy sections/.
- skill-e2e-office-hours-brain-writeback.test.ts: GBRAIN_SAVE_RESULTS moved into
the section, so check the regenerated skeleton+section UNION for the gbrain put
block, ship both into the workdir, and restore both (the section regen was also
leaking into the working tree — finally now restores it).
ship copies (single-file Step-0 slices) and review/retro (not carved) untouched.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test: migrate section-loading E2E to lossless SDK tool-stream detection
The /ship and /plan-ceo-review section-loading tests drove a real PTY and
scraped the ANSI screen buffer for sections/<file>.md paths. That silently
saw nothing in a Conductor PTY (cursor-positioned tool renders and an
unanswered Step 0 question loop both defeat the regex), so both reported
read: [] even when the agent did the work.
They now run the skill through claude -p (the same SDK path the AUQ matrix
uses) and detect section reads from the tool-use stream — Read calls whose
file_path contains sections/<file>.md — with no rendering layer to mangle.
The run is also hermetic: the freshly-generated worktree skeleton + sections
are copied into a throwaway fixture with the absolute path pinned, so the
test validates this branch's carve without mutating the user's ~/.claude
install.
Validated EVALS_TIER=periodic: both pass (plan-ceo Reads review-sections.md;
ship Reads review-army.md + changelog.md), ~6.5 min for both vs ~23 min
combined on the old PTY path where both were failing.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: consolidate branch to v1.56.0.0 (single MINOR above main)
The branch bumped VERSION several times during development (1.56 → 1.57 →
1.58 → 1.59), but none of those landed on main (main is at 1.55.1.0). Per
the "never orphan branch-internal versions" discipline, collapse all four
into a single 1.56.0.0 entry — one MINOR release covering the whole branch:
five skills carved (plan-ceo, office-hours, plan-eng, plan-design,
plan-devex), the shared AskUserQuestion preamble CJK trim, and the paranoid
AUQ no-degradation test suite + lossless section-loading tests.
VERSION and package.json set to 1.56.0.0; main's 1.55.1.0 entry preserved
below the consolidated entry. No SKILL.md drift (VERSION is not embedded in
generated bodies).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
||
|
|
070722ace3 |
v1.52.1.0 feat: brain-aware planning — 5 skills read structured gbrain context before asking (#1742)
* feat(brain): brain-cache-spec.ts — single source of truth for cache layer Foundation for the brain-aware planning skills work (v1.48 plan / D2). One TS const file consolidates BRAIN_CACHE_ENTITIES (8 entities × TTL + budget + invalidation rules), SKILL_DIGEST_SUBSETS (per-skill which files to load), SALIENCE_DEFAULT_ALLOWLIST (D9 privacy gate), SKILL_CALIBRATION_WEIGHTS (Phase 2 E5), and policy / identity / schema constants. Drift between docs and runtime becomes impossible by construction: resolver, cache CLI, and test/skill-preflight-budget.test.ts all import from the same module. test/brain-cache-spec.test.ts: 19 invariant assertions (subset/entity consistency, per-skill achievability, allowlist sanity, transport defaults, user-slug fallback chain, lock timeout, retention policy). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(brain): gstack-core@1.0.0 schema pack (T1 / Phase 0) Defines 8 typed page kinds for the brain entity model: gstack/user-profile, gstack/product, gstack/goal, gstack/developer-persona, gstack/brand, gstack/competitive-intel, gstack/skill-run, gstack/take Each declares frontmatter shape (typed fields with required/optional flags), retention policy (immutable / archive-after-90d / never-archive), and emits_links graph for mcp__gbrain__schema_graph rendering. getSchemaPackMutationPayload() returns JSON in the shape accepted by mcp__gbrain__schema_apply_mutations. Idempotent registration: gbrain skips when pack+version already installed. test/gstack-schema-pack.test.ts: 16 invariants on pack shape, retention policies, link verb consistency, JSON serializability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(brain): gstack-brain-cache CLI (T2a) — core subcommands bin/gstack-brain-cache: TS CLI with five subcommands: get <entity-name> [--project <slug>] refresh [--full] [--entity X] [--project <slug>] invalidate <entity-name> [--project <slug>] digest <entity-slug> meta [--project <slug>] Cache layout per Phase 0.5 design: ~/.gstack/brain-cache/ ← cross-project (user-profile) ~/.gstack/projects/<slug>/brain-cache/ ← per-project (everything else) Per-entity TTL drives staleness; per-entity byte budgets enforce compression at write time. Atomic writes via tmp+rename. Stale-but-usable fallback when brain unreachable (returns cached digest with diagnostic prefix instead of failing). Schema-version mismatch + endpoint switch both trigger full rebuild for the affected scope (D4 A4). Fetch+compress paths wired for the 7 entities (user-profile, product, goals, developer-persona, brand, competitive-intel, recent-decisions, salience) via gbrain CLI shell-out — works for local PGLite and local-stdio MCP, transparent over the existing spawnGbrain helper. Concurrent-refresh dedup (D3 / T15) is a follow-up commit. Salience allowlist gate (D9 / T17) is a follow-up commit. Bootstrap + lifecycle subcommands (T2b / T18) are follow-up commits. test/brain-cache-roundtrip.test.ts: 11 tests covering path resolution, meta lifecycle, endpoint detection, schema mismatch behavior, and the four cache states (warm / cold-refreshed / stale-fallback / missing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(brain): concurrent-refresh lockfile dedup (T15 / D3) When autoplan dispatches 4 planning skills back-to-back and they all hit a cold-miss on the same digest, only ONE actually fetches from the brain. The rest dedup via the project-scoped lockfile at ~/.gstack/projects/<slug>/brain-cache/.refresh.lock. Reuses the 5-min stale-takeover convention from /sync-gbrain. Lock is taken over when: - File is older than CACHE_REFRESH_LOCK_TIMEOUT_MS - PID is on the same host and dead (process.kill(pid, 0) fails) - Lock file is corrupt (defensive) withRefreshLock(projectSlug, fn) returns either the callback's value or the literal 'dedup'. The CLI emits exit code 3 + diagnostic stderr on dedup, so callers can choose to wait + retry (resolver does this) or fall through to stale-but-usable behavior. test/cache-concurrent-refresh.test.ts: 7 tests covering acquire/release, stale-takeover, dead-PID takeover, corrupt-lock recovery, error-path release, and cross-project lock location. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(brain): salience privacy allowlist gate (T17 / D9) D9 cross-model finding from codex outside voice: salience-sourced digests can include emotionally-weighted personal pages (family, therapy, reflection). Pulling those into a coding-review prompt leaks sensitive context into work-flow reasoning. fetchSalience now strips entries whose slugs don't match an allowlist prefix BEFORE writing to the cache file. Default allowlist is SALIENCE_DEFAULT_ALLOWLIST = ['projects/', 'concepts/', 'gstack/']. User can extend via: gstack-config set salience_allowlist 'projects/,gstack/,concepts/,custom/' or override with GSTACK_SALIENCE_ALLOWLIST env var. Digest still records the strip count for transparency. Empty result emits 'all N entries stripped' note rather than silent absence. test/salience-allowlist.test.ts: 9 tests covering default permits, default blocks, empty allowlist, env override, whitespace trimming, and the invariant that defaults contain nothing sensitive (personal, family, therapy, reflection, private, medical, health). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(brain): bootstrap + list + purge subcommands (T2b / T18) T2b — bootstrap synthesizes draft entity content from CLAUDE.md + README + recent learnings.jsonl and emits as JSON for the caller. Skill template is responsible for the AUQ-confirm-before-write flow (D10 T4 extraction- review requirement). Cli stays pure (no AUQ logic); agent owns user interaction. T18 — list/purge subcommands close the lifecycle loop: list [--project <slug>] — enumerate gstack-owned pages in brain (probe all 8 gstack/* page types) purge <slug> — delete one gstack page, refuses non-gstack/ slugs (defensive) list defaults to all-projects (cross-project user-profile included). With --project, filters to per-project pages plus the cross-project user-profile. --json flag emits machine-readable output for the agent. Retention sweep + audit subcommand are deferred to a follow-up commit (they need the lifecycle scheduling design, not just CLI plumbing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(brain): brain-aware planning resolvers + 3 new placeholders (T4) scripts/resolvers/gbrain.ts adds: - generateBrainPreflight(ctx) — emits per-skill ## Brain Context block + bash that loads digests via gstack-brain-cache get (one call per digest). Per-skill subset comes from SKILL_DIGEST_SUBSETS (single source). - generateBrainCacheRefresh(ctx) — at-skill-end background refresh hook; non-blocking; warms cache for next run. - generateBrainWriteBack(ctx) — Phase 2 / E5 calibration write-back with per-skill weight. Gated on personal trust policy + the BRAIN_CALIBRATION_WRITEBACK flag. Includes invalidation bash that busts affected digests after the write. scripts/resolvers/index.ts registers three new placeholders: {{BRAIN_PREFLIGHT}}, {{BRAIN_CACHE_REFRESH}}, {{BRAIN_WRITE_BACK}} All three resolvers return empty string for skills not in SKILL_DIGEST_SUBSETS (defensive — skill template authors can drop the placeholders into non-preflight skills with zero effect). D9 privacy is mentioned in the rendered preflight prose so the agent knows to expect filtered salience. D11 codex tension: write-back gates on brain_trust_policy@<hash> being personal — shared brains skip write-back to avoid polluting team calibration profile. test/brain-preflight.test.ts: 19 tests covering subset rendering, non-preflight skill gating, cross-project vs per-project --project flag emission, weight injection per skill, BRAIN_CALIBRATION_WRITEBACK flag mention, and registration in RESOLVERS map. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(brain): gstack-config brain integration helpers (T5+T10+T16) Extends bin/gstack-config to support the brain-aware planning layer: KEY VALIDATION (T5): Plain alphanumeric/underscore now extended to allow @<hex-hash> suffix. Required for per-endpoint namespaced keys (brain_trust_policy@<sha8>, user_slug_at_<sha8>). Keys without the suffix still validate as before. VALUE WHITELISTING (D4 / D11): brain_trust_policy@* values gated to personal | shared | unset. Unknown values warn + default to unset (defense against typos). NEW DEFAULTS (lookup_default): brain_trust_policy@* -> unset salience_allowlist -> '' (resolver uses SALIENCE_DEFAULT_ALLOWLIST) user_slug_at_* -> '' (resolve-user-slug fills + persists on demand) NEW SUBCOMMANDS: endpoint-hash — print sha8 of active gbrain MCP URL from ~/.claude.json. Collision check escalates to sha16 when a prior endpoint stored at the same sha8 would conflict (T10 defensive default). resolve-user-slug — walks D4 A3 identity chain: 1. mcp__gbrain__whoami.client_name 2. $USER env var 3. sha8(git config user.email) 4. anonymous-<sha8(hostname)> Persists result on first call so subsequent calls are stable across sessions. test/user-slug-fallback.test.ts: 14 tests covering endpoint-hash output shape, fallback chain ordering, persistence, brain_trust_policy namespace value validation + per-endpoint isolation, and key validator extension for @-suffixed keys. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(brain): wire 5 planning skill templates with BRAIN_* placeholders (T6) Adds three placeholders to each of the 5 planning SKILL.md.tmpl files: {{BRAIN_PREFLIGHT}} — top of skill body, before first interactive section. Loads the per-skill digest subset (5 files for office-hours, 2 for plan-eng- review, etc.) into the prompt context before any AskUserQuestion fires. {{BRAIN_WRITE_BACK}} — end of skill, before refresh hook. Phase 2 calibration write path; gated on personal policy + BRAIN_CALIBRATION_WRITEBACK flag. {{BRAIN_CACHE_REFRESH}} — end of skill, after write-back. Non-blocking background refresh so next invocation gets warm cache. Files touched (templates + regenerated SKILL.md): office-hours/SKILL.md.tmpl plan-ceo-review/SKILL.md.tmpl plan-eng-review/SKILL.md.tmpl plan-design-review/SKILL.md.tmpl plan-devex-review/SKILL.md.tmpl (matching .md files regenerated via bun run gen:skill-docs) All 5 generated SKILL.md files now contain the rendered ## Brain Context (preflight) section + write-back guidance + background-refresh hook. The resolver renders only for skills in SKILL_DIGEST_SUBSETS — these 5 + an empty string for any other skill that drops in the placeholders. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(brain): setup-gbrain trust-policy step + sync-gbrain flags (T5b / T13+T5c) T5b — setup-gbrain Step 9.5: Inserts the brain trust policy AskUserQuestion before the verdict block. Detects active endpoint hash via gstack-config endpoint-hash. Branches per transport: * Local (sha == "local"): auto-set personal, one-line notice * Remote-MCP, unset: AskUserQuestion (personal vs shared) * Already-set: skip, just print current policy Personal default flips artifacts_sync_mode=full when still off. T13+T5c — sync-gbrain: Adds two flag short-circuits: --refresh-cache : route to gstack-brain-cache refresh --project <slug>; skip code + memory + brain-sync stages. Replaces the planned /brain-refresh-context skill per D1 fold (one fewer always-loaded skill in catalog). --audit : emit gstack-owned page summary + sensitive-content leak check via gstack-brain-cache list. Read-only. Step 1 trust policy gate: fires the same AskUserQuestion as setup-gbrain Step 9.5 when policy is unset for a remote endpoint. Local engines auto-set personal silently. Idempotent for already-set policies. Both templates re-rendered via bun run gen:skill-docs. Trust policy question wording centralized in setup-gbrain Step 9.5; sync-gbrain Step 1 references it to avoid prompt drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(brain): schema migration + fence-block fallback + preflight budget (T19+T21) 3 new gate-tier test files closing the most important coverage gaps in the brain-aware planning layer: test/schema-version-migration.test.ts (D4 A4): - Cache file with mismatched schema_version triggers wipe-and-rebuild - Matching version + fresh TTL stays warm-hit (no unnecessary rebuild) - Rebuild wipes ALL files in scope, not just the one being read test/takes-fence-fallback.test.ts: - Every preflight skill mentions both takes_add (preferred) and put_page fence-block (fallback for pre-T8 gbrain versions) - All 5 skills gate on BRAIN_CALIBRATION_WRITEBACK flag + personal trust policy - Per-skill weight matches SKILL_CALIBRATION_WEIGHTS (E5) - Write-back emits the kind=bet frontmatter shape and invalidates affected cache digests test/skill-preflight-budget.test.ts (T21 / D7): - Per-skill BRAIN_* instruction bytes stay under 3x the runtime digest budget (resolver bloat catch) - Autoplan total instruction bytes stay under 75 KB (3x of 25 KB runtime cap) - Non-preflight skills emit zero brain bytes - Per-skill subset references are present in the preflight bash Note on the 3x multiplier: SKILL_PREFLIGHT_BUDGET_BYTES governs runtime digest data (enforced by cache CLI truncateToBudget). Instruction text emitted by the resolver gets a separate 3x headroom — anything beyond that signals the instructions themselves are bloated and need a trim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(todos): brain-aware planning follow-ups (T11) Adds five deferred items from the v1.48.0.0 brain-aware planning plan: - P2: /gstack-reflect nightly synthesis skill (E2, deferred D4) - P3: cross-machine brain-cache sync (E3, deferred D5) - P3: /gstack-onboarding dedicated skill (E4, deferred D6) - P2: upstream gbrain takes_add + takes_resolve MCP ops (T8 wrap-up) - P3: background-refresh hook supervision (codex outside-voice T3) Each entry follows the TODOS.md format: What / Why / Pros / Cons / Context / Effort / Depends on. Each cross-references the v1.48.0.0 review decision (D-numbers from /plan-ceo-review and /plan-eng-review) that deferred it. The plan itself is at ~/.claude/plans/hm-interesting-well-why-dapper-eagle.md and is NOT a TODO entry (it's a one-shot design doc, not ongoing work). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(brain): bump schema-migration test timeout to 60s Rebuild path fans out to 7 per-project entity refreshes, each shelling gbrain with 10s internal timeout. Worst case ~70s. Default bun test 5s was timing out on slow brain unreachable cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v1.50.0.0) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(test): tighten put_page regression pin to CLI subcommand The test asserted no substring 'put_page' anywhere in the resolver, but the BRAIN_WRITE_BACK resolver legitimately references the MCP op `mcp__gbrain__put_page` as the fallback path for calibration takes when gbrain v0.42+'s `takes_add` op isn't available. The check conflated the deprecated `gbrain put_page` CLI subcommand (renamed in v0.18+ to `gbrain put`) with the still-valid MCP op of the same name. Narrow the assertion to `gbrain put_page` (with the space) so the fallback prose stays legal while the CLI rename regression stays caught. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(brain): gstack-config gbrain-refresh subcommand Adds a new subcommand that re-detects gbrain installation state and persists the result to ~/.gstack/gbrain-detection.json. The detection file is consumed by gen-skill-docs --respect-detection (next commit) to decide whether to render the GBRAIN_CONTEXT_LOAD and GBRAIN_SAVE_RESULTS resolver blocks in user-local SKILL.md generation. Reuses the existing bin/gstack-gbrain-detect helper for the actual probe; this subcommand just persists + summarizes. Users run it after installing or uninstalling gbrain so their locally generated SKILL.md files match their installation state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(brain): gen-skill-docs respects gbrain-detection override Adds --respect-detection flag (and bun run gen:skill-docs:user script). When the flag is set, gen-skill-docs reads ~/.gstack/gbrain-detection.json and filters GBRAIN_CONTEXT_LOAD + GBRAIN_SAVE_RESULTS out of each host's suppressedResolvers when gbrain_local_status is "ok". When absent or gbrain isn't detected, suppression behaves as before. The default `bun run gen:skill-docs` (CI canonical) ignores the detection file so the committed SKILL.md stays reproducible regardless of any developer's local gbrain installation state. Use gen:skill-docs:user for user-local installs (./setup invokes it). No host config files modified — the static suppressedResolvers stay correct for the no-gbrain case; the override happens at gen-time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(brain): setup runs gbrain detection + conditional SKILL.md regen At the end of install, ./setup now: 1. Runs bin/gstack-gbrain-detect, persists the result to ~/.gstack/gbrain-detection.json 2. If gbrain_local_status == "ok", regenerates Claude-host SKILL.md via `bun run gen:skill-docs:user --host claude` so the user's local install picks up the compressed brain-aware blocks 3. If gbrain isn't detected, leaves the canonical no-gbrain SKILL.md files in place (zero token overhead) and surfaces the gstack-config gbrain-refresh path for users who install gbrain later Together with the prior two commits, this completes the setup-time conditional un-suppression: brain-aware blocks render iff the user has gbrain installed, regardless of which CLI host they're on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(brain): compress GBRAIN_* resolvers, move template prose to docs/ generateGBrainContextLoad: 80 -> 115 tokens with explicit skip-header. generateGBrainSaveResults: 500-700 -> 161 tokens per skill with the skill metadata extracted into a typed skillSaveMap (slugPrefix + title + tag). Verbose prose (heredoc body, entity-stub instructions, throttle handling, backlink protocol) moved into a new doc: docs/gbrain-write-surfaces.md (Sections: §Context Load, §Save Template). The agent reads the doc on-demand only when actually saving — one Read call, cached by Claude's context. Net per-planning-skill overhead under un-suppression drops from ~1000 tokens (naive un-suppression) to ~275 tokens (compressed). Combined with the setup-time detection from prior commits, users WITHOUT gbrain pay zero overhead (block suppressed at gen-time) and users WITH gbrain pay ~275 tokens. The /investigate special-case (data-research routing in CONTEXT_LOAD) stays inline since it's skill-specific. docs/gbrain-write-surfaces.md also serves as the manual-probe reference for humans verifying live persistence + a topology summary covering trust-policy + .gbrain-source reads-only semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(brain): wire SAVE_RESULTS for plan-design-review + plan-devex-review Adds {{GBRAIN_SAVE_RESULTS}} placeholder to the two planning skills that were missing it, immediately before {{BRAIN_WRITE_BACK}} (mirrors plan-eng-review:324 + office-hours:650). The corresponding skillSaveMap entries (design-reviews/<feature-slug> + devex-reviews/<feature-slug>) landed with the resolver compression in the prior commit. Regenerated SKILL.md reflects the new placeholder position. The default no-gbrain generation (CI canonical) still suppresses the block — zero diff in the rendered output for non-gbrain users. All five planning skills now write a retrievable review page to gbrain when gbrain is detected at setup time, instead of three of five. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(brain): resolver compression + detection-override regression pins test/resolvers-gbrain-save-results.test.ts (140 LOC, 10 tests): - Per-skill assertions for all 5 planning skills: emits gbrain put + correct slug prefix + tag + title. - Skip-header present so agent can short-circuit when gbrain isn't on PATH. - Compression pin: each per-skill block stays under 750 chars (~190 tokens) — guards against a future "let me add one more line" refactor silently re-inflating toward the ~1000-token naive un-suppression baseline. - Generic fallback for unmapped skill names still works. - /investigate gets the data-research routing suffix; non-investigate skills do not. - generateGBrainContextLoad stays under 500 chars (~125 tokens). test/gbrain-detection-override.test.ts (120 LOC, 4 tests): - End-to-end through gen-skill-docs subprocess against an isolated temp GSTACK_HOME. Asserts: * detected:true un-suppresses GBRAIN_* → SKILL.md gains the block * detected:false (status != "ok") suppresses → no block * no detection file suppresses → no block (graceful default) * no --respect-detection flag IGNORES the detection file → no block (CI canonical path stays reproducible) Each detection-override test restores the canonical SKILL.md in a finally block so the working tree stays clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(brain): fake-CLI agent-obedience E2E for /office-hours writeback test/skill-e2e-office-hours-brain-writeback.test.ts (~210 LOC, periodic-tier, ~$0.50-1/run): Drives /office-hours via runSkillTest against a deterministic fixture brief (pixel.fund founder pitch). The workdir has: - A regenerated office-hours/SKILL.md with the compressed brain blocks (generated via gen-skill-docs --respect-detection against a temp GSTACK_HOME, then restored to canonical post-snapshot) - A fake gbrain shell script on PATH that uses printf %q quoting to preserve --content "$(cat <<'EOF' ... EOF)" heredoc payloads intact (naive `echo "$@"` would lose argv boundaries) - The docs/gbrain-write-surfaces.md the resolver points to Asserts: - gbrain-calls.log contains `gbrain put office-hours/pixel-fund` - Payload file at gbrain-payloads/office-hours/pixel-fund.md exists with valid YAML frontmatter (title: + tags: + design-doc tag) - At least one gbrain put entities/<name> call (entity stub enrichment is best-effort, soft warning if absent) Covers agent obedience to the SAVE_RESULTS instruction. Out of scope: gbrain CLI persistence contract (T11 covers that with real PGLite). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(brain): real PGLite round-trip E2E (matched-pair persistence) test/skill-e2e-gbrain-roundtrip-local.test.ts (~145 LOC, periodic-tier, ~$0.001/run on Voyage): Real gbrain CLI round-trip against an isolated temp HOME: 1. gbrain init --pglite --embedding-model voyage:voyage-code-3 2. gbrain put office-hours/<unique-slug> --content <markdown> 3. gbrain get <slug> 4. Assert every body line survives + title + tags + non-empty This is the matched-pair check for the v1.50.0.0 question "is the data we hope to save actually being saved?" — proves the gbrain CLI persistence contract gstack relies on, against a real engine. Does NOT involve the agent — pure CLI integration test. The agent obedience side is covered by the fake-CLI E2E in the prior commit. Skips cleanly when VOYAGE_API_KEY is unset OR gbrain CLI is missing from PATH, so CI without secrets degrades gracefully. Remote/Supabase routing is gbrain's contract — the same CLI shape works against every engine. gstack stops at local round-trip coverage to avoid re-testing gbrain's MCP client implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(brain): touchfiles + TODOS + CHANGELOG for v1.50.0.0 test/helpers/touchfiles.ts: register the two new E2Es in E2E_TOUCHFILES + E2E_TIERS (both periodic): - office-hours-brain-writeback: triggered by resolver / gen-pipeline / detection helper / refresh subcommand / office-hours template / docs / fixture / test file changes - gbrain-roundtrip-local: triggered by resolver / test file changes TODOS.md: append two P2 follow-ups carried over from the v1.50 plan: - Re-verify calibration takes when gbrain v0.42+ ships takes_add and BRAIN_CALIBRATION_WRITEBACK flips TRUE - Extend brain-writeback E2E to the other 4 planning skills (extract makeFakeGbrain to test/helpers/fake-gbrain.ts when second consumer arrives) CHANGELOG.md v1.50.0.0: add a "Save-results path: works under any CLI when gbrain is on PATH" section that documents the headline: - Conditional inclusion at setup-time (zero overhead for non-gbrain users, ~250 tokens with gbrain) - Wiring symmetry fix (5 of 5 planning skills now write a page) - Token cost table comparing detection states - Test coverage map (resolver unit + override mechanism + fake-CLI agent obedience + real PGLite round-trip) - Why remote routing isn't tested here (gbrain's contract) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(brain): tighten prompt + relax slug assertion in writeback E2E Two fixes: 1. Prompt: "Slug it 'pixel-fund'" was ambiguous — agent could read it as "use pixel-fund as the FULL slug" instead of "substitute pixel-fund for <feature-slug>". Replaced with explicit guidance: "The feature-slug value to substitute into the SAVE_RESULTS template's <feature-slug> placeholder is exactly 'pixel-fund' (no path prefix — the template already provides the prefix). Apply the SAVE_RESULTS template literally." Also added "Do NOT explore gbrain --help" to short-circuit the discovery loop the agent fell into. 2. Slug assertion: was a strict /gbrain put .*office-hours\/pixel-fund/ regex. This conflated two concerns — agent obedience (does the agent actually invoke gbrain put?) vs resolver output shape (does the template emit the right prefix?). The latter is already pinned by test/resolvers-gbrain-save-results.test.ts at the resolver level (free, hermetic). The E2E now asserts /gbrain put .*pixel-fund/ (slug contains pixel-fund somewhere) plus a recursive payload-file search that accepts either office-hours/pixel-fund.md (template- faithful) or pixel-fund.md (agent dropped prefix). The YAML frontmatter + tag assertions on the payload remain strict — those are the real agent-obedience contract. 3. Entity-stub regex: was looking for entities/<name>; agent variability uses entity/<name>, people/<name>, companies/<name>. Loosened to match entit(y|ies) only. The soft-warning path stays (no hard fail) because entity extraction is best-effort prose, not a CLI contract. Verified passing locally: 7 expect() calls, 268s, ~$0.50. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version to 1.51.1.0 main advanced to 1.51.0.0 while this branch was in development. Bump to 1.51.1.0 (PATCH above main) so the branch lands cleanly above the current main version per the monotonic-ordered-release invariant. Renames the branch-internal [1.50.0.0] CHANGELOG entry to [1.51.1.0] — 1.50.0.0 never landed on main (main skipped to 1.51.0.0), so this consolidates the branch's brain-aware planning + save-results work under a single shipping version with no orphaned entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
f58977041c |
v1.39.1.0 feat: EXIT PLAN MODE GATE for plan-mode review skills (#1512)
* feat: EXIT PLAN MODE GATE for plan-mode review skills Add a terminal BLOCKING checklist that verifies the plan file ends with `## GSTACK REVIEW REPORT` before ExitPlanMode is called. Lives at EOF of all four plan-* review skills (eng/ceo/design/devex) and inside codex Step 2A. Tones down the preamble's "Plan Status Footer" to a neutral forward reference so review-report rules don't bleed into operational skills (/ship /qa /review). Single source of truth: `generateExitPlanModeGate` in scripts/resolvers/review.ts, registered as EXIT_PLAN_MODE_GATE in scripts/resolvers/index.ts. New test in test/gen-skill-docs.test.ts strips fenced code blocks before matching `## ` headings and asserts the gate is the terminal heading in all four plan-* review SKILL.md files. Codex's SKILL.md uses toContain (mid-file by design — Step 2B/2C are not plan-touching modes). Decisions locked via /plan-eng-review + /codex outside-voice: - D1=A: 4 plan-* reviews + codex (autoplan, office-hours deferred) - D2=B → D4=A: tone preamble down to neutral forward reference - D3=A: add automated test in test/gen-skill-docs.test.ts - D5=B: keep codex gate inside Step 2A (mid-file acceptable per gate self-gating) Codex pre-merge findings folded in: line numbers obsolete (use EOF), test regex must strip fences, fresh skill list (not stale REVIEW_SKILLS constant), gate check 4 short-circuits when no plan file in context. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore: bump version and changelog (v1.39.1.0) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix: package.json build script uses subshells, not brace groups The three `{ git rev-parse HEAD 2>/dev/null || true; } > path/.version` brace groups in the build script regressed when v1.38.0.0 merged into this branch (resolved with --ours during conflict). Bun on Windows can't parse brace groups in this position; the v1.38.0.0 invariant requires `(...)` subshells. Windows CI test `package.json build scripts — POSIX shell compat` caught it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> |
||
|
|
ea51b45e08 |
v1.38.1.0 fix wave: surrogate-safe page captures (#1440), Implementation Tasks across review skills (#1454), root-level artifact patterns (#1452) (#1504)
* fix(browse): sanitize lone Unicode surrogates at commandResult chokepoint + /batch envelope (#1440) Page captures with mixed-script Unicode round-trip cleanly to the Claude API. Two new utilities in browse/src/sanitize.ts: stripLoneSurrogates for raw UTF-16 strings, stripLoneSurrogateEscapes for \uXXXX JSON escape text. sanitizeBody picks the right pass based on cr.json. buildCommandResponse is extracted from handleCommand (now exported) and applies sanitization before new Response(). /batch was bypassing this chokepoint via direct JSON.stringify, so it sanitizes each cr.result before pushing AND wraps the envelope with stripLoneSurrogateEscapes. Defense in depth wraps at getCleanText, getCleanTextWithStripping, html, accessibility, and snapshot.ts return points so downstream consumers (datamarking, envelope wrapping) see sanitized text before the response is built. 25 new unit tests across sanitize.test.ts and build-command-response.test.ts. content-security.test.ts updated to accept either pre- or post-sanitize form of the snapshot scoped branch (source-level regression check). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat: bug fix wave v1.36.0.0 — Implementation Tasks, allowlist patterns, surrogate-safe page captures (#1440 #1452 #1454) Three filed issues land together: #1440 — Page captures from real-world HTML hit 'API Error 400: no low surrogate in string'. Sanitizers + buildCommandResponse extraction shipped in the prior commit; this commit adds the migration script that patches existing brain-allowlist/privacy-map/gitattributes installs and the supporting tests. #1452 — Federation sync was silently skipping root-level design and test-plan docs. bin/gstack-artifacts-init adds two patterns to all three managed blocks (.brain-allowlist, .brain-privacy-map.json, .gitattributes). Idempotent migration v1.36.0.0.sh repairs existing installs in place via jq (preserves JSON validity) — no commit + push from the migration. #1454 — All four review skills (CEO/design/eng/DX) emit an Implementation Tasks markdown section AND write a jq-built JSONL artifact per phase. /autoplan reads all four files, scopes by current branch + 5-commit window, dedupes on exact (component, sorted(files), title), and renders an aggregated list in the Final Approval Gate. New tests: - browse/test/sanitize.test.ts (18 cases) - browse/test/build-command-response.test.ts (7 cases) - test/artifacts-init-migration.test.ts (7 cases) VERSION → 1.36.0.0. Skips the v1.34.x slot taken by 'gstack consumable as submodule' and the v1.35.0.0 slot taken by /document-generate. #1428 was shipped separately by v1.34.2.0 with a different approach; follow-up #1503 filed for the bare-path filesystem boundary concern surfaced during our analysis. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore: bump to v1.38.1.0 VERSION + package.json + CHANGELOG header + migration filename + test reference all consistently at v1.38.1.0. Migration renamed: gstack-upgrade/migrations/v1.38.0.0.sh -> v1.38.1.0.sh. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> |
||
|
|
5d4fe7df07 |
v1.31.0.0 fix: delete AskUserQuestion fallback (root cause of forever war) + harness primitives (#1390)
* test: add multi-finding batching regression test (periodic tier) Adds a periodic-tier E2E that catches the May 2026 transcript bug shape the existing single-finding gate-tier floor test cannot detect: a model that fires one AskUserQuestion and then batches the remaining findings into a single "## Decisions to confirm" plan write + ExitPlanMode. Why a separate test from skill-e2e-plan-eng-finding-floor: the gate-tier floor (runPlanSkillFloorCheck) exits on the first AUQ render and returns success, so a once-then-batch model would pass it trivially. This test uses runPlanSkillCounting at periodic tier with N-AUQ tracking and asserts >= 3 distinct review-phase AUQs on a 4-finding seeded plan. - test/fixtures/forcing-finding-seeds.ts: FORCING_BATCHING_ENG fixture (4 distinct non-trivial findings spread across Architecture, Code Quality, Tests, Performance — mirrors the D1-D4 transcript shape) - test/skill-e2e-plan-eng-multi-finding-batching.test.ts: new test - test/helpers/touchfiles.ts: registered in BOTH E2E_TOUCHFILES and E2E_TIERS (touchfiles.test.ts asserts exact equality) Test will fail on baseline today because today's model uses the preamble fallback to batch findings; passes after the architectural fix lands in a follow-up commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: expand plan-mode pass envelopes to accept BLOCKED path Three existing plan-mode regression tests previously codified the preamble fallback as a valid PASS path under --disallowedTools AskUserQuestion: outcome=plan_ready was accepted only when the model wrote a "## Decisions to confirm" section. The forever-war fix deletes that fallback, so this assertion would fail post-deletion. Expanded envelope accepts EITHER: - 'plan_ready' WITH (## Decisions section [legacy] OR BLOCKED string visible in TTY [post-fix]) - 'exited' WITH BLOCKED string visible in TTY [post-fix] The legacy ## Decisions branch stays in the envelope so these tests keep passing on today's code (where the fallback still exists) and on tomorrow's code (where the model reports BLOCKED instead). Once the deletion has been on main long enough that the cache flushes, the legacy branch can be removed in a follow-up. Failure signals (regression we DO want to catch) unchanged: auto_decided / silent_write / timeout / exited-without-BLOCKED / plan_ready-without-(decisions OR BLOCKED). - test/skill-e2e-plan-ceo-plan-mode.test.ts (test 2 only) - test/skill-e2e-autoplan-auto-mode.test.ts - test/skill-e2e-plan-design-plan-mode.test.ts Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: delete AskUserQuestion fallback (root cause of forever war) The /plan-eng-review skill failed to fire AskUserQuestion on a real plan review and surfaced 4 calibration decisions via prose instead. Investigation traced this to a "fallback when neither variant is callable" clause in the preamble that the model rationalizes around as a general escape hatch from "fanning out round-trip AUQs," even when an AUQ variant IS callable. Codex review confirmed the fallback exists in 8 inline sites with 2 surviving escape hatches the original narrowing missed (a "genuinely trivial" exception duplicated across all 4 plan-* templates, and a "outside plan mode, output as prose and stop" branch in the preamble itself). Net deletion in skill text. Closes both branches of the deleted fallback (plan-file write AND prose-and-stop) and the trivial-fix exception with a single hard rule: If no AskUserQuestion variant appears in your tool list, this skill is BLOCKED. Stop, report `BLOCKED — AskUserQuestion unavailable`, and wait for the user. Honest about being a model directive, not a runtime guard — none of the PTY harness helpers enforce BLOCKED today. The architectural improvement is that the model has fewer alternatives to obey it against. Runtime enforcement is a follow-up TODO. Sources changed: - scripts/resolvers/preamble/generate-ask-user-format.ts: delete both fallback branches; replace with 1-line BLOCKED rule - scripts/resolvers/preamble/generate-completion-status.ts: delete fallback in generatePlanModeInfo - plan-eng-review/SKILL.md.tmpl: delete fallback at Step 0 + Sections 1-4 (5 instances) + delete trivial-fix exception - office-hours/SKILL.md.tmpl: delete fallback in approach-selection - plan-ceo-review/SKILL.md.tmpl: delete trivial-fix exception - plan-design-review/SKILL.md.tmpl: delete trivial-fix exception - plan-devex-review/SKILL.md.tmpl: delete trivial-fix exception Generated SKILL.md regen lands in a follow-up commit per the bisect convention (template changes separate from regenerated output). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: regenerate SKILL.md after fallback deletion Regenerates all 47 generated SKILL.md files (default + 7 host adapters) after the template/resolver edits in the prior commit. Pure mechanical output of `bun run gen:skill-docs`; no hand-edits. Verifies fallback deletion landed across the entire skill surface: - zero hits for "Decisions to confirm" in canonical SKILL.md / .tmpl - zero hits for "no AskUserQuestion variant is callable" - zero hits for "genuinely trivial" - BLOCKED rule present in 42 generated SKILL.md (every Tier-2+ skill) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(harness): detect prose-rendered AskUserQuestion in plan mode When --disallowedTools AskUserQuestion is set and no MCP variant is callable, the model surfaces decisions as visible prose options ("A) ... B) ... C) ..." or "1. ... 2. ... 3. ...") rather than via the native numbered-prompt UI. isNumberedOptionListVisible doesn't catch these because the ❯ cursor sits on the empty input prompt rather than on option 1, so runPlanSkillObservation and runPlanSkillFloorCheck would time out at 5-10 minutes per test even though the model was correctly waiting for user input. This was exposed by the v1.28 fallback deletion: pre-deletion the model used the preamble fallback to silently auto-resolve to plan_ready in this scenario. Post-deletion the model correctly surfaces the question and waits, but the harness couldn't tell. isProseAUQVisible matches: - 2+ distinct lettered options at line starts (A/B/C/D form) - 3+ distinct numbered options at line starts WITHOUT a `❯ 1.` cursor (so it doesn't double-fire on native numbered prompts) Wired into: - classifyVisible (used by runPlanSkillObservation) → returns outcome='asked' instead of timeout - runPlanSkillFloorCheck → counts as auq_observed (floor met) 8 new unit tests in claude-pty-runner.unit.test.ts cover the lettered shape, numbered shape, threshold edges, native-cursor exclusion, and mid-prose false-positive guard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(harness): LLM judge for waiting-vs-working PTY state + snapshot logs Regex detectors (isNumberedOptionListVisible, isProseAUQVisible) are fast and free, but PTY rendering quirks fragment prose AUQ option lists across logical lines that no regex can reliably reassemble. When detection misses, polling loops time out at the full budget even though the model is correctly waiting for user input. Adds judgePtyState — a Haiku-graded trichotomy classifier: - waiting: agent surfaced a question/options, sitting at input prompt - working: spinner / tool calls / generation in progress - hung: stopped without surfacing anything (rare crash signal) Wired as a fallback into the polling loops of runPlanSkillObservation and runPlanSkillFloorCheck: after 60s with no regex hit, snapshot the TTY every 30s and call the judge. On 'waiting' verdict, return outcome=asked / auq_observed early. On 'working' or 'hung', enrich the eventual timeout summary with the verdict so failures are diagnosable. Implementation: - Spawns `claude -p --model claude-haiku-4-5 --max-turns 1` synchronously with prompt piped via stdin (subscription auth, no API key env required) - In-process cache keyed by SHA-1 of normalized last-4KB so identical spinner-frame snapshots don't re-charge - Best-effort JSONL log to ~/.gstack/analytics/pty-judge.jsonl with timestamp, testName, state, reasoning, hash, judge wall time - 30s timeout per call; returns state='unknown' with diagnostic on any failure mode (timeout, malformed JSON, missing claude binary) Snapshot logging: when GSTACK_PTY_LOG=1 is set, dump last 4KB of visible TTY at every judge tick to ~/.gstack/analytics/pty-snapshots/<test>- <elapsed>ms.txt — postmortem trail for debugging flakes. Cost: ~$0.0005 per call; ~10 calls per 5-min test budget; ~$0.005 per test added in worst case (only when regex detectors miss). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: accept prose-AUQ visible as third valid surface in plan-mode envelopes The first re-run after wiring the LLM judge revealed that the model also emits a third surface I hadn't anticipated: a properly-formatted question with options ("Pick A, B, or C in your reply") rendered as prose AND followed by ExitPlanMode (outcome=plan_ready). The migrated tests only accepted (## Decisions section) OR (BLOCKED string) — neither matched this case, so the test failed even though the user clearly saw the question. Three valid surfaces now: 1. `## Decisions to confirm` section in plan file (legacy fallback path, still valid through migration window) 2. `BLOCKED — AskUserQuestion` string in TTY (post-v1.28 BLOCKED rule) 3. Numbered/lettered options visible in TTY as prose (post-v1.28 prose rendering — uses the existing isProseAUQVisible detector) Also fixes assertReportAtBottomIfPlanWritten to be tolerant of: - Missing files (path detected from TTY but file not persisted) — was throwing ENOENT on plan_design_plan_mode and plan_ceo_plan_mode test 1 - 'asked' outcome (smoke test exited at first AUQ before the model reached the report-writing step) — was throwing on the 1 fail in the plan-eng-plan-mode --disallowedTools test Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: drop GSTACK REVIEW REPORT contract from --disallowedTools migrations The plan-ceo / plan-design --disallowedTools migrated tests called assertReportAtBottomIfPlanWritten as the final assertion, but that contract is for full multi-section review completions. Under --disallowedTools AskUserQuestion the model can't run the full review (no AUQ tools to ask findings questions through), so it exits at Step 0 with either prose-AUQ rendering or the legacy decisions fallback. A plan file written in that mode WON'T have a GSTACK REVIEW REPORT section — the workflow never reached the report-writing step. The contract is still enforced by the periodic finding-count tests (skill-e2e-plan-{ceo,eng,design,devex}-finding-count.test.ts), which DO run the full review end-to-end and assert report-at-bottom there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(harness): high-water-mark prose-AUQ tracking across polling iterations The autoplan E2E surfaces a brief prose-AUQ window (model emits options, waits ~30s for non-existent test responder, then resumes thinking) that the existing polling loop misses: by judge-tick time the buffer has moved into spinner state, so the LLM judge correctly reports 'working' and the loop times out at 5min. Adds two flags tracked across polling iterations: - proseAUQEverObserved: set true the first tick isProseAUQVisible returns true on the recent buffer - waitingEverObserved: set true on the first LLM judge 'waiting' verdict At timeout, if either flag is set, return outcome='asked' with a summary explaining the historical signal. The model DID surface the question — we just missed the live-state window. Snapshot logged with tag='prose-auq-surfaced' when GSTACK_PTY_LOG=1 for postmortem trace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: migrate plan-eng-plan-mode test 2 envelope to match other plan-mode tests The plan-ceo, plan-design, and autoplan plan-mode tests under --disallowedTools all moved to the same surface-visibility envelope (decisions section OR BLOCKED string OR prose-AUQ visible) and dropped the GSTACK REVIEW REPORT contract because the workflow can't complete without AUQ tools. plan-eng-plan-mode test 2 had been left on the old envelope and was the last failing test. This commit migrates it to match. Also lifts 'exited' out of the failure list and into a guarded path (acceptable when surface-visible). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(harness): isProseAUQVisible — gate numbered path on tail, not full buffer The numbered-options branch of isProseAUQVisible deferred to isNumberedOptionListVisible whenever a `❯ 1.` cursor was visible in the full buffer. But the boot trust dialog (`❯ 1. Yes, trust`) lives in scrollback for the entire run, so this gate suppressed prose-numbered detection for any session that had the trust prompt at startup — i.e., every E2E run after the first user-trust acceptance. Fix: check only the last 4KB tail. Native-UI deferral applies when the cursor list is CURRENTLY rendered, not historically present in scrollback. Adds a regression test that puts the trust dialog in early scrollback + 5KB filler + a current prose-AUQ render, asserts true. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(harness): isProseAUQVisible — lower numbered threshold to 2 (matches lettered) The 4KB tail window often contains only options 2-4 of a 4-option numbered prose AUQ because the model emits the question header + option 1 several KB earlier in the buffer. The threshold of 3 distinct numbered markers caused the detector to miss real prose AUQs whenever option 1 had scrolled out. Threshold 2 matches the lettered branch and is still tightly gated by: - Line-start anchoring (no false positives on inline `1.` references) - No-cursor gate (defers to native UI when ❯ 1. is currently rendered) - The 4KB tail window itself (prose-AUQ rendering happens at the end of the model's response, so options are clustered in the tail) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: expose high-water-mark flags through PlanSkillObservation The 2KB obs.evidence window often misses the prose-AUQ moment because ExitPlanMode UI ("Ready to execute" + numbered approve/reject prompt) pushes the model's earlier option list out of the tail by the time outcome=plan_ready fires. Tests checking "did the user see a question" need to consult historical state, not just the truncated final tail. Adds two optional fields to PlanSkillObservation: - proseAUQEverObserved: true if isProseAUQVisible was true at any tick - waitingEverObserved: true if the LLM judge ever returned 'waiting' The 4 plan-mode --disallowedTools tests now check these flags as part of the surfaceVisible computation: isProseAUQVisible(obs.evidence) || obs.proseAUQEverObserved === true blockedVisible || proseAUQVisible || obs.waitingEverObserved === true This catches the autoplan / plan-ceo / plan-eng case where the model surfaces options briefly, fails to get a response, then keeps thinking — eventually emitting ExitPlanMode and pushing options out of evidence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(plan-ceo): bump --disallowedTools test timeout to 10 min Last 5 runs showed the model under --disallowedTools spending the full 5-min budget in 'high effort thinking' before surfacing options. The LLM judge correctly reports state=working at every 30s tick, so the high-water-mark fallback never fires. 10-min budget gives the model 20 judge windows to eventually surface the question. Outer bun timeout bumped accordingly to 660s (inner +60s). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(plan-ceo): pre-prime --disallowedTools test with concrete plan content Root cause of the persistent timeout: under --disallowedTools, the model can't fire the AUQ tool to ask "what should I review?" — it has to prose-render that question. Prose-rendering a 4-option choice requires the model to first enumerate every option, which spent the full 5min budget in 'high effort thinking' (8 consecutive 'state=working' verdicts from the LLM judge). Fix: pass initialPlanContent (already supported by runPlanSkillObservation) with a CEO-review-shaped seed plan (vague success metric, missing premise, scope creep smell). The model now has concrete material to critique on entry, bypasses the scope-deliberation loop, and moves directly to surfacing Step 0 / Section 1 findings — the actual behavior we want to regression-test. Reverted timeout from 600_000 back to 300_000 since the 5-min budget is plenty when the model has a real plan to work with. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: delete --disallowedTools AskUserQuestion-blocked test variants These tests simulated a fictional environment that doesn't exist in production. Real Conductor sessions launch claude with `--disallowedTools AskUserQuestion` AND register `mcp__conductor__AskUserQuestion` — the model has the MCP variant. But the tests passed `--disallowedTools` without standing up any MCP server, so they tested "model behavior with NO AUQ available," which no real user state produces. Combined with bare `/plan-ceo-review` invocation (no follow-up content), this forced the model into a 5+ minute deliberation loop trying to prose-render a question with options it had to first invent. The result was persistent flakes that consumed nine paid E2E runs trying to fix "the model takes too long" — but the actual problem was the test configuration, not the model. Removals: - test/skill-e2e-autoplan-auto-mode.test.ts (deleted; the entire file was a single AUQ-blocked test) - test/skill-e2e-plan-ceo-plan-mode.test.ts test 2 (the migrated --disallowedTools test); test 1 (baseline plan-mode smoke) stays - test/skill-e2e-plan-design-plan-mode.test.ts test 2 (same shape); test 1 stays - test/skill-e2e-plan-eng-plan-mode.test.ts test 2 (same shape); test 1 (baseline) and test 3 (STOP-gate with seeded plan, different contract) stay - test/helpers/touchfiles.ts: autoplan-auto-mode entry removed - test/touchfiles.test.ts: assertion count + commentary updated Coverage retained: test 1 of each plan-mode file already verifies the model fires AUQ; the periodic finding-count tests verify per-finding AUQ cadence end-to-end. The harness improvements landed during this debugging cycle (isProseAUQVisible regex, LLM judge, snapshot logging, high-water-mark tracking, ENOENT-tolerant assertReportAtBottomIfPlanWritten) all stay — they're useful for the remaining plan-mode tests that can also encounter prose rendering and slow-thinking phases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v1.31.0.0) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
00f966b3ec |
v1.30.0.0 fix wave: 21 community PRs + Windows CI extension + codex flag-semantics smoke (#1391)
* fix(codex): use resume-compatible flags * fix: V-001 security vulnerability Automated security fix generated by Orbis Security AI * docs: align prompt-injection thresholds to security.ts (v1.6.4.0 catch-up) CLAUDE.md:290 and ARCHITECTURE.md:159 were missed when WARN was bumped 0.60 → 0.75 in |
||
|
|
7b4738bca0 |
v1.27.1.0 fix: anti-shortcut clause + gate-tier AskUserQuestion floor tests for all plan-* skills (#1354)
* feat(test/helpers): runPlanSkillFloorCheck — minimal AskUserQuestion-floor observer Adds a focused PTY observer that exits at the first non-permission numbered-option render. Catches the May 2026 transcript-bug class (model wrote plan + ExitPlanMode without firing any AUQ) without needing to fingerprint or navigate past the AUQ. Why separate from runPlanSkillCounting: plan-mode AUQs render every option on a single logical line via cursor-positioning escapes that stripAnsi can't simulate, so parseNumberedOptions returns < 2 options and never records a fingerprint. Counting tests work on 25-min budgets because eventually one frame parses cleanly; gate-tier floor tests need to exit early on the first observation. Trades fingerprint precision for early-exit reliability. Also drops COMPLETION_SUMMARY_RE check from this helper — it matches "GSTACK REVIEW REPORT" anywhere in the buffer including when the agent does recon by reading existing plan files. plan_ready (claude's actual "Ready to execute" confirmation) is the reliable terminal signal for "agent finished without asking." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(resolvers): generateAntiShortcutClause shared resolver Adds {{ANTI_SHORTCUT_CLAUSE}} placeholder backed by a single resolver function in scripts/resolvers/review.ts. Plan-* review skills can now include the clause via one placeholder line in their .tmpl rather than cloning the paragraph four times. Future tightening edits one resolver, all four skills update on next gen-skill-docs. Wired into the existing RESOLVERS map alongside generateReviewDashboard and generatePlanFileReviewReport — no gen-skill-docs.ts change needed because the generator already does generic placeholder substitution against that map. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(plan-*-review): anti-shortcut clause in all four review skills Inserts {{ANTI_SHORTCUT_CLAUSE}} placeholder immediately after the **Anti-skip rule:** paragraph in plan-{eng,ceo,design,devex}-review SKILL.md.tmpl. The four templates use different surrounding section headers (eng "Review Sections (after scope is agreed)" vs ceo/design/devex variants), so anchoring on the paragraph rather than the heading works across all four. Closes the May 2026 transcript-bug loophole: existing STOP gates name forbidden actions only AFTER a per-section finding is identified. The anti-shortcut clause adds the pre-emptive rule — "the plan file is the OUTPUT of the interactive review, not a substitute for it" — covering the case the transcript exhibited (skip per-section walk, dump every finding into one plan write, call ExitPlanMode). Regenerated SKILL.md for all hosts via bun run gen:skill-docs --host all. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: gate-tier AskUserQuestion floor tests for all plan-* review skills Adds 4 finding-floor tests (one per plan-* skill) that catch the May 2026 transcript-bug class — model wrote a plan and called ExitPlanMode without firing any review-phase AskUserQuestion. Asserts via runPlanSkillFloorCheck that ANY non-permission AUQ render fires before the agent reaches plan_ready. Verified: - Eng floor: passed in 59s - CEO floor: passed in 197s - Design floor: passed - Devex floor: passed - Total ~$2-6 per CI run; only triggers on diff against the 4 plan-* templates, the shared resolver review.ts, the seeds fixture, or the PTY runner helper. Fixtures live in test/fixtures/forcing-finding-seeds.ts, one constant per skill. Each seed is engineered to force at least one obvious finding under that skill's review focus (architectural smell for eng, scope-creep for ceo, UI-slop for design, painful onboarding for devex). Touchfiles wiring: - E2E_TOUCHFILES: 4 plan-*-finding-floor entries with deps on the matching skill template, the shared resolver, the seeds fixture, and the PTY runner helper - E2E_TIERS: all 4 entries marked 'gate' - touchfiles.test.ts: count assertion bumped 21→22 with explicit plan-ceo-finding-floor containment check Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v1.27.1.0) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
9e244c0bed |
v1.11.1.0 fix: plan-mode handshake + canUseTool test harness (#1182)
* feat: plan-mode handshake for interactive review skills Add a preamble-level STOP-Ask handshake that fires when the user invokes any of the 4 interactive review skills (plan-ceo-review, plan-eng-review, plan-design-review, plan-devex-review) while their Claude Code session is in plan mode. Without this gate, plan mode's "this supercedes any other instructions" system-reminder outranked the skills' interactive STOP gates and the skills silently wrote plan files without any per-finding AskUserQuestion. The handshake offers 2 options (exit-and-rerun, cancel) — the original third "stay and batch" option was dropped after two independent reviewers flagged it as a silent bypass of the skills' anti-skip rule. Architecture decisions (CEO+Eng review): - Preamble-level resolver, not per-template injection (Codex finding #2) - Position 1 in preamble composition: after bash block (_SESSION_ID live), before onboarding AskUserQuestion gates (so fresh-install users see the handshake first, not drowned in telemetry/proactive/routing prompts) - Generator-only `interactive: true` frontmatter flag, following the `preamble-tier` precedent (no host-config frontmatter allowlist edits) - Host-scoped to Claude via `ctx.host === 'claude'` check inside the resolver (simpler than `suppressedResolvers` which only gates `{{}}` placeholders) - One-way-door classification in scripts/question-registry.ts for all 4 skills so question-tuning `never-ask` preferences can't suppress the gate - Synchronous telemetry write to ~/.gstack/analytics/skill-usage.jsonl on handshake fire (captures A-exit and C-cancel outcomes that terminate the skill before end-of-run telemetry runs) Also adds an explicit STOP block to plan-ceo-review Step 0C-bis so the approach-selection question can't silently skip to mode selection. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat: extend agent-sdk-runner with canUseTool for AskUserQuestion interception Test harness at test/helpers/agent-sdk-runner.ts gains an optional `canUseTool` callback parameter. When a test supplies it, the harness flips `permissionMode` from `bypassPermissions` (overlay-harness default) to `default` so the SDK actually invokes the callback on every tool use, and auto-adds `AskUserQuestion` to `allowedTools` so Claude can fire it at all. Exports a `passThroughNonAskUserQuestion` helper so tests that only want to intercept AskUserQuestion can auto-allow every other tool with one line: `return passThroughNonAskUserQuestion(toolName, input)`. This is the foundation for D14 — every future interactive-skill E2E test can now assert on AskUserQuestion shape and routing. Previous E2E tests at `test/skill-e2e.test.ts` explicitly instructed the model to skip AskUserQuestion ("non-interactive run") which meant no test could actually verify the question content or routing. 6 new unit tests in test/agent-sdk-runner.test.ts cover: - permissionMode flips to 'default' when canUseTool supplied - permissionMode stays 'bypassPermissions' when canUseTool absent - canUseTool callback reaches the SDK options - AskUserQuestion auto-added to allowedTools when canUseTool supplied - AskUserQuestion NOT added when canUseTool absent - passThroughNonAskUserQuestion helper returns allow+updatedInput Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test: plan-mode handshake E2E coverage and unit assertions Adds 6 E2E test files and 8 new unit assertions to verify the plan-mode handshake works end-to-end and stays correct under regeneration. E2E tests (gate-tier, paid, EVALS=1 EVALS_TIER=gate): - test/skill-e2e-plan-ceo-plan-mode.test.ts — handshake fires before any Write/Edit when plan-mode distinctive phrase is present; 2-option shape (Exit/Cancel); option A routes to ExitPlanMode cleanly - test/skill-e2e-plan-eng-plan-mode.test.ts — same contract for plan-eng - test/skill-e2e-plan-design-plan-mode.test.ts — same contract for plan-design; exercises C-cancel branch instead of A-exit - test/skill-e2e-plan-devex-plan-mode.test.ts — same contract for plan-devex - test/skill-e2e-plan-mode-no-op.test.ts — negative regression: handshake must NOT fire when distinctive phrase is absent; skill proceeds normally through Step 0 (REGRESSION RULE guardrail against breaking existing interactive-review sessions) - test/e2e-harness-audit.test.ts — free unit test asserting every `interactive: true` skill has at least one canUseTool-using test file (prevents future drift where a skill opts in without coverage) Shared helper test/helpers/plan-mode-handshake-helpers.ts centralizes the canUseTool interceptor + distinctive-phrase injection so the 4 sibling E2E tests are thin wiring (~20 LOC each) and can't drift out of sync. Unit assertions added to test/gen-skill-docs.test.ts: - handshake section present in all 4 Claude-generated SKILL.md files - handshake section absent from non-interactive Claude skills (ship, review, qa, office-hours, codex, retro, cso) - handshake section absent from non-Claude host outputs (.agents, etc.) - 0C-bis STOP block present in plan-ceo-review/SKILL.md at correct position (between the "Present these approach options" line and "### 0D-prelude" header) - handshake resolver wired BEFORE generateUpgradeCheck in preamble composition order 6 new gate-tier entries added to test/helpers/touchfiles.ts so any change to the handshake resolver, preamble composition, skill templates, question registry, one-way-door classifier, or agent-sdk-runner fires the relevant E2E tests. test/touchfiles.test.ts updated for the new selection count (plan-ceo-review/** now triggers 15 tests, up from 8). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(v1.11.1.0): VERSION bump + CHANGELOG entry + TODOS follow-ups Bumps from main's v1.11.0.0 to v1.11.1.0 (PATCH — bug-fix release, no new user-facing artifacts). CHANGELOG entry covers the plan-mode handshake, agent-sdk-runner canUseTool extension, and the 2 follow-up TODOs. CHANGELOG order: v1.11.1.0 (this) → v1.11.0.0 (workspace-aware ship, merged from main) → v1.10.1.0 (overlay efficacy harness). No duplicate headers. Syncs package.json version to match VERSION per the Step 12 idempotency invariant (both files must agree or /ship halts). TODOS.md: - Preserves the Testing/security-bench-haiku-responses P1 added on main - Adds P1 "Structural STOP-Ask forcing function" — broader class of the bug this release fixes - Adds P2 "Apply interactive: true to non-review skills (office-hours, codex, investigate, qa, retro, cso)" Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> |
||
|
|
a81be53621 |
v1.10.0.0: fix AskUserQuestion cadence + Pros/Cons format upgrade (#1178)
* fix(preamble): reorder AskUserQuestion Format above model overlay + rewrite Opus 4.7 pacing directive
Root cause of plan-review regression (v1.6.4.0): model overlays rendered
ABOVE the pacing rule in every SKILL.md, so Opus 4.7 read "Batch your
questions" first and absorbed it as the ambient default. The overlay's
claimed subordination ("skill wins on pacing, always") didn't stick —
literal-interpretation mode reads physical order, not claimed hierarchy.
Part 1 of 4 (plan: ~/.claude/plans/system-instruction-you-are-working-polymorphic-twilight.md):
scripts/resolvers/preamble.ts
- Move generateAskUserFormat above generateModelOverlay in section array
- Comment explains why — prevents future refactors from silently reverting
model-overlays/opus-4-7.md
- Replace "Batch your questions" block with "Pace questions to the skill"
- New wording makes one-question-per-turn the default when the skill
contains STOP directives; batching becomes the explicit exception
Regenerated 30 SKILL.md files via bun run gen:skill-docs.
Verified:
- With --model opus-4-7: Format renders at line 359, Model-Specific
Patch at 373, "Pace questions" at 419 (Format comes first, overlay
second, pacing directive intact).
- bun test passes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(plan-reviews): tighten STOP/escape-hatch directives across 4 templates
Part 2 of 4 (plan: ~/.claude/plans/system-instruction-you-are-working-polymorphic-twilight.md).
Codex caught that v1.6.3.0's reasoning collapsed on Opus 4.7: the old
escape-hatch wording ("If no issues or fix is obvious, state what
you'll do and move on — don't waste a question") let the literal
interpreter classify every finding as having an "obvious fix" and skip
AskUserQuestion entirely. Reviews became reports.
Per-template hardening (16 sites total, verified by rg):
plan-ceo-review/SKILL.md.tmpl (13 sites):
- 12 inline STOP directives: replace the full escape-hatch clause with
"zero findings → say so and proceed; findings → MUST call AskUserQuestion
as a tool_use, including for obvious fixes."
- 1 Escape hatch bullet in CRITICAL RULE section: tightened.
plan-eng-review, plan-design-review, plan-devex-review (1 site each):
- Each template's Escape hatch bullet tightened to match the new CEO wording,
adapted for each review's domain (issue/gap, decision/design/DX alternatives).
After regeneration: rg "don't waste a question" returns 0 across all
*SKILL.md.tmpl and *SKILL.md files. "zero findings, state" wording
present 16 times (matches prior count of escape-hatch sites).
bun test passes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(preamble): upgrade AskUserQuestion format to Pros/Cons decision brief
Part 4 of 4 (plan: ~/.claude/plans/system-instruction-you-are-working-polymorphic-twilight.md).
Every AskUserQuestion now renders as a decision brief, not a bullet list:
D-numbered header, ELI10, Stakes-if-we-pick-wrong, Recommendation, Pros/Cons
with ✅/❌ markers per option, closing Net: tradeoff synthesis.
scripts/resolvers/preamble/generate-ask-user-format.ts
- Full rewrite. Preserves prior rules (Re-ground, ELI10, Recommend,
Completeness, Options) and adds:
- D-numbering per skill invocation (model-level, not runtime state)
- Stakes line (pain avoided / capability unlocked / consequence named)
- Pros/Cons block with min 2 ✅ + 1 ❌ per option, min 40 chars/bullet
- Hard-stop escape: "✅ No cons — this is a hard-stop choice" for
genuine one-sided choices (destructive-action confirmations)
- Neutral-posture handling (CT1-compliant): (recommended) label
STAYS on default option to preserve AUTO_DECIDE contract; neutrality
expressed as prose in Recommendation line only
- Net line closes the decision with a one-sentence tradeoff frame
- Rule 11: tool_use mandate (prose "Question:" blocks don't count)
- Self-check list before emitting
test/skill-validation.test.ts
- Update format assertions to check for new Pros/Cons tokens
(Pros / cons:, Recommendation: <choice>, Net:, ELI10, Stakes if we
pick wrong:, ✅, ❌) across all tier-2+ skills
- Old "RECOMMENDATION: Choose" expectation removed (the new format uses
mixed-case "Recommendation:" with no literal "Choose")
test/skill-e2e-plan-format.test.ts
- Add v1.7.0.0 format token regexes (PROS_CONS_HEADER_RE, PRO_BULLET_RE,
CON_BULLET_RE, NET_LINE_RE, D_NUMBER_RE, STAKES_RE)
- Existing RECOMMENDATION_RE loosened to accept mixed-case "Recommendation:"
(canonical v1.7.0.0 form) alongside all-caps (legacy). Tests are
additive — the strict new-format gate is the upcoming cadence eval.
Regenerated 30 SKILL.md files via bun run gen:skill-docs.
Verified:
- bun test: 319 pass (1 pre-existing security-bench fixture oversize
failure on main, unrelated — confirmed via git stash test on main HEAD)
- New format tokens render in all tier-2+ skills (plan-ceo-review,
plan-eng-review, ship, office-hours verified)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: gate-tier units + periodic Pros/Cons evals for AskUserQuestion format
Part 3 of 4 (plan: ~/.claude/plans/system-instruction-you-are-working-polymorphic-twilight.md).
Gate-tier (E1, free, runs on every `bun test`):
test/preamble-compose.test.ts — pins the composition order
Asserts AskUserQuestion Format section renders BEFORE Model-Specific
Behavioral Patch in tier-≥2 preamble output. Covers claude default,
opus-4-7 overlay, tier 2/3, and codex host. Catches any future edit
to scripts/resolvers/preamble.ts that silently reverts the order.
test/resolver-ask-user-format.test.ts — pins the Pros/Cons contract
14 assertions against generateAskUserFormat output: D<N>, ELI10,
Stakes if we pick wrong:, Recommendation: <choice>, Pros / cons:,
✅/❌ markers, min 2 pros + 1 con rules, hard-stop escape exact
phrase, neutral-posture CT1 rule ((recommended) label preserved for
AUTO_DECIDE), Completeness coverage-vs-kind, tool_use mandate
(rule 11), self-check list, D-numbering model-level caveat.
test/model-overlay-opus-4-7.test.ts — pins the pacing directive
Asserts raw overlay file + resolved overlay output contain "Pace
questions to the skill" and NOT "Batch your questions". Verifies
INHERIT:claude chain still works (Todo-list, subordination wrapper),
Fan out / Effort-match / Literal interpretation nudges preserved.
Also asserts claude base overlay does NOT carry the Opus-specific
pacing directive (no cross-contamination).
Periodic-tier (E2, Opus-dependent, ~$1-2/run):
test/skill-e2e-plan-prosons.test.ts — 4 cases extending v1.6.3.0 harness
1. Format positive — every token present when plan has real tradeoff
2. Hard-stop NEGATIVE — plan with genuine tradeoff must NOT dodge to
"No cons — hard-stop choice" escape
3. Neutral-posture NEGATIVE — plan where one option dominates must emit
(recommended) label + "because <reason>", must NOT dodge to
"taste call" / "no preference"
4. Hard-stop POSITIVE — destructive-action plan may legitimately use
the hard-stop escape
test/helpers/touchfiles.ts — entries for all new eval cases
Dependencies: overlay, preamble.ts, generate-ask-user-format.ts, and
the 4 plan-review templates. Diff-based selection triggers the evals
whenever those files change. Also added entries for 7 expanded-coverage
cases (ship, office-hours, investigate, qa, review, design-review,
document-release) — test cases will land in follow-up PRs per skill.
Follow-ups noted in test file header:
- True multi-turn cadence eval (3 findings → 3 distinct asks) — current
harness captures one $OUT_FILE per session; multi-turn capture needs
new harness support.
- Expanded-coverage test cases for the 7 non-plan-review skills.
Verified:
- bun test: 349 pass (30 new + 319 baseline), 1 pre-existing security-bench
oversize failure on main (unrelated, unchanged).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: regenerate golden fixtures + update ELI10 phrase check for v1.7.0.0
Pros/Cons format rewrite (
|
||
|
|
b805aa0113 |
feat: Confusion Protocol, Hermes + GBrain hosts, brain-first resolver (v0.18.0.0) (#1005)
* feat: add Confusion Protocol to preamble resolver Injects a high-stakes ambiguity gate at preamble tier >= 2 so all workflow skills get it. Fires when Claude encounters architectural decisions, data model changes, destructive operations, or contradictory requirements. Does NOT fire on routine coding. Addresses Karpathy failure mode #1 (wrong assumptions) with an inline STOP gate instead of relying on workflow skill invocation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add Hermes and GBrain host configs Hermes: tool rewrites for terminal/read_file/patch/delegate_task, paths to ~/.hermes/skills/gstack, AGENTS.md config file. GBrain: coding skills become brain-aware when GBrain mod is installed. Same tool rewrites as OpenClaw (agents spawn Claude Code via ACP). GBRAIN_CONTEXT_LOAD and GBRAIN_SAVE_RESULTS NOT suppressed on gbrain host, enabling brain-first lookup and save-to-brain behavior. Both registered in hosts/index.ts with setup script redirect messages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: GBrain resolver — brain-first lookup and save-to-brain New scripts/resolvers/gbrain.ts with two resolver functions: - GBRAIN_CONTEXT_LOAD: search brain for context before skill starts - GBRAIN_SAVE_RESULTS: save skill output to brain after completion Placeholders added to 4 thinking skill templates (office-hours, investigate, plan-ceo-review, retro). Resolves to empty string on all hosts except gbrain via suppressedResolvers. GBRAIN suppression added to all 9 non-gbrain host configs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: wire slop:diff into /review as advisory diagnostic Adds Step 3.5 to the review template: runs bun run slop:diff against the base branch to catch AI code quality issues (empty catches, redundant return await, overcomplicated abstractions). Advisory only, never blocking. Skips silently if slop-scan is not installed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add Karpathy compatibility note to README Positions gstack as the workflow enforcement layer for Karpathy-style CLAUDE.md rules (17K stars). Links to forrestchang/andrej-karpathy-skills. Maps each Karpathy failure mode to the gstack skill that addresses it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: improve native OpenClaw thinking skills office-hours: add design doc path visibility message after writing ceo-review: add HARD GATE reminder at review section transitions retro: add non-git context support (check memory for meeting notes) Mirrors template improvements to hand-crafted native skills. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: update tests and golden fixtures for new hosts - Host count: 8 → 10 (hermes, gbrain) - OpenClaw adapter test: expects undefined (dead code removed) - Golden ship fixtures: updated with Confusion Protocol + vendoring Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate all SKILL.md files Regenerated from templates after Confusion Protocol, GBrain resolver placeholders, slop:diff in review, HARD GATE reminders, investigation learnings, design doc visibility, and retro non-git context changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update project documentation for v0.18.0.0 - CHANGELOG: add v0.18.0.0 entry (Confusion Protocol, Hermes, GBrain, slop in review, Karpathy note, skill improvements) - CLAUDE.md: add hermes.ts and gbrain.ts to hosts listing - README.md: update agent count 8→10, add Hermes + GBrain to table - VERSION: bump to 0.18.0.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: sync package.json version to 0.18.0.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: extract Step 0 from review SKILL.md in E2E test The review-base-branch E2E test was copying the full 1493-line review/SKILL.md into the test fixture. The agent spent 8+ turns reading it in chunks, leaving only 7 turns for actual work, causing error_max_turns on every attempt. Now extracts only Step 0 (base branch detection, ~50 lines) which is all the test actually needs. Follows the CLAUDE.md rule: "NEVER copy a full SKILL.md file into an E2E test fixture." Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: update GBrain and Hermes host configs for v0.10.0 integration GBrain: add 'triggers' to keepFields so generated skills pass checkResolvable() validation. Add version compat comment. Hermes: un-suppress GBRAIN_CONTEXT_LOAD and GBRAIN_SAVE_RESULTS. The resolvers handle GBrain-not-installed gracefully, so Hermes agents with GBrain as a mod get brain features automatically. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: GBrain resolver DX improvements and preamble health check Resolver changes: - gbrain query → gbrain search (fast keyword search, not expensive hybrid) - Add keyword extraction guidance for agents - Show explicit gbrain put_page syntax with --title, --tags, heredoc - Add entity enrichment with false-positive filter - Name throttle error patterns (exit code 1, stderr keywords) - Add data-research routing for investigate skill - Expand skillSaveMap from 4 to 8 entries - Add brain operation telemetry summary Preamble changes: - Add gbrain doctor --fast --json health check for gbrain/hermes hosts - Parse check failures/warnings count - Show failing check details when score < 50 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: preserve keepFields in allowlist frontmatter mode The allowlist mode hard-coded name + description reconstruction but never iterated keepFields for additional fields. Adding 'triggers' to keepFields was a no-op because the field was silently stripped. Now iterates keepFields and preserves any field beyond name/description from the source template frontmatter, including YAML arrays. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add triggers to all 38 skill templates Multi-word, skill-specific trigger keywords for GBrain's RESOLVER.md router. Each skill gets 3-6 triggers derived from its "Use when asked to..." description text. Avoids single generic words that would collide across skills (e.g., "debug this" not "debug"). These are distinct from voice-triggers (speech-to-text aliases) and serve GBrain's checkResolvable() validation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate all SKILL.md files and update golden fixtures Regenerated from updated templates (triggers, brain placeholders, resolver DX improvements, preamble health check). Golden fixtures updated to match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: settings-hook remove exits 1 when nothing to remove gstack-settings-hook remove was exiting 0 when settings.json didn't exist, causing gstack-uninstall to report "SessionStart hook" as removed on clean systems where nothing was installed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update project documentation for GBrain v0.10.0 integration ARCHITECTURE.md: added GBRAIN_CONTEXT_LOAD and GBRAIN_SAVE_RESULTS to resolver table. CHANGELOG.md: expanded v0.18.0.0 entry with GBrain v0.10.0 integration details (triggers, expanded brain-awareness, DX improvements, Hermes brain support), updated date. CLAUDE.md: added gbrain to resolvers/ directory comment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: routing E2E stops writing to user's ~/.claude/skills/ installSkills() was copying SKILL.md files to both project-level (.claude/skills/ in tmpDir) and user-level (~/.claude/skills/). Writing to the user's real install fails when symlinks point to different worktrees or dangling targets (ENOENT on copyFileSync). Now installs to project-level only. The test already sets cwd to the tmpDir, so project-level discovery works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: scale Gemini E2E back to smoke test Gemini CLI gets lost in worktrees on complex tasks (review times out at 600s, discover-skill hits exit 124). Nobody uses Gemini for gstack skill execution. Replace the two failing tests (gemini-discover-skill and gemini-review-findings) with a single smoke test that verifies Gemini can start and read the README. 90s timeout, no skill invocation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
31943b2f02 |
feat: anti-skip rule for all review skills (v0.15.6.1) (#804)
* feat: anti-skip rule for all review skills Review skills sometimes skip sections when reviewing strategy or spec plans. This adds an explicit anti-skip rule to CEO (1-11), eng (1-4), design (1-7), and DX (1-8) review skills. Also fixes CEO header from "10 sections" to "11 sections" to match actual count. * chore: bump version and changelog (v0.15.6.1) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
447851452a |
feat: interactive /plan-devex-review + plan mode skill fix (v0.15.5.0) (#796)
* fix: skill invocation during plan mode takes precedence over generic plan mode Adds a "Skill Invocation During Plan Mode" section to the preamble resolver so all generated SKILL.md files include it. Fixes a bug where Claude treats loaded skill content as reference material instead of executable instructions, and keeps trying to ExitPlanMode instead of following the skill workflow step by step. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: interactive /plan-devex-review with persona, benchmarks, and forcing questions Complete rewrite of the DX review skill to match CEO/eng review depth. New flow: investigate (persona, empathy, competitors, magical moment, journey tracing) then force decisions, then score with evidence. Three modes: DX EXPANSION, DX POLISH, DX TRIAGE. 20-45 interactive STOP points vs 10-12 before. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: autoplan DX POLISH mode + review log schema for new devex fields Adds mode selection, persona, competitive, and magical moment override rules to autoplan Phase 3.5. Documents new review log fields (mode, persona, competitive_tier) in the plan-file-review-report schema. Syncs package.json version to VERSION. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update project documentation for v0.15.5.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
be96ff5ce7 |
feat: /plan-devex-review + /devex-review — DX review skills (v0.15.3.0) (#784)
* feat: add DX framework resolver for shared principles and scoring rubric
New {{DX_FRAMEWORK}} resolver provides compact (~150 lines) shared content
for /plan-devex-review and /devex-review: Addy Osmani's 8 DX principles,
7 characteristics table, 10 cognitive patterns, scoring rubric, and TTHW
benchmarks. Hall of Fame examples loaded on-demand per pass to avoid bloat.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add DX Review row to review dashboard
Adds plan-devex-review and devex-review schema entries to the review
dashboard resolver and placeholder table in the preamble. All existing
SKILL.md files regenerated to include the new DX Review row.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: /plan-devex-review skill — DX plan review with Osmani framework
Plan-stage developer experience review. Rates 8 DX dimensions 0-10:
getting started, API/CLI/SDK design, error messages, docs, upgrade path,
dev environment, community, and DX measurement. Includes developer empathy
simulation, auto-detect product type with applicability gate, DX scorecard
with trend tracking, and a conditional Claude Code Skill DX checklist.
Hall of Fame examples loaded on-demand per pass from dx-hall-of-fame.md.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: /devex-review skill — live DX audit with browse
Live-system developer experience audit using browse tool. Tests all 8
dimensions aligned with /plan-devex-review for boomerang comparison
(plan said 3 min TTHW, reality says 8). Each dimension marked TESTED,
INFERRED, or N/A with evidence. Scope-aware: declares what browse can
and cannot test, falls back to file artifacts for untestable dimensions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: bump version and changelog (v0.15.3.0)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|