merge: integrate origin/main (v0.18.1.0) into open-agents-learnings

Main moved forward 6 commits while this branch was local. Integrated
both sides preserving all functionality:

From main (v0.16.4.0 → v0.18.1.0):
- v0.17.0.0 — UX behavioral foundations + ux-audit (generateUXPrinciples,
  {{UX_PRINCIPLES}} placeholder, triggers frontmatter on skills)
- v0.18.0.0 — Confusion Protocol, Hermes + GBrain hosts, brain-first
  resolver (generateBrainHealthInstruction, generateConfusionProtocol,
  generateGBrainContextLoad, generateGBrainSaveResults, hosts/gbrain.ts,
  hosts/hermes.ts, scripts/resolvers/gbrain.ts, GBrain bash health check)
- v0.18.0.1 — ngrok Windows build fix
- 0cc830b6 — tilde-in-assignment permission fix
- cc42f14a — gstack compact design doc (tabled)
- 822e843a — headed browser auto-shutdown + disconnect cleanup (v0.18.1.0)

Integration approach: keep this branch's preamble.ts submodule refactor
as the structure of record. Extracted main's two new generators into
their own submodules:
- scripts/resolvers/preamble/generate-brain-health-instruction.ts
- scripts/resolvers/preamble/generate-confusion-protocol.ts

Updated scripts/resolvers/preamble/generate-preamble-bash.ts to absorb
main's GBrain health check (host-conditional on gbrain/hermes).

scripts/resolvers/index.ts now imports BOTH:
- This branch's adds: MODEL_OVERLAY, TASTE_PROFILE, BIN_DIR resolvers
- Main's adds: UX_PRINCIPLES, GBRAIN_CONTEXT_LOAD, GBRAIN_SAVE_RESULTS
  resolvers

scripts/resolvers/design.ts keeps both generateTasteProfile (this
branch) and generateUXPrinciples (main). Sibling exports, no overlap.

scripts/gen-skill-docs.ts keeps both this branch's --model flag wiring
and main's edits.

Templates auto-merged where possible. The 35 generated SKILL.md /
golden conflicts auto-resolved via `bun run gen:skill-docs --host all`
followed by re-snapshotting the ship goldens for claude/codex/factory.

Verification:
- bun run gen:skill-docs --host all completes cleanly
- bun test: 1 pre-existing failure (gstack-community-dashboard Supabase
  network test, 235s timeout). NOT related to merge — unchanged Supabase
  test infra times out without live network. Flagged in PR body.

Token-ceiling warnings on plan-ceo-review (29K), office-hours (26K),
and ship (34K). These existed on origin/main before the merge — the
preamble grew substantially from main's GBrain + UX additions plus this
branch's continuous-checkpoint, context-health, model-overlay, taste-profile,
and feature-discovery additions. Worth a follow-up reduction pass but
doesn't block this merge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-04-17 13:58:15 +08:00
129 changed files with 3314 additions and 154 deletions
+22
View File
@@ -18,6 +18,11 @@ allowed-tools:
- Agent
- AskUserQuestion
- WebSearch
triggers:
- ship it
- create a pr
- push to main
- deploy this
---
<!-- AUTO-GENERATED from SKILL.md.tmpl — do not edit directly -->
<!-- Regenerate: bun run gen:skill-docs -->
@@ -434,6 +439,19 @@ AI makes completeness near-free. Always recommend the complete option over short
Include `Completeness: X/10` for each option (10=all edge cases, 7=happy path, 3=shortcut).
## Confusion Protocol
When you encounter high-stakes ambiguity during coding:
- Two plausible architectures or data models for the same requirement
- A request that contradicts existing patterns and you're unsure which to follow
- A destructive operation where the scope is unclear
- Missing context that would change your approach significantly
STOP. Name the ambiguity in one sentence. Present 2-3 options with tradeoffs.
Ask the user. Do not guess on architectural or data model decisions.
This does NOT apply to routine coding, small features, or obvious changes.
## Continuous Checkpoint Mode
If `CHECKPOINT_MODE` is `"continuous"` (from preamble output): auto-commit work as
@@ -707,6 +725,8 @@ branch name wherever the instructions say "the base branch" or `<default>`.
---
# Ship: Fully Automated Ship Workflow
You are running the `/ship` workflow. This is a **non-interactive, fully automated** workflow. Do NOT ask for confirmation at any step. The user said `/ship` which means DO IT. Run straight through and output the PR URL at the end.
@@ -2282,6 +2302,8 @@ staleness detection: if those files are later deleted, the learning can be flagg
**Only log genuine discoveries.** Don't log obvious things. Don't log things the user
already knows. A good test: would this insight save time in a future session? If yes, log it.
## Step 4: Version bump (auto-decide)
**Idempotency check:** Before bumping, compare VERSION against the base branch.
+17
View File
@@ -428,6 +428,19 @@ AI makes completeness near-free. Always recommend the complete option over short
Include `Completeness: X/10` for each option (10=all edge cases, 7=happy path, 3=shortcut).
## Confusion Protocol
When you encounter high-stakes ambiguity during coding:
- Two plausible architectures or data models for the same requirement
- A request that contradicts existing patterns and you're unsure which to follow
- A destructive operation where the scope is unclear
- Missing context that would change your approach significantly
STOP. Name the ambiguity in one sentence. Present 2-3 options with tradeoffs.
Ask the user. Do not guess on architectural or data model decisions.
This does NOT apply to routine coding, small features, or obvious changes.
## Continuous Checkpoint Mode
If `CHECKPOINT_MODE` is `"continuous"` (from preamble output): auto-commit work as
@@ -701,6 +714,8 @@ branch name wherever the instructions say "the base branch" or `<default>`.
---
# Ship: Fully Automated Ship Workflow
You are running the `/ship` workflow. This is a **non-interactive, fully automated** workflow. Do NOT ask for confirmation at any step. The user said `/ship` which means DO IT. Run straight through and output the PR URL at the end.
@@ -1902,6 +1917,8 @@ staleness detection: if those files are later deleted, the learning can be flagg
**Only log genuine discoveries.** Don't log obvious things. Don't log things the user
already knows. A good test: would this insight save time in a future session? If yes, log it.
## Step 4: Version bump (auto-decide)
**Idempotency check:** Before bumping, compare VERSION against the base branch.
+17
View File
@@ -430,6 +430,19 @@ AI makes completeness near-free. Always recommend the complete option over short
Include `Completeness: X/10` for each option (10=all edge cases, 7=happy path, 3=shortcut).
## Confusion Protocol
When you encounter high-stakes ambiguity during coding:
- Two plausible architectures or data models for the same requirement
- A request that contradicts existing patterns and you're unsure which to follow
- A destructive operation where the scope is unclear
- Missing context that would change your approach significantly
STOP. Name the ambiguity in one sentence. Present 2-3 options with tradeoffs.
Ask the user. Do not guess on architectural or data model decisions.
This does NOT apply to routine coding, small features, or obvious changes.
## Continuous Checkpoint Mode
If `CHECKPOINT_MODE` is `"continuous"` (from preamble output): auto-commit work as
@@ -703,6 +716,8 @@ branch name wherever the instructions say "the base branch" or `<default>`.
---
# Ship: Fully Automated Ship Workflow
You are running the `/ship` workflow. This is a **non-interactive, fully automated** workflow. Do NOT ask for confirmation at any step. The user said `/ship` which means DO IT. Run straight through and output the PR URL at the end.
@@ -2278,6 +2293,8 @@ staleness detection: if those files are later deleted, the learning can be flagg
**Only log genuine discoveries.** Don't log obvious things. Don't log things the user
already knows. A good test: would this insight save time in a future session? If yes, log it.
## Step 4: Version bump (auto-decide)
**Idempotency check:** Before bumping, compare VERSION against the base branch.
+20 -60
View File
@@ -1,9 +1,10 @@
/**
* Gemini CLI E2E tests verify skills work when invoked by Gemini CLI.
* Gemini CLI E2E smoke test verify Gemini CLI can start and discover skills.
*
* Spawns `gemini -p` with stream-json output in the repo root (where
* .agents/skills/ already exists), parses JSONL events, and validates
* structured results. Follows the same pattern as codex-e2e.test.ts.
* This is a lightweight smoke test, not a full integration test. Gemini CLI
* gets lost in worktrees and times out on complex tasks. The smoke test
* validates that the skill files are structured correctly for Gemini's
* .agents/skills/ discovery mechanism.
*
* Prerequisites:
* - `gemini` binary installed (npm install -g @google/gemini-cli)
@@ -48,10 +49,9 @@ if (!evalsEnabled) {
// --- Diff-based test selection ---
// Gemini E2E touchfiles — keyed by test name, same pattern as Codex E2E
// Gemini E2E touchfiles — keyed by test name
const GEMINI_E2E_TOUCHFILES: Record<string, string[]> = {
'gemini-discover-skill': ['.agents/skills/**', 'test/helpers/gemini-session-runner.ts'],
'gemini-review-findings': ['review/**', '.agents/skills/gstack-review/**', 'test/helpers/gemini-session-runner.ts'],
'gemini-smoke': ['.agents/skills/**', 'test/helpers/gemini-session-runner.ts'],
};
let selectedTests: string[] | null = null; // null = run all
@@ -71,7 +71,6 @@ if (evalsEnabled && !process.env.EVALS_ALL) {
}
process.stderr.write('\n');
}
// If changedFiles is empty (e.g., on main branch), selectedTests stays null -> run all
}
/** Skip an individual test if not selected by diff-based selection. */
@@ -84,7 +83,6 @@ function testIfSelected(testName: string, fn: () => Promise<void>, timeout: numb
const evalCollector = evalsEnabled && !SKIP ? new EvalCollector('e2e-gemini') : null;
/** DRY helper to record a Gemini E2E test result into the eval collector. */
function recordGeminiE2E(name: string, result: GeminiResult, passed: boolean) {
evalCollector?.addTest({
name,
@@ -92,14 +90,13 @@ function recordGeminiE2E(name: string, result: GeminiResult, passed: boolean) {
tier: 'e2e',
passed,
duration_ms: result.durationMs,
cost_usd: 0, // Gemini doesn't report cost in USD; tokens are tracked
cost_usd: 0,
output: result.output?.slice(0, 2000),
turns_used: result.toolCalls.length, // approximate: tool calls as turns
turns_used: result.toolCalls.length,
exit_reason: result.exitCode === 0 ? 'success' : `exit_code_${result.exitCode}`,
});
}
/** Print cost summary after a Gemini E2E test. */
function logGeminiCost(label: string, result: GeminiResult) {
const durationSec = Math.round(result.durationMs / 1000);
console.log(`${label}: ${result.tokens} tokens, ${result.toolCalls.length} tool calls, ${durationSec}s`);
@@ -125,59 +122,22 @@ describeGemini('Gemini E2E', () => {
harvestAndCleanup('gemini');
});
testIfSelected('gemini-discover-skill', async () => {
// Run Gemini in an isolated worktree (has .agents/skills/ copied from ROOT)
testIfSelected('gemini-smoke', async () => {
// Smoke test: can Gemini start, read the repo, and produce output?
// Uses a simple prompt that doesn't require skill invocation or complex navigation.
const result = await runGeminiSkill({
prompt: 'List any skills or instructions you have available. Just list the names.',
timeoutMs: 60_000,
prompt: 'What is this project? Answer in one sentence based on the README.',
timeoutMs: 90_000,
cwd: testWorktree,
});
logGeminiCost('gemini-discover-skill', result);
logGeminiCost('gemini-smoke', result);
// Gemini should have produced some output
const passed = result.exitCode === 0 && result.output.length > 0;
recordGeminiE2E('gemini-discover-skill', result, passed);
// Pass if Gemini produced any meaningful output (even with non-zero exit from timeout)
const hasOutput = result.output.length > 10;
const passed = hasOutput;
recordGeminiE2E('gemini-smoke', result, passed);
expect(result.exitCode).toBe(0);
expect(result.output.length).toBeGreaterThan(0);
// The output should reference skills in some form
const outputLower = result.output.toLowerCase();
expect(
outputLower.includes('review') || outputLower.includes('gstack') || outputLower.includes('skill'),
).toBe(true);
expect(result.output.length, 'Gemini should produce output').toBeGreaterThan(10);
}, 120_000);
testIfSelected('gemini-review-findings', async () => {
// Run gstack-review skill via Gemini on worktree (isolated from main working tree)
const result = await runGeminiSkill({
prompt: 'Run the gstack-review skill on this repository. Review the current branch diff and report your findings.',
timeoutMs: 540_000,
cwd: testWorktree,
});
logGeminiCost('gemini-review-findings', result);
// Should produce structured review-like output
const output = result.output;
const passed = result.exitCode === 0 && output.length > 50;
recordGeminiE2E('gemini-review-findings', result, passed);
expect(result.exitCode).toBe(0);
expect(output.length).toBeGreaterThan(50);
// Review output should contain some review-like content
const outputLower = output.toLowerCase();
const hasReviewContent =
outputLower.includes('finding') ||
outputLower.includes('issue') ||
outputLower.includes('review') ||
outputLower.includes('change') ||
outputLower.includes('diff') ||
outputLower.includes('clean') ||
outputLower.includes('no issues') ||
outputLower.includes('p1') ||
outputLower.includes('p2');
expect(hasReviewContent).toBe(true);
}, 600_000);
});
+3 -5
View File
@@ -122,9 +122,8 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
'codex-discover-skill': ['codex/**', '.agents/skills/**', 'test/helpers/codex-session-runner.ts', 'lib/worktree.ts'],
'codex-review-findings': ['review/**', '.agents/skills/gstack-review/**', 'codex/**', 'test/helpers/codex-session-runner.ts', 'lib/worktree.ts'],
// Gemini E2E (tests skills via Gemini CLI + worktree)
'gemini-discover-skill': ['.agents/skills/**', 'test/helpers/gemini-session-runner.ts', 'lib/worktree.ts'],
'gemini-review-findings': ['review/**', '.agents/skills/gstack-review/**', 'test/helpers/gemini-session-runner.ts', 'lib/worktree.ts'],
// Gemini E2E — smoke test only (Gemini gets lost in worktrees on complex tasks)
'gemini-smoke': ['.agents/skills/**', 'test/helpers/gemini-session-runner.ts', 'lib/worktree.ts'],
// Coverage audit (shared fixture) + triage + gates
@@ -284,8 +283,7 @@ export const E2E_TIERS: Record<string, 'gate' | 'periodic'> = {
// Multi-AI — periodic (require external CLIs)
'codex-discover-skill': 'periodic',
'codex-review-findings': 'periodic',
'gemini-discover-skill': 'periodic',
'gemini-review-findings': 'periodic',
'gemini-smoke': 'periodic',
// Design — gate for cheap functional, periodic for Opus/quality
'design-consultation-core': 'periodic',
+4 -5
View File
@@ -30,8 +30,8 @@ const ROOT = path.resolve(import.meta.dir, '..');
// ─── hosts/index.ts ─────────────────────────────────────────
describe('hosts/index.ts', () => {
test('ALL_HOST_CONFIGS has 8 hosts', () => {
expect(ALL_HOST_CONFIGS.length).toBe(8);
test('ALL_HOST_CONFIGS has 10 hosts', () => {
expect(ALL_HOST_CONFIGS.length).toBe(10);
});
test('ALL_HOST_NAMES matches config names', () => {
@@ -479,9 +479,8 @@ describe('host config correctness', () => {
expect(openclaw.pathRewrites.some(r => r.from === 'CLAUDE.md' && r.to === 'AGENTS.md')).toBe(true);
});
test('openclaw has adapter path', () => {
expect(openclaw.adapter).toBeDefined();
expect(openclaw.adapter).toContain('openclaw-adapter');
test('openclaw has no adapter (dead code removed)', () => {
expect(openclaw.adapter).toBeUndefined();
});
test('openclaw has no staticFiles (SOUL.md removed)', () => {
+11 -1
View File
@@ -1,9 +1,19 @@
import { describe, test, expect, beforeEach, afterEach } from 'bun:test';
import { describe, test as _bunTest, expect, beforeEach, afterEach } from 'bun:test';
import { execSync } from 'child_process';
import * as fs from 'fs';
import * as path from 'path';
import * as os from 'os';
// Every test in this file shells out to gstack-config + gstack-relink (bash scripts
// invoking subprocess work). Under parallel bun test load, subprocess spawn contends
// with other suites and each test can drift ~200ms past the 5s default. Bump to 15s.
// Object.assign preserves test.only / test.skip / test.each / test.todo sub-APIs.
const test = Object.assign(
((name: any, fn: any, timeout?: number) =>
_bunTest(name, fn, timeout ?? 15_000)) as typeof _bunTest,
_bunTest,
);
const ROOT = path.resolve(import.meta.dir, '..');
const BIN = path.join(ROOT, 'bin');
+10 -7
View File
@@ -286,18 +286,21 @@ describeIfSelected('Base branch detection', ['review-base-branch', 'ship-base-br
run('git', ['add', 'app.rb'], dir);
run('git', ['commit', '-m', 'feat: add hello method'], dir);
// Copy review skill files
fs.copyFileSync(path.join(ROOT, 'review', 'SKILL.md'), path.join(dir, 'review-SKILL.md'));
fs.copyFileSync(path.join(ROOT, 'review', 'checklist.md'), path.join(dir, 'review-checklist.md'));
fs.copyFileSync(path.join(ROOT, 'review', 'greptile-triage.md'), path.join(dir, 'review-greptile-triage.md'));
// Extract only Step 0 (base branch detection) + minimal review instructions
// Full SKILL.md is ~1500 lines — copying it causes the agent to spend all turns reading
const full = fs.readFileSync(path.join(ROOT, 'review', 'SKILL.md'), 'utf-8');
const step0Start = full.indexOf('## Step 0: Detect platform and base branch');
const step1Start = full.indexOf('## Step 1: Check branch');
const step1End = full.indexOf('---', step1Start + 10);
const extracted = full.slice(step0Start, step1End > step1Start ? step1End : step1Start + 500);
fs.writeFileSync(path.join(dir, 'review-SKILL.md'), extracted);
const result = await runSkillTest({
prompt: `You are in a git repo on a feature branch with changes.
Read review-SKILL.md for the review workflow instructions.
Also read review-checklist.md and apply it.
Read review-SKILL.md for the base branch detection instructions.
IMPORTANT: Follow Step 0 to detect the base branch. Since there is no remote, gh commands will fail fall back to main.
Then run the review against the detected base branch.
Then run git diff against the detected base branch and write a brief review.
Write your findings to ${dir}/review-output.md`,
workingDirectory: dir,
maxTurns: 15,
+7 -16
View File
@@ -60,10 +60,9 @@ if (evalsEnabled && process.env.EVALS_TIER) {
// --- Helper functions ---
/** Copy all SKILL.md files for auto-discovery.
* Install to BOTH project-level (.claude/skills/) AND user-level (~/.claude/skills/)
* because Claude Code discovers skills from both locations. In CI containers,
* $HOME may differ from the working directory, so we need both paths to ensure
* the Skill tool appears in Claude's available tools list. */
* Installs to project-level (.claude/skills/) only. Writing to the user's
* ~/.claude/skills/ is unsafe: it may contain symlinks from the real gstack
* install that point to different worktrees or dangling targets. */
function installSkills(tmpDir: string) {
const skillDirs = [
'', // root gstack SKILL.md
@@ -73,24 +72,16 @@ function installSkills(tmpDir: string) {
'gstack-upgrade', 'humanizer',
];
// Install to both project-level and user-level skill directories
const homeDir = process.env.HOME || os.homedir();
const installTargets = [
path.join(tmpDir, '.claude', 'skills'), // project-level
path.join(homeDir, '.claude', 'skills'), // user-level (~/.claude/skills/)
];
const targetBase = path.join(tmpDir, '.claude', 'skills');
for (const skill of skillDirs) {
const srcPath = path.join(ROOT, skill, 'SKILL.md');
if (!fs.existsSync(srcPath)) continue;
const skillName = skill || 'gstack';
for (const targetBase of installTargets) {
const destDir = path.join(targetBase, skillName);
fs.mkdirSync(destDir, { recursive: true });
fs.copyFileSync(srcPath, path.join(destDir, 'SKILL.md'));
}
const destDir = path.join(targetBase, skillName);
fs.mkdirSync(destDir, { recursive: true });
fs.copyFileSync(srcPath, path.join(destDir, 'SKILL.md'));
}
// Write a CLAUDE.md with explicit routing instructions.
+1
View File
@@ -143,6 +143,7 @@ describe('Command registry consistency', () => {
const validKeys = new Set([
'interactive', 'compact', 'depth', 'selector',
'diff', 'annotate', 'outputPath', 'cursorInteractive',
'heatmap',
]);
for (const flag of SNAPSHOT_FLAGS) {
expect(validKeys.has(flag.optionKey)).toBe(true);
+2 -2
View File
@@ -85,11 +85,11 @@ describe('gstack-settings-hook', () => {
expect(settings.hooks).toBeUndefined();
});
test('remove is safe when settings.json does not exist', () => {
test('remove exits 1 when settings.json does not exist', () => {
const result = run(`${SETTINGS_HOOK} remove /path/to/gstack-session-update`, {
env: { GSTACK_SETTINGS_FILE: settingsFile },
});
expect(result.exitCode).toBe(0);
expect(result.exitCode).toBe(1);
});
test('remove preserves other hooks', () => {