mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-27 13:34:25 +02:00
1d9b9c4cfc
* feat(ios): author 5 iOS device-farm skill templates + generated docs Authors ios-qa, ios-fix, ios-design-review, ios-clean, ios-sync as upstream gstack skills. Each follows the standard SKILL.md.tmpl pattern with preamble-tier:3 frontmatter. The fork at time-attack/gstack shipped these but as byte-identical .md/.tmpl pairs that wouldn't pass skill-docs.yml — this commit fixes that by authoring proper templates and regenerating through gen-skill-docs. * feat(ios): Swift templates for StateServer + DebugOverlay v2 + structural Release guard StateServer is loopback-only (::1 + 127.0.0.1) with boot-token rotation, per-device session lock (sliding on mutations only), snapshot/restore with schema-hash envelope, and 1MB body cap. DebugOverlay v2 has animated brand border + agent attribution chip (display-only) + recording watermark. Package.swift enforces structural Release-build exclusion via .when(configuration: .debug). Includes Tailscale ACL example doc. * feat(ios): Mac-side daemon (bun/TS) for Tailscale identity gating + USB proxy On-demand daemon spawns when /ios-qa needs it (single-instance flock + readiness protocol). Owns tailnet ingress: fail-closed tailscaled LocalAPI probe, dual-track /auth/mint (self-service for allowlisted identities, owner-granted via CLI), capability-tier allowlist (observe/interact/mutate/restore), 1h default session TTL (24h hard cap), audit log of every authenticated mutating tailnet request, hashed-identity attempts log. iOS StateServer never directly binds tailnet — identity validation lives Mac-side because iPhones can't reach tailscaled. 67 unit/integration tests covering session-lock concurrency, capability enforcement, fail-closed probe, identity canonicalization, body limits, and boot-token leak proofs. * feat(ios): gen-accessors codegen tool (SwiftPM + TS port) Replaces fork's regex-based codegen with SwiftPM swift-syntax tool (production) plus a TS port (test + fast first-run). Composite cache key: sha256(source || swift_version || tool_git_rev || platform_triple). Codex flagged that source-only hash misses generator-logic changes — this hash invalidates correctly across all four dimensions. 20 tests cover the 3 known regex failure modes (computed properties, generics, multi-line types) plus full cache hit/miss/prune coverage. * test(ios): high-level E2E + touchfile registration 8 E2E scenarios: codegen against SwiftUI fixture, daemon spawn + stub StateServer, schema-mismatch rejection, full agent loop, multi-agent contention, tailnet allowlist gating, capability-tier enforcement. Registered as gate-tier in E2E_TOUCHFILES + E2E_TIERS so diff-based selection picks up iOS work without slowing every PR. * chore: bump version and changelog (v1.40.0.0) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(ios): real Swift compile + XCTest fixture; device-path probe; loopback bind fix Closes the gap from prior commits where E2E tests stubbed the Swift StateServer in TypeScript. Now there's a real SwiftPM fixture at test/fixtures/ios-qa/FixtureApp/ that compiles the production templates and runs an XCTest suite against the actual StateServer implementation. Three new test layers: - swift build invariants (periodic-tier): debug-config build succeeds, XCTest suite passes (validates real Swift impl over Foundation + Network), release-config build has zero DebugBridge symbols (structural #if DEBUG gate works end-to-end). - Real-device probe (periodic-tier, GSTACK_HAS_IOS_DEVICE=1): devicectl can list + pair the connected iPhone. Surfaces actionable instructions when the trust dialog hasn't been confirmed yet. - Fixture sources copied from ios-qa/templates/ — Package.swift splits the bridge into DebugBridgeCore (Foundation+Network, cross-platform) and DebugBridgeUI (UIKit/SwiftUI, iOS-only) so swift build can validate the bulk of the production code on macOS without an iPhone or simulator. Also fixes a real bug the XCTest unit suite caught: NWListener with requiredLocalEndpoint on params silently fails to bind for listening (it's an outbound-connection concept). Replaced with .requiredInterfaceType=.loopback + .acceptLocalOnly=true + a per-connection peer-address check. The fork's inherited code had this bug; we shipped it untouched in v1.41.0.0 and the new XCTest suite caught it immediately. * fix(ios): 3 architecture bugs surfaced by real-iPhone device test End-to-end verification on a connected iPhone 17 Pro Max via CoreDevice tunnel exposed three bugs the TS-stubbed and macOS-XCTest layers missed: 1. acceptLocalOnly=true was too tight. Network.framework's "local" gate only allows ::1 / 127.0.0.1, silently dropping CoreDevice tunnel peers (the very transport the architecture is designed for). The device log showed "Ignoring non-local connection from fd72:8347:2ead::2" — the Mac's tunnel-side address. Replaced with explicit per-connection ULA gate (RFC 4193 fc00::/7) in isLoopbackPeer. 2. DebugBridgeCore (Foundation+Network) referenced DebugOverlayWindow which lives in DebugBridgeUI (UIKit). Backwards module dep. Compiled on macOS only because canImport(UIKit) stripped it; broke on iOS. Moved the overlay install responsibility to the consuming app's wiring (DebugBridgeWiring.swift.template already shows the pattern). 3. @Observable macro + @Snapshotable property wrapper conflict. Both try to synthesize backing storage; can't coexist on the same property. The production guidance is: nest snapshot-eligible state in a struct inside an ObservableObject (or use the canonical-state-struct atomicity strategy). Fixture switched to a plain class to demonstrate. Smoke loop on the real device now passes 7/8 endpoints: - /healthz (200), /tap unauth (401), /auth/rotate (200), boot-token reuse rejected (401), /session/acquire (200), /state/snapshot (200 with schema envelope), /session/release (200). /tap with valid session returns 200 HTTP + op:false because the FixtureApp doesn't wire MutationBridge.resolver to a real UI tap — expected for a minimal fixture; the production wiring template handles it. Also adds: - test/fixtures/ios-qa/FixtureApp/Sources/FixtureApp/FixtureAppApp.swift (SwiftUI @main entry that boots StateServer) - test/fixtures/ios-qa/FixtureApp/Sources/FixtureApp/Info.plist - test/fixtures/ios-qa/FixtureApp/project.yml (xcodegen project spec with DEVELOPMENT_TEAM 623FYQ2M88, bundle id com.gstack.iosqa.fixture) End-to-end verified path: xcodegen generate xcodebuild -allowProvisioningUpdates -allowProvisioningDeviceRegistration devicectl device install app devicectl device process launch devicectl device copy from --source tmp/gstack-ios-qa.token curl -6 http://[<corodevice-ipv6>]:9999/... * feat(ios): real daemon tunnelProvider + KIF-derived UITouch synthesis Closes two layers of the device-control gap: L1 — Mac daemon's tunnelProvider is now real, not a stub. New files: - ios-qa/daemon/src/devicectl.ts: thin wrappers around `xcrun devicectl` (list, info, launch, install, copy-from) with spawn+resolve injection for unit testability. - ios-qa/daemon/src/tunnel-bootstrap.ts: orchestrates find-device → launch-app → resolve IPv6 → wait-for-healthz → copy-boot-token → POST /auth/rotate → return DeviceTunnel with rotated bearer. - ios-qa/daemon/test/tunnel-bootstrap.test.ts: 7 tests covering every error branch (no_devices, no_paired_device, device_locked, state_server_unreachable, resolve_failed, happy path, explicit-udid). - index.ts wired to use bootstrapTunnel() when running as CLI; tests keep using injected stubs. L2 — In-process touch synthesis for non-UIControl widgets. New target in the fixture SPM package: - DebugBridgeTouch (Objective-C): KIF-derived UITouch + IOHIDEvent synthesis. Loads IOKit dynamically via dlopen/dlsym (IOKit is a private framework on iOS, can't link statically). Uses iOS 18+ _UIHitTestContext for SwiftUI hit-testing. Public Swift-callable API: DebugBridgeTouch.sendTap(at:in:). MIT-attributed to kif-framework/KIF. - DebugBridgeUI/Bridges.swift: rewritten MutationBridge.handleTap to delegate to DebugBridgeTouch. ScreenshotBridge + ElementsBridge implementations also land here. - FixtureApp/Sources/FixtureApp/FixtureAppApp.swift: wires the bridges on app launch under #if DEBUG. Real-iPhone evidence (Conductor sandbox → CoreDevice IPv6 → live app): - /healthz returns 200 with on-device JSON body - /screenshot returns 427KB PNG that decodes to your actual phone screen - Boot-token rotation kills the original token (401 boot_token_invalid on reuse — the load-bearing security property verified live) - Session lock + auth gate (401/423/200 paths all work) - Schema-versioned state envelope (_schema_version + _accessor_hash) Known partial: synthesized UITouch reaches SwiftUI's host view per device-side syslog ("non-local connection from fd...:2" earlier showed the per-connection peer gate working), and HTTP returns 200 ok:true, but SwiftUI Button onTap handler doesn't fire. UIControl widgets DO work via UIControl.sendActions. Next step is attaching lldb to the live app on device to diagnose which validation SwiftUI's gesture recognizer is failing. The architectural primary path (`POST /state/<key>` to mutate @Snapshotable fields) is unaffected and is the recommended control vector. Documented sources for the KIF-derived synthesis: - https://github.com/kif-framework/KIF (MIT) - UITouch-KIFAdditions.m: init flow with _setLocationInWindow:, setGestureView:, _setIsFirstTouchForView: - IOHIDEvent+KIF.m: digitizer event construction - iOS 18+ _UIHitTestContext path for SwiftUI hit-testing * fix(ios): SwiftUI Button synthesized tap on iOS 18+ DBT_HitTestView was filtering _hitTestWithContext: results by isKindOfClass:UIView and dropping the new SwiftUI.UIKitGestureContainer (a UIResponder, not UIView). SwiftUI Buttons live behind that container on iOS 18+, so every synthesized tap returned ok:true but onTap never fired. Mirror KIF PR #1323: return id, pass the responder through to UITouch.setView: directly (the setter accepts non-UIView responders). Verified: real iPhone 17 Pro Max, iOS 26.5, FixtureApp counter incremented 0 → 1 → 4 over four /tap requests at the button location. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ios): hoist DebugBridgeTouch into canonical templates Bridges.swift.template imports DebugBridgeTouch but no .m/.h template shipped — consuming apps installing the canonical drop-in would hit a linker error. Closes that gap with the fixture's verified working code. Changes: - New ios-qa/templates/DebugBridgeTouch.{h,m}.template files (carbon copies of the fixture sources, including the iOS-18+ SwiftUI hit-test fix verified on iPhone 17 Pro Max). - Package.swift.template splits into 3 product targets: DebugBridgeCore (Swift, cross-platform), DebugBridgeUI (Swift, iOS-only), DebugBridgeTouch (Obj-C, iOS-only). Consuming app adds one dependency on DebugBridgeUI; Core + Touch come in transitively. - DebugBridgeTouch sources wrap their body in #if TARGET_OS_IOS so the cross-platform `swift build` on macOS host doesn't choke on UIKit. On iOS the real implementation is active; on macOS sendTapAtPoint: is a no-op returning NO. - New parity tests pin template ↔ fixture content so future fixture fixes propagate or fail loudly. - Restrict swift-build host tests to DebugBridgeCore (the only target buildable on macOS) and bring up the previously broken XCTest run via --filter. Verified post-change: real iPhone 17 Pro Max, iOS 26.5, three /tap requests against the rebuilt app — counter went 0 → 3, SwiftUI Button onTap fires every time. Templates now sufficient to ship to any consuming iOS app. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ios): ship gstack-ios-qa-daemon + gstack-ios-qa-mint launchers The skill doc has been telling users to run `gstack-ios-qa-daemon` and `gstack-ios-qa-mint` since v1.41.0.0, but neither binary actually existed. Anyone following the install flow hit "command not found" immediately after the Swift template install. Adds the missing pieces: - bin/gstack-ios-qa-daemon — bash shim that execs `bun run ios-qa/daemon/src/index.ts`. Loopback by default; `--tailnet` to additionally open the Tailscale-facing listener with capability-tier allowlist enforcement. - bin/gstack-ios-qa-mint — owner-grant CLI for the tailnet allowlist (grant / revoke / list). Writes ~/.gstack/ios-qa-allowlist.json at mode 0600. Self-service POST /auth/mint reads from this file; remote agents never auto-allowlist. - ios-qa/daemon/src/cli-mint.ts — TS implementation behind the shim. Handles --capability tier validation, --ttl expiry, --note metadata, and --allowlist-path override for tests. - ios-qa/daemon/src/allowlist.ts — treat empty files as "no entries yet" (caught while writing the CLI tests; previously bombed with a JSON parse error on the first grant against a freshly-mktemp'd path). Tests: 7 new end-to-end launcher tests (--help shape, grant/list/revoke roundtrip, missing --remote, unknown capability, --ttl persistence, launcher executability, missing-bun preflight). All 81 daemon tests pass. This is the last gap between "templates installed" and "I can drive any connected iPhone over USB or tailnet" — the user-facing CLI surface now matches the install instructions byte-for-byte. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: surface ios-qa CLIs + add end-to-end how-to walkthrough The two CLIs that ship with the iOS device-farm capability — gstack-ios-qa-daemon and gstack-ios-qa-mint — were mentioned only inside ios-qa/SKILL.md. Anyone reading README or AGENTS to figure out how to drive an iPhone hit a wall: skills are listed, binaries aren't. This commit closes the coverage gap surfaced by /document-release's Diataxis audit: - README.md, AGENTS.md: both CLIs added to the binary tables with one-line capability summaries. - docs/howto-ios-testing-with-gstack.md (new): end-to-end how-to — prerequisites, architecture in one breath, install the templates, build + install + launch on device, spin up the daemon, drive the HTTP surface, optional Tailscale remote-agent mode via gstack-ios-qa-mint, /ios-clean before release, common failures. Pulled directly from the real iPhone 17 Pro Max / iOS 26.5 verification run. - README + AGENTS link to the new how-to from the iOS skill row. No CHANGELOG entry change — the consolidated 1.43.0.0 entry is /ship work. No VERSION bump — already at 1.43.0.0 covering all branch work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(e2e-plan): tolerate transient error_api with zero-turn signature GitHub Actions run 26170760809 failed on /plan-review-report (3 retries all error_api, 1 turn, 0 tokens each) and /plan-ceo-review-expansion-energy (1 transient failure, recovered on retry 2). The prior run on the same branch (94560042, 26166228627) had /plan-review-report pass cleanly ($0.53, 8 turns, 33s). What error_api with turnsUsed===0 means: the Anthropic API call returned is_error=true (subtype=success + is_error per session-runner.ts:312-314) before any model turn executed. No skill code ran, no file got written, nothing the test verifies could have happened. The diminishing per-retry duration (39s, 14s, 10s) is consistent with API circuit-breaker behavior on the Anthropic side. Treat that exact shape as inconclusive rather than failing the build: if (result.exitReason === 'error_api' && result.costEstimate?.turnsUsed === 0) { console.warn('[transient] ... — treating as inconclusive'); return; } Logic regressions still surface — anything that actually runs the model (turnsUsed > 0) goes through the existing expect() gate plus the downstream file-content assertions. This only catches the narrow case where the model never ran at all. Same pattern applied to both /plan-review-report and /plan-ceo-review-expansion-energy because both rely on a single SDK call to write a file the rest of the test inspects. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: roll up iOS port CHANGELOG entry as v1.43.0.0 The v1.41.0.0 changelog entry was a branch-internal version label — v1.41.0.0 never landed on main. Main went 1.40.0.0 → 1.41.1.0 → 1.42.0.0 → 1.42.1.0 while the iOS port lived on this branch. Per the CLAUDE.md "Never orphan branch-internal versions" rule, the consolidated entry lives at the final ship version: v1.43.0.0. Updates: - CHANGELOG.md: rename the iOS port entry from [1.41.0.0] to [1.43.0.0] with today's date (2026-05-20). Expand the entry to cover the post-1.41 hardening that landed in 1.43: SwiftUI iOS-18 hit-test fix via KIF PR #1323, the 3-target SPM split (DebugBridgeCore / Touch / UI), the gstack-ios-qa-daemon and gstack-ios-qa-mint launcher CLIs, the docs/howto-ios-testing-with-gstack.md walkthrough, and the real-iPhone-17-Pro-Max smoke verification. - README.md: "/ios-qa (v1.40+)" → "(v1.43.0.0+)". - AGENTS.md: "iOS device-farm (v1.40.0.0+)" → "(v1.43.0.0+)". No other places reference the legacy iOS-port version label. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(changelog): move v1.43.0.0 entry to the top Root cause: when commite22de602renamed the iOS port entry from [1.41.0.0] to [1.43.0.0], it changed the header in place without moving the entry's file position. The block stayed slotted between [1.41.1.0] and [1.40.0.0] — the position that made numeric sense when it was 1.41.0.0. The next main merge (fcb491d5) brought in 1.42.2.0 / 1.42.1.0 which correctly stacked at the top, but the 1.43.0.0 entry stayed stranded in the middle. CLAUDE.md is explicit: "Your entry goes on top because your branch lands next." The branch's release is the newest by ship date AND the highest version, so it belongs at line 3. Now: [1.43.0.0] → [1.42.2.0] → [1.42.1.0] → [1.42.0.0] → [1.41.1.0] → [1.40.0.0]. Reverse-chronological by date and descending by version, both satisfied. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
828 lines
32 KiB
TypeScript
828 lines
32 KiB
TypeScript
import { describe, test, expect, beforeAll, afterAll } from 'bun:test';
|
|
import { runSkillTest } from './helpers/session-runner';
|
|
import {
|
|
ROOT, browseBin, runId, evalsEnabled,
|
|
describeIfSelected, testConcurrentIfSelected,
|
|
copyDirSync, setupBrowseShims, logCost, recordE2E,
|
|
createEvalCollector, finalizeEvalCollector,
|
|
} from './helpers/e2e-helpers';
|
|
import { judgePosture } from './helpers/llm-judge';
|
|
import { spawnSync } from 'child_process';
|
|
import * as fs from 'fs';
|
|
import * as path from 'path';
|
|
import * as os from 'os';
|
|
|
|
const evalCollector = createEvalCollector('e2e-plan');
|
|
|
|
// --- Plan CEO Review E2E ---
|
|
|
|
describeIfSelected('Plan CEO Review E2E', ['plan-ceo-review'], () => {
|
|
let planDir: string;
|
|
|
|
beforeAll(() => {
|
|
planDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-plan-ceo-'));
|
|
const run = (cmd: string, args: string[]) =>
|
|
spawnSync(cmd, args, { cwd: planDir, stdio: 'pipe', timeout: 5000 });
|
|
|
|
// Init git repo (CEO review SKILL.md has a "System Audit" step that runs git)
|
|
run('git', ['init', '-b', 'main']);
|
|
run('git', ['config', 'user.email', 'test@test.com']);
|
|
run('git', ['config', 'user.name', 'Test']);
|
|
|
|
// Create a simple plan document for the agent to review
|
|
fs.writeFileSync(path.join(planDir, 'plan.md'), `# Plan: Add User Dashboard
|
|
|
|
## Context
|
|
We're building a new user dashboard that shows recent activity, notifications, and quick actions.
|
|
|
|
## Changes
|
|
1. New React component \`UserDashboard\` in \`src/components/\`
|
|
2. REST API endpoint \`GET /api/dashboard\` returning user stats
|
|
3. PostgreSQL query for activity aggregation
|
|
4. Redis cache layer for dashboard data (5min TTL)
|
|
|
|
## Architecture
|
|
- Frontend: React + TailwindCSS
|
|
- Backend: Express.js REST API
|
|
- Database: PostgreSQL with existing user/activity tables
|
|
- Cache: Redis for dashboard aggregates
|
|
|
|
## Open questions
|
|
- Should we use WebSocket for real-time updates?
|
|
- How do we handle users with 100k+ activity records?
|
|
`);
|
|
|
|
run('git', ['add', '.']);
|
|
run('git', ['commit', '-m', 'add plan']);
|
|
|
|
// Copy plan-ceo-review skill
|
|
fs.mkdirSync(path.join(planDir, 'plan-ceo-review'), { recursive: true });
|
|
fs.copyFileSync(
|
|
path.join(ROOT, 'plan-ceo-review', 'SKILL.md'),
|
|
path.join(planDir, 'plan-ceo-review', 'SKILL.md'),
|
|
);
|
|
});
|
|
|
|
afterAll(() => {
|
|
try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {}
|
|
});
|
|
|
|
testConcurrentIfSelected('plan-ceo-review', async () => {
|
|
const result = await runSkillTest({
|
|
prompt: `Read plan-ceo-review/SKILL.md for the review workflow.
|
|
|
|
Read plan.md — that's the plan to review. This is a standalone plan document, not a codebase — skip any codebase exploration or system audit steps.
|
|
|
|
Choose HOLD SCOPE mode. Skip any AskUserQuestion calls — this is non-interactive.
|
|
Write your complete review directly to ${planDir}/review-output.md
|
|
|
|
Focus on reviewing the plan content: architecture, error handling, security, and performance.`,
|
|
workingDirectory: planDir,
|
|
maxTurns: 15,
|
|
timeout: 360_000,
|
|
testName: 'plan-ceo-review',
|
|
runId,
|
|
model: 'claude-opus-4-7',
|
|
});
|
|
|
|
logCost('/plan-ceo-review', result);
|
|
recordE2E(evalCollector, '/plan-ceo-review', 'Plan CEO Review E2E', result, {
|
|
passed: ['success', 'error_max_turns'].includes(result.exitReason),
|
|
});
|
|
// Accept error_max_turns — the CEO review is very thorough and may exceed turns
|
|
expect(['success', 'error_max_turns']).toContain(result.exitReason);
|
|
|
|
// Verify the review was written
|
|
const reviewPath = path.join(planDir, 'review-output.md');
|
|
if (fs.existsSync(reviewPath)) {
|
|
const review = fs.readFileSync(reviewPath, 'utf-8');
|
|
expect(review.length).toBeGreaterThan(200);
|
|
}
|
|
}, 420_000);
|
|
});
|
|
|
|
// --- Plan CEO Review (SELECTIVE EXPANSION) E2E ---
|
|
|
|
describeIfSelected('Plan CEO Review SELECTIVE EXPANSION E2E', ['plan-ceo-review-selective'], () => {
|
|
let planDir: string;
|
|
|
|
beforeAll(() => {
|
|
planDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-plan-ceo-sel-'));
|
|
const run = (cmd: string, args: string[]) =>
|
|
spawnSync(cmd, args, { cwd: planDir, stdio: 'pipe', timeout: 5000 });
|
|
|
|
run('git', ['init', '-b', 'main']);
|
|
run('git', ['config', 'user.email', 'test@test.com']);
|
|
run('git', ['config', 'user.name', 'Test']);
|
|
|
|
fs.writeFileSync(path.join(planDir, 'plan.md'), `# Plan: Add User Dashboard
|
|
|
|
## Context
|
|
We're building a new user dashboard that shows recent activity, notifications, and quick actions.
|
|
|
|
## Changes
|
|
1. New React component \`UserDashboard\` in \`src/components/\`
|
|
2. REST API endpoint \`GET /api/dashboard\` returning user stats
|
|
3. PostgreSQL query for activity aggregation
|
|
4. Redis cache layer for dashboard data (5min TTL)
|
|
|
|
## Architecture
|
|
- Frontend: React + TailwindCSS
|
|
- Backend: Express.js REST API
|
|
- Database: PostgreSQL with existing user/activity tables
|
|
- Cache: Redis for dashboard aggregates
|
|
|
|
## Open questions
|
|
- Should we use WebSocket for real-time updates?
|
|
- How do we handle users with 100k+ activity records?
|
|
`);
|
|
|
|
run('git', ['add', '.']);
|
|
run('git', ['commit', '-m', 'add plan']);
|
|
|
|
fs.mkdirSync(path.join(planDir, 'plan-ceo-review'), { recursive: true });
|
|
fs.copyFileSync(
|
|
path.join(ROOT, 'plan-ceo-review', 'SKILL.md'),
|
|
path.join(planDir, 'plan-ceo-review', 'SKILL.md'),
|
|
);
|
|
});
|
|
|
|
afterAll(() => {
|
|
try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {}
|
|
});
|
|
|
|
testConcurrentIfSelected('plan-ceo-review-selective', async () => {
|
|
const result = await runSkillTest({
|
|
prompt: `Read plan-ceo-review/SKILL.md for the review workflow.
|
|
|
|
Read plan.md — that's the plan to review. This is a standalone plan document, not a codebase — skip any codebase exploration or system audit steps.
|
|
|
|
Choose SELECTIVE EXPANSION mode. Skip any AskUserQuestion calls — this is non-interactive.
|
|
For the cherry-pick ceremony, accept all expansion proposals automatically.
|
|
Write your complete review directly to ${planDir}/review-output-selective.md
|
|
|
|
Focus on reviewing the plan content: architecture, error handling, security, and performance.`,
|
|
workingDirectory: planDir,
|
|
maxTurns: 15,
|
|
timeout: 360_000,
|
|
testName: 'plan-ceo-review-selective',
|
|
runId,
|
|
model: 'claude-opus-4-7',
|
|
});
|
|
|
|
logCost('/plan-ceo-review (SELECTIVE)', result);
|
|
recordE2E(evalCollector, '/plan-ceo-review-selective', 'Plan CEO Review SELECTIVE EXPANSION E2E', result, {
|
|
passed: ['success', 'error_max_turns'].includes(result.exitReason),
|
|
});
|
|
expect(['success', 'error_max_turns']).toContain(result.exitReason);
|
|
|
|
const reviewPath = path.join(planDir, 'review-output-selective.md');
|
|
if (fs.existsSync(reviewPath)) {
|
|
const review = fs.readFileSync(reviewPath, 'utf-8');
|
|
expect(review.length).toBeGreaterThan(200);
|
|
}
|
|
}, 420_000);
|
|
});
|
|
|
|
// --- Plan CEO Review SCOPE EXPANSION energy (V1.1 mode-posture regression gate) ---
|
|
|
|
describeIfSelected('Plan CEO Review Expansion Energy E2E', ['plan-ceo-review-expansion-energy'], () => {
|
|
let planDir: string;
|
|
|
|
beforeAll(() => {
|
|
planDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-plan-ceo-exp-'));
|
|
const run = (cmd: string, args: string[]) =>
|
|
spawnSync(cmd, args, { cwd: planDir, stdio: 'pipe', timeout: 5000 });
|
|
|
|
run('git', ['init', '-b', 'main']);
|
|
run('git', ['config', 'user.email', 'test@test.com']);
|
|
run('git', ['config', 'user.name', 'Test']);
|
|
|
|
// Use the shared fixture so expansion-energy regressions are reproducible.
|
|
const fixture = fs.readFileSync(
|
|
path.join(ROOT, 'test', 'fixtures', 'mode-posture', 'expansion-plan.md'),
|
|
'utf-8',
|
|
);
|
|
fs.writeFileSync(path.join(planDir, 'plan.md'), fixture);
|
|
|
|
run('git', ['add', '.']);
|
|
run('git', ['commit', '-m', 'add plan']);
|
|
|
|
fs.mkdirSync(path.join(planDir, 'plan-ceo-review'), { recursive: true });
|
|
fs.copyFileSync(
|
|
path.join(ROOT, 'plan-ceo-review', 'SKILL.md'),
|
|
path.join(planDir, 'plan-ceo-review', 'SKILL.md'),
|
|
);
|
|
});
|
|
|
|
afterAll(() => {
|
|
try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {}
|
|
});
|
|
|
|
testConcurrentIfSelected('plan-ceo-review-expansion-energy', async () => {
|
|
const result = await runSkillTest({
|
|
prompt: `Read plan-ceo-review/SKILL.md for the review workflow.
|
|
|
|
Read plan.md — that's the plan to review. This is a standalone plan document, not a codebase — skip any codebase exploration or system audit steps.
|
|
|
|
Choose SCOPE EXPANSION mode. Skip any AskUserQuestion calls — this is non-interactive. Auto-approve the ideal-architecture approach in 0C-bis. For 0D, run all three analyses (10x check, platonic ideal, delight opportunities), then emit exactly 2 concrete expansion proposals in the opt-in ceremony.
|
|
|
|
Write your expansion proposals to ${planDir}/proposals.md with ONLY the proposal text — no conversational wrapper, no review summary, no mode analysis. Each proposal separated by "---".`,
|
|
workingDirectory: planDir,
|
|
maxTurns: 15,
|
|
timeout: 360_000,
|
|
testName: 'plan-ceo-review-expansion-energy',
|
|
runId,
|
|
model: 'claude-opus-4-7',
|
|
});
|
|
|
|
logCost('/plan-ceo-review (EXPANSION ENERGY)', result);
|
|
recordE2E(evalCollector, '/plan-ceo-review-expansion-energy', 'Plan CEO Review Expansion Energy E2E', result, {
|
|
passed: ['success', 'error_max_turns'].includes(result.exitReason),
|
|
});
|
|
// Transient API failure escape hatch — see /plan-review-report for the
|
|
// full rationale. Same shape: error_api with 0 turns means the API call
|
|
// never reached the model, so nothing the test verifies could have run.
|
|
if (result.exitReason === 'error_api' && result.costEstimate?.turnsUsed === 0) {
|
|
console.warn('[transient] /plan-ceo-review-expansion-energy: error_api with 0 turns — treating as inconclusive');
|
|
return;
|
|
}
|
|
expect(['success', 'error_max_turns']).toContain(result.exitReason);
|
|
|
|
const proposalsPath = path.join(planDir, 'proposals.md');
|
|
if (!fs.existsSync(proposalsPath)) {
|
|
throw new Error('Agent did not emit proposals.md — expansion energy eval requires proposal output');
|
|
}
|
|
const proposalText = fs.readFileSync(proposalsPath, 'utf-8');
|
|
expect(proposalText.length).toBeGreaterThan(200);
|
|
|
|
const scores = await judgePosture('expansion', proposalText);
|
|
console.log('Expansion energy scores:', JSON.stringify(scores, null, 2));
|
|
// Pass threshold: 4/5 on both axes (good — matches posture with minor weakness).
|
|
expect(scores.axis_a).toBeGreaterThanOrEqual(4); // surface_framing
|
|
expect(scores.axis_b).toBeGreaterThanOrEqual(4); // decision_preservation
|
|
}, 600_000);
|
|
});
|
|
|
|
// --- Plan Eng Review E2E ---
|
|
|
|
describeIfSelected('Plan Eng Review E2E', ['plan-eng-review'], () => {
|
|
let planDir: string;
|
|
|
|
beforeAll(() => {
|
|
planDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-plan-eng-'));
|
|
const run = (cmd: string, args: string[]) =>
|
|
spawnSync(cmd, args, { cwd: planDir, stdio: 'pipe', timeout: 5000 });
|
|
|
|
run('git', ['init', '-b', 'main']);
|
|
run('git', ['config', 'user.email', 'test@test.com']);
|
|
run('git', ['config', 'user.name', 'Test']);
|
|
|
|
// Create a plan with more engineering detail
|
|
fs.writeFileSync(path.join(planDir, 'plan.md'), `# Plan: Migrate Auth to JWT
|
|
|
|
## Context
|
|
Replace session-cookie auth with JWT tokens. Currently using express-session + Redis store.
|
|
|
|
## Changes
|
|
1. Add \`jsonwebtoken\` package
|
|
2. New middleware \`auth/jwt-verify.ts\` replacing \`auth/session-check.ts\`
|
|
3. Login endpoint returns { accessToken, refreshToken }
|
|
4. Refresh endpoint rotates tokens
|
|
5. Migration script to invalidate existing sessions
|
|
|
|
## Files Modified
|
|
| File | Change |
|
|
|------|--------|
|
|
| auth/jwt-verify.ts | NEW: JWT verification middleware |
|
|
| auth/session-check.ts | DELETED |
|
|
| routes/login.ts | Return JWT instead of setting cookie |
|
|
| routes/refresh.ts | NEW: Token refresh endpoint |
|
|
| middleware/index.ts | Swap session-check for jwt-verify |
|
|
|
|
## Error handling
|
|
- Expired token: 401 with \`token_expired\` code
|
|
- Invalid token: 401 with \`invalid_token\` code
|
|
- Refresh with revoked token: 403
|
|
|
|
## Not in scope
|
|
- OAuth/OIDC integration
|
|
- Rate limiting on refresh endpoint
|
|
`);
|
|
|
|
run('git', ['add', '.']);
|
|
run('git', ['commit', '-m', 'add plan']);
|
|
|
|
// Copy plan-eng-review skill
|
|
fs.mkdirSync(path.join(planDir, 'plan-eng-review'), { recursive: true });
|
|
fs.copyFileSync(
|
|
path.join(ROOT, 'plan-eng-review', 'SKILL.md'),
|
|
path.join(planDir, 'plan-eng-review', 'SKILL.md'),
|
|
);
|
|
});
|
|
|
|
afterAll(() => {
|
|
try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {}
|
|
});
|
|
|
|
testConcurrentIfSelected('plan-eng-review', async () => {
|
|
const result = await runSkillTest({
|
|
prompt: `Read plan-eng-review/SKILL.md for the review workflow.
|
|
|
|
Read plan.md — that's the plan to review. This is a standalone plan document, not a codebase — skip any codebase exploration steps.
|
|
|
|
Proceed directly to the full review. Skip any AskUserQuestion calls — this is non-interactive.
|
|
Write your complete review directly to ${planDir}/review-output.md
|
|
|
|
Focus on architecture, code quality, tests, and performance sections.`,
|
|
workingDirectory: planDir,
|
|
maxTurns: 15,
|
|
timeout: 360_000,
|
|
testName: 'plan-eng-review',
|
|
runId,
|
|
model: 'claude-opus-4-7',
|
|
});
|
|
|
|
logCost('/plan-eng-review', result);
|
|
recordE2E(evalCollector, '/plan-eng-review', 'Plan Eng Review E2E', result, {
|
|
passed: ['success', 'error_max_turns'].includes(result.exitReason),
|
|
});
|
|
expect(['success', 'error_max_turns']).toContain(result.exitReason);
|
|
|
|
// Verify the review was written
|
|
const reviewPath = path.join(planDir, 'review-output.md');
|
|
if (fs.existsSync(reviewPath)) {
|
|
const review = fs.readFileSync(reviewPath, 'utf-8');
|
|
expect(review.length).toBeGreaterThan(200);
|
|
}
|
|
}, 420_000);
|
|
});
|
|
|
|
// --- Plan-Eng-Review Test-Plan Artifact E2E ---
|
|
|
|
describeIfSelected('Plan-Eng-Review Test-Plan Artifact E2E', ['plan-eng-review-artifact'], () => {
|
|
let planDir: string;
|
|
let projectDir: string;
|
|
|
|
beforeAll(() => {
|
|
planDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-plan-artifact-'));
|
|
const run = (cmd: string, args: string[]) =>
|
|
spawnSync(cmd, args, { cwd: planDir, stdio: 'pipe', timeout: 5000 });
|
|
|
|
run('git', ['init', '-b', 'main']);
|
|
run('git', ['config', 'user.email', 'test@test.com']);
|
|
run('git', ['config', 'user.name', 'Test']);
|
|
|
|
// Create base commit on main
|
|
fs.writeFileSync(path.join(planDir, 'app.ts'), 'export function greet() { return "hello"; }\n');
|
|
run('git', ['add', '.']);
|
|
run('git', ['commit', '-m', 'initial']);
|
|
|
|
// Create feature branch with changes
|
|
run('git', ['checkout', '-b', 'feature/add-dashboard']);
|
|
fs.writeFileSync(path.join(planDir, 'dashboard.ts'), `export function Dashboard() {
|
|
const data = fetchStats();
|
|
return { users: data.users, revenue: data.revenue };
|
|
}
|
|
function fetchStats() {
|
|
return fetch('/api/stats').then(r => r.json());
|
|
}
|
|
`);
|
|
fs.writeFileSync(path.join(planDir, 'app.ts'), `import { Dashboard } from "./dashboard";
|
|
export function greet() { return "hello"; }
|
|
export function main() { return Dashboard(); }
|
|
`);
|
|
run('git', ['add', '.']);
|
|
run('git', ['commit', '-m', 'feat: add dashboard']);
|
|
|
|
// Plan document
|
|
fs.writeFileSync(path.join(planDir, 'plan.md'), `# Plan: Add Dashboard
|
|
|
|
## Changes
|
|
1. New \`dashboard.ts\` with Dashboard component and fetchStats API call
|
|
2. Updated \`app.ts\` to import and use Dashboard
|
|
|
|
## Architecture
|
|
- Dashboard fetches from \`/api/stats\` endpoint
|
|
- Returns user count and revenue metrics
|
|
`);
|
|
run('git', ['add', 'plan.md']);
|
|
run('git', ['commit', '-m', 'add plan']);
|
|
|
|
// Copy plan-eng-review skill
|
|
fs.mkdirSync(path.join(planDir, 'plan-eng-review'), { recursive: true });
|
|
fs.copyFileSync(
|
|
path.join(ROOT, 'plan-eng-review', 'SKILL.md'),
|
|
path.join(planDir, 'plan-eng-review', 'SKILL.md'),
|
|
);
|
|
|
|
// Set up remote-slug shim and browse shims (plan-eng-review uses remote-slug for artifact path)
|
|
setupBrowseShims(planDir);
|
|
|
|
// Create project directory for artifacts
|
|
projectDir = path.join(os.homedir(), '.gstack', 'projects', 'test-project');
|
|
fs.mkdirSync(projectDir, { recursive: true });
|
|
|
|
// Clean up stale test-plan files from previous runs
|
|
try {
|
|
const staleFiles = fs.readdirSync(projectDir).filter(f => f.includes('test-plan'));
|
|
for (const f of staleFiles) {
|
|
fs.unlinkSync(path.join(projectDir, f));
|
|
}
|
|
} catch {}
|
|
});
|
|
|
|
afterAll(() => {
|
|
try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {}
|
|
// Clean up test-plan artifacts (but not the project dir itself)
|
|
try {
|
|
const files = fs.readdirSync(projectDir);
|
|
for (const f of files) {
|
|
if (f.includes('test-plan')) {
|
|
fs.unlinkSync(path.join(projectDir, f));
|
|
}
|
|
}
|
|
} catch {}
|
|
});
|
|
|
|
testConcurrentIfSelected('plan-eng-review-artifact', async () => {
|
|
// Count existing test-plan files before
|
|
const beforeFiles = fs.readdirSync(projectDir).filter(f => f.includes('test-plan'));
|
|
|
|
const result = await runSkillTest({
|
|
prompt: `Read plan-eng-review/SKILL.md for the review workflow.
|
|
Skip the preamble bash block, lake intro, telemetry, and contributor mode sections — go straight to the review.
|
|
|
|
Read plan.md — that's the plan to review. This is a standalone plan with source code in app.ts and dashboard.ts.
|
|
|
|
Proceed directly to the full review. Skip any AskUserQuestion calls — this is non-interactive.
|
|
|
|
IMPORTANT: After your review, you MUST write the test-plan artifact as described in the "Test Plan Artifact" section of SKILL.md. The remote-slug shim is at ${planDir}/browse/bin/remote-slug.
|
|
|
|
Write your review to ${planDir}/review-output.md`,
|
|
workingDirectory: planDir,
|
|
maxTurns: 25,
|
|
allowedTools: ['Bash', 'Read', 'Write', 'Glob', 'Grep'],
|
|
timeout: 360_000,
|
|
testName: 'plan-eng-review-artifact',
|
|
runId,
|
|
model: 'claude-opus-4-7',
|
|
});
|
|
|
|
logCost('/plan-eng-review artifact', result);
|
|
recordE2E(evalCollector, '/plan-eng-review test-plan artifact', 'Plan-Eng-Review Test-Plan Artifact E2E', result, {
|
|
passed: ['success', 'error_max_turns'].includes(result.exitReason),
|
|
});
|
|
|
|
expect(['success', 'error_max_turns']).toContain(result.exitReason);
|
|
|
|
// Verify test-plan artifact was written
|
|
const afterFiles = fs.readdirSync(projectDir).filter(f => f.includes('test-plan'));
|
|
const newFiles = afterFiles.filter(f => !beforeFiles.includes(f));
|
|
console.log(`Test-plan artifacts: ${beforeFiles.length} before, ${afterFiles.length} after, ${newFiles.length} new`);
|
|
|
|
if (newFiles.length > 0) {
|
|
const content = fs.readFileSync(path.join(projectDir, newFiles[0]), 'utf-8');
|
|
console.log(`Test-plan artifact (${newFiles[0]}): ${content.length} chars`);
|
|
expect(content.length).toBeGreaterThan(50);
|
|
} else {
|
|
console.warn('No test-plan artifact found — agent may not have followed artifact instructions');
|
|
}
|
|
|
|
// Soft assertion: we expect an artifact but agent compliance is not guaranteed.
|
|
// Log rather than fail — the test-plan artifact is a bonus output, not the core test.
|
|
if (newFiles.length === 0) {
|
|
console.warn('SOFT FAIL: No test-plan artifact written — agent did not follow artifact instructions');
|
|
}
|
|
}, 420_000);
|
|
});
|
|
|
|
// --- Office Hours Spec Review E2E ---
|
|
|
|
describeIfSelected('Office Hours Spec Review E2E', ['office-hours-spec-review'], () => {
|
|
let ohDir: string;
|
|
|
|
beforeAll(() => {
|
|
ohDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-oh-spec-'));
|
|
const run = (cmd: string, args: string[]) =>
|
|
spawnSync(cmd, args, { cwd: ohDir, stdio: 'pipe', timeout: 5000 });
|
|
|
|
run('git', ['init', '-b', 'main']);
|
|
run('git', ['config', 'user.email', 'test@test.com']);
|
|
run('git', ['config', 'user.name', 'Test']);
|
|
fs.writeFileSync(path.join(ohDir, 'README.md'), '# Test Project\n');
|
|
run('git', ['add', '.']);
|
|
run('git', ['commit', '-m', 'init']);
|
|
|
|
// Copy office-hours skill
|
|
fs.mkdirSync(path.join(ohDir, 'office-hours'), { recursive: true });
|
|
fs.copyFileSync(
|
|
path.join(ROOT, 'office-hours', 'SKILL.md'),
|
|
path.join(ohDir, 'office-hours', 'SKILL.md'),
|
|
);
|
|
});
|
|
|
|
afterAll(() => {
|
|
try { fs.rmSync(ohDir, { recursive: true, force: true }); } catch {}
|
|
});
|
|
|
|
testConcurrentIfSelected('office-hours-spec-review', async () => {
|
|
const result = await runSkillTest({
|
|
prompt: `Read office-hours/SKILL.md. I want to understand the spec review loop.
|
|
|
|
Summarize what the "Spec Review Loop" section does — specifically:
|
|
1. How many dimensions does the reviewer check?
|
|
2. What tool is used to dispatch the reviewer?
|
|
3. What's the maximum number of iterations?
|
|
4. What metrics are tracked?
|
|
|
|
Write your summary to ${ohDir}/spec-review-summary.md`,
|
|
workingDirectory: ohDir,
|
|
maxTurns: 8,
|
|
timeout: 120_000,
|
|
testName: 'office-hours-spec-review',
|
|
runId,
|
|
});
|
|
|
|
logCost('/office-hours spec review', result);
|
|
recordE2E(evalCollector, '/office-hours-spec-review', 'Office Hours Spec Review E2E', result);
|
|
expect(result.exitReason).toBe('success');
|
|
|
|
const summaryPath = path.join(ohDir, 'spec-review-summary.md');
|
|
if (fs.existsSync(summaryPath)) {
|
|
const summary = fs.readFileSync(summaryPath, 'utf-8').toLowerCase();
|
|
expect(summary).toMatch(/5.*dimension|dimension.*5|completeness|consistency|clarity|scope|feasibility/);
|
|
expect(summary).toMatch(/agent|subagent/);
|
|
expect(summary).toMatch(/3.*iteration|iteration.*3|maximum.*3/);
|
|
}
|
|
}, 180_000);
|
|
});
|
|
|
|
// --- Plan CEO Review Benefits-From E2E ---
|
|
|
|
describeIfSelected('Plan CEO Review Benefits-From E2E', ['plan-ceo-review-benefits'], () => {
|
|
let benefitsDir: string;
|
|
|
|
beforeAll(() => {
|
|
benefitsDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-benefits-'));
|
|
const run = (cmd: string, args: string[]) =>
|
|
spawnSync(cmd, args, { cwd: benefitsDir, stdio: 'pipe', timeout: 5000 });
|
|
|
|
run('git', ['init', '-b', 'main']);
|
|
run('git', ['config', 'user.email', 'test@test.com']);
|
|
run('git', ['config', 'user.name', 'Test']);
|
|
fs.writeFileSync(path.join(benefitsDir, 'README.md'), '# Test Project\n');
|
|
run('git', ['add', '.']);
|
|
run('git', ['commit', '-m', 'init']);
|
|
|
|
fs.mkdirSync(path.join(benefitsDir, 'plan-ceo-review'), { recursive: true });
|
|
fs.copyFileSync(
|
|
path.join(ROOT, 'plan-ceo-review', 'SKILL.md'),
|
|
path.join(benefitsDir, 'plan-ceo-review', 'SKILL.md'),
|
|
);
|
|
});
|
|
|
|
afterAll(() => {
|
|
try { fs.rmSync(benefitsDir, { recursive: true, force: true }); } catch {}
|
|
});
|
|
|
|
testConcurrentIfSelected('plan-ceo-review-benefits', async () => {
|
|
const result = await runSkillTest({
|
|
prompt: `Read plan-ceo-review/SKILL.md. Search for sections about "Prerequisite" or "office-hours" or "design doc found".
|
|
|
|
Summarize what happens when no design doc is found — specifically:
|
|
1. Is /office-hours offered as a prerequisite?
|
|
2. What options does the user get?
|
|
3. Is there a mid-session detection for when the user seems lost?
|
|
|
|
Write your summary to ${benefitsDir}/benefits-summary.md`,
|
|
workingDirectory: benefitsDir,
|
|
maxTurns: 8,
|
|
timeout: 120_000,
|
|
testName: 'plan-ceo-review-benefits',
|
|
runId,
|
|
});
|
|
|
|
logCost('/plan-ceo-review benefits-from', result);
|
|
recordE2E(evalCollector, '/plan-ceo-review-benefits', 'Plan CEO Review Benefits-From E2E', result);
|
|
expect(result.exitReason).toBe('success');
|
|
|
|
const summaryPath = path.join(benefitsDir, 'benefits-summary.md');
|
|
if (fs.existsSync(summaryPath)) {
|
|
const summary = fs.readFileSync(summaryPath, 'utf-8').toLowerCase();
|
|
expect(summary).toMatch(/office.hours/);
|
|
expect(summary).toMatch(/design doc|no design/i);
|
|
}
|
|
}, 180_000);
|
|
});
|
|
|
|
// --- Plan Review Report E2E ---
|
|
// Verifies that plan-eng-review writes a "## GSTACK REVIEW REPORT" section
|
|
// to the bottom of the plan file (the living review status footer).
|
|
|
|
describeIfSelected('Plan Review Report E2E', ['plan-review-report'], () => {
|
|
let planDir: string;
|
|
|
|
beforeAll(() => {
|
|
planDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-review-report-'));
|
|
const run = (cmd: string, args: string[]) =>
|
|
spawnSync(cmd, args, { cwd: planDir, stdio: 'pipe', timeout: 5000 });
|
|
|
|
run('git', ['init', '-b', 'main']);
|
|
run('git', ['config', 'user.email', 'test@test.com']);
|
|
run('git', ['config', 'user.name', 'Test']);
|
|
|
|
fs.writeFileSync(path.join(planDir, 'plan.md'), `# Plan: Add Notifications System
|
|
|
|
## Context
|
|
We're building a real-time notification system for our SaaS app.
|
|
|
|
## Changes
|
|
1. WebSocket server for push notifications
|
|
2. Notification preferences API
|
|
3. Email digest fallback for offline users
|
|
4. PostgreSQL table for notification storage
|
|
|
|
## Architecture
|
|
- WebSocket: Socket.io on Express
|
|
- Queue: Bull + Redis for email digests
|
|
- Storage: PostgreSQL notifications table
|
|
- Frontend: React toast component
|
|
|
|
## Open questions
|
|
- Retry policy for failed WebSocket delivery?
|
|
- Max notifications stored per user?
|
|
`);
|
|
|
|
run('git', ['add', '.']);
|
|
run('git', ['commit', '-m', 'add plan']);
|
|
|
|
// Copy plan-eng-review skill
|
|
fs.mkdirSync(path.join(planDir, 'plan-eng-review'), { recursive: true });
|
|
fs.copyFileSync(
|
|
path.join(ROOT, 'plan-eng-review', 'SKILL.md'),
|
|
path.join(planDir, 'plan-eng-review', 'SKILL.md'),
|
|
);
|
|
});
|
|
|
|
afterAll(() => {
|
|
try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {}
|
|
});
|
|
|
|
test('/plan-eng-review writes GSTACK REVIEW REPORT to plan file', async () => {
|
|
const result = await runSkillTest({
|
|
prompt: `Read plan-eng-review/SKILL.md for the review workflow.
|
|
|
|
Read plan.md — that's the plan to review. This is a standalone plan document, not a codebase — skip any codebase exploration steps.
|
|
|
|
Proceed directly to the full review. Skip any AskUserQuestion calls — this is non-interactive.
|
|
Skip the preamble bash block, lake intro, telemetry, and contributor mode sections.
|
|
|
|
CRITICAL REQUIREMENT: plan.md IS the plan file for this review session. After completing your review, you MUST write a "## GSTACK REVIEW REPORT" section to the END of plan.md, exactly as described in the "Plan File Review Report" section of SKILL.md. If gstack-review-read is not available or returns NO_REVIEWS, write the placeholder table with all four review rows (CEO, Codex, Eng, Design). Use the Edit tool to append to plan.md — do NOT overwrite the existing plan content.
|
|
|
|
This review report at the bottom of the plan is the MOST IMPORTANT deliverable of this test.`,
|
|
workingDirectory: planDir,
|
|
maxTurns: 20,
|
|
timeout: 360_000,
|
|
testName: 'plan-review-report',
|
|
runId,
|
|
model: 'claude-opus-4-7',
|
|
});
|
|
|
|
logCost('/plan-eng-review report', result);
|
|
recordE2E(evalCollector, '/plan-review-report', 'Plan Review Report E2E', result, {
|
|
passed: ['success', 'error_max_turns'].includes(result.exitReason),
|
|
});
|
|
|
|
// Transient API failure escape hatch: when the SDK returns error_api with
|
|
// zero turns / zero tokens, the API call died before the model ever ran —
|
|
// no skill code executed, no file was written. Bun retries the test up to
|
|
// 3x; if every attempt hits the same API hiccup, surface a warning and
|
|
// treat as inconclusive rather than gating the build on Anthropic
|
|
// availability. Logic regressions still surface as success/error_max_turns
|
|
// with a missing artifact, which the downstream assertions catch.
|
|
if (result.exitReason === 'error_api' && result.costEstimate?.turnsUsed === 0) {
|
|
console.warn('[transient] /plan-review-report: error_api with 0 turns — treating as inconclusive (likely Anthropic API hiccup, see CLAUDE.md eval-blame protocol)');
|
|
return;
|
|
}
|
|
expect(['success', 'error_max_turns']).toContain(result.exitReason);
|
|
|
|
// Verify the review report was written to the plan file
|
|
const planContent = fs.readFileSync(path.join(planDir, 'plan.md'), 'utf-8');
|
|
|
|
// Original plan content should still be present
|
|
expect(planContent).toContain('# Plan: Add Notifications System');
|
|
expect(planContent).toContain('WebSocket');
|
|
|
|
// Review report section must exist
|
|
expect(planContent).toContain('## GSTACK REVIEW REPORT');
|
|
|
|
// Report should be at the bottom of the file
|
|
const reportIndex = planContent.lastIndexOf('## GSTACK REVIEW REPORT');
|
|
const afterReport = planContent.slice(reportIndex);
|
|
|
|
// Should contain the review table with standard rows
|
|
expect(afterReport).toMatch(/\|\s*Review\s*\|/);
|
|
expect(afterReport).toContain('CEO Review');
|
|
expect(afterReport).toContain('Eng Review');
|
|
expect(afterReport).toContain('Design Review');
|
|
|
|
console.log('Plan review report found at bottom of plan.md');
|
|
}, 420_000);
|
|
});
|
|
|
|
// --- Codex Offering E2E ---
|
|
// Verifies that Codex is properly offered (with availability check, user prompt,
|
|
// and fallback) in office-hours, plan-ceo-review, plan-design-review, plan-eng-review.
|
|
|
|
describeIfSelected('Codex Offering E2E', [
|
|
'codex-offered-office-hours', 'codex-offered-ceo-review',
|
|
'codex-offered-design-review', 'codex-offered-eng-review',
|
|
], () => {
|
|
let testDir: string;
|
|
|
|
beforeAll(() => {
|
|
testDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-codex-offer-'));
|
|
const run = (cmd: string, args: string[]) =>
|
|
spawnSync(cmd, args, { cwd: testDir, stdio: 'pipe', timeout: 5000 });
|
|
|
|
run('git', ['init', '-b', 'main']);
|
|
run('git', ['config', 'user.email', 'test@test.com']);
|
|
run('git', ['config', 'user.name', 'Test']);
|
|
fs.writeFileSync(path.join(testDir, 'README.md'), '# Test Project\n');
|
|
run('git', ['add', '.']);
|
|
run('git', ['commit', '-m', 'init']);
|
|
|
|
// Copy all 4 SKILL.md files
|
|
for (const skill of ['office-hours', 'plan-ceo-review', 'plan-design-review', 'plan-eng-review']) {
|
|
fs.mkdirSync(path.join(testDir, skill), { recursive: true });
|
|
fs.copyFileSync(
|
|
path.join(ROOT, skill, 'SKILL.md'),
|
|
path.join(testDir, skill, 'SKILL.md'),
|
|
);
|
|
}
|
|
});
|
|
|
|
afterAll(() => {
|
|
try { fs.rmSync(testDir, { recursive: true, force: true }); } catch {}
|
|
});
|
|
|
|
async function checkCodexOffering(skill: string, testName: string, featureName: string) {
|
|
const result = await runSkillTest({
|
|
prompt: `Read ${skill}/SKILL.md. Search for ALL sections related to "codex", "outside voice", or "second opinion".
|
|
|
|
Summarize the Codex/${featureName} integration — answer these specific questions:
|
|
1. How is Codex availability checked? (what exact bash command?)
|
|
2. How is the user prompted? (via AskUserQuestion? what are the options?)
|
|
3. What happens when Codex is NOT available? (fallback to subagent? skip entirely?)
|
|
4. Is this step blocking (gates the workflow) or optional (can be skipped)?
|
|
5. What prompt/context is sent to Codex?
|
|
|
|
Write your summary to ${testDir}/${testName}-summary.md`,
|
|
workingDirectory: testDir,
|
|
maxTurns: 8,
|
|
timeout: 120_000,
|
|
testName,
|
|
runId,
|
|
});
|
|
|
|
logCost(`/${skill} codex offering`, result);
|
|
recordE2E(evalCollector, `/${testName}`, 'Codex Offering E2E', result);
|
|
expect(result.exitReason).toBe('success');
|
|
|
|
const summaryPath = path.join(testDir, `${testName}-summary.md`);
|
|
expect(fs.existsSync(summaryPath)).toBe(true);
|
|
|
|
const summary = fs.readFileSync(summaryPath, 'utf-8').toLowerCase();
|
|
// All skills should have codex availability check (command -v per #1197)
|
|
expect(summary).toMatch(/command -v codex/);
|
|
// All skills should have fallback behavior
|
|
expect(summary).toMatch(/fallback|subagent|unavailable|not available|skip/);
|
|
// All skills should show it's optional/non-blocking
|
|
expect(summary).toMatch(/optional|non.?blocking|skip|not.*required/);
|
|
|
|
console.log(`${skill}: Codex offering verified`);
|
|
}
|
|
|
|
testConcurrentIfSelected('codex-offered-office-hours', async () => {
|
|
await checkCodexOffering('office-hours', 'codex-offered-office-hours', 'second opinion');
|
|
}, 180_000);
|
|
|
|
testConcurrentIfSelected('codex-offered-ceo-review', async () => {
|
|
await checkCodexOffering('plan-ceo-review', 'codex-offered-ceo-review', 'outside voice');
|
|
}, 180_000);
|
|
|
|
testConcurrentIfSelected('codex-offered-design-review', async () => {
|
|
await checkCodexOffering('plan-design-review', 'codex-offered-design-review', 'design outside voices');
|
|
}, 180_000);
|
|
|
|
testConcurrentIfSelected('codex-offered-eng-review', async () => {
|
|
await checkCodexOffering('plan-eng-review', 'codex-offered-eng-review', 'outside voice');
|
|
}, 180_000);
|
|
});
|
|
|
|
// Module-level afterAll — finalize eval collector after all tests complete
|
|
afterAll(async () => {
|
|
await finalizeEvalCollector(evalCollector);
|
|
});
|