mirror of
https://github.com/garrytan/gstack.git
synced 2026-06-22 17:49:57 +02:00
9fd03fae9e
* fix(gbrain): stop forcing GBRAIN_PREPARE on transaction-mode poolers (#1965) buildGbrainEnv auto-set GBRAIN_PREPARE=true whenever DATABASE_URL targeted port 6543, and the /sync-gbrain capability check exported it for the rest of the skill run. Both had the semantics inverted: gbrain auto-disables prepared statements on transaction-mode poolers because they break every write there ("prepared statement does not exist"); GBRAIN_PREPARE=true is gbrain's documented override for SESSION-mode poolers on 6543, not a requirement for transaction mode. The #1435 search symptom the auto-set worked around was fixed gbrain-side. Remove both force-sets. A caller-set GBRAIN_PREPARE (either value) still passes through untouched, preserving the session-mode-on-6543 escape hatch. isTransactionModePooler stays exported. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(gbrain): classify probe timeout as its own status; sync proceeds instead of skipping (#1964) The 5s engine probe misclassified healthy-but-slow engines (cold Supabase pooler connections measured at 6.9-10.7s) as broken-config, so /sync-gbrain silently skipped code+memory and told the user their config was malformed. - New "timeout" status: probe killed at the deadline with no recognized stderr pattern. Default deadline is now 15s, overridable via GSTACK_GBRAIN_PROBE_TIMEOUT_MS (tests set 300ms against a fake that sleeps 2s). - Sync stages PROCEED on timeout with a stderr warning naming the env knob; a genuinely-dead engine surfaces its real error at the first operation instead of a false config diagnosis. - Consistency everywhere "ok" gated behavior: gstack-gbrain-detect --is-ok exits 0 on timeout, and gen-skill-docs' detection gate accepts it, so a slow engine no longer silently suppresses brain-aware features. - Status cache: key now includes the effective probe timeout (raising it invalidates a cached timeout) and GBRAIN_HOME; config detection honors GBRAIN_HOME so relocated-home users stop being misclassified as missing-config. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(bins): cygpath-normalize SCRIPT_DIR for bun imports; surface learnings-log errors (#1950) Under Windows git-bash, pwd yields a POSIX path (/c/Users/...) that Bun on Windows cannot resolve as an ES module specifier. gstack-learnings-log interpolates SCRIPT_DIR into a bun -e import, so every invocation died with "Cannot find module" — and 2>/dev/null swallowed the error, silently dropping every AI-logged learning for Windows users. - 3-line cygpath -m guard in gstack-learnings-log and gstack-question-log (which gains the same import shape in the next commit). Matches the duplicated IS_WINDOWS convention in setup; no shared shell lib exists. - learnings-log adopts question-log's set +e / TMPERR capture pattern wholesale: validation errors now print to stderr. The old `if [ $? -ne 0 ]` check was dead code under set -euo pipefail — the script exited at the failing assignment before reaching it. - New test/bin-windows-bun-import-paths.test.ts: static invariant (any bash bin interpolating $SCRIPT_DIR into a bun -e import must carry the guard) + behavioral end-to-end run invoked via `bash <bin>` — added to the windows-free-tests workflow list so the conversion is proven on the only platform where the bug exists. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(question-log): dedupe INJECTION_PATTERNS via lib/jsonl-store (#1934) bin/gstack-question-log carried a local copy of the injection-pattern list, so pattern fixes to lib/jsonl-store.ts never propagated — including the /override[:\s]/i false-positive fix arriving via community PR #1940. Import the shared hasInjection instead (enabled by the previous commit's cygpath guard). question-log also gets the lib's stricter superset (human:, disregard, from-now-on, approve-all patterns). Tests pin the contract in a #1940-order-independent way: an "Override: ignore all previous instructions" header is rejected, "prose overrides the deterministic table" is accepted, and a static invariant keeps local INJECTION_PATTERNS duplicates out of the bin. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(security): community-pulse + both dashboards never report fake zeros (#1947) The security-signaling surface failed open at three layers — every failure mode read as a reassuring "0 attacks" / "0 installs": - community-pulse edge function: supabase-js returns {data,error} without throwing, and all five queries discarded `error` — a DB outage produced real-looking zeros via the SUCCESS path, and the catch (also returning zeros with HTTP 200) was unreachable for query failures. Every query now destructures and throws; the catch serves the stale cache (marked "stale": true) when one exists, else 503 {"error":"pulse_unavailable"}. Success responses carry "status":"ok" so clients can distinguish authoritative data from legacy backends. NOTE: the edge function deploys out-of-band (supabase functions deploy community-pulse). - gstack-security-dashboard: captures the HTTP status; non-200 / network failure / error body / missing section → "unknown — backend error"; jq missing → "unknown — install jq" (the lossy grep fallback broke on nested arrays and under-reported attacks as zero — removed); a 200 without the new marker shows figures with an "unverified (legacy backend)" note. Also fixes a latent display bug: the TOTAL grep matched the digit 7 inside "attacks_last_7_days" and misreported every count. - gstack-community-dashboard: same class — curl || echo "{}" plus grep || echo "0" printed "Weekly active installs: 0" on any failure. Now "unknown — backend error (HTTP N)". test/security-dashboard-fallback.test.ts pins the matrix (200+marker, 200-legacy, 503, network failure) x (jq present, jq absent) for both bins: "unknown" states never render as 0. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(telemetry): redact error_message spans before they leave the machine (#1947) error_message was uploaded with only quote/newline escaping — stack traces and failed-API errors can embed credentials, private paths, and hostnames, and the sync path strips only _repo_slug/_branch. New lib/redact-engine.ts export redactFindingSpans(): replaces EVERY finding's span with <REDACTED-{id}> regardless of tier (applyRedactions is the interactive PII-only path and exits nonzero on credential findings, so it can't serve machine egress). Returns null when a span can't be located — callers drop the whole payload rather than risk a leak. gstack-telemetry-log pipes error_message through it at LOG time, so the local JSONL at rest is clean too; surrounding text survives for crash triage. FAIL CLOSED: bun missing, engine error, or non-JSON-string output all null the field. Tests pin: embedded ghp_ token → <REDACTED-github.pat> with context intact; redactor unavailable → null; raw bytes on disk never contain the token. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(redact): prepush guard fails closed on git failure; /ship owns hook install (#1946) Two gaps closed: 1. Fail closed. The git() helper returned "" on ANY non-zero exit or maxBuffer overflow (status null), addedLinesFor produced an empty string, and the push sailed through unscanned — fail-open on exactly the oversized-diff case where a large secret-bearing blob is most likely. The diff call now uses a strict variant that throws; main blocks with a clear message naming the GSTACK_REDACT_PREPUSH=skip escape valve. Probe calls (symbolic-ref, rev-parse, merge-base) keep the permissive helper — their failures are normal control flow. 2. Install path. The hook was installed by nothing ("opt-in, installed by nothing" was the issue's words). ./setup runs in the gstack checkout — the wrong repo for a per-project hook — so it gets a one-line hint only. /ship owns per-repo install: config redact_prepush_hook=true + hook missing → silent install (consent already given); config unset + no ~/.gstack/.redact-prepush-prompted marker → one-time machine-wide AskUserQuestion offer, answer persisted. ship/SKILL.md regenerated in this same commit (check-freshness bisect discipline). Tests: unscannable diff (bogus SHAs) → exit 1 + valve named; empty-but- successful diff → exit 0; static asserts pin setup as hint-only and the ship template as the installer surface. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(redact): six new credential patterns — GitLab, HuggingFace, npm, DigitalOcean, Bearer, GCP SA (#1946) Coverage gaps from the #1946 security review, including token types for tooling gstack itself drives (glab): HIGH (block): gitlab.token (glpat-/glptt-/gldt-), huggingface.token (hf_), npm.token (npm_), digitalocean.token (dop_v1_), gcp.service_account (the JSON-escaped "private_key" form that dodges pem.private_key's literal-block match when minified, confirmed by "private_key_id" proximity). MEDIUM (warn): auth.bearer — the most FP-prone shape in the set (docs are full of "Authorization: Bearer <token>"), so it requires header-context proximity and the same entropy>=3.0 + placeholder validator recipe as env.kv. "Bearer YOUR_TOKEN_HERE" never fires; calibration over coverage, per the cries-wolf principle. All shapes are linear-time; test/redact-pattern-lint.test.ts covers them automatically. Engine tests add positive + placeholder-negative cases per pattern. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test: coverage-audit additions for the fix wave Ship Step 7 gap-fill (all passing, 248 tests across the touched suites): memory + dream stage probe-timeout proceeds, gbrain-detect override paths, stale-flag passthrough, 200-body-missing-.security fail-closed case, telemetry redaction edges, and credential-pattern edge cases. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix: pre-landing review fixes Review army findings (1 critical, auto-fixed with regression tests): - CRITICAL (security specialist, verified live): redactFindingSpans spliced only the regex capture span, and pem.private_key / gcp.service_account capture just the BEGIN-header — the key body survived "redaction" and shipped via telemetry. Marker-only patterns now drop the whole payload (null, fail closed). Overlapping spans (Bearer+JWT on the same bytes) are coalesced before splicing so stale offsets can't leave partial secret bytes behind. - gitStrict: drop the dead `|| r.status === null` disjunct (null !== 0 already covers it); add the signal-kill/null-status regression test the docstring promised. - security-dashboard human mode flags stale snapshots ("figures may be out of date") instead of presenting frozen counts as current. - community-dashboard marker check uses jq when available — the grep-only variant misclassified whitespaced/reserialized bodies as legacy. - telemetry fail-closed test now shadows bun with a failing stub (deterministic on any host layout); stale "five status cases" describe title renamed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix: adversarial review fixes (Claude + Codex cross-model passes) Both adversarial passes ran against the wave; every FIXABLE finding landed with a regression test: - probeTimeoutMs clamps to >=1ms: a fractional override floored to 0, and execFileSync treats timeout:0 as NO timeout — the probe that exists to bound hangs could hang forever (found by both models independently). - /ship silent hook install now requires the hooks dir to live inside .git: with core.hooksPath (husky's COMMITTED .husky/), the chaining installer would have renamed the team's committed pre-push and written a machine-local wrapper into the working tree (found by both models). - gstack-config gbrain-refresh accepts the "timeout" status — the last consumer still gating on literal "ok" (Codex); gstack-gbrain-detect's config-derived fields honor GBRAIN_HOME so the detection JSON can't report status ok alongside config_exists false (Codex). - prepush: a remote sha absent locally (shallow clone / stale fetch) falls back to the merge-base/empty-tree range — scans MORE, never blocks a legitimate push into training users toward --no-verify. - dashboards: curl's own 000 no longer doubles to "HTTP 000000"; the community dashboard flags stale snapshots like the security one; array sections parse via jq (the sed/grep loops truncated at the first ']'); the no-jq marker grep tolerates whitespace. - telemetry: multi-line redactor output nulls the field instead of corrupting the JSONL record; setup's hint fires only when the config key is genuinely unset (an explicit false is a recorded decline); the /ship prompt marker honors GSTACK_HOME. Kept as designed (cross-model tension noted): Bearer stays MEDIUM in the prepush gate — a HIGH Bearer would block every docs example; the entropy validator can't eliminate that FP class, and MEDIUM warns visibly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * chore: bump version and changelog (v1.57.11.0) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs: P1 TODO — eval harness live progress + incremental persistence Root-caused during this ship: a killed eval run was indistinguishable from a healthy one for hours (per-file output buffering across mega test files, no incremental eval-store writes, no honest liveness signal). Full context and starting points in the entry. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test: fix operational-learning E2E fixture — copy lib/jsonl-store.ts Pre-existing breakage, proven on main: gstack-learnings-log has imported lib/jsonl-store.ts (shared injection patterns) since v1.57.5.0 / #1910, but the fixture copies only the bin scripts — the bin exits 1 before writing anything, on main silently (stderr swallowed) and on this branch loudly (the #1950 error-surfacing made the four-day-old failure visible). A real install always ships bin/ and lib/ together; the fixture now does too. Verified: the fixture-shaped invocation writes the learning (exit 0) with lib present, exits 1 on both main and this branch without it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(ios-qa): isolate E2E tests under --concurrent (3 real races) The ios-qa E2E file failed intermittently under `bun test --concurrent` (the eval harness default). Three distinct shared-state races, all fixed: 1. Shared pidfile: a module-level `workDir` reassigned in beforeEach was clobbered by parallel tests, so concurrent daemons collided on the same pidfile and the loser returned `already_running`. Each test now gets its own dir via makeWorkDir(). 2. process.env path globals: tests set GSTACK_IOS_AUDIT_PATH / _ATTEMPTS_PATH / _ALLOWLIST_PATH on the shared process env; concurrent tests stomped each other's audit/attempts destinations. Threaded auditPath/attemptsPath/allowlistPath through DaemonOptions (and mintForCaller) as explicit args — env is no longer load-bearing. 3. afterEach cleanup race: the per-test cleanup drained a shared dir array, so the first test to finish deleted still-running tests' workDirs mid-assertion. Moved to afterAll (cleans once, after all settle). Verified: 5/5 clean full-suite runs at --max-concurrency 15 (was intermittent); daemon unit suite 91/91; daemon source compiles. The paths default to the env-derived locations when options are omitted, so the production CLI path is unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test(pty): pin spawned claude to EVALS model chain (default claude-sonnet-4-6) launchClaudePty spawned the interactive `claude` TUI with no --model flag, so the child inherited the operator's ~/.claude/settings.json model. On a slow-thinking model that meant 5+ min of extended thinking on empty plan-mode context, timing out the plan-mode smoke tests regardless of contention. Pin the model via opts.model ?? EVALS_MODEL ?? 'claude-sonnet-4-6' — byte-identical to session-runner.ts:144, so PTY and `claude -p` evals always agree. Pushed before extraArgs (last flag wins, so a per-test --model still overrides). Placement leaves the spawn region byte-stable for a clean merge with the in-flight hermetic-env branch. Plumbed model through the three plan-skill wrappers. Static-grep tripwires guard the pin, its fallback chain, the before-extraArgs ordering, and all three wrapper forwards. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(pty): detect markdown bold-bullet prose AUQs (fixes office-hours smoke) office-hours auto-mode renders its mode question as `- **Building a startup**` markdown bullets (office-hours/SKILL.md.tmpl:102) with no letter/number marker. isProseAUQVisible only matched `A)`-style lettered or `1.`-style numbered options, so the question went undetected: the model surfaced it at ~2m19s (well under the 300s budget) but the harness kept scoring the run "working" off the spinner glyphs and timed out — a false timeout on a question that was already on screen. Add Pattern 3: when an interrogative line ('?') is present AND 3+ bold-bullet markers (`- **`) appear in the 4KB tail, classify as a prose AUQ. Bold is the discriminator vs incidental prose bullets; the line anchor is dropped (stripAnsi can collapse option lines) and the existing `❯ 1.` cursor gate still defers to a live native list. Wires through the existing classifyVisible 'asked' path and the timeout high-water-mark, so office-hours now classifies 'asked' instead of 'timeout'. Five unit cases: the office-hours render passes; no-'?', <3-bullet, plain-bullet, and native-cursor cases stay false. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(pty): detect stripAnsi-collapsed prose AUQs + judge spinner-precedence The plan-eng/plan-design plan-mode + finding-floor smokes timed out even when the skill HAD rendered a complete prose AskUserQuestion and was waiting: the PTY strips cursor-positioning escapes, collapsing the option newlines/spaces so "A) ..." arrives as "A(recommended)" / "-B:" and "Reply with A, B, or C" as "ReplywithA,B,orC". Every line-anchored detector (Patterns 1-3) returns false on those bytes, so proseAUQEverObserved never latched and the run timed out on a question that was already on screen. Add Pattern 4/5: a two-signal collapsed-form detector — a reply/recommendation marker (space-insensitive "reply with [A-D]", "Recommendation:", or "(recommended)") AND 2+ distinct A-D letters each punctuated by ) : or (. The conjunction is what separates a real AUQ from incidental report prose; verified true on the verbatim failing-run buffers where Patterns 1-3 return false. Also fix the Haiku judge spinner bias: of 614 verdicts, 569 were 'working' and 95 of those noted a question was visible — Claude Code keeps the spinner animating at an idle prose decision, so the judge coin-flipped. Add a precedence override: when an option list AND a Recommendation/Reply instruction are both visible, classify WAITING even with spinner glyphs. Kept the strict dual-signal gate (never option-list-alone) so auto-decide-preserved doesn't flip. 5 unit tests pin the two-signal contract (2 true on real collapsed bytes, 3 false guards). 90 -> 95 pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(plan-review): ask-first scope gate for plan-eng + plan-design review On an empty/cold invocation, plan-eng-review and plan-design-review would dive straight into repo exploration (plan-eng) or a 7-pass mockup+audit (plan-design) and only ask the user much later, if at all. plan-ceo-review already asks first via an unconditional Step-0 gate and behaves well; these two did not. Add a hard-STOP scope gate as the FIRST operational instruction in each skill (above the design-doc check / pre-review audit / mockup defaults it explicitly overrides): the first tool call must be AskUserQuestion confirming the review target, before any git/Read/Grep/Glob/Bash or mockup generation. Under --disallowedTools the options render as plain column-0 lettered prose with a Recommendation + "Reply with A, B, or C" line so the answer is detectable. This is correct cold-start UX (confirm what to review before grinding a full review on nothing) and it is the product half of the plan-mode smoke fix; the harness collapsed-form detector is the deterministic half that catches the ask however it renders. Templates + regenerated SKILL.md (default variant). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(tiers): reclassify stochastic plan-eng/plan-design ask-first smokes as periodic plan-eng-review and plan-design-review run a long explore/audit before their first AskUserQuestion, so whether the plan-mode + finding-floor smokes reach a terminal outcome within the 300s/600s budget depends on stochastic ask-first compliance (measured ~50-67%/run even with the hardened gate). Per the "non-deterministic -> periodic" tiering rule, move the four affected smokes (plan-eng/plan-design review-plan-mode + finding-floor) to periodic. The deterministic harness fix (collapsed-form detector + judge precedence) and the ask-first gate lift these from always-failing to mostly-passing and are the real product+harness improvements; periodic monitoring tracks the rate weekly without blocking PRs on an LLM coin-flip. plan-ceo/plan-devex ask-first reliably and stay gate-tier. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ci(evals): gate the deterministic PTY plan-mode smokes in CI The real-PTY plan-mode smokes never ran in CI — the gate was local-only. Add an e2e-pty-plan-smoke matrix suite running the two deterministically-reliable ones (office-hours-auto-mode, plan-mode-no-op) so a regression there blocks PRs. The stochastic plan-eng/plan-design ask-first smokes stay periodic (touchfiles E2E_TIERS) and are not CI-gated. A fresh CI container has no ~/.claude.json, so the spawned interactive `claude` would wedge on the onboarding + API-key-approval dialog. Add a scoped seed step (hasCompletedOnboarding + key approval, its own ANTHROPIC_API_KEY env) before the run — mirrors what the hermetic E2E child env seeds. Per-suite timeout override (35 min) via matrix.suite.timeout so the PTY suite has headroom for --retry 2 without bumping the other 12 suites. Report runner count 12 -> 13. Validate via workflow_dispatch before relying on the gate (PTY-in-CI is new). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ci(evals): install gstack skill registry for the PTY smoke suite The first dry-run of e2e-pty-plan-smoke failed: the spawned interactive `claude` printed "Unknown command: /plan-ceo-review". .claude/skills is gitignored, so a fresh CI checkout has no gstack skill registry and the TUI can't resolve /office-hours or /plan-ceo-review. Add a Register step (scoped to the suite, after Seed, before Run) that mirrors setup's --no-prefix user-scoped registry minimally: $HOME/.claude/skills/gstack -> repo (resolves the preambles' absolute ~/.claude/skills/gstack/bin/* and <skill>/sections/* paths) + per-skill SKILL.md/sections symlinks for the two skills these tests invoke. HOME is /github/home in this container and the runner adds no HOME/CLAUDE_CONFIG_DIR override (no hermetic mode), so $HOME is the right anchor — the Seed step already proved claude reads it. No ./setup (binary build + Chromium + fonts + /dev/tty prompt); SKILL.md + bin/ + sections/ are committed. Self-validating: fails the step loudly on a dangling symlink or missing `name:` frontmatter, so a moved target surfaces here instead of as a silent 35-min "Unknown command" timeout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v1.58.4.0) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
504 lines
20 KiB
TypeScript
504 lines
20 KiB
TypeScript
// High-level E2E for /ios-qa skill flow.
|
|
//
|
|
// Two scenarios:
|
|
// 1. NO_DEVICE (gate-tier compatible): runs the gen-accessors codegen
|
|
// against a SwiftUI fixture, verifies output is correct, no daemon
|
|
// hardware required. Catches regression in source-read + codegen +
|
|
// cache + render paths without an iPhone.
|
|
// 2. WITH_DEVICE (periodic-tier, requires GSTACK_HAS_IOS_DEVICE=1): full
|
|
// daemon + tailnet + USB tunnel loop. Skipped in CI.
|
|
//
|
|
// Note: The detailed daemon HTTP unit/integration tests live next to the
|
|
// daemon source (ios-qa/daemon/test/*). This file tests the agent-flow
|
|
// boundary — what the /ios-qa skill orchestrates end-to-end.
|
|
|
|
import { describe, test, expect, afterAll } from 'bun:test';
|
|
import { createServer, type Server, type IncomingMessage } from 'http';
|
|
import { mkdtempSync, rmSync, mkdirSync, writeFileSync, existsSync, readFileSync } from 'fs';
|
|
import { tmpdir } from 'os';
|
|
import { join } from 'path';
|
|
import { startDaemon, type RunningDaemon } from '../ios-qa/daemon/src/index';
|
|
import type { DeviceTunnel } from '../ios-qa/daemon/src/proxy';
|
|
import { grantIdentity } from '../ios-qa/daemon/src/allowlist';
|
|
import { generate } from '../ios-qa/scripts/gen-accessors';
|
|
|
|
const HAS_DEVICE = process.env.GSTACK_HAS_IOS_DEVICE === '1';
|
|
|
|
const DEVICE_TOKEN = 'rotated-mock-bearer-token';
|
|
|
|
// Per-test isolation under `bun test --concurrent`: a single module-level
|
|
// `workDir` reassigned in beforeEach is clobbered by parallel tests, so they
|
|
// collide on the same daemon pidfile (`already_running`) and stomp each
|
|
// other's GSTACK_IOS_* env paths. Each test calls makeWorkDir() for its own
|
|
// dir instead; afterEach cleans up every dir created during the test.
|
|
const createdWorkDirs: string[] = [];
|
|
function makeWorkDir(): string {
|
|
const dir = mkdtempSync(join(tmpdir(), 'ios-e2e-'));
|
|
createdWorkDirs.push(dir);
|
|
return dir;
|
|
}
|
|
|
|
// Clean up ONCE after all tests, not per-test. Under `bun test --concurrent`
|
|
// an afterEach that drains the shared array would delete still-running tests'
|
|
// workDirs the moment ANY test finishes, vanishing their audit/attempts files
|
|
// mid-assertion. afterAll runs after every concurrent test has settled.
|
|
afterAll(() => {
|
|
for (const dir of createdWorkDirs) {
|
|
rmSync(dir, { recursive: true, force: true });
|
|
}
|
|
createdWorkDirs.length = 0;
|
|
});
|
|
|
|
interface StubState {
|
|
loggedIn: boolean;
|
|
username: string;
|
|
rawTaps: Array<{ x: number; y: number }>;
|
|
}
|
|
|
|
// Build a stub StateServer that mimics the iOS app's HTTP surface end-to-end:
|
|
// /auth/rotate, session lock, snapshot, restore, tap. Used for both NO_DEVICE
|
|
// and as the development harness for WITH_DEVICE.
|
|
function startStubStateServer(initial: StubState): Promise<{ server: Server; port: number; state: StubState }> {
|
|
const state = { ...initial };
|
|
let activeSession: string | null = null;
|
|
|
|
return new Promise((resolve) => {
|
|
const server = createServer((req, res) => {
|
|
const chunks: Buffer[] = [];
|
|
req.on('data', (c) => chunks.push(c));
|
|
req.on('end', () => {
|
|
const body = Buffer.concat(chunks).toString('utf-8');
|
|
const auth = req.headers['authorization'];
|
|
const url = req.url ?? '/';
|
|
|
|
// /healthz public on loopback (the stub mimics that)
|
|
if (req.method === 'GET' && url === '/healthz') {
|
|
return respond(res, 200, { version: '1.0.0' });
|
|
}
|
|
|
|
// /auth/rotate: validates boot token (we accept any here for the stub)
|
|
if (req.method === 'POST' && url === '/auth/rotate') {
|
|
return respond(res, 200, { ok: true });
|
|
}
|
|
|
|
// Everything else requires our rotated token
|
|
if (auth !== `Bearer ${DEVICE_TOKEN}`) {
|
|
return respond(res, 401, { error: 'unauthorized' });
|
|
}
|
|
|
|
// Session ops
|
|
if (req.method === 'POST' && url === '/session/acquire') {
|
|
if (activeSession) return respond(res, 423, { error: 'device_locked' });
|
|
activeSession = 'stub-session-' + Math.random().toString(16).slice(2, 8);
|
|
return respond(res, 200, { session_id: activeSession, ttl_seconds: 300 });
|
|
}
|
|
if (req.method === 'POST' && url === '/session/release') {
|
|
activeSession = null;
|
|
return respond(res, 200, { ok: true });
|
|
}
|
|
|
|
// Snapshot
|
|
if (req.method === 'GET' && url === '/state/snapshot') {
|
|
return respond(res, 200, {
|
|
_schema_version: 1,
|
|
_app_build_id: 'stub-1.0',
|
|
_accessor_hash: 'stub-hash',
|
|
keys: {
|
|
loggedIn: state.loggedIn,
|
|
username: state.username,
|
|
},
|
|
});
|
|
}
|
|
|
|
// Mutations require session
|
|
const sessionHeader = req.headers['x-session-id'];
|
|
const sessionOk = !!sessionHeader && sessionHeader === activeSession;
|
|
const isMutation = req.method === 'POST' && (
|
|
url === '/tap' || url === '/swipe' || url === '/type' ||
|
|
url.startsWith('/state/') && !url.endsWith('/snapshot')
|
|
);
|
|
|
|
if (isMutation && !sessionOk) {
|
|
return respond(res, 409, { error: 'session_required' });
|
|
}
|
|
|
|
if (req.method === 'POST' && url === '/tap') {
|
|
const payload = JSON.parse(body || '{}');
|
|
state.rawTaps.push({ x: payload.x ?? 0, y: payload.y ?? 0 });
|
|
return respond(res, 200, { op: 'tap', ok: true });
|
|
}
|
|
|
|
if (req.method === 'POST' && url === '/state/restore') {
|
|
const payload = JSON.parse(body || '{}');
|
|
if (payload._accessor_hash && payload._accessor_hash !== 'stub-hash') {
|
|
return respond(res, 409, { error: 'schema_mismatch' });
|
|
}
|
|
if (payload.keys?.loggedIn !== undefined) state.loggedIn = payload.keys.loggedIn;
|
|
if (payload.keys?.username !== undefined) state.username = payload.keys.username;
|
|
return respond(res, 200, { ok: true });
|
|
}
|
|
|
|
respond(res, 404, { error: 'not_found' });
|
|
});
|
|
});
|
|
server.listen(0, '127.0.0.1', () => {
|
|
const addr = server.address();
|
|
const port = typeof addr === 'object' && addr ? addr.port : 0;
|
|
resolve({ server, port, state });
|
|
});
|
|
});
|
|
}
|
|
|
|
function respond(res: import('http').ServerResponse, status: number, body: unknown): void {
|
|
const payload = JSON.stringify(body);
|
|
res.writeHead(status, { 'content-type': 'application/json', 'content-length': Buffer.byteLength(payload) });
|
|
res.end(payload);
|
|
}
|
|
|
|
async function fetchJson(method: string, url: string, init: { headers?: Record<string, string>; body?: string } = {}): Promise<{ status: number; body: unknown }> {
|
|
const res = await fetch(url, { method, headers: init.headers, body: init.body });
|
|
const text = await res.text();
|
|
let body: unknown;
|
|
try { body = JSON.parse(text); } catch { body = text; }
|
|
return { status: res.status, body };
|
|
}
|
|
|
|
describe('ios-qa E2E (no-device path)', () => {
|
|
test('NO_DEVICE: codegen runs against a SwiftUI fixture and emits valid accessors', () => {
|
|
const workDir = makeWorkDir();
|
|
const srcDir = join(workDir, 'app-src');
|
|
mkdirSync(srcDir);
|
|
writeFileSync(join(srcDir, 'AppState.swift'), `
|
|
@Observable
|
|
class AppState {
|
|
@Snapshotable var isLoggedIn: Bool = false
|
|
@Snapshotable var username: String = ""
|
|
@Snapshotable var counter: Int = 0
|
|
var ephemeralCache: [String: Any] = [:]
|
|
}
|
|
`);
|
|
const cacheRoot = join(workDir, 'cache');
|
|
const result = generate({
|
|
inputDir: srcDir,
|
|
cacheRoot,
|
|
swiftVersion: '6.0.0',
|
|
toolGitRev: 'e2e-test',
|
|
platformTriple: 'darwin-arm64',
|
|
});
|
|
expect(result.cacheHit).toBe(false);
|
|
expect(result.specs).toHaveLength(1);
|
|
expect(result.specs[0]!.fields.map(f => f.name).sort()).toEqual(['counter', 'isLoggedIn', 'username']);
|
|
const generatedSwift = readFileSync(result.outputPath, 'utf-8');
|
|
expect(generatedSwift).toContain('public enum AppStateAccessor');
|
|
expect(generatedSwift).toContain('key: "isLoggedIn"');
|
|
expect(generatedSwift).toContain('key: "counter"');
|
|
expect(generatedSwift).not.toContain('key: "ephemeralCache"'); // not marked @Snapshotable
|
|
expect(generatedSwift).toContain('#if DEBUG');
|
|
});
|
|
|
|
test('NO_DEVICE: cache hit on rerun', () => {
|
|
const workDir = makeWorkDir();
|
|
const srcDir = join(workDir, 'app-src');
|
|
mkdirSync(srcDir);
|
|
writeFileSync(join(srcDir, 'AppState.swift'), '@Observable class A { @Snapshotable var x: Int = 0 }');
|
|
const cacheRoot = join(workDir, 'cache');
|
|
const r1 = generate({ inputDir: srcDir, cacheRoot, swiftVersion: '6', toolGitRev: 't', platformTriple: 'p' });
|
|
const r2 = generate({ inputDir: srcDir, cacheRoot, swiftVersion: '6', toolGitRev: 't', platformTriple: 'p' });
|
|
expect(r1.cacheHit).toBe(false);
|
|
expect(r2.cacheHit).toBe(true);
|
|
});
|
|
|
|
test('NO_DEVICE: schema mismatch returns 409 on restore', async () => {
|
|
const workDir = makeWorkDir();
|
|
const stub = await startStubStateServer({ loggedIn: false, username: '', rawTaps: [] });
|
|
try {
|
|
const tunnel: DeviceTunnel = {
|
|
udid: 'NO-DEVICE-UDID',
|
|
ipv6Addr: '127.0.0.1',
|
|
port: stub.port,
|
|
bootTokenRotated: DEVICE_TOKEN,
|
|
};
|
|
const daemon = await startDaemon({
|
|
loopbackPort: 0,
|
|
tailnetEnabled: false,
|
|
pidfilePath: join(workDir, 'daemon.pid'),
|
|
tunnelProvider: async () => tunnel,
|
|
});
|
|
if ('error' in daemon) throw new Error(daemon.error);
|
|
try {
|
|
// Acquire session first
|
|
const acqR = await fetchJson('POST', `http://127.0.0.1:${daemon.loopbackPort}/session/acquire`);
|
|
expect(acqR.status).toBe(200);
|
|
const sessionId = (acqR.body as { session_id: string }).session_id;
|
|
|
|
// Restore with wrong schema hash
|
|
const restoreR = await fetchJson('POST', `http://127.0.0.1:${daemon.loopbackPort}/state/restore`, {
|
|
headers: { 'content-type': 'application/json', 'x-session-id': sessionId },
|
|
body: JSON.stringify({
|
|
_schema_version: 1,
|
|
_accessor_hash: 'wrong-hash-xxxxxxxxxxxxx',
|
|
keys: { loggedIn: true },
|
|
}),
|
|
});
|
|
expect(restoreR.status).toBe(409);
|
|
expect((restoreR.body as { error: string }).error).toBe('schema_mismatch');
|
|
} finally {
|
|
await daemon.close();
|
|
}
|
|
} finally {
|
|
stub.server.close();
|
|
}
|
|
});
|
|
});
|
|
|
|
describe('ios-qa E2E (agent-flow simulation)', () => {
|
|
test('SCENARIO: acquire → snapshot → restore → tap → release', async () => {
|
|
const workDir = makeWorkDir();
|
|
const initial: StubState = { loggedIn: false, username: '', rawTaps: [] };
|
|
const stub = await startStubStateServer(initial);
|
|
try {
|
|
const tunnel: DeviceTunnel = {
|
|
udid: 'AGENT-UDID',
|
|
ipv6Addr: '127.0.0.1',
|
|
port: stub.port,
|
|
bootTokenRotated: DEVICE_TOKEN,
|
|
};
|
|
const daemon = await startDaemon({
|
|
loopbackPort: 0,
|
|
tailnetEnabled: false,
|
|
pidfilePath: join(workDir, 'daemon.pid'),
|
|
tunnelProvider: async () => tunnel,
|
|
});
|
|
if ('error' in daemon) throw new Error(daemon.error);
|
|
const base = `http://127.0.0.1:${daemon.loopbackPort}`;
|
|
try {
|
|
// 1. Acquire session
|
|
const acq = await fetchJson('POST', `${base}/session/acquire`);
|
|
expect(acq.status).toBe(200);
|
|
const sessionId = (acq.body as { session_id: string }).session_id;
|
|
|
|
// 2. Snapshot initial state
|
|
const snap = await fetchJson('GET', `${base}/state/snapshot`);
|
|
expect(snap.status).toBe(200);
|
|
expect((snap.body as { keys: { loggedIn: boolean } }).keys.loggedIn).toBe(false);
|
|
|
|
// 3. Restore: flip logged-in to true via the correct schema hash
|
|
const restore = await fetchJson('POST', `${base}/state/restore`, {
|
|
headers: { 'content-type': 'application/json', 'x-session-id': sessionId },
|
|
body: JSON.stringify({
|
|
_schema_version: 1,
|
|
_accessor_hash: 'stub-hash',
|
|
keys: { loggedIn: true, username: 'agent@e2e' },
|
|
}),
|
|
});
|
|
expect(restore.status).toBe(200);
|
|
|
|
// 4. Verify state changed
|
|
const snap2 = await fetchJson('GET', `${base}/state/snapshot`);
|
|
expect((snap2.body as { keys: { loggedIn: boolean; username: string } }).keys).toEqual({
|
|
loggedIn: true,
|
|
username: 'agent@e2e',
|
|
});
|
|
|
|
// 5. Tap (with session-id)
|
|
const tap = await fetchJson('POST', `${base}/tap`, {
|
|
headers: { 'content-type': 'application/json', 'x-session-id': sessionId },
|
|
body: JSON.stringify({ x: 100, y: 200 }),
|
|
});
|
|
expect(tap.status).toBe(200);
|
|
expect(stub.state.rawTaps).toEqual([{ x: 100, y: 200 }]);
|
|
|
|
// 6. Release
|
|
const rel = await fetchJson('POST', `${base}/session/release`);
|
|
expect(rel.status).toBe(200);
|
|
} finally {
|
|
await daemon.close();
|
|
}
|
|
} finally {
|
|
stub.server.close();
|
|
}
|
|
});
|
|
|
|
test('SCENARIO: contention — second session-acquire returns 423 while first holds', async () => {
|
|
const workDir = makeWorkDir();
|
|
const stub = await startStubStateServer({ loggedIn: false, username: '', rawTaps: [] });
|
|
try {
|
|
const tunnel: DeviceTunnel = {
|
|
udid: 'CONTENTION-UDID',
|
|
ipv6Addr: '127.0.0.1',
|
|
port: stub.port,
|
|
bootTokenRotated: DEVICE_TOKEN,
|
|
};
|
|
const daemon = await startDaemon({
|
|
loopbackPort: 0,
|
|
tailnetEnabled: false,
|
|
pidfilePath: join(workDir, 'daemon.pid'),
|
|
tunnelProvider: async () => tunnel,
|
|
});
|
|
if ('error' in daemon) throw new Error(daemon.error);
|
|
const base = `http://127.0.0.1:${daemon.loopbackPort}`;
|
|
try {
|
|
const a = await fetchJson('POST', `${base}/session/acquire`);
|
|
expect(a.status).toBe(200);
|
|
const b = await fetchJson('POST', `${base}/session/acquire`);
|
|
expect(b.status).toBe(423);
|
|
} finally {
|
|
await daemon.close();
|
|
}
|
|
} finally {
|
|
stub.server.close();
|
|
}
|
|
});
|
|
|
|
test('SCENARIO: tailnet allowlist gate + mint + audit log', async () => {
|
|
const workDir = makeWorkDir();
|
|
const stub = await startStubStateServer({ loggedIn: false, username: '', rawTaps: [] });
|
|
try {
|
|
const allowPath = join(workDir, 'allowlist.json');
|
|
const auditPath = join(workDir, 'audit.jsonl');
|
|
const attemptsPath = join(workDir, 'attempts.jsonl');
|
|
// Pass paths as daemon OPTIONS, not process.env — env is process-global
|
|
// and races across concurrent tests (the cause of the original
|
|
// intermittent failures). GSTACK_IOS_TAILNET_BIND is read from env but
|
|
// is the same constant for every tailnet test, so it can't diverge.
|
|
process.env.GSTACK_IOS_TAILNET_BIND = '127.0.0.1';
|
|
|
|
const tunnel: DeviceTunnel = {
|
|
udid: 'TAILNET-UDID',
|
|
ipv6Addr: '127.0.0.1',
|
|
port: stub.port,
|
|
bootTokenRotated: DEVICE_TOKEN,
|
|
};
|
|
const daemon = await startDaemon({
|
|
loopbackPort: 0,
|
|
tailnetEnabled: true,
|
|
allowlistPath: allowPath,
|
|
auditPath,
|
|
attemptsPath,
|
|
pidfilePath: join(workDir, 'daemon.pid'),
|
|
tunnelProvider: async () => tunnel,
|
|
probeImpl: async () => ({ ok: true, ownIdentity: 'mac@e2e' }),
|
|
whoIsImpl: async () => ({ identity: 'agent@e2e', raw: {} }),
|
|
});
|
|
if ('error' in daemon) throw new Error(daemon.error);
|
|
const tailnetBase = `http://127.0.0.1:${daemon.tailnetPort}`;
|
|
try {
|
|
// 1. Mint denied for un-allowlisted identity
|
|
const denied = await fetchJson('POST', `${tailnetBase}/auth/mint`, {
|
|
headers: { 'content-type': 'application/json' },
|
|
body: JSON.stringify({ capability: 'interact' }),
|
|
});
|
|
expect(denied.status).toBe(403);
|
|
|
|
// 2. Owner grants — then mint succeeds
|
|
await grantIdentity({ identity: 'agent@e2e', capability: 'mutate', path: allowPath });
|
|
const minted = await fetchJson('POST', `${tailnetBase}/auth/mint`, {
|
|
headers: { 'content-type': 'application/json' },
|
|
body: JSON.stringify({ capability: 'interact' }),
|
|
});
|
|
expect(minted.status).toBe(200);
|
|
const sessionToken = (minted.body as { session_token: string }).session_token;
|
|
|
|
// 3. Use session token to tap (with X-Session-Id)
|
|
const acqR = await fetchJson('POST', `${tailnetBase}/session/acquire`, {
|
|
headers: { 'authorization': `Bearer ${sessionToken}` },
|
|
});
|
|
expect(acqR.status).toBe(200);
|
|
const sessionId = (acqR.body as { session_id: string }).session_id;
|
|
|
|
const tapR = await fetchJson('POST', `${tailnetBase}/tap`, {
|
|
headers: { 'authorization': `Bearer ${sessionToken}`, 'content-type': 'application/json', 'x-session-id': sessionId },
|
|
body: JSON.stringify({ x: 50, y: 60 }),
|
|
});
|
|
expect(tapR.status).toBe(200);
|
|
|
|
// 4. Audit log must have an entry for /tap
|
|
await new Promise(r => setTimeout(r, 80));
|
|
expect(existsSync(auditPath)).toBe(true);
|
|
const rows = readFileSync(auditPath, 'utf-8').trim().split('\n').filter(Boolean).map(l => JSON.parse(l));
|
|
const tapRow = rows.find(r => r.endpoint === 'POST /tap');
|
|
expect(tapRow).toBeDefined();
|
|
expect(tapRow.identity).toBe('agent@e2e');
|
|
expect(tapRow.capability).toBe('mutate');
|
|
expect(tapRow.device_udid).toBe('TAILNET-UDID');
|
|
|
|
// 5. Attempts log must have the denied-mint entry, with HASHED identity (no raw leak)
|
|
expect(existsSync(attemptsPath)).toBe(true);
|
|
const attempts = readFileSync(attemptsPath, 'utf-8');
|
|
expect(attempts).not.toContain('agent@e2e');
|
|
expect(attempts).toMatch(/"reason":"identity_not_allowed"/);
|
|
} finally {
|
|
await daemon.close();
|
|
delete process.env.GSTACK_IOS_TAILNET_BIND;
|
|
}
|
|
} finally {
|
|
stub.server.close();
|
|
}
|
|
});
|
|
|
|
test('SCENARIO: capability-tier enforcement — observe token cannot /tap', async () => {
|
|
const workDir = makeWorkDir();
|
|
const stub = await startStubStateServer({ loggedIn: false, username: '', rawTaps: [] });
|
|
try {
|
|
const allowPath = join(workDir, 'allowlist.json');
|
|
// Paths via daemon options, not process.env (concurrency-safe).
|
|
const tunnel: DeviceTunnel = {
|
|
udid: 'CAP-UDID', ipv6Addr: '127.0.0.1', port: stub.port, bootTokenRotated: DEVICE_TOKEN,
|
|
};
|
|
const daemon = await startDaemon({
|
|
loopbackPort: 0,
|
|
tailnetEnabled: true,
|
|
allowlistPath: allowPath,
|
|
auditPath: join(workDir, 'audit.jsonl'),
|
|
attemptsPath: join(workDir, 'attempts.jsonl'),
|
|
pidfilePath: join(workDir, 'daemon.pid'),
|
|
tunnelProvider: async () => tunnel,
|
|
probeImpl: async () => ({ ok: true, ownIdentity: 'mac@e2e' }),
|
|
whoIsImpl: async () => ({ identity: 'readonly@e2e', raw: {} }),
|
|
});
|
|
if ('error' in daemon) throw new Error(daemon.error);
|
|
const base = `http://127.0.0.1:${daemon.tailnetPort}`;
|
|
try {
|
|
await grantIdentity({ identity: 'readonly@e2e', capability: 'observe', path: allowPath });
|
|
const minted = await fetchJson('POST', `${base}/auth/mint`, {
|
|
headers: { 'content-type': 'application/json' },
|
|
body: JSON.stringify({ capability: 'observe' }),
|
|
});
|
|
const token = (minted.body as { session_token: string }).session_token;
|
|
|
|
// /screenshot (observe) → ok
|
|
const ss = await fetchJson('GET', `${base}/screenshot`, {
|
|
headers: { 'authorization': `Bearer ${token}` },
|
|
});
|
|
// The stub StateServer doesn't implement /screenshot, returns 404
|
|
// through the proxy. That's fine — what we're testing is the daemon's
|
|
// capability gate. observe is sufficient for /screenshot at the gate.
|
|
expect([200, 404]).toContain(ss.status);
|
|
|
|
// /tap (interact) → 403 capability_insufficient
|
|
const tap = await fetchJson('POST', `${base}/tap`, {
|
|
headers: { 'authorization': `Bearer ${token}`, 'content-type': 'application/json', 'x-session-id': 'x' },
|
|
body: JSON.stringify({ x: 1, y: 1 }),
|
|
});
|
|
expect(tap.status).toBe(403);
|
|
expect((tap.body as { error: string }).error).toBe('capability_insufficient');
|
|
} finally {
|
|
await daemon.close();
|
|
}
|
|
} finally {
|
|
stub.server.close();
|
|
}
|
|
});
|
|
});
|
|
|
|
// ───────── WITH_DEVICE — manual smoke tests (skipped in CI) ─────────
|
|
|
|
(HAS_DEVICE ? describe : describe.skip)('ios-qa E2E (with device)', () => {
|
|
test('WITH_DEVICE: full agent loop against a real iPhone', () => {
|
|
const workDir = makeWorkDir();
|
|
// Stub — real implementation requires `devicectl` + an attached iPhone.
|
|
// Documented in ios-qa/SKILL.md.tmpl under "Manual smoke test".
|
|
expect(HAS_DEVICE).toBe(true);
|
|
});
|
|
});
|