feat(browse): terminal-agent watchdog with PID liveness + crash-loop guard

terminal-agent could die independently of the server — SIGKILL from the OS
OOM killer, an uncaught exception under PTY churn, an external `pkill` from
a sibling debugging session. Pre-v1.44 the sidebar would observe the broken
connection and stay broken until the user reloaded the sidebar. Now a 60s
ticker checks the recorded agent PID and respawns via the shared
spawnTerminalAgent helper when dead.

Identity-based liveness (T4 from the eng review):
  * Uses readAgentRecord + isProcessAlive (signal 0 probe), not a name match.
  * Slow-but-alive agents intentionally fall through — respawning around a
    living agent would create split-brain (two agents writing the port
    file, tokens diverging between them, mystery upgrade 401s).
  * Pairs with the v1.44 generation counter in /internal/* loopback calls:
    if a stale agent does come back to life mid-cycle, its X-Browse-Gen
    no longer matches and the parent's calls 409 cleanly.

Crash-loop guard:
  * 3 respawn attempts inside a rolling 60s window → stop trying. A daemon
    up for a week with one crash a day shouldn't trip the guard.
  * On trip: one-line error to console (`respawn guard tripped`) and the
    watchdog goes dormant. Manual restart via the sidebar Restart button
    is the explicit signal to re-arm (added in Commit 2 of the larger PR).

Shared spawn path (refactor):
  * New spawnTerminalAgent(opts) in terminal-agent-control.ts handles:
    prior-PID cleanup → spawn → record stash. Both the CLI cold-start path
    in cli.ts and the new server.ts watchdog route through it. Removes the
    copy-paste between them; future env wiring lands in one place.

Gated on cfg.ownsTerminalAgent — embedders that pre-launch their own PTY
server (gbrowser phoenix overlay) still own the full lifecycle.

GSTACK_AGENT_WATCHDOG_TICK_MS env knob compresses the 60s tick for e2e
tests without 60s waits per assertion.

Tests:
  * browse/test/terminal-agent-watchdog.test.ts — 7 static-grep tripwires
    for the load-bearing invariants (ownsTerminalAgent gate, PID-based
    liveness, crash-loop guard with window pruning, shutdown cleanup,
    CLI cold-start uses the same helper, env knob exists).
  * Live process-kill tests belong in the e2e tier; cheaper invariants
    here catch refactor regressions in ~1ms each.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-05-23 23:11:54 -07:00
parent ad669b238a
commit f42d7bac6d
4 changed files with 247 additions and 33 deletions
+81 -1
View File
@@ -43,7 +43,8 @@ import { inspectElement, modifyStyle, resetModifications, getModificationHistory
// Bun.spawn used instead of child_process.spawn (compiled bun binaries
// fail posix_spawn on all executables including /bin/bash)
import { safeUnlink, safeUnlinkQuiet, safeKill } from './error-handling';
import { readAgentRecord, killAgentByRecord, clearAgentRecord, agentRecordPath } from './terminal-agent-control';
import { readAgentRecord, killAgentByRecord, clearAgentRecord, agentRecordPath, spawnTerminalAgent } from './terminal-agent-control';
import { isProcessAlive } from './error-handling';
import { sanitizeBody, stripLoneSurrogateEscapes } from './sanitize';
import { startSocksBridge, testUpstream, type BridgeHandle } from './socks-bridge';
import { parseProxyConfig, toUpstreamConfig, ProxyConfigError } from './proxy-config';
@@ -1306,6 +1307,84 @@ export function buildFetchHandler(cfg: ServerConfig): ServerHandle {
// premise even under malformed cfg.
const ownsTerminalAgent = cfg.ownsTerminalAgent === false ? false : true;
// ─── Terminal-Agent Watchdog (v1.44+) ─────────────────────────────
//
// The terminal-agent process can die independently of the server: SIGKILL
// from the OS OOM killer, an uncaught exception under load, an external
// `pkill` from a sibling debugging session. Pre-v1.44 the sidebar would
// see the broken connection and stay broken until the user reloaded.
// Now: 60s ticker checks the recorded agent PID, respawns via the shared
// spawnTerminalAgent helper if dead.
//
// Identity-based — uses readAgentRecord + isProcessAlive, NOT a process
// name probe. Critical: prevents respawning around a slow-but-alive agent
// (which would create split-brain — two agents writing the port file,
// tokens diverging between them, mystery PTY upgrade failures).
//
// Crash-loop guard: 3 respawn attempts inside 60s → stop trying and emit
// a one-line error. Manual `forceRestart` from the sidebar clears the
// history (the user is the explicit signal to retry).
//
// Only active when ownsTerminalAgent === true. Embedders that pre-launch
// their own PTY server (gbrowser phoenix overlay) must not be auto-respawned
// by us — their lifecycle is their concern.
let agentWatchdogInterval: ReturnType<typeof setInterval> | null = null;
const respawnHistory: number[] = [];
const AGENT_WATCHDOG_TICK_MS = parseInt(
process.env.GSTACK_AGENT_WATCHDOG_TICK_MS || '60000',
10,
);
const RESPAWN_GUARD_WINDOW_MS = 60_000;
const RESPAWN_GUARD_MAX = 3;
let agentRespawnGuardTripped = false;
if (ownsTerminalAgent) {
agentWatchdogInterval = setInterval(() => {
if (isShuttingDown) return;
if (agentRespawnGuardTripped) return;
const stateDir = path.dirname(cfg.config.stateFile);
const record = readAgentRecord(stateDir);
// If the record exists and the PID is alive, the agent is healthy
// (or at least still answering signal 0). Slow-but-alive agents
// intentionally fall through here — split-brain is worse than
// unresponsiveness, and slow recovery is handled by the user via
// restart.
if (record && isProcessAlive(record.pid)) return;
// Either no record (never spawned, or cleaned up after crash) or
// PID is dead. Try to respawn.
const now = Date.now();
while (respawnHistory.length && now - respawnHistory[0] > RESPAWN_GUARD_WINDOW_MS) {
respawnHistory.shift();
}
if (respawnHistory.length >= RESPAWN_GUARD_MAX) {
agentRespawnGuardTripped = true;
console.error(
`[browse] terminal-agent respawn guard tripped (${RESPAWN_GUARD_MAX} crashes in ${RESPAWN_GUARD_WINDOW_MS / 1000}s) — manual restart required`,
);
return;
}
respawnHistory.push(now);
try {
const pid = spawnTerminalAgent({
stateFile: cfg.config.stateFile,
serverPort: cfg.browsePort,
cwd: cfg.config.projectDir,
});
if (pid) {
console.log(`[browse] terminal-agent respawned by watchdog (PID: ${pid})`);
} else {
console.warn('[browse] terminal-agent respawn skipped — script not found on disk');
}
} catch (err: any) {
console.warn('[browse] terminal-agent respawn failed:', err?.message || err);
}
}, AGENT_WATCHDOG_TICK_MS);
// Detach the watchdog timer from Node's event-loop ref count so a
// healthy idle process can still exit cleanly if everything else is
// also unref'd. Bun's setInterval returns a Timer with unref().
(agentWatchdogInterval as any)?.unref?.();
}
// Factory-scoped validateAuth. Closes over cfg.authToken so every internal
// auth check sees the same token the routes receive. Module-level
// validateAuth was deleted in v1.35.0.0.
@@ -1345,6 +1424,7 @@ export function buildFetchHandler(cfg: ServerConfig): ServerHandle {
if (cfgBrowserManager.isWatching()) cfgBrowserManager.stopWatch();
clearInterval(flushInterval);
clearInterval(idleCheckInterval);
if (agentWatchdogInterval) clearInterval(agentWatchdogInterval);
await flushBuffers();
await cfgBrowserManager.close();