Infonet sync: honor HTTP 429 Retry-After + exponential backoff

Fixes the retry-storm that's been keeping the local node 429'd out of the seed peer (the diagnosis we ran earlier in the session). Pre-fix: 1. Sync hits the seed peer, gets HTTP 429 (Too Many Requests) 2. _peer_sync_response stringifies the status into a ValueError 3. _sync_from_peer catches it, error becomes the str() of the exc 4. _run_public_sync_cycle calls finish_sync(error=..., failure_backoff_s=60) 5. next_sync_due_at = now + 60s 6. After 60s, sync runs again, hits same upstream that hasn't reset its rate-limit bucket, 429 again. Loop indefinitely. Net effect: a node that hit one transient 429 would hammer the seed every 60s forever, keeping the bucket full and never recovering. We saw this in the live status dump: consecutive_failures=49, last_sync_ok_at=0, retry storm sustained over the entire uptime. What changed ------------ services/mesh/mesh_infonet_sync_support.py * New typed exception PeerSyncRateLimited carries the parsed Retry-After value out of the HTTP layer instead of stringifying everything into a generic ValueError. * New parse_retry_after_header() handles both RFC 7231 §7.1.3 forms (delay-seconds and HTTP-date). Clamped at 1 hour so a hostile peer can't silence us for days. * New _failure_backoff_seconds() helper computes the next delay as max(exponential, retry_after_s). Schedule with default base=60s, cap=1800s: failure 1 -> 60s (preserves pre-fix for transient blips) failure 2 -> 120s failure 3 -> 240s failure 4 -> 480s failure 5 -> 960s failure 6+ -> 1800s (capped at 30 min) cap_s=0 explicitly disables exponential entirely — operators who want pure-Retry-After behavior have that option. * finish_sync now accepts retry_after_s and failure_backoff_cap_s kwargs. Backward-compatible: existing callers that don't pass retry_after_s get the same first-failure delay as before (the base value), only repeat failures grow. main.py * _peer_sync_response detects 429 specifically, parses the Retry-After header, raises PeerSyncRateLimited(retry_after_s=N). Includes the response body prefix in the message so the operator's last_error finally shows something useful. * _sync_from_peer extended to return (ok, error, forked, retry_after_s) — the 4th tuple element is non-zero only when the upstream sent a parseable Retry-After. Existing call shape preserved: the lone caller in _run_public_sync_cycle was updated in the same commit. * _run_public_sync_cycle forwards retry_after_s into finish_sync. Tests ----- backend/tests/mesh/test_infonet_sync_429_backoff.py — 17 new tests: TestParseRetryAfter (7): - integer seconds form - HTTP-date form (computed as seconds-from-now) - HTTP-date in the past returns 0 - empty / whitespace returns 0 - malformed returns 0 - clamps to 1 hour (hostile-peer cap) - negative returns 0 TestFailureBackoffSeconds (5): - exponential growth schedule pins each level - retry_after wins when larger than exponential - exponential wins when larger than retry_after - cap_s=0 disables exponential entirely - zero inputs return zero TestFinishSyncBackoff (5): - first failure uses base unchanged (pre-fix back-compat) - consecutive_failures actually grow the delay - retry_after honored at low failure count - success resets consecutive_failures - last_error carries the HTTP status / Retry-After detail All 24 existing sync-support / status-gate tests still pass. Other failures in tests/mesh/ are pre-existing on origin/main and unrelated to this change (verified by running the same tests against the user's main worktree without these edits). What the operator sees after this lands + a docker rebuild ---------------------------------------------------------- With the live 429 storm we diagnosed: Pre-fix: consecutive_failures keeps climbing 1/min forever, last_error empty or generic Post-fix: consecutive_failures grows, next_sync_due_at backs off exponentially (max 30 min), last_error explicitly carries "HTTP 429 from <peer> (retry_after=Ns): <body>" so the operator can see what's actually wrong. Once the upstream bucket drains and a sync succeeds, consecutive_failures resets to 0 and the schedule returns to the normal 300s interval.
Merge pull request #309 from BigBodyCobain/feat/cross-node-dm-mailbox-replication
2026-06-03 12:58:11 +02:00 · 2026-05-22 22:55:05 -06:00 · 2026-05-22 22:43:26 -06:00 · 2026-05-22 19:23:09 -06:00 · 2026-05-22 19:18:01 -06:00 · 2026-05-22 18:56:08 -06:00
13 changed files with 1938 additions and 78 deletions
@@ -1417,6 +1417,29 @@ def _peer_sync_response(peer_url: str, body: dict[str, Any]) -> dict[str, Any]:
        proxy = f"socks5h://127.0.0.1:{socks_port}"
        kwargs["proxies"] = {"http": proxy, "https": proxy}
    response = _requests.post(f"{normalized}/api/mesh/infonet/sync", **kwargs)
+    # HTTP 429 must be surfaced as a typed exception carrying the
+    # Retry-After value, so finish_sync can honor it and stop hammering
+    # the upstream. Pre-fix this path just stringified the status into
+    # a ValueError, which finish_sync then ignored — keeping the
+    # upstream's rate-limit bucket full indefinitely.
+    if response.status_code == 429:
+        from services.mesh.mesh_infonet_sync_support import (
+            PeerSyncRateLimited,
+            parse_retry_after_header,
+        )
+
+        retry_after_s = parse_retry_after_header(
+            response.headers.get("Retry-After", "") or "",
+        )
+        try:
+            body_text = response.text[:200]
+        except Exception:
+            body_text = ""
+        raise PeerSyncRateLimited(
+            f"HTTP 429 from {normalized} (retry_after={retry_after_s}s): {body_text}",
+            retry_after_s=retry_after_s,
+            status=429,
+        )
    try:
        payload = response.json()
    except Exception as exc:
@@ -1462,8 +1485,23 @@ def _hydrate_gate_store_from_chain(events: list[dict]) -> int:
    return count


-def _sync_from_peer(peer_url: str, *, page_limit: int = 100, max_rounds: int = 5) -> tuple[bool, str, bool]:
+def _sync_from_peer(
+    peer_url: str,
+    *,
+    page_limit: int = 100,
+    max_rounds: int = 5,
+) -> tuple[bool, str, bool, int]:
+    """Sync the local Infonet chain against ``peer_url``.
+
+    Returns ``(ok, error, forked, retry_after_s)``. The fourth tuple
+    element is non-zero only when the peer responded with HTTP 429
+    and supplied a parseable ``Retry-After`` header — see the typed
+    ``PeerSyncRateLimited`` exception in mesh_infonet_sync_support.py.
+    Callers should pass that value to ``finish_sync(retry_after_s=...)``
+    so the next attempt actually waits.
+    """
    from services.mesh.mesh_hashchain import infonet
+    from services.mesh.mesh_infonet_sync_support import PeerSyncRateLimited

    rounds = 0
    while rounds < max_rounds:
@@ -1472,7 +1510,11 @@ def _sync_from_peer(peer_url: str, *, page_limit: int = 100, max_rounds: int = 5
            "locator": infonet.get_locator(),
            "limit": page_limit,
        }
-        payload = _peer_sync_response(peer_url, body)
+        try:
+            payload = _peer_sync_response(peer_url, body)
+        except PeerSyncRateLimited as exc:
+            # Bubble up the retry-after so finish_sync can honor it.
+            return False, str(exc), False, exc.retry_after_s
        if bool(payload.get("forked")):
            # Auto-recover small local forks: if the local chain is tiny
            # (< 20 events) and the remote has a longer chain, reset local
@@ -1488,23 +1530,23 @@ def _sync_from_peer(peer_url: str, *, page_limit: int = 100, max_rounds: int = 5
                )
                infonet.reset_chain()
                continue  # retry sync with clean genesis locator
-            return False, "fork detected", True
+            return False, "fork detected", True, 0
        events = payload.get("events", [])
        if not isinstance(events, list):
-            return False, "peer sync events must be a list", False
+            return False, "peer sync events must be a list", False, 0
        if not events:
-            return True, "", False
+            return True, "", False, 0
        result = infonet.ingest_events(events)
        _hydrate_gate_store_from_chain(events)
        rejected = list(result.get("rejected", []) or [])
        if rejected:
-            return False, f"sync ingest rejected {len(rejected)} event(s)", False
+            return False, f"sync ingest rejected {len(rejected)} event(s)", False, 0
        if int(result.get("accepted", 0) or 0) == 0 and int(result.get("duplicates", 0) or 0) >= len(events):
-            return True, "", False
+            return True, "", False, 0
        if len(events) < page_limit:
-            return True, "", False
+            return True, "", False, 0
        rounds += 1
-    return True, "", False
+    return True, "", False, 0


 def _run_public_sync_cycle() -> SyncWorkerState:
@@ -1567,11 +1609,12 @@ def _run_public_sync_cycle() -> SyncWorkerState:
        with _NODE_RUNTIME_LOCK:
            set_sync_state(started)
        try:
-            ok, error, forked = _sync_from_peer(record.peer_url)
+            ok, error, forked, retry_after_s = _sync_from_peer(record.peer_url)
        except Exception as exc:
            ok = False
            error = str(exc or type(exc).__name__)
            forked = False
+            retry_after_s = 0
        if ok:
            store.mark_seen(record.peer_url, "sync", now=time.time())
            store.mark_sync_success(record.peer_url, now=time.time())
@@ -1618,6 +1661,12 @@ def _run_public_sync_cycle() -> SyncWorkerState:
            now=time.time(),
            interval_s=int(get_settings().MESH_SYNC_INTERVAL_S or 300),
            failure_backoff_s=failure_backoff_s,
+            # 429 retry-storm fix: when the peer returned HTTP 429 with
+            # a Retry-After header, finish_sync uses max(exponential,
+            # retry_after) for next_sync_due_at — so we actually wait
+            # the time the upstream asked for instead of hammering
+            # every 60s and keeping its rate-limit bucket full forever.
+            retry_after_s=retry_after_s,
        )
        with _NODE_RUNTIME_LOCK:
            set_sync_state(updated)
@@ -2521,45 +2521,85 @@ async def api_capabilities(request: Request):
 # OpenClaw Connection Management (local-operator only — NOT via HMAC)
 # These endpoints manage the HMAC secret itself, so they MUST require
 # local operator access to prevent privilege escalation.
+#
+# Issue #302 (tg12): pre-fix, GET /api/ai/connect-info had two problems:
+#
+#   1. ``?reveal=true`` made the full secret travel through every operator
+#      page-load that opened the Connect modal. Even gated to
+#      ``require_local_operator``, that put the secret into browser
+#      history, dev-tools network panels, browser disk caches, HAR
+#      exports, and screen captures. Every time the modal opened.
+#
+#   2. The same GET endpoint auto-bootstrapped (generated + persisted)
+#      the secret on first read. Side effects on a GET are a footgun:
+#      browser prefetchers, mirror tools, and casual curl-from-history
+#      would all silently mint+persist a fresh secret. (Gated, but
+#      still surprising — and noisy in the audit log.)
+#
+# Resolution:
+#
+#   GET  /api/ai/connect-info             — always returns the MASKED
+#                                            secret. No ?reveal param.
+#                                            No auto-bootstrap; if the
+#                                            secret is missing,
+#                                            ``hmac_secret_set: false``
+#                                            tells the frontend to call
+#                                            /bootstrap.
+#
+#   POST /api/ai/connect-info/bootstrap   — NEW. Generates + persists the
+#                                            secret if missing. Idempotent.
+#                                            Returns metadata only, never
+#                                            the full secret.
+#
+#   POST /api/ai/connect-info/reveal      — NEW. Returns the full secret in
+#                                            the body with strict
+#                                            ``Cache-Control: no-store,
+#                                            no-cache, must-revalidate``
+#                                            + ``Pragma: no-cache`` so
+#                                            it does not land in browser
+#                                            caches. POST means it does
+#                                            not land in URL history.
+#
+#   POST /api/ai/connect-info/regenerate  — keeps existing one-time-reveal
+#                                            behavior (regenerate IS a
+#                                            deliberate destructive action
+#                                            the operator triggered, so
+#                                            displaying the new secret
+#                                            once is the only path that
+#                                            makes the operation useful).
+#                                            Same no-store headers added.
 # ---------------------------------------------------------------------------

-@router.get("/api/ai/connect-info", dependencies=[Depends(require_local_operator)])
-@limiter.limit("30/minute")
-async def get_connect_info(request: Request, reveal: bool = False):
-    """Return connection details for the OpenClaw Connect modal.
+# Cache-Control headers that should accompany every response carrying the
+# full HMAC secret. Reused across the reveal + regenerate endpoints so a
+# future refactor that splits or renames them can't forget the headers.
+_NO_STORE_HEADERS = {
+    "Cache-Control": "no-store, no-cache, must-revalidate, private",
+    "Pragma": "no-cache",
+    "Expires": "0",
+}

-    The HMAC secret is masked by default. Pass ?reveal=true to see the full key.
-    Private keys are NEVER returned.
+
+def _mask_hmac_secret(secret: str) -> str:
+    """Return a fingerprint-style mask (first6 + bullets + last4) suitable
+    for display in the UI before the operator clicks Reveal."""
+    if not secret:
+        return ""
+    if len(secret) > 10:
+        return secret[:6] + "••••••••" + secret[-4:]
+    return "••••••••"
+
+
+def _connect_info_metadata(settings) -> dict:
+    """Return everything the Connect modal needs EXCEPT the secret itself.
+
+    Shared between GET /api/ai/connect-info (where the full secret is
+    masked) and POST /api/ai/connect-info/bootstrap (where the operator
+    just generated a secret but we don't return it inline — they have to
+    call /reveal to see it).
    """
-    import os
-    import secrets
-    from services.config import get_settings
-
-    settings = get_settings()
-    hmac_secret = str(settings.OPENCLAW_HMAC_SECRET or "").strip()
    access_tier = str(settings.OPENCLAW_ACCESS_TIER or "restricted").strip().lower()
-
-    # Auto-generate if not set
-    if not hmac_secret:
-        hmac_secret = secrets.token_hex(24)  # 48 chars
-        _write_env_value("OPENCLAW_HMAC_SECRET", hmac_secret)
-        # Clear settings cache so next read picks up the new value
-        get_settings.cache_clear()
-
-    masked = hmac_secret[:6] + "••••••••" + hmac_secret[-4:] if len(hmac_secret) > 10 else "••••••••"
-
    return {
-        "ok": True,
-        "hmac_secret": hmac_secret if reveal else masked,
-        "hmac_secret_set": bool(hmac_secret),
-        "bootstrap_behavior": {
-            "auto_generates_when_missing": True,
-            "auto_generated_this_call": not bool(settings.OPENCLAW_HMAC_SECRET or ""),
-            "notes": [
-                "If no HMAC secret exists yet, this endpoint bootstraps one and persists it to .env.",
-                "Regenerating the HMAC secret revokes all existing direct-mode OpenClaw callers at once.",
-            ],
-        },
        "access_tier": access_tier,
        "trust_model": {
            "remote_http_principal": "holder_of_openclaw_hmac_secret",
@@ -2613,24 +2653,138 @@ async def get_connect_info(request: Request, reveal: bool = False):
    }


-@router.post("/api/ai/connect-info/regenerate", dependencies=[Depends(require_local_operator)])
-@limiter.limit("5/minute")
-async def regenerate_hmac_secret(request: Request):
-    """Generate a new HMAC secret. Old secret immediately stops working."""
+@router.get("/api/ai/connect-info", dependencies=[Depends(require_local_operator)])
+@limiter.limit("30/minute")
+async def get_connect_info(request: Request):
+    """Return connection details for the OpenClaw Connect modal.
+
+    The HMAC secret is always returned as a fingerprint mask
+    (``first6 + bullets + last4``); the full value is only ever served by
+    ``POST /api/ai/connect-info/reveal`` (see #302). When the secret has
+    not been bootstrapped yet, ``hmac_secret_set`` is false and the
+    frontend should call ``POST /api/ai/connect-info/bootstrap``.
+
+    Private keys are NEVER returned.
+    """
+    from services.config import get_settings
+
+    settings = get_settings()
+    hmac_secret = str(settings.OPENCLAW_HMAC_SECRET or "").strip()
+
+    return {
+        "ok": True,
+        "masked_hmac_secret": _mask_hmac_secret(hmac_secret),
+        "hmac_secret_set": bool(hmac_secret),
+        "bootstrap_behavior": {
+            "auto_generates_when_missing": False,
+            "notes": [
+                "Call POST /api/ai/connect-info/bootstrap to mint a secret on first use.",
+                "Call POST /api/ai/connect-info/reveal to see the full secret (no-store).",
+                "Regenerating the HMAC secret revokes all existing direct-mode OpenClaw callers at once.",
+            ],
+        },
+        **_connect_info_metadata(settings),
+    }
+
+
+@router.post("/api/ai/connect-info/bootstrap", dependencies=[Depends(require_local_operator)])
+@limiter.limit("10/minute")
+async def bootstrap_hmac_secret(request: Request):
+    """Mint and persist the OpenClaw HMAC secret if it isn't already set.
+
+    Idempotent: if a secret already exists, returns ``generated: false``
+    and leaves the existing secret untouched. Never returns the secret
+    value in the response body — the operator calls
+    ``POST /api/ai/connect-info/reveal`` to see it.
+    """
    import secrets
    from services.config import get_settings

+    settings = get_settings()
+    existing = str(settings.OPENCLAW_HMAC_SECRET or "").strip()
+    if existing:
+        return {
+            "ok": True,
+            "generated": False,
+            "hmac_secret_set": True,
+            "masked_hmac_secret": _mask_hmac_secret(existing),
+            "detail": "HMAC secret already configured. Use /reveal to see it.",
+        }
+
    new_secret = secrets.token_hex(24)  # 48 chars
    _write_env_value("OPENCLAW_HMAC_SECRET", new_secret)
    get_settings.cache_clear()

    return {
        "ok": True,
-        "hmac_secret": new_secret,
-        "detail": "HMAC secret regenerated. Update your OpenClaw agent configuration.",
+        "generated": True,
+        "hmac_secret_set": True,
+        "masked_hmac_secret": _mask_hmac_secret(new_secret),
+        "detail": "HMAC secret generated. Call /reveal to copy it into your OpenClaw config.",
    }


+@router.post("/api/ai/connect-info/reveal", dependencies=[Depends(require_local_operator)])
+@limiter.limit("10/minute")
+async def reveal_hmac_secret(request: Request):
+    """Return the full HMAC secret in the response body.
+
+    POST (not GET) so the secret never lands in URL history, access logs,
+    or browser visit history. Strict ``Cache-Control: no-store`` headers
+    prevent intermediaries from persisting the response. Returns 404 if
+    no secret has been bootstrapped — the frontend should call
+    ``POST /api/ai/connect-info/bootstrap`` first.
+    """
+    from services.config import get_settings
+
+    settings = get_settings()
+    hmac_secret = str(settings.OPENCLAW_HMAC_SECRET or "").strip()
+    if not hmac_secret:
+        raise HTTPException(
+            404,
+            "No HMAC secret configured. Call POST /api/ai/connect-info/bootstrap first.",
+        )
+    return JSONResponse(
+        content={
+            "ok": True,
+            "hmac_secret": hmac_secret,
+            "masked_hmac_secret": _mask_hmac_secret(hmac_secret),
+        },
+        headers=_NO_STORE_HEADERS,
+    )
+
+
+@router.post("/api/ai/connect-info/regenerate", dependencies=[Depends(require_local_operator)])
+@limiter.limit("5/minute")
+async def regenerate_hmac_secret(request: Request):
+    """Generate a new HMAC secret. Old secret immediately stops working.
+
+    Returns the new secret in the response body — this is the only
+    operation where the full secret travels back through the response,
+    because regenerating IS a deliberate destructive action the operator
+    triggered and they need to see the new value once to update their
+    OpenClaw configuration. Strict ``Cache-Control: no-store`` headers
+    keep it from being persisted by browser caches, proxies, or HAR
+    capture tooling.
+    """
+    import secrets
+    from services.config import get_settings
+
+    new_secret = secrets.token_hex(24)  # 48 chars
+    _write_env_value("OPENCLAW_HMAC_SECRET", new_secret)
+    get_settings.cache_clear()
+
+    return JSONResponse(
+        content={
+            "ok": True,
+            "hmac_secret": new_secret,
+            "masked_hmac_secret": _mask_hmac_secret(new_secret),
+            "detail": "HMAC secret regenerated. Update your OpenClaw agent configuration.",
+        },
+        headers=_NO_STORE_HEADERS,
+    )
+
+
@router.put("/api/ai/connect-info/access-tier", dependencies=[Depends(require_local_operator)])
@limiter.limit("10/minute")
 async def set_access_tier(request: Request, body: dict):
@@ -85,6 +85,64 @@ async def infonet_peer_push(request: Request):
    return {"ok": True, **result}


+@router.post("/api/mesh/dm/replicate-envelope")
+@limiter.limit("60/minute")
+async def dm_replicate_envelope(request: Request):
+    """Accept a DM envelope replicated from a peer relay (cross-node mailbox).
+
+    Companion endpoint to ``DMRelay.replicate_to_peers`` (outbound, in
+    ``mesh_dm_relay.py``). The sender's relay POSTs an encrypted DM
+    envelope here after a successful local ``deposit``; this endpoint
+    re-enforces the per-(sender, recipient) anti-spam cap and stores
+    the envelope in the local mailbox if accepted.
+
+    The cap is the network rule: a hostile sender's relay can spool
+    extras locally, but every honest peer enforces the cap on inbound
+    replication. Recipient polling from any honest peer therefore
+    never sees more than ``MESH_DM_PENDING_PER_SENDER_LIMIT`` pending
+    from any one sender, no matter how many spam attempts were tried.
+
+    Same HMAC auth pattern as ``infonet_peer_push`` and ``gate_peer_push``.
+    """
+    content_length = request.headers.get("content-length")
+    if content_length:
+        try:
+            # DM envelopes are bounded by MESH_DM_MAX_MSG_BYTES + envelope
+            # overhead; 64 KB is a generous ceiling.
+            if int(content_length) > 65_536:
+                return Response(
+                    content='{"ok":false,"detail":"Request body too large (max 64KB)"}',
+                    status_code=413, media_type="application/json",
+                )
+        except (ValueError, TypeError):
+            pass
+    body_bytes = await request.body()
+    if not _verify_peer_push_hmac(request, body_bytes):
+        return Response(
+            content='{"ok":false,"detail":"Invalid or missing peer HMAC"}',
+            status_code=403, media_type="application/json",
+        )
+    try:
+        body = json_mod.loads(body_bytes or b"{}")
+    except (ValueError, TypeError):
+        return Response(
+            content='{"ok":false,"detail":"Invalid JSON body"}',
+            status_code=400, media_type="application/json",
+        )
+    envelope = body.get("envelope")
+    if not isinstance(envelope, dict):
+        return {"ok": False, "detail": "envelope must be an object"}
+
+    originating_peer = _peer_hmac_url_from_request(request) or ""
+
+    from services.mesh.mesh_dm_relay import dm_relay
+    result = dm_relay.accept_replica(
+        envelope=envelope,
+        originating_peer_url=originating_peer,
+    )
+    return result
+
+
@router.post("/api/mesh/gate/peer-push")
@limiter.limit("30/minute")
 async def gate_peer_push(request: Request):
@@ -116,6 +116,21 @@ class Settings(BaseSettings):
    MESH_DM_REQUEST_MAILBOX_LIMIT: int = 12
    MESH_DM_SHARED_MAILBOX_LIMIT: int = 48
    MESH_DM_SELF_MAILBOX_LIMIT: int = 12
+    # Anti-spam: cap on distinct UNACKED messages a single sender can have
+    # parked in a single recipient's mailbox at any one time. Once the
+    # recipient pulls (acks) a message, the sender's quota for that pair
+    # frees up. Default 2 — a sender who wants to deliver more must wait
+    # for the recipient to actually read the prior messages.
+    #
+    # This cap is enforced TWICE: once on the local deposit path (the
+    # sender's own node refuses to spool the 3rd message) AND once on
+    # the replication-acceptance path (honest peer relays refuse to
+    # accept inbound replicas that would put them over the cap). The
+    # double enforcement makes the rule a NETWORK rule — patching out
+    # the local check on a hostile sender's relay doesn't let extras
+    # propagate, because every honest peer enforces the same cap on
+    # inbound replication.
+    MESH_DM_PENDING_PER_SENDER_LIMIT: int = 2
    MESH_BLOCK_LEGACY_AGENT_ID_LOOKUP: bool = True
    MESH_ALLOW_COMPAT_DM_INVITE_IMPORT: bool = False
    MESH_ALLOW_COMPAT_DM_INVITE_IMPORT_UNTIL: str = ""
@@ -317,6 +317,39 @@ class DMRelay:
    def _self_mailbox_limit(self) -> int:
        return max(1, int(self._settings().MESH_DM_SELF_MAILBOX_LIMIT))

+    def _per_sender_pending_limit(self) -> int:
+        """Anti-spam cap on UNACKED messages a single sender can have parked
+        in a single recipient mailbox at any one time. See ``config.py``
+        ``MESH_DM_PENDING_PER_SENDER_LIMIT`` for the threat model — this
+        rule is enforced both at ``deposit`` (local) and at
+        ``accept_replica`` (peer push acceptance), making it a network
+        rule rather than a client-side honor system."""
+        try:
+            limit = int(getattr(self._settings(), "MESH_DM_PENDING_PER_SENDER_LIMIT", 2) or 2)
+        except (TypeError, ValueError):
+            limit = 2
+        return max(1, limit)
+
+    def _per_sender_pending_count(
+        self,
+        *,
+        mailbox_key: str,
+        sender_block_ref: str,
+    ) -> int:
+        """Count UNACKED messages from ``sender_block_ref`` currently parked
+        in ``mailbox_key``. Caller already holds ``self._lock``.
+
+        Messages that have been claimed/acked are removed from the mailbox
+        list (see ``claim_message_ids``), so anything still here is by
+        definition unacked. We count by exact ``sender_block_ref`` match
+        — that's the per-pair sender identity used for blocking too, so
+        the cap is naturally per-(sender, recipient).
+        """
+        if not mailbox_key or not sender_block_ref:
+            return 0
+        messages = self._mailboxes.get(mailbox_key, [])
+        return sum(1 for m in messages if m.sender_block_ref == sender_block_ref)
+
    def _nonce_ttl_seconds(self) -> int:
        return max(30, int(self._settings().MESH_DM_NONCE_TTL_S))

@@ -1515,6 +1548,29 @@ class DMRelay:
            if len(self._mailboxes[mailbox_key]) >= self._mailbox_limit_for_class(delivery_class):
                metrics_inc("dm_drop_full")
                return {"ok": False, "detail": "Recipient mailbox full"}
+            # Anti-spam: per-(sender, recipient) cap on unacked messages.
+            # A sender who already has the configured number of messages
+            # parked in this mailbox can't deposit more until the recipient
+            # pulls (acks) at least one. The same cap is re-enforced on
+            # inbound replication in ``accept_replica`` so this rule isn't
+            # bypassable by patching out the local check on a hostile
+            # sender's relay — see config.py
+            # MESH_DM_PENDING_PER_SENDER_LIMIT for the threat model.
+            per_sender_limit = self._per_sender_pending_limit()
+            pending = self._per_sender_pending_count(
+                mailbox_key=mailbox_key,
+                sender_block_ref=sender_block_ref,
+            )
+            if pending >= per_sender_limit:
+                metrics_inc("dm_drop_per_sender_cap")
+                return {
+                    "ok": False,
+                    "detail": (
+                        f"Recipient already has {pending} unread message"
+                        f"{'s' if pending != 1 else ''} from you. Wait for "
+                        "them to read your messages before sending more."
+                    ),
+                }
            if not msg_id:
                msg_id = f"dm_{int(time.time() * 1000)}_{secrets.token_hex(6)}"
            elif any(m.msg_id == msg_id for m in self._mailboxes[mailbox_key]):
@@ -1539,8 +1595,245 @@ class DMRelay:
            )
            self._stats["messages_in_memory"] = sum(len(v) for v in self._mailboxes.values())
            self._save()
+            # Cross-node mailbox replication: push the freshly-stored
+            # envelope to every authenticated relay peer so the recipient
+            # can log into ANY node and find their messages. The push is
+            # async (fire-and-forget thread) so deposit() returns
+            # immediately — slow Tor peers can't block the sender's UX.
+            # Each receiving peer re-enforces the per-sender cap on
+            # acceptance, so hostile relays can't widen the cap.
+            try:
+                envelope_for_push = self.envelope_for_replication(
+                    mailbox_key=mailbox_key, msg_id=msg_id,
+                )
+                if envelope_for_push:
+                    self._replicate_envelope_to_peers_async(
+                        envelope=envelope_for_push,
+                    )
+            except Exception:
+                metrics_inc("dm_replication_push_error")
            return {"ok": True, "msg_id": msg_id}

+    def accept_replica(
+        self,
+        *,
+        envelope: dict[str, Any],
+        originating_peer_url: str = "",
+    ) -> dict[str, Any]:
+        """Receive a DM envelope replicated from a peer relay.
+
+        Cross-node mailbox replication entry point. When a sender's local
+        relay accepts a ``deposit`` and pushes the envelope to
+        ``MESH_RELAY_PEERS`` (so the recipient can log into any peer
+        node and find their messages), each receiving peer calls
+        ``accept_replica`` to ingest it.
+
+        The per-(sender, recipient) cap is re-enforced HERE. That's what
+        makes the rule a NETWORK rule rather than a client-side honor
+        system: a hostile sender who patches out the local ``deposit``
+        check still can't get a 3rd unacked message to spread, because
+        every honest peer enforces the same cap on inbound replicas.
+        Result: hostile relays can hold extras locally, but those extras
+        never reach any node a legitimate recipient is polling from.
+
+        Returns the same shape as ``deposit`` so the calling endpoint can
+        forward the result back to the originating peer.
+        """
+        if not isinstance(envelope, dict):
+            return {"ok": False, "detail": "envelope must be an object"}
+        msg_id = str(envelope.get("msg_id", "") or "").strip()
+        mailbox_key = str(envelope.get("mailbox_key", "") or "").strip()
+        sender_block_ref = str(envelope.get("sender_block_ref", "") or "").strip()
+        ciphertext = str(envelope.get("ciphertext", "") or "")
+        if not msg_id or not mailbox_key or not sender_block_ref or not ciphertext:
+            return {"ok": False, "detail": "envelope missing required fields"}
+
+        with self._lock:
+            self._refresh_from_shared_relay()
+            self._cleanup_expired()
+
+            # Idempotent — if we already hold this exact msg_id, the
+            # replication round-tripped or a peer pushed the same
+            # envelope through multiple paths. Accept silently.
+            if any(m.msg_id == msg_id for m in self._mailboxes.get(mailbox_key, [])):
+                metrics_inc("dm_replica_duplicate")
+                return {"ok": True, "msg_id": msg_id, "duplicate": True}
+
+            # Same per-class cap as the deposit path — defense in depth
+            # against a peer that wraps a "deposit" as a "replica" to
+            # bypass the class limit.
+            delivery_class = str(envelope.get("delivery_class", "") or "")
+            if delivery_class in ("request", "shared", "self"):
+                class_limit = self._mailbox_limit_for_class(delivery_class)
+            else:
+                class_limit = self._shared_mailbox_limit()
+            if len(self._mailboxes.get(mailbox_key, [])) >= class_limit:
+                metrics_inc("dm_replica_drop_full")
+                return {"ok": False, "detail": "Recipient mailbox full"}
+
+            # THE network rule: per-(sender, recipient) anti-spam cap.
+            per_sender_limit = self._per_sender_pending_limit()
+            pending = self._per_sender_pending_count(
+                mailbox_key=mailbox_key,
+                sender_block_ref=sender_block_ref,
+            )
+            if pending >= per_sender_limit:
+                metrics_inc("dm_replica_drop_per_sender_cap")
+                # Returning a structured rejection — the sender's relay
+                # learns its envelope was rejected by an honest peer and
+                # can stop trying to push it.
+                return {
+                    "ok": False,
+                    "detail": (
+                        "Per-sender cap reached on this relay; refusing replica"
+                    ),
+                    "cap_violation": True,
+                    "pending": pending,
+                    "limit": per_sender_limit,
+                }
+
+            # Accept the replica into the local mailbox.
+            self._mailboxes[mailbox_key].append(
+                DMMessage(
+                    sender_id=str(envelope.get("sender_id", "") or ""),
+                    ciphertext=ciphertext,
+                    timestamp=float(envelope.get("timestamp", time.time()) or time.time()),
+                    msg_id=msg_id,
+                    delivery_class=str(envelope.get("delivery_class", "shared") or "shared"),
+                    sender_seal=str(envelope.get("sender_seal", "") or ""),
+                    relay_salt=str(envelope.get("relay_salt", "") or ""),
+                    sender_block_ref=sender_block_ref,
+                    payload_format=str(envelope.get("payload_format", "dm1") or "dm1"),
+                    session_welcome=str(envelope.get("session_welcome", "") or ""),
+                )
+            )
+            self._stats["messages_in_memory"] = sum(len(v) for v in self._mailboxes.values())
+            self._save()
+            metrics_inc("dm_replica_accepted")
+            return {"ok": True, "msg_id": msg_id}
+
+    def _replicate_envelope_to_peers_async(
+        self,
+        *,
+        envelope: dict[str, Any],
+    ) -> None:
+        """Push an outbound DM envelope to every authenticated relay peer.
+
+        Fire-and-forget: spawned in a background thread so ``deposit``
+        returns to the caller immediately. Per-peer errors are logged
+        and swallowed — the sender's UX must not block on slow Tor
+        peers, and a peer that's down today gets the next message
+        whenever it comes back. Inbound recipient polling from a healthy
+        peer keeps the system functional during peer failures.
+
+        Each peer is authed with the existing per-peer HMAC pattern
+        (#256) — same headers and key resolver gate-message replication
+        uses, so a hostile node that doesn't know any peer's HMAC key
+        can't impersonate a legitimate relay.
+        """
+        import threading
+
+        def _do_push():
+            try:
+                import hashlib
+                import hmac
+                import requests as _requests
+
+                from services.mesh.mesh_crypto import (
+                    normalize_peer_url,
+                    resolve_peer_key_for_url,
+                )
+                from services.mesh.mesh_router import (
+                    authenticated_push_peer_urls,
+                )
+
+                peers = authenticated_push_peer_urls()
+                if not peers:
+                    return
+
+                payload = json.dumps(
+                    {"envelope": envelope},
+                    separators=(",", ":"),
+                    ensure_ascii=False,
+                ).encode("utf-8")
+
+                timeout = max(
+                    1,
+                    int(getattr(self._settings(), "MESH_RELAY_PUSH_TIMEOUT_S", 10) or 10),
+                )
+
+                for peer_url in peers:
+                    try:
+                        normalized = normalize_peer_url(peer_url)
+                        headers = {"Content-Type": "application/json"}
+                        peer_key = resolve_peer_key_for_url(normalized)
+                        if peer_key:
+                            headers["X-Peer-Url"] = normalized
+                            headers["X-Peer-HMAC"] = hmac.new(
+                                peer_key, payload, hashlib.sha256
+                            ).hexdigest()
+                        url = f"{peer_url}/api/mesh/dm/replicate-envelope"
+                        resp = _requests.post(
+                            url, data=payload, timeout=timeout, headers=headers,
+                        )
+                        if resp.status_code == 200:
+                            metrics_inc("dm_replication_push_ok")
+                        else:
+                            # 4xx including the structured cap_violation
+                            # rejection from accept_replica — sender's
+                            # relay learns and stops retrying this msg_id.
+                            metrics_inc("dm_replication_push_rejected")
+                    except Exception:
+                        # Per-peer failure is non-fatal — log to metrics
+                        # but don't break the loop. Other peers and a
+                        # future retry can still propagate the envelope.
+                        metrics_inc("dm_replication_push_error")
+                        continue
+            except Exception:
+                # Outer guard — never let replication errors propagate
+                # back to the sender's deposit() caller.
+                metrics_inc("dm_replication_push_error")
+
+        thread = threading.Thread(
+            target=_do_push,
+            name="dm-replicate-push",
+            daemon=True,
+        )
+        thread.start()
+
+    def envelope_for_replication(
+        self,
+        *,
+        mailbox_key: str,
+        msg_id: str,
+    ) -> dict[str, Any] | None:
+        """Return the wire-form envelope for a stored message, suitable
+        for POSTing to a peer relay's replicate-envelope endpoint.
+
+        Returns ``None`` if the message isn't in the mailbox (already
+        acked, expired, never existed). The caller holds the
+        responsibility for transport security (Tor SOCKS for .onion
+        peers, per-peer HMAC) and for not leaking the envelope to
+        clearnet peers when private transport is required.
+        """
+        with self._lock:
+            for m in self._mailboxes.get(mailbox_key, []):
+                if m.msg_id == msg_id:
+                    return {
+                        "msg_id": m.msg_id,
+                        "mailbox_key": mailbox_key,
+                        "sender_id": m.sender_id,
+                        "sender_block_ref": m.sender_block_ref,
+                        "sender_seal": m.sender_seal,
+                        "ciphertext": m.ciphertext,
+                        "timestamp": m.timestamp,
+                        "delivery_class": m.delivery_class,
+                        "relay_salt": m.relay_salt,
+                        "payload_format": m.payload_format,
+                        "session_welcome": m.session_welcome,
+                    }
+        return None
+
    def is_blocked(self, recipient_id: str, sender_id: str) -> bool:
        with self._lock:
            self._refresh_from_shared_relay()
@@ -2,10 +2,64 @@ from __future__ import annotations

 import time
 from dataclasses import asdict, dataclass
+from email.utils import parsedate_to_datetime
+from datetime import timezone

 from services.mesh.mesh_peer_store import PeerRecord


+class PeerSyncRateLimited(Exception):
+    """Upstream peer returned HTTP 429 — Too Many Requests.
+
+    Carries the ``Retry-After`` header value (parsed to seconds) so
+    the caller can pass it to ``finish_sync(retry_after_s=...)`` and
+    actually wait that long instead of hammering the upstream every
+    60s and keeping its rate-limit bucket full.
+
+    ``retry_after_s`` is 0 when the upstream didn't provide a header.
+    Caller should still apply the exponential backoff in that case.
+    """
+
+    def __init__(self, message: str, retry_after_s: int = 0, status: int = 429):
+        super().__init__(message)
+        self.retry_after_s = max(0, int(retry_after_s or 0))
+        self.status = int(status or 429)
+
+
+def parse_retry_after_header(header_value: str, *, now: float | None = None) -> int:
+    """Parse the ``Retry-After`` HTTP header.
+
+    Two valid forms per RFC 7231 §7.1.3:
+
+      * Delay-seconds: a non-negative integer (e.g. ``Retry-After: 120``)
+      * HTTP-date: an absolute time (e.g. ``Retry-After: Wed, 21 Oct 2026 07:28:00 GMT``)
+
+    Returns the wait in **seconds from now**. Unparseable / empty headers
+    return 0 (caller falls back to exponential backoff). Clamped at a
+    sane upper bound (1 hour) so a typo'd or hostile peer can't pin us
+    silent for days.
+    """
+    value = str(header_value or "").strip()
+    if not value:
+        return 0
+    upper_bound = 3600  # never trust a peer to silence us > 1h
+    # Form 1: pure integer seconds.
+    if value.isdigit():
+        return min(max(0, int(value)), upper_bound)
+    # Form 2: HTTP-date.
+    try:
+        target = parsedate_to_datetime(value)
+        if target is None:
+            return 0
+        if target.tzinfo is None:
+            target = target.replace(tzinfo=timezone.utc)
+        current = float(now if now is not None else time.time())
+        delta = int(target.timestamp() - current)
+        return min(max(0, delta), upper_bound)
+    except (TypeError, ValueError):
+        return 0
+
+
@dataclass(frozen=True)
 class SyncWorkerState:
    last_sync_started_at: int = 0
@@ -72,6 +126,59 @@ def begin_sync(
    )


+def _failure_backoff_seconds(
+    *,
+    base_backoff_s: int,
+    consecutive_failures: int,
+    retry_after_s: int,
+    cap_s: int = 1800,
+) -> int:
+    """Compute the next-attempt delay after a failed sync.
+
+    Two inputs combine:
+
+    * ``retry_after_s`` — when an upstream peer answered HTTP 429
+      with a ``Retry-After`` header, we honor it exactly. Continuing
+      to hammer the upstream every 60s is the bug this fix exists to
+      close: it keeps the upstream's rate-limit bucket full
+      indefinitely and no sync ever lands.
+
+    * Exponential growth on ``consecutive_failures`` — even without an
+      explicit Retry-After, repeated failures should slow us down. The
+      first failure waits ``base`` (preserves pre-fix behavior for
+      one-off blips). Each subsequent failure doubles the wait, capped
+      to ``cap_s`` (default 30 minutes). With base=60 and cap=1800,
+      the schedule is 60s → 120s → 240s → 480s → 960s → 1800s →
+      1800s → … .
+
+    The actual delay is the MAX of the two — whichever asks for more
+    patience wins. ``retry_after_s == 0`` (no header) falls back to
+    pure exponential. An aggressive ``Retry-After`` (say 600s while
+    we're only at 1 failure) wins over the exponential ladder.
+    """
+    base = max(0, int(base_backoff_s or 0))
+    failures = max(0, int(consecutive_failures or 0))
+    cap = max(0, int(cap_s or 0))
+    retry_after = max(0, int(retry_after_s or 0))
+    # ``cap_s=0`` explicitly disables the exponential ladder entirely
+    # — operators who want the pre-fix "honor Retry-After only" behavior
+    # can set this. The default cap of 1800s is what saturates the
+    # ladder at the 5th-6th failure for base=60.
+    if cap == 0:
+        return retry_after
+    # 2^(failures-1) — so failure #1 = base (preserves the pre-fix
+    # default for transient blips), failure #2 = 2*base, etc. Cap on
+    # the exponent (16) is defense against integer overflow on a
+    # hostile or very large failures counter.
+    if base > 0 and failures > 0:
+        exponent = min(max(0, failures - 1), 16)
+        grown = base * (2 ** exponent)
+    else:
+        grown = 0
+    exponential = min(max(0, grown), cap)
+    return max(exponential, retry_after)
+
+
 def finish_sync(
    state: SyncWorkerState,
    *,
@@ -83,7 +190,26 @@ def finish_sync(
    now: float | None = None,
    interval_s: int = 300,
    failure_backoff_s: int = 60,
+    retry_after_s: int = 0,
+    failure_backoff_cap_s: int = 1800,
 ) -> SyncWorkerState:
+    """Finalise a sync attempt and compute when the next one should run.
+
+    New args (added for the 429 retry storm fix):
+
+    * ``retry_after_s`` — if the peer responded with HTTP 429 + a
+      ``Retry-After`` header, pass that value here. ``finish_sync``
+      will use ``max(exponential, retry_after_s)`` for the delay so
+      we never hammer a peer that asked us to back off.
+    * ``failure_backoff_cap_s`` — upper bound on the exponential
+      ladder. Default 1800 (30 min) — keeps a sync queue from going
+      silent for hours while still cutting the request rate to
+      something the upstream can absorb.
+
+    The pre-fix behavior (constant 60s on every failure) is recoverable
+    by passing ``failure_backoff_cap_s=0`` and ``retry_after_s=0``, but
+    there's no reason to.
+    """
    timestamp = int(now if now is not None else time.time())
    if ok:
        return SyncWorkerState(
@@ -99,17 +225,25 @@ def finish_sync(
            consecutive_failures=0,
        )

+    next_failures = state.consecutive_failures + 1
+    delay_s = _failure_backoff_seconds(
+        base_backoff_s=failure_backoff_s,
+        consecutive_failures=next_failures,
+        retry_after_s=retry_after_s,
+        cap_s=failure_backoff_cap_s,
+    )
+
    return SyncWorkerState(
        last_sync_started_at=state.last_sync_started_at,
        last_sync_finished_at=timestamp,
        last_sync_ok_at=state.last_sync_ok_at,
-        next_sync_due_at=timestamp + max(0, int(failure_backoff_s or 0)),
+        next_sync_due_at=timestamp + delay_s,
        last_peer_url=peer_url or state.last_peer_url,
        last_error=str(error or "").strip(),
        last_outcome="fork" if fork_detected else "error",
        current_head=current_head or state.current_head,
        fork_detected=bool(fork_detected),
-        consecutive_failures=state.consecutive_failures + 1,
+        consecutive_failures=next_failures,
    )


@@ -0,0 +1,261 @@
+"""Infonet sync respects upstream HTTP 429 + applies exponential backoff.
+
+Background
+----------
+Before this fix, ``finish_sync`` used a constant 60s ``failure_backoff_s``
+regardless of how many consecutive failures preceded. When an upstream
+peer (e.g. the seed onion) returned HTTP 429 "Too Many Requests", the
+sync worker would:
+
+  1. Receive 429
+  2. Stringify the status into a generic ``ValueError``
+  3. Call ``finish_sync(error=str(exc))`` -- losing the status code
+  4. Schedule next attempt for ``now + 60s``
+  5. Retry. Upstream's rate-limit bucket is still full. 429 again. Loop.
+
+Net effect: a node with one transient 429 would hammer the upstream
+every 60s forever, keeping the bucket full and never recovering. This
+is what kept the user's Infonet node from reaching the seed peer.
+
+What the fix does
+-----------------
+* New typed exception ``PeerSyncRateLimited`` carries the parsed
+  ``Retry-After`` value out of the HTTP layer.
+* ``_sync_from_peer`` returns ``(ok, error, forked, retry_after_s)``
+  instead of the old 3-tuple.
+* ``finish_sync`` honors ``retry_after_s`` AND applies exponential
+  backoff: ``delay = max(retry_after_s, base * 2^failures, cap=1800)``.
+* ``parse_retry_after_header`` handles both RFC 7231 forms (delay
+  seconds, and HTTP-date).
+
+These tests pin every part of the new contract.
+"""
+
+from __future__ import annotations
+
+import time
+
+import pytest
+
+
+# ---------------------------------------------------------------------------
+# parse_retry_after_header — both RFC 7231 forms + edge cases
+# ---------------------------------------------------------------------------
+
+
+class TestParseRetryAfter:
+    def test_integer_seconds(self):
+        from services.mesh.mesh_infonet_sync_support import parse_retry_after_header
+        assert parse_retry_after_header("120") == 120
+        assert parse_retry_after_header("  30  ") == 30
+        assert parse_retry_after_header("0") == 0
+
+    def test_http_date(self):
+        """RFC 7231 §7.1.3 explicitly allows ``Retry-After: <HTTP-date>``.
+        We compute seconds-from-now so callers can use the same field
+        regardless of which form the upstream chose."""
+        from services.mesh.mesh_infonet_sync_support import parse_retry_after_header
+        # Pin "now" so the test is deterministic.
+        now = 1_700_000_000.0  # 2023-11-14T22:13:20Z
+        # 300 seconds in the future, formatted per RFC 7231.
+        future = "Tue, 14 Nov 2023 22:18:20 GMT"
+        result = parse_retry_after_header(future, now=now)
+        assert 295 <= result <= 305, f"expected ~300s, got {result}"
+
+    def test_http_date_in_past_returns_zero(self):
+        from services.mesh.mesh_infonet_sync_support import parse_retry_after_header
+        now = 1_700_000_000.0
+        past = "Mon, 13 Nov 2023 00:00:00 GMT"
+        assert parse_retry_after_header(past, now=now) == 0
+
+    def test_empty_and_whitespace_return_zero(self):
+        from services.mesh.mesh_infonet_sync_support import parse_retry_after_header
+        assert parse_retry_after_header("") == 0
+        assert parse_retry_after_header("   ") == 0
+
+    def test_malformed_returns_zero(self):
+        from services.mesh.mesh_infonet_sync_support import parse_retry_after_header
+        assert parse_retry_after_header("not a header") == 0
+        assert parse_retry_after_header("xyz") == 0
+
+    def test_clamps_to_one_hour(self):
+        """A hostile peer can't silence us for a week by claiming a
+        24h Retry-After. We cap at 1 hour."""
+        from services.mesh.mesh_infonet_sync_support import parse_retry_after_header
+        assert parse_retry_after_header("86400") == 3600  # 24h -> 1h
+        assert parse_retry_after_header("99999999") == 3600
+
+    def test_negative_returns_zero(self):
+        """RFC 7231 says ``Retry-After`` is a non-negative integer;
+        leading-minus parses as a non-digit and yields 0 here."""
+        from services.mesh.mesh_infonet_sync_support import parse_retry_after_header
+        assert parse_retry_after_header("-10") == 0
+
+
+# ---------------------------------------------------------------------------
+# _failure_backoff_seconds — exponential growth, retry-after override, cap
+# ---------------------------------------------------------------------------
+
+
+class TestFailureBackoffSeconds:
+    def test_exponential_growth(self):
+        """First failure uses the base (preserves pre-fix behavior
+        for one-off blips). Each subsequent failure doubles the wait,
+        capped at 1800s. With base=60: 60, 120, 240, 480, 960, 1800,
+        1800, 1800."""
+        from services.mesh.mesh_infonet_sync_support import _failure_backoff_seconds
+        delays = [
+            _failure_backoff_seconds(
+                base_backoff_s=60,
+                consecutive_failures=n,
+                retry_after_s=0,
+                cap_s=1800,
+            )
+            for n in range(1, 9)
+        ]
+        assert delays == [60, 120, 240, 480, 960, 1800, 1800, 1800], delays
+
+    def test_retry_after_wins_when_larger(self):
+        """If the upstream says ``Retry-After: 600`` but exponential
+        would only ask for 60s (one failure), we honor the upstream."""
+        from services.mesh.mesh_infonet_sync_support import _failure_backoff_seconds
+        assert _failure_backoff_seconds(
+            base_backoff_s=60,
+            consecutive_failures=1,
+            retry_after_s=600,
+            cap_s=1800,
+        ) == 600
+
+    def test_exponential_wins_when_larger(self):
+        """If exponential is asking for 1800s (6+ failures) but
+        upstream only sent ``Retry-After: 30``, we honor exponential.
+        The 30s was the upstream's view at one moment; our exponential
+        reflects sustained failure."""
+        from services.mesh.mesh_infonet_sync_support import _failure_backoff_seconds
+        result = _failure_backoff_seconds(
+            base_backoff_s=60,
+            consecutive_failures=7,
+            retry_after_s=30,
+            cap_s=1800,
+        )
+        assert result == 1800
+
+    def test_cap_zero_disables_exponential(self):
+        """Operators who want pre-fix behavior can set cap=0; only the
+        upstream's Retry-After is respected. (Pre-fix had no
+        exponential growth at all.)"""
+        from services.mesh.mesh_infonet_sync_support import _failure_backoff_seconds
+        assert _failure_backoff_seconds(
+            base_backoff_s=60,
+            consecutive_failures=10,
+            retry_after_s=120,
+            cap_s=0,
+        ) == 120
+
+    def test_zero_inputs_return_zero(self):
+        from services.mesh.mesh_infonet_sync_support import _failure_backoff_seconds
+        assert _failure_backoff_seconds(
+            base_backoff_s=0,
+            consecutive_failures=0,
+            retry_after_s=0,
+        ) == 0
+
+
+# ---------------------------------------------------------------------------
+# finish_sync end-to-end — failure path with retry-after + growing counter
+# ---------------------------------------------------------------------------
+
+
+class TestFinishSyncBackoff:
+    def _state(self, **overrides):
+        from services.mesh.mesh_infonet_sync_support import SyncWorkerState
+        base = {
+            "last_sync_started_at": 0,
+            "last_sync_finished_at": 0,
+            "last_sync_ok_at": 0,
+            "next_sync_due_at": 0,
+            "last_peer_url": "",
+            "last_error": "",
+            "last_outcome": "idle",
+            "current_head": "",
+            "fork_detected": False,
+            "consecutive_failures": 0,
+        }
+        base.update(overrides)
+        return SyncWorkerState(**base)
+
+    def test_first_failure_uses_base_unchanged(self):
+        """One failure means consecutive_failures becomes 1, which uses
+        ``base * 2^0 = base``. Preserves the pre-fix behavior so a
+        single transient upstream blip doesn't suddenly take 2 minutes
+        to retry — that change has to be earned by sustained failure."""
+        from services.mesh.mesh_infonet_sync_support import finish_sync
+        result = finish_sync(
+            self._state(),
+            ok=False,
+            error="some upstream blip",
+            now=1000.0,
+            failure_backoff_s=60,
+        )
+        assert result.consecutive_failures == 1
+        assert result.next_sync_due_at == 1000 + 60
+        assert result.last_error == "some upstream blip"
+        assert result.last_outcome == "error"
+
+    def test_consecutive_failures_grow_the_delay(self):
+        """After 5 prior failures already in state, the next failure
+        sets consecutive=6 and uses the cap (1800s = 60 * 2^5)."""
+        from services.mesh.mesh_infonet_sync_support import finish_sync
+        result = finish_sync(
+            self._state(consecutive_failures=5),
+            ok=False,
+            error="HTTP 429",
+            now=2000.0,
+            failure_backoff_s=60,
+        )
+        assert result.consecutive_failures == 6
+        assert result.next_sync_due_at == 2000 + 1800
+
+    def test_retry_after_honored_at_low_failure_count(self):
+        """When the upstream says ``Retry-After: 900`` but we'd
+        otherwise only wait 240s (4 failures = 60*2^3), wait 900s."""
+        from services.mesh.mesh_infonet_sync_support import finish_sync
+        result = finish_sync(
+            self._state(consecutive_failures=3),
+            ok=False,
+            error="HTTP 429",
+            now=5000.0,
+            failure_backoff_s=60,
+            retry_after_s=900,
+        )
+        assert result.consecutive_failures == 4
+        assert result.next_sync_due_at == 5000 + 900
+
+    def test_success_resets_consecutive_failures(self):
+        from services.mesh.mesh_infonet_sync_support import finish_sync
+        result = finish_sync(
+            self._state(consecutive_failures=4),
+            ok=True,
+            now=7000.0,
+            interval_s=300,
+        )
+        assert result.consecutive_failures == 0
+        assert result.next_sync_due_at == 7000 + 300
+        assert result.last_outcome == "ok"
+
+    def test_last_error_carries_status_string(self):
+        """The pre-fix path stringified exceptions into ``last_error``
+        but the string was often empty (HTTP layer raised ValueError
+        with no message). We now require callers to pass something
+        meaningful — see the typed exception path in main.py."""
+        from services.mesh.mesh_infonet_sync_support import finish_sync
+        result = finish_sync(
+            self._state(),
+            ok=False,
+            error="HTTP 429 from peer (retry_after=120s): rate-limited",
+            now=1000.0,
+            failure_backoff_s=60,
+            retry_after_s=120,
+        )
+        assert "HTTP 429" in result.last_error
+        assert "retry_after=120s" in result.last_error
@@ -0,0 +1,270 @@
+"""Per-(sender, recipient) anti-spam cap on the DM relay.
+
+The user-stated rule: a single sender can have at most N UNACKED messages
+parked in a single recipient's mailbox at any one time (N=2 by default).
+Once the recipient pulls a message, the sender's quota for that pair
+frees up.
+
+Network rule, not local rule
+-----------------------------
+The cap is enforced TWICE:
+
+1. ``DMRelay.deposit(...)`` -- local check on the sender's own node.
+   Refuses to spool the (N+1)th message before it can be replicated.
+
+2. ``DMRelay.accept_replica(...)`` -- replication-acceptance check on
+   every receiving peer. Refuses to accept an inbound replica that
+   would put the local mailbox over the cap, even if the originating
+   peer claims it had cap room.
+
+The double enforcement matters because cap (1) is client-side -- a
+hostile relay could patch it out and continue to spool extras locally.
+Cap (2) means those extras can't propagate: every honest peer rejects
+them on the way in. A recipient who polls from honest peers therefore
+never sees more than N pending from any one sender, regardless of how
+many spam attempts the sender's own relay accepted.
+
+These tests pin both halves of the rule.
+"""
+
+from __future__ import annotations
+
+import time
+
+import pytest
+
+
+@pytest.fixture
+def relay():
+    """Fresh ``DMRelay`` per test."""
+    from services.mesh.mesh_dm_relay import DMRelay
+    r = DMRelay()
+    r._mailboxes.clear()
+    r._blocks.clear()
+    r._stats = {"messages_in_memory": 0}
+    return r
+
+
+def _deposit(
+    relay,
+    *,
+    sender: str = "alice",
+    recipient_token: str = "bob_mailbox_token_abc",
+    ciphertext: str = "ciphertext-blob",
+    msg_id: str = "",
+):
+    """Convenience wrapper using ``shared`` delivery class."""
+    return relay.deposit(
+        sender_id=sender,
+        raw_sender_id=sender,
+        recipient_id="bob",
+        ciphertext=ciphertext,
+        msg_id=msg_id,
+        delivery_class="shared",
+        recipient_token=recipient_token,
+    )
+
+
+# ---------------------------------------------------------------------------
+# Local cap on ``deposit``
+# ---------------------------------------------------------------------------
+
+
+class TestDepositCap:
+    def test_two_deposits_from_same_sender_succeed(self, relay):
+        r1 = _deposit(relay)
+        r2 = _deposit(relay)
+        assert r1["ok"] is True
+        assert r2["ok"] is True
+        assert r1["msg_id"] != r2["msg_id"]
+
+    def test_third_deposit_from_same_sender_rejected(self, relay):
+        _deposit(relay)
+        _deposit(relay)
+        r3 = _deposit(relay)
+        assert r3["ok"] is False
+        detail = r3["detail"].lower()
+        assert "unread" in detail or "read your messages" in detail
+
+    def test_different_senders_have_independent_quotas(self, relay):
+        for _ in range(2):
+            assert _deposit(relay, sender="alice")["ok"] is True
+        for _ in range(2):
+            assert _deposit(relay, sender="carol")["ok"] is True
+        assert _deposit(relay, sender="carol")["ok"] is False
+
+    def test_different_recipients_have_independent_quotas(self, relay):
+        for _ in range(2):
+            assert _deposit(relay, sender="alice", recipient_token="bob_token")["ok"] is True
+        for _ in range(2):
+            assert _deposit(relay, sender="alice", recipient_token="dave_token")["ok"] is True
+
+    def test_ack_frees_quota(self, relay):
+        r1 = _deposit(relay)
+        _deposit(relay)
+        assert _deposit(relay)["ok"] is False
+
+        mailbox_key = relay._hashed_mailbox_token("bob_mailbox_token_abc")
+        relay._mailboxes[mailbox_key] = [
+            m for m in relay._mailboxes[mailbox_key]
+            if m.msg_id != r1["msg_id"]
+        ]
+        relay._stats["messages_in_memory"] = sum(
+            len(v) for v in relay._mailboxes.values()
+        )
+
+        r3 = _deposit(relay)
+        assert r3["ok"] is True, f"expected quota free after ack, got: {r3}"
+
+    def test_cap_is_env_tunable(self, relay, monkeypatch):
+        import services.mesh.mesh_dm_relay as mdr
+        monkeypatch.setattr(
+            mdr.DMRelay,
+            "_per_sender_pending_limit",
+            lambda self: 1,
+        )
+
+        assert _deposit(relay)["ok"] is True
+        assert _deposit(relay)["ok"] is False
+
+
+# ---------------------------------------------------------------------------
+# Replication-acceptance cap (the half that makes this a network rule)
+# ---------------------------------------------------------------------------
+
+
+class TestAcceptReplicaCap:
+    def _envelope(self, *, msg_id: str, sender_block_ref: str, mailbox_key: str):
+        return {
+            "msg_id": msg_id,
+            "mailbox_key": mailbox_key,
+            "sender_block_ref": sender_block_ref,
+            "sender_id": "alice",
+            "sender_seal": "",
+            "ciphertext": f"ciphertext-{msg_id}",
+            "timestamp": time.time(),
+            "delivery_class": "shared",
+            "relay_salt": "",
+            "payload_format": "dm1",
+            "session_welcome": "",
+        }
+
+    def test_replica_accepted_under_cap(self, relay):
+        env = self._envelope(
+            msg_id="dm_replica_1",
+            sender_block_ref="alice_block_ref",
+            mailbox_key="mailbox_xyz",
+        )
+        result = relay.accept_replica(envelope=env)
+        assert result["ok"] is True
+
+    def test_replica_idempotent_on_duplicate_msg_id(self, relay):
+        mailbox_key = "mailbox_xyz"
+        env = self._envelope(
+            msg_id="dm_dup_1",
+            sender_block_ref="alice_block_ref",
+            mailbox_key=mailbox_key,
+        )
+        r1 = relay.accept_replica(envelope=env)
+        r2 = relay.accept_replica(envelope=env)
+        assert r1["ok"] is True
+        assert r2["ok"] is True
+        assert r2.get("duplicate") is True
+        assert len(relay._mailboxes[mailbox_key]) == 1
+
+    def test_replica_rejected_when_local_count_already_at_cap(self, relay):
+        mailbox_key = "mailbox_xyz"
+        for i in (1, 2):
+            relay.accept_replica(envelope=self._envelope(
+                msg_id=f"dm_seeded_{i}",
+                sender_block_ref="alice_block_ref",
+                mailbox_key=mailbox_key,
+            ))
+
+        result = relay.accept_replica(envelope=self._envelope(
+            msg_id="dm_overcap_3",
+            sender_block_ref="alice_block_ref",
+            mailbox_key=mailbox_key,
+        ))
+        assert result["ok"] is False
+        assert result.get("cap_violation") is True
+        assert result.get("pending") == 2
+        assert result.get("limit") == 2
+        assert len(relay._mailboxes[mailbox_key]) == 2
+
+    def test_replica_from_different_sender_passes_when_one_is_at_cap(self, relay):
+        mailbox_key = "mailbox_xyz"
+        for i in (1, 2):
+            relay.accept_replica(envelope=self._envelope(
+                msg_id=f"dm_alice_{i}",
+                sender_block_ref="alice_block_ref",
+                mailbox_key=mailbox_key,
+            ))
+        assert relay.accept_replica(envelope=self._envelope(
+            msg_id="dm_alice_3",
+            sender_block_ref="alice_block_ref",
+            mailbox_key=mailbox_key,
+        ))["ok"] is False
+        assert relay.accept_replica(envelope=self._envelope(
+            msg_id="dm_carol_1",
+            sender_block_ref="carol_block_ref",
+            mailbox_key=mailbox_key,
+        ))["ok"] is True
+
+    def test_replica_rejects_malformed_envelopes(self, relay):
+        for bad in (
+            {},
+            {"msg_id": "x"},
+            {"msg_id": "x", "mailbox_key": "y"},
+            "not an object at all",
+        ):
+            result = relay.accept_replica(envelope=bad)
+            assert result["ok"] is False
+
+
+# ---------------------------------------------------------------------------
+# ``envelope_for_replication`` -- helper for the outbound replication path
+# ---------------------------------------------------------------------------
+
+
+class TestEnvelopeForReplication:
+    def test_returns_envelope_for_stored_message(self, relay):
+        r = _deposit(relay, ciphertext="hello-ciphertext")
+        msg_id = r["msg_id"]
+        mailbox_key = relay._hashed_mailbox_token("bob_mailbox_token_abc")
+
+        env = relay.envelope_for_replication(mailbox_key=mailbox_key, msg_id=msg_id)
+        assert env is not None
+        assert env["msg_id"] == msg_id
+        assert env["mailbox_key"] == mailbox_key
+        assert env["ciphertext"] == "hello-ciphertext"
+        assert env["delivery_class"] == "shared"
+        for k in ("msg_id", "mailbox_key", "sender_block_ref", "ciphertext"):
+            assert env.get(k), f"envelope missing required field {k!r}"
+
+    def test_returns_none_for_unknown_message(self, relay):
+        env = relay.envelope_for_replication(
+            mailbox_key="never_existed", msg_id="never_existed",
+        )
+        assert env is None
+
+    def test_envelope_round_trips_through_accept_replica(self, relay):
+        from services.mesh.mesh_dm_relay import DMRelay
+        receiver_relay = DMRelay()
+        receiver_relay._mailboxes.clear()
+        receiver_relay._stats = {"messages_in_memory": 0}
+
+        r = _deposit(relay)
+        msg_id = r["msg_id"]
+        mailbox_key = relay._hashed_mailbox_token("bob_mailbox_token_abc")
+        env = relay.envelope_for_replication(
+            mailbox_key=mailbox_key, msg_id=msg_id,
+        )
+        assert env is not None
+
+        result = receiver_relay.accept_replica(envelope=env)
+        assert result["ok"] is True
+        stored = receiver_relay._mailboxes.get(mailbox_key, [])
+        assert len(stored) == 1
+        assert stored[0].msg_id == msg_id
+        assert stored[0].ciphertext == "ciphertext-blob"
@@ -0,0 +1,150 @@
+"""POST /api/mesh/dm/replicate-envelope — receiving side of cross-node DM
+mailbox replication.
+
+This is the endpoint that peer relays call when they want to hand off an
+encrypted DM envelope to us (so the recipient can log into our node and
+find their messages). It re-enforces the per-(sender, recipient) anti-spam
+cap so hostile sender relays can't widen the cap by skipping the local
+check on their own deposit path.
+
+The endpoint:
+
+  * authenticates the caller via the existing per-peer HMAC pattern
+    (same one /api/mesh/infonet/peer-push and /api/mesh/gate/peer-push
+    use, introduced in #256 — ``X-Peer-Url`` + ``X-Peer-HMAC`` headers
+    keyed off ``resolve_peer_key_for_url``)
+  * rejects bodies > 64 KB (DM envelope size is bounded by
+    ``MESH_DM_MAX_MSG_BYTES`` — 64KB ceiling has generous headroom)
+  * rejects requests without a valid peer HMAC with 403
+  * passes the envelope to ``DMRelay.accept_replica`` which enforces
+    the cap
+
+This file pins the endpoint contract. The cap enforcement itself is
+tested in ``test_dm_relay_per_sender_cap.py`` against the relay's
+``accept_replica`` method directly.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import hashlib
+import hmac
+import json
+
+import pytest
+from httpx import ASGITransport, AsyncClient
+
+
+@pytest.fixture
+def remote_client():
+    """ASGI client with peer IP 1.2.3.4 — never on the local-operator
+    allowlist. Used to prove the endpoint isn't accidentally reachable
+    by random remote callers without peer HMAC."""
+    from main import app
+
+    class _RemoteClient:
+        def __init__(self):
+            self._loop = asyncio.new_event_loop()
+            self._transport = ASGITransport(app=app, client=("1.2.3.4", 12345))
+            self._base = "http://1.2.3.4:8000"
+
+        def post(self, url, **kw):
+            async def go():
+                async with AsyncClient(transport=self._transport, base_url=self._base) as ac:
+                    return await ac.post(url, **kw)
+            return self._loop.run_until_complete(go())
+
+        def close(self):
+            self._loop.close()
+
+    c = _RemoteClient()
+    yield c
+    c.close()
+
+
+class TestReplicateEndpointAuth:
+    def test_rejects_request_without_peer_hmac(self, remote_client):
+        """A peer push that does NOT carry X-Peer-Url + X-Peer-HMAC
+        must be rejected with 403 before the envelope is ever passed
+        to the relay. Same gate the existing infonet/gate peer-push
+        endpoints enforce."""
+        payload = {
+            "envelope": {
+                "msg_id": "dm_unauth_1",
+                "mailbox_key": "mb",
+                "sender_block_ref": "sender",
+                "ciphertext": "x",
+            },
+        }
+        r = remote_client.post(
+            "/api/mesh/dm/replicate-envelope",
+            json=payload,
+        )
+        assert r.status_code == 403
+        assert "peer HMAC" in r.text or "peer hmac" in r.text.lower()
+
+    def test_rejects_wrong_peer_hmac(self, remote_client, monkeypatch):
+        """A request with a peer HMAC header keyed off the WRONG secret
+        is rejected. Confirms the HMAC is actually verified — a tampered
+        body or a key-substitution attack doesn't sneak through."""
+        # Plant a known peer secret. The request will sign with a
+        # DIFFERENT key, so verification must fail.
+        from services.config import get_settings
+        monkeypatch.setenv("MESH_PEER_PUSH_SECRET", "real-secret-32-chars-min-padding-padding")
+        get_settings.cache_clear()
+
+        body = json.dumps({
+            "envelope": {
+                "msg_id": "dm_wronghmac",
+                "mailbox_key": "mb",
+                "sender_block_ref": "sender",
+                "ciphertext": "x",
+            },
+        }).encode("utf-8")
+        wrong_hmac = hmac.new(b"wrong-key", body, hashlib.sha256).hexdigest()
+        r = remote_client.post(
+            "/api/mesh/dm/replicate-envelope",
+            content=body,
+            headers={
+                "Content-Type": "application/json",
+                "X-Peer-Url": "http://example-peer.onion:8000",
+                "X-Peer-HMAC": wrong_hmac,
+            },
+        )
+        assert r.status_code == 403
+
+    def test_rejects_oversize_body(self, remote_client):
+        """64 KB ceiling — anything bigger doesn't even get parsed.
+        Defends against memory amplification via giant ciphertexts."""
+        # 100 KB body is well over the 64 KB cap.
+        big = b"{" + b"x" * 100_000 + b"}"
+        r = remote_client.post(
+            "/api/mesh/dm/replicate-envelope",
+            content=big,
+            headers={
+                "Content-Type": "application/json",
+                "Content-Length": str(len(big)),
+            },
+        )
+        assert r.status_code in (400, 413), (
+            f"oversize body should be rejected with 400/413, got {r.status_code}"
+        )
+
+
+class TestReplicateEndpointRegistered:
+    def test_route_present_in_app(self):
+        """Static check that the route is actually wired into the app.
+        Catches a future refactor that drops the router include or
+        deletes the endpoint by accident."""
+        from main import app
+
+        paths_methods = set()
+        for route in app.routes:
+            path = getattr(route, "path", None)
+            methods = getattr(route, "methods", set()) or set()
+            for m in methods:
+                paths_methods.add((m, path))
+
+        assert ("POST", "/api/mesh/dm/replicate-envelope") in paths_methods, (
+            "POST /api/mesh/dm/replicate-envelope is not registered on the app"
+        )
@@ -0,0 +1,334 @@
+"""Issue #302 (tg12): OpenClaw connect-info HMAC secret disclosure.
+
+Before this change, ``GET /api/ai/connect-info?reveal=true`` returned the
+full HMAC secret in the response body on every modal open AND the same
+GET endpoint auto-bootstrapped (generated + persisted) the secret on a
+mere read. Even gated to ``require_local_operator``, that put the full
+secret into:
+
+  * browser visit history
+  * dev-tools network panel
+  * browser disk cache
+  * HAR exports
+  * screen captures / shoulder-surfing
+
+Every single time the OpenClaw Connect modal opened.
+
+After this change:
+
+  GET  /api/ai/connect-info            — always returns the MASKED
+                                          fingerprint. No ?reveal param.
+                                          No side effects (auto-bootstrap
+                                          gone).
+  POST /api/ai/connect-info/bootstrap  — mints+persists the secret if
+                                          missing. Idempotent. Never
+                                          returns the full secret.
+  POST /api/ai/connect-info/reveal     — returns the full secret with
+                                          strict Cache-Control: no-store
+                                          headers. POST so the body
+                                          doesn't land in URL history.
+  POST /api/ai/connect-info/regenerate — keeps the one-time-disclosure
+                                          for the new secret (regen IS a
+                                          deliberate destructive action).
+                                          Same no-store headers added.
+
+These tests pin every property.
+"""
+
+from __future__ import annotations
+
+import asyncio
+from unittest.mock import patch
+
+import pytest
+from httpx import ASGITransport, AsyncClient
+
+
+# ---------------------------------------------------------------------------
+# Loopback test client. ``require_local_operator`` resolves true for
+# request.client.host == "127.0.0.1"; FastAPI's TestClient sets it to
+# "testclient" which isn't on the allowlist. Use raw ASGITransport.
+# ---------------------------------------------------------------------------
+
+
+@pytest.fixture
+def loopback():
+    from main import app
+
+    class _Client:
+        def __init__(self, peer_ip: str = "127.0.0.1"):
+            self._loop = asyncio.new_event_loop()
+            self._transport = ASGITransport(app=app, client=(peer_ip, 12345))
+            self._base = f"http://{peer_ip}:8000"
+
+        def _do(self, method: str, url: str, **kw):
+            async def go():
+                async with AsyncClient(transport=self._transport, base_url=self._base) as ac:
+                    return await ac.request(method, url, **kw)
+            return self._loop.run_until_complete(go())
+
+        def get(self, url, **kw):  return self._do("GET", url, **kw)
+        def post(self, url, **kw): return self._do("POST", url, **kw)
+        def close(self): self._loop.close()
+
+    c = _Client()
+    yield c
+    c.close()
+
+
+@pytest.fixture
+def remote():
+    from main import app
+
+    class _Client:
+        def __init__(self):
+            self._loop = asyncio.new_event_loop()
+            self._transport = ASGITransport(app=app, client=("1.2.3.4", 12345))
+            self._base = "http://1.2.3.4:8000"
+
+        def _do(self, method: str, url: str, **kw):
+            async def go():
+                async with AsyncClient(transport=self._transport, base_url=self._base) as ac:
+                    return await ac.request(method, url, **kw)
+            return self._loop.run_until_complete(go())
+
+        def get(self, url, **kw):  return self._do("GET", url, **kw)
+        def post(self, url, **kw): return self._do("POST", url, **kw)
+        def close(self): self._loop.close()
+
+    c = _Client()
+    yield c
+    c.close()
+
+
+@pytest.fixture
+def stub_env(monkeypatch):
+    """Isolate connect-info tests from the dev's real backend .env.
+
+    Pydantic ``Settings()`` reads from ``.env`` file directly on
+    instantiation, so monkey-patching ``os.environ`` isn't sufficient
+    — the real ``OPENCLAW_HMAC_SECRET`` would leak through. Instead we
+    override ``get_settings()`` in the route module to return a fresh
+    ``Settings`` instance whose env values are driven entirely by an
+    in-test dict, AND we replace ``_write_env_value`` so writes update
+    that same dict instead of touching the developer's filesystem.
+
+    Yields the dict so individual tests can pre-seed values or assert
+    that writes happened.
+    """
+    import routers.ai_intel as ai_intel
+    import services.config as config
+
+    state: dict[str, str] = {}
+
+    class _FakeSettings:
+        @property
+        def OPENCLAW_HMAC_SECRET(self) -> str:
+            return state.get("OPENCLAW_HMAC_SECRET", "")
+
+        @property
+        def OPENCLAW_ACCESS_TIER(self) -> str:
+            return state.get("OPENCLAW_ACCESS_TIER", "restricted")
+
+    fake = _FakeSettings()
+
+    def _fake_get_settings():
+        return fake
+
+    # Route code calls ``get_settings.cache_clear()`` after writing the
+    # env. The production version is wrapped with ``@lru_cache``, so
+    # cache_clear exists. Attach a no-op shim here.
+    _fake_get_settings.cache_clear = lambda: None  # type: ignore[attr-defined]
+
+    monkeypatch.setattr(config, "get_settings", _fake_get_settings)
+
+    def _fake_write_env_value(key: str, value: str) -> None:
+        state[key] = value
+
+    monkeypatch.setattr(ai_intel, "_write_env_value", _fake_write_env_value)
+
+    yield state
+
+
+# ---------------------------------------------------------------------------
+# GET /api/ai/connect-info — always masked, no auto-bootstrap
+# ---------------------------------------------------------------------------
+
+
+class TestGetConnectInfoMasking:
+    def test_returns_masked_when_secret_set(self, loopback, stub_env):
+        secret = "abcdef" + "0" * 38 + "wxyz"
+        stub_env["OPENCLAW_HMAC_SECRET"] = secret
+
+        r = loopback.get("/api/ai/connect-info")
+        assert r.status_code == 200
+        body = r.json()
+        # Body must NOT carry the full secret value anywhere.
+        assert secret not in r.text, (
+            "GET /api/ai/connect-info MUST NOT include the full HMAC "
+            "secret. Response body contained the secret value."
+        )
+        assert body["hmac_secret_set"] is True
+        assert body["masked_hmac_secret"].startswith("abcdef")
+        assert body["masked_hmac_secret"].endswith("wxyz")
+        assert "•" in body["masked_hmac_secret"]
+        # Pre-fix field is gone.
+        assert "hmac_secret" not in body
+
+    def test_no_auto_bootstrap_when_secret_missing(self, loopback, stub_env):
+        """Side-effect-on-GET was the second half of issue #302. A GET
+        with no secret configured must NOT mint one — that should
+        require an explicit POST /bootstrap."""
+        r = loopback.get("/api/ai/connect-info")
+        assert r.status_code == 200
+        body = r.json()
+        assert body["hmac_secret_set"] is False
+        assert body["masked_hmac_secret"] == ""
+        # The bootstrap_behavior block should advertise the new flow.
+        assert body["bootstrap_behavior"]["auto_generates_when_missing"] is False
+        # And no _write_env_value call happened.
+        assert "OPENCLAW_HMAC_SECRET" not in stub_env
+
+    def test_no_reveal_query_param(self, loopback, stub_env):
+        """Pre-fix, ?reveal=true would return the full secret. Post-fix
+        the param is silently ignored — the response is the same as
+        without it (still masked, no leak)."""
+        secret = "abcdef" + "0" * 38 + "wxyz"
+        stub_env["OPENCLAW_HMAC_SECRET"] = secret
+
+        r = loopback.get("/api/ai/connect-info?reveal=true")
+        assert r.status_code == 200
+        assert secret not in r.text, (
+            "?reveal=true must be a no-op on GET — the full secret "
+            "MUST NOT come back in the response body."
+        )
+
+
+# ---------------------------------------------------------------------------
+# POST /api/ai/connect-info/bootstrap
+# ---------------------------------------------------------------------------
+
+
+class TestBootstrap:
+    def test_mints_when_missing(self, loopback, stub_env):
+        r = loopback.post("/api/ai/connect-info/bootstrap")
+        assert r.status_code == 200
+        body = r.json()
+        assert body["ok"] is True
+        assert body["generated"] is True
+        assert body["hmac_secret_set"] is True
+        # Bootstrap must NOT return the full secret in-line.
+        assert "hmac_secret" not in body or not body.get("hmac_secret")
+        assert "•" in body["masked_hmac_secret"]
+        # _write_env_value was actually called.
+        assert stub_env.get("OPENCLAW_HMAC_SECRET")
+        # The full value isn't echoed back in the response text either.
+        assert stub_env["OPENCLAW_HMAC_SECRET"] not in r.text
+
+    def test_idempotent_when_already_set(self, loopback, stub_env):
+        existing = "abcdef" + "0" * 38 + "wxyz"
+        stub_env["OPENCLAW_HMAC_SECRET"] = existing
+
+        r = loopback.post("/api/ai/connect-info/bootstrap")
+        assert r.status_code == 200
+        body = r.json()
+        assert body["ok"] is True
+        assert body["generated"] is False
+        assert body["hmac_secret_set"] is True
+        # Existing secret untouched — value is still the seeded one.
+        assert stub_env["OPENCLAW_HMAC_SECRET"] == existing
+        # No full secret in the response.
+        assert existing not in r.text
+
+
+# ---------------------------------------------------------------------------
+# POST /api/ai/connect-info/reveal
+# ---------------------------------------------------------------------------
+
+
+class TestReveal:
+    def test_returns_full_secret_when_set(self, loopback, stub_env):
+        secret = "abcdef" + "0" * 38 + "wxyz"
+        stub_env["OPENCLAW_HMAC_SECRET"] = secret
+
+        r = loopback.post("/api/ai/connect-info/reveal")
+        assert r.status_code == 200
+        body = r.json()
+        assert body["ok"] is True
+        assert body["hmac_secret"] == secret
+
+    def test_strict_cache_control_headers(self, loopback, stub_env):
+        """The whole point of POST /reveal vs GET ?reveal=true is that
+        the response carries headers that prevent every cache layer
+        from persisting the secret."""
+        secret = "abcdef" + "0" * 38 + "wxyz"
+        stub_env["OPENCLAW_HMAC_SECRET"] = secret
+
+        r = loopback.post("/api/ai/connect-info/reveal")
+        cc = r.headers.get("cache-control", "")
+        assert "no-store" in cc, (
+            f"reveal MUST set Cache-Control: no-store — got {cc!r}"
+        )
+        assert "no-cache" in cc
+        # Pragma + Expires as well for HTTP/1.0 caches.
+        assert r.headers.get("pragma", "").lower() == "no-cache"
+        assert r.headers.get("expires") == "0"
+
+    def test_404_when_no_secret_configured(self, loopback, stub_env):
+        r = loopback.post("/api/ai/connect-info/reveal")
+        assert r.status_code == 404
+        # Hint should point at the bootstrap endpoint, not just say "404".
+        detail = r.json().get("detail", "")
+        assert "/bootstrap" in detail or "bootstrap" in detail.lower()
+
+
+# ---------------------------------------------------------------------------
+# POST /api/ai/connect-info/regenerate — still returns the new secret
+# inline (deliberate destructive action), but with no-store headers.
+# ---------------------------------------------------------------------------
+
+
+class TestRegenerate:
+    def test_returns_new_secret_with_no_store_headers(self, loopback, stub_env):
+        # Seed an existing secret so we can prove it changes.
+        old = "oldold" + "0" * 38 + "1234"
+        stub_env["OPENCLAW_HMAC_SECRET"] = old
+
+        r = loopback.post("/api/ai/connect-info/regenerate")
+        assert r.status_code == 200
+        body = r.json()
+        assert body["ok"] is True
+        assert body["hmac_secret"]
+        assert body["hmac_secret"] != old
+        # no-store headers MUST be present so the new secret doesn't
+        # land in browser disk cache after the regenerate click.
+        cc = r.headers.get("cache-control", "")
+        assert "no-store" in cc and "no-cache" in cc
+        assert r.headers.get("pragma", "").lower() == "no-cache"
+
+
+# ---------------------------------------------------------------------------
+# Auth-gate regression — every endpoint still rejects anonymous remote
+# callers. This is the property we already enforce for the rest of the
+# operator-only surface; adding the three new endpoints to the audit
+# coverage prevents a future refactor from dropping the dependency.
+# ---------------------------------------------------------------------------
+
+
+class TestAnonymousRejection:
+    @pytest.mark.parametrize(
+        "method,path,body",
+        [
+            ("get",  "/api/ai/connect-info",            None),
+            ("post", "/api/ai/connect-info/bootstrap",  None),
+            ("post", "/api/ai/connect-info/reveal",     None),
+            ("post", "/api/ai/connect-info/regenerate", None),
+        ],
+    )
+    def test_remote_rejected(self, remote, method, path, body):
+        fn = getattr(remote, method)
+        r = fn(path, json=body) if body is not None else fn(path)
+        assert r.status_code == 403, (
+            f"{method.upper()} {path} must reject anonymous remote callers; "
+            f"got {r.status_code}"
+        )
@@ -357,8 +357,15 @@ function ConnectModalBody({ apiEndpoint, handleCopy, copied }: ConnectModalBodyP
  const [riskAccepted, setRiskAccepted] = React.useState(false);
  const [accessTier, setAccessTier] = React.useState<'restricted' | 'full'>('restricted');
  const [connectionMode, setConnectionMode] = React.useState<'local' | 'remote'>('local');
+  // hmacSecret holds the FULL secret once the operator has clicked
+  // Reveal (or after a regenerate). maskedHmacSecret is the safe-to-show
+  // fingerprint returned by GET /api/ai/connect-info and is loaded on
+  // mount. The two are independent state slots so a stale full secret
+  // can never leak back into the UI after a regenerate.
  const [hmacSecret, setHmacSecret] = React.useState('');
+  const [maskedHmacSecret, setMaskedHmacSecret] = React.useState('');
  const [hmacLoading, setHmacLoading] = React.useState(false);
+  const [revealing, setRevealing] = React.useState(false);
  const [tierSaving, setTierSaving] = React.useState(false);
  const [showAdvanced, setShowAdvanced] = React.useState(false);
  const [showResetConfirm, setShowResetConfirm] = React.useState(false);
@@ -381,16 +388,40 @@ function ConnectModalBody({ apiEndpoint, handleCopy, copied }: ConnectModalBodyP
  const [torError, setTorError] = React.useState('');
  const [torOnion, setTorOnion] = React.useState('');

-  // Fetch connect-info + node status on mount
+  // Issue #302 (tg12): the full HMAC secret no longer travels through
+  // GET /api/ai/connect-info on every modal open. The flow is now:
+  //
+  //   1. GET /api/ai/connect-info — always returns the masked fingerprint
+  //      (first6 + bullets + last4). `hmacSecret` stays empty until the
+  //      operator clicks the Reveal (eye) button below.
+  //   2. POST /api/ai/connect-info/bootstrap — fires once on mount if the
+  //      backend reports `hmac_secret_set: false`. Idempotent and never
+  //      returns the secret in the response.
+  //   3. POST /api/ai/connect-info/reveal — fires when the operator clicks
+  //      Reveal or Copy without the secret yet loaded. Returns the full
+  //      secret with strict `Cache-Control: no-store` so it doesn't land
+  //      in browser caches or HAR exports.
  React.useEffect(() => {
    (async () => {
      try {
        setHmacLoading(true);
-        const res = await fetch(`${API_BASE}/api/ai/connect-info?reveal=true`);
-        if (res.ok) {
-          const data = await res.json();
-          setHmacSecret(data.hmac_secret || '');
-          setAccessTier(data.access_tier === 'full' ? 'full' : 'restricted');
+        const res = await fetch(`${API_BASE}/api/ai/connect-info`);
+        if (!res.ok) return;
+        const data = await res.json();
+        setMaskedHmacSecret(data.masked_hmac_secret || '');
+        setAccessTier(data.access_tier === 'full' ? 'full' : 'restricted');
+
+        // Transparent first-use bootstrap. Mirrors the pre-#302 UX of
+        // "open modal → secret exists" without the GET side-effect.
+        if (!data.hmac_secret_set) {
+          const bootRes = await fetch(
+            `${API_BASE}/api/ai/connect-info/bootstrap`,
+            { method: 'POST' },
+          );
+          if (bootRes.ok) {
+            const bootData = await bootRes.json();
+            setMaskedHmacSecret(bootData.masked_hmac_secret || '');
+          }
        }
      } catch { /* ignore */ }
      finally { setHmacLoading(false); }
@@ -477,8 +508,17 @@ function ConnectModalBody({ apiEndpoint, handleCopy, copied }: ConnectModalBodyP
      const res = await fetch(`${API_BASE}/api/settings/agent/reset-all`, { method: 'POST' });
      const data = await res.json();
      if (data.ok) {
-        // Update local state with new credentials
-        if (data.new_hmac_secret) setHmacSecret(data.new_hmac_secret);
+        // Update local state with new credentials. reset-all returns
+        // the new HMAC secret in-band (same one-time-disclosure rule
+        // as /regenerate — a deliberate destructive action). Refresh
+        // both slots so the masked display stays in sync.
+        if (data.new_hmac_secret) {
+          setHmacSecret(data.new_hmac_secret);
+          const s = String(data.new_hmac_secret);
+          setMaskedHmacSecret(
+            s.length > 10 ? s.slice(0, 6) + '•'.repeat(8) + s.slice(-4) : '•'.repeat(16),
+          );
+        }
        if (data.new_onion) {
          setTorOnion(data.new_onion);
          setRemoteUrl(data.new_onion);
@@ -502,13 +542,41 @@ function ConnectModalBody({ apiEndpoint, handleCopy, copied }: ConnectModalBodyP
    finally { setTierSaving(false); }
  };

+  // Issue #302: POST /reveal returns the full secret with strict
+  // no-store headers. Lazily fetched — never on mount. Returns the
+  // secret string so callers can copy it immediately without waiting
+  // for React state propagation.
+  const revealHmacSecret = async (): Promise<string> => {
+    if (hmacSecret) return hmacSecret;
+    setRevealing(true);
+    try {
+      const res = await fetch(`${API_BASE}/api/ai/connect-info/reveal`, {
+        method: 'POST',
+      });
+      if (!res.ok) return '';
+      const data = await res.json();
+      const secret = String(data.hmac_secret || '');
+      setHmacSecret(secret);
+      return secret;
+    } catch {
+      return '';
+    } finally {
+      setRevealing(false);
+    }
+  };
+
  const handleRegenerate = async () => {
    setRegenerating(true);
    try {
      const res = await fetch(`${API_BASE}/api/ai/connect-info/regenerate`, { method: 'POST' });
      if (res.ok) {
        const data = await res.json();
+        // Regenerate is a deliberate destructive action — operator needs
+        // to see the new secret once to update their OpenClaw config.
+        // Both the full and masked forms refresh in one shot.
        setHmacSecret(data.hmac_secret || '');
+        setMaskedHmacSecret(data.masked_hmac_secret || '');
+        setShowSecret(true);
      }
    } catch { /* ignore */ }
    finally { setRegenerating(false); }
@@ -543,9 +611,17 @@ function ConnectModalBody({ apiEndpoint, handleCopy, copied }: ConnectModalBodyP
    finally { setNodeToggling(false); }
  };

-  const maskedSecret = hmacSecret
-    ? hmacSecret.slice(0, 6) + '\u2022'.repeat(8) + hmacSecret.slice(-4)
-    : '\u2022'.repeat(16);
+  // Issue #302: prefer the server-supplied fingerprint
+  // (maskedHmacSecret) \u2014 it's filled on mount via the (no-secret) GET.
+  // If the operator has clicked Reveal, fall through to deriving the
+  // mask from the in-memory full secret so we keep the same shape
+  // (first6 + bullets + last4) regardless of source. Final fallback
+  // (no secret loaded yet) is a generic bullet string.
+  const maskedSecret =
+    maskedHmacSecret ||
+    (hmacSecret
+      ? hmacSecret.slice(0, 6) + '\u2022'.repeat(8) + hmacSecret.slice(-4)
+      : '\u2022'.repeat(16));

  // Resolve the endpoint URL
  const resolvedUrl = connectionMode === 'local'
@@ -672,10 +748,15 @@ function ConnectModalBody({ apiEndpoint, handleCopy, copied }: ConnectModalBodyP
    return lines.join('\n');
  };
  const displaySnippet = buildSnippet(maskedSecret);
-  const copySnippet = buildSnippet(hmacSecret);

-  const handleCopySnippet = () => {
-    navigator.clipboard.writeText(copySnippet);
+  // Issue #302: the copy snippet needs the FULL secret. Pre-#302 we kept
+  // it in memory from the GET-with-reveal load; now we lazy-fetch via
+  // POST /reveal only when the operator actually clicks Copy. If they
+  // already revealed, the in-memory value is reused (no extra request).
+  const handleCopySnippet = async () => {
+    const secret = hmacSecret || (await revealHmacSecret());
+    if (!secret) return;
+    navigator.clipboard.writeText(buildSnippet(secret));
    setSnippetCopied(true);
    setTimeout(() => setSnippetCopied(false), 2000);
  };
@@ -913,18 +994,38 @@ function ConnectModalBody({ apiEndpoint, handleCopy, copied }: ConnectModalBodyP
                  </div>
                  <div className="flex items-center gap-2">
                    <code className="flex-1 bg-black/60 border border-violet-800/40 px-3 py-2 text-xs font-mono text-violet-300 overflow-hidden text-ellipsis">
-                      {showSecret ? hmacSecret : maskedSecret}
+                      {/* Issue #302: when the operator hasn't clicked
+                          Reveal yet, hmacSecret is empty and we fall
+                          back to maskedHmacSecret (the safe fingerprint
+                          returned by GET /api/ai/connect-info). */}
+                      {showSecret && hmacSecret ? hmacSecret : (maskedHmacSecret || maskedSecret)}
                    </code>
                    <button
-                      onClick={() => setShowSecret(!showSecret)}
-                      className="p-2 bg-violet-600/20 border border-violet-500/40 text-violet-400 hover:bg-violet-600/40 transition-colors shrink-0"
+                      onClick={async () => {
+                        if (showSecret) {
+                          setShowSecret(false);
+                          return;
+                        }
+                        // Need the full secret in state before showing it.
+                        const secret = await revealHmacSecret();
+                        if (secret) setShowSecret(true);
+                      }}
+                      disabled={revealing}
+                      className="p-2 bg-violet-600/20 border border-violet-500/40 text-violet-400 hover:bg-violet-600/40 transition-colors shrink-0 disabled:opacity-50"
                      title={showSecret ? 'Hide' : 'Reveal'}
                    >
                      {showSecret ? <EyeOff size={14} /> : <Eye size={14} />}
                    </button>
                    <button
-                      onClick={() => handleCopy(hmacSecret)}
-                      className="p-2 bg-violet-600/20 border border-violet-500/40 text-violet-400 hover:bg-violet-600/40 transition-colors shrink-0"
+                      onClick={async () => {
+                        // Copy needs the full secret. Fetch it lazily if
+                        // the operator hasn't clicked Reveal yet — no
+                        // point making them reveal first just to copy.
+                        const secret = hmacSecret || (await revealHmacSecret());
+                        if (secret) handleCopy(secret);
+                      }}
+                      disabled={revealing}
+                      className="p-2 bg-violet-600/20 border border-violet-500/40 text-violet-400 hover:bg-violet-600/40 transition-colors shrink-0 disabled:opacity-50"
                      title="Copy key"
                    >
                      {copied ? <Check size={14} /> : <Copy size={14} />}
@@ -140,17 +140,51 @@ const OnboardingModal = React.memo(function OnboardingModal({
  ].join('\n');
  const remoteAgentNeedsTor = agentMode === 'remote' && !torAddress;

+  // Issue #302 (tg12): the full HMAC secret no longer comes back from
+  // GET /api/ai/connect-info. We fetch metadata + the masked fingerprint
+  // first; if the operator has explicitly asked to see the key (the
+  // ``reveal`` flag), we follow up with POST /api/ai/connect-info/reveal
+  // (after a transparent POST /bootstrap if the secret hasn't been
+  // minted yet) which carries the secret with strict no-store headers.
  const fetchAgentConnectInfo = async (reveal = true) => {
    setAgentLoading(true);
    setAgentMsg(null);
    try {
-      const res = await fetch(`/api/ai/connect-info?reveal=${reveal ? 'true' : 'false'}`);
-      const data = await res.json().catch(() => ({}));
-      if (!res.ok || data?.ok === false) {
-        throw new Error(data?.detail || 'Could not prepare agent credentials.');
+      // 1) GET metadata + masked fingerprint.
+      const metaRes = await fetch('/api/ai/connect-info');
+      const metaData = await metaRes.json().catch(() => ({}));
+      if (!metaRes.ok || metaData?.ok === false) {
+        throw new Error(metaData?.detail || 'Could not prepare agent credentials.');
+      }
+      setAgentTier(metaData.access_tier === 'full' ? 'full' : 'restricted');
+
+      // 2) Mint the secret if it isn't set yet — transparent, idempotent.
+      let secretSet = !!metaData.hmac_secret_set;
+      if (!secretSet) {
+        const bootRes = await fetch('/api/ai/connect-info/bootstrap', {
+          method: 'POST',
+        });
+        const bootData = await bootRes.json().catch(() => ({}));
+        if (!bootRes.ok || bootData?.ok === false) {
+          throw new Error(bootData?.detail || 'Could not generate agent credentials.');
+        }
+        secretSet = !!bootData.hmac_secret_set;
+      }
+
+      // 3) If the caller asked to see the secret, fetch it explicitly.
+      //    Otherwise the masked fingerprint is enough for the UI.
+      if (reveal && secretSet) {
+        const revealRes = await fetch('/api/ai/connect-info/reveal', {
+          method: 'POST',
+        });
+        const revealData = await revealRes.json().catch(() => ({}));
+        if (!revealRes.ok || revealData?.ok === false) {
+          throw new Error(revealData?.detail || 'Could not reveal agent credentials.');
+        }
+        setAgentSecret(revealData.hmac_secret || '');
+      } else {
+        setAgentSecret(metaData.masked_hmac_secret || '');
      }
-      setAgentSecret(data.hmac_secret || '');
-      setAgentTier(data.access_tier === 'full' ? 'full' : 'restricted');
      setAgentMsg({ type: 'ok', text: 'Agent key is ready. Copy it into your local or remote agent runtime.' });
    } catch (error) {
      setAgentMsg({
@@ -76,6 +76,13 @@ function canRun(command, args) {
  return !result.error && result.status === 0;
 }

+function canRunBackendPython(pythonBin) {
+  return (
+    canRun(pythonBin, ["-V"]) &&
+    canRun(pythonBin, ["-c", "import fastapi, uvicorn"])
+  );
+}
+
 function findBasePython() {
  const candidates = isWindows
    ? [
@@ -135,12 +142,12 @@ function rebuildBackendVenv(targetDir, basePython) {
  if (result.error || result.status !== 0) {
    return null;
  }
-  return canRun(repairedBin, ["-V"]) ? repairedBin : null;
+  return canRunBackendPython(repairedBin) ? repairedBin : null;
 }

 function ensureBackendVenv() {
  for (const candidate of venvCandidates) {
-    if (fs.existsSync(candidate) && canRun(candidate, ["-V"])) {
+    if (fs.existsSync(candidate) && canRunBackendPython(candidate)) {
      persistSelectedVenv(candidate);
      return candidate;
    }
Author	SHA1	Message	Date
BigBodyCobain	31ebcb5cd9	Infonet sync: honor HTTP 429 Retry-After + exponential backoff Fixes the retry-storm that's been keeping the local node 429'd out of the seed peer (the diagnosis we ran earlier in the session). Pre-fix: 1. Sync hits the seed peer, gets HTTP 429 (Too Many Requests) 2. _peer_sync_response stringifies the status into a ValueError 3. _sync_from_peer catches it, error becomes the str() of the exc 4. _run_public_sync_cycle calls finish_sync(error=..., failure_backoff_s=60) 5. next_sync_due_at = now + 60s 6. After 60s, sync runs again, hits same upstream that hasn't reset its rate-limit bucket, 429 again. Loop indefinitely. Net effect: a node that hit one transient 429 would hammer the seed every 60s forever, keeping the bucket full and never recovering. We saw this in the live status dump: consecutive_failures=49, last_sync_ok_at=0, retry storm sustained over the entire uptime. What changed ------------ services/mesh/mesh_infonet_sync_support.py * New typed exception PeerSyncRateLimited carries the parsed Retry-After value out of the HTTP layer instead of stringifying everything into a generic ValueError. * New parse_retry_after_header() handles both RFC 7231 §7.1.3 forms (delay-seconds and HTTP-date). Clamped at 1 hour so a hostile peer can't silence us for days. * New _failure_backoff_seconds() helper computes the next delay as max(exponential, retry_after_s). Schedule with default base=60s, cap=1800s: failure 1 -> 60s (preserves pre-fix for transient blips) failure 2 -> 120s failure 3 -> 240s failure 4 -> 480s failure 5 -> 960s failure 6+ -> 1800s (capped at 30 min) cap_s=0 explicitly disables exponential entirely — operators who want pure-Retry-After behavior have that option. * finish_sync now accepts retry_after_s and failure_backoff_cap_s kwargs. Backward-compatible: existing callers that don't pass retry_after_s get the same first-failure delay as before (the base value), only repeat failures grow. main.py * _peer_sync_response detects 429 specifically, parses the Retry-After header, raises PeerSyncRateLimited(retry_after_s=N). Includes the response body prefix in the message so the operator's last_error finally shows something useful. * _sync_from_peer extended to return (ok, error, forked, retry_after_s) — the 4th tuple element is non-zero only when the upstream sent a parseable Retry-After. Existing call shape preserved: the lone caller in _run_public_sync_cycle was updated in the same commit. * _run_public_sync_cycle forwards retry_after_s into finish_sync. Tests ----- backend/tests/mesh/test_infonet_sync_429_backoff.py — 17 new tests: TestParseRetryAfter (7): - integer seconds form - HTTP-date form (computed as seconds-from-now) - HTTP-date in the past returns 0 - empty / whitespace returns 0 - malformed returns 0 - clamps to 1 hour (hostile-peer cap) - negative returns 0 TestFailureBackoffSeconds (5): - exponential growth schedule pins each level - retry_after wins when larger than exponential - exponential wins when larger than retry_after - cap_s=0 disables exponential entirely - zero inputs return zero TestFinishSyncBackoff (5): - first failure uses base unchanged (pre-fix back-compat) - consecutive_failures actually grow the delay - retry_after honored at low failure count - success resets consecutive_failures - last_error carries the HTTP status / Retry-After detail All 24 existing sync-support / status-gate tests still pass. Other failures in tests/mesh/ are pre-existing on origin/main and unrelated to this change (verified by running the same tests against the user's main worktree without these edits). What the operator sees after this lands + a docker rebuild ---------------------------------------------------------- With the live 429 storm we diagnosed: Pre-fix: consecutive_failures keeps climbing 1/min forever, last_error empty or generic Post-fix: consecutive_failures grows, next_sync_due_at backs off exponentially (max 30 min), last_error explicitly carries "HTTP 429 from <peer> (retry_after=Ns): <body>" so the operator can see what's actually wrong. Once the upstream bucket drains and a sync succeeds, consecutive_failures resets to 0 and the schedule returns to the normal 300s interval.	2026-05-22 22:55:05 -06:00
Shadowbroker	b3fca3dc18	Merge pull request #309 from BigBodyCobain/feat/cross-node-dm-mailbox-replication DM mailbox: per-(sender, recipient) anti-spam cap + replication primitives	2026-05-22 22:43:26 -06:00
BigBodyCobain	401f114e4f	DM mailbox: outbound replication + receiving endpoint Second commit on this branch (first added the per-sender cap + accept_replica primitive). This commit wires the actual cross-node propagation: Outbound (sender side) ---------------------- * New ``DMRelay._replicate_envelope_to_peers_async()`` — fire-and-forget thread that POSTs the envelope to every authenticated relay peer via the same per-peer HMAC pattern gate-message replication uses (#256 ``X-Peer-Url`` + ``X-Peer-HMAC`` headers, ``resolve_peer_key_for_url``). * ``deposit()`` now calls the replication helper after a successful local accept. Per-peer errors are swallowed — slow Tor peers must not block the sender's UX, and the recipient polling from a healthy peer works fine even if some peers are down. * Metrics: dm_replication_push_ok / _rejected / _error. Inbound (receiving side) ------------------------ * New endpoint ``POST /api/mesh/dm/replicate-envelope`` in routers/mesh_peer_sync.py. * Same HMAC auth gate (``_verify_peer_push_hmac``) as the existing infonet/gate peer-push endpoints. Unauthenticated requests get 403. * Body cap of 64 KB (DM envelope is bounded by MESH_DM_MAX_MSG_BYTES). * Calls DMRelay.accept_replica which enforces the per-sender cap as a network rule — hostile sender's relay can hold extras locally but honest peers reject them on inbound replication. End-to-end flow now works ------------------------- 1. Alice's node accepts a deposit to Bob's mailbox (local cap check). 2. Alice's node spawns a background thread that POSTs the envelope to MESH_RELAY_PEERS with per-peer HMAC. 3. Each peer's /api/mesh/dm/replicate-envelope verifies the HMAC and calls accept_replica, which re-enforces the per-sender cap. 4. Bob (offline at the time of send) eventually logs into ANY node in MESH_RELAY_PEERS, his existing pollDmMailboxes pulls from the local mailbox there, finds Alice's envelope, decrypts. Tests ----- backend/tests/test_dm_replicate_envelope_endpoint.py — 4 tests: TestReplicateEndpointAuth: - rejects requests without peer HMAC (403) - rejects requests with WRONG peer HMAC (403) — confirms the HMAC is actually verified, not just present - rejects oversize bodies (>64 KB) with 400/413 TestReplicateEndpointRegistered: - static check that POST /api/mesh/dm/replicate-envelope is registered on app.routes — catches future refactor that drops the router include All 38 backend tests touching the new code paths still pass: test_dm_relay_per_sender_cap.py (14) test_dm_replicate_envelope_endpoint.py (4) test_no_new_duplicate_routes.py (1) — new route is unique test_per_peer_secret_resolver.py (19) — HMAC primitive unaffected What's still ahead (PR-3+) -------------------------- * ack propagation: when recipient pulls a message on node X, peers Y/Z should prune their copies to free the sender's quota network-wide. Without this, the sender's quota frees only on the node the recipient actually polled — other peers still see N pending until TTL expiry. Workable but suboptimal. PR-3 will add a /api/mesh/dm/ack endpoint with the same HMAC pattern. * recipient pull-from-peers: today the recipient's poll only hits their own node's relay. If they log into a peer they didn't deposit with, they need a way to fetch envelopes from other peers in MESH_RELAY_PEERS. Today this works as long as the recipient's current node is one of the peers Alice's node pushed to — which is true in a fully-meshed deployment but not guaranteed for partial meshes. PR-4 if telemetry shows this matters.	2026-05-22 19:23:09 -06:00
BigBodyCobain	79b39e8985	DM mailbox: per-(sender, recipient) anti-spam cap + replication primitives Foundation work for cross-node DM mailbox replication. Adds the network rule that makes the replication safe to ship next, plus the primitives the outbound replication PR will call. The rule -------- A single sender can have at most N UNACKED messages parked in a single recipient's mailbox at any one time. Default N=2, tunable via ``MESH_DM_PENDING_PER_SENDER_LIMIT``. Once the recipient pulls (acks) a message, the sender's quota for that (sender, recipient) pair frees up. Network rule, not local rule ---------------------------- The cap is enforced TWICE: 1. ``DMRelay.deposit(...)`` — local check on the sender's own node. Refuses to spool the (N+1)th message before it can be replicated. 2. ``DMRelay.accept_replica(...)`` — replication-acceptance check on every receiving peer. Refuses to accept an inbound replica that would put the local mailbox over the cap. The second half is what makes the rule a NETWORK rule. A hostile sender could patch out the deposit check on their own relay and continue to spool extras locally — but those extras can never propagate, because every honest peer enforces the same cap on the way in. A recipient who polls from honest peers therefore never sees more than N pending from any one sender, regardless of how many spam attempts the hostile sender's relay accepted. New API surface on ``DMRelay`` ------------------------------ _per_sender_pending_limit() — reads MESH_DM_PENDING_PER_SENDER_LIMIT _per_sender_pending_count(...) — counts unacked from a sender for a mailbox accept_replica(envelope=...) — peer-push receive entry point envelope_for_replication(...) — helper to extract a wire-form envelope ``accept_replica`` is idempotent on duplicate ``msg_id`` (replication round-trips and multi-path delivery don't double-spool). ``envelope_for_replication`` exposes the exact shape ``accept_replica`` expects, so the follow-up PR (outbound replication wiring) just has to fetch the envelope and POST it to authenticated peer URLs with the existing per-peer HMAC pattern from #256. Why this is PR-1 of two ----------------------- The full cross-node mailbox replication needs three pieces: A. cap enforcement on deposit (in this PR) B. cap enforcement on replica acceptance (in this PR) C. outbound: push envelope to MESH_RELAY_PEERS after deposit (NEXT PR) (A) + (B) shipped together close the cap-bypass attack surface BEFORE (C) introduces the actual cross-node propagation. Shipping them in the other order would briefly let extras propagate during the window between "outbound push lands" and "accept_replica cap lands." Tests ----- backend/tests/test_dm_relay_per_sender_cap.py — 14 tests: TestDepositCap: - first 2 deposits succeed (UX baseline) - 3rd from same sender rejected with friendly message - different senders have independent quotas - different recipients have independent quotas - ack frees the quota (after recipient pulls, sender can deposit again) - cap is env-tunable TestAcceptReplicaCap: - replica accepted under cap - idempotent on duplicate msg_id (no double-spool, no rejection) - rejected at cap with structured ``cap_violation`` marker so sender's relay can stop retrying - per-sender, not per-mailbox: different sender_block_ref passes even when another sender at the same mailbox is capped - malformed envelope shapes rejected without crash TestEnvelopeForReplication: - returns the envelope for stored messages - returns None for unknown msg_id - round-trips through accept_replica end-to-end (proves the wire shape matches across the two sides)	2026-05-22 19:18:01 -06:00
Shadowbroker	c3e38621fc	Merge pull request #308 from BigBodyCobain/fix/296-windows-venv-uvicorn-detection Fix #296: reject backend venvs missing uvicorn before launch (Windows)	2026-05-22 18:56:08 -06:00
BigBodyCobain	9ef02dd06f	Fix #296 : reject backend venvs missing uvicorn before launch Reported by @f3n3k on Windows native install path. Symptom: C:\001\backend\venv\Scripts\python.exe: No module named uvicorn [backend] exited with 1 ShadowBroker has stopped. Exit code: 1 Root cause ---------- The Windows Start.bat flow chains: Start.bat └─ scripts\run-windows-runtime.ps1 └─ frontend\scripts\dev-all.cjs └─ start-backend.js └─ backend\venv\Scripts\python.exe -m uvicorn main:app `start-backend.js` decided whether an existing `backend\venv` was usable by calling `canRun(candidate, ["-V"])`. That only checks whether Python itself can run — it does NOT check whether the backend's actual runtime dependencies are installed. When the venv exists but `pip install` never finished (partial install, failed network, interrupted bootstrap, etc.), the launcher happily accepted that broken venv, then died with the exact error f3n3k reported. Fix --- New `canRunBackendPython()` helper that requires BOTH: python -V # Python is runnable python -c "import fastapi, uvicorn" # backend deps are installed Used in two call sites: * `ensureBackendVenv()` — when iterating candidate venvs on first launch, reject any venv whose Python can't import the backend's real entry-point deps. The launcher then falls through to its existing rebuild path (`rebuildBackendVenv`) which reinstalls deps before declaring the venv healthy. * `rebuildBackendVenv()` — after a rebuild attempt, verify the deps are present before returning the new interpreter path. Catches silent partial rebuilds. The check is the import that uvicorn itself would do at startup, so a green return here genuinely means "uvicorn will start". Cost is one extra `python -c` per venv candidate on launcher startup — milliseconds. Verified locally with `node --check start-backend.js`. Credit: @f3n3k for the original report.	2026-05-22 18:50:27 -06:00
Shadowbroker	ba39d3b9aa	Merge pull request #307 from BigBodyCobain/fix/302-openclaw-hmac-reveal-hardening Fix #302: split OpenClaw HMAC reveal into dedicated POST with no-store headers	2026-05-22 18:47:09 -06:00
BigBodyCobain	f91ddcf38b	Fix #302 : split OpenClaw HMAC reveal into dedicated POST with no-store Reported by @tg12. Pre-fix, two problems lived on the GET endpoint: 1. `GET /api/ai/connect-info?reveal=true` returned the full HMAC secret in the response body on every Connect modal open. Even gated to require_local_operator, that put the secret into browser history, dev-tools network panels, browser disk caches, HAR exports, and screen captures. 2. The same GET endpoint auto-bootstrapped (generated + persisted) the secret on a mere read. Side effects on a GET are a footgun: browser prefetchers, mirror tools, and casual curl-from-history would all silently mint+persist a fresh secret. Backend (backend/routers/ai_intel.py) ------------------------------------- GET /api/ai/connect-info — always returns the MASKED fingerprint (first6 + bullets + last4). No `?reveal` param. NO auto-bootstrap. When the secret is missing, returns `hmac_secret_set: false` and tells the caller to POST to /bootstrap. POST /api/ai/connect-info/bootstrap — NEW. Mints+persists the secret if missing. Idempotent. Never returns the full secret in the response body. POST /api/ai/connect-info/reveal — NEW. Returns the full secret with Cache-Control: no-store, no-cache, must-revalidate + Pragma: no-cache + Expires: 0. POST so the body never lands in URL history. 404 (with a pointer to /bootstrap) when the secret isn't set. POST /api/ai/connect-info/regenerate — keeps existing one-time-reveal behavior (regen IS a deliberate destructive action triggered by the operator). Same no-store/no-cache headers added so even the regen response doesn't get cached. Frontend (AIIntelPanel.tsx, OnboardingModal.tsx) ------------------------------------------------ * On mount: GET (masked only). If hmac_secret_set: false, fire a transparent POST /bootstrap and refresh the masked fingerprint. Operator sees no behavior change from pre-#302. * Reveal (eye icon): lazy POST /reveal — secret only travels when the operator explicitly clicks the button. * Copy: lazy POST /reveal too — copying without a prior reveal works exactly like before, just routed through the new endpoint. * Regenerate: POST returns the new secret (same as before, but the response now has no-store headers). * The displayed snippet uses the masked fingerprint until the operator clicks Reveal or Copy. Tests (backend/tests/test_openclaw_connect_info_reveal.py — 13 tests) --------------------------------------------------------------------- * GET returns masked + the full secret never appears in r.text * GET does NOT auto-bootstrap when missing * GET silently ignores any ?reveal=true query (back-compat noise) * POST /bootstrap mints when missing, idempotent when set * POST /bootstrap never returns the full secret * POST /reveal returns the full secret with Cache-Control: no-store, no-cache + Pragma: no-cache + Expires: 0 * POST /reveal 404s with a pointer to /bootstrap when no secret * POST /regenerate returns the new secret with the same headers * Anonymous remote callers get 403 on ALL FOUR endpoints (parametric regression against the same allowlist used elsewhere). Adjacent suites still green: test_openclaw_route_security, test_no_new_duplicate_routes, test_control_surface_auth. 67/67 pass locally. Credit: @tg12 for the audit report.	2026-05-22 18:40:24 -06:00
Shadowbroker	49151d8b9f	Merge pull request #304 from BigBodyCobain/fix/298-sentinel-creds-server-side Fix #298: move Sentinel credentials from browser storage to backend .env	2026-05-22 18:29:11 -06:00