mirror of
https://github.com/BigBodyCobain/Shadowbroker.git
synced 2026-05-28 10:01:31 +02:00
31ebcb5cd9
Fixes the retry-storm that's been keeping the local node 429'd out of
the seed peer (the diagnosis we ran earlier in the session). Pre-fix:
1. Sync hits the seed peer, gets HTTP 429 (Too Many Requests)
2. _peer_sync_response stringifies the status into a ValueError
3. _sync_from_peer catches it, error becomes the str() of the exc
4. _run_public_sync_cycle calls finish_sync(error=..., failure_backoff_s=60)
5. next_sync_due_at = now + 60s
6. After 60s, sync runs again, hits same upstream that hasn't reset
its rate-limit bucket, 429 again. Loop indefinitely.
Net effect: a node that hit one transient 429 would hammer the seed
every 60s forever, keeping the bucket full and never recovering. We
saw this in the live status dump: consecutive_failures=49,
last_sync_ok_at=0, retry storm sustained over the entire uptime.
What changed
------------
services/mesh/mesh_infonet_sync_support.py
* New typed exception PeerSyncRateLimited carries the parsed
Retry-After value out of the HTTP layer instead of stringifying
everything into a generic ValueError.
* New parse_retry_after_header() handles both RFC 7231 §7.1.3
forms (delay-seconds and HTTP-date). Clamped at 1 hour so a
hostile peer can't silence us for days.
* New _failure_backoff_seconds() helper computes the next delay
as max(exponential, retry_after_s). Schedule with default
base=60s, cap=1800s:
failure 1 -> 60s (preserves pre-fix for transient blips)
failure 2 -> 120s
failure 3 -> 240s
failure 4 -> 480s
failure 5 -> 960s
failure 6+ -> 1800s (capped at 30 min)
cap_s=0 explicitly disables exponential entirely — operators
who want pure-Retry-After behavior have that option.
* finish_sync now accepts retry_after_s and failure_backoff_cap_s
kwargs. Backward-compatible: existing callers that don't pass
retry_after_s get the same first-failure delay as before (the
base value), only repeat failures grow.
main.py
* _peer_sync_response detects 429 specifically, parses the
Retry-After header, raises PeerSyncRateLimited(retry_after_s=N).
Includes the response body prefix in the message so the
operator's last_error finally shows something useful.
* _sync_from_peer extended to return (ok, error, forked,
retry_after_s) — the 4th tuple element is non-zero only when
the upstream sent a parseable Retry-After. Existing call shape
preserved: the lone caller in _run_public_sync_cycle was
updated in the same commit.
* _run_public_sync_cycle forwards retry_after_s into finish_sync.
Tests
-----
backend/tests/mesh/test_infonet_sync_429_backoff.py — 17 new tests:
TestParseRetryAfter (7):
- integer seconds form
- HTTP-date form (computed as seconds-from-now)
- HTTP-date in the past returns 0
- empty / whitespace returns 0
- malformed returns 0
- clamps to 1 hour (hostile-peer cap)
- negative returns 0
TestFailureBackoffSeconds (5):
- exponential growth schedule pins each level
- retry_after wins when larger than exponential
- exponential wins when larger than retry_after
- cap_s=0 disables exponential entirely
- zero inputs return zero
TestFinishSyncBackoff (5):
- first failure uses base unchanged (pre-fix back-compat)
- consecutive_failures actually grow the delay
- retry_after honored at low failure count
- success resets consecutive_failures
- last_error carries the HTTP status / Retry-After detail
All 24 existing sync-support / status-gate tests still pass. Other
failures in tests/mesh/ are pre-existing on origin/main and unrelated
to this change (verified by running the same tests against the
user's main worktree without these edits).
What the operator sees after this lands + a docker rebuild
----------------------------------------------------------
With the live 429 storm we diagnosed:
Pre-fix: consecutive_failures keeps climbing 1/min forever,
last_error empty or generic
Post-fix: consecutive_failures grows, next_sync_due_at backs off
exponentially (max 30 min), last_error explicitly carries
"HTTP 429 from <peer> (retry_after=Ns): <body>" so the
operator can see what's actually wrong. Once the upstream
bucket drains and a sync succeeds, consecutive_failures
resets to 0 and the schedule returns to the normal 300s
interval.