Files
Shadowbroker/docs/production-hardening.md
T
2026-06-06 20:23:11 -06:00

2.6 KiB
Raw Blame History

Production hardening checklist

Use this before merging PRs that touch the data path, fetchers, live-data APIs, or anything that runs unattended for more than an hour (Docker, VPS self-host).

Adapt as needed — not every item applies to UI-only or docs-only PRs.

Config and exposure

  • Do new or changed config flags default to the safe value (loopback bind, features off until opt-in)?
  • Is any wider exposure (LAN bind, clearnet upstreams, admin without key) gated behind an explicit env opt-in?

Live-data API

  • When an endpoint's payload shape or sources change, does its serializer match siblings (default=str, OPT_NON_STR_KEYS via _live_data_json_bytes in routers/data.py)?
  • Is each route path defined exactly once? Grep the path — duplicate main.py + router copies drift.
  • Do ETag prefixes distinguish response variants (full vs fast vs slow, initial vs full, bbox suffix)?

Fetcher pools and timeouts

  • Do future.result(timeout=...) sites cancel queued work on timeout (or document why running threads are idempotent)?
  • Do *_CONCURRENCY knobs agree with the executor pool size they run on?
  • Does retry/backoff match intent — transient network/5xx retried; HTTP 4xx from raise_for_status not retried (services/fetchers/retry.py)?
  • Are outbound HTTP calls timeout-bounded (timeout= on requests.*, explicit timeout on fetch_with_curl, Playwright set_default_*_timeout)?

Secrets and observability

  • Are secrets read from env only, never logged by value; missing keys logged by variable name?
  • Do record_success / record_failure reflect what actually happened?

Tests

  • Do regression tests assert properties (serialization survives non-JSON-native values, slow pool cannot starve fast tier under load), not only wiring (which executor a label uses)?

Spot-checked heavy paths (2026-06)

Path Timeout posture
services/geopolitics.py (GDELT) fetch_with_curl(..., timeout=10/15) per export file
services/fetchers/flights.py requests / fetch_with_curl with 1030s
services/fetchers/earth_observation.py fetch_with_curl / `session.get
services/liveuamap_scraper.py page.goto(..., timeout=60s) + context default timeouts

Re-audit when adding a new fetcher or changing scheduler cadence.

  • #375 — dev bind, store lock, slow executor
  • #239 — duplicate route CI guard (test_no_new_duplicate_routes.py)