Files
Shadowbroker/backend/services/sentinel_search.py
T
Shadowbroker 76750caa92 Round 7a: per-operator outbound attribution + GDELT GCS-direct fix (#292)
== Per-install operator handle for every third-party API call ==

Before this PR, every Shadowbroker install identified itself to
Wikipedia, Wikidata, Nominatim, GDELT, OpenMHz, Broadcastify,
weather.gov, NUFORC, Sentinel/Planetary Computer, TinyGS / CelesTrak,
Shodan, Finnhub, and others with a single project-wide User-Agent
("Shadowbroker/1.0" or "ShadowBroker-OSINT/1.0"). From the upstream's
perspective every install in the world looked like one giant scraper.
If one install misbehaved, the upstream's only recourse was to block
"Shadowbroker" as a whole.

PR #284 inadvertently doubled down on this in the frontend by
introducing a shared `WIKIMEDIA_API_USER_AGENT` constant. This PR
retrofits both backends to per-operator attribution.

  New setting: OPERATOR_HANDLE (env var / settings UI / auto-gen)
  New helper:  network_utils.outbound_user_agent("purpose")

The handle is auto-generated as "operator-XXXXXX" on first call (the
"shadow-" prefix from earlier drafts was deliberately dropped — too
suspicious-looking for abuse-detection systems). Operators can
override via OPERATOR_HANDLE; the value is sanitized to lowercase
alphanumeric+dash+underscore and capped at 48 chars. Persisted to
backend/data/operator_handle.json so it survives container restarts.

Retrofitted call sites (every previously-MONSTER User-Agent):
  - services/region_dossier.py (Wikipedia + Wikidata + Nominatim)
  - services/geocode.py         (Nominatim)
  - services/sentinel_search.py (Microsoft Planetary Computer)
  - services/feed_ingester.py   (operator-curated RSS feeds)
  - services/fetchers/earth_observation.py (weather.gov, NUFORC)
  - services/fetchers/infrastructure.py
  - services/fetchers/aircraft_database.py
  - services/fetchers/route_database.py
  - services/fetchers/trains.py
  - services/fetchers/meshtastic_map.py
  - services/shodan_connector.py
  - services/unusual_whales_connector.py (Finnhub)
  - services/tinygs_fetcher.py            (CelesTrak + TinyGS)
  - services/sar/sar_products_client.py
  - services/geopolitics.py               (GDELT)
  - services/radio_intercept.py           (Broadcastify + OpenMHz)
  - routers/cctv.py + main.py             (CCTV proxy)
  - routers/ai_intel.py
  - scripts/convert_power_plants.py       (release-time data refresh)

Spoofed browser UAs removed (issues #289 / #290 / #291 — tg12 audit):
  - cloudscraper-based Chrome impersonation against api.openmhz.com
    -> replaced with honest requests + per-install UA
  - Mozilla/5.0 spoofed UA on Broadcastify scrape
    -> replaced with honest UA
  - Mozilla/5.0 + fake first-party Referer on OpenMHz audio relay
    -> replaced with honest UA
  - cloudscraper dependency dropped from pyproject.toml + uv.lock

Frontend retrofit:
  - new GET /api/settings/operator-handle endpoint (local-operator
    gated) returns the install's handle
  - frontend/src/lib/wikimediaClient.ts fetches the handle once on
    first use, caches it for page lifetime, embeds it in the
    Api-User-Agent for every Wikipedia / Wikidata browser-direct call

== GDELT GCS-direct fix ==

GDELT's data.gdeltproject.org is a CNAME to a Google Cloud Storage
bucket. GCS responds with the wildcard *.storage.googleapis.com cert
which legitimately does NOT cover the GDELT custom domain, so Python's
TLS verification correctly refuses the connection. Some networks
happen to route through a path where this works; many (notably Docker
Desktop's outbound NAT on local installs) do not. Verified on the
maintainer's local install: GDELT was unreachable; 1610 geopolitical
events / 48 export files were dropping silently.

Fix: services/geopolitics._gcs_direct_gdelt_url() rewrites any
data.gdeltproject.org URL to its GCS-direct equivalent
(storage.googleapis.com/data.gdeltproject.org/...) where the standard
GCS cert is genuinely valid. api.gdeltproject.org and every other host
are left untouched.

Confirmed live: backend log goes from
  GDELT lastupdate failed: 500
to
  Downloading 48 GDELT export files...
  Downloaded 48/48 GDELT exports
  GDELT parsed: 1610 conflict locations from 48 files

== Tests ==

  backend/tests/test_per_operator_outbound_attribution.py (12 tests)
  backend/tests/test_gdelt_gcs_direct_rewrite.py          (6 tests)
  backend/tests/test_region_dossier_wikimedia_ua.py       (updated to
    pin the helper + per-operator handle, not the old constant)
  frontend/src/__tests__/utils/wikimediaClient.test.ts    (rewritten
    to mock /api/settings/operator-handle and assert per-operator UA)

Local: backend 114/114 security+audit+round7a suite green;
       frontend 718/718 vitest suite green.

Credit: tg12 (external security audit, issues #289/#290/#291
relating to spoofed UAs); BigBodyCobain (operator-prefix call,
GDELT cloud-vs-local diagnosis).
2026-05-21 15:11:28 -06:00

115 lines
4.3 KiB
Python

"""
Sentinel-2 satellite imagery search via Microsoft Planetary Computer STAC API.
Free, keyless search for metadata + thumbnails. Used in the right-click dossier.
We use the raw STAC HTTP API with explicit timeouts so the right-click dossier
cannot hang behind a slow client library call.
"""
import logging
import requests
from datetime import datetime, timedelta
from cachetools import TTLCache
from services.network_utils import outbound_user_agent
logger = logging.getLogger(__name__)
# Cache by rounded lat/lon (0.02° grid ~= 2km), TTL 1 hour
_sentinel_cache = TTLCache(maxsize=200, ttl=3600)
def _planetary_user_agent() -> str:
# Round 7a: per-install handle so Microsoft Planetary Computer can
# attribute requests to the specific operator rather than treating
# the whole Shadowbroker user base as one entity.
return outbound_user_agent("sentinel2-planetary-computer")
def _esri_imagery_fallback(lat: float, lng: float) -> dict:
lat_span = 0.18
lng_span = 0.24
bbox = f"{lng - lng_span},{lat - lat_span},{lng + lng_span},{lat + lat_span}"
fullres = (
"https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/"
f"export?bbox={bbox}&bboxSR=4326&imageSR=4326&size=1600,900&format=png32&f=image"
)
thumbnail = (
"https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/"
f"export?bbox={bbox}&bboxSR=4326&imageSR=4326&size=640,360&format=png32&f=image"
)
return {
"found": True,
"scene_id": None,
"datetime": None,
"cloud_cover": None,
"thumbnail_url": thumbnail,
"fullres_url": fullres,
"bbox": [lng - lng_span, lat - lat_span, lng + lng_span, lat + lat_span],
"platform": "Esri World Imagery",
"fallback": True,
"message": "Planetary Computer unavailable; using Esri World Imagery fallback",
}
def search_sentinel2_scene(lat: float, lng: float) -> dict:
"""Search for the latest Sentinel-2 L2A scene covering a point."""
cache_key = f"{round(lat, 2)}_{round(lng, 2)}"
if cache_key in _sentinel_cache:
return _sentinel_cache[cache_key]
try:
end = datetime.utcnow()
start = end - timedelta(days=30)
search_payload = {
"collections": ["sentinel-2-l2a"],
"intersects": {"type": "Point", "coordinates": [lng, lat]},
"datetime": f"{start.isoformat()}Z/{end.isoformat()}Z",
"sortby": [{"field": "datetime", "direction": "desc"}],
"limit": 3,
"query": {"eo:cloud_cover": {"lt": 30}},
}
search_res = requests.post(
"https://planetarycomputer.microsoft.com/api/stac/v1/search",
json=search_payload,
timeout=8,
headers={"User-Agent": _planetary_user_agent()},
)
search_res.raise_for_status()
data = search_res.json()
features = data.get("features", [])
if not features:
result = _esri_imagery_fallback(lat, lng)
_sentinel_cache[cache_key] = result
return result
item = features[0]
assets = item.get("assets", {}) or {}
rendered = assets.get("rendered_preview") or {}
thumbnail = assets.get("thumbnail") or {}
# Full-res image URL — what opens when user clicks
fullres_url = rendered.get("href") or thumbnail.get("href")
# Thumbnail URL — what shows in the popup card
thumb_url = thumbnail.get("href") or rendered.get("href")
result = {
"found": True,
"scene_id": item.get("id"),
"datetime": item.get("properties", {}).get("datetime"),
"cloud_cover": item.get("properties", {}).get("eo:cloud_cover"),
"thumbnail_url": thumb_url,
"fullres_url": fullres_url,
"bbox": list(item.get("bbox", [])) if item.get("bbox") else None,
"platform": item.get("properties", {}).get("platform", "Sentinel-2"),
}
_sentinel_cache[cache_key] = result
return result
except (requests.RequestException, ConnectionError, TimeoutError, ValueError) as e:
logger.error(f"Sentinel-2 search failed for ({lat}, {lng}): {e}")
result = _esri_imagery_fallback(lat, lng)
result["error"] = str(e)
_sentinel_cache[cache_key] = result
return result