mirror of
https://github.com/BigBodyCobain/Shadowbroker.git
synced 2026-06-02 12:31:40 +02:00
76750caa92
== Per-install operator handle for every third-party API call ==
Before this PR, every Shadowbroker install identified itself to
Wikipedia, Wikidata, Nominatim, GDELT, OpenMHz, Broadcastify,
weather.gov, NUFORC, Sentinel/Planetary Computer, TinyGS / CelesTrak,
Shodan, Finnhub, and others with a single project-wide User-Agent
("Shadowbroker/1.0" or "ShadowBroker-OSINT/1.0"). From the upstream's
perspective every install in the world looked like one giant scraper.
If one install misbehaved, the upstream's only recourse was to block
"Shadowbroker" as a whole.
PR #284 inadvertently doubled down on this in the frontend by
introducing a shared `WIKIMEDIA_API_USER_AGENT` constant. This PR
retrofits both backends to per-operator attribution.
New setting: OPERATOR_HANDLE (env var / settings UI / auto-gen)
New helper: network_utils.outbound_user_agent("purpose")
The handle is auto-generated as "operator-XXXXXX" on first call (the
"shadow-" prefix from earlier drafts was deliberately dropped — too
suspicious-looking for abuse-detection systems). Operators can
override via OPERATOR_HANDLE; the value is sanitized to lowercase
alphanumeric+dash+underscore and capped at 48 chars. Persisted to
backend/data/operator_handle.json so it survives container restarts.
Retrofitted call sites (every previously-MONSTER User-Agent):
- services/region_dossier.py (Wikipedia + Wikidata + Nominatim)
- services/geocode.py (Nominatim)
- services/sentinel_search.py (Microsoft Planetary Computer)
- services/feed_ingester.py (operator-curated RSS feeds)
- services/fetchers/earth_observation.py (weather.gov, NUFORC)
- services/fetchers/infrastructure.py
- services/fetchers/aircraft_database.py
- services/fetchers/route_database.py
- services/fetchers/trains.py
- services/fetchers/meshtastic_map.py
- services/shodan_connector.py
- services/unusual_whales_connector.py (Finnhub)
- services/tinygs_fetcher.py (CelesTrak + TinyGS)
- services/sar/sar_products_client.py
- services/geopolitics.py (GDELT)
- services/radio_intercept.py (Broadcastify + OpenMHz)
- routers/cctv.py + main.py (CCTV proxy)
- routers/ai_intel.py
- scripts/convert_power_plants.py (release-time data refresh)
Spoofed browser UAs removed (issues #289 / #290 / #291 — tg12 audit):
- cloudscraper-based Chrome impersonation against api.openmhz.com
-> replaced with honest requests + per-install UA
- Mozilla/5.0 spoofed UA on Broadcastify scrape
-> replaced with honest UA
- Mozilla/5.0 + fake first-party Referer on OpenMHz audio relay
-> replaced with honest UA
- cloudscraper dependency dropped from pyproject.toml + uv.lock
Frontend retrofit:
- new GET /api/settings/operator-handle endpoint (local-operator
gated) returns the install's handle
- frontend/src/lib/wikimediaClient.ts fetches the handle once on
first use, caches it for page lifetime, embeds it in the
Api-User-Agent for every Wikipedia / Wikidata browser-direct call
== GDELT GCS-direct fix ==
GDELT's data.gdeltproject.org is a CNAME to a Google Cloud Storage
bucket. GCS responds with the wildcard *.storage.googleapis.com cert
which legitimately does NOT cover the GDELT custom domain, so Python's
TLS verification correctly refuses the connection. Some networks
happen to route through a path where this works; many (notably Docker
Desktop's outbound NAT on local installs) do not. Verified on the
maintainer's local install: GDELT was unreachable; 1610 geopolitical
events / 48 export files were dropping silently.
Fix: services/geopolitics._gcs_direct_gdelt_url() rewrites any
data.gdeltproject.org URL to its GCS-direct equivalent
(storage.googleapis.com/data.gdeltproject.org/...) where the standard
GCS cert is genuinely valid. api.gdeltproject.org and every other host
are left untouched.
Confirmed live: backend log goes from
GDELT lastupdate failed: 500
to
Downloading 48 GDELT export files...
Downloaded 48/48 GDELT exports
GDELT parsed: 1610 conflict locations from 48 files
== Tests ==
backend/tests/test_per_operator_outbound_attribution.py (12 tests)
backend/tests/test_gdelt_gcs_direct_rewrite.py (6 tests)
backend/tests/test_region_dossier_wikimedia_ua.py (updated to
pin the helper + per-operator handle, not the old constant)
frontend/src/__tests__/utils/wikimediaClient.test.ts (rewritten
to mock /api/settings/operator-handle and assert per-operator UA)
Local: backend 114/114 security+audit+round7a suite green;
frontend 718/718 vitest suite green.
Credit: tg12 (external security audit, issues #289/#290/#291
relating to spoofed UAs); BigBodyCobain (operator-prefix call,
GDELT cloud-vs-local diagnosis).
102 lines
3.9 KiB
Python
102 lines
3.9 KiB
Python
"""Issues #218 / #219 (tg12): outbound Wikipedia + Wikidata calls must
|
|
identify ShadowBroker via the Wikimedia-recommended User-Agent /
|
|
Api-User-Agent headers.
|
|
|
|
Before this fix, ``backend/services/region_dossier.py`` called
|
|
``fetch_with_curl(url)`` with no explicit headers, falling back to the
|
|
generic project default UA. That sent a too-anonymous identifier to
|
|
Wikimedia. Per Wikimedia's policy
|
|
(https://foundation.wikimedia.org/wiki/Policy:Wikimedia_Foundation_User-Agent_Policy)
|
|
the API caller should send a stable, contactable identifier so Wikimedia
|
|
operators can rate-limit or reach the project.
|
|
|
|
This test does NOT make network calls. It patches ``fetch_with_curl``
|
|
and asserts the headers that get passed through.
|
|
"""
|
|
from __future__ import annotations
|
|
|
|
from unittest.mock import MagicMock, patch
|
|
|
|
import pytest
|
|
|
|
|
|
def _fake_resp(payload: dict, status: int = 200) -> MagicMock:
|
|
r = MagicMock()
|
|
r.status_code = status
|
|
r.json.return_value = payload
|
|
return r
|
|
|
|
|
|
def test_wikidata_call_passes_wikimedia_request_headers():
|
|
from services import region_dossier
|
|
|
|
calls = []
|
|
|
|
def fake_fetch(url, **kwargs):
|
|
calls.append(kwargs.get("headers"))
|
|
return _fake_resp({"results": {"bindings": []}})
|
|
|
|
with patch.object(region_dossier, "fetch_with_curl", side_effect=fake_fetch):
|
|
region_dossier._fetch_wikidata_leader("Testlandia")
|
|
|
|
assert calls, "fetch_with_curl was not called"
|
|
headers = calls[0] or {}
|
|
assert "User-Agent" in headers
|
|
assert "Api-User-Agent" in headers
|
|
# Stable identifier should mention the project + a contact path.
|
|
assert "Shadowbroker" in headers["Api-User-Agent"] or "ShadowBroker" in headers["Api-User-Agent"]
|
|
assert "github.com" in headers["Api-User-Agent"].lower()
|
|
|
|
|
|
def test_wikipedia_summary_call_passes_wikimedia_request_headers():
|
|
from services import region_dossier
|
|
|
|
calls = []
|
|
|
|
def fake_fetch(url, **kwargs):
|
|
calls.append((url, kwargs.get("headers")))
|
|
return _fake_resp(
|
|
{
|
|
"type": "standard",
|
|
"description": "test desc",
|
|
"extract": "test extract",
|
|
"thumbnail": {"source": ""},
|
|
}
|
|
)
|
|
|
|
with patch.object(region_dossier, "fetch_with_curl", side_effect=fake_fetch):
|
|
region_dossier._fetch_local_wiki_summary("Paris", "France")
|
|
|
|
# At least one Wikipedia REST call was issued.
|
|
wikipedia_calls = [c for c in calls if "wikipedia.org" in c[0]]
|
|
assert wikipedia_calls, "no Wikipedia call was issued"
|
|
for url, headers in wikipedia_calls:
|
|
headers = headers or {}
|
|
assert "User-Agent" in headers, f"missing User-Agent on {url}"
|
|
assert "Api-User-Agent" in headers, f"missing Api-User-Agent on {url}"
|
|
assert "github.com" in headers["Api-User-Agent"].lower()
|
|
|
|
|
|
def test_wikimedia_headers_helper_is_stable():
|
|
"""Regression guard: if someone removes the contact path or the
|
|
per-operator handle from the Wikimedia headers, we want a loud
|
|
test failure, not a silent ToS drift.
|
|
|
|
Round 7a: the original ``_WIKIMEDIA_REQUEST_HEADERS`` constant was
|
|
replaced with the ``_wikimedia_request_headers()`` function so the
|
|
per-install operator handle is embedded at call time. This test
|
|
pins both the project identifier AND the contact path AND the
|
|
per-operator format.
|
|
"""
|
|
from services.region_dossier import _wikimedia_request_headers
|
|
|
|
headers = _wikimedia_request_headers()
|
|
aua = headers.get("Api-User-Agent", "")
|
|
ua = headers.get("User-Agent", "")
|
|
for h, label in ((ua, "User-Agent"), (aua, "Api-User-Agent")):
|
|
assert "Shadowbroker" in h or "ShadowBroker" in h, f"{label} missing project id"
|
|
assert "github.com" in h.lower(), f"{label} missing contact URL"
|
|
assert "issues" in h.lower(), f"{label} missing /issues contact path"
|
|
# Round 7a: must include the per-operator handle.
|
|
assert "operator:" in h, f"{label} missing per-operator handle: {h!r}"
|