mirror of
https://github.com/BigBodyCobain/Shadowbroker.git
synced 2026-06-06 14:23:56 +02:00
Round 7a: per-operator outbound attribution + GDELT GCS-direct fix
== Per-install operator handle for every third-party API call ==
Before this PR, every Shadowbroker install identified itself to
Wikipedia, Wikidata, Nominatim, GDELT, OpenMHz, Broadcastify,
weather.gov, NUFORC, Sentinel/Planetary Computer, TinyGS / CelesTrak,
Shodan, Finnhub, and others with a single project-wide User-Agent
("Shadowbroker/1.0" or "ShadowBroker-OSINT/1.0"). From the upstream's
perspective every install in the world looked like one giant scraper.
If one install misbehaved, the upstream's only recourse was to block
"Shadowbroker" as a whole.
PR #284 inadvertently doubled down on this in the frontend by
introducing a shared `WIKIMEDIA_API_USER_AGENT` constant. This PR
retrofits both backends to per-operator attribution.
New setting: OPERATOR_HANDLE (env var / settings UI / auto-gen)
New helper: network_utils.outbound_user_agent("purpose")
The handle is auto-generated as "operator-XXXXXX" on first call (the
"shadow-" prefix from earlier drafts was deliberately dropped — too
suspicious-looking for abuse-detection systems). Operators can
override via OPERATOR_HANDLE; the value is sanitized to lowercase
alphanumeric+dash+underscore and capped at 48 chars. Persisted to
backend/data/operator_handle.json so it survives container restarts.
Retrofitted call sites (every previously-MONSTER User-Agent):
- services/region_dossier.py (Wikipedia + Wikidata + Nominatim)
- services/geocode.py (Nominatim)
- services/sentinel_search.py (Microsoft Planetary Computer)
- services/feed_ingester.py (operator-curated RSS feeds)
- services/fetchers/earth_observation.py (weather.gov, NUFORC)
- services/fetchers/infrastructure.py
- services/fetchers/aircraft_database.py
- services/fetchers/route_database.py
- services/fetchers/trains.py
- services/fetchers/meshtastic_map.py
- services/shodan_connector.py
- services/unusual_whales_connector.py (Finnhub)
- services/tinygs_fetcher.py (CelesTrak + TinyGS)
- services/sar/sar_products_client.py
- services/geopolitics.py (GDELT)
- services/radio_intercept.py (Broadcastify + OpenMHz)
- routers/cctv.py + main.py (CCTV proxy)
- routers/ai_intel.py
- scripts/convert_power_plants.py (release-time data refresh)
Spoofed browser UAs removed (issues #289 / #290 / #291 — tg12 audit):
- cloudscraper-based Chrome impersonation against api.openmhz.com
-> replaced with honest requests + per-install UA
- Mozilla/5.0 spoofed UA on Broadcastify scrape
-> replaced with honest UA
- Mozilla/5.0 + fake first-party Referer on OpenMHz audio relay
-> replaced with honest UA
- cloudscraper dependency dropped from pyproject.toml + uv.lock
Frontend retrofit:
- new GET /api/settings/operator-handle endpoint (local-operator
gated) returns the install's handle
- frontend/src/lib/wikimediaClient.ts fetches the handle once on
first use, caches it for page lifetime, embeds it in the
Api-User-Agent for every Wikipedia / Wikidata browser-direct call
== GDELT GCS-direct fix ==
GDELT's data.gdeltproject.org is a CNAME to a Google Cloud Storage
bucket. GCS responds with the wildcard *.storage.googleapis.com cert
which legitimately does NOT cover the GDELT custom domain, so Python's
TLS verification correctly refuses the connection. Some networks
happen to route through a path where this works; many (notably Docker
Desktop's outbound NAT on local installs) do not. Verified on the
maintainer's local install: GDELT was unreachable; 1610 geopolitical
events / 48 export files were dropping silently.
Fix: services/geopolitics._gcs_direct_gdelt_url() rewrites any
data.gdeltproject.org URL to its GCS-direct equivalent
(storage.googleapis.com/data.gdeltproject.org/...) where the standard
GCS cert is genuinely valid. api.gdeltproject.org and every other host
are left untouched.
Confirmed live: backend log goes from
GDELT lastupdate failed: 500
to
Downloading 48 GDELT export files...
Downloaded 48/48 GDELT exports
GDELT parsed: 1610 conflict locations from 48 files
== Tests ==
backend/tests/test_per_operator_outbound_attribution.py (12 tests)
backend/tests/test_gdelt_gcs_direct_rewrite.py (6 tests)
backend/tests/test_region_dossier_wikimedia_ua.py (updated to
pin the helper + per-operator handle, not the old constant)
frontend/src/__tests__/utils/wikimediaClient.test.ts (rewritten
to mock /api/settings/operator-handle and assert per-operator UA)
Local: backend 114/114 security+audit+round7a suite green;
frontend 718/718 vitest suite green.
Credit: tg12 (external security audit, issues #289/#290/#291
relating to spoofed UAs); BigBodyCobain (operator-prefix call,
GDELT cloud-vs-local diagnosis).
This commit is contained in:
@@ -0,0 +1,83 @@
|
||||
"""GDELT's ``data.gdeltproject.org`` is a CNAME to a Google Cloud Storage
|
||||
bucket. GCS responds with the wildcard ``*.storage.googleapis.com``
|
||||
certificate, which legitimately does NOT cover the GDELT custom
|
||||
domain, so Python's TLS verification refuses the connection. Some
|
||||
networks happen to route through a path where this works; many
|
||||
(notably Docker Desktop's outbound NAT on local installs) do not.
|
||||
|
||||
The fix in ``services.geopolitics._gcs_direct_gdelt_url`` rewrites any
|
||||
URL pointing at ``data.gdeltproject.org`` to its GCS-direct equivalent
|
||||
(``storage.googleapis.com/data.gdeltproject.org/...``), where the
|
||||
standard GCS certificate is genuinely valid. ``api.gdeltproject.org``
|
||||
and every other host are left untouched.
|
||||
|
||||
These tests pin that behavior so a future refactor that drops the
|
||||
helper or accidentally rewrites the wrong host gets a loud failure.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
def test_rewrites_data_gdeltproject_https():
|
||||
from services.geopolitics import _gcs_direct_gdelt_url
|
||||
|
||||
assert _gcs_direct_gdelt_url(
|
||||
"https://data.gdeltproject.org/gdeltv2/lastupdate.txt"
|
||||
) == "https://storage.googleapis.com/data.gdeltproject.org/gdeltv2/lastupdate.txt"
|
||||
|
||||
|
||||
def test_rewrites_data_gdeltproject_http():
|
||||
"""GDELT's lastupdate.txt sometimes lists URLs with http:// — we
|
||||
rewrite those too (the downstream call upgrades them to https)."""
|
||||
from services.geopolitics import _gcs_direct_gdelt_url
|
||||
|
||||
assert _gcs_direct_gdelt_url(
|
||||
"http://data.gdeltproject.org/gdeltv2/20260301120000.export.CSV.zip"
|
||||
) == "http://storage.googleapis.com/data.gdeltproject.org/gdeltv2/20260301120000.export.CSV.zip"
|
||||
|
||||
|
||||
def test_rewrites_preserve_query_string_and_path():
|
||||
from services.geopolitics import _gcs_direct_gdelt_url
|
||||
|
||||
url = "https://data.gdeltproject.org/some/deep/path?a=1&b=2&c=hello%20world"
|
||||
rewritten = _gcs_direct_gdelt_url(url)
|
||||
assert rewritten == (
|
||||
"https://storage.googleapis.com/data.gdeltproject.org"
|
||||
"/some/deep/path?a=1&b=2&c=hello%20world"
|
||||
)
|
||||
|
||||
|
||||
def test_does_not_touch_api_gdeltproject_org():
|
||||
"""The API host is NOT a CNAME to GCS; rewriting it would break the
|
||||
actual GDELT API endpoint."""
|
||||
from services.geopolitics import _gcs_direct_gdelt_url
|
||||
|
||||
url = "https://api.gdeltproject.org/api/v2/doc/doc?query=carrier"
|
||||
assert _gcs_direct_gdelt_url(url) == url
|
||||
|
||||
|
||||
def test_does_not_touch_other_hosts():
|
||||
from services.geopolitics import _gcs_direct_gdelt_url
|
||||
|
||||
for url in (
|
||||
"https://en.wikipedia.org/wiki/Boeing_747",
|
||||
"https://query.wikidata.org/sparql",
|
||||
"https://storage.googleapis.com/already-correct/path",
|
||||
"https://nominatim.openstreetmap.org/search",
|
||||
):
|
||||
assert _gcs_direct_gdelt_url(url) == url
|
||||
|
||||
|
||||
def test_does_not_partially_match_strings():
|
||||
"""``data.gdeltproject.org`` is matched exactly; URLs that merely
|
||||
contain that substring elsewhere (in a query parameter, for example)
|
||||
are left alone. Otherwise we'd rewrite something like
|
||||
``https://example.com/?ref=data.gdeltproject.org/x`` which is wrong."""
|
||||
from services.geopolitics import _gcs_direct_gdelt_url
|
||||
|
||||
# The match requires ``://`` immediately before the host, so a host
|
||||
# like ``example-data.gdeltproject.org`` would also be left alone
|
||||
# (treated as a different host, which is correct).
|
||||
url = "https://example-data.gdeltproject.org/path"
|
||||
assert _gcs_direct_gdelt_url(url) == url
|
||||
@@ -0,0 +1,277 @@
|
||||
"""Round 7a: per-install operator handle threads through every outbound
|
||||
third-party API call.
|
||||
|
||||
Background: before this change every Shadowbroker install identified
|
||||
itself to Wikipedia, Wikidata, Nominatim, GDELT, OpenMHz, Broadcastify,
|
||||
weather.gov, NUFORC, etc. with a single project-wide ``Shadowbroker``
|
||||
User-Agent. From the upstream's perspective, every install in the world
|
||||
looked like one giant scraper. If one install misbehaved, the upstream's
|
||||
only recourse was to block ``Shadowbroker`` as a whole, taking out every
|
||||
other install.
|
||||
|
||||
Fix: each install gets a stable pseudonymous handle (auto-generated like
|
||||
``shadow-7f3a92`` or operator-overridden via ``OPERATOR_HANDLE``) that
|
||||
gets embedded in the User-Agent for every outbound call. Upstreams can
|
||||
now rate-limit / contact the specific operator instead of the project.
|
||||
|
||||
These tests pin:
|
||||
|
||||
1. The handle is auto-generated on first call if no override exists.
|
||||
2. The handle survives process restart (persisted to disk).
|
||||
3. ``OPERATOR_HANDLE`` env var override wins over the auto-gen handle.
|
||||
4. The handle is sanitized (whitespace, special chars, length).
|
||||
5. Every previously-MONSTER-UA call site now sends the per-operator UA.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import os
|
||||
from pathlib import Path
|
||||
from unittest.mock import patch
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def isolated_handle(tmp_path, monkeypatch):
|
||||
"""Redirect the persistence path to tmp and reset caches between tests."""
|
||||
from services import network_utils
|
||||
|
||||
handle_file = tmp_path / "operator_handle.json"
|
||||
monkeypatch.setattr(network_utils, "_OPERATOR_HANDLE_FILE", handle_file)
|
||||
network_utils._reset_operator_handle_cache_for_tests()
|
||||
monkeypatch.delenv("OPERATOR_HANDLE", raising=False)
|
||||
|
||||
# Reset Settings cache so OPERATOR_HANDLE env changes are picked up.
|
||||
from services.config import get_settings
|
||||
get_settings.cache_clear()
|
||||
|
||||
yield network_utils
|
||||
|
||||
network_utils._reset_operator_handle_cache_for_tests()
|
||||
get_settings.cache_clear()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Core handle generation / persistence / override
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestOperatorHandleGeneration:
|
||||
def test_auto_generates_on_first_call(self, isolated_handle):
|
||||
h = isolated_handle.get_operator_handle()
|
||||
# Prefix is "operator-" (deliberately neutral; "shadow-" looked
|
||||
# exactly like a pattern abuse-detection systems would auto-block).
|
||||
assert h.startswith("operator-")
|
||||
assert len(h) == len("operator-") + 6
|
||||
# Hex suffix.
|
||||
suffix = h.split("-", 1)[1]
|
||||
int(suffix, 16) # raises if not hex
|
||||
|
||||
def test_persists_to_disk_so_handle_survives_restart(self, isolated_handle):
|
||||
first = isolated_handle.get_operator_handle()
|
||||
# Simulate process restart: clear in-memory cache, then ask again.
|
||||
isolated_handle._reset_operator_handle_cache_for_tests()
|
||||
second = isolated_handle.get_operator_handle()
|
||||
assert second == first
|
||||
# The file actually exists.
|
||||
assert isolated_handle._OPERATOR_HANDLE_FILE.exists()
|
||||
body = json.loads(isolated_handle._OPERATOR_HANDLE_FILE.read_text())
|
||||
assert body["handle"] == first
|
||||
|
||||
def test_env_override_wins_over_auto_generated(self, isolated_handle, monkeypatch):
|
||||
# First call without env var auto-generates.
|
||||
auto = isolated_handle.get_operator_handle()
|
||||
assert auto.startswith("operator-")
|
||||
# Setting env var changes the resolved handle without touching the disk file.
|
||||
monkeypatch.setenv("OPERATOR_HANDLE", "alice")
|
||||
from services.config import get_settings
|
||||
get_settings.cache_clear()
|
||||
isolated_handle._reset_operator_handle_cache_for_tests()
|
||||
assert isolated_handle.get_operator_handle() == "alice"
|
||||
|
||||
def test_handle_is_sanitized(self, isolated_handle, monkeypatch):
|
||||
from services.config import get_settings
|
||||
|
||||
# Sanitization tests run against the normalizer directly so the
|
||||
# empty-string case can be asserted independently of the env-var
|
||||
# resolution path (where empty means "use auto-gen", not "use
|
||||
# 'anonymous'").
|
||||
from services.network_utils import _normalize_handle
|
||||
|
||||
cases = [
|
||||
("Alice Smith", "alice-smith"),
|
||||
("user@example.com", "user-example-com"),
|
||||
(" whitespace ", "whitespace"),
|
||||
("UPPER-CASE", "upper-case"),
|
||||
("multiple---dashes", "multiple-dashes"),
|
||||
("/leading/slash", "leading-slash"),
|
||||
("trailing-", "trailing"),
|
||||
("", "anonymous"),
|
||||
]
|
||||
for raw, expected in cases:
|
||||
got = _normalize_handle(raw)
|
||||
assert got == expected, f"{raw!r} -> {got!r}, expected {expected!r}"
|
||||
assert got == got.lower()
|
||||
for ch in got:
|
||||
assert ch.isalnum() or ch in "-_", f"unsafe char {ch!r} in {got!r}"
|
||||
assert "--" not in got
|
||||
|
||||
def test_handle_is_length_capped(self, isolated_handle, monkeypatch):
|
||||
from services.config import get_settings
|
||||
|
||||
monkeypatch.setenv("OPERATOR_HANDLE", "x" * 1000)
|
||||
get_settings.cache_clear()
|
||||
isolated_handle._reset_operator_handle_cache_for_tests()
|
||||
got = isolated_handle.get_operator_handle()
|
||||
assert len(got) <= 48
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# outbound_user_agent() builds the right header
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestOutboundUserAgentString:
|
||||
def test_includes_operator_handle(self, isolated_handle):
|
||||
ua = isolated_handle.outbound_user_agent()
|
||||
handle = isolated_handle.get_operator_handle()
|
||||
assert f"operator: {handle}" in ua
|
||||
|
||||
def test_includes_purpose_when_provided(self, isolated_handle):
|
||||
ua = isolated_handle.outbound_user_agent("wikipedia")
|
||||
assert "purpose: wikipedia" in ua
|
||||
|
||||
def test_includes_contact_path(self, isolated_handle):
|
||||
ua = isolated_handle.outbound_user_agent()
|
||||
assert "github.com" in ua.lower()
|
||||
assert "shadowbroker" in ua.lower()
|
||||
|
||||
def test_version_prefix(self, isolated_handle):
|
||||
ua = isolated_handle.outbound_user_agent()
|
||||
assert ua.startswith("Shadowbroker/")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Wikipedia / Wikidata — retroactive fix for PR #284's MONSTER pattern
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestWikimediaCallsAreNowPerOperator:
|
||||
def test_wikidata_call_uses_per_operator_ua(self, isolated_handle, monkeypatch):
|
||||
from services import region_dossier
|
||||
|
||||
captured = []
|
||||
|
||||
class _FakeResp:
|
||||
status_code = 200
|
||||
def json(self):
|
||||
return {"results": {"bindings": []}}
|
||||
|
||||
def fake_fetch(url, **kwargs):
|
||||
captured.append(kwargs.get("headers") or {})
|
||||
return _FakeResp()
|
||||
|
||||
monkeypatch.setattr(region_dossier, "fetch_with_curl", fake_fetch)
|
||||
region_dossier._fetch_wikidata_leader("Testlandia")
|
||||
|
||||
assert captured, "Wikidata fetcher was not called"
|
||||
headers = captured[0]
|
||||
assert "User-Agent" in headers
|
||||
assert "Api-User-Agent" in headers
|
||||
handle = isolated_handle.get_operator_handle()
|
||||
for header_value in (headers["User-Agent"], headers["Api-User-Agent"]):
|
||||
assert f"operator: {handle}" in header_value, (
|
||||
f"Wikimedia UA must include the per-operator handle; got {header_value!r}"
|
||||
)
|
||||
|
||||
def test_wikipedia_summary_uses_per_operator_ua(self, isolated_handle, monkeypatch):
|
||||
from services import region_dossier
|
||||
|
||||
captured = []
|
||||
|
||||
class _FakeResp:
|
||||
status_code = 200
|
||||
def json(self):
|
||||
return {
|
||||
"type": "standard",
|
||||
"description": "x",
|
||||
"extract": "y",
|
||||
"thumbnail": {"source": ""},
|
||||
}
|
||||
|
||||
def fake_fetch(url, **kwargs):
|
||||
captured.append((url, kwargs.get("headers") or {}))
|
||||
return _FakeResp()
|
||||
|
||||
monkeypatch.setattr(region_dossier, "fetch_with_curl", fake_fetch)
|
||||
region_dossier._fetch_local_wiki_summary("Paris", "France")
|
||||
|
||||
wikipedia_hits = [c for c in captured if "wikipedia.org" in c[0]]
|
||||
assert wikipedia_hits, "Wikipedia summary fetch was not called"
|
||||
for _url, headers in wikipedia_hits:
|
||||
handle = isolated_handle.get_operator_handle()
|
||||
assert f"operator: {handle}" in headers.get("User-Agent", "")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Generic round-7a regression guard
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestNoMonsterUserAgentRemains:
|
||||
"""The audit's underlying concern was that every Shadowbroker install
|
||||
looked like one entity. This test scans the codebase for the OLD
|
||||
aggregate identifier patterns and fails if a new one sneaks back in.
|
||||
|
||||
We allow the strings to appear in:
|
||||
- comments (audit prose, change-log notes)
|
||||
- tests
|
||||
- .env.example (documentation)
|
||||
The test only fails if the string lives in actual outbound-request
|
||||
HEADER values without going through the per-operator helper.
|
||||
"""
|
||||
|
||||
BANNED_LITERALS = (
|
||||
"ShadowBroker-OSINT/1.0",
|
||||
"ShadowBroker-OSINT/0.9",
|
||||
"ShadowBroker-FeedIngester/1.0",
|
||||
"ShadowBroker/0.9.79 local Shodan connector",
|
||||
"ShadowBroker/0.9.79 Finnhub connector",
|
||||
"Mozilla/5.0 (compatible; ShadowBroker CCTV proxy)",
|
||||
)
|
||||
|
||||
def test_no_banned_aggregate_user_agent_strings(self):
|
||||
from pathlib import Path
|
||||
|
||||
backend_root = Path(__file__).parent.parent
|
||||
offenders = []
|
||||
for py in backend_root.rglob("*.py"):
|
||||
# Skip test files and any audit-context comments.
|
||||
rel = py.relative_to(backend_root).as_posix()
|
||||
if rel.startswith("tests/"):
|
||||
continue
|
||||
text = py.read_text(encoding="utf-8", errors="ignore")
|
||||
# Look only for the literal as part of a string in a User-Agent
|
||||
# context: cheap heuristic via "User-Agent" + literal coexisting
|
||||
# in the same file. A literal in a comment block won't trigger
|
||||
# because the same line won't have User-Agent surrounding it.
|
||||
for banned in self.BANNED_LITERALS:
|
||||
if banned in text:
|
||||
# Walk lines to ensure it's a real header value.
|
||||
for i, line in enumerate(text.splitlines(), 1):
|
||||
if banned in line:
|
||||
# Comments / docstrings are allowed — only fail
|
||||
# if the line looks like a header assignment.
|
||||
stripped = line.strip()
|
||||
if stripped.startswith("#"):
|
||||
continue
|
||||
if '"User-Agent"' in line or "'User-Agent'" in line:
|
||||
offenders.append(f"{rel}:{i}: {stripped[:120]}")
|
||||
assert not offenders, (
|
||||
"Round 7a regression: the following lines reintroduced an "
|
||||
"aggregate Shadowbroker User-Agent. Use "
|
||||
"outbound_user_agent('purpose') instead so the per-install "
|
||||
"operator handle is embedded.\n"
|
||||
+ "\n".join(offenders)
|
||||
)
|
||||
@@ -77,15 +77,25 @@ def test_wikipedia_summary_call_passes_wikimedia_request_headers():
|
||||
assert "github.com" in headers["Api-User-Agent"].lower()
|
||||
|
||||
|
||||
def test_wikimedia_headers_constant_is_stable():
|
||||
"""Regression guard: if someone removes the contact path from the
|
||||
Api-User-Agent we want a loud test failure, not a silent ToS drift.
|
||||
"""
|
||||
from services.region_dossier import _WIKIMEDIA_REQUEST_HEADERS
|
||||
def test_wikimedia_headers_helper_is_stable():
|
||||
"""Regression guard: if someone removes the contact path or the
|
||||
per-operator handle from the Wikimedia headers, we want a loud
|
||||
test failure, not a silent ToS drift.
|
||||
|
||||
aua = _WIKIMEDIA_REQUEST_HEADERS.get("Api-User-Agent", "")
|
||||
assert "Shadowbroker" in aua or "ShadowBroker" in aua
|
||||
assert "github.com" in aua.lower()
|
||||
# Must include a path Wikimedia operators can use to contact us
|
||||
# (we use /issues against the public repo).
|
||||
assert "issues" in aua.lower()
|
||||
Round 7a: the original ``_WIKIMEDIA_REQUEST_HEADERS`` constant was
|
||||
replaced with the ``_wikimedia_request_headers()`` function so the
|
||||
per-install operator handle is embedded at call time. This test
|
||||
pins both the project identifier AND the contact path AND the
|
||||
per-operator format.
|
||||
"""
|
||||
from services.region_dossier import _wikimedia_request_headers
|
||||
|
||||
headers = _wikimedia_request_headers()
|
||||
aua = headers.get("Api-User-Agent", "")
|
||||
ua = headers.get("User-Agent", "")
|
||||
for h, label in ((ua, "User-Agent"), (aua, "Api-User-Agent")):
|
||||
assert "Shadowbroker" in h or "ShadowBroker" in h, f"{label} missing project id"
|
||||
assert "github.com" in h.lower(), f"{label} missing contact URL"
|
||||
assert "issues" in h.lower(), f"{label} missing /issues contact path"
|
||||
# Round 7a: must include the per-operator handle.
|
||||
assert "operator:" in h, f"{label} missing per-operator handle: {h!r}"
|
||||
|
||||
Reference in New Issue
Block a user