mirror of
https://github.com/BigBodyCobain/Shadowbroker.git
synced 2026-05-28 10:01:31 +02:00
76750caa92
== Per-install operator handle for every third-party API call ==
Before this PR, every Shadowbroker install identified itself to
Wikipedia, Wikidata, Nominatim, GDELT, OpenMHz, Broadcastify,
weather.gov, NUFORC, Sentinel/Planetary Computer, TinyGS / CelesTrak,
Shodan, Finnhub, and others with a single project-wide User-Agent
("Shadowbroker/1.0" or "ShadowBroker-OSINT/1.0"). From the upstream's
perspective every install in the world looked like one giant scraper.
If one install misbehaved, the upstream's only recourse was to block
"Shadowbroker" as a whole.
PR #284 inadvertently doubled down on this in the frontend by
introducing a shared `WIKIMEDIA_API_USER_AGENT` constant. This PR
retrofits both backends to per-operator attribution.
New setting: OPERATOR_HANDLE (env var / settings UI / auto-gen)
New helper: network_utils.outbound_user_agent("purpose")
The handle is auto-generated as "operator-XXXXXX" on first call (the
"shadow-" prefix from earlier drafts was deliberately dropped — too
suspicious-looking for abuse-detection systems). Operators can
override via OPERATOR_HANDLE; the value is sanitized to lowercase
alphanumeric+dash+underscore and capped at 48 chars. Persisted to
backend/data/operator_handle.json so it survives container restarts.
Retrofitted call sites (every previously-MONSTER User-Agent):
- services/region_dossier.py (Wikipedia + Wikidata + Nominatim)
- services/geocode.py (Nominatim)
- services/sentinel_search.py (Microsoft Planetary Computer)
- services/feed_ingester.py (operator-curated RSS feeds)
- services/fetchers/earth_observation.py (weather.gov, NUFORC)
- services/fetchers/infrastructure.py
- services/fetchers/aircraft_database.py
- services/fetchers/route_database.py
- services/fetchers/trains.py
- services/fetchers/meshtastic_map.py
- services/shodan_connector.py
- services/unusual_whales_connector.py (Finnhub)
- services/tinygs_fetcher.py (CelesTrak + TinyGS)
- services/sar/sar_products_client.py
- services/geopolitics.py (GDELT)
- services/radio_intercept.py (Broadcastify + OpenMHz)
- routers/cctv.py + main.py (CCTV proxy)
- routers/ai_intel.py
- scripts/convert_power_plants.py (release-time data refresh)
Spoofed browser UAs removed (issues #289 / #290 / #291 — tg12 audit):
- cloudscraper-based Chrome impersonation against api.openmhz.com
-> replaced with honest requests + per-install UA
- Mozilla/5.0 spoofed UA on Broadcastify scrape
-> replaced with honest UA
- Mozilla/5.0 + fake first-party Referer on OpenMHz audio relay
-> replaced with honest UA
- cloudscraper dependency dropped from pyproject.toml + uv.lock
Frontend retrofit:
- new GET /api/settings/operator-handle endpoint (local-operator
gated) returns the install's handle
- frontend/src/lib/wikimediaClient.ts fetches the handle once on
first use, caches it for page lifetime, embeds it in the
Api-User-Agent for every Wikipedia / Wikidata browser-direct call
== GDELT GCS-direct fix ==
GDELT's data.gdeltproject.org is a CNAME to a Google Cloud Storage
bucket. GCS responds with the wildcard *.storage.googleapis.com cert
which legitimately does NOT cover the GDELT custom domain, so Python's
TLS verification correctly refuses the connection. Some networks
happen to route through a path where this works; many (notably Docker
Desktop's outbound NAT on local installs) do not. Verified on the
maintainer's local install: GDELT was unreachable; 1610 geopolitical
events / 48 export files were dropping silently.
Fix: services/geopolitics._gcs_direct_gdelt_url() rewrites any
data.gdeltproject.org URL to its GCS-direct equivalent
(storage.googleapis.com/data.gdeltproject.org/...) where the standard
GCS cert is genuinely valid. api.gdeltproject.org and every other host
are left untouched.
Confirmed live: backend log goes from
GDELT lastupdate failed: 500
to
Downloading 48 GDELT export files...
Downloaded 48/48 GDELT exports
GDELT parsed: 1610 conflict locations from 48 files
== Tests ==
backend/tests/test_per_operator_outbound_attribution.py (12 tests)
backend/tests/test_gdelt_gcs_direct_rewrite.py (6 tests)
backend/tests/test_region_dossier_wikimedia_ua.py (updated to
pin the helper + per-operator handle, not the old constant)
frontend/src/__tests__/utils/wikimediaClient.test.ts (rewritten
to mock /api/settings/operator-handle and assert per-operator UA)
Local: backend 114/114 security+audit+round7a suite green;
frontend 718/718 vitest suite green.
Credit: tg12 (external security audit, issues #289/#290/#291
relating to spoofed UAs); BigBodyCobain (operator-prefix call,
GDELT cloud-vs-local diagnosis).
358 lines
12 KiB
Python
358 lines
12 KiB
Python
import requests
|
|
from bs4 import BeautifulSoup
|
|
import logging
|
|
from cachetools import cached, TTLCache
|
|
import reverse_geocoder as rg
|
|
from urllib.parse import urlparse
|
|
|
|
from services.network_utils import outbound_user_agent
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
_OPENMHZ_AUDIO_HOSTS = {"media.openmhz.com", "media2.openmhz.com", "media3.openmhz.com"}
|
|
|
|
|
|
# Round 7a / Issues #289, #290, #291 (tg12 audit):
|
|
# We previously sent a spoofed Chrome User-Agent and (for OpenMHz) used
|
|
# cloudscraper to bypass anti-bot challenges. Both are dishonest and ToS-
|
|
# unfriendly. We now send the per-install Shadowbroker UA — the upstream
|
|
# can identify us, rate-limit us per install, and contact us if needed.
|
|
#
|
|
# If the upstream actively blocks our honest UA, the feature degrades
|
|
# gracefully (returns an empty list / cached results) rather than
|
|
# escalating to deception.
|
|
|
|
|
|
def _broadcastify_user_agent() -> str:
|
|
return outbound_user_agent("broadcastify")
|
|
|
|
|
|
def _openmhz_user_agent() -> str:
|
|
return outbound_user_agent("openmhz")
|
|
|
|
# Cache the top feeds for 5 minutes so we don't hammer Broadcastify
|
|
radio_cache = TTLCache(maxsize=1, ttl=300)
|
|
|
|
|
|
@cached(radio_cache)
|
|
def get_top_broadcastify_feeds():
|
|
"""
|
|
Scrapes the Broadcastify Top 50 live audio feeds public dashboard.
|
|
Returns a list of dictionaries containing feed metadata and direct stream URLs.
|
|
"""
|
|
logger.info("Scraping Broadcastify Top Feeds (Cache Miss)")
|
|
headers = {
|
|
# Issue #289 (tg12) + Round 7a: identify ourselves honestly as a
|
|
# per-install Shadowbroker scraper. Broadcastify can rate-limit
|
|
# us per install or block us; either way we stop pretending to be
|
|
# a browser. If they block, the panel degrades gracefully.
|
|
"User-Agent": _broadcastify_user_agent(),
|
|
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
|
|
"Accept-Language": "en-US,en;q=0.9",
|
|
}
|
|
|
|
try:
|
|
res = requests.get("https://www.broadcastify.com/listen/top", headers=headers, timeout=10)
|
|
if res.status_code != 200:
|
|
logger.error(f"Broadcastify Scrape Failed: HTTP {res.status_code}")
|
|
return []
|
|
|
|
soup = BeautifulSoup(res.text, "html.parser")
|
|
|
|
table = soup.find("table", {"class": "btable"})
|
|
if not table:
|
|
logger.error("Could not find feeds table on Broadcastify.")
|
|
return []
|
|
|
|
feeds = []
|
|
rows = table.find_all("tr")[1:] # Skip header row
|
|
|
|
for row in rows:
|
|
cols = row.find_all("td")
|
|
if len(cols) >= 5:
|
|
# Top layout: [Listeners, Feed ID (hidden), Location, Feed Name, Category, Genre]
|
|
listeners_str = cols[0].text.strip().replace(",", "")
|
|
listeners = int(listeners_str) if listeners_str.isdigit() else 0
|
|
|
|
link_tag = cols[2].find("a")
|
|
if not link_tag:
|
|
continue
|
|
|
|
href = link_tag.get("href", "")
|
|
feed_id = href.split("/")[-1] if "/listen/feed/" in href else None
|
|
|
|
if not feed_id:
|
|
continue
|
|
|
|
location = cols[1].text.strip()
|
|
name = cols[2].text.strip()
|
|
category = cols[3].text.strip()
|
|
|
|
feeds.append(
|
|
{
|
|
"id": feed_id,
|
|
"listeners": listeners,
|
|
"location": location,
|
|
"name": name,
|
|
"category": category,
|
|
"stream_url": f"https://broadcastify.cdnstream1.com/{feed_id}",
|
|
}
|
|
)
|
|
|
|
logger.info(f"Successfully scraped {len(feeds)} top feeds from Broadcastify.")
|
|
return feeds
|
|
|
|
except (requests.RequestException, ConnectionError, TimeoutError, ValueError, KeyError) as e:
|
|
logger.error(f"Broadcastify Scrape Exception: {e}")
|
|
return []
|
|
|
|
|
|
# Cache OpenMHZ systems mapping so we don't have to fetch all 450+ every time
|
|
openmhz_systems_cache = TTLCache(maxsize=1, ttl=3600)
|
|
|
|
|
|
@cached(openmhz_systems_cache)
|
|
def get_openmhz_systems():
|
|
"""Fetches the full directory of OpenMHZ systems.
|
|
|
|
Issue #290 (tg12) + Round 7a: replaced cloudscraper-based Chrome
|
|
impersonation with an honest per-install Shadowbroker User-Agent.
|
|
If OpenMHz's Cloudflare layer blocks honest traffic, we accept
|
|
that degradation (return empty list) rather than spoof a browser.
|
|
"""
|
|
logger.info("Fetching OpenMHZ Systems (Cache Miss)")
|
|
try:
|
|
res = requests.get(
|
|
"https://api.openmhz.com/systems",
|
|
timeout=15,
|
|
headers={"User-Agent": _openmhz_user_agent(), "Accept": "application/json"},
|
|
)
|
|
if res.status_code == 200:
|
|
data = res.json()
|
|
return data.get("systems", []) if isinstance(data, dict) else []
|
|
if res.status_code in (403, 503):
|
|
logger.warning(
|
|
"OpenMHZ returned %s for systems directory — Cloudflare may "
|
|
"be blocking our honest UA. Feature degrades to empty result.",
|
|
res.status_code,
|
|
)
|
|
return []
|
|
except (requests.RequestException, ConnectionError, TimeoutError, ValueError, KeyError) as e:
|
|
logger.error(f"OpenMHZ Systems Fetch Exception: {e}")
|
|
return []
|
|
|
|
|
|
# Cache specific city calls briefly (15-30s) to limit our polling rate
|
|
openmhz_calls_cache = TTLCache(maxsize=100, ttl=20)
|
|
|
|
|
|
@cached(openmhz_calls_cache)
|
|
def get_recent_openmhz_calls(sys_name: str):
|
|
"""Fetches the actual audio burst .m4a URLs for a specific system (e.g., 'wmata').
|
|
|
|
Issue #290 (tg12) + Round 7a: same honest-UA model as
|
|
``get_openmhz_systems``.
|
|
"""
|
|
logger.info(f"Fetching OpenMHZ calls for {sys_name} (Cache Miss)")
|
|
try:
|
|
url = f"https://api.openmhz.com/{sys_name}/calls"
|
|
res = requests.get(
|
|
url,
|
|
timeout=15,
|
|
headers={"User-Agent": _openmhz_user_agent(), "Accept": "application/json"},
|
|
)
|
|
if res.status_code == 200:
|
|
data = res.json()
|
|
return data.get("calls", []) if isinstance(data, dict) else []
|
|
return []
|
|
except (requests.RequestException, ConnectionError, TimeoutError, ValueError, KeyError) as e:
|
|
logger.error(f"OpenMHZ Calls Fetch Exception ({sys_name}): {e}")
|
|
return []
|
|
|
|
|
|
_OPENMHZ_MAX_REDIRECTS = 5
|
|
|
|
|
|
def openmhz_audio_response(target_url: str):
|
|
"""Fetch an OpenMHz audio object through the backend with browser-safe headers.
|
|
|
|
Redirects are followed manually so each hop's host can be re-validated
|
|
against ``_OPENMHZ_AUDIO_HOSTS``. Without this, the upstream could
|
|
302-redirect to an internal address (e.g. ``http://127.0.0.1:8000/...``
|
|
or an RFC1918 range), and the backend would dutifully fetch and stream
|
|
that response back to the browser — a classic open-redirect-to-SSRF
|
|
chain. Same-host redirects (CDN edge selection) still work normally.
|
|
"""
|
|
from fastapi import HTTPException
|
|
from fastapi.responses import StreamingResponse
|
|
from urllib.parse import urljoin
|
|
|
|
parsed = urlparse(str(target_url or ""))
|
|
host = (parsed.hostname or "").lower()
|
|
if parsed.scheme != "https" or host not in _OPENMHZ_AUDIO_HOSTS:
|
|
raise HTTPException(status_code=400, detail="Unsupported OpenMHz audio URL")
|
|
|
|
current_url = target_url
|
|
hops = 0
|
|
try:
|
|
while True:
|
|
upstream = requests.get(
|
|
current_url,
|
|
stream=True,
|
|
timeout=(5, 20),
|
|
allow_redirects=False,
|
|
headers={
|
|
# Issue #291 (tg12) + Round 7a: drop spoofed Mozilla
|
|
# UA and the fake first-party Referer. Identify as
|
|
# the per-install Shadowbroker proxy honestly.
|
|
"User-Agent": _openmhz_user_agent(),
|
|
"Accept": "audio/mpeg,audio/*,*/*;q=0.8",
|
|
},
|
|
)
|
|
if upstream.is_redirect or upstream.status_code in (301, 302, 303, 307, 308):
|
|
location = upstream.headers.get("Location", "")
|
|
upstream.close()
|
|
if hops >= _OPENMHZ_MAX_REDIRECTS or not location:
|
|
raise HTTPException(status_code=502, detail="OpenMHz redirect rejected")
|
|
next_url = urljoin(current_url, location)
|
|
next_parsed = urlparse(next_url)
|
|
next_host = (next_parsed.hostname or "").lower()
|
|
# Re-validate the next hop against the same allowlist used for
|
|
# the original URL. Cross-host redirects to disallowed hosts
|
|
# are rejected silently; the browser audio element handles
|
|
# the resulting 502 gracefully and moves on.
|
|
if next_parsed.scheme != "https" or next_host not in _OPENMHZ_AUDIO_HOSTS:
|
|
raise HTTPException(status_code=502, detail="OpenMHz redirect rejected")
|
|
current_url = next_url
|
|
hops += 1
|
|
continue
|
|
break
|
|
except requests.RequestException as exc:
|
|
raise HTTPException(status_code=502, detail="OpenMHz audio fetch failed") from exc
|
|
|
|
if upstream.status_code >= 400:
|
|
upstream.close()
|
|
raise HTTPException(status_code=upstream.status_code, detail="OpenMHz audio unavailable")
|
|
|
|
def chunks():
|
|
try:
|
|
for chunk in upstream.iter_content(chunk_size=64 * 1024):
|
|
if chunk:
|
|
yield chunk
|
|
finally:
|
|
upstream.close()
|
|
|
|
return StreamingResponse(
|
|
chunks(),
|
|
media_type="audio/mpeg",
|
|
headers={
|
|
"Cache-Control": "public, max-age=300",
|
|
"Accept-Ranges": "bytes",
|
|
},
|
|
)
|
|
|
|
|
|
US_STATES = {
|
|
"Alabama": "AL",
|
|
"Alaska": "AK",
|
|
"Arizona": "AZ",
|
|
"Arkansas": "AR",
|
|
"California": "CA",
|
|
"Colorado": "CO",
|
|
"Connecticut": "CT",
|
|
"Delaware": "DE",
|
|
"Florida": "FL",
|
|
"Georgia": "GA",
|
|
"Hawaii": "HI",
|
|
"Idaho": "ID",
|
|
"Illinois": "IL",
|
|
"Indiana": "IN",
|
|
"Iowa": "IA",
|
|
"Kansas": "KS",
|
|
"Kentucky": "KY",
|
|
"Louisiana": "LA",
|
|
"Maine": "ME",
|
|
"Maryland": "MD",
|
|
"Massachusetts": "MA",
|
|
"Michigan": "MI",
|
|
"Minnesota": "MN",
|
|
"Mississippi": "MS",
|
|
"Missouri": "MO",
|
|
"Montana": "MT",
|
|
"Nebraska": "NE",
|
|
"Nevada": "NV",
|
|
"New Hampshire": "NH",
|
|
"New Jersey": "NJ",
|
|
"New Mexico": "NM",
|
|
"New York": "NY",
|
|
"North Carolina": "NC",
|
|
"North Dakota": "ND",
|
|
"Ohio": "OH",
|
|
"Oklahoma": "OK",
|
|
"Oregon": "OR",
|
|
"Pennsylvania": "PA",
|
|
"Rhode Island": "RI",
|
|
"South Carolina": "SC",
|
|
"South Dakota": "SD",
|
|
"Tennessee": "TN",
|
|
"Texas": "TX",
|
|
"Utah": "UT",
|
|
"Vermont": "VT",
|
|
"Virginia": "VA",
|
|
"Washington": "WA",
|
|
"West Virginia": "WV",
|
|
"Wisconsin": "WI",
|
|
"Wyoming": "WY",
|
|
"Washington, D.C.": "DC",
|
|
"District of Columbia": "DC",
|
|
}
|
|
|
|
import math
|
|
|
|
|
|
def haversine_distance(lat1, lon1, lat2, lon2):
|
|
R = 3958.8 # Earth radius in miles
|
|
dLat = math.radians(lat2 - lat1)
|
|
dLon = math.radians(lon2 - lon1)
|
|
a = math.sin(dLat / 2) * math.sin(dLat / 2) + math.cos(math.radians(lat1)) * math.cos(
|
|
math.radians(lat2)
|
|
) * math.sin(dLon / 2) * math.sin(dLon / 2)
|
|
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
|
|
return R * c
|
|
|
|
|
|
def find_nearest_openmhz_systems_list(lat: float, lng: float, limit: int = 5):
|
|
"""
|
|
Finds the strictly nearest OpenMHZ systems by distance.
|
|
"""
|
|
systems = get_openmhz_systems()
|
|
if not systems:
|
|
return []
|
|
|
|
# Calculate distance for all systems that provide coordinates
|
|
valid_systems = []
|
|
for s in systems:
|
|
s_lat = s.get("lat")
|
|
s_lng = s.get("lng")
|
|
if s_lat is not None and s_lng is not None:
|
|
dist = haversine_distance(lat, lng, float(s_lat), float(s_lng))
|
|
s["distance_miles"] = dist
|
|
valid_systems.append(s)
|
|
|
|
if not valid_systems:
|
|
return []
|
|
|
|
# Sort strictly by distance
|
|
valid_systems.sort(key=lambda x: x["distance_miles"])
|
|
return valid_systems[:limit]
|
|
|
|
|
|
def find_nearest_openmhz_system(lat: float, lng: float):
|
|
"""
|
|
Returns the single closest OpenMHZ system by distance.
|
|
"""
|
|
nearest = find_nearest_openmhz_systems_list(lat, lng, limit=1)
|
|
if nearest:
|
|
return nearest[0]
|
|
return None
|