mirror of
https://github.com/BigBodyCobain/Shadowbroker.git
synced 2026-06-03 21:08:13 +02:00
76750caa92
== Per-install operator handle for every third-party API call ==
Before this PR, every Shadowbroker install identified itself to
Wikipedia, Wikidata, Nominatim, GDELT, OpenMHz, Broadcastify,
weather.gov, NUFORC, Sentinel/Planetary Computer, TinyGS / CelesTrak,
Shodan, Finnhub, and others with a single project-wide User-Agent
("Shadowbroker/1.0" or "ShadowBroker-OSINT/1.0"). From the upstream's
perspective every install in the world looked like one giant scraper.
If one install misbehaved, the upstream's only recourse was to block
"Shadowbroker" as a whole.
PR #284 inadvertently doubled down on this in the frontend by
introducing a shared `WIKIMEDIA_API_USER_AGENT` constant. This PR
retrofits both backends to per-operator attribution.
New setting: OPERATOR_HANDLE (env var / settings UI / auto-gen)
New helper: network_utils.outbound_user_agent("purpose")
The handle is auto-generated as "operator-XXXXXX" on first call (the
"shadow-" prefix from earlier drafts was deliberately dropped — too
suspicious-looking for abuse-detection systems). Operators can
override via OPERATOR_HANDLE; the value is sanitized to lowercase
alphanumeric+dash+underscore and capped at 48 chars. Persisted to
backend/data/operator_handle.json so it survives container restarts.
Retrofitted call sites (every previously-MONSTER User-Agent):
- services/region_dossier.py (Wikipedia + Wikidata + Nominatim)
- services/geocode.py (Nominatim)
- services/sentinel_search.py (Microsoft Planetary Computer)
- services/feed_ingester.py (operator-curated RSS feeds)
- services/fetchers/earth_observation.py (weather.gov, NUFORC)
- services/fetchers/infrastructure.py
- services/fetchers/aircraft_database.py
- services/fetchers/route_database.py
- services/fetchers/trains.py
- services/fetchers/meshtastic_map.py
- services/shodan_connector.py
- services/unusual_whales_connector.py (Finnhub)
- services/tinygs_fetcher.py (CelesTrak + TinyGS)
- services/sar/sar_products_client.py
- services/geopolitics.py (GDELT)
- services/radio_intercept.py (Broadcastify + OpenMHz)
- routers/cctv.py + main.py (CCTV proxy)
- routers/ai_intel.py
- scripts/convert_power_plants.py (release-time data refresh)
Spoofed browser UAs removed (issues #289 / #290 / #291 — tg12 audit):
- cloudscraper-based Chrome impersonation against api.openmhz.com
-> replaced with honest requests + per-install UA
- Mozilla/5.0 spoofed UA on Broadcastify scrape
-> replaced with honest UA
- Mozilla/5.0 + fake first-party Referer on OpenMHz audio relay
-> replaced with honest UA
- cloudscraper dependency dropped from pyproject.toml + uv.lock
Frontend retrofit:
- new GET /api/settings/operator-handle endpoint (local-operator
gated) returns the install's handle
- frontend/src/lib/wikimediaClient.ts fetches the handle once on
first use, caches it for page lifetime, embeds it in the
Api-User-Agent for every Wikipedia / Wikidata browser-direct call
== GDELT GCS-direct fix ==
GDELT's data.gdeltproject.org is a CNAME to a Google Cloud Storage
bucket. GCS responds with the wildcard *.storage.googleapis.com cert
which legitimately does NOT cover the GDELT custom domain, so Python's
TLS verification correctly refuses the connection. Some networks
happen to route through a path where this works; many (notably Docker
Desktop's outbound NAT on local installs) do not. Verified on the
maintainer's local install: GDELT was unreachable; 1610 geopolitical
events / 48 export files were dropping silently.
Fix: services/geopolitics._gcs_direct_gdelt_url() rewrites any
data.gdeltproject.org URL to its GCS-direct equivalent
(storage.googleapis.com/data.gdeltproject.org/...) where the standard
GCS cert is genuinely valid. api.gdeltproject.org and every other host
are left untouched.
Confirmed live: backend log goes from
GDELT lastupdate failed: 500
to
Downloading 48 GDELT export files...
Downloaded 48/48 GDELT exports
GDELT parsed: 1610 conflict locations from 48 files
== Tests ==
backend/tests/test_per_operator_outbound_attribution.py (12 tests)
backend/tests/test_gdelt_gcs_direct_rewrite.py (6 tests)
backend/tests/test_region_dossier_wikimedia_ua.py (updated to
pin the helper + per-operator handle, not the old constant)
frontend/src/__tests__/utils/wikimediaClient.test.ts (rewritten
to mock /api/settings/operator-handle and assert per-operator UA)
Local: backend 114/114 security+audit+round7a suite green;
frontend 718/718 vitest suite green.
Credit: tg12 (external security audit, issues #289/#290/#291
relating to spoofed UAs); BigBodyCobain (operator-prefix call,
GDELT cloud-vs-local diagnosis).
211 lines
8.0 KiB
TypeScript
211 lines
8.0 KiB
TypeScript
/**
|
|
* wikimediaClient — single fetch surface for Wikipedia / Wikidata.
|
|
*
|
|
* Issues #218, #219, #220 (tg12 external audit) + Round 7a:
|
|
*
|
|
* Wikimedia's User-Agent policy asks API clients to identify themselves
|
|
* via `Api-User-Agent` when calling from browser JavaScript (because the
|
|
* browser does not let JS set `User-Agent` directly). Three independent
|
|
* components used to issue anonymous browser fetches against Wikipedia /
|
|
* Wikidata:
|
|
*
|
|
* - useRegionDossier (Wikidata SPARQL + Wikipedia REST summary)
|
|
* - WikiImage (Wikipedia REST summary)
|
|
* - NewsFeed (Wikipedia REST summary)
|
|
*
|
|
* PR #284 collapsed them into this shared helper with one stable
|
|
* `Api-User-Agent`. That fixed compliance but introduced a new problem:
|
|
* the `Api-User-Agent` was project-wide, so from Wikimedia's perspective
|
|
* every Shadowbroker install looked like one giant scraper. If one
|
|
* install misbehaved, Wikimedia's only recourse was to block the project
|
|
* as a whole.
|
|
*
|
|
* Round 7a fixes that. The frontend fetches the per-install operator
|
|
* handle from `GET /api/settings/operator-handle` once on first use and
|
|
* embeds it in the `Api-User-Agent`. Wikimedia can now rate-limit /
|
|
* contact the specific install instead of the project. The handle is
|
|
* auto-generated on the backend (`shadow-XXXXXX`) or operator-chosen via
|
|
* the `OPERATOR_HANDLE` setting.
|
|
*
|
|
* UX impact: zero. Same thumbnails, same summaries, same load behavior.
|
|
* The only observable change is the value of the outgoing
|
|
* `Api-User-Agent` header.
|
|
*/
|
|
|
|
// Module-level cache shared by WikiImage, NewsFeed, and useRegionDossier.
|
|
// Keyed by Wikipedia article title (NOT slug — we keep the human-readable
|
|
// form so debugging the cache is easier). Values track in-flight state
|
|
// so concurrent callers for the same title share one network request.
|
|
export interface WikipediaSummary {
|
|
title: string;
|
|
description: string;
|
|
extract: string;
|
|
thumbnail: string;
|
|
type: string; // 'standard' | 'disambiguation' | etc.
|
|
}
|
|
|
|
interface CacheEntry {
|
|
summary: WikipediaSummary | null;
|
|
inflight: Promise<WikipediaSummary | null> | null;
|
|
loaded: boolean;
|
|
}
|
|
|
|
const _summaryCache: Map<string, CacheEntry> = new Map();
|
|
const SUMMARY_CACHE_MAX = 512;
|
|
|
|
function evictIfOverCap() {
|
|
if (_summaryCache.size <= SUMMARY_CACHE_MAX) return;
|
|
const oldest = _summaryCache.keys().next().value;
|
|
if (oldest) _summaryCache.delete(oldest);
|
|
}
|
|
|
|
// ─── Per-operator handle (Round 7a) ────────────────────────────────────────
|
|
|
|
// Fetched once from the backend on first need and cached for the page
|
|
// lifetime. The handle is NOT a secret — Wikimedia will see it on every
|
|
// Wikipedia / Wikidata request we make — but caching it locally avoids a
|
|
// round-trip on every Wikipedia fetch and lets the offline / no-backend
|
|
// case still produce a stable UA (the fallback handle).
|
|
let _handlePromise: Promise<string> | null = null;
|
|
let _cachedHandle: string | null = null;
|
|
|
|
const FALLBACK_HANDLE = 'operator-offline';
|
|
const HANDLE_ENDPOINT = '/api/settings/operator-handle';
|
|
|
|
async function fetchOperatorHandle(): Promise<string> {
|
|
try {
|
|
const res = await fetch(HANDLE_ENDPOINT, {
|
|
// Use the standard relative-path proxy so the Next.js admin-key
|
|
// injection (same-origin) flows naturally for legitimate browser
|
|
// sessions. A cross-origin scanner will be blocked by the proxy
|
|
// before this even leaves their browser.
|
|
credentials: 'same-origin',
|
|
});
|
|
if (!res.ok) return FALLBACK_HANDLE;
|
|
const data = await res.json();
|
|
const h = (data && typeof data.handle === 'string' && data.handle.trim()) || '';
|
|
return h || FALLBACK_HANDLE;
|
|
} catch {
|
|
return FALLBACK_HANDLE;
|
|
}
|
|
}
|
|
|
|
async function getOperatorHandle(): Promise<string> {
|
|
if (_cachedHandle) return _cachedHandle;
|
|
if (!_handlePromise) {
|
|
_handlePromise = fetchOperatorHandle().then((h) => {
|
|
_cachedHandle = h;
|
|
return h;
|
|
});
|
|
}
|
|
return _handlePromise;
|
|
}
|
|
|
|
/** Build the Wikimedia Api-User-Agent for this install.
|
|
*
|
|
* Includes the per-install operator handle so Wikimedia can rate-limit /
|
|
* contact the specific operator instead of the project as a whole.
|
|
* Exported for tests; production callers should let
|
|
* `fetchWikipediaSummary` / `fetchWikidataSparql` build it implicitly.
|
|
*/
|
|
export async function buildWikimediaUserAgent(purpose: string): Promise<string> {
|
|
const handle = await getOperatorHandle();
|
|
const safePurpose = (purpose || '').replace(/[^a-zA-Z0-9_-]/g, '-').toLowerCase();
|
|
return (
|
|
`Shadowbroker/1.0 (operator: ${handle}; purpose: ${safePurpose}; ` +
|
|
'+https://github.com/BigBodyCobain/Shadowbroker; report issues at /issues)'
|
|
);
|
|
}
|
|
|
|
// ─── Wikipedia summary fetch ───────────────────────────────────────────────
|
|
|
|
/** Fetch a Wikipedia article summary (titles, NOT URLs).
|
|
*
|
|
* Empty / invalid input resolves to `null`. Network errors and disambig
|
|
* pages also resolve to `null` so callers can render a fallback without
|
|
* a try/catch. Per the audit's "fail forward, not loud" rule.
|
|
*/
|
|
export async function fetchWikipediaSummary(
|
|
title: string,
|
|
): Promise<WikipediaSummary | null> {
|
|
const trimmed = (title || '').trim();
|
|
if (!trimmed) return null;
|
|
|
|
const cached = _summaryCache.get(trimmed);
|
|
if (cached?.loaded) return cached.summary;
|
|
if (cached?.inflight) return cached.inflight;
|
|
|
|
const slug = encodeURIComponent(trimmed.replace(/ /g, '_'));
|
|
const url = `https://en.wikipedia.org/api/rest_v1/page/summary/${slug}`;
|
|
|
|
const promise = (async (): Promise<WikipediaSummary | null> => {
|
|
try {
|
|
const ua = await buildWikimediaUserAgent('wikipedia-summary');
|
|
const r = await fetch(url, { headers: { 'Api-User-Agent': ua } });
|
|
if (!r.ok) return null;
|
|
const d = await r.json();
|
|
if (d?.type === 'disambiguation') return null;
|
|
return {
|
|
title: trimmed,
|
|
description: d?.description || '',
|
|
extract: d?.extract || '',
|
|
thumbnail: d?.thumbnail?.source || d?.originalimage?.source || '',
|
|
type: d?.type || 'standard',
|
|
};
|
|
} catch {
|
|
return null;
|
|
}
|
|
})().then((summary) => {
|
|
_summaryCache.set(trimmed, { summary, inflight: null, loaded: true });
|
|
evictIfOverCap();
|
|
return summary;
|
|
});
|
|
|
|
_summaryCache.set(trimmed, { summary: null, inflight: promise, loaded: false });
|
|
evictIfOverCap();
|
|
return promise;
|
|
}
|
|
|
|
// ─── Wikidata SPARQL ───────────────────────────────────────────────────────
|
|
|
|
/** Fetch a Wikidata SPARQL query result.
|
|
*
|
|
* Returns the parsed JSON `results.bindings` array on success; `null`
|
|
* (not throwing) on any failure so callers can render fallbacks
|
|
* silently. Per-install operator handle threaded through `Api-User-Agent`
|
|
* (Round 7a).
|
|
*/
|
|
export async function fetchWikidataSparql<T = Record<string, { value: string }>>(
|
|
sparql: string,
|
|
): Promise<T[] | null> {
|
|
const trimmed = (sparql || '').trim();
|
|
if (!trimmed) return null;
|
|
const url = `https://query.wikidata.org/sparql?query=${encodeURIComponent(
|
|
trimmed,
|
|
)}&format=json`;
|
|
try {
|
|
const ua = await buildWikimediaUserAgent('wikidata-sparql');
|
|
const res = await fetch(url, {
|
|
headers: {
|
|
'Api-User-Agent': ua,
|
|
Accept: 'application/sparql-results+json',
|
|
},
|
|
});
|
|
if (!res.ok) return null;
|
|
const json = await res.json();
|
|
const bindings = json?.results?.bindings;
|
|
return Array.isArray(bindings) ? (bindings as T[]) : null;
|
|
} catch {
|
|
return null;
|
|
}
|
|
}
|
|
|
|
// ─── Test helpers ──────────────────────────────────────────────────────────
|
|
|
|
/** Internal: clear the shared cache + the handle cache. Exposed for tests only. */
|
|
export function _resetWikimediaClientCacheForTests() {
|
|
_summaryCache.clear();
|
|
_handlePromise = null;
|
|
_cachedHandle = null;
|
|
}
|