Round 7a: per-operator outbound attribution + GDELT GCS-direct fix (#292)

== Per-install operator handle for every third-party API call ==

Before this PR, every Shadowbroker install identified itself to
Wikipedia, Wikidata, Nominatim, GDELT, OpenMHz, Broadcastify,
weather.gov, NUFORC, Sentinel/Planetary Computer, TinyGS / CelesTrak,
Shodan, Finnhub, and others with a single project-wide User-Agent
("Shadowbroker/1.0" or "ShadowBroker-OSINT/1.0"). From the upstream's
perspective every install in the world looked like one giant scraper.
If one install misbehaved, the upstream's only recourse was to block
"Shadowbroker" as a whole.

PR #284 inadvertently doubled down on this in the frontend by
introducing a shared `WIKIMEDIA_API_USER_AGENT` constant. This PR
retrofits both backends to per-operator attribution.

  New setting: OPERATOR_HANDLE (env var / settings UI / auto-gen)
  New helper:  network_utils.outbound_user_agent("purpose")

The handle is auto-generated as "operator-XXXXXX" on first call (the
"shadow-" prefix from earlier drafts was deliberately dropped — too
suspicious-looking for abuse-detection systems). Operators can
override via OPERATOR_HANDLE; the value is sanitized to lowercase
alphanumeric+dash+underscore and capped at 48 chars. Persisted to
backend/data/operator_handle.json so it survives container restarts.

Retrofitted call sites (every previously-MONSTER User-Agent):
  - services/region_dossier.py (Wikipedia + Wikidata + Nominatim)
  - services/geocode.py         (Nominatim)
  - services/sentinel_search.py (Microsoft Planetary Computer)
  - services/feed_ingester.py   (operator-curated RSS feeds)
  - services/fetchers/earth_observation.py (weather.gov, NUFORC)
  - services/fetchers/infrastructure.py
  - services/fetchers/aircraft_database.py
  - services/fetchers/route_database.py
  - services/fetchers/trains.py
  - services/fetchers/meshtastic_map.py
  - services/shodan_connector.py
  - services/unusual_whales_connector.py (Finnhub)
  - services/tinygs_fetcher.py            (CelesTrak + TinyGS)
  - services/sar/sar_products_client.py
  - services/geopolitics.py               (GDELT)
  - services/radio_intercept.py           (Broadcastify + OpenMHz)
  - routers/cctv.py + main.py             (CCTV proxy)
  - routers/ai_intel.py
  - scripts/convert_power_plants.py       (release-time data refresh)

Spoofed browser UAs removed (issues #289 / #290 / #291 — tg12 audit):
  - cloudscraper-based Chrome impersonation against api.openmhz.com
    -> replaced with honest requests + per-install UA
  - Mozilla/5.0 spoofed UA on Broadcastify scrape
    -> replaced with honest UA
  - Mozilla/5.0 + fake first-party Referer on OpenMHz audio relay
    -> replaced with honest UA
  - cloudscraper dependency dropped from pyproject.toml + uv.lock

Frontend retrofit:
  - new GET /api/settings/operator-handle endpoint (local-operator
    gated) returns the install's handle
  - frontend/src/lib/wikimediaClient.ts fetches the handle once on
    first use, caches it for page lifetime, embeds it in the
    Api-User-Agent for every Wikipedia / Wikidata browser-direct call

== GDELT GCS-direct fix ==

GDELT's data.gdeltproject.org is a CNAME to a Google Cloud Storage
bucket. GCS responds with the wildcard *.storage.googleapis.com cert
which legitimately does NOT cover the GDELT custom domain, so Python's
TLS verification correctly refuses the connection. Some networks
happen to route through a path where this works; many (notably Docker
Desktop's outbound NAT on local installs) do not. Verified on the
maintainer's local install: GDELT was unreachable; 1610 geopolitical
events / 48 export files were dropping silently.

Fix: services/geopolitics._gcs_direct_gdelt_url() rewrites any
data.gdeltproject.org URL to its GCS-direct equivalent
(storage.googleapis.com/data.gdeltproject.org/...) where the standard
GCS cert is genuinely valid. api.gdeltproject.org and every other host
are left untouched.

Confirmed live: backend log goes from
  GDELT lastupdate failed: 500
to
  Downloading 48 GDELT export files...
  Downloaded 48/48 GDELT exports
  GDELT parsed: 1610 conflict locations from 48 files

== Tests ==

  backend/tests/test_per_operator_outbound_attribution.py (12 tests)
  backend/tests/test_gdelt_gcs_direct_rewrite.py          (6 tests)
  backend/tests/test_region_dossier_wikimedia_ua.py       (updated to
    pin the helper + per-operator handle, not the old constant)
  frontend/src/__tests__/utils/wikimediaClient.test.ts    (rewritten
    to mock /api/settings/operator-handle and assert per-operator UA)

Local: backend 114/114 security+audit+round7a suite green;
       frontend 718/718 vitest suite green.

Credit: tg12 (external security audit, issues #289/#290/#291
relating to spoofed UAs); BigBodyCobain (operator-prefix call,
GDELT cloud-vs-local diagnosis).
This commit is contained in:
Shadowbroker
2026-05-21 15:11:28 -06:00
committed by GitHub
parent c3ef9f4b9e
commit 76750caa92
31 changed files with 1120 additions and 237 deletions
@@ -1,16 +1,21 @@
/**
* Issues #218 / #219 / #220 (tg12 external audit):
* Issues #218 / #219 / #220 (tg12 external audit) + Round 7a:
*
* Every browser-direct call to Wikipedia or Wikidata must send the
* `Api-User-Agent` header that Wikimedia's UA policy asks for. These
* tests pin that requirement on the shared `lib/wikimediaClient`
* `Api-User-Agent` header that Wikimedia's UA policy asks for, AND must
* embed the per-install operator handle so Wikimedia can rate-limit /
* contact the specific operator instead of treating "Shadowbroker" as
* one giant entity.
*
* These tests pin both requirements on the shared `lib/wikimediaClient`
* helper that WikiImage, NewsFeed, and useRegionDossier all route
* through, so a future refactor that drops the header gets a loud
* test failure rather than a silent ToS regression.
* through. A future refactor that drops either the header OR the
* per-operator handle gets a loud test failure rather than a silent
* ToS / privacy regression.
*/
import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest';
import {
WIKIMEDIA_API_USER_AGENT,
buildWikimediaUserAgent,
fetchWikipediaSummary,
fetchWikidataSparql,
_resetWikimediaClientCacheForTests,
@@ -18,6 +23,18 @@ import {
const originalFetch = globalThis.fetch;
// Helper: stub fetch so calls to /api/settings/operator-handle return a
// known handle, and everything else proxies to whatever the test set up.
function withHandle(handle: string, otherFetch: typeof globalThis.fetch) {
return vi.fn(async (input: any, init?: RequestInit) => {
const url = String(input);
if (url.endsWith('/api/settings/operator-handle')) {
return new Response(JSON.stringify({ handle }), { status: 200 });
}
return otherFetch(input, init);
});
}
describe('lib/wikimediaClient', () => {
beforeEach(() => {
_resetWikimediaClientCacheForTests();
@@ -28,16 +45,35 @@ describe('lib/wikimediaClient', () => {
vi.restoreAllMocks();
});
it('exposes a stable Api-User-Agent identifier with a contact path', () => {
expect(WIKIMEDIA_API_USER_AGENT).toContain('Shadowbroker');
expect(WIKIMEDIA_API_USER_AGENT.toLowerCase()).toContain('github.com');
expect(WIKIMEDIA_API_USER_AGENT.toLowerCase()).toContain('issues');
it('builds a stable per-operator Api-User-Agent with contact path', async () => {
globalThis.fetch = withHandle(
'operator-abc123',
vi.fn(async () => new Response('{}', { status: 200 })) as any,
) as any;
const ua = await buildWikimediaUserAgent('wikipedia-summary');
expect(ua).toContain('Shadowbroker');
expect(ua.toLowerCase()).toContain('github.com');
expect(ua.toLowerCase()).toContain('issues');
expect(ua).toContain('operator: operator-abc123');
expect(ua).toContain('purpose: wikipedia-summary');
});
it('sends Api-User-Agent on Wikipedia summary fetch', async () => {
const calls: Array<{ url: string; init?: RequestInit }> = [];
globalThis.fetch = vi.fn(async (url: any, init?: RequestInit) => {
calls.push({ url: String(url), init });
it('falls back to "operator-offline" when handle endpoint is unreachable', async () => {
globalThis.fetch = vi.fn(async (input: any) => {
const url = String(input);
if (url.endsWith('/api/settings/operator-handle')) {
return new Response('forbidden', { status: 403 });
}
return new Response('{}', { status: 200 });
}) as any;
const ua = await buildWikimediaUserAgent('test');
expect(ua).toContain('operator: operator-offline');
});
it('sends per-operator Api-User-Agent on Wikipedia summary fetch', async () => {
const wikiCalls: Array<{ url: string; init?: RequestInit }> = [];
const baseFetch = vi.fn(async (url: any, init?: RequestInit) => {
wikiCalls.push({ url: String(url), init });
return new Response(
JSON.stringify({
type: 'standard',
@@ -48,44 +84,71 @@ describe('lib/wikimediaClient', () => {
}),
{ status: 200 },
);
}) as any;
});
globalThis.fetch = withHandle('operator-test01', baseFetch as any) as any;
const summary = await fetchWikipediaSummary('Boeing 747');
expect(summary?.thumbnail).toBe('https://example.org/thumb.jpg');
expect(calls).toHaveLength(1);
const headers = (calls[0].init?.headers || {}) as Record<string, string>;
expect(headers['Api-User-Agent']).toBe(WIKIMEDIA_API_USER_AGENT);
// wikiCalls only captures calls to non-handle URLs.
expect(wikiCalls).toHaveLength(1);
const headers = (wikiCalls[0].init?.headers || {}) as Record<string, string>;
expect(headers['Api-User-Agent']).toContain('operator: operator-test01');
expect(headers['Api-User-Agent']).toContain('purpose: wikipedia-summary');
});
it('sends Api-User-Agent on Wikidata SPARQL fetch', async () => {
it('sends per-operator Api-User-Agent on Wikidata SPARQL fetch', async () => {
const calls: Array<{ url: string; init?: RequestInit }> = [];
globalThis.fetch = vi.fn(async (url: any, init?: RequestInit) => {
const baseFetch = vi.fn(async (url: any, init?: RequestInit) => {
calls.push({ url: String(url), init });
return new Response(
JSON.stringify({
results: {
bindings: [
{
leaderLabel: { value: 'Test Leader' },
govTypeLabel: { value: 'Test Government' },
},
],
},
results: { bindings: [{ leaderLabel: { value: 'Test Leader' } }] },
}),
{ status: 200 },
);
});
globalThis.fetch = withHandle('operator-sparql', baseFetch as any) as any;
const bindings = await fetchWikidataSparql('SELECT * WHERE { ?s ?p ?o }');
expect(bindings).toHaveLength(1);
const headers = (calls[0].init?.headers || {}) as Record<string, string>;
expect(headers['Api-User-Agent']).toContain('operator: operator-sparql');
expect(headers['Api-User-Agent']).toContain('purpose: wikidata-sparql');
expect(headers['Accept']).toBe('application/sparql-results+json');
});
it('handle endpoint is queried only ONCE across many wiki fetches', async () => {
let handleCalls = 0;
let wikiCalls = 0;
globalThis.fetch = vi.fn(async (input: any) => {
const url = String(input);
if (url.endsWith('/api/settings/operator-handle')) {
handleCalls++;
return new Response(JSON.stringify({ handle: 'operator-cache' }), { status: 200 });
}
wikiCalls++;
return new Response(
JSON.stringify({
type: 'standard',
title: 'X',
description: '',
extract: '',
thumbnail: { source: 'https://example.org/x.jpg' },
}),
{ status: 200 },
);
}) as any;
const bindings = await fetchWikidataSparql('SELECT * WHERE { ?s ?p ?o }');
expect(bindings).toHaveLength(1);
const headers = (calls[0].init?.headers || {}) as Record<string, string>;
expect(headers['Api-User-Agent']).toBe(WIKIMEDIA_API_USER_AGENT);
expect(headers['Accept']).toBe('application/sparql-results+json');
await fetchWikipediaSummary('Eiffel Tower');
await fetchWikipediaSummary('Mount Fuji');
await fetchWikipediaSummary('Statue of Liberty');
expect(handleCalls).toBe(1);
expect(wikiCalls).toBe(3);
});
it('shares cache across consecutive callers for the same Wikipedia title', async () => {
let fetchCount = 0;
globalThis.fetch = vi.fn(async () => {
const baseFetch = vi.fn(async () => {
fetchCount++;
return new Response(
JSON.stringify({
@@ -97,7 +160,8 @@ describe('lib/wikimediaClient', () => {
}),
{ status: 200 },
);
}) as any;
});
globalThis.fetch = withHandle('operator-cache', baseFetch as any) as any;
const a = await fetchWikipediaSummary('Eiffel Tower');
const b = await fetchWikipediaSummary('Eiffel Tower');
@@ -107,7 +171,7 @@ describe('lib/wikimediaClient', () => {
it('deduplicates concurrent in-flight requests for the same title', async () => {
let fetchCount = 0;
globalThis.fetch = vi.fn(async () => {
const baseFetch = vi.fn(async () => {
fetchCount++;
await new Promise((r) => setTimeout(r, 5));
return new Response(
@@ -120,7 +184,8 @@ describe('lib/wikimediaClient', () => {
}),
{ status: 200 },
);
}) as any;
});
globalThis.fetch = withHandle('operator-cache', baseFetch as any) as any;
const [a, b, c] = await Promise.all([
fetchWikipediaSummary('Mount Fuji'),
@@ -134,28 +199,37 @@ describe('lib/wikimediaClient', () => {
});
it('returns null on disambiguation pages without throwing', async () => {
globalThis.fetch = vi.fn(async () =>
new Response(JSON.stringify({ type: 'disambiguation' }), { status: 200 }),
globalThis.fetch = withHandle(
'operator-cache',
vi.fn(async () =>
new Response(JSON.stringify({ type: 'disambiguation' }), { status: 200 }),
) as any,
) as any;
const summary = await fetchWikipediaSummary('Mercury');
expect(summary).toBeNull();
});
it('returns null on HTTP error without throwing', async () => {
globalThis.fetch = vi.fn(async () => new Response('not found', { status: 404 })) as any;
globalThis.fetch = withHandle(
'operator-cache',
vi.fn(async () => new Response('not found', { status: 404 })) as any,
) as any;
const summary = await fetchWikipediaSummary('Nonexistent Article 12345');
expect(summary).toBeNull();
});
it('returns null on network error without throwing', async () => {
globalThis.fetch = vi.fn(async () => {
throw new Error('network down');
}) as any;
globalThis.fetch = withHandle(
'operator-cache',
vi.fn(async () => {
throw new Error('network down');
}) as any,
) as any;
const summary = await fetchWikipediaSummary('Anything');
expect(summary).toBeNull();
});
it('returns null on empty input', async () => {
it('returns null on empty input without fetching anything', async () => {
globalThis.fetch = vi.fn(async () => new Response('{}', { status: 200 })) as any;
expect(await fetchWikipediaSummary('')).toBeNull();
expect(await fetchWikipediaSummary(' ')).toBeNull();
+103 -50
View File
@@ -1,51 +1,37 @@
/**
* wikimediaClient — single fetch surface for Wikipedia / Wikidata.
*
* Issues #218, #219, #220 (tg12 external audit):
* Issues #218, #219, #220 (tg12 external audit) + Round 7a:
*
* Wikimedia's User-Agent policy asks API clients to identify themselves
* via `Api-User-Agent` when calling from browser JavaScript (because the
* browser does not let JS set `User-Agent` directly). Before this
* module existed, three independent components issued anonymous browser
* fetches against Wikipedia / Wikidata:
* browser does not let JS set `User-Agent` directly). Three independent
* components used to issue anonymous browser fetches against Wikipedia /
* Wikidata:
*
* - useRegionDossier (Wikidata SPARQL + Wikipedia REST summary)
* - WikiImage (Wikipedia REST summary)
* - NewsFeed (Wikipedia REST summary)
*
* Each component shipped its own copy-pasted fetch + module-local cache.
* Provider-policy compliance was missing in all three places.
* PR #284 collapsed them into this shared helper with one stable
* `Api-User-Agent`. That fixed compliance but introduced a new problem:
* the `Api-User-Agent` was project-wide, so from Wikimedia's perspective
* every Shadowbroker install looked like one giant scraper. If one
* install misbehaved, Wikimedia's only recourse was to block the project
* as a whole.
*
* This module centralizes:
* Round 7a fixes that. The frontend fetches the per-install operator
* handle from `GET /api/settings/operator-handle` once on first use and
* embeds it in the `Api-User-Agent`. Wikimedia can now rate-limit /
* contact the specific install instead of the project. The handle is
* auto-generated on the backend (`shadow-XXXXXX`) or operator-chosen via
* the `OPERATOR_HANDLE` setting.
*
* 1. The `Api-User-Agent` header on every request.
* 2. A single LRU cache for Wikipedia summary lookups (keyed by article
* title). Multiple components asking for the same article share
* one in-flight request and one cache slot.
* 3. One predictable kill switch — if Wikimedia ever asks us to back
* off, we change `WIKIMEDIA_API_USER_AGENT` here and the whole
* frontend updates.
*
* This does NOT change end-user UX:
*
* - WikiImage still shows the same thumbnails.
* - NewsFeed still shows aircraft thumbnails.
* - useRegionDossier still returns the same place summary + leader.
*
* What changes:
*
* - Wikimedia can identify our traffic from any other anonymous
* browser visitor pool.
* - Provider-policy fixes happen here once, not in three places.
* UX impact: zero. Same thumbnails, same summaries, same load behavior.
* The only observable change is the value of the outgoing
* `Api-User-Agent` header.
*/
// Stable identifier per Wikimedia UA policy. Includes a contact path so
// Wikimedia's operators can reach the project if they need to rate-limit
// or coordinate. Bump the version when the contact path changes.
export const WIKIMEDIA_API_USER_AGENT =
'Shadowbroker/1.0 (+https://github.com/BigBodyCobain/Shadowbroker; ' +
'report issues at /issues)';
// Module-level cache shared by WikiImage, NewsFeed, and useRegionDossier.
// Keyed by Wikipedia article title (NOT slug — we keep the human-readable
// form so debugging the cache is easier). Values track in-flight state
@@ -73,6 +59,66 @@ function evictIfOverCap() {
if (oldest) _summaryCache.delete(oldest);
}
// ─── Per-operator handle (Round 7a) ────────────────────────────────────────
// Fetched once from the backend on first need and cached for the page
// lifetime. The handle is NOT a secret — Wikimedia will see it on every
// Wikipedia / Wikidata request we make — but caching it locally avoids a
// round-trip on every Wikipedia fetch and lets the offline / no-backend
// case still produce a stable UA (the fallback handle).
let _handlePromise: Promise<string> | null = null;
let _cachedHandle: string | null = null;
const FALLBACK_HANDLE = 'operator-offline';
const HANDLE_ENDPOINT = '/api/settings/operator-handle';
async function fetchOperatorHandle(): Promise<string> {
try {
const res = await fetch(HANDLE_ENDPOINT, {
// Use the standard relative-path proxy so the Next.js admin-key
// injection (same-origin) flows naturally for legitimate browser
// sessions. A cross-origin scanner will be blocked by the proxy
// before this even leaves their browser.
credentials: 'same-origin',
});
if (!res.ok) return FALLBACK_HANDLE;
const data = await res.json();
const h = (data && typeof data.handle === 'string' && data.handle.trim()) || '';
return h || FALLBACK_HANDLE;
} catch {
return FALLBACK_HANDLE;
}
}
async function getOperatorHandle(): Promise<string> {
if (_cachedHandle) return _cachedHandle;
if (!_handlePromise) {
_handlePromise = fetchOperatorHandle().then((h) => {
_cachedHandle = h;
return h;
});
}
return _handlePromise;
}
/** Build the Wikimedia Api-User-Agent for this install.
*
* Includes the per-install operator handle so Wikimedia can rate-limit /
* contact the specific operator instead of the project as a whole.
* Exported for tests; production callers should let
* `fetchWikipediaSummary` / `fetchWikidataSparql` build it implicitly.
*/
export async function buildWikimediaUserAgent(purpose: string): Promise<string> {
const handle = await getOperatorHandle();
const safePurpose = (purpose || '').replace(/[^a-zA-Z0-9_-]/g, '-').toLowerCase();
return (
`Shadowbroker/1.0 (operator: ${handle}; purpose: ${safePurpose}; ` +
'+https://github.com/BigBodyCobain/Shadowbroker; report issues at /issues)'
);
}
// ─── Wikipedia summary fetch ───────────────────────────────────────────────
/** Fetch a Wikipedia article summary (titles, NOT URLs).
*
* Empty / invalid input resolves to `null`. Network errors and disambig
@@ -92,40 +138,42 @@ export async function fetchWikipediaSummary(
const slug = encodeURIComponent(trimmed.replace(/ /g, '_'));
const url = `https://en.wikipedia.org/api/rest_v1/page/summary/${slug}`;
const promise = fetch(url, {
headers: { 'Api-User-Agent': WIKIMEDIA_API_USER_AGENT },
})
.then(async (r) => {
const promise = (async (): Promise<WikipediaSummary | null> => {
try {
const ua = await buildWikimediaUserAgent('wikipedia-summary');
const r = await fetch(url, { headers: { 'Api-User-Agent': ua } });
if (!r.ok) return null;
const d = await r.json();
if (d?.type === 'disambiguation') return null;
const summary: WikipediaSummary = {
return {
title: trimmed,
description: d?.description || '',
extract: d?.extract || '',
thumbnail: d?.thumbnail?.source || d?.originalimage?.source || '',
type: d?.type || 'standard',
};
return summary;
})
.catch(() => null)
.then((summary) => {
_summaryCache.set(trimmed, { summary, inflight: null, loaded: true });
evictIfOverCap();
return summary;
});
} catch {
return null;
}
})().then((summary) => {
_summaryCache.set(trimmed, { summary, inflight: null, loaded: true });
evictIfOverCap();
return summary;
});
_summaryCache.set(trimmed, { summary: null, inflight: promise, loaded: false });
evictIfOverCap();
return promise;
}
// ─── Wikidata SPARQL ───────────────────────────────────────────────────────
/** Fetch a Wikidata SPARQL query result.
*
* Returns the parsed JSON `results.bindings` array on success; `null`
* (not throwing) on any failure so callers can render fallbacks
* silently. Kept as a thin wrapper so the audit-required UA header is
* applied in exactly one place.
* silently. Per-install operator handle threaded through `Api-User-Agent`
* (Round 7a).
*/
export async function fetchWikidataSparql<T = Record<string, { value: string }>>(
sparql: string,
@@ -136,9 +184,10 @@ export async function fetchWikidataSparql<T = Record<string, { value: string }>>
trimmed,
)}&format=json`;
try {
const ua = await buildWikimediaUserAgent('wikidata-sparql');
const res = await fetch(url, {
headers: {
'Api-User-Agent': WIKIMEDIA_API_USER_AGENT,
'Api-User-Agent': ua,
Accept: 'application/sparql-results+json',
},
});
@@ -151,7 +200,11 @@ export async function fetchWikidataSparql<T = Record<string, { value: string }>>
}
}
/** Internal: clear the shared cache. Exposed for tests only. */
// ─── Test helpers ──────────────────────────────────────────────────────────
/** Internal: clear the shared cache + the handle cache. Exposed for tests only. */
export function _resetWikimediaClientCacheForTests() {
_summaryCache.clear();
_handlePromise = null;
_cachedHandle = null;
}