mirror of
https://github.com/BigBodyCobain/Shadowbroker.git
synced 2026-06-03 21:08:13 +02:00
ebbf42fb3c
== Per-install operator handle for every third-party API call ==
Before this PR, every Shadowbroker install identified itself to
Wikipedia, Wikidata, Nominatim, GDELT, OpenMHz, Broadcastify,
weather.gov, NUFORC, Sentinel/Planetary Computer, TinyGS / CelesTrak,
Shodan, Finnhub, and others with a single project-wide User-Agent
("Shadowbroker/1.0" or "ShadowBroker-OSINT/1.0"). From the upstream's
perspective every install in the world looked like one giant scraper.
If one install misbehaved, the upstream's only recourse was to block
"Shadowbroker" as a whole.
PR #284 inadvertently doubled down on this in the frontend by
introducing a shared `WIKIMEDIA_API_USER_AGENT` constant. This PR
retrofits both backends to per-operator attribution.
New setting: OPERATOR_HANDLE (env var / settings UI / auto-gen)
New helper: network_utils.outbound_user_agent("purpose")
The handle is auto-generated as "operator-XXXXXX" on first call (the
"shadow-" prefix from earlier drafts was deliberately dropped — too
suspicious-looking for abuse-detection systems). Operators can
override via OPERATOR_HANDLE; the value is sanitized to lowercase
alphanumeric+dash+underscore and capped at 48 chars. Persisted to
backend/data/operator_handle.json so it survives container restarts.
Retrofitted call sites (every previously-MONSTER User-Agent):
- services/region_dossier.py (Wikipedia + Wikidata + Nominatim)
- services/geocode.py (Nominatim)
- services/sentinel_search.py (Microsoft Planetary Computer)
- services/feed_ingester.py (operator-curated RSS feeds)
- services/fetchers/earth_observation.py (weather.gov, NUFORC)
- services/fetchers/infrastructure.py
- services/fetchers/aircraft_database.py
- services/fetchers/route_database.py
- services/fetchers/trains.py
- services/fetchers/meshtastic_map.py
- services/shodan_connector.py
- services/unusual_whales_connector.py (Finnhub)
- services/tinygs_fetcher.py (CelesTrak + TinyGS)
- services/sar/sar_products_client.py
- services/geopolitics.py (GDELT)
- services/radio_intercept.py (Broadcastify + OpenMHz)
- routers/cctv.py + main.py (CCTV proxy)
- routers/ai_intel.py
- scripts/convert_power_plants.py (release-time data refresh)
Spoofed browser UAs removed (issues #289 / #290 / #291 — tg12 audit):
- cloudscraper-based Chrome impersonation against api.openmhz.com
-> replaced with honest requests + per-install UA
- Mozilla/5.0 spoofed UA on Broadcastify scrape
-> replaced with honest UA
- Mozilla/5.0 + fake first-party Referer on OpenMHz audio relay
-> replaced with honest UA
- cloudscraper dependency dropped from pyproject.toml + uv.lock
Frontend retrofit:
- new GET /api/settings/operator-handle endpoint (local-operator
gated) returns the install's handle
- frontend/src/lib/wikimediaClient.ts fetches the handle once on
first use, caches it for page lifetime, embeds it in the
Api-User-Agent for every Wikipedia / Wikidata browser-direct call
== GDELT GCS-direct fix ==
GDELT's data.gdeltproject.org is a CNAME to a Google Cloud Storage
bucket. GCS responds with the wildcard *.storage.googleapis.com cert
which legitimately does NOT cover the GDELT custom domain, so Python's
TLS verification correctly refuses the connection. Some networks
happen to route through a path where this works; many (notably Docker
Desktop's outbound NAT on local installs) do not. Verified on the
maintainer's local install: GDELT was unreachable; 1610 geopolitical
events / 48 export files were dropping silently.
Fix: services/geopolitics._gcs_direct_gdelt_url() rewrites any
data.gdeltproject.org URL to its GCS-direct equivalent
(storage.googleapis.com/data.gdeltproject.org/...) where the standard
GCS cert is genuinely valid. api.gdeltproject.org and every other host
are left untouched.
Confirmed live: backend log goes from
GDELT lastupdate failed: 500
to
Downloading 48 GDELT export files...
Downloaded 48/48 GDELT exports
GDELT parsed: 1610 conflict locations from 48 files
== Tests ==
backend/tests/test_per_operator_outbound_attribution.py (12 tests)
backend/tests/test_gdelt_gcs_direct_rewrite.py (6 tests)
backend/tests/test_region_dossier_wikimedia_ua.py (updated to
pin the helper + per-operator handle, not the old constant)
frontend/src/__tests__/utils/wikimediaClient.test.ts (rewritten
to mock /api/settings/operator-handle and assert per-operator UA)
Local: backend 114/114 security+audit+round7a suite green;
frontend 718/718 vitest suite green.
Credit: tg12 (external security audit, issues #289/#290/#291
relating to spoofed UAs); BigBodyCobain (operator-prefix call,
GDELT cloud-vs-local diagnosis).
239 lines
8.5 KiB
TypeScript
239 lines
8.5 KiB
TypeScript
/**
|
|
* Issues #218 / #219 / #220 (tg12 external audit) + Round 7a:
|
|
*
|
|
* Every browser-direct call to Wikipedia or Wikidata must send the
|
|
* `Api-User-Agent` header that Wikimedia's UA policy asks for, AND must
|
|
* embed the per-install operator handle so Wikimedia can rate-limit /
|
|
* contact the specific operator instead of treating "Shadowbroker" as
|
|
* one giant entity.
|
|
*
|
|
* These tests pin both requirements on the shared `lib/wikimediaClient`
|
|
* helper that WikiImage, NewsFeed, and useRegionDossier all route
|
|
* through. A future refactor that drops either the header OR the
|
|
* per-operator handle gets a loud test failure rather than a silent
|
|
* ToS / privacy regression.
|
|
*/
|
|
import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest';
|
|
import {
|
|
buildWikimediaUserAgent,
|
|
fetchWikipediaSummary,
|
|
fetchWikidataSparql,
|
|
_resetWikimediaClientCacheForTests,
|
|
} from '@/lib/wikimediaClient';
|
|
|
|
const originalFetch = globalThis.fetch;
|
|
|
|
// Helper: stub fetch so calls to /api/settings/operator-handle return a
|
|
// known handle, and everything else proxies to whatever the test set up.
|
|
function withHandle(handle: string, otherFetch: typeof globalThis.fetch) {
|
|
return vi.fn(async (input: any, init?: RequestInit) => {
|
|
const url = String(input);
|
|
if (url.endsWith('/api/settings/operator-handle')) {
|
|
return new Response(JSON.stringify({ handle }), { status: 200 });
|
|
}
|
|
return otherFetch(input, init);
|
|
});
|
|
}
|
|
|
|
describe('lib/wikimediaClient', () => {
|
|
beforeEach(() => {
|
|
_resetWikimediaClientCacheForTests();
|
|
});
|
|
|
|
afterEach(() => {
|
|
globalThis.fetch = originalFetch;
|
|
vi.restoreAllMocks();
|
|
});
|
|
|
|
it('builds a stable per-operator Api-User-Agent with contact path', async () => {
|
|
globalThis.fetch = withHandle(
|
|
'operator-abc123',
|
|
vi.fn(async () => new Response('{}', { status: 200 })) as any,
|
|
) as any;
|
|
const ua = await buildWikimediaUserAgent('wikipedia-summary');
|
|
expect(ua).toContain('Shadowbroker');
|
|
expect(ua.toLowerCase()).toContain('github.com');
|
|
expect(ua.toLowerCase()).toContain('issues');
|
|
expect(ua).toContain('operator: operator-abc123');
|
|
expect(ua).toContain('purpose: wikipedia-summary');
|
|
});
|
|
|
|
it('falls back to "operator-offline" when handle endpoint is unreachable', async () => {
|
|
globalThis.fetch = vi.fn(async (input: any) => {
|
|
const url = String(input);
|
|
if (url.endsWith('/api/settings/operator-handle')) {
|
|
return new Response('forbidden', { status: 403 });
|
|
}
|
|
return new Response('{}', { status: 200 });
|
|
}) as any;
|
|
const ua = await buildWikimediaUserAgent('test');
|
|
expect(ua).toContain('operator: operator-offline');
|
|
});
|
|
|
|
it('sends per-operator Api-User-Agent on Wikipedia summary fetch', async () => {
|
|
const wikiCalls: Array<{ url: string; init?: RequestInit }> = [];
|
|
const baseFetch = vi.fn(async (url: any, init?: RequestInit) => {
|
|
wikiCalls.push({ url: String(url), init });
|
|
return new Response(
|
|
JSON.stringify({
|
|
type: 'standard',
|
|
title: 'Boeing 747',
|
|
description: 'aircraft',
|
|
extract: 'long extract',
|
|
thumbnail: { source: 'https://example.org/thumb.jpg' },
|
|
}),
|
|
{ status: 200 },
|
|
);
|
|
});
|
|
globalThis.fetch = withHandle('operator-test01', baseFetch as any) as any;
|
|
|
|
const summary = await fetchWikipediaSummary('Boeing 747');
|
|
expect(summary?.thumbnail).toBe('https://example.org/thumb.jpg');
|
|
// wikiCalls only captures calls to non-handle URLs.
|
|
expect(wikiCalls).toHaveLength(1);
|
|
const headers = (wikiCalls[0].init?.headers || {}) as Record<string, string>;
|
|
expect(headers['Api-User-Agent']).toContain('operator: operator-test01');
|
|
expect(headers['Api-User-Agent']).toContain('purpose: wikipedia-summary');
|
|
});
|
|
|
|
it('sends per-operator Api-User-Agent on Wikidata SPARQL fetch', async () => {
|
|
const calls: Array<{ url: string; init?: RequestInit }> = [];
|
|
const baseFetch = vi.fn(async (url: any, init?: RequestInit) => {
|
|
calls.push({ url: String(url), init });
|
|
return new Response(
|
|
JSON.stringify({
|
|
results: { bindings: [{ leaderLabel: { value: 'Test Leader' } }] },
|
|
}),
|
|
{ status: 200 },
|
|
);
|
|
});
|
|
globalThis.fetch = withHandle('operator-sparql', baseFetch as any) as any;
|
|
|
|
const bindings = await fetchWikidataSparql('SELECT * WHERE { ?s ?p ?o }');
|
|
expect(bindings).toHaveLength(1);
|
|
const headers = (calls[0].init?.headers || {}) as Record<string, string>;
|
|
expect(headers['Api-User-Agent']).toContain('operator: operator-sparql');
|
|
expect(headers['Api-User-Agent']).toContain('purpose: wikidata-sparql');
|
|
expect(headers['Accept']).toBe('application/sparql-results+json');
|
|
});
|
|
|
|
it('handle endpoint is queried only ONCE across many wiki fetches', async () => {
|
|
let handleCalls = 0;
|
|
let wikiCalls = 0;
|
|
globalThis.fetch = vi.fn(async (input: any) => {
|
|
const url = String(input);
|
|
if (url.endsWith('/api/settings/operator-handle')) {
|
|
handleCalls++;
|
|
return new Response(JSON.stringify({ handle: 'operator-cache' }), { status: 200 });
|
|
}
|
|
wikiCalls++;
|
|
return new Response(
|
|
JSON.stringify({
|
|
type: 'standard',
|
|
title: 'X',
|
|
description: '',
|
|
extract: '',
|
|
thumbnail: { source: 'https://example.org/x.jpg' },
|
|
}),
|
|
{ status: 200 },
|
|
);
|
|
}) as any;
|
|
|
|
await fetchWikipediaSummary('Eiffel Tower');
|
|
await fetchWikipediaSummary('Mount Fuji');
|
|
await fetchWikipediaSummary('Statue of Liberty');
|
|
expect(handleCalls).toBe(1);
|
|
expect(wikiCalls).toBe(3);
|
|
});
|
|
|
|
it('shares cache across consecutive callers for the same Wikipedia title', async () => {
|
|
let fetchCount = 0;
|
|
const baseFetch = vi.fn(async () => {
|
|
fetchCount++;
|
|
return new Response(
|
|
JSON.stringify({
|
|
type: 'standard',
|
|
title: 'Eiffel Tower',
|
|
description: 'iron lattice tower',
|
|
extract: '...',
|
|
thumbnail: { source: 'https://example.org/eiffel.jpg' },
|
|
}),
|
|
{ status: 200 },
|
|
);
|
|
});
|
|
globalThis.fetch = withHandle('operator-cache', baseFetch as any) as any;
|
|
|
|
const a = await fetchWikipediaSummary('Eiffel Tower');
|
|
const b = await fetchWikipediaSummary('Eiffel Tower');
|
|
expect(fetchCount).toBe(1);
|
|
expect(a?.thumbnail).toBe(b?.thumbnail);
|
|
});
|
|
|
|
it('deduplicates concurrent in-flight requests for the same title', async () => {
|
|
let fetchCount = 0;
|
|
const baseFetch = vi.fn(async () => {
|
|
fetchCount++;
|
|
await new Promise((r) => setTimeout(r, 5));
|
|
return new Response(
|
|
JSON.stringify({
|
|
type: 'standard',
|
|
title: 'Mount Fuji',
|
|
description: 'stratovolcano',
|
|
extract: '...',
|
|
thumbnail: { source: 'https://example.org/fuji.jpg' },
|
|
}),
|
|
{ status: 200 },
|
|
);
|
|
});
|
|
globalThis.fetch = withHandle('operator-cache', baseFetch as any) as any;
|
|
|
|
const [a, b, c] = await Promise.all([
|
|
fetchWikipediaSummary('Mount Fuji'),
|
|
fetchWikipediaSummary('Mount Fuji'),
|
|
fetchWikipediaSummary('Mount Fuji'),
|
|
]);
|
|
expect(fetchCount).toBe(1);
|
|
expect(a?.thumbnail).toBe('https://example.org/fuji.jpg');
|
|
expect(b).toEqual(a);
|
|
expect(c).toEqual(a);
|
|
});
|
|
|
|
it('returns null on disambiguation pages without throwing', async () => {
|
|
globalThis.fetch = withHandle(
|
|
'operator-cache',
|
|
vi.fn(async () =>
|
|
new Response(JSON.stringify({ type: 'disambiguation' }), { status: 200 }),
|
|
) as any,
|
|
) as any;
|
|
const summary = await fetchWikipediaSummary('Mercury');
|
|
expect(summary).toBeNull();
|
|
});
|
|
|
|
it('returns null on HTTP error without throwing', async () => {
|
|
globalThis.fetch = withHandle(
|
|
'operator-cache',
|
|
vi.fn(async () => new Response('not found', { status: 404 })) as any,
|
|
) as any;
|
|
const summary = await fetchWikipediaSummary('Nonexistent Article 12345');
|
|
expect(summary).toBeNull();
|
|
});
|
|
|
|
it('returns null on network error without throwing', async () => {
|
|
globalThis.fetch = withHandle(
|
|
'operator-cache',
|
|
vi.fn(async () => {
|
|
throw new Error('network down');
|
|
}) as any,
|
|
) as any;
|
|
const summary = await fetchWikipediaSummary('Anything');
|
|
expect(summary).toBeNull();
|
|
});
|
|
|
|
it('returns null on empty input without fetching anything', async () => {
|
|
globalThis.fetch = vi.fn(async () => new Response('{}', { status: 200 })) as any;
|
|
expect(await fetchWikipediaSummary('')).toBeNull();
|
|
expect(await fetchWikipediaSummary(' ')).toBeNull();
|
|
expect(globalThis.fetch).not.toHaveBeenCalled();
|
|
});
|
|
});
|