Corpus images were gitignored (local-only). The negatives were reviewed and cleared for publishing, so the labeled set is now committed (regular git, 65 MB across 25 files) -- making the removal regression set reproducible and CI-able. Corpus: - Track data/synthid_corpus/images/ (pos 9, neg 15, cleaned 1); keep only the synthetic refs/ calibration fills gitignored. - Reconcile manifest.csv to the on-disk files: 117 -> 25 rows (92 dangling rows for removed images pruned; dedup left one cleaned output, f6dd47a5). - Rewrite the corpus README layout/policy (images committed; review every image for private content before adding -- public repo, permanent history). Test fixtures: - Remove data/samples/not-ai-1/2/3 (personal iPhone photos, incl. GPS EXIF). - Add the clean_photo conftest fixture serving a verified-negative image from the corpus neg/ set; repoint the three "non-AI / clean photo" tests onto it (skips if the corpus is absent). Metadata-source coverage (close the last sub-variant gaps): - c2pa digitalSourceType: algorithmicMedia (procedural, not flagged AI) and compositeWithTrainedAlgorithmicMedia (AI + SynthID proxy). - exif_generator: EXIF Artist and ImageDescription fields (Software/Make/XMP CreatorTool were already covered). All 8 metadata-source kinds are now tested at both the unit and identify() level. 313 tests pass. CLAUDE.md updated (corpus tracked, clean_photo fixture). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SynthID reference corpus
A locally-collected, labeled image corpus for SynthID work. Two downstream uses:
- Per-resolution spectral codebook for an experimental SynthID detector (carrier frequencies are resolution-dependent, so labels must record the exact native resolution).
- Removal regression set — verify that our pipeline turns a SynthID-positive image into a negative one.
There is no reliable local detector of the SynthID pixel watermark (Google's
decoder is proprietary). The ground-truth label therefore comes from an
external oracle, recorded per image in verified_via (see below).
Layout
data/synthid_corpus/
README.md # this protocol (committed)
manifest.csv # labels + provenance (committed; one row per tracked image)
images/ # the labeled corpus (committed)
pos/ # SynthID present
neg/ # SynthID absent (incl. reviewed real photos)
cleaned/ # our pipeline output from a pos image
refs/ # synthetic black/white calibration fills (gitignored, regenerable)
The labeled images are committed so the corpus is reproducible and the removal
regression set runs in CI. manifest.csv is kept in sync with the files on
disk (one row per tracked image; dangling rows are pruned when files are
removed). Before adding any image, confirm it carries no private or
identifiable content you would not publish -- this is a public repo and git
history is permanent. The synthetic refs/ fills stay gitignored (regenerable,
not part of the labeled set).
Verification levels (verified_via)
Ground-truth quality, strongest first:
gemini-app— checked via the Gemini app "Verify with SynthID" feature. Gold standard for the pixel watermark (Google models).openai-verify— checked via openai.com/verify (gold standard for OpenAI ChatGPT/Codex/API images).synthid-portal— checked via Google's SynthID Detector portal.c2pa-metadata— issuer-only proxy (Google/OpenAI C2PA manifest present). Weaker: the C2PA can be stripped while the pixel watermark remains.third-party— label asserted by an external dataset, not independently verified.none— unverified.
Prefer gemini-app for any image that will train the codebook or gate a test.
What to collect
For the codebook (per target resolution, e.g. 1024x1024, 1024x1536, 1536x2816):
- 30-50+ SynthID-positive outputs per resolution (more is better; ~150-200 per resolution materially improves carrier discovery).
- At each target resolution, also a batch of pure-black (#000000) and pure-white (#FFFFFF) fills generated by the SynthID model — these isolate the content-independent carrier (the watermark is most of the signal there).
For the regression set:
- A handful of
posimages, theircleanedcounterparts (run through our pipeline), and the cleaned re-verified viagemini-app(should read negative). negcontrols: non-AI photos and outputs from non-SynthID models (SD, Midjourney, Firefly) verified negative.
The corpus is committed to a public repo: review every image before adding it and keep out anything private or identifiable you would not publish.
Ingesting
Use scripts/synthid_corpus.py — it copies a file in, records its sha256,
resolution, format, and C2PA issuer (via our own detector), and appends a row
to manifest.csv:
uv run python scripts/synthid_corpus.py ingest path/to/*.png \
--label pos --source "Gemini app" --model gemini-3-pro \
--verified-via gemini-app --notes "1024x1024 batch"
uv run python scripts/synthid_corpus.py status # counts by label / resolution / verification
Autonomous collection via Chrome MCP
Generation can be driven through the browser (the account must be logged in):
-
Gemini (
gemini.google.com): typeCreate an image: <prompt>, wait, hover the result, click the download icon (top-right). Single, reliable click. Outputs carry Google C2PA + SynthID. Occasionally the composer stalls in a "generating" state -> start a New chat to reset. -
ChatGPT (
chatgpt.com): the UI download is flaky (the fullscreen viewer races and can grab the previous image; the share-modal path works but is multi-step). Reliable path is an in-page fetch of the rendered image, which preserves the original bytes (C2PA intact, unlike a canvas re-encode):// run in the ChatGPT tab via the browser MCP javascript tool (async () => { const imgs = [...document.querySelectorAll('img')].filter(i => i.naturalWidth >= 400); const img = imgs[imgs.length - 1]; // newest large image const b = await (await fetch(img.currentSrc || img.src)).blob(); const a = document.createElement('a'); a.href = URL.createObjectURL(b); a.download = 'dl.png'; document.body.appendChild(a); a.click(); a.remove(); return 'size=' + b.size; // do NOT return the src (privacy guard blocks query strings) })()Gotcha: confirm the returned
sizediffers from the previous image before ingesting -- if the new image has not finished rendering, the script grabs the prior one (the corpus dedups by sha256, but the notes would mislabel it). ChatGPT also shows an A/B "which is better?" picker; click Skip first.
Originals, not previews. Some platforms render a low-res preview in the chat
(Grok serves a ~20KB 1024px JPEG/PNG; the in-page <img> fetch grabs that, not
the original). Previews are re-encoded and strip metadata, so a "clean"
preview is not proof the original is clean. Always pull the original via the
platform's native Download / lightbox button and sanity-check the file size (a
20KB "1024x1024" is a preview). ChatGPT's in-chat <img> is the full-res
oaiusercontent original, so fetch+blob is fine there; Grok needs the lightbox
Download; Leonardo serves the original JPEG in-chat (download button matches).
Per-platform watermark map (observed, May 2026)
What each platform actually embeds, verified by byte-scan (and Gemini-app oracle
where noted). The detector's coverage is complementary: metadata catches C2PA /
IPTC; exif_generator catches EXIF Make/Software + XMP CreatorTool;
invisible_watermark.py (imwatermark) catches the open SD/SDXL/FLUX DWT-DCT
watermark on pristine files; the visible detector catches the Gemini-family
sparkle; the SynthID pixel itself has no local detector (oracle only).
| Platform | C2PA issuer | SynthID pixel | IPTC "Made with AI" | Visible sparkle | imwatermark | Corpus label |
|---|---|---|---|---|---|---|
| Gemini app | yes | - | yes | - | pos | |
| ChatGPT / gpt-image | OpenAI | yes | - | - | - | pos |
| Microsoft Designer | OpenAI + Microsoft | yes (via OpenAI) | - | - | - | pos |
| Bing Image Creator | Microsoft (MAI-Image) | no | - | - | - | pos (C2PA "Microsoft", not OpenAI) |
| Google AI Studio (Nano Banana) | none | yes (oracle-confirmed) | - | yes | - | pos (metadata blind spot) |
| Stability AI (Brand Studio) | Stability AI Ltd | no | - | - | no | pos (C2PA only) |
| Ideogram | none | no | - | - | no | pos (EXIF Make="Ideogram AI" only) |
| Meta AI | none | no | yes | - | - | neg (for SynthID) |
| Leonardo.ai | none | no | no | - | no | neg |
| Recraft | none (export strips) | no | no | - | no | neg (re-encoded export, no signal) |
| Krea (FLUX 2 host) | none | no | no | - | no | neg (host omits the imwatermark encoder) |
| Grok (xAI) | none (non-adopter) | no | no | - | no | neg (captured: clean low-res preview) |
Key takeaways:
- The same model differs by surface: Gemini app wraps C2PA, AI Studio (API/playground) emits none -- only the pixel + sparkle survive.
- Microsoft Designer's DALL-E backend inherits OpenAI's C2PA+SynthID (issuer "OpenAI, Microsoft"); Bing now runs Microsoft's own MAI-Image and signs C2PA as "Microsoft" (not OpenAI/DALL-E).
- Meta uses the IPTC
digitalSourceTypemarker, not C2PA or SynthID. - The open imwatermark fires only on pristine output from a pipeline that runs the encoder (diffusers default, official BFL) -- not from re-hosts (Krea, Stability hosted SDXL) or re-encoded design exports (Recraft, Canva). Ideogram's only signal is the EXIF
Maketag. - Bing and Grok web UIs are uncooperative for autonomous capture (no document_idle for screenshots; blob downloads intermittently no-op; low-res in-chat previews). Use their native download button manually if a full-res sample is needed.