Grow the SynthID corpus to 109 originals (91 iPhone-photo negatives, 2 positives) and document what was learned studying 8 platforms: - README: per-platform watermark map (C2PA issuer / SynthID pixel / IPTC / visible sparkle per platform) and an "originals, not previews" note (re-encoded previews strip metadata, so a clean preview is not proof). - CLAUDE.md: surface-dependent blind spot -- the same Google model wraps C2PA in the Gemini app but emits the SynthID pixel watermark + sparkle with no C2PA/IPTC via the API/playground (AI Studio, Nano Banana), so synthid_source returns None despite SynthID being present; only the pixel oracle or the visible-sparkle detector catches those. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SynthID reference corpus
A locally-collected, labeled image corpus for SynthID work. Two downstream uses:
- Per-resolution spectral codebook for an experimental SynthID detector (carrier frequencies are resolution-dependent, so labels must record the exact native resolution).
- Removal regression set — verify that our pipeline turns a SynthID-positive image into a negative one.
There is no reliable local detector of the SynthID pixel watermark (Google's
decoder is proprietary). The ground-truth label therefore comes from an
external oracle, recorded per image in verified_via (see below).
Layout
data/synthid_corpus/
README.md # this protocol (committed)
manifest.csv # labels + provenance (committed, reviewable)
images/ # the actual files (gitignored, local-only)
pos/ # SynthID present
neg/ # SynthID absent
cleaned/ # our pipeline output from a pos image
Images are gitignored on purpose: the corpus is large, may contain personal or
licensed content, and SynthID-positive outputs are best kept local. The
manifest.csv (sha256 + labels + extracted metadata) is the durable artifact.
Verification levels (verified_via)
Ground-truth quality, strongest first:
gemini-app— checked via the Gemini app "Verify with SynthID" feature. Gold standard for the pixel watermark (Google models).openai-verify— checked via openai.com/verify (gold standard for OpenAI ChatGPT/Codex/API images).synthid-portal— checked via Google's SynthID Detector portal.c2pa-metadata— issuer-only proxy (Google/OpenAI C2PA manifest present). Weaker: the C2PA can be stripped while the pixel watermark remains.third-party— label asserted by an external dataset, not independently verified.none— unverified.
Prefer gemini-app for any image that will train the codebook or gate a test.
What to collect
For the codebook (per target resolution, e.g. 1024x1024, 1024x1536, 1536x2816):
- 30-50+ SynthID-positive outputs per resolution (more is better; ~150-200 per resolution materially improves carrier discovery).
- At each target resolution, also a batch of pure-black (#000000) and pure-white (#FFFFFF) fills generated by the SynthID model — these isolate the content-independent carrier (the watermark is most of the signal there).
For the regression set:
- A handful of
posimages, theircleanedcounterparts (run through our pipeline), and the cleaned re-verified viagemini-app(should read negative). negcontrols: non-AI photos and outputs from non-SynthID models (SD, Midjourney, Firefly) verified negative.
Avoid personal or identifiable content; the corpus stays local.
Ingesting
Use scripts/synthid_corpus.py — it copies a file in, records its sha256,
resolution, format, and C2PA issuer (via our own detector), and appends a row
to manifest.csv:
uv run python scripts/synthid_corpus.py ingest path/to/*.png \
--label pos --source "Gemini app" --model gemini-3-pro \
--verified-via gemini-app --notes "1024x1024 batch"
uv run python scripts/synthid_corpus.py status # counts by label / resolution / verification
Autonomous collection via Chrome MCP
Generation can be driven through the browser (the account must be logged in):
-
Gemini (
gemini.google.com): typeCreate an image: <prompt>, wait, hover the result, click the download icon (top-right). Single, reliable click. Outputs carry Google C2PA + SynthID. Occasionally the composer stalls in a "generating" state -> start a New chat to reset. -
ChatGPT (
chatgpt.com): the UI download is flaky (the fullscreen viewer races and can grab the previous image; the share-modal path works but is multi-step). Reliable path is an in-page fetch of the rendered image, which preserves the original bytes (C2PA intact, unlike a canvas re-encode):// run in the ChatGPT tab via the browser MCP javascript tool (async () => { const imgs = [...document.querySelectorAll('img')].filter(i => i.naturalWidth >= 400); const img = imgs[imgs.length - 1]; // newest large image const b = await (await fetch(img.currentSrc || img.src)).blob(); const a = document.createElement('a'); a.href = URL.createObjectURL(b); a.download = 'dl.png'; document.body.appendChild(a); a.click(); a.remove(); return 'size=' + b.size; // do NOT return the src (privacy guard blocks query strings) })()Gotcha: confirm the returned
sizediffers from the previous image before ingesting -- if the new image has not finished rendering, the script grabs the prior one (the corpus dedups by sha256, but the notes would mislabel it). ChatGPT also shows an A/B "which is better?" picker; click Skip first.
Originals, not previews. Some platforms render a low-res preview in the chat
(Grok serves a ~20KB 1024px JPEG/PNG; the in-page <img> fetch grabs that, not
the original). Previews are re-encoded and strip metadata, so a "clean"
preview is not proof the original is clean. Always pull the original via the
platform's native Download / lightbox button and sanity-check the file size (a
20KB "1024x1024" is a preview). ChatGPT's in-chat <img> is the full-res
oaiusercontent original, so fetch+blob is fine there; Grok needs the lightbox
Download; Leonardo serves the original JPEG in-chat (download button matches).
Per-platform watermark map (observed, May 2026)
What each platform actually embeds, verified by byte-scan (and Gemini-app oracle where noted). The detector's coverage is complementary: metadata catches C2PA / IPTC; the visible detector catches the Gemini-family sparkle; the SynthID pixel itself has no local detector (oracle only).
| Platform | C2PA issuer | SynthID pixel | IPTC "Made with AI" | Visible sparkle | Corpus label |
|---|---|---|---|---|---|
| Gemini app | yes | - | yes | pos | |
| ChatGPT / gpt-image | OpenAI | yes | - | - | pos |
| Microsoft Designer | OpenAI + Microsoft | yes (via OpenAI) | - | - | pos |
| Google AI Studio (Nano Banana) | none | yes (oracle-confirmed) | - | yes | pos (metadata blind spot) |
| Meta AI | none | no | yes | - | neg (for SynthID) |
| Leonardo.ai | none | no | no | - | neg |
| Grok (xAI) | none (non-adopter) | no | no | - | neg (captured: clean low-res preview) |
| Bing Image Creator (DALL-E) | (expect OpenAI, like Designer) | - | - | - | not captured (UI uncooperative) |
Key takeaways:
- The same model differs by surface: Gemini app wraps C2PA, AI Studio (API/playground) emits none -- only the pixel + sparkle survive.
- Microsoft Designer's DALL-E backend inherits OpenAI's C2PA+SynthID (issuer "OpenAI, Microsoft").
- Meta uses the IPTC
digitalSourceTypemarker, not C2PA or SynthID. - Bing and Grok web UIs are uncooperative for autonomous capture (no document_idle for screenshots; blob downloads intermittently no-op; low-res in-chat previews). Use their native download button manually if a full-res sample is needed.