mirror of
https://github.com/wiltodelta/remove-ai-watermarks.git
synced 2026-06-06 02:53:54 +02:00
03fb460f77
Corpus images were gitignored (local-only). The negatives were reviewed and cleared for publishing, so the labeled set is now committed (regular git, 65 MB across 25 files) -- making the removal regression set reproducible and CI-able. Corpus: - Track data/synthid_corpus/images/ (pos 9, neg 15, cleaned 1); keep only the synthetic refs/ calibration fills gitignored. - Reconcile manifest.csv to the on-disk files: 117 -> 25 rows (92 dangling rows for removed images pruned; dedup left one cleaned output, f6dd47a5). - Rewrite the corpus README layout/policy (images committed; review every image for private content before adding -- public repo, permanent history). Test fixtures: - Remove data/samples/not-ai-1/2/3 (personal iPhone photos, incl. GPS EXIF). - Add the clean_photo conftest fixture serving a verified-negative image from the corpus neg/ set; repoint the three "non-AI / clean photo" tests onto it (skips if the corpus is absent). Metadata-source coverage (close the last sub-variant gaps): - c2pa digitalSourceType: algorithmicMedia (procedural, not flagged AI) and compositeWithTrainedAlgorithmicMedia (AI + SynthID proxy). - exif_generator: EXIF Artist and ImageDescription fields (Software/Make/XMP CreatorTool were already covered). All 8 metadata-source kinds are now tested at both the unit and identify() level. 313 tests pass. CLAUDE.md updated (corpus tracked, clean_photo fixture). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
153 lines
8.3 KiB
Markdown
153 lines
8.3 KiB
Markdown
# SynthID reference corpus
|
|
|
|
A locally-collected, labeled image corpus for SynthID work. Two downstream uses:
|
|
|
|
1. **Per-resolution spectral codebook** for an experimental SynthID detector
|
|
(carrier frequencies are resolution-dependent, so labels must record the
|
|
exact native resolution).
|
|
2. **Removal regression set** — verify that our pipeline turns a SynthID-positive
|
|
image into a negative one.
|
|
|
|
There is no reliable local detector of the SynthID pixel watermark (Google's
|
|
decoder is proprietary). The ground-truth label therefore comes from an
|
|
external oracle, recorded per image in `verified_via` (see below).
|
|
|
|
## Layout
|
|
|
|
```
|
|
data/synthid_corpus/
|
|
README.md # this protocol (committed)
|
|
manifest.csv # labels + provenance (committed; one row per tracked image)
|
|
images/ # the labeled corpus (committed)
|
|
pos/ # SynthID present
|
|
neg/ # SynthID absent (incl. reviewed real photos)
|
|
cleaned/ # our pipeline output from a pos image
|
|
refs/ # synthetic black/white calibration fills (gitignored, regenerable)
|
|
```
|
|
|
|
The labeled images are committed so the corpus is reproducible and the removal
|
|
regression set runs in CI. `manifest.csv` is kept in sync with the files on
|
|
disk (one row per tracked image; dangling rows are pruned when files are
|
|
removed). Before adding any image, confirm it carries no private or
|
|
identifiable content you would not publish -- this is a public repo and git
|
|
history is permanent. The synthetic `refs/` fills stay gitignored (regenerable,
|
|
not part of the labeled set).
|
|
|
|
## Verification levels (`verified_via`)
|
|
|
|
Ground-truth quality, strongest first:
|
|
|
|
- `gemini-app` — checked via the Gemini app "Verify with SynthID" feature. Gold standard for the pixel watermark (Google models).
|
|
- `openai-verify` — checked via openai.com/verify (gold standard for OpenAI ChatGPT/Codex/API images).
|
|
- `synthid-portal` — checked via Google's SynthID Detector portal.
|
|
- `c2pa-metadata` — issuer-only proxy (Google/OpenAI C2PA manifest present). Weaker: the C2PA can be stripped while the pixel watermark remains.
|
|
- `third-party` — label asserted by an external dataset, not independently verified.
|
|
- `none` — unverified.
|
|
|
|
Prefer `gemini-app` for any image that will train the codebook or gate a test.
|
|
|
|
## What to collect
|
|
|
|
For the **codebook** (per target resolution, e.g. 1024x1024, 1024x1536, 1536x2816):
|
|
|
|
- 30-50+ SynthID-positive outputs per resolution (more is better; ~150-200 per
|
|
resolution materially improves carrier discovery).
|
|
- At each target resolution, also a batch of **pure-black (#000000)** and
|
|
**pure-white (#FFFFFF)** fills generated by the SynthID model — these isolate
|
|
the content-independent carrier (the watermark is most of the signal there).
|
|
|
|
For the **regression set**:
|
|
|
|
- A handful of `pos` images, their `cleaned` counterparts (run through our
|
|
pipeline), and the cleaned re-verified via `gemini-app` (should read negative).
|
|
- `neg` controls: non-AI photos and outputs from non-SynthID models (SD,
|
|
Midjourney, Firefly) verified negative.
|
|
|
|
The corpus is committed to a public repo: review every image before adding it
|
|
and keep out anything private or identifiable you would not publish.
|
|
|
|
## Ingesting
|
|
|
|
Use `scripts/synthid_corpus.py` — it copies a file in, records its sha256,
|
|
resolution, format, and C2PA issuer (via our own detector), and appends a row
|
|
to `manifest.csv`:
|
|
|
|
```bash
|
|
uv run python scripts/synthid_corpus.py ingest path/to/*.png \
|
|
--label pos --source "Gemini app" --model gemini-3-pro \
|
|
--verified-via gemini-app --notes "1024x1024 batch"
|
|
|
|
uv run python scripts/synthid_corpus.py status # counts by label / resolution / verification
|
|
```
|
|
|
|
## Autonomous collection via Chrome MCP
|
|
|
|
Generation can be driven through the browser (the account must be logged in):
|
|
|
|
- **Gemini** (`gemini.google.com`): type `Create an image: <prompt>`, wait, hover the
|
|
result, click the download icon (top-right). Single, reliable click. Outputs
|
|
carry Google C2PA + SynthID. Occasionally the composer stalls in a
|
|
"generating" state -> start a New chat to reset.
|
|
- **ChatGPT** (`chatgpt.com`): the UI download is flaky (the fullscreen viewer
|
|
races and can grab the previous image; the share-modal path works but is
|
|
multi-step). Reliable path is an in-page fetch of the rendered image, which
|
|
preserves the original bytes (C2PA intact, unlike a canvas re-encode):
|
|
|
|
```js
|
|
// run in the ChatGPT tab via the browser MCP javascript tool
|
|
(async () => {
|
|
const imgs = [...document.querySelectorAll('img')].filter(i => i.naturalWidth >= 400);
|
|
const img = imgs[imgs.length - 1]; // newest large image
|
|
const b = await (await fetch(img.currentSrc || img.src)).blob();
|
|
const a = document.createElement('a');
|
|
a.href = URL.createObjectURL(b); a.download = 'dl.png';
|
|
document.body.appendChild(a); a.click(); a.remove();
|
|
return 'size=' + b.size; // do NOT return the src (privacy guard blocks query strings)
|
|
})()
|
|
```
|
|
|
|
Gotcha: confirm the returned `size` differs from the previous image before
|
|
ingesting -- if the new image has not finished rendering, the script grabs the
|
|
prior one (the corpus dedups by sha256, but the notes would mislabel it).
|
|
ChatGPT also shows an A/B "which is better?" picker; click Skip first.
|
|
|
|
**Originals, not previews.** Some platforms render a low-res preview in the chat
|
|
(Grok serves a ~20KB 1024px JPEG/PNG; the in-page `<img>` fetch grabs *that*, not
|
|
the original). Previews are re-encoded and **strip metadata**, so a "clean"
|
|
preview is not proof the original is clean. Always pull the original via the
|
|
platform's native Download / lightbox button and sanity-check the file size (a
|
|
20KB "1024x1024" is a preview). ChatGPT's in-chat `<img>` *is* the full-res
|
|
oaiusercontent original, so fetch+blob is fine there; Grok needs the lightbox
|
|
Download; Leonardo serves the original JPEG in-chat (download button matches).
|
|
|
|
## Per-platform watermark map (observed, May 2026)
|
|
|
|
What each platform actually embeds, verified by byte-scan (and Gemini-app oracle
|
|
where noted). The detector's coverage is complementary: metadata catches C2PA /
|
|
IPTC; `exif_generator` catches EXIF `Make`/`Software` + XMP `CreatorTool`;
|
|
`invisible_watermark.py` (imwatermark) catches the open SD/SDXL/FLUX DWT-DCT
|
|
watermark on pristine files; the visible detector catches the Gemini-family
|
|
sparkle; the SynthID *pixel* itself has no local detector (oracle only).
|
|
|
|
| Platform | C2PA issuer | SynthID pixel | IPTC "Made with AI" | Visible sparkle | imwatermark | Corpus label |
|
|
|---|---|---|---|---|---|---|
|
|
| Gemini app | Google | yes | - | yes | - | pos |
|
|
| ChatGPT / gpt-image | OpenAI | yes | - | - | - | pos |
|
|
| Microsoft Designer | OpenAI + Microsoft | yes (via OpenAI) | - | - | - | pos |
|
|
| Bing Image Creator | Microsoft (MAI-Image) | no | - | - | - | pos (C2PA "Microsoft", not OpenAI) |
|
|
| Google AI Studio (Nano Banana) | **none** | yes (oracle-confirmed) | - | yes | - | pos (metadata blind spot) |
|
|
| Stability AI (Brand Studio) | Stability AI Ltd | no | - | - | no | pos (C2PA only) |
|
|
| Ideogram | none | no | - | - | no | pos (EXIF `Make="Ideogram AI"` only) |
|
|
| Meta AI | none | no | **yes** | - | - | neg (for SynthID) |
|
|
| Leonardo.ai | none | no | no | - | no | neg |
|
|
| Recraft | none (export strips) | no | no | - | no | neg (re-encoded export, no signal) |
|
|
| Krea (FLUX 2 host) | none | no | no | - | no | neg (host omits the imwatermark encoder) |
|
|
| Grok (xAI) | none (non-adopter) | no | no | - | no | neg (captured: clean low-res preview) |
|
|
|
|
Key takeaways:
|
|
- The same model differs by *surface*: Gemini app wraps C2PA, AI Studio (API/playground) emits none -- only the pixel + sparkle survive.
|
|
- Microsoft Designer's DALL-E backend inherits OpenAI's C2PA+SynthID (issuer "OpenAI, Microsoft"); Bing now runs Microsoft's own **MAI-Image** and signs C2PA as "Microsoft" (not OpenAI/DALL-E).
|
|
- Meta uses the IPTC `digitalSourceType` marker, not C2PA or SynthID.
|
|
- The open imwatermark fires only on *pristine* output from a pipeline that runs the encoder (diffusers default, official BFL) -- not from re-hosts (Krea, Stability hosted SDXL) or re-encoded design exports (Recraft, Canva). Ideogram's only signal is the EXIF `Make` tag.
|
|
- Bing and Grok web UIs are uncooperative for autonomous capture (no document_idle for screenshots; blob downloads intermittently no-op; low-res in-chat previews). Use their native download button manually if a full-res sample is needed.
|