mirror of
https://github.com/wiltodelta/remove-ai-watermarks.git
synced 2026-05-26 22:22:24 +02:00
af787fd8d6
Grow the SynthID corpus to 109 originals (91 iPhone-photo negatives, 2 positives) and document what was learned studying 8 platforms: - README: per-platform watermark map (C2PA issuer / SynthID pixel / IPTC / visible sparkle per platform) and an "originals, not previews" note (re-encoded previews strip metadata, so a clean preview is not proof). - CLAUDE.md: surface-dependent blind spot -- the same Google model wraps C2PA in the Gemini app but emits the SynthID pixel watermark + sparkle with no C2PA/IPTC via the API/playground (AI Studio, Nano Banana), so synthid_source returns None despite SynthID being present; only the pixel oracle or the visible-sparkle detector catches those. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7.1 KiB
7.1 KiB
Remove-AI-Watermarks
You are a principal Python engineer maintaining a CLI tool and library for removing visible and invisible AI watermarks from images.
How to run
uv run remove-ai-watermarks all <image.png> -o <output.png>uv run remove-ai-watermarks metadata <image.png> --check— inspect AI metadata (C2PA, EXIF, PNG chunks)uv run remove-ai-watermarks metadata <image.png> --remove -o <out.png>— strip all AI metadata
Test and lint
bash maintain.sh— uv-outdated, uv-secure, ruff check/fix, ruff format, pyright, pytest -n automaintain.shdoes not currently finish green (pre-existing, not per-change):uv-secureaborts on a fixable transitiveidnavuln, and strict pyright carries debt inremove_ai_metadata/cli.py(untyped piexif/PIL/click/rich). To gate a change, runuv run ruff check,uv run pyright <changed files>,uv run pytestdirectly.- Run
uv runfrom the repo root — from another cwd it falls back to a bare env without numpy/cv2/torch. - Metadata/C2PA tests assert against real committed fixtures in
data/samples/(chatgpt-*.png= OpenAI C2PA,firefly-1.png= Adobe,not-ai-*= clean); synthetic byte blobs cover the JPEG/ISOBMFF format paths. - SynthID reference corpus:
scripts/synthid_corpus.pyingests labeled images intodata/synthid_corpus/(manifest.csvtracked,images/gitignored); see its README for the collection protocol and verification oracles.
Configuration
- GPU/ML modules (invisible_engine, ctrlregen, watermark_remover) are optional — guard imports with
is_available()checks - Tests for ML modules are limited to availability checks (require multi-GB downloads)
Key modules
noai/c2pa.py— PNG chunk parser; useextract_c2pa_chunk(path)to get raw caBX payload,has_c2pa_metadata(path)to detect. Do not reimplement chunk parsing.extract_c2pa_info(path)setssynthid_watermark/synthid_vendorswhen the manifest is signed by a SynthID-using vendor.noai/constants.py— PNG_SIGNATURE, C2PA_CHUNK_TYPE, C2PA_SIGNATURES, C2PA_ISSUERS, andSYNTHID_C2PA_ISSUERS(issuers that pair SynthID with C2PA: Google, OpenAI). Add a new issuer here, not inline.metadata.py—synthid_source(path)returns the vendor name(s) if the C2PA manifest implies a SynthID pixel watermark, else None. Format-agnostic: PNG via the caBX parser, JPEG/WebP/AVIF/HEIF/JXL via a binary scan (C2PA marker + SynthID issuer + AI-source marker).get_ai_metadatasurfaces the verdict, andmetadata --checkprints it as a callout.face_protector.py— YOLO detect + soft-blend pattern; mirror this for any "protect region during diffusion" features
Known limitations
invisiblepipeline downscales to model-native resolution (1024 px for SDXL) before diffusion. Degrades fine text in infographics. Tracked; fix is tile-based diffusion.- Pyright first run is slow (2-3 min) due to ML deps (torch/diffusers/transformers stubs); full-project
uv run pyrightcan stall for many minutes — scope it to changed files. ultralyticsmonkey-patchesPIL.Image.openand tries to autoloadpi_heif. Whenpi_heifis missing, opening files raisesModuleNotFoundError, notUnidentifiedImageError. Code that opens user-supplied or unknown-format files shouldexcept Exception, not justOSError/UnidentifiedImageError.- Metadata detection for AVIF/HEIF/JPEG-XL relies on a binary scan for
C2PA_UUID+IPTC_AI_MARKERS. C2PA removal in those containers is implemented vianoai/isobmff.py(top-leveluuid/jumbbox stripper, no re-encoding). EXIF/XMP boxes inside those containers are not yet scrubbed. - SynthID detection is metadata-only. There is no reliable local detector of the SynthID pixel watermark — Google's decoder is proprietary, no public spec or API (only a waitlisted portal). We detect SynthID by its C2PA companion (
synthid_source/SYNTHID_C2PA_ISSUERS), which is reliable while the manifest is intact but says nothing once C2PA is stripped. Surface-dependent blind spot (verified 2026-05-24): the same Google model emits different metadata per surface -- the Gemini app wraps outputs in Google C2PA, but the API/playground (AI Studio, Nano Banana / gemini-2.5-flash-image) emits the SynthID pixel watermark (confirmed via the Gemini-app oracle) + the visible sparkle but no C2PA/IPTC at all, sosynthid_sourcereturns None despite SynthID being present. Only the pixel oracle or the visible-sparkle detector catches those. (Meta AI is another surface mismatch: it writes the IPTCdigitalSourceType=trainedAlgorithmicMediamarker, not C2PA and not SynthID.) Google→SynthID is long-standing; OpenAI→SynthID is confirmed by OpenAI's Help Center (ChatGPT/Codex/API "include both C2PA metadata and SynthID watermarks", updated 2026-05-21) but time-gated (pre-rollout OpenAI images carry C2PA without SynthID), so the OpenAI verdict is hedged "likely". Oracles: Gemini app "Verify with SynthID" (Google), openai.com/verify (OpenAI). The spectral phase-coherence approach fromgithub.com/aloshdenny/reverse-SynthIDwas evaluated (May 2026) and does not work for real-content detection: on its own shipped codebook + validation set, watermarked and cleaned images were indistinguishable (conf within noise, cleaned often higher); it only fires on pure-black 1024x1024 reference images at exact resolution (the controlled case it was calibrated on). The README's "90% / conf=0.91" reproduces only in that lab condition. Do not build a production detector on it; if revisited, it is experimental/diagnostic only and needs a per-resolution, per-model reference corpus. A from-scratch gpt-image pilot (2026-05-24) confirmed this independently: 5 independent solid-black gpt-image outputs share a near-identical fixed signature (pairwise residual correlation 0.92, avg-template retains 97% energy), so the watermark/carrier IS strongly present and consistent on flat content — but the carrier frequencies extracted from it do NOT discriminate real content (carrier-to-random ratio: cleaned 1.86 > watermarked 1.53; a non-gpt-image image scored highest at 3.67). The signature drowns in content texture. Net: a perfectly consistent solid-color signature still yields no real-content pixel detector with magnitude/carrier methods. - SynthID v2 vs default pipeline: the SDXL-based default profile (since May 2026) defeats SynthID v2. Verified end-to-end (May 2026): local SDXL run on a Gemini 3 Pro output, checked via the Gemini app's "Verify with SynthID" feature, returned "no SynthID watermark detected". Also confirmed against OpenAI's SynthID (2026-05-23): a fresh ChatGPT/gpt-image output read "SynthID detected" on openai.com/verify before the local SDXL run and "SynthID not detected" after (corpus regression chain: pos
4ef377bd-> cleaned47188e88). The same configuration is used in raiw-app production (fal-ai/fast-sdxlat native ~1024 px, strength 0.05, steps 50). SD-1.5 dreamshaper at 768 px was previously the default and does NOT defeat v2 — verified empirically against the same feature (strength 0.04, 0.10, and elastic warp α∈{5,8} all flagged positive). That SD-1.5 path was removed; onlydefault(SDXL) andctrlregenprofiles remain.