mirror of https://github.com/wiltodelta/remove-ai-watermarks.git synced 2026-06-05 10:38:00 +02:00

Files

T

Victor Kuznetsov ef6fdaeeec Detect text at native resolution (capped), fixing small-text recall on large images (#27 )

The text-protection detector scaled every image to a fixed 736 px long side, so
small text on large canvases (e.g. ~16 px on 2048) was downscaled below the
detector and missed -> deformed by the SDXL pass (issue #14). Detect at the
native long side capped at 1536, never upscaled (_detection_input_size, a pure
unit-tested helper). Detection is script-agnostic (DB segments regions, not
characters), so this is language-agnostic: a new benchmark
(scripts/text_detection_benchmark.py) measures recall across Latin/Cyrillic/CJK/
Hangul/Arabic/digits x sizes x canvas -> overall hit-rate 0.91 -> 1.00, worst
cell (2048/16 px) 0.06 -> 1.00. Docs updated.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-05-29 12:28:30 -07:00

49 KiB

Raw Blame History

Remove-AI-Watermarks

You are a principal Python engineer maintaining a CLI tool and library for removing visible and invisible AI watermarks from images.

How to run

uv run remove-ai-watermarks all <image.png> -o <output.png>
uv run remove-ai-watermarks visible <image.png> -o <out.png> — visible-mark removal, CPU, no GPU. --mark auto (default) routes between the Gemini sparkle and the Doubao "豆包AI生成" text strip by detector confidence; --mark gemini / --mark doubao force one.
uv run remove-ai-watermarks erase <image.png> --region x,y,w,h -o <out.png> — universal region eraser (any logo/object, any position). --backend cv2 (default, no deps) or --backend lama (big-LaMa via onnxruntime, extra lama); --region is repeatable.
uv run remove-ai-watermarks identify <image> — provenance verdict (platform + watermark inventory + confidence); --json for machine output, --no-visible to skip the cv2 sparkle detector
uv run remove-ai-watermarks metadata <image.png> --check — inspect AI metadata (C2PA, EXIF, PNG chunks)
uv run remove-ai-watermarks metadata <image.png> --remove -o <out.png> — strip all AI metadata

Test and lint

CI (.github/workflows/test.yml): runs on push to main + every PR. A lint job (ubuntu: ruff check + ruff format --check) plus a test matrix (ubuntu/macos/windows x py3.10/3.12) that does uv sync --frozen --extra dev then pytest. The matrix installs only core + dev (no gpu extra), so the GPU/model-running tests skip there and it exercises the metadata/identify/visible/cv2-eraser surface on all three OSes. Keep uv.lock valid (don't break --frozen) when editing pyproject.toml. publish.yml stays release-only.
bash maintain.sh — uv-outdated, uv-secure, ruff check/fix, ruff format, pyright, pytest -n auto
Strict pyright is clean across src/ (0 errors). The cv2/torch/diffusers boundary files (gemini_engine, region_eraser, doubao_engine, face_protector, humanizer, invisible_engine, noai/watermark_remover, and the whole noai/ctrlregen/ subpackage) carry a documented per-file # pyright: relax pragma (or, for ctrlregen, a tool.pyright.executionEnvironments entry) that turns off only the unknown-type / untyped-third-party rules — those libs ship no usable types, so strict typing there fights the ecosystem. Pure-logic files stay fully strict; typings/piexif/__init__.pyi is a local stub so metadata.py/extractor.py resolve piexif. Public ndarray-returning signatures on the relaxed engines are still annotated NDArray[Any] so strict consumers (cli.py) stay clean. When touching a relaxed file, prefer fixing real issues over widening the pragma; keep the pragma scoped to genuinely-untyped boundaries. (uv-secure is clean since idna was bumped 3.11 -> 3.16, fixing GHSA-65pc-fj4g-8rjx.)
Full-project uv run pyright (no path) OOMs/crashes node on this ML-heavy repo (emits a libnode stack frame, no summary) — a known environment limit, not a code error. Gate with uv run --extra dev --extra gpu pyright src/ (completes, authoritative) or scope to changed files; also run uv run ruff check and uv run pytest directly.
Run uv run from the repo root — from another cwd it falls back to a bare env without numpy/cv2/torch.
To add a dev tool (pytest/ruff/pyright) into the env, use uv sync --frozen --extra dev --extra gpu, never uv pip install — uv pip install re-resolves and rewrites uv.lock, which silently bumped transformers to a build incompatible with the pinned diffusers (cannot import name 'Qwen3VLForConditionalGeneration') and broke every identify/metadata import. Recovery: git checkout uv.lock && uv sync --frozen --extra gpu --extra dev. The gpu extra holds diffusers/transformers/torch, so a bare uv sync (no extras) removes them and noai/__init__ (eager pipeline import) then fails. maintain.sh's uv sync --all-extras also pulls the heavy trustmark/lama wheels (pytorch-lightning, onnxruntime) — fine on a good connection, but on flaky DNS sync only --extra gpu --extra dev and run the lint/test steps by hand.
Metadata/C2PA tests assert against real committed fixtures in data/samples/ (chatgpt-*.png = OpenAI C2PA, firefly-1.png = Adobe, mj-* = Midjourney IPTC, doubao-1.png = ByteDance Doubao with the China TC260 <TC260:AIGC> XMP label and a visible "豆包AI生成" text mark bottom-right; grok-1.jpg = xAI Grok with its EXIF-only Signature: blob + UUID Artist and no C2PA/SynthID/IPTC); synthetic byte blobs cover the JPEG/ISOBMFF format paths. The "non-AI / clean photo" control is no longer in data/samples/ -- the clean_photo conftest fixture serves a verified-negative image from the corpus neg/ set (skips if the corpus is absent).
SynthID reference corpus: scripts/synthid_corpus.py ingests labeled images into data/synthid_corpus/. The labeled images/ (pos/ neg/ cleaned/) are committed (public repo -- review every image for private content before adding; manifest.csv is kept in sync with the files on disk, one row per tracked image); only the synthetic refs/ calibration fills are gitignored. See its README for the collection protocol and verification oracles.

Configuration

GPU/ML modules (invisible_engine, ctrlregen, watermark_remover) are optional — guard imports with is_available() checks
Optional detection extras: detect (imwatermark — open SD/SDXL/FLUX watermark) and trustmark (Adobe TrustMark decoder; pulls torch + downloads weights). Both are guarded by is_available() and skipped by identify when absent.
Tests for the model-running paths are limited to availability checks (multi-GB downloads). But the pure helpers inside ML-adjacent modules are unit-tested without any download and must stay that way: _target_size (native-vs-downscale, test_invisible_engine.py), the MPS->CPU fallback control flow via mocked pipelines (test_img2img_runner.py, 100% cover), and the tiling math tile_positions/make_blend_weight/resize_center_crop (test_tiling.py; pytest.importorskip("torch") since tiling.py imports torch at module top). Don't skip these as "ML, needs a model" — only run_tiled/remove_watermark/the diffusion bodies do.

Key modules

noai/c2pa.py — PNG chunk parser; use extract_c2pa_chunk(path) to get raw caBX payload, has_c2pa_metadata(path) to detect. Do not reimplement chunk parsing. extract_c2pa_info(path) sets synthid_watermark/synthid_vendors when the manifest is signed by a SynthID-using vendor, and soft_binding/soft_binding_vendors when a c2pa.soft-binding alg names a forensic-watermark vendor (soft_binding_vendors_in(buffer) is the shared byte-scan, used by both the PNG parser and the non-PNG binary path).
noai/constants.py — PNG_SIGNATURE, C2PA_CHUNK_TYPE, C2PA_SIGNATURES, C2PA_ISSUERS, SYNTHID_C2PA_ISSUERS (issuers that pair SynthID with C2PA: Google, OpenAI), and C2PA_SOFT_BINDINGS (soft-binding alg prefix → forensic-watermark vendor: Adobe TrustMark, Digimarc, Imatag, Steg.AI, Microsoft, ...). Add a new issuer/binding here, not inline.
metadata.py — scan_head(path, size=1MB) is the shared input for every C2PA/AIGC/IPTC byte scan: first size bytes plus, for ISOBMFF, the late provenance-box payloads from isobmff.scan_c2pa_region (catches a manifest after a large mdat); behavior-neutral (f.read(size)) for non-ISOBMFF. Use it instead of open().read(1MB) for any new marker scan. synthid_source(path) returns the vendor name(s) if the C2PA manifest implies a SynthID pixel watermark, else None. Format-agnostic: PNG via the caBX parser, JPEG/WebP/AVIF/HEIF/JXL via a binary scan (C2PA marker + SynthID issuer + AI-source marker). get_ai_metadata surfaces the verdict, and metadata --check prints it as a callout. Both get_ai_metadata and has_ai_metadata guard the PIL open with except Exception (HEIC/unknown formats raise non-OSError) and fall through to the binary scan. xai_signature(path) detects xAI/Grok's EXIF-only scheme (ImageDescription = Signature: <base64> + UUID Artist); it feeds has_ai_metadata, get_ai_metadata (key xai_signature), and identify. iptc_ai_system(path) detects the IPTC Photo Metadata 2025.1 AI-disclosure XMP properties (IPTC_AI_FIELD_MARKERS = AISystemUsed/AISystemVersionUsed/AIPromptInformation/AIPromptWriterName) and returns the AISystemUsed generator name (or "fields present"). remove_ai_metadata routes ISOBMFF video (.mp4/.mov/.m4v) through the same isobmff.strip_c2pa_boxes as AVIF/HEIF (MP4 is ISOBMFF), and _scrub_ai_exif removes the xAI signature + AI-generator EXIF tags on JPEG output.
identify.py — identify(path) aggregates every locally-readable signal (C2PA issuer→platform, C2PA soft-binding forensic-watermark vendor, IPTC "Made with AI" + IPTC 2025.1 AISystemUsed, embedded SD/ComfyUI params, SynthID proxy, xAI/Grok EXIF signature via metadata.xai_signature, the China TC260 AIGC label via metadata.aigc_label, the HuggingFace hf-job-id job marker via metadata.huggingface_job, visible Gemini sparkle, open invisible watermark, Adobe TrustMark via trustmark_detector) into one ProvenanceReport. is_ai_generated is True or None (never asserted False — stripped metadata is not proof of clean origin). The hf_job and visible-sparkle signals are medium confidence: each lifts an otherwise-Unknown verdict to a tentative AI (hf_only / visible_only, parallel branches) but is excluded from the high-confidence ai_from_metadata set, so neither overrides a hard metadata signal. Visible-sparkle is promoted only at confidence ≥ _SPARKLE_THRESHOLD (0.5; corpus-tuned to separate Gemini sparkles ≥0.56 from non-sparkle ≤0.49). The cv2 dependency lives in gemini_engine.detect_sparkle_confidence, not here. C2PA platform attribution is device-token-first, issuer-scan fallback (_device_platform scans manifest bytes for _DEVICE_C2PA_PLATFORM tokens, then _attribute_platform/_ISSUER_PLATFORM). Why, verified on real signed files 2026-05-26: the old issuer-only byte-scan matched ANY issuer substring anywhere, so multi-entity manifests mis-attributed -- Leica→"Truepic" (a signing authority in the trust chain), Nikon→"Adobe Firefly" (XMP-toolkit "Adobe" + the sample's "Adobe_MAX" name), Pixel→"Google (Gemini)" ("Google LLC" cert org), Truepic→"Google". A distinctive device token wins instead. Token distinctiveness is load-bearing: bare b"Truepic" mis-fires (it appears in unrelated trust chains -- it mis-attributed the OpenAI chatgpt-1.png fixture), so the token is the specific b"Truepic_Lens" from the Lens SDK claim generator; likewise b"Pixel Camera" (cert CN) not bare b"Pixel". _DEVICE_C2PA_PLATFORM lists ONLY tokens verified against a real C2PA file: Leica (lc_c2pa/Leica Camera), Nikon (NIKON), Pixel (Pixel Camera -- from a real Pixel 10 Pro file attached to c2pa-rs issue #1609/#1554), Sony (sony.sig/sony.cert -- Sony's own C2PA assertion namespace, verified on a real Sony PXW-Z300 file; NOT bare "Sony" which is a common EXIF Make), Truepic (Truepic_Lens). Canon/Samsung/Bria have no public direct-download C2PA sample (checked exhaustively: GitHub issue/PR attachments, contentcredentials gallery, HF datasets -- all upload-to-verify or token-gated; Canon's only public file was a self-signed hobbyist CR3, not factory), so they stay unmapped until a real file is captured (same fixture discipline as Grok/Doubao). The Sony sample is video (MP4) -- our ISOBMFF C2PA path detects it; Sony Alpha stills likely share the sony.* namespace but are not separately verified. Camera C2PA marks capture authenticity, not AI (Pixel carries computationalCapture, not trainedAlgorithmicMedia), so these never set is_ai -- that stays driven by digital-source-type. c2pa.cbor_text_after (now public) is best-effort for the generator detail string only and can be None when the manifest keys it claim_generator_info (Pixel). Issuer→generator mapping is is_ai-gated (_attribute_platform(issuers, is_ai=c2pa_is_ai)): a specific AI-generator platform is named only when the digital-source-type is trainedAlgorithmicMedia; on a non-AI source an issuer substring is treated as incidental (an "Adobe XMP" toolkit string in an unmapped Canon/Sony capture would otherwise mislabel it "Adobe Firefly"), so it degrades to the neutral "C2PA signer: X" label. Real Firefly/OpenAI/Google output carries the AI source-type, so it is unaffected (verified: chatgpt-1.png→OpenAI, firefly-1.png→Adobe Firefly still attribute). _attribute_platform defaults is_ai=True so the mapping stays unit-testable in isolation. Add device tokens to _DEVICE_C2PA_PLATFORM, generator/issuer platforms to _ISSUER_PLATFORM, not inline. For non-PNG containers (JPEG/WebP/AVIF/HEIF/JXL) the caBX parser returns nothing, so issuer (_issuers_in) and generator (_ai_tools_in, reusing C2PA_AI_TOOLS) are recovered by binary-scanning the first MB. EXIF Software / Make / Artist / ImageDescription and XMP CreatorTool generator tags are read by metadata.exif_generator (PIL+piexif for any format PIL opens incl. AVIF, plus a container-agnostic XMP raw-byte scan that also covers HEIF/JXL), matched against AI_GENERATOR_TOKENS so ordinary editors (plain "Adobe Photoshop") and real-camera Make ("Apple"/"Canon") are not flagged. Ideogram tags its output with EXIF Make="Ideogram AI" (verified on a real download 2026-05-24) — that's why Make is read. Integrity-clash detection (_integrity_clashes, surfaced as ProvenanceReport.integrity_clashes, printed in red by identify and serialized to --json): contradictions between independent generator stamps are a laundering/spoofing tell. Two rules: (1) two or more distinct AI-origin vendors named by independent signals (e.g. C2PA OpenAI + EXIF Make="Ideogram AI"), and (2) a camera-capture C2PA device (_DEVICE_C2PA_PLATFORM) coexisting with any AI-generation marker. Vendor normalization is _vendor_of over _AI_VENDOR_TOKENS (so a C2PA "Google (Gemini)" issuer and a SynthID-Google proxy agree, while different vendors clash). High-precision by design: only hard generator stamps feed it (C2PA-issuer when source is AI, SynthID, EXIF/XMP generator, IPTC AISystemUsed, xAI, AIGC); the fuzzy visible sparkle and the open invisible watermark are excluded (the latter can be a by-product of our own SDXL removal pass). The c2pa vendor is classified from the issuer attribution / generator, NOT the resolved platform (a camera label like "Google Pixel" would mis-normalize to "Google"). All real single-origin fixtures (chatgpt/firefly/doubao/grok/mj) verified to produce zero clashes (false-positive guard in test_identify.py::TestRealSamplesHaveNoClash).
gemini_engine.py — visible Gemini-sparkle remover/detector (cv2/numpy, no GPU). detect_sparkle_confidence(path) is the file-level entry point used by identify.py.
doubao_engine.py — visible Doubao "豆包AI生成" remover/detector (cv2/numpy, no GPU). DoubaoEngine.locate anchors a bottom-right box by geometry (mark scales with image WIDTH, fractions in module constants; no bundled template), extract_mask pulls the light low-saturation glyphs with a polarity-aware white top-hat (brighter-than-blurred-local-bg, so white-paper documents are left untouched instead of smeared), detect thresholds glyph coverage (DETECT_MIN_COVERAGE 0.16 separates real marks ≥0.20 from corner noise, which stays ≤0.06 on large images but can spike to ~0.15 on tiny ones), remove_watermark inpaints (cv2 Telea/NS) and bails when coverage > MAX_INPAINT_COVERAGE 0.50 (dense-text background → would smear). Wired into visible --mark via cli._run_doubao_if_selected. Logo is near-white (~253), not the gray some third-party tools assume. Best on photo/illustration backgrounds; high-contrast edges leave faint residue (cv2-inpaint limit). Clean per-pixel reverse-alpha (Gemini-style) needs a black-background capture (alpha = capture/255), not more content images -- content-image distillation was tried and fails; see "Doubao clean-reverse-alpha distillation" below.
region_eraser.py — universal region eraser (erase CLI). erase(image, boxes=|mask=, backend=): boxes_to_mask → cv2.inpaint (cv2 backend, default, no deps) or big-LaMa via onnxruntime (lama backend, extra lama, Carve/LaMa-ONNX Apache-2.0 model downloaded on first use, never bundled). erase_lama crops a padded region around the mask, runs LaMa at its fixed 512² input, pastes only masked pixels back (untouched areas stay pixel-exact). Lazy _get_lama_session singleton; lama_available() guards the optional import. LaMa-ONNX costs ~3.5-4 GB peak RAM and ~5-6 s/call on CPU (FFC working set, not arena — enable_cpu_mem_arena=False does not help), so it does NOT fit a minimal droplet; the cv2 backend (tens of MB, ~30 ms) does. LaMa quality at low RAM = serverless/GPU, mirroring how raiw.cc offloads SDXL to fal.
invisible_watermark.py — detect_invisible_watermark(path) decodes the OPEN DWT-DCT watermarks (public decoder, no key) embedded by Stable Diffusion / SDXL / FLUX via the imwatermark library. Known fixed patterns (verified against upstream source) live in _BITS_48 (SDXL 48-bit, FLUX.2 48-bit) and _SD1_STRING ("StableDiffusionV1", SD 1.x/2.x). Optional dep (extra detect); returns None when absent. The detect extra pulls torch transitively (invisible-watermark declares torch a hard dep, and WatermarkDecoder eagerly imports rivaGan -> torch at import time), so detection needs torch present even though dwtDct runs CPU-only on cv2/numpy/pywavelets — no GPU and no separate gpu extra required. Unlike SynthID this is locally detectable, but the watermark is fragile (does not survive JPEG re-encode/resize — verified gone after JPEG q90), so it confirms origin only on pristine files. Add new known patterns here. The file carries a top-of-module pyright pragma because imwatermark/cv2 ship no type stubs.
trustmark_detector.py — detect_trustmark(path) decodes the OPEN, keyless Adobe TrustMark watermark (the soft binding behind Adobe Durable Content Credentials, alg com.adobe.trustmark.P) via the optional trustmark package (extra trustmark; pulls torch, downloads model weights on first use). Mirrors invisible_watermark.py (lazy singleton, top-of-module pyright pragma, returns None when absent). It detects provenance, not AI origin as such (TrustMark also marks human-authored content), so identify lists it as a watermark without setting is_ai_generated. Other soft-binding vendors (Digimarc/Imatag/Steg.AI/...) have no public decoder — they are only named via the C2PA_SOFT_BINDINGS scan, not decoded.
text_protector.py — text-region protection for the invisible SDXL img2img pass (issue #21: CJK/small text deforms at watermark-removal strengths). is_available() gates on cv2.dnn.TextDetectionModel_DB; TextProtector.detect_text_boxes(bgr) runs the PP-OCRv3 DB ONNX detector (~2.4 MB, Apache-2.0, opencv_zoo, returns rotated quad polygons) — downloaded+cached to ~/.cache/remove-ai-watermarks on first use via atomic temp-rename, never bundled, no torch (cv2.dnn only). Detection is script-agnostic (DB segments text regions, not characters), so Latin / Cyrillic / CJK / Hangul / Arabic / digits all detect identically — language was never the recall lever, resolution was. _detection_input_size(h, w) (pure, unit-tested) detects at the native long side capped at _DET_MAX_LONG_SIDE (1536), never upscaled: the old fixed 736 downscaled large canvases so small text fell below the detector and was missed (issue #14, e.g. ~16 px text on a 2048 image). scripts/text_detection_benchmark.py measures recall across scripts × sizes × canvas: the cap fix lifts overall hit-rate 0.91 → 1.00 (worst cell 2048/16 px: 0.06 → 1.00) at ~100 ms CPU. Very large canvases with tiny text may still need tiling (documented limit, not built). build_change_map(boxes, h, w, preserve=0.9, feather=15) paints a Differential-Diffusion change map. Polarity (verified empirically): white(1.0)=PRESERVE original pixels, black(0.0)=MAX change; map is black bg + preserve inside text polygons, Gaussian-feathered edges, clipped to [0,1]. preserve stays below a hard 1.0 freeze by default so text still scrubs lightly (SynthID survives cropping). Wired into watermark_remover._run_differential via the community pipeline_stable_diffusion_xl_differential_img2img (loaded with custom_revision="0.38.0" — HF resolves the PyPI version string, not the v0.38.0 git tag); gated to the SDXL DEFAULT_MODEL_ID only (_can_protect_text), falls back to plain img2img otherwise. Autonomous by default (protect_text=True in invisible_engine/watermark_remover, mirroring protect_faces): the detector runs per image and _run_differential falls back to plain img2img when no boxes are found, so text-free inputs pay only the cheap cv2 detection (no differential-pipeline load). CLI exposes a single off-switch --no-protect-text on invisible/all (passed as protect_text=not no_protect_text); the unavailable-model case logs at debug, not warning, since it is now the default path. The diff pipeline upcasts the VAE to fp32 internally, so do not add upcast_vae()/enable_attention_slicing (both produced NaN/black on fp16 MPS). build_change_map is unit-tested without any model download (tests/test_text_protector.py).
face_protector.py — YOLO detect + soft-blend pattern; mirror this for any "protect region during diffusion" features
image_io.py — Unicode-safe cv2 IO (issue #17). imread(path, flags=None) / imwrite(path, img) wrap np.fromfile+cv2.imdecode / cv2.imencode+tofile so non-ASCII paths work on Windows -- bare cv2.imread/cv2.imwrite use the platform ANSI code-page API there and fail (empty decode + can't open/read file) on Chinese/Cyrillic/accented filenames. imread keeps cv2.imread semantics (defaults to IMREAD_COLOR, returns None on missing/empty/undecodable). Every cv2 file read/write in the package routes through here; do not call cv2.imread/cv2.imwrite directly. macOS/Linux already accept UTF-8 paths, so it is behavior-neutral there (the bug only reproduces on Windows). cv2/numpy are imported lazily inside the functions, so the module is cheap to import in a bare env.

Doubao clean-reverse-alpha distillation (re-investigated 2026-05-29)

Conclusion: pure reverse-alpha distilled from content images does NOT work, and the blocker is the WRONG kind of data, not too little of it. The earlier framing ("need ~5-8 PRISTINE same-resolution originals") is obsolete -- data/spaces/originals/ holds plenty. Curate them with DoubaoEngine.detect + an NCC filter against a clean glyph template, keeping only marks at offset ≈ (0,0): that yields e.g. 15 pixel-aligned 2048² marks (sub-pixel drift, not the ±50 px the old lossy/mixed-res scrapes had), plus 1086x1448 / 1792x2400 clusters. With those, LaMa-clean O + weighted-LS (and per-pixel I-on-O regression) for α (+ logo colour) was tried end-to-end and still leaves a persistent ghost outline.

Diagnosed why, empirically (cached stacks, /tmp/doubao_distill): (1) the mark is a clean white overlay with no dark halo -- over glyph pixels ~54% are brighter than the clean bg, only ~4% darker -- so the white-logo model I=(1-α)O+α·255 is correct; (2) but content backgrounds are almost never dark under the mark (median darkest available bg over glyph pixels = 58/255; only ~13% of mark pixels are ever observed on a bg < 40), so on bright backgrounds the equation is ill-conditioned and α is unidentifiable; (3) LaMa's O is a plausible hallucination, not the true pre-mark background, which compounds the error, and per-pixel regression on ~15 obs overfits into colour noise.

Why Gemini's engine is clean (verified in GeminiWatermarkTool src/core/watermark_engine.cpp): its alpha map is the watermark stamped on a PURE-BLACK background, where watermarked = α·255 + (1-α)·0 = α·255, so alpha = capture/255 exactly -- no estimation. (gemini_bg_*.png is literally the sparkle in grey on black.) So the real Doubao unlock is the same controlled capture, not more content images. Black/white/gray seeds exist (data/doubao_capture/seeds/seed_*_1x1_2048x2048.png); a capture run (feed a black seed through doubao.com edit mode, download the original) was requested from the #13 reporter 2026-05-29. With ~2-3 black captures we get α = capture/255 for free, Gemini-quality.

Until black captures arrive, the shipped direction is precise canonical glyph mask + inpaint (cv2 default, lama optional), NOT reverse-alpha. The consensus glyph silhouette across the aligned marks distills cleanly (proto: a tight "豆包AI生成" strip, width ≈ 0.156 × image-width) and is good both as an exact inpaint mask and as an NCC localiser -- the latter also fixes the #23 detector false-positives (match the real glyph shape, not any bright low-saturation corner). Do not retry content-image reverse-alpha: it is data-limited by physics (no dark-background observations), not by effort.

Watermarking landscape (research 2026-05-24)

Who embeds what, and whether it is locally detectable (so we know which gaps are fillable). See identify.py for what we read.

Locally detectable (open decoder, no key/API): Stable Diffusion / SDXL / FLUX via imwatermark DWT-DCT (now covered by invisible_watermark.py). FLUX uses the same library (black-forest-labs/flux2 src/flux2/watermark.py, 48-bit 0b001010101111111010000111100111001111010100101110); SDXL is the diffusers WATERMARK_MESSAGE (0b101100111110110010010000011110111011000110011110). Caveat: fragile to re-encoding.
C2PA / IPTC (covered by the issuer/marker scan): OpenAI, Google, Adobe Firefly, Microsoft (Designer + Bing Image Creator — collected 2026-05-24; Bing now runs Microsoft's own MAI-Image model, signs C2PA as "Microsoft", NOT OpenAI/DALL-E), and Stability AI (collected from Brand Studio / DreamStudio successor; signs C2PA as "Stability AI Ltd", no SynthID, no imwatermark on its current Stable Image model — issuer added to C2PA_ISSUERS). Still unsampled: Canva (its downloads are re-encoded design exports that strip C2PA, so a Canva "positive" is inconclusive — skipped), Getty, Shutterstock. Midjourney embeds NO C2PA and no invisible watermark (our mj-* sample carried only the IPTC tag).
EXIF/XMP generator tag (caught by exif_generator): Ideogram writes EXIF Make="Ideogram AI" (collected 2026-05-24 — no C2PA, no SynthID, no imwatermark; the Make tag is the only signal).
xAI / Grok — its own EXIF signature scheme, NOT C2PA (DETECTED by metadata.xai_signature, built 2026-05-26). Grok JPEG downloads (Aurora model) carry no C2PA, no XMP, no SynthID, no IPTC — only EXIF Artist = a UUID and EXIF ImageDescription = Signature: <base64> (a crypto signature, unverifiable locally without xAI's public key). This empirically kills the earlier unverified "xAI signs C2PA as xAI" lead — xAI is not even a C2PA member. exif_generator misses it (neither field holds an AI_GENERATOR_TOKENS token), so a dedicated detector xai_signature(path) matches the pair (ImageDescription ~ ^Signature: [A-Za-z0-9+/=]{64,} AND UUID Artist); wired into has_ai_metadata, get_ai_metadata (key xai_signature), and identify (signal xai_signature, platform "xAI (Grok / Aurora)"). Format confirmed stable across n=3 genuine generations: exactly three EXIF tags (Artist, ExifOffset, ImageDescription), Signature: prefix constant, base64 payload 300-1004 chars. Two capture facts: (a) the Artist UUID equals the public image id in the asset URL (https://imagine-public.x.ai/imagine-public/images/<uuid>.jpg), so it is NOT a private per-user secret — only the Signature blob is; (b) the Grok web-UI image is a re-encoded WebP with no signature — the EXIF survives only in the original JPEG (download button or that public tokenless URL), which is why screenshots / re-encodes are metadata-stripped. A real fixture data/samples/grok-1.jpg plus synthetic JPEG fixtures (fake UUID + fake Signature: blob) cover the detector; never add a real Grok image carrying private content (the repo is public). Stripped on removal too: remove_ai_metadata now calls _scrub_ai_exif on the JPEG EXIF, which deletes the xAI Signature+UUID-Artist pair and any Software/Make/Artist/ImageDescription tag holding an AI_GENERATOR_TOKENS token (so Ideogram's Make="Ideogram AI" is scrubbed too), while keeping genuine camera/editor EXIF. The shared _is_xai_signature_pair helper (module-level compiled regexes) is the single source of truth for the pattern, used by both xai_signature and _scrub_ai_exif. (AVIF/HEIF/JXL still strip only C2PA boxes via isobmff, not EXIF — unchanged.)
China TC260 AIGC label (caught by AIGC_MARKERS / metadata.aigc_label, surfaced by identify as the aigc signal): China-served generators embed an XMP <TC260:AIGC>{"Label":"1","ContentProducer":...} block — China's mandatory AI-content labeling (TC260 namespace tc260.org.cn/ns/AIGC). Doubao (ByteDance) uses it (verified on the real #13 sample 2026-05-25; ContentProducer 001191110102MACQD9K64010000, no C2PA/SynthID/imwatermark — the XMP block is the only signal; GitHub attachment upload did NOT strip it). The same standard is mandatory for Jimeng/Kling/Qwen/Ernie etc., so the one marker covers the whole China-AIGC-labeled ecosystem. aigc_label reads two serializations through a shared _parse helper: the HTML-entity-encoded XMP <TC260:AIGC> block (container-agnostic raw-byte scan, any JSON object accepted) and a raw-JSON PNG AIGC tEXt chunk — Doubao also writes the label this way, with no namespaced marker at all (confirmed on the corpus 2026-05-28, ContentProducer="doubao"). The PNG-chunk path is gated on at least one TC260 field (_TC260_FIELDS) so a generic AIGC key cannot false-positive. In identify, aigc fires on the parsed label or the AIGC_MARKERS byte scan (the latter preserves the laundering-tell case where the JSON payload is truncated).
HuggingFace-hosted job (caught by metadata.huggingface_job, surfaced by identify as the hf_job signal, MEDIUM confidence): HuggingFace Jobs / Spaces stamp generated PNGs with an hf-job-id tEXt chunk holding the job UUID (3 on the corpus 2026-05-28, no other signal). It marks the hosting job, not a model — most commonly diffusion output — so it lifts an Unknown verdict to a tentative AI via hf_only (parallel to the visible sparkle) but never overrides a hard metadata signal; _HF_JOB_CAVEAT states the limit (job, not model; not proof of AI pixels). Stripped on removal (the PNG save whitelist keeps only STANDARD_METADATA_KEYS, so hf-job-id and the AIGC chunk are both dropped). The exact writer is not authoritatively documented (HF Jobs are generic GPU jobs), hence medium not high.
No detectable signal on download (correctly reported unknown): Recraft (PNG export is a re-encoded design export — strips everything), Krea hosting FLUX 2 (no imwatermark despite FLUX — the host omits the encoder, same as Stability's hosted SDXL), and Midjourney (embeds nothing). Lesson: the imwatermark detector only fires on pristine output from a pipeline that runs the encoder (diffusers default, official BFL), not from re-hosts (Krea/Stability) or re-encoded exports (Recraft/Canva).
Invisible but NOT locally detectable (proprietary, API/oracle only — same wall as SynthID): Amazon Titan Image Generator + Nova Canvas (Bedrock DetectGeneratedContent API), Kakao (new SynthID image adopter, May 2026), NVIDIA Cosmos (SynthID video). No local detector possible; treat like SynthID.
C2PA 2.4 "Durable Content Credentials" (April 2026; verified against the spec) raise the bar for metadata stripping. 2.4 defines soft bindings (an invisible watermark or a content fingerprint) plus a server-side manifest repository and a new c2pa.repository-receipt assertion. Per the spec: "if a C2PA manifest is removed from an asset, but a copy of that manifest remains in a provenance store elsewhere, the manifest and asset may be matched using available soft bindings." So our local metadata --remove deletes the embedded manifest, but a fingerprint/watermark soft binding can still re-link the image to its manifest in a repository server-side. Stripping the file is becoming necessary-but-not-sufficient against durable provenance. (Our parsers target the stable embedded-manifest format documented in C2PA 2.1 §11; that format is unchanged in 2.4 -- the new pieces are repository/soft-binding infra, not the on-file box layout, so no parser change is implied.) Spec: https://spec.c2pa.org/specifications/specifications/2.4/specs/C2PA_Specification.html We now READ the soft-binding alg (C2PA_SOFT_BINDINGS / soft_binding_vendors_in) to name the forensic-watermark vendor, and locally DECODE the one open scheme, Adobe TrustMark (trustmark_detector); the rest (Digimarc/Imatag/Steg.AI/...) stay name-only (proprietary decoders).
Built 2026-05-26 (this batch): soft-binding alg vendor detection; IPTC Photo Metadata 2025.1 AI-disclosure fields (AISystemUsed etc.); video C2PA metadata detect + strip for MP4/MOV/M4V (free — isobmff.py is format-agnostic, MP4 is ISOBMFF); Adobe TrustMark open decoder. NOT done (out of cheap reach, per the feasibility review): visible video-logo removal (needs a video frame pipeline) and audio (SynthID/ElevenLabs/Resemble/Suno all oracle-only or unmarked). Box detection window — now handled (v0.6.8): detection no longer relies on a fixed first-MB read. metadata.scan_head(path, size) reads the first size bytes and, for ISOBMFF, appends the payloads of late provenance boxes found by isobmff.scan_c2pa_region (a file-seeking top-level box walker that skips past mdat by size without reading it), so a C2PA/AIGC/IPTC manifest placed AFTER a large mdat in a streaming/non-faststart MP4 is now caught. Every C2PA/marker byte scan (has_ai_metadata, aigc_label, iptc_ai_system, synthid_source, exif_generator XMP, get_ai_metadata soft-binding, and identify) goes through scan_head; it is behavior-neutral for non-ISOBMFF inputs (exactly f.read(size)). Meta-box XMP removal — now handled (v0.6.9): an AI-label XMP packet stored as a meta-box mime item (HEIF/AVIF; out of reach of the top-level box stripper) is blanked in place by isobmff.blank_ai_xmp_packets — it locates the packet by its <?xpacket begin … end?> delimiters and, if it carries an AI marker (_AI_LABEL_MARKERS), overwrites it with spaces of the SAME length, so box sizes / iloc offsets stay valid and the coded image is untouched (selective: plain non-AI XMP is left alone, mirroring the top-level uuid logic). Wired into remove_ai_metadata's ISOBMFF branch after strip_c2pa_boxes. The remaining gap is an Exif meta-box item (rare; the AI labels are XMP) — still needs iinf/iloc surgery or exiftool.
Regulatory driver (context, not a code change): AI-content labeling mandates are expanding, which pushes more generators toward exactly the C2PA + watermark signals we read. The full per-jurisdiction table lives in README "## Legal" -- keep it there, not duplicated here. Newly added + primary-source verified 2026-05-26: EU AI Act Article 50 machine-readable marking applicable 2026-08-02 (verified against the article text); South Korea AI Framework Act Art. 31(3) in force since 22 January 2026 (verified via Kim & Chang + FPF/Korea Times; Enforcement Decree accepts an invisible-watermark label); California AB 853 (amends the CA AI Transparency Act) latent-disclosure duty operative 2026-08-02, requiring a disclosure "permanent or extraordinarily difficult to remove" (verified against the leginfo bill text -- this is the exact disclosure our tool strips); India IT Amendment Rules 2026 in force 2026-02-20 (verified via Chambers), which prominently-label + permanent-provenance-id all synthetic media AND expressly prohibit removing/suppressing the label or metadata -- the first major all-content removal ban outside China. Removal liability (README "## Legal" disclaimer): the tool is lawful general-purpose software; liability sits with the remover and is intent-gated -- downstream acts (fraud/deception/IP), plus US DMCA 17 USC 1202 (removing copyright-management info to conceal infringement), plus the removal-as-such bans in China + India. When extending the README table, verify each date/article against the statute/bill text before committing, not against search summaries.

Known limitations

invisible pipeline processes at native resolution by default (max_resolution=0), matching the hosted raiw.cc backend (fal fast-sdxl, no pre-downscale). The old forced downscale-to-1024 -> upscale-back round-trip was the main quality loss (issue #10) and is gone; at strength ~0.05 SDXL img2img does not need the ~1024 downscale. --max-resolution N re-introduces an opt-in long-side cap purely to bound GPU/MPS memory on very large inputs (it reintroduces the lossy round-trip). For huge images that OOM at native, tile-based diffusion is still the proper long-term fix. Concrete MPS data point (verified 2026-05-25 on a 1254x1254 gpt-image SDXL run, fp32, 20 GB MPS ceiling): native res OOMs at the UNet step (peak ~17 GiB), not only the VAE decode, and the auto-fallback in img2img_runner reloads on CPU and finishes (slow, ~13 min) -- the output is still weight-identical and defeats SynthID, so "looks hung/crashed" on Mac is usually this CPU fallback, not a pipeline error. Adding enable_vae_tiling() alone does NOT prevent it (the peak is the UNet, not the VAE). The fast Mac workarounds are fp16 on MPS (roughly halves memory) or --max-resolution to cap the long side; neither is wired as the default. The native-vs-downscale decision lives in the pure helper invisible_engine._target_size(w, h, max_resolution) (returns None for native, a clamped target tuple otherwise) so it is unit-tested (tests/test_invisible_engine.py::TestTargetSize, the #10/#15 regression guard) without loading the model -- keep that logic in the helper, don't re-inline it.
Pyright first run is slow (2-3 min) due to ML deps (torch/diffusers/transformers stubs); full-project uv run pyright can stall for many minutes — scope it to changed files.
ultralytics monkey-patches PIL.Image.open and tries to autoload pi_heif. When pi_heif is missing, opening files raises ModuleNotFoundError, not UnidentifiedImageError. Code that opens user-supplied or unknown-format files should except Exception, not just OSError/UnidentifiedImageError.
rich console.print parses [word] as a style tag and silently drops unknown ones. A literal bracketed token in a print string disappears: pip install 'remove-ai-watermarks[gpu]' rendered as ...remove-ai-watermarks' (the [gpu] extra eaten), which sent users a broken install command (surfaced via #19). Escape the literal bracket as \[gpu] (in a normal Python string that is "\\[gpu]") in any rich string carrying user-facing brackets. Regression-guarded by tests/test_cli.py::TestGpuHintMarkup.
Metadata detection for AVIF/HEIF/JPEG-XL relies on a binary scan for C2PA_UUID + IPTC_AI_MARKERS, plus EXIF Software / XMP CreatorTool generator tags via metadata.exif_generator (validated with synthesized AVIF/JPEG fixtures + an XMP raw-scan fixture). C2PA removal in those containers is implemented via noai/isobmff.py (top-level uuid / jumb box stripper, no re-encoding), which now also drops a top-level XMP uuid box that carries an AI label (matched by AI-marker content, not by the XMP UUID, so byte-order-robust) and covers MP4/MOV/M4V/M4A by content sniff. Non-ISOBMFF audio/video removal is via ffmpeg (_FFMPEG_STRIP_EXTS -> _strip_with_ffmpeg): WebM/Matroska (EBML), MP3 (ID3), WAV/FLAC/OGG (RIFF/Vorbis) are stripped losslessly with ffmpeg -map_metadata -1 -map_chapters -1 -c copy (codec data untouched). Requires ffmpeg on PATH; raises RuntimeError if absent or if ffmpeg can't parse the file. Verified end-to-end (a real ffmpeg-made WAV/MP3 with a title=Suno AI tag -> tag gone, audio bytes preserved). Meta-box XMP now handled (isobmff.blank_ai_xmp_packets, v0.6.9): an AI-label XMP packet stored as a meta-box mime item (AVIF/HEIF) is blanked in place (overwritten with spaces of the same length, so iloc offsets and the coded image stay valid). Still NOT built: an Exif item inside the meta box (rare -- AI labels are XMP) needs full iinf/iloc surgery (offset rewrite) with corruption risk -- exiftool (R/W/C for HEIC/AVIF EXIF+XMP, verified on exiftool.org 2026-05-27) would do it but is a non-installed binary dep, so it stays a documented gap. Audio watermark DETECTION (Resemble PerTh) was evaluated and NOT built (2026-05-26): resemble-perth's PerthImplicitWatermarker.get_watermark() returns a raw bit-array with no presence/confidence flag (clean audio decodes to arbitrary bits too), so reliably distinguishing watermarked-from-clean needs either Resemble's fixed payload or a confidence API -- neither is public, and there's no real Resemble sample to calibrate against. Same wall-class as the SynthID pixel detector: the decode exists, reliable presence-detection does not. (perth's top-level PerthImplicitWatermarker is also gated to None unless librosa is importable.)
SynthID detection is metadata-only. There is no reliable local detector of the SynthID pixel watermark — Google's decoder is proprietary, no public spec or API (only a waitlisted portal). Authoritative confirmation: Google DeepMind's own paper "SynthID-Image: Image watermarking at internet scale" (Gowal et al., arXiv:2510.09263) states the verification service is restricted to "trusted testers" and does not release detector weights or a reproducible algorithm — so a local pixel detector is infeasible by design, not just unbuilt. https://arxiv.org/abs/2510.09263 We detect SynthID by its C2PA companion (synthid_source / SYNTHID_C2PA_ISSUERS), which is reliable while the manifest is intact but says nothing once C2PA is stripped. Surface-dependent blind spot (verified 2026-05-24): the same Google model emits different metadata per surface -- the Gemini app wraps outputs in Google C2PA, but the API/playground (AI Studio, Nano Banana / gemini-2.5-flash-image) emits the SynthID pixel watermark (confirmed via the Gemini-app oracle) + the visible sparkle but no C2PA/IPTC at all, so synthid_source returns None despite SynthID being present. Only the pixel oracle or the visible-sparkle detector catches those. (Meta AI is another surface mismatch: it writes the IPTC digitalSourceType=trainedAlgorithmicMedia marker, not C2PA and not SynthID.) Google→SynthID is long-standing; OpenAI→SynthID is confirmed by OpenAI's Help Center (ChatGPT/Codex/API "include both C2PA metadata and SynthID watermarks", updated 2026-05-21) but time-gated (pre-rollout OpenAI images carry C2PA without SynthID), so the OpenAI verdict is hedged "likely". Oracles: Gemini app "Verify with SynthID" (Google), openai.com/verify (OpenAI). The spectral phase-coherence approach from github.com/aloshdenny/reverse-SynthID was evaluated (May 2026) and does not work for real-content detection: on its own shipped codebook + validation set, watermarked and cleaned images were indistinguishable (conf within noise, cleaned often higher); it only fires on pure-black 1024x1024 reference images at exact resolution (the controlled case it was calibrated on). The README's "90% / conf=0.91" reproduces only in that lab condition. Do not build a production detector on it; if revisited, it is experimental/diagnostic only and needs a per-resolution, per-model reference corpus. A from-scratch gpt-image pilot (2026-05-24) confirmed this independently: 5 independent solid-black gpt-image outputs share a near-identical fixed signature (pairwise residual correlation 0.92, avg-template retains 97% energy), so the watermark/carrier IS strongly present and consistent on flat content — but the carrier frequencies extracted from it do NOT discriminate real content (carrier-to-random ratio: cleaned 1.86 > watermarked 1.53; a non-gpt-image image scored highest at 3.67). The signature drowns in content texture. Net: a perfectly consistent solid-color signature still yields no real-content pixel detector with magnitude/carrier methods. A corpus discrimination test (2026-05-24, scripts/synthid_pixel_probe.py, raw zero-mean residual NCC) independently re-confirms this: at matched resolution, SynthID positives do NOT cluster apart from negatives (within-Gemini 0.07; at 1024 px pos-vs-neg >= pos-vs-pos). The only high correlations were near-duplicate content (5 ChatGPT renders of one prompt at ~0.92, while a distinct ChatGPT image scored ~0 against them) — content, not a carrier. The probe is solid-fills-only and EXPERIMENTAL/DIAGNOSTIC; do not use it on real content. Correction (deeper re-examination 2026-05-25): the carrier IS real on solid fills — the earlier "no carrier" was a method artifact of using spatial / FFT-magnitude NCC, which can't see it. The carrier is a fixed phase at specific low frequencies, so the right metric is per-bin phase coherence. On 8 white gemini-2.5-flash-image fills (generated via the reverse-SynthID trick: identity-edit prompt "Recreate this image exactly as it is" on a synthetic pure-white PNG — this bypasses the recitation block that rejects text prompts for pure colors), phase coherence at the white carriers (0,±7..±12,±20..±23) = 0.86 vs 0.31 random; single-image leave-one-out phase-match +0.83 vs real photos -0.24. (Black 2.5-flash fills clip to std≈0 — SynthID can't push values below 0, so no carrier in black; the repo's dark carriers come from nano-banana-pro.) But it does not generalize: (a) carriers are model-version + resolution + color specific — the repo's v4 codebook (built for gemini-3.1-flash-image-preview + nano-banana-pro-preview) scores ~0.527 on my 2.5-flash white fills, indistinguishable from negatives (~0.50), i.e. carriers shift across model versions and need a per-model codebook; (b) on real content (30 2.5-flash images) the carrier collapses — set phase coherence at carriers 0.37 ≈ random 0.42, and the repo's v4 detector gives content 0.518 ≈ negatives 0.504 (no separation; a faint +0.24 single-image lean is likely a brightness confound). Net: the spectral/phase approach is a real controlled-fill characterizer, NOT an arbitrary-real-content detector, and is brittle to model version. Metadata proxy + visible sparkle + online oracles remain the ceiling for real content.
External AI-vs-real classifier models are out of scope (decided 2026-05-24). Generic HuggingFace detectors (Organika/sdxl-detector Swin Transformer, umm-maybe/AI-image-detector, and fine-tunes) exist and report ~0.98 on their own SDXL-vs-real validation sets, but they are per-generator and the model cards themselves note degraded accuracy off-distribution; they are untested on gpt-image / Gemini Nano Banana (the metadata-stripped surfaces we care about), and our own light SDXL pass would likely defeat them the same way it defeats SynthID. Detection here stays local + signal-based (metadata + visible sparkle); do not add a bundled classifier dependency.
SynthID v2 vs default pipeline: the SDXL-based default profile (since May 2026) defeats SynthID v2. Verified end-to-end (May 2026): local SDXL run on a Gemini 3 Pro output, checked via the Gemini app's "Verify with SynthID" feature, returned "no SynthID watermark detected". Also confirmed against OpenAI's SynthID (2026-05-23): a fresh ChatGPT/gpt-image output read "SynthID detected" on openai.com/verify before the local SDXL run and "SynthID not detected" after (corpus regression chain: pos 4ef377bd -> cleaned 47188e88). The same configuration is used in raiw-app production (fal-ai/fast-sdxl/image-to-image, strength 0.05, steps 50, guidance 7.5, no pre-downscale). fal's own llms.txt for fast-sdxl names the base checkpoint as stabilityai/stable-diffusion-xl-base-1.0 (verified 2026-05-25) -- the exact checkpoint the local CLI defaults to (DEFAULT_MODEL_ID). So the local invisible default is weight-for-weight identical to prod; "fast-sdxl" is fal's optimized serving, not different weights. After the native-resolution fix the local pipeline matches prod on weights + strength + steps + guidance + resolution. SD-1.5 dreamshaper at 768 px was previously the default and does NOT defeat v2 — verified empirically against the same feature (strength 0.04, 0.10, and elastic warp α∈{5,8} all flagged positive). That SD-1.5 path was removed; only default (SDXL) and ctrlregen profiles remain. Scope of the claim: defeating the SynthID verifier is NOT the same as forensic invisibility. "Removing the Watermark Is Not Enough: Forensic Stealth in Generative-AI Watermark Removal" (arXiv:2605.09203, 2026-05) shows that six removal attacks across four families (UnMarker, CtrlRegen+, WatermarkAttacker, etc.) all leave forensic traces: independent detectors flag removal-processed images vs genuinely-clean ones at >98% TPR at 1% FPR. So our SDXL pass makes the oracle read "SynthID not detected," but the output can still be classifiable as "an image that went through a removal pipeline." Do not over-claim "indistinguishable from a real photo." https://arxiv.org/abs/2605.09203

49 KiB Raw Blame History Unescape Escape