From 9feea4ac1e7129e337ccd67cc3a1394b0ea401cc Mon Sep 17 00:00:00 2001 From: Victor Kuznetsov Date: Thu, 11 Jun 2026 15:50:03 -0700 Subject: [PATCH] Slim CLAUDE.md: move module internals, limitations, landscape research to docs Co-Authored-By: Claude Fable 5 --- .claude/settings.json | 14 -- CLAUDE.md | 82 ++++------- README.md | 2 +- docs/known-limitations.md | 124 ++++++++++++++++ docs/module-internals.md | 210 +++++++++++++++++++++++++++ docs/release-and-distribution.md | 25 ++++ docs/research-doubao-distillation.md | 21 +++ docs/watermarking-landscape.md | 40 +++++ 8 files changed, 452 insertions(+), 66 deletions(-) delete mode 100644 .claude/settings.json create mode 100644 docs/known-limitations.md create mode 100644 docs/module-internals.md create mode 100644 docs/release-and-distribution.md create mode 100644 docs/research-doubao-distillation.md create mode 100644 docs/watermarking-landscape.md diff --git a/.claude/settings.json b/.claude/settings.json deleted file mode 100644 index 1fb8139..0000000 --- a/.claude/settings.json +++ /dev/null @@ -1,14 +0,0 @@ -{ - "permissions": { - "allow": [ - "WebSearch", - "WebFetch" - ] - }, - "enabledPlugins": { - "pyright-lsp@claude-plugins-official": true, - "context7@claude-plugins-official": true, - "code-simplifier@claude-plugins-official": true, - "claude-md-management@claude-plugins-official": true - } -} diff --git a/CLAUDE.md b/CLAUDE.md index 2d23f46..b392bac 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -15,7 +15,8 @@ You are a **principal Python engineer** maintaining a CLI tool and library for r ## Test and lint -- **CI** (`.github/workflows/test.yml`): runs on push to `main` + every PR. A `lint` job (ubuntu: `ruff check` + `ruff format --check`) plus a `test` matrix (ubuntu/macos/windows x py3.10/3.12) that does `uv sync --frozen --extra dev` then `pytest`. The matrix installs only core + dev (no `gpu` extra), so the GPU/model-running tests skip there and it exercises the metadata/identify/visible/cv2-eraser surface on all three OSes. Keep `uv.lock` valid (don't break `--frozen`) when editing `pyproject.toml`. `publish.yml` stays release-only and now verifies the release tag matches the `pyproject.toml` version (fails the build on a mismatch) before building, then uploads via `uv publish` (PyPI trusted publishing over OIDC, no token — replaced the `pypa/gh-action-pypi-publish` action so the upload no longer depends on that action's bundled twine accepting the Metadata-Version; the `id-token: write` permission + `pypi` environment + workflow filename are unchanged, so PyPI's trusted-publisher entry still matches). **Release flow:** bump the version in `pyproject.toml` + `src/remove_ai_watermarks/__init__.py` + `uv.lock` (the project's own `[[package]]` entry — find it with `grep -n 'name = "remove-ai-watermarks"' uv.lock`, the `version =` line right below it, ~line 2246), commit `chore(release): vX.Y.Z`, `git tag -a vX.Y.Z -m vX.Y.Z` (annotated — `git tag` without `-m` errors here), push `main` + the tag, then `gh release create vX.Y.Z` — **PyPI publish triggers on the GitHub Release `published` event, NOT on the tag push**, so the tag alone does not publish. **After the PyPI sdist is live, bump the Homebrew formula** in the separate public tap repo `wiltodelta/homebrew-tap` (`Formula/remove-ai-watermarks.rb`): update `url` to the new sdist URL (from `https://pypi.org/pypi/remove-ai-watermarks//json`, the `sdist` entry's `url`) and `sha256` to its hash, commit + push there — otherwise `brew install wiltodelta/tap/remove-ai-watermarks` keeps installing the old version. The formula is a core-only venv that pip-installs the sdist (no vendored resources, so pip pulls the binary numpy/opencv wheels per platform at install time); only those two lines change per release. **This is now AUTOMATED:** the main repo's `.github/workflows/distribute.yml` fires on the GitHub Release `published` event, waits for the sdist to appear on PyPI (poll loop, the Release event races publish.yml's upload), rewrites the formula's `url`+`sha256`, and pushes to the tap using the `HOMEBREW_TAP_TOKEN` repo secret (a fine-grained PAT with Contents:write on `homebrew-tap`). The SAME workflow also factory-rebuilds the HF Space (`HfApi.restart_space(..., factory_reboot=True)`, `HF_TOKEN` secret) so the Space reinstalls the new sdist (it pins `remove-ai-watermarks>=...` and only re-resolves on a rebuild). The manual Homebrew steps above are the fallback / what the workflow automates — a normal release needs no Homebrew or HF action. **Other distribution channels:** (1) **conda-forge** — recipe source of truth committed at `packaging/conda/recipe.yaml` (v1 `recipe.yaml`, noarch core-only: pillow/piexif/numpy/py-opencv/click/python-dotenv); the initial submission is `conda-forge/staged-recipes` PR #33674 (went green only after **`pip_check: false`** in the python test — rattler-build's `pip check` defaults to ON and fails on the ancient conda-forge `piexif py_2` build's stale metadata with "piexif 1.1.3 is not supported on this platform", though the package installs/imports/works; keep it disabled). Once that merges and the `remove-ai-watermarks-feedstock` exists, the `regro-cf-autotick-bot` auto-opens a version-bump PR on the feedstock when each new PyPI sdist is detected — just review + merge it (hand-edit only if run-deps changed; keep `packaging/conda/recipe.yaml` in sync as the reference copy). (2) **ComfyUI Registry** — the node package is a SEPARATE repo `wiltodelta/ComfyUI-remove-ai-watermarks` with its OWN `pyproject.toml` `version` (independent of the library version). Publish a new node version by bumping that `version` in the node repo's `pyproject.toml` and pushing to `main` — the node repo's `.github/workflows/publish.yml` (`Comfy-Org/publish-node-action@main`, triggered on a push that touches `pyproject.toml`, secret `COMFY_REGISTRY_TOKEN`) **auto-publishes** it; `comfy node publish --token ` is the manual/local fallback. It is NOT auto-published on a library release (the node has its own version), so only bump it when the node code or its `remove-ai-watermarks>=` dependency floor changes. **Sdist must exclude `data/`** (`[tool.hatch.build.targets.sdist] exclude = ["/data"]`): hatchling's default sdist bundles all VCS-tracked files, so the committed `data/` test corpora (the multi-hundred-MB synthid_corpus images + the visible-mark captures) pushed the **0.8.0** sdist past PyPI's per-project file-size limit (400 "File too large") — the wheel uploaded but the sdist was rejected, so 0.8.0 shipped wheel-only and 0.8.1 carried the fix. The wheel only ships `src/` (via `[tool.hatch.build.targets.wheel] packages`), so it was never affected. **A failed PyPI upload of one artifact still leaves the other live and you cannot re-upload the same version** — fix the build and cut the next patch. **Build backend is unpinned `hatchling`** (`[build-system] requires`) since 2026-06-09. History: it was pinned `<1.31` because hatchling 1.30.0 made Metadata-Version 2.5 (PEP 794) the default and the twine bundled in `pypa/gh-action-pypi-publish@release/v1` rejected it (`"'2.5' is not a valid Metadata-Version"`), which **failed the v0.8.3 PyPI upload on 2026-06-01**; hatchling 1.30.1 reverted the default to 2.4. After the workflow moved to `uv publish` (whose uploader accepts 2.5) the pin was belt-and-suspenders only, and once v0.9.0 + v0.10.0 both published wheel+sdist through that path (verified on PyPI) it was dropped. If a future hatchling flips the default to 2.5 again and some consumer chokes, re-pin with a dated comment. +- **CI** (`.github/workflows/test.yml`): runs on push to `main` + every PR. A `lint` job (ubuntu: `ruff check` + `ruff format --check`) plus a `test` matrix (ubuntu/macos/windows x py3.10/3.12) that does `uv sync --frozen --extra dev` then `pytest`. The matrix installs only core + dev (no `gpu` extra), so the GPU/model-running tests skip there and it exercises the metadata/identify/visible/cv2-eraser surface on all three OSes. Keep `uv.lock` valid (don't break `--frozen`) when editing `pyproject.toml`. +- **Release flow + distribution channels** (PyPI publish via `publish.yml`/`uv publish`, the automated Homebrew-tap + HF-Space bumps in `distribute.yml`, conda-forge, ComfyUI Registry, the sdist `data/` exclusion, hatchling pin history): see `docs/release-and-distribution.md` before cutting a release. - `bash maintain.sh` — uv-outdated, uv-secure, ruff check/fix, ruff format, pyright (scoped `src/`, see the OOM note below), pytest -n auto. The helper tools live in the `dev` extra (`pytest-xdist`, plus `uv-outdated`/`uv-secure` marker-gated to py3.12+ so the py3.10 resolution stays solvable) — a bare env without `--extra dev` does not have them. - **Strict pyright is clean across `src/` (0 errors).** The cv2/torch/diffusers boundary files (`gemini_engine`, `region_eraser`, `doubao_engine`, `humanizer`, `invisible_engine`, `noai/watermark_remover`) carry a documented per-file `# pyright:` relax pragma that turns off only the unknown-type / untyped-third-party rules — those libs ship no usable types, so strict typing there fights the ecosystem. Pure-logic files stay fully strict; `typings/piexif/__init__.pyi` is a local stub so `metadata.py`/`extractor.py` resolve piexif. Public ndarray-returning signatures on the relaxed engines are still annotated `NDArray[Any]` so strict consumers (`cli.py`) stay clean. When touching a relaxed file, prefer fixing real issues over widening the pragma; keep the pragma scoped to genuinely-untyped boundaries. (`uv-secure` is clean since idna was bumped 3.11 -> 3.16, fixing GHSA-65pc-fj4g-8rjx, and aiohttp 3.13.5 -> 3.14.0 via `uv lock --upgrade-package aiohttp`, fixing GHSA-hg6j-4rv6-33pg + GHSA-jg22-mg44-37j8. (The old basicsr Dependabot alert (GHSA-86w8-vhw6-q9qq) is resolved by removal: the experimental `restore` extra was retired and basicsr is no longer anywhere in the dependency tree.) The torch Dependabot alert **GHSA-rrmf-rvhw-rf47** (`torch.jit.script` memory corruption, vulnerable `<= 2.12.0`) is **dismissed as `not_used`** (2026-06-10): torch is a transitive dep of the optional `gpu` extra only, the codebase never calls `torch.jit` (grep-verified), and **no patched torch version exists** (`first_patched_version` is null), so it cannot be closed by an upgrade — do not re-triage it. - **Full-project `uv run pyright` (no path) OOMs/crashes node on this ML-heavy repo** (emits a `libnode` stack frame, no summary) — a known environment limit, not a code error. Gate with `uv run --extra dev --extra gpu pyright src/` (completes, authoritative) or scope to changed files; also run `uv run ruff check` and `uv run pytest` directly. @@ -34,63 +35,42 @@ You are a **principal Python engineer** maintaining a CLI tool and library for r ## Key modules -- `noai/c2pa.py` — PNG chunk parser; use `extract_c2pa_chunk(path)` to get raw caBX payload, `has_c2pa_metadata(path)` to detect. Do not reimplement chunk parsing. `extract_c2pa_info(path)` sets `synthid_watermark`/`synthid_vendors` when the manifest is signed by a SynthID-using vendor, and `soft_binding`/`soft_binding_vendors` when a `c2pa.soft-binding` `alg` names a forensic-watermark vendor (`soft_binding_vendors_in(buffer)` is the shared byte-scan, used by both the PNG parser and the non-PNG binary path). PNG/caBX chunk reads are clamped to the remaining file size (`safe_length = min(length, remaining)`; skipped chunks use seek) so a malformed huge `length` cannot drive a multi-GB allocation (shared safety discipline matching `isobmff.scan_c2pa_region`). -- `noai/constants.py` — PNG_SIGNATURE, C2PA_CHUNK_TYPE, C2PA_SIGNATURES, and `C2PA_AI_VENDORS` — the single `C2paAiVendor` registry of C2PA-signing vendors (issuer byte, resolved org name, the `identify` platform label, and a `synthid` flag), from which `C2PA_ISSUERS`, `SYNTHID_C2PA_ISSUERS` (issuers that pair SynthID with C2PA: Google, OpenAI), and `identify._ISSUER_PLATFORM` are all **derived** — plus `C2PA_SOFT_BINDINGS` (soft-binding `alg` prefix → forensic-watermark vendor: Adobe TrustMark, Digimarc, Imatag, Steg.AI, Microsoft, ...). Add a new C2PA vendor as one `C2PA_AI_VENDORS` entry (never edit the derived dicts), a new soft-binding to `C2PA_SOFT_BINDINGS`; not inline. -- `metadata.py` — `scan_head(path, size=1MB)` is the shared input for every C2PA/AIGC/IPTC byte scan: first `size` bytes plus the payloads of any provenance metadata found beyond that window — for ISOBMFF, the late provenance boxes from `isobmff.scan_c2pa_region` (catches a manifest after a large `mdat`); for **PNG**, the late `tEXt`/`iTXt`/`zTXt`/`eXIf`/`iCCP` chunks from `_png_late_metadata` (catches an XMP/EXIF packet appended after a large `IDAT`, e.g. a TC260 AIGC label at ~2.7 MB). Behavior-neutral (`f.read(size)`) for non-ISOBMFF inputs and for any file that fits within `size`. Use it instead of `open().read(1MB)` for any new marker scan. **Memoized per (path, size, mtime)** (added 2026-06-09, `_scan_head_cached` lru_cache, `maxsize=8`): one `identify`/`get_ai_metadata` call fans out to ~8 byte-scan detectors that each re-read the same file head, so the cache turns those into a single read; the mtime key invalidates on change, a stat failure falls back to an uncached read. `synthid_source(path)` returns the vendor name(s) if the C2PA manifest implies a SynthID pixel watermark, else None. Format-agnostic: PNG via the caBX parser, JPEG/WebP/AVIF/HEIF/JXL via a binary scan (C2PA marker + SynthID issuer + AI-source marker). `get_ai_metadata` surfaces the verdict, and `metadata --check` prints it as a callout. Both `get_ai_metadata` and `has_ai_metadata` guard the PIL open with `except Exception` (HEIC/unknown formats raise non-OSError) and fall through to the binary scan. `xai_signature(path)` detects xAI/Grok's EXIF-only scheme (`ImageDescription` = `Signature: ` + UUID `Artist`); it feeds `has_ai_metadata`, `get_ai_metadata` (key `xai_signature`), and `identify`. `iptc_ai_system(path)` detects the IPTC Photo Metadata 2025.1 AI-disclosure XMP properties (`IPTC_AI_FIELD_MARKERS` = `AISystemUsed`/`AISystemVersionUsed`/`AIPromptInformation`/`AIPromptWriterName`) and returns the `AISystemUsed` generator name (or `"fields present"`). `remove_ai_metadata` routes **ISOBMFF video** (`.mp4`/`.mov`/`.m4v`) through the same `isobmff.strip_c2pa_boxes` as AVIF/HEIF (MP4 is ISOBMFF), and `_scrub_ai_exif` removes the xAI signature + AI-generator EXIF tags on JPEG output. `strip_c2pa_boxes` is **fail-safe** on a malformed box: it returns the original bytes unchanged with a logged warning instead of truncating the tail to EOF (detection-only `scan_c2pa_region` still stops at a malformed box). `_png_late_metadata` clamps each late-chunk read to the remaining file size (`safe_length = min(length, remaining)`) so a malformed `length` cannot drive a multi-GB allocation, AND advances the cursor by `safe_length` (not the raw `length`) so an inflated length cannot jump past EOF and abort the scan, silently skipping a genuine AI-label chunk after it. -- `identify.py` — the OpenAI rollout caveat is keyed on `_vendor_of(synthid) == "OpenAI"` (not a raw substring over the issuer + verdict blob). `identify(path)` aggregates every locally-readable signal (C2PA issuer→platform, C2PA soft-binding forensic-watermark vendor, **C2PA cloud-manifest reference** via `metadata.c2pa_cloud_manifest` — signal `c2pa_cloud`, **medium**, provenance-only (does NOT set `is_ai`, excluded from `ai_from_metadata` + clash vendors): a C2PA 2.4 Durable-Content-Credentials case where the embedded manifest is stripped but an XMP `dcterms:provenance` pointer to the vendor's cloud manifest store (`_C2PA_MANIFEST_REPOSITORIES`, today `cai-manifests.adobe.com` → "Adobe Content Authenticity") survives, so the credentials stay recoverable server-side; only emitted when no embedded manifest already attributed the file — surfaced on 2 corpus PNGs 2026-06-10 that read fully `unknown` before, IPTC "Made with AI" + IPTC 2025.1 `AISystemUsed`, embedded SD/ComfyUI params, SynthID proxy, xAI/Grok EXIF signature via `metadata.xai_signature`, the China TC260 AIGC label via `metadata.aigc_label`, the HuggingFace `hf-job-id` job marker via `metadata.huggingface_job`, the Samsung Galaxy AI editing marker via `metadata.samsung_genai`, the visible marks — Gemini sparkle plus the ByteDance Doubao 豆包AI生成 / Jimeng 即梦AI / Samsung Galaxy AI "Contenuti generati dall'AI" text marks via the `watermark_registry` — open invisible watermark, Adobe TrustMark via `trustmark_detector`) into one `ProvenanceReport`. `is_ai_generated` is True or None (never asserted False — stripped metadata is not proof of clean origin). The `hf_job`, visible-mark, and Samsung `samsung_genai` signals are **medium** confidence: each lifts an otherwise-Unknown verdict to a tentative AI (`hf_only` / `visible_only` / `samsung_only`, parallel branches; `visible_only` fires on any `visible_*` signal) but is excluded from the high-confidence `ai_from_metadata` set, so none overrides a hard metadata signal. **Visible-mark detection** (`check_visible`, signals `visible_sparkle` / `visible_doubao` / `visible_jimeng` / `visible_samsung`): the Gemini sparkle keeps its own file-level path (`_visible_sparkle` → `gemini_engine.detect_sparkle_confidence`, promoted only at confidence ≥ `_SPARKLE_THRESHOLD` 0.5; corpus-tuned to separate Gemini sparkles ≥0.56 from non-sparkle ≤0.49), while Doubao/Jimeng/Samsung reuse the registry detectors (`_visible_text_marks` → `watermark_registry`, iterating `_VISIBLE_MARK_PLATFORM`), each gated by its own engine NCC threshold via `MarkDetection.detected` (Doubao 0.4, Jimeng 0.45, Samsung 0.4). Doubao/Jimeng are normally also caught by the TC260 AIGC metadata label and Samsung by its C2PA + `genAIType` marker, so the visible path is their stripped-metadata fallback. Visible marks set `platform` only when no harder signal already did, and (like the sparkle) are excluded from integrity-clash vendor claims. The cv2 dependency lives in the engines, not here. **`import identify` is deliberately light** (~21 MB; ~36 MB with cv2 loaded by a visible-mark run, ~106 MB for a full `check_visible` run): it imports only the pure `noai.c2pa`/`noai.constants` submodules, and `noai/__init__` is lazy (see "Test and lint"), so torch/diffusers are NOT pulled at import even in a full `gpu`/`detect` install — fits a 512 MB host. The heavy paths are opt-in: `check_invisible=True` needs the `detect`/`trustmark` extras (each pulls **torch**; TrustMark also **downloads weights**), so on a core-only deploy leave `check_invisible` off (it is a no-op there anyway). Before the lazy `__init__`, the mere presence of torch in the env inflated `import identify` to ~420 MB. **C2PA platform attribution is device-token-first, issuer-scan fallback** (`_device_platform` scans manifest bytes for `_DEVICE_C2PA_PLATFORM` tokens, then `_attribute_platform`/`_ISSUER_PLATFORM`). **Why, verified on real signed files 2026-05-26:** the old issuer-only byte-scan matched ANY issuer substring anywhere, so multi-entity manifests mis-attributed -- Leica→"Truepic" (a signing authority in the trust chain), Nikon→"Adobe Firefly" (XMP-toolkit "Adobe" + the sample's "Adobe_MAX" name), Pixel→"Google (Gemini)" ("Google LLC" cert org), Truepic→"Google". A distinctive device token wins instead. **Token distinctiveness is load-bearing:** bare `b"Truepic"` mis-fires (it appears in unrelated trust chains -- it mis-attributed the OpenAI `chatgpt-1.png` fixture), so the token is the specific `b"Truepic_Lens"` from the Lens SDK claim generator; likewise `b"Pixel Camera"` (cert CN) not bare `b"Pixel"`. `_DEVICE_C2PA_PLATFORM` lists ONLY tokens **verified against a real C2PA file**: Leica (`lc_c2pa`/`Leica Camera`), Nikon (`NIKON`), Pixel (`Pixel Camera` -- from a real Pixel 10 Pro file attached to c2pa-rs issue #1609/#1554), Sony (`sony.sig`/`sony.cert` -- Sony's own C2PA assertion namespace, verified on a real Sony PXW-Z300 file; NOT bare "Sony" which is a common EXIF Make), Truepic (`Truepic_Lens`). Canon/Bria have **no public direct-download C2PA sample** (checked exhaustively: GitHub issue/PR attachments, contentcredentials gallery, HF datasets -- all upload-to-verify or token-gated; Canon's only public file was a self-signed hobbyist CR3, not factory), so they stay unmapped until a real file is captured (same fixture discipline as Grok/Doubao). The Sony sample is video (MP4) -- our ISOBMFF C2PA path detects it; Sony Alpha stills likely share the `sony.*` namespace but are not separately verified. **Samsung Galaxy + ASUS Gallery live in a separate `_SIGNER_C2PA_PLATFORM` (scanned after `_device_platform`, before the issuer fallback), NOT in `_DEVICE_C2PA_PLATFORM`** — verified on real signed files 2026-05-29. Reason: a Galaxy phone stamps BOTH its device cert AND a `trainedAlgorithmicMedia`/genAIType AI marker on a Generative-Edit image, so treating it as a "genuine camera capture" would false-fire integrity-clash rule 2 on every Galaxy AI edit. The signer tokens (`b"Samsung Galaxy"` cert org — distinct from the EXIF `SM-xxxx` model string on ordinary Samsung photos; `b"com.asus.gallery"` claim generator) only resolve the platform label; the AI verdict still comes from the source-type / genAIType. ASUS Gallery is a C2PA-signed edit with no AI marker, so it attributes the platform without asserting `is_ai`. **Samsung's `genAIType` (in the proprietary `PhotoEditor_Re_Edit_Data` JSON) is an undocumented Galaxy-AI editing marker** (`metadata.samsung_genai`, gated on the `PhotoEditor_Re_Edit_Data` container; non-zero value = AI tool used, values {1,5} observed): medium-confidence because the field has no public spec (verified 2026-05-29: absent from C2PA spec + Samsung docs), but it co-occurred with `trainedAlgorithmicMedia` in 3/3 verified files that record a source-type and was the SOLE AI marker on a Galaxy S24 file that omits the source type. Camera C2PA marks capture authenticity, not AI (Pixel carries `computationalCapture`, not `trainedAlgorithmicMedia`), so these never set `is_ai` -- that stays driven by digital-source-type. `c2pa.cbor_text_after` (now public) is best-effort for the `generator` detail string only and can be None when the manifest keys it `claim_generator_info` (Pixel). **Issuer→generator mapping is `is_ai`-gated** (`_attribute_platform(issuers, is_ai=c2pa_is_ai)`): a specific AI-generator platform is named only when the digital-source-type is `trainedAlgorithmicMedia`; on a non-AI source an issuer substring is treated as incidental (an "Adobe XMP" toolkit string in an *unmapped* Canon/Sony capture would otherwise mislabel it "Adobe Firefly"), so it degrades to the neutral "C2PA signer: X" label. Real Firefly/OpenAI/Google output carries the AI source-type, so it is unaffected (verified: chatgpt-1.png→OpenAI, firefly-1.png→Adobe Firefly still attribute). `_attribute_platform` defaults `is_ai=True` so the mapping stays unit-testable in isolation. Add capture-camera tokens to `_DEVICE_C2PA_PLATFORM`, editing-app/AI-device signer tokens to `_SIGNER_C2PA_PLATFORM`, generator/issuer platforms to the `C2PA_AI_VENDORS` registry in `constants.py` (which derives `_ISSUER_PLATFORM`), not inline. For non-PNG containers (JPEG/WebP/AVIF/HEIF/JXL) the caBX parser returns nothing, so issuer (`_issuers_in`) and generator (`_ai_tools_in`, reusing `C2PA_AI_TOOLS`) are recovered by binary-scanning the first MB. EXIF `Software` / `Make` / `Artist` / `ImageDescription` and XMP `CreatorTool` generator tags are read by `metadata.exif_generator` (PIL+piexif for any format PIL opens incl. AVIF, plus a container-agnostic XMP raw-byte scan that also covers HEIF/JXL), matched against `AI_GENERATOR_TOKENS` so ordinary editors (plain "Adobe Photoshop") and real-camera `Make` ("Apple"/"Canon") are not flagged. **Ideogram tags its output with EXIF `Make="Ideogram AI"`** (verified on a real download 2026-05-24) — that's why `Make` is read. **Integrity-clash detection** (`_integrity_clashes`, surfaced as `ProvenanceReport.integrity_clashes`, printed in red by `identify` and serialized to `--json`): contradictions between independent generator stamps are a laundering/spoofing tell. Two rules: (1) two or more distinct AI-origin vendors named by **independent** signals (e.g. C2PA OpenAI + EXIF `Make="Ideogram AI"`), and (2) a camera-capture C2PA device (`_DEVICE_C2PA_PLATFORM`) coexisting with an AI-generation marker **from a source INDEPENDENT of the camera's own manifest**. **Rule 2's independence gate (added 2026-06-11):** a device that both captures and runs on-device generative AI (Google Pixel Magic Editor / Pixel Studio) records the capture AND the AI edit in ONE C2PA manifest — so the AI vendor is named only from that same manifest (`c2pa` issuer + `synthid` proxy, both `c2pa_manifest` source) — a legitimate edit chain, NOT a clash. Rule 2 therefore fires only when some `ai_vendor_claims` family has a source `!= "c2pa_manifest"` (EXIF/XMP generator, IPTC, TC260 AIGC, a second manifest naming AI on a camera capture — the real laundering tell). This killed a false-positive class on the corpus: 2 real Pixel generative-edit PNGs (`computationalCapture` + `trainedAlgorithmicMedia` + "Applied imperceptible SynthID watermark" in one Google manifest) read as camera-vs-AI clashes before the gate. Pure cameras (Leica/Sony/Nikon/Truepic) that do NOT generate AI still clash on any within-manifest AI marker only if it is independent — they never legitimately carry one, so the gate is behavior-neutral for them while fixing Pixel (regression-guarded by `test_identify.py::TestIntegrityClashesHelper::{test_pixel_generative_edit_same_manifest_no_clash,test_camera_plus_independent_ai_marker_still_clashes}` + `TestIntegrityClashEndToEnd::test_pixel_generative_edit_no_clash`). **Independence is source-grouped (`_CLASH_SOURCE`, added 2026-06-02):** the C2PA issuer attribution (`c2pa`) and the SynthID proxy (`synthid`) are NOT independent — the proxy is inferred from the *same* manifest — so they share one source and two vendors named within a single manifest do not clash. This killed a false-positive class found on the spaces corpus: legitimate multi-actor manifests where a product wraps another vendor's engine (Microsoft Designer on OpenAI → `OpenAI, Microsoft`; Microsoft on Google → `Microsoft, Google LLC, Google C2PA Core Generator Library`) or an edit chain re-signs (Adobe over a Gemini original → Adobe c2pa + Google synthid) — 19 such files across the 2026-06-01/02 batches read as clashes before the fix. Rule 1 still fires when a manifest vendor disagrees with a genuinely independent stamp (EXIF/XMP generator, IPTC `AISystemUsed`, AIGC, xAI); each non-`c2pa`/`synthid` family is its own source (`test_identify.py::TestIntegrityClashes::{test_multi_actor_manifest_no_clash,test_manifest_vendor_vs_independent_signal_clashes}`). Vendor normalization is `_vendor_of` over `_AI_VENDOR_TOKENS` (so a C2PA "Google (Gemini)" issuer and a SynthID-Google proxy agree, while different vendors clash). **High-precision by design:** only hard generator stamps feed it (C2PA-issuer when source is AI, SynthID, EXIF/XMP generator, IPTC `AISystemUsed`, xAI, AIGC); the fuzzy visible sparkle and the open invisible watermark are **excluded** (the latter can be a by-product of our own SDXL removal pass). The c2pa vendor is classified from the issuer attribution / generator, NOT the resolved `platform` (a camera label like "Google Pixel" would mis-normalize to "Google"). All real single-origin fixtures (chatgpt/firefly/doubao/grok/mj) verified to produce **zero** clashes (false-positive guard in `test_identify.py::TestRealSamplesHaveNoClash`). -- `watermark_registry.py` — **single catalog of known visible watermarks**, the unified "find known marks in their usual places, recognize, remove" entry. **Reverse-alpha based by policy**: a mark is listed only once a real alpha map has been captured for it, and removal inverts that map (`original = (wm - a*logo)/(1-a)`) — Gemini recovers cleanly with no inpaint (its sparkle alpha comes from a pure-black capture, so it is near-exact), while **Doubao, Jimeng, and Samsung all add an always-on THIN residual inpaint** over the glyph footprint (their text marks re-rasterize + jitter a few px per image, so a single capture cannot pixel-cancel them; the inpaint blends into the reverse-alpha-recovered pixels). Arbitrary-region inpainting still lives in `region_eraser`/`erase`. Each `KnownMark` ties a key to {usual `location`, `in_auto` flag, `recovery` (="reverse-alpha"), a `detect` adapter → uniform `MarkDetection`, a `remove` adapter}. Entries today: `gemini` (bottom-right sparkle), `doubao` (bottom-right "豆包AI生成"), `jimeng` (bottom-right "★ 即梦AI"), and `samsung` (bottom-**LEFT** "✦ Contenuti generati dall'AI", Samsung Galaxy AI, Italian locale). `detect_marks` scans all; `best_auto_mark` picks the highest-confidence detection. **Cross-engine confidences aren't directly comparable**, so the gemini adapter applies the corpus-validated 0.5 sparkle threshold (`_GEMINI_AUTO_MIN_CONF`) for its `detected` flag — otherwise the gemini engine's loose internal threshold weakly fires (~0.36) on the Doubao text and hijacks `auto`. The shape-keyed Doubao/Jimeng/Samsung NCC detectors don't cross-fire (jimeng scores ~0.22 on the Doubao strip, well under its 0.45 threshold; Samsung is bottom-left so it shares no corner with the others, and scored 0.0 on Doubao/Jimeng captures and they 0.0 on a real Samsung photo), so `auto` picks the right one. `cli.cmd_visible` is registry-driven: `--mark auto` → `best_auto_mark`, `--mark ` → that mark; `--mark` choices come from `mark_keys()`. **`cli._remove_visible_auto` is the shared visible-removal helper used by `cmd_all`/`cmd_batch` too** (they no longer hardcode `GeminiEngine`), so `all`/`batch` remove Doubao/Jimeng/Samsung text marks, not just the Gemini sparkle (regression-guarded by `test_all_visible_step_uses_registry`). The three text-mark adapters were consolidated 2026-06-09: a single `_text_mark(key, label, location)` builds the registry row from one parameterized `_text_mark_detect`/`_text_mark_remove` pair (reverse-alpha only when detected/forced AND `reverse_alpha_available`, else skipped — no inpaint); the gemini adapters stay bespoke. Add a new visible mark = one `_text_mark(...)` row + its `TextMarkConfig` (with a captured alpha map); do not re-add per-mark `if` branches or copy-paste adapters. **Alpha-on-save policy (issue #30):** `cli._write_bgr_with_alpha` rejoins the input's alpha plane **unchanged** — it must NOT zero alpha in the watermark bbox. Reverse-alpha (and `erase` inpaint) recover real pixels there, so zeroing alpha punched a transparent hole that renders as a solid **white box** on any non-transparent viewer (Gemini app exports are opaque RGBA, so every user hit it; regression-guarded by `test_visible_keeps_alpha_opaque_in_watermark_region`). The registry `remove()` still returns its region (used for `inpaint_residual` positioning), but the CLI no longer uses it to clear alpha. -- `gemini_engine.py` — visible Gemini-sparkle remover/detector (cv2/numpy, no GPU). `detect_sparkle_confidence(path)` is the file-level entry point used by `identify.py`. The public entry points normalize a grayscale (2D) or RGBA (4-channel) input to BGR up front so a non-BGR image does not crash the cv2 pipeline. **Detection localization (issue #36):** `detect_watermark`'s global multi-scale NCC search applies a size weight (`(scale/96)**0.5`) that suppresses tiny-patch false positives but can let a larger, mediocre match (e.g. a bright collar in a portrait) outrank a small, near-perfect sparkle in the corner — so a faint sparkle on a busy background scored below threshold and read as clean (the regression osachub reported from widening the search window 256px->512px between v0.7.2 and v0.8.8). `_corner_promote` adds a bottom-right-corner raw-NCC pass on top of the global search: a match with raw NCC >= `_CORNER_PROMOTE_NCC` 0.85 that beats the global pick overrides it (it only ever replaces a lower-fidelity pick, so it cannot weaken an existing detection), rescuing the buried sparkle without reverting the wider window. The corner side is **relative-clamped** (`_CORNER_PROMOTE_FRAC` 0.20 of the short side, clamped to `[_CORNER_PROMOTE_MIN` 96, `_CORNER_PROMOTE_MAX` 384`]`): a fixed 256px is a true corner on a large image but covers ~70% of a small portrait, where a real photo raw-matches the star at ~0.81 (relative tightening drops that worst case to ~0.69, while the upper clamp stops the corner ballooning on huge images where a real photo reached ~0.83 at 512px). The 0.85 gate sits midway between the worst real-photo corner match (~0.78 across native + downscaled negatives) and a genuine faint sparkle (~0.93), so promotion adds true detections with zero corpus false positives (Gemini's sparkle sits ~60-160px from the corner at fixed margins, covered by the [96, 384] band at every measured size). Regression-guarded by `test_gemini_engine.py::TestCornerPromotion`. **Square-image residual misses are NOT fixable by lowering the detector threshold (measured + REJECTED 2026-06-11):** osachub (#36 follow-up) reported the corner-promote still misses Gemini sparkles on Google **square (1:1)** outputs. Reproduced on the spaces corpus: of 330 square Google-C2PA images, 140 score below the identify 0.5 threshold, and visual review confirmed a real class -- faint white sparkles on dark/textured/colored backgrounds (raw NCC 0.46-0.73, below the 0.85 promote gate) landing at fusion conf 0.41-0.47. A margin-gated promote (promote when raw NCC >= 0.50 AND `_core_ring_margin` >= 40) rescued 32/33 confirmed misses at an apparent 0 FP, but that 0 was a **measurement artifact** -- the negative set was the margin<40 misses, which a margin>=40 gate excludes by construction. On an honest 518-image non-Google pool the same gate fired on **~174 (≈33%)**, visually content (screenshots, Chinese "AI生成" Doubao/Jimeng text marks, logos, bright textures), not sparkles. Adding an achromatic-core constraint (`chroma <= 15`) did not separate them either (kept 15/33 POS, 41 NEG still firing). Root cause is the documented contrast-invariant-NCC wall: a faint sparkle on a busy background is indistinguishable from a bright/ornate content corner at the (shape-NCC, brightness-margin, core-chroma) feature level. **Conclusion: keep the 0.85 corner gate; do NOT add a margin/chroma-gated lower promote.** The cost (mislabel ~8-33% of non-Gemini content as Gemini) outweighs the benefit -- the visible sparkle is a medium-confidence stripped-metadata fallback, and intact Gemini is caught by C2PA in `identify` regardless. Remaining square misses are an accepted known limitation; a real fix would need a sparkle-specific discriminator (template match on a background-subtracted image, or a hard fixed-margin position prior), which is open research, not a threshold tweak. **Removal is reverse-alpha with an over-subtraction guard** (`remove_watermark` → `_reverse_alpha_blend`, else `_inpaint_footprint`): the sparkle alpha is computed (`alpha = max(R,G,B)/255`) from the bundled sparkle-on-black captures `assets/gemini_bg_{96,48}.png` (the capture max is ~130, NOT 255 — the sparkle is a ~51%-opaque white overlay, so `alpha` maxes at ~0.51, which is CORRECT for the capture, not under-exposed). The alpha is near-exact only when the real mark's effective opacity matches the capture, which holds on bright/flat backgrounds — re-verified clean on `demo_banana_before.png` 2026-05-31. **Issue #30 (dark-background black pit):** on a dark/textured background (e.g. grass, ~73) the real sparkle's effective opacity is LOWER than the captured 0.51, so the fixed-alpha reverse blend OVER-subtracts (`watermarked - a*logo` goes negative) and drives the footprint to black — the white sparkle becomes a black diamond. `remove_watermark` now detects this via `_reverse_alpha_oversubtracts` (fraction of footprint pixels with `alpha >= _FOOTPRINT_ALPHA` 0.1 whose numerator < 0 exceeds `_OVERSUB_FOOTPRINT_FRAC` 0.05) and **inpaints the footprint** (`_inpaint_footprint`, cv2 NS over the dilated alpha mask) from the surrounding pixels instead. **Behavior-neutral on the working case:** a bright background over-subtracts at ~0% so reverse-alpha is used and the output is byte-identical to before (verified: demo_banana 0.0 frac vs issue-#30 grass 0.61 frac; regression-guarded by `test_gemini_engine.py::TestOverSubtractionGuard`, which composites the sparkle at a reduced effective alpha to reproduce the mismatch). **Under-subtraction (the symmetric case, fixed 2026-06-03):** some real Gemini sparkles are rendered MORE opaque than the captured ~0.51, so the fixed-alpha reverse blend UNDER-subtracts and leaves a bright sparkle residual the detector still fires on (measured on the spaces corpus: a visible-removal audit through the registry path left a detectable sparkle on a meaningful fraction of marks, all under-removals, NOT a background-brightness class — failures and successes had the same input confidence and the same background-luma distribution; the discriminator was the removal delta itself). `remove_watermark` now estimates a per-image alpha gain (`_estimate_alpha_gain`: effective sparkle opacity at the bright core vs the local background ring, `a_eff/a_cap`, clamped `[1.0, _ALPHA_GAIN_MAX` 1.94`]`) and scales the alpha to match before the over-sub/blend branch. The gain cleanly separates on the corpus (under-removed marks ~1.47, cleanly-removed ~1.00), and a deadband (`_ALPHA_GAIN_DEADBAND` 1.05) keeps a matching sparkle **byte-identical** to the pre-fix output, so the fix is purely additive (0 regressions on the audit set; the over-sub guard still runs on the scaled alpha as the safety net for an over-shooting estimate). Regression-guarded by `test_gemini_engine.py::TestUnderSubtractionGain` (composites a more-opaque-than-capture sparkle; **asserts on footprint pixels, NOT the detector** — the detector's NCC is degenerate on a flat synthetic background, so a re-detect conf is meaningless there; the real corpus removal drops the detector from ~0.80 to ~0.27). **False-positive gate (added 2026-06-03):** `detect_watermark`'s shape-only NCC (`spatial*0.5 + gradient*0.3 + var*0.2`) fires on ornate/flat content (text strips, banners, hatching) that coincidentally matches the diamond shape — a real Gemini sparkle is a bright WHITE overlay, so its core sits above the local background, but the NCC is contrast-invariant and cannot see that. The fusion now **demotes** (caps confidence to 0.30) any match that is BOTH low-confidence (`< _SPARKLE_FP_CONF` 0.65) AND has a low core-ring brightness margin (`_core_ring_margin < _SPARKLE_FP_MARGIN` 5). Real sparkles escape via EITHER high confidence (white-bg sparkles score ≥0.79 despite a low margin — the NCC shape match is strong) OR high margin (dark/mid backgrounds, incl. the #36 faint-corner case, lift well clear), so BOTH must fail to demote. The gate is **monotonic** (only ever removes detections, never adds), so it cannot regress the verified-negative corpus (already 0 FPs). On the spaces corpus it demoted 16/495 flagged sparkles (13 carried no AI metadata = content FPs; the 3 AI-meta were visually FPs / a near-invisible white-on-white sparkle whose AI verdict is held by metadata anyway), and dropped the removal-audit failures 20→15 (post-removal flat footprints the NCC re-fired on). `_core_ring_margin` and `_estimate_alpha_gain` share the `_core_and_bg` helper (core 75th-pct brightness vs background-ring median). Regression-guarded by `test_gemini_engine.py::TestSparkleFalsePositiveGate`. **Self-verify repair (added 2026-06-04):** the gain estimate corrects most under-subtractions, but a tail of strong sparkles still survived reverse-alpha (position jitter, or a gain the `[1.0, 1.94]` clamp could not fully reach). After the reverse blend, `remove_watermark` re-detects via `_verify_and_repair`; when a sparkle at or above `_VERIFY_FALLBACK_CONF` 0.5 (the registry's real fail line) remains, it inpaints the footprint and **keeps that only when it lowers the re-detect confidence** — purely additive (the common clean removal re-detects below 0.5 and is returned untouched, so it can never regress). On the spaces corpus this rescued **4 of the 15 remaining gemini removal-audit failures** (15→11, doubao/jimeng still 0), verified through the registry/CLI path. Costs one extra `detect_watermark` per removal (two when the fallback fires). Regression-guarded by `test_gemini_engine.py::TestVerifyAndRepair` (stubs `detect_watermark` to drive the keep-best control flow, since the NCC is degenerate on flat synthetics). **An offset+scale alignment search was prototyped on the remaining 11 fails and REJECTED (2026-06-04):** an audit "ceiling" test suggested it could rescue 4 more (e.g. a5a9 0.577→0.417), but direct inspection showed those were NCC-gaming, not removal — the lower-scoring placement left the sparkle as bright or BRIGHTER (a5a9: first-pass slot 99.5th-pct ~76 at background level, the "aligned win" slot ~164), it just reshaped the residual so the contrast-invariant shape-NCC scored lower. A slot-brightness sanity gate rejected every one, so alignment contributed 0 genuine rescues and was removed (the footprint inpaint stays because it physically reconstructs the slot from its darker surroundings, so its rescues are real). **Lesson: the visible-audit pass/fail metric (re-detect conf < 0.5) is gameable by reshaping the residual — optimizing it directly finds NCC-gaming placements, not clean removals; gate any removal candidate on a physical brightness check, not the detector alone.** The 11 survivors are near-white ill-conditioning (reverse-alpha divides by `1-a`≈0.02) or detector false positives (before≈after≈0.51) that no reverse-alpha placement fixes. The registry's optional `inpaint_residual` (edge cleanup) is a no-op on a clean reverse-alpha removal (and on the same corpus it lowered the re-detect conf on 3 marks, raised it on 10, no-op on 466 — net-neutral on pass/fail, so the self-verify repair, not it, drives the removal tail); an earlier "Gemini smears" read was a misjudged soft-fur original, not an artifact. **The bg assets are now rebuilt from OUR OWN controlled captures** (`data/gemini_capture/captures/`, committed) by `scripts/visible_alpha_solve.py gemini`, which locates the 96px sparkle on the black capture and crops it to the two logo sizes; our capture matched the previously third-party-sourced `gemini_bg_96.png` to **NCC 0.9998**, validating the asset and making it reproducible. Gemini's multi-size fixed-slot model is genuinely different from the Doubao/Jimeng text-strip engines (so it stays a separate engine, not part of the shared-base refactor). -- `_text_mark_engine.py` — **shared base for the three reverse-alpha text-mark engines (Doubao/Jimeng/Samsung), extracted 2026-06-09** (they were ~90% byte-identical clones). `TextMarkEngine(config: TextMarkConfig)` owns the whole `locate → extract_mask → detect → _fixed/_aligned_alpha_map → _apply_reverse_alpha → remove_watermark_reverse_alpha` pipeline (+ the asset-keyed `load_alpha_template`/`glyph_silhouette`/`template_match_score` caches). Each engine module is now a thin subclass: it supplies only its `TextMarkConfig` (the tuned constants, the bundled asset, and the bounded structural deltas — `corner` br/bl, `margin_floor` 4/2, `morph_open_size` 5/3, `min_gw` 8/16) plus the test-facing module shims (`_alpha_template`/`_glyph_silhouette`/`_template_match_score` + the constants). Behavior is byte-exact vs the old per-engine code (the three engine test suites pass unchanged). Gemini stays a SEPARATE engine (its multi-size fixed-slot sparkle model is genuinely different). Add a new text mark = a new `TextMarkConfig` + a thin subclass + one registry `_text_mark(...)` row. The engine bullets below describe each mark's calibration history; the LOGIC lives here. -- `doubao_engine.py` — **a thin `_text_mark_engine.TextMarkEngine` subclass (config only) since 2026-06-09.** visible Doubao "豆包AI生成" remover/detector (cv2/numpy, no GPU). `DoubaoEngine.locate` anchors a bottom-right box by **geometry** (mark scales with image WIDTH), `extract_mask` pulls the light, low-chroma glyphs (the detection candidate) using a per-pixel channel-spread proxy `sat = roi.max(axis=2) - roi.min(axis=2)` (no HSV conversion). `detect` is **shape-consistent**: it matches the bundled alpha glyph silhouette (`assets/doubao_alpha.png`) against the candidate via zero-mean normalized correlation (`_template_match_score`, cv2 `TM_CCOEFF_NORMED`), gated at `DETECT_NCC_THRESHOLD` 0.4 over a small `DETECT_MIN_COVERAGE` floor. Keying on glyph SHAPE (not coverage heuristics) fixed #23 (corpus FP 7/1243). **Removal = reverse-alpha + thin residual inpaint** (`remove_watermark_reverse_alpha`): `original = (wm - a*logo)/(1-a)` from the bundled alpha map + `_ALPHA_LOGO_BGR` (pure white) + `_ALPHA_*_FRAC` geometry, then a deliberately THIN inpaint (`_RESIDUAL_*`, `INPAINT_NS`) over the glyph footprint clears leftover edges without smearing. **Alpha is rebuilt by `scripts/visible_alpha_solve.py` (the careful gray-self solve: cubic background fit, mean over channels, full halo, unblurred), same recipe as Jimeng** — the captures are committed in `data/doubao_capture/captures/`. **Removal aligns ALWAYS** (no `_ALPHA_NATIVE_BAND` fast-path): it tries fixed geometry AND `_aligned_alpha_map`'s `TM_CCOEFF_NORMED` scale+position search and keeps the lower-residual one — the mark is re-rasterized and a few px off per image, so fixed geometry alone leaves a visible outline even at 2048. **The locate box (`WM_*`) is generous (0.22 wide, margins 0.004) and reaches close to the corner** — a tight box (the old 0.185 / margin 0.012) let a corner-ward shift fall OUTSIDE the alignment search, so the align missed and a readable outline survived; regression-guarded by `test_recovers_shifted_mark_on_texture` (composes the alpha shifted on a known texture; old box ~29 vs new ~1 mean residual). **Issue #13 follow-up defect (found 2026-05-31): the SHIPPED Doubao removal left a clearly READABLE "豆包AI生成" outline on the real `doubao-1.png` sample, while `detect` returned conf 0.0 (it is fooled by a thin outline) so `test_reverse_alpha_removes_mark` passed and the old "56/56 clean" claim was detector-measured, not visual.** Root cause: bad alpha (under-estimated, max ~0.65) + fixed-no-inpaint + tight box; the careful rebuild + always-align + thin inpaint + wide box takes it from a readable outline to faint texture-level traces (parity with Jimeng — a single capture cannot pixel-cancel a per-image re-rasterized mark). **Lesson: a detector-only removal test is insufficient; assert visual residual (the textured-shift test).** **`extract_mask` guards a degenerate ROI (`bh < 16 or bw < 16` -> empty mask, skips cv2):** the always-align removal scores each placement with a residual `detect(out)`, and on an extremely wide/short image (e.g. 2048x1, `test_wide_short_does_not_raise`) that fed cv2's GaussianBlur a ~1-px-tall ROI and **faulted natively on Windows py3.12 (access violation, non-deterministic — one CI cell went red while a re-run passed)**; the old at-native path never ran `detect` on degenerate sizes. Real images always clear the guard (the `WM_*` box floors are `max(16, …)` height / `max(40, …)` width), so it only short-circuits slivers. `reverse_alpha_available` is just "asset present"; the registry gates removal on `detect`. The shipped third-party `_refs/zhengsuanfa_doubao_alpha_120x20.png` is NOT a usable alpha (verified 2026-05-29). Arbitrary-region inpainting is `region_eraser`/`erase`. -- `jimeng_engine.py` — **a thin `TextMarkEngine` subclass (config only) since 2026-06-09.** visible Jimeng / Dreamina "★ 即梦AI" remover/detector (cv2/numpy, no GPU), built 2026-05-30 from issue #13's solid captures (@powersee). Shares the base with `doubao_engine`: `locate` anchors a bottom-right box by **geometry** (scales with WIDTH), `extract_mask` pulls the light low-chroma glyphs (white top-hat + grayish + min-luma), `detect` matches the bundled "即梦AI" glyph silhouette (`assets/jimeng_alpha.png`) via `TM_CCOEFF_NORMED` over a coverage floor. Threshold `DETECT_NCC_THRESHOLD` **0.45** cleanly separates real Jimeng marks (>=0.81) from the Doubao strip (0.21) and other AI output (0.0), so the two ByteDance marks don't cross-fire in `--mark auto`. **Logo is pure white (255,255,255)** (`_ALPHA_LOGO_BGR`; the white capture + an L-pair-solve confirm ~254.6); compositing is **sRGB, not linear** (a linear-light solve tripled the cross-residual). **Alpha rebuilt by `scripts/visible_alpha_solve.py` from the GRAY capture** (`data/jimeng_capture/captures/`, the solid captures now committed): `a = (I - B)/(255 - B)`, B a per-capture **cubic** background fit over the non-glyph pixels, **averaged over channels, full halo extent (down to a~0.02), unblurred**. Gray (bg ~132) is the deliberate choice over black: it is the best proxy for real content (the mark sits on bright photo areas, not on black), and the careful build drops the gray self-residual to ~1.3. **The mask quality, not the method, was the earlier limit** — a max-channel / quadratic-bg / blurred / halo-truncated build (and a black-dominated LS) left a visible outline (lesson from issue #13: when reverse-alpha leaves a ghost, suspect the captured alpha map before adding heuristics or switching method). Geometry emitted by the solver at `_ALPHA_NATIVE_WIDTH` 2048: `_ALPHA_WIDTH_FRAC` 0.202, `_ALPHA_HEIGHT_FRAC` 0.058, margins ~0.029. **Removal = reverse-alpha + a deliberately THIN residual inpaint** (`remove_watermark_reverse_alpha`, `_RESIDUAL_DILATE` 5 over the `_RESIDUAL_ALPHA_FLOOR` 0.05 footprint, `_RESIDUAL_INPAINT_RADIUS` 2, `INPAINT_NS`): a single 2048 alpha cannot pixel-cancel the mark re-rasterized at another resolution (alpha maps from independent captures correlate 0.998, not 1.0; off-native reverse-alpha alone only halves the mark), so a tight inpaint clears the residual edges WITHOUT the texture/edge smear a wide full-footprint pass caused. **Placement ALWAYS tries fixed geometry AND `_aligned_alpha_map`'s NCC scale+position search, keeping the lower-residual** — the mark re-rasterizes + jitters a few px per image even at the captured width, so fixed geometry alone misses (there is no `_ALPHA_NATIVE_BAND` fast-path; the scale search `_ALPHA_ALIGN_SEARCH` is fine-stepped, and the `WM_*` locate box is generous so a corner-ward shift stays inside the search — the same widen that fixed Doubao). Verified clean on the solid captures (native 2048; faint self-residual ~1.3 visible only on a dead-flat field, hidden by real texture) and a real 1440-wide Jimeng download (off-native, table edge preserved). `reverse_alpha_available` is just "asset present"; the registry gates on `detect`. **No committed real sample** (the real content download stays gitignored; only the solid calibration captures are committed) — `tests/test_jimeng_engine.py` synthesizes a mark from the bundled alpha asset, and `test_recovers_shifted_mark_on_texture` guards the align-on-shift path that the Doubao defect exposed. Jimeng images are independently caught by the China TC260 AIGC label in `metadata`/`identify`, so this engine is the visible-mark *removal* path, not a new `identify` signal. -- `samsung_engine.py` — **a thin `TextMarkEngine` subclass (config only) since 2026-06-09.** visible Samsung Galaxy AI "✦ Contenuti generati dall'AI" remover/detector (cv2/numpy, no GPU), built 2026-06-05 from issue #37's flat captures (@f-liva). Shares the base but anchored **bottom-LEFT** (Doubao/Jimeng are bottom-right): `locate` anchors a bottom-left box by **geometry** (scales with WIDTH), `extract_mask` pulls the light low-chroma glyphs (white top-hat + grayish + min-luma — `LOGO_MIN_LUMA` is lowered to **110** because the mark is faint, peak alpha ~0.38, so on a mid/dark background its glyph luma is lower than Jimeng's), `detect` matches the bundled glyph silhouette (`assets/samsung_alpha.png`) via `TM_CCOEFF_NORMED` over a coverage floor. Threshold `DETECT_NCC_THRESHOLD` **0.40** (real marks ~0.79 on a real photo, ~0.57/0.71 on the black/gray captures; 0.0 on Doubao/Jimeng captures, and Doubao/Jimeng score 0.0 on a real Samsung photo — no cross-fire, also because the corner differs). **Logo is pure white (255,255,255)** (`_ALPHA_LOGO_BGR`; white capture confirms). **Alpha solved by `scripts/visible_alpha_solve.py samsung` from the GRAY capture** (`data/samsung_capture/captures/`, the flat black/gray/white captures committed; the solver gained a `corner="bl"` mode + left-margin logging for this), same careful recipe as Jimeng (cubic background, mean-channel, full halo, unblurred). Geometry emitted at `_ALPHA_NATIVE_WIDTH` **1086** (the flat-edit capture width): `_ALPHA_WIDTH_FRAC` 0.3195, `_ALPHA_HEIGHT_FRAC` 0.0378, `_ALPHA_MARGIN_LEFT_FRAC` 0.0110, `_ALPHA_MARGIN_BOTTOM_FRAC` 0.0064. **Removal = reverse-alpha + a deliberately THIN residual inpaint** (`remove_watermark_reverse_alpha`, same `_RESIDUAL_*` recipe as Jimeng) with **always-try fixed AND `_aligned_alpha_map` NCC scale+position search, keep the lower-residual** (`_ALPHA_ALIGN_SEARCH` widened to (0.85, 1.18, 23) because the flat captures are far off the real-photo width). **Resolution caveat:** the flat captures arrived at 1086 wide while real photos are ~2958 wide (the mark scales with width, so the captured glyph ~334px is ~2.7x smaller than the ~903px real-photo glyph); width-scale + NCC-align still removes it cleanly (verified on a real 2958-wide @f-liva photo: re-detect 0.79→0.00, no readable text or outline on the recovered wooden table — checked **visually**, not just by the detector, per the Gemini self-verify lesson), but a flat capture at the real photo resolution would make the alpha pixel-sharp instead of upscaled (open quality upgrade, noted in `data/samsung_capture/README.md`). **The mark is locale-specific** (text differs per language); this build is the Italian "Contenuti generati dall'AI" variant — other locales need their own captured template. `reverse_alpha_available` is just "asset present"; the registry gates on `detect`. **No committed real sample** (the real photo stays gitignored; only the flat calibration captures are committed) — `tests/test_samsung_engine.py` synthesizes a mark from the bundled alpha asset (bottom-left geometry), with `test_recovers_shifted_mark_on_texture` guarding the align-on-shift path. Samsung Galaxy AI edits are independently caught by C2PA + the `genAIType` marker in `metadata`/`identify`, so this engine is the visible-mark *removal* path; it also feeds `identify` as the medium-confidence `visible_samsung` signal via the registry (the stripped-metadata fallback). -- `region_eraser.py` — universal region eraser (`erase` CLI). `erase(image, boxes=|mask=, backend=)` accepts grayscale (2D) and RGBA (4-channel) inputs on **both** backends (`erase_cv2` and `erase_lama` each split off any alpha plane and re-attach it unchanged, and promote grayscale to BGR for processing — LaMa would otherwise crash on grayscale and drop alpha on BGRA): `boxes_to_mask` → `cv2.inpaint` (`cv2` backend, default, no deps) or big-LaMa via onnxruntime (`lama` backend, extra `lama`, `Carve/LaMa-ONNX` Apache-2.0 model downloaded on first use, never bundled). `erase_lama` crops a padded region around the mask, runs LaMa at its fixed 512² input, pastes only masked pixels back (untouched areas stay pixel-exact). Lazy `_get_lama_session` singleton; `lama_available()` guards the optional import. **LaMa-ONNX costs ~3.5-4 GB peak RAM and ~5-6 s/call on CPU** (FFC working set, not arena — `enable_cpu_mem_arena=False` does not help), so it does NOT fit a minimal droplet; the cv2 backend (tens of MB, ~30 ms) does. LaMa quality at low RAM = serverless/GPU, mirroring how raiw.cc offloads SDXL to fal. -- `invisible_watermark.py` — `detect_invisible_watermark(path)` decodes the OPEN DWT-DCT watermarks (public decoder, no key) embedded by Stable Diffusion / SDXL / FLUX via the `imwatermark` library. Known fixed patterns (verified against upstream source) live in `_BITS_48` (SDXL 48-bit, FLUX.2 48-bit) and `_SD1_STRING` ("StableDiffusionV1", SD 1.x/2.x). Optional dep (extra `detect`); returns None when absent. The `detect` extra pulls **torch** transitively (invisible-watermark declares torch a hard dep, and `WatermarkDecoder` eagerly imports `rivaGan` -> `torch` at import time), so detection needs torch present even though dwtDct runs CPU-only on cv2/numpy/pywavelets — no GPU and no separate `gpu` extra required. **Unlike SynthID this is locally detectable**, but the watermark is fragile (does not survive JPEG re-encode/resize — verified gone after JPEG q90), so it confirms origin only on pristine files. Add new known patterns here. The file carries a top-of-module pyright pragma because imwatermark/cv2 ship no type stubs. -- `trustmark_detector.py` — `detect_trustmark(path)` decodes the OPEN, keyless **Adobe TrustMark** watermark (the soft binding behind Adobe Durable Content Credentials, `alg` `com.adobe.trustmark.P`) via the optional `trustmark` package (extra `trustmark`; pulls torch, downloads model weights on first use). Mirrors `invisible_watermark.py` (lazy singleton guarded by a double-checked `threading.Lock` so concurrent callers do not double-download the weights, top-of-module pyright pragma, returns None when absent). It detects *provenance*, not AI origin as such (TrustMark also marks human-authored content), so `identify` lists it as a watermark without setting `is_ai_generated`. Other soft-binding vendors (Digimarc/Imatag/Steg.AI/...) have no public decoder — they are only *named* via the `C2PA_SOFT_BINDINGS` scan, not decoded. **False-positive gate (added 2026-05-29):** TrustMark's `wm_present` is a BCH error-correction validity flag that spuriously validates on a content-correlated fraction of un-watermarked images — AI-generated textures trip it far more than camera photos (verified 2026-05-29 on real files: it fires on Gemini/OpenAI/Doubao output that *cannot* carry Adobe's watermark, with a random-bytes decoded secret, while signal-free camera photos did not trip it). A genuine TrustMark is a *durable* soft binding engineered to survive re-encoding, so `detect_trustmark` re-decodes after a mild JPEG round-trip (`_survives_reencode`, `_REENCODE_QUALITY` 95) and requires the same schema both times; every observed false positive collapsed (none survived even q95), so the gate is the durability property the watermark guarantees. The second decode runs only on the rare initial hit, so the cost is negligible. Do NOT remove the gate to "catch more" — a lone TrustMark hit without it is almost always content noise. -- `noai/watermark_remover.py` — the `WatermarkRemover` class has two diffusion pipelines, selected by the explicit `pipeline` ctor arg (NOT inferred from `model_id` -- both use the same SDXL base, `DEFAULT_MODEL_ID`). **`sdxl`** (renamed from `default` 2026-06-09; `default` kept as a back-compat alias via `normalize_profile`) runs plain SDXL img2img (`_run_img2img`); it is the lighter opt-down alternative (no ControlNet weights). **`controlnet`** (**the DEFAULT pipeline since 2026-06-09** for `invisible`/`all`/`batch` and both engine ctors; `_run_controlnet`, `_load_controlnet_pipeline`) runs `StableDiffusionXLControlNetImg2ImgPipeline` with the SDXL-native canny ControlNet `xinsir/controlnet-canny-sdxl-1.0` (`watermark_profiles.CONTROLNET_CANNY_MODEL`): the control image is `cv2.Canny(gray, 100, 200)` stacked to 3 channels (`_CANNY_LOW`/`_CANNY_HIGH`, prompt `_CONTROLNET_PROMPT` / `_CONTROLNET_NEGATIVE`). **Removal comes from the img2img regeneration (`strength`); the ControlNet only PRESERVES text and face STRUCTURE via the edge map.** No original pixels are copied or frozen, BUT **validation 2026-06-04 disproved the old "so SynthID does not survive" claim: SynthID CAN survive controlnet on photoreal/high-detail content.** At the shared low removal strength the canny edge-conditioning keeps the regeneration so close to the original that the pixel perturbation that destroys SynthID does not happen (oracle-confirmed: an OpenAI bracelet photo + a 9-face grid read **SynthID-detected** after controlnet at strength 0.10/0.15, but **SynthID-not-detected** after the `default` pipeline at the SAME strength + resolution -- only the pipeline differed). **But the reverse also holds: a flat-graphic logo/poster SURVIVED `default` while clearing controlnet** -- removal at the low strength is content×pipeline dependent and neither pipeline is universally safe; the real lever is a higher strength. See the controlnet Known-limitations bullet for the full table + root cause. Canny holds face STRUCTURE but NOT identity (the regenerated face drifts in likeness -- canny carries edges, not identity). The drifted cleaned face is the LEAST-AI state we can reach without re-introducing SynthID; the library does NOT ship a face-restore extra. Every restore approach we evaluated (GFPGAN-on-cleaned, PhotoMaker-V2 txt2img, InstantID txt2img, InstantID img2img-on-cleaned at three parameter sweeps, 2026-06-04 - 2026-06-08 Modal cert sweeps) regenerated the face from an ArcFace embedding via SDXL diffusion -- which makes the output face look MORE AI-generated, not less. Empirical conclusion in `docs/synthid-robust-identity-research-2026-06-08.md` "Empirical follow-up". For production face preservation, ship the cleaned image as-is. `controlnet_conditioning_scale` (ctor arg, default 1.0) is the structure-preservation knob. Same dtype rule as `default` (fp32 on cpu/mps, fp16 only on cuda/xpu; the fp16-fixed SDXL VAE `_SDXL_FP16_VAE_ID` is swapped in on fp16 GPUs -- issue #29) and the same MPS->CPU fallback (reload on cpu/fp32, drop a non-cpu generator, retry once). -- **`auto_config.py` + the content-detection layer were REMOVED 2026-06-09.** History: `auto_config.plan()` was a content-adaptive planner that detected faces/text/edges (bundled OpenCV YuNet + PP-OCRv3 DBNet models) to route the pipeline and toggle the adaptive polish. Once `controlnet` became the default-and-only auto pipeline (it no longer downgrades a structure-less image to `sdxl`) and the adaptive polish was confirmed to **self-gate by detail level** (`humanizer.adaptive_polish` no-ops when the cleaned image already meets the input's Laplacian variance, so it does real work only on over-smoothed photo/face texture and ~nothing on text/flat), the detection no longer changed any behavior — it only annotated a `reason` string. So the whole layer was deleted: `auto_config.py`, `tests/test_auto_config.py`, and the two detection assets (`assets/face_detection_yunet_2023mar.onnx`, `assets/text_detection_ppocrv3_2023may.onnx`, ~2.6 MB). **`--auto` is now a DEPRECATED no-op** (`cli._resolve_auto_polish`): controlnet is already the default pipeline AND the adaptive polish is ON by default, so `--auto` has nothing left to do — it only prints a deprecation warning and passes `adaptive_polish` through unchanged (an explicit `--no-adaptive-polish` still wins). (Originally it re-enabled the polish; once the polish default flipped to ON the same day, the parameter-source branch became dead and was dropped.) The **adaptive polish itself lives on** in `humanizer.adaptive_polish` (CLI `--adaptive-polish/--no-adaptive-polish`, **ON by default since 2026-06-09** — it self-gates to a no-op where there is no detail deficit, so default-on is safe; uses the full-res original as the detail reference) — see the `humanizer` test note. `batch` resolves the polish once before the loop (one warning) and caches the invisible engine per pipeline (`ctx.obj["_inv_engines"]`). -- `upscaler.py` — optional Real-ESRGAN pre-diffusion super-resolution for small inputs (spandrel boundary, top-of-file pyright pragma). `is_available()` gates on spandrel+torch (via `importlib.util.find_spec`); `upscale(bgr, device=None)` loads a lazily-built spandrel `ImageModelDescriptor` singleton (double-checked lock) and upscales by the model's native factor (x2), with a non-CPU→CPU device fallback mirroring the diffusion engine's MPS→CPU retry. Weights (`RealESRGAN_x2plus.pth`, BSD-3-Clause) download on first use to the `torch.hub` checkpoints cache; never bundled. Used only when UPscaling to the `min_resolution` floor (a `max_resolution` downscale always uses Lanczos). The wiring is `InvisibleEngine._esrgan_upscale(pil, target)` — Real-ESRGAN at native factor, then a Lanczos resize to the exact target, falling back to a plain Lanczos resize if the extra is absent or the model errors (so an optional upscaler can never break removal). The default `--upscaler` is `lanczos` (cv2, no deps). **ESRGAN is a generic photo/texture GAN with no face/glyph prior**, so it best fits photo/texture content and can degrade faces (glassy/asymmetric eyes -- the diffusion pass regenerates faces so the full-pipeline final recovers) and thin/small text (the GAN invents wrong strokes, and low-strength diffusion will not fix it). Verified 2026-06-04: isolated upscale lap-var ~5x Lanczos on faces+textures but glassy eyes; end-to-end `invisible` final lap-var 1634 vs Lanczos 663 with natural faces (diffusion cleaned the artifact). Kept a **manual opt-in knob** (the auto plan never selects it) with `lanczos` the default; not content-gated by design (use Lanczos for text-heavy inputs). spandrel is MIT and pulls no basicsr. Unit-tested without the model: `tests/test_upscaler.py` (availability guard + the not-installed RuntimeError) and `tests/test_invisible_engine.py::TestEsrganUpscale` (the three `_esrgan_upscale` branches via a monkeypatched `upscaler`). -- `image_io.py` — Unicode-safe cv2 IO (issue #17). `imread(path, flags=None)` / `imwrite(path, img)` wrap `np.fromfile`+`cv2.imdecode` / `cv2.imencode`+`tofile` so non-ASCII paths work on Windows -- bare `cv2.imread`/`cv2.imwrite` use the platform ANSI code-page API there and fail (empty decode + `can't open/read file`) on Chinese/Cyrillic/accented filenames. `imread` keeps `cv2.imread` semantics (defaults to `IMREAD_COLOR`, returns `None` on missing/empty/undecodable). **Every cv2 file read/write in the package routes through here; do not call `cv2.imread`/`cv2.imwrite` directly.** `imwrite` returns `False` on an unwritable path (`OSError` caught) instead of raising, matching `cv2.imwrite` semantics. macOS/Linux already accept UTF-8 paths, so it is behavior-neutral there (the bug only reproduces on Windows). **`to_bgr(image)` (added 2026-06-09)** is the shared channel normalizer: promotes 2D grayscale / (h,w,1) / 4-channel BGRA to 3-channel BGR (a 3-channel input is returned unchanged, no copy). Use it instead of inlining the `cvtColor(GRAY2BGR/BGRA2BGR)` branch — the gemini engine and the `TextMarkEngine` base both route through it so a grayscale/BGRA input (a real Gemini-app export is opaque RGBA) does not crash the `axis=2` channel reductions. cv2/numpy are imported lazily inside the functions, so the module is cheap to import in a bare env. +Compact map. The full per-module detail (design decisions, tuned thresholds, calibration history, incident records, and the regression-guard map) lives in `docs/module-internals.md` — **read the relevant section there before changing any module below.** -### Doubao clean-reverse-alpha distillation (re-investigated 2026-05-29) +- `noai/c2pa.py` — PNG caBX chunk parser (`extract_c2pa_chunk` / `has_c2pa_metadata` / `extract_c2pa_info`). Do not reimplement chunk parsing; chunk reads are clamped to the remaining file size by design. +- `noai/constants.py` — the single `C2PA_AI_VENDORS` registry (+ `C2PA_SOFT_BINDINGS`) from which `C2PA_ISSUERS` / `SYNTHID_C2PA_ISSUERS` / `identify._ISSUER_PLATFORM` are all derived. Add a new vendor as one registry entry; never edit the derived dicts and never add inline. +- `metadata.py` — `scan_head(path)` is the shared (memoized) input for every C2PA/AIGC/IPTC byte scan; use it instead of `open().read(1MB)` for any new marker scan. Also home to `synthid_source`, `xai_signature`, `iptc_ai_system`, `aigc_label`, `huggingface_job`, `samsung_genai`, and `remove_ai_metadata` (fail-safe `strip_c2pa_boxes`). +- `identify.py` — aggregates every locally-readable signal into one `ProvenanceReport`; `is_ai_generated` is True or None, never asserted False. `import identify` is deliberately light (lazy `noai/__init__`, fits a 512 MB host) — keep heavy imports out. Add capture-camera tokens to `_DEVICE_C2PA_PLATFORM` only when verified against a real C2PA file; editing-app/AI-device signer tokens go to `_SIGNER_C2PA_PLATFORM`; generator/issuer platforms to `C2PA_AI_VENDORS` in `constants.py`. Integrity-clash detection is high-precision by design (only hard generator stamps feed it, source-grouped independence). +- `watermark_registry.py` — the single catalog of known visible watermarks (gemini / doubao / jimeng / samsung), reverse-alpha based by policy. Add a new visible text mark = one `_text_mark(...)` row + a `TextMarkConfig` with a captured alpha map; do not re-add per-mark `if` branches. `cli._write_bgr_with_alpha` must NOT zero alpha in the watermark bbox (issue #30 white-box regression). +- `gemini_engine.py` — visible Gemini-sparkle remover/detector (cv2/numpy, no GPU): corner-promote, over/under-subtraction guards, false-positive gate, self-verify repair. Keep the 0.85 corner-promote NCC gate; a margin/chroma-gated lower promote was measured and REJECTED 2026-06-11 (~33% FP on non-Google content). Gate any removal candidate on a physical brightness check, not the detector alone. +- `_text_mark_engine.py` — shared base for the three reverse-alpha text-mark engines (extracted 2026-06-09); the per-engine modules are config-only subclasses. New text mark = a `TextMarkConfig` + a thin subclass + one registry row. Gemini stays a separate engine (different model). +- `doubao_engine.py` / `jimeng_engine.py` / `samsung_engine.py` — thin `TextMarkEngine` subclasses: Doubao "豆包AI生成" (bottom-right), Jimeng "★ 即梦AI" (bottom-right), Samsung Galaxy AI "✦ Contenuti generati dall'AI" (bottom-LEFT, locale-specific — Italian variant calibrated). Removal = reverse-alpha (always-align) + thin residual inpaint. A detector-only removal test is insufficient — assert visual residual (the textured-shift tests). +- `region_eraser.py` — universal region eraser (`erase` CLI): cv2 backend default (no deps), optional big-LaMa via onnxruntime (~3.5-4 GB peak RAM, ~5-6 s/call CPU — does not fit a minimal droplet). +- `invisible_watermark.py` — decodes the OPEN DWT-DCT watermarks (SD / SDXL / FLUX) via `imwatermark` (extra `detect`, pulls torch). Fragile: does not survive JPEG re-encode/resize, so it confirms origin only on pristine files. +- `trustmark_detector.py` — Adobe TrustMark open decoder (extra `trustmark`). Do NOT remove the JPEG re-encode false-positive gate — a lone TrustMark hit without it is almost always content noise. +- `noai/watermark_remover.py` — `WatermarkRemover` with two diffusion pipelines selected by the explicit `pipeline` ctor arg, never inferred from `model_id`: `sdxl` (plain SDXL img2img) and `controlnet` (SDXL + canny ControlNet, **the DEFAULT since 2026-06-09**). Removal comes from the img2img `strength`; ControlNet only preserves text/face STRUCTURE — SynthID CAN survive controlnet on photoreal content at low strength. No face-restore extra ships, by validated decision (every restore approach looked MORE AI-generated). +- `auto_config.py` + the content-detection layer were REMOVED 2026-06-09; `--auto` is a deprecated no-op (controlnet is the default pipeline and the adaptive polish is ON by default and self-gates to a no-op where there is no detail deficit). +- `upscaler.py` — optional Real-ESRGAN pre-diffusion super-resolution for small inputs (extra `esrgan`, spandrel only). Manual opt-in; the default `--upscaler` stays `lanczos` and the engine always falls back to Lanczos on absence/error. ESRGAN can degrade faces and thin text. +- `image_io.py` — Unicode-safe cv2 IO (issue #17). Every cv2 file read/write in the package routes through `imread`/`imwrite`; do not call `cv2.imread`/`cv2.imwrite` directly. `to_bgr(image)` is the shared channel normalizer — use it instead of inlining `cvtColor` branches. -**RESOLVED 2026-05-29: black+gray Doubao captures were obtained and a reverse-alpha is built** (`doubao_engine.remove_watermark_reverse_alpha`, `assets/doubao_alpha.png`; see the `doubao_engine.py` bullet above). The captures (`data/doubao_capture/captures/`, now committed) confirmed the alpha-composite model: on black `captured = a*logo`, logo pure white. **UPDATE 2026-05-31 (issue #13 follow-up): the first build was NOT "exact"** — it left a readable "豆包AI生成" outline on the real sample (the detector was fooled, conf 0.0). The alpha is now rebuilt by `scripts/visible_alpha_solve.py` (the careful gray-self solve shared with Jimeng), removal always-aligns + thin-inpaints, and the locate box was widened; see the `doubao_engine.py` bullet. The notes below (the failed content-image distillation) are retained as the record of why controlled captures were necessary. +For the Doubao alpha-distillation history (why content-image reverse-alpha distillation fails by physics and controlled captures were required), see `docs/research-doubao-distillation.md`. -**Conclusion (historical): pure reverse-alpha distilled from content images does NOT work, and the blocker is the WRONG kind of data, not too little of it.** The earlier framing ("need ~5-8 PRISTINE same-resolution originals") is obsolete -- `data/spaces/originals/` holds plenty. Curate them with `DoubaoEngine.detect` + an NCC filter against a clean glyph template, keeping only marks at offset ≈ (0,0): that yields e.g. **15 pixel-aligned 2048² marks** (sub-pixel drift, not the ±50 px the old lossy/mixed-res scrapes had), plus 1086x1448 / 1792x2400 clusters. With those, LaMa-clean `O` + weighted-LS (and per-pixel I-on-O regression) for `α` (+ logo colour) was tried end-to-end and **still leaves a persistent ghost outline.** +## Watermarking landscape -Diagnosed why, empirically (cached stacks, `/tmp/doubao_distill`): (1) the mark is a clean white overlay with **no dark halo** -- over glyph pixels ~54% are brighter than the clean bg, only ~4% darker -- so the white-logo model `I=(1-α)O+α·255` is correct; (2) but content backgrounds are almost never dark *under* the mark (median darkest available bg over glyph pixels = **58/255**; only ~13% of mark pixels are ever observed on a bg < 40), so on bright backgrounds the equation is ill-conditioned and `α` is unidentifiable; (3) LaMa's `O` is a plausible **hallucination**, not the true pre-mark background, which compounds the error, and per-pixel regression on ~15 obs overfits into colour noise. - -**Why Gemini's engine is clean (verified in GeminiWatermarkTool `src/core/watermark_engine.cpp`): its alpha map is the watermark stamped on a PURE-BLACK background**, where `watermarked = α·255 + (1-α)·0 = α·255`, so `alpha = capture/255` exactly -- no estimation. (`gemini_bg_*.png` is literally the sparkle in grey on black.) So the real Doubao unlock is the same controlled capture, **not more content images**. Black/white/gray seeds exist (`data/doubao_capture/seeds/seed_*_1x1_2048x2048.png`); a capture run (feed a black seed through doubao.com edit mode, download the *original*) was requested from the #13 reporter 2026-05-29. With ~2-3 black captures we get `α = capture/255` for free, Gemini-quality. - -**Until black captures arrive, the shipped direction is precise canonical glyph mask + inpaint (cv2 default, lama optional), NOT reverse-alpha.** The consensus glyph silhouette across the aligned marks distills cleanly (proto: a tight "豆包AI生成" strip, width ≈ 0.156 × image-width) and is good both as an exact inpaint mask and as an NCC localiser -- the latter also fixes the #23 detector false-positives (match the real glyph shape, not any bright low-saturation corner). Do **not** retry content-image reverse-alpha: it is data-limited by physics (no dark-background observations), not by effort. - -## Watermarking landscape (research 2026-05-24) - -Who embeds what, and whether it is locally detectable (so we know which gaps are fillable). See `identify.py` for what we read. -- **Locally detectable (open decoder, no key/API):** Stable Diffusion / SDXL / FLUX via `imwatermark` DWT-DCT (now covered by `invisible_watermark.py`). FLUX uses the same library (`black-forest-labs/flux2` `src/flux2/watermark.py`, 48-bit `0b001010101111111010000111100111001111010100101110`); SDXL is the diffusers `WATERMARK_MESSAGE` (`0b101100111110110010010000011110111011000110011110`). Caveat: fragile to re-encoding. -- **C2PA / IPTC (covered by the issuer/marker scan):** OpenAI, Google, Adobe Firefly, Microsoft (Designer + **Bing Image Creator** — collected 2026-05-24; Bing now runs Microsoft's own **MAI-Image** model, signs C2PA as "Microsoft", NOT OpenAI/DALL-E), and **Stability AI** (collected from Brand Studio / DreamStudio successor; signs C2PA as "Stability AI Ltd", no SynthID, no imwatermark on its current Stable Image model — issuer added to `C2PA_ISSUERS`). Still unsampled: Canva (its downloads are re-encoded design *exports* that strip C2PA, so a Canva "positive" is inconclusive — skipped), Getty, Shutterstock. Midjourney embeds NO C2PA and no invisible watermark (our `mj-*` sample carried only the IPTC tag). **Samsung Galaxy AI** (Generative Edit / Sketch to Image / Portrait Studio on Galaxy S23 FE / S24 / S25, One UI 7+) signs C2PA as "Samsung Galaxy" with the standard `trainedAlgorithmicMedia` source type AND a proprietary `genAIType` marker; verified on real signed files 2026-05-29 (the standard scan catches the source type; `genAIType` additionally catches a Galaxy S24 file that omits it). It ALSO burns a **visible** localized wordmark into the pixels — a sparkle + "generated with AI" string in the bottom-LEFT corner (issue #37; the Italian "✦ Contenuti generati dall'AI" variant is calibrated) — removed by `samsung_engine.py` / `visible --mark samsung` (reverse-alpha, see the engine bullet); detection feeds `identify` as the medium `visible_samsung` signal. The string is locale-specific, so each locale needs its own captured alpha template. **ASUS Gallery** also signs edited photos as C2PA (`com.asus.gallery`) but with no AI source type — a signer, not an AI marker. **Black Forest Labs (FLUX)** API output signs C2PA: `claim_generator_info "Black Forest Labs API"` + a `c2pa.ai_generated_content` assertion + `trainedAlgorithmicMedia` (issuer `b"Black Forest Labs"` added to `C2PA_ISSUERS`, platform "Black Forest Labs (FLUX)"). **ByteDance Volcano Engine (Volcengine)** — the cloud behind Doubao / Jimeng — signs its AI image output with a cert from `certificate_center@volcengine.com` + `trainedAlgorithmicMedia` (issuer `b"volcengine"` → "ByteDance (Volcano Engine)", platform "ByteDance (Doubao / Jimeng / Volcano Engine)"); note this is the C2PA-signed surface, distinct from the XMP/PNG TC260 `AIGC` label Doubao also uses. All three verified on real signed files 2026-05-29. -- **EXIF/XMP generator tag (caught by `exif_generator`):** **Ideogram** writes EXIF `Make="Ideogram AI"` (collected 2026-05-24 — no C2PA, no SynthID, no imwatermark; the Make tag is the only signal). -- **xAI / Grok — its own EXIF signature scheme, NOT C2PA (DETECTED by `metadata.xai_signature`, built 2026-05-26).** Grok JPEG downloads (Aurora model) carry **no C2PA, no XMP, no SynthID, no IPTC** — only EXIF `Artist` = a UUID and EXIF `ImageDescription` = `Signature: ` (a crypto signature, unverifiable locally without xAI's public key). This empirically kills the earlier unverified "xAI signs C2PA as xAI" lead — xAI is not even a C2PA member. `exif_generator` misses it (neither field holds an `AI_GENERATOR_TOKENS` token), so a dedicated detector `xai_signature(path)` matches the pair (`ImageDescription ~ ^Signature: [A-Za-z0-9+/=]{64,}` AND UUID `Artist`); wired into `has_ai_metadata`, `get_ai_metadata` (key `xai_signature`), and `identify` (signal `xai_signature`, platform "xAI (Grok / Aurora)"). **Format confirmed stable across n=3 genuine generations:** exactly three EXIF tags (`Artist`, `ExifOffset`, `ImageDescription`), `Signature:` prefix constant, base64 payload 300-1004 chars. Two capture facts: (a) the `Artist` UUID **equals the public image id** in the asset URL (`https://imagine-public.x.ai/imagine-public/images/.jpg`), so it is NOT a private per-user secret — only the `Signature` blob is; (b) the Grok web-UI image is a re-encoded **WebP with no signature** — the EXIF survives only in the *original* JPEG (download button or that public tokenless URL), which is why screenshots / re-encodes are metadata-stripped. A real fixture `data/samples/grok-1.jpg` plus **synthetic** JPEG fixtures (fake UUID + fake `Signature:` blob) cover the detector; never add a real Grok image carrying private content (the repo is public). **Stripped on removal too:** `remove_ai_metadata` now calls `_scrub_ai_exif` on the JPEG EXIF, which deletes the xAI Signature+UUID-Artist pair **and** any `Software`/`Make`/`Artist`/`ImageDescription` tag holding an `AI_GENERATOR_TOKENS` token (so Ideogram's `Make="Ideogram AI"` is scrubbed too), while keeping genuine camera/editor EXIF. The shared `_is_xai_signature_pair` helper (module-level compiled regexes) is the single source of truth for the pattern, used by both `xai_signature` and `_scrub_ai_exif`. (AVIF/HEIF/JXL still strip only C2PA boxes via `isobmff`, not EXIF — unchanged.) -- **China TC260 AIGC label (caught by `AIGC_MARKERS` / `metadata.aigc_label`, surfaced by `identify` as the `aigc` signal):** China-served generators embed an XMP `{"Label":"1","ContentProducer":...}` block — China's mandatory AI-content labeling (TC260 namespace `tc260.org.cn/ns/AIGC`). **Doubao** (ByteDance) uses it (verified on the real #13 sample 2026-05-25; `ContentProducer` `001191110102MACQD9K64010000`, no C2PA/SynthID/imwatermark — the XMP block is the only signal; GitHub attachment upload did NOT strip it). The same standard is mandatory for Jimeng/Kling/Qwen/Ernie etc., so the one marker covers the whole China-AIGC-labeled ecosystem. `aigc_label` reads **four serializations** through a shared `_parse` helper: the HTML-entity-encoded XMP `TC260:AIGC` block in **either RDF form** — the nested element `{...}` (Doubao) or the attribute `TC260:AIGC="{...}"` (**PicWish**, `ContentProducer="picwish"`, verified on the corpus 2026-05-30) — via a container-agnostic raw-byte scan (any JSON object accepted), a raw-JSON PNG `AIGC` tEXt chunk (Doubao also writes the label this way, no namespaced marker at all — confirmed on the corpus 2026-05-28, `ContentProducer="doubao"`), a bare raw-JSON `{"AIGC":{...}}` object embedded in **JPEG EXIF (UserComment)** by some China-served generators, brace-matched from the scan head with `json.JSONDecoder().raw_decode` (no namespaced marker, no PNG chunk — confirmed on the corpus 2026-05-30, `ContentProducer="001191440300708461136T1308L"`), **and** a bare `AIGC{...}` blob (the label glued straight to its JSON, no `"AIGC":` key wrapper) embedded in a **JPEG APP segment near the JFIF header** — confirmed on the corpus 2026-06-10 (`ContentProducer="00119144030008867405X210002"`; 3 files read `unknown` before this form was added). The two raw-JSON forms are scanned in one loop (`'"AIGC"'` then `AIGC{`) that **falls through on a non-TC260 / undecodable hit instead of returning** — a quoted `"AIGC"` can appear later in an XMP packet while the real label is a bare `AIGC{...}` earlier in the file, so an unconditional early return on the quoted form would shadow the bare form (the exact bug behind the 06-10 misses). All three generic forms (the PNG chunk, the bare `{"AIGC":...}` object, and the bare `AIGC{...}` blob) are gated on at least one TC260 field (`_TC260_FIELDS`) so a generic `AIGC` key cannot false-positive; the namespaced XMP element is unambiguous and needs no gate. In `identify`, `aigc` fires on the parsed label **or** the `AIGC_MARKERS` byte scan (the latter preserves the laundering-tell case where the JSON payload is truncated). -- **HuggingFace-hosted job (caught by `metadata.huggingface_job`, surfaced by `identify` as the `hf_job` signal, MEDIUM confidence):** HuggingFace Jobs / Spaces stamp generated PNGs with an `hf-job-id` tEXt chunk holding the job UUID (3 on the corpus 2026-05-28, no other signal). It marks the *hosting job*, not a model — most commonly diffusion output — so it lifts an Unknown verdict to a tentative AI via `hf_only` (parallel to the visible sparkle) but never overrides a hard metadata signal; `_HF_JOB_CAVEAT` states the limit (job, not model; not proof of AI pixels). Stripped on removal (the PNG save whitelist keeps only `STANDARD_METADATA_KEYS`, so `hf-job-id` and the `AIGC` chunk are both dropped). The exact writer is not authoritatively documented (HF Jobs are generic GPU jobs), hence medium not high. -- **No detectable signal on download (correctly reported `unknown`):** **Recraft** (PNG export is a re-encoded design export — strips everything), **Krea hosting FLUX 2** (no imwatermark despite FLUX — the host omits the encoder, same as Stability's hosted SDXL), and Midjourney (embeds nothing). Lesson: the imwatermark detector only fires on *pristine* output from a pipeline that runs the encoder (diffusers default, official BFL), not from re-hosts (Krea/Stability) or re-encoded exports (Recraft/Canva). -- **Invisible but NOT locally detectable (proprietary, API/oracle only — same wall as SynthID):** Amazon Titan Image Generator + Nova Canvas (Bedrock `DetectGeneratedContent` API), Kakao (new SynthID image adopter, May 2026), NVIDIA Cosmos (SynthID video). No local detector possible; treat like SynthID. -- **C2PA 2.4 "Durable Content Credentials" (April 2026; verified against the spec) raise the bar for metadata stripping.** 2.4 defines soft bindings (an invisible watermark or a content fingerprint) plus a server-side manifest repository and a new `c2pa.repository-receipt` assertion. Per the spec: "if a C2PA manifest is removed from an asset, but a copy of that manifest remains in a provenance store elsewhere, the manifest and asset may be matched using available soft bindings." So our local `metadata --remove` deletes the *embedded* manifest, but a fingerprint/watermark soft binding can still re-link the image to its manifest in a repository server-side. Stripping the file is becoming necessary-but-not-sufficient against durable provenance. (Our parsers target the stable embedded-manifest format documented in C2PA 2.1 §11; that format is unchanged in 2.4 -- the new pieces are repository/soft-binding infra, not the on-file box layout, so no parser change is implied.) Spec: https://spec.c2pa.org/specifications/specifications/2.4/specs/C2PA_Specification.html We now READ the soft-binding `alg` (`C2PA_SOFT_BINDINGS` / `soft_binding_vendors_in`) to name the forensic-watermark vendor, and locally DECODE the one open scheme, Adobe TrustMark (`trustmark_detector`); the rest (Digimarc/Imatag/Steg.AI/...) stay name-only (proprietary decoders). -- **Built 2026-05-26 (this batch):** soft-binding `alg` vendor detection; IPTC Photo Metadata 2025.1 AI-disclosure fields (`AISystemUsed` etc.); **video C2PA metadata** detect + strip for MP4/MOV/M4V (free — `isobmff.py` is format-agnostic, MP4 is ISOBMFF); Adobe TrustMark open decoder. NOT done (out of cheap reach, per the feasibility review): visible video-logo removal (needs a video frame pipeline) and audio (SynthID/ElevenLabs/Resemble/Suno all oracle-only or unmarked). **Box detection window — now handled (v0.6.8):** detection no longer relies on a fixed first-MB read. `metadata.scan_head(path, size)` reads the first `size` bytes and, for ISOBMFF, appends the payloads of late provenance boxes found by `isobmff.scan_c2pa_region` (a file-seeking top-level box walker that skips past `mdat` by size without reading it), so a C2PA/AIGC/IPTC manifest placed AFTER a large `mdat` in a streaming/non-faststart MP4 is now caught. Every C2PA/marker byte scan (`has_ai_metadata`, `aigc_label`, `iptc_ai_system`, `synthid_source`, `exif_generator` XMP, `get_ai_metadata` soft-binding, and `identify`) goes through `scan_head`; it is behavior-neutral for non-ISOBMFF inputs (exactly `f.read(size)`). **Meta-box XMP removal — now handled (v0.6.9):** an AI-label XMP packet stored as a meta-box `mime` item (HEIF/AVIF; out of reach of the top-level box stripper) is blanked in place by `isobmff.blank_ai_xmp_packets` — it locates the packet by its `` delimiters and, if it carries an AI marker (`_AI_LABEL_MARKERS`), overwrites it with spaces of the SAME length, so box sizes / `iloc` offsets stay valid and the coded image is untouched (selective: plain non-AI XMP is left alone, mirroring the top-level uuid logic). Wired into `remove_ai_metadata`'s ISOBMFF branch after `strip_c2pa_boxes`. The remaining gap is an `Exif` meta-box *item* (rare; the AI labels are XMP) — still needs `iinf`/`iloc` surgery or exiftool. -- **Regulatory driver (context, not a code change):** AI-content labeling mandates are expanding, which pushes more generators toward exactly the C2PA + watermark signals we read. The full per-jurisdiction table lives in README "## Legal" -- keep it there, not duplicated here. Newly added + primary-source verified 2026-05-26: **EU AI Act Article 50** machine-readable marking applicable **2026-08-02** (verified against the article text); **South Korea AI Framework Act Art. 31(3)** in force since **22 January 2026** (verified via Kim & Chang + FPF/Korea Times; Enforcement Decree accepts an invisible-watermark label); **California AB 853** (amends the CA AI Transparency Act) latent-disclosure duty operative **2026-08-02**, requiring a disclosure "permanent or extraordinarily difficult to remove" (verified against the leginfo bill text -- this is the exact disclosure our tool strips); **India IT Amendment Rules 2026** in force **2026-02-20** (verified via Chambers), which prominently-label + permanent-provenance-id all synthetic media AND **expressly prohibit removing/suppressing the label or metadata** -- the first major all-content removal ban outside China. **Removal liability (README "## Legal" disclaimer):** the tool is lawful general-purpose software; liability sits with the remover and is intent-gated -- downstream acts (fraud/deception/IP), plus US DMCA 17 USC 1202 (removing copyright-management info to conceal infringement), plus the removal-as-such bans in China + India. When extending the README table, verify each date/article against the statute/bill text before committing, not against search summaries. +Who embeds what (C2PA / IPTC / EXIF / TC260 AIGC / xAI signature / open and proprietary invisible watermarks), whether each is locally detectable, the C2PA 2.4 durable-credentials implications, and the regulatory driver table live in `docs/watermarking-landscape.md` (research 2026-05-24, updated through 2026-06-10). Read it before adding a new `identify` signal, vendor token, or metadata marker. See `identify.py` for what we read today. ## Known limitations -- `invisible` pipeline processes at **native resolution for inputs whose long side is >= 1024px**, and **auto-upscales smaller inputs UP to a 1024px floor** (`min_resolution=1024`, the default; `--min-resolution 0` disables) before diffusion -- SDXL img2img distorts badly on a tiny latent (a 381x512 portrait wrecks at native, the #36 follow-up), and the output is restored to the original input size so the floor is a transparent quality boost (it adds time/memory on small inputs). The floor upscale uses Lanczos by default; **`--upscaler esrgan`** (opt-in, the `esrgan` extra) runs Real-ESRGAN first for better detail before the Lanczos resize to the exact target (`upscaler.py` / `InvisibleEngine._esrgan_upscale`, falls back to Lanczos if the extra is absent). `max_resolution=0` (default) means no downscale cap, matching the hosted raiw.cc backend (fal fast-sdxl, no pre-downscale). The old forced downscale-to-1024 -> upscale-back round-trip for LARGE images was the main quality loss (issue #10) and is gone; at strength ~0.05 SDXL img2img does not need a downscale. **Final `--unsharp` post-filter (`humanizer.unsharp_mask`, opt-in, default 0):** applied LAST (after the face-restore pass, else it would be smoothed over) to counter the soft/over-smoothed look diffusion + restoration leave (an AI tell); ~0.5-0.8 safe, higher risks halos. Pairs with `--humanize` (grain adds sensor-noise texture, unsharp adds crispness). `--max-resolution N` re-introduces an opt-in long-side cap purely to bound GPU/MPS memory on very large inputs (it reintroduces the lossy round-trip). For huge images that OOM at native, tile-based diffusion is still the proper long-term fix. **Concrete MPS data points (the OOM is memory-tier-dependent, NOT a hard MPS limit):** on a ~24 GB unified-memory machine (verified 2026-05-25, 1254x1254 gpt-image SDXL, fp32) native res OOMs at the *UNet* step (peak ~17 GiB), not only the VAE decode, and the auto-fallback in `img2img_runner` reloads on CPU and finishes (slow, ~13 min) -- the output is still weight-identical and defeats SynthID, so "looks hung/crashed" on Mac is usually this CPU fallback, not a pipeline error. On a **32 GB** unified-memory machine the same default SDXL pass runs entirely on MPS with **no CPU fallback** (verified 2026-05-31, 1122x1402 gpt-image, `all`/default, ~155 s end-to-end), so 32 GB clears the native-res UNet peak that 24 GB could not. Adding `enable_vae_tiling()` alone does NOT prevent the 24 GB OOM (the peak is the UNet, not the VAE). The fast Mac workarounds for memory-constrained machines are fp16 on MPS (roughly halves memory) or `--max-resolution` to cap the long side; neither is wired as the default. The `controlnet` pipeline adds the canny ControlNet weights on top of SDXL, so its peak is a bit higher than the plain `default` pass; the same MPS->CPU fallback covers an OOM. The native-vs-cap-vs-floor decision lives in the pure helper `invisible_engine._target_size(w, h, max_resolution, min_resolution)` (returns `None` for native, a target tuple for a downscale cap OR an upscale floor; cap takes precedence, the floor is skipped on a min>max misconfig) so it is unit-tested (`tests/test_invisible_engine.py::TestTargetSize`, the #10/#15/#36 regression guard) without loading the model -- keep that logic in the helper, don't re-inline it. -- **fp16 VAE black-output fix (issue #29, 2026-05-30):** on a **CUDA/XPU fp16** backend the stock SDXL VAE overflows to NaN and the *plain* img2img path decodes to an **all-black** image (reproduced on the raiw.cc result: a 1086x1448 input -> a uniformly black 4.6 KB PNG, mean 0). `watermark_remover._load_pipeline` / `_load_controlnet_pipeline` swap in the fp16-fixed SDXL VAE (`madebyollin/sdxl-vae-fp16-fix` = `_SDXL_FP16_VAE_ID`) when `_needs_fp16_vae_fix(model_id, DEFAULT_MODEL_ID, is_fp16)` is true -- only the default SDXL checkpoint on fp16. **cpu/mps run fp32** (the stock VAE is fine there, which is why the bug never reproduces on Mac). A custom non-SDXL `model_id` keeps its own VAE (the fp16-fix VAE is SDXL-architecture-specific). The decision is a pure helper, unit-tested without a download (`tests/test_platform.py::TestFp16VaeFix`); the actual black->clean recovery needs a CUDA GPU. **Confirmed on real CUDA hardware 2026-06-03:** running `all` on a 1086x1448 OpenAI gpt-image (the #29 repro size) at fp16 produced a normal (non-black) output, so the fp16-fix VAE swap resolves the all-black decode. (It was not reproducible on this MPS machine, which runs fp32, so the verification had to happen on an NVIDIA box.) **Follow-up safety net (issue #41, 2026-06-04):** the swap is gated to `model_id == DEFAULT_MODEL_ID`, so a custom model, a stale pre-fix install, or a fal/custom loader can still hit the black decode -- a new reporter did (gpt-image 1448x1086, the #29 size, with the exact `image_processor.py:142 invalid value encountered in cast` warning the NaN->0 cast emits). `remove_watermark` now adds a model-agnostic backstop: after generation, if the run was fp16 AND the output is degenerate (`_is_degenerate_image`: mean and std both below `_DEGENERATE_THRESHOLD` 1.0 -- a uniform all-black/NaN frame; the variance guard spares a legitimately dark-but-textured photo), it rebuilds the pipeline in fp32 on the SAME device and re-runs once. fp32 is the verified-clean path, so the user never gets a black image regardless of model_id/version. Mirrors the existing MPS->CPU fallback's self-mutation pattern (reset `torch_dtype` + clear `_pipeline`/`_controlnet_pipeline`); `batch` inherits it through `remove_watermark`, and once one image trips it the rest of the batch stays on the safe fp32. The detector is a pure helper, unit-tested without a model (`tests/test_platform.py::TestDegenerateOutputGuard`); the full fp16->detect->fp32-retry chain was verified e2e on this MPS machine by forcing fp16 with the swap disabled (first pass black, guard fired, retry produced a normal image). CAVEAT: the fp32 retry uses ~2x memory, so on a VRAM-constrained GPU it can OOM (a visible error, still better than a silent black frame; the MPS->CPU fallback covers that path). The reporter's "CPU also black" symptom is NOT reproducible here -- fp32 (cpu/mps) decodes clean -- so it points at an old version or a non-fp32 run, pending their version + command. +Compact list. Full measurements, incident history, and oracle-validation runs live in `docs/known-limitations.md` — **read the relevant section there before changing the diffusion pipelines, strength defaults, resolution handling, or metadata coverage.** + +- `invisible` processes at native resolution for inputs >= 1024px long side and auto-upscales smaller inputs to a 1024px floor (`--min-resolution 0` disables; `--max-resolution N` is an opt-in cap to bound GPU/MPS memory). MPS OOM is memory-tier dependent, not a hard limit: ~24 GB unified memory falls back to CPU (slow but weight-identical output), 32 GB runs native on MPS. The native-vs-cap-vs-floor decision lives in the pure helper `invisible_engine._target_size` — keep the logic there, unit-tested without the model. +- fp16 VAE black-output (issues #29/#41): the fp16-fixed SDXL VAE (`madebyollin/sdxl-vae-fp16-fix`) is swapped in for the default SDXL checkpoint on cuda/xpu fp16, plus a model-agnostic backstop that detects a degenerate (all-black) fp16 output and re-runs once in fp32. cpu/mps run fp32 and never reproduce the bug. - Pyright first run is slow (2-3 min) due to ML deps (torch/diffusers/transformers stubs); full-project `uv run pyright` can stall for many minutes — scope it to changed files. - A third-party PIL plugin autoload (e.g. an HEIF/AVIF plugin) can raise a non-OSError (`ModuleNotFoundError`), not `UnidentifiedImageError`, when opening a file. Code that opens user-supplied or unknown-format files should `except Exception`, not just `OSError`/`UnidentifiedImageError`. -- **rich was dropped (CLI + scripts print plain text via `click.echo`).** `cli.py` renders through small `_Console`/`_Table`/`_Progress` shims; the analysis scripts (`scripts/synthid_corpus.py`, `synthid_pixel_probe.py`, `text_detection_benchmark.py`, `corpus_gap_scan.py`) import `Console`/`Table` from the shared `scripts/_plain_console.py` shim (markup like `[bold]`/`[/]` is stripped, tables render aligned). Consequences: (1) `rich` is NOT a dependency, so anything that imports it breaks a clean `uv sync --frozen` (CI installs core+dev only) — this exact gap red-failed CI after the refactor when those 4 scripts still imported rich; if you add a script, use the `_plain_console` shim, not rich. (2) The old `[gpu]`-bracket-eaten bug (#19) is gone — plain `click.echo` prints `pip install 'remove-ai-watermarks[gpu]'` verbatim, no escaping needed (regression-guarded by `tests/test_cli.py::TestGpuHintMarkup`). (3) No Unicode glyphs / colors / progress bars in CLI output by design. -- Metadata detection for AVIF/HEIF/JPEG-XL relies on a binary scan for `C2PA_UUID` + `IPTC_AI_MARKERS`, plus EXIF `Software` / XMP `CreatorTool` generator tags via `metadata.exif_generator` (validated with synthesized AVIF/JPEG fixtures + an XMP raw-scan fixture). C2PA removal in those containers is implemented via `noai/isobmff.py` (top-level ``uuid`` / ``jumb`` box stripper, no re-encoding), which now also drops a top-level XMP ``uuid`` box that carries an AI label (matched by AI-marker content, not by the XMP UUID, so byte-order-robust) and covers MP4/MOV/M4V/M4A by content sniff. **Non-ISOBMFF audio/video removal is via ffmpeg** (`_FFMPEG_STRIP_EXTS` -> `_strip_with_ffmpeg`): WebM/Matroska (EBML), MP3 (ID3), WAV/FLAC/OGG (RIFF/Vorbis) are stripped losslessly with `ffmpeg -map_metadata -1 -map_chapters -1 -c copy` (codec data untouched). Requires ffmpeg on PATH; raises `RuntimeError` if absent or if ffmpeg can't parse the file. Verified end-to-end (a real ffmpeg-made WAV/MP3 with a `title=Suno AI` tag -> tag gone, audio bytes preserved). **Meta-box XMP now handled (`isobmff.blank_ai_xmp_packets`, v0.6.9):** an AI-label XMP packet stored as a meta-box `mime` item (AVIF/HEIF) is blanked in place (overwritten with spaces of the same length, so `iloc` offsets and the coded image stay valid). **Still NOT built:** an `Exif` *item* inside the `meta` box (rare -- AI labels are XMP) needs full `iinf`/`iloc` surgery (offset rewrite) with corruption risk -- exiftool (R/W/C for HEIC/AVIF EXIF+XMP, verified on exiftool.org 2026-05-27) would do it but is a non-installed binary dep, so it stays a documented gap. **Audio watermark DETECTION (Resemble PerTh) was evaluated and NOT built (2026-05-26):** `resemble-perth`'s `PerthImplicitWatermarker.get_watermark()` returns a raw bit-array with **no presence/confidence flag** (clean audio decodes to arbitrary bits too), so reliably distinguishing watermarked-from-clean needs either Resemble's fixed payload or a confidence API -- neither is public, and there's no real Resemble sample to calibrate against. Same wall-class as the SynthID pixel detector: the decode exists, reliable presence-detection does not. (perth's top-level `PerthImplicitWatermarker` is also gated to None unless `librosa` is importable.) +- rich was dropped: the CLI + analysis scripts print plain text (`click.echo` / the `scripts/_plain_console.py` shim). `rich` is NOT a dependency — importing it breaks the core+dev CI sync; new scripts must use the shim. No Unicode glyphs / colors / progress bars in CLI output by design. +- AVIF/HEIF/JPEG-XL metadata detection is a binary scan; C2PA removal in those containers (and MP4/MOV/M4V) is `noai/isobmff.py`; non-ISOBMFF audio/video (WebM/MP3/WAV/FLAC/OGG) strips losslessly via ffmpeg on PATH. Still NOT built: an `Exif` meta-box *item* (needs `iinf`/`iloc` surgery) and Resemble PerTh audio detection (no presence/confidence flag exists). - **SynthID technical reference: `docs/synthid.md`** — primary-source-cited doc covering mechanism (post-hoc encoder/decoder pair, 136-bit payload at 512x512, pixel-space, model weights NOT modified), robustness numbers (arXiv:2510.09263: ~99.98% TPR@0.1%FPR across 30 transforms including JPEG/crop/resize/color/noise), removal attacks and forensic detectability (arXiv:2605.09203: all 6 attacks detectable at >98% TPR@1%FPR), detectability limits (no public decoder, metadata-proxy only), oracle scope, and adoption landscape. Read that doc first before adding notes here. - -- **SynthID detection is metadata-only.** There is no reliable *local* detector of the SynthID *pixel* watermark — Google's decoder is proprietary, no public spec or API (only a waitlisted portal). Authoritative confirmation: Google DeepMind's own paper "SynthID-Image: Image watermarking at internet scale" (Gowal et al., arXiv:2510.09263) states the verification service is restricted to "trusted testers" and does not release detector weights or a reproducible algorithm — so a local pixel detector is infeasible by design, not just unbuilt. https://arxiv.org/abs/2510.09263 We detect SynthID by its C2PA companion (`synthid_source` / `SYNTHID_C2PA_ISSUERS`), which is reliable while the manifest is intact but says nothing once C2PA is stripped. **Surface-dependent blind spot (verified 2026-05-24):** the same Google model emits different metadata per surface -- the Gemini *app* wraps outputs in Google C2PA, but the *API/playground* (AI Studio, Nano Banana / gemini-2.5-flash-image) emits the SynthID *pixel* watermark (confirmed via the Gemini-app oracle) + the visible sparkle but **no C2PA/IPTC at all**, so `synthid_source` returns None despite SynthID being present. Only the pixel oracle or the visible-sparkle detector catches those. (Meta AI is another surface mismatch: it writes the IPTC `digitalSourceType=trainedAlgorithmicMedia` marker, not C2PA and not SynthID.) Google→SynthID is long-standing; OpenAI→SynthID is confirmed by OpenAI's Help Center (ChatGPT/Codex/API "include both C2PA metadata and SynthID watermarks", updated 2026-05-21) but time-gated (pre-rollout OpenAI images carry C2PA without SynthID), so the OpenAI verdict is hedged "likely". Oracles: Gemini app "Verify with SynthID" (Google), openai.com/verify (OpenAI). **Each vendor's oracle detects only its OWN content (verified on the page 2026-05-31):** `openai.com/research/verify` states verbatim "OpenAI generation signals will only be detected if the image was generated with our tools" and "Content could also still be AI-generated by another company's model, which the tool currently does not detect" -- SynthID is shared tech but the verifier is keyed to its own vendor's payload, so a Google-SynthID image reads clean on OpenAI's verifier and vice-versa. **This explains the recurring "oracle says clean but `identify` still flags SynthID" report (#14):** the oracle reads the *pixel* watermark (gone after our SDXL pass), while `identify` reads the *C2PA-metadata proxy* (still present if the manifest survived). Different signals, not a contradiction -- strip the metadata too (`metadata --remove` / `all`) and the proxy goes quiet, but a quiet proxy is not proof the pixel watermark is gone. **SynthID is durable to JPEG re-encode by design, so a GitHub-recompressed issue attachment is still a valid SynthID test subject** (verified 2026-06-01 on issue #14's pic3: the GitHub-served JPEG survived re-encoding and openai.com/verify still detected SynthID). Do NOT dismiss issue-attachment JPEGs as "not faithful originals" when reproducing a SynthID-survival report: the recompression strips the **C2PA metadata** (so `identify` reads Unknown on the attachment) but NOT the **pixel watermark** that openai.com/verify reads. A true byte-original only matters for the metadata/C2PA path, not for the pixel-SynthID-removal test. (Contrast the open imwatermark above, which IS fragile to JPEG.) The spectral phase-coherence approach from `github.com/aloshdenny/reverse-SynthID` was evaluated (May 2026) and **does not work for real-content detection**: on its own shipped codebook + validation set, watermarked and cleaned images were indistinguishable (conf within noise, cleaned often higher); it only fires on pure-black 1024x1024 reference images at exact resolution (the controlled case it was calibrated on). The README's "90% / conf=0.91" reproduces only in that lab condition. Do not build a production detector on it; if revisited, it is experimental/diagnostic only and needs a per-resolution, per-model reference corpus. A from-scratch gpt-image pilot (2026-05-24) confirmed this independently: 5 independent solid-black gpt-image outputs share a near-identical fixed signature (pairwise residual correlation **0.92**, avg-template retains 97% energy), so the watermark/carrier IS strongly present and consistent on flat content — but the carrier frequencies extracted from it do NOT discriminate real content (carrier-to-random ratio: cleaned 1.86 > watermarked 1.53; a non-gpt-image image scored highest at 3.67). The signature drowns in content texture. Net: a perfectly consistent solid-color signature still yields no real-content pixel detector with magnitude/carrier methods. A corpus discrimination test (2026-05-24, `scripts/synthid_pixel_probe.py`, raw zero-mean residual NCC) independently re-confirms this: at matched resolution, SynthID positives do NOT cluster apart from negatives (within-Gemini 0.07; at 1024 px pos-vs-neg >= pos-vs-pos). The only high correlations were near-duplicate *content* (5 ChatGPT renders of one prompt at ~0.92, while a distinct ChatGPT image scored ~0 against them) — content, not a carrier. The probe is solid-fills-only and EXPERIMENTAL/DIAGNOSTIC; do not use it on real content. **Correction (deeper re-examination 2026-05-25):** the carrier IS real on solid fills — the earlier "no carrier" was a *method* artifact of using spatial / FFT-magnitude NCC, which can't see it. The carrier is a fixed *phase* at specific low frequencies, so the right metric is **per-bin phase coherence**. On 8 white `gemini-2.5-flash-image` fills (generated via the reverse-SynthID trick: identity-edit prompt "Recreate this image exactly as it is" on a synthetic pure-white PNG — this bypasses the recitation block that rejects text prompts for pure colors), phase coherence at the white carriers `(0,±7..±12,±20..±23)` = **0.86** vs **0.31** random; single-image leave-one-out phase-match **+0.83** vs real photos **-0.24**. (Black `2.5-flash` fills clip to std≈0 — SynthID can't push values below 0, so no carrier in black; the repo's dark carriers come from nano-banana-pro.) **But it does not generalize:** (a) carriers are model-version + resolution + color specific — the repo's v4 codebook (built for `gemini-3.1-flash-image-preview` + `nano-banana-pro-preview`) scores ~0.527 on my 2.5-flash white fills, indistinguishable from negatives (~0.50), i.e. carriers shift across model versions and need a per-model codebook; (b) on real content (30 `2.5-flash` images) the carrier collapses — set phase coherence at carriers 0.37 ≈ random 0.42, and the repo's v4 detector gives content 0.518 ≈ negatives 0.504 (no separation; a faint +0.24 single-image lean is likely a brightness confound). Net: the spectral/phase approach is a real *controlled-fill* characterizer, NOT an arbitrary-real-content detector, and is brittle to model version. Metadata proxy + visible sparkle + online oracles remain the ceiling for real content. -- **External AI-vs-real classifier models are out of scope (decided 2026-05-24).** Generic HuggingFace detectors (`Organika/sdxl-detector` Swin Transformer, `umm-maybe/AI-image-detector`, and fine-tunes) exist and report ~0.98 on their *own* SDXL-vs-real validation sets, but they are per-generator and the model cards themselves note degraded accuracy off-distribution; they are untested on gpt-image / Gemini Nano Banana (the metadata-stripped surfaces we care about), and our own light SDXL pass would likely defeat them the same way it defeats SynthID. Detection here stays local + signal-based (metadata + visible sparkle); do not add a bundled classifier dependency. -- **DEFAULT STRENGTH IS VENDOR-ADAPTIVE, ONE LADDER FOR BOTH PIPELINES (raised + unified 2026-06-09; vendor-adaptive since 2026-06-01, SUPERSEDES every fixed-default claim in this bullet and the next).** `resolve_strength(strength, vendor)` + `vendor_for_strength(path)` (`watermark_profiles.py`) read the C2PA issuer (`metadata.synthid_source`) on the ORIGINAL input and pick `OPENAI_STRENGTH` **0.20** / `GEMINI_STRENGTH` **0.30** / `UNKNOWN_STRENGTH` **0.30** when `--strength` is unset; explicit `--strength` always wins. **The SAME ladder applies to BOTH pipelines** (`sdxl` and `controlnet`) -- these are the 2026-06-04 Modal-cert controlnet floors. **Why one ladder (NOT a per-pipeline split):** the cert was run on controlnet and does NOT transfer to `sdxl` by symmetry (opposite hard cases -- controlnet leaves SynthID on photoreal, `sdxl` on flat graphics), BUT on its OWN hard case (flat fills) `sdxl` is the WEAKER remover (plain img2img barely perturbs a flat region at low strength), so it needs AT LEAST controlnet's strength -- hence the certified floor is the right floor for `sdxl` too. It is a MARGIN argument for `sdxl`, not a fresh certification (no local SynthID detector to self-verify); raise `--strength` if an oracle still reads a flat `sdxl` output. The higher strength costs little quality because `controlnet` is now the default pipeline AND the only `--auto` pick, so `sdxl` is reached only via an explicit `--pipeline sdxl` (a deliberate opt-down for inputs without faces/text), where over-regeneration has nothing to damage. (A short-lived per-pipeline split ladder -- `sdxl` 0.15/0.20 vs controlnet 0.20/0.30 -- existed on 2026-06-09 before being unified the same day; the `resolve_strength` `pipeline` param and the `CONTROLNET_*_STRENGTH` constants were removed.) The CLI detects the vendor from the pristine source (before the visible pass / metadata-strip removes C2PA from the temp file) and passes it to display calls so display and execution agree; `cmd_invisible`/`cmd_all`/`batch` thread `vendor`. **This replaces the single 0.30 default AND the prior "do NOT build a vendor-adaptive default" policy** -- both came from the now-debunked region-rescrub-contaminated study (the per-region re-scrub that contaminated those numbers was removed in the controlnet refactor). Basis: the oracle-verified June 2026 controlled study (clean v0.8.6, protect OFF): OpenAI clears at 0.05 across 1024-1600 (n=4, resolution-independent); Google needs 0.15 on the capped-1536 path (n=4). `docs/synthid.md` §2.2 (data) + §5.2 (the adaptive default) are authoritative. **CAVEAT (oracle pass 2026-06-04): the OpenAI 0.10 default is content-dependent, NOT universal -- a flat-graphic OpenAI logo/poster still read SynthID-detected after `default` at 0.10, and photoreal images after controlnet at 0.10/0.15 (low-change regions under-perturbed). Removal at 0.10/0.15 is content×pipeline dependent (see the controlnet Known-limitations bullet); the lever is a higher strength, oracle-revalidated per content type. Do NOT assume the vendor-adaptive default clears every image.** CAVEAT: Google's 0.15 was validated only on `--max-resolution 1536`; native large Gemini (2816) was not locally measurable (OOM on M-series) and is pending GPU validation on raiw.cc -- if it survives 0.15 native, raise `--strength`. **Everything below in this bullet about a fixed 0.10/0.30 default is HISTORICAL; trust the vendor-adaptive constants + docs/synthid.md.** -- **SynthID removal: strength + oracle scope.** Default strength is vendor-adaptive (see the bullet above); `docs/synthid.md` §2.2 is authoritative for the numbers. **Oracle scope (load-bearing):** the Gemini app "Verify with SynthID" is the ONLY valid SynthID oracle (detects Google's mark on any image); `openai.com/verify` is scoped to OpenAI provenance (its own C2PA), NOT a SynthID oracle -- a negative there is meaningless for SynthID. There is no local SynthID detector, so the tool cannot self-check; if the oracle still reads SynthID, raise `--strength` to the lowest value that verifies clean. Only the `sdxl` (plain SDXL img2img; `default` is a back-compat alias) and `controlnet` (SDXL + canny ControlNet) profiles exist; the local `invisible` default is weight-for-weight identical to raiw.cc prod (`fal-ai/fast-sdxl` = `stabilityai/stable-diffusion-xl-base-1.0`, runtime-downloaded, not bundled). **Forensic-stealth caveat** (arXiv:2605.09203): defeating the SynthID verifier is NOT forensic invisibility -- independent detectors flag *removal-processed* images vs genuinely-clean ones at >98% TPR@1%FPR, so do not over-claim "indistinguishable from a real photo". -- **`controlnet` pipeline (text/face STRUCTURE preservation, THE DEFAULT since 2026-06-09; `--pipeline default` opts down to plain SDXL).** SDXL + the canny ControlNet `xinsir/controlnet-canny-sdxl-1.0` via `StableDiffusionXLControlNetImg2ImgPipeline` (`watermark_remover._run_controlnet` / `_load_controlnet_pipeline`). **Removal still comes from the img2img regeneration (`strength`); the ControlNet only PRESERVES text and face STRUCTURE by conditioning on the canny edge map** (`cv2.Canny(gray, 100, 200)`, 3-channel). Canny preserves edges, NOT face identity (a regenerated face drifts in likeness). The drifted cleaned face is the LEAST-AI state we can reach without re-introducing SynthID; **the library does NOT ship a face-restore extra** (every approach evaluated 2026-06-04 - 2026-06-08 -- GFPGAN-on-cleaned, PhotoMaker-V2, InstantID txt2img, InstantID img2img-on-cleaned at three parameter sweeps -- regenerated the face via SDXL and made it look MORE AI-generated). Full empirical conclusion in `docs/synthid-robust-identity-research-2026-06-08.md` "Empirical follow-up". For production face preservation, ship the cleaned image as-is. No original pixels are copied or frozen, **BUT removal at the low vendor-adaptive strength is CONTENT × PIPELINE dependent and NEITHER pipeline clears all content -- oracle-validated against the OpenAI verifier 2026-06-04 (8 images, strength 0.10/0.15, `--max-resolution 1536`).** The survivors FLIP by content type: **photoreal** (a 9-face grid, a bracelet product photo) SURVIVES controlnet but CLEARS `default` (controlnet's dense edge map keeps the regen too close to the original, so the SynthID-destroying perturbation never happens; plain img2img perturbs photoreal texture enough); **flat graphic** (a logo/poster with large flat color fills) SURVIVES `default` but CLEARS controlnet (at low strength img2img barely changes flat fills so SynthID persists there, while controlnet repaints them more freely); a flat **text** card cleared under both. **Root cause is insufficient STRENGTH, not the pipeline: at 0.10 the low-change regions -- dense-edge photoreal under controlnet, large flat fills under `default` -- are not perturbed enough to destroy SynthID. The vendor-adaptive 0.10 from the June study is NOT universally sufficient (that study's content happened to clear at 0.10).** The robust fix is a HIGHER strength, oracle-revalidated per content type (controlnet can be cranked harder without losing structure; a lower `controlnet_conditioning_scale` also frees the regen on photoreal). So at today's default strength **both pipelines AND `--auto` can LEAVE SynthID on some content** -- a removal-priority caller (raiw.cc) MUST oracle-validate strength across content types before adopting, not pick a pipeline and assume removal. **Follow-up same day: re-running the two photoreal survivors through controlnet at an explicit `--strength 0.15` cleared BOTH on the oracle -- BUT one of them (the bracelet) had SURVIVED the SAME 0.15 controlnet config in the first pass (only the random, unset seed differed). So removal near the threshold is SEED-NON-DETERMINISTIC: the same image+pipeline+strength+resolution can pass or fail run-to-run (img2img uses `seed=None`/random unless `--seed` is passed, and there is no local SynthID detector to self-verify). 0.15 is the borderline, NOT a robust floor -- pick a strength with MARGIN (controlnet ~>= 0.20) rather than exactly on it; the content×pipeline table's 0.15 data point is near-threshold noise. A confirming run at `--strength 0.20` controlnet cleared BOTH photoreal survivors on the oracle (ladder: 0.10 grid detected → 0.15 borderline/non-deterministic → 0.20 both clean), so **0.20 is the recommended robust controlnet floor for OpenAI photoreal** (one margin run, not an N-run repeatability proof -- a service should add margin or verify repeatability since there is no local SynthID detector to self-check). **Engineering follow-up DONE 2026-06-09 (three coupled changes):** (1) **strength raised + unified** -- `resolve_strength(strength, vendor)` now applies ONE vendor-adaptive ladder (the certified controlnet floors 0.20/0.30/0.30) to BOTH pipelines; see the DEFAULT STRENGTH bullet above for why one ladder covers `sdxl`. (2) **`controlnet` is now the DEFAULT pipeline** (CLI `--pipeline` default = `controlnet` + both engine ctors). Rationale: with the certified higher ladder it clears BOTH content classes that flipped in the content-x-pipeline table (photoreal AND flat graphic), whereas plain SDXL left SynthID on flat graphics -- so controlnet is the more removal-robust default. Cost: every non-`--auto` run now downloads the canny ControlNet weights + a higher memory peak (MPS->CPU fallback covers OOM). (3) **the plain-SDXL profile was renamed `default` -> `sdxl`** (`watermark_profiles.SDXL_PROFILE`/`normalize_profile`); `default` stays as a back-compat CLI/ctor alias (the `--pipeline` Choice accepts `sdxl`/`controlnet`/`default`, a click callback `_normalize_pipeline` maps `default`->`sdxl` AND warns that `default` is deprecated). (4) **the content-detection layer + `--auto` planner were removed and `--auto` was retired to a deprecated alias for `--adaptive-polish`** -- see the dedicated `auto_config.py`-removal bullet above (controlnet is the default pipeline and the polish self-gates, so detection changed nothing). raiw.cc still needs its own per-vendor/content calibration on the GPU worker for native resolution. The Gemini-native resolution caveat stands: controlnet 0.30 is certified only <=1536.** **CERTIFIED 2026-06-04 via the isolated `raiw-controlnet-cert` Modal app (`raiw-app/modal_cert.py`), restore OFF, ≤1536, each vendor on its own oracle: controlnet floors are OpenAI 0.20 (2 photoreal × 3 seeds = 6/6 clean; the 0.15-flipper is seed-robust at 0.20) and Gemini 0.30 (0.20 detected → 0.30 clean on 2/2 seeds). OpenAI 0.20 transfers to prod (resolution-independent); Gemini 0.30 holds only ≤1536 — Gemini is resolution-sensitive and raiw.cc runs NATIVE (`max_resolution=0`), so cap Gemini ≤1536 + use 0.30, or native-calibrate (~0.35+). Prod recipe: controlnet + per-vendor floor in `resolve_strength` (not the default ladder) + FIXED seed (kills the non-determinism). **No face-restore in the library:** every approach evaluated (GFPGAN-on-cleaned, PhotoMaker-V2, InstantID txt2img, InstantID img2img-on-cleaned, 2026-06-04 - 2026-06-08 cert sweeps) regenerated the face via SDXL diffusion -- the output face inherited SDXL "clean skin" gloss and lost original identity precision, looking MORE AI-generated than the cleaned image, not less. The drifted face from controlnet 0.20 is the least-AI state we can reach; for a paid service that's the prod output. See `docs/synthid-robust-identity-research-2026-06-08.md` "Empirical follow-up".** See `docs/synthid.md` §5.5 + `docs/controlnet-removal-pipeline-research.md` (certified floors table).** **Lesson: visual-quality + face-recovery validation does NOT prove watermark removal -- only the SynthID oracle does, across MULTIPLE content types; never infer removal from sharpness/identity, and never conclude from a partial result (the photoreal-only data first read as "controlnet shields, default removes" -- the flat-graphic result reversed it).** `controlnet_conditioning_scale` (CLI `--controlnet-scale`, default 1.0) is the structure-preservation knob (higher = closer to the original structure); fp32 on cpu/mps, fp16-fixed VAE on cuda/xpu. The `controlnet` profile is threaded explicitly (`WatermarkRemover(pipeline=...)` / `InvisibleEngine(pipeline=...)`), NOT inferred from `model_id`. This productionizes the `scripts/controlnet_sweep.py` prototype; see `docs/controlnet-removal-pipeline-research.md`. **Forensic-stealth caveat still applies** (arXiv:2605.09203): defeating the SynthID verifier is not forensic invisibility -- a "this image went through a removal pipeline" classifier can still flag the output. +- **SynthID detection is metadata-only.** No local pixel detector is possible by design (Google's decoder is proprietary, trusted-testers only); we read the C2PA companion proxy, which goes quiet once metadata is stripped — a quiet proxy is not proof the pixel watermark is gone. The Gemini app "Verify with SynthID" is the ONLY valid SynthID oracle; `openai.com/verify` is scoped to OpenAI provenance and each vendor's oracle detects only its own content. SynthID survives JPEG re-encode, so GitHub issue attachments remain valid pixel-watermark test subjects. Every spectral/phase detection approach evaluated (reverse-SynthID, our own probes) works only on controlled solid fills, never on real content. +- **External AI-vs-real classifier models are out of scope** (decided 2026-05-24): per-generator, degrade off-distribution, and our own light SDXL pass would likely defeat them. Detection stays local + signal-based. +- **Default strength is VENDOR-ADAPTIVE, one ladder for BOTH pipelines** (since 2026-06-09): `resolve_strength(strength, vendor)` picks OpenAI **0.20** / Gemini **0.30** / unknown **0.30** when `--strength` is unset; explicit `--strength` always wins. Removal at low strength is content x pipeline dependent, and near-threshold removal is SEED-NON-DETERMINISTIC — pick a strength with margin and oracle-revalidate per content type. Certified controlnet floors (Modal cert 2026-06-04): OpenAI 0.20 (resolution-independent), Gemini 0.30 (only <= 1536px; native large Gemini needs ~0.35+ or a cap). +- **`controlnet` is the default pipeline**; `--pipeline sdxl` is the lighter opt-down. Neither pipeline clears all content at low strength (photoreal survives controlnet, flat graphics survive sdxl — the lever is higher strength). A removal-priority caller MUST oracle-validate strength across content types; prod recipe: controlnet + per-vendor floor + FIXED seed. Forensic-stealth caveat (arXiv:2605.09203): defeating the SynthID verifier is NOT forensic invisibility — removal-processed images are flaggable at >98% TPR@1%FPR. diff --git a/README.md b/README.md index f248c87..535d9ad 100644 --- a/README.md +++ b/README.md @@ -456,7 +456,7 @@ pip install certifi Tracked but not yet implemented: - **SynthID-Image v2 automated regression test**. The default SDXL profile defeats v2 per manual checks against the [Gemini app](https://support.google.com/gemini/answer/16722517)'s "Verify with SynthID" feature on a Gemini 3 Pro output (May 2026). An automated end-to-end test would need either programmatic access to the [SynthID Detector portal](https://blog.google/innovation-and-ai/products/google-synthid-ai-content-detector/) (waitlist for media professionals and researchers) or an offline surrogate detector. The spectral phase-coherence surrogate from [reverse-SynthID](https://github.com/aloshdenny/reverse-SynthID) was evaluated and does not separate watermarked from cleaned real-content images (it only fires on controlled solid-color references at exact resolution), so it is not a usable oracle. Open. -- **Local SynthID *pixel* detector**. Not feasible today: Google's decoder is proprietary, and magnitude/carrier spectral methods do not separate real content (confirmed by three independent evaluations, including a from-scratch gpt-image pilot; see CLAUDE.md). Blocked on either (a) a programmatic generation path (OpenAI / Gemini API) to build a per-(model, resolution) labeled corpus at scale, or (b) a raw watermarked-output dataset. If data arrives, the next approach to try is a learned classifier on diverse content rather than a fixed carrier codebook. +- **Local SynthID *pixel* detector**. Not feasible today: Google's decoder is proprietary, and magnitude/carrier spectral methods do not separate real content (confirmed by three independent evaluations, including a from-scratch gpt-image pilot; see docs/known-limitations.md). Blocked on either (a) a programmatic generation path (OpenAI / Gemini API) to build a per-(model, resolution) labeled corpus at scale, or (b) a raw watermarked-output dataset. If data arrives, the next approach to try is a learned classifier on diverse content rather than a fixed carrier codebook. - **Grow the SynthID reference corpus** (`data/synthid_corpus/`) with oracle-labeled samples per model and resolution (Gemini app for Google, openai.com/verify for OpenAI). Prerequisite for any pixel-detector attempt and for an automated removal-regression set. - **Real non-PNG C2PA fixtures**. SynthID-source detection for JPEG / WebP / AVIF is currently covered only by synthetic byte blobs; replace with real vendor-emitted files to ground the binary-scan path. - **Maintenance debt**. Strict pyright is now clean across `src/` (0 errors): pure-logic files are fully typed, the cv2 / torch / diffusers boundary files carry a documented per-file relax pragma, and a local `typings/piexif` stub covers piexif. Remaining: full-project `pyright` (no path) still OOMs node on this ML-heavy repo, so it must be scoped to `src/`; narrowing the boundary pragmas back toward full strict (as upstream stubs improve) is the long tail. (`uv-secure` is already clean since `idna` was bumped to 3.16.) diff --git a/docs/known-limitations.md b/docs/known-limitations.md new file mode 100644 index 0000000..0703278 --- /dev/null +++ b/docs/known-limitations.md @@ -0,0 +1,124 @@ +# Known limitations: full detail + +> Relocated verbatim from `CLAUDE.md` on 2026-06-11 to keep the always-loaded +> context small. Long single-line entries were reformatted into paragraphs; +> no content was changed or summarized. + +Full detail behind the compact Known-limitations list in `CLAUDE.md`: +measurements, incident history, oracle runs, and the reasoning behind each +decision. Read the relevant section here before changing the diffusion +pipelines, strength defaults, or metadata coverage. + +## Invisible-pipeline resolution handling (native / 1024 floor / `--max-resolution`; MPS memory tiers) + +`invisible` pipeline processes at **native resolution for inputs whose long side is >= 1024px**, and **auto-upscales smaller inputs UP to a 1024px floor** (`min_resolution=1024`, the default; `--min-resolution 0` disables) before diffusion -- SDXL img2img distorts badly on a tiny latent (a 381x512 portrait wrecks at native, the #36 follow-up), and the output is restored to the original input size so the floor is a transparent quality boost (it adds time/memory on small inputs). The floor upscale uses Lanczos by default; **`--upscaler esrgan`** (opt-in, the `esrgan` extra) runs Real-ESRGAN first for better detail before the Lanczos resize to the exact target (`upscaler.py` / `InvisibleEngine._esrgan_upscale`, falls back to Lanczos if the extra is absent). `max_resolution=0` (default) means no downscale cap, matching the hosted raiw.cc backend (fal fast-sdxl, no pre-downscale). The old forced downscale-to-1024 -> upscale-back round-trip for LARGE images was the main quality loss (issue #10) and is gone; at strength ~0.05 SDXL img2img does not need a downscale. + +**Final `--unsharp` post-filter (`humanizer.unsharp_mask`, opt-in, default 0):** applied LAST (after the face-restore pass, else it would be smoothed over) to counter the soft/over-smoothed look diffusion + restoration leave (an AI tell); ~0.5-0.8 safe, higher risks halos. Pairs with `--humanize` (grain adds sensor-noise texture, unsharp adds crispness). `--max-resolution N` re-introduces an opt-in long-side cap purely to bound GPU/MPS memory on very large inputs (it reintroduces the lossy round-trip). For huge images that OOM at native, tile-based diffusion is still the proper long-term fix. + +**Concrete MPS data points (the OOM is memory-tier-dependent, NOT a hard MPS limit):** on a ~24 GB unified-memory machine (verified 2026-05-25, 1254x1254 gpt-image SDXL, fp32) native res OOMs at the *UNet* step (peak ~17 GiB), not only the VAE decode, and the auto-fallback in `img2img_runner` reloads on CPU and finishes (slow, ~13 min) -- the output is still weight-identical and defeats SynthID, so "looks hung/crashed" on Mac is usually this CPU fallback, not a pipeline error. On a **32 GB** unified-memory machine the same default SDXL pass runs entirely on MPS with **no CPU fallback** (verified 2026-05-31, 1122x1402 gpt-image, `all`/default, ~155 s end-to-end), so 32 GB clears the native-res UNet peak that 24 GB could not. Adding `enable_vae_tiling()` alone does NOT prevent the 24 GB OOM (the peak is the UNet, not the VAE). The fast Mac workarounds for memory-constrained machines are fp16 on MPS (roughly halves memory) or `--max-resolution` to cap the long side; neither is wired as the default. The `controlnet` pipeline adds the canny ControlNet weights on top of SDXL, so its peak is a bit higher than the plain `default` pass; the same MPS->CPU fallback covers an OOM. The native-vs-cap-vs-floor decision lives in the pure helper `invisible_engine._target_size(w, h, max_resolution, min_resolution)` (returns `None` for native, a target tuple for a downscale cap OR an upscale floor; cap takes precedence, the floor is skipped on a min>max misconfig) so it is unit-tested (`tests/test_invisible_engine.py::TestTargetSize`, the #10/#15/#36 regression guard) without loading the model -- keep that logic in the helper, don't re-inline it. + +## fp16 VAE black-output fix (issue #29) + degenerate-output fp32 backstop (issue #41) + +**fp16 VAE black-output fix (issue #29, 2026-05-30):** on a **CUDA/XPU fp16** backend the stock SDXL VAE overflows to NaN and the *plain* img2img path decodes to an **all-black** image (reproduced on the raiw.cc result: a 1086x1448 input -> a uniformly black 4.6 KB PNG, mean 0). `watermark_remover._load_pipeline` / `_load_controlnet_pipeline` swap in the fp16-fixed SDXL VAE (`madebyollin/sdxl-vae-fp16-fix` = `_SDXL_FP16_VAE_ID`) when `_needs_fp16_vae_fix(model_id, DEFAULT_MODEL_ID, is_fp16)` is true -- only the default SDXL checkpoint on fp16. + +**cpu/mps run fp32** (the stock VAE is fine there, which is why the bug never reproduces on Mac). A custom non-SDXL `model_id` keeps its own VAE (the fp16-fix VAE is SDXL-architecture-specific). The decision is a pure helper, unit-tested without a download (`tests/test_platform.py::TestFp16VaeFix`); the actual black->clean recovery needs a CUDA GPU. + +**Confirmed on real CUDA hardware 2026-06-03:** running `all` on a 1086x1448 OpenAI gpt-image (the #29 repro size) at fp16 produced a normal (non-black) output, so the fp16-fix VAE swap resolves the all-black decode. (It was not reproducible on this MPS machine, which runs fp32, so the verification had to happen on an NVIDIA box.) + +**Follow-up safety net (issue #41, 2026-06-04):** the swap is gated to `model_id == DEFAULT_MODEL_ID`, so a custom model, a stale pre-fix install, or a fal/custom loader can still hit the black decode -- a new reporter did (gpt-image 1448x1086, the #29 size, with the exact `image_processor.py:142 invalid value encountered in cast` warning the NaN->0 cast emits). `remove_watermark` now adds a model-agnostic backstop: after generation, if the run was fp16 AND the output is degenerate (`_is_degenerate_image`: mean and std both below `_DEGENERATE_THRESHOLD` 1.0 -- a uniform all-black/NaN frame; the variance guard spares a legitimately dark-but-textured photo), it rebuilds the pipeline in fp32 on the SAME device and re-runs once. fp32 is the verified-clean path, so the user never gets a black image regardless of model_id/version. Mirrors the existing MPS->CPU fallback's self-mutation pattern (reset `torch_dtype` + clear `_pipeline`/`_controlnet_pipeline`); `batch` inherits it through `remove_watermark`, and once one image trips it the rest of the batch stays on the safe fp32. The detector is a pure helper, unit-tested without a model (`tests/test_platform.py::TestDegenerateOutputGuard`); the full fp16->detect->fp32-retry chain was verified e2e on this MPS machine by forcing fp16 with the swap disabled (first pass black, guard fired, retry produced a normal image). CAVEAT: the fp32 retry uses ~2x memory, so on a VRAM-constrained GPU it can OOM (a visible error, still better than a silent black frame; the MPS->CPU fallback covers that path). The reporter's "CPU also black" symptom is NOT reproducible here -- fp32 (cpu/mps) decodes clean -- so it points at an old version or a non-fp32 run, pending their version + command. + +## rich was dropped (plain-text CLI and scripts) + +**rich was dropped (CLI + scripts print plain text via `click.echo`).** + +`cli.py` renders through small `_Console`/`_Table`/`_Progress` shims; the analysis scripts (`scripts/synthid_corpus.py`, `synthid_pixel_probe.py`, `text_detection_benchmark.py`, `corpus_gap_scan.py`) import `Console`/`Table` from the shared `scripts/_plain_console.py` shim (markup like `[bold]`/`[/]` is stripped, tables render aligned). Consequences: (1) `rich` is NOT a dependency, so anything that imports it breaks a clean `uv sync --frozen` (CI installs core+dev only) — this exact gap red-failed CI after the refactor when those 4 scripts still imported rich; if you add a script, use the `_plain_console` shim, not rich. (2) The old `[gpu]`-bracket-eaten bug (#19) is gone — plain `click.echo` prints `pip install 'remove-ai-watermarks[gpu]'` verbatim, no escaping needed (regression-guarded by `tests/test_cli.py::TestGpuHintMarkup`). (3) No Unicode glyphs / colors / progress bars in CLI output by design. + +## AVIF/HEIF/JPEG-XL metadata, ISOBMFF/ffmpeg removal, audio watermark detection + +Metadata detection for AVIF/HEIF/JPEG-XL relies on a binary scan for `C2PA_UUID` + `IPTC_AI_MARKERS`, plus EXIF `Software` / XMP `CreatorTool` generator tags via `metadata.exif_generator` (validated with synthesized AVIF/JPEG fixtures + an XMP raw-scan fixture). C2PA removal in those containers is implemented via `noai/isobmff.py` (top-level ``uuid`` / ``jumb`` box stripper, no re-encoding), which now also drops a top-level XMP ``uuid`` box that carries an AI label (matched by AI-marker content, not by the XMP UUID, so byte-order-robust) and covers MP4/MOV/M4V/M4A by content sniff. + +**Non-ISOBMFF audio/video removal is via ffmpeg** (`_FFMPEG_STRIP_EXTS` -> `_strip_with_ffmpeg`): WebM/Matroska (EBML), MP3 (ID3), WAV/FLAC/OGG (RIFF/Vorbis) are stripped losslessly with `ffmpeg -map_metadata -1 -map_chapters -1 -c copy` (codec data untouched). Requires ffmpeg on PATH; raises `RuntimeError` if absent or if ffmpeg can't parse the file. Verified end-to-end (a real ffmpeg-made WAV/MP3 with a `title=Suno AI` tag -> tag gone, audio bytes preserved). + +**Meta-box XMP now handled (`isobmff.blank_ai_xmp_packets`, v0.6.9):** an AI-label XMP packet stored as a meta-box `mime` item (AVIF/HEIF) is blanked in place (overwritten with spaces of the same length, so `iloc` offsets and the coded image stay valid). + +**Still NOT built:** an `Exif` *item* inside the `meta` box (rare -- AI labels are XMP) needs full `iinf`/`iloc` surgery (offset rewrite) with corruption risk -- exiftool (R/W/C for HEIC/AVIF EXIF+XMP, verified on exiftool.org 2026-05-27) would do it but is a non-installed binary dep, so it stays a documented gap. + +**Audio watermark DETECTION (Resemble PerTh) was evaluated and NOT built (2026-05-26):** `resemble-perth`'s `PerthImplicitWatermarker.get_watermark()` returns a raw bit-array with **no presence/confidence flag** (clean audio decodes to arbitrary bits too), so reliably distinguishing watermarked-from-clean needs either Resemble's fixed payload or a confidence API -- neither is public, and there's no real Resemble sample to calibrate against. Same wall-class as the SynthID pixel detector: the decode exists, reliable presence-detection does not. (perth's top-level `PerthImplicitWatermarker` is also gated to None unless `librosa` is importable.) + +## SynthID detection is metadata-only (no local pixel detector) + +**SynthID detection is metadata-only.** + +There is no reliable *local* detector of the SynthID *pixel* watermark — Google's decoder is proprietary, no public spec or API (only a waitlisted portal). Authoritative confirmation: Google DeepMind's own paper "SynthID-Image: Image watermarking at internet scale" (Gowal et al., arXiv:2510.09263) states the verification service is restricted to "trusted testers" and does not release detector weights or a reproducible algorithm — so a local pixel detector is infeasible by design, not just unbuilt. https://arxiv.org/abs/2510.09263 We detect SynthID by its C2PA companion (`synthid_source` / `SYNTHID_C2PA_ISSUERS`), which is reliable while the manifest is intact but says nothing once C2PA is stripped. + +**Surface-dependent blind spot (verified 2026-05-24):** the same Google model emits different metadata per surface -- the Gemini *app* wraps outputs in Google C2PA, but the *API/playground* (AI Studio, Nano Banana / gemini-2.5-flash-image) emits the SynthID *pixel* watermark (confirmed via the Gemini-app oracle) + the visible sparkle but **no C2PA/IPTC at all**, so `synthid_source` returns None despite SynthID being present. Only the pixel oracle or the visible-sparkle detector catches those. (Meta AI is another surface mismatch: it writes the IPTC `digitalSourceType=trainedAlgorithmicMedia` marker, not C2PA and not SynthID.) Google→SynthID is long-standing; OpenAI→SynthID is confirmed by OpenAI's Help Center (ChatGPT/Codex/API "include both C2PA metadata and SynthID watermarks", updated 2026-05-21) but time-gated (pre-rollout OpenAI images carry C2PA without SynthID), so the OpenAI verdict is hedged "likely". Oracles: Gemini app "Verify with SynthID" (Google), openai.com/verify (OpenAI). + +**Each vendor's oracle detects only its OWN content (verified on the page 2026-05-31):** `openai.com/research/verify` states verbatim "OpenAI generation signals will only be detected if the image was generated with our tools" and "Content could also still be AI-generated by another company's model, which the tool currently does not detect" -- SynthID is shared tech but the verifier is keyed to its own vendor's payload, so a Google-SynthID image reads clean on OpenAI's verifier and vice-versa. + +**This explains the recurring "oracle says clean but `identify` still flags SynthID" report (#14):** the oracle reads the *pixel* watermark (gone after our SDXL pass), while `identify` reads the *C2PA-metadata proxy* (still present if the manifest survived). Different signals, not a contradiction -- strip the metadata too (`metadata --remove` / `all`) and the proxy goes quiet, but a quiet proxy is not proof the pixel watermark is gone. + +**SynthID is durable to JPEG re-encode by design, so a GitHub-recompressed issue attachment is still a valid SynthID test subject** (verified 2026-06-01 on issue #14's pic3: the GitHub-served JPEG survived re-encoding and openai.com/verify still detected SynthID). Do NOT dismiss issue-attachment JPEGs as "not faithful originals" when reproducing a SynthID-survival report: the recompression strips the **C2PA metadata** (so `identify` reads Unknown on the attachment) but NOT the **pixel watermark** that openai.com/verify reads. A true byte-original only matters for the metadata/C2PA path, not for the pixel-SynthID-removal test. (Contrast the open imwatermark above, which IS fragile to JPEG.) The spectral phase-coherence approach from `github.com/aloshdenny/reverse-SynthID` was evaluated (May 2026) and **does not work for real-content detection**: on its own shipped codebook + validation set, watermarked and cleaned images were indistinguishable (conf within noise, cleaned often higher); it only fires on pure-black 1024x1024 reference images at exact resolution (the controlled case it was calibrated on). The README's "90% / conf=0.91" reproduces only in that lab condition. Do not build a production detector on it; if revisited, it is experimental/diagnostic only and needs a per-resolution, per-model reference corpus. A from-scratch gpt-image pilot (2026-05-24) confirmed this independently: 5 independent solid-black gpt-image outputs share a near-identical fixed signature (pairwise residual correlation **0.92**, avg-template retains 97% energy), so the watermark/carrier IS strongly present and consistent on flat content — but the carrier frequencies extracted from it do NOT discriminate real content (carrier-to-random ratio: cleaned 1.86 > watermarked 1.53; a non-gpt-image image scored highest at 3.67). The signature drowns in content texture. Net: a perfectly consistent solid-color signature still yields no real-content pixel detector with magnitude/carrier methods. A corpus discrimination test (2026-05-24, `scripts/synthid_pixel_probe.py`, raw zero-mean residual NCC) independently re-confirms this: at matched resolution, SynthID positives do NOT cluster apart from negatives (within-Gemini 0.07; at 1024 px pos-vs-neg >= pos-vs-pos). The only high correlations were near-duplicate *content* (5 ChatGPT renders of one prompt at ~0.92, while a distinct ChatGPT image scored ~0 against them) — content, not a carrier. The probe is solid-fills-only and EXPERIMENTAL/DIAGNOSTIC; do not use it on real content. + +**Correction (deeper re-examination 2026-05-25):** the carrier IS real on solid fills — the earlier "no carrier" was a *method* artifact of using spatial / FFT-magnitude NCC, which can't see it. The carrier is a fixed *phase* at specific low frequencies, so the right metric is **per-bin phase coherence**. On 8 white `gemini-2.5-flash-image` fills (generated via the reverse-SynthID trick: identity-edit prompt "Recreate this image exactly as it is" on a synthetic pure-white PNG — this bypasses the recitation block that rejects text prompts for pure colors), phase coherence at the white carriers `(0,±7..±12,±20..±23)` = **0.86** vs **0.31** random; single-image leave-one-out phase-match **+0.83** vs real photos **-0.24**. (Black `2.5-flash` fills clip to std≈0 — SynthID can't push values below 0, so no carrier in black; the repo's dark carriers come from nano-banana-pro.) + +**But it does not generalize:** (a) carriers are model-version + resolution + color specific — the repo's v4 codebook (built for `gemini-3.1-flash-image-preview` + `nano-banana-pro-preview`) scores ~0.527 on my 2.5-flash white fills, indistinguishable from negatives (~0.50), i.e. carriers shift across model versions and need a per-model codebook; (b) on real content (30 `2.5-flash` images) the carrier collapses — set phase coherence at carriers 0.37 ≈ random 0.42, and the repo's v4 detector gives content 0.518 ≈ negatives 0.504 (no separation; a faint +0.24 single-image lean is likely a brightness confound). Net: the spectral/phase approach is a real *controlled-fill* characterizer, NOT an arbitrary-real-content detector, and is brittle to model version. Metadata proxy + visible sparkle + online oracles remain the ceiling for real content. + +## External AI-vs-real classifier models are out of scope + +**External AI-vs-real classifier models are out of scope (decided 2026-05-24).** + +Generic HuggingFace detectors (`Organika/sdxl-detector` Swin Transformer, `umm-maybe/AI-image-detector`, and fine-tunes) exist and report ~0.98 on their *own* SDXL-vs-real validation sets, but they are per-generator and the model cards themselves note degraded accuracy off-distribution; they are untested on gpt-image / Gemini Nano Banana (the metadata-stripped surfaces we care about), and our own light SDXL pass would likely defeat them the same way it defeats SynthID. Detection here stays local + signal-based (metadata + visible sparkle); do not add a bundled classifier dependency. + +## Default strength is vendor-adaptive, one ladder for both pipelines + +**DEFAULT STRENGTH IS VENDOR-ADAPTIVE, ONE LADDER FOR BOTH PIPELINES (raised + unified 2026-06-09; vendor-adaptive since 2026-06-01, SUPERSEDES every fixed-default claim in this bullet and the next).** + +`resolve_strength(strength, vendor)` + `vendor_for_strength(path)` (`watermark_profiles.py`) read the C2PA issuer (`metadata.synthid_source`) on the ORIGINAL input and pick `OPENAI_STRENGTH` **0.20** / `GEMINI_STRENGTH` **0.30** / `UNKNOWN_STRENGTH` **0.30** when `--strength` is unset; explicit `--strength` always wins. + +**The SAME ladder applies to BOTH pipelines** (`sdxl` and `controlnet`) -- these are the 2026-06-04 Modal-cert controlnet floors. + +**Why one ladder (NOT a per-pipeline split):** the cert was run on controlnet and does NOT transfer to `sdxl` by symmetry (opposite hard cases -- controlnet leaves SynthID on photoreal, `sdxl` on flat graphics), BUT on its OWN hard case (flat fills) `sdxl` is the WEAKER remover (plain img2img barely perturbs a flat region at low strength), so it needs AT LEAST controlnet's strength -- hence the certified floor is the right floor for `sdxl` too. It is a MARGIN argument for `sdxl`, not a fresh certification (no local SynthID detector to self-verify); raise `--strength` if an oracle still reads a flat `sdxl` output. The higher strength costs little quality because `controlnet` is now the default pipeline AND the only `--auto` pick, so `sdxl` is reached only via an explicit `--pipeline sdxl` (a deliberate opt-down for inputs without faces/text), where over-regeneration has nothing to damage. (A short-lived per-pipeline split ladder -- `sdxl` 0.15/0.20 vs controlnet 0.20/0.30 -- existed on 2026-06-09 before being unified the same day; the `resolve_strength` `pipeline` param and the `CONTROLNET_*_STRENGTH` constants were removed.) The CLI detects the vendor from the pristine source (before the visible pass / metadata-strip removes C2PA from the temp file) and passes it to display calls so display and execution agree; `cmd_invisible`/`cmd_all`/`batch` thread `vendor`. + +**This replaces the single 0.30 default AND the prior "do NOT build a vendor-adaptive default" policy** -- both came from the now-debunked region-rescrub-contaminated study (the per-region re-scrub that contaminated those numbers was removed in the controlnet refactor). Basis: the oracle-verified June 2026 controlled study (clean v0.8.6, protect OFF): OpenAI clears at 0.05 across 1024-1600 (n=4, resolution-independent); Google needs 0.15 on the capped-1536 path (n=4). `docs/synthid.md` §2.2 (data) + §5.2 (the adaptive default) are authoritative. + +**CAVEAT (oracle pass 2026-06-04): the OpenAI 0.10 default is content-dependent, NOT universal -- a flat-graphic OpenAI logo/poster still read SynthID-detected after `default` at 0.10, and photoreal images after controlnet at 0.10/0.15 (low-change regions under-perturbed). Removal at 0.10/0.15 is content×pipeline dependent (see the controlnet Known-limitations bullet); the lever is a higher strength, oracle-revalidated per content type. Do NOT assume the vendor-adaptive default clears every image.** + +CAVEAT: Google's 0.15 was validated only on `--max-resolution 1536`; native large Gemini (2816) was not locally measurable (OOM on M-series) and is pending GPU validation on raiw.cc -- if it survives 0.15 native, raise `--strength`. + +**Everything below in this bullet about a fixed 0.10/0.30 default is HISTORICAL; trust the vendor-adaptive constants + docs/synthid.md.** + +## SynthID removal: strength + oracle scope + +**SynthID removal: strength + oracle scope.** + +Default strength is vendor-adaptive (see the bullet above); `docs/synthid.md` §2.2 is authoritative for the numbers. + +**Oracle scope (load-bearing):** the Gemini app "Verify with SynthID" is the ONLY valid SynthID oracle (detects Google's mark on any image); `openai.com/verify` is scoped to OpenAI provenance (its own C2PA), NOT a SynthID oracle -- a negative there is meaningless for SynthID. There is no local SynthID detector, so the tool cannot self-check; if the oracle still reads SynthID, raise `--strength` to the lowest value that verifies clean. Only the `sdxl` (plain SDXL img2img; `default` is a back-compat alias) and `controlnet` (SDXL + canny ControlNet) profiles exist; the local `invisible` default is weight-for-weight identical to raiw.cc prod (`fal-ai/fast-sdxl` = `stabilityai/stable-diffusion-xl-base-1.0`, runtime-downloaded, not bundled). + +**Forensic-stealth caveat** (arXiv:2605.09203): defeating the SynthID verifier is NOT forensic invisibility -- independent detectors flag *removal-processed* images vs genuinely-clean ones at >98% TPR@1%FPR, so do not over-claim "indistinguishable from a real photo". + +## `controlnet` pipeline: content x pipeline removal, certified floors, no face-restore + +**`controlnet` pipeline (text/face STRUCTURE preservation, THE DEFAULT since 2026-06-09; `--pipeline default` opts down to plain SDXL).** + +SDXL + the canny ControlNet `xinsir/controlnet-canny-sdxl-1.0` via `StableDiffusionXLControlNetImg2ImgPipeline` (`watermark_remover._run_controlnet` / `_load_controlnet_pipeline`). + +**Removal still comes from the img2img regeneration (`strength`); the ControlNet only PRESERVES text and face STRUCTURE by conditioning on the canny edge map** (`cv2.Canny(gray, 100, 200)`, 3-channel). Canny preserves edges, NOT face identity (a regenerated face drifts in likeness). The drifted cleaned face is the LEAST-AI state we can reach without re-introducing SynthID; **the library does NOT ship a face-restore extra** (every approach evaluated 2026-06-04 - 2026-06-08 -- GFPGAN-on-cleaned, PhotoMaker-V2, InstantID txt2img, InstantID img2img-on-cleaned at three parameter sweeps -- regenerated the face via SDXL and made it look MORE AI-generated). Full empirical conclusion in `docs/synthid-robust-identity-research-2026-06-08.md` "Empirical follow-up". For production face preservation, ship the cleaned image as-is. No original pixels are copied or frozen, **BUT removal at the low vendor-adaptive strength is CONTENT × PIPELINE dependent and NEITHER pipeline clears all content -- oracle-validated against the OpenAI verifier 2026-06-04 (8 images, strength 0.10/0.15, `--max-resolution 1536`).** + +The survivors FLIP by content type: **photoreal** (a 9-face grid, a bracelet product photo) SURVIVES controlnet but CLEARS `default` (controlnet's dense edge map keeps the regen too close to the original, so the SynthID-destroying perturbation never happens; plain img2img perturbs photoreal texture enough); **flat graphic** (a logo/poster with large flat color fills) SURVIVES `default` but CLEARS controlnet (at low strength img2img barely changes flat fills so SynthID persists there, while controlnet repaints them more freely); a flat **text** card cleared under both. + +**Root cause is insufficient STRENGTH, not the pipeline: at 0.10 the low-change regions -- dense-edge photoreal under controlnet, large flat fills under `default` -- are not perturbed enough to destroy SynthID. The vendor-adaptive 0.10 from the June study is NOT universally sufficient (that study's content happened to clear at 0.10).** + +The robust fix is a HIGHER strength, oracle-revalidated per content type (controlnet can be cranked harder without losing structure; a lower `controlnet_conditioning_scale` also frees the regen on photoreal). So at today's default strength **both pipelines AND `--auto` can LEAVE SynthID on some content** -- a removal-priority caller (raiw.cc) MUST oracle-validate strength across content types before adopting, not pick a pipeline and assume removal. + +**Follow-up same day: re-running the two photoreal survivors through controlnet at an explicit `--strength 0.15` cleared BOTH on the oracle -- BUT one of them (the bracelet) had SURVIVED the SAME 0.15 controlnet config in the first pass (only the random, unset seed differed). So removal near the threshold is SEED-NON-DETERMINISTIC: the same image+pipeline+strength+resolution can pass or fail run-to-run (img2img uses `seed=None`/random unless `--seed` is passed, and there is no local SynthID detector to self-verify). 0.15 is the borderline, NOT a robust floor -- pick a strength with MARGIN (controlnet ~>= 0.20) rather than exactly on it; the content×pipeline table's 0.15 data point is near-threshold noise. A confirming run at `--strength 0.20` controlnet cleared BOTH photoreal survivors on the oracle (ladder: 0.10 grid detected → 0.15 borderline/non-deterministic → 0.20 both clean), so **0.20 is the recommended robust controlnet floor for OpenAI photoreal** (one margin run, not an N-run repeatability proof -- a service should add margin or verify repeatability since there is no local SynthID detector to self-check). + +**Engineering follow-up DONE 2026-06-09 (three coupled changes):** (1) **strength raised + unified** -- `resolve_strength(strength, vendor)` now applies ONE vendor-adaptive ladder (the certified controlnet floors 0.20/0.30/0.30) to BOTH pipelines; see the DEFAULT STRENGTH bullet above for why one ladder covers `sdxl`. (2) **`controlnet` is now the DEFAULT pipeline** (CLI `--pipeline` default = `controlnet` + both engine ctors). Rationale: with the certified higher ladder it clears BOTH content classes that flipped in the content-x-pipeline table (photoreal AND flat graphic), whereas plain SDXL left SynthID on flat graphics -- so controlnet is the more removal-robust default. Cost: every non-`--auto` run now downloads the canny ControlNet weights + a higher memory peak (MPS->CPU fallback covers OOM). (3) **the plain-SDXL profile was renamed `default` -> `sdxl`** (`watermark_profiles.SDXL_PROFILE`/`normalize_profile`); `default` stays as a back-compat CLI/ctor alias (the `--pipeline` Choice accepts `sdxl`/`controlnet`/`default`, a click callback `_normalize_pipeline` maps `default`->`sdxl` AND warns that `default` is deprecated). (4) **the content-detection layer + `--auto` planner were removed and `--auto` was retired to a deprecated alias for `--adaptive-polish`** -- see the dedicated `auto_config.py`-removal bullet above (controlnet is the default pipeline and the polish self-gates, so detection changed nothing). raiw.cc still needs its own per-vendor/content calibration on the GPU worker for native resolution. The Gemini-native resolution caveat stands: controlnet 0.30 is certified only <=1536.** **CERTIFIED 2026-06-04 via the isolated `raiw-controlnet-cert` Modal app (`raiw-app/modal_cert.py`), restore OFF, ≤1536, each vendor on its own oracle: controlnet floors are OpenAI 0.20 (2 photoreal × 3 seeds = 6/6 clean; the 0.15-flipper is seed-robust at 0.20) and Gemini 0.30 (0.20 detected → 0.30 clean on 2/2 seeds). OpenAI 0.20 transfers to prod (resolution-independent); Gemini 0.30 holds only ≤1536 — Gemini is resolution-sensitive and raiw.cc runs NATIVE (`max_resolution=0`), so cap Gemini ≤1536 + use 0.30, or native-calibrate (~0.35+). Prod recipe: controlnet + per-vendor floor in `resolve_strength` (not the default ladder) + FIXED seed (kills the non-determinism). + +**No face-restore in the library:** every approach evaluated (GFPGAN-on-cleaned, PhotoMaker-V2, InstantID txt2img, InstantID img2img-on-cleaned, 2026-06-04 - 2026-06-08 cert sweeps) regenerated the face via SDXL diffusion -- the output face inherited SDXL "clean skin" gloss and lost original identity precision, looking MORE AI-generated than the cleaned image, not less. The drifted face from controlnet 0.20 is the least-AI state we can reach; for a paid service that's the prod output. See `docs/synthid-robust-identity-research-2026-06-08.md` "Empirical follow-up".** + +See `docs/synthid.md` §5.5 + `docs/controlnet-removal-pipeline-research.md` (certified floors table).** **Lesson: visual-quality + face-recovery validation does NOT prove watermark removal -- only the SynthID oracle does, across MULTIPLE content types; never infer removal from sharpness/identity, and never conclude from a partial result (the photoreal-only data first read as "controlnet shields, default removes" -- the flat-graphic result reversed it).** + +`controlnet_conditioning_scale` (CLI `--controlnet-scale`, default 1.0) is the structure-preservation knob (higher = closer to the original structure); fp32 on cpu/mps, fp16-fixed VAE on cuda/xpu. The `controlnet` profile is threaded explicitly (`WatermarkRemover(pipeline=...)` / `InvisibleEngine(pipeline=...)`), NOT inferred from `model_id`. This productionizes the `scripts/controlnet_sweep.py` prototype; see `docs/controlnet-removal-pipeline-research.md`. + +**Forensic-stealth caveat still applies** (arXiv:2605.09203): defeating the SynthID verifier is not forensic invisibility -- a "this image went through a removal pipeline" classifier can still flag the output. diff --git a/docs/module-internals.md b/docs/module-internals.md new file mode 100644 index 0000000..f86e3e9 --- /dev/null +++ b/docs/module-internals.md @@ -0,0 +1,210 @@ +# Module internals + +> Relocated verbatim from `CLAUDE.md` on 2026-06-11 to keep the always-loaded +> context small. Long single-line entries were reformatted into paragraphs; +> no content was changed or summarized. + +Full per-module detail: design decisions, tuned thresholds, calibration +history, incident records, and the regression-guard map. The compact module +list lives in `CLAUDE.md`; read the relevant section here before changing a +module. + +## `noai/c2pa.py` + +`noai/c2pa.py` — PNG chunk parser; use `extract_c2pa_chunk(path)` to get raw caBX payload, `has_c2pa_metadata(path)` to detect. Do not reimplement chunk parsing. `extract_c2pa_info(path)` sets `synthid_watermark`/`synthid_vendors` when the manifest is signed by a SynthID-using vendor, and `soft_binding`/`soft_binding_vendors` when a `c2pa.soft-binding` `alg` names a forensic-watermark vendor (`soft_binding_vendors_in(buffer)` is the shared byte-scan, used by both the PNG parser and the non-PNG binary path). PNG/caBX chunk reads are clamped to the remaining file size (`safe_length = min(length, remaining)`; skipped chunks use seek) so a malformed huge `length` cannot drive a multi-GB allocation (shared safety discipline matching `isobmff.scan_c2pa_region`). + +## `noai/constants.py` + +`noai/constants.py` — PNG_SIGNATURE, C2PA_CHUNK_TYPE, C2PA_SIGNATURES, and `C2PA_AI_VENDORS` — the single `C2paAiVendor` registry of C2PA-signing vendors (issuer byte, resolved org name, the `identify` platform label, and a `synthid` flag), from which `C2PA_ISSUERS`, `SYNTHID_C2PA_ISSUERS` (issuers that pair SynthID with C2PA: Google, OpenAI), and `identify._ISSUER_PLATFORM` are all **derived** — plus `C2PA_SOFT_BINDINGS` (soft-binding `alg` prefix → forensic-watermark vendor: Adobe TrustMark, Digimarc, Imatag, Steg.AI, Microsoft, ...). Add a new C2PA vendor as one `C2PA_AI_VENDORS` entry (never edit the derived dicts), a new soft-binding to `C2PA_SOFT_BINDINGS`; not inline. + +## `metadata.py` + +`metadata.py` — `scan_head(path, size=1MB)` is the shared input for every C2PA/AIGC/IPTC byte scan: first `size` bytes plus the payloads of any provenance metadata found beyond that window — for ISOBMFF, the late provenance boxes from `isobmff.scan_c2pa_region` (catches a manifest after a large `mdat`); for **PNG**, the late `tEXt`/`iTXt`/`zTXt`/`eXIf`/`iCCP` chunks from `_png_late_metadata` (catches an XMP/EXIF packet appended after a large `IDAT`, e.g. a TC260 AIGC label at ~2.7 MB). Behavior-neutral (`f.read(size)`) for non-ISOBMFF inputs and for any file that fits within `size`. Use it instead of `open().read(1MB)` for any new marker scan. + +**Memoized per (path, size, mtime)** (added 2026-06-09, `_scan_head_cached` lru_cache, `maxsize=8`): one `identify`/`get_ai_metadata` call fans out to ~8 byte-scan detectors that each re-read the same file head, so the cache turns those into a single read; the mtime key invalidates on change, a stat failure falls back to an uncached read. `synthid_source(path)` returns the vendor name(s) if the C2PA manifest implies a SynthID pixel watermark, else None. Format-agnostic: PNG via the caBX parser, JPEG/WebP/AVIF/HEIF/JXL via a binary scan (C2PA marker + SynthID issuer + AI-source marker). `get_ai_metadata` surfaces the verdict, and `metadata --check` prints it as a callout. Both `get_ai_metadata` and `has_ai_metadata` guard the PIL open with `except Exception` (HEIC/unknown formats raise non-OSError) and fall through to the binary scan. `xai_signature(path)` detects xAI/Grok's EXIF-only scheme (`ImageDescription` = `Signature: ` + UUID `Artist`); it feeds `has_ai_metadata`, `get_ai_metadata` (key `xai_signature`), and `identify`. `iptc_ai_system(path)` detects the IPTC Photo Metadata 2025.1 AI-disclosure XMP properties (`IPTC_AI_FIELD_MARKERS` = `AISystemUsed`/`AISystemVersionUsed`/`AIPromptInformation`/`AIPromptWriterName`) and returns the `AISystemUsed` generator name (or `"fields present"`). `remove_ai_metadata` routes **ISOBMFF video** (`.mp4`/`.mov`/`.m4v`) through the same `isobmff.strip_c2pa_boxes` as AVIF/HEIF (MP4 is ISOBMFF), and `_scrub_ai_exif` removes the xAI signature + AI-generator EXIF tags on JPEG output. `strip_c2pa_boxes` is **fail-safe** on a malformed box: it returns the original bytes unchanged with a logged warning instead of truncating the tail to EOF (detection-only `scan_c2pa_region` still stops at a malformed box). `_png_late_metadata` clamps each late-chunk read to the remaining file size (`safe_length = min(length, remaining)`) so a malformed `length` cannot drive a multi-GB allocation, AND advances the cursor by `safe_length` (not the raw `length`) so an inflated length cannot jump past EOF and abort the scan, silently skipping a genuine AI-label chunk after it. + +## `identify.py` + +`identify.py` — the OpenAI rollout caveat is keyed on `_vendor_of(synthid) == "OpenAI"` (not a raw substring over the issuer + verdict blob). `identify(path)` aggregates every locally-readable signal (C2PA issuer→platform, C2PA soft-binding forensic-watermark vendor, **C2PA cloud-manifest reference** via `metadata.c2pa_cloud_manifest` — signal `c2pa_cloud`, **medium**, provenance-only (does NOT set `is_ai`, excluded from `ai_from_metadata` + clash vendors): a C2PA 2.4 Durable-Content-Credentials case where the embedded manifest is stripped but an XMP `dcterms:provenance` pointer to the vendor's cloud manifest store (`_C2PA_MANIFEST_REPOSITORIES`, today `cai-manifests.adobe.com` → "Adobe Content Authenticity") survives, so the credentials stay recoverable server-side; only emitted when no embedded manifest already attributed the file — surfaced on 2 corpus PNGs 2026-06-10 that read fully `unknown` before, IPTC "Made with AI" + IPTC 2025.1 `AISystemUsed`, embedded SD/ComfyUI params, SynthID proxy, xAI/Grok EXIF signature via `metadata.xai_signature`, the China TC260 AIGC label via `metadata.aigc_label`, the HuggingFace `hf-job-id` job marker via `metadata.huggingface_job`, the Samsung Galaxy AI editing marker via `metadata.samsung_genai`, the visible marks — Gemini sparkle plus the ByteDance Doubao 豆包AI生成 / Jimeng 即梦AI / Samsung Galaxy AI "Contenuti generati dall'AI" text marks via the `watermark_registry` — open invisible watermark, Adobe TrustMark via `trustmark_detector`) into one `ProvenanceReport`. `is_ai_generated` is True or None (never asserted False — stripped metadata is not proof of clean origin). The `hf_job`, visible-mark, and Samsung `samsung_genai` signals are **medium** confidence: each lifts an otherwise-Unknown verdict to a tentative AI (`hf_only` / `visible_only` / `samsung_only`, parallel branches; `visible_only` fires on any `visible_*` signal) but is excluded from the high-confidence `ai_from_metadata` set, so none overrides a hard metadata signal. + +**Visible-mark detection** (`check_visible`, signals `visible_sparkle` / `visible_doubao` / `visible_jimeng` / `visible_samsung`): the Gemini sparkle keeps its own file-level path (`_visible_sparkle` → `gemini_engine.detect_sparkle_confidence`, promoted only at confidence ≥ `_SPARKLE_THRESHOLD` 0.5; corpus-tuned to separate Gemini sparkles ≥0.56 from non-sparkle ≤0.49), while Doubao/Jimeng/Samsung reuse the registry detectors (`_visible_text_marks` → `watermark_registry`, iterating `_VISIBLE_MARK_PLATFORM`), each gated by its own engine NCC threshold via `MarkDetection.detected` (Doubao 0.4, Jimeng 0.45, Samsung 0.4). Doubao/Jimeng are normally also caught by the TC260 AIGC metadata label and Samsung by its C2PA + `genAIType` marker, so the visible path is their stripped-metadata fallback. Visible marks set `platform` only when no harder signal already did, and (like the sparkle) are excluded from integrity-clash vendor claims. The cv2 dependency lives in the engines, not here. + +**`import identify` is deliberately light** (~21 MB; ~36 MB with cv2 loaded by a visible-mark run, ~106 MB for a full `check_visible` run): it imports only the pure `noai.c2pa`/`noai.constants` submodules, and `noai/__init__` is lazy (see "Test and lint"), so torch/diffusers are NOT pulled at import even in a full `gpu`/`detect` install — fits a 512 MB host. The heavy paths are opt-in: `check_invisible=True` needs the `detect`/`trustmark` extras (each pulls **torch**; TrustMark also **downloads weights**), so on a core-only deploy leave `check_invisible` off (it is a no-op there anyway). Before the lazy `__init__`, the mere presence of torch in the env inflated `import identify` to ~420 MB. + +**C2PA platform attribution is device-token-first, issuer-scan fallback** (`_device_platform` scans manifest bytes for `_DEVICE_C2PA_PLATFORM` tokens, then `_attribute_platform`/`_ISSUER_PLATFORM`). + +**Why, verified on real signed files 2026-05-26:** the old issuer-only byte-scan matched ANY issuer substring anywhere, so multi-entity manifests mis-attributed -- Leica→"Truepic" (a signing authority in the trust chain), Nikon→"Adobe Firefly" (XMP-toolkit "Adobe" + the sample's "Adobe_MAX" name), Pixel→"Google (Gemini)" ("Google LLC" cert org), Truepic→"Google". A distinctive device token wins instead. + +**Token distinctiveness is load-bearing:** bare `b"Truepic"` mis-fires (it appears in unrelated trust chains -- it mis-attributed the OpenAI `chatgpt-1.png` fixture), so the token is the specific `b"Truepic_Lens"` from the Lens SDK claim generator; likewise `b"Pixel Camera"` (cert CN) not bare `b"Pixel"`. `_DEVICE_C2PA_PLATFORM` lists ONLY tokens **verified against a real C2PA file**: Leica (`lc_c2pa`/`Leica Camera`), Nikon (`NIKON`), Pixel (`Pixel Camera` -- from a real Pixel 10 Pro file attached to c2pa-rs issue #1609/#1554), Sony (`sony.sig`/`sony.cert` -- Sony's own C2PA assertion namespace, verified on a real Sony PXW-Z300 file; NOT bare "Sony" which is a common EXIF Make), Truepic (`Truepic_Lens`). Canon/Bria have **no public direct-download C2PA sample** (checked exhaustively: GitHub issue/PR attachments, contentcredentials gallery, HF datasets -- all upload-to-verify or token-gated; Canon's only public file was a self-signed hobbyist CR3, not factory), so they stay unmapped until a real file is captured (same fixture discipline as Grok/Doubao). The Sony sample is video (MP4) -- our ISOBMFF C2PA path detects it; Sony Alpha stills likely share the `sony.*` namespace but are not separately verified. + +**Samsung Galaxy + ASUS Gallery live in a separate `_SIGNER_C2PA_PLATFORM` (scanned after `_device_platform`, before the issuer fallback), NOT in `_DEVICE_C2PA_PLATFORM`** — verified on real signed files 2026-05-29. Reason: a Galaxy phone stamps BOTH its device cert AND a `trainedAlgorithmicMedia`/genAIType AI marker on a Generative-Edit image, so treating it as a "genuine camera capture" would false-fire integrity-clash rule 2 on every Galaxy AI edit. The signer tokens (`b"Samsung Galaxy"` cert org — distinct from the EXIF `SM-xxxx` model string on ordinary Samsung photos; `b"com.asus.gallery"` claim generator) only resolve the platform label; the AI verdict still comes from the source-type / genAIType. ASUS Gallery is a C2PA-signed edit with no AI marker, so it attributes the platform without asserting `is_ai`. + +**Samsung's `genAIType` (in the proprietary `PhotoEditor_Re_Edit_Data` JSON) is an undocumented Galaxy-AI editing marker** (`metadata.samsung_genai`, gated on the `PhotoEditor_Re_Edit_Data` container; non-zero value = AI tool used, values {1,5} observed): medium-confidence because the field has no public spec (verified 2026-05-29: absent from C2PA spec + Samsung docs), but it co-occurred with `trainedAlgorithmicMedia` in 3/3 verified files that record a source-type and was the SOLE AI marker on a Galaxy S24 file that omits the source type. Camera C2PA marks capture authenticity, not AI (Pixel carries `computationalCapture`, not `trainedAlgorithmicMedia`), so these never set `is_ai` -- that stays driven by digital-source-type. `c2pa.cbor_text_after` (now public) is best-effort for the `generator` detail string only and can be None when the manifest keys it `claim_generator_info` (Pixel). + +**Issuer→generator mapping is `is_ai`-gated** (`_attribute_platform(issuers, is_ai=c2pa_is_ai)`): a specific AI-generator platform is named only when the digital-source-type is `trainedAlgorithmicMedia`; on a non-AI source an issuer substring is treated as incidental (an "Adobe XMP" toolkit string in an *unmapped* Canon/Sony capture would otherwise mislabel it "Adobe Firefly"), so it degrades to the neutral "C2PA signer: X" label. Real Firefly/OpenAI/Google output carries the AI source-type, so it is unaffected (verified: chatgpt-1.png→OpenAI, firefly-1.png→Adobe Firefly still attribute). `_attribute_platform` defaults `is_ai=True` so the mapping stays unit-testable in isolation. Add capture-camera tokens to `_DEVICE_C2PA_PLATFORM`, editing-app/AI-device signer tokens to `_SIGNER_C2PA_PLATFORM`, generator/issuer platforms to the `C2PA_AI_VENDORS` registry in `constants.py` (which derives `_ISSUER_PLATFORM`), not inline. For non-PNG containers (JPEG/WebP/AVIF/HEIF/JXL) the caBX parser returns nothing, so issuer (`_issuers_in`) and generator (`_ai_tools_in`, reusing `C2PA_AI_TOOLS`) are recovered by binary-scanning the first MB. EXIF `Software` / `Make` / `Artist` / `ImageDescription` and XMP `CreatorTool` generator tags are read by `metadata.exif_generator` (PIL+piexif for any format PIL opens incl. AVIF, plus a container-agnostic XMP raw-byte scan that also covers HEIF/JXL), matched against `AI_GENERATOR_TOKENS` so ordinary editors (plain "Adobe Photoshop") and real-camera `Make` ("Apple"/"Canon") are not flagged. + +**Ideogram tags its output with EXIF `Make="Ideogram AI"`** (verified on a real download 2026-05-24) — that's why `Make` is read. + +**Integrity-clash detection** (`_integrity_clashes`, surfaced as `ProvenanceReport.integrity_clashes`, printed in red by `identify` and serialized to `--json`): contradictions between independent generator stamps are a laundering/spoofing tell. Two rules: (1) two or more distinct AI-origin vendors named by **independent** signals (e.g. C2PA OpenAI + EXIF `Make="Ideogram AI"`), and (2) a camera-capture C2PA device (`_DEVICE_C2PA_PLATFORM`) coexisting with an AI-generation marker **from a source INDEPENDENT of the camera's own manifest**. + +**Rule 2's independence gate (added 2026-06-11):** a device that both captures and runs on-device generative AI (Google Pixel Magic Editor / Pixel Studio) records the capture AND the AI edit in ONE C2PA manifest — so the AI vendor is named only from that same manifest (`c2pa` issuer + `synthid` proxy, both `c2pa_manifest` source) — a legitimate edit chain, NOT a clash. Rule 2 therefore fires only when some `ai_vendor_claims` family has a source `!= "c2pa_manifest"` (EXIF/XMP generator, IPTC, TC260 AIGC, a second manifest naming AI on a camera capture — the real laundering tell). This killed a false-positive class on the corpus: 2 real Pixel generative-edit PNGs (`computationalCapture` + `trainedAlgorithmicMedia` + "Applied imperceptible SynthID watermark" in one Google manifest) read as camera-vs-AI clashes before the gate. Pure cameras (Leica/Sony/Nikon/Truepic) that do NOT generate AI still clash on any within-manifest AI marker only if it is independent — they never legitimately carry one, so the gate is behavior-neutral for them while fixing Pixel (regression-guarded by `test_identify.py::TestIntegrityClashesHelper::{test_pixel_generative_edit_same_manifest_no_clash,test_camera_plus_independent_ai_marker_still_clashes}` + `TestIntegrityClashEndToEnd::test_pixel_generative_edit_no_clash`). + +**Independence is source-grouped (`_CLASH_SOURCE`, added 2026-06-02):** the C2PA issuer attribution (`c2pa`) and the SynthID proxy (`synthid`) are NOT independent — the proxy is inferred from the *same* manifest — so they share one source and two vendors named within a single manifest do not clash. This killed a false-positive class found on the spaces corpus: legitimate multi-actor manifests where a product wraps another vendor's engine (Microsoft Designer on OpenAI → `OpenAI, Microsoft`; Microsoft on Google → `Microsoft, Google LLC, Google C2PA Core Generator Library`) or an edit chain re-signs (Adobe over a Gemini original → Adobe c2pa + Google synthid) — 19 such files across the 2026-06-01/02 batches read as clashes before the fix. Rule 1 still fires when a manifest vendor disagrees with a genuinely independent stamp (EXIF/XMP generator, IPTC `AISystemUsed`, AIGC, xAI); each non-`c2pa`/`synthid` family is its own source (`test_identify.py::TestIntegrityClashes::{test_multi_actor_manifest_no_clash,test_manifest_vendor_vs_independent_signal_clashes}`). Vendor normalization is `_vendor_of` over `_AI_VENDOR_TOKENS` (so a C2PA "Google (Gemini)" issuer and a SynthID-Google proxy agree, while different vendors clash). + +**High-precision by design:** only hard generator stamps feed it (C2PA-issuer when source is AI, SynthID, EXIF/XMP generator, IPTC `AISystemUsed`, xAI, AIGC); the fuzzy visible sparkle and the open invisible watermark are **excluded** (the latter can be a by-product of our own SDXL removal pass). The c2pa vendor is classified from the issuer attribution / generator, NOT the resolved `platform` (a camera label like "Google Pixel" would mis-normalize to "Google"). All real single-origin fixtures (chatgpt/firefly/doubao/grok/mj) verified to produce **zero** clashes (false-positive guard in `test_identify.py::TestRealSamplesHaveNoClash`). + +## `watermark_registry.py` + +`watermark_registry.py` — **single catalog of known visible watermarks**, the unified "find known marks in their usual places, recognize, remove" entry. + +**Reverse-alpha based by policy**: a mark is listed only once a real alpha map has been captured for it, and removal inverts that map (`original = (wm - a*logo)/(1-a)`) — Gemini recovers cleanly with no inpaint (its sparkle alpha comes from a pure-black capture, so it is near-exact), while **Doubao, Jimeng, and Samsung all add an always-on THIN residual inpaint** over the glyph footprint (their text marks re-rasterize + jitter a few px per image, so a single capture cannot pixel-cancel them; the inpaint blends into the reverse-alpha-recovered pixels). Arbitrary-region inpainting still lives in `region_eraser`/`erase`. Each `KnownMark` ties a key to {usual `location`, `in_auto` flag, `recovery` (="reverse-alpha"), a `detect` adapter → uniform `MarkDetection`, a `remove` adapter}. Entries today: `gemini` (bottom-right sparkle), `doubao` (bottom-right "豆包AI生成"), `jimeng` (bottom-right "★ 即梦AI"), and `samsung` (bottom-**LEFT** "✦ Contenuti generati dall'AI", Samsung Galaxy AI, Italian locale). `detect_marks` scans all; `best_auto_mark` picks the highest-confidence detection. + +**Cross-engine confidences aren't directly comparable**, so the gemini adapter applies the corpus-validated 0.5 sparkle threshold (`_GEMINI_AUTO_MIN_CONF`) for its `detected` flag — otherwise the gemini engine's loose internal threshold weakly fires (~0.36) on the Doubao text and hijacks `auto`. The shape-keyed Doubao/Jimeng/Samsung NCC detectors don't cross-fire (jimeng scores ~0.22 on the Doubao strip, well under its 0.45 threshold; Samsung is bottom-left so it shares no corner with the others, and scored 0.0 on Doubao/Jimeng captures and they 0.0 on a real Samsung photo), so `auto` picks the right one. `cli.cmd_visible` is registry-driven: `--mark auto` → `best_auto_mark`, `--mark ` → that mark; `--mark` choices come from `mark_keys()`. + +**`cli._remove_visible_auto` is the shared visible-removal helper used by `cmd_all`/`cmd_batch` too** (they no longer hardcode `GeminiEngine`), so `all`/`batch` remove Doubao/Jimeng/Samsung text marks, not just the Gemini sparkle (regression-guarded by `test_all_visible_step_uses_registry`). The three text-mark adapters were consolidated 2026-06-09: a single `_text_mark(key, label, location)` builds the registry row from one parameterized `_text_mark_detect`/`_text_mark_remove` pair (reverse-alpha only when detected/forced AND `reverse_alpha_available`, else skipped — no inpaint); the gemini adapters stay bespoke. Add a new visible mark = one `_text_mark(...)` row + its `TextMarkConfig` (with a captured alpha map); do not re-add per-mark `if` branches or copy-paste adapters. + +**Alpha-on-save policy (issue #30):** `cli._write_bgr_with_alpha` rejoins the input's alpha plane **unchanged** — it must NOT zero alpha in the watermark bbox. Reverse-alpha (and `erase` inpaint) recover real pixels there, so zeroing alpha punched a transparent hole that renders as a solid **white box** on any non-transparent viewer (Gemini app exports are opaque RGBA, so every user hit it; regression-guarded by `test_visible_keeps_alpha_opaque_in_watermark_region`). The registry `remove()` still returns its region (used for `inpaint_residual` positioning), but the CLI no longer uses it to clear alpha. + +## `gemini_engine.py` + +`gemini_engine.py` — visible Gemini-sparkle remover/detector (cv2/numpy, no GPU). `detect_sparkle_confidence(path)` is the file-level entry point used by `identify.py`. The public entry points normalize a grayscale (2D) or RGBA (4-channel) input to BGR up front so a non-BGR image does not crash the cv2 pipeline. + +**Detection localization (issue #36):** `detect_watermark`'s global multi-scale NCC search applies a size weight (`(scale/96)**0.5`) that suppresses tiny-patch false positives but can let a larger, mediocre match (e.g. a bright collar in a portrait) outrank a small, near-perfect sparkle in the corner — so a faint sparkle on a busy background scored below threshold and read as clean (the regression osachub reported from widening the search window 256px->512px between v0.7.2 and v0.8.8). `_corner_promote` adds a bottom-right-corner raw-NCC pass on top of the global search: a match with raw NCC >= `_CORNER_PROMOTE_NCC` 0.85 that beats the global pick overrides it (it only ever replaces a lower-fidelity pick, so it cannot weaken an existing detection), rescuing the buried sparkle without reverting the wider window. The corner side is **relative-clamped** (`_CORNER_PROMOTE_FRAC` 0.20 of the short side, clamped to `[_CORNER_PROMOTE_MIN` 96, `_CORNER_PROMOTE_MAX` 384`]`): a fixed 256px is a true corner on a large image but covers ~70% of a small portrait, where a real photo raw-matches the star at ~0.81 (relative tightening drops that worst case to ~0.69, while the upper clamp stops the corner ballooning on huge images where a real photo reached ~0.83 at 512px). The 0.85 gate sits midway between the worst real-photo corner match (~0.78 across native + downscaled negatives) and a genuine faint sparkle (~0.93), so promotion adds true detections with zero corpus false positives (Gemini's sparkle sits ~60-160px from the corner at fixed margins, covered by the [96, 384] band at every measured size). Regression-guarded by `test_gemini_engine.py::TestCornerPromotion`. + +**Square-image residual misses are NOT fixable by lowering the detector threshold (measured + REJECTED 2026-06-11):** osachub (#36 follow-up) reported the corner-promote still misses Gemini sparkles on Google **square (1:1)** outputs. Reproduced on the spaces corpus: of 330 square Google-C2PA images, 140 score below the identify 0.5 threshold, and visual review confirmed a real class -- faint white sparkles on dark/textured/colored backgrounds (raw NCC 0.46-0.73, below the 0.85 promote gate) landing at fusion conf 0.41-0.47. A margin-gated promote (promote when raw NCC >= 0.50 AND `_core_ring_margin` >= 40) rescued 32/33 confirmed misses at an apparent 0 FP, but that 0 was a **measurement artifact** -- the negative set was the margin<40 misses, which a margin>=40 gate excludes by construction. On an honest 518-image non-Google pool the same gate fired on **~174 (≈33%)**, visually content (screenshots, Chinese "AI生成" Doubao/Jimeng text marks, logos, bright textures), not sparkles. Adding an achromatic-core constraint (`chroma <= 15`) did not separate them either (kept 15/33 POS, 41 NEG still firing). Root cause is the documented contrast-invariant-NCC wall: a faint sparkle on a busy background is indistinguishable from a bright/ornate content corner at the (shape-NCC, brightness-margin, core-chroma) feature level. + +**Conclusion: keep the 0.85 corner gate; do NOT add a margin/chroma-gated lower promote.** + +The cost (mislabel ~8-33% of non-Gemini content as Gemini) outweighs the benefit -- the visible sparkle is a medium-confidence stripped-metadata fallback, and intact Gemini is caught by C2PA in `identify` regardless. Remaining square misses are an accepted known limitation; a real fix would need a sparkle-specific discriminator (template match on a background-subtracted image, or a hard fixed-margin position prior), which is open research, not a threshold tweak. + +**Removal is reverse-alpha with an over-subtraction guard** (`remove_watermark` → `_reverse_alpha_blend`, else `_inpaint_footprint`): the sparkle alpha is computed (`alpha = max(R,G,B)/255`) from the bundled sparkle-on-black captures `assets/gemini_bg_{96,48}.png` (the capture max is ~130, NOT 255 — the sparkle is a ~51%-opaque white overlay, so `alpha` maxes at ~0.51, which is CORRECT for the capture, not under-exposed). The alpha is near-exact only when the real mark's effective opacity matches the capture, which holds on bright/flat backgrounds — re-verified clean on `demo_banana_before.png` 2026-05-31. + +**Issue #30 (dark-background black pit):** on a dark/textured background (e.g. grass, ~73) the real sparkle's effective opacity is LOWER than the captured 0.51, so the fixed-alpha reverse blend OVER-subtracts (`watermarked - a*logo` goes negative) and drives the footprint to black — the white sparkle becomes a black diamond. `remove_watermark` now detects this via `_reverse_alpha_oversubtracts` (fraction of footprint pixels with `alpha >= _FOOTPRINT_ALPHA` 0.1 whose numerator < 0 exceeds `_OVERSUB_FOOTPRINT_FRAC` 0.05) and **inpaints the footprint** (`_inpaint_footprint`, cv2 NS over the dilated alpha mask) from the surrounding pixels instead. + +**Behavior-neutral on the working case:** a bright background over-subtracts at ~0% so reverse-alpha is used and the output is byte-identical to before (verified: demo_banana 0.0 frac vs issue-#30 grass 0.61 frac; regression-guarded by `test_gemini_engine.py::TestOverSubtractionGuard`, which composites the sparkle at a reduced effective alpha to reproduce the mismatch). + +**Under-subtraction (the symmetric case, fixed 2026-06-03):** some real Gemini sparkles are rendered MORE opaque than the captured ~0.51, so the fixed-alpha reverse blend UNDER-subtracts and leaves a bright sparkle residual the detector still fires on (measured on the spaces corpus: a visible-removal audit through the registry path left a detectable sparkle on a meaningful fraction of marks, all under-removals, NOT a background-brightness class — failures and successes had the same input confidence and the same background-luma distribution; the discriminator was the removal delta itself). `remove_watermark` now estimates a per-image alpha gain (`_estimate_alpha_gain`: effective sparkle opacity at the bright core vs the local background ring, `a_eff/a_cap`, clamped `[1.0, _ALPHA_GAIN_MAX` 1.94`]`) and scales the alpha to match before the over-sub/blend branch. The gain cleanly separates on the corpus (under-removed marks ~1.47, cleanly-removed ~1.00), and a deadband (`_ALPHA_GAIN_DEADBAND` 1.05) keeps a matching sparkle **byte-identical** to the pre-fix output, so the fix is purely additive (0 regressions on the audit set; the over-sub guard still runs on the scaled alpha as the safety net for an over-shooting estimate). Regression-guarded by `test_gemini_engine.py::TestUnderSubtractionGain` (composites a more-opaque-than-capture sparkle; **asserts on footprint pixels, NOT the detector** — the detector's NCC is degenerate on a flat synthetic background, so a re-detect conf is meaningless there; the real corpus removal drops the detector from ~0.80 to ~0.27). + +**False-positive gate (added 2026-06-03):** `detect_watermark`'s shape-only NCC (`spatial*0.5 + gradient*0.3 + var*0.2`) fires on ornate/flat content (text strips, banners, hatching) that coincidentally matches the diamond shape — a real Gemini sparkle is a bright WHITE overlay, so its core sits above the local background, but the NCC is contrast-invariant and cannot see that. The fusion now **demotes** (caps confidence to 0.30) any match that is BOTH low-confidence (`< _SPARKLE_FP_CONF` 0.65) AND has a low core-ring brightness margin (`_core_ring_margin < _SPARKLE_FP_MARGIN` 5). Real sparkles escape via EITHER high confidence (white-bg sparkles score ≥0.79 despite a low margin — the NCC shape match is strong) OR high margin (dark/mid backgrounds, incl. the #36 faint-corner case, lift well clear), so BOTH must fail to demote. The gate is **monotonic** (only ever removes detections, never adds), so it cannot regress the verified-negative corpus (already 0 FPs). On the spaces corpus it demoted 16/495 flagged sparkles (13 carried no AI metadata = content FPs; the 3 AI-meta were visually FPs / a near-invisible white-on-white sparkle whose AI verdict is held by metadata anyway), and dropped the removal-audit failures 20→15 (post-removal flat footprints the NCC re-fired on). `_core_ring_margin` and `_estimate_alpha_gain` share the `_core_and_bg` helper (core 75th-pct brightness vs background-ring median). Regression-guarded by `test_gemini_engine.py::TestSparkleFalsePositiveGate`. + +**Self-verify repair (added 2026-06-04):** the gain estimate corrects most under-subtractions, but a tail of strong sparkles still survived reverse-alpha (position jitter, or a gain the `[1.0, 1.94]` clamp could not fully reach). After the reverse blend, `remove_watermark` re-detects via `_verify_and_repair`; when a sparkle at or above `_VERIFY_FALLBACK_CONF` 0.5 (the registry's real fail line) remains, it inpaints the footprint and **keeps that only when it lowers the re-detect confidence** — purely additive (the common clean removal re-detects below 0.5 and is returned untouched, so it can never regress). On the spaces corpus this rescued **4 of the 15 remaining gemini removal-audit failures** (15→11, doubao/jimeng still 0), verified through the registry/CLI path. Costs one extra `detect_watermark` per removal (two when the fallback fires). Regression-guarded by `test_gemini_engine.py::TestVerifyAndRepair` (stubs `detect_watermark` to drive the keep-best control flow, since the NCC is degenerate on flat synthetics). + +**An offset+scale alignment search was prototyped on the remaining 11 fails and REJECTED (2026-06-04):** an audit "ceiling" test suggested it could rescue 4 more (e.g. a5a9 0.577→0.417), but direct inspection showed those were NCC-gaming, not removal — the lower-scoring placement left the sparkle as bright or BRIGHTER (a5a9: first-pass slot 99.5th-pct ~76 at background level, the "aligned win" slot ~164), it just reshaped the residual so the contrast-invariant shape-NCC scored lower. A slot-brightness sanity gate rejected every one, so alignment contributed 0 genuine rescues and was removed (the footprint inpaint stays because it physically reconstructs the slot from its darker surroundings, so its rescues are real). + +**Lesson: the visible-audit pass/fail metric (re-detect conf < 0.5) is gameable by reshaping the residual — optimizing it directly finds NCC-gaming placements, not clean removals; gate any removal candidate on a physical brightness check, not the detector alone.** + +The 11 survivors are near-white ill-conditioning (reverse-alpha divides by `1-a`≈0.02) or detector false positives (before≈after≈0.51) that no reverse-alpha placement fixes. The registry's optional `inpaint_residual` (edge cleanup) is a no-op on a clean reverse-alpha removal (and on the same corpus it lowered the re-detect conf on 3 marks, raised it on 10, no-op on 466 — net-neutral on pass/fail, so the self-verify repair, not it, drives the removal tail); an earlier "Gemini smears" read was a misjudged soft-fur original, not an artifact. + +**The bg assets are now rebuilt from OUR OWN controlled captures** (`data/gemini_capture/captures/`, committed) by `scripts/visible_alpha_solve.py gemini`, which locates the 96px sparkle on the black capture and crops it to the two logo sizes; our capture matched the previously third-party-sourced `gemini_bg_96.png` to **NCC 0.9998**, validating the asset and making it reproducible. Gemini's multi-size fixed-slot model is genuinely different from the Doubao/Jimeng text-strip engines (so it stays a separate engine, not part of the shared-base refactor). + +## `_text_mark_engine.py` + +`_text_mark_engine.py` — **shared base for the three reverse-alpha text-mark engines (Doubao/Jimeng/Samsung), extracted 2026-06-09** (they were ~90% byte-identical clones). `TextMarkEngine(config: TextMarkConfig)` owns the whole `locate → extract_mask → detect → _fixed/_aligned_alpha_map → _apply_reverse_alpha → remove_watermark_reverse_alpha` pipeline (+ the asset-keyed `load_alpha_template`/`glyph_silhouette`/`template_match_score` caches). Each engine module is now a thin subclass: it supplies only its `TextMarkConfig` (the tuned constants, the bundled asset, and the bounded structural deltas — `corner` br/bl, `margin_floor` 4/2, `morph_open_size` 5/3, `min_gw` 8/16) plus the test-facing module shims (`_alpha_template`/`_glyph_silhouette`/`_template_match_score` + the constants). Behavior is byte-exact vs the old per-engine code (the three engine test suites pass unchanged). Gemini stays a SEPARATE engine (its multi-size fixed-slot sparkle model is genuinely different). Add a new text mark = a new `TextMarkConfig` + a thin subclass + one registry `_text_mark(...)` row. The engine bullets below describe each mark's calibration history; the LOGIC lives here. + +## `doubao_engine.py` + +`doubao_engine.py` — **a thin `_text_mark_engine.TextMarkEngine` subclass (config only) since 2026-06-09.** visible Doubao "豆包AI生成" remover/detector (cv2/numpy, no GPU). `DoubaoEngine.locate` anchors a bottom-right box by **geometry** (mark scales with image WIDTH), `extract_mask` pulls the light, low-chroma glyphs (the detection candidate) using a per-pixel channel-spread proxy `sat = roi.max(axis=2) - roi.min(axis=2)` (no HSV conversion). `detect` is **shape-consistent**: it matches the bundled alpha glyph silhouette (`assets/doubao_alpha.png`) against the candidate via zero-mean normalized correlation (`_template_match_score`, cv2 `TM_CCOEFF_NORMED`), gated at `DETECT_NCC_THRESHOLD` 0.4 over a small `DETECT_MIN_COVERAGE` floor. Keying on glyph SHAPE (not coverage heuristics) fixed #23 (corpus FP 7/1243). + +**Removal = reverse-alpha + thin residual inpaint** (`remove_watermark_reverse_alpha`): `original = (wm - a*logo)/(1-a)` from the bundled alpha map + `_ALPHA_LOGO_BGR` (pure white) + `_ALPHA_*_FRAC` geometry, then a deliberately THIN inpaint (`_RESIDUAL_*`, `INPAINT_NS`) over the glyph footprint clears leftover edges without smearing. + +**Alpha is rebuilt by `scripts/visible_alpha_solve.py` (the careful gray-self solve: cubic background fit, mean over channels, full halo, unblurred), same recipe as Jimeng** — the captures are committed in `data/doubao_capture/captures/`. + +**Removal aligns ALWAYS** (no `_ALPHA_NATIVE_BAND` fast-path): it tries fixed geometry AND `_aligned_alpha_map`'s `TM_CCOEFF_NORMED` scale+position search and keeps the lower-residual one — the mark is re-rasterized and a few px off per image, so fixed geometry alone leaves a visible outline even at 2048. + +**The locate box (`WM_*`) is generous (0.22 wide, margins 0.004) and reaches close to the corner** — a tight box (the old 0.185 / margin 0.012) let a corner-ward shift fall OUTSIDE the alignment search, so the align missed and a readable outline survived; regression-guarded by `test_recovers_shifted_mark_on_texture` (composes the alpha shifted on a known texture; old box ~29 vs new ~1 mean residual). + +**Issue #13 follow-up defect (found 2026-05-31): the SHIPPED Doubao removal left a clearly READABLE "豆包AI生成" outline on the real `doubao-1.png` sample, while `detect` returned conf 0.0 (it is fooled by a thin outline) so `test_reverse_alpha_removes_mark` passed and the old "56/56 clean" claim was detector-measured, not visual.** + +Root cause: bad alpha (under-estimated, max ~0.65) + fixed-no-inpaint + tight box; the careful rebuild + always-align + thin inpaint + wide box takes it from a readable outline to faint texture-level traces (parity with Jimeng — a single capture cannot pixel-cancel a per-image re-rasterized mark). + +**Lesson: a detector-only removal test is insufficient; assert visual residual (the textured-shift test).** **`extract_mask` guards a degenerate ROI (`bh < 16 or bw < 16` -> empty mask, skips cv2):** the always-align removal scores each placement with a residual `detect(out)`, and on an extremely wide/short image (e.g. 2048x1, `test_wide_short_does_not_raise`) that fed cv2's GaussianBlur a ~1-px-tall ROI and **faulted natively on Windows py3.12 (access violation, non-deterministic — one CI cell went red while a re-run passed)**; the old at-native path never ran `detect` on degenerate sizes. Real images always clear the guard (the `WM_*` box floors are `max(16, …)` height / `max(40, …)` width), so it only short-circuits slivers. `reverse_alpha_available` is just "asset present"; the registry gates removal on `detect`. The shipped third-party `_refs/zhengsuanfa_doubao_alpha_120x20.png` is NOT a usable alpha (verified 2026-05-29). Arbitrary-region inpainting is `region_eraser`/`erase`. + +## `jimeng_engine.py` + +`jimeng_engine.py` — **a thin `TextMarkEngine` subclass (config only) since 2026-06-09.** visible Jimeng / Dreamina "★ 即梦AI" remover/detector (cv2/numpy, no GPU), built 2026-05-30 from issue #13's solid captures (@powersee). Shares the base with `doubao_engine`: `locate` anchors a bottom-right box by **geometry** (scales with WIDTH), `extract_mask` pulls the light low-chroma glyphs (white top-hat + grayish + min-luma), `detect` matches the bundled "即梦AI" glyph silhouette (`assets/jimeng_alpha.png`) via `TM_CCOEFF_NORMED` over a coverage floor. Threshold `DETECT_NCC_THRESHOLD` **0.45** cleanly separates real Jimeng marks (>=0.81) from the Doubao strip (0.21) and other AI output (0.0), so the two ByteDance marks don't cross-fire in `--mark auto`. + +**Logo is pure white (255,255,255)** (`_ALPHA_LOGO_BGR`; the white capture + an L-pair-solve confirm ~254.6); compositing is **sRGB, not linear** (a linear-light solve tripled the cross-residual). + +**Alpha rebuilt by `scripts/visible_alpha_solve.py` from the GRAY capture** (`data/jimeng_capture/captures/`, the solid captures now committed): `a = (I - B)/(255 - B)`, B a per-capture **cubic** background fit over the non-glyph pixels, **averaged over channels, full halo extent (down to a~0.02), unblurred**. Gray (bg ~132) is the deliberate choice over black: it is the best proxy for real content (the mark sits on bright photo areas, not on black), and the careful build drops the gray self-residual to ~1.3. + +**The mask quality, not the method, was the earlier limit** — a max-channel / quadratic-bg / blurred / halo-truncated build (and a black-dominated LS) left a visible outline (lesson from issue #13: when reverse-alpha leaves a ghost, suspect the captured alpha map before adding heuristics or switching method). Geometry emitted by the solver at `_ALPHA_NATIVE_WIDTH` 2048: `_ALPHA_WIDTH_FRAC` 0.202, `_ALPHA_HEIGHT_FRAC` 0.058, margins ~0.029. + +**Removal = reverse-alpha + a deliberately THIN residual inpaint** (`remove_watermark_reverse_alpha`, `_RESIDUAL_DILATE` 5 over the `_RESIDUAL_ALPHA_FLOOR` 0.05 footprint, `_RESIDUAL_INPAINT_RADIUS` 2, `INPAINT_NS`): a single 2048 alpha cannot pixel-cancel the mark re-rasterized at another resolution (alpha maps from independent captures correlate 0.998, not 1.0; off-native reverse-alpha alone only halves the mark), so a tight inpaint clears the residual edges WITHOUT the texture/edge smear a wide full-footprint pass caused. + +**Placement ALWAYS tries fixed geometry AND `_aligned_alpha_map`'s NCC scale+position search, keeping the lower-residual** — the mark re-rasterizes + jitters a few px per image even at the captured width, so fixed geometry alone misses (there is no `_ALPHA_NATIVE_BAND` fast-path; the scale search `_ALPHA_ALIGN_SEARCH` is fine-stepped, and the `WM_*` locate box is generous so a corner-ward shift stays inside the search — the same widen that fixed Doubao). Verified clean on the solid captures (native 2048; faint self-residual ~1.3 visible only on a dead-flat field, hidden by real texture) and a real 1440-wide Jimeng download (off-native, table edge preserved). `reverse_alpha_available` is just "asset present"; the registry gates on `detect`. + +**No committed real sample** (the real content download stays gitignored; only the solid calibration captures are committed) — `tests/test_jimeng_engine.py` synthesizes a mark from the bundled alpha asset, and `test_recovers_shifted_mark_on_texture` guards the align-on-shift path that the Doubao defect exposed. Jimeng images are independently caught by the China TC260 AIGC label in `metadata`/`identify`, so this engine is the visible-mark *removal* path, not a new `identify` signal. + +## `samsung_engine.py` + +`samsung_engine.py` — **a thin `TextMarkEngine` subclass (config only) since 2026-06-09.** visible Samsung Galaxy AI "✦ Contenuti generati dall'AI" remover/detector (cv2/numpy, no GPU), built 2026-06-05 from issue #37's flat captures (@f-liva). Shares the base but anchored **bottom-LEFT** (Doubao/Jimeng are bottom-right): `locate` anchors a bottom-left box by **geometry** (scales with WIDTH), `extract_mask` pulls the light low-chroma glyphs (white top-hat + grayish + min-luma — `LOGO_MIN_LUMA` is lowered to **110** because the mark is faint, peak alpha ~0.38, so on a mid/dark background its glyph luma is lower than Jimeng's), `detect` matches the bundled glyph silhouette (`assets/samsung_alpha.png`) via `TM_CCOEFF_NORMED` over a coverage floor. Threshold `DETECT_NCC_THRESHOLD` **0.40** (real marks ~0.79 on a real photo, ~0.57/0.71 on the black/gray captures; 0.0 on Doubao/Jimeng captures, and Doubao/Jimeng score 0.0 on a real Samsung photo — no cross-fire, also because the corner differs). + +**Logo is pure white (255,255,255)** (`_ALPHA_LOGO_BGR`; white capture confirms). + +**Alpha solved by `scripts/visible_alpha_solve.py samsung` from the GRAY capture** (`data/samsung_capture/captures/`, the flat black/gray/white captures committed; the solver gained a `corner="bl"` mode + left-margin logging for this), same careful recipe as Jimeng (cubic background, mean-channel, full halo, unblurred). Geometry emitted at `_ALPHA_NATIVE_WIDTH` **1086** (the flat-edit capture width): `_ALPHA_WIDTH_FRAC` 0.3195, `_ALPHA_HEIGHT_FRAC` 0.0378, `_ALPHA_MARGIN_LEFT_FRAC` 0.0110, `_ALPHA_MARGIN_BOTTOM_FRAC` 0.0064. + +**Removal = reverse-alpha + a deliberately THIN residual inpaint** (`remove_watermark_reverse_alpha`, same `_RESIDUAL_*` recipe as Jimeng) with **always-try fixed AND `_aligned_alpha_map` NCC scale+position search, keep the lower-residual** (`_ALPHA_ALIGN_SEARCH` widened to (0.85, 1.18, 23) because the flat captures are far off the real-photo width). + +**Resolution caveat:** the flat captures arrived at 1086 wide while real photos are ~2958 wide (the mark scales with width, so the captured glyph ~334px is ~2.7x smaller than the ~903px real-photo glyph); width-scale + NCC-align still removes it cleanly (verified on a real 2958-wide @f-liva photo: re-detect 0.79→0.00, no readable text or outline on the recovered wooden table — checked **visually**, not just by the detector, per the Gemini self-verify lesson), but a flat capture at the real photo resolution would make the alpha pixel-sharp instead of upscaled (open quality upgrade, noted in `data/samsung_capture/README.md`). + +**The mark is locale-specific** (text differs per language); this build is the Italian "Contenuti generati dall'AI" variant — other locales need their own captured template. `reverse_alpha_available` is just "asset present"; the registry gates on `detect`. + +**No committed real sample** (the real photo stays gitignored; only the flat calibration captures are committed) — `tests/test_samsung_engine.py` synthesizes a mark from the bundled alpha asset (bottom-left geometry), with `test_recovers_shifted_mark_on_texture` guarding the align-on-shift path. Samsung Galaxy AI edits are independently caught by C2PA + the `genAIType` marker in `metadata`/`identify`, so this engine is the visible-mark *removal* path; it also feeds `identify` as the medium-confidence `visible_samsung` signal via the registry (the stripped-metadata fallback). + +## `region_eraser.py` + +`region_eraser.py` — universal region eraser (`erase` CLI). `erase(image, boxes=|mask=, backend=)` accepts grayscale (2D) and RGBA (4-channel) inputs on **both** backends (`erase_cv2` and `erase_lama` each split off any alpha plane and re-attach it unchanged, and promote grayscale to BGR for processing — LaMa would otherwise crash on grayscale and drop alpha on BGRA): `boxes_to_mask` → `cv2.inpaint` (`cv2` backend, default, no deps) or big-LaMa via onnxruntime (`lama` backend, extra `lama`, `Carve/LaMa-ONNX` Apache-2.0 model downloaded on first use, never bundled). `erase_lama` crops a padded region around the mask, runs LaMa at its fixed 512² input, pastes only masked pixels back (untouched areas stay pixel-exact). Lazy `_get_lama_session` singleton; `lama_available()` guards the optional import. + +**LaMa-ONNX costs ~3.5-4 GB peak RAM and ~5-6 s/call on CPU** (FFC working set, not arena — `enable_cpu_mem_arena=False` does not help), so it does NOT fit a minimal droplet; the cv2 backend (tens of MB, ~30 ms) does. LaMa quality at low RAM = serverless/GPU, mirroring how raiw.cc offloads SDXL to fal. + +## `invisible_watermark.py` + +`invisible_watermark.py` — `detect_invisible_watermark(path)` decodes the OPEN DWT-DCT watermarks (public decoder, no key) embedded by Stable Diffusion / SDXL / FLUX via the `imwatermark` library. Known fixed patterns (verified against upstream source) live in `_BITS_48` (SDXL 48-bit, FLUX.2 48-bit) and `_SD1_STRING` ("StableDiffusionV1", SD 1.x/2.x). Optional dep (extra `detect`); returns None when absent. The `detect` extra pulls **torch** transitively (invisible-watermark declares torch a hard dep, and `WatermarkDecoder` eagerly imports `rivaGan` -> `torch` at import time), so detection needs torch present even though dwtDct runs CPU-only on cv2/numpy/pywavelets — no GPU and no separate `gpu` extra required. + +**Unlike SynthID this is locally detectable**, but the watermark is fragile (does not survive JPEG re-encode/resize — verified gone after JPEG q90), so it confirms origin only on pristine files. Add new known patterns here. The file carries a top-of-module pyright pragma because imwatermark/cv2 ship no type stubs. + +## `trustmark_detector.py` + +`trustmark_detector.py` — `detect_trustmark(path)` decodes the OPEN, keyless **Adobe TrustMark** watermark (the soft binding behind Adobe Durable Content Credentials, `alg` `com.adobe.trustmark.P`) via the optional `trustmark` package (extra `trustmark`; pulls torch, downloads model weights on first use). Mirrors `invisible_watermark.py` (lazy singleton guarded by a double-checked `threading.Lock` so concurrent callers do not double-download the weights, top-of-module pyright pragma, returns None when absent). It detects *provenance*, not AI origin as such (TrustMark also marks human-authored content), so `identify` lists it as a watermark without setting `is_ai_generated`. Other soft-binding vendors (Digimarc/Imatag/Steg.AI/...) have no public decoder — they are only *named* via the `C2PA_SOFT_BINDINGS` scan, not decoded. + +**False-positive gate (added 2026-05-29):** TrustMark's `wm_present` is a BCH error-correction validity flag that spuriously validates on a content-correlated fraction of un-watermarked images — AI-generated textures trip it far more than camera photos (verified 2026-05-29 on real files: it fires on Gemini/OpenAI/Doubao output that *cannot* carry Adobe's watermark, with a random-bytes decoded secret, while signal-free camera photos did not trip it). A genuine TrustMark is a *durable* soft binding engineered to survive re-encoding, so `detect_trustmark` re-decodes after a mild JPEG round-trip (`_survives_reencode`, `_REENCODE_QUALITY` 95) and requires the same schema both times; every observed false positive collapsed (none survived even q95), so the gate is the durability property the watermark guarantees. The second decode runs only on the rare initial hit, so the cost is negligible. Do NOT remove the gate to "catch more" — a lone TrustMark hit without it is almost always content noise. + +## `noai/watermark_remover.py` + +`noai/watermark_remover.py` — the `WatermarkRemover` class has two diffusion pipelines, selected by the explicit `pipeline` ctor arg (NOT inferred from `model_id` -- both use the same SDXL base, `DEFAULT_MODEL_ID`). + +**`sdxl`** (renamed from `default` 2026-06-09; `default` kept as a back-compat alias via `normalize_profile`) runs plain SDXL img2img (`_run_img2img`); it is the lighter opt-down alternative (no ControlNet weights). + +**`controlnet`** (**the DEFAULT pipeline since 2026-06-09** for `invisible`/`all`/`batch` and both engine ctors; `_run_controlnet`, `_load_controlnet_pipeline`) runs `StableDiffusionXLControlNetImg2ImgPipeline` with the SDXL-native canny ControlNet `xinsir/controlnet-canny-sdxl-1.0` (`watermark_profiles.CONTROLNET_CANNY_MODEL`): the control image is `cv2.Canny(gray, 100, 200)` stacked to 3 channels (`_CANNY_LOW`/`_CANNY_HIGH`, prompt `_CONTROLNET_PROMPT` / `_CONTROLNET_NEGATIVE`). + +**Removal comes from the img2img regeneration (`strength`); the ControlNet only PRESERVES text and face STRUCTURE via the edge map.** + +No original pixels are copied or frozen, BUT **validation 2026-06-04 disproved the old "so SynthID does not survive" claim: SynthID CAN survive controlnet on photoreal/high-detail content.** + +At the shared low removal strength the canny edge-conditioning keeps the regeneration so close to the original that the pixel perturbation that destroys SynthID does not happen (oracle-confirmed: an OpenAI bracelet photo + a 9-face grid read **SynthID-detected** after controlnet at strength 0.10/0.15, but **SynthID-not-detected** after the `default` pipeline at the SAME strength + resolution -- only the pipeline differed). + +**But the reverse also holds: a flat-graphic logo/poster SURVIVED `default` while clearing controlnet** -- removal at the low strength is content×pipeline dependent and neither pipeline is universally safe; the real lever is a higher strength. See the controlnet Known-limitations bullet for the full table + root cause. Canny holds face STRUCTURE but NOT identity (the regenerated face drifts in likeness -- canny carries edges, not identity). The drifted cleaned face is the LEAST-AI state we can reach without re-introducing SynthID; the library does NOT ship a face-restore extra. Every restore approach we evaluated (GFPGAN-on-cleaned, PhotoMaker-V2 txt2img, InstantID txt2img, InstantID img2img-on-cleaned at three parameter sweeps, 2026-06-04 - 2026-06-08 Modal cert sweeps) regenerated the face from an ArcFace embedding via SDXL diffusion -- which makes the output face look MORE AI-generated, not less. Empirical conclusion in `docs/synthid-robust-identity-research-2026-06-08.md` "Empirical follow-up". For production face preservation, ship the cleaned image as-is. `controlnet_conditioning_scale` (ctor arg, default 1.0) is the structure-preservation knob. Same dtype rule as `default` (fp32 on cpu/mps, fp16 only on cuda/xpu; the fp16-fixed SDXL VAE `_SDXL_FP16_VAE_ID` is swapped in on fp16 GPUs -- issue #29) and the same MPS->CPU fallback (reload on cpu/fp32, drop a non-cpu generator, retry once). + +## `auto_config.py` (REMOVED 2026-06-09) + +**`auto_config.py` + the content-detection layer were REMOVED 2026-06-09.** + +History: `auto_config.plan()` was a content-adaptive planner that detected faces/text/edges (bundled OpenCV YuNet + PP-OCRv3 DBNet models) to route the pipeline and toggle the adaptive polish. Once `controlnet` became the default-and-only auto pipeline (it no longer downgrades a structure-less image to `sdxl`) and the adaptive polish was confirmed to **self-gate by detail level** (`humanizer.adaptive_polish` no-ops when the cleaned image already meets the input's Laplacian variance, so it does real work only on over-smoothed photo/face texture and ~nothing on text/flat), the detection no longer changed any behavior — it only annotated a `reason` string. So the whole layer was deleted: `auto_config.py`, `tests/test_auto_config.py`, and the two detection assets (`assets/face_detection_yunet_2023mar.onnx`, `assets/text_detection_ppocrv3_2023may.onnx`, ~2.6 MB). + +**`--auto` is now a DEPRECATED no-op** (`cli._resolve_auto_polish`): controlnet is already the default pipeline AND the adaptive polish is ON by default, so `--auto` has nothing left to do — it only prints a deprecation warning and passes `adaptive_polish` through unchanged (an explicit `--no-adaptive-polish` still wins). (Originally it re-enabled the polish; once the polish default flipped to ON the same day, the parameter-source branch became dead and was dropped.) The **adaptive polish itself lives on** in `humanizer.adaptive_polish` (CLI `--adaptive-polish/--no-adaptive-polish`, **ON by default since 2026-06-09** — it self-gates to a no-op where there is no detail deficit, so default-on is safe; uses the full-res original as the detail reference) — see the `humanizer` test note. `batch` resolves the polish once before the loop (one warning) and caches the invisible engine per pipeline (`ctx.obj["_inv_engines"]`). + +## `upscaler.py` + +`upscaler.py` — optional Real-ESRGAN pre-diffusion super-resolution for small inputs (spandrel boundary, top-of-file pyright pragma). `is_available()` gates on spandrel+torch (via `importlib.util.find_spec`); `upscale(bgr, device=None)` loads a lazily-built spandrel `ImageModelDescriptor` singleton (double-checked lock) and upscales by the model's native factor (x2), with a non-CPU→CPU device fallback mirroring the diffusion engine's MPS→CPU retry. Weights (`RealESRGAN_x2plus.pth`, BSD-3-Clause) download on first use to the `torch.hub` checkpoints cache; never bundled. Used only when UPscaling to the `min_resolution` floor (a `max_resolution` downscale always uses Lanczos). The wiring is `InvisibleEngine._esrgan_upscale(pil, target)` — Real-ESRGAN at native factor, then a Lanczos resize to the exact target, falling back to a plain Lanczos resize if the extra is absent or the model errors (so an optional upscaler can never break removal). The default `--upscaler` is `lanczos` (cv2, no deps). + +**ESRGAN is a generic photo/texture GAN with no face/glyph prior**, so it best fits photo/texture content and can degrade faces (glassy/asymmetric eyes -- the diffusion pass regenerates faces so the full-pipeline final recovers) and thin/small text (the GAN invents wrong strokes, and low-strength diffusion will not fix it). Verified 2026-06-04: isolated upscale lap-var ~5x Lanczos on faces+textures but glassy eyes; end-to-end `invisible` final lap-var 1634 vs Lanczos 663 with natural faces (diffusion cleaned the artifact). Kept a **manual opt-in knob** (the auto plan never selects it) with `lanczos` the default; not content-gated by design (use Lanczos for text-heavy inputs). spandrel is MIT and pulls no basicsr. Unit-tested without the model: `tests/test_upscaler.py` (availability guard + the not-installed RuntimeError) and `tests/test_invisible_engine.py::TestEsrganUpscale` (the three `_esrgan_upscale` branches via a monkeypatched `upscaler`). + +## `image_io.py` + +`image_io.py` — Unicode-safe cv2 IO (issue #17). `imread(path, flags=None)` / `imwrite(path, img)` wrap `np.fromfile`+`cv2.imdecode` / `cv2.imencode`+`tofile` so non-ASCII paths work on Windows -- bare `cv2.imread`/`cv2.imwrite` use the platform ANSI code-page API there and fail (empty decode + `can't open/read file`) on Chinese/Cyrillic/accented filenames. `imread` keeps `cv2.imread` semantics (defaults to `IMREAD_COLOR`, returns `None` on missing/empty/undecodable). + +**Every cv2 file read/write in the package routes through here; do not call `cv2.imread`/`cv2.imwrite` directly.** + +`imwrite` returns `False` on an unwritable path (`OSError` caught) instead of raising, matching `cv2.imwrite` semantics. macOS/Linux already accept UTF-8 paths, so it is behavior-neutral there (the bug only reproduces on Windows). + +**`to_bgr(image)` (added 2026-06-09)** is the shared channel normalizer: promotes 2D grayscale / (h,w,1) / 4-channel BGRA to 3-channel BGR (a 3-channel input is returned unchanged, no copy). Use it instead of inlining the `cvtColor(GRAY2BGR/BGRA2BGR)` branch — the gemini engine and the `TextMarkEngine` base both route through it so a grayscale/BGRA input (a real Gemini-app export is opaque RGBA) does not crash the `axis=2` channel reductions. cv2/numpy are imported lazily inside the functions, so the module is cheap to import in a bare env. diff --git a/docs/release-and-distribution.md b/docs/release-and-distribution.md new file mode 100644 index 0000000..387b48c --- /dev/null +++ b/docs/release-and-distribution.md @@ -0,0 +1,25 @@ +# Release and distribution + +> Relocated verbatim from `CLAUDE.md` on 2026-06-11 to keep the always-loaded +> context small. Long single-line entries were reformatted into paragraphs; +> no content was changed or summarized. + +Release flow and every distribution channel (PyPI, Homebrew tap, conda-forge, +ComfyUI Registry, HF Space), plus sdist/build-backend history. The CI summary +stays in `CLAUDE.md`; read this before cutting a release. + +`publish.yml` stays release-only and now verifies the release tag matches the `pyproject.toml` version (fails the build on a mismatch) before building, then uploads via `uv publish` (PyPI trusted publishing over OIDC, no token — replaced the `pypa/gh-action-pypi-publish` action so the upload no longer depends on that action's bundled twine accepting the Metadata-Version; the `id-token: write` permission + `pypi` environment + workflow filename are unchanged, so PyPI's trusted-publisher entry still matches). + +**Release flow:** bump the version in `pyproject.toml` + `src/remove_ai_watermarks/__init__.py` + `uv.lock` (the project's own `[[package]]` entry — find it with `grep -n 'name = "remove-ai-watermarks"' uv.lock`, the `version =` line right below it, ~line 2246), commit `chore(release): vX.Y.Z`, `git tag -a vX.Y.Z -m vX.Y.Z` (annotated — `git tag` without `-m` errors here), push `main` + the tag, then `gh release create vX.Y.Z` — **PyPI publish triggers on the GitHub Release `published` event, NOT on the tag push**, so the tag alone does not publish. + +**After the PyPI sdist is live, bump the Homebrew formula** in the separate public tap repo `wiltodelta/homebrew-tap` (`Formula/remove-ai-watermarks.rb`): update `url` to the new sdist URL (from `https://pypi.org/pypi/remove-ai-watermarks//json`, the `sdist` entry's `url`) and `sha256` to its hash, commit + push there — otherwise `brew install wiltodelta/tap/remove-ai-watermarks` keeps installing the old version. The formula is a core-only venv that pip-installs the sdist (no vendored resources, so pip pulls the binary numpy/opencv wheels per platform at install time); only those two lines change per release. + +**This is now AUTOMATED:** the main repo's `.github/workflows/distribute.yml` fires on the GitHub Release `published` event, waits for the sdist to appear on PyPI (poll loop, the Release event races publish.yml's upload), rewrites the formula's `url`+`sha256`, and pushes to the tap using the `HOMEBREW_TAP_TOKEN` repo secret (a fine-grained PAT with Contents:write on `homebrew-tap`). The SAME workflow also factory-rebuilds the HF Space (`HfApi.restart_space(..., factory_reboot=True)`, `HF_TOKEN` secret) so the Space reinstalls the new sdist (it pins `remove-ai-watermarks>=...` and only re-resolves on a rebuild). The manual Homebrew steps above are the fallback / what the workflow automates — a normal release needs no Homebrew or HF action. + +**Other distribution channels:** (1) **conda-forge** — recipe source of truth committed at `packaging/conda/recipe.yaml` (v1 `recipe.yaml`, noarch core-only: pillow/piexif/numpy/py-opencv/click/python-dotenv); the initial submission is `conda-forge/staged-recipes` PR #33674 (went green only after **`pip_check: false`** in the python test — rattler-build's `pip check` defaults to ON and fails on the ancient conda-forge `piexif py_2` build's stale metadata with "piexif 1.1.3 is not supported on this platform", though the package installs/imports/works; keep it disabled). Once that merges and the `remove-ai-watermarks-feedstock` exists, the `regro-cf-autotick-bot` auto-opens a version-bump PR on the feedstock when each new PyPI sdist is detected — just review + merge it (hand-edit only if run-deps changed; keep `packaging/conda/recipe.yaml` in sync as the reference copy). (2) **ComfyUI Registry** — the node package is a SEPARATE repo `wiltodelta/ComfyUI-remove-ai-watermarks` with its OWN `pyproject.toml` `version` (independent of the library version). Publish a new node version by bumping that `version` in the node repo's `pyproject.toml` and pushing to `main` — the node repo's `.github/workflows/publish.yml` (`Comfy-Org/publish-node-action@main`, triggered on a push that touches `pyproject.toml`, secret `COMFY_REGISTRY_TOKEN`) **auto-publishes** it; `comfy node publish --token ` is the manual/local fallback. It is NOT auto-published on a library release (the node has its own version), so only bump it when the node code or its `remove-ai-watermarks>=` dependency floor changes. + +**Sdist must exclude `data/`** (`[tool.hatch.build.targets.sdist] exclude = ["/data"]`): hatchling's default sdist bundles all VCS-tracked files, so the committed `data/` test corpora (the multi-hundred-MB synthid_corpus images + the visible-mark captures) pushed the **0.8.0** sdist past PyPI's per-project file-size limit (400 "File too large") — the wheel uploaded but the sdist was rejected, so 0.8.0 shipped wheel-only and 0.8.1 carried the fix. The wheel only ships `src/` (via `[tool.hatch.build.targets.wheel] packages`), so it was never affected. + +**A failed PyPI upload of one artifact still leaves the other live and you cannot re-upload the same version** — fix the build and cut the next patch. + +**Build backend is unpinned `hatchling`** (`[build-system] requires`) since 2026-06-09. History: it was pinned `<1.31` because hatchling 1.30.0 made Metadata-Version 2.5 (PEP 794) the default and the twine bundled in `pypa/gh-action-pypi-publish@release/v1` rejected it (`"'2.5' is not a valid Metadata-Version"`), which **failed the v0.8.3 PyPI upload on 2026-06-01**; hatchling 1.30.1 reverted the default to 2.4. After the workflow moved to `uv publish` (whose uploader accepts 2.5) the pin was belt-and-suspenders only, and once v0.9.0 + v0.10.0 both published wheel+sdist through that path (verified on PyPI) it was dropped. If a future hatchling flips the default to 2.5 again and some consumer chokes, re-pin with a dated comment. diff --git a/docs/research-doubao-distillation.md b/docs/research-doubao-distillation.md new file mode 100644 index 0000000..925072b --- /dev/null +++ b/docs/research-doubao-distillation.md @@ -0,0 +1,21 @@ +# Doubao clean-reverse-alpha distillation (re-investigated 2026-05-29) + +> Relocated verbatim from `CLAUDE.md` on 2026-06-11 to keep the always-loaded +> context small. Long single-line entries were reformatted into paragraphs; +> no content was changed or summarized. + +**RESOLVED 2026-05-29: black+gray Doubao captures were obtained and a reverse-alpha is built** (`doubao_engine.remove_watermark_reverse_alpha`, `assets/doubao_alpha.png`; see the `doubao_engine.py` section in `docs/module-internals.md`). The captures (`data/doubao_capture/captures/`, now committed) confirmed the alpha-composite model: on black `captured = a*logo`, logo pure white. + +**UPDATE 2026-05-31 (issue #13 follow-up): the first build was NOT "exact"** — it left a readable "豆包AI生成" outline on the real sample (the detector was fooled, conf 0.0). The alpha is now rebuilt by `scripts/visible_alpha_solve.py` (the careful gray-self solve shared with Jimeng), removal always-aligns + thin-inpaints, and the locate box was widened; see the `doubao_engine.py` section in `docs/module-internals.md`. The notes below (the failed content-image distillation) are retained as the record of why controlled captures were necessary. + +**Conclusion (historical): pure reverse-alpha distilled from content images does NOT work, and the blocker is the WRONG kind of data, not too little of it.** + +The earlier framing ("need ~5-8 PRISTINE same-resolution originals") is obsolete -- `data/spaces/originals/` holds plenty. Curate them with `DoubaoEngine.detect` + an NCC filter against a clean glyph template, keeping only marks at offset ≈ (0,0): that yields e.g. **15 pixel-aligned 2048² marks** (sub-pixel drift, not the ±50 px the old lossy/mixed-res scrapes had), plus 1086x1448 / 1792x2400 clusters. With those, LaMa-clean `O` + weighted-LS (and per-pixel I-on-O regression) for `α` (+ logo colour) was tried end-to-end and **still leaves a persistent ghost outline.** + +Diagnosed why, empirically (cached stacks, `/tmp/doubao_distill`): (1) the mark is a clean white overlay with **no dark halo** -- over glyph pixels ~54% are brighter than the clean bg, only ~4% darker -- so the white-logo model `I=(1-α)O+α·255` is correct; (2) but content backgrounds are almost never dark *under* the mark (median darkest available bg over glyph pixels = **58/255**; only ~13% of mark pixels are ever observed on a bg < 40), so on bright backgrounds the equation is ill-conditioned and `α` is unidentifiable; (3) LaMa's `O` is a plausible **hallucination**, not the true pre-mark background, which compounds the error, and per-pixel regression on ~15 obs overfits into colour noise. + +**Why Gemini's engine is clean (verified in GeminiWatermarkTool `src/core/watermark_engine.cpp`): its alpha map is the watermark stamped on a PURE-BLACK background**, where `watermarked = α·255 + (1-α)·0 = α·255`, so `alpha = capture/255` exactly -- no estimation. (`gemini_bg_*.png` is literally the sparkle in grey on black.) So the real Doubao unlock is the same controlled capture, **not more content images**. Black/white/gray seeds exist (`data/doubao_capture/seeds/seed_*_1x1_2048x2048.png`); a capture run (feed a black seed through doubao.com edit mode, download the *original*) was requested from the #13 reporter 2026-05-29. With ~2-3 black captures we get `α = capture/255` for free, Gemini-quality. + +**Until black captures arrive, the shipped direction is precise canonical glyph mask + inpaint (cv2 default, lama optional), NOT reverse-alpha.** + +The consensus glyph silhouette across the aligned marks distills cleanly (proto: a tight "豆包AI生成" strip, width ≈ 0.156 × image-width) and is good both as an exact inpaint mask and as an NCC localiser -- the latter also fixes the #23 detector false-positives (match the real glyph shape, not any bright low-saturation corner). Do **not** retry content-image reverse-alpha: it is data-limited by physics (no dark-background observations), not by effort. diff --git a/docs/watermarking-landscape.md b/docs/watermarking-landscape.md new file mode 100644 index 0000000..acdcb2e --- /dev/null +++ b/docs/watermarking-landscape.md @@ -0,0 +1,40 @@ +# Watermarking landscape (research 2026-05-24) + +> Relocated verbatim from `CLAUDE.md` on 2026-06-11 to keep the always-loaded +> context small. Long single-line entries were reformatted into paragraphs; +> no content was changed or summarized. + +Who embeds what, and whether it is locally detectable (so we know which gaps are fillable). See `identify.py` for what we read. +- **Locally detectable (open decoder, no key/API):** Stable Diffusion / SDXL / FLUX via `imwatermark` DWT-DCT (now covered by `invisible_watermark.py`). FLUX uses the same library (`black-forest-labs/flux2` `src/flux2/watermark.py`, 48-bit `0b001010101111111010000111100111001111010100101110`); SDXL is the diffusers `WATERMARK_MESSAGE` (`0b101100111110110010010000011110111011000110011110`). Caveat: fragile to re-encoding. +- **C2PA / IPTC (covered by the issuer/marker scan):** OpenAI, Google, Adobe Firefly, Microsoft (Designer + **Bing Image Creator** — collected 2026-05-24; Bing now runs Microsoft's own **MAI-Image** model, signs C2PA as "Microsoft", NOT OpenAI/DALL-E), and **Stability AI** (collected from Brand Studio / DreamStudio successor; signs C2PA as "Stability AI Ltd", no SynthID, no imwatermark on its current Stable Image model — issuer added to `C2PA_ISSUERS`). Still unsampled: Canva (its downloads are re-encoded design *exports* that strip C2PA, so a Canva "positive" is inconclusive — skipped), Getty, Shutterstock. Midjourney embeds NO C2PA and no invisible watermark (our `mj-*` sample carried only the IPTC tag). + +**Samsung Galaxy AI** (Generative Edit / Sketch to Image / Portrait Studio on Galaxy S23 FE / S24 / S25, One UI 7+) signs C2PA as "Samsung Galaxy" with the standard `trainedAlgorithmicMedia` source type AND a proprietary `genAIType` marker; verified on real signed files 2026-05-29 (the standard scan catches the source type; `genAIType` additionally catches a Galaxy S24 file that omits it). It ALSO burns a **visible** localized wordmark into the pixels — a sparkle + "generated with AI" string in the bottom-LEFT corner (issue #37; the Italian "✦ Contenuti generati dall'AI" variant is calibrated) — removed by `samsung_engine.py` / `visible --mark samsung` (reverse-alpha, see the engine bullet); detection feeds `identify` as the medium `visible_samsung` signal. The string is locale-specific, so each locale needs its own captured alpha template. + +**ASUS Gallery** also signs edited photos as C2PA (`com.asus.gallery`) but with no AI source type — a signer, not an AI marker. + +**Black Forest Labs (FLUX)** API output signs C2PA: `claim_generator_info "Black Forest Labs API"` + a `c2pa.ai_generated_content` assertion + `trainedAlgorithmicMedia` (issuer `b"Black Forest Labs"` added to `C2PA_ISSUERS`, platform "Black Forest Labs (FLUX)"). + +**ByteDance Volcano Engine (Volcengine)** — the cloud behind Doubao / Jimeng — signs its AI image output with a cert from `certificate_center@volcengine.com` + `trainedAlgorithmicMedia` (issuer `b"volcengine"` → "ByteDance (Volcano Engine)", platform "ByteDance (Doubao / Jimeng / Volcano Engine)"); note this is the C2PA-signed surface, distinct from the XMP/PNG TC260 `AIGC` label Doubao also uses. All three verified on real signed files 2026-05-29. +- **EXIF/XMP generator tag (caught by `exif_generator`):** **Ideogram** writes EXIF `Make="Ideogram AI"` (collected 2026-05-24 — no C2PA, no SynthID, no imwatermark; the Make tag is the only signal). +- **xAI / Grok — its own EXIF signature scheme, NOT C2PA (DETECTED by `metadata.xai_signature`, built 2026-05-26).** + +Grok JPEG downloads (Aurora model) carry **no C2PA, no XMP, no SynthID, no IPTC** — only EXIF `Artist` = a UUID and EXIF `ImageDescription` = `Signature: ` (a crypto signature, unverifiable locally without xAI's public key). This empirically kills the earlier unverified "xAI signs C2PA as xAI" lead — xAI is not even a C2PA member. `exif_generator` misses it (neither field holds an `AI_GENERATOR_TOKENS` token), so a dedicated detector `xai_signature(path)` matches the pair (`ImageDescription ~ ^Signature: [A-Za-z0-9+/=]{64,}` AND UUID `Artist`); wired into `has_ai_metadata`, `get_ai_metadata` (key `xai_signature`), and `identify` (signal `xai_signature`, platform "xAI (Grok / Aurora)"). + +**Format confirmed stable across n=3 genuine generations:** exactly three EXIF tags (`Artist`, `ExifOffset`, `ImageDescription`), `Signature:` prefix constant, base64 payload 300-1004 chars. Two capture facts: (a) the `Artist` UUID **equals the public image id** in the asset URL (`https://imagine-public.x.ai/imagine-public/images/.jpg`), so it is NOT a private per-user secret — only the `Signature` blob is; (b) the Grok web-UI image is a re-encoded **WebP with no signature** — the EXIF survives only in the *original* JPEG (download button or that public tokenless URL), which is why screenshots / re-encodes are metadata-stripped. A real fixture `data/samples/grok-1.jpg` plus **synthetic** JPEG fixtures (fake UUID + fake `Signature:` blob) cover the detector; never add a real Grok image carrying private content (the repo is public). + +**Stripped on removal too:** `remove_ai_metadata` now calls `_scrub_ai_exif` on the JPEG EXIF, which deletes the xAI Signature+UUID-Artist pair **and** any `Software`/`Make`/`Artist`/`ImageDescription` tag holding an `AI_GENERATOR_TOKENS` token (so Ideogram's `Make="Ideogram AI"` is scrubbed too), while keeping genuine camera/editor EXIF. The shared `_is_xai_signature_pair` helper (module-level compiled regexes) is the single source of truth for the pattern, used by both `xai_signature` and `_scrub_ai_exif`. (AVIF/HEIF/JXL still strip only C2PA boxes via `isobmff`, not EXIF — unchanged.) +- **China TC260 AIGC label (caught by `AIGC_MARKERS` / `metadata.aigc_label`, surfaced by `identify` as the `aigc` signal):** China-served generators embed an XMP `{"Label":"1","ContentProducer":...}` block — China's mandatory AI-content labeling (TC260 namespace `tc260.org.cn/ns/AIGC`). + +**Doubao** (ByteDance) uses it (verified on the real #13 sample 2026-05-25; `ContentProducer` `001191110102MACQD9K64010000`, no C2PA/SynthID/imwatermark — the XMP block is the only signal; GitHub attachment upload did NOT strip it). The same standard is mandatory for Jimeng/Kling/Qwen/Ernie etc., so the one marker covers the whole China-AIGC-labeled ecosystem. `aigc_label` reads **four serializations** through a shared `_parse` helper: the HTML-entity-encoded XMP `TC260:AIGC` block in **either RDF form** — the nested element `{...}` (Doubao) or the attribute `TC260:AIGC="{...}"` (**PicWish**, `ContentProducer="picwish"`, verified on the corpus 2026-05-30) — via a container-agnostic raw-byte scan (any JSON object accepted), a raw-JSON PNG `AIGC` tEXt chunk (Doubao also writes the label this way, no namespaced marker at all — confirmed on the corpus 2026-05-28, `ContentProducer="doubao"`), a bare raw-JSON `{"AIGC":{...}}` object embedded in **JPEG EXIF (UserComment)** by some China-served generators, brace-matched from the scan head with `json.JSONDecoder().raw_decode` (no namespaced marker, no PNG chunk — confirmed on the corpus 2026-05-30, `ContentProducer="001191440300708461136T1308L"`), **and** a bare `AIGC{...}` blob (the label glued straight to its JSON, no `"AIGC":` key wrapper) embedded in a **JPEG APP segment near the JFIF header** — confirmed on the corpus 2026-06-10 (`ContentProducer="00119144030008867405X210002"`; 3 files read `unknown` before this form was added). The two raw-JSON forms are scanned in one loop (`'"AIGC"'` then `AIGC{`) that **falls through on a non-TC260 / undecodable hit instead of returning** — a quoted `"AIGC"` can appear later in an XMP packet while the real label is a bare `AIGC{...}` earlier in the file, so an unconditional early return on the quoted form would shadow the bare form (the exact bug behind the 06-10 misses). All three generic forms (the PNG chunk, the bare `{"AIGC":...}` object, and the bare `AIGC{...}` blob) are gated on at least one TC260 field (`_TC260_FIELDS`) so a generic `AIGC` key cannot false-positive; the namespaced XMP element is unambiguous and needs no gate. In `identify`, `aigc` fires on the parsed label **or** the `AIGC_MARKERS` byte scan (the latter preserves the laundering-tell case where the JSON payload is truncated). +- **HuggingFace-hosted job (caught by `metadata.huggingface_job`, surfaced by `identify` as the `hf_job` signal, MEDIUM confidence):** HuggingFace Jobs / Spaces stamp generated PNGs with an `hf-job-id` tEXt chunk holding the job UUID (3 on the corpus 2026-05-28, no other signal). It marks the *hosting job*, not a model — most commonly diffusion output — so it lifts an Unknown verdict to a tentative AI via `hf_only` (parallel to the visible sparkle) but never overrides a hard metadata signal; `_HF_JOB_CAVEAT` states the limit (job, not model; not proof of AI pixels). Stripped on removal (the PNG save whitelist keeps only `STANDARD_METADATA_KEYS`, so `hf-job-id` and the `AIGC` chunk are both dropped). The exact writer is not authoritatively documented (HF Jobs are generic GPU jobs), hence medium not high. +- **No detectable signal on download (correctly reported `unknown`):** **Recraft** (PNG export is a re-encoded design export — strips everything), **Krea hosting FLUX 2** (no imwatermark despite FLUX — the host omits the encoder, same as Stability's hosted SDXL), and Midjourney (embeds nothing). Lesson: the imwatermark detector only fires on *pristine* output from a pipeline that runs the encoder (diffusers default, official BFL), not from re-hosts (Krea/Stability) or re-encoded exports (Recraft/Canva). +- **Invisible but NOT locally detectable (proprietary, API/oracle only — same wall as SynthID):** Amazon Titan Image Generator + Nova Canvas (Bedrock `DetectGeneratedContent` API), Kakao (new SynthID image adopter, May 2026), NVIDIA Cosmos (SynthID video). No local detector possible; treat like SynthID. +- **C2PA 2.4 "Durable Content Credentials" (April 2026; verified against the spec) raise the bar for metadata stripping.** 2.4 defines soft bindings (an invisible watermark or a content fingerprint) plus a server-side manifest repository and a new `c2pa.repository-receipt` assertion. Per the spec: "if a C2PA manifest is removed from an asset, but a copy of that manifest remains in a provenance store elsewhere, the manifest and asset may be matched using available soft bindings." So our local `metadata --remove` deletes the *embedded* manifest, but a fingerprint/watermark soft binding can still re-link the image to its manifest in a repository server-side. Stripping the file is becoming necessary-but-not-sufficient against durable provenance. (Our parsers target the stable embedded-manifest format documented in C2PA 2.1 §11; that format is unchanged in 2.4 -- the new pieces are repository/soft-binding infra, not the on-file box layout, so no parser change is implied.) Spec: https://spec.c2pa.org/specifications/specifications/2.4/specs/C2PA_Specification.html We now READ the soft-binding `alg` (`C2PA_SOFT_BINDINGS` / `soft_binding_vendors_in`) to name the forensic-watermark vendor, and locally DECODE the one open scheme, Adobe TrustMark (`trustmark_detector`); the rest (Digimarc/Imatag/Steg.AI/...) stay name-only (proprietary decoders). +- **Built 2026-05-26 (this batch):** soft-binding `alg` vendor detection; IPTC Photo Metadata 2025.1 AI-disclosure fields (`AISystemUsed` etc.); **video C2PA metadata** detect + strip for MP4/MOV/M4V (free — `isobmff.py` is format-agnostic, MP4 is ISOBMFF); Adobe TrustMark open decoder. NOT done (out of cheap reach, per the feasibility review): visible video-logo removal (needs a video frame pipeline) and audio (SynthID/ElevenLabs/Resemble/Suno all oracle-only or unmarked). + +**Box detection window — now handled (v0.6.8):** detection no longer relies on a fixed first-MB read. `metadata.scan_head(path, size)` reads the first `size` bytes and, for ISOBMFF, appends the payloads of late provenance boxes found by `isobmff.scan_c2pa_region` (a file-seeking top-level box walker that skips past `mdat` by size without reading it), so a C2PA/AIGC/IPTC manifest placed AFTER a large `mdat` in a streaming/non-faststart MP4 is now caught. Every C2PA/marker byte scan (`has_ai_metadata`, `aigc_label`, `iptc_ai_system`, `synthid_source`, `exif_generator` XMP, `get_ai_metadata` soft-binding, and `identify`) goes through `scan_head`; it is behavior-neutral for non-ISOBMFF inputs (exactly `f.read(size)`). + +**Meta-box XMP removal — now handled (v0.6.9):** an AI-label XMP packet stored as a meta-box `mime` item (HEIF/AVIF; out of reach of the top-level box stripper) is blanked in place by `isobmff.blank_ai_xmp_packets` — it locates the packet by its `` delimiters and, if it carries an AI marker (`_AI_LABEL_MARKERS`), overwrites it with spaces of the SAME length, so box sizes / `iloc` offsets stay valid and the coded image is untouched (selective: plain non-AI XMP is left alone, mirroring the top-level uuid logic). Wired into `remove_ai_metadata`'s ISOBMFF branch after `strip_c2pa_boxes`. The remaining gap is an `Exif` meta-box *item* (rare; the AI labels are XMP) — still needs `iinf`/`iloc` surgery or exiftool. +- **Regulatory driver (context, not a code change):** AI-content labeling mandates are expanding, which pushes more generators toward exactly the C2PA + watermark signals we read. The full per-jurisdiction table lives in README "## Legal" -- keep it there, not duplicated here. Newly added + primary-source verified 2026-05-26: **EU AI Act Article 50** machine-readable marking applicable **2026-08-02** (verified against the article text); **South Korea AI Framework Act Art. 31(3)** in force since **22 January 2026** (verified via Kim & Chang + FPF/Korea Times; Enforcement Decree accepts an invisible-watermark label); **California AB 853** (amends the CA AI Transparency Act) latent-disclosure duty operative **2026-08-02**, requiring a disclosure "permanent or extraordinarily difficult to remove" (verified against the leginfo bill text -- this is the exact disclosure our tool strips); **India IT Amendment Rules 2026** in force **2026-02-20** (verified via Chambers), which prominently-label + permanent-provenance-id all synthetic media AND **expressly prohibit removing/suppressing the label or metadata** -- the first major all-content removal ban outside China. + +**Removal liability (README "## Legal" disclaimer):** the tool is lawful general-purpose software; liability sits with the remover and is intent-gated -- downstream acts (fraud/deception/IP), plus US DMCA 17 USC 1202 (removing copyright-management info to conceal infringement), plus the removal-as-such bans in China + India. When extending the README table, verify each date/article against the statute/bill text before committing, not against search summaries.