Files
remove-ai-watermarks/docs/module-internals.md
T
Victor Kuznetsov 373b910a60 docs: fix the qwen-vs-controlnet face comparison to oracle-confirmed scrub floors
The face fidelity numbers cited an equal-strength compare (both 0.15), but Qwen at
0.15 does NOT clear Gemini SynthID -- so that output is un-scrubbed and the compare
is invalid. Per the methodology rule (compare fidelity only between outputs where
SynthID is removed in BOTH), restate faces at each pipeline's scrub floor
(controlnet 0.15 / Qwen 0.30): ArcFace identity 0.546 vs 0.331, lapvar 0.62 vs 0.40,
face LPIPS 0.09 vs 0.19 -- controlnet still wins faces, conclusion unchanged. Drop
the "equal strength" framing in CLAUDE.md / module-internals / known-limitations.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 14:33:11 -07:00

225 lines
67 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Module internals
> Relocated verbatim from `CLAUDE.md` on 2026-06-11 to keep the always-loaded
> context small. Long single-line entries were reformatted into paragraphs;
> no content was changed or summarized.
Full per-module detail: design decisions, tuned thresholds, calibration
history, incident records, and the regression-guard map. The compact module
list lives in `CLAUDE.md`; read the relevant section here before changing a
module.
## `noai/c2pa.py`
`noai/c2pa.py` — C2PA reading, **official c2pa-python `Reader` first, hand-rolled parser as fallback** (migrated 2026-06-18; the official lib is a core dep, MIT/Apache, spec-tracking). `read_manifest_store_json(path)` runs `Reader.try_create` with a default `Context` (NO trust enforcement — we report what is in the file, we do not gate on cert trust) and returns the **whole** manifest-store JSON (every manifest plus ingredient manifests); it is memoized per (path, mtime) (`lru_cache(maxsize=8)`) because one `identify`/`get_ai_metadata` call invokes the structured parser ~3x on the same file. `extract_c2pa_info(path)` builds its dict from that store JSON (`_info_from_store_json`: structured `claim_generator` from the active manifest's `claim_generator` / `claim_generator_info[].name`, `timestamp` from `signature_info.time`) and falls back to the legacy caBX parser (`_extract_c2pa_info_png`) when the reader is unavailable (broken/absent wheel, `reader_available()` False) or finds no parseable manifest (synthetic/partial test blobs, the inject round-trip's re-stitched chunk). **Both paths share `_populate_registry_fields(buf, info)`** — the issuer / AI-tool / action / source-type / SynthID / soft-binding registry byte-scan applied to the store JSON (reader path) or the raw caBX bytes (fallback) — so the return-dict shape is identical and the registry stays the single source of truth. Whole-store scanning is load-bearing: a ChatGPT *edit* of a Sora generation keeps `trainedAlgorithmicMedia` + issuer "OpenAI" on the **parent/ingredient** manifest, not the active "opened" one (the active manifest's `signature_info.issuer` is "OpenAI", `common_name` "Truepic Lens CLI in Sora", so the issuer field now reads "OpenAI, Truepic" — first-match-wins platform attribution still resolves OpenAI). `extract_c2pa_info` now also serves non-PNG containers (JPEG/AVIF/MP4) structurally via the reader; the consumers (`identify`, `synthid_source`, `get_ai_metadata`) already merge `info OR byte-scan`, so this strictly upgrades the non-PNG path with no double-counting. `synthid_watermark`/`synthid_vendors` is set when the manifest is signed by a SynthID-using vendor on AI content; `soft_binding`/`soft_binding_vendors` when a `c2pa.soft-binding` `alg` names a forensic-watermark vendor (`soft_binding_vendors_in(buffer)` is the shared byte-scan, used by both paths and the non-PNG binary path). `extract_c2pa_chunk` / `inject_c2pa_chunk` / `has_c2pa_metadata` stay the PNG caBX byte tools (raw-chunk extraction for `extractor.py`, test injection, fallback detection). PNG/caBX chunk reads are clamped to the remaining file size (`safe_length = min(length, remaining)`; skipped chunks use seek) so a malformed huge `length` cannot drive a multi-GB allocation (shared safety discipline matching `isobmff.scan_c2pa_region`). Regression-guarded by `tests/test_noai.py::TestC2PARealSamples::{test_extract_info_uses_reader_store,test_fallback_to_png_parser_when_reader_unavailable}`.
## `noai/constants.py`
`noai/constants.py` — PNG_SIGNATURE, C2PA_CHUNK_TYPE, C2PA_SIGNATURES, and `C2PA_AI_VENDORS` — the single `C2paAiVendor` registry of C2PA-signing vendors (issuer byte, resolved org name, the `identify` platform label, and a `synthid` flag), from which `C2PA_ISSUERS`, `SYNTHID_C2PA_ISSUERS` (issuers that pair SynthID with C2PA: Google, OpenAI), and `identify._ISSUER_PLATFORM` are all **derived** — plus `C2PA_SOFT_BINDINGS` (soft-binding `alg` prefix → forensic-watermark vendor: Adobe TrustMark, Digimarc, Imatag, Steg.AI, Microsoft, ...). Add a new C2PA vendor as one `C2PA_AI_VENDORS` entry (never edit the derived dicts), a new soft-binding to `C2PA_SOFT_BINDINGS`; not inline.
## `metadata.py`
`metadata.py``scan_head(path, size=1MB)` is the shared input for every C2PA/AIGC/IPTC byte scan: first `size` bytes plus the payloads of any provenance metadata found beyond that window — for ISOBMFF, the late provenance boxes from `isobmff.scan_c2pa_region` (catches a manifest after a large `mdat`); for **PNG**, the late `tEXt`/`iTXt`/`zTXt`/`eXIf`/`iCCP` chunks from `_png_late_metadata` (catches an XMP/EXIF packet appended after a large `IDAT`, e.g. a TC260 AIGC label at ~2.7 MB). Behavior-neutral (`f.read(size)`) for non-ISOBMFF inputs and for any file that fits within `size`. Use it instead of `open().read(1MB)` for any new marker scan.
**Memoized per (path, size, mtime)** (added 2026-06-09, `_scan_head_cached` lru_cache, `maxsize=8`): one `identify`/`get_ai_metadata` call fans out to ~8 byte-scan detectors that each re-read the same file head, so the cache turns those into a single read; the mtime key invalidates on change, a stat failure falls back to an uncached read. `synthid_source(path)` returns the vendor name(s) if the C2PA manifest implies a SynthID pixel watermark, else None. Format-agnostic: PNG via the caBX parser, JPEG/WebP/AVIF/HEIF/JXL via a binary scan (C2PA marker + SynthID issuer + AI-source marker). `get_ai_metadata` surfaces the verdict, and `metadata --check` prints it as a callout. Both `get_ai_metadata` and `has_ai_metadata` guard the PIL open with `except Exception` (HEIC/unknown formats raise non-OSError) and fall through to the binary scan. `xai_signature(path)` detects xAI/Grok's EXIF-only scheme (`ImageDescription` = `Signature: <base64>` + UUID `Artist`); it feeds `has_ai_metadata`, `get_ai_metadata` (key `xai_signature`), and `identify`. `iptc_ai_system(path)` detects the IPTC Photo Metadata 2025.1 AI-disclosure XMP properties (`IPTC_AI_FIELD_MARKERS` = `AISystemUsed`/`AISystemVersionUsed`/`AIPromptInformation`/`AIPromptWriterName`) and returns the `AISystemUsed` generator name (or `"fields present"`). `remove_ai_metadata` routes **ISOBMFF video** (`.mp4`/`.mov`/`.m4v`) through the same `isobmff.strip_c2pa_boxes` as AVIF/HEIF (MP4 is ISOBMFF), and `_scrub_ai_exif` removes the xAI signature + AI-generator EXIF tags on JPEG output. `strip_c2pa_boxes` is **fail-safe** on a malformed box: it returns the original bytes unchanged with a logged warning instead of truncating the tail to EOF (detection-only `scan_c2pa_region` still stops at a malformed box). `_png_late_metadata` clamps each late-chunk read to the remaining file size (`safe_length = min(length, remaining)`) so a malformed `length` cannot drive a multi-GB allocation, AND advances the cursor by `safe_length` (not the raw `length`) so an inflated length cannot jump past EOF and abort the scan, silently skipping a genuine AI-label chunk after it.
## `identify.py`
`identify.py` — the OpenAI rollout caveat is keyed on `_vendor_of(synthid) == "OpenAI"` (not a raw substring over the issuer + verdict blob). `identify(path)` aggregates every locally-readable signal (C2PA issuer→platform, C2PA soft-binding forensic-watermark vendor, **C2PA cloud-manifest reference** via `metadata.c2pa_cloud_manifest` — signal `c2pa_cloud`, **medium**, provenance-only (does NOT set `is_ai`, excluded from `ai_from_metadata` + clash vendors): a C2PA 2.4 Durable-Content-Credentials case where the embedded manifest is stripped but an XMP `dcterms:provenance` pointer to the vendor's cloud manifest store (`_C2PA_MANIFEST_REPOSITORIES`, today `cai-manifests.adobe.com` → "Adobe Content Authenticity") survives, so the credentials stay recoverable server-side; only emitted when no embedded manifest already attributed the file — surfaced on 2 corpus PNGs 2026-06-10 that read fully `unknown` before, IPTC "Made with AI" + IPTC 2025.1 `AISystemUsed`, embedded SD/ComfyUI params, SynthID proxy, xAI/Grok EXIF signature via `metadata.xai_signature`, the China TC260 AIGC label via `metadata.aigc_label`, the HuggingFace `hf-job-id` job marker via `metadata.huggingface_job`, the Samsung Galaxy AI editing marker via `metadata.samsung_genai`, the visible marks — Gemini sparkle plus the ByteDance Doubao 豆包AI生成 / Jimeng 即梦AI / Samsung Galaxy AI "Contenuti generati dall'AI" text marks via the `watermark_registry` — open invisible watermark, Adobe TrustMark via `trustmark_detector`) into one `ProvenanceReport`. `is_ai_generated` is True or None (never asserted False — stripped metadata is not proof of clean origin). The `hf_job`, visible-mark, and Samsung `samsung_genai` signals are **medium** confidence: each lifts an otherwise-Unknown verdict to a tentative AI (`hf_only` / `visible_only` / `samsung_only`, parallel branches; `visible_only` fires on any `visible_*` signal) but is excluded from the high-confidence `ai_from_metadata` set, so none overrides a hard metadata signal.
**Visible-mark detection** (`check_visible`, signals `visible_sparkle` / `visible_doubao` / `visible_jimeng` / `visible_samsung`): the Gemini sparkle keeps its own file-level path (`_visible_sparkle``gemini_engine.detect_sparkle_confidence`, promoted only at confidence ≥ `_SPARKLE_THRESHOLD` 0.5; corpus-tuned to separate Gemini sparkles ≥0.56 from non-sparkle ≤0.49), while Doubao/Jimeng/Samsung reuse the registry detectors (`_visible_text_marks``watermark_registry`, iterating `_VISIBLE_MARK_PLATFORM`), each gated by its own engine NCC threshold via `MarkDetection.detected` (Doubao 0.4, Jimeng 0.45, Samsung 0.4). Doubao/Jimeng are normally also caught by the TC260 AIGC metadata label and Samsung by its C2PA + `genAIType` marker, so the visible path is their stripped-metadata fallback. Visible marks set `platform` only when no harder signal already did, and (like the sparkle) are excluded from integrity-clash vendor claims. The cv2 dependency lives in the engines, not here.
**`import identify` is deliberately light** (~26 MB; ~36 MB with cv2 loaded by a visible-mark run, ~106 MB for a full `check_visible` run): it imports the `noai.c2pa`/`noai.constants` submodules, and `noai/__init__` is lazy (see "Test and lint"), so torch/diffusers are NOT pulled at import even in a full `gpu`/`detect` install — fits a 512 MB host. `noai.c2pa` does eagerly import the **c2pa-python** binary (Rust + cryptography, ~+5 MB RSS, no torch) for the primary `Reader` path — light enough to stay on the dependency-light host; a broken/absent wheel degrades to the byte-scan parser (`reader_available()` False). The heavy paths are opt-in: `check_invisible=True` needs the `detect`/`trustmark` extras (each pulls **torch**; TrustMark also **downloads weights**), so on a core-only deploy leave `check_invisible` off (it is a no-op there anyway). Before the lazy `__init__`, the mere presence of torch in the env inflated `import identify` to ~420 MB.
**C2PA platform attribution is device-token-first, issuer-scan fallback** (`_device_platform` scans manifest bytes for `_DEVICE_C2PA_PLATFORM` tokens, then `_attribute_platform`/`_ISSUER_PLATFORM`).
**Why, verified on real signed files 2026-05-26:** the old issuer-only byte-scan matched ANY issuer substring anywhere, so multi-entity manifests mis-attributed -- Leica→"Truepic" (a signing authority in the trust chain), Nikon→"Adobe Firefly" (XMP-toolkit "Adobe" + the sample's "Adobe_MAX" name), Pixel→"Google (Gemini)" ("Google LLC" cert org), Truepic→"Google". A distinctive device token wins instead.
**Token distinctiveness is load-bearing:** bare `b"Truepic"` mis-fires (it appears in unrelated trust chains -- it mis-attributed the OpenAI `chatgpt-1.png` fixture), so the token is the specific `b"Truepic_Lens"` from the Lens SDK claim generator; likewise `b"Pixel Camera"` (cert CN) not bare `b"Pixel"`. `_DEVICE_C2PA_PLATFORM` lists ONLY tokens **verified against a real C2PA file**: Leica (`lc_c2pa`/`Leica Camera`), Nikon (`NIKON`), Pixel (`Pixel Camera` -- from a real Pixel 10 Pro file attached to c2pa-rs issue #1609/#1554), Sony (`sony.sig`/`sony.cert` -- Sony's own C2PA assertion namespace, verified on a real Sony PXW-Z300 file; NOT bare "Sony" which is a common EXIF Make), Truepic (`Truepic_Lens`). Canon/Bria have **no public direct-download C2PA sample** (checked exhaustively: GitHub issue/PR attachments, contentcredentials gallery, HF datasets -- all upload-to-verify or token-gated; Canon's only public file was a self-signed hobbyist CR3, not factory), so they stay unmapped until a real file is captured (same fixture discipline as Grok/Doubao). The Sony sample is video (MP4) -- our ISOBMFF C2PA path detects it; Sony Alpha stills likely share the `sony.*` namespace but are not separately verified.
**Samsung Galaxy + ASUS Gallery live in a separate `_SIGNER_C2PA_PLATFORM` (scanned after `_device_platform`, before the issuer fallback), NOT in `_DEVICE_C2PA_PLATFORM`** — verified on real signed files 2026-05-29. Reason: a Galaxy phone stamps BOTH its device cert AND a `trainedAlgorithmicMedia`/genAIType AI marker on a Generative-Edit image, so treating it as a "genuine camera capture" would false-fire integrity-clash rule 2 on every Galaxy AI edit. The signer tokens (`b"Samsung Galaxy"` cert org — distinct from the EXIF `SM-xxxx` model string on ordinary Samsung photos; `b"com.asus.gallery"` claim generator) only resolve the platform label; the AI verdict still comes from the source-type / genAIType. ASUS Gallery is a C2PA-signed edit with no AI marker, so it attributes the platform without asserting `is_ai`.
**Samsung's `genAIType` (in the proprietary `PhotoEditor_Re_Edit_Data` JSON) is an undocumented Galaxy-AI editing marker** (`metadata.samsung_genai`, gated on the `PhotoEditor_Re_Edit_Data` container; non-zero value = AI tool used, values {1,5} observed): medium-confidence because the field has no public spec (verified 2026-05-29: absent from C2PA spec + Samsung docs), but it co-occurred with `trainedAlgorithmicMedia` in 3/3 verified files that record a source-type and was the SOLE AI marker on a Galaxy S24 file that omits the source type. Camera C2PA marks capture authenticity, not AI (Pixel carries `computationalCapture`, not `trainedAlgorithmicMedia`), so these never set `is_ai` -- that stays driven by digital-source-type. `c2pa.cbor_text_after` (now public) is best-effort for the `generator` detail string only and can be None when the manifest keys it `claim_generator_info` (Pixel).
**Issuer→generator mapping is `is_ai`-gated** (`_attribute_platform(issuers, is_ai=c2pa_is_ai)`): a specific AI-generator platform is named only when the digital-source-type is `trainedAlgorithmicMedia`; on a non-AI source an issuer substring is treated as incidental (an "Adobe XMP" toolkit string in an *unmapped* Canon/Sony capture would otherwise mislabel it "Adobe Firefly"), so it degrades to the neutral "C2PA signer: X" label. Real Firefly/OpenAI/Google output carries the AI source-type, so it is unaffected (verified: chatgpt-1.png→OpenAI, firefly-1.png→Adobe Firefly still attribute). `_attribute_platform` defaults `is_ai=True` so the mapping stays unit-testable in isolation. Add capture-camera tokens to `_DEVICE_C2PA_PLATFORM`, editing-app/AI-device signer tokens to `_SIGNER_C2PA_PLATFORM`, generator/issuer platforms to the `C2PA_AI_VENDORS` registry in `constants.py` (which derives `_ISSUER_PLATFORM`), not inline. For non-PNG containers (JPEG/WebP/AVIF/HEIF/JXL) the caBX parser returns nothing, so issuer (`_issuers_in`) and generator (`_ai_tools_in`, reusing `C2PA_AI_TOOLS`) are recovered by binary-scanning the first MB. EXIF `Software` / `Make` / `Artist` / `ImageDescription` and XMP `CreatorTool` generator tags are read by `metadata.exif_generator` (PIL+piexif for any format PIL opens incl. AVIF, plus a container-agnostic XMP raw-byte scan that also covers HEIF/JXL), matched against `AI_GENERATOR_TOKENS` so ordinary editors (plain "Adobe Photoshop") and real-camera `Make` ("Apple"/"Canon") are not flagged.
**Ideogram tags its output with EXIF `Make="Ideogram AI"`** (verified on a real download 2026-05-24) — that's why `Make` is read.
**Integrity-clash detection** (`_integrity_clashes`, surfaced as `ProvenanceReport.integrity_clashes`, printed in red by `identify` and serialized to `--json`): contradictions between independent generator stamps are a laundering/spoofing tell. Two rules: (1) two or more distinct AI-origin vendors named by **independent** signals (e.g. C2PA OpenAI + EXIF `Make="Ideogram AI"`), and (2) a camera-capture C2PA device (`_DEVICE_C2PA_PLATFORM`) coexisting with an AI-generation marker **from a source INDEPENDENT of the camera's own manifest**.
**Rule 2's independence gate (added 2026-06-11):** a device that both captures and runs on-device generative AI (Google Pixel Magic Editor / Pixel Studio) records the capture AND the AI edit in ONE C2PA manifest — so the AI vendor is named only from that same manifest (`c2pa` issuer + `synthid` proxy, both `c2pa_manifest` source) — a legitimate edit chain, NOT a clash. Rule 2 therefore fires only when some `ai_vendor_claims` family has a source `!= "c2pa_manifest"` (EXIF/XMP generator, IPTC, TC260 AIGC, a second manifest naming AI on a camera capture — the real laundering tell). This killed a false-positive class on the corpus: 2 real Pixel generative-edit PNGs (`computationalCapture` + `trainedAlgorithmicMedia` + "Applied imperceptible SynthID watermark" in one Google manifest) read as camera-vs-AI clashes before the gate. Pure cameras (Leica/Sony/Nikon/Truepic) that do NOT generate AI still clash on any within-manifest AI marker only if it is independent — they never legitimately carry one, so the gate is behavior-neutral for them while fixing Pixel (regression-guarded by `test_identify.py::TestIntegrityClashesHelper::{test_pixel_generative_edit_same_manifest_no_clash,test_camera_plus_independent_ai_marker_still_clashes}` + `TestIntegrityClashEndToEnd::test_pixel_generative_edit_no_clash`).
**Independence is source-grouped (`_CLASH_SOURCE`, added 2026-06-02):** the C2PA issuer attribution (`c2pa`) and the SynthID proxy (`synthid`) are NOT independent — the proxy is inferred from the *same* manifest — so they share one source and two vendors named within a single manifest do not clash. This killed a false-positive class found on the spaces corpus: legitimate multi-actor manifests where a product wraps another vendor's engine (Microsoft Designer on OpenAI → `OpenAI, Microsoft`; Microsoft on Google → `Microsoft, Google LLC, Google C2PA Core Generator Library`) or an edit chain re-signs (Adobe over a Gemini original → Adobe c2pa + Google synthid) — 19 such files across the 2026-06-01/02 batches read as clashes before the fix. Rule 1 still fires when a manifest vendor disagrees with a genuinely independent stamp (EXIF/XMP generator, IPTC `AISystemUsed`, AIGC, xAI); each non-`c2pa`/`synthid` family is its own source (`test_identify.py::TestIntegrityClashes::{test_multi_actor_manifest_no_clash,test_manifest_vendor_vs_independent_signal_clashes}`). Vendor normalization is `_vendor_of` over `_AI_VENDOR_TOKENS` (so a C2PA "Google (Gemini)" issuer and a SynthID-Google proxy agree, while different vendors clash).
**High-precision by design:** only hard generator stamps feed it (C2PA-issuer when source is AI, SynthID, EXIF/XMP generator, IPTC `AISystemUsed`, xAI, AIGC); the fuzzy visible sparkle and the open invisible watermark are **excluded** (the latter can be a by-product of our own SDXL removal pass). The c2pa vendor is classified from the issuer attribution / generator, NOT the resolved `platform` (a camera label like "Google Pixel" would mis-normalize to "Google"). All real single-origin fixtures (chatgpt/firefly/doubao/grok/mj) verified to produce **zero** clashes (false-positive guard in `test_identify.py::TestRealSamplesHaveNoClash`).
## `watermark_registry.py`
`watermark_registry.py`**single catalog of known visible watermarks**, the unified "find known marks in their usual places, recognize, remove" entry.
**Reverse-alpha based by policy**: a mark is listed only once a real alpha map has been captured for it, and removal inverts that map (`original = (wm - a*logo)/(1-a)`) — Gemini recovers cleanly with no inpaint (its sparkle alpha comes from a pure-black capture, so it is near-exact), while **Doubao, Jimeng, and Samsung all add an always-on THIN residual inpaint** over the glyph footprint (their text marks re-rasterize + jitter a few px per image, so a single capture cannot pixel-cancel them; the inpaint blends into the reverse-alpha-recovered pixels). Arbitrary-region inpainting still lives in `region_eraser`/`erase`. Each `KnownMark` ties a key to {usual `location`, `in_auto` flag, `recovery` (="reverse-alpha"), a `detect` adapter → uniform `MarkDetection`, a `remove` adapter}. Entries today: `gemini` (bottom-right sparkle), `doubao` (bottom-right "豆包AI生成"), `jimeng` (bottom-right "★ 即梦AI"), and `samsung` (bottom-**LEFT** "✦ Contenuti generati dall'AI", Samsung Galaxy AI, Italian locale). `detect_marks` scans all; `best_auto_mark` picks the highest-confidence detection.
**Cross-engine confidences aren't directly comparable**, so the gemini adapter applies the corpus-validated 0.5 sparkle threshold (`_GEMINI_AUTO_MIN_CONF`) for its `detected` flag — otherwise the gemini engine's loose internal threshold weakly fires (~0.36) on the Doubao text and hijacks `auto`. The shape-keyed Doubao/Jimeng/Samsung NCC detectors don't cross-fire (jimeng scores ~0.22 on the Doubao strip, well under its 0.45 threshold; Samsung is bottom-left so it shares no corner with the others, and scored 0.0 on Doubao/Jimeng captures and they 0.0 on a real Samsung photo), so `auto` picks the right one. `cli.cmd_visible` is registry-driven: `--mark auto``best_auto_mark`, `--mark <key>` → that mark; `--mark` choices come from `mark_keys()`.
**`cli._remove_visible_auto` is the shared visible-removal helper used by `cmd_all`/`cmd_batch` too** (they no longer hardcode `GeminiEngine`), so `all`/`batch` remove Doubao/Jimeng/Samsung text marks, not just the Gemini sparkle (regression-guarded by `test_all_visible_step_uses_registry`). The three text-mark adapters were consolidated 2026-06-09: a single `_text_mark(key, label, location)` builds the registry row from one parameterized `_text_mark_detect`/`_text_mark_remove` pair (reverse-alpha only when detected/forced AND `reverse_alpha_available`, else skipped — no inpaint); the gemini adapters stay bespoke. Add a new visible mark = one `_text_mark(...)` row + its `TextMarkConfig` (with a captured alpha map); do not re-add per-mark `if` branches or copy-paste adapters.
**Alpha-on-save policy (issue #30):** `cli._write_bgr_with_alpha` rejoins the input's alpha plane **unchanged** — it must NOT zero alpha in the watermark bbox. Reverse-alpha (and `erase` inpaint) recover real pixels there, so zeroing alpha punched a transparent hole that renders as a solid **white box** on any non-transparent viewer (Gemini app exports are opaque RGBA, so every user hit it; regression-guarded by `test_visible_keeps_alpha_opaque_in_watermark_region`). The registry `remove()` still returns its region (used for `inpaint_residual` positioning), but the CLI no longer uses it to clear alpha.
## `gemini_engine.py`
`gemini_engine.py` — visible Gemini-sparkle remover/detector (cv2/numpy, no GPU). `detect_sparkle_confidence(path)` is the file-level entry point used by `identify.py`. The public entry points normalize a grayscale (2D) or RGBA (4-channel) input to BGR up front so a non-BGR image does not crash the cv2 pipeline.
**Detection localization (issue #36):** `detect_watermark`'s global multi-scale NCC search applies a size weight (`(scale/96)**0.5`) that suppresses tiny-patch false positives but can let a larger, mediocre match (e.g. a bright collar in a portrait) outrank a small, near-perfect sparkle in the corner — so a faint sparkle on a busy background scored below threshold and read as clean (the regression osachub reported from widening the search window 256px->512px between v0.7.2 and v0.8.8). `_corner_promote` adds a bottom-right-corner raw-NCC pass on top of the global search: a match with raw NCC >= `_CORNER_PROMOTE_NCC` 0.85 that beats the global pick overrides it (it only ever replaces a lower-fidelity pick, so it cannot weaken an existing detection), rescuing the buried sparkle without reverting the wider window. The corner side is **relative-clamped** (`_CORNER_PROMOTE_FRAC` 0.20 of the short side, clamped to `[_CORNER_PROMOTE_MIN` 96, `_CORNER_PROMOTE_MAX` 384`]`): a fixed 256px is a true corner on a large image but covers ~70% of a small portrait, where a real photo raw-matches the star at ~0.81 (relative tightening drops that worst case to ~0.69, while the upper clamp stops the corner ballooning on huge images where a real photo reached ~0.83 at 512px). The 0.85 gate sits midway between the worst real-photo corner match (~0.78 across native + downscaled negatives) and a genuine faint sparkle (~0.93), so promotion adds true detections with zero corpus false positives (Gemini's sparkle sits ~60-160px from the corner at fixed margins, covered by the [96, 384] band at every measured size). Regression-guarded by `test_gemini_engine.py::TestCornerPromotion`.
**Top-K fusion selection (osachub follow-up 2026-06-12):** `_corner_promote`'s 0.85 raw-NCC gate still missed a class the 256->512 widening exposed — a genuine MID-scale sparkle whose raw NCC sits *below* 0.85 but is buried by a LARGER, low-fidelity decoy that wins the size weight. The reporter's image (a scale-48 sparkle on light bedding) measured spatial 0.775 / grad 0.960 / fusion 0.676 at the true sparkle, but the size-weighted argmax instead locked onto a decoy at spatial 0.628 / grad 0.036 (fusion 0.325) — so `identify` read `unknown` on v0.8-0.11 where v0.7.2 (256px window) had caught it at 0.676. Fix: `detect_watermark` now keeps the **top-`_SELECT_TOPK` (3)** size-weighted candidates (NMS-deduped by location) plus the corner-promote candidate, scores EACH by the full fusion (spatial+gradient+variance) via the extracted `_grad_var_scores` helper, and selects the highest — the gradient term (the discriminator a contrast-invariant spatial NCC lacks) lifts the true sparkle over the decoy. Critically, selection ranks by the SIZE-WEIGHTED score, NOT raw NCC: a raw-NCC argmax (tried first) re-admitted the exact tiny-patch (scale 16-18) false positives the size weight exists to suppress — it flagged 14/65 doubao + 4/11 jimeng visible-corpus images (non-Gemini content) as Gemini sparkles. Top-K keeps tiny-patch suppression intact: a coincidental 16px match never ranks in the size-weighted top-K, so widening selection added **zero** flips on the doubao/jimeng corpora and left the 495-image Gemini set unchanged (479 detected, both before and after) while recovering the reporter's image. Regression-guarded by `test_gemini_engine.py::TestCornerPromotion::test_low_gradient_decoy_loses_to_high_gradient_corner_sparkle` (mirrors the real spatial/grad signature via a monkeypatched scan) and `test_size_weighted_search_alone_traps_on_the_decoy`.
**Square-image residual misses are NOT fixable by lowering the detector threshold (measured + REJECTED 2026-06-11):** osachub (#36 follow-up) reported the corner-promote still misses Gemini sparkles on Google **square (1:1)** outputs. Reproduced on the spaces corpus: of 330 square Google-C2PA images, 140 score below the identify 0.5 threshold, and visual review confirmed a real class -- faint white sparkles on dark/textured/colored backgrounds (raw NCC 0.46-0.73, below the 0.85 promote gate) landing at fusion conf 0.41-0.47. A margin-gated promote (promote when raw NCC >= 0.50 AND `_core_ring_margin` >= 40) rescued 32/33 confirmed misses at an apparent 0 FP, but that 0 was a **measurement artifact** -- the negative set was the margin<40 misses, which a margin>=40 gate excludes by construction. On an honest 518-image non-Google pool the same gate fired on **~174 (≈33%)**, visually content (screenshots, Chinese "AI生成" Doubao/Jimeng text marks, logos, bright textures), not sparkles. Adding an achromatic-core constraint (`chroma <= 15`) did not separate them either (kept 15/33 POS, 41 NEG still firing). Root cause is the documented contrast-invariant-NCC wall: a faint sparkle on a busy background is indistinguishable from a bright/ornate content corner at the (shape-NCC, brightness-margin, core-chroma) feature level.
**Conclusion: keep the 0.85 corner gate; do NOT add a margin/chroma-gated lower promote.**
The cost (mislabel ~8-33% of non-Gemini content as Gemini) outweighs the benefit -- the visible sparkle is a medium-confidence stripped-metadata fallback, and intact Gemini is caught by C2PA in `identify` regardless. Remaining square misses are an accepted known limitation; a real fix would need a sparkle-specific discriminator (template match on a background-subtracted image, or a hard fixed-margin position prior), which is open research, not a threshold tweak.
**Removal is reverse-alpha with an over-subtraction guard** (`remove_watermark``_reverse_alpha_blend`, else `_inpaint_footprint`): the sparkle alpha is computed (`alpha = max(R,G,B)/255`) from the bundled sparkle-on-black captures `assets/gemini_bg_{96,48}.png` (the capture max is ~130, NOT 255 — the sparkle is a ~51%-opaque white overlay, so `alpha` maxes at ~0.51, which is CORRECT for the capture, not under-exposed). The alpha is near-exact only when the real mark's effective opacity matches the capture, which holds on bright/flat backgrounds — re-verified clean on `demo_banana_before.png` 2026-05-31.
**Issue #30 (dark-background black pit):** on a dark/textured background (e.g. grass, ~73) the real sparkle's effective opacity is LOWER than the captured 0.51, so the fixed-alpha reverse blend OVER-subtracts (`watermarked - a*logo` goes negative) and drives the footprint to black — the white sparkle becomes a black diamond. `remove_watermark` now detects this via `_reverse_alpha_oversubtracts` (fraction of footprint pixels with `alpha >= _FOOTPRINT_ALPHA` 0.1 whose numerator < 0 exceeds `_OVERSUB_FOOTPRINT_FRAC` 0.05) and **inpaints the footprint** (`_inpaint_footprint`, cv2 NS over the dilated alpha mask) from the surrounding pixels instead.
**Behavior-neutral on the working case:** a bright background over-subtracts at ~0% so reverse-alpha is used and the output is byte-identical to before (verified: demo_banana 0.0 frac vs issue-#30 grass 0.61 frac; regression-guarded by `test_gemini_engine.py::TestOverSubtractionGuard`, which composites the sparkle at a reduced effective alpha to reproduce the mismatch).
**Under-subtraction (the symmetric case, fixed 2026-06-03):** some real Gemini sparkles are rendered MORE opaque than the captured ~0.51, so the fixed-alpha reverse blend UNDER-subtracts and leaves a bright sparkle residual the detector still fires on (measured on the spaces corpus: a visible-removal audit through the registry path left a detectable sparkle on a meaningful fraction of marks, all under-removals, NOT a background-brightness class — failures and successes had the same input confidence and the same background-luma distribution; the discriminator was the removal delta itself). `remove_watermark` now estimates a per-image alpha gain (`_estimate_alpha_gain`: effective sparkle opacity at the bright core vs the local background ring, `a_eff/a_cap`, clamped `[1.0, _ALPHA_GAIN_MAX` 1.94`]`) and scales the alpha to match before the over-sub/blend branch. The gain cleanly separates on the corpus (under-removed marks ~1.47, cleanly-removed ~1.00), and a deadband (`_ALPHA_GAIN_DEADBAND` 1.05) keeps a matching sparkle **byte-identical** to the pre-fix output, so the fix is purely additive (0 regressions on the audit set; the over-sub guard still runs on the scaled alpha as the safety net for an over-shooting estimate). Regression-guarded by `test_gemini_engine.py::TestUnderSubtractionGain` (composites a more-opaque-than-capture sparkle; **asserts on footprint pixels, NOT the detector** — the detector's NCC is degenerate on a flat synthetic background, so a re-detect conf is meaningless there; the real corpus removal drops the detector from ~0.80 to ~0.27).
**False-positive gate (added 2026-06-03):** `detect_watermark`'s shape-only NCC (`spatial*0.5 + gradient*0.3 + var*0.2`) fires on ornate/flat content (text strips, banners, hatching) that coincidentally matches the diamond shape — a real Gemini sparkle is a bright WHITE overlay, so its core sits above the local background, but the NCC is contrast-invariant and cannot see that. The fusion now **demotes** (caps confidence to 0.30) any match that is BOTH low-confidence (`< _SPARKLE_FP_CONF` 0.65) AND has a low core-ring brightness margin (`_core_ring_margin < _SPARKLE_FP_MARGIN` 5). Real sparkles escape via EITHER high confidence (white-bg sparkles score ≥0.79 despite a low margin — the NCC shape match is strong) OR high margin (dark/mid backgrounds, incl. the #36 faint-corner case, lift well clear), so BOTH must fail to demote. The gate is **monotonic** (only ever removes detections, never adds), so it cannot regress the verified-negative corpus (already 0 FPs). On the spaces corpus it demoted 16/495 flagged sparkles (13 carried no AI metadata = content FPs; the 3 AI-meta were visually FPs / a near-invisible white-on-white sparkle whose AI verdict is held by metadata anyway), and dropped the removal-audit failures 20→15 (post-removal flat footprints the NCC re-fired on). `_core_ring_margin` and `_estimate_alpha_gain` share the `_core_and_bg` helper (core 75th-pct brightness vs background-ring median). Regression-guarded by `test_gemini_engine.py::TestSparkleFalsePositiveGate`.
**Self-verify repair (added 2026-06-04):** the gain estimate corrects most under-subtractions, but a tail of strong sparkles still survived reverse-alpha (position jitter, or a gain the `[1.0, 1.94]` clamp could not fully reach). After the reverse blend, `remove_watermark` re-detects via `_verify_and_repair`; when a sparkle at or above `_VERIFY_FALLBACK_CONF` 0.5 (the registry's real fail line) remains, it inpaints the footprint and **keeps that only when it lowers the re-detect confidence** — purely additive (the common clean removal re-detects below 0.5 and is returned untouched, so it can never regress). On the spaces corpus this rescued **4 of the 15 remaining gemini removal-audit failures** (15→11, doubao/jimeng still 0), verified through the registry/CLI path. Costs one extra `detect_watermark` per removal (two when the fallback fires). Regression-guarded by `test_gemini_engine.py::TestVerifyAndRepair` (stubs `detect_watermark` to drive the keep-best control flow, since the NCC is degenerate on flat synthetics).
**An offset+scale alignment search was prototyped on the remaining 11 fails and REJECTED (2026-06-04):** an audit "ceiling" test suggested it could rescue 4 more (e.g. a5a9 0.577→0.417), but direct inspection showed those were NCC-gaming, not removal — the lower-scoring placement left the sparkle as bright or BRIGHTER (a5a9: first-pass slot 99.5th-pct ~76 at background level, the "aligned win" slot ~164), it just reshaped the residual so the contrast-invariant shape-NCC scored lower. A slot-brightness sanity gate rejected every one, so alignment contributed 0 genuine rescues and was removed (the footprint inpaint stays because it physically reconstructs the slot from its darker surroundings, so its rescues are real).
**Lesson: the visible-audit pass/fail metric (re-detect conf < 0.5) is gameable by reshaping the residual — optimizing it directly finds NCC-gaming placements, not clean removals; gate any removal candidate on a physical brightness check, not the detector alone.**
The 11 survivors are near-white ill-conditioning (reverse-alpha divides by `1-a`≈0.02) or detector false positives (before≈after≈0.51) that no reverse-alpha placement fixes. The registry's optional `inpaint_residual` (edge cleanup) is a no-op on a clean reverse-alpha removal (and on the same corpus it lowered the re-detect conf on 3 marks, raised it on 10, no-op on 466 — net-neutral on pass/fail, so the self-verify repair, not it, drives the removal tail); an earlier "Gemini smears" read was a misjudged soft-fur original, not an artifact.
**The bg assets are now rebuilt from OUR OWN controlled captures** (`data/gemini_capture/captures/`, committed) by `scripts/visible_alpha_solve.py gemini`, which locates the 96px sparkle on the black capture and crops it to the two logo sizes; our capture matched the previously third-party-sourced `gemini_bg_96.png` to **NCC 0.9998**, validating the asset and making it reproducible. Gemini's multi-size fixed-slot model is genuinely different from the Doubao/Jimeng text-strip engines (so it stays a separate engine, not part of the shared-base refactor).
## `_text_mark_engine.py`
`_text_mark_engine.py`**shared base for the three reverse-alpha text-mark engines (Doubao/Jimeng/Samsung), extracted 2026-06-09** (they were ~90% byte-identical clones). `TextMarkEngine(config: TextMarkConfig)` owns the whole `locate → extract_mask → detect → _fixed/_aligned_alpha_map → _apply_reverse_alpha → remove_watermark_reverse_alpha` pipeline (+ the asset-keyed `load_alpha_template`/`glyph_silhouette`/`template_match_score` caches). Each engine module is now a thin subclass: it supplies only its `TextMarkConfig` (the tuned constants, the bundled asset, and the bounded structural deltas — `corner` br/bl, `margin_floor` 4/2, `morph_open_size` 5/3, `min_gw` 8/16) plus the test-facing module shims (`_alpha_template`/`_glyph_silhouette`/`_template_match_score` + the constants). Behavior is byte-exact vs the old per-engine code (the three engine test suites pass unchanged). Gemini stays a SEPARATE engine (its multi-size fixed-slot sparkle model is genuinely different). Add a new text mark = a new `TextMarkConfig` + a thin subclass + one registry `_text_mark(...)` row. The engine bullets below describe each mark's calibration history; the LOGIC lives here.
**`_apply_reverse_alpha` runs on the glyph crop only:** the blend is a no-op outside the glyph `region` (x, y, w, h) (`(wm - 0)/(1 - 0) == wm`, and a uint8→float32→uint8 round-trip is exact). It copies the frame through and computes the reverse-alpha math on the `region` crop only — byte-identical to the old full-frame pass (verified: Doubao 130 + Jimeng 22 placements, 0 mismatches) but O(glyph) not O(image). The full-frame pass cost ~275 ms on a 12 MP frame for a glyph that is <0.1% of it, once per candidate placement (fixed + aligned ≈ 2×/removal); the crop drops that to ~2 ms. Mirror of the Gemini `_core_and_bg` crop.
**`_fixed/_aligned_alpha_map` and `extract_mask` return footprint-sized arrays, not full frames (memory):** the alpha-map helpers return the glyph-sized alpha **block** (`(gh, gw)` float32) plus its placement `(ax, ay, gw, gh)`, and `extract_mask` returns the box-sized glyph mask (`(loc.h, loc.w)` uint8) — both used to allocate a full `(h, w)` array that is read only inside the small glyph/box. A full-frame float32 alpha map is ~48 MB on a 12 MP frame and two were held at once during removal (fixed + aligned ≈ 96 MB of mostly-zeros); the box mask was a ~12 MB uint8 allocation rebuilt per text-mark `detect` on the memory-tight `identify` path. `_apply_reverse_alpha` consumes the block directly; the residual inpaint embeds it into one full-frame uint8 mask only at `cv2.inpaint` time (which needs a full-frame mask). Byte-identical to the old full-frame path — the block equals the old map's `[ay:ay+gh, ax:ax+gw]` slice and the box equals the old mask cropped to `loc.bbox` (regression-guarded by `tests/test_text_mark_memory.py`, which reconstructs the old full-frame path inline and asserts equality, so the proof survives a cv2/asset bump). `remove_watermark_reverse_alpha` tracks the winning `region` alongside `best_amap` to place that mask.
## `doubao_engine.py`
`doubao_engine.py`**a thin `_text_mark_engine.TextMarkEngine` subclass (config only) since 2026-06-09.** visible Doubao "豆包AI生成" remover/detector (cv2/numpy, no GPU). `DoubaoEngine.locate` anchors a bottom-right box by **geometry** (mark scales with image WIDTH), `extract_mask` pulls the light, low-chroma glyphs (the detection candidate) using a per-pixel channel-spread proxy `sat = roi.max(axis=2) - roi.min(axis=2)` (no HSV conversion). `detect` is **shape-consistent**: it matches the bundled alpha glyph silhouette (`assets/doubao_alpha.png`) against the candidate via zero-mean normalized correlation (`_template_match_score`, cv2 `TM_CCOEFF_NORMED`), gated at `DETECT_NCC_THRESHOLD` 0.4 over a small `DETECT_MIN_COVERAGE` floor. Keying on glyph SHAPE (not coverage heuristics) fixed #23 (corpus FP 7/1243).
**Removal = reverse-alpha + thin residual inpaint** (`remove_watermark_reverse_alpha`): `original = (wm - a*logo)/(1-a)` from the bundled alpha map + `_ALPHA_LOGO_BGR` (pure white) + `_ALPHA_*_FRAC` geometry, then a deliberately THIN inpaint (`_RESIDUAL_*`, `INPAINT_NS`) over the glyph footprint clears leftover edges without smearing.
**Alpha is rebuilt by `scripts/visible_alpha_solve.py` (the careful gray-self solve: cubic background fit, mean over channels, full halo, unblurred), same recipe as Jimeng** — the captures are committed in `data/doubao_capture/captures/`.
**Removal aligns ALWAYS** (no `_ALPHA_NATIVE_BAND` fast-path): it tries fixed geometry AND `_aligned_alpha_map`'s `TM_CCOEFF_NORMED` scale+position search and keeps the lower-residual one — the mark is re-rasterized and a few px off per image, so fixed geometry alone leaves a visible outline even at 2048.
**The locate box (`WM_*`) is generous (0.22 wide, margins 0.004) and reaches close to the corner** — a tight box (the old 0.185 / margin 0.012) let a corner-ward shift fall OUTSIDE the alignment search, so the align missed and a readable outline survived; regression-guarded by `test_recovers_shifted_mark_on_texture` (composes the alpha shifted on a known texture; old box ~29 vs new ~1 mean residual).
**Issue #13 follow-up defect (found 2026-05-31): the SHIPPED Doubao removal left a clearly READABLE "豆包AI生成" outline on the real `doubao-1.png` sample, while `detect` returned conf 0.0 (it is fooled by a thin outline) so `test_reverse_alpha_removes_mark` passed and the old "56/56 clean" claim was detector-measured, not visual.**
Root cause: bad alpha (under-estimated, max ~0.65) + fixed-no-inpaint + tight box; the careful rebuild + always-align + thin inpaint + wide box takes it from a readable outline to faint texture-level traces (parity with Jimeng — a single capture cannot pixel-cancel a per-image re-rasterized mark).
**Lesson: a detector-only removal test is insufficient; assert visual residual (the textured-shift test).** **`extract_mask` guards a degenerate ROI (`bh < 16 or bw < 16` -> empty mask, skips cv2):** the always-align removal scores each placement with a residual `detect(out)`, and on an extremely wide/short image (e.g. 2048x1, `test_wide_short_does_not_raise`) that fed cv2's GaussianBlur a ~1-px-tall ROI and **faulted natively on Windows py3.12 (access violation, non-deterministic — one CI cell went red while a re-run passed)**; the old at-native path never ran `detect` on degenerate sizes. Real images always clear the guard (the `WM_*` box floors are `max(16, …)` height / `max(40, …)` width), so it only short-circuits slivers. `reverse_alpha_available` is just "asset present"; the registry gates removal on `detect`. The shipped third-party `_refs/zhengsuanfa_doubao_alpha_120x20.png` is NOT a usable alpha (verified 2026-05-29). Arbitrary-region inpainting is `region_eraser`/`erase`.
## `jimeng_engine.py`
`jimeng_engine.py`**a thin `TextMarkEngine` subclass (config only) since 2026-06-09.** visible Jimeng / Dreamina "★ 即梦AI" remover/detector (cv2/numpy, no GPU), built 2026-05-30 from issue #13's solid captures (@powersee). Shares the base with `doubao_engine`: `locate` anchors a bottom-right box by **geometry** (scales with WIDTH), `extract_mask` pulls the light low-chroma glyphs (white top-hat + grayish + min-luma), `detect` matches the bundled "即梦AI" glyph silhouette (`assets/jimeng_alpha.png`) via `TM_CCOEFF_NORMED` over a coverage floor. Threshold `DETECT_NCC_THRESHOLD` **0.45** cleanly separates real Jimeng marks (>=0.81) from the Doubao strip (0.21) and other AI output (0.0), so the two ByteDance marks don't cross-fire in `--mark auto`.
**Logo is pure white (255,255,255)** (`_ALPHA_LOGO_BGR`; the white capture + an L-pair-solve confirm ~254.6); compositing is **sRGB, not linear** (a linear-light solve tripled the cross-residual).
**Alpha rebuilt by `scripts/visible_alpha_solve.py` from the GRAY capture** (`data/jimeng_capture/captures/`, the solid captures now committed): `a = (I - B)/(255 - B)`, B a per-capture **cubic** background fit over the non-glyph pixels, **averaged over channels, full halo extent (down to a~0.02), unblurred**. Gray (bg ~132) is the deliberate choice over black: it is the best proxy for real content (the mark sits on bright photo areas, not on black), and the careful build drops the gray self-residual to ~1.3.
**The mask quality, not the method, was the earlier limit** — a max-channel / quadratic-bg / blurred / halo-truncated build (and a black-dominated LS) left a visible outline (lesson from issue #13: when reverse-alpha leaves a ghost, suspect the captured alpha map before adding heuristics or switching method). Geometry emitted by the solver at `_ALPHA_NATIVE_WIDTH` 2048: `_ALPHA_WIDTH_FRAC` 0.202, `_ALPHA_HEIGHT_FRAC` 0.058, margins ~0.029.
**Removal = reverse-alpha + a deliberately THIN residual inpaint** (`remove_watermark_reverse_alpha`, `_RESIDUAL_DILATE` 5 over the `_RESIDUAL_ALPHA_FLOOR` 0.05 footprint, `_RESIDUAL_INPAINT_RADIUS` 2, `INPAINT_NS`): a single 2048 alpha cannot pixel-cancel the mark re-rasterized at another resolution (alpha maps from independent captures correlate 0.998, not 1.0; off-native reverse-alpha alone only halves the mark), so a tight inpaint clears the residual edges WITHOUT the texture/edge smear a wide full-footprint pass caused.
**Placement ALWAYS tries fixed geometry AND `_aligned_alpha_map`'s NCC scale+position search, keeping the lower-residual** — the mark re-rasterizes + jitters a few px per image even at the captured width, so fixed geometry alone misses (there is no `_ALPHA_NATIVE_BAND` fast-path; the scale search `_ALPHA_ALIGN_SEARCH` is fine-stepped, and the `WM_*` locate box is generous so a corner-ward shift stays inside the search — the same widen that fixed Doubao). Verified clean on the solid captures (native 2048; faint self-residual ~1.3 visible only on a dead-flat field, hidden by real texture) and a real 1440-wide Jimeng download (off-native, table edge preserved). `reverse_alpha_available` is just "asset present"; the registry gates on `detect`.
**No committed real sample** (the real content download stays gitignored; only the solid calibration captures are committed) — `tests/test_jimeng_engine.py` synthesizes a mark from the bundled alpha asset, and `test_recovers_shifted_mark_on_texture` guards the align-on-shift path that the Doubao defect exposed. Jimeng images are independently caught by the China TC260 AIGC label in `metadata`/`identify`, so this engine is the visible-mark *removal* path, not a new `identify` signal.
## `samsung_engine.py`
`samsung_engine.py`**a thin `TextMarkEngine` subclass (config only) since 2026-06-09.** visible Samsung Galaxy AI "✦ Contenuti generati dall'AI" remover/detector (cv2/numpy, no GPU), built 2026-06-05 from issue #37's flat captures (@f-liva). Shares the base but anchored **bottom-LEFT** (Doubao/Jimeng are bottom-right): `locate` anchors a bottom-left box by **geometry** (scales with WIDTH), `extract_mask` pulls the light low-chroma glyphs (white top-hat + grayish + min-luma — `LOGO_MIN_LUMA` is lowered to **110** because the mark is faint, peak alpha ~0.38, so on a mid/dark background its glyph luma is lower than Jimeng's), `detect` matches the bundled glyph silhouette (`assets/samsung_alpha.png`) via `TM_CCOEFF_NORMED` over a coverage floor. Threshold `DETECT_NCC_THRESHOLD` **0.40** (real marks ~0.79 on a real photo, ~0.57/0.71 on the black/gray captures; 0.0 on Doubao/Jimeng captures, and Doubao/Jimeng score 0.0 on a real Samsung photo — no cross-fire, also because the corner differs).
**Logo is pure white (255,255,255)** (`_ALPHA_LOGO_BGR`; white capture confirms).
**Alpha solved by `scripts/visible_alpha_solve.py samsung` from the GRAY capture** (`data/samsung_capture/captures/`, the flat black/gray/white captures committed; the solver gained a `corner="bl"` mode + left-margin logging for this), same careful recipe as Jimeng (cubic background, mean-channel, full halo, unblurred). Geometry emitted at `_ALPHA_NATIVE_WIDTH` **1086** (the flat-edit capture width): `_ALPHA_WIDTH_FRAC` 0.3195, `_ALPHA_HEIGHT_FRAC` 0.0378, `_ALPHA_MARGIN_LEFT_FRAC` 0.0110, `_ALPHA_MARGIN_BOTTOM_FRAC` 0.0064.
**Removal = reverse-alpha + a deliberately THIN residual inpaint** (`remove_watermark_reverse_alpha`, same `_RESIDUAL_*` recipe as Jimeng) with **always-try fixed AND `_aligned_alpha_map` NCC scale+position search, keep the lower-residual** (`_ALPHA_ALIGN_SEARCH` widened to (0.85, 1.18, 23) because the flat captures are far off the real-photo width).
**Resolution caveat:** the flat captures arrived at 1086 wide while real photos are ~2958 wide (the mark scales with width, so the captured glyph ~334px is ~2.7x smaller than the ~903px real-photo glyph); width-scale + NCC-align still removes it cleanly (verified on a real 2958-wide @f-liva photo: re-detect 0.79→0.00, no readable text or outline on the recovered wooden table — checked **visually**, not just by the detector, per the Gemini self-verify lesson), but a flat capture at the real photo resolution would make the alpha pixel-sharp instead of upscaled (open quality upgrade, noted in `data/samsung_capture/README.md`).
**The mark is locale-specific** (text differs per language); this build is the Italian "Contenuti generati dall'AI" variant — other locales need their own captured template. `reverse_alpha_available` is just "asset present"; the registry gates on `detect`.
**No committed real sample** (the real photo stays gitignored; only the flat calibration captures are committed) — `tests/test_samsung_engine.py` synthesizes a mark from the bundled alpha asset (bottom-left geometry), with `test_recovers_shifted_mark_on_texture` guarding the align-on-shift path. Samsung Galaxy AI edits are independently caught by C2PA + the `genAIType` marker in `metadata`/`identify`, so this engine is the visible-mark *removal* path; it also feeds `identify` as the medium-confidence `visible_samsung` signal via the registry (the stripped-metadata fallback).
## `region_eraser.py`
`region_eraser.py` — universal region eraser (`erase` CLI). `erase(image, boxes=|mask=, backend=)` accepts grayscale (2D) and RGBA (4-channel) inputs on **both** backends (`erase_cv2` and `erase_lama` each split off any alpha plane and re-attach it unchanged, and promote grayscale to BGR for processing — LaMa would otherwise crash on grayscale and drop alpha on BGRA): `boxes_to_mask``cv2.inpaint` (`cv2` backend, default, no deps) or big-LaMa via onnxruntime (`lama` backend, extra `lama`, `Carve/LaMa-ONNX` Apache-2.0 model downloaded on first use, never bundled). `erase_lama` crops a padded region around the mask, runs LaMa at its fixed 512² input, pastes only masked pixels back (untouched areas stay pixel-exact). Lazy `_get_lama_session` singleton; `lama_available()` guards the optional import.
**LaMa-ONNX costs ~3.5-4 GB peak RAM and ~5-6 s/call on CPU** (FFC working set, not arena — `enable_cpu_mem_arena=False` does not help), so it does NOT fit a minimal droplet; the cv2 backend (tens of MB, ~30 ms) does. LaMa quality at low RAM = serverless/GPU, mirroring how raiw.cc offloads SDXL to fal.
## `invisible_watermark.py`
`invisible_watermark.py``detect_invisible_watermark(path)` decodes the OPEN DWT-DCT watermarks (public decoder, no key) embedded by Stable Diffusion / SDXL / FLUX via the `imwatermark` library. Known fixed patterns (verified against upstream source) live in `_BITS_48` (SDXL 48-bit, FLUX.2 48-bit) and `_SD1_STRING` ("StableDiffusionV1", SD 1.x/2.x). Optional dep (extra `detect`); returns None when absent. The `detect` extra pulls **torch** transitively (invisible-watermark declares torch a hard dep, and `WatermarkDecoder` eagerly imports `rivaGan` -> `torch` at import time), so detection needs torch present even though dwtDct runs CPU-only on cv2/numpy/pywavelets — no GPU and no separate `gpu` extra required.
**Unlike SynthID this is locally detectable**, but the watermark is fragile (does not survive JPEG re-encode/resize — verified gone after JPEG q90), so it confirms origin only on pristine files. Add new known patterns here. The file carries a top-of-module pyright pragma because imwatermark/cv2 ship no type stubs.
## `trustmark_detector.py`
`trustmark_detector.py``detect_trustmark(path)` decodes the OPEN, keyless **Adobe TrustMark** watermark (the soft binding behind Adobe Durable Content Credentials, `alg` `com.adobe.trustmark.P`) via the optional `trustmark` package (extra `trustmark`; pulls torch, downloads model weights on first use). Mirrors `invisible_watermark.py` (lazy singleton guarded by a double-checked `threading.Lock` so concurrent callers do not double-download the weights, top-of-module pyright pragma, returns None when absent). It detects *provenance*, not AI origin as such (TrustMark also marks human-authored content), so `identify` lists it as a watermark without setting `is_ai_generated`. Other soft-binding vendors (Digimarc/Imatag/Steg.AI/...) have no public decoder — they are only *named* via the `C2PA_SOFT_BINDINGS` scan, not decoded.
**False-positive gate (added 2026-05-29):** TrustMark's `wm_present` is a BCH error-correction validity flag that spuriously validates on a content-correlated fraction of un-watermarked images — AI-generated textures trip it far more than camera photos (verified 2026-05-29 on real files: it fires on Gemini/OpenAI/Doubao output that *cannot* carry Adobe's watermark, with a random-bytes decoded secret, while signal-free camera photos did not trip it). A genuine TrustMark is a *durable* soft binding engineered to survive re-encoding, so `detect_trustmark` re-decodes after a mild JPEG round-trip (`_survives_reencode`, `_REENCODE_QUALITY` 95) and requires the same schema both times; every observed false positive collapsed (none survived even q95), so the gate is the durability property the watermark guarantees. The second decode runs only on the rare initial hit, so the cost is negligible. Do NOT remove the gate to "catch more" — a lone TrustMark hit without it is almost always content noise.
## `noai/watermark_remover.py`
`noai/watermark_remover.py` — the `WatermarkRemover` class has three diffusion pipelines, selected by the explicit `pipeline` ctor arg (NOT inferred from `model_id`). `sdxl`/`controlnet` share the SDXL base (`DEFAULT_MODEL_ID`); `qwen` is its own base (`QWEN_MODEL_ID`).
**`sdxl`** (renamed from `default` 2026-06-09; `default` kept as a back-compat alias via `normalize_profile`) runs plain SDXL img2img (`_run_img2img`); it is the lighter opt-down alternative (no ControlNet weights).
**`qwen`** (`_run_qwen`, `_load_qwen_pipeline`) runs `QwenImageImg2ImgPipeline` on `Qwen/Qwen-Image` (20B MMDiT, Apache-2.0 code AND weights). The scrub still comes from the img2img `strength`; Qwen's value is **text preservation** (incl. CJK and small text). **Metric-measured nuance (2026-06-19, `scripts/fidelity_metrics.py`, do NOT trust the eyeball here — it misled). Compare ONLY at each pipeline's oracle-confirmed scrub floor (outputs where SynthID is removed in BOTH — an equal-strength compare is invalid where it leaves one un-scrubbed; Qwen at 0.15 does not clear Gemini): Qwen wins TEXT (lower OCR CER across EN/RU/ZH, perfect Chinese) but controlnet wins FACES (higher Laplacian-variance retention and lower LPIPS — Qwen smooths faces MORE; ArcFace identity favors controlnet 0.546 vs 0.331 at the Gemini floors).** So Qwen is the better text-preserving remover, NOT a universal fidelity win — controlnet's canny edge map holds face skin detail better. Specifics: bf16 on CUDA (fp16 risks overflow on the 20B MMDiT — see the dtype branch in `__init__`); loads `QWEN_MODEL_ID` unless `--model` is overridden; the call shape lives in the pure module helper `_build_qwen_kwargs` (unit-tested without torch in `tests/test_platform.py::TestQwenKwargs`), which uses Qwen's `true_cfg_scale` (NOT SDXL's `guidance_scale` — the CLI `--guidance-scale` maps onto it; ~4.0 is typical, the SDXL default 7.5 is high for Qwen) and an explicit `negative_prompt` (`_QWEN_PROMPT`/`_QWEN_NEGATIVE`). It is CUDA/cloud-class (the 20B does not fit MPS), so `_run_qwen` has NO MPS->CPU fallback — an error propagates. `_load_qwen_pipeline` raises a clear ImportError if the installed diffusers lacks `QwenImageImg2ImgPipeline`. **Prototype oracle floors (Modal A100-80GB, single seed, 2026-06-19, PENDING seed-repeat cert): OpenAI clears at strength ~0.10, Gemini at ~0.30 (0.20 still detected) — both controls were SynthID-positive. The Gemini floor (0.30) is HIGHER than the certified controlnet Gemini floor (0.15), and `resolve_strength` is shared/pipeline-independent, so pass an explicit `--strength` for Gemini content on `qwen` until a Qwen-specific ladder is certified.** Fidelity vs controlnet was measured (`scripts/fidelity_metrics.py`), NOT eyeballed — see the metric nuance above: Qwen wins text, controlnet wins faces.
**`controlnet`** (**the DEFAULT pipeline since 2026-06-09** for `invisible`/`all`/`batch` and both engine ctors; `_run_controlnet`, `_load_controlnet_pipeline`) runs `StableDiffusionXLControlNetImg2ImgPipeline` with the SDXL-native canny ControlNet `xinsir/controlnet-canny-sdxl-1.0` (`watermark_profiles.CONTROLNET_CANNY_MODEL`): the control image is `cv2.Canny(gray, 100, 200)` stacked to 3 channels (`_CANNY_LOW`/`_CANNY_HIGH`, prompt `_CONTROLNET_PROMPT` / `_CONTROLNET_NEGATIVE`).
**Removal comes from the img2img regeneration (`strength`); the ControlNet only PRESERVES text and face STRUCTURE via the edge map.**
No original pixels are copied or frozen, BUT **validation 2026-06-04 disproved the old "so SynthID does not survive" claim: SynthID CAN survive controlnet on photoreal/high-detail content.**
At the shared low removal strength the canny edge-conditioning keeps the regeneration so close to the original that the pixel perturbation that destroys SynthID does not happen (oracle-confirmed: an OpenAI bracelet photo + a 9-face grid read **SynthID-detected** after controlnet at strength 0.10/0.15, but **SynthID-not-detected** after the `default` pipeline at the SAME strength + resolution -- only the pipeline differed).
**But the reverse also holds: a flat-graphic logo/poster SURVIVED `default` while clearing controlnet** -- removal at the low strength is content×pipeline dependent and neither pipeline is universally safe; the real lever is a higher strength. See the controlnet Known-limitations bullet for the full table + root cause. Canny holds face STRUCTURE but NOT identity (the regenerated face drifts in likeness -- canny carries edges, not identity). The drifted cleaned face is the LEAST-AI state we can reach without re-introducing SynthID; the library does NOT ship a face-restore extra. Every restore approach we evaluated (GFPGAN-on-cleaned, PhotoMaker-V2 txt2img, InstantID txt2img, InstantID img2img-on-cleaned at three parameter sweeps, 2026-06-04 - 2026-06-08 Modal cert sweeps) regenerated the face from an ArcFace embedding via SDXL diffusion -- which makes the output face look MORE AI-generated, not less. Empirical conclusion in `docs/synthid-robust-identity-research-2026-06-08.md` "Empirical follow-up". For production face preservation, ship the cleaned image as-is. `controlnet_conditioning_scale` (ctor arg, default 1.0) is the structure-preservation knob. Same dtype rule as `default` (fp32 on cpu/mps, fp16 only on cuda/xpu; the fp16-fixed SDXL VAE `_SDXL_FP16_VAE_ID` is swapped in on fp16 GPUs -- issue #29) and the same MPS->CPU fallback (reload on cpu/fp32, drop a non-cpu generator, retry once).
**Tiled diffusion (`tile`/`tile_size`/`tile_overlap` ctor-path args, CLI `--tile`, issue #10):** for large inputs that OOM at native resolution, `remove_watermark` can process the diffusion pass in overlapping sliding-window tiles instead of one forward pass — the lossless alternative to a `--max-resolution` downscale. The single-image generation closure was refactored into `_generate_one(img)` (dispatches controlnet/img2img, generator shared so the seed advances deterministically across tiles), and `_generate()` routes it through `noai.tiling.run_tiled` when `tile` is set AND `max(init_image.size) > tile_size` (a sub-tile image runs one pass unchanged). The ControlNet canny edge map is rebuilt per tile inside `_generate_one`, so structure preservation is tile-local. See `noai/tiling.py` below and the tiled-diffusion subsection in `docs/known-limitations.md` for the geometry, the partition-of-unity blend, and the quality caveat.
## `noai/tiling.py`
Pure sliding-window tiling for the diffusion path (no torch import; numpy/PIL only). `plan_tiles(w, h, tile_size, overlap)` returns a row-major grid of uniform-size `Tile` boxes — every tile is exactly `tile_size` (the SDXL training size), with the last tile on each axis pulled back flush to the far edge (`_axis_positions` clamps a pathological `overlap >= tile` to `tile - 1` so the step stays >= 1). `feather_weights(w, h, overlap)` is a separable linear taper (1 in the interior, ramping toward each edge) floored at `_WEIGHT_EPS` so it is **strictly positive everywhere** — that makes the normalised `accum / weight_sum` blend a partition of unity, so identical/unchanged tiles reconstruct the input exactly (the seam-free guarantee). `run_tiled(generate_tile, image, tile_size, overlap, set_progress)` is the orchestration loop: crop each planned tile, call `generate_tile` (one diffusion pass on a single PIL tile — injected, so this stays decoupled from the pipeline), resize a latent-grid-rounded result back to the exact tile size, and feather-accumulate. All three are unit-tested without the model (`tests/test_tiling.py`: axis math, grid coverage, taper shape/symmetry/positivity, identity reconstruction, per-tile call count, and the resize-back path). New blend tuning belongs in these pure helpers, not inlined into the runner.
## `auto_config.py` (REMOVED 2026-06-09)
**`auto_config.py` + the content-detection layer were REMOVED 2026-06-09.**
History: `auto_config.plan()` was a content-adaptive planner that detected faces/text/edges (bundled OpenCV YuNet + PP-OCRv3 DBNet models) to route the pipeline and toggle the adaptive polish. Once `controlnet` became the default-and-only auto pipeline (it no longer downgrades a structure-less image to `sdxl`) and the adaptive polish was confirmed to **self-gate by detail level** (`humanizer.adaptive_polish` no-ops when the cleaned image already meets the input's Laplacian variance, so it does real work only on over-smoothed photo/face texture and ~nothing on text/flat), the detection no longer changed any behavior — it only annotated a `reason` string. So the whole layer was deleted: `auto_config.py`, `tests/test_auto_config.py`, and the two detection assets (`assets/face_detection_yunet_2023mar.onnx`, `assets/text_detection_ppocrv3_2023may.onnx`, ~2.6 MB).
**`--auto` is now a DEPRECATED no-op** (`cli._resolve_auto_polish`): controlnet is already the default pipeline AND the adaptive polish is ON by default, so `--auto` has nothing left to do — it only prints a deprecation warning and passes `adaptive_polish` through unchanged (an explicit `--no-adaptive-polish` still wins). (Originally it re-enabled the polish; once the polish default flipped to ON the same day, the parameter-source branch became dead and was dropped.) The **adaptive polish itself lives on** in `humanizer.adaptive_polish` (CLI `--adaptive-polish/--no-adaptive-polish`, **ON by default since 2026-06-09** — it self-gates to a no-op where there is no detail deficit, so default-on is safe; uses the full-res original as the detail reference) — see the `humanizer` test note. `batch` resolves the polish once before the loop (one warning) and caches the invisible engine per pipeline (`ctx.obj["_inv_engines"]`).
## `upscaler.py`
`upscaler.py` — optional Real-ESRGAN pre-diffusion super-resolution for small inputs (spandrel boundary, top-of-file pyright pragma). `is_available()` gates on spandrel+torch (via `importlib.util.find_spec`); `upscale(bgr, device=None)` loads a lazily-built spandrel `ImageModelDescriptor` singleton (double-checked lock) and upscales by the model's native factor (x2), with a non-CPU→CPU device fallback mirroring the diffusion engine's MPS→CPU retry. Weights (`RealESRGAN_x2plus.pth`, BSD-3-Clause) download on first use to the `torch.hub` checkpoints cache; never bundled. Used only when UPscaling to the `min_resolution` floor (a `max_resolution` downscale always uses Lanczos). The wiring is `InvisibleEngine._esrgan_upscale(pil, target)` — Real-ESRGAN at native factor, then a Lanczos resize to the exact target, falling back to a plain Lanczos resize if the extra is absent or the model errors (so an optional upscaler can never break removal). The default `--upscaler` is `lanczos` (cv2, no deps).
**ESRGAN is a generic photo/texture GAN with no face/glyph prior**, so it best fits photo/texture content and can degrade faces (glassy/asymmetric eyes -- the diffusion pass regenerates faces so the full-pipeline final recovers) and thin/small text (the GAN invents wrong strokes, and low-strength diffusion will not fix it). Verified 2026-06-04: isolated upscale lap-var ~5x Lanczos on faces+textures but glassy eyes; end-to-end `invisible` final lap-var 1634 vs Lanczos 663 with natural faces (diffusion cleaned the artifact). Kept a **manual opt-in knob** (the auto plan never selects it) with `lanczos` the default; not content-gated by design (use Lanczos for text-heavy inputs). spandrel is MIT and pulls no basicsr. Unit-tested without the model: `tests/test_upscaler.py` (availability guard + the not-installed RuntimeError) and `tests/test_invisible_engine.py::TestEsrganUpscale` (the three `_esrgan_upscale` branches via a monkeypatched `upscaler`).
## `image_io.py`
`image_io.py` — Unicode-safe cv2 IO (issue #17). `imread(path, flags=None)` / `imwrite(path, img)` wrap `np.fromfile`+`cv2.imdecode` / `cv2.imencode`+`tofile` so non-ASCII paths work on Windows -- bare `cv2.imread`/`cv2.imwrite` use the platform ANSI code-page API there and fail (empty decode + `can't open/read file`) on Chinese/Cyrillic/accented filenames. `imread` keeps `cv2.imread` semantics (defaults to `IMREAD_COLOR`, returns `None` on missing/empty/undecodable).
**Every cv2 file read/write in the package routes through here; do not call `cv2.imread`/`cv2.imwrite` directly.**
`imwrite` returns `False` on an unwritable path (`OSError` caught) instead of raising, matching `cv2.imwrite` semantics. macOS/Linux already accept UTF-8 paths, so it is behavior-neutral there (the bug only reproduces on Windows).
**`to_bgr(image)` (added 2026-06-09)** is the shared channel normalizer: promotes 2D grayscale / (h,w,1) / 4-channel BGRA to 3-channel BGR (a 3-channel input is returned unchanged, no copy). Use it instead of inlining the `cvtColor(GRAY2BGR/BGRA2BGR)` branch — the gemini engine and the `TextMarkEngine` base both route through it so a grayscale/BGRA input (a real Gemini-app export is opaque RGBA) does not crash the `axis=2` channel reductions. cv2/numpy are imported lazily inside the functions, so the module is cheap to import in a bare env.