fix(visible): inpaint mid-tone Gemini sparkle instead of a dark diamond

The free `visible` path over-subtracted a faint Gemini sparkle on a mid-tone background into a darker-than-background brown diamond instead of removing it (2026-06-18 prod NPS report, "the watermark was not removed, just its color changed"). The existing over-subtraction guard only tripped when reverse-alpha drove a footprint pixel fully negative (the issue #30 dark-background black-pit case); on a mid-tone background the over-subtraction darkens the core well below the background without any pixel crossing zero, so the gate missed it and shipped the dark mark. Add a second over-subtraction signal to `_reverse_alpha_oversubtracts`: predict the reverse-alpha output at the bright core, (core - a*logo)/(1-a), and route to the footprint inpaint when it lands more than `_OVERSUB_DARK_MARGIN` (25) gray levels below the local background ring. Calibrated wide: clean removals predict within ~12 of background (demo_banana ~-1), the prod regression ~-40, the issue #30 dark case ~-82. Corpus-validated on the 479 detected Gemini images: 10 switch reverse-alpha to inpaint, all of them dark-diamond cases that improve or match; the other 469 stay byte-identical. demo_banana stays on the reverse-alpha path (byte-identical). Also crop both reverse-alpha helpers to the region they actually touch, a pure O(image) -> O(mark) win that is byte-identical to the full-frame math (a uint8<->float32 round-trip is exact): - `GeminiEngine._core_and_bg` converts only the footprint+ring crop to gray, not the whole frame (~70 ms -> 0.1 ms on a 12 MP image; it runs for both the alpha-gain estimate and the new gate). Verified identical across 479 images; detector confidence unchanged. - `TextMarkEngine._apply_reverse_alpha` computes the blend on the glyph crop only (`amap` is zero outside it, so the math is a no-op there): ~275 ms -> ~2 ms per placement on a 12 MP frame, up to 2 placements per removal. Verified identical across 142 Doubao/Jimeng placements. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-05 07:57:50 +02:00 · 2026-06-18 17:19:41 -07:00
parent 09fdb4544a
commit 41f67973ce
4 changed files with 96 additions and 18 deletions
@@ -11,7 +11,7 @@ module.

 ## `noai/c2pa.py`

-`noai/c2pa.py` — PNG chunk parser; use `extract_c2pa_chunk(path)` to get raw caBX payload, `has_c2pa_metadata(path)` to detect. Do not reimplement chunk parsing. `extract_c2pa_info(path)` sets `synthid_watermark`/`synthid_vendors` when the manifest is signed by a SynthID-using vendor, and `soft_binding`/`soft_binding_vendors` when a `c2pa.soft-binding` `alg` names a forensic-watermark vendor (`soft_binding_vendors_in(buffer)` is the shared byte-scan, used by both the PNG parser and the non-PNG binary path). PNG/caBX chunk reads are clamped to the remaining file size (`safe_length = min(length, remaining)`; skipped chunks use seek) so a malformed huge `length` cannot drive a multi-GB allocation (shared safety discipline matching `isobmff.scan_c2pa_region`).
+`noai/c2pa.py` — C2PA reading, **official c2pa-python `Reader` first, hand-rolled parser as fallback** (migrated 2026-06-18; the official lib is a core dep, MIT/Apache, spec-tracking). `read_manifest_store_json(path)` runs `Reader.try_create` with a default `Context` (NO trust enforcement — we report what is in the file, we do not gate on cert trust) and returns the **whole** manifest-store JSON (every manifest plus ingredient manifests); it is memoized per (path, mtime) (`lru_cache(maxsize=8)`) because one `identify`/`get_ai_metadata` call invokes the structured parser ~3x on the same file. `extract_c2pa_info(path)` builds its dict from that store JSON (`_info_from_store_json`: structured `claim_generator` from the active manifest's `claim_generator` / `claim_generator_info[].name`, `timestamp` from `signature_info.time`) and falls back to the legacy caBX parser (`_extract_c2pa_info_png`) when the reader is unavailable (broken/absent wheel, `reader_available()` False) or finds no parseable manifest (synthetic/partial test blobs, the inject round-trip's re-stitched chunk). **Both paths share `_populate_registry_fields(buf, info)`** — the issuer / AI-tool / action / source-type / SynthID / soft-binding registry byte-scan applied to the store JSON (reader path) or the raw caBX bytes (fallback) — so the return-dict shape is identical and the registry stays the single source of truth. Whole-store scanning is load-bearing: a ChatGPT *edit* of a Sora generation keeps `trainedAlgorithmicMedia` + issuer "OpenAI" on the **parent/ingredient** manifest, not the active "opened" one (the active manifest's `signature_info.issuer` is "OpenAI", `common_name` "Truepic Lens CLI in Sora", so the issuer field now reads "OpenAI, Truepic" — first-match-wins platform attribution still resolves OpenAI). `extract_c2pa_info` now also serves non-PNG containers (JPEG/AVIF/MP4) structurally via the reader; the consumers (`identify`, `synthid_source`, `get_ai_metadata`) already merge `info OR byte-scan`, so this strictly upgrades the non-PNG path with no double-counting. `synthid_watermark`/`synthid_vendors` is set when the manifest is signed by a SynthID-using vendor on AI content; `soft_binding`/`soft_binding_vendors` when a `c2pa.soft-binding` `alg` names a forensic-watermark vendor (`soft_binding_vendors_in(buffer)` is the shared byte-scan, used by both paths and the non-PNG binary path). `extract_c2pa_chunk` / `inject_c2pa_chunk` / `has_c2pa_metadata` stay the PNG caBX byte tools (raw-chunk extraction for `extractor.py`, test injection, fallback detection). PNG/caBX chunk reads are clamped to the remaining file size (`safe_length = min(length, remaining)`; skipped chunks use seek) so a malformed huge `length` cannot drive a multi-GB allocation (shared safety discipline matching `isobmff.scan_c2pa_region`). Regression-guarded by `tests/test_noai.py::TestC2PARealSamples::{test_extract_info_uses_reader_store,test_fallback_to_png_parser_when_reader_unavailable}`.

 ## `noai/constants.py`

@@ -29,7 +29,7 @@ module.

 **Visible-mark detection** (`check_visible`, signals `visible_sparkle` / `visible_doubao` / `visible_jimeng` / `visible_samsung`): the Gemini sparkle keeps its own file-level path (`_visible_sparkle` → `gemini_engine.detect_sparkle_confidence`, promoted only at confidence ≥ `_SPARKLE_THRESHOLD` 0.5; corpus-tuned to separate Gemini sparkles ≥0.56 from non-sparkle ≤0.49), while Doubao/Jimeng/Samsung reuse the registry detectors (`_visible_text_marks` → `watermark_registry`, iterating `_VISIBLE_MARK_PLATFORM`), each gated by its own engine NCC threshold via `MarkDetection.detected` (Doubao 0.4, Jimeng 0.45, Samsung 0.4). Doubao/Jimeng are normally also caught by the TC260 AIGC metadata label and Samsung by its C2PA + `genAIType` marker, so the visible path is their stripped-metadata fallback. Visible marks set `platform` only when no harder signal already did, and (like the sparkle) are excluded from integrity-clash vendor claims. The cv2 dependency lives in the engines, not here.

-**`import identify` is deliberately light** (~21 MB; ~36 MB with cv2 loaded by a visible-mark run, ~106 MB for a full `check_visible` run): it imports only the pure `noai.c2pa`/`noai.constants` submodules, and `noai/__init__` is lazy (see "Test and lint"), so torch/diffusers are NOT pulled at import even in a full `gpu`/`detect` install — fits a 512 MB host. The heavy paths are opt-in: `check_invisible=True` needs the `detect`/`trustmark` extras (each pulls **torch**; TrustMark also **downloads weights**), so on a core-only deploy leave `check_invisible` off (it is a no-op there anyway). Before the lazy `__init__`, the mere presence of torch in the env inflated `import identify` to ~420 MB.
+**`import identify` is deliberately light** (~26 MB; ~36 MB with cv2 loaded by a visible-mark run, ~106 MB for a full `check_visible` run): it imports the `noai.c2pa`/`noai.constants` submodules, and `noai/__init__` is lazy (see "Test and lint"), so torch/diffusers are NOT pulled at import even in a full `gpu`/`detect` install — fits a 512 MB host. `noai.c2pa` does eagerly import the **c2pa-python** binary (Rust + cryptography, ~+5 MB RSS, no torch) for the primary `Reader` path — light enough to stay on the dependency-light host; a broken/absent wheel degrades to the byte-scan parser (`reader_available()` False). The heavy paths are opt-in: `check_invisible=True` needs the `detect`/`trustmark` extras (each pulls **torch**; TrustMark also **downloads weights**), so on a core-only deploy leave `check_invisible` off (it is a no-op there anyway). Before the lazy `__init__`, the mere presence of torch in the env inflated `import identify` to ~420 MB.

 **C2PA platform attribution is device-token-first, issuer-scan fallback** (`_device_platform` scans manifest bytes for `_DEVICE_C2PA_PLATFORM` tokens, then `_attribute_platform`/`_ISSUER_PLATFORM`).

@@ -103,6 +103,8 @@ The 11 survivors are near-white ill-conditioning (reverse-alpha divides by `1-a`

 `_text_mark_engine.py` — **shared base for the three reverse-alpha text-mark engines (Doubao/Jimeng/Samsung), extracted 2026-06-09** (they were ~90% byte-identical clones). `TextMarkEngine(config: TextMarkConfig)` owns the whole `locate → extract_mask → detect → _fixed/_aligned_alpha_map → _apply_reverse_alpha → remove_watermark_reverse_alpha` pipeline (+ the asset-keyed `load_alpha_template`/`glyph_silhouette`/`template_match_score` caches). Each engine module is now a thin subclass: it supplies only its `TextMarkConfig` (the tuned constants, the bundled asset, and the bounded structural deltas — `corner` br/bl, `margin_floor` 4/2, `morph_open_size` 5/3, `min_gw` 8/16) plus the test-facing module shims (`_alpha_template`/`_glyph_silhouette`/`_template_match_score` + the constants). Behavior is byte-exact vs the old per-engine code (the three engine test suites pass unchanged). Gemini stays a SEPARATE engine (its multi-size fixed-slot sparkle model is genuinely different). Add a new text mark = a new `TextMarkConfig` + a thin subclass + one registry `_text_mark(...)` row. The engine bullets below describe each mark's calibration history; the LOGIC lives here.

+**`_apply_reverse_alpha` runs on the glyph crop only:** `amap` is zero outside the glyph `region` (x, y, w, h), so the blend is a no-op there (`(wm - 0)/(1 - 0) == wm`, and a uint8→float32→uint8 round-trip is exact). It copies the frame through and computes the reverse-alpha math on the `region` crop only — byte-identical to the old full-frame pass (verified: Doubao 130 + Jimeng 22 placements, 0 mismatches) but O(glyph) not O(image). The full-frame pass cost ~275 ms on a 12 MP frame for a glyph that is <0.1% of it, once per candidate placement (fixed + aligned ≈ 2×/removal); the crop drops that to ~2 ms. Mirror of the Gemini `_core_and_bg` crop. `remove_watermark_reverse_alpha` passes the `region` each `_fixed/_aligned_alpha_map` returns.
+
 ## `doubao_engine.py`

 `doubao_engine.py` — **a thin `_text_mark_engine.TextMarkEngine` subclass (config only) since 2026-06-09.** visible Doubao "豆包AI生成" remover/detector (cv2/numpy, no GPU). `DoubaoEngine.locate` anchors a bottom-right box by **geometry** (mark scales with image WIDTH), `extract_mask` pulls the light, low-chroma glyphs (the detection candidate) using a per-pixel channel-spread proxy `sat = roi.max(axis=2) - roi.min(axis=2)` (no HSV conversion). `detect` is **shape-consistent**: it matches the bundled alpha glyph silhouette (`assets/doubao_alpha.png`) against the candidate via zero-mean normalized correlation (`_template_match_score`, cv2 `TM_CCOEFF_NORMED`), gated at `DETECT_NCC_THRESHOLD` 0.4 over a small `DETECT_MIN_COVERAGE` floor. Keying on glyph SHAPE (not coverage heuristics) fixed #23 (corpus FP 7/1243).