Merge branch 'claude/modest-carson-d72243': corpus-mining provenance + removal fixes

Retained-corpus mining (2026-06-20) fixes, all gate-green:
- C2PA vendor coverage (Volcano Engine CJK legal name, ElevenLabs; TikTok/PixelBin vetted out)
- identify AI-generated vs AI-enhanced (ai_source_kind) + shared GEMINI_SPARKLE_TRUST_CONF (detect/remove threshold unify)
- text-mark over-subtraction guard (Doubao/Jimeng/Samsung)
- region-targeted regeneration for AI-enhanced composites (feather_region_composite + remove_watermark(region=))

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

# Conflicts:
#	CLAUDE.md
This commit is contained in:
Victor Kuznetsov
2026-06-20 15:39:29 -07:00
13 changed files with 594 additions and 21 deletions
+4 -4
View File
@@ -53,16 +53,16 @@ Compact map. The full per-module detail (design decisions, tuned thresholds, cal
- `noai/c2pa.py` — C2PA reading. `extract_c2pa_info(path)` uses the official **c2pa-python `Reader`** first (core dep, any container; `read_manifest_store_json` returns the WHOLE store JSON — active + ingredient manifests — so an AI marker on a parent manifest is seen), and falls back to the hand-rolled caBX/CBOR parser (`has_c2pa_metadata` / `extract_c2pa_chunk` / `_extract_c2pa_info_png`) for synthetic/partial blobs the validator rejects or a broken/absent wheel. The registry scan (issuer / source-type / SynthID / soft-binding) is shared by both paths via `_populate_registry_fields`, so the return-dict shape is identical. Do not reimplement chunk parsing; chunk reads are clamped to the remaining file size by design. `extract_c2pa_chunk`/`inject_c2pa_chunk` stay PNG-only (raw caBX bytes, test/extractor use).
- `noai/constants.py` — the single `C2PA_AI_VENDORS` registry (+ `C2PA_SOFT_BINDINGS`) from which `C2PA_ISSUERS` / `SYNTHID_C2PA_ISSUERS` / `identify._ISSUER_PLATFORM` are all derived. Add a new vendor as one registry entry; never edit the derived dicts and never add inline.
- `metadata.py``scan_head(path)` is the shared (memoized) input for every C2PA/AIGC/IPTC byte scan; use it instead of `open().read(1MB)` for any new marker scan. Also home to `synthid_source`, `xai_signature`, `iptc_ai_system`, `aigc_label`, `huggingface_job`, `samsung_genai`, and `remove_ai_metadata` (fail-safe `strip_c2pa_boxes`).
- `identify.py` — aggregates every locally-readable signal into one `ProvenanceReport`; `is_ai_generated` is True or None, never asserted False. `import identify` is deliberately light (lazy `noai/__init__`, fits a 512 MB host) — keep heavy imports out. Add capture-camera tokens to `_DEVICE_C2PA_PLATFORM` only when verified against a real C2PA file; editing-app/AI-device signer tokens go to `_SIGNER_C2PA_PLATFORM`; generator/issuer platforms to `C2PA_AI_VENDORS` in `constants.py`. Integrity-clash detection is high-precision by design (only hard generator stamps feed it, source-grouped independence).
- `identify.py` — aggregates every locally-readable signal into one `ProvenanceReport`; `is_ai_generated` is True or None, never asserted False. `ProvenanceReport.ai_source_kind` exposes the C2PA digital-source-type split — `"generated"` (trainedAlgorithmicMedia, fully AI) vs `"enhanced"` (compositeWithTrainedAlgorithmicMedia, a real photo with an AI-composited region), else None — so a caller branches full-frame scrub vs region-targeted clean (see `noai/tiling.feather_region_composite` + `WatermarkRemover.remove_watermark(region=...)`). The sparkle provenance threshold is the SHARED `watermark_registry.GEMINI_SPARKLE_TRUST_CONF` (imported, not a private copy) so the provenance "is there a sparkle" verdict and the removal "take the sparkle" decision can never drift. `import identify` is deliberately light (lazy `noai/__init__`, fits a 512 MB host) — keep heavy imports out (the `watermark_registry` constant import stays light: engines are lazy there). Add capture-camera tokens to `_DEVICE_C2PA_PLATFORM` only when verified against a real C2PA file; editing-app/AI-device signer tokens go to `_SIGNER_C2PA_PLATFORM`; generator/issuer platforms to `C2PA_AI_VENDORS` in `constants.py`. Integrity-clash detection is high-precision by design (only hard generator stamps feed it, source-grouped independence).
- `watermark_registry.py` — the single catalog of known visible watermarks (gemini / doubao / jimeng / samsung), reverse-alpha based by policy. Add a new visible text mark = one `_text_mark(...)` row + a `TextMarkConfig` with a captured alpha map; do not re-add per-mark `if` branches. `cli._write_bgr_with_alpha` must NOT zero alpha in the watermark bbox (issue #30 white-box regression).
- `gemini_engine.py` — visible Gemini-sparkle remover/detector (cv2/numpy, no GPU): top-K size-weighted fusion candidate selection (`_SELECT_TOPK`), corner-promote, over/under-subtraction guards, false-positive gate, self-verify repair. Detection scores the top-K size-weighted matches by full fusion (spatial+gradient+variance) and keeps the highest — NOT the raw-NCC argmax, which re-admits the tiny-patch FPs the size weight suppresses (the osachub 2026-06-12 sub-0.85 corner-sparkle regression; see `docs/module-internals.md`). Keep the 0.85 corner-promote NCC gate; a margin/chroma-gated lower promote was measured and REJECTED 2026-06-11 (~33% FP on non-Google content). Gate any removal candidate on a physical brightness check, not the detector alone.
- `_text_mark_engine.py` — shared base for the three reverse-alpha text-mark engines (extracted 2026-06-09); the per-engine modules are config-only subclasses. New text mark = a `TextMarkConfig` + a thin subclass + one registry row. Gemini stays a separate engine (different model).
- `doubao_engine.py` / `jimeng_engine.py` / `samsung_engine.py` — thin `TextMarkEngine` subclasses: Doubao "豆包AI生成" (bottom-right), Jimeng "★ 即梦AI" (bottom-right), Samsung Galaxy AI "✦ Contenuti generati dall'AI" (bottom-LEFT, locale-specific — Italian variant calibrated). Removal = reverse-alpha (always-align) + thin residual inpaint. A detector-only removal test is insufficient — assert visual residual (the textured-shift tests).
- `doubao_engine.py` / `jimeng_engine.py` / `samsung_engine.py` — thin `TextMarkEngine` subclasses: Doubao "豆包AI生成" (bottom-right), Jimeng "★ 即梦AI" (bottom-right), Samsung Galaxy AI "✦ Contenuti generati dall'AI" (bottom-LEFT, locale-specific — Italian variant calibrated). Removal = reverse-alpha (always-align) + thin residual inpaint, **with an over-subtraction guard ported from `gemini_engine` (2026-06-20)**: `_reverse_alpha_oversubtracts` predicts the reverse-alpha output PER PIXEL over the glyph body from the INPUT, and when the recovered body lands more than `_OVERSUB_DARK_MARGIN` (25) gray levels below the local ring it abandons the reverse-alpha pixels and inpaints the footprint from the original surroundings (`_inpaint_footprint`) — fixing the dark-pit ghost on dark/mid-tone backgrounds (roadmap P0#8). Predicting per-pixel from the input (not the produced output) keeps a clean full-strength mark byte-identical (no false trip). A detector-only removal test is insufficient — assert visual residual (the textured-shift tests + `tests/test_text_mark_oversubtraction.py`).
- `region_eraser.py` — universal region eraser (`erase` CLI): cv2 backend default (no deps), optional big-LaMa via onnxruntime (~3.5-4 GB peak RAM, ~5-6 s/call CPU — does not fit a minimal droplet).
- `invisible_watermark.py` — decodes the OPEN DWT-DCT watermarks (SD / SDXL / FLUX) via `imwatermark` (extra `detect`, pulls torch). Fragile two ways: (1) does not survive JPEG re-encode/resize; (2) **carrier-fragile on a broad class of pristine images** -- a clean encode->decode round-trip recovers 48/48 on chatgpt/firefly/random but FAILS (28-39/48, below the `_MATCH_48`=44 gate) on the FLUX fox, doubao, a flat FLUX generation, AND a clean synthetic flat fill with no watermark. The failure does NOT track texture; it goes with a degenerate **all-ones decode that is a CARRIER ARTIFACT, not a watermark** (synthetic clean image reproduces it). So `detect_invisible_watermark` is **positive-only**: trust a hit; a `None` is inconclusive unless a same-carrier positive-control embed first recovers >=44. Verified 2026-06-19; full caveat in `docs/watermarking-landscape.md`.
- `trustmark_detector.py` — Adobe TrustMark open decoder (extra `trustmark`). Do NOT remove the JPEG re-encode false-positive gate — a lone TrustMark hit without it is almost always content noise.
- `noai/watermark_remover.py``WatermarkRemover` with three diffusion pipelines selected by the explicit `pipeline` ctor arg, never inferred from `model_id`: `sdxl` (plain SDXL img2img), `controlnet` (SDXL + canny ControlNet, **the DEFAULT since 2026-06-09**), and `qwen` (Qwen-Image 20B MMDiT img2img, Apache-2.0, CUDA/cloud-class — best **text** preservation (incl. CJK); `_load_qwen_pipeline`/`_run_qwen`, bf16, no MPS fallback; call shape in the pure `_build_qwen_kwargs` using `true_cfg_scale`). Removal comes from the img2img `strength`; ControlNet only preserves text/face STRUCTURE — SynthID CAN survive controlnet on photoreal content at low strength. Qwen CERTIFIED oracle floors (2026-06-20): OpenAI **0.10** (seed-robust, clean on seeds 0-4), Gemini **0.25** (seed 0 verified, pin a seed — Gemini oracle rate-limits volume; higher than the controlnet Gemini floor 0.15, so pass explicit `--strength 0.25` for Gemini on `qwen` until a Qwen ladder is wired). Fidelity measured by `scripts/fidelity_metrics.py` (OCR-CER / ArcFace / LPIPS / Laplacian-var, NOT eyeball), compared ONLY at each pipeline's oracle-confirmed scrub floor (where SynthID is removed in BOTH — equal-strength is invalid where it leaves one un-scrubbed): Qwen wins TEXT (incl. CJK), controlnet wins FACES (Qwen smooths faces more) — Qwen is the text-preserving remover, not a universal fidelity win. No face-restore extra ships, by validated decision (every restore approach looked MORE AI-generated).
- `noai/tiling.py` — sliding-window tiled diffusion for large inputs (CLI `--tile`). `WatermarkRemover.remove_watermark` branches to `run_tiled` when `tile` is set AND the long side exceeds `tile_size`, refactoring the single-pass `_generate` into a per-tile `_generate_one` (the ControlNet edge map is rebuilt per tile inside it). Pure helpers `plan_tiles` (uniform-size tiles, last one flush to the edge) and `feather_weights` (strictly-positive separable taper -> partition-of-unity blend) are unit-tested without the model. New tile-blend tuning goes in those pure helpers; do not inline blend math into the runner.
- `noai/watermark_remover.py``WatermarkRemover` with three diffusion pipelines selected by the explicit `pipeline` ctor arg, never inferred from `model_id`: `sdxl` (plain SDXL img2img), `controlnet` (SDXL + canny ControlNet, **the DEFAULT since 2026-06-09**), and `qwen` (Qwen-Image 20B MMDiT img2img, Apache-2.0, CUDA/cloud-class — best **text** preservation (incl. CJK); `_load_qwen_pipeline`/`_run_qwen`, bf16, no MPS fallback; call shape in the pure `_build_qwen_kwargs` using `true_cfg_scale`). Removal comes from the img2img `strength`; ControlNet only preserves text/face STRUCTURE — SynthID CAN survive controlnet on photoreal content at low strength. Qwen CERTIFIED oracle floors (2026-06-20): OpenAI **0.10** (seed-robust, clean on seeds 0-4), Gemini **0.25** (seed 0 verified, pin a seed — Gemini oracle rate-limits volume; higher than the controlnet Gemini floor 0.15, so pass explicit `--strength 0.25` for Gemini on `qwen` until a Qwen ladder is wired). Fidelity measured by `scripts/fidelity_metrics.py` (OCR-CER / ArcFace / LPIPS / Laplacian-var, NOT eyeball), compared ONLY at each pipeline's oracle-confirmed scrub floor (where SynthID is removed in BOTH — equal-strength is invalid where it leaves one un-scrubbed): Qwen wins TEXT (incl. CJK), controlnet wins FACES (Qwen smooths faces more) — Qwen is the text-preserving remover, not a universal fidelity win. No face-restore extra ships, by validated decision (every restore approach looked MORE AI-generated). `remove_watermark(region=(x,y,w,h), region_feather=...)` runs the regeneration but feather-composites only the AI box back over the original (via `noai/tiling.feather_region_composite`), preserving the real photo elsewhere — the **AI-enhanced composite** path (`identify` `ai_source_kind == "enhanced"`); the box is supplied by the caller (a C2PA composite manifest carries no reliable machine-readable region, so we do not fabricate one).
- `noai/tiling.py` — sliding-window tiled diffusion for large inputs (CLI `--tile`). `WatermarkRemover.remove_watermark` branches to `run_tiled` when `tile` is set AND the long side exceeds `tile_size`, refactoring the single-pass `_generate` into a per-tile `_generate_one` (the ControlNet edge map is rebuilt per tile inside it). Pure helpers `plan_tiles` (uniform-size tiles, last one flush to the edge) and `feather_weights` (strictly-positive separable taper -> partition-of-unity blend) are unit-tested without the model. Also home to `feather_region_composite(base, regenerated, box, *, feather)` — the pure region-targeted compositor for **AI-enhanced composites** (`ai_source_kind == "enhanced"`): blends the regenerated AI box back over the original with a feathered seam, leaving the real photo OUTSIDE the box pixel-exact. It backs `WatermarkRemover.remove_watermark(region=...)` (regenerate ONLY the AI region, not the whole frame); the no-model lossless region path stays `region_eraser.erase`. New tile/region-blend tuning goes in these pure helpers; do not inline blend math into the runner.
- `auto_config.py` + the content-detection layer were REMOVED 2026-06-09; `--auto` is a deprecated no-op (controlnet is the default pipeline and the adaptive polish is ON by default and self-gates to a no-op where there is no detail deficit).
- `upscaler.py` — optional Real-ESRGAN pre-diffusion super-resolution for small inputs (extra `esrgan`, spandrel only). Manual opt-in; the default `--upscaler` stays `lanczos` and the engine always falls back to Lanczos on absence/error. ESRGAN can degrade faces and thin text.
- `image_io.py` — Unicode-safe cv2 IO (issue #17). Every cv2 file read/write in the package routes through `imread`/`imwrite`; do not call `cv2.imread`/`cv2.imwrite` directly. `to_bgr(image)` is the shared channel normalizer — use it instead of inlining `cvtColor` branches.
+8 -2
View File
@@ -15,7 +15,7 @@ module.
## `noai/constants.py`
`noai/constants.py` — PNG_SIGNATURE, C2PA_CHUNK_TYPE, C2PA_SIGNATURES, and `C2PA_AI_VENDORS` — the single `C2paAiVendor` registry of C2PA-signing vendors (issuer byte, resolved org name, the `identify` platform label, and a `synthid` flag), from which `C2PA_ISSUERS`, `SYNTHID_C2PA_ISSUERS` (issuers that pair SynthID with C2PA: Google, OpenAI), and `identify._ISSUER_PLATFORM` are all **derived** — plus `C2PA_SOFT_BINDINGS` (soft-binding `alg` prefix → forensic-watermark vendor: Adobe TrustMark, Digimarc, Imatag, Steg.AI, Microsoft, ...). Add a new C2PA vendor as one `C2PA_AI_VENDORS` entry (never edit the derived dicts), a new soft-binding to `C2PA_SOFT_BINDINGS`; not inline.
`noai/constants.py` — PNG_SIGNATURE, C2PA_CHUNK_TYPE, C2PA_SIGNATURES, and `C2PA_AI_VENDORS` — the single `C2paAiVendor` registry of C2PA-signing vendors (issuer byte, resolved org name, the `identify` platform label, and a `synthid` flag), from which `C2PA_ISSUERS`, `SYNTHID_C2PA_ISSUERS` (issuers that pair SynthID with C2PA: Google, OpenAI), and `identify._ISSUER_PLATFORM` are all **derived** — plus `C2PA_SOFT_BINDINGS` (soft-binding `alg` prefix → forensic-watermark vendor: Adobe TrustMark, Digimarc, Imatag, Steg.AI, Microsoft, ...). Add a new C2PA vendor as one `C2PA_AI_VENDORS` entry (never edit the derived dicts), a new soft-binding to `C2PA_SOFT_BINDINGS`; not inline. A vendor that signs under multiple legal names needs one entry PER distinctive issuer byte string: e.g. ByteDance's Volcano Engine is registered both as latin `volcengine` AND the Chinese legal entity `北京火山引擎科技有限公司` (UTF-8; the latin needle misses the Chinese-named certs entirely) — both normalize to the same "ByteDance" needle/platform. ElevenLabs ("Eleven Labs Inc.", pure generative-AI) is registered as a generator. Deliberately EXCLUDED (mined-corpus candidates 2026-06-20, documented in the file): TikTok Inc. (a content-provenance / AI-labeling signer on uploads, not a generator) and PixelBin.io / "Fynd" (an image transform / CDN signer) — registering either as a generator would mis-label human uploads as AI; the `is_ai` verdict keys off the digitalSourceType, which is already honored.
## `metadata.py`
@@ -27,7 +27,9 @@ module.
`identify.py` — the OpenAI rollout caveat is keyed on `_vendor_of(synthid) == "OpenAI"` (not a raw substring over the issuer + verdict blob). `identify(path)` aggregates every locally-readable signal (C2PA issuer→platform, C2PA soft-binding forensic-watermark vendor, **C2PA cloud-manifest reference** via `metadata.c2pa_cloud_manifest` — signal `c2pa_cloud`, **medium**, provenance-only (does NOT set `is_ai`, excluded from `ai_from_metadata` + clash vendors): a C2PA 2.4 Durable-Content-Credentials case where the embedded manifest is stripped but an XMP `dcterms:provenance` pointer to the vendor's cloud manifest store (`_C2PA_MANIFEST_REPOSITORIES`, today `cai-manifests.adobe.com` → "Adobe Content Authenticity") survives, so the credentials stay recoverable server-side; only emitted when no embedded manifest already attributed the file — surfaced on 2 corpus PNGs 2026-06-10 that read fully `unknown` before, IPTC "Made with AI" + IPTC 2025.1 `AISystemUsed`, embedded SD/ComfyUI params, SynthID proxy, xAI/Grok EXIF signature via `metadata.xai_signature`, the China TC260 AIGC label via `metadata.aigc_label`, the HuggingFace `hf-job-id` job marker via `metadata.huggingface_job`, the Samsung Galaxy AI editing marker via `metadata.samsung_genai`, the visible marks — Gemini sparkle plus the ByteDance Doubao 豆包AI生成 / Jimeng 即梦AI / Samsung Galaxy AI "Contenuti generati dall'AI" text marks via the `watermark_registry` — open invisible watermark, Adobe TrustMark via `trustmark_detector`) into one `ProvenanceReport`. `is_ai_generated` is True or None (never asserted False — stripped metadata is not proof of clean origin). The `hf_job`, visible-mark, and Samsung `samsung_genai` signals are **medium** confidence: each lifts an otherwise-Unknown verdict to a tentative AI (`hf_only` / `visible_only` / `samsung_only`, parallel branches; `visible_only` fires on any `visible_*` signal) but is excluded from the high-confidence `ai_from_metadata` set, so none overrides a hard metadata signal.
**Visible-mark detection** (`check_visible`, signals `visible_sparkle` / `visible_doubao` / `visible_jimeng` / `visible_samsung`): the Gemini sparkle keeps its own file-level path (`_visible_sparkle` `gemini_engine.detect_sparkle_confidence`, promoted only at confidence ≥ `_SPARKLE_THRESHOLD` 0.5; corpus-tuned to separate Gemini sparkles ≥0.56 from non-sparkle ≤0.49), while Doubao/Jimeng/Samsung reuse the registry detectors (`_visible_text_marks``watermark_registry`, iterating `_VISIBLE_MARK_PLATFORM`), each gated by its own engine NCC threshold via `MarkDetection.detected` (Doubao 0.4, Jimeng 0.45, Samsung 0.4). Doubao/Jimeng are normally also caught by the TC260 AIGC metadata label and Samsung by its C2PA + `genAIType` marker, so the visible path is their stripped-metadata fallback. Visible marks set `platform` only when no harder signal already did, and (like the sparkle) are excluded from integrity-clash vendor claims. The cv2 dependency lives in the engines, not here.
**AI-generated vs AI-enhanced** (`ProvenanceReport.ai_source_kind`, roadmap item): the C2PA digital-source-type is split into `"generated"` (trainedAlgorithmicMedia, fully synthetic) vs `"enhanced"` (compositeWithTrainedAlgorithmicMedia, a real photo with an AI-composited region) — the two byte strings are unambiguous (`compositeWithTrainedAlgorithmicMedia` capitalizes the inner "Trained", so a lowercase `trainedAlgorithmicMedia` match is standalone full generation; full generation wins when both appear). `ai_source_kind` is set only when the AI verdict actually came from the C2PA source type (a non-C2PA AI signal — IPTC/AIGC/local gen/xAI — leaves it None). It lets a caller branch a full-frame scrub (`generated`) from a region-targeted clean that preserves the real photo (`enhanced`; see `noai/tiling.feather_region_composite`). The CLI verdict line reads "AI-generated (fully synthetic)" vs "AI-enhanced (real content with an AI-composited region)".
**Visible-mark detection** (`check_visible`, signals `visible_sparkle` / `visible_doubao` / `visible_jimeng` / `visible_samsung`): the Gemini sparkle keeps its own file-level path (`_visible_sparkle``gemini_engine.detect_sparkle_confidence`, promoted only at confidence ≥ `_SPARKLE_THRESHOLD`, which is the SHARED `watermark_registry.GEMINI_SPARKLE_TRUST_CONF` (0.5) — imported, not a private copy, so the provenance detect threshold and the removal `best_auto_mark` / `_gemini_detect` arbitration gate can never drift (the detect-vs-remove desync from roadmap P0#7; regression-guarded by `tests/test_identify.py::TestSparkleDetectRemoveAlignment`, which composites the real demo sparkle at borderline opacities and asserts identify and `best_auto_mark` AGREE on either side of the line). Lowering the gate to recover faint sub-0.5 sparkles was evaluated 2026-06-20 and REJECTED: a real Doubao text mark scores ~0.40-0.42 as a gemini match with a HIGHER core-ring brightness margin than a genuine faint sparkle, so neither confidence nor the brightness gate separates them in the [0.35, 0.5) band — lowering trades a rare miss for false-positive removals on clean images. Corpus-tuned to separate Gemini sparkles ≥0.56 from non-sparkle ≤0.49), while Doubao/Jimeng/Samsung reuse the registry detectors (`_visible_text_marks``watermark_registry`, iterating `_VISIBLE_MARK_PLATFORM`), each gated by its own engine NCC threshold via `MarkDetection.detected` (Doubao 0.4, Jimeng 0.45, Samsung 0.4). Doubao/Jimeng are normally also caught by the TC260 AIGC metadata label and Samsung by its C2PA + `genAIType` marker, so the visible path is their stripped-metadata fallback. Visible marks set `platform` only when no harder signal already did, and (like the sparkle) are excluded from integrity-clash vendor claims. The cv2 dependency lives in the engines, not here.
**`import identify` is deliberately light** (~26 MB; ~36 MB with cv2 loaded by a visible-mark run, ~106 MB for a full `check_visible` run): it imports the `noai.c2pa`/`noai.constants` submodules, and `noai/__init__` is lazy (see "Test and lint"), so torch/diffusers are NOT pulled at import even in a full `gpu`/`detect` install — fits a 512 MB host. `noai.c2pa` does eagerly import the **c2pa-python** binary (Rust + cryptography, ~+5 MB RSS, no torch) for the primary `Reader` path — light enough to stay on the dependency-light host; a broken/absent wheel degrades to the byte-scan parser (`reader_available()` False). The heavy paths are opt-in: `check_invisible=True` needs the `detect`/`trustmark` extras (each pulls **torch**; TrustMark also **downloads weights**), so on a core-only deploy leave `check_invisible` off (it is a no-op there anyway). Before the lazy `__init__`, the mere presence of torch in the env inflated `import identify` to ~420 MB.
@@ -105,6 +107,8 @@ The 11 survivors are near-white ill-conditioning (reverse-alpha divides by `1-a`
**`_apply_reverse_alpha` runs on the glyph crop only:** the blend is a no-op outside the glyph `region` (x, y, w, h) (`(wm - 0)/(1 - 0) == wm`, and a uint8→float32→uint8 round-trip is exact). It copies the frame through and computes the reverse-alpha math on the `region` crop only — byte-identical to the old full-frame pass (verified: Doubao 130 + Jimeng 22 placements, 0 mismatches) but O(glyph) not O(image). The full-frame pass cost ~275 ms on a 12 MP frame for a glyph that is <0.1% of it, once per candidate placement (fixed + aligned ≈ 2×/removal); the crop drops that to ~2 ms. Mirror of the Gemini `_core_and_bg` crop.
**Over-subtraction guard (`_reverse_alpha_oversubtracts` → `_inpaint_footprint`, ported from `gemini_engine` 2026-06-20, roadmap P0#8):** on a dark or mid-tone background the captured alpha can over-estimate THIS image's mark opacity, and reverse-alpha leaves a darker-than-background glyph ghost (a "dark pit") instead of recovering the true pixels — the sparkle-only fix (commit 41f6797) left the text marks unhandled. After `remove_watermark_reverse_alpha` selects the winning placement, the guard PREDICTS the reverse-alpha output PER PIXEL over the glyph body from the INPUT (`(obs - a*logo)/(1-a)`, exactly the remover's math) and, when the predicted body lands more than `_OVERSUB_DARK_MARGIN` (25) gray levels below the local background ring, abandons the reverse-alpha output for the footprint and inpaints it from the ORIGINAL surroundings (`_inpaint_footprint`, a wider dilate/radius than the thin residual pass). Predicting per-pixel from the INPUT (not the produced output, which depends on which placement the remover picked) is what keeps a cleanly captured full-strength mark byte-identical — it predicts back to the background everywhere, so the guard never trips on it (verified across Doubao/Jimeng/Samsung on white/mid/dark/midgray backgrounds). A faint mark predicts a body far below the ring and diverts to the inpaint. Regression-guarded by `tests/test_text_mark_oversubtraction.py` (predicate True on faint / False on clean, end-to-end no-dark-pit acceptance, clean-mark byte-identity, textured-background recovery). A flat synthetic background cannot exhibit the residual-inpaint failure (inpaint-from-flat is perfect regardless), so the value shows on textured/real content where the footprint inpaint samples un-darkened original pixels instead of the darkened reverse-alpha halo.
**`_fixed/_aligned_alpha_map` and `extract_mask` return footprint-sized arrays, not full frames (memory):** the alpha-map helpers return the glyph-sized alpha **block** (`(gh, gw)` float32) plus its placement `(ax, ay, gw, gh)`, and `extract_mask` returns the box-sized glyph mask (`(loc.h, loc.w)` uint8) — both used to allocate a full `(h, w)` array that is read only inside the small glyph/box. A full-frame float32 alpha map is ~48 MB on a 12 MP frame and two were held at once during removal (fixed + aligned ≈ 96 MB of mostly-zeros); the box mask was a ~12 MB uint8 allocation rebuilt per text-mark `detect` on the memory-tight `identify` path. `_apply_reverse_alpha` consumes the block directly; the residual inpaint embeds it into one full-frame uint8 mask only at `cv2.inpaint` time (which needs a full-frame mask). Byte-identical to the old full-frame path — the block equals the old map's `[ay:ay+gh, ax:ax+gw]` slice and the box equals the old mask cropped to `loc.bbox` (regression-guarded by `tests/test_text_mark_memory.py`, which reconstructs the old full-frame path inline and asserts equality, so the proof survives a cv2/asset bump). `remove_watermark_reverse_alpha` tracks the winning `region` alongside `best_amap` to place that mask.
## `doubao_engine.py`
@@ -199,6 +203,8 @@ At the shared low removal strength the canny edge-conditioning keeps the regener
Pure sliding-window tiling for the diffusion path (no torch import; numpy/PIL only). `plan_tiles(w, h, tile_size, overlap)` returns a row-major grid of uniform-size `Tile` boxes — every tile is exactly `tile_size` (the SDXL training size), with the last tile on each axis pulled back flush to the far edge (`_axis_positions` clamps a pathological `overlap >= tile` to `tile - 1` so the step stays >= 1). `feather_weights(w, h, overlap)` is a separable linear taper (1 in the interior, ramping toward each edge) floored at `_WEIGHT_EPS` so it is **strictly positive everywhere** — that makes the normalised `accum / weight_sum` blend a partition of unity, so identical/unchanged tiles reconstruct the input exactly (the seam-free guarantee). `run_tiled(generate_tile, image, tile_size, overlap, set_progress)` is the orchestration loop: crop each planned tile, call `generate_tile` (one diffusion pass on a single PIL tile — injected, so this stays decoupled from the pipeline), resize a latent-grid-rounded result back to the exact tile size, and feather-accumulate. All three are unit-tested without the model (`tests/test_tiling.py`: axis math, grid coverage, taper shape/symmetry/positivity, identity reconstruction, per-tile call count, and the resize-back path). New blend tuning belongs in these pure helpers, not inlined into the runner.
`feather_region_composite(base, regenerated, box, *, feather)` is the pure region-targeted compositor for **AI-enhanced composites** (roadmap P1#8; `identify` `ai_source_kind == "enhanced"`, digitalSourceType `compositeWithTrainedAlgorithmicMedia`). It blends `regenerated` over `base` inside `box = (x, y, w, h)` with a separable linear taper of `feather` px at the box edges (the taper anchors to ~0 at the boundary, so unlike `feather_weights` it is NOT floored — the result equals `base` EXACTLY outside the box), preserving dtype and supporting HxW or HxWxC. It backs `WatermarkRemover.remove_watermark(region=..., region_feather=...)`: the remover regenerates the frame (or tiles), then composites only the AI box back over the original input, so the real photo outside the box stays pixel-exact and only the AI region is scrubbed. The box is caller-supplied (a C2PA composite manifest carries no reliable machine-readable region); the no-model lossless region path remains `region_eraser.erase`. Unit-tested in `tests/test_tiling.py::TestFeatherRegionComposite` (outside-box exactness, interior == regenerated, hard-paste at feather 0, monotonic seam ramp, dtype/grayscale/clamp/empty-box/shape-mismatch).
## `auto_config.py` (REMOVED 2026-06-09)
**`auto_config.py` + the content-detection layer were REMOVED 2026-06-09.**
@@ -37,6 +37,28 @@ if TYPE_CHECKING:
logger = logging.getLogger(__name__)
# Reverse-alpha over-subtraction guard (ported from gemini_engine, 2026-06-20).
# The reverse-alpha blend ``(wm - a*logo)/(1-a)`` over-subtracts when the captured
# alpha over-estimates THIS image's mark opacity: on a dark or mid-tone background
# it drives the glyph footprint into a visibly DARKER-than-background ghost (a
# "dark pit") instead of recovering the true pixels. The retained-corpus mining
# (2026-06-20) showed the sparkle-only fix (commit 41f6797) left this unhandled
# for the Doubao/Jimeng text marks. Mirror the sparkle gate: when the recovered
# glyph body lands more than this many gray levels below the local background
# ring, abandon the reverse-alpha output for the footprint and inpaint it from
# the surroundings instead. Calibrated to the same 25-level margin the sparkle
# gate uses -- clean text-mark removals recover within ~10 of the ring, the dark
# pit lands tens of levels below.
_OVERSUB_DARK_MARGIN = 25.0
# Glyph-body / background-ring sampling for the guard. The ring is a pad around
# the glyph box (excluding the box); the body is the bright-core glyph pixels.
_OVERSUB_RING_PAD_FRAC = 0.6 # ring pad as a fraction of the glyph-box height
_OVERSUB_BODY_ALPHA_FLOOR = 0.15 # alpha above which a block pixel counts as glyph body
# Footprint inpaint when the guard trips: dilate the glyph mask wider than the
# thin residual pass so the whole darkened ghost is reconstructed, not just its edge.
_OVERSUB_INPAINT_DILATE = 9
_OVERSUB_INPAINT_RADIUS = 4
@dataclass(frozen=True)
class TextMarkConfig:
@@ -335,6 +357,74 @@ class TextMarkEngine:
out[y1:y2, x1:x2] = np.clip((roi - a3 * logo) / np.clip(1.0 - a3, 0.25, 1.0), 0, 255).astype(np.uint8)
return out
def _reverse_alpha_oversubtracts(
self, image: NDArray[Any], amap: NDArray[Any], region: tuple[int, int, int, int]
) -> bool:
"""True when reverse-alpha would darken the glyph footprint into a dark pit.
Ported from ``gemini_engine._reverse_alpha_oversubtracts`` (2026-06-20):
PREDICT the reverse-alpha output at the bright glyph core directly from the
INPUT and the captured alpha, ``(core_obs - a*logo)/(1-a)``, and trip when it
lands more than ``_OVERSUB_DARK_MARGIN`` gray levels below the local
background ring. Predicting from the input (not the produced output) keeps the
gate independent of which placement the reverse-alpha picked, so a clean
full-strength mark (whose strokes predict back to the background) never trips,
while a mark fainter than the capture (over-subtracted into a ghost) does.
"""
ax, ay, gw, gh = region
ih, iw = image.shape[:2]
if gw < 4 or gh < 4:
return False
if float(amap.max()) < 0.2: # too faint a capture to over-subtract meaningfully
return False
body_box = amap >= _OVERSUB_BODY_ALPHA_FLOOR # glyph strokes
if not bool(body_box.any()):
return False
pad = max(4, int(gh * _OVERSUB_RING_PAD_FRAC))
ry1, ry2 = max(0, ay - pad), min(ih, ay + gh + pad)
rx1, rx2 = max(0, ax - pad), min(iw, ax + gw + pad)
ring = image[ry1:ry2, rx1:rx2].astype(np.float32).mean(axis=2)
fy1, fy2, fx1, fx2 = ay - ry1, ay - ry1 + gh, ax - rx1, ax - rx1 + gw
ring_mask = np.ones(ring.shape, dtype=bool)
ring_mask[fy1:fy2, fx1:fx2] = False
if int(ring_mask.sum()) < 10:
return False
# Predict the reverse-alpha output PER PIXEL over the glyph body -- exactly
# the (obs - a*logo)/(1-a) math the remover applies -- so a cleanly captured
# mark predicts back to the true background everywhere (no trip), while a mark
# fainter than the capture predicts a body far below the local ring. The
# per-pixel alpha (not a single peak value) keeps the prediction faithful
# across the glyph's anti-aliased alpha gradient.
obs = ring[fy1:fy2, fx1:fx2]
a = np.clip(amap, 0.0, 0.99)
logo = float(np.mean(self.config.alpha_logo_bgr))
predicted = (obs - a * logo) / (1.0 - a)
predicted_core = float(np.median(predicted[body_box]))
bg = float(np.median(ring[ring_mask]))
oversub = predicted_core < bg - _OVERSUB_DARK_MARGIN
if oversub:
logger.debug(
"%s reverse-alpha over-subtracts: predicted core=%.1f bg=%.1f (margin %.0f) -> footprint inpaint",
self.config.name,
predicted_core,
bg,
_OVERSUB_DARK_MARGIN,
)
return oversub
def _inpaint_footprint(
self, image: NDArray[Any], amap: NDArray[Any], region: tuple[int, int, int, int]
) -> NDArray[Any]:
"""Reconstruct the glyph footprint from its surroundings (used when
reverse-alpha would over-subtract into a dark pit). Inpaints the ORIGINAL
image over a dilated glyph mask, so the result never contains the darkened
reverse-alpha pixels."""
ax, ay, gw, gh = region
mask = np.zeros(image.shape[:2], np.uint8)
mask[ay : ay + gh, ax : ax + gw] = (amap > self.config.residual_alpha_floor).astype(np.uint8) * 255
mask = cv2.dilate(mask, np.ones((_OVERSUB_INPAINT_DILATE, _OVERSUB_INPAINT_DILATE), np.uint8))
return cv2.inpaint(image, mask, _OVERSUB_INPAINT_RADIUS, cv2.INPAINT_NS)
def remove_watermark_reverse_alpha(self, image: NDArray[Any], *, residual_inpaint: bool = True) -> NDArray[Any]:
"""Recover the original pixels by inverting the alpha blend, then clear the
residual outline with a thin inpaint over the glyph footprint.
@@ -370,6 +460,13 @@ class TextMarkEngine:
best_residual, best_out, best_amap, best_region = residual, out, amap, region
if best_out is None or best_amap is None or best_region is None: # pragma: no cover - maps is non-empty
return image.copy()
# Over-subtraction guard: on a dark/mid-tone background the captured alpha can
# over-estimate the mark's opacity and reverse-alpha leaves a darker-than-
# background ghost. When the recovered glyph body sits far below the local
# ring, reconstruct the footprint from its surroundings instead of shipping the
# dark pit (the thin residual inpaint cannot fix a footprint-wide darkening).
if self._reverse_alpha_oversubtracts(image, best_amap, best_region):
return self._inpaint_footprint(image, best_amap, best_region)
if residual_inpaint:
# Embed the glyph-sized alpha block into a full-frame uint8 mask only for
# the inpaint (cv2.inpaint needs a mask matching best_out). One uint8
+7
View File
@@ -881,6 +881,13 @@ def cmd_identify(ctx: click.Context, source: Path, no_visible: bool, as_json: bo
_banner()
verdict = {True: "AI-generated", False: "not AI", None: "unknown"}[report.is_ai_generated]
# Sharpen the True verdict when the C2PA source type says the image is a real
# photo with an AI-composited region rather than a full AI generation, so the
# caller (and the user) can tell "scrub the whole frame" from "scrub the AI region".
if report.is_ai_generated and report.ai_source_kind == "enhanced":
verdict = "AI-enhanced (real content with an AI-composited region)"
elif report.is_ai_generated and report.ai_source_kind == "generated":
verdict = "AI-generated (fully synthetic)"
console.print(f"\n Verdict: {verdict} (confidence: {report.confidence})")
console.print(f" Platform: {report.platform or 'undetermined'}")
+32 -8
View File
@@ -42,6 +42,7 @@ from remove_ai_watermarks.metadata import (
)
from remove_ai_watermarks.noai.c2pa import cbor_text_after, extract_c2pa_info, soft_binding_vendors_in
from remove_ai_watermarks.noai.constants import C2PA_AI_TOOLS, C2PA_AI_VENDORS, C2PA_ISSUERS
from remove_ai_watermarks.watermark_registry import GEMINI_SPARKLE_TRUST_CONF
if TYPE_CHECKING:
from pathlib import Path
@@ -57,11 +58,14 @@ log = logging.getLogger(__name__)
_SCAN_BYTES = 1024 * 1024
# Visible-sparkle confidence above which the signal is trusted as provenance.
# Stricter than the removal default (0.25): on the corpus, Gemini-family
# sparkles score >= 0.56 while non-sparkle images top out at 0.49, so 0.5
# cleanly separates them and avoids false positives when sparkle is the only
# signal (e.g. an OpenAI image scored 0.37 -- below threshold, correctly dropped).
_SPARKLE_THRESHOLD = 0.5
# Shared with the removal arbitration (watermark_registry.GEMINI_SPARKLE_TRUST_CONF)
# so the provenance "is there a sparkle" verdict and the removal "take the sparkle"
# decision can never drift apart -- the detect-vs-remove desync the retained-corpus
# mining surfaced (2026-06-20). On the corpus Gemini-family sparkles score >= 0.56
# while non-sparkle images top out at 0.49, so 0.5 cleanly separates them and avoids
# false positives when the sparkle is the only signal (e.g. an OpenAI image scored
# 0.37 -- below threshold, correctly dropped).
_SPARKLE_THRESHOLD = GEMINI_SPARKLE_TRUST_CONF
# Issuer (C2PA signer) -> human-readable generating platform, derived from the
# single C2PA_AI_VENDORS registry. Ordered: when a manifest names several issuers
@@ -132,6 +136,14 @@ class ProvenanceReport:
is_ai_generated: bool | None # True / False is never asserted; None = unknown
platform: str | None
confidence: str # "high" | "medium" | "none"
# Coarse AI-origin kind from the C2PA digital-source-type, so a caller can
# branch on full generation vs an AI-touched real photo:
# "generated" -- digitalSourceType trainedAlgorithmicMedia (fully AI).
# "enhanced" -- compositeWithTrainedAlgorithmicMedia (real content with an
# AI-composited region; scrub the AI region, keep the photo).
# None -- no C2PA AI source-type (verdict, if AI, came from another
# signal: IPTC, AIGC, local gen params, xAI, ...).
ai_source_kind: str | None = None
watermarks: list[str] = field(default_factory=list[str])
signals: list[Signal] = field(default_factory=list["Signal"])
caveats: list[str] = field(default_factory=list[str])
@@ -484,9 +496,18 @@ def identify(image_path: Path, *, check_visible: bool = True, check_invisible: b
# ── C2PA Content Credentials ────────────────────────────────────
has_c2pa = bool(info) or c2pa_marker_in(head)
issuers = [info["issuer"]] if info.get("issuer") else _issuers_in(head)
c2pa_is_ai = "trainedAlgorithmicMedia" in info.get("source_type", "") or any(
m in head for m in (b"trainedAlgorithmicMedia", b"compositeWithTrainedAlgorithmicMedia")
)
# Full AI generation (trainedAlgorithmicMedia) vs an AI-enhanced real photo
# (compositeWithTrainedAlgorithmicMedia). The structured kind is parsed once in
# noai.c2pa._populate_registry_fields (covers PNG + any container the c2pa-python
# reader handles); fall back to a raw head scan for the non-PNG raw-blob path
# where extract_c2pa_info returns {}. Full generation wins when both appear.
c2pa_source_kind = info.get("ai_source_kind")
if c2pa_source_kind is None:
if b"trainedAlgorithmicMedia" in head:
c2pa_source_kind = "generated"
elif b"compositeWithTrainedAlgorithmicMedia" in head:
c2pa_source_kind = "enhanced"
c2pa_is_ai = c2pa_source_kind is not None
# Generator string (for the signal detail): structured for PNG, CBOR-scanned
# for other containers. Best-effort -- some manifests key it as
# `claim_generator_info` (Pixel), so this can be None even when a device is
@@ -734,6 +755,9 @@ def identify(image_path: Path, *, check_visible: bool = True, check_invisible: b
is_ai_generated=is_ai,
platform=platform,
confidence=confidence,
# Only meaningful when the AI verdict actually came from the C2PA source
# type; a non-C2PA AI signal (IPTC/AIGC/local gen) leaves it None.
ai_source_kind=c2pa_source_kind if (is_ai and has_c2pa) else None,
watermarks=watermarks,
signals=signals,
caveats=caveats,
+8
View File
@@ -363,14 +363,22 @@ def _populate_registry_fields(buf: bytes, c2pa_info: dict[str, Any]) -> bool:
# Digital source type (matched anywhere in the store, including ingredient
# manifests -- a ChatGPT edit of a Sora generation carries the AI marker on
# the parent, not the active manifest).
# ``ai_source_kind`` is the structured generated-vs-enhanced split the caller
# branches on (full-frame scrub vs region-targeted clean); ``source_type`` is the
# human-readable form. The two byte strings are unambiguous:
# "compositeWithTrainedAlgorithmicMedia" capitalizes the inner "Trained", so a
# lowercase "trainedAlgorithmicMedia" match is standalone full generation, which
# wins when both appear (an edit chain).
ai_source = False
if b"trainedAlgorithmicMedia" in buf:
c2pa_info["source_type"] = "trainedAlgorithmicMedia (AI-generated)"
c2pa_info["ai_source_kind"] = "generated"
ai_source = True
elif b"algorithmicMedia" in buf:
c2pa_info["source_type"] = "algorithmicMedia"
elif b"compositeWithTrainedAlgorithmicMedia" in buf:
c2pa_info["source_type"] = "compositeWithTrainedAlgorithmicMedia (AI-enhanced)"
c2pa_info["ai_source_kind"] = "enhanced"
ai_source = True
# SynthID pixel-watermark proxy: a C2PA manifest from a SynthID-using
@@ -122,6 +122,20 @@ C2PA_AI_VENDORS: tuple[C2paAiVendor, ...] = (
C2paAiVendor(
b"volcengine", "ByteDance (Volcano Engine)", "ByteDance (Doubao / Jimeng / Volcano Engine)", "ByteDance"
),
# Some Volcano Engine certs name the signer with the Chinese legal entity
# "北京火山引擎科技有限公司" (Beijing Volcano Engine Technology Co., Ltd.) rather
# than the latin "volcengine" -- the latin needle misses it entirely, so real
# ByteDance output was un-attributed in production traffic. The issuer is the
# UTF-8 of the Chinese name (it appears UTF-8-encoded in the manifest-store
# JSON and the raw caBX bytes alike); it normalizes to the same "ByteDance"
# needle and platform as the volcengine row, so the two collapse together for
# clash detection. Verified against the mined retained corpus, 2026-06-20.
C2paAiVendor(
"北京火山引擎科技有限公司".encode(),
"ByteDance (Volcano Engine)",
"ByteDance (Doubao / Jimeng / Volcano Engine)",
"ByteDance",
),
# ByteDance's international brand (BytePlus / Seedream / Seededit) signs its
# cert as "Byteplus Pte. Ltd." -- the bare ``volcengine`` needle misses it, so
# real BytePlus AI output was mis-attributed (an incidental "Adobe XMP" string
@@ -136,11 +150,29 @@ C2PA_AI_VENDORS: tuple[C2paAiVendor, ...] = (
# source read AI but no platform was attributed. Verified on real signed files
# in production traffic, 2026-06-19. Canva does not use SynthID.
C2paAiVendor(b"Canva", "Canva", "Canva (Magic Media)", "Canva"),
# ElevenLabs is a pure generative-AI company (AI voice / audio, and image /
# video via its API); it signs output as "Eleven Labs Inc.", so the C2PA
# manifest alone marks AI generation. Verified against the mined retained
# corpus, 2026-06-20. ElevenLabs does not use SynthID.
C2paAiVendor(b"Eleven Labs", "ElevenLabs", "ElevenLabs", "ElevenLabs"),
# Truepic is a C2PA signing authority, not an AI generator: no platform label,
# never asserts is_ai (the verdict comes from the digital-source-type).
C2paAiVendor(b"Truepic", "Truepic", None, None),
)
# Deliberately NOT registered as AI-generation vendors (mined-corpus candidates
# evaluated 2026-06-20):
# - TikTok Inc.: signs C2PA as a content-provenance / AI-labeling authority on
# uploads, not as an image generator. The is_ai verdict keys off the
# digitalSourceType (trainedAlgorithmicMedia), which is already honored; a
# bare TikTok signer marks distribution provenance, not generation, so adding
# it as a generator needle would mis-label human uploads as AI.
# - PixelBin.io (issuer "Fynd"): an image transformation / optimization / CDN
# service. Its C2PA stamps a transform/upload step, not a generation event.
# Both are excluded to avoid false-positive AI attribution; re-evaluate only
# against a real signed file whose manifest carries a trainedAlgorithmicMedia
# digital-source type produced by the vendor itself.
# Derived view -- add a vendor to C2PA_AI_VENDORS above, not here.
# C2PA issuer signature -> resolved org name, for the manifest byte-scan.
C2PA_ISSUERS: dict[bytes, str] = {v.issuer: v.org for v in C2PA_AI_VENDORS}
+53
View File
@@ -100,6 +100,59 @@ def feather_weights(width: int, height: int, overlap: int) -> NDArray[Any]:
return weights
def feather_region_composite(
base: NDArray[Any],
regenerated: NDArray[Any],
box: tuple[int, int, int, int],
*,
feather: int = 64,
) -> NDArray[Any]:
"""Composite ``regenerated`` over ``base`` inside ``box`` only, feathering the seam.
For AI-ENHANCED composites (digitalSourceType ``compositeWithTrainedAlgorithmicMedia``):
the diffusion remover regenerates the whole frame, but only the AI-composited
REGION should change -- the rest is a real photo that must be preserved. This
blends the regenerated pixels in over ``box = (x, y, w, h)`` with a separable
linear taper of ``feather`` px at the box edges, so the result equals ``base``
EXACTLY outside the box and ramps smoothly (no hard seam) at the boundary.
Pure and model-free (unit-tested): ``base`` and ``regenerated`` must be the same
shape (H x W, or H x W x C). The output preserves ``base``'s dtype. ``feather`` is
clamped to half the box on each axis, so a small region still tapers symmetrically;
``feather=0`` is a hard-edged paste.
"""
import numpy as np
if base.shape != regenerated.shape:
raise ValueError(f"shape mismatch: base {base.shape} vs regenerated {regenerated.shape}")
h, w = base.shape[:2]
x, y, bw, bh = box
x0, y0 = max(0, x), max(0, y)
x1, y1 = min(w, x + bw), min(h, y + bh)
out = base.copy()
if x1 <= x0 or y1 <= y0:
return out # empty / off-image box -> nothing regenerated
def taper(n: int) -> NDArray[Any]:
win = np.ones(n, dtype=np.float32)
f = min(max(feather, 0), n // 2)
if f > 0:
ramp = (np.arange(f, dtype=np.float32) + 1.0) / (f + 1.0) # in (0, 1), 0 at the edge
win[:f] = ramp
win[n - f :] = ramp[::-1]
return win
rh, rw = y1 - y0, x1 - x0
wmap = np.outer(taper(rh), taper(rw)) # ~0 at the box edge, 1 in the interior
if base.ndim == 3:
wmap = wmap[:, :, None]
roi_base = base[y0:y1, x0:x1].astype(np.float32)
roi_gen = regenerated[y0:y1, x0:x1].astype(np.float32)
blended = roi_base * (1.0 - wmap) + roi_gen * wmap
out[y0:y1, x0:x1] = np.clip(blended, 0, 255).astype(base.dtype)
return out
def run_tiled(
generate_tile: Callable[[PILImage.Image], PILImage.Image],
image: PILImage.Image,
@@ -566,6 +566,8 @@ class WatermarkRemover:
tile: bool = False,
tile_size: int = 1024,
tile_overlap: int = 128,
region: tuple[int, int, int, int] | None = None,
region_feather: int = 64,
) -> Path:
"""Remove watermark from an image using regeneration attack.
@@ -589,6 +591,15 @@ class WatermarkRemover:
tile_size: Tile dimension in px (default 1024, SDXL's training size).
tile_overlap: Overlap between adjacent tiles in px (default 128), feather-
blended so there is no visible seam.
region: Restrict the regeneration to the AI-composited box ``(x, y, w, h)``
and feather-composite it back over the ORIGINAL pixels everywhere else.
For AI-ENHANCED composites (digitalSourceType
``compositeWithTrainedAlgorithmicMedia``, surfaced as
``identify.ProvenanceReport.ai_source_kind == "enhanced"``): the real
photo outside the box is preserved exactly, only the AI region is
scrubbed. The box is supplied by the caller (a C2PA composite manifest
does not carry a reliable machine-readable region). None -> whole frame.
region_feather: Seam taper in px for ``region`` compositing (default 64).
Returns:
Path to the cleaned image.
@@ -660,6 +671,22 @@ class WatermarkRemover:
self._controlnet_pipeline = None
cleaned_image = _generate()
# Region-targeted regeneration for AI-enhanced composites: keep the real photo
# outside the AI box pixel-exact, blend only the regenerated AI region back in.
if region is not None:
import numpy as np
from remove_ai_watermarks.noai.tiling import feather_region_composite
gen = cleaned_image.convert("RGB")
if gen.size != init_image.size: # a downscaled/tiled pass can resize
gen = gen.resize(init_image.size)
cleaned_image = gen
base_rgb = np.asarray(init_image) # original RGB, untouched outside the box
merged = feather_region_composite(base_rgb, np.asarray(gen), region, feather=region_feather)
cleaned_image = Image.fromarray(merged)
self._set_progress(f"Region-targeted regeneration: AI box {region}, real photo preserved")
self._set_progress(f"Regeneration complete · Output: {w}x{h}px {cleaned_image.mode}")
output_path.parent.mkdir(parents=True, exist_ok=True)
@@ -877,12 +904,17 @@ def remove_watermark(
model_id: str | None = None,
device: str | None = None,
hf_token: str | None = None,
region: tuple[int, int, int, int] | None = None,
) -> Path:
"""Convenience function to remove watermark from an image.
``strength=None`` lets the profile pick its vendor-adaptive default
(0.20 OpenAI / 0.30 Google / 0.30 unknown, from the C2PA SynthID proxy on the
input; same ladder for the controlnet and sdxl pipelines). Pass a value to override.
``region=(x, y, w, h)`` restricts the regeneration to that box and preserves the
real photo elsewhere -- for AI-enhanced composites (see
``WatermarkRemover.remove_watermark``).
"""
from remove_ai_watermarks.noai.watermark_profiles import vendor_for_strength
@@ -892,4 +924,5 @@ def remove_watermark(
output_path=output_path,
strength=strength,
vendor=vendor_for_strength(image_path),
region=region,
)
+21 -7
View File
@@ -90,13 +90,27 @@ class KnownMark:
return self._remove(image, inpaint_method, inpaint, inpaint_strength, force)
# Gemini-sparkle confidence above which the registry treats it as a confident
# detection for arbitration. Matches identify's corpus-validated sparkle
# threshold (0.5): the gemini engine's own detect flag uses a looser internal
# threshold and weakly fires (~0.36) on unrelated bottom-right text (e.g. the
# Doubao mark), which would otherwise let it hijack `--mark auto`. 0.5 gives 0
# false positives on the corpus.
_GEMINI_AUTO_MIN_CONF = 0.5
# Single source of truth for the Gemini-sparkle "trust this as a real mark"
# confidence, shared by BOTH the removal arbitration here (`best_auto_mark` /
# `_gemini_detect`) and the provenance detector in `identify` (which imports it
# as its sparkle threshold). Defining it once removes the detect-vs-remove
# threshold drift the retained-corpus mining surfaced (2026-06-20): identify
# would report a sparkle while removal declined it, or vice versa, whenever the
# two independently-maintained 0.5 constants fell out of step. Now they cannot.
#
# Value 0.5 is corpus-validated: the gemini engine's own `detected` flag uses a
# looser internal threshold (0.35) and weakly fires (~0.36-0.42) on unrelated
# bottom-right text -- a real Doubao mark scores ~0.40-0.42 as a gemini match,
# and its core-ring brightness margin is HIGHER than a genuine faint sparkle's,
# so neither confidence nor the brightness gate separates them in the [0.35, 0.5)
# band. Lowering this gate to recover faint sparkles was evaluated against that
# band (2026-06-20) and REJECTED: it cannot be done without re-admitting the
# Doubao-text / content false positives, trading a rare miss for false-positive
# removals on clean images. The band below the gate is therefore intentionally
# left to the higher-strength / metadata paths. 0.5 gives 0 false positives on
# the corpus.
GEMINI_SPARKLE_TRUST_CONF = 0.5
_GEMINI_AUTO_MIN_CONF = GEMINI_SPARKLE_TRUST_CONF
# ── Engine adapters (lazy singletons; engines are cv2-only, no model load) ──
+109
View File
@@ -24,6 +24,7 @@ from remove_ai_watermarks.identify import (
_vendor_of,
identify,
)
from remove_ai_watermarks.watermark_registry import GEMINI_SPARKLE_TRUST_CONF
# Where the lazy import inside identify._visible_sparkle resolves the detector.
_SPARKLE_TARGET = "remove_ai_watermarks.gemini_engine.detect_sparkle_confidence"
@@ -140,6 +141,23 @@ class TestIdentifyNonPng:
assert r.is_ai_generated is True
assert "ByteDance" in (r.platform or "")
def test_bytedance_chinese_legal_name_attributed(self, tmp_path: Path):
# Some Volcano Engine certs name the signer with the Chinese legal entity
# rather than the latin "volcengine"; the latin needle misses it, so the
# Chinese-name registry entry is what attributes real ByteDance output.
blob = "北京火山引擎科技有限公司".encode() + b" ... trainedAlgorithmicMedia"
path = self._c2pa_jpeg(tmp_path, blob)
r = identify(path, check_visible=False, check_invisible=False)
assert r.is_ai_generated is True
assert "ByteDance" in (r.platform or "")
def test_elevenlabs_attributed(self, tmp_path: Path):
path = self._c2pa_jpeg(tmp_path, b"Eleven Labs Inc. ... trainedAlgorithmicMedia")
r = identify(path, check_visible=False, check_invisible=False)
assert r.is_ai_generated is True
assert r.platform == "ElevenLabs"
assert not any("SynthID" in w for w in r.watermarks) # ElevenLabs does not use SynthID
def test_stability_ai_issuer_attributed_no_synthid(self, tmp_path: Path):
path = self._c2pa_jpeg(tmp_path, b"Stability AI ... trainedAlgorithmicMedia")
r = identify(path, check_visible=False)
@@ -148,6 +166,21 @@ class TestIdentifyNonPng:
assert "Stability AI" in r.platform
assert not any("SynthID" in w for w in r.watermarks) # Stability does not use SynthID
def test_trained_source_is_generated_kind(self, tmp_path: Path):
path = self._c2pa_jpeg(tmp_path, b"OpenAI ... trainedAlgorithmicMedia")
r = identify(path, check_visible=False, check_invisible=False)
assert r.is_ai_generated is True
assert r.ai_source_kind == "generated"
def test_composite_source_is_enhanced_kind(self, tmp_path: Path):
# compositeWithTrainedAlgorithmicMedia: a real photo with an AI-composited
# region. Still AI (is_ai True), but the kind must read "enhanced" so a
# caller can do region-targeted cleaning instead of a full-frame regen.
path = self._c2pa_jpeg(tmp_path, b"Adobe ... compositeWithTrainedAlgorithmicMedia")
r = identify(path, check_visible=False, check_invisible=False)
assert r.is_ai_generated is True
assert r.ai_source_kind == "enhanced"
def test_c2pa_without_ai_marker_is_unknown(self, tmp_path: Path):
# Adobe signs C2PA on plain Photoshop edits too. Without an AI digital-
# source marker, the honest verdict is unknown -- the C2PA watermark is
@@ -202,6 +235,16 @@ class TestIdentifySamsungGalaxy:
assert r.platform == "ASUS Gallery (C2PA signer)"
assert any("C2PA" in w for w in r.watermarks)
def test_galaxy_capture_without_ai_marker_is_not_ai(self, tmp_path: Path):
# A genuine Galaxy phone capture carries Samsung Galaxy C2PA provenance but
# NO AI source-type / genAIType. It must stay is_ai=None -- the device cert
# is authenticity provenance of a real photo, not an AI-generation signal.
path = self._jpeg(tmp_path, "s25_capture.jpg", b"Samsung Galaxy Galaxy S25 c2pa-rs no ai marker")
r = identify(path, check_visible=False, check_invisible=False)
assert r.is_ai_generated is None
assert r.platform == "Samsung Galaxy (C2PA)"
assert any("C2PA" in w for w in r.watermarks)
# ── End-to-end verdicts on real fixtures ────────────────────────────
@@ -277,6 +320,12 @@ class TestIdentifyLocalParams:
assert "parameters" in signal.detail
assert signal.confidence == "high"
def test_local_gen_params_have_no_c2pa_source_kind(self, tmp_png_with_ai_metadata: Path):
# AI verdict from local SD params (not C2PA) -> ai_source_kind stays None.
r = identify(tmp_png_with_ai_metadata, check_visible=False)
assert r.is_ai_generated is True
assert r.ai_source_kind is None
def test_clean_png_is_unknown(self, tmp_clean_png: Path):
r = identify(tmp_clean_png, check_visible=False)
assert r.is_ai_generated is None
@@ -399,6 +448,66 @@ class TestIdentifyVisibleSparkle:
assert r.confidence == "high"
REPO_ROOT = Path(__file__).resolve().parent.parent
_DEMO_BEFORE = REPO_ROOT / "demo_banana_before.png"
_DEMO_AFTER = REPO_ROOT / "demo_banana_after.png"
@pytest.mark.skipif(not (_DEMO_BEFORE.exists() and _DEMO_AFTER.exists()), reason="demo banana pair not present")
class TestSparkleDetectRemoveAlignment:
"""Detect (identify) and remove (registry.best_auto_mark) must agree on the
same image -- the retained-corpus desync where identify reported a sparkle the
removal arbitration declined (or vice versa). Both gate on the single shared
GEMINI_SPARKLE_TRUST_CONF, so a sparkle just over the line is taken by BOTH
and one just under is declined by BOTH. Fixtures composite the real captured
sparkle (before-minus-after) back at reduced opacity to land on either side.
"""
@staticmethod
def _faint_sparkle(tmp_path: Path, opacity: float) -> Path:
import numpy as np
from remove_ai_watermarks import image_io
before = image_io.imread(_DEMO_BEFORE).astype("float32")
after = image_io.imread(_DEMO_AFTER).astype("float32")
faint = np.clip(after + opacity * (before - after), 0, 255).astype("uint8")
out = tmp_path / f"sparkle_{int(opacity * 100)}.png"
image_io.imwrite(out, faint)
return out
def _detect_remove(self, path: Path) -> tuple[bool, bool, float]:
from remove_ai_watermarks import image_io, watermark_registry
from remove_ai_watermarks.gemini_engine import detect_sparkle_confidence
conf = detect_sparkle_confidence(path) or 0.0
identify_fires = conf >= GEMINI_SPARKLE_TRUST_CONF
best = watermark_registry.best_auto_mark(image_io.imread(path))
remove_takes_gemini = best is not None and best.key == "gemini"
return identify_fires, remove_takes_gemini, conf
def test_above_threshold_both_fire(self, tmp_path: Path):
path = self._faint_sparkle(tmp_path, 0.7) # ~0.55 conf, just over the line
identify_fires, remove_takes, conf = self._detect_remove(path)
assert conf >= GEMINI_SPARKLE_TRUST_CONF
assert identify_fires, f"identify declined a sparkle above threshold (conf={conf:.3f})"
assert remove_takes, f"removal declined a sparkle above threshold (conf={conf:.3f})"
def test_below_threshold_both_decline(self, tmp_path: Path):
path = self._faint_sparkle(tmp_path, 0.5) # ~0.37 conf, just under the line
identify_fires, remove_takes, conf = self._detect_remove(path)
assert conf < GEMINI_SPARKLE_TRUST_CONF
assert not identify_fires, f"identify fired below threshold (conf={conf:.3f})"
assert not remove_takes, f"removal fired below threshold (conf={conf:.3f})"
def test_full_strength_both_fire(self):
# The shipped demo sparkle at full strength: unambiguous agreement.
identify_fires, remove_takes, conf = self._detect_remove(_DEMO_BEFORE)
assert conf >= GEMINI_SPARKLE_TRUST_CONF
assert identify_fires
assert remove_takes
class TestIdentifyImportIsLight:
"""`import identify` must stay torch-free (lazy noai/__init__): the package
is deployed on a 512 MB host where eagerly pulling torch/diffusers OOMs."""
+120
View File
@@ -0,0 +1,120 @@
"""Reverse-alpha over-subtraction guard for the visible text-mark engines.
Ported from the Gemini sparkle fix (commit 41f6797) to Doubao/Jimeng/Samsung
(retained-corpus mining 2026-06-20, roadmap P0#8): on a dark or mid-tone
background the captured alpha can over-estimate THIS image's mark opacity, and
reverse-alpha leaves a darker-than-background glyph ghost (a "dark pit") instead
of recovering the true pixels. The guard predicts the reverse-alpha output per
pixel and, when the glyph body lands far below the local ring, reconstructs the
footprint from the original surroundings instead of shipping the pit.
These assert visual residual (pixel levels vs the local background), not just a
detector re-fire -- a dark pit can clear the NCC detector while still looking wrong.
"""
from __future__ import annotations
import numpy as np
import pytest
from remove_ai_watermarks import image_io
from remove_ai_watermarks._text_mark_engine import _OVERSUB_DARK_MARGIN
from remove_ai_watermarks.doubao_engine import DoubaoEngine
from remove_ai_watermarks.jimeng_engine import JimengEngine
from remove_ai_watermarks.samsung_engine import SamsungEngine
_ENGINES = [DoubaoEngine, JimengEngine, SamsungEngine]
def _compose(engine, bg: float, opacity_gain: float, w: int = 1024, h: int = 1024):
"""Composite the engine's captured mark onto a flat ``bg`` at ``opacity_gain``.
``opacity_gain < 1`` makes the mark FAINTER than the capture, so reverse-alpha
at the full captured alpha over-subtracts into a dark pit -- the case the guard
must catch. Returns ``(watermarked_uint8, alpha_block, region)`` where the block
and region are exactly what the engine's reverse-alpha receives.
"""
img = np.full((h, w, 3), float(bg), np.float32)
block, (ax, ay, gw, gh) = engine._fixed_alpha_map(img)
a = np.clip(block * opacity_gain, 0.0, 0.99)[:, :, None]
logo = np.array(engine.config.alpha_logo_bgr, np.float32)
img[ay : ay + gh, ax : ax + gw] = img[ay : ay + gh, ax : ax + gw] * (1 - a) + logo * a
return np.clip(img, 0, 255).astype(np.uint8), block, (ax, ay, gw, gh)
def _body_vs_ring(out, region, block) -> tuple[float, float]:
"""Median luma of the glyph body vs the local background ring in ``out``."""
ax, ay, gw, gh = region
g = out.astype(np.float32).mean(axis=2)
body = block >= 0.15
pad = max(4, int(gh * 0.6))
ry1, ry2 = max(0, ay - pad), min(g.shape[0], ay + gh + pad)
rx1, rx2 = max(0, ax - pad), min(g.shape[1], ax + gw + pad)
ring = g[ry1:ry2, rx1:rx2]
fy1, fy2, fx1, fx2 = ay - ry1, ay - ry1 + gh, ax - rx1, ax - rx1 + gw
ring_mask = np.ones(ring.shape, dtype=bool)
ring_mask[fy1:fy2, fx1:fx2] = False
core = float(np.median(g[ay : ay + gh, ax : ax + gw][body]))
return core, float(np.median(ring[ring_mask]))
@pytest.mark.parametrize("Engine", _ENGINES, ids=lambda e: e.__name__)
class TestOversubtractionGuard:
@pytest.mark.parametrize(("bg", "gain"), [(120, 0.45), (150, 0.4), (90, 0.5)])
def test_guard_trips_on_faint_mark(self, Engine, bg, gain):
eng = Engine()
wm, block, region = _compose(eng, bg, gain)
assert eng._reverse_alpha_oversubtracts(image_io.to_bgr(wm), block, region)
@pytest.mark.parametrize("bg", [255, 200, 128, 60])
def test_guard_skips_clean_full_strength_mark(self, Engine, bg):
# A cleanly captured (gain 1.0) mark predicts back to the background, so the
# guard must NOT trip -- no regression of the common clean-removal path.
eng = Engine()
wm, block, region = _compose(eng, bg, 1.0)
assert not eng._reverse_alpha_oversubtracts(image_io.to_bgr(wm), block, region)
@pytest.mark.parametrize(("bg", "gain"), [(120, 0.45), (150, 0.4)])
def test_faint_removal_leaves_no_dark_pit(self, Engine, bg, gain):
# End-to-end acceptance (roadmap P0#8): after removal the glyph footprint is
# not a region more than _OVERSUB_DARK_MARGIN below the local background.
eng = Engine()
wm, block, region = _compose(eng, bg, gain)
out = eng.remove_watermark_reverse_alpha(wm)
core, ring_bg = _body_vs_ring(out, region, block)
assert core >= ring_bg - _OVERSUB_DARK_MARGIN, f"dark pit: body {core:.0f} vs ring {ring_bg:.0f}"
def test_clean_mark_removal_unchanged_by_guard(self, Engine, monkeypatch):
# On a clean mark the guard must be a no-op: forcing it off yields the same
# output (the guard only ever diverts the over-subtraction case).
eng = Engine()
wm, _block, _region = _compose(eng, 200, 1.0)
guarded = eng.remove_watermark_reverse_alpha(wm)
monkeypatch.setattr(type(eng), "_reverse_alpha_oversubtracts", lambda self, *a, **k: False)
unguarded = eng.remove_watermark_reverse_alpha(wm)
assert np.array_equal(guarded, unguarded)
@pytest.mark.parametrize("Engine", _ENGINES, ids=lambda e: e.__name__)
def test_guard_recovers_pit_on_textured_background(Engine):
"""The guard's footprint inpaint reconstructs from the ORIGINAL surroundings,
so a faint mark over-subtracted on a textured background recovers to roughly the
local content level rather than a glyph-shaped dark ghost."""
eng = Engine()
w = h = 1024
yy, xx = np.mgrid[0:h, 0:w].astype(np.float32)
base = 120 + 35 * np.sin(xx / 80.0) + 25 * np.cos(yy / 60.0)
bg_img = np.clip(np.stack([base, base * 0.95, base * 1.05], axis=-1), 0, 255)
block, (ax, ay, gw, gh) = eng._fixed_alpha_map(bg_img)
a = np.clip(block * 0.45, 0.0, 0.99)[:, :, None]
logo = np.array(eng.config.alpha_logo_bgr, np.float32)
bg_img[ay : ay + gh, ax : ax + gw] = bg_img[ay : ay + gh, ax : ax + gw] * (1 - a) + logo * a
wm = np.clip(bg_img, 0, 255).astype(np.uint8)
out = eng.remove_watermark_reverse_alpha(wm).astype(np.float32)
# Compare the recovered glyph body to the clean texture under the mark.
clean = np.clip(np.stack([base, base * 0.95, base * 1.05], axis=-1), 0, 255)
body = block >= 0.15
region_out = out[ay : ay + gh, ax : ax + gw].mean(axis=2)
region_clean = clean[ay : ay + gh, ax : ax + gw].mean(axis=2)
err = float(np.abs(region_out[body] - region_clean[body]).mean())
assert err < 25.0, f"glyph body not recovered (mean abs err {err:.1f})"
+70
View File
@@ -15,6 +15,7 @@ from PIL import Image
from remove_ai_watermarks.noai.tiling import (
Tile,
_axis_positions,
feather_region_composite,
feather_weights,
plan_tiles,
run_tiled,
@@ -138,3 +139,72 @@ class TestRunTiled:
image = Image.new("RGB", (1500, 1100), (200, 100, 50))
out = run_tiled(generate, image, tile_size=1024, overlap=128)
assert out.size == (1500, 1100)
class TestFeatherRegionComposite:
"""Region-targeted compositing for AI-enhanced composites: only the AI box is
regenerated, the real photo outside it stays pixel-exact (roadmap P1#8)."""
@staticmethod
def _frames(h=200, w=300):
base = np.full((h, w, 3), 80, np.uint8)
regenerated = np.full((h, w, 3), 200, np.uint8)
return base, regenerated
def test_outside_box_is_pixel_exact(self):
base, regen = self._frames()
out = feather_region_composite(base, regen, (100, 60, 80, 50), feather=8)
# Far corners are well outside the box -> identical to base.
assert np.array_equal(out[:50, :80], base[:50, :80])
assert np.array_equal(out[150:, 220:], base[150:, 220:])
def test_interior_equals_regenerated(self):
base, regen = self._frames()
out = feather_region_composite(base, regen, (100, 60, 80, 50), feather=8)
# Deep interior of the box (past the feather ramp) is fully regenerated.
assert np.array_equal(out[80:90, 130:150], regen[80:90, 130:150])
def test_hard_paste_when_no_feather(self):
base, regen = self._frames()
out = feather_region_composite(base, regen, (100, 60, 80, 50), feather=0)
assert np.array_equal(out[60:110, 100:180], regen[60:110, 100:180])
assert np.array_equal(out[:60], base[:60])
def test_seam_is_monotonic_ramp(self):
base, regen = self._frames()
out = feather_region_composite(base, regen, (100, 60, 80, 50), feather=10).astype(np.float32)
# Along a horizontal line crossing the left edge, values rise from base(80)
# toward regenerated(200) monotonically through the feather band.
row = out[85, 100:115, 0]
assert row[0] < row[-1]
assert np.all(np.diff(row) >= -1e-3)
def test_dtype_preserved(self):
base, regen = self._frames()
out = feather_region_composite(base, regen, (50, 50, 40, 40), feather=4)
assert out.dtype == base.dtype
def test_grayscale_2d_supported(self):
base = np.full((100, 120), 30, np.uint8)
regen = np.full((100, 120), 220, np.uint8)
out = feather_region_composite(base, regen, (40, 30, 30, 30), feather=4)
assert out.shape == base.shape
assert np.array_equal(out[:30], base[:30])
def test_empty_or_offimage_box_returns_base(self):
base, regen = self._frames()
assert np.array_equal(feather_region_composite(base, regen, (0, 0, 0, 0)), base)
assert np.array_equal(feather_region_composite(base, regen, (500, 500, 40, 40)), base)
def test_box_clamped_to_image_bounds(self):
base, regen = self._frames()
# Box overhangs the bottom-right; only the in-image part is composited.
out = feather_region_composite(base, regen, (280, 180, 60, 60), feather=0)
assert np.array_equal(out[180:, 280:], regen[180:, 280:])
assert out.shape == base.shape
def test_shape_mismatch_raises(self):
base, _ = self._frames(200, 300)
bad = np.full((100, 100, 3), 200, np.uint8)
with pytest.raises(ValueError, match="shape mismatch"):
feather_region_composite(base, bad, (10, 10, 20, 20))