Merge branch 'claude/modest-carson-d72243': corpus-mining provenance + removal fixes

Retained-corpus mining (2026-06-20) fixes, all gate-green: - C2PA vendor coverage (Volcano Engine CJK legal name, ElevenLabs; TikTok/PixelBin vetted out) - identify AI-generated vs AI-enhanced (ai_source_kind) + shared GEMINI_SPARKLE_TRUST_CONF (detect/remove threshold unify) - text-mark over-subtraction guard (Doubao/Jimeng/Samsung) - region-targeted regeneration for AI-enhanced composites (feather_region_composite + remove_watermark(region=)) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> # Conflicts: # CLAUDE.md
2026-07-04 23:47:49 +02:00 · 2026-06-20 15:39:29 -07:00
parent e29d6624b9 737305858d
commit 0d9033d63a
13 changed files with 594 additions and 21 deletions
@@ -53,16 +53,16 @@ Compact map. The full per-module detail (design decisions, tuned thresholds, cal
 - `noai/c2pa.py` — C2PA reading. `extract_c2pa_info(path)` uses the official **c2pa-python `Reader`** first (core dep, any container; `read_manifest_store_json` returns the WHOLE store JSON — active + ingredient manifests — so an AI marker on a parent manifest is seen), and falls back to the hand-rolled caBX/CBOR parser (`has_c2pa_metadata` / `extract_c2pa_chunk` / `_extract_c2pa_info_png`) for synthetic/partial blobs the validator rejects or a broken/absent wheel. The registry scan (issuer / source-type / SynthID / soft-binding) is shared by both paths via `_populate_registry_fields`, so the return-dict shape is identical. Do not reimplement chunk parsing; chunk reads are clamped to the remaining file size by design. `extract_c2pa_chunk`/`inject_c2pa_chunk` stay PNG-only (raw caBX bytes, test/extractor use).
 - `noai/constants.py` — the single `C2PA_AI_VENDORS` registry (+ `C2PA_SOFT_BINDINGS`) from which `C2PA_ISSUERS` / `SYNTHID_C2PA_ISSUERS` / `identify._ISSUER_PLATFORM` are all derived. Add a new vendor as one registry entry; never edit the derived dicts and never add inline.
 - `metadata.py` — `scan_head(path)` is the shared (memoized) input for every C2PA/AIGC/IPTC byte scan; use it instead of `open().read(1MB)` for any new marker scan. Also home to `synthid_source`, `xai_signature`, `iptc_ai_system`, `aigc_label`, `huggingface_job`, `samsung_genai`, and `remove_ai_metadata` (fail-safe `strip_c2pa_boxes`).
- `identify.py` — aggregates every locally-readable signal into one `ProvenanceReport`; `is_ai_generated` is True or None, never asserted False. `import identify` is deliberately light (lazy `noai/__init__`, fits a 512 MB host) — keep heavy imports out. Add capture-camera tokens to `_DEVICE_C2PA_PLATFORM` only when verified against a real C2PA file; editing-app/AI-device signer tokens go to `_SIGNER_C2PA_PLATFORM`; generator/issuer platforms to `C2PA_AI_VENDORS` in `constants.py`. Integrity-clash detection is high-precision by design (only hard generator stamps feed it, source-grouped independence).
+- `identify.py` — aggregates every locally-readable signal into one `ProvenanceReport`; `is_ai_generated` is True or None, never asserted False. `ProvenanceReport.ai_source_kind` exposes the C2PA digital-source-type split — `"generated"` (trainedAlgorithmicMedia, fully AI) vs `"enhanced"` (compositeWithTrainedAlgorithmicMedia, a real photo with an AI-composited region), else None — so a caller branches full-frame scrub vs region-targeted clean (see `noai/tiling.feather_region_composite` + `WatermarkRemover.remove_watermark(region=...)`). The sparkle provenance threshold is the SHARED `watermark_registry.GEMINI_SPARKLE_TRUST_CONF` (imported, not a private copy) so the provenance "is there a sparkle" verdict and the removal "take the sparkle" decision can never drift. `import identify` is deliberately light (lazy `noai/__init__`, fits a 512 MB host) — keep heavy imports out (the `watermark_registry` constant import stays light: engines are lazy there). Add capture-camera tokens to `_DEVICE_C2PA_PLATFORM` only when verified against a real C2PA file; editing-app/AI-device signer tokens go to `_SIGNER_C2PA_PLATFORM`; generator/issuer platforms to `C2PA_AI_VENDORS` in `constants.py`. Integrity-clash detection is high-precision by design (only hard generator stamps feed it, source-grouped independence).
 - `watermark_registry.py` — the single catalog of known visible watermarks (gemini / doubao / jimeng / samsung), reverse-alpha based by policy. Add a new visible text mark = one `_text_mark(...)` row + a `TextMarkConfig` with a captured alpha map; do not re-add per-mark `if` branches. `cli._write_bgr_with_alpha` must NOT zero alpha in the watermark bbox (issue #30 white-box regression).
 - `gemini_engine.py` — visible Gemini-sparkle remover/detector (cv2/numpy, no GPU): top-K size-weighted fusion candidate selection (`_SELECT_TOPK`), corner-promote, over/under-subtraction guards, false-positive gate, self-verify repair. Detection scores the top-K size-weighted matches by full fusion (spatial+gradient+variance) and keeps the highest — NOT the raw-NCC argmax, which re-admits the tiny-patch FPs the size weight suppresses (the osachub 2026-06-12 sub-0.85 corner-sparkle regression; see `docs/module-internals.md`). Keep the 0.85 corner-promote NCC gate; a margin/chroma-gated lower promote was measured and REJECTED 2026-06-11 (~33% FP on non-Google content). Gate any removal candidate on a physical brightness check, not the detector alone.
 - `_text_mark_engine.py` — shared base for the three reverse-alpha text-mark engines (extracted 2026-06-09); the per-engine modules are config-only subclasses. New text mark = a `TextMarkConfig` + a thin subclass + one registry row. Gemini stays a separate engine (different model).
- `doubao_engine.py` / `jimeng_engine.py` / `samsung_engine.py` — thin `TextMarkEngine` subclasses: Doubao "豆包AI生成" (bottom-right), Jimeng "★ 即梦AI" (bottom-right), Samsung Galaxy AI "✦ Contenuti generati dall'AI" (bottom-LEFT, locale-specific — Italian variant calibrated). Removal = reverse-alpha (always-align) + thin residual inpaint. A detector-only removal test is insufficient — assert visual residual (the textured-shift tests).
+- `doubao_engine.py` / `jimeng_engine.py` / `samsung_engine.py` — thin `TextMarkEngine` subclasses: Doubao "豆包AI生成" (bottom-right), Jimeng "★ 即梦AI" (bottom-right), Samsung Galaxy AI "✦ Contenuti generati dall'AI" (bottom-LEFT, locale-specific — Italian variant calibrated). Removal = reverse-alpha (always-align) + thin residual inpaint, **with an over-subtraction guard ported from `gemini_engine` (2026-06-20)**: `_reverse_alpha_oversubtracts` predicts the reverse-alpha output PER PIXEL over the glyph body from the INPUT, and when the recovered body lands more than `_OVERSUB_DARK_MARGIN` (25) gray levels below the local ring it abandons the reverse-alpha pixels and inpaints the footprint from the original surroundings (`_inpaint_footprint`) — fixing the dark-pit ghost on dark/mid-tone backgrounds (roadmap P0#8). Predicting per-pixel from the input (not the produced output) keeps a clean full-strength mark byte-identical (no false trip). A detector-only removal test is insufficient — assert visual residual (the textured-shift tests + `tests/test_text_mark_oversubtraction.py`).
 - `region_eraser.py` — universal region eraser (`erase` CLI): cv2 backend default (no deps), optional big-LaMa via onnxruntime (~3.5-4 GB peak RAM, ~5-6 s/call CPU — does not fit a minimal droplet).
 - `invisible_watermark.py` — decodes the OPEN DWT-DCT watermarks (SD / SDXL / FLUX) via `imwatermark` (extra `detect`, pulls torch). Fragile two ways: (1) does not survive JPEG re-encode/resize; (2) **carrier-fragile on a broad class of pristine images** -- a clean encode->decode round-trip recovers 48/48 on chatgpt/firefly/random but FAILS (28-39/48, below the `_MATCH_48`=44 gate) on the FLUX fox, doubao, a flat FLUX generation, AND a clean synthetic flat fill with no watermark. The failure does NOT track texture; it goes with a degenerate **all-ones decode that is a CARRIER ARTIFACT, not a watermark** (synthetic clean image reproduces it). So `detect_invisible_watermark` is **positive-only**: trust a hit; a `None` is inconclusive unless a same-carrier positive-control embed first recovers >=44. Verified 2026-06-19; full caveat in `docs/watermarking-landscape.md`.
 - `trustmark_detector.py` — Adobe TrustMark open decoder (extra `trustmark`). Do NOT remove the JPEG re-encode false-positive gate — a lone TrustMark hit without it is almost always content noise.
- `noai/watermark_remover.py` — `WatermarkRemover` with three diffusion pipelines selected by the explicit `pipeline` ctor arg, never inferred from `model_id`: `sdxl` (plain SDXL img2img), `controlnet` (SDXL + canny ControlNet, **the DEFAULT since 2026-06-09**), and `qwen` (Qwen-Image 20B MMDiT img2img, Apache-2.0, CUDA/cloud-class — best **text** preservation (incl. CJK); `_load_qwen_pipeline`/`_run_qwen`, bf16, no MPS fallback; call shape in the pure `_build_qwen_kwargs` using `true_cfg_scale`). Removal comes from the img2img `strength`; ControlNet only preserves text/face STRUCTURE — SynthID CAN survive controlnet on photoreal content at low strength. Qwen CERTIFIED oracle floors (2026-06-20): OpenAI **0.10** (seed-robust, clean on seeds 0-4), Gemini **0.25** (seed 0 verified, pin a seed — Gemini oracle rate-limits volume; higher than the controlnet Gemini floor 0.15, so pass explicit `--strength 0.25` for Gemini on `qwen` until a Qwen ladder is wired). Fidelity measured by `scripts/fidelity_metrics.py` (OCR-CER / ArcFace / LPIPS / Laplacian-var, NOT eyeball), compared ONLY at each pipeline's oracle-confirmed scrub floor (where SynthID is removed in BOTH — equal-strength is invalid where it leaves one un-scrubbed): Qwen wins TEXT (incl. CJK), controlnet wins FACES (Qwen smooths faces more) — Qwen is the text-preserving remover, not a universal fidelity win. No face-restore extra ships, by validated decision (every restore approach looked MORE AI-generated).
- `noai/tiling.py` — sliding-window tiled diffusion for large inputs (CLI `--tile`). `WatermarkRemover.remove_watermark` branches to `run_tiled` when `tile` is set AND the long side exceeds `tile_size`, refactoring the single-pass `_generate` into a per-tile `_generate_one` (the ControlNet edge map is rebuilt per tile inside it). Pure helpers `plan_tiles` (uniform-size tiles, last one flush to the edge) and `feather_weights` (strictly-positive separable taper -> partition-of-unity blend) are unit-tested without the model. New tile-blend tuning goes in those pure helpers; do not inline blend math into the runner.
+- `noai/watermark_remover.py` — `WatermarkRemover` with three diffusion pipelines selected by the explicit `pipeline` ctor arg, never inferred from `model_id`: `sdxl` (plain SDXL img2img), `controlnet` (SDXL + canny ControlNet, **the DEFAULT since 2026-06-09**), and `qwen` (Qwen-Image 20B MMDiT img2img, Apache-2.0, CUDA/cloud-class — best **text** preservation (incl. CJK); `_load_qwen_pipeline`/`_run_qwen`, bf16, no MPS fallback; call shape in the pure `_build_qwen_kwargs` using `true_cfg_scale`). Removal comes from the img2img `strength`; ControlNet only preserves text/face STRUCTURE — SynthID CAN survive controlnet on photoreal content at low strength. Qwen CERTIFIED oracle floors (2026-06-20): OpenAI **0.10** (seed-robust, clean on seeds 0-4), Gemini **0.25** (seed 0 verified, pin a seed — Gemini oracle rate-limits volume; higher than the controlnet Gemini floor 0.15, so pass explicit `--strength 0.25` for Gemini on `qwen` until a Qwen ladder is wired). Fidelity measured by `scripts/fidelity_metrics.py` (OCR-CER / ArcFace / LPIPS / Laplacian-var, NOT eyeball), compared ONLY at each pipeline's oracle-confirmed scrub floor (where SynthID is removed in BOTH — equal-strength is invalid where it leaves one un-scrubbed): Qwen wins TEXT (incl. CJK), controlnet wins FACES (Qwen smooths faces more) — Qwen is the text-preserving remover, not a universal fidelity win. No face-restore extra ships, by validated decision (every restore approach looked MORE AI-generated). `remove_watermark(region=(x,y,w,h), region_feather=...)` runs the regeneration but feather-composites only the AI box back over the original (via `noai/tiling.feather_region_composite`), preserving the real photo elsewhere — the **AI-enhanced composite** path (`identify` `ai_source_kind == "enhanced"`); the box is supplied by the caller (a C2PA composite manifest carries no reliable machine-readable region, so we do not fabricate one).
+- `noai/tiling.py` — sliding-window tiled diffusion for large inputs (CLI `--tile`). `WatermarkRemover.remove_watermark` branches to `run_tiled` when `tile` is set AND the long side exceeds `tile_size`, refactoring the single-pass `_generate` into a per-tile `_generate_one` (the ControlNet edge map is rebuilt per tile inside it). Pure helpers `plan_tiles` (uniform-size tiles, last one flush to the edge) and `feather_weights` (strictly-positive separable taper -> partition-of-unity blend) are unit-tested without the model. Also home to `feather_region_composite(base, regenerated, box, *, feather)` — the pure region-targeted compositor for **AI-enhanced composites** (`ai_source_kind == "enhanced"`): blends the regenerated AI box back over the original with a feathered seam, leaving the real photo OUTSIDE the box pixel-exact. It backs `WatermarkRemover.remove_watermark(region=...)` (regenerate ONLY the AI region, not the whole frame); the no-model lossless region path stays `region_eraser.erase`. New tile/region-blend tuning goes in these pure helpers; do not inline blend math into the runner.
 - `auto_config.py` + the content-detection layer were REMOVED 2026-06-09; `--auto` is a deprecated no-op (controlnet is the default pipeline and the adaptive polish is ON by default and self-gates to a no-op where there is no detail deficit).
 - `upscaler.py` — optional Real-ESRGAN pre-diffusion super-resolution for small inputs (extra `esrgan`, spandrel only). Manual opt-in; the default `--upscaler` stays `lanczos` and the engine always falls back to Lanczos on absence/error. ESRGAN can degrade faces and thin text.
 - `image_io.py` — Unicode-safe cv2 IO (issue #17). Every cv2 file read/write in the package routes through `imread`/`imwrite`; do not call `cv2.imread`/`cv2.imwrite` directly. `to_bgr(image)` is the shared channel normalizer — use it instead of inlining `cvtColor` branches.
@@ -15,7 +15,7 @@ module.

 ## `noai/constants.py`

-`noai/constants.py` — PNG_SIGNATURE, C2PA_CHUNK_TYPE, C2PA_SIGNATURES, and `C2PA_AI_VENDORS` — the single `C2paAiVendor` registry of C2PA-signing vendors (issuer byte, resolved org name, the `identify` platform label, and a `synthid` flag), from which `C2PA_ISSUERS`, `SYNTHID_C2PA_ISSUERS` (issuers that pair SynthID with C2PA: Google, OpenAI), and `identify._ISSUER_PLATFORM` are all **derived** — plus `C2PA_SOFT_BINDINGS` (soft-binding `alg` prefix → forensic-watermark vendor: Adobe TrustMark, Digimarc, Imatag, Steg.AI, Microsoft, ...). Add a new C2PA vendor as one `C2PA_AI_VENDORS` entry (never edit the derived dicts), a new soft-binding to `C2PA_SOFT_BINDINGS`; not inline.
+`noai/constants.py` — PNG_SIGNATURE, C2PA_CHUNK_TYPE, C2PA_SIGNATURES, and `C2PA_AI_VENDORS` — the single `C2paAiVendor` registry of C2PA-signing vendors (issuer byte, resolved org name, the `identify` platform label, and a `synthid` flag), from which `C2PA_ISSUERS`, `SYNTHID_C2PA_ISSUERS` (issuers that pair SynthID with C2PA: Google, OpenAI), and `identify._ISSUER_PLATFORM` are all **derived** — plus `C2PA_SOFT_BINDINGS` (soft-binding `alg` prefix → forensic-watermark vendor: Adobe TrustMark, Digimarc, Imatag, Steg.AI, Microsoft, ...). Add a new C2PA vendor as one `C2PA_AI_VENDORS` entry (never edit the derived dicts), a new soft-binding to `C2PA_SOFT_BINDINGS`; not inline. A vendor that signs under multiple legal names needs one entry PER distinctive issuer byte string: e.g. ByteDance's Volcano Engine is registered both as latin `volcengine` AND the Chinese legal entity `北京火山引擎科技有限公司` (UTF-8; the latin needle misses the Chinese-named certs entirely) — both normalize to the same "ByteDance" needle/platform. ElevenLabs ("Eleven Labs Inc.", pure generative-AI) is registered as a generator. Deliberately EXCLUDED (mined-corpus candidates 2026-06-20, documented in the file): TikTok Inc. (a content-provenance / AI-labeling signer on uploads, not a generator) and PixelBin.io / "Fynd" (an image transform / CDN signer) — registering either as a generator would mis-label human uploads as AI; the `is_ai` verdict keys off the digitalSourceType, which is already honored.

 ## `metadata.py`

@@ -27,7 +27,9 @@ module.

 `identify.py` — the OpenAI rollout caveat is keyed on `_vendor_of(synthid) == "OpenAI"` (not a raw substring over the issuer + verdict blob). `identify(path)` aggregates every locally-readable signal (C2PA issuer→platform, C2PA soft-binding forensic-watermark vendor, **C2PA cloud-manifest reference** via `metadata.c2pa_cloud_manifest` — signal `c2pa_cloud`, **medium**, provenance-only (does NOT set `is_ai`, excluded from `ai_from_metadata` + clash vendors): a C2PA 2.4 Durable-Content-Credentials case where the embedded manifest is stripped but an XMP `dcterms:provenance` pointer to the vendor's cloud manifest store (`_C2PA_MANIFEST_REPOSITORIES`, today `cai-manifests.adobe.com` → "Adobe Content Authenticity") survives, so the credentials stay recoverable server-side; only emitted when no embedded manifest already attributed the file — surfaced on 2 corpus PNGs 2026-06-10 that read fully `unknown` before, IPTC "Made with AI" + IPTC 2025.1 `AISystemUsed`, embedded SD/ComfyUI params, SynthID proxy, xAI/Grok EXIF signature via `metadata.xai_signature`, the China TC260 AIGC label via `metadata.aigc_label`, the HuggingFace `hf-job-id` job marker via `metadata.huggingface_job`, the Samsung Galaxy AI editing marker via `metadata.samsung_genai`, the visible marks — Gemini sparkle plus the ByteDance Doubao 豆包AI生成 / Jimeng 即梦AI / Samsung Galaxy AI "Contenuti generati dall'AI" text marks via the `watermark_registry` — open invisible watermark, Adobe TrustMark via `trustmark_detector`) into one `ProvenanceReport`. `is_ai_generated` is True or None (never asserted False — stripped metadata is not proof of clean origin). The `hf_job`, visible-mark, and Samsung `samsung_genai` signals are **medium** confidence: each lifts an otherwise-Unknown verdict to a tentative AI (`hf_only` / `visible_only` / `samsung_only`, parallel branches; `visible_only` fires on any `visible_*` signal) but is excluded from the high-confidence `ai_from_metadata` set, so none overrides a hard metadata signal.

-**Visible-mark detection** (`check_visible`, signals `visible_sparkle` / `visible_doubao` / `visible_jimeng` / `visible_samsung`): the Gemini sparkle keeps its own file-level path (`_visible_sparkle` → `gemini_engine.detect_sparkle_confidence`, promoted only at confidence ≥ `_SPARKLE_THRESHOLD` 0.5; corpus-tuned to separate Gemini sparkles ≥0.56 from non-sparkle ≤0.49), while Doubao/Jimeng/Samsung reuse the registry detectors (`_visible_text_marks` → `watermark_registry`, iterating `_VISIBLE_MARK_PLATFORM`), each gated by its own engine NCC threshold via `MarkDetection.detected` (Doubao 0.4, Jimeng 0.45, Samsung 0.4). Doubao/Jimeng are normally also caught by the TC260 AIGC metadata label and Samsung by its C2PA + `genAIType` marker, so the visible path is their stripped-metadata fallback. Visible marks set `platform` only when no harder signal already did, and (like the sparkle) are excluded from integrity-clash vendor claims. The cv2 dependency lives in the engines, not here.
+**AI-generated vs AI-enhanced** (`ProvenanceReport.ai_source_kind`, roadmap item): the C2PA digital-source-type is split into `"generated"` (trainedAlgorithmicMedia, fully synthetic) vs `"enhanced"` (compositeWithTrainedAlgorithmicMedia, a real photo with an AI-composited region) — the two byte strings are unambiguous (`compositeWithTrainedAlgorithmicMedia` capitalizes the inner "Trained", so a lowercase `trainedAlgorithmicMedia` match is standalone full generation; full generation wins when both appear). `ai_source_kind` is set only when the AI verdict actually came from the C2PA source type (a non-C2PA AI signal — IPTC/AIGC/local gen/xAI — leaves it None). It lets a caller branch a full-frame scrub (`generated`) from a region-targeted clean that preserves the real photo (`enhanced`; see `noai/tiling.feather_region_composite`). The CLI verdict line reads "AI-generated (fully synthetic)" vs "AI-enhanced (real content with an AI-composited region)".
+
+**Visible-mark detection** (`check_visible`, signals `visible_sparkle` / `visible_doubao` / `visible_jimeng` / `visible_samsung`): the Gemini sparkle keeps its own file-level path (`_visible_sparkle` → `gemini_engine.detect_sparkle_confidence`, promoted only at confidence ≥ `_SPARKLE_THRESHOLD`, which is the SHARED `watermark_registry.GEMINI_SPARKLE_TRUST_CONF` (0.5) — imported, not a private copy, so the provenance detect threshold and the removal `best_auto_mark` / `_gemini_detect` arbitration gate can never drift (the detect-vs-remove desync from roadmap P0#7; regression-guarded by `tests/test_identify.py::TestSparkleDetectRemoveAlignment`, which composites the real demo sparkle at borderline opacities and asserts identify and `best_auto_mark` AGREE on either side of the line). Lowering the gate to recover faint sub-0.5 sparkles was evaluated 2026-06-20 and REJECTED: a real Doubao text mark scores ~0.40-0.42 as a gemini match with a HIGHER core-ring brightness margin than a genuine faint sparkle, so neither confidence nor the brightness gate separates them in the [0.35, 0.5) band — lowering trades a rare miss for false-positive removals on clean images. Corpus-tuned to separate Gemini sparkles ≥0.56 from non-sparkle ≤0.49), while Doubao/Jimeng/Samsung reuse the registry detectors (`_visible_text_marks` → `watermark_registry`, iterating `_VISIBLE_MARK_PLATFORM`), each gated by its own engine NCC threshold via `MarkDetection.detected` (Doubao 0.4, Jimeng 0.45, Samsung 0.4). Doubao/Jimeng are normally also caught by the TC260 AIGC metadata label and Samsung by its C2PA + `genAIType` marker, so the visible path is their stripped-metadata fallback. Visible marks set `platform` only when no harder signal already did, and (like the sparkle) are excluded from integrity-clash vendor claims. The cv2 dependency lives in the engines, not here.

 **`import identify` is deliberately light** (~26 MB; ~36 MB with cv2 loaded by a visible-mark run, ~106 MB for a full `check_visible` run): it imports the `noai.c2pa`/`noai.constants` submodules, and `noai/__init__` is lazy (see "Test and lint"), so torch/diffusers are NOT pulled at import even in a full `gpu`/`detect` install — fits a 512 MB host. `noai.c2pa` does eagerly import the **c2pa-python** binary (Rust + cryptography, ~+5 MB RSS, no torch) for the primary `Reader` path — light enough to stay on the dependency-light host; a broken/absent wheel degrades to the byte-scan parser (`reader_available()` False). The heavy paths are opt-in: `check_invisible=True` needs the `detect`/`trustmark` extras (each pulls **torch**; TrustMark also **downloads weights**), so on a core-only deploy leave `check_invisible` off (it is a no-op there anyway). Before the lazy `__init__`, the mere presence of torch in the env inflated `import identify` to ~420 MB.

@@ -105,6 +107,8 @@ The 11 survivors are near-white ill-conditioning (reverse-alpha divides by `1-a`

 **`_apply_reverse_alpha` runs on the glyph crop only:** the blend is a no-op outside the glyph `region` (x, y, w, h) (`(wm - 0)/(1 - 0) == wm`, and a uint8→float32→uint8 round-trip is exact). It copies the frame through and computes the reverse-alpha math on the `region` crop only — byte-identical to the old full-frame pass (verified: Doubao 130 + Jimeng 22 placements, 0 mismatches) but O(glyph) not O(image). The full-frame pass cost ~275 ms on a 12 MP frame for a glyph that is <0.1% of it, once per candidate placement (fixed + aligned ≈ 2×/removal); the crop drops that to ~2 ms. Mirror of the Gemini `_core_and_bg` crop.

+**Over-subtraction guard (`_reverse_alpha_oversubtracts` → `_inpaint_footprint`, ported from `gemini_engine` 2026-06-20, roadmap P0#8):** on a dark or mid-tone background the captured alpha can over-estimate THIS image's mark opacity, and reverse-alpha leaves a darker-than-background glyph ghost (a "dark pit") instead of recovering the true pixels — the sparkle-only fix (commit 41f6797) left the text marks unhandled. After `remove_watermark_reverse_alpha` selects the winning placement, the guard PREDICTS the reverse-alpha output PER PIXEL over the glyph body from the INPUT (`(obs - a*logo)/(1-a)`, exactly the remover's math) and, when the predicted body lands more than `_OVERSUB_DARK_MARGIN` (25) gray levels below the local background ring, abandons the reverse-alpha output for the footprint and inpaints it from the ORIGINAL surroundings (`_inpaint_footprint`, a wider dilate/radius than the thin residual pass). Predicting per-pixel from the INPUT (not the produced output, which depends on which placement the remover picked) is what keeps a cleanly captured full-strength mark byte-identical — it predicts back to the background everywhere, so the guard never trips on it (verified across Doubao/Jimeng/Samsung on white/mid/dark/midgray backgrounds). A faint mark predicts a body far below the ring and diverts to the inpaint. Regression-guarded by `tests/test_text_mark_oversubtraction.py` (predicate True on faint / False on clean, end-to-end no-dark-pit acceptance, clean-mark byte-identity, textured-background recovery). A flat synthetic background cannot exhibit the residual-inpaint failure (inpaint-from-flat is perfect regardless), so the value shows on textured/real content where the footprint inpaint samples un-darkened original pixels instead of the darkened reverse-alpha halo.
+
 **`_fixed/_aligned_alpha_map` and `extract_mask` return footprint-sized arrays, not full frames (memory):** the alpha-map helpers return the glyph-sized alpha **block** (`(gh, gw)` float32) plus its placement `(ax, ay, gw, gh)`, and `extract_mask` returns the box-sized glyph mask (`(loc.h, loc.w)` uint8) — both used to allocate a full `(h, w)` array that is read only inside the small glyph/box. A full-frame float32 alpha map is ~48 MB on a 12 MP frame and two were held at once during removal (fixed + aligned ≈ 96 MB of mostly-zeros); the box mask was a ~12 MB uint8 allocation rebuilt per text-mark `detect` on the memory-tight `identify` path. `_apply_reverse_alpha` consumes the block directly; the residual inpaint embeds it into one full-frame uint8 mask only at `cv2.inpaint` time (which needs a full-frame mask). Byte-identical to the old full-frame path — the block equals the old map's `[ay:ay+gh, ax:ax+gw]` slice and the box equals the old mask cropped to `loc.bbox` (regression-guarded by `tests/test_text_mark_memory.py`, which reconstructs the old full-frame path inline and asserts equality, so the proof survives a cv2/asset bump). `remove_watermark_reverse_alpha` tracks the winning `region` alongside `best_amap` to place that mask.

 ## `doubao_engine.py`
@@ -199,6 +203,8 @@ At the shared low removal strength the canny edge-conditioning keeps the regener

 Pure sliding-window tiling for the diffusion path (no torch import; numpy/PIL only). `plan_tiles(w, h, tile_size, overlap)` returns a row-major grid of uniform-size `Tile` boxes — every tile is exactly `tile_size` (the SDXL training size), with the last tile on each axis pulled back flush to the far edge (`_axis_positions` clamps a pathological `overlap >= tile` to `tile - 1` so the step stays >= 1). `feather_weights(w, h, overlap)` is a separable linear taper (1 in the interior, ramping toward each edge) floored at `_WEIGHT_EPS` so it is **strictly positive everywhere** — that makes the normalised `accum / weight_sum` blend a partition of unity, so identical/unchanged tiles reconstruct the input exactly (the seam-free guarantee). `run_tiled(generate_tile, image, tile_size, overlap, set_progress)` is the orchestration loop: crop each planned tile, call `generate_tile` (one diffusion pass on a single PIL tile — injected, so this stays decoupled from the pipeline), resize a latent-grid-rounded result back to the exact tile size, and feather-accumulate. All three are unit-tested without the model (`tests/test_tiling.py`: axis math, grid coverage, taper shape/symmetry/positivity, identity reconstruction, per-tile call count, and the resize-back path). New blend tuning belongs in these pure helpers, not inlined into the runner.

+`feather_region_composite(base, regenerated, box, *, feather)` is the pure region-targeted compositor for **AI-enhanced composites** (roadmap P1#8; `identify` `ai_source_kind == "enhanced"`, digitalSourceType `compositeWithTrainedAlgorithmicMedia`). It blends `regenerated` over `base` inside `box = (x, y, w, h)` with a separable linear taper of `feather` px at the box edges (the taper anchors to ~0 at the boundary, so unlike `feather_weights` it is NOT floored — the result equals `base` EXACTLY outside the box), preserving dtype and supporting HxW or HxWxC. It backs `WatermarkRemover.remove_watermark(region=..., region_feather=...)`: the remover regenerates the frame (or tiles), then composites only the AI box back over the original input, so the real photo outside the box stays pixel-exact and only the AI region is scrubbed. The box is caller-supplied (a C2PA composite manifest carries no reliable machine-readable region); the no-model lossless region path remains `region_eraser.erase`. Unit-tested in `tests/test_tiling.py::TestFeatherRegionComposite` (outside-box exactness, interior == regenerated, hard-paste at feather 0, monotonic seam ramp, dtype/grayscale/clamp/empty-box/shape-mismatch).
+
 ## `auto_config.py` (REMOVED 2026-06-09)

 **`auto_config.py` + the content-detection layer were REMOVED 2026-06-09.**
@@ -37,6 +37,28 @@ if TYPE_CHECKING:

 logger = logging.getLogger(__name__)

+# Reverse-alpha over-subtraction guard (ported from gemini_engine, 2026-06-20).
+# The reverse-alpha blend ``(wm - a*logo)/(1-a)`` over-subtracts when the captured
+# alpha over-estimates THIS image's mark opacity: on a dark or mid-tone background
+# it drives the glyph footprint into a visibly DARKER-than-background ghost (a
+# "dark pit") instead of recovering the true pixels. The retained-corpus mining
+# (2026-06-20) showed the sparkle-only fix (commit 41f6797) left this unhandled
+# for the Doubao/Jimeng text marks. Mirror the sparkle gate: when the recovered
+# glyph body lands more than this many gray levels below the local background
+# ring, abandon the reverse-alpha output for the footprint and inpaint it from
+# the surroundings instead. Calibrated to the same 25-level margin the sparkle
+# gate uses -- clean text-mark removals recover within ~10 of the ring, the dark
+# pit lands tens of levels below.
+_OVERSUB_DARK_MARGIN = 25.0
+# Glyph-body / background-ring sampling for the guard. The ring is a pad around
+# the glyph box (excluding the box); the body is the bright-core glyph pixels.
+_OVERSUB_RING_PAD_FRAC = 0.6  # ring pad as a fraction of the glyph-box height
+_OVERSUB_BODY_ALPHA_FLOOR = 0.15  # alpha above which a block pixel counts as glyph body
+# Footprint inpaint when the guard trips: dilate the glyph mask wider than the
+# thin residual pass so the whole darkened ghost is reconstructed, not just its edge.
+_OVERSUB_INPAINT_DILATE = 9
+_OVERSUB_INPAINT_RADIUS = 4
+

@dataclass(frozen=True)
 class TextMarkConfig:
@@ -335,6 +357,74 @@ class TextMarkEngine:
        out[y1:y2, x1:x2] = np.clip((roi - a3 * logo) / np.clip(1.0 - a3, 0.25, 1.0), 0, 255).astype(np.uint8)
        return out

+    def _reverse_alpha_oversubtracts(
+        self, image: NDArray[Any], amap: NDArray[Any], region: tuple[int, int, int, int]
+    ) -> bool:
+        """True when reverse-alpha would darken the glyph footprint into a dark pit.
+
+        Ported from ``gemini_engine._reverse_alpha_oversubtracts`` (2026-06-20):
+        PREDICT the reverse-alpha output at the bright glyph core directly from the
+        INPUT and the captured alpha, ``(core_obs - a*logo)/(1-a)``, and trip when it
+        lands more than ``_OVERSUB_DARK_MARGIN`` gray levels below the local
+        background ring. Predicting from the input (not the produced output) keeps the
+        gate independent of which placement the reverse-alpha picked, so a clean
+        full-strength mark (whose strokes predict back to the background) never trips,
+        while a mark fainter than the capture (over-subtracted into a ghost) does.
+        """
+        ax, ay, gw, gh = region
+        ih, iw = image.shape[:2]
+        if gw < 4 or gh < 4:
+            return False
+        if float(amap.max()) < 0.2:  # too faint a capture to over-subtract meaningfully
+            return False
+        body_box = amap >= _OVERSUB_BODY_ALPHA_FLOOR  # glyph strokes
+        if not bool(body_box.any()):
+            return False
+        pad = max(4, int(gh * _OVERSUB_RING_PAD_FRAC))
+        ry1, ry2 = max(0, ay - pad), min(ih, ay + gh + pad)
+        rx1, rx2 = max(0, ax - pad), min(iw, ax + gw + pad)
+        ring = image[ry1:ry2, rx1:rx2].astype(np.float32).mean(axis=2)
+        fy1, fy2, fx1, fx2 = ay - ry1, ay - ry1 + gh, ax - rx1, ax - rx1 + gw
+        ring_mask = np.ones(ring.shape, dtype=bool)
+        ring_mask[fy1:fy2, fx1:fx2] = False
+        if int(ring_mask.sum()) < 10:
+            return False
+        # Predict the reverse-alpha output PER PIXEL over the glyph body -- exactly
+        # the (obs - a*logo)/(1-a) math the remover applies -- so a cleanly captured
+        # mark predicts back to the true background everywhere (no trip), while a mark
+        # fainter than the capture predicts a body far below the local ring. The
+        # per-pixel alpha (not a single peak value) keeps the prediction faithful
+        # across the glyph's anti-aliased alpha gradient.
+        obs = ring[fy1:fy2, fx1:fx2]
+        a = np.clip(amap, 0.0, 0.99)
+        logo = float(np.mean(self.config.alpha_logo_bgr))
+        predicted = (obs - a * logo) / (1.0 - a)
+        predicted_core = float(np.median(predicted[body_box]))
+        bg = float(np.median(ring[ring_mask]))
+        oversub = predicted_core < bg - _OVERSUB_DARK_MARGIN
+        if oversub:
+            logger.debug(
+                "%s reverse-alpha over-subtracts: predicted core=%.1f bg=%.1f (margin %.0f) -> footprint inpaint",
+                self.config.name,
+                predicted_core,
+                bg,
+                _OVERSUB_DARK_MARGIN,
+            )
+        return oversub
+
+    def _inpaint_footprint(
+        self, image: NDArray[Any], amap: NDArray[Any], region: tuple[int, int, int, int]
+    ) -> NDArray[Any]:
+        """Reconstruct the glyph footprint from its surroundings (used when
+        reverse-alpha would over-subtract into a dark pit). Inpaints the ORIGINAL
+        image over a dilated glyph mask, so the result never contains the darkened
+        reverse-alpha pixels."""
+        ax, ay, gw, gh = region
+        mask = np.zeros(image.shape[:2], np.uint8)
+        mask[ay : ay + gh, ax : ax + gw] = (amap > self.config.residual_alpha_floor).astype(np.uint8) * 255
+        mask = cv2.dilate(mask, np.ones((_OVERSUB_INPAINT_DILATE, _OVERSUB_INPAINT_DILATE), np.uint8))
+        return cv2.inpaint(image, mask, _OVERSUB_INPAINT_RADIUS, cv2.INPAINT_NS)
+
    def remove_watermark_reverse_alpha(self, image: NDArray[Any], *, residual_inpaint: bool = True) -> NDArray[Any]:
        """Recover the original pixels by inverting the alpha blend, then clear the
        residual outline with a thin inpaint over the glyph footprint.
@@ -370,6 +460,13 @@ class TextMarkEngine:
                best_residual, best_out, best_amap, best_region = residual, out, amap, region
        if best_out is None or best_amap is None or best_region is None:  # pragma: no cover - maps is non-empty
            return image.copy()
+        # Over-subtraction guard: on a dark/mid-tone background the captured alpha can
+        # over-estimate the mark's opacity and reverse-alpha leaves a darker-than-
+        # background ghost. When the recovered glyph body sits far below the local
+        # ring, reconstruct the footprint from its surroundings instead of shipping the
+        # dark pit (the thin residual inpaint cannot fix a footprint-wide darkening).
+        if self._reverse_alpha_oversubtracts(image, best_amap, best_region):
+            return self._inpaint_footprint(image, best_amap, best_region)
        if residual_inpaint:
            # Embed the glyph-sized alpha block into a full-frame uint8 mask only for
            # the inpaint (cv2.inpaint needs a mask matching best_out). One uint8
@@ -881,6 +881,13 @@ def cmd_identify(ctx: click.Context, source: Path, no_visible: bool, as_json: bo

    _banner()
    verdict = {True: "AI-generated", False: "not AI", None: "unknown"}[report.is_ai_generated]
+    # Sharpen the True verdict when the C2PA source type says the image is a real
+    # photo with an AI-composited region rather than a full AI generation, so the
+    # caller (and the user) can tell "scrub the whole frame" from "scrub the AI region".
+    if report.is_ai_generated and report.ai_source_kind == "enhanced":
+        verdict = "AI-enhanced (real content with an AI-composited region)"
+    elif report.is_ai_generated and report.ai_source_kind == "generated":
+        verdict = "AI-generated (fully synthetic)"
    console.print(f"\n  Verdict: {verdict}  (confidence: {report.confidence})")
    console.print(f"  Platform: {report.platform or 'undetermined'}")

@@ -42,6 +42,7 @@ from remove_ai_watermarks.metadata import (
 )
 from remove_ai_watermarks.noai.c2pa import cbor_text_after, extract_c2pa_info, soft_binding_vendors_in
 from remove_ai_watermarks.noai.constants import C2PA_AI_TOOLS, C2PA_AI_VENDORS, C2PA_ISSUERS
+from remove_ai_watermarks.watermark_registry import GEMINI_SPARKLE_TRUST_CONF

 if TYPE_CHECKING:
    from pathlib import Path
@@ -57,11 +58,14 @@ log = logging.getLogger(__name__)
 _SCAN_BYTES = 1024 * 1024

 # Visible-sparkle confidence above which the signal is trusted as provenance.
-# Stricter than the removal default (0.25): on the corpus, Gemini-family
-# sparkles score >= 0.56 while non-sparkle images top out at 0.49, so 0.5
-# cleanly separates them and avoids false positives when sparkle is the only
-# signal (e.g. an OpenAI image scored 0.37 -- below threshold, correctly dropped).
-_SPARKLE_THRESHOLD = 0.5
+# Shared with the removal arbitration (watermark_registry.GEMINI_SPARKLE_TRUST_CONF)
+# so the provenance "is there a sparkle" verdict and the removal "take the sparkle"
+# decision can never drift apart -- the detect-vs-remove desync the retained-corpus
+# mining surfaced (2026-06-20). On the corpus Gemini-family sparkles score >= 0.56
+# while non-sparkle images top out at 0.49, so 0.5 cleanly separates them and avoids
+# false positives when the sparkle is the only signal (e.g. an OpenAI image scored
+# 0.37 -- below threshold, correctly dropped).
+_SPARKLE_THRESHOLD = GEMINI_SPARKLE_TRUST_CONF

 # Issuer (C2PA signer) -> human-readable generating platform, derived from the
 # single C2PA_AI_VENDORS registry. Ordered: when a manifest names several issuers
@@ -132,6 +136,14 @@ class ProvenanceReport:
    is_ai_generated: bool | None  # True / False is never asserted; None = unknown
    platform: str | None
    confidence: str  # "high" | "medium" | "none"
+    # Coarse AI-origin kind from the C2PA digital-source-type, so a caller can
+    # branch on full generation vs an AI-touched real photo:
+    #   "generated" -- digitalSourceType trainedAlgorithmicMedia (fully AI).
+    #   "enhanced"  -- compositeWithTrainedAlgorithmicMedia (real content with an
+    #                  AI-composited region; scrub the AI region, keep the photo).
+    #   None        -- no C2PA AI source-type (verdict, if AI, came from another
+    #                  signal: IPTC, AIGC, local gen params, xAI, ...).
+    ai_source_kind: str | None = None
    watermarks: list[str] = field(default_factory=list[str])
    signals: list[Signal] = field(default_factory=list["Signal"])
    caveats: list[str] = field(default_factory=list[str])
@@ -484,9 +496,18 @@ def identify(image_path: Path, *, check_visible: bool = True, check_invisible: b
    # ── C2PA Content Credentials ────────────────────────────────────
    has_c2pa = bool(info) or c2pa_marker_in(head)
    issuers = [info["issuer"]] if info.get("issuer") else _issuers_in(head)
-    c2pa_is_ai = "trainedAlgorithmicMedia" in info.get("source_type", "") or any(
-        m in head for m in (b"trainedAlgorithmicMedia", b"compositeWithTrainedAlgorithmicMedia")
-    )
+    # Full AI generation (trainedAlgorithmicMedia) vs an AI-enhanced real photo
+    # (compositeWithTrainedAlgorithmicMedia). The structured kind is parsed once in
+    # noai.c2pa._populate_registry_fields (covers PNG + any container the c2pa-python
+    # reader handles); fall back to a raw head scan for the non-PNG raw-blob path
+    # where extract_c2pa_info returns {}. Full generation wins when both appear.
+    c2pa_source_kind = info.get("ai_source_kind")
+    if c2pa_source_kind is None:
+        if b"trainedAlgorithmicMedia" in head:
+            c2pa_source_kind = "generated"
+        elif b"compositeWithTrainedAlgorithmicMedia" in head:
+            c2pa_source_kind = "enhanced"
+    c2pa_is_ai = c2pa_source_kind is not None
    # Generator string (for the signal detail): structured for PNG, CBOR-scanned
    # for other containers. Best-effort -- some manifests key it as
    # `claim_generator_info` (Pixel), so this can be None even when a device is
@@ -734,6 +755,9 @@ def identify(image_path: Path, *, check_visible: bool = True, check_invisible: b
        is_ai_generated=is_ai,
        platform=platform,
        confidence=confidence,
+        # Only meaningful when the AI verdict actually came from the C2PA source
+        # type; a non-C2PA AI signal (IPTC/AIGC/local gen) leaves it None.
+        ai_source_kind=c2pa_source_kind if (is_ai and has_c2pa) else None,
        watermarks=watermarks,
        signals=signals,
        caveats=caveats,
@@ -363,14 +363,22 @@ def _populate_registry_fields(buf: bytes, c2pa_info: dict[str, Any]) -> bool:
    # Digital source type (matched anywhere in the store, including ingredient
    # manifests -- a ChatGPT edit of a Sora generation carries the AI marker on
    # the parent, not the active manifest).
+    # ``ai_source_kind`` is the structured generated-vs-enhanced split the caller
+    # branches on (full-frame scrub vs region-targeted clean); ``source_type`` is the
+    # human-readable form. The two byte strings are unambiguous:
+    # "compositeWithTrainedAlgorithmicMedia" capitalizes the inner "Trained", so a
+    # lowercase "trainedAlgorithmicMedia" match is standalone full generation, which
+    # wins when both appear (an edit chain).
    ai_source = False
    if b"trainedAlgorithmicMedia" in buf:
        c2pa_info["source_type"] = "trainedAlgorithmicMedia (AI-generated)"
+        c2pa_info["ai_source_kind"] = "generated"
        ai_source = True
    elif b"algorithmicMedia" in buf:
        c2pa_info["source_type"] = "algorithmicMedia"
    elif b"compositeWithTrainedAlgorithmicMedia" in buf:
        c2pa_info["source_type"] = "compositeWithTrainedAlgorithmicMedia (AI-enhanced)"
+        c2pa_info["ai_source_kind"] = "enhanced"
        ai_source = True

    # SynthID pixel-watermark proxy: a C2PA manifest from a SynthID-using
@@ -122,6 +122,20 @@ C2PA_AI_VENDORS: tuple[C2paAiVendor, ...] = (
    C2paAiVendor(
        b"volcengine", "ByteDance (Volcano Engine)", "ByteDance (Doubao / Jimeng / Volcano Engine)", "ByteDance"
    ),
+    # Some Volcano Engine certs name the signer with the Chinese legal entity
+    # "北京火山引擎科技有限公司" (Beijing Volcano Engine Technology Co., Ltd.) rather
+    # than the latin "volcengine" -- the latin needle misses it entirely, so real
+    # ByteDance output was un-attributed in production traffic. The issuer is the
+    # UTF-8 of the Chinese name (it appears UTF-8-encoded in the manifest-store
+    # JSON and the raw caBX bytes alike); it normalizes to the same "ByteDance"
+    # needle and platform as the volcengine row, so the two collapse together for
+    # clash detection. Verified against the mined retained corpus, 2026-06-20.
+    C2paAiVendor(
+        "北京火山引擎科技有限公司".encode(),
+        "ByteDance (Volcano Engine)",
+        "ByteDance (Doubao / Jimeng / Volcano Engine)",
+        "ByteDance",
+    ),
    # ByteDance's international brand (BytePlus / Seedream / Seededit) signs its
    # cert as "Byteplus Pte. Ltd." -- the bare ``volcengine`` needle misses it, so
    # real BytePlus AI output was mis-attributed (an incidental "Adobe XMP" string
@@ -136,11 +150,29 @@ C2PA_AI_VENDORS: tuple[C2paAiVendor, ...] = (
    # source read AI but no platform was attributed. Verified on real signed files
    # in production traffic, 2026-06-19. Canva does not use SynthID.
    C2paAiVendor(b"Canva", "Canva", "Canva (Magic Media)", "Canva"),
+    # ElevenLabs is a pure generative-AI company (AI voice / audio, and image /
+    # video via its API); it signs output as "Eleven Labs Inc.", so the C2PA
+    # manifest alone marks AI generation. Verified against the mined retained
+    # corpus, 2026-06-20. ElevenLabs does not use SynthID.
+    C2paAiVendor(b"Eleven Labs", "ElevenLabs", "ElevenLabs", "ElevenLabs"),
    # Truepic is a C2PA signing authority, not an AI generator: no platform label,
    # never asserts is_ai (the verdict comes from the digital-source-type).
    C2paAiVendor(b"Truepic", "Truepic", None, None),
 )

+# Deliberately NOT registered as AI-generation vendors (mined-corpus candidates
+# evaluated 2026-06-20):
+#   - TikTok Inc.: signs C2PA as a content-provenance / AI-labeling authority on
+#     uploads, not as an image generator. The is_ai verdict keys off the
+#     digitalSourceType (trainedAlgorithmicMedia), which is already honored; a
+#     bare TikTok signer marks distribution provenance, not generation, so adding
+#     it as a generator needle would mis-label human uploads as AI.
+#   - PixelBin.io (issuer "Fynd"): an image transformation / optimization / CDN
+#     service. Its C2PA stamps a transform/upload step, not a generation event.
+#   Both are excluded to avoid false-positive AI attribution; re-evaluate only
+#   against a real signed file whose manifest carries a trainedAlgorithmicMedia
+#   digital-source type produced by the vendor itself.
+
 # Derived view -- add a vendor to C2PA_AI_VENDORS above, not here.
 # C2PA issuer signature -> resolved org name, for the manifest byte-scan.
 C2PA_ISSUERS: dict[bytes, str] = {v.issuer: v.org for v in C2PA_AI_VENDORS}
@@ -100,6 +100,59 @@ def feather_weights(width: int, height: int, overlap: int) -> NDArray[Any]:
    return weights


+def feather_region_composite(
+    base: NDArray[Any],
+    regenerated: NDArray[Any],
+    box: tuple[int, int, int, int],
+    *,
+    feather: int = 64,
+) -> NDArray[Any]:
+    """Composite ``regenerated`` over ``base`` inside ``box`` only, feathering the seam.
+
+    For AI-ENHANCED composites (digitalSourceType ``compositeWithTrainedAlgorithmicMedia``):
+    the diffusion remover regenerates the whole frame, but only the AI-composited
+    REGION should change -- the rest is a real photo that must be preserved. This
+    blends the regenerated pixels in over ``box = (x, y, w, h)`` with a separable
+    linear taper of ``feather`` px at the box edges, so the result equals ``base``
+    EXACTLY outside the box and ramps smoothly (no hard seam) at the boundary.
+
+    Pure and model-free (unit-tested): ``base`` and ``regenerated`` must be the same
+    shape (H x W, or H x W x C). The output preserves ``base``'s dtype. ``feather`` is
+    clamped to half the box on each axis, so a small region still tapers symmetrically;
+    ``feather=0`` is a hard-edged paste.
+    """
+    import numpy as np
+
+    if base.shape != regenerated.shape:
+        raise ValueError(f"shape mismatch: base {base.shape} vs regenerated {regenerated.shape}")
+    h, w = base.shape[:2]
+    x, y, bw, bh = box
+    x0, y0 = max(0, x), max(0, y)
+    x1, y1 = min(w, x + bw), min(h, y + bh)
+    out = base.copy()
+    if x1 <= x0 or y1 <= y0:
+        return out  # empty / off-image box -> nothing regenerated
+
+    def taper(n: int) -> NDArray[Any]:
+        win = np.ones(n, dtype=np.float32)
+        f = min(max(feather, 0), n // 2)
+        if f > 0:
+            ramp = (np.arange(f, dtype=np.float32) + 1.0) / (f + 1.0)  # in (0, 1), 0 at the edge
+            win[:f] = ramp
+            win[n - f :] = ramp[::-1]
+        return win
+
+    rh, rw = y1 - y0, x1 - x0
+    wmap = np.outer(taper(rh), taper(rw))  # ~0 at the box edge, 1 in the interior
+    if base.ndim == 3:
+        wmap = wmap[:, :, None]
+    roi_base = base[y0:y1, x0:x1].astype(np.float32)
+    roi_gen = regenerated[y0:y1, x0:x1].astype(np.float32)
+    blended = roi_base * (1.0 - wmap) + roi_gen * wmap
+    out[y0:y1, x0:x1] = np.clip(blended, 0, 255).astype(base.dtype)
+    return out
+
+
 def run_tiled(
    generate_tile: Callable[[PILImage.Image], PILImage.Image],
    image: PILImage.Image,
@@ -566,6 +566,8 @@ class WatermarkRemover:
        tile: bool = False,
        tile_size: int = 1024,
        tile_overlap: int = 128,
+        region: tuple[int, int, int, int] | None = None,
+        region_feather: int = 64,
    ) -> Path:
        """Remove watermark from an image using regeneration attack.

@@ -589,6 +591,15 @@ class WatermarkRemover:
            tile_size: Tile dimension in px (default 1024, SDXL's training size).
            tile_overlap: Overlap between adjacent tiles in px (default 128), feather-
                blended so there is no visible seam.
+            region: Restrict the regeneration to the AI-composited box ``(x, y, w, h)``
+                and feather-composite it back over the ORIGINAL pixels everywhere else.
+                For AI-ENHANCED composites (digitalSourceType
+                ``compositeWithTrainedAlgorithmicMedia``, surfaced as
+                ``identify.ProvenanceReport.ai_source_kind == "enhanced"``): the real
+                photo outside the box is preserved exactly, only the AI region is
+                scrubbed. The box is supplied by the caller (a C2PA composite manifest
+                does not carry a reliable machine-readable region). None -> whole frame.
+            region_feather: Seam taper in px for ``region`` compositing (default 64).

        Returns:
            Path to the cleaned image.
@@ -660,6 +671,22 @@ class WatermarkRemover:
            self._controlnet_pipeline = None
            cleaned_image = _generate()

+        # Region-targeted regeneration for AI-enhanced composites: keep the real photo
+        # outside the AI box pixel-exact, blend only the regenerated AI region back in.
+        if region is not None:
+            import numpy as np
+
+            from remove_ai_watermarks.noai.tiling import feather_region_composite
+
+            gen = cleaned_image.convert("RGB")
+            if gen.size != init_image.size:  # a downscaled/tiled pass can resize
+                gen = gen.resize(init_image.size)
+            cleaned_image = gen
+            base_rgb = np.asarray(init_image)  # original RGB, untouched outside the box
+            merged = feather_region_composite(base_rgb, np.asarray(gen), region, feather=region_feather)
+            cleaned_image = Image.fromarray(merged)
+            self._set_progress(f"Region-targeted regeneration: AI box {region}, real photo preserved")
+
        self._set_progress(f"Regeneration complete · Output: {w}x{h}px {cleaned_image.mode}")

        output_path.parent.mkdir(parents=True, exist_ok=True)
@@ -877,12 +904,17 @@ def remove_watermark(
    model_id: str | None = None,
    device: str | None = None,
    hf_token: str | None = None,
+    region: tuple[int, int, int, int] | None = None,
 ) -> Path:
    """Convenience function to remove watermark from an image.

    ``strength=None`` lets the profile pick its vendor-adaptive default
    (0.20 OpenAI / 0.30 Google / 0.30 unknown, from the C2PA SynthID proxy on the
    input; same ladder for the controlnet and sdxl pipelines). Pass a value to override.
+
+    ``region=(x, y, w, h)`` restricts the regeneration to that box and preserves the
+    real photo elsewhere -- for AI-enhanced composites (see
+    ``WatermarkRemover.remove_watermark``).
    """
    from remove_ai_watermarks.noai.watermark_profiles import vendor_for_strength

@@ -892,4 +924,5 @@ def remove_watermark(
        output_path=output_path,
        strength=strength,
        vendor=vendor_for_strength(image_path),
+        region=region,
    )
@@ -90,13 +90,27 @@ class KnownMark:
        return self._remove(image, inpaint_method, inpaint, inpaint_strength, force)


-# Gemini-sparkle confidence above which the registry treats it as a confident
-# detection for arbitration. Matches identify's corpus-validated sparkle
-# threshold (0.5): the gemini engine's own detect flag uses a looser internal
-# threshold and weakly fires (~0.36) on unrelated bottom-right text (e.g. the
-# Doubao mark), which would otherwise let it hijack `--mark auto`. 0.5 gives 0
-# false positives on the corpus.
-_GEMINI_AUTO_MIN_CONF = 0.5
+# Single source of truth for the Gemini-sparkle "trust this as a real mark"
+# confidence, shared by BOTH the removal arbitration here (`best_auto_mark` /
+# `_gemini_detect`) and the provenance detector in `identify` (which imports it
+# as its sparkle threshold). Defining it once removes the detect-vs-remove
+# threshold drift the retained-corpus mining surfaced (2026-06-20): identify
+# would report a sparkle while removal declined it, or vice versa, whenever the
+# two independently-maintained 0.5 constants fell out of step. Now they cannot.
+#
+# Value 0.5 is corpus-validated: the gemini engine's own `detected` flag uses a
+# looser internal threshold (0.35) and weakly fires (~0.36-0.42) on unrelated
+# bottom-right text -- a real Doubao mark scores ~0.40-0.42 as a gemini match,
+# and its core-ring brightness margin is HIGHER than a genuine faint sparkle's,
+# so neither confidence nor the brightness gate separates them in the [0.35, 0.5)
+# band. Lowering this gate to recover faint sparkles was evaluated against that
+# band (2026-06-20) and REJECTED: it cannot be done without re-admitting the
+# Doubao-text / content false positives, trading a rare miss for false-positive
+# removals on clean images. The band below the gate is therefore intentionally
+# left to the higher-strength / metadata paths. 0.5 gives 0 false positives on
+# the corpus.
+GEMINI_SPARKLE_TRUST_CONF = 0.5
+_GEMINI_AUTO_MIN_CONF = GEMINI_SPARKLE_TRUST_CONF

 # ── Engine adapters (lazy singletons; engines are cv2-only, no model load) ──

@@ -24,6 +24,7 @@ from remove_ai_watermarks.identify import (
    _vendor_of,
    identify,
 )
+from remove_ai_watermarks.watermark_registry import GEMINI_SPARKLE_TRUST_CONF

 # Where the lazy import inside identify._visible_sparkle resolves the detector.
 _SPARKLE_TARGET = "remove_ai_watermarks.gemini_engine.detect_sparkle_confidence"
@@ -140,6 +141,23 @@ class TestIdentifyNonPng:
        assert r.is_ai_generated is True
        assert "ByteDance" in (r.platform or "")

+    def test_bytedance_chinese_legal_name_attributed(self, tmp_path: Path):
+        # Some Volcano Engine certs name the signer with the Chinese legal entity
+        # rather than the latin "volcengine"; the latin needle misses it, so the
+        # Chinese-name registry entry is what attributes real ByteDance output.
+        blob = "北京火山引擎科技有限公司".encode() + b" ... trainedAlgorithmicMedia"
+        path = self._c2pa_jpeg(tmp_path, blob)
+        r = identify(path, check_visible=False, check_invisible=False)
+        assert r.is_ai_generated is True
+        assert "ByteDance" in (r.platform or "")
+
+    def test_elevenlabs_attributed(self, tmp_path: Path):
+        path = self._c2pa_jpeg(tmp_path, b"Eleven Labs Inc. ... trainedAlgorithmicMedia")
+        r = identify(path, check_visible=False, check_invisible=False)
+        assert r.is_ai_generated is True
+        assert r.platform == "ElevenLabs"
+        assert not any("SynthID" in w for w in r.watermarks)  # ElevenLabs does not use SynthID
+
    def test_stability_ai_issuer_attributed_no_synthid(self, tmp_path: Path):
        path = self._c2pa_jpeg(tmp_path, b"Stability AI ... trainedAlgorithmicMedia")
        r = identify(path, check_visible=False)
@@ -148,6 +166,21 @@ class TestIdentifyNonPng:
        assert "Stability AI" in r.platform
        assert not any("SynthID" in w for w in r.watermarks)  # Stability does not use SynthID

+    def test_trained_source_is_generated_kind(self, tmp_path: Path):
+        path = self._c2pa_jpeg(tmp_path, b"OpenAI ... trainedAlgorithmicMedia")
+        r = identify(path, check_visible=False, check_invisible=False)
+        assert r.is_ai_generated is True
+        assert r.ai_source_kind == "generated"
+
+    def test_composite_source_is_enhanced_kind(self, tmp_path: Path):
+        # compositeWithTrainedAlgorithmicMedia: a real photo with an AI-composited
+        # region. Still AI (is_ai True), but the kind must read "enhanced" so a
+        # caller can do region-targeted cleaning instead of a full-frame regen.
+        path = self._c2pa_jpeg(tmp_path, b"Adobe ... compositeWithTrainedAlgorithmicMedia")
+        r = identify(path, check_visible=False, check_invisible=False)
+        assert r.is_ai_generated is True
+        assert r.ai_source_kind == "enhanced"
+
    def test_c2pa_without_ai_marker_is_unknown(self, tmp_path: Path):
        # Adobe signs C2PA on plain Photoshop edits too. Without an AI digital-
        # source marker, the honest verdict is unknown -- the C2PA watermark is
@@ -202,6 +235,16 @@ class TestIdentifySamsungGalaxy:
        assert r.platform == "ASUS Gallery (C2PA signer)"
        assert any("C2PA" in w for w in r.watermarks)

+    def test_galaxy_capture_without_ai_marker_is_not_ai(self, tmp_path: Path):
+        # A genuine Galaxy phone capture carries Samsung Galaxy C2PA provenance but
+        # NO AI source-type / genAIType. It must stay is_ai=None -- the device cert
+        # is authenticity provenance of a real photo, not an AI-generation signal.
+        path = self._jpeg(tmp_path, "s25_capture.jpg", b"Samsung Galaxy Galaxy S25 c2pa-rs no ai marker")
+        r = identify(path, check_visible=False, check_invisible=False)
+        assert r.is_ai_generated is None
+        assert r.platform == "Samsung Galaxy (C2PA)"
+        assert any("C2PA" in w for w in r.watermarks)
+

 # ── End-to-end verdicts on real fixtures ────────────────────────────

@@ -277,6 +320,12 @@ class TestIdentifyLocalParams:
        assert "parameters" in signal.detail
        assert signal.confidence == "high"

+    def test_local_gen_params_have_no_c2pa_source_kind(self, tmp_png_with_ai_metadata: Path):
+        # AI verdict from local SD params (not C2PA) -> ai_source_kind stays None.
+        r = identify(tmp_png_with_ai_metadata, check_visible=False)
+        assert r.is_ai_generated is True
+        assert r.ai_source_kind is None
+
    def test_clean_png_is_unknown(self, tmp_clean_png: Path):
        r = identify(tmp_clean_png, check_visible=False)
        assert r.is_ai_generated is None
@@ -399,6 +448,66 @@ class TestIdentifyVisibleSparkle:
        assert r.confidence == "high"


+REPO_ROOT = Path(__file__).resolve().parent.parent
+_DEMO_BEFORE = REPO_ROOT / "demo_banana_before.png"
+_DEMO_AFTER = REPO_ROOT / "demo_banana_after.png"
+
+
+@pytest.mark.skipif(not (_DEMO_BEFORE.exists() and _DEMO_AFTER.exists()), reason="demo banana pair not present")
+class TestSparkleDetectRemoveAlignment:
+    """Detect (identify) and remove (registry.best_auto_mark) must agree on the
+    same image -- the retained-corpus desync where identify reported a sparkle the
+    removal arbitration declined (or vice versa). Both gate on the single shared
+    GEMINI_SPARKLE_TRUST_CONF, so a sparkle just over the line is taken by BOTH
+    and one just under is declined by BOTH. Fixtures composite the real captured
+    sparkle (before-minus-after) back at reduced opacity to land on either side.
+    """
+
+    @staticmethod
+    def _faint_sparkle(tmp_path: Path, opacity: float) -> Path:
+        import numpy as np
+
+        from remove_ai_watermarks import image_io
+
+        before = image_io.imread(_DEMO_BEFORE).astype("float32")
+        after = image_io.imread(_DEMO_AFTER).astype("float32")
+        faint = np.clip(after + opacity * (before - after), 0, 255).astype("uint8")
+        out = tmp_path / f"sparkle_{int(opacity * 100)}.png"
+        image_io.imwrite(out, faint)
+        return out
+
+    def _detect_remove(self, path: Path) -> tuple[bool, bool, float]:
+        from remove_ai_watermarks import image_io, watermark_registry
+        from remove_ai_watermarks.gemini_engine import detect_sparkle_confidence
+
+        conf = detect_sparkle_confidence(path) or 0.0
+        identify_fires = conf >= GEMINI_SPARKLE_TRUST_CONF
+        best = watermark_registry.best_auto_mark(image_io.imread(path))
+        remove_takes_gemini = best is not None and best.key == "gemini"
+        return identify_fires, remove_takes_gemini, conf
+
+    def test_above_threshold_both_fire(self, tmp_path: Path):
+        path = self._faint_sparkle(tmp_path, 0.7)  # ~0.55 conf, just over the line
+        identify_fires, remove_takes, conf = self._detect_remove(path)
+        assert conf >= GEMINI_SPARKLE_TRUST_CONF
+        assert identify_fires, f"identify declined a sparkle above threshold (conf={conf:.3f})"
+        assert remove_takes, f"removal declined a sparkle above threshold (conf={conf:.3f})"
+
+    def test_below_threshold_both_decline(self, tmp_path: Path):
+        path = self._faint_sparkle(tmp_path, 0.5)  # ~0.37 conf, just under the line
+        identify_fires, remove_takes, conf = self._detect_remove(path)
+        assert conf < GEMINI_SPARKLE_TRUST_CONF
+        assert not identify_fires, f"identify fired below threshold (conf={conf:.3f})"
+        assert not remove_takes, f"removal fired below threshold (conf={conf:.3f})"
+
+    def test_full_strength_both_fire(self):
+        # The shipped demo sparkle at full strength: unambiguous agreement.
+        identify_fires, remove_takes, conf = self._detect_remove(_DEMO_BEFORE)
+        assert conf >= GEMINI_SPARKLE_TRUST_CONF
+        assert identify_fires
+        assert remove_takes
+
+
 class TestIdentifyImportIsLight:
    """`import identify` must stay torch-free (lazy noai/__init__): the package
    is deployed on a 512 MB host where eagerly pulling torch/diffusers OOMs."""
@@ -0,0 +1,120 @@
+"""Reverse-alpha over-subtraction guard for the visible text-mark engines.
+
+Ported from the Gemini sparkle fix (commit 41f6797) to Doubao/Jimeng/Samsung
+(retained-corpus mining 2026-06-20, roadmap P0#8): on a dark or mid-tone
+background the captured alpha can over-estimate THIS image's mark opacity, and
+reverse-alpha leaves a darker-than-background glyph ghost (a "dark pit") instead
+of recovering the true pixels. The guard predicts the reverse-alpha output per
+pixel and, when the glyph body lands far below the local ring, reconstructs the
+footprint from the original surroundings instead of shipping the pit.
+
+These assert visual residual (pixel levels vs the local background), not just a
+detector re-fire -- a dark pit can clear the NCC detector while still looking wrong.
+"""
+
+from __future__ import annotations
+
+import numpy as np
+import pytest
+
+from remove_ai_watermarks import image_io
+from remove_ai_watermarks._text_mark_engine import _OVERSUB_DARK_MARGIN
+from remove_ai_watermarks.doubao_engine import DoubaoEngine
+from remove_ai_watermarks.jimeng_engine import JimengEngine
+from remove_ai_watermarks.samsung_engine import SamsungEngine
+
+_ENGINES = [DoubaoEngine, JimengEngine, SamsungEngine]
+
+
+def _compose(engine, bg: float, opacity_gain: float, w: int = 1024, h: int = 1024):
+    """Composite the engine's captured mark onto a flat ``bg`` at ``opacity_gain``.
+
+    ``opacity_gain < 1`` makes the mark FAINTER than the capture, so reverse-alpha
+    at the full captured alpha over-subtracts into a dark pit -- the case the guard
+    must catch. Returns ``(watermarked_uint8, alpha_block, region)`` where the block
+    and region are exactly what the engine's reverse-alpha receives.
+    """
+    img = np.full((h, w, 3), float(bg), np.float32)
+    block, (ax, ay, gw, gh) = engine._fixed_alpha_map(img)
+    a = np.clip(block * opacity_gain, 0.0, 0.99)[:, :, None]
+    logo = np.array(engine.config.alpha_logo_bgr, np.float32)
+    img[ay : ay + gh, ax : ax + gw] = img[ay : ay + gh, ax : ax + gw] * (1 - a) + logo * a
+    return np.clip(img, 0, 255).astype(np.uint8), block, (ax, ay, gw, gh)
+
+
+def _body_vs_ring(out, region, block) -> tuple[float, float]:
+    """Median luma of the glyph body vs the local background ring in ``out``."""
+    ax, ay, gw, gh = region
+    g = out.astype(np.float32).mean(axis=2)
+    body = block >= 0.15
+    pad = max(4, int(gh * 0.6))
+    ry1, ry2 = max(0, ay - pad), min(g.shape[0], ay + gh + pad)
+    rx1, rx2 = max(0, ax - pad), min(g.shape[1], ax + gw + pad)
+    ring = g[ry1:ry2, rx1:rx2]
+    fy1, fy2, fx1, fx2 = ay - ry1, ay - ry1 + gh, ax - rx1, ax - rx1 + gw
+    ring_mask = np.ones(ring.shape, dtype=bool)
+    ring_mask[fy1:fy2, fx1:fx2] = False
+    core = float(np.median(g[ay : ay + gh, ax : ax + gw][body]))
+    return core, float(np.median(ring[ring_mask]))
+
+
+@pytest.mark.parametrize("Engine", _ENGINES, ids=lambda e: e.__name__)
+class TestOversubtractionGuard:
+    @pytest.mark.parametrize(("bg", "gain"), [(120, 0.45), (150, 0.4), (90, 0.5)])
+    def test_guard_trips_on_faint_mark(self, Engine, bg, gain):
+        eng = Engine()
+        wm, block, region = _compose(eng, bg, gain)
+        assert eng._reverse_alpha_oversubtracts(image_io.to_bgr(wm), block, region)
+
+    @pytest.mark.parametrize("bg", [255, 200, 128, 60])
+    def test_guard_skips_clean_full_strength_mark(self, Engine, bg):
+        # A cleanly captured (gain 1.0) mark predicts back to the background, so the
+        # guard must NOT trip -- no regression of the common clean-removal path.
+        eng = Engine()
+        wm, block, region = _compose(eng, bg, 1.0)
+        assert not eng._reverse_alpha_oversubtracts(image_io.to_bgr(wm), block, region)
+
+    @pytest.mark.parametrize(("bg", "gain"), [(120, 0.45), (150, 0.4)])
+    def test_faint_removal_leaves_no_dark_pit(self, Engine, bg, gain):
+        # End-to-end acceptance (roadmap P0#8): after removal the glyph footprint is
+        # not a region more than _OVERSUB_DARK_MARGIN below the local background.
+        eng = Engine()
+        wm, block, region = _compose(eng, bg, gain)
+        out = eng.remove_watermark_reverse_alpha(wm)
+        core, ring_bg = _body_vs_ring(out, region, block)
+        assert core >= ring_bg - _OVERSUB_DARK_MARGIN, f"dark pit: body {core:.0f} vs ring {ring_bg:.0f}"
+
+    def test_clean_mark_removal_unchanged_by_guard(self, Engine, monkeypatch):
+        # On a clean mark the guard must be a no-op: forcing it off yields the same
+        # output (the guard only ever diverts the over-subtraction case).
+        eng = Engine()
+        wm, _block, _region = _compose(eng, 200, 1.0)
+        guarded = eng.remove_watermark_reverse_alpha(wm)
+        monkeypatch.setattr(type(eng), "_reverse_alpha_oversubtracts", lambda self, *a, **k: False)
+        unguarded = eng.remove_watermark_reverse_alpha(wm)
+        assert np.array_equal(guarded, unguarded)
+
+
+@pytest.mark.parametrize("Engine", _ENGINES, ids=lambda e: e.__name__)
+def test_guard_recovers_pit_on_textured_background(Engine):
+    """The guard's footprint inpaint reconstructs from the ORIGINAL surroundings,
+    so a faint mark over-subtracted on a textured background recovers to roughly the
+    local content level rather than a glyph-shaped dark ghost."""
+    eng = Engine()
+    w = h = 1024
+    yy, xx = np.mgrid[0:h, 0:w].astype(np.float32)
+    base = 120 + 35 * np.sin(xx / 80.0) + 25 * np.cos(yy / 60.0)
+    bg_img = np.clip(np.stack([base, base * 0.95, base * 1.05], axis=-1), 0, 255)
+    block, (ax, ay, gw, gh) = eng._fixed_alpha_map(bg_img)
+    a = np.clip(block * 0.45, 0.0, 0.99)[:, :, None]
+    logo = np.array(eng.config.alpha_logo_bgr, np.float32)
+    bg_img[ay : ay + gh, ax : ax + gw] = bg_img[ay : ay + gh, ax : ax + gw] * (1 - a) + logo * a
+    wm = np.clip(bg_img, 0, 255).astype(np.uint8)
+    out = eng.remove_watermark_reverse_alpha(wm).astype(np.float32)
+    # Compare the recovered glyph body to the clean texture under the mark.
+    clean = np.clip(np.stack([base, base * 0.95, base * 1.05], axis=-1), 0, 255)
+    body = block >= 0.15
+    region_out = out[ay : ay + gh, ax : ax + gw].mean(axis=2)
+    region_clean = clean[ay : ay + gh, ax : ax + gw].mean(axis=2)
+    err = float(np.abs(region_out[body] - region_clean[body]).mean())
+    assert err < 25.0, f"glyph body not recovered (mean abs err {err:.1f})"
@@ -15,6 +15,7 @@ from PIL import Image
 from remove_ai_watermarks.noai.tiling import (
    Tile,
    _axis_positions,
+    feather_region_composite,
    feather_weights,
    plan_tiles,
    run_tiled,
@@ -138,3 +139,72 @@ class TestRunTiled:
        image = Image.new("RGB", (1500, 1100), (200, 100, 50))
        out = run_tiled(generate, image, tile_size=1024, overlap=128)
        assert out.size == (1500, 1100)
+
+
+class TestFeatherRegionComposite:
+    """Region-targeted compositing for AI-enhanced composites: only the AI box is
+    regenerated, the real photo outside it stays pixel-exact (roadmap P1#8)."""
+
+    @staticmethod
+    def _frames(h=200, w=300):
+        base = np.full((h, w, 3), 80, np.uint8)
+        regenerated = np.full((h, w, 3), 200, np.uint8)
+        return base, regenerated
+
+    def test_outside_box_is_pixel_exact(self):
+        base, regen = self._frames()
+        out = feather_region_composite(base, regen, (100, 60, 80, 50), feather=8)
+        # Far corners are well outside the box -> identical to base.
+        assert np.array_equal(out[:50, :80], base[:50, :80])
+        assert np.array_equal(out[150:, 220:], base[150:, 220:])
+
+    def test_interior_equals_regenerated(self):
+        base, regen = self._frames()
+        out = feather_region_composite(base, regen, (100, 60, 80, 50), feather=8)
+        # Deep interior of the box (past the feather ramp) is fully regenerated.
+        assert np.array_equal(out[80:90, 130:150], regen[80:90, 130:150])
+
+    def test_hard_paste_when_no_feather(self):
+        base, regen = self._frames()
+        out = feather_region_composite(base, regen, (100, 60, 80, 50), feather=0)
+        assert np.array_equal(out[60:110, 100:180], regen[60:110, 100:180])
+        assert np.array_equal(out[:60], base[:60])
+
+    def test_seam_is_monotonic_ramp(self):
+        base, regen = self._frames()
+        out = feather_region_composite(base, regen, (100, 60, 80, 50), feather=10).astype(np.float32)
+        # Along a horizontal line crossing the left edge, values rise from base(80)
+        # toward regenerated(200) monotonically through the feather band.
+        row = out[85, 100:115, 0]
+        assert row[0] < row[-1]
+        assert np.all(np.diff(row) >= -1e-3)
+
+    def test_dtype_preserved(self):
+        base, regen = self._frames()
+        out = feather_region_composite(base, regen, (50, 50, 40, 40), feather=4)
+        assert out.dtype == base.dtype
+
+    def test_grayscale_2d_supported(self):
+        base = np.full((100, 120), 30, np.uint8)
+        regen = np.full((100, 120), 220, np.uint8)
+        out = feather_region_composite(base, regen, (40, 30, 30, 30), feather=4)
+        assert out.shape == base.shape
+        assert np.array_equal(out[:30], base[:30])
+
+    def test_empty_or_offimage_box_returns_base(self):
+        base, regen = self._frames()
+        assert np.array_equal(feather_region_composite(base, regen, (0, 0, 0, 0)), base)
+        assert np.array_equal(feather_region_composite(base, regen, (500, 500, 40, 40)), base)
+
+    def test_box_clamped_to_image_bounds(self):
+        base, regen = self._frames()
+        # Box overhangs the bottom-right; only the in-image part is composited.
+        out = feather_region_composite(base, regen, (280, 180, 60, 60), feather=0)
+        assert np.array_equal(out[180:, 280:], regen[180:, 280:])
+        assert out.shape == base.shape
+
+    def test_shape_mismatch_raises(self):
+        base, _ = self._frames(200, 300)
+        bad = np.full((100, 100, 3), 200, np.uint8)
+        with pytest.raises(ValueError, match="shape mismatch"):
+            feather_region_composite(base, bad, (10, 10, 20, 20))