Commit Graph

13 Commits

Author SHA1 Message Date
Victor Kuznetsov 439eeadc07 refactor(face-restore): wipe GFPGAN path, --restore-faces is PhotoMaker-only
The GFPGAN `restore` extra and its `face_restore.py` module are gone. They were
oracle-confirmed to re-introduce SynthID by blending watermarked original face
pixels at fidelity weight 0.5 (clean A/B: gemini_3 controlnet 0.20 detected WITH
GFPGAN, clean WITHOUT). Keeping them as the default restore method was a footgun
for the removal pipeline. PhotoMaker-V2 (added in the previous commit) is the
single shipped restore path now -- identity-as-embedding, SynthID-safe by
construction.

Removed:
- src/remove_ai_watermarks/face_restore.py + tests/test_face_restore.py
- pyproject.toml `restore` extra (gfpgan/facexlib/basicsr + scipy/numba pins)
- pyproject.toml `[tool.uv.extra-build-dependencies] basicsr = [...]` build pin
- CLI: `--restore-faces-method` and `--restore-faces-weight` (no method choice
  to make, no GFPGAN weight knob to expose)
- InvisibleEngine._restore_faces method (only _restore_faces_photomaker remains)
- All restore-faces-method / restore-faces-weight threading through cmd_*
  signatures and _process_batch_image

Kept:
- `--restore-faces / --no-restore-faces`: now binds to PhotoMaker-V2.
- All adopted oracle findings about GFPGAN re-introducing SynthID (kept in the
  research docs as historical context that explains why the path was removed).

Docs updated: CLAUDE.md (restore extras bullet collapsed to photomaker, removed
face_restore Key-modules bullet, several inline GFPGAN refs scrubbed), README.md
(face-identity callout + install section now point to the photomaker extra),
docs/synthid.md 5.5 (net recipe), docs/controlnet-removal-pipeline-research.md
(recommendations).

ruff + strict pyright (src/) clean; 578 tests pass (the 9 GFPGAN tests are gone,
the 9 PhotoMaker tests stay green).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 15:35:37 -07:00
Victor Kuznetsov f8f247308b docs(identity): smoke test confirms OpenCLIP embedding is invariant to SynthID-magnitude noise
Empirical confirmation of the load-bearing assumption in the PhotoMaker-V2 path: the
identity embedding cannot transport an invisible pixel watermark.

Tested OpenCLIP-ViT-H/14 (laion2B-s32B-b79K — the same encoder PhotoMaker-V2
fine-tunes) on 31 face crops from gemini_3/gemini_4/openai_3 grid. cosine
similarity between embed(orig) and embed(perturbed):

- synthid_proxy (±2 LSB low-frequency noise, the regime SynthID actually lives in):
  mean 0.9977, min 0.9937. Embedding moves by 0.002 — an order of magnitude less
  than JPEG90 (mean 0.928), which SynthID survives at >=99% TPR by design.
- noise3 / jpeg70 / blur1: 0.89-0.95, all clearly above the SynthID floor.
- self check: 1.0000 (pipeline sane).

So the embedder discards exactly the dimensions SynthID hides in. PhotoMaker-V2
conditioned on a watermarked face will see the same identity vector as a clean
face of that person, so the generated face inherits identity, not the watermark.

This unblocks step 2 of the research plan: prototype PhotoMaker-V2 in the
controlnet pipeline. The previously logged ad-hoc "cos(orig, SDXL-cleaned)"
numbers (0.56-0.93) measured diffusion drift, not watermark invariance, and are
not relevant to the hypothesis.

Docs only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 15:05:15 -07:00
Victor Kuznetsov 310ce912ba docs: SynthID-robust identity research — PhotoMaker-V2 is the only commercial-safe SDXL stack
After GFPGAN restore was oracle-confirmed to RE-INTRODUCE SynthID (it is a fidelity-
restoration net conditioned on the watermarked input), the only identity path that
will not transport the watermark is identity-by-EMBEDDING: a semantic vector that
conditions a fresh generation. That requires a face-recognition / ArcFace-class or
CLIP-image embedder.

Verified the license stack of every credible 2025-2026 SDXL identity adapter by
fetching primary sources directly (HuggingFace model cards, insightface.ai):

- IP-Adapter FaceID family, InstantID, PuLID, Arc2Face -> all blocked. Each
  depends at runtime on InsightFace's antelopev2/buffalo_l ArcFace packs, and
  insightface.ai explicitly states "Code is MIT licensed; models require separate
  commercial licensing." IP-Adapter FaceID's own model card flags itself non-
  commercial for the same reason.
- PhotoMaker-V2 is the single commercial-safe end-to-end stack today: Apache-2.0
  adapter weights with identity encoded as a fine-tuned OpenCLIP-ViT-H/14 (the
  model card's exact phrase: "id_encoder includes finetuned OpenCLIP-ViT-H-14
  and a few fuse layers"). No InsightFace.

Mechanistic argument that an identity embedding cannot transport SynthID: the
embedder is trained to be invariant to low-amplitude pixel changes (JPEG, resize,
brightness, noise), which is exactly the regime SynthID hides in by design. So
the embedding extracted from a watermarked face should be ~identical to the
embedding from the cleaned face, and the embedding cannot carry the watermark
into a freshly generated face. Flagged explicitly as not-yet-measured -- the
first integration step is a cosine-similarity smoke test (no codegen) before
investing in a PhotoMaker prototype.

Process note: the deep-research harness was run but its verifier subagents failed
to call StructuredOutput (same harness bug as a prior session), so its synthesis
was unusable; the license claims here are direct quotes from the primary
sources, fetched and verified, not from the workflow synthesis.

Docs only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 14:58:11 -07:00
Victor Kuznetsov be14eca207 docs: certified controlnet strength floors from the Modal GPU oracle sweep
Ran the isolated raiw-controlnet-cert Modal app (raiw-app/modal_cert.py) over a
strength x seed grid, restore OFF, --max-resolution 1536, each vendor checked on its
OWN oracle (OpenAI -> openai.com/verify, Gemini -> the Gemini app). Certified
controlnet SynthID-removal floors:

- OpenAI 0.20: 2 photoreal images (9-face grid + bracelet) x seed {1,2,3} = 6/6 clean;
  the bracelet that flipped at 0.15 is seed-robust at 0.20. Transfers to prod (OpenAI
  removal is resolution-independent).
- Gemini 0.30: 0.20 detected -> 0.30 clean on 2/2 seeds (hardest face). Holds only at
  <= 1536; Gemini is resolution-sensitive and raiw.cc runs NATIVE, so cap Gemini
  <= 1536 + use 0.30, or native-calibrate (~0.35+).

Prod recipe recorded: controlnet + a controlnet-specific per-vendor schedule in
resolve_strength (OpenAI 0.20 / Gemini 0.30, NOT the default 0.10/0.15 ladder) +
FIXED prod seed (kills the near-threshold non-determinism) + restore reworked/off.
Added to docs/controlnet-removal-pipeline-research.md (certified floors table),
docs/synthid.md 5.5, and the CLAUDE.md controlnet bullet. Docs only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 12:44:56 -07:00
Victor Kuznetsov d38b9a6122 docs: correct controlnet/restore SynthID-removal claims from the 2026-06-04 oracle pass
Oracle validation (openai.com/verify + the Gemini app) overturned three claims that
were on main, and consolidates the controlnet findings into one authoritative place.

- controlnet does NOT reliably remove SynthID at the low vendor-adaptive strength:
  removal is content x pipeline dependent and the survivors FLIP by content type
  (photoreal survives controlnet / clears default; flat graphic survives default /
  clears controlnet; flat text clears both). Root cause is insufficient strength,
  not the pipeline; controlnet needs a higher, per-vendor floor than default.
- removal near the threshold is SEED-non-deterministic (same image+pipeline+strength
  can pass or fail run-to-run); a single clean run does not certify a strength.
- `--restore-faces` RE-INTRODUCES SynthID: GFPGAN runs on the ORIGINAL watermarked
  face at weight 0.5 and composites it back over the cleaned result (clean A/B:
  a Gemini face stayed detected through controlnet 0.15/0.20/0.25 WITH restore,
  cleared at 0.20 with --no-restore-faces). The old "GFPGAN scrubs SynthID" claim
  was wrong.

Corrected in CLAUDE.md (watermark_remover controlnet bullet, controlnet
Known-limitations bullet, face_restore bullet, vendor-adaptive strength bullet) and
docs/synthid.md (5.1 controlnet/face-identity, 5.2 strength floors, new 5.5 oracle
validation log). docs/controlnet-removal-pipeline-research.md gains an authoritative
"Oracle validation 2026-06-04" section that the others point to as the single source.

Docs only; no code change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 12:22:43 -07:00
Victor Kuznetsov 8523f48fb6 data(corpus): archive June 2026 SynthID strength-study subjects
Back docs/synthid.md section 2.2 with the actual test set: the per-image
oracle-verified subjects were only in a local working dir, while the doc claimed
they were recorded in data/synthid_corpus/. Ingest the key pos+cleaned pairs so
the claim holds.

- pos: openai_1/2/3 originals (gpt-image, openai-verify) + gemini_1/2/3/4
  originals (Gemini app, gemini-app); all probe as C2PA-SynthID present.
- cleaned: OpenAI at strength 0.05 (openai_2 only s010 captured) + Gemini at 0.15
  --max-resolution 1536; oracle: SynthID NOT detected. Metadata stripped, so no
  C2PA on the cleaned rows.
- Excluded the third-party issue #14 image (pic3): oracle-verified but not
  committed to the public corpus.
- docs/synthid.md 2.2: state OpenAI n=4 = 3 archived + 1 external-only.
- CLAUDE.md: drop the drift-prone "~65 MB" corpus size from the sdist note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 17:09:58 -07:00
Victor Kuznetsov 5ec8269949 chore: mark controlnet pipeline + GFPGAN restore-faces as experimental
Both content-preservation features are now flagged EXPERIMENTAL and opt-in.
--pipeline controlnet was already opt-in (default=default); --restore-faces
flips from on-by-default to OFF by default, matching the repo's prior pattern
for experimental preservation passes (the removed protect_text/protect_faces).

- cli.py: --restore-faces/--no-restore-faces default False; EXPERIMENTAL in the
  --restore-faces / --controlnet-scale / --pipeline help; batch default False.
- invisible_engine.py: remove_watermark restore_faces default False + docstring.
- CLAUDE.md / README.md / docs/synthid.md: label both experimental/opt-in.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 16:59:28 -07:00
Victor Kuznetsov 411ef16ec3 feat: GFPGAN face-identity restoration post-pass
Add an optional, commercial-safe face-restoration post-pass that recovers
face identity the diffusion removal pass drifts (canny holds structure, not
likeness) while still scrubbing the pixel watermark in the face regions.

- face_restore.py: GFPGANer singleton (CPU unless CUDA), the basicsr
  torchvision.transforms.functional_tensor shim, and the pure feather
  _composite_faces helper (unit-tested without the model). GFPGAN
  re-synthesizes each face from a StyleGAN2 prior, so composited face pixels
  are GAN-generated (no watermark, no pixel-copy) -- oracle-clean at weight 0.5
  with identity preserved.
- InvisibleEngine.remove_watermark: restore_faces / restore_faces_weight,
  best-effort, auto-skips when the extra is absent or no face is detected.
- CLI --restore-faces/--no-restore-faces + --restore-faces-weight on
  invisible/all/batch (on by default).
- restore extra (gfpgan/facexlib/basicsr), numpy<2-pinned (scipy<1.18,
  numba<0.60) and kept out of `all`; basicsr needs Python <3.13 + setuptools<69
  to build, so pin .python-version 3.12.

Commercial-safe: GFPGAN Apache-2.0, RetinaFace MIT. The CodeFormer alternative
is non-commercial and is not shipped. The earlier IP-Adapter FaceID layer was
removed (footgun: needs high strength, corrupts faces at the low removal
strength).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 16:59:28 -07:00
Victor Kuznetsov d90d5d886a feat: controlnet pipeline for text/face-structure preservation
Add `--pipeline controlnet` (SDXL base + xinsir canny ControlNet via
StableDiffusionXLControlNetImg2ImgPipeline): the canny edge map conditions the
img2img regeneration so text and face STRUCTURE stay sharp, while the watermark
is still removed by the regeneration (`strength`) -- no original pixels are
copied or frozen, so SynthID does not survive. Oracle-verified clean on OpenAI
with better text/structure fidelity than plain img2img at equal strength.
`--controlnet-scale` tunes structure preservation; fp32 on mps/cpu (fp16-fixed
VAE on cuda/xpu). Shares the img2img runner (live progress + MPS->CPU fallback)
and the fp16-VAE-fix / device-move helpers with the default pipeline.

Remove the superseded subsystems -- ctrlregen (SD1.5 clean-noise),
text-protection (differential / region-hires) and face-protection: they either
destroyed real content or shielded the watermark by re-using original pixels.
controlnet replaces them by regenerating everything under edge conditioning.

Canny preserves face structure but not identity; face IDENTITY is a separate
face-restoration post-pass (CodeFormer/GFPGAN), researched + prototyped but not
yet shipped. An IP-Adapter FaceID attempt was built and removed (footgun: needs
high strength, corrupts faces at removal strength).

Docs: docs/controlnet-removal-pipeline-research.md, scripts/controlnet_sweep.py.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 16:59:28 -07:00
Victor Kuznetsov 96038f960f feat(invisible): vendor-adaptive default strength (OpenAI 0.10 / Google 0.15)
The default img2img strength is now chosen from the detected SynthID vendor
(C2PA issuer) instead of a single fixed 0.30: OpenAI gpt-image -> 0.10, Google
Gemini -> 0.15, unknown source -> 0.15. Explicit --strength always wins.

Basis: an oracle-verified June 2026 controlled study (clean v0.8.6, text/face
protection OFF, per-image openai.com/verify or Gemini-app verdict). OpenAI's
SynthID clears at 0.05 across 1024-1600 px (n=4, resolution-independent);
Google's is ~3x more robust and needs 0.15 on the capped-1536 path (n=4). The
dominant factor is the VENDOR, not resolution. The earlier single 0.30 default
and the "resolution dependence" lore came from contaminated tests run with the
protect-text bug ON (issue #14) -- re-running those same 1600x1600 images clean
removes SynthID at 0.05.

`vendor_for_strength(path)` reads metadata.synthid_source on the ORIGINAL input
and is threaded through cli (invisible/all/batch) -> invisible_engine ->
watermark_remover -> resolve_strength(strength, profile, vendor), so display and
execution use the same vendor (the engine sees a temp path whose C2PA the visible
pass already stripped, so detection must happen in the CLI on the pristine
source). Caveat: Google's 0.15 was validated only on --max-resolution 1536;
native 2816 Gemini was not locally measurable (OOM on Apple Silicon) and is
pending GPU validation on raiw.cc.

Docs: docs/synthid.md sections 2.2/4.4/5.2 corrected (the contaminated
resolution-dependence findings replaced with the clean oracle-verified table);
README and CLAUDE.md updated; CLI --strength help reflects the adaptive default.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 19:29:47 -07:00
Victor Kuznetsov 4b0b370ac0 fix(invisible): disable protect-text/protect-faces by default; add docs/synthid.md
Both text and face protection were shielding SynthID from removal. The
text-protection high-res re-scrub regenerates pixels at an upscaled resolution
where the per-region pass may not be strong enough to re-destroy the SynthID
payload, allowing it to survive in text areas. Face protection has an even more
direct mechanism: it pastes back the original (pre-diffusion, watermarked) face
pixels after the global pass, guaranteeing SynthID survives in face regions
regardless of strength.

Both --protect-text and --protect-faces are now off by default and opt-in.
Rename from --no-protect-text / --no-protect-faces to --protect-text /
--protect-faces. Extract shared click.option decorators to module-level
constants (_protect_text_option, _protect_faces_option) to eliminate
copy-paste between cmd_invisible and cmd_all.

Add docs/synthid.md: primary-source-cited technical reference for SynthID-Image
covering mechanism (post-hoc encoder/decoder, 136-bit payload, pixel-space, no
model-weight modification), robustness numbers (arXiv:2510.09263: ~99.98% TPR
at 0.1% FPR across 30 transforms), removal attacks and forensic detectability
(arXiv:2605.09203: all 6 attacks detectable >98% TPR@1%FPR), detectability
limits, oracle scope, adoption landscape, and practical implications including
the protect-text/faces SynthID-preservation finding.

Verified June 2026 on gpt-image 1600x1600 via openai.com/verify: with
--protect-text SynthID detected; without, SynthID removed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 10:28:34 -07:00
Victor Kuznetsov 4b4049a6f1 docs(text-protection): update stale strength note (~0.05 -> ~0.30 SynthID threshold)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 17:53:48 -07:00
Victor Kuznetsov 58bdf51c59 Visible-watermark registry: reverse-alpha-only Doubao + Gemini, exact native recovery (#28)
* fix(trustmark): gate detection on re-encode durability to kill false positives

TrustMark's wm_present flag is a BCH validity check that spuriously
validates on a content-correlated fraction of un-watermarked images
(AI textures trip it more than camera photos). On a 1343-image set all
20 raw detections were false, several on Gemini/OpenAI/Doubao output that
cannot carry Adobe's watermark, with random-bytes secrets.

A genuine TrustMark is a durable soft binding that survives re-encoding,
so detect_trustmark now re-decodes after a mild JPEG round-trip and
requires the same schema both times. Every observed false positive
collapsed under this gate; the second decode runs only on the rare hit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(identify): Samsung Galaxy AI, FLUX, ByteDance C2PA; fix C2PA substring FP

Detection extensions verified on real signed files (2026-05-29):

- Samsung Galaxy AI: signer attribution via a new _SIGNER_C2PA_PLATFORM
  (Samsung Galaxy / ASUS Gallery) kept separate from the capture-camera
  _DEVICE_C2PA_PLATFORM so a Galaxy AI edit (device cert + AI source type)
  does not trip the camera-vs-AI integrity clash. Plus metadata.samsung_genai:
  the proprietary genAIType marker in PhotoEditor_Re_Edit_Data, a medium-
  confidence AI-editing signal (samsung_only branch).
- Black Forest Labs (FLUX) and ByteDance Volcano Engine (Doubao/Jimeng)
  added as C2PA issuers + issuer->platform mappings.
- fix: C2PA presence required only the bare 4-byte 'c2pa' substring, which
  false-positives on compressed pixel data (a recompressed PNG IDAT re-flagged
  C2PA after its manifest was correctly stripped). New c2pa_marker_in() requires
  the JUMBF wrapper (jumb+c2pa) or the C2PA uuid box; applied in identify +
  metadata. Verified: all 535 real C2PA files carry jumb.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(doubao): gate detection on text structure to cut ~95% of false positives (#23)

Coverage alone over-fired: any textured bottom-right corner cleared the
threshold, so the detector false-positived on ~28% of arbitrary images.
The real '豆包AI生成' mark is six glyphs in one row, so detect now also
requires the text-structure signature (_glyph_structure): many connected
components, no single dominant blob, concentration in a thin horizontal
band. False positives dropped 343 -> 17 across the corpus while keeping
real-mark recall and the doubao-1.png sample. Also accept a no-op force
kwarg for remover-interface symmetry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(samsung): add Samsung Galaxy AI visible-badge remover

New samsung_engine.py removes the bottom-left sparkle + localized
'AI-generated content' badge that Galaxy AI tools stamp. Mirrors the
Doubao locate->mask->inpaint pattern but bottom-left, with a dual-polarity
top-hat mask (the badge is light-on-dark or dark-on-light). Detection gates
on a band + left-anchor signature (the Doubao CJK-component gate does not
transfer: Latin badge letters connect into few blobs). Explicit-only --
tuned on few real badges with a ~4% FP floor, so it is not used in auto.
Synthetic byte-blob fixtures (real badges are user content, not shipped).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(visible): unified known-watermark registry + LaMa inpaint backend

watermark_registry.py is a single catalog of known visible marks, each
tying {usual location, in_auto flag, recovery strategy, detect adapter,
remove adapter}: gemini (reverse-alpha, exact), doubao, samsung. cmd_visible
is now registry-driven (best_auto_mark for --mark auto; mark_keys() feeds the
CLI choices) -- the per-mark _run_doubao/_run_samsung helper branches are gone.

Cross-engine confidences are not comparable, so the gemini adapter applies the
corpus-validated 0.5 sparkle threshold for auto arbitration (its engine flag is
loose and weakly fired ~0.36 on Doubao text, hijacking auto).

--backend auto|cv2|lama chooses background reconstruction for the mask-based
marks; auto = LaMa when onnxruntime is present, else cv2. For LaMa the mask is
the FILLED glyph bounding box (sparse glyph masks leave anti-aliased edges
behind). cv2 stays the zero-dependency fallback.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: watermark registry, Samsung/FLUX/ByteDance detection, LaMa backend, trustmark gate

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(doubao): exact reverse-alpha removal from captured alpha map

The Doubao '豆包AI生成' mark is a fixed semi-transparent white overlay, so
given its alpha map the original pixels are recovered exactly:
original = (wm - a*logo)/(1-a) -- no inpaint hallucination.

The alpha map + logo colour were solved from real black+gray Doubao captures
on a controlled background: on black captured = a*logo, and the black/gray pair
solves a per-pixel without assuming the logo colour (a_max~0.65, logo near-white);
the white capture cross-validates (mark vanishes to a flat fill). Bundled as
assets/doubao_alpha.png + geometry constants.

remove_watermark_reverse_alpha applies it scaled to image width; exact at the
captured width, so the registry routes doubao through it only when
reverse_alpha_available (width within the calibrated band) and the mark is
detected, falling back to mask inpaint (cv2/LaMa) otherwise. A light residual
inpaint cleans the sub-pixel rescaling error. Add captures at more resolutions
to widen exact coverage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(visible): reverse-alpha only -- drop inpaint removal + heuristic detection

Per the principle that we only remove/detect what we can do exactly, the
visible-mark path is now reverse-alpha only:

- Doubao detect is reverse-alpha-consistent: match the bundled alpha glyph
  silhouette against the corner via TM_CCOEFF_NORMED (DETECT_NCC_THRESHOLD 0.4)
  -- keys on the '豆包AI生成' SHAPE, not coverage/structure heuristics. FP
  7/1243 (0.6%). Removes the cv2 inpaint path + the _glyph_structure gate.
- Registry is reverse-alpha only: dropped the cv2/LaMa backend (_glyph_remove,
  _lama_box_inpaint, default_backend, --backend) and the Samsung entry. Doubao
  outside the alpha resolution band is skipped, never inpainted.
- Removed samsung_engine.py + tests + --mark samsung (no alpha map captured;
  Samsung C2PA/genAIType metadata detection in identify is unaffected).
- The universal erase --region (cv2/LaMa) is unchanged -- arbitrary-region
  inpainting stays a user-directed tool, separate from the known-mark registry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(doubao): NCC sub-pixel alignment -> reverse-alpha at any resolution

A pure width-scale of the captured alpha map is only sub-pixel-accurate at the
captured width and leaves a faint ghost elsewhere. remove_watermark_reverse_alpha
now registers the alpha glyph to the actual mark via a TM_CCOEFF_NORMED
scale+position search (_aligned_alpha_map) before inverting the blend, so the
single 2048 capture works at any resolution -- verified clean on the 1773x2364
(3:4) corpus size, the biggest coverage gap (23 files).

reverse_alpha_available is now just 'asset present' (no width band); the registry
still gates removal on detect so a clean corner is never touched. Drops the
_ALPHA_WIDTH_TOLERANCE gate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(doubao): keep native recovery exact -- fixed geometry at captured width

Integer-pixel NCC alignment landed ~1px off at the captured width, degrading the
otherwise-exact native reverse-alpha (synthetic recovery error 0.94 -> 1.39).
remove_watermark_reverse_alpha now uses exact width-relative geometry within
_ALPHA_NATIVE_BAND of the captured width and the NCC search only off it -- best
of both: native back to 0.94, other resolutions still aligned.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(doubao): harden alignment -- try fixed+aligned, keep least residual (56/56)

On a faint/busy-background mark the NCC alignment peak can wander a few px off
the true mark and leave a residual (2/56 real corpus files). Off the captured
width, remove_watermark_reverse_alpha now builds BOTH the fixed-geometry and the
NCC-aligned alpha map, applies each, and keeps whichever leaves the least
residual mark (re-detect confidence on the bare reverse-alpha) -- geometry wins
on faint marks, alignment on clear ones, no magic threshold. Real-file round-trip
now removes 56/56 detected Doubao clean across every corpus resolution (was 54).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* perf(doubao): skip residual inpaint at native width for exact recovery

At the captured width the fixed-geometry reverse-alpha is pixel-exact, so
inpainting over it only replaced exactly-recovered interior pixels with a
cv2 hallucination -- measured worse on a textured background (native error
vs true bg 1.6 reverse-alpha-only vs 2.6 with the old always-on
full-footprint inpaint). Native now returns the bare recovery untouched;
off-native, where NCC alignment is only sub-pixel-approximate, the footprint
inpaint stays to clean the seam. Real round-trip still 56/56 across all
corpus resolutions; negatives 0/60, Gemini unaffected.

Add test_native_returns_exact_reverse_alpha_no_inpaint as the regression
guard. Sync CLAUDE.md + README (the table cell and prose described the
pre-NCC "skipped off native / cv2-LaMa" behavior, now stale). Gitignore the
session scheduled_tasks.lock, and add the text-protection research note.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 19:49:09 -07:00