mirror of
https://github.com/wiltodelta/remove-ai-watermarks.git
synced 2026-06-13 05:57:46 +02:00
feat: controlnet pipeline for text/face-structure preservation
Add `--pipeline controlnet` (SDXL base + xinsir canny ControlNet via StableDiffusionXLControlNetImg2ImgPipeline): the canny edge map conditions the img2img regeneration so text and face STRUCTURE stay sharp, while the watermark is still removed by the regeneration (`strength`) -- no original pixels are copied or frozen, so SynthID does not survive. Oracle-verified clean on OpenAI with better text/structure fidelity than plain img2img at equal strength. `--controlnet-scale` tunes structure preservation; fp32 on mps/cpu (fp16-fixed VAE on cuda/xpu). Shares the img2img runner (live progress + MPS->CPU fallback) and the fp16-VAE-fix / device-move helpers with the default pipeline. Remove the superseded subsystems -- ctrlregen (SD1.5 clean-noise), text-protection (differential / region-hires) and face-protection: they either destroyed real content or shielded the watermark by re-using original pixels. controlnet replaces them by regenerating everything under edge conditioning. Canny preserves face structure but not identity; face IDENTITY is a separate face-restoration post-pass (CodeFormer/GFPGAN), researched + prototyped but not yet shipped. An IP-Adapter FaceID attempt was built and removed (footgun: needs high strength, corrupts faces at removal strength). Docs: docs/controlnet-removal-pipeline-research.md, scripts/controlnet_sweep.py. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -23,7 +23,7 @@ If this tool saves you time, consider [sponsoring its development](https://githu
|
||||
- **AI metadata stripping** — EXIF, PNG text chunks, C2PA provenance manifests (PNG / JPEG / AVIF / HEIF / JPEG-XL, **MP4 / MOV / M4V / M4A** at the container level, and **WebM / MP3 / WAV / FLAC / OGG** losslessly via ffmpeg), XMP DigitalSourceType
|
||||
- **"Made with AI" label removal** — removes the AI-disclosure metadata that platforms read to apply automatic labels (useful for clearing a false-positive label from a human-edited photograph)
|
||||
- **Analog Humanizer** — optional film grain and chromatic aberration post-processing
|
||||
- **Smart Face Protection** — automatic extraction and blending of human faces to prevent AI distortion
|
||||
- **Text and face preservation** — optional `--pipeline controlnet` adds a canny ControlNet that keeps text and face structure sharp through the removal pass (without copying original pixels, so SynthID is still removed). Note: canny preserves face *structure*, not *identity* (the regenerated face drifts in likeness); preserving identity is a separate face-restoration post-pass, researched but not yet shipped
|
||||
- **Batch processing** — process entire directories
|
||||
- **Detection** — three-stage NCC watermark detection with confidence scoring
|
||||
- **Provenance detection (`identify`)** — aggregate C2PA issuer, the C2PA soft-binding forensic-watermark vendor (Adobe TrustMark, Digimarc, Imatag, ...), IPTC "Made with AI" plus the IPTC 2025.1 `AISystemUsed` field, embedded SD/ComfyUI params, EXIF/XMP generator tags, the xAI/Grok EXIF signature, the China TC260 AIGC label (XMP, PNG chunk, or EXIF), the HuggingFace `hf-job-id` job marker, the SynthID metadata proxy, the visible marks (Gemini sparkle plus the Doubao "豆包AI生成" / Jimeng "即梦AI" text marks), the open SD/SDXL/FLUX invisible watermark, and (with the `trustmark` extra) the open Adobe TrustMark watermark into one origin-platform + watermark-inventory verdict (`--json` for machine output)
|
||||
@@ -117,9 +117,9 @@ image → encode to latent space (VAE) at native resolution
|
||||
|
||||
> **Default strength is vendor-adaptive (no flag needed).** The tool reads the C2PA issuer to detect which vendor's SynthID is present and picks the strength that clears it with the least quality loss: **OpenAI gpt-image → `0.10`**, **Google Gemini → `0.15`**, **unknown source → `0.15`**. An oracle-verified June 2026 study (clean pipeline, per-image openai.com/verify or Gemini app) found OpenAI's watermark clears at `0.05` across `1024`-`1600` px (resolution-independent) while Google's is ~3x more robust and needs `0.15`. The dominant factor is the vendor, not resolution. There is no local SynthID detector, so if the oracle still reads SynthID, raise `--strength`; if you care more about preserving fine text, lower it. (Caveat: Google's `0.15` was validated on the capped `--max-resolution 1536` path; a very large native Gemini image may need more.)
|
||||
>
|
||||
> **Text and face protection are OFF by default.** The high-resolution text re-scrub can shield SynthID in text regions, leaving the watermark intact there even after the global pass clears it everywhere else (verified June 2026: same image, with `--protect-text` → SynthID detected; without → SynthID removed). Both features are opt-in with `--protect-text` / `--protect-faces` and considered **experimental**. If you enable them, verify the result with the oracle.
|
||||
> **`--pipeline controlnet` preserves text and face structure.** It runs the same SDXL img2img scrub but adds a canny ControlNet that conditions the regeneration on the image's edge map, so text and structure stay sharp at the strengths that remove SynthID. The watermark removal still comes from the img2img regeneration (`--strength`); the ControlNet only preserves structure — no original pixels are copied or frozen, so SynthID does not survive. `--controlnet-scale` tunes the preservation strength (higher = closer to the original structure). Runs fp32 on mps/cpu (fp16 only on cuda/xpu, where the fp16-fixed SDXL VAE is loaded automatically).
|
||||
>
|
||||
> **`--pipeline ctrlregen` is experimental and not recommended.** On paper CtrlRegen ([ICLR 2025](https://github.com/yepengliu/CtrlRegen)) regenerates from near-clean Gaussian noise to defeat robust watermarks, but in testing on real images it **destroys content** — smooth and background regions fill with hallucinated micro-text — and it is heavy (several GB of extra models, minutes per image). It has no usable middle setting (too low removes nothing, high enough to remove wrecks the image), so the shippable path is the default SDXL pipeline at the vendor-adaptive strength. CtrlRegen stays available for experimentation only.
|
||||
> **Face identity is not preserved yet.** Canny preserves where a face is, but not who it is — the regenerated face drifts in likeness. (An IP-Adapter FaceID approach was tried and removed: it needs high denoise strength and corrupts faces at the low strength used for removal.) The validated direction is a separate face-restoration post-pass (CodeFormer/GFPGAN at a low fidelity weight, run after the removal pass — it re-synthesizes each face from a codebook, so it scrubs the watermark while holding identity) — researched and prototyped (see `docs/controlnet-removal-pipeline-research.md`) but not yet shipped.
|
||||
|
||||
SDXL is the default since May 2026: empirically defeats SynthID v2 on Gemini 3 Pro outputs, where the older SD-1.5 pipeline at 768 px did not. The SD-1.5 path was removed once it was verified not to handle v2. Note the scope: this defeats the SynthID *verifier*, which is not the same as being forensically indistinguishable from a real photo. Recent work ([arXiv:2605.09203](https://arxiv.org/abs/2605.09203)) shows watermark-removal pipelines leave detectable traces, so a separate "this image was processed" classifier can still flag the output.
|
||||
|
||||
@@ -127,12 +127,10 @@ SDXL is the default since May 2026: empirically defeats SynthID v2 on Gemini 3 P
|
||||
|
||||
> **Technical deep-dive:** see [`docs/synthid.md`](docs/synthid.md) for a primary-source-cited breakdown of how SynthID works mechanically (post-hoc encoder/decoder, 136-bit payload, pixel-space embedding), what it empirically survives (JPEG, crop, resize: ~99.98% TPR at 0.1% FPR from arXiv:2510.09263), what removes it, and the forensic-stealth tradeoff (all known removal attacks are detectable at >98% TPR@1%FPR per arXiv:2605.09203).
|
||||
|
||||
**Face Protection** (experimental, opt-in `--protect-faces`): before diffusion, YOLO detects people in the image and extracts them; after diffusion the original faces are blended back. Off by default — enable only when face fidelity matters more than SynthID removal completeness.
|
||||
**Text and face preservation** (opt-in `--pipeline controlnet`): adds a canny ControlNet so text and face *structure* stay sharp through the removal pass, without copying or freezing any original pixels (so SynthID is still removed). Tune the preservation strength with `--controlnet-scale`. Canny preserves structure but not face *identity* (preserving identity is a future face-restoration post-pass, not yet shipped — see the callout above).
|
||||
|
||||
**Analog Humanizer**: optional film grain and chromatic aberration injection that mimics a photo of a screen, raising the bar for AI-generated image classifiers. (It frustrates generic classifiers but does not guarantee forensic invisibility — see the [arXiv:2605.09203](https://arxiv.org/abs/2605.09203) note above.)
|
||||
|
||||
**Text Protection** (experimental, opt-in `--protect-text`): re-scrubs detected text blocks at high resolution after the global pass to keep small glyphs crisp. **Off by default** because the high-resolution re-scrub can preserve SynthID in text regions even after the global pass removes it elsewhere. Enable only when text fidelity matters more than watermark removal completeness, and verify the oracle result. SDXL pipeline only.
|
||||
|
||||
### Stripping C2PA, EXIF, and "Made with AI" metadata
|
||||
|
||||
AI tools embed generation metadata that social platforms use to show "Made with AI" labels:
|
||||
@@ -274,8 +272,8 @@ remove-ai-watermarks invisible image.png -o clean.png --humanize 4.0
|
||||
# Runs at native resolution by default. On a very large image that OOMs the
|
||||
# GPU/MPS, cap the long side: --max-resolution 2048
|
||||
# Strength is vendor-adaptive by default (OpenAI 0.10 / Google 0.15); override
|
||||
# with --strength. Text/face protection is opt-in (--protect-text /
|
||||
# --protect-faces, experimental: they can shield SynthID).
|
||||
# with --strength. To preserve text/face structure, use --pipeline controlnet
|
||||
# (SDXL + canny ControlNet); tune preservation with --controlnet-scale. Add
|
||||
|
||||
# Check / strip AI metadata (C2PA, EXIF, "Made with AI" labels)
|
||||
# --check also flags SynthID-bearing sources: a C2PA manifest signed by
|
||||
@@ -341,7 +339,7 @@ pip install certifi
|
||||
|
||||
- [noai-watermark](https://github.com/mertizci/noai-watermark) by mertizci — invisible watermark removal engine
|
||||
- [GeminiWatermarkTool](https://github.com/allenk/GeminiWatermarkTool) by Allen Kuo (MIT) — visible watermark removal algorithm
|
||||
- [CtrlRegen](https://github.com/yepengliu/CtrlRegen) by Liu et al. (ICLR 2025) — controllable regeneration pipeline
|
||||
- [controlnet-canny-sdxl-1.0](https://huggingface.co/xinsir/controlnet-canny-sdxl-1.0) by xinsir — SDXL canny ControlNet used by the `controlnet` pipeline to preserve text/face structure
|
||||
- NeuralBleach (MIT) — analog humanizer technique
|
||||
|
||||
## Roadmap
|
||||
|
||||
@@ -0,0 +1,539 @@
|
||||
# ControlNet-as-removal-pipeline research: can structure-conditioned regeneration scrub SynthID and keep text?
|
||||
|
||||
Date: 2026-06-02. Source: a manual primary-source pass (WebSearch + WebFetch over the
|
||||
watermark-removal-attack and SDXL-ControlNet literature). Prompted by issue #35
|
||||
(@newideas99 / Jacob): "as we use SDXL even at low strength that kills small text ... Do you
|
||||
think ControlNet could be added to preserve and still remove the watermark?" Clarified scope:
|
||||
Jacob means **replacing the removal pipeline itself** with a ControlNet-conditioned
|
||||
regeneration (structure held by the control signal), NOT a separate text-protection add-on.
|
||||
|
||||
A deep-research workflow run was attempted first (`wf_3244411d-ffd`) and failed at the harness
|
||||
level (97 agents completed without emitting StructuredOutput; ~4.3 M tokens, no report). This
|
||||
note is the hand-run replacement.
|
||||
|
||||
## The question, precisely
|
||||
|
||||
Can a single full-image ControlNet-conditioned diffusion pass **replace** plain SDXL base 1.0
|
||||
img2img as the watermark remover, so that one structure-guided regeneration removes the
|
||||
invisible robust pixel watermark (Google SynthID) **everywhere** while keeping fine detail and
|
||||
small/CJK **text** legible across the whole image? The hard constraint is unchanged from
|
||||
`text-protection-research.md`: the watermark must be scrubbed everywhere including inside text,
|
||||
so any path that freezes or composites original text pixels is disqualified.
|
||||
|
||||
## Executive summary
|
||||
|
||||
The idea is **already academically validated as a watermark remover and is literally what we
|
||||
already ship** — CtrlRegen (ICLR 2025) is a canny-ControlNet + DINOv2-semantic pipeline that
|
||||
regenerates from clean noise. But the make-or-break gap is exact: **none of the
|
||||
watermark-removal papers validate TEXT or fine-detail preservation at all** — CtrlRegen reports
|
||||
only FID/PSNR/quality-model scores and explicitly contains no text, fine-detail, or
|
||||
hallucination analysis. Our shipped ctrlregen's empirical failure ("destroys real content,
|
||||
hallucinates micro-text in smooth regions") is precisely this unstudied failure mode of the
|
||||
published method, most likely driven by our **512 px tiling** (text occupies too few pixels per
|
||||
tile to regenerate legibly; edge-free smooth regions get DINOv2-semantic hallucination). The
|
||||
constructive path is NOT to keep fixing the SD1.5 CtrlRegen, but to port the structure-control
|
||||
idea onto an **SDXL-native** ControlNet (xinsir tile-sdxl / ControlNet-Union-SDXL) as a control
|
||||
add to our existing SDXL base 1.0 img2img, run it at **1024+, not 512 tiles**, and empirically
|
||||
sweep the (denoise strength x conditioning scale x resolution) cube against the SynthID oracle
|
||||
AND a text-legibility check. The central tension may be fundamental and must be measured: the
|
||||
conditioning strong enough to keep text legible may suppress regeneration enough to let SynthID
|
||||
survive; the regeneration strong enough to scrub may deform text regardless of edges.
|
||||
|
||||
## Findings (with confidence and sources)
|
||||
|
||||
### Finding 1 — confidence: high
|
||||
|
||||
**Claim.** "ControlNet as the removal pipeline" is exactly CtrlRegen (ICLR 2025), and our
|
||||
shipped `ctrlregen` profile is a faithful implementation of it. Its **spatial control is canny
|
||||
edges** extracted from the watermarked image; its **semantic control is DINOv2-giant** via a
|
||||
trainable projection + decoupled cross-attention. Clean-noise (full-strength) regeneration
|
||||
scrubs the watermark from both pixel and latent space while the two control nets hold structure.
|
||||
|
||||
**Evidence.** CtrlRegen: spatial control "conditioned on Canny edge images extracted from the
|
||||
watermarked image," integrated into the U-Net decoder blocks via a ControlNet structure;
|
||||
semantic control on "DINOv2-giant" embeddings. Removal is strong: TPR@1%FPR driven from 1.00 ->
|
||||
0.01 (StegaStamp) and 0.99 -> 0.12 (TreeRing). This matches our `ctrlregen/engine.py` exactly
|
||||
(canny detector + `facebook/dinov2-giant` + spatial ControlNet from `yepengliu/ctrlregen`).
|
||||
|
||||
**Sources.** https://arxiv.org/html/2410.05470v1 · https://github.com/yepengliu/CtrlRegen ·
|
||||
https://openreview.net/forum?id=mDKxlfraAn
|
||||
|
||||
### Finding 2 — confidence: high
|
||||
|
||||
**Claim.** Regeneration provably removes any bounded-perturbation pixel watermark **given enough
|
||||
noise** — the operative constraint is the amount of regeneration, which is the same knob that
|
||||
trades against fidelity.
|
||||
|
||||
**Evidence.** Zhao et al., "Invisible Image Watermarks Are Provably Removable Using Generative
|
||||
AI" (NeurIPS 2024): a noise-then-reconstruct regeneration attack "guarantees the removal of any
|
||||
invisible watermark" that perturbs the image within a bounded L2 distance. The guarantee is a
|
||||
function of injected noise magnitude — low noise preserves detail but leaves the watermark; high
|
||||
noise scrubs but discards original signal. This is the knob ControlNet conditioning is meant to
|
||||
make survivable (push regeneration high while the control signal holds composition).
|
||||
|
||||
**Sources.** https://arxiv.org/abs/2306.01953 · https://github.com/XuandongZhao/WatermarkAttacker
|
||||
|
||||
### Finding 3 — confidence: high
|
||||
|
||||
**Claim.** The make-or-break gap: **no watermark-removal paper validates text or fine-detail
|
||||
preservation.** CtrlRegen's "high perceptual quality" is FID/PSNR/quality-model only and
|
||||
explicitly omits text, fine-detail, and hallucination analysis. So the literature does NOT
|
||||
support the specific claim Jacob needs (text survives), it is simply unmeasured.
|
||||
|
||||
**Evidence.** CtrlRegen reports CLIP-FID, PSNR, Q-Align, LIQE; the fetched analysis confirms
|
||||
"the paper contains no discussion of text preservation, fine-detail retention, or hallucination
|
||||
artifacts," and "explicitly avoids discussing failure modes." Pixel metrics like PSNR are
|
||||
acknowledged not to reflect perception, and text legibility is a different axis than FID.
|
||||
|
||||
**Sources.** https://arxiv.org/html/2410.05470v1
|
||||
|
||||
### Finding 4 — confidence: medium-high
|
||||
|
||||
**Claim.** Resolution is the prime suspect for our shipped ctrlregen's content destruction. We
|
||||
tile to **512 px** and run full clean-noise per tile; at 512 px text occupies too few pixels per
|
||||
tile to regenerate legibly, and smooth edge-free regions (no canny signal) are filled by the
|
||||
DINOv2 semantic prior, which hallucinates texture/micro-text. The paper omits resolution
|
||||
entirely, so this is an implementation regime it never characterized.
|
||||
|
||||
**Evidence.** Our `ctrlregen/engine.py`: `PROCESS_SIZE = 512`, `TILE_SIZE = 512`, full strength
|
||||
on each tile. This mirrors the `_run_region_hires` insight (text needs MORE pixels under
|
||||
regeneration so strokes exceed the VAE's ~8 px latent floor), but ctrlregen runs the regeneration
|
||||
at LOW res, the opposite. CtrlRegen's paper gives no resolution/tiling spec to contradict this.
|
||||
|
||||
**Sources.** internal (`src/remove_ai_watermarks/noai/ctrlregen/engine.py`); resolution-omission
|
||||
confirmed against https://arxiv.org/html/2410.05470v1
|
||||
|
||||
### Finding 5 — confidence: high
|
||||
|
||||
**Claim.** SDXL-native ControlNets exist, so the removal-pipeline upgrade need NOT be the SD1.5
|
||||
re-architecture our current ctrlregen is. xinsir `controlnet-tile-sdxl-1.0` and
|
||||
`controlnet-union-sdxl-1.0` (ControlNet++) run on SDXL base 1.0. The tile model has a `tile_var`
|
||||
image-variation mode purpose-built to regenerate detail while preserving structure, at
|
||||
`controlnet_conditioning_scale = 1.0`, optimal 1024 px. This is a drop-in control add to our
|
||||
existing SDXL img2img.
|
||||
|
||||
**Evidence.** xinsir tile-sdxl model card: use cases = deblur/detail-repaint, **image variation
|
||||
(preserving structure)**, super-resolution; `controlnet_conditioning_scale = 1.0`, ~30 steps,
|
||||
optimal 1024x1024, works with `madebyollin/sdxl-vae-fp16-fix` (the same VAE our fp16 path
|
||||
already swaps in). ControlNet-Union-SDXL / ControlNet++ merges 10+ control types (canny, HED,
|
||||
tile, depth, lineart) into one SDXL model.
|
||||
|
||||
**Sources.** https://huggingface.co/xinsir/controlnet-tile-sdxl-1.0 ·
|
||||
https://huggingface.co/xinsir/controlnet-union-sdxl-1.0 · https://github.com/xinsir6/ControlNetPlus
|
||||
|
||||
### Finding 6 — confidence: high
|
||||
|
||||
**Claim.** The community tile-ControlNet upscale workflow runs at **LOW denoise (0.3-0.4)** —
|
||||
the wrong regime for watermark removal. It preserves detail precisely by regenerating little, so
|
||||
a naive tile-upscale preserves text AND preserves the watermark. The open empirical question is
|
||||
whether at `conditioning_scale ~1.0` you can push denoise high enough to scrub SynthID while the
|
||||
tile conditioning still holds text — the exact cell to test.
|
||||
|
||||
**Evidence.** Stable-Diffusion-Art ControlNet-tile upscale: denoise "typically 0.3, max ~0.4 to
|
||||
avoid artifacts"; some users push 0.6 with ControlNet strength 0.5. Our own data: SynthID
|
||||
survives below the removal-strength threshold (current Gemini needs notably higher denoise than
|
||||
the tile-upscale regime). So the detail-preserving regime and the watermark-scrubbing regime are
|
||||
on opposite ends of the denoise axis; ControlNet conditioning is the bet that they can meet.
|
||||
|
||||
**Sources.** https://stable-diffusion-art.com/controlnet-upscale/ ·
|
||||
internal (`docs/synthid.md` strength data)
|
||||
|
||||
### Finding 7 — confidence: high
|
||||
|
||||
**Claim.** Forensic-stealth caveat: diffusion-based regeneration is among the MOST detectable
|
||||
removal families. Even a ControlNet-regeneration that fools the SynthID oracle leaves forensic
|
||||
traces flagging the output as "removal-processed" at >98% TPR@1%FPR. This bounds the claim (do
|
||||
not over-promise "indistinguishable from an original") but does not block the use case — the
|
||||
SynthID oracle still reads negative.
|
||||
|
||||
**Evidence.** "Removing the Watermark Is Not Enough: Forensic Stealth in Generative-AI Watermark
|
||||
Removal" (arXiv:2605.09203, Goonatilake & Ateniese, GMU): across six removal attacks including
|
||||
diffusion-based regeneration, independent forensic detectors separate removal-processed from
|
||||
clean content at >98% TPR under a 1% FPR budget.
|
||||
|
||||
**Sources.** https://arxiv.org/html/2605.09203v1
|
||||
|
||||
### Finding 8 — confidence: low (watch, do not build on yet)
|
||||
|
||||
**Claim.** Partial/semantic-guided regeneration is an active sub-direction that explicitly targets
|
||||
the removal-vs-fidelity tradeoff, but the specific fidelity-on-text claims were not verifiable
|
||||
from the source in this pass.
|
||||
|
||||
**Evidence.** "Removing Watermarks with Partial Regeneration using Semantic Information"
|
||||
(arXiv:2505.08234) proposes focusing regeneration on watermarked regions with semantic (VLM)
|
||||
conditioning to preserve untouched areas; the PDF body did not render cleanly enough to confirm
|
||||
its quantitative text/detail results. Treat as a pointer, not evidence.
|
||||
|
||||
**Sources.** https://arxiv.org/pdf/2505.08234
|
||||
|
||||
## Recommendation / decision
|
||||
|
||||
**ControlNet-as-removal-pipeline is worth prototyping — but not by fixing the SD1.5 ctrlregen.**
|
||||
Port the structure-control idea onto an SDXL-native ControlNet as a control add to the existing
|
||||
SDXL base 1.0 img2img, run it at full resolution (1024+, NOT 512 tiles), and treat the
|
||||
text-vs-scrub tension as an empirical question to measure, not assume.
|
||||
|
||||
**Prototype (runs locally on 32 GB MPS — no dedicated GPU required):**
|
||||
|
||||
Compute is NOT the bottleneck. On a 32 GB Apple-silicon machine (M5 here) native SDXL already
|
||||
runs entirely on MPS with no CPU fallback (~155 s at 1122x1402, verified — see `synthid.md` /
|
||||
CLAUDE.md). The prototype runs at **1024** (fewer pixels than that) with SDXL base + an SDXL
|
||||
ControlNet + activations in **fp32** (MPS fp16 decodes to all-black NaN — issue #29 — confirmed
|
||||
on run 1 below; fp32 is the required default on mps/cpu) — fits the 32 GB budget with vae-tiling +
|
||||
attention-slicing; ~1-2 min/image, so a coarse sweep is a sub-hour background run. A dedicated GPU
|
||||
is needed ONLY for the separate
|
||||
native-large-Gemini (2816 px) case, which OOMs even without a ControlNet (that stays a raiw.cc
|
||||
GPU task). The genuine external dependency is NOT compute but the **manual SynthID oracle**:
|
||||
there is no local SynthID detector, so removal is verified by hand in the Gemini app
|
||||
("Verify with SynthID") per image, regardless of where the diffusion runs.
|
||||
|
||||
Runner: **`scripts/controlnet_sweep.py`** (built 2026-06-02) implements exactly this sweep —
|
||||
SDXL base 1.0 + an SDXL-native ControlNet img2img, one output per (control x strength x scale)
|
||||
cell, plus a `sweep_index.csv` with empty `synthid_oracle` / `text_legible` columns to fill by
|
||||
hand. It uses the dedicated single-type xinsir models (`controlnet-canny-sdxl-1.0`,
|
||||
`controlnet-tile-sdxl-1.0`) rather than the Union model to keep the diffusers API path robust.
|
||||
|
||||
uv run python scripts/controlnet_sweep.py watermarked.png -o sweep_out
|
||||
|
||||
1. SDXL base 1.0 img2img + `xinsir/controlnet-canny-sdxl-1.0` / `controlnet-tile-sdxl-1.0`
|
||||
(sweep both `tile` and `canny` control), full image at 1024, `sdxl-vae-fp16-fix`.
|
||||
2. Sweep the cube on fresh Gemini + gpt-image inputs that contain small/CJK text:
|
||||
- denoise strength {0.15, 0.3, 0.5, 0.7, 1.0}
|
||||
- `controlnet_conditioning_scale` {0.5, 0.8, 1.0}
|
||||
- control type {tile, canny}
|
||||
3. Per cell, measure BOTH axes:
|
||||
- **removal**: Gemini app "Verify with SynthID" oracle (the only valid SynthID oracle; for
|
||||
gpt-image also openai.com/verify for provenance) — must read clean.
|
||||
- **text**: OCR round-trip / visual legibility of the small text.
|
||||
- secondary: SSIM/FID vs original for global fidelity.
|
||||
4. Find the Pareto cell where the oracle is clean AND text stays legible.
|
||||
|
||||
**The honest fork the prototype resolves:**
|
||||
- If such a cell exists -> the answer to Jacob is YES, ship an SDXL-native ControlNet removal
|
||||
profile (replacing the SD1.5 ctrlregen) tuned to that cell.
|
||||
- If no cell clears both (the tension is fundamental: scrub-strength always deforms text, or
|
||||
text-preserving conditioning always spares the watermark) -> the canny/tile-ControlNet middle
|
||||
path is dead for text, and the standing answer reverts to `text-protection-research.md`: a full
|
||||
**glyph-conditioned re-render** (EasyText / TextSR on a FLUX-DiT base) is required, which is a
|
||||
base-model migration, not a control add.
|
||||
|
||||
**Do not:** keep tuning the 512 px SD1.5 ctrlregen for text (wrong resolution, wrong base model);
|
||||
run tile-ControlNet at the community 0.3-0.4 upscale denoise and expect watermark removal (that
|
||||
regime preserves the watermark); over-claim forensic invisibility (Finding 7).
|
||||
|
||||
## Prototype run 1 — 2026-06-02 (text axis measured; watermark axis pending the oracle)
|
||||
|
||||
First sweep on a real, SynthID-positive, text-dense input: the corpus tokyo-street-night
|
||||
gpt-image (`88e61a38-chatgpt_tokyo.png`, 1023x1537 -> 680x1024, dense small CJK + Latin neon
|
||||
signage; SynthID + C2PA confirmed, so its valid oracle is openai.com/verify). Grid: control
|
||||
{canny, tile} x strength {0.3, 0.5, 0.7, 1.0} x `conditioning_scale` 1.0, fp32 on MPS. Outputs +
|
||||
`sweep_index.csv` (text verdicts filled by visual inspection; `synthid_oracle` left for the
|
||||
manual run) are under `/tmp/cnsweep/` (not committed — derived regenerations of corpus content).
|
||||
|
||||
**Measured — PSNR vs input (proxy for how much was regenerated):**
|
||||
- canny: 0.3 -> 16.91, 0.5 -> 15.91, 0.7 -> 14.82, 1.0 -> 13.22 (monotonic drop = progressively
|
||||
more regeneration as strength rises; canny only pins edges, so flat regions change).
|
||||
- tile: 0.3 -> 17.89, 0.5 -> 17.84, 0.7 -> 17.83, 1.0 -> 17.74 (**flat and high — near-identity
|
||||
even at strength 1.0**; tile@scale1.0 pins the whole image to the input and barely regenerates).
|
||||
|
||||
**Measured — text legibility (visual, focused on SMALL text; large high-contrast glyphs survive
|
||||
everything because canny/tile hold their edges):**
|
||||
- canny: legible at 0.3, softening at 0.5 (partial), garbling at 0.7, hallucinated pseudo-glyphs
|
||||
at 1.0 ("NEC" -> "NWENES"). Same plain-img2img small-text deformation, only big text protected.
|
||||
- tile: near-identity through 0.7, only tiny alterations at 1.0 — small text preserved throughout.
|
||||
|
||||
**Reading (the make-or-break tension, now visible in the data):**
|
||||
- **tile@scale1.0 does not actually regenerate** (flat PSNR), so it preserves all text but almost
|
||||
certainly leaves the watermark intact — it is a near-identity pass, exactly the community
|
||||
"tile-upscale preserves detail by not regenerating" regime (Finding 6), confirmed.
|
||||
- **canny@scale1.0 regenerates progressively** (PSNR drops) and so could scrub — but small text
|
||||
breaks at exactly the strength where scrubbing would start to bite. canny saves big edges, not
|
||||
sub-stroke small text.
|
||||
- Net on the text axis: neither cell at scale 1.0 cleanly gives "high regeneration + legible small
|
||||
text." This is the literature prior (Findings 3, 6) reproduced empirically. Lowering
|
||||
`conditioning_scale` to force small-text regeneration is the same tradeoff knob, not an escape.
|
||||
|
||||
**Still pending (the decisive half, cannot be done locally):** run the 8 cells through the SynthID
|
||||
oracle and fill `synthid_oracle`. The most informative cells: canny 1.0 (text dead — does it at
|
||||
least scrub? if not, the canny path is dead outright), canny 0.5 (text partial — does it scrub?),
|
||||
tile 1.0 (text perfect — predicted to still read present). If no cell is `oracle=clean` AND
|
||||
`text=yes`, the fork resolves to the glyph-re-render path (`text-protection-research.md`).
|
||||
|
||||
**Incidental bug caught:** the first run used fp16 on MPS (the script's original default) and
|
||||
produced **all-black** outputs across every cell (2 KB PNGs, PSNR 9.22 flat) — the issue #29
|
||||
fp16-VAE-NaN failure, and the fp16-fix VAE did not save it on MPS. Fixed `scripts/controlnet_sweep.py`
|
||||
to default fp32 on mps/cpu (fp16 only on cuda/xpu), matching the production pipeline.
|
||||
|
||||
## Tuning ControlNet for text preservation across image types (research 2026-06-03)
|
||||
|
||||
Goal: how to configure the canny-ControlNet path to best preserve text (and faces) on diverse
|
||||
images. Primary sources: diffusers ControlNet doc, the ControlNet paper (arXiv:2302.05543),
|
||||
xinsir model cards, practitioner guides. The **critical reframe**: almost all community ControlNet
|
||||
advice optimizes a txt2img *generation* tradeoff (control vs creative freedom). OUR context is
|
||||
img2img *watermark removal*, where the objective is the opposite -- maximum faithful preservation
|
||||
while regenerating just enough to scrub. So several common recommendations INVERT here.
|
||||
|
||||
**Removal is `strength`; everything below is preservation and does not change removal efficacy**
|
||||
(only the watermark-shielding risk -- see the caveat). Set `strength` by the oracle/vendor need;
|
||||
tune these to keep text/faces intact at that strength.
|
||||
|
||||
Knobs, ranked by impact for text:
|
||||
|
||||
1. **Canny edge density (the per-image lever, currently hardcoded `_CANNY_LOW=100`/`_CANNY_HIGH=200`).**
|
||||
Lower thresholds capture more/finer edges; higher thresholds keep only major outlines (diffusers
|
||||
doc + practitioner guides; ControlNet paper uses 100/200 as the default). Small-text strokes and
|
||||
fine facial features fall below the default 100/200 and are missed. **For dense small text
|
||||
(infographics, signage) lower the thresholds (~50/120, even 30/100 for facial likeness per
|
||||
practitioner tests); for high-contrast large text 100/200 already suffices.** Denser canny is
|
||||
still a BINARY thresholded edge map, so it does not carry the low-amplitude SynthID pixel pattern
|
||||
-- it passes more shape, not the watermark (still oracle-verify). This is the single highest-value
|
||||
unexplored lever and should become a CLI knob.
|
||||
|
||||
2. **`controlnet_conditioning_scale` -> keep at 1.0 (max structure hold).** Community defaults to 0.5
|
||||
for creative balance; we want maximum preservation, so 1.0 (xinsir canny/tile cards also recommend
|
||||
1.0). We measured text on a clean high-contrast image surviving across strength 0.1-0.5 at scale
|
||||
1.0 (PSNR ~26 flat), so scale 1.0 is the right default; only lower it if a specific image needs
|
||||
more regeneration to scrub (raises shielding risk the other way).
|
||||
|
||||
3. **`control_guidance_start=0.0`, `control_guidance_end=1.0` (full window) -- KEEP, do not shorten.**
|
||||
The common "end=0.5: establish structure early then let the model render detail freely" is a
|
||||
creative-generation recipe; for text it is HARMFUL -- the late free steps re-render and deform the
|
||||
glyphs. We want the edge control active through ALL denoise steps so text stays pinned. (Our
|
||||
pipeline already uses the 0->1 default; the point is to NOT adopt the shorten-the-window advice.)
|
||||
|
||||
4. **Control type, per image type:**
|
||||
- **Text / graphics / high-contrast -> canny** (the literature's reliable choice for defined edges
|
||||
and text; what we ship).
|
||||
- **Faces / smooth tonal content -> soft-edge / HED is a candidate worth testing.** Canny's hard
|
||||
binary threshold fractures smooth skin gradients; HED/soft-edge gives gradual edges that may hold
|
||||
faces better. UNVERIFIED for removal (softer edges may carry slightly more original signal ->
|
||||
oracle-check). A face-heavy image is the test (gemini group photos).
|
||||
- **tile -> NOT for removal.** It is near-identity (detail-enhancement at low denoise); it shields
|
||||
the watermark (measured flat PSNR ~17.8 across strength on the tokyo sweep). Do not use it as the
|
||||
removal control.
|
||||
|
||||
5. **Resolution** -- higher long-side = strokes span more VAE latent cells = less softening, while
|
||||
still fully regenerating. Already a knob (`--max-resolution`); for tiny text prefer native/large.
|
||||
|
||||
**Multi-ControlNet (canny + soft-edge), list scales e.g. `[1.0, 0.8]`** (diffusers MultiControlNet):
|
||||
could hold text edges AND face geometry at once, but doubles ControlNet memory/latency and raises the
|
||||
shielding risk; defer to a v2 after the single-canny path is dialed in.
|
||||
|
||||
**Image-type playbook (proposed, to validate with the oracle):**
|
||||
- Clean high-contrast text (openai_1-style): canny 100/200, scale 1.0, full window -- already optimal.
|
||||
- Dense small text / infographics (big_pic3, neon signage): canny **lower thresholds (~50/120)**,
|
||||
scale 1.0, full window, larger resolution.
|
||||
- Faces / portraits: try **soft-edge/HED** control, scale 1.0; or multi-ControlNet canny+softedge.
|
||||
|
||||
**Hard caveat:** every change that increases preservation (higher scale, denser canny, fuller window,
|
||||
softer edges) marginally REDUCES effective regeneration and so raises the chance the watermark
|
||||
survives -- exactly the shielding failure mode. There is no local SynthID detector, so each tuning
|
||||
change must be re-confirmed on the oracle. These are img2img-context recommendations derived from
|
||||
generation-context sources plus our own measurements; treat the playbook as hypotheses to verify, not
|
||||
settled defaults.
|
||||
|
||||
**Sources.** https://huggingface.co/docs/diffusers/en/using-diffusers/controlnet ·
|
||||
https://arxiv.org/pdf/2302.05543 · https://huggingface.co/xinsir/controlnet-canny-sdxl-1.0 ·
|
||||
https://huggingface.co/xinsir/controlnet-tile-sdxl-1.0 · https://blog.cephalon.ai/canny-and-softedge/
|
||||
|
||||
## FaceID research: identity-preserving face conditioning (research 2026-06-03)
|
||||
|
||||
Motivation: canny alone preserves face STRUCTURE/position better than plain SDXL but does NOT hold
|
||||
IDENTITY -- verified on a real Gemini group photo (gemini_3, s015): faces drift in expression and
|
||||
likeness (the smile/mouth and eyes change), they are "a similar person," not the same one. Canny
|
||||
carries edges, not identity, so the regenerated face is identity-drifted. To hold identity WITHOUT
|
||||
copying original pixels (the hard constraint -- copied pixels carry SynthID), the conditioning must
|
||||
be an identity EMBEDDING, not pixels. Primary sources: diffusers IP-Adapter doc, InstantID
|
||||
(arXiv:2401.07519), IP-Adapter (arXiv:2308.06721), practitioner comparisons.
|
||||
|
||||
### Findings
|
||||
|
||||
**1. IP-Adapter FaceID conditions on an ArcFace identity VECTOR, not pixels (confidence: high).**
|
||||
FaceID extracts `insightface` ArcFace `normed_embedding` (a ~512-d identity vector) via
|
||||
`FaceAnalysis`, and passes it as `ip_adapter_image_embeds` -- NOT a CLIP image embedding, NOT the
|
||||
original pixels. So it is constraint-compatible: the watermark (a pixel-amplitude pattern) is not in
|
||||
the identity vector, and the img2img still regenerates the pixels (removal via `strength` unchanged).
|
||||
It loads on any SDXL via `load_ip_adapter` (~100 MB), is fast/low-VRAM, but identity fidelity on SDXL
|
||||
is ~5-10% lower than the SD1.5 line / dedicated methods.
|
||||
|
||||
**2. Multiple distinct faces ARE handled, via regional attention masks (confidence: high -- THE key
|
||||
unlock).** This is the make-or-break for group photos (our hardest case). diffusers supports a LIST
|
||||
of IP-Adapter face images each with its own binary region mask: `IPAdapterMaskProcessor` builds the
|
||||
masks, `set_ip_adapter_scale([[s1, s2, ...]])`, and `cross_attention_kwargs={"ip_adapter_masks":
|
||||
masks}`. So you detect each face, extract its own ArcFace embedding, assign it a region mask, and one
|
||||
pass preserves N different identities simultaneously. (InstantID, by contrast, is single-subject --
|
||||
it averages embeddings for multiple refs, which is wrong for distinct people -- so for group photos
|
||||
**IP-Adapter FaceID + masks beats InstantID**.)
|
||||
|
||||
**3. IP-Adapter + ControlNet + img2img compose (confidence: high).** The doc shows IP-Adapter +
|
||||
ControlNet (depth) in one pipeline and IP-Adapter + img2img (`strength`). Our target stack is the
|
||||
union: `StableDiffusionXLControlNetImg2ImgPipeline` (canny = structure) + `load_ip_adapter` (FaceID =
|
||||
identity) + `strength` (removal). `set_ip_adapter_scale` (1.0 = image-only, 0.5 = balanced) is the
|
||||
identity-hold knob. API friction to verify in implementation: that `ip_adapter_masks` via
|
||||
`cross_attention_kwargs` works on the *ControlNet img2img* pipeline (the masking is an attention-
|
||||
processor feature, so it should be pipeline-agnostic, but confirm).
|
||||
|
||||
**4. InstantID / PuLID positioning (confidence: medium).** InstantID does not train the UNet so it
|
||||
composes with canny/depth ControlNets, and gives better single-face fidelity than FaceID -- but it is
|
||||
single-subject (needs its own landmark ControlNet + dedicated weights). PuLID has the best identity
|
||||
fidelity but is heaviest and Flux-leaning. For our multi-face, constraint-bound, SDXL-canny case,
|
||||
IP-Adapter FaceID + masks is the right first build; InstantID/PuLID are single-portrait upgrades.
|
||||
|
||||
### Architecture (proposed)
|
||||
|
||||
```
|
||||
detect faces (insightface) -> per face: ArcFace embed + region mask
|
||||
one img2img pass:
|
||||
image=init, control_image=canny(init), # structure (existing)
|
||||
ip_adapter_image_embeds=[face_embeds], # identity per face
|
||||
cross_attention_kwargs={"ip_adapter_masks": face_masks}, # each face -> its region
|
||||
controlnet_conditioning_scale=1.0, set_ip_adapter_scale(~0.6),
|
||||
strength=vendor-adaptive # removal (unchanged)
|
||||
```
|
||||
Pixels are regenerated (SynthID removed by `strength`), structure held by canny, each face's identity
|
||||
held by its masked ArcFace vector -- no original pixel copied.
|
||||
|
||||
### Risks / honest costs
|
||||
|
||||
- **Shielding risk (same wall):** FaceID conditioning, like canny, reduces effective regeneration ->
|
||||
higher `set_ip_adapter_scale` raises the chance SynthID survives in the face region (echo of why the
|
||||
old region-hires failed). MUST oracle-verify removal at the chosen FaceID scale; keep `strength` at
|
||||
the vendor threshold.
|
||||
- **New heavy dependency:** `insightface` + `onnxruntime` + the `buffalo_l` model (~300 MB, downloaded
|
||||
on first use). Detection + embedding is CPU/ONNX, separate from the diffusion.
|
||||
- **Detection floor:** insightface needs faces large enough (det_size ~640); tiny faces in a dense
|
||||
group may not be detected -> not preserved (falls back to canny-only for those).
|
||||
- **Identity ceiling:** SDXL FaceID is ~5-10% off true identity -- a meaningful boost over canny-only
|
||||
drift, NOT a perfect face swap. Set expectations; PuLID/InstantID are the higher-fidelity (heavier)
|
||||
paths if needed.
|
||||
- **Value scales with strength:** at low strength (OpenAI 0.10) faces barely drift, so FaceID is
|
||||
marginal; at the higher strength a hard vendor (Google 0.30) needs, FaceID earns its keep.
|
||||
|
||||
### Build plan (staged)
|
||||
|
||||
- v1: optional `--face-id` flag on `--pipeline controlnet`. Detect faces; if any, run the masked
|
||||
FaceID pass (works for 1 or N faces -- masks generalize). If none detected, fall through to plain
|
||||
canny. Oracle-verify SynthID removal is preserved at the default FaceID scale on a face image.
|
||||
- v2 (if identity still short): InstantID for single-portrait, or PuLID, as a higher-fidelity opt-in.
|
||||
|
||||
**Sources.** https://huggingface.co/docs/diffusers/main/en/using-diffusers/ip_adapter ·
|
||||
https://huggingface.co/h94/IP-Adapter-FaceID · https://arxiv.org/pdf/2401.07519 (InstantID) ·
|
||||
https://instantid.github.io/ · https://arxiv.org/abs/2308.06721 (IP-Adapter)
|
||||
|
||||
### FaceID prototype run 1 -- 2026-06-03 (NEGATIVE on dense small-face groups)
|
||||
|
||||
Built and shipped the masked multi-face FaceID layer (`--face-id`, `face_id.py`, `faceid` extra).
|
||||
First real run on the gemini_3 group photo (Google, s015, scale 0.6, native 2816 via cap 1536):
|
||||
insightface detected **17 faces**, the masked multi-face pass composed and ran end-to-end (non-black
|
||||
output), so the API is correct. **At s015 the result is a clear FAILURE: every face corrupted --
|
||||
melted/discolored/psychedelic, materially WORSE than canny-only.**
|
||||
|
||||
**ROOT CAUSE FOUND (confirmed by ablation, not speculation) -- it is STRENGTH, not scale/masks/faces.**
|
||||
Investigated the real data: masks are fine (max overlap depth 2, 33% coverage, only 0.2% of pixels
|
||||
double-covered -- NOT an overlap problem), embeddings are fine (`normed_embedding` norm 1.000), the
|
||||
FaceID LoRA is not required for SDXL (h94 model card), and faces span 34-181 px (7 medium + 10 tiny).
|
||||
None of those is the cause. The decisive test: the SAME image + FaceID at **strength 0.5** produces
|
||||
**clean, coherent faces across the whole group** (no psychedelic artifacts). So FaceID needs
|
||||
substantial regeneration: the h94 usage is full generation (txt2img, 30 steps); at our removal
|
||||
strength (0.10-0.15 = ~7 effective steps) the strong identity cross-attention cannot reconcile with
|
||||
a latent that is ~85% the untouched original, so it smears identity-colored noise onto the faces.
|
||||
|
||||
**This is a FUNDAMENTAL tension, not a tuning bug:** watermark removal wants LOW strength (minimal
|
||||
degradation, just enough to scrub), FaceID wants HIGH strength (regenerate the face to impose
|
||||
identity). They are opposed. At strength 0.5 FaceID works AND removes the watermark, but the whole
|
||||
image regenerates much more (canny still holds text/edge structure, but texture/detail drifts well
|
||||
beyond the 0.15 "minimal degradation" target). So `--face-id` is a HIGH-STRENGTH option: it trades
|
||||
whole-image fidelity for face identity, and is a footgun at the low default strength (guaranteed
|
||||
garbage). Required follow-up code guard: when `--face-id` is set, floor `strength` at ~0.5 (or refuse
|
||||
+ warn) -- never run FaceID at the vendor-adaptive removal strength. Open question: whether
|
||||
high-strength FaceID's whole-image drift is acceptable for face-centric images, or whether identity
|
||||
preservation at LOW strength needs a different mechanism entirely (FaceID structurally cannot do it). (Infra lesson: the `faceid` extra must
|
||||
stay numpy<2.0 -- pin `onnx<1.18` + `scipy<1.18`; pinning numpy UP, as the first build did, leaves a
|
||||
numpy-1.26 env with a numpy-2-only scipy that crashes the diffusers import via `np.long`.)
|
||||
|
||||
## Face preservation, done properly (research 2026-06-03, after the FaceID failure)
|
||||
|
||||
The FaceID run failed and I wrongly concluded "faces can't be preserved." Re-research corrected the
|
||||
understanding. The hard constraint is unchanged: to remove the watermark FROM a face the face MUST
|
||||
be regenerated (freezing it leaves SynthID), so the goal is identity-preserving REGENERATION of the
|
||||
face, at minimal overall image degradation. Three things I got wrong and the corrected picture:
|
||||
|
||||
**What I got wrong:** (1) I applied FaceID at GLOBAL high strength -- the literature is clear the
|
||||
architecture must be REGION-ADAPTIVE (face region handled separately, background stays low-strength);
|
||||
(2) I used IP-Adapter FaceID, the WEAKEST identity tool -- InstantID uses an ArcFace encoder and hits
|
||||
82-86% face-recognition similarity vs FaceID's weak CLIP-ish signal; (3) I missed the entire
|
||||
face-restoration class (CodeFormer / GFPGAN), which is purpose-built for "regenerate a face, keep
|
||||
identity."
|
||||
|
||||
**The most promising mechanism -- CodeFormer face-restoration post-pass (confidence: high on the
|
||||
mechanism, unverified on our watermark).** CodeFormer is a VQ-VAE: a frozen discrete CODEBOOK of HQ
|
||||
facial priors + a Transformer that predicts code *tokens* from the input, and a frozen decoder that
|
||||
regenerates the face FROM THE CODEBOOK ENTRIES -- "does not depend on feature fusion with low-quality
|
||||
cues." So the output face pixels come from a finite learned codebook, NOT from the input pixels:
|
||||
**the SynthID pixel-amplitude pattern physically cannot survive a codebook re-synthesis** -- a
|
||||
stronger scrub than low-strength img2img (which keeps ~85% of the latent). Fidelity knob `w` in
|
||||
[0,1]: higher w preserves identity but fuses MORE low-quality (input) cues (more watermark risk),
|
||||
lower w leans on the codebook (cleaner scrub, identity drift) -- the same scrub-vs-fidelity tension,
|
||||
settled per-image by the oracle; there is likely a `w` that holds identity AND clears the oracle.
|
||||
|
||||
**Constraint-compatible architecture:** run the normal canny low-strength controlnet removal globally
|
||||
(minimal degradation everywhere), then detect+align each face, run CodeFormer on the **ORIGINAL** face
|
||||
crop (to capture true identity AND re-synthesize from the codebook = scrub), and composite the
|
||||
CodeFormer output (codebook-generated, not original pixels -> no copy, no watermark) into the cleaned
|
||||
image. Decouples whole-image minimal-degradation from face identity -- no high GLOBAL strength needed.
|
||||
|
||||
**Honest costs/caveats:** (a) **License -- CodeFormer is NTU S-Lab 1.0 (non-commercial/research)**, so
|
||||
it cannot be bundled in this MIT tool for general use; the license-clean alternative is **GFPGAN
|
||||
(Apache-2.0)**, slightly lower quality. (b) Deps (basicsr/facexlib) are heavy and numpy-version-finicky
|
||||
(same class of conflict as insightface). (c) CodeFormer is a *restoration* model -- it can subtly
|
||||
alter expression/asymmetry; identity is held but not pixel-identical. (d) **The watermark-scrub is
|
||||
mechanistically strong but UNVERIFIED -- must oracle-check.** InstantID + region-adaptive strength is
|
||||
the alternative if the restoration route disappoints, but it is more complex (differential strength).
|
||||
Prototype plan: validate CodeFormer on a real face in a THROWAWAY env (identity held? oracle clean?)
|
||||
before any project-env integration or the license/GFPGAN decision.
|
||||
|
||||
### CodeFormer prototype -- VALIDATED end-to-end 2026-06-03 (oracle-confirmed)
|
||||
|
||||
Prototyped the CodeFormer face-restoration post-pass (codeformer-pip in a throwaway venv, forced CPU
|
||||
-- the pip wrapper has an MPS device-mismatch bug) on the gemini_3 group photo (18 faces). Pipeline:
|
||||
`all --pipeline controlnet --strength 0.15` (sparkle + SynthID removed from the whole image, minimal
|
||||
degradation) -> CodeFormer on the ORIGINAL faces -> feather-composite the CodeFormer faces into the
|
||||
all-cleaned image. Oracle results (Gemini app "Verify with SynthID"), isolating each part:
|
||||
- pure controlnet-0.15 background (no faces): **clean** -> the background scrub works at 0.15 (no
|
||||
ControlNet-shielding problem for Google on this image).
|
||||
- composite with CodeFormer faces at **w0.7**: **SynthID DETECTED** -> high fidelity fuses too much of
|
||||
the original face signal (the watermark) through.
|
||||
- composite at **w0.5**: **clean**. composite at **w0.3**: **clean**.
|
||||
So the scrub-vs-fidelity threshold is between 0.5 and 0.7; **w=0.5 is the sweet spot** (highest
|
||||
fidelity / best identity that still clears the oracle). Identity at w0.3-0.7 all looks like the same
|
||||
person (the face is large enough), so the lower w costs little.
|
||||
|
||||
**This VALIDATES the corrected face-preservation approach** (and refutes my earlier "faces can't be
|
||||
preserved" / FaceID conclusion): controlnet low-strength background scrub + CodeFormer-codebook face
|
||||
re-synthesis at w~0.5 + feather composite = oracle-clean SynthID removal everywhere (background AND
|
||||
faces), identity preserved, minimal overall degradation, zero original-pixel copying (CodeFormer faces
|
||||
are codebook-generated). CodeFormer's discrete-codebook re-synthesis DOES scrub the pixel watermark,
|
||||
but only when w is low enough that the decoder leans on the codebook rather than fusing the input
|
||||
(watermark-carrying) features -- exactly the predicted fidelity-vs-scrub tension, with an empirical
|
||||
clean threshold at w<=0.5.
|
||||
|
||||
**Production TODO (not built -- still a throwaway prototype):** (1) license -- CodeFormer is NTU S-Lab
|
||||
(non-commercial); decide CodeFormer-as-user-installed-extra vs GFPGAN (Apache-2.0, re-verify it scrubs
|
||||
at its fidelity setting); (2) wire a `--restore-faces` post-pass (detect -> restore w~0.5 -> feather
|
||||
composite) onto the controlnet pipeline; (3) handle the MPS device bug (force CPU for the face model
|
||||
or fix); (4) re-verify the w threshold on more images / vendors (w=0.5 confirmed on one Gemini group
|
||||
photo only).
|
||||
|
||||
**Sources.** https://arxiv.org/abs/2206.11253 (CodeFormer) · https://github.com/sczhou/CodeFormer ·
|
||||
https://arxiv.org/pdf/2401.07519 (InstantID) ·
|
||||
https://openaccess.thecvf.com/content/WACV2024/papers/Suin_Diffuse_and_Restore... (region-adaptive) ·
|
||||
https://arxiv.org/pdf/2504.12809 (saliency-aware watermark removal)
|
||||
|
||||
## Provenance
|
||||
|
||||
Hand-run primary-source pass, 2026-06-02. Sources fetched and quoted above; the central
|
||||
make-or-break claim (structure-conditioned high-strength regeneration scrubs the watermark while
|
||||
keeping text) is **unverified and explicitly flagged as the thing the local prototype must
|
||||
measure** (against the manual Gemini SynthID oracle) — the literature supports removal (Findings 1, 2) and supports structure-preserving
|
||||
regeneration (Finding 5) but never jointly validated text (Finding 3). No code change implied
|
||||
until the prototype validates a Pareto cell on the SynthID oracle.
|
||||
+34
-36
@@ -196,12 +196,11 @@ with SynthID" for Google). The test set and per-image results are recorded in
|
||||
**CORRECTION (supersedes the earlier "resolution dependence" claim).** A prior
|
||||
version of this doc and CLAUDE.md stated that strength 0.30 failed to remove
|
||||
SynthID on 1600x1600 gpt-image and that removal was resolution-dependent. That
|
||||
was an **artifact of the text-protection bug** (issue #14): those tests ran a
|
||||
build where `protect_text` was ON by default, and the high-resolution text
|
||||
re-scrub re-introduced SynthID in the dense-text regions of the infographic
|
||||
images tested. Re-running the *same* 1600x1600 image on clean v0.8.6 (protect
|
||||
OFF) removes SynthID at **0.05**. The "large images resist removal" conclusion
|
||||
was false; the resistance was the protect-text shielding, now fixed (v0.8.5).
|
||||
was a **measurement artifact of a since-removed per-region re-scrub step** (issue
|
||||
#14): on the dense-text infographics tested, that step could reconstitute SynthID
|
||||
in text regions. Re-running the *same* 1600x1600 image on the clean current
|
||||
pipeline removes SynthID at **0.05**. The "large images resist removal" conclusion
|
||||
was false; the resistance was that region-rescrub shielding, since removed.
|
||||
|
||||
**Open / not locally testable:**
|
||||
|
||||
@@ -362,38 +361,37 @@ empirically from oracle tests:
|
||||
|
||||
Google has hardened SynthID relative to OpenAI's (vendor gap measured at ~3x
|
||||
strength), but the year-over-year "0.05 -> 0.10 -> 0.30" progression above
|
||||
conflates a real hardening trend with the now-debunked protect-text artifact;
|
||||
conflates a real hardening trend with the now-debunked region-rescrub artifact;
|
||||
treat only the section 2.2 controlled numbers as authoritative.
|
||||
|
||||
---
|
||||
|
||||
## 5. Practical implications for this tool
|
||||
|
||||
### 5.1 Text and face protection: OFF by default
|
||||
### 5.1 Preserving content means regenerating it, never copying it
|
||||
|
||||
**Text protection (`--protect-text`) can preserve SynthID in text regions.**
|
||||
Verified June 2026 on gpt-image at 1600x1600: same image processed twice --
|
||||
with `--protect-text` the oracle detected SynthID; without it, SynthID was
|
||||
removed. The mechanism: the global img2img pass clears SynthID everywhere, but
|
||||
the text-protection high-resolution re-scrub regenerates those pixels from an
|
||||
upscaled crop. At the effective resolution of the upscaled crop, the per-region
|
||||
pass may be insufficient to re-destroy the payload, reconstituting SynthID in
|
||||
text regions.
|
||||
**Core rule:** SynthID is a pixel-amplitude pattern, so any approach that FREEZES
|
||||
or RESTORES original pixels in a region re-introduces the watermark there. Early
|
||||
region-based text/face "protection" (since removed) proved this: restoring the
|
||||
original face pixels guaranteed SynthID survived in faces, and even a per-region
|
||||
high-resolution re-scrub from an upscaled crop could be insufficient to destroy
|
||||
the payload, reconstituting SynthID in text. The lesson held and shaped the
|
||||
current design: **content is preserved by REGENERATING it under structural
|
||||
conditioning, never by copying original pixels.**
|
||||
|
||||
**Face protection (`--protect-faces`) has an even more direct preservation
|
||||
mechanism.** The pipeline extracts face regions from the ORIGINAL (watermarked)
|
||||
image BEFORE the diffusion pass, runs the global pass (which removes SynthID
|
||||
everywhere), then blends the original face pixels BACK onto the result
|
||||
(`invisible_engine.py`: `original_faces = protector.extract_faces(cv_img)`
|
||||
before `remove_watermark`, then `protector.restore_faces(out_cv, original_faces)`
|
||||
after). Those restored pixels are the original watermarked pixels -- SynthID is
|
||||
guaranteed to survive in face regions, not just possibly. The text-protection
|
||||
case is at least re-generating (uncertain); face protection is literally
|
||||
restoring the original SynthID-bearing pixels.
|
||||
|
||||
Both `--protect-text` and `--protect-faces` are therefore **experimental and
|
||||
OFF by default**. Enable only when text/face fidelity matters more than
|
||||
watermark removal completeness, and always verify the result with the oracle.
|
||||
- **Text + structure:** `--pipeline controlnet` (SDXL img2img + a canny ControlNet)
|
||||
conditions the regeneration on the edge map, so text and structure stay sharp
|
||||
while every pixel is still regenerated -- SynthID is removed everywhere. Verified
|
||||
better than plain img2img at the same strength (text stays legible where plain
|
||||
garbles it), and the controlnet background scrub reads clean on the oracle.
|
||||
- **Face identity:** canny holds face *structure* but not *identity*. The validated
|
||||
approach (researched + prototyped 2026-06-03, not yet shipped) is a face-restoration
|
||||
post-pass: CodeFormer/GFPGAN RE-SYNTHESIZES each face from a discrete codebook
|
||||
(codebook pixels, not original -> scrubs SynthID) at a low fidelity weight
|
||||
(`w~0.5`), composited into the cleaned image. Oracle-confirmed clean in face
|
||||
regions with identity preserved. (An IP-Adapter FaceID approach was tried and
|
||||
REMOVED -- it needs high denoise strength and corrupts faces at removal strength;
|
||||
see `docs/controlnet-removal-pipeline-research.md`.)
|
||||
|
||||
### 5.2 Strength setting
|
||||
|
||||
@@ -431,13 +429,13 @@ value that reads clean on the oracle.
|
||||
verifier is not the same as being forensically indistinguishable from clean
|
||||
content (arXiv:2605.09203).
|
||||
|
||||
### 5.4 ctrlregen and img2img: the tradeoff
|
||||
### 5.4 Strength vs forensic detectability: the tradeoff
|
||||
|
||||
Both the paper and our testing confirm: higher img2img strength removes the
|
||||
watermark but introduces detectable regeneration artifacts. The Goonatilake &
|
||||
Ateniese paper shows CtrlRegen+ (the most powerful remover) is simultaneously
|
||||
the most forensically detectable (AUROC 0.9999). The tradeoff is unavoidable
|
||||
with current diffusion-based approaches.
|
||||
Higher img2img strength removes the watermark but introduces detectable
|
||||
regeneration artifacts. The Goonatilake & Ateniese paper shows the strongest
|
||||
diffusion-based removers are simultaneously the most forensically detectable
|
||||
(AUROC up to 0.9999). The tradeoff is unavoidable with current diffusion-based
|
||||
approaches: defeating the vendor's verifier is not the same as being clean.
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -49,10 +49,7 @@ gpu = [
|
||||
"transformers>=5,<6",
|
||||
"tokenizers>=0.22,<0.23",
|
||||
"accelerate>=0.25.0",
|
||||
"controlnet-aux>=0.0.9",
|
||||
"safetensors",
|
||||
"ultralytics>=8.0.0",
|
||||
"color-matcher>=0.5.0",
|
||||
]
|
||||
# Open invisible-watermark (imwatermark) decoder for detecting the DWT-DCT
|
||||
# watermarks embedded by Stable Diffusion / SDXL / FLUX. Optional because it
|
||||
@@ -150,7 +147,6 @@ ignore = [
|
||||
"tests/*.py" = ["ANN", "S101", "S105", "S106", "S108"]
|
||||
"src/remove_ai_watermarks/noai/watermark_remover.py" = ["S603", "S606", "S607", "T201"] # subprocess calls for auto-install/CUDA fix
|
||||
"src/remove_ai_watermarks/noai/c2pa.py" = ["S110"] # try-except-pass for corrupt file handling
|
||||
"src/remove_ai_watermarks/noai/ctrlregen/engine.py" = ["S101", "S603"] # assert for loaded state, subprocess for auto-install
|
||||
|
||||
[tool.ruff.format]
|
||||
quote-style = "double"
|
||||
@@ -172,33 +168,3 @@ reportUnknownMemberType = false
|
||||
reportUnknownArgumentType = false
|
||||
reportUnknownVariableType = false
|
||||
reportMissingTypeArgument = false
|
||||
|
||||
# CtrlRegen is a torch/diffusers/controlnet-aux boundary: those libs ship no
|
||||
# usable types, so strict pyright cannot know the tensor element types. Relax the
|
||||
# unknown-type rules for this subpackage only (mirrors the per-file pragmas used
|
||||
# for the cv2 engines); the rest of the codebase stays strict.
|
||||
[[tool.pyright.executionEnvironments]]
|
||||
root = "src/remove_ai_watermarks/noai/ctrlregen"
|
||||
reportUnknownMemberType = false
|
||||
reportUnknownArgumentType = false
|
||||
reportUnknownVariableType = false
|
||||
reportUnknownParameterType = false
|
||||
reportMissingTypeArgument = false
|
||||
reportMissingTypeStubs = false
|
||||
reportMissingImports = false
|
||||
reportArgumentType = false
|
||||
reportAssignmentType = false
|
||||
reportReturnType = false
|
||||
reportCallIssue = false
|
||||
reportIndexIssue = false
|
||||
reportOperatorIssue = false
|
||||
reportOptionalMemberAccess = false
|
||||
reportOptionalCall = false
|
||||
reportOptionalSubscript = false
|
||||
reportOptionalOperand = false
|
||||
reportAttributeAccessIssue = false
|
||||
reportPrivateImportUsage = false
|
||||
reportPrivateUsage = false
|
||||
reportInvalidTypeForm = false
|
||||
reportConstantRedefinition = false
|
||||
reportUnnecessaryComparison = false
|
||||
|
||||
@@ -0,0 +1,348 @@
|
||||
"""ControlNet-as-removal-pipeline prototype sweep (issue #35 / Jacob).
|
||||
|
||||
Research prototype, NOT a shipped pipeline. It tests whether a full-image
|
||||
SDXL-native ControlNet-conditioned img2img can REPLACE plain SDXL img2img as the
|
||||
watermark remover: a single structure-guided regeneration that scrubs an invisible
|
||||
robust watermark (SynthID) everywhere while keeping fine detail and small/CJK text
|
||||
legible. See docs/controlnet-removal-pipeline-research.md for the full rationale.
|
||||
|
||||
The make-or-break tension (from the watermark-removal-attack literature): the
|
||||
denoise strength high enough to scrub the watermark deforms text, while the
|
||||
conditioning strong enough to keep text may spare the watermark. There is no local
|
||||
SynthID detector, so this script CANNOT decide removal on its own -- it produces
|
||||
one output per (control, strength, conditioning-scale) cell plus an index, and YOU
|
||||
verify each output by hand in the Gemini app ("Verify with SynthID") and judge text
|
||||
legibility visually. Fill the verdict columns in the emitted index, then read off
|
||||
the Pareto cell (oracle clean AND text legible).
|
||||
|
||||
Pipeline: stabilityai/stable-diffusion-xl-base-1.0 +
|
||||
- canny: xinsir/controlnet-canny-sdxl-1.0 (control = cv2.Canny(gray, 100, 200))
|
||||
- tile: xinsir/controlnet-tile-sdxl-1.0 (control = the resized original, no preproc)
|
||||
StableDiffusionXLControlNetImg2ImgPipeline (image=init, control_image=control).
|
||||
|
||||
Needs the gpu extra (torch + diffusers) and cv2. Runs locally on 32 GB MPS in
|
||||
fp32 (MPS fp16 decodes to all-black NaN -- issue #29 -- so fp32 is the default on
|
||||
mps/cpu, fp16 only on cuda/xpu); a dedicated GPU is not required for 1024 px. Run:
|
||||
|
||||
uv run python scripts/controlnet_sweep.py path/to/watermarked.png -o sweep_out
|
||||
uv run python scripts/controlnet_sweep.py img.png --control canny tile \\
|
||||
--strength 0.3 0.5 0.7 1.0 --scale 0.6 1.0 --size 1024
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
# torch/diffusers/cv2 ship no usable types; relax the unknown-type + private-import
|
||||
# rules for this boundary script (mirrors scripts/visible_alpha_solve.py and the
|
||||
# cv2/torch engine modules). Pure-logic helpers here stay correct regardless.
|
||||
# pyright: reportUnknownMemberType=false, reportUnknownArgumentType=false, reportUnknownVariableType=false, reportUnknownParameterType=false, reportMissingTypeArgument=false, reportMissingTypeStubs=false, reportMissingImports=false, reportArgumentType=false, reportAssignmentType=false, reportReturnType=false, reportCallIssue=false, reportIndexIssue=false, reportOperatorIssue=false, reportPrivateImportUsage=false
|
||||
import argparse
|
||||
import contextlib
|
||||
import csv
|
||||
import importlib.util
|
||||
import logging
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import TYPE_CHECKING, Any
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
|
||||
|
||||
from _plain_console import Console, Table
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from PIL import Image
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(message)s")
|
||||
log = logging.getLogger(__name__)
|
||||
console = Console()
|
||||
|
||||
BASE_MODEL = "stabilityai/stable-diffusion-xl-base-1.0"
|
||||
FP16_VAE = "madebyollin/sdxl-vae-fp16-fix"
|
||||
CONTROLNETS = {
|
||||
"canny": "xinsir/controlnet-canny-sdxl-1.0",
|
||||
"tile": "xinsir/controlnet-tile-sdxl-1.0",
|
||||
}
|
||||
|
||||
# A neutral quality prompt: the goal is faithful regeneration, not creative edits.
|
||||
PROMPT = "best quality, high quality, sharp, detailed, photographic"
|
||||
NEGATIVE_PROMPT = "blurry, lowres, deformed, distorted text, garbled text, watermark, jpeg artifacts"
|
||||
|
||||
|
||||
def pick_device(requested: str) -> str:
|
||||
"""Resolve the inference device without the CUDA-reinstaller side effect.
|
||||
|
||||
Deliberately does NOT call the package ``get_device`` (which can trigger a
|
||||
torch-CUDA reinstall+restart). A research script should never do that.
|
||||
"""
|
||||
import torch
|
||||
|
||||
if requested != "auto":
|
||||
return requested
|
||||
if torch.cuda.is_available():
|
||||
return "cuda"
|
||||
if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
|
||||
return "mps"
|
||||
return "cpu"
|
||||
|
||||
|
||||
def resolve_dtype(device: str, requested: str) -> Any:
|
||||
"""fp16 only on cuda/xpu; fp32 on cpu AND mps, unless overridden.
|
||||
|
||||
MPS fp16 produces all-black NaN output here (the SDXL UNet/VAE overflows on
|
||||
the Metal backend -- issue #29; even the fp16-fix VAE does not save it), so the
|
||||
production pipeline runs fp32 on MPS and so do we. fp32 SDXL + an SDXL ControlNet
|
||||
at 1024 fits 32 GB unified memory with vae-tiling + attention-slicing.
|
||||
"""
|
||||
import torch
|
||||
|
||||
if requested == "fp16":
|
||||
return torch.float16
|
||||
if requested == "fp32":
|
||||
return torch.float32
|
||||
return torch.float16 if device in {"cuda", "xpu"} else torch.float32
|
||||
|
||||
|
||||
def fit_size(image: Image.Image, long_side: int) -> Image.Image:
|
||||
"""Resize so the long side is ``long_side``, each dim a multiple of 8 (SDXL)."""
|
||||
from PIL import Image as PILImage
|
||||
|
||||
w, h = image.size
|
||||
scale = long_side / max(w, h)
|
||||
nw = max(8, round(w * scale) // 8 * 8)
|
||||
nh = max(8, round(h * scale) // 8 * 8)
|
||||
if (nw, nh) == (w, h):
|
||||
return image
|
||||
return image.resize((nw, nh), PILImage.Resampling.LANCZOS)
|
||||
|
||||
|
||||
def make_control_image(init: Image.Image, control: str) -> Image.Image:
|
||||
"""Build the ControlNet conditioning image for the given control type.
|
||||
|
||||
canny: cv2.Canny(gray, 100, 200) -> 3-channel edge map (xinsir canny recipe).
|
||||
tile: the init image itself, no preprocessing (xinsir tile recipe).
|
||||
"""
|
||||
import cv2
|
||||
import numpy as np
|
||||
from PIL import Image as PILImage
|
||||
|
||||
if control == "tile":
|
||||
return init
|
||||
rgb = np.array(init.convert("RGB"))
|
||||
gray = cv2.cvtColor(rgb, cv2.COLOR_RGB2GRAY)
|
||||
edges = cv2.Canny(gray, 100, 200)
|
||||
edges_rgb = np.stack([edges, edges, edges], axis=-1)
|
||||
return PILImage.fromarray(edges_rgb)
|
||||
|
||||
|
||||
def psnr(a: Image.Image, b: Image.Image) -> float:
|
||||
"""Coarse global fidelity proxy vs the original (NOT a text or watermark metric)."""
|
||||
import numpy as np
|
||||
|
||||
x = np.asarray(a.convert("RGB"), dtype=np.float64)
|
||||
y = np.asarray(b.convert("RGB").resize(a.size), dtype=np.float64)
|
||||
mse = float(np.mean((x - y) ** 2))
|
||||
if mse == 0.0:
|
||||
return 99.0
|
||||
return float(10.0 * np.log10((255.0**2) / mse))
|
||||
|
||||
|
||||
def load_pipeline(control: str, device: str, dtype: Any) -> Any:
|
||||
"""Load SDXL base + the chosen SDXL ControlNet as an img2img pipeline."""
|
||||
import torch
|
||||
from diffusers import (
|
||||
AutoencoderKL,
|
||||
ControlNetModel,
|
||||
StableDiffusionXLControlNetImg2ImgPipeline,
|
||||
)
|
||||
|
||||
console.print(f"Loading {CONTROLNETS[control]} ({control}) ...")
|
||||
controlnet = ControlNetModel.from_pretrained(CONTROLNETS[control], torch_dtype=dtype)
|
||||
load_kwargs: dict[str, Any] = {"controlnet": controlnet, "torch_dtype": dtype}
|
||||
if dtype == torch.float16:
|
||||
# The stock SDXL VAE decodes to NaN/black in fp16; the fp16-fix VAE is the
|
||||
# same swap the production pipeline uses (_SDXL_FP16_VAE_ID).
|
||||
load_kwargs["vae"] = AutoencoderKL.from_pretrained(FP16_VAE, torch_dtype=dtype)
|
||||
pipe = StableDiffusionXLControlNetImg2ImgPipeline.from_pretrained(BASE_MODEL, **load_kwargs)
|
||||
pipe = pipe.to(device)
|
||||
pipe.set_progress_bar_config(disable=True)
|
||||
if device != "cpu":
|
||||
# Keep the 1024 px + extra-ControlNet peak inside 32 GB unified memory.
|
||||
with contextlib.suppress(Exception):
|
||||
pipe.enable_vae_tiling()
|
||||
with contextlib.suppress(Exception):
|
||||
pipe.enable_attention_slicing()
|
||||
return pipe
|
||||
|
||||
|
||||
def run_cell(
|
||||
pipe: Any,
|
||||
init: Image.Image,
|
||||
control_image: Image.Image,
|
||||
strength: float,
|
||||
scale: float,
|
||||
steps: int,
|
||||
guidance: float,
|
||||
seed: int,
|
||||
) -> Image.Image:
|
||||
"""Run one ControlNet img2img cell and return the regenerated image.
|
||||
|
||||
The generator is created on CPU intentionally: a CPU generator is portable
|
||||
across the mps/cuda/cpu backends (diffusers rejects a device-mismatched one),
|
||||
matching the production runner's fallback behavior.
|
||||
"""
|
||||
import torch
|
||||
|
||||
generator = torch.Generator(device="cpu").manual_seed(seed)
|
||||
result = pipe(
|
||||
prompt=PROMPT,
|
||||
negative_prompt=NEGATIVE_PROMPT,
|
||||
image=init,
|
||||
control_image=control_image,
|
||||
controlnet_conditioning_scale=float(scale),
|
||||
strength=float(strength),
|
||||
num_inference_steps=steps,
|
||||
guidance_scale=guidance,
|
||||
generator=generator,
|
||||
)
|
||||
return result.images[0]
|
||||
|
||||
|
||||
def parse_args() -> argparse.Namespace:
|
||||
p = argparse.ArgumentParser(description="ControlNet-as-removal-pipeline prototype sweep.")
|
||||
p.add_argument("image", type=Path, help="Watermarked input image.")
|
||||
p.add_argument("-o", "--out", type=Path, default=Path("controlnet_sweep_out"), help="Output directory.")
|
||||
p.add_argument("--control", nargs="+", choices=list(CONTROLNETS), default=["canny", "tile"])
|
||||
p.add_argument("--strength", nargs="+", type=float, default=[0.3, 0.5, 0.7, 1.0])
|
||||
p.add_argument("--scale", nargs="+", type=float, default=[0.6, 1.0], help="controlnet_conditioning_scale values.")
|
||||
p.add_argument("--size", type=int, default=1024, help="Long-side resolution (multiple of 8).")
|
||||
p.add_argument("--steps", type=int, default=30)
|
||||
p.add_argument("--guidance", type=float, default=7.5)
|
||||
p.add_argument("--seed", type=int, default=0)
|
||||
p.add_argument("--device", default="auto", choices=["auto", "mps", "cuda", "cpu"])
|
||||
p.add_argument("--dtype", default="auto", choices=["auto", "fp16", "fp32"])
|
||||
return p.parse_args()
|
||||
|
||||
|
||||
def main() -> int:
|
||||
args = parse_args()
|
||||
if not args.image.exists():
|
||||
log.error("Input image not found: %s", args.image)
|
||||
return 1
|
||||
|
||||
try:
|
||||
from PIL import Image as PILImage
|
||||
except ImportError:
|
||||
log.error("Pillow is required. Install the gpu extra: uv sync --extra gpu --extra dev")
|
||||
return 1
|
||||
if importlib.util.find_spec("diffusers") is None or importlib.util.find_spec("torch") is None:
|
||||
log.error("diffusers/torch are required. Install: uv sync --extra gpu --extra dev")
|
||||
return 1
|
||||
|
||||
device = pick_device(args.device)
|
||||
dtype = resolve_dtype(device, args.dtype)
|
||||
console.print(f"Device: {device} | dtype: {str(dtype).split('.')[-1]}")
|
||||
|
||||
init_full = PILImage.open(args.image).convert("RGB")
|
||||
init = fit_size(init_full, args.size)
|
||||
console.print(f"Input: {args.image.name} {init_full.size[0]}x{init_full.size[1]} -> {init.size[0]}x{init.size[1]}")
|
||||
|
||||
args.out.mkdir(parents=True, exist_ok=True)
|
||||
stem = args.image.stem
|
||||
init_path = args.out / f"{stem}__INPUT.png"
|
||||
init.save(init_path)
|
||||
|
||||
rows: list[dict[str, Any]] = []
|
||||
table = Table(title="ControlNet sweep")
|
||||
for col in ("control", "strength", "scale", "psnr_vs_input", "file"):
|
||||
table.add_column(col)
|
||||
|
||||
# Group by control so SDXL + the ControlNet load once per control type.
|
||||
for control in args.control:
|
||||
pipe = load_pipeline(control, device, dtype)
|
||||
control_image = make_control_image(init, control)
|
||||
if control == "canny":
|
||||
control_image.save(args.out / f"{stem}__canny_edges.png")
|
||||
for strength in args.strength:
|
||||
for scale in args.scale:
|
||||
tag = f"{control}_s{strength:g}_c{scale:g}"
|
||||
console.print(f"Running {tag} ...")
|
||||
try:
|
||||
out = run_cell(
|
||||
pipe,
|
||||
init,
|
||||
control_image,
|
||||
strength,
|
||||
scale,
|
||||
args.steps,
|
||||
args.guidance,
|
||||
args.seed,
|
||||
)
|
||||
except Exception as exc:
|
||||
log.warning("Cell %s failed: %s", tag, exc)
|
||||
continue
|
||||
fname = f"{stem}__{tag}.png"
|
||||
out.save(args.out / fname)
|
||||
quality = psnr(init, out)
|
||||
rows.append(
|
||||
{
|
||||
"control": control,
|
||||
"strength": strength,
|
||||
"scale": scale,
|
||||
"psnr_vs_input": round(quality, 2),
|
||||
"file": fname,
|
||||
"synthid_oracle": "", # fill: clean / present
|
||||
"text_legible": "", # fill: yes / no / partial
|
||||
}
|
||||
)
|
||||
table.add_row(control, f"{strength:g}", f"{scale:g}", f"{quality:.2f}", fname)
|
||||
del pipe
|
||||
_free_memory(device)
|
||||
|
||||
_write_index(args.out, stem, rows, init_path.name)
|
||||
console.print(table)
|
||||
console.print(f"\nWrote {len(rows)} cells to {args.out}/")
|
||||
console.print(f"Next: open {args.out}/sweep_index.csv, run each PNG through the Gemini SynthID oracle,")
|
||||
console.print("fill synthid_oracle (clean/present) + text_legible (yes/no/partial), find the Pareto cell.")
|
||||
return 0
|
||||
|
||||
|
||||
def _free_memory(device: str) -> None:
|
||||
import gc
|
||||
|
||||
gc.collect()
|
||||
with contextlib.suppress(Exception):
|
||||
import torch
|
||||
|
||||
if device == "cuda":
|
||||
torch.cuda.empty_cache()
|
||||
elif device == "mps" and hasattr(torch, "mps"):
|
||||
torch.mps.empty_cache()
|
||||
|
||||
|
||||
def _write_index(out: Path, stem: str, rows: list[dict[str, Any]], input_name: str) -> None:
|
||||
"""Write the CSV index (with empty verdict columns) and a README."""
|
||||
fields = ["control", "strength", "scale", "psnr_vs_input", "file", "synthid_oracle", "text_legible"]
|
||||
with (out / "sweep_index.csv").open("w", newline="", encoding="utf-8") as fh:
|
||||
writer = csv.DictWriter(fh, fieldnames=fields)
|
||||
writer.writeheader()
|
||||
writer.writerows(rows)
|
||||
readme = (
|
||||
f"# ControlNet sweep for {stem}\n\n"
|
||||
f"Input (resized): {input_name}\n\n"
|
||||
"Each row in sweep_index.csv is one (control, strength, scale) cell. PSNR vs the\n"
|
||||
"resized input is a COARSE global-fidelity proxy only -- it does NOT measure text\n"
|
||||
"legibility or watermark presence. Decide those two by hand:\n\n"
|
||||
"1. synthid_oracle: open the PNG in the Gemini app, 'Verify with SynthID'. Mark\n"
|
||||
" 'clean' if no SynthID is detected, 'present' if it still is. (No local detector\n"
|
||||
" exists; this manual check is the only valid SynthID oracle.)\n"
|
||||
"2. text_legible: eyeball the small/CJK text. Mark yes / partial / no.\n\n"
|
||||
"The Pareto cell is the one where synthid_oracle=clean AND text_legible=yes at the\n"
|
||||
"lowest strength. If no cell satisfies both, the canny/tile-ControlNet middle path\n"
|
||||
"is dead for text and a glyph re-render is required (see\n"
|
||||
"docs/text-protection-research.md). Record the outcome in\n"
|
||||
"docs/controlnet-removal-pipeline-research.md.\n"
|
||||
)
|
||||
(out / "sweep_README.md").write_text(readme, encoding="utf-8")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
@@ -1,150 +0,0 @@
|
||||
"""Multilingual recall benchmark for the text-protection detector.
|
||||
|
||||
Measures the core lever of text protection (`text_protector.TextProtector`): if
|
||||
the PP-OCRv3 DB detector misses a text region, that text is NOT preserved during
|
||||
the SDXL watermark-removal pass and gets deformed (issue #14). This renders short
|
||||
text in several scripts at several font sizes on two canvas sizes, runs detection,
|
||||
and reports the fraction of each known text bbox the detector covers.
|
||||
|
||||
Findings (2026-05-29):
|
||||
- Detection is script-agnostic: DB segments text *regions*, not characters, so
|
||||
Latin / Cyrillic / CJK / Hangul / Arabic / digits score identically. Language
|
||||
was never the lever.
|
||||
- The only lever is resolution. A fixed small detector input downscaled large
|
||||
canvases so far that small text was missed. Detecting at the native long side
|
||||
(capped, see ``text_protector._DET_MAX_LONG_SIDE``) lifts overall hit-rate
|
||||
from 0.91 to 1.00 and the worst cell (~16 px text on a 2048 canvas) from
|
||||
0.06 to 1.00.
|
||||
|
||||
This needs the detector model (downloaded on first use) and a font that covers
|
||||
all the scripts (macOS "Arial Unicode"; on Linux install a Noto super-font).
|
||||
No GPU. Run:
|
||||
|
||||
uv run python scripts/text_detection_benchmark.py
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import sys
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import cv2
|
||||
import numpy as np
|
||||
from _plain_console import Console
|
||||
from PIL import Image, ImageDraw, ImageFont
|
||||
|
||||
from remove_ai_watermarks import text_protector as tp
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
console = Console()
|
||||
|
||||
# A single font covering every tested script isolates "language" from "font".
|
||||
_FONT_CANDIDATES = [
|
||||
"/System/Library/Fonts/Supplemental/Arial Unicode.ttf",
|
||||
"/Library/Fonts/Arial Unicode.ttf",
|
||||
"/usr/share/fonts/truetype/noto/NotoSansCJK-Regular.ttc",
|
||||
"/usr/share/fonts/truetype/unifont/unifont.ttf",
|
||||
]
|
||||
SCRIPTS = {
|
||||
"Latin": "Generated by AI",
|
||||
"Cyrillic": "Сгенерировано ИИ",
|
||||
"CJK": "豆包AI生成内容",
|
||||
"Hangul": "AI로 생성됨",
|
||||
"Arabic": "أنشئ بالذكاء",
|
||||
"Digits": "0123456789",
|
||||
}
|
||||
FONT_SIZES = [16, 24, 32, 48, 64]
|
||||
CANVASES = [1024, 2048]
|
||||
PLACEMENTS = [(0.08, 0.15), (0.30, 0.55), (0.10, 0.82)]
|
||||
BG_COLORS = [(35, 40, 60), (210, 205, 200)]
|
||||
|
||||
|
||||
def _find_font() -> str:
|
||||
for path in _FONT_CANDIDATES:
|
||||
if Path(path).exists():
|
||||
return path
|
||||
raise SystemExit(
|
||||
"No multi-script font found. Install one (macOS ships 'Arial Unicode'; "
|
||||
"on Linux: a Noto CJK/super font) and add its path to _FONT_CANDIDATES."
|
||||
)
|
||||
|
||||
|
||||
def _render(
|
||||
font_path: str,
|
||||
canvas: int,
|
||||
text: str,
|
||||
font_size: int,
|
||||
place: tuple[float, float],
|
||||
bg: tuple[int, int, int],
|
||||
) -> tuple[Image.Image, tuple[int, int, int, int]]:
|
||||
img = Image.new("RGB", (canvas, canvas), bg)
|
||||
draw = ImageDraw.Draw(img)
|
||||
font = ImageFont.truetype(font_path, font_size)
|
||||
x, y = int(place[0] * canvas), int(place[1] * canvas)
|
||||
fg = (245, 245, 245) if sum(bg) < 360 else (20, 20, 20)
|
||||
draw.text((x, y), text, font=font, fill=fg)
|
||||
return img, draw.textbbox((x, y), text, font=font)
|
||||
|
||||
|
||||
def _coverage(boxes: list[Any], bbox: tuple[int, int, int, int], h: int, w: int) -> float | None:
|
||||
gt = np.zeros((h, w), np.uint8)
|
||||
cv2.rectangle(gt, (bbox[0], bbox[1]), (bbox[2], bbox[3]), 1, -1)
|
||||
area = int(gt.sum())
|
||||
if area == 0:
|
||||
return None
|
||||
det = np.zeros((h, w), np.uint8)
|
||||
if boxes:
|
||||
cv2.fillPoly(det, [np.asarray(b, np.int32) for b in boxes], 1)
|
||||
return int((gt & det).sum()) / area
|
||||
|
||||
|
||||
def _hitrate(values: list[float], thr: float = 0.5) -> float:
|
||||
return sum(c >= thr for c in values) / len(values) if values else float("nan")
|
||||
|
||||
|
||||
def main() -> int:
|
||||
logging.basicConfig(level=logging.WARNING)
|
||||
if not tp.is_available():
|
||||
raise SystemExit("text detector unavailable (need opencv with cv2.dnn.TextDetectionModel_DB)")
|
||||
font_path = _find_font()
|
||||
detector = tp.TextProtector()
|
||||
|
||||
by_script_size: dict[tuple[str, int], list[float]] = defaultdict(list)
|
||||
by_size_canvas: dict[tuple[int, int], list[float]] = defaultdict(list)
|
||||
by_script: dict[str, list[float]] = defaultdict(list)
|
||||
|
||||
for canvas in CANVASES:
|
||||
for script, text in SCRIPTS.items():
|
||||
for font_size in FONT_SIZES:
|
||||
for idx, place in enumerate(PLACEMENTS):
|
||||
img, bbox = _render(font_path, canvas, text, font_size, place, BG_COLORS[idx % len(BG_COLORS)])
|
||||
bgr = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2BGR)
|
||||
cov = _coverage(detector.detect_text_boxes(bgr), bbox, canvas, canvas)
|
||||
if cov is None:
|
||||
continue
|
||||
by_script_size[(script, font_size)].append(cov)
|
||||
by_size_canvas[(font_size, canvas)].append(cov)
|
||||
by_script[script].append(cov)
|
||||
|
||||
console.print("=== hit-rate (coverage>=0.5) by script x font-size ===")
|
||||
console.print("script".ljust(10) + "".join(f"{fs:>7}" for fs in FONT_SIZES))
|
||||
for script in SCRIPTS:
|
||||
console.print(
|
||||
script.ljust(10) + "".join(f"{_hitrate(by_script_size[(script, fs)]):>7.2f}" for fs in FONT_SIZES)
|
||||
)
|
||||
|
||||
console.print("\n=== hit-rate by font-size x canvas (the downscale effect) ===")
|
||||
console.print("size".ljust(8) + "".join(f"{c:>8}" for c in CANVASES))
|
||||
for fs in FONT_SIZES:
|
||||
console.print(str(fs).ljust(8) + "".join(f"{_hitrate(by_size_canvas[(fs, c)]):>8.2f}" for c in CANVASES))
|
||||
|
||||
overall = _hitrate([c for vals in by_script.values() for c in vals])
|
||||
console.print(f"\nOVERALL hit-rate: {overall:.2f} (detector max long side = {tp._DET_MAX_LONG_SIDE})")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
@@ -136,23 +136,14 @@ def _validate_image(path: Path) -> Path:
|
||||
|
||||
_ALPHA_FORMATS = {".png", ".webp"}
|
||||
|
||||
# Shared option decorators for commands that run the invisible-watermark pipeline.
|
||||
# Both cmd_invisible and cmd_all expose these flags; defining them once avoids
|
||||
# Shared option decorator for commands that run the invisible-watermark pipeline.
|
||||
# Both cmd_invisible and cmd_all expose this flag; defining it once avoids
|
||||
# copy-paste drift.
|
||||
_protect_text_option = click.option(
|
||||
"--protect-text",
|
||||
is_flag=True,
|
||||
default=False,
|
||||
help=(
|
||||
"Enable text region protection (experimental: re-scrubs text blocks at high resolution). "
|
||||
"May prevent SynthID removal in text areas -- verify with oracle before relying on it."
|
||||
),
|
||||
)
|
||||
_protect_faces_option = click.option(
|
||||
"--protect-faces",
|
||||
is_flag=True,
|
||||
default=False,
|
||||
help="Enable face protection (experimental: YOLO detect + blend original faces back).",
|
||||
_controlnet_scale_option = click.option(
|
||||
"--controlnet-scale",
|
||||
type=float,
|
||||
default=1.0,
|
||||
help="ControlNet conditioning scale (structure/text preservation strength), controlnet pipeline only.",
|
||||
)
|
||||
|
||||
|
||||
@@ -453,14 +444,15 @@ def cmd_erase(
|
||||
type=float,
|
||||
default=None,
|
||||
help="Denoising strength (0.0-1.0). Default: vendor-adaptive (OpenAI 0.10 / Google 0.15 / "
|
||||
"unknown 0.15, from the C2PA issuer); ctrlregen uses 1.0.",
|
||||
"unknown 0.15, from the C2PA issuer).",
|
||||
)
|
||||
@click.option("--steps", type=int, default=50, help="Number of denoising steps. Default: 50.")
|
||||
@click.option(
|
||||
"--pipeline",
|
||||
type=click.Choice(["default", "ctrlregen"]),
|
||||
type=click.Choice(["default", "controlnet"]),
|
||||
default="default",
|
||||
help="Pipeline profile (default=SDXL; ctrlregen=CtrlRegen, EXPERIMENTAL/destructive at clean-noise).",
|
||||
help="Pipeline profile (default=SDXL img2img; controlnet=SDXL + canny ControlNet that preserves "
|
||||
"text/faces via edge conditioning while removing SynthID).",
|
||||
)
|
||||
@click.option(
|
||||
"--device",
|
||||
@@ -479,8 +471,7 @@ def cmd_erase(
|
||||
default=0,
|
||||
help="Cap long side (px) before diffusion; 0 = native (best quality, like raiw.cc). Raise only on GPU/MPS OOM.",
|
||||
)
|
||||
@_protect_text_option
|
||||
@_protect_faces_option
|
||||
@_controlnet_scale_option
|
||||
@click.pass_context
|
||||
def cmd_invisible(
|
||||
ctx: click.Context,
|
||||
@@ -494,8 +485,7 @@ def cmd_invisible(
|
||||
hf_token: str | None,
|
||||
humanize: float,
|
||||
max_resolution: int,
|
||||
protect_text: bool,
|
||||
protect_faces: bool,
|
||||
controlnet_scale: float,
|
||||
) -> None:
|
||||
"""Remove invisible AI watermarks (SynthID, StableSignature, TreeRing).
|
||||
|
||||
@@ -526,6 +516,7 @@ def cmd_invisible(
|
||||
pipeline=pipeline,
|
||||
hf_token=hf_token,
|
||||
progress_callback=progress_cb,
|
||||
controlnet_conditioning_scale=controlnet_scale,
|
||||
)
|
||||
|
||||
# Detect the SynthID vendor from the ORIGINAL (before processing strips C2PA) so the
|
||||
@@ -533,7 +524,7 @@ def cmd_invisible(
|
||||
vendor = vendor_for_strength(source)
|
||||
console.print(f" Input: {source.name}")
|
||||
console.print(f" Pipeline: {pipeline}")
|
||||
console.print(f" Strength: {resolve_strength(strength, pipeline, vendor)} Steps: {steps}")
|
||||
console.print(f" Strength: {resolve_strength(strength, vendor)} Steps: {steps}")
|
||||
|
||||
t0 = time.monotonic()
|
||||
result_path = engine.remove_watermark(
|
||||
@@ -544,8 +535,6 @@ def cmd_invisible(
|
||||
guidance_scale=None,
|
||||
seed=seed,
|
||||
humanize=humanize,
|
||||
protect_text=protect_text,
|
||||
protect_faces=protect_faces,
|
||||
max_resolution=max_resolution,
|
||||
vendor=vendor,
|
||||
)
|
||||
@@ -694,15 +683,15 @@ def cmd_identify(ctx: click.Context, source: Path, no_visible: bool, as_json: bo
|
||||
"--strength",
|
||||
type=float,
|
||||
default=None,
|
||||
help="Invisible watermark denoising strength. Default: vendor-adaptive "
|
||||
"(OpenAI 0.10 / Google 0.15 / unknown 0.15); ctrlregen uses 1.0.",
|
||||
help="Invisible watermark denoising strength. Default: vendor-adaptive (OpenAI 0.10 / Google 0.15 / unknown 0.15).",
|
||||
)
|
||||
@click.option("--steps", type=int, default=50, help="Number of denoising steps for invisible removal.")
|
||||
@click.option(
|
||||
"--pipeline",
|
||||
type=click.Choice(["default", "ctrlregen"]),
|
||||
type=click.Choice(["default", "controlnet"]),
|
||||
default="default",
|
||||
help="Pipeline profile (default=SDXL; ctrlregen=CtrlRegen, EXPERIMENTAL/destructive at clean-noise).",
|
||||
help="Pipeline profile (default=SDXL img2img; controlnet=SDXL + canny ControlNet that preserves "
|
||||
"text/faces via edge conditioning while removing SynthID).",
|
||||
)
|
||||
@click.option("--model", type=str, default=None, help="HuggingFace model ID for invisible removal.")
|
||||
@click.option(
|
||||
@@ -722,8 +711,7 @@ def cmd_identify(ctx: click.Context, source: Path, no_visible: bool, as_json: bo
|
||||
default=0,
|
||||
help="Cap long side (px) before diffusion; 0 = native (best quality, like raiw.cc). Raise only on GPU/MPS OOM.",
|
||||
)
|
||||
@_protect_text_option
|
||||
@_protect_faces_option
|
||||
@_controlnet_scale_option
|
||||
@click.pass_context
|
||||
def cmd_all(
|
||||
ctx: click.Context,
|
||||
@@ -740,8 +728,7 @@ def cmd_all(
|
||||
hf_token: str | None,
|
||||
humanize: float,
|
||||
max_resolution: int,
|
||||
protect_text: bool,
|
||||
protect_faces: bool,
|
||||
controlnet_scale: float,
|
||||
) -> None:
|
||||
"""Remove ALL watermarks: visible + invisible + metadata.
|
||||
|
||||
@@ -822,13 +809,14 @@ def cmd_all(
|
||||
pipeline=pipeline,
|
||||
hf_token=hf_token,
|
||||
progress_callback=progress_cb,
|
||||
controlnet_conditioning_scale=controlnet_scale,
|
||||
)
|
||||
|
||||
# Detect the vendor from the pristine ORIGINAL (`source`); `tmp_path` has
|
||||
# already lost its C2PA to the visible-removal pass, so reading it would
|
||||
# always resolve to the unknown-vendor default.
|
||||
vendor = vendor_for_strength(source)
|
||||
console.print(f" Strength: {resolve_strength(strength, pipeline, vendor)} Steps: {steps}")
|
||||
console.print(f" Strength: {resolve_strength(strength, vendor)} Steps: {steps}")
|
||||
inv_engine.remove_watermark(
|
||||
image_path=tmp_path,
|
||||
output_path=tmp_path,
|
||||
@@ -836,8 +824,6 @@ def cmd_all(
|
||||
num_inference_steps=steps,
|
||||
seed=seed,
|
||||
humanize=humanize,
|
||||
protect_text=protect_text,
|
||||
protect_faces=protect_faces,
|
||||
max_resolution=max_resolution,
|
||||
vendor=vendor,
|
||||
)
|
||||
@@ -990,9 +976,10 @@ def _process_batch_image(
|
||||
)
|
||||
@click.option(
|
||||
"--pipeline",
|
||||
type=click.Choice(["default", "ctrlregen"]),
|
||||
type=click.Choice(["default", "controlnet"]),
|
||||
default="default",
|
||||
help="Pipeline profile (default=SDXL; ctrlregen=CtrlRegen, EXPERIMENTAL/destructive at clean-noise).",
|
||||
help="Pipeline profile (default=SDXL img2img; controlnet=SDXL + canny ControlNet that preserves "
|
||||
"text/faces via edge conditioning while removing SynthID).",
|
||||
)
|
||||
@click.option(
|
||||
"--device",
|
||||
|
||||
@@ -1,150 +0,0 @@
|
||||
"""YOLO-based face detection and soft-blend restoration for diffusion pipelines."""
|
||||
|
||||
# cv2/numpy/ultralytics boundary: these libs ship no usable element types; relax
|
||||
# the unknown-type rules for this file only.
|
||||
# pyright: reportUnknownMemberType=false, reportUnknownArgumentType=false, reportUnknownVariableType=false, reportUnknownParameterType=false, reportMissingTypeArgument=false, reportMissingTypeStubs=false, reportMissingImports=false, reportArgumentType=false, reportAssignmentType=false, reportReturnType=false, reportCallIssue=false, reportIndexIssue=false, reportOperatorIssue=false, reportOptionalMemberAccess=false, reportOptionalCall=false, reportOptionalSubscript=false, reportOptionalOperand=false, reportAttributeAccessIssue=false, reportPrivateImportUsage=false, reportPrivateUsage=false, reportInvalidTypeForm=false, reportConstantRedefinition=false, reportUnnecessaryComparison=false, reportPossiblyUnboundVariable=false
|
||||
import logging
|
||||
from pathlib import Path
|
||||
|
||||
import cv2
|
||||
import numpy as np
|
||||
|
||||
try:
|
||||
from ultralytics import YOLO
|
||||
|
||||
HAS_YOLO = True
|
||||
except ImportError:
|
||||
HAS_YOLO = False
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class FaceProtector:
|
||||
"""
|
||||
Detects faces in an image and provides methods to seamlessly paste them back
|
||||
onto the an upscaled/processed image to preserve facial details that may have
|
||||
been destroyed by latent diffusion or other algorithms.
|
||||
"""
|
||||
|
||||
def __init__(self, use_yolo: bool = True, model_name: str = "yolov8n.pt") -> None:
|
||||
self.use_yolo = use_yolo and HAS_YOLO
|
||||
self.detector = None
|
||||
self.haar_cascade = None
|
||||
|
||||
if self.use_yolo:
|
||||
# Fix SSL certificate issues on macOS (fresh Python installs)
|
||||
self._fix_ssl_certs()
|
||||
logger.info("Loading YOLO model '%s' for face protection...", model_name)
|
||||
self.detector = YOLO(model_name)
|
||||
else:
|
||||
if use_yolo and not HAS_YOLO:
|
||||
logger.warning(
|
||||
"ultralytics YOLO is not installed. Falling back to OpenCV Haar "
|
||||
"Cascades. Install ultralytics with `pip install ultralytics` "
|
||||
"for better face detection."
|
||||
)
|
||||
logger.info("Loading OpenCV Haar Cascade for face protection...")
|
||||
cascade_path = Path(cv2.__file__).parent / "data" / "haarcascade_frontalface_default.xml"
|
||||
if not cascade_path.exists():
|
||||
cascade_path = "haarcascade_frontalface_default.xml"
|
||||
self.haar_cascade = cv2.CascadeClassifier(str(cascade_path))
|
||||
|
||||
def detect_face_bboxes(self, image: np.ndarray) -> list[tuple[int, int, int, int]]:
|
||||
"""
|
||||
Detect faces and return bounding boxes as (x1, y1, x2, y2).
|
||||
"""
|
||||
if self.use_yolo and self.detector is not None:
|
||||
# For standard YOLOv8n, 'person' is class 0. We'll use person bounding boxes
|
||||
# as a proxy for faces/people to protect them. If using a specific face model, adjust classes.
|
||||
results = self.detector(image, verbose=False, classes=[0])
|
||||
bboxes = []
|
||||
for r in results:
|
||||
boxes = r.boxes
|
||||
for box in boxes:
|
||||
x1, y1, x2, y2 = box.xyxy[0].cpu().numpy()
|
||||
bboxes.append((int(x1), int(y1), int(x2), int(y2)))
|
||||
return bboxes
|
||||
|
||||
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
|
||||
faces = self.haar_cascade.detectMultiScale(gray, scaleFactor=1.1, minNeighbors=5, minSize=(30, 30))
|
||||
bboxes = []
|
||||
for x, y, w, h in faces:
|
||||
# Add a 20% margin around the haar cascade face box
|
||||
margin_x = int(w * 0.2)
|
||||
margin_y = int(h * 0.2)
|
||||
x1 = max(0, x - margin_x)
|
||||
y1 = max(0, y - int(margin_y * 1.5)) # more margin on top for hair
|
||||
x2 = min(image.shape[1], x + w + margin_x)
|
||||
y2 = min(image.shape[0], y + h + margin_y)
|
||||
bboxes.append((x1, y1, x2, y2))
|
||||
return bboxes
|
||||
|
||||
@staticmethod
|
||||
def _fix_ssl_certs() -> None:
|
||||
"""Set SSL_CERT_FILE from certifi if not already set (macOS fix)."""
|
||||
import os
|
||||
|
||||
if os.environ.get("SSL_CERT_FILE"):
|
||||
return
|
||||
try:
|
||||
import certifi
|
||||
|
||||
os.environ["SSL_CERT_FILE"] = certifi.where()
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
def extract_faces(self, image: np.ndarray) -> list[tuple[tuple[int, int, int, int], np.ndarray]]:
|
||||
"""
|
||||
Extract faces from the image.
|
||||
Returns a list of (bbox, face_crop) tuples.
|
||||
"""
|
||||
bboxes = self.detect_face_bboxes(image)
|
||||
faces = []
|
||||
for bbox in bboxes:
|
||||
x1, y1, x2, y2 = bbox
|
||||
faces.append((bbox, image[y1:y2, x1:x2].copy()))
|
||||
return faces
|
||||
|
||||
def restore_faces(
|
||||
self, processed_image: np.ndarray, original_faces: list[tuple[tuple[int, int, int, int], np.ndarray]]
|
||||
) -> np.ndarray:
|
||||
"""
|
||||
Paste original faces back onto the processed image using seamless cloning
|
||||
or soft blending so the edges don't show.
|
||||
"""
|
||||
if not original_faces:
|
||||
return processed_image
|
||||
|
||||
result = processed_image.copy()
|
||||
for (x1, y1, x2, y2), face_crop in original_faces:
|
||||
h, w = face_crop.shape[:2]
|
||||
|
||||
# If the processed image was resized, we'd need to resize face_crop, but
|
||||
# pipeline ensures the output from InvisibleEngine is the same size or we resize it back before this.
|
||||
if result.shape[:2] != processed_image.shape[:2]:
|
||||
continue # Safety bypass
|
||||
|
||||
try:
|
||||
# Create a soft alpha mask for the face crop to smoothly blend it
|
||||
mask = np.zeros((h, w), dtype=np.float32)
|
||||
|
||||
# Inner ellipse is pure white
|
||||
cv2.ellipse(mask, (w // 2, h // 2), (int(w * 0.4), int(h * 0.4)), 0, 0, 360, 1.0, -1)
|
||||
|
||||
# Blur the mask heavily for soft edges
|
||||
blur_size = max(w, h) // 4
|
||||
if blur_size % 2 == 0:
|
||||
blur_size += 1
|
||||
mask = cv2.GaussianBlur(mask, (blur_size, blur_size), 0)
|
||||
mask = cv2.merge([mask, mask, mask])
|
||||
|
||||
# Blend
|
||||
target_roi = result[y1:y2, x1:x2].astype(np.float32)
|
||||
src_roi = face_crop.astype(np.float32)
|
||||
|
||||
blended = src_roi * mask + target_roi * (1.0 - mask)
|
||||
result[y1:y2, x1:x2] = blended.astype(np.uint8)
|
||||
except Exception as e:
|
||||
logger.warning("Failed to restore face at %d,%d to %d,%d: %s", x1, y1, x2, y2, e)
|
||||
|
||||
return result
|
||||
@@ -7,9 +7,9 @@ This module requires the 'gpu' extra dependencies:
|
||||
uv pip install 'remove-ai-watermarks[gpu]'
|
||||
"""
|
||||
|
||||
# cv2/torch boundary: this engine wraps cv2 (resize/imwrite/cvtColor), the YOLO
|
||||
# face protector, and the humanizer, none of which carry usable element types;
|
||||
# relax the unknown-type rules for this file only.
|
||||
# cv2/torch boundary: this engine wraps cv2 (resize/imwrite/cvtColor) and the
|
||||
# humanizer, none of which carry usable element types; relax the unknown-type
|
||||
# rules for this file only.
|
||||
# pyright: reportUnknownMemberType=false, reportUnknownArgumentType=false, reportUnknownVariableType=false, reportUnknownParameterType=false, reportMissingTypeArgument=false, reportMissingTypeStubs=false, reportMissingImports=false, reportArgumentType=false, reportAssignmentType=false, reportReturnType=false, reportCallIssue=false, reportIndexIssue=false, reportOperatorIssue=false, reportOptionalMemberAccess=false, reportOptionalCall=false, reportOptionalSubscript=false, reportOptionalOperand=false, reportAttributeAccessIssue=false, reportPrivateImportUsage=false, reportPrivateUsage=false, reportInvalidTypeForm=false, reportConstantRedefinition=false, reportUnnecessaryComparison=false
|
||||
from __future__ import annotations
|
||||
|
||||
@@ -70,12 +70,9 @@ class InvisibleEngine:
|
||||
to break watermark patterns, and reconstructs via reverse diffusion.
|
||||
"""
|
||||
|
||||
# SDXL base is the default since May 2026; the current Google SynthID is
|
||||
# removed at strength ~0.30 / steps=50 / native res (oracle-verified, n=3 fresh
|
||||
# Gemini -- 0.10/0.15/0.2 still detected). See CLAUDE.md "Known limitations" for
|
||||
# the strength study and the regression evidence ruling out SD-1.5 pipelines.
|
||||
# SDXL base is the default since May 2026; the vendor-adaptive strength
|
||||
# removes the current SynthID (see watermark_profiles + docs/synthid.md).
|
||||
DEFAULT_MODEL_ID = "stabilityai/stable-diffusion-xl-base-1.0"
|
||||
CTRLREGEN_MODEL_ID = "yepengliu/ctrlregen"
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
@@ -84,31 +81,33 @@ class InvisibleEngine:
|
||||
pipeline: str = "default",
|
||||
hf_token: str | None = None,
|
||||
progress_callback: Callable[[str], None] | None = None,
|
||||
controlnet_conditioning_scale: float = 1.0,
|
||||
) -> None:
|
||||
"""Initialize the invisible watermark removal engine.
|
||||
|
||||
Args:
|
||||
model_id: HuggingFace model ID. None = use default for pipeline.
|
||||
model_id: HuggingFace model ID. None = use the SDXL base default.
|
||||
device: Device for inference (auto/cpu/mps/cuda/xpu). None = auto.
|
||||
pipeline: Pipeline profile. "default" (SDXL base, defeats SynthID
|
||||
v2) or "ctrlregen" (CtrlRegen).
|
||||
pipeline: Pipeline profile. "default" (plain SDXL img2img) or
|
||||
"controlnet" (SDXL + canny ControlNet that preserves text/face
|
||||
structure via edge conditioning while removing SynthID).
|
||||
hf_token: HuggingFace API token.
|
||||
progress_callback: Optional callback for progress messages.
|
||||
controlnet_conditioning_scale: ControlNet structure-preservation
|
||||
strength (controlnet pipeline only).
|
||||
"""
|
||||
|
||||
from remove_ai_watermarks.noai.watermark_remover import WatermarkRemover
|
||||
|
||||
effective_model = model_id
|
||||
if pipeline == "ctrlregen" and model_id is None:
|
||||
effective_model = self.CTRLREGEN_MODEL_ID
|
||||
elif model_id is None:
|
||||
effective_model = self.DEFAULT_MODEL_ID
|
||||
effective_model = model_id or self.DEFAULT_MODEL_ID
|
||||
|
||||
self._remover = WatermarkRemover(
|
||||
model_id=effective_model,
|
||||
device=device,
|
||||
progress_callback=progress_callback,
|
||||
hf_token=hf_token,
|
||||
pipeline=pipeline,
|
||||
controlnet_conditioning_scale=controlnet_conditioning_scale,
|
||||
)
|
||||
self._progress_callback = progress_callback
|
||||
|
||||
@@ -125,8 +124,6 @@ class InvisibleEngine:
|
||||
guidance_scale: float | None = None,
|
||||
seed: int | None = None,
|
||||
humanize: float = 0.0,
|
||||
protect_faces: bool = False,
|
||||
protect_text: bool = False,
|
||||
max_resolution: int = 0,
|
||||
vendor: str | None = None,
|
||||
) -> Path:
|
||||
@@ -135,16 +132,12 @@ class InvisibleEngine:
|
||||
Args:
|
||||
image_path: Path to the watermarked image.
|
||||
output_path: Output path (None = overwrite source).
|
||||
strength: Denoising strength (0.0-1.0). None -> profile default
|
||||
(0.10 for SDXL, 1.0 clean-noise for ctrlregen).
|
||||
strength: Denoising strength (0.0-1.0). None -> the vendor-adaptive
|
||||
default.
|
||||
steps: Number of denoising steps.
|
||||
guidance_scale: Classifier-free guidance scale.
|
||||
seed: Random seed for reproducibility.
|
||||
humanize: Intensity of Analog Humanizer film grain (0 = off).
|
||||
protect_faces: Boolean to extract and restore faces intact.
|
||||
protect_text: Detect text regions and preserve them via Differential
|
||||
Diffusion when any are found, so glyphs (incl. CJK) survive the
|
||||
removal pass. On by default; the detector decides per image.
|
||||
max_resolution: Cap the long side (px) before diffusion. 0 (default)
|
||||
= native resolution, no pre-downscale -- matches the hosted
|
||||
raiw.cc backend. Set a positive value only to bound GPU/MPS
|
||||
@@ -189,27 +182,6 @@ class InvisibleEngine:
|
||||
image_path = _tmp_path
|
||||
|
||||
try:
|
||||
# Optional: Face protection (Phase 1 - Extraction)
|
||||
original_faces = []
|
||||
if protect_faces:
|
||||
try:
|
||||
import cv2
|
||||
|
||||
from remove_ai_watermarks.face_protector import FaceProtector
|
||||
|
||||
if self._progress_callback:
|
||||
self._progress_callback("Detecting and extracting faces (protect-faces)...")
|
||||
# Convert PIL to CV2 BGR
|
||||
import numpy as np
|
||||
|
||||
cv_img = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR)
|
||||
protector = FaceProtector(use_yolo=True)
|
||||
original_faces = protector.extract_faces(cv_img)
|
||||
if self._progress_callback:
|
||||
self._progress_callback(f"Extracted {len(original_faces)} face(s) for protection.")
|
||||
except Exception as e:
|
||||
logger.error("Failed to extract faces: %s", e)
|
||||
|
||||
out_path = self._remover.remove_watermark(
|
||||
image_path=image_path,
|
||||
output_path=output_path,
|
||||
@@ -217,14 +189,12 @@ class InvisibleEngine:
|
||||
num_inference_steps=num_inference_steps,
|
||||
guidance_scale=guidance_scale,
|
||||
seed=seed,
|
||||
protect_text=protect_text,
|
||||
vendor=vendor,
|
||||
)
|
||||
|
||||
# Optional: Face restoration & Humanizer (Phase 2 - Post-processing)
|
||||
if protect_faces or humanize > 0.0:
|
||||
# Post-processing: optional Humanizer, then restore original resolution.
|
||||
if humanize > 0.0:
|
||||
import cv2
|
||||
import numpy as np
|
||||
|
||||
from remove_ai_watermarks import image_io
|
||||
|
||||
@@ -232,20 +202,11 @@ class InvisibleEngine:
|
||||
if out_cv is None:
|
||||
return out_path
|
||||
|
||||
if protect_faces and original_faces:
|
||||
if self._progress_callback:
|
||||
self._progress_callback("Restoring protected faces with soft blending...")
|
||||
from remove_ai_watermarks.face_protector import FaceProtector
|
||||
if self._progress_callback:
|
||||
self._progress_callback(f"Applying Analog Humanizer (grain: {humanize})...")
|
||||
from remove_ai_watermarks.humanizer import apply_analog_humanizer
|
||||
|
||||
protector = FaceProtector(use_yolo=True)
|
||||
out_cv = protector.restore_faces(out_cv, original_faces)
|
||||
|
||||
if humanize > 0.0:
|
||||
if self._progress_callback:
|
||||
self._progress_callback(f"Applying Analog Humanizer (grain: {humanize})...")
|
||||
from remove_ai_watermarks.humanizer import apply_analog_humanizer
|
||||
|
||||
out_cv = apply_analog_humanizer(out_cv, grain_intensity=humanize, chromatic_shift=1)
|
||||
out_cv = apply_analog_humanizer(out_cv, grain_intensity=humanize, chromatic_shift=1)
|
||||
|
||||
# Restore original resolution
|
||||
if (out_cv.shape[1], out_cv.shape[0]) != orig_size:
|
||||
@@ -259,7 +220,7 @@ class InvisibleEngine:
|
||||
image_io.imwrite(out_path, out_cv)
|
||||
|
||||
else:
|
||||
# Even if no protect_faces or humanize, we must restore original size if needed
|
||||
# No humanize: still restore the original size if it was capped.
|
||||
import cv2
|
||||
|
||||
from remove_ai_watermarks import image_io
|
||||
|
||||
@@ -255,10 +255,9 @@ def has_ai_metadata(image_path: Path) -> bool:
|
||||
"""
|
||||
from PIL import Image
|
||||
|
||||
# PIL may not handle AVIF/HEIF/JPEG-XL without the optional plugins
|
||||
# (ultralytics also monkey-patches Image.open in a way that can raise
|
||||
# ModuleNotFoundError when pi_heif autoload fails), so any open failure
|
||||
# falls through to the binary scan.
|
||||
# PIL may not handle AVIF/HEIF/JPEG-XL without the optional plugins, and a
|
||||
# third-party plugin autoload can raise a non-OSError (e.g. ModuleNotFoundError),
|
||||
# so any open failure falls through to the binary scan.
|
||||
try:
|
||||
with Image.open(image_path) as img:
|
||||
for key in img.info:
|
||||
@@ -655,9 +654,9 @@ def get_ai_metadata(image_path: Path) -> dict[str, str]:
|
||||
|
||||
result: dict[str, str] = {}
|
||||
|
||||
# PIL may not open AVIF/HEIF/JPEG-XL without optional plugins (and
|
||||
# ultralytics' Image.open patch can raise ModuleNotFoundError); fall through
|
||||
# to the C2PA/binary path on any open failure. See CLAUDE.md.
|
||||
# PIL may not open AVIF/HEIF/JPEG-XL without optional plugins (and a
|
||||
# third-party plugin autoload can raise a non-OSError); fall through to the
|
||||
# C2PA/binary path on any open failure. See CLAUDE.md.
|
||||
try:
|
||||
with Image.open(image_path) as img:
|
||||
for key, value in img.info.items():
|
||||
|
||||
@@ -1,18 +0,0 @@
|
||||
"""CtrlRegen watermark removal via controllable regeneration.
|
||||
|
||||
Implements the pipeline from "Image Watermarks Are Removable Using
|
||||
Controllable Regeneration from Clean Noise" (ICLR 2025) by Liu et al.
|
||||
|
||||
This sub-package uses a ControlNet for spatial guidance (canny edges)
|
||||
and a DINOv2-based IP Adapter for semantic guidance to regenerate
|
||||
watermarked images from partially noised latents.
|
||||
|
||||
Attribution:
|
||||
Based on https://github.com/yepengliu/CtrlRegen .
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from remove_ai_watermarks.noai.ctrlregen.engine import CtrlRegenEngine, is_ctrlregen_available
|
||||
|
||||
__all__ = ["CtrlRegenEngine", "is_ctrlregen_available"]
|
||||
@@ -1,40 +0,0 @@
|
||||
"""Color matching post-processing for CtrlRegen output.
|
||||
|
||||
After diffusion-based regeneration, the output image may have slight
|
||||
color shifts. This module uses histogram-based color transfer to
|
||||
align the regenerated image's color distribution back to the original.
|
||||
|
||||
Attribution:
|
||||
Adapted from https://github.com/yepengliu/CtrlRegen .
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import numpy as np
|
||||
from color_matcher import ColorMatcher
|
||||
from color_matcher.normalizer import Normalizer
|
||||
from PIL import Image
|
||||
|
||||
|
||||
def color_match(reference: Image.Image, source: Image.Image) -> Image.Image:
|
||||
"""Transfer the color distribution of *reference* onto *source*.
|
||||
|
||||
Uses a two-pass histogram matching approach (``hm-mkl-hm``) that
|
||||
preserves fine-grained color relationships while correcting global
|
||||
shifts introduced by the regeneration pipeline.
|
||||
|
||||
Args:
|
||||
reference: The original (watermarked) image whose colors should
|
||||
be preserved.
|
||||
source: The regenerated image whose colors will be adjusted.
|
||||
|
||||
Returns:
|
||||
A new PIL Image with the structure of *source* but the color
|
||||
palette of *reference*.
|
||||
"""
|
||||
cm = ColorMatcher()
|
||||
ref_np = Normalizer(np.asarray(reference)).type_norm()
|
||||
src_np = Normalizer(np.asarray(source)).type_norm()
|
||||
result = cm.transfer(src=src_np, ref=ref_np, method="hm-mkl-hm")
|
||||
result = Normalizer(result).uint8_norm()
|
||||
return Image.fromarray(result)
|
||||
@@ -1,365 +0,0 @@
|
||||
"""CtrlRegen engine — orchestrates the full watermark removal pipeline.
|
||||
|
||||
Loads the base SD 1.5 model with a ControlNet (spatial control from
|
||||
canny edges) and a DINOv2-based IP Adapter (semantic control), then
|
||||
runs controllable regeneration with optional color matching.
|
||||
|
||||
Attribution:
|
||||
Based on https://github.com/yepengliu/CtrlRegen .
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import contextlib
|
||||
import logging
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
from typing import TYPE_CHECKING, Any
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from collections.abc import Callable
|
||||
|
||||
import torch
|
||||
from PIL import Image
|
||||
|
||||
from remove_ai_watermarks.noai.progress import make_pipeline_progress
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Availability checks — these imports are optional.
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
_HAS_CONTROLNET_AUX = False
|
||||
_HAS_COLOR_MATCHER = False
|
||||
_HAS_DIFFUSERS = False
|
||||
|
||||
try:
|
||||
from diffusers import AutoencoderKL, ControlNetModel, UniPCMultistepScheduler
|
||||
|
||||
from remove_ai_watermarks.noai.ctrlregen.pipeline import CustomCtrlRegenPipeline
|
||||
|
||||
_HAS_DIFFUSERS = True
|
||||
except ImportError:
|
||||
AutoencoderKL = None # type: ignore[assignment,misc]
|
||||
ControlNetModel = None # type: ignore[assignment,misc]
|
||||
UniPCMultistepScheduler = None # type: ignore[assignment,misc]
|
||||
CustomCtrlRegenPipeline = None # type: ignore[assignment,misc]
|
||||
|
||||
try:
|
||||
from controlnet_aux import CannyDetector
|
||||
|
||||
_HAS_CONTROLNET_AUX = True
|
||||
except ImportError:
|
||||
CannyDetector = None # type: ignore[assignment,misc]
|
||||
|
||||
try:
|
||||
from remove_ai_watermarks.noai.ctrlregen.color import color_match
|
||||
|
||||
_HAS_COLOR_MATCHER = True
|
||||
except ImportError:
|
||||
color_match = None # type: ignore[assignment]
|
||||
|
||||
CTRLREGEN_HF_REPO = "yepengliu/ctrlregen"
|
||||
SPATIAL_SUBFOLDER = "spatialnet_ckp/spatial_control_ckp_14000"
|
||||
SEMANTIC_SUBFOLDER = "semanticnet_ckp/models"
|
||||
SEMANTIC_WEIGHT_NAME = "semantic_control_ckp_435000.bin"
|
||||
|
||||
DEFAULT_BASE_MODEL = "SG161222/Realistic_Vision_V4.0_noVAE"
|
||||
CUSTOM_VAE_ID = "stabilityai/sd-vae-ft-mse"
|
||||
|
||||
PROCESS_SIZE = 512
|
||||
DEFAULT_GUIDANCE_SCALE = 2.0
|
||||
QUALITY_PROMPT = "best quality, high quality"
|
||||
NEGATIVE_PROMPT = "monochrome, lowres, bad anatomy, worst quality, low quality"
|
||||
|
||||
CANNY_LOW_THRESHOLD = 100
|
||||
CANNY_HIGH_THRESHOLD = 150
|
||||
|
||||
TILE_SIZE = 512
|
||||
TILE_OVERLAP = 192
|
||||
|
||||
|
||||
def is_ctrlregen_available() -> bool:
|
||||
"""Return True when all CtrlRegen-specific dependencies are installed."""
|
||||
return _HAS_DIFFUSERS and _HAS_CONTROLNET_AUX and _HAS_COLOR_MATCHER
|
||||
|
||||
|
||||
class CtrlRegenEngine:
|
||||
"""End-to-end CtrlRegen watermark removal engine.
|
||||
|
||||
Handles model loading, canny edge extraction, controlled denoising,
|
||||
and color-matched post-processing in a single ``run()`` call.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
base_model_id: str | None = None,
|
||||
device: str = "cpu",
|
||||
torch_dtype: torch.dtype | None = None,
|
||||
hf_token: str | None = None,
|
||||
progress_callback: Callable[[str], None] | None = None,
|
||||
) -> None:
|
||||
if not is_ctrlregen_available():
|
||||
missing: list[str] = []
|
||||
if not _HAS_DIFFUSERS:
|
||||
missing.extend(["diffusers", "transformers", "accelerate"])
|
||||
if not _HAS_CONTROLNET_AUX:
|
||||
missing.append("controlnet-aux")
|
||||
if not _HAS_COLOR_MATCHER:
|
||||
missing.append("color-matcher")
|
||||
logger.info("Auto-installing missing dependencies: %s", missing)
|
||||
import subprocess
|
||||
|
||||
try:
|
||||
subprocess.check_call(
|
||||
[sys.executable, "-m", "pip", "install", *missing],
|
||||
stdout=subprocess.DEVNULL,
|
||||
stderr=subprocess.DEVNULL,
|
||||
)
|
||||
except (subprocess.CalledProcessError, FileNotFoundError) as exc:
|
||||
raise ImportError(
|
||||
"Failed to auto-install missing dependencies: "
|
||||
+ ", ".join(missing)
|
||||
+ ". Try manually: pip install --force-reinstall noai-watermark"
|
||||
) from exc
|
||||
|
||||
self.base_model_id = base_model_id or DEFAULT_BASE_MODEL
|
||||
self.device = device
|
||||
self.torch_dtype = torch_dtype or (torch.float32 if device in ("cpu", "mps") else torch.float16)
|
||||
self.hf_token: str | None = hf_token or os.environ.get("HF_TOKEN")
|
||||
self._progress_callback = progress_callback
|
||||
self._pipeline: CustomCtrlRegenPipeline | None = None # type: ignore[assignment]
|
||||
self._canny_detector: CannyDetector | None = None # type: ignore[assignment]
|
||||
|
||||
def _set_progress(self, message: str) -> None:
|
||||
if self._progress_callback is None:
|
||||
return
|
||||
with contextlib.suppress(Exception):
|
||||
self._progress_callback(message)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Loading
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
def load(self) -> None:
|
||||
"""Download and assemble the full CtrlRegen pipeline."""
|
||||
if self._pipeline is not None:
|
||||
return
|
||||
|
||||
token_kwargs: dict[str, Any] = {}
|
||||
if self.hf_token:
|
||||
token_kwargs["token"] = self.hf_token
|
||||
|
||||
self._set_progress(f"Loading CtrlRegen spatial ControlNet from {CTRLREGEN_HF_REPO}...")
|
||||
logger.info("Loading ControlNet from %s/%s", CTRLREGEN_HF_REPO, SPATIAL_SUBFOLDER)
|
||||
controlnet = [
|
||||
ControlNetModel.from_pretrained(
|
||||
CTRLREGEN_HF_REPO,
|
||||
subfolder=SPATIAL_SUBFOLDER,
|
||||
torch_dtype=self.torch_dtype,
|
||||
**token_kwargs,
|
||||
)
|
||||
]
|
||||
|
||||
self._set_progress(f"Loading SD base model ({self.base_model_id}) for CtrlRegen pipeline...")
|
||||
logger.info("Loading base pipeline from %s", self.base_model_id)
|
||||
pipe = CustomCtrlRegenPipeline.from_pretrained(
|
||||
self.base_model_id,
|
||||
controlnet=controlnet,
|
||||
torch_dtype=self.torch_dtype,
|
||||
safety_checker=None,
|
||||
requires_safety_checker=False,
|
||||
**token_kwargs,
|
||||
)
|
||||
|
||||
self._set_progress(f"Loading CtrlRegen semantic IP-Adapter + DINOv2 from {CTRLREGEN_HF_REPO}...")
|
||||
logger.info("Loading IP-Adapter from %s/%s", CTRLREGEN_HF_REPO, SEMANTIC_SUBFOLDER)
|
||||
pipe.load_ctrlregen_ip_adapter(
|
||||
CTRLREGEN_HF_REPO,
|
||||
subfolder=SEMANTIC_SUBFOLDER,
|
||||
weight_name=SEMANTIC_WEIGHT_NAME,
|
||||
**token_kwargs,
|
||||
)
|
||||
|
||||
from transformers import AutoImageProcessor, AutoModel
|
||||
|
||||
pipe.image_encoder = AutoModel.from_pretrained("facebook/dinov2-giant").to(self.device, dtype=self.torch_dtype)
|
||||
pipe.feature_extractor = AutoImageProcessor.from_pretrained("facebook/dinov2-giant")
|
||||
|
||||
self._set_progress(f"Loading custom VAE ({CUSTOM_VAE_ID})...")
|
||||
logger.info("Loading VAE from %s", CUSTOM_VAE_ID)
|
||||
pipe.vae = AutoencoderKL.from_pretrained(
|
||||
CUSTOM_VAE_ID,
|
||||
torch_dtype=self.torch_dtype,
|
||||
**token_kwargs,
|
||||
)
|
||||
|
||||
self._set_progress("Configuring UniPC scheduler...")
|
||||
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
|
||||
pipe.set_ip_adapter_scale(1.0)
|
||||
|
||||
self._set_progress(f"Moving CtrlRegen pipeline to {self.device}...")
|
||||
pipe = pipe.to(self.device)
|
||||
|
||||
if hasattr(pipe, "enable_xformers_memory_efficient_attention"):
|
||||
with contextlib.suppress(Exception):
|
||||
pipe.enable_xformers_memory_efficient_attention()
|
||||
|
||||
self._pipeline = pipe
|
||||
self._canny_detector = CannyDetector()
|
||||
self._set_progress("CtrlRegen pipeline ready.")
|
||||
logger.info("CtrlRegen pipeline loaded on %s", self.device)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Inference — public entry point
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
def run(
|
||||
self,
|
||||
image: Image.Image,
|
||||
strength: float = 0.5,
|
||||
num_inference_steps: int = 50,
|
||||
guidance_scale: float = DEFAULT_GUIDANCE_SCALE,
|
||||
seed: int | None = None,
|
||||
) -> Image.Image:
|
||||
"""Run CtrlRegen watermark removal on a single image.
|
||||
|
||||
Images that fit within ``TILE_SIZE`` (512) are processed as a
|
||||
single pass. Larger images are split into overlapping tiles.
|
||||
"""
|
||||
self.load()
|
||||
assert self._pipeline is not None
|
||||
assert self._canny_detector is not None
|
||||
|
||||
orig_w, orig_h = image.size
|
||||
orig_image = image
|
||||
t0 = time.monotonic()
|
||||
|
||||
needs_tiling = orig_w > TILE_SIZE or orig_h > TILE_SIZE
|
||||
|
||||
if needs_tiling:
|
||||
from remove_ai_watermarks.noai.ctrlregen.tiling import resize_center_crop, run_tiled
|
||||
|
||||
aligned_w = orig_w // 8 * 8
|
||||
aligned_h = orig_h // 8 * 8
|
||||
if aligned_w != orig_w or aligned_h != orig_h:
|
||||
image = image.resize((aligned_w, aligned_h), Image.LANCZOS)
|
||||
regen_image = run_tiled(
|
||||
pipeline=self._pipeline,
|
||||
canny_detector=self._canny_detector,
|
||||
image=image,
|
||||
strength=strength,
|
||||
num_inference_steps=num_inference_steps,
|
||||
guidance_scale=guidance_scale,
|
||||
seed=seed,
|
||||
tile_size=TILE_SIZE,
|
||||
tile_overlap=TILE_OVERLAP,
|
||||
quality_prompt=QUALITY_PROMPT,
|
||||
negative_prompt=NEGATIVE_PROMPT,
|
||||
canny_low=CANNY_LOW_THRESHOLD,
|
||||
canny_high=CANNY_HIGH_THRESHOLD,
|
||||
device=self.device,
|
||||
set_progress=self._set_progress,
|
||||
ip_adapter_image=orig_image,
|
||||
)
|
||||
else:
|
||||
from remove_ai_watermarks.noai.ctrlregen.tiling import resize_center_crop
|
||||
|
||||
proc_image = resize_center_crop(image, PROCESS_SIZE)
|
||||
self._set_progress(f"Preprocessed {orig_w}x{orig_h}px → {proc_image.size[0]}x{proc_image.size[1]}px")
|
||||
regen_image = self._run_single(
|
||||
proc_image,
|
||||
strength,
|
||||
num_inference_steps,
|
||||
guidance_scale,
|
||||
seed,
|
||||
)
|
||||
|
||||
if regen_image.size != (orig_w, orig_h):
|
||||
self._set_progress(f"Resizing {regen_image.size[0]}x{regen_image.size[1]}px → {orig_w}x{orig_h}px...")
|
||||
regen_image = regen_image.resize((orig_w, orig_h), Image.LANCZOS)
|
||||
|
||||
self._set_progress(f"Applying color matching at {orig_w}x{orig_h}px...")
|
||||
output = color_match(reference=orig_image, source=regen_image)
|
||||
|
||||
self._set_progress(f"✓ CtrlRegen done · {orig_w}x{orig_h}px · {time.monotonic() - t0:.0f}s total")
|
||||
return output
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Single-image path (image <= 512x512)
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
def _run_single(
|
||||
self,
|
||||
image: Image.Image,
|
||||
strength: float,
|
||||
num_inference_steps: int,
|
||||
guidance_scale: float,
|
||||
seed: int | None,
|
||||
) -> Image.Image:
|
||||
"""Process a single 512x512 image through the CtrlRegen pipeline."""
|
||||
w, h = image.size
|
||||
effective_steps = max(1, int(num_inference_steps * strength))
|
||||
|
||||
self._set_progress(
|
||||
f"Extracting canny edges ({w}x{h}px, thresholds {CANNY_LOW_THRESHOLD}/{CANNY_HIGH_THRESHOLD})..."
|
||||
)
|
||||
control_image = self._canny_detector(
|
||||
image,
|
||||
low_threshold=CANNY_LOW_THRESHOLD,
|
||||
high_threshold=CANNY_HIGH_THRESHOLD,
|
||||
)
|
||||
|
||||
generator = torch.manual_seed(seed if seed is not None else 0)
|
||||
|
||||
self._set_progress(
|
||||
f"Config: strength={strength}, steps={num_inference_steps} "
|
||||
f"(~{effective_steps} effective), guidance={guidance_scale}"
|
||||
)
|
||||
|
||||
step_cb, first_step, pipeline_done, start_updater = make_pipeline_progress(
|
||||
effective_steps,
|
||||
self.device,
|
||||
self._set_progress,
|
||||
label="CtrlRegen denoising",
|
||||
)
|
||||
start_updater()
|
||||
|
||||
try:
|
||||
result = self._pipeline(
|
||||
prompt=QUALITY_PROMPT,
|
||||
negative_prompt=NEGATIVE_PROMPT,
|
||||
image=[image],
|
||||
control_image=[control_image],
|
||||
controlnet_conditioning_scale=1.0,
|
||||
ip_adapter_image=[image],
|
||||
strength=strength,
|
||||
num_inference_steps=num_inference_steps,
|
||||
guidance_scale=guidance_scale,
|
||||
generator=generator,
|
||||
control_guidance_start=0.0,
|
||||
control_guidance_end=1.0,
|
||||
callback=step_cb,
|
||||
callback_steps=1,
|
||||
)
|
||||
except TypeError:
|
||||
first_step.set()
|
||||
result = self._pipeline(
|
||||
prompt=QUALITY_PROMPT,
|
||||
negative_prompt=NEGATIVE_PROMPT,
|
||||
image=[image],
|
||||
control_image=[control_image],
|
||||
controlnet_conditioning_scale=1.0,
|
||||
ip_adapter_image=[image],
|
||||
strength=strength,
|
||||
num_inference_steps=num_inference_steps,
|
||||
guidance_scale=guidance_scale,
|
||||
generator=generator,
|
||||
)
|
||||
finally:
|
||||
first_step.set()
|
||||
pipeline_done.set()
|
||||
|
||||
return result.images[0]
|
||||
@@ -1,149 +0,0 @@
|
||||
"""Custom IP-Adapter mixin using DINOv2 as the image encoder.
|
||||
|
||||
The standard diffusers ``IPAdapterMixin`` uses a CLIP image encoder.
|
||||
CtrlRegen replaces it with ``facebook/dinov2-giant`` for richer
|
||||
semantic features. This mixin provides ``load_ctrlregen_ip_adapter``
|
||||
which handles the custom weight format and encoder swap.
|
||||
|
||||
Attribution:
|
||||
Adapted from https://github.com/yepengliu/CtrlRegen .
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
from typing import Any
|
||||
|
||||
import torch
|
||||
from diffusers.models.modeling_utils import _LOW_CPU_MEM_USAGE_DEFAULT
|
||||
from diffusers.utils import (
|
||||
_get_model_file,
|
||||
is_accelerate_available,
|
||||
is_torch_version,
|
||||
)
|
||||
from diffusers.utils import (
|
||||
logging as diffusers_logging,
|
||||
)
|
||||
from huggingface_hub.utils import validate_hf_hub_args
|
||||
from safetensors import safe_open
|
||||
from transformers import AutoImageProcessor, AutoModel
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
_diffusers_logger = diffusers_logging.get_logger(__name__)
|
||||
|
||||
DINOV2_MODEL_ID = "facebook/dinov2-giant"
|
||||
|
||||
|
||||
class CustomIPAdapterMixin:
|
||||
"""Mixin that adds ``load_ctrlregen_ip_adapter`` to a diffusers pipeline."""
|
||||
|
||||
@validate_hf_hub_args
|
||||
def load_ctrlregen_ip_adapter(
|
||||
self,
|
||||
pretrained_model_name_or_path_or_dict: str | list[str] | dict[str, torch.Tensor],
|
||||
subfolder: str | list[str],
|
||||
weight_name: str | list[str],
|
||||
image_encoder_folder: str | None = "image_encoder",
|
||||
**kwargs: Any,
|
||||
) -> None:
|
||||
"""Load CtrlRegen IP-Adapter weights and DINOv2 image encoder.
|
||||
|
||||
Parameters mirror ``IPAdapterMixin.load_ip_adapter`` but the
|
||||
image encoder is always ``facebook/dinov2-giant`` regardless of
|
||||
the ``image_encoder_folder`` value in the checkpoint.
|
||||
"""
|
||||
if not isinstance(weight_name, list):
|
||||
weight_name = [weight_name]
|
||||
if not isinstance(pretrained_model_name_or_path_or_dict, list):
|
||||
pretrained_model_name_or_path_or_dict = [pretrained_model_name_or_path_or_dict]
|
||||
if len(pretrained_model_name_or_path_or_dict) == 1:
|
||||
pretrained_model_name_or_path_or_dict = pretrained_model_name_or_path_or_dict * len(weight_name)
|
||||
if not isinstance(subfolder, list):
|
||||
subfolder = [subfolder]
|
||||
if len(subfolder) == 1:
|
||||
subfolder = subfolder * len(weight_name)
|
||||
|
||||
if len(weight_name) != len(pretrained_model_name_or_path_or_dict):
|
||||
raise ValueError("`weight_name` and `pretrained_model_name_or_path_or_dict` must have the same length.")
|
||||
if len(weight_name) != len(subfolder):
|
||||
raise ValueError("`weight_name` and `subfolder` must have the same length.")
|
||||
|
||||
cache_dir = kwargs.pop("cache_dir", None)
|
||||
force_download = kwargs.pop("force_download", False)
|
||||
kwargs.pop("resume_download", None)
|
||||
proxies = kwargs.pop("proxies", None)
|
||||
local_files_only = kwargs.pop("local_files_only", None)
|
||||
token = kwargs.pop("token", None)
|
||||
revision = kwargs.pop("revision", None)
|
||||
low_cpu_mem_usage = kwargs.pop("low_cpu_mem_usage", _LOW_CPU_MEM_USAGE_DEFAULT)
|
||||
|
||||
if low_cpu_mem_usage and not is_accelerate_available():
|
||||
low_cpu_mem_usage = False
|
||||
_diffusers_logger.warning(
|
||||
"Cannot initialize model with low cpu memory usage because "
|
||||
"`accelerate` was not found. Defaulting to "
|
||||
"`low_cpu_mem_usage=False`."
|
||||
)
|
||||
|
||||
if low_cpu_mem_usage is True and not is_torch_version(">=", "1.9.0"):
|
||||
raise NotImplementedError("Low memory initialization requires torch >= 1.9.0.")
|
||||
|
||||
user_agent = {
|
||||
"file_type": "attn_procs_weights",
|
||||
"framework": "pytorch",
|
||||
}
|
||||
|
||||
state_dicts: list[dict] = []
|
||||
for path_or_dict, wn, sf in zip(pretrained_model_name_or_path_or_dict, weight_name, subfolder, strict=False):
|
||||
if not isinstance(path_or_dict, dict):
|
||||
model_file = _get_model_file(
|
||||
path_or_dict,
|
||||
weights_name=wn,
|
||||
cache_dir=cache_dir,
|
||||
force_download=force_download,
|
||||
proxies=proxies,
|
||||
local_files_only=local_files_only,
|
||||
token=token,
|
||||
revision=revision,
|
||||
subfolder=sf,
|
||||
user_agent=user_agent,
|
||||
)
|
||||
if wn.endswith(".safetensors"):
|
||||
state_dict: dict = {"image_proj": {}, "ip_adapter": {}}
|
||||
with safe_open(model_file, framework="pt", device="cpu") as f:
|
||||
for key in f.keys(): # noqa: SIM118
|
||||
if key.startswith("image_proj."):
|
||||
state_dict["image_proj"][key.replace("image_proj.", "")] = f.get_tensor(key)
|
||||
elif key.startswith("ip_adapter."):
|
||||
state_dict["ip_adapter"][key.replace("ip_adapter.", "")] = f.get_tensor(key)
|
||||
else:
|
||||
state_dict = torch.load(model_file, map_location="cpu")
|
||||
else:
|
||||
state_dict = path_or_dict
|
||||
|
||||
keys = list(state_dict.keys())
|
||||
if keys != ["image_proj", "ip_adapter"]:
|
||||
raise ValueError("Required keys (`image_proj` and `ip_adapter`) missing from the state dict.")
|
||||
state_dicts.append(state_dict)
|
||||
|
||||
# Always use DINOv2-giant as the image encoder.
|
||||
has_encoder_attr = hasattr(self, "image_encoder") and getattr(self, "image_encoder", None) is None
|
||||
if has_encoder_attr and image_encoder_folder is not None:
|
||||
logger.info("Loading DINOv2-giant image encoder for CtrlRegen")
|
||||
enc_dtype = getattr(self, "dtype", torch.float32) # type: ignore[attr-defined]
|
||||
image_encoder = AutoModel.from_pretrained(DINOV2_MODEL_ID).to(
|
||||
self.device,
|
||||
dtype=enc_dtype, # type: ignore[attr-defined]
|
||||
)
|
||||
self.register_modules(image_encoder=image_encoder) # type: ignore[attr-defined]
|
||||
|
||||
if hasattr(self, "feature_extractor") and getattr(self, "feature_extractor", None) is None:
|
||||
feature_extractor = AutoImageProcessor.from_pretrained(DINOV2_MODEL_ID)
|
||||
self.register_modules(feature_extractor=feature_extractor) # type: ignore[attr-defined]
|
||||
|
||||
unet = (
|
||||
getattr(self, self.unet_name) # type: ignore[attr-defined]
|
||||
if not hasattr(self, "unet")
|
||||
else self.unet # type: ignore[attr-defined]
|
||||
)
|
||||
unet._load_ip_adapter_weights(state_dicts, low_cpu_mem_usage=low_cpu_mem_usage)
|
||||
@@ -1,35 +0,0 @@
|
||||
"""Custom Stable Diffusion ControlNet Img2Img pipeline for CtrlRegen.
|
||||
|
||||
Extends ``StableDiffusionControlNetImg2ImgPipeline`` with the
|
||||
``load_ctrlregen_ip_adapter`` method (via ``CustomIPAdapterMixin``)
|
||||
that swaps in DINOv2-giant as the image encoder and loads the
|
||||
CtrlRegen semantic-control adapter weights.
|
||||
|
||||
No ``encode_image`` override is needed — the CtrlRegen checkpoint
|
||||
creates an ``IPAdapterPlusImageProjection`` which tells diffusers to
|
||||
call ``encode_image`` with ``output_hidden_states=True``. The
|
||||
default implementation then uses ``hidden_states[-2]`` from DINOv2,
|
||||
which is exactly what the projection was trained on.
|
||||
|
||||
Attribution:
|
||||
Adapted from https://github.com/yepengliu/CtrlRegen .
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from diffusers import StableDiffusionControlNetImg2ImgPipeline
|
||||
|
||||
from remove_ai_watermarks.noai.ctrlregen.ip_adapter import CustomIPAdapterMixin
|
||||
|
||||
|
||||
class CustomCtrlRegenPipeline(
|
||||
StableDiffusionControlNetImg2ImgPipeline,
|
||||
CustomIPAdapterMixin,
|
||||
):
|
||||
"""SD ControlNet Img2Img pipeline with DINOv2 IP-Adapter support.
|
||||
|
||||
MRO mirrors the original CtrlRegen repository: the base diffusers
|
||||
pipeline comes first so all standard methods are resolved from it,
|
||||
while ``CustomIPAdapterMixin`` only adds the
|
||||
``load_ctrlregen_ip_adapter`` method.
|
||||
"""
|
||||
@@ -1,179 +0,0 @@
|
||||
"""Tile-based processing for large images in the CtrlRegen pipeline.
|
||||
|
||||
Extracted from ``ctrlregen.engine`` to keep the engine focused on
|
||||
single-image inference and model orchestration.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import math
|
||||
import time
|
||||
from typing import TYPE_CHECKING, Any
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from collections.abc import Callable
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
from PIL import Image
|
||||
|
||||
|
||||
def tile_positions(total: int, tile: int, overlap: int) -> list[int]:
|
||||
"""Compute evenly-spaced tile start positions covering *total* pixels."""
|
||||
if not (0 <= overlap < tile):
|
||||
raise ValueError(f"overlap must satisfy 0 <= overlap < tile (got overlap={overlap}, tile={tile})")
|
||||
if total <= tile:
|
||||
return [0]
|
||||
n = max(2, math.ceil((total - overlap) / (tile - overlap)))
|
||||
stride = (total - tile) / (n - 1)
|
||||
return [round(i * stride) for i in range(n)]
|
||||
|
||||
|
||||
def make_blend_weight(h: int, w: int, overlap: int) -> np.ndarray:
|
||||
"""2-D weight mask: 1.0 in center, cosine ramp in overlap margins."""
|
||||
wy = np.ones(h, dtype=np.float64)
|
||||
wx = np.ones(w, dtype=np.float64)
|
||||
if overlap > 0:
|
||||
ramp = 0.5 - 0.5 * np.cos(np.linspace(0, np.pi, overlap))
|
||||
wy[:overlap] = np.minimum(wy[:overlap], ramp)
|
||||
wy[-overlap:] = np.minimum(wy[-overlap:], ramp[::-1])
|
||||
wx[:overlap] = np.minimum(wx[:overlap], ramp)
|
||||
wx[-overlap:] = np.minimum(wx[-overlap:], ramp[::-1])
|
||||
return np.outer(wy, wx)
|
||||
|
||||
|
||||
def resize_center_crop(image: Image.Image, size: int = 512) -> Image.Image:
|
||||
"""Resize shortest edge to *size*, then center-crop to a square.
|
||||
|
||||
Matches the ``transforms.Resize(512) + CenterCrop(512)`` pipeline
|
||||
used in the original CtrlRegen repository.
|
||||
"""
|
||||
w, h = image.size
|
||||
short = min(w, h)
|
||||
scale = size / short
|
||||
new_w, new_h = round(w * scale), round(h * scale)
|
||||
image = image.resize((new_w, new_h), Image.BILINEAR)
|
||||
left = (new_w - size) // 2
|
||||
top = (new_h - size) // 2
|
||||
return image.crop((left, top, left + size, top + size))
|
||||
|
||||
|
||||
def run_tiled(
|
||||
pipeline: Any,
|
||||
canny_detector: Any,
|
||||
image: Image.Image,
|
||||
strength: float,
|
||||
num_inference_steps: int,
|
||||
guidance_scale: float,
|
||||
seed: int | None,
|
||||
*,
|
||||
tile_size: int,
|
||||
tile_overlap: int,
|
||||
quality_prompt: str,
|
||||
negative_prompt: str,
|
||||
canny_low: int,
|
||||
canny_high: int,
|
||||
device: str,
|
||||
set_progress: Callable[[str], None],
|
||||
ip_adapter_image: Image.Image | None = None,
|
||||
) -> Image.Image:
|
||||
"""Split a large image into overlapping tiles, process each, blend."""
|
||||
w, h = image.size
|
||||
xs = tile_positions(w, tile_size, tile_overlap)
|
||||
ys = tile_positions(h, tile_size, tile_overlap)
|
||||
n_tiles = len(xs) * len(ys)
|
||||
grid = f"{len(xs)}x{len(ys)}"
|
||||
effective_steps = max(1, int(num_inference_steps * strength))
|
||||
|
||||
set_progress(f"Tiling {w}x{h}px → {n_tiles} tiles ({grid} grid, {tile_size}px, overlap {tile_overlap}px)")
|
||||
|
||||
canvas = np.zeros((h, w, 3), dtype=np.float64)
|
||||
weight_sum = np.zeros((h, w), dtype=np.float64)
|
||||
blend_w = make_blend_weight(tile_size, tile_size, tile_overlap)
|
||||
|
||||
t0 = time.monotonic()
|
||||
bar_len = 20
|
||||
tile_idx = 0
|
||||
|
||||
for ty in ys:
|
||||
for tx in xs:
|
||||
tile_idx += 1
|
||||
prefix = f"[Tile {tile_idx}/{n_tiles}]"
|
||||
|
||||
tile = image.crop((tx, ty, tx + tile_size, ty + tile_size))
|
||||
|
||||
set_progress(f"{prefix} Extracting canny edges...")
|
||||
control = canny_detector(
|
||||
tile,
|
||||
low_threshold=canny_low,
|
||||
high_threshold=canny_high,
|
||||
)
|
||||
|
||||
gen = None
|
||||
if seed is not None:
|
||||
gen = torch.Generator(device=device).manual_seed(seed + tile_idx)
|
||||
|
||||
tile_t0 = time.monotonic()
|
||||
|
||||
def _make_cb(
|
||||
_prefix: str = prefix,
|
||||
_t0: float = tile_t0,
|
||||
_es: int = effective_steps,
|
||||
) -> Callable:
|
||||
def _cb(step: int, timestep: int, latents: Any) -> None:
|
||||
elapsed = time.monotonic() - _t0
|
||||
cur = step + 1
|
||||
per = elapsed / max(1, cur)
|
||||
rem = per * max(0, _es - cur)
|
||||
filled = int(bar_len * cur / max(1, _es))
|
||||
bar = "█" * filled + "░" * (bar_len - filled)
|
||||
set_progress(f"{_prefix} [{bar}] {cur}/{_es} | {elapsed:.0f}s, ~{rem:.0f}s left")
|
||||
|
||||
return _cb
|
||||
|
||||
sem_image = ip_adapter_image if ip_adapter_image is not None else tile
|
||||
|
||||
try:
|
||||
result = pipeline(
|
||||
prompt=quality_prompt,
|
||||
negative_prompt=negative_prompt,
|
||||
image=[tile],
|
||||
control_image=[control],
|
||||
controlnet_conditioning_scale=1.0,
|
||||
ip_adapter_image=[sem_image],
|
||||
strength=strength,
|
||||
num_inference_steps=num_inference_steps,
|
||||
guidance_scale=guidance_scale,
|
||||
generator=gen,
|
||||
control_guidance_start=0.0,
|
||||
control_guidance_end=1.0,
|
||||
callback=_make_cb(),
|
||||
callback_steps=1,
|
||||
)
|
||||
except TypeError:
|
||||
result = pipeline(
|
||||
prompt=quality_prompt,
|
||||
negative_prompt=negative_prompt,
|
||||
image=[tile],
|
||||
control_image=[control],
|
||||
controlnet_conditioning_scale=1.0,
|
||||
ip_adapter_image=[sem_image],
|
||||
strength=strength,
|
||||
num_inference_steps=num_inference_steps,
|
||||
guidance_scale=guidance_scale,
|
||||
generator=gen,
|
||||
)
|
||||
|
||||
proc_arr = np.array(result.images[0], dtype=np.float64)
|
||||
th, tw = proc_arr.shape[:2]
|
||||
mask = blend_w[:th, :tw]
|
||||
canvas[ty : ty + th, tx : tx + tw] += proc_arr * mask[..., None]
|
||||
weight_sum[ty : ty + th, tx : tx + tw] += mask
|
||||
|
||||
tile_time = time.monotonic() - tile_t0
|
||||
total_elapsed = time.monotonic() - t0
|
||||
set_progress(f"{prefix} Done ({tile_time:.0f}s) · Total: {total_elapsed:.0f}s")
|
||||
|
||||
set_progress(f"Blending {n_tiles} tiles → {w}x{h}px...")
|
||||
canvas /= np.maximum(weight_sum[..., None], 1e-8)
|
||||
return Image.fromarray(np.clip(canvas, 0, 255).astype(np.uint8))
|
||||
@@ -29,8 +29,14 @@ def run_img2img(
|
||||
generator: Any,
|
||||
device: str,
|
||||
set_progress: Callable[[str], None],
|
||||
extra_kwargs: dict[str, Any] | None = None,
|
||||
) -> Image.Image:
|
||||
"""Execute img2img with live progress and return the generated image."""
|
||||
"""Execute img2img with live progress and return the generated image.
|
||||
|
||||
``extra_kwargs`` overlays additional pipeline arguments (e.g. the ControlNet
|
||||
``control_image`` / ``controlnet_conditioning_scale`` and a non-empty prompt),
|
||||
so a ControlNet img2img pass reuses the same progress + fallback machinery.
|
||||
"""
|
||||
effective_steps = max(1, int(num_inference_steps * strength))
|
||||
|
||||
step_cb, first_step, done_ev, start_updater = make_pipeline_progress(
|
||||
@@ -42,26 +48,14 @@ def run_img2img(
|
||||
|
||||
try:
|
||||
result = _call_pipeline(
|
||||
pipeline,
|
||||
image,
|
||||
strength,
|
||||
num_inference_steps,
|
||||
guidance_scale,
|
||||
generator,
|
||||
step_cb,
|
||||
pipeline, image, strength, num_inference_steps, guidance_scale, generator, step_cb, extra_kwargs
|
||||
)
|
||||
done_ev.set()
|
||||
return result.images[0]
|
||||
except TypeError:
|
||||
first_step.set()
|
||||
result = _call_pipeline(
|
||||
pipeline,
|
||||
image,
|
||||
strength,
|
||||
num_inference_steps,
|
||||
guidance_scale,
|
||||
generator,
|
||||
None,
|
||||
pipeline, image, strength, num_inference_steps, guidance_scale, generator, None, extra_kwargs
|
||||
)
|
||||
done_ev.set()
|
||||
return result.images[0]
|
||||
@@ -81,11 +75,13 @@ def run_img2img_with_mps_fallback(
|
||||
set_progress: Callable[[str], None],
|
||||
*,
|
||||
reload_on_cpu: Callable[[], Any],
|
||||
extra_kwargs: dict[str, Any] | None = None,
|
||||
) -> tuple[Image.Image, str]:
|
||||
"""Run img2img; on MPS error, fall back to CPU.
|
||||
|
||||
Returns:
|
||||
(result_image, final_device) — device may change to ``"cpu"`` on fallback.
|
||||
``extra_kwargs`` overlays extra pipeline arguments (used by the ControlNet
|
||||
path). Returns ``(result_image, final_device)`` — device may change to
|
||||
``"cpu"`` on fallback.
|
||||
"""
|
||||
pipeline = load_pipeline()
|
||||
|
||||
@@ -99,6 +95,7 @@ def run_img2img_with_mps_fallback(
|
||||
generator,
|
||||
device,
|
||||
set_progress,
|
||||
extra_kwargs,
|
||||
)
|
||||
return img, device
|
||||
except RuntimeError as error:
|
||||
@@ -108,104 +105,7 @@ def run_img2img_with_mps_fallback(
|
||||
_try_clear_mps_cache()
|
||||
pipeline = reload_on_cpu()
|
||||
img = run_img2img(
|
||||
pipeline,
|
||||
image,
|
||||
strength,
|
||||
num_inference_steps,
|
||||
guidance_scale,
|
||||
None,
|
||||
"cpu",
|
||||
set_progress,
|
||||
)
|
||||
return img, "cpu"
|
||||
raise
|
||||
|
||||
|
||||
def run_differential(
|
||||
pipeline: Any,
|
||||
image: Image.Image,
|
||||
change_map: Any,
|
||||
strength: float,
|
||||
num_inference_steps: int,
|
||||
guidance_scale: float,
|
||||
generator: Any,
|
||||
device: str,
|
||||
set_progress: Callable[[str], None],
|
||||
) -> Image.Image:
|
||||
"""Run the SDXL Differential-Diffusion pipeline and return the image.
|
||||
|
||||
Unlike standard img2img, the differential pipeline needs pre-processed image
|
||||
tensors plus a per-pixel change map (HxW float32 in [0, 1]); white preserves
|
||||
the original pixels, black regenerates them. Runs without a step callback --
|
||||
the community pipeline's callback signature differs across diffusers
|
||||
versions, and a protect-text pass is short.
|
||||
"""
|
||||
import torch
|
||||
|
||||
image_tensor = pipeline.image_processor.preprocess(image).to(device)
|
||||
map_tensor = torch.from_numpy(change_map)[None].to(device) # pyright: ignore[reportPrivateImportUsage, reportUnknownMemberType]
|
||||
set_progress(f"Running protected regeneration ({device}, strength={strength})...")
|
||||
result = pipeline(
|
||||
prompt="",
|
||||
image=image_tensor,
|
||||
original_image=image_tensor,
|
||||
map=map_tensor,
|
||||
strength=strength,
|
||||
num_inference_steps=num_inference_steps,
|
||||
guidance_scale=guidance_scale,
|
||||
generator=generator,
|
||||
)
|
||||
return result.images[0]
|
||||
|
||||
|
||||
def run_differential_with_mps_fallback(
|
||||
load_pipeline: Callable[[], Any],
|
||||
image: Image.Image,
|
||||
change_map: Any,
|
||||
strength: float,
|
||||
num_inference_steps: int,
|
||||
guidance_scale: float,
|
||||
generator: Any,
|
||||
device: str,
|
||||
set_progress: Callable[[str], None],
|
||||
*,
|
||||
reload_on_cpu: Callable[[], Any],
|
||||
) -> tuple[Image.Image, str]:
|
||||
"""Run differential img2img; on MPS error, fall back to CPU.
|
||||
|
||||
Returns:
|
||||
(result_image, final_device) -- device may change to ``"cpu"`` on fallback.
|
||||
"""
|
||||
pipeline = load_pipeline()
|
||||
try:
|
||||
img = run_differential(
|
||||
pipeline,
|
||||
image,
|
||||
change_map,
|
||||
strength,
|
||||
num_inference_steps,
|
||||
guidance_scale,
|
||||
generator,
|
||||
device,
|
||||
set_progress,
|
||||
)
|
||||
return img, device
|
||||
except RuntimeError as error:
|
||||
if device == "mps" and is_mps_error(error):
|
||||
logger.warning("MPS error detected: %s. Falling back to CPU.", error)
|
||||
set_progress("MPS error! Clearing cache and retrying on CPU...")
|
||||
_try_clear_mps_cache()
|
||||
pipeline = reload_on_cpu()
|
||||
img = run_differential(
|
||||
pipeline,
|
||||
image,
|
||||
change_map,
|
||||
strength,
|
||||
num_inference_steps,
|
||||
guidance_scale,
|
||||
None,
|
||||
"cpu",
|
||||
set_progress,
|
||||
pipeline, image, strength, num_inference_steps, guidance_scale, None, "cpu", set_progress, extra_kwargs
|
||||
)
|
||||
return img, "cpu"
|
||||
raise
|
||||
@@ -219,6 +119,7 @@ def _call_pipeline(
|
||||
guidance_scale: float,
|
||||
generator: Any,
|
||||
step_callback: Any,
|
||||
extra_kwargs: dict[str, Any] | None = None,
|
||||
) -> Any:
|
||||
kwargs: dict[str, Any] = {
|
||||
"prompt": "",
|
||||
@@ -228,6 +129,8 @@ def _call_pipeline(
|
||||
"guidance_scale": guidance_scale,
|
||||
"generator": generator,
|
||||
}
|
||||
if extra_kwargs:
|
||||
kwargs.update(extra_kwargs)
|
||||
if step_callback is not None:
|
||||
kwargs["callback"] = step_callback
|
||||
kwargs["callback_steps"] = 1
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
"""Watermark removal model profiles, the default strength, and profile detection.
|
||||
"""Watermark removal model profiles and the default strength.
|
||||
|
||||
Pure configuration and lookup functions with no ML dependencies.
|
||||
"""
|
||||
@@ -11,13 +11,18 @@ if TYPE_CHECKING:
|
||||
from pathlib import Path
|
||||
|
||||
DEFAULT_MODEL_ID = "stabilityai/stable-diffusion-xl-base-1.0"
|
||||
CTRLREGEN_MODEL_ID = "yepengliu/ctrlregen"
|
||||
|
||||
# The SDXL-native canny ControlNet used by the ``controlnet`` pipeline. The
|
||||
# ControlNet is an add-on to the SDXL base checkpoint (DEFAULT_MODEL_ID), not a
|
||||
# separate base model, so both the ``default`` and ``controlnet`` profiles load
|
||||
# the same base weights and share the same vendor-adaptive strength.
|
||||
CONTROLNET_CANNY_MODEL = "xinsir/controlnet-canny-sdxl-1.0"
|
||||
|
||||
# Vendor-adaptive default denoising strength for the SDXL img2img scrub, overridable
|
||||
# from the CLI (`--strength`). The right strength depends on which vendor's SynthID is
|
||||
# present, detected from the C2PA issuer (metadata.synthid_source). Oracle-verified
|
||||
# controlled study (2026-06-01, clean v0.8.6 with protect_text/faces OFF, per-image
|
||||
# openai.com/verify or Gemini-app verdict; see docs/synthid.md section 2.2):
|
||||
# controlled study (2026-06-01, clean v0.8.6, per-image openai.com/verify or Gemini-app
|
||||
# verdict; see docs/synthid.md section 2.2):
|
||||
# - OpenAI gpt-image: removed at 0.05 across 1024-1600 (n=4), resolution-independent.
|
||||
# OPENAI_STRENGTH 0.10 = the 0.05 floor plus a 2x margin (keeps quality high).
|
||||
# - Google Gemini: removed at 0.15 on the capped-1536 path (n=4); 0.05/0.10 do NOT
|
||||
@@ -29,8 +34,8 @@ CTRLREGEN_MODEL_ID = "yepengliu/ctrlregen"
|
||||
# - Unknown vendor (metadata stripped, or non-OpenAI/Google C2PA): UNKNOWN_STRENGTH
|
||||
# 0.15, the safe middle that clears both vendors at the tested resolutions.
|
||||
# The dominant factor is VENDOR, not resolution: Google's SynthID is ~3x more robust
|
||||
# than OpenAI's. The earlier single 0.30 default (and the "resolution dependence" lore)
|
||||
# came from contaminated tests run with protect_text ON -- see docs/synthid.md 2.2.
|
||||
# than OpenAI's. The ``controlnet`` pipeline shares these strengths (same SDXL base; the
|
||||
# canny ControlNet only preserves structure, the strength still drives removal).
|
||||
OPENAI_STRENGTH = 0.10
|
||||
GEMINI_STRENGTH = 0.15
|
||||
UNKNOWN_STRENGTH = 0.15
|
||||
@@ -41,45 +46,21 @@ DEFAULT_STRENGTH = UNKNOWN_STRENGTH
|
||||
# Detected-vendor -> default strength. Vendor strings come from `vendor_for_strength`.
|
||||
_VENDOR_STRENGTH = {"openai": OPENAI_STRENGTH, "google": GEMINI_STRENGTH}
|
||||
|
||||
# CtrlRegen removes watermarks by regenerating from (near) clean Gaussian noise,
|
||||
# NOT by the light-touch partial-noise img2img the SDXL default uses. The research
|
||||
# is explicit (CtrlRegen, ICLR 2025, arXiv:2410.05470): partial-noise regeneration
|
||||
# "struggles with high-perturbation watermarks" because a small noise step "retains"
|
||||
# watermark information that diffuses back into the output; the fix is to start from
|
||||
# clean noise. With the StableDiffusionControlNetImg2ImgPipeline that maps to a high
|
||||
# strength (~1.0 = full noise at the first timestep, structure held by the canny
|
||||
# ControlNet + DINOv2 IP-Adapter, not by the watermarked latent). So the ctrlregen
|
||||
# profile must NOT inherit the SDXL default (`DEFAULT_STRENGTH`, a partial-noise
|
||||
# value) -- at that low strength it loads ControlNet + DINOv2-giant and then barely
|
||||
# changes the image (a no-op for removal). Tunable via
|
||||
# `--strength`; lower it to trade removal strength for fidelity (the CtrlRegen+ regime).
|
||||
#
|
||||
# EXPERIMENTAL -- NOT recommended for production. The same GPU study that set the 0.3
|
||||
# SDXL threshold tested ctrlregen at its clean-noise strength and found it DESTROYS
|
||||
# images: smooth/background regions fill with hallucinated micro-text garbage, and it
|
||||
# is heavy (~8.5 min / ~$0.30 vs ~25 s / ~$0.02 for SDXL on a large image). The pipeline
|
||||
# is effectively binary -- low strength = no-op, high strength = destroys -- with no
|
||||
# usable middle, so the literature's "clean-noise is the lever" (arXiv:2410.05470) did
|
||||
# NOT survive empirical testing on real content. SDXL img2img at ~0.3 is the shippable
|
||||
# path; ctrlregen stays opt-in and flagged experimental.
|
||||
CTRLREGEN_DEFAULT_STRENGTH = 1.0
|
||||
|
||||
def resolve_strength(strength: float | None, vendor: str | None = None) -> float:
|
||||
"""Resolve the denoising strength, applying the vendor default when unset.
|
||||
|
||||
def resolve_strength(strength: float | None, profile: str, vendor: str | None = None) -> float:
|
||||
"""Resolve the denoising strength, applying the profile/vendor default when unset.
|
||||
|
||||
``None`` means "the user did not pass ``--strength``". ``ctrlregen`` resolves to
|
||||
``CTRLREGEN_DEFAULT_STRENGTH`` (clean-noise regeneration). The SDXL default profile
|
||||
resolves **vendor-adaptively**: ``vendor`` (``"openai"`` / ``"google"`` / None, from
|
||||
``None`` means "the user did not pass ``--strength``", which resolves
|
||||
**vendor-adaptively**: ``vendor`` (``"openai"`` / ``"google"`` / None, from
|
||||
``vendor_for_strength``) selects ``OPENAI_STRENGTH`` / ``GEMINI_STRENGTH`` /
|
||||
``UNKNOWN_STRENGTH``. An explicit value always wins (including ``0.0`` -- the check is
|
||||
``is None``, not falsiness). Shared by the CLI (for display) and the engine (for
|
||||
``UNKNOWN_STRENGTH``. An explicit value always wins (including ``0.0`` -- the check
|
||||
is ``is None``, not falsiness). The ``default`` and ``controlnet`` profiles share
|
||||
the same SDXL base (the ControlNet only preserves structure), so the default does
|
||||
NOT depend on the profile. Shared by the CLI (for display) and the engine (for
|
||||
execution) so the two never disagree -- both must pass the SAME ``vendor``.
|
||||
"""
|
||||
if strength is not None:
|
||||
return strength
|
||||
if profile == "ctrlregen":
|
||||
return CTRLREGEN_DEFAULT_STRENGTH
|
||||
return _VENDOR_STRENGTH.get(vendor or "", UNKNOWN_STRENGTH)
|
||||
|
||||
|
||||
@@ -107,17 +88,13 @@ def vendor_for_strength(image_path: Path) -> Literal["openai", "google"] | None:
|
||||
|
||||
|
||||
def get_model_id_for_profile(profile: str) -> str:
|
||||
"""Map CLI model profile names to concrete Hugging Face model IDs."""
|
||||
"""Map CLI model profile names to concrete Hugging Face model IDs.
|
||||
|
||||
Both ``default`` and ``controlnet`` use the SDXL base checkpoint -- the canny
|
||||
ControlNet (``CONTROLNET_CANNY_MODEL``) is an add-on loaded on top of it, not a
|
||||
separate base model.
|
||||
"""
|
||||
normalized = profile.strip().lower()
|
||||
if normalized == "default":
|
||||
if normalized in ("default", "controlnet"):
|
||||
return DEFAULT_MODEL_ID
|
||||
if normalized == "ctrlregen":
|
||||
return CTRLREGEN_MODEL_ID
|
||||
raise ValueError(f"Unknown model profile '{profile}'. Use one of: default, ctrlregen.")
|
||||
|
||||
|
||||
def detect_model_profile(model_id: str) -> str:
|
||||
"""Infer model profile from model identifier."""
|
||||
if "ctrlregen" in model_id.lower():
|
||||
return "ctrlregen"
|
||||
return "default"
|
||||
raise ValueError(f"Unknown model profile '{profile}'. Use one of: default, controlnet.")
|
||||
|
||||
@@ -1,13 +1,13 @@
|
||||
"""Watermark removal using diffusion model regeneration attack.
|
||||
|
||||
Based on the paper "Image Watermarks Are Removable Using Controllable
|
||||
Regeneration from Clean Noise" (ICLR 2025).
|
||||
|
||||
This module implements a simple regeneration attack that:
|
||||
1. Encodes the watermarked image to latent space
|
||||
2. Adds noise via forward diffusion process
|
||||
3. Denoises via reverse diffusion process
|
||||
4. Decodes back to pixel space
|
||||
Two pipelines:
|
||||
1. ``default`` -- plain SDXL img2img. Partial-noise regeneration scrubs the
|
||||
invisible watermark; ``strength`` controls how much is regenerated.
|
||||
2. ``controlnet`` -- SDXL img2img with a canny ControlNet. The watermark REMOVAL
|
||||
still comes from the img2img regeneration (``strength``); the ControlNet only
|
||||
PRESERVES structure (text/faces) by conditioning on the edge map. No original
|
||||
pixels are ever copied or frozen, so SynthID does not survive.
|
||||
``controlnet_conditioning_scale`` is the preservation knob.
|
||||
"""
|
||||
|
||||
# torch/diffusers/cv2 boundary: these libs ship no usable types for the tensor and
|
||||
@@ -29,10 +29,9 @@ if TYPE_CHECKING:
|
||||
from PIL import Image
|
||||
|
||||
from remove_ai_watermarks.noai.watermark_profiles import (
|
||||
CTRLREGEN_MODEL_ID,
|
||||
CONTROLNET_CANNY_MODEL,
|
||||
DEFAULT_MODEL_ID,
|
||||
DEFAULT_STRENGTH,
|
||||
detect_model_profile,
|
||||
resolve_strength,
|
||||
)
|
||||
|
||||
@@ -273,21 +272,14 @@ def _make_seed_generator(device: str, seed: int) -> Any:
|
||||
return torch.Generator().manual_seed(seed) # type: ignore
|
||||
|
||||
|
||||
def _generator_device(generator: Any) -> str:
|
||||
"""Best-effort device type of a ``torch.Generator`` (e.g. ``"cpu"``, ``"mps"``)."""
|
||||
device = getattr(generator, "device", None)
|
||||
return getattr(device, "type", str(device)) if device is not None else "cpu"
|
||||
# Canny edge thresholds for the ControlNet control image (xinsir canny recipe:
|
||||
# cv2.Canny(gray, 100, 200) -> a 3-channel edge map).
|
||||
_CANNY_LOW = 100
|
||||
_CANNY_HIGH = 200
|
||||
|
||||
|
||||
# Keep legacy name available for backwards compatibility
|
||||
_detect_model_profile_from_id = detect_model_profile
|
||||
|
||||
# SDXL Differential-Diffusion community pipeline, pinned to the installed
|
||||
# diffusers version so the fetched pipeline code matches the library (see #21).
|
||||
# Diffusers' dynamic-module loader resolves ``custom_revision`` against the
|
||||
# package version string (``0.38.0``), NOT the GitHub git tag (``v0.38.0``).
|
||||
_DIFF_PIPELINE_NAME = "pipeline_stable_diffusion_xl_differential_img2img"
|
||||
_DIFF_PIPELINE_REVISION = "0.38.0"
|
||||
# A neutral quality prompt: the goal is faithful regeneration, not creative edits.
|
||||
_CONTROLNET_PROMPT = "best quality, high quality, sharp, detailed, photographic"
|
||||
_CONTROLNET_NEGATIVE = "blurry, lowres, deformed, distorted text, garbled text, watermark, jpeg artifacts"
|
||||
|
||||
|
||||
class WatermarkRemover:
|
||||
@@ -299,8 +291,8 @@ class WatermarkRemover:
|
||||
"""
|
||||
|
||||
DEFAULT_MODEL_ID = DEFAULT_MODEL_ID
|
||||
CTRLREGEN_MODEL_ID = CTRLREGEN_MODEL_ID
|
||||
DEFAULT_STRENGTH = DEFAULT_STRENGTH
|
||||
CONTROLNET_CANNY_MODEL = CONTROLNET_CANNY_MODEL
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
@@ -309,9 +301,14 @@ class WatermarkRemover:
|
||||
torch_dtype: Any = None,
|
||||
progress_callback: Callable[[str], None] | None = None,
|
||||
hf_token: str | None = None,
|
||||
pipeline: str = "default",
|
||||
controlnet_conditioning_scale: float = 1.0,
|
||||
) -> None:
|
||||
self.model_id = model_id or self.DEFAULT_MODEL_ID
|
||||
self.model_profile = detect_model_profile(self.model_id)
|
||||
# The pipeline profile is threaded explicitly (not inferred from model_id):
|
||||
# both "default" and "controlnet" use the same SDXL base checkpoint.
|
||||
self.model_profile = pipeline
|
||||
self.controlnet_conditioning_scale = controlnet_conditioning_scale
|
||||
|
||||
if not is_watermark_removal_available():
|
||||
_ensure_watermark_deps()
|
||||
@@ -329,8 +326,7 @@ class WatermarkRemover:
|
||||
self.torch_dtype = torch_dtype
|
||||
|
||||
self._pipeline: AutoImg2ImgPipeline | None = None
|
||||
self._diff_pipeline: Any = None
|
||||
self._ctrlregen_engine: Any = None
|
||||
self._controlnet_pipeline: Any = None
|
||||
self._progress_callback = progress_callback
|
||||
self.hf_token: str | None = hf_token or os.environ.get("HF_TOKEN")
|
||||
|
||||
@@ -345,44 +341,59 @@ class WatermarkRemover:
|
||||
|
||||
def preload(self) -> None:
|
||||
"""Eagerly load the pipeline so download progress bars are visible."""
|
||||
if self.model_profile == "ctrlregen":
|
||||
self._run_ctrlregen_preload()
|
||||
if self.model_profile == "controlnet":
|
||||
self._load_controlnet_pipeline()
|
||||
else:
|
||||
self._load_pipeline()
|
||||
|
||||
def _run_ctrlregen_preload(self) -> None:
|
||||
"""Ensure the CtrlRegen engine and all its models are loaded."""
|
||||
from remove_ai_watermarks.noai.ctrlregen import is_ctrlregen_available
|
||||
|
||||
if not is_ctrlregen_available():
|
||||
missing_pkgs = ["controlnet-aux", "color-matcher", "safetensors"]
|
||||
logger.info("Auto-installing missing CtrlRegen dependencies: %s", missing_pkgs)
|
||||
if not _auto_install(missing_pkgs):
|
||||
raise ImportError(
|
||||
f"Failed to auto-install missing dependencies: {', '.join(missing_pkgs)}. "
|
||||
"Try manually: pip install --force-reinstall noai-watermark"
|
||||
)
|
||||
if self._ctrlregen_engine is None:
|
||||
self._ctrlregen_engine = self._make_ctrlregen_engine()
|
||||
self._ctrlregen_engine.load()
|
||||
|
||||
def _make_ctrlregen_engine(self) -> Any:
|
||||
"""Create a new CtrlRegenEngine with current settings."""
|
||||
from remove_ai_watermarks.noai.ctrlregen import CtrlRegenEngine
|
||||
|
||||
base_model = self.model_id if self.model_id != self.CTRLREGEN_MODEL_ID else None
|
||||
return CtrlRegenEngine(
|
||||
base_model_id=base_model,
|
||||
device=self.device,
|
||||
torch_dtype=self.torch_dtype,
|
||||
hf_token=self.hf_token,
|
||||
progress_callback=self._progress_callback,
|
||||
)
|
||||
|
||||
# ── Pipeline loading ─────────────────────────────────────────────
|
||||
|
||||
def _maybe_add_fp16_vae(self, load_kwargs: dict[str, Any]) -> None:
|
||||
"""Swap in the fp16-fixed SDXL VAE for the default checkpoint on a fp16 GPU.
|
||||
|
||||
The stock SDXL VAE overflows to NaN in fp16 and decodes to an all-black
|
||||
image (issue #29). Shared by both pipeline loaders; a no-op on fp32 (cpu/mps)
|
||||
or a non-SDXL checkpoint.
|
||||
"""
|
||||
if _needs_fp16_vae_fix(self.model_id, self.DEFAULT_MODEL_ID, self.torch_dtype == torch.float16):
|
||||
from diffusers import AutoencoderKL
|
||||
|
||||
self._set_progress("Loading fp16-fixed SDXL VAE (avoids black output)...")
|
||||
load_kwargs["vae"] = AutoencoderKL.from_pretrained(_SDXL_FP16_VAE_ID, torch_dtype=torch.float16)
|
||||
|
||||
def _move_to_device_and_optimize(self, pipeline: Any) -> Any:
|
||||
"""Move a freshly-loaded pipeline to ``self.device`` + enable memory opts.
|
||||
|
||||
Shared by both loaders. On a CUDA move failure (missing CUDA torch build),
|
||||
trigger the torch-CUDA reinstall+restart. Returns the moved pipeline.
|
||||
"""
|
||||
self._set_progress(f"Moving model to device: {self.device}")
|
||||
try:
|
||||
pipeline = pipeline.to(self.device)
|
||||
except (RuntimeError, AssertionError) as exc:
|
||||
if self.device == "cuda" and not os.environ.get(_CUDA_FIX_ENV_KEY):
|
||||
self._set_progress("CUDA failed. Reinstalling torch with CUDA support...")
|
||||
_reinstall_torch_cuda_and_restart()
|
||||
raise RuntimeError(
|
||||
f"Failed to move model to {self.device} ({exc}). "
|
||||
"Install CUDA-enabled PyTorch manually:\n"
|
||||
f" pip install torch --index-url {_detect_cuda_index_url()}"
|
||||
) from exc
|
||||
|
||||
if hasattr(pipeline, "enable_xformers_memory_efficient_attention"):
|
||||
with contextlib.suppress(Exception):
|
||||
self._set_progress("Enabling memory optimizations...")
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
|
||||
# Mac Float32 memory slicing
|
||||
if self.device == "mps" and hasattr(pipeline, "enable_attention_slicing"):
|
||||
with contextlib.suppress(Exception):
|
||||
pipeline.enable_attention_slicing("max")
|
||||
|
||||
return pipeline
|
||||
|
||||
def _load_pipeline(self) -> AutoImg2ImgPipeline:
|
||||
"""Load the diffusion pipeline lazily."""
|
||||
"""Load the plain SDXL img2img pipeline lazily."""
|
||||
if self._pipeline is None:
|
||||
logger.info("Loading model %s on %s...", self.model_id, self.device)
|
||||
self._set_progress(f"Loading model weights: {self.model_id}")
|
||||
@@ -394,48 +405,47 @@ class WatermarkRemover:
|
||||
}
|
||||
if self.hf_token:
|
||||
load_kwargs["token"] = self.hf_token
|
||||
self._maybe_add_fp16_vae(load_kwargs)
|
||||
|
||||
# Avoid the SDXL fp16 NaN/all-black decode (issue #29) by loading the
|
||||
# fp16-fixed VAE for the default SDXL checkpoint on a fp16 GPU.
|
||||
if _needs_fp16_vae_fix(self.model_id, self.DEFAULT_MODEL_ID, self.torch_dtype == torch.float16):
|
||||
from diffusers import AutoencoderKL
|
||||
|
||||
self._set_progress("Loading fp16-fixed SDXL VAE (avoids black output)...")
|
||||
load_kwargs["vae"] = AutoencoderKL.from_pretrained(_SDXL_FP16_VAE_ID, torch_dtype=torch.float16)
|
||||
|
||||
self._pipeline = AutoImg2ImgPipeline.from_pretrained( # type: ignore
|
||||
self.model_id,
|
||||
**load_kwargs,
|
||||
)
|
||||
|
||||
self._set_progress(f"Moving model to device: {self.device}")
|
||||
try:
|
||||
self._pipeline = self._pipeline.to(self.device) # type: ignore
|
||||
except (RuntimeError, AssertionError) as exc:
|
||||
if self.device == "cuda" and not os.environ.get(_CUDA_FIX_ENV_KEY):
|
||||
self._set_progress("CUDA failed. Reinstalling torch with CUDA support...")
|
||||
_reinstall_torch_cuda_and_restart()
|
||||
raise RuntimeError(
|
||||
f"Failed to move model to {self.device} ({exc}). "
|
||||
"Install CUDA-enabled PyTorch manually:\n"
|
||||
f" pip install torch --index-url {_detect_cuda_index_url()}"
|
||||
) from exc
|
||||
|
||||
if hasattr(self._pipeline, "enable_xformers_memory_efficient_attention"):
|
||||
with contextlib.suppress(Exception):
|
||||
self._set_progress("Enabling memory optimizations...")
|
||||
self._pipeline.enable_xformers_memory_efficient_attention() # type: ignore
|
||||
|
||||
# Mac Float32 memory slicing
|
||||
if self.device == "mps" and hasattr(self._pipeline, "enable_attention_slicing"):
|
||||
with contextlib.suppress(Exception):
|
||||
self._pipeline.enable_attention_slicing("max")
|
||||
pipeline = AutoImg2ImgPipeline.from_pretrained(self.model_id, **load_kwargs) # type: ignore
|
||||
self._pipeline = self._move_to_device_and_optimize(pipeline)
|
||||
|
||||
logger.info("Model loaded successfully")
|
||||
self._set_progress("Model initialized. Preparing input image...")
|
||||
|
||||
return self._pipeline # type: ignore
|
||||
|
||||
def _load_controlnet_pipeline(self) -> Any:
|
||||
"""Load the SDXL + canny-ControlNet img2img pipeline lazily.
|
||||
|
||||
Mirrors ``_load_pipeline`` (same fp16-fix VAE, device move, attention
|
||||
slicing via the shared helpers) but loads the canny ControlNet on top of
|
||||
the SDXL base. The ControlNet only preserves structure via the edge map;
|
||||
removal still comes from the img2img regeneration (``strength``).
|
||||
"""
|
||||
if self._controlnet_pipeline is None:
|
||||
from diffusers import ControlNetModel, StableDiffusionXLControlNetImg2ImgPipeline
|
||||
|
||||
logger.info("Loading SDXL + ControlNet (%s) on %s...", CONTROLNET_CANNY_MODEL, self.device)
|
||||
self._set_progress(f"Loading ControlNet: {CONTROLNET_CANNY_MODEL}")
|
||||
controlnet = ControlNetModel.from_pretrained(CONTROLNET_CANNY_MODEL, torch_dtype=self.torch_dtype)
|
||||
|
||||
load_kwargs: dict[str, Any] = {"controlnet": controlnet, "torch_dtype": self.torch_dtype}
|
||||
if self.hf_token:
|
||||
load_kwargs["token"] = self.hf_token
|
||||
self._maybe_add_fp16_vae(load_kwargs)
|
||||
|
||||
self._set_progress(f"Loading model weights: {self.model_id}")
|
||||
pipeline = StableDiffusionXLControlNetImg2ImgPipeline.from_pretrained(self.model_id, **load_kwargs)
|
||||
pipeline = self._move_to_device_and_optimize(pipeline)
|
||||
with contextlib.suppress(Exception):
|
||||
pipeline.set_progress_bar_config(disable=True)
|
||||
|
||||
logger.info("ControlNet model loaded successfully")
|
||||
self._controlnet_pipeline = pipeline
|
||||
|
||||
return self._controlnet_pipeline
|
||||
|
||||
# ── Core removal ─────────────────────────────────────────────────
|
||||
|
||||
def remove_watermark(
|
||||
@@ -446,7 +456,6 @@ class WatermarkRemover:
|
||||
num_inference_steps: int = 50,
|
||||
guidance_scale: float | None = None,
|
||||
seed: int | None = None,
|
||||
protect_text: bool = True,
|
||||
vendor: str | None = None,
|
||||
) -> Path:
|
||||
"""Remove watermark from an image using regeneration attack.
|
||||
@@ -459,10 +468,6 @@ class WatermarkRemover:
|
||||
num_inference_steps: Number of denoising steps.
|
||||
guidance_scale: Classifier-free guidance scale.
|
||||
seed: Random seed for reproducibility.
|
||||
protect_text: Detect text regions and preserve them via Differential
|
||||
Diffusion when any are found (SDXL default profile only). On by
|
||||
default; the detector decides per image, and text-free inputs run
|
||||
the standard pass at no extra cost.
|
||||
vendor: SynthID vendor (``"openai"`` / ``"google"`` / None) used to pick the
|
||||
default strength when ``strength`` is None. Detect it from the ORIGINAL
|
||||
input with ``watermark_profiles.vendor_for_strength`` before processing
|
||||
@@ -482,13 +487,13 @@ class WatermarkRemover:
|
||||
if output_path is None:
|
||||
output_path = image_path
|
||||
|
||||
strength = resolve_strength(strength, self.model_profile, vendor)
|
||||
strength = resolve_strength(strength, vendor)
|
||||
|
||||
if not 0.0 <= strength <= 1.0:
|
||||
raise ValueError(f"Strength must be between 0.0 and 1.0, got {strength}")
|
||||
|
||||
if guidance_scale is None:
|
||||
guidance_scale = 2.0 if self.model_profile == "ctrlregen" else 7.5
|
||||
guidance_scale = 7.5
|
||||
|
||||
self._set_progress("Loading and preprocessing input image...")
|
||||
init_image = Image.open(image_path).convert("RGB")
|
||||
@@ -508,16 +513,8 @@ class WatermarkRemover:
|
||||
|
||||
_total_start = time.monotonic()
|
||||
|
||||
if self.model_profile == "ctrlregen":
|
||||
cleaned_image = self._run_ctrlregen(
|
||||
init_image,
|
||||
strength,
|
||||
num_inference_steps,
|
||||
guidance_scale,
|
||||
generator,
|
||||
)
|
||||
elif protect_text and self._can_protect_text():
|
||||
cleaned_image = self._run_region_hires(
|
||||
if self.model_profile == "controlnet":
|
||||
cleaned_image = self._run_controlnet(
|
||||
init_image,
|
||||
strength,
|
||||
num_inference_steps,
|
||||
@@ -525,12 +522,6 @@ class WatermarkRemover:
|
||||
generator,
|
||||
)
|
||||
else:
|
||||
if protect_text:
|
||||
logger.debug(
|
||||
"Text protection unavailable "
|
||||
"(needs the SDXL default model and the cv2 text detector); "
|
||||
"running standard img2img."
|
||||
)
|
||||
cleaned_image = self._run_img2img(
|
||||
init_image,
|
||||
strength,
|
||||
@@ -613,148 +604,25 @@ class WatermarkRemover:
|
||||
self._pipeline = None
|
||||
return self._load_pipeline()
|
||||
|
||||
# ── Text-protected differential runner ───────────────────────────
|
||||
# ── ControlNet runner ────────────────────────────────────────────
|
||||
|
||||
def _can_protect_text(self) -> bool:
|
||||
"""True when text protection can run: SDXL default model + cv2 detector."""
|
||||
from remove_ai_watermarks import text_protector
|
||||
def _build_canny_control_image(self, init_image: Image.Image) -> Image.Image:
|
||||
"""Build the canny ControlNet conditioning image (xinsir recipe).
|
||||
|
||||
return self.model_id == self.DEFAULT_MODEL_ID and text_protector.is_available()
|
||||
|
||||
def _load_differential_pipeline(self) -> Any:
|
||||
"""Load the SDXL Differential-Diffusion community pipeline lazily."""
|
||||
if self._diff_pipeline is None:
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
self._set_progress("Loading Differential-Diffusion pipeline (protect-text)...")
|
||||
use_fp16 = self.device in {"mps", "cuda", "xpu"}
|
||||
load_kwargs: dict[str, Any] = {
|
||||
"custom_pipeline": _DIFF_PIPELINE_NAME,
|
||||
"custom_revision": _DIFF_PIPELINE_REVISION,
|
||||
"torch_dtype": torch.float16 if use_fp16 else torch.float32, # type: ignore[attr-defined]
|
||||
"use_safetensors": True,
|
||||
}
|
||||
if use_fp16:
|
||||
load_kwargs["variant"] = "fp16"
|
||||
if self.hf_token:
|
||||
load_kwargs["token"] = self.hf_token
|
||||
|
||||
pipeline = DiffusionPipeline.from_pretrained(self.model_id, **load_kwargs).to(self.device)
|
||||
# The differential pipeline upcasts the SDXL VAE to fp32 internally
|
||||
# (the fp16 VAE decodes to NaN/black otherwise), so we add no extra
|
||||
# VAE handling here. Attention slicing is also left off on MPS: it
|
||||
# produced NaN latents with this pipeline, and the protect-text pass
|
||||
# is short enough not to need it.
|
||||
with contextlib.suppress(Exception):
|
||||
pipeline.set_progress_bar_config(disable=True)
|
||||
self._diff_pipeline = pipeline
|
||||
return self._diff_pipeline
|
||||
|
||||
def _reload_differential_on_cpu(self) -> Any:
|
||||
"""Reload the differential pipeline on CPU after an MPS failure."""
|
||||
self.device = "cpu"
|
||||
self.torch_dtype = torch.float32 # type: ignore[assignment]
|
||||
self._diff_pipeline = None
|
||||
return self._load_differential_pipeline()
|
||||
|
||||
# Region high-res text scrub: defaults tuned so each text block is upscaled
|
||||
# enough that strokes exceed the VAE's ~8px latent cell, capped so a single
|
||||
# region never blows past the GPU/MPS memory budget.
|
||||
_REGION_HIRES_SCALE = 3.0
|
||||
_REGION_MAX_MEGAPIXELS = 1.3
|
||||
|
||||
def _run_region_hires(
|
||||
self,
|
||||
init_image: Image.Image,
|
||||
strength: float,
|
||||
num_inference_steps: int,
|
||||
guidance_scale: float,
|
||||
generator: Any,
|
||||
) -> Image.Image:
|
||||
"""Scrub the whole image, then RE-scrub each detected text block at high
|
||||
resolution and composite it back.
|
||||
|
||||
Unlike the Differential-Diffusion path (which freezes text in latent space
|
||||
and so leaves the watermark intact there), every pixel here is regenerated
|
||||
-- the watermark is removed everywhere. Small text survives because each
|
||||
text block is upscaled before its img2img pass, so strokes span more than
|
||||
one VAE latent cell (the ~8px floor that softens text at native scale);
|
||||
the scrubbed crop is downscaled and feather-composited back. Falls back to
|
||||
the plain global scrub when no text is detected.
|
||||
cv2.Canny on the RGB->gray array, stacked to 3 channels, wrapped as a PIL
|
||||
image. The edge map only PRESERVES structure; it never copies pixels.
|
||||
``init_image`` is already RGB (``remove_watermark`` converts on load).
|
||||
"""
|
||||
import math
|
||||
|
||||
import cv2
|
||||
import numpy as np
|
||||
|
||||
from remove_ai_watermarks import text_protector
|
||||
rgb = np.array(init_image)
|
||||
gray = cv2.cvtColor(rgb, cv2.COLOR_RGB2GRAY)
|
||||
edges = cv2.Canny(gray, _CANNY_LOW, _CANNY_HIGH)
|
||||
edges_rgb = np.stack([edges, edges, edges], axis=-1)
|
||||
return Image.fromarray(edges_rgb)
|
||||
|
||||
base = self._run_img2img(init_image, strength, num_inference_steps, guidance_scale, generator)
|
||||
|
||||
# The base pass may have fallen back from MPS to CPU (it flips
|
||||
# self.device). The generator was built for the original device, and
|
||||
# diffusers rejects a device-mismatched generator ("Expected a 'cpu'
|
||||
# device generator but found 'mps'"), so drop it for the per-region
|
||||
# passes -- they then seed from the global RNG, which is fine here.
|
||||
if generator is not None and self.device == "cpu" and _generator_device(generator) != "cpu":
|
||||
generator = None
|
||||
|
||||
bgr = cv2.cvtColor(np.array(init_image), cv2.COLOR_RGB2BGR)
|
||||
try:
|
||||
boxes = text_protector.TextProtector().detect_text_boxes(bgr)
|
||||
except Exception as exc:
|
||||
logger.warning("Text detection failed (%s); keeping the global scrub.", exc)
|
||||
return base
|
||||
if not boxes:
|
||||
self._set_progress("No text detected; global scrub only.")
|
||||
return base
|
||||
|
||||
width, height = init_image.size
|
||||
regions = text_protector.merge_text_regions(boxes, height, width)
|
||||
orig_bgr = cv2.cvtColor(np.array(init_image), cv2.COLOR_RGB2BGR)
|
||||
out_bgr = cv2.cvtColor(np.array(base), cv2.COLOR_RGB2BGR)
|
||||
budget = self._REGION_MAX_MEGAPIXELS * 1_000_000
|
||||
|
||||
done = 0
|
||||
for x, y, w, h in regions:
|
||||
area = max(1, w * h)
|
||||
# INTEGER scale so the upscale -> scrub -> downscale round-trip is an
|
||||
# exact dimensional inverse (a fractional factor truncates and shifts
|
||||
# the composited text ~1-2px, which is invisible but tanks alignment).
|
||||
scale = int(min(self._REGION_HIRES_SCALE, math.sqrt(budget / area)))
|
||||
if scale < 2:
|
||||
# Region too large to even double within the budget: upscaling
|
||||
# buys nothing here; the global scrub covers it (documented limit
|
||||
# for very large text areas -- tiling is the future fix).
|
||||
continue
|
||||
crop = orig_bgr[y : y + h, x : x + w]
|
||||
up = cv2.resize(crop, (w * scale, h * scale), interpolation=cv2.INTER_LANCZOS4)
|
||||
up_pil = Image.fromarray(cv2.cvtColor(up, cv2.COLOR_BGR2RGB))
|
||||
scrubbed = self._run_img2img(up_pil, strength, num_inference_steps, guidance_scale, generator)
|
||||
down = cv2.resize(cv2.cvtColor(np.array(scrubbed), cv2.COLOR_RGB2BGR), (w, h), interpolation=cv2.INTER_AREA)
|
||||
# The up -> scrub -> down round-trip can offset the re-rendered text by
|
||||
# a pixel or two (the diffusion pipeline rounds dims to a multiple of
|
||||
# 8, so the inverse resize is not perfectly centered). Phase-correlate
|
||||
# the patch back to the original crop and translate it so the glyphs
|
||||
# land exactly where they were -- otherwise a sub-pixel shift garbles
|
||||
# the composite even though the text is crisp.
|
||||
cg = cv2.cvtColor(crop, cv2.COLOR_BGR2GRAY).astype(np.float32)
|
||||
dg = cv2.cvtColor(down, cv2.COLOR_BGR2GRAY).astype(np.float32)
|
||||
(sx, sy), resp = cv2.phaseCorrelate(cg, dg)
|
||||
# Only correct for the real 1-2px round-trip shift. On a near-flat /
|
||||
# low-contrast crop phaseCorrelate returns a spurious large offset at
|
||||
# a tiny response (e.g. (19,19) at resp ~0.005); warping by that
|
||||
# garbles the composite -- the exact failure this was meant to
|
||||
# prevent. Gate on both a confident response and a plausible offset.
|
||||
if resp > 0.3 and abs(sx) < 4 and abs(sy) < 4 and (abs(sx) > 0.1 or abs(sy) > 0.1):
|
||||
m = np.float32([[1, 0, -sx], [0, 1, -sy]])
|
||||
down = cv2.warpAffine(down, m, (w, h), flags=cv2.INTER_LINEAR, borderMode=cv2.BORDER_REPLICATE)
|
||||
out_bgr = text_protector.feather_paste(out_bgr, down, x, y)
|
||||
done += 1
|
||||
self._set_progress(f"Re-scrubbed {done}/{len(regions)} text region(s) at high resolution.")
|
||||
return Image.fromarray(cv2.cvtColor(out_bgr, cv2.COLOR_BGR2RGB))
|
||||
|
||||
def _run_differential(
|
||||
def _run_controlnet(
|
||||
self,
|
||||
init_image: Image.Image,
|
||||
strength: float,
|
||||
@@ -762,105 +630,50 @@ class WatermarkRemover:
|
||||
guidance_scale: float,
|
||||
generator: Any,
|
||||
) -> Image.Image:
|
||||
"""Run differential img2img that preserves detected text regions."""
|
||||
import cv2
|
||||
import numpy as np
|
||||
"""Run the SDXL + canny-ControlNet img2img pass.
|
||||
|
||||
from remove_ai_watermarks import text_protector
|
||||
Removal still comes from the img2img regeneration (``strength``); the canny
|
||||
ControlNet only PRESERVES text and face STRUCTURE via the edge map. No
|
||||
original pixels are copied/frozen, so SynthID does not survive (canny holds
|
||||
structure, not face identity). ``controlnet_conditioning_scale`` is the
|
||||
structure-preservation knob. Shares the img2img runner (live progress +
|
||||
MPS->CPU fallback) with ``_run_img2img``; the only delta is the extra
|
||||
ControlNet kwargs (canny control image + conditioning scale + a non-empty
|
||||
prompt) overlaid via ``extra_kwargs``.
|
||||
"""
|
||||
from remove_ai_watermarks.noai.img2img_runner import run_img2img_with_mps_fallback
|
||||
|
||||
self._set_progress("Detecting text regions to protect (protect-text)...")
|
||||
bgr = cv2.cvtColor(np.array(init_image), cv2.COLOR_RGB2BGR)
|
||||
try:
|
||||
boxes = text_protector.TextProtector().detect_text_boxes(bgr)
|
||||
except Exception as exc:
|
||||
logger.warning("Text detection failed (%s); running standard img2img.", exc)
|
||||
return self._run_img2img(init_image, strength, num_inference_steps, guidance_scale, generator)
|
||||
|
||||
if not boxes:
|
||||
self._set_progress("No text detected; running standard img2img.")
|
||||
return self._run_img2img(init_image, strength, num_inference_steps, guidance_scale, generator)
|
||||
|
||||
width, height = init_image.size
|
||||
change_map = text_protector.build_change_map(boxes, height, width)
|
||||
self._set_progress(f"Protecting {len(boxes)} text region(s) via Differential Diffusion...")
|
||||
|
||||
from remove_ai_watermarks.noai.img2img_runner import run_differential_with_mps_fallback
|
||||
|
||||
result_image, final_device = run_differential_with_mps_fallback(
|
||||
load_pipeline=self._load_differential_pipeline,
|
||||
extra_kwargs = {
|
||||
"prompt": _CONTROLNET_PROMPT,
|
||||
"negative_prompt": _CONTROLNET_NEGATIVE,
|
||||
"control_image": self._build_canny_control_image(init_image),
|
||||
"controlnet_conditioning_scale": float(self.controlnet_conditioning_scale),
|
||||
}
|
||||
result_image, final_device = run_img2img_with_mps_fallback(
|
||||
load_pipeline=self._load_controlnet_pipeline,
|
||||
image=init_image,
|
||||
change_map=change_map,
|
||||
strength=strength,
|
||||
num_inference_steps=num_inference_steps,
|
||||
guidance_scale=guidance_scale,
|
||||
generator=generator,
|
||||
device=self.device,
|
||||
set_progress=self._set_progress,
|
||||
reload_on_cpu=self._reload_differential_on_cpu,
|
||||
reload_on_cpu=self._reload_controlnet_on_cpu,
|
||||
extra_kwargs=extra_kwargs,
|
||||
)
|
||||
|
||||
if final_device != self.device:
|
||||
self.device = final_device
|
||||
self.torch_dtype = torch.float32 # type: ignore[assignment]
|
||||
|
||||
return result_image
|
||||
|
||||
# ── CtrlRegen runner ─────────────────────────────────────────────
|
||||
|
||||
def _run_ctrlregen(
|
||||
self,
|
||||
init_image: Image.Image,
|
||||
strength: float,
|
||||
num_inference_steps: int,
|
||||
guidance_scale: float,
|
||||
generator: Any,
|
||||
) -> Image.Image:
|
||||
"""Run CtrlRegen pipeline with MPS fallback."""
|
||||
from remove_ai_watermarks.noai.ctrlregen import is_ctrlregen_available
|
||||
from remove_ai_watermarks.noai.progress import is_mps_error
|
||||
|
||||
if not is_ctrlregen_available():
|
||||
missing_pkgs = ["controlnet-aux", "color-matcher", "safetensors"]
|
||||
logger.info("Auto-installing missing CtrlRegen dependencies: %s", missing_pkgs)
|
||||
if not _auto_install(missing_pkgs):
|
||||
raise ImportError(
|
||||
f"Failed to auto-install missing dependencies: {', '.join(missing_pkgs)}. "
|
||||
"Try manually: pip install --force-reinstall noai-watermark"
|
||||
)
|
||||
|
||||
if self._ctrlregen_engine is None:
|
||||
self._ctrlregen_engine = self._make_ctrlregen_engine()
|
||||
|
||||
seed = None
|
||||
if generator is not None and hasattr(generator, "initial_seed"):
|
||||
seed = generator.initial_seed()
|
||||
|
||||
try:
|
||||
return self._ctrlregen_engine.run(
|
||||
image=init_image,
|
||||
strength=strength,
|
||||
num_inference_steps=num_inference_steps,
|
||||
guidance_scale=guidance_scale,
|
||||
seed=seed,
|
||||
)
|
||||
except RuntimeError as error:
|
||||
if self.device == "mps" and is_mps_error(error):
|
||||
logger.warning("MPS out of memory during CtrlRegen. Falling back to CPU.")
|
||||
self._set_progress("MPS out of memory! Retrying CtrlRegen on CPU...")
|
||||
with contextlib.suppress(Exception):
|
||||
if _HAS_TORCH and hasattr(torch, "mps"):
|
||||
torch.mps.empty_cache() # type: ignore[attr-defined]
|
||||
|
||||
self.device = "cpu"
|
||||
self.torch_dtype = torch.float32 # type: ignore[assignment]
|
||||
self._ctrlregen_engine = self._make_ctrlregen_engine()
|
||||
|
||||
return self._ctrlregen_engine.run(
|
||||
image=init_image,
|
||||
strength=strength,
|
||||
num_inference_steps=num_inference_steps,
|
||||
guidance_scale=guidance_scale,
|
||||
seed=seed,
|
||||
)
|
||||
raise
|
||||
def _reload_controlnet_on_cpu(self) -> Any:
|
||||
"""Reload the controlnet pipeline on CPU after an MPS failure."""
|
||||
self.device = "cpu"
|
||||
self.torch_dtype = torch.float32 # type: ignore[assignment]
|
||||
self._controlnet_pipeline = None
|
||||
return self._load_controlnet_pipeline()
|
||||
|
||||
# ── Batch ────────────────────────────────────────────────────────
|
||||
|
||||
@@ -909,9 +722,9 @@ def remove_watermark(
|
||||
) -> Path:
|
||||
"""Convenience function to remove watermark from an image.
|
||||
|
||||
``strength=None`` lets the profile pick its default: vendor-adaptive for SDXL
|
||||
``strength=None`` lets the profile pick its vendor-adaptive SDXL default
|
||||
(0.10 OpenAI / 0.15 Google / 0.15 unknown, from the C2PA SynthID proxy on the
|
||||
input), clean-noise 1.0 for ctrlregen. Pass a value to override.
|
||||
input). Pass a value to override.
|
||||
"""
|
||||
from remove_ai_watermarks.noai.watermark_profiles import vendor_for_strength
|
||||
|
||||
|
||||
@@ -1,271 +0,0 @@
|
||||
"""Text-region protection for diffusion-based watermark removal.
|
||||
|
||||
SDXL img2img (the ``invisible`` pipeline) regenerates every pixel, so small text
|
||||
and CJK glyphs get deformed at the strengths that defeat SynthID (issue #21).
|
||||
This module detects text regions and builds a per-pixel "change map" for
|
||||
Differential Diffusion: the background is regenerated normally while text
|
||||
regions are largely preserved, so glyphs survive the watermark-removal pass.
|
||||
|
||||
Detection uses only OpenCV's DNN module (no torch): the PP-OCRv3 text detector
|
||||
is a ~2.4 MB ONNX model (Apache-2.0, from opencv_zoo) that is CJK-native and
|
||||
returns rotated quadrilaterals. The model is downloaded and cached on first use;
|
||||
it is never bundled in this repo.
|
||||
|
||||
Change-map polarity (verified empirically against the differential pipeline):
|
||||
white (1.0) = PRESERVE the original pixels, black (0.0) = MAXIMUM change. So the
|
||||
map is black everywhere except the text polygons, which are painted toward
|
||||
white. ``preserve`` stays below a hard 1.0 freeze by default: SynthID is
|
||||
designed to survive cropping, so totally freezing text pixels would leave the
|
||||
watermark intact there. A high-but-partial preserve still scrubs lightly.
|
||||
"""
|
||||
|
||||
# cv2 ships no type stubs; mirror the pragma used by the other cv2-using modules.
|
||||
# pyright: reportMissingTypeStubs=false, reportUnknownMemberType=false, reportUnknownVariableType=false, reportUnknownArgumentType=false, reportCallIssue=false, reportArgumentType=false, reportReturnType=false
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
import tempfile
|
||||
import urllib.request
|
||||
from pathlib import Path
|
||||
from typing import TYPE_CHECKING, Any
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from numpy.typing import NDArray
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# PP-OCRv3 Chinese text detector (DB head), opencv_zoo, Apache-2.0.
|
||||
_MODEL_URL = (
|
||||
"https://github.com/opencv/opencv_zoo/raw/main/models/text_detection_ppocr/text_detection_cn_ppocrv3_2023may.onnx"
|
||||
)
|
||||
_MODEL_FILENAME = "text_detection_cn_ppocrv3_2023may.onnx"
|
||||
|
||||
# DB detector input: the image is detected at its NATIVE long side, capped at
|
||||
# this value (rounded to a multiple of 32), never upscaled. A fixed small input
|
||||
# (the old 736) downscaled large images so far that small text fell below the
|
||||
# detector's resolution and was missed -- the cause of the "small text still
|
||||
# distorts" reports (issue #14). Detection is script-agnostic (DB segments text
|
||||
# *regions*, not characters), so this recall fix applies to every language; the
|
||||
# only lever that mattered was resolution. 1536 recovers full recall down to
|
||||
# ~12 px text on a 2048 canvas at ~100 ms on CPU (a fixed 736 missed it); going
|
||||
# higher buys no measured recall at 2x+ the cost. Benchmarked in
|
||||
# scripts/text_detection_benchmark.py. Very large canvases with tiny text may
|
||||
# still need tiling -- a documented limit, not yet built.
|
||||
_DET_MAX_LONG_SIDE = 1536
|
||||
# ImageNet mean (x255) and 1/255 scale -- the normalization PP-OCRv3 expects.
|
||||
_DET_MEAN = (0.485 * 255, 0.456 * 255, 0.406 * 255)
|
||||
_DET_SCALE = 1 / 255.0
|
||||
|
||||
|
||||
def is_available() -> bool:
|
||||
"""True when OpenCV's DNN text-detection model is importable."""
|
||||
try:
|
||||
import cv2
|
||||
|
||||
return hasattr(cv2.dnn, "TextDetectionModel_DB")
|
||||
except ImportError:
|
||||
return False
|
||||
|
||||
|
||||
def _cache_dir() -> Path:
|
||||
"""Local cache directory for the detector model (created on demand)."""
|
||||
cache = Path.home() / ".cache" / "remove-ai-watermarks"
|
||||
cache.mkdir(parents=True, exist_ok=True)
|
||||
return cache
|
||||
|
||||
|
||||
def _model_path() -> Path:
|
||||
"""Return the cached detector path, downloading it on first use."""
|
||||
target = _cache_dir() / _MODEL_FILENAME
|
||||
if target.exists() and target.stat().st_size > 0:
|
||||
return target
|
||||
logger.info("Downloading PP-OCRv3 text detector (~2.4 MB) to %s", target)
|
||||
# Download to a temp file in the same dir, then atomically rename so a
|
||||
# partial download never leaves a corrupt model cached.
|
||||
fd, tmp_name = tempfile.mkstemp(dir=str(target.parent), suffix=".onnx.part")
|
||||
tmp_path = Path(tmp_name)
|
||||
try:
|
||||
os.close(fd)
|
||||
with urllib.request.urlopen(_MODEL_URL) as resp: # noqa: S310 (trusted GitHub URL)
|
||||
tmp_path.write_bytes(resp.read())
|
||||
tmp_path.replace(target)
|
||||
finally:
|
||||
if tmp_path.exists():
|
||||
tmp_path.unlink()
|
||||
return target
|
||||
|
||||
|
||||
def _detection_input_size(height: int, width: int) -> tuple[int, int]:
|
||||
"""DB-detector input ``(in_w, in_h)`` for an image of the given size.
|
||||
|
||||
Detect at the native long side, capped at ``_DET_MAX_LONG_SIDE`` and never
|
||||
upscaled, each side rounded down to a multiple of 32 (the DB head requires
|
||||
/32 dims), floored at 32. Pure function so the resolution contract (the
|
||||
issue #14 small-text recall fix) is unit-testable without the model.
|
||||
"""
|
||||
long_side = max(height, width)
|
||||
scale = min(_DET_MAX_LONG_SIDE, long_side) / long_side
|
||||
in_w = max((round(width * scale) // 32) * 32, 32)
|
||||
in_h = max((round(height * scale) // 32) * 32, 32)
|
||||
return in_w, in_h
|
||||
|
||||
|
||||
def build_change_map(
|
||||
boxes: list[NDArray[Any]],
|
||||
height: int,
|
||||
width: int,
|
||||
preserve: float = 0.9,
|
||||
feather: int = 15,
|
||||
) -> NDArray[Any]:
|
||||
"""Build a Differential-Diffusion change map from text polygons.
|
||||
|
||||
Args:
|
||||
boxes: Text-region polygons as arrays of (x, y) vertices.
|
||||
height: Output map height in pixels.
|
||||
width: Output map width in pixels.
|
||||
preserve: Map value painted inside text polygons (0..1). White (1.0)
|
||||
fully preserves the original pixels; the default 0.9 preserves
|
||||
strongly while still letting a light scrub through.
|
||||
feather: Gaussian-blur kernel size for soft polygon edges (forced odd).
|
||||
|
||||
Returns:
|
||||
Float32 HxW array in [0, 1]: ~0 in the background (full change),
|
||||
``preserve`` inside text regions, blended at the edges.
|
||||
"""
|
||||
import cv2
|
||||
import numpy as np
|
||||
|
||||
change_map = np.zeros((height, width), np.float32)
|
||||
if boxes:
|
||||
polys = [np.asarray(b, np.int32) for b in boxes]
|
||||
cv2.fillPoly(change_map, polys, float(preserve))
|
||||
if feather > 0:
|
||||
if feather % 2 == 0:
|
||||
feather += 1
|
||||
change_map = cv2.GaussianBlur(change_map, (feather, feather), 0)
|
||||
# GaussianBlur can overshoot the painted value by a float epsilon; keep
|
||||
# the contract that the map stays a valid [0, 1] change map.
|
||||
np.clip(change_map, 0.0, 1.0, out=change_map)
|
||||
return change_map
|
||||
|
||||
|
||||
def merge_text_regions(
|
||||
boxes: list[NDArray[Any]],
|
||||
height: int,
|
||||
width: int,
|
||||
dilate_frac: float = 0.012,
|
||||
pad_frac: float = 0.02,
|
||||
max_regions: int = 8,
|
||||
) -> list[tuple[int, int, int, int]]:
|
||||
"""Group detected text polygons into a few padded axis-aligned rectangles.
|
||||
|
||||
The DB detector returns one box per word/line; the region-high-res text scrub
|
||||
runs a separate diffusion pass per region, so we coalesce nearby boxes into a
|
||||
handful of *local* blocks (a light dilation merges within a paragraph but not
|
||||
across the whole image, so each block stays small enough to upscale within a
|
||||
memory budget). Returns ``(x, y, w, h)`` rects, largest-area first, clipped to
|
||||
the image and capped at ``max_regions``.
|
||||
"""
|
||||
import cv2
|
||||
import numpy as np
|
||||
|
||||
mask = np.zeros((height, width), np.uint8)
|
||||
if not boxes:
|
||||
return []
|
||||
cv2.fillPoly(mask, [np.asarray(b, np.int32) for b in boxes], 1)
|
||||
k = max(1, int(min(height, width) * dilate_frac))
|
||||
mask = cv2.dilate(mask, cv2.getStructuringElement(cv2.MORPH_RECT, (k, k)))
|
||||
n, _labels, stats, _c = cv2.connectedComponentsWithStats(mask, 8)
|
||||
pad = int(min(height, width) * pad_frac)
|
||||
rects: list[tuple[int, int, int, int]] = []
|
||||
for i in range(1, n):
|
||||
x, y, w, h = (
|
||||
int(stats[i, cv2.CC_STAT_LEFT]),
|
||||
int(stats[i, cv2.CC_STAT_TOP]),
|
||||
int(stats[i, cv2.CC_STAT_WIDTH]),
|
||||
int(stats[i, cv2.CC_STAT_HEIGHT]),
|
||||
)
|
||||
x0, y0 = max(0, x - pad), max(0, y - pad)
|
||||
x1, y1 = min(width, x + w + pad), min(height, y + h + pad)
|
||||
rects.append((x0, y0, x1 - x0, y1 - y0))
|
||||
rects.sort(key=lambda r: -(r[2] * r[3]))
|
||||
return rects[:max_regions]
|
||||
|
||||
|
||||
def feather_paste(
|
||||
base: NDArray[Any],
|
||||
patch: NDArray[Any],
|
||||
x: int,
|
||||
y: int,
|
||||
feather: int = 8,
|
||||
) -> NDArray[Any]:
|
||||
"""Alpha-composite ``patch`` into ``base`` at ``(x, y)`` with a feathered edge.
|
||||
|
||||
Used to drop a separately re-scrubbed (high-resolution) text region back into
|
||||
the globally-scrubbed image without a visible seam. Returns a new array;
|
||||
``base`` is not modified. ``patch`` is clipped to ``base`` bounds.
|
||||
"""
|
||||
import numpy as np
|
||||
|
||||
out = base.copy()
|
||||
bh, bw = base.shape[:2]
|
||||
ph, pw = patch.shape[:2]
|
||||
x0, y0 = max(0, x), max(0, y)
|
||||
x1, y1 = min(bw, x + pw), min(bh, y + ph)
|
||||
if x1 <= x0 or y1 <= y0:
|
||||
return out
|
||||
patch_roi = patch[y0 - y : y1 - y, x0 - x : x1 - x].astype(np.float32)
|
||||
base_roi = out[y0:y1, x0:x1].astype(np.float32)
|
||||
rh, rw = base_roi.shape[:2]
|
||||
alpha = np.ones((rh, rw), np.float32)
|
||||
f = max(0, min(feather, rh // 2, rw // 2))
|
||||
if f > 0:
|
||||
ramp = np.linspace(0.0, 1.0, f, dtype=np.float32)
|
||||
alpha[:f, :] *= ramp[:, None]
|
||||
alpha[rh - f :, :] *= ramp[::-1, None]
|
||||
alpha[:, :f] *= ramp[None, :]
|
||||
alpha[:, rw - f :] *= ramp[None, ::-1]
|
||||
a3 = alpha[:, :, None]
|
||||
out[y0:y1, x0:x1] = (patch_roi * a3 + base_roi * (1.0 - a3)).astype(base.dtype)
|
||||
return out
|
||||
|
||||
|
||||
class TextProtector:
|
||||
"""Detect text regions with PP-OCRv3 for diffusion change-map protection."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
binary_threshold: float = 0.3,
|
||||
polygon_threshold: float = 0.5,
|
||||
max_candidates: int = 200,
|
||||
unclip_ratio: float = 2.0,
|
||||
) -> None:
|
||||
import cv2
|
||||
|
||||
self._detector = cv2.dnn.TextDetectionModel_DB(str(_model_path()))
|
||||
self._detector.setBinaryThreshold(binary_threshold)
|
||||
self._detector.setPolygonThreshold(polygon_threshold)
|
||||
self._detector.setMaxCandidates(max_candidates)
|
||||
self._detector.setUnclipRatio(unclip_ratio)
|
||||
|
||||
def detect_text_boxes(self, bgr_image: NDArray[Any]) -> list[NDArray[Any]]:
|
||||
"""Detect text regions, returning a list of rotated quad polygons.
|
||||
|
||||
Args:
|
||||
bgr_image: Image as an HxWx3 BGR uint8 array (OpenCV convention).
|
||||
|
||||
Returns:
|
||||
One array of four (x, y) vertices per detected text region.
|
||||
"""
|
||||
height, width = bgr_image.shape[:2]
|
||||
in_w, in_h = _detection_input_size(height, width)
|
||||
self._detector.setInputParams(
|
||||
scale=_DET_SCALE,
|
||||
size=(in_w, in_h),
|
||||
mean=_DET_MEAN,
|
||||
swapRB=True,
|
||||
)
|
||||
boxes, _confidences = self._detector.detect(bgr_image)
|
||||
return list(boxes)
|
||||
@@ -1,63 +0,0 @@
|
||||
import numpy as np
|
||||
|
||||
from remove_ai_watermarks.face_protector import FaceProtector
|
||||
|
||||
|
||||
def test_face_protector_initialization():
|
||||
# Will fallback to Haar cascade if ultralytics is missing
|
||||
fp = FaceProtector(use_yolo=False)
|
||||
assert fp.use_yolo is False
|
||||
assert fp.haar_cascade is not None
|
||||
|
||||
|
||||
def test_face_protector_lifecycle():
|
||||
fp = FaceProtector(use_yolo=False)
|
||||
|
||||
# Create dummy black image
|
||||
img = np.zeros((200, 200, 3), dtype=np.uint8)
|
||||
|
||||
# Since it's a black image, haar cascade should find 0 faces
|
||||
faces = fp.extract_faces(img)
|
||||
assert isinstance(faces, list)
|
||||
assert len(faces) == 0
|
||||
|
||||
# Restoring 0 faces should result in strictly equal image
|
||||
restored = fp.restore_faces(img, faces)
|
||||
assert np.array_equal(img, restored)
|
||||
|
||||
|
||||
def test_face_protector_restore_bypass_on_size_mismatch():
|
||||
fp = FaceProtector(use_yolo=False)
|
||||
img_small = np.zeros((100, 100, 3), dtype=np.uint8)
|
||||
|
||||
# Manually mock a face that is OUT OF BOUNDS for img_small
|
||||
mock_bbox = (80, 80, 130, 130)
|
||||
mock_crop = np.ones((50, 50, 3), dtype=np.uint8) * 255
|
||||
mock_faces = [(mock_bbox, mock_crop)]
|
||||
|
||||
# Attempt to restore onto an image too small for this box
|
||||
restored = fp.restore_faces(img_small, mock_faces)
|
||||
|
||||
# Should safely skip restoring and not crash
|
||||
assert np.array_equal(restored, img_small)
|
||||
|
||||
|
||||
def test_face_protector_restore_blending():
|
||||
fp = FaceProtector(use_yolo=False)
|
||||
# Background is black
|
||||
img_target = np.zeros((100, 100, 3), dtype=np.uint8)
|
||||
|
||||
# Face crop is white
|
||||
mock_bbox = (25, 25, 75, 75)
|
||||
mock_crop = np.ones((50, 50, 3), dtype=np.uint8) * 255
|
||||
mock_faces = [(mock_bbox, mock_crop)]
|
||||
|
||||
restored = fp.restore_faces(img_target, mock_faces)
|
||||
|
||||
# The center of the face should be perfectly white (255)
|
||||
assert restored[50, 50, 0] >= 254
|
||||
# The corner of the target should remain perfectly black (0)
|
||||
assert restored[0, 0, 0] == 0
|
||||
# We should have a blending gradient between them due to the gaussian blur mask
|
||||
# For example, around (30, 30) or similar
|
||||
assert 0 <= restored[28, 28, 0] <= 255
|
||||
@@ -15,7 +15,6 @@ import pytest
|
||||
|
||||
from remove_ai_watermarks.noai import img2img_runner
|
||||
from remove_ai_watermarks.noai.img2img_runner import (
|
||||
run_differential_with_mps_fallback,
|
||||
run_img2img,
|
||||
run_img2img_with_mps_fallback,
|
||||
)
|
||||
@@ -110,80 +109,6 @@ class TestMpsFallback:
|
||||
reload_on_cpu.assert_not_called()
|
||||
|
||||
|
||||
class TestDifferentialMpsFallback:
|
||||
"""The protect-text (Differential Diffusion) path shares the MPS->CPU
|
||||
fallback contract; mock ``run_differential`` so no torch/model is needed."""
|
||||
|
||||
def test_mps_error_reloads_on_cpu_and_retries(self, monkeypatch: pytest.MonkeyPatch):
|
||||
sentinel = object()
|
||||
inner = Mock(side_effect=[RuntimeError(_MPS_OOM), sentinel])
|
||||
monkeypatch.setattr(img2img_runner, "run_differential", inner)
|
||||
reload_on_cpu = Mock(return_value="cpu_pipe")
|
||||
|
||||
img, device = run_differential_with_mps_fallback(
|
||||
load_pipeline=Mock(return_value="gpu_pipe"),
|
||||
image=object(),
|
||||
change_map=object(),
|
||||
strength=0.05,
|
||||
num_inference_steps=50,
|
||||
guidance_scale=7.5,
|
||||
generator="gen",
|
||||
device="mps",
|
||||
set_progress=lambda _m: None,
|
||||
reload_on_cpu=reload_on_cpu,
|
||||
)
|
||||
|
||||
assert (img, device) == (sentinel, "cpu")
|
||||
reload_on_cpu.assert_called_once()
|
||||
assert inner.call_count == 2
|
||||
# Retry uses the reloaded CPU pipeline, device "cpu", and drops the MPS
|
||||
# generator (generator=None) for deterministic CPU execution.
|
||||
retry_args = inner.call_args_list[1].args
|
||||
assert retry_args[0] == "cpu_pipe"
|
||||
assert retry_args[6] is None # generator
|
||||
assert retry_args[7] == "cpu" # device
|
||||
|
||||
def test_happy_path_returns_original_device_without_reload(self, monkeypatch: pytest.MonkeyPatch):
|
||||
sentinel = object()
|
||||
monkeypatch.setattr(img2img_runner, "run_differential", Mock(return_value=sentinel))
|
||||
reload_on_cpu = Mock()
|
||||
|
||||
img, device = run_differential_with_mps_fallback(
|
||||
load_pipeline=Mock(return_value="gpu_pipe"),
|
||||
image=object(),
|
||||
change_map=object(),
|
||||
strength=0.05,
|
||||
num_inference_steps=50,
|
||||
guidance_scale=7.5,
|
||||
generator="gen",
|
||||
device="mps",
|
||||
set_progress=lambda _m: None,
|
||||
reload_on_cpu=reload_on_cpu,
|
||||
)
|
||||
|
||||
assert (img, device) == (sentinel, "mps")
|
||||
reload_on_cpu.assert_not_called()
|
||||
|
||||
def test_non_mps_runtime_error_propagates(self, monkeypatch: pytest.MonkeyPatch):
|
||||
monkeypatch.setattr(img2img_runner, "run_differential", Mock(side_effect=RuntimeError("CUDA out of memory")))
|
||||
reload_on_cpu = Mock()
|
||||
|
||||
with pytest.raises(RuntimeError, match="CUDA"):
|
||||
run_differential_with_mps_fallback(
|
||||
load_pipeline=Mock(return_value="gpu_pipe"),
|
||||
image=object(),
|
||||
change_map=object(),
|
||||
strength=0.05,
|
||||
num_inference_steps=50,
|
||||
guidance_scale=7.5,
|
||||
generator="gen",
|
||||
device="mps",
|
||||
set_progress=lambda _m: None,
|
||||
reload_on_cpu=reload_on_cpu,
|
||||
)
|
||||
reload_on_cpu.assert_not_called()
|
||||
|
||||
|
||||
class TestRunImg2Img:
|
||||
def test_returns_first_image_from_pipeline_result(self):
|
||||
sentinel = object()
|
||||
|
||||
@@ -30,9 +30,6 @@ class TestInvisibleEngineInit:
|
||||
# SDXL base became the default in May 2026 (defeats SynthID v2).
|
||||
assert InvisibleEngine.DEFAULT_MODEL_ID == "stabilityai/stable-diffusion-xl-base-1.0"
|
||||
|
||||
def test_ctrlregen_model_id(self):
|
||||
assert InvisibleEngine.CTRLREGEN_MODEL_ID == "yepengliu/ctrlregen"
|
||||
|
||||
|
||||
class TestTargetSize:
|
||||
"""Regression guard for the native-resolution decision (issues #10 / #15).
|
||||
|
||||
+19
-32
@@ -14,12 +14,10 @@ import pytest
|
||||
from remove_ai_watermarks.noai.progress import is_mps_error
|
||||
from remove_ai_watermarks.noai.utils import get_image_format, is_supported_format
|
||||
from remove_ai_watermarks.noai.watermark_profiles import (
|
||||
CTRLREGEN_DEFAULT_STRENGTH,
|
||||
DEFAULT_STRENGTH,
|
||||
GEMINI_STRENGTH,
|
||||
OPENAI_STRENGTH,
|
||||
UNKNOWN_STRENGTH,
|
||||
detect_model_profile,
|
||||
get_model_id_for_profile,
|
||||
resolve_strength,
|
||||
)
|
||||
@@ -114,54 +112,43 @@ class TestModelProfiles:
|
||||
def test_default_profile(self):
|
||||
assert get_model_id_for_profile("default") == "stabilityai/stable-diffusion-xl-base-1.0"
|
||||
|
||||
def test_ctrlregen_profile(self):
|
||||
assert get_model_id_for_profile("ctrlregen") == "yepengliu/ctrlregen"
|
||||
def test_controlnet_profile(self):
|
||||
# controlnet shares the SDXL base checkpoint (the ControlNet is an add-on).
|
||||
assert get_model_id_for_profile("controlnet") == "stabilityai/stable-diffusion-xl-base-1.0"
|
||||
|
||||
def test_unknown_profile_raises(self):
|
||||
with pytest.raises(ValueError, match="Unknown model profile"):
|
||||
get_model_id_for_profile("nonexistent")
|
||||
|
||||
def test_detect_default(self):
|
||||
assert detect_model_profile("stabilityai/stable-diffusion-xl-base-1.0") == "default"
|
||||
|
||||
def test_detect_ctrlregen(self):
|
||||
assert detect_model_profile("yepengliu/ctrlregen") == "ctrlregen"
|
||||
|
||||
|
||||
class TestResolveStrength:
|
||||
"""resolve_strength applies the profile/vendor default only when strength is unset."""
|
||||
"""resolve_strength applies the vendor default only when strength is unset."""
|
||||
|
||||
def test_none_default_profile_is_vendor_adaptive(self):
|
||||
# No vendor -> unknown default; OpenAI lower, Google == unknown.
|
||||
assert resolve_strength(None, "default") == UNKNOWN_STRENGTH
|
||||
assert resolve_strength(None, "default", "openai") == OPENAI_STRENGTH
|
||||
assert resolve_strength(None, "default", "google") == GEMINI_STRENGTH
|
||||
assert resolve_strength(None, "default", None) == UNKNOWN_STRENGTH
|
||||
def test_none_is_vendor_adaptive(self):
|
||||
# No vendor -> unknown default; OpenAI lower, Google == unknown. The default
|
||||
# is vendor-adaptive and does NOT depend on the pipeline profile (default and
|
||||
# controlnet share the same SDXL base).
|
||||
assert resolve_strength(None) == UNKNOWN_STRENGTH
|
||||
assert resolve_strength(None, "openai") == OPENAI_STRENGTH
|
||||
assert resolve_strength(None, "google") == GEMINI_STRENGTH
|
||||
assert resolve_strength(None, None) == UNKNOWN_STRENGTH
|
||||
# An unrecognized vendor string falls through to the unknown default.
|
||||
assert resolve_strength(None, "default", "adobe") == UNKNOWN_STRENGTH
|
||||
assert resolve_strength(None, "adobe") == UNKNOWN_STRENGTH
|
||||
|
||||
def test_default_strength_alias_is_unknown_vendor_value(self):
|
||||
assert DEFAULT_STRENGTH == UNKNOWN_STRENGTH
|
||||
assert OPENAI_STRENGTH < UNKNOWN_STRENGTH
|
||||
|
||||
def test_none_ctrlregen_uses_clean_noise_default(self):
|
||||
# ctrlregen must NOT inherit the SDXL vendor defaults (that makes it a no-op);
|
||||
# clean-noise regeneration is the lever against robust marks. Vendor is ignored.
|
||||
assert resolve_strength(None, "ctrlregen") == CTRLREGEN_DEFAULT_STRENGTH
|
||||
assert resolve_strength(None, "ctrlregen", "openai") == CTRLREGEN_DEFAULT_STRENGTH
|
||||
assert CTRLREGEN_DEFAULT_STRENGTH > DEFAULT_STRENGTH
|
||||
|
||||
def test_explicit_value_overrides_profile_and_vendor(self):
|
||||
assert resolve_strength(0.3, "default") == 0.3
|
||||
assert resolve_strength(0.3, "default", "openai") == 0.3
|
||||
assert resolve_strength(0.3, "ctrlregen") == 0.3
|
||||
def test_explicit_value_overrides_vendor(self):
|
||||
assert resolve_strength(0.3) == 0.3
|
||||
assert resolve_strength(0.3, "openai") == 0.3
|
||||
|
||||
def test_explicit_zero_is_respected_not_treated_as_unset(self):
|
||||
# 0.0 is falsy but explicit -- must not fall through to the profile default
|
||||
# 0.0 is falsy but explicit -- must not fall through to the vendor default
|
||||
# (the old `strength or DEFAULT` bug would have). Range validation lives in
|
||||
# remove_watermark, not here.
|
||||
assert resolve_strength(0.0, "ctrlregen") == 0.0
|
||||
assert resolve_strength(0.0, "default", "google") == 0.0
|
||||
assert resolve_strength(0.0) == 0.0
|
||||
assert resolve_strength(0.0, "google") == 0.0
|
||||
|
||||
|
||||
class TestVendorForStrength:
|
||||
|
||||
@@ -1,170 +0,0 @@
|
||||
"""Unit tests for the text-protection change-map helper (no model download).
|
||||
|
||||
``build_change_map`` is the pure cv2/numpy part of ``text_protector``: it turns
|
||||
detected text polygons into a Differential-Diffusion change map. The polarity is
|
||||
load-bearing and was verified empirically (white = preserve, black = change), so
|
||||
a regression here would either freeze the whole image or fail to protect text.
|
||||
The PP-OCRv3 detector itself needs a model download and is not exercised here.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import numpy as np
|
||||
|
||||
from remove_ai_watermarks.text_protector import (
|
||||
_DET_MAX_LONG_SIDE,
|
||||
_detection_input_size,
|
||||
build_change_map,
|
||||
feather_paste,
|
||||
merge_text_regions,
|
||||
)
|
||||
|
||||
|
||||
def _quad(x0, y0, x1, y1):
|
||||
"""An axis-aligned 4-vertex polygon as the detector returns."""
|
||||
return np.array([[x0, y0], [x1, y0], [x1, y1], [x0, y1]], np.int32)
|
||||
|
||||
|
||||
class TestMergeTextRegions:
|
||||
def test_empty(self):
|
||||
assert merge_text_regions([], 256, 256) == []
|
||||
|
||||
def test_far_apart_boxes_stay_separate(self):
|
||||
boxes = [_quad(10, 10, 60, 30), _quad(10, 200, 60, 220)]
|
||||
regions = merge_text_regions(boxes, 256, 256, dilate_frac=0.005, pad_frac=0.0)
|
||||
assert len(regions) == 2
|
||||
|
||||
def test_close_boxes_merge(self):
|
||||
# two boxes on the same line, a few px apart -> one block
|
||||
boxes = [_quad(10, 10, 60, 30), _quad(64, 10, 110, 30)]
|
||||
# dilate_frac sized to close the few-px inter-word gap on one line
|
||||
regions = merge_text_regions(boxes, 256, 256, dilate_frac=0.03)
|
||||
assert len(regions) == 1
|
||||
|
||||
def test_rects_in_bounds_and_padded(self):
|
||||
boxes = [_quad(100, 100, 150, 130)]
|
||||
(x, y, w, h) = merge_text_regions(boxes, 256, 256, pad_frac=0.05)[0]
|
||||
assert x >= 0
|
||||
assert y >= 0
|
||||
assert x + w <= 256
|
||||
assert y + h <= 256
|
||||
assert w > 50 # padded beyond the raw 50px box
|
||||
|
||||
def test_caps_region_count(self):
|
||||
boxes = [_quad(20 * i, 0, 20 * i + 8, 8) for i in range(20)]
|
||||
regions = merge_text_regions(boxes, 64, 512, dilate_frac=0.002, pad_frac=0.0, max_regions=5)
|
||||
assert len(regions) <= 5
|
||||
|
||||
|
||||
class TestFeatherPaste:
|
||||
def test_patch_lands_at_location_center(self):
|
||||
base = np.zeros((100, 100, 3), np.uint8)
|
||||
patch = np.full((40, 40, 3), 200, np.uint8)
|
||||
out = feather_paste(base, patch, 30, 30, feather=6)
|
||||
# center of the pasted region is (near) the patch value
|
||||
assert out[50, 50, 0] >= 190
|
||||
# far corner untouched
|
||||
assert out[2, 2, 0] == 0
|
||||
|
||||
def test_does_not_mutate_base(self):
|
||||
base = np.zeros((50, 50, 3), np.uint8)
|
||||
feather_paste(base, np.full((20, 20, 3), 255, np.uint8), 10, 10)
|
||||
assert base.sum() == 0
|
||||
|
||||
def test_shape_preserved(self):
|
||||
base = np.zeros((50, 60, 3), np.uint8)
|
||||
out = feather_paste(base, np.full((10, 10, 3), 100, np.uint8), 5, 5)
|
||||
assert out.shape == base.shape
|
||||
|
||||
def test_partial_out_of_bounds_no_crash(self):
|
||||
base = np.zeros((40, 40, 3), np.uint8)
|
||||
out = feather_paste(base, np.full((30, 30, 3), 150, np.uint8), 25, 25, feather=4)
|
||||
assert out.shape == (40, 40, 3)
|
||||
|
||||
|
||||
class TestDetectionInputSize:
|
||||
"""Resolution contract for the DB detector input (issue #14 recall fix).
|
||||
|
||||
A fixed small input (the old 736) downscaled large canvases so far that small
|
||||
text fell below the detector's resolution and was missed. Detection now runs
|
||||
at the native long side, capped and never upscaled.
|
||||
"""
|
||||
|
||||
def test_large_canvas_not_downscaled_to_old_736(self):
|
||||
# The #14 regression: a 2048 canvas must detect well above the old 736
|
||||
# so ~12-16 px text survives. Capped at the max long side.
|
||||
in_w, in_h = _detection_input_size(2048, 2048)
|
||||
assert in_w == _DET_MAX_LONG_SIDE
|
||||
assert in_h == _DET_MAX_LONG_SIDE
|
||||
assert in_w > 736 # the old fixed input that missed small text
|
||||
|
||||
def test_native_resolution_not_upscaled(self):
|
||||
# A 1024 canvas detects at native 1024 (not upscaled to the cap, not
|
||||
# downscaled to the old 736).
|
||||
assert _detection_input_size(1024, 1024) == (1024, 1024)
|
||||
|
||||
def test_small_image_is_native(self):
|
||||
assert _detection_input_size(512, 512) == (512, 512)
|
||||
|
||||
def test_dims_are_multiples_of_32(self):
|
||||
for h, w in [(2048, 1024), (1234, 567), (4096, 4096), (1000, 1000)]:
|
||||
in_w, in_h = _detection_input_size(h, w)
|
||||
assert in_w % 32 == 0
|
||||
assert in_h % 32 == 0
|
||||
|
||||
def test_aspect_ratio_preserved_when_capped(self):
|
||||
# Portrait 2048x1024: long side capped to the max, short side scaled by
|
||||
# the same factor (so the 2:1 aspect is roughly kept).
|
||||
in_w, in_h = _detection_input_size(2048, 1024)
|
||||
assert in_h == _DET_MAX_LONG_SIDE
|
||||
assert abs((in_w / in_h) - 0.5) < 0.05
|
||||
|
||||
def test_floor_at_32(self):
|
||||
in_w, in_h = _detection_input_size(10, 5)
|
||||
assert in_w >= 32
|
||||
assert in_h >= 32
|
||||
|
||||
|
||||
class TestBuildChangeMap:
|
||||
def test_no_boxes_is_all_change(self):
|
||||
m = build_change_map([], 32, 48)
|
||||
assert m.shape == (32, 48)
|
||||
assert m.dtype == np.float32
|
||||
assert float(m.max()) == 0.0
|
||||
|
||||
def test_text_region_is_preserved_background_is_change(self):
|
||||
# A 20x20 box centered in a 64x64 map, no feather for a crisp check.
|
||||
box = np.array([[22, 22], [42, 22], [42, 42], [22, 42]])
|
||||
m = build_change_map([box], 64, 64, preserve=0.9, feather=0)
|
||||
# Inside the polygon: painted to preserve value.
|
||||
assert m[32, 32] == np.float32(0.9)
|
||||
# Far background: untouched -> full change (0.0).
|
||||
assert m[2, 2] == 0.0
|
||||
# Polarity: text preserved more than background.
|
||||
assert m[32, 32] > m[2, 2]
|
||||
|
||||
def test_preserve_value_is_respected(self):
|
||||
box = np.array([[10, 10], [30, 10], [30, 30], [10, 30]])
|
||||
m = build_change_map([box], 40, 40, preserve=0.5, feather=0)
|
||||
assert m[20, 20] == np.float32(0.5)
|
||||
|
||||
def test_feather_creates_soft_edge_gradient(self):
|
||||
box = np.array([[20, 20], [44, 20], [44, 44], [20, 44]])
|
||||
m = build_change_map([box], 64, 64, preserve=1.0, feather=15)
|
||||
center = m[32, 32]
|
||||
# An edge pixel just outside the polygon should be partially blended:
|
||||
# strictly between full-change (0) and the preserved center.
|
||||
edge = m[32, 47]
|
||||
assert 0.0 < edge < center
|
||||
assert center <= 1.0
|
||||
|
||||
def test_even_feather_does_not_crash(self):
|
||||
box = np.array([[10, 10], [30, 10], [30, 30], [10, 30]])
|
||||
m = build_change_map([box], 40, 40, feather=14)
|
||||
assert m.shape == (40, 40)
|
||||
|
||||
def test_values_stay_in_unit_range(self):
|
||||
box = np.array([[5, 5], [35, 5], [35, 35], [5, 35]])
|
||||
m = build_change_map([box], 40, 40, preserve=1.0, feather=9)
|
||||
assert float(m.min()) >= 0.0
|
||||
assert float(m.max()) <= 1.0
|
||||
@@ -1,86 +0,0 @@
|
||||
"""Unit tests for the pure tiling helpers (no GPU/model required).
|
||||
|
||||
``tiling.py`` imports torch at module top, so skip cleanly when torch is
|
||||
absent. The helpers themselves are pure numpy/PIL/math -- they decide how a
|
||||
large image is split into overlapping tiles and blended back, so a regression
|
||||
here would seam or crop the CtrlRegen output wrongly.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import numpy as np
|
||||
import pytest
|
||||
|
||||
pytest.importorskip("torch")
|
||||
|
||||
from PIL import Image
|
||||
|
||||
from remove_ai_watermarks.noai.ctrlregen.tiling import (
|
||||
make_blend_weight,
|
||||
resize_center_crop,
|
||||
tile_positions,
|
||||
)
|
||||
|
||||
|
||||
class TestTilePositions:
|
||||
def test_image_smaller_than_tile_single_position(self):
|
||||
assert tile_positions(500, 512, 64) == [0]
|
||||
|
||||
def test_image_equal_to_tile_single_position(self):
|
||||
assert tile_positions(512, 512, 64) == [0]
|
||||
|
||||
def test_first_is_zero_last_is_total_minus_tile(self):
|
||||
# The tiles must fully cover the span: first starts at 0, last ends at
|
||||
# the far edge (start == total - tile), or the image's edge is missed.
|
||||
pos = tile_positions(2000, 512, 64)
|
||||
assert pos[0] == 0
|
||||
assert pos[-1] == 2000 - 512
|
||||
|
||||
def test_overlap_positions_are_monotonic_and_exact(self):
|
||||
assert tile_positions(1000, 512, 64) == [0, 244, 488]
|
||||
|
||||
def test_zero_overlap_tiles_are_contiguous(self):
|
||||
# 1024 wide, 512 tile, no overlap -> two tiles butting at 512.
|
||||
assert tile_positions(1024, 512, 0) == [0, 512]
|
||||
|
||||
def test_overlap_equal_to_tile_raises(self):
|
||||
# overlap == tile makes the stride denominator (tile - overlap) zero;
|
||||
# reject up front instead of dividing by zero.
|
||||
with pytest.raises(ValueError, match="overlap"):
|
||||
tile_positions(2000, 512, 512)
|
||||
|
||||
def test_overlap_greater_than_tile_raises(self):
|
||||
with pytest.raises(ValueError, match="overlap"):
|
||||
tile_positions(2000, 512, 600)
|
||||
|
||||
|
||||
class TestMakeBlendWeight:
|
||||
def test_zero_overlap_is_all_ones(self):
|
||||
w = make_blend_weight(8, 8, 0)
|
||||
assert w.shape == (8, 8)
|
||||
assert w.dtype == np.float64
|
||||
assert np.all(w == 1.0)
|
||||
|
||||
def test_overlap_ramps_corners_to_zero_center_to_one(self):
|
||||
w = make_blend_weight(16, 16, 4)
|
||||
assert w[0, 0] == 0.0 # cosine ramp starts at 0
|
||||
assert w[8, 8] == 1.0 # center is unweighted
|
||||
assert w.max() == 1.0
|
||||
assert w.min() == 0.0
|
||||
|
||||
def test_weight_is_point_symmetric(self):
|
||||
# Symmetric ramps on both edges -> mask equals its 180-degree rotation,
|
||||
# so opposite tile seams blend identically.
|
||||
w = make_blend_weight(16, 16, 4)
|
||||
assert np.allclose(w, w[::-1, ::-1])
|
||||
|
||||
|
||||
class TestResizeCenterCrop:
|
||||
@pytest.mark.parametrize(("width", "height"), [(400, 800), (800, 400), (300, 300), (1000, 1001)])
|
||||
def test_output_is_always_square_of_requested_size(self, width: int, height: int):
|
||||
out = resize_center_crop(Image.new("RGB", (width, height)), 256)
|
||||
assert out.size == (256, 256)
|
||||
|
||||
def test_default_size_is_512(self):
|
||||
out = resize_center_crop(Image.new("RGB", (640, 480)))
|
||||
assert out.size == (512, 512)
|
||||
Reference in New Issue
Block a user