feat: controlnet pipeline for text/face-structure preservation

Add `--pipeline controlnet` (SDXL base + xinsir canny ControlNet via
StableDiffusionXLControlNetImg2ImgPipeline): the canny edge map conditions the
img2img regeneration so text and face STRUCTURE stay sharp, while the watermark
is still removed by the regeneration (`strength`) -- no original pixels are
copied or frozen, so SynthID does not survive. Oracle-verified clean on OpenAI
with better text/structure fidelity than plain img2img at equal strength.
`--controlnet-scale` tunes structure preservation; fp32 on mps/cpu (fp16-fixed
VAE on cuda/xpu). Shares the img2img runner (live progress + MPS->CPU fallback)
and the fp16-VAE-fix / device-move helpers with the default pipeline.

Remove the superseded subsystems -- ctrlregen (SD1.5 clean-noise),
text-protection (differential / region-hires) and face-protection: they either
destroyed real content or shielded the watermark by re-using original pixels.
controlnet replaces them by regenerating everything under edge conditioning.

Canny preserves face structure but not identity; face IDENTITY is a separate
face-restoration post-pass (CodeFormer/GFPGAN), researched + prototyped but not
yet shipped. An IP-Adapter FaceID attempt was built and removed (footgun: needs
high strength, corrupts faces at removal strength).

Docs: docs/controlnet-removal-pipeline-research.md, scripts/controlnet_sweep.py.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Victor Kuznetsov
2026-06-03 15:55:00 -07:00
parent 175609b60a
commit d90d5d886a
28 changed files with 1239 additions and 3541 deletions
+10 -12
View File
File diff suppressed because one or more lines are too long
+7 -9
View File
@@ -23,7 +23,7 @@ If this tool saves you time, consider [sponsoring its development](https://githu
- **AI metadata stripping** — EXIF, PNG text chunks, C2PA provenance manifests (PNG / JPEG / AVIF / HEIF / JPEG-XL, **MP4 / MOV / M4V / M4A** at the container level, and **WebM / MP3 / WAV / FLAC / OGG** losslessly via ffmpeg), XMP DigitalSourceType
- **"Made with AI" label removal** — removes the AI-disclosure metadata that platforms read to apply automatic labels (useful for clearing a false-positive label from a human-edited photograph)
- **Analog Humanizer** — optional film grain and chromatic aberration post-processing
- **Smart Face Protection** — automatic extraction and blending of human faces to prevent AI distortion
- **Text and face preservation** — optional `--pipeline controlnet` adds a canny ControlNet that keeps text and face structure sharp through the removal pass (without copying original pixels, so SynthID is still removed). Note: canny preserves face *structure*, not *identity* (the regenerated face drifts in likeness); preserving identity is a separate face-restoration post-pass, researched but not yet shipped
- **Batch processing** — process entire directories
- **Detection** — three-stage NCC watermark detection with confidence scoring
- **Provenance detection (`identify`)** — aggregate C2PA issuer, the C2PA soft-binding forensic-watermark vendor (Adobe TrustMark, Digimarc, Imatag, ...), IPTC "Made with AI" plus the IPTC 2025.1 `AISystemUsed` field, embedded SD/ComfyUI params, EXIF/XMP generator tags, the xAI/Grok EXIF signature, the China TC260 AIGC label (XMP, PNG chunk, or EXIF), the HuggingFace `hf-job-id` job marker, the SynthID metadata proxy, the visible marks (Gemini sparkle plus the Doubao "豆包AI生成" / Jimeng "即梦AI" text marks), the open SD/SDXL/FLUX invisible watermark, and (with the `trustmark` extra) the open Adobe TrustMark watermark into one origin-platform + watermark-inventory verdict (`--json` for machine output)
@@ -117,9 +117,9 @@ image → encode to latent space (VAE) at native resolution
> **Default strength is vendor-adaptive (no flag needed).** The tool reads the C2PA issuer to detect which vendor's SynthID is present and picks the strength that clears it with the least quality loss: **OpenAI gpt-image → `0.10`**, **Google Gemini → `0.15`**, **unknown source → `0.15`**. An oracle-verified June 2026 study (clean pipeline, per-image openai.com/verify or Gemini app) found OpenAI's watermark clears at `0.05` across `1024`-`1600` px (resolution-independent) while Google's is ~3x more robust and needs `0.15`. The dominant factor is the vendor, not resolution. There is no local SynthID detector, so if the oracle still reads SynthID, raise `--strength`; if you care more about preserving fine text, lower it. (Caveat: Google's `0.15` was validated on the capped `--max-resolution 1536` path; a very large native Gemini image may need more.)
>
> **Text and face protection are OFF by default.** The high-resolution text re-scrub can shield SynthID in text regions, leaving the watermark intact there even after the global pass clears it everywhere else (verified June 2026: same image, with `--protect-text` → SynthID detected; without → SynthID removed). Both features are opt-in with `--protect-text` / `--protect-faces` and considered **experimental**. If you enable them, verify the result with the oracle.
> **`--pipeline controlnet` preserves text and face structure.** It runs the same SDXL img2img scrub but adds a canny ControlNet that conditions the regeneration on the image's edge map, so text and structure stay sharp at the strengths that remove SynthID. The watermark removal still comes from the img2img regeneration (`--strength`); the ControlNet only preserves structure — no original pixels are copied or frozen, so SynthID does not survive. `--controlnet-scale` tunes the preservation strength (higher = closer to the original structure). Runs fp32 on mps/cpu (fp16 only on cuda/xpu, where the fp16-fixed SDXL VAE is loaded automatically).
>
> **`--pipeline ctrlregen` is experimental and not recommended.** On paper CtrlRegen ([ICLR 2025](https://github.com/yepengliu/CtrlRegen)) regenerates from near-clean Gaussian noise to defeat robust watermarks, but in testing on real images it **destroys content** — smooth and background regions fill with hallucinated micro-text — and it is heavy (several GB of extra models, minutes per image). It has no usable middle setting (too low removes nothing, high enough to remove wrecks the image), so the shippable path is the default SDXL pipeline at the vendor-adaptive strength. CtrlRegen stays available for experimentation only.
> **Face identity is not preserved yet.** Canny preserves where a face is, but not who it is — the regenerated face drifts in likeness. (An IP-Adapter FaceID approach was tried and removed: it needs high denoise strength and corrupts faces at the low strength used for removal.) The validated direction is a separate face-restoration post-pass (CodeFormer/GFPGAN at a low fidelity weight, run after the removal pass — it re-synthesizes each face from a codebook, so it scrubs the watermark while holding identity) — researched and prototyped (see `docs/controlnet-removal-pipeline-research.md`) but not yet shipped.
SDXL is the default since May 2026: empirically defeats SynthID v2 on Gemini 3 Pro outputs, where the older SD-1.5 pipeline at 768 px did not. The SD-1.5 path was removed once it was verified not to handle v2. Note the scope: this defeats the SynthID *verifier*, which is not the same as being forensically indistinguishable from a real photo. Recent work ([arXiv:2605.09203](https://arxiv.org/abs/2605.09203)) shows watermark-removal pipelines leave detectable traces, so a separate "this image was processed" classifier can still flag the output.
@@ -127,12 +127,10 @@ SDXL is the default since May 2026: empirically defeats SynthID v2 on Gemini 3 P
> **Technical deep-dive:** see [`docs/synthid.md`](docs/synthid.md) for a primary-source-cited breakdown of how SynthID works mechanically (post-hoc encoder/decoder, 136-bit payload, pixel-space embedding), what it empirically survives (JPEG, crop, resize: ~99.98% TPR at 0.1% FPR from arXiv:2510.09263), what removes it, and the forensic-stealth tradeoff (all known removal attacks are detectable at >98% TPR@1%FPR per arXiv:2605.09203).
**Face Protection** (experimental, opt-in `--protect-faces`): before diffusion, YOLO detects people in the image and extracts them; after diffusion the original faces are blended back. Off by default — enable only when face fidelity matters more than SynthID removal completeness.
**Text and face preservation** (opt-in `--pipeline controlnet`): adds a canny ControlNet so text and face *structure* stay sharp through the removal pass, without copying or freezing any original pixels (so SynthID is still removed). Tune the preservation strength with `--controlnet-scale`. Canny preserves structure but not face *identity* (preserving identity is a future face-restoration post-pass, not yet shipped — see the callout above).
**Analog Humanizer**: optional film grain and chromatic aberration injection that mimics a photo of a screen, raising the bar for AI-generated image classifiers. (It frustrates generic classifiers but does not guarantee forensic invisibility — see the [arXiv:2605.09203](https://arxiv.org/abs/2605.09203) note above.)
**Text Protection** (experimental, opt-in `--protect-text`): re-scrubs detected text blocks at high resolution after the global pass to keep small glyphs crisp. **Off by default** because the high-resolution re-scrub can preserve SynthID in text regions even after the global pass removes it elsewhere. Enable only when text fidelity matters more than watermark removal completeness, and verify the oracle result. SDXL pipeline only.
### Stripping C2PA, EXIF, and "Made with AI" metadata
AI tools embed generation metadata that social platforms use to show "Made with AI" labels:
@@ -274,8 +272,8 @@ remove-ai-watermarks invisible image.png -o clean.png --humanize 4.0
# Runs at native resolution by default. On a very large image that OOMs the
# GPU/MPS, cap the long side: --max-resolution 2048
# Strength is vendor-adaptive by default (OpenAI 0.10 / Google 0.15); override
# with --strength. Text/face protection is opt-in (--protect-text /
# --protect-faces, experimental: they can shield SynthID).
# with --strength. To preserve text/face structure, use --pipeline controlnet
# (SDXL + canny ControlNet); tune preservation with --controlnet-scale. Add
# Check / strip AI metadata (C2PA, EXIF, "Made with AI" labels)
# --check also flags SynthID-bearing sources: a C2PA manifest signed by
@@ -341,7 +339,7 @@ pip install certifi
- [noai-watermark](https://github.com/mertizci/noai-watermark) by mertizci — invisible watermark removal engine
- [GeminiWatermarkTool](https://github.com/allenk/GeminiWatermarkTool) by Allen Kuo (MIT) — visible watermark removal algorithm
- [CtrlRegen](https://github.com/yepengliu/CtrlRegen) by Liu et al. (ICLR 2025) — controllable regeneration pipeline
- [controlnet-canny-sdxl-1.0](https://huggingface.co/xinsir/controlnet-canny-sdxl-1.0) by xinsir — SDXL canny ControlNet used by the `controlnet` pipeline to preserve text/face structure
- NeuralBleach (MIT) — analog humanizer technique
## Roadmap
@@ -0,0 +1,539 @@
# ControlNet-as-removal-pipeline research: can structure-conditioned regeneration scrub SynthID and keep text?
Date: 2026-06-02. Source: a manual primary-source pass (WebSearch + WebFetch over the
watermark-removal-attack and SDXL-ControlNet literature). Prompted by issue #35
(@newideas99 / Jacob): "as we use SDXL even at low strength that kills small text ... Do you
think ControlNet could be added to preserve and still remove the watermark?" Clarified scope:
Jacob means **replacing the removal pipeline itself** with a ControlNet-conditioned
regeneration (structure held by the control signal), NOT a separate text-protection add-on.
A deep-research workflow run was attempted first (`wf_3244411d-ffd`) and failed at the harness
level (97 agents completed without emitting StructuredOutput; ~4.3 M tokens, no report). This
note is the hand-run replacement.
## The question, precisely
Can a single full-image ControlNet-conditioned diffusion pass **replace** plain SDXL base 1.0
img2img as the watermark remover, so that one structure-guided regeneration removes the
invisible robust pixel watermark (Google SynthID) **everywhere** while keeping fine detail and
small/CJK **text** legible across the whole image? The hard constraint is unchanged from
`text-protection-research.md`: the watermark must be scrubbed everywhere including inside text,
so any path that freezes or composites original text pixels is disqualified.
## Executive summary
The idea is **already academically validated as a watermark remover and is literally what we
already ship** — CtrlRegen (ICLR 2025) is a canny-ControlNet + DINOv2-semantic pipeline that
regenerates from clean noise. But the make-or-break gap is exact: **none of the
watermark-removal papers validate TEXT or fine-detail preservation at all** — CtrlRegen reports
only FID/PSNR/quality-model scores and explicitly contains no text, fine-detail, or
hallucination analysis. Our shipped ctrlregen's empirical failure ("destroys real content,
hallucinates micro-text in smooth regions") is precisely this unstudied failure mode of the
published method, most likely driven by our **512 px tiling** (text occupies too few pixels per
tile to regenerate legibly; edge-free smooth regions get DINOv2-semantic hallucination). The
constructive path is NOT to keep fixing the SD1.5 CtrlRegen, but to port the structure-control
idea onto an **SDXL-native** ControlNet (xinsir tile-sdxl / ControlNet-Union-SDXL) as a control
add to our existing SDXL base 1.0 img2img, run it at **1024+, not 512 tiles**, and empirically
sweep the (denoise strength x conditioning scale x resolution) cube against the SynthID oracle
AND a text-legibility check. The central tension may be fundamental and must be measured: the
conditioning strong enough to keep text legible may suppress regeneration enough to let SynthID
survive; the regeneration strong enough to scrub may deform text regardless of edges.
## Findings (with confidence and sources)
### Finding 1 — confidence: high
**Claim.** "ControlNet as the removal pipeline" is exactly CtrlRegen (ICLR 2025), and our
shipped `ctrlregen` profile is a faithful implementation of it. Its **spatial control is canny
edges** extracted from the watermarked image; its **semantic control is DINOv2-giant** via a
trainable projection + decoupled cross-attention. Clean-noise (full-strength) regeneration
scrubs the watermark from both pixel and latent space while the two control nets hold structure.
**Evidence.** CtrlRegen: spatial control "conditioned on Canny edge images extracted from the
watermarked image," integrated into the U-Net decoder blocks via a ControlNet structure;
semantic control on "DINOv2-giant" embeddings. Removal is strong: TPR@1%FPR driven from 1.00 ->
0.01 (StegaStamp) and 0.99 -> 0.12 (TreeRing). This matches our `ctrlregen/engine.py` exactly
(canny detector + `facebook/dinov2-giant` + spatial ControlNet from `yepengliu/ctrlregen`).
**Sources.** https://arxiv.org/html/2410.05470v1 · https://github.com/yepengliu/CtrlRegen ·
https://openreview.net/forum?id=mDKxlfraAn
### Finding 2 — confidence: high
**Claim.** Regeneration provably removes any bounded-perturbation pixel watermark **given enough
noise** — the operative constraint is the amount of regeneration, which is the same knob that
trades against fidelity.
**Evidence.** Zhao et al., "Invisible Image Watermarks Are Provably Removable Using Generative
AI" (NeurIPS 2024): a noise-then-reconstruct regeneration attack "guarantees the removal of any
invisible watermark" that perturbs the image within a bounded L2 distance. The guarantee is a
function of injected noise magnitude — low noise preserves detail but leaves the watermark; high
noise scrubs but discards original signal. This is the knob ControlNet conditioning is meant to
make survivable (push regeneration high while the control signal holds composition).
**Sources.** https://arxiv.org/abs/2306.01953 · https://github.com/XuandongZhao/WatermarkAttacker
### Finding 3 — confidence: high
**Claim.** The make-or-break gap: **no watermark-removal paper validates text or fine-detail
preservation.** CtrlRegen's "high perceptual quality" is FID/PSNR/quality-model only and
explicitly omits text, fine-detail, and hallucination analysis. So the literature does NOT
support the specific claim Jacob needs (text survives), it is simply unmeasured.
**Evidence.** CtrlRegen reports CLIP-FID, PSNR, Q-Align, LIQE; the fetched analysis confirms
"the paper contains no discussion of text preservation, fine-detail retention, or hallucination
artifacts," and "explicitly avoids discussing failure modes." Pixel metrics like PSNR are
acknowledged not to reflect perception, and text legibility is a different axis than FID.
**Sources.** https://arxiv.org/html/2410.05470v1
### Finding 4 — confidence: medium-high
**Claim.** Resolution is the prime suspect for our shipped ctrlregen's content destruction. We
tile to **512 px** and run full clean-noise per tile; at 512 px text occupies too few pixels per
tile to regenerate legibly, and smooth edge-free regions (no canny signal) are filled by the
DINOv2 semantic prior, which hallucinates texture/micro-text. The paper omits resolution
entirely, so this is an implementation regime it never characterized.
**Evidence.** Our `ctrlregen/engine.py`: `PROCESS_SIZE = 512`, `TILE_SIZE = 512`, full strength
on each tile. This mirrors the `_run_region_hires` insight (text needs MORE pixels under
regeneration so strokes exceed the VAE's ~8 px latent floor), but ctrlregen runs the regeneration
at LOW res, the opposite. CtrlRegen's paper gives no resolution/tiling spec to contradict this.
**Sources.** internal (`src/remove_ai_watermarks/noai/ctrlregen/engine.py`); resolution-omission
confirmed against https://arxiv.org/html/2410.05470v1
### Finding 5 — confidence: high
**Claim.** SDXL-native ControlNets exist, so the removal-pipeline upgrade need NOT be the SD1.5
re-architecture our current ctrlregen is. xinsir `controlnet-tile-sdxl-1.0` and
`controlnet-union-sdxl-1.0` (ControlNet++) run on SDXL base 1.0. The tile model has a `tile_var`
image-variation mode purpose-built to regenerate detail while preserving structure, at
`controlnet_conditioning_scale = 1.0`, optimal 1024 px. This is a drop-in control add to our
existing SDXL img2img.
**Evidence.** xinsir tile-sdxl model card: use cases = deblur/detail-repaint, **image variation
(preserving structure)**, super-resolution; `controlnet_conditioning_scale = 1.0`, ~30 steps,
optimal 1024x1024, works with `madebyollin/sdxl-vae-fp16-fix` (the same VAE our fp16 path
already swaps in). ControlNet-Union-SDXL / ControlNet++ merges 10+ control types (canny, HED,
tile, depth, lineart) into one SDXL model.
**Sources.** https://huggingface.co/xinsir/controlnet-tile-sdxl-1.0 ·
https://huggingface.co/xinsir/controlnet-union-sdxl-1.0 · https://github.com/xinsir6/ControlNetPlus
### Finding 6 — confidence: high
**Claim.** The community tile-ControlNet upscale workflow runs at **LOW denoise (0.3-0.4)**
the wrong regime for watermark removal. It preserves detail precisely by regenerating little, so
a naive tile-upscale preserves text AND preserves the watermark. The open empirical question is
whether at `conditioning_scale ~1.0` you can push denoise high enough to scrub SynthID while the
tile conditioning still holds text — the exact cell to test.
**Evidence.** Stable-Diffusion-Art ControlNet-tile upscale: denoise "typically 0.3, max ~0.4 to
avoid artifacts"; some users push 0.6 with ControlNet strength 0.5. Our own data: SynthID
survives below the removal-strength threshold (current Gemini needs notably higher denoise than
the tile-upscale regime). So the detail-preserving regime and the watermark-scrubbing regime are
on opposite ends of the denoise axis; ControlNet conditioning is the bet that they can meet.
**Sources.** https://stable-diffusion-art.com/controlnet-upscale/ ·
internal (`docs/synthid.md` strength data)
### Finding 7 — confidence: high
**Claim.** Forensic-stealth caveat: diffusion-based regeneration is among the MOST detectable
removal families. Even a ControlNet-regeneration that fools the SynthID oracle leaves forensic
traces flagging the output as "removal-processed" at >98% TPR@1%FPR. This bounds the claim (do
not over-promise "indistinguishable from an original") but does not block the use case — the
SynthID oracle still reads negative.
**Evidence.** "Removing the Watermark Is Not Enough: Forensic Stealth in Generative-AI Watermark
Removal" (arXiv:2605.09203, Goonatilake & Ateniese, GMU): across six removal attacks including
diffusion-based regeneration, independent forensic detectors separate removal-processed from
clean content at >98% TPR under a 1% FPR budget.
**Sources.** https://arxiv.org/html/2605.09203v1
### Finding 8 — confidence: low (watch, do not build on yet)
**Claim.** Partial/semantic-guided regeneration is an active sub-direction that explicitly targets
the removal-vs-fidelity tradeoff, but the specific fidelity-on-text claims were not verifiable
from the source in this pass.
**Evidence.** "Removing Watermarks with Partial Regeneration using Semantic Information"
(arXiv:2505.08234) proposes focusing regeneration on watermarked regions with semantic (VLM)
conditioning to preserve untouched areas; the PDF body did not render cleanly enough to confirm
its quantitative text/detail results. Treat as a pointer, not evidence.
**Sources.** https://arxiv.org/pdf/2505.08234
## Recommendation / decision
**ControlNet-as-removal-pipeline is worth prototyping — but not by fixing the SD1.5 ctrlregen.**
Port the structure-control idea onto an SDXL-native ControlNet as a control add to the existing
SDXL base 1.0 img2img, run it at full resolution (1024+, NOT 512 tiles), and treat the
text-vs-scrub tension as an empirical question to measure, not assume.
**Prototype (runs locally on 32 GB MPS — no dedicated GPU required):**
Compute is NOT the bottleneck. On a 32 GB Apple-silicon machine (M5 here) native SDXL already
runs entirely on MPS with no CPU fallback (~155 s at 1122x1402, verified — see `synthid.md` /
CLAUDE.md). The prototype runs at **1024** (fewer pixels than that) with SDXL base + an SDXL
ControlNet + activations in **fp32** (MPS fp16 decodes to all-black NaN — issue #29 — confirmed
on run 1 below; fp32 is the required default on mps/cpu) — fits the 32 GB budget with vae-tiling +
attention-slicing; ~1-2 min/image, so a coarse sweep is a sub-hour background run. A dedicated GPU
is needed ONLY for the separate
native-large-Gemini (2816 px) case, which OOMs even without a ControlNet (that stays a raiw.cc
GPU task). The genuine external dependency is NOT compute but the **manual SynthID oracle**:
there is no local SynthID detector, so removal is verified by hand in the Gemini app
("Verify with SynthID") per image, regardless of where the diffusion runs.
Runner: **`scripts/controlnet_sweep.py`** (built 2026-06-02) implements exactly this sweep —
SDXL base 1.0 + an SDXL-native ControlNet img2img, one output per (control x strength x scale)
cell, plus a `sweep_index.csv` with empty `synthid_oracle` / `text_legible` columns to fill by
hand. It uses the dedicated single-type xinsir models (`controlnet-canny-sdxl-1.0`,
`controlnet-tile-sdxl-1.0`) rather than the Union model to keep the diffusers API path robust.
uv run python scripts/controlnet_sweep.py watermarked.png -o sweep_out
1. SDXL base 1.0 img2img + `xinsir/controlnet-canny-sdxl-1.0` / `controlnet-tile-sdxl-1.0`
(sweep both `tile` and `canny` control), full image at 1024, `sdxl-vae-fp16-fix`.
2. Sweep the cube on fresh Gemini + gpt-image inputs that contain small/CJK text:
- denoise strength {0.15, 0.3, 0.5, 0.7, 1.0}
- `controlnet_conditioning_scale` {0.5, 0.8, 1.0}
- control type {tile, canny}
3. Per cell, measure BOTH axes:
- **removal**: Gemini app "Verify with SynthID" oracle (the only valid SynthID oracle; for
gpt-image also openai.com/verify for provenance) — must read clean.
- **text**: OCR round-trip / visual legibility of the small text.
- secondary: SSIM/FID vs original for global fidelity.
4. Find the Pareto cell where the oracle is clean AND text stays legible.
**The honest fork the prototype resolves:**
- If such a cell exists -> the answer to Jacob is YES, ship an SDXL-native ControlNet removal
profile (replacing the SD1.5 ctrlregen) tuned to that cell.
- If no cell clears both (the tension is fundamental: scrub-strength always deforms text, or
text-preserving conditioning always spares the watermark) -> the canny/tile-ControlNet middle
path is dead for text, and the standing answer reverts to `text-protection-research.md`: a full
**glyph-conditioned re-render** (EasyText / TextSR on a FLUX-DiT base) is required, which is a
base-model migration, not a control add.
**Do not:** keep tuning the 512 px SD1.5 ctrlregen for text (wrong resolution, wrong base model);
run tile-ControlNet at the community 0.3-0.4 upscale denoise and expect watermark removal (that
regime preserves the watermark); over-claim forensic invisibility (Finding 7).
## Prototype run 1 — 2026-06-02 (text axis measured; watermark axis pending the oracle)
First sweep on a real, SynthID-positive, text-dense input: the corpus tokyo-street-night
gpt-image (`88e61a38-chatgpt_tokyo.png`, 1023x1537 -> 680x1024, dense small CJK + Latin neon
signage; SynthID + C2PA confirmed, so its valid oracle is openai.com/verify). Grid: control
{canny, tile} x strength {0.3, 0.5, 0.7, 1.0} x `conditioning_scale` 1.0, fp32 on MPS. Outputs +
`sweep_index.csv` (text verdicts filled by visual inspection; `synthid_oracle` left for the
manual run) are under `/tmp/cnsweep/` (not committed — derived regenerations of corpus content).
**Measured — PSNR vs input (proxy for how much was regenerated):**
- canny: 0.3 -> 16.91, 0.5 -> 15.91, 0.7 -> 14.82, 1.0 -> 13.22 (monotonic drop = progressively
more regeneration as strength rises; canny only pins edges, so flat regions change).
- tile: 0.3 -> 17.89, 0.5 -> 17.84, 0.7 -> 17.83, 1.0 -> 17.74 (**flat and high — near-identity
even at strength 1.0**; tile@scale1.0 pins the whole image to the input and barely regenerates).
**Measured — text legibility (visual, focused on SMALL text; large high-contrast glyphs survive
everything because canny/tile hold their edges):**
- canny: legible at 0.3, softening at 0.5 (partial), garbling at 0.7, hallucinated pseudo-glyphs
at 1.0 ("NEC" -> "NWENES"). Same plain-img2img small-text deformation, only big text protected.
- tile: near-identity through 0.7, only tiny alterations at 1.0 — small text preserved throughout.
**Reading (the make-or-break tension, now visible in the data):**
- **tile@scale1.0 does not actually regenerate** (flat PSNR), so it preserves all text but almost
certainly leaves the watermark intact — it is a near-identity pass, exactly the community
"tile-upscale preserves detail by not regenerating" regime (Finding 6), confirmed.
- **canny@scale1.0 regenerates progressively** (PSNR drops) and so could scrub — but small text
breaks at exactly the strength where scrubbing would start to bite. canny saves big edges, not
sub-stroke small text.
- Net on the text axis: neither cell at scale 1.0 cleanly gives "high regeneration + legible small
text." This is the literature prior (Findings 3, 6) reproduced empirically. Lowering
`conditioning_scale` to force small-text regeneration is the same tradeoff knob, not an escape.
**Still pending (the decisive half, cannot be done locally):** run the 8 cells through the SynthID
oracle and fill `synthid_oracle`. The most informative cells: canny 1.0 (text dead — does it at
least scrub? if not, the canny path is dead outright), canny 0.5 (text partial — does it scrub?),
tile 1.0 (text perfect — predicted to still read present). If no cell is `oracle=clean` AND
`text=yes`, the fork resolves to the glyph-re-render path (`text-protection-research.md`).
**Incidental bug caught:** the first run used fp16 on MPS (the script's original default) and
produced **all-black** outputs across every cell (2 KB PNGs, PSNR 9.22 flat) — the issue #29
fp16-VAE-NaN failure, and the fp16-fix VAE did not save it on MPS. Fixed `scripts/controlnet_sweep.py`
to default fp32 on mps/cpu (fp16 only on cuda/xpu), matching the production pipeline.
## Tuning ControlNet for text preservation across image types (research 2026-06-03)
Goal: how to configure the canny-ControlNet path to best preserve text (and faces) on diverse
images. Primary sources: diffusers ControlNet doc, the ControlNet paper (arXiv:2302.05543),
xinsir model cards, practitioner guides. The **critical reframe**: almost all community ControlNet
advice optimizes a txt2img *generation* tradeoff (control vs creative freedom). OUR context is
img2img *watermark removal*, where the objective is the opposite -- maximum faithful preservation
while regenerating just enough to scrub. So several common recommendations INVERT here.
**Removal is `strength`; everything below is preservation and does not change removal efficacy**
(only the watermark-shielding risk -- see the caveat). Set `strength` by the oracle/vendor need;
tune these to keep text/faces intact at that strength.
Knobs, ranked by impact for text:
1. **Canny edge density (the per-image lever, currently hardcoded `_CANNY_LOW=100`/`_CANNY_HIGH=200`).**
Lower thresholds capture more/finer edges; higher thresholds keep only major outlines (diffusers
doc + practitioner guides; ControlNet paper uses 100/200 as the default). Small-text strokes and
fine facial features fall below the default 100/200 and are missed. **For dense small text
(infographics, signage) lower the thresholds (~50/120, even 30/100 for facial likeness per
practitioner tests); for high-contrast large text 100/200 already suffices.** Denser canny is
still a BINARY thresholded edge map, so it does not carry the low-amplitude SynthID pixel pattern
-- it passes more shape, not the watermark (still oracle-verify). This is the single highest-value
unexplored lever and should become a CLI knob.
2. **`controlnet_conditioning_scale` -> keep at 1.0 (max structure hold).** Community defaults to 0.5
for creative balance; we want maximum preservation, so 1.0 (xinsir canny/tile cards also recommend
1.0). We measured text on a clean high-contrast image surviving across strength 0.1-0.5 at scale
1.0 (PSNR ~26 flat), so scale 1.0 is the right default; only lower it if a specific image needs
more regeneration to scrub (raises shielding risk the other way).
3. **`control_guidance_start=0.0`, `control_guidance_end=1.0` (full window) -- KEEP, do not shorten.**
The common "end=0.5: establish structure early then let the model render detail freely" is a
creative-generation recipe; for text it is HARMFUL -- the late free steps re-render and deform the
glyphs. We want the edge control active through ALL denoise steps so text stays pinned. (Our
pipeline already uses the 0->1 default; the point is to NOT adopt the shorten-the-window advice.)
4. **Control type, per image type:**
- **Text / graphics / high-contrast -> canny** (the literature's reliable choice for defined edges
and text; what we ship).
- **Faces / smooth tonal content -> soft-edge / HED is a candidate worth testing.** Canny's hard
binary threshold fractures smooth skin gradients; HED/soft-edge gives gradual edges that may hold
faces better. UNVERIFIED for removal (softer edges may carry slightly more original signal ->
oracle-check). A face-heavy image is the test (gemini group photos).
- **tile -> NOT for removal.** It is near-identity (detail-enhancement at low denoise); it shields
the watermark (measured flat PSNR ~17.8 across strength on the tokyo sweep). Do not use it as the
removal control.
5. **Resolution** -- higher long-side = strokes span more VAE latent cells = less softening, while
still fully regenerating. Already a knob (`--max-resolution`); for tiny text prefer native/large.
**Multi-ControlNet (canny + soft-edge), list scales e.g. `[1.0, 0.8]`** (diffusers MultiControlNet):
could hold text edges AND face geometry at once, but doubles ControlNet memory/latency and raises the
shielding risk; defer to a v2 after the single-canny path is dialed in.
**Image-type playbook (proposed, to validate with the oracle):**
- Clean high-contrast text (openai_1-style): canny 100/200, scale 1.0, full window -- already optimal.
- Dense small text / infographics (big_pic3, neon signage): canny **lower thresholds (~50/120)**,
scale 1.0, full window, larger resolution.
- Faces / portraits: try **soft-edge/HED** control, scale 1.0; or multi-ControlNet canny+softedge.
**Hard caveat:** every change that increases preservation (higher scale, denser canny, fuller window,
softer edges) marginally REDUCES effective regeneration and so raises the chance the watermark
survives -- exactly the shielding failure mode. There is no local SynthID detector, so each tuning
change must be re-confirmed on the oracle. These are img2img-context recommendations derived from
generation-context sources plus our own measurements; treat the playbook as hypotheses to verify, not
settled defaults.
**Sources.** https://huggingface.co/docs/diffusers/en/using-diffusers/controlnet ·
https://arxiv.org/pdf/2302.05543 · https://huggingface.co/xinsir/controlnet-canny-sdxl-1.0 ·
https://huggingface.co/xinsir/controlnet-tile-sdxl-1.0 · https://blog.cephalon.ai/canny-and-softedge/
## FaceID research: identity-preserving face conditioning (research 2026-06-03)
Motivation: canny alone preserves face STRUCTURE/position better than plain SDXL but does NOT hold
IDENTITY -- verified on a real Gemini group photo (gemini_3, s015): faces drift in expression and
likeness (the smile/mouth and eyes change), they are "a similar person," not the same one. Canny
carries edges, not identity, so the regenerated face is identity-drifted. To hold identity WITHOUT
copying original pixels (the hard constraint -- copied pixels carry SynthID), the conditioning must
be an identity EMBEDDING, not pixels. Primary sources: diffusers IP-Adapter doc, InstantID
(arXiv:2401.07519), IP-Adapter (arXiv:2308.06721), practitioner comparisons.
### Findings
**1. IP-Adapter FaceID conditions on an ArcFace identity VECTOR, not pixels (confidence: high).**
FaceID extracts `insightface` ArcFace `normed_embedding` (a ~512-d identity vector) via
`FaceAnalysis`, and passes it as `ip_adapter_image_embeds` -- NOT a CLIP image embedding, NOT the
original pixels. So it is constraint-compatible: the watermark (a pixel-amplitude pattern) is not in
the identity vector, and the img2img still regenerates the pixels (removal via `strength` unchanged).
It loads on any SDXL via `load_ip_adapter` (~100 MB), is fast/low-VRAM, but identity fidelity on SDXL
is ~5-10% lower than the SD1.5 line / dedicated methods.
**2. Multiple distinct faces ARE handled, via regional attention masks (confidence: high -- THE key
unlock).** This is the make-or-break for group photos (our hardest case). diffusers supports a LIST
of IP-Adapter face images each with its own binary region mask: `IPAdapterMaskProcessor` builds the
masks, `set_ip_adapter_scale([[s1, s2, ...]])`, and `cross_attention_kwargs={"ip_adapter_masks":
masks}`. So you detect each face, extract its own ArcFace embedding, assign it a region mask, and one
pass preserves N different identities simultaneously. (InstantID, by contrast, is single-subject --
it averages embeddings for multiple refs, which is wrong for distinct people -- so for group photos
**IP-Adapter FaceID + masks beats InstantID**.)
**3. IP-Adapter + ControlNet + img2img compose (confidence: high).** The doc shows IP-Adapter +
ControlNet (depth) in one pipeline and IP-Adapter + img2img (`strength`). Our target stack is the
union: `StableDiffusionXLControlNetImg2ImgPipeline` (canny = structure) + `load_ip_adapter` (FaceID =
identity) + `strength` (removal). `set_ip_adapter_scale` (1.0 = image-only, 0.5 = balanced) is the
identity-hold knob. API friction to verify in implementation: that `ip_adapter_masks` via
`cross_attention_kwargs` works on the *ControlNet img2img* pipeline (the masking is an attention-
processor feature, so it should be pipeline-agnostic, but confirm).
**4. InstantID / PuLID positioning (confidence: medium).** InstantID does not train the UNet so it
composes with canny/depth ControlNets, and gives better single-face fidelity than FaceID -- but it is
single-subject (needs its own landmark ControlNet + dedicated weights). PuLID has the best identity
fidelity but is heaviest and Flux-leaning. For our multi-face, constraint-bound, SDXL-canny case,
IP-Adapter FaceID + masks is the right first build; InstantID/PuLID are single-portrait upgrades.
### Architecture (proposed)
```
detect faces (insightface) -> per face: ArcFace embed + region mask
one img2img pass:
image=init, control_image=canny(init), # structure (existing)
ip_adapter_image_embeds=[face_embeds], # identity per face
cross_attention_kwargs={"ip_adapter_masks": face_masks}, # each face -> its region
controlnet_conditioning_scale=1.0, set_ip_adapter_scale(~0.6),
strength=vendor-adaptive # removal (unchanged)
```
Pixels are regenerated (SynthID removed by `strength`), structure held by canny, each face's identity
held by its masked ArcFace vector -- no original pixel copied.
### Risks / honest costs
- **Shielding risk (same wall):** FaceID conditioning, like canny, reduces effective regeneration ->
higher `set_ip_adapter_scale` raises the chance SynthID survives in the face region (echo of why the
old region-hires failed). MUST oracle-verify removal at the chosen FaceID scale; keep `strength` at
the vendor threshold.
- **New heavy dependency:** `insightface` + `onnxruntime` + the `buffalo_l` model (~300 MB, downloaded
on first use). Detection + embedding is CPU/ONNX, separate from the diffusion.
- **Detection floor:** insightface needs faces large enough (det_size ~640); tiny faces in a dense
group may not be detected -> not preserved (falls back to canny-only for those).
- **Identity ceiling:** SDXL FaceID is ~5-10% off true identity -- a meaningful boost over canny-only
drift, NOT a perfect face swap. Set expectations; PuLID/InstantID are the higher-fidelity (heavier)
paths if needed.
- **Value scales with strength:** at low strength (OpenAI 0.10) faces barely drift, so FaceID is
marginal; at the higher strength a hard vendor (Google 0.30) needs, FaceID earns its keep.
### Build plan (staged)
- v1: optional `--face-id` flag on `--pipeline controlnet`. Detect faces; if any, run the masked
FaceID pass (works for 1 or N faces -- masks generalize). If none detected, fall through to plain
canny. Oracle-verify SynthID removal is preserved at the default FaceID scale on a face image.
- v2 (if identity still short): InstantID for single-portrait, or PuLID, as a higher-fidelity opt-in.
**Sources.** https://huggingface.co/docs/diffusers/main/en/using-diffusers/ip_adapter ·
https://huggingface.co/h94/IP-Adapter-FaceID · https://arxiv.org/pdf/2401.07519 (InstantID) ·
https://instantid.github.io/ · https://arxiv.org/abs/2308.06721 (IP-Adapter)
### FaceID prototype run 1 -- 2026-06-03 (NEGATIVE on dense small-face groups)
Built and shipped the masked multi-face FaceID layer (`--face-id`, `face_id.py`, `faceid` extra).
First real run on the gemini_3 group photo (Google, s015, scale 0.6, native 2816 via cap 1536):
insightface detected **17 faces**, the masked multi-face pass composed and ran end-to-end (non-black
output), so the API is correct. **At s015 the result is a clear FAILURE: every face corrupted --
melted/discolored/psychedelic, materially WORSE than canny-only.**
**ROOT CAUSE FOUND (confirmed by ablation, not speculation) -- it is STRENGTH, not scale/masks/faces.**
Investigated the real data: masks are fine (max overlap depth 2, 33% coverage, only 0.2% of pixels
double-covered -- NOT an overlap problem), embeddings are fine (`normed_embedding` norm 1.000), the
FaceID LoRA is not required for SDXL (h94 model card), and faces span 34-181 px (7 medium + 10 tiny).
None of those is the cause. The decisive test: the SAME image + FaceID at **strength 0.5** produces
**clean, coherent faces across the whole group** (no psychedelic artifacts). So FaceID needs
substantial regeneration: the h94 usage is full generation (txt2img, 30 steps); at our removal
strength (0.10-0.15 = ~7 effective steps) the strong identity cross-attention cannot reconcile with
a latent that is ~85% the untouched original, so it smears identity-colored noise onto the faces.
**This is a FUNDAMENTAL tension, not a tuning bug:** watermark removal wants LOW strength (minimal
degradation, just enough to scrub), FaceID wants HIGH strength (regenerate the face to impose
identity). They are opposed. At strength 0.5 FaceID works AND removes the watermark, but the whole
image regenerates much more (canny still holds text/edge structure, but texture/detail drifts well
beyond the 0.15 "minimal degradation" target). So `--face-id` is a HIGH-STRENGTH option: it trades
whole-image fidelity for face identity, and is a footgun at the low default strength (guaranteed
garbage). Required follow-up code guard: when `--face-id` is set, floor `strength` at ~0.5 (or refuse
+ warn) -- never run FaceID at the vendor-adaptive removal strength. Open question: whether
high-strength FaceID's whole-image drift is acceptable for face-centric images, or whether identity
preservation at LOW strength needs a different mechanism entirely (FaceID structurally cannot do it). (Infra lesson: the `faceid` extra must
stay numpy<2.0 -- pin `onnx<1.18` + `scipy<1.18`; pinning numpy UP, as the first build did, leaves a
numpy-1.26 env with a numpy-2-only scipy that crashes the diffusers import via `np.long`.)
## Face preservation, done properly (research 2026-06-03, after the FaceID failure)
The FaceID run failed and I wrongly concluded "faces can't be preserved." Re-research corrected the
understanding. The hard constraint is unchanged: to remove the watermark FROM a face the face MUST
be regenerated (freezing it leaves SynthID), so the goal is identity-preserving REGENERATION of the
face, at minimal overall image degradation. Three things I got wrong and the corrected picture:
**What I got wrong:** (1) I applied FaceID at GLOBAL high strength -- the literature is clear the
architecture must be REGION-ADAPTIVE (face region handled separately, background stays low-strength);
(2) I used IP-Adapter FaceID, the WEAKEST identity tool -- InstantID uses an ArcFace encoder and hits
82-86% face-recognition similarity vs FaceID's weak CLIP-ish signal; (3) I missed the entire
face-restoration class (CodeFormer / GFPGAN), which is purpose-built for "regenerate a face, keep
identity."
**The most promising mechanism -- CodeFormer face-restoration post-pass (confidence: high on the
mechanism, unverified on our watermark).** CodeFormer is a VQ-VAE: a frozen discrete CODEBOOK of HQ
facial priors + a Transformer that predicts code *tokens* from the input, and a frozen decoder that
regenerates the face FROM THE CODEBOOK ENTRIES -- "does not depend on feature fusion with low-quality
cues." So the output face pixels come from a finite learned codebook, NOT from the input pixels:
**the SynthID pixel-amplitude pattern physically cannot survive a codebook re-synthesis** -- a
stronger scrub than low-strength img2img (which keeps ~85% of the latent). Fidelity knob `w` in
[0,1]: higher w preserves identity but fuses MORE low-quality (input) cues (more watermark risk),
lower w leans on the codebook (cleaner scrub, identity drift) -- the same scrub-vs-fidelity tension,
settled per-image by the oracle; there is likely a `w` that holds identity AND clears the oracle.
**Constraint-compatible architecture:** run the normal canny low-strength controlnet removal globally
(minimal degradation everywhere), then detect+align each face, run CodeFormer on the **ORIGINAL** face
crop (to capture true identity AND re-synthesize from the codebook = scrub), and composite the
CodeFormer output (codebook-generated, not original pixels -> no copy, no watermark) into the cleaned
image. Decouples whole-image minimal-degradation from face identity -- no high GLOBAL strength needed.
**Honest costs/caveats:** (a) **License -- CodeFormer is NTU S-Lab 1.0 (non-commercial/research)**, so
it cannot be bundled in this MIT tool for general use; the license-clean alternative is **GFPGAN
(Apache-2.0)**, slightly lower quality. (b) Deps (basicsr/facexlib) are heavy and numpy-version-finicky
(same class of conflict as insightface). (c) CodeFormer is a *restoration* model -- it can subtly
alter expression/asymmetry; identity is held but not pixel-identical. (d) **The watermark-scrub is
mechanistically strong but UNVERIFIED -- must oracle-check.** InstantID + region-adaptive strength is
the alternative if the restoration route disappoints, but it is more complex (differential strength).
Prototype plan: validate CodeFormer on a real face in a THROWAWAY env (identity held? oracle clean?)
before any project-env integration or the license/GFPGAN decision.
### CodeFormer prototype -- VALIDATED end-to-end 2026-06-03 (oracle-confirmed)
Prototyped the CodeFormer face-restoration post-pass (codeformer-pip in a throwaway venv, forced CPU
-- the pip wrapper has an MPS device-mismatch bug) on the gemini_3 group photo (18 faces). Pipeline:
`all --pipeline controlnet --strength 0.15` (sparkle + SynthID removed from the whole image, minimal
degradation) -> CodeFormer on the ORIGINAL faces -> feather-composite the CodeFormer faces into the
all-cleaned image. Oracle results (Gemini app "Verify with SynthID"), isolating each part:
- pure controlnet-0.15 background (no faces): **clean** -> the background scrub works at 0.15 (no
ControlNet-shielding problem for Google on this image).
- composite with CodeFormer faces at **w0.7**: **SynthID DETECTED** -> high fidelity fuses too much of
the original face signal (the watermark) through.
- composite at **w0.5**: **clean**. composite at **w0.3**: **clean**.
So the scrub-vs-fidelity threshold is between 0.5 and 0.7; **w=0.5 is the sweet spot** (highest
fidelity / best identity that still clears the oracle). Identity at w0.3-0.7 all looks like the same
person (the face is large enough), so the lower w costs little.
**This VALIDATES the corrected face-preservation approach** (and refutes my earlier "faces can't be
preserved" / FaceID conclusion): controlnet low-strength background scrub + CodeFormer-codebook face
re-synthesis at w~0.5 + feather composite = oracle-clean SynthID removal everywhere (background AND
faces), identity preserved, minimal overall degradation, zero original-pixel copying (CodeFormer faces
are codebook-generated). CodeFormer's discrete-codebook re-synthesis DOES scrub the pixel watermark,
but only when w is low enough that the decoder leans on the codebook rather than fusing the input
(watermark-carrying) features -- exactly the predicted fidelity-vs-scrub tension, with an empirical
clean threshold at w<=0.5.
**Production TODO (not built -- still a throwaway prototype):** (1) license -- CodeFormer is NTU S-Lab
(non-commercial); decide CodeFormer-as-user-installed-extra vs GFPGAN (Apache-2.0, re-verify it scrubs
at its fidelity setting); (2) wire a `--restore-faces` post-pass (detect -> restore w~0.5 -> feather
composite) onto the controlnet pipeline; (3) handle the MPS device bug (force CPU for the face model
or fix); (4) re-verify the w threshold on more images / vendors (w=0.5 confirmed on one Gemini group
photo only).
**Sources.** https://arxiv.org/abs/2206.11253 (CodeFormer) · https://github.com/sczhou/CodeFormer ·
https://arxiv.org/pdf/2401.07519 (InstantID) ·
https://openaccess.thecvf.com/content/WACV2024/papers/Suin_Diffuse_and_Restore... (region-adaptive) ·
https://arxiv.org/pdf/2504.12809 (saliency-aware watermark removal)
## Provenance
Hand-run primary-source pass, 2026-06-02. Sources fetched and quoted above; the central
make-or-break claim (structure-conditioned high-strength regeneration scrubs the watermark while
keeping text) is **unverified and explicitly flagged as the thing the local prototype must
measure** (against the manual Gemini SynthID oracle) — the literature supports removal (Findings 1, 2) and supports structure-preserving
regeneration (Finding 5) but never jointly validated text (Finding 3). No code change implied
until the prototype validates a Pareto cell on the SynthID oracle.
+34 -36
View File
@@ -196,12 +196,11 @@ with SynthID" for Google). The test set and per-image results are recorded in
**CORRECTION (supersedes the earlier "resolution dependence" claim).** A prior
version of this doc and CLAUDE.md stated that strength 0.30 failed to remove
SynthID on 1600x1600 gpt-image and that removal was resolution-dependent. That
was an **artifact of the text-protection bug** (issue #14): those tests ran a
build where `protect_text` was ON by default, and the high-resolution text
re-scrub re-introduced SynthID in the dense-text regions of the infographic
images tested. Re-running the *same* 1600x1600 image on clean v0.8.6 (protect
OFF) removes SynthID at **0.05**. The "large images resist removal" conclusion
was false; the resistance was the protect-text shielding, now fixed (v0.8.5).
was a **measurement artifact of a since-removed per-region re-scrub step** (issue
#14): on the dense-text infographics tested, that step could reconstitute SynthID
in text regions. Re-running the *same* 1600x1600 image on the clean current
pipeline removes SynthID at **0.05**. The "large images resist removal" conclusion
was false; the resistance was that region-rescrub shielding, since removed.
**Open / not locally testable:**
@@ -362,38 +361,37 @@ empirically from oracle tests:
Google has hardened SynthID relative to OpenAI's (vendor gap measured at ~3x
strength), but the year-over-year "0.05 -> 0.10 -> 0.30" progression above
conflates a real hardening trend with the now-debunked protect-text artifact;
conflates a real hardening trend with the now-debunked region-rescrub artifact;
treat only the section 2.2 controlled numbers as authoritative.
---
## 5. Practical implications for this tool
### 5.1 Text and face protection: OFF by default
### 5.1 Preserving content means regenerating it, never copying it
**Text protection (`--protect-text`) can preserve SynthID in text regions.**
Verified June 2026 on gpt-image at 1600x1600: same image processed twice --
with `--protect-text` the oracle detected SynthID; without it, SynthID was
removed. The mechanism: the global img2img pass clears SynthID everywhere, but
the text-protection high-resolution re-scrub regenerates those pixels from an
upscaled crop. At the effective resolution of the upscaled crop, the per-region
pass may be insufficient to re-destroy the payload, reconstituting SynthID in
text regions.
**Core rule:** SynthID is a pixel-amplitude pattern, so any approach that FREEZES
or RESTORES original pixels in a region re-introduces the watermark there. Early
region-based text/face "protection" (since removed) proved this: restoring the
original face pixels guaranteed SynthID survived in faces, and even a per-region
high-resolution re-scrub from an upscaled crop could be insufficient to destroy
the payload, reconstituting SynthID in text. The lesson held and shaped the
current design: **content is preserved by REGENERATING it under structural
conditioning, never by copying original pixels.**
**Face protection (`--protect-faces`) has an even more direct preservation
mechanism.** The pipeline extracts face regions from the ORIGINAL (watermarked)
image BEFORE the diffusion pass, runs the global pass (which removes SynthID
everywhere), then blends the original face pixels BACK onto the result
(`invisible_engine.py`: `original_faces = protector.extract_faces(cv_img)`
before `remove_watermark`, then `protector.restore_faces(out_cv, original_faces)`
after). Those restored pixels are the original watermarked pixels -- SynthID is
guaranteed to survive in face regions, not just possibly. The text-protection
case is at least re-generating (uncertain); face protection is literally
restoring the original SynthID-bearing pixels.
Both `--protect-text` and `--protect-faces` are therefore **experimental and
OFF by default**. Enable only when text/face fidelity matters more than
watermark removal completeness, and always verify the result with the oracle.
- **Text + structure:** `--pipeline controlnet` (SDXL img2img + a canny ControlNet)
conditions the regeneration on the edge map, so text and structure stay sharp
while every pixel is still regenerated -- SynthID is removed everywhere. Verified
better than plain img2img at the same strength (text stays legible where plain
garbles it), and the controlnet background scrub reads clean on the oracle.
- **Face identity:** canny holds face *structure* but not *identity*. The validated
approach (researched + prototyped 2026-06-03, not yet shipped) is a face-restoration
post-pass: CodeFormer/GFPGAN RE-SYNTHESIZES each face from a discrete codebook
(codebook pixels, not original -> scrubs SynthID) at a low fidelity weight
(`w~0.5`), composited into the cleaned image. Oracle-confirmed clean in face
regions with identity preserved. (An IP-Adapter FaceID approach was tried and
REMOVED -- it needs high denoise strength and corrupts faces at removal strength;
see `docs/controlnet-removal-pipeline-research.md`.)
### 5.2 Strength setting
@@ -431,13 +429,13 @@ value that reads clean on the oracle.
verifier is not the same as being forensically indistinguishable from clean
content (arXiv:2605.09203).
### 5.4 ctrlregen and img2img: the tradeoff
### 5.4 Strength vs forensic detectability: the tradeoff
Both the paper and our testing confirm: higher img2img strength removes the
watermark but introduces detectable regeneration artifacts. The Goonatilake &
Ateniese paper shows CtrlRegen+ (the most powerful remover) is simultaneously
the most forensically detectable (AUROC 0.9999). The tradeoff is unavoidable
with current diffusion-based approaches.
Higher img2img strength removes the watermark but introduces detectable
regeneration artifacts. The Goonatilake & Ateniese paper shows the strongest
diffusion-based removers are simultaneously the most forensically detectable
(AUROC up to 0.9999). The tradeoff is unavoidable with current diffusion-based
approaches: defeating the vendor's verifier is not the same as being clean.
---
-34
View File
@@ -49,10 +49,7 @@ gpu = [
"transformers>=5,<6",
"tokenizers>=0.22,<0.23",
"accelerate>=0.25.0",
"controlnet-aux>=0.0.9",
"safetensors",
"ultralytics>=8.0.0",
"color-matcher>=0.5.0",
]
# Open invisible-watermark (imwatermark) decoder for detecting the DWT-DCT
# watermarks embedded by Stable Diffusion / SDXL / FLUX. Optional because it
@@ -150,7 +147,6 @@ ignore = [
"tests/*.py" = ["ANN", "S101", "S105", "S106", "S108"]
"src/remove_ai_watermarks/noai/watermark_remover.py" = ["S603", "S606", "S607", "T201"] # subprocess calls for auto-install/CUDA fix
"src/remove_ai_watermarks/noai/c2pa.py" = ["S110"] # try-except-pass for corrupt file handling
"src/remove_ai_watermarks/noai/ctrlregen/engine.py" = ["S101", "S603"] # assert for loaded state, subprocess for auto-install
[tool.ruff.format]
quote-style = "double"
@@ -172,33 +168,3 @@ reportUnknownMemberType = false
reportUnknownArgumentType = false
reportUnknownVariableType = false
reportMissingTypeArgument = false
# CtrlRegen is a torch/diffusers/controlnet-aux boundary: those libs ship no
# usable types, so strict pyright cannot know the tensor element types. Relax the
# unknown-type rules for this subpackage only (mirrors the per-file pragmas used
# for the cv2 engines); the rest of the codebase stays strict.
[[tool.pyright.executionEnvironments]]
root = "src/remove_ai_watermarks/noai/ctrlregen"
reportUnknownMemberType = false
reportUnknownArgumentType = false
reportUnknownVariableType = false
reportUnknownParameterType = false
reportMissingTypeArgument = false
reportMissingTypeStubs = false
reportMissingImports = false
reportArgumentType = false
reportAssignmentType = false
reportReturnType = false
reportCallIssue = false
reportIndexIssue = false
reportOperatorIssue = false
reportOptionalMemberAccess = false
reportOptionalCall = false
reportOptionalSubscript = false
reportOptionalOperand = false
reportAttributeAccessIssue = false
reportPrivateImportUsage = false
reportPrivateUsage = false
reportInvalidTypeForm = false
reportConstantRedefinition = false
reportUnnecessaryComparison = false
+348
View File
@@ -0,0 +1,348 @@
"""ControlNet-as-removal-pipeline prototype sweep (issue #35 / Jacob).
Research prototype, NOT a shipped pipeline. It tests whether a full-image
SDXL-native ControlNet-conditioned img2img can REPLACE plain SDXL img2img as the
watermark remover: a single structure-guided regeneration that scrubs an invisible
robust watermark (SynthID) everywhere while keeping fine detail and small/CJK text
legible. See docs/controlnet-removal-pipeline-research.md for the full rationale.
The make-or-break tension (from the watermark-removal-attack literature): the
denoise strength high enough to scrub the watermark deforms text, while the
conditioning strong enough to keep text may spare the watermark. There is no local
SynthID detector, so this script CANNOT decide removal on its own -- it produces
one output per (control, strength, conditioning-scale) cell plus an index, and YOU
verify each output by hand in the Gemini app ("Verify with SynthID") and judge text
legibility visually. Fill the verdict columns in the emitted index, then read off
the Pareto cell (oracle clean AND text legible).
Pipeline: stabilityai/stable-diffusion-xl-base-1.0 +
- canny: xinsir/controlnet-canny-sdxl-1.0 (control = cv2.Canny(gray, 100, 200))
- tile: xinsir/controlnet-tile-sdxl-1.0 (control = the resized original, no preproc)
StableDiffusionXLControlNetImg2ImgPipeline (image=init, control_image=control).
Needs the gpu extra (torch + diffusers) and cv2. Runs locally on 32 GB MPS in
fp32 (MPS fp16 decodes to all-black NaN -- issue #29 -- so fp32 is the default on
mps/cpu, fp16 only on cuda/xpu); a dedicated GPU is not required for 1024 px. Run:
uv run python scripts/controlnet_sweep.py path/to/watermarked.png -o sweep_out
uv run python scripts/controlnet_sweep.py img.png --control canny tile \\
--strength 0.3 0.5 0.7 1.0 --scale 0.6 1.0 --size 1024
"""
from __future__ import annotations
# torch/diffusers/cv2 ship no usable types; relax the unknown-type + private-import
# rules for this boundary script (mirrors scripts/visible_alpha_solve.py and the
# cv2/torch engine modules). Pure-logic helpers here stay correct regardless.
# pyright: reportUnknownMemberType=false, reportUnknownArgumentType=false, reportUnknownVariableType=false, reportUnknownParameterType=false, reportMissingTypeArgument=false, reportMissingTypeStubs=false, reportMissingImports=false, reportArgumentType=false, reportAssignmentType=false, reportReturnType=false, reportCallIssue=false, reportIndexIssue=false, reportOperatorIssue=false, reportPrivateImportUsage=false
import argparse
import contextlib
import csv
import importlib.util
import logging
import sys
from pathlib import Path
from typing import TYPE_CHECKING, Any
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
from _plain_console import Console, Table
if TYPE_CHECKING:
from PIL import Image
logging.basicConfig(level=logging.INFO, format="%(message)s")
log = logging.getLogger(__name__)
console = Console()
BASE_MODEL = "stabilityai/stable-diffusion-xl-base-1.0"
FP16_VAE = "madebyollin/sdxl-vae-fp16-fix"
CONTROLNETS = {
"canny": "xinsir/controlnet-canny-sdxl-1.0",
"tile": "xinsir/controlnet-tile-sdxl-1.0",
}
# A neutral quality prompt: the goal is faithful regeneration, not creative edits.
PROMPT = "best quality, high quality, sharp, detailed, photographic"
NEGATIVE_PROMPT = "blurry, lowres, deformed, distorted text, garbled text, watermark, jpeg artifacts"
def pick_device(requested: str) -> str:
"""Resolve the inference device without the CUDA-reinstaller side effect.
Deliberately does NOT call the package ``get_device`` (which can trigger a
torch-CUDA reinstall+restart). A research script should never do that.
"""
import torch
if requested != "auto":
return requested
if torch.cuda.is_available():
return "cuda"
if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
return "mps"
return "cpu"
def resolve_dtype(device: str, requested: str) -> Any:
"""fp16 only on cuda/xpu; fp32 on cpu AND mps, unless overridden.
MPS fp16 produces all-black NaN output here (the SDXL UNet/VAE overflows on
the Metal backend -- issue #29; even the fp16-fix VAE does not save it), so the
production pipeline runs fp32 on MPS and so do we. fp32 SDXL + an SDXL ControlNet
at 1024 fits 32 GB unified memory with vae-tiling + attention-slicing.
"""
import torch
if requested == "fp16":
return torch.float16
if requested == "fp32":
return torch.float32
return torch.float16 if device in {"cuda", "xpu"} else torch.float32
def fit_size(image: Image.Image, long_side: int) -> Image.Image:
"""Resize so the long side is ``long_side``, each dim a multiple of 8 (SDXL)."""
from PIL import Image as PILImage
w, h = image.size
scale = long_side / max(w, h)
nw = max(8, round(w * scale) // 8 * 8)
nh = max(8, round(h * scale) // 8 * 8)
if (nw, nh) == (w, h):
return image
return image.resize((nw, nh), PILImage.Resampling.LANCZOS)
def make_control_image(init: Image.Image, control: str) -> Image.Image:
"""Build the ControlNet conditioning image for the given control type.
canny: cv2.Canny(gray, 100, 200) -> 3-channel edge map (xinsir canny recipe).
tile: the init image itself, no preprocessing (xinsir tile recipe).
"""
import cv2
import numpy as np
from PIL import Image as PILImage
if control == "tile":
return init
rgb = np.array(init.convert("RGB"))
gray = cv2.cvtColor(rgb, cv2.COLOR_RGB2GRAY)
edges = cv2.Canny(gray, 100, 200)
edges_rgb = np.stack([edges, edges, edges], axis=-1)
return PILImage.fromarray(edges_rgb)
def psnr(a: Image.Image, b: Image.Image) -> float:
"""Coarse global fidelity proxy vs the original (NOT a text or watermark metric)."""
import numpy as np
x = np.asarray(a.convert("RGB"), dtype=np.float64)
y = np.asarray(b.convert("RGB").resize(a.size), dtype=np.float64)
mse = float(np.mean((x - y) ** 2))
if mse == 0.0:
return 99.0
return float(10.0 * np.log10((255.0**2) / mse))
def load_pipeline(control: str, device: str, dtype: Any) -> Any:
"""Load SDXL base + the chosen SDXL ControlNet as an img2img pipeline."""
import torch
from diffusers import (
AutoencoderKL,
ControlNetModel,
StableDiffusionXLControlNetImg2ImgPipeline,
)
console.print(f"Loading {CONTROLNETS[control]} ({control}) ...")
controlnet = ControlNetModel.from_pretrained(CONTROLNETS[control], torch_dtype=dtype)
load_kwargs: dict[str, Any] = {"controlnet": controlnet, "torch_dtype": dtype}
if dtype == torch.float16:
# The stock SDXL VAE decodes to NaN/black in fp16; the fp16-fix VAE is the
# same swap the production pipeline uses (_SDXL_FP16_VAE_ID).
load_kwargs["vae"] = AutoencoderKL.from_pretrained(FP16_VAE, torch_dtype=dtype)
pipe = StableDiffusionXLControlNetImg2ImgPipeline.from_pretrained(BASE_MODEL, **load_kwargs)
pipe = pipe.to(device)
pipe.set_progress_bar_config(disable=True)
if device != "cpu":
# Keep the 1024 px + extra-ControlNet peak inside 32 GB unified memory.
with contextlib.suppress(Exception):
pipe.enable_vae_tiling()
with contextlib.suppress(Exception):
pipe.enable_attention_slicing()
return pipe
def run_cell(
pipe: Any,
init: Image.Image,
control_image: Image.Image,
strength: float,
scale: float,
steps: int,
guidance: float,
seed: int,
) -> Image.Image:
"""Run one ControlNet img2img cell and return the regenerated image.
The generator is created on CPU intentionally: a CPU generator is portable
across the mps/cuda/cpu backends (diffusers rejects a device-mismatched one),
matching the production runner's fallback behavior.
"""
import torch
generator = torch.Generator(device="cpu").manual_seed(seed)
result = pipe(
prompt=PROMPT,
negative_prompt=NEGATIVE_PROMPT,
image=init,
control_image=control_image,
controlnet_conditioning_scale=float(scale),
strength=float(strength),
num_inference_steps=steps,
guidance_scale=guidance,
generator=generator,
)
return result.images[0]
def parse_args() -> argparse.Namespace:
p = argparse.ArgumentParser(description="ControlNet-as-removal-pipeline prototype sweep.")
p.add_argument("image", type=Path, help="Watermarked input image.")
p.add_argument("-o", "--out", type=Path, default=Path("controlnet_sweep_out"), help="Output directory.")
p.add_argument("--control", nargs="+", choices=list(CONTROLNETS), default=["canny", "tile"])
p.add_argument("--strength", nargs="+", type=float, default=[0.3, 0.5, 0.7, 1.0])
p.add_argument("--scale", nargs="+", type=float, default=[0.6, 1.0], help="controlnet_conditioning_scale values.")
p.add_argument("--size", type=int, default=1024, help="Long-side resolution (multiple of 8).")
p.add_argument("--steps", type=int, default=30)
p.add_argument("--guidance", type=float, default=7.5)
p.add_argument("--seed", type=int, default=0)
p.add_argument("--device", default="auto", choices=["auto", "mps", "cuda", "cpu"])
p.add_argument("--dtype", default="auto", choices=["auto", "fp16", "fp32"])
return p.parse_args()
def main() -> int:
args = parse_args()
if not args.image.exists():
log.error("Input image not found: %s", args.image)
return 1
try:
from PIL import Image as PILImage
except ImportError:
log.error("Pillow is required. Install the gpu extra: uv sync --extra gpu --extra dev")
return 1
if importlib.util.find_spec("diffusers") is None or importlib.util.find_spec("torch") is None:
log.error("diffusers/torch are required. Install: uv sync --extra gpu --extra dev")
return 1
device = pick_device(args.device)
dtype = resolve_dtype(device, args.dtype)
console.print(f"Device: {device} | dtype: {str(dtype).split('.')[-1]}")
init_full = PILImage.open(args.image).convert("RGB")
init = fit_size(init_full, args.size)
console.print(f"Input: {args.image.name} {init_full.size[0]}x{init_full.size[1]} -> {init.size[0]}x{init.size[1]}")
args.out.mkdir(parents=True, exist_ok=True)
stem = args.image.stem
init_path = args.out / f"{stem}__INPUT.png"
init.save(init_path)
rows: list[dict[str, Any]] = []
table = Table(title="ControlNet sweep")
for col in ("control", "strength", "scale", "psnr_vs_input", "file"):
table.add_column(col)
# Group by control so SDXL + the ControlNet load once per control type.
for control in args.control:
pipe = load_pipeline(control, device, dtype)
control_image = make_control_image(init, control)
if control == "canny":
control_image.save(args.out / f"{stem}__canny_edges.png")
for strength in args.strength:
for scale in args.scale:
tag = f"{control}_s{strength:g}_c{scale:g}"
console.print(f"Running {tag} ...")
try:
out = run_cell(
pipe,
init,
control_image,
strength,
scale,
args.steps,
args.guidance,
args.seed,
)
except Exception as exc:
log.warning("Cell %s failed: %s", tag, exc)
continue
fname = f"{stem}__{tag}.png"
out.save(args.out / fname)
quality = psnr(init, out)
rows.append(
{
"control": control,
"strength": strength,
"scale": scale,
"psnr_vs_input": round(quality, 2),
"file": fname,
"synthid_oracle": "", # fill: clean / present
"text_legible": "", # fill: yes / no / partial
}
)
table.add_row(control, f"{strength:g}", f"{scale:g}", f"{quality:.2f}", fname)
del pipe
_free_memory(device)
_write_index(args.out, stem, rows, init_path.name)
console.print(table)
console.print(f"\nWrote {len(rows)} cells to {args.out}/")
console.print(f"Next: open {args.out}/sweep_index.csv, run each PNG through the Gemini SynthID oracle,")
console.print("fill synthid_oracle (clean/present) + text_legible (yes/no/partial), find the Pareto cell.")
return 0
def _free_memory(device: str) -> None:
import gc
gc.collect()
with contextlib.suppress(Exception):
import torch
if device == "cuda":
torch.cuda.empty_cache()
elif device == "mps" and hasattr(torch, "mps"):
torch.mps.empty_cache()
def _write_index(out: Path, stem: str, rows: list[dict[str, Any]], input_name: str) -> None:
"""Write the CSV index (with empty verdict columns) and a README."""
fields = ["control", "strength", "scale", "psnr_vs_input", "file", "synthid_oracle", "text_legible"]
with (out / "sweep_index.csv").open("w", newline="", encoding="utf-8") as fh:
writer = csv.DictWriter(fh, fieldnames=fields)
writer.writeheader()
writer.writerows(rows)
readme = (
f"# ControlNet sweep for {stem}\n\n"
f"Input (resized): {input_name}\n\n"
"Each row in sweep_index.csv is one (control, strength, scale) cell. PSNR vs the\n"
"resized input is a COARSE global-fidelity proxy only -- it does NOT measure text\n"
"legibility or watermark presence. Decide those two by hand:\n\n"
"1. synthid_oracle: open the PNG in the Gemini app, 'Verify with SynthID'. Mark\n"
" 'clean' if no SynthID is detected, 'present' if it still is. (No local detector\n"
" exists; this manual check is the only valid SynthID oracle.)\n"
"2. text_legible: eyeball the small/CJK text. Mark yes / partial / no.\n\n"
"The Pareto cell is the one where synthid_oracle=clean AND text_legible=yes at the\n"
"lowest strength. If no cell satisfies both, the canny/tile-ControlNet middle path\n"
"is dead for text and a glyph re-render is required (see\n"
"docs/text-protection-research.md). Record the outcome in\n"
"docs/controlnet-removal-pipeline-research.md.\n"
)
(out / "sweep_README.md").write_text(readme, encoding="utf-8")
if __name__ == "__main__":
raise SystemExit(main())
-150
View File
@@ -1,150 +0,0 @@
"""Multilingual recall benchmark for the text-protection detector.
Measures the core lever of text protection (`text_protector.TextProtector`): if
the PP-OCRv3 DB detector misses a text region, that text is NOT preserved during
the SDXL watermark-removal pass and gets deformed (issue #14). This renders short
text in several scripts at several font sizes on two canvas sizes, runs detection,
and reports the fraction of each known text bbox the detector covers.
Findings (2026-05-29):
- Detection is script-agnostic: DB segments text *regions*, not characters, so
Latin / Cyrillic / CJK / Hangul / Arabic / digits score identically. Language
was never the lever.
- The only lever is resolution. A fixed small detector input downscaled large
canvases so far that small text was missed. Detecting at the native long side
(capped, see ``text_protector._DET_MAX_LONG_SIDE``) lifts overall hit-rate
from 0.91 to 1.00 and the worst cell (~16 px text on a 2048 canvas) from
0.06 to 1.00.
This needs the detector model (downloaded on first use) and a font that covers
all the scripts (macOS "Arial Unicode"; on Linux install a Noto super-font).
No GPU. Run:
uv run python scripts/text_detection_benchmark.py
"""
from __future__ import annotations
import logging
import sys
from collections import defaultdict
from pathlib import Path
from typing import Any
import cv2
import numpy as np
from _plain_console import Console
from PIL import Image, ImageDraw, ImageFont
from remove_ai_watermarks import text_protector as tp
log = logging.getLogger(__name__)
console = Console()
# A single font covering every tested script isolates "language" from "font".
_FONT_CANDIDATES = [
"/System/Library/Fonts/Supplemental/Arial Unicode.ttf",
"/Library/Fonts/Arial Unicode.ttf",
"/usr/share/fonts/truetype/noto/NotoSansCJK-Regular.ttc",
"/usr/share/fonts/truetype/unifont/unifont.ttf",
]
SCRIPTS = {
"Latin": "Generated by AI",
"Cyrillic": "Сгенерировано ИИ",
"CJK": "豆包AI生成内容",
"Hangul": "AI로 생성됨",
"Arabic": "أنشئ بالذكاء",
"Digits": "0123456789",
}
FONT_SIZES = [16, 24, 32, 48, 64]
CANVASES = [1024, 2048]
PLACEMENTS = [(0.08, 0.15), (0.30, 0.55), (0.10, 0.82)]
BG_COLORS = [(35, 40, 60), (210, 205, 200)]
def _find_font() -> str:
for path in _FONT_CANDIDATES:
if Path(path).exists():
return path
raise SystemExit(
"No multi-script font found. Install one (macOS ships 'Arial Unicode'; "
"on Linux: a Noto CJK/super font) and add its path to _FONT_CANDIDATES."
)
def _render(
font_path: str,
canvas: int,
text: str,
font_size: int,
place: tuple[float, float],
bg: tuple[int, int, int],
) -> tuple[Image.Image, tuple[int, int, int, int]]:
img = Image.new("RGB", (canvas, canvas), bg)
draw = ImageDraw.Draw(img)
font = ImageFont.truetype(font_path, font_size)
x, y = int(place[0] * canvas), int(place[1] * canvas)
fg = (245, 245, 245) if sum(bg) < 360 else (20, 20, 20)
draw.text((x, y), text, font=font, fill=fg)
return img, draw.textbbox((x, y), text, font=font)
def _coverage(boxes: list[Any], bbox: tuple[int, int, int, int], h: int, w: int) -> float | None:
gt = np.zeros((h, w), np.uint8)
cv2.rectangle(gt, (bbox[0], bbox[1]), (bbox[2], bbox[3]), 1, -1)
area = int(gt.sum())
if area == 0:
return None
det = np.zeros((h, w), np.uint8)
if boxes:
cv2.fillPoly(det, [np.asarray(b, np.int32) for b in boxes], 1)
return int((gt & det).sum()) / area
def _hitrate(values: list[float], thr: float = 0.5) -> float:
return sum(c >= thr for c in values) / len(values) if values else float("nan")
def main() -> int:
logging.basicConfig(level=logging.WARNING)
if not tp.is_available():
raise SystemExit("text detector unavailable (need opencv with cv2.dnn.TextDetectionModel_DB)")
font_path = _find_font()
detector = tp.TextProtector()
by_script_size: dict[tuple[str, int], list[float]] = defaultdict(list)
by_size_canvas: dict[tuple[int, int], list[float]] = defaultdict(list)
by_script: dict[str, list[float]] = defaultdict(list)
for canvas in CANVASES:
for script, text in SCRIPTS.items():
for font_size in FONT_SIZES:
for idx, place in enumerate(PLACEMENTS):
img, bbox = _render(font_path, canvas, text, font_size, place, BG_COLORS[idx % len(BG_COLORS)])
bgr = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2BGR)
cov = _coverage(detector.detect_text_boxes(bgr), bbox, canvas, canvas)
if cov is None:
continue
by_script_size[(script, font_size)].append(cov)
by_size_canvas[(font_size, canvas)].append(cov)
by_script[script].append(cov)
console.print("=== hit-rate (coverage>=0.5) by script x font-size ===")
console.print("script".ljust(10) + "".join(f"{fs:>7}" for fs in FONT_SIZES))
for script in SCRIPTS:
console.print(
script.ljust(10) + "".join(f"{_hitrate(by_script_size[(script, fs)]):>7.2f}" for fs in FONT_SIZES)
)
console.print("\n=== hit-rate by font-size x canvas (the downscale effect) ===")
console.print("size".ljust(8) + "".join(f"{c:>8}" for c in CANVASES))
for fs in FONT_SIZES:
console.print(str(fs).ljust(8) + "".join(f"{_hitrate(by_size_canvas[(fs, c)]):>8.2f}" for c in CANVASES))
overall = _hitrate([c for vals in by_script.values() for c in vals])
console.print(f"\nOVERALL hit-rate: {overall:.2f} (detector max long side = {tp._DET_MAX_LONG_SIDE})")
return 0
if __name__ == "__main__":
sys.exit(main())
+26 -39
View File
@@ -136,23 +136,14 @@ def _validate_image(path: Path) -> Path:
_ALPHA_FORMATS = {".png", ".webp"}
# Shared option decorators for commands that run the invisible-watermark pipeline.
# Both cmd_invisible and cmd_all expose these flags; defining them once avoids
# Shared option decorator for commands that run the invisible-watermark pipeline.
# Both cmd_invisible and cmd_all expose this flag; defining it once avoids
# copy-paste drift.
_protect_text_option = click.option(
"--protect-text",
is_flag=True,
default=False,
help=(
"Enable text region protection (experimental: re-scrubs text blocks at high resolution). "
"May prevent SynthID removal in text areas -- verify with oracle before relying on it."
),
)
_protect_faces_option = click.option(
"--protect-faces",
is_flag=True,
default=False,
help="Enable face protection (experimental: YOLO detect + blend original faces back).",
_controlnet_scale_option = click.option(
"--controlnet-scale",
type=float,
default=1.0,
help="ControlNet conditioning scale (structure/text preservation strength), controlnet pipeline only.",
)
@@ -453,14 +444,15 @@ def cmd_erase(
type=float,
default=None,
help="Denoising strength (0.0-1.0). Default: vendor-adaptive (OpenAI 0.10 / Google 0.15 / "
"unknown 0.15, from the C2PA issuer); ctrlregen uses 1.0.",
"unknown 0.15, from the C2PA issuer).",
)
@click.option("--steps", type=int, default=50, help="Number of denoising steps. Default: 50.")
@click.option(
"--pipeline",
type=click.Choice(["default", "ctrlregen"]),
type=click.Choice(["default", "controlnet"]),
default="default",
help="Pipeline profile (default=SDXL; ctrlregen=CtrlRegen, EXPERIMENTAL/destructive at clean-noise).",
help="Pipeline profile (default=SDXL img2img; controlnet=SDXL + canny ControlNet that preserves "
"text/faces via edge conditioning while removing SynthID).",
)
@click.option(
"--device",
@@ -479,8 +471,7 @@ def cmd_erase(
default=0,
help="Cap long side (px) before diffusion; 0 = native (best quality, like raiw.cc). Raise only on GPU/MPS OOM.",
)
@_protect_text_option
@_protect_faces_option
@_controlnet_scale_option
@click.pass_context
def cmd_invisible(
ctx: click.Context,
@@ -494,8 +485,7 @@ def cmd_invisible(
hf_token: str | None,
humanize: float,
max_resolution: int,
protect_text: bool,
protect_faces: bool,
controlnet_scale: float,
) -> None:
"""Remove invisible AI watermarks (SynthID, StableSignature, TreeRing).
@@ -526,6 +516,7 @@ def cmd_invisible(
pipeline=pipeline,
hf_token=hf_token,
progress_callback=progress_cb,
controlnet_conditioning_scale=controlnet_scale,
)
# Detect the SynthID vendor from the ORIGINAL (before processing strips C2PA) so the
@@ -533,7 +524,7 @@ def cmd_invisible(
vendor = vendor_for_strength(source)
console.print(f" Input: {source.name}")
console.print(f" Pipeline: {pipeline}")
console.print(f" Strength: {resolve_strength(strength, pipeline, vendor)} Steps: {steps}")
console.print(f" Strength: {resolve_strength(strength, vendor)} Steps: {steps}")
t0 = time.monotonic()
result_path = engine.remove_watermark(
@@ -544,8 +535,6 @@ def cmd_invisible(
guidance_scale=None,
seed=seed,
humanize=humanize,
protect_text=protect_text,
protect_faces=protect_faces,
max_resolution=max_resolution,
vendor=vendor,
)
@@ -694,15 +683,15 @@ def cmd_identify(ctx: click.Context, source: Path, no_visible: bool, as_json: bo
"--strength",
type=float,
default=None,
help="Invisible watermark denoising strength. Default: vendor-adaptive "
"(OpenAI 0.10 / Google 0.15 / unknown 0.15); ctrlregen uses 1.0.",
help="Invisible watermark denoising strength. Default: vendor-adaptive (OpenAI 0.10 / Google 0.15 / unknown 0.15).",
)
@click.option("--steps", type=int, default=50, help="Number of denoising steps for invisible removal.")
@click.option(
"--pipeline",
type=click.Choice(["default", "ctrlregen"]),
type=click.Choice(["default", "controlnet"]),
default="default",
help="Pipeline profile (default=SDXL; ctrlregen=CtrlRegen, EXPERIMENTAL/destructive at clean-noise).",
help="Pipeline profile (default=SDXL img2img; controlnet=SDXL + canny ControlNet that preserves "
"text/faces via edge conditioning while removing SynthID).",
)
@click.option("--model", type=str, default=None, help="HuggingFace model ID for invisible removal.")
@click.option(
@@ -722,8 +711,7 @@ def cmd_identify(ctx: click.Context, source: Path, no_visible: bool, as_json: bo
default=0,
help="Cap long side (px) before diffusion; 0 = native (best quality, like raiw.cc). Raise only on GPU/MPS OOM.",
)
@_protect_text_option
@_protect_faces_option
@_controlnet_scale_option
@click.pass_context
def cmd_all(
ctx: click.Context,
@@ -740,8 +728,7 @@ def cmd_all(
hf_token: str | None,
humanize: float,
max_resolution: int,
protect_text: bool,
protect_faces: bool,
controlnet_scale: float,
) -> None:
"""Remove ALL watermarks: visible + invisible + metadata.
@@ -822,13 +809,14 @@ def cmd_all(
pipeline=pipeline,
hf_token=hf_token,
progress_callback=progress_cb,
controlnet_conditioning_scale=controlnet_scale,
)
# Detect the vendor from the pristine ORIGINAL (`source`); `tmp_path` has
# already lost its C2PA to the visible-removal pass, so reading it would
# always resolve to the unknown-vendor default.
vendor = vendor_for_strength(source)
console.print(f" Strength: {resolve_strength(strength, pipeline, vendor)} Steps: {steps}")
console.print(f" Strength: {resolve_strength(strength, vendor)} Steps: {steps}")
inv_engine.remove_watermark(
image_path=tmp_path,
output_path=tmp_path,
@@ -836,8 +824,6 @@ def cmd_all(
num_inference_steps=steps,
seed=seed,
humanize=humanize,
protect_text=protect_text,
protect_faces=protect_faces,
max_resolution=max_resolution,
vendor=vendor,
)
@@ -990,9 +976,10 @@ def _process_batch_image(
)
@click.option(
"--pipeline",
type=click.Choice(["default", "ctrlregen"]),
type=click.Choice(["default", "controlnet"]),
default="default",
help="Pipeline profile (default=SDXL; ctrlregen=CtrlRegen, EXPERIMENTAL/destructive at clean-noise).",
help="Pipeline profile (default=SDXL img2img; controlnet=SDXL + canny ControlNet that preserves "
"text/faces via edge conditioning while removing SynthID).",
)
@click.option(
"--device",
-150
View File
@@ -1,150 +0,0 @@
"""YOLO-based face detection and soft-blend restoration for diffusion pipelines."""
# cv2/numpy/ultralytics boundary: these libs ship no usable element types; relax
# the unknown-type rules for this file only.
# pyright: reportUnknownMemberType=false, reportUnknownArgumentType=false, reportUnknownVariableType=false, reportUnknownParameterType=false, reportMissingTypeArgument=false, reportMissingTypeStubs=false, reportMissingImports=false, reportArgumentType=false, reportAssignmentType=false, reportReturnType=false, reportCallIssue=false, reportIndexIssue=false, reportOperatorIssue=false, reportOptionalMemberAccess=false, reportOptionalCall=false, reportOptionalSubscript=false, reportOptionalOperand=false, reportAttributeAccessIssue=false, reportPrivateImportUsage=false, reportPrivateUsage=false, reportInvalidTypeForm=false, reportConstantRedefinition=false, reportUnnecessaryComparison=false, reportPossiblyUnboundVariable=false
import logging
from pathlib import Path
import cv2
import numpy as np
try:
from ultralytics import YOLO
HAS_YOLO = True
except ImportError:
HAS_YOLO = False
logger = logging.getLogger(__name__)
class FaceProtector:
"""
Detects faces in an image and provides methods to seamlessly paste them back
onto the an upscaled/processed image to preserve facial details that may have
been destroyed by latent diffusion or other algorithms.
"""
def __init__(self, use_yolo: bool = True, model_name: str = "yolov8n.pt") -> None:
self.use_yolo = use_yolo and HAS_YOLO
self.detector = None
self.haar_cascade = None
if self.use_yolo:
# Fix SSL certificate issues on macOS (fresh Python installs)
self._fix_ssl_certs()
logger.info("Loading YOLO model '%s' for face protection...", model_name)
self.detector = YOLO(model_name)
else:
if use_yolo and not HAS_YOLO:
logger.warning(
"ultralytics YOLO is not installed. Falling back to OpenCV Haar "
"Cascades. Install ultralytics with `pip install ultralytics` "
"for better face detection."
)
logger.info("Loading OpenCV Haar Cascade for face protection...")
cascade_path = Path(cv2.__file__).parent / "data" / "haarcascade_frontalface_default.xml"
if not cascade_path.exists():
cascade_path = "haarcascade_frontalface_default.xml"
self.haar_cascade = cv2.CascadeClassifier(str(cascade_path))
def detect_face_bboxes(self, image: np.ndarray) -> list[tuple[int, int, int, int]]:
"""
Detect faces and return bounding boxes as (x1, y1, x2, y2).
"""
if self.use_yolo and self.detector is not None:
# For standard YOLOv8n, 'person' is class 0. We'll use person bounding boxes
# as a proxy for faces/people to protect them. If using a specific face model, adjust classes.
results = self.detector(image, verbose=False, classes=[0])
bboxes = []
for r in results:
boxes = r.boxes
for box in boxes:
x1, y1, x2, y2 = box.xyxy[0].cpu().numpy()
bboxes.append((int(x1), int(y1), int(x2), int(y2)))
return bboxes
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
faces = self.haar_cascade.detectMultiScale(gray, scaleFactor=1.1, minNeighbors=5, minSize=(30, 30))
bboxes = []
for x, y, w, h in faces:
# Add a 20% margin around the haar cascade face box
margin_x = int(w * 0.2)
margin_y = int(h * 0.2)
x1 = max(0, x - margin_x)
y1 = max(0, y - int(margin_y * 1.5)) # more margin on top for hair
x2 = min(image.shape[1], x + w + margin_x)
y2 = min(image.shape[0], y + h + margin_y)
bboxes.append((x1, y1, x2, y2))
return bboxes
@staticmethod
def _fix_ssl_certs() -> None:
"""Set SSL_CERT_FILE from certifi if not already set (macOS fix)."""
import os
if os.environ.get("SSL_CERT_FILE"):
return
try:
import certifi
os.environ["SSL_CERT_FILE"] = certifi.where()
except ImportError:
pass
def extract_faces(self, image: np.ndarray) -> list[tuple[tuple[int, int, int, int], np.ndarray]]:
"""
Extract faces from the image.
Returns a list of (bbox, face_crop) tuples.
"""
bboxes = self.detect_face_bboxes(image)
faces = []
for bbox in bboxes:
x1, y1, x2, y2 = bbox
faces.append((bbox, image[y1:y2, x1:x2].copy()))
return faces
def restore_faces(
self, processed_image: np.ndarray, original_faces: list[tuple[tuple[int, int, int, int], np.ndarray]]
) -> np.ndarray:
"""
Paste original faces back onto the processed image using seamless cloning
or soft blending so the edges don't show.
"""
if not original_faces:
return processed_image
result = processed_image.copy()
for (x1, y1, x2, y2), face_crop in original_faces:
h, w = face_crop.shape[:2]
# If the processed image was resized, we'd need to resize face_crop, but
# pipeline ensures the output from InvisibleEngine is the same size or we resize it back before this.
if result.shape[:2] != processed_image.shape[:2]:
continue # Safety bypass
try:
# Create a soft alpha mask for the face crop to smoothly blend it
mask = np.zeros((h, w), dtype=np.float32)
# Inner ellipse is pure white
cv2.ellipse(mask, (w // 2, h // 2), (int(w * 0.4), int(h * 0.4)), 0, 0, 360, 1.0, -1)
# Blur the mask heavily for soft edges
blur_size = max(w, h) // 4
if blur_size % 2 == 0:
blur_size += 1
mask = cv2.GaussianBlur(mask, (blur_size, blur_size), 0)
mask = cv2.merge([mask, mask, mask])
# Blend
target_roi = result[y1:y2, x1:x2].astype(np.float32)
src_roi = face_crop.astype(np.float32)
blended = src_roi * mask + target_roi * (1.0 - mask)
result[y1:y2, x1:x2] = blended.astype(np.uint8)
except Exception as e:
logger.warning("Failed to restore face at %d,%d to %d,%d: %s", x1, y1, x2, y2, e)
return result
+24 -63
View File
@@ -7,9 +7,9 @@ This module requires the 'gpu' extra dependencies:
uv pip install 'remove-ai-watermarks[gpu]'
"""
# cv2/torch boundary: this engine wraps cv2 (resize/imwrite/cvtColor), the YOLO
# face protector, and the humanizer, none of which carry usable element types;
# relax the unknown-type rules for this file only.
# cv2/torch boundary: this engine wraps cv2 (resize/imwrite/cvtColor) and the
# humanizer, none of which carry usable element types; relax the unknown-type
# rules for this file only.
# pyright: reportUnknownMemberType=false, reportUnknownArgumentType=false, reportUnknownVariableType=false, reportUnknownParameterType=false, reportMissingTypeArgument=false, reportMissingTypeStubs=false, reportMissingImports=false, reportArgumentType=false, reportAssignmentType=false, reportReturnType=false, reportCallIssue=false, reportIndexIssue=false, reportOperatorIssue=false, reportOptionalMemberAccess=false, reportOptionalCall=false, reportOptionalSubscript=false, reportOptionalOperand=false, reportAttributeAccessIssue=false, reportPrivateImportUsage=false, reportPrivateUsage=false, reportInvalidTypeForm=false, reportConstantRedefinition=false, reportUnnecessaryComparison=false
from __future__ import annotations
@@ -70,12 +70,9 @@ class InvisibleEngine:
to break watermark patterns, and reconstructs via reverse diffusion.
"""
# SDXL base is the default since May 2026; the current Google SynthID is
# removed at strength ~0.30 / steps=50 / native res (oracle-verified, n=3 fresh
# Gemini -- 0.10/0.15/0.2 still detected). See CLAUDE.md "Known limitations" for
# the strength study and the regression evidence ruling out SD-1.5 pipelines.
# SDXL base is the default since May 2026; the vendor-adaptive strength
# removes the current SynthID (see watermark_profiles + docs/synthid.md).
DEFAULT_MODEL_ID = "stabilityai/stable-diffusion-xl-base-1.0"
CTRLREGEN_MODEL_ID = "yepengliu/ctrlregen"
def __init__(
self,
@@ -84,31 +81,33 @@ class InvisibleEngine:
pipeline: str = "default",
hf_token: str | None = None,
progress_callback: Callable[[str], None] | None = None,
controlnet_conditioning_scale: float = 1.0,
) -> None:
"""Initialize the invisible watermark removal engine.
Args:
model_id: HuggingFace model ID. None = use default for pipeline.
model_id: HuggingFace model ID. None = use the SDXL base default.
device: Device for inference (auto/cpu/mps/cuda/xpu). None = auto.
pipeline: Pipeline profile. "default" (SDXL base, defeats SynthID
v2) or "ctrlregen" (CtrlRegen).
pipeline: Pipeline profile. "default" (plain SDXL img2img) or
"controlnet" (SDXL + canny ControlNet that preserves text/face
structure via edge conditioning while removing SynthID).
hf_token: HuggingFace API token.
progress_callback: Optional callback for progress messages.
controlnet_conditioning_scale: ControlNet structure-preservation
strength (controlnet pipeline only).
"""
from remove_ai_watermarks.noai.watermark_remover import WatermarkRemover
effective_model = model_id
if pipeline == "ctrlregen" and model_id is None:
effective_model = self.CTRLREGEN_MODEL_ID
elif model_id is None:
effective_model = self.DEFAULT_MODEL_ID
effective_model = model_id or self.DEFAULT_MODEL_ID
self._remover = WatermarkRemover(
model_id=effective_model,
device=device,
progress_callback=progress_callback,
hf_token=hf_token,
pipeline=pipeline,
controlnet_conditioning_scale=controlnet_conditioning_scale,
)
self._progress_callback = progress_callback
@@ -125,8 +124,6 @@ class InvisibleEngine:
guidance_scale: float | None = None,
seed: int | None = None,
humanize: float = 0.0,
protect_faces: bool = False,
protect_text: bool = False,
max_resolution: int = 0,
vendor: str | None = None,
) -> Path:
@@ -135,16 +132,12 @@ class InvisibleEngine:
Args:
image_path: Path to the watermarked image.
output_path: Output path (None = overwrite source).
strength: Denoising strength (0.0-1.0). None -> profile default
(0.10 for SDXL, 1.0 clean-noise for ctrlregen).
strength: Denoising strength (0.0-1.0). None -> the vendor-adaptive
default.
steps: Number of denoising steps.
guidance_scale: Classifier-free guidance scale.
seed: Random seed for reproducibility.
humanize: Intensity of Analog Humanizer film grain (0 = off).
protect_faces: Boolean to extract and restore faces intact.
protect_text: Detect text regions and preserve them via Differential
Diffusion when any are found, so glyphs (incl. CJK) survive the
removal pass. On by default; the detector decides per image.
max_resolution: Cap the long side (px) before diffusion. 0 (default)
= native resolution, no pre-downscale -- matches the hosted
raiw.cc backend. Set a positive value only to bound GPU/MPS
@@ -189,27 +182,6 @@ class InvisibleEngine:
image_path = _tmp_path
try:
# Optional: Face protection (Phase 1 - Extraction)
original_faces = []
if protect_faces:
try:
import cv2
from remove_ai_watermarks.face_protector import FaceProtector
if self._progress_callback:
self._progress_callback("Detecting and extracting faces (protect-faces)...")
# Convert PIL to CV2 BGR
import numpy as np
cv_img = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR)
protector = FaceProtector(use_yolo=True)
original_faces = protector.extract_faces(cv_img)
if self._progress_callback:
self._progress_callback(f"Extracted {len(original_faces)} face(s) for protection.")
except Exception as e:
logger.error("Failed to extract faces: %s", e)
out_path = self._remover.remove_watermark(
image_path=image_path,
output_path=output_path,
@@ -217,14 +189,12 @@ class InvisibleEngine:
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
seed=seed,
protect_text=protect_text,
vendor=vendor,
)
# Optional: Face restoration & Humanizer (Phase 2 - Post-processing)
if protect_faces or humanize > 0.0:
# Post-processing: optional Humanizer, then restore original resolution.
if humanize > 0.0:
import cv2
import numpy as np
from remove_ai_watermarks import image_io
@@ -232,20 +202,11 @@ class InvisibleEngine:
if out_cv is None:
return out_path
if protect_faces and original_faces:
if self._progress_callback:
self._progress_callback("Restoring protected faces with soft blending...")
from remove_ai_watermarks.face_protector import FaceProtector
if self._progress_callback:
self._progress_callback(f"Applying Analog Humanizer (grain: {humanize})...")
from remove_ai_watermarks.humanizer import apply_analog_humanizer
protector = FaceProtector(use_yolo=True)
out_cv = protector.restore_faces(out_cv, original_faces)
if humanize > 0.0:
if self._progress_callback:
self._progress_callback(f"Applying Analog Humanizer (grain: {humanize})...")
from remove_ai_watermarks.humanizer import apply_analog_humanizer
out_cv = apply_analog_humanizer(out_cv, grain_intensity=humanize, chromatic_shift=1)
out_cv = apply_analog_humanizer(out_cv, grain_intensity=humanize, chromatic_shift=1)
# Restore original resolution
if (out_cv.shape[1], out_cv.shape[0]) != orig_size:
@@ -259,7 +220,7 @@ class InvisibleEngine:
image_io.imwrite(out_path, out_cv)
else:
# Even if no protect_faces or humanize, we must restore original size if needed
# No humanize: still restore the original size if it was capped.
import cv2
from remove_ai_watermarks import image_io
+6 -7
View File
@@ -255,10 +255,9 @@ def has_ai_metadata(image_path: Path) -> bool:
"""
from PIL import Image
# PIL may not handle AVIF/HEIF/JPEG-XL without the optional plugins
# (ultralytics also monkey-patches Image.open in a way that can raise
# ModuleNotFoundError when pi_heif autoload fails), so any open failure
# falls through to the binary scan.
# PIL may not handle AVIF/HEIF/JPEG-XL without the optional plugins, and a
# third-party plugin autoload can raise a non-OSError (e.g. ModuleNotFoundError),
# so any open failure falls through to the binary scan.
try:
with Image.open(image_path) as img:
for key in img.info:
@@ -655,9 +654,9 @@ def get_ai_metadata(image_path: Path) -> dict[str, str]:
result: dict[str, str] = {}
# PIL may not open AVIF/HEIF/JPEG-XL without optional plugins (and
# ultralytics' Image.open patch can raise ModuleNotFoundError); fall through
# to the C2PA/binary path on any open failure. See CLAUDE.md.
# PIL may not open AVIF/HEIF/JPEG-XL without optional plugins (and a
# third-party plugin autoload can raise a non-OSError); fall through to the
# C2PA/binary path on any open failure. See CLAUDE.md.
try:
with Image.open(image_path) as img:
for key, value in img.info.items():
@@ -1,18 +0,0 @@
"""CtrlRegen watermark removal via controllable regeneration.
Implements the pipeline from "Image Watermarks Are Removable Using
Controllable Regeneration from Clean Noise" (ICLR 2025) by Liu et al.
This sub-package uses a ControlNet for spatial guidance (canny edges)
and a DINOv2-based IP Adapter for semantic guidance to regenerate
watermarked images from partially noised latents.
Attribution:
Based on https://github.com/yepengliu/CtrlRegen .
"""
from __future__ import annotations
from remove_ai_watermarks.noai.ctrlregen.engine import CtrlRegenEngine, is_ctrlregen_available
__all__ = ["CtrlRegenEngine", "is_ctrlregen_available"]
@@ -1,40 +0,0 @@
"""Color matching post-processing for CtrlRegen output.
After diffusion-based regeneration, the output image may have slight
color shifts. This module uses histogram-based color transfer to
align the regenerated image's color distribution back to the original.
Attribution:
Adapted from https://github.com/yepengliu/CtrlRegen .
"""
from __future__ import annotations
import numpy as np
from color_matcher import ColorMatcher
from color_matcher.normalizer import Normalizer
from PIL import Image
def color_match(reference: Image.Image, source: Image.Image) -> Image.Image:
"""Transfer the color distribution of *reference* onto *source*.
Uses a two-pass histogram matching approach (``hm-mkl-hm``) that
preserves fine-grained color relationships while correcting global
shifts introduced by the regeneration pipeline.
Args:
reference: The original (watermarked) image whose colors should
be preserved.
source: The regenerated image whose colors will be adjusted.
Returns:
A new PIL Image with the structure of *source* but the color
palette of *reference*.
"""
cm = ColorMatcher()
ref_np = Normalizer(np.asarray(reference)).type_norm()
src_np = Normalizer(np.asarray(source)).type_norm()
result = cm.transfer(src=src_np, ref=ref_np, method="hm-mkl-hm")
result = Normalizer(result).uint8_norm()
return Image.fromarray(result)
@@ -1,365 +0,0 @@
"""CtrlRegen engine — orchestrates the full watermark removal pipeline.
Loads the base SD 1.5 model with a ControlNet (spatial control from
canny edges) and a DINOv2-based IP Adapter (semantic control), then
runs controllable regeneration with optional color matching.
Attribution:
Based on https://github.com/yepengliu/CtrlRegen .
"""
from __future__ import annotations
import contextlib
import logging
import os
import sys
import time
from typing import TYPE_CHECKING, Any
if TYPE_CHECKING:
from collections.abc import Callable
import torch
from PIL import Image
from remove_ai_watermarks.noai.progress import make_pipeline_progress
logger = logging.getLogger(__name__)
# ---------------------------------------------------------------------------
# Availability checks — these imports are optional.
# ---------------------------------------------------------------------------
_HAS_CONTROLNET_AUX = False
_HAS_COLOR_MATCHER = False
_HAS_DIFFUSERS = False
try:
from diffusers import AutoencoderKL, ControlNetModel, UniPCMultistepScheduler
from remove_ai_watermarks.noai.ctrlregen.pipeline import CustomCtrlRegenPipeline
_HAS_DIFFUSERS = True
except ImportError:
AutoencoderKL = None # type: ignore[assignment,misc]
ControlNetModel = None # type: ignore[assignment,misc]
UniPCMultistepScheduler = None # type: ignore[assignment,misc]
CustomCtrlRegenPipeline = None # type: ignore[assignment,misc]
try:
from controlnet_aux import CannyDetector
_HAS_CONTROLNET_AUX = True
except ImportError:
CannyDetector = None # type: ignore[assignment,misc]
try:
from remove_ai_watermarks.noai.ctrlregen.color import color_match
_HAS_COLOR_MATCHER = True
except ImportError:
color_match = None # type: ignore[assignment]
CTRLREGEN_HF_REPO = "yepengliu/ctrlregen"
SPATIAL_SUBFOLDER = "spatialnet_ckp/spatial_control_ckp_14000"
SEMANTIC_SUBFOLDER = "semanticnet_ckp/models"
SEMANTIC_WEIGHT_NAME = "semantic_control_ckp_435000.bin"
DEFAULT_BASE_MODEL = "SG161222/Realistic_Vision_V4.0_noVAE"
CUSTOM_VAE_ID = "stabilityai/sd-vae-ft-mse"
PROCESS_SIZE = 512
DEFAULT_GUIDANCE_SCALE = 2.0
QUALITY_PROMPT = "best quality, high quality"
NEGATIVE_PROMPT = "monochrome, lowres, bad anatomy, worst quality, low quality"
CANNY_LOW_THRESHOLD = 100
CANNY_HIGH_THRESHOLD = 150
TILE_SIZE = 512
TILE_OVERLAP = 192
def is_ctrlregen_available() -> bool:
"""Return True when all CtrlRegen-specific dependencies are installed."""
return _HAS_DIFFUSERS and _HAS_CONTROLNET_AUX and _HAS_COLOR_MATCHER
class CtrlRegenEngine:
"""End-to-end CtrlRegen watermark removal engine.
Handles model loading, canny edge extraction, controlled denoising,
and color-matched post-processing in a single ``run()`` call.
"""
def __init__(
self,
base_model_id: str | None = None,
device: str = "cpu",
torch_dtype: torch.dtype | None = None,
hf_token: str | None = None,
progress_callback: Callable[[str], None] | None = None,
) -> None:
if not is_ctrlregen_available():
missing: list[str] = []
if not _HAS_DIFFUSERS:
missing.extend(["diffusers", "transformers", "accelerate"])
if not _HAS_CONTROLNET_AUX:
missing.append("controlnet-aux")
if not _HAS_COLOR_MATCHER:
missing.append("color-matcher")
logger.info("Auto-installing missing dependencies: %s", missing)
import subprocess
try:
subprocess.check_call(
[sys.executable, "-m", "pip", "install", *missing],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
)
except (subprocess.CalledProcessError, FileNotFoundError) as exc:
raise ImportError(
"Failed to auto-install missing dependencies: "
+ ", ".join(missing)
+ ". Try manually: pip install --force-reinstall noai-watermark"
) from exc
self.base_model_id = base_model_id or DEFAULT_BASE_MODEL
self.device = device
self.torch_dtype = torch_dtype or (torch.float32 if device in ("cpu", "mps") else torch.float16)
self.hf_token: str | None = hf_token or os.environ.get("HF_TOKEN")
self._progress_callback = progress_callback
self._pipeline: CustomCtrlRegenPipeline | None = None # type: ignore[assignment]
self._canny_detector: CannyDetector | None = None # type: ignore[assignment]
def _set_progress(self, message: str) -> None:
if self._progress_callback is None:
return
with contextlib.suppress(Exception):
self._progress_callback(message)
# ------------------------------------------------------------------
# Loading
# ------------------------------------------------------------------
def load(self) -> None:
"""Download and assemble the full CtrlRegen pipeline."""
if self._pipeline is not None:
return
token_kwargs: dict[str, Any] = {}
if self.hf_token:
token_kwargs["token"] = self.hf_token
self._set_progress(f"Loading CtrlRegen spatial ControlNet from {CTRLREGEN_HF_REPO}...")
logger.info("Loading ControlNet from %s/%s", CTRLREGEN_HF_REPO, SPATIAL_SUBFOLDER)
controlnet = [
ControlNetModel.from_pretrained(
CTRLREGEN_HF_REPO,
subfolder=SPATIAL_SUBFOLDER,
torch_dtype=self.torch_dtype,
**token_kwargs,
)
]
self._set_progress(f"Loading SD base model ({self.base_model_id}) for CtrlRegen pipeline...")
logger.info("Loading base pipeline from %s", self.base_model_id)
pipe = CustomCtrlRegenPipeline.from_pretrained(
self.base_model_id,
controlnet=controlnet,
torch_dtype=self.torch_dtype,
safety_checker=None,
requires_safety_checker=False,
**token_kwargs,
)
self._set_progress(f"Loading CtrlRegen semantic IP-Adapter + DINOv2 from {CTRLREGEN_HF_REPO}...")
logger.info("Loading IP-Adapter from %s/%s", CTRLREGEN_HF_REPO, SEMANTIC_SUBFOLDER)
pipe.load_ctrlregen_ip_adapter(
CTRLREGEN_HF_REPO,
subfolder=SEMANTIC_SUBFOLDER,
weight_name=SEMANTIC_WEIGHT_NAME,
**token_kwargs,
)
from transformers import AutoImageProcessor, AutoModel
pipe.image_encoder = AutoModel.from_pretrained("facebook/dinov2-giant").to(self.device, dtype=self.torch_dtype)
pipe.feature_extractor = AutoImageProcessor.from_pretrained("facebook/dinov2-giant")
self._set_progress(f"Loading custom VAE ({CUSTOM_VAE_ID})...")
logger.info("Loading VAE from %s", CUSTOM_VAE_ID)
pipe.vae = AutoencoderKL.from_pretrained(
CUSTOM_VAE_ID,
torch_dtype=self.torch_dtype,
**token_kwargs,
)
self._set_progress("Configuring UniPC scheduler...")
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.set_ip_adapter_scale(1.0)
self._set_progress(f"Moving CtrlRegen pipeline to {self.device}...")
pipe = pipe.to(self.device)
if hasattr(pipe, "enable_xformers_memory_efficient_attention"):
with contextlib.suppress(Exception):
pipe.enable_xformers_memory_efficient_attention()
self._pipeline = pipe
self._canny_detector = CannyDetector()
self._set_progress("CtrlRegen pipeline ready.")
logger.info("CtrlRegen pipeline loaded on %s", self.device)
# ------------------------------------------------------------------
# Inference — public entry point
# ------------------------------------------------------------------
def run(
self,
image: Image.Image,
strength: float = 0.5,
num_inference_steps: int = 50,
guidance_scale: float = DEFAULT_GUIDANCE_SCALE,
seed: int | None = None,
) -> Image.Image:
"""Run CtrlRegen watermark removal on a single image.
Images that fit within ``TILE_SIZE`` (512) are processed as a
single pass. Larger images are split into overlapping tiles.
"""
self.load()
assert self._pipeline is not None
assert self._canny_detector is not None
orig_w, orig_h = image.size
orig_image = image
t0 = time.monotonic()
needs_tiling = orig_w > TILE_SIZE or orig_h > TILE_SIZE
if needs_tiling:
from remove_ai_watermarks.noai.ctrlregen.tiling import resize_center_crop, run_tiled
aligned_w = orig_w // 8 * 8
aligned_h = orig_h // 8 * 8
if aligned_w != orig_w or aligned_h != orig_h:
image = image.resize((aligned_w, aligned_h), Image.LANCZOS)
regen_image = run_tiled(
pipeline=self._pipeline,
canny_detector=self._canny_detector,
image=image,
strength=strength,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
seed=seed,
tile_size=TILE_SIZE,
tile_overlap=TILE_OVERLAP,
quality_prompt=QUALITY_PROMPT,
negative_prompt=NEGATIVE_PROMPT,
canny_low=CANNY_LOW_THRESHOLD,
canny_high=CANNY_HIGH_THRESHOLD,
device=self.device,
set_progress=self._set_progress,
ip_adapter_image=orig_image,
)
else:
from remove_ai_watermarks.noai.ctrlregen.tiling import resize_center_crop
proc_image = resize_center_crop(image, PROCESS_SIZE)
self._set_progress(f"Preprocessed {orig_w}x{orig_h}px → {proc_image.size[0]}x{proc_image.size[1]}px")
regen_image = self._run_single(
proc_image,
strength,
num_inference_steps,
guidance_scale,
seed,
)
if regen_image.size != (orig_w, orig_h):
self._set_progress(f"Resizing {regen_image.size[0]}x{regen_image.size[1]}px → {orig_w}x{orig_h}px...")
regen_image = regen_image.resize((orig_w, orig_h), Image.LANCZOS)
self._set_progress(f"Applying color matching at {orig_w}x{orig_h}px...")
output = color_match(reference=orig_image, source=regen_image)
self._set_progress(f"✓ CtrlRegen done · {orig_w}x{orig_h}px · {time.monotonic() - t0:.0f}s total")
return output
# ------------------------------------------------------------------
# Single-image path (image <= 512x512)
# ------------------------------------------------------------------
def _run_single(
self,
image: Image.Image,
strength: float,
num_inference_steps: int,
guidance_scale: float,
seed: int | None,
) -> Image.Image:
"""Process a single 512x512 image through the CtrlRegen pipeline."""
w, h = image.size
effective_steps = max(1, int(num_inference_steps * strength))
self._set_progress(
f"Extracting canny edges ({w}x{h}px, thresholds {CANNY_LOW_THRESHOLD}/{CANNY_HIGH_THRESHOLD})..."
)
control_image = self._canny_detector(
image,
low_threshold=CANNY_LOW_THRESHOLD,
high_threshold=CANNY_HIGH_THRESHOLD,
)
generator = torch.manual_seed(seed if seed is not None else 0)
self._set_progress(
f"Config: strength={strength}, steps={num_inference_steps} "
f"(~{effective_steps} effective), guidance={guidance_scale}"
)
step_cb, first_step, pipeline_done, start_updater = make_pipeline_progress(
effective_steps,
self.device,
self._set_progress,
label="CtrlRegen denoising",
)
start_updater()
try:
result = self._pipeline(
prompt=QUALITY_PROMPT,
negative_prompt=NEGATIVE_PROMPT,
image=[image],
control_image=[control_image],
controlnet_conditioning_scale=1.0,
ip_adapter_image=[image],
strength=strength,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
generator=generator,
control_guidance_start=0.0,
control_guidance_end=1.0,
callback=step_cb,
callback_steps=1,
)
except TypeError:
first_step.set()
result = self._pipeline(
prompt=QUALITY_PROMPT,
negative_prompt=NEGATIVE_PROMPT,
image=[image],
control_image=[control_image],
controlnet_conditioning_scale=1.0,
ip_adapter_image=[image],
strength=strength,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
generator=generator,
)
finally:
first_step.set()
pipeline_done.set()
return result.images[0]
@@ -1,149 +0,0 @@
"""Custom IP-Adapter mixin using DINOv2 as the image encoder.
The standard diffusers ``IPAdapterMixin`` uses a CLIP image encoder.
CtrlRegen replaces it with ``facebook/dinov2-giant`` for richer
semantic features. This mixin provides ``load_ctrlregen_ip_adapter``
which handles the custom weight format and encoder swap.
Attribution:
Adapted from https://github.com/yepengliu/CtrlRegen .
"""
from __future__ import annotations
import logging
from typing import Any
import torch
from diffusers.models.modeling_utils import _LOW_CPU_MEM_USAGE_DEFAULT
from diffusers.utils import (
_get_model_file,
is_accelerate_available,
is_torch_version,
)
from diffusers.utils import (
logging as diffusers_logging,
)
from huggingface_hub.utils import validate_hf_hub_args
from safetensors import safe_open
from transformers import AutoImageProcessor, AutoModel
logger = logging.getLogger(__name__)
_diffusers_logger = diffusers_logging.get_logger(__name__)
DINOV2_MODEL_ID = "facebook/dinov2-giant"
class CustomIPAdapterMixin:
"""Mixin that adds ``load_ctrlregen_ip_adapter`` to a diffusers pipeline."""
@validate_hf_hub_args
def load_ctrlregen_ip_adapter(
self,
pretrained_model_name_or_path_or_dict: str | list[str] | dict[str, torch.Tensor],
subfolder: str | list[str],
weight_name: str | list[str],
image_encoder_folder: str | None = "image_encoder",
**kwargs: Any,
) -> None:
"""Load CtrlRegen IP-Adapter weights and DINOv2 image encoder.
Parameters mirror ``IPAdapterMixin.load_ip_adapter`` but the
image encoder is always ``facebook/dinov2-giant`` regardless of
the ``image_encoder_folder`` value in the checkpoint.
"""
if not isinstance(weight_name, list):
weight_name = [weight_name]
if not isinstance(pretrained_model_name_or_path_or_dict, list):
pretrained_model_name_or_path_or_dict = [pretrained_model_name_or_path_or_dict]
if len(pretrained_model_name_or_path_or_dict) == 1:
pretrained_model_name_or_path_or_dict = pretrained_model_name_or_path_or_dict * len(weight_name)
if not isinstance(subfolder, list):
subfolder = [subfolder]
if len(subfolder) == 1:
subfolder = subfolder * len(weight_name)
if len(weight_name) != len(pretrained_model_name_or_path_or_dict):
raise ValueError("`weight_name` and `pretrained_model_name_or_path_or_dict` must have the same length.")
if len(weight_name) != len(subfolder):
raise ValueError("`weight_name` and `subfolder` must have the same length.")
cache_dir = kwargs.pop("cache_dir", None)
force_download = kwargs.pop("force_download", False)
kwargs.pop("resume_download", None)
proxies = kwargs.pop("proxies", None)
local_files_only = kwargs.pop("local_files_only", None)
token = kwargs.pop("token", None)
revision = kwargs.pop("revision", None)
low_cpu_mem_usage = kwargs.pop("low_cpu_mem_usage", _LOW_CPU_MEM_USAGE_DEFAULT)
if low_cpu_mem_usage and not is_accelerate_available():
low_cpu_mem_usage = False
_diffusers_logger.warning(
"Cannot initialize model with low cpu memory usage because "
"`accelerate` was not found. Defaulting to "
"`low_cpu_mem_usage=False`."
)
if low_cpu_mem_usage is True and not is_torch_version(">=", "1.9.0"):
raise NotImplementedError("Low memory initialization requires torch >= 1.9.0.")
user_agent = {
"file_type": "attn_procs_weights",
"framework": "pytorch",
}
state_dicts: list[dict] = []
for path_or_dict, wn, sf in zip(pretrained_model_name_or_path_or_dict, weight_name, subfolder, strict=False):
if not isinstance(path_or_dict, dict):
model_file = _get_model_file(
path_or_dict,
weights_name=wn,
cache_dir=cache_dir,
force_download=force_download,
proxies=proxies,
local_files_only=local_files_only,
token=token,
revision=revision,
subfolder=sf,
user_agent=user_agent,
)
if wn.endswith(".safetensors"):
state_dict: dict = {"image_proj": {}, "ip_adapter": {}}
with safe_open(model_file, framework="pt", device="cpu") as f:
for key in f.keys(): # noqa: SIM118
if key.startswith("image_proj."):
state_dict["image_proj"][key.replace("image_proj.", "")] = f.get_tensor(key)
elif key.startswith("ip_adapter."):
state_dict["ip_adapter"][key.replace("ip_adapter.", "")] = f.get_tensor(key)
else:
state_dict = torch.load(model_file, map_location="cpu")
else:
state_dict = path_or_dict
keys = list(state_dict.keys())
if keys != ["image_proj", "ip_adapter"]:
raise ValueError("Required keys (`image_proj` and `ip_adapter`) missing from the state dict.")
state_dicts.append(state_dict)
# Always use DINOv2-giant as the image encoder.
has_encoder_attr = hasattr(self, "image_encoder") and getattr(self, "image_encoder", None) is None
if has_encoder_attr and image_encoder_folder is not None:
logger.info("Loading DINOv2-giant image encoder for CtrlRegen")
enc_dtype = getattr(self, "dtype", torch.float32) # type: ignore[attr-defined]
image_encoder = AutoModel.from_pretrained(DINOV2_MODEL_ID).to(
self.device,
dtype=enc_dtype, # type: ignore[attr-defined]
)
self.register_modules(image_encoder=image_encoder) # type: ignore[attr-defined]
if hasattr(self, "feature_extractor") and getattr(self, "feature_extractor", None) is None:
feature_extractor = AutoImageProcessor.from_pretrained(DINOV2_MODEL_ID)
self.register_modules(feature_extractor=feature_extractor) # type: ignore[attr-defined]
unet = (
getattr(self, self.unet_name) # type: ignore[attr-defined]
if not hasattr(self, "unet")
else self.unet # type: ignore[attr-defined]
)
unet._load_ip_adapter_weights(state_dicts, low_cpu_mem_usage=low_cpu_mem_usage)
@@ -1,35 +0,0 @@
"""Custom Stable Diffusion ControlNet Img2Img pipeline for CtrlRegen.
Extends ``StableDiffusionControlNetImg2ImgPipeline`` with the
``load_ctrlregen_ip_adapter`` method (via ``CustomIPAdapterMixin``)
that swaps in DINOv2-giant as the image encoder and loads the
CtrlRegen semantic-control adapter weights.
No ``encode_image`` override is needed — the CtrlRegen checkpoint
creates an ``IPAdapterPlusImageProjection`` which tells diffusers to
call ``encode_image`` with ``output_hidden_states=True``. The
default implementation then uses ``hidden_states[-2]`` from DINOv2,
which is exactly what the projection was trained on.
Attribution:
Adapted from https://github.com/yepengliu/CtrlRegen .
"""
from __future__ import annotations
from diffusers import StableDiffusionControlNetImg2ImgPipeline
from remove_ai_watermarks.noai.ctrlregen.ip_adapter import CustomIPAdapterMixin
class CustomCtrlRegenPipeline(
StableDiffusionControlNetImg2ImgPipeline,
CustomIPAdapterMixin,
):
"""SD ControlNet Img2Img pipeline with DINOv2 IP-Adapter support.
MRO mirrors the original CtrlRegen repository: the base diffusers
pipeline comes first so all standard methods are resolved from it,
while ``CustomIPAdapterMixin`` only adds the
``load_ctrlregen_ip_adapter`` method.
"""
@@ -1,179 +0,0 @@
"""Tile-based processing for large images in the CtrlRegen pipeline.
Extracted from ``ctrlregen.engine`` to keep the engine focused on
single-image inference and model orchestration.
"""
from __future__ import annotations
import math
import time
from typing import TYPE_CHECKING, Any
if TYPE_CHECKING:
from collections.abc import Callable
import numpy as np
import torch
from PIL import Image
def tile_positions(total: int, tile: int, overlap: int) -> list[int]:
"""Compute evenly-spaced tile start positions covering *total* pixels."""
if not (0 <= overlap < tile):
raise ValueError(f"overlap must satisfy 0 <= overlap < tile (got overlap={overlap}, tile={tile})")
if total <= tile:
return [0]
n = max(2, math.ceil((total - overlap) / (tile - overlap)))
stride = (total - tile) / (n - 1)
return [round(i * stride) for i in range(n)]
def make_blend_weight(h: int, w: int, overlap: int) -> np.ndarray:
"""2-D weight mask: 1.0 in center, cosine ramp in overlap margins."""
wy = np.ones(h, dtype=np.float64)
wx = np.ones(w, dtype=np.float64)
if overlap > 0:
ramp = 0.5 - 0.5 * np.cos(np.linspace(0, np.pi, overlap))
wy[:overlap] = np.minimum(wy[:overlap], ramp)
wy[-overlap:] = np.minimum(wy[-overlap:], ramp[::-1])
wx[:overlap] = np.minimum(wx[:overlap], ramp)
wx[-overlap:] = np.minimum(wx[-overlap:], ramp[::-1])
return np.outer(wy, wx)
def resize_center_crop(image: Image.Image, size: int = 512) -> Image.Image:
"""Resize shortest edge to *size*, then center-crop to a square.
Matches the ``transforms.Resize(512) + CenterCrop(512)`` pipeline
used in the original CtrlRegen repository.
"""
w, h = image.size
short = min(w, h)
scale = size / short
new_w, new_h = round(w * scale), round(h * scale)
image = image.resize((new_w, new_h), Image.BILINEAR)
left = (new_w - size) // 2
top = (new_h - size) // 2
return image.crop((left, top, left + size, top + size))
def run_tiled(
pipeline: Any,
canny_detector: Any,
image: Image.Image,
strength: float,
num_inference_steps: int,
guidance_scale: float,
seed: int | None,
*,
tile_size: int,
tile_overlap: int,
quality_prompt: str,
negative_prompt: str,
canny_low: int,
canny_high: int,
device: str,
set_progress: Callable[[str], None],
ip_adapter_image: Image.Image | None = None,
) -> Image.Image:
"""Split a large image into overlapping tiles, process each, blend."""
w, h = image.size
xs = tile_positions(w, tile_size, tile_overlap)
ys = tile_positions(h, tile_size, tile_overlap)
n_tiles = len(xs) * len(ys)
grid = f"{len(xs)}x{len(ys)}"
effective_steps = max(1, int(num_inference_steps * strength))
set_progress(f"Tiling {w}x{h}px → {n_tiles} tiles ({grid} grid, {tile_size}px, overlap {tile_overlap}px)")
canvas = np.zeros((h, w, 3), dtype=np.float64)
weight_sum = np.zeros((h, w), dtype=np.float64)
blend_w = make_blend_weight(tile_size, tile_size, tile_overlap)
t0 = time.monotonic()
bar_len = 20
tile_idx = 0
for ty in ys:
for tx in xs:
tile_idx += 1
prefix = f"[Tile {tile_idx}/{n_tiles}]"
tile = image.crop((tx, ty, tx + tile_size, ty + tile_size))
set_progress(f"{prefix} Extracting canny edges...")
control = canny_detector(
tile,
low_threshold=canny_low,
high_threshold=canny_high,
)
gen = None
if seed is not None:
gen = torch.Generator(device=device).manual_seed(seed + tile_idx)
tile_t0 = time.monotonic()
def _make_cb(
_prefix: str = prefix,
_t0: float = tile_t0,
_es: int = effective_steps,
) -> Callable:
def _cb(step: int, timestep: int, latents: Any) -> None:
elapsed = time.monotonic() - _t0
cur = step + 1
per = elapsed / max(1, cur)
rem = per * max(0, _es - cur)
filled = int(bar_len * cur / max(1, _es))
bar = "" * filled + "" * (bar_len - filled)
set_progress(f"{_prefix} [{bar}] {cur}/{_es} | {elapsed:.0f}s, ~{rem:.0f}s left")
return _cb
sem_image = ip_adapter_image if ip_adapter_image is not None else tile
try:
result = pipeline(
prompt=quality_prompt,
negative_prompt=negative_prompt,
image=[tile],
control_image=[control],
controlnet_conditioning_scale=1.0,
ip_adapter_image=[sem_image],
strength=strength,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
generator=gen,
control_guidance_start=0.0,
control_guidance_end=1.0,
callback=_make_cb(),
callback_steps=1,
)
except TypeError:
result = pipeline(
prompt=quality_prompt,
negative_prompt=negative_prompt,
image=[tile],
control_image=[control],
controlnet_conditioning_scale=1.0,
ip_adapter_image=[sem_image],
strength=strength,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
generator=gen,
)
proc_arr = np.array(result.images[0], dtype=np.float64)
th, tw = proc_arr.shape[:2]
mask = blend_w[:th, :tw]
canvas[ty : ty + th, tx : tx + tw] += proc_arr * mask[..., None]
weight_sum[ty : ty + th, tx : tx + tw] += mask
tile_time = time.monotonic() - tile_t0
total_elapsed = time.monotonic() - t0
set_progress(f"{prefix} Done ({tile_time:.0f}s) · Total: {total_elapsed:.0f}s")
set_progress(f"Blending {n_tiles} tiles → {w}x{h}px...")
canvas /= np.maximum(weight_sum[..., None], 1e-8)
return Image.fromarray(np.clip(canvas, 0, 255).astype(np.uint8))
+18 -115
View File
@@ -29,8 +29,14 @@ def run_img2img(
generator: Any,
device: str,
set_progress: Callable[[str], None],
extra_kwargs: dict[str, Any] | None = None,
) -> Image.Image:
"""Execute img2img with live progress and return the generated image."""
"""Execute img2img with live progress and return the generated image.
``extra_kwargs`` overlays additional pipeline arguments (e.g. the ControlNet
``control_image`` / ``controlnet_conditioning_scale`` and a non-empty prompt),
so a ControlNet img2img pass reuses the same progress + fallback machinery.
"""
effective_steps = max(1, int(num_inference_steps * strength))
step_cb, first_step, done_ev, start_updater = make_pipeline_progress(
@@ -42,26 +48,14 @@ def run_img2img(
try:
result = _call_pipeline(
pipeline,
image,
strength,
num_inference_steps,
guidance_scale,
generator,
step_cb,
pipeline, image, strength, num_inference_steps, guidance_scale, generator, step_cb, extra_kwargs
)
done_ev.set()
return result.images[0]
except TypeError:
first_step.set()
result = _call_pipeline(
pipeline,
image,
strength,
num_inference_steps,
guidance_scale,
generator,
None,
pipeline, image, strength, num_inference_steps, guidance_scale, generator, None, extra_kwargs
)
done_ev.set()
return result.images[0]
@@ -81,11 +75,13 @@ def run_img2img_with_mps_fallback(
set_progress: Callable[[str], None],
*,
reload_on_cpu: Callable[[], Any],
extra_kwargs: dict[str, Any] | None = None,
) -> tuple[Image.Image, str]:
"""Run img2img; on MPS error, fall back to CPU.
Returns:
(result_image, final_device) — device may change to ``"cpu"`` on fallback.
``extra_kwargs`` overlays extra pipeline arguments (used by the ControlNet
path). Returns ``(result_image, final_device)`` — device may change to
``"cpu"`` on fallback.
"""
pipeline = load_pipeline()
@@ -99,6 +95,7 @@ def run_img2img_with_mps_fallback(
generator,
device,
set_progress,
extra_kwargs,
)
return img, device
except RuntimeError as error:
@@ -108,104 +105,7 @@ def run_img2img_with_mps_fallback(
_try_clear_mps_cache()
pipeline = reload_on_cpu()
img = run_img2img(
pipeline,
image,
strength,
num_inference_steps,
guidance_scale,
None,
"cpu",
set_progress,
)
return img, "cpu"
raise
def run_differential(
pipeline: Any,
image: Image.Image,
change_map: Any,
strength: float,
num_inference_steps: int,
guidance_scale: float,
generator: Any,
device: str,
set_progress: Callable[[str], None],
) -> Image.Image:
"""Run the SDXL Differential-Diffusion pipeline and return the image.
Unlike standard img2img, the differential pipeline needs pre-processed image
tensors plus a per-pixel change map (HxW float32 in [0, 1]); white preserves
the original pixels, black regenerates them. Runs without a step callback --
the community pipeline's callback signature differs across diffusers
versions, and a protect-text pass is short.
"""
import torch
image_tensor = pipeline.image_processor.preprocess(image).to(device)
map_tensor = torch.from_numpy(change_map)[None].to(device) # pyright: ignore[reportPrivateImportUsage, reportUnknownMemberType]
set_progress(f"Running protected regeneration ({device}, strength={strength})...")
result = pipeline(
prompt="",
image=image_tensor,
original_image=image_tensor,
map=map_tensor,
strength=strength,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
generator=generator,
)
return result.images[0]
def run_differential_with_mps_fallback(
load_pipeline: Callable[[], Any],
image: Image.Image,
change_map: Any,
strength: float,
num_inference_steps: int,
guidance_scale: float,
generator: Any,
device: str,
set_progress: Callable[[str], None],
*,
reload_on_cpu: Callable[[], Any],
) -> tuple[Image.Image, str]:
"""Run differential img2img; on MPS error, fall back to CPU.
Returns:
(result_image, final_device) -- device may change to ``"cpu"`` on fallback.
"""
pipeline = load_pipeline()
try:
img = run_differential(
pipeline,
image,
change_map,
strength,
num_inference_steps,
guidance_scale,
generator,
device,
set_progress,
)
return img, device
except RuntimeError as error:
if device == "mps" and is_mps_error(error):
logger.warning("MPS error detected: %s. Falling back to CPU.", error)
set_progress("MPS error! Clearing cache and retrying on CPU...")
_try_clear_mps_cache()
pipeline = reload_on_cpu()
img = run_differential(
pipeline,
image,
change_map,
strength,
num_inference_steps,
guidance_scale,
None,
"cpu",
set_progress,
pipeline, image, strength, num_inference_steps, guidance_scale, None, "cpu", set_progress, extra_kwargs
)
return img, "cpu"
raise
@@ -219,6 +119,7 @@ def _call_pipeline(
guidance_scale: float,
generator: Any,
step_callback: Any,
extra_kwargs: dict[str, Any] | None = None,
) -> Any:
kwargs: dict[str, Any] = {
"prompt": "",
@@ -228,6 +129,8 @@ def _call_pipeline(
"guidance_scale": guidance_scale,
"generator": generator,
}
if extra_kwargs:
kwargs.update(extra_kwargs)
if step_callback is not None:
kwargs["callback"] = step_callback
kwargs["callback_steps"] = 1
@@ -1,4 +1,4 @@
"""Watermark removal model profiles, the default strength, and profile detection.
"""Watermark removal model profiles and the default strength.
Pure configuration and lookup functions with no ML dependencies.
"""
@@ -11,13 +11,18 @@ if TYPE_CHECKING:
from pathlib import Path
DEFAULT_MODEL_ID = "stabilityai/stable-diffusion-xl-base-1.0"
CTRLREGEN_MODEL_ID = "yepengliu/ctrlregen"
# The SDXL-native canny ControlNet used by the ``controlnet`` pipeline. The
# ControlNet is an add-on to the SDXL base checkpoint (DEFAULT_MODEL_ID), not a
# separate base model, so both the ``default`` and ``controlnet`` profiles load
# the same base weights and share the same vendor-adaptive strength.
CONTROLNET_CANNY_MODEL = "xinsir/controlnet-canny-sdxl-1.0"
# Vendor-adaptive default denoising strength for the SDXL img2img scrub, overridable
# from the CLI (`--strength`). The right strength depends on which vendor's SynthID is
# present, detected from the C2PA issuer (metadata.synthid_source). Oracle-verified
# controlled study (2026-06-01, clean v0.8.6 with protect_text/faces OFF, per-image
# openai.com/verify or Gemini-app verdict; see docs/synthid.md section 2.2):
# controlled study (2026-06-01, clean v0.8.6, per-image openai.com/verify or Gemini-app
# verdict; see docs/synthid.md section 2.2):
# - OpenAI gpt-image: removed at 0.05 across 1024-1600 (n=4), resolution-independent.
# OPENAI_STRENGTH 0.10 = the 0.05 floor plus a 2x margin (keeps quality high).
# - Google Gemini: removed at 0.15 on the capped-1536 path (n=4); 0.05/0.10 do NOT
@@ -29,8 +34,8 @@ CTRLREGEN_MODEL_ID = "yepengliu/ctrlregen"
# - Unknown vendor (metadata stripped, or non-OpenAI/Google C2PA): UNKNOWN_STRENGTH
# 0.15, the safe middle that clears both vendors at the tested resolutions.
# The dominant factor is VENDOR, not resolution: Google's SynthID is ~3x more robust
# than OpenAI's. The earlier single 0.30 default (and the "resolution dependence" lore)
# came from contaminated tests run with protect_text ON -- see docs/synthid.md 2.2.
# than OpenAI's. The ``controlnet`` pipeline shares these strengths (same SDXL base; the
# canny ControlNet only preserves structure, the strength still drives removal).
OPENAI_STRENGTH = 0.10
GEMINI_STRENGTH = 0.15
UNKNOWN_STRENGTH = 0.15
@@ -41,45 +46,21 @@ DEFAULT_STRENGTH = UNKNOWN_STRENGTH
# Detected-vendor -> default strength. Vendor strings come from `vendor_for_strength`.
_VENDOR_STRENGTH = {"openai": OPENAI_STRENGTH, "google": GEMINI_STRENGTH}
# CtrlRegen removes watermarks by regenerating from (near) clean Gaussian noise,
# NOT by the light-touch partial-noise img2img the SDXL default uses. The research
# is explicit (CtrlRegen, ICLR 2025, arXiv:2410.05470): partial-noise regeneration
# "struggles with high-perturbation watermarks" because a small noise step "retains"
# watermark information that diffuses back into the output; the fix is to start from
# clean noise. With the StableDiffusionControlNetImg2ImgPipeline that maps to a high
# strength (~1.0 = full noise at the first timestep, structure held by the canny
# ControlNet + DINOv2 IP-Adapter, not by the watermarked latent). So the ctrlregen
# profile must NOT inherit the SDXL default (`DEFAULT_STRENGTH`, a partial-noise
# value) -- at that low strength it loads ControlNet + DINOv2-giant and then barely
# changes the image (a no-op for removal). Tunable via
# `--strength`; lower it to trade removal strength for fidelity (the CtrlRegen+ regime).
#
# EXPERIMENTAL -- NOT recommended for production. The same GPU study that set the 0.3
# SDXL threshold tested ctrlregen at its clean-noise strength and found it DESTROYS
# images: smooth/background regions fill with hallucinated micro-text garbage, and it
# is heavy (~8.5 min / ~$0.30 vs ~25 s / ~$0.02 for SDXL on a large image). The pipeline
# is effectively binary -- low strength = no-op, high strength = destroys -- with no
# usable middle, so the literature's "clean-noise is the lever" (arXiv:2410.05470) did
# NOT survive empirical testing on real content. SDXL img2img at ~0.3 is the shippable
# path; ctrlregen stays opt-in and flagged experimental.
CTRLREGEN_DEFAULT_STRENGTH = 1.0
def resolve_strength(strength: float | None, vendor: str | None = None) -> float:
"""Resolve the denoising strength, applying the vendor default when unset.
def resolve_strength(strength: float | None, profile: str, vendor: str | None = None) -> float:
"""Resolve the denoising strength, applying the profile/vendor default when unset.
``None`` means "the user did not pass ``--strength``". ``ctrlregen`` resolves to
``CTRLREGEN_DEFAULT_STRENGTH`` (clean-noise regeneration). The SDXL default profile
resolves **vendor-adaptively**: ``vendor`` (``"openai"`` / ``"google"`` / None, from
``None`` means "the user did not pass ``--strength``", which resolves
**vendor-adaptively**: ``vendor`` (``"openai"`` / ``"google"`` / None, from
``vendor_for_strength``) selects ``OPENAI_STRENGTH`` / ``GEMINI_STRENGTH`` /
``UNKNOWN_STRENGTH``. An explicit value always wins (including ``0.0`` -- the check is
``is None``, not falsiness). Shared by the CLI (for display) and the engine (for
``UNKNOWN_STRENGTH``. An explicit value always wins (including ``0.0`` -- the check
is ``is None``, not falsiness). The ``default`` and ``controlnet`` profiles share
the same SDXL base (the ControlNet only preserves structure), so the default does
NOT depend on the profile. Shared by the CLI (for display) and the engine (for
execution) so the two never disagree -- both must pass the SAME ``vendor``.
"""
if strength is not None:
return strength
if profile == "ctrlregen":
return CTRLREGEN_DEFAULT_STRENGTH
return _VENDOR_STRENGTH.get(vendor or "", UNKNOWN_STRENGTH)
@@ -107,17 +88,13 @@ def vendor_for_strength(image_path: Path) -> Literal["openai", "google"] | None:
def get_model_id_for_profile(profile: str) -> str:
"""Map CLI model profile names to concrete Hugging Face model IDs."""
"""Map CLI model profile names to concrete Hugging Face model IDs.
Both ``default`` and ``controlnet`` use the SDXL base checkpoint -- the canny
ControlNet (``CONTROLNET_CANNY_MODEL``) is an add-on loaded on top of it, not a
separate base model.
"""
normalized = profile.strip().lower()
if normalized == "default":
if normalized in ("default", "controlnet"):
return DEFAULT_MODEL_ID
if normalized == "ctrlregen":
return CTRLREGEN_MODEL_ID
raise ValueError(f"Unknown model profile '{profile}'. Use one of: default, ctrlregen.")
def detect_model_profile(model_id: str) -> str:
"""Infer model profile from model identifier."""
if "ctrlregen" in model_id.lower():
return "ctrlregen"
return "default"
raise ValueError(f"Unknown model profile '{profile}'. Use one of: default, controlnet.")
+152 -339
View File
@@ -1,13 +1,13 @@
"""Watermark removal using diffusion model regeneration attack.
Based on the paper "Image Watermarks Are Removable Using Controllable
Regeneration from Clean Noise" (ICLR 2025).
This module implements a simple regeneration attack that:
1. Encodes the watermarked image to latent space
2. Adds noise via forward diffusion process
3. Denoises via reverse diffusion process
4. Decodes back to pixel space
Two pipelines:
1. ``default`` -- plain SDXL img2img. Partial-noise regeneration scrubs the
invisible watermark; ``strength`` controls how much is regenerated.
2. ``controlnet`` -- SDXL img2img with a canny ControlNet. The watermark REMOVAL
still comes from the img2img regeneration (``strength``); the ControlNet only
PRESERVES structure (text/faces) by conditioning on the edge map. No original
pixels are ever copied or frozen, so SynthID does not survive.
``controlnet_conditioning_scale`` is the preservation knob.
"""
# torch/diffusers/cv2 boundary: these libs ship no usable types for the tensor and
@@ -29,10 +29,9 @@ if TYPE_CHECKING:
from PIL import Image
from remove_ai_watermarks.noai.watermark_profiles import (
CTRLREGEN_MODEL_ID,
CONTROLNET_CANNY_MODEL,
DEFAULT_MODEL_ID,
DEFAULT_STRENGTH,
detect_model_profile,
resolve_strength,
)
@@ -273,21 +272,14 @@ def _make_seed_generator(device: str, seed: int) -> Any:
return torch.Generator().manual_seed(seed) # type: ignore
def _generator_device(generator: Any) -> str:
"""Best-effort device type of a ``torch.Generator`` (e.g. ``"cpu"``, ``"mps"``)."""
device = getattr(generator, "device", None)
return getattr(device, "type", str(device)) if device is not None else "cpu"
# Canny edge thresholds for the ControlNet control image (xinsir canny recipe:
# cv2.Canny(gray, 100, 200) -> a 3-channel edge map).
_CANNY_LOW = 100
_CANNY_HIGH = 200
# Keep legacy name available for backwards compatibility
_detect_model_profile_from_id = detect_model_profile
# SDXL Differential-Diffusion community pipeline, pinned to the installed
# diffusers version so the fetched pipeline code matches the library (see #21).
# Diffusers' dynamic-module loader resolves ``custom_revision`` against the
# package version string (``0.38.0``), NOT the GitHub git tag (``v0.38.0``).
_DIFF_PIPELINE_NAME = "pipeline_stable_diffusion_xl_differential_img2img"
_DIFF_PIPELINE_REVISION = "0.38.0"
# A neutral quality prompt: the goal is faithful regeneration, not creative edits.
_CONTROLNET_PROMPT = "best quality, high quality, sharp, detailed, photographic"
_CONTROLNET_NEGATIVE = "blurry, lowres, deformed, distorted text, garbled text, watermark, jpeg artifacts"
class WatermarkRemover:
@@ -299,8 +291,8 @@ class WatermarkRemover:
"""
DEFAULT_MODEL_ID = DEFAULT_MODEL_ID
CTRLREGEN_MODEL_ID = CTRLREGEN_MODEL_ID
DEFAULT_STRENGTH = DEFAULT_STRENGTH
CONTROLNET_CANNY_MODEL = CONTROLNET_CANNY_MODEL
def __init__(
self,
@@ -309,9 +301,14 @@ class WatermarkRemover:
torch_dtype: Any = None,
progress_callback: Callable[[str], None] | None = None,
hf_token: str | None = None,
pipeline: str = "default",
controlnet_conditioning_scale: float = 1.0,
) -> None:
self.model_id = model_id or self.DEFAULT_MODEL_ID
self.model_profile = detect_model_profile(self.model_id)
# The pipeline profile is threaded explicitly (not inferred from model_id):
# both "default" and "controlnet" use the same SDXL base checkpoint.
self.model_profile = pipeline
self.controlnet_conditioning_scale = controlnet_conditioning_scale
if not is_watermark_removal_available():
_ensure_watermark_deps()
@@ -329,8 +326,7 @@ class WatermarkRemover:
self.torch_dtype = torch_dtype
self._pipeline: AutoImg2ImgPipeline | None = None
self._diff_pipeline: Any = None
self._ctrlregen_engine: Any = None
self._controlnet_pipeline: Any = None
self._progress_callback = progress_callback
self.hf_token: str | None = hf_token or os.environ.get("HF_TOKEN")
@@ -345,44 +341,59 @@ class WatermarkRemover:
def preload(self) -> None:
"""Eagerly load the pipeline so download progress bars are visible."""
if self.model_profile == "ctrlregen":
self._run_ctrlregen_preload()
if self.model_profile == "controlnet":
self._load_controlnet_pipeline()
else:
self._load_pipeline()
def _run_ctrlregen_preload(self) -> None:
"""Ensure the CtrlRegen engine and all its models are loaded."""
from remove_ai_watermarks.noai.ctrlregen import is_ctrlregen_available
if not is_ctrlregen_available():
missing_pkgs = ["controlnet-aux", "color-matcher", "safetensors"]
logger.info("Auto-installing missing CtrlRegen dependencies: %s", missing_pkgs)
if not _auto_install(missing_pkgs):
raise ImportError(
f"Failed to auto-install missing dependencies: {', '.join(missing_pkgs)}. "
"Try manually: pip install --force-reinstall noai-watermark"
)
if self._ctrlregen_engine is None:
self._ctrlregen_engine = self._make_ctrlregen_engine()
self._ctrlregen_engine.load()
def _make_ctrlregen_engine(self) -> Any:
"""Create a new CtrlRegenEngine with current settings."""
from remove_ai_watermarks.noai.ctrlregen import CtrlRegenEngine
base_model = self.model_id if self.model_id != self.CTRLREGEN_MODEL_ID else None
return CtrlRegenEngine(
base_model_id=base_model,
device=self.device,
torch_dtype=self.torch_dtype,
hf_token=self.hf_token,
progress_callback=self._progress_callback,
)
# ── Pipeline loading ─────────────────────────────────────────────
def _maybe_add_fp16_vae(self, load_kwargs: dict[str, Any]) -> None:
"""Swap in the fp16-fixed SDXL VAE for the default checkpoint on a fp16 GPU.
The stock SDXL VAE overflows to NaN in fp16 and decodes to an all-black
image (issue #29). Shared by both pipeline loaders; a no-op on fp32 (cpu/mps)
or a non-SDXL checkpoint.
"""
if _needs_fp16_vae_fix(self.model_id, self.DEFAULT_MODEL_ID, self.torch_dtype == torch.float16):
from diffusers import AutoencoderKL
self._set_progress("Loading fp16-fixed SDXL VAE (avoids black output)...")
load_kwargs["vae"] = AutoencoderKL.from_pretrained(_SDXL_FP16_VAE_ID, torch_dtype=torch.float16)
def _move_to_device_and_optimize(self, pipeline: Any) -> Any:
"""Move a freshly-loaded pipeline to ``self.device`` + enable memory opts.
Shared by both loaders. On a CUDA move failure (missing CUDA torch build),
trigger the torch-CUDA reinstall+restart. Returns the moved pipeline.
"""
self._set_progress(f"Moving model to device: {self.device}")
try:
pipeline = pipeline.to(self.device)
except (RuntimeError, AssertionError) as exc:
if self.device == "cuda" and not os.environ.get(_CUDA_FIX_ENV_KEY):
self._set_progress("CUDA failed. Reinstalling torch with CUDA support...")
_reinstall_torch_cuda_and_restart()
raise RuntimeError(
f"Failed to move model to {self.device} ({exc}). "
"Install CUDA-enabled PyTorch manually:\n"
f" pip install torch --index-url {_detect_cuda_index_url()}"
) from exc
if hasattr(pipeline, "enable_xformers_memory_efficient_attention"):
with contextlib.suppress(Exception):
self._set_progress("Enabling memory optimizations...")
pipeline.enable_xformers_memory_efficient_attention()
# Mac Float32 memory slicing
if self.device == "mps" and hasattr(pipeline, "enable_attention_slicing"):
with contextlib.suppress(Exception):
pipeline.enable_attention_slicing("max")
return pipeline
def _load_pipeline(self) -> AutoImg2ImgPipeline:
"""Load the diffusion pipeline lazily."""
"""Load the plain SDXL img2img pipeline lazily."""
if self._pipeline is None:
logger.info("Loading model %s on %s...", self.model_id, self.device)
self._set_progress(f"Loading model weights: {self.model_id}")
@@ -394,48 +405,47 @@ class WatermarkRemover:
}
if self.hf_token:
load_kwargs["token"] = self.hf_token
self._maybe_add_fp16_vae(load_kwargs)
# Avoid the SDXL fp16 NaN/all-black decode (issue #29) by loading the
# fp16-fixed VAE for the default SDXL checkpoint on a fp16 GPU.
if _needs_fp16_vae_fix(self.model_id, self.DEFAULT_MODEL_ID, self.torch_dtype == torch.float16):
from diffusers import AutoencoderKL
self._set_progress("Loading fp16-fixed SDXL VAE (avoids black output)...")
load_kwargs["vae"] = AutoencoderKL.from_pretrained(_SDXL_FP16_VAE_ID, torch_dtype=torch.float16)
self._pipeline = AutoImg2ImgPipeline.from_pretrained( # type: ignore
self.model_id,
**load_kwargs,
)
self._set_progress(f"Moving model to device: {self.device}")
try:
self._pipeline = self._pipeline.to(self.device) # type: ignore
except (RuntimeError, AssertionError) as exc:
if self.device == "cuda" and not os.environ.get(_CUDA_FIX_ENV_KEY):
self._set_progress("CUDA failed. Reinstalling torch with CUDA support...")
_reinstall_torch_cuda_and_restart()
raise RuntimeError(
f"Failed to move model to {self.device} ({exc}). "
"Install CUDA-enabled PyTorch manually:\n"
f" pip install torch --index-url {_detect_cuda_index_url()}"
) from exc
if hasattr(self._pipeline, "enable_xformers_memory_efficient_attention"):
with contextlib.suppress(Exception):
self._set_progress("Enabling memory optimizations...")
self._pipeline.enable_xformers_memory_efficient_attention() # type: ignore
# Mac Float32 memory slicing
if self.device == "mps" and hasattr(self._pipeline, "enable_attention_slicing"):
with contextlib.suppress(Exception):
self._pipeline.enable_attention_slicing("max")
pipeline = AutoImg2ImgPipeline.from_pretrained(self.model_id, **load_kwargs) # type: ignore
self._pipeline = self._move_to_device_and_optimize(pipeline)
logger.info("Model loaded successfully")
self._set_progress("Model initialized. Preparing input image...")
return self._pipeline # type: ignore
def _load_controlnet_pipeline(self) -> Any:
"""Load the SDXL + canny-ControlNet img2img pipeline lazily.
Mirrors ``_load_pipeline`` (same fp16-fix VAE, device move, attention
slicing via the shared helpers) but loads the canny ControlNet on top of
the SDXL base. The ControlNet only preserves structure via the edge map;
removal still comes from the img2img regeneration (``strength``).
"""
if self._controlnet_pipeline is None:
from diffusers import ControlNetModel, StableDiffusionXLControlNetImg2ImgPipeline
logger.info("Loading SDXL + ControlNet (%s) on %s...", CONTROLNET_CANNY_MODEL, self.device)
self._set_progress(f"Loading ControlNet: {CONTROLNET_CANNY_MODEL}")
controlnet = ControlNetModel.from_pretrained(CONTROLNET_CANNY_MODEL, torch_dtype=self.torch_dtype)
load_kwargs: dict[str, Any] = {"controlnet": controlnet, "torch_dtype": self.torch_dtype}
if self.hf_token:
load_kwargs["token"] = self.hf_token
self._maybe_add_fp16_vae(load_kwargs)
self._set_progress(f"Loading model weights: {self.model_id}")
pipeline = StableDiffusionXLControlNetImg2ImgPipeline.from_pretrained(self.model_id, **load_kwargs)
pipeline = self._move_to_device_and_optimize(pipeline)
with contextlib.suppress(Exception):
pipeline.set_progress_bar_config(disable=True)
logger.info("ControlNet model loaded successfully")
self._controlnet_pipeline = pipeline
return self._controlnet_pipeline
# ── Core removal ─────────────────────────────────────────────────
def remove_watermark(
@@ -446,7 +456,6 @@ class WatermarkRemover:
num_inference_steps: int = 50,
guidance_scale: float | None = None,
seed: int | None = None,
protect_text: bool = True,
vendor: str | None = None,
) -> Path:
"""Remove watermark from an image using regeneration attack.
@@ -459,10 +468,6 @@ class WatermarkRemover:
num_inference_steps: Number of denoising steps.
guidance_scale: Classifier-free guidance scale.
seed: Random seed for reproducibility.
protect_text: Detect text regions and preserve them via Differential
Diffusion when any are found (SDXL default profile only). On by
default; the detector decides per image, and text-free inputs run
the standard pass at no extra cost.
vendor: SynthID vendor (``"openai"`` / ``"google"`` / None) used to pick the
default strength when ``strength`` is None. Detect it from the ORIGINAL
input with ``watermark_profiles.vendor_for_strength`` before processing
@@ -482,13 +487,13 @@ class WatermarkRemover:
if output_path is None:
output_path = image_path
strength = resolve_strength(strength, self.model_profile, vendor)
strength = resolve_strength(strength, vendor)
if not 0.0 <= strength <= 1.0:
raise ValueError(f"Strength must be between 0.0 and 1.0, got {strength}")
if guidance_scale is None:
guidance_scale = 2.0 if self.model_profile == "ctrlregen" else 7.5
guidance_scale = 7.5
self._set_progress("Loading and preprocessing input image...")
init_image = Image.open(image_path).convert("RGB")
@@ -508,16 +513,8 @@ class WatermarkRemover:
_total_start = time.monotonic()
if self.model_profile == "ctrlregen":
cleaned_image = self._run_ctrlregen(
init_image,
strength,
num_inference_steps,
guidance_scale,
generator,
)
elif protect_text and self._can_protect_text():
cleaned_image = self._run_region_hires(
if self.model_profile == "controlnet":
cleaned_image = self._run_controlnet(
init_image,
strength,
num_inference_steps,
@@ -525,12 +522,6 @@ class WatermarkRemover:
generator,
)
else:
if protect_text:
logger.debug(
"Text protection unavailable "
"(needs the SDXL default model and the cv2 text detector); "
"running standard img2img."
)
cleaned_image = self._run_img2img(
init_image,
strength,
@@ -613,148 +604,25 @@ class WatermarkRemover:
self._pipeline = None
return self._load_pipeline()
# ── Text-protected differential runner ───────────────────────────
# ── ControlNet runner ────────────────────────────────────────────
def _can_protect_text(self) -> bool:
"""True when text protection can run: SDXL default model + cv2 detector."""
from remove_ai_watermarks import text_protector
def _build_canny_control_image(self, init_image: Image.Image) -> Image.Image:
"""Build the canny ControlNet conditioning image (xinsir recipe).
return self.model_id == self.DEFAULT_MODEL_ID and text_protector.is_available()
def _load_differential_pipeline(self) -> Any:
"""Load the SDXL Differential-Diffusion community pipeline lazily."""
if self._diff_pipeline is None:
from diffusers import DiffusionPipeline
self._set_progress("Loading Differential-Diffusion pipeline (protect-text)...")
use_fp16 = self.device in {"mps", "cuda", "xpu"}
load_kwargs: dict[str, Any] = {
"custom_pipeline": _DIFF_PIPELINE_NAME,
"custom_revision": _DIFF_PIPELINE_REVISION,
"torch_dtype": torch.float16 if use_fp16 else torch.float32, # type: ignore[attr-defined]
"use_safetensors": True,
}
if use_fp16:
load_kwargs["variant"] = "fp16"
if self.hf_token:
load_kwargs["token"] = self.hf_token
pipeline = DiffusionPipeline.from_pretrained(self.model_id, **load_kwargs).to(self.device)
# The differential pipeline upcasts the SDXL VAE to fp32 internally
# (the fp16 VAE decodes to NaN/black otherwise), so we add no extra
# VAE handling here. Attention slicing is also left off on MPS: it
# produced NaN latents with this pipeline, and the protect-text pass
# is short enough not to need it.
with contextlib.suppress(Exception):
pipeline.set_progress_bar_config(disable=True)
self._diff_pipeline = pipeline
return self._diff_pipeline
def _reload_differential_on_cpu(self) -> Any:
"""Reload the differential pipeline on CPU after an MPS failure."""
self.device = "cpu"
self.torch_dtype = torch.float32 # type: ignore[assignment]
self._diff_pipeline = None
return self._load_differential_pipeline()
# Region high-res text scrub: defaults tuned so each text block is upscaled
# enough that strokes exceed the VAE's ~8px latent cell, capped so a single
# region never blows past the GPU/MPS memory budget.
_REGION_HIRES_SCALE = 3.0
_REGION_MAX_MEGAPIXELS = 1.3
def _run_region_hires(
self,
init_image: Image.Image,
strength: float,
num_inference_steps: int,
guidance_scale: float,
generator: Any,
) -> Image.Image:
"""Scrub the whole image, then RE-scrub each detected text block at high
resolution and composite it back.
Unlike the Differential-Diffusion path (which freezes text in latent space
and so leaves the watermark intact there), every pixel here is regenerated
-- the watermark is removed everywhere. Small text survives because each
text block is upscaled before its img2img pass, so strokes span more than
one VAE latent cell (the ~8px floor that softens text at native scale);
the scrubbed crop is downscaled and feather-composited back. Falls back to
the plain global scrub when no text is detected.
cv2.Canny on the RGB->gray array, stacked to 3 channels, wrapped as a PIL
image. The edge map only PRESERVES structure; it never copies pixels.
``init_image`` is already RGB (``remove_watermark`` converts on load).
"""
import math
import cv2
import numpy as np
from remove_ai_watermarks import text_protector
rgb = np.array(init_image)
gray = cv2.cvtColor(rgb, cv2.COLOR_RGB2GRAY)
edges = cv2.Canny(gray, _CANNY_LOW, _CANNY_HIGH)
edges_rgb = np.stack([edges, edges, edges], axis=-1)
return Image.fromarray(edges_rgb)
base = self._run_img2img(init_image, strength, num_inference_steps, guidance_scale, generator)
# The base pass may have fallen back from MPS to CPU (it flips
# self.device). The generator was built for the original device, and
# diffusers rejects a device-mismatched generator ("Expected a 'cpu'
# device generator but found 'mps'"), so drop it for the per-region
# passes -- they then seed from the global RNG, which is fine here.
if generator is not None and self.device == "cpu" and _generator_device(generator) != "cpu":
generator = None
bgr = cv2.cvtColor(np.array(init_image), cv2.COLOR_RGB2BGR)
try:
boxes = text_protector.TextProtector().detect_text_boxes(bgr)
except Exception as exc:
logger.warning("Text detection failed (%s); keeping the global scrub.", exc)
return base
if not boxes:
self._set_progress("No text detected; global scrub only.")
return base
width, height = init_image.size
regions = text_protector.merge_text_regions(boxes, height, width)
orig_bgr = cv2.cvtColor(np.array(init_image), cv2.COLOR_RGB2BGR)
out_bgr = cv2.cvtColor(np.array(base), cv2.COLOR_RGB2BGR)
budget = self._REGION_MAX_MEGAPIXELS * 1_000_000
done = 0
for x, y, w, h in regions:
area = max(1, w * h)
# INTEGER scale so the upscale -> scrub -> downscale round-trip is an
# exact dimensional inverse (a fractional factor truncates and shifts
# the composited text ~1-2px, which is invisible but tanks alignment).
scale = int(min(self._REGION_HIRES_SCALE, math.sqrt(budget / area)))
if scale < 2:
# Region too large to even double within the budget: upscaling
# buys nothing here; the global scrub covers it (documented limit
# for very large text areas -- tiling is the future fix).
continue
crop = orig_bgr[y : y + h, x : x + w]
up = cv2.resize(crop, (w * scale, h * scale), interpolation=cv2.INTER_LANCZOS4)
up_pil = Image.fromarray(cv2.cvtColor(up, cv2.COLOR_BGR2RGB))
scrubbed = self._run_img2img(up_pil, strength, num_inference_steps, guidance_scale, generator)
down = cv2.resize(cv2.cvtColor(np.array(scrubbed), cv2.COLOR_RGB2BGR), (w, h), interpolation=cv2.INTER_AREA)
# The up -> scrub -> down round-trip can offset the re-rendered text by
# a pixel or two (the diffusion pipeline rounds dims to a multiple of
# 8, so the inverse resize is not perfectly centered). Phase-correlate
# the patch back to the original crop and translate it so the glyphs
# land exactly where they were -- otherwise a sub-pixel shift garbles
# the composite even though the text is crisp.
cg = cv2.cvtColor(crop, cv2.COLOR_BGR2GRAY).astype(np.float32)
dg = cv2.cvtColor(down, cv2.COLOR_BGR2GRAY).astype(np.float32)
(sx, sy), resp = cv2.phaseCorrelate(cg, dg)
# Only correct for the real 1-2px round-trip shift. On a near-flat /
# low-contrast crop phaseCorrelate returns a spurious large offset at
# a tiny response (e.g. (19,19) at resp ~0.005); warping by that
# garbles the composite -- the exact failure this was meant to
# prevent. Gate on both a confident response and a plausible offset.
if resp > 0.3 and abs(sx) < 4 and abs(sy) < 4 and (abs(sx) > 0.1 or abs(sy) > 0.1):
m = np.float32([[1, 0, -sx], [0, 1, -sy]])
down = cv2.warpAffine(down, m, (w, h), flags=cv2.INTER_LINEAR, borderMode=cv2.BORDER_REPLICATE)
out_bgr = text_protector.feather_paste(out_bgr, down, x, y)
done += 1
self._set_progress(f"Re-scrubbed {done}/{len(regions)} text region(s) at high resolution.")
return Image.fromarray(cv2.cvtColor(out_bgr, cv2.COLOR_BGR2RGB))
def _run_differential(
def _run_controlnet(
self,
init_image: Image.Image,
strength: float,
@@ -762,105 +630,50 @@ class WatermarkRemover:
guidance_scale: float,
generator: Any,
) -> Image.Image:
"""Run differential img2img that preserves detected text regions."""
import cv2
import numpy as np
"""Run the SDXL + canny-ControlNet img2img pass.
from remove_ai_watermarks import text_protector
Removal still comes from the img2img regeneration (``strength``); the canny
ControlNet only PRESERVES text and face STRUCTURE via the edge map. No
original pixels are copied/frozen, so SynthID does not survive (canny holds
structure, not face identity). ``controlnet_conditioning_scale`` is the
structure-preservation knob. Shares the img2img runner (live progress +
MPS->CPU fallback) with ``_run_img2img``; the only delta is the extra
ControlNet kwargs (canny control image + conditioning scale + a non-empty
prompt) overlaid via ``extra_kwargs``.
"""
from remove_ai_watermarks.noai.img2img_runner import run_img2img_with_mps_fallback
self._set_progress("Detecting text regions to protect (protect-text)...")
bgr = cv2.cvtColor(np.array(init_image), cv2.COLOR_RGB2BGR)
try:
boxes = text_protector.TextProtector().detect_text_boxes(bgr)
except Exception as exc:
logger.warning("Text detection failed (%s); running standard img2img.", exc)
return self._run_img2img(init_image, strength, num_inference_steps, guidance_scale, generator)
if not boxes:
self._set_progress("No text detected; running standard img2img.")
return self._run_img2img(init_image, strength, num_inference_steps, guidance_scale, generator)
width, height = init_image.size
change_map = text_protector.build_change_map(boxes, height, width)
self._set_progress(f"Protecting {len(boxes)} text region(s) via Differential Diffusion...")
from remove_ai_watermarks.noai.img2img_runner import run_differential_with_mps_fallback
result_image, final_device = run_differential_with_mps_fallback(
load_pipeline=self._load_differential_pipeline,
extra_kwargs = {
"prompt": _CONTROLNET_PROMPT,
"negative_prompt": _CONTROLNET_NEGATIVE,
"control_image": self._build_canny_control_image(init_image),
"controlnet_conditioning_scale": float(self.controlnet_conditioning_scale),
}
result_image, final_device = run_img2img_with_mps_fallback(
load_pipeline=self._load_controlnet_pipeline,
image=init_image,
change_map=change_map,
strength=strength,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
generator=generator,
device=self.device,
set_progress=self._set_progress,
reload_on_cpu=self._reload_differential_on_cpu,
reload_on_cpu=self._reload_controlnet_on_cpu,
extra_kwargs=extra_kwargs,
)
if final_device != self.device:
self.device = final_device
self.torch_dtype = torch.float32 # type: ignore[assignment]
return result_image
# ── CtrlRegen runner ─────────────────────────────────────────────
def _run_ctrlregen(
self,
init_image: Image.Image,
strength: float,
num_inference_steps: int,
guidance_scale: float,
generator: Any,
) -> Image.Image:
"""Run CtrlRegen pipeline with MPS fallback."""
from remove_ai_watermarks.noai.ctrlregen import is_ctrlregen_available
from remove_ai_watermarks.noai.progress import is_mps_error
if not is_ctrlregen_available():
missing_pkgs = ["controlnet-aux", "color-matcher", "safetensors"]
logger.info("Auto-installing missing CtrlRegen dependencies: %s", missing_pkgs)
if not _auto_install(missing_pkgs):
raise ImportError(
f"Failed to auto-install missing dependencies: {', '.join(missing_pkgs)}. "
"Try manually: pip install --force-reinstall noai-watermark"
)
if self._ctrlregen_engine is None:
self._ctrlregen_engine = self._make_ctrlregen_engine()
seed = None
if generator is not None and hasattr(generator, "initial_seed"):
seed = generator.initial_seed()
try:
return self._ctrlregen_engine.run(
image=init_image,
strength=strength,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
seed=seed,
)
except RuntimeError as error:
if self.device == "mps" and is_mps_error(error):
logger.warning("MPS out of memory during CtrlRegen. Falling back to CPU.")
self._set_progress("MPS out of memory! Retrying CtrlRegen on CPU...")
with contextlib.suppress(Exception):
if _HAS_TORCH and hasattr(torch, "mps"):
torch.mps.empty_cache() # type: ignore[attr-defined]
self.device = "cpu"
self.torch_dtype = torch.float32 # type: ignore[assignment]
self._ctrlregen_engine = self._make_ctrlregen_engine()
return self._ctrlregen_engine.run(
image=init_image,
strength=strength,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
seed=seed,
)
raise
def _reload_controlnet_on_cpu(self) -> Any:
"""Reload the controlnet pipeline on CPU after an MPS failure."""
self.device = "cpu"
self.torch_dtype = torch.float32 # type: ignore[assignment]
self._controlnet_pipeline = None
return self._load_controlnet_pipeline()
# ── Batch ────────────────────────────────────────────────────────
@@ -909,9 +722,9 @@ def remove_watermark(
) -> Path:
"""Convenience function to remove watermark from an image.
``strength=None`` lets the profile pick its default: vendor-adaptive for SDXL
``strength=None`` lets the profile pick its vendor-adaptive SDXL default
(0.10 OpenAI / 0.15 Google / 0.15 unknown, from the C2PA SynthID proxy on the
input), clean-noise 1.0 for ctrlregen. Pass a value to override.
input). Pass a value to override.
"""
from remove_ai_watermarks.noai.watermark_profiles import vendor_for_strength
-271
View File
@@ -1,271 +0,0 @@
"""Text-region protection for diffusion-based watermark removal.
SDXL img2img (the ``invisible`` pipeline) regenerates every pixel, so small text
and CJK glyphs get deformed at the strengths that defeat SynthID (issue #21).
This module detects text regions and builds a per-pixel "change map" for
Differential Diffusion: the background is regenerated normally while text
regions are largely preserved, so glyphs survive the watermark-removal pass.
Detection uses only OpenCV's DNN module (no torch): the PP-OCRv3 text detector
is a ~2.4 MB ONNX model (Apache-2.0, from opencv_zoo) that is CJK-native and
returns rotated quadrilaterals. The model is downloaded and cached on first use;
it is never bundled in this repo.
Change-map polarity (verified empirically against the differential pipeline):
white (1.0) = PRESERVE the original pixels, black (0.0) = MAXIMUM change. So the
map is black everywhere except the text polygons, which are painted toward
white. ``preserve`` stays below a hard 1.0 freeze by default: SynthID is
designed to survive cropping, so totally freezing text pixels would leave the
watermark intact there. A high-but-partial preserve still scrubs lightly.
"""
# cv2 ships no type stubs; mirror the pragma used by the other cv2-using modules.
# pyright: reportMissingTypeStubs=false, reportUnknownMemberType=false, reportUnknownVariableType=false, reportUnknownArgumentType=false, reportCallIssue=false, reportArgumentType=false, reportReturnType=false
from __future__ import annotations
import logging
import os
import tempfile
import urllib.request
from pathlib import Path
from typing import TYPE_CHECKING, Any
if TYPE_CHECKING:
from numpy.typing import NDArray
logger = logging.getLogger(__name__)
# PP-OCRv3 Chinese text detector (DB head), opencv_zoo, Apache-2.0.
_MODEL_URL = (
"https://github.com/opencv/opencv_zoo/raw/main/models/text_detection_ppocr/text_detection_cn_ppocrv3_2023may.onnx"
)
_MODEL_FILENAME = "text_detection_cn_ppocrv3_2023may.onnx"
# DB detector input: the image is detected at its NATIVE long side, capped at
# this value (rounded to a multiple of 32), never upscaled. A fixed small input
# (the old 736) downscaled large images so far that small text fell below the
# detector's resolution and was missed -- the cause of the "small text still
# distorts" reports (issue #14). Detection is script-agnostic (DB segments text
# *regions*, not characters), so this recall fix applies to every language; the
# only lever that mattered was resolution. 1536 recovers full recall down to
# ~12 px text on a 2048 canvas at ~100 ms on CPU (a fixed 736 missed it); going
# higher buys no measured recall at 2x+ the cost. Benchmarked in
# scripts/text_detection_benchmark.py. Very large canvases with tiny text may
# still need tiling -- a documented limit, not yet built.
_DET_MAX_LONG_SIDE = 1536
# ImageNet mean (x255) and 1/255 scale -- the normalization PP-OCRv3 expects.
_DET_MEAN = (0.485 * 255, 0.456 * 255, 0.406 * 255)
_DET_SCALE = 1 / 255.0
def is_available() -> bool:
"""True when OpenCV's DNN text-detection model is importable."""
try:
import cv2
return hasattr(cv2.dnn, "TextDetectionModel_DB")
except ImportError:
return False
def _cache_dir() -> Path:
"""Local cache directory for the detector model (created on demand)."""
cache = Path.home() / ".cache" / "remove-ai-watermarks"
cache.mkdir(parents=True, exist_ok=True)
return cache
def _model_path() -> Path:
"""Return the cached detector path, downloading it on first use."""
target = _cache_dir() / _MODEL_FILENAME
if target.exists() and target.stat().st_size > 0:
return target
logger.info("Downloading PP-OCRv3 text detector (~2.4 MB) to %s", target)
# Download to a temp file in the same dir, then atomically rename so a
# partial download never leaves a corrupt model cached.
fd, tmp_name = tempfile.mkstemp(dir=str(target.parent), suffix=".onnx.part")
tmp_path = Path(tmp_name)
try:
os.close(fd)
with urllib.request.urlopen(_MODEL_URL) as resp: # noqa: S310 (trusted GitHub URL)
tmp_path.write_bytes(resp.read())
tmp_path.replace(target)
finally:
if tmp_path.exists():
tmp_path.unlink()
return target
def _detection_input_size(height: int, width: int) -> tuple[int, int]:
"""DB-detector input ``(in_w, in_h)`` for an image of the given size.
Detect at the native long side, capped at ``_DET_MAX_LONG_SIDE`` and never
upscaled, each side rounded down to a multiple of 32 (the DB head requires
/32 dims), floored at 32. Pure function so the resolution contract (the
issue #14 small-text recall fix) is unit-testable without the model.
"""
long_side = max(height, width)
scale = min(_DET_MAX_LONG_SIDE, long_side) / long_side
in_w = max((round(width * scale) // 32) * 32, 32)
in_h = max((round(height * scale) // 32) * 32, 32)
return in_w, in_h
def build_change_map(
boxes: list[NDArray[Any]],
height: int,
width: int,
preserve: float = 0.9,
feather: int = 15,
) -> NDArray[Any]:
"""Build a Differential-Diffusion change map from text polygons.
Args:
boxes: Text-region polygons as arrays of (x, y) vertices.
height: Output map height in pixels.
width: Output map width in pixels.
preserve: Map value painted inside text polygons (0..1). White (1.0)
fully preserves the original pixels; the default 0.9 preserves
strongly while still letting a light scrub through.
feather: Gaussian-blur kernel size for soft polygon edges (forced odd).
Returns:
Float32 HxW array in [0, 1]: ~0 in the background (full change),
``preserve`` inside text regions, blended at the edges.
"""
import cv2
import numpy as np
change_map = np.zeros((height, width), np.float32)
if boxes:
polys = [np.asarray(b, np.int32) for b in boxes]
cv2.fillPoly(change_map, polys, float(preserve))
if feather > 0:
if feather % 2 == 0:
feather += 1
change_map = cv2.GaussianBlur(change_map, (feather, feather), 0)
# GaussianBlur can overshoot the painted value by a float epsilon; keep
# the contract that the map stays a valid [0, 1] change map.
np.clip(change_map, 0.0, 1.0, out=change_map)
return change_map
def merge_text_regions(
boxes: list[NDArray[Any]],
height: int,
width: int,
dilate_frac: float = 0.012,
pad_frac: float = 0.02,
max_regions: int = 8,
) -> list[tuple[int, int, int, int]]:
"""Group detected text polygons into a few padded axis-aligned rectangles.
The DB detector returns one box per word/line; the region-high-res text scrub
runs a separate diffusion pass per region, so we coalesce nearby boxes into a
handful of *local* blocks (a light dilation merges within a paragraph but not
across the whole image, so each block stays small enough to upscale within a
memory budget). Returns ``(x, y, w, h)`` rects, largest-area first, clipped to
the image and capped at ``max_regions``.
"""
import cv2
import numpy as np
mask = np.zeros((height, width), np.uint8)
if not boxes:
return []
cv2.fillPoly(mask, [np.asarray(b, np.int32) for b in boxes], 1)
k = max(1, int(min(height, width) * dilate_frac))
mask = cv2.dilate(mask, cv2.getStructuringElement(cv2.MORPH_RECT, (k, k)))
n, _labels, stats, _c = cv2.connectedComponentsWithStats(mask, 8)
pad = int(min(height, width) * pad_frac)
rects: list[tuple[int, int, int, int]] = []
for i in range(1, n):
x, y, w, h = (
int(stats[i, cv2.CC_STAT_LEFT]),
int(stats[i, cv2.CC_STAT_TOP]),
int(stats[i, cv2.CC_STAT_WIDTH]),
int(stats[i, cv2.CC_STAT_HEIGHT]),
)
x0, y0 = max(0, x - pad), max(0, y - pad)
x1, y1 = min(width, x + w + pad), min(height, y + h + pad)
rects.append((x0, y0, x1 - x0, y1 - y0))
rects.sort(key=lambda r: -(r[2] * r[3]))
return rects[:max_regions]
def feather_paste(
base: NDArray[Any],
patch: NDArray[Any],
x: int,
y: int,
feather: int = 8,
) -> NDArray[Any]:
"""Alpha-composite ``patch`` into ``base`` at ``(x, y)`` with a feathered edge.
Used to drop a separately re-scrubbed (high-resolution) text region back into
the globally-scrubbed image without a visible seam. Returns a new array;
``base`` is not modified. ``patch`` is clipped to ``base`` bounds.
"""
import numpy as np
out = base.copy()
bh, bw = base.shape[:2]
ph, pw = patch.shape[:2]
x0, y0 = max(0, x), max(0, y)
x1, y1 = min(bw, x + pw), min(bh, y + ph)
if x1 <= x0 or y1 <= y0:
return out
patch_roi = patch[y0 - y : y1 - y, x0 - x : x1 - x].astype(np.float32)
base_roi = out[y0:y1, x0:x1].astype(np.float32)
rh, rw = base_roi.shape[:2]
alpha = np.ones((rh, rw), np.float32)
f = max(0, min(feather, rh // 2, rw // 2))
if f > 0:
ramp = np.linspace(0.0, 1.0, f, dtype=np.float32)
alpha[:f, :] *= ramp[:, None]
alpha[rh - f :, :] *= ramp[::-1, None]
alpha[:, :f] *= ramp[None, :]
alpha[:, rw - f :] *= ramp[None, ::-1]
a3 = alpha[:, :, None]
out[y0:y1, x0:x1] = (patch_roi * a3 + base_roi * (1.0 - a3)).astype(base.dtype)
return out
class TextProtector:
"""Detect text regions with PP-OCRv3 for diffusion change-map protection."""
def __init__(
self,
binary_threshold: float = 0.3,
polygon_threshold: float = 0.5,
max_candidates: int = 200,
unclip_ratio: float = 2.0,
) -> None:
import cv2
self._detector = cv2.dnn.TextDetectionModel_DB(str(_model_path()))
self._detector.setBinaryThreshold(binary_threshold)
self._detector.setPolygonThreshold(polygon_threshold)
self._detector.setMaxCandidates(max_candidates)
self._detector.setUnclipRatio(unclip_ratio)
def detect_text_boxes(self, bgr_image: NDArray[Any]) -> list[NDArray[Any]]:
"""Detect text regions, returning a list of rotated quad polygons.
Args:
bgr_image: Image as an HxWx3 BGR uint8 array (OpenCV convention).
Returns:
One array of four (x, y) vertices per detected text region.
"""
height, width = bgr_image.shape[:2]
in_w, in_h = _detection_input_size(height, width)
self._detector.setInputParams(
scale=_DET_SCALE,
size=(in_w, in_h),
mean=_DET_MEAN,
swapRB=True,
)
boxes, _confidences = self._detector.detect(bgr_image)
return list(boxes)
-63
View File
@@ -1,63 +0,0 @@
import numpy as np
from remove_ai_watermarks.face_protector import FaceProtector
def test_face_protector_initialization():
# Will fallback to Haar cascade if ultralytics is missing
fp = FaceProtector(use_yolo=False)
assert fp.use_yolo is False
assert fp.haar_cascade is not None
def test_face_protector_lifecycle():
fp = FaceProtector(use_yolo=False)
# Create dummy black image
img = np.zeros((200, 200, 3), dtype=np.uint8)
# Since it's a black image, haar cascade should find 0 faces
faces = fp.extract_faces(img)
assert isinstance(faces, list)
assert len(faces) == 0
# Restoring 0 faces should result in strictly equal image
restored = fp.restore_faces(img, faces)
assert np.array_equal(img, restored)
def test_face_protector_restore_bypass_on_size_mismatch():
fp = FaceProtector(use_yolo=False)
img_small = np.zeros((100, 100, 3), dtype=np.uint8)
# Manually mock a face that is OUT OF BOUNDS for img_small
mock_bbox = (80, 80, 130, 130)
mock_crop = np.ones((50, 50, 3), dtype=np.uint8) * 255
mock_faces = [(mock_bbox, mock_crop)]
# Attempt to restore onto an image too small for this box
restored = fp.restore_faces(img_small, mock_faces)
# Should safely skip restoring and not crash
assert np.array_equal(restored, img_small)
def test_face_protector_restore_blending():
fp = FaceProtector(use_yolo=False)
# Background is black
img_target = np.zeros((100, 100, 3), dtype=np.uint8)
# Face crop is white
mock_bbox = (25, 25, 75, 75)
mock_crop = np.ones((50, 50, 3), dtype=np.uint8) * 255
mock_faces = [(mock_bbox, mock_crop)]
restored = fp.restore_faces(img_target, mock_faces)
# The center of the face should be perfectly white (255)
assert restored[50, 50, 0] >= 254
# The corner of the target should remain perfectly black (0)
assert restored[0, 0, 0] == 0
# We should have a blending gradient between them due to the gaussian blur mask
# For example, around (30, 30) or similar
assert 0 <= restored[28, 28, 0] <= 255
-75
View File
@@ -15,7 +15,6 @@ import pytest
from remove_ai_watermarks.noai import img2img_runner
from remove_ai_watermarks.noai.img2img_runner import (
run_differential_with_mps_fallback,
run_img2img,
run_img2img_with_mps_fallback,
)
@@ -110,80 +109,6 @@ class TestMpsFallback:
reload_on_cpu.assert_not_called()
class TestDifferentialMpsFallback:
"""The protect-text (Differential Diffusion) path shares the MPS->CPU
fallback contract; mock ``run_differential`` so no torch/model is needed."""
def test_mps_error_reloads_on_cpu_and_retries(self, monkeypatch: pytest.MonkeyPatch):
sentinel = object()
inner = Mock(side_effect=[RuntimeError(_MPS_OOM), sentinel])
monkeypatch.setattr(img2img_runner, "run_differential", inner)
reload_on_cpu = Mock(return_value="cpu_pipe")
img, device = run_differential_with_mps_fallback(
load_pipeline=Mock(return_value="gpu_pipe"),
image=object(),
change_map=object(),
strength=0.05,
num_inference_steps=50,
guidance_scale=7.5,
generator="gen",
device="mps",
set_progress=lambda _m: None,
reload_on_cpu=reload_on_cpu,
)
assert (img, device) == (sentinel, "cpu")
reload_on_cpu.assert_called_once()
assert inner.call_count == 2
# Retry uses the reloaded CPU pipeline, device "cpu", and drops the MPS
# generator (generator=None) for deterministic CPU execution.
retry_args = inner.call_args_list[1].args
assert retry_args[0] == "cpu_pipe"
assert retry_args[6] is None # generator
assert retry_args[7] == "cpu" # device
def test_happy_path_returns_original_device_without_reload(self, monkeypatch: pytest.MonkeyPatch):
sentinel = object()
monkeypatch.setattr(img2img_runner, "run_differential", Mock(return_value=sentinel))
reload_on_cpu = Mock()
img, device = run_differential_with_mps_fallback(
load_pipeline=Mock(return_value="gpu_pipe"),
image=object(),
change_map=object(),
strength=0.05,
num_inference_steps=50,
guidance_scale=7.5,
generator="gen",
device="mps",
set_progress=lambda _m: None,
reload_on_cpu=reload_on_cpu,
)
assert (img, device) == (sentinel, "mps")
reload_on_cpu.assert_not_called()
def test_non_mps_runtime_error_propagates(self, monkeypatch: pytest.MonkeyPatch):
monkeypatch.setattr(img2img_runner, "run_differential", Mock(side_effect=RuntimeError("CUDA out of memory")))
reload_on_cpu = Mock()
with pytest.raises(RuntimeError, match="CUDA"):
run_differential_with_mps_fallback(
load_pipeline=Mock(return_value="gpu_pipe"),
image=object(),
change_map=object(),
strength=0.05,
num_inference_steps=50,
guidance_scale=7.5,
generator="gen",
device="mps",
set_progress=lambda _m: None,
reload_on_cpu=reload_on_cpu,
)
reload_on_cpu.assert_not_called()
class TestRunImg2Img:
def test_returns_first_image_from_pipeline_result(self):
sentinel = object()
-3
View File
@@ -30,9 +30,6 @@ class TestInvisibleEngineInit:
# SDXL base became the default in May 2026 (defeats SynthID v2).
assert InvisibleEngine.DEFAULT_MODEL_ID == "stabilityai/stable-diffusion-xl-base-1.0"
def test_ctrlregen_model_id(self):
assert InvisibleEngine.CTRLREGEN_MODEL_ID == "yepengliu/ctrlregen"
class TestTargetSize:
"""Regression guard for the native-resolution decision (issues #10 / #15).
+19 -32
View File
@@ -14,12 +14,10 @@ import pytest
from remove_ai_watermarks.noai.progress import is_mps_error
from remove_ai_watermarks.noai.utils import get_image_format, is_supported_format
from remove_ai_watermarks.noai.watermark_profiles import (
CTRLREGEN_DEFAULT_STRENGTH,
DEFAULT_STRENGTH,
GEMINI_STRENGTH,
OPENAI_STRENGTH,
UNKNOWN_STRENGTH,
detect_model_profile,
get_model_id_for_profile,
resolve_strength,
)
@@ -114,54 +112,43 @@ class TestModelProfiles:
def test_default_profile(self):
assert get_model_id_for_profile("default") == "stabilityai/stable-diffusion-xl-base-1.0"
def test_ctrlregen_profile(self):
assert get_model_id_for_profile("ctrlregen") == "yepengliu/ctrlregen"
def test_controlnet_profile(self):
# controlnet shares the SDXL base checkpoint (the ControlNet is an add-on).
assert get_model_id_for_profile("controlnet") == "stabilityai/stable-diffusion-xl-base-1.0"
def test_unknown_profile_raises(self):
with pytest.raises(ValueError, match="Unknown model profile"):
get_model_id_for_profile("nonexistent")
def test_detect_default(self):
assert detect_model_profile("stabilityai/stable-diffusion-xl-base-1.0") == "default"
def test_detect_ctrlregen(self):
assert detect_model_profile("yepengliu/ctrlregen") == "ctrlregen"
class TestResolveStrength:
"""resolve_strength applies the profile/vendor default only when strength is unset."""
"""resolve_strength applies the vendor default only when strength is unset."""
def test_none_default_profile_is_vendor_adaptive(self):
# No vendor -> unknown default; OpenAI lower, Google == unknown.
assert resolve_strength(None, "default") == UNKNOWN_STRENGTH
assert resolve_strength(None, "default", "openai") == OPENAI_STRENGTH
assert resolve_strength(None, "default", "google") == GEMINI_STRENGTH
assert resolve_strength(None, "default", None) == UNKNOWN_STRENGTH
def test_none_is_vendor_adaptive(self):
# No vendor -> unknown default; OpenAI lower, Google == unknown. The default
# is vendor-adaptive and does NOT depend on the pipeline profile (default and
# controlnet share the same SDXL base).
assert resolve_strength(None) == UNKNOWN_STRENGTH
assert resolve_strength(None, "openai") == OPENAI_STRENGTH
assert resolve_strength(None, "google") == GEMINI_STRENGTH
assert resolve_strength(None, None) == UNKNOWN_STRENGTH
# An unrecognized vendor string falls through to the unknown default.
assert resolve_strength(None, "default", "adobe") == UNKNOWN_STRENGTH
assert resolve_strength(None, "adobe") == UNKNOWN_STRENGTH
def test_default_strength_alias_is_unknown_vendor_value(self):
assert DEFAULT_STRENGTH == UNKNOWN_STRENGTH
assert OPENAI_STRENGTH < UNKNOWN_STRENGTH
def test_none_ctrlregen_uses_clean_noise_default(self):
# ctrlregen must NOT inherit the SDXL vendor defaults (that makes it a no-op);
# clean-noise regeneration is the lever against robust marks. Vendor is ignored.
assert resolve_strength(None, "ctrlregen") == CTRLREGEN_DEFAULT_STRENGTH
assert resolve_strength(None, "ctrlregen", "openai") == CTRLREGEN_DEFAULT_STRENGTH
assert CTRLREGEN_DEFAULT_STRENGTH > DEFAULT_STRENGTH
def test_explicit_value_overrides_profile_and_vendor(self):
assert resolve_strength(0.3, "default") == 0.3
assert resolve_strength(0.3, "default", "openai") == 0.3
assert resolve_strength(0.3, "ctrlregen") == 0.3
def test_explicit_value_overrides_vendor(self):
assert resolve_strength(0.3) == 0.3
assert resolve_strength(0.3, "openai") == 0.3
def test_explicit_zero_is_respected_not_treated_as_unset(self):
# 0.0 is falsy but explicit -- must not fall through to the profile default
# 0.0 is falsy but explicit -- must not fall through to the vendor default
# (the old `strength or DEFAULT` bug would have). Range validation lives in
# remove_watermark, not here.
assert resolve_strength(0.0, "ctrlregen") == 0.0
assert resolve_strength(0.0, "default", "google") == 0.0
assert resolve_strength(0.0) == 0.0
assert resolve_strength(0.0, "google") == 0.0
class TestVendorForStrength:
-170
View File
@@ -1,170 +0,0 @@
"""Unit tests for the text-protection change-map helper (no model download).
``build_change_map`` is the pure cv2/numpy part of ``text_protector``: it turns
detected text polygons into a Differential-Diffusion change map. The polarity is
load-bearing and was verified empirically (white = preserve, black = change), so
a regression here would either freeze the whole image or fail to protect text.
The PP-OCRv3 detector itself needs a model download and is not exercised here.
"""
from __future__ import annotations
import numpy as np
from remove_ai_watermarks.text_protector import (
_DET_MAX_LONG_SIDE,
_detection_input_size,
build_change_map,
feather_paste,
merge_text_regions,
)
def _quad(x0, y0, x1, y1):
"""An axis-aligned 4-vertex polygon as the detector returns."""
return np.array([[x0, y0], [x1, y0], [x1, y1], [x0, y1]], np.int32)
class TestMergeTextRegions:
def test_empty(self):
assert merge_text_regions([], 256, 256) == []
def test_far_apart_boxes_stay_separate(self):
boxes = [_quad(10, 10, 60, 30), _quad(10, 200, 60, 220)]
regions = merge_text_regions(boxes, 256, 256, dilate_frac=0.005, pad_frac=0.0)
assert len(regions) == 2
def test_close_boxes_merge(self):
# two boxes on the same line, a few px apart -> one block
boxes = [_quad(10, 10, 60, 30), _quad(64, 10, 110, 30)]
# dilate_frac sized to close the few-px inter-word gap on one line
regions = merge_text_regions(boxes, 256, 256, dilate_frac=0.03)
assert len(regions) == 1
def test_rects_in_bounds_and_padded(self):
boxes = [_quad(100, 100, 150, 130)]
(x, y, w, h) = merge_text_regions(boxes, 256, 256, pad_frac=0.05)[0]
assert x >= 0
assert y >= 0
assert x + w <= 256
assert y + h <= 256
assert w > 50 # padded beyond the raw 50px box
def test_caps_region_count(self):
boxes = [_quad(20 * i, 0, 20 * i + 8, 8) for i in range(20)]
regions = merge_text_regions(boxes, 64, 512, dilate_frac=0.002, pad_frac=0.0, max_regions=5)
assert len(regions) <= 5
class TestFeatherPaste:
def test_patch_lands_at_location_center(self):
base = np.zeros((100, 100, 3), np.uint8)
patch = np.full((40, 40, 3), 200, np.uint8)
out = feather_paste(base, patch, 30, 30, feather=6)
# center of the pasted region is (near) the patch value
assert out[50, 50, 0] >= 190
# far corner untouched
assert out[2, 2, 0] == 0
def test_does_not_mutate_base(self):
base = np.zeros((50, 50, 3), np.uint8)
feather_paste(base, np.full((20, 20, 3), 255, np.uint8), 10, 10)
assert base.sum() == 0
def test_shape_preserved(self):
base = np.zeros((50, 60, 3), np.uint8)
out = feather_paste(base, np.full((10, 10, 3), 100, np.uint8), 5, 5)
assert out.shape == base.shape
def test_partial_out_of_bounds_no_crash(self):
base = np.zeros((40, 40, 3), np.uint8)
out = feather_paste(base, np.full((30, 30, 3), 150, np.uint8), 25, 25, feather=4)
assert out.shape == (40, 40, 3)
class TestDetectionInputSize:
"""Resolution contract for the DB detector input (issue #14 recall fix).
A fixed small input (the old 736) downscaled large canvases so far that small
text fell below the detector's resolution and was missed. Detection now runs
at the native long side, capped and never upscaled.
"""
def test_large_canvas_not_downscaled_to_old_736(self):
# The #14 regression: a 2048 canvas must detect well above the old 736
# so ~12-16 px text survives. Capped at the max long side.
in_w, in_h = _detection_input_size(2048, 2048)
assert in_w == _DET_MAX_LONG_SIDE
assert in_h == _DET_MAX_LONG_SIDE
assert in_w > 736 # the old fixed input that missed small text
def test_native_resolution_not_upscaled(self):
# A 1024 canvas detects at native 1024 (not upscaled to the cap, not
# downscaled to the old 736).
assert _detection_input_size(1024, 1024) == (1024, 1024)
def test_small_image_is_native(self):
assert _detection_input_size(512, 512) == (512, 512)
def test_dims_are_multiples_of_32(self):
for h, w in [(2048, 1024), (1234, 567), (4096, 4096), (1000, 1000)]:
in_w, in_h = _detection_input_size(h, w)
assert in_w % 32 == 0
assert in_h % 32 == 0
def test_aspect_ratio_preserved_when_capped(self):
# Portrait 2048x1024: long side capped to the max, short side scaled by
# the same factor (so the 2:1 aspect is roughly kept).
in_w, in_h = _detection_input_size(2048, 1024)
assert in_h == _DET_MAX_LONG_SIDE
assert abs((in_w / in_h) - 0.5) < 0.05
def test_floor_at_32(self):
in_w, in_h = _detection_input_size(10, 5)
assert in_w >= 32
assert in_h >= 32
class TestBuildChangeMap:
def test_no_boxes_is_all_change(self):
m = build_change_map([], 32, 48)
assert m.shape == (32, 48)
assert m.dtype == np.float32
assert float(m.max()) == 0.0
def test_text_region_is_preserved_background_is_change(self):
# A 20x20 box centered in a 64x64 map, no feather for a crisp check.
box = np.array([[22, 22], [42, 22], [42, 42], [22, 42]])
m = build_change_map([box], 64, 64, preserve=0.9, feather=0)
# Inside the polygon: painted to preserve value.
assert m[32, 32] == np.float32(0.9)
# Far background: untouched -> full change (0.0).
assert m[2, 2] == 0.0
# Polarity: text preserved more than background.
assert m[32, 32] > m[2, 2]
def test_preserve_value_is_respected(self):
box = np.array([[10, 10], [30, 10], [30, 30], [10, 30]])
m = build_change_map([box], 40, 40, preserve=0.5, feather=0)
assert m[20, 20] == np.float32(0.5)
def test_feather_creates_soft_edge_gradient(self):
box = np.array([[20, 20], [44, 20], [44, 44], [20, 44]])
m = build_change_map([box], 64, 64, preserve=1.0, feather=15)
center = m[32, 32]
# An edge pixel just outside the polygon should be partially blended:
# strictly between full-change (0) and the preserved center.
edge = m[32, 47]
assert 0.0 < edge < center
assert center <= 1.0
def test_even_feather_does_not_crash(self):
box = np.array([[10, 10], [30, 10], [30, 30], [10, 30]])
m = build_change_map([box], 40, 40, feather=14)
assert m.shape == (40, 40)
def test_values_stay_in_unit_range(self):
box = np.array([[5, 5], [35, 5], [35, 35], [5, 35]])
m = build_change_map([box], 40, 40, preserve=1.0, feather=9)
assert float(m.min()) >= 0.0
assert float(m.max()) <= 1.0
-86
View File
@@ -1,86 +0,0 @@
"""Unit tests for the pure tiling helpers (no GPU/model required).
``tiling.py`` imports torch at module top, so skip cleanly when torch is
absent. The helpers themselves are pure numpy/PIL/math -- they decide how a
large image is split into overlapping tiles and blended back, so a regression
here would seam or crop the CtrlRegen output wrongly.
"""
from __future__ import annotations
import numpy as np
import pytest
pytest.importorskip("torch")
from PIL import Image
from remove_ai_watermarks.noai.ctrlregen.tiling import (
make_blend_weight,
resize_center_crop,
tile_positions,
)
class TestTilePositions:
def test_image_smaller_than_tile_single_position(self):
assert tile_positions(500, 512, 64) == [0]
def test_image_equal_to_tile_single_position(self):
assert tile_positions(512, 512, 64) == [0]
def test_first_is_zero_last_is_total_minus_tile(self):
# The tiles must fully cover the span: first starts at 0, last ends at
# the far edge (start == total - tile), or the image's edge is missed.
pos = tile_positions(2000, 512, 64)
assert pos[0] == 0
assert pos[-1] == 2000 - 512
def test_overlap_positions_are_monotonic_and_exact(self):
assert tile_positions(1000, 512, 64) == [0, 244, 488]
def test_zero_overlap_tiles_are_contiguous(self):
# 1024 wide, 512 tile, no overlap -> two tiles butting at 512.
assert tile_positions(1024, 512, 0) == [0, 512]
def test_overlap_equal_to_tile_raises(self):
# overlap == tile makes the stride denominator (tile - overlap) zero;
# reject up front instead of dividing by zero.
with pytest.raises(ValueError, match="overlap"):
tile_positions(2000, 512, 512)
def test_overlap_greater_than_tile_raises(self):
with pytest.raises(ValueError, match="overlap"):
tile_positions(2000, 512, 600)
class TestMakeBlendWeight:
def test_zero_overlap_is_all_ones(self):
w = make_blend_weight(8, 8, 0)
assert w.shape == (8, 8)
assert w.dtype == np.float64
assert np.all(w == 1.0)
def test_overlap_ramps_corners_to_zero_center_to_one(self):
w = make_blend_weight(16, 16, 4)
assert w[0, 0] == 0.0 # cosine ramp starts at 0
assert w[8, 8] == 1.0 # center is unweighted
assert w.max() == 1.0
assert w.min() == 0.0
def test_weight_is_point_symmetric(self):
# Symmetric ramps on both edges -> mask equals its 180-degree rotation,
# so opposite tile seams blend identically.
w = make_blend_weight(16, 16, 4)
assert np.allclose(w, w[::-1, ::-1])
class TestResizeCenterCrop:
@pytest.mark.parametrize(("width", "height"), [(400, 800), (800, 400), (300, 300), (1000, 1001)])
def test_output_is_always_square_of_requested_size(self, width: int, height: int):
out = resize_center_crop(Image.new("RGB", (width, height)), 256)
assert out.size == (256, 256)
def test_default_size_is_512(self):
out = resize_center_crop(Image.new("RGB", (640, 480)))
assert out.size == (512, 512)
Generated
+29 -1051
View File
File diff suppressed because it is too large Load Diff