Empirical conclusion from the 2026-06-04 - 2026-06-08 cert sweeps: every shipped face-restore method (GFPGAN-on-cleaned, PhotoMaker-V2, InstantID txt2img, InstantID img2img-on-cleaned at three parameter settings) regenerates the face from an ArcFace embedding via SDXL diffusion. Output face pixels are diffusion-fresh, which makes the regenerated face look MORE AI-generated than the cleaned image (gloss, symmetric pores, SDXL "clean skin" aesthetic) regardless of license. The cleaned image from the main controlnet 0.20 removal pass is the LEAST-AI state we can reach without re-introducing SynthID; any restore on top trades original-look for embedding-driven regeneration. The fundamental issue is structural: ArcFace encodes "general look" at 512 dimensions, SDXL decodes that into pixels with the inherent SDXL aesthetic. Stronger identity push (higher strength + IP-Adapter scale) makes the face closer to the embedding but more AI-looking; weaker push leaves identity to drift further. No parameter setting recovers original identity AND looks less AI than cleaned. Production conclusion: do not ship `--restore-faces` in any monetized deployment. The extras (`instantid`, `photomaker`) stay in the library for research / personal use where users explicitly want regeneration. Documented at every entry point: - CLAUDE.md: new "Face restore trade-off" bullet + every restore mention rewritten to "REGENERATES, does NOT recover"; controlnet bullet updated - README.md: feature bullet + callout + secondary mention all updated - docs/synthid-robust-identity-research-2026-06-08.md: appended "Empirical follow-up" section documenting the InstantID sweep phases (Phase 1 txt2img v1/v2/v3, Phase 2 img2img defaults + stronger params) - docs/controlnet-removal-pipeline-research.md: updated restore-faces bullet to reflect the empirical conclusion - CLI help: `_restore_faces_options` docstring + `--restore-faces` / `--restore-faces-method` help text all updated Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
44 KiB
ControlNet-as-removal-pipeline research: can structure-conditioned regeneration scrub SynthID and keep text?
Date: 2026-06-02. Source: a manual primary-source pass (WebSearch + WebFetch over the watermark-removal-attack and SDXL-ControlNet literature). Prompted by issue #35 (@newideas99 / Jacob): "as we use SDXL even at low strength that kills small text ... Do you think ControlNet could be added to preserve and still remove the watermark?" Clarified scope: Jacob means replacing the removal pipeline itself with a ControlNet-conditioned regeneration (structure held by the control signal), NOT a separate text-protection add-on.
A deep-research workflow run was attempted first (wf_3244411d-ffd) and failed at the harness
level (97 agents completed without emitting StructuredOutput; ~4.3 M tokens, no report). This
note is the hand-run replacement.
The question, precisely
Can a single full-image ControlNet-conditioned diffusion pass replace plain SDXL base 1.0
img2img as the watermark remover, so that one structure-guided regeneration removes the
invisible robust pixel watermark (Google SynthID) everywhere while keeping fine detail and
small/CJK text legible across the whole image? The hard constraint is unchanged from
text-protection-research.md: the watermark must be scrubbed everywhere including inside text,
so any path that freezes or composites original text pixels is disqualified.
Executive summary
The idea is already academically validated as a watermark remover and is literally what we already ship — CtrlRegen (ICLR 2025) is a canny-ControlNet + DINOv2-semantic pipeline that regenerates from clean noise. But the make-or-break gap is exact: none of the watermark-removal papers validate TEXT or fine-detail preservation at all — CtrlRegen reports only FID/PSNR/quality-model scores and explicitly contains no text, fine-detail, or hallucination analysis. Our shipped ctrlregen's empirical failure ("destroys real content, hallucinates micro-text in smooth regions") is precisely this unstudied failure mode of the published method, most likely driven by our 512 px tiling (text occupies too few pixels per tile to regenerate legibly; edge-free smooth regions get DINOv2-semantic hallucination). The constructive path is NOT to keep fixing the SD1.5 CtrlRegen, but to port the structure-control idea onto an SDXL-native ControlNet (xinsir tile-sdxl / ControlNet-Union-SDXL) as a control add to our existing SDXL base 1.0 img2img, run it at 1024+, not 512 tiles, and empirically sweep the (denoise strength x conditioning scale x resolution) cube against the SynthID oracle AND a text-legibility check. The central tension may be fundamental and must be measured: the conditioning strong enough to keep text legible may suppress regeneration enough to let SynthID survive; the regeneration strong enough to scrub may deform text regardless of edges.
Oracle validation 2026-06-04 — measured answer (AUTHORITATIVE; supersedes pre-oracle "scrubs SynthID" claims below)
The central tension the summary predicted is now MEASURED against the live oracles
(OpenAI openai.com/verify for OpenAI content; the Gemini app "Verify with SynthID"
for Google content — each detects only its own vendor's payload). Verdict: at the
shipped low vendor-adaptive strength, controlnet is NOT a drop-in SynthID remover.
It preserves structure so well that the watermark survives on exactly the photoreal
content it protects. Controlnet is the text/structure PRESERVATION pipeline; removal
is set by STRENGTH, separately calibrated, not by the pipeline choice.
This section is the single consolidated reference for the controlnet pipeline's
removal behavior. (Mirrored briefly in docs/synthid.md §5.5 and the CLAUDE.md
controlnet / face_restore bullets, which point here.)
What we measured (real gpt-image + Gemini originals)
1. Content × pipeline — neither pipeline clears all content at low strength.
OpenAI set, strength = vendor-adaptive (0.10 OpenAI / 0.15 no-C2PA), --max-resolution 1536:
| content | controlnet (--auto) |
plain default |
|---|---|---|
| flat text card | clean | clean |
| flat graphic (logo/poster, large flat fills) | clean | SynthID detected |
| photoreal (9-face grid) | SynthID detected | clean |
| photoreal (bracelet product photo) | SynthID detected | clean |
Mechanism: controlnet's dense edge map keeps the regeneration very close to the original on photoreal, so the SynthID-destroying perturbation never happens; but it freely repaints large flat fills. Plain img2img at low strength perturbs photoreal texture enough yet barely touches flat fills. So the survivors FLIP by content type — pipeline choice alone does not guarantee removal.
2. Seed non-determinism near threshold. img2img uses a random seed unless --seed
is passed, and there is no local SynthID detector to self-verify. The bracelet survived
controlnet @0.15 in one run and CLEARED @0.15 in another (same pipeline+strength+res).
So a single clean run does NOT establish a strength as safe — characterizing a reliable
floor needs a seed-repeatability sweep (N runs, varied seed), not one pass.
3. Per-vendor controlnet strength ladder (photoreal, oracle):
- OpenAI: 0.10 detected → 0.15 borderline/non-deterministic → 0.20 cleared (with margin).
- Gemini (harder vendor; default 0.15 vs OpenAI 0.10): most cleared at 0.15–0.25.
- Resolution is NOT the lever: SynthID is robust to downscaling, and the study's trend says LOWER processing res needs LESS strength, so 1024 was never the wall. A Gemini face that resisted 0.15/0.20/0.25 was blocked by face-restore (#4), not strength.
4. --restore-faces RE-INTRODUCES SynthID (was the "stubborn face" mystery). A
Gemini face image stayed SynthID-detected through controlnet 0.15/0.20/0.25 WITH restore,
but CLEARED at 0.20 with --no-restore-faces (clean single-variable A/B). GFPGAN runs on
the ORIGINAL watermarked face at fidelity weight 0.5, blends ~half its pixels with the
StyleGAN2 prior, and composites that back OVER the diffusion-cleaned face → the watermark
returns in the face region. Content-dependent (smaller faces can clear with restore). So
raising strength cannot fix it — the face is re-pasted from the original after diffusion.
This also corrects the prior "GFPGAN scrubs SynthID / oracle-confirmed clean" claim (it was
checked on one lucky image).
Certified controlnet strength floors (Modal GPU sweep + oracle, 2026-06-04)
Run via the isolated raiw-controlnet-cert Modal app (raiw-app/modal_cert.py):
controlnet, restore_faces OFF (it re-introduces SynthID), --max-resolution 1536,
each image checked on ITS OWN vendor oracle (OpenAI -> openai.com/verify, Gemini -> the
Gemini app; the two payloads are vendor-specific and never cross-checked):
| vendor | floor | evidence (oracle, restore OFF, <= 1536) |
|---|---|---|
| OpenAI | 0.20 | 2 photoreal images (9-face grid + bracelet) x seed {1,2,3} = 6/6 clean; the bracelet that FLIPPED at 0.15 is seed-robust at 0.20 |
| Gemini | 0.30 | hardest face (gemini_3): 0.20 detected -> 0.30 clean on 2/2 seeds; Gemini is the harder vendor (default 0.15 vs OpenAI 0.10) AND resolution-sensitive |
- OpenAI 0.20 transfers to prod as-is (OpenAI removal is resolution-independent: the study clears it at 0.05 across 1024-1600).
- Gemini 0.30 is the floor at <= 1536 only. Gemini is resolution-sensitive (study:
native 2816 likely needs >= 0.30 even on
default), and raiw.cc runs NATIVE (max_resolution=0inmodal_app.py). So either CAP Gemini to <= 1536 in raiw.cc and use 0.30, or run a native-resolution Gemini cert and expect a higher floor (~0.35+).
Recommendations for a removal pipeline (raiw.cc)
- Treat controlnet as PRESERVATION, not removal. Choose it for text/structure content,
defaultfor photoreal; removal efficacy comes from STRENGTH in both. - Give controlnet a higher, per-vendor strength than
default(today both shareresolve_strength0.10/0.15, tuned for plain img2img). Certified controlnet floors: OpenAI 0.20, Gemini 0.30 (see table above) — add a controlnet-specific per-vendor schedule toresolve_strengthrather than reusing thedefaultladder. - Fix the seed in prod. The non-determinism is purely
seed=None(random); a fixed--seedmakes every run reproduce the certified-clean result, so you ship a deterministic, re-certifiable config (and the seed sweep collapses to one config). --restore-facesis OFF in prod and stays opt-in. Two methods ship (instantiddefault,photomaker), both NON-COMMERCIAL. They REGENERATE the face from an ArcFace embedding via SDXL diffusion, making the output face look more AI-generated than the cleaned image (gloss, symmetric pores, SDXL "clean skin" aesthetic). For production face preservation the cleaned image from controlnet 0.20 is the LEAST-AI state we can reach — any restore on top trades original-look for embedding-driven regeneration. Empirical sweep summary: GFPGAN-on-cleaned polished without identity recovery; PhotoMaker-V2 produced a different person; InstantID txt2img produced studio-portrait patchwork on group photos; InstantID img2img-on-cleaned with three parameter settings integrated scene context cleanly but never recovered original identity precisely — every setting traded one problem for another. Seedocs/synthid-robust-identity-research-2026-06-08.md"Empirical follow-up" for the full sweep.- No local SynthID detector exists → the service can't self-verify; bake in strength margin and periodic oracle spot-checks.
- Lesson: visual-quality / face-identity recovery does NOT prove removal — only the oracle does, across MULTIPLE content types; never conclude from a partial result (the photoreal-only data first read as "controlnet shields, default removes"; the flat-graphic result reversed it; the face mystery was restore, not strength).
Findings (with confidence and sources)
Finding 1 — confidence: high
Claim. "ControlNet as the removal pipeline" is exactly CtrlRegen (ICLR 2025), and our
shipped ctrlregen profile is a faithful implementation of it. Its spatial control is canny
edges extracted from the watermarked image; its semantic control is DINOv2-giant via a
trainable projection + decoupled cross-attention. Clean-noise (full-strength) regeneration
scrubs the watermark from both pixel and latent space while the two control nets hold structure.
Evidence. CtrlRegen: spatial control "conditioned on Canny edge images extracted from the
watermarked image," integrated into the U-Net decoder blocks via a ControlNet structure;
semantic control on "DINOv2-giant" embeddings. Removal is strong: TPR@1%FPR driven from 1.00 ->
0.01 (StegaStamp) and 0.99 -> 0.12 (TreeRing). This matches our ctrlregen/engine.py exactly
(canny detector + facebook/dinov2-giant + spatial ControlNet from yepengliu/ctrlregen).
Sources. https://arxiv.org/html/2410.05470v1 · https://github.com/yepengliu/CtrlRegen · https://openreview.net/forum?id=mDKxlfraAn
Finding 2 — confidence: high
Claim. Regeneration provably removes any bounded-perturbation pixel watermark given enough noise — the operative constraint is the amount of regeneration, which is the same knob that trades against fidelity.
Evidence. Zhao et al., "Invisible Image Watermarks Are Provably Removable Using Generative AI" (NeurIPS 2024): a noise-then-reconstruct regeneration attack "guarantees the removal of any invisible watermark" that perturbs the image within a bounded L2 distance. The guarantee is a function of injected noise magnitude — low noise preserves detail but leaves the watermark; high noise scrubs but discards original signal. This is the knob ControlNet conditioning is meant to make survivable (push regeneration high while the control signal holds composition).
Sources. https://arxiv.org/abs/2306.01953 · https://github.com/XuandongZhao/WatermarkAttacker
Finding 3 — confidence: high
Claim. The make-or-break gap: no watermark-removal paper validates text or fine-detail preservation. CtrlRegen's "high perceptual quality" is FID/PSNR/quality-model only and explicitly omits text, fine-detail, and hallucination analysis. So the literature does NOT support the specific claim Jacob needs (text survives), it is simply unmeasured.
Evidence. CtrlRegen reports CLIP-FID, PSNR, Q-Align, LIQE; the fetched analysis confirms "the paper contains no discussion of text preservation, fine-detail retention, or hallucination artifacts," and "explicitly avoids discussing failure modes." Pixel metrics like PSNR are acknowledged not to reflect perception, and text legibility is a different axis than FID.
Sources. https://arxiv.org/html/2410.05470v1
Finding 4 — confidence: medium-high
Claim. Resolution is the prime suspect for our shipped ctrlregen's content destruction. We tile to 512 px and run full clean-noise per tile; at 512 px text occupies too few pixels per tile to regenerate legibly, and smooth edge-free regions (no canny signal) are filled by the DINOv2 semantic prior, which hallucinates texture/micro-text. The paper omits resolution entirely, so this is an implementation regime it never characterized.
Evidence. Our ctrlregen/engine.py: PROCESS_SIZE = 512, TILE_SIZE = 512, full strength
on each tile. This mirrors the _run_region_hires insight (text needs MORE pixels under
regeneration so strokes exceed the VAE's ~8 px latent floor), but ctrlregen runs the regeneration
at LOW res, the opposite. CtrlRegen's paper gives no resolution/tiling spec to contradict this.
Sources. internal (src/remove_ai_watermarks/noai/ctrlregen/engine.py); resolution-omission
confirmed against https://arxiv.org/html/2410.05470v1
Finding 5 — confidence: high
Claim. SDXL-native ControlNets exist, so the removal-pipeline upgrade need NOT be the SD1.5
re-architecture our current ctrlregen is. xinsir controlnet-tile-sdxl-1.0 and
controlnet-union-sdxl-1.0 (ControlNet++) run on SDXL base 1.0. The tile model has a tile_var
image-variation mode purpose-built to regenerate detail while preserving structure, at
controlnet_conditioning_scale = 1.0, optimal 1024 px. This is a drop-in control add to our
existing SDXL img2img.
Evidence. xinsir tile-sdxl model card: use cases = deblur/detail-repaint, image variation
(preserving structure), super-resolution; controlnet_conditioning_scale = 1.0, ~30 steps,
optimal 1024x1024, works with madebyollin/sdxl-vae-fp16-fix (the same VAE our fp16 path
already swaps in). ControlNet-Union-SDXL / ControlNet++ merges 10+ control types (canny, HED,
tile, depth, lineart) into one SDXL model.
Sources. https://huggingface.co/xinsir/controlnet-tile-sdxl-1.0 · https://huggingface.co/xinsir/controlnet-union-sdxl-1.0 · https://github.com/xinsir6/ControlNetPlus
Finding 6 — confidence: high
Claim. The community tile-ControlNet upscale workflow runs at LOW denoise (0.3-0.4) —
the wrong regime for watermark removal. It preserves detail precisely by regenerating little, so
a naive tile-upscale preserves text AND preserves the watermark. The open empirical question is
whether at conditioning_scale ~1.0 you can push denoise high enough to scrub SynthID while the
tile conditioning still holds text — the exact cell to test.
Evidence. Stable-Diffusion-Art ControlNet-tile upscale: denoise "typically 0.3, max ~0.4 to avoid artifacts"; some users push 0.6 with ControlNet strength 0.5. Our own data: SynthID survives below the removal-strength threshold (current Gemini needs notably higher denoise than the tile-upscale regime). So the detail-preserving regime and the watermark-scrubbing regime are on opposite ends of the denoise axis; ControlNet conditioning is the bet that they can meet.
Sources. https://stable-diffusion-art.com/controlnet-upscale/ ·
internal (docs/synthid.md strength data)
Finding 7 — confidence: high
Claim. Forensic-stealth caveat: diffusion-based regeneration is among the MOST detectable removal families. Even a ControlNet-regeneration that fools the SynthID oracle leaves forensic traces flagging the output as "removal-processed" at >98% TPR@1%FPR. This bounds the claim (do not over-promise "indistinguishable from an original") but does not block the use case — the SynthID oracle still reads negative.
Evidence. "Removing the Watermark Is Not Enough: Forensic Stealth in Generative-AI Watermark Removal" (arXiv:2605.09203, Goonatilake & Ateniese, GMU): across six removal attacks including diffusion-based regeneration, independent forensic detectors separate removal-processed from clean content at >98% TPR under a 1% FPR budget.
Sources. https://arxiv.org/html/2605.09203v1
Finding 8 — confidence: low (watch, do not build on yet)
Claim. Partial/semantic-guided regeneration is an active sub-direction that explicitly targets the removal-vs-fidelity tradeoff, but the specific fidelity-on-text claims were not verifiable from the source in this pass.
Evidence. "Removing Watermarks with Partial Regeneration using Semantic Information" (arXiv:2505.08234) proposes focusing regeneration on watermarked regions with semantic (VLM) conditioning to preserve untouched areas; the PDF body did not render cleanly enough to confirm its quantitative text/detail results. Treat as a pointer, not evidence.
Sources. https://arxiv.org/pdf/2505.08234
Recommendation / decision
ControlNet-as-removal-pipeline is worth prototyping — but not by fixing the SD1.5 ctrlregen. Port the structure-control idea onto an SDXL-native ControlNet as a control add to the existing SDXL base 1.0 img2img, run it at full resolution (1024+, NOT 512 tiles), and treat the text-vs-scrub tension as an empirical question to measure, not assume.
Prototype (runs locally on 32 GB MPS — no dedicated GPU required):
Compute is NOT the bottleneck. On a 32 GB Apple-silicon machine (M5 here) native SDXL already
runs entirely on MPS with no CPU fallback (~155 s at 1122x1402, verified — see synthid.md /
CLAUDE.md). The prototype runs at 1024 (fewer pixels than that) with SDXL base + an SDXL
ControlNet + activations in fp32 (MPS fp16 decodes to all-black NaN — issue #29 — confirmed
on run 1 below; fp32 is the required default on mps/cpu) — fits the 32 GB budget with vae-tiling +
attention-slicing; ~1-2 min/image, so a coarse sweep is a sub-hour background run. A dedicated GPU
is needed ONLY for the separate
native-large-Gemini (2816 px) case, which OOMs even without a ControlNet (that stays a raiw.cc
GPU task). The genuine external dependency is NOT compute but the manual SynthID oracle:
there is no local SynthID detector, so removal is verified by hand in the Gemini app
("Verify with SynthID") per image, regardless of where the diffusion runs.
Runner: scripts/controlnet_sweep.py (built 2026-06-02) implements exactly this sweep —
SDXL base 1.0 + an SDXL-native ControlNet img2img, one output per (control x strength x scale)
cell, plus a sweep_index.csv with empty synthid_oracle / text_legible columns to fill by
hand. It uses the dedicated single-type xinsir models (controlnet-canny-sdxl-1.0,
controlnet-tile-sdxl-1.0) rather than the Union model to keep the diffusers API path robust.
uv run python scripts/controlnet_sweep.py watermarked.png -o sweep_out
- SDXL base 1.0 img2img +
xinsir/controlnet-canny-sdxl-1.0/controlnet-tile-sdxl-1.0(sweep bothtileandcannycontrol), full image at 1024,sdxl-vae-fp16-fix. - Sweep the cube on fresh Gemini + gpt-image inputs that contain small/CJK text:
- denoise strength {0.15, 0.3, 0.5, 0.7, 1.0}
controlnet_conditioning_scale{0.5, 0.8, 1.0}- control type {tile, canny}
- Per cell, measure BOTH axes:
- removal: Gemini app "Verify with SynthID" oracle (the only valid SynthID oracle; for gpt-image also openai.com/verify for provenance) — must read clean.
- text: OCR round-trip / visual legibility of the small text.
- secondary: SSIM/FID vs original for global fidelity.
- Find the Pareto cell where the oracle is clean AND text stays legible.
The honest fork the prototype resolves:
- If such a cell exists -> the answer to Jacob is YES, ship an SDXL-native ControlNet removal profile (replacing the SD1.5 ctrlregen) tuned to that cell.
- If no cell clears both (the tension is fundamental: scrub-strength always deforms text, or
text-preserving conditioning always spares the watermark) -> the canny/tile-ControlNet middle
path is dead for text, and the standing answer reverts to
text-protection-research.md: a full glyph-conditioned re-render (EasyText / TextSR on a FLUX-DiT base) is required, which is a base-model migration, not a control add.
Do not: keep tuning the 512 px SD1.5 ctrlregen for text (wrong resolution, wrong base model); run tile-ControlNet at the community 0.3-0.4 upscale denoise and expect watermark removal (that regime preserves the watermark); over-claim forensic invisibility (Finding 7).
Prototype run 1 — 2026-06-02 (text axis measured; watermark axis pending the oracle)
First sweep on a real, SynthID-positive, text-dense input: the corpus tokyo-street-night
gpt-image (88e61a38-chatgpt_tokyo.png, 1023x1537 -> 680x1024, dense small CJK + Latin neon
signage; SynthID + C2PA confirmed, so its valid oracle is openai.com/verify). Grid: control
{canny, tile} x strength {0.3, 0.5, 0.7, 1.0} x conditioning_scale 1.0, fp32 on MPS. Outputs +
sweep_index.csv (text verdicts filled by visual inspection; synthid_oracle left for the
manual run) are under /tmp/cnsweep/ (not committed — derived regenerations of corpus content).
Measured — PSNR vs input (proxy for how much was regenerated):
- canny: 0.3 -> 16.91, 0.5 -> 15.91, 0.7 -> 14.82, 1.0 -> 13.22 (monotonic drop = progressively more regeneration as strength rises; canny only pins edges, so flat regions change).
- tile: 0.3 -> 17.89, 0.5 -> 17.84, 0.7 -> 17.83, 1.0 -> 17.74 (flat and high — near-identity even at strength 1.0; tile@scale1.0 pins the whole image to the input and barely regenerates).
Measured — text legibility (visual, focused on SMALL text; large high-contrast glyphs survive everything because canny/tile hold their edges):
- canny: legible at 0.3, softening at 0.5 (partial), garbling at 0.7, hallucinated pseudo-glyphs at 1.0 ("NEC" -> "NWENES"). Same plain-img2img small-text deformation, only big text protected.
- tile: near-identity through 0.7, only tiny alterations at 1.0 — small text preserved throughout.
Reading (the make-or-break tension, now visible in the data):
- tile@scale1.0 does not actually regenerate (flat PSNR), so it preserves all text but almost certainly leaves the watermark intact — it is a near-identity pass, exactly the community "tile-upscale preserves detail by not regenerating" regime (Finding 6), confirmed.
- canny@scale1.0 regenerates progressively (PSNR drops) and so could scrub — but small text breaks at exactly the strength where scrubbing would start to bite. canny saves big edges, not sub-stroke small text.
- Net on the text axis: neither cell at scale 1.0 cleanly gives "high regeneration + legible small
text." This is the literature prior (Findings 3, 6) reproduced empirically. Lowering
conditioning_scaleto force small-text regeneration is the same tradeoff knob, not an escape.
Still pending (the decisive half, cannot be done locally): run the 8 cells through the SynthID
oracle and fill synthid_oracle. The most informative cells: canny 1.0 (text dead — does it at
least scrub? if not, the canny path is dead outright), canny 0.5 (text partial — does it scrub?),
tile 1.0 (text perfect — predicted to still read present). If no cell is oracle=clean AND
text=yes, the fork resolves to the glyph-re-render path (text-protection-research.md).
Incidental bug caught: the first run used fp16 on MPS (the script's original default) and
produced all-black outputs across every cell (2 KB PNGs, PSNR 9.22 flat) — the issue #29
fp16-VAE-NaN failure, and the fp16-fix VAE did not save it on MPS. Fixed scripts/controlnet_sweep.py
to default fp32 on mps/cpu (fp16 only on cuda/xpu), matching the production pipeline.
Tuning ControlNet for text preservation across image types (research 2026-06-03)
Goal: how to configure the canny-ControlNet path to best preserve text (and faces) on diverse images. Primary sources: diffusers ControlNet doc, the ControlNet paper (arXiv:2302.05543), xinsir model cards, practitioner guides. The critical reframe: almost all community ControlNet advice optimizes a txt2img generation tradeoff (control vs creative freedom). OUR context is img2img watermark removal, where the objective is the opposite -- maximum faithful preservation while regenerating just enough to scrub. So several common recommendations INVERT here.
Removal is strength; everything below is preservation and does not change removal efficacy
(only the watermark-shielding risk -- see the caveat). Set strength by the oracle/vendor need;
tune these to keep text/faces intact at that strength.
Knobs, ranked by impact for text:
-
Canny edge density (the per-image lever, currently hardcoded
_CANNY_LOW=100/_CANNY_HIGH=200). Lower thresholds capture more/finer edges; higher thresholds keep only major outlines (diffusers doc + practitioner guides; ControlNet paper uses 100/200 as the default). Small-text strokes and fine facial features fall below the default 100/200 and are missed. For dense small text (infographics, signage) lower the thresholds (~50/120, even 30/100 for facial likeness per practitioner tests); for high-contrast large text 100/200 already suffices. Denser canny is still a BINARY thresholded edge map, so it does not carry the low-amplitude SynthID pixel pattern -- it passes more shape, not the watermark (still oracle-verify). This is the single highest-value unexplored lever and should become a CLI knob. -
controlnet_conditioning_scale-> keep at 1.0 (max structure hold). Community defaults to 0.5 for creative balance; we want maximum preservation, so 1.0 (xinsir canny/tile cards also recommend 1.0). We measured text on a clean high-contrast image surviving across strength 0.1-0.5 at scale 1.0 (PSNR ~26 flat), so scale 1.0 is the right default; only lower it if a specific image needs more regeneration to scrub (raises shielding risk the other way). -
control_guidance_start=0.0,control_guidance_end=1.0(full window) -- KEEP, do not shorten. The common "end=0.5: establish structure early then let the model render detail freely" is a creative-generation recipe; for text it is HARMFUL -- the late free steps re-render and deform the glyphs. We want the edge control active through ALL denoise steps so text stays pinned. (Our pipeline already uses the 0->1 default; the point is to NOT adopt the shorten-the-window advice.) -
Control type, per image type:
- Text / graphics / high-contrast -> canny (the literature's reliable choice for defined edges and text; what we ship).
- Faces / smooth tonal content -> soft-edge / HED is a candidate worth testing. Canny's hard binary threshold fractures smooth skin gradients; HED/soft-edge gives gradual edges that may hold faces better. UNVERIFIED for removal (softer edges may carry slightly more original signal -> oracle-check). A face-heavy image is the test (gemini group photos).
- tile -> NOT for removal. It is near-identity (detail-enhancement at low denoise); it shields the watermark (measured flat PSNR ~17.8 across strength on the tokyo sweep). Do not use it as the removal control.
-
Resolution -- higher long-side = strokes span more VAE latent cells = less softening, while still fully regenerating. Already a knob (
--max-resolution); for tiny text prefer native/large.
Multi-ControlNet (canny + soft-edge), list scales e.g. [1.0, 0.8] (diffusers MultiControlNet):
could hold text edges AND face geometry at once, but doubles ControlNet memory/latency and raises the
shielding risk; defer to a v2 after the single-canny path is dialed in.
Image-type playbook (proposed, to validate with the oracle):
- Clean high-contrast text (openai_1-style): canny 100/200, scale 1.0, full window -- already optimal.
- Dense small text / infographics (big_pic3, neon signage): canny lower thresholds (~50/120), scale 1.0, full window, larger resolution.
- Faces / portraits: try soft-edge/HED control, scale 1.0; or multi-ControlNet canny+softedge.
Hard caveat: every change that increases preservation (higher scale, denser canny, fuller window, softer edges) marginally REDUCES effective regeneration and so raises the chance the watermark survives -- exactly the shielding failure mode. There is no local SynthID detector, so each tuning change must be re-confirmed on the oracle. These are img2img-context recommendations derived from generation-context sources plus our own measurements; treat the playbook as hypotheses to verify, not settled defaults.
Sources. https://huggingface.co/docs/diffusers/en/using-diffusers/controlnet · https://arxiv.org/pdf/2302.05543 · https://huggingface.co/xinsir/controlnet-canny-sdxl-1.0 · https://huggingface.co/xinsir/controlnet-tile-sdxl-1.0 · https://blog.cephalon.ai/canny-and-softedge/
FaceID research: identity-preserving face conditioning (research 2026-06-03)
Motivation: canny alone preserves face STRUCTURE/position better than plain SDXL but does NOT hold IDENTITY -- verified on a real Gemini group photo (gemini_3, s015): faces drift in expression and likeness (the smile/mouth and eyes change), they are "a similar person," not the same one. Canny carries edges, not identity, so the regenerated face is identity-drifted. To hold identity WITHOUT copying original pixels (the hard constraint -- copied pixels carry SynthID), the conditioning must be an identity EMBEDDING, not pixels. Primary sources: diffusers IP-Adapter doc, InstantID (arXiv:2401.07519), IP-Adapter (arXiv:2308.06721), practitioner comparisons.
Findings
1. IP-Adapter FaceID conditions on an ArcFace identity VECTOR, not pixels (confidence: high).
FaceID extracts insightface ArcFace normed_embedding (a ~512-d identity vector) via
FaceAnalysis, and passes it as ip_adapter_image_embeds -- NOT a CLIP image embedding, NOT the
original pixels. So it is constraint-compatible: the watermark (a pixel-amplitude pattern) is not in
the identity vector, and the img2img still regenerates the pixels (removal via strength unchanged).
It loads on any SDXL via load_ip_adapter (~100 MB), is fast/low-VRAM, but identity fidelity on SDXL
is ~5-10% lower than the SD1.5 line / dedicated methods.
2. Multiple distinct faces ARE handled, via regional attention masks (confidence: high -- THE key
unlock). This is the make-or-break for group photos (our hardest case). diffusers supports a LIST
of IP-Adapter face images each with its own binary region mask: IPAdapterMaskProcessor builds the
masks, set_ip_adapter_scale([[s1, s2, ...]]), and cross_attention_kwargs={"ip_adapter_masks": masks}. So you detect each face, extract its own ArcFace embedding, assign it a region mask, and one
pass preserves N different identities simultaneously. (InstantID, by contrast, is single-subject --
it averages embeddings for multiple refs, which is wrong for distinct people -- so for group photos
IP-Adapter FaceID + masks beats InstantID.)
3. IP-Adapter + ControlNet + img2img compose (confidence: high). The doc shows IP-Adapter +
ControlNet (depth) in one pipeline and IP-Adapter + img2img (strength). Our target stack is the
union: StableDiffusionXLControlNetImg2ImgPipeline (canny = structure) + load_ip_adapter (FaceID =
identity) + strength (removal). set_ip_adapter_scale (1.0 = image-only, 0.5 = balanced) is the
identity-hold knob. API friction to verify in implementation: that ip_adapter_masks via
cross_attention_kwargs works on the ControlNet img2img pipeline (the masking is an attention-
processor feature, so it should be pipeline-agnostic, but confirm).
4. InstantID / PuLID positioning (confidence: medium). InstantID does not train the UNet so it composes with canny/depth ControlNets, and gives better single-face fidelity than FaceID -- but it is single-subject (needs its own landmark ControlNet + dedicated weights). PuLID has the best identity fidelity but is heaviest and Flux-leaning. For our multi-face, constraint-bound, SDXL-canny case, IP-Adapter FaceID + masks is the right first build; InstantID/PuLID are single-portrait upgrades.
Architecture (proposed)
detect faces (insightface) -> per face: ArcFace embed + region mask
one img2img pass:
image=init, control_image=canny(init), # structure (existing)
ip_adapter_image_embeds=[face_embeds], # identity per face
cross_attention_kwargs={"ip_adapter_masks": face_masks}, # each face -> its region
controlnet_conditioning_scale=1.0, set_ip_adapter_scale(~0.6),
strength=vendor-adaptive # removal (unchanged)
Pixels are regenerated (SynthID removed by strength), structure held by canny, each face's identity
held by its masked ArcFace vector -- no original pixel copied.
Risks / honest costs
- Shielding risk (same wall): FaceID conditioning, like canny, reduces effective regeneration ->
higher
set_ip_adapter_scaleraises the chance SynthID survives in the face region (echo of why the old region-hires failed). MUST oracle-verify removal at the chosen FaceID scale; keepstrengthat the vendor threshold. - New heavy dependency:
insightface+onnxruntime+ thebuffalo_lmodel (~300 MB, downloaded on first use). Detection + embedding is CPU/ONNX, separate from the diffusion. - Detection floor: insightface needs faces large enough (det_size ~640); tiny faces in a dense group may not be detected -> not preserved (falls back to canny-only for those).
- Identity ceiling: SDXL FaceID is ~5-10% off true identity -- a meaningful boost over canny-only drift, NOT a perfect face swap. Set expectations; PuLID/InstantID are the higher-fidelity (heavier) paths if needed.
- Value scales with strength: at low strength (OpenAI 0.10) faces barely drift, so FaceID is marginal; at the higher strength a hard vendor (Google 0.30) needs, FaceID earns its keep.
Build plan (staged)
- v1: optional
--face-idflag on--pipeline controlnet. Detect faces; if any, run the masked FaceID pass (works for 1 or N faces -- masks generalize). If none detected, fall through to plain canny. Oracle-verify SynthID removal is preserved at the default FaceID scale on a face image. - v2 (if identity still short): InstantID for single-portrait, or PuLID, as a higher-fidelity opt-in.
Sources. https://huggingface.co/docs/diffusers/main/en/using-diffusers/ip_adapter · https://huggingface.co/h94/IP-Adapter-FaceID · https://arxiv.org/pdf/2401.07519 (InstantID) · https://instantid.github.io/ · https://arxiv.org/abs/2308.06721 (IP-Adapter)
FaceID prototype run 1 -- 2026-06-03 (NEGATIVE on dense small-face groups)
Built and shipped the masked multi-face FaceID layer (--face-id, face_id.py, faceid extra).
First real run on the gemini_3 group photo (Google, s015, scale 0.6, native 2816 via cap 1536):
insightface detected 17 faces, the masked multi-face pass composed and ran end-to-end (non-black
output), so the API is correct. At s015 the result is a clear FAILURE: every face corrupted --
melted/discolored/psychedelic, materially WORSE than canny-only.
ROOT CAUSE FOUND (confirmed by ablation, not speculation) -- it is STRENGTH, not scale/masks/faces.
Investigated the real data: masks are fine (max overlap depth 2, 33% coverage, only 0.2% of pixels
double-covered -- NOT an overlap problem), embeddings are fine (normed_embedding norm 1.000), the
FaceID LoRA is not required for SDXL (h94 model card), and faces span 34-181 px (7 medium + 10 tiny).
None of those is the cause. The decisive test: the SAME image + FaceID at strength 0.5 produces
clean, coherent faces across the whole group (no psychedelic artifacts). So FaceID needs
substantial regeneration: the h94 usage is full generation (txt2img, 30 steps); at our removal
strength (0.10-0.15 = ~7 effective steps) the strong identity cross-attention cannot reconcile with
a latent that is ~85% the untouched original, so it smears identity-colored noise onto the faces.
This is a FUNDAMENTAL tension, not a tuning bug: watermark removal wants LOW strength (minimal
degradation, just enough to scrub), FaceID wants HIGH strength (regenerate the face to impose
identity). They are opposed. At strength 0.5 FaceID works AND removes the watermark, but the whole
image regenerates much more (canny still holds text/edge structure, but texture/detail drifts well
beyond the 0.15 "minimal degradation" target). So --face-id is a HIGH-STRENGTH option: it trades
whole-image fidelity for face identity, and is a footgun at the low default strength (guaranteed
garbage). Required follow-up code guard: when --face-id is set, floor strength at ~0.5 (or refuse
- warn) -- never run FaceID at the vendor-adaptive removal strength. Open question: whether
high-strength FaceID's whole-image drift is acceptable for face-centric images, or whether identity
preservation at LOW strength needs a different mechanism entirely (FaceID structurally cannot do it). (Infra lesson: the
faceidextra must stay numpy<2.0 -- pinonnx<1.18+scipy<1.18; pinning numpy UP, as the first build did, leaves a numpy-1.26 env with a numpy-2-only scipy that crashes the diffusers import vianp.long.)
Face preservation, done properly (research 2026-06-03, after the FaceID failure)
The FaceID run failed and I wrongly concluded "faces can't be preserved." Re-research corrected the understanding. The hard constraint is unchanged: to remove the watermark FROM a face the face MUST be regenerated (freezing it leaves SynthID), so the goal is identity-preserving REGENERATION of the face, at minimal overall image degradation. Three things I got wrong and the corrected picture:
What I got wrong: (1) I applied FaceID at GLOBAL high strength -- the literature is clear the architecture must be REGION-ADAPTIVE (face region handled separately, background stays low-strength); (2) I used IP-Adapter FaceID, the WEAKEST identity tool -- InstantID uses an ArcFace encoder and hits 82-86% face-recognition similarity vs FaceID's weak CLIP-ish signal; (3) I missed the entire face-restoration class (CodeFormer / GFPGAN), which is purpose-built for "regenerate a face, keep identity."
The most promising mechanism -- CodeFormer face-restoration post-pass (confidence: high on the
mechanism, unverified on our watermark). CodeFormer is a VQ-VAE: a frozen discrete CODEBOOK of HQ
facial priors + a Transformer that predicts code tokens from the input, and a frozen decoder that
regenerates the face FROM THE CODEBOOK ENTRIES -- "does not depend on feature fusion with low-quality
cues." So the output face pixels come from a finite learned codebook, NOT from the input pixels:
the SynthID pixel-amplitude pattern physically cannot survive a codebook re-synthesis -- a
stronger scrub than low-strength img2img (which keeps ~85% of the latent). Fidelity knob w in
[0,1]: higher w preserves identity but fuses MORE low-quality (input) cues (more watermark risk),
lower w leans on the codebook (cleaner scrub, identity drift) -- the same scrub-vs-fidelity tension,
settled per-image by the oracle; there is likely a w that holds identity AND clears the oracle.
Constraint-compatible architecture: run the normal canny low-strength controlnet removal globally (minimal degradation everywhere), then detect+align each face, run CodeFormer on the ORIGINAL face crop (to capture true identity AND re-synthesize from the codebook = scrub), and composite the CodeFormer output (codebook-generated, not original pixels -> no copy, no watermark) into the cleaned image. Decouples whole-image minimal-degradation from face identity -- no high GLOBAL strength needed.
Honest costs/caveats: (a) License -- CodeFormer is NTU S-Lab 1.0 (non-commercial/research), so it cannot be bundled in this MIT tool for general use; the license-clean alternative is GFPGAN (Apache-2.0), slightly lower quality. (b) Deps (basicsr/facexlib) are heavy and numpy-version-finicky (same class of conflict as insightface). (c) CodeFormer is a restoration model -- it can subtly alter expression/asymmetry; identity is held but not pixel-identical. (d) The watermark-scrub is mechanistically strong but UNVERIFIED -- must oracle-check. InstantID + region-adaptive strength is the alternative if the restoration route disappoints, but it is more complex (differential strength). Prototype plan: validate CodeFormer on a real face in a THROWAWAY env (identity held? oracle clean?) before any project-env integration or the license/GFPGAN decision.
CodeFormer prototype -- VALIDATED end-to-end 2026-06-03 (oracle-confirmed)
Prototyped the CodeFormer face-restoration post-pass (codeformer-pip in a throwaway venv, forced CPU
-- the pip wrapper has an MPS device-mismatch bug) on the gemini_3 group photo (18 faces). Pipeline:
all --pipeline controlnet --strength 0.15 (sparkle + SynthID removed from the whole image, minimal
degradation) -> CodeFormer on the ORIGINAL faces -> feather-composite the CodeFormer faces into the
all-cleaned image. Oracle results (Gemini app "Verify with SynthID"), isolating each part:
- pure controlnet-0.15 background (no faces): clean -> the background scrub works at 0.15 (no ControlNet-shielding problem for Google on this image).
- composite with CodeFormer faces at w0.7: SynthID DETECTED -> high fidelity fuses too much of the original face signal (the watermark) through.
- composite at w0.5: clean. composite at w0.3: clean. So the scrub-vs-fidelity threshold is between 0.5 and 0.7; w=0.5 is the sweet spot (highest fidelity / best identity that still clears the oracle). Identity at w0.3-0.7 all looks like the same person (the face is large enough), so the lower w costs little.
This VALIDATES the corrected face-preservation approach (and refutes my earlier "faces can't be preserved" / FaceID conclusion): controlnet low-strength background scrub + CodeFormer-codebook face re-synthesis at w~0.5 + feather composite = oracle-clean SynthID removal everywhere (background AND faces), identity preserved, minimal overall degradation, zero original-pixel copying (CodeFormer faces are codebook-generated). CodeFormer's discrete-codebook re-synthesis DOES scrub the pixel watermark, but only when w is low enough that the decoder leans on the codebook rather than fusing the input (watermark-carrying) features -- exactly the predicted fidelity-vs-scrub tension, with an empirical clean threshold at w<=0.5.
Production TODO (not built -- still a throwaway prototype): (1) license -- CodeFormer is NTU S-Lab
(non-commercial); decide CodeFormer-as-user-installed-extra vs GFPGAN (Apache-2.0, re-verify it scrubs
at its fidelity setting); (2) wire a --restore-faces post-pass (detect -> restore w~0.5 -> feather
composite) onto the controlnet pipeline; (3) handle the MPS device bug (force CPU for the face model
or fix); (4) re-verify the w threshold on more images / vendors (w=0.5 confirmed on one Gemini group
photo only).
Sources. https://arxiv.org/abs/2206.11253 (CodeFormer) · https://github.com/sczhou/CodeFormer · https://arxiv.org/pdf/2401.07519 (InstantID) · https://openaccess.thecvf.com/content/WACV2024/papers/Suin_Diffuse_and_Restore... (region-adaptive) · https://arxiv.org/pdf/2504.12809 (saliency-aware watermark removal)
Provenance
Hand-run primary-source pass, 2026-06-02. Sources fetched and quoted above; the central make-or-break claim (structure-conditioned high-strength regeneration scrubs the watermark while keeping text) is unverified and explicitly flagged as the thing the local prototype must measure (against the manual Gemini SynthID oracle) — the literature supports removal (Findings 1, 2) and supports structure-preserving regeneration (Finding 5) but never jointly validated text (Finding 3). No code change implied until the prototype validates a Pareto cell on the SynthID oracle.