mirror of
https://github.com/wiltodelta/remove-ai-watermarks.git
synced 2026-07-04 23:47:49 +02:00
docs: capture the Qwen-improvement research (ship vs improve)
Cited deep-research report (22 sources, 3-vote adversarial verification, 5 refuted) behind the "ship qwen as-is or improve first?" decision. Verdict: shippable now as an opt-in text lane; strongest improvement lead is adding a Qwen-Image ControlNet (InstantX / DiffSynth, Apache-2.0, diffusers QwenImageControlNetPipeline) for face/ skin structure; Z-Image-Turbo (6B, Apache-2.0) is the best cheaper text-preserving substitute. No improvement has measured face-fidelity at our scrub floors yet -- validate with scripts/fidelity_metrics.py first. Linked from known-limitations. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -146,4 +146,6 @@ The scrub still comes from the img2img `strength` (same lever as SDXL); the call
|
||||
|
||||
**Conclusion: Qwen is the better TEXT-preserving remover (substantial Latin/mixed text), NOT a universal fidelity win — controlnet's canny edge map holds face skin detail better, so the path is a content-routed lane (text→qwen, faces→controlnet), not a blanket migration.** Caveat: `resolve_strength` is shared and pipeline-independent, so the Gemini default (0.15) UNDER-scrubs Gemini on `qwen` (floor 0.25) — pass `--strength 0.25` for Gemini on `qwen` until a Qwen ladder is wired. Flat-graphic content was not in the sample.
|
||||
|
||||
**Improving Qwen (ship vs improve):** the cited research on fixing the face-smoothing while keeping the text win (Qwen-Image ControlNet for structure conditioning, Qwen-Image-Edit, Z-Image-Turbo as a cheaper text-preserving substitute, non-regenerative detail restoration) lives in `docs/qwen-improvement-research.md` -- read it before extending the `qwen` pipeline. Verdict: shippable now as an opt-in text lane; the strongest improvement lead is adding a Qwen-Image ControlNet, but no improvement has measured face-fidelity at our floors yet (validate with `scripts/fidelity_metrics.py` first).
|
||||
|
||||
**Seed as a quality lever (measured, openai_1 at 0.10, seeds 0-4):** the seed barely moves whole-image fidelity (img LPIPS 0.062-0.065, SSIM 0.855-0.857, PSNR 28.5-28.7 — flat) but does shift TEXT legibility (OCR CER 0.241-0.290, ~17% spread) -- the seed changes WHICH details get regenerated, not the overall level. So a per-image best-of-N-seed selection is a WEAK, text-only lever (pick the lowest-CER seed that still scrubs; fidelity selection needs no oracle). Not worth the N× cost for general use -- pin one decent seed in prod; reserve best-of-N for text-heavy premium cases.
|
||||
|
||||
@@ -0,0 +1,153 @@
|
||||
# Qwen-Image improvement research (2026-06-20)
|
||||
|
||||
Cited research behind the decision **"ship the `qwen` pipeline as-is, or improve it
|
||||
first?"** Produced by the multi-source deep-research harness (5 search angles, 22
|
||||
sources fetched, 85 claims extracted, 25 verified by a 3-vote adversarial check, 20
|
||||
confirmed / 5 killed, 104 agent calls). Findings carry their confidence and vote.
|
||||
|
||||
## Context
|
||||
|
||||
The `qwen` pipeline runs base Qwen-Image (20B MMDiT, Apache-2.0) as a low-strength
|
||||
img2img scrub (removal comes from the denoising `strength`). Certified oracle scrub
|
||||
floors: OpenAI 0.10 (seed-robust), Gemini 0.25 (pinned seed). Measured against the
|
||||
SDXL + canny-ControlNet pipeline (`scripts/fidelity_metrics.py`): Qwen preserves
|
||||
**text** markedly better (incl. CJK and Cyrillic, lower OCR CER) but preserves
|
||||
**faces** worse, smoothing skin (Laplacian-variance retention 0.40 vs 0.62, face
|
||||
LPIPS 0.17 vs 0.09, ArcFace identity 0.38 vs 0.55 at the scrub floors). The goal of
|
||||
the research: keep Qwen's text advantage while fixing the face-smoothing, and judge
|
||||
production-readiness.
|
||||
|
||||
## Verdict
|
||||
|
||||
Base Qwen-Image is **shippable now as an opt-in text-content lane** (Apache-2.0 on
|
||||
code and weights, scrub lever confirmed), but it is not a universal upgrade (it loses
|
||||
faces). The strongest verified improvement path is to **add structure conditioning**
|
||||
(a Qwen-Image ControlNet) to the existing base pass, the direct analog of the SDXL +
|
||||
canny conditioning that wins on faces. Separately, **Z-Image / Z-Image-Turbo** (6B,
|
||||
Apache-2.0) is the best-verified lighter alternative to evaluate before committing to
|
||||
the 20B cost. None of the improvements has measured face-fidelity numbers at our
|
||||
scrub floors yet, so each must be validated with `scripts/fidelity_metrics.py` plus
|
||||
the oracle before shipping.
|
||||
|
||||
## Findings
|
||||
|
||||
1. **[high, 3-0] A permissively-licensed Qwen-Image ControlNet exists today and is
|
||||
CUDA/diffusers-runnable.** InstantX Qwen-Image-ControlNet-Union supports
|
||||
canny/soft-edge/depth/pose; DiffSynth-Studio maintains blockwise Canny/Depth/Inpaint
|
||||
plus an In-Context-Control-Union; diffusers exposes `QwenImageControlNetPipeline`
|
||||
and `QwenImageMultiControlNetModel` with `controlnet_conditioning_scale` (default
|
||||
1.0) and `control_guidance_start`/`end`. This is the direct analog of the certified
|
||||
SDXL+canny structure conditioning that wins on faces. Caveat: canny/depth preserve
|
||||
geometric structure, not face identity per se, and none is a **tile**-ControlNet
|
||||
(the variant most tied to fine-detail/skin retention in the SDXL world).
|
||||
Sources: InstantX/Qwen-Image-ControlNet-Union, InstantX/Qwen-Image-ControlNet-Inpainting,
|
||||
DiffSynth-Studio Qwen-Image docs, diffusers qwenimage pipeline docs.
|
||||
|
||||
2. **[high, 3-0] The scrub mechanism is preserved, and the license is clean.**
|
||||
`QwenImageImg2ImgPipeline.strength` (default 0.6, range 0-1; DiffSynth names it
|
||||
`denoising_strength`) keeps the partial-regeneration scrub the project relies on,
|
||||
lower values staying closer to the input. Qwen-Image and Qwen-Image-Edit-2509 are
|
||||
Apache-2.0 on both code and weights.
|
||||
|
||||
3. **[medium, mixed 2-1 / 3-0] Qwen-Image-Edit improves identity consistency, but that
|
||||
is not proof it fixes our metric.** The instruction-edit pipeline (2511 better than
|
||||
2509) improves identity/character consistency, but only for identity *through edits*
|
||||
of an input portrait, which is not the same as measured face-skin Laplacian/LPIPS
|
||||
fidelity at a low scrub strength. Architecture: 20B base + Qwen2.5-VL (semantic
|
||||
control) + VAE Encoder (appearance control). Several stronger edit-model face claims
|
||||
were refuted (see below).
|
||||
|
||||
4. **[high, 3-0] Z-Image / Z-Image-Turbo is the best-verified lighter alternative.** A
|
||||
6B model (~1/3 of Qwen-Image's 20B), Apache-2.0 on code and weights, strong bilingual
|
||||
(Chinese + English) native text rendering, with an official diffusers
|
||||
`ZImageImg2ImgPipeline` exposing the same 0-1 denoising-strength scrub lever; Turbo
|
||||
runs at ~8 steps (guidance_scale=0.0) vs ~40. A material cost/footprint reduction vs
|
||||
20B/A100-80GB (but see caveat 4 on the refuted consumer-GPU claim).
|
||||
|
||||
5. **[high, 3-0] EliGen-V2 is NOT relevant** to the face-smoothing problem. It is an
|
||||
entity-level/regional control model (LoRA + regional attention placing entities via
|
||||
text + mask maps, plus entity-level inpainting); it provides no
|
||||
ControlNet/canny/depth/tile structure conditioning or face-skin-detail retention.
|
||||
|
||||
6. **[medium, 2-1] flymy-ai/qwen-image-realism-lora** is Apache-2.0 (code+weights) on
|
||||
base Qwen-Image, so it is permissively usable with the existing base img2img pass,
|
||||
but it is NOT verified to specifically fix the face/skin-smoothing failure mode.
|
||||
|
||||
## Caveats
|
||||
|
||||
1. The research did NOT surface verified evidence for two things specifically asked:
|
||||
(a) a Qwen-Image **tile**-ControlNet (the variant most tied to fine-detail/skin
|
||||
retention; only canny/soft-edge/depth/pose/inpaint were confirmed), and (b) any
|
||||
**non-regenerative detail-restoration** technique (high-frequency residual transfer,
|
||||
guided filtering) that recovers smoothed faces without re-introducing the watermark.
|
||||
Research angle 4 produced zero surviving claims, so it is unanswered.
|
||||
2. No claim provides measured face-fidelity numbers (ArcFace/LPIPS/Laplacian) for ANY
|
||||
recommended intervention at the project's scrub floors. All fidelity evidence is the
|
||||
project's own internal measurement. The improvements are mechanistically sound but
|
||||
unproven for this exact metric, so validate with `scripts/fidelity_metrics.py`
|
||||
before shipping.
|
||||
3. Several vendor model cards are marketing-register primary sources (Qwen blog,
|
||||
Z-Image card). Load-bearing facts (license, params, API levers) are independently
|
||||
corroborated, but comparative quality framings are author glosses.
|
||||
4. Z-Image's "sub-second" figure is H800-specific and author-benchmarked; consumer-GPU
|
||||
third-party benchmarks are still limited (seconds, not sub-second, though within the
|
||||
<16GB envelope).
|
||||
5. Time-sensitivity: Qwen-Image-Edit-2511 and Z-Image are late-2025/2026 releases; the
|
||||
diffusers pipelines cited are on the main/dev branch, so confirm released-version
|
||||
availability before pinning.
|
||||
6. Five claims were refuted (below), clustering on over-strong edit-model face-fidelity
|
||||
and one over-strong Z-Image cost claim.
|
||||
|
||||
## Open questions
|
||||
|
||||
- Does a Qwen-Image **tile-ControlNet** (or equivalent high-resolution detail
|
||||
conditioning) exist under a permissive license?
|
||||
- What **non-regenerative detail-restoration** method recovers smoothed faces WITHOUT
|
||||
re-introducing SynthID? Note: residual transfer from the ORIGINAL risks copying back
|
||||
watermark-carrying high frequencies, so it must be verified against the SynthID oracle.
|
||||
- Does adding Qwen-Image-ControlNet (canny/depth) at the certified floors (OpenAI 0.10,
|
||||
Gemini 0.25) actually raise face Laplacian/LPIPS toward the SDXL+ControlNet numbers
|
||||
(0.62 / 0.09) WITHOUT re-introducing SynthID, or does the structure constraint
|
||||
preserve the watermark the way ControlNet can on photoreal content (the existing
|
||||
"SynthID CAN survive controlnet at low strength" caveat)?
|
||||
- Head-to-head: does Z-Image-Turbo at its scrub floor match Qwen's text advantage
|
||||
(CJK+Cyrillic CER) while not worsening faces, and what are Z-Image's own SynthID
|
||||
scrub floors and seed-robustness (none exist yet)?
|
||||
|
||||
## Refuted claims (do NOT rely on these)
|
||||
|
||||
- [0-3] "Qwen-Image-Edit-2511 specifically targets/mitigates image drift, the same
|
||||
failure mode as face-detail loss in a low-strength scrub." (qwen.ai/blog, 2511)
|
||||
- [0-3] "Qwen-Image-Edit-2509 explicitly improves facial identity preservation and
|
||||
supports portrait styles and pose transformations." (HF Qwen-Image-Edit-2509)
|
||||
- [0-3] "Qwen-Image-Edit-2509 has native built-in ControlNet support (depth/edge/
|
||||
keypoint)." (HF Qwen-Image-Edit-2509)
|
||||
- [1-2] "flymy realism LoRA specifically targets facial and skin detail, the exact
|
||||
failure mode." (HF flymy-ai/qwen-image-realism-lora)
|
||||
- [0-3] "Z-Image-Turbo runs on consumer 16GB-VRAM hardware, far below the A100-80GB of
|
||||
Qwen-Image 20B, materially lowering per-image cost." (HF Tongyi-MAI/Z-Image-Turbo)
|
||||
|
||||
## Sources
|
||||
|
||||
1. https://qwen.ai/blog?id=qwen-image-edit-2511
|
||||
2. https://qwenlm.github.io/blog/qwen-image-edit/
|
||||
3. https://docs.comfy.org/tutorials/image/qwen/qwen-image-edit
|
||||
4. https://github.com/FurkanGozukara/Stable-Diffusion/wiki/Qwen-Image-Edit-2511-Free-and-Open-Source-Crushes-Qwen-Image-Edit-2509-and-Challenges-Nano-Banana-Pro
|
||||
5. https://myaiforce.com/qie-2511/
|
||||
6. https://huggingface.co/Qwen/Qwen-Image-Edit-2509
|
||||
7. https://huggingface.co/InstantX/Qwen-Image-ControlNet-Union
|
||||
8. https://huggingface.co/InstantX/Qwen-Image-ControlNet-Inpainting
|
||||
9. https://huggingface.co/DiffSynth-Studio/Qwen-Image-EliGen-V2
|
||||
10. https://github.com/modelscope/DiffSynth-Studio/blob/main/docs/en/Model_Details/Qwen-Image.md
|
||||
11. https://blog.comfy.org/p/day-1-support-of-qwen-image-instantx
|
||||
12. https://learn.thinkdiffusion.com/how-to-use-qwen-image-with-instantx-union-controlnet-in-comfyui-guide-workflow/
|
||||
13. https://huggingface.co/flymy-ai/qwen-image-realism-lora
|
||||
14. https://huggingface.co/lightx2v/Qwen-Image-Lightning/discussions/4
|
||||
15. https://huggingface.co/docs/diffusers/main/en/api/pipelines/qwenimage
|
||||
16. https://www.diyphotography.net/skin-retouching-technique-frequency-separation/
|
||||
17. https://link.springer.com/content/pdf/10.1007/978-3-642-15549-9_1.pdf
|
||||
18. https://github.com/ShieldMnt/invisible-watermark/wiki/Frequency-Methods
|
||||
19. https://huggingface.co/Tongyi-MAI/Z-Image-Turbo
|
||||
20. https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/z_image/pipeline_z_image_img2img.py
|
||||
21. https://arxiv.org/pdf/2511.22699
|
||||
22. https://github.com/ModelTC/LightX2V-Qwen-Image-Lightning
|
||||
Reference in New Issue
Block a user