docs: capture the Qwen-improvement research (ship vs improve)

Cited deep-research report (22 sources, 3-vote adversarial verification, 5 refuted)
behind the "ship qwen as-is or improve first?" decision. Verdict: shippable now as
an opt-in text lane; strongest improvement lead is adding a Qwen-Image ControlNet
(InstantX / DiffSynth, Apache-2.0, diffusers QwenImageControlNetPipeline) for face/
skin structure; Z-Image-Turbo (6B, Apache-2.0) is the best cheaper text-preserving
substitute. No improvement has measured face-fidelity at our scrub floors yet --
validate with scripts/fidelity_metrics.py first. Linked from known-limitations.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Victor Kuznetsov
2026-06-20 15:58:46 -07:00
parent 0d9033d63a
commit 8f64869bfc
2 changed files with 155 additions and 0 deletions
+2
View File
@@ -146,4 +146,6 @@ The scrub still comes from the img2img `strength` (same lever as SDXL); the call
**Conclusion: Qwen is the better TEXT-preserving remover (substantial Latin/mixed text), NOT a universal fidelity win — controlnet's canny edge map holds face skin detail better, so the path is a content-routed lane (text→qwen, faces→controlnet), not a blanket migration.** Caveat: `resolve_strength` is shared and pipeline-independent, so the Gemini default (0.15) UNDER-scrubs Gemini on `qwen` (floor 0.25) — pass `--strength 0.25` for Gemini on `qwen` until a Qwen ladder is wired. Flat-graphic content was not in the sample.
**Improving Qwen (ship vs improve):** the cited research on fixing the face-smoothing while keeping the text win (Qwen-Image ControlNet for structure conditioning, Qwen-Image-Edit, Z-Image-Turbo as a cheaper text-preserving substitute, non-regenerative detail restoration) lives in `docs/qwen-improvement-research.md` -- read it before extending the `qwen` pipeline. Verdict: shippable now as an opt-in text lane; the strongest improvement lead is adding a Qwen-Image ControlNet, but no improvement has measured face-fidelity at our floors yet (validate with `scripts/fidelity_metrics.py` first).
**Seed as a quality lever (measured, openai_1 at 0.10, seeds 0-4):** the seed barely moves whole-image fidelity (img LPIPS 0.062-0.065, SSIM 0.855-0.857, PSNR 28.5-28.7 — flat) but does shift TEXT legibility (OCR CER 0.241-0.290, ~17% spread) -- the seed changes WHICH details get regenerated, not the overall level. So a per-image best-of-N-seed selection is a WEAK, text-only lever (pick the lowest-CER seed that still scrubs; fidelity selection needs no oracle). Not worth the N× cost for general use -- pin one decent seed in prod; reserve best-of-N for text-heavy premium cases.
+153
View File
@@ -0,0 +1,153 @@
# Qwen-Image improvement research (2026-06-20)
Cited research behind the decision **"ship the `qwen` pipeline as-is, or improve it
first?"** Produced by the multi-source deep-research harness (5 search angles, 22
sources fetched, 85 claims extracted, 25 verified by a 3-vote adversarial check, 20
confirmed / 5 killed, 104 agent calls). Findings carry their confidence and vote.
## Context
The `qwen` pipeline runs base Qwen-Image (20B MMDiT, Apache-2.0) as a low-strength
img2img scrub (removal comes from the denoising `strength`). Certified oracle scrub
floors: OpenAI 0.10 (seed-robust), Gemini 0.25 (pinned seed). Measured against the
SDXL + canny-ControlNet pipeline (`scripts/fidelity_metrics.py`): Qwen preserves
**text** markedly better (incl. CJK and Cyrillic, lower OCR CER) but preserves
**faces** worse, smoothing skin (Laplacian-variance retention 0.40 vs 0.62, face
LPIPS 0.17 vs 0.09, ArcFace identity 0.38 vs 0.55 at the scrub floors). The goal of
the research: keep Qwen's text advantage while fixing the face-smoothing, and judge
production-readiness.
## Verdict
Base Qwen-Image is **shippable now as an opt-in text-content lane** (Apache-2.0 on
code and weights, scrub lever confirmed), but it is not a universal upgrade (it loses
faces). The strongest verified improvement path is to **add structure conditioning**
(a Qwen-Image ControlNet) to the existing base pass, the direct analog of the SDXL +
canny conditioning that wins on faces. Separately, **Z-Image / Z-Image-Turbo** (6B,
Apache-2.0) is the best-verified lighter alternative to evaluate before committing to
the 20B cost. None of the improvements has measured face-fidelity numbers at our
scrub floors yet, so each must be validated with `scripts/fidelity_metrics.py` plus
the oracle before shipping.
## Findings
1. **[high, 3-0] A permissively-licensed Qwen-Image ControlNet exists today and is
CUDA/diffusers-runnable.** InstantX Qwen-Image-ControlNet-Union supports
canny/soft-edge/depth/pose; DiffSynth-Studio maintains blockwise Canny/Depth/Inpaint
plus an In-Context-Control-Union; diffusers exposes `QwenImageControlNetPipeline`
and `QwenImageMultiControlNetModel` with `controlnet_conditioning_scale` (default
1.0) and `control_guidance_start`/`end`. This is the direct analog of the certified
SDXL+canny structure conditioning that wins on faces. Caveat: canny/depth preserve
geometric structure, not face identity per se, and none is a **tile**-ControlNet
(the variant most tied to fine-detail/skin retention in the SDXL world).
Sources: InstantX/Qwen-Image-ControlNet-Union, InstantX/Qwen-Image-ControlNet-Inpainting,
DiffSynth-Studio Qwen-Image docs, diffusers qwenimage pipeline docs.
2. **[high, 3-0] The scrub mechanism is preserved, and the license is clean.**
`QwenImageImg2ImgPipeline.strength` (default 0.6, range 0-1; DiffSynth names it
`denoising_strength`) keeps the partial-regeneration scrub the project relies on,
lower values staying closer to the input. Qwen-Image and Qwen-Image-Edit-2509 are
Apache-2.0 on both code and weights.
3. **[medium, mixed 2-1 / 3-0] Qwen-Image-Edit improves identity consistency, but that
is not proof it fixes our metric.** The instruction-edit pipeline (2511 better than
2509) improves identity/character consistency, but only for identity *through edits*
of an input portrait, which is not the same as measured face-skin Laplacian/LPIPS
fidelity at a low scrub strength. Architecture: 20B base + Qwen2.5-VL (semantic
control) + VAE Encoder (appearance control). Several stronger edit-model face claims
were refuted (see below).
4. **[high, 3-0] Z-Image / Z-Image-Turbo is the best-verified lighter alternative.** A
6B model (~1/3 of Qwen-Image's 20B), Apache-2.0 on code and weights, strong bilingual
(Chinese + English) native text rendering, with an official diffusers
`ZImageImg2ImgPipeline` exposing the same 0-1 denoising-strength scrub lever; Turbo
runs at ~8 steps (guidance_scale=0.0) vs ~40. A material cost/footprint reduction vs
20B/A100-80GB (but see caveat 4 on the refuted consumer-GPU claim).
5. **[high, 3-0] EliGen-V2 is NOT relevant** to the face-smoothing problem. It is an
entity-level/regional control model (LoRA + regional attention placing entities via
text + mask maps, plus entity-level inpainting); it provides no
ControlNet/canny/depth/tile structure conditioning or face-skin-detail retention.
6. **[medium, 2-1] flymy-ai/qwen-image-realism-lora** is Apache-2.0 (code+weights) on
base Qwen-Image, so it is permissively usable with the existing base img2img pass,
but it is NOT verified to specifically fix the face/skin-smoothing failure mode.
## Caveats
1. The research did NOT surface verified evidence for two things specifically asked:
(a) a Qwen-Image **tile**-ControlNet (the variant most tied to fine-detail/skin
retention; only canny/soft-edge/depth/pose/inpaint were confirmed), and (b) any
**non-regenerative detail-restoration** technique (high-frequency residual transfer,
guided filtering) that recovers smoothed faces without re-introducing the watermark.
Research angle 4 produced zero surviving claims, so it is unanswered.
2. No claim provides measured face-fidelity numbers (ArcFace/LPIPS/Laplacian) for ANY
recommended intervention at the project's scrub floors. All fidelity evidence is the
project's own internal measurement. The improvements are mechanistically sound but
unproven for this exact metric, so validate with `scripts/fidelity_metrics.py`
before shipping.
3. Several vendor model cards are marketing-register primary sources (Qwen blog,
Z-Image card). Load-bearing facts (license, params, API levers) are independently
corroborated, but comparative quality framings are author glosses.
4. Z-Image's "sub-second" figure is H800-specific and author-benchmarked; consumer-GPU
third-party benchmarks are still limited (seconds, not sub-second, though within the
<16GB envelope).
5. Time-sensitivity: Qwen-Image-Edit-2511 and Z-Image are late-2025/2026 releases; the
diffusers pipelines cited are on the main/dev branch, so confirm released-version
availability before pinning.
6. Five claims were refuted (below), clustering on over-strong edit-model face-fidelity
and one over-strong Z-Image cost claim.
## Open questions
- Does a Qwen-Image **tile-ControlNet** (or equivalent high-resolution detail
conditioning) exist under a permissive license?
- What **non-regenerative detail-restoration** method recovers smoothed faces WITHOUT
re-introducing SynthID? Note: residual transfer from the ORIGINAL risks copying back
watermark-carrying high frequencies, so it must be verified against the SynthID oracle.
- Does adding Qwen-Image-ControlNet (canny/depth) at the certified floors (OpenAI 0.10,
Gemini 0.25) actually raise face Laplacian/LPIPS toward the SDXL+ControlNet numbers
(0.62 / 0.09) WITHOUT re-introducing SynthID, or does the structure constraint
preserve the watermark the way ControlNet can on photoreal content (the existing
"SynthID CAN survive controlnet at low strength" caveat)?
- Head-to-head: does Z-Image-Turbo at its scrub floor match Qwen's text advantage
(CJK+Cyrillic CER) while not worsening faces, and what are Z-Image's own SynthID
scrub floors and seed-robustness (none exist yet)?
## Refuted claims (do NOT rely on these)
- [0-3] "Qwen-Image-Edit-2511 specifically targets/mitigates image drift, the same
failure mode as face-detail loss in a low-strength scrub." (qwen.ai/blog, 2511)
- [0-3] "Qwen-Image-Edit-2509 explicitly improves facial identity preservation and
supports portrait styles and pose transformations." (HF Qwen-Image-Edit-2509)
- [0-3] "Qwen-Image-Edit-2509 has native built-in ControlNet support (depth/edge/
keypoint)." (HF Qwen-Image-Edit-2509)
- [1-2] "flymy realism LoRA specifically targets facial and skin detail, the exact
failure mode." (HF flymy-ai/qwen-image-realism-lora)
- [0-3] "Z-Image-Turbo runs on consumer 16GB-VRAM hardware, far below the A100-80GB of
Qwen-Image 20B, materially lowering per-image cost." (HF Tongyi-MAI/Z-Image-Turbo)
## Sources
1. https://qwen.ai/blog?id=qwen-image-edit-2511
2. https://qwenlm.github.io/blog/qwen-image-edit/
3. https://docs.comfy.org/tutorials/image/qwen/qwen-image-edit
4. https://github.com/FurkanGozukara/Stable-Diffusion/wiki/Qwen-Image-Edit-2511-Free-and-Open-Source-Crushes-Qwen-Image-Edit-2509-and-Challenges-Nano-Banana-Pro
5. https://myaiforce.com/qie-2511/
6. https://huggingface.co/Qwen/Qwen-Image-Edit-2509
7. https://huggingface.co/InstantX/Qwen-Image-ControlNet-Union
8. https://huggingface.co/InstantX/Qwen-Image-ControlNet-Inpainting
9. https://huggingface.co/DiffSynth-Studio/Qwen-Image-EliGen-V2
10. https://github.com/modelscope/DiffSynth-Studio/blob/main/docs/en/Model_Details/Qwen-Image.md
11. https://blog.comfy.org/p/day-1-support-of-qwen-image-instantx
12. https://learn.thinkdiffusion.com/how-to-use-qwen-image-with-instantx-union-controlnet-in-comfyui-guide-workflow/
13. https://huggingface.co/flymy-ai/qwen-image-realism-lora
14. https://huggingface.co/lightx2v/Qwen-Image-Lightning/discussions/4
15. https://huggingface.co/docs/diffusers/main/en/api/pipelines/qwenimage
16. https://www.diyphotography.net/skin-retouching-technique-frequency-separation/
17. https://link.springer.com/content/pdf/10.1007/978-3-642-15549-9_1.pdf
18. https://github.com/ShieldMnt/invisible-watermark/wiki/Frequency-Methods
19. https://huggingface.co/Tongyi-MAI/Z-Image-Turbo
20. https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/z_image/pipeline_z_image_img2img.py
21. https://arxiv.org/pdf/2511.22699
22. https://github.com/ModelTC/LightX2V-Qwen-Image-Lightning