test(eval): vision-transcribed ground truth for qwen_in + clean text-CER numbers

data/qwen_in/ground_truth.json is transcribed by vision (PaddleOCR mangled the stylized Cyrillic), so the text metric scores variants against an accurate reference instead of noisy OCR-vs-OCR. Re-measured text CER (controlnet vs qwen) with this ground truth confirms qwen wins text across EN/RU/ZH: openai_1 0.385 vs 0.241, openai_2 0.341 vs 0.290, gemini_1 (ZH) 0.037 vs 0.000 (perfect Chinese even at the higher 0.30 strength). Faces still favor controlnet. Refresh the numbers in docs/known-limitations.md to this cleaner methodology. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-04 23:47:49 +02:00 · 2026-06-20 14:26:23 -07:00
parent e29c156279
commit 2d5b26ed18
2 changed files with 6 additions and 1 deletions
@@ -0,0 +1,5 @@
+{
+  "openai_1_original.png": "This is a longer sample text in English.\nTypography can flow smoothly from large to medium size.\nSmaller lines help demonstrate hierarchy, rhythm, and clarity.\nEven the finest text should remain clean and readable.\n这是一段较长的中文示例文本。\n排版可以从大字号逐渐过渡到中字号。\n更小的文字能够展示层次、节奏与清晰度。\n即使是最小的一行，也应该保持清楚易读。\nЭто более длинный пример текста на русском языке.\nТипографика может плавно переходить от крупного размера к среднему.\nБолее мелкие строки показывают иерархию, ритм и ясность.\nДаже самый маленький текст должен оставаться чистым и читаемым.",
+  "openai_2_original.png": "raiw.cc\nRaiw.cc – The Platform for the Future\nRaiw.cc is a modern project built for those who value quality, innovation, and reliability.\nSecurity & Reliability\nWe use advanced technologies to protect your data and ensure stable service performance.\nInnovation\nWe continuously explore and implement better solutions to help you stay one step ahead.\nCommunity\nRaiw.cc is an active community of like-minded people who share knowledge, experience, and ideas.\nProgress Together\nWhether you're a developer, creator, or dreamer — raiw.cc is your partner in achieving your goals.\nOur mission is to create opportunities and inspire people to reach new heights.\nRaiw.cc\nVisit raiw.cc today\nExplore more possibilities and start your journey to the future!\nGet Started\nSecure\nInnovative\nCommunity\nForward-Thinking\nINNOVATE · CONNECT · CREATE · GROW",
+  "gemini_1_original.png": "每天都是一个新的机会。\n用微笑开始它。\n世界也会向你微笑。"
+}
@@ -140,4 +140,4 @@ The scrub still comes from the img2img `strength` (same lever as SDXL); the call

 **Prototype oracle floors (Modal A100-80GB, single seed, 2026-06-19 — PENDING seed-repeat cert):** on native-resolution OpenAI and Gemini cert inputs (both controls SynthID-POSITIVE), OpenAI cleared at strength **0.10** and Gemini at **0.30** (0.20 still detected).

-**Fidelity vs controlnet was MEASURED, not eyeballed (`scripts/fidelity_metrics.py`; an initial eyeball read was wrong and overturned by the metrics):** at EQUAL strength, Qwen beats controlnet on **text** (OpenAI typography 0.10: OCR CER 0.25 vs 0.37; higher SSIM/PSNR) but controlnet beats Qwen on **faces** (gemini_3, 18 faces, 0.15 each: Laplacian-variance retention 0.62 vs 0.41, face LPIPS 0.09 vs 0.13 — Qwen smooths faces MORE; ArcFace identity ~tied at 0.546/0.543). At each pipeline's OWN scrub floor the face gap widens (Qwen's Gemini floor 0.30 is 2x controlnet's 0.15). **Conclusion: Qwen is the better TEXT-preserving remover, NOT a universal fidelity win — controlnet's canny edge map holds face skin detail better.** Two caveats before relying on it: (1) near-floor scrub is SEED-NON-DETERMINISTIC (the general known-limitation above), so these single-seed floors are NOT certified — run a seed-repeat sweep before trusting them; (2) `resolve_strength` is shared and pipeline-independent, so the Gemini default (0.15, the certified controlnet floor) UNDER-scrubs Gemini on `qwen` (whose floor is ~0.30) — **pass an explicit `--strength` for Gemini content on `qwen`** until a Qwen-specific ladder is certified. Flat-graphic content was not in the prototype sample.
+**Fidelity vs controlnet was MEASURED, not eyeballed (`scripts/fidelity_metrics.py`, text scored against a vision-transcribed ground truth in `data/qwen_in/ground_truth.json` + PaddleOCR on the variants; an initial eyeball read was wrong and overturned by the metrics):** Qwen beats controlnet on **text** across EN/RU/ZH -- OCR CER, controlnet vs Qwen: openai_1 (EN+RU+ZH, both 0.10) 0.385 vs 0.241, openai_2 (EN, both 0.10) 0.341 vs 0.290, gemini_1 (ZH, controlnet 0.15 / Qwen 0.30) 0.037 vs **0.000** (Qwen renders the Chinese perfectly even at the higher strength). But controlnet beats Qwen on **faces** (gemini_3, 18 faces, 0.15 each: Laplacian-variance retention 0.62 vs 0.41, face LPIPS 0.09 vs 0.13 — Qwen smooths faces MORE; ArcFace identity ~tied at 0.546/0.543). At each pipeline's OWN scrub floor the face gap widens (Qwen's Gemini floor 0.30 is 2x controlnet's 0.15). **Conclusion: Qwen is the better TEXT-preserving remover, NOT a universal fidelity win — controlnet's canny edge map holds face skin detail better.** Two caveats before relying on it: (1) near-floor scrub is SEED-NON-DETERMINISTIC (the general known-limitation above), so these single-seed floors are NOT certified — run a seed-repeat sweep before trusting them; (2) `resolve_strength` is shared and pipeline-independent, so the Gemini default (0.15, the certified controlnet floor) UNDER-scrubs Gemini on `qwen` (whose floor is ~0.30) — **pass an explicit `--strength` for Gemini content on `qwen`** until a Qwen-specific ladder is certified. Flat-graphic content was not in the prototype sample.