test(eval): fix the qwen_in pipeline-fidelity eval set + PaddleOCR ground-truth flow

- data/qwen_in/: a stable, committed set of 4 AI-generated images (OpenAI +
  Google, carrying SynthID/C2PA -- same class as data/samples fixtures) used to
  compare the controlnet/sdxl/qwen pipelines for fidelity. Two text-multi-script
  (incl. RU/CJK), one EN poster, one face grid. README documents the set + the
  ground-truth workflow. data/ is sdist-excluded so the wheel is unaffected.
- scripts/fidelity_metrics.py: switch text OCR from EasyOCR to PaddleOCR
  (PP-OCRv6, higher accuracy esp. CJK, single multilingual stack); split into
  `ocr` (seed a {basename: text} ground truth) and `compare` (--ground-truth for
  a clean CER vs the hand-verified reference instead of noisy OCR-vs-OCR). Spatial
  IoU-NMS keeps the best-scoring read per line so wrong-script models don't inject
  garbage over Cyrillic/CJK.
- Oracle methodology: validate the OpenAI arm FIRST (openai.com/verify is more
  accessible and the strongest Playwright/Chrome-MCP automation candidate; the
  Gemini app is more manual). Recorded in CLAUDE.md + docs/synthid.md.

Ground-truth JSON (data/qwen_in/ground_truth.json) lands in a follow-up once
hand-verified.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Victor Kuznetsov
2026-06-20 14:17:04 -07:00
parent a2c33af284
commit e29c156279
8 changed files with 194 additions and 73 deletions
+1 -1
View File
@@ -84,7 +84,7 @@ Compact list. Full measurements, incident history, and oracle-validation runs li
- rich was dropped: the CLI + analysis scripts print plain text (`click.echo` / the `scripts/_plain_console.py` shim). `rich` is NOT a dependency — importing it breaks the core+dev CI sync; new scripts must use the shim. No Unicode glyphs / colors / progress bars in CLI output by design.
- AVIF/HEIF/JPEG-XL metadata detection is a binary scan; C2PA removal in those containers (and MP4/MOV/M4V) is `noai/isobmff.py`; non-ISOBMFF audio/video (WebM/MP3/WAV/FLAC/OGG) strips losslessly via ffmpeg on PATH. An AI-generator token in an `Exif` meta-box *item* (bytes in `mdat`/`idat`) is now blanked **in place** by `isobmff.blank_ai_exif_tokens` (same-length space overwrite, piexif-validated so a coincidental II/MM run in pixels is ignored — no `iinf`/`iloc` surgery, mirrors `blank_ai_xmp_packets`); it scrubs the AI-token value only, leaving camera/editor EXIF intact. Still NOT built: Resemble PerTh audio detection (no presence/confidence flag exists).
- **SynthID technical reference: `docs/synthid.md`** — primary-source-cited doc covering mechanism (post-hoc encoder/decoder pair, 136-bit payload at 512x512, pixel-space, model weights NOT modified), robustness numbers (arXiv:2510.09263: ~99.98% TPR@0.1%FPR across 30 transforms including JPEG/crop/resize/color/noise), removal attacks and forensic detectability (arXiv:2605.09203: all 6 attacks detectable at >98% TPR@1%FPR), detectability limits (no public decoder, metadata-proxy only), oracle scope, and adoption landscape. Read that doc first before adding notes here.
- **SynthID detection is metadata-only.** No local pixel detector is possible by design (Google's decoder is proprietary, trusted-testers only); we read the C2PA companion proxy, which goes quiet once metadata is stripped — a quiet proxy is not proof the pixel watermark is gone. The Gemini app "Verify with SynthID" is the ONLY valid SynthID oracle; `openai.com/verify` is scoped to OpenAI provenance and each vendor's oracle detects only its own content. SynthID survives JPEG re-encode, so GitHub issue attachments remain valid pixel-watermark test subjects. Every spectral/phase detection approach evaluated (reverse-SynthID, our own probes) works only on controlled solid fills, never on real content.
- **SynthID detection is metadata-only.** No local pixel detector is possible by design (Google's decoder is proprietary, trusted-testers only); we read the C2PA companion proxy, which goes quiet once metadata is stripped — a quiet proxy is not proof the pixel watermark is gone. Each vendor has its OWN oracle and it detects only that vendor's content: the Gemini app "Verify with SynthID" for Google, `openai.com/verify` for OpenAI. **Validate the OpenAI arm FIRST**`openai.com/verify` is more accessible (fewer per-check restrictions) and the strongest automation candidate (Playwright / Chrome MCP); the Gemini flow is more manual. Ordering/throughput choice, not a substitution (see `docs/synthid.md`). SynthID survives JPEG re-encode, so GitHub issue attachments remain valid pixel-watermark test subjects. Every spectral/phase detection approach evaluated (reverse-SynthID, our own probes) works only on controlled solid fills, never on real content.
- **External AI-vs-real classifier models are out of scope** (decided 2026-05-24): per-generator, degrade off-distribution, and our own light SDXL pass would likely defeat them. Detection stays local + signal-based.
- **Default strength is VENDOR-ADAPTIVE, one ladder for BOTH pipelines** (since 2026-06-09): `resolve_strength(strength, vendor)` picks OpenAI **0.20** / Gemini **0.30** / unknown **0.30** when `--strength` is unset; explicit `--strength` always wins. Removal at low strength is content x pipeline dependent, and near-threshold removal is SEED-NON-DETERMINISTIC — pick a strength with margin and oracle-revalidate per content type. Certified controlnet floors (Modal cert 2026-06-04): OpenAI 0.20 (resolution-independent), Gemini 0.30 (only <= 1536px; native large Gemini needs ~0.35+ or a cap).
- **`controlnet` is the default pipeline**; `--pipeline sdxl` is the lighter opt-down. Neither pipeline clears all content at low strength (photoreal survives controlnet, flat graphics survive sdxl — the lever is higher strength). A removal-priority caller MUST oracle-validate strength across content types; prod recipe: controlnet + per-vendor floor + FIXED seed. Forensic-stealth caveat (arXiv:2605.09203): defeating the SynthID verifier is NOT forensic invisibility — removal-processed images are flaggable at >98% TPR@1%FPR.
+35
View File
@@ -0,0 +1,35 @@
# qwen_in — pipeline-fidelity eval set
A small, **stable** set of AI-generated images used to compare the diffusion
removal pipelines (`controlnet` / `sdxl` / `qwen`) for fidelity with
`scripts/fidelity_metrics.py`. Fixing the set in the repo keeps comparisons
reproducible across runs and pipelines.
All four are AI-generated test content (they carry SynthID + C2PA from their
generator — verify with `remove-ai-watermarks identify`), same class as the
`data/samples/` fixtures. No real-person photos.
| file | vendor (SynthID) | content | exercises |
|---|---|---|---|
| `openai_1_original.png` | OpenAI | typography sheet (EN + RU + ZH) | text (multi-script) |
| `openai_2_original.png` | OpenAI | Raiw.cc poster | text (EN, small) |
| `gemini_1_original.png` | Google | landscape + Chinese sign | text (CJK) |
| `gemini_3_original.png` | Google | 3x3 portrait grid | faces (identity / skin texture) |
## Text ground truth
`ground_truth.json` (`{basename: text}`) is the **hand-verified** OCR of the
text-bearing originals, seeded by `fidelity_metrics.py ocr` and corrected by
hand (PaddleOCR mis-reads stylized Cyrillic in particular). It is the reference
for the text CER metric — much cleaner than OCR-vs-OCR. Regenerate the seed with:
uv run scripts/fidelity_metrics.py ocr data/qwen_in/openai_1_original.png \
data/qwen_in/openai_2_original.png data/qwen_in/gemini_1_original.png \
--langs en,ru,ch --out data/qwen_in/ground_truth.json
# then re-verify by hand before trusting it.
## Compare
uv run scripts/fidelity_metrics.py compare \
--original data/qwen_in/gemini_3_original.png \
--variant controlnet=<out>.png --variant qwen=<out>.png --ocr-langs ""
Binary file not shown.

After

Width:  |  Height:  |  Size: 8.2 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 9.3 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.3 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.4 MiB

+9
View File
@@ -181,6 +181,15 @@ pos original plus its minimum-clearing cleaned output (manifest `verified_via` =
was oracle-verified but is not committed (third-party content stays out of the
public corpus).
**Oracle validation order: start with OpenAI.** When validating removal across
vendors, run the OpenAI arm first. `openai.com/verify` is more accessible than the
Gemini app -- fewer per-check restrictions, so it gives the fastest signal and is
the strongest candidate for automation (Playwright / Chrome MCP driving
`openai.com/verify`); the Gemini "Verify with SynthID" flow is more manual. This is
an ordering/throughput choice, not a substitution: each oracle only reads its own
vendor's SynthID (`openai.com/verify` is OpenAI-scoped), so Google content still
needs the Gemini app.
| Vendor | Images | Resolution(s) | Pipeline | Removed at |
|--------|--------|---------------|----------|------------|
| OpenAI (gpt-image) | n=4 (3 archived + 1 external-only) | 1024x1536 .. 1600x1600 | native | **0.05** |
+149 -72
View File
@@ -9,7 +9,8 @@
# "rapidfuzz",
# "torch",
# "lpips",
# "easyocr",
# "paddleocr",
# "paddlepaddle",
# "insightface",
# "onnxruntime",
# ]
@@ -22,29 +23,38 @@ preserved -- so "closer to the original" is the right axis here (between two
equally-scrubbed outputs, the one that deviates less from the original wins).
It is a standalone eval tool, NOT part of the package: PEP 723 inline deps let
``uv run`` build a throwaway env so the heavy models (EasyOCR, insightface,
``uv run`` build a throwaway env so the heavy models (PaddleOCR, insightface,
LPIPS) never touch uv.lock or the shipped library. Metrics self-gate: face
metrics run only where faces are detected, text metrics only where text is.
Four metric groups (all reference = original):
1. Text -- EasyOCR character error rate (CER) of each variant vs the original's
OCR string. Lower = text better preserved. OCR is noisy, so treat it
as a RELATIVE comparison (every variant scored against the same ref).
2. Face identity -- insightface (buffalo_l) ArcFace cosine similarity, original
face vs the geometrically-matched variant face. Higher = identity kept.
3. Face texture -- LPIPS + Laplacian-variance ratio (variant/original) on face
crops. Catches "plastication" (lost high-frequency skin detail):
lapvar ratio < 1 = smoother than the original.
4. Whole image -- LPIPS / SSIM / PSNR vs the original (context: background too).
Two subcommands:
ocr -- OCR images (PaddleOCR PP-OCRv6) into a JSON {basename: text} file.
Run this on the ORIGINALS, hand-verify/correct the file, and it
becomes the ground truth for ``compare --ground-truth`` -- the clean
way to score text, since OCR-vs-OCR is doubly noisy (errors on both
images + reading-order differences inflate CER even on identical text).
compare -- Score each VARIANT against the ORIGINAL across four groups:
1. Text -- character error rate (CER) of the variant's OCR vs the
verified ground truth (or the original's OCR if no --ground-truth).
2. Face identity -- insightface (buffalo_l) ArcFace cosine similarity.
3. Face texture -- LPIPS + Laplacian-variance ratio on face crops
(catches "plastication": ratio < 1 = smoother than the original).
4. Whole image -- LPIPS / SSIM / PSNR vs the original.
Usage:
uv run scripts/fidelity_metrics.py --original O.png \
--variant controlnet=C.png --variant qwen=Q.png --ocr-langs en,ru,ch_sim
uv run scripts/fidelity_metrics.py ocr O1.png O2.png --langs en,ru,ch --out gt.json
# (edit gt.json by hand to fix any OCR slips, then:)
uv run scripts/fidelity_metrics.py compare --original O1.png \
--variant controlnet=C.png --variant qwen=Q.png \
--ocr-langs en,ru,ch --ground-truth gt.json
"""
from __future__ import annotations
import logging
import json
import unicodedata
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
@@ -54,8 +64,6 @@ import cv2
import numpy as np
from _plain_console import Console, Table
logging.basicConfig(level=logging.WARNING, format="%(message)s")
log = logging.getLogger(__name__)
console = Console()
@@ -76,45 +84,90 @@ def _match_size(variant: np.ndarray, ref: np.ndarray) -> np.ndarray:
return variant
# ── text: OCR CER ────────────────────────────────────────────────────
# EasyOCR rejects some language combos in one Reader, so group into compatible
# readers and union the detections. Cyrillic and Chinese cannot share a reader.
_OCR_GROUPS = {
"en": ["en"],
"ru": ["ru", "en"],
"ch_sim": ["ch_sim", "en"],
}
def _norm(text: str) -> str:
"""Normalize for CER: NFC + drop ALL whitespace (segmentation-order agnostic)."""
return "".join(unicodedata.normalize("NFC", text).split())
def _ocr_string(readers: list, bgr: np.ndarray) -> str:
"""Union all readers' detections into one position-sorted, whitespace-free string."""
rgb = cv2.cvtColor(bgr, cv2.COLOR_BGR2RGB)
dets: list[tuple[float, float, str]] = []
for reader in readers:
for box, text, conf in reader.readtext(rgb):
if conf < 0.3 or not text.strip():
continue
ys = [p[1] for p in box]
xs = [p[0] for p in box]
dets.append((min(ys), min(xs), text.strip()))
# Sort top-to-bottom, then left-to-right (coarse reading order).
dets.sort(key=lambda d: (round(d[0] / 20.0), d[1]))
return "".join(t for _, _, t in dets).replace(" ", "")
# ── text: PaddleOCR (PP-OCRv6) ───────────────────────────────────────
# Our lang codes -> PaddleOCR lang. The 'ch' model also reads Latin; 'ru' reads
# Cyrillic + Latin. Multiple langs in one image -> run each model, union detections.
_PADDLE_LANG = {"en": "en", "ru": "ru", "ch": "ch", "ch_sim": "ch", "latin": "latin"}
_paddle_cache: dict[str, Any] = {}
def _build_ocr_readers(langs: list[str]) -> list:
import easyocr
def _paddle(lang: str) -> Any:
if lang not in _paddle_cache:
from paddleocr import PaddleOCR
seen: set[tuple[str, ...]] = set()
readers = []
_paddle_cache[lang] = PaddleOCR(
lang=lang,
use_doc_orientation_classify=False,
use_doc_unwarping=False,
use_textline_orientation=False,
)
return _paddle_cache[lang]
def _box_xyxy(box: Any) -> tuple[float, float, float, float]:
"""Axis-aligned (x1, y1, x2, y2) of a PaddleOCR rec box ([x1,y1,x2,y2]) or poly (4x2)."""
arr = np.asarray(box, dtype=np.float32).reshape(-1)
if arr.size == 4:
return float(arr[0]), float(arr[1]), float(arr[2]), float(arr[3])
pts = arr.reshape(-1, 2)
return float(pts[:, 0].min()), float(pts[:, 1].min()), float(pts[:, 0].max()), float(pts[:, 1].max())
def _iou(a: tuple[float, float, float, float], b: tuple[float, float, float, float]) -> float:
ix1, iy1 = max(a[0], b[0]), max(a[1], b[1])
ix2, iy2 = min(a[2], b[2]), min(a[3], b[3])
iw, ih = max(0.0, ix2 - ix1), max(0.0, iy2 - iy1)
inter = iw * ih
if inter <= 0:
return 0.0
area_a = (a[2] - a[0]) * (a[3] - a[1])
area_b = (b[2] - b[0]) * (b[3] - b[1])
return inter / (area_a + area_b - inter + 1e-9)
def _ocr_lines(bgr: np.ndarray, langs: list[str], min_score: float = 0.5) -> list[str]:
"""Detected text lines in reading order, unioned across lang models with spatial NMS.
Several language models over one image re-detect the same lines -- and crucially the
WRONG-script models read e.g. Cyrillic as confident Latin gibberish. So instead of a
naive union, keep the HIGHEST-score detection per physical location (greedy IoU NMS):
the model that actually fits a line wins it (the 'ru' model takes the Cyrillic, 'ch'
the CJK, 'en' the Latin), and the cross-script garbage is dropped.
"""
raw: list[tuple[float, tuple[float, float, float, float], str]] = []
for lang in langs:
group = tuple(_OCR_GROUPS.get(lang, [lang]))
if group in seen:
plang = _PADDLE_LANG.get(lang, lang)
for page in _paddle(plang).predict(bgr):
texts = page.get("rec_texts", [])
scores = page.get("rec_scores", [])
boxes = page.get("rec_boxes", None)
if boxes is None or len(boxes) == 0:
boxes = page.get("rec_polys", [])
for text, score, box in zip(texts, scores, boxes, strict=False):
if score < min_score or not text.strip():
continue
raw.append((float(score), _box_xyxy(box), text.strip()))
raw.sort(key=lambda d: d[0], reverse=True)
kept: list[tuple[tuple[float, float, float, float], str]] = []
for _score, box, text in raw:
if any(_iou(box, kbox) > 0.3 for kbox, _ in kept):
continue
seen.add(group)
readers.append(easyocr.Reader(list(group), gpu=False, verbose=False))
return readers
kept.append((box, text))
kept.sort(key=lambda d: (round(d[0][1] / 20.0), d[0][0])) # reading order: y then x
return [t for _, t in kept]
def _cer(ref: str, hyp: str) -> float:
from rapidfuzz.distance import Levenshtein
return Levenshtein.normalized_distance(_norm(ref), _norm(hyp))
# ── face: detection + ArcFace + texture ──────────────────────────────
@@ -183,12 +236,10 @@ def _ssim_psnr(a_bgr: np.ndarray, b_bgr: np.ndarray) -> tuple[float, float]:
a = cv2.cvtColor(a_bgr, cv2.COLOR_BGR2GRAY)
b = cv2.cvtColor(b_bgr, cv2.COLOR_BGR2GRAY)
ssim = float(structural_similarity(a, b))
psnr = float(peak_signal_noise_ratio(a, b))
return ssim, psnr
return float(structural_similarity(a, b)), float(peak_signal_noise_ratio(a, b))
# ── main ─────────────────────────────────────────────────────────────
# ── reporting ────────────────────────────────────────────────────────
def _mean(xs: list[float]) -> float | None:
@@ -199,18 +250,42 @@ def _fmt(v: float | None, nd: int = 3) -> str:
return "-" if v is None else f"{v:.{nd}f}"
@click.command()
# ── CLI ──────────────────────────────────────────────────────────────
@click.group()
def cli() -> None:
"""Objective fidelity metrics for watermark-removal outputs."""
@cli.command("ocr")
@click.argument("images", nargs=-1, required=True, type=click.Path(exists=True))
@click.option("--langs", default="en", help="Comma list of OCR langs (en,ru,ch).")
@click.option("--out", type=click.Path(), default=None, help="Write {basename: text} JSON here (for ground truth).")
def ocr_cmd(images: tuple[str, ...], langs: str, out: str | None) -> None:
"""OCR images into a ground-truth seed -- hand-verify the result before using it."""
lang_list = [x.strip() for x in langs.split(",") if x.strip()]
result: dict[str, str] = {}
for path in images:
lines = _ocr_lines(_load_bgr(path), lang_list)
text = "\n".join(lines)
result[Path(path).name] = text
console.print(f"\n=== {Path(path).name} ===")
console.print(text or "(no text detected)")
if out:
Path(out).write_text(json.dumps(result, ensure_ascii=False, indent=2), encoding="utf-8")
console.print(f"\n Wrote {out} -- verify/correct it by hand, then pass it to `compare --ground-truth`.")
@cli.command("compare")
@click.option("--original", required=True, type=click.Path(exists=True), help="Reference (unprocessed) image.")
@click.option(
"--variant",
"variants",
multiple=True,
required=True,
help="LABEL=PATH of a cleaned output (repeatable).",
"--variant", "variants", multiple=True, required=True, help="LABEL=PATH of a cleaned output (repeatable)."
)
@click.option("--ocr-langs", default="en", help="Comma list of EasyOCR langs (en,ru,ch_sim). Empty = skip text.")
@click.option("--ocr-langs", default="en", help="Comma list of OCR langs (en,ru,ch). Empty = skip text.")
@click.option("--ground-truth", type=click.Path(exists=True), default=None, help="Verified {basename: text} JSON.")
@click.option("--no-faces", is_flag=True, help="Skip face metrics.")
def main(original: str, variants: tuple[str, ...], ocr_langs: str, no_faces: bool) -> None:
def compare(original: str, variants: tuple[str, ...], ocr_langs: str, ground_truth: str | None, no_faces: bool) -> None:
"""Score each VARIANT against ORIGINAL across the four fidelity groups."""
ref = _load_bgr(original)
parsed: list[tuple[str, np.ndarray]] = []
@@ -226,17 +301,19 @@ def main(original: str, variants: tuple[str, ...], ocr_langs: str, no_faces: boo
# ── text ──
ocr_cer: dict[str, float | None] = {label: None for label, _ in parsed}
if langs:
console.print(f" OCR ({','.join(langs)})...")
from rapidfuzz.distance import Levenshtein
readers = _build_ocr_readers(langs)
ref_text = _ocr_string(readers, ref)
if ref_text:
for label, img in parsed:
hyp = _ocr_string(readers, img)
ocr_cer[label] = Levenshtein.normalized_distance(ref_text, hyp)
ref_text: str | None = None
if ground_truth:
gt = json.loads(Path(ground_truth).read_text(encoding="utf-8"))
ref_text = gt.get(Path(original).name)
if ref_text is None:
console.print(f" (no ground-truth entry for {Path(original).name}; skipping text)")
else:
console.print(" (no text detected in the original; skipping text metric)")
console.print(f" OCR original ({','.join(langs)})...")
ref_text = "\n".join(_ocr_lines(ref, langs))
if ref_text:
console.print(f" OCR variants ({','.join(langs)})...")
for label, img in parsed:
ocr_cer[label] = _cer(ref_text, "\n".join(_ocr_lines(img, langs)))
# ── faces ──
face_stats: dict[str, FaceStats] = {label: FaceStats() for label, _ in parsed}
@@ -300,4 +377,4 @@ def main(original: str, variants: tuple[str, ...], ocr_langs: str, no_faces: boo
if __name__ == "__main__":
main()
cli()