diff --git a/CLAUDE.md b/CLAUDE.md index d0a430e..5e0b375 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -84,7 +84,7 @@ Compact list. Full measurements, incident history, and oracle-validation runs li - rich was dropped: the CLI + analysis scripts print plain text (`click.echo` / the `scripts/_plain_console.py` shim). `rich` is NOT a dependency — importing it breaks the core+dev CI sync; new scripts must use the shim. No Unicode glyphs / colors / progress bars in CLI output by design. - AVIF/HEIF/JPEG-XL metadata detection is a binary scan; C2PA removal in those containers (and MP4/MOV/M4V) is `noai/isobmff.py`; non-ISOBMFF audio/video (WebM/MP3/WAV/FLAC/OGG) strips losslessly via ffmpeg on PATH. An AI-generator token in an `Exif` meta-box *item* (bytes in `mdat`/`idat`) is now blanked **in place** by `isobmff.blank_ai_exif_tokens` (same-length space overwrite, piexif-validated so a coincidental II/MM run in pixels is ignored — no `iinf`/`iloc` surgery, mirrors `blank_ai_xmp_packets`); it scrubs the AI-token value only, leaving camera/editor EXIF intact. Still NOT built: Resemble PerTh audio detection (no presence/confidence flag exists). - **SynthID technical reference: `docs/synthid.md`** — primary-source-cited doc covering mechanism (post-hoc encoder/decoder pair, 136-bit payload at 512x512, pixel-space, model weights NOT modified), robustness numbers (arXiv:2510.09263: ~99.98% TPR@0.1%FPR across 30 transforms including JPEG/crop/resize/color/noise), removal attacks and forensic detectability (arXiv:2605.09203: all 6 attacks detectable at >98% TPR@1%FPR), detectability limits (no public decoder, metadata-proxy only), oracle scope, and adoption landscape. Read that doc first before adding notes here. -- **SynthID detection is metadata-only.** No local pixel detector is possible by design (Google's decoder is proprietary, trusted-testers only); we read the C2PA companion proxy, which goes quiet once metadata is stripped — a quiet proxy is not proof the pixel watermark is gone. The Gemini app "Verify with SynthID" is the ONLY valid SynthID oracle; `openai.com/verify` is scoped to OpenAI provenance and each vendor's oracle detects only its own content. SynthID survives JPEG re-encode, so GitHub issue attachments remain valid pixel-watermark test subjects. Every spectral/phase detection approach evaluated (reverse-SynthID, our own probes) works only on controlled solid fills, never on real content. +- **SynthID detection is metadata-only.** No local pixel detector is possible by design (Google's decoder is proprietary, trusted-testers only); we read the C2PA companion proxy, which goes quiet once metadata is stripped — a quiet proxy is not proof the pixel watermark is gone. Each vendor has its OWN oracle and it detects only that vendor's content: the Gemini app "Verify with SynthID" for Google, `openai.com/verify` for OpenAI. **Validate the OpenAI arm FIRST** — `openai.com/verify` is more accessible (fewer per-check restrictions) and the strongest automation candidate (Playwright / Chrome MCP); the Gemini flow is more manual. Ordering/throughput choice, not a substitution (see `docs/synthid.md`). SynthID survives JPEG re-encode, so GitHub issue attachments remain valid pixel-watermark test subjects. Every spectral/phase detection approach evaluated (reverse-SynthID, our own probes) works only on controlled solid fills, never on real content. - **External AI-vs-real classifier models are out of scope** (decided 2026-05-24): per-generator, degrade off-distribution, and our own light SDXL pass would likely defeat them. Detection stays local + signal-based. - **Default strength is VENDOR-ADAPTIVE, one ladder for BOTH pipelines** (since 2026-06-09): `resolve_strength(strength, vendor)` picks OpenAI **0.20** / Gemini **0.30** / unknown **0.30** when `--strength` is unset; explicit `--strength` always wins. Removal at low strength is content x pipeline dependent, and near-threshold removal is SEED-NON-DETERMINISTIC — pick a strength with margin and oracle-revalidate per content type. Certified controlnet floors (Modal cert 2026-06-04): OpenAI 0.20 (resolution-independent), Gemini 0.30 (only <= 1536px; native large Gemini needs ~0.35+ or a cap). - **`controlnet` is the default pipeline**; `--pipeline sdxl` is the lighter opt-down. Neither pipeline clears all content at low strength (photoreal survives controlnet, flat graphics survive sdxl — the lever is higher strength). A removal-priority caller MUST oracle-validate strength across content types; prod recipe: controlnet + per-vendor floor + FIXED seed. Forensic-stealth caveat (arXiv:2605.09203): defeating the SynthID verifier is NOT forensic invisibility — removal-processed images are flaggable at >98% TPR@1%FPR. diff --git a/data/qwen_in/README.md b/data/qwen_in/README.md new file mode 100644 index 0000000..6fcc603 --- /dev/null +++ b/data/qwen_in/README.md @@ -0,0 +1,35 @@ +# qwen_in — pipeline-fidelity eval set + +A small, **stable** set of AI-generated images used to compare the diffusion +removal pipelines (`controlnet` / `sdxl` / `qwen`) for fidelity with +`scripts/fidelity_metrics.py`. Fixing the set in the repo keeps comparisons +reproducible across runs and pipelines. + +All four are AI-generated test content (they carry SynthID + C2PA from their +generator — verify with `remove-ai-watermarks identify`), same class as the +`data/samples/` fixtures. No real-person photos. + +| file | vendor (SynthID) | content | exercises | +|---|---|---|---| +| `openai_1_original.png` | OpenAI | typography sheet (EN + RU + ZH) | text (multi-script) | +| `openai_2_original.png` | OpenAI | Raiw.cc poster | text (EN, small) | +| `gemini_1_original.png` | Google | landscape + Chinese sign | text (CJK) | +| `gemini_3_original.png` | Google | 3x3 portrait grid | faces (identity / skin texture) | + +## Text ground truth + +`ground_truth.json` (`{basename: text}`) is the **hand-verified** OCR of the +text-bearing originals, seeded by `fidelity_metrics.py ocr` and corrected by +hand (PaddleOCR mis-reads stylized Cyrillic in particular). It is the reference +for the text CER metric — much cleaner than OCR-vs-OCR. Regenerate the seed with: + + uv run scripts/fidelity_metrics.py ocr data/qwen_in/openai_1_original.png \ + data/qwen_in/openai_2_original.png data/qwen_in/gemini_1_original.png \ + --langs en,ru,ch --out data/qwen_in/ground_truth.json + # then re-verify by hand before trusting it. + +## Compare + + uv run scripts/fidelity_metrics.py compare \ + --original data/qwen_in/gemini_3_original.png \ + --variant controlnet=.png --variant qwen=.png --ocr-langs "" diff --git a/data/qwen_in/gemini_1_original.png b/data/qwen_in/gemini_1_original.png new file mode 100644 index 0000000..229f943 Binary files /dev/null and b/data/qwen_in/gemini_1_original.png differ diff --git a/data/qwen_in/gemini_3_original.png b/data/qwen_in/gemini_3_original.png new file mode 100644 index 0000000..7ffcbf4 Binary files /dev/null and b/data/qwen_in/gemini_3_original.png differ diff --git a/data/qwen_in/openai_1_original.png b/data/qwen_in/openai_1_original.png new file mode 100644 index 0000000..c1325f7 Binary files /dev/null and b/data/qwen_in/openai_1_original.png differ diff --git a/data/qwen_in/openai_2_original.png b/data/qwen_in/openai_2_original.png new file mode 100644 index 0000000..e3b01fb Binary files /dev/null and b/data/qwen_in/openai_2_original.png differ diff --git a/docs/synthid.md b/docs/synthid.md index 6a61345..05981ef 100644 --- a/docs/synthid.md +++ b/docs/synthid.md @@ -181,6 +181,15 @@ pos original plus its minimum-clearing cleaned output (manifest `verified_via` = was oracle-verified but is not committed (third-party content stays out of the public corpus). +**Oracle validation order: start with OpenAI.** When validating removal across +vendors, run the OpenAI arm first. `openai.com/verify` is more accessible than the +Gemini app -- fewer per-check restrictions, so it gives the fastest signal and is +the strongest candidate for automation (Playwright / Chrome MCP driving +`openai.com/verify`); the Gemini "Verify with SynthID" flow is more manual. This is +an ordering/throughput choice, not a substitution: each oracle only reads its own +vendor's SynthID (`openai.com/verify` is OpenAI-scoped), so Google content still +needs the Gemini app. + | Vendor | Images | Resolution(s) | Pipeline | Removed at | |--------|--------|---------------|----------|------------| | OpenAI (gpt-image) | n=4 (3 archived + 1 external-only) | 1024x1536 .. 1600x1600 | native | **0.05** | diff --git a/scripts/fidelity_metrics.py b/scripts/fidelity_metrics.py index 6e057bc..932cb03 100644 --- a/scripts/fidelity_metrics.py +++ b/scripts/fidelity_metrics.py @@ -9,7 +9,8 @@ # "rapidfuzz", # "torch", # "lpips", -# "easyocr", +# "paddleocr", +# "paddlepaddle", # "insightface", # "onnxruntime", # ] @@ -22,29 +23,38 @@ preserved -- so "closer to the original" is the right axis here (between two equally-scrubbed outputs, the one that deviates less from the original wins). It is a standalone eval tool, NOT part of the package: PEP 723 inline deps let -``uv run`` build a throwaway env so the heavy models (EasyOCR, insightface, +``uv run`` build a throwaway env so the heavy models (PaddleOCR, insightface, LPIPS) never touch uv.lock or the shipped library. Metrics self-gate: face metrics run only where faces are detected, text metrics only where text is. -Four metric groups (all reference = original): - 1. Text -- EasyOCR character error rate (CER) of each variant vs the original's - OCR string. Lower = text better preserved. OCR is noisy, so treat it - as a RELATIVE comparison (every variant scored against the same ref). - 2. Face identity -- insightface (buffalo_l) ArcFace cosine similarity, original - face vs the geometrically-matched variant face. Higher = identity kept. - 3. Face texture -- LPIPS + Laplacian-variance ratio (variant/original) on face - crops. Catches "plastication" (lost high-frequency skin detail): - lapvar ratio < 1 = smoother than the original. - 4. Whole image -- LPIPS / SSIM / PSNR vs the original (context: background too). +Two subcommands: + + ocr -- OCR images (PaddleOCR PP-OCRv6) into a JSON {basename: text} file. + Run this on the ORIGINALS, hand-verify/correct the file, and it + becomes the ground truth for ``compare --ground-truth`` -- the clean + way to score text, since OCR-vs-OCR is doubly noisy (errors on both + images + reading-order differences inflate CER even on identical text). + + compare -- Score each VARIANT against the ORIGINAL across four groups: + 1. Text -- character error rate (CER) of the variant's OCR vs the + verified ground truth (or the original's OCR if no --ground-truth). + 2. Face identity -- insightface (buffalo_l) ArcFace cosine similarity. + 3. Face texture -- LPIPS + Laplacian-variance ratio on face crops + (catches "plastication": ratio < 1 = smoother than the original). + 4. Whole image -- LPIPS / SSIM / PSNR vs the original. Usage: - uv run scripts/fidelity_metrics.py --original O.png \ - --variant controlnet=C.png --variant qwen=Q.png --ocr-langs en,ru,ch_sim + uv run scripts/fidelity_metrics.py ocr O1.png O2.png --langs en,ru,ch --out gt.json + # (edit gt.json by hand to fix any OCR slips, then:) + uv run scripts/fidelity_metrics.py compare --original O1.png \ + --variant controlnet=C.png --variant qwen=Q.png \ + --ocr-langs en,ru,ch --ground-truth gt.json """ from __future__ import annotations -import logging +import json +import unicodedata from dataclasses import dataclass, field from pathlib import Path from typing import Any @@ -54,8 +64,6 @@ import cv2 import numpy as np from _plain_console import Console, Table -logging.basicConfig(level=logging.WARNING, format="%(message)s") -log = logging.getLogger(__name__) console = Console() @@ -76,45 +84,90 @@ def _match_size(variant: np.ndarray, ref: np.ndarray) -> np.ndarray: return variant -# ── text: OCR CER ──────────────────────────────────────────────────── - -# EasyOCR rejects some language combos in one Reader, so group into compatible -# readers and union the detections. Cyrillic and Chinese cannot share a reader. -_OCR_GROUPS = { - "en": ["en"], - "ru": ["ru", "en"], - "ch_sim": ["ch_sim", "en"], -} +def _norm(text: str) -> str: + """Normalize for CER: NFC + drop ALL whitespace (segmentation-order agnostic).""" + return "".join(unicodedata.normalize("NFC", text).split()) -def _ocr_string(readers: list, bgr: np.ndarray) -> str: - """Union all readers' detections into one position-sorted, whitespace-free string.""" - rgb = cv2.cvtColor(bgr, cv2.COLOR_BGR2RGB) - dets: list[tuple[float, float, str]] = [] - for reader in readers: - for box, text, conf in reader.readtext(rgb): - if conf < 0.3 or not text.strip(): - continue - ys = [p[1] for p in box] - xs = [p[0] for p in box] - dets.append((min(ys), min(xs), text.strip())) - # Sort top-to-bottom, then left-to-right (coarse reading order). - dets.sort(key=lambda d: (round(d[0] / 20.0), d[1])) - return "".join(t for _, _, t in dets).replace(" ", "") +# ── text: PaddleOCR (PP-OCRv6) ─────────────────────────────────────── + +# Our lang codes -> PaddleOCR lang. The 'ch' model also reads Latin; 'ru' reads +# Cyrillic + Latin. Multiple langs in one image -> run each model, union detections. +_PADDLE_LANG = {"en": "en", "ru": "ru", "ch": "ch", "ch_sim": "ch", "latin": "latin"} +_paddle_cache: dict[str, Any] = {} -def _build_ocr_readers(langs: list[str]) -> list: - import easyocr +def _paddle(lang: str) -> Any: + if lang not in _paddle_cache: + from paddleocr import PaddleOCR - seen: set[tuple[str, ...]] = set() - readers = [] + _paddle_cache[lang] = PaddleOCR( + lang=lang, + use_doc_orientation_classify=False, + use_doc_unwarping=False, + use_textline_orientation=False, + ) + return _paddle_cache[lang] + + +def _box_xyxy(box: Any) -> tuple[float, float, float, float]: + """Axis-aligned (x1, y1, x2, y2) of a PaddleOCR rec box ([x1,y1,x2,y2]) or poly (4x2).""" + arr = np.asarray(box, dtype=np.float32).reshape(-1) + if arr.size == 4: + return float(arr[0]), float(arr[1]), float(arr[2]), float(arr[3]) + pts = arr.reshape(-1, 2) + return float(pts[:, 0].min()), float(pts[:, 1].min()), float(pts[:, 0].max()), float(pts[:, 1].max()) + + +def _iou(a: tuple[float, float, float, float], b: tuple[float, float, float, float]) -> float: + ix1, iy1 = max(a[0], b[0]), max(a[1], b[1]) + ix2, iy2 = min(a[2], b[2]), min(a[3], b[3]) + iw, ih = max(0.0, ix2 - ix1), max(0.0, iy2 - iy1) + inter = iw * ih + if inter <= 0: + return 0.0 + area_a = (a[2] - a[0]) * (a[3] - a[1]) + area_b = (b[2] - b[0]) * (b[3] - b[1]) + return inter / (area_a + area_b - inter + 1e-9) + + +def _ocr_lines(bgr: np.ndarray, langs: list[str], min_score: float = 0.5) -> list[str]: + """Detected text lines in reading order, unioned across lang models with spatial NMS. + + Several language models over one image re-detect the same lines -- and crucially the + WRONG-script models read e.g. Cyrillic as confident Latin gibberish. So instead of a + naive union, keep the HIGHEST-score detection per physical location (greedy IoU NMS): + the model that actually fits a line wins it (the 'ru' model takes the Cyrillic, 'ch' + the CJK, 'en' the Latin), and the cross-script garbage is dropped. + """ + raw: list[tuple[float, tuple[float, float, float, float], str]] = [] for lang in langs: - group = tuple(_OCR_GROUPS.get(lang, [lang])) - if group in seen: + plang = _PADDLE_LANG.get(lang, lang) + for page in _paddle(plang).predict(bgr): + texts = page.get("rec_texts", []) + scores = page.get("rec_scores", []) + boxes = page.get("rec_boxes", None) + if boxes is None or len(boxes) == 0: + boxes = page.get("rec_polys", []) + for text, score, box in zip(texts, scores, boxes, strict=False): + if score < min_score or not text.strip(): + continue + raw.append((float(score), _box_xyxy(box), text.strip())) + + raw.sort(key=lambda d: d[0], reverse=True) + kept: list[tuple[tuple[float, float, float, float], str]] = [] + for _score, box, text in raw: + if any(_iou(box, kbox) > 0.3 for kbox, _ in kept): continue - seen.add(group) - readers.append(easyocr.Reader(list(group), gpu=False, verbose=False)) - return readers + kept.append((box, text)) + kept.sort(key=lambda d: (round(d[0][1] / 20.0), d[0][0])) # reading order: y then x + return [t for _, t in kept] + + +def _cer(ref: str, hyp: str) -> float: + from rapidfuzz.distance import Levenshtein + + return Levenshtein.normalized_distance(_norm(ref), _norm(hyp)) # ── face: detection + ArcFace + texture ────────────────────────────── @@ -183,12 +236,10 @@ def _ssim_psnr(a_bgr: np.ndarray, b_bgr: np.ndarray) -> tuple[float, float]: a = cv2.cvtColor(a_bgr, cv2.COLOR_BGR2GRAY) b = cv2.cvtColor(b_bgr, cv2.COLOR_BGR2GRAY) - ssim = float(structural_similarity(a, b)) - psnr = float(peak_signal_noise_ratio(a, b)) - return ssim, psnr + return float(structural_similarity(a, b)), float(peak_signal_noise_ratio(a, b)) -# ── main ───────────────────────────────────────────────────────────── +# ── reporting ──────────────────────────────────────────────────────── def _mean(xs: list[float]) -> float | None: @@ -199,18 +250,42 @@ def _fmt(v: float | None, nd: int = 3) -> str: return "-" if v is None else f"{v:.{nd}f}" -@click.command() +# ── CLI ────────────────────────────────────────────────────────────── + + +@click.group() +def cli() -> None: + """Objective fidelity metrics for watermark-removal outputs.""" + + +@cli.command("ocr") +@click.argument("images", nargs=-1, required=True, type=click.Path(exists=True)) +@click.option("--langs", default="en", help="Comma list of OCR langs (en,ru,ch).") +@click.option("--out", type=click.Path(), default=None, help="Write {basename: text} JSON here (for ground truth).") +def ocr_cmd(images: tuple[str, ...], langs: str, out: str | None) -> None: + """OCR images into a ground-truth seed -- hand-verify the result before using it.""" + lang_list = [x.strip() for x in langs.split(",") if x.strip()] + result: dict[str, str] = {} + for path in images: + lines = _ocr_lines(_load_bgr(path), lang_list) + text = "\n".join(lines) + result[Path(path).name] = text + console.print(f"\n=== {Path(path).name} ===") + console.print(text or "(no text detected)") + if out: + Path(out).write_text(json.dumps(result, ensure_ascii=False, indent=2), encoding="utf-8") + console.print(f"\n Wrote {out} -- verify/correct it by hand, then pass it to `compare --ground-truth`.") + + +@cli.command("compare") @click.option("--original", required=True, type=click.Path(exists=True), help="Reference (unprocessed) image.") @click.option( - "--variant", - "variants", - multiple=True, - required=True, - help="LABEL=PATH of a cleaned output (repeatable).", + "--variant", "variants", multiple=True, required=True, help="LABEL=PATH of a cleaned output (repeatable)." ) -@click.option("--ocr-langs", default="en", help="Comma list of EasyOCR langs (en,ru,ch_sim). Empty = skip text.") +@click.option("--ocr-langs", default="en", help="Comma list of OCR langs (en,ru,ch). Empty = skip text.") +@click.option("--ground-truth", type=click.Path(exists=True), default=None, help="Verified {basename: text} JSON.") @click.option("--no-faces", is_flag=True, help="Skip face metrics.") -def main(original: str, variants: tuple[str, ...], ocr_langs: str, no_faces: bool) -> None: +def compare(original: str, variants: tuple[str, ...], ocr_langs: str, ground_truth: str | None, no_faces: bool) -> None: """Score each VARIANT against ORIGINAL across the four fidelity groups.""" ref = _load_bgr(original) parsed: list[tuple[str, np.ndarray]] = [] @@ -226,17 +301,19 @@ def main(original: str, variants: tuple[str, ...], ocr_langs: str, no_faces: boo # ── text ── ocr_cer: dict[str, float | None] = {label: None for label, _ in parsed} if langs: - console.print(f" OCR ({','.join(langs)})...") - from rapidfuzz.distance import Levenshtein - - readers = _build_ocr_readers(langs) - ref_text = _ocr_string(readers, ref) - if ref_text: - for label, img in parsed: - hyp = _ocr_string(readers, img) - ocr_cer[label] = Levenshtein.normalized_distance(ref_text, hyp) + ref_text: str | None = None + if ground_truth: + gt = json.loads(Path(ground_truth).read_text(encoding="utf-8")) + ref_text = gt.get(Path(original).name) + if ref_text is None: + console.print(f" (no ground-truth entry for {Path(original).name}; skipping text)") else: - console.print(" (no text detected in the original; skipping text metric)") + console.print(f" OCR original ({','.join(langs)})...") + ref_text = "\n".join(_ocr_lines(ref, langs)) + if ref_text: + console.print(f" OCR variants ({','.join(langs)})...") + for label, img in parsed: + ocr_cer[label] = _cer(ref_text, "\n".join(_ocr_lines(img, langs))) # ── faces ── face_stats: dict[str, FaceStats] = {label: FaceStats() for label, _ in parsed} @@ -300,4 +377,4 @@ def main(original: str, variants: tuple[str, ...], ocr_langs: str, no_faces: boo if __name__ == "__main__": - main() + cli()