mirror of
https://github.com/wiltodelta/remove-ai-watermarks.git
synced 2026-06-05 18:46:34 +02:00
25a1acc53b
A corpus audit surfaced China TC260 AIGC-labeled images that `identify`
missed. Three detection gaps in `aigc_label`, all fixed:
- raw-JSON `{"AIGC":{...}}` in JPEG EXIF (UserComment): brace-matched from
the scan head with `json.raw_decode`, gated on a TC260 field like the
PNG-chunk path. (Doubao-class output via that export surface.)
- XMP attribute form `TC260:AIGC="{...}"` (PicWish): folded into the
element regex as a second alternation.
- TC260 XMP packet appended after a large `IDAT`, past the 1 MB scan
window: `scan_head` now appends late PNG metadata chunks via
`_png_late_metadata`, mirroring the existing ISOBMFF late-box scan.
Adds `scripts/corpus_gap_scan.py`: runs `identify` over a corpus, writes
the per-file report CSV, and flags `unknown` files that carry a known
marker in their metadata region (the audit that found these gaps).
Scanning only the metadata region — not the whole file — avoids the
random short-token collisions inside compressed PNG/JPEG streams.
On the local corpus this lifts 3 files from `unknown` to AI (China AIGC)
and leaves zero false gap candidates. Synthetic piexif/PngInfo fixtures
cover all three forms.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>