20 Commits

Author SHA1 Message Date
Victor Kuznetsov 25a1acc53b Detect TC260 AIGC label in JPEG EXIF and late/attribute PNG XMP
A corpus audit surfaced China TC260 AIGC-labeled images that `identify`
missed. Three detection gaps in `aigc_label`, all fixed:

- raw-JSON `{"AIGC":{...}}` in JPEG EXIF (UserComment): brace-matched from
  the scan head with `json.raw_decode`, gated on a TC260 field like the
  PNG-chunk path. (Doubao-class output via that export surface.)
- XMP attribute form `TC260:AIGC="{...}"` (PicWish): folded into the
  element regex as a second alternation.
- TC260 XMP packet appended after a large `IDAT`, past the 1 MB scan
  window: `scan_head` now appends late PNG metadata chunks via
  `_png_late_metadata`, mirroring the existing ISOBMFF late-box scan.

Adds `scripts/corpus_gap_scan.py`: runs `identify` over a corpus, writes
the per-file report CSV, and flags `unknown` files that carry a known
marker in their metadata region (the audit that found these gaps).
Scanning only the metadata region — not the whole file — avoids the
random short-token collisions inside compressed PNG/JPEG streams.

On the local corpus this lifts 3 files from `unknown` to AI (China AIGC)
and leaves zero false gap candidates. Synthetic piexif/PngInfo fixtures
cover all three forms.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 11:44:53 -07:00
Victor Kuznetsov 58bdf51c59 Visible-watermark registry: reverse-alpha-only Doubao + Gemini, exact native recovery (#28)
* fix(trustmark): gate detection on re-encode durability to kill false positives

TrustMark's wm_present flag is a BCH validity check that spuriously
validates on a content-correlated fraction of un-watermarked images
(AI textures trip it more than camera photos). On a 1343-image set all
20 raw detections were false, several on Gemini/OpenAI/Doubao output that
cannot carry Adobe's watermark, with random-bytes secrets.

A genuine TrustMark is a durable soft binding that survives re-encoding,
so detect_trustmark now re-decodes after a mild JPEG round-trip and
requires the same schema both times. Every observed false positive
collapsed under this gate; the second decode runs only on the rare hit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(identify): Samsung Galaxy AI, FLUX, ByteDance C2PA; fix C2PA substring FP

Detection extensions verified on real signed files (2026-05-29):

- Samsung Galaxy AI: signer attribution via a new _SIGNER_C2PA_PLATFORM
  (Samsung Galaxy / ASUS Gallery) kept separate from the capture-camera
  _DEVICE_C2PA_PLATFORM so a Galaxy AI edit (device cert + AI source type)
  does not trip the camera-vs-AI integrity clash. Plus metadata.samsung_genai:
  the proprietary genAIType marker in PhotoEditor_Re_Edit_Data, a medium-
  confidence AI-editing signal (samsung_only branch).
- Black Forest Labs (FLUX) and ByteDance Volcano Engine (Doubao/Jimeng)
  added as C2PA issuers + issuer->platform mappings.
- fix: C2PA presence required only the bare 4-byte 'c2pa' substring, which
  false-positives on compressed pixel data (a recompressed PNG IDAT re-flagged
  C2PA after its manifest was correctly stripped). New c2pa_marker_in() requires
  the JUMBF wrapper (jumb+c2pa) or the C2PA uuid box; applied in identify +
  metadata. Verified: all 535 real C2PA files carry jumb.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(doubao): gate detection on text structure to cut ~95% of false positives (#23)

Coverage alone over-fired: any textured bottom-right corner cleared the
threshold, so the detector false-positived on ~28% of arbitrary images.
The real '豆包AI生成' mark is six glyphs in one row, so detect now also
requires the text-structure signature (_glyph_structure): many connected
components, no single dominant blob, concentration in a thin horizontal
band. False positives dropped 343 -> 17 across the corpus while keeping
real-mark recall and the doubao-1.png sample. Also accept a no-op force
kwarg for remover-interface symmetry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(samsung): add Samsung Galaxy AI visible-badge remover

New samsung_engine.py removes the bottom-left sparkle + localized
'AI-generated content' badge that Galaxy AI tools stamp. Mirrors the
Doubao locate->mask->inpaint pattern but bottom-left, with a dual-polarity
top-hat mask (the badge is light-on-dark or dark-on-light). Detection gates
on a band + left-anchor signature (the Doubao CJK-component gate does not
transfer: Latin badge letters connect into few blobs). Explicit-only --
tuned on few real badges with a ~4% FP floor, so it is not used in auto.
Synthetic byte-blob fixtures (real badges are user content, not shipped).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(visible): unified known-watermark registry + LaMa inpaint backend

watermark_registry.py is a single catalog of known visible marks, each
tying {usual location, in_auto flag, recovery strategy, detect adapter,
remove adapter}: gemini (reverse-alpha, exact), doubao, samsung. cmd_visible
is now registry-driven (best_auto_mark for --mark auto; mark_keys() feeds the
CLI choices) -- the per-mark _run_doubao/_run_samsung helper branches are gone.

Cross-engine confidences are not comparable, so the gemini adapter applies the
corpus-validated 0.5 sparkle threshold for auto arbitration (its engine flag is
loose and weakly fired ~0.36 on Doubao text, hijacking auto).

--backend auto|cv2|lama chooses background reconstruction for the mask-based
marks; auto = LaMa when onnxruntime is present, else cv2. For LaMa the mask is
the FILLED glyph bounding box (sparse glyph masks leave anti-aliased edges
behind). cv2 stays the zero-dependency fallback.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: watermark registry, Samsung/FLUX/ByteDance detection, LaMa backend, trustmark gate

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(doubao): exact reverse-alpha removal from captured alpha map

The Doubao '豆包AI生成' mark is a fixed semi-transparent white overlay, so
given its alpha map the original pixels are recovered exactly:
original = (wm - a*logo)/(1-a) -- no inpaint hallucination.

The alpha map + logo colour were solved from real black+gray Doubao captures
on a controlled background: on black captured = a*logo, and the black/gray pair
solves a per-pixel without assuming the logo colour (a_max~0.65, logo near-white);
the white capture cross-validates (mark vanishes to a flat fill). Bundled as
assets/doubao_alpha.png + geometry constants.

remove_watermark_reverse_alpha applies it scaled to image width; exact at the
captured width, so the registry routes doubao through it only when
reverse_alpha_available (width within the calibrated band) and the mark is
detected, falling back to mask inpaint (cv2/LaMa) otherwise. A light residual
inpaint cleans the sub-pixel rescaling error. Add captures at more resolutions
to widen exact coverage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(visible): reverse-alpha only -- drop inpaint removal + heuristic detection

Per the principle that we only remove/detect what we can do exactly, the
visible-mark path is now reverse-alpha only:

- Doubao detect is reverse-alpha-consistent: match the bundled alpha glyph
  silhouette against the corner via TM_CCOEFF_NORMED (DETECT_NCC_THRESHOLD 0.4)
  -- keys on the '豆包AI生成' SHAPE, not coverage/structure heuristics. FP
  7/1243 (0.6%). Removes the cv2 inpaint path + the _glyph_structure gate.
- Registry is reverse-alpha only: dropped the cv2/LaMa backend (_glyph_remove,
  _lama_box_inpaint, default_backend, --backend) and the Samsung entry. Doubao
  outside the alpha resolution band is skipped, never inpainted.
- Removed samsung_engine.py + tests + --mark samsung (no alpha map captured;
  Samsung C2PA/genAIType metadata detection in identify is unaffected).
- The universal erase --region (cv2/LaMa) is unchanged -- arbitrary-region
  inpainting stays a user-directed tool, separate from the known-mark registry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(doubao): NCC sub-pixel alignment -> reverse-alpha at any resolution

A pure width-scale of the captured alpha map is only sub-pixel-accurate at the
captured width and leaves a faint ghost elsewhere. remove_watermark_reverse_alpha
now registers the alpha glyph to the actual mark via a TM_CCOEFF_NORMED
scale+position search (_aligned_alpha_map) before inverting the blend, so the
single 2048 capture works at any resolution -- verified clean on the 1773x2364
(3:4) corpus size, the biggest coverage gap (23 files).

reverse_alpha_available is now just 'asset present' (no width band); the registry
still gates removal on detect so a clean corner is never touched. Drops the
_ALPHA_WIDTH_TOLERANCE gate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(doubao): keep native recovery exact -- fixed geometry at captured width

Integer-pixel NCC alignment landed ~1px off at the captured width, degrading the
otherwise-exact native reverse-alpha (synthetic recovery error 0.94 -> 1.39).
remove_watermark_reverse_alpha now uses exact width-relative geometry within
_ALPHA_NATIVE_BAND of the captured width and the NCC search only off it -- best
of both: native back to 0.94, other resolutions still aligned.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(doubao): harden alignment -- try fixed+aligned, keep least residual (56/56)

On a faint/busy-background mark the NCC alignment peak can wander a few px off
the true mark and leave a residual (2/56 real corpus files). Off the captured
width, remove_watermark_reverse_alpha now builds BOTH the fixed-geometry and the
NCC-aligned alpha map, applies each, and keeps whichever leaves the least
residual mark (re-detect confidence on the bare reverse-alpha) -- geometry wins
on faint marks, alignment on clear ones, no magic threshold. Real-file round-trip
now removes 56/56 detected Doubao clean across every corpus resolution (was 54).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* perf(doubao): skip residual inpaint at native width for exact recovery

At the captured width the fixed-geometry reverse-alpha is pixel-exact, so
inpainting over it only replaced exactly-recovered interior pixels with a
cv2 hallucination -- measured worse on a textured background (native error
vs true bg 1.6 reverse-alpha-only vs 2.6 with the old always-on
full-footprint inpaint). Native now returns the bare recovery untouched;
off-native, where NCC alignment is only sub-pixel-approximate, the footprint
inpaint stays to clean the seam. Real round-trip still 56/56 across all
corpus resolutions; negatives 0/60, Gemini unaffected.

Add test_native_returns_exact_reverse_alpha_no_inpaint as the regression
guard. Sync CLAUDE.md + README (the table cell and prose described the
pre-NCC "skipped off native / cv2-LaMa" behavior, now stale). Gitignore the
session scheduled_tasks.lock, and add the text-protection research note.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 19:49:09 -07:00
Victor Kuznetsov 9aaa53fe32 fix(metadata): preserve upload format and quality on strip
remove_ai_metadata now writes JPEG at quality 95 with 4:4:4 (no chroma
subsampling) instead of the lossy PIL defaults (q75, 4:2:0), and preserves
WebP losslessly instead of silently rewriting it as PNG.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 18:46:26 -07:00
Victor Kuznetsov 223cbcf171 feat(metadata): detect China TC260 AIGC PNG chunk and HuggingFace hf-job-id
aigc_label now reads the TC260 label from a raw-JSON `AIGC` PNG tEXt chunk
(as Doubao/ByteDance write it, with no namespaced XMP marker) in addition to
the `<TC260:AIGC>` XMP block, via a shared _parse helper gated on a TC260 field
so a generic AIGC key cannot false-positive. New huggingface_job() reads the
hf-job-id PNG chunk; identify surfaces it as a medium-confidence hf_job signal
(parallel to the visible sparkle, never overriding a hard metadata verdict).
Both wired into has_ai_metadata/get_ai_metadata; the PNG save whitelist already
strips them on removal. Found by auditing 646 corpus originals: 28 AIGC and 3
hf-job files the library previously reported as Unknown.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 12:40:17 -07:00
test-user 5bfed00553 feat(metadata): blank AI-label XMP inside the HEIF/AVIF meta box (v0.6.9)
HEIF/AVIF store XMP as a meta-box `mime` item whose bytes live in mdat/idat, out
of reach of the top-level uuid/jumb box stripper. An AI-label XMP packet there
(TC260 AIGC, IPTC "Made with AI", IPTC 2025.1) was therefore left in place.

isobmff.blank_ai_xmp_packets locates each XMP packet by its <?xpacket begin ...
end?> delimiters and, if it carries an AI marker (_AI_LABEL_MARKERS), overwrites
it with spaces of the SAME length. Equal length means no box size or iloc offset
shifts -- the coded image stays bit-for-bit intact, the item stays structurally
valid, only the AI label content is destroyed. Plain (non-AI) XMP is left alone,
mirroring the top-level XMP-uuid content match. Wired into remove_ai_metadata's
ISOBMFF branch after strip_c2pa_boxes.

Chosen over exiftool (a non-bundled binary dep) to stay pure-Python and
droplet-compatible; over full iinf/iloc surgery to avoid offset-rewrite
corruption risk. The AI labels we target are all XMP, so this closes the
practical gap. An Exif *item* inside the meta box (rare) still needs iinf/iloc
surgery or exiftool -- documented.

4 new tests (TestMetaBoxXmpBlanking): AI packet blanked (same length, marker
gone, surrounding image bytes intact), plain XMP preserved, no-packet no-op, and
end-to-end remove_ai_metadata on a .heic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 18:15:48 -07:00
test-user 31f0a82906 feat(metadata): detect C2PA/AIGC/IPTC manifests after a large mdat in MP4 (v0.6.8)
Provenance detection no longer relies on a fixed first-MB read. In a streaming /
non-faststart MP4 the C2PA manifest sits AFTER a multi-megabyte mdat, beyond the
1 MB scan window, so it was missed.

- isobmff.scan_c2pa_region(path): a file-seeking top-level box walker that
  returns the payloads of uuid/jumb (provenance) boxes, seeking past mdat by
  size without reading it -- works on multi-GB files. Returns b"" for
  non-ISOBMFF or on read error. Mirrors the box-size encoding of the existing
  in-memory _iter_top_level_boxes (largesize / size==0).
- metadata.scan_head(path, size): the shared input for every C2PA/AIGC/IPTC
  byte scan -- first __TEXT	__DATA	__OBJC	others	dec	hex bytes plus, for ISOBMFF, the late provenance-box
  payloads. Behavior-neutral (f.read(size)) for non-ISOBMFF inputs.
- Routed all six metadata scan sites (has_ai_metadata, aigc_label,
  iptc_ai_system, synthid_source, exif_generator XMP, get_ai_metadata
  soft-binding) and identify's head read through scan_head.

6 new tests: late box found by scan_c2pa_region / scan_head, the fixed window
provably misses it, non-ISOBMFF -> b"", front-placed (faststart) regression.

The remaining gap stays documented: EXIF/XMP stored as items inside the meta
box (AVIF/HEIF stills) still needs meta-box surgery or exiftool.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 13:42:29 -07:00
test-user f9cf14c372 feat(metadata): strip container metadata from WebM/MP3/WAV/FLAC/OGG via ffmpeg (v0.6.4)
remove_ai_metadata now handles non-ISOBMFF audio/video (which the box walker
can't reach) by shelling out to ffmpeg with a lossless stream copy
(`-map_metadata -1 -map_chapters -1 -c copy`): codec data is untouched, only
container tags/chapters (ID3 / RIFF / Vorbis comments / EBML tags) are dropped.
Requires ffmpeg on PATH; raises a clear RuntimeError if absent or if ffmpeg
can't parse the input (instead of crashing in the image path).

Verified end-to-end: a real ffmpeg-made WAV/MP3 with a "Suno AI" title tag ->
tag gone, audio bytes preserved.

NOT built (evaluated, deliberate): Resemble PerTh audio *detection* --
`get_watermark()` returns a raw bit array with no presence/confidence flag, so
reliably telling watermarked from clean needs Resemble's fixed payload or a
confidence API (neither public; no real sample to calibrate). Same wall as the
SynthID pixel detector. AVIF/HEIF meta-box EXIF/XMP stripping also stays a gap
(needs exiftool, a non-installed binary). Both documented in CLAUDE.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 21:39:42 -07:00
test-user 2676325184 feat(c2pa): expand soft-binding vendor map with registry-verified algs
Adds Trufo, Overlai, MarkAny, Mentaport, LumaTrace, VerdaAI, ContentLens, ISCC
(io.iscc content code), and Adobe ICN fingerprint to C2PA_SOFT_BINDINGS, and
notes AIWatermark wraps Meta PixelSeal. All `alg` prefixes verified against the
official c2pa-org/softbinding-algorithm-list registry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 18:00:16 -07:00
test-user c196a16900 feat: detect soft-binding vendors, IPTC 2025.1, video/audio C2PA, TrustMark (v0.6.0)
Broadens metadata provenance coverage at the detection and container-strip level.

Detection:
- C2PA soft-binding `alg` -> forensic-watermark vendor (Adobe TrustMark,
  Digimarc, Imatag, Steg.AI, Microsoft, ...) via C2PA_SOFT_BINDINGS +
  soft_binding_vendors_in(); names the watermark vendor even when the watermark
  itself can't be decoded.
- IPTC Photo Metadata 2025.1 AI-disclosure XMP fields (AISystemUsed etc.) via
  iptc_ai_system() + IPTC_AI_FIELD_MARKERS.
- Adobe TrustMark open keyless decoder (trustmark_detector.py, optional extra
  `trustmark`) -- the watermark behind Adobe Durable Content Credentials.
  Detects provenance, not AI origin, so it does not assert is_ai.

Removal / containers:
- isobmff.strip_c2pa_boxes now also drops a top-level XMP uuid box that carries
  an AI label (matched by AI-marker content, byte-order-robust; plain XMP kept).
- remove_ai_metadata routes MP4/MOV/M4V/M4A (and any ftyp-sniffed ISOBMFF)
  through the box stripper; raises a clear error for non-ISOBMFF audio/video
  (WebM/MP3/WAV) instead of crashing in the image path.

Tests: soft-binding scan, IPTC element/attribute/presence, MP4 + M4A detect/
strip, ISOBMFF XMP surgical strip, content-sniff, unsupported-container guard,
TrustMark absent-safety + identify integration. ruff clean; pyright clean on
all new modules.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 17:56:48 -07:00
test-user ba94de8275 feat: strip AI-provenance EXIF tags on removal (v0.5.6)
remove_ai_metadata now scrubs AI tags from the JPEG EXIF instead of passing
the block through wholesale. Closes the v0.5.5 follow-up: the xAI/Grok
Signature + UUID-Artist pair was detected but not removed.

- metadata._scrub_ai_exif(): deletes the xAI signature pair and any
  Software/Make/Artist/ImageDescription tag carrying an AI_GENERATOR_TOKENS
  token (so Ideogram's Make="Ideogram AI" is scrubbed too), keeping genuine
  camera/editor EXIF intact.
- Shared _is_xai_signature_pair / _exif_text helpers (module-level compiled
  regexes) are now the single source of truth, used by both xai_signature
  and _scrub_ai_exif.
- Tests: Grok signature stripped on JPEG output, Ideogram Make stripped,
  real-camera Make ("Apple") preserved. 325 passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 14:26:20 -07:00
test-user 74618b91a7 feat: detect xAI/Grok EXIF signature; refresh watermarking landscape (v0.5.5)
xAI Grok (Aurora) images carry no C2PA/SynthID/IPTC -- their only provenance
signal is an EXIF pair: ImageDescription "Signature: <base64>" + a UUID Artist.
Verified stable across 3 genuine generations (a real download previously read
as unknown / "no AI metadata").

- metadata.xai_signature(): matches the Signature blob + UUID Artist pair;
  wired into has_ai_metadata, get_ai_metadata, and identify (platform
  "xAI (Grok / Aurora)").
- data/samples/grok-1.jpg: real Grok fixture (neutral content; the Artist UUID
  is the public image id, not PII).
- Tests: synthetic-fixture unit tests, real-sample assertion, identify
  integration (322 passing).

Docs (research refresh, May 2026):
- C2PA 2.4 Durable Content Credentials (soft-binding re-discovery after the
  embedded manifest is stripped).
- New AI-labeling laws, primary-source verified: EU AI Act Art 50 (2026-08-02),
  South Korea AI Framework Act Art 31(3), California AB 853.
- Hedge removal claims: defeating the SynthID verifier is not forensic
  invisibility (arXiv:2605.09203); cite SynthID-Image (arXiv:2510.09263).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 14:14:35 -07:00
test-user 03fb460f77 Track the labeled SynthID corpus; complete metadata-source test coverage
Corpus images were gitignored (local-only). The negatives were reviewed and
cleared for publishing, so the labeled set is now committed (regular git, 65 MB
across 25 files) -- making the removal regression set reproducible and CI-able.

Corpus:
- Track data/synthid_corpus/images/ (pos 9, neg 15, cleaned 1); keep only the
  synthetic refs/ calibration fills gitignored.
- Reconcile manifest.csv to the on-disk files: 117 -> 25 rows (92 dangling rows
  for removed images pruned; dedup left one cleaned output, f6dd47a5).
- Rewrite the corpus README layout/policy (images committed; review every image
  for private content before adding -- public repo, permanent history).

Test fixtures:
- Remove data/samples/not-ai-1/2/3 (personal iPhone photos, incl. GPS EXIF).
- Add the clean_photo conftest fixture serving a verified-negative image from
  the corpus neg/ set; repoint the three "non-AI / clean photo" tests onto it
  (skips if the corpus is absent).

Metadata-source coverage (close the last sub-variant gaps):
- c2pa digitalSourceType: algorithmicMedia (procedural, not flagged AI) and
  compositeWithTrainedAlgorithmicMedia (AI + SynthID proxy).
- exif_generator: EXIF Artist and ImageDescription fields (Software/Make/XMP
  CreatorTool were already covered).

All 8 metadata-source kinds are now tested at both the unit and identify()
level. 313 tests pass. CLAUDE.md updated (corpus tracked, clean_photo fixture).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 14:46:47 -07:00
test-user e27f24f520 test(samples): commit real Doubao fixture + AIGC real-sample test
data/samples/doubao-1.png is the real #13 sample: carries the China TC260
<TC260:AIGC> XMP label and a visible '豆包AI生成' text mark (bottom-right).
Grounds the AIGC detection on a real file (alongside the synthetic tests)
and serves as the fixture for visible-watermark removal work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 12:37:15 -07:00
test-user c7f0d71f90 feat(identify): detect China TC260 AIGC label (Doubao et al.)
China-served generators embed an XMP <TC260:AIGC>{"Label":"1",...} block
(China's mandatory AI-content labeling, TC260 standard). Doubao (ByteDance)
uses it -- verified on the real #13 sample. It's none of C2PA / SynthID /
imwatermark / IPTC, so identify() previously returned unknown.

- metadata: AIGC_MARKERS + aigc_label() (json-decodes the HTML-entity-encoded
  block); has_ai_metadata + get_ai_metadata now surface it.
- identify: new 'aigc' signal -> is_ai True, platform 'China AIGC-labeled
  generator (TC260; e.g. Doubao)', carries the ContentProducer code.
- Container-agnostic raw-byte scan, so it covers the whole China-AIGC ecosystem
  (Jimeng/Kling/Qwen/Ernie share the standard).
- Tests: synthetic TC260 block (metadata + identify). Docs updated.

Addresses #13.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 12:29:51 -07:00
test-user ede35a3db5 feat(metadata): read EXIF Make tag; collect Ideogram/Recraft/Krea-FLUX
Collected live samples from three popular generators we lacked:

- Ideogram tags its downloads with EXIF Make="Ideogram AI" (no C2PA, no
  SynthID, no imwatermark) -- the Make tag is its only signal. exif_generator
  only read Software/Artist/ImageDescription, so it missed this; now reads
  Make too. Real cameras put "Apple"/"Canon" in Make (no AI token), so this
  stays low-false-positive. 4 originals ingested.
- Recraft (PNG export) and Krea hosting FLUX 2: downloads carry NO detectable
  signal -- no C2PA/EXIF/IPTC, and notably no imwatermark despite Krea running
  FLUX. identify correctly reports 'unknown'. Both ingested as neg fixtures.

Lesson recorded in CLAUDE.md: the imwatermark detector fires only on pristine
output from a pipeline that runs the encoder (diffusers default, official BFL),
not from re-hosts (Krea/Stability) or re-encoded exports (Recraft/Canva).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 18:38:56 -07:00
test-user ad3b8ee248 feat(identify): read EXIF Software / XMP CreatorTool generator tags
Closes the documented gap where EXIF/XMP fields inside AVIF/HEIF/JXL went
unparsed. metadata.exif_generator extracts the EXIF Software/Artist tag
(via PIL+piexif, which opens AVIF natively) and the XMP CreatorTool (via a
container-agnostic raw-byte scan that also covers HEIF/JXL that PIL can't
open), and matches against AI_GENERATOR_TOKENS so only generator names
(Firefly, DALL-E, Midjourney, ComfyUI, ...) fire -- a plain 'Adobe
Photoshop' or 'GIMP' tag is not flagged.

identify() surfaces it as a high-confidence signal and uses it for
platform attribution when no C2PA names a platform, so an AVIF/HEIF whose
only AI signal is an EXIF/XMP generator tag is now caught.

Validated with synthesized fixtures (the 'no positive fixtures' blocker
was self-imposed): real AVIF and JPEG written with EXIF Software via PIL,
plus an XMP CreatorTool raw-scan fixture. Zero false positives across the
109-image corpus (real iPhone photos carry no AI generator token).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 17:56:39 -07:00
test-user f36320ff39 fix(metadata): guard get_ai_metadata PIL open against non-OSError
get_ai_metadata opened the file with PIL unguarded, so a HEIC (or any
format PIL can't open without optional plugins) raised
UnidentifiedImageError instead of falling through to the binary scan --
unlike has_ai_metadata, which already guards. Wrap the open in
except Exception and continue to the C2PA/IPTC path. Regression test
feeds an unopenable .heic shell and asserts no raise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 16:19:15 -07:00
test-user f07ce10c72 feat(metadata): SynthID-source detection, C2PA parser consolidation, corpus + tests
Detect SynthID-bearing images via their C2PA companion: a manifest signed by a
SynthID-using vendor (Google/OpenAI) on AI-generated content implies an
invisible SynthID pixel watermark. Verified end-to-end against the vendor
oracles (openai.com/verify, Gemini "Verify with SynthID").

- metadata: synthid_source() + synthid_watermark verdict in get_ai_metadata,
  surfaced as a `metadata --check` callout. Format-agnostic (PNG caBX parser +
  JPEG/WebP/AVIF/HEIF/JXL binary scan).
- constants: SYNTHID_C2PA_ISSUERS {Google, OpenAI}; +opened/placed actions.
- c2pa: single CBOR-aware parser (_cbor_text_after) replaces glitchy regex
  (fixes fGPT-4o claim_generator); removed duplicate _scan_png_c2pa_chunk from
  metadata; shared synthid_verdict / synthid_vendors_in helpers.
- corpus: scripts/synthid_corpus.py ingest tool + data/synthid_corpus/
  (manifest tracked, images gitignored) for a labeled reference set.
- tests: +38 across C2PA parser internals, extract/inject round-trip, ISOBMFF
  container stripping, all IPTC AI markers, and invisible watermark strength
  tiers (SynthID/StableSignature/TreeRing/StegaStamp/RingID/RivaGAN/...).

Pixel-level SynthID detection remains out of reach locally (Google's decoder is
proprietary); a from-scratch spectral pilot confirmed it does not separate real
content. See CLAUDE.md for the full evaluation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 11:32:46 -07:00
test-user f2fc5e09ab feat: SDXL default; AVIF/HEIF/JPEG-XL C2PA stripping
SD-1.5 dreamshaper at 768 px did not defeat SynthID v2 on Gemini 3 Pro
outputs (verified May 2026 via Gemini app's "Verify with SynthID"). Switch
the default invisible engine to SDXL at 1024 px, matching the raiw-app
production config (strength 0.05, steps 50). Drop the SD-1.5 pipeline.

Metadata layer: add C2PA UUID and IPTC AI marker byte-scan detection
across all formats, plus an ISOBMFF box walker (noai/isobmff.py) that
strips top-level C2PA uuid and JUMBF jumb boxes from AVIF/HEIF/JPEG-XL
containers without re-encoding.

README gets a Legal table and a Threat-model section about SynthID v2's
136-bit payload. CLAUDE.md tracks the SD-1.5 regression as historical
context.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 12:54:37 -07:00
test-user e5d8970add Add project files, tests, and documentation for GitHub release
- CLI with visible, invisible, all, metadata, and batch commands
- Gemini watermark removal via reverse alpha blending
- Invisible watermark removal via diffusion regeneration (SynthID, TreeRing)
- AI metadata stripping (EXIF, PNG text, C2PA)
- Face protection (YOLO/Haar) and analog humanizer
- 137 tests covering all CLI modes and core engines
- Ruff and Pyright clean
2026-03-25 11:15:05 -07:00