Commit Graph

40 Commits

Author SHA1 Message Date
Victor Kuznetsov 223cbcf171 feat(metadata): detect China TC260 AIGC PNG chunk and HuggingFace hf-job-id
aigc_label now reads the TC260 label from a raw-JSON `AIGC` PNG tEXt chunk
(as Doubao/ByteDance write it, with no namespaced XMP marker) in addition to
the `<TC260:AIGC>` XMP block, via a shared _parse helper gated on a TC260 field
so a generic AIGC key cannot false-positive. New huggingface_job() reads the
hf-job-id PNG chunk; identify surfaces it as a medium-confidence hf_job signal
(parallel to the visible sparkle, never overriding a hard metadata verdict).
Both wired into has_ai_metadata/get_ai_metadata; the PNG save whitelist already
strips them on removal. Found by auditing 646 corpus originals: 28 AIGC and 3
hf-job files the library previously reported as Unknown.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 12:40:17 -07:00
Victor Kuznetsov a0bf62e601 feat(invisible): preserve text/CJK via Differential Diffusion (--protect-text) (v0.6.10)
SDXL img2img regenerates every pixel, so small text and CJK glyphs deform
at the strengths that defeat SynthID (issue #21). With --protect-text a
CJK-native PP-OCRv3 detector (2.4 MB ONNX, cv2.dnn, no torch, cached on
first use) locates text regions and the pass switches to the SDXL
Differential-Diffusion community pipeline: a per-pixel change map keeps
text regions largely intact while the background is regenerated to strip
the watermark. Gated to the SDXL default model; falls back to plain
img2img with a warning when unavailable.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 11:59:15 -07:00
test-user 5bfed00553 feat(metadata): blank AI-label XMP inside the HEIF/AVIF meta box (v0.6.9)
HEIF/AVIF store XMP as a meta-box `mime` item whose bytes live in mdat/idat, out
of reach of the top-level uuid/jumb box stripper. An AI-label XMP packet there
(TC260 AIGC, IPTC "Made with AI", IPTC 2025.1) was therefore left in place.

isobmff.blank_ai_xmp_packets locates each XMP packet by its <?xpacket begin ...
end?> delimiters and, if it carries an AI marker (_AI_LABEL_MARKERS), overwrites
it with spaces of the SAME length. Equal length means no box size or iloc offset
shifts -- the coded image stays bit-for-bit intact, the item stays structurally
valid, only the AI label content is destroyed. Plain (non-AI) XMP is left alone,
mirroring the top-level XMP-uuid content match. Wired into remove_ai_metadata's
ISOBMFF branch after strip_c2pa_boxes.

Chosen over exiftool (a non-bundled binary dep) to stay pure-Python and
droplet-compatible; over full iinf/iloc surgery to avoid offset-rewrite
corruption risk. The AI labels we target are all XMP, so this closes the
practical gap. An Exif *item* inside the meta box (rare) still needs iinf/iloc
surgery or exiftool -- documented.

4 new tests (TestMetaBoxXmpBlanking): AI packet blanked (same length, marker
gone, surrounding image bytes intact), plain XMP preserved, no-packet no-op, and
end-to-end remove_ai_metadata on a .heic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 18:15:48 -07:00
test-user 31f0a82906 feat(metadata): detect C2PA/AIGC/IPTC manifests after a large mdat in MP4 (v0.6.8)
Provenance detection no longer relies on a fixed first-MB read. In a streaming /
non-faststart MP4 the C2PA manifest sits AFTER a multi-megabyte mdat, beyond the
1 MB scan window, so it was missed.

- isobmff.scan_c2pa_region(path): a file-seeking top-level box walker that
  returns the payloads of uuid/jumb (provenance) boxes, seeking past mdat by
  size without reading it -- works on multi-GB files. Returns b"" for
  non-ISOBMFF or on read error. Mirrors the box-size encoding of the existing
  in-memory _iter_top_level_boxes (largesize / size==0).
- metadata.scan_head(path, size): the shared input for every C2PA/AIGC/IPTC
  byte scan -- first __TEXT	__DATA	__OBJC	others	dec	hex bytes plus, for ISOBMFF, the late provenance-box
  payloads. Behavior-neutral (f.read(size)) for non-ISOBMFF inputs.
- Routed all six metadata scan sites (has_ai_metadata, aigc_label,
  iptc_ai_system, synthid_source, exif_generator XMP, get_ai_metadata
  soft-binding) and identify's head read through scan_head.

6 new tests: late box found by scan_c2pa_region / scan_head, the fixed window
provably misses it, non-ISOBMFF -> b"", front-placed (faststart) regression.

The remaining gap stays documented: EXIF/XMP stored as items inside the meta
box (AVIF/HEIF stills) still needs meta-box surgery or exiftool.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 13:42:29 -07:00
test-user 18160fe269 feat(identify): integrity-clash detection for contradictory provenance (v0.6.7)
Surface contradictions between independent provenance signals instead of
collapsing to a single verdict -- a strong tell of spoofed, transplanted, or
laundered metadata. Inspired by arXiv:2603.02378.

Two rules in the new _integrity_clashes helper:
- Conflicting AI-origin attributions: two or more distinct AI vendors named by
  independent generator stamps (e.g. a C2PA OpenAI manifest on an image whose
  EXIF says Make="Ideogram AI").
- Camera + AI: a camera-capture C2PA device (Pixel/Leica/Sony/Nikon/Truepic)
  coexisting with an AI-generation marker -- a genuine capture is not AI.

High-precision by design: only hard generator stamps feed it (C2PA issuer when
the source is AI, SynthID proxy, EXIF/XMP generator, IPTC AISystemUsed, xAI,
AIGC). The fuzzy visible sparkle and the open invisible watermark are excluded
-- the latter can be a by-product of our own SDXL removal pass. Vendor
normalization (_vendor_of over _AI_VENDOR_TOKENS) keeps consistent signals from
clashing (C2PA "Google (Gemini)" + SynthID-Google agree); the C2PA vendor is
read from the issuer attribution, not the resolved platform, so a camera label
like "Google Pixel" cannot mis-normalize to an AI vendor.

Surfaced as ProvenanceReport.integrity_clashes (red in the table view, included
in --json). 19 new tests; all real single-origin fixtures (chatgpt/firefly/
doubao/grok/mj) verified to produce zero clashes (false-positive guard).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 13:27:25 -07:00
test-user 7b47fa9f6a fix(io): Unicode-safe cv2 image IO + un-eat the [gpu] install hint (v0.6.6)
Two CLI/IO robustness bugs surfaced by issues #17 and #19.

#17 -- non-ASCII image paths (Chinese/Cyrillic/accented) failed on Windows:
cv2.imread/imwrite use the platform ANSI code-page API, so the decode came back
empty with a "can't open/read file" warning. New image_io.imread/imwrite route
through np.fromfile+cv2.imdecode / cv2.imencode+tofile (Unicode-safe, byte-
identical output, cv2.imread None-semantics preserved); all 8 cv2 read/write
call sites now go through it. Behavior-neutral on macOS/Linux (already accept
UTF-8 paths), so the fix is correct-by-construction for the Windows-only bug.

#19 (incidental) -- rich parsed the "[gpu]" in the GPU-extra install hint as a
style tag and dropped it, so the printed command was the un-installable
"pip install 'remove-ai-watermarks'". Escaped as \[gpu] at both call sites.

Tests: test_image_io.py (non-ASCII round-trip, alpha, missing/empty/garbage
semantics); test_cli.py::TestGpuHintMarkup (install hint keeps the extra).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 11:52:48 -07:00
test-user e1c99b5937 fix(identify): gate C2PA issuer->generator attribution on AI source type (v0.6.5)
Prevents an unmapped C2PA device whose manifest incidentally contains a mapped
issuer substring (e.g. the "Adobe XMP" toolkit string in a Canon/Sony camera
capture) from being mislabeled as that AI generator ("Adobe Firefly").
_attribute_platform now names a specific AI-generator platform only when the
digital-source-type is trainedAlgorithmicMedia; otherwise it degrades to the
neutral "C2PA signer: X" label. Real Firefly/OpenAI/Google output carries the
AI source-type and is unaffected (verified: chatgpt-1.png->OpenAI,
firefly-1.png->Adobe Firefly still attribute). Closes the only real downside of
leaving Canon/Samsung/Bria device signers unmapped: detection and removal were
already unaffected; now the platform label degrades gracefully too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 10:29:12 -07:00
test-user f9cf14c372 feat(metadata): strip container metadata from WebM/MP3/WAV/FLAC/OGG via ffmpeg (v0.6.4)
remove_ai_metadata now handles non-ISOBMFF audio/video (which the box walker
can't reach) by shelling out to ffmpeg with a lossless stream copy
(`-map_metadata -1 -map_chapters -1 -c copy`): codec data is untouched, only
container tags/chapters (ID3 / RIFF / Vorbis comments / EBML tags) are dropped.
Requires ffmpeg on PATH; raises a clear RuntimeError if absent or if ffmpeg
can't parse the input (instead of crashing in the image path).

Verified end-to-end: a real ffmpeg-made WAV/MP3 with a "Suno AI" title tag ->
tag gone, audio bytes preserved.

NOT built (evaluated, deliberate): Resemble PerTh audio *detection* --
`get_watermark()` returns a raw bit array with no presence/confidence flag, so
reliably telling watermarked from clean needs Resemble's fixed payload or a
confidence API (neither public; no real sample to calibrate). Same wall as the
SynthID pixel detector. AVIF/HEIF meta-box EXIF/XMP stripping also stays a gap
(needs exiftool, a non-installed binary). Both documented in CLAUDE.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 21:39:42 -07:00
test-user bc3228d387 feat(visible): Doubao text-mark removal + universal region eraser
Add deterministic, CPU-only removal of the visible Doubao "豆包AI生成" mark and
a position-agnostic region eraser for any other visible watermark/logo.

- doubao_engine.py: locate (geometry, scales with width) + polarity-aware
  white-top-hat glyph mask + cv2 inpaint; coverage-gated detection and a
  dense-text safety guard. No GPU, ~30ms.
- region_eraser.py + `erase` command: inpaint arbitrary --region box(es).
  Default cv2 backend (no deps); optional big-LaMa via onnxruntime (`lama`
  extra, Carve/LaMa-ONNX, model downloaded on first use, never bundled).
- cli `visible --mark auto|gemini|doubao`: auto routes by detector confidence.
- tests for both engines; seed previously-unseeded CLI image fixtures to stop
  the Doubao detector flaking on random corners.
- .gitignore: doubao_capture/{seeds,captures} scratch (alpha-map calibration).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 21:31:51 -07:00
test-user 9f93d9c0c5 feat(identify): add Sony C2PA device attribution, verified (v0.6.3)
Adds Sony to _DEVICE_C2PA_PLATFORM, matching Sony's own `sony.sig` / `sony.cert`
C2PA assertion namespace (NOT bare "Sony", which is a common EXIF Make). Verified
against a real Sony-signed file (Sony PXW-Z300, signer "Sony Corporation") found
in the Security4Media/c2pa-video-player repo. The sample is video (MP4) -- our
ISOBMFF C2PA path detects it; Sony Alpha stills likely share the namespace.

Verified device set is now Leica, Nikon, Google Pixel, Sony, Truepic. Canon /
Samsung / Bria still have no public direct-download C2PA sample to verify.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 21:13:49 -07:00
test-user 64be9598f2 fix(identify): device-token-first C2PA attribution; add verified Pixel (v0.6.2)
Replaces the claim-generator-string match with a distinctive device-token scan
of the manifest bytes (_device_platform / _DEVICE_C2PA_PLATFORM), which is more
robust: it catches devices where the generator name lives under a non-standard
CBOR key (Pixel uses `claim_generator_info`, so it has no `claim_generator`).

- Adds Google Pixel, verified against a real Pixel 10 Pro C2PA file (attached to
  c2pa-rs issue #1609/#1554): cert CN "Pixel Camera", digitalSourceType
  `computationalCapture` -> capture authenticity, not AI (is_ai stays None).
- Token distinctiveness is load-bearing: bare "Truepic" matched the OpenAI
  chatgpt-1.png fixture (Truepic is a trust-chain signing authority), so the
  token is the specific "Truepic_Lens"; "Pixel Camera" (cert CN) not "Pixel".
- Verified Leica/Nikon/Truepic/Pixel attribute correctly and OpenAI/Adobe/MJ
  do not regress. Sony/Canon/Samsung/Bria stay unmapped: no public direct-
  download C2PA sample exists to verify their in-manifest string.
- Regression tests: device token beats incidental issuer mentions (Leica,
  Pixel-vs-Google).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 20:43:40 -07:00
test-user dda2ee7fbb fix(identify): attribute C2PA by claim_generator, not incidental issuer tokens (v0.6.1)
Verified on real signed files that the issuer byte-scan mis-attributes
multi-entity manifests: Leica read as "Truepic" (timestamp authority in the
chain), Nikon as "Adobe Firefly" (XMP-toolkit "Adobe" + the sample's
"Adobe_MAX" name), Truepic as "Google". Platform attribution now prefers the
claim generator (what produced the asset) and falls back to the issuer scan.

- New _CLAIM_GENERATOR_PLATFORM map + _platform_from_generator; claim generator
  read for non-PNG via the now-public c2pa.cbor_text_after.
- Device tokens listed only where verified against a real C2PA file (Leica
  lc_c2pa, Nikon, Truepic Lens); Pixel/Samsung/Sony/Canon/Bria deferred until a
  real sample confirms the in-manifest string. Camera C2PA marks capture
  authenticity, so these never set is_ai.
- cbor_text_after made public (was _cbor_text_after); call sites + tests updated.
- Regression test: claim_generator beats incidental Adobe/Google/Truepic tokens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 20:10:07 -07:00
test-user 2676325184 feat(c2pa): expand soft-binding vendor map with registry-verified algs
Adds Trufo, Overlai, MarkAny, Mentaport, LumaTrace, VerdaAI, ContentLens, ISCC
(io.iscc content code), and Adobe ICN fingerprint to C2PA_SOFT_BINDINGS, and
notes AIWatermark wraps Meta PixelSeal. All `alg` prefixes verified against the
official c2pa-org/softbinding-algorithm-list registry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 18:00:16 -07:00
test-user c196a16900 feat: detect soft-binding vendors, IPTC 2025.1, video/audio C2PA, TrustMark (v0.6.0)
Broadens metadata provenance coverage at the detection and container-strip level.

Detection:
- C2PA soft-binding `alg` -> forensic-watermark vendor (Adobe TrustMark,
  Digimarc, Imatag, Steg.AI, Microsoft, ...) via C2PA_SOFT_BINDINGS +
  soft_binding_vendors_in(); names the watermark vendor even when the watermark
  itself can't be decoded.
- IPTC Photo Metadata 2025.1 AI-disclosure XMP fields (AISystemUsed etc.) via
  iptc_ai_system() + IPTC_AI_FIELD_MARKERS.
- Adobe TrustMark open keyless decoder (trustmark_detector.py, optional extra
  `trustmark`) -- the watermark behind Adobe Durable Content Credentials.
  Detects provenance, not AI origin, so it does not assert is_ai.

Removal / containers:
- isobmff.strip_c2pa_boxes now also drops a top-level XMP uuid box that carries
  an AI label (matched by AI-marker content, byte-order-robust; plain XMP kept).
- remove_ai_metadata routes MP4/MOV/M4V/M4A (and any ftyp-sniffed ISOBMFF)
  through the box stripper; raises a clear error for non-ISOBMFF audio/video
  (WebM/MP3/WAV) instead of crashing in the image path.

Tests: soft-binding scan, IPTC element/attribute/presence, MP4 + M4A detect/
strip, ISOBMFF XMP surgical strip, content-sniff, unsupported-container guard,
TrustMark absent-safety + identify integration. ruff clean; pyright clean on
all new modules.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 17:56:48 -07:00
test-user ba94de8275 feat: strip AI-provenance EXIF tags on removal (v0.5.6)
remove_ai_metadata now scrubs AI tags from the JPEG EXIF instead of passing
the block through wholesale. Closes the v0.5.5 follow-up: the xAI/Grok
Signature + UUID-Artist pair was detected but not removed.

- metadata._scrub_ai_exif(): deletes the xAI signature pair and any
  Software/Make/Artist/ImageDescription tag carrying an AI_GENERATOR_TOKENS
  token (so Ideogram's Make="Ideogram AI" is scrubbed too), keeping genuine
  camera/editor EXIF intact.
- Shared _is_xai_signature_pair / _exif_text helpers (module-level compiled
  regexes) are now the single source of truth, used by both xai_signature
  and _scrub_ai_exif.
- Tests: Grok signature stripped on JPEG output, Ideogram Make stripped,
  real-camera Make ("Apple") preserved. 325 passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 14:26:20 -07:00
test-user 74618b91a7 feat: detect xAI/Grok EXIF signature; refresh watermarking landscape (v0.5.5)
xAI Grok (Aurora) images carry no C2PA/SynthID/IPTC -- their only provenance
signal is an EXIF pair: ImageDescription "Signature: <base64>" + a UUID Artist.
Verified stable across 3 genuine generations (a real download previously read
as unknown / "no AI metadata").

- metadata.xai_signature(): matches the Signature blob + UUID Artist pair;
  wired into has_ai_metadata, get_ai_metadata, and identify (platform
  "xAI (Grok / Aurora)").
- data/samples/grok-1.jpg: real Grok fixture (neutral content; the Artist UUID
  is the public image id, not PII).
- Tests: synthetic-fixture unit tests, real-sample assertion, identify
  integration (322 passing).

Docs (research refresh, May 2026):
- C2PA 2.4 Durable Content Credentials (soft-binding re-discovery after the
  embedded manifest is stripped).
- New AI-labeling laws, primary-source verified: EU AI Act Art 50 (2026-08-02),
  South Korea AI Framework Act Art 31(3), California AB 853.
- Hedge removal claims: defeating the SynthID verifier is not forensic
  invisibility (arXiv:2605.09203); cite SynthID-Image (arXiv:2510.09263).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 14:14:35 -07:00
test-user 03fb460f77 Track the labeled SynthID corpus; complete metadata-source test coverage
Corpus images were gitignored (local-only). The negatives were reviewed and
cleared for publishing, so the labeled set is now committed (regular git, 65 MB
across 25 files) -- making the removal regression set reproducible and CI-able.

Corpus:
- Track data/synthid_corpus/images/ (pos 9, neg 15, cleaned 1); keep only the
  synthetic refs/ calibration fills gitignored.
- Reconcile manifest.csv to the on-disk files: 117 -> 25 rows (92 dangling rows
  for removed images pruned; dedup left one cleaned output, f6dd47a5).
- Rewrite the corpus README layout/policy (images committed; review every image
  for private content before adding -- public repo, permanent history).

Test fixtures:
- Remove data/samples/not-ai-1/2/3 (personal iPhone photos, incl. GPS EXIF).
- Add the clean_photo conftest fixture serving a verified-negative image from
  the corpus neg/ set; repoint the three "non-AI / clean photo" tests onto it
  (skips if the corpus is absent).

Metadata-source coverage (close the last sub-variant gaps):
- c2pa digitalSourceType: algorithmicMedia (procedural, not flagged AI) and
  compositeWithTrainedAlgorithmicMedia (AI + SynthID proxy).
- exif_generator: EXIF Artist and ImageDescription fields (Software/Make/XMP
  CreatorTool were already covered).

All 8 metadata-source kinds are now tested at both the unit and identify()
level. 313 tests pass. CLAUDE.md updated (corpus tracked, clean_photo fixture).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 14:46:47 -07:00
test-user 3ebdee57b8 Test the untested pure logic: MPS fallback, tiling, isobmff/c2pa edges
Coverage audit (pytest --cov) found real, non-model logic at 0%/low cover.
Add unit tests that need no model download:

- img2img_runner.py 0% -> 100%: the MPS->CPU fallback orchestration, mocked
  via injected load_pipeline/reload_on_cpu callables. Guards the production
  behavior hit this session (native-res SDXL OOMs on MPS, must retry on CPU;
  non-MPS errors must propagate; "mps"-worded error on a cpu device must not
  reload).
- ctrlregen/tiling.py 0% -> 40%: the pure tile math (tile_positions,
  make_blend_weight, resize_center_crop) that decides how large images are
  split and blended. (run_tiled stays model-bound, untested.)
- isobmff.py 93% -> 100%: size==0 (box-to-EOF) and truncated 64-bit largesize
  parsing branches for AVIF/HEIF/JXL C2PA stripping.
- c2pa.py: non-PNG-signed .png reads as clean (has_c2pa_metadata /
  extract_c2pa_chunk) instead of mis-parsing.

309 tests pass (+23). Document in CLAUDE.md that these pure helpers are
unit-tested without downloads so future sessions don't skip them as "ML".
No src/ change, no release.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 14:21:32 -07:00
test-user d24d8a4b14 Extract _target_size helper + regression-test native resolution (v0.5.4)
The native-vs-downscale decision in InvisibleEngine.remove_watermark (the
issue #10/#15 fix: max_resolution=0 must not pre-downscale, since any
downscale both loses quality and lets SynthID survive) had no test. Extract
it into a pure helper invisible_engine._target_size(w, h, max_resolution)
and cover it with tests/test_invisible_engine.py::TestTargetSize so a
re-introduced forced downscale fails CI instead of silently regressing #15.

Also:
- Clamp the short side to >=1 in _target_size: extreme aspect ratios (e.g.
  5000x3 with --max-resolution 1024) truncated it to 0 and crashed
  image.resize(). Pre-existing in the inline math; fixed now that it is a
  named, tested function.
- Consolidate the two duplicated temp-file save blocks into one
  unconditional save (behavior unchanged: the EXIF-transposed image is
  still always persisted before WatermarkRemover reloads it by path), and
  drop the now-redundant `_tmp_path is not None` guard in finally.
- Bump version 0.5.3 -> 0.5.4 (pyproject, __init__, uv.lock); document the
  helper as the regression guard in CLAUDE.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 14:09:33 -07:00
test-user e27f24f520 test(samples): commit real Doubao fixture + AIGC real-sample test
data/samples/doubao-1.png is the real #13 sample: carries the China TC260
<TC260:AIGC> XMP label and a visible '豆包AI生成' text mark (bottom-right).
Grounds the AIGC detection on a real file (alongside the synthetic tests)
and serves as the fixture for visible-watermark removal work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 12:37:15 -07:00
test-user c7f0d71f90 feat(identify): detect China TC260 AIGC label (Doubao et al.)
China-served generators embed an XMP <TC260:AIGC>{"Label":"1",...} block
(China's mandatory AI-content labeling, TC260 standard). Doubao (ByteDance)
uses it -- verified on the real #13 sample. It's none of C2PA / SynthID /
imwatermark / IPTC, so identify() previously returned unknown.

- metadata: AIGC_MARKERS + aigc_label() (json-decodes the HTML-entity-encoded
  block); has_ai_metadata + get_ai_metadata now surface it.
- identify: new 'aigc' signal -> is_ai True, platform 'China AIGC-labeled
  generator (TC260; e.g. Doubao)', carries the ContentProducer code.
- Container-agnostic raw-byte scan, so it covers the whole China-AIGC ecosystem
  (Jimeng/Kling/Qwen/Ernie share the standard).
- Tests: synthetic TC260 block (metadata + identify). Docs updated.

Addresses #13.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 12:29:51 -07:00
Rex Lorenzo d091b9f822 fix(cli): preserve alpha channel in visible-watermark pipeline
`cv2.imread(..., IMREAD_COLOR)` was silently stripping the alpha channel
on RGBA inputs, and `cv2.imwrite` then wrote opaque 3-channel PNGs — so
images with transparent backgrounds came back with an opaque-black (or
white) background and the sparkle area baked in as a solid blob.

Read the source with `IMREAD_UNCHANGED`, keep the alpha plane out of the
detection/inpaint path (those still operate on BGR), and rejoin alpha at
save time. The detected watermark bbox is also zeroed in the alpha plane
so the sparkle region becomes transparent rather than an opaque artifact.

Applies to `visible`, `all`, and `batch` modes. RGB-only inputs and JPEG
outputs are unaffected.
2026-05-25 09:18:39 -07:00
test-user ede35a3db5 feat(metadata): read EXIF Make tag; collect Ideogram/Recraft/Krea-FLUX
Collected live samples from three popular generators we lacked:

- Ideogram tags its downloads with EXIF Make="Ideogram AI" (no C2PA, no
  SynthID, no imwatermark) -- the Make tag is its only signal. exif_generator
  only read Software/Artist/ImageDescription, so it missed this; now reads
  Make too. Real cameras put "Apple"/"Canon" in Make (no AI token), so this
  stays low-false-positive. 4 originals ingested.
- Recraft (PNG export) and Krea hosting FLUX 2: downloads carry NO detectable
  signal -- no C2PA/EXIF/IPTC, and notably no imwatermark despite Krea running
  FLUX. identify correctly reports 'unknown'. Both ingested as neg fixtures.

Lesson recorded in CLAUDE.md: the imwatermark detector fires only on pristine
output from a pipeline that runs the encoder (diffusers default, official BFL),
not from re-hosts (Krea/Stability) or re-encoded exports (Recraft/Canva).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 18:38:56 -07:00
test-user ad3b8ee248 feat(identify): read EXIF Software / XMP CreatorTool generator tags
Closes the documented gap where EXIF/XMP fields inside AVIF/HEIF/JXL went
unparsed. metadata.exif_generator extracts the EXIF Software/Artist tag
(via PIL+piexif, which opens AVIF natively) and the XMP CreatorTool (via a
container-agnostic raw-byte scan that also covers HEIF/JXL that PIL can't
open), and matches against AI_GENERATOR_TOKENS so only generator names
(Firefly, DALL-E, Midjourney, ComfyUI, ...) fire -- a plain 'Adobe
Photoshop' or 'GIMP' tag is not flagged.

identify() surfaces it as a high-confidence signal and uses it for
platform attribution when no C2PA names a platform, so an AVIF/HEIF whose
only AI signal is an EXIF/XMP generator tag is now caught.

Validated with synthesized fixtures (the 'no positive fixtures' blocker
was self-imposed): real AVIF and JPEG written with EXIF Software via PIL,
plus an XMP CreatorTool raw-scan fixture. Zero false positives across the
109-image corpus (real iPhone photos carry no AI generator token).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 17:56:39 -07:00
test-user 3a1c5427c8 feat(c2pa): recognize Stability AI issuer; fix Microsoft platform label
Collected live C2PA positives from Bing Image Creator and Stability Brand
Studio (DreamStudio successor) and learned two things our scan got wrong:

- Bing now runs Microsoft's own MAI-Image model, not DALL-E, and signs
  C2PA as 'Microsoft'. The scan caught it, but the platform label claimed
  'Microsoft Designer (DALL-E / OpenAI backend)'. Relabeled model-neutral:
  'Microsoft (Bing Image Creator / Designer)'.
- Stability signs C2PA as 'Stability AI' (cert 'Stability AI Ltd'), which
  was not in C2PA_ISSUERS, so it read as 'unknown signer'. Added the issuer
  and a platform mapping. Stability uses no SynthID and (on its current
  Stable Image model) no imwatermark watermark -- verified, both negative.

Both ingested as SynthID-negative corpus fixtures (they are AI but not
SynthID) for issuer-coverage. Canva skipped: its downloads are re-encoded
design exports that strip C2PA, so a Canva sample would be inconclusive.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 17:12:42 -07:00
test-user 27ad5b7645 feat(identify): detect open SD/SDXL/FLUX invisible watermark
Research found one locally-fillable detection gap: Stable Diffusion, SDXL,
and FLUX all embed an open DWT-DCT watermark via the invisible-watermark
(imwatermark) library -- a PUBLIC decoder, no secret key, unlike SynthID.
New invisible_watermark.py decodes the known fixed patterns (verified
against upstream source: diffusers SDXL WATERMARK_MESSAGE, FLUX.2
src/flux2/watermark.py, and the 'StableDiffusionV1' default string) and
identify() reports the scheme as a high-confidence signal.

Verified locally end-to-end: embedding SDXL's exact 48-bit message and
decoding it back recovers 48/48 bits; a clean image and our own fal-SDXL
outputs decode to ~21/48 (no match). Caveat baked into the report: the
watermark is fragile -- gone after JPEG q90 -- so it confirms origin only
on pristine files; absence is never proof.

imwatermark is an optional dep (extra 'detect'; pulls non-headless opencv),
so the import is guarded and the signal is skipped when absent. CLI
--no-visible now means metadata-only (skips both pixel-domain detectors).

Also records the broader watermarking landscape in CLAUDE.md: which
services are locally detectable (SD/SDXL/FLUX), C2PA-covered (Bing/Canva/
Getty/Shutterstock unsampled), or proprietary-only like SynthID (Amazon
Titan/Nova, Kakao). Midjourney embeds neither C2PA nor an invisible mark.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 16:53:59 -07:00
test-user 7dcc922617 feat(probe): solid-fill SynthID carrier probe; corpus reconfirms no pixel detector
scripts/synthid_pixel_probe.py is an experimental/diagnostic tool for the
one pixel-domain question that isn't a dead-end: on solid-color fills the
zero-mean residual IS essentially the watermark carrier. Two modes:
'consistency' (mean pairwise NCC of carriers across fills vs random
baseline) and 'removal' (does the pipeline drop the carrier toward
baseline?). Logic validated synthetically (injected carrier correlates,
random noise doesn't, simulated removal collapses it) -- no real fills or
GPU needed.

Running its metric on the corpus independently re-confirms the documented
dead-end for real content: at matched resolution SynthID positives do not
cluster apart from negatives (within-Gemini 0.07; at 1024 px pos-vs-neg
>= pos-vs-pos). An apparent 0.62 among 1254px ChatGPT positives turned out
to be near-duplicate content (5 renders of one prompt at ~0.92; a distinct
ChatGPT image scored ~0 against them), not a shared carrier. The probe is
solid-fills-only; do not use on real content.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 16:35:39 -07:00
test-user 1a9f3e4fe5 test(identify): cover provenance branches, CLI, sparkle helper
Adds 20 tests around the new provenance path:

- identify(): local SD/ComfyUI params -> local-pipeline attribution;
  visible-sparkle gating at the 0.5 threshold (mocked detector: above,
  below, unavailable, opt-out); metadata verdict not downgraded by a
  sparkle hit; OpenAI/SynthID caveats + dedup; ProvenanceReport is
  JSON-serializable (the CLI --json path); and the honest edge where a
  C2PA manifest without an AI source marker stays 'unknown'.
- CLI 'identify': help, clean PNG, AI PNG platform, valid --json,
  missing file.
- gemini_engine.detect_sparkle_confidence: float in range for a real
  image, None for an unreadable file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 16:27:00 -07:00
test-user fa104bcade feat(identify): provenance command (platform + watermark inventory)
New 'identify' command and identify.py module: upload an image, get one
ProvenanceReport answering where it was made and what watermarks it
carries. Aggregates every locally-readable signal:

- C2PA Content Credentials -> generating platform (issuer + generator).
- IPTC digitalSourceType 'Made with AI' (Meta and others).
- Embedded SD/ComfyUI generation parameters (local pipelines).
- SynthID metadata proxy (Google / OpenAI C2PA companion).
- Visible Gemini sparkle (cv2 fallback for the stripped-metadata case),
  promoted only at confidence >= 0.5 (corpus-tuned: Gemini sparkles
  score >= 0.56, non-sparkle <= 0.49).

is_ai_generated is True or None, never asserted False -- stripped
metadata leaves no local proof of a clean origin, so absence of signals
is reported as 'unknown' with an explicit caveat. The SynthID *pixel*
watermark remains locally undecodable; the report says so.

Non-PNG containers (JPEG/WebP/AVIF/HEIF/JXL) get the same issuer +
generator attribution via a binary scan (the caBX parser is PNG-only).
The cv2 dependency is isolated in gemini_engine.detect_sparkle_confidence
so identify.py stays type-clean. CLI supports --json and --no-visible.

Validated against the 109-image corpus: 14/14 positives flagged AI,
93/94 negatives clean (the one 'neg' flagged is a Meta image that
genuinely carries the IPTC tag -- correct), zero true errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 16:19:26 -07:00
test-user f36320ff39 fix(metadata): guard get_ai_metadata PIL open against non-OSError
get_ai_metadata opened the file with PIL unguarded, so a HEIC (or any
format PIL can't open without optional plugins) raised
UnidentifiedImageError instead of falling through to the binary scan --
unlike has_ai_metadata, which already guards. Wrap the open in
except Exception and continue to the C2PA/IPTC path. Regression test
feeds an unopenable .heic shell and asserts no raise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 16:19:15 -07:00
test-user 6cef1d59f0 fix(c2pa): drop non-printable claim_generator garbage
On some manifests (observed: Microsoft Designer) the first CBOR "name"
key precedes a binary hash field, not the generator string, so
_cbor_text_after returns control-char garbage. Guard with isprintable()
to drop it; issuer detection (byte-search) and the SynthID verdict are
unaffected. Adds TestParseChunkGuards covering kept-vs-dropped cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 15:55:07 -07:00
test-user f07ce10c72 feat(metadata): SynthID-source detection, C2PA parser consolidation, corpus + tests
Detect SynthID-bearing images via their C2PA companion: a manifest signed by a
SynthID-using vendor (Google/OpenAI) on AI-generated content implies an
invisible SynthID pixel watermark. Verified end-to-end against the vendor
oracles (openai.com/verify, Gemini "Verify with SynthID").

- metadata: synthid_source() + synthid_watermark verdict in get_ai_metadata,
  surfaced as a `metadata --check` callout. Format-agnostic (PNG caBX parser +
  JPEG/WebP/AVIF/HEIF/JXL binary scan).
- constants: SYNTHID_C2PA_ISSUERS {Google, OpenAI}; +opened/placed actions.
- c2pa: single CBOR-aware parser (_cbor_text_after) replaces glitchy regex
  (fixes fGPT-4o claim_generator); removed duplicate _scan_png_c2pa_chunk from
  metadata; shared synthid_verdict / synthid_vendors_in helpers.
- corpus: scripts/synthid_corpus.py ingest tool + data/synthid_corpus/
  (manifest tracked, images gitignored) for a labeled reference set.
- tests: +38 across C2PA parser internals, extract/inject round-trip, ISOBMFF
  container stripping, all IPTC AI markers, and invisible watermark strength
  tiers (SynthID/StableSignature/TreeRing/StegaStamp/RingID/RivaGAN/...).

Pixel-level SynthID detection remains out of reach locally (Google's decoder is
proprietary); a from-scratch spectral pilot confirmed it does not separate real
content. See CLAUDE.md for the full evaluation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 11:32:46 -07:00
test-user f2fc5e09ab feat: SDXL default; AVIF/HEIF/JPEG-XL C2PA stripping
SD-1.5 dreamshaper at 768 px did not defeat SynthID v2 on Gemini 3 Pro
outputs (verified May 2026 via Gemini app's "Verify with SynthID"). Switch
the default invisible engine to SDXL at 1024 px, matching the raiw-app
production config (strength 0.05, steps 50). Drop the SD-1.5 pipeline.

Metadata layer: add C2PA UUID and IPTC AI marker byte-scan detection
across all formats, plus an ISOBMFF box walker (noai/isobmff.py) that
strips top-level C2PA uuid and JUMBF jumb boxes from AVIF/HEIF/JPEG-XL
containers without re-encoding.

README gets a Legal table and a Threat-model section about SynthID v2's
136-bit payload. CLAUDE.md tracks the SD-1.5 regression as historical
context.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 12:54:37 -07:00
test-user 7eb32fedee refactor: enforce strict linting and type checking across codebase
- Expand ruff rules (B, S, SIM, RET, COM, C4, G, PT, PIE, T20, DTZ, ICN, TCH, RUF, ANN)
- Switch pyright to strict mode with relaxed test environment
- Replace try-except-pass with contextlib.suppress throughout
- Move type-only imports into TYPE_CHECKING blocks
- Replace ambiguous Unicode chars (en dash, multiplication sign, Greek alpha) with ASCII
- Move color-matcher from base deps to [gpu], remove unused requests dep
- Add pyright to dev deps, update dependabot to uv ecosystem
- Fix hardcoded version in test_version, unused unpacked vars in tests
- Update maintain.sh, CLAUDE.md, .gitignore, .claude/settings.json
- Remove obsolete .agents/rules/project.md
- Upgrade all dependencies (Pygments vulnerability fix)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 11:42:42 -07:00
test-user b41c8e5aba v0.3.1: Fix opencv conflict, graceful GPU fallback, correct docs
- Remove opencv-python from [gpu] extra (conflicts with headless in base deps)
- Add graceful fallback in 'invisible' and 'all' commands when GPU deps missing
- Cache InvisibleEngine in batch mode (avoid reloading model per image)
- Fix --humanize help text (was '0.0-1.0', actual range is 0-6.0+)
- Fix stale docstring referencing non-existent [invisible] extra
- Add [gpu] extra install instructions to README
- Fix broken NeuralBleach placeholder URL in Credits
2026-03-26 10:50:26 -07:00
test-user 2bdc4bceff Bump version to 0.3.0 2026-03-25 17:27:39 -07:00
test-user 507757738e v0.2.2: Unify quality defaults, improve README
- Unify 'all' defaults to match 'invisible' (strength=0.02, steps=100)
- Reorder CLI docs: 'all' command first, individual commands second
- HuggingFace token is now documented as optional
- Remove 'additional setup' label from invisible section
2026-03-25 12:28:02 -07:00
test-user 2152ebcd32 v0.2.1: Code review fixes, platform-neutral docs
- Fix f-string logging → %-style (face_protector, invisible_engine)
- Fix logger name: hardcoded string → __name__
- Add module docstrings to humanizer.py, face_protector.py
- Break long warning string into multiple lines (PEP 8)
- Make docs platform-neutral (macOS/Linux/Windows)
- Rename 'optional' → 'additional setup' in README
2026-03-25 12:19:29 -07:00
test-user cace97b04e Bump version to 0.2.0
Changes since 0.1.0:
- Fix phantom model param bug in invisible/all commands
- Fix macOS SSL certificate issue for YOLO downloads
- Use temp file in 'all' pipeline to hide intermediate output
- Add legal disclaimer and fix license attribution
- Add troubleshooting and upgrade docs to README
- Expand test suite to 137 tests covering all CLI modes
- Clean up dependencies and pyright config
2026-03-25 12:03:44 -07:00
test-user e5d8970add Add project files, tests, and documentation for GitHub release
- CLI with visible, invisible, all, metadata, and batch commands
- Gemini watermark removal via reverse alpha blending
- Invisible watermark removal via diffusion regeneration (SynthID, TreeRing)
- AI metadata stripping (EXIF, PNG text, C2PA)
- Face protection (YOLO/Haar) and analog humanizer
- 137 tests covering all CLI modes and core engines
- Ruff and Pyright clean
2026-03-25 11:15:05 -07:00