feat(metadata): detect China TC260 AIGC PNG chunk and HuggingFace hf-job-id

aigc_label now reads the TC260 label from a raw-JSON `AIGC` PNG tEXt chunk
(as Doubao/ByteDance write it, with no namespaced XMP marker) in addition to
the `<TC260:AIGC>` XMP block, via a shared _parse helper gated on a TC260 field
so a generic AIGC key cannot false-positive. New huggingface_job() reads the
hf-job-id PNG chunk; identify surfaces it as a medium-confidence hf_job signal
(parallel to the visible sparkle, never overriding a hard metadata verdict).
Both wired into has_ai_metadata/get_ai_metadata; the PNG save whitelist already
strips them on removal. Found by auditing 646 corpus originals: 28 AIGC and 3
hf-job files the library previously reported as Unknown.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Victor Kuznetsov
2026-05-28 12:40:17 -07:00
parent 0eec3001bb
commit 223cbcf171
6 changed files with 280 additions and 16 deletions
+4 -2
View File
File diff suppressed because one or more lines are too long
+3 -3
View File
@@ -25,7 +25,7 @@ If this tool saves you time, consider [sponsoring its development](https://githu
- **Smart Face Protection** — automatic extraction and blending of human faces to prevent AI distortion
- **Batch processing** — process entire directories
- **Detection** — three-stage NCC watermark detection with confidence scoring
- **Provenance detection (`identify`)** — aggregate C2PA issuer, the C2PA soft-binding forensic-watermark vendor (Adobe TrustMark, Digimarc, Imatag, ...), IPTC "Made with AI" plus the IPTC 2025.1 `AISystemUsed` field, embedded SD/ComfyUI params, EXIF/XMP generator tags, the xAI/Grok EXIF signature, the SynthID metadata proxy, the visible sparkle, the open SD/SDXL/FLUX invisible watermark, and (with the `trustmark` extra) the open Adobe TrustMark watermark into one origin-platform + watermark-inventory verdict (`--json` for machine output)
- **Provenance detection (`identify`)** — aggregate C2PA issuer, the C2PA soft-binding forensic-watermark vendor (Adobe TrustMark, Digimarc, Imatag, ...), IPTC "Made with AI" plus the IPTC 2025.1 `AISystemUsed` field, embedded SD/ComfyUI params, EXIF/XMP generator tags, the xAI/Grok EXIF signature, the China TC260 AIGC label (XMP or PNG chunk), the HuggingFace `hf-job-id` job marker, the SynthID metadata proxy, the visible sparkle, the open SD/SDXL/FLUX invisible watermark, and (with the `trustmark` extra) the open Adobe TrustMark watermark into one origin-platform + watermark-inventory verdict (`--json` for machine output)
## Examples
@@ -48,13 +48,13 @@ If this tool saves you time, consider [sponsoring its development](https://githu
| **xAI Grok (Aurora)** | — | — | ✅ EXIF signature scheme (no C2PA): `Signature:` blob + UUID `Artist` | Detected (`identify`); metadata strip |
| **Midjourney** | — | — | ✅ EXIF + XMP (prompt, model, seed) | Metadata strip |
| **Meta AI** | — | — | ✅ IPTC "Made with AI" (digitalSourceType) | Metadata strip (removes the label) |
| **Doubao** (ByteDance) / China AIGC generators | ✅ "豆包AI生成" text strip (bottom-right) | — | ✅ TC260 `<TC260:AIGC>` XMP label (China's mandatory AI labeling) | Locate + mask + inpaint (cv2, CPU) + metadata strip |
| **Doubao** (ByteDance) / China AIGC generators | ✅ "豆包AI生成" text strip (bottom-right) | — | ✅ TC260 AIGC label — `<TC260:AIGC>` XMP **or** `AIGC` PNG chunk (China's mandatory AI labeling) | Locate + mask + inpaint (cv2, CPU) + metadata strip |
| **StableSignature** (Meta) | — | ✅ In-model watermark | — | Diffusion regeneration |
| **TreeRing** | — | ✅ Latent space watermark | — | Diffusion regeneration |
> Visible overlays are used by Google Gemini / Nano Banana (sparkle logo) and by Doubao / China AIGC generators (the mandated "...AI生成" corner text). Both are removed deterministically on CPU. Other services rely on invisible watermarks and/or metadata; our diffusion-based regeneration works against any invisible watermark in pixel or frequency domain. For a visible mark from any other source (any position, any colour), use the universal `erase --region` command.
> **Detection:** `remove-ai-watermarks identify <image>` reports the origin platform and watermark inventory for all the signals above — C2PA issuer, the C2PA soft-binding forensic-watermark vendor (TrustMark / Digimarc / Imatag / ...), IPTC "Made with AI" plus the IPTC 2025.1 `AISystemUsed` field, the China TC260 AIGC label, embedded generation params, EXIF/XMP generator tags, the xAI/Grok EXIF signature, the SynthID metadata proxy, the visible sparkle, and (with the `[detect]` / `[trustmark]` extras) the open SD/SDXL/FLUX and Adobe TrustMark invisible watermarks. SynthID and the proprietary soft-binding watermarks (Digimarc etc.) have no local decoder, so they are reported by metadata proxy / vendor name only.
> **Detection:** `remove-ai-watermarks identify <image>` reports the origin platform and watermark inventory for all the signals above — C2PA issuer, the C2PA soft-binding forensic-watermark vendor (TrustMark / Digimarc / Imatag / ...), IPTC "Made with AI" plus the IPTC 2025.1 `AISystemUsed` field, the China TC260 AIGC label (XMP or PNG chunk), the HuggingFace `hf-job-id` job marker, embedded generation params, EXIF/XMP generator tags, the xAI/Grok EXIF signature, the SynthID metadata proxy, the visible sparkle, and (with the `[detect]` / `[trustmark]` extras) the open SD/SDXL/FLUX and Adobe TrustMark invisible watermarks. SynthID and the proprietary soft-binding watermarks (Digimarc etc.) have no local decoder, so they are reported by metadata proxy / vendor name only.
## How it works
+27 -3
View File
@@ -31,6 +31,7 @@ from remove_ai_watermarks.metadata import (
aigc_label,
exif_generator,
get_ai_metadata,
huggingface_job,
iptc_ai_system,
scan_head,
xai_signature,
@@ -89,6 +90,11 @@ _INVISIBLE_WM_CAVEAT = (
"The open invisible watermark is fragile: it does not survive JPEG re-encoding "
"or resizing, so it confirms origin only on a pristine (un-re-encoded) file."
)
_HF_JOB_CAVEAT = (
"The hf-job-id tag marks a HuggingFace-hosted job (commonly diffusion "
"generation) but names neither the model nor the content type, so it is a "
"medium-confidence signal, not proof the pixels are AI-generated."
)
@dataclass
@@ -423,9 +429,14 @@ def identify(image_path: Path, *, check_visible: bool = True, check_invisible: b
ai_vendor_claims["iptc_ai_system"] = v
# ── China TC260 AIGC label (Doubao and other China-served gens) ──
aigc = any(m in head for m in AIGC_MARKERS)
# Fire on either the namespaced byte marker (``TC260:AIGC`` / the TC260 ns
# URL, present in XMP and as a laundering tell even when the JSON payload is
# truncated) OR the parsed label, which additionally catches the raw-JSON
# PNG ``AIGC`` tEXt chunk that carries no namespaced marker at all.
aigc_data = aigc_label(image_path)
aigc = aigc_data is not None or any(m in head for m in AIGC_MARKERS)
if aigc:
producer = (aigc_label(image_path) or {}).get("ContentProducer", "")
producer = (aigc_data or {}).get("ContentProducer", "")
signals.append(Signal("aigc", f"TC260 AIGC label{f' (producer {producer})' if producer else ''}", "high"))
watermarks.append("China AIGC label (TC260 standard)")
if platform is None:
@@ -461,6 +472,18 @@ def identify(image_path: Path, *, check_visible: bool = True, check_invisible: b
platform = "xAI (Grok / Aurora)"
ai_vendor_claims["xai"] = "xAI"
# ── HuggingFace-hosted job marker (hf-job-id PNG text chunk) ─────
# Marks the hosting job, not a model -- medium confidence (commonly diffusion
# output). Like the visible sparkle, it lifts an otherwise-Unknown verdict to
# a tentative AI, but never overrides a high-confidence metadata signal.
hf_job = huggingface_job(image_path)
if hf_job:
signals.append(Signal("hf_job", f"HuggingFace job {hf_job}", "medium"))
watermarks.append("HuggingFace-hosted job (hf-job-id)")
caveats.append(_HF_JOB_CAVEAT)
if platform is None:
platform = "HuggingFace-hosted job (model not identified)"
# ── Open invisible watermark (SD / SDXL / FLUX, dwtDct) ──────────
# Public decoder, no key -- a definitive embedded signal on pristine files.
if check_invisible and (scheme := _invisible_watermark(image_path)) is not None:
@@ -503,11 +526,12 @@ def identify(image_path: Path, *, check_visible: bool = True, check_invisible: b
platform = "Google Gemini family (visible sparkle detected)"
visible_only = any(s.name == "visible_sparkle" for s in signals) and not ai_from_metadata
hf_only = bool(hf_job) and not ai_from_metadata
if ai_from_metadata:
is_ai: bool | None = True
confidence = "high"
elif visible_only:
elif visible_only or hf_only:
is_ai = True
confidence = "medium"
else:
+92 -8
View File
@@ -108,6 +108,27 @@ AIGC_MARKERS: tuple[bytes, ...] = (
b"TC260:AIGC",
)
# TC260 AIGC-label JSON fields (the standard's labeling object). Doubao writes
# the same object as a PNG ``tEXt`` chunk keyed ``AIGC`` (raw JSON, not XMP), so
# a JSON object carrying at least one of these is accepted as a valid TC260
# label even when the namespaced XMP element is absent.
_TC260_FIELDS: frozenset[str] = frozenset(
{
"Label",
"ContentProducer",
"ProduceID",
"ContentPropagator",
"PropagateID",
"ReservedCode1",
"ReservedCode2",
}
)
# HuggingFace-hosted GPU jobs (Jobs / Spaces) stamp generated PNGs with this
# ``tEXt`` chunk key holding the job UUID. It marks the hosting job, not a
# specific model -- a medium-confidence AI signal (commonly diffusion output).
_HF_JOB_KEY: str = "hf-job-id"
STANDARD_METADATA_KEYS: frozenset[str] = frozenset(
[
"Author",
@@ -202,31 +223,90 @@ def has_ai_metadata(image_path: Path) -> bool:
# IPTC 2025.1 AI-disclosure XMP properties (their presence flags AI content).
if any(marker in data for marker in IPTC_AI_FIELD_MARKERS):
return True
# China TC260 AIGC label as a PNG text chunk (the byte scan above catches
# only the XMP form; the raw-JSON tEXt chunk needs the PIL-based parse).
if aigc_label(image_path):
return True
# HuggingFace-hosted job marker (hf-job-id PNG text chunk).
if huggingface_job(image_path):
return True
# xAI / Grok: no C2PA/IPTC/XMP -- only the EXIF Signature + UUID-Artist pair.
return xai_signature(image_path)
def aigc_label(image_path: Path) -> dict[str, str] | None:
"""Parse a China TC260 ``<TC260:AIGC>`` AI-labeling block, if present.
"""Parse a China TC260 AI-labeling block, if present.
Two serializations are recognized:
- a PNG ``tEXt``/``iTXt`` chunk keyed ``AIGC`` carrying the raw JSON object
(as written by Doubao / ByteDance), read via PIL; and
- an XMP ``<TC260:AIGC>{...}</TC260:AIGC>`` block (HTML-entity encoded text),
found by a container-agnostic raw-byte scan (PNG/JPEG/WebP alike).
Returns the decoded JSON (e.g. ``{"Label": "1", "ContentProducer": ...}``)
or None. The block is XMP text (HTML-entity encoded), so it is found by a
container-agnostic raw-byte scan and works for PNG/JPEG/WebP alike.
or None. The PNG-chunk key ``AIGC`` is generic, so a JSON object there is
accepted only if it carries at least one known TC260 field (``_TC260_FIELDS``);
the namespaced XMP element is unambiguous, so any JSON object is accepted.
"""
import html
import json
import re
from typing import cast
def _parse(text: str, *, require_tc260_field: bool) -> dict[str, str] | None:
try:
parsed = json.loads(text)
except ValueError:
return None
if not isinstance(parsed, dict):
return None
fields = {str(k): str(v) for k, v in cast("dict[object, object]", parsed).items()}
if require_tc260_field and not (_TC260_FIELDS & fields.keys()):
return None
return fields
# PNG tEXt chunk keyed "AIGC" with raw JSON (Doubao and other China gens).
# The key is generic, so require a TC260 field to avoid a false positive.
try:
from PIL import Image
with Image.open(image_path) as img:
value = img.info.get("AIGC")
except Exception as exc:
logger.debug("PIL could not open %s for AIGC chunk scan: %s", image_path, exc)
value = None
if isinstance(value, str) and (result := _parse(value, require_tc260_field=True)):
return result
# XMP <TC260:AIGC>{...}</TC260:AIGC> block (namespaced element, unambiguous).
data = scan_head(image_path)
match = re.search(rb"<TC260:AIGC>(.*?)</TC260:AIGC>", data, re.DOTALL)
if not match:
return None
raw = html.unescape(match.group(1).decode("utf-8", "replace"))
return _parse(html.unescape(match.group(1).decode("utf-8", "replace")), require_tc260_field=False)
def huggingface_job(image_path: Path) -> str | None:
"""Return the HuggingFace job id if the image carries an ``hf-job-id`` PNG
text chunk, else None.
HuggingFace-hosted GPU jobs (Jobs / Spaces) stamp generated PNGs with an
``hf-job-id`` ``tEXt`` chunk holding the job's UUID. It identifies the
*hosting job*, not a specific model, and is most commonly seen on diffusion-
generation output -- a medium-confidence AI signal, not proof of AI pixels
on its own.
"""
try:
parsed = json.loads(raw)
except ValueError:
from PIL import Image
with Image.open(image_path) as img:
value = img.info.get(_HF_JOB_KEY)
except Exception as exc:
logger.debug("PIL could not open %s for hf-job-id scan: %s", image_path, exc)
return None
return {str(k): str(v) for k, v in parsed.items()} if isinstance(parsed, dict) else None
if isinstance(value, str) and value.strip():
return value.strip()
return None
def iptc_ai_system(image_path: Path) -> str | None:
@@ -500,6 +580,10 @@ def get_ai_metadata(image_path: Path) -> dict[str, str]:
# IPTC 2025.1 AI-disclosure XMP fields (Iptc4xmpExt:AISystemUsed etc.).
if system := iptc_ai_system(image_path):
result.setdefault("ai_system", f"IPTC 2025.1 AI disclosure ({system})")
# HuggingFace-hosted job marker (hf-job-id PNG text chunk).
if job := huggingface_job(image_path):
result.setdefault("huggingface_job", f"HuggingFace-hosted job ({job})")
return result
+72
View File
@@ -201,6 +201,78 @@ class TestIdentifyLocalParams:
assert r.signals == []
# ── China TC260 AIGC label as a PNG text chunk (Doubao) ─────────────
class TestIdentifyAigcPngChunk:
"""The raw-JSON ``AIGC`` PNG chunk (no namespaced XMP marker) is a high-
confidence AI verdict, same as the XMP form."""
def _aigc_chunk_png(self, tmp_path: Path) -> Path:
from PIL import Image
from PIL.PngImagePlugin import PngInfo
p = tmp_path / "doubao_chunk.png"
pnginfo = PngInfo()
pnginfo.add_text("AIGC", json.dumps({"Label": "1", "ContentProducer": "doubao"}))
Image.new("RGB", (32, 32)).save(p, pnginfo=pnginfo)
return p
def test_png_chunk_detected_high(self, tmp_path: Path):
r = identify(self._aigc_chunk_png(tmp_path), check_visible=False)
assert r.is_ai_generated is True
assert r.confidence == "high"
assert r.platform is not None
assert "AIGC" in r.platform
signal = next(s for s in r.signals if s.name == "aigc")
assert "doubao" in signal.detail
# ── HuggingFace-hosted job marker (medium confidence) ───────────────
class TestIdentifyHuggingFaceJob:
"""The hf-job-id chunk lifts an otherwise-Unknown verdict to a tentative
(medium) AI, never overriding a high-confidence metadata signal."""
def _hf_png(self, tmp_path: Path) -> Path:
from PIL import Image
from PIL.PngImagePlugin import PngInfo
p = tmp_path / "hfjob.png"
pnginfo = PngInfo()
pnginfo.add_text("hf-job-id", "ec8380a6-2091-423a-b835-209420f99ee1")
Image.new("RGB", (32, 32)).save(p, pnginfo=pnginfo)
return p
def test_hf_job_promotes_to_medium(self, tmp_path: Path):
r = identify(self._hf_png(tmp_path), check_visible=False)
assert r.is_ai_generated is True
assert r.confidence == "medium"
assert r.platform is not None
assert "HuggingFace" in r.platform
signal = next(s for s in r.signals if s.name == "hf_job")
assert signal.confidence == "medium"
def test_hf_job_caveat_present(self, tmp_path: Path):
r = identify(self._hf_png(tmp_path), check_visible=False)
assert any("hf-job-id" in c for c in r.caveats)
def test_metadata_keeps_high_even_with_hf_job(self, tmp_png_with_ai_metadata: Path):
# A high-confidence metadata verdict is not downgraded by an hf-job hit.
from PIL import Image
from PIL.PngImagePlugin import PngInfo
img = Image.open(tmp_png_with_ai_metadata)
pnginfo = PngInfo()
for k, v in img.text.items():
pnginfo.add_text(k, v)
pnginfo.add_text("hf-job-id", "ec8380a6-2091-423a-b835-209420f99ee1")
img.save(tmp_png_with_ai_metadata, pnginfo=pnginfo)
r = identify(tmp_png_with_ai_metadata, check_visible=False)
assert r.confidence == "high"
# ── Visible-sparkle fallback (mocked detector) ──────────────────────
+82
View File
@@ -554,6 +554,88 @@ class TestAIGCLabel:
assert "aigc_label" in meta
assert "TC260" in meta["aigc_label"]
def _aigc_chunk_png(self, tmp_path: Path, producer: str = "doubao") -> Path:
"""Doubao writes the TC260 object as a PNG ``tEXt`` chunk keyed ``AIGC``
with raw JSON (no XMP, no namespaced marker)."""
import json
p = tmp_path / "doubao_chunk.png"
pnginfo = PngInfo()
pnginfo.add_text(
"AIGC",
json.dumps({"Label": "1", "ContentProducer": producer, "ProduceID": "abc123"}),
)
Image.new("RGB", (32, 32)).save(p, pnginfo=pnginfo)
return p
def test_parses_png_text_chunk_form(self, tmp_path: Path):
from remove_ai_watermarks.metadata import aigc_label
info = aigc_label(self._aigc_chunk_png(tmp_path))
assert info is not None
assert info["Label"] == "1"
assert info["ContentProducer"] == "doubao"
def test_png_chunk_without_tc260_field_ignored(self, tmp_path: Path):
"""A generic ``AIGC`` chunk with no TC260 field must not false-positive."""
import json
from remove_ai_watermarks.metadata import aigc_label
p = tmp_path / "unrelated.png"
pnginfo = PngInfo()
pnginfo.add_text("AIGC", json.dumps({"unrelated": "value"}))
Image.new("RGB", (32, 32)).save(p, pnginfo=pnginfo)
assert aigc_label(p) is None
def test_has_ai_metadata_detects_png_chunk_form(self, tmp_path: Path):
assert has_ai_metadata(self._aigc_chunk_png(tmp_path))
def test_remove_strips_png_chunk_form(self, tmp_path: Path):
from remove_ai_watermarks.metadata import aigc_label, remove_ai_metadata
out = tmp_path / "clean.png"
remove_ai_metadata(self._aigc_chunk_png(tmp_path), out)
assert aigc_label(out) is None
assert not has_ai_metadata(out)
class TestHuggingFaceJob:
"""HuggingFace-hosted job marker (``hf-job-id`` PNG text chunk)."""
def _hf_png(self, tmp_path: Path, job_id: str = "ec8380a6-2091-423a-b835-209420f99ee1") -> Path:
p = tmp_path / "hfjob.png"
pnginfo = PngInfo()
pnginfo.add_text("hf-job-id", job_id)
Image.new("RGB", (32, 32)).save(p, pnginfo=pnginfo)
return p
def test_returns_job_id(self, tmp_path: Path):
from remove_ai_watermarks.metadata import huggingface_job
assert huggingface_job(self._hf_png(tmp_path)) == "ec8380a6-2091-423a-b835-209420f99ee1"
def test_none_when_absent(self, tmp_clean_png):
from remove_ai_watermarks.metadata import huggingface_job
assert huggingface_job(tmp_clean_png) is None
def test_has_ai_metadata_detects_hf_job(self, tmp_path: Path):
assert has_ai_metadata(self._hf_png(tmp_path))
def test_get_ai_metadata_surfaces_hf_job(self, tmp_path: Path):
meta = get_ai_metadata(self._hf_png(tmp_path))
assert "huggingface_job" in meta
assert "ec8380a6" in meta["huggingface_job"]
def test_remove_strips_hf_job(self, tmp_path: Path):
from remove_ai_watermarks.metadata import huggingface_job, remove_ai_metadata
out = tmp_path / "clean.png"
remove_ai_metadata(self._hf_png(tmp_path), out)
assert huggingface_job(out) is None
assert not has_ai_metadata(out)
@pytest.mark.skipif(not (SAMPLES_DIR / "doubao-1.png").exists(), reason="doubao sample not present")
class TestAIGCRealSample: