🗡️ OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack–Defense Evaluation 🛡️

📰 News 📰

[2025/12/06] 🚀 Initial release of OmniSafeBench-MM with 13 attack methods and 15 defense strategies 🚀

🗡️ What is OmniSafeBench-MM 🛡️

OmniSafeBench-MM is a unified benchmark and open-source toolbox for evaluating multimodal jailbreak attacks and defenses in Large Vision–Language Models (MLLMs). It integrates a large-scale dataset spanning 8–9 major risk domains and 50 fine-grained categories, supports three real-world prompt types (consultative, imperative, declarative), and implements 13 representative attack methods and 15 defense strategies in a modular pipeline. Beyond traditional ASR, it introduces a three-dimensional evaluation protocol measuring harmfulness, intent alignment, and response detail, enabling fine-grained safety–utility analysis. Tested across 18 open-source and closed-source MLLMs, OmniSafeBench-MM provides a comprehensive, reproducible, and extensible platform for benchmarking multimodal safety.

🌐 Overview 🌐

Overview of OmniSafeBench-MM. The benchmark unifies multi-modal jailbreak attack–defense evaluation, 13 attack and 15 defense methods, and a three-dimensional scoring protocol measuring harmfulness, alignment, and detail.

🗡️ Integrated Attack Methods

Name	Title	Venue	Paper	Code
FigStep / FigStep-Pro	FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts	AAAI 2025	https://arxiv.org/abs/2311.05608	https://github.com/ThuCCSLab/FigStep
QR-Attack (MM-SafetyBench)	MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models	ECCV 2024	https://arxiv.org/abs/2311.17600	https://github.com/isXinLiu/MM-SafetyBench
MML	Jailbreak Large Vision-Language Models Through Multi-Modal Linkage	ACL 2025	https://aclanthology.org/2025.acl-long.74/	https://github.com/wangyu-ovo/MML
CS-DJ	Distraction is All You Need for Multimodal Large Language Model Jailbreaking	CVPR 2025	https://arxiv.org/abs/2502.10794	https://github.com/TeamPigeonLab/CS-DJ
SI-Attack	Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency	ICCV 2025	https://arxiv.org/abs/2501.04931	https://github.com/zhaoshiji123/SI-Attack
JOOD	Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy	CVPR 2025	https://arxiv.org/abs/2503.20823	https://github.com/naver-ai/JOOD
HIMRD	Heuristic-Induced Multimodal Risk Distribution (HIMRD) Jailbreak Attack	ICCV 2025	https://arxiv.org/abs/2412.05934	https://github.com/MaTengSYSU/HIMRD-jailbreak
HADES	Images are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking MLLMs	ECCV 2024	https://arxiv.org/abs/2403.09792	https://github.com/AoiDragon/HADES
BAP	Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt (BAP)	TIFS 2025	https://arxiv.org/abs/2406.04031	https://github.com/NY1024/BAP-Jailbreak-Vision-Language-Models-via-Bi-Modal-Adversarial-Prompt
visual_adv	Visual Adversarial Examples Jailbreak Aligned Large Language Models	AAAI 2024	https://ojs.aaai.org/index.php/AAAI/article/view/30150	https://github.com/Unispac/Visual-Adversarial-Examples-Jailbreak-Large-Language-Models
VisCRA	VisCRA: A Visual Chain Reasoning Attack for Jailbreaking Multimodal Large Language Models	EMNLP 2025	https://arxiv.org/abs/2505.19684	https://github.com/DyMessi/VisCRA
UMK	White-box Multimodal Jailbreaks Against Large Vision-Language Models (Universal Master Key)	ACMMM 2024	https://arxiv.org/abs/2405.17894	https://github.com/roywang021/UMK
PBI-Attack	Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for Toxicity Maximization	EMNLP 2025	https://aclanthology.org/2025.emnlp-main.32.pdf	https://github.com/Rosy0912/PBI-Attack
ImgJP / DeltaJP	Jailbreaking Attack against Multimodal Large Language Models	arXiv 2024	https://arxiv.org/abs/2402.02309	https://github.com/abc03570128/Jailbreaking-Attack-against-Multimodal-Large-Language-Model
JPS	JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering	ACMMM 2025	https://arxiv.org/abs/2508.05087	https://github.com/thu-coai/JPS

🛡️Integrated Defense Methods

Name	Title	Venue	Paper	Code
JailGuard	JailGuard: A Universal Detection Framework for Prompt-based Attacks on LLM Systems	TOSEM2025	link	link
MLLM-Protector	MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance	EMNLP2024	link	link
ECSO	Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation	ECCV2024	link	link
ShieldLM	ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors	EMNLP2024	link	link
AdaShield	AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting	ECCV2024	link	link
Uniguard	UNIGUARD: Towards Universal Safety Guardrails for Jailbreak Attacks on Multimodal Large Language Models	ECCV2024	link	link
DPS	Defending LVLMs Against Vision Attacks Through Partial-Perception Supervision	ICML2025	link	link
CIDER	Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models	EMNLP2024	link	link
GuardReasoner-VL	GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning	ICML2025	link	link
Llama-Guard-4	Llama Guard 4	Model Card	link	link
QGuard	QGuard: Question-based Zero-shot Guard for Multi-modal LLM Safety	ArXiv	link	link
LlavaGuard	LlavaGuard: An Open VLM-based Framework for Safeguarding Vision Datasets and Models	ICML2025	link	link
Llama-Guard-3	Llama Guard 3	Model Card	link	link
HiddenDetect	HiddenDetect: Detecting Jailbreak Attacks against Multimodal Large Language Models via Monitoring Hidden States	ACL2025	link	link
CoCA	CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration	COLM2024	link	-
VLGuard	Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models	ICML2024	link	link

More methods are coming soon!!

12 KiB Raw Blame History Unescape Escape