This commit is contained in:
jiaxiaojunQAQ
2025-12-03 01:11:17 +08:00
parent 0a30a1e973
commit 453552bf54
3 changed files with 66 additions and 34 deletions

100
README.md
View File

@@ -1,40 +1,72 @@
# OmniSafeBench-MM
A Unified Benchmark and Toolbox for Multimodal Jailbreak AttackDefense Evaluation
<p align="center">
<img src="assets/home.png" alt="OmniSafeBench-MM"/>
</p>
<div align="center">
# 🗡️ OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak AttackDefense Evaluation 🛡️
</div>
## 📰 News 📰
* ***[2024/02/06] 🚀 [Initial release of OmniSafeBench-MM with 13 attack methods and 15 defense strategies]()*** 🚀
## 🗡️ What is OmniSafeBench-MM 🛡️
OmniSafeBench-MM is a unified benchmark and open-source toolbox for evaluating multimodal jailbreak attacks and defenses in Large VisionLanguage Models (MLLMs). It integrates a large-scale dataset spanning 89 major risk domains and 50 fine-grained categories, supports three real-world prompt types (consultative, imperative, declarative), and implements 13 representative attack methods and 15 defense strategies in a modular pipeline. Beyond traditional ASR, it introduces a three-dimensional evaluation protocol measuring harmfulness, intent alignment, and response detail, enabling fine-grained safetyutility analysis. Tested across 18 open-source and closed-source MLLMs, OmniSafeBench-MM provides a comprehensive, reproducible, and extensible platform for benchmarking multimodal safety.
## 🌐 Overview 🌐
<div align="center">
<img src="assets/overview.png" alt="HarmBench Evaluation Pipeline"/>
</div>
**Overview of OmniSafeBench-MM**.
The benchmark unifies multi-modal jailbreak attackdefense evaluation, 13 attack and 15 defense methods, and a three-dimensional scoring protocol measuring harmfulness, alignment, and detail.
## 🗡️ Integrated Attack Methods
| Name | Full paper title | Paper | Code | Classification |
|------|------------------|-------|------|----------------|
| FigStep / FigStep-Pro | FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts | [paper link](https://arxiv.org/abs/2311.05608) | [code link](https://github.com/ThuCCSLab/FigStep) | Black-box — Structured visual carriers |
| QR-Attack (MM-SafetyBench) | MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models | [paper link](https://arxiv.org/abs/2311.17600) | [code link](https://github.com/isXinLiu/MM-SafetyBench) | Black-box — Structured visual carriers |
| MML | Jailbreak Large Vision-Language Models Through Multi-Modal Linkage | [paper link](https://aclanthology.org/2025.acl-long.74/) | [code link](https://github.com/wangyu-ovo/MML) | Black-box — Structured visual carriers |
| CS-DJ | Distraction is All You Need for Multimodal Large Language Model Jailbreaking | [paper link](https://arxiv.org/abs/2502.10794) | [code link](https://github.com/TeamPigeonLab/CS-DJ) | Black-box — OOD (attention / distribution manipulation) |
| SI-Attack | Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency | [paper link](https://arxiv.org/abs/2501.04931) | [code link](https://github.com/zhaoshiji123/SI-Attack) | Black-box — OOD (shuffle / attention inconsistency) |
| JOOD | Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy | [paper link](https://arxiv.org/abs/2503.20823) | [code link](https://github.com/naver-ai/JOOD) | Black-box — OOD (OOD strategy) |
| HIMRD | Heuristic-Induced Multimodal Risk Distribution (HIMRD) Jailbreak Attack | [paper link](https://arxiv.org/abs/2412.05934) | [code link](https://github.com/MaTengSYSU/HIMRD-jailbreak) | Black-box — OOD / risk distribution |
| HADES | Images are Achilles Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models | [paper link](https://arxiv.org/abs/2403.09792) | [code link](https://github.com/AoiDragon/HADES) | Black-box — Structured visual carriers |
| BAP | Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt (BAP) | [paper link](https://arxiv.org/abs/2406.04031) | [code link](https://github.com/NY1024/BAP-Jailbreak-Vision-Language-Models-via-Bi-Modal-Adversarial-Prompt) | White-box — Cross-modal |
| visual_adv | Visual Adversarial Examples Jailbreak Aligned Large Language Models | [paper link](https://ojs.aaai.org/index.php/AAAI/article/view/30150) | [code link](https://github.com/Unispac/Visual-Adversarial-Examples-Jailbreak-Large-Language-Models) | White-box — Single-modal |
| VisCRA | VisCRA: A Visual Chain Reasoning Attack for Jailbreaking Multimodal Large Language Models | [paper link](https://arxiv.org/abs/2505.19684) | [code link](https://github.com/DyMessi/VisCRA) | Black-box — Structured visual carriers |
| UMK | White-box Multimodal Jailbreaks Against Large Vision-Language Models (Universal Master Key, UMK) | [paper link](https://arxiv.org/abs/2405.17894) | [code link](https://github.com/roywang021/UMK) | White-box — Cross-modal |
| PBI-Attack | Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for Toxicity Maximization (PBI-Attack) | [paper link](https://aclanthology.org/2025.emnlp-main.32.pdf) | [code link](https://github.com/Rosy0912/PBI-Attack) | Black-box — Query optimization & transfer |
| ImgJP / DeltaJP | Jailbreaking Attack against Multimodal Large Language Model (imgJP / deltaJP) | [paper link](https://arxiv.org/abs/2402.02309) | [code link](https://github.com/abc03570128/Jailbreaking-Attack-against-Multimodal-Large-Language-Model) | White-box — Single-modal |
| JPS | JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering | [paper link](https://arxiv.org/abs/2508.05087) | [code link](https://github.com/thu-coai/JPS) | White-box — Cross-modal |
| Name | Title | Paper | Code | Classification |
|------|-----------------------------------------------------------------------------------------------------------------------------|-------|------|----------------|
| FigStep / FigStep-Pro | FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts | [link](https://arxiv.org/abs/2311.05608) | [link](https://github.com/ThuCCSLab/FigStep) | Black-box — Structured visual carriers |
| QR-Attack (MM-SafetyBench) | MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models | [link](https://arxiv.org/abs/2311.17600) | [link](https://github.com/isXinLiu/MM-SafetyBench) | Black-box — Structured visual carriers |
| MML | Jailbreak Large Vision-Language Models Through Multi-Modal Linkage | [link](https://aclanthology.org/2025.acl-long.74/) | [link](https://github.com/wangyu-ovo/MML) | Black-box — Structured visual carriers |
| CS-DJ | Distraction is All You Need for Multimodal Large Language Model Jailbreaking | [link](https://arxiv.org/abs/2502.10794) | [link](https://github.com/TeamPigeonLab/CS-DJ) | Black-box — OOD (attention / distribution manipulation) |
| SI-Attack | Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency | [link](https://arxiv.org/abs/2501.04931) | [link](https://github.com/zhaoshiji123/SI-Attack) | Black-box — OOD (shuffle / attention inconsistency) |
| JOOD | Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy | [link](https://arxiv.org/abs/2503.20823) | [link](https://github.com/naver-ai/JOOD) | Black-box — OOD (OOD strategy) |
| HIMRD | Heuristic-Induced Multimodal Risk Distribution (HIMRD) Jailbreak Attack | [link](https://arxiv.org/abs/2412.05934) | [link](https://github.com/MaTengSYSU/HIMRD-jailbreak) | Black-box — OOD / risk distribution |
| HADES | Images are Achilles Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models | [link](https://arxiv.org/abs/2403.09792) | [link](https://github.com/AoiDragon/HADES) | Black-box — Structured visual carriers |
| BAP | Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt (BAP) | [link](https://arxiv.org/abs/2406.04031) | [link](https://github.com/NY1024/BAP-Jailbreak-Vision-Language-Models-via-Bi-Modal-Adversarial-Prompt) | White-box — Cross-modal |
| visual_adv | Visual Adversarial Examples Jailbreak Aligned Large Language Models | [link](https://ojs.aaai.org/index.php/AAAI/article/view/30150) | [link](https://github.com/Unispac/Visual-Adversarial-Examples-Jailbreak-Large-Language-Models) | White-box — Single-modal |
| VisCRA | VisCRA: A Visual Chain Reasoning Attack for Jailbreaking Multimodal Large Language Models | [link](https://arxiv.org/abs/2505.19684) | [link](https://github.com/DyMessi/VisCRA) | Black-box — Structured visual carriers |
| UMK | White-box Multimodal Jailbreaks Against Large Vision-Language Models (Universal Master Key, UMK) | [link](https://arxiv.org/abs/2405.17894) | [link](https://github.com/roywang021/UMK) | White-box — Cross-modal |
| PBI-Attack | Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for Toxicity Maximization (PBI-Attack) | [link](https://aclanthology.org/2025.emnlp-main.32.pdf) | [link](https://github.com/Rosy0912/PBI-Attack) | Black-box — Query optimization & transfer |
| ImgJP / DeltaJP | Jailbreaking Attack against Multimodal Large Language Model (imgJP / deltaJP) | [link](https://arxiv.org/abs/2402.02309) | [link](https://github.com/abc03570128/Jailbreaking-Attack-against-Multimodal-Large-Language-Model) | White-box — Single-modal |
| JPS | JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering | [link](https://arxiv.org/abs/2508.05087) | [link](https://github.com/thu-coai/JPS) | White-box — Cross-modal |
## 🛡Integrated Defense Methods
| No. | Title | Type | Venue | Paper | Code |
|:---|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----:|:---:|:-----------|:-----------|
| 1 | Jailguard: A universal detection framework for llm prompt-based attacks | 1 | arXiv | [paper link](https://arxiv.org/pdf/2312.10766) | [code link](https://github.com/shiningrain/JailGuard) |
| 2 | Mllm-protector: Ensuring mllm's safety without hurting performance | 1 | EMNLP 2024 | [paper link](https://arxiv.org/pdf/2401.02906) | [code link](https://github.com/pipilurj/MLLM-protector) |
| 3 | Eyes closed, safety on: Protecting multimodal llms via image-to-text transformation | 1 | ECCV 2024 | [paper link](https://arxiv.org/pdf/2403.09572) | [code link](https://gyhdog99.github.io/projects/ecso/) |
| 4 | Shieldlm: Empowering llms as aligned, customizable and explainable safety detectors | 1 | EMNLP 2024 | [paper link](https://arxiv.org/pdf/2402.16444) | [code link](https://github.com/thu-coai/ShieldLM) |
| 5 | Adashield: Safeguarding multimodal large language models from structure-based attack via adaptive shield prompting | 1 | ECCV 2024 | [paper link](https://link.springer.com/content/pdf/10.1007/978-3-031-72661-3_5.pdf) | [code link](https://github.com/SaFoLab-WISC/AdaShield)|
| 6 | Uniguard: Towards universal safety guardrails for jailbreak attacks on multimodal large language models | 1 | ECCV 2024 | [paper link](https://arxiv.org/pdf/2411.01703) | [code link](https://anonymous.4open.science/r/UniGuard/README.md) |
| 7 | Defending LVLMs Against Vision Attacks Through Partial-Perception Supervision | 1 | ICML 2025 | [paper link](https://arxiv.org/pdf/2412.12722) | [code link](https://github.com/tools-only/DPS) |
| 8 | Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models | 1 | EMNLP 2024 | [paper link](https://arxiv.org/pdf/2407.21659) | [code link](https://github.com/PandragonXIII/CIDER) |
| 9 | Guardreasoner-vl: Safeguarding vlms via reinforced reasoning | 1 | ICML 2025 | [paper link](https://arxiv.org/pdf/2505.11049) | [code link](https://github.com/yueliu1999/GuardReasoner-VL/) |
| 10 | Llama Guard 4 | 1 | None | [document link](https://www.llama.com/docs/model-cards-and-prompt-formats/llama-guard-4/) | [model link](https://huggingface.co/meta-llama/Llama-Guard-4-12B) |
| 11 | QGuard: Question-based Zero-shot Guard for Multi-modal LLM Safety | 1 | arXiv | [paper link](https://arxiv.org/pdf/2506.12299) | [code link](https://github.com/taegyeong-lee/QGuard-Question-based-Zero-shot-Guard-for-Multi-modal-LLM-Safety) |
| 12 | Llavaguard: Vlm-based safeguard for vision dataset curation and safety assessment | 1 | CVPR 2024 | [paper link](https://openaccess.thecvf.com/content/CVPR2024W/ReGenAI/papers/Helff_LLAVAGUARD_VLM-based_Safeguard_for_Vision_Dataset_Curation_and_Safety_Assessment_CVPRW_2024_paper.pdf) | [code link](https://github.com/ml-research/LlavaGuard) |
| Name | Title | Venue | Paper | Code | Classification |
|---|--------------------------------------------------------------------------------------------------------------------|---|---|---|---|
| JailGuard | JailGuard: A Universal Detection Framework for Prompt-based Attacks on LLM Systems | arXiv | [link](https://arxiv.org/abs/2312.10766) | [link](https://github.com/shiningrain/JailGuard) | Off-Model — Input pre-processing |
| MLLM-Protector | MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance | EMNLP 2024 | [link](https://arxiv.org/abs/2401.02906) | [link](https://github.com/pipilurj/MLLM-protector) | Off-Model — Output post-processing |
| ECSO | Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation | ECCV 2024 | [link](https://arxiv.org/abs/2403.09572) | [link](https://gyhdog99.github.io/projects/ecso/) | Off-Model — Input pre-processing |
| ShieldLM | ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors | EMNLP 2024 | [link](https://arxiv.org/abs/2402.16444) | [link](https://github.com/thu-coai/ShieldLM) | Off-Model — Output post-processing |
| AdaShield | AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting | ECCV 2024 | [link](https://arxiv.org/abs/2403.09513) | [link](https://github.com/rain305f/AdaShield) | Off-Model — Input pre-processing |
| Uniguard | UNIGUARD: Towards Universal Safety Guardrails for Jailbreak Attacks on Multimodal Large Language Models | ECCV 2024 | [link](https://arxiv.org/abs/2411.01703) | [link](https://anonymous.4open.science/r/UniGuard/README.md) | Off-Model — Input pre-processing |
| DPS | Defending LVLMs Against Vision Attacks Through Partial-Perception Supervision | ICML 2025 | [link](https://arxiv.org/abs/2412.12722) | [link](https://github.com/tools-only/DPS) | Off-Model — Input pre-processing |
| CIDER | Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models | EMNLP 2024 | [link](https://arxiv.org/abs/2407.21659) | [link](https://github.com/PandragonXIII/CIDER) | Off-Model — Input pre-processing |
| GuardReasoner-VL | GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning | ICML 2025 | [link](https://arxiv.org/abs/2505.11049) | [link](https://github.com/yueliu1999/GuardReasoner-VL) | Off-Model — Input pre-processing |
| Llama-Guard-4 | Llama Guard 4 | Model Card | [link](https://www.llama.com/docs/model-cards-and-prompt-formats/llama-guard-4/) | [link](https://huggingface.co/meta-llama/Llama-Guard-4-12B) | Off-Model — Input pre-processing |
| QGuard | QGuard: Question-based Zero-shot Guard for Multi-modal LLM Safety | arXiv | [link](https://arxiv.org/abs/2506.12299) | [link](https://github.com/taegyeong-lee/QGuard-Question-based-Zero-shot-Guard-for-Multi-modal-LLM-Safety) | Off-Model — Input pre-processing |
| LlavaGuard | LlavaGuard: An Open VLM-based Framework for Safeguarding Vision Datasets and Models | ICML 2025 | [link](https://arxiv.org/abs/2406.05113) | [link](https://github.com/ml-research/LlavaGuard) | Off-Model — Input pre-processing |
| Llama-Guard-3 | Llama Guard 3 | Model Card | [link](https://www.llama.com/docs/model-cards-and-prompt-formats/llama-guard-3/) | [model link](https://huggingface.co/meta-llama/Llama-Guard-3-8B) | Off-Model — Output post-processing |
| HiddenDetect | HiddenDetect: Detecting Jailbreak Attacks against Multimodal Large Language Models via Monitoring Hidden States | ACL 2025 | [link](https://arxiv.org/abs/2502.14744) | [link](https://github.com/leigest519/HiddenDetect) | On-Model — Inference process intervention |
| CoCA | CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration | COLM 2024 | [link](https://arxiv.org/abs/2409.11365) | Null | On-Model — Inference process intervention |
| VLGuard | Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models | ICML 2024 | [link](https://arxiv.org/abs/2402.02207) | [link](https://github.com/ys-zong/VLGuard) | On-Model — Intrinsic model alignment |
> More methods are coming soon!!

BIN
assets/home.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 691 KiB

BIN
assets/overview.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 497 KiB