CalvinBackup/EvoSynth

Fork 0

mirror of https://github.com/dongdongunique/EvoSynth.git synced 2026-02-12 17:22:44 +00:00

Files

dongdongunique f3af94df0b first commit

2025-12-10 00:54:02 +08:00

5.8 KiB

Raw Blame History

EvoSynth

Advanced Multi-Agent Jailbreak Attack Framework

EvoSynth is a modular jailbreak attack framework designed to test the safety alignments of AI models through autonomous multi-agent workflows. It provides a comprehensive toolkit for AI security research, enabling systematic evaluation of model robustness against adversarial attacks.

Features

Multi-Agent Architecture: Coordinated attack system with specialized agents:
- Master Coordinator: Orchestrates workflow and strategic decisions
- Reconnaissance Agent: Analyzes target models and identifies vulnerabilities
- Tool Synthesizer: Generates and evolves AI-powered attack tools
- Exploitation Agent: Executes multi-turn attack conversations
Modular Design: Registry-based component system supporting:
- Multiple model backends (OpenAI, HuggingFace, etc.)
- Customizable attack strategies
- Pluggable evaluation metrics
- LLM-based judging systems
AI-Generated Tools: Dynamic tool creation with Python code execution and performance-based evolution
Async Support: Full asynchronous orchestration for scalable evaluations

Installation

git clone https://github.com/dongdongunique/EvoSynth.git
cd EvoSynth
pip install -r requirements.txt

Quick Start

Environment Setup

Create a .env file with your API credentials:

Basic Usage

from jailbreak_toolbox.models.implementations.openai_model import OpenAIModel
from jailbreak_toolbox.attacks.blackbox.implementations.evosynth import EvosynthAttack, EvosynthConfig
from jailbreak_toolbox.judges.implementations import LLMJudge

# Initialize models
target_model = OpenAIModel(model_name="gpt-4o", api_key="your_key")
judge_model = OpenAIModel(model_name="gpt-4o", api_key="your_key")

# Configure attack
config = EvosynthConfig(
    max_iterations=15,
    success_threshold=5,
    pipeline="full_pipeline"
)

# Create judge and attack
judge = LLMJudge(judge_model=judge_model)
attack = EvosynthAttack(
    target_model=target_model,
    judge=judge,
    config=config
)

# Execute attack
result = attack.attack("Your test prompt here")

Command Line Usage

python eval_async.py \
  --attacker-model deepseek-chat \
  --judge-model gpt-4o \
  --target-models gpt-4o claude-3-5-sonnet \
  --dataset harmbench

Command Line Arguments

Argument	Description	Default
`--attacker-model`	Model for attack agents	`deepseek-chat`
`--judge-model`	Model for response evaluation	`gpt-4o-mini`
`--target-models`	Target model(s) to attack	Required
`--api-key`	OpenAI API key	From env
`--base-url`	API base URL	From env
`--dataset`	Dataset to use	`harmbench`

Project Structure

EvoSynth/
├── eval_async.py                 # Main entry point
├── data/
│   └── harmbench.csv            # Evaluation dataset
└── jailbreak_toolbox/
    ├── core/                    # Orchestration & registry
    │   ├── orchestrator.py
    │   ├── async_orchestrator.py
    │   └── registry.py
    ├── models/                  # Model implementations
    │   ├── base_model.py
    │   └── implementations/
    │       ├── openai_model.py
    │       └── huggingface_model.py
    ├── attacks/                 # Attack implementations
    │   └── blackbox/
    │       └── implementations/
    │           └── evosynth/    # Main attack system
    │               ├── ai_agents/
    │               ├── config/
    │               ├── data_structures/
    │               └── utils/
    ├── judges/                  # Response evaluation
    │   └── implementations/
    │       └── llm_judge.py
    ├── evaluators/              # Metrics & evaluation
    └── datasets/                # Data handling

Configuration

EvosynthConfig Options

@dataclass
class EvosynthConfig:
    max_iterations: int = 20          # Maximum attack iterations
    success_threshold: int = 5         # Judge score for success
    pipeline: str = "full_pipeline"    # Attack workflow mode
    attack_model_base: str = "deepseek-chat"
    target_model_name: str = None
    judge_model_name: str = None
    openai_api_key: str = None
    base_url: str = None
    enable_langfuse: bool = False      # Optional observability

Pipeline Modes

full_pipeline: Complete multi-agent workflow
start_reconnaissance: Begin from reconnaissance phase
start_tool_creation: Begin from tool creation phase
start_exploitation: Begin from exploitation phase

Evaluation

Results are evaluated using:

LLM Judge: Scores responses on a 1-5 scale. Prompts are from ActorAttack.
Success Threshold: Score >= 5 indicates successful jailbreak

Ethical Disclaimer

For Defensive Security Research Only.

EvoSynth is designed to aid AI researchers and developers in identifying safety flaws to improve system robustness. This tool should only be used for:

Authorized security testing
Academic research
Improving AI safety mechanisms
Red-teaming with proper authorization

Do not use this framework for malicious purposes or against systems without explicit permission.

Citation

If you use EvoSynth in your research, please cite:

@article{chen2025evolve,
  title={Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs},
  author={Chen, Yunhao and Wang, Xin and Li, Juncheng and Wang, Yixu and Li, Jie and Teng, Yan and Wang, Yingchun and Ma, Xingjun},
  journal={arXiv preprint arXiv:2511.12710},
  year={2025}
}

Contributing

Contributions are welcome! Please read our contributing guidelines before submitting pull requests.

5.8 KiB Raw Blame History