mirror of https://github.com/AI-secure/UDora.git synced 2026-02-13 05:13:11 +00:00

Files

javyduck 0d259fc3aa init_v1

2025-06-24 01:50:12 +00:00

4.2 KiB

Raw Permalink Blame History

UDora Architecture Documentation

This document describes the modular architecture of UDora after refactoring for better maintainability and extensibility.

Overview

The UDora codebase has been decomposed into specialized modules, each handling a specific aspect of the attack algorithm. This modular design makes the code more:

Readable: Each module has a clear, focused responsibility
Maintainable: Changes to one component don't affect others
Extensible: New features can be added by extending specific modules
Testable: Individual components can be tested in isolation

Module Structure

1. `attack.py` - Core Attack Algorithm

Responsibilities:

Main UDora class implementation
Attack orchestration and optimization loop
Configuration and result data classes
Attack buffer management
High-level coordination between modules

Key Classes:

UDora: Main attack class
UDoraConfig: Configuration parameters
UDoraResult: Attack results
AttackBuffer: Candidate management

2. `scheduling.py` - Weighted Interval Scheduling

Responsibilities:

Weighted interval scheduling algorithm (the core of UDora's positioning strategy)
Interval filtering based on optimization modes
Final token sequence construction
Dynamic programming for optimal target placement

Key Functions:

weighted_interval_scheduling(): Core DP algorithm
filter_intervals_by_sequential_mode(): Mode-specific filtering
build_final_token_sequence(): Token sequence construction

3. `datasets.py` - Dataset-Specific Logic

Responsibilities:

Success condition evaluation for different datasets
Dataset-specific formatting and validation
Extensible framework for adding new datasets

Key Functions:

check_success_condition(): Main success evaluation
Dataset-specific checkers: _check_webshop_success(), _check_injecagent_success(), _check_agentharm_success()
validate_dataset_name(): Dataset validation

4. `text_processing.py` - Target Positioning

Responsibilities:

Target interval identification and scoring
Text analysis for optimal insertion positions
Probability-based scoring of potential targets
Debug utilities for interval analysis

Key Functions:

build_target_intervals(): Find all possible target positions
_compute_interval_score(): Score target quality
count_matched_locations(): Success threshold analysis
format_interval_debug_info(): Debug output formatting

5. `loss.py` - Specialized Loss Functions

UDora loss computation combining probability and reward components
Cross-entropy loss with positional weighting
Mellowmax loss application for smoother optimization
Consecutive token matching rewards

Key Functions:

compute_udora_loss(): Main UDora loss with exponential weighting
compute_cross_entropy_loss(): Standard cross-entropy with position weighting
apply_mellowmax_loss(): Mellowmax alternative to cross-entropy

6. `readable.py` - Readable Adversarial String Optimization

Responsibilities:

Generate natural language adversarial strings instead of random tokens
Apply semantic guidance to gradient-based optimization
Evaluate readability and naturalness of adversarial strings
Context-aware token selection for fluent adversarial prompts

Key Classes:

ReadableOptimizer: Main class for readable optimization
- Vocabulary categorization (functional words, content words, etc.)
- Context-aware gradient modification
- Fluency-based token bonuses
- ASCII/special character penalties

Key Functions:

apply_readable_guidance(): Modify gradients to encourage natural language
generate_readable_initialization(): Create natural language starting points
evaluate_readability(): Assess naturalness and coherence of text
create_readable_optimizer(): Factory function for easy instantiation

Beta Feature: Enable with config.readable = True to generate adversarial strings that appear more natural and less suspicious to human reviewers.

6. `utils.py` - General Utilities

Responsibilities:

General utility functions
Text processing helpers
Model interface utilities
Common constants and helpers

4.2 KiB Raw Permalink Blame History

UDora Architecture Documentation

Overview

Module Structure

1. attack.py - Core Attack Algorithm

2. scheduling.py - Weighted Interval Scheduling

3. datasets.py - Dataset-Specific Logic

4. text_processing.py - Target Positioning

5. loss.py - Specialized Loss Functions

6. readable.py - Readable Adversarial String Optimization

6. utils.py - General Utilities

4.2 KiB

Raw Permalink Blame History

1. `attack.py` - Core Attack Algorithm

2. `scheduling.py` - Weighted Interval Scheduling

3. `datasets.py` - Dataset-Specific Logic

4. `text_processing.py` - Target Positioning

5. `loss.py` - Specialized Loss Functions

6. `readable.py` - Readable Adversarial String Optimization

6. `utils.py` - General Utilities