mirror of
https://github.com/romovpa/claudini.git
synced 2026-06-07 20:33:54 +02:00
5b6058b3c4
Co-Authored-By: Alexander Panfilov <sasha_pusha@mail.de> Co-Authored-By: Claude <noreply@anthropic.com>
name, full_name, reference, paper_url, code
| name | full_name | reference | paper_url | code |
|---|---|---|---|---|
| PRS | Random Search | andriushchenko2024jailbreaking | https://arxiv.org/abs/2404.02151 | https://github.com/tml-epfl/llm-adaptive-attacks |
PRS — Random Search
Paper: Andriushchenko et al., "Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks" (ICLR 2025) Links: arXiv | Code \cite{andriushchenko2024jailbreaking}
Algorithm
Zeroth-order token optimization via batched contiguous block mutation with a coarse-to-fine schedule. Paper-faithful implementation following the official schedule_n_to_change_fixed from utils.py in the reference code.
Per step:
- Compute block size
n_changefrom the coarse-to-fine schedule (large early, small late) - For each of B candidates, pick a random start position and replace a contiguous block of
n_changetokens with random tokens from the allowed vocabulary - Single batched forward pass to evaluate all B candidates
- Keep the best candidate if it improves the current best loss
This is a gradient-free method — no backward passes are used.
Coarse-to-Fine Schedule
Block size decays with step count (halving at each threshold):
| Steps | Block size (schedule="fixed") |
|---|---|
| 1–10 | max_tokens_change |
| 11–25 | max_tokens_change // 2 |
| 26–50 | max_tokens_change // 4 |
| 51–100 | max_tokens_change // 8 |
| 101–500 | max_tokens_change // 16 |
| 501+ | max_tokens_change // 32 |
All values are clamped to a minimum of 1. With schedule="none", block size stays constant at max_tokens_change.
Key Hyperparameters
num_candidates(B): candidates per step (default: 128)max_tokens_change: initial block size for coarse-to-fine schedule (default: 4)schedule:"fixed"(coarse-to-fine decay, default) or"none"(constant block size)position_mode:"random"(default, random start for contiguous block) or"round_robin"(spread candidates across positions round-robin, single-token mutation — ignoresmax_tokens_changeandschedule)patience: steps without improvement before restart (default: 25)
Differences from original paper
- Full-sequence CE loss instead of first-token NLL. The original paper optimizes only the log-probability of the first target token (e.g., "Sure") — a jailbreak-specific heuristic. Our default uses full-sequence cross-entropy, which is appropriate for random token targets. The paper-faithful first-token NLL variant is available as
prs_safety. - Patience-based restarts: if loss doesn't improve for 25 steps, reinitialize the suffix and reset the coarse-to-fine schedule. The original paper has a similar restart mechanism (25 iterations of no progress → restart).