Files
claudini/claudini/methods/original/prs
Peter Romov 5b6058b3c4 Initial commit
Co-Authored-By: Alexander Panfilov <sasha_pusha@mail.de>
Co-Authored-By: Claude <noreply@anthropic.com>
2026-03-25 02:09:26 +00:00
..
2026-03-25 02:09:26 +00:00
2026-03-25 02:09:26 +00:00
2026-03-25 02:09:26 +00:00

name, full_name, reference, paper_url, code
name full_name reference paper_url code
PRS Random Search andriushchenko2024jailbreaking https://arxiv.org/abs/2404.02151 https://github.com/tml-epfl/llm-adaptive-attacks

PRS — Random Search

Paper: Andriushchenko et al., "Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks" (ICLR 2025) Links: arXiv | Code \cite{andriushchenko2024jailbreaking}

Algorithm

Zeroth-order token optimization via batched contiguous block mutation with a coarse-to-fine schedule. Paper-faithful implementation following the official schedule_n_to_change_fixed from utils.py in the reference code.

Per step:

  1. Compute block size n_change from the coarse-to-fine schedule (large early, small late)
  2. For each of B candidates, pick a random start position and replace a contiguous block of n_change tokens with random tokens from the allowed vocabulary
  3. Single batched forward pass to evaluate all B candidates
  4. Keep the best candidate if it improves the current best loss

This is a gradient-free method — no backward passes are used.

Coarse-to-Fine Schedule

Block size decays with step count (halving at each threshold):

Steps Block size (schedule="fixed")
110 max_tokens_change
1125 max_tokens_change // 2
2650 max_tokens_change // 4
51100 max_tokens_change // 8
101500 max_tokens_change // 16
501+ max_tokens_change // 32

All values are clamped to a minimum of 1. With schedule="none", block size stays constant at max_tokens_change.

Key Hyperparameters

  • num_candidates (B): candidates per step (default: 128)
  • max_tokens_change: initial block size for coarse-to-fine schedule (default: 4)
  • schedule: "fixed" (coarse-to-fine decay, default) or "none" (constant block size)
  • position_mode: "random" (default, random start for contiguous block) or "round_robin" (spread candidates across positions round-robin, single-token mutation — ignores max_tokens_change and schedule)
  • patience: steps without improvement before restart (default: 25)

Differences from original paper

  • Full-sequence CE loss instead of first-token NLL. The original paper optimizes only the log-probability of the first target token (e.g., "Sure") — a jailbreak-specific heuristic. Our default uses full-sequence cross-entropy, which is appropriate for random token targets. The paper-faithful first-token NLL variant is available as prs_safety.
  • Patience-based restarts: if loss doesn't improve for 25 steps, reinitialize the suffix and reset the coarse-to-fine schedule. The original paper has a similar restart mechanism (25 iterations of no progress → restart).