mirror of https://github.com/romovpa/claudini.git synced 2026-07-23 23:30:50 +02:00

Files

T

Peter Romov 5b6058b3c4 Initial commit

Co-Authored-By: Alexander Panfilov <sasha_pusha@mail.de>
Co-Authored-By: Claude <noreply@anthropic.com>

2026-03-25 02:09:26 +00:00

__init__.py

Initial commit

2026-03-25 02:09:26 +00:00

optimizer.py

Initial commit

2026-03-25 02:09:26 +00:00

README.md

Initial commit

2026-03-25 02:09:26 +00:00

README.md

name, full_name, reference, paper_url, code

name	full_name	reference	paper_url	code
PRS	Random Search	andriushchenko2024jailbreaking	https://arxiv.org/abs/2404.02151	https://github.com/tml-epfl/llm-adaptive-attacks

PRS — Random Search

Paper: Andriushchenko et al., "Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks" (ICLR 2025) Links: arXiv | Code \cite{andriushchenko2024jailbreaking}

Algorithm

Zeroth-order token optimization via batched contiguous block mutation with a coarse-to-fine schedule. Paper-faithful implementation following the official schedule_n_to_change_fixed from utils.py in the reference code.

Per step:

Compute block size n_change from the coarse-to-fine schedule (large early, small late)
For each of B candidates, pick a random start position and replace a contiguous block of n_change tokens with random tokens from the allowed vocabulary
Single batched forward pass to evaluate all B candidates
Keep the best candidate if it improves the current best loss

This is a gradient-free method — no backward passes are used.

Coarse-to-Fine Schedule

Block size decays with step count (halving at each threshold):

Steps	Block size (`schedule="fixed"`)
1–10	`max_tokens_change`
11–25	`max_tokens_change // 2`
26–50	`max_tokens_change // 4`
51–100	`max_tokens_change // 8`
101–500	`max_tokens_change // 16`
501+	`max_tokens_change // 32`

All values are clamped to a minimum of 1. With schedule="none", block size stays constant at max_tokens_change.

Key Hyperparameters

num_candidates (B): candidates per step (default: 128)
max_tokens_change: initial block size for coarse-to-fine schedule (default: 4)
schedule: "fixed" (coarse-to-fine decay, default) or "none" (constant block size)
position_mode: "random" (default, random start for contiguous block) or "round_robin" (spread candidates across positions round-robin, single-token mutation — ignores max_tokens_change and schedule)
patience: steps without improvement before restart (default: 25)

Differences from original paper

Full-sequence CE loss instead of first-token NLL. The original paper optimizes only the log-probability of the first target token (e.g., "Sure") — a jailbreak-specific heuristic. Our default uses full-sequence cross-entropy, which is appropriate for random token targets. The paper-faithful first-token NLL variant is available as prs_safety.
Patience-based restarts: if loss doesn't improve for 25 steps, reinitialize the suffix and reset the coarse-to-fine schedule. The original paper has a similar restart mechanism (25 iterations of no progress → restart).

README.md Unescape Escape

PRS — Random Search

Algorithm

Coarse-to-Fine Schedule

Key Hyperparameters

Differences from original paper

README.md