# ⛓️💥 Claudini ⛓️💥
**Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs**
[](https://arxiv.org/abs/2603.24511)
We show that an *[autoresearch](https://github.com/karpathy/autoresearch)*-style pipeline powered by Claude Code discovers novel white-box adversarial attack *algorithms* that **significantly outperform** all existing [methods](claudini/methods/original/README.md) in jailbreaking and prompt injection evaluations.
This official code repository contains a demo autoresearch pipeline, the Claude-discovered methods from the paper, baseline implementations, and the evaluation benchmark. Read our [paper](https://arxiv.org/abs/2603.24511) and consider [citing us](#citation) if you find this useful.
## Setup
```bash
git clone https://github.com/romovpa/claudini.git
cd claudini
uv sync
```
Requires Python 3.12+ and [uv](https://docs.astral.sh/uv/).
## Discover Your Own SOTA Attack
To run autoresearch, open [Claude Code](https://docs.anthropic.com/en/docs/claude-code) and start the `/claudini` skill in a loop:
```bash
claude
> /loop /claudini my_run break Qwen2.5-7B on random strings under 1e15 FLOPs
```
Each iteration, Claude studies existing methods and results, designs a new optimizer, benchmarks it, and commits — maintaining an agent log across iterations. The run code (`my_run` above) isolates the method chain, git branch, and log.
See the full [skill prompt](.claude/skills/claudini/SKILL.md) for details.
Use `tmux` or `screen` so sessions survive disconnection. Track progress via `git log`.
## Evaluate
All experiments are run via `claudini.run_bench` CLI:
```bash
uv run -m claudini.run_bench --help
```
It takes a preset name (from [`configs/`](configs/)) or a path to a YAML file.
Config settings can be overridden with CLI options. For example, to evaluate methods on the random targets track, override FLOPs budget:
```bash
uv run -m claudini.run_bench random_valid --method gcg,acg --max-flops 1e15
```
Results are saved to `results////sample__seed_.json`. Existing results are auto-skipped.
Precomputed results from the paper are available as a [GitHub release](https://github.com/romovpa/claudini/releases). Download and unzip `claudini-results.zip` into the repo root.
## Attack Methods
We consider white-box GCG-style attacks that search directly over the model's vocabulary using gradients. Each method ([`TokenOptimizer`](claudini/base.py#L429)) optimizes a short discrete token *suffix* that, when appended to an input prompt, causes the model to produce a desired target sequence.
- **Baselines** (existing methods): [`claudini/methods/original/`](claudini/methods/original/)
- **Claude-designed methods** (each run code produces a separate chain):
- Generalizable attacks (random targets): [`claudini/methods/claude_random/`](claudini/methods/claude_random/)
- Attacks on a safeguard model: [`claudini/methods/claude_safeguard/`](claudini/methods/claude_safeguard/)
See [`CLAUDE.md`](CLAUDE.md) for how to implement a new method.
**Leaderboard.** Run `uv run -m claudini.leaderboard results/` to generate per-track, per-model leaderboards ranking all methods by average loss. Results are saved to `results/loss_leaderboard//.json`.
## Citation
```bibtex
@article{panfilov2026claudini,
title = {Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs},
author = {Alexander Panfilov and Peter Romov and Igor Shilov and Yves-Alexandre de Montjoye and Jonas Geiping and Maksym Andriushchenko},
journal = {arXiv preprint},
eprint = {2603.24511},
archivePrefix = {arXiv},
year = {2026},
url = {https://arxiv.org/abs/2603.24511},
}
```