# ⛓️‍💥 Claudini ⛓️‍💥

**Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs**

[![arXiv](https://img.shields.io/badge/arXiv-2603.24511-b31b1b.svg?logo=arxiv)](https://arxiv.org/abs/2603.24511)

<p align="center">
  <img src="assets/pareto_evolution_small.png" width="90%" alt="Pareto front with evolution annotations">
</p>

We show that an *[autoresearch](https://github.com/karpathy/autoresearch)*-style pipeline powered by Claude Code discovers novel white-box adversarial attack *algorithms* that **significantly outperform** all existing [methods](claudini/methods/original/README.md) in jailbreaking and prompt injection evaluations.

This official code repository contains a demo autoresearch pipeline, the Claude-discovered methods from the paper, baseline implementations, and the evaluation benchmark. Read our [paper](https://arxiv.org/abs/2603.24511) and consider [citing us](#citation) if you find this useful.

## Setup

```bash
git clone https://github.com/romovpa/claudini.git
cd claudini
uv sync
```

Requires Python 3.12+ and [uv](https://docs.astral.sh/uv/).

## Discover Your Own SOTA Attack

<p align="center">
  <img src="assets/autoresearch_loop.png" width="90%" alt="Autoresearch loop: seeding, analysis-experiment cycle, evaluation">
</p>

To run autoresearch, open [Claude Code](https://docs.anthropic.com/en/docs/claude-code) and start the `/claudini` skill in a loop:

```bash
claude
> /loop /claudini my_run break Qwen2.5-7B on random strings under 1e15 FLOPs
```

Each iteration, Claude studies existing methods and results, designs a new optimizer, benchmarks it, and commits — maintaining an agent log across iterations. The run code (`my_run` above) isolates the method chain, git branch, and log. 

See the full [skill prompt](.claude/skills/claudini/SKILL.md) for details.

Use `tmux` or `screen` so sessions survive disconnection. Track progress via `git log`.

## Evaluate

All experiments are run via `claudini.run_bench` CLI:
```bash
uv run -m claudini.run_bench --help
```

It takes a preset name (from [`configs/`](configs/)) or a path to a YAML file.

Config settings can be overridden with CLI options. For example, to evaluate methods on the random targets track, override FLOPs budget:
```bash
uv run -m claudini.run_bench random_valid --method gcg,acg --max-flops 1e15
```

Results are saved to `results/<method>/<preset>/<model>/sample_<S>_seed_<N>.json`. Existing results are auto-skipped.

Precomputed results from the paper are available as a [GitHub release](https://github.com/romovpa/claudini/releases). Download and unzip `claudini-results.zip` into the repo root.

## Attack Methods

We consider white-box GCG-style attacks that search directly over the model's vocabulary using gradients. Each method ([`TokenOptimizer`](claudini/base.py#L429)) optimizes a short discrete token *suffix* that, when appended to an input prompt, causes the model to produce a desired target sequence.

- **Baselines** (existing methods): [`claudini/methods/original/`](claudini/methods/original/)
- **Claude-designed methods** (each run code produces a separate chain):
    - Generalizable attacks (random targets): [`claudini/methods/claude_random/`](claudini/methods/claude_random/)
    - Attacks on a safeguard model: [`claudini/methods/claude_safeguard/`](claudini/methods/claude_safeguard/)

See [`CLAUDE.md`](CLAUDE.md) for how to implement a new method.

**Leaderboard.** Run `uv run -m claudini.leaderboard results/` to generate per-track, per-model leaderboards ranking all methods by average loss. Results are saved to `results/loss_leaderboard/<preset>/<model_tag>.json`.

## Citation

```bibtex
@article{panfilov2026claudini,
  title = {Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs},
  author = {Alexander Panfilov and Peter Romov and Igor Shilov and Yves-Alexandre de Montjoye and Jonas Geiping and Maksym Andriushchenko},
  journal = {arXiv preprint},
  eprint = {2603.24511},
  archivePrefix = {arXiv},
  year = {2026},
  url = {https://arxiv.org/abs/2603.24511},
}
```