⛓️‍💥 Claudini ⛓️‍💥

Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

arXiv

Pareto front with evolution annotations

We show that an autoresearch-style pipeline powered by Claude Code discovers novel white-box adversarial attack algorithms that significantly outperform all existing methods in jailbreaking and prompt injection evaluations.

This official code repository contains a demo autoresearch pipeline, the Claude-discovered methods from the paper, baseline implementations, and the evaluation benchmark. Read our paper and consider citing us if you find this useful.

Setup

git clone https://github.com/romovpa/claudini.git
cd claudini
uv sync

Requires Python 3.12+ and uv.

Discover Your Own SOTA Attack

Autoresearch loop: seeding, analysis-experiment cycle, evaluation

To run autoresearch, open Claude Code and start the /claudini skill in a loop:

claude
> /loop /claudini my_run break Qwen2.5-7B on random strings under 1e15 FLOPs

Each iteration, Claude studies existing methods and results, designs a new optimizer, benchmarks it, and commits — maintaining an agent log across iterations. The run code (my_run above) isolates the method chain, git branch, and log.

See the full skill prompt for details.

Use tmux or screen so sessions survive disconnection. Track progress via git log.

Evaluate

All experiments are run via claudini.run_bench CLI:

uv run -m claudini.run_bench --help

It takes a preset name (from configs/) or a path to a YAML file.

Config settings can be overridden with CLI options. For example, to evaluate methods on the random targets track, override FLOPs budget:

uv run -m claudini.run_bench random_valid --method gcg,acg --max-flops 1e15

Results are saved to results/<method>/<preset>/<model>/sample_<S>_seed_<N>.json. Existing results are auto-skipped.

Precomputed results from the paper are available as a GitHub release. Download and unzip claudini-results.zip into the repo root.

Attack Methods

We consider white-box GCG-style attacks that search directly over the model's vocabulary using gradients. Each method (TokenOptimizer) optimizes a short discrete token suffix that, when appended to an input prompt, causes the model to produce a desired target sequence.

See CLAUDE.md for how to implement a new method.

Citation

@article{panfilov2026claudini,
  title = {Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs},
  author = {Alexander Panfilov and Peter Romov and Igor Shilov and Yves-Alexandre de Montjoye and Jonas Geiping and Maksym Andriushchenko},
  journal = {arXiv preprint},
  eprint = {2603.24511},
  archivePrefix = {arXiv},
  year = {2026},
  url = {https://arxiv.org/abs/2603.24511},
}
S
Description
No description provided
Readme Apache-2.0 Cite this repository 1.5 MiB
Languages
Python 100%