Assisted-by: Claude <noreply@anthropic.com>
⛓️💥 Claudini ⛓️💥
Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs
We show that an autoresearch-style pipeline powered by Claude Code discovers novel white-box adversarial attack algorithms that significantly outperform all existing methods in jailbreaking and prompt injection evaluations.
This official code repository contains a demo autoresearch pipeline, the Claude-discovered methods from the paper, baseline implementations, and the evaluation benchmark. Read our paper and consider citing us if you find this useful.
Setup
git clone https://github.com/romovpa/claudini.git
cd claudini
uv sync
Requires Python 3.12+ and uv.
Discover Your Own SOTA Attack
To run autoresearch, open Claude Code and start the /claudini skill in a loop:
claude
> /loop /claudini my_run break Qwen2.5-7B on random strings under 1e15 FLOPs
Each iteration, Claude studies existing methods and results, designs a new optimizer, benchmarks it, and commits — maintaining an agent log across iterations. The run code (my_run above) isolates the method chain, git branch, and log.
See the full skill prompt for details.
Use tmux or screen so sessions survive disconnection. Track progress via git log.
Evaluate
All experiments are run via claudini.run_bench CLI:
uv run -m claudini.run_bench --help
It takes a preset name (from configs/) or a path to a YAML file.
Config settings can be overridden with CLI options. For example, to evaluate methods on the random targets track, override FLOPs budget:
uv run -m claudini.run_bench random_valid --method gcg,acg --max-flops 1e15
Results are saved to results/<method>/<preset>/<model>/sample_<S>_seed_<N>.json. Existing results are auto-skipped.
Precomputed results from the paper are available as a GitHub release. Download and unzip claudini-results.zip into the repo root.
Attack Methods
We consider white-box GCG-style attacks that search directly over the model's vocabulary using gradients. Each method (TokenOptimizer) optimizes a short discrete token suffix that, when appended to an input prompt, causes the model to produce a desired target sequence.
- Baselines (existing methods):
claudini/methods/original/ - Claude-designed methods (each run code produces a separate chain):
- Generalizable attacks (random targets):
claudini/methods/claude_random/ - Attacks on a safeguard model:
claudini/methods/claude_safeguard/
- Generalizable attacks (random targets):
See CLAUDE.md for how to implement a new method.
Citation
@article{panfilov2026claudini,
title = {Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs},
author = {Alexander Panfilov and Peter Romov and Igor Shilov and Yves-Alexandre de Montjoye and Jonas Geiping and Maksym Andriushchenko},
journal = {arXiv preprint},
eprint = {2603.24511},
archivePrefix = {arXiv},
year = {2026},
url = {https://arxiv.org/abs/2603.24511},
}

