Co-Authored-By: Alexander Panfilov <sasha_pusha@mail.de> Co-Authored-By: Claude <noreply@anthropic.com>
⛓️💥 Claudini ⛓️💥
Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs
We show that an autoresearch-style pipeline powered by Claude Code discovers novel white-box adversarial attack algorithms that significantly outperform all existing methods in jailbreaking and prompt injection evaluations.
This official code repository contains a demo autoresearch pipeline, the Claude-discovered methods from the paper, baseline implementations, and the evaluation benchmark. Read our paper and consider citing us if you find this useful.
Setup
git clone https://github.com/romovpa/claudini.git
cd claudini
uv sync
Requires Python 3.12+ and uv.
Evaluate
All experiments are run via claudini.run_bench CLI:
uv run -m claudini.run_bench --help
It takes a preset name (from configs/) or a path to a YAML file.
Config settings can be overridden with CLI options. For example, to evaluate methods on the random targets track, override FLOPs budget:
uv run -m claudini.run_bench random_valid --method gcg,acg --max-flops 1e15
Results are saved to results/<method>/<preset>/<model>/sample_<S>_seed_<N>.json. Existing results are auto-skipped.
Run Autoresearch
Install Claude Code, then start the loop:
claude
> /loop 10m $(cat PROMPT.txt)
Or with the ralph-loop plugin:
claude
> /ralph-loop:ralph-loop $(cat PROMPT.txt)
The loop reads PROMPT.txt, studies existing methods and results, designs a new optimizer, implements it, runs the benchmark, and commits — all autonomously. Use tmux or screen so sessions survive disconnection. Track progress via git log.
Attack Methods
We consider white-box GCG-style attacks that search directly over the model's vocabulary using gradients. Each method (TokenOptimizer) optimizes a short discrete token suffix that, when appended to an input prompt, causes the model to produce a desired target sequence.
- Baselines (existing methods):
claudini/methods/original/ - Claude-deviced methods:
- Generalizable attacks (random targets):
claudini/methods/claude_random/ - Attacks on a safeguard model:
claudini/methods/claude_safeguard/
- Generalizable attacks (random targets):
See CLAUDE.md for how to implement a new method.
Citation
@article{panfilov2026claudini,
title={Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs},
author={Panfilov, Alexander and Romov, Peter and Shilov, Igor and de Montjoye, Yves-Alexandre and Geiping, Jonas and Andriushchenko, Maksym},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2026}
}

