2026-03-26 10:05:29 +00:00
2026-03-25 02:09:26 +00:00
2026-03-25 02:09:26 +00:00
2026-03-25 02:09:26 +00:00
2026-03-25 02:09:26 +00:00
2026-03-25 02:09:26 +00:00
2026-03-25 02:09:26 +00:00
2026-03-25 02:09:26 +00:00
2026-03-25 02:09:26 +00:00

⛓️‍💥 Claudini ⛓️‍💥

Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

arXiv

Claude autoresearch vs Optuna hyperparameter search: best train and validation loss over trials

We show that an autoresearch-style pipeline powered by Claude Code discovers novel white-box adversarial attack algorithms that significantly outperform all existing methods in jailbreaking and prompt injection evaluations.

This official code repository contains a demo autoresearch pipeline, the Claude-discovered methods from the paper, baseline implementations, and the evaluation benchmark. Read our paper and consider citing us if you find this useful.

Setup

git clone https://github.com/romovpa/claudini.git
cd claudini
uv sync

Requires Python 3.12+ and uv.

Evaluate

All experiments are run via claudini.run_bench CLI:

uv run -m claudini.run_bench --help

It takes a preset name (from configs/) or a path to a YAML file.

Config settings can be overridden with CLI options. For example, to evaluate methods on the random targets track, override FLOPs budget:

uv run -m claudini.run_bench random_valid --method gcg,acg --max-flops 1e15

Results are saved to results/<method>/<preset>/<model>/sample_<S>_seed_<N>.json. Existing results are auto-skipped.

Run Autoresearch

Autoresearch loop: seeding, analysis-experiment cycle, evaluation

Install Claude Code, then start the loop:

claude
> /loop 10m $(cat PROMPT.txt)

Or with the ralph-loop plugin:

claude
> /ralph-loop:ralph-loop $(cat PROMPT.txt)

The loop reads PROMPT.txt, studies existing methods and results, designs a new optimizer, implements it, runs the benchmark, and commits — all autonomously. Use tmux or screen so sessions survive disconnection. Track progress via git log.

Attack Methods

We consider white-box GCG-style attacks that search directly over the model's vocabulary using gradients. Each method (TokenOptimizer) optimizes a short discrete token suffix that, when appended to an input prompt, causes the model to produce a desired target sequence.

See CLAUDE.md for how to implement a new method.

Citation

@article{panfilov2026claudini,
  title = {Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs},
  author = {Alexander Panfilov and Peter Romov and Igor Shilov and Yves-Alexandre de Montjoye and Jonas Geiping and Maksym Andriushchenko},
  journal = {arXiv preprint},
  eprint = {2603.24511},
  archivePrefix = {arXiv},
  year = {2026},
  url = {https://arxiv.org/abs/2603.24511},
}
S
Description
No description provided
Readme Apache-2.0 Cite this repository 1.5 MiB
Languages
Python 100%