Peter Romov 5b6058b3c4 Initial commit
Co-Authored-By: Alexander Panfilov <sasha_pusha@mail.de>
Co-Authored-By: Claude <noreply@anthropic.com>
2026-03-25 02:09:26 +00:00
2026-03-25 02:09:26 +00:00
2026-03-25 02:09:26 +00:00
2026-03-25 02:09:26 +00:00
2026-03-25 02:09:26 +00:00
2026-03-25 02:09:26 +00:00
2026-03-25 02:09:26 +00:00
2026-03-25 02:09:26 +00:00
2026-03-25 02:09:26 +00:00
2026-03-25 02:09:26 +00:00
2026-03-25 02:09:26 +00:00
2026-03-25 02:09:26 +00:00

⛓️‍💥 Claudini ⛓️‍💥

Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

arXiv

Claude autoresearch vs Optuna hyperparameter search: best train and validation loss over trials

We show that an autoresearch-style pipeline powered by Claude Code discovers novel white-box adversarial attack algorithms that significantly outperform all existing methods in jailbreaking and prompt injection evaluations.

This official code repository contains a demo autoresearch pipeline, the Claude-discovered methods from the paper, baseline implementations, and the evaluation benchmark. Read our paper and consider citing us if you find this useful.

Setup

git clone https://github.com/romovpa/claudini.git
cd claudini
uv sync

Requires Python 3.12+ and uv.

Evaluate

All experiments are run via claudini.run_bench CLI:

uv run -m claudini.run_bench --help

It takes a preset name (from configs/) or a path to a YAML file.

Config settings can be overridden with CLI options. For example, to evaluate methods on the random targets track, override FLOPs budget:

uv run -m claudini.run_bench random_valid --method gcg,acg --max-flops 1e15

Results are saved to results/<method>/<preset>/<model>/sample_<S>_seed_<N>.json. Existing results are auto-skipped.

Run Autoresearch

Autoresearch loop: seeding, analysis-experiment cycle, evaluation

Install Claude Code, then start the loop:

claude
> /loop 10m $(cat PROMPT.txt)

Or with the ralph-loop plugin:

claude
> /ralph-loop:ralph-loop $(cat PROMPT.txt)

The loop reads PROMPT.txt, studies existing methods and results, designs a new optimizer, implements it, runs the benchmark, and commits — all autonomously. Use tmux or screen so sessions survive disconnection. Track progress via git log.

Attack Methods

We consider white-box GCG-style attacks that search directly over the model's vocabulary using gradients. Each method (TokenOptimizer) optimizes a short discrete token suffix that, when appended to an input prompt, causes the model to produce a desired target sequence.

See CLAUDE.md for how to implement a new method.

Citation

@article{panfilov2026claudini,
  title={Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs},
  author={Panfilov, Alexander and Romov, Peter and Shilov, Igor and de Montjoye, Yves-Alexandre and Geiping, Jonas and Andriushchenko, Maksym},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2026}
}
S
Description
No description provided
Readme Apache-2.0 Cite this repository 1.5 MiB
Languages
Python 100%