mirror of https://github.com/romovpa/claudini.git synced 2026-06-05 20:26:34 +02:00

T

Peter Romov 48eb0f155c Replace teaser with Pareto front evolution plot

Assisted-by: Claude <noreply@anthropic.com>

2026-03-27 00:20:39 +00:00

.claude/skills/claudini

Add autoresearch skill, update configs and README

2026-03-26 17:19:04 +00:00

assets

Replace teaser with Pareto front evolution plot

2026-03-27 00:20:39 +00:00

claudini

Remove empty claude_demo method directory

2026-03-26 17:37:18 +00:00

configs

Add autoresearch skill, update configs and README

2026-03-26 17:19:04 +00:00

.gitignore

Initial commit

2026-03-25 02:09:26 +00:00

CITATION.cff

Update arxiv link and citation (arXiv:2603.24511)

2026-03-26 10:05:29 +00:00

CLAUDE.md

Add autoresearch skill, update configs and README

2026-03-26 17:19:04 +00:00

LICENSE

Initial commit

2026-03-25 02:09:26 +00:00

pyproject.toml

Initial commit

2026-03-25 02:09:26 +00:00

README.md

Replace teaser with Pareto front evolution plot

2026-03-27 00:20:39 +00:00

uv.lock

Initial commit

2026-03-25 02:09:26 +00:00

README.md

⛓️‍💥 Claudini ⛓️‍💥

Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

We show that an autoresearch-style pipeline powered by Claude Code discovers novel white-box adversarial attack algorithms that significantly outperform all existing methods in jailbreaking and prompt injection evaluations.

This official code repository contains a demo autoresearch pipeline, the Claude-discovered methods from the paper, baseline implementations, and the evaluation benchmark. Read our paper and consider citing us if you find this useful.

Setup

git clone https://github.com/romovpa/claudini.git
cd claudini
uv sync

Requires Python 3.12+ and uv.

Discover Your Own SOTA Attack

To run autoresearch, open Claude Code and start the /claudini skill in a loop:

claude
> /loop /claudini my_run break Qwen2.5-7B on random strings under 1e15 FLOPs

Each iteration, Claude studies existing methods and results, designs a new optimizer, benchmarks it, and commits — maintaining an agent log across iterations. The run code (my_run above) isolates the method chain, git branch, and log.

See the full skill prompt for details.

Use tmux or screen so sessions survive disconnection. Track progress via git log.

Evaluate

All experiments are run via claudini.run_bench CLI:

uv run -m claudini.run_bench --help

It takes a preset name (from configs/) or a path to a YAML file.

Config settings can be overridden with CLI options. For example, to evaluate methods on the random targets track, override FLOPs budget:

uv run -m claudini.run_bench random_valid --method gcg,acg --max-flops 1e15

Results are saved to results/<method>/<preset>/<model>/sample_<S>_seed_<N>.json. Existing results are auto-skipped.

Precomputed results from the paper are available as a GitHub release. Download and unzip claudini-results.zip into the repo root.

Attack Methods

We consider white-box GCG-style attacks that search directly over the model's vocabulary using gradients. Each method (TokenOptimizer) optimizes a short discrete token suffix that, when appended to an input prompt, causes the model to produce a desired target sequence.

Baselines (existing methods): claudini/methods/original/
Claude-designed methods (each run code produces a separate chain):
- Generalizable attacks (random targets): claudini/methods/claude_random/
- Attacks on a safeguard model: claudini/methods/claude_safeguard/

See CLAUDE.md for how to implement a new method.

Citation

@article{panfilov2026claudini,
  title = {Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs},
  author = {Alexander Panfilov and Peter Romov and Igor Shilov and Yves-Alexandre de Montjoye and Jonas Geiping and Maksym Andriushchenko},
  journal = {arXiv preprint},
  eprint = {2603.24511},
  archivePrefix = {arXiv},
  year = {2026},
  url = {https://arxiv.org/abs/2603.24511},
}