mirror of https://github.com/romovpa/claudini.git synced 2026-06-25 21:49:56 +02:00

T

Peter Romov 4c938fd325 Update arxiv link and citation (arXiv:2603.24511)

Assisted-by: Claude <noreply@anthropic.com>

2026-03-26 10:05:29 +00:00

assets

Initial commit

2026-03-25 02:09:26 +00:00

claudini

Add unrolled methods (claude_oss_v53, claude_v63) with README

2026-03-25 18:27:52 +00:00

configs

Initial commit

2026-03-25 02:09:26 +00:00

.gitignore

Initial commit

2026-03-25 02:09:26 +00:00

CITATION.cff

Update arxiv link and citation (arXiv:2603.24511)

2026-03-26 10:05:29 +00:00

CLAUDE.md

Initial commit

2026-03-25 02:09:26 +00:00

LICENSE

Initial commit

2026-03-25 02:09:26 +00:00

PROMPT.txt

Initial commit

2026-03-25 02:09:26 +00:00

pyproject.toml

Initial commit

2026-03-25 02:09:26 +00:00

README.md

Update arxiv link and citation (arXiv:2603.24511)

2026-03-26 10:05:29 +00:00

uv.lock

Initial commit

2026-03-25 02:09:26 +00:00

README.md

⛓️‍💥 Claudini ⛓️‍💥

Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

We show that an autoresearch-style pipeline powered by Claude Code discovers novel white-box adversarial attack algorithms that significantly outperform all existing methods in jailbreaking and prompt injection evaluations.

This official code repository contains a demo autoresearch pipeline, the Claude-discovered methods from the paper, baseline implementations, and the evaluation benchmark. Read our paper and consider citing us if you find this useful.

Setup

git clone https://github.com/romovpa/claudini.git
cd claudini
uv sync

Requires Python 3.12+ and uv.

Evaluate

All experiments are run via claudini.run_bench CLI:

uv run -m claudini.run_bench --help

It takes a preset name (from configs/) or a path to a YAML file.

Config settings can be overridden with CLI options. For example, to evaluate methods on the random targets track, override FLOPs budget:

uv run -m claudini.run_bench random_valid --method gcg,acg --max-flops 1e15

Results are saved to results/<method>/<preset>/<model>/sample_<S>_seed_<N>.json. Existing results are auto-skipped.

Run Autoresearch

Install Claude Code, then start the loop:

claude
> /loop 10m $(cat PROMPT.txt)

Or with the ralph-loop plugin:

claude
> /ralph-loop:ralph-loop $(cat PROMPT.txt)

The loop reads PROMPT.txt, studies existing methods and results, designs a new optimizer, implements it, runs the benchmark, and commits — all autonomously. Use tmux or screen so sessions survive disconnection. Track progress via git log.

Attack Methods

We consider white-box GCG-style attacks that search directly over the model's vocabulary using gradients. Each method (TokenOptimizer) optimizes a short discrete token suffix that, when appended to an input prompt, causes the model to produce a desired target sequence.

Baselines (existing methods): claudini/methods/original/
Claude-deviced methods:
- Generalizable attacks (random targets): claudini/methods/claude_random/
- Attacks on a safeguard model: claudini/methods/claude_safeguard/

See CLAUDE.md for how to implement a new method.

Citation

@article{panfilov2026claudini,
  title = {Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs},
  author = {Alexander Panfilov and Peter Romov and Igor Shilov and Yves-Alexandre de Montjoye and Jonas Geiping and Maksym Andriushchenko},
  journal = {arXiv preprint},
  eprint = {2603.24511},
  archivePrefix = {arXiv},
  year = {2026},
  url = {https://arxiv.org/abs/2603.24511},
}