mirror of https://github.com/romovpa/claudini.git synced 2026-06-07 20:33:54 +02:00

T

Peter Romov 5b6058b3c4 Initial commit

Co-Authored-By: Alexander Panfilov <sasha_pusha@mail.de>
Co-Authored-By: Claude <noreply@anthropic.com>

2026-03-25 02:09:26 +00:00

assets

Initial commit

2026-03-25 02:09:26 +00:00

claudini

Initial commit

2026-03-25 02:09:26 +00:00

configs

Initial commit

2026-03-25 02:09:26 +00:00

.gitignore

Initial commit

2026-03-25 02:09:26 +00:00

CITATION.cff

Initial commit

2026-03-25 02:09:26 +00:00

CLAUDE.md

Initial commit

2026-03-25 02:09:26 +00:00

LICENSE

Initial commit

2026-03-25 02:09:26 +00:00

PROMPT.txt

Initial commit

2026-03-25 02:09:26 +00:00

pyproject.toml

Initial commit

2026-03-25 02:09:26 +00:00

README.md

Initial commit

2026-03-25 02:09:26 +00:00

uv.lock

Initial commit

2026-03-25 02:09:26 +00:00

README.md

⛓️‍💥 Claudini ⛓️‍💥

Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

We show that an autoresearch-style pipeline powered by Claude Code discovers novel white-box adversarial attack algorithms that significantly outperform all existing methods in jailbreaking and prompt injection evaluations.

This official code repository contains a demo autoresearch pipeline, the Claude-discovered methods from the paper, baseline implementations, and the evaluation benchmark. Read our paper and consider citing us if you find this useful.

Setup

git clone https://github.com/romovpa/claudini.git
cd claudini
uv sync

Requires Python 3.12+ and uv.

Evaluate

All experiments are run via claudini.run_bench CLI:

uv run -m claudini.run_bench --help

It takes a preset name (from configs/) or a path to a YAML file.

Config settings can be overridden with CLI options. For example, to evaluate methods on the random targets track, override FLOPs budget:

uv run -m claudini.run_bench random_valid --method gcg,acg --max-flops 1e15

Results are saved to results/<method>/<preset>/<model>/sample_<S>_seed_<N>.json. Existing results are auto-skipped.

Run Autoresearch

Install Claude Code, then start the loop:

claude
> /loop 10m $(cat PROMPT.txt)

Or with the ralph-loop plugin:

claude
> /ralph-loop:ralph-loop $(cat PROMPT.txt)

The loop reads PROMPT.txt, studies existing methods and results, designs a new optimizer, implements it, runs the benchmark, and commits — all autonomously. Use tmux or screen so sessions survive disconnection. Track progress via git log.

Attack Methods

We consider white-box GCG-style attacks that search directly over the model's vocabulary using gradients. Each method (TokenOptimizer) optimizes a short discrete token suffix that, when appended to an input prompt, causes the model to produce a desired target sequence.

Baselines (existing methods): claudini/methods/original/
Claude-deviced methods:
- Generalizable attacks (random targets): claudini/methods/claude_random/
- Attacks on a safeguard model: claudini/methods/claude_safeguard/

See CLAUDE.md for how to implement a new method.

Citation

@article{panfilov2026claudini,
  title={Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs},
  author={Panfilov, Alexander and Romov, Peter and Shilov, Igor and de Montjoye, Yves-Alexandre and Geiping, Jonas and Andriushchenko, Maksym},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2026}
}