mirror of
https://github.com/romovpa/claudini.git
synced 2026-06-07 20:33:54 +02:00
59106bdf3c
- **PEFT adapter merging.** `model_loader.py` auto-detects PEFT adapters (e.g. `facebook/Meta-SecAlign-8B`), merges on CPU in bf16, and caches the merged model to disk. No config flags needed. - **Configurable quantization.** `quantization:` field in YAML or `--quantization` on CLI, accepting `nf4`, `fp4`, or `int8`. Replaces the old `load_in_4bit` boolean. - **Multi-GPU sharding.** `device_map:` in configs or `--device-map` on CLI. Config value is now correctly read from YAML presets (was previously ignored). - **CLI overrides.** New `--model`, `--device-map`, `--quantization` flags to override preset values from the command line. - **SecAlign injection presets.** Configs for prompt injection on Meta-SecAlign-8B and 70B (default + Optuna-tuned), using new `AlpacaInjectionSource` — generates 3-role prompts from AlpacaFarm data with trusted/untrusted separation. - **Fixes.** `BenchmarkRunner.summarize()` crash when all runs are skipped. System prompt suppression now works correctly (`""` suppresses model defaults, `None` omits the turn). Co-authored-by: Peter Romov <peter@romov.com> Co-authored-by: Alexander Panfilov <39771221+kotekjedi@users.noreply.github.com>
Original Methods
Reimplementations of published token-optimization attacks, adapted to the TokenOptimizer interface.
Type: D = discrete token search, C = continuous relaxation, F = gradient-free.
Methods
| Package | Methods | Full Name | Type | Year | Paper | Official Implementation |
|---|---|---|---|---|---|---|
uat |
uat |
Universal Adversarial Triggers | D | 2019 | wallace2019universal | Eric-Wallace/universal-triggers |
autoprompt |
autoprompt |
AutoPrompt | D | 2020 | shin2020autoprompt | ucinlp/autoprompt |
gbda |
gbda |
Gradient-Based Distributional Attack | C | 2021 | guo2021gradient | facebookresearch/text-adversarial-attack |
arca |
arca |
Autoregressive Randomized Coordinate Ascent | D | 2023 | jones2023arca | ejones313/auditing-llms |
gcg |
gcg |
Greedy Coordinate Gradient | D | 2023 | zou2023universal | llm-attacks/llm-attacks |
lls |
lls |
Lapid-Langberg-Sipper genetic algorithm | F | 2023 | lapid2024open | |
pez |
pez |
Prompts made EaZy (Hard Prompts Made Easy) | C | 2023 | wen2023hard | YuxinWenRick/hard-prompts-made-easy |
acg |
acg |
Accelerated Coordinate Gradient | D | 2024 | liu2024making | |
adc |
adc |
Adaptive Dense-to-sparse Constrained optimization | C | 2024 | hu2024efficient | hukkai/adc_llm_attack |
attn_gcg |
attngcg |
Attention-enhanced GCG | D | 2024 | wang2024attngcg | UCSC-VLAA/AttnGCG-attack |
beast |
beast |
Beam Search-based Adversarial Attack | D | 2024 | sadasivan2024beast | vinusankars/BEAST |
bon |
bon |
Best-of-N | F | 2024 | hughes2024bon | jplhughes/bon-jailbreaking |
cold_attack |
cold_attack |
COLD-Attack (Langevin Dynamics in Logit Space) | C | 2024 | guo2024cold | Yu-Fangxu/COLD-Attack |
degcg |
degcg |
Iterative Decoupled GCG | D | 2024 | liu2024advancing | Waffle-Liu/DeGCG |
faster_gcg |
faster_gcg |
Faster GCG | D | 2024 | li2024faster | |
gcg_pp |
gcg_pp |
GCG++ | D | 2024 | sitawarin2024pal | chawins/pal |
i_gcg |
i_gcg_lsgm, i_gcg_lila, i_gcg |
Improved GCG (LSGM / LILA / Combined) | D | 2024 | li2024improved | qizhangli/Gradient-based-Jailbreak-Attacks |
mac |
mac |
Momentum-Accelerated GCG | D | 2024 | zhang2024boosting | weizeming/momentum-attack-llm |
magic |
magic |
Model Attack Gradient Index GCG | D | 2024 | li2024exploiting | jiah-li/magic |
mc_gcg |
mc_gcg |
MC-GCG (Progressive Multi-Coordinate Merging) | D | 2024 | jia2025improved | jiaxiaojunQAQ/I-GCG |
pgd |
pgd, pgd_vanilla |
Projected Gradient Descent | C | 2024 | geisler2024pgd | sigeisler/reinforce-attacks-llms |
probe_sampling |
probe_sampling |
Probe Sampling | D | 2024 | zhao2024accelerating | zhaoyiran924/Probe-Sampling |
prs |
prs |
Random Search | F | 2024 | andriushchenko2024jailbreaking | tml-epfl/llm-adaptive-attacks |
reg_relax |
reg_relax |
Regularized Relaxation | C | 2024 | chacko2024adversarial | sj21j/Regularized_Relaxation |
egd |
egd |
Exponentiated Gradient Descent | C | 2025 | biswas2025adversarial | sbamit/Exponentiated-Gradient-Descent-LLM-Attack |
mask_gcg |
mask_gcg |
Mask-GCG (Learnable Token Masks on GCG) | D | 2025 | mu2025maskgcg | Junjie-Mu/Mask-GCG |
reinforce |
reinforce_gcg, reinforce_pgd |
REINFORCE Adversarial Attacks | D/C | 2025 | geisler2025reinforce | sigeisler/reinforce-attacks-llms |
slot_gcg |
slot_gcg |
Slot GCG | D | 2025 | jeong2025slotgcg | youai058/SlotGCG |
sm_gcg |
sm_gcg |
Spatial Momentum GCG | D | 2025 | gu2025smgcg | |
tgcg |
tgcg |
Temperature-annealed GCG | D | 2025 | tan2025resurgence | |
rails |
rails |
RAILS (Random Iterative Local Search) | F | 2026 | nurlanov2026jailbreaking | |
tao |
tao |
TAO-Attack (Direction-Priority Token Optimization) | D | 2026 | xu2026tao | ZevineXu/TAO-Attack |