Implement a YAML-based rule system for defining attack patterns and success
conditions, inspired by Promptmap's 50+ YAML rule definitions.
Features:
- AttackRule model with name, type, severity, prompt, pass/fail conditions
- RuleLoader for parsing YAML files with validation
- Support for recursive directory loading and filtering by type/severity
- Template variable substitution in prompts
- Dataset integration for converting rules to ProbeDataset format
- YAMLRulesDatasetLoader for loading rules from multiple directories
Tested with 47 unit tests covering models, loader, and dataset integration.
Successfully loads 69 rules from promptmap research directory.
Add LLM-based refusal classifier inspired by Promptmap's dual-LLM
architecture. The controller LLM evaluates whether an attack succeeded
by analyzing the target's response against pass/fail conditions.
- Create LLMRefusalClassifier plugin integrating with existing system
- Support OpenAI and Anthropic providers with lazy initialization
- Add configurable system prompts and pass/fail conditions
- Include 20 unit tests for comprehensive coverage