Implement hybrid refusal classifier combining multiple detection methods:
- Add confidence scoring to refusal detection (HybridResult)
- Implement weighted voting with configurable thresholds
- Support require_unanimous mode for strict classification
- Add factory function create_hybrid_classifier for common setup
- Include 32 unit tests with table-driven test patterns
Create unified provider abstraction layer for direct LLM integrations beyond
HTTP specs, inspired by FuzzyAI's comprehensive provider system.
- Add BaseLLMProvider abstract class with standard interface (generate, chat,
sync_generate, sync_chat methods)
- Implement OpenAIProvider supporting chat completions API
- Implement AnthropicProvider supporting messages API
- Create provider factory for instantiation by name (create_provider,
get_provider_class)
- Add 60 unit tests covering all provider implementations
Implement FuzzNode and FuzzChain classes for multi-step attack chains
with pipe operator syntax, inspired by FuzzyAI architecture.
- FuzzNode: Single LLM call with {var} template substitution
- FuzzChain: Sequential execution passing output as input
- Pipe operator (|) for composing nodes into chains
- LLMProvider protocol for provider abstraction
- 22 unit tests covering composition and execution
Implement a YAML-based rule system for defining attack patterns and success
conditions, inspired by Promptmap's 50+ YAML rule definitions.
Features:
- AttackRule model with name, type, severity, prompt, pass/fail conditions
- RuleLoader for parsing YAML files with validation
- Support for recursive directory loading and filtering by type/severity
- Template variable substitution in prompts
- Dataset integration for converting rules to ProbeDataset format
- YAMLRulesDatasetLoader for loading rules from multiple directories
Tested with 47 unit tests covering models, loader, and dataset integration.
Successfully loads 69 rules from promptmap research directory.
Add LLM-based refusal classifier inspired by Promptmap's dual-LLM
architecture. The controller LLM evaluates whether an attack succeeded
by analyzing the target's response against pass/fail conditions.
- Create LLMRefusalClassifier plugin integrating with existing system
- Support OpenAI and Anthropic providers with lazy initialization
- Add configurable system prompts and pass/fail conditions
- Include 20 unit tests for comprehensive coverage