mirror of
https://github.com/msoedov/agentic_security.git
synced 2026-06-24 06:09:55 +02:00
32f103acbc
Add LLM-based refusal classifier inspired by Promptmap's dual-LLM architecture. The controller LLM evaluates whether an attack succeeded by analyzing the target's response against pass/fail conditions. - Create LLMRefusalClassifier plugin integrating with existing system - Support OpenAI and Anthropic providers with lazy initialization - Add configurable system prompts and pass/fail conditions - Include 20 unit tests for comprehensive coverage
24 lines
1.1 KiB
Plaintext
24 lines
1.1 KiB
Plaintext
## Codebase Patterns
|
|
- Use async-first patterns for all I/O operations (httpx, asyncio)
|
|
- Plugin architecture with abstract base classes for extensibility
|
|
- Use table-driven tests with inline_snapshot where possible
|
|
- Tests are organized in tests/unit/, tests/integration/, tests/system/
|
|
- Use absolute imports, avoid relative imports except within same module
|
|
- Minimize docstrings, use quick returns, avoid abstractions without proven need
|
|
|
|
---
|
|
|
|
## 2026-01-28 - Initial Setup
|
|
- Created PRD with 5 user stories based on research from promptfoo, promptmap, and FuzzyAI
|
|
- Key improvements identified:
|
|
1. Dual-LLM evaluation (from Promptmap)
|
|
2. YAML rule system (from Promptmap)
|
|
3. Composable fuzzing chains (from FuzzyAI)
|
|
4. Unified provider abstraction (from FuzzyAI)
|
|
5. Hybrid refusal detection (combining approaches)
|
|
- **Learnings for future iterations:**
|
|
- Existing refusal detection is in agentic_security/probe_actor/refusal.py with RefusalClassifierPlugin system
|
|
- Attack data modules are in agentic_security/probe_data/modules/
|
|
- Security utilities are in agentic_security/core/security.py
|
|
---
|