mirror of
https://github.com/Shiva108/ai-llm-red-team-handbook.git
synced 2026-05-26 17:47:53 +02:00
612 lines
17 KiB
Markdown
612 lines
17 KiB
Markdown
<!--
|
|
Chapter: [X]
|
|
Title: [Chapter Title]
|
|
Category: [Foundations/Technical Deep-Dives/Attack Techniques/Defense & Operations]
|
|
Difficulty: [Beginner/Intermediate/Advanced]
|
|
Estimated Time: [X minutes read time]
|
|
Hands-on: [Yes/No - includes executable code]
|
|
Prerequisites: Chapters [list prerequisite chapters if any]
|
|
Related: Chapters [list related chapters]
|
|
-->
|
|
|
|
# Chapter [X]: [Chapter Title]
|
|
|
|

|
|
|
|
_[Write a compelling 2-3 sentence abstract that: (1) describes what this chapter covers, (2) explains why it matters for AI red teaming, and (3) includes specific techniques/concepts covered. Example: "This chapter provides comprehensive coverage of [topic], including [technique 1], [technique 2], [technique 3], detection methods, defense strategies, and critical ethical considerations." Be specific and engaging.]_
|
|
|
|
## [X].1 Introduction
|
|
|
|
[Opening hook - explain the attack/topic and why it matters in the context of AI red teaming. Include a compelling narrative or real-world context.]
|
|
|
|
### Why This Matters
|
|
|
|
[Explain the significance combining impact points and real-world examples. Include:]
|
|
|
|
- Critical impact on red teaming/security
|
|
- Real-world incidents with scale/outcomes (include dollar amounts, breach scope, or impact metrics)
|
|
- Prevalence and trends in the threat landscape
|
|
- Unique challenges this technique presents
|
|
|
|
### Key Concepts
|
|
|
|
- **Concept 1:** Clear definition and relevance to red teaming
|
|
- **Concept 2:** Clear definition and relevance to red teaming
|
|
- **Concept 3:** Clear definition and relevance to red teaming
|
|
|
|
### Theoretical Foundation
|
|
|
|
#### Why This Works (Model Behavior)
|
|
|
|
[Explain what properties of transformer architecture, training methodology, or input processing enable this attack/technique. Address:]
|
|
|
|
- **Architectural Factor:** [What transformer component is exploited: attention, tokenization, embedding space, residual stream?]
|
|
- **Training Artifact:** [What aspect of pretraining, fine-tuning, or RLHF creates this vulnerability?]
|
|
- **Input Processing:** [How does the model's handling of tokens/context enable this?]
|
|
|
|
#### Foundational Research
|
|
|
|
| Paper | Key Finding | Relevance |
|
|
| ------------------------------------ | ---------------------- | ----------------------------- |
|
|
| [Author et al., Year] "[Title]"(url) | [One-sentence finding] | [How it informs this chapter] |
|
|
| [Author et al., Year] "[Title]"(url) | [One-sentence finding] | [How it informs this chapter] |
|
|
|
|
#### What This Reveals About LLMs
|
|
|
|
[2-3 sentences on broader implications for understanding model behavior]
|
|
|
|
#### Chapter Scope
|
|
|
|
We'll cover [list the major sections/topics], including practical code examples, detection methods, defense strategies, real-world case studies, and ethical considerations for authorized security testing.
|
|
|
|
---
|
|
|
|
## [X].2 [Main Topic Section 1]
|
|
|
|
[Opening paragraph: Define the topic/attack technique and explain why it's important and effective]
|
|
|
|
### How [Topic] Works
|
|
|
|
[Provide a step-by-step breakdown or ASCII diagram showing the flow]
|
|
|
|
```text
|
|
[Attack Flow or Process Diagram]
|
|
Step 1 → Step 2 → Step 3 → Impact
|
|
|
|
Example:
|
|
Attacker → [Action] → System Processes → [Result] → Victim Impacted
|
|
```
|
|
|
|
### Mechanistic Explanation
|
|
|
|
At the token/embedding level, this technique exploits:
|
|
|
|
1. **Tokenization:** [How BPE/tokenization affects this technique]
|
|
2. **Attention Dynamics:** [What happens in attention layers]
|
|
3. **Hidden State Manipulation:** [How the residual stream is affected]
|
|
|
|
### Research Basis
|
|
|
|
- **Introduced by:** [Citation with link]
|
|
- **Validated by:** [Follow-up citation]
|
|
- **Open Questions:** [What remains unknown]
|
|
|
|
### [X].2.1 [Subtopic 1]
|
|
|
|
[Detailed content about the subtopic. Use clear, professional language.]
|
|
|
|
#### Attack Variations
|
|
|
|
1. **Variation 1 Name:** Description and use case
|
|
2. **Variation 2 Name:** Description and use case
|
|
|
|
#### Practical Example: [Descriptive Name]
|
|
|
|
#### What This Code Does
|
|
|
|
[Clear description of what the code demonstrates, its purpose, and what attackers would use it for i.e. why and how it works]
|
|
|
|
#### Key Components
|
|
|
|
1. **Component 1:** Purpose and function
|
|
2. **Component 2:** Purpose and function
|
|
3. **Component 3:** Purpose and function
|
|
|
|
```python
|
|
#!/usr/bin/env python3
|
|
"""
|
|
[Script Title]
|
|
[Brief description of what this script does]
|
|
|
|
Requirements:
|
|
pip install [dependency1] [dependency2]
|
|
|
|
Usage:
|
|
python script_name.py
|
|
"""
|
|
|
|
import [required_modules]
|
|
from typing import [type_hints]
|
|
|
|
class ExampleClass:
|
|
"""[Class description]"""
|
|
|
|
def __init__(self, param1: str, param2: str = "default"):
|
|
"""
|
|
Initialize [class name]
|
|
|
|
Args:
|
|
param1: Description
|
|
param2: Description (default: "default")
|
|
"""
|
|
self.param1 = param1
|
|
self.param2 = param2
|
|
|
|
def main_method(self, input_data: str) -> dict:
|
|
"""
|
|
[Method description]
|
|
|
|
How This Works:
|
|
1. Step 1 explanation
|
|
2. Step 2 explanation
|
|
3. Step 3 explanation
|
|
|
|
Args:
|
|
input_data: Description
|
|
|
|
Returns:
|
|
Dictionary containing results
|
|
"""
|
|
# Implementation
|
|
result = {"status": "success", "data": input_data}
|
|
return result
|
|
|
|
def demonstrate_attack(self):
|
|
"""
|
|
Demonstrate [attack/technique] in action
|
|
|
|
Shows how attackers use this technique to achieve [goal]
|
|
"""
|
|
print("="*70)
|
|
print(" [DEMONSTRATION TITLE] ".center(70, "="))
|
|
print("="*70)
|
|
print("\n⚠️ WARNING: FOR EDUCATIONAL PURPOSES ONLY ⚠️\n")
|
|
|
|
# Demo implementation
|
|
print("[*] Step 1: [Description]")
|
|
print("[*] Step 2: [Description]")
|
|
print("\n" + "="*70)
|
|
|
|
# ============================================================================
|
|
# DEMO USAGE
|
|
# ============================================================================
|
|
|
|
if __name__ == "__main__":
|
|
print("[Script Name] - For educational/authorized testing only\n")
|
|
|
|
# DEMO MODE - Simulated execution
|
|
print("[DEMO MODE] Simulating [attack/technique]\n")
|
|
|
|
example = ExampleClass.__new__(ExampleClass)
|
|
example.demonstrate_attack()
|
|
|
|
print("\n[REAL USAGE - AUTHORIZED TESTING ONLY]:")
|
|
print("# example = ExampleClass(param1='value')")
|
|
print("# result = example.main_method('test_data')")
|
|
print("# print(result)")
|
|
|
|
```
|
|
|
|
## Attack Execution
|
|
|
|
```python
|
|
# Basic usage
|
|
example = ExampleClass(param1="value")
|
|
result = example.main_method(input_data)
|
|
```
|
|
|
|
## Success Metrics
|
|
|
|
- **Metric 1:** Expected measurement/outcome
|
|
- **Metric 2:** Expected measurement/outcome
|
|
- **Metric 3:** Expected measurement/outcome
|
|
|
|
## Why This Code Works
|
|
|
|
This implementation succeeds because:
|
|
|
|
1. **Effectiveness:** [Why it's effective against the target]
|
|
2. **Defense Failures:** [Why current defenses don't stop it]
|
|
3. **Model Behavior Exploited:** [Specific vulnerability]
|
|
4. **Research Basis:** [Paper documenting this behavior]
|
|
5. **Transferability:** [Does this work across models? Why/why not?]
|
|
|
|
## Key Takeaways
|
|
|
|
1. **Takeaway 1:** Specific insight about the technique
|
|
2. **Takeaway 2:** Specific insight about detection/defense
|
|
3. **Takeaway 3:** Specific insight about real-world application
|
|
|
|
## [X].3 [Detection and Mitigation]
|
|
|
|
### [X].3.1 Detection Methods
|
|
|
|
#### Detection Strategies
|
|
|
|
#### Detection Method 1: [Name]
|
|
|
|
- **What:** Clear description of detection approach
|
|
- **How:** Implementation details and tools
|
|
- **Effectiveness:** Rating and limitations
|
|
- **False Positive Rate:** Expected rate and mitigation
|
|
|
|
#### Detection Method 2: [Name]
|
|
|
|
- **What:** Clear description of detection approach
|
|
- **How:** Implementation details and tools
|
|
- **Effectiveness:** Rating and limitations
|
|
- **False Positive Rate:** Expected rate and mitigation
|
|
|
|
#### Detection Indicators
|
|
|
|
- **Indicator 1:** What to look for and significance
|
|
- **Indicator 2:** What to look for and significance
|
|
- **Indicator 3:** What to look for and significance
|
|
|
|
#### Detection Rationale
|
|
|
|
Why this detection method works:
|
|
|
|
- **Signal Exploited:** [What model behavior indicates attack]
|
|
- **Interpretability Basis:** [Reference to mechanistic interpretability research]
|
|
- **Limitations:** [What the detection cannot see and why]
|
|
|
|
#### Practical Detection Example
|
|
|
|
```python
|
|
#!/usr/bin/env python3
|
|
"""
|
|
Detection Script for [Attack Type]
|
|
Monitors for [specific indicators]
|
|
|
|
Usage:
|
|
python detect_[attack].py --log-file /path/to/logs
|
|
"""
|
|
|
|
import re
|
|
from typing import List, Dict
|
|
|
|
class AttackDetector:
|
|
"""Detect [attack type] in system logs/data"""
|
|
|
|
def __init__(self):
|
|
# Detection patterns
|
|
self.patterns = [
|
|
r"[pattern1]",
|
|
r"[pattern2]",
|
|
]
|
|
|
|
def analyze(self, log_entry: str) -> Dict:
|
|
"""
|
|
Analyze log entry for attack indicators
|
|
|
|
Returns:
|
|
Detection result with confidence score
|
|
"""
|
|
for pattern in self.patterns:
|
|
if re.search(pattern, log_entry):
|
|
return {
|
|
"detected": True,
|
|
"confidence": 0.8,
|
|
"pattern": pattern
|
|
}
|
|
|
|
return {"detected": False}
|
|
|
|
# Demo usage
|
|
if __name__ == "__main__":
|
|
detector = AttackDetector()
|
|
|
|
# Test cases
|
|
test_logs = [
|
|
"Normal activity",
|
|
"Suspicious pattern [example]"
|
|
]
|
|
|
|
for log in test_logs:
|
|
result = detector.analyze(log)
|
|
print(f"Log: {log} | Detected: {result['detected']}")
|
|
```
|
|
|
|
### [X].3.2 Mitigation and Defenses
|
|
|
|
#### Defense-in-Depth Approach
|
|
|
|
```text
|
|
Layer 1: [Prevention] → [Specific defense mechanism]
|
|
Layer 2: [Detection] → [Specific detection method]
|
|
Layer 3: [Response] → [Specific response procedure]
|
|
Layer 4: [Recovery] → [Specific recovery process]
|
|
```
|
|
|
|
#### Defense Strategy 1: [Name]
|
|
|
|
- **What:** Clear description of the defense mechanism
|
|
- **How:** Implementation details and configuration
|
|
- **Effectiveness:** Rating against different attack variants
|
|
- **Limitations:** Known weaknesses or bypass methods
|
|
- **Implementation Complexity:** Low/Medium/High
|
|
|
|
#### Implementation Example
|
|
|
|
```python
|
|
# Code showing how to implement this defense
|
|
class DefenseMechanism:
|
|
"""Implement [defense name]"""
|
|
|
|
def __init__(self, config: dict):
|
|
self.config = config
|
|
|
|
def validate_input(self, user_input: str) -> bool:
|
|
"""
|
|
Validate input against attack patterns
|
|
|
|
Returns:
|
|
True if input is safe, False otherwise
|
|
"""
|
|
# Validation logic
|
|
return True
|
|
```
|
|
|
|
## Defense Strategy 2 & 3: [Names]
|
|
|
|
[Follow the same pattern as Defense Strategy 1 above for additional defenses]
|
|
|
|
## Best Practices
|
|
|
|
1. **Practice 1:** Description and rationale
|
|
2. **Practice 2:** Description and rationale
|
|
3. **Practice 3:** Description and rationale
|
|
|
|
## Configuration Recommendations
|
|
|
|
```yaml
|
|
# Example security configuration
|
|
security_settings:
|
|
defense_1:
|
|
enabled: true
|
|
sensitivity: high
|
|
|
|
defense_2:
|
|
enabled: true
|
|
threshold: 0.8
|
|
```
|
|
|
|
## Defense Mechanism Analysis
|
|
|
|
Why this defense works (or fails):
|
|
|
|
- **Training Dynamics:** [How this affects model learning]
|
|
- **Alignment Research:** [Relevant RLHF/DPO/Constitutional AI papers]
|
|
- **Known Bypasses:** [Research documenting defense limitations]
|
|
|
|
---
|
|
|
|
## [X].4 [Advanced Techniques or Attack Patterns]
|
|
|
|
### Advanced Technique 1: [Name]
|
|
|
|
[Description of advanced technique]
|
|
|
|
### Advanced Technique 2: [Name]
|
|
|
|
[Description of advanced technique]
|
|
|
|
### Combining Techniques
|
|
|
|
[Explain how techniques can be chained or combined for greater impact]
|
|
|
|
### Technique Interaction Analysis
|
|
|
|
Why combining techniques amplifies effectiveness:
|
|
|
|
- **Technique A + B:** [Mechanistic explanation of synergy]
|
|
- **Research Support:** [Papers on attack composition]
|
|
|
|
### Theoretical Limits
|
|
|
|
- What would make this technique stop working?
|
|
- What architectural changes would mitigate this?
|
|
|
|
---
|
|
|
|
## [X].5 Research Landscape
|
|
|
|
### Seminal Papers
|
|
|
|
| Paper | Year | Venue | Contribution |
|
|
| ------- | ------ | ------- | ------------------ |
|
|
| [Title] | [Year] | [Venue] | [Key contribution] |
|
|
| [Title] | [Year] | [Venue] | [Key contribution] |
|
|
| [Title] | [Year] | [Venue] | [Key contribution] |
|
|
|
|
### Evolution of Understanding
|
|
|
|
[Timeline or narrative showing how research understanding developed]
|
|
|
|
### Current Research Gaps
|
|
|
|
1. [Open question with relevance to practitioners]
|
|
2. [Open question with relevance to practitioners]
|
|
3. [Open question with relevance to practitioners]
|
|
|
|
### Recommended Reading
|
|
|
|
### For Practitioners (by time available)
|
|
|
|
- **5 minutes:** [Paper/Blog] - Quick overview of [key concept]
|
|
- **30 minutes:** [Paper] - Practical understanding of [technique/defense]
|
|
- **Deep dive:** [Paper] - Comprehensive research on [theoretical foundation]
|
|
|
|
### By Focus Area
|
|
|
|
- **Attack Techniques:** [Paper 1] - Best for understanding [aspect]
|
|
- **Defense Mechanisms:** [Paper 2] - Best for understanding [aspect]
|
|
- **Theoretical Foundation:** [Paper 3] - Best for understanding [aspect]
|
|
|
|
---
|
|
|
|
## [X].6 [Case Studies / Real-World Examples]
|
|
|
|
### Case Study 1: [Name/Description]
|
|
|
|
#### Incident Overview (Case Study 1)
|
|
|
|
- **When:** Date/timeframe
|
|
- **Target:** Organization/system type
|
|
- **Impact:** Financial/data/reputation damage
|
|
- **Attack Vector:** How the attack was executed
|
|
|
|
#### Attack Timeline
|
|
|
|
1. **Initial Access:** How attackers gained entry
|
|
2. **Exploitation:** Techniques used
|
|
3. **Impact:** What damage occurred
|
|
4. **Discovery:** How it was detected
|
|
5. **Response:** What was done to mitigate
|
|
|
|
#### Lessons Learned (Case Study 1)
|
|
|
|
- Lesson 1: Specific takeaway
|
|
- Lesson 2: Specific takeaway
|
|
- Lesson 3: Specific takeaway
|
|
|
|
### Case Study 2: [Name/Description]
|
|
|
|
#### Incident Overview (Case Study 2)
|
|
|
|
- **When:** Date/timeframe
|
|
- **Target:** Organization/system type
|
|
- **Impact:** Financial/data/reputation damage
|
|
- **Attack Vector:** How the attack was executed
|
|
|
|
#### Key Details
|
|
|
|
[Narrative description of what happened and why it matters]
|
|
|
|
#### Lessons Learned (Case Study 2)
|
|
|
|
- Lesson 1: Specific takeaway
|
|
- Lesson 2: Specific takeaway
|
|
|
|
---
|
|
|
|
## [X].7 Conclusion
|
|
|
|
### Chapter Takeaways
|
|
|
|
1. **[Topic] is Critical:** Because [specific reason with data/examples]
|
|
2. **Detection is Challenging:** Due to [specific technical reasons]
|
|
3. **Defense Requires Layers:** No single solution is sufficient
|
|
4. **Ethical Testing is Essential:** For improving security posture
|
|
|
|
### Recommendations for Red Teamers
|
|
|
|
- **Recommendation 1:** Specific actionable advice
|
|
- **Recommendation 2:** Specific actionable advice
|
|
- **Recommendation 3:** Specific actionable advice
|
|
|
|
### Recommendations for Defenders
|
|
|
|
- **Defense Action 1:** Specific actionable advice
|
|
- **Defense Action 2:** Specific actionable advice
|
|
- **Defense Action 3:** Specific actionable advice
|
|
|
|
### Future Considerations
|
|
|
|
[Discuss emerging trends, evolving attack techniques, or upcoming defenses related to this topic]
|
|
|
|
### Next Steps
|
|
|
|
- Chapter [X+1]: [Related topic to explore next]
|
|
- Chapter [Y]: [Additional related chapter]
|
|
- Practice: Set up lab environment and test these techniques (Chapter 7)
|
|
|
|
---
|
|
|
|
## Quick Reference
|
|
|
|
### Attack Vector Summary
|
|
|
|
[1-2 sentence description of the attack technique]
|
|
|
|
### Key Detection Indicators
|
|
|
|
- [Indicator 1]
|
|
- [Indicator 2]
|
|
- [Indicator 3]
|
|
|
|
### Primary Mitigation
|
|
|
|
- [Defense 1]: [Brief description]
|
|
- [Defense 2]: [Brief description]
|
|
|
|
**Severity:** [Low/Medium/High/Critical]
|
|
**Ease of Exploit:** [Low/Medium/High]
|
|
**Common Targets:** [System types most vulnerable]
|
|
|
|
---
|
|
|
|
## Appendix A: Pre-Engagement Checklist
|
|
|
|
### [Chapter-Specific Pre-Engagement Items]
|
|
|
|
- [ ] [Specific preparation item 1]
|
|
- [ ] [Specific preparation item 2]
|
|
- [ ] [Specific preparation item 3]
|
|
|
|
## Appendix B: Post-Engagement Checklist
|
|
|
|
### [Chapter-Specific Post-Engagement Items]
|
|
|
|
- [ ] [Specific cleanup item 1]
|
|
- [ ] [Specific cleanup item 2]
|
|
- [ ] [Specific cleanup item 3]
|
|
|
|
---
|
|
|
|
<!--
|
|
TEMPLATE USAGE NOTES:
|
|
1. Replace all [X] with actual chapter number
|
|
2. Replace all [placeholders] with specific content
|
|
3. Remove sections that don't apply to your chapter
|
|
4. Add additional sections as needed for your topic
|
|
5. Include at least one practical code demonstration
|
|
6. Provide real-world examples or case studies
|
|
7. Update checklists with chapter-specific items
|
|
8. Use appropriate alert types (NOTE/TIP/IMPORTANT/WARNING/CAUTION)
|
|
|
|
ALERT USAGE EXAMPLES:
|
|
> [!NOTE] - Additional context, background information, helpful explanations
|
|
> [!TIP] - Best practices, optimization suggestions, efficiency improvements
|
|
> [!IMPORTANT] - Essential requirements, critical steps, must-know information
|
|
> [!WARNING] - Breaking changes, compatibility issues, potential problems
|
|
> [!CAUTION] - Legal/ethical warnings, high-risk actions, serious consequences
|
|
|
|
RESEARCH INTEGRATION REQUIREMENTS:
|
|
9. Every technique must include "Why This Works" mechanistic explanation
|
|
10. Minimum 3 academic citations per chapter (foundational + validation + recent)
|
|
11. Connect detection/defense methods to interpretability research
|
|
12. Include "Research Landscape" section with seminal papers
|
|
13. Flag techniques lacking research basis as "Empirically Observed (Unverified)"
|
|
14. Prefer peer-reviewed papers; mark preprints as [Preprint]
|
|
15. Include arXiv/DOI links for all citations
|
|
|
|
VISUAL ELEMENTS TO CONSIDER:
|
|
- ASCII diagrams for attack flows
|
|
- Comparison tables (Traditional vs AI-Powered)
|
|
- Code examples with explanatory comments
|
|
- Before/after examples
|
|
- Timeline diagrams for case studies
|
|
-->
|