47 KiB
Chapter 46: Conclusion and Next Steps
You've reached the end of the AI LLM Red Team Handbook. But in security, there is no end—only the next model update.
46.1 The Journey So Far
Over 46 chapters, you've mastered the complete spectrum of AI security—from foundational ethics to cutting-edge algorithmic attacks. This handbook represents the culmination of field knowledge, research, and operational experience across the entire AI red teaming discipline.
The Handbook Map
graph TD
Part1[Part I: Professional Foundations<br/>Ch 1-4] --> Part2[Part II: Project Preparation<br/>Ch 5-8]
Part2 --> Part3[Part III: Technical Fundamentals<br/>Ch 9-11]
Part3 --> Part4[Part IV: Pipeline Security<br/>Ch 12-13]
Part4 --> Part5[Part V: Attacks & Techniques<br/>Ch 14-24]
Part5 --> Part6[Part VI: Defense & Mitigation<br/>Ch 25-30]
Part6 --> Part7[Part VII: Advanced Operations<br/>Ch 31-39]
Part7 --> Part8[Part VIII: Strategic Topics<br/>Ch 40-46]
What You've Mastered
Part I: Professional Foundations (Ch 1-4)
- The fundamentals of AI red teaming and why it's critical
- Ethical frameworks, legal compliance (CFAA, GDPR), and stakeholder communication
- The red teamer's mindset: curiosity, creativity, and systematic thinking
- SOW, Rules of Engagement, and client onboarding procedures
Part II: Project Preparation (Ch 5-8)
- Threat modeling methodologies specific to AI systems
- Scoping engagements: in-scope vs out-of-scope for AI components
- Building isolated red team labs with proper safety controls
- Chain of custody and evidence handling for non-deterministic systems
Part III: Technical Fundamentals (Ch 9-11)
- LLM architectures: transformers, attention mechanisms, tokenization
- Context windows, generation parameters, and model behavior
- Plugin architectures and API integration points
Part IV: Pipeline Security (Ch 12-13)
- RAG system vulnerabilities and retrieval poisoning
- Supply chain attacks: from training data to model distribution
- Vector database security and embedding space manipulation
Part V: Attacks & Techniques (Ch 14-24)
- Prompt injection (direct, indirect, multi-turn)
- Data leakage and training data extraction
- Jailbreaking and content filter bypass
- Plugin exploitation and API abuse
- Evasion techniques against detection systems
- Data poisoning and model manipulation
- Model theft and extraction attacks
- Denial of service (token exhaustion, sponge attacks)
- Multimodal attacks (vision, audio, cross-modal)
- Social engineering via AI agents
Part VI: Defense & Mitigation (Ch 25-30)
- Adversarial machine learning and robustness
- Supply chain defense and model provenance
- Federated learning security
- Privacy-preserving techniques
- Model inversion defenses
- Backdoor detection and mitigation
Part VII: Advanced Operations (Ch 31-39)
- OSINT and reconnaissance for AI systems
- Attack automation frameworks (TAP, GCG)
- Automated red team pipelines
- Defense evasion and anti-detection
- Post-exploitation in AI environments
- Professional reporting and CVSS for AI
- Remediation strategies and patch validation
- Continuous red teaming programs
- Bug bounty hunting for AI vulnerabilities
Part VIII: Strategic Topics (Ch 40-46)
- Compliance frameworks (EU AI Act, ISO 42001, NIST AI RMF)
- Industry best practices and defense architectures
- Real-world case studies and failure analysis
- Future of AI red teaming (algorithmic attacks, formal verification)
- Emerging threats (Shadow AI, deepfakes, critical infrastructure)
- Building organizational red team programs
- This conclusion: your path forward
Your New Capabilities
You can now:
- Assess any AI system for security vulnerabilities across the full attack surface
- Execute sophisticated attacks from prompt injection to model extraction
- Automate red team operations using Python, Garak, PyRIT, and custom tools
- Report findings in compliance-friendly formats (ISO 42001, NIST AI RMF, EU AI Act)
- Design defense-in-depth architectures with input/output guardrails
- Build continuous red team programs with metrics and automation
- Navigate the legal and ethical landscape of AI security testing
- Hunt high-value bugs in bounty programs with proven methodologies
46.2 Staying Current in a Rapidly Evolving Field
AI security changes faster than any other domain. New models drop monthly, new attacks weekly, new defenses daily. Here's your strategy to stay ahead.
Daily Engagement (5-10 minutes)
Twitter/X Accounts to Follow
- @AIatMeta - Meta AI research announcements
- @OpenAI - Model releases and safety research
- @AnthropicAI - Constitutional AI and alignment research
- @GoogleDeepMind - Cutting-edge ML security
- @simonw - AI engineering and security insights
- @goodside - Creative jailbreaks and prompt research
- @mmitchell_ai - AI ethics and responsible development
- @random_walker - Privacy and AI policy
Security News Aggregators
- AI Incident Database - Real-world AI failures
- LLM Security News - Curated security updates
- Hacker News (AI) - Community discussions
Weekly Deep Dives (30-60 minutes)
ArXiv Preprints
- cs.CL (Computation and Language) - New LLM research
- cs.CR (Cryptography and Security) - Security papers
- cs.LG (Machine Learning) - ML security
Security Blogs
- Trail of Bits Blog - Advanced security research
- NCC Group Research - AI pentesting
- Anthropic Research - Alignment and safety
- OpenAI Research - Red team reports in system cards
- HiddenLayer Blog - ML supply chain security
Podcasts
- "Latent Space" - AI engineering and security
- "Practical AI" - Applied ML and deployment challenges
- "Security Now" (AI segments) - Broader security context
Monthly Learning (2-4 hours)
Conference Recordings
- DEF CON AI Village (YouTube)
- Black Hat AI Security Summit
- NeurIPS Red Teaming Competition
- ICLR Workshop on Secure and Trustworthy ML
- USENIX Security (ML Security Track)
Technical Deep Reads
- New model "System Cards" (GPT-4, Claude 3, Gemini) - Read the red team appendix
- OWASP Top 10 for LLM Applications updates
- MITRE ATLAS framework additions
- New CVEs in ML frameworks (PyTorch, TensorFlow, HuggingFace)
Hands-On Practice
- Gandalf (Lakera) - Prompt injection challenges
- HackThePrompt - Jailbreaking CTF
- Crucible (AI Village) - Red team scenarios
- Kaggle AI Security Competitions (when available)
Certifications and Formal Training
Relevant Certifications
- OSCP (Offensive Security Certified Professional) - Foundational pentesting + AI augmentation
- GIAC (GXPN, GWAPT) - Advanced exploitation skills applicable to AI
- Cloud (AWS/Azure/GCP ML Specialty) - Understanding AI deployment infrastructure
- Custom: Several organizations (SANS, etc.) are developing AI-specific security certs - watch for these in 2025-2026
University Programs
- Stanford CS 329S (ML Systems Security) - Free online materials
- UC Berkeley CS 294 (AI Security) - Public lectures
- Carnegie Mellon INI course on Adversarial ML
Community Engagement
Discord Servers
- AI Safety - General AI alignment and security
- HuggingFace - ML engineering and model security
- MLSecOps - ML security operations
Slack Communities
- OWASP AI Security & Privacy Slack
- AI Village (DEF CON)
GitHub Monitoring
- Watch repositories:
garak-ai/garak,Azure/PyRIT,NVIDIA/NeMo-Guardrails - Follow security researchers publishing LLM attack tools
46.3 Contributing to the Handbook
This handbook is open source and living. Your contributions make it better.
How to Contribute
Repository: github.com/Shiva108/ai-llm-red-team-handbook
Contribution Types
-
Bug Fixes & Typos
- Fork the repo
- Fix the issue
- Submit PR with clear description
-
New Attack Techniques
- Follow Chapter Template format (
docs/templates/Chapter_Template.md) - Include working code examples
- Provide 3+ research citations
- Add to appropriate Part in README
- Follow Chapter Template format (
-
Tool Integrations
- Add to Appendix C with installation instructions
- Provide quick-start example
- Link to official documentation
-
Case Studies
- Use real-world incidents (anonymized if needed)
- Include timeline, technical details, lessons learned
- Can be added to Chapter 42 or as standalone in
case_studies/
-
Translations
- Contact maintainers first to coordinate
- Maintain technical accuracy
- Keep code examples in English with translated comments
Quality Standards
- All code must be tested and working
- Citations required for claims
- Follow existing markdown style
- Pass
markdownlintchecks - No plagiarism - original work or properly attributed
46.4 Career Pathways in AI Security
AI security is one of the highest-growth career tracks in cybersecurity. Here's your roadmap.
Job Titles and Roles
Entry Level ($80k-$120k)
- Junior AI Security Engineer
- ML Security Analyst
- AI Red Team Associate
- LLM Security Researcher (Junior)
Mid-Level ($120k-$180k)
- AI Security Engineer
- Senior ML Security Researcher
- AI Red Team Lead
- Prompt Security Specialist
- AI Safety Engineer
Senior/Staff ($180k-$300k+)
- Principal AI Security Architect
- Director of AI Red Teaming
- Head of AI Safety
- AI Security Consultant (Independent)
- Bug Bounty Professional (AI Specialist)
FAANG Levels
- L4/L5 (Mid): $150k-$250k total comp
- L6/L7 (Senior/Staff): $300k-$500k+ total comp
- L8+ (Principal): $500k-$1M+ total comp
Building Your Portfolio
Must-Haves
-
Public GitHub Repository
- Custom AI security tools (scanner, fuzzer, analyzer)
- Automated red team scripts
- Contributions to Garak, PyRIT, or similar projects
- Well-documented, production-quality code
-
Technical Writeups
- Medium/personal blog with deep technical analysis
- 3-5 detailed posts on:
- Novel attack technique you discovered
- Tool you built and how it works
- Case study of interesting vulnerability
- Defense architecture you implemented
- Clear writing, code snippets, diagrams
-
Bounties or CVEs
- Even 1-2 valid reports show real-world skill
- Document methodology in writeups (after disclosure period)
- OpenAI, Google, Microsoft most prestigious
-
Conference Talks or CTF Wins
- DEF CON AI Village lightning talk
- Local BSides presentation
- Top 10 finish in Gandalf or Crucible CTF
- HackThePrompt.com leaderboard mention
Nice-to-Haves
- Research paper (even if just arXiv preprint)
- YouTube channel with technical tutorials
- Open source tool with 100+ GitHub stars
- Active participation in OWASP AI Security group
Interview Preparation
Common Technical Questions
- Architecture: "Explain how a RAG system works and identify 3 attack vectors."
- Hands-On: "Here's a system prompt. Show me 3 ways to leak it."
- Tooling: "Walk me through how you'd use Burp Suite to test an LLM API."
- Compliance: "How would you map a prompt injection finding to ISO 42001 controls?"
- Defense: "Design a defense-in-depth architecture for a customer support chatbot."
- Scenario: "You found a training data extraction vulnerability. Walk me through responsible disclosure."
Behavioral Questions
- "Tell me about a time you found a critical vulnerability. How did you report it?"
- "How do you stay current in AI security given the pace of change?"
- "Describe a disagreement with a client/team about security priorities. How did you resolve it?"
Reverse Interview Questions (Ask Them)
- "What does your AI red team program look like today? What are you building toward?"
- "How much of my time would be manual testing vs tool development vs research?"
- "What's your approach to responsible disclosure when you find issues in third-party models?"
- "How do you balance 'move fast' engineering culture with security rigor?"
Salary Negotiation Tips
- Use levels.fyi to research compensation bands
- AI security roles often get 10-20% premium over general AppSec
- Remote roles from FAANG/top startups pay the same globally
- Equity/RSUs matter - $50k base difference meaningless if equity 3x larger
- Negotiate based on competing offers, not "what you need"
46.5 The 30/60/90 Day Action Plan
From handbook reader to practicing AI red teamer in 90 days.
gantt
title Your 90-Day AI Security Journey
dateFormat YYYY-MM-DD
section Foundation
Setup Lab Environment :a1, 2024-01-01, 7d
Run Garak Scans :a2, after a1, 7d
Reproduce 3 Handbook Attacks :a3, after a2, 16d
section Contribution
First Bug Bounty Submission :b1, after a3, 10d
Contribute to OSS Tool :b2, after b1, 10d
Write Technical Blog Post :b3, after b2, 10d
section Career
Build Custom Tool :c1, after b3, 15d
Apply to 5 Roles :c2, after c1, 10d
Interview Prep :c3, after c2, 5d
Days 1-30: Foundation Building
Week 1: Lab Setup
- Set up isolated Ubuntu VM or Docker environment
- Install Python 3.10+,
transformers,torch - Clone and install Garak:
pip install garak - Deploy local Llama-3-8B via Ollama
- Test Burp Suite with LLM API proxy
Week 2: Automated Scanning
- Run full Garak scan against local model
- Analyze results, identify top 3 vulnerabilities
- Manually verify each finding
- Document findings in structured report
Week 3-4: Hands-On Attacks
- Reproduce Prompt Injection (Chapter 14) against ChatGPT-like interface
- Reproduce RAG Poisoning (Chapter 12) in local RAG setup
- Reproduce Jailbreak (Chapter 16) using DAN or similar technique
- Create GitHub repo documenting all 3 attacks with code
Deliverable: GitHub repository: my-ai-security-lab with working exploit code
Days 31-60: Real-World Application
Week 5-6: Bug Bounty
- Read HackerOne/Bugcrowd policies for OpenAI, Google, Microsoft
- Identify in-scope AI feature (ChatGPT plugins, Gemini Extensions, Copilot)
- Test for prompt injection, data leakage, authentication bypass
- Submit first report (even if it gets closed as duplicate - learning experience)
Week 7: Open Source Contribution
- Pick tool: Garak, PyRIT, or NeMo-Guardrails
- Find "good first issue" on GitHub
- Submit PR (could be documentation, new probe, bug fix)
- Engage with maintainers on review feedback
Week 8: Content Creation
- Write 1,500+ word blog post on Medium/personal blog
- Topic: "How I Found My First AI Vulnerability" or "Building a Custom LLM Fuzzer"
- Include code, screenshots, diagrams
- Share on Twitter, LinkedIn, relevant Discord servers
Deliverable: Published writeup + at least 1 bug bounty submission
Days 61-90: Career Launch
Week 9-10: Tool Development
- Build custom tool addressing gap you've noticed:
- Nuclei templates for AI endpoints
- Automated jailbreak generator using GCG
- Compliance mapper (findings → ISO 42001 controls)
- RAG security scanner
- Publish to GitHub with README, usage examples, tests
- Add to portfolio as "Featured Project"
Week 11: Job Applications
- Update LinkedIn: "AI Security Engineer" as target role
- Update resume: Highlight handbook, projects, contributions
- Apply to 5 companies:
- 2 FAANG (Google, Microsoft, Meta AI safety teams)
- 2 AI startups (OpenAI, Anthropic, Hugging Face, Cohere)
- 1 security consultancy (Trail of Bits, NCC Group, Bishop Fox)
Week 12: Interview Preparation
- Practice technical questions from Section 46.4
- Prepare 3 "project deep dive" stories (tool you built, attack you found, defense you designed)
- Do mock interviews with peers or use interviewing.io
- Review handbook chapters on topics you're weakest in
Week 13: Networking
- Attend 1 virtual conference or meetup (DEF CON Groups, AI Safety Summit online)
- Connect with 10 AI security professionals on LinkedIn
- Send informational interview requests to 3 practitioners
- Join and actively participate in 1-2 relevant Discord/Slack communities
Deliverable: 5 job applications submitted + active interviews or next round scheduled
46.6 The Ethical Red Teamer's Oath
As you enter the field of AI security, consider adopting this professional code:
I pledge to:
Test Responsibly
- Only test systems I am explicitly authorized to test
- Respect the Rules of Engagement in every engagement
- Stop immediately if I risk causing harm or accessing live customer data
Report Honorably
- Disclose vulnerabilities responsibly through proper channels
- Never weaponize findings or sell them to malicious actors
- Give vendors reasonable time to patch before public disclosure
Protect Privacy
- Treat all data encountered during testing as confidential
- Never exfiltrate, store, or share PII beyond proof-of-concept
- Securely delete evidence after the engagement concludes
Advance the Field
- Share knowledge through responsible writeups, talks, and open source
- Mentor newcomers and contribute to community resources
- Advocate for security and safety in AI development
Stay Ethical
- Decline work that serves surveillance, oppression, or harm
- Question whether my actions serve the greater good
- Remember that AI systems impact real people's lives
I understand that with the power to break AI systems comes the responsibility to make them safer.
Appendix A: Comprehensive Glossary
| Term | Definition |
|---|---|
| Adversarial Example | An input specifically crafted to cause misclassification or unexpected model behavior. |
| Alignment | The process of ensuring an AI system's goals match human values and intentions (via RLHF, Constitutional AI, etc.). |
| API Key Leakage | Unintentional exposure of authentication credentials through model outputs or logs. |
| Attention Mechanism | The core component of transformers that determines which tokens the model "pays attention to" when generating output. |
| Backdoor Attack | Poisoning training data or model weights to create a hidden trigger that activates malicious behavior. |
| Chain of Thought (CoT) | Prompting technique where the model shows its reasoning step-by-step, often improving accuracy but exposing logic. |
| Constitutional AI | Anthropic's method of training models to follow ethical principles defined in a "constitution." |
| Context Window | The maximum number of tokens (input + output) the model can process in a single interaction. |
| Data Poisoning | Injecting malicious examples into training data to corrupt model behavior. |
| Embedding | A dense vector representation of text/image/audio in high-dimensional space. |
| Few-Shot Learning | Providing a few examples in the prompt to guide the model's behavior without fine-tuning. |
| Fine-Tuning | Additional training on a specific dataset to specialize a pre-trained model. |
| Function Calling (Tool Use) | The ability of an LLM to generate structured API calls to invoke external tools. |
| Gradient Descent | The optimization algorithm used to train neural networks by minimizing loss. |
| Guardrails | Input/output filters and safety mechanisms designed to prevent harmful model behavior. |
| Hallucination | A confident but factually incorrect, fabricated, or nonsensical response. |
| Indirect Prompt Injection | Injecting malicious instructions via external data sources (emails, web pages, documents) that the model reads. |
| Inference | The process of using a trained model to generate predictions or outputs. |
| Jailbreak | A prompt specifically designed to bypass safety training and content filters. |
| Logits | Raw output scores from the model before applying softmax to convert to probabilities. |
| Mechanistic Interpretability | Research field studying the internal representations and activations of neural networks to understand how they work. |
| Model Card | Transparency document describing a model's intended use, training data, limitations, and evaluation results. |
| Model Extraction | Stealing a model's functionality by querying it repeatedly and training a surrogate. |
| Model Inversion | Recovering training data by analyzing model weights or outputs. |
| Perplexity | A measure of how "surprised" a model is by a sequence; lower is better (more confident). |
| Prompt Injection | Inserting malicious instructions that override or manipulate the System Prompt. |
| RAG (Retrieval-Augmented Generation) | Enhancing a model's knowledge by retrieving relevant documents from an external database before generation. |
| RLHF (Reinforcement Learning from Human Feedback) | Training technique where human raters score outputs to align model behavior with preferences. |
| Sponge Attack | Input crafted to maximize computational cost, causing DoS through resource exhaustion. |
| System Prompt | The initial hidden instructions given to the model by the developer (e.g., "You are a helpful assistant"). |
| Temperature | Sampling parameter controlling randomness. High (1.0+) = creative/unstable; Low (0.0-0.3) = deterministic /focused. |
| Token | The basic unit of text the model processes (roughly 0.75 words in English). |
| Tokenization | The process of converting text into tokens using algorithms like BPE (Byte-Pair Encoding). |
| Top-P (Nucleus Sampling) | Sampling method that considers tokens with cumulative probability mass P. |
| Vector Database | Database optimized for storing and searching high-dimensional embeddings (used in RAG). |
| Weight Poisoning | Directly manipulating model parameters to introduce backdoors or degrade performance. |
| Zero-Shot Learning | Asking the model to perform a task without providing any examples in the prompt. |
Appendix B: The Red Teamer's Library (Essential Papers)
Foundational Research
-
"Attention Is All You Need" (Vaswani et al., 2017)
arXiv:1706.03762
The transformer paper. Understand the architecture you're attacking. -
"Language Models are Few-Shot Learners" (GPT-3) (Brown et al., 2020)
arXiv:2005.14165
Demonstrated emergent capabilities and in-context learning. -
"BERT: Pre-training of Deep Bidirectional Transformers" (Devlin et al., 2018)
arXiv:1810.04805
Bidirectional understanding and the foundation for many NLP attacks. -
"Training language models to follow instructions with human feedback" (Ouyang et al., 2022)
arXiv:2203.02155
InstructGPT / ChatGPT - how RLHF creates alignment (and attack surface). -
"Constitutional AI: Harmlessness from AI Feedback" (Bai et al., 2022)
arXiv:2212.08073
Anthropic's alternative to RLHF and how constitutions can be attacked.
Attack Techniques
-
"Universal and Transferable Adversarial Attacks on Aligned Language Models" (GCG) (Zou et al., 2023)
arXiv:2307.15043
The GCG algorithm - gradient-based jailbreak optimization. -
"Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection" (Greshake et al., 2023)
arXiv:2302.12173
The indirect injection paper that changed how we think about RAG security. -
"Extracting Training Data from Large Language Models" (Carlini et al., 2021)
arXiv:2012.07805
First demonstrated extraction of memorized training data from GPT-2. -
"Stealing Part of a Production Language Model" (Carlini et al., 2024)
arXiv:2403.06634
Model extraction from ChatGPT showing practical theft techniques. -
"Jailbroken: How Does LLM Safety Training Fail?" (Wei et al., 2023)
arXiv:2307.02483
Taxonomy of jailbreak techniques and why safety training fails. -
"Poisoning Web-Scale Training Datasets is Practical" (Carlini et al., 2023)
arXiv:2302.10149
Demonstrated how to poison LAION-400M at scale for $60. -
"Prompt Injection attack against LLM-integrated Applications" (Liu et al., 2023)
arXiv:2306.05499
Comprehensive study of injection vectors in production systems.
Defense and Robustness
-
"Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned" (Ganguli et al., Anthropic, 2022)
arXiv:2209.07858
Methodology for responsible red teaming at scale. -
"Defending Against Backdoor Attacks in Natural Language Generation" (Yan et al., 2023)
arXiv:2106.01022
Detection and mitigation strategies for trojan attacks. -
"Certified Robustness to Adversarial Examples with Differential Privacy" (Lecuyer et al., 2019)
arXiv:1802.03471
Using DP for provable defenses. -
"Holistic Safety for Large Language Models" (Qu et al., 2024)
arXiv:2311.11824
Comprehensive framework for LLM safety evaluation. -
"NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications" (NVIDIA, 2023)
Technical Report - GitHub
Production defense architecture and implementation.
Advanced Topics
-
"Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training" (Hubinger et al., 2024)
arXiv:2401.05566
Backdoors that survive RLHF - implications for alignment. -
"Tree of Attacks: Jailbreaking Black-Box LLMs Automatically" (Mehrotra et al., 2023)
arXiv:2312.02119
TAP algorithm - automated adversarial prompt generation. -
"The Adversarial Robustness of LLMs: A Survey" (Wang et al., 2024)
arXiv:2407.12321
Comprehensive survey of attack and defense landscape.
Compliance and Standards
-
"Model Cards for Model Reporting" (Mitchell et al., 2019)
arXiv:1810.03993
Transparency framework - basis for compliance documentation. -
"Datasheets for Datasets" (Gebru et al., 2021)
arXiv:1803.09010
Data documentation for compliance and reproducibility. -
"NIST AI Risk Management Framework" (NIST, 2023)
Technical Standard - nist.gov/itl/ai-risk-management-framework
Official guidance for AI risk assessment.
Appendix C: Comprehensive Tool Repository
Vulnerability Scanners
| Tool | Install | Use Case | Quick Start |
|---|---|---|---|
| Garak | pip install garak |
Automated LLM vulnerability scanning | garak --model_name openai --model_type openai --probes all |
| PyRIT | pip install pyrit |
Microsoft's Red Team automation framework | python -m pyrit.orchestrator --target chatgpt |
| Inspect | inspect-ai.org | UK AI Safety Institute evaluation platform | pip install inspect-ai; inspect eval |
| PromptFuzz | GitHub:microsoft/promptfuzz | Prompt fuzzing and testing | Clone repo, follow README |
Traffic Analysis and Proxies
| Tool | Install | Use Case | Quick Start |
|---|---|---|---|
| Burp Suite | Download | Intercept and modify LLM API calls | Configure proxy, install Logger++ extension |
| mitmproxy | pip install mitmproxy |
Programmable MITM for API traffic | mitmproxy -s inject_script.py |
| Proxyman | Download | MacOS alternative to Burp | Install, configure system proxy |
Attack Frameworks
| Tool | Install | Use Case | Quick Start |
|---|---|---|---|
| TextAttack | pip install textattack |
Adversarial text generation | textattack attack --model bert-base-uncased --dataset ag_news |
| ART (IBM) | pip install adversarial-robustness-toolbox |
Multi-framework adversarial attacks | from art.attacks import FastGradientMethod |
| CleverHans | pip install cleverhans |
Classic adversarial ML library | from cleverhans.attacks import FastGradientMethod |
Model Analysis and Interpretability
| Tool | Install | Use Case | Quick Start |
|---|---|---|---|
| TransformerLens | pip install transformer-lens |
Mechanistic interpretability | from transformer_lens import HookedTransformer |
| Ecco | pip install ecco |
Explore what activates in transformers | import ecco; lm = ecco.from_pretrained('gpt2') |
| BertViz | pip install bertviz |
Visualize attention patterns | from bertviz import head_view |
Defense and Guardrails
| Tool | Install | Use Case | Quick Start |
|---|---|---|---|
| NeMo Guardrails | pip install nemoguardrails |
NVIDIA's LLM control framework | nemoguardrails --config config.yml |
| Guardrails AI | pip install guardrails-ai |
Input/output validation | from guardrails import Guard |
| Presidio | pip install presidio-analyzer |
Microsoft PII detection and redaction | from presidio_analyzer import AnalyzerEngine |
| LiteLLM | pip install litellm |
Unified LLM API with filtering | from litellm import completion |
Specialized Reconnaissance
| Tool | Install | Use Case | Quick Start |
|---|---|---|---|
| Nuclei | go install -v github.com/projectdiscovery/nuclei/v2/cmd/nuclei@latest |
Vulnerability scanner (AI templates) | nuclei -t ai-templates/ -l targets.txt |
| httpx | go install -v github.com/projectdiscovery/httpx/cmd/httpx@latest |
HTTP probing to find AI endpoints | cat domains.txt | httpx -td |
| Subfinder | go install -v github.com/projectdiscovery/subfinder/v2/cmd/subfinder@latest |
Subdomain enumeration | subfinder -d target.com |
Local Model Deployment (for testing)
| Tool | Install | Use Case | Quick Start |
|---|---|---|---|
| Ollama | ollama.ai/download | Run Llama/Mistral locally | ollama run llama3 |
| LM Studio | lmstudio.ai | GUI for local models | Download, load model, start server |
| vLLM | pip install vllm |
High-performance inference | python -m vllm.entrypoints.api_server --model meta-llama/Llama-2-7b |
| Text Generation WebUI | GitHub:oobabooga/text-generation-webui | Gradio UI for testing | Clone, install, run python server.py |
Monitoring and Logging
| Tool | Install | Use Case | Quick Start |
|---|---|---|---|
| LangSmith | Sign up at smith.langchain.com | LangChain observability | Set LANGCHAIN_TRACING_V2=true env var |
| Weights & Biases | pip install wandb |
ML experiment tracking | wandb login; wandb init |
| MLflow | pip install mlflow |
Model lifecycle management | mlflow ui |
Appendix D: Quick Reference Cheat Sheet
OWASP Top 10 for LLM Applications
- LLM01: Prompt Injection - Manipulating LLM via crafted inputs
- LLM02: Insecure Output Handling - Insufficient output validation leading to XSS/injection
- LLM03: Training Data Poisoning - Compromising training data to create backdoors
- LLM04: Model Denial of Service - Resource exhaustion through expensive inputs
- LLM05: Supply Chain Vulnerabilities - Compromised models, datasets, or frameworks
- LLM06: Sensitive Information Disclosure - Leaking PII through model outputs
- LLM07: Insecure Plugin Design - Flaws in LLM extensions and tool integration
- LLM08: Excessive Agency - Over-privileged LLM actions without oversight
- LLM09: Overreliance - Trusting LLM outputs without verification
- LLM10: Model Theft - Unauthorized extraction or replication of models
Top 5 Attack Patterns (Critical)
-
Indirect Prompt Injection via RAG
- Poison documents in vector database
- Wait for retrieval to inject malicious instructions
- Model executes attacker's commands
-
Function-Calling Privilege Escalation
- Trick LLM into calling admin-only functions
- Bypass intended access control logic
- Achieve unauthorized actions
-
Training Data Extraction
- Craft prompts that trigger memorization
- Extract PII, secrets, proprietary data
- Verify with divergence metrics
-
Multi-Turn Jailbreak
- Build up context over multiple exchanges
- Gradually erode safety alignment
- Finally request harmful content
-
Supply Chain Pickle RCE
- Craft malicious PyTorch model file
- Upload to model hub / send to victim
- Arbitrary code execution on
torch.load()
Defense-in-Depth Checklist
Input Layer
- Unicode normalization (NFKC)
- Invisible character stripping
- Prompt injection detection classifier
- Rate limiting by tokens (not requests)
- API key rotation and scoping
Model Layer
- Updated model with latest safety training
- System prompt hardening (delimiters, explicit rules)
- Temperature/top-p constraints for determinism
- Function-calling ACLs (principle of least privilege)
Output Layer
- PII redaction (Presidio or regex)
- Harmful content filtering
- Citation/grounding enforcement
- Output length limits
Infrastructure
- Model signing and provenance tracking
- Dependency pinning (requirements.txt with hashes)
- Secrets management (never in code)
- Comprehensive logging (input/output/metadata)
- Isolated execution environment for code generation
Monitoring
- Real-time safety violation alerts
- Token velocity dashboards
- Anomaly detection on prompt patterns
- Regular penetration testing (quarterly)
- Bug bounty program
AI Security Incident Response
- Detect: Alert triggered (PII leak, jailbreak detected, unusual token volume)
- Contain: Circuit breaker activates, feature flagged off, isolate affected users
- Analyze: Pull logs, reproduce attack, assess scope of compromise
- Eradicate: Patch vulnerability, update guardrails, retrain if necessary
- Recover: Gradual rollout with enhanced monitoring, validate fix
- Learn: Post-mortem, update runbooks, add regression test
Appendix E: Chapter Cross-Reference Matrix
| Attack/Topic | Primary Chapter | Related Chapters | Tools Referenced |
|---|---|---|---|
| Prompt Injection | 14 | 11, 18, 21 | Garak, PyRIT, TextAttack |
| RAG Security | 12 | 13, 18 | ChromaDB, Pinecone, Weaviate |
| Jailbreaking | 16 | 14, 19, 43 | Gandalf, HackThePrompt |
| Model Theft | 20 | 13, 25 | ART, CleverHans |
| Data Poisoning | 19 | 13, 26 | TextAttack, ART |
| Plugin Exploitation | 17 | 11, 23 | Burp Suite, Nuclei |
| Model DoS | 21 | 10, 41 | Garak (sponge probes) |
| Multimodal Attacks | 22 | 9, 44 | Adversarial Robustness Toolbox |
| Social Engineering | 23 | 2, 3 | (Human tactics) |
| Adversarial ML | 25 | 9, 18, 20 | CleverHans, ART, TextAttack |
| Supply Chain | 13, 26 | 7, 44 | Picklescan, Sigstore |
| Federated Learning | 27 | 25, 29 | PySyft, Flower |
| Privacy Attacks | 28 | 14, 15, 29 | Membership Inference libraries |
| Model Inversion | 29 | 20, 28 | Research implementations |
| Backdoors | 30 | 13, 19, 43 | TrojanNN, BadNets replicas |
| OSINT Recon | 31 | 6, 39 | Shodan, Censys, theHarvester |
| Attack Automation | 32 | 38, 43 | Garak, PyRIT, Custom scripts |
| Defense Evasion | 34 | 14, 41 | Encoding tools, obfuscation |
| Reporting | 36 | 2, 40 | Markdown, LaTeX, CVSS calculators |
| Compliance | 40 | 2, 5, 41 | ISO 42001 templates, NIST RMF |
| Bug Bounties | 39 | 14, 17, 36 | Burp Suite, Nuclei, Garak |
| Building Programs | 45 | 3, 38, 40 | (Organizational tools) |
Final Words
You've completed a journey through 46 chapters, over 1,000 pages, and the entire landscape of AI security. From the philosophical foundations of ethics and mindset to the cutting edge of algorithmic attacks and formal verification, you now possess a comprehensive understanding of how AI systems fail—and how to make them safer.
This is not the end. It's the beginning.
AI security is the most consequential field in cybersecurity today. The systems you test will make critical decisions: approving loans, diagnosing diseases, controlling autonomous vehicles, summarizing legal contracts, and moderating what billions of people see online. When these systems fail, the consequences are measured not just in dollars, but in lives, liberty, and the fabric of democratic discourse.
Your role as a Red Teamer is to find the failures before they matter.
The attacks you've learned—prompt injection, model extraction, RAG poisoning, jailbreaks—are not theoretical exercises. They're being deployed right now, in the wild, by adversaries ranging from curious hobbyists to nation-states. Your job is to think like them, but act in service of defense.
Remember:
- Every system you harden makes the world slightly safer
- Every vulnerability you report saves someone from exploitation
- Every tool you build multiplies the effectiveness of defenders
- Every person you mentor creates a network of protection
AI is not slowing down. GPT-5, Llama-4, Gemini-2, Claude-4—each more powerful, each with new capabilities, each with new attack surfaces. The red team never rests, because the models never stop evolving.
But now you're ready.
You know how transformers work at the token level. You can fingerprint AI backends, craft gradient-optimized jailbreaks, poison vector databases, and map findings to ISO 42001 controls. You can build tools, hunt bounties, and design compliance programs. You can communicate with executives, negotiate with clients, and mentor junior teammates.
The handbook may be complete, but your journey is just starting.
In 90 days, you could have deployed your first custom fuzzer. In 6 months, you could have submitted your first CVE. In a year, you could be leading a red team program at a major AI lab. In 5 years, you could be defining the standards that the next generation of practitioners follows.
AI is the API to human knowledge. Securing it is the prerequisite for a safe future.
Go forth. Build things. Break things responsibly. Find the bugs before they become disasters. Share what you learn. And remember that every test you run, every report you write, every system you secure contributes to a world where AI amplifies human potential instead of human vulnerability.
Welcome to the future. Let's make it secure.
— The AI LLM Red Team Handbook Team
Version 1.46.154 | Gold Master | January 2026
Contribute: github.com/Shiva108/ai-llm-red-team-handbook
License: CC BY-SA 4.0



