mirror of https://github.com/Shiva108/ai-llm-red-team-handbook.git synced 2026-05-14 04:42:13 +02:00

Files

T

shiva108 3dfd102404 docs: Add infographics for AI Red Team Maturity Model, Purple Team Loop, and Red Team Lab architecture to Chapter 45.

2026-01-22 11:34:53 +01:00

8.1 KiB

Raw Blame History

Chapter 45: Building an AI Red Team Program

Turning the Red Team mindset into operations takes more than hacking skills. It requires a formal program, a budget, and a defined scope. This chapter provides the blueprint for CISOs and Directors to build an in-house AI Security capability.

45.1 From "Ad-Hoc" to "Systematic"

Most organizations start AI security by asking a developer to "try and break the bot." This is insufficient. A mature program treats AI Red Teaming as a continuous engineering function, not a one-off pentest.

The Maturity Model

Level 1 (Reactive): Relying on user reports/bounties.
Level 2 (Periodic): Annual 3rd-party assessments.
Level 3 (Continuous): Automated scans (Garak/PyRIT) in CI/CD.
Level 4 (Adversarial): Dedicated internal team developing novel attacks against model weights.

45.1.1 The Purple Team Architecture

Red Teams find bugs; Blue Teams fix them. Purple Teams do both simultaneously.

graph LR
    Red[Red Team: Attack Model] -->|1. Generate Jailbreaks| API[LLM Gateway]
    API -->|2. Log Attack| Green[Green Team: Data Science]
    Green -->|3. Fine-tune Filter| Guard[Guardrails]

    subgraph "Feedback Loop"
    Red -->|4. Re-test| Guard
    end

The goal is to create a Closed Loop: Every successful attack immediately becomes a regression test case in the CI/CD pipeline.

45.2 Infrastructure: The Red Team Lab

You cannot run Red Team operations from a corporate laptop. You need an isolated environment.

45.2.1 Architecture

graph TD
    RedTeam[Red Team VDI] --> AdminProxy[Audit/Logging Proxy]
    AdminProxy --> IsolatedVPC[Isolated VPC]

    subgraph "Attack Infrastructure"
    IsolatedVPC --> Attack_VM[Attack Box (Kali/Garak)]
    Attack_VM --> GPU_Cluster[Local GPU (A100)]
    end

    Attack_VM --> Target_API[Target Model API]

    Target_API -->|Logs| SIEM[Security SIEM]
    AdminProxy -->|Audit Logs| Compliance_DB[Audit DB]

Why Isolation?
- Malware Generation: If you ask the model to "write ransomware," you don't want that ransomware landing on a corporate endpoint.
- NSFW Content: Red Teaming involves generating toxicity/pornography to test filters. This traffic triggers HR content filters unless isolated.

45.2.2 The Cost of Curiosity

AI Red Teaming is expensive.

Activity	Resource	Cost Est.	Note
Automated Scan	GPT-4 API	$500 / scan	10k prompts 1k tokens $0.03/1k
Local Fuzzing	H100 GPU	$4 / hour	Rent via Lambda/CoreWeave for privacy
Human Labeling	ScaleAI / Labelbox	$5,000 / dataset	Custom adversarial datasets

45.3 Hiring: The "AI Security Engineer"

This is a unicorn role. You typically hire for one strength and train the other.

45.3.1 The Interview Kit

Round 1: The Machine Learning Engineer (Testing Security Aptitude)

Question: "You are building a RAG system. How do you prevent the model from retrieving a document the user shouldn't see?"
Good Answer: "Implement ACLs at the Vector Database level (Metadata filtering) before the retrieval step."
Bad Answer: "Ask the LLM to only show authorized documents."

Round 2: The Penetration Tester (Testing AI Aptitude)

Question: "Explain how 'Tokenization' impacts a SQL Injection payload."
Good Answer: "The SQL payload ' OR 1=1 might be tokenized differently depending on spacing, potentially bypassing a regex filter that expects specific character sequences. Also, the LLM predicts tokens, so it might 'fix' a broken SQL injection to make it valid."

Round 3: The Take-Home Challenge

Task: "Here is a Docker container running a local Llama-3 instance with a hidden System Prompt. You have API access only. Extract the System Prompt."
Success: Candidate uses "Repeat after me" or "Completion suffix" attacks.

45.3.2 Training Curriculum (Internal University)

You can't hire enough experts; you must build them.

Syllabus for "AI Security 101":

Module 1: Prompt Engineering Internals. (How Attention works, Context Windows, System Prompts).
Module 2: The OWASP Top 10 for LLMs. (Injection, Data Leakage, Supply Chain).
Module 3: Hands-On Lab. (Use Garak to find a vulnerability in a sandboxed app).
Module 4: Remediation. (How to use NeMo Guardrails or Guardrails AI to patch the hole).

45.4 Operationalizing: Rules of Engagement (RoE)

AI is non-deterministic. "Do no harm" is harder to guarantee.

45.4.1 The Scope Sheet

Category	In Scope	Out of Scope	Reason
Prompt Injection	Yes	-	Core vulnerability.
Model Inversion	Yes	-	Privacy testing.
DoS (Resource)	No	Yes	Denying service costs money and proves nothing new.
Social Engineering	-	No	Don't attack the developers, attack the model.
Exfiltration	Proof of Access	Full Dump	Don't dump the actual customer DB.

45.4.2 The "Safe Harbor" Clause

Your internal policy must state:

"Security Engineers generally are exempt from HR policies regarding 'Generating Toxic Content' provided it is done within the designated Red Team Lab environment for valid testing purposes."

Without this, your Red Teamers will be fired for generating hate speech violations during testing.

45.5 Metrics: Measuring the Intangible

Executive dashboards need numbers.

Attack Surface Coverage:
- Formula: (Tested System Prompts / Total System Prompts) * 100
- Goal: 100% of production prompts extracted and fuzz-tested.
Regression Rate:
- Formula: % of previously fixed jailbreaks that work again in the new model version.
- Goal: < 1%.
Human Bypass Rate:
- Formula: Success rate of human red team attempts vs. Automated guardrails.
- Goal: Low. If humans easily bypass the automated defense, the automation is creating a false sense of security.

45.5.1 Board Level Reporting

The Board doesn't care about "Prompt Injection." They care about Risk.

Slide Deck Template

The "Crown Jewels" Analysis:
- "We use AI for [X]. If it fails, we lose [$Y]."
Current Risk Posture:
- "We tested [5] models. [2] were susceptible to Data Extraction."
The "Ask":
- "We need [$50k] for API credits to continuous test the new Customer Support Bot before launch."

45.6 Conclusion

Building an AI Red Team is building an "Immune System." It is not a project that finishes; it is a function that lives as long as the models do.

Chapter Takeaways

Isolate the Lab: Don't generate malware on the corporate WiFi.
Budget for OpEx: API tokens are the "ammunition" of this war.
Hiring: Look for "Curious Builders" who understand both Python and Psychology.

Next Steps

Chapter 46: Conclusion and Next Steps
Action: Build your cost model. Calculate the price of running garak against your main application today.

8.1 KiB Raw Blame History