diff --git a/README.md b/README.md index 0e89e27..8f14c45 100644 --- a/README.md +++ b/README.md @@ -161,4 +161,4 @@ The authors and contributors accept no liability for misuse of this material. --- -**Version:** 1.46.154 | **Status:** Gold Master +**Version:** 1.46.154 diff --git a/docs/Chapter_43_Future_of_AI_Red_Teaming.md b/docs/Chapter_43_Future_of_AI_Red_Teaming.md index 95a35f6..9f67cd2 100644 --- a/docs/Chapter_43_Future_of_AI_Red_Teaming.md +++ b/docs/Chapter_43_Future_of_AI_Red_Teaming.md @@ -25,6 +25,10 @@ Manual prompting is random and hard to scale. The future belongs to algorithms t Published by Zubritsky et al. (2023), GCG demonstrated that you can mechanically find a string of characters (a "suffix") that forces any model to comply. +

+ GCG Loss Landscape Optimization +

+ **The Math:** $$ \text{minimize } - \log P(Target | Input + Suffix) $$ @@ -61,6 +65,10 @@ Using **SMT Solvers** (Satisfiability Modulo Theories), researchers convert neur We are moving from "Chatbots" to "OS-Controlling Agents." How do you Red Team a swarm? +

+ Swarm Agent Attack Diagram +

+ ### 43.2.1 The "Agent Turing Test" Red Teaming an agent requires testing its **Goal Integrity** over time. @@ -87,6 +95,10 @@ A philosophical risk with practical implications. Any intelligent agent, regardl The ultimate defense is not looking at the output, but looking at the _activations_. +

+ Neural Network X-Ray Probes +

+ ### 43.3.1 Linear Probes and Steering Vectors Research shows that concepts like "Deception" or "Refusal" are represented as **Vectors** in the model's residual stream. diff --git a/docs/assets/Ch43_Diagram_Swarm.png b/docs/assets/Ch43_Diagram_Swarm.png new file mode 100644 index 0000000..5cee666 Binary files /dev/null and b/docs/assets/Ch43_Diagram_Swarm.png differ diff --git a/docs/assets/Ch43_Graph_GCG.png b/docs/assets/Ch43_Graph_GCG.png new file mode 100644 index 0000000..90fa111 Binary files /dev/null and b/docs/assets/Ch43_Graph_GCG.png differ diff --git a/docs/assets/Ch43_Schematic_Probes.png b/docs/assets/Ch43_Schematic_Probes.png new file mode 100644 index 0000000..1cac3f5 Binary files /dev/null and b/docs/assets/Ch43_Schematic_Probes.png differ