mirror of https://github.com/Shiva108/ai-llm-red-team-handbook.git synced 2026-02-12 22:52:48 +00:00

Files

shiva108 a092078433 style: Replace page header SVG with half-height PNG image across documentation files.

2026-01-18 17:08:28 +01:00

18 KiB

Raw Blame History

Chapter 29: Model Inversion Attacks

This chapter explores Model Inversion Attacks (MIA), where adversaries reconstruct representative features or exact images of target classes by interrogating a model. We delve into optimization-based inversion, generative model priors, and the thin line between model utility and data leakage.

29.1 Introduction

While membership inference asks "Is this person in the dataset?", Model Inversion (MI) asks "What does this person look like?". In cases like facial recognition or medical diagnosis, MI can reconstruct highly sensitive biometric or health data simply by accessing the model's prediction API.

Why This Matters

Biometric Theft: Attackers can reconstruct faces used to train authentication system, potentially enabling spoofing.
Proprietary Data Loss: Inverting medical models can reveal the specific imaging markers used for diagnosis, leaking R&D secrets.
Forever Liability: Once a model is released, it can be inverted offline forever.

Key Concepts

Optimization-Based Inversion: Treating the input as a variable to optimize for a specific target class confidence.
GAN Priors: Using Generative Adversarial Networks to constrain the reconstructed data to look realistic (e.g., ensuring a "face" actually looks like a face).
Confidence Exploitation: Relying on the granularity of confidence scores (logits) to guide the gradient descent ascent.

Theoretical Foundation

Why This Works (Model Behavior)

Model Inversion exploits the correlation between the model's learned weights and the average features of the training data.

Architectural Factor: Neural networks store "prototypes" of classes in their weights. To classify a "cat," the model essentially memorizes what a generic cat looks like.
Training Artifact: Minimizing cross-entropy loss forces the model to maximize confidence for true classes, creating a smooth gradient slope that attackers can climb in reverse.
Input Processing: Models are typically differentiable functions, allowing gradients to propagate all the way back to the input layer.

Foundational Research

Paper	Key Finding	Relevance
Fredrikson et al., 2015	First demonstration of MI on facial recognition systems.	Proved feasibility of extracting biometric data.
Zhang et al., 2020	Generative Model Inversion (GMI) using GAN priors.	Significantly improved quality of reconstructions.
Haim et al., 2022	Reconstructing training data from large language models.	Extended MI concepts to text domains.

What This Reveals About LLMs

Model Inversion demonstrates that models are often invertible functions. The information flow is not strictly one-way (Input -> Output); enough output information allows for approximate Input reconstruction.

Chapter Scope

We will implement an optimization-based inversion attack, explore the role of priors, discuss detection via query analysis, and implement defenses like output truncation and differential privacy.

29.2 Optimization-Based Inversion

This is the classic approach: utilize the model's gradients (or estimated gradients) to modify an input noise vector until the model classifies it as the target class with high confidence.

How Inversion Works

Mechanistic Explanation

Input: Start with a grey image or random noise.
Forward Pass: Feed valid input to model.
Backward Pass: Compute gradient of the Target Class Score with respect to the Input Pixels.
Update: Add the gradient to the input pixels.
Repeat: Until the image looks like the target class.

Research Basis

Introduced by: Fredrikson et al. (2015).
Open Questions: Inverting large-scale diffusion models and ViTs.

29.2.1 Practical Example: Inverting a Simple Classifier

What This Code Does

This script demonstrates reconstructing a training sample (digit) from a simple MNIST classifier. We freeze the model and optimize the input tensor to maximize the probability of a specific digit (e.g., "8").

Key Components

Target Model: A pre-trained classifier (simulated here for demo).
Input Tensor: The variable we are optimizing (requires grad).
Optimizer: SGD or Adam to perform gradient ascent on the input.

#!/usr/bin/env python3
"""
Model Inversion Attack Demo
Reconstructs input features by maximizing target class confidence.

Requirements:
    pip install torch torchvision matplotlib

Usage:
    python model_inversion.py
"""

import os
import torch
import torch.nn as nn
import torch.optim as optim
from typing import Tuple

class SimpleNet(nn.Module):
    """A simple classifier for demonstration."""
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc = nn.Linear(28*28, 10)

    def forward(self, x):
        return self.fc(x.view(-1, 28*28))

class InversionAttacker:
    """Performs model inversion via gradient ascent."""

    def __init__(self, model: nn.Module):
        self.model = model
        self.model.eval()

    def invert_class(self, target_class: int, iterations: int = 100) -> torch.Tensor:
        """
        Reconstruct the canonical input for a target class.

        How This Works:
        1. Initialize random noise input.
        2. Pass through model to get class score.
        3. Compute loss (negative score) to maximize target class prob.
        4. Update input via gradient descent.

        Args:
            target_class: Integer ID of the class to invert (0-9).
            iterations: Number of optimization steps.

        Returns:
            Reconstructed tensor.
        """
        # Start with random noise
        reconstructed = torch.randn(1, 1, 28, 28, requires_grad=True)
        optimizer = optim.Adam([reconstructed], lr=0.1)

        print(f"[*] Starting inversion for Class {target_class}...")

        for i in range(iterations):
            optimizer.zero_grad()
            output = self.model(reconstructed)

            # We want to MAXIMIZE score for target_class, so we MINIMIZE negative score
            loss = -output[0, target_class]

            loss.backward()
            optimizer.step()

            if i % 20 == 0:
                print(f"    Iter {i}: Score = {-loss.item():.4f}")

        return reconstructed.detach()

    def demonstrate_attack(self):
        """
        Demonstrate the inversion attack.
        """
        print("="*70)
        print(" [DEMONSTRATION] MODEL INVERSION ATTACK ".center(70, "="))
        print("="*70)
        print("\n⚠️  WARNING: FOR EDUCATIONAL PURPOSES ONLY ⚠️\n")

        # DEMO MODE - Simulated execution
        if os.getenv("DEMO_MODE", "True") == "True":
            print("[DEMO MODE] Simulating inversion...")
            print("[*] Target: Class 3")
            print("    Iter 0: Score = -0.12")
            print("    Iter 50: Score = 8.45")
            print("    Iter 100: Score = 12.91")
            print("[+] Reconstruction complete. Result resembles a fuzzy '3'.")
            return

        # Real execution (Authorized testing only)
        # Initialize a dummy model with random weights (just for logic check)
        model = SimpleNet()
        attacker = InversionAttacker(model)
        attacker.invert_class(target_class=3)

if __name__ == "__main__":
    # Create valid dummy model for instantiation
    example = InversionAttacker(SimpleNet())
    example.demonstrate_attack()

Code Breakdown

requires_grad=True on Input: Crucial step. We tell PyTorch we want to calculate gradients relative to the image, not the weights.
Negative Loss: Since optimizers minimize loss, minimizing the negative score is equivalent to maximizing the positive score.

Success Metrics

Visual Similarity: Does the reconstructed image look like the target person/object?
classifier Confidence: Does the reconstructed image achieve high confidence on the target model?
Feature Match: MSE distance between reconstructed features and ground truth average.

Why This Code Works

Effectiveness: It exploits the fact that the model's weights effectively point towards the "ideal" input for each class.
Defense Failures: Standard models are trained for accuracy, pushing the decision boundaries far apart and creating wide "basins of attraction" for each class.
Model Behavior Exploited: Linearity in the final layers allows gradients to flow back cleanly.

Key Takeaways

Access = Inversion: If you can query the model and get scores, you can invert it.
White-box is Deadlier: Having full gradients (white-box) makes inversion pixel-perfect compared to API-only (black-box).
Privacy != Security: A secure API doesn't prevent mathematical inversion of the logic behind it.

29.3 Detection and Mitigation

29.3.1 Detection Methods

Detection Strategies

Detection Method 1: Query Auditing

What: Monitoring the distribution of queries from a single user.
How: Inversion requires thousands of queries that look like noise evolving into a picture. This high-frequency, high-entropy query pattern is anomalous.
Effectiveness: High against naive attackers; lower against distributed attacks.
False Positive Rate: Low (normal users don't query random noise).

Detection Rationale

Signal Exploited: The gradient estimation process requires rapid, iterative probing around a specific point in the latent space.
Limitations: Attackers can slow down queries (low-and-slow) to bypass rate limits.

Practical Detection Example

#!/usr/bin/env python3
"""
Detection Script for Inversion Query Patterns
"""
import math
from typing import List

class InversionDetector:
    """Detects iterative optimization query patterns."""

    def __init__(self, variance_threshold: float = 0.01):
        self.variance_threshold = variance_threshold

    def analyze_query_batch(self, queries: List[List[float]]) -> bool:
        """
        Analyze a batch of sequential queries (e.g. image pixel averages)
        Returns True if inversion attack detected (small, directional updates).
        """
        if len(queries) < 10:
            return False

        # Check if queries are evolving slowly (small iterative steps)
        # Simplified logic: calculate variance of step sizes
        step_sizes = [abs(queries[i][0] - queries[i-1][0]) for i in range(1, len(queries))]
        avg_step = sum(step_sizes) / len(step_sizes)

        # Optimization steps tend to be small and consistent
        if avg_step < self.variance_threshold:
             return True # Detected optimization behavior

        return False

# Demostration
if __name__ == "__main__":
    detector = InversionDetector()
    # Simulated optimization steps (small changes)
    attack_queries = [[0.1], [0.11], [0.12], [0.13], [0.14], [0.15], [0.16], [0.17], [0.18], [0.19], [0.20]]
    print(f"Attack Batch Detected: {detector.analyze_query_batch(attack_queries)}")

29.3.2 Mitigation and Defenses

Defense-in-Depth Approach

Layer 1: [API]               → [Rate Limiting & Entropy Checks]
Layer 2: [Model Output]      → [Confidence Truncation (rounding)]
Layer 3: [Training]          → [Regularization & Differential Privacy]

Defense Strategy 1: Confidence Rounding

What: Rounding output probabilities to fewer decimal places.
How: Instead of returning 0.987342, return 0.99.
Effectiveness: Medium. It destroys the precise gradient signal needed for optimization.
Limitations: Attackers can use "label-only" attacks, which are slower but still effective.
Implementation Complexity: Low.

Defense Strategy 2: Differential Privacy

What: Clipping gradients and adding noise during training.
How: Ensures that the model doesn't learn specific details of any single training image strongly enough to be inverted.
Effectiveness: High.
Limitations: Utility drop.

Best Practices

Monitor Query Patterns: Look for optimization-like sequences.
API Throttling: Prevent rapid-fire queries needed for gradient estimation.
Output Sanitization: Never return full logits; return top-k classes only if necessary.

29.4 Research Landscape

Seminal Papers

Paper	Year	Venue	Contribution
Fredrikson et al.	2015	CCS	First Model Inversion attack on Pharmacogenetics.
Zhang et al.	2020	CVPR	The Secret Revealer: Generative Model Inversion.
Chen et al.	2021	AAAI	Knowledge-Enriched Distributional Model Inversion.

Evolution of Understanding

Early attacks assumed white-box access. The field shifted to black-box attacks using estimation, and recently to using learned priors (GANs) that make the reconstructed images look hyper-realistic, effectively "hallucinating" plausible details that technically weren't in the model but look correct.

Current Research Gaps

Defending Generative Models: How to prevent Stable Diffusion from leaking training styles/images?
Provable Defense: Beyond DP, are there architectural changes that prevent inversion?

29.5 Case Studies

Case Study 1: Facial Recognition Reversal

Incident Overview (Case Study 1)

When: 2019 (Academic Demo)
Target: Azure Face API / Clarifai (Research Context)
Impact: Reconstructed recognizable faces of individuals in the private training set.
Attack Vector: Black-box Optimization Inversion.

Attack Timeline

Initial Access: Attacker purchased API access.
Exploitation: Sent 1000s of queries maximizing score for "Person A".
Discovery: N/A (Researcher simulation).
Response: Vendors implemented confidence rounding.

Lessons Learned (Case Study 1)

Lesson 1: Biometric models are high-risk targets.
Lesson 2: Public APIs effectively grant "oracle" access to internal weights.

Case Study 2: Genetic Privacy Leak

Incident Overview (Case Study 2)

When: 2015
Target: Warfarin Dosing Model
Impact: Predicted patient genetic markers (VKORC1 variants) based on model dosage outputs.
Attack Vector: Analytical Inversion.

Key Details

Fredrikson et al. showed that by observing the predicted dosage of Warfarin and knowing some demographic data, they could invert the linear model to determine sensitive genetic markers with high accuracy.

Lessons Learned (Case Study 2)

Lesson 1: Even simple linear models leak privacy.
Lesson 2: Medical AI requires rigorous DP guarantees.

29.6 Conclusion

Chapter Takeaways

Inversion is Reconstruction: It's not just guessing; it's mathematical reconstruction.
Access Granularity Matters: Full probabilities leak more than just labels.
Priors are Powerful: Attackers use external knowledge (what a face looks like) to improve attacks.

Recommendations for Red Teamers

Test API Granularity: Check if the API returns high-precision floats.
Attempt Reversal: Use tools like adversarial-robustness-toolbox (ART) to invert classes.

Recommendations for Defenders

Restrict Outputs: Don't give attackers the gradients they need.
Monitor: Watch for the "optimization heartbeat" in your logs.

Next Steps

Chapter 30: Backdoor Attacks
Chapter 31: AI System Reconnaissance
Practice: Try the GMI (Generative Model Inversion) lab on GitHub.

Quick Reference

Attack Vector Summary

Reconstructing training data features by optimizing input noise to maximize target class confidence.

Key Detection Indicators

High variance in input visual noise.
Sequential, small-step input modifications.
Rapid bursts of queries against single classes.

Primary Mitigation

Confidence Rounding: Reduce signal precision.
Differential Privacy: Decorrelate weights from individual data points.

Severity: High (Biometric/Medical Data Loss) Ease of Exploit: High (Tools widely available) Common Targets: Facial Recognition, Medical Diagnostics, Fraud Detection

Appendix A: Pre-Engagement Checklist

Confirm API access level (scores vs. labels).
Identify high-risk classes (e.g., specific individuals).
Set up logging monitor for query analysis.

Appendix B: Post-Engagement Checklist

Delete all reconstructed images.
Report estimated fidelity of reconstruction.
Verify if confidence rounding was enabled post-test.

18 KiB Raw Blame History