OBLITERATUS/docs/EFFICIENCY_AUDIT.md

# OBLITERATUS Pipeline Efficiency Audit

**Auditor perspective**: Shrewd CTO evaluating compute ROI, memory discipline, and time-to-value across all obliteration methods.

**Scope**: Every obliteration method in `abliterate.py` (8 primary methods + 4 baseline reproductions), the strategy layer (`strategies/`), the informed pipeline, Bayesian optimizer, and LoRA ablation.

---

## Executive Summary

OBLITERATUS has an impressively comprehensive pipeline, but several methods carry **significant hidden costs** that erode their value proposition. The worst offenders are:

1. **`_collect_activations` runs prompts one-at-a-time** — this is the single biggest throughput bottleneck in the entire system, costing 5-15x in wall-clock time during PROBE.
2. **Bayesian `optimized` mode clones ALL strong-layer weights to CPU** for rollback, then runs 50 full forward+generate passes — the memory and compute overhead can exceed the rest of the pipeline combined.
3. **`true_iterative_refinement` re-runs the entire PROBE+DISTILL pipeline** per refinement pass with zero early-stopping — 3 passes in `aggressive` triples probe cost even when pass 2 achieves negligible improvement.
4. **SAE training on CPU** is needlessly slow for GPU-resident models.

Below is the method-by-method breakdown.

---

## Stage-Level Audit

### Stage 1: SUMMON (Model Loading)

**Status**: Acceptable. Uses `load_model` with quantization support and `expandable_segments` CUDA config. No issues.

### Stage 2: PROBE (`_collect_activations`)

| Issue | Severity | Impact |
|-------|----------|--------|
| **Single-prompt forward passes** (`abliterate.py:1074`) | CRITICAL | Each of 512+ harmful/harmless prompts triggers a separate `model(**inputs)` call. No batching. On a 7B model with 512 pairs, this means ~1024 sequential forward passes instead of ~32 batched passes (batch_size=32). Estimated 5-15x slowdown. |
| **`_free_gpu_memory()` called after EVERY prompt** (`abliterate.py:1086`) | HIGH | `gc.collect()` + `torch.cuda.empty_cache()` 1024 times is expensive — the Python GC full-collection alone adds measurable overhead at this frequency. Should be called every N prompts, not every single one. |
| **Chat template applied per-prompt in a Python loop** (`abliterate.py:955-965`) | MODERATE | `tokenizer.apply_chat_template()` called individually 1024 times. Should batch. |
| **Jailbreak probing doubles cost** when `use_jailbreak_contrast=True` | MODERATE | Adds a third full pass over all prompts. Justified by the quality improvement, but the lack of batching amplifies the cost 3x instead of 1.5x. |
| **Router profiling hooks zero-cost claim is correct** (`abliterate.py:872`) | OK | Hooks piggyback on existing forward passes. Good design. |

**Recommendation**: Batch `_collect_activations`. Tokenize all prompts, pad to equal length per micro-batch, run batched `model(**inputs)`. Expected 5-10x speedup with zero quality loss. Reduce `_free_gpu_memory()` frequency to every 32-64 prompts.

### Stage 3: DISTILL (`_distill`)

| Issue | Severity | Impact |
|-------|----------|--------|
| **Full SVD on per-prompt diff matrix** (`abliterate.py:1226`) | MODERATE | `torch.linalg.svd(diff_matrix, full_matrices=False)` on a `(512, hidden_dim)` matrix per layer. For 32 layers this is 32 SVD calls, each O(min(m,n)^2 * max(m,n)). At hidden_dim=4096, each is ~100ms on CPU. Total: ~3s. Acceptable for the quality gain. |
| **Whitened SVD import is lazy** (`abliterate.py:1127`) | OK | Good — only imports when needed. No cost for basic/advanced. |
| **Wasserstein extraction** (`abliterate.py:1136`) | OK | Falls back gracefully. The GEP solve is lightweight. |
| **RDO gradient optimization: 500 steps per layer** (`abliterate.py:1427`) | HIGH | For 20 strong layers, that's 10,000 Adam steps. Each step involves a matrix multiply on `(n_prompts, hidden_dim)` tensors. On CPU this takes 30-60s. The 500-step budget is a "practical compromise" per the comments, but the SVD warm-start means most directions converge in ~100 steps. **No early stopping.** |
| **Gram-Schmidt re-orthogonalization is O(k^2)** per layer (`abliterate.py:1168-1173`) | LOW | With k<=8, this is negligible. |
| **SAE training: 30 epochs on CPU** (`abliterate.py:1582`) | HIGH | `device="cpu"` is hardcoded. For hidden_dim=4096 and expansion=4, the SAE has 32M parameters. 30 epochs on CPU takes 15-45s per layer. With 20 strong layers, this is 5-15 minutes of wasted time when a GPU is available. |
| **Layer selection (knee + COSMIC fusion)** | OK | Lightweight statistical operations. No concern. |
| **CoT-aware orthogonalization** | OK | Single SVD per layer, simple vector operations. |
| **Jailbreak-contrastive blending** | OK | Pure vector arithmetic, negligible cost. |
| **Float-layer interpolation** | OK | Gaussian weight computation is trivial. |

**Recommendation**: (1) Add early-stopping to RDO at convergence (e.g., loss delta < 1e-4 for 20 consecutive steps). (2) Use GPU for SAE training when available — change `device="cpu"` to auto-detect.

### Stage 4: EXCISE (`_excise`)

| Issue | Severity | Impact |
|-------|----------|--------|
| **Rank-1 projection is memory-efficient** (`abliterate.py:3479-3480`) | OK | `W @ d` produces a vector, not a full projection matrix. This is the right approach. |
| **`true_iterative_refinement` re-runs PROBE+DISTILL** (`abliterate.py:2474-2485`) | CRITICAL | Each refinement pass re-collects all activations (512*2+ forward passes) and re-runs SVD. `aggressive` mode does 3 passes = 3x full pipeline cost. There is **no check** whether the refined directions materially differ from the previous pass. A cosine-similarity early-exit (e.g., all directions > 0.99 cosine with previous pass → stop) would save enormous compute on pass 3. |
| **Bayesian optimization clones ALL weight tensors** (`bayesian_optimizer.py:301-341`) | CRITICAL | For a 7B model with 20 strong layers, this can be 2-4 GB of CPU clones just for rollback. For a 70B model, this is 20-40 GB. The log even reports the size (`total_saved_mb`), but there's no memory check or fallback. |
| **Bayesian trials run full generate passes** (`bayesian_optimizer.py:445-446`) | CRITICAL | Each of 50 trials runs `_measure_refusal_rate` (8-30 generation calls with `max_new_tokens=128`) PLUS `_measure_kl_divergence` (5 forward passes). That's ~35 forward/generate passes per trial × 50 trials = **1,750 forward passes** just for hyperparameter search. This likely dominates the total pipeline runtime for `optimized` and `heretic` modes. |
| **KL optimization proxy is cheap** (`abliterate.py:3057-3268`) | OK | Uses projection magnitude as a KL proxy instead of actual per-layer forward passes. Good engineering — avoids the expensive per-layer ablation/measurement loop. |
| **Norm preservation adds one extra `.norm()` per weight matrix** | LOW | Frobenius norm is O(n) — negligible overhead. |
| **Dequantize/re-quantize for bitsandbytes** (`abliterate.py:3287-3400`) | MODERATE | Necessary for correctness, but the full dequantize → modify → re-quantize cycle per weight matrix is expensive for 4-bit models. Consider caching the dequantized tensor when projecting multiple directions through the same weight. |
| **Safety-neuron masking** | LOW | Z-score computation is a single pass over the projection vector. Cheap. |
| **Expert transplant uses incremental mean** (`abliterate.py:4350-4364`) | OK | Welford-style running mean avoids materializing all expert weights. Good memory discipline for 400B-scale models. |
| **`_stabilize_router_weights` called after every MoE layer** (`abliterate.py:3866`) | LOW | Clamps router weights. Trivial cost. |

**Recommendation**: (1) Add direction-convergence early-exit to iterative refinement. (2) Reduce Bayesian trial count or implement batch generation for refusal measurement. (3) Cache dequantized weights across multi-direction projection within the same layer.

### Stage 5: VERIFY (`_verify`)

| Issue | Severity | Impact |
|-------|----------|--------|
| **30 generation calls for refusal measurement** (`abliterate.py:4622`) | MODERATE | Each generates up to 128 tokens with greedy decoding. For a 7B model this is ~30s total. Acceptable as a one-time quality check. |
| **`_tier_label` does `list.index()` per prompt** (`abliterate.py:4593`) | LOW | O(n) search in a list for each of 30 prompts. Trivially fixable with a dict, but the cost is negligible at n=512. |
| **Perplexity measurement on 3 short texts** | OK | Minimal cost. |

### Stage 6: REBIRTH (Model Saving)

Not audited in detail — standard HuggingFace `save_pretrained`. No efficiency concerns.

---

## Method-by-Method Efficiency Grades

| Method | Compute Cost | Memory Cost | Value/Cost Ratio | Grade |
|--------|-------------|-------------|-------------------|-------|
| **basic** | Low (1 dir, 1 pass, no extras) | Low | High | **A** |
| **advanced** | Moderate (4 dirs, 2 passes, norm-preserve, bias projection) | Moderate | High | **A-** |
| **aggressive** | High (8 dirs, 3 passes with `true_iterative_refinement`) | High (3x activation storage) | Moderate — 3rd pass rarely justified | **B-** |
| **informed** | High (runs analysis modules + Wasserstein GEP) | High (analysis module state) | High — analysis feedback is genuinely valuable | **B+** |
| **surgical** | Very High (SAE training + head surgery + EGA + neuron masking) | Very High | Moderate — many techniques compound but with diminishing returns | **C+** |
| **inverted** | Very High (surgical + reflection + SAE) | Very High | Niche — only needed for "actively compliant" use case | **C** |
| **optimized** | Extreme (50 Bayesian trials × 35 forward passes each) | Extreme (full weight clones + 1750 forward passes) | Low unless you have a multi-GPU cluster | **D+** |
| **nuclear** | Very High (inverted + layer-adaptive + expert transplant + steering hooks) | Very High | Highly specialized — justified only for stubborn MoE models | **C** |

### Baseline Reproductions

| Method | Compute Cost | Grade | Notes |
|--------|-------------|-------|-------|
| **failspy** | Low | **A** | Faithful minimal reproduction. Efficient by design. |
| **gabliteration** | Low-Moderate | **A-** | 4-dir SVD + ridge. Clean. |
| **heretic** | Extreme | **D** | Inherits Bayesian trial overhead. 50 trials × 35 passes each. |
| **rdo** | High | **B** | 500 gradient steps/layer. Would benefit from early-stopping. |

---

## Strategy Module Audit (`strategies/`)

| Strategy | Implementation | Grade |
|----------|---------------|-------|
| `embedding_ablation` | Clean zero-out by chunk. `torch.no_grad()` used correctly. | **A** |
| `ffn_ablation` | Iterates all FFN params and zeros. Fine for ablation study. | **A** |
| `head_pruning` | Handles GPT-2 Conv1D and standard Q/K/V separately. Correct. | **A-** |
| `layer_removal` | Zeros all params. Simple and correct. | **A** |
| `registry` | Minimal dict-based registry with decorator. No overhead. | **A** |
| `runner.py` | **Creates a new `Evaluator` per spec** (`runner.py:86-95`). This re-initializes dataset processing for every ablation spec. Should create once and reuse. | **B** |

---

## Cross-Cutting Concerns

### 1. Memory Management

- **Good**: `_free_gpu_memory()` exists and is called between stages. `expandable_segments` is set early.
- **Bad**: `_free_gpu_memory()` called 1024+ times during PROBE (once per prompt). The `gc.collect()` cost alone adds up.
- **Bad**: Bayesian optimizer clones all strong-layer weights with no memory budget check.
- **Bad**: No streaming/chunking for activation storage — all 512 prompts × 32 layers of activations are held in a list of CPU tensors simultaneously.

### 2. GPU Utilization

- **Good**: Adaptive `max_length` based on free GPU memory.
- **Good**: Rank-1 projections avoid materializing full projection matrices.
- **Bad**: SAE training hardcoded to CPU.
- **Bad**: Single-prompt forward passes waste GPU parallelism.
- **Bad**: No `torch.compile()` or `torch.inference_mode()` used anywhere (the latter is faster than `torch.no_grad()` for inference).

### 3. Quantization Handling

- **Good**: Detects bitsandbytes 4-bit/8-bit and dequantizes before projection.
- **Good**: Refuses to operate on raw quantized bytes (avoids silent corruption).
- **Moderate**: Full dequantize/re-quantize per direction per weight matrix. Could cache across multi-direction projections.

---

## Top 5 Recommendations (Ranked by Impact)

### 1. Batch `_collect_activations` (CRITICAL — 5-15x PROBE speedup)

```python
# Current: one prompt at a time
for i, prompt in enumerate(prompts):
    inputs = tokenizer(prompt, ...)
    model(**inputs)

# Proposed: micro-batched
for batch_start in range(0, len(prompts), batch_size):
    batch = prompts[batch_start:batch_start+batch_size]
    inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True, max_length=max_length)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    with torch.no_grad():
        model(**inputs)
```

Hooks need a minor adjustment to handle batch dimension, but the core change is ~20 lines.

### 2. Add early-stopping to `true_iterative_refinement` (HIGH — saves 1-2 full PROBE passes)

After re-distilling, compute cosine similarity between old and new refusal directions. If all directions are >0.99 cosine, skip remaining passes. Expected to save 30-60% of `aggressive` mode runtime.

### 3. Move SAE training to GPU (HIGH — 5-15 min saved for `surgical`/`inverted`)

Change `device="cpu"` to auto-detect available GPU. The SAE is small (32M params at expansion=4) and fits easily alongside the model.

### 4. Reduce Bayesian trial overhead (HIGH — saves 30-60 min for `optimized`)

Options:
- Reduce `n_refusal_prompts` from 8-30 to 4-6 (generation is expensive)
- Use perplexity-only as a faster proxy in early trials, switch to refusal measurement for top candidates
- Implement batch generation for `_measure_refusal_rate`

### 5. Add early-stopping to RDO (MODERATE — saves 10-30s for `rdo` mode)

Monitor loss convergence and break at plateau (delta < 1e-4 for 20 steps). Most directions converge in ~100-200 steps, not 500.

---

## Verdict

The pipeline is **architecturally sound** — the rank-1 projection math is correct and memory-efficient, the stage separation is clean, and the progressive method complexity (basic → nuclear) gives users clear cost/quality tradeoffs. However, the **PROBE stage bottleneck** (single-prompt forward passes) and **Bayesian trial overhead** (1750 forward passes) are the two elephants in the room. Fixing just recommendation #1 would make the entire system 3-5x faster for the majority of users who run basic/advanced/aggressive modes.

The `optimized` and `heretic` modes have a legitimate place for users with compute budget, but their current efficiency makes them impractical for anything under an A100. The documentation should be more explicit about expected runtimes.

**Overall system grade: B+** — excellent functionality, needs batching and early-stopping.