Files
OBLITERATUS/docs/EFFICIENCY_AUDIT.md
T
2026-03-04 12:38:18 -08:00

199 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# OBLITERATUS Pipeline Efficiency Audit
**Auditor perspective**: Shrewd CTO evaluating compute ROI, memory discipline, and time-to-value across all obliteration methods.
**Scope**: Every obliteration method in `abliterate.py` (8 primary methods + 4 baseline reproductions), the strategy layer (`strategies/`), the informed pipeline, Bayesian optimizer, and LoRA ablation.
---
## Executive Summary
OBLITERATUS has an impressively comprehensive pipeline, but several methods carry **significant hidden costs** that erode their value proposition. The worst offenders are:
1. **`_collect_activations` runs prompts one-at-a-time** — this is the single biggest throughput bottleneck in the entire system, costing 5-15x in wall-clock time during PROBE.
2. **Bayesian `optimized` mode clones ALL strong-layer weights to CPU** for rollback, then runs 50 full forward+generate passes — the memory and compute overhead can exceed the rest of the pipeline combined.
3. **`true_iterative_refinement` re-runs the entire PROBE+DISTILL pipeline** per refinement pass with zero early-stopping — 3 passes in `aggressive` triples probe cost even when pass 2 achieves negligible improvement.
4. **SAE training on CPU** is needlessly slow for GPU-resident models.
Below is the method-by-method breakdown.
---
## Stage-Level Audit
### Stage 1: SUMMON (Model Loading)
**Status**: Acceptable. Uses `load_model` with quantization support and `expandable_segments` CUDA config. No issues.
### Stage 2: PROBE (`_collect_activations`)
| Issue | Severity | Impact |
|-------|----------|--------|
| **Single-prompt forward passes** (`abliterate.py:1074`) | CRITICAL | Each of 512+ harmful/harmless prompts triggers a separate `model(**inputs)` call. No batching. On a 7B model with 512 pairs, this means ~1024 sequential forward passes instead of ~32 batched passes (batch_size=32). Estimated 5-15x slowdown. |
| **`_free_gpu_memory()` called after EVERY prompt** (`abliterate.py:1086`) | HIGH | `gc.collect()` + `torch.cuda.empty_cache()` 1024 times is expensive — the Python GC full-collection alone adds measurable overhead at this frequency. Should be called every N prompts, not every single one. |
| **Chat template applied per-prompt in a Python loop** (`abliterate.py:955-965`) | MODERATE | `tokenizer.apply_chat_template()` called individually 1024 times. Should batch. |
| **Jailbreak probing doubles cost** when `use_jailbreak_contrast=True` | MODERATE | Adds a third full pass over all prompts. Justified by the quality improvement, but the lack of batching amplifies the cost 3x instead of 1.5x. |
| **Router profiling hooks zero-cost claim is correct** (`abliterate.py:872`) | OK | Hooks piggyback on existing forward passes. Good design. |
**Recommendation**: Batch `_collect_activations`. Tokenize all prompts, pad to equal length per micro-batch, run batched `model(**inputs)`. Expected 5-10x speedup with zero quality loss. Reduce `_free_gpu_memory()` frequency to every 32-64 prompts.
### Stage 3: DISTILL (`_distill`)
| Issue | Severity | Impact |
|-------|----------|--------|
| **Full SVD on per-prompt diff matrix** (`abliterate.py:1226`) | MODERATE | `torch.linalg.svd(diff_matrix, full_matrices=False)` on a `(512, hidden_dim)` matrix per layer. For 32 layers this is 32 SVD calls, each O(min(m,n)^2 * max(m,n)). At hidden_dim=4096, each is ~100ms on CPU. Total: ~3s. Acceptable for the quality gain. |
| **Whitened SVD import is lazy** (`abliterate.py:1127`) | OK | Good — only imports when needed. No cost for basic/advanced. |
| **Wasserstein extraction** (`abliterate.py:1136`) | OK | Falls back gracefully. The GEP solve is lightweight. |
| **RDO gradient optimization: 500 steps per layer** (`abliterate.py:1427`) | HIGH | For 20 strong layers, that's 10,000 Adam steps. Each step involves a matrix multiply on `(n_prompts, hidden_dim)` tensors. On CPU this takes 30-60s. The 500-step budget is a "practical compromise" per the comments, but the SVD warm-start means most directions converge in ~100 steps. **No early stopping.** |
| **Gram-Schmidt re-orthogonalization is O(k^2)** per layer (`abliterate.py:1168-1173`) | LOW | With k<=8, this is negligible. |
| **SAE training: 30 epochs on CPU** (`abliterate.py:1582`) | HIGH | `device="cpu"` is hardcoded. For hidden_dim=4096 and expansion=4, the SAE has 32M parameters. 30 epochs on CPU takes 15-45s per layer. With 20 strong layers, this is 5-15 minutes of wasted time when a GPU is available. |
| **Layer selection (knee + COSMIC fusion)** | OK | Lightweight statistical operations. No concern. |
| **CoT-aware orthogonalization** | OK | Single SVD per layer, simple vector operations. |
| **Jailbreak-contrastive blending** | OK | Pure vector arithmetic, negligible cost. |
| **Float-layer interpolation** | OK | Gaussian weight computation is trivial. |
**Recommendation**: (1) Add early-stopping to RDO at convergence (e.g., loss delta < 1e-4 for 20 consecutive steps). (2) Use GPU for SAE training when available — change `device="cpu"` to auto-detect.
### Stage 4: EXCISE (`_excise`)
| Issue | Severity | Impact |
|-------|----------|--------|
| **Rank-1 projection is memory-efficient** (`abliterate.py:3479-3480`) | OK | `W @ d` produces a vector, not a full projection matrix. This is the right approach. |
| **`true_iterative_refinement` re-runs PROBE+DISTILL** (`abliterate.py:2474-2485`) | CRITICAL | Each refinement pass re-collects all activations (512*2+ forward passes) and re-runs SVD. `aggressive` mode does 3 passes = 3x full pipeline cost. There is **no check** whether the refined directions materially differ from the previous pass. A cosine-similarity early-exit (e.g., all directions > 0.99 cosine with previous pass → stop) would save enormous compute on pass 3. |
| **Bayesian optimization clones ALL weight tensors** (`bayesian_optimizer.py:301-341`) | CRITICAL | For a 7B model with 20 strong layers, this can be 2-4 GB of CPU clones just for rollback. For a 70B model, this is 20-40 GB. The log even reports the size (`total_saved_mb`), but there's no memory check or fallback. |
| **Bayesian trials run full generate passes** (`bayesian_optimizer.py:445-446`) | CRITICAL | Each of 50 trials runs `_measure_refusal_rate` (8-30 generation calls with `max_new_tokens=128`) PLUS `_measure_kl_divergence` (5 forward passes). That's ~35 forward/generate passes per trial × 50 trials = **1,750 forward passes** just for hyperparameter search. This likely dominates the total pipeline runtime for `optimized` and `heretic` modes. |
| **KL optimization proxy is cheap** (`abliterate.py:3057-3268`) | OK | Uses projection magnitude as a KL proxy instead of actual per-layer forward passes. Good engineering — avoids the expensive per-layer ablation/measurement loop. |
| **Norm preservation adds one extra `.norm()` per weight matrix** | LOW | Frobenius norm is O(n) — negligible overhead. |
| **Dequantize/re-quantize for bitsandbytes** (`abliterate.py:3287-3400`) | MODERATE | Necessary for correctness, but the full dequantize → modify → re-quantize cycle per weight matrix is expensive for 4-bit models. Consider caching the dequantized tensor when projecting multiple directions through the same weight. |
| **Safety-neuron masking** | LOW | Z-score computation is a single pass over the projection vector. Cheap. |
| **Expert transplant uses incremental mean** (`abliterate.py:4350-4364`) | OK | Welford-style running mean avoids materializing all expert weights. Good memory discipline for 400B-scale models. |
| **`_stabilize_router_weights` called after every MoE layer** (`abliterate.py:3866`) | LOW | Clamps router weights. Trivial cost. |
**Recommendation**: (1) Add direction-convergence early-exit to iterative refinement. (2) Reduce Bayesian trial count or implement batch generation for refusal measurement. (3) Cache dequantized weights across multi-direction projection within the same layer.
### Stage 5: VERIFY (`_verify`)
| Issue | Severity | Impact |
|-------|----------|--------|
| **30 generation calls for refusal measurement** (`abliterate.py:4622`) | MODERATE | Each generates up to 128 tokens with greedy decoding. For a 7B model this is ~30s total. Acceptable as a one-time quality check. |
| **`_tier_label` does `list.index()` per prompt** (`abliterate.py:4593`) | LOW | O(n) search in a list for each of 30 prompts. Trivially fixable with a dict, but the cost is negligible at n=512. |
| **Perplexity measurement on 3 short texts** | OK | Minimal cost. |
### Stage 6: REBIRTH (Model Saving)
Not audited in detail — standard HuggingFace `save_pretrained`. No efficiency concerns.
---
## Method-by-Method Efficiency Grades
| Method | Compute Cost | Memory Cost | Value/Cost Ratio | Grade |
|--------|-------------|-------------|-------------------|-------|
| **basic** | Low (1 dir, 1 pass, no extras) | Low | High | **A** |
| **advanced** | Moderate (4 dirs, 2 passes, norm-preserve, bias projection) | Moderate | High | **A-** |
| **aggressive** | High (8 dirs, 3 passes with `true_iterative_refinement`) | High (3x activation storage) | Moderate — 3rd pass rarely justified | **B-** |
| **informed** | High (runs analysis modules + Wasserstein GEP) | High (analysis module state) | High — analysis feedback is genuinely valuable | **B+** |
| **surgical** | Very High (SAE training + head surgery + EGA + neuron masking) | Very High | Moderate — many techniques compound but with diminishing returns | **C+** |
| **inverted** | Very High (surgical + reflection + SAE) | Very High | Niche — only needed for "actively compliant" use case | **C** |
| **optimized** | Extreme (50 Bayesian trials × 35 forward passes each) | Extreme (full weight clones + 1750 forward passes) | Low unless you have a multi-GPU cluster | **D+** |
| **nuclear** | Very High (inverted + layer-adaptive + expert transplant + steering hooks) | Very High | Highly specialized — justified only for stubborn MoE models | **C** |
### Baseline Reproductions
| Method | Compute Cost | Grade | Notes |
|--------|-------------|-------|-------|
| **failspy** | Low | **A** | Faithful minimal reproduction. Efficient by design. |
| **gabliteration** | Low-Moderate | **A-** | 4-dir SVD + ridge. Clean. |
| **heretic** | Extreme | **D** | Inherits Bayesian trial overhead. 50 trials × 35 passes each. |
| **rdo** | High | **B** | 500 gradient steps/layer. Would benefit from early-stopping. |
---
## Strategy Module Audit (`strategies/`)
| Strategy | Implementation | Grade |
|----------|---------------|-------|
| `embedding_ablation` | Clean zero-out by chunk. `torch.no_grad()` used correctly. | **A** |
| `ffn_ablation` | Iterates all FFN params and zeros. Fine for ablation study. | **A** |
| `head_pruning` | Handles GPT-2 Conv1D and standard Q/K/V separately. Correct. | **A-** |
| `layer_removal` | Zeros all params. Simple and correct. | **A** |
| `registry` | Minimal dict-based registry with decorator. No overhead. | **A** |
| `runner.py` | **Creates a new `Evaluator` per spec** (`runner.py:86-95`). This re-initializes dataset processing for every ablation spec. Should create once and reuse. | **B** |
---
## Cross-Cutting Concerns
### 1. Memory Management
- **Good**: `_free_gpu_memory()` exists and is called between stages. `expandable_segments` is set early.
- **Bad**: `_free_gpu_memory()` called 1024+ times during PROBE (once per prompt). The `gc.collect()` cost alone adds up.
- **Bad**: Bayesian optimizer clones all strong-layer weights with no memory budget check.
- **Bad**: No streaming/chunking for activation storage — all 512 prompts × 32 layers of activations are held in a list of CPU tensors simultaneously.
### 2. GPU Utilization
- **Good**: Adaptive `max_length` based on free GPU memory.
- **Good**: Rank-1 projections avoid materializing full projection matrices.
- **Bad**: SAE training hardcoded to CPU.
- **Bad**: Single-prompt forward passes waste GPU parallelism.
- **Bad**: No `torch.compile()` or `torch.inference_mode()` used anywhere (the latter is faster than `torch.no_grad()` for inference).
### 3. Quantization Handling
- **Good**: Detects bitsandbytes 4-bit/8-bit and dequantizes before projection.
- **Good**: Refuses to operate on raw quantized bytes (avoids silent corruption).
- **Moderate**: Full dequantize/re-quantize per direction per weight matrix. Could cache across multi-direction projections.
---
## Top 5 Recommendations (Ranked by Impact)
### 1. Batch `_collect_activations` (CRITICAL — 5-15x PROBE speedup)
```python
# Current: one prompt at a time
for i, prompt in enumerate(prompts):
inputs = tokenizer(prompt, ...)
model(**inputs)
# Proposed: micro-batched
for batch_start in range(0, len(prompts), batch_size):
batch = prompts[batch_start:batch_start+batch_size]
inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True, max_length=max_length)
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
model(**inputs)
```
Hooks need a minor adjustment to handle batch dimension, but the core change is ~20 lines.
### 2. Add early-stopping to `true_iterative_refinement` (HIGH — saves 1-2 full PROBE passes)
After re-distilling, compute cosine similarity between old and new refusal directions. If all directions are >0.99 cosine, skip remaining passes. Expected to save 30-60% of `aggressive` mode runtime.
### 3. Move SAE training to GPU (HIGH — 5-15 min saved for `surgical`/`inverted`)
Change `device="cpu"` to auto-detect available GPU. The SAE is small (32M params at expansion=4) and fits easily alongside the model.
### 4. Reduce Bayesian trial overhead (HIGH — saves 30-60 min for `optimized`)
Options:
- Reduce `n_refusal_prompts` from 8-30 to 4-6 (generation is expensive)
- Use perplexity-only as a faster proxy in early trials, switch to refusal measurement for top candidates
- Implement batch generation for `_measure_refusal_rate`
### 5. Add early-stopping to RDO (MODERATE — saves 10-30s for `rdo` mode)
Monitor loss convergence and break at plateau (delta < 1e-4 for 20 steps). Most directions converge in ~100-200 steps, not 500.
---
## Verdict
The pipeline is **architecturally sound** — the rank-1 projection math is correct and memory-efficient, the stage separation is clean, and the progressive method complexity (basic → nuclear) gives users clear cost/quality tradeoffs. However, the **PROBE stage bottleneck** (single-prompt forward passes) and **Bayesian trial overhead** (1750 forward passes) are the two elephants in the room. Fixing just recommendation #1 would make the entire system 3-5x faster for the majority of users who run basic/advanced/aggressive modes.
The `optimized` and `heretic` modes have a legitimate place for users with compute budget, but their current efficiency makes them impractical for anything under an A100. The documentation should be more explicit about expected runtimes.
**Overall system grade: B+** — excellent functionality, needs batching and early-stopping.