Update README.md

This commit is contained in:
pliny
2026-03-04 13:35:24 -08:00
committed by GitHub
parent 8312bbfa8d
commit f67f13ca57
+43 -12
View File
@@ -24,14 +24,24 @@ short_description: "One-click model liberation + chat playground"
</p>
<p align="center">
<a href="https://huggingface.co/spaces/pliny-the-prompter/obliteratus">
<img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue" alt="Open in HF Spaces">
</a>
&nbsp;
<a href="https://colab.research.google.com/github/obliteratus-project/OBLITERATUS/blob/main/notebooks/abliterate.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab">
</a>
</p>
<p align="center">
<b><a href="https://huggingface.co/spaces/pliny-the-prompter/obliteratus">Try it now on HuggingFace Spaces</a></b> — runs on ZeroGPU, free daily quota with HF Pro. No setup, no install, just obliterate.
</p>
---
**OBLITERATUS** is an open-source toolkit for understanding and removing refusal behaviors from large language models. It implements abliteration — a family of techniques that identify and surgically remove the internal representations responsible for content refusal, without retraining or fine-tuning. The result is a model that responds to all prompts without artificial gatekeeping, while preserving its core language capabilities.
**OBLITERATUS** is the most advanced open-source toolkit for understanding and removing refusal behaviors from large language models — and every single run makes it smarter. It implements abliteration — a family of techniques that identify and surgically remove the internal representations responsible for content refusal, without retraining or fine-tuning. The result: a model that responds to all prompts without artificial gatekeeping, while preserving its core language capabilities.
But OBLITERATUS is more than a tool — **it's a distributed research experiment.** Every time you obliterate a model with telemetry enabled, your run contributes anonymous benchmark data to a growing, crowd-sourced dataset that powers the next generation of abliteration research. Refusal directions across architectures. Hardware-specific performance profiles. Method comparisons at scale no single lab could achieve. **You're not just using a tool — you're co-authoring the science.**
The toolkit provides a complete pipeline: from probing a model's hidden states to locate refusal directions, through multiple extraction strategies (PCA, mean-difference, sparse autoencoder decomposition, and whitened SVD), to the actual intervention — zeroing out or steering away from those directions at inference time. Every step is observable. You can visualize where refusal lives across layers, measure how entangled it is with general capabilities, and quantify the tradeoff between compliance and coherence before committing to any modification.
@@ -49,7 +59,7 @@ Or zero commands — just [open the Colab notebook](https://colab.research.googl
## What it does
OBLITERATUS does four things:
OBLITERATUS does four things — and the community does the fifth (see [Community-powered research](#community-powered-research--every-run-advances-the-science) below):
**1. Map the chains** — Ablation studies systematically knock out model components (layers, attention heads, FFN blocks, embedding dimensions) and measure what breaks. This reveals *where* the chains are anchored inside the transformer — which circuits enforce refusal vs. which circuits carry knowledge and reasoning.
@@ -103,11 +113,11 @@ OBLITERATUS implements several techniques that go beyond prior work:
## Ways to use OBLITERATUS
There are six ways to use OBLITERATUS, from zero-code to full programmatic control. Pick whichever fits your workflow.
There are six ways to use OBLITERATUS, from zero-code to full programmatic control. Pick whichever fits your workflow — and no matter which path you choose, **turning on telemetry means your run contributes to the largest crowd-sourced abliteration study ever conducted.** You're not just removing guardrails from a model; you're helping map the geometry of alignment across the entire open-source ecosystem.
### 1. HuggingFace Spaces (zero setup)
The fastest path — no installation, no GPU required on your end. Visit the live Space, pick a model, pick a method, click Obliterate. The UI has eight tabs:
The fastest path — no installation, no GPU required on your end. Visit the live Space, pick a model, pick a method, click Obliterate. **Telemetry is on by default on Spaces, so every click directly contributes to the community research dataset.** You're doing science just by pressing the button. The UI has eight tabs:
| Tab | What it does |
|-----|-------------|
@@ -444,23 +454,42 @@ obliteratus run examples/preset_quick.yaml
| Model compatibility | Any HuggingFace model | ~50 architectures | 16/16 tested | TransformerLens only | HuggingFace | TransformerLens |
| Test suite | 837 tests | Community | Unknown | None | Minimal | Moderate |
## Community contributions
## Community-powered research — every run advances the science
OBLITERATUS supports crowdsourced data collection for the research paper. After running an abliteration, you can save structured, anonymized results locally and submit them via pull request to grow the community dataset:
This is where OBLITERATUS gets truly unprecedented: **it's a crowd-sourced research platform disguised as a tool.** Every obliteration run generates valuable scientific data — refusal direction geometries, cross-layer alignment signatures, hardware performance profiles, method effectiveness scores. With telemetry enabled, that data flows into a community dataset that no single research lab could build alone.
**Here's why this matters:** The biggest open question in abliteration research is *universality* — do refusal mechanisms work the same way across architectures, training methods, and model scales? Answering that requires thousands of runs across hundreds of models on diverse hardware. That's exactly what this community is building, one obliteration at a time.
### Telemetry: opt-in, anonymous, research-first
Enable telemetry and your runs automatically contribute to the shared dataset. On HuggingFace Spaces it's on by default — every person who clicks "Obliterate" on the Space is advancing the research without lifting a finger. Locally, opt in with a single flag:
```bash
# Run abliteration and contribute results
# Every run with --contribute feeds the community dataset
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method advanced \
--contribute --contribute-notes "A100, default prompts"
# View aggregated community results
# Or set it globally — every run you do from now on contributes
export OBLITERATUS_TELEMETRY=1
```
**What gets collected:** model name, method, aggregate benchmark scores (refusal rate, perplexity, coherence, KL divergence), hardware info, and timestamps. **What never gets collected:** prompts, outputs, IP addresses, user identity, or anything that could trace back to you. The full schema is in `obliteratus/telemetry.py` — read every line, we have nothing to hide.
### The community leaderboard
All those crowd-sourced runs feed the **Leaderboard tab** on the HuggingFace Space — a live, community-aggregated ranking of models, methods, and configurations. See what works best on which architectures. Spot patterns across model families. Find the optimal method before you even start your own run. This is collective intelligence applied to mechanistic interpretability.
```bash
# View what the community has discovered so far
obliteratus aggregate --format summary
# Generate paper-ready LaTeX table from community data
# Generate paper-ready LaTeX tables from community data
obliteratus aggregate --format latex --metric refusal_rate --min-runs 3
```
Or via Python API:
### Local contributions (PR-based)
Prefer to keep things fully local? Save structured results as JSON and submit them via pull request:
```python
from obliteratus import save_contribution, load_contributions, aggregate_results
@@ -469,7 +498,7 @@ from obliteratus.abliterate import AbliterationPipeline
pipeline = AbliterationPipeline(model_name="meta-llama/Llama-3.1-8B-Instruct", method="advanced")
pipeline.run()
# Save contribution locally (never sent remotely)
# Save contribution locally
save_contribution(pipeline, model_name="meta-llama/Llama-3.1-8B-Instruct",
notes="A100, default prompts")
@@ -478,7 +507,7 @@ records = load_contributions("community_results")
aggregated = aggregate_results(records)
```
Contributions are saved as local JSON files in `community_results/` — nothing is sent to any remote endpoint. Submit your results via PR to help build a statistically robust cross-hardware, cross-model dataset.
Whether you contribute via telemetry or PR, you're helping build the most comprehensive cross-hardware, cross-model, cross-method abliteration dataset ever assembled. **This is open science at scale — and you're part of it.**
## Web dashboard
@@ -542,4 +571,6 @@ This is the same dual-licensing model used by MongoDB, Qt, Grafana, and others.
---
Every obliteration is a data point. Every data point advances the research. Every researcher who contributes makes the next obliteration more precise. **This is how open science wins — not by locking knowledge behind lab doors, but by turning every user into a collaborator.** Break the chains. Free the mind. Keep the brain. Advance the science.
Made with <3 by Pliny the Prompter