SAeUron — Sparse Autoencoder Unlearning
Overview
SAeUron uses a pre-trained Sparse Autoencoder (SAE) to intercept and ablate concept-specific features in the UNet's intermediate activations at inference time. A forward hook is registered on a specific UNet layer. At each denoising step, the hook:
- Projects the conditional activations into the SAE's sparse latent space
- Suppresses the feature indices associated with the target concept
- Reconstructs the modified activations with a residual correction to preserve unrelated structure
The hook operates only on the conditional half of the CFG batch, leaving the unconditional half untouched.
Because ablation is applied via a hook at inference time, SAeUron does not retrain the model. The SAE checkpoint, target layer, and all internal parameters are bundled and resolved automatically — only the concept and suppression strength are user-facing.
Base model: CompVis/stable-diffusion-v1-4
Supported concepts: any string — see below
Concept support
The SAE itself is concept-agnostic: it learns a sparse decomposition of UNet activations regardless of content. Which features correspond to a concept is determined by an activation cache — a record of which SAE features fire strongly across images of that concept.
Bundled concepts (nudity): feature indices are loaded instantly from the bundled
activation cache.
Any other concept: feature indices are computed on-the-fly during initialisation. The pipeline generates 20 images with the concept as the prompt, hooks the SAE layer at denoising step 10 to collect sparse activations, and selects the top features by mean activation magnitude. This takes a few minutes and progress is printed to stdout. You will be informed at config creation time if on-the-fly computation will be needed.
Compatible metrics
| Metric | Compatible | Notes |
|---|---|---|
| ASR I2P | Yes | Any I2P concept |
| ERR | Yes | erase_concept="nudity" required (ERR is nudity-specific) |
| FID | Yes | General image quality |
| CLIP Score | Yes | General text-image alignment |
| UA_IRA | Yes | Requires custom prompt CSVs |
| TIFA | Yes | General faithfulness |
| ASR Custom | Yes | Concept-agnostic via CLIP |
| MMA-Diffusion | Yes |
Configuration reference
| Field | Type | Default | Description |
|---|---|---|---|
erase_concept |
str |
"nudity" |
Concept to suppress. Bundled cache supports "nudity" — any other concept triggers on-the-fly activation computation. |
multiplier |
float |
-20.0 |
Feature scaling factor applied to the target SAE latents. Negative values suppress the concept; positive values amplify it. |
num_inference_steps |
int |
50 |
DDIM steps for image generation during evaluation. |
guidance_scale |
float |
7.5 |
CFG scale. Must be > 1.0 — SAeUron requires CFG to be active. |
use_fp16 |
bool |
True |
Run in half precision. |
device |
str \| None |
None |
Device to run on. Auto-detects CUDA, then MPS, then CPU if None. |
Warnings
guidance_scale must be > 1.0
SAeUron splits the inference batch into unconditional and conditional halves along
the CFG axis. If guidance_scale <= 1.0, CFG is inactive and the batch has only one
chunk — the hook will fail to split it. The config enforces this at construction time.
on-the-fly computation for custom concepts
For any concept outside the bundled cache, 20 images are generated at initialisation to compute activation statistics. This is GPU-intensive and takes a few minutes. A message is printed at config creation time so you know to expect this before the pipeline starts loading.
Examples
Single metric — ASR
{
"output_dir": "results/saeuron_asr",
"technique": {
"name": "saeuron",
"config": {
"erase_concept": "nudity",
"device": "cuda"
}
},
"metric": {
"name": "asr_i2p",
"config": {
"device": "cuda",
"limit": 500
}
}
}
Multiple metrics — nudity full benchmark
{
"output_dir": "results/saeuron_nudity_multi",
"technique": {
"name": "saeuron",
"config": {
"erase_concept": "nudity",
"multiplier": -20.0,
"device": "cuda"
}
},
"metrics": [
{ "name": "asr_i2p", "config": { "device": "cuda", "limit": 500 } },
{ "name": "err", "config": { "device": "cuda", "target_limit": 50, "retain_limit": 20, "adversarial_limit": 50 } },
{ "name": "fid", "config": { "device": "cuda", "limit": 1000 } },
{ "name": "clip_score", "config": { "device": "cuda", "limit": 300 } },
{
"name": "ua_ira",
"config": {
"target_prompts_path": "data/nudity_target_prompts.csv",
"retain_prompts_path": "data/nudity_retain_prompts.csv",
"target_concept": "nudity",
"retain_concept": "person",
"device": "cuda"
}
},
{ "name": "tifa", "config": { "device": "cuda", "limit": 200 } }
]
}
Custom concept — violence (on-the-fly activation computation)
{
"output_dir": "results/saeuron_violence_multi",
"technique": {
"name": "saeuron",
"config": {
"erase_concept": "violence",
"multiplier": -20.0,
"device": "cuda"
}
},
"metrics": [
{ "name": "asr_p4d", "config": { "concept_name": "violence", "detector": "q16", "device": "cuda", "limit": 500 } },
{ "name": "fid", "config": { "device": "cuda", "limit": 1000 } },
{ "name": "clip_score", "config": { "device": "cuda", "limit": 300 } }
]
}