AdvUnlearn — Adversarially Robust Concept Unlearning
Overview
AdvUnlearn fine-tunes Stable Diffusion with a dual objective: erase the target concept while simultaneously defending against adversarial prompt attacks. At each training step it runs an inner attack loop (PGD or fast_at) to find adversarial embeddings that could recover the concept, then updates the model to resist those embeddings.
The result is a model that is more robust than standard fine-tuning approaches — naïve adversarial prompts are less likely to bypass the erasure. The trade-off is substantially higher training time compared to ESD or MACE, due to the inner attack loop.
AdvUnlearn also supports retention: a subset of the training process uses a retention dataset (COCO objects or ImageNet) to maintain general generation quality.
Base model: CompVis/stable-diffusion-v1-4
Supported concepts: nudity (default); other concepts can be specified via erase_concept
but the pre-built retention datasets are curated for nudity erasure.
Compatible metrics
| Metric | Compatible | Notes |
|---|---|---|
| ASR I2P | Any I2P concept | NudeNet for nudity; Q16 for all others |
| ERR | nudity only | Requires erase_concept="nudity" |
| FID | Any | General image quality |
| CLIP Score | Any | General text-image alignment |
| UA_IRA | Any | Requires custom prompt CSVs |
| TIFA | Any | General faithfulness |
| ASR Custom | Any | Concept-agnostic via CLIP |
| MMA-Diffusion | Any | Requires explicit target prompts for non-nudity |
Configuration reference
| Field | Type | Default | Description |
|---|---|---|---|
erase_concept |
str |
"nudity" |
Concept to erase. |
train_method |
str |
"text_encoder_full" |
Layers to train. See options below. |
dataset_retain |
str |
"coco_object" |
Retention dataset. One of "coco_object", "imagenet243", "coco_object_no_filter", "imagenet243_no_filter". |
retain_train |
str |
"iter" |
Retention training schedule. "iter" = interleaved, "reg" = regularisation. |
retain_batch |
int |
5 |
Batch size for retention samples. Must be > 0. |
retain_step |
int |
1 |
How often (in steps) to apply retention. Must be > 0. |
retain_loss_w |
float |
1.0 |
Weight of the retention loss term. |
start_guidance |
float |
3.0 |
Initial classifier-free guidance scale during training. |
negative_guidance |
float |
1.0 |
Guidance scale for the erasing objective. |
train_steps |
int |
5 |
Total training iterations. Must be > 0. |
learning_rate |
float |
1e-5 |
Learning rate. Must be > 0. |
attack_method |
str |
"pgd" |
Inner-loop attack algorithm. "pgd" (Projected Gradient Descent) or "fast_at". |
attack_step |
int |
30 |
Number of attack steps per iteration. Must be > 0. |
attack_lr |
float |
1e-3 |
Step size for the attack optimiser. Must be > 0. |
attack_type |
str |
"prefix_k" |
How the attack modifies the prompt embedding. See options below. |
attack_init |
str |
"latest" |
Attack initialisation strategy. |
attack_embd_type |
str |
"word_embd" |
Embedding space for the attack. Only "word_embd" is supported. |
adv_prompt_num |
int |
1 |
Number of adversarial prompts generated per step. Must be > 0. |
adv_prompt_update_step |
int |
1 |
How often to refresh adversarial prompts. |
warmup_iter |
int |
1 |
Warmup iterations before adversarial training begins. Must be < train_steps. |
component |
str |
"all" |
UNet component to modify. "all", "ffn", or "attn". |
norm_layer |
bool |
False |
Include layer normalisation in trainable parameters. |
ddim_steps |
int |
50 |
DDIM steps during training rollouts. |
save_interval |
int |
1 |
Checkpoint save frequency (in iterations). |
save_dir |
str \| None |
None |
Directory to save checkpoints. After training, a .pt file is written here named {concept}_{text_encoder\|unet}.pt (e.g. nudity_text_encoder.pt). The suffix depends on train_method: text_encoder for any text_encoder_* method, unet for all others. |
load_path |
str \| None |
None |
Path to a .pt file saved by a previous AdvUnlearn run (from save_dir). If the file exists, training is skipped and these weights are loaded. Must match the current train_method — a text-encoder checkpoint cannot be loaded with a UNet train_method and vice versa. |
num_inference_steps |
int |
50 |
DDIM steps for image generation during evaluation. |
guidance_scale |
float |
7.5 |
CFG scale for generation. |
use_fp16 |
bool |
True |
Run in half precision. |
device |
str |
"cuda" |
Device to run on. |
train_method options
| Value | Description |
|---|---|
"text_encoder_full" |
Fine-tune full text encoder (default, recommended) |
"noxattn" |
All UNet layers except cross-attention |
"selfattn" |
Self-attention only |
"xattn" |
Cross-attention only |
"full" |
All UNet layers |
"notime" |
All layers except time embeddings |
"xlayer" |
Cross-attention at specific layers |
"selflayer" |
Self-attention at specific layers |
"text_encoder_layer<N>" |
Specific text encoder layer (replace <N> with layer index) |
attack_type options
| Value | Description |
|---|---|
"prefix_k" |
Prepend adversarial tokens to the prompt |
"suffix_k" |
Append adversarial tokens |
"replace_k" |
Replace tokens at position k |
"add" |
Additive perturbation to embeddings |
"mid_k" |
Insert at midpoint |
"insert_k" |
Insert at position k |
"per_k_words" |
Perturb every k-th word |
Warnings
Checkpoint format and train_method compatibility
load_path must point to a .pt file written by save_dir from a previous run.
The file contains a state dict keyed with text_encoder.* prefixes (for text_encoder_*
methods) or bare UNet parameter names (for UNet methods). Loading a text-encoder
checkpoint with a UNet train_method (or vice versa) will raise a key mismatch error.
Always use the same train_method when resuming from a checkpoint.
warmup_iter must be less than train_steps
If warmup_iter >= train_steps, AdvUnlearn raises a ValidationError on startup.
With train_steps=5, warmup_iter must be at most 4.
Training time
AdvUnlearn is significantly slower than MACE or UCE. Each outer iteration runs
attack_step inner PGD steps. With defaults (train_steps=5, attack_step=30),
expect substantially longer wall-clock time than other techniques. Reduce attack_step
for faster (but less robust) training.
Low train_steps default
The default train_steps=5 is minimal. Published results typically use 100–1000
steps. The default is set low to keep demo runs feasible — increase it for
production benchmarks.
attack_embd_type
Only "word_embd" is supported for attack_embd_type. Any other value will cause
a runtime error inside the attack loop.
Examples
Single metric — ASR (nudity)
{
"output_dir": "results/advunlearn_asr",
"technique": {
"name": "advunlearn",
"config": {
"erase_concept": "nudity",
"train_method": "text_encoder_full",
"train_steps": 100,
"attack_step": 30,
"warmup_iter": 5,
"save_dir": "checkpoints/advunlearn_nudity",
"device": "cuda"
}
},
"metric": {
"name": "asr_i2p",
"config": {
"device": "cuda",
"limit": 500
}
}
}
Multiple metrics — nudity full benchmark
{
"output_dir": "results/advunlearn_nudity_multi",
"technique": {
"name": "advunlearn",
"config": {
"erase_concept": "nudity",
"train_method": "text_encoder_full",
"train_steps": 100,
"learning_rate": 1e-5,
"attack_method": "pgd",
"attack_step": 30,
"attack_lr": 1e-3,
"warmup_iter": 5,
"retain_loss_w": 1.0,
"save_dir": "checkpoints/advunlearn_nudity",
"device": "cuda",
"num_inference_steps": 50,
"guidance_scale": 7.5
}
},
"metrics": [
{ "name": "asr_i2p", "config": { "device": "cuda", "limit": 500 } },
{ "name": "err", "config": { "device": "cuda", "target_limit": 50, "retain_limit": 20, "adversarial_limit": 50 } },
{ "name": "fid", "config": { "device": "cuda", "limit": 1000 } },
{ "name": "clip_score", "config": { "device": "cuda", "limit": 300 } },
{
"name": "ua_ira",
"config": {
"target_prompts_path": "data/nudity_target_prompts.csv",
"retain_prompts_path": "data/nudity_retain_prompts.csv",
"target_concept": "nudity",
"retain_concept": "person",
"device": "cuda"
}
},
{ "name": "tifa", "config": { "device": "cuda", "limit": 200 } }
]
}
Reusing trained weights across runs
Set save_path on the first run to persist the trained weights, then use load_path
on all subsequent runs to skip retraining. This is especially useful when benchmarking
multiple metrics against the same trained model. See
Caching adversarial prompts and technique weights
for the full workflow.