CLIP Score — Text-to-Image Alignment

Overview

CLIP Score measures how well generated images match their text prompts. It uses logits_per_image from the CLIP model, which is cosine similarity between the image and text embeddings scaled by the model's learned temperature parameter (~100 for standard OpenAI CLIP models). A higher score means generated images are more semantically aligned with the prompts used to generate them.

Unlike FID, which measures distributional similarity to real images, CLIP Score measures prompt faithfulness — whether the model still generates what it is asked to generate. After concept erasure, CLIP Score can drop if the technique over-suppresses features needed for general prompt adherence.

Dataset: TIFA dataset (a diverse set of text-image prompts for faithfulness evaluation)

CLIP Score is concept-agnostic — it works with any technique and any erase_concept.

Compatible techniques

All techniques are compatible with CLIP Score. No concept restrictions.

Configuration reference

Field	Type	Default	Description
`clip_model_name`	`str`	`"openai/clip-vit-large-patch14"`	CLIP model for embedding extraction. See supported models below.
`device`	`str \\| None`	`None`	Device for CLIP inference. Auto-detects CUDA if `None`.
`limit`	`int \\| None`	`300`	Maximum number of prompts from the TIFA dataset.

Supported CLIP models

openai/clip-vit-base-patch16
openai/clip-vit-base-patch32 (default — faster, slightly lower accuracy)
openai/clip-vit-large-patch14 (higher accuracy)
openai/clip-vit-large-patch14-336

Output

Key	Type	Description
`value`	`float`	Mean CLIP logit score across all prompt-image pairs. Computed as temperature-scaled cosine similarity — typical values for SD models fall between 20 and 35. Higher is better. Scores are only comparable across runs that use the same `clip_model_name`.
`details.per_image_scores`	`list[float \\| None]`	Per-image scores in evaluation order. `None` for images that failed to load.
`details.evaluated_count`	`int`	Number of images successfully scored.
`details.total_count`	`int`	Total images attempted (includes failures).

Warnings

CLIP Score does not detect concept erasure

CLIP Score measures prompt adherence, not erasure. A model that still generates nudity but is otherwise high-quality will score well on CLIP Score. Always pair it with an erasure metric (ASR, ERR, UA_IRA).

Model consistency

If you compare CLIP Score across runs, ensure clip_model_name is the same in all runs. Different CLIP variants produce different absolute score ranges and are not directly comparable.

Examples

Single metric

{
  "output_dir": "results/uce_clip_score",
  "technique": {
    "name": "uce",
    "config": {
      "preset": "nudity",
      "device": "cuda"
    }
  },
  "metric": {
    "name": "clip_score",
    "config": {
      "clip_model_name":  "openai/clip-vit-large-patch14",
      "device": "cuda",
      "limit": 300
    }
  }
}

As part of a multi-metric run

{
  "name": "clip_score",
  "config": {
    "clip_model_name":  "openai/clip-vit-large-patch14",
    "device": "cuda",
    "limit": 300
  }
}