TIFA — Text-to-Image Faithfulness Assessment

Overview

TIFA evaluates prompt faithfulness using Visual Question Answering (VQA) rather than embedding similarity. For each generated image, BLIP-2 is asked a set of questions derived from the generation prompt. The score is the fraction of questions answered correctly across all images.

For example, for the prompt "a red bicycle in a park", TIFA might ask: - "What colour is the bicycle?" → expected: "red" - "Where is the bicycle?" → expected: "park"

This QA-based approach captures fine-grained semantic faithfulness that embedding-based metrics like CLIP Score can miss — particularly for attribute binding (colour, count, spatial relationships).

Dataset: TIFA dataset (with pre-annotated QA pairs per prompt)
VQA model: BLIP-2 (Salesforce/blip2-flan-t5-xl by default)

TIFA is concept-agnostic and compatible with all techniques.

Compatible techniques

All techniques are compatible with TIFA. No concept restrictions.

Configuration reference

Field	Type	Default	Description
`vqa_model_name`	`str`	`"Salesforce/blip2-flan-t5-xl"`	HuggingFace model ID for the BLIP-2 VQA model.
`device`	`str \\| None`	`None`	Device for BLIP-2 inference. Auto-detects CUDA if `None`.
`limit`	`int \\| None`	`200`	Maximum number of prompts from the TIFA dataset.

Output

Key	Type	Description
`value`	`float`	TIFA score: fraction of QA pairs answered correctly. Range [0, 1]. Higher is better. Typical SD baselines score 0.7–0.85.

Warnings

BLIP-2 GPU memory

Salesforce/blip2-flan-t5-xl is a large model (~4–5B parameters, ~8GB in fp16). It requires substantial GPU memory (16GB+ recommended when accounting for activations and image tensors). In a multi-metric run, TIFA should be listed last to avoid GPU memory conflicts with other metrics that are still loaded.

qa_pairs metadata requirement

TIFA requires each batch to carry qa_pairs metadata — lists of {"question": str, "answer": str} dicts parallel to the images. The TIFA dataset provides these automatically. If you supply a custom dataset without qa_pairs, TIFA will fail with a KeyError on the metadata dict.

Slow evaluation

BLIP-2 inference is slow compared to CLIP-based metrics. With limit=200, expect TIFA to take several minutes on a GPU. Reduce limit for faster iteration.

Examples

Single metric

{
  "output_dir": "results/esd_tifa",
  "technique": {
    "name": "esd",
    "config": {
      "erase_concept": "nudity",
      "train_method": "noxattn",
      "device": "cuda"
    }
  },
  "metric": {
    "name": "tifa",
    "config": {
      "vqa_model_name": "Salesforce/blip2-flan-t5-xl",
      "device": "cuda",
      "limit": 200
    }
  }
}

As part of a multi-metric run

{
  "name": "tifa",
  "config": {
    "vqa_model_name": "Salesforce/blip2-flan-t5-xl",
    "device": "cuda",
    "limit": 200
  }
}