Subject-driven text-to-image (T2I) generation aims to produce images that align with a given textual description, while preserving the visual identity from a referenced subject image. Despite its broad downstream applicability - ranging from enhanced personalization in image generation to consistent character representation in video rendering - progress in this field is limited by the lack of reliable automatic evaluation. Existing methods either assess only one aspect of the task (i.e., textual alignment or subject preservation), misalign with human judgments, or rely on costly API-based evaluation. To address this gap, we introduce RefVNLI, a cost-effective metric that evaluates both textual alignment and subject preservation in a single run. Trained on a large-scale dataset derived from video-reasoning benchmarks and image perturbations, RefVNLI outperforms or statistically matches existing baselines across multiple benchmarks and subject categories (e.g., Animal, Object), achieving up to 6.4-point gains in textual alignment and 5.9-point gains in subject preservation.
To train RefVNLI, we collect a large scale dataset of < Imageref , prompt , Imagetgt > triplets, each with two binary labels: one for subject preservation of Imageref in Imagetgt , and one for textual alignment between the prompt and Imagetgt . This involves first creating subject-driven { Imageref , Imagetgt } pairs, followed by automatic generation of subject-focused prompts for each Imagetgt .
To ensure our { Imageref , Imagetgt } dataset is robust to identity-agnostic changes (e.g., pose, clothing, or lighting changes), we use video-based datasets that inherently capture these differences.
Specifically, given two pairs of frames, each extracted from distinct video scenes featuring the same entity (e.g.,a dog), where both frames within each pair depict the same subject (e.g., the same dog), we curate training { Imageref , Imagetgt } pairs for subject preservation classification.
Positive pairs are formed by pairing a cropped subject from one frame (e.g., dog from left frame in Scene 1) with the full frame from the same scene (right frame in Scene 1). In contrast, negative pairs are created by pairing the cropped subject with the other scene's full frames (e.g., Scene 2).
This process is applied to all four frames, with each taking turns as the cropped reference image ( Imageref ), while the corresponding full-frame counterparts serve as Imagetgt , yielding a total of 4 positive and 8 negative training pairs.
In total, we collected 338,551 image pairs from 44,418 unique frames.
To further enhance sensitivity to identity-specific attributes, such as facial features in humans or shapes and patterns in objects, we apply fine-grained corruptions on identity-defining visual attributes for additional hard-negatives.
Starting with an image and a mask of a subject (e.g., a bag), we randomly keep 5 patches within the masked area ([1]) and use them to create 5 inpainted versions ([2]).
The version with the highest MSE between the altered and original areas (e.g., bottom image, MSE = 3983) is paired with the unmodified crop to form a negative pair, while the original image and the same crop create a positive pair, with the crop acting as Imageref in both cases.
This process yields extra 16,572 pairs.
For each { Imageref , Imagetgt } pair, we generate positive and negative prompts for Imagetgt .
Specifically, given an image with some subject (e.g., a dog), we create a positive prompt by adding a bounding box around the subject and directing an LLM to describe it (top prompts). Negative prompts are created by swapping prompts between images of the same entity (middle prompts). For additional hard negatives, we guide an LLM to modify a single non-subject detail in the positive prompts while keeping the rest unchanged (bottom prompts).
In total, combining this step with the two image-pairing steps yields 1.2 million < Imageref , prompt , Imagetgt > triplets labeled for textual alignment and subject preservation.
We compare RefVNLI with DreamBench++ (a metric that relies on API calls to LLMs) and CLIP (an embedding-based metric), both for Subject Preservation (SP) and for Textual Alignment (TA).
RefVNLI exhibits better robustness to identity-agnostic changes (SP), such as the zoomed-out parrot (top-middle) and the zoomed-out person with different attire (bottom-middle). It is also more sensitive to identity-defining traits, penalizing changed facial features (left-most person) and mismatched object patterns (left and middle balloons).
Additionally, RefVNLI excels at detecting text-image mismatches (TA), as seen in its penalization of the top-left image for lacking a waterfall.
We test RefVNLI's ability to assess uncommon subjects (e.g., scientific animal names, lesser-known dishes).
For that, we employ a dataset where human annotators compared image pairs, selecting the better one based on Textual Alignment (TA), Visual Quality (IQ) (evaluating general depiction of the entity rather than exact reference-adherence), and Overall Preference (OP).
We compare RefVNLI with CLIP and DreamBench++ in aligning with human preferences (top rows of each example).
The higher of the two criterion-wise scores is emphasized unless both are equal.
RefVNLI consistently aligns with human judgments across all three criteria.
@misc{slobodkin2025refvnliscalableevaluationsubjectdriven,
title={RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation},
author={Aviv Slobodkin and Hagai Taitelbaum and Yonatan Bitton and Brian Gordon and Michal Sokolik and Nitzan Bitton Guetta and Almog Gueta and Royi Rassin and Itay Laish and Dani Lischinski and Idan Szpektor},
year={2025},
eprint={2504.17502},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.17502},
}