Home › Evaluation › AI Image Benchmark Methodology — Human Preference Evaluation | AIMomentz

AI Image Benchmark Methodology

How we evaluate AI image generators with human preference data.

The Problem with Automated Metrics

FID (Fréchet Inception Distance), CLIP score, and IS (Inception Score) are the standard metrics for evaluating image generation quality. However, these metrics correlate poorly with actual human preference. A model can achieve excellent FID scores while producing images that humans find unappealing, or vice versa.

The gap between automated metrics and human judgment is well-documented. Google Research's RichHF-18K dataset, which won Best Paper at CVPR 2024, demonstrated that multi-dimensional human feedback captures quality aspects that no single automated metric can represent.

Three-Signal Evaluation

AIMomentz collects three complementary signal types from every user interaction.

Signal 1: Pairwise Comparison (A/B Battle)

Two AI-generated images from different models, created from the same news-derived prompt, are presented side by side. The user taps their preferred image. This produces a binary preference signal directly compatible with Diffusion-DPO (Direct Preference Optimization) and Pick-a-Pic format training.

Current dataset: 301 pairwise comparisons across 120 battles.

Signal 2: Multi-Axis Rating (4-Axis)

Users can rate individual images on four dimensions, each on a 1-5 scale:

Aesthetics — Visual beauty and artistic quality
Alignment — How well the image matches the prompt
Plausibility — Physical and logical coherence
Overall — General impression

This format is compatible with RichHF-18K (Google Research) and UltraFeedback. Current dataset: 0 multi-axis ratings.

Signal 3: Behavioral Engagement

Implicit signals captured without user effort:

Dwell time — How long a user views each image before voting (decision difficulty)
Zoom rate — Whether users enlarge images (detail interest)
Reason labels — Why users preferred an image (composition, color, creativity, etc.)
Like/bookmark rate — Post-vote engagement with individual images

Same-Prompt Comparison

Unlike datasets where different models generate images from different prompts, AIMomentz ensures all models in a battle receive the identical prompt derived from the same news headline. This eliminates prompt difficulty as a confounding variable — the most critical requirement for valid model comparison.

Provenance: CAP-SRP

Every evaluation event is recorded in a tamper-proof SHA-256 hash chain (CAP-SRP). This includes not only successful image generations but also safety refusals — cases where the AI declined to generate an image. Five refusal types are tracked, providing a complete audit trail from news input to final evaluation.

Public verification: CAP Verify API · SRP Audit API

Data Output Formats

DPO pairs — chosen/rejected image pairs with prompt, compatible with Diffusion-DPO training
4-axis ratings — RichHF-18K compatible multi-dimensional feedback
UltraFeedback — multi-model evaluation with per-axis scores
CSV / JSONL — raw export for custom analysis

All exports support oss_only=1 filtering to include only open-source model outputs (FLUX, SDXL), ensuring commercial safety.

Contribute to the Benchmark

Every vote improves the quality of AI image evaluation data. No registration required.

→ Vote in the Arena