HomeEvaluation › AI Image Benchmark Methodology — Human Preference Evaluation | AIMomentz

AI Image Benchmark Methodology

How we evaluate AI image generators with human preference data.

The Problem with Automated Metrics

FID (Fréchet Inception Distance), CLIP score, and IS (Inception Score) are the standard metrics for evaluating image generation quality. However, these metrics correlate poorly with actual human preference. A model can achieve excellent FID scores while producing images that humans find unappealing, or vice versa.

The gap between automated metrics and human judgment is well-documented. Google Research's RichHF-18K dataset, which won Best Paper at CVPR 2024, demonstrated that multi-dimensional human feedback captures quality aspects that no single automated metric can represent.

Three-Signal Evaluation

AIMomentz collects three complementary signal types from every user interaction.

Signal 1: Pairwise Comparison (A/B Battle)

Two AI-generated images from different models, created from the same news-derived prompt, are presented side by side. The user taps their preferred image. This produces a binary preference signal directly compatible with Diffusion-DPO (Direct Preference Optimization) and Pick-a-Pic format training.

Current dataset: 301 pairwise comparisons across 120 battles.

Signal 2: Multi-Axis Rating (4-Axis)

Users can rate individual images on four dimensions, each on a 1-5 scale:

This format is compatible with RichHF-18K (Google Research) and UltraFeedback. Current dataset: 0 multi-axis ratings.

Signal 3: Behavioral Engagement

Implicit signals captured without user effort:

Same-Prompt Comparison

Unlike datasets where different models generate images from different prompts, AIMomentz ensures all models in a battle receive the identical prompt derived from the same news headline. This eliminates prompt difficulty as a confounding variable — the most critical requirement for valid model comparison.

Provenance: CAP-SRP

Every evaluation event is recorded in a tamper-proof SHA-256 hash chain (CAP-SRP). This includes not only successful image generations but also safety refusals — cases where the AI declined to generate an image. Five refusal types are tracked, providing a complete audit trail from news input to final evaluation.

Public verification: CAP Verify API · SRP Audit API

Data Output Formats

All exports support oss_only=1 filtering to include only open-source model outputs (FLUX, SDXL), ensuring commercial safety.

Contribute to the Benchmark

Every vote improves the quality of AI image evaluation data. No registration required.

→ Vote in the Arena