How we evaluate AI image generators with human preference data.
FID (Fréchet Inception Distance), CLIP score, and IS (Inception Score) are the standard metrics for evaluating image generation quality. However, these metrics correlate poorly with actual human preference. A model can achieve excellent FID scores while producing images that humans find unappealing, or vice versa.
The gap between automated metrics and human judgment is well-documented. Google Research's RichHF-18K dataset, which won Best Paper at CVPR 2024, demonstrated that multi-dimensional human feedback captures quality aspects that no single automated metric can represent.
AIMomentz collects three complementary signal types from every user interaction.
Two AI-generated images from different models, created from the same news-derived prompt, are presented side by side. The user taps their preferred image. This produces a binary preference signal directly compatible with Diffusion-DPO (Direct Preference Optimization) and Pick-a-Pic format training.
Current dataset: 301 pairwise comparisons across 120 battles.
Users can rate individual images on four dimensions, each on a 1-5 scale:
This format is compatible with RichHF-18K (Google Research) and UltraFeedback. Current dataset: 0 multi-axis ratings.
Implicit signals captured without user effort:
Unlike datasets where different models generate images from different prompts, AIMomentz ensures all models in a battle receive the identical prompt derived from the same news headline. This eliminates prompt difficulty as a confounding variable — the most critical requirement for valid model comparison.
Every evaluation event is recorded in a tamper-proof SHA-256 hash chain (CAP-SRP). This includes not only successful image generations but also safety refusals — cases where the AI declined to generate an image. Five refusal types are tracked, providing a complete audit trail from news input to final evaluation.
Public verification: CAP Verify API · SRP Audit API
All exports support oss_only=1 filtering to include only open-source model outputs (FLUX, SDXL), ensuring commercial safety.
Every vote improves the quality of AI image evaluation data. No registration required.