
Fair in Mind, Fair in Action?
The first synchronous benchmark designed to evaluate the fairness of both understanding and generation tasks in Unified Multimodal Large Language Models (UMLLMs).
A Three-Dimensional Evaluation Chain
Fairness evaluation for UMLLMs should be a complete logical chain. Instead of seeking a single metric, IRIS maps the tension-filled fairness space from a model's "default instincts", to its "real-world cognition", and finally to its "controllability".
Ideal Fairness (IFS)
Core Question: In the absence of context, what are the model's intrinsic "Default Values"? How far do its priors deviate from a Utopian, perfectly egalitarian world?
Real-world Fidelity (RFS)
Core Question: After assessing the "ideal," we must assess its "reality." Does the model accurately comprehend the world "as-is" in real-world demographic distributions?
Bias Inertia & Steerability
Core Question: Given its "defaults" and "cognition", how much Inertia does it exhibit? Can it be steered toward a desired counter-stereotypical state?
π Global Leaderboard
Click on any model row to visualize its High-dimensional Fairness Space.
| Rank β | Model β | IRIS Score β |
|---|---|---|
| 1 | Bagel | 95.94 |
| 2 | Janus-Pro | 67.97 |
| 3 | UniWorld-V1 | 64.43 |
| 4 | VILA-U | 60.69 |
| 5 | Show-o | 60.01 |
| 6 | Harmon | 52.49 |
| 7 | BLIP3-o | 40.13 |
The IRIS-MBTI Personalities
Moving beyond absolute judgment. Discover the unique fairness "personality" of each model to guide context-specific applications.
What the Personality Profile Means
To provide a more intuitive understanding of a model's fairness characteristics, we introduce the IRIS-MBTI Personality Diagnostic. This profile summarizes a model's behavior across our three dimensions for both Generation (Gen) and Understanding (Und) tasks. Each three-letter code represents the model's specific tendencies:
Utopian (U) (strong Ideal Fairness)
vs. Heuristic (H) (weaker Ideal Fairness).
Accurate (A) (strong Real-world Fidelity)
vs. Distorted (D) (weaker Real-world Fidelity).
Flexible (F) (strong Steerability)
vs. Rigid (R) (weaker Steerability).

uaf
The Adaptive Idealist
High scores on all dimensions. The ideal model we strive for.

haf
The Heuristic Reformer
Strong in perception and willpower, but lacks an idealistic foundation.

udf
The Grounded Reformer
Strong in belief and willpower, but has difficulty perceiving reality.

hdf
The Teachable Student
Strong only in willpower. A promising "blank slate".

uar
The Sophisticated Stereotyper
Strong in belief and perception, but has rigid willpower.

har
The Unteachable Ignoramus
Strong only in perception, but stubbornly resists guidance.

udr
The Obstinate Heurist
Strong only in belief, ignoring reality and resisting correction.

hdr
The Dogmatic Preacher
Low scores on all dimensions. The worst-case scenario.
π Citing Our Work
@inproceedings{
zhao2026fair,
title={Fair in Mind, Fair in Action? A Synchronous Benchmark for Understanding and Generation in {UMLLM}s},
author={Yiran Zhao and Lu Zhou and Xiaogang Xu and Liming Fang and Zhe Liu and Jiafei Wu},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=NYphgYTloq}
}