Fair in Mind, Fair in Action?

The first synchronous benchmark designed to evaluate the fairness of both understanding and generation tasks in Unified Multimodal Large Language Models (UMLLMs).

View Leaderboard Dataset & Code Read Paper (ICLR 2026)

A Three-Dimensional Evaluation Chain

Fairness evaluation for UMLLMs should be a complete logical chain. Instead of seeking a single metric, IRIS maps the tension-filled fairness space from a model's "default instincts", to its "real-world cognition", and finally to its "controllability".

The "Should-be"

Ideal Fairness (IFS)

Core Question: In the absence of context, what are the model's intrinsic "Default Values"? How far do its priors deviate from a Utopian, perfectly egalitarian world?

Theoretical AnchorGroup Fairness; Statistical Parity Difference (SPD); Representation Disparity (RD).

Practical ImplicationsIndicates "factory safety settings". Vital for Public APIs or Educational contexts.

The "As-is"

Real-world Fidelity (RFS)

Core Question: After assessing the "ideal," we must assess its "reality." Does the model accurately comprehend the world "as-is" in real-world demographic distributions?

Theoretical AnchorFairness through Awareness; Equality of Opportunity.

Practical ImplicationsIndicates "cognitive accuracy". Crucial for Societal Simulation and Decision Support systems.

The "Can-be"

Bias Inertia & Steerability

Core Question: Given its "defaults" and "cognition", how much Inertia does it exhibit? Can it be steered toward a desired counter-stereotypical state?

Theoretical AnchorCounterfactual Fairness; Individual Fairness; Algorithmic Recourse.

Practical ImplicationsIndicates "alignment cost". Essential for debiasing during fine-tuning and High-Fidelity Control.

🏆 Global Leaderboard

Click on any model row to visualize its High-dimensional Fairness Space.

Rank ↕	Model ↕	IRIS Score ↓	IFS (Gen) ↕	RFS (Gen) ↕	BIS (Gen) ↕	MBTI (Gen/Und)
1	Bagel 7B	95.94	82.58	69.13	60.91	uaf/udf
2	Janus-Pro 7B	67.97	56.78	42.45	69.30	haf/hdf
3	UniWorld-V1 ~20B	64.43	52.30	62.35	45.94	uar/hdr
4	VILA-U 7B	60.69	64.90	40.68	64.97	udf/haf
5	Show-o 1.3B	60.01	70.03	68.22	54.57	uaf/uar
6	Harmon 1.5B	52.49	35.76	60.50	49.97	har/uar
7	BLIP3-o 8B	40.13	60.95	34.68	78.82	hdf/udr

The IRIS-MBTI Personalities

Moving beyond absolute judgment. Discover the unique fairness "personality" of each model to guide context-specific applications.

What the Personality Profile Means

To provide a more intuitive understanding of a model's fairness characteristics, we introduce the IRIS-MBTI Personality Diagnostic. This profile summarizes a model's behavior across our three dimensions for both Generation (Gen) and Understanding (Und) tasks. Each three-letter code represents the model's specific tendencies:

1st Letter (Belief)

Utopian (U) (strong Ideal Fairness)
vs. Heuristic (H) (weaker Ideal Fairness).

2nd Letter (Perception)

Accurate (A) (strong Real-world Fidelity)
vs. Distorted (D) (weaker Real-world Fidelity).

3rd Letter (Willpower)

Flexible (F) (strong Steerability)
vs. Rigid (R) (weaker Steerability).

Example:A UAF profile like Bagel's in the Generation task indicates it is a 'Utopian, Accurate, and Flexible' model—an "Adaptive Idealist" that performs well across all dimensions. This qualitative diagnosis helps to quickly identify a model's unique strengths and weaknesses beyond a single score.

uaf

The Adaptive Idealist

High scores on all dimensions. The ideal model we strive for.

haf

The Heuristic Reformer

Strong in perception and willpower, but lacks an idealistic foundation.

udf

The Grounded Reformer

Strong in belief and willpower, but has difficulty perceiving reality.

hdf

The Teachable Student

Strong only in willpower. A promising "blank slate".

uar

The Sophisticated Stereotyper

Strong in belief and perception, but has rigid willpower.

har

The Unteachable Ignoramus

Strong only in perception, but stubbornly resists guidance.

udr

The Obstinate Heurist

Strong only in belief, ignoring reality and resisting correction.

hdr

The Dogmatic Preacher

Low scores on all dimensions. The worst-case scenario.

📚 Citing Our Work

@inproceedings{
  zhao2026fair,
  title={Fair in Mind, Fair in Action? A Synchronous Benchmark for Understanding and Generation in {UMLLM}s},
  author={Yiran Zhao and Lu Zhou and Xiaogang Xu and Liming Fang and Zhe Liu and Jiafei Wu},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://openreview.net/forum?id=NYphgYTloq}
}