Consistency Should Be the Priority for Unified Multimodal Models

Feb 3, 2026·

Zhaolong Su

Yinyi Luo

Yiqiao Jin

Mengqi Zhang

Wenyue Hua

Srijan Kumar

Qingsong Wen

Jindong Wang

· 1 min read

PDF Cite

Abstract

Unified multimodal models (UMMs) aim to handle understanding and generation across modalities within a single architecture. Despite rapid progress, current UMMs frequently produce inconsistent outputs across views, modalities, and prompts. In this position paper, we argue that consistency, not capability, should be the priority research target for UMMs. We characterize three forms of consistency (cross-view, cross-modal, and cross-prompt), survey current evaluation gaps, and outline a roadmap for consistency-driven UMM research.

Type

Preprint

Publication

Under Review at EMNLP 2026

Consistency Should Be the Priority for Unified Multimodal Models

Abstract

Links