Consistency Should Be the Priority for Unified Multimodal Models
Feb 3, 2026·,
,,,,,·
1 min read
Zhaolong Su
Yinyi Luo
Yiqiao Jin
Mengqi Zhang
Wenyue Hua
Srijan Kumar
Qingsong Wen
Jindong Wang
Abstract
Unified multimodal models (UMMs) aim to handle understanding and generation across modalities within a single architecture. Despite rapid progress, current UMMs frequently produce inconsistent outputs across views, modalities, and prompts. In this position paper, we argue that consistency, not capability, should be the priority research target for UMMs. We characterize three forms of consistency (cross-view, cross-modal, and cross-prompt), survey current evaluation gaps, and outline a roadmap for consistency-driven UMM research.
Type
Publication
Under Review at EMNLP 2026
Abstract
Unified multimodal models (UMMs) aim to handle understanding and generation across modalities within a single architecture. Despite rapid progress, current UMMs frequently produce inconsistent outputs across views, modalities, and prompts. In this position paper, we argue that consistency, not capability, should be the priority research target for UMMs.