Consistency Should Be the Priority for Unified Multimodal Models

Feb 3, 2026·
Zhaolong Su
,
Yinyi Luo
Yiqiao Jin
Yiqiao Jin
,
Mengqi Zhang
,
Wenyue Hua
,
Srijan Kumar
,
Qingsong Wen
,
Jindong Wang
· 1 min read
Abstract
Unified multimodal models (UMMs) aim to handle understanding and generation across modalities within a single architecture. Despite rapid progress, current UMMs frequently produce inconsistent outputs across views, modalities, and prompts. In this position paper, we argue that consistency, not capability, should be the priority research target for UMMs. We characterize three forms of consistency (cross-view, cross-modal, and cross-prompt), survey current evaluation gaps, and outline a roadmap for consistency-driven UMM research.
Type
Publication
Under Review at ICML 2026 (Preprint)

Abstract

Unified multimodal models (UMMs) aim to handle understanding and generation across modalities within a single architecture. Despite rapid progress, current UMMs frequently produce inconsistent outputs across views, modalities, and prompts. In this position paper, we argue that consistency, not capability, should be the priority research target for UMMs.

Yiqiao Jin
Authors
Ph.D. Candidate in Computer Science
My research focuses on adaptive and efficient AI systems, with emphasis on LLM agents, agent memory, self-distillation, multimodal LLMs, and structured multi-agent intelligence.