
We introduce SlideAgent, a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents.
Nov 27, 2025
ScreenLLM introduces a stateful screen schema and key-frame extractor that compresses dynamic UI sessions into time-aware summaries, enabling efficient GUI understanding and action prediction with multimodal LLMs.
Apr 30, 2025
A multimodal LLM for Graphical User Interface understanding and action prediction. Introduces a stateful screen schema that summarizes dynamic UI sessions as time-aware text and a key-frame extractor for significant UI transitions. WebConf 2025 MM4SG Workshop.
Apr 30, 2025