Agents

SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding

We introduce SlideAgent, a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents.

Nov 27, 2025

ScreenLLM: Stateful Screen Schema for Efficient Action Understanding and Prediction

ScreenLLM introduces a stateful screen schema and key-frame extractor that compresses dynamic UI sessions into time-aware summaries, enabling efficient GUI understanding and action prediction with multimodal LLMs.

Apr 30, 2025

ScreenLLM

A multimodal LLM for Graphical User Interface understanding and action prediction. Introduces a stateful screen schema that summarizes dynamic UI sessions as time-aware text and a key-frame extractor for significant UI transitions. WebConf 2025 MM4SG Workshop.

Apr 30, 2025