Multimodal LLMs

SlideAgent

Hierarchical agentic framework for multi-page visual document understanding. Decomposes reasoning into global, page, and element levels to handle slide decks, financial reports, and infographics. Accepted at ACL 2026 main conference.

Mar 15, 2026

SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding
SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding

Multi-page visual documents such as manuals, brochures, presentations, and posters convey key information through layout, colors, icons, and cross-slide references. While large language models (LLMs) offer opportunities in document understanding, current systems struggle with complex, multi-page visual documents, particularly in fine-grained reasoning over elements and pages. We introduce SlideAgent, a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents, especially slide decks. SlideAgent employs specialized agents and decomposes reasoning into three specialized levels--global, page, and element--to construct a structured, query-agnostic representation that captures both overarching themes and detailed visual or textual cues. During inference, SlideAgent selectively activates specialized agents for multi-level reasoning and integrates their outputs into coherent, context-aware answers. Extensive experiments show that SlideAgent achieves significant improvement over both proprietary (+7.9 over GPT-4o) and open-source models (+9.8 over InternVL3-8B).

Sep 26, 2025

ProteinGPT: Multimodal LLM for Protein Property Prediction and Structure Understanding

ProteinGPT is a multimodal LLM that integrates protein sequence and structural representations in a unified generative interface for property prediction and structure understanding.

Jul 18, 2025

ScreenLLM: Stateful Screen Schema for Efficient Action Understanding and Prediction

ScreenLLM introduces a stateful screen schema and key-frame extractor that compresses dynamic UI sessions into time-aware summaries, enabling efficient GUI understanding and action prediction with multimodal LLMs.

Apr 30, 2025

ScreenLLM

A multimodal LLM for Graphical User Interface understanding and action prediction. Introduces a stateful screen schema that summarizes dynamic UI sessions as time-aware text and a key-frame extractor for significant UI transitions. WebConf 2025 MM4SG Workshop.

Apr 30, 2025

UniGuard: Towards Universal Safety Guardrails for Jailbreak Attacks on Multimodal Large Language Models

UniGuard is a universal safety guardrail for multimodal LLMs, defending against cross-modal jailbreak attacks across image and text channels with low utility cost.

Mar 3, 2025

RNA-GPT: Multimodal Generative System for RNA Sequence Understanding

RNA-GPT is a multimodal generative system that combines RNA sequence reasoning with structural cues for property prediction, retrieval, and natural-language interaction over RNA data.

Dec 13, 2024

MM-Soc

A benchmark for evaluating multimodal LLMs on social media platforms, covering misinformation, sentiment, hate speech, and humor across image-text content. ACL 2024 (Findings).

May 15, 2024