Multi-page visual documents such as manuals, brochures, presentations, and posters convey key information through layout, colors, icons, and cross-slide references. While large language models (LLMs) offer opportunities in document understanding, current systems struggle with complex, multi-page visual documents, particularly in fine-grained reasoning over elements and pages. We introduce SlideAgent, a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents, especially slide decks. SlideAgent employs specialized agents and decomposes reasoning into three specialized levels--global, page, and element--to construct a structured, query-agnostic representation that captures both overarching themes and detailed visual or textual cues. During inference, SlideAgent selectively activates specialized agents for multi-level reasoning and integrates their outputs into coherent, context-aware answers. Extensive experiments show that SlideAgent achieves significant improvement over both proprietary (+7.9 over GPT-4o) and open-source models (+9.8 over InternVL3-8B). Our code is available at https://anonymous.4open.science/r/SlideAgent.
Sep 26, 2025
An introduction to my research interests and recent work in Large Language Models, Multimodal Learning, and Social Computing.
Aug 15, 2025
A framework and benchmark to evaluate LLMs multilingual capabilities in healthcare queries, revealing significant performance gaps across languages.
Dec 14, 2024
Peer review is fundamental to the integrity and advancement of scientific publication. Traditional methods of peer review analyses often rely on exploration and statistics of existing peer review data, which do not adequately address the multivariate nature of the process, account for the latent variables, and are further constrained by privacy concerns due to the sensitive nature of the data. We introduce AgentReview, the first large language model (LLM) based peer review simulation framework, which effectively disentangles the impacts of multiple latent factors and addresses the privacy issue. Our study reveals significant insights, including a notable 37.1% variation in paper decisions due to reviewers' biases, supported by sociological theories such as the social influence theory, altruism fatigue, and authority bias. We believe that this study could offer valuable insights to improve the design of peer review mechanisms. Our code is available at https://github.com/Ahren09/AgentReview.
Nov 12, 2024
Peer review is fundamental to the integrity and advancement of scientific publication. Traditional methods of peer review analyses often rely on exploration and statistics of existing peer review data...
Nov 12, 2024
This work studies the competition dynamics among LLM-based agents, revealing emergent behaviors and strategic patterns in multi-agent systems....
Apr 30, 2024
We introduce SlideAgent, a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents.
Jan 1, 2024
We propose a prototypical reward network that enables data-efficient reinforcement learning from human feedback (RLHF) for large language models....
Jan 1, 2024
Social media platforms are hubs for multimodal information exchange, encompassing text, images, and videos, making it challenging for machines to comprehensively understand the information. Multimodal...
Jan 1, 2024
We present a framework and benchmark to evaluate LLMs' multilingual capabilities in healthcare queries, revealing significant performance gaps across languages and providing insights for improving hea...
Jan 1, 2024