A survey of efficient LLM training organized around data-centric techniques — selection, mixing, ordering, and synthesis — and their trade-offs with compute and downstream performance.
Jul 31, 2025
ProteinGPT is a multimodal LLM that integrates protein sequence and structural representations in a unified generative interface for property prediction and structure understanding.
Jul 18, 2025

CultureVLM characterizes and improves cultural understanding of vision-language models across more than 100 countries using culturally-grounded benchmarks and training procedures.
Jun 11, 2025
ScreenLLM introduces a stateful screen schema and key-frame extractor that compresses dynamic UI sessions into time-aware summaries, enabling efficient GUI understanding and action prediction with multimodal LLMs.
Apr 30, 2025
UniGuard is a universal safety guardrail for multimodal LLMs, defending against cross-modal jailbreak attacks across image and text channels with low utility cost.
Mar 3, 2025

SciEvo is a comprehensive dataset containing 2 million+ papers spanning 30 years (1995-2024) for temporal scientometric analysis.
Mar 3, 2025
RNA-GPT is a multimodal generative system that combines RNA sequence reasoning with structural cues for property prediction, retrieval, and natural-language interaction over RNA data.
Dec 13, 2024
PrivacyMind teaches LLMs to be contextual privacy protection learners that recognize sensitive content in context and adapt outputs accordingly, preserving utility while reducing leakage.
Nov 12, 2024

Peer review is fundamental to the integrity and advancement of scientific publication. Traditional methods of peer review analyses often rely on exploration and statistics of existing peer review data...
Nov 12, 2024

We address fairness issues in graph anomaly detection, providing benchmark datasets and comprehensive evaluation frameworks for fair anomaly detection on graphs....
Jul 4, 2024