SARA: Selective and Adaptive Retrieval-augmented Generation with Context Compression

1Georgia Institute of Technology, 2Visa Research

Abstract

Retrieval-augmented Generation (RAG) extends large language models (LLMs) with external knowledge but faces key challenges: restricted effective context length and redundancy in retrieved documents. Pure compression-based approaches reduce input size but often discard fine-grained details essential for factual accuracy. We propose SARA, a unified RAG framework that balances local precision and global knowledge coverage under tight context budgets. SARA combines natural-language text snippets with semantic compression vectors to jointly enhance context efficiency and answer correctness. It represents contexts at two complementary levels: 1) fine-grained natural-language spans that preserve critical entities and numerical values, and 2) compact, interpretable vectors that summarize high-level semantics. An iterative evidence-selection module employs the compression vectors for dynamic reranking of contexts. Across 9 datasets and 5 open-source LLMs spanning 3 model families (Mistral, Llama, and Gemma), SARA consistently improves answer relevance (+17.71), answer correctness (+13.72), and semantic similarity (+15.53), demonstrating the importance of integrating textual and compressed representations for robust, context-efficient RAG.

Overall Framework

SARA addresses key challenges in RAG through a hybrid compression strategy that balances local precision and global knowledge coverage. The framework operates through a two-stage training procedure:

🎯 Stage 1: Compression Learning

📍 Embedding Alignment
Lightweight compressor aligns embeddings with LLM token space
🔄 Context Reconstruction
Learns to reconstruct original contexts from vectors
📚 Curriculum Learning
Progressive training on complex text chunks

⚡ Stage 2: Instruction-tuning & Inference

🔀 Hybrid Processing
Top-k passages in natural language + compressed vectors
🎛️ Dynamic Reranking
Iterative selection for relevance and diversity
🎯 Dual Strategies
Embedding novelty + CSI scoring

The framework represents contexts at two complementary levels: 1) fine-grained natural-language spans that preserve critical entities and numerical values, and 2) compact, interpretable vectors that summarize high-level semantics. An iterative evidence-selection module employs compression vectors for dynamic reranking of contexts, ensuring optimal information density within strict context budgets.

SARA Framework Overview

Key Contributions

Hybrid Compression

Balances local precision with global abstraction for optimal context efficiency

Iterative Refinement

Dynamic context optimization through embedding-based novelty and CSI scoring

Model-Agnostic

Compatible with any retrievers, embeddings, and LLMs via lightweight projection

Performance Results

SARA consistently outperforms strong baselines across multiple evaluation metrics and datasets. Under strict context length constraints (512 and 1024 tokens), SARA improves F1 by 19.4% and ROUGE-L by 20.8% on average, with particularly significant gains on knowledge-intensive tasks like TriviaQA (+24.5%) and HotpotQA (+29.0%).

Performance on QASPER Dataset (512 tokens)

Method Answer Relevance Answer Correctness Semantic Similarity Faithfulness
ICAE 75.45 24.03 59.48 21.72
LLMLingua 79.83 23.97 61.08 25.31
LongLLMLingua 82.77 22.86 62.17 29.77
SARA (Ours) 85.35 25.74 63.99 31.95

Performance on Multiple Datasets (1024 tokens, F1 Score)

Method QASPER NarrativeQA TriviaQA QuALITY HotpotQA
Standard RAG 22.73 40.23 58.43 31.79 48.56
Raptor 31.77 56.60 70.51 34.27 68.26
GraphRAG 37.05 64.93 77.52 37.21 73.23
xRAG 32.36 33.43 43.36 32.65 60.19
SARA (Ours) 40.55 69.46 85.08 42.78 84.21

Experimental Setup

📊 9 Datasets Across 4 Task Categories

📝 Short QA
SQuAD-v2.0
📖 Long QA
NarrativeQA, QASPER
QuALITY, MultifieldQA
🔗 Multi-hop
HotpotQA, TriviaQA
2WikiMultihopQA
📋 Summarization
QMSum
5 LLMs across 3 families
10 Total contexts (7 text + 3 compressed)
512-1024 Token constraints

Generalization Across Models

🚀 Consistent Improvements Across 5 LLMs

+17.7 Answer Relevance (Mistral-7B)
+13.7 Answer Correctness
+15.5 Semantic Similarity

Key Insight: 7B models achieve performance matching 24B models through efficient context utilization

Generalization across models

Sensitivity Analysis

SARA's hybrid approach effectively balances natural language and compressed contexts. Performance remains strong even with minimal natural language input, indicating that compression vectors retain essential information. The optimal balance is achieved around 7-8 natural language contexts, demonstrating the effectiveness of our hybrid strategy.

QASPER sensitivity
NarrativeQA sensitivity
QuALITY sensitivity
TriviaQA sensitivity
HotpotQA sensitivity

Context Compression Analysis

🗜️ Smart Compression That Preserves Details

Maintains entities, numbers, and organization names even under tight token budgets

📊 Semantic Fidelity

Accurate reconstruction while staying interpretable

🎯 Fine-grained

Retains critical entities and values

⚡ Scalable

Works across different retrievers and LLMs

🔄 Parallelizable

High-ratio summaries in single pass

Compression analysis

BibTeX

@article{jin2025sara,
    title={SARA: Selective and Adaptive Retrieval-augmented Generation with Context Compression},
    author={Jin, Yiqiao and Sharma, Kartik and Rakesh, Vineeth and Dou, Yingtong and Pan, Menghai and Das, Mahashweta and Kumar, Srijan},
    journal={arXiv:2507.05633},
    year={2025}
    }
}

Usage and License Notices

The data, code and model checkpoint are intended and licensed mainly for research.

This website is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License.