SARA: Selective and Adaptive Retrieval-augmented Generation

Abstract

Hybrid RAG that balances precision and coverage

Retrieval-augmented generation (RAG) extends large language models (LLMs) with external knowledge, but it must balance limited effective context, redundant retrieved evidence, and the loss of fine-grained facts under aggressive compression. Pure compression-based approaches reduce input size but often discard fine-grained details essential for factual accuracy. We propose SARA, a hybrid RAG framework that targets answer quality under fixed token budgets by combining natural-language snippets with semantic compression vectors. SARA retains a small set of passages in text form to preserve entities and numerical values, compresses the remaining evidence into interpretable vectors for broader coverage, and uses those vectors for iterative evidence reranking. Across 9 datasets and 5 open-source LLMs spanning 3 model families (Mistral, Llama, and Gemma), SARA consistently improves answer relevance (+17.71), answer correctness (+13.72), and semantic similarity (+15.53), demonstrating the importance of integrating textual and compressed representations for robust, context-efficient RAG.

+17.71

Answer Relevance

+13.72

Answer Correctness

+15.53

Semantic Similarity

Averaged over in-domain datasets with Mistral-7B as the QA backbone.

Method

A two-stage framework

SARA learns a compressor that encodes evidence into single-token semantic vectors, then adapts the LLM to reason over a mixture of natural-language snippets and compressed contexts.

Compression Learning

A lightweight compressor (sentence encoder + MLP) is trained to fit each chunk into a single embedding-space token, recoverable by autoencoding on Wikipedia.

Embedding alignment — projects sentence embeddings into the LLM's input space
Context reconstruction — auto-encoding objective preserves entities & numerics
Curriculum learning — progresses from short to long evidence chunks

Stage 1: Compression Learning — SARA learns to reconstruct text from compression vectors.

Instruction-tuning & Inference

SARA reasons over a mixture of compressed evidence and natural-language contexts, using compression vectors to drive iterative evidence reranking.

Hybrid processing — keeps top-k passages in text, compresses the rest
Iterative reranking — selects contexts by query-relevance and novelty
Dual signals — embedding-novelty and Conditional Self-Information (CSI)

Stage 2: Inference — SARA reasons over a mixture of compressed and natural-language contexts.

Contributions

Three core ideas

Hybrid Compression

Balances local precision via natural-language spans with global abstraction through compression vectors — fine-grained reasoning under strict token budgets.

Iterative Refinement

Compression-vector-based selection dynamically reduces redundancy and prioritizes query-relevant evidence using embedding novelty and CSI scoring.

Model-Agnostic

Works across 5 open-source LLMs spanning 3 families (Mistral, Llama, Gemma) and generalizes to multiple retrievers — no architectural changes required.

Results

Performance under strict token budgets

SARA outperforms competitive RAG baselines and dedicated context-compression methods across both lexical and LLM-based metrics.

Radar chart comparing SARA and compression-based methods at a 512-token budget.

Context efficiency at a 512-token budget. SARA dominates compression-based methods across every dataset. The hybrid representation — natural-language plus compression vectors — yields large gains on knowledge-intensive tasks.

+19.4%

F1 (avg)

+20.8%

ROUGE-L (avg)

+29.0%

F1 on HotpotQA

Analysis

Generalization, sensitivity, and compression fidelity

Generalization across LLMs — RAG vs. SARA on LLM-based metrics for QASPER.

Generalization across LLMs. SARA improves performance for every backbone tested — Mistral-7B, MistralNemo-12B, MistralSmall-24B, Llama-3.1-8B, Gemma3-4B. 7B models with SARA match the performance of 24B baselines.

Sensitivity analysis

Holding the total context size fixed at N=10 and sweeping the number of natural-language passages k, performance peaks around k=7–8. Even at k=1, compression vectors close most of the gap to the full text setup — confirming that vectors add real signal at minimal cost.

QASPER

NarrativeQA

QuALITY

TriviaQA

HotpotQA

Compression fidelity

Decoded text from a single compression token recovers a span comparable to a 3-sentence reference, capturing entities and numerical values with high fidelity.

Semantic Fidelity Fine-grained Scalable Parallelizable

Probability density of token counts for decoded vs. reference evidence.

Compression effectiveness. Probability density of recovered token counts for decoded evidence and the reference 3-sentence budget — the decoder tracks the reference distribution closely.

Experimental Setup

9 datasets · 5 LLMs · 3 model families

Datasets · 4 task categories

LLMs · 4B – 24B

Contexts (7 text + 3 compressed)

512–1024

Token budgets

Short QA

SQuAD-v2.0

Long-context QA

NarrativeQA · QASPER · QuALITY · MultiFieldQA

Multi-hop

HotpotQA · TriviaQA · 2WikiMultihopQA

Summarization

QMSum

Citation

BibTeX

@inproceedings{jin2026sara,
    title     = {SARA: Selective and Adaptive Retrieval-augmented Generation with Context Compression},
    author    = {Jin, Yiqiao and Sharma, Kartik and Rakesh, Vineeth and Dou, Yingtong and Pan, Menghai and Das, Mahashweta and Kumar, Srijan},
    booktitle = {Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)},
    year      = {2026}
}