The attention mechanism in large language models (LLMs) scales quadratically with input length. This means that if a document’s length doubles, the computational and memory costs can increase fourfold. Such scaling issues not only slow down inference but also inflate the size of the key-value (KV) cache, rendering large-context applications impractical in production systems. In retrieval-augmented generation (RAG) settings, most retrieved passages contribute minimally to the final answer, yet the model incurs the full quadratic cost to process them.
- How Does REFRAG Compress and Shorten Context?
- How is Acceleration Achieved?
- How Does REFRAG Preserve Accuracy?
- What Do the Experiments Reveal?
How Does REFRAG Compress and Shorten Context?
REFRAG introduces a lightweight encoder that divides retrieved passages into fixed-size chunks, such as 16 tokens, and compresses each into a dense chunk embedding. Instead of processing thousands of raw tokens, the decoder handles this shorter sequence of embeddings, resulting in a 16× reduction in sequence length without altering the LLM architecture.
How is Acceleration Achieved?
By reducing the input sequence length for the decoder, REFRAG decreases the quadratic attention computation and shrinks the KV cache. Empirical results demonstrate a 16.53× acceleration in time-to-first-token (TTFT) at k=16 and a 30.85× acceleration at k=32, significantly surpassing previous state-of-the-art methods like CEPE, which achieved only 2 – 8× improvements. Throughput also improves by up to 6.78× compared to LLaMA baselines.
How Does REFRAG Preserve Accuracy?
REFRAG employs a reinforcement learning (RL) policy to supervise compression, identifying the most information-dense chunks and allowing them to bypass compression, feeding raw tokens directly into the decoder. This selective strategy ensures that critical details, such as exact numbers or rare entities, are preserved. Across multiple benchmarks, REFRAG maintained or improved perplexity compared to CEPE while operating at significantly lower latency.
What Do the Experiments Reveal?
REFRAG was pretrained on 20 billion tokens from the SlimPajama corpus (Books + arXiv) and tested on long-context datasets including Book, Arxiv, PG19, and ProofPile. On RAG benchmarks, multi-turn conversation tasks, and long-document summarization, REFRAG consistently outperformed strong baselines:
- 16× context extension beyond standard LLaMA-2 (4k tokens).
- Approximately 9.3% perplexity improvement over CEPE across four datasets.
- Enhanced accuracy in weak retriever settings, where irrelevant passages dominate, due to the ability to process more passages under the same latency budget.
REFRAG demonstrates that long-context LLMs do not have to be slow or memory-intensive. By compressing retrieved passages into compact embeddings, selectively expanding only the important ones, and rethinking RAG decoding, Meta Superintelligence Labs has enabled the processing of much larger inputs while running dramatically faster. This advancement makes large-context applications – such as analyzing entire reports, handling multi-turn conversations, or scaling enterprise RAG systems – not only feasible but efficient, without compromising accuracy.
Frequently Asked Questions
Why is long context a bottleneck for LLMs?
The attention mechanism in large language models (LLMs) scales quadratically with input length, leading to increased computational and memory costs. This scaling issue slows down inference and inflates the size of the key-value cache, making large-context applications impractical in production systems.
How does REFRAG compress and shorten context?
REFRAG introduces a lightweight encoder that divides retrieved passages into fixed-size chunks and compresses each into a dense chunk embedding. This approach reduces the sequence length by 16 times without altering the LLM architecture, allowing the decoder to handle a shorter sequence of embeddings.
What acceleration does REFRAG achieve?
REFRAG achieves a 16.53× acceleration in time-to-first-token (TTFT) at k=16 and a 30.85× acceleration at k=32, surpassing previous methods like CEPE. Throughput also improves by up to 6.78× compared to LLaMA baselines.
How does REFRAG preserve accuracy?
REFRAG uses a reinforcement learning policy to supervise compression, allowing the most information-dense chunks to bypass compression. This ensures critical details are preserved, maintaining or improving perplexity compared to CEPE while operating at lower latency.
What were the results of REFRAG’s experiments?
REFRAG was pretrained on 20 billion tokens and tested on long-context datasets, consistently outperforming strong baselines. It extended context 16× beyond standard LLaMA-2, improved perplexity by approximately 9.3% over CEPE, and enhanced accuracy in weak retriever settings.







