RetroInfer
Scalable long-context LLM decoding that leverages sparsity—by treating the KV cache as a vector storage system. RetroInfer is a novel system that rethinks the KV cache as vector storage within a GPU–CPU co-execution setup to accelerate long-context LLM inference. It exploits the inherent sparsity of the attention mechanism and introduces an Attention-aWare VEctor index (wave index) that enables efficient and accurate retrieval of critical tokens from the KV cache. Complementing this is the wave buffer, which coordinates KV cache placement and overlaps computation and data transfer across GPU and CPU to sustain high throughput.