Towards Safer Heuristics With XPlain
Reviving Cloud Gaming Sessions
Input-Dependent Power Usage in GPUs
RetroInfer
Scalable long-context LLM decoding that leverages sparsity—by treating the KV cache as a vector storage system. RetroInfer is a novel system that rethinks the KV cache as vector storage within a GPU–CPU co-execution setup to…
AttentionEngine: A Custom Model Optimization Framework
AttentionEngine accelerates transformer attention variants by generating efficient custom kernels, enabling model designers to easily create new variants with our flexible API.
Practical System Verification
Formal verification is a promising approach to eliminate bugs at compile time, before software ships. Unfortunately, verifying the correctness of system software traditionally requires heroic developer effort. In this project, we aim to enable accessible, faster,…
SeerAttention
SeerAttention is a learning-based method to enable block-level sparse attention for long-context LLM without using prefined static pattern or heuristic methods. It can be applied in Post-training or Fine-tuning stages. The Attention Gate units learn…