KV CACHE Compression
updated
SnapKV: LLM Knows What You are Looking for Before Generation
Paper
• 2404.14469
• Published
• 27
Finch: Prompt-guided Key-Value Cache Compression
Paper
• 2408.00167
• Published
• 17
Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge
Reasoning
Paper
• 2503.04973
• Published
• 26
A Simple and Effective L_2 Norm-Based Strategy for KV Cache
Compression
Paper
• 2406.11430
• Published
• 25
FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration
Paper
• 2502.01068
• Published
• 18
ChunkKV: Semantic-Preserving KV Cache Compression for Efficient
Long-Context LLM Inference
Paper
• 2502.00299
• Published
• 3
Efficient Streaming Language Models with Attention Sinks
Paper
• 2309.17453
• Published
• 14
Transformers are Multi-State RNNs
Paper
• 2401.06104
• Published
• 39
H_2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large
Language Models
Paper
• 2306.14048
• Published
• 14
Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression
Paper
• 2503.02812
• Published
• 10
ThinK: Thinner Key Cache by Query-Driven Pruning
Paper
• 2407.21018
• Published
• 32
LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with
Effortless Adaptation
Paper
• 2410.13846
• Published
• 2
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and
Streaming Heads
Paper
• 2410.10819
• Published
• 7
Scissorhands: Exploiting the Persistence of Importance Hypothesis for
LLM KV Cache Compression at Test Time
Paper
• 2305.17118
• Published
• 1
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information
Funneling
Paper
• 2406.02069
• Published
• 1