Papers - Attention
updated
Linear Transformers with Learnable Kernel Functions are Better
In-Context Models
Paper
• 2402.10644
• Published
• 81
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head
Checkpoints
Paper
• 2305.13245
• Published
• 6
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and
Two-Phase Partition
Paper
• 2402.15220
• Published
• 20
Sequence Parallelism: Long Sequence Training from System Perspective
Paper
• 2105.13120
• Published
• 6
Ring Attention with Blockwise Transformers for Near-Infinite Context
Paper
• 2310.01889
• Published
• 13
Striped Attention: Faster Ring Attention for Causal Transformers
Paper
• 2311.09431
• Published
• 4
Longformer: The Long-Document Transformer
Paper
• 2004.05150
• Published
• 4
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
Paper
• 2006.03654
• Published
• 3
DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with
Gradient-Disentangled Embedding Sharing
Paper
• 2111.09543
• Published
• 3
Rescoring Sequence-to-Sequence Models for Text Line Recognition with
CTC-Prefixes
Paper
• 2110.05909
• Published
• 2
3D Medical Image Segmentation based on multi-scale MPU-Net
Paper
• 2307.05799
• Published
• 2
Attention Swin U-Net: Cross-Contextual Attention Mechanism for Skin
Lesion Segmentation
Paper
• 2210.16898
• Published
• 2
CSWin Transformer: A General Vision Transformer Backbone with
Cross-Shaped Windows
Paper
• 2107.00652
• Published
• 2
BOAT: Bilateral Local Attention Vision Transformer
Paper
• 2201.13027
• Published
• 2
MAFormer: A Transformer Network with Multi-scale Attention Fusion for
Visual Recognition
Paper
• 2209.01620
• Published
• 2
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Paper
• 2103.14030
• Published
• 5
Using Multi-scale SwinTransformer-HTC with Data augmentation in CoNIC
Challenge
Paper
• 2202.13588
• Published
• 2
Interpretability in the Wild: a Circuit for Indirect Object
Identification in GPT-2 small
Paper
• 2211.00593
• Published
• 2
BurstAttention: An Efficient Distributed Attention Framework for
Extremely Long Sequences
Paper
• 2403.09347
• Published
• 22
Vision Transformer with Quadrangle Attention
Paper
• 2303.15105
• Published
• 2
Lightweight Image Inpainting by Stripe Window Transformer with Joint
Attention to CNN
Paper
• 2301.00553
• Published
• 3
Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as
an Alternative to Attention Layers in Transformers
Paper
• 2311.10642
• Published
• 25
Code Completion using Neural Attention and Byte Pair Encoding
Paper
• 2004.06343
• Published
• 2
Recurrent Drafter for Fast Speculative Decoding in Large Language Models
Paper
• 2403.09919
• Published
• 21
Vid2Robot: End-to-end Video-conditioned Policy Learning with
Cross-Attention Transformers
Paper
• 2403.12943
• Published
• 15
VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis
Paper
• 2403.13501
• Published
• 9
Efficient Memory Management for Large Language Model Serving with
PagedAttention
Paper
• 2309.06180
• Published
• 37
Leave No Context Behind: Efficient Infinite Context Transformers with
Infini-attention
Paper
• 2404.07143
• Published
• 111
Paper
• 2404.07821
• Published
• 13
JetMoE: Reaching Llama2 Performance with 0.1M Dollars
Paper
• 2404.07413
• Published
• 38
Megalodon: Efficient LLM Pretraining and Inference with Unlimited
Context Length
Paper
• 2404.08801
• Published
• 66
Hydragen: High-Throughput LLM Inference with Shared Prefixes
Paper
• 2402.05099
• Published
• 20
MegaScale: Scaling Large Language Model Training to More Than 10,000
GPUs
Paper
• 2402.15627
• Published
• 36
MoA: Mixture-of-Attention for Subject-Context Disentanglement in
Personalized Image Generation
Paper
• 2404.11565
• Published
• 15
SpecInfer: Accelerating Generative LLM Serving with Speculative
Inference and Token Tree Verification
Paper
• 2305.09781
• Published
• 4
GLIGEN: Open-Set Grounded Text-to-Image Generation
Paper
• 2301.07093
• Published
• 4
FlashSpeech: Efficient Zero-Shot Speech Synthesis
Paper
• 2404.14700
• Published
• 32
Multi-Head Mixture-of-Experts
Paper
• 2404.15045
• Published
• 60
Transformers Can Represent n-gram Language Models
Paper
• 2404.14994
• Published
• 21
BASS: Batched Attention-optimized Speculative Sampling
Paper
• 2404.15778
• Published
• 11
InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation
Paper
• 2404.19427
• Published
• 74
What needs to go right for an induction head? A mechanistic study of
in-context learning circuits and their formation
Paper
• 2404.07129
• Published
• 3
Transformers are SSMs: Generalized Models and Efficient Algorithms
Through Structured State Space Duality
Paper
• 2405.21060
• Published
• 68
VideoFACT: Detecting Video Forgeries Using Attention, Scene Context, and
Forensic Traces
Paper
• 2211.15775
• Published
• 1
Reasoning in Large Language Models: A Geometric Perspective
Paper
• 2407.02678
• Published
• 1
Beyond Euclid: An Illustrated Guide to Modern Machine Learning with
Geometric, Topological, and Algebraic Structures
Paper
• 2407.09468
• Published
• 2
Paper
• 2405.15932
• Published
• 1
Tree Attention: Topology-aware Decoding for Long-Context Attention on
GPU clusters
Paper
• 2408.04093
• Published
• 4
Attention Heads of Large Language Models: A Survey
Paper
• 2409.03752
• Published
• 92
Paper
• 2410.05258
• Published
• 180
ThunderKittens: Simple, Fast, and Adorable AI Kernels
Paper
• 2410.20399
• Published
• 2
HAT: Hybrid Attention Transformer for Image Restoration
Paper
• 2309.05239
• Published
• 1
Unraveling the Gradient Descent Dynamics of Transformers
Paper
• 2411.07538
• Published
• 2
An Evolved Universal Transformer Memory
Paper
• 2410.13166
• Published
• 6
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper
• 2412.09871
• Published
• 108
Paper
• 2412.09764
• Published
• 5