Listen "FlashAttention-2"
Episode Synopsis
FlashAttention-2 builds upon FlashAttention to achieve faster attention computation with better GPU resource utilization. It enhances parallelism by also parallelizing along the sequence length dimension, optimizing work partitioning between thread blocks and warps to reduce shared memory access. A key improvement is the reduction of non-matmul FLOPs, which are less efficient on modern GPUs optimized for matrix multiplication. These enhancements lead to significant speedups compared to FlashAttention and standard attention, reaching higher throughput and better model FLOPs utilization in end-to-end training for Transformers.
More episodes of the podcast Large Language Model (LLM) Talk
Kimi K2
22/07/2025
Mixture-of-Recursions (MoR)
18/07/2025
MeanFlow
10/07/2025
Mamba
10/07/2025
LLM Alignment
14/06/2025
Why We Think
20/05/2025
Deep Research
12/05/2025
vLLM
04/05/2025
Qwen3: Thinking Deeper, Acting Faster
04/05/2025
ZARZA We are Zarza, the prestigious firm behind major projects in information technology.