GPU Kernel Optimization - Complete Reading Materials

Created: 2026-03-25

GPU Kernel Optimization - Complete Reading Materials

Comprehensive reading list for learning GPU kernel optimization, profiling, and AI infrastructure. Organized by topic with direct links.


1. PMPP - Programming Massively Parallel Processors

Primary Textbook

Supplementary Materials


2. CUDA Toolkit & Development Environment

Official Resources

Video Tutorials

GTC (GPU Technology Conference) Sessions


3. GPU Profiling Tools

NVIDIA Nsight Compute

AMD ROCm Profiling (rocprof)

Other Profiling Tools

Profiling Tutorials


4. Triton - Python GPU Programming

Official Documentation

Tutorials (Step-by-Step)

  1. Vector Addition: https://triton-lang.org/main/getting-started/tutorials/01-vector-add.html
  2. Fused Softmax: https://triton-lang.org/main/getting-started/tutorials/02-fused-softmax.html
  3. Matrix Multiplication: https://triton-lang.org/main/getting-started/tutorials/03-matrix-multiplication.html
  4. Low-Memory Dropout: https://triton-lang.org/main/getting-started/tutorials/04-low-memory-dropout.html
  5. Layer Normalization: https://triton-lang.org/main/getting-started/tutorials/05-layer-norm.html
  6. Fused Attention: https://triton-lang.org/main/getting-started/tutorials/06-fused-attention.html
  7. Grouped GEMM: https://triton-lang.org/main/getting-started/tutorials/08-grouped-gemm.html
  8. Persistent Matmul: https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html
  9. Block Scaled Matmul: https://triton-lang.org/main/getting-started/tutorials/10-block-scaled-matmul.html

Programming Guide

Download Tutorial Code


5. CuTe & CUTLASS - High Performance CUDA

CUTLASS Repository

CuTe Documentation

Example Code

GTC Talks on CUTLASS


6. FlashAttention - Optimized Attention Kernels

Papers

  1. FlashAttention (NeurIPS 2022):

  2. FlashAttention-2 (ICLR 2024):

  3. FlashAttention-3 (for Hopper H100):

Implementation

FlashAttention-4 (CuTe DSL)

  • Installation: pip install flash-attn-4
  • PyTorch Integration:
    from flash_attn.cute import flash_attn_func
    out = flash_attn_func(q, k, v, causal=True)
    

7. Kernel Fusion & Optimization

PyTorch torch.compile

XLA (Accelerated Linear Algebra)

Research Papers on Fusion


8. Inference Serving & vLLM

vLLM (High-Throughput LLM Serving)

Key Concepts

  • PagedAttention: Memory-efficient attention for serving
  • Continuous Batching: Dynamic batching for inference
  • Speculative Decoding: Draft-then-verify for faster generation

Other Inference Engines


9. NCCL - Collective Communications

Documentation

Key Topics

Examples


10. GPU Architecture Deep Dive

PTX (Parallel Thread Execution)

GPU Architecture Guides

Tensor Cores


11. Quantization & Precision

Mixed Precision

Quantization Libraries


12. ROCm/HIP for AMD GPUs

ROCm Documentation

AMD GPU Programming


13. Online Courses & Training

NVIDIA Deep Learning Institute (DLI)

University Courses

YouTube Channels


14. Communities & Forums

Discussion Forums

Stack Overflow


15. Papers & Research

Must-Read Papers

GPU Programming & Optimization:

  1. FlashAttention (NeurIPS 2022): https://arxiv.org/abs/2205.14135
  2. FlashAttention-2 (ICLR 2024): https://tridao.me/publications/flash2/flash2.pdf
  3. FlashAttention-3: https://tridao.me/publications/flash3/flash3.pdf
  4. PagedAttention (SOSP 2023): https://arxiv.org/abs/2309.06180

Kernel Fusion: 5. XLA: Optimizing Compiler: https://arxiv.org/abs/2009.00102 6. Tensor Comprehensions: https://arxiv.org/abs/1802.04730 7. TVM: Automated Optimization: https://arxiv.org/abs/1802.04799

Matrix Multiplication: 8. CUTLASS Paper: https://arxiv.org/abs/1902.04615 9. Strassen's Algorithm GPU: https://arxiv.org/abs/1708.07469

Distributed Training: 10. Megatron-LM: https://arxiv.org/abs/1909.08053 11. DeepSpeed: https://arxiv.org/abs/2201.05596


16. Practice Projects & Exercises

Beginner Projects

  1. Vector Addition: CUDA version, Triton version
  2. Matrix Multiplication: Naive, Tiled, cuBLAS comparison
  3. Softmax: Standard vs Fused implementation
  4. Reduction: Sum, Max, Min operations

Intermediate Projects

  1. GEMM Optimized: Achieve >80% of cuBLAS performance
  2. LayerNorm Fusion: Combine mean, std, scale, shift
  3. Attention Kernel: Implement manual attention
  4. FlashAttention Implementation: From paper to code

Advanced Projects

  1. Custom CuTe GEMM: Using CUTLASS templates
  2. Multi-GPU AllReduce: Using NCCL
  3. Quantized GEMM: INT8/FP8 implementation
  4. Speculative Decoding: Draft model implementation

Open Source Contributions


17. Cheat Sheets & Quick References

CUDA Cheatsheets

PTX Quick Reference


18. Hardware Specifications

GPU Specs

Architecture Comparison


19. Interview Preparation

Topics to Master

  1. GPU memory hierarchy (HBM, L2, Shared, Registers)
  2. Thread organization (Grid, Block, Warp, Thread)
  3. Memory coalescing patterns
  4. Occupancy calculation
  5. Roofline analysis
  6. Tiling strategies
  7. Bank conflicts in shared memory
  8. Tensor Core usage

Practice Questions

  • Implement tiled matrix multiplication
  • Optimize a given kernel for memory bandwidth
  • Explain FlashAttention algorithm
  • Design a fused kernel for LayerNorm
  • Profile and optimize a PyTorch model

20. Tracking Progress

Recommended Reading Order

Week 1-4: Foundation

  1. PMPP Chapters 1-5
  2. CUDA C++ Programming Guide
  3. Install CUDA Toolkit and run samples

Week 5-8: Profiling 4. Nsight Compute documentation 5. Profile simple kernels 6. Roofline analysis paper

Week 9-12: Triton 7. All Triton tutorials 8. Implement softmax and matmul in Triton 9. FlashAttention Triton implementation

Week 13-16: Kernel Fusion 10. PyTorch torch.compile docs 11. XLA fusion paper 12. Implement fused kernels

Week 17-24: CUTLASS/CuTe 13. CUTLASS documentation 14. CuTe examples 15. Implement custom GEMM

Ongoing: 16. Read one paper per week 17. Contribute to open source 18. Build portfolio projects


Notes

  • Last Updated: 2026-03-25
  • Total Resources: 150+ links
  • Estimated Study Time: 500+ hours
  • Difficulty Level: Beginner to Expert

Related Notes


This is a living document. Add new resources as you discover them.