Accelerating torch.compile with Diode ===================================== Diode can speeds up PyTorch's ``torch.compile`` by providing pre-trained models that predict optimal matrix multiplication configurations, eliminating the need for expensive runtime autotuning. This integration allows you to get the performance benefits of extensive autotuning with minimal compilation time. Overview -------- When PyTorch compiles matrix multiplication operations, it typically needs to search through many different kernel configurations to find the optimal one for your specific hardware and problem size. This process, called autotuning, can take substantial time during compilation. Diode solves this by providing a pre-trained model that predicts the optimal configuration for a given hardware and problem size. This saves compilation time by eliminating the need for runtime autotuning, while still providing optimal performance. Quick Start ----------- Getting started with Diode acceleration is simple and requires only three steps: Step 1: Install torch-diode-models ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Install the pre-trained models package: .. code-block:: bash pip install torch-diode This package contains pre-trained models for popular hardware configurations including NVIDIA H100 and AMD MI300X GPUs. Step 2: Import and Auto-Register ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Simply import the package to automatically register the best model for your hardware: .. code-block:: python import torch_diode_models This import automatically: * Detects your hardware configuration * Selects the most appropriate pre-trained model * Registers the model with PyTorch's compilation system * Configures the prediction interface Step 3: Enable Fast Autotuning ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Configure PyTorch to use Diode's fast autotuning: .. code-block:: python import torch from torch._inductor import config # Enable fast autotuning with Diode models config.max_autotune = True Complete Example ---------------- Here's a complete example showing how to use Diode with torch.compile: .. code-block:: python import torch import torch_diode_models # Auto-registers the best model for your hardware from torch._inductor import config # Configure PyTorch to use Diode acceleration config.max_autotune_gemm_backends = "DIODE" config.max_autotune = True # Your existing PyTorch code - no changes needed! def matmul_function(a, b): return torch.mm(a, b) # Compile with torch.compile - now accelerated by Diode compiled_fn = torch.compile(matmul_function, mode="max-autotune") # Use as normal a = torch.randn(1024, 2048, device="cuda", dtype=torch.float16) b = torch.randn(2048, 4096, device="cuda", dtype=torch.float16) result = compiled_fn(a, b) Benefits -------- Performance Improvements ~~~~~~~~~~~~~~~~~~~~~~~~ Diode provides significant improvements in both compilation time and runtime performance: **Compilation Speed** * **10x faster compilation**: Eliminates expensive autotuning searches * **Instant predictions**: Model inference takes microseconds vs. seconds of autotuning * **Consistent compile times**: No variation based on problem size or hardware load **Runtime Performance** * **Max Autotune**: We can match Max Autotune and Max Autotune EXHAUSTIVE performance. Memory Efficiency ~~~~~~~~~~~~~~~~~ * **Reduced memory overhead**: No need to store multiple kernel variants during compilation * **Predictable memory usage**: Consistent memory consumption across different problem sizes Advanced Configuration ---------------------- Hardware Detection ~~~~~~~~~~~~~~~~~~ Diode automatically detects your hardware, but you can also specify it manually: .. code-block:: python import torch_diode_models # Check detected hardware print(f"Detected hardware: {torch_diode_models.get_detected_hardware()}") # List available models available_models = torch_diode_models.list_available_models() print(f"Available models: {available_models}") Integration with Existing Workflows ------------------------------------ Training Workflows ~~~~~~~~~~~~~~~~~~ Diode integrates seamlessly with existing training code: .. code-block:: python import torch import torch.nn as nn import torch_diode_models from torch._inductor import config # Enable Diode acceleration config.max_autotune = True class MyModel(nn.Module): def __init__(self): super().__init__() self.linear1 = nn.Linear(1024, 2048) self.linear2 = nn.Linear(2048, 1024) def forward(self, x): x = self.linear1(x) x = torch.relu(x) return self.linear2(x) model = MyModel().cuda() # Compile with Diode acceleration compiled_model = torch.compile(model, mode="max-autotune") # Training loop - faster compilation on first run optimizer = torch.optim.Adam(model.parameters()) for batch in dataloader: optimizer.zero_grad() output = compiled_model(batch) # Fast compilation + optimal performance loss = criterion(output, targets) loss.backward() optimizer.step() Inference Workflows ~~~~~~~~~~~~~~~~~~~ Perfect for production inference where fast startup is critical: .. code-block:: python import torch import torch_diode_models from torch._inductor import config # Configure for inference config.max_autotune_gemm_backends = "DIODE" config.fast_autotune = True config.triton.cudagraphs = True # Enable CUDA graphs for even better performance # Load your model model = torch.jit.load("my_model.pt").cuda() # Compile with minimal warmup time compiled_model = torch.compile(model, mode="max-autotune") # First inference compiles quickly thanks to Diode with torch.no_grad(): output = compiled_model(input_tensor) Supported Operations -------------------- Diode currently accelerates the following matrix multiplication operations: **Core Operations** * ``torch.mm`` - Basic matrix multiplication * ``torch.addmm`` - Matrix multiplication with bias addition * ``torch.bmm`` - Batch matrix multiplication **Data Types** * ``float16`` (half precision) * ``bfloat16`` (brain float) * ``float32`` (single precision) **Hardware Support** * NVIDIA GPUs: H100 * AMD GPUs: MI300x For more information on training custom models, see the :doc:`getting_started` guide.