Accelerating torch.compile with Diode
Diode can speeds up PyTorch’s torch.compile by providing pre-trained models that predict optimal matrix multiplication configurations, eliminating the need for expensive runtime autotuning. This integration allows you to get the performance benefits of extensive autotuning with minimal compilation time.
Overview
When PyTorch compiles matrix multiplication operations, it typically needs to search through many different kernel configurations to find the optimal one for your specific hardware and problem size. This process, called autotuning, can take substantial time during compilation.
Diode solves this by providing a pre-trained model that predicts the optimal configuration for a given hardware and problem size. This saves compilation time by eliminating the need for runtime autotuning, while still providing optimal performance.
Quick Start
Getting started with Diode acceleration is simple and requires only three steps:
Step 1: Install torch-diode-models
Install the pre-trained models package:
pip install torch-diode
This package contains pre-trained models for popular hardware configurations including NVIDIA H100 and AMD MI300X GPUs.
Step 2: Import and Auto-Register
Simply import the package to automatically register the best model for your hardware:
import torch_diode_models
This import automatically:
Detects your hardware configuration
Selects the most appropriate pre-trained model
Registers the model with PyTorch’s compilation system
Configures the prediction interface
Step 3: Enable Fast Autotuning
Configure PyTorch to use Diode’s fast autotuning:
import torch
from torch._inductor import config
# Enable fast autotuning with Diode models
config.max_autotune = True
Complete Example
Here’s a complete example showing how to use Diode with torch.compile:
import torch
import torch_diode_models # Auto-registers the best model for your hardware
from torch._inductor import config
# Configure PyTorch to use Diode acceleration
config.max_autotune_gemm_backends = "DIODE"
config.max_autotune = True
# Your existing PyTorch code - no changes needed!
def matmul_function(a, b):
return torch.mm(a, b)
# Compile with torch.compile - now accelerated by Diode
compiled_fn = torch.compile(matmul_function, mode="max-autotune")
# Use as normal
a = torch.randn(1024, 2048, device="cuda", dtype=torch.float16)
b = torch.randn(2048, 4096, device="cuda", dtype=torch.float16)
result = compiled_fn(a, b)
Benefits
Performance Improvements
Diode provides significant improvements in both compilation time and runtime performance:
Compilation Speed * 10x faster compilation: Eliminates expensive autotuning searches * Instant predictions: Model inference takes microseconds vs. seconds of autotuning * Consistent compile times: No variation based on problem size or hardware load
Runtime Performance * Max Autotune: We can match Max Autotune and Max Autotune EXHAUSTIVE performance. Memory Efficiency ~~~~~~~~~~~~~~~~~
Reduced memory overhead: No need to store multiple kernel variants during compilation
Predictable memory usage: Consistent memory consumption across different problem sizes
Advanced Configuration
Hardware Detection
Diode automatically detects your hardware, but you can also specify it manually:
import torch_diode_models
# Check detected hardware
print(f"Detected hardware: {torch_diode_models.get_detected_hardware()}")
# List available models
available_models = torch_diode_models.list_available_models()
print(f"Available models: {available_models}")
Integration with Existing Workflows
Training Workflows
Diode integrates seamlessly with existing training code:
import torch
import torch.nn as nn
import torch_diode_models
from torch._inductor import config
# Enable Diode acceleration
config.max_autotune = True
class MyModel(nn.Module):
def __init__(self):
super().__init__()
self.linear1 = nn.Linear(1024, 2048)
self.linear2 = nn.Linear(2048, 1024)
def forward(self, x):
x = self.linear1(x)
x = torch.relu(x)
return self.linear2(x)
model = MyModel().cuda()
# Compile with Diode acceleration
compiled_model = torch.compile(model, mode="max-autotune")
# Training loop - faster compilation on first run
optimizer = torch.optim.Adam(model.parameters())
for batch in dataloader:
optimizer.zero_grad()
output = compiled_model(batch) # Fast compilation + optimal performance
loss = criterion(output, targets)
loss.backward()
optimizer.step()
Inference Workflows
Perfect for production inference where fast startup is critical:
import torch
import torch_diode_models
from torch._inductor import config
# Configure for inference
config.max_autotune_gemm_backends = "DIODE"
config.fast_autotune = True
config.triton.cudagraphs = True # Enable CUDA graphs for even better performance
# Load your model
model = torch.jit.load("my_model.pt").cuda()
# Compile with minimal warmup time
compiled_model = torch.compile(model, mode="max-autotune")
# First inference compiles quickly thanks to Diode
with torch.no_grad():
output = compiled_model(input_tensor)
Supported Operations
Diode currently accelerates the following matrix multiplication operations:
Core Operations
* torch.mm - Basic matrix multiplication
* torch.addmm - Matrix multiplication with bias addition
* torch.bmm - Batch matrix multiplication
Data Types
* float16 (half precision)
* bfloat16 (brain float)
* float32 (single precision)
Hardware Support * NVIDIA GPUs: H100 * AMD GPUs: MI300x
For more information on training custom models, see the Getting Started with Training Models with Diode guide.