Accelerating torch.compile with Diode

Diode can speeds up PyTorch’s torch.compile by providing pre-trained models that predict optimal matrix multiplication configurations, eliminating the need for expensive runtime autotuning. This integration allows you to get the performance benefits of extensive autotuning with minimal compilation time.

Overview

When PyTorch compiles matrix multiplication operations, it typically needs to search through many different kernel configurations to find the optimal one for your specific hardware and problem size. This process, called autotuning, can take substantial time during compilation.

Diode solves this by providing a pre-trained model that predicts the optimal configuration for a given hardware and problem size. This saves compilation time by eliminating the need for runtime autotuning, while still providing optimal performance.

Quick Start

Getting started with Diode acceleration is simple and requires only three steps:

Step 1: Install torch-diode-models

Install the pre-trained models package:

pip install torch-diode

This package contains pre-trained models for popular hardware configurations including NVIDIA H100 and AMD MI300X GPUs.

Step 2: Import and Auto-Register

Simply import the package to automatically register the best model for your hardware:

import torch_diode_models

This import automatically:

Detects your hardware configuration
Selects the most appropriate pre-trained model
Registers the model with PyTorch’s compilation system
Configures the prediction interface

Step 3: Enable Fast Autotuning

Configure PyTorch to use Diode’s fast autotuning:

import torch
from torch._inductor import config

# Enable fast autotuning with Diode models
config.max_autotune = True

Complete Example

Here’s a complete example showing how to use Diode with torch.compile:

import torch
import torch_diode_models  # Auto-registers the best model for your hardware
from torch._inductor import config

# Configure PyTorch to use Diode acceleration
config.max_autotune_gemm_backends = "DIODE"
config.max_autotune = True

# Your existing PyTorch code - no changes needed!
def matmul_function(a, b):
    return torch.mm(a, b)

# Compile with torch.compile - now accelerated by Diode
compiled_fn = torch.compile(matmul_function, mode="max-autotune")

# Use as normal
a = torch.randn(1024, 2048, device="cuda", dtype=torch.float16)
b = torch.randn(2048, 4096, device="cuda", dtype=torch.float16)

result = compiled_fn(a, b)

Benefits

Performance Improvements

Diode provides significant improvements in both compilation time and runtime performance:

Compilation Speed * 10x faster compilation: Eliminates expensive autotuning searches * Instant predictions: Model inference takes microseconds vs. seconds of autotuning * Consistent compile times: No variation based on problem size or hardware load

Runtime Performance * Max Autotune: We can match Max Autotune and Max Autotune EXHAUSTIVE performance. Memory Efficiency ~~~~~~~~~~~~~~~~~

Reduced memory overhead: No need to store multiple kernel variants during compilation
Predictable memory usage: Consistent memory consumption across different problem sizes

Advanced Configuration

Hardware Detection

Diode automatically detects your hardware, but you can also specify it manually:

import torch_diode_models

# Check detected hardware
print(f"Detected hardware: {torch_diode_models.get_detected_hardware()}")

# List available models
available_models = torch_diode_models.list_available_models()
print(f"Available models: {available_models}")

Integration with Existing Workflows

Training Workflows

Diode integrates seamlessly with existing training code:

import torch
import torch.nn as nn
import torch_diode_models
from torch._inductor import config

# Enable Diode acceleration
config.max_autotune = True

class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = nn.Linear(1024, 2048)
        self.linear2 = nn.Linear(2048, 1024)

    def forward(self, x):
        x = self.linear1(x)
        x = torch.relu(x)
        return self.linear2(x)

model = MyModel().cuda()

# Compile with Diode acceleration
compiled_model = torch.compile(model, mode="max-autotune")

# Training loop - faster compilation on first run
optimizer = torch.optim.Adam(model.parameters())
for batch in dataloader:
    optimizer.zero_grad()
    output = compiled_model(batch)  # Fast compilation + optimal performance
    loss = criterion(output, targets)
    loss.backward()
    optimizer.step()

Inference Workflows

Perfect for production inference where fast startup is critical:

import torch
import torch_diode_models
from torch._inductor import config

# Configure for inference
config.max_autotune_gemm_backends = "DIODE"
config.fast_autotune = True
config.triton.cudagraphs = True  # Enable CUDA graphs for even better performance

# Load your model
model = torch.jit.load("my_model.pt").cuda()

# Compile with minimal warmup time
compiled_model = torch.compile(model, mode="max-autotune")

# First inference compiles quickly thanks to Diode
with torch.no_grad():
    output = compiled_model(input_tensor)

Supported Operations

Diode currently accelerates the following matrix multiplication operations:

Core Operations * torch.mm - Basic matrix multiplication * torch.addmm - Matrix multiplication with bias addition * torch.bmm - Batch matrix multiplication

Data Types * float16 (half precision) * bfloat16 (brain float) * float32 (single precision)

Hardware Support * NVIDIA GPUs: H100 * AMD GPUs: MI300x

For more information on training custom models, see the Getting Started with Training Models with Diode guide.