Accelerating torch.compile with Diode

Diode can speeds up PyTorch’s torch.compile by providing pre-trained models that predict optimal matrix multiplication configurations, eliminating the need for expensive runtime autotuning. This integration allows you to get the performance benefits of extensive autotuning with minimal compilation time.

Overview

When PyTorch compiles matrix multiplication operations, it typically needs to search through many different kernel configurations to find the optimal one for your specific hardware and problem size. This process, called autotuning, can take substantial time during compilation.

Diode solves this by providing a pre-trained model that predicts the optimal configuration for a given hardware and problem size. This saves compilation time by eliminating the need for runtime autotuning, while still providing optimal performance.

Quick Start

Getting started with Diode acceleration is simple and requires only three steps:

Step 1: Install torch-diode-models

Install the pre-trained models package:

pip install torch-diode

This package contains pre-trained models for popular hardware configurations including NVIDIA H100 and AMD MI300X GPUs.

Step 2: Import and Auto-Register

Simply import the package to automatically register the best model for your hardware:

import torch_diode_models

This import automatically:

  • Detects your hardware configuration

  • Selects the most appropriate pre-trained model

  • Registers the model with PyTorch’s compilation system

  • Configures the prediction interface

Step 3: Enable Fast Autotuning

Configure PyTorch to use Diode’s fast autotuning:

import torch
from torch._inductor import config

# Enable fast autotuning with Diode models
config.max_autotune = True

Complete Example

Here’s a complete example showing how to use Diode with torch.compile:

import torch
import torch_diode_models  # Auto-registers the best model for your hardware
from torch._inductor import config

# Configure PyTorch to use Diode acceleration
config.max_autotune_gemm_backends = "DIODE"
config.max_autotune = True

# Your existing PyTorch code - no changes needed!
def matmul_function(a, b):
    return torch.mm(a, b)

# Compile with torch.compile - now accelerated by Diode
compiled_fn = torch.compile(matmul_function, mode="max-autotune")

# Use as normal
a = torch.randn(1024, 2048, device="cuda", dtype=torch.float16)
b = torch.randn(2048, 4096, device="cuda", dtype=torch.float16)

result = compiled_fn(a, b)

Benefits

Performance Improvements

Diode provides significant improvements in both compilation time and runtime performance:

Compilation Speed * 10x faster compilation: Eliminates expensive autotuning searches * Instant predictions: Model inference takes microseconds vs. seconds of autotuning * Consistent compile times: No variation based on problem size or hardware load

Runtime Performance * Max Autotune: We can match Max Autotune and Max Autotune EXHAUSTIVE performance. Memory Efficiency ~~~~~~~~~~~~~~~~~

  • Reduced memory overhead: No need to store multiple kernel variants during compilation

  • Predictable memory usage: Consistent memory consumption across different problem sizes

Advanced Configuration

Hardware Detection

Diode automatically detects your hardware, but you can also specify it manually:

import torch_diode_models

# Check detected hardware
print(f"Detected hardware: {torch_diode_models.get_detected_hardware()}")

# List available models
available_models = torch_diode_models.list_available_models()
print(f"Available models: {available_models}")

Integration with Existing Workflows

Training Workflows

Diode integrates seamlessly with existing training code:

import torch
import torch.nn as nn
import torch_diode_models
from torch._inductor import config

# Enable Diode acceleration
config.max_autotune = True

class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = nn.Linear(1024, 2048)
        self.linear2 = nn.Linear(2048, 1024)

    def forward(self, x):
        x = self.linear1(x)
        x = torch.relu(x)
        return self.linear2(x)

model = MyModel().cuda()

# Compile with Diode acceleration
compiled_model = torch.compile(model, mode="max-autotune")

# Training loop - faster compilation on first run
optimizer = torch.optim.Adam(model.parameters())
for batch in dataloader:
    optimizer.zero_grad()
    output = compiled_model(batch)  # Fast compilation + optimal performance
    loss = criterion(output, targets)
    loss.backward()
    optimizer.step()

Inference Workflows

Perfect for production inference where fast startup is critical:

import torch
import torch_diode_models
from torch._inductor import config

# Configure for inference
config.max_autotune_gemm_backends = "DIODE"
config.fast_autotune = True
config.triton.cudagraphs = True  # Enable CUDA graphs for even better performance

# Load your model
model = torch.jit.load("my_model.pt").cuda()

# Compile with minimal warmup time
compiled_model = torch.compile(model, mode="max-autotune")

# First inference compiles quickly thanks to Diode
with torch.no_grad():
    output = compiled_model(input_tensor)

Supported Operations

Diode currently accelerates the following matrix multiplication operations:

Core Operations * torch.mm - Basic matrix multiplication * torch.addmm - Matrix multiplication with bias addition * torch.bmm - Batch matrix multiplication

Data Types * float16 (half precision) * bfloat16 (brain float) * float32 (single precision)

Hardware Support * NVIDIA GPUs: H100 * AMD GPUs: MI300x

For more information on training custom models, see the Getting Started with Training Models with Diode guide.