Getting Started with Training Models with Diode
==========================

This comprehensive guide will walk you through the complete process of creating a machine learning model from scratch using the Diode toolkit. You'll learn how to generate a dataset of matrix multiplication performance data and train a model to predict optimal configurations.

Overview
--------

The Diode workflow involves four main steps:

1. **Data Collection**: Generate matrix multiplication performance data using PyTorch's autotuning capabilities
2. **Model Training**: Train a deep learning model on the collected data
3. **Validation Dataset Creation**: Create a separate validation dataset from predefined operation shapes
4. **Model Validation**: Evaluate the trained model's performance on the validation dataset

Prerequisites
-------------

Before starting, ensure you have:

* Access to your target hardware
* PyTorch nightlies
* The Diode toolkit

Step 1: Data Collection
-----------------------

The first step is to generate a training dataset by collecting matrix multiplication performance data. Diode uses PyTorch's feedback saver interface to automatically collect timing information for different matrix multiplication configurations.

Setting Up the Data Collector
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``MatmulDatasetCollector`` class provides flexible data collection capabilities:

.. code-block:: python

    from torch_diode.collection.matmul_dataset_collector import MatmulDatasetCollector, CollectionMode

    # Initialize the collector with log-normal distribution mode
    collector = MatmulDatasetCollector(
        hardware_name="your_gpu_name",
        mode=CollectionMode.LOG_NORMAL,
        operations=["mm", "addmm", "bmm"],
        num_shapes=1000,
        seed=50,
    )

Collection Modes
~~~~~~~~~~~~~~~~

Diode supports three collection modes:

1. **LOG_NORMAL**: Uses log-normal distributions to generate realistic matrix sizes based on production workloads
2. **RANDOM**: Generates uniformly random matrix sizes within specified bounds
3. **OPERATION_SHAPE_SET**: Uses predefined shapes from a configuration file

Running Data Collection
~~~~~~~~~~~~~~~~~~~~~~~

Use the matmul_toolkit.py script to collect training data:

.. code-block:: bash

    python matmul_toolkit.py \
        --format msgpack \
        --seed 50 \
        collect \
        --output train_dataset.msgpack \
        --num-shapes 1000 \
        --log-normal \
        --search-space EXHAUSTIVE \
        --search-mode max-autotune \
        --chunk-size 5

Key parameters:

* ``--format msgpack``: Use MessagePack format for efficient serialization
* ``--seed 50``: Set random seed for reproducibility
* ``--num-shapes 1000``: Generate 1000 different matrix configurations
* ``--log-normal``: Use log-normal distribution for realistic sizes
* ``--search-space EXHAUSTIVE``: Use exhaustive search for optimal configurations
* ``--search-mode max-autotune``: Use PyTorch's max-autotune mode
* ``--chunk-size 5``: Write data every 5 operations to prevent data during collection

Understanding the Collection Process
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The data collection process works by:

1. Generating matrix shapes based on the selected mode
2. Creating random tensors with the specified dimensions and data types
3. Compiling matrix multiplication operations with PyTorch's autotuning
4. Capturing timing data for different Triton GEMM configurations through the feedback saver interface
5. Storing the results in a structured dataset format

Step 2: Model Training
----------------------

Once you have collected training data, train a deep learning model to predict optimal GEMM configurations:

.. code-block:: bash

    python matmul_toolkit.py \
        --seed 50 \
        train \
        --data-dir ./data \
        --model matmul_model.pt \
        --model-type deep \
        --batch-size 64 \
        --num-epochs 1000 \
        --learning-rate 0.001 \
        --log-dir ./logs

Training parameters:

* ``--model-type deep``: Use a deep neural network architecture
* ``--batch-size 64``: Process 64 samples per batch
* ``--num-epochs 1000``: Train for 1000 epochs
* ``--learning-rate 0.001``: Set the learning rate
* ``--log-dir``: Directory to save training logs and metrics

The model learns to predict optimal Triton GEMM configurations based on matrix dimensions, data types, and hardware characteristics.

Model Architecture
~~~~~~~~~~~~~~~~~~

Diode provides two simple neural network architectures for timing prediction. These are not meant to be state-of-the-art models, but rather serve as a starting point for further experimentation and development:

**Standard Model (MatmulTimingModel)**

The standard model uses a feedforward neural network with the following architecture:

.. code-block:: python

    class MatmulTimingModel(nn.Module):
        def __init__(
            self,
            problem_feature_dim: int,
            config_feature_dim: int,
            hidden_dims: List[int] = [256, 512, 256, 128, 64],
            dropout_rate: float = 0.2,
        ):

Architecture components:

* **Input Layer**: Concatenates problem features (matrix dimensions, data types) and configuration features (Triton GEMM parameters)
* **Hidden Layers**: Multiple fully connected layers with ReLU activation, batch normalization, and dropout
* **Output Layer**: Single neuron predicting log execution time
* **Regularization**: Dropout and batch normalization to prevent overfitting

**Deep Model (DeepMatmulTimingModel)**

The deep model uses residual connections for training deeper networks:

.. code-block:: python

    class DeepMatmulTimingModel(nn.Module):
        def __init__(
            self,
            problem_feature_dim: int,
            config_feature_dim: int,
            hidden_dim: int = 128,
            num_layers: int = 10,
            dropout_rate: float = 0.2,
        ):

Key features:

* **Residual Blocks**: Each block contains two linear layers with skip connections
* **Deeper Architecture**: 10+ layers with consistent hidden dimensions
* **Better Gradient Flow**: Residual connections help train deeper networks effectively

**Residual Block Implementation**

.. code-block:: python

    class ResidualBlock(nn.Module):
        def forward(self, x: torch.Tensor) -> torch.Tensor:
            identity = x
            out = self.block(x)
            out += identity  # Skip connection
            out = self.relu(out)
            return out

The residual blocks enable training much deeper networks while maintaining stable gradients throughout the network depth.

Step 3: Creating a Validation Dataset
-------------------------------------

Create a separate validation dataset using predefined operation shapes to evaluate model performance:

.. code-block:: bash

    python matmul_toolkit.py \
        --format msgpack \
        --seed 50 \
        create-validation \
        --output validation_dataset.msgpack \
        --shapeset operation_shapeset.json \
        --operations mm addmm bmm \
        --search-space EXHAUSTIVE \
        --search-mode max-autotune

This step:

* Loads predefined matrix shapes from ``operation_shapeset.json``
* Runs autotuning to find optimal configurations for these shapes
* Creates a validation dataset with known ground truth performance data

Step 4: Model Validation
------------------------

Finally, evaluate your trained model against the validation dataset:

.. code-block:: bash

    python matmul_toolkit.py \
        --seed 50 \
        validate-model \
        --model matmul_model.pt \
        --dataset validation_dataset.msgpack \
        --batch-size 64 \
        --top-n-worst 10

This validation step:

* Loads the trained model and validation dataset
* Makes predictions for each validation sample
* Compares predictions against ground truth timing data
* Reports accuracy metrics and identifies the worst-performing predictions

Complete Workflow Script
------------------------

Here's a complete bash script that orchestrates the entire process:

.. code-block:: bash

    #!/bin/bash

    set -e  # Exit on any error

    # Configuration
    SEED=50
    DATA_DIR="./data"
    TRAIN_DATASET="${DATA_DIR}/seed_${SEED}_train_dataset.msgpack"
    VALIDATION_DATASET="${DATA_DIR}/validation/validation_dataset.msgpack"
    MODEL_PATH="${DATA_DIR}/matmul_model.pt"
    LOG_DIR="${DATA_DIR}/logs"
    NUM_SHAPES=1000
    NUM_EPOCHS=1000
    PYTHON_CMD="python"
    TOOLKIT_PATH="matmul_toolkit.py"
    OPERATION_SHAPESET_PATH="operation_shapeset.json"

    echo "Starting Diode workflow..."

    # Step 1: Create data directory
    mkdir -p "${DATA_DIR}"
    mkdir -p "${DATA_DIR}/validation"

    # Step 2: Generate training dataset
    echo "Collecting training data..."
    ${PYTHON_CMD} "${TOOLKIT_PATH}" \
        --format msgpack \
        --seed "${SEED}" \
        collect \
        --output "${TRAIN_DATASET}" \
        --num-shapes ${NUM_SHAPES} \
        --log-normal \
        --search-space EXHAUSTIVE \
        --search-mode max-autotune \
        --chunk-size 5

    # Step 3: Train model
    echo "Training model..."
    ${PYTHON_CMD} "${TOOLKIT_PATH}" \
        --seed "${SEED}" \
        train \
        --data-dir "${DATA_DIR}" \
        --model "${MODEL_PATH}" \
        --model-type deep \
        --batch-size 64 \
        --num-epochs ${NUM_EPOCHS} \
        --learning-rate 0.001 \
        --log-dir "${LOG_DIR}"

    # Step 4: Create validation dataset
    echo "Creating validation dataset..."
    ${PYTHON_CMD} "${TOOLKIT_PATH}" \
        --format msgpack \
        --seed "${SEED}" \
        create-validation \
        --output "${VALIDATION_DATASET}" \
        --shapeset "${OPERATION_SHAPESET_PATH}" \
        --operations mm addmm bmm \
        --search-space EXHAUSTIVE \
        --search-mode max-autotune

    # Step 5: Validate model
    echo "Validating model..."
    ${PYTHON_CMD} "${TOOLKIT_PATH}" \
        --seed "${SEED}" \
        validate-model \
        --model "${MODEL_PATH}" \
        --dataset "${VALIDATION_DATASET}" \
        --batch-size 64 \
        --top-n-worst 10

    echo "Workflow completed successfully!"

Advanced Configuration
----------------------

Custom Collection Parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For more control over data collection, you can customize the log-normal distribution parameters:

.. code-block:: python

    # Custom parameters for different workload characteristics
    collector = MatmulDatasetCollector(
        mode=CollectionMode.LOG_NORMAL,
        # Larger matrices (shift mean higher)
        log_normal_m_mean=7.0,
        log_normal_n_mean=6.5,
        log_normal_k_mean=6.8,
        # Smaller variance for more consistent sizes
        log_normal_m_std=1.5,
        log_normal_n_std=1.2,
        log_normal_k_std=1.8,
    )

Tips
----------------

1. **Start Small**: Begin with a smaller number of shapes (100-200) to validate your setup
2. **Monitor Memory**: Keep an eye on GPU memory usage during collection
3. **Save Frequently**: Use the ``--chunk-size`` parameter to save data periodically
4. **Reproducibility**: Always set a random seed for consistent results
5. **Hardware Consistency**: Collect training and validation data on the same hardware

Next Steps
----------

After completing this workflow, you can:

* Experiment with different model architectures
* Collect data for specific workloads using OPERATION_SHAPE_SET mode
* Integrate the trained model into your own applications
* Analyze the collected data to understand performance patterns

For more advanced usage, see the API documentation and examples in the repository.