GPT-OSS-20B-Triton-Kernel: Fine-Tuning a 20B Model to Generate GPU Kernels

I have fine-tuned the GPT-OSS 20B model on the KernelBook dataset. This dataset contains detailed examples of Triton and PyTorch kernels, and I transformed it into the format suitable for GPT-style supervised fine-tuning. It took me around 4 hours and 55 minutes to train this model.

I want to be explicit about the process and results:
I have fine-tuned the GPT-OSS 20 billion model on the KernelBook dataset. This dataset contains the details about Triton kernels and PyTorch kernels. I transformed the dataset into the structure the GPT model expects, then used Hugging Face Transformers to fine-tune the model using an SFT/PEFT configuration. I will share the notebook so you can see the new dataset I created from the original KernelBook (released by Facebook). After fine-tuning, I was able to run inference on the fine-tuned model and published it on Hugging Face. It took about 4 hours 55 minutes to train. I believe this fine-tune is very useful for generating Triton kernels using the GPT-OSS 20B model.

What it does

The model can generate optimized Triton GPU kernels for various deep-learning operations and kernel templates.

How it was built

Base model: openai/gpt-oss-20b
Dataset: KernelBook-messages (a filtered/modified version of KernelBook) reformatted into a GPT-style dialogue/sequence structure suitable for SFT.
Training: supervised fine-tuning (SFT) using TRL + Hugging Face Transformers.
Training time: ~4 hrs 55 min.

Challenges

Could not fine-tune the 120B model due to OOM (memory) limitations.
Needed careful dataset reformatting so the model learns the input/output patterns for kernel generation.

Accomplishments

Successfully fine-tuned the 20B model and published it on Hugging Face: Nadiveedishravanreddy/gpt-oss-20b-triton-kernel.
Able to perform inference with the fine-tuned model and produce Triton kernel outputs.

What we learned

Dataset structure must match the model’s expected prompt/response format for good SFT results.
Fine-tuning large models requires heavy memory planning (batch size, seq length, gradient checkpointing, etc.).

Next steps

Evaluate the generated kernels on KernelBench / benchmark scripts.
Build an agent around the model to generate, validate, and optimize kernels automatically.

Timeline & Notes (selected)

Aug 02, 2025: Concept & survey of prior work (Sakana, KernelLLM-8B, Cognition’s Kevin, Anne Ouyang’s Fast Kernels, AMD GEAK-agent).
Aug 03–14, 2025: Experiments with Qwen3-4B, Cerebras API, and agent prompts; Colab prototypes.
Aug 11–Sep 17, 2025: Trainer scripts, TRL SFT experiments, layer selection experiments, ROCm/HF_PEFT setup, Docker + MI300X attempts, W&B logging.
Result: Final model Nadiveedishravanreddy/gpt-oss-20b-triton-kernel with training visualized on Weights & Biases. Fine-tuning time ≈ 4:29:27 (reported in logs).

Model Card (summary)

Model: Nadiveedishravanreddy/gpt-oss-20b-triton-kernel (fine-tuned from openai/gpt-oss-20b)
Purpose: Generate Triton GPU kernels from high-level prompts / PyTorch snippets.
Dataset: KernelBook-messages (modified KernelBook)
Training procedure: SFT (Supervised Fine-Tuning) using TRL + Transformers
Framework versions: TRL 0.22.2, Transformers 4.56.1, PyTorch 2.6.0, Datasets 4.0.0, Tokenizers 0.22.0

References & Links

facebook/KernelLLM
GPUMODE/KernelBook dataset
Colab prototypes and agent code (internal links you shared during development)
Devpost Kernel LLM
Faster Transformers blog
Benchmark script (faster-transformers-scripts)