#9

MiniGPT — GPT-2 from Scratch

January 24, 2026

PyTorchRunPodGPT-2Transformers

Trained a GPT-2 based language model from scratch on RunPod (~$15). Strong in language modeling and factual recall, weaker in reasoning/math. Later fine-tuned into a chatbot.

What is it?

A GPT-2 scale transformer trained from random weights — not fine-tuned from a checkpoint, but pre-trained from scratch on a text corpus. The goal was to understand every layer of the training pipeline by doing it, not by reading about it.

The architecture

Token embeddings → N transformer blocks → output projection → softmax over vocabulary. Each transformer block: LayerNorm → multi-head self-attention → residual connection → LayerNorm → feed-forward network → residual connection. GPT-2 small: 12 layers, 12 heads, 768-dim embeddings, 117M parameters.

Under the hood: multi-head self-attention

For each token, compute three projections: Query (what am I looking for?), Key (what do I advertise?), Value (what do I return if selected?). Attention score = softmax(Q × Kᵀ / √d_k), output = weighted sum of Values.

Multi-head splits the 768-dim embedding into 12 heads × 64 dims each. Each head attends to different patterns — some learn syntactic relations, some semantic, some positional. Outputs are concatenated and projected back to 768. The /√d_k scaling prevents softmax saturation for large d.

Training from scratch on RunPod

Pre-training objective: next token prediction (cross-entropy loss). Given 'The quick brown', predict 'fox'. With enough data and steps, the model learns grammar, facts, and reasoning as emergent properties.

Trained on RunPod A100 (40GB VRAM), ~$15 total. Techniques: gradient checkpointing (recompute activations to save VRAM), mixed precision (fp16 forward/backward, fp32 optimizer state), AdamW with cosine LR decay, linear warmup. The model is strong at language modeling and factual recall; weaker at multi-step reasoning — expected for a pre-train-only 117M model.

Fine-tuning into a chatbot

After pre-training, a second training pass on instruction-following data fine-tuned the base model into a chatbot. This is the same paradigm as GPT-2 → InstructGPT, just much smaller scale. The base model provides language understanding; the fine-tuning pass teaches it the Q&A format.

Key takeaways

  • Transformer architecture from scratch: attention, residual connections, layer norm placement
  • Multi-head self-attention: QKV projections, scaled dot-product, head concatenation
  • Byte-pair encoding tokenization: vocab building, encoding efficiency
  • Gradient checkpointing and mixed precision: training 117M params on a single GPU
  • Pre-training vs fine-tuning: what each phase teaches the model
← all projects