#4

Python Code Generation Assistant

December 16, 2025

PythonHugging FaceQLoRARunPod

Fine-tuned an LLM to generate clean, ready-to-use Python code snippets. My first real fine-tuning experiment — tried GCP, Vultr, and RunPod before getting it right.

What is it?

A fine-tuned Mistral-7B-v0.1 model for Python code generation, served as a REST API. The fine-tuning used QLoRA with rank 16 across all attention and MLP projection layers. The final deployment runs on GCP Cloud Run with an NVIDIA L4 GPU.

The fine-tuning config

Base model: mistralai/Mistral-7B-v0.1 from Hugging Face. LoRA config: r=16, lora_alpha=32, lora_dropout=0.05. Target modules cover everything — q_proj, k_proj, v_proj, o_proj (attention) and gate_proj, up_proj, down_proj (MLP). Task: CAUSAL_LM. The trained adapter weights are stored in adapter_model.safetensors alongside the serving code — a real trained adapter, not a placeholder.

Under the hood: QLoRA

Full fine-tuning Mistral-7B requires ~28GB VRAM. QLoRA reduces this to under 8GB in two steps:

1. Quantize the base model to 4-bit NF4 (NormalFloat4) using bitsandbytes. The quantized model barely moves during training — gradients don't flow through it. 2. Add small trainable LoRA adapter matrices alongside the frozen weights. With r=16, each adapter pair is O(d × 16) parameters instead of O(d²). The total trainable params are ~1-2% of the model.

At inference time: load the quantized base model, load the adapter via PeftModel.from_pretrained, generate. The adapter adds negligible memory.

The platform journey: GCP → Vultr → RunPod → GCP Cloud Run

GCP Vertex AI: quota limits blocked GPU access without a support ticket. Vultr: GPU VMs available, but the documentation for PyTorch + CUDA setup was sparse and the experience was rough. RunPod: straightforward — pay per hour, Jupyter notebooks pre-configured, A40/A100 available. Training ran there.

For serving: GCP Cloud Run with GPU support became available. The deploy command specifies --gpu=1 --gpu-type=nvidia-l4, 16Gi RAM, 4 CPUs, max 1 instance (GPU constraint), 300s request timeout. The image is pushed to Artifact Registry and Cloud Run pulls it.

Inference API

The prompt template wraps every input: ### Instruction:\n{prompt}\n\n### Response:\n. This is a standard Alpaca-style instruction format. POST /generate accepts prompt, max_tokens, temperature, and top_p. The model loads on startup (warm start), so first-request latency is high but subsequent requests are fast. The L4 GPU handles inference in seconds per generation.

Key takeaways

  • QLoRA mechanics: NF4 4-bit quantization + LoRA adapters, how to run on consumer-grade VRAM
  • LoRA rank selection: r=16 with full target module coverage for code generation quality
  • GPU platform comparison: Vertex AI quota issues, RunPod for training, Cloud Run GPU for serving
  • PEFT library: PeftModel.from_pretrained, loading adapters onto quantized base models
  • GCP Cloud Run GPU: --gpu flag, nvidia-l4, max instance constraints, Artifact Registry workflow
← all projects