Fine-tuning

Fine-tuning is the process of further training a foundation model on your own examples so it learns to behave a specific way.

Fine-tuning takes a pre-trained foundation model (Llama, Mistral, GPT-3.5, etc.) and continues training it on a much smaller, task-specific dataset of input/output pairs. The result is a model that's specialized for your domain — your tone of voice, your output format, your edge cases.

The mental model: foundation models know how to write English, code, and reason. Fine-tuning teaches them how to write your English, in your format, for your problem.

The big shift in the last two years: fine-tuning used to be the default for "make GPT do my thing." Today, for most teams, prompting + RAG outperforms fine-tuning at a fraction of the cost. Reasons:

Cost — a fine-tune costs $50-5000 per run depending on size; a prompt costs nothing to iterate on.
Speed — you can change a prompt in seconds; you can't change a fine-tune in seconds.
Capability ceiling — modern frontier models out-of-the-box are better than a fine-tuned mid-tier model from 2023.

That said, fine-tuning still wins for: tightly constrained output formats (structured JSON, function calls), domain-specific tone that prompts can't reliably elicit, latency-critical paths where you can't afford a giant system prompt, and on-prem deployments where you can't call an API.

If you're choosing between fine-tuning and RAG: pick RAG if you have data that changes; pick fine-tuning if you have style that's hard to describe.

FAQ

How much data do I need to fine-tune?

For format/style adjustments, 50-200 high-quality examples often beats 10k mediocre ones. For new capabilities, 1k-10k is more realistic. Quality and diversity matter more than raw count.

Can I fine-tune Claude or GPT-4?

Anthropic doesn't currently offer Claude fine-tuning. OpenAI offers fine-tuning on GPT-4o-mini and GPT-3.5. For frontier-tier customization, RAG + careful prompting is the standard path.

Related terms

RAG (Retrieval-Augmented Generation) — RAG combines a language model with a search step over your own documents, so answers stay grounded in your data instead of hallucinating.
LLM (Large Language Model) — A Large Language Model is a neural network trained on huge volumes of text to predict the next token, which produces emergent capabilities like reasoning, code generation, and translation.
Prompt engineering — Prompt engineering is the craft of writing instructions to a language model so it produces reliable, accurate, useful outputs.
RLHF (Reinforcement Learning from Human Feedback) — RLHF is the post-training process where human raters score model outputs and the model is trained to produce outputs humans prefer.

Want to actually build with this?

Our Stack Builder picks the best AI tools for your specific project in under 60 seconds.

Build my stack →