Inference
Inference is the act of running a trained AI model to produce an output — every API call to Claude, GPT, or Gemini is inference.
Inference is the runtime execution of a trained model: feed an input, get an output. Every time you chat with Claude, run Cursor, or generate an image with Midjourney, you're triggering inference.
The distinction between training and inference matters because they have wildly different cost and infrastructure profiles:
- Training is the one-time (or once-per-model-version) job of producing weights from data. It's expensive — frontier model training runs cost tens of millions of dollars — but amortized over the model's serving lifetime.
- Inference is the per-request operation. Every conversation, every code completion, every image generation is its own inference call. For widely-used products, inference cost dwarfs training cost over the model's lifetime.
Inference economics drive a lot of product decisions:
- Model size tradeoffs — smaller models cost less per token but are less capable. Pick the smallest model that meets your quality bar.
- Batching — running 32 prompts together can be 10-20x more efficient than 32 sequential calls.
- Caching — Anthropic, OpenAI, and Google all support prompt caching: repeated system prompts get cached and re-used at ~10% of their original cost.
- Distillation — train a small model to mimic a big one, deploy the small one for serving. Common pattern in production.
- On-device inference — Apple Intelligence, smaller Llama variants. Free per-request, capability-limited, no network round-trip.
Hardware: the modern inference stack is built on GPUs (Nvidia H100/H200, B200) for cloud, with specialized chips (Groq, Cerebras, AWS Trainium) emerging as faster/cheaper alternatives for specific model shapes.
Related terms
- LLM (Large Language Model) — A Large Language Model is a neural network trained on huge volumes of text to predict the next token, which produces emergent capabilities like reasoning, code generation, and translation.
- Context window — The context window is the maximum number of tokens (text chunks) a language model can consider at once — both the prompt you send and the response it generates.
- Tokenization — Tokenization is the process of breaking text into chunks (tokens) — usually sub-word pieces — that an LLM actually reads and writes.
Want to actually build with this?
Our Stack Builder picks the best AI tools for your specific project in under 60 seconds.
Build my stack →