Context window

The context window is the maximum number of tokens (text chunks) a language model can consider at once — both the prompt you send and the response it generates.

The context window is the LLM's working memory. Everything the model sees in a single call — your system prompt, the conversation history, retrieved documents, tools, examples, and the answer it's generating — has to fit inside this window.

In 2026, context windows range from 8k tokens (older Llama, smaller Mistral) to 1M+ tokens (Gemini 1.5 Pro / 2, Claude with extended context). One token is roughly ¾ of a word, so a 200k-token window is about 150,000 words — a long novel.

What the window enables:

Long-doc question-answering — paste a 100-page contract into the prompt, ask precise questions.
Whole-repo code analysis — Cursor and Claude Code use big windows to read across files.
Sustained conversations — chat for hours without the model "forgetting" early turns.

What it doesn't fix:

Cost — a 200k-token prompt is 200x more expensive than a 1k-token one. Just because you can paste your whole codebase doesn't mean you should.
Recall quality — even with a huge window, models often miss details in the middle ("lost in the middle" effect). RAG with smaller, targeted context often outperforms gigantic context dumps.
Latency — longer prompts take longer to process. A 100k-token prompt can take 30+ seconds before the first response token streams.

Practical advice: most production systems should target 5-30k tokens of context per call. If you find yourself wanting more, that's a sign you need RAG, not a bigger context window.

FAQ

What's the biggest context window in 2026?

Gemini 2 leads with 2M tokens for paid tiers. Claude offers up to 1M for some customers. GPT-4o and most other frontier models cluster around 128k-200k.

Does a bigger context window mean smarter answers?

Not directly. More room to include relevant context can help, but model 'attention quality' over long context varies — and dumping more text can hurt more than it helps if much of it is noise.

Related terms

LLM (Large Language Model) — A Large Language Model is a neural network trained on huge volumes of text to predict the next token, which produces emergent capabilities like reasoning, code generation, and translation.
Tokenization — Tokenization is the process of breaking text into chunks (tokens) — usually sub-word pieces — that an LLM actually reads and writes.
RAG (Retrieval-Augmented Generation) — RAG combines a language model with a search step over your own documents, so answers stay grounded in your data instead of hallucinating.

Want to actually build with this?

Our Stack Builder picks the best AI tools for your specific project in under 60 seconds.

Build my stack →