Transformer architecture

The transformer is the neural network architecture introduced in 2017 that powers every major LLM — built around the attention mechanism that lets each token weigh all other tokens.

The transformer is the neural network design that made modern LLMs possible. Introduced in the 2017 paper "Attention Is All You Need," it replaced recurrence (the old RNN/LSTM approach) with self-attention: a mechanism where every token in a sequence can directly look at every other token to decide what it means.

Why this beat everything else:

  • Parallelizable training — recurrent models had to process tokens one at a time; transformers process the whole sequence at once on GPUs. This 100-1000x'd training throughput.
  • Long-range dependencies — RNNs forgot context after dozens of tokens; transformers naturally attend across thousands.
  • Scale-friendly — pile on more parameters, more data, more compute, and transformers keep improving smoothly.

Every well-known LLM from 2017 onward — BERT, GPT, T5, Llama, Claude, Gemini — is a transformer. They differ in:

  • Decoder-only vs encoder-decoder — GPT, Claude, Llama are decoder-only (generate one token at a time). T5, BART are encoder-decoder. Decoder-only won the LLM market because it's simpler and scales better.
  • Parameter count — from 1B (small/local) to 500B+ (frontier).
  • Training data — what they were trained on shapes capabilities and biases.
  • Post-training — RLHF, constitutional AI, and other techniques shape behavior after pretraining.

In 2024-2026, alternative architectures (Mamba, RWKV, state-space models, hybrid models) have emerged with theoretical efficiency advantages, but transformers still dominate production. The "all you need" claim has aged remarkably well.

FAQ

Will transformers eventually be replaced?

Probably, but not soon. Alternatives like Mamba have efficiency advantages but haven't matched transformer capability at frontier scale. Most ML researchers expect hybrid architectures — transformer + state-space — to dominate in the next 2-4 years.

Related terms

Want to actually build with this?

Our Stack Builder picks the best AI tools for your specific project in under 60 seconds.

Build my stack →