Attention is all you need
On the paper that started it all
This weekend, I spent time reading Google’s now-seminal paper, Attention Is All You Need, which introduced the transformer architecture that forms the foundation of large language models like ChatGPT, Claude, and Gemini.
I’ve forgotten most of the math I studied during my engineering, which made it difficult to fully understand the paper. However, I was able to build some intuition. Here are my abstracted notes:
Prior to transformers, techniques like RNNs (Recurrent Neural Networks) were used to train models:
RNNs process sequences one step at a time. If you’re reading a sentence, you read word 1, then word 2, then word 3. Each step depends on the previous one.
This made the process computationally slow—computing step 5 requires completing step 4, which requires step 3, and so on.
Since information has to travel far, RNNs start to fail on large contexts—just as information gets lost or modified in a long chain of Chinese whispers.
The researchers proposed a radical new approach: instead of processing words sequentially, compute the relationships between all words in a sequence simultaneously.
There’s a bunch of math behind this. The paper formalises it using three concepts:
Query: what I’m currently looking for
Key: what each word offers
Value: the information each word contains
These are used to calculate “attention.” Attention is an elegant way of assigning weights to words based on their relationship with other words. For example, in the sentence “The dog sat on the chair as it was tired”, the calculation will enrich the word “it” with context such that its relationship with “dog” has a higher weight than its relationship with “chair.”
Attention is content-aware but position-blind, which can be problematic. Without positional information, a transformer wouldn’t know which noun is the subject and which is the object—”man eats fish” and “fish eats man” would be indistinguishable. To solve this, the researchers added a positional encoding to each word before feeding it into the transformer. This preserves word order and meaning.
The paper had several implications:
Training time compressed dramatically. The researchers achieved far superior results on a translation task in 3.5 days using 8 GPUs—at a fraction of the cost of the best models at that time.
Due to the parallel nature of computation, throwing more compute at the problem works well. Capital became a moat, which is why we’re seeing massive investments in GPU acquisition for training.
Transformers, due to their architecture, are extensible to multimodal inputs (voice, text, images) and diverse tasks (summarisation, text generation, translation).
Transformers made intelligence a scaling problem. Everything since has been a consequence of that.

