Attention and Transformer, We Hardly Know Thee

The Transformer architecture has achieved near-mythical status in AI. It’s the engine behind the LLMs that are changing our world. But like many famous technologies, its true nature is often obscured by a fog of oversimplification and outright myths. It’s an architecture many of us use, but few truly know. Today, I want to pull back the curtain and explore the deeper reality behind the biggest misconceptions.

Open Table of contents

Myth 1: “Attention is just a clever kernel trick”
Myth 2: “The ‘Attention is All You Need’ paper was the entire revolution”
- BERT: The revolution in understanding
- GPT: The revolution in scaling
Myth 3: “You can see a Transformer’s ‘reasoning’ by visualizing its Attention”
History rhymes, maybe?

Myth 1: “Attention is just a clever kernel trick”

There’s a grain of truth to this, which makes the myth particularly sticky. At a high mathematical level, the attention mechanism is formally equivalent to a type of non-parametric regression known as a Nadaraya-Watson kernel estimator. The formula for this estimator is:

\(\hat{f}(x) = \frac{\sum_{i=1}^{n} K(x, x_i) y_i}{\sum_{j=1}^{n} K(x, x_j)}\)

This looks remarkably similar to the attention formula if we map the query \(x\) to a Query vector \(q\), the input points \(x_i\) to Key vectors \(k_i\), and the output points \(y_i\) to Value vectors \(v_i\). The kernel function itself, \(K(x, x_i)\), becomes analogous to the similarity score between the query and a key, which in the Transformer is defined as the exponential of the scaled dot-product:

\(K(q, k_i) = \exp\left(\frac{q^T k_i}{\sqrt{d_k}}\right).\)

Substituting these components gives us the full attention formula:

\(\text{Attention}(Q, K, V) = \frac{\sum_{i=1}^{n} \exp\left(\frac{q^T k_i}{\sqrt{d_k}}\right) v_i}{\sum_{j=1}^{n} \exp\left(\frac{q^T k_j}{\sqrt{d_k}}\right)}.\)

However, calling it “just a kernel trick” is a myth because it misses the most important feature: the kernel itself is learned.¹ In a classic kernel method, the function is fixed. In a Transformer, the Query, Key, and Value vectors are not raw inputs; they are projections of the inputs using learned weight matrices. We can define these projections as a set of equations: \(Q = XW^Q\), \(K = XW^K\), and \(V = XW^V\). The model isn’t given a similarity function; it learns the optimal similarity function for the task at hand by optimizing these weight matrices. This parameterization is the source of its power and flexibility.

Myth 2: “The ‘Attention is All You Need’ paper was the entire revolution”

This myth places the entire AI revolution at the feet of a single paper from 2017. While that paper was the catalyst, it was like a blueprint for a powerful new engine. A blueprint is useless until someone figures out how to build a car around it and, crucially, discovers what kind of fuel makes it run. The revolution required two massive, subsequent empirical breakthroughs to show the world how to use the Transformer blueprint effectively: BERT and GPT.

BERT: The revolution in understanding

Google’s BERT (2018) was the first “car” built from the Transformer’s encoder block. Its creators discovered a new kind of fuel: self-supervised learning on web-scale text. They didn’t just generate text; they taught the model to understand it deeply using two objectives:

Masked Language Model (MLM): By masking words in a sentence and forcing the model to predict them based on surrounding context (both left and right), BERT learned a rich, bidirectional representation of language.
Next Sentence Prediction (NSP): By making the model predict whether two sentences were sequential, it learned about sentence-level relationships.²

BERT shattered NLP benchmarks and proved that a pre-trained Transformer could be a universal starting point for a vast range of language understanding tasks.

GPT: The revolution in scaling

OpenAI’s GPT series took a different road, using only the Transformer’s decoder block. The “car” was simpler, but they discovered a high-octane rocket fuel: massive scale. Their training objective was a classic autoregressive model—simply predict the next word, with a loss function defined as maximizing the log-likelihood of the next token given the previous tokens:

\(\mathcal{L} = \sum_i \log P(u_i | u_{i-k}, \dots, u_{i-1}; \Theta)\)

The revolution, especially with GPT-2 and GPT-3, was the empirical discovery that this simple objective, when combined with astronomical increases in model size, data, and compute, led to shocking emergent capabilities. GPT proved that qualitative leaps in performance could be achieved through quantitative increases in scale.

So, while the 2017 paper was the spark, it was the twin empirical revolutions of BERT (how to train for understanding) and GPT (the power of pure scale) that truly ignited the firestorm of modern AI. The story is more complex, and far more interesting, than the myths suggest.

Myth 3: “You can see a Transformer’s ‘reasoning’ by visualizing its Attention”

This is one of the most tempting and misleading myths in all of deep learning. The idea is that by plotting the attention scores from the softmax function as a heatmap, you can create a beautiful visualization showing which words the model “paid attention to” when producing a certain result. It seems like a direct window into the model’s brain.

This belief has been largely debunked.³ While attention maps can sometimes be plausible or interesting, they are not a reliable explanation of the model’s behavior. Landmark research has shown that there is often little correlation between high attention scores and the features that are truly important for the model’s prediction. The attention weights are just one intermediate value in a long chain of computations. The final output is also heavily influenced by the Value projections, the residual connections that bypass the attention block entirely, and especially the massive Feed-Forward Networks that process the attention block’s output. Many attention heads learn redundant or uninterpretable patterns, and you can often drastically alter the attention map without changing the model’s final output. Believing the attention map is believing a convenient but unfaithful narrator. For a more robust qualtiative interpretability study, tools involving layered gradient methods could be more reliable, like Captum.

History rhymes, maybe?

This is not a post on test-time compute, a more recent development often halted as one of the next big things (thinking agents, anyone?). But I’ll say this: ultimately, these myths highlight our desire for simple explanations of inherently dynamic, complex models. Test-time compute, among other things, is not merely about scaling up inference (or, naively generating more tokens). Rather, it engages deeply with reinforcement learning principles to enhance model performance—without altering the underlying pretrained parameters. This is why papers like s1 are exciting: there’s a chance that RL-based alignment techniques can scale. Test-time compute methods like chain-of-thought prompting and the introduction of thought tokens demonstrably boost performance, especially in question-answering tasks, through internal validation. Yet, crucially, it remains uncertain whether these intermediate steps genuinely represent the model’s reasoning or reliably inform interpretability, in the same way that attention heatmaps might not be the most reliable interpretability tool.

This learnable nature is precisely what makes attention difficult to interpret. With a fixed kernel, like a Gaussian kernel, the definition of “similarity” is clear. With attention, the model learns a high-dimensional, context-dependent notion of similarity that is effective but not easily reverse-engineered by human inspection. ↩
Interestingly, subsequent research (e.g., RoBERTa) found the Next Sentence Prediction (NSP) task to be largely unhelpful and potentially detrimental to model performance. Most modern encoder-style models have since dropped it, focusing solely on the Masked Language Model objective. This highlights how empirical discovery continues to refine our understanding. ↩
The paper “Attention is Not Explanation” by Jain and Wallace (2019) was a key work in this area, demonstrating through rigorous experiments that attention weights often fail to correlate with other feature importance metrics like gradients. This work cautioned the research community against over-interpreting attention maps as faithful explanations. See both: “Attention is not Explanation” (Jain & Wallace, NAACL 2019) and “Attention is not not Explanation” (Wiegreffe & Pinter, EMNLP-IJCNLP 2019) ↩