Attention Is All You Need: The Paper That Changed Everything
A deep dive into the transformer architecture — from the core self-attention mechanism to why positional encodings matter, with interactive Python visualizations.
The 2017 paper "Attention Is All You Need" by Vaswani et al. introduced the Transformer architecture — a model that has since become the backbone of virtually every major breakthrough in AI. In this post I want to walk through the core ideas, motivated from first principles, and show some interactive demonstrations.
The Problem with RNNs#
Before transformers, sequence modeling was dominated by recurrent networks (LSTMs, GRUs). They process tokens one at a time, left to right, accumulating a hidden state. This is inherently sequential — you can't parallelize across time — and it struggles to propagate information across long sequences.
The intuition behind attention: let every position look directly at every other position, weighting how much to "attend" to each.
Self-Attention in Code#
The core operation is elegantly simple. Given a sequence of input vectors packed into a matrix , we project it into three spaces: Queries, Keys, and Values.
import numpy as np
def softmax(x, axis=-1):
e = np.exp(x - x.max(axis=axis, keepdims=True))
return e / e.sum(axis=axis, keepdims=True)
def self_attention(X, W_Q, W_K, W_V):
Q = X @ W_Q
K = X @ W_K
V = X @ W_V
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k)
weights = softmax(scores)
output = weights @ V
return output, weights
# Toy example: 4 tokens, 8-dimensional embeddings, 4-dim attention
np.random.seed(42)
seq_len, d_model, d_k = 4, 8, 4
X = np.random.randn(seq_len, d_model)
W_Q = np.random.randn(d_model, d_k) * 0.1
W_K = np.random.randn(d_model, d_k) * 0.1
W_V = np.random.randn(d_model, d_k) * 0.1
out, attn_weights = self_attention(X, W_Q, W_K, W_V)
print("Attention weights (rows sum to 1):")
print(np.round(attn_weights, 3))
print(f"\nOutput shape: {out.shape}")Visualizing Attention Patterns#
Let's visualize what those attention weights look like — a heatmap shows which tokens attend to which.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
np.random.seed(7)
tokens = ["The", "cat", "sat", "mat"]
n = len(tokens)
# Simulate attention with some structure
raw = np.array([
[2.0, 0.5, 0.2, 0.1],
[0.3, 2.5, 0.4, 0.2],
[0.4, 0.8, 2.0, 0.6],
[0.1, 0.2, 0.5, 2.8],
])
attn = np.exp(raw) / np.exp(raw).sum(axis=1, keepdims=True)
fig, ax = plt.subplots(figsize=(5, 4))
im = ax.imshow(attn, cmap="Blues", vmin=0, vmax=1)
ax.set_xticks(range(n)); ax.set_yticks(range(n))
ax.set_xticklabels(tokens, fontsize=11)
ax.set_yticklabels(tokens, fontsize=11)
ax.set_xlabel("Keys (attend to)"); ax.set_ylabel("Queries (from)")
ax.set_title("Self-Attention Weights")
plt.colorbar(im, ax=ax)
for i in range(n):
for j in range(n):
ax.text(j, i, f"{attn[i,j]:.2f}", ha="center", va="center",
color="white" if attn[i,j] > 0.5 else "black", fontsize=9)
plt.tight_layout()
plt.show()Multi-Head Attention#
Rather than a single attention function, the transformer runs attention heads in parallel, each with its own learned projections. The idea is that different heads can specialize in different types of relationships (syntax, coreference, semantics…).
import numpy as np
def multi_head_attention(X, heads=4):
d_model = X.shape[-1]
d_k = d_model // heads
results = []
for _ in range(heads):
W_Q = np.random.randn(d_model, d_k) * 0.1
W_K = np.random.randn(d_model, d_k) * 0.1
W_V = np.random.randn(d_model, d_k) * 0.1
Q = X @ W_Q; K = X @ W_K; V = X @ W_V
scores = Q @ K.T / np.sqrt(d_k)
w = np.exp(scores) / np.exp(scores).sum(axis=-1, keepdims=True)
results.append(w @ V)
# Concatenate and project
concat = np.concatenate(results, axis=-1)
W_O = np.random.randn(d_model, d_model) * 0.1
return concat @ W_O
np.random.seed(0)
X = np.random.randn(6, 16)
out = multi_head_attention(X)
print(f"Input: {X.shape}")
print(f"Output: {out.shape}")Why Positional Encodings?#
Self-attention is permutation-equivariant — shuffle the tokens and the output shuffles accordingly. That means the model has no notion of order by default. Positional encodings inject position information into the embeddings.
The original paper used sinusoidal functions:
import numpy as np
import matplotlib.pyplot as plt
def positional_encoding(max_len, d_model):
PE = np.zeros((max_len, d_model))
positions = np.arange(max_len)[:, None]
dims = np.arange(0, d_model, 2)
PE[:, 0::2] = np.sin(positions / 10000 ** (dims / d_model))
PE[:, 1::2] = np.cos(positions / 10000 ** (dims / d_model))
return PE
PE = positional_encoding(50, 128)
fig, ax = plt.subplots(figsize=(10, 4))
im = ax.imshow(PE.T, aspect="auto", cmap="RdBu", vmin=-1, vmax=1)
ax.set_xlabel("Position"); ax.set_ylabel("Dimension")
ax.set_title("Sinusoidal Positional Encodings")
plt.colorbar(im, ax=ax)
plt.tight_layout()
plt.show()Final Thoughts#
The elegance of the transformer is that it replaces a complex inductive bias (recurrence = memory over time) with a general-purpose tool: learned, differentiable attention over all pairs. This is what makes it scale so impressively — you can throw more compute at it and it keeps improving.
The key insight is that attention is just weighted averaging — but the weights are learned, content-dependent, and computed in parallel across the whole sequence.
Next up: Positional encodings beyond sinusoids — rotary embeddings (RoPE) and ALiBi.