Andrew Hoang
← All posts
aiMay 15, 20254 min read

Attention Is All You Need: The Paper That Changed Everything

A deep dive into the transformer architecture — from the core self-attention mechanism to why positional encodings matter, with interactive Python visualizations.

#transformers#nlp#deep-learning#attention

The 2017 paper "Attention Is All You Need" by Vaswani et al. introduced the Transformer architecture — a model that has since become the backbone of virtually every major breakthrough in AI. In this post I want to walk through the core ideas, motivated from first principles, and show some interactive demonstrations.

The Problem with RNNs#

Before transformers, sequence modeling was dominated by recurrent networks (LSTMs, GRUs). They process tokens one at a time, left to right, accumulating a hidden state. This is inherently sequential — you can't parallelize across time — and it struggles to propagate information across long sequences.

The intuition behind attention: let every position look directly at every other position, weighting how much to "attend" to each.

Self-Attention in Code#

The core operation is elegantly simple. Given a sequence of input vectors packed into a matrix XX, we project it into three spaces: Queries, Keys, and Values.

import numpy as np

def softmax(x, axis=-1):
    e = np.exp(x - x.max(axis=axis, keepdims=True))
    return e / e.sum(axis=axis, keepdims=True)

def self_attention(X, W_Q, W_K, W_V):
    Q = X @ W_Q
    K = X @ W_K
    V = X @ W_V
    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)
    weights = softmax(scores)
    output = weights @ V
    return output, weights

# Toy example: 4 tokens, 8-dimensional embeddings, 4-dim attention
np.random.seed(42)
seq_len, d_model, d_k = 4, 8, 4
X = np.random.randn(seq_len, d_model)
W_Q = np.random.randn(d_model, d_k) * 0.1
W_K = np.random.randn(d_model, d_k) * 0.1
W_V = np.random.randn(d_model, d_k) * 0.1

out, attn_weights = self_attention(X, W_Q, W_K, W_V)
print("Attention weights (rows sum to 1):")
print(np.round(attn_weights, 3))
print(f"\nOutput shape: {out.shape}")

Visualizing Attention Patterns#

Let's visualize what those attention weights look like — a heatmap shows which tokens attend to which.

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors

np.random.seed(7)
tokens = ["The", "cat", "sat", "mat"]
n = len(tokens)

# Simulate attention with some structure
raw = np.array([
    [2.0, 0.5, 0.2, 0.1],
    [0.3, 2.5, 0.4, 0.2],
    [0.4, 0.8, 2.0, 0.6],
    [0.1, 0.2, 0.5, 2.8],
])
attn = np.exp(raw) / np.exp(raw).sum(axis=1, keepdims=True)

fig, ax = plt.subplots(figsize=(5, 4))
im = ax.imshow(attn, cmap="Blues", vmin=0, vmax=1)
ax.set_xticks(range(n)); ax.set_yticks(range(n))
ax.set_xticklabels(tokens, fontsize=11)
ax.set_yticklabels(tokens, fontsize=11)
ax.set_xlabel("Keys (attend to)"); ax.set_ylabel("Queries (from)")
ax.set_title("Self-Attention Weights")
plt.colorbar(im, ax=ax)
for i in range(n):
    for j in range(n):
        ax.text(j, i, f"{attn[i,j]:.2f}", ha="center", va="center",
                color="white" if attn[i,j] > 0.5 else "black", fontsize=9)
plt.tight_layout()
plt.show()

Multi-Head Attention#

Rather than a single attention function, the transformer runs hh attention heads in parallel, each with its own learned projections. The idea is that different heads can specialize in different types of relationships (syntax, coreference, semantics…).

import numpy as np

def multi_head_attention(X, heads=4):
    d_model = X.shape[-1]
    d_k = d_model // heads
    results = []
    for _ in range(heads):
        W_Q = np.random.randn(d_model, d_k) * 0.1
        W_K = np.random.randn(d_model, d_k) * 0.1
        W_V = np.random.randn(d_model, d_k) * 0.1
        Q = X @ W_Q; K = X @ W_K; V = X @ W_V
        scores = Q @ K.T / np.sqrt(d_k)
        w = np.exp(scores) / np.exp(scores).sum(axis=-1, keepdims=True)
        results.append(w @ V)
    # Concatenate and project
    concat = np.concatenate(results, axis=-1)
    W_O = np.random.randn(d_model, d_model) * 0.1
    return concat @ W_O

np.random.seed(0)
X = np.random.randn(6, 16)
out = multi_head_attention(X)
print(f"Input:  {X.shape}")
print(f"Output: {out.shape}")

Why Positional Encodings?#

Self-attention is permutation-equivariant — shuffle the tokens and the output shuffles accordingly. That means the model has no notion of order by default. Positional encodings inject position information into the embeddings.

The original paper used sinusoidal functions:

PE(pos,2i)=sin ⁣(pos100002i/dmodel)PE(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d_{model}}}\right) PE(pos,2i+1)=cos ⁣(pos100002i/dmodel)PE(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d_{model}}}\right)
import numpy as np
import matplotlib.pyplot as plt

def positional_encoding(max_len, d_model):
    PE = np.zeros((max_len, d_model))
    positions = np.arange(max_len)[:, None]
    dims = np.arange(0, d_model, 2)
    PE[:, 0::2] = np.sin(positions / 10000 ** (dims / d_model))
    PE[:, 1::2] = np.cos(positions / 10000 ** (dims / d_model))
    return PE

PE = positional_encoding(50, 128)

fig, ax = plt.subplots(figsize=(10, 4))
im = ax.imshow(PE.T, aspect="auto", cmap="RdBu", vmin=-1, vmax=1)
ax.set_xlabel("Position"); ax.set_ylabel("Dimension")
ax.set_title("Sinusoidal Positional Encodings")
plt.colorbar(im, ax=ax)
plt.tight_layout()
plt.show()

Final Thoughts#

The elegance of the transformer is that it replaces a complex inductive bias (recurrence = memory over time) with a general-purpose tool: learned, differentiable attention over all pairs. This is what makes it scale so impressively — you can throw more compute at it and it keeps improving.

The key insight is that attention is just weighted averaging — but the weights are learned, content-dependent, and computed in parallel across the whole sequence.


Next up: Positional encodings beyond sinusoids — rotary embeddings (RoPE) and ALiBi.