Deep Learning

Transformers — A Visual & Intuitive Guide

Orlando Ding

21 Sep 2023 • 3 min read

Reference:

Transformers Explained Visually (Part 1): Overview of Functionality
Transformers Explained Visually (Part 2): How it works, step-by-step
Transformers Explained Visually (Part 3): Multi-head Attention, deep dive
Transformers Explained Visually — Not Just How, but Why They Work So Well
Foundations of NLP Explained — Bleu Score and WER Metrics
Foundations of NLP Explained Visually: Beam Search, How It Works

A consolidated blog inspired by the Transformers Explained Visually series and the Foundations of NLP Explained posts.

1. Why Transformers Exist

Before transformers, NLP was dominated by RNNs / LSTMs.

Problem with RNNs

Issue	What Happens
Sequential computation	Cannot parallelize efficiently
Long-range dependency	Forget early tokens
Gradient problems	Vanishing / exploding gradients
Memory bottleneck	Compress whole sentence into hidden state

Example:

In the sentence:
"The animal didn't cross the road because it was too tired"

RNN must remember the subject across many steps.

Transformers solve this by:

Every token directly attends to every other token

No compression into a single state.

2. High-Level Transformer Architecture

Input Tokens
     ↓
Embedding + Positional Encoding
     ↓
[ Encoder Blocks ] × N
     ↓
[ Decoder Blocks ] × N
     ↓
Linear + Softmax
     ↓
Next Token Probability

Each block contains:

Multi‑Head Attention
      ↓
Add & Norm
      ↓
Feed Forward Network
      ↓
Add & Norm

Key idea:

Transformers replace recurrence with attention + parallel computation.

3. Attention — The Core Idea

Instead of remembering history, the model looks back at relevant words.

Query, Key, Value intuition

Think of searching a database.

Component	Meaning
Query	What I am looking for
Key	What each word offers
Value	Information to retrieve

The attention score:

Attention(Q,K,V) = softmax(QKᵀ / √d) V

Interpretation:

Each word asks: Which other words matter for me right now?

Example:

Sentence: "The bank raised interest rates"

The word bank attends to "interest" → financial meaning.

4. Multi‑Head Attention — Why Multiple Heads?

One attention map = one relationship type.

Multiple heads allow multiple linguistic views simultaneously.

Head	Learns
1	Subject ↔ Verb
2	Coreference (he/she/it)
3	Syntax structure
4	Semantic meaning

Visual Intuition

Instead of one opinion, the model forms a committee of experts.

Sentence → Head1
         → Head2
         → Head3
         → Head4
              ↓
        Concatenate
              ↓
         Projection

Multi‑head attention = ensemble reasoning inside one layer.

5. Positional Encoding — Adding Order Without RNNs

Transformers read all words simultaneously.

So we must inject position information.

They use sinusoidal functions:

PE(pos,2i)   = sin(pos / 10000^(2i/d))
PE(pos,2i+1) = cos(pos / 10000^(2i/d))

Why sinusoidal?

Provides relative distance information
Allows extrapolation to longer sequences
Smooth continuous representation

The model learns distance relationships mathematically instead of memorizing order.

6. Feed‑Forward Network (FFN)

After attention mixes information, FFN processes meaning locally.

FFN(x) = max(0, xW₁ + b₁)W₂ + b₂

Role:

Attention = communication between tokens
FFN = thinking inside each token

7. Why Transformers Work So Well

1. Direct Token Interaction

No information bottleneck.

RNN:

A → hidden → hidden → hidden → B

Transformer:

A ↔ B (1 step)

2. Parallel Training

All tokens processed simultaneously → massive GPU efficiency.

3. Layered Reasoning

Early layers: syntax
Middle layers: relations
Late layers: semantics & tasks

4. Emergent Abilities

Scale leads to:

reasoning
translation
coding
planning

Because attention builds a differentiable reasoning graph.

8. Text Generation — Beam Search

Greedy decoding picks the best token each step:

Problem: locally optimal ≠ globally optimal

Beam Search

Keep top k sequences simultaneously.

Example (beam=2):

Step1:  I (0.6), We (0.4)
Step2:  I am (0.5), I was (0.1), We are (0.3), We were (0.1)
Keep best 2 → I am, We are

Beam search explores multiple futures before committing.

Tradeoff:

Beam Size	Effect
Small	Faster, diverse
Large	Accurate, repetitive

9. Evaluation Metrics

BLEU Score (Translation Quality)

Measures n‑gram overlap with reference text.

BLEU ≈ precision of matching phrases

Limitations:

Cannot measure meaning
Penalizes paraphrases

WER (Word Error Rate)

Used in speech recognition.

WER = (Substitutions + Insertions + Deletions) / Total Words

Lower is better.

10. Putting It All Together

Transformer pipeline:

Tokens
 → Embedding
 → Positional Encoding
 → Self Attention (understand relationships)
 → Feed Forward (process meaning)
 → Repeat layers (build abstraction)
 → Decoder predicts next token
 → Beam search generates sequence
 → BLEU/WER evaluate quality

Final Intuition

Transformers are powerful because they:

Replace memory with relationships
Replace recurrence with parallel reasoning
Replace rules with learned structure

A transformer is not reading text sequentially — it is building a graph of meaning and reasoning over it.

One‑Sentence Summary

RNNs remember the past. Transformers understand the entire context at once.