Transformers — A Visual & Intuitive Guide

Reference:

Transformers Explained Visually (Part 1): Overview of Functionality
Transformers Explained Visually (Part 2): How it works, step-by-step
Transformers Explained Visually (Part 3): Multi-head Attention, deep dive
Transformers Explained Visually — Not Just How, but Why They Work So Well
Foundations of NLP Explained — Bleu Score and WER Metrics
Foundations of NLP Explained Visually: Beam Search, How It Works

A consolidated blog inspired by the Transformers Explained Visually series and the Foundations of NLP Explained posts.


1. Why Transformers Exist

Before transformers, NLP was dominated by RNNs / LSTMs.

Problem with RNNs

Issue What Happens
Sequential computation Cannot parallelize efficiently
Long-range dependency Forget early tokens
Gradient problems Vanishing / exploding gradients
Memory bottleneck Compress whole sentence into hidden state

Example:

In the sentence:
"The animal didn't cross the road because it was too tired"

RNN must remember the subject across many steps.

Transformers solve this by:

Every token directly attends to every other token

No compression into a single state.


2. High-Level Transformer Architecture

Input Tokens
     ↓
Embedding + Positional Encoding
     ↓
[ Encoder Blocks ] × N
     ↓
[ Decoder Blocks ] × N
     ↓
Linear + Softmax
     ↓
Next Token Probability

Each block contains:

Multi‑Head Attention
      ↓
Add & Norm
      ↓
Feed Forward Network
      ↓
Add & Norm

Key idea:

Transformers replace recurrence with attention + parallel computation.


3. Attention — The Core Idea

Instead of remembering history, the model looks back at relevant words.

Query, Key, Value intuition

Think of searching a database.

Component Meaning
Query What I am looking for
Key What each word offers
Value Information to retrieve

The attention score:

Attention(Q,K,V) = softmax(QKᵀ / √d) V

Interpretation:

Each word asks: Which other words matter for me right now?

Example:

Sentence: "The bank raised interest rates"

The word bank attends to "interest" → financial meaning.


4. Multi‑Head Attention — Why Multiple Heads?

One attention map = one relationship type.

Multiple heads allow multiple linguistic views simultaneously.

Head Learns
1 Subject ↔ Verb
2 Coreference (he/she/it)
3 Syntax structure
4 Semantic meaning

Visual Intuition

Instead of one opinion, the model forms a committee of experts.

Sentence → Head1
         → Head2
         → Head3
         → Head4
              ↓
        Concatenate
              ↓
         Projection

Multi‑head attention = ensemble reasoning inside one layer.


5. Positional Encoding — Adding Order Without RNNs

Transformers read all words simultaneously.

So we must inject position information.

They use sinusoidal functions:

PE(pos,2i)   = sin(pos / 10000^(2i/d))
PE(pos,2i+1) = cos(pos / 10000^(2i/d))

Why sinusoidal?

  • Provides relative distance information
  • Allows extrapolation to longer sequences
  • Smooth continuous representation

The model learns distance relationships mathematically instead of memorizing order.


6. Feed‑Forward Network (FFN)

After attention mixes information, FFN processes meaning locally.

FFN(x) = max(0, xW₁ + b₁)W₂ + b₂

Role:

  • Attention = communication between tokens
  • FFN = thinking inside each token

7. Why Transformers Work So Well

1. Direct Token Interaction

No information bottleneck.

RNN:

A → hidden → hidden → hidden → B

Transformer:

A ↔ B (1 step)

2. Parallel Training

All tokens processed simultaneously → massive GPU efficiency.

3. Layered Reasoning

Early layers: syntax
Middle layers: relations
Late layers: semantics & tasks

4. Emergent Abilities

Scale leads to:

  • reasoning
  • translation
  • coding
  • planning

Because attention builds a differentiable reasoning graph.


Greedy decoding picks the best token each step:

Problem: locally optimal ≠ globally optimal

Keep top k sequences simultaneously.

Example (beam=2):

Step1:  I (0.6), We (0.4)
Step2:  I am (0.5), I was (0.1), We are (0.3), We were (0.1)
Keep best 2 → I am, We are

Beam search explores multiple futures before committing.

Tradeoff:

Beam Size Effect
Small Faster, diverse
Large Accurate, repetitive

9. Evaluation Metrics

BLEU Score (Translation Quality)

Measures n‑gram overlap with reference text.

BLEU ≈ precision of matching phrases

Limitations:

  • Cannot measure meaning
  • Penalizes paraphrases

WER (Word Error Rate)

Used in speech recognition.

WER = (Substitutions + Insertions + Deletions) / Total Words

Lower is better.


10. Putting It All Together

Transformer pipeline:

Tokens
 → Embedding
 → Positional Encoding
 → Self Attention (understand relationships)
 → Feed Forward (process meaning)
 → Repeat layers (build abstraction)
 → Decoder predicts next token
 → Beam search generates sequence
 → BLEU/WER evaluate quality

Final Intuition

Transformers are powerful because they:

  1. Replace memory with relationships
  2. Replace recurrence with parallel reasoning
  3. Replace rules with learned structure

A transformer is not reading text sequentially — it is building a graph of meaning and reasoning over it.


One‑Sentence Summary

RNNs remember the past. Transformers understand the entire context at once.