Transformer Architecture Explained: From Attention to GPT

Published on November 15, 2025

The Transformer architecture, introduced in "Attention Is All You Need" (2017), revolutionized NLP and beyond. Understanding Transformers is essential for working with modern LLMs like GPT, BERT, and their successors. Let's build intuition from the ground up.

The Core Insight: Attention

Attention allows models to focus on relevant parts of the input when producing each output. Unlike RNNs that process sequentially, attention computes relationships between all positions simultaneously.

Scaled Dot-Product Attention

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

def scaled_dot_product_attention(query, key, value, mask=None):
    """
    Compute attention weights and output.
    
    Args:
        query: (batch, seq_len, d_k)
        key: (batch, seq_len, d_k)
        value: (batch, seq_len, d_v)
        mask: Optional attention mask
    
    Returns:
        output: (batch, seq_len, d_v)
        attention_weights: (batch, seq_len, seq_len)
    """
    d_k = query.size(-1)
    
    # Compute attention scores
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    
    # Apply mask (for decoder self-attention)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    
    # Softmax to get attention weights
    attention_weights = F.softmax(scores, dim=-1)
    
    # Weighted sum of values
    output = torch.matmul(attention_weights, value)
    
    return output, attention_weights

Intuition: Q, K, V

Attention scores measure how well each query matches each key, then use those scores to weight the values.

Multi-Head Attention

Instead of single attention, use multiple "heads" to capture different relationship types:

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        # Linear projections for Q, K, V
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
    
    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)
        
        # Linear projections and reshape for multi-head
        Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        
        # Apply attention
        attn_output, _ = scaled_dot_product_attention(Q, K, V, mask)
        
        # Concatenate heads and project
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        output = self.W_o(attn_output)
        
        return output

Positional Encoding

Transformers have no inherent sense of position. We add positional information using sinusoidal functions:

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        
        # Create positional encoding matrix
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)  # Even indices
        pe[:, 1::2] = torch.cos(position * div_term)  # Odd indices
        
        pe = pe.unsqueeze(0)  # Add batch dimension
        self.register_buffer('pe', pe)
    
    def forward(self, x):
        return x + self.pe[:, :x.size(1), :]

Transformer Encoder Block

class TransformerEncoderBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        
        # Multi-head self-attention
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        
        # Feed-forward network
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
        self.norm2 = nn.LayerNorm(d_model)
        
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        # Self-attention with residual connection
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        
        # Feed-forward with residual connection
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout(ffn_output))
        
        return x

From Transformers to GPT

GPT (Generative Pre-trained Transformer) uses only the decoder part with causal masking—each position can only attend to previous positions:

def create_causal_mask(seq_len):
    """Create mask to prevent attending to future tokens."""
    mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1)
    return mask == 0  # True where attention is allowed

GPT Architecture

  1. Token Embedding: Convert tokens to vectors
  2. Positional Encoding: Add position information
  3. N Transformer Blocks: Self-attention + FFN
  4. Output Layer: Project to vocabulary size

BERT vs GPT

Aspect BERT GPT
Architecture Encoder only Decoder only
Attention Bidirectional Causal (left-to-right)
Pre-training Masked Language Modeling Next Token Prediction
Best for Understanding tasks Generation tasks

Conclusion

The Transformer's elegance lies in its simplicity: attention mechanisms replace complex recurrent structures, enabling parallelization and capturing long-range dependencies. Understanding these fundamentals is key to working with—and improving upon—modern LLMs.

Related Project: Skill Lyft: AI-Powered Learning Prototype - Simplifying ML concepts like Transformers and LLMs.