Notes on

Build a Large Language Model (From Scratch)

by Sebastian Raschka

| 75 min read


This book is a good, well-organized guide to building large language models (LLMs). It walks you through everything, from transformer basics to a working GPT-like model. The explanations are clear, and the code examples are helpful. You’ll learn the steps, from pretraining to fine-tuning for instruction and classification tasks.

The book covers important concepts like tokenization, embeddings, and the self-attention mechanism. It’s good at explaining each part’s purpose and how it fits into the whole. The step-by-step implementation of each component is a plus.

However, the book could go deeper. It doesn’t always explain why things are the way they are, or the mathematical intuition. It focuses on the how, but I’d liked to see more on the mathematical underpinnings and reasoning behind some choices. I get why this could be considered beyond the scope of the book.

Sebastian Raschka absolutely deserves high praise for writing this book, as well as the tremendous effort put into the additional material in the appendix and on the GitHub repository.

You can find my code here: chhoumann/ml.

1 Understanding large language models

1.3 Stages of building and using LLMs

Process of training an LLM:

  1. Pretraining: training the LLM on a large corpus of text data. This is raw text, meaning text without any labeling information—but potentially filtered, like removing formatting characters or docs in unknown languages.
    • LLMs use Self-supervised Learning in this phase, so no labeling needed!
  2. Fine-tuning on a smaller, labeled dataset to follow instructions or perform classification tasks

“Pre” in “pretraining” refers to the first phase, where a model like an LLM is trained on a large, diverse dataset to develop a broad understanding of language.
This model then serves as a base/foundation model we can refine further with fine-tuning, by training it on a narrower dataset, more specific to particular tasks or domains.

An LLM, after pretraining, will have some basic capabilities like text completion and few-shot capabilities (can learn to perform new tasks based only on a few examples, rather than needing extensive training data).
After fine-tuning, it may be able to do a lot more, like classification, summarization, translation, and so on.
The two most popular categories of fine-tuning:

  • instruction fine-tuning, where we use a labeled dataset consisting of instruction and answer pairs, e.g., a query to translate a text and the correctly translated text
  • classification fine-tuning, where we use a labeled dataset of texts and associated class labels, e.g., emails and “spam” / “not spam” labels

1.4 Introducing the transformer architecture

The original transformer was developed for machine translation, translating English texts to German and French. Presented in Attention Is All You Need.

There are two parts:

  • Encoder that processes input text and produces an Embedding representation of the text (vectors that capture the contextual information of the input)
  • Decoder that uses the embeddings to generate the translated text, one word at a time

Both of these consist of many layers connected by a The self-attention mechanism|self-attention mechanism.

The self-attention mechanism lets the model weigh the importance of different words/tokens in a sequence, relative to each other.
This enables it to capture long-range dependencies and contextual relationships within the input data—so it can generate output that is coherent and contextually relevant.

Later variants: Bidirectional Encoder Representations From Transformers (BERT) and Generative Pretrained Transformers (GPT).

BERT (and its variants) specialize in masked word prediction. It receives inputs where words are randomly masked during training, and it fills in the missing words to generate the original sentence.
This makes the model great at text classification, including sentiment prediction & document categorization. So X (Twitter) uses it to detect toxic content (as of the books publishing).
Example: “This is an … of how concise I … be” to “This is an example of how concise I can be”

On the other hand, GPT is designed for generative tasks. It receives incomplete texts, in the sense that the sentence is unfinished, not with masked words. Then it learns to generate one word at a time.
This makes the model great at tasks that require generating text, like machine translation, text summarization, writing tasks, etc.
Example: “This is an example of how concise I can” → “This is an example of how concise I can be”

1.5 Utilizing large datasets

The datasets for these models are huge, representing diverse and comprehensive text corpora with billions of words on various topics.

The diversity of the training data lets the models perform well on diverse tasks.

Pretraining the models requires a lot of resources and is expensive. Luckily, many pretrained LLMs are available as open source models, and can be used. They can even be fine-tuned for specific tasks with (relatively) smaller datasets.

1.6 A closer look at the GPT architecture

Original GPT paper: Improving Language Understanding by Generative Pre-Training by Radford et al. from OpenAI.

GPT-3 is a scaled up version of this with more parameters & was trained on a larger dataset. Introduced in 2020.
The original ChatGPT model was created by fine-tuning GPT-3 on a large instruction dataset with the methods presented in OpenAI’s InstructGPT paper.

Next-word prediction is a form of Self-supervised Learning. We don’t need to collect labels for the training data: we just use the next word in a sentence / document as the label the model’s supposed to predict.

The general GPT architecture is simple, it’s just the decoder part without the encoder.
Decoder-style models generate text by predicting text one word at a time, so they’re considered a type of autoregressive model.
Autoregressive models take in their previous outputs as inputs for future predictions.
So in GPT, each new word is chosen based on the sequence that precedes it.

GPT-3 is also a lot larger than the original transformer model.
The original repeated the encoder & decoder blocks 6 times.
GPT-3 has 96 transformer layers & 175b parameters in total.

GPT-3 was introduced a long time ago by DL & LLM development standards (2020), but more recent architectures (like Meta’s Llama models) are based on the same underlying concepts with only minor modifications.

2 Working with text data

2.1 Understanding word embeddings

Deep neural network models can’t process raw text directly. Text is categorical, so we can’t perform the mathematical operations on it that we use to train neural networks.
So we need to represent words as continuous-valued vectors.

Converting data into a vector format is often called embedding.

Using a specific neural network layer or another pretrained neural network model, we can embed various types of data (video, audio, text, etc.). But different data formats require distinct embedding models.

An embedding is a mapping from discrete objects (e.g. words, images, entire documents) to points in a continuous vector space.

Word embeddings are the most common form of text embedding. But you can also embed sentences, paragraphs, or whole documents.
Sentences or paragraph embeddings are popular for RAG.

There are many algorithms and frameworks for generating word Embeddings.
One of the earlier & most popular examples is Word2Vec.
This trained Neural Network architectures to generate word embeddings by predicting the context of a word given the target word or vice versa.
The main idea: words that appear in similar contexts tend to have similar meanings.

Word embeddings can have varying dimensions. With more dimensions you can capture more nuanced relationships, but it also takes more compute.

You can use Word2Vec to generate embeddings, but LLMs commonly produce their own that are part of the input layer & are updated during training.
Optimizing the embeddings as part of training the LLM also means that the embeddings are optimized to the task and data at hand.

Embedding size is often referred to as the dimensionality of the model’s hidden states.
It varies based on the model variant and sizes. It’s a tradeoff between performance and efficiency.

2.2 Tokenizing text

Don’t make text all lowercase when tokenizing. Capitalization helps the LLMs distinguish between proper nouns and common nouns, understand sentence structure, & learn to generate text with proper capitalization.

preprocessed = re.split(r"([,.:;?_!\"()']|--|\s)", raw_text)
preprocessed = [item for item in preprocessed if item.strip()]

Should you keep or remove whitespaces?
It depends on the application & its requirements.
Removing them reduces memory & computational requirements.
But they might be important for some applications, like Python code, which is whitespace-sensitive.
We remove it here, but will later switch to a method that keeps whitespaces.

2.3 Converting tokens into token IDs

First we build a vocabulary, which defines how we map each unique word and special character to a unique integer.
We also build an inverse vocabulary.

from typing import Dict, List

all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

vocab = {token: i for i, token in enumerate(all_words)}

class SimpleTokenizerV1:
    def __init__(self, vocab: Dict[str, int]):
        self.str_to_int = vocab
        self.int_to_str = {i: token for token, i in vocab.items()}

    def encode(self, text: str) -> List[int]:
        preprocessed = re.split(r"([,.:;?_!\"()']|--|\s)", text)
        preprocessed = [item for item in preprocessed if item.strip()]
        ids = [self.str_to_int[token] for token in preprocessed]
        return ids

    def decode(self, tokens: List[int]) -> str:
        text = " ".join([self.int_to_str[token] for token in tokens])
        # Remove whitespaces before punctuation marks
        text = re.sub(r" ([,.:;?_!\"()'])", r"\1", text)
        return text

tokenizer = SimpleTokenizerV1(vocab)
text = """"It's the last he painted, you know," 
       Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)

2.4 Adding special context tokens

The tokenizer should be able to handle unknown words.
We should also be able to use and add special context tokens that can give the model a better understanding of context or other relevant information in the text, like if it’s reached the end of the text.

So we’re adding two tokens: <|unk|> and <|endoftext|>.
And we’ll modify the tokenizer to use <|unk|> if it encounters a word that isn’t part of the vocabulary.
<|endoftext|> is also added between unrelated texts to mark when one source ends. When you’re training GPT-like LLMs on multiple independent sources, adding such a token between each source helps it understand that despite the sources being concatenated, they are unrelated.

all_tokens = sorted(set(preprocessed))
all_tokens.extend(["<|unk|>", "<|endoftext|>"])
vocab = {token: i for i, token in enumerate(all_tokens)}

class SimpleTokenizerV2:
    def __init__(self, vocab: Dict[str, int]):
        self.str_to_int = vocab
        self.int_to_str = {i: token for token, i in vocab.items()}

    def encode(self, text: str) -> List[int]:
        preprocessed = re.split(r"([,.:;?_!\"()']|--|\s)", text)
        preprocessed = [item for item in preprocessed if item.strip()]
        # Replace unknown tokens with "<|unk|>"
        preprocessed = [item if item in self.str_to_int else "<|unk|>" for item in preprocessed]
        ids = [self.str_to_int[token] for token in preprocessed]
        return ids

    def decode(self, tokens: List[int]) -> str:
        text = " ".join([self.int_to_str[token] for token in tokens])
        # Remove whitespaces before punctuation marks
        text = re.sub(r" ([,.:;?_!\"()'])", r"\1", text)
        return text

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))
print(text)
# > "Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace."

tokenizer = SimpleTokenizerV2(vocab)
ids = tokenizer.encode(text)
print(ids)
> [1130, 5, 355, 1126, 628, 975, 10, 1131, 55, 988, 956, 984, 722, 988, 1130, 7]

print(tokenizer.decode(tokenizer.encode(text)))
# > "<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>."

It’s readily apparent that “Hello” and “palace” does not appear in Edith Wharton’s “The Verdict.”

Other special tokens:

  • [BOS] (beginning of sequence) to mark the start of a text
  • [EOS](end of sequence), positioned at the end of a text. Useful when concatenating multiple unrelated texts (like <|endoftext|> is—they’re analogous)
  • [PAD] (padding) when training LLMs with batch sizes > 1, the batch may contain texts of varying lengths. So to ensure all texts have the same length, the shorter texts are padded with this token, up to the length of the longest text in the batch

<|endoftext|> is sometimes used for padding, e.g. in GPT models that only uses that token.
When training on batched inputs, we typically use a mask, and therefore don’t attend to padded tokens. So it doesn’t matter which token is used for padding.
GPT models also don’t use <|unk|>. They use a byte pair encoding tokenizer, which breaks words down into subword units.

2.5 Byte pair encoding

The BPE tokenizer was used to train LLMs like GPT-2, GPT-3, and others.

We’re just using OpenAI’s Tiktoken here.

BPE can encode and decode unknown words correctly (without <|unk|>) because the algorithm breaks words down into subword units, or even individual characters. So any unfamiliar word can be represented as a sequence of subword tokens or characters.

BPE builds its vocabulary by iteratively merging frequent characters into subwords and frequent subwords into words. Merges are determined by a frequency cutoff.

2.6 Data sampling with a sliding window

Now we need to generate the input pairs required for training an LLM. Recall that the task is next-word prediction.
The basic way to do it is by having x and y, where x contains the inputs and y contains the targets. Those are the inputs shifted by one.

context_size = 4
x = enc_sample[:context_size]
y = enc_sample[1 : context_size + 1]
print(f"x: {x}")
print(f"y:      {y}")
# > x: [290, 4920, 2241, 287]
# > y:      [4920, 2241, 287, 257]

# So next-word prediction tasks looks like:
for i in range(1, context_size + 1):
    context = enc_sample[:i]
    target = enc_sample[i]
    print(context, "-->", target)
# > [290] --> 4920
# > [290, 4920] --> 2241
# > [290, 4920, 2241] --> 287
# > [290, 4920, 2241, 287] --> 257

Context size determines how many tokens are included in the input.

We’ll build a data loader to handle data sampling.
The data loader should return an input tensor with the text the LLM sees and a target tensor with the targets it should predict.

import torch
from torch.utils.data import DataLoader, Dataset


class GPTDatasetV1(Dataset):
    def __init__(
        self, txt: str, tokenizer: tiktoken.Encoding, max_length: int, stride: int
    ):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the text
        token_ids = tokenizer.encode(txt)

        # Chunk text into overlapping sequences of max_length using the sliding window
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i : i + max_length]
            target_chunk = token_ids[i + 1 : i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        """Total number of samples in the dataset."""
        return len(self.input_ids)

    def __getitem__(self, idx):
        """Get a sample from the dataset at the given index."""
        return self.input_ids[idx], self.target_ids[idx]


def create_dataloader_v1(
    txt: str,
    batch_size: int = 4,
    max_length: int = 256,
    stride: int = 128,
    shuffle: bool = True,
    drop_last: bool = True,
    num_workers: int = 0,
):
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers,
    )
    return dataloader

batch_size is the number of samples per batch. Small batch sizes require less memory, but can lead to more noisy model updates.
We batch because it’s inefficient to just feed the model one sample at a time. We put multiple samples together into a batch (samples they have the same length).
drop_last drops the last batch if it’s shorter than the specified batch_size.
This prevents loss spikes during training.
stride is the step size for the sliding window.

2.7 Creating token embeddings

Now we need to convert token IDs into embedding vectors.
So the steps, to recap, as follows:

  1. Input text gets tokenized
  2. Encode tokenized text to get token IDs
  3. Create input token embeddings
  4. Process with GPT-like decoder-only transformer
  5. Perform postprocessing steps to get output text
input_ids = torch.tensor([2, 3, 5, 1])

vocab_size = 6
output_dim = 3  # create embeddings of size 3

torch.manual_seed(42)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(embedding_layer.weight)

# Parameter containing:
# tensor([[ 1.9269,  1.4873, -0.4974],
#         [ 0.4396, -0.7581,  1.0783],
#         [ 0.8008,  1.6806,  0.3559],
#         [-0.6866,  0.6105,  1.3347],
#         [-0.2316,  0.0418, -0.2516],
#         [ 0.8599, -0.3097, -0.3957]], requires_grad=True)

The weights above have been randomly initialized.
The values will get optimized during LLM training, as part of the LLM optimization.
6 rows with 3 columns. One row for each of the six possible tokens in the vocabulary, and one column for each of the three embedding dimensions.

print(embedding_layer(torch.tensor([3])))  # applying embedding layer to token id 3
# tensor([[-0.6866,  0.6105,  1.3347]], grad_fn=<EmbeddingBackward0>)

You can see that the output is identical to the index 3 in the weights.
This is because the embedding layer is basically like a lookup from the embedding layer’s weights via the token ID.

The embedding layer here is like a more efficient way to implement one-hot encoding, followed by matrix multiplication in a fully connected layer.
And that’s also why we can view it as a neural network layer that can be optimized via backprop.

Sebastian provided a great notebook that explains this relationship here.
In it, he explains that embedding layers in PyTorch do the same as linear layers that perform matrix multiplications. We use embedding layers for computational efficiency.

Above we’ve discussed how the embedding is basically like a lookup, and that this is comparable to one-hot and a matmul for a linear layer. So say we have the nn.Linear layer on a one-hot encoded representation.
So the categories are the various token ids we have available, and we’ve one-hot encoded those to be binary attributes. Therefore, we have as many one-hot features as tokens in our vocabulary.
Given a token ID, we’d encode it such as a vector with a binary 1 (hot) in its attribute and 0 elsewhere.
Performing a matrix multiplication of that vector with our linear layer’s weights gives us the embeddings for that exact token, equivalent to the lookup.

Mathematically, we can represent this as:

Where:

  • is the resulting embedding vector
  • is the one-hot encoded input vector
  • is the weight matrix of the linear layer (or embedding matrix)

For example, if we have a vocabulary size of 6 and an embedding dimension of 3:

This operation effectively selects the third row of the weight matrix, which is equivalent to looking up the embedding for the third token in our vocabulary.

The embedding layer can also be thought of as a hashtable lookup. In this case, we can represent it as:

embedding = hashtable[token_id]

Where:

  • embedding is the resulting embedding vector
  • hashtable is a dictionary-like structure containing the embeddings
  • token_id is the ID of the token we want to look up

For our example with a vocabulary size of 6 and an embedding dimension of 3, we could represent this as:

hashtable = {
    0: [w11, w12, w13],
    1: [w21, w22, w23],
    2: [w31, w32, w33],
    3: [w41, w42, w43],
    4: [w51, w52, w53],
    5: [w61, w62, w63]
}

Then, to get the embedding for token ID 2, we would simply do:

embedding = hashtable[2]  # This would return [w31, w32, w33]

This hashtable lookup approach is conceptually similar to the embedding layer and provides another way to understand how embeddings work. However, the actual implementation in PyTorch uses more optimized methods for efficiency and to enable gradient flow for training. Just take a look.

print(embedding_layer(input_ids))
# tensor([[ 0.8008,  1.6806,  0.3559],
#         [-0.6866,  0.6105,  1.3347],
#         [ 0.8599, -0.3097, -0.3957],
#         [ 0.4396, -0.7581,  1.0783]], grad_fn=<EmbeddingBackward0>)

2.8 Encoding word positions

Token embeddings could be used as inputs for LLMs.
But their self-attention mechanism doesn’t have a notion of position or order for the tokens within a sequence.

So we inject additional position information into the LLM.

There are two broad categories of position-aware embeddings we could use:

  • relative positional embeddings
  • absolute positional embeddings

With absolute positional embeddings, we add a positional embedding to the token embedding. There’s a unique positional embedding for each position in the input sequence.

While with relative positional embeddings, we convey information about the relative position/distance between tokens. This lets the model generalize better to sequences of varying lengths.

Both help LLMs understand the order and relationships between tokens. You choose the one appropriate to your application and data.

GPT models use absolute positional embeddings that are optimized during the training process (unlike those that were fixed/predefined in the original transformer model). The optimization here is part of the model training itself.

vocab_size = 50257
output_dim = 256
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

max_length = 4
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=max_length,
    stride=max_length, shuffle=False
)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Token IDs:\n", inputs)
print("\nInputs shape:\n", inputs.shape) # 8 text samples, 4 tokens each

# Token IDs:
#  tensor([[   40,   367,  2885,  1464],
#         [ 1807,  3619,   402,   271],
#         [10899,  2138,   257,  7026],
#         [15632,   438,  2016,   257],
#         [  922,  5891,  1576,   438],
#         [  568,   340,   373,   645],
#         [ 1049,  5975,   284,   502],
#         [  284,  3285,   326,    11]])
# 
# Inputs shape:
#  torch.Size([8, 4])

token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)
# torch.Size([8, 4, 256])

We embedded each of the tokens into a 256 dimensional vector.
8 samples in our batch (4 text samples), 4 tokens per sample, and 256 embedding dimensions for each token.

A GPT model’s absolute embedding approach:

context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(context_length))
print(pos_embeddings.shape)
# torch.Size([4, 256])

The input is usually a placeholder vector containing a sequence of numbers 0, 1, …, n, where n is the maximum input length.

context_length represents the supported input size for the LLM.
We set it to max_length here.
In practice, the input text can be longer than the supported context length—then we’d have to truncate the text.

input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)
# torch.Size([8, 4, 256])

3 Coding attention mechanisms

3.2 Capturing data dependencies with attention mechanisms

First we had Bahdanau attention. Then self-attention.

Self-attention assigns attention scores to each word (token) in the sentence. This lets them attend to the other tokens to understand their relative importance, which is helpful for context.

3.3 Attending to different parts of the input with self-attention

“self” in self-attention refers to the fact that the attention mechanism operates within a single sequence of elements (words in a sentence / tokens in an input) by comparing and relating these elements to each other.

Each element in the sequence attends to all other elements, including itself, to capture dependencies and contextual relationships.

Input are token Embeddings. For each element in the sequence, compute how much attention it should pay to every other element , including itself (). Use this attention to calculate new a representation for , which is a weighted sum of all the elements in the sequence.

It’s what gives us the context-aware representations. And unlike traditional sequence models, like RNNs, where relationships are captured sequentially, self-attention simultaneously considers all interactions across the sequence.

Example: “The cat sat on the mat.”
The word “cat” attends to itself (“cat”) and other words (“The”, “sat”, etc.) to understand its role in the sentence.
This helps it learn relationships like “cat” being the subject and “sat” being the action.

So:
“Self” means that it operates within a single sentence, mapping relationships from each word to all the other words, including itself. It does so by assigning attention scores, which measure the importance of one word relative to another.

In contrast to the traditional attention mechanisms, the focus is on the relationship between elements of two different sequences.

3.3.1 A simple self-attention mechanism without trainable weights

The goal of self-attention is to compute a context vector for each input element in the input sequence. This context vector aggregates information from all other input elements in the sequence, weighted by their relevance (as determined by attention weights).

These context vectors capture the relationships between each input element (e.g. token) and all other tokens in the sequence, allowing the model to encode contextual dependencies effectively.

First step: compute intermediate values , called the attention scores.
This is done by computing the dot product of the query (token) with all elements in the input sequence:

import torch

# Inputs are the embeddings of the words in the sentence
inputs = torch.tensor(
  [[0.43, 0.15, 0.89],      # Your     (x^1)
   [0.55, 0.87, 0.66],      # journey  (x^2)
   [0.57, 0.85, 0.64],      # starts   (x^3)
   [0.22, 0.58, 0.33],      # with     (x^4)
   [0.77, 0.25, 0.10],      # one      (x^5)
   [0.05, 0.80, 0.55]]      # step     (x^6)
)

# Using the second token, `journey`, as the query:
query = inputs[1]
attention_scores_2 = torch.empty(inputs.shape[0])
for i, x_i in enumerate(inputs):
    attention_scores_2[i] = torch.dot(query, x_i)

The dot product is a measure of similarity: it measures how closely the two vectors are aligned. A higher dot product means a higher similarity. In self-attention, it determines how much each element attends to (focuses on) any other element–the higher the dot product, the higher the attention score.

Now we normalize each of the attention scores.
We want to ensure that the sum of the attention weights is 1.

# Notice how we computed use the scores to compute the weights
attention_weights_2_tmp = attention_scores_2 / attention_scores_2.sum()
print("Attention weights:", attention_weights_2_tmp)
print("Sum:", attention_weights_2_tmp.sum())
# Attention weights: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])
# Sum: tensor(1.0000)

But it’s better to use the Softmax function for normalization. It’s better at handling extreme values & gives better gradient properties during training.
It also ensures the attention weights are always positive, making outputs interpretable as probabilities or relative importance (greater weights means greater importance).

# This fn may encounter numerical instability problems (e.g. overflow, underflow) when dealing with large or small inputs. So use PyTorchs implementation.
def softmax_naive(x):
    return torch.exp(x) / torch.exp(x).sum(dim=0)

attention_weights_2_naive = softmax_naive(attention_scores_2)
print("Attention weights:", attention_weights_2_naive)
print("Sum:", attention_weights_2_naive.sum())
# Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
# Sum: tensor(1.)

# PyTorch softmax:
attention_weights_2_naive = torch.softmax(attention_scores_2, dim=0)

The final step is to calculate the context vector by multiplying the embedded input tokens with the corresponding attention weights & summing the resulting vectors.

Recap:

# Compute attention scores
query = inputs[0]
attn_scores_1 = torch.zeros(inputs.shape[0])
for i, x_i in enumerate(inputs):
    attn_scores_1[i] = torch.dot(query, x_i)

# Normalize - compute attenttion weights
attn_weights_1 = torch.softmax(attn_scores_1, dim=0)

# Compute context vector
context_vec_1 = torch.zeros(query.shape)
for i, x_i in enumerate(inputs):
    context_vec_1 += attn_weights_1[i] * x_i

3.3.2 Computing attention weights for all input tokens

Computing the scores:

attn_scores = torch.empty((inputs.shape[0], inputs.shape[0]))
for i, x_i in enumerate(inputs):
    for j, x_j in enumerate(inputs):
        attn_scores[i, j] = torch.dot(x_i, x_j)

# A faster way:
attn_scores = inputs @ inputs.T # or torch.matmul
attn_scores

Each element in the attn_scores tensor represents an attention score between each pair of inputs.

You can imagine it being a matrix like this, excluding the labels:

Yourjourneystartswithonestep
Your0.99950.95440.94220.47530.45760.6310
journey0.95441.49501.47540.84340.70701.0865
starts0.94221.47541.45700.82960.71541.0605
with0.47530.84340.82960.49370.34740.6565
one0.45760.70700.71540.34740.66540.2935
step0.63101.08651.06050.65650.29350.9450

Computing the weights:

attn_weights = torch.softmax(attn_scores, dim=-1)

dim=-1 means the last dimension.
For this rank 2 tensor, it means we’re applying Softmax along the second dimension of [rows, columns].
That is, we’re normalizing across the columns, so the values in each row (summing over the column dimension) sum up to 1.

Computing the context vectors:

context_vecs = attn_weights @ inputs

3.4 Implementing self-attention with trainable weights

Implementing Transformer architecture (Attention Is All You Need). This is called scaled dot-product attention.

3.4.1 Computing the attention weights step by step

Introduce three trainable weight matrices: , , and . Used to project the embedded input tokens, , into query, key, and value vectors.
Weights W in this context refers to weight parameters that are optimized during model training, not attention weights.

x_2 = inputs[1]
d_in = inputs.shape[1] # input embedding size
d_out = 2 # output embedding size

torch.manual_seed(42)

W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_key   = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)

requires_grad=False to reduce clutter. If we were using the weight matrices for model training, we’d set it to True during training.

query_2 = x_2 @ W_query
key_2 = x_2 @ W_key
value_2 = x_2 @ W_value

query_2
# tensor([1.0760, 1.7344])

Visualizing a single computation of some input (3 dim vec) and some weight matrix :

To compute all keys and values:

keys = inputs @ W_key
values = inputs @ W_value

keys.shape, values.shape
# (torch.Size([6, 2]), torch.Size([6, 2]))

These computations are similar to above, just with matrix multiplications. Visualizing matrices of the shapes of inputs and W_key being multiplied:

Now we’ve projected the six input tokens from a three-dimensional onto a two-dimensional embedding space.

keys_2 = keys[1]
attn_scores_22 = query_2.dot(keys_2)
attn_scores_22  # unnormalized attention score
# tensor(3.3338)

# Generalized:
attn_scores_2 = query_2 @ keys.T
attn_scores_2
# tensor([2.7084, 3.3338, 3.3013, 1.7563, 1.7869, 2.1966])

From attention scores to attention weights:
Scale the attention scores by dividing them by the sqrt of the embedding dimension of the keys & then using the Softmax function.

We scale by the embedding dimension to improve training performance by avoiding small gradients.
Large dot products can lead to very small gradients during backprop due to Softmax. As dot products increase, Softmax becomes more like a step function, leading to gradients near zero. These can slow down training / cause it to stagnate.

We call this self-attention mechanism “scaled-dot product attention” due to this scaling by the sqrt of the embedding dimension.

d_k = keys.shape[-1]
attn_weights_2 = torch.softmax(attn_scores_2 / d_k**0.5, dim=-1)
attn_weights_2
# tensor([0.1723, 0.2681, 0.2620, 0.0879, 0.0898, 0.1200])

context_vec_2 = attn_weights_2 @ values
context_vec_2
# tensor([1.4201, 0.8892])

“query”, “key”, and “value” are borrowed from the domain of information retrieval and databases.

  • A query is similar to a search query: it represents the current item the model is trying focusing on / trying to understand.
  • The key is like a database key used for indexing and searching. Each item in the input sequence has an associated key, and we use them to match the query.
  • The value is similar to the value in a key-value pair in a database. Represents the actual content / representation of the input items. Once the model determines which keys (which parts of the input) are most relevant to the query, it retrieves the corresponding values.

3.4.2 Implementing a compact self-attention Python class

import torch.nn as nn

class SelfAttention_v1(nn.Module):
    def __init__(self, d_in: int, d_out: int):
        super().__init__()
        self.W_query = nn.Parameter(torch.rand(d_in, d_out))
        self.W_key   = nn.Parameter(torch.rand(d_in, d_out))
        self.W_value = nn.Parameter(torch.rand(d_in, d_out))

    def forward(self, x):
        keys = x @ self.W_key
        queries = x @ self.W_query
        values = x @ self.W_value

        attn_scores = queries @ keys.T
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        context_vec = attn_weights @ values
        return context_vec

is the input embedding dimensions. is the output embedding dimensions. is the amount of inputs tokens. This image focuses on a single input text with tokens, not a batch (as we’d usually do).

We can use nn.Linear layers instead, which effectively perform matmuls when bias units are disabled.
And linear layers have an optimized weight initialization scheme, meaning more stable and effective training.

class SelfAttention_v2(nn.Module):
    def __init__(self, d_in: int, d_out: int, qkv_bias=False):
        super().__init__()
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)

    def forward(self, x):
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        attn_scores = queries @ keys.T
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        context_vec = attn_weights @ values
        return context_vec

3.5 Hiding future words with causal attention

Causal attention is also known as masked attention.

For some tasks, we want self-attention to only consider tokens appearing prior to the current position when predicting tokens in a sequence.

It’s a special form of self-attention that restricts the model to only consider previous and current inputs in a sequence.

We essentially mask out future tokens - tokens that come after the current token in the input.
Mask attention weights (set to 0) above the diagonal, and normalize the non-masked attention weights, so the weights sum to 1 in each row.

3.5.1 Applying a causal attention mask

  1. Apply Softmax on the attention scores to get the normalized attention weights.
  2. Mask those with 0’s above the diagonal to get the masked attention scores.
  3. Normalize rows to get the masked attention weights.

But that isn’t very efficient. Instead, we can:

  1. Mask the attention scores with above the diagonal to get the masked attention scores
  2. Apply Softmax on those to get masked attention weights

This works because Softmax treats negative infinity values in a row as zero probability, which it does because approaches 0.

context_length = attn_scores.shape[0]
# Upper triangular matrix with all elements above the diagonal being 1 and those on and below the diagonal 0
mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)
# `mask.bool()` sets 1s to True. Then we fill the corresponding values (same idx as True values in mask) in the attention scores matrix with negative infinity
masked = attn_scores.masked_fill(mask.bool(), -torch.inf)
attn_weights = torch.softmax(masked / keys.shape[0]**0.5, dim=-1)

3.5.2 Masking additional attention weights with dropout

We can use Dropout in the Attention mechanisms|attention mechanism. It’s typically applied either after calculating the attention weights or after applying them to the value vectors.

torch.manual_seed(42)
dropout = torch.nn.Dropout(0.5)  # usually 0.1 or 0.2 for training GPT models
example = torch.ones(6, 6)
dropout(example)
# tensor([[0., 0., 2., 2., 2., 2.],
#         [2., 0., 2., 0., 2., 0.],
#         [0., 0., 2., 2., 2., 0.],
#         [2., 2., 0., 2., 0., 2.],
#         [2., 0., 2., 2., 2., 2.],
#         [2., 2., 2., 0., 2., 0.]])

~50% were scaled to zero. To compensate for the reduction in active elements, the rest were scaled up by a factor of .
This is to maintain the overall balance of the weights, so the average influence of attention mechanisms is consistent both during training and inference.

We can use Dropout(attn_weights).

3.5.3 Implementing a compact causal attention class

class CausalAttention(nn.Module):
    def __init__(self, d_in: int, d_out: int, context_length: int, dropout: float, qkv_bias=False):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.dropout = nn.Dropout(dropout)
        # Useful because the buffer is auto-moved to the appropriate device (CPU/GPU) with our model (e.g. when training)
        self.register_buffer(
            'mask',
            torch.triu(torch.ones(context_length, context_length),
                       diagonal=1)
        )

    def forward(self, x):
        # [batch, num_tokens, d_in]
        b, num_tokens, d_in = x.shape
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        attn_scores = queries @ keys.transpose(1, 2)  # Keep batch dim at position 0, but transpose dim 1 and 2
        # Trailing _ means inplace. Using it to avoid unnecessary memory copies.
        attn_scores.masked_fill_(self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)
        attn_weights = torch.softmax(masked / keys.shape[-1]**0.5, dim=-1)
        attn_weights_dropped = self.dropout(attn_weights)
        context_vec = attn_weights_dropped @ values
        return context_vec

3.6 Extending single-head attention to multi-head attention

Causal attention over multiple heads = multi-head attention.

“Multi-head” because multiple, independent attention heads process the input.
The point is that each head can learn to attend to different aspects or patterns in the input.
This allows it to jointly attend to information from different representation subspaces at different positions.
One head might learn to focus on syntactic relationships, another might capture semantic similarities, a third attend to longer-range dependencies, and one specialize in local context patterns, for example.
In CNNs, early layers detect basic features like edges and textures, middle layers combine these into more complex patterns like shapes, and later layers recognize high-level features like faces or objects.
In transformers, attention heads don’t work in this hierarchical manner. Instead, they work as like providing different “perspectives” at the same level as the other heads.

3.6.1 Stacking multiple single-head attention layers

Create multiple attention heads, each with their own weights, and combine their outputs.

class MultiHeadAttentionWrapper(nn.Module):
    def __init__(
        self,
        d_in: int,
        d_out: int,
        context_length: int,
        dropout: float,
        num_heads: int,
        qkv_bias=False,
    ):
        super().__init__()
        self.heads = nn.ModuleList(
            [
                CausalAttention(d_in, d_out, context_length, dropout, qkv_bias)
                for _ in range(num_heads)
            ]
        )

    def forward(self, x):
        return torch.cat([head(x) for head in self.heads], dim=-1)

If we use d_out=3 and num_heads=2, we’d get an embedding dimension would be 3*2=6-dimensional in our context vector matrix.
We get a tensor with two sets of context vector matrices. In each, the rows are the context vectors corresponding to the tokens, and the columns correspond to the embedding dimension. These matrices are concatenated along the column dimension, giving us the embedding dimension of 3*2=6.

torch.manual_seed(42)
context_length = batch.shape[1]
d_in, d_out = 3, 2
mha = MultiHeadAttentionWrapper(d_in, d_out, context_length, 0.0, num_heads=2)
context_vecs = mha(batch)
print(context_vecs)
print(f"{context_vecs.shape=}")

# tensor([[[0.4429, 0.1077, 0.5473, 0.3307],
#          [0.4656, 0.2597, 0.3420, 0.2234],
#          [0.4732, 0.3030, 0.2818, 0.1894],
#          [0.4135, 0.2921, 0.2105, 0.1521],
#          [0.4078, 0.2567, 0.2252, 0.1357],
#          [0.3772, 0.2746, 0.1709, 0.1215]],
# 
#         [[0.4429, 0.1077, 0.5473, 0.3307],
#          [0.4656, 0.2597, 0.3420, 0.2234],
#          [0.4732, 0.3030, 0.2818, 0.1894],
#          [0.4135, 0.2921, 0.2105, 0.1521],
#          [0.4078, 0.2567, 0.2252, 0.1357],
#          [0.3772, 0.2746, 0.1709, 0.1215]]], grad_fn=<CatBackward0>)
# context_vecs.shape=torch.Size([2, 6, 4])
  • The first dimension is 2 because we have two samples in our batch.
  • The second dimension denotes the 6 tokens in each input.
  • The third dimension is the 4-dimensional embeddings of each token.

[batch_size, context_length, embedding_dimensions]

3.6.2 Implementing multi-head attention with weight splits

class MultiHeadAttention(nn.Module):
    def __init__(self, d_in: int, d_out: int, context_length: int, dropout: float, num_heads: int, qkv_bias=False):
        super().__init__()
        assert (d_out % num_heads == 0), "d_out must be divisible by num_heads"

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads # Reduces projection dim to match desired output dim
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.out_proj = nn.Linear(d_out, d_out) # To combine head outputs
        self.dropout = nn.Dropout(dropout)
        self.register_buffer("mask", torch.triu(torch.ones(context_length, context_length), diagonal=1))

    def forward(self, x):
        b, num_tokens, d_in = x.shape
        # Tensor shape (b, num_tokens, d_out)
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_values(x)

        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) # implicitly split the matrix by adding num_heads dimension, then unroll the last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

        # Transposes from shape (b, num_tokens, num_heads, head_dim) to (b, num_heads, num_tokens, head_dim)
        keys = keys.transpose(1, 2)
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)

        attn_scores = queries @ keys.transpose(2, 3) # compute dot product for each head
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens] # masks truncated to the number of tokens

        attn_scores.masked_fill_(mask_bool, -torch.inf) # uses mask to fill attn scores

        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        context_vec = (attn_weights @ values).transpose(1, 2)  # tensor shape: (b, num_tokens, n_heads, head_dim)

        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
        context_vec = self.out_proj(context_vec) # optional linear projection
        return context_vec

As before, d_in is the dimensionality of the input embeddings, d_out is the desired dimensionality of the output context vectors. context_length is the maximum length of the input sequence (denoted in amount of tokens).

d_out must be divisible by num_heads to ensure the output can be evenly divided among the heads. This means that when we concatenate their outputs later, we recover the original d_out dimension.

The Linear layers starting with W_ project the input embedding into query, key, and value representations. The linear transformations learn to map the input embeddings into different representation subspaces where the model can compute similarities and relevance between tokens (using queries and keys) and extract relevant information (values). Each of these layers has a weight matrix of shape (d_in, d_out).

The optional output projection layer is often used after concatenating the outputs of the attention heads. It can learn to mix and combine the information from the different heads, potentially allowing the model to create more complex representations. The weight matrix of this layer is shape (d_out, d_out).

We create the causal mask with torch.triu and torch.ones, creating an upper triangular matrix of ones with the diagonal offset by 1. The mask is used to prevent the model from attending to future tokens in the sequence (we want to predict the next token, so this prevents peeking). Setting the upper triangle to 1 and lower (including main diagonal) to 0 means that, when we compute attention, a token at position can only attend to tokens at positions where . register_buffer ensures the mask is saved and loaded along with the model’s parameters, but isn’t treated as a learnable parameter.

In the forward function, we start by unpacking the shape of the input tensor x: b is batch size, num_tokens is the number of tokens in the input sequence, and d_in is the embedding dimension of each token.

Then we do linear projections to get the query, key, and value representations. Here, we’re essentially transforming the input embeddings x into different representations using the learned weight matrices. So the shape of x is (b, num_tokens, d_in), which we linearly project to a tensor of shape (b, num_tokens, d_out).

Then we use .view to split the d_out dimension into num_heads and head_dim. This just reshapes the tensors without changing the data. If d_out is 12 and num_heads is 4, then head_dim would be 3. So the keys tensor, originally of shape (b, num_tokens, 12) is reshaped to (b, num_tokens, 4, 3). Like dividing the 12 dimensional key representation into 4 separate 3 dimensional key representations—one for each head.

This should help illustrate what Tensor.view does. It returns a new tensor with the same data but of a different shape.

>>> import torch
>>> x = torch.randn(4,4)
>>> x
tensor([[ 2.0618, -0.7314,  0.1790,  0.0057],
        [ 1.0455, -1.1515,  0.2536,  0.3051],
        [ 0.3095, -0.2626,  0.1183,  1.3439],
        [-0.1830,  0.5182,  1.6458, -0.1339]])
>>> y = x.view(16)
>>> y
tensor([ 2.0618, -0.7314,  0.1790,  0.0057,  1.0455, -1.1515,  0.2536,  0.3051,
         0.3095, -0.2626,  0.1183,  1.3439, -0.1830,  0.5182,  1.6458, -0.1339])
>>> z = x.view(-1, 8)
>>> z
tensor([[ 2.0618, -0.7314,  0.1790,  0.0057,  1.0455, -1.1515,  0.2536,  0.3051],
        [ 0.3095, -0.2626,  0.1183,  1.3439, -0.1830,  0.5182,  1.6458, -0.1339]])

So when we do e.g. keys.view(...), we essentially reshape the keys matrix to fit the new shape we give.

The .transpose(1, 2) is rather simple, we do as we say in the text: transpose from one shape to another. We swap the given dimensions. We transpose the num_tokens and num_heads dimensions. We do this to enable efficient batched matrix multiplication in the next step. After transposing, the shapes of keys, queries, and values becomes (b, num_heads, num_tokens, head_dim). Now, the num_heads is the second dimension, allowing us to compute attention scores for all heads in parallel with a single matrix multiplication.

Then we use the mask, converting it to a Boolean tensor and truncating it to the actual number of tokens in the input sequence. This is necessary because context_length is the maximum sequence length, but the actual input sequences may be shorter.

Next, we compute the attention weights by applying the softmax function to the scaled attentions cores. We scale by the square root of the head_dim (keys.shape[-1]**0.5) to stabilize training.

Then we apply dropout to the attention weights.

We compute the context vectors by performing a weighted sum of the values using the attention weights.
For each token, we’re creating a new representation (the context vector) by taking the average of the value vectors of all tokens, where the weights are determined by the attention mechanism. Tokens deemed more relevant (higher attention weights) will contribute more to the context vector. The shape of the resulting context vector is initially (b, num_heads, num_tokens, head_dim), and then we transpose it back to (b, num_tokens, num_heads, head_dim) to align with the original ordering of dimensions before we computed attention.

Then, using contiguous and view, we concatenate the outputs from the different heads and reshape the tensor. contiguous ensures the tensor is stored in a contiguous block of memory. Essentially, we combine the outputs from all the heads back into a single representation. The shape becomes b x num_tokens x (num_heads * head_dim), which is equivalent to b x num_tokens x d_out.

Finally, we apply the output projection, and return the final context vectors.

Instead of the previous wrapper approach, we integrate the single-head attention mechanisms into this class.
This figure explains the difference between the approaches. In the top part of the figure, we see the previous wrapper approach. And in the bottom, we see the new MultiheadAttention approach. We have one large weight matrix & only perform one matmul with the inputs to get the query matrix, and we then split that into separate matrices (and also do this for the keys and values):

We do the splitting via tensor reshaping and transposing (view and transpose).
We split the d_out dimension into num_heads and head_dim, where

We achieve this with the view method, as seen in the code.

Then we transpose the tensors to bring the num_heads dimension before the num_tokensdimension, which is crucial for aligning the queries, keys, and values across the different heads & performing batched matmuls efficiently.

We added the (optional) output projection layer out_proj. It’s commonly used in many LLM architectures.

This approach is more efficient than the previous approach because we only do a single matrix multiplication to compute e.g. the keys (likewise for queries, values). In the wrapper approach, we had to repeat it multiple times.

d_in, d_out = 3, 2
context_length = batch.shape[1]
mha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads=2, qkv_bias=False)
context_vecs = mha(batch)
print(context_vecs)
print(f"{context_vecs.shape=}")

# tensor([[[-0.6380,  0.3370],
#          [-0.7576,  0.2926],
#          [-0.7891,  0.2779],
#          [-0.7887,  0.2770],
#          [-0.6782,  0.2563],
#          [-0.7425,  0.2639]],
# 
#         [[-0.6380,  0.3370],
#          [-0.7576,  0.2926],
#          [-0.7891,  0.2779],
#          [-0.7887,  0.2770],
#          [-0.6782,  0.2563],
#          [-0.7425,  0.2639]]], grad_fn=<ViewBackward0>)
# context_vecs.shape=torch.Size([2, 6, 2])

The output dimension is directly controlled by the d_out argument here.

The smallest GPT-2 (117m params) has 12 attention heads and a context vector embedding size of 768.
In GPT models, the embedding sizes of the token inputs and context embeddings are the same (d_in = d_out).

4 Implementing a GPT model from scratch to generate text

4.1 Coding an LLM architecture

This is the config we’re using. It’s the smallest GPT-2.

GPT_CONFIG_124M = {
    "vocab_size": 50257,
    "context_length": 1024,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False
}
import torch
import torch.nn as nn


class DummyGPTModel(nn.Module):
    def __init__(
        self,
        vocab_size,
        context_length,
        emb_dim,
        n_heads,
        n_layers,
        drop_rate,
        qkv_bias=False,
        **kwargs,
    ):
        super().__init__()
        self.tok_emb = nn.Embedding(vocab_size, emb_dim)
        self.pos_emb = nn.Embedding(context_length, emb_dim)
        self.drop_emb = nn.Dropout(drop_rate)
        self.trf_blocks = nn.Sequential(
            *[
                DummyTransformerBlock(
                    vocab_size=vocab_size,
                    context_length=context_length,
                    emb_dim=emb_dim,
                    n_heads=n_heads,
                    n_layers=n_layers,
                    drop_rate=drop_rate,
                    qkv_bias=qkv_bias,
                    **kwargs,
                )
                for _ in range(n_layers)
            ]
        )
        self.final_norm = DummyLayerNorm(emb_dim)
        self.out_head = nn.Linear(emb_dim, vocab_size, bias=False)

    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embeds + pos_embeds
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits


class DummyTransformerBlock(nn.Module):
    def __init__(
        self,
        vocab_size,
        context_length,
        emb_dim,
        n_heads,
        n_layers,
        drop_rate,
        qkv_bias=False,
        **kwargs,
    ):
        super().__init__()

    def forward(self, x):
        return x


class DummyLayerNorm(nn.Module):
    def __init__(self, normalized_shape, eps=1e-5):
        super().__init__()

    def forward(self, x):
        return x

DummyGPTModel(**GPT_CONFIG_124M)
# DummyGPTModel(
#   (tok_emb): Embedding(50257, 768)
#   (pos_emb): Embedding(1024, 768)
#   (drop_emb): Dropout(p=0.1, inplace=False)
#   (trf_blocks): Sequential(
#     (0): DummyTransformerBlock()
#     (1): DummyTransformerBlock()
#     (2): DummyTransformerBlock()
#     (3): DummyTransformerBlock()
#     (4): DummyTransformerBlock()
#     (5): DummyTransformerBlock()
#     (6): DummyTransformerBlock()
#     (7): DummyTransformerBlock()
#     (8): DummyTransformerBlock()
#     (9): DummyTransformerBlock()
#     (10): DummyTransformerBlock()
#     (11): DummyTransformerBlock()
#   )
#   (final_norm): DummyLayerNorm()
#   (out_head): Linear(in_features=768, out_features=50257, bias=False)
# )
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")
batch = []
txt1 = "Every effort moves you"
txt2 = "Every day holds a"

batch.append(torch.tensor(tokenizer.encode(txt1)))
batch.append(torch.tensor(tokenizer.encode(txt2)))
batch = torch.stack(batch, dim=0)
print(batch)
# tensor([[6109, 3626, 6100,  345],  # First text
#         [6109, 1110, 6622,  257]]) # Second text
torch.manual_seed(42)
model = DummyGPTModel(**GPT_CONFIG_124M)
logits = model(batch)
print(f"{logits} ({logits.shape=})")
# tensor([[[ 0.7739,  0.0181, -0.0797,  ...,  0.3098,  0.8177, -0.6049],
#          [-0.8063,  0.8920, -1.0962,  ..., -0.4378,  1.1056,  0.1939],
#          [-0.8459, -1.0176,  0.4964,  ...,  0.4581, -0.3293,  0.2320],
#          [ 0.4098, -0.3144, -1.0831,  ...,  0.7491,  0.7018,  0.4715]],
# 
#         [[ 0.2911,  0.1596, -0.2137,  ...,  0.5173,  0.7380, -0.7045],
#          [-0.4064,  0.6045, -0.4485,  ..., -0.5616,  0.4590, -0.1384],
#          [-0.6108,  0.7148,  1.2499,  ..., -0.7925, -0.5328,  0.4794],
#          [ 0.9423,  0.1867, -0.5557,  ...,  0.4156,  0.1756,  1.9882]]],
#        grad_fn=<UnsafeViewBackward0>) (logits.shape=torch.Size([2, 4, 50257]))

The output tensor has 2 rows–one for each text sample. Each text sample has 4 tokens.
The embedding has 50,257 dimensions because each dimension refers to a unique token in the vocabulary. Later, we’ll convert the vectors back into token IDs, so we can decode them into words.

4.2 Normalizing activations with layer normalization

Layer normalization is typically applied before and after the multi-head attention module, and, as we have seen with the DummyLayerNorm placeholder, before the final output layer.

Layer Normalization gives us unit variance: 0 mean, variance of 1.

# Pre LayerNorm
torch.manual_seed(42)
batch_example = torch.randn(2, 5)
layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())
out = layer(batch_example)
print(out)
mean = out.mean(dim=-1, keepdim=True)
var = out.var(dim=-1, keepdim=True)
print(f"{mean=}\n{var=}")
# tensor([[0.0000, 0.1842, 0.0052, 0.7233, 0.0000, 0.5298],
#         [0.0000, 0.0000, 0.0000, 0.2237, 0.0000, 0.7727]],
#        grad_fn=<ReluBackward0>)
# mean=tensor([[0.2404],
#         [0.1661]], grad_fn=<MeanBackward1>)
# var=tensor([[0.0982],
#         [0.0963]], grad_fn=<VarBackward0>)

# Post LayerNorm
out_norm = (out - mean) / torch.sqrt(var)
mean = out_norm.mean(dim=-1, keepdim=True)
var = out_norm.var(dim=-1, keepdim=True)
print("Normalized layer outputs:\n", out_norm)
print(f"Mean:\n{mean}")
print(f"Variance:\n{var}")
# Normalized layer outputs:
#  tensor([[-0.7672, -0.1794, -0.7506,  1.5410, -0.7672,  0.9234],
#         [-0.5351, -0.5351, -0.5351,  0.1857, -0.5351,  1.9546]],
#        grad_fn=<DivBackward0>)
# Mean:
# tensor([[0.0000e+00],
#         [7.4506e-09]], grad_fn=<MeanBackward1>)
# Variance:
# tensor([[1.0000],
#         [1.0000]], grad_fn=<VarBackward0>)

Layer Normalization:

class LayerNorm(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5  # Prevent division by zero
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))
    
    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        norm_x = (x - mean) / torch.sqrt(var + self.eps)
        return self.scale * norm_x + self.shift

We set unbias=False. When computing variance, we divide by the number of inputs n in the variance formula. This doesn’t apply Bessel’s correction, which uses n-1 in the denominator, instead of n.
In LLMs, the embedding dim n is usually significantly large, so the difference between n and n-1 is negligible.
So we do it here because it follows TensorFlow’s default behavior, which was used for the original GPT-2.

ln = LayerNorm(emb_dim=5)
out_ln = ln(batch_example)
mean = out_ln.mean(dim=-1, keepdim=True)
var = out_ln.var(dim=-1, unbiased=False, keepdim=True)
print("Mean:\n", mean)
print("Variance:\n", var)
# Mean:
#  tensor([[-1.1921e-08],
#         [ 3.2037e-08]], grad_fn=<MeanBackward1>)
# Variance:
#  tensor([[1.0000],
#         [1.0000]], grad_fn=<VarBackward0>)

4.3 Implementing a feed forward network with GELU activations

ReLU is used for many DL tasks. For LLMs, however, we usually use GELU or SwiGLU (Swish-gated linear unit).

class GELU(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return (
            0.5 * x * (1 + torch.tanh(
                torch.sqrt(torch.tensor(2.0 / torch.pi))
                * (x + 0.044715 * torch.pow(x, 3))
            ))
        )
class FeedForward(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(emb_dim, 4 * emb_dim),
            GELU(),
            nn.Linear(4 * emb_dim, emb_dim)
        )

    def forward(self, x):
        return self.layers(x)

This feedforward layer is useful as it enhances the model’s ability to learn from and generalize the data. It expands the embedding dimension into a higher-dimensional space, uses a nonlinear GELU activation, and then contracts back to the original dimension with the second linear transformation. This allows for the exploration of a richer representation space.

4.4 Adding shortcut connections

Adding shortcut connections can help optimize gradient flow, e.g. alleviate The Vanishing Gradient Problem.

4.5 Connecting attention and linear layers in a transformer block

class TransformerBlock(nn.Module):
    def __init__(self, emb_dim, context_length, num_heads, drop_rate, qkv_bias):
        super().__init__()
        self.layers = nn.ModuleList(
            [
                nn.Sequential(
                    LayerNorm(emb_dim),
                    MultiHeadAttention(
                        d_in=emb_dim,
                        d_out=emb_dim,
                        context_length=context_length,
                        drop_rate=drop_rate,
                        num_heads=num_heads,
                        qkv_bias=qkv_bias,
                    ),
                    nn.Dropout(drop_rate),
                ),
                nn.Sequential(
                    LayerNorm(emb_dim), FeedForward(emb_dim), nn.Dropout(drop_rate)
                ),
            ]
        )

    def forward(self, x):
        for layer in self.layers:
            x = layer(x) + x
        return x
torch.manual_seed(123)
x = torch.rand(2, 4, 768)
block = TransformerBlock(
    context_length=GPT_CONFIG_124M["context_length"],
    drop_rate=GPT_CONFIG_124M["drop_rate"],
    emb_dim=GPT_CONFIG_124M["emb_dim"],
    num_heads=GPT_CONFIG_124M["n_heads"],
    qkv_bias=GPT_CONFIG_124M["qkv_bias"]
)
output = block(x)

print(f"Input shape: {x.shape=}")
print(f"Output shape: {output.shape=}")
# Input shape: x.shape=torch.Size([2, 4, 768])
# Output shape: output.shape=torch.Size([2, 4, 768])

As we see, the shape is preserved. This is part of what makes the tranformer great for sequence-to-sequence tasks.
The output is a context vector that encapsulates information from the entire input sequence.

4.6 Coding the GPT model

class GPTModel(nn.Module):
    def __init__(
        self,
        vocab_size,
        context_length,
        emb_dim,
        n_heads,
        n_layers,
        drop_rate,
        qkv_bias=False,
        **kwargs,
    ):
        super().__init__()
        self.tok_emb = nn.Embedding(vocab_size, emb_dim)
        self.pos_emb = nn.Embedding(context_length, emb_dim)
        self.drop_emb = nn.Dropout(drop_rate)
        self.trf_blocks = nn.Sequential(
            *[
                TransformerBlock(
                    context_length=context_length,
                    emb_dim=emb_dim,
                    num_heads=n_heads,
                    dropout=drop_rate,
                    qkv_bias=qkv_bias,
                    **kwargs,
                )
                for _ in range(n_layers)
            ]
        )
        self.final_norm = LayerNorm(emb_dim)
        self.out_head = nn.Linear(emb_dim, vocab_size, bias=False)

    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embeds + pos_embeds
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits

We use the final layer norm to standardize the outputs from the transformer blocks to stabilize the learning process.
The linear output head is defined without bias. It projects the transformer’s output into the vocabulary space of the tokenizer to generate logits for each token in the vocabulary.

torch.manual_seed(123)

model = GPTModel(
    vocab_size=GPT_CONFIG_124M["vocab_size"],
    context_length=GPT_CONFIG_124M["context_length"],
    drop_rate=GPT_CONFIG_124M["drop_rate"],
    emb_dim=GPT_CONFIG_124M["emb_dim"],
    n_heads=GPT_CONFIG_124M["n_heads"],
    n_layers=GPT_CONFIG_124M["n_layers"],
    qkv_bias=GPT_CONFIG_124M["qkv_bias"]
)

out = model(batch)
print(f"Input batch:\n{batch}")
print()
print(f"Output shape: {out.shape=}")
print(out)

# Input batch:
# tensor([[6109, 3626, 6100,  345],
#         [6109, 1110, 6622,  257]])
# 
# Output shape: out.shape=torch.Size([2, 4, 50257])
# tensor([[[ 0.1381,  0.0077, -0.1963,  ..., -0.0222, -0.1060,  0.1717],
#          [ 0.3865, -0.8408, -0.6564,  ..., -0.5163,  0.2369, -0.3357],
#          [ 0.6989, -0.1829, -0.1631,  ...,  0.1472, -0.6504, -0.0056],
#          [-0.4290,  0.1669, -0.1258,  ...,  1.1579,  0.5303, -0.5549]],
# 
#         [[ 0.1094, -0.2894, -0.1467,  ..., -0.0557,  0.2911, -0.2824],
#          [ 0.0882, -0.3552, -0.3527,  ...,  1.2930,  0.0053,  0.1898],
#          [ 0.6091,  0.4702, -0.4094,  ...,  0.7688,  0.3787, -0.1974],
#          [-0.0612, -0.0737,  0.4751,  ...,  1.2463, -0.3834,  0.0609]]],
#        grad_fn=<UnsafeViewBackward0>)
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}")
# Total number of parameters: 163,009,536

The “124-million-parameter” GPT model has an actual number of 163 million parameters due to a concept called weight tying. This was used in the original GPT-2 architecture by reusing the weights from the token embedding layer in the output layer.
If we were to subtract the output layer’s amount of parameters from the total, we’d get 124 million.

Weight tying reduces memory footprint and computational complexity, but using separate token embedding and output layer can result in better training and model performance.

# Exercise 4.1: Calculate the number of parameters that are contained in the feed forward module and those that are contained in the multi-head attention module.
ff_params = 0
attn_params = 0

for module in model.modules():
    if isinstance(module, FeedForward):
        ff_params += sum(p.numel() for p in module.parameters())
    elif isinstance(module, MultiHeadAttention):
        attn_params += sum(p.numel() for p in module.parameters())

print(f"Parameters in feed forward layers: {ff_params:,}")
print(f"Parameters in attention layers: {attn_params:,}")
print(f"Percentage of total parameters:")
print(f"Feed forward: {ff_params/total_params*100:.1f}%")
print(f"Attention: {attn_params/total_params*100:.1f}%")
# Parameters in feed forward layers: 56,669,184
# Parameters in attention layers: 28,320,768
# Percentage of total parameters:
# Feed forward: 34.8%
# Attention: 17.4%
total_size_bytes = total_params * 4 # assumes float32, = 4 bytes per parameter
total_size_mb = total_size_bytes / (1024 * 1024)
print(f"Total size of the model: {total_size_mb:.2f} MB")
# Total size of the model: 621.83 MB

4.7 Generating text

# `idx` is a (batch, n_tokens) array of indices in the current context
def generate_text_simple(model, idx, max_new_tokens, context_size):
    for _ in range(max_new_tokens):
        # crops current context if it exceeds supported context size (only last 'context_size' tokens are used as context if current context is larger than dontext_size)
        idx_cond = idx[:, -context_size:]
        with torch.no_grad():
            logits = model(idx_cond)

        logits = logits[:, -1, :] # focus on last time step
        probas = torch.softmax(logits, dim=-1) # (batch, vocab_size)
        idx_next = torch.argmax(probas, dim=-1, keepdim=True) # (batch, 1)
        idx = torch.cat((idx, idx_next), dim=1) # appends sampled index to the running sequence. idx: (batch, n_tokens+1)

    return idx

This uses greedy decoding, wherein the model generates the most likely next token. However, we can use other sampling techniques to modify the softmax outputs such that it doesn’t always select the most likely token. This introduces variability and creativity in the generated text.

start_context = "Hello, I am"
encoded = tokenizer.encode(start_context)
print(f"{encoded=}")
encoded_tensor = torch.tensor(encoded).unsqueeze(0) # adds batch dimension
print(f"{encoded_tensor.shape=}")
# encoded=[15496, 11, 314, 716]
# encoded_tensor.shape=torch.Size([1, 4])
model.eval()
out = generate_text_simple(
    model=model,
    idx=encoded_tensor,
    max_new_tokens=6,
    context_size=cfg["context_length"]
)
print(f"{out=}")
print(f"{len(out)=}")
# out=tensor([[15496,    11,   314,   716, 27018, 24086, 47843, 30961, 42348,  7267]])
# len(out)=1
decoded_text = tokenizer.decode(out.squeeze(0).tolist())
print(decoded_text)
# Hello, I am Featureiman Byeswickattribute argue

5 Pretraining on unlabeled data

5.1 Evaluating generative text models

5.1.1 Using GPT to generate text

import torch

from build_a_large_language_model_from_scratch.lib.GPTModel import GPTModel

GPT_CONFIG_124M = {
    "vocab_size": 50257,
    "context_length": 256, # intentionally shortening it to reduce computational demands of training
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12, 
    "drop_rate": 0.1, # it's possible and common to set dropout to 0
    "qkv_bias": False
}

cfg = GPT_CONFIG_124M

model = GPTModel(
    context_length=cfg["context_length"],
    drop_rate=cfg["drop_rate"],
    emb_dim=cfg["emb_dim"],
    n_heads=cfg["n_heads"],
    n_layers=cfg["n_layers"],
    vocab_size=cfg["vocab_size"],
    qkv_bias=cfg["qkv_bias"]
)

torch.manual_seed(123)
model.eval()
import tiktoken

from build_a_large_language_model_from_scratch.lib.generate import generate_text_simple

def text_to_token_ids(text: str, tokenizer):
    encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
    encoded_tensor = torch.tensor(encoded).unsqueeze(0)  # `unsqueeze(0)` adds batch dimension
    return encoded_tensor

def token_ids_to_text(token_ids, tokenizer):
    flat = token_ids.squeeze(0)
    return tokenizer.decode(flat.tolist())

start_context = "Every effort moves you"
tokenizer = tiktoken.get_encoding("gpt2")

token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids(start_context, tokenizer),
    max_new_tokens=10,
    context_size=cfg["context_length"]
)
print(f"Output text:\n{token_ids_to_text(token_ids, tokenizer)}")
# Output text:
# Every effort moves you rentingetic wasnم refres RexMeCHicular stren

5.1.2 Calculating the text generation loss

The steps for getting the next token:

  1. Tokenize inputs (input is a sentence), translate to token IDs via the vocabulary
  2. Apply model on tokens IDs to get logits, then apply softmax to get a probability distribution over the vocabulary (outputs a probability of each token in the vocabulary)
  3. Find the most likely next token by taking the argmax of the probabilities, the index of which is the token id (greedy decoding)
  4. Use the inverse map (id → token) to get the predicted sentence

During model training, we want to optimize the model’s parameters such that it assigns the highest probability to the corresponding target token. So after we’ve used softmax on the logits to get a probability distribution over the vocabulary, the idea is that the highest probability value in the vector should be at the index for the token that is denoted in the target vector as the next word.

The loss we’ll be using is the negative average log probability.
Let’s illustrate by example. Say we have input and targets like this:

inputs = torch.tensor([
    [16833, 3626, 6100], # "every effort moves"
      [40, 1107, 58]])   # "I really like"

targets = torch.tensor([
    [3626, 6100, 345], # " effort moves you"
    [1107, 588, 11311] # " really like chocolate"
])

Then we can get the probabilities, which we use to get the predicted token ids.

with torch.no_grad():
    logits = model(inputs)

probas = torch.softmax(logits, dim=-1)
print(f"{probas.shape=}")
# probas.shape=torch.Size([2, 3, 50257])

token_ids = torch.argmax(probas, dim=-1, keepdim=True)
print(f"{token_ids=}")
# token_ids=tensor([[[41498],
#          [43024],
#          [16685]],
# 
#         [[21934],
#          [33733],
#          [22443]]])

# Just to illustrate:
[tokenizer.decode(o) for o in token_ids.flatten(1).tolist()]
# [' scaff Retrieved Barbara', 'Sunday Positive sulf']

print(f"Targets batch 1: {token_ids_to_text(targets[0], tokenizer)}")
print(f"Outputs batch 1: {token_ids_to_text(token_ids[0].flatten(), tokenizer)}")
# Targets batch 1:  effort moves you
# Outputs batch 1:  scaff Retrieved Barbara

keepdim=True ensures the output tensor retains the same number of dimensions as the input tensor, even if the size of the dimension is being reduced by 1.
In other words, it’s the difference between a token_ids.shape of [2, 3, 1] and of [2, 3].

Now we compute the loss – the negative average log probabilities:

log_probas = torch.log(torch.cat((target_probas_1, target_probas_2)))
print(log_probas)
# tensor([-10.4315, -9.5601, -10.9243, -10.9316, -10.3856, -11.3752])

avg_log_probas = torch.mean(log_probas)
print(avg_log_probas)
# tensor(-10.6014)

neg_avg_log_probas = avg_log_probas * -1
print(neg_avg_log_probas)
# tensor(10.6014)

The goal is to get the average log probability as close to 0 as possible during training.
In DL, common practice isn’t to push the avg log prob up to 0, but rather bring the neg avg log prob down to zero.

The term for turning the negative value, -10.6014, into 10.6014, is Cross Entropy loss.
We can just use PyTorch’s Cross Entropy loss function—this does the same as the steps we did above:

# We don't have the right shapes for cross_entropy:
print(f"{logits.shape=}")
print(f"{targets.shape=}")
# logits.shape=torch.Size([2, 3, 50257])
# targets.shape=torch.Size([2, 3])

# Need to flatten for `cross_entropy` to work, so we combien them over the batch dimension:
logits_flat = logits.flatten(0, 1)
targets_flat = targets.flatten()
print(f"{logits_flat.shape}")
print(f"{targets_flat.shape}")
# torch.Size([6, 50257])
# torch.Size([6])

loss = torch.nn.functional.cross_entropy(logits_flat, targets_flat)
print(loss)
# tensor(10.6014)

Cross Entropy loss measures the difference between two Probability Distributions—typically the true distribution of labels (e.g. tokens in a dataset) & the predicted distribution (e.g. token probabilities by an LLM).
The cross_entropy function in ML frameworks computes the measure for discrete outcomes, which is similar to the negative average log probability of the target tokens given the model’s generate token probabilities.
That’s what makes the two terms related & often used interchangeably in practice.

We often use the Perplexity measure alongside Cross Entropy loss to evaluate the performance of the models in tasks like language modeling.

Perplexity can provide a more interpretable way to understand the uncertainty of a model in predicting the next token in a sequence.
It’s considered more interpretable than the raw loss value because it signifies the effective vocabulary size about which the model is uncertain at each step: a perplexity value of x signifies that the model is unsure about which among x tokens in the vocabulary to generate as the next token.

It measures how well the probability distribution predicted by the model matches the actual distribution of the words in the dataset. Lower scores mean the model predictions are closer to the actual distribution.

Calculate as:

perplexity = torch.exp(loss)

5.1.3 Calculating the training and validation set losses

We’re using a batch size of 2 and a train/val split of 0.9.
We use a small batch size to reduce computational resource demand because we’re working with a small dataset.
Using batch sizes of 1024 or larger is not uncommon.

We’re also using the data loaders from chapter 2.

# val_loader is similar, omitting for brevity
train_loader = create_dataloader_v1(
    train_data,  # raw text
    batch_size=2, # small batch size for reduced compute usage
    max_length=cfg["context_length"],
    stride=cfg["context_length"],
    drop_last=True,
    shuffle=True,
    num_workers=0
)

def calc_loss_batch(input_batch, target_batch, model, device):
    input_batch = input_batch.to(device)
    target_batch = target_batch.to(device)
    logits = model(input_batch)
    loss = torch.nn.functional.cross_entropy(
        logits.flatten(0, 1), target_batch.flatten()
    )
    return loss

def calc_loss_loader(data_loader, model, device, num_batches=None):
    total_loss = 0.
    if len(data_loader) == 0:
        return float("nan")
    elif num_batches is None:
        num_batches = len(data_loader)
    else:
        num_batches = min(num_batches, len(data_loader))
    
    for i, (input_batch, target_batch) in enumerate(data_loader):
        if i < num_batches:
            loss = calc_loss_batch(input_batch, target_batch, model, device)
            total_loss += loss.item()
        else:
            break
    
    return total_loss / num_batches

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

with torch.no_grad():
    train_loss = calc_loss_loader(train_loader, model, device)
    val_loss = calc_loss_loader(val_loader, model, device)

print("Training loss:", train_loss)
print("Validation loss:", val_loss)
# Training loss: 11.023475329081217
# Validation loss: 10.993597984313965

Loss approaches 0 if the model learns to generate the next tokens as they appear in the training and validation sets.

5.2 Training an LLM

Advanced techniques:

  • Learning rate warmup
  • Cosine annealing
  • Gradient Clipping

We’ll use the AdamW optimizer. It improves weight decay to minimize model complexity and prevent overfitting by penalizing larger weights. This leads to more effective Regularization and better generalization.

from torch.utils.data import DataLoader

def train_model_simple(
    model,
    train_loader: DataLoader,
    val_loader: DataLoader,
    optimizer,
    device,
    num_epochs: int,
    eval_freq: int,
    eval_iter: int,
    start_context: str,
    tokenizer,
):
    train_losses, val_losses, track_tokens_seen = [], [], []
    tokens_seen, global_step = 0, -1

    for epoch in range(num_epochs):
        model.train()
        for input_batch, target_batch in train_loader:
            optimizer.zero_grad()  # reset loss gradients from previous iteration
            loss = calc_loss_batch(input_batch, target_batch, model, device)
            loss.backward()  # calculate loss gradients
            optimizer.step()  # update model weights using the loss gradients
            tokens_seen += input_batch.numel()
            global_step += 1

            if global_step % eval_freq == 0:
                train_loss, val_loss = evaluate_model(
                    model, train_loader, val_loader, device, eval_iter
                )
                train_losses.append(train_loss)
                val_losses.append(val_loss)
                track_tokens_seen.append(tokens_seen)
                print(
                    f"Ep {epoch+1} (Step {global_step:06d}): "
                    f"Train loss {train_loss:.3f}, "
                    f"Val loss {val_loss:.3f}"
                )

        generate_and_print_sample(model, tokenizer, device, start_context)

    return train_losses, val_losses, track_tokens_seen

def evaluate_model(model, train_loader: DataLoader, val_loader: DataLoader, device, eval_iter: int):
    model.eval()  # to disable dropout during evaluation
    with torch.no_grad():  # to disable gradient tracking, it's not required (reduce computational overhead)
        train_loss = calc_loss_loader(
            train_loader, model, device, num_batches=eval_iter
        )
        val_loss = calc_loss_loader(
            val_loader, model, device, num_batches=eval_iter
        )
    model.train()
    return train_loss, val_loss

def generate_and_print_sample(model, tokenizer, device, start_context):
    model.eval()
    context_size = model.pos_emb.weight.shape[0]
    encoded = text_to_token_ids(start_context, tokenizer).to(device)
    with torch.no_grad():
        token_ids = generate_text_simple(
            model=model, idx=encoded,
            max_new_tokens=50, context_size=context_size
        )
    decoded_text = token_ids_to_text(token_ids, tokenizer)
    print(decoded_text.replace("\n", " "))
    model.train()

And now for training:

torch.manual_seed(123)

model = GPTModel(
    vocab_size=cfg["vocab_size"],
    context_length=cfg["context_length"],
    drop_rate=cfg["drop_rate"],
    emb_dim=cfg["emb_dim"],
    n_heads=cfg["n_heads"],
    n_layers=cfg["n_layers"],
    qkv_bias=cfg["qkv_bias"]
)
model.to(device)

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=0.0004, weight_decay=0.1
)

num_epochs = 10
train_losses, val_losses, tokens_seen = train_model_simple(
    model, train_loader, val_loader, optimizer, device,
    num_epochs=num_epochs, eval_freq=5, eval_iter=5,
    start_context="Every effort moves you", tokenizer=tokenizer
)
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator

def plot_losses(epochs_seen, tokens_seen, train_losses, val_losses):
    fig, ax1 = plt.subplots(figsize=(5, 3))
    ax1.plot(epochs_seen, train_losses, label="Training loss")
    ax1.plot(epochs_seen, val_losses, linestyle="-.", label="Validation loss")
    ax1.set_xlabel("Epochs")
    ax1.set_ylabel("Loss")
    ax1.legend(loc="upper right")
    ax1.xaxis.set_major_locator(MaxNLocator(integer=True))
    ax2 = ax1.twiny()  # creates a second x-axis that shares the same y-axis
    ax2.plot(tokens_seen, train_losses, alpha=0)  # invisible plot for aligning ticks
    ax2.set_xlabel("Tokens seen")
    fig.tight_layout()
    plt.show()

epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)

5.3 Decoding strategies to control randomness

We’ve previously used greedy decoding, meaning we always select the most probable token as the output.
This section details two alternative strategies: temperature scaling and top-k sampling.

5.3.1 Temperature scaling

This adds a probabilistic selection process to the next-token generation task.

Instead of using argmax, as we do in greedy decoding, we use a function that samples from a probability distribution (here, the probability scores the LLM generates for each vocabulary entry at each token generation step).

We can use torch.multinomial instead of torch.argmax.
This function samples the next token proportional to its probability score. So it’s still most likely to select the most likely token, but it is not guaranteed.

We can further control the distribution and selection process via temperature scaling.
All that means is we divide the logits by a number greater than 0.

def softmax_with_temperature(logits, temperature):
    scaled_logits = logits / temperature
    return torch.softmax(scaled_logits, dim=0)

Temperatures > 1 result in more uniformly distributed token probabilities. This can lead to more variety, but also more nonsensical text, as other tokens are selected more often.
Temperatures < 1 will result in more confident (sharper or more peaky) distributions—the most likely token will have an even higher probability score.
A temperature of 1 is the same as not using any temperature scaling.

5.3.2 Top-k sampling

With temperature scaling, we may see diverse outputs, but these are sometimes grammatically incorrect or completely nonsensical.

Using top-k sampling with probabilistic sampling and temperature scaling, we can improve the text generation results.

The idea is to restrict the sampled tokens to the top-k most likely tokens and exclude all other tokens from the selection process by masking their probability scores.

  1. Select top-k from logits
  2. Apply -inf mask (will be scaled to zero via softmax)
  3. Apply softmax, which assigns zero-probabilities to the non-top-k positions, so the next token is always sampled from a top-k position
top_k = 3
top_logits, top_pos = torch.topk(next_token_logits, top_k)
# top_logits=tensor([6.7500, 6.2800, 4.5100])
# top_pos=tensor([3, 7, 0])
new_logits = torch.where(
    condition=next_token_logits < top_logits[-1],
    input=torch.tensor(float('-inf')),
    other=next_token_logits
)
# new_logits=tensor([4.5100, -inf, -inf, 6.7500, -inf, -inf, -inf, 6.2800, -inf])
topk_probas = torch.softmax(new_logits, dim=0)
# topk_probas=tensor([0.0615, 0.0000, 0.0000, 0.5775, 0.0000, 0.0000, 0.0000, 0.3610, 0.0000])

5.3.3 Modifying the text generation function

from typing import Optional


def generate(model, idx, max_new_tokens: int, context_size: int, temperature=0.0, top_k: Optional[int] = None, eos_id=None):
    for _ in range(max_new_tokens):
        idx_cond = idx[:, -context_size:]
        with torch.no_grad():
            logits = model(idx_cond)

        # Get last token in current sequence
        logits = logits[:, -1, :]
        # top-k sampling
        if top_k is not None:
            top_logits, _ = torch.topk(logits, top_k)
            min_val = top_logits[:, -1]
            logits = torch.where(
                condition=logits < min_val,
                input=torch.tensor(float('-inf')).to(logits.device),
                other=logits
            )
        if temperature > 0.0:
            # temperature scaling
            logits = logits / temperature
            probs = torch.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
        else:
            # greedy decoding
            idx_next = torch.argmax(probs, dim=-1, keepdim=True)
        # check if we've reached end-of-sequence
        if idx_next == eos_id:
            break
        # append generated token to current sequence for further generation
        idx = torch.cat((idx, idx_next), dim=1)
    return idx

5.4 Loading and saving model weights in PyTorch

The recommended way to save a PyTorch model is by saving the model’s state_dict:

torch.save(model.state_dict(), "model.pth")

The state_dict maps each layer to its parameters.
The .pth extension is convention for PyTorch files.

Loading the model weights:

model = GPTModel(...)
model.load_state_dict(torch.load("model.pth", map_location=device))
model.eval()  # don't want dropout during inference

If we wished to continue pretraining later, it’s also recommended saving the optimizer state.
Adaptive optimizers like AdamW stores additional parameters for each model weight. Without these, the optimizer resets, potentially ruining training.

# Save
torch.save(
    {
        "model_state_dict": model.state_dict(),
        "optimizer_state_dict": optimizer.state_dict(),
    },
    "model_and_optimizer.pth",
)

# Load
checkpoint = torch.load("model_and_optimizer.pth", map_location=device)
model = GPTModel(
    vocab_size=cfg["vocab_size"],
    context_length=cfg["context_length"],
    drop_rate=cfg["drop_rate"],
    emb_dim=cfg["emb_dim"],
    n_heads=cfg["n_heads"],
    n_layers=cfg["n_layers"],
    qkv_bias=cfg["qkv_bias"]
)
model.load_state_dict(checkpoint["model_state_dict"])
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=0.1)
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
model.train()

5.5 Loading pretrained weights from OpenAI

My implementation differs slightly from the book as I’ve used a ModuleList for the components of my MHA.

The author provided a script to download and load gpt2 weights from OpenAI.

from build_a_large_language_model_from_scratch.gpt_download import download_and_load_gpt2

settings, params = download_and_load_gpt2(
    model_size="124M", models_dir="gpt2"
)

model_configs = {
    "gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
	# ...
}

model_name = "gpt2-small (124M)"
NEW_CONFIG = GPT_CONFIG_124M.copy()
NEW_CONFIG.update(model_configs[model_name])
# Because we modified it earlier
NEW_CONFIG.update({"context_length": 1024, "qkv_bias": True})

gpt = GPTModel(...)
gpt.eval()

Now we need this utility function to make loading weights into the model easier:

def assign(left, right):
    """Assigns values from right tensor to left tensor after shape validation.
    
    Args:
        left: Target PyTorch tensor/parameter
        right: Source tensor/array to copy values from
        
    Returns:
        torch.nn.Parameter: New parameter containing values from right tensor
        
    Raises:
        ValueError: If shapes of left and right tensors don't match
    """
    if left.shape != right.shape:
        raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}")

    return torch.nn.Parameter(torch.tensor(right))

And the code to load the weights:

import numpy as np


def load_weights_into_gpt(gpt, params):
    gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params["wpe"])
    gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params["wte"])

    # iterative over transformer blocks
    for b in range(len(params["blocks"])):
        # split is used to divide attention and bias weights into three equal parts for the qkv components
        # load attention qkv weights
        q_w, k_w, v_w = np.split(
            (params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1
        )
        gpt.trf_blocks[b].layers[0][1].W_query.weight = assign(
            gpt.trf_blocks[b].layers[0][1].W_query.weight, q_w.T
        )
        gpt.trf_blocks[b].layers[0][1].W_key.weight = assign(
            gpt.trf_blocks[b].layers[0][1].W_key.weight, k_w.T
        )
        gpt.trf_blocks[b].layers[0][1].W_value.weight = assign(
            gpt.trf_blocks[b].layers[0][1].W_value.weight, v_w.T
        )

        # load attn qkv bias
        q_b, k_b, v_b = np.split(
            (params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1
        )
        gpt.trf_blocks[b].layers[0][1].W_query.bias = assign(
            gpt.trf_blocks[b].layers[0][1].W_query.bias, q_b
        )
        gpt.trf_blocks[b].layers[0][1].W_key.bias = assign(
            gpt.trf_blocks[b].layers[0][1].W_key.bias, k_b
        )
        gpt.trf_blocks[b].layers[0][1].W_value.bias = assign(
            gpt.trf_blocks[b].layers[0][1].W_value.bias, v_b
        )

        # load attn linear projection weights
        gpt.trf_blocks[b].layers[0][1].out_proj.weight = assign(
            gpt.trf_blocks[b].layers[0][1].out_proj.weight,
            params["blocks"][b]["attn"]["c_proj"]["w"].T,
        )
        gpt.trf_blocks[b].layers[0][1].out_proj.bias = assign(
            gpt.trf_blocks[b].layers[0][1].out_proj.bias,
            params["blocks"][b]["attn"]["c_proj"]["b"],
        )

        # load feedforward network weights and biases
        gpt.trf_blocks[b].layers[1][1].layers[0].weight = assign(
            gpt.trf_blocks[b].layers[1][1].layers[0].weight,
            params["blocks"][b]["mlp"]["c_fc"]["w"].T,
        )
        gpt.trf_blocks[b].layers[1][1].layers[0].bias = assign(
            gpt.trf_blocks[b].layers[1][1].layers[0].bias,
            params["blocks"][b]["mlp"]["c_fc"]["b"]
        )
        gpt.trf_blocks[b].layers[1][1].layers[2].weight = assign(
            gpt.trf_blocks[b].layers[1][1].layers[2].weight,
            params["blocks"][b]["mlp"]["c_proj"]["w"].T,
        )
        gpt.trf_blocks[b].layers[1][1].layers[2].bias = assign(
            gpt.trf_blocks[b].layers[1][1].layers[2].bias,
            params["blocks"][b]["mlp"]["c_proj"]["b"],
        )

        # load layer norm params
        gpt.trf_blocks[b].layers[0][0].scale = assign(
            gpt.trf_blocks[b].layers[0][0].scale,
            params["blocks"][b]["ln_1"]["g"]
        )
        gpt.trf_blocks[b].layers[0][0].shift = assign(
            gpt.trf_blocks[b].layers[0][0].shift,
            params["blocks"][b]["ln_1"]["b"]
        )
        gpt.trf_blocks[b].layers[1][0].scale = assign(
            gpt.trf_blocks[b].layers[1][0].scale,
            params["blocks"][b]["ln_2"]["g"]
        )
        gpt.trf_blocks[b].layers[1][0].shift = assign(
            gpt.trf_blocks[b].layers[1][0].shift,
            params["blocks"][b]["ln_2"]["b"]
        )

    gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"])
    gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"])
    # Original GPT-2 model reused the token embedding weights to reduce the total number of params (weight tying)
    gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"])

As mentioned, my code is slightly different from the book, as I’ve used a ModuleList in my MHA implementation.

Now we can load and test!

load_weights_into_gpt(gpt, params)
gpt.to(device)

torch.manual_seed(123)
token_ids = generate(
    model=gpt,
    idx=text_to_token_ids("Every effort moves you", tokenizer).to(device),
    max_new_tokens=25,
    context_size=NEW_CONFIG["context_length"],
    top_k=50,
    temperature=1.5
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
# Output text:
#  Every effort moves you as far as the hand can go until the end of your turn unless something interrupts your control flow. As you may observe I

6 Fine-tuning for classification

6.1 Different categories of fine-tuning

Language models are most commonly either instruction fine-tuned or classification fine-tuned.

Instruction fine-tuning means you train the model on a set of tasks using specific instructions to improve its ability to understand and execute tasks described by natural language prompts.

Classification fine-tuning means you train the model to recognize a specific set of class labels (e.g. “spam”, “not spam”).

6.2 Preparing the dataset

Here we downloaded the dataset. It’s a set with 2 columns: Text and Label. Since it’s unbalanced, we balance it by ensuring there’s an equal amount of each class. Then we do a random split into train/val/test.

6.3 Creating data loaders

Can’t do sliding window as before, so we’ll have to either shorten each sample to match the size of the shortest or pad all samples to match the length of the longest. Can be in the batch / dataset.

Padding is better so we don’t lose information.

import torch
from torch.utils.data import Dataset
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")
print(tokenizer.encode("<|endoftext|>", allowed_special={"<|endoftext|>"}))
# [50256]

class SpamDataset(Dataset):
    def __init__(self, csv_file, tokenizer, max_length=None, pad_token_id=50256):
        self.data = pd.read_csv(csv_file)

        # Tokenize
        self.encoded_texts = [
            tokenizer.encode(text) for text in self.data["Text"]
        ]

        if max_length is None:
            self.max_length = self._longest_encoded_length()
        else:
            self.max_length = max_length
            # Truncate any sentence longer than `max_length`
            self.encoded_texts = [
                encoded_text[:self.max_length]
                for encoded_text in self.encoded_texts
            ]

        # Pad
        self.encoded_texts = [
            encoded_text + [pad_token_id] *
            (self.max_length - len(encoded_text))
            for encoded_text in self.encoded_texts
        ]

    def __getitem__(self, index):
        encoded = self.encoded_texts[index]
        label = self.data.iloc[index]["Label"]
        return (
            torch.tensor(encoded, dtype=torch.long),
            torch.tensor(label, dtype=torch.long)
        )

    def __len__(self):
        return len(self.data)
    
    def _longest_encoded_length(self):
        max_length = 0
        for encoded_text in self.encoded_texts:
            encoded_length = len(encoded_text)
            if encoded_length > max_length:
                max_length = encoded_length
        return max_length

Then we create some DataLoaders.

6.4 Initializing a model with pretrained weights

Just following the same procedure as the previous chapter.

text_2 = (
    "Is the following text 'spam'? Answer with 'yes' or 'no':"
    " 'You are a winner you have been specially"
    " selected to receive $1000 cash or a $2000 award.'"
)
token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids(text_2, tokenizer),
    max_new_tokens=23,
    context_size=BASE_CONFIG["context_length"]
)
print(token_ids_to_text(token_ids, tokenizer))

As we can see, the model isn’t capable of classifying (yet!).

6.5 Adding a classification head

We modify the pretrained LLM for classification fine-tuning by replacing the original output layer, which maps the hidden representation to a vocabulary of 50257 tokens, with a smaller output layer that outputs to our target classes.

It’s possible to use a single output node, since we’re doing binary classification, but it’d require modifying the loss function. So we take the approach where the number of output nodes matches the number of classes.

When fine-tuning from a pretrained model, it isn’t necessary to fine-tune all model layers. The lower layers in neural nets generally capture basic language structures and semantics applicable across a wide range of tasks, so fine-tuning only the last layers that deal with more nuanced linguistic patterns and task-specific features is fine.
Training just the output layer can be sufficient, but fine-tuning additional layers near the output can lead to improved predictive performance.

# Freeze the model
or param in model.parameters():
    param.requires_grad = False

# Replace the out head:
torch.manual_seed(123)
num_classes = 2
model.out_head = torch.nn.Linear(
    in_features=BASE_CONFIG["emb_dim"],
    out_features=num_classes
)

require_grad is already set to True on the new layer. We’ll set it in the final Transformer block and layer norm:

for param in model.trf_blocks[-1].parameters():
    param.requires_grad = True

for param in model.final_norm.parameters():
    param.requires_grad = True

The model still works, despite us having replaced the out_head:

inputs = tokenizer.encode("Do you have time")
inputs = torch.tensor(inputs).unsqueeze(0)
print("Inputs:", inputs)
print("Inputs dimensions:", inputs.shape)
# Inputs: tensor([[5211,  345,  423,  640]])
# Inputs dimensions: torch.Size([1, 4])

torch.Size([1, 4]) is essentially batch_size and num_tokens in the input.
One sample in the batch consisting of four tokens.

with torch.no_grad():
    outputs = model(inputs)

print("Outputs:\n", outputs)
print("Outputs dimensions:", outputs.shape)
# Outputs:
#  tensor([[[-1.5854,  0.9904],
#          [-3.7235,  7.4548],
#          [-2.2661,  6.6049],
#          [-3.5983,  3.9902]]])
# Outputs dimensions: torch.Size([1, 4, 2])

Previously this would have produced an output tensor of shape [1, 4, 50257], where 50257 represents the vocabulary size. The number of output rows corresponds to the number of input tokens (4), but each output’s embedding dimension (number of columns) is now 2 instead of 50257.

Since we’re interested in fine-tuning the model to return a class label, we don’t need to fine-tune all four output rows. We just focus on the last row corresponding to the last token.

print("Last output token:", outputs[:, -1, :])
# Last output token: tensor([[-3.5983, 3.9902]])

The reason why we’re only interested in the last token is because it’s the only token with an attention score to all other tokens. The others have had some tokens masked.

6.6 Calculating the classification loss and accuracy

We don’t even need to use softmax to get the probability of an input being spam/ham, as the largest logit will be the predicted class anyway (as we use argmax).

def calc_accuracy_loader(data_loader, model, device, num_batches=None):
    model.eval()
    correct_predictions, num_examples = 0, 0

    if num_batches is None:
        num_batches = len(data_loader)
    else:
        num_batches = min(num_batches, len(data_loader))

    for i, (input_batch, target_batch) in enumerate(data_loader):
        if i < num_batches:
            input_batch = input_batch.to(device)
            target_batch = target_batch.to(device)

            with torch.no_grad():
                logits = model(input_batch)[:, -1, :]
            
            predicted_labels = torch.argmax(logits, dim=-1)

            num_examples += predicted_labels.shape[0]
            correct_predictions += (
                (predicted_labels == target_batch).sum().item()
            )
        else:
            break

    return correct_predictions / num_examples
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

torch.manual_seed(123)
train_accuracy = calc_accuracy_loader(
    train_loader, model, device, num_batches=10
)
val_accuracy = calc_accuracy_loader(
    val_loader, model, device, num_batches=10
)
test_accuracy = calc_accuracy_loader(
    test_loader, model, device, num_batches=10
)

print(f"Training accuracy: {train_accuracy*100:.2f}%")
print(f"Validation accuracy: {val_accuracy*100:.2f}%")
print(f"Test accuracy: {test_accuracy*100:.2f}%")
# Training accuracy: 46.25%
# Validation accuracy: 45.00%
# Test accuracy: 48.75%

The resulting prediction accuracies are near a random prediction.

To improve, we’ll need to fine-tune. And to do that, we need to define the loss function we’ll optimize.
The objective is to maximize the spam classification accuracy of the model. Classification accuracy is not a differentiable function, so we’ll use cross-entropy loss as a proxy to maximize accuracy.

We’ll use much the same function as previously, except now we only focus on the loss for the last token.

def calc_loss_batch(input_batch, target_batch, model, device):
    input_batch = input_batch.to(device)
    target_batch = target_batch.to(device)
    logits = model(input_batch)[:, -1, :]
    loss = torch.nn.functional.cross_entropy(logits, target_batch)
    return loss
def calc_loss_loader(data_loader, model, device, num_batches=None):
    total_loss = 0
    if len(data_loader) == 0:
        return float("nan")
    elif num_batches is None:
        num_batches = len(data_loader)
    else:
        num_batches = min(num_batches, len(data_loader))
    
    for i, (input_batch, target_batch) in enumerate(data_loader):
        if i < num_batches:
            loss = calc_loss_batch(
                input_batch, target_batch, model, device
            )
            total_loss += loss.item()
        else:
            break
    return total_loss / num_batches
# Computing the initial loss for each dataset
with torch.no_grad():
    train_loss = calc_loss_loader(train_loader, model, device, num_batches=5)
    val_loss = calc_loss_loader(val_loader, model, device, num_batches=5)
    test_loss = calc_loss_loader(test_loader, model, device, num_batches=5)

print(f"Training loss: {train_loss:.3f}")
print(f"Validation loss: {val_loss:.3f}")
print(f"Test loss: {test_loss:.3f}")
# Training loss: 2.453
# Validation loss: 2.583
# Test loss: 2.322

6.7 Fine-tuning the model on supervised data

def train_classifier_simple(model, train_loader, val_loader, optimizer, device, num_epochs, eval_freq, eval_iter):
    train_losses, val_losses, train_accs, val_accs = [], [], [], []
    examples_seen, global_step = 0, -1

    for epoch in range(num_epochs):
        model.train()

        for input_batch, target_batch in train_loader:
            optimizer.zero_grad()
            loss = calc_loss_batch(
                input_batch, target_batch, model, device
            )
            loss.backward()
            optimizer.step()
            examples_seen += input_batch.shape[0]
            global_step += 1

            if global_step % eval_freq == 0:
                train_loss, val_loss = evaluate_model(
                    model, train_loader, val_loader, device, eval_iter
                )
                train_losses.append(train_loss)
                val_losses.append(val_loss)
                print(f"Ep {epoch+1} (Step {global_step:06d}): Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")

        train_accuracy = calc_accuracy_loader(
            train_loader, model, device, num_batches=eval_iter
        )
        val_accuracy = calc_accuracy_loader(val_loader, model, device, num_batches=eval_iter)

        print(f"Training accuracy: {train_accuracy*100:.2f}% | ", end="")
        print(f"Validation accuracy: {val_accuracy*100:.2f}% | ")
        train_accs.append(train_accuracy)
        val_accs.append(val_accuracy)

    return train_losses, val_losses, train_accs, val_accs, examples_seen


def evaluate_model(model, train_loader, val_loader, device, eval_iter):
    model.eval()
    with torch.no_grad():
        train_loss = calc_loss_loader(
            train_loader, model, device, num_batches=eval_iter
        )
        val_loss = calc_loss_loader(
            val_loader, model, device, num_batches=eval_iter
        )

    model.train()
    return train_loss, val_loss
import time

start_time = time.time()
torch.manual_seed(123)
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5, weight_decay=0.1)
num_epochs = 5

train_losses, val_losses, train_accs, val_accs, examples_seen = \
    train_classifier_simple(
        model, train_loader, val_loader, optimizer, device,
        num_epochs=num_epochs, eval_freq=50, eval_iter=5
    )

end_time = time.time()
execution_time_minutes = (end_time - start_time) / 60
print(f"Training competed in {execution_time_minutes:.2f} minutes.")
# Ep 1 (Step 000000): Train loss 2.153, Val loss 2.392
# Ep 1 (Step 000050): Train loss 0.617, Val loss 0.637
# Ep 1 (Step 000100): Train loss 0.523, Val loss 0.557
# Training accuracy: 70.00% | Validation accuracy: 72.50% | 
# Ep 2 (Step 000150): Train loss 0.561, Val loss 0.489
# Ep 2 (Step 000200): Train loss 0.419, Val loss 0.397
# Ep 2 (Step 000250): Train loss 0.409, Val loss 0.353
# Training accuracy: 82.50% | Validation accuracy: 85.00% | 
# Ep 3 (Step 000300): Train loss 0.333, Val loss 0.320
# Ep 3 (Step 000350): Train loss 0.340, Val loss 0.306
# Training accuracy: 90.00% | Validation accuracy: 90.00% | 
# Ep 4 (Step 000400): Train loss 0.136, Val loss 0.200
# Ep 4 (Step 000450): Train loss 0.153, Val loss 0.132
# Ep 4 (Step 000500): Train loss 0.222, Val loss 0.137
# Training accuracy: 100.00% | Validation accuracy: 97.50% | 
# Ep 5 (Step 000550): Train loss 0.207, Val loss 0.143
# Ep 5 (Step 000600): Train loss 0.083, Val loss 0.074
# Training accuracy: 100.00% | Validation accuracy: 97.50% | 
# Training competed in 0.61 minutes.

Some extra code to plot:

import matplotlib.pyplot as plt

def plot_values(epochs_seen, examples_seen, train_values, val_values, label="loss"):
    fig, ax1 = plt.subplots(figsize=(5,3))
    ax1.plot(epochs_seen, train_values, label=f"Training {label}")
    ax1.plot(epochs_seen, val_values, linestyle="-.", label=f"Validation {label}")
    ax1.set_xlabel("Epochs")
    ax1.set_ylabel(label.capitalize())
    ax1.legend()

    ax2 = ax1.twiny()
    ax2.plot(examples_seen, train_values, alpha=0)
    ax2.set_xlabel("Examples seen")

    fig.tight_layout()
    # plt.savefig(f"{label}-plot.pdf")
    plt.show()

# Loss curves
epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
examples_seen_tensor = torch.linspace(0, examples_seen, len(train_losses))

plot_values(epochs_tensor, examples_seen_tensor, train_losses, val_losses)

# Classification accuracies
epochs_tensor = torch.linspace(0, num_epochs, len(train_accs))
examples_seen_tensor = torch.linspace(0, examples_seen, len(train_accs))

plot_values(
    epochs_tensor, examples_seen_tensor, train_accs, val_accs,
    label="accuracy"
)

And the final metrics:

train_accuracy = calc_accuracy_loader(train_loader, model, device)
val_accuracy = calc_accuracy_loader(val_loader, model, device)
test_accuracy = calc_accuracy_loader(test_loader, model, device)

print(f"Training accuracy: {train_accuracy*100:.2f}%")
print(f"Validation accuracy: {val_accuracy*100:.2f}%")
print(f"Test accuracy: {test_accuracy*100:.2f}%")
# Training accuracy: 97.21%
# Validation accuracy: 97.32%
# Test accuracy: 95.67%

6.8 Using the LLM as a spam classifier

def classify_review(
        text, model, tokenizer, device, max_length: int = 128,
        pad_token_id: int = 50256
):
    model.eval()

    input_ids = tokenizer.encode(text)
    supported_context_length = model.pos_emb.weight.shape[1]

    # Handle max_length properly by using a default if None
    effective_max_length = max_length if max_length is not None else supported_context_length
    input_ids = input_ids[:min(effective_max_length, supported_context_length)]

    # Calculate padding length only if we have a valid max_length
    padding_length = effective_max_length - len(input_ids)
    input_ids += [pad_token_id] * padding_length

    input_tensor = torch.tensor(input_ids, device=device).unsqueeze(0)

    with torch.no_grad():
        logits = model(input_tensor)[:, -1, :]
    predicted_label = torch.argmax(logits, dim=-1).item()

    return "spam" if predicted_label == 1 else "not spam"
text_1 = (
    "You are a winner you have been specifically"
    " selected to receive $1000 cash or a $2000 award."
)

print(classify_review(
    text_1, model, tokenizer, device, max_length=train_dataset.max_length
))
# spam

text_2 = (
    "Hey, just wanted to check if we're still on"
    " for dinner tonight? Let me know!"
)

print(classify_review(
    text_2, model, tokenizer, device, max_length=train_dataset.max_length
))
# not spam

7 Fine-tuning to follow instructions

7.2 Preparing a dataset for supervised instruction fine-tuning

import torch
from torch.utils.data import Dataset

class InstructionDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data
        self.encoded_texts = []
        for entry in data:
            instruction_plut_input = format_input(entry)
            response_text = f"\n\n### Response:\n{entry["output"]}"
            full_text = instruction_plut_input + response_text
            self.encoded_texts.append(
                tokenizer.encode(full_text)
            )

    def __getitem__(self, index):
        return self.encoded_texts[index]
    
    def __len__(self):
        return len(self.data)

We’ll also use the <|endoftext|> token here (ID: 50256).

We’ll build a custom collate function which pads the training examples such that they have the same length, while allowing different batches to have different lengths.
We don’t need to match the longest sequence in the entire dataset, just in the batch.

Then we create a list of target token IDs, which are the inputs shifted by 1, plus an additional padding token.

We assign a -100 placeholder value to all padding tokens to allow us to exclude them from contribution to the training loss calculation.
This means only meaningful data will influence model learning.
The reason why we use -100 is that PyTorch’s default settings for the cross_entropy function has it as the ignore index. Any target labeled with -100 is therefore ignored.
And we’re using that to ignore all but the first 50256 (end-of-text) token ID in the targets, as it helps the LLM learn to generate end-of-text tokens—used to indicate that the answer is complete.
Since we’re padding all items in the batch to be of the same length, the LLM may be forced to output values after ‘finishing’ the sentence. We don’t want that non-meaningful output to count when computing the loss. We don’t want the model to learn incorrectly from noise or from forced tokens after the natural end of a sequence.

def custom_collate_fn(batch, pad_token_id=50256, ignore_index=-100, allowed_max_length=None, device="cpu"):
    batch_max_length = max(len(item)+1 for item in batch)
    inputs_lst, targets_lst = [], []

    for i, item in enumerate(batch):
        new_item = item.copy()
        new_item += [pad_token_id]

        padded = (
            new_item + [pad_token_id] *
            (batch_max_length - len(new_item))
        )
        
        inputs = torch.tensor(padded[:-1])
        targets = torch.tensor(padded[1:])
        
        mask = targets == pad_token_id
        indices = torch.nonzero(mask).squeeze()

        if indices.numel() > 1:
            targets[indices[1:]] = ignore_index
        
        if allowed_max_length is not None:
            inputs = inputs[:allowed_max_length]
            targets = targets[:allowed_max_length]

        inputs_lst.append(inputs)
        targets_lst.append(targets)
    
    inputs_tensor = torch.stack(inputs_lst).to(device)
    targets_tensor = torch.stack(targets_lst).to(device)
    return inputs_tensor, targets_tensor

custom_collate_fn([[1,2,3,4,5], [1,2,3]])

Let’s walk through the processing of the batch inputs I show above.
The batch size is 2, and the batch max length is .
Processing the longest input isn’t super exciting, so let’s use as our example (item 1). Item 1 has a length of 3.
We start by adding an initial padding token:

And then we pad until it’s the same size as the largest item in the batch:

From here, we generate the input and target tensors. The input tensor will be all but the last item in the padded sequence:

And the target is all but the first item in the padded sequence:

Now we’ll mask. The padding mask has True values where the padding token is in the targets tensor. We’ll assign -100 in those indices.

Then we find the corresponding indices:

We found 3 padding tokens, masking all but first:

The final output is:

Besides masking padding tokens, it’s also common to mask target token IDs that correspond to the instructions. Doing so means the Cross Entropy loss is only computed for the generated response IDs, so the model is trained to focus on generating accurate responses rather than memorizing instructions, which can help prevent overfitting.
However, it isn’t clear yet whether masking the instructions is universally beneficial during instruction fine-tuning.

7.4 Creating data loaders for an instruction dataset

from functools import partial

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

customized_collate_fn = partial(
    custom_collate_fn,
    device=device,
    allowed_max_length=1024  # max context length supported by GPT-2
)
from torch.utils.data import DataLoader
import tiktoken

num_workers = 0
batch_size = 8

torch.manual_seed(123)
tokenizer = tiktoken.get_encoding("gpt2")

train_dataset = InstructionDataset(train_data, tokenizer)
train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    collate_fn=customized_collate_fn,
    shuffle=True,
    drop_last=True,
    num_workers=num_workers
)

val_dataset = InstructionDataset(val_data, tokenizer)
val_loader = DataLoader(
    val_dataset,
    batch_size=batch_size,
    collate_fn=customized_collate_fn,
    shuffle=False,
    drop_last=False,
    num_workers=num_workers
)

test_dataset = InstructionDataset(test_data, tokenizer)
test_loader = DataLoader(
   test_dataset,
    batch_size=batch_size,
    collate_fn=customized_collate_fn,
    shuffle=False,
    drop_last=False,
    num_workers=num_workers
)

7.5 Loading a pretrained LLM

We’ll load the 355m parameter model, as the 124m parameter model is too limited in capacity to achieve good results for instruction fine-tuning.

This is done in the exact same way as previously, except we load the gpt2-medium (355M) model.

Let’s assess the model’s existing capabilities.

torch.manual_seed(123)
input_text = format_input(val_data[0])
print(input_text)
"""
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Convert the active sentence to passive: 'The chef cooks the meal every day.'
"""
from build_a_large_language_model_from_scratch.lib.generate import text_to_token_ids, token_ids_to_text, generate

token_ids = generate(
    model=model,
    idx=text_to_token_ids(input_text, tokenizer),
    max_new_tokens=35,
    context_size=BASE_CONFIG["context_length"],
    eos_id=50256
)

generated_text = token_ids_to_text(token_ids, tokenizer)
print(generated_text)
"""
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Convert the active sentence to passive: 'The chef cooks the meal every day.'

### Response:

The chef cooks the meal every day.

### Instruction:

Convert the active sentence to passive: 'The chef cooks the
"""
response_text = generated_text[len(input_text):].strip()
print(response_text)
"""
### Response:

The chef cooks the meal every day.

### Instruction:

Convert the active sentence to passive: 'The chef cooks the
"""

7.6 Fine-tuning the LLM on instruction data

Similar to before.

import time
from build_a_large_language_model_from_scratch.lib.train import train_model_simple

start_time = time.time()
torch.manual_seed(123)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.00005, weight_decay=0.1)
num_epochs = 2
model.to(device)

train_losses, val_losses, tokens_seen = train_model_simple(
    model, train_loader, val_loader, optimizer, device,
    num_epochs=num_epochs, eval_freq=5, eval_iter=5,
    start_context=format_input(val_data[0]), tokenizer=tokenizer
)

end_time = time.time()
exec_time_m = (end_time - start_time) / 60
print(f"Training completed in {exec_time_m:.2f} minutes.")

7.7 Extracting and saving responses

torch.manual_seed(123)

for entry in test_data[:3]:
    input_text = format_input(entry)
    token_ids = generate(
        model=model,
        idx=text_to_token_ids(input_text, tokenizer).to(device),
        max_new_tokens=256,
        context_size=BASE_CONFIG["context_length"],
        eos_id=50256
    )
    generated_text = token_ids_to_text(token_ids, tokenizer)

    response_text = (
        generated_text[len(input_text):]
        .replace("### Response:", "")
        .strip()
    )
    print(input_text)
    print(f"\nCorrect response:\n>> {entry["output"]}")
    print(f"\nModel response:\n>> {response_text.strip()}")
    print("-"*20)

This prints the outputs as you might expect.
For example:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Rewrite the sentence using a simile.

### Input:
The car is very fast.

Correct response:
>> The car is as fast as lightning.

Model response:
>> The car is as fast as a bullet.

As evident, evaluating this kind of answer at scale isn’t as easy as measuring the percentage of correct spam/ham class labels to get the classification’s accuracy.

Some ideas:

7.8 Evaluating the fine-tuned LLM

This chapter uses Llama 3 (by Meta AI) to evaluate the test-set responses from our fine-tuned model.

7.9 Conclusions

After instruction fine-tuning, you can optionally do preference fine-tuning. This is useful to customize the model to better align with specific user preferences.

Appendix E Parameter-efficient fine-tuning with LoRA

Low-rank adaptation (LoRA) is a widely used technique for parameter-efficient fine-tuning.
Source: [2106.09685] LoRA: Low-Rank Adaptation of Large Language Models

Since models are becoming larger, it has become increasingly infeasible to perform full fine-tunes of them.
LoRA greatly reduces the number of trainable parameters, while retaining or improving fine-tuning in model quality on various models.

It works by freezing the pre-trained weights and injecting trainable rank decomposition matrices into each layer.
“Low-rank” refers to the mathematical concept of limiting adjustments to a smaller dimensional subspace of the total weight parameter space.
This captures the most influential directions of the weight parameter changes during training.

Since we can reuse the pretrained weights and just apply the LoRA matrices dynamically after training when using the model, LoRA enables model customization without needing to store multiple complete versions of an LLM (means less storage use, improved scalability), as only the smaller LoRA matrices need to be adjusted.

import math

import torch
import torch.nn as nn


class LoRALayer(nn.Module):
    def __init__(self, in_dim, out_dim, rank, alpha):
        super().__init__()
        self.A = torch.nn.Parameter(torch.empty(in_dim, rank))
        torch.nn.init.kaiming_uniform_(self.A, a=math.sqrt(5))
        self.B = torch.nn.Parameter(torch.zeros(rank, out_dim))
        self.alpha = alpha

    def forward(self, x):
        x = self.alpha * (x @ self.A @ self.B)
        return x

rank governs the inner dimension of matrices and . It essentially determines the number of extra parameters introduced by LoRA.
It’s a balance of adaptability of the model and efficiency via the number of parameters used.

alpha functions as a scaling factor for the output.
It dictates the degree to which the output from the adapted layer can affect the original layer’s output. Like a way to regulate the effect of LoRA on the layer’s output.

The goal in LoRA is typically substituting the Linear layers, so the weight updates can be applied directly to the pre-trained existing weights.

class LinearWithLoRA(nn.Module):
    def __init__(self, linear, rank, alpha):
        super().__init__()
        self.linear = linear
        self.lora = LoRALayer(linear.in_features, linear.out_features, rank, alpha)

    def forward(self, x):
        return self.linear(x) + self.lora(x)

Utility:

def replace_linear_with_lora(model, rank, alpha):
    for name, module in model.named_children():
        if isinstance(module, nn.Linear):
            setattr(model, name, LinearWithLoRA(module, rank, alpha))
        else:
            replace_linear_with_lora(module, rank, alpha)

Liked these notes? Join the newsletter.

Get notified whenever I post new notes.