Notes on
Build a Large Language Model (From Scratch)
by Sebastian Raschka
| 75 min read
This book is a good, well-organized guide to building large language models (LLMs). It walks you through everything, from transformer basics to a working GPT-like model. The explanations are clear, and the code examples are helpful. You’ll learn the steps, from pretraining to fine-tuning for instruction and classification tasks.
The book covers important concepts like tokenization, embeddings, and the self-attention mechanism. It’s good at explaining each part’s purpose and how it fits into the whole. The step-by-step implementation of each component is a plus.
However, the book could go deeper. It doesn’t always explain why things are the way they are, or the mathematical intuition. It focuses on the how, but I’d liked to see more on the mathematical underpinnings and reasoning behind some choices. I get why this could be considered beyond the scope of the book.
Sebastian Raschka absolutely deserves high praise for writing this book, as well as the tremendous effort put into the additional material in the appendix and on the GitHub repository.
You can find my code here: chhoumann/ml.
1 Understanding large language models
1.3 Stages of building and using LLMs
Process of training an LLM:
- Pretraining: training the LLM on a large corpus of text data. This is raw text, meaning text without any labeling information—but potentially filtered, like removing formatting characters or docs in unknown languages.
- LLMs use Self-supervised Learning in this phase, so no labeling needed!
- Fine-tuning on a smaller, labeled dataset to follow instructions or perform classification tasks
“Pre” in “pretraining” refers to the first phase, where a model like an LLM is trained on a large, diverse dataset to develop a broad understanding of language.
This model then serves as a base/foundation model we can refine further with fine-tuning, by training it on a narrower dataset, more specific to particular tasks or domains.
An LLM, after pretraining, will have some basic capabilities like text completion and few-shot capabilities (can learn to perform new tasks based only on a few examples, rather than needing extensive training data).
After fine-tuning, it may be able to do a lot more, like classification, summarization, translation, and so on.
The two most popular categories of fine-tuning:
- instruction fine-tuning, where we use a labeled dataset consisting of instruction and answer pairs, e.g., a query to translate a text and the correctly translated text
- classification fine-tuning, where we use a labeled dataset of texts and associated class labels, e.g., emails and “spam” / “not spam” labels
1.4 Introducing the transformer architecture
The original transformer was developed for machine translation, translating English texts to German and French. Presented in Attention Is All You Need.
There are two parts:
- Encoder that processes input text and produces an Embedding representation of the text (vectors that capture the contextual information of the input)
- Decoder that uses the embeddings to generate the translated text, one word at a time
Both of these consist of many layers connected by a The self-attention mechanism|self-attention mechanism.
The self-attention mechanism lets the model weigh the importance of different words/tokens in a sequence, relative to each other.
This enables it to capture long-range dependencies and contextual relationships within the input data—so it can generate output that is coherent and contextually relevant.
Later variants: Bidirectional Encoder Representations From Transformers (BERT) and Generative Pretrained Transformers (GPT).
BERT (and its variants) specialize in masked word prediction. It receives inputs where words are randomly masked during training, and it fills in the missing words to generate the original sentence.
This makes the model great at text classification, including sentiment prediction & document categorization. So X (Twitter) uses it to detect toxic content (as of the books publishing).
Example: “This is an … of how concise I … be” to “This is an example of how concise I can be”
On the other hand, GPT is designed for generative tasks. It receives incomplete texts, in the sense that the sentence is unfinished, not with masked words. Then it learns to generate one word at a time.
This makes the model great at tasks that require generating text, like machine translation, text summarization, writing tasks, etc.
Example: “This is an example of how concise I can” → “This is an example of how concise I can be”
1.5 Utilizing large datasets
The datasets for these models are huge, representing diverse and comprehensive text corpora with billions of words on various topics.
The diversity of the training data lets the models perform well on diverse tasks.
Pretraining the models requires a lot of resources and is expensive. Luckily, many pretrained LLMs are available as open source models, and can be used. They can even be fine-tuned for specific tasks with (relatively) smaller datasets.
1.6 A closer look at the GPT architecture
Original GPT paper: Improving Language Understanding by Generative Pre-Training by Radford et al. from OpenAI.
GPT-3 is a scaled up version of this with more parameters & was trained on a larger dataset. Introduced in 2020.
The original ChatGPT model was created by fine-tuning GPT-3 on a large instruction dataset with the methods presented in OpenAI’s InstructGPT paper.
Next-word prediction is a form of Self-supervised Learning. We don’t need to collect labels for the training data: we just use the next word in a sentence / document as the label the model’s supposed to predict.
The general GPT architecture is simple, it’s just the decoder part without the encoder.
Decoder-style models generate text by predicting text one word at a time, so they’re considered a type of autoregressive model.
Autoregressive models take in their previous outputs as inputs for future predictions.
So in GPT, each new word is chosen based on the sequence that precedes it.
GPT-3 is also a lot larger than the original transformer model.
The original repeated the encoder & decoder blocks 6 times.
GPT-3 has 96 transformer layers & 175b parameters in total.
GPT-3 was introduced a long time ago by DL & LLM development standards (2020), but more recent architectures (like Meta’s Llama models) are based on the same underlying concepts with only minor modifications.
2 Working with text data
2.1 Understanding word embeddings
Deep neural network models can’t process raw text directly. Text is categorical, so we can’t perform the mathematical operations on it that we use to train neural networks.
So we need to represent words as continuous-valued vectors.
Converting data into a vector format is often called embedding.
Using a specific neural network layer or another pretrained neural network model, we can embed various types of data (video, audio, text, etc.). But different data formats require distinct embedding models.
An embedding is a mapping from discrete objects (e.g. words, images, entire documents) to points in a continuous vector space.
Word embeddings are the most common form of text embedding. But you can also embed sentences, paragraphs, or whole documents.
Sentences or paragraph embeddings are popular for RAG.
There are many algorithms and frameworks for generating word Embeddings.
One of the earlier & most popular examples is Word2Vec.
This trained Neural Network architectures to generate word embeddings by predicting the context of a word given the target word or vice versa.
The main idea: words that appear in similar contexts tend to have similar meanings.
Word embeddings can have varying dimensions. With more dimensions you can capture more nuanced relationships, but it also takes more compute.
You can use Word2Vec to generate embeddings, but LLMs commonly produce their own that are part of the input layer & are updated during training.
Optimizing the embeddings as part of training the LLM also means that the embeddings are optimized to the task and data at hand.
Embedding size is often referred to as the dimensionality of the model’s hidden states.
It varies based on the model variant and sizes. It’s a tradeoff between performance and efficiency.
2.2 Tokenizing text
Don’t make text all lowercase when tokenizing. Capitalization helps the LLMs distinguish between proper nouns and common nouns, understand sentence structure, & learn to generate text with proper capitalization.
preprocessed = re.split(r"([,.:;?_!\"()']|--|\s)", raw_text)
preprocessed = [item for item in preprocessed if item.strip()]
Should you keep or remove whitespaces?
It depends on the application & its requirements.
Removing them reduces memory & computational requirements.
But they might be important for some applications, like Python code, which is whitespace-sensitive.
We remove it here, but will later switch to a method that keeps whitespaces.
2.3 Converting tokens into token IDs
First we build a vocabulary, which defines how we map each unique word and special character to a unique integer.
We also build an inverse vocabulary.
from typing import Dict, List
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
vocab = {token: i for i, token in enumerate(all_words)}
class SimpleTokenizerV1:
def __init__(self, vocab: Dict[str, int]):
self.str_to_int = vocab
self.int_to_str = {i: token for token, i in vocab.items()}
def encode(self, text: str) -> List[int]:
preprocessed = re.split(r"([,.:;?_!\"()']|--|\s)", text)
preprocessed = [item for item in preprocessed if item.strip()]
ids = [self.str_to_int[token] for token in preprocessed]
return ids
def decode(self, tokens: List[int]) -> str:
text = " ".join([self.int_to_str[token] for token in tokens])
# Remove whitespaces before punctuation marks
text = re.sub(r" ([,.:;?_!\"()'])", r"\1", text)
return text
tokenizer = SimpleTokenizerV1(vocab)
text = """"It's the last he painted, you know,"
Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
2.4 Adding special context tokens
The tokenizer should be able to handle unknown words.
We should also be able to use and add special context tokens that can give the model a better understanding of context or other relevant information in the text, like if it’s reached the end of the text.
So we’re adding two tokens: <|unk|>
and <|endoftext|>
.
And we’ll modify the tokenizer to use <|unk|>
if it encounters a word that isn’t part of the vocabulary.
<|endoftext|>
is also added between unrelated texts to mark when one source ends. When you’re training GPT-like LLMs on multiple independent sources, adding such a token between each source helps it understand that despite the sources being concatenated, they are unrelated.
all_tokens = sorted(set(preprocessed))
all_tokens.extend(["<|unk|>", "<|endoftext|>"])
vocab = {token: i for i, token in enumerate(all_tokens)}
class SimpleTokenizerV2:
def __init__(self, vocab: Dict[str, int]):
self.str_to_int = vocab
self.int_to_str = {i: token for token, i in vocab.items()}
def encode(self, text: str) -> List[int]:
preprocessed = re.split(r"([,.:;?_!\"()']|--|\s)", text)
preprocessed = [item for item in preprocessed if item.strip()]
# Replace unknown tokens with "<|unk|>"
preprocessed = [item if item in self.str_to_int else "<|unk|>" for item in preprocessed]
ids = [self.str_to_int[token] for token in preprocessed]
return ids
def decode(self, tokens: List[int]) -> str:
text = " ".join([self.int_to_str[token] for token in tokens])
# Remove whitespaces before punctuation marks
text = re.sub(r" ([,.:;?_!\"()'])", r"\1", text)
return text
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))
print(text)
# > "Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace."
tokenizer = SimpleTokenizerV2(vocab)
ids = tokenizer.encode(text)
print(ids)
> [1130, 5, 355, 1126, 628, 975, 10, 1131, 55, 988, 956, 984, 722, 988, 1130, 7]
print(tokenizer.decode(tokenizer.encode(text)))
# > "<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>."
It’s readily apparent that “Hello” and “palace” does not appear in Edith Wharton’s “The Verdict.”
Other special tokens:
[BOS]
(beginning of sequence) to mark the start of a text[EOS]
(end of sequence), positioned at the end of a text. Useful when concatenating multiple unrelated texts (like<|endoftext|>
is—they’re analogous)[PAD]
(padding) when training LLMs with batch sizes > 1, the batch may contain texts of varying lengths. So to ensure all texts have the same length, the shorter texts are padded with this token, up to the length of the longest text in the batch
<|endoftext|>
is sometimes used for padding, e.g. in GPT models that only uses that token.
When training on batched inputs, we typically use a mask, and therefore don’t attend to padded tokens. So it doesn’t matter which token is used for padding.
GPT models also don’t use <|unk|>
. They use a byte pair encoding tokenizer, which breaks words down into subword units.
2.5 Byte pair encoding
The BPE tokenizer was used to train LLMs like GPT-2, GPT-3, and others.
We’re just using OpenAI’s Tiktoken here.
BPE can encode and decode unknown words correctly (without <|unk|>
) because the algorithm breaks words down into subword units, or even individual characters. So any unfamiliar word can be represented as a sequence of subword tokens or characters.
BPE builds its vocabulary by iteratively merging frequent characters into subwords and frequent subwords into words. Merges are determined by a frequency cutoff.
2.6 Data sampling with a sliding window
Now we need to generate the input pairs required for training an LLM. Recall that the task is next-word prediction.
The basic way to do it is by having x
and y
, where x
contains the inputs and y
contains the targets. Those are the inputs shifted by one.
context_size = 4
x = enc_sample[:context_size]
y = enc_sample[1 : context_size + 1]
print(f"x: {x}")
print(f"y: {y}")
# > x: [290, 4920, 2241, 287]
# > y: [4920, 2241, 287, 257]
# So next-word prediction tasks looks like:
for i in range(1, context_size + 1):
context = enc_sample[:i]
target = enc_sample[i]
print(context, "-->", target)
# > [290] --> 4920
# > [290, 4920] --> 2241
# > [290, 4920, 2241] --> 287
# > [290, 4920, 2241, 287] --> 257
Context size determines how many tokens are included in the input.
We’ll build a data loader to handle data sampling.
The data loader should return an input tensor with the text the LLM sees and a target tensor with the targets it should predict.
import torch
from torch.utils.data import DataLoader, Dataset
class GPTDatasetV1(Dataset):
def __init__(
self, txt: str, tokenizer: tiktoken.Encoding, max_length: int, stride: int
):
self.input_ids = []
self.target_ids = []
# Tokenize the text
token_ids = tokenizer.encode(txt)
# Chunk text into overlapping sequences of max_length using the sliding window
for i in range(0, len(token_ids) - max_length, stride):
input_chunk = token_ids[i : i + max_length]
target_chunk = token_ids[i + 1 : i + max_length + 1]
self.input_ids.append(torch.tensor(input_chunk))
self.target_ids.append(torch.tensor(target_chunk))
def __len__(self):
"""Total number of samples in the dataset."""
return len(self.input_ids)
def __getitem__(self, idx):
"""Get a sample from the dataset at the given index."""
return self.input_ids[idx], self.target_ids[idx]
def create_dataloader_v1(
txt: str,
batch_size: int = 4,
max_length: int = 256,
stride: int = 128,
shuffle: bool = True,
drop_last: bool = True,
num_workers: int = 0,
):
tokenizer = tiktoken.get_encoding("gpt2")
dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
dataloader = DataLoader(
dataset,
batch_size=batch_size,
shuffle=shuffle,
drop_last=drop_last,
num_workers=num_workers,
)
return dataloader
batch_size
is the number of samples per batch. Small batch sizes require less memory, but can lead to more noisy model updates.
We batch because it’s inefficient to just feed the model one sample at a time. We put multiple samples together into a batch (samples they have the same length).
drop_last
drops the last batch if it’s shorter than the specified batch_size
.
This prevents loss spikes during training.
stride
is the step size for the sliding window.
2.7 Creating token embeddings
Now we need to convert token IDs into embedding vectors.
So the steps, to recap, as follows:
- Input text gets tokenized
- Encode tokenized text to get token IDs
- Create input token embeddings
- Process with GPT-like decoder-only transformer
- Perform postprocessing steps to get output text
input_ids = torch.tensor([2, 3, 5, 1])
vocab_size = 6
output_dim = 3 # create embeddings of size 3
torch.manual_seed(42)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(embedding_layer.weight)
# Parameter containing:
# tensor([[ 1.9269, 1.4873, -0.4974],
# [ 0.4396, -0.7581, 1.0783],
# [ 0.8008, 1.6806, 0.3559],
# [-0.6866, 0.6105, 1.3347],
# [-0.2316, 0.0418, -0.2516],
# [ 0.8599, -0.3097, -0.3957]], requires_grad=True)
The weights above have been randomly initialized.
The values will get optimized during LLM training, as part of the LLM optimization.
6 rows with 3 columns. One row for each of the six possible tokens in the vocabulary, and one column for each of the three embedding dimensions.
print(embedding_layer(torch.tensor([3]))) # applying embedding layer to token id 3
# tensor([[-0.6866, 0.6105, 1.3347]], grad_fn=<EmbeddingBackward0>)
You can see that the output is identical to the index 3 in the weights.
This is because the embedding layer is basically like a lookup from the embedding layer’s weights via the token ID.
The embedding layer here is like a more efficient way to implement one-hot encoding, followed by matrix multiplication in a fully connected layer.
And that’s also why we can view it as a neural network layer that can be optimized via backprop.
Sebastian provided a great notebook that explains this relationship here.
In it, he explains that embedding layers in PyTorch do the same as linear layers that perform matrix multiplications. We use embedding layers for computational efficiency.
Above we’ve discussed how the embedding is basically like a lookup, and that this is comparable to one-hot and a matmul for a linear layer. So say we have the nn.Linear
layer on a one-hot encoded representation.
So the categories are the various token ids we have available, and we’ve one-hot encoded those to be binary attributes. Therefore, we have as many one-hot features as tokens in our vocabulary.
Given a token ID, we’d encode it such as a vector with a binary 1 (hot) in its attribute and 0 elsewhere.
Performing a matrix multiplication of that vector with our linear layer’s weights gives us the embeddings for that exact token, equivalent to the lookup.
Mathematically, we can represent this as:
Where:
is the resulting embedding vector is the one-hot encoded input vector is the weight matrix of the linear layer (or embedding matrix)
For example, if we have a vocabulary size of 6 and an embedding dimension of 3:
This operation effectively selects the third row of the weight matrix, which is equivalent to looking up the embedding for the third token in our vocabulary.
The embedding layer can also be thought of as a hashtable lookup. In this case, we can represent it as:
embedding = hashtable[token_id]
Where:
embedding
is the resulting embedding vectorhashtable
is a dictionary-like structure containing the embeddingstoken_id
is the ID of the token we want to look up
For our example with a vocabulary size of 6 and an embedding dimension of 3, we could represent this as:
hashtable = {
0: [w11, w12, w13],
1: [w21, w22, w23],
2: [w31, w32, w33],
3: [w41, w42, w43],
4: [w51, w52, w53],
5: [w61, w62, w63]
}
Then, to get the embedding for token ID 2, we would simply do:
embedding = hashtable[2] # This would return [w31, w32, w33]
This hashtable lookup approach is conceptually similar to the embedding layer and provides another way to understand how embeddings work. However, the actual implementation in PyTorch uses more optimized methods for efficiency and to enable gradient flow for training. Just take a look.
print(embedding_layer(input_ids))
# tensor([[ 0.8008, 1.6806, 0.3559],
# [-0.6866, 0.6105, 1.3347],
# [ 0.8599, -0.3097, -0.3957],
# [ 0.4396, -0.7581, 1.0783]], grad_fn=<EmbeddingBackward0>)
2.8 Encoding word positions
Token embeddings could be used as inputs for LLMs.
But their self-attention mechanism doesn’t have a notion of position or order for the tokens within a sequence.
So we inject additional position information into the LLM.
There are two broad categories of position-aware embeddings we could use:
- relative positional embeddings
- absolute positional embeddings
With absolute positional embeddings, we add a positional embedding to the token embedding. There’s a unique positional embedding for each position in the input sequence.
While with relative positional embeddings, we convey information about the relative position/distance between tokens. This lets the model generalize better to sequences of varying lengths.
Both help LLMs understand the order and relationships between tokens. You choose the one appropriate to your application and data.
GPT models use absolute positional embeddings that are optimized during the training process (unlike those that were fixed/predefined in the original transformer model). The optimization here is part of the model training itself.
vocab_size = 50257
output_dim = 256
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
max_length = 4
dataloader = create_dataloader_v1(
raw_text, batch_size=8, max_length=max_length,
stride=max_length, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Token IDs:\n", inputs)
print("\nInputs shape:\n", inputs.shape) # 8 text samples, 4 tokens each
# Token IDs:
# tensor([[ 40, 367, 2885, 1464],
# [ 1807, 3619, 402, 271],
# [10899, 2138, 257, 7026],
# [15632, 438, 2016, 257],
# [ 922, 5891, 1576, 438],
# [ 568, 340, 373, 645],
# [ 1049, 5975, 284, 502],
# [ 284, 3285, 326, 11]])
#
# Inputs shape:
# torch.Size([8, 4])
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)
# torch.Size([8, 4, 256])
We embedded each of the tokens into a 256 dimensional vector.
8 samples in our batch (4 text samples), 4 tokens per sample, and 256 embedding dimensions for each token.
A GPT model’s absolute embedding approach:
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(context_length))
print(pos_embeddings.shape)
# torch.Size([4, 256])
The input is usually a placeholder vector containing a sequence of numbers 0, 1, …, n, where n is the maximum input length.
context_length
represents the supported input size for the LLM.
We set it to max_length
here.
In practice, the input text can be longer than the supported context length—then we’d have to truncate the text.
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)
# torch.Size([8, 4, 256])
3 Coding attention mechanisms
3.2 Capturing data dependencies with attention mechanisms
First we had Bahdanau attention. Then self-attention.
Self-attention assigns attention scores to each word (token) in the sentence. This lets them attend to the other tokens to understand their relative importance, which is helpful for context.
3.3 Attending to different parts of the input with self-attention
“self” in self-attention refers to the fact that the attention mechanism operates within a single sequence of elements (words in a sentence / tokens in an input) by comparing and relating these elements to each other.
Each element in the sequence attends to all other elements, including itself, to capture dependencies and contextual relationships.
Input are token Embeddings. For each element
It’s what gives us the context-aware representations. And unlike traditional sequence models, like RNNs, where relationships are captured sequentially, self-attention simultaneously considers all interactions across the sequence.
Example: “The cat sat on the mat.”
The word “cat” attends to itself (“cat”) and other words (“The”, “sat”, etc.) to understand its role in the sentence.
This helps it learn relationships like “cat” being the subject and “sat” being the action.
So:
“Self” means that it operates within a single sentence, mapping relationships from each word to all the other words, including itself. It does so by assigning attention scores, which measure the importance of one word relative to another.
In contrast to the traditional attention mechanisms, the focus is on the relationship between elements of two different sequences.
3.3.1 A simple self-attention mechanism without trainable weights
The goal of self-attention is to compute a context vector
These context vectors capture the relationships between each input element (e.g. token) and all other tokens in the sequence, allowing the model to encode contextual dependencies effectively.
First step: compute intermediate values
This is done by computing the dot product of the query (token) with all elements in the input sequence:
import torch
# Inputs are the embeddings of the words in the sentence
inputs = torch.tensor(
[[0.43, 0.15, 0.89], # Your (x^1)
[0.55, 0.87, 0.66], # journey (x^2)
[0.57, 0.85, 0.64], # starts (x^3)
[0.22, 0.58, 0.33], # with (x^4)
[0.77, 0.25, 0.10], # one (x^5)
[0.05, 0.80, 0.55]] # step (x^6)
)
# Using the second token, `journey`, as the query:
query = inputs[1]
attention_scores_2 = torch.empty(inputs.shape[0])
for i, x_i in enumerate(inputs):
attention_scores_2[i] = torch.dot(query, x_i)
The dot product is a measure of similarity: it measures how closely the two vectors are aligned. A higher dot product means a higher similarity. In self-attention, it determines how much each element attends to (focuses on) any other element–the higher the dot product, the higher the attention score.
Now we normalize each of the attention scores.
We want to ensure that the sum of the attention weights is 1.
# Notice how we computed use the scores to compute the weights
attention_weights_2_tmp = attention_scores_2 / attention_scores_2.sum()
print("Attention weights:", attention_weights_2_tmp)
print("Sum:", attention_weights_2_tmp.sum())
# Attention weights: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])
# Sum: tensor(1.0000)
But it’s better to use the Softmax function for normalization. It’s better at handling extreme values & gives better gradient properties during training.
It also ensures the attention weights are always positive, making outputs interpretable as probabilities or relative importance (greater weights means greater importance).
# This fn may encounter numerical instability problems (e.g. overflow, underflow) when dealing with large or small inputs. So use PyTorchs implementation.
def softmax_naive(x):
return torch.exp(x) / torch.exp(x).sum(dim=0)
attention_weights_2_naive = softmax_naive(attention_scores_2)
print("Attention weights:", attention_weights_2_naive)
print("Sum:", attention_weights_2_naive.sum())
# Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
# Sum: tensor(1.)
# PyTorch softmax:
attention_weights_2_naive = torch.softmax(attention_scores_2, dim=0)
The final step is to calculate the context vector by multiplying the embedded input tokens with the corresponding attention weights & summing the resulting vectors.
Recap:
# Compute attention scores
query = inputs[0]
attn_scores_1 = torch.zeros(inputs.shape[0])
for i, x_i in enumerate(inputs):
attn_scores_1[i] = torch.dot(query, x_i)
# Normalize - compute attenttion weights
attn_weights_1 = torch.softmax(attn_scores_1, dim=0)
# Compute context vector
context_vec_1 = torch.zeros(query.shape)
for i, x_i in enumerate(inputs):
context_vec_1 += attn_weights_1[i] * x_i
3.3.2 Computing attention weights for all input tokens
Computing the scores:
attn_scores = torch.empty((inputs.shape[0], inputs.shape[0]))
for i, x_i in enumerate(inputs):
for j, x_j in enumerate(inputs):
attn_scores[i, j] = torch.dot(x_i, x_j)
# A faster way:
attn_scores = inputs @ inputs.T # or torch.matmul
attn_scores
Each element in the attn_scores
tensor represents an attention score between each pair of inputs.
You can imagine it being a matrix like this, excluding the labels:
Your | journey | starts | with | one | step | |
---|---|---|---|---|---|---|
Your | 0.9995 | 0.9544 | 0.9422 | 0.4753 | 0.4576 | 0.6310 |
journey | 0.9544 | 1.4950 | 1.4754 | 0.8434 | 0.7070 | 1.0865 |
starts | 0.9422 | 1.4754 | 1.4570 | 0.8296 | 0.7154 | 1.0605 |
with | 0.4753 | 0.8434 | 0.8296 | 0.4937 | 0.3474 | 0.6565 |
one | 0.4576 | 0.7070 | 0.7154 | 0.3474 | 0.6654 | 0.2935 |
step | 0.6310 | 1.0865 | 1.0605 | 0.6565 | 0.2935 | 0.9450 |
Computing the weights:
attn_weights = torch.softmax(attn_scores, dim=-1)
dim=-1
means the last dimension.
For this rank 2 tensor, it means we’re applying Softmax along the second dimension of [rows, columns]
.
That is, we’re normalizing across the columns, so the values in each row (summing over the column dimension) sum up to 1.
Computing the context vectors:
context_vecs = attn_weights @ inputs
3.4 Implementing self-attention with trainable weights
Implementing Transformer architecture (Attention Is All You Need). This is called scaled dot-product attention.
3.4.1 Computing the attention weights step by step
Introduce three trainable weight matrices:
Weights W
in this context refers to weight parameters that are optimized during model training, not attention weights.
x_2 = inputs[1]
d_in = inputs.shape[1] # input embedding size
d_out = 2 # output embedding size
torch.manual_seed(42)
W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_key = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
requires_grad=False
to reduce clutter. If we were using the weight matrices for model training, we’d set it to True
during training.
query_2 = x_2 @ W_query
key_2 = x_2 @ W_key
value_2 = x_2 @ W_value
query_2
# tensor([1.0760, 1.7344])
Visualizing a single computation of some input
To compute all keys and values:
keys = inputs @ W_key
values = inputs @ W_value
keys.shape, values.shape
# (torch.Size([6, 2]), torch.Size([6, 2]))
These computations are similar to above, just with matrix multiplications. Visualizing matrices of the shapes of inputs
and W_key
being multiplied:
Now we’ve projected the six input tokens from a three-dimensional onto a two-dimensional embedding space.
keys_2 = keys[1]
attn_scores_22 = query_2.dot(keys_2)
attn_scores_22 # unnormalized attention score
# tensor(3.3338)
# Generalized:
attn_scores_2 = query_2 @ keys.T
attn_scores_2
# tensor([2.7084, 3.3338, 3.3013, 1.7563, 1.7869, 2.1966])
From attention scores to attention weights:
Scale the attention scores by dividing them by the sqrt of the embedding dimension of the keys & then using the Softmax function.
We scale by the embedding dimension to improve training performance by avoiding small gradients.
Large dot products can lead to very small gradients during backprop due to Softmax. As dot products increase, Softmax becomes more like a step function, leading to gradients near zero. These can slow down training / cause it to stagnate.
We call this self-attention mechanism “scaled-dot product attention” due to this scaling by the sqrt of the embedding dimension.
d_k = keys.shape[-1]
attn_weights_2 = torch.softmax(attn_scores_2 / d_k**0.5, dim=-1)
attn_weights_2
# tensor([0.1723, 0.2681, 0.2620, 0.0879, 0.0898, 0.1200])
context_vec_2 = attn_weights_2 @ values
context_vec_2
# tensor([1.4201, 0.8892])
“query”, “key”, and “value” are borrowed from the domain of information retrieval and databases.
- A query is similar to a search query: it represents the current item the model is trying focusing on / trying to understand.
- The key is like a database key used for indexing and searching. Each item in the input sequence has an associated key, and we use them to match the query.
- The value is similar to the value in a key-value pair in a database. Represents the actual content / representation of the input items. Once the model determines which keys (which parts of the input) are most relevant to the query, it retrieves the corresponding values.
3.4.2 Implementing a compact self-attention Python class
import torch.nn as nn
class SelfAttention_v1(nn.Module):
def __init__(self, d_in: int, d_out: int):
super().__init__()
self.W_query = nn.Parameter(torch.rand(d_in, d_out))
self.W_key = nn.Parameter(torch.rand(d_in, d_out))
self.W_value = nn.Parameter(torch.rand(d_in, d_out))
def forward(self, x):
keys = x @ self.W_key
queries = x @ self.W_query
values = x @ self.W_value
attn_scores = queries @ keys.T
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
context_vec = attn_weights @ values
return context_vec
We can use nn.Linear
layers instead, which effectively perform matmuls when bias units are disabled.
And linear layers have an optimized weight initialization scheme, meaning more stable and effective training.
class SelfAttention_v2(nn.Module):
def __init__(self, d_in: int, d_out: int, qkv_bias=False):
super().__init__()
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
def forward(self, x):
keys = self.W_key(x)
queries = self.W_query(x)
values = self.W_value(x)
attn_scores = queries @ keys.T
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
context_vec = attn_weights @ values
return context_vec
3.5 Hiding future words with causal attention
Causal attention is also known as masked attention.
For some tasks, we want self-attention to only consider tokens appearing prior to the current position when predicting tokens in a sequence.
It’s a special form of self-attention that restricts the model to only consider previous and current inputs in a sequence.
We essentially mask out future tokens - tokens that come after the current token in the input.
Mask attention weights (set to 0) above the diagonal, and normalize the non-masked attention weights, so the weights sum to 1 in each row.
3.5.1 Applying a causal attention mask
- Apply Softmax on the attention scores to get the normalized attention weights.
- Mask those with 0’s above the diagonal to get the masked attention scores.
- Normalize rows to get the masked attention weights.
But that isn’t very efficient. Instead, we can:
- Mask the attention scores with
above the diagonal to get the masked attention scores - Apply Softmax on those to get masked attention weights
This works because Softmax treats negative infinity values in a row as zero probability, which it does because
context_length = attn_scores.shape[0]
# Upper triangular matrix with all elements above the diagonal being 1 and those on and below the diagonal 0
mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)
# `mask.bool()` sets 1s to True. Then we fill the corresponding values (same idx as True values in mask) in the attention scores matrix with negative infinity
masked = attn_scores.masked_fill(mask.bool(), -torch.inf)
attn_weights = torch.softmax(masked / keys.shape[0]**0.5, dim=-1)
3.5.2 Masking additional attention weights with dropout
We can use Dropout in the Attention mechanisms|attention mechanism. It’s typically applied either after calculating the attention weights or after applying them to the value vectors.
torch.manual_seed(42)
dropout = torch.nn.Dropout(0.5) # usually 0.1 or 0.2 for training GPT models
example = torch.ones(6, 6)
dropout(example)
# tensor([[0., 0., 2., 2., 2., 2.],
# [2., 0., 2., 0., 2., 0.],
# [0., 0., 2., 2., 2., 0.],
# [2., 2., 0., 2., 0., 2.],
# [2., 0., 2., 2., 2., 2.],
# [2., 2., 2., 0., 2., 0.]])
~50% were scaled to zero. To compensate for the reduction in active elements, the rest were scaled up by a factor of
This is to maintain the overall balance of the weights, so the average influence of attention mechanisms is consistent both during training and inference.
We can use Dropout(attn_weights)
.
3.5.3 Implementing a compact causal attention class
class CausalAttention(nn.Module):
def __init__(self, d_in: int, d_out: int, context_length: int, dropout: float, qkv_bias=False):
super().__init__()
self.d_out = d_out
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.dropout = nn.Dropout(dropout)
# Useful because the buffer is auto-moved to the appropriate device (CPU/GPU) with our model (e.g. when training)
self.register_buffer(
'mask',
torch.triu(torch.ones(context_length, context_length),
diagonal=1)
)
def forward(self, x):
# [batch, num_tokens, d_in]
b, num_tokens, d_in = x.shape
keys = self.W_key(x)
queries = self.W_query(x)
values = self.W_value(x)
attn_scores = queries @ keys.transpose(1, 2) # Keep batch dim at position 0, but transpose dim 1 and 2
# Trailing _ means inplace. Using it to avoid unnecessary memory copies.
attn_scores.masked_fill_(self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)
attn_weights = torch.softmax(masked / keys.shape[-1]**0.5, dim=-1)
attn_weights_dropped = self.dropout(attn_weights)
context_vec = attn_weights_dropped @ values
return context_vec
3.6 Extending single-head attention to multi-head attention
Causal attention over multiple heads = multi-head attention.
“Multi-head” because multiple, independent attention heads process the input.
The point is that each head can learn to attend to different aspects or patterns in the input.
This allows it to jointly attend to information from different representation subspaces at different positions.
One head might learn to focus on syntactic relationships, another might capture semantic similarities, a third attend to longer-range dependencies, and one specialize in local context patterns, for example.
In CNNs, early layers detect basic features like edges and textures, middle layers combine these into more complex patterns like shapes, and later layers recognize high-level features like faces or objects.
In transformers, attention heads don’t work in this hierarchical manner. Instead, they work as like providing different “perspectives” at the same level as the other heads.
3.6.1 Stacking multiple single-head attention layers
Create multiple attention heads, each with their own weights, and combine their outputs.
class MultiHeadAttentionWrapper(nn.Module):
def __init__(
self,
d_in: int,
d_out: int,
context_length: int,
dropout: float,
num_heads: int,
qkv_bias=False,
):
super().__init__()
self.heads = nn.ModuleList(
[
CausalAttention(d_in, d_out, context_length, dropout, qkv_bias)
for _ in range(num_heads)
]
)
def forward(self, x):
return torch.cat([head(x) for head in self.heads], dim=-1)
If we use d_out=3
and num_heads=2
, we’d get an embedding dimension would be 3*2=6
-dimensional in our context vector matrix.
We get a tensor with two sets of context vector matrices. In each, the rows are the context vectors corresponding to the tokens, and the columns correspond to the embedding dimension. These matrices are concatenated along the column dimension, giving us the embedding dimension of 3*2=6
.
torch.manual_seed(42)
context_length = batch.shape[1]
d_in, d_out = 3, 2
mha = MultiHeadAttentionWrapper(d_in, d_out, context_length, 0.0, num_heads=2)
context_vecs = mha(batch)
print(context_vecs)
print(f"{context_vecs.shape=}")
# tensor([[[0.4429, 0.1077, 0.5473, 0.3307],
# [0.4656, 0.2597, 0.3420, 0.2234],
# [0.4732, 0.3030, 0.2818, 0.1894],
# [0.4135, 0.2921, 0.2105, 0.1521],
# [0.4078, 0.2567, 0.2252, 0.1357],
# [0.3772, 0.2746, 0.1709, 0.1215]],
#
# [[0.4429, 0.1077, 0.5473, 0.3307],
# [0.4656, 0.2597, 0.3420, 0.2234],
# [0.4732, 0.3030, 0.2818, 0.1894],
# [0.4135, 0.2921, 0.2105, 0.1521],
# [0.4078, 0.2567, 0.2252, 0.1357],
# [0.3772, 0.2746, 0.1709, 0.1215]]], grad_fn=<CatBackward0>)
# context_vecs.shape=torch.Size([2, 6, 4])
- The first dimension is
2
because we have two samples in our batch. - The second dimension denotes the 6 tokens in each input.
- The third dimension is the 4-dimensional embeddings of each token.
[batch_size, context_length, embedding_dimensions]
3.6.2 Implementing multi-head attention with weight splits
class MultiHeadAttention(nn.Module):
def __init__(self, d_in: int, d_out: int, context_length: int, dropout: float, num_heads: int, qkv_bias=False):
super().__init__()
assert (d_out % num_heads == 0), "d_out must be divisible by num_heads"
self.d_out = d_out
self.num_heads = num_heads
self.head_dim = d_out // num_heads # Reduces projection dim to match desired output dim
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.out_proj = nn.Linear(d_out, d_out) # To combine head outputs
self.dropout = nn.Dropout(dropout)
self.register_buffer("mask", torch.triu(torch.ones(context_length, context_length), diagonal=1))
def forward(self, x):
b, num_tokens, d_in = x.shape
# Tensor shape (b, num_tokens, d_out)
keys = self.W_key(x)
queries = self.W_query(x)
values = self.W_values(x)
keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) # implicitly split the matrix by adding num_heads dimension, then unroll the last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
values = values.view(b, num_tokens, self.num_heads, self.head_dim)
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
# Transposes from shape (b, num_tokens, num_heads, head_dim) to (b, num_heads, num_tokens, head_dim)
keys = keys.transpose(1, 2)
queries = queries.transpose(1, 2)
values = values.transpose(1, 2)
attn_scores = queries @ keys.transpose(2, 3) # compute dot product for each head
mask_bool = self.mask.bool()[:num_tokens, :num_tokens] # masks truncated to the number of tokens
attn_scores.masked_fill_(mask_bool, -torch.inf) # uses mask to fill attn scores
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
attn_weights = self.dropout(attn_weights)
context_vec = (attn_weights @ values).transpose(1, 2) # tensor shape: (b, num_tokens, n_heads, head_dim)
context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
context_vec = self.out_proj(context_vec) # optional linear projection
return context_vec
As before, d_in
is the dimensionality of the input embeddings, d_out
is the desired dimensionality of the output context vectors. context_length
is the maximum length of the input sequence (denoted in amount of tokens).
d_out
must be divisible by num_heads
to ensure the output can be evenly divided among the heads. This means that when we concatenate their outputs later, we recover the original d_out
dimension.
The Linear
layers starting with W_
project the input embedding into query, key, and value representations. The linear transformations learn to map the input embeddings into different representation subspaces where the model can compute similarities and relevance between tokens (using queries and keys) and extract relevant information (values). Each of these layers has a weight matrix of shape (d_in, d_out)
.
The optional output projection layer is often used after concatenating the outputs of the attention heads. It can learn to mix and combine the information from the different heads, potentially allowing the model to create more complex representations. The weight matrix of this layer is shape (d_out, d_out)
.
We create the causal mask with torch.triu
and torch.ones
, creating an upper triangular matrix of ones with the diagonal offset by 1. The mask is used to prevent the model from attending to future tokens in the sequence (we want to predict the next token, so this prevents peeking). Setting the upper triangle to 1 and lower (including main diagonal) to 0 means that, when we compute attention, a token at position register_buffer
ensures the mask is saved and loaded along with the model’s parameters, but isn’t treated as a learnable parameter.
In the forward
function, we start by unpacking the shape of the input tensor x
: b
is batch size, num_tokens
is the number of tokens in the input sequence, and d_in
is the embedding dimension of each token.
Then we do linear projections to get the query, key, and value representations. Here, we’re essentially transforming the input embeddings x
into different representations using the learned weight matrices. So the shape of x
is (b, num_tokens, d_in)
, which we linearly project to a tensor of shape (b, num_tokens, d_out)
.
Then we use .view
to split the d_out
dimension into num_heads
and head_dim
. This just reshapes the tensors without changing the data. If d_out
is 12 and num_heads
is 4, then head_dim
would be 3. So the keys
tensor, originally of shape (b, num_tokens, 12)
is reshaped to (b, num_tokens, 4, 3)
. Like dividing the 12 dimensional key representation into 4 separate 3 dimensional key representations—one for each head.
This should help illustrate what Tensor.view
does. It returns a new tensor with the same data but of a different shape.
>>> import torch
>>> x = torch.randn(4,4)
>>> x
tensor([[ 2.0618, -0.7314, 0.1790, 0.0057],
[ 1.0455, -1.1515, 0.2536, 0.3051],
[ 0.3095, -0.2626, 0.1183, 1.3439],
[-0.1830, 0.5182, 1.6458, -0.1339]])
>>> y = x.view(16)
>>> y
tensor([ 2.0618, -0.7314, 0.1790, 0.0057, 1.0455, -1.1515, 0.2536, 0.3051,
0.3095, -0.2626, 0.1183, 1.3439, -0.1830, 0.5182, 1.6458, -0.1339])
>>> z = x.view(-1, 8)
>>> z
tensor([[ 2.0618, -0.7314, 0.1790, 0.0057, 1.0455, -1.1515, 0.2536, 0.3051],
[ 0.3095, -0.2626, 0.1183, 1.3439, -0.1830, 0.5182, 1.6458, -0.1339]])
So when we do e.g. keys.view(...)
, we essentially reshape the keys matrix to fit the new shape we give.
The .transpose(1, 2)
is rather simple, we do as we say in the text: transpose from one shape to another. We swap the given dimensions. We transpose the num_tokens
and num_heads
dimensions. We do this to enable efficient batched matrix multiplication in the next step. After transposing, the shapes of keys
, queries
, and values
becomes (b, num_heads, num_tokens, head_dim)
. Now, the num_heads
is the second dimension, allowing us to compute attention scores for all heads in parallel with a single matrix multiplication.
Then we use the mask, converting it to a Boolean tensor and truncating it to the actual number of tokens in the input sequence. This is necessary because context_length
is the maximum sequence length, but the actual input sequences may be shorter.
Next, we compute the attention weights by applying the softmax function to the scaled attentions cores. We scale by the square root of the head_dim
(keys.shape[-1]**0.5
) to stabilize training.
Then we apply dropout to the attention weights.
We compute the context vectors by performing a weighted sum of the values using the attention weights.
For each token, we’re creating a new representation (the context vector) by taking the average of the value vectors of all tokens, where the weights are determined by the attention mechanism. Tokens deemed more relevant (higher attention weights) will contribute more to the context vector. The shape of the resulting context vector is initially (b, num_heads, num_tokens, head_dim)
, and then we transpose it back to (b, num_tokens, num_heads, head_dim)
to align with the original ordering of dimensions before we computed attention.
Then, using contiguous
and view
, we concatenate the outputs from the different heads and reshape the tensor. contiguous
ensures the tensor is stored in a contiguous block of memory. Essentially, we combine the outputs from all the heads back into a single representation. The shape becomes b x num_tokens x (num_heads * head_dim)
, which is equivalent to b x num_tokens x d_out
.
Finally, we apply the output projection, and return the final context vectors.
Instead of the previous wrapper approach, we integrate the single-head attention mechanisms into this class.
This figure explains the difference between the approaches. In the top part of the figure, we see the previous wrapper approach. And in the bottom, we see the new MultiheadAttention
approach. We have one large weight matrix & only perform one matmul with the inputs to get the query matrix, and we then split that into separate matrices (and also do this for the keys and values):
We do the splitting via tensor reshaping and transposing (view
and transpose
).
We split the d_out
dimension into num_heads
and head_dim
, where
We achieve this with the view
method, as seen in the code.
Then we transpose the tensors to bring the num_heads
dimension before the num_tokens
dimension, which is crucial for aligning the queries, keys, and values across the different heads & performing batched matmuls efficiently.
We added the (optional) output projection layer out_proj
. It’s commonly used in many LLM architectures.
This approach is more efficient than the previous approach because we only do a single matrix multiplication to compute e.g. the keys (likewise for queries, values). In the wrapper approach, we had to repeat it multiple times.
d_in, d_out = 3, 2
context_length = batch.shape[1]
mha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads=2, qkv_bias=False)
context_vecs = mha(batch)
print(context_vecs)
print(f"{context_vecs.shape=}")
# tensor([[[-0.6380, 0.3370],
# [-0.7576, 0.2926],
# [-0.7891, 0.2779],
# [-0.7887, 0.2770],
# [-0.6782, 0.2563],
# [-0.7425, 0.2639]],
#
# [[-0.6380, 0.3370],
# [-0.7576, 0.2926],
# [-0.7891, 0.2779],
# [-0.7887, 0.2770],
# [-0.6782, 0.2563],
# [-0.7425, 0.2639]]], grad_fn=<ViewBackward0>)
# context_vecs.shape=torch.Size([2, 6, 2])
The output dimension is directly controlled by the d_out
argument here.
The smallest GPT-2 (117m params) has 12 attention heads and a context vector embedding size of 768.
In GPT models, the embedding sizes of the token inputs and context embeddings are the same (d_in = d_out
).
4 Implementing a GPT model from scratch to generate text
4.1 Coding an LLM architecture
This is the config we’re using. It’s the smallest GPT-2.
GPT_CONFIG_124M = {
"vocab_size": 50257,
"context_length": 1024,
"emb_dim": 768,
"n_heads": 12,
"n_layers": 12,
"drop_rate": 0.1,
"qkv_bias": False
}
import torch
import torch.nn as nn
class DummyGPTModel(nn.Module):
def __init__(
self,
vocab_size,
context_length,
emb_dim,
n_heads,
n_layers,
drop_rate,
qkv_bias=False,
**kwargs,
):
super().__init__()
self.tok_emb = nn.Embedding(vocab_size, emb_dim)
self.pos_emb = nn.Embedding(context_length, emb_dim)
self.drop_emb = nn.Dropout(drop_rate)
self.trf_blocks = nn.Sequential(
*[
DummyTransformerBlock(
vocab_size=vocab_size,
context_length=context_length,
emb_dim=emb_dim,
n_heads=n_heads,
n_layers=n_layers,
drop_rate=drop_rate,
qkv_bias=qkv_bias,
**kwargs,
)
for _ in range(n_layers)
]
)
self.final_norm = DummyLayerNorm(emb_dim)
self.out_head = nn.Linear(emb_dim, vocab_size, bias=False)
def forward(self, in_idx):
batch_size, seq_len = in_idx.shape
tok_embeds = self.tok_emb(in_idx)
pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
x = tok_embeds + pos_embeds
x = self.drop_emb(x)
x = self.trf_blocks(x)
x = self.final_norm(x)
logits = self.out_head(x)
return logits
class DummyTransformerBlock(nn.Module):
def __init__(
self,
vocab_size,
context_length,
emb_dim,
n_heads,
n_layers,
drop_rate,
qkv_bias=False,
**kwargs,
):
super().__init__()
def forward(self, x):
return x
class DummyLayerNorm(nn.Module):
def __init__(self, normalized_shape, eps=1e-5):
super().__init__()
def forward(self, x):
return x
DummyGPTModel(**GPT_CONFIG_124M)
# DummyGPTModel(
# (tok_emb): Embedding(50257, 768)
# (pos_emb): Embedding(1024, 768)
# (drop_emb): Dropout(p=0.1, inplace=False)
# (trf_blocks): Sequential(
# (0): DummyTransformerBlock()
# (1): DummyTransformerBlock()
# (2): DummyTransformerBlock()
# (3): DummyTransformerBlock()
# (4): DummyTransformerBlock()
# (5): DummyTransformerBlock()
# (6): DummyTransformerBlock()
# (7): DummyTransformerBlock()
# (8): DummyTransformerBlock()
# (9): DummyTransformerBlock()
# (10): DummyTransformerBlock()
# (11): DummyTransformerBlock()
# )
# (final_norm): DummyLayerNorm()
# (out_head): Linear(in_features=768, out_features=50257, bias=False)
# )
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
batch = []
txt1 = "Every effort moves you"
txt2 = "Every day holds a"
batch.append(torch.tensor(tokenizer.encode(txt1)))
batch.append(torch.tensor(tokenizer.encode(txt2)))
batch = torch.stack(batch, dim=0)
print(batch)
# tensor([[6109, 3626, 6100, 345], # First text
# [6109, 1110, 6622, 257]]) # Second text
torch.manual_seed(42)
model = DummyGPTModel(**GPT_CONFIG_124M)
logits = model(batch)
print(f"{logits} ({logits.shape=})")
# tensor([[[ 0.7739, 0.0181, -0.0797, ..., 0.3098, 0.8177, -0.6049],
# [-0.8063, 0.8920, -1.0962, ..., -0.4378, 1.1056, 0.1939],
# [-0.8459, -1.0176, 0.4964, ..., 0.4581, -0.3293, 0.2320],
# [ 0.4098, -0.3144, -1.0831, ..., 0.7491, 0.7018, 0.4715]],
#
# [[ 0.2911, 0.1596, -0.2137, ..., 0.5173, 0.7380, -0.7045],
# [-0.4064, 0.6045, -0.4485, ..., -0.5616, 0.4590, -0.1384],
# [-0.6108, 0.7148, 1.2499, ..., -0.7925, -0.5328, 0.4794],
# [ 0.9423, 0.1867, -0.5557, ..., 0.4156, 0.1756, 1.9882]]],
# grad_fn=<UnsafeViewBackward0>) (logits.shape=torch.Size([2, 4, 50257]))
The output tensor has 2 rows–one for each text sample. Each text sample has 4 tokens.
The embedding has 50,257 dimensions because each dimension refers to a unique token in the vocabulary. Later, we’ll convert the vectors back into token IDs, so we can decode them into words.
4.2 Normalizing activations with layer normalization
Layer normalization is typically applied before and after the multi-head attention module, and, as we have seen with the DummyLayerNorm placeholder, before the final output layer.
Layer Normalization gives us unit variance: 0 mean, variance of 1.
# Pre LayerNorm
torch.manual_seed(42)
batch_example = torch.randn(2, 5)
layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())
out = layer(batch_example)
print(out)
mean = out.mean(dim=-1, keepdim=True)
var = out.var(dim=-1, keepdim=True)
print(f"{mean=}\n{var=}")
# tensor([[0.0000, 0.1842, 0.0052, 0.7233, 0.0000, 0.5298],
# [0.0000, 0.0000, 0.0000, 0.2237, 0.0000, 0.7727]],
# grad_fn=<ReluBackward0>)
# mean=tensor([[0.2404],
# [0.1661]], grad_fn=<MeanBackward1>)
# var=tensor([[0.0982],
# [0.0963]], grad_fn=<VarBackward0>)
# Post LayerNorm
out_norm = (out - mean) / torch.sqrt(var)
mean = out_norm.mean(dim=-1, keepdim=True)
var = out_norm.var(dim=-1, keepdim=True)
print("Normalized layer outputs:\n", out_norm)
print(f"Mean:\n{mean}")
print(f"Variance:\n{var}")
# Normalized layer outputs:
# tensor([[-0.7672, -0.1794, -0.7506, 1.5410, -0.7672, 0.9234],
# [-0.5351, -0.5351, -0.5351, 0.1857, -0.5351, 1.9546]],
# grad_fn=<DivBackward0>)
# Mean:
# tensor([[0.0000e+00],
# [7.4506e-09]], grad_fn=<MeanBackward1>)
# Variance:
# tensor([[1.0000],
# [1.0000]], grad_fn=<VarBackward0>)
Layer Normalization:
class LayerNorm(nn.Module):
def __init__(self, emb_dim):
super().__init__()
self.eps = 1e-5 # Prevent division by zero
self.scale = nn.Parameter(torch.ones(emb_dim))
self.shift = nn.Parameter(torch.zeros(emb_dim))
def forward(self, x):
mean = x.mean(dim=-1, keepdim=True)
var = x.var(dim=-1, keepdim=True, unbiased=False)
norm_x = (x - mean) / torch.sqrt(var + self.eps)
return self.scale * norm_x + self.shift
We set unbias=False
. When computing variance, we divide by the number of inputs n
in the variance formula. This doesn’t apply Bessel’s correction, which uses n-1
in the denominator, instead of n
.
In LLMs, the embedding dim n
is usually significantly large, so the difference between n
and n-1
is negligible.
So we do it here because it follows TensorFlow’s default behavior, which was used for the original GPT-2.
ln = LayerNorm(emb_dim=5)
out_ln = ln(batch_example)
mean = out_ln.mean(dim=-1, keepdim=True)
var = out_ln.var(dim=-1, unbiased=False, keepdim=True)
print("Mean:\n", mean)
print("Variance:\n", var)
# Mean:
# tensor([[-1.1921e-08],
# [ 3.2037e-08]], grad_fn=<MeanBackward1>)
# Variance:
# tensor([[1.0000],
# [1.0000]], grad_fn=<VarBackward0>)
4.3 Implementing a feed forward network with GELU activations
ReLU is used for many DL tasks. For LLMs, however, we usually use GELU or SwiGLU (Swish-gated linear unit).
class GELU(nn.Module):
def __init__(self):
super().__init__()
def forward(self, x):
return (
0.5 * x * (1 + torch.tanh(
torch.sqrt(torch.tensor(2.0 / torch.pi))
* (x + 0.044715 * torch.pow(x, 3))
))
)
class FeedForward(nn.Module):
def __init__(self, emb_dim):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(emb_dim, 4 * emb_dim),
GELU(),
nn.Linear(4 * emb_dim, emb_dim)
)
def forward(self, x):
return self.layers(x)
This feedforward layer is useful as it enhances the model’s ability to learn from and generalize the data. It expands the embedding dimension into a higher-dimensional space, uses a nonlinear GELU activation, and then contracts back to the original dimension with the second linear transformation. This allows for the exploration of a richer representation space.
4.4 Adding shortcut connections
Adding shortcut connections can help optimize gradient flow, e.g. alleviate The Vanishing Gradient Problem.
4.5 Connecting attention and linear layers in a transformer block
class TransformerBlock(nn.Module):
def __init__(self, emb_dim, context_length, num_heads, drop_rate, qkv_bias):
super().__init__()
self.layers = nn.ModuleList(
[
nn.Sequential(
LayerNorm(emb_dim),
MultiHeadAttention(
d_in=emb_dim,
d_out=emb_dim,
context_length=context_length,
drop_rate=drop_rate,
num_heads=num_heads,
qkv_bias=qkv_bias,
),
nn.Dropout(drop_rate),
),
nn.Sequential(
LayerNorm(emb_dim), FeedForward(emb_dim), nn.Dropout(drop_rate)
),
]
)
def forward(self, x):
for layer in self.layers:
x = layer(x) + x
return x
torch.manual_seed(123)
x = torch.rand(2, 4, 768)
block = TransformerBlock(
context_length=GPT_CONFIG_124M["context_length"],
drop_rate=GPT_CONFIG_124M["drop_rate"],
emb_dim=GPT_CONFIG_124M["emb_dim"],
num_heads=GPT_CONFIG_124M["n_heads"],
qkv_bias=GPT_CONFIG_124M["qkv_bias"]
)
output = block(x)
print(f"Input shape: {x.shape=}")
print(f"Output shape: {output.shape=}")
# Input shape: x.shape=torch.Size([2, 4, 768])
# Output shape: output.shape=torch.Size([2, 4, 768])
As we see, the shape is preserved. This is part of what makes the tranformer great for sequence-to-sequence tasks.
The output is a context vector that encapsulates information from the entire input sequence.
4.6 Coding the GPT model
class GPTModel(nn.Module):
def __init__(
self,
vocab_size,
context_length,
emb_dim,
n_heads,
n_layers,
drop_rate,
qkv_bias=False,
**kwargs,
):
super().__init__()
self.tok_emb = nn.Embedding(vocab_size, emb_dim)
self.pos_emb = nn.Embedding(context_length, emb_dim)
self.drop_emb = nn.Dropout(drop_rate)
self.trf_blocks = nn.Sequential(
*[
TransformerBlock(
context_length=context_length,
emb_dim=emb_dim,
num_heads=n_heads,
dropout=drop_rate,
qkv_bias=qkv_bias,
**kwargs,
)
for _ in range(n_layers)
]
)
self.final_norm = LayerNorm(emb_dim)
self.out_head = nn.Linear(emb_dim, vocab_size, bias=False)
def forward(self, in_idx):
batch_size, seq_len = in_idx.shape
tok_embeds = self.tok_emb(in_idx)
pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
x = tok_embeds + pos_embeds
x = self.drop_emb(x)
x = self.trf_blocks(x)
x = self.final_norm(x)
logits = self.out_head(x)
return logits
We use the final layer norm to standardize the outputs from the transformer blocks to stabilize the learning process.
The linear output head is defined without bias. It projects the transformer’s output into the vocabulary space of the tokenizer to generate logits for each token in the vocabulary.
torch.manual_seed(123)
model = GPTModel(
vocab_size=GPT_CONFIG_124M["vocab_size"],
context_length=GPT_CONFIG_124M["context_length"],
drop_rate=GPT_CONFIG_124M["drop_rate"],
emb_dim=GPT_CONFIG_124M["emb_dim"],
n_heads=GPT_CONFIG_124M["n_heads"],
n_layers=GPT_CONFIG_124M["n_layers"],
qkv_bias=GPT_CONFIG_124M["qkv_bias"]
)
out = model(batch)
print(f"Input batch:\n{batch}")
print()
print(f"Output shape: {out.shape=}")
print(out)
# Input batch:
# tensor([[6109, 3626, 6100, 345],
# [6109, 1110, 6622, 257]])
#
# Output shape: out.shape=torch.Size([2, 4, 50257])
# tensor([[[ 0.1381, 0.0077, -0.1963, ..., -0.0222, -0.1060, 0.1717],
# [ 0.3865, -0.8408, -0.6564, ..., -0.5163, 0.2369, -0.3357],
# [ 0.6989, -0.1829, -0.1631, ..., 0.1472, -0.6504, -0.0056],
# [-0.4290, 0.1669, -0.1258, ..., 1.1579, 0.5303, -0.5549]],
#
# [[ 0.1094, -0.2894, -0.1467, ..., -0.0557, 0.2911, -0.2824],
# [ 0.0882, -0.3552, -0.3527, ..., 1.2930, 0.0053, 0.1898],
# [ 0.6091, 0.4702, -0.4094, ..., 0.7688, 0.3787, -0.1974],
# [-0.0612, -0.0737, 0.4751, ..., 1.2463, -0.3834, 0.0609]]],
# grad_fn=<UnsafeViewBackward0>)
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}")
# Total number of parameters: 163,009,536
The “124-million-parameter” GPT model has an actual number of 163 million parameters due to a concept called weight tying. This was used in the original GPT-2 architecture by reusing the weights from the token embedding layer in the output layer.
If we were to subtract the output layer’s amount of parameters from the total, we’d get 124 million.
Weight tying reduces memory footprint and computational complexity, but using separate token embedding and output layer can result in better training and model performance.
# Exercise 4.1: Calculate the number of parameters that are contained in the feed forward module and those that are contained in the multi-head attention module.
ff_params = 0
attn_params = 0
for module in model.modules():
if isinstance(module, FeedForward):
ff_params += sum(p.numel() for p in module.parameters())
elif isinstance(module, MultiHeadAttention):
attn_params += sum(p.numel() for p in module.parameters())
print(f"Parameters in feed forward layers: {ff_params:,}")
print(f"Parameters in attention layers: {attn_params:,}")
print(f"Percentage of total parameters:")
print(f"Feed forward: {ff_params/total_params*100:.1f}%")
print(f"Attention: {attn_params/total_params*100:.1f}%")
# Parameters in feed forward layers: 56,669,184
# Parameters in attention layers: 28,320,768
# Percentage of total parameters:
# Feed forward: 34.8%
# Attention: 17.4%
total_size_bytes = total_params * 4 # assumes float32, = 4 bytes per parameter
total_size_mb = total_size_bytes / (1024 * 1024)
print(f"Total size of the model: {total_size_mb:.2f} MB")
# Total size of the model: 621.83 MB
4.7 Generating text
# `idx` is a (batch, n_tokens) array of indices in the current context
def generate_text_simple(model, idx, max_new_tokens, context_size):
for _ in range(max_new_tokens):
# crops current context if it exceeds supported context size (only last 'context_size' tokens are used as context if current context is larger than dontext_size)
idx_cond = idx[:, -context_size:]
with torch.no_grad():
logits = model(idx_cond)
logits = logits[:, -1, :] # focus on last time step
probas = torch.softmax(logits, dim=-1) # (batch, vocab_size)
idx_next = torch.argmax(probas, dim=-1, keepdim=True) # (batch, 1)
idx = torch.cat((idx, idx_next), dim=1) # appends sampled index to the running sequence. idx: (batch, n_tokens+1)
return idx
This uses greedy decoding, wherein the model generates the most likely next token. However, we can use other sampling techniques to modify the softmax outputs such that it doesn’t always select the most likely token. This introduces variability and creativity in the generated text.
start_context = "Hello, I am"
encoded = tokenizer.encode(start_context)
print(f"{encoded=}")
encoded_tensor = torch.tensor(encoded).unsqueeze(0) # adds batch dimension
print(f"{encoded_tensor.shape=}")
# encoded=[15496, 11, 314, 716]
# encoded_tensor.shape=torch.Size([1, 4])
model.eval()
out = generate_text_simple(
model=model,
idx=encoded_tensor,
max_new_tokens=6,
context_size=cfg["context_length"]
)
print(f"{out=}")
print(f"{len(out)=}")
# out=tensor([[15496, 11, 314, 716, 27018, 24086, 47843, 30961, 42348, 7267]])
# len(out)=1
decoded_text = tokenizer.decode(out.squeeze(0).tolist())
print(decoded_text)
# Hello, I am Featureiman Byeswickattribute argue
5 Pretraining on unlabeled data
5.1 Evaluating generative text models
5.1.1 Using GPT to generate text
import torch
from build_a_large_language_model_from_scratch.lib.GPTModel import GPTModel
GPT_CONFIG_124M = {
"vocab_size": 50257,
"context_length": 256, # intentionally shortening it to reduce computational demands of training
"emb_dim": 768,
"n_heads": 12,
"n_layers": 12,
"drop_rate": 0.1, # it's possible and common to set dropout to 0
"qkv_bias": False
}
cfg = GPT_CONFIG_124M
model = GPTModel(
context_length=cfg["context_length"],
drop_rate=cfg["drop_rate"],
emb_dim=cfg["emb_dim"],
n_heads=cfg["n_heads"],
n_layers=cfg["n_layers"],
vocab_size=cfg["vocab_size"],
qkv_bias=cfg["qkv_bias"]
)
torch.manual_seed(123)
model.eval()
import tiktoken
from build_a_large_language_model_from_scratch.lib.generate import generate_text_simple
def text_to_token_ids(text: str, tokenizer):
encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
encoded_tensor = torch.tensor(encoded).unsqueeze(0) # `unsqueeze(0)` adds batch dimension
return encoded_tensor
def token_ids_to_text(token_ids, tokenizer):
flat = token_ids.squeeze(0)
return tokenizer.decode(flat.tolist())
start_context = "Every effort moves you"
tokenizer = tiktoken.get_encoding("gpt2")
token_ids = generate_text_simple(
model=model,
idx=text_to_token_ids(start_context, tokenizer),
max_new_tokens=10,
context_size=cfg["context_length"]
)
print(f"Output text:\n{token_ids_to_text(token_ids, tokenizer)}")
# Output text:
# Every effort moves you rentingetic wasnم refres RexMeCHicular stren
5.1.2 Calculating the text generation loss
The steps for getting the next token:
- Tokenize inputs (input is a sentence), translate to token IDs via the vocabulary
- Apply model on tokens IDs to get logits, then apply softmax to get a probability distribution over the vocabulary (outputs a probability of each token in the vocabulary)
- Find the most likely next token by taking the
argmax
of the probabilities, the index of which is the token id (greedy decoding) - Use the inverse map (id → token) to get the predicted sentence
During model training, we want to optimize the model’s parameters such that it assigns the highest probability to the corresponding target token. So after we’ve used softmax on the logits to get a probability distribution over the vocabulary, the idea is that the highest probability value in the vector should be at the index for the token that is denoted in the target vector as the next word.
The loss we’ll be using is the negative average log probability.
Let’s illustrate by example. Say we have input and targets like this:
inputs = torch.tensor([
[16833, 3626, 6100], # "every effort moves"
[40, 1107, 58]]) # "I really like"
targets = torch.tensor([
[3626, 6100, 345], # " effort moves you"
[1107, 588, 11311] # " really like chocolate"
])
Then we can get the probabilities, which we use to get the predicted token ids.
with torch.no_grad():
logits = model(inputs)
probas = torch.softmax(logits, dim=-1)
print(f"{probas.shape=}")
# probas.shape=torch.Size([2, 3, 50257])
token_ids = torch.argmax(probas, dim=-1, keepdim=True)
print(f"{token_ids=}")
# token_ids=tensor([[[41498],
# [43024],
# [16685]],
#
# [[21934],
# [33733],
# [22443]]])
# Just to illustrate:
[tokenizer.decode(o) for o in token_ids.flatten(1).tolist()]
# [' scaff Retrieved Barbara', 'Sunday Positive sulf']
print(f"Targets batch 1: {token_ids_to_text(targets[0], tokenizer)}")
print(f"Outputs batch 1: {token_ids_to_text(token_ids[0].flatten(), tokenizer)}")
# Targets batch 1: effort moves you
# Outputs batch 1: scaff Retrieved Barbara
keepdim=True
ensures the output tensor retains the same number of dimensions as the input tensor, even if the size of the dimension is being reduced by 1.
In other words, it’s the difference between a token_ids.shape
of [2, 3, 1]
and of [2, 3]
.
Now we compute the loss – the negative average log probabilities:
log_probas = torch.log(torch.cat((target_probas_1, target_probas_2)))
print(log_probas)
# tensor([-10.4315, -9.5601, -10.9243, -10.9316, -10.3856, -11.3752])
avg_log_probas = torch.mean(log_probas)
print(avg_log_probas)
# tensor(-10.6014)
neg_avg_log_probas = avg_log_probas * -1
print(neg_avg_log_probas)
# tensor(10.6014)
The goal is to get the average log probability as close to 0 as possible during training.
In DL, common practice isn’t to push the avg log prob up to 0, but rather bring the neg avg log prob down to zero.
The term for turning the negative value, -10.6014
, into 10.6014
, is Cross Entropy loss.
We can just use PyTorch’s Cross Entropy loss function—this does the same as the steps we did above:
# We don't have the right shapes for cross_entropy:
print(f"{logits.shape=}")
print(f"{targets.shape=}")
# logits.shape=torch.Size([2, 3, 50257])
# targets.shape=torch.Size([2, 3])
# Need to flatten for `cross_entropy` to work, so we combien them over the batch dimension:
logits_flat = logits.flatten(0, 1)
targets_flat = targets.flatten()
print(f"{logits_flat.shape}")
print(f"{targets_flat.shape}")
# torch.Size([6, 50257])
# torch.Size([6])
loss = torch.nn.functional.cross_entropy(logits_flat, targets_flat)
print(loss)
# tensor(10.6014)
Cross Entropy loss measures the difference between two Probability Distributions—typically the true distribution of labels (e.g. tokens in a dataset) & the predicted distribution (e.g. token probabilities by an LLM).
The cross_entropy
function in ML frameworks computes the measure for discrete outcomes, which is similar to the negative average log probability of the target tokens given the model’s generate token probabilities.
That’s what makes the two terms related & often used interchangeably in practice.
We often use the Perplexity measure alongside Cross Entropy loss to evaluate the performance of the models in tasks like language modeling.
Perplexity can provide a more interpretable way to understand the uncertainty of a model in predicting the next token in a sequence.
It’s considered more interpretable than the raw loss value because it signifies the effective vocabulary size about which the model is uncertain at each step: a perplexity value of x
signifies that the model is unsure about which among x
tokens in the vocabulary to generate as the next token.
It measures how well the probability distribution predicted by the model matches the actual distribution of the words in the dataset. Lower scores mean the model predictions are closer to the actual distribution.
Calculate as:
perplexity = torch.exp(loss)
5.1.3 Calculating the training and validation set losses
We’re using a batch size of 2 and a train/val split of 0.9.
We use a small batch size to reduce computational resource demand because we’re working with a small dataset.
Using batch sizes of 1024 or larger is not uncommon.
We’re also using the data loaders from chapter 2.
# val_loader is similar, omitting for brevity
train_loader = create_dataloader_v1(
train_data, # raw text
batch_size=2, # small batch size for reduced compute usage
max_length=cfg["context_length"],
stride=cfg["context_length"],
drop_last=True,
shuffle=True,
num_workers=0
)
def calc_loss_batch(input_batch, target_batch, model, device):
input_batch = input_batch.to(device)
target_batch = target_batch.to(device)
logits = model(input_batch)
loss = torch.nn.functional.cross_entropy(
logits.flatten(0, 1), target_batch.flatten()
)
return loss
def calc_loss_loader(data_loader, model, device, num_batches=None):
total_loss = 0.
if len(data_loader) == 0:
return float("nan")
elif num_batches is None:
num_batches = len(data_loader)
else:
num_batches = min(num_batches, len(data_loader))
for i, (input_batch, target_batch) in enumerate(data_loader):
if i < num_batches:
loss = calc_loss_batch(input_batch, target_batch, model, device)
total_loss += loss.item()
else:
break
return total_loss / num_batches
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
with torch.no_grad():
train_loss = calc_loss_loader(train_loader, model, device)
val_loss = calc_loss_loader(val_loader, model, device)
print("Training loss:", train_loss)
print("Validation loss:", val_loss)
# Training loss: 11.023475329081217
# Validation loss: 10.993597984313965
Loss approaches 0 if the model learns to generate the next tokens as they appear in the training and validation sets.
5.2 Training an LLM
Advanced techniques:
- Learning rate warmup
- Cosine annealing
- Gradient Clipping
We’ll use the AdamW optimizer. It improves weight decay to minimize model complexity and prevent overfitting by penalizing larger weights. This leads to more effective Regularization and better generalization.
from torch.utils.data import DataLoader
def train_model_simple(
model,
train_loader: DataLoader,
val_loader: DataLoader,
optimizer,
device,
num_epochs: int,
eval_freq: int,
eval_iter: int,
start_context: str,
tokenizer,
):
train_losses, val_losses, track_tokens_seen = [], [], []
tokens_seen, global_step = 0, -1
for epoch in range(num_epochs):
model.train()
for input_batch, target_batch in train_loader:
optimizer.zero_grad() # reset loss gradients from previous iteration
loss = calc_loss_batch(input_batch, target_batch, model, device)
loss.backward() # calculate loss gradients
optimizer.step() # update model weights using the loss gradients
tokens_seen += input_batch.numel()
global_step += 1
if global_step % eval_freq == 0:
train_loss, val_loss = evaluate_model(
model, train_loader, val_loader, device, eval_iter
)
train_losses.append(train_loss)
val_losses.append(val_loss)
track_tokens_seen.append(tokens_seen)
print(
f"Ep {epoch+1} (Step {global_step:06d}): "
f"Train loss {train_loss:.3f}, "
f"Val loss {val_loss:.3f}"
)
generate_and_print_sample(model, tokenizer, device, start_context)
return train_losses, val_losses, track_tokens_seen
def evaluate_model(model, train_loader: DataLoader, val_loader: DataLoader, device, eval_iter: int):
model.eval() # to disable dropout during evaluation
with torch.no_grad(): # to disable gradient tracking, it's not required (reduce computational overhead)
train_loss = calc_loss_loader(
train_loader, model, device, num_batches=eval_iter
)
val_loss = calc_loss_loader(
val_loader, model, device, num_batches=eval_iter
)
model.train()
return train_loss, val_loss
def generate_and_print_sample(model, tokenizer, device, start_context):
model.eval()
context_size = model.pos_emb.weight.shape[0]
encoded = text_to_token_ids(start_context, tokenizer).to(device)
with torch.no_grad():
token_ids = generate_text_simple(
model=model, idx=encoded,
max_new_tokens=50, context_size=context_size
)
decoded_text = token_ids_to_text(token_ids, tokenizer)
print(decoded_text.replace("\n", " "))
model.train()
And now for training:
torch.manual_seed(123)
model = GPTModel(
vocab_size=cfg["vocab_size"],
context_length=cfg["context_length"],
drop_rate=cfg["drop_rate"],
emb_dim=cfg["emb_dim"],
n_heads=cfg["n_heads"],
n_layers=cfg["n_layers"],
qkv_bias=cfg["qkv_bias"]
)
model.to(device)
optimizer = torch.optim.AdamW(
model.parameters(),
lr=0.0004, weight_decay=0.1
)
num_epochs = 10
train_losses, val_losses, tokens_seen = train_model_simple(
model, train_loader, val_loader, optimizer, device,
num_epochs=num_epochs, eval_freq=5, eval_iter=5,
start_context="Every effort moves you", tokenizer=tokenizer
)
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
def plot_losses(epochs_seen, tokens_seen, train_losses, val_losses):
fig, ax1 = plt.subplots(figsize=(5, 3))
ax1.plot(epochs_seen, train_losses, label="Training loss")
ax1.plot(epochs_seen, val_losses, linestyle="-.", label="Validation loss")
ax1.set_xlabel("Epochs")
ax1.set_ylabel("Loss")
ax1.legend(loc="upper right")
ax1.xaxis.set_major_locator(MaxNLocator(integer=True))
ax2 = ax1.twiny() # creates a second x-axis that shares the same y-axis
ax2.plot(tokens_seen, train_losses, alpha=0) # invisible plot for aligning ticks
ax2.set_xlabel("Tokens seen")
fig.tight_layout()
plt.show()
epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)
5.3 Decoding strategies to control randomness
We’ve previously used greedy decoding, meaning we always select the most probable token as the output.
This section details two alternative strategies: temperature scaling and top-k sampling.
5.3.1 Temperature scaling
This adds a probabilistic selection process to the next-token generation task.
Instead of using argmax
, as we do in greedy decoding, we use a function that samples from a probability distribution (here, the probability scores the LLM generates for each vocabulary entry at each token generation step).
We can use torch.multinomial
instead of torch.argmax
.
This function samples the next token proportional to its probability score. So it’s still most likely to select the most likely token, but it is not guaranteed.
We can further control the distribution and selection process via temperature scaling.
All that means is we divide the logits by a number greater than 0.
def softmax_with_temperature(logits, temperature):
scaled_logits = logits / temperature
return torch.softmax(scaled_logits, dim=0)
Temperatures > 1 result in more uniformly distributed token probabilities. This can lead to more variety, but also more nonsensical text, as other tokens are selected more often.
Temperatures < 1 will result in more confident (sharper or more peaky) distributions—the most likely token will have an even higher probability score.
A temperature of 1 is the same as not using any temperature scaling.
5.3.2 Top-k sampling
With temperature scaling, we may see diverse outputs, but these are sometimes grammatically incorrect or completely nonsensical.
Using top-k sampling with probabilistic sampling and temperature scaling, we can improve the text generation results.
The idea is to restrict the sampled tokens to the top-k most likely tokens and exclude all other tokens from the selection process by masking their probability scores.
- Select top-k from logits
- Apply
-inf
mask (will be scaled to zero via softmax) - Apply softmax, which assigns zero-probabilities to the non-top-k positions, so the next token is always sampled from a top-k position
top_k = 3
top_logits, top_pos = torch.topk(next_token_logits, top_k)
# top_logits=tensor([6.7500, 6.2800, 4.5100])
# top_pos=tensor([3, 7, 0])
new_logits = torch.where(
condition=next_token_logits < top_logits[-1],
input=torch.tensor(float('-inf')),
other=next_token_logits
)
# new_logits=tensor([4.5100, -inf, -inf, 6.7500, -inf, -inf, -inf, 6.2800, -inf])
topk_probas = torch.softmax(new_logits, dim=0)
# topk_probas=tensor([0.0615, 0.0000, 0.0000, 0.5775, 0.0000, 0.0000, 0.0000, 0.3610, 0.0000])
5.3.3 Modifying the text generation function
from typing import Optional
def generate(model, idx, max_new_tokens: int, context_size: int, temperature=0.0, top_k: Optional[int] = None, eos_id=None):
for _ in range(max_new_tokens):
idx_cond = idx[:, -context_size:]
with torch.no_grad():
logits = model(idx_cond)
# Get last token in current sequence
logits = logits[:, -1, :]
# top-k sampling
if top_k is not None:
top_logits, _ = torch.topk(logits, top_k)
min_val = top_logits[:, -1]
logits = torch.where(
condition=logits < min_val,
input=torch.tensor(float('-inf')).to(logits.device),
other=logits
)
if temperature > 0.0:
# temperature scaling
logits = logits / temperature
probs = torch.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1)
else:
# greedy decoding
idx_next = torch.argmax(probs, dim=-1, keepdim=True)
# check if we've reached end-of-sequence
if idx_next == eos_id:
break
# append generated token to current sequence for further generation
idx = torch.cat((idx, idx_next), dim=1)
return idx
5.4 Loading and saving model weights in PyTorch
The recommended way to save a PyTorch model is by saving the model’s state_dict
:
torch.save(model.state_dict(), "model.pth")
The state_dict
maps each layer to its parameters.
The .pth
extension is convention for PyTorch files.
Loading the model weights:
model = GPTModel(...)
model.load_state_dict(torch.load("model.pth", map_location=device))
model.eval() # don't want dropout during inference
If we wished to continue pretraining later, it’s also recommended saving the optimizer state.
Adaptive optimizers like AdamW stores additional parameters for each model weight. Without these, the optimizer resets, potentially ruining training.
# Save
torch.save(
{
"model_state_dict": model.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
},
"model_and_optimizer.pth",
)
# Load
checkpoint = torch.load("model_and_optimizer.pth", map_location=device)
model = GPTModel(
vocab_size=cfg["vocab_size"],
context_length=cfg["context_length"],
drop_rate=cfg["drop_rate"],
emb_dim=cfg["emb_dim"],
n_heads=cfg["n_heads"],
n_layers=cfg["n_layers"],
qkv_bias=cfg["qkv_bias"]
)
model.load_state_dict(checkpoint["model_state_dict"])
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=0.1)
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
model.train()
5.5 Loading pretrained weights from OpenAI
My implementation differs slightly from the book as I’ve used a ModuleList
for the components of my MHA.
The author provided a script to download and load gpt2
weights from OpenAI.
from build_a_large_language_model_from_scratch.gpt_download import download_and_load_gpt2
settings, params = download_and_load_gpt2(
model_size="124M", models_dir="gpt2"
)
model_configs = {
"gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
# ...
}
model_name = "gpt2-small (124M)"
NEW_CONFIG = GPT_CONFIG_124M.copy()
NEW_CONFIG.update(model_configs[model_name])
# Because we modified it earlier
NEW_CONFIG.update({"context_length": 1024, "qkv_bias": True})
gpt = GPTModel(...)
gpt.eval()
Now we need this utility function to make loading weights into the model easier:
def assign(left, right):
"""Assigns values from right tensor to left tensor after shape validation.
Args:
left: Target PyTorch tensor/parameter
right: Source tensor/array to copy values from
Returns:
torch.nn.Parameter: New parameter containing values from right tensor
Raises:
ValueError: If shapes of left and right tensors don't match
"""
if left.shape != right.shape:
raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}")
return torch.nn.Parameter(torch.tensor(right))
And the code to load the weights:
import numpy as np
def load_weights_into_gpt(gpt, params):
gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params["wpe"])
gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params["wte"])
# iterative over transformer blocks
for b in range(len(params["blocks"])):
# split is used to divide attention and bias weights into three equal parts for the qkv components
# load attention qkv weights
q_w, k_w, v_w = np.split(
(params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1
)
gpt.trf_blocks[b].layers[0][1].W_query.weight = assign(
gpt.trf_blocks[b].layers[0][1].W_query.weight, q_w.T
)
gpt.trf_blocks[b].layers[0][1].W_key.weight = assign(
gpt.trf_blocks[b].layers[0][1].W_key.weight, k_w.T
)
gpt.trf_blocks[b].layers[0][1].W_value.weight = assign(
gpt.trf_blocks[b].layers[0][1].W_value.weight, v_w.T
)
# load attn qkv bias
q_b, k_b, v_b = np.split(
(params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1
)
gpt.trf_blocks[b].layers[0][1].W_query.bias = assign(
gpt.trf_blocks[b].layers[0][1].W_query.bias, q_b
)
gpt.trf_blocks[b].layers[0][1].W_key.bias = assign(
gpt.trf_blocks[b].layers[0][1].W_key.bias, k_b
)
gpt.trf_blocks[b].layers[0][1].W_value.bias = assign(
gpt.trf_blocks[b].layers[0][1].W_value.bias, v_b
)
# load attn linear projection weights
gpt.trf_blocks[b].layers[0][1].out_proj.weight = assign(
gpt.trf_blocks[b].layers[0][1].out_proj.weight,
params["blocks"][b]["attn"]["c_proj"]["w"].T,
)
gpt.trf_blocks[b].layers[0][1].out_proj.bias = assign(
gpt.trf_blocks[b].layers[0][1].out_proj.bias,
params["blocks"][b]["attn"]["c_proj"]["b"],
)
# load feedforward network weights and biases
gpt.trf_blocks[b].layers[1][1].layers[0].weight = assign(
gpt.trf_blocks[b].layers[1][1].layers[0].weight,
params["blocks"][b]["mlp"]["c_fc"]["w"].T,
)
gpt.trf_blocks[b].layers[1][1].layers[0].bias = assign(
gpt.trf_blocks[b].layers[1][1].layers[0].bias,
params["blocks"][b]["mlp"]["c_fc"]["b"]
)
gpt.trf_blocks[b].layers[1][1].layers[2].weight = assign(
gpt.trf_blocks[b].layers[1][1].layers[2].weight,
params["blocks"][b]["mlp"]["c_proj"]["w"].T,
)
gpt.trf_blocks[b].layers[1][1].layers[2].bias = assign(
gpt.trf_blocks[b].layers[1][1].layers[2].bias,
params["blocks"][b]["mlp"]["c_proj"]["b"],
)
# load layer norm params
gpt.trf_blocks[b].layers[0][0].scale = assign(
gpt.trf_blocks[b].layers[0][0].scale,
params["blocks"][b]["ln_1"]["g"]
)
gpt.trf_blocks[b].layers[0][0].shift = assign(
gpt.trf_blocks[b].layers[0][0].shift,
params["blocks"][b]["ln_1"]["b"]
)
gpt.trf_blocks[b].layers[1][0].scale = assign(
gpt.trf_blocks[b].layers[1][0].scale,
params["blocks"][b]["ln_2"]["g"]
)
gpt.trf_blocks[b].layers[1][0].shift = assign(
gpt.trf_blocks[b].layers[1][0].shift,
params["blocks"][b]["ln_2"]["b"]
)
gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"])
gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"])
# Original GPT-2 model reused the token embedding weights to reduce the total number of params (weight tying)
gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"])
As mentioned, my code is slightly different from the book, as I’ve used a ModuleList
in my MHA implementation.
Now we can load and test!
load_weights_into_gpt(gpt, params)
gpt.to(device)
torch.manual_seed(123)
token_ids = generate(
model=gpt,
idx=text_to_token_ids("Every effort moves you", tokenizer).to(device),
max_new_tokens=25,
context_size=NEW_CONFIG["context_length"],
top_k=50,
temperature=1.5
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
# Output text:
# Every effort moves you as far as the hand can go until the end of your turn unless something interrupts your control flow. As you may observe I
6 Fine-tuning for classification
6.1 Different categories of fine-tuning
Language models are most commonly either instruction fine-tuned or classification fine-tuned.
Instruction fine-tuning means you train the model on a set of tasks using specific instructions to improve its ability to understand and execute tasks described by natural language prompts.
Classification fine-tuning means you train the model to recognize a specific set of class labels (e.g. “spam”, “not spam”).
6.2 Preparing the dataset
Here we downloaded the dataset. It’s a set with 2 columns: Text
and Label
. Since it’s unbalanced, we balance it by ensuring there’s an equal amount of each class. Then we do a random split into train/val/test.
6.3 Creating data loaders
Can’t do sliding window as before, so we’ll have to either shorten each sample to match the size of the shortest or pad all samples to match the length of the longest. Can be in the batch / dataset.
Padding is better so we don’t lose information.
import torch
from torch.utils.data import Dataset
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
print(tokenizer.encode("<|endoftext|>", allowed_special={"<|endoftext|>"}))
# [50256]
class SpamDataset(Dataset):
def __init__(self, csv_file, tokenizer, max_length=None, pad_token_id=50256):
self.data = pd.read_csv(csv_file)
# Tokenize
self.encoded_texts = [
tokenizer.encode(text) for text in self.data["Text"]
]
if max_length is None:
self.max_length = self._longest_encoded_length()
else:
self.max_length = max_length
# Truncate any sentence longer than `max_length`
self.encoded_texts = [
encoded_text[:self.max_length]
for encoded_text in self.encoded_texts
]
# Pad
self.encoded_texts = [
encoded_text + [pad_token_id] *
(self.max_length - len(encoded_text))
for encoded_text in self.encoded_texts
]
def __getitem__(self, index):
encoded = self.encoded_texts[index]
label = self.data.iloc[index]["Label"]
return (
torch.tensor(encoded, dtype=torch.long),
torch.tensor(label, dtype=torch.long)
)
def __len__(self):
return len(self.data)
def _longest_encoded_length(self):
max_length = 0
for encoded_text in self.encoded_texts:
encoded_length = len(encoded_text)
if encoded_length > max_length:
max_length = encoded_length
return max_length
Then we create some DataLoader
s.
6.4 Initializing a model with pretrained weights
Just following the same procedure as the previous chapter.
text_2 = (
"Is the following text 'spam'? Answer with 'yes' or 'no':"
" 'You are a winner you have been specially"
" selected to receive $1000 cash or a $2000 award.'"
)
token_ids = generate_text_simple(
model=model,
idx=text_to_token_ids(text_2, tokenizer),
max_new_tokens=23,
context_size=BASE_CONFIG["context_length"]
)
print(token_ids_to_text(token_ids, tokenizer))
As we can see, the model isn’t capable of classifying (yet!).
6.5 Adding a classification head
We modify the pretrained LLM for classification fine-tuning by replacing the original output layer, which maps the hidden representation to a vocabulary of 50257 tokens, with a smaller output layer that outputs to our target classes.
It’s possible to use a single output node, since we’re doing binary classification, but it’d require modifying the loss function. So we take the approach where the number of output nodes matches the number of classes.
When fine-tuning from a pretrained model, it isn’t necessary to fine-tune all model layers. The lower layers in neural nets generally capture basic language structures and semantics applicable across a wide range of tasks, so fine-tuning only the last layers that deal with more nuanced linguistic patterns and task-specific features is fine.
Training just the output layer can be sufficient, but fine-tuning additional layers near the output can lead to improved predictive performance.
# Freeze the model
or param in model.parameters():
param.requires_grad = False
# Replace the out head:
torch.manual_seed(123)
num_classes = 2
model.out_head = torch.nn.Linear(
in_features=BASE_CONFIG["emb_dim"],
out_features=num_classes
)
require_grad
is already set to True
on the new layer. We’ll set it in the final Transformer block and layer norm:
for param in model.trf_blocks[-1].parameters():
param.requires_grad = True
for param in model.final_norm.parameters():
param.requires_grad = True
The model still works, despite us having replaced the out_head
:
inputs = tokenizer.encode("Do you have time")
inputs = torch.tensor(inputs).unsqueeze(0)
print("Inputs:", inputs)
print("Inputs dimensions:", inputs.shape)
# Inputs: tensor([[5211, 345, 423, 640]])
# Inputs dimensions: torch.Size([1, 4])
torch.Size([1, 4])
is essentially batch_size
and num_tokens
in the input.
One sample in the batch consisting of four tokens.
with torch.no_grad():
outputs = model(inputs)
print("Outputs:\n", outputs)
print("Outputs dimensions:", outputs.shape)
# Outputs:
# tensor([[[-1.5854, 0.9904],
# [-3.7235, 7.4548],
# [-2.2661, 6.6049],
# [-3.5983, 3.9902]]])
# Outputs dimensions: torch.Size([1, 4, 2])
Previously this would have produced an output tensor of shape [1, 4, 50257]
, where 50257
represents the vocabulary size. The number of output rows corresponds to the number of input tokens (4), but each output’s embedding dimension (number of columns) is now 2 instead of 50257.
Since we’re interested in fine-tuning the model to return a class label, we don’t need to fine-tune all four output rows. We just focus on the last row corresponding to the last token.
print("Last output token:", outputs[:, -1, :])
# Last output token: tensor([[-3.5983, 3.9902]])
The reason why we’re only interested in the last token is because it’s the only token with an attention score to all other tokens. The others have had some tokens masked.
6.6 Calculating the classification loss and accuracy
We don’t even need to use softmax to get the probability of an input being spam/ham, as the largest logit will be the predicted class anyway (as we use argmax).
def calc_accuracy_loader(data_loader, model, device, num_batches=None):
model.eval()
correct_predictions, num_examples = 0, 0
if num_batches is None:
num_batches = len(data_loader)
else:
num_batches = min(num_batches, len(data_loader))
for i, (input_batch, target_batch) in enumerate(data_loader):
if i < num_batches:
input_batch = input_batch.to(device)
target_batch = target_batch.to(device)
with torch.no_grad():
logits = model(input_batch)[:, -1, :]
predicted_labels = torch.argmax(logits, dim=-1)
num_examples += predicted_labels.shape[0]
correct_predictions += (
(predicted_labels == target_batch).sum().item()
)
else:
break
return correct_predictions / num_examples
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
torch.manual_seed(123)
train_accuracy = calc_accuracy_loader(
train_loader, model, device, num_batches=10
)
val_accuracy = calc_accuracy_loader(
val_loader, model, device, num_batches=10
)
test_accuracy = calc_accuracy_loader(
test_loader, model, device, num_batches=10
)
print(f"Training accuracy: {train_accuracy*100:.2f}%")
print(f"Validation accuracy: {val_accuracy*100:.2f}%")
print(f"Test accuracy: {test_accuracy*100:.2f}%")
# Training accuracy: 46.25%
# Validation accuracy: 45.00%
# Test accuracy: 48.75%
The resulting prediction accuracies are near a random prediction.
To improve, we’ll need to fine-tune. And to do that, we need to define the loss function we’ll optimize.
The objective is to maximize the spam classification accuracy of the model. Classification accuracy is not a differentiable function, so we’ll use cross-entropy loss as a proxy to maximize accuracy.
We’ll use much the same function as previously, except now we only focus on the loss for the last token.
def calc_loss_batch(input_batch, target_batch, model, device):
input_batch = input_batch.to(device)
target_batch = target_batch.to(device)
logits = model(input_batch)[:, -1, :]
loss = torch.nn.functional.cross_entropy(logits, target_batch)
return loss
def calc_loss_loader(data_loader, model, device, num_batches=None):
total_loss = 0
if len(data_loader) == 0:
return float("nan")
elif num_batches is None:
num_batches = len(data_loader)
else:
num_batches = min(num_batches, len(data_loader))
for i, (input_batch, target_batch) in enumerate(data_loader):
if i < num_batches:
loss = calc_loss_batch(
input_batch, target_batch, model, device
)
total_loss += loss.item()
else:
break
return total_loss / num_batches
# Computing the initial loss for each dataset
with torch.no_grad():
train_loss = calc_loss_loader(train_loader, model, device, num_batches=5)
val_loss = calc_loss_loader(val_loader, model, device, num_batches=5)
test_loss = calc_loss_loader(test_loader, model, device, num_batches=5)
print(f"Training loss: {train_loss:.3f}")
print(f"Validation loss: {val_loss:.3f}")
print(f"Test loss: {test_loss:.3f}")
# Training loss: 2.453
# Validation loss: 2.583
# Test loss: 2.322
6.7 Fine-tuning the model on supervised data
def train_classifier_simple(model, train_loader, val_loader, optimizer, device, num_epochs, eval_freq, eval_iter):
train_losses, val_losses, train_accs, val_accs = [], [], [], []
examples_seen, global_step = 0, -1
for epoch in range(num_epochs):
model.train()
for input_batch, target_batch in train_loader:
optimizer.zero_grad()
loss = calc_loss_batch(
input_batch, target_batch, model, device
)
loss.backward()
optimizer.step()
examples_seen += input_batch.shape[0]
global_step += 1
if global_step % eval_freq == 0:
train_loss, val_loss = evaluate_model(
model, train_loader, val_loader, device, eval_iter
)
train_losses.append(train_loss)
val_losses.append(val_loss)
print(f"Ep {epoch+1} (Step {global_step:06d}): Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")
train_accuracy = calc_accuracy_loader(
train_loader, model, device, num_batches=eval_iter
)
val_accuracy = calc_accuracy_loader(val_loader, model, device, num_batches=eval_iter)
print(f"Training accuracy: {train_accuracy*100:.2f}% | ", end="")
print(f"Validation accuracy: {val_accuracy*100:.2f}% | ")
train_accs.append(train_accuracy)
val_accs.append(val_accuracy)
return train_losses, val_losses, train_accs, val_accs, examples_seen
def evaluate_model(model, train_loader, val_loader, device, eval_iter):
model.eval()
with torch.no_grad():
train_loss = calc_loss_loader(
train_loader, model, device, num_batches=eval_iter
)
val_loss = calc_loss_loader(
val_loader, model, device, num_batches=eval_iter
)
model.train()
return train_loss, val_loss
import time
start_time = time.time()
torch.manual_seed(123)
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5, weight_decay=0.1)
num_epochs = 5
train_losses, val_losses, train_accs, val_accs, examples_seen = \
train_classifier_simple(
model, train_loader, val_loader, optimizer, device,
num_epochs=num_epochs, eval_freq=50, eval_iter=5
)
end_time = time.time()
execution_time_minutes = (end_time - start_time) / 60
print(f"Training competed in {execution_time_minutes:.2f} minutes.")
# Ep 1 (Step 000000): Train loss 2.153, Val loss 2.392
# Ep 1 (Step 000050): Train loss 0.617, Val loss 0.637
# Ep 1 (Step 000100): Train loss 0.523, Val loss 0.557
# Training accuracy: 70.00% | Validation accuracy: 72.50% |
# Ep 2 (Step 000150): Train loss 0.561, Val loss 0.489
# Ep 2 (Step 000200): Train loss 0.419, Val loss 0.397
# Ep 2 (Step 000250): Train loss 0.409, Val loss 0.353
# Training accuracy: 82.50% | Validation accuracy: 85.00% |
# Ep 3 (Step 000300): Train loss 0.333, Val loss 0.320
# Ep 3 (Step 000350): Train loss 0.340, Val loss 0.306
# Training accuracy: 90.00% | Validation accuracy: 90.00% |
# Ep 4 (Step 000400): Train loss 0.136, Val loss 0.200
# Ep 4 (Step 000450): Train loss 0.153, Val loss 0.132
# Ep 4 (Step 000500): Train loss 0.222, Val loss 0.137
# Training accuracy: 100.00% | Validation accuracy: 97.50% |
# Ep 5 (Step 000550): Train loss 0.207, Val loss 0.143
# Ep 5 (Step 000600): Train loss 0.083, Val loss 0.074
# Training accuracy: 100.00% | Validation accuracy: 97.50% |
# Training competed in 0.61 minutes.
Some extra code to plot:
import matplotlib.pyplot as plt
def plot_values(epochs_seen, examples_seen, train_values, val_values, label="loss"):
fig, ax1 = plt.subplots(figsize=(5,3))
ax1.plot(epochs_seen, train_values, label=f"Training {label}")
ax1.plot(epochs_seen, val_values, linestyle="-.", label=f"Validation {label}")
ax1.set_xlabel("Epochs")
ax1.set_ylabel(label.capitalize())
ax1.legend()
ax2 = ax1.twiny()
ax2.plot(examples_seen, train_values, alpha=0)
ax2.set_xlabel("Examples seen")
fig.tight_layout()
# plt.savefig(f"{label}-plot.pdf")
plt.show()
# Loss curves
epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
examples_seen_tensor = torch.linspace(0, examples_seen, len(train_losses))
plot_values(epochs_tensor, examples_seen_tensor, train_losses, val_losses)
# Classification accuracies
epochs_tensor = torch.linspace(0, num_epochs, len(train_accs))
examples_seen_tensor = torch.linspace(0, examples_seen, len(train_accs))
plot_values(
epochs_tensor, examples_seen_tensor, train_accs, val_accs,
label="accuracy"
)
And the final metrics:
train_accuracy = calc_accuracy_loader(train_loader, model, device)
val_accuracy = calc_accuracy_loader(val_loader, model, device)
test_accuracy = calc_accuracy_loader(test_loader, model, device)
print(f"Training accuracy: {train_accuracy*100:.2f}%")
print(f"Validation accuracy: {val_accuracy*100:.2f}%")
print(f"Test accuracy: {test_accuracy*100:.2f}%")
# Training accuracy: 97.21%
# Validation accuracy: 97.32%
# Test accuracy: 95.67%
6.8 Using the LLM as a spam classifier
def classify_review(
text, model, tokenizer, device, max_length: int = 128,
pad_token_id: int = 50256
):
model.eval()
input_ids = tokenizer.encode(text)
supported_context_length = model.pos_emb.weight.shape[1]
# Handle max_length properly by using a default if None
effective_max_length = max_length if max_length is not None else supported_context_length
input_ids = input_ids[:min(effective_max_length, supported_context_length)]
# Calculate padding length only if we have a valid max_length
padding_length = effective_max_length - len(input_ids)
input_ids += [pad_token_id] * padding_length
input_tensor = torch.tensor(input_ids, device=device).unsqueeze(0)
with torch.no_grad():
logits = model(input_tensor)[:, -1, :]
predicted_label = torch.argmax(logits, dim=-1).item()
return "spam" if predicted_label == 1 else "not spam"
text_1 = (
"You are a winner you have been specifically"
" selected to receive $1000 cash or a $2000 award."
)
print(classify_review(
text_1, model, tokenizer, device, max_length=train_dataset.max_length
))
# spam
text_2 = (
"Hey, just wanted to check if we're still on"
" for dinner tonight? Let me know!"
)
print(classify_review(
text_2, model, tokenizer, device, max_length=train_dataset.max_length
))
# not spam
7 Fine-tuning to follow instructions
7.2 Preparing a dataset for supervised instruction fine-tuning
import torch
from torch.utils.data import Dataset
class InstructionDataset(Dataset):
def __init__(self, data, tokenizer):
self.data = data
self.encoded_texts = []
for entry in data:
instruction_plut_input = format_input(entry)
response_text = f"\n\n### Response:\n{entry["output"]}"
full_text = instruction_plut_input + response_text
self.encoded_texts.append(
tokenizer.encode(full_text)
)
def __getitem__(self, index):
return self.encoded_texts[index]
def __len__(self):
return len(self.data)
We’ll also use the <|endoftext|>
token here (ID: 50256).
We’ll build a custom collate function which pads the training examples such that they have the same length, while allowing different batches to have different lengths.
We don’t need to match the longest sequence in the entire dataset, just in the batch.
Then we create a list of target token IDs, which are the inputs shifted by 1, plus an additional padding token.
We assign a -100
placeholder value to all padding tokens to allow us to exclude them from contribution to the training loss calculation.
This means only meaningful data will influence model learning.
The reason why we use -100
is that PyTorch’s default settings for the cross_entropy
function has it as the ignore index. Any target labeled with -100
is therefore ignored.
And we’re using that to ignore all but the first 50256
(end-of-text) token ID in the targets, as it helps the LLM learn to generate end-of-text tokens—used to indicate that the answer is complete.
Since we’re padding all items in the batch to be of the same length, the LLM may be forced to output values after ‘finishing’ the sentence. We don’t want that non-meaningful output to count when computing the loss. We don’t want the model to learn incorrectly from noise or from forced tokens after the natural end of a sequence.
def custom_collate_fn(batch, pad_token_id=50256, ignore_index=-100, allowed_max_length=None, device="cpu"):
batch_max_length = max(len(item)+1 for item in batch)
inputs_lst, targets_lst = [], []
for i, item in enumerate(batch):
new_item = item.copy()
new_item += [pad_token_id]
padded = (
new_item + [pad_token_id] *
(batch_max_length - len(new_item))
)
inputs = torch.tensor(padded[:-1])
targets = torch.tensor(padded[1:])
mask = targets == pad_token_id
indices = torch.nonzero(mask).squeeze()
if indices.numel() > 1:
targets[indices[1:]] = ignore_index
if allowed_max_length is not None:
inputs = inputs[:allowed_max_length]
targets = targets[:allowed_max_length]
inputs_lst.append(inputs)
targets_lst.append(targets)
inputs_tensor = torch.stack(inputs_lst).to(device)
targets_tensor = torch.stack(targets_lst).to(device)
return inputs_tensor, targets_tensor
custom_collate_fn([[1,2,3,4,5], [1,2,3]])
Let’s walk through the processing of the batch
inputs I show above.
The batch size is 2, and the batch max length is
Processing the longest input isn’t super exciting, so let’s use
We start by adding an initial padding token:
And then we pad until it’s the same size as the largest item in the batch:
From here, we generate the input and target tensors. The input tensor will be all but the last item in the padded sequence:
And the target is all but the first item in the padded sequence:
Now we’ll mask. The padding mask has True
values where the padding token is in the targets
tensor. We’ll assign -100
in those indices.
Then we find the corresponding indices:
We found 3 padding tokens, masking all but first:
The final output is:
Besides masking padding tokens, it’s also common to mask target token IDs that correspond to the instructions. Doing so means the Cross Entropy loss is only computed for the generated response IDs, so the model is trained to focus on generating accurate responses rather than memorizing instructions, which can help prevent overfitting.
However, it isn’t clear yet whether masking the instructions is universally beneficial during instruction fine-tuning.
7.4 Creating data loaders for an instruction dataset
from functools import partial
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
customized_collate_fn = partial(
custom_collate_fn,
device=device,
allowed_max_length=1024 # max context length supported by GPT-2
)
from torch.utils.data import DataLoader
import tiktoken
num_workers = 0
batch_size = 8
torch.manual_seed(123)
tokenizer = tiktoken.get_encoding("gpt2")
train_dataset = InstructionDataset(train_data, tokenizer)
train_loader = DataLoader(
train_dataset,
batch_size=batch_size,
collate_fn=customized_collate_fn,
shuffle=True,
drop_last=True,
num_workers=num_workers
)
val_dataset = InstructionDataset(val_data, tokenizer)
val_loader = DataLoader(
val_dataset,
batch_size=batch_size,
collate_fn=customized_collate_fn,
shuffle=False,
drop_last=False,
num_workers=num_workers
)
test_dataset = InstructionDataset(test_data, tokenizer)
test_loader = DataLoader(
test_dataset,
batch_size=batch_size,
collate_fn=customized_collate_fn,
shuffle=False,
drop_last=False,
num_workers=num_workers
)
7.5 Loading a pretrained LLM
We’ll load the 355m parameter model, as the 124m parameter model is too limited in capacity to achieve good results for instruction fine-tuning.
This is done in the exact same way as previously, except we load the gpt2-medium (355M)
model.
Let’s assess the model’s existing capabilities.
torch.manual_seed(123)
input_text = format_input(val_data[0])
print(input_text)
"""
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Convert the active sentence to passive: 'The chef cooks the meal every day.'
"""
from build_a_large_language_model_from_scratch.lib.generate import text_to_token_ids, token_ids_to_text, generate
token_ids = generate(
model=model,
idx=text_to_token_ids(input_text, tokenizer),
max_new_tokens=35,
context_size=BASE_CONFIG["context_length"],
eos_id=50256
)
generated_text = token_ids_to_text(token_ids, tokenizer)
print(generated_text)
"""
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Convert the active sentence to passive: 'The chef cooks the meal every day.'
### Response:
The chef cooks the meal every day.
### Instruction:
Convert the active sentence to passive: 'The chef cooks the
"""
response_text = generated_text[len(input_text):].strip()
print(response_text)
"""
### Response:
The chef cooks the meal every day.
### Instruction:
Convert the active sentence to passive: 'The chef cooks the
"""
7.6 Fine-tuning the LLM on instruction data
Similar to before.
import time
from build_a_large_language_model_from_scratch.lib.train import train_model_simple
start_time = time.time()
torch.manual_seed(123)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.00005, weight_decay=0.1)
num_epochs = 2
model.to(device)
train_losses, val_losses, tokens_seen = train_model_simple(
model, train_loader, val_loader, optimizer, device,
num_epochs=num_epochs, eval_freq=5, eval_iter=5,
start_context=format_input(val_data[0]), tokenizer=tokenizer
)
end_time = time.time()
exec_time_m = (end_time - start_time) / 60
print(f"Training completed in {exec_time_m:.2f} minutes.")
7.7 Extracting and saving responses
torch.manual_seed(123)
for entry in test_data[:3]:
input_text = format_input(entry)
token_ids = generate(
model=model,
idx=text_to_token_ids(input_text, tokenizer).to(device),
max_new_tokens=256,
context_size=BASE_CONFIG["context_length"],
eos_id=50256
)
generated_text = token_ids_to_text(token_ids, tokenizer)
response_text = (
generated_text[len(input_text):]
.replace("### Response:", "")
.strip()
)
print(input_text)
print(f"\nCorrect response:\n>> {entry["output"]}")
print(f"\nModel response:\n>> {response_text.strip()}")
print("-"*20)
This prints the outputs as you might expect.
For example:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Rewrite the sentence using a simile.
### Input:
The car is very fast.
Correct response:
>> The car is as fast as lightning.
Model response:
>> The car is as fast as a bullet.
As evident, evaluating this kind of answer at scale isn’t as easy as measuring the percentage of correct spam/ham class labels to get the classification’s accuracy.
Some ideas:
- Multiple choice ([2009.03300] Measuring Massive Multitask Language Understanding)
- Human evaluation (e.g. LMSYS arena)
- Automated evaluation (e.g. via another language model)
7.8 Evaluating the fine-tuned LLM
This chapter uses Llama 3 (by Meta AI) to evaluate the test-set responses from our fine-tuned model.
7.9 Conclusions
After instruction fine-tuning, you can optionally do preference fine-tuning. This is useful to customize the model to better align with specific user preferences.
Appendix E Parameter-efficient fine-tuning with LoRA
Low-rank adaptation (LoRA) is a widely used technique for parameter-efficient fine-tuning.
Source: [2106.09685] LoRA: Low-Rank Adaptation of Large Language Models
Since models are becoming larger, it has become increasingly infeasible to perform full fine-tunes of them.
LoRA greatly reduces the number of trainable parameters, while retaining or improving fine-tuning in model quality on various models.
It works by freezing the pre-trained weights and injecting trainable rank decomposition matrices into each layer.
“Low-rank” refers to the mathematical concept of limiting adjustments to a smaller dimensional subspace of the total weight parameter space.
This captures the most influential directions of the weight parameter changes during training.
Since we can reuse the pretrained weights and just apply the LoRA matrices dynamically after training when using the model, LoRA enables model customization without needing to store multiple complete versions of an LLM (means less storage use, improved scalability), as only the smaller LoRA matrices need to be adjusted.
import math
import torch
import torch.nn as nn
class LoRALayer(nn.Module):
def __init__(self, in_dim, out_dim, rank, alpha):
super().__init__()
self.A = torch.nn.Parameter(torch.empty(in_dim, rank))
torch.nn.init.kaiming_uniform_(self.A, a=math.sqrt(5))
self.B = torch.nn.Parameter(torch.zeros(rank, out_dim))
self.alpha = alpha
def forward(self, x):
x = self.alpha * (x @ self.A @ self.B)
return x
rank
governs the inner dimension of matrices
It’s a balance of adaptability of the model and efficiency via the number of parameters used.
alpha
functions as a scaling factor for the output.
It dictates the degree to which the output from the adapted layer can affect the original layer’s output. Like a way to regulate the effect of LoRA on the layer’s output.
The goal in LoRA is typically substituting the Linear layers, so the weight updates can be applied directly to the pre-trained existing weights.
class LinearWithLoRA(nn.Module):
def __init__(self, linear, rank, alpha):
super().__init__()
self.linear = linear
self.lora = LoRALayer(linear.in_features, linear.out_features, rank, alpha)
def forward(self, x):
return self.linear(x) + self.lora(x)
Utility:
def replace_linear_with_lora(model, rank, alpha):
for name, module in model.named_children():
if isinstance(module, nn.Linear):
setattr(model, name, LinearWithLoRA(module, rank, alpha))
else:
replace_linear_with_lora(module, rank, alpha)
Liked these notes? Join the newsletter.
Get notified whenever I post new notes.