```python from IPython.display import Image ``` ```python Image(filename = 'RE-GPT.jpeg') ``` ![jpeg](README_files/README_1_0.jpg) # Reverse Engineering GPT #### Inspired by Andrej Karpathy’s "Let’s Build GPT", this project guides you step‑by‑step to build a GPT from scratch, demystifying its architecture through clear, hands‑on code. ### [original video from Andrej Karpathy](https://www.youtube.com/watch?v=kCc8FmEb1nY) #### Credit: [Andrej Karpathy](mailto:karpathy@eurekalabs.ai) #### Instructor: [Kevin Thomas](mailto:ket189@pitt.edu) ## [Attention Is All You Need](https://arxiv.org/pdf/1706.03762) #### Academic Paper ```python !pip install torch ``` Requirement already satisfied: torch in /opt/anaconda3/envs/prod/lib/python3.12/site-packages (2.5.1) Requirement already satisfied: filelock in /opt/anaconda3/envs/prod/lib/python3.12/site-packages (from torch) (3.13.1) Requirement already satisfied: typing-extensions>=4.8.0 in /opt/anaconda3/envs/prod/lib/python3.12/site-packages (from torch) (4.11.0) Requirement already satisfied: networkx in /opt/anaconda3/envs/prod/lib/python3.12/site-packages (from torch) (3.3) Requirement already satisfied: jinja2 in /opt/anaconda3/envs/prod/lib/python3.12/site-packages (from torch) (3.1.4) Requirement already satisfied: fsspec in /opt/anaconda3/envs/prod/lib/python3.12/site-packages (from torch) (2024.6.1) Requirement already satisfied: setuptools in /opt/anaconda3/envs/prod/lib/python3.12/site-packages (from torch) (75.1.0) Requirement already satisfied: sympy==1.13.1 in /opt/anaconda3/envs/prod/lib/python3.12/site-packages (from torch) (1.13.1) Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/anaconda3/envs/prod/lib/python3.12/site-packages (from sympy==1.13.1->torch) (1.3.0) Requirement already satisfied: MarkupSafe>=2.0 in /opt/anaconda3/envs/prod/lib/python3.12/site-packages (from jinja2->torch) (2.1.3) ```python import torch import torch.nn as nn from torch.nn import functional as F ``` ```python from IPython.display import Image ``` ## Transformer Model Architecture ```python Image(filename = 'transformer-model-arch.png', width=400) ``` ![png](README_files/README_8_0.png) ## Understanding Self-Attention in Simple Terms When building a language model like GPT from scratch, we initially might use a uniform weight matrix `wei` based on a function like `torch.tril`. This matrix treats all previous tokens equally, which isn’t ideal because different words (tokens) in a sentence might be more or less important to each other. For example, if a vowel in a word is looking back at previous letters, it might be more interested in certain consonants rather than all past letters equally. Self-attention helps solve this problem by allowing each token to focus on specific other tokens in a data-dependent way. Here’s how it works: every token at each position generates two vectors—`query` and `key`. The `query` vector represents **“What am I looking for?”** and the `key` vector represents **“What do I contain?”**. By computing the dot product between a token’s query and the keys of all other tokens, we obtain a measure of similarity or “affinity”. This affinity tells us how much attention one token should pay to another. In code, we start by initializing linear layers for the keys and queries without biases: ```python key = nn.Linear(input_size, head_size, bias=False) query = nn.Linear(input_size, head_size, bias=False) ``` We then compute the keys and queries by passing our input `x` (which contains all tokens) through these layers: ```python k = key(x) # shape: (B, T, head_size) q = query(x) # shape: (B, T, head_size) ``` Here, `B` is the `batch_size`, `T` is the sequence length, and `head_size` is a hyperparameter (like 16). At this point, each token has independently produced its key and query vectors without any communication with other tokens. Next, we compute the affinities (similarities) between tokens by taking the dot product of queries and transposed keys: ```python wei = q @ k.transpose(-2, -1) # shape: (B, T, T) ``` This results in a matrix where each element tells us how much one token should pay attention to another. For example, `wei[0][8][4]` might represent how much the 8th token in the first batch should focus on the 4th token. These affinities are data-dependent, meaning they change based on the actual content of the tokens. However, when aggregating information, we don’t use the original tokens directly. Instead, each token also generates a value vector, which represents the information it wants to share: ```python value = nn.Linear(input_size, head_size, bias=False) v = value(x) # shape: (B, T, head_size) ``` Finally, we use the affinities to compute a weighted sum of these values: ```python output = wei @ v # shape: (B, T, head_size) ``` This means each token gathers information from other tokens, weighted by how relevant they are (as determined by the affinities). So, a token effectively says, **“Based on what I’m interested in (my query) and what others contain (their keys), here’s the combined information (values) I should consider.”** By doing this, self-attention allows the model to dynamically focus on different parts of the input sequence, enabling it to capture complex patterns and relationships in the data. ```python # version 4: self-attention! torch.manual_seed(1337) B, T, C = 4, 8, 32 # batch, time, channels x = torch.randn(B,T,C) # let's see a single Head perform self-attention head_size = 16 key = nn.Linear(C, head_size, bias=False) query = nn.Linear(C, head_size, bias=False) value = nn.Linear(C, head_size, bias=False) k = key(x) # (B, T, 16) q = query(x) # (B, T, 16) wei = q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T) tril = torch.tril(torch.ones(T, T)) #wei = torch.zeros((T,T)) wei = wei.masked_fill(tril == 0, float('-inf')) wei = F.softmax(wei, dim=-1) v = value(x) out = wei @ v #out = wei @ x out.shape ``` torch.Size([4, 8, 16]) ```python wei[0] ``` tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000], [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000], [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000], [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000], [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]], grad_fn=) ```python Image(filename = 'attention.png', width=400) ``` ![png](README_files/README_13_0.png) ## Understanding Attention Mechanisms in Simple Terms Attention is like a way for different parts (nodes) of a network to communicate with each other. Imagine these nodes as points in a directed graph where edges show which nodes are connected. Each node has some information stored as a vector, and it can gather information from other nodes it’s connected to by taking a weighted sum of their vectors. The weights are data-dependent, meaning they change based on the actual content at each node. In our case, we have a graph of 8 nodes because our `block_size` is 8, so there are always 8 tokens. The structure is such that the first node only looks at itself, the second node looks at itself and the first node, and so on, up to the 8th node, which looks at all previous nodes and itself. This setup ensures that future tokens don’t influence past ones, which is important in language modeling where we predict the next word based on previous words. One important thing to note is that attention doesn’t have a built-in sense of position or space. The nodes don’t inherently know where they are in the sequence. To fix this, we add positional encodings to our input vectors so that each token is aware of its position in the sequence. This is different from convolutional operations where the position is inherently part of the computation due to the structure of the convolution filters. In our model, we process multiple examples at once using batches. For instance, with a `batch_size` of 4, we have 4 separate groups of 8 nodes. These groups are processed independently and don’t share information with each other. This is handled efficiently using batched matrix multiplications that operate across the batch dimension `B`. When it comes to the code, if we wanted all the nodes to communicate with each other (like in tasks where future tokens can influence past ones), we’d use an encoder attention block. This involves removing the masking line in our code: ```python wei = wei.masked_fill(tril == 0, float('-inf')) ``` By deleting this line, we allow every node to attend to every other node without restrictions. However, for language modeling, we keep this line to prevent future tokens from influencing the computation of past tokens, creating what’s known as a **decoder attention block**. Lastly, in self-attention, the keys, queries, and values all come from the same source `x`. This means each node is attending to other nodes within the same set. In contrast, cross-attention involves keys and values coming from a different source than the queries, which is useful when integrating information from external data or another part of the network. ## Understanding Scaled Attention in Simple Terms ### Scaled Dot-Product Attention Formula ```python Image(filename = 'scaled-dot-product-attention-formula.png', width=350) ``` ![png](README_files/README_18_0.png) ```python Image(filename = 'scaled-dot-product-attention.png', width=250) ``` ![png](README_files/README_19_0.png) In the **“Attention Is All You Need”** paper, we’ve learned how to implement attention mechanisms using queries, keys, and values. We multiply the queries (`q`) and keys (`k`), apply the softmax function to the result, and then use these weights to aggregate the values (`v`). This process allows the model to focus on different parts of the input data. However, there’s an important step we haven’t included yet: dividing by the square root of the `head_size` (denoted as dk in the formula). This operation is known as scaled attention, and it’s a crucial normalization step in the attention mechanism. Here’s why scaling is important: if `q` and `k` are random variables drawn from a standard normal distribution (mean of 0 and standard deviation of 1), then their dot product will have a variance proportional to the `head_size` (which is 16 in our case). Without scaling, the `wei` (weights) would have a variance of about 16, causing the softmax function to produce very sharp (peaked) outputs. By multiplying by `head_size**-0.5` (which is the same as dividing by the square root of head_size), we adjust the variance of `wei` back to 1: ```python wei = q @ k.transpose(-2, -1) * head_size ** -0.5 ``` This scaling ensures that when we apply the softmax function: ```python wei = F.softmax(wei, dim=-1) ``` The resulting weights are more evenly distributed (diffuse) rather than being overly concentrated. This is especially important during initialization because it allows the model to explore different parts of the input without being biased toward specific positions. In summary, including the scaling factor in our attention computation helps maintain stable gradients and prevents the softmax outputs from becoming too extreme. This makes the model more effective at learning and focusing on the relevant parts of the input data. ```python k = torch.randn(B,T,head_size) q = torch.randn(B,T,head_size) wei = q @ k.transpose(-2, -1) * head_size**-0.5 ``` ```python k.var() ``` tensor(1.0449) ```python q.var() ``` tensor(1.0700) ```python wei.var() ``` tensor(1.0918) ## Understanding the Impact of Softmax in Attention Mechanisms In our attention mechanism, we use the softmax function to convert the raw attention weights (`wei`) into probabilities that sum up to one. However, there’s a problem when the values inside `wei` are very large or very small (both positive and negative). When `wei` contains very large positive and negative numbers, the softmax function tends to produce outputs that are extremely peaked, meaning it approaches one-hot vectors. This results in the model focusing almost entirely on one token and ignoring the rest, which isn’t always desirable. For example, if we have `wei` values like `[10, 20, 30]`, applying softmax will give us something close to `[0.0, 0.0, 1.0]`. This happens because the exponential function in softmax amplifies the differences between large numbers, making the largest value dominate the output. Conversely, if we apply softmax to values that are very close to zero, like `[0.1, 0.2, 0.3]`, the output will be more evenly distributed, such as `[0.30, 0.33, 0.37]`. This diffuse output means the model considers multiple tokens more equally. To prevent the softmax from becoming too sharp and focusing only on one token, we need to ensure that the values in `wei` are not too large in magnitude. This is where scaling comes in. By dividing `wei` by a scaling factor (specifically, the square root of the `head_size`), we reduce the variance of its values, keeping them closer to zero. This scaling ensures that the softmax function produces a more balanced output. In code, we implement this scaling as follows: ```python wei = q @ k.transpose(-2, -1) * head_size ** -0.5 ``` By including the scaling factor `head_size ** -0.5`, we adjust the attention weights so that their variance is controlled, and the softmax function doesn’t saturate. This allows the model to consider information from multiple tokens rather than just one, improving its ability to learn and generalize from the data. Understanding this scaling is important because it highlights how mathematical operations in neural networks can significantly impact the model’s performance. By carefully managing the values passed into functions like softmax, we ensure that the attention mechanism works effectively, allowing our GPT model to capture complex patterns in language. ```python torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1) ``` tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872]) ```python torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1) # gets too peaky, converges to one-hot ``` tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000]) ## Understanding the Head Class in Self-Attention In our journey to build GPT from scratch, the `Head` class is crucial because it implements a single head of self-attention. Self-attention allows the model to focus on different parts of the input sequence when generating each part of the output, which is essential for understanding context in language. When we initialize the `Head` class, we pass in a parameter called `head_size`. Inside the constructor (`__init__` method), we create three linear layers: `key`, `query`, and `value`. These layers are initialized without bias terms (`bias=False`) and are used to project our input `x` into different representations: ```python self.key = nn.Linear(input_size, head_size, bias=False) self.query = nn.Linear(input_size, head_size, bias=False) self.value = nn.Linear(input_size, head_size, bias=False) ``` These linear layers apply matrix multiplications to the input data and are essential for computing the attention mechanism. We also create a lower triangular matrix called tril using `torch.tril`, which stands for “triangle lower.” This matrix is registered as a buffer (not a parameter that the model learns) using register_buffer: ```python self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size))) ``` This matrix ensures that each position in the sequence can only attend to itself and previous positions, preventing information from “future” tokens from influencing the current token (which is important for language modeling where we predict the next word). In the forward method, which defines how the data flows through the model, we take an input `x` and extract its dimensions: ```python B, T, C = x.shape # Batch size, Sequence length, Embedding size ``` We compute the `key` and `query` matrices by passing`x` through their respective linear layers: ```python k = self.key(x) # Shape: (B, T, C) q = self.query(x) # Shape: (B, T, C) ``` We calculate the attention weights (wei) by taking the dot product of `q` and the transposed `k`, and then we normalize it by dividing by the square root of `C` (this is known as scaled attention): ```python wei = q @ k.transpose(-2, -1) * C ** -0.5 # Shape: (B, T, T) ``` To ensure that future tokens do not influence the current token, we apply the lower triangular mask: ```python wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # Shape: (B, T, T) ``` We then apply the softmax function to turn these weights into probabilities that sum to one: ```python wei = F.softmax(wei, dim=-1) ``` Next, we compute the value matrix: ```python v = self.value(x) # Shape: (B, T, C) ``` Finally, we perform the weighted aggregation of the values by multiplying the attention weights `wei` with the values `v`: ```python out = wei @ v # Shape: (B, T, C) return out ``` The result out is a new representation of the input sequence where each token has gathered information from the relevant tokens that came before it. This mechanism allows the model to capture dependencies and relationships in the data, which is fundamental for tasks like language modeling. ```python class Head(nn.Module): """One head of self-attention.""" def __init__(self, head_size): super().__init__() self.key = nn.Linear(n_embd, head_size, bias=False) self.query = nn.Linear(n_embd, head_size, bias=False) self.value = nn.Linear(n_embd, head_size, bias=False) self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size))) self.dropout = nn.Dropout(dropout) def forward(self, x): B,T,C = x.shape k = self.key(x) # (B,T,C) q = self.query(x) # (B,T,C) # compute attention scores ("affinities") wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T) wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T) wei = F.softmax(wei, dim=-1) # (B, T, T) wei = self.dropout(wei) # perform the weighted aggregation of the values v = self.value(x) # (B,T,C) out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C) return out ``` ## Understanding Self-Attention and Positional Embeddings in Our Language Model In our language model called `BigramLanguageModel`, we incorporate self-attention mechanisms to help the model understand relationships between different tokens in a sequence. Within the constructor of our model, we create multiple attention blocks using the following code: ```python self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)]) ``` Here, `n_embd` represents the embedding size (the dimensionality of our token embeddings), and `n_head` is the number of attention heads we want to use. Each `Block` is essentially a self-attention head, and we’re stacking them together using `nn.Sequential`. The `head_size` for each attention head is set to `n_embd`. In the forward method, we first encode our input tokens by adding token embeddings and positional embeddings: ```python x = tok_emb + pos_emb ``` This means we take the embeddings of the tokens (`tok_emb`) and add positional information (`pos_emb`) so that the model knows the position of each token in the sequence. We then pass this combined embedding `x` through our self-attention blocks: ```python x = self.blocks(x) ``` The output from the attention blocks is then fed into the language modeling head to produce the `logits`, which are the unnormalized probabilities for the next token in the sequence: ```python logits = self.lm_head(x) ``` When generating new text with the generate method, we need to ensure that the input indices (`idx`) we feed into the model do not exceed the `block_size`. This is because our positional embedding table only has embeddings up to `block_size`, and we can’t provide positional embeddings for positions beyond that. To handle this, we crop the context to the last `block_size` tokens: ```python idx_cond = idx[:, -block_size:] ``` By doing this, we make sure we’re always using a valid range of positional embeddings, preventing any errors or out-of-scope issues. This allows the model to generate text effectively while respecting the limitations of our positional embedding setup. ```python class BigramLanguageModel(nn.Module): """Language model based on the Transformer architecture.""" def __init__(self): super().__init__() # each token directly reads off the logits for the next token from a lookup table self.token_embedding_table = nn.Embedding(vocab_size, n_embd) self.position_embedding_table = nn.Embedding(block_size, n_embd) self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)]) self.ln_f = nn.LayerNorm(n_embd) # final layer norm self.lm_head = nn.Linear(n_embd, vocab_size) def forward(self, idx, targets=None): B, T = idx.shape # idx and targets are both (B,T) tensor of integers tok_emb = self.token_embedding_table(idx) # (B,T,C) pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C) x = tok_emb + pos_emb # (B,T,C) x = self.blocks(x) # (B,T,C) x = self.ln_f(x) # (B,T,C) logits = self.lm_head(x) # (B,T,vocab_size) if targets is None: loss = None else: B, T, C = logits.shape logits = logits.view(B*T, C) targets = targets.view(B*T) loss = F.cross_entropy(logits, targets) return logits, loss def generate(self, idx, max_new_tokens): # idx is (B, T) array of indices in the current context for _ in range(max_new_tokens): # crop idx to the last block_size tokens idx_cond = idx[:, -block_size:] # get the predictions logits, loss = self(idx_cond) # focus only on the last time step logits = logits[:, -1, :] # becomes (B, C) # apply softmax to get probabilities probs = F.softmax(logits, dim=-1) # (B, C) # sample from the distribution idx_next = torch.multinomial(probs, num_samples=1) # (B, 1) # append sampled index to the running sequence idx = torch.cat((idx, idx_next), dim=1) # (B, T+1) return idx ``` ## Understanding Multi-Head Attention in Simple Terms ### Multi-Head Attention Formula ```python Image(filename = 'multi-head-attention-formula.png', width=450) ``` ![png](README_files/README_37_0.png) ```python Image(filename = 'multi-head-attention.png', width=200) ``` ![png](README_files/README_38_0.png) In our GPT model built from scratch, we use a concept called **Multi-Head Attention**. This means we have multiple self-attention mechanisms (called “heads”) running in parallel. Instead of relying on a single attention mechanism, we allow the model to focus on different aspects of the input simultaneously. In PyTorch, we specify the number of heads (`num_heads`) and the size of each head (`head_size`). Here’s how we create multiple heads: ```python self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)]) ``` Each Head is an instance of our self-attention mechanism. We process the input `x` through all these heads in parallel and then concatenate their outputs along the channel dimension (`dim=-1`): ```python out = torch.cat([h(x) for h in self.heads], dim=-1) ``` This concatenation combines the outputs of all the attention heads into a single tensor. Instead of having a single attention head with a `head_size` equal to the embedding size (`n_embd`), we divide the embedding size among multiple heads. For example, if `n_embd` is 32 and we have 4 heads, each head will have a `head_size` of 8. This means: * We have 4 communication channels (heads) running in parallel. * Each head processes an 8-dimensional vector. * When we concatenate the outputs of all heads, we get back to the original embedding size of 32. Having multiple attention heads is beneficial because tokens (like words or characters) have a lot of different things to “talk” about. For instance, they might want to find vowels, consonants, or specific patterns at certain positions. By using multiple independent channels of communication, each head can focus on different types of information. This allows the model to gather a richer set of data before making predictions, leading to better performance in understanding and generating language. ```python class MultiHeadAttention(nn.Module): """Multiple self-attention heads in parallel.""" def __init__(self, num_heads, head_size): super().__init__() self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)]) self.proj = nn.Linear(n_embd, n_embd) self.dropout = nn.Dropout(dropout) def forward(self, x): out = torch.cat([h(x) for h in self.heads], dim=-1) out = self.dropout(self.proj(out)) return out ``` ## Adding Computation at the Token Level ### Position-Wise Feed-Forward Network Formula ```python Image(filename = 'ffn-formula.png', width=350) ``` ![png](README_files/README_43_0.png) So far in our language model, we’ve implemented multi-headed self-attention, which allows tokens (like words or characters) to communicate with each other. This means each token can look at other tokens in the sequence and gather information from them. However, we’ve been moving a bit too quickly from this communication step to making predictions (calculating the `logits`) in our `BigramLanguageModel`. The problem is that while the tokens have looked at each other, they haven’t had much time to process or “think about” the information they’ve gathered from other tokens. To fix this, we’re going to add a small feed-forward neural network that operates on a per-token level. This means that after gathering information, each token will independently process that information to make better predictions. This feed-forward network is simply a linear layer followed by a ReLU (Rectified Linear Unit) activation function, which introduces non-linearity. In code, we implement it like this within our `Block` class: ```python self.ffwd = FeedForward(n_embd) ``` And we call it right after the self-attention layer in the forward method. The FeedForward class might look something like this: ```python class FeedForward(nn.Module): def __init__(self, n_embd): super().__init__() self.net = nn.Sequential( nn.Linear(n_embd, n_embd), nn.ReLU() ) def forward(self, x): return self.net(x) ``` Here, `n_embd` is the embedding size (the size of our token vectors). Each token processes its own vector independently through this network. The self-attention layer allows tokens to gather information from others (communication), and the feed-forward network allows each token to process that information individually (computation). By adding this computation step, we enable each token to make better use of the information it has received, leading to improved performance of the language model. This mirrors how, in human communication, we not only listen to others but also take time to think and process what we’ve heard before responding. ```python class FeedFoward(nn.Module): """A simple feed-forward neural network.""" def __init__(self, n_embd): super().__init__() self.net = nn.Sequential( nn.Linear(n_embd, 4 * n_embd), nn.ReLU(), nn.Linear(4 * n_embd, n_embd), nn.Dropout(dropout), ) def forward(self, x): return self.net(x) ``` ## Understanding Transformer Blocks and Their Role in GPT In building our GPT (Generative Pre-trained Transformer) model from scratch, we’re now focusing on combining communication and computation within the network. This approach mirrors how Transformers work—they have blocks that allow tokens (like words or characters) to communicate with each other and then compute based on that information. These blocks are grouped and replicated multiple times to build a powerful model. The core of this mechanism is implemented in the `Block` class, which represents the main part of the Transformer decoder model (excluding cross-attention components that interact with an encoder in some architectures). The `Block` class interleaves communication and computation steps. The communication is handled by multi-headed self-attention: ```python self.sa = MultiHeadAttention(n_head, head_size) ``` This allows tokens to look at other tokens in the sequence and gather relevant information. After communication, each token independently processes the gathered information using a feed-forward neural network: ```python self.ffwd = FeedForward(n_embd) ``` In the constructor of the `Block` class, we specify `n_embd`, which is the size of our token embeddings (the embedding dimension), and `n_head`, the number of attention heads we want to use. These parameters determine how the tokens will communicate and compute within each block. Within our `BigramLanguageModel` class, we stack these blocks sequentially to build the depth of the model: ```python self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)]) ``` Here, `n_layer` specifies how many times we repeat the `Block`. This setup allows us to interleave communication and computation multiple times, enabling the model to capture complex patterns in language. Finally, in the forward method, after passing the data through all the blocks, we decode the output to generate the logits (the raw predictions before applying softmax) using: ```python logits = self.lm_head(x) ``` By interspersing communication and computation in this way, each token can gather information from others and then process it independently, which is crucial for understanding context and generating coherent text in language models like GPT. ```python Image(filename = 'block.png', width=120) ``` ![png](README_files/README_48_0.png) ```python Image(filename = 'cross-attention.png', width=125) ``` ![png](README_files/README_49_0.png) ```python class Block(nn.Module): """Transformer block: communication followed by computation.""" def __init__(self, n_embd, n_head): # n_embd: embedding dimension, n_head: the number of heads we'd like super().__init__() head_size = n_embd // n_head self.sa = MultiHeadAttention(n_head, head_size) self.ffwd = FeedFoward(n_embd) self.ln1 = nn.LayerNorm(n_embd) self.ln2 = nn.LayerNorm(n_embd) def forward(self, x): x = x + self.sa(self.ln1(x)) x = x + self.ffwd(self.ln2(x)) return x ``` ## Improving Deep Neural Networks with Residual Connections At this stage of building our GPT model from scratch, we’re noticing that the performance isn’t as good as we’d like. One reason is that our neural network is becoming quite deep, and deep neural networks often face optimization issues. This means they can be hard to train effectively because the gradients used in learning can either vanish or explode as they pass through many layers. To tackle this problem, we can borrow an idea from the **“Attention Is All You Need”** paper. The paper introduces two optimizations that significantly help deep networks remain trainable. The first optimization is the use of skip connections, also known as residual connections. These connections allow the model to bypass certain layers by adding the input of a layer directly to its output. This helps preserve the original information and makes it easier for the network to learn. In simple terms, instead of just passing data through a transformation (like a neural network layer), we also add the original data back into the output. This means that if the transformation doesn’t learn anything useful, the network can still pass the original information forward. This helps prevent the network from getting worse as it gets deeper. Here’s how we can implement residual connections in our `Block` class: ```python class Block(nn.Module): def __init__(self, n_embd, n_head): super().__init__() head_size = n_embd // n_head self.sa = MultiHeadAttention(n_head, head_size) self.ffwd = FeedForward(n_embd) def forward(self, x): x = x + self.sa(x) # residual connection after self-attention x = x + self.ffwd(x) # residual connection after feed-forward network return x ``` In this code: * Self-Attention Residual Connection: We compute `self.sa(x)`, which is the output of the self-attention layer, and add it to the original input `x`. ```python x = x + self.sa(x) ``` * Feed-Forward Residual Connection: Similarly, we compute `self.ffwd(x)`, which processes each token independently, and add it to the result of the previous step. ```python x = x + self.ffwd(x) ``` By adding these residual connections, we’re effectively allowing the network to “skip” layers if needed, making it easier to train deeper models. The residual connections help the gradients flow backward through the network during training, which addresses the optimization issues associated with deep neural networks. In summary, residual connections are a simple yet powerful idea that helps deep neural networks learn more effectively. By incorporating them into our model, we’re borrowing a successful strategy from advanced architectures like Transformers, ensuring that our GPT model can train successfully even as it becomes deeper and more complex. ```python Image(filename = 'block.png', width=125) ``` ![png](README_files/README_53_0.png) ```python # SOURCE: https://towardsdatascience.com/residual-blocks-building-blocks-of-resnet-fd90ca15d6ec ``` ```python Image(filename = 'residual-blocks.png', width=300) ``` ![png](README_files/README_55_0.png) ## Understanding Residual Connections in the GPT Model When building deep neural networks like our GPT model, we can run into problems because deep networks are harder to train effectively. One powerful solution is using residual connections. Think of a residual connection as a shortcut path for information to flow through the network without getting distorted by too many layers. In our model, the computation flows from top to bottom, and there’s a central pathway called the residual pathway, represented by a black line in diagrams. At certain points, we “fork off” from this residual pathway to perform some computations—like self-attention or feed-forward processing—and then we add the result back to the main pathway. This is implemented using addition operations. Here’s why this helps: during training, when the network learns by backpropagation, gradients (which update the network’s weights) can flow directly through these addition points. This creates a “gradient superhighway” that allows learning signals to pass unimpeded from the output back to the input layers, making training more efficient. To implement residual connections in our code, we modify the forward method of the `Block` class like this: ```python def forward(self, x): x = x + self.sa(self.ln1(x)) x = x + self.ffwd(self.ln2(x)) return x ``` In this code: * `self.ln1(x)` and `self.ln2(x)` apply layer normalization to stabilize the inputs. * `self.sa` is the multi-head self-attention operation. * `self.ffwd` is the feed-forward neural network. * We add the output of these operations back to the original input `x`, creating the residual connections. In the `MultiHeadAttention` class, we need to make sure the output dimensions match so we can add them back to `x`. We do this by introducing a projection layer: ```python self.proj = nn.Linear(n_embd, n_embd) ``` After combining the outputs of all attention heads: ```python out = torch.cat([h(x) for h in self.heads], dim=-1) out = self.proj(out) return out ``` * We concatenate the outputs from all heads along the last dimension. * We then project this combined output back to the original embedding size (`n_embd`) using `self.proj(out)`. Similarly, in the `FeedForward` class, we adjust the network to have a larger inner layer, which increases its capacity to learn complex patterns: ```python self.net = nn.Sequential( nn.Linear(n_embd, 4 * n_embd), nn.ReLU(), nn.Linear(4 * n_embd, n_embd), ) ``` * The first linear layer expands the size from `n_embd` to `4 * n_embd`. * After applying the ReLU activation function, the second linear layer brings it back to `n_embd`, allowing us to add it back to `x`. By using these residual connections and appropriately sized projection layers, we allow the model to add new computations without losing the original information. This helps the gradients flow smoothly during training, making it much easier to optimize deep networks like our GPT model. ```python Image(filename = 'types-of-residual-blocks.png', width=500) ``` ![png](README_files/README_58_0.png) ```python Image(filename = 'block.png', width=125) ``` ![png](README_files/README_59_0.png) ## Understanding Layer Normalization in Deep Neural Networks As our GPT model becomes deeper, we encounter difficulties in training it effectively. Deep neural networks can suffer from optimization issues, making it hard for the model to learn from the data. To overcome this, we use two important techniques from the **“Attention Is All You Need”** paper. We’ve already added residual connections to help information flow through the network. The second optimization is called layer normalization, often shown as “Norm” next to the “Add” operations in diagrams. Layer normalization is similar to batch normalization, which you might have heard of. In batch normalization, we ensure that each neuron’s output has a mean of zero and a standard deviation of one across the entire batch of data (`B`). This helps stabilize the learning process by keeping the outputs of neurons on a similar scale. However, layer normalization works a bit differently. Instead of normalizing across the batch, layer normalization normalizes across the features (the elements within each data point). This means that for each individual example in the batch, we compute the mean and variance of its features and adjust them so that they have a mean of zero and a standard deviation of one. This is especially helpful in models like Transformers because it doesn’t depend on the batch size and works well with variable-length sequences. Here’s how we incorporate layer normalization into our `Block` class: ```python import torch.nn as nn class Block(nn.Module): def __init__(self, n_embd, n_head): super().__init__() head_size = n_embd // n_head self.ln1 = nn.LayerNorm(n_embd) # layer normalization before self-attention self.sa = MultiHeadAttention(n_head, head_size) self.ln2 = nn.LayerNorm(n_embd) # layer normalization before feed-forward network self.ffwd = FeedForward(n_embd) def forward(self, x): x = x + self.sa(self.ln1(x)) # residual connection with self-attention x = x + self.ffwd(self.ln2(x)) # residual connection with feed-forward network return x ``` In this code: * Layer Normalization Layers: We introduce `self.ln1` and `self.ln2` using `nn.LayerNorm(n_embd)`. These layers normalize the inputs to the self-attention and feed-forward networks. * Residual Connections: We maintain our residual connections by adding the output of the self-attention and feed-forward networks back to the original input `x`. * Forward Method: In the `forward` method, we apply layer normalization before each main operation. This helps stabilize the inputs to those layers. By using layer normalization, we ensure that the activations (outputs of each layer) have consistent statistics throughout the network. This makes the deep network easier to train because it reduces the internal changes that the network has to adapt to during learning. Combined with residual connections, layer normalization greatly improves the optimization of very deep neural networks like our GPT model. ```python # SOURCE: https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html ``` ```python Image(filename = 'layer-norm-formula.png', width=225) ``` ![png](README_files/README_63_0.png) ```python class LayerNorm1d: # (used to be BatchNorm1d) """Implements 1D Layer Normalization to stabilize and normalize input activations.""" def __init__(self, dim, eps=1e-5, momentum=0.1): self.eps = eps self.gamma = torch.ones(dim) self.beta = torch.zeros(dim) def __call__(self, x): # calculate the forward pass xmean = x.mean(1, keepdim=True) # batch mean xvar = x.var(1, keepdim=True) # batch variance xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance self.out = self.gamma * xhat + self.beta return self.out def parameters(self): return [self.gamma, self.beta] torch.manual_seed(1337) module = LayerNorm1d(100) x = torch.randn(32, 100) # batch size 32 of 100-dimensional vectors x = module(x) x.shape ``` torch.Size([32, 100]) ## Understanding How Layer Normalization Works in Our GPT Mode In our GPT model, we use layer normalization to help stabilize and improve the training of our deep neural network. Let’s consider an example where we have a `batch_size` of 32, and each input vector has 100 dimensions. This means we have 32 samples (vectors), each with 100 features. When we pass these vectors through a layer normalization layer, we ensure that each feature within a sample is normalized. Specifically, for each individual sample in the batch, we compute the mean and standard deviation across its features and adjust them to have a mean of zero and a standard deviation of one. Here’s how we implement layer normalization: ```python # x has a shape of (batch_size, num_features), e.g., (32, 100) xmean = x.mean(1, keepdim=True) # compute the mean across features for each sample xvar = x.var(1, keepdim=True) # compute the variance across features for each sample x_normalized = (x - xmean) / torch.sqrt(xvar + 1e-5) # normalize each sample ``` In this code: * `xmean` is calculated by taking the mean of `x` across dimension 1, which corresponds to the feature dimension. We use `keepdim=True` to maintain the dimensionality for broadcasting. * `xvar` is the variance computed similarly across the features of each sample. * `x_normalized` is the result of subtracting the mean and dividing by the standard deviation (square root of variance plus a small epsilon to prevent division by zero). By changing the dimension from 0 to 1 in the `mean` and `var` functions, we’re computing the statistics across the features of each individual sample rather than across the batch. This means we’re normalizing each sample independently, and the normalization does not depend on other samples in the batch. Initially, if we had used: ```python xmean = x.mean(0, keepdim=True) # mean across the batch for each feature xvar = x.var(0, keepdim=True) # variance across the batch for each feature ``` This would have computed the mean and variance across the batch dimension for each feature (column). In this case, we would be normalizing each feature across all samples in the batch, which is what batch normalization does. However, since we’re implementing layer normalization, we use: ```python xmean = x.mean(1, keepdim=True) # mean across features for each sample xvar = x.var(1, keepdim=True) # variance across features for each sample ``` With layer normalization, the columns (features) are not normalized across the batch. Instead, each sample’s features are normalized based on that sample’s own mean and variance. This ensures that the normalization is independent of the batch size and the data in other samples. By normalizing each sample individually, we help the model to perform consistently regardless of the batch composition, which is particularly useful in models like Transformers where sequences can have varying lengths, and batching can be complex. In summary, layer normalization adjusts the activations (outputs) of each sample so that they have a mean of zero and a standard deviation of one across their features. This helps the network learn more effectively by preventing internal covariate shift and ensuring that the scale of the inputs to each layer remains consistent. ```python x[:,0].mean(), x[:,0].std() # mean,std of one feature across all batch inputs ``` (tensor(0.1469), tensor(0.8803)) ```python x[0,:].mean(), x[0,:].std() # mean,std of a single input from the batch, of its features ``` (tensor(-3.5763e-09), tensor(1.0000)) ## Understanding the Pre-Norm Formulation in Transformer Models In the original Transformer model described in the **“Attention Is All You Need”** paper, the **Add & Norm** (addition and layer normalization) steps are applied after the main transformations like self-attention and feed-forward networks. However, in more recent implementations, it’s common to apply layer normalization before these transformations. This approach is called the **Pre-Norm Formulation**. Applying layer normalization before the transformations helps stabilize the training of deep neural networks. It ensures that the inputs to each layer have a consistent scale and distribution, which makes it easier for the network to learn effectively. In our `Block` class, which represents a single Transformer block, we implement this by adding two layer normalization layers in the constructor: ```python self.ln1 = nn.LayerNorm(n_embd) # first layer norm for self-attention self.ln2 = nn.LayerNorm(n_embd) # second layer norm for feed-forward network ``` Here, `n_embd` is the embedding dimension—the size of the vector that represents each token (like a word or character) in our sequence. In the `forward` method of the `Block` class, we apply the layer norms before passing the data to the self-attention and feed-forward layers: ```python def forward(self, x): x = x + self.sa(self.ln1(x)) # apply layer norm before self-attention x = x + self.ffwd(self.ln2(x)) # apply layer norm before feed-forward network return x ``` By normalizing `x` before each transformation, we help the model learn better and more stable representations. This change reflects modern best practices in training Transformer models, allowing our deep neural network to train more effectively, leading to improved performance in tasks like language modeling. ## Understanding Layer Normalization and Scaling Up Our GPT Model In our GPT model, we set the embedding size `n_embd` to 32. This means each token in our sequence is represented by a vector of 32 numbers. When we apply layer normalization, we normalize these features by calculating the mean and variance over these 32 numbers for each token. The batch size (`B`) and the sequence length (`T`) act as batch dimensions, so the normalization happens per token independently. This ensures that each token’s features have a mean of zero and a standard deviation of one at initialization. Layer normalization includes trainable parameters called gamma (γ) and beta (β), which allow the model to scale and shift the normalized outputs during training. In our implementation, we initialize them as follows: ```python self.gamma = torch.ones(dim) self.beta = torch.zeros(dim) ``` Here, dim is the embedding dimension (`n_embd`). While the initial output after normalization might be unit Gaussian, the optimization process during training adjusts these parameters to find the best scale and shift for the data. In the `BigramLanguageModel` class, we add a final layer normalization layer at the end of the Transformer, right before the last linear layer that decodes the embeddings into logits for the vocabulary. This is done in the constructor: ```python self.ln_f = nn.LayerNorm(n_embd) # final layer norm ``` To scale up our model and make it more powerful, we introduce the variable `n_layer` in the `BigramLanguageModel` constructor. This variable specifies how many layers of `Block` modules we stack together: ```python self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)]) ``` Each `Block` consists of multi-head self-attention and a feed-forward neural network, along with residual connections and layer normalization. We also introduce `n_head` which specifies the number of attention heads in our multi-head attention mechanism. By increasing `n_layer` and `n_head`, we can make our model deeper and allow it to capture more complex patterns in the data. In summary, by properly applying layer normalization and scaling up the model with more layers (`n_layer`) and attention heads (`n_head`), we enhance the model’s ability to learn and generalize from the data. This approach ensures our deep neural network remains stable and effective during training. ### Layer Norm Formula ```python Image(filename = 'layer-norm-formula.png', width=225) ``` ![png](README_files/README_74_0.png) ```python Image(filename = 'block.png', width=125) ``` ![png](README_files/README_75_0.png) ## Adding Dropout to Improve the GPT Model In our GPT model, we introduce a technique called dropout to prevent overfitting and improve the model’s ability to generalize to new data. Dropout works by randomly “turning off” or setting to zero a subset of neurons during each training pass. This means that every time the model processes data during training, it uses a slightly different network configuration. At test time, all neurons are active, and the model benefits from the combined knowledge of these different configurations. We add dropout layers at specific points in our model to enhance regularization: 1. In the `FeedForward` class constructor, we add `dropout` right before connecting back to the residual pathway. This ensures that some neurons in the feed-forward network are randomly ignored during training: ```python self.dropout = nn.Dropout(dropout) ``` 2. In the `MultiHeadAttention` class constructor, we include `dropout` after the attention heads have been processed. This helps prevent the model from becoming too dependent on any single attention pathway: ```python self.dropout = nn.Dropout(dropout) ``` 3. In the `Head` class constructor, we add `dropout` after calculating the attention weights (affinities) and applying the softmax function. This randomly prevents some nodes from communicating, adding a layer of regularization to the attention mechanism: ```python self.dropout = nn.Dropout(dropout) ``` By incorporating dropout in these areas, we effectively train an ensemble of smaller sub-networks within our larger network. Each sub-network learns slightly different patterns, and when combined, they make the overall model more robust. This technique is especially useful when scaling up models, as it reduces the risk of overfitting to the training data and improves performance on unseen data. In summary, dropout enhances our GPT model by: * Randomly disabling neurons during training, which prevents the model from relying too heavily on any single neuron. * Encouraging the network to learn more generalized features that are useful across different subsets of the data. * Improving the model’s ability to generalize to new, unseen inputs by reducing overfitting. This addition ensures that our model remains effective and reliable as it becomes more complex. ### Dropout Layer ```python # https://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf ``` ```python Image(filename = 'dropout.png', width=425) ``` ![png](README_files/README_80_0.png) ## Full Finished Code You may want to refer directly to the git repo instead though. ```python import torch import torch.nn as nn from torch.nn import functional as F # hyperparameters batch_size = 16 # number of independent sequences to process in parallel block_size = 32 # maximum context length for predictions max_iters = 5000 # total number of training iterations eval_interval = 100 # interval for evaluating the model on validation set learning_rate = 1e-3 # learning rate for the optimizer device = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu' # device to run the model on eval_iters = 200 # number of iterations to estimate loss n_embd = 64 # embedding dimension n_head = 4 # number of attention heads n_layer = 4 # number of transformer blocks dropout = 0.0 # dropout rate for regularization # ------------ torch.manual_seed(1337) # for reproducibility # load the dataset # make sure to have 'input.txt' file in your working directory # you can download it using: wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt with open('input.txt', 'r', encoding='utf-8') as f: text = f.read() # create a mapping from characters to integers chars = sorted(list(set(text))) vocab_size = len(chars) # create a mapping from characters to indices and vice versa stoi = { ch:i for i,ch in enumerate(chars) } # string to index itos = { i:ch for i,ch in enumerate(chars) } # index to string encode = lambda s: [stoi[c] for c in s] # encoder: string to list of integers decode = lambda l: ''.join([itos[i] for i in l]) # decoder: list of integers to string # prepare the dataset data = torch.tensor(encode(text), dtype=torch.long) n = int(0.9 * len(data)) # split 90% for training, 10% for validation train_data = data[:n] val_data = data[n:] # function to generate a batch of data def get_batch(split): """ Generate a batch of input and target sequences for training. Args: split (str): 'train' or 'val' to select the dataset split. Returns: x (torch.Tensor): Input tensor of shape (batch_size, block_size). y (torch.Tensor): Target tensor of shape (batch_size, block_size). """ # select the appropriate data split data = train_data if split == 'train' else val_data # randomly choose starting indices for each sequence in the batch ix = torch.randint(len(data) - block_size, (batch_size,)) # collect sequences of length 'block_size' starting from each index x = torch.stack([data[i:i+block_size] for i in ix]) y = torch.stack([data[i+1:i+block_size+1] for i in ix]) # move data to the appropriate device x, y = x.to(device), y.to(device) return x, y # function to estimate the loss on training and validation sets @torch.no_grad() def estimate_loss(): """ Estimate the average loss over several iterations for both training and validation datasets. Returns: out (dict): Dictionary containing average losses for 'train' and 'val'. """ out = {} model.eval() # set the model to evaluation mode for split in ['train', 'val']: losses = torch.zeros(eval_iters) for k in range(eval_iters): X, Y = get_batch(split) # get a batch of data logits, loss = model(X, Y) # forward pass losses[k] = loss.item() # store the loss out[split] = losses.mean() # compute the average loss model.train() # set the model back to training mode return out class Head(nn.Module): """One head of self-attention.""" def __init__(self, head_size): """ Initialize the self-attention head. Args: head_size (int): Dimensionality of the key, query, and value vectors. """ super().__init__() # linear projections for keys, queries, and values self.key = nn.Linear(n_embd, head_size, bias=False) self.query = nn.Linear(n_embd, head_size, bias=False) self.value = nn.Linear(n_embd, head_size, bias=False) # register a lower triangular matrix for masking future positions self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size))) # dropout layer for regularization self.dropout = nn.Dropout(dropout) def forward(self, x): """ Perform the forward pass of the self-attention head. Args: x (torch.Tensor): Input tensor of shape (B, T, C). Returns: out (torch.Tensor): Output tensor of shape (B, T, head_size). """ B, T, C = x.shape # compute keys, queries, and values k = self.key(x) # (B, T, head_size) q = self.query(x) # (B, T, head_size) v = self.value(x) # (B, T, head_size) # compute attention scores using scaled dot-product wei = q @ k.transpose(-2, -1) * C**-0.5 # (B, T, T) # apply causal mask to prevent attending to future positions wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # convert scores to probabilities wei = F.softmax(wei, dim=-1) # (B, T, T) wei = self.dropout(wei) # apply dropout # compute the weighted sum of values out = wei @ v # (B, T, head_size) return out class MultiHeadAttention(nn.Module): """Multiple self-attention heads in parallel.""" def __init__(self, num_heads, head_size): """ Initialize the multi-head attention module. Args: num_heads (int): Number of attention heads. head_size (int): Size of each head. """ super().__init__() # create a list of attention heads self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)]) # linear projection to combine the outputs of all heads self.proj = nn.Linear(n_embd, n_embd) # dropout layer self.dropout = nn.Dropout(dropout) def forward(self, x): """ Perform the forward pass of multi-head attention. Args: x (torch.Tensor): Input tensor of shape (B, T, C). Returns: out (torch.Tensor): Output tensor of shape (B, T, C). """ # concatenate the outputs from all attention heads out = torch.cat([h(x) for h in self.heads], dim=-1) # (B, T, C) # apply linear projection and dropout out = self.dropout(self.proj(out)) return out class FeedForward(nn.Module): """A simple feed-forward neural network.""" def __init__(self, n_embd): """ Initialize the feed-forward network. Args: n_embd (int): Embedding dimension. """ super().__init__() # define a two-layer MLP self.net = nn.Sequential( nn.Linear(n_embd, 4 * n_embd), # expand dimensionality nn.ReLU(), # non-linearity nn.Linear(4 * n_embd, n_embd), # project back to original size nn.Dropout(dropout), # dropout for regularization ) def forward(self, x): """ Perform the forward pass of the feed-forward network. Args: x (torch.Tensor): Input tensor of shape (B, T, C). Returns: torch.Tensor: Output tensor of the same shape. """ return self.net(x) class Block(nn.Module): """Transformer block: communication followed by computation.""" def __init__(self, n_embd, n_head): """ Initialize the transformer block. Args: n_embd (int): Embedding dimension. n_head (int): Number of attention heads. """ super().__init__() head_size = n_embd // n_head # size of each attention head # multi-head self-attention self.sa = MultiHeadAttention(n_head, head_size) # feed-forward network self.ffwd = FeedForward(n_embd) # layer normalizations self.ln1 = nn.LayerNorm(n_embd) self.ln2 = nn.LayerNorm(n_embd) def forward(self, x): """ Perform the forward pass of the transformer block. Args: x (torch.Tensor): Input tensor of shape (B, T, C). Returns: torch.Tensor: Output tensor of the same shape. """ # apply layer norm and self-attention, then add residual connection x = x + self.sa(self.ln1(x)) # apply layer norm and feed-forward network, then add residual connection x = x + self.ffwd(self.ln2(x)) return x class BigramLanguageModel(nn.Module): """Language model based on the Transformer architecture.""" def __init__(self): """ Initialize the language model. The model consists of token embeddings, positional embeddings, multiple transformer blocks, and a final linear layer to produce logits. """ super().__init__() # token embedding table: maps token indices to embedding vectors self.token_embedding_table = nn.Embedding(vocab_size, n_embd) # positional embedding table: learns embeddings for positions in the sequence self.position_embedding_table = nn.Embedding(block_size, n_embd) # stack of transformer blocks self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)]) # final layer normalization self.ln_f = nn.LayerNorm(n_embd) # linear layer to project embeddings to vocabulary logits self.lm_head = nn.Linear(n_embd, vocab_size) def forward(self, idx, targets=None): """ Perform the forward pass of the language model. Args: idx (torch.Tensor): Input tensor of token indices with shape (B, T). targets (torch.Tensor, optional): Target tensor for computing loss. Returns: logits (torch.Tensor): Logits tensor of shape (B, T, vocab_size). loss (torch.Tensor or None): Cross-entropy loss if targets are provided. """ B, T = idx.shape # get token embeddings for each token in the sequence tok_emb = self.token_embedding_table(idx) # (B, T, C) # get positional embeddings for each position in the sequence pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T, C) # add token and positional embeddings to get the input to transformer blocks x = tok_emb + pos_emb # (B, T, C) # pass through the stack of transformer blocks x = self.blocks(x) # (B, T, C) # apply final layer normalization x = self.ln_f(x) # (B, T, C) # compute logits for the next token prediction logits = self.lm_head(x) # (B, T, vocab_size) # if targets are provided, compute the loss if targets is None: loss = None else: # reshape logits and targets for computing cross-entropy loss B, T, C = logits.shape logits = logits.view(B*T, C) targets = targets.view(B*T) # compute the loss loss = F.cross_entropy(logits, targets) return logits, loss def generate(self, idx, max_new_tokens): """ Generate new text by sampling from the language model. Args: idx (torch.Tensor): Input tensor of shape (B, T) containing the context. max_new_tokens (int): Number of new tokens to generate. Returns: idx (torch.Tensor): Tensor of shape (B, T + max_new_tokens) with generated tokens. """ for _ in range(max_new_tokens): # ensure the context does not exceed the block size idx_cond = idx[:, -block_size:] # get the predictions logits, _ = self(idx_cond) # focus on the last time step logits = logits[:, -1, :] # (B, vocab_size) # convert logits to probabilities probs = F.softmax(logits, dim=-1) # (B, vocab_size) # sample the next token from the probability distribution idx_next = torch.multinomial(probs, num_samples=1) # (B, 1) # append the new token to the sequence idx = torch.cat((idx, idx_next), dim=1) # (B, T+1) return idx # instantiate the model and move it to the appropriate device model = BigramLanguageModel().to(device) # print the number of parameters (in millions) print(sum(p.numel() for p in model.parameters())/1e6, 'M parameters') # create an optimizer for updating the model parameters optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate) # training loop for iter in range(max_iters): # every eval_interval iterations, evaluate the model on the validation set if iter % eval_interval == 0 or iter == max_iters - 1: losses = estimate_loss() print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}") # get a batch of training data xb, yb = get_batch('train') # compute the loss and gradients logits, loss = model(xb, yb) optimizer.zero_grad(set_to_none=True) loss.backward() optimizer.step() # generate text from the model context = torch.zeros((1, 1), dtype=torch.long, device=device) # starting token (e.g., ) generated_sequence = model.generate(context, max_new_tokens=2000)[0].tolist() # decode and print the generated text print(decode(generated_sequence)) ``` 0.209729 M parameters step 0: train loss 4.4116, val loss 4.4022 step 100: train loss 2.6568, val loss 2.6670 step 200: train loss 2.5091, val loss 2.5060 step 300: train loss 2.4196, val loss 2.4336 step 400: train loss 2.3503, val loss 2.3565 step 500: train loss 2.2965, val loss 2.3127 step 600: train loss 2.2410, val loss 2.2501 step 700: train loss 2.2048, val loss 2.2186 step 800: train loss 2.1636, val loss 2.1864 step 900: train loss 2.1242, val loss 2.1504 step 1000: train loss 2.1024, val loss 2.1291 step 1100: train loss 2.0690, val loss 2.1176 step 1200: train loss 2.0377, val loss 2.0795 step 1300: train loss 2.0229, val loss 2.0622 step 1400: train loss 1.9922, val loss 2.0357 step 1500: train loss 1.9706, val loss 2.0315 step 1600: train loss 1.9618, val loss 2.0465 step 1700: train loss 1.9409, val loss 2.0130 step 1800: train loss 1.9077, val loss 1.9936 step 1900: train loss 1.9078, val loss 1.9855 step 2000: train loss 1.8825, val loss 1.9938 step 2100: train loss 1.8711, val loss 1.9750 step 2200: train loss 1.8579, val loss 1.9596 step 2300: train loss 1.8543, val loss 1.9528 step 2400: train loss 1.8401, val loss 1.9418 step 2500: train loss 1.8150, val loss 1.9439 step 2600: train loss 1.8234, val loss 1.9347 step 2700: train loss 1.8118, val loss 1.9318 step 2800: train loss 1.8048, val loss 1.9225 step 2900: train loss 1.8070, val loss 1.9296 step 3000: train loss 1.7953, val loss 1.9239 step 3100: train loss 1.7688, val loss 1.9158 step 3200: train loss 1.7511, val loss 1.9081 step 3300: train loss 1.7580, val loss 1.9045 step 3400: train loss 1.7561, val loss 1.8935 step 3500: train loss 1.7398, val loss 1.8928 step 3600: train loss 1.7244, val loss 1.8893 step 3700: train loss 1.7305, val loss 1.8828 step 3800: train loss 1.7180, val loss 1.8852 step 3900: train loss 1.7196, val loss 1.8693 step 4000: train loss 1.7148, val loss 1.8605 step 4100: train loss 1.7127, val loss 1.8744 step 4200: train loss 1.7071, val loss 1.8654 step 4300: train loss 1.7023, val loss 1.8460 step 4400: train loss 1.7052, val loss 1.8656 step 4500: train loss 1.6899, val loss 1.8512 step 4600: train loss 1.6862, val loss 1.8300 step 4700: train loss 1.6828, val loss 1.8413 step 4800: train loss 1.6659, val loss 1.8388 step 4900: train loss 1.6686, val loss 1.8351 step 4999: train loss 1.6622, val loss 1.8221 Foast. MENENIUS: Prave is your niews? I cank, COmine. I well torms, beary. HENRY WARWORDriown: The Papoinst proy way as home but exfulings begt as liht; Lyief, away, friom is of bulb. HENRY BOLINA: What Than what you suffect toogny! That prope of so pity this badoggent; Stame deck untiless, Their laters you Is you Tow my such in mamy that prongmanoe, Anjoth then your usequind, my would wontimn; Thou prove to day them as it? SITUS: Yeas staw his Kingdeed our chall: But now this dray. ROMEO: O, upon to death! him not this bornorow-prince. My sunder's like us. But you wilerss armiss brond, Stayle my becul'st I say, your bear shalle I mone faults not fleathms ell spraver of it she wongrame and broth of his it. But reven. WARY HARDONTIO: Qumper! what voishmes! Good liff tumbuntincaed up us. AUCHIOPOM: Therefort them, but In to sproved. KING RICHARD II: Come, dreivide, But twas oot, for and sirring to to a but mantore your bond wedaus thee. VORK: For which his lictless me, gurse? Uhould dried: To now, alm? I wherse fortune deque; To least my not thinged weouly entount. Cewle ther, Nont loung, you Vilive: Let thou beves thou one true toges amont; There twfined me. If your cause with and Thost the will langed! So morman, mad the'e noccust to knot Hench when is the underer you: if The I hom blidess one lip We is maid weak'd a bed'sime, maday, And then you pringent, and what, for there is a gring, And is ear aftiffed where diswer. Make slendow to nit, You loved, my tonte mind hath dels in wor flords. ISABELLA: Whult bear your sont On is Sup Where not: I bust ma! part you bring, thou met dincedts them thee towly him, But a frust those, if you would kingt. TROM First: It, Jurets both our too right or lmed of hide not these dut o' the ploss you. And I known, the piors, time say as day BI thy God came to time. I'll would is bring; Lorde, What, his arm he nobt That boved fireive, what evert togen our whus. ISABELLA: You our loverd would let before elcome see, Which ha