Summary of Let's build GPT: from scratch, in code, spelled out.

Summary Let's build GPT: from scratch, in code, spelled out. - YouTube (Youtube) www.youtube.com

20,700 words - YouTube video - View YouTube video

One Line

The text discusses the use of transformer architecture in AI Chat and GPT models, including the training process, tokenization, self-attention, and the importance of attention in normalization and improving performance.

Slides

Slide Presentation (9 slides)

Copy slides outline Copy embed code Download as Word

Exploring GPT: AI Chat and Transformer Architecture

Source: www.youtube.com - video - 20,700 words - view

Introduction to Chat and Transformer Architecture

• Chat is an AI system that completes text-based tasks

• Transformer architecture has had a significant impact on AI

• Transformer architecture is the foundation of Chat's capabilities

Training a Character-Level GPT Language Model

• "Tiny Shakespeare" dataset used to train the GPT model

• GPT model generates infinite Shakespeare-like text

• Training involves tokenizing the text and splitting it into training and validation sets

Self-Attention in Transformer Model

• Self-attention allows tokens to communicate and understand context

• Self-attention improves prediction accuracy

• Tokens learn from each other in the transformer model

Optimizing Deep Neural Networks

• Layer normalization helps optimize deep neural networks

• Skip connections improve optimization in deep neural networks

• Dropout prevents overfitting in the transformer model

Scaling up the Transformer Model

• Adding more layers and heads to scale up the transformer model

• Dropout can be used to prevent overfitting in larger models

• Validation loss improves with scaling up the model

Pretraining and Fine-Tuning GPT

• Pretraining involves training a decoder-only transformer on a large dataset

• Fine-tuning aligns the model to function as an assistant for specific tasks

• Fine-tuning includes steps like training a reward model and using Ppo for alignment

Conclusion and Next Steps

• GPT can be further fine-tuned for specific tasks like question-answering

• Layer normalization and skip connections optimize deep neural networks in the transformer model

• Training GPT involves pretraining and fine-tuning stages

Key Takeaways

• Chat is an AI system based on transformer architecture

• Training a character-level GPT language model involves tokenization and validation

• Self-attention improves prediction accuracy in the transformer model

• Optimizing deep neural networks with layer normalization and skip connections

• Scaling up the transformer model by adding more layers and heads

• Pretraining and fine-tuning stages are involved in training GPT for specific tasks

Key Points

Chat is an AI system that completes text-based tasks and has gained popularity in the AI community.
Chat is based on the transformer architecture, which has had a significant impact on the field of AI.
A character-level GPT language model can be trained using the "tiny shakespeare" dataset to generate infinite Shakespeare-like text.
The training process involves tokenizing the text, splitting it into training and validation sets, and training the transformer model on text sequences.
Self-attention is a key component of the transformer model that allows tokens to communicate and understand context for better predictions.
Layer normalization and skip connections are added to optimize deep neural networks in the transformer model.
The transformer model can be scaled up by adding more layers and heads, and dropout can be used to prevent overfitting.
Pretraining and fine-tuning stages are involved in training GPT for specific tasks, such as question-answering.

Summaries

489 word summary

AI Chat uses transformer architecture, requires extensive training.

Build GPT, generate Shakespeare.

Python, calculus/statistics understanding needed. Google Colab Jupyter notebook utilized. Shakespeare dataset converted to character sequence. Encoding/decoding shown with lookup table.

GPT generates text by encoding characters and tokenizing a Shakespeare dataset.

Chunking improves context

Input into transformer considers time and batch dimensions, with random offsets generating data chunks stacked into a tensor. The input is a 4x8 tensor representing training set chunks, containing 32 independent examples.

Transformer predicts integers with GPT.

Reshape input, evaluate loss, generate function.

Generation process creates tokens. Softmax converts logits to probabilities. Loss is generated if targets provided. Process starts with batch size 1. Resulting indices converted to Python list.

Train GPT with Adam optimization.

Model progress improves with more tokens. Transformer model introduces interaction. Code includes parameters, encoding, decoding, and language model. GPU support for faster processing.

Generating "p's" on device, loss function, model phases, torch context manager, 120 lines of code, output includes loss and sample. Attention block, mathematical trick, sample with batches and time components.

Tokens communicate context.

Efficiently calculate token averages using matrix simplification and multiplication, normalizing rows for accuracy.

Matrix manipulation for incremental averages.

Soft max normalizes data, determines token affinity, uses weighted aggregation in self attention, and introduces intermediate embedding phase.

Build GPT with linear transformations, embedding tables, and self-attention blocks. Initialize token affinity at 0.

Tokens use self-attention for interaction.

Tokens communicate based on affinity and aggregation is done by producing a value vector.

Attention is a directed graph with 8 nodes. Positional encoding assigns positions to the nodes. Batch elements are processed independently. There is no communication between future and past tokens.

GPT uses self-attention and encoder-decoder blocks to predict sentiment. Cross-attention is used from separate sources. Attention is important for normalization.

Soft max converges towards 1 hot vectors, sharpens when multiplied by 8, and is controlled by scaling. Self attention knowledge in head module.

Implement multi-head attention with self-attention, normalization, aggregation, and feeding.

GPT uses parallel communication channels. Tokens compute independently.

Validation loss decreases with integrated communication and computation using multi-head self-attention and feed-forward network. Skip connections improve deep neural network optimization.

Original blocks initialized gradually. Projection layer added. Communication and computation performed. Inner layer multiplied by 4. Training results in validation loss of 2.08 with overfitting. Layer norm improves deep neural networks.

Layer normalization in transformer model improves performance by normalizing rows.

Improved transformer model with scaled-up parameters achieves reduced validation loss.

Transformer generates nonsensical text.

Decoder-only transformer

Encoder-decoder architecture emphasized in original paper.

Transformer without triangular mask, encoder-decoder model, decoder-only transformer, pretrained Up-Pie and Nano GPT models, multiple heads in causal self-attention block, identical communication and computation phases in transformer blocks.

GPT building process and training stages.

GPT fine-tuning uses question-answer pairs to align the model as an assistant. Reviewers rank the pairs for a reward model, resulting in a question-answer system. There are larger versions of GPT available.

Fine tuning required

520 word summary

AI system Chat generates multiple outcomes, indexes interactions, and is based on the transformer architecture but requires extensive training.

Build GPT from scratch, generate Shakespeare.

Python, calculus/statistics understanding needed. Google Colab Jupyter notebook utilized. Shakespeare dataset converted to character sequence. Encoding/decoding shown with lookup table.

GPT encodes characters, tokenizes Shakespeare dataset, generates text.

Chunked text improves transformer context.

Input into transformer considers time and batch dimensions. Random offsets generate data chunks stacked into a tensor. Input is a 4x8 tensor with rows representing training set chunks. Contains 32 independent examples.

Transformer predicts integers with GPT.

Reshape input, evaluate loss, generate function.

Generation process creates new tokens. Softmax converts logits to probabilities. Loss is generated if targets provided. Process starts with batch size 1. Resulting indices converted to Python list.

Train GPT with Adam optimization.

Model progress improves with more tokens. Transformer model introduces interaction. Code includes parameters, encoding, decoding, and language model. GPU support for faster processing.

Generating "p's" on device, loss function, model phases, torch context manager, 120 lines of code, output includes loss and sample, attention block, mathematical trick, sample with batches and time components.

Tokens communicate with past context.

Efficiently calculate token averages using matrix simplification and multiplication, normalizing rows for accuracy.

Matrix manipulation for incremental averages.

Soft max normalizes data, determines affinity between tokens, and uses weighted aggregation in self attention. Intermediate embedding phase introduced.

Build GPT with linear transformations, embedding tables, and self-attention blocks. Initialize token affinity at 0.

Tokens use self-attention for interaction.

Tokens communicate based on affinity. Aggregation is done by producing a value vector.

Attention is a directed graph. GPT has 8 nodes. Positional encoding gives nodes position. Batch elements processed independently. No communication between future and past tokens.

GPT predicts sentiment using self-attention and encoder-decoder blocks. Cross-attention is used from separate sources. Attention is important for normalization.

Soft max converges towards 1 hot vectors with extreme numbers, sharpens when multiplied by 8, controlled by scaling. Self attention knowledge in head module.

Create self-attention, normalize, aggregate, feed. Implement multi-head attention.

GPT uses parallel communication channels. Tokens compute independently.

Validation loss decreases with integrated communication and computation using multi-head self-attention and feed-forward network. Skip connections improve deep neural network optimization.

Layer normalization in transformer model improves performance by normalizing rows instead of columns.

Transformer model scaled up with improved parameters, reduced validation loss.

Transformer model generates nonsensical Shakespearean text.

Decoder-only transformer for language modeling.

Original paper focuses on encoder-decoder architecture.

Transformer used without triangular mask, encoder encodes French sentence, decoder incorporates encoder's outputs. Decoder-only transformer imitates text file. Pretrained Up-Pie is training code, Nano GPT is similar pretrained model. Nano GPT's causal self-attention block has multiple heads in batch manner. Transformer blocks have identical communication and computation phases.

GPT building process and training stages.

GPT fine-tuning aligns model as assistant. Uses question-answer pairs. Ranked by reviewers for reward model. Creates question-answer system. Larger GPT versions available.

Fine tuning required for specific tasks.

1014 word summary

Chat, an AI system that completes text-based tasks, is popular in the AI community. It generates multiple outcomes and has websites indexing interactions. Chat is a language model based on the transformer architecture, but extensive training is needed.

Train a character-level GPT language model using "tiny shakespeare" dataset. Generate infinite Shakespeare. Code available in GitHub repository called Nano (GPT). Train on open web text dataset to reproduce GPT-2 performance. Write repository from scratch, define transformer piece by piece, train on tiny shakespeare dataset.

Python proficiency, calculus/statistics knowledge, and previous neural network language model understanding required. Google Colab Jupyter notebook used. Shakespeare dataset downloaded. Text converted to character sequence. Encoding/decoding process demonstrated using lookup table.

GPT uses character-level encoding, tokenizes Shakespeare dataset, splits into training/validation sets. Goal: generate Shakespeare-like text.

Text sequences are chunked for training the transformer model, improving context prediction.

Feeding inputs into transformer considers time and batch dimensions. Random offsets generate chunks of data stacked into a tensor for processing. Input is a 4x8 tensor with rows representing training set chunks. Contains 32 independent examples.

The transformer predicts integers using a neural network and GPT language model.

Reshape input, evaluate loss, generate function extends input for model predictions.

The generation process creates new tokens based on previous predictions and indices. Softmax converts logits to probabilities for sampling. A loss is generated if targets are provided. The process starts with a batch of size 1 and removes the batch dimension. The resulting indices are converted to a Python list.

Improve GPT model by training with Adam optimization for better output results.

The model's progress improves with more tokens, but lacks interaction. Introducing the transformer model solves this. The code is converted and includes parameters, encoding, decoding, and a language model. GPU support is added for faster processing.

Creating context for generating "p's" is important on the device. Estimated loss function reduces noise in measuring current loss during training. Setting model to evaluation phase and resetting to training phase is explained. Torch context manager optimizes memory usage. The script is around 120 lines of code and will be released later. Output includes train loss, val loss, and a sample produced by the model. First attention block for processing tokens is almost ready to be written. Mathematical trick used in self-attention within transformers is introduced. Sample with batches, time components, and formation illustrates the concept.

Tokens in a sequence can communicate with their past context by averaging preceding elements.

Efficiently calculate token averages using matrix simplification and multiplication, normalizing rows for accuracy.

Matrix manipulation calculates incremental averages using vectorization and weighted aggregation with batch matrix multiplication.

Soft max normalizes data and determines affinity between tokens. Weighted aggregation is used in self attention. Intermediate embedding phase introduced.

Build GPT from scratch using linear transformation, embedding tables, and self-attention blocks. Initialize token affinity at 0.

Self-attention allows tokens in a sequence to gather information from each other. Tokens emit queries and keys to calculate interaction. Linear modules generate queries and keys, and matrix multiplication calculates attention weights. Attention weights vary for each batch element.

Tokens in self-attention communicate based on affinity determined by dot product of query and key vectors. Aggregation is done by producing a value vector.

Attention is a communication mechanism in a directed graph. GPT has 8 nodes. Positional encoding gives nodes a sense of position. Batch elements are processed independently. Future tokens don't communicate with past tokens.

GPT uses self attention to predict sentiment, with encode and decoder blocks. Cross attention is used from separate sources. Attention is flexible and important for normalization.

Soft max converges towards 1 hot vectors with extreme numbers, sharpens when multiplied by 8, controlled by scaling. Self attention knowledge in head module.

Create self-attention by applying linear projections, normalize and aggregate values, feed into attention head. Crop context, adjust learning rate and iterations. Implement multi-head attention by running parallel self-attention heads.

GPT uses parallel communication channels for data gathering, incorporating position and token embeddings, attention mechanisms, and feed-forward networks. Tokens compute independently.

Validation loss decreases. Communication and computation integrated with multi-head self-attention and feed-forward network. Skip connections improve optimization in deep neural networks.

Original blocks gradually initialized over time for minimal contribution to residual pathway. Projection layer added. Communication and computation performed. Inner layer of feed forward network multiplied by 4. Training results in validation loss of 2.08 with overfitting. Layer norm improves optimization of deep neural networks, similar to batch normalization.

Layer normalization in transformer model normalized rows instead of columns, improving performance.

Transformer model scaled up with more layers and heads, adjusted hyperparameters, improved validation loss.

In 15 minutes, transformer model generated recognizable, nonsensical Shakespearean text. The implementation is a decoder-only transformer using a triangular mask for language modeling. The original paper focuses on encoder-decoder architecture for machine translation.

In our video, we show how a transformer is used without a triangular mask, allowing all tokens to communicate freely. The encoder encodes a French sentence and outputs it. The decoder uses cross attention to incorporate the encoder's outputs. We use a decoder-only transformer to imitate a text file. Pretrained Up-Pie is the training code, while Nano GPT is a similar pretrained model. The causal self-attention block in Nano GPT has multiple heads implemented in a batch manner. The transformer blocks have identical communication and computation phases.

Building GPT involves position and token coatings, layer norm, and a final linear layer. Chat training has two stages: pretraining and fine tuning. Pretraining uses a large amount of internet data to train a decoder-only transformer. The OpenAI transformer has 1.75 billion parameters compared to the created transformer's 10 million. The larger model requires significant infrastructure. After pretraining, the model generates documents but lacks helpful responses to questions. It may generate more questions instead.

GPT fine-tuning involves aligning the model to function as an assistant. Training data consists of question-answer pairs. Responses are ranked by human reviewers to create a reward model. This transforms the model into a question-answer system. Larger versions of GPT exist.

Further stages of fine tuning are needed for specific tasks like alignment and sentiment detection.

3190 word summary

Chat, an AI system that completes text-based tasks, has gained popularity in the AI community. It generates different outcomes based on the same prompt and can provide multiple answers. There are websites that index interactions with Chat, including humorous examples like explaining HTML to a dog or writing release notes for a chess game. Chat is a language model that completes sequences of words based on its understanding of English language patterns. The under hood components of Chat are based on the transformer architecture, which has had a significant impact on the field of AI. However, it is not possible to fully reproduce Chat's capabilities without extensive training and fine-tuning.

We will train a character-level GPT language model using a smaller dataset called "tiny shakespeare," which contains all of Shakespeare's works in a single file. The transformer neural network will predict the next character in a sequence based on the highlighted characters and generate character sequences that resemble Shakespeare's language. We can generate infinite Shakespeare once the system is trained. The code to train these transformers is available in a GitHub repository called Nano (GPT), which is a simple implementation consisting of two files. By training on the open web text dataset, we can reproduce the performance of GPT-2. In this lecture, we will write the repository from scratch, defining the transformer piece by piece and training it on the tiny shakespeare dataset to generate infinite Shakespeare.

To understand how chat works, a proficiency in Python, basic understanding of calculus and statistics, and familiarity with previous videos on neural network language models is required. A Google Colab Jupyter notebook is used to share code. The Shakespeare dataset is downloaded and the first 1000 characters are printed. The text is then converted into a sequence of characters, creating a vocabulary of possible elements. A strategy is developed to tokenize the input text by translating characters into integers. A code chunk demonstrates the encoding and decoding process using a lookup table.

There are various ways to encode text into integers, such as using sentence piece or byte pair encoding. OpenAI's GPT uses character-level encoding, resulting in long sequences. The training set of Shakespeare can be tokenized using the encode and decoder functions. The dataset is split into a training and validation set for assessing overfitting. The goal is to create a neural network that generates Shakespeare-like text.

To train the transformer model, chunks of text sequences are used instead of feeding the entire text at once. These chunks have a maximum length called the block size, which is typically set to 8 characters. Each chunk contains multiple examples for the transformer to predict the next character in the sequence. Training on these examples with different context lengths helps the transformer learn to predict characters within various contexts. This is important for later inference when generating text.

When feeding inputs into the transformer, there are two dimensions to consider: time and batch. Multiple chunks of text are stacked in a single tensor for efficiency, but they are processed independently. The code provided generates random offsets to grab chunks of data, which are then stacked into a tensor. The input to the transformer is a 4 by 8 tensor, with each row representing a chunk of the training set, and the targets are used for the loss function. This 4 by 8 array contains 32 independent examples.

The transformer simultaneously processes examples and predicts integers in each position of the tensor. The input is fed into a neural network, specifically the GPT language model. The GPT language model is implemented as a subclass of Module in PyTorch. The input integers are passed into a token embedding table, where each integer corresponds to a row in the table. The rows are arranged into a batch, time, and channel tensor. The model predicts what comes next based on the individual identity of a single token. The predictions are evaluated using the negative log likelihood loss, or cross entropy, to measure their quality.

The target should have a high number, while the other dimensions should be low. However, there is an issue that prevents it from running. We need to reshape the input to match what is expected. We reshape the "lo" and "targets" variables to conform to the desired dimensions. After reshaping, we can evaluate the loss, which is currently 4.87. The initial predictions are not diffuse enough. Now that we can evaluate the model, we want to generate from it. The generate function extends the input to include additional characters.

The generation process involves creating new tokens based on previous predictions. The current indices are used to get prediction strengths and generate new indices. The last step in the time dimension is focused on for predicting what comes next. Softmax is used to convert the logits to probabilities, and sampling is done based on the probability distribution. The sampled integers are added to the current sequence of integers. The process can generate a loss if targets are provided, otherwise, it returns the generated sequence. A batch of size 1 is created to kick off the generation with a new line character. The generate function works on batches, so indexing is done to remove the batch dimension. The resulting array of indices is converted to a Python list.

We generated random text using a GPT model, but it was garbage because the model is currently random. We want to train the model to improve the results. We will use an optimization object called Adam, which is more advanced and works well. We will run the training loop for a certain number of iterations and evaluate the loss. The optimization is happening, but we will increase the number of iterations for better results. After training for longer, the loss is improving. Although it won't be Shakespeare, we expect more reasonable output.

The model's progress is improving as the number of tokens increases. However, the current model is simple and the tokens are not interacting with each other to make predictions. To address this, the tokens need to communicate and understand the context in order to make better predictions. This is where the transformer model comes in. The code is converted from a Jupyter notebook to a script to simplify the process. The script includes various parameters and familiar elements such as data encoding, decoding, and creating a language model. Additionally, the ability to run on a GPU is added for faster processing.

In the text excerpt, the author discusses creating context for generating "p's" and the importance of creating it on the device. They also mention the need for an estimated loss function to reduce noise in the measurement of current loss during training. The author explains the practice of setting the model to evaluation phase and resetting it to training phase, although it doesn't currently affect the model. They mention the use of a Torch context manager to optimize memory usage. The author notes that the script is around 120 lines of code and will be released later. They show an example of the output from running the script, including train loss, val loss, and a sample produced by the model. The author mentions being almost ready to write the first attention block for processing tokens and introduces a mathematical trick used in self-attention within transformers. They create a sample with batches, time components, and formation at each point in the sequence to illustrate the concept.

To ensure that tokens in a sequence only communicate with their past context, an easy way is to average the preceding elements. While this method may result in a loss of spatial arrangement information, it allows for the calculation of the average vectors for each token in the sequence. By initializing a bag of words at 0 and iterating over the batch dimensions and previous tokens, the average is calculated and stored in a one-dimensional vector. This process allows for communication between tokens while preserving the context of each token's history.

We can efficiently calculate the outcome of averaging tokens using matrix simplification. By using matrix multiplication, we can achieve this. The trick is to use the function "tril" to obtain the lower triangular portion of the matrix, which allows us to ignore certain elements and focus on the desired result. By normalizing the rows of matrix b to sum to 1, we can calculate an average instead of a sum.

We can manipulate elements of a matrix to calculate averages in an incremental fashion. By multiplying a given matrix with a multiple line matrix, we can achieve these averages conveniently. To make the process more efficient, we can use vectorization. We create an array called "way" that represents the weights for averaging each row. By performing batch matrix multiplication, we can apply the weighted aggregation to all batch elements individually. The weights are specified in a t by t array, resulting in a weighted sum. The aggregation is performed in a triangular form, where each token only receives information from the tokens preceding it. Another way to achieve the same result is by using the "max" function and applying a mask fill to set certain elements to negative infinity. Finally, we can use softmax along each row to complete the process.

Soft max is a normalizing operation that produces the same matrix when applied twice. In soft max, each element is differentiated and divided by the sum. This process creates a mask that determines the interaction strength or affinity between tokens. By setting certain tokens to negative infinity, they are excluded from aggregation. The weighted aggregation of past elements depends on their affinity with each other. This weighted aggregation is used in self attention. A lower triangular matrix indicates how much each element contributes to a specific position. This concept is used to develop the self attention block. Some preliminary changes are made, such as removing unnecessary variable passing and introducing an intermediate embedding phase.

To build GPT from scratch, we start with a language modeling head that uses linear transformation. We encode the tokens' identities and positions using embedding tables. The addition of token and position embeddings creates a matrix that holds both identity and position information. Self-attention is the crux of the model, where a small self-attention block is implemented for a single head. The code achieves a simple average of previous and current tokens, masked and normalized using a lower triangular structure. The affinity between tokens is initialized to 0.

Self-attention is a technique that allows tokens in a sequence to gather information from other tokens. Each token emits a query and a key vector, which are used to calculate the interaction between tokens. By multiplying the queries and keys, we obtain a matrix that represents the attention weights between tokens. This allows for data-dependent information flow and enables tokens to learn more about specific tokens in the sequence. The implementation involves linear modules to generate the queries and keys, followed by a matrix multiplication to calculate the attention weights. This process results in attention weights that vary for each batch element, as each batch contains different tokens at different positions.

In self-attention, tokens communicate based on their affinity. Affinity is determined by the dot product of query and key vectors. The resulting distribution determines how much information to aggregate from each token. The aggregation is done by producing a value vector. The output of a single self-attention head is a 16-dimensional vector. Each token has private information (x) and communicates interesting information (v) to other tokens.

Attention is a communication mechanism in which nodes in a directed graph aggregate information via weighted sums of nodes that point to them. The graph structure of GPT consists of 8 nodes, with each node being pointed to by the previous nodes and itself. Attention can be applied to any arbitrary directed graph. There is no notion of space in attention, so positional encoding is necessary to give nodes a sense of position. Elements across the batch dimension are processed independently and do not communicate with each other. In language modeling, future tokens do not communicate with past tokens, but this constraint is not necessary in general cases.

In building GPT, self attention is used to predict the sentiment of a sentence. The encode block allows nodes to communicate, while the decoder block maintains a triangular structure to prevent nodes from giving away the answer. Cross attention is used when information is pulled from a separate source. Attention is generalized and more flexible than self attention. Scaled attention is important for normalization and preserving variance.

Soft max converges towards 1 hot vectors when there are very positive and very negative numbers. Soft max sharpens towards the maximum value when the tensor values are multiplied by 8. Extreme values in the initial visualization of Soft max are not desired. Scaling is used to control variance at the start. The self attention knowledge is implemented in a head module for further use.

You can create a self-attention component by applying linear projections to all nodes. Use the register buffer to create a lower triangular matrix. Normalize and aggregate the values. Feed the encoded information into the self-attention head. Crop the context to ensure it doesn't exceed the block size. Decrease the learning rate and increase the number of iterations for better results. Implement multi-head attention by running multiple heads of self-attention in parallel and concatenating their outputs.

GPT utilizes multiple communication channels in parallel, each with smaller dimensions. This is similar to group convolution. Multiple independent channels of communication help tokens gather different types of data. The paper introduces position and token embeddings, masked multi-headed attention, and cross-attention to an encoder. Computation is added on a per-token level through a feed-forward network. The feed-forward network consists of a linear layer followed by a reLU activation. It is applied sequentially after the self-attention step. Tokens perform the feed-forward computation independently.

The validation loss decreases from 2.28 to 2.24. Communication and computation are integrated in the block using multi-head self-attention and a feed-forward network. The number of heads is determined by the number of embeddings and the embedding dimension. Skip connections, or residual connections, are added to improve optimization in deep neural networks. These connections allow gradients to flow through every addition node, creating a gradient superhighway.

In the beginning, the original blocks contribute very little to the residual pathway and are gradually initialized over time. The implementation involves adding a projection layer and performing communication and computation. The feed forward network's inner layer should be multiplied by 4. Training results in a validation loss of 2.08 and some overfitting. The addition of layer norm is helpful for optimizing deep neural networks. Layer norm is similar to batch normalization.

We implemented layer normalization in our transformer model, normalizing the rows instead of the columns. We no longer need running buffers or a distinction between training and test time. We keep gamma and beta, but not momentum. This is called the pre-norm formulation, which is a deviation from the original paper. We also added layer norms before the transformation, resulting in a slight improvement in performance. Adding layer norms would likely be more beneficial in larger and deeper networks. Additionally, there should be a layer norm at the end of the transformer and before the final linear activation.

The transformer model is complete and has been scaled up by adding more layers and heads. Dropout has been introduced to prevent overfitting. Hyperparameters have been adjusted, resulting in a validation loss improvement from 2.07 to 1.48. The model was pretrained for approximately 15 minutes on a powerful GPU and may not be reproducible on a CPU or less powerful devices.

In about 15 minutes, the transformer model was able to generate recognizable text in the style of Shakespeare. However, the output is nonsensical when read closely. This demonstrates the capabilities of the transformer model trained on a character level using a dataset from Shakespeare. The architecture used in this implementation is a decoder-only transformer, lacking the attention and encode components. The decoder uses a triangular mask to generate text, making it suitable for language modeling. The original paper on which this implementation is based focuses on machine translation, where the encoder-decoder architecture is used to condition the generation on additional information, such as a sentence to be translated.

In our video, we demonstrate how a transformer is used without a triangular mask, allowing all tokens to communicate freely. The encoder encodes the content of a French sentence and outputs it. In the decoder, cross attention is used to incorporate the outputs of the encoder. The keys and values come from the top, generated by the encoder. This conditioning allows the decoder to consider the full encoded French prompt. We use a decoder-only transformer because we are imitating a text file. Pretrained Up-Pie is the boilerplate code for training the network, while Nano GPT is a pretrained model of interest. The model in Nano GPT is almost identical to what we have done in our demonstration. The causal self-attention block is similar, but with multiple heads implemented in a batch manner. The blocks of the transformer are identical in communication and computation phases.

The code for building GPT includes position and token coatings, layer norm, and a final linear layer. Training chat involves two stages: pretraining and fine tuning. In pretraining, a large chunk of internet data is used to train a decoder-only transformer. The OpenAI transformer has 1.75 billion parameters compared to the 10 million in the created transformer. The architecture and hyperparameters are similar, but training the larger model requires a massive infrastructure. After pretraining, the model generates documents and articles, but doesn't provide helpful responses to questions. It may generate more questions instead.

The fine-tuning process for GPT involves aligning the model to function as an assistant by collecting training data that resembles assistant behavior. This data consists of question-answer pairs in a specific format. The model is then trained to focus on this type of data, gradually aligning it to expect questions and provide answers. The fine-tuning process also includes steps where different responses are ranked by human reviewers to create a reward model. This reward model is used to train the model further and ensure that the generated answers are expected to receive high rewards. This fine-tuning stage transforms the model from a document computer to a question-answer system. Some of the data used in this stage is not publicly available. The training process involves a decoder-only transformer and follows the paper "Attention is All You Need." The code for training the model will be released, along with a notebook and Google Colab app. Larger versions of GPT, such as G3, exist and are architecturally similar but significantly larger. The summary does not cover fine-tuning stages beyond language modeling.

To achieve specific tasks, alignment, sentiment detection, or document completion, further stages of fine tuning are necessary. This can involve supervised fine tuning or more advanced techniques like training a reward model and using Ppo for alignment. Much more can be done on top of the model, but the lecture concludes here.

Raw indexed text (109,400 chars / 20,700 words)

Source: https://www.youtube.com/watch?v=kCc8FmEb1nY
Page title: Let's build GPT: from scratch, in code, spelled out. - YouTube
Meta description: We build a Generatively Pretrained Transformer (GPT), following the paper "Attention is All You Need" and OpenAI's GPT-2 / GPT-3. We talk about connections t...

[00:00:00 - 00:00:09]

n/a: Everyone. So by now you have probably heard of chat. It has taken the world and the Ai community by storm. And it is a system that allows you to into racked with

[00:00:09 - 00:00:29]

n/a: an Ai and give it text based tasks. So for example, we can ask chat To write us a small hi coop about how it that people understand Ai, and then they can use it to improve the world and make it more prosperous. So when we run this, Ai knowledge brings prosperity for all to see embrace its power. Okay. Not bad.

[00:00:29 - 00:00:49]

n/a: And so you could see that Cha went from left to right and generated all these words see sort of. Now I asked it over Need" the exact same prompt a little bit earlier and it generated a slightly different outcome. Ai is powered to grow. Ignorance sold us back, learn learn prosperity weights. So pretty good in both cases and slightly different.

[00:00:50 - 00:01:17]

n/a: So you can see that chat Is a prob system and for any 1 prone it can give us multiple answers sort of replying to it. Now this is just 1 example of the prompt. People have come up with many, many examples and there are websites that index interactions with Cha. And so many of them are quite humorous explain Html to like I'm a dog, write release notes for chest to. Write a note about elon musk buying a Twitter and so on.

[00:01:19 - 00:01:35]

n/a: So as an example, please write a breaking news article about a leaf falling from a tree. And a shocking turn events. A leaf has fallen from a tree... Local park, witnesses report that the leaf, which was previously attached to a branch of a tree, detached itself and fell to the ground. Very dramatic.

[00:01:36 - 00:02:10]

n/a: So you can see that this is a pretty remark system, and it is what we call a language model because it it models the sequence of words or characters or tokens more generally, and it knows how sort of words follow each other in English language. And so from its perspectives what it is doing is it is completing the sequence. So I give it the start of a sequence and it completes the sequence with the outcome. And so it's a language model in that sense. Now I would like to focus on the under hood of under hood components of what makes jet Gp work.

[00:02:11 - 00:02:37]

n/a: So what is the new network under the hood that models the sequence of these words. And that comes from this paper called potential is all you need. In 2017, a landmark paper, a glen landmark paper and Ai that produced and proposed the transformer architecture. So Gp is short for, generally, generative pre trained transformer. So transformer is the neural nuts transformer actually does all the heavy lifting under the hood.

[00:02:38 - 00:03:08]

n/a: It comes from this paper in 2017. Now if you read this paper, this reads like a Pretrained random machine translation paper. And that's because I think the authors didn't fully anticipate the impact that the transformer would have on the field and random this architecture that they produced in the context of machine translation in their case actually ended up taking over the rest of Ai in the next 5 years. After. And so this architecture with minor changes was copy paste it into a huge amount of applications in Ai in more recent years.

[00:03:09 - 00:03:27]

n/a: And that includes at the core of chat. Now we are not going to what I'd like to do now I'd like to build out something like chat Ep. But we're not going to be able to, of course, reproduce Chat Ep. This is a very serious because action (GPT) system. It is trained on a good chunk of Internet.

[00:03:27 - 00:03:46]

n/a: And then there's a lot of pre training and fine tuning stages to it. And so it's very. What I like to focus on is just to train a transformer based language model. And in our case, (GPT) going to be a character level (GPT) language model. I still think that is very educational with respect to how these systems work.

[00:03:46 - 00:04:00]

n/a: So I don't want to train on the chunk of Internet. We need a smaller dataset. In the case, I propose that we work with my favorite toy dataset. It's called tiny shakespeare. And what it is basically, it's a con combination of all of the works of Shakespeare.

[00:04:00 - 00:04:17]

n/a: In my understanding. And so this is all of Shakespeare in a single file. This file is about 1 megabyte, and it's just a all shakespeare. And what we are going to do now is we're going to basically model how these characters follow each other. So for example, given a chunk all of these characters like this.

[00:04:19 - 00:04:49]

n/a: Given some context of characters in the past, the transformer neural network will look at the characters that I've highlighted and (GPT) going to predict that g is likely to come next in the sequence. And it's going to do that because we're going to train that transformer on Shakespeare. And it's just going to try to produce character sequences that look like this. And in that process is going to model all the patterns inside this data. You So once we've trained the system, I just like to give you a preview, we can generate infinite shakespeare.

[00:04:49 - 00:05:18]

n/a: And, of course, it's a fake thing that looks. Of like shakespeare. Apologies for... There's some drink that I'm not able to resolve in kind in here, but you can see how this is going character by character, and it's kind of like predicting shakespeare like language. So verily my Lord, the sights have left the again, the king, coming with my Curse would precious pai.

[00:05:18 - 00:05:45]

n/a: And then T says something else, etcetera. And this is just coming out of the transformer in a very similar manner as it would come out and catch it In our case character by character in cha, it's coming out on the token by token level, P and tokens are these sort of like a little sub pieces. So they're not word level. They're kind of like word chunk level. And now I've already written this entire code to train these transformers.

[00:05:47 - 00:06:06]

n/a: And it is in a github repository that you can find and it's called... Nano (GPT). So nano Gp is a repository that you can find a (GPT). And it's a repository for training transformers on any given text I And what I think is interesting about it because there's many ways to train transformers. About this is a very simple implementation.

[00:06:06 - 00:06:37]

n/a: So it's just 2 files of 300 lines of code each 1 defines the Gp model, the transformer and 1 file trains it on some given text dataset. And here, I'm showing that if you train it on the open web text dataset, which is a fairly large dataset of what pages. Then I reproduce the the performance of (GPT) 2. So Gp 2 is an early version of of Open (GPT) from 2017 if I (GPT) correctly. And I've only so far produced the the smallest 01:24 parameter model.

[00:06:38 - 00:07:02]

n/a: But basically, this is just proving that the code base is correctly arranged. And I'm able to load the neural network weights that open ai has really least later. So you can take a look at the finished code here in, (GPT) would I would like to do in this lecture is least I would like to basically write this repository from scratch. So we're going to begin with an empty file and we're going to... Define a transformer piece by piece.

[00:07:03 - 00:07:47]

n/a: We're going to train it on the Tiny shakespeare dataset, and we'll see how we can then generate infinite Shakespeare. And of course, this can copy paste to any arbitrary text dataset set that you like and but my goal really here is to just make you understand and appreciate how under the hood chat Works. And really all that's required is a proficiency in python and some basic understanding of calculus plus statistics, and it would help if you also see my previous bit videos on the same youtube channel, in particular my I make more series where I define smaller end simpler, neural network language models. So multi per and so on. It really introduces the language modeling framework.

[00:07:47 - 00:08:04]

n/a: And then here in this video going to focus on the transformer neural network itself. Okay. So I created a new Google Cola app jupyter notebook here, and we're this will allow me to later easily share this code that we're going to develop together with you so you can follow it along. So this will be in a video description. Later.

[00:08:05 - 00:08:32]

n/a: Now here, I've just done some preliminary. I downloaded dataset set the Need" shakespeare dataset at this Url, and you can see that it's about a 1 megabyte file. Then here, I open the imp dot dx x file and just read all the text of the string, and we see that we are working with 1000000 church roughly. And the first 1000 characters if we just print them out are basically what you would expect. This is the first 1000 characters of the tiny shakespeare dataset roughly up to here.

[00:08:33 - 00:08:55]

n/a: So so far so good. Next, we're going to take this text and the text as a sequence of characters in Python. So when I call the set construct on it. I'm just going to get the set of all characters, that occur in this text. And then I call list on that to create a list of those characters instead of just the set so that I have an ordering and arbitrary corn.

[00:08:56 - 00:09:13]

n/a: And then I sort that. So basically, we get just all the characters that are currently the entire dataset and they're sorted. Now the number of them is to be our vocabulary size. These are the possible elements of our sequences. And we see that when I print here the characters, they're second 5 of them in total.

[00:09:14 - 00:09:44]

n/a: There's a space character and then all kinds of special characters and then capitals and lower case letters. So that's our vocabulary, and that's the sort of like possible characters that the model can see or emit. Okay. So next, we would like to develop some strategy, P you've token the input text. Now when people say token, they Need" convert, the raw text as a string to self sequence of integers according to some note according to some vocabulary of possible elements.

[00:09:45 - 00:10:13]

n/a: So as an example, here, we are going to be building a cultural level language model, so we're simply going to be translating individual characters into integers. So let me show you a chunk of code that cert does that for a So we're building both the encode and the decoder. And let me just talk through what's happening here. When we inks code an arbitrary text like high there, we're going to receive a list of integers that represents that strain. So for code that 46, 47, etcetera.

[00:10:14 - 00:10:46]

n/a: And then we also have the reverse mapping, so we can take this list and decode it to get back the exact same string. So it's really just like a translation 2 integers and back for arbitrary string. And for us, it is done on a character level. Now the way this was achieved is we just iterate over all the characters here and create a lookup table from the character to the integer and vice versa. And then to encode some string, we simply all the characters individually and to decode it back, we use the reverse mapping and can't cap all of it.

[00:10:47 - 00:11:12]

n/a: Now this is only 1 of many possible encoding or many possible sort of token organizers and it's a very simple 1. But there's many other sc that people have come up with in practice. So for example, Google uses sentence piece. So sentence piece will also encode text into integers, but in a different sc and using a different vocabulary rate. And sentence piece is a sub word sort of token.

[00:11:13 - 00:11:38]

n/a: And what that means is that you're not encoding entire words, but you're not also encoding individually characters. It's it's a sub word unit level. And that's usually what's adopted in practice. For example, also Open Has this library called tiktok that use it a byte pair encoding OpenAI's, and that's what Gp uses. And you can also just encode words into like, hell world into as list of integers.

[00:11:38 - 00:12:01]

n/a: So as an example, I'm using the T tiktok can library here. I'm getting the encoding from Gp 2 or that was used for Need" 2, a Instead of just having 65 possible characters or tokens, they have 50000 tokens. And so when they encode the exact same, string high there. We only get a list of 3 integers. But those integers are not between 0 and 64.

[00:12:01 - 00:12:41]

n/a: They are between 0 and 55256. So basically, you can trade off the code (GPT) book size and the Sequence lengths. So you can have very long sequences of integers with very small vocabularies or you can have a short I would sequences of integers with very large vocabularies. And so typically, people use in practice these sub word encoding, but I'd like to keep our token very simple. So we're using character level OpenAI's, and that means that we have very small code books, We have very simple encode and decode functions, but we do get very long sequences as a result.

[00:12:41 - 00:13:00]

n/a: But that's the notebooks at which we're going to stick with this lecture because it's the simplest thing. Okay. So now that we have an encode and a decoder, effectively OpenAI's, we can token the entire training set of Shakespeare. So here's a chunk of code that does that. And I'm going to start to use the P library, and specifically the a to dot tensor from the p to library.

[00:13:01 - 00:13:35]

n/a: So we're going to take all of the text in tiny shakespeare, encoded it and then wrap it into a torch dot tensor to get the data a tensor. So here's what the data tensor looks like when I look at just the first 1000 characters or the 1000 elements of it. So we see that we have a map sequence of integers, and this sequence of integers here is basically an identical translation of the first 1000 characters here. So I believe, for example, that 0 is a new line character and maybe 1 is a space, not a hundred percent sure. But from now on, to, data set of text is re represented as just...

[00:13:35 - 00:14:02]

n/a: It just stretched out as is a single very large sequence of integers. Let me do 1 more thing before we on here. I'd like to separate out our dataset into a train and a validation split. So in particular, we're going to take the first 90 percent of the dataset Pretrained consider that to be the training data for the transformer, and we're going to withhold the last 10 percent at the end of it to be the validation data. And then will help us understand to what extent our model is over fitting.

[00:14:02 - 00:14:22]

n/a: So we're going to basically hide and keep the validation data on the side because we don't want just a perfect memories. Authorization of this exact shakespeare. We want a neural network that sort of cr shakespeare like text. So it should be fairly likely for it to produce the actual like, s away, true. Spear text.

[00:14:24 - 00:14:44]

n/a: And so we're going to use this to get a sense of the over fitting. Okay. So now we would like to start plugging these text sequence or your sequences into the transformer so that it can train and learn those patterns. Now the important thing to realize is we're never going to actually feed entire text transformer all at once. That would be computationally very expensive and prohibitive.

[00:14:45 - 00:15:11]

n/a: So when we actually train a transformer on a lot of these data assets. We only work with chunks of the dataset set. And when we train the transformer, we basically sample random little chunks out of the training set and train on just chunks at a time. And these chunks have basically some kind of a length and some maximum length, chunks Now the maximum length, typically, at least in the code, I usually write it's called block size. You can you can find it on the different name.

[00:15:11 - 00:15:26]

n/a: Select complex length or something like that. Let's start with the box as of just 8. And then let me look at the first Pretrained data characters, means first block size plus 1 characters. I'll explain why plus 1 in a second. So this is the first 9 characters in the sequence.

[00:15:27 - 00:15:42]

n/a: In the the training set. Now what I'd like to point out is that when you sample a chunk of data like this. So say if these 9 characters out of the training set. This actually has (GPT) examples packed into it. And that's because all of these characters follow each other.

[00:15:43 - 00:16:11]

n/a: And so what this thing, multiple is going to say when we plug it into a transformer. Is we're going to actually simultaneously train it to make prediction at every 1 of these positions. Now in the in a chunk of 9 characters, there's actually 8 individual examples packed in there. So there's the examples that 01:18 when in the context of 18, 47 likely comes next. In a context of 8 light a 47, 56 comes next.

[00:16:11 - 00:16:29]

n/a: In a context of 18 47 56, 57 can come next, and so on. So that's the 8 individual examples. Let me actually spell it out with code. So here's the chunk of code to illustrate. X are the inputs to the transform it will just be the first block size characters.

[00:16:30 - 00:16:56]

n/a: Why will be the next block size characters. So it's offset by 1. And that's because why are the targets for each position in the input. And then here, I'm over all the blocks sides of and the clay is always all the characters in x up to t and including t, and the target is always the teeth character, but in the (GPT) A y. So let me just run this.

[00:16:58 - 00:17:35]

n/a: And basically, it spells out what I said in words. These are the 8 examples hidden in a chang of 9 characters that we sample from the training set. I want to mention 1 more thing we train on all the 8 examples here with context between 1 all the way up to context of block size and we train that not just for computational reasons because we happen to have the sequence already or something like that. It's not just done for efficiency. It's also done to make the transformer network be used to seeing context all the way from as little as 1 all the way to block size.

[00:17:36 - 00:18:03]

n/a: And we'd like the try form to be used to seeing everything in between. And that's going to be useful later during inference because while we're sampling, we can start a set sampling generation with Atlas as 1 character of context, and then transformer knows how to predict next character with all the way up to just context of 1. And so then can predict everything up to block size. And after block size, we have to start t heating because the transformer will never... It receive more than block size inputs when it's predicting the next character.

[00:18:04 - 00:18:32]

n/a: Okay. So we've looked at the time dimension of the tensor that are gonna be feeding into the transformer. There's 1 more dimension to care about and that is the batch dimension. And so as we're sampling these chunks of text, we're going to be actually be time we're gonna feed them into a transformer, we're going to have many batches of multiple chunks of text that are all like stacked up in a single tensor. And that's just done for efficiency, just so that we can keep the Jeep use busy because they are very good at parallel processing of of data.

[00:18:33 - 00:18:45]

n/a: And so we just want to process multiple chunks all at the same time. But those chunks are processed completely independently. They don't talk to each other and so on. So let me basically just generalize this and introduce a batch dimension. Here's a chunk of code.

[00:18:47 - 00:19:22]

n/a: I'm just right, and then I'm going to explain what it does. So here, because we're going start sampling random locations in the datasets to pull chunks from, I am set think the seed so that in the random number generator, so that the numbers I see here are gonna be the same numbers you see later. I you try to reproduce this. Now the size here is how many independent sequences we are processing every forward backward pass of the transformer, if The block size, as I explained is the maximum context length to make those predictions. So let's say, by says 4, block size 8, and then here's how we get that.

[00:19:23 - 00:19:42]

n/a: For any arbitrary split. If the split is a training split, then we're gonna look at train data, otherwise, in Val, that gives us the data array. And then when I generate all random positions to grab a chunk out of. I actually grab... I actually generate batch size number of random offsets.

[00:19:43 - 00:20:23]

n/a: So because this is 4, we are ix is going to be a 4 numbers that are randomly generated, between 0 and len of data minus block size. So it's just random offsets into the training set. And then x's as I explained are the block size characters starting at I. The wise are the offset by 1 of ash so just add plus 1. And then we're going to get those chunks for every 1 of Integers I in ix and use a torch dot stat back to take all those 1 dimensional tensor as we saw here, and we're going to stack them up at Rows.

[00:20:24 - 00:20:36]

n/a: I and so they all become a row in a 4 by 8 tester. So here's where I'm printing them. When I sample a batch X b and... Y b. The input the transformer now are...

[00:20:37 - 00:21:11]

n/a: The input x is the 4 by 8 sensor 4 rows of 8 columns. And each 1 of these is a chunk of the training set. And then the targets here are and the associated the ray y, and they will come in to the transformer all way at the end to create the loss part function. So they will give us the correct answer for every single position inside x, and then beam are the 4 independent rows. So spelled out as we did before.

[00:21:12 - 00:21:36]

n/a: This 4 by 8 array because contains a total of 32 examples and they're completely independent as far as the transformers is concerned. So when the input is 24, the target is 43 or rather 43 here in the y array. When the input is 24 40. The target is 58. When the input is 24 43 58, the target is 5, etcetera.

[00:21:36 - 00:22:00]

n/a: Or like what it is a 52 58 1, green the target is 58. Right? So you can sort of see this spelled out. These are the 32 independent examples pack in to a single batch of the input x and then the desired targets are in y. And so now this integer tensor of x is going to feed into the transformer.

[00:22:01 - 00:22:25]

n/a: And that transformer is going to simultaneously process all these examples, and then look up the correct integers to predict in every 1 of these positions in the tensor. Okay. So now that we have our batch of input that we like to feed into a transformer. Let's start basically feeding this into neural networks. Now we're going to start off with the simplest possible neural network, which I'd the case of language modeling in my opinion is the (GPT) language model.

[00:22:25 - 00:22:47]

n/a: And we've covered the (GPT) language model in my make more series in a lot of depth. And so here, I'm going to sort of go faster and let's just implement P to module directly that implements the pi gram language model. So I'm importing the P torch and end module. And for rep. And then here, I'm constructing a (GPT) language model, which is a sub class of Module.

[00:22:49 - 00:23:23]

n/a: And then I'm calling it I'm passing in the inputs and the targets, and I'm just printing. Now when the input on targets come here, you see that I'm just taking the index, the inputs and x here, which I ring renamed to Id x. And I'm just passing them into this token embedding table. So (GPT) going on here is that here in the construct We are creating a token embedding table, and it is a size vocab size by vocab size. And we're using an ind embedding, which is a very thin wrapper around basically a tensor of shape caps size by caps size.

[00:23:24 - 00:24:00]

n/a: And what's happening here is that when we pass Ids x here. Every single integer in our input is going to refer to this embedding table and is going to pluck out a row of that embedding to Table, corresponding to its index. So 24 here (GPT) go to the embedding table and we'll pluck out the 24 row table and then 43 will go here and block out the 40 row, etcetera. And then Pipe orchard is going to arrange all of this into a batch, my time by channel tensor. In this case, batch is 4, time is 8 and c, which my just the channels is caps size or 65.

[00:24:00 - 00:24:40]

n/a: And so we're just going to pluck out all those rows, arrange them in a b by t by c and now we're going to interpret this as the lo, which are basically the course for the next care term sequence. And so what's happening here is we are predicting what comes next based on just the individual identity of a single token. And you can do that because I'm currently, the tokens are not talking to each other, and they're not seeing any context except for they're just seeing themselves. So I'm a I'm a token number 5. And then can actually make pretty decent predictions about what comes next, just by knowing that I'm talking 5, because some characters know sort follow other characters.

[00:24:40 - 00:25:10]

n/a: And I indicative scenarios. So we saw a lot of this in a lot more depth in the make more series. And here if I just run this, then in we currently get the predictions, the scores, the lo for every 1 of the 4 by 8 positions. Now that we've Need" predictions about, what comes next, Need" like to evaluate the loss function. And so in make more series, we saw that a good way to measure a loss or like a quality of the predictions is to use the negative log likelihood loss, which is also implemented in path under the name cross center entropy.

[00:25:11 - 00:25:41]

n/a: So what we'd like to do here is likelihood loss is the cross entropy on the predictions and the targets. And so this measures the quality of the lo with respect to the targets. In other words, we have the identity of the next character, so how well are we predicting the next character based on Blue. And intuitively, the correct the correct dimension of, depending on whatever the target is should have a very high number. And all the other dimensions should be very low number.

[00:25:41 - 00:25:56]

n/a: Right? Now the issue is that this won't actually... This is what we want. We want to basically output the loads and the lobby loss. This is what we want, but unfortunately, this won't actually run weeds get an error message.

[00:25:57 - 00:26:19]

n/a: But intuitively, we want to measure this. Now when we go to the p cross center entropy I'll get documentation here. We're trying to call the cross entropy and it's functional form. So that means we don't have to create like a module for it. But here (GPT) go to the documentation, you have to look into the details of how Path expects these inputs.

[00:26:20 - 00:26:53]

n/a: And basically the issue here is P expects, if you have multi measure input which we do, because we have a b by t by c, tensor, then it actually really wants the channels to be the the second general dimension here. So if you... So basically, it wants a b by c by t instead of a b by t by c And so it's just the details of how P treats these kinds of inputs. And so we don't actually wanna deal with that. So we're (GPT) to do is that is we need to basically reshape our lo.

[00:26:54 - 00:27:12]

n/a: So here's what I like to do. I like to take basically give names to the questions. So lo that shape is b by t by c and amber pack those numbers. And then let's say that lo equals lo dot view, and we want it to be a b times c we pack times the by. So just a 2 dimensional array?

[00:27:13 - 00:27:46]

n/a: Right? So we're going to take all the... We're going to take all of these decisions here, and we're going to stretch them out in a 1 dimensional sequence and preserve the channel dimension as the second session. So we're just kinda like stretching out the array, so it's too dimensional. And in that case, it's going to better conform to what P sort of expects in its dimension Now we have to do the same to targets, because currently targets are of shape b by t, and we want it to be just b times d.

[00:27:46 - 00:28:10]

n/a: So 1 dimensional. Now, alternatively you could always still just do minus 1 because we guess what this should be if you wanna lay it out, but let me just be explicit as safety times. Once we reshape this, it match the cross entropy case, and then we should be able to evaluate our loss. Okay. So that right now, and we can 2 loss.

[00:28:10 - 00:28:38]

n/a: And so currently, we see that the loss is 4.87. Now because our... We have 65 possible vocabulary elements, we can actually guess at what the loss should be. And in particular, we covered negative log likelihood in a lot of detail, we are expecting log or blonde of 1 over 65 and negative of that. So we're expecting the loss to be about 4 point 1217, but we're getting 4.87.

[00:28:39 - 00:28:52]

n/a: And so that's telling us that the initial predictions are not super diffuse. They've got a little bit of entry be, and so we're guessing wrong. So yes. But actually, we're I able we are able to evaluate the loss. Okay.

[00:28:52 - 00:29:10]

n/a: So now that we can evaluate the quality of the model on some data. We'd like to also be able to generate from the model. So let's do the generation. Now I'm going to go again little bit faster here because I covered all this already in the previous videos. So here's a generate function for the model.

[00:29:12 - 00:29:38]

n/a: So we take some... We take the the same kind of input X here. And basically, This is the current context of some characters in a batch in some batch. So it's also b by t, And the job of Generatively is to basically take this B by t and extended to be b by t plus 1, plus 2 plus 3, and so it's just basically it continues the generation in all the back... Dimensions in the time dimension.

[00:29:39 - 00:29:57]

n/a: So that's its job. And you will do that for maximum new tokens. So you can see here on the bottom. There's (GPT) be some stuff here, but on the bottom, whatever is predicted is con on top of the previous Id x along the first dimension, which is a time dimension to create a b by t plus 1. So that becomes a new Id dx x.

[00:29:58 - 00:30:16]

n/a: So the job of Generatively could b by t and make it a b by t plus 1, plus 2 plus 3, as many as we want maximum tokens. So this is the generation from the law. Now inside the generation, what we're... What are we doing, we're taking the current indices, we're getting the prediction strength So we get... Those are the lo.

[00:30:18 - 00:30:53]

n/a: And then the loss here (GPT) gonna be ignored because we're not we're not using that, and we have no targets that sort of ground truth targets that we're (GPT) be comparing with. Then once we get the lo, we are only focusing on the last step. So instead of a b by t by c, we're going to pluck out the negative 1, the last element in the time dimension because those are to predict for what comes next. So that gives us the logic, which we then cover convert to probabilities via Soft max. And then when use stores that multi nominal to sample from those probabilities and we ask factors to give us 1 sample.

[00:30:53 - 00:31:18]

n/a: And so X next will become a b by 1 because in each we 1 of the batch dimensions. We're going to have a single prediction for what comes next. So the sn samples equals 1 will make this b 1. And then we're going to take those integers that come from the sampling process according to the property distribution given here. And those integers got just con it on top of the current sort of like running, take stream of integers.

[00:31:19 - 00:31:46]

n/a: And this gives us a p by t plus 1, and then we can return that. Now 1 thing here is you see how I'm calling self of Id x, which will end up going to the forward function. I'm not providing any target So currently, this would give an error because targets is is sort of like not given. So targets has to be optional. (GPT) is none by default And then if targets is none, then there's no loss to create.

[00:31:46 - 00:32:07]

n/a: So it's just loss is none. But else, all of this happens and we can create a loss. So this will make it so if we have the targets, we provide them and get a loss, if we have no targets, it we'll just get the. So this here will generate from the model. And let's take that for right now.

[00:32:08 - 00:32:18]

n/a: This Oops. So I have another coat chunk here, which we'll generate for the model from the model. And, okay, this is kind of crazy. So maybe let me... You break this down.

[00:32:19 - 00:32:35]

n/a: So these are the Id dx x. Right? I'm creating a batch will be just 1 time will be just 1. So I'm creating a little 1 by 1 sensor and it's holding a 0. And the d type the data type is.

[00:32:35 - 00:32:55]

n/a: This integer. So 0 is going to be how we kick off the generation. And remember that 0 is is the element standing for a new line character. So it's kind of like a reasonable thing to to feed in as the very first character in a sequence to be the new line. The So it's going to be Id, which we're going to feed in here.

[00:32:56 - 00:33:37]

n/a: Then we're going to ask for 100 tokens. And then end that generate we'll continue that. Now because generate works on the level of batches, we that we then have to index into the 0 throw to basically unplug the the single bash dimension that exists. And then that gives us a to time steps is just a 1 dimensional array of all the indices, which we will convert to simple python list from P sensor, so that can feed into our decode function and convert those integers into test. So let me bring this path back and we're Generatively hundred tokens that's run.

[00:33:38 - 00:34:05]

n/a: And here's the generation that we achieved. So obviously, (GPT) garbage, and the reason it's garbage is because is totally random the model. So next up, we're going to want to train this model. Now 1 more thing I wanted to point out here is this person function is written to be general, but it's kind of like ridiculous right now because we're feeding in all this. We're building out this contact function and we're con consolidating it all, and we're always feeding it all into the model.

[00:34:06 - 00:34:36]

n/a: But that's kind of ridiculous because this is just a simple model so to make... For example, this prediction about k, we only needed this w. But actually what we fed into the model is we fed the entire sequence. And then we only looked at the very last piece and predicted k. So the only reason I'm writing it in this way is because right now this is a (GPT) model, but I'd like to keep this function fixed, and I like it to work later when our characters actually basically look further in the history.

[00:34:37 - 00:34:50]

n/a: And so right now, the history is not used, so this looks silly. But eventually, the history will be used. And so that's why we want to do it this way. So just a quick cause comment on that. So now we see that this is random.

[00:34:50 - 00:35:03]

n/a: So let's train the model, so it becomes a bit less random. Okay. Let's not train the model. So first, what I'm going to do is I'm going to create a P torque optimization object. So here we are using the optimize Adam w.

[00:35:04 - 00:35:41]

n/a: Now in the make more series, we've only ever used a cast gradient descent, the simplest possible optimize, which you can get using the S g instead But I want to use Adam, which is a much more advanced and popular optimize, and it works extremely well. For typical good setting for the learning rate is roughly 3 negative 4, but for very, very small networks, like is the case here, you can get away with much, much higher learning rates. 1 negative 3 or even higher probably. But let me create the optimize object, which will basically take the gradients and update the parameters using the gradients. And then here Our batch the above was only 4.

[00:35:41 - 00:36:01]

n/a: So let me actually use something bigger, let's say, 32. And then for some number of steps are sampling a new batch of data. We're evaluating the loss. We're 0 out all the gradients from the previous step, getting the gradients for all the parameters, and then using those gradients to update our parameters. So typical training loop as we saw in the Make more series.

[00:36:02 - 00:36:43]

n/a: So let me now run this using for, say hundred iterations, and let's see what kind of losses we're gonna get. So we started around 4.7 and going to down to like, 4.64 0.5, etcetera. So the optimization is definitely happening, but now let's sort of try to increase number of iterations and only print at the end because we probably wanna train for longer. Okay. So we're down to 3.6 roughly, roughly down to 3 This is the most gen key optimization.

[00:36:47 - 00:36:57]

n/a: Okay. It's working. Let's just do 10000. And then from here, we want to copy this. And hopefully, we're going to get something reasonable.

[00:36:57 - 00:37:11]

n/a: And of course it's not gonna be Shakespeare from a (GPT) model. But at least we see that the loss is improving and hope we're expecting something a bit more reasonable. Okay. So we're down at about 2.5 ish, let's see what we get. Okay.

[00:37:13 - 00:37:30]

n/a: Dramatic improvements are only only what we had here. So let me just increase the number of tokens. Okay. So we see that we're starting to get something tastes like reasonable ish. Certainly not Shakespeare, these but the model is making progress.

[00:37:30 - 00:37:57]

n/a: So that is the simplest possible model. So now what I'd like to do is... Obviously, this is a very simple model because the tokens are not talking to each other. So given the previous context of whatever was generated, we're only looking at the very lot character to make the predictions about with Comes next. So now these now these tokens have to start talking to each other and figuring out what is in the context so that they can make better predictions from what comes next.

[00:37:57 - 00:38:16]

n/a: And this is how we're (GPT) kick off the transformer. Okay. So next, I take the that we developed in the jupyter notebook and I converted it to be a script. And I'm doing this because I just want to simplify our intermediate of work which is just the final product that we have at this point. So in the top here, I put all the hap parameters that we find.

[00:38:16 - 00:38:37]

n/a: I introduced a few and I'm going to speak to that in a little bit. Otherwise, a lot of this should be very recognizable. Reputable, Need" data, get the encode, the decoder, create Pretrained into splits. Use the kinda like data loader that gets a bad of the inputs and targets. This is new and I'll talk about it in a second.

[00:38:39 - 00:39:00]

n/a: Now this is the background language model that we developed and it can forward and give us lo and loss and it can generate. And then here, we are creating the optimize, and this is the training loop. So everything here should look pretty familiar. Now some of the small things that I added. Number 1, I added the ability to run on a Gpu if you have it.

[00:39:00 - 00:39:33]

n/a: So if you have a gpu then you can... This will use Cuda instead of a Cpu and everything will be a lot more faster. Now when device becomes Cuda, then Need" need to make sure that we load the data, we move it to device. When we create the model, we want to move the model per to device. So as an example, here we have the end and embedding table, and it's got a duct weight inside it, which stores the sort parameters cup table, so that would be moved to the Gpu so that all the calculations here happen on the Gpu and they can be a lot faster.

[00:39:33 - 00:40:03]

n/a: And then finally here when I'm creating the context that p's to generate, I have to make sure that I create on the device. Number 2, what I introduced is context the fact that here in the training loop. Here, I was just printing the loss dot item inside the training loop. But this is a very noisy measurement of the current loss because every batch will be more or less lucky. And so what I want to do usually, it I have an estimate loss function.

[00:40:04 - 00:40:34]

n/a: And the estimated loss basically then goes up here is and it averages up the loss over multiple batches. So in particular, we're going to iterate, evaluator times, and we're going to basically get our loss, and then we're going to get the average loss for both splits. And so this will be a lot less noisy. So here when we all the estimate loss, we're going to report the pre accurate train and evaluation loss. Now when we come back up, all notice a few things here.

[00:40:34 - 00:40:57]

n/a: I'm setting the model to a evaluation phase and down here, I'm resetting it back to training phase. Now right now for our model as you'll this, this doesn't actually do anything. Because the only thing inside this model is this and end up embedding. And this this does work would behave both would be the same in both evaluation mode and training mode. We have no drop layers.

[00:40:57 - 00:41:19]

n/a: We have no bass layers etcetera era. But it is a good practice to think through what mode your neural network is in. Because some layers will have different behavior at inference time In time. And there's also this contact manager torch up no (GPT). And this is just telling P torch that everything that happens inside this function, we will not call (GPT) backward on.

[00:41:20 - 00:41:48]

n/a: And so python can be a lot more efficient with its memory use, because it doesn't have to store all the intermediate variables, because we're never (GPT) call backward. And so it can it can be a lot more very efficient in that way. So also a good practice to tilt torch when we don't intend to do back propagation. So right now, the script is about hundred and 20 lines of code of and that's kind of part starter code. I'm calling it by gram dot pi, and I'm going to release it later.

[00:41:48 - 00:42:13]

n/a: Now running the scripts our gives us output in the terminal, then it looks something like this. It basically, as I ran the code, it was giving me the train loss and val loss, and we see that we to somewhere around 2.5 with the (GPT) grip model. And then here's the sample that we produced at the end. And so we have everything packaged up the script, and we're in a good position now to iterate on this. Okay.

[00:42:13 - 00:42:59]

n/a: So we are almost ready to start writing our very first subs attention block for processing these tokens. Now, before we actually get there, I want to get she used to a mathematical trick that is used in the self attention inside a transformer, and is really just like at the heart of an an efficient implementation of self tension. And so I want to work with this (GPT) example to just get to used to this operation, and then it's going to make it much more clear once we actually get to to it in the script again. So let's create a b by t by c where b t and c are just 4 8 and 2 in the stories sample. And these are basically channels, and we have batches and we have the time component and we have some formation at each point in the sequence.

[00:42:59 - 00:43:13]

n/a: So c. Now what we would like to do is we would like these tokens. So we have up to 8 tokens here in a batch. And these 8 tokens are currently not talking to each other, and we would like them to talk to each other. Need" like to couple them.

[00:43:14 - 00:43:40]

n/a: And in particular, we we we want to couple them in a very specific way. So the token for example, fifth location, it should not communicate with tokens in the sixth seventh and eighth location, because those are future tokens in the sequence. The To fifth on a fifth location should only talk to the 1 in the fourth third second and first. So it's only... So information only flows from pre es context to the current time.

[00:43:41 - 00:43:57]

n/a: And we cannot get any information from the future because we are about to try to predict the future. So es what is the easiest way for tokens to communicate? Okay? The easiest way I would say is, okay. If we are up to if we a fifth, and I'd like to communicate with my past.

[00:43:57 - 00:44:32]

n/a: The simplest way we can do that is to just do a wait to just do an of all the of all the preceding elements. So for example, if I'm the fifth token, I would like to take the channels that makeup up that our information at my step. But then also the channels from the fourth step, third step, second step in the first step, unless to average those up, and then that would become sort of like a feature vector that summarizes me in the context of my history. Now, of course, doing a sum or like an average is an extremely weak form of interaction. Like this communication is extremely loss.

[00:44:32 - 00:44:56]

n/a: We've lost a ton of information about the spatial arrangements of a those tokens. But that's okay for now. We'll see how we can bring that information back later. For now, what we would like to do is post For every single batch almond independently. For every teeth token in that sequence, we'd like to now calculate the average of all the vectors in all the previous tokens and also at this token.

[00:44:57 - 00:45:20]

n/a: Now So let's write that out. I have a small snippet here, and instead of just fumbling around. Let me just be pasted and talk to it. So in other words, we're going to create x and B0W is shortly for backup ports. Because back words is is kind of like a term that people use when you are just averaging up things So it's just a bag of words.

[00:45:20 - 00:45:50]

n/a: Basically, there's a word stored on every 1 of these 8 locations and we're doing a bag of words which just averaging. So in the beginning, we're going to say it's just initialized at 0. And then I'm doing a for loop here, so we're not being efficient yet that's coming. But for now we're just iterating that all the batch dimensions independently, it iterating over time, and then the previous tokens are at this batch where a dimension, and then everything up to and including the teeth token. Okay?

[00:45:50 - 00:46:15]

n/a: So when we slice out x in this way, x breath becomes of shape, how many tea elements that were in the past. And then, of course, c some all the 2 dimensional information from these little tokens. So that's the previous sort of chunk of tokens. From my currency, ones. And then I'm just doing the average or the mean over the 0 dimensions.

[00:46:15 - 00:46:32]

n/a: So I'm averaging out the time here and just going to get a little c 1 dimensional vector, which I'm going to store in x bag of words. So I can run it. And this is not going to be very informative because... Let's see. So this is x of 0.

[00:46:32 - 00:46:46]

n/a: So this is the... A 0 batch element, and then expo at 0, now, you see how the... At the first location here, you see that the 2 are equal. And that's because it's... We're just doing an average of this 1 token.

[00:46:47 - 00:47:06]

n/a: But here, this 1 is now an average of these 2. And now this 1 is an average of these 3 and so on. So... And this last 1 is the bridge of all of these elements. So vertical average, just averaging up all the tokens now gives this outcome here.

[00:47:08 - 00:47:20]

n/a: So this is all well on good. But this is very inefficient. Now the trick is that we can be very, very efficient about doing this using matrix simplification. So that's the math trick. And let me show you what I mean.

[00:47:20 - 00:47:38]

n/a: Let's work with the toy example here. Let me run it and I'll explain. I have a simple matrix here that is 3 by 3 of all ones. A matrix b of just random numbers and it's a 3 by 2, and a matrix c which will be 3 by 3 multiply 3 by 2. Which will give out a 3 by 2.

[00:47:39 - 00:47:54]

n/a: So here we're just using matrix multiplication. So a multiply b gives us c, Okay. So how are these numbers in c to achieved. Right? So this number in the top left?

[00:47:54 - 00:48:12]

n/a: Is the first row of a dot product with the first column of b. And since all the the row of a right now on just once. Then the dot product here with with this column of b is just going to do a sum of these of this column. So 2 plus 6 plus. 6 is 14.

[00:48:13 - 00:48:34]

n/a: The element here in the output of c is also the first column here, the first row of a, multiplied now with the sec column of b. So 7 plus 4 plus plus 5 is 16. Now you see that there's repeating elements here. So this 14 again is because this (GPT) again, all ones and multiplying the first column of b, so we (GPT) 14. And this 1 is and so on.

[00:48:34 - 00:49:02]

n/a: So this last number here is the last row that's product last column. Now the trick here is the following. This is just a boring number of just a boring array of all ones, but To has this function called trill, which is short for a triangular something like that. You can wrap it in torch at once, and It will just return the lower triangular portion of this. Okay.

[00:49:04 - 00:49:19]

n/a: And So now it will basically 0 out these guys here. So we just get the lower (GPT) part. Well what happens if we do that? So now we'll have a like this and b like this? And now what are we getting here in c?

[00:49:20 - 00:49:41]

n/a: Well, what is this number. Oh, this is the first row times the first column. And because this is zeros number these elements here are now ignored, so we just get it 2. And then this number here is the first row times the second column. And because these are zeros, they get ignored and it's just 7, the 7 multiplies this 1.

[00:49:42 - 00:50:07]

n/a: But looking could happened here because this is 1 and then zeros, we... What ended up happening is we're just pluck out the row of this row of and that's what we got. Now here, we have 110. So here, 110 dot product with these 2 columns will now give us 2 plus 6, which is 8 and 7 plus 4, which is 11. And because this is 111, we ended up with the addition of Alt them.

[00:50:08 - 00:50:53]

n/a: And so basically, depending on how many ones and zeros we have here, we are basically doing a sum curtain of the variable number of beef rows and that gets deposited into c. So currently we're doing sums because these are ones, but we can also do average. Right? And you can start to see how it could do average of the rows of b sort in incremental fashion because we don't have to we can basically normalize these rows so that they sum to 1 and then we're going to get an average. So if we took and then we did a equals a divide torch dot sum in the of a in the 1 dimension.

[00:50:54 - 00:51:08]

n/a: And then let's keep them as true. So therefore, the broadcasting will work out. If I rerun this, you see now that these rows now sum to 1. So this row is 1. This row is 0.5 0.5 and 0, and here we get 1 thirds.

[00:51:08 - 00:51:30]

n/a: And now when we do a multiply b, what are we getting? Here we are just getting the first row, first row, 0 Here now, we are getting the average of the first 2 rows. Okay. So 2 and 6 average is 4 and 04:07 use 5 5. And on the bottom here, we are now getting the average of these 3 rows.

[00:51:31 - 00:51:55]

n/a: So the average odor all of elements of b are now deposited here. And so you can see that by manipulating these elements of this multiple line matrix and then multiplying it with any given matrix. We can do these averages in this incremental fashion because we just get line and we can manipulate that based on the elements of a. Okay. So that's very convenient.

[00:51:56 - 00:52:26]

n/a: So let's swing back up here and see how we can vector and making much more efficient using what we've learned. So in particular, we are going to produce an array a, but here I'm going to call it way short for weights. Now this is our a. And this is how much of every row we want to average up and it's going to be an average because you can see that these rows some to 1. So this is our a, and then our b in this example, of course, is these x.

[00:52:27 - 00:52:55]

n/a: So (GPT) going to happen here now is that we (GPT) gonna have an expo 2. And this expo 2 is going to be way, multiplying r. So let's think this strip. Way is is t by t, and this is matrix multiply and point, a b by t by c, and it's giving us a what shape. So Pat will come here, and will see that these shapes are not the same.

[00:52:55 - 00:53:20]

n/a: So it will create a vast dimension here. And this is a batch matrix multiply. And so it will apply this matrix multiplication in all the batch elements impair law, and individually. And then for each batch element, there will be a t by t multiplying t by c, parallel exactly as we had below. So this will now create b by t by c.

[00:53:21 - 00:53:49]

n/a: And x bow 2 will now become identical to expo. So we can see that torch dot all close of X expo and x expo 2 should be true. Now, so this kind like misses us that these are in fact the same. So x expo and x expo 2, if I just print them Okay. We're not gonna be able to...

[00:53:50 - 00:54:04]

n/a: Okay. We're not gonna be able to just stare it down, but... Well, let me try x expo, basically just at the 0. Element and expo 2 of the 0 element. So just the first batch, and we should see that this that should be identical, which they are.

[00:54:04 - 00:54:30]

n/a: Element Right? So what happened here? The trick is we were able to use batch matrix multiply to do this aggregation really. And it's a weighted aggregation and the weights are specified in this tb by t array. And we're basically doing weighted some And these weighted sums are according to the weights inside here, but take on sort of this triangular.

[00:54:31 - 00:54:51]

n/a: Form. And so that means that a token at teeth dimension will only get sort of information from warm the tokens perceiving it. So that's exactly what we want. And finally, I would like to rewrite it in 1 more way, the and we're going to see why that's useful. So this is the third version, and it's also identical to the first and second.

[00:54:52 - 00:55:06]

n/a: But let me talk through it. It uses max. So trill here is this matrix, lower triangular ones. Way big (GPT) as all 0. Okay.

[00:55:06 - 00:55:45]

n/a: So if I just print way in the beginning, it's all 0. Then I used masked fill So what this is doing is weight that mask fill, it's all zeros, and I'm saying, for all the elements where trill is equal equals 0 row make them be negative infinity. So all the elements where a trill is 0 will become negative infinity o So this is what we get. And then the final line here is Soft max. So if I take a soft max along every single, so Dim is negative 1, so along every single row, if I do a soft max, what is that going to do?

[00:55:47 - 00:56:11]

n/a: Well, Soft max is is also like a normalizing operation. Right? And so spoiler alert, you get the exact same matrix. Let me bring back the soft max. And recall that in Soft max, we're going to differentiate every single 1 of these, and then we're going to divide by the sum, and Max so if we differentiate every single element here, we're going to get a 1.

[00:56:12 - 00:56:32]

n/a: And here we're going to get basically 000A0. And. And then when we normalize, we just get 1. Here we're gonna get 1 1 and then zeros, and then soft max again divide and this will get us 0.5 0.5 and so on. And so this is also the the same way to produce the mask.

[00:56:33 - 00:57:06]

n/a: Now the reason that this is a bit more interesting. And the reason we're going to end up using it and self attention is that these weights here begin with 0, and you can think of this as like an interaction strength or like an affinity. So basically, it's to and, how much of each token from the past do we want to aggregate an average of? L then this line is saying tokens from the past cannot communicate. By setting them to negative infinity, we're saying that we will not aggregate anything from the...

[00:57:06 - 00:57:44]

n/a: Those tokens. And so basically, this then goes through Soft max and through the weighted and this is the aggregation through matrix. And so what this is now is you can think of these as these zeros are currently just set by to be 0, but quick preview is that these between the tokens are not going to be just constant dust at 0, they're going to be data dependent. These tokens are going to start looking at each other, and some tokens will find other tokens more or less interesting. And depending on what their values are, they're going to find each other interesting to different amounts, and (GPT) going to call those affinity, I think.

[00:57:45 - 00:58:22]

n/a: And here we are saying the future cannot communicate with the past, where we're gonna clap them. And then when we normalize and some, we're (GPT) aggregate sort of their values depending on how interestingly they find each other. And so that's the preview for self attention. And basically, values long story short from this entire section is that you can do weighted aggregation of your past elements by having by using matrix patient of a lower triangular fashion. And then the elements here in the lower triangular part are telling you how much of each element fuses into this position.

[00:58:23 - 00:58:41]

n/a: So we're going to use this trick now to develop the self tension block. So for let's get some quick preliminary out of the way. First, the thing I'm kind of bothered by is that you see how we're passing them caps size into the construct. There's no need to do that because we'll caps size is already find up top as a global variable. So there's no need to pass this stuff around.

[00:58:42 - 00:59:08]

n/a: Next 1 I want to do is I don't want to actually create... I want to create, like, a level of interaction here where we don't directly go to the embedding for the lo. But instead we go through this intermediate phase because we're going to start making that bigger. So let me introduce a new variable and embed. A short for number of embedding dimensions So an embed here will be say 32.

[00:59:08 - 00:59:26]

n/a: And that was the suggestion from Github c by the way. It also just 32, which is a good number. So this is an (GPT) table and only 32 dimensional settings. So then here, this is not going to give us lo directly. Instead, this is going to give us token embedding, settings It's I'm going to call it.

[00:59:27 - 00:59:50]

n/a: And then to go from the token meetings to the lo, we're going to need a linear layer. So solve that l head, let's call it short for language modeling head. Is an linear from an embed to go cup size. And then when we swing over here, we're actually going to get the lo by exactly what the c copilot says. Now we have to be careful here because this c and this c lo are not equal.

[00:59:51 - 01:00:19]

n/a: This is an embed c, and this is will caps size. So let's just say that an embed is equal to c. And then this just creates 1 spur layer of interaction through a linear layer, but this should basically run. Then so we see that this runs and this currently looks kind spur, but we're going to build on top of this. Now next up.

[01:00:19 - 01:00:48]

n/a: So far we've taken these in in indices and we've encoded them on the identity of the tokens inside Id x. The next thing that people very often do is that we're not just encoding the identity of these tokens, but also their position. So we're going to have a second position embedding table here. So solve that position embedding table is an an embedded of block size by an embed. And so each position from 0 to block size minus 1 will also get its own embedding factor.

[01:00:49 - 01:01:20]

n/a: And then here, first, let me decode b by t from d dot shape. And then here, we're also going to have a pause embedding, which is position embedding and these are this is to range. So this will be basically just integers from 0 to t minus 1. And all of those integers from 0 to t minus 1 it embedded through the table to create a t by c. And then here, this gets renamed to just say x and acted it will be the addition of the token embedding with the position embedding.

[01:01:21 - 01:01:51]

n/a: And here, the broadcasting note will work out. So b by t by c plus t by c, this gets right aligned, a new dimension of 1 gets added, and it gets broadcasted across batch. So at this point, it gets holds not just the token identities, but the positions at which these tokens occur. And this is currently not that useful because, of course, we just have a simple (GPT) model, so it doesn't if you're in the fifth position, the second position or wherever, it's all translation in variant at this stage. So this information currently wouldn't help.

[01:01:52 - 01:02:10]

n/a: But as we work on the soft tension block, we'll see that this starts to matter. Okay. So now we get the crux of self attention. So this is probably the most lock important part of this video to understand. We're going to implement a small self tension for a single individual head as they're cold.

[01:02:11 - 01:02:30]

n/a: So we start off with where we were. So all of this code is familiar. So right now, I'm working with an example where I changed Need" number of channels from 2 to 32. So we have a 4 by 8 arrangement of tokens. And each tote and the information a each token is currently 32 dimensional, but we just are working with random new numbers.

[01:02:31 - 01:03:00]

n/a: Now we saw here that the code as we had before does a simple way simple average of all the past tokens and the current token. So it it's just the previous and current permission is just being mixed together in in average. And that's what this code currently achieves. And it does so by creating this lower triangular structure, which allows us to mask out this way matrix that we create. So we mask it out and then we normalize it.

[01:03:00 - 01:03:41]

n/a: And currently, when we initialize the infinities between all the different sort of tokens or nodes. I'm gonna use those terms interchangeably So when we initialize the affinity between all the different tokens to be 0, then we see that way gives us this structure, every single row has these uniform numbers. And so that's what that's what then in this matrix multiply makes it. That we're doing a simple average. Now we don't actually want this to be all uniform because different tokens will find different other tokens, more or less interesting and we want that to be data dependent.

[01:03:41 - 01:04:12]

n/a: So for example, if I'm a vowel, then maybe I'm looking for consonants, my past, and maybe I want to know what those consonants are, and I want that information to flow to me. And so I want to now gather information from the past, but I want to do it in the date a dependent way, and this is the problem that self attention solves. Now the way self attention solves this is the following. Every single node or every single token Need" each position, will emit 2 vectors. It will emit a query, and it will emit a key.

[01:04:13 - 01:04:55]

n/a: Now the query vector right roughly speaking is what am I looking for? And the key vector roughly speaking is what do I contain? And then the way we get infinities between these tokens now in the sequence is we basically just do a dump park between the keys and the queries. So my query dot with all the keys of all the other tokens and that dot product now becomes way. Acts And so if the key and the query are sort of aligned, they will interact to a very high amount, and then I will get to learn more about death specific token as opposed to any other token in the sequence.

[01:04:56 - 01:05:17]

n/a: So let's implement this down. Bath We're going to implement a single what's called head of self attention. So this is just 1 head. There's a hyper parameter involved these heads, which is the head size. And then here I've initial using linear modules, and I'm using bicycles.

[01:05:18 - 01:06:01]

n/a: So these are just going to apply a matrix multiply with some fixed weights. And now, let me produce a key and q, k and q by forwarding these modules on x. So the size of this will not become b by t by 16, because that is the head size. And the same here, b by t by 16, and so that's being bad size. So you see here that when I forward this linear on top of my x, all the tokens in all the positions in the b by t arrangement all of them in parallel and independently produce a key inquiry, so no communication has happened yet.

[01:06:02 - 01:06:33]

n/a: But the communication comes now, all the queries will product with all the keys. So basically what we want is we want way now or the infinities between these to be query, multiplying key. But we have to be careful with we can't just multiply that we actually need to transpose k, but we have to be also careful because these are... When you have the batch dimensions. So in particular, we want to transpose the last 2 dimensions, dimension negative 1, and dimension negative 2.

[01:06:33 - 01:07:03]

n/a: So negative 2, negative 1. And so this matrix multiply now will basically do the following b by t by 16, matrix multiplies b by 16 by t. To give us b by t by t. Right? So for every row of b, we're not going to have a t square matrix giving us the infinities, and these are now the way.

[01:07:03 - 01:07:29]

n/a: So are not zeros. They are now coming from this dot product between those keys and the queries. So this can now run Can I can run this and we're aggregation now is a function in a data band manner between the keys and queries of these notes? So just inspecting what happened here, the way takes on this form. And you see that before way was just a constant.

[01:07:29 - 01:07:50]

n/a: So it was applied in the same way to all the batch elements. But now every single batch elements will have different sort of way because every single batch and contains different tokens at different positions. And so this is not data dependent. So when we look at just the 0 row, for example, in the input. These are the weights that came out.

[01:07:50 - 01:08:09]

n/a: And so you can see now that they're not just exactly uniform. And in particular as an example for the last pro, this was the eighth token. And the 8 token knows what content it has and it knows at what position it's in. For And now the a token based on that creates a query. Hey, I'm looking for this kind of stuff.

[01:08:10 - 01:08:28]

n/a: I'm a vowel I'm on the 8 position. I'm looking for any consonants at positions up to 4. And then all the notes it to emit keys. And maybe 1 of the channels could be, I am I am a consonant and I am in a position up to 4. And that key would have a high number in that specific channel.

[01:08:29 - 01:08:54]

n/a: And that's how the query and the key when they doc product, they can find each other and create a high affinity. And when they have a high affinity like say, this token was pretty interesting to to this eighth token. When they have a high affinity than through the Soft max, I will end up aggregating a lot of its information into my position to and so I'll get to learn a lot about it. Now just... This...

[01:08:54 - 01:09:10]

n/a: We're looking at way after this has already happened. Let me erase this operation as well. So let me erase the masking and the soft max after just to show you the under the hood internals and how that works. So without the masking in the soft max, way comes out like this. Right?

[01:09:10 - 01:09:34]

n/a: This is the outputs of the dot products. And these are the raw outputs and they take on values for Generatively, you know, 2 to positive 2, etcetera. The So that's the raw interactions and raw affinity between all the nodes. But now if I'm a if I'm a fifth Need", I will not want to aggregate anything from the 6 node seventh node and the 8 node. So actually, we use the upper triangular, masking.

[01:09:34 - 01:09:57]

n/a: So those are not a allowed to communicate. And now we actually want to have a nice distribution. So we don't want to aggregate negative 0.11 of this node, that's crazy. So instead we differentiate and normalize. And now we get a nice distribution that seems to want to And this is telling us now in the data dependent manner, how much of information to aggregate from any of these tokens in the past?

[01:09:59 - 01:10:19]

n/a: So that's way, and it's not Need" anymore but but it's calculated in this way. Now there's 1 more part to a single self attention head. And that is that when we do the aggregation. We don't actually aggregate the tokens exactly. We aggregate, we produce 1 more value here, and we call that the value.

[01:10:21 - 01:10:48]

n/a: So in the same way that we produced key query, we're Also to create a value. And then here, we don't aggregate x. We calculate a b, which is just achieved by prop (GPT) this on top of x (GPT). And then we output way multiplied by v. So b is the elements that we aggregate or the the vector that we aggregate instead of the raw x.

[01:10:49 - 01:11:07]

n/a: And now, of course, this will make it so that the output here of the single head will be 16 dimensional because that is the head size. So So you can think of x as kind of like private information to this token. If you if you think about it that way. So x is kind of private to this token. So, Ami, fifth token, at some...

[01:11:07 - 01:11:24]

n/a: Can I have some identity, and my information is kept in vector x? And now fifth for the purposes of the single head, here's what I'm interested in. Here's what I have. And if you find me interesting, here's what I will communicate to you. And that's stored in v.

[01:11:25 - 01:11:41]

n/a: And so v is the thing that gets aggregated for the purposes of this single pad between the different notes. And that's basically the self tension mechanism. This is this is what it does. There are a few notes. I would make like to make about attention.

[01:11:41 - 01:12:10]

n/a: Number 1, attention is a communication mechanism. You can really think about it as a commit communication mechanism or you have a number of nodes in a directed graph, where basically you have edges pointed in between those like this. And what happens is every node has some vector of information, and It gets to aggregate information via a weighted some all of nodes that point to it. And this is done in a data dependent manner. So depending on whatever data is actually sorted you know that Need" point in time.

[01:12:11 - 01:12:32]

n/a: Now, our graph doesn't look like this. Our graph has a different structure. We have 8 nodes because the block size is ate in and there's always 8 talk tokens. And the first node is only pointed to by itself. The second node is pointed to by the first node and itself all the way up to the eighth node, which is pointed to by all the previous nodes and itself.

[01:12:33 - 01:12:58]

n/a: And so that's the structure that our directed graph have has or happens happens to have an (GPT) aggressive sort of scenario like language modeling. But in principle attention can be applied to any arbitrary directed grass and it's just a communication mechanism between the notes. The second note is that note is that there's no notion of space. So attention simply acts over, like a set of vectors in this graph. And so by default, these nodes have no idea where they are positioned in the space.

[01:12:59 - 01:13:40]

n/a: And that's why we need to encode them position and sort of give them some information that's anchors to a specific position so that they sort of know where they are. And this is different than, for example, convolution. Because if you're run, for example, a convolution operation over some input, there is a very specific sort of layout of the information in space and the convolutional filters or of act in space. And so it's it's not like an attention. An attention is just a set vectors out there in space, they communicate, and if you want them to have a notion of space, you need to specifically add it, which is what we've done when we calculated the relative the deposition encode encoding and added that information to the vectors.

[01:13:40 - 01:14:14]

n/a: The next thing that I hope is very clear is that the elements across the batch dimension, but which are independent examples never talk to each other. They're always processed independently, and this is a batch matrix multiply that it applies basically a matrix multiplication kind of parallel across the batch dimension. So maybe it would be more accurate to say that in this analogy of a directed graph we really have because the size is 4, we really have 4 separate pools of 8 nodes and those 8 notes only talk to each other. But in total, there's is like 32 nodes that are being processed, but there's sort of 4 separate pools of 8. You can look at it that way.

[01:14:15 - 01:14:51]

n/a: The next one's note is that here in the case of language modeling, we have this specific structure of directed graph where the future tokens will not communicate to the past tokens. But this doesn't necessarily have to be the constraint in the general case. And in fact, in many cases, you may want to to have all of the notes talk to each other fully. So as an example, if you're doing sentiment analysis or something like that with a transformer, you might have a number of tokens do you may want to have them all talk to each other fully because later, you are predicting for example, the sentiment of the sentence. And so it's okay for these notes to talk to each other.

[01:14:51 - 01:15:24]

n/a: And And so in those cases, you will use an encode block of self tension. And all it means that it's an encode block is that you will delete this line. F code, allowing all the nodes to completely talk to each other. What we're implementing here is sometimes called a decoder block, and it's called a decoder because it is sort of like decoding language and it's got this auto aggressive format where you have to mask with the (GPT) matrix. So that notes from the future never talked to the past because they would give away the answer.

[01:15:25 - 01:15:44]

n/a: And so basically, an encode blocks, you would delete this, so allow all the nodes to talk. In decoder blocks, this will only be present so that you have this triangular structure. But both are allowed and attention doesn't care. Attention supports arbitrary activity a between notes. The next thing I wanted to comment on is you keep me you keep hearing me, say attention, self attention, etcetera.

[01:15:45 - 01:16:06]

n/a: There's actually also something called cross attention. What is the difference? So basically, the reason this attention is a self is because the key queries and the values are all coming from the same source from x. So the source x produces key queries and valleys. So these nodes are self attending.

[01:16:06 - 01:16:30]

n/a: But in principle, attention is much more general than that. So for example, encode decoder transformers, you can have a case where the queries are produced from x. But the keys and the values come from a whole separate external source and sometimes from encode blocks that encodes some context that we like a condition on. And so the keys in the values will actually come from a whole separate source. Those are nodes on the side.

[01:16:30 - 01:16:51]

n/a: And here we're just producing queries to and we're reading off information from the side. So cross attention is used when there's a separate source of nodes. We'd like to pull information from into our notes. And its itself attention if we just have nodes that would like to look at each other and talk to each other. So this attention here happens to the soft attention.

[01:16:52 - 01:17:10]

n/a: But in principle, attention is a lot more general. Okay. And the last note that the steady is if we come to the attention is only Need" paper here. We've already implemented attention. So given query key and value, we've multiplied the cor key, we've soft maxed it, and then we are aggregating the values.

[01:17:11 - 01:17:26]

n/a: There's 1 more thing that we're missing here, which is the dividing by 1 of our square root of the head side. The K here is the head size. Why aren't they doing this 1 it's important. So they call it the scaled attention. And it's kind of like an important normalization to basically have.

[01:17:27 - 01:17:58]

n/a: The problem is if you have unit gauss inputs, so 0 mean variance, k and qr (GPT) unit caution. Then if you just do way naive, then you see that your way actually will be the variance will be on the order of head size which in our case is 16. But if you multiply by 1 of our head size square root, so this is square root, and this is 1 over, then the variance of way will be 1, so it will be preserved. Now why is this important? You'll notice that way here will feed into Soft max.

[01:17:59 - 01:18:23]

n/a: It'd be it's really important, especially at initial commercialization, that way be fairly diffuse. So in our case here, we sort of locked out here and way at a fairly diffuse numbers here. So like this. Now the problem is that because of Soft max, if weight take saw very positive and very negative numbers inside it. Soft max will actually converge towards 1 hot vectors.

[01:18:24 - 01:18:46]

n/a: And so I can illustrate that here. Say we are applying Soft max to a tensor of values that are very close to 0. They (GPT) the diffuse thing like of Soft max. But the moment, I take the exact same thing and I start sharpening it, making it bigger by multiplying these numbers by 8 for the example. You'll see that the soft max will start to sharpen and in fact, it will sharpen towards the max.

[01:18:46 - 01:19:04]

n/a: So it will sharpen towards whatever number here is the highest. And so, basically, we don't want these values to be too extreme, especially the initial visualization. Otherwise Soft will be way too picky. And you're basically aggregating information from like, a single node. Every node just aggregates information from a single other node.

[01:19:04 - 01:19:23]

n/a: That's not what we want leads initial visualization. And so the scaling is used just to control the variance at initial. Okay. So having set all bash let's now take our self attention knowledge, and let's take it for a spin. So here in the code, I created this head module and implements a single head of self attention.

[01:19:24 - 01:19:40]

n/a: So you give it a head size, and then here, it creates the key query and the value linear layers. Typically, people don't use bias in these. So those are the linear projections that we're going to apply to all of our nodes. Now here, I'm creating this trill variable. Trail is not a parameter of the module.

[01:19:41 - 01:20:02]

n/a: So in sort of pit naming conventions, this is called a buffer. It's not all a parameter, and you have to call it, you have to assign it to the module using the register buffer. So that creates the trill that trying... Lower triangular matrix, a And (GPT) given the input x, this should look very familiar now. We copy the keys, the queries, we can pluck the attention scores sideways.

[01:20:02 - 01:20:26]

n/a: We normalize it so we're using scaled attention here. Then we make sure that sure doesn't communicate with the past, so this makes it a decoder block. So and then soft max and then aggregate the value and up it. Then here in the language model, I'm creating a ahead in the construct, and I'm calling a self attention head. And the head size, I'm going to keep as the same and embed just for now.

[01:20:28 - 01:21:13]

n/a: And then here, we've encoded the information with the token embedding and the position embedding, we're simply going to feed it into the self attention head, and then the output of that is going to go into the decoder language modeling head and create the log. So this is this sort of the simplest way to plug in a self attention component. Into our network right now. I have to make 1 more change, which is that here in the generate, we have to make sure that our Dx x that we feed into the model. Because now we're using position embedding, we can never have more than Block s coming in because if Id x is more than block size then our position in embedding table is going to run out of scope because it only has embedding for up to block size.

[01:21:14 - 01:21:38]

n/a: And so, therefore, I added some code here to crop the context that we're gonna feed into self so that we never passed in more block size elements. So those are the changes and let's now train the network. Okay. So I also came up to the script here and I decreased the learning rate because the self attention can't tolerate very, very high learning rates. And then I also increased number of iterations because the learning rate is lower.

[01:21:39 - 01:21:59]

n/a: And then I trained it and previously were only able to get to up to 2.5, and now we are down to 2.4. So we definitely see a little bit of an improvement from 2.5 to 2.4 roughly. We but the text is still not amazing. So clearly, the self tension head is doing some useful communication, but we still have a long way to go. Okay.

[01:21:59 - 01:22:08]

n/a: So now we've implemented the scale dot product attention. That next step and the attention is all you need. Paper. There's something called multi head attention. And what is multi head attention?

[01:22:08 - 01:22:24]

n/a: It's just applying multiple attention in parallel and con captivating their results. So they have a little of diagram here. I don't know if this is super clear. It's really just multiple attention in parallel. So let's implement that fairly straightforward.

[01:22:25 - 01:22:40]

n/a: If we want a multi head attention, then we want multiple heads of self tension running in parallel. So in impact towards, we can do this by simply creating multiple heads. So however, heads... (GPT) many however many... You want, and then what is the head size of each.

[01:22:41 - 01:23:18]

n/a: And then we run all of them in parallel into a list and simply inca all of the cockpit outputs and we're con cutting over the channel dimension. So the way this looks now is We don't have just a single attention that has hit size of 32, because remember, n embed is 32. Instead of having 1 communication channel, we now have 4 communication channels in parallel. And each 1 of these communication channels typically will be smaller correspondingly. So because we have 4 communication channels, we want 8 dimensional self tension.

[01:23:19 - 01:23:46]

n/a: And so from each key communication channel, we're (GPT) together 8 dimensional vectors. And then we have 4 of them and that coordinates to give us 32, which is the original and embed. And so this is kind of similar to if you're familiar with convolution, this is kind of like a group convolution because basically, is instead having 1 large convolution, we do convolution in groups. And that's multi headed self attention. And so then here we just use essay sa self attention heads instead.

[01:23:47 - 01:24:14]

n/a: Now I actually ran it and scrolling down. I ran the same thing, and then we now get down to 2.28 roughly. And the aqua is still the generation is still not amazing, but clearly, the validation loss is improving because we were at 2.4 just now. And so it helps to have multiple communication channels because obviously, these tokens have a lot to talk about. They want to find the consonants, the vowels, they want to find the vowels just from certain positions.

[01:24:15 - 01:24:39]

n/a: They want to find any kinds of different things. And so it helps to create multiple independent channels of communication, gather lots of different types of data, and Need" decode the output. Now going back to the paper for a second. Of course, I didn't explain this figure in full detail, but we are starting to see some components of what we've already implemented. We have the position coatings, the token coatings that add, we have the masked multi headed attention implemented.

[01:24:39 - 01:24:58]

n/a: Now here's another multi headed attention on which is a cross attention to an encode, which we haven't... We're not going to implement in this case. I'm going to come back to that later. But I want to notice is that there's a Need" forward part here which and then this is grouped into a block that gets repeated again again. Now the feed forward part here is just a simple multi.

[01:25:01 - 01:25:22]

n/a: So the multi headed. So here position y speak forward networks is just a simple little... Ml. So I want to start basically in a similar fashion, also adding computation into the network. And this computation is on the per Need" level So I've already implemented it and you can see the diff highlighted on the left here when I've added or changed things.

[01:25:22 - 01:25:46]

n/a: Now 4, we had the multi headed itself attention that did the communication, but we went way too fast to calculate the log. So 4 tokens looked at each other, but they don't really have a lot of time to think on what they found from the other tokens. And so what I've implemented here is a little feet forward... Single layer. And this little layer is just a linear followed by a re, and that's that's it.

[01:25:48 - 01:26:06]

n/a: So it's just a little layer... And then I call it Need" forward and embed. And then this feed forward is just culture sequentially right after the self attention. So we self, then we feed forward. And you'll notice that the feet forward here when it's applying linear, this is on a per token level.

[01:26:06 - 01:26:26]

n/a: All the tokens do this independently. So the self attention is the communication and then once they gathered all data, now they need to think on that data individually. And so that's what feed forward is doing, and that's why I've added here. Data Now when I train this, the validation loss actually continues to go down. Now to 2.24, which is down from 2.28.

[01:26:27 - 01:26:48]

n/a: The up still look kinda. Terrible, but at least we've improved the situation. And so as a preview, we're going to now start to inter the communication with the compass mutation. And that's also what the transformer does when it has blocks that communicate and then compute and it grouped them and replicates them. Okay.

[01:26:48 - 01:27:03]

n/a: So let me show you what we like to do. Need" like to do something like this. We have a block, and this block is basically this part here, except for the cross attention. Now the block basically inter process, communication and the computation. The computation...

[01:27:03 - 01:27:26]

n/a: The communications is done using multi head self attention. And then the the computation is done using a feed forward network on all the tokens independently. And is Now what I've added here also is you'll notice. This takes the number of embedding in the embedding dimension a number of heads that we would like, which is kind of like group size in group convolution. And I'm saying that number of heads Need" like is 4.

[01:27:26 - 01:27:53]

n/a: And so because this is 32, we calculate that because is 32, the number of has should be 4. There's numb... The head size should be 8 so that everything sort of works out channel wise. This So this is how the transformer structures sort of the the sizes, typically. So the head size will become 8, and then this is how we wanna them And then here, I'm trying to create blocks, which is just a sequential application of black black block.

[01:27:53 - 01:28:06]

n/a: So that we're inter communication feed forward many, many times. And then finally, we decode. Now actually we try to run this. And the problem is this does actually give it very good and and answer. A very good result.

[01:28:06 - 01:28:40]

n/a: And the reason for that is we're start starting to actually get like a pretty deep neural Pretrained deep neural nets suffer from optimization issues. And I think that's were kind of like slightly starting to run into. So we need 1 more idea that we can borrow from the transformer of paper to resolve those difficulties. Now there are 2 optimizations that dramatically help with the depth of these networks and make sure that the networks remain optimized. Let's talk about the first 1 that first 1 in this diagram is you see this arrow here, and then the arrow and this arrow, those are Skip connections or sometimes called residual connections.

[01:28:41 - 01:29:10]

n/a: I think the from this paper, the precision learning firm recognition from about 2015 that introduced the concept. Come Now, these are basically what it means is you transform data, but then you have a Skip connection with addition from the previous features. Now the way, I like to visualize it that I prefer is the following. Here the computation happens from the top to bottom. And basically, have this residual pathway.

[01:29:11 - 01:29:57]

n/a: And you are free to fork off from residual pathway, perform some computation and then project back to the residual pathway. Via addition. And so you go from the the inputs to the targets, only the plus and plus and plus, okay And the reason this is useful is because during (GPT) propagation, remember from our Micro grad video earlier, addition, distributes gradients equal to both of its branches that that fed as the input. And so the supervision or the gradients from the loss, basically hop through every addition node all the way to the input, and then also fork off into the residual blocks. But basically have this gradient super highway that goes directly from the supervision all the way to the input, uni.

[01:29:58 - 01:30:28]

n/a: And then these original blocks are usually initialized and the beginning, so they contribute very, very little if anything to the residual pathway. They they are initialized that way. So in the beginning, they are sort oh, almost kinda like not there. But the beginning during the optimization, they come online over time and they start to contribute, but at least at the initial optimization, you can go from directly supervision to the input, gradient is uni and just flows, and then the blocks over time kick in. And so that dramatic helps with the optimization.

[01:30:28 - 01:30:52]

n/a: So let's implement this. So coming back to our block here. Basically, what we want to do is we want to do x equals x plus, solve and x equals x plus sought dot heat forward. So this is x and then we fork off and do some communication and come back and we finish off and Need" do some computation and come back. So those are residual connections, and then swinging them back up here.

[01:30:53 - 01:31:17]

n/a: We also have to introduce this projection. So n then dot linear, and this is going to be from after we con Need" this this is the size and embed. So this is the output of the soft tension itself. But then we actually want the this to apply the projection, and that's the result. So the projection is just a linear transformation of the outcome of this layer.

[01:31:18 - 01:31:41]

n/a: So that's the back into the usual pathway. And then here in a feed forward, it's going to be the same thing. I could have a a self projection here as well, but let me jump simplify it and let me couple it inside the same sequential container. And so this is the projection layer, going to back into the residual pathway. And so that's...

[01:31:41 - 01:31:58]

n/a: Well, that's it. So now we can train this. So implemented 1 more small change when you'll do the paper again, you see that the dimensional of input and output is 05:12 for them, and they're saying that the inner layer here... The feed forward has the shot of 2048. So there's a multiplier of 4.

[01:31:59 - 01:32:21]

n/a: And so the inner layer of the feed forward network should be multiplied by form in terms of channel sizes. So I came here and I multiply 4 times embed here for the Need" forward. And then from 4 times n in embed or line back down to an embed. When we go back to the to the protection. So adding a bit of computation here and growing that layer that is in the residual block line on the side of the residual pathway.

[01:32:23 - 01:32:48]

n/a: And then I train this, and we actually get down all the way to 2.08 validation loss. And we also see that not on is start to get big enough that our train loss is getting ahead of validation loss. So we start to see like a little bit of over fitting. And our our generations here are still not amazing but at least. You see that we can see, like, is here this now green sync like this starts to almost look like English.

[01:32:48 - 01:33:01]

n/a: So yeah, we're starting to really (GPT) there. Okay. And the second innovation that is very helpful for optimized using very deep neural networks is right here. So we have this addition now that's the residual part. Of this norm is referring to something called later norm.

[01:33:02 - 01:33:29]

n/a: So layer norm is implemented in python torch. It's a paper that came out a while back here. And layer norm is very, very similar to Ba norm. George So remember back to our Make more series part 3. We implemented batch normalization and batch normalization basically just make sure that across the batch dimension, any individual neuron had unit gauss distribution.

[01:33:30 - 01:34:02]

n/a: So was 0 mean and unit standard deviation, 1 standard deviation output. So what I did here is I'm copy pasting the batch 1 d that we developed in our make series. Pretrained see here we can initialize for example, this module, and we can have a batch of 32 voice hundred dimensional vectors feeding through the vac layer. So what this does is it guarantees that when we look at just the 0 column, It's a 0 mean 1 standard deviation. So it's normalizing every single column of this input.

[01:34:03 - 01:34:17]

n/a: Now the rows are not going to be normalized by default, because we're just normalizing columns. So let's now implement layer norm. It's very complicated. Look, we come here. We change this from 0 to 1.

[01:34:17 - 01:34:46]

n/a: So we don't normalize the columns, we normalize the rows. And now we've implemented Layer. So now the columns are not going to be normalized. But the rows are going to be normalized. For every individual example, it's 100 dimensional act is normalized in this way And because our computation now does not spend across examples, we can delete all of this buffers stuff because we can always apply this.

[01:34:47 - 01:35:01]

n/a: Operation and don't need to maintain any running buffers. So we don't need the buffers. We don't... There's no distinction between training and test time. And we don't need these running buffers.

[01:35:01 - 01:35:22]

n/a: We do keep gamma and beta. We don't need the momentum. We don't care if it's train or not. And this is now a layer norm, and it normalize the rows instead of the columns and this here is identical to basically this here. So let's now implement learning norm in our transformer.

[01:35:22 - 01:35:46]

n/a: Before I incorporate the later, I just wanted to note that as I said very few details about the transformer have changed in the last 5 years. This is actually something that likely departs from the original paper. You see that the ad and norm is applied after the transformation. But and now it is a bit more basically common to apply the layer norm before the transformation. So there's a res shuffling of the layer norms.

[01:35:46 - 01:36:08]

n/a: So this is called the pre norm formulation and that's the 1 that we're going to implement as well. So (GPT) deviation from the original paper. Basically, we need to layer arms, layer 1 is and then dot layer norm and we tell it how many what the embedding dimension. And we need the second layers norm. And then here, the layer are applied immediately on x.

[01:36:09 - 01:36:37]

n/a: So soft dot layer 1 applied to x. And solve out layer apply on x before it goes into self tension and feed forward. And the size of the layer here is an embed 32, who So when the layer norm is normalizing our features. It is the normalization here, happens, the mean and the variance are taking over 32 numbers. So the batch and the time act as batch dimensions, both of them.

[01:36:38 - 01:37:13]

n/a: So this is kind of like a per token transformation that just normalize the features and makes them unit mean unit at an civilization. But, of course, because these layer norms inside it have these gamma and beta train parameters the layer normal eventually create outputs that might not be in (GPT), but the optimization will in that. So for now, this is the this is incorporating the layer norms and let's train them up. Okay. So I'll let it run, and we see that we get down to 2.06 or is better than the previous 2.08.

[01:37:14 - 01:37:40]

n/a: So a slight improvement by adding the layer norms. And I'd expect that they help even more if we had bigger and Deeper network, 1 more thing I forgot to add is that there should be a layer norm here also typically as at the end of the transformer and right before the final au. Linear layer that decode into vocabulary. So I added that as well. So at this stage, we actually have a pretty complete transformer according the original paper and it's a decoder only transformer.

[01:37:41 - 01:38:04]

n/a: I'll I'll talk about that in a second. But at this stage, the major pieces are in place, so we can try to scale this up and see how well we can push this number. Now in order to scale out the model, I had to perform some cosmetic changes here to make it nicer. So I can introduced this variable called in layer which just specifies how many layers of the blocks we're going to have. I create a bunch of blocks and we haven't variable number of heads as well.

[01:38:05 - 01:38:28]

n/a: I pulled out the layer norm here, and so this is identical. Now 1 thing I that I did briefly change is I added dropout. So dropout out is something that you can add right before the residual connection back or right before the connection back into the residual pathway. So we can drop out that as the last layer here. We can drop out here at the end of the multi extensions as well.

[01:38:29 - 01:38:57]

n/a: And we can also drop out here when we calculate the basically affinity and after the soft max, we can drop out some of those. So we can randomly prevent some of the notes from communicating. And so dropout out comes from this paper throughout 2014 or so. And basically it takes your neural nut. And it randomly every forward backward pass shuts off from some subset of neurons.

[01:38:57 - 01:39:31]

n/a: So randomly drops them to 0 and trains without them. And what this does effectively is because the mask of what's being dropped out has changed every single 4 backward pass, it ends up kind of training and ensemble of sub networks. And then at time, everything is fully enabled and kind of all of those subnet are merged into a single ensemble if you can if you wanna think about it that way. So I would read the paper just get the full detail. For now, we're just going to stay on the level of this is a regular technique, and I added it because I'm about to scale up to model quite a bit, and I was concerned about over fitting.

[01:39:33 - 01:39:52]

n/a: So now when we scroll off to the top, we'll see that I changed a number of hyper parameters here about our your lot. So I made the best size b much larger, now 64. I changed the block size to be 02:56. So previously it was just 8, 8 characters of context. Now it is hundred and 56 characters of context to predict the 250 seventh.

[01:39:54 - 01:40:13]

n/a: I brought down the learning rate a little bit because the neural is now much bigger. So I brought on the learning weight. The embedding dimension is not 384 and there are 6 heads. So 384 divide 6 means that every head is cyclical 4 dimensional as as a standard. And then there are (GPT) be 6 layers of that.

[01:40:13 - 01:40:30]

n/a: And the drop will be a 0.2. So every forward backward path 20 percent of all these intermediate calculations are disabled and dropped to 0. And then I already Pretrained this and I ran it. So drum roll, how does it perform. So let me just scroll up here.

[01:40:32 - 01:40:53]

n/a: We get a validation loss of 1.48, which is actually quite a bit of an improvement on what we had before, which I think was 2.07. So we went from 2.07 all the way down to 1.48 just by scaling up this neural lot with the code that we have. But this, of course, ran for a lot longer. This Pretrained 4, I wanna say about 15 minutes on my a 100 Gpu. So that's a pretty good Gpu.

[01:40:54 - 01:41:07]

n/a: And if you don't have (GPT) Gpu not be able to reproduce this. On a Cpu, this would be... I would not run this on a Cpu or a Macbook or something like that. You'll have to break down the number of... Be layers and the embedding dimension and so on.

[01:41:08 - 01:41:27]

n/a: But in about 15 minutes, we can get this kind of a result and I'm printing some of the Shakespeare here, but what I did also is I printed 10000 characters. So a lot more and I wrote them to... File. And so here we see some of the outputs. So it's a lot more recognizable as the input type text file.

[01:41:28 - 01:41:53]

n/a: So the text file just for reference look like this. So there's always like someone speaking in this matter and our predictions now take on that form. Except, of course, they're they're nonsensical and you actually read them. So that is every quilt b house. Oh, those per patient, will give Need" them you know.

[01:41:56 - 01:42:20]

n/a: Oh h sent me you Mighty Lord. Anyway, so you can read through this, It's not nonsensical, of course, but this is just a transformer trained on the character level for 1000000 characters that come from... Shakespeare. So sort of like Bla zone and shakespeare like manner, but it doesn't, of course make sense at this scale. But I think I think A pretty good demonstration of what's possible.

[01:42:21 - 01:42:43]

n/a: So now I think that kind of like concludes the grammar section of this video. We basically kind of did a pretty good job in of implementing this transformer. But the picture doesn't exactly match up to what we've done. So what's going on with all these the sc parts here. So let me finish explaining this architecture and why it looks so funky.

[01:42:44 - 01:43:02]

n/a: Basically, what's happening in here is, what we implemented here is a decoder only transformer. So there's no component here. This part is called the encode, and there's no cross tension block here. Our block only has a self attention and the Need" forward. So it is missing this third in between piece here.

[01:43:02 - 01:43:23]

n/a: This piece is attention. So we don't have it, and we don't have the encode, we just have the decoder. And the reason we have a decoder only is because we are just generating text and it's on conditioned on anything or just... We're just bla on according to a given dataset. What makes it a decoder is that we are using the triangular mask in our transformer.

[01:43:23 - 01:43:55]

n/a: So it has this auto reg aggressive property where we can just go and sample from it. So the fact that it's using the (GPT) triangular mask to mascot the attention, makes it a decoder, and it can be used for language modeling. Now the reason that the original paper had an encode decoder arctic architect sure is because it is a machine translation paper. So it is concerned with a different setting, in particular, it expects some token that encodes say for example French. And then it is expected to decode the translation in English.

[01:43:55 - 01:44:25]

n/a: So So you typically these here are special tokens. So you are expected to read in this and condition on it. And then you start off the generation with a special token called start So this is a special new token that you introduce and always place in the beginning. And then the network is expected to output neural networks are awesome and then a special end token to finish the generation. Is So this part here will be decode exactly as we we've done it.

[01:44:25 - 01:44:47]

n/a: Neural networks are awesome, will be identical to what we did. So but unlike what we did, they want to condition the generation on some additional information. And in that case, this additional information is sentence that they should be translating. So what they do now is they bring the encode. Now the encode reads this part here.

[01:44:47 - 01:45:26]

n/a: So we're (GPT) going to take the part of French, and we're going to create tokens from it exactly as we've seen in our video. And we're going to put a transformer on it, but there's (GPT) be no triangular mask and so all the tokens are allowed to talk to each other as video as they want, and they're just encoding whatever's the content of this French sentence. Once they've encoded it, they've they basically out in the top here. And then what happens here is in our decoder, which does the language modeling, there's an additional condemn action here to the outputs of the encode. And that is brought in through cross attention.

[01:45:27 - 01:46:03]

n/a: So the queries are still generated from x now the keys and the values are coming from the side. The keys and the values are coming from the top generated by the Need" that came outside of the the encode. And by those tops, the keys in the values there, the top of it, feed in on a side into every single block of the decoder. And so that's why there's an additional cross attention. And really what is doing is it's conditioning the decoding, not just on the past of this current decoding, but also on having seen the full fully encoded French prompt.

[01:46:03 - 01:46:30]

n/a: Sorry And so it's an encode decoder model, which is why we have those 2 transformers and additional block and so on. So we did not do this because we have no we have nothing to, there's no, We just have a text file, and we just want to imitate it, and that's why we are using a decoder or only transformer exactly as done Gp. Okay. So now I wanted to do a very brief walkthrough through of nano, which you can find on my Github. And nano Gp is basically 2 5 Of interest.

[01:46:31 - 01:46:58]

n/a: Pretrained up pie and model Pretrained up pie is all the boiler plate code for training the network. It is basically all that's stuff that we had here is the training loop. It's just that it's a lot more complicated because we saving and loading checkpoints and pre trained waste rates, and we are decaying the learning rate and compelling the model and using distributed training across multiple nodes or Gpus. So the train that (GPT) gets a little bit more hits very complicated. There's more options, etcetera.

[01:46:59 - 01:47:17]

n/a: But the model that 5 should look very, very similar to what we done here. In fact, the model is is almost identical. So first, here we have the causal self attention block and all of this shown look very, very recognizable to you. We're producing queries keys values. We're doing dot products.

[01:47:17 - 01:47:43]

n/a: We're masking, appliances. So off max optional dropping out and here we are pulling the the values. What is different here is that in our code, I have separate out the multi added attention into just a single individual head. And then here, I have multiple heads of I explicitly con coordinate them. Whereas here, all of it is implemented in a batch manner inside a single causal self tension.

[01:47:43 - 01:48:12]

n/a: And so we don't just have a b and a and c dimension. We also end up with a fourth dimension, which is the heads. And so it just gets a lot more sort of Hairy b as we have 4 dimensional array tensor now, but it is equivalent mathematically. So the exact same thing is happening is what we have have it's just it's a bit more efficient because all the heads are not treated as a batch dimension as well. Then we need to multiple s, it's using the gal nonlinear linearity, which is defined here except instead of ro.

[01:48:13 - 01:48:34]

n/a: And this is done just because OpenAI's I used it. And I want able to load their checkpoints. The blocks of the transformer are identical, the communicate and the compute phase as we saw, and then (GPT) Gp will be identical, we have the position coatings token coatings, the blocks, the layer norm at the end, the final linear layer. And this should look all very now. Sizable.

[01:48:35 - 01:49:01]

n/a: And there's a bit more here because I'm loading checkpoints and stuff like that. I'm separating out the parameters into those shirt that should be weight to cad and those that shouldn't as of the generate functions should also be very, very similar. So a few details are different, but you should definitely be able to look at this file and be able to understand a lot of the these now. So let's now bring things back to Chat. What would it look like if we wanted to train chat ourselves and how does it relate to a wheels learned today.

[01:49:02 - 01:49:31]

n/a: While to train and chat, there are roughly 2 stages. First is the pre training stage and then the fine tuning they In the Pretrained stage, we are training on a large chunk of Internet and just trying to get a first decoder only transformer touch ba text. So it's very, very similar to what we've done ourselves. Except we've done like a tiny little baby Pretrained step. And so in our case, this is how you print a number of parameters.

[01:49:31 - 01:49:50]

n/a: I printed it and it's about 10000000. So this transformer that I created here to create little shakespeare beer. Transformer was about 10000000 primer. Our dataset set is roughly 1000000 characters, so roughly 1000000 tokens. But you have to remember that OpenAI's is different vocabulary They're not on a character level.

[01:49:50 - 01:50:13]

n/a: They use these sub chunks of words. And so they have a vocabulary of 50000 roughly elements. And so they're 3 are a bit more condensed. So our dataset set, the Shakespeare data would be probably around 300000 tokens in the open eye vocabulary roughly. So we trained about 10000000 primer model and roughly 300000 tokens.

[01:50:14 - 01:50:40]

n/a: Now when you go to the Gp paper. And you look at the transformers that they trained, Pretrained a number of transformers of different sizes, but the biggest transformer here has 1 75000000000 parameters. So ours is again, 10000000. They used this number of layers in the transformer, a this is the an embed. This is the number of heads, and this is the head size, and then this is the batch size.

[01:50:41 - 01:50:56]

n/a: So r will 65. And the learning rate is similar. Now when they train this transformers they trained on 300000000000 tokens. So again, remember, ours is about 300000. So this is about a million fold increase.

[01:50:56 - 01:51:28]

n/a: And this number would not be even that large by today's standards. You'd be going up 1000000000000 So they are training a significantly larger model on a good chunk of the Internet, and that is the pre training stage. But otherwise, these hyper parameters should be fairly recognizable to you and the architecture is actually like nearly identical to what we implement ourselves. But of it's a massive infrastructure challenge to train this. You're talking about typically thousands of Gpus having to, you know, talk to each other to train models of.

[01:51:29 - 01:51:41]

n/a: Size. So that's just the Pretrained stage. Now after you complete the Pretrained stage. You don't get something that response to your questions with answers and it's not helpful and etcetera. You get a document computer.

[01:51:42 - 01:51:58]

n/a: Right? So it ba but it doesn't ba Shakespeare in Ba Internet. It will create arbitrary news articles and documents and it will try to complete documents because that's what it's trained for. It's trying to complete the sequence. So when you give it a question, it would just potentially just give you more questions.

[01:51:58 - 01:52:14]

n/a: It would follow with more questions. It will do whatever it looks like the some closed document would do in the training data on the Internet. And so who knows you're getting to of like undefined behavior. It might basically answer with 2 questions with other questions. It might ignore your question.

[01:52:14 - 01:52:37]

n/a: It might just try to completes a new article, it's totally on the undermined as we say. So the second fine tuning stage is to actually align it to be an assistant. And this is the second stage. And so this Chat blog post from OpenAI's talks a little bit about how this stage is achieved. We basically, there's roughly 3 steps to to this stage.

[01:52:38 - 01:52:56]

n/a: So what they do here is they start to collect training data. That looks specifically like what an assistant would do. So they have documents that have to format where the question is on top and then an answer is below. And they have have a large number of these, but probably not on the order of the Internet. This is probably on the order of maybe thousands of examples.

[01:52:57 - 01:53:17]

n/a: And so have they they then fine tune the model to basically only focus on documents that look like that. And so starting to slowly align it. So it's going to expect a question at the top and it's going to expect to complete the answer. And these very, very large models are very amp efficient during their fine tuning. So this actually somehow works.

[01:53:17 - 01:54:02]

n/a: But that's just step 1. That's just fine tuning. So then they actually have more steps where, okay, The second step is you let the model respond and then different raiders look at the different responses and rank them for their preferences as to which 1 is better than the other. They use that to train a reward model, so they can predict basically using a different network, how much of any candidate risk better bonds it would be desirable. And then once they have a reward model, they run Ppo, which is a form of (GPT) gradient reinforcement learning up m to fine tune this sampling policy, so that the answers that the Now generates are expected to score a high reward according to the reward model.

[01:54:03 - 01:54:20]

n/a: And so basically there's a whole lining stage here or fine tune stage. It's got multiple steps in between there as well. And it takes the model from being a document computer to a question answer. And that's like a whole separate stage. A lot of this data is not available publicly.

[01:54:20 - 01:54:34]

n/a: It is internal to open eye, and it's much harder to replicate this stage. A And so that's roughly what (GPT) give you a chat. And nano (GPT) focuses on the Pretrained stage. Okay. And that's everything that I wanted to cover today.

[01:54:35 - 01:55:00]

n/a: So we trained to summarize a decoder only transformer, following this famous paper attention is all you need from Twin 17. And so that's basically a (GPT). We trained it on a tiny shakespeare and (GPT) sensible results. All of the training code is roughly 200 liza code. I will be releasing bolts this code base.

[01:55:00 - 01:55:25]

n/a: So also, it comes with all the git log commits along the way as we built it up. In addition, this code. I'm going to release the notebook. Of course, the Google Cola app. And I hope that give you a sense for how you can do train these models like say, G 3, that will be architectural basically identical to what we have have, but they are somewhere between 1010000 times bigger depending on how you count.

[01:55:26 - 01:56:04]

n/a: And so that's all I have for now. We did not talk about any of the fine tuning stages that typically go on top of this. So if you're interested in something that's not just language modeling, but you actually want to, you know, say, perform tasks or you want them to be aligned in a specific way or you want to detect sentiment or anything like that. Basically, anytime you want something that's just a document complete, you have to complete further stages of fine tuning, which we did not cover. And that could be somewhat simple supervised fine tuning or it can be something more fancy like we see in Chat, where we actually train a reward model and then do runs of Ppo to align it with respect to the reward.

[01:56:04 - 01:56:18]

n/a: Model. So there's a lot more that can be down on top fit. I think for now, we're starting to get to about 2 hours mark, so I'm going to model kinda finish here. I hope you enjoyed the lecture. And, yeah, go forth and transform.

[01:56:18 - 01:56:19]

n/a: See there. Kind