Summary of State of GPT | BRK216HFS

Summary State of GPT | BRK216HFS - YouTube (Youtube) www.youtube.com

8,805 words - YouTube video - View YouTube video

One Line

GPT-4 is a powerful and versatile AI model that goes through a multi-step training process, with prompt engineering as a way to enhance performance, but it requires human oversight due to limitations and potential vulnerabilities.

Slides

Slide Presentation (9 slides)

Copy slides outline Copy embed code Download as Word

The Training Process of GPT

Slide 1: GPT-4 Training Process

• GPT-4 goes through four stages of training: pre-training, supervised fine-tuning, reward modeling, and reinforcement learning.

• Pre-training involves gathering and tokenizing data from various sources.

• Fine-tuning uses high-quality datasets for specific tasks.

• Reward modeling and reinforcement learning further improve the model's performance.

[Include an image showing the four stages of training]

Prompting vs Fine-tuning

• Prompting the models can be more effective than fine-tuning.

• Base models can be prompted to perform specific tasks.

• Supervised fine-tuning and reinforcement learning from human feedback are necessary to create actual GPT assistants.

[Include a comparison chart showing the effectiveness of prompting vs fine-tuning]

RL Models in Training

• RL models are preferred over S models and base models in the training process.

• RL models leverage human judgment and work better.

• The reason for this improvement is not fully understood.

[Include a graph comparing the performance of RL models, S models, and base models]

Prompt Engineering Techniques

• Prompt engineering techniques can improve the performance of GPT models.

• Techniques include spreading out reasoning, using templates, and allowing sequence revision.

• Sampling multiple times and selecting the best samples can be effective.

[Include examples of prompt engineering techniques]

Limitations of GPT Models

• GPT models lack cognitive abilities like internal dialogue or reflection.

• They have large fact-based knowledge and working memory.

• Limitations include biases, fabricating information, reasoning errors, and knowledge cutoffs.

[Include an image illustrating the limitations of GPT models]

Python Glue Code and Prompt-based Language Models

• Python glue code is required to achieve desired results with prompt-based language models like GPT-4.

• Careful prompt engineering is necessary.

• Prompt-based models employ tree search algorithms and tools like calculators and code interpreters.

[Include an image showing the use of Python glue code]

Importance of Supervision and Human Oversight

• GPT models should be used in low-stakes applications with human oversight.

• Treat them as sources of inspiration rather than fully autonomous agents.

• Consider the limitations and vulnerabilities of language models.

[Include a reminder to prioritize human oversight]

Harnessing the Power of GPT

• GPT models offer vast knowledge accessible through simple code prompts.

• Prompt engineering and supervision are key to achieving desired results.

• Remember to consider the limitations and vulnerabilities of language models.

[Include a captivating image representing the power of GPT]

[End of presentation]

Key Points

GPT (Generative Pre-trained Transformer) is a large language model that goes through four stages of training: pre-training, supervised fine-tuning, reward modeling, and reinforcement learning.
Prompting the models can be more effective than fine-tuning, and base models can be prompted to perform specific tasks.
Supervised fine-tuning and reinforcement learning from human feedback are necessary to create actual GPT assistants.
RL models are preferred over S models and base models in the training process, as they work better and leverage human judgment.
Prompt engineering techniques can be used to improve the performance of GPT models, such as spreading out reasoning, using templates, and allowing them to revise their sequences.
GPT models imitate tokens and lack cognitive abilities like internal dialogue or reflection, but they have large fact-based knowledge and working memory.
Python glue code and careful prompt engineering are required to achieve desired results with prompt-based language models like GPT-4.
It is important to consider the limitations of language models, such as biases, fabricating information, reasoning errors, knowledge cutoffs, and vulnerability to attacks.

Summaries

58 word summary

GPT's training process involves pre-training, supervised fine-tuning, reward modeling, and reinforcement learning. Different models are used, with GPT-4 being one of the best assistants. Prompt engineering can improve performance, but limitations such as biases and vulnerability to attacks should be considered. Human oversight is necessary in low-stakes applications. GPT-4 is a powerful tool accessible through simple code prompts.

115 word summary

Andre Carp, an AI researcher and founding member of OpenAI, discussed the state of GPT and its training process. The process involves pre-training, supervised fine-tuning, reward modeling, and reinforcement learning. Carp emphasized that base models can be prompted to perform tasks but require supervised fine-tuning and reinforcement learning for actual GPT assistants. Different types of models are used, with RL models leveraging human judgment. GPT-4 is one of the best assistant models available. Prompt engineering techniques can improve performance, but it's important to consider limitations such as biases and vulnerability to attacks. Language models should be used with human oversight in low-stakes applications. In conclusion, GPT-4 is a powerful tool accessible through simple code prompts.

313 word summary

Andre Carp, an AI researcher and founding member of OpenAI, discussed the state of GPT (Generative Pre-trained Transformer) and the training process for using it effectively. The training process involves pre-training, supervised fine-tuning, reward modeling, and reinforcement learning. Carp emphasized that base models are not systems but can be prompted to perform specific tasks. Supervised fine-tuning and reinforcement learning from human feedback are necessary to create actual GPT assistants. Reward modeling and reinforcement learning improve the model's performance.

Speaker B explained that they are currently training on yellow tokens and weighing the language modeling objective by the rewards indicated by the reward model. Tokens that receive a high score from the reward model are reinforced, while tokens that receive a low score are given slightly higher probabilities. Different types of models, such as base models, S models, and RL chip models, are used in the training process. RL models work better because they leverage human judgment. Base models are still preferred in scenarios where diverse outputs are desired.

There are various assistant models available, with GPT-4 being one of the best. Prompting is important for training and using the GPT assistant model. Techniques such as spreading out reasoning across more tokens, using templates, thinking step by step, and self-consistency can improve performance. Prompt engineering is being explored to recreate human-like abilities in GPT models.

GPT models imitate tokens and do not have the same cognitive abilities as humans. Prompt-based language models require careful prompt engineering and the use of python glue code. Techniques like retrieval-augmented generation and consumable string prompting can enhance model performance. However, it is important to consider the limitations of language models, such as biases and vulnerability to attacks. It is recommended to use language models in low-stakes applications with human oversight.

In conclusion, GPT-4 is a powerful tool with a vast amount of knowledge accessible through simple code prompts.

809 word summary

Andre Carp, an AI researcher and founding member of OpenAI, discussed the state of GPT (Generative Pre-trained Transformer) and the growing ecosystem of large language models. He divided his talk into two parts: training GPT and using it effectively for applications. The training process consists of four major stages: pre-training, supervised fine-tuning, reward modeling, and reinforcement learning. Pre-training involves gathering a large amount of data from various sources and tokenizing it into sequences of integers. The resulting base model is then fine-tuned using high-quality datasets for specific tasks. Carp mentioned that prompting the models can be more effective than fine-tuning. He also highlighted the evolutionary tree of base models, with GPT-3 being available as the Dev vinci model. Carp emphasized that base models are not systems and can only complete documents, but they can be prompted to perform specific tasks. However, to create actual GPT assistants, supervised fine-tuning and reinforcement learning from human feedback are necessary. In supervised fine-tuning, human contractors provide prompt-response pairs, which are used to train the model. Reward modeling and reinforcement learning further improve the model's performance by collecting data in the form of comparisons and ranking completion options based on a reward model. The reward model is then used during reinforcement learning to score the quality of completions and train the model accordingly.

Speaker B explains that they are currently training on yellow tokens and weighing the language modeling objective by the rewards indicated by the reward model. Tokens that receive a high score from the reward model are reinforced and given higher probabilities for the future, while tokens that receive a low score are given slightly higher probabilities. This process is repeated on many prompts and batches to create a policy. Different types of models, such as base models, S models, and RL chip models, are used in the training process. RL models are preferred by humans compared to S models and base models because they work better. The reason for this improvement is not fully understood, but it may have to do with the asymmetry between generating and comparing. Comparisons are easier for humans and can leverage human judgment to create a better model. RL models may lose some entropy compared to base models, resulting in less variation in their outputs. Base models are still preferred in certain scenarios where diverse outputs are desired.

The speaker mentions that there are various assistant models available, with GPT-4 being one of the best followed by Claude G3.5. Prompting is an important aspect of training and using the GPT assistant model. The speaker provides an example of comparing populations of California and Alaska, highlighting the internal monologue and computational processes involved in generating a sentence. From the GPT's perspective, it is just a sequence of tokens and each token receives the same amount of computational work. GPT models do not have the same cognitive abilities as humans and do not have internal dialogue or reflection. They imitate the next token without knowing what they are good at or correcting mistakes. However, they have large fact-based knowledge and relatively large working memory.

To get better results from GPT models, prompting techniques can be used. Spreading out reasoning across more tokens, using templates to show work, thinking step by step, and self-consistency can improve performance. Sampling multiple times and selecting the best samples can also be effective. GPT models cannot recover from bad tokens like humans, so techniques that allow them to look back or revise their sequences can be useful. Prompt engineering is an important aspect of recreating human-like abilities in GPT models. Techniques such as maintaining multiple completions for a prompt and scoring them along the way are being explored.

In conclusion, GPT models are trained using RL chip models and prompt engineering techniques can be used to improve their performance. While they may not have the same cognitive abilities as humans, they can imitate tokens

Prompt-based language models, such as GPT-4, require careful prompt engineering and the use of python glue code to achieve desired results. Similar to AlphaGo, these models employ tree search algorithms to expand and evaluate multiple prompts. The use of structured prompts, like thought-action-observation sequences, and tools like calculators and code interpreters can enhance performance. Retrieval-augmented generation, which incorporates relevant information into the model's working memory, is gaining interest. Techniques like consumable string prompting and fine-tuning can further improve model performance, although fine-tuning requires technical expertise and may involve complex data pipelines. It is important to consider the limitations of language models, such as biases, fabricating information, reasoning errors, knowledge cutoffs, and vulnerability to attacks. Therefore, it is recommended to use language models in low-stakes applications with human oversight and treat them as sources of inspiration rather than fully autonomous agents. Despite these limitations, GPT-4 is a powerful tool with a vast amount of knowledge accessible through simple code prompts.

Raw indexed text (48,369 chars / 8,805 words)

Source: https://www.youtube.com/watch?v=bZQun8Y4L2A
Page title: State of GPT | BRK216HFS - YouTube
Meta description: Learn about the training pipeline of GPT assistants like ChatGPT, from tokenization to pretraining, supervised finetuning, and Reinforcement Learning from Hu...

[00:00:08 - 00:00:13]

Speaker 0: Please welcome Ai researcher and founding member of Open Ai, Andre Carp.

[00:00:22 - 00:00:24]

Speaker 1: Hi, everyone. I'm happy to

[00:00:24 - 00:00:25]

Speaker 0: be here to tell you about. The state

[00:00:25 - 00:00:40]

Speaker 1: of Gp and more generally about the rapidly growing ecosystem of large language models. So I would like to partition the talk into 2 parts In the first part, I would like to tell you about how we train Gp assistance. And then in the second part,

[00:00:40 - 00:00:53]

Speaker 0: we will are going to take a look at how we can use these assistance effectively for your applications. So first, let's take a look at the emerging recipe for how to train these assistance and keep in mind that this is all very new and still rapidly evolving But so far,

[00:00:53 - 00:00:56]

Speaker 1: the recipe looks something like this. Now this is kind of

[00:00:56 - 00:01:12]

Speaker 0: a complicated slides so Going to go through it piece by piece. But roughly speaking, we have 4 major stages pre training, supervised fine tuning, reward modeling, reinforcement learning, and they follow each other serial. Now in each stage, we have a data that

[00:01:12 - 00:01:22]

Speaker 1: is that powers that stage. We have an algorithm that for our purposes will be a objective and a and over for a for training the neural network. And then we

[00:01:22 - 00:01:37]

Speaker 0: have a resulting model, and then there's some notes on the bottom. So the first stage we're going to start with is the pre training stage. Now this stage is kind of special in this diagram, and this diagram is not to scale. Because this stage is where all of the computational work basically happens. This is 99 percent

[00:01:37 - 00:01:47]

Speaker 1: of the training compute time and also flops. And so this is where we are dealing with internet scale datasets with thousands of Gpus in

[00:01:47 - 00:01:54]

Speaker 0: the supercomputer and also months of training potentially. The other 3 stages are fine tuning stages that are much more along

[00:01:54 - 00:02:06]

Speaker 1: the lines of small few number of Gpus and hours or days. So let's take a look at the pretraining stage to achieve a base model. First, we're going to gather a large amount

[00:02:06 - 00:02:08]

Speaker 0: of data. Here's an example of what

[00:02:08 - 00:02:17]

Speaker 1: we call a data mixture that comes from this paper that was released by Meta, where they released this lam based model. Now you

[00:02:17 - 00:02:21]

Speaker 0: can see roughly the kinds of datasets sets that enter into these collections. So we have common pro,

[00:02:21 - 00:02:30]

Speaker 1: which is just a web scrape, C 4, which is also common pro. And then some high quality datasets as well. So for example, Github, wikipedia, books, archive, stack exchange

[00:02:30 - 00:02:42]

Speaker 0: and so on. These are all mixed up together. And then they are sample according to some given proportions and that forms the training set for the neural net for the Gp. Now before we can actually train

[00:02:42 - 00:02:43]

Speaker 1: on this data, we need

[00:02:43 - 00:02:48]

Speaker 0: to go through 1 more pre processing step and that is tokenization. And this is basically a translation

[00:02:48 - 00:02:50]

Speaker 1: of the raw text that we scrape from

[00:02:50 - 00:03:17]

Speaker 0: the Internet into sequences of integers because that's the native representation over which Gp function Now this is a loss less kind of translation between pieces of text and tokens and integers. And there are a number of algorithms for this stage. Typically, for example, you could use something like byte encoding, which iterative merges little text chunks and groups them into tokens. And so here, I'm showing some example chunks of these tokens. And then this is

[00:03:17 - 00:03:20]

Speaker 1: the raw integer sequence that will actually feed into a transformer.

[00:03:22 - 00:03:23]

Speaker 0: Now here I'm showing 2

[00:03:24 - 00:03:30]

Speaker 1: sort of like examples for hyper primer that govern this stage. So Gp 4, we did not release

[00:03:30 - 00:03:34]

Speaker 0: too much information about how it was trained and so. So I'm using Gp three's numbers, but Gp 3 is,

[00:03:34 - 00:03:42]

Speaker 1: of course, a little bit old by now, but 3 years ago. Llama is a fairly recent model for Meta. So these are roughly the orders of magnitude

[00:03:42 - 00:04:05]

Speaker 0: that we're dealing with when we're doing pretraining. The vocabulary size is usually a couple 10000 tokens. The complex length is usually something like 6000 or nowadays, even 100000. And this governs the maximum number of integers that the G will look at when it's trying to predict the next integer in a sequence. You can see that roughly the number of parameters is say 65000000000 for Lam.

[00:04:06 - 00:04:22]

Speaker 0: Now even though Lam llama has only 65 parameters compared to G31 75000000000 parameters Llama is a significantly more powerful model. And intuitively, that's because the model is trained for significantly longer. In this case, 1400000000000.0 tokens, instead of just 300000000000 tokens. So you shouldn't judge the power of

[00:04:22 - 00:04:34]

Speaker 1: a model just by the number of parameters that it contains. Below, I'm showing some tables of rough number of rough hyper primer that typically go into specifying the transformer neural network. So the

[00:04:34 - 00:04:54]

Speaker 0: number of heads, the dimension size, number of layers and so on. And on the bottom, I'm showing some training hyper parameters. So for example, pretraining the 65 model, Meta used 2000 Gpus. Roughly 21 days train of training and and roughly several million dollars And so that's the rough orders of magnitude that

[00:04:54 - 00:05:02]

Speaker 1: you should have in mind for the pretraining stage. Now when we're pretraining what happens. Roughly speaking, we are

[00:05:02 - 00:05:18]

Speaker 0: going to take our tokens, and we're going to lay them out into data batches. So we have these arrays that will feed into the transformer, and these arrays are b the batch size and these are all independent examples stacked up in Rows and b by t, t being the maximum context length. So in

[00:05:18 - 00:05:19]

Speaker 1: my picture, I only have 10

[00:05:20 - 00:05:36]

Speaker 0: the context line. So this could be 6000, etcetera. So these are extremely long rows. And what we do is we take these documents and we pack them into Rows and we limit them with these special end of text tokens, basically on the transformer where a new document begins. And so here, I have

[00:05:36 - 00:05:47]

Speaker 1: a few examples of documents, and then I stretched them out into this input. Now we're going to feed all of these numbers into transformer. And let me let

[00:05:47 - 00:06:04]

Speaker 0: me just focus on a single particular cell, but the same thing will happen at every every cell in this diagram. So let's look at the green cell. The green cell is going to take a look at all of the tokens before it. So all of the tokens in yellow. And we're going to feed that entire context into the transformer neural network.

[00:06:05 - 00:06:13]

Speaker 0: And the transformer is going to try to predict the next tokenization sequence in this case in red. Now the transformer, I don't have too much time to unfortunately go into the full details

[00:06:13 - 00:06:22]

Speaker 1: of this neural network architecture is just a large blob of neural net stuff for our purposes. And it's got several 10000000000 parameters typically or something like that. And of course,

[00:06:22 - 00:06:47]

Speaker 0: as they tune these parameters, you're getting slightly different predictive distributions for every single 1 of these cells. And so for example, if our vocabulary size is 50257 tokens, then we're going to have that many numbers. Because we need to specify a probability distribution for what comes next. So basically, we have a probability for whatever may follow. Now in this specific example for this specific cell 5 and 13 will come next.

[00:06:47 - 00:06:53]

Speaker 0: And so we can use this as a source of supervision to update our transformers weights. And so we're applying this basically on every single cell

[00:06:53 - 00:07:02]

Speaker 1: in the parallel and we keep swap swapping batches, and we're trying to get the transformer to make the correct predictions over what token comes next in sequence. Let me

[00:07:02 - 00:07:21]

Speaker 0: show you more concrete completely what this looks like when you train 1 of these models. This is actually coming from a New york Times and they trained a small Gp on Shakespeare spear. And so here's small snippet of Shakespeare and they trained a Gp on it. Now in the beginning at initial initiation, the Gp starts with completely random weights. You're just getting completely random outputs as well.

[00:07:22 - 00:07:50]

Speaker 0: But over time, as you train the Gp longer and longer, you are getting more and more coherent and consistent, sort of a samples from the model. And the way you sample from it, of course is you predict what comes next, you sample from that distribution and you keep feeding that back into the process. And you can basically sample large sequences. And so by the end, you see that the transformer has learned about words and where to put spaces and where to put commas and so on. And so we're making more and more consistent predictions over time.

[00:07:51 - 00:07:58]

Speaker 0: These are the kinds of plots that you're looking at when you're doing model pre training. Effectively, we're looking at the loss function. Over time as you

[00:07:58 - 00:08:07]

Speaker 1: train and low loss means that our transformer is predicting the correct is giving a higher probability to put the correct next integer sequence. Now, what are

[00:08:07 - 00:08:16]

Speaker 0: we gonna do with this model once we've trained it after our month? Well, the first thing that we noticed we the field is that These models basically

[00:08:16 - 00:08:20]

Speaker 1: in the process of language modeling, learn very powerful, general representations,

[00:08:21 - 00:08:25]

Speaker 0: and it's possible to very efficiently fine tune them, any arbitrary downstream task

[00:08:25 - 00:08:27]

Speaker 1: you might be interested in. So as an example,

[00:08:27 - 00:08:34]

Speaker 0: if you're interested in sentiment classification, the approach used to be that you collect a bunch of positives and negatives and then you train some kind

[00:08:34 - 00:08:39]

Speaker 1: of an Nlp model for that. But the new approach is ignore sentiment classification,

[00:08:40 - 00:08:59]

Speaker 0: go off and do large language model, pretraining, train a large transformer, and then you can only... You may only have a few examples, and you can very efficiently fine tune your model for that task. And so this works very well in practice. And the reason for this is that basically the transformer is forced to multitask a huge amount of tasks in the

[00:08:59 - 00:09:01]

Speaker 1: language modeling task because just in

[00:09:01 - 00:09:04]

Speaker 0: terms of predicting the next token, it's forced to understand

[00:09:04 - 00:09:10]

Speaker 1: a lot about the structure of the of the text and all the different concepts they're in. So that

[00:09:10 - 00:09:32]

Speaker 0: was Gp 1. Now around the time of Gp 2, people noticed that actually even better than fine tuning, you can actually prompt these models very effectively. So these are language models and they want to complete documents. So you can actually trick them into performing tasks just by arranging these fake documents. So in this example, for example, we have some passage, and then we sort of like, do Qa qa qa, this is called a few short prompt.

[00:09:32 - 00:09:42]

Speaker 0: And then we do q. And then as the transformers is tried to complete the document, it's actually answering our question. And so this is an example of prompt engineering a base model, making the belief that it's sort

[00:09:42 - 00:09:56]

Speaker 1: of I imitate a document and it getting a true perform a task. And so this kicked off, I think the era of, I would say prompting over fine tuning, and seeing that this actually can work extremely well on a lot of problems even without training any neural networks fine tuning or so on.

[00:09:57 - 00:10:11]

Speaker 0: Now since then, we've seen an entire evolutionary tree of base models that everyone has trained, not all of these models are available. For example, the Gp 4 base model was never released. The Gp 4 model that you might be interacting with over Api is not

[00:10:11 - 00:10:31]

Speaker 1: a base model. It's an assistant model and we're going to cover how to get those in a bit. Gp 3 base model is available the Api under the named Dev vinci and Gp 2 base model is available even as weights on our github repo But currently the best available base model probably is the Llama series from Meta, although it is not commercially licensed.

[00:10:33 - 00:10:44]

Speaker 0: Now 1 thing to point out is baseball models are not systems. They don't want to answer to we don't want to make answers to your questions. They just want to complete documents. So if you tell them

[00:10:44 - 00:10:50]

Speaker 1: write a poem about the brand cheese, it will just you, it will ask for questions with more questions. It's just completing what it thinks

[00:10:50 - 00:10:53]

Speaker 0: as a document. However, you can prompt them

[00:10:53 - 00:11:01]

Speaker 1: in a specific way for base models that that is more likely to work. So as an example, here's a poem about bread and cheese. And in that case, it will auto complete correctly.

[00:11:02 - 00:11:14]

Speaker 0: You can even trick base models into being assistance. And the way you would do this is you would create like a specific few short prompt that makes it look like there's some kind of a document between a human and assistant and they're exchanging

[00:11:14 - 00:11:17]

Speaker 1: sort of information. And then at

[00:11:17 - 00:11:18]

Speaker 0: the bottom, you sort of put

[00:11:18 - 00:11:25]

Speaker 1: your query at the end, and the base model will sort of like condition itself into being like a helpful assistant and kind

[00:11:25 - 00:11:45]

Speaker 0: of answer. But this is not very reliable and doesn't work super well in practice, although it can be done. So instead, we have a different path to make actual G assistance, not just base model document complete. And so that takes us into supervised fine tuning. So in the supervised fine tuning stage, we are going to collect small, but high quality datasets.

[00:11:45 - 00:11:53]

Speaker 1: And in this case, we're going to ask human contractors to gather data of the format met of the form prompt and ideal response. We're going to collect lots

[00:11:53 - 00:12:04]

Speaker 0: of these, typically tens of thousands or something like that. And then we're going to still do language modeling on this data. So nothing changed. We're just swapping out a training set. So it used to be Internet documents,

[00:12:04 - 00:12:16]

Speaker 1: which is a high quantity low 4, basically Qa prompt response count data. And that is low quantity, high quality. So still do language modeling and then after training, we get

[00:12:16 - 00:12:29]

Speaker 0: an S model. And you can actually deploy these models and they are show assistants and they work to some extent. Let me show you what an example demonstration might look like. So here's something that a human contractor might come up with. Here's some random prompt,

[00:12:29 - 00:12:56]

Speaker 1: to Can you write a short introduction about the relevance of the term or something like that? And then the contractor also writes out an ideal response? And when they write out these responses, they are following extensive labeling documentation and they are being asked to be helpful, truthful and harm. And there's labeling instructions here, you probably can't treat it neither can eye, but they're long and this is just people following instructions and trying to complete these fronts. So that's what

[00:12:56 - 00:13:14]

Speaker 0: their dataset looks like and you can train these models, and this works to some extent. Now you can actually continue the pipeline from here on. And go into Rl chef, reinforcement learning from human feedback that consists of both reward modeling and reinforcement learning, So let me cover that and then I'll come back to why you may want to go through the extra steps and how that compares to

[00:13:14 - 00:13:18]

Speaker 1: just Models. So in the reward modeling step, what we're

[00:13:18 - 00:13:21]

Speaker 0: going to do. Is we're now going to shift our data collection to be of

[00:13:21 - 00:13:36]

Speaker 1: the form of comparisons. So here's an example of what our dataset set look like. I have the same prompt identical prompt on the top which is asking the assistant to write a program or a function that checks if a given string is a pal room. And then what we do is we take

[00:13:36 - 00:13:41]

Speaker 0: the S model, which we've already trained, and we create multiple completion. So in this case, we

[00:13:41 - 00:13:50]

Speaker 1: have 3 completion that the model has created. And then we ask people to rank these completion, So if you stare at this for a while. And by the way, these are the very difficult things

[00:13:50 - 00:13:51]

Speaker 0: to do to compare some

[00:13:51 - 00:13:53]

Speaker 1: of these predictions. And this

[00:13:53 - 00:13:55]

Speaker 0: can take people even hours for a single

[00:13:55 - 00:13:58]

Speaker 1: prompt. Completion pairs. But let's

[00:13:58 - 00:14:00]

Speaker 0: say we decided that 1 of these is much better than

[00:14:00 - 00:14:03]

Speaker 1: the others and so on. So we rank them. Then we

[00:14:03 - 00:14:05]

Speaker 0: can follow that with something that looks very much

[00:14:05 - 00:14:10]

Speaker 1: kind of like a binary classification on all the possible pairs between these completion. So what we

[00:14:10 - 00:14:12]

Speaker 0: do now is we lay out our prompt

[00:14:12 - 00:14:16]

Speaker 1: in rows, and the prompt is identical across all 3 rows here. So it's

[00:14:16 - 00:14:37]

Speaker 0: all the same prompt, but the completion of is very. And so the yellow tokens are coming from the S model. Then what we do is we append another special reward readout token at the end. And we basically only supervise the transformer at this single green token. And the transformer will predict summary reward for how good that completion is for that prompt.

[00:14:38 - 00:14:51]

Speaker 0: And so basically, it makes a guess about the quality of each completion. And then once it makes a guess for every 1 of them, We also have the ground truth, which is telling us the ranking of them. And so we can actually enforce that some of these numbers should be much higher than others and

[00:14:51 - 00:15:01]

Speaker 1: so on. We formulate this into a loss function and we train our model to make reward predictions that are consistent with the ground truth coming from the comparisons from all these contractors. So that's how

[00:15:01 - 00:15:04]

Speaker 0: we train our reward model. And that allows us to score how good a

[00:15:04 - 00:15:07]

Speaker 1: completion is for a prompt. Once we

[00:15:07 - 00:15:30]

Speaker 0: have a reward model, This is... We can't deploy this because this is not very useful as an system by itself, but it's very useful for the reinforcement learning stage that follows now. Because we have a reward model, we can score the quality of any arbitrary completion for any given prompt. So what we do during reinforcement learning, is we basically get again a large collection of prompts. And now we do rei reinforcement learning with respect to the reward model.

[00:15:30 - 00:15:34]

Speaker 1: So here's what that looks like. We take a single prompt, we laid

[00:15:34 - 00:15:51]

Speaker 0: it out in rows. And now we use the S model we use basically the model we'd like to train, which is initialized at As model. To create some completion in yellow. And then we append the reward token again, and we read off the reward, according to the reward model, which is now kept fixed. It doesn't change anymore.

[00:15:52 - 00:15:59]

Speaker 0: And now the reward model tells us the quality of every single completion each for these prompts. And so what we can do is we can now just basically

[00:15:59 - 00:16:29]

Speaker 1: up apply the same language modeling loss function, but we're currently training on the yellow tokens, and we are weighing, the language modeling objective by the rewards indicated by the reward model. So as an example. In the first row, the reward model said, that this is a fairly high scoring completion. And so all the tokens that we happened to sample on the first row are going to get reinforced and they're going to get higher probabilities for the future. Conversely, on the second row, the reward model really did not like this completion, negative 1.2,

[00:16:29 - 00:16:33]

Speaker 0: And so therefore, every single token that we sample in that second row is going to

[00:16:33 - 00:16:45]

Speaker 1: get a slightly higher probability for the future. And we do this over and over on many prompts on many batches. And basically, we get a policy, which creates yellow tokens here. And basically, all of them,

[00:16:45 - 00:16:48]

Speaker 0: all of the completion here will score high. According to the reward model

[00:16:48 - 00:16:52]

Speaker 1: that we train in the previous stage. So that's how we train...

[00:16:52 - 00:17:06]

Speaker 0: That's what the Rl pipeline is Now and then at the end, you got a model that you could deploy. And so and as an example, cha is an Rl chip model. But some other models that you might come across, like for example of the K 13 and

[00:17:06 - 00:17:08]

Speaker 1: so on, these are S models. So we

[00:17:08 - 00:17:13]

Speaker 0: have base models, S models and Rl chip models. And that's kind of

[00:17:13 - 00:17:14]

Speaker 1: like the state of things there.

[00:17:14 - 00:17:15]

Speaker 0: Now why would you want

[00:17:15 - 00:17:18]

Speaker 1: to do Rl f? So 1 answer that is

[00:17:18 - 00:17:20]

Speaker 0: kind of not that exciting is that it

[00:17:20 - 00:17:22]

Speaker 1: just works better. So this comes from the instruct Paper.

[00:17:23 - 00:17:26]

Speaker 0: According to these experiments a while ago now, these Ppo models

[00:17:26 - 00:17:33]

Speaker 1: are Rl, and we see that they are basically just preferred in a lot of comparisons when we give

[00:17:33 - 00:17:36]

Speaker 0: them to humans. So humans just prefer, out basically tokens that

[00:17:36 - 00:17:38]

Speaker 1: come from rl models compared

[00:17:38 - 00:17:41]

Speaker 0: to S models compared to base model that is prompted to

[00:17:41 - 00:17:46]

Speaker 1: be an assistant. And so it just works better. But you might ask why why

[00:17:46 - 00:17:59]

Speaker 0: does it work better? And I don't think that there's a single, like, amazing answer that the community has really like agreed on? But I will just offer 1 1 reason potentially. And it has to do with the as symmetry between how easy computationally

[00:17:59 - 00:18:14]

Speaker 1: it is to compare. Versus generate. So let's take an example of generating a hi crew. Suppose I ask a model to write a high crew but paper clips If you're a contractor trying to give trained data, then imagine being a contractor collecting basically data for the S stage,

[00:18:14 - 00:18:16]

Speaker 0: how are you supposed to create a nice high crew for

[00:18:16 - 00:18:19]

Speaker 1: a paper club. You might just not be very good at that. But if I

[00:18:19 - 00:18:22]

Speaker 0: give you a few examples of Hi cruise, you might be able to appreciate some

[00:18:22 - 00:18:46]

Speaker 1: of these hi iq a lot more than others. And so judging which 1 of these is good is much easier task. And so basically, this symmetry makes it so that comparisons are a better way to potentially leverage yourself as a human and your judgment to create a slightly better model. Now our of models are not strictly an improvement on the base models in some cases. So in particular, we'd noticed, for example, that they lose some entropy.

[00:18:47 - 00:19:13]

Speaker 1: So that means that they give more Pt results. They can output lower variations, what they can the samples with lower variation than base model. So base model has lots of entropy and will give lots of diverse outputs. So for example, 1 kind of place where I still prefer to use a base model. Is in a setup where you basically have end things and you want to generate more things like it.

[00:19:13 - 00:19:17]

Speaker 1: And so here is an example that I just cooked up. I want to generate

[00:19:17 - 00:19:23]

Speaker 0: cool Pokemon names. I gave it to 7 pokemon names, and I asked the base model to complete the document. And they gave you

[00:19:23 - 00:19:26]

Speaker 1: a lot more pokemon names. These are fictitious. I tried to lock

[00:19:26 - 00:19:27]

Speaker 0: them up. I don't believe

[00:19:27 - 00:19:40]

Speaker 1: they're are actual pokemon. And is kind of task that I think base model would be good at because it still has lots of entropy, It'll give you lots of diverse cool kind of more things that look like whatever you give it before. So this

[00:19:40 - 00:19:44]

Speaker 0: is what this number having said all that. These are kind of like the assistant models

[00:19:44 - 00:19:50]

Speaker 1: that are probably available to at this point. That there's a team at Berkeley that ranked a lot

[00:19:50 - 00:19:54]

Speaker 0: of the available assistant models and gave them basically el ratings. So currently some of

[00:19:54 - 00:19:56]

Speaker 1: the best models, of course, are Gp 4 by far, I would

[00:19:56 - 00:20:00]

Speaker 0: say, followed by claude, G 3.5, and then a number of models,

[00:20:00 - 00:20:13]

Speaker 1: some of these might be available as weights, like The K, Ko koala, etcetera. And the first 3 rows here are they're All Rl chef models and all of the other models to my knowledge are S models, I believe.

[00:20:16 - 00:20:19]

Speaker 0: Okay. So that's how we train these models on a high level. Now I'm going

[00:20:19 - 00:20:23]

Speaker 1: to switch gears and let's look at how we can best apply and

[00:20:23 - 00:20:28]

Speaker 0: the Gp assistant model to your problems. Now I would like to work in setting

[00:20:28 - 00:20:31]

Speaker 1: of a concrete example, So let's let's work with

[00:20:31 - 00:20:33]

Speaker 0: a concrete example here. Let's say

[00:20:33 - 00:20:44]

Speaker 1: that you are working on an article or a blog post and you're going to write this sentence at the end. California population is 53 times at Alaska. So for some reason, you want to compare the populations of these 2 states.

[00:20:45 - 00:20:47]

Speaker 0: Think about the rich internal monologue

[00:20:47 - 00:21:00]

Speaker 1: and Tool use and how much work actually goes, computationally in your brain to generate this 1 final sentence. So here's maybe what that could look like in your brain. Okay. For this next step, let me blog of my blog. Let me compare these 2 populations.

[00:21:01 - 00:21:12]

Speaker 1: Okay. First, I'm going to obviously need to get both of these populations. Now I know that I probably don't know these populations off the top of my head. So I'm kind of like aware of what I know where don't know of my self knowledge. Right?

[00:21:13 - 00:21:15]

Speaker 1: So I go I do some tools and I

[00:21:15 - 00:21:18]

Speaker 0: go to Wikipedia and I look up California population

[00:21:18 - 00:21:20]

Speaker 1: and Alaska population. Now I know

[00:21:20 - 00:21:27]

Speaker 0: that I should divide the 2. But again, I know that dividing 3 9.2 by 0.74 is very unlikely to succeed. That's not the kind

[00:21:27 - 00:21:39]

Speaker 1: of thing that I can do in my head. And so, therefore, I'm gonna rely on the calculator. So I'm gonna use a calculator, punch it in and see that the output is roughly 53. And then maybe I I do some reflection

[00:21:39 - 00:21:43]

Speaker 0: and sanity checks in my brain. So does 53 makes sense. Well, that's quite quite

[00:21:43 - 00:21:49]

Speaker 1: a large fraction, but then California is the most populous state maybe that looks okay. So then I have all

[00:21:49 - 00:21:50]

Speaker 0: the information I might need, and now I

[00:21:50 - 00:22:15]

Speaker 1: get to the sort of creative portion of writing. So I might start to write something like California has 53 x times greater, and then I think to myself, that's actually like really awkward phrasing. So let me actually delete that and let me try again. And so as I'm writing, I have the separate process almost inspecting what I'm writing and judging whether it looks good or not. And then maybe I delete and maybe I reframe it and then maybe I'm happy with what comes out.

[00:22:15 - 00:22:18]

Speaker 1: So basically long story short, a ton happens under the hood

[00:22:18 - 00:22:25]

Speaker 0: in terms of your internal monologue when you create sentences like this. But what does a sentence like this look like, when we are training

[00:22:25 - 00:22:28]

Speaker 1: a Gp on it. It's From Gp perspective, this is

[00:22:28 - 00:22:46]

Speaker 0: just a sequence of tokens. So Gp when it's reading or generating these tokens, it just goes chunk chunk chunk chunk. And each chunk is roughly the same amount of computational work for each token. And these transformers are not very shallow networks. They have about 80 layers of reasoning, but Ads is still not like too much.

[00:22:47 - 00:23:04]

Speaker 0: And so this transform is going to do, it's best to imitate, but of course, the process here looks very, very different from the process that you took. So in particular, in our final artifacts in the datasets that we create and then eventually feed to Ll. All of that internal dialogue is completely

[00:23:04 - 00:23:08]

Speaker 1: strict. And unlike you, the Gp will

[00:23:08 - 00:23:13]

Speaker 0: look at every single token and spend the same amount of compute on every 1 of them. And so you can't expect

[00:23:13 - 00:23:15]

Speaker 1: it to actually like, well, you can't expect

[00:23:15 - 00:23:17]

Speaker 0: it to do to sort of do

[00:23:17 - 00:23:18]

Speaker 1: too much work per token.

[00:23:19 - 00:23:29]

Speaker 0: So... And also in particular, basically, these transformers are just like token simulators. So they don't know what they don't know, like they just imitate the next token. They don't know what they're good at are not good at.

[00:23:29 - 00:23:33]

Speaker 1: They just tried their best mutated next token. They don't reflect in

[00:23:33 - 00:23:40]

Speaker 0: the loop. They don't sanity check anything. They don't correct their mistakes along the way by default. They just sample token sequences.

[00:23:40 - 00:23:54]

Speaker 1: They don't have separate inner inter monologue streams in their head. Right? They're are evaluating what's happening. Now they do have some sort of cognitive advantages I would say. And that is that they do actually have very large fact based knowledge across the vast number of areas.

[00:23:54 - 00:24:03]

Speaker 1: Because they have, say several 10000000000 parameters. So that's a lot of storage for a lot of facts. But and they also I think have a relatively large and

[00:24:03 - 00:24:09]

Speaker 0: perfect working memory. So whatever fits into the whatever fits into the context window is immediately available

[00:24:09 - 00:24:12]

Speaker 1: to the transformer through its internal self tension mechanism, And so it's

[00:24:12 - 00:24:17]

Speaker 0: kind of like perfect memory, but it's got a finite size, but the transformer has

[00:24:17 - 00:24:23]

Speaker 1: a very direct access to it, and so it can like loss remember anything. That is inside its context window.

[00:24:24 - 00:24:27]

Speaker 0: So it's kind of how I would compare those 2. And the reason I bring all of this up is because I think

[00:24:27 - 00:24:35]

Speaker 1: to a large extent prompting is just making up for this sort of cognitive difference between these 2 kind

[00:24:35 - 00:24:35]

Speaker 0: of architectures.

[00:24:37 - 00:24:44]

Speaker 1: Like our brains here and L brains, you can look at it that way almost. So here's 1 thing that people found, for example, works pretty well

[00:24:44 - 00:24:59]

Speaker 0: in practice. Especially if your tasks require reasoning, you can't expect the transformer to make to do too much reasoning per token, And so you have to really spread out the reasoning across more and more tokens. So for example, you can't give a transformer a very complicated question and expect it to

[00:24:59 - 00:25:05]

Speaker 1: get the answer in a single token. There's just not enough time for it. These transformers need tokens to think quote unquote, I like

[00:25:05 - 00:25:09]

Speaker 0: to say sometimes. And so this is some of the things that work well, you may, for example,

[00:25:09 - 00:25:45]

Speaker 1: have a few short prompt that shows the transformer that it should like show its work when it's answering question when answering your question. And if you give a few examples, the transformer will imitate, that template, and it will just end up working out better in terms of its evaluation. Additionally, you can elicit this kind of behavior from the transformer, by saying, let's think step by step because this condition the transformer into sort of like showing its work. And because it kind of snaps into a mode showing its work is going to do less computational work per token. And so it's more likely to succeed as a result because it's making slower reasoning over time.

[00:25:46 - 00:25:59]

Speaker 1: Here's another example. This 1 is called self consistency. We saw that we had the ability to start writing. And then if it didn't work out, I can try again, and I can try multiple times and maybe select the

[00:25:59 - 00:26:07]

Speaker 0: 1 that worked best. So in these kinds of approaches, you may sample not just once, but you may sample multiple times and then have some process for finding the ones

[00:26:07 - 00:26:13]

Speaker 1: that are good, and then keeping just those samples or doing a majority vote or something like that. So basically these transformers

[00:26:13 - 00:26:20]

Speaker 0: in the process as they predict the next token just like you, they can get unlucky, and they could sample a not a very good token, and they

[00:26:20 - 00:26:42]

Speaker 1: can go down sort of like a blind alley in terms of reasoning. And so Unlike you, they cannot recover from that. They are stuck with every single token they sample, and so they will continue the sequence even if they even know that this sequence is not gonna to work out. So give them the ability to look back, inspect or find try to basically sample around it. Here's 1 technique also, you could...

[00:26:43 - 00:26:44]

Speaker 1: It turns

[00:26:44 - 00:26:54]

Speaker 0: out that actually L m's, like they know when they've screwed up. So as an example, Say you asked the model to generate a poem that does not run. And it might give you a poem, but

[00:26:54 - 00:26:56]

Speaker 1: it actually runs. But it

[00:26:56 - 00:27:08]

Speaker 0: turns out that especially for the bigger models like Gp 4, you can just ask it that you meet the assignment. And actually, Gp 4 knows very well that it did not meet the assignment. It just kind of got unlucky in its sampling. And so it will you, you no, I didn't actually

[00:27:08 - 00:27:12]

Speaker 1: meet the assignment. Here's... Let me try again. But without you prompting it,

[00:27:12 - 00:27:19]

Speaker 0: it doesn't even like it it doesn't it doesn't know to revisit and and so on. So you have to make up for that

[00:27:19 - 00:27:25]

Speaker 1: in your prompts. You have to get it to check. If you don't ask it to check, it's not gonna to check by itself. It's just a token simulator.

[00:27:29 - 00:27:30]

Speaker 0: I think more generally a lot

[00:27:30 - 00:27:32]

Speaker 1: of these techniques fall into the bucket of what I would

[00:27:32 - 00:27:40]

Speaker 0: say, recreating our system too. So you might be familiar with the system 1 system 2 thinking for humans. System 1 is a fast automatic process. And I think kind

[00:27:40 - 00:28:01]

Speaker 1: of corresponds to like an L just sampling tokens. And system 2 is the slower deliberate planning sort of part. Of your brain. And so this is a paper actually from just last week because this space is pretty quickly evolving. It's called tree of thought and in trio of thought, the authors of this paper proposed maintaining multiple completion for any given prompt.

[00:28:02 - 00:28:18]

Speaker 1: And then they are also scoring them along the way and keeping the ones that are going well if that makes sense. And so a lot of people are like, really playing around with kind of prompt engineering to basically bring back some of these abilities that we sort of have in our brain for.

[00:28:19 - 00:28:28]

Speaker 0: Now 1 thing I would like to note here is that this is not just a prompt. This is actually prompts that are together used with some python glue code. Because you don't... You actually have

[00:28:28 - 00:28:35]

Speaker 1: to maintain multiple prompts, and you also have to do some tree search algorithm here to like figure out which prompt to expand, etcetera. So it's

[00:28:35 - 00:28:40]

Speaker 0: a s of python blue code and individual prompts that are called in a y loop

[00:28:40 - 00:28:43]

Speaker 1: or in a bigger algorithm. I also think there's

[00:28:43 - 00:28:47]

Speaker 0: a really cool parallel here to Alphago. Alphago has a policy for placing the next stone,

[00:28:47 - 00:29:00]

Speaker 1: what It I place go. And policy was strained originally by I imaging humans. But in addition to this policy, it also does monte carlo tree research. And basically, it will play out a number of possibilities in its head and evaluate all

[00:29:00 - 00:29:02]

Speaker 0: of them and only keep the ones that work well. So I think this is

[00:29:02 - 00:29:07]

Speaker 1: kind of an equivalent of alphago, but for text if that makes sense.

[00:29:09 - 00:29:19]

Speaker 0: So just like tree of thought, I think more generally, people are starting to, like, really explore more general techniques of not just a simple question answer prompts. But something that looks a lot more like python

[00:29:19 - 00:29:34]

Speaker 1: blue code string together many prompts. So on the right, I have an example from this paper code react. Where they structure the answer to a prompt as a sequence of thought action observation, thought action observation, and it's

[00:29:34 - 00:29:35]

Speaker 0: a full rollout kind of

[00:29:35 - 00:29:46]

Speaker 1: a thinking process to answer the query. And in these actions, the model is also allowed to To use. On the left, I have an example of Auto Gp. And now Auto Gp by

[00:29:46 - 00:29:48]

Speaker 0: the way became is a

[00:29:48 - 00:30:01]

Speaker 1: project that I think got a lot of hype recently. And I think... But I think I still find it kind of inspirational interesting. It's it's a project that allows an L to sort of keep task list. And continue to recur do breakdown tasks.

[00:30:02 - 00:30:02]

Speaker 1: And I don't think

[00:30:02 - 00:30:03]

Speaker 0: this currently works very well

[00:30:03 - 00:30:07]

Speaker 1: and I would not advise people to use it in practical applications just

[00:30:07 - 00:30:09]

Speaker 0: think it's something to generally take inspiration from

[00:30:09 - 00:30:11]

Speaker 1: in terms of where this is going, I think over time.

[00:30:13 - 00:30:17]

Speaker 0: So that's kind of like giving our model system to thinking. The next thing that I find

[00:30:17 - 00:30:25]

Speaker 1: kind of interesting is this following, I would say almost psychological quirk of all is that elements don't want

[00:30:25 - 00:30:30]

Speaker 0: to succeed. They want to imitate, you want to succeed and you should ask

[00:30:30 - 00:30:38]

Speaker 1: for it. So what I mean by that is When transformers are trained, they have training sets. And there can be

[00:30:38 - 00:30:42]

Speaker 0: an entire spectrum of performance qualities in their training data. For example, there could be some kind of

[00:30:42 - 00:30:44]

Speaker 1: a prompt for some physics question or something

[00:30:44 - 00:30:52]

Speaker 0: point like that, and there could be a student solution that is completely wrong, but there can also be an expert answer that is extremely right. And transformers can't

[00:30:52 - 00:30:54]

Speaker 1: tell the difference between like low. I, they

[00:30:54 - 00:31:06]

Speaker 0: know they know about low low quality solutions and high quality solutions. But by default, they want to imitate all of it because they're just trained on language modeling. And so at test time, you actually have to ask for a good performance. In this example

[00:31:06 - 00:31:21]

Speaker 1: in this paper, it's... They tried various prompts. And let's think step by step is very powerful because it's sort of like spread out the reasoning over many tokens. But what worked even better is let's work this out in a step by step way to be sure we have the right answer. And so it's

[00:31:21 - 00:31:27]

Speaker 0: kind of like conditioning on getting a right answer, and this actually makes the transformer work better because the transformer doesn't

[00:31:27 - 00:31:56]

Speaker 1: have to now hedge its probability mass on low quality solutions as ridiculous as that sounds. And so basically, don't feel free to ask for a strong solution. Say something like you are a leading expert on this topic. Pretend you have Iq 01:20, etcetera. But don't try to ask for too much Iq because if you ask for Iq like 400, you might be out of data distribution or even worse, you could be in data distribution for some like Sci stuff, and it will start to, like, take out some Sci like role playing or something like that.

[00:31:57 - 00:32:08]

Speaker 1: You have to find, like, the right amount of Iq, I think, it's got some u shaped curve there. Next up, as we saw, when we are trying to solve problems, we know we are good

[00:32:08 - 00:32:12]

Speaker 0: at and what we're not good at and we lean on tools computationally. You want to

[00:32:12 - 00:32:25]

Speaker 1: do the same potentially with your L. So in particular, we may want to give them calculators, code interpreter and so on. The ability to do search and there's a

[00:32:25 - 00:32:34]

Speaker 0: lot of techniques for doing that. 1 thing to keep in mind again is that these transformers by default may not know what they not don't know. So you may even wanna tell the transformer

[00:32:34 - 00:32:40]

Speaker 1: in a prompt, you are not very good at mental arithmetic. Whenever you need to do very large number tokenization multiplication

[00:32:40 - 00:32:43]

Speaker 0: or whatever, instead use this calculator. Here's how you

[00:32:43 - 00:32:46]

Speaker 1: use the calculator. Use this token combination, etcetera, etcetera. So you have

[00:32:46 - 00:32:50]

Speaker 0: to actually spell it out because the model by default doesn't know it's good

[00:32:50 - 00:33:12]

Speaker 1: at or not good at necessarily, just like you and you and I might be. Next up, I think something that is very interesting is We went from a world that was retrieval only, all the way, the pendulum swung to the other extreme where it's memory only in All. Actually there's this entire space in between of these retrieval augmented models, and this works extremely well in practice.

[00:33:13 - 00:33:20]

Speaker 0: As I mentioned, the context window of a transformer is its working memory. If you can load the working memory with any information

[00:33:20 - 00:33:33]

Speaker 1: that is relevant to the task, the model will work extremely well because it can immediately access all that memory. And so I think a lot of people are really interested in basically retrieval augmented generation. And on

[00:33:33 - 00:33:41]

Speaker 0: the bottom, I have like an example of Lam index, which is 1 sort of data connector to lots of different types of data and you can you can make it... You can index all

[00:33:41 - 00:33:42]

Speaker 1: of that data, and you can

[00:33:42 - 00:33:57]

Speaker 0: make it accessible to All. And the emerging recipe there is you take relevant documents, you split them up into chunks, you embed all of them and you basically get embedding vectors that represent that data. You store that in the Vector store. And then at test time, you make some kind

[00:33:57 - 00:34:20]

Speaker 1: of a query to your vector store and you fetch chunks that might be relevant to your task and you stuff them into the prompt and then you generate. So this can work quite well in practice. So this is, I think similar to when you nice solve problems, you can do everything from your memory and transformers have very large and extensive memory. But also it really helps to reference some primary documents. So when you whenever you find yourself going back to a textbook to find something.

[00:34:21 - 00:34:23]

Speaker 1: Whenever you find yourself going back to documentation of a library

[00:34:23 - 00:34:28]

Speaker 0: to look something up. The transformers definitely when I do that too. You have a

[00:34:28 - 00:34:34]

Speaker 1: you have some memory over how some documentation of library works, but it's much better to look it up. So the same same applies here.

[00:34:36 - 00:34:38]

Speaker 0: Next, I wanted to briefly talk about consumable

[00:34:39 - 00:34:57]

Speaker 1: string prompting. I also find this very interesting. This is basically techniques for forcing certain template in the outputs of L m's. So guidance is 1 example for Microsoft actually. And here we are enforcing that the output from the Will be Json.

[00:34:57 - 00:35:00]

Speaker 1: And this will actually guarantee that the output will take

[00:35:00 - 00:35:03]

Speaker 0: on this form. Because they go in and they mess with the probabilities of

[00:35:03 - 00:35:09]

Speaker 1: all the different tokens that come out of the transformer and they they clamped those tokens. And then the transformer is only filling in the blanks here

[00:35:10 - 00:35:12]

Speaker 0: and then you can enforce additional restrictions on what could go

[00:35:12 - 00:35:17]

Speaker 1: into those blanks. So this might be really helpful. And I think this kind of constraint sampling is also extremely interesting.

[00:35:20 - 00:35:21]

Speaker 0: I also wanted to say a

[00:35:21 - 00:35:36]

Speaker 1: few words about fine tuning. It is the case that you can get really far with prompt engineering. But it's also possible to think about fine tuning your models. Now fine tuning models it means that you are actually going to change the weights of the model. It is becoming a lot more accessible to do this in practice.

[00:35:37 - 00:35:51]

Speaker 1: And that's because of a number of techniques that have been developed and have libraries for very recently. So for example, parameter efficient fine tuning techniques like Lora, make sure that you're only... You're only training small spa pieces of your model. So most of

[00:35:51 - 00:35:54]

Speaker 0: the model is kept clamped at the base model and some pieces

[00:35:54 - 00:36:00]

Speaker 1: of it are allowed to change and this still works pretty well empirical and makes it much cheaper to sort of tune only small pieces

[00:36:00 - 00:36:04]

Speaker 0: of your model. The There's also... It also means that because most

[00:36:04 - 00:36:26]

Speaker 1: of your model is clamped. You can use very low precision inference for computing those parts, because they are not going to be updated by gradient descent and so that makes everything a lot more efficient as well. And in addition, we have a number of open sourced high quality based models currently as I mentioned, I think Lam is quite nice, although it is not a commercially licensed, I believe right now. Something to keep in mind is that basically

[00:36:26 - 00:36:30]

Speaker 0: fine tuning is a lot more technically involved. It requires a

[00:36:30 - 00:36:37]

Speaker 1: lot more. I think technical expertise to do right. It requires human data contractors for datasets and or synthetic data pipelines that

[00:36:37 - 00:36:45]

Speaker 0: can be pretty complicated. This will definitely slow down your iteration cycle by a lot, And I would say on a high level, S is achievable because

[00:36:45 - 00:37:00]

Speaker 1: it is just your continuing language modeling task. It's relatively straightforward. But Rl jeff, I would say is very much research territory and is even much harder to get to work. And so I would probably not advise that. Someone just tries to roll their own Rl of implementation.

[00:37:00 - 00:37:10]

Speaker 1: These things are pretty unstable, very difficult to train, not something that is I think very beginner friendly right now. Someone and is also potentially likely also to change pretty rapidly still.

[00:37:12 - 00:37:13]

Speaker 0: So I think these are my

[00:37:13 - 00:37:29]

Speaker 1: sort of default recommendations right now. I would break up your task into 2 major parts. Number 1, achieve your top performance, and number 2, optimize your performance in that order. Number 1, the best performers will currently come from G 4 model. It is the most capable model by far.

[00:37:30 - 00:37:40]

Speaker 1: Use prompts that are very detailed. They have lots of task contents. Relevant information and instructions. Think along the lines of what would you tell a task contractor if they can't email

[00:37:40 - 00:37:41]

Speaker 0: you back, but then also keep

[00:37:41 - 00:38:01]

Speaker 1: in mind that a task contractor is a human and they have inter monologue and they're very clever, etcetera. L do not possess those qualities. So make sure to... Think through the psychology of the Ll almost in catered prompts to that. Retrieve add any relevant context and information to these prompts, Basically refer to a lot

[00:38:01 - 00:38:05]

Speaker 0: of the prompt engineering techniques. Some of them Have highlighted in the slides above, but also this

[00:38:05 - 00:38:16]

Speaker 1: is a very large space and I would just advise you to look for prompt engineering techniques online. There's a lot to cover there. Experiment with few shop examples, What this refers to is you

[00:38:16 - 00:38:17]

Speaker 0: don't just wanna tell you

[00:38:17 - 00:38:24]

Speaker 1: want to show whenever it's possible. So give it examples of everything that helps it really understand what you mean if you can.

[00:38:25 - 00:38:28]

Speaker 0: Experiment with tools and plugins to offload a task that

[00:38:28 - 00:38:36]

Speaker 1: are difficult for All natively. And then think about not just a single prompt and answer. Think about potential chains and reflection and how you

[00:38:36 - 00:38:39]

Speaker 0: glue them together and how you could potentially make multiple samples and so on.

[00:38:40 - 00:39:00]

Speaker 1: Finally, if you think you've squeezed out prompt engineering, which I think you should stick with for a while, look at some potentially fine tuning a model, to your application, but expect this to be a lot more slower and involved. And then there's an expert fragile research zone here, and I would say that is Rl which currently does work a bit better

[00:39:00 - 00:39:01]

Speaker 0: than S, if you can

[00:39:01 - 00:39:04]

Speaker 1: get it to work, but again, this is pretty bold, I would say.

[00:39:05 - 00:39:06]

Speaker 0: And to optimize your costs,

[00:39:06 - 00:39:10]

Speaker 1: try to explore lower capacity models or shorter prompts and so on.

[00:39:13 - 00:39:14]

Speaker 0: I also wanted to say

[00:39:14 - 00:39:28]

Speaker 1: a few words about the use cases in which I think L are currently well suited for. So in particular, note that there's a large number of limitations to L today. And so I would keep that definitely in mind for all applications. Models. And this despite the way it could be an entire talk.

[00:39:28 - 00:39:43]

Speaker 1: So I don't have time to cover it in full detail. Models may be biased, they may fabric hall information. They may have reasoning errors. They may struggle in entire classes of applications. They have knowledge cutoff, so they might not know any information above say, September 20 21.

[00:39:44 - 00:40:10]

Speaker 1: They are susceptible to a large range of attacks, which are served like coming out on Twitter daily. Including prompt injection, jail break attacks, data poisoning attacks and so on. So my recommendation right now is use L in low stakes, applications, combine them with always with human oversight, use them as a source of inspiration and suggestions and think c pilots instead of completely autonomous agents that are just like performing a task somewhere. It's just not clear that the models are there right now.

[00:40:12 - 00:40:15]

Speaker 0: So I wanted to close by saying that Gp 4 is an amazing artifact. I'm very thankful

[00:40:15 - 00:40:28]

Speaker 1: that it exists. And it's beautiful. It has a ton of knowledge across so many areas. It can do math, code, and so on. And in addition, there's this thriving ecosystem of everything else that is being built and incorporated into the ecosystem.

[00:40:29 - 00:40:32]

Speaker 1: Some of some of things these things I've talked about. And all

[00:40:32 - 00:40:34]

Speaker 0: of this power is accessible

[00:40:34 - 00:40:54]

Speaker 1: at your fingertips So here's everything that's needed in terms of code to ask Gp for a question to prompt it and get a response. In this case, I said, can you say something to inspire the audience of Microsoft Bill 20 23? And I just punched this into Python and verbatim, Gp 4 said the following. And by the

[00:40:54 - 00:40:55]

Speaker 0: way, I did not know that

[00:40:55 - 00:40:59]

Speaker 1: they used this trick in the keynote. So I thought I was being clever, but

[00:41:01 - 00:41:02]

Speaker 0: But it is really good

[00:41:02 - 00:41:32]

Speaker 1: at this. It says, like as an just innovators and trail, businesses it says Microsoft build 20 23. Welcome to the gathering of brilliant minds like no other you are the architect of the future, vision molding the digital realm in which humanity thrive. Embrace the limitless possibilities of technologies and let your idea soar as high as your imagination together, let's create a more connected, remarkable and inclusive role for generations to come, get ready to unleash your creativity, canvas the unknown and turn dreams into reality. Your journey begins today.