Summary of A Very Gentle Introduction to Large Language Models without the Hype | by Mark Riedl

Summary A Very Gentle Introduction to Large Language Models without the Hype | by Mark Riedl | Medium mark-riedl.medium.com

10,091 words - html page - View html page

One Line

The text discusses large language models like ChatGPT, explaining their capabilities, limitations, and underlying concepts in a concise and accessible manner.

Slides

Slide Presentation (7 slides)

Copy slides outline Copy embed code Download as Word

A Gentle Introduction to Large Language Models

Source: mark-riedl.medium.com - html - 10,091 words - view

Reinforcement Learning with Human Feedback

• Large language models like ChatGPT use reinforcement learning with human feedback to improve their responses over time.

• Training the model to predict positive or negative feedback helps generate more preferred responses.

• Limitations include not truly understanding input prompts and lacking explicit goals.

Self-Attention in Large Language Models

• Self-attention is a process where each word attends to other words in the sequence to compute similarity scores.

• Similarity scores create an attention matrix, helping the model learn relationships between words.

• This process involves dot product computation and enhances the model's understanding of the input.

Masked Language Models

• Masked language models are generative pre-trained transformer models.

• They can be trained on a large corpus of text and fine-tuned for specific tasks.

• The model predicts the masked word in an input sequence, improving its ability to generate accurate text.

Encoder-Decoder Networks

• Encoder networks compress input sequences into smaller encodings.

• Decoder networks expand these encodings to generate output sequences.

• The encoder and decoder work together to learn relationships between words and make accurate predictions.

Understanding Large Language Models

• Large language models can represent a wide range of concepts and handle large amounts of data.

• They rely on pattern-matching and lack the ability to plan or look ahead.

• It is important to understand their underlying mechanisms and not view them as magical tools.

Unveiling the Power of Large Language Models

• Large language models, like ChatGPT, have revolutionized conversational AI.

• Despite limitations, they offer powerful text generation capabilities.

• Understanding their mechanisms and using them appropriately can unlock their full potential.

Key Points

Large language models like ChatGPT use reinforcement learning with human feedback to improve their responses over time.
These models have limitations such as not truly understanding input prompts and lacking explicit goals.
Large language models rely on pattern-matching and do not have the ability to plan or look ahead.
Self-attention is a process in large language models where each word attends to other words in the sequence to compute similarity scores.
Masked language models are a type of generative pre-trained transformer model that can be trained on a large corpus of text and fine-tuned for specific tasks.
Encoder-decoder networks in large language models compress input sequences into smaller encodings and generate output sequences based on these encodings.
Large language models are capable of representing a wide range of concepts and can handle large amounts of input data.
Understanding the underlying mechanisms of large language models is important and they should not be seen as magical tools.

Summary

784 word summary

Large language models, such as ChatGPT, have gained popularity due to their ability to generate text based on input prompts. One key feature of ChatGPT is reinforcement learning with human feedback (RLHF), which allows the model to improve its responses over time. RLHF involves training the model to predict whether certain responses will receive positive or negative feedback, and adjusting the model's circuitry accordingly. This helps the model generate more preferred responses. However, large language models still have limitations, such as not truly understanding the input prompts and lacking explicit goals. They rely on pattern-matching and do not have the ability to plan or look ahead. Despite these limitations, large language models have proven to be powerful tools in generating text. Large language models, such as ChatGPT and GPT-4, do not have goals, problem-solving abilities, or planning capabilities. They can create plans and solve problems by making choices between alternatives. However, they cannot remember earlier conversations due to limitations in their log size. The quality of response is directly proportional to the quality of the input prompt. Large language models have a tendency to guess words that are more common in their training data. They do not have a sense of truth or right or wrong. They can make mistakes and may regurgitate language that they have seen more frequently. Large language models are trained on a vast amount of information from the internet, including both positive and negative content. They can provide common snippets of code, answer questions about science, and generate text on various topics. These models operate by guessing the next word based on the context of the input. The technical details involve encoding and decoding processes using self-attention mechanisms. It is important not to anthropomorphize large language models and to verify their outputs. In this document excerpt, the author discusses the concept of self-attention in large language models. Self-attention is a process in which each word in an input sequence attends to other words in the sequence to compute similarity scores. These scores are used to create an attention matrix, which records the similarity between words. The author explains that self-attention can be computed using a mathematical operation called dot product and that it helps the model learn relationships between words in the sequence.

The author also introduces the idea of a masked language model, which is a type of generative pre-trained transformer model. The masked language model takes in a sequence of words and generates a sequence of words by predicting the next word. This process involves masking certain words in the input and having the model guess the masked word. The author explains that this type of model can be trained on a large corpus of general text and then fine-tuned for specific tasks.

Additionally, the author discusses the architecture of an encoder-decoder network. The encoder network compresses the input sequence into a smaller set of numbers called encodings, while the decoder network expands these encodings to generate an output sequence. The encoder and decoder work together to learn relationships between words and generate accurate predictions.

Overall, the author provides a concise overview of self-attention, masked language models, and encoder-decoder networks in large language models. In this document, the author discusses large language models and their underlying concepts. The author explains that large language models are capable of representing a wide range of concepts, such as royalty or armored mammals. The author also mentions the use of encoders and decoders in language models, as well as the challenges of handling large amounts of input data. The author describes the process of training a neural network using data and adjusting parameters through the backpropagation algorithm. The author emphasizes that large language models are not magic and should be understood in terms of their underlying mechanisms. The author concludes by introducing artificial intelligence and its relationship to intelligent behaviors performed by entities. This summary provides a concise version of the excerpted text, highlighting key points and organizing them into separate paragraphs.

Large Language Models and ChatGPT will be explained without jargon. The article aims to give people without a computer science background insight into how ChatGPT and similar AI systems work. The concepts will be illustrated using metaphors, and no technical or mathematical background is required to understand the core concepts. The article will discuss the way these models work, what can be expected from them, and why the core concepts are effective.

ChatGPT is a chatbot that falls under the category of conversational AI. The article will break down how ChatGPT and other models like GPT-3, GPT-4, Bing Chat, and Bard function.

The goal of the article is to provide a gentle introduction to large language models without hype.

Raw indexed text (57,065 chars / 10,091 words / 1,295 lines)

A Very

Gentle Introduction to Large Language Models without the Hype \| by Mark

Riedl \| Medium

Open in app

Write

Top

highlight

A Very Gentle Introduction to Large

Language Models without the Hype

Mark

Riedl

38 min read

Apr

114

Listen

Introduction

This article is designed to give

people with no computer science background some insight into how ChatGPT

and similar AI systems work (GPT-3, GPT-4, Bing Chat, Bard, etc).

ChatGPT is a chatbot a type of conversational AI built but on top of

Large Language Model

. Those are definitely words

and we will break all of that down. In the process, we will discuss the

core concepts behind them. This article does not require any technical

or mathematical background. We will make heavy use of metaphors to

illustrate the concepts. We will talk about why the core concepts work

the way they work and what we can expect or not expect Large Language

Models like ChatGPT to do.

Here is what we are going to do. We are

going to gently walk through some of the terminology associated with

Large Language Models and ChatGPT without any jargon. If I have to use

jargon, I will break it down without jargon. We will start very basic,

with what is Artificial Intelligence and work our way up. I will use

some recurring metaphors as much as possible. I will talk about the

implications of the technologies in terms of what we should expect them

to do or should not expect them to do.

Lets go!

1. What is Artificial

Intelligence?

But first, lets start with some

basic terminology that you are probably hearing a lot. What is

artificial intelligence

Artificial Intelligence

\: An entity that performs behaviors that a person might reasonably

call intelligent if a human were to do something

similar.

It is a bit problematic to

define artificial intelligence by using the word intelligent, but no

one can agree on a good definition of intelligent. However, I think

this still works reasonably well. It basically says that if we look at

something artificial and it does things that are engaging and useful and

seem to be somewhat non-trivial, then we might call it intelligent. For

example we often ascribe the term AI to computer-controlled characters

in computer games. Most of these bots are simple pieces of

if-then-else

code (e.g., if the player is within range

then shoot else move to the nearest boulder for cover). But if we are

doing the job of keeping us engaged and entertained, and not doing any

obviously stupid things, then we might think they are more sophisticated

than the are.

Once we get to understand how

something works, we might not be very impressed and expect something

more sophisticated behind the scenes. It all depends on what you know

about what is going on behind the scenes.

They key point is

that artificial intelligence is not magic. And because it is not magic,

it can be explained.

So lets get into it.

2. What is Machine

Learning?

Another term you will often hear

associated with artificial intelligence is

machine

learning

Machine

Learning

: A means by which to create behavior by taking in data,

forming a model, and then executing the model.

Sometimes it is too hard to manually

create a bunch of if-then-else statements to capture some complicated

phenomenon, like language. In this case, we try to find a bunch of data

and use algorithms that can find patterns in the data to model.

But what is a model? A

model

is a simplification of some complex phenomenon.

For example, a model car is just a smaller, simpler version of a real

car that has many of the attributes but is not meant to completely

replace the original. A model car might look real and be useful for

certain purposes, but we cant drive it to the store.

A DALL-E generated image of a

model car on a desk.

Just like we can make a smaller, simpler

version of a car, we can also make a smaller, simpler version of human

language. We use the term

large language models

because these models are, well, large, from the perspective of how much

memory is required to use them. The largest models in production, such

as ChatGPT, GPT-3, and GPT-4 are large enough that it requires massive

super-computers running in data center servers to create and run.

3. What

is a Neural Network?

There are many ways to

learn a model from data. The Neural Network is one such way. The

technique is roughly based on how the human brain is made up of a

network of interconnected brain cells called

neurons

that pass electrical signals back and forth, somehow allowing us to do

all the things we do. The basic concept of the neural network was

invented in the 1940s and the basic concepts on how to train them as

were invented in the 1980s. Neural networks are very inefficient, and it

wasnt until around 2017 when computer hardware was good enough to use

them at large scale.

But instead of brains, I like

to think of neural networks using the metaphor of electrical circuitry.

You dont have to be an electrical engineer to know that electricity

flows through wires and that we have things called resistors that make

it harder for electricity to flow through parts of a circuit.

Imagine you want to make a self-driving

car that can drive on the highway. You have equipped your car with

proximity sensors on the front, back, and sides. The proximity sensors

report a value of 1.0 when there is something very close and report a

value of 0.0 when nothing is detectable nearby.

You have also rigged your car so that

robotic mechanisms can turn the steering wheel, push the brakes, and

push the accelerator. When the accelerator receives a value of 1.0, it

uses maximum acceleration, and 0.0 means no acceleration. Similarly, a

value of 1.0 sent to the braking mechanism means slam on the brakes and

0.0 means no braking. The steering mechanism takes a value of -1.0 to

+1.0 with a negative value meaning steer left and a positive value

meaning steer right and 0.0 meaning keep straight.

You have also recorded data about how you

drive. When the road in front is clear you accelerate. When there is a

car in front, you slow down. When a car gets too close on the left, you

turn to the right and change lanes. Unless, of course, there is a car on

your right as well. Its a complex process involving different

combinations of actions (steer left, steer right, accelerate more or

less, brake) based on different combinations of sensor

information.

Now you have to wire up the sensor

to the robotic mechanisms. How do you do this? It isnt clear. So you

wire up every sensor to every robotic actuator.

A neural network as a circuit

connecting sensors to actuators.

What happens when you take your car out

on the road? Electrical current flows from all the sensor to all the

robotic actuators and the car simultaneously steers left, steers right,

accelerates, and brakes. Its a mess.

When some of our sensors send

energy, that energy flows through to all the actuators and the car

accelerates, brakes, and steers all at once.

Thats no good. So I grab my resistors

and I start putting them on different parts of the circuits so that

electricity can flow more freely between certain sensors and certain

robotic actuators. For example, I want electricity to flow more freely

from the front proximity sensors to the brakes and not to the steering

wheel. I also put in things called gates, that stop the flow of

electricity until enough electricity accumulates to flip a switch (only

allow electricity to flow when the front proximity sensor and rear

proximity sensor are reporting high numbers), or sending electrical

energy forward only when the input electrical strength is low (send more

electricity to the accelerator when the front proximity sensor is

reporting a low value).

But where do I put these

resistors and gates? I dont know. I start putting them randomly all

over the place. Then I try again. Maybe this time my car drives better,

meaning sometimes it brakes when the data says it is best to brake and

steers when the data says it is best to steer, etc. But it doesnt do

everything right. And some things it does worse (accelerates when the

data says it is best to brake). So I keep randomly trying out different

combinations of resistors and gates. Eventually I will stumble upon a

combination that works well enough that I declare success. Maybe it

looks like this:

A fully trained neural network.

Darker lines mean parts of the circuit where energy flows more freely.

Circles in the middle are gates that might accumulate a lot of energy

from below before sending any energy up to the top, or possibly even

send energy up when there is little energy

below.

(In reality, we dont add or subtract

gates, which are always there, but we modify the gates so that they

activate with less energy from below or requires more energy from below,

or maybe release a lot of energy only when there is very little energy

from below. Machine learning purists might vomit a little bit in their

mouths at this characterization. Technically this is done by adjusting

something called a

bias

on the gates, which is

typically not shown in diagrams such as these, but in terms of the

circuitry metaphor can be thought of as a wire going into each gate

plugged directly into an electrical source, which can then be modified

like all the other wires.)

Lets take it for a test

drive!

Randomly trying things sucks. An

algorithm called

back propagation

is reasonably good

at making guesses about how to change the configuration of the circuit.

The details of the algorithm are not important except to know that it

makes tiny changes to the circuit to get the behavior of the circuit

closer to doing what the data suggests, and over thousands or millions

of tweaks, can eventually get something close to agreeing with the

data.

We call the resistors and gates

parameters

because in actuality they are everywhere and

what the back propagation algorithm is doing is declaring that each

resistor is stronger or weaker. Thus the entire circuit can be

reproduced in other cars if we know the layout of the circuits and the

parameter values.

4. What is Deep Learning?

Deep Learning

is a

recognition that we can put other things in our circuits besides

resistors and gates. For example we can have a mathematical calculation

in the middle of our circuit that adds and multiplies things together

before sending electricity forward. Deep Learning still uses the same

basic incremental technique of guessing parameters.

5. What is a Language

Model?

When we did the example of the car, we

were trying to get our neural network to perform behavior that was

consistent with our data. We were asking whether we could create a

circuit that manipulated the mechanisms in the car the same way a driver

did under similar circumstances. We can treat language the same way. We

can look at text written by humans and wonder whether a circuit could

produce a sequence of words that looks a lot like the sequences of words

that humans tend to produce. Now, our sensors fire when we see words and

our output mechanisms are words too.

What are we trying to do? We are trying

to create a circuit that guesses an output word, given a bunch of input

words. For example:

Once upon a *\_\_*

seems like it should fill in the blank

with time but not armadillo.

We tend to talk about language models in

terms of probability. Mathematically we will write the above example

as:

If you arent familiar with

the notation, dont worry. This is just math talk meaning the

probability (

) of the word time given (the bar

symbol

means

given

) a bunch of words once, upon, and a. We

would expect a good language model to produce a higher probability of

the word time than for the word armadillo.

We can generalize this to:

which just means compute

the probability of the

-th word in a sequence given

all the words that come before it (words in positions 1 through

-1).

But lets pull back a bit.

Think of an old-fashioned typewriter, the kind with the striker

arms.

DALL-E2 made this image. Look at

all the striker arms!

Except instead of having a different

striker arm for each letter, we have a striker for each word. If the

English language has 50,000 words then this is a big typewriter!

Instead of the network for the car, think

of a similar network, except the top of our circuit has 50,000 outputs

connected to striker arms, one for each word. Correspondingly, we would

have 50,000 sensors, each one detecting the presence of a different

input word. So what we are doing at the end of the day is picking a

single striker arm that gets the highest electrical signal and that is

the word that goes in the blank.

Here is where we stand: if I want to make

a simple circuit that takes in a

single

word and

produces a

single

word, I would have to make a

circuit that has 50,000 sensors (one for each word) and 50,000 outputs

(one for each striker arm). I would just wire each sensor to each

striker arm for a total of 50,000 x 50,000 = 2.5 billion

wires.

Every circle on the bottom senses

one word. It takes 50,000 sensors to recognize the word once. That

energy gets sent through some arbitrary network. All the circles on the

top are connected to striker arms for each word. All striker arms

receive some energy, but one will receive more than the

others.

That is a big network!

But it gets worse. If I want to do the

Once upon a *\_\_ example, I need to sense which word is in each of

three input positions. I would need 50,000 x 3 = 150,000 sensors. Wired

up to 50,000 striker arms gives me 150,000 x 50,000 = 7.5 billion wires.

As of 2023, most large language models can take in 4,000 words, with the

largest taking in 32,000 words. My eyes are watering.

A network that takes three words

as input requires 50,000 sensors per word.

We are going to need some tricks to get a

handle on this situation. We will take things in stages.

5.1

Encoders

The first thing we will do is break our

circuit into two circuits, one called an

encoder

and one called a

decoder

. The insight is that a lot

of words mean approximately the same thing. Consider the following

phrases:

The king sat

on the \_\_*

The queen sat on the *\_\_

The princess sat on

the \_\_*

The regent sat on the \_\_\_

A reasonable guess for all the blanks

above would be throne (or maybe toilet). Which is to say that I

might not need separate wires between the king and throne, or

between queen and throne, etc. Instead it would be great if I had

something that approximately means royalty and every time I see king

or queen, I use this intermediate thing instead. Then I only have to

worry about which words mean approximately the same thing and then what

to do about it (send a lot of energy to throne).

So here is what we are going to do. We

are going to set up one circuit that takes 50,000 word sensors and maps

to some smaller set of outputs, say 256 instead of 50,000. And instead

of only being able to trigger one striker arm, we are able to mash a

bunch of arms at a time. Each possible combination of striker arms could

represent a different concept (like royalty or armored mammals).

These 256 outputs would give us the ability to represent 2 = 1.15 x

10 concepts. In reality it is even more because, like how in the car

example we can press the brakes part-way down, each of those 256 outputs

can be not just 1.0 or 0.0 but any number in between. So maybe the

better metaphor for this is that all 256 striker arms mash down, but

each mashes down with a different amount of force.

Okay so previously one word would

required one of 50,000 sensors to fire. Now weve boiled one activated

sensor and 49,999 off sensors down into 256 numbers. So king might be

\[0.1,

0.0

, 0.9, , 0.4\] and queen

might be \[0.1,

0.1

, 0.9, , 0.4\] which

are almost the same as each other. I will call these lists of numbers

encodings

(also called the

hidden

state

for historical reasons but I dont want to explain this, so

we will stick with encoding). We call the circuit that squishes our

50,000 sensors into 256 outputs the

encoder

. It

looks like this:

An encoder network squishing the

50,000 sensor values needed to detect a single word down into an

encoding of 256 numbers (lighter and darker blue used to indicate higher

or lower values).

5.2 Decoders

But the encoder doesnt tell us which

word should come next. So we pair our encoder with a

decoder

network. The decoder is another circuit that

takes 256 numbers making up the encoding and activates the original

50,000 striker arms, one for each word. We would then pick the word with

the highest electrical output. This is what it would look

like:

A decoder network, expanding the

256 values in the encoding into activation values for the 50,000 striker

arms associated with each possible word. One word activates the

highest.

5.3 Encoders and Decoders Together

Here is the encoder and decoder working

together to make one big neural network:

An encoder-decoder network. It is

just a decoder sitting on top of an encoder.

And, by the way, a single word input to a

single word output going through encoding only needs (50,000 x 256) x 2

= 25.6 million parameters. That seems much better.

That example was for one word input and

producing one word output, so we would have 50,000 x

inputs if we wanted to read

words, and 256 x

for the encoding

But why does this work? By forcing 50,000

words to all fit into a small set of numbers, we force the network to

make compromises and group words together that might trigger the same

output word guess. This is a lot like file compression. When you zip a

text document you get a smaller document that is no longer readable. But

you can unzip the document and recover the original readable text. This

can be done because the zip program replaces certain patterns of words

with a shorthand notation. Then when it unzips it knows what text to

swap back in for the shorthand notation. Our encoder and decoder

circuits learn a configuration of resistors and gates that zip and then

unzip words.

5.4 Self-Supervision

How do we know what encoding for each

word is best? Put another way, how do we know that the encoding for

king should be similar to the encoding for queen instead of

armadillo?

As a thought experiment, consider an

encoder-decoder network that should take in a single word (50,000

sensors) and produce the exact same word as output. This is a silly

thing to do, but it is quite instructive for what will come

next.

An encoder-decoder network trained

to output the same word as the input (its the same image as before but

with color for activation).

I put in the word king and a single

sensor sends its electrical signal through the encoder and partially

turns on 256 values in the encoding in the middle. If the encoding is

right, then the decoder will send the highest electrical signal to the

same word, king. Right, easy? Not so fast. I am just as likely to see

the striker arm with the word armadillo with the highest activation

energy. Suppose the striker arm for king gets 0.051 electrical signal

and the striker arm for armadillo gets 0.23 electrical signal.

Actually, I dont even care what the value for armadillo is. I can

just look at the output energy for king and know that it wasnt 1.0.

The difference between 1.0 and 0.051 is the error (also called

loss

) and I can use back propagation to make some

changes to the decoder and the encoder so that a slightly different

encoding is made next time we see the word king.

We do this for all words. The encoder is

going to have to compromise because the 256 is way smaller than 50,000.

That is, some words are going to have to use the same combinations of

activation energy in the middle. So when given the choice, it is going

to want the encoding for king and queen to be nearly identical and

the encoding for armadillo to be very different. This will give the

decoder a better shot at guessing the word by just looking at the 256

encoding values. And if the decoder sees a particular combination of 256

values and guesses king with 0.43 and queen with 0.42, we are going

to be okay with that as long as king and queen get the highest

electrical signals and every of the 49,998 striker arms gets numbers

that are smaller. Another way of saying that is that we are probably

going to be more okay with the network getting confused between kings

and queens than if the network gets confused between kings and

armadillos.

We say the neural network is

self-supervised

because, unlike the car example, you

dont have to collect separate data for testing the output. We just

compare the output to the input we dont need to have separate data

for the input and the output.

5.5 Masked Language Models

If the above thought experiment seems

silly, it is building block to something called

masked

language models

. The idea of a masked language model is to take in

sequence

of words and generate a

sequence

of words. One of the words in the input and

output are blanked out.

The \[MASK\] sat on the

throne.

The network guesses all the

words. Well, its pretty easy to guess the unmasked words. We only

really care about the networks guess about the masked word. That is, we

have 50,000 striker arms for each word in the output. We look at the

50,000 striker arms for the masked word.

Masking a sequence. Im getting

tired of drawing a lot of connection lines, so I will just draw red

lines to mean lots-and-lots of connections between everything above and

below.

We can move the mask around and have the

network guess different words in different places.

One special type of masked language model

only has the mask at the end. This is called a

generative

model

because the mask it is guessing for is always the next word

in the sequence, which is equivalent to generating the next word as if

the next word didnt exist. Like this:

The \[MASK\]

The queen \[MASK\]

The queen sat \[MASK\]

The queen sat on \[MASK\]

The queen

sat on the \[MASK\]

We also call this an

auto-regressive

model. The word

regressive

sounds not-so-good. But regression just means

trying to understand the relationship between things, like words that

have been input and words that should be output.

Auto

means self. An auto-regressive model is

self-predictive. It predicts a word. Then that word is used to predict

the next word, which is used to predict the next word, and so on. There

are some interesting implications to this that we will come back to

later.

6. What is a Transformer?

As of the time of this writing, we hear a

lot about things called GPT-3 and GPT-4 and ChatGPT. GPT is a particular

branding of a type of large language model developed by a company called

OpenAI. GPT stands for

Generative Pre-trained

Transformer

. Lets break this down:

Generative.

The model is capable of generating

continuations to the provided input. That is, given some text, the model

tries to guess which words come next.

Pre-trained

. The model is trained on a very large corpus

of general text and is meant to be trained once and used for a lot of

different things without needing to be re-trained from

scratch.