Summary What's next for AI agentic workflows ft. Andrew Ng of AI Fund (Youtube) youtu.be
2,572 words - YouTube video - View YouTube video
Speaker 0 All of you, know Andrew Ng as a famous, computer science professor at Stanford. Was really early on in the development of neural networks with GPUs, of course, a creator of Coursera and popular courses like deeplearning.ai. Also the founder and creator, and early lead of Google Brain. But 1 thing I've always wanted to ask you before I hand it over, Andrew, while you're on stage, is a question I think would be relevant to the whole audience. 10 years ago, on problem set number 2 of CS 229, you gave me a b.
Speaker 0 And I was wondering I looked it over. I was wondering what you saw that I did incorrectly. So anyway, Andrew.
Speaker 1 Thank you, Hansien. Looking forward to sharing with all of you what I'm seeing with AI agents, which I think is the exciting trend that I think everyone building in AI should pay attention to. And then also excited about all the other What's Next presentations. So AI agents, you know, today, the way most of us use large language models is like this, with a non agentic workflow where you type a prompt and it generates an answer. And that's a bit like if you ask a person to write an essay on a topic and I say, please sit down on the keyboard and just type the essay from start to finish without ever using backspace.
Speaker 1 And despite how hard this is, LMs do it remarkably well. In contrast with an agentic workflow, this is what it may look like. Have an AI, have an LOM, say, write an essay outline. Do you need to do any web research? If so, let's do that.
Speaker 1 Then write the first draft, and then read your own first draft, and think about what parts need revision, and then revise your draft, and you go on and on. And so this workflow is much more iterative, where you may have the OM do some thinking, and then revise this article and then do some more thinking and iterate this through a number of times. And what not many people appreciate is this that there is remarkably better results. I've actually really surprised myself working on these agent workflows, how way how well they were. I'm gonna do 1 case study.
Speaker 1 I had my team analyze some data, using a coding benchmark called the human eval benchmark released by OpenAI a few years ago, but this has coding problems like given the non empty list of integers, return the sum of all the odd elements or even positions. And it turns out the answer is, you know, code snippet like that. So today, a lot of us will use 0 shot prompting, meaning we tell the AI, write the code and have it run on the first part. It's like, who codes like that? No human codes like that.
Speaker 1 We just type out the code and run it. Maybe you do. I can't do that. So it turns out that if you use GPT 3.5, 0 shot prompting, it gets it 48% right. GPT 4, way better, 67 percent right.
Speaker 1 But if you take an agentic workflow and wrap it around GPT 3.5, say, it actually does better than even GPT 4. And if you were to wrap this type of workflow around GPT 4, you know, it also, does very well. And you notice that GPD 3.5 with an agentic workflow actually outperforms GPD 4, and I think this has, and and this means that this has significant consequences for, I think, how we all approach building applications. So agents is the term that's been tossed around a lot. There's a lot of consultant reports.
Speaker 1 How about agents? The future of AI? Blah blah blah. I want to be a bit concrete and share with you, the broad design patterns I'm seeing in agents. It's a very messy, chaotic space.
Speaker 1 Tons of research, tons of open source. There's a lot going on, but I try to categorise, a bit more concretely what's going on with agents. Reflection is a tool that I think many of us should just use. It just works. To use, I think it's more widely appreciated, but actually works pretty well.
Speaker 1 I think of these as pretty robust technologies. When I use them, I can, you know, almost always get them to work well. Planning and multi agent collaboration, I think it was more emerging. When I use them, sometimes my mind is blown for how well they work. But at least at this moment in time, I don't feel that I can always get them to work reliably.
Speaker 1 So let me walk walk through these 4 design patterns in a few slides, and if some of you go back and yourself will ask your engineers to use these, I think you'll get a productivity boost quite quickly. So reflection, here's an example. Let's say I ask a system, please write code for me for a given task, then we have a coder agent, just an element that you prompt to write code, to say, yo, def do task, write a function like that. An example of self reflection would be if you then prompt the l m with something like this, here's code intended for a task, and just give it back the exact same code that you just generated, and then say, check the code carefully for correctness, sound efficiency, good construction, just write a prompt like that. It turns out the same LOM that you prompted to write the code may be able to spot problems like this bug in line 5, we'll fix it by blah blah blah.
Speaker 1 And if you now take his own feedback and give it to it and re prompt it, it may come up with a version 2 of the code that could well work better than the first version. Not a guarantee, but it works, you know, often enough for this to be worth trying for a lot of applications. To foreshadow 2 use, if you let it run unit tests, if it fails the unit test, then why did you fail the unit test? Have that conversation, and we would have figured out, failed the unit test, so you should try changing something and come up with v 3. By the way, for those of you that wanna learn more about these technologies, I'm very excited about them, but each of the 4 sections have a little recommended reading section at the bottom that, you know, hopefully gives more references.
Speaker 1 And again, just to foreshadow multi agent systems, I've described as a single coder agent that you prompt to have it, you know, have this conversation with itself. 1 natural evolution of this idea is instead of a single coder agent, you can have 2 agents where 1 is a coder agent and the second is a critic agent. And these could be the same base LMM model, but they you prompt in different ways where you say, 1, you're exploit coder. Right? Code.
Speaker 1 The other 1 say, you're exploit code reviewer as a review of this code. And this type of workflow is actually pretty easy to implement. I think it's such a very general purpose technology for a lot of workflows. This would give you a significant boost in in the performance of LMs. The second design pattern is 2 use.
Speaker 1 Many of you already have seen, you know, LMM based systems, using tools. On the left is a screenshot from, Copilot. On the right is something that I kind of extracted from, GPT 4, but, you know, LMS today, if you ask what's the best coffee maker in your web search for some problems, LM's will generate code and run codes, and it turns out that there are a lot of different tools that many different people are using for analysis, for gathering information, for taking actions, for personal productivity. It turns out a lot of the early work in 2 years turned out to be in the computer vision community because before large language models, LMS, you know, they couldn't do anything with images, so the only option was to the OM generate a function call that could manipulate an image, like generate an image or do object detection or whatever. So if you actually look at literature, it's been interesting how much of the work, in 2 years seems like it originated from vision because LMS were blind to images before, you know, gpt4v and and and lava and so on.
Speaker 1 So that's 2Us, and it expands what an LM can do. And then planning, you know, for those of you that have not yet played a lot with planning algorithms, I I feel like a lot of people talk about the chat gpt moment, where you're, wow, never seen anything like this? I think if you've not used planning algorithms, many people will have a kind of a AI agent. Wow. I couldn't imagine an AI agent doing good.
Speaker 1 So I've run live demos where something failed and the AI agent rerouted around the failures. I've actually had quite a few of them going, wow. You know, I can't believe my AI system just did that autonomously. But, 1 example that I adapted from a Hugging GPT paper, you know, you say, please generate an image where the girl's read where a girl's reading a book and a post is the same as a boy in the image, example dot jpeg, and please describe the new image for your voice. So give an example like this.
Speaker 1 Today, we have AI agents who can kind of decide, first thing I need to do is determine the pose of the boy, then, you know, find the right model, maybe on hugging face, to extract the pose, then next, you need to find a post image model to synthesize a picture of a of a girl of as following the instructions, then use, image to text to and then finally use text to speech. And today, we actually have agents that, I don't wanna say they work reliably. You know, they're kind of finicky. They don't always work, but when it works, it's actually pretty amazing, but with agentic loops, sometimes you can recover from earlier failures as well. So I find myself already using research agents in some of my work, where I'll want a piece of research, but I don't feel like, you know, googling myself and spend a long time.
Speaker 1 I should send to the research agent, come back in a few minutes and see what it's come up with, and and it it sometimes works, sometimes doesn't, right, but that's already a part of my personal workflow. The final design pattern, multi agent collaboration. This is 1 of those funny things, but, it works much better than you might think. But on the left is a screenshot from a paper called, chat dev, which is completely open which is actually open source. Many of you saw the, you know, flashy social media announcements of demo of a dev in.
Speaker 1 Chat Dev is actually open source. It runs on my laptop, and what ChatDev does is an example of a multi agent system where you prompt 1 LOM to sometimes act like the CEO of a software engine company, sometimes act like a designer, sometimes act like a product manager, sometimes act like a tester, and this flock of agents that you built by prompting an L. M. To tell them, you are now a CEO, you are now a software engineer, they collaborate, have an extended conversation so that if you tell it, please develop a game, develop a GoMulky game, they'll actually spend, you know, a few minutes writing code, testing it, iterating, and they generate, like, surprisingly complex programs. Doesn't always work.
Speaker 1 I've used it. Sometimes it doesn't work. Sometimes it's amazing, but this technology is really, getting better, and and just 1 of the design pattern, it turns out that multi agent debate, where you have different agents, you know, for example, it could be have Chi GPT and Gemini debate each other, that actually results in, a better performance as well. So having multiple simulated AI agents work together has a more powerful design pattern as well. So just to summarize, I think these are the these are the the the, patterns I've seen, and I think that if we were to, use these, patterns, you know, in our work, a lot of us can get a prognosis you lose quite quickly.
Speaker 1 And I think that, agentic reasoning design patterns are gonna be important. This is my last slide. I expect that the set of tasks AI could do will expand dramatically this year, because of agentic workflows. And 1 thing that is actually difficult for people to get used to is when we prompt an LN, we want a response right away. In fact, a decade ago when I was, you know, having discussions around at at Google on, we call it a big box search, we type in long prompt.
Speaker 1 1 of the reasons, you know, I failed to push successfully for that was because when you do a web search, you want a response back in half a second. Right? That's just human nature. We like that instant grab, instant feedback. But for a lot of the agent workflows, I think we'll need to learn to dedicate a task in AI agent and patiently wait minutes, maybe even hours, to for a response.
Speaker 1 But just like I've seen a lot of novice managers delegate something to someone, and they check-in 5 minutes later, right, and that's not productive. I think we need to it's it's really difficult. We need to do that with some of our AI agents as well. I still I heard some laughs. And then 1 other important trend, fast token generation is important because with these agentic workflows, we're iterating over and over, so the element is generating tokens for the element to read.
Speaker 1 So being able to generate tokens way faster than any human to read is fantastic. And I think that, generating more tokens really quickly from even a slightly lower quality l m might give good results compared to slower tokens from a better l m? Maybe. It's a little bit controversial because it may let you go around this loop a lot more times, kind of like the results I showed with GPD 3 and an agent architecture on the first slide. And candidly, I'm really looking forward to car 5 and, car 4 and GPD 5 and Gemini 2.0 and all these other wonderful models that maybe you're building, and part of me feels like if you're looking forward to running your thing on GPD 5 0 shot, you know, you may be able to get closer to that level of performance on some applications than you might think with agenting reasoning, but on an early model.
Speaker 1 I think, I I I I think this is an important trend, and honestly, the path to AGI feels like a journey route to dinner destination, but I think this type of agent workflows could help us take a small step forward on this very long journey. Thank you.