Sloppy Joe

Summary (Youtube) youtu.be

8,839 words - YouTube video - View YouTube video

Speaker 0 And agents because I just sent my calendar and someone interacts with it without my interrupt... My my my involvement, And when they need to reschedule it, they just interact with that and it just books updates my thing for me. There's no L in the loop, but it is an agent in the sense that it's a thing that exists as part

Speaker 1 of my org. That works without me that is the human.

Speaker 2 Alright. That that's maybe a good... Segue to hand it off to you So We'd love to hear more about what you're working on, how you see it evolving. You've been doing some awesome stuff. So really excited.

Speaker 0 Thank you. I... We maybe should have tested the screen share a little bit before I if I did this. It's my... So I think I'm sharing my screen now.

Speaker 2 Yeah. It's working. We have an infinite mirror going on. Now we no longer have an infinite there, a bit sad. But...

Speaker 0 Yes. Fun fun fact, that I've heard that that called a drugs effect if you're... Into

Speaker 2 that kind

Speaker 1 of thing with the with the mirrors. U.

Speaker 0 So... Okay. So I'm very jealous of vivian writing ability because I had basically the same pieces 2 months ago. So she she treated this yesterday or maybe 2 days ago, And obviously, she's an she's extremely well established. I think she works at open the eyes.

Speaker 0 So, like, extremely credible. 2 months ago, I wrote this. Which is agents, equals, foundation models plus gen thought, plus memory plus browsers plus tool use plus planning, So I feel like I just, like, over complicated it and she just, like, made an equation and where it open Ai and, like, you know, obviously, put in a whole bunch more worth than I did. But this is how I kind of frame frame it. And I think if you if you could take take the composition of these tools, you can actually...

Speaker 0 Try to build up to whatever agent that you're talking about. Right? So I think Guru, you're talking about some kind of conversational agents, And maybe that's a combination of foundation model plus evaluation plus external memory. Right? And then, like, you know, chat with bing, that's that's adding on the browser browser tooling.

Speaker 0 Pleasing ignore, this is when my my thoughts were less well forms. Now I I have sort of retroactively revised step 3 to be just browser research rather in automation and all these I'm promoting up to up towards fuller agents. So, basically, the the difference between sort of the the step 2 and 3 is step 2 represents things you already know and think step 3 represent things you don't yet know and that you need to research. But all of this is just only for sort of the read read only phase or, like, the... It only exists in in in text during conversation or when you're doing research.

Speaker 0 And then stepping out of read into read and write. So, like, basically interacting with the rewards, you have 2 user planning. So I don't know if it's like, that that is a a good basis or anyone wants to sort of discuss with this this kind of framing your mind. I'll pause there in case I You any your responses? Nope.

Speaker 0 Okay. Alright. I've I

Speaker 2 have a question. Like, what's the what's the difference between, like, tool use? In browser automation? Like, couldn't you can automation using tools under the hood?

Speaker 0 Yeah. Yeah. Exactly. So I... I often say in my in my sort of longer form explanation that browser automation is a single instance of a more general purpose we'll use.

Speaker 0 Browsers are just a a special case just because they represent knowledge, whereas tool user are typically very well defined contracts of, let's say, I interact an open Api spec. Or predefined connector Apis. Whereas here is just a very general purpose. Like, read need information right information voice. Here it's, it's it's actually due actions onto the world.

Speaker 3 And then do you cost any of these as like must haves versus like... Like is is in Harrison form like a chain? Or do you have to have all of these for it to be an agent?

Speaker 0 Well, no. You do not have to have all these to be an agent. I think that these are just tools in a toolkit, to use when your situation asks for it. So if you discover that you need browser automation, you might... You you you might...

Speaker 0 You want need you wanna pull it in. You've discovered it that you need memory, you wanna pull it in. I do obviously have this suggestion evolution towards a more intelligent agent, we'll probably require all these steps. But it's not a requirements. I mean, you can sort of lob legit these things and just go, like, alright.

Speaker 0 Let's just do it without tool use. Right? And and then you have Baby ag gi. And then add tool use that you have all the Gp, you know, something like that. Right?

Speaker 0 Do I I think not having a taxonomy of the types of capabilities that you can pick off the shelf for your agent, means that you're not thinking about what agents can do well enough. So having a reg rigorously thought through Taxonomy. I think it's important there. And for what it's worth?

Speaker 4 Sorry go ahead. There's an interesting question in Chad actually, from Michael about can can memory also be considered tool use. This is something that I've also wondered before. It's, like, in the same way that browsing is effectively to use. So basically external something that's not provided in the base model.

Speaker 4 You're calling vector Db in a way that's a tweet tool use. You how do you feel about that?

Speaker 0 I think that at the most superficial level, it is a former tool tool But then I think this... These become so specialized when you take them seriously enough that they deserve a special place. In the stack. And maybe I'll I'll go into that a little bit. Right?

Speaker 0 So... Okay. So maybe I'll I'll talk a little bit about small developer. Oops. Give give that a little bit away.

Speaker 0 Most... And actually, I... You know, III did a personal of this presentation at... The Ag dji house thing... And 1 thing I forgot neglected to ask was, you know, how many people have heard of auto Gp, which is basically everyone and how many people use on the Gp, which is basically 0.

Speaker 0 There's people on the auto Gp team that I've met and asked privately, who don't use auto Gp. Right? It is it is a it is it is equivalent of the tulip, tulip of 16 hundreds. And and I think like, it's fine to dream, but I think we want to sort of make it more practical because that's the point of all this energy I guess and probably investment going in. So I I often often think about...

Speaker 0 1 thing 1 thing Think about agents is to try to connect it to an existing set of thinking and philosophy because we don't have to reinvent everything from scratch. So the 2 things that I would recommend looking into is reading Ben Thomson article on text 2 philosophies, which I think is... Quite applicable to the the Ai h as well. And in short, it's really, you know, tech... Ai can either do things for you quite ai, I can do things with you.

Speaker 0 And I think that's a very important design distinction and or agents can supply both. I think we jump very very very directly to the do things for you bit. Instead of thinking about how to integrate the human in the loop. So for example, if you if you sort of break down baby Ag. But by the way, 1 trick went encountering an eco basis, to stick all of this into G 4 and ask it for a diagram of of the of the code base, and it is very, very capable generating things like this.

Speaker 0 You did that small developer? No. I just... I... Did this in the opening playground I I encourage everyone to do this?

Speaker 2 Oh, I mean, did you do that for the small developer kind of like eco architecture? Or...

Speaker 0 Oh, I haven't. We could do it right now? I mean,

Speaker 2 So... Please maybe after maybe after you you talk about it small development? Yeah.

Speaker 0 I I just... Love I love showing people tricks. Because that that is actually probably more useful as an immediate takeaway. Anyway, so Baby Hbo was designed like this. There's, like, 4 different agents, which was the origin of of me So this...

Speaker 0 Realizing that there's this architectural shift going on that not... Nobody's talking about. And then it lose back and then it it, you know, that both auto G and maybe. Like, basically, everyone does this, which is, like, you slap on this, like, hey, and interrupts and saying, hey, human. Do you want to do the next action?

Speaker 0 And that's a there's a very sort of tacked on thing to to how we talk to how we sort of add in because we're trying to design this. In a in a way that very much tries to leave the human out of it. And I think we we should actually... The second the second existing philosophy, I think we should borrow from prior art. It's essentially the the the categorization from self driving cars.

Speaker 0 Where They've already thought through the difference between the 5 different levels of automated driving. And I I was 1 of the people in 20 14 15 that said that we would have self driving right now. Full self driving right now. And I was wrong. And I think a lot of people would drawn in by that by that dream.

Speaker 0 And in fact, like, the the people who worked on, the less glamorous things, but more practical things that people use on a day to day basis. Got the data got the lessons that they learned. That now we're actually going over into. Into into the into more full cell driving. So what I will categorize this is basically auto Lgbt is aiming here and maybe there's there's not enough people aiming here in in the sort of level 1 level 2 agent sp.

Speaker 0 Cool. So... And and I also kinda go into the the central problems of small developer. A lot of the existing agents... Steam to to do general things.

Speaker 0 Like auto Gp default prompts is increase my net worth. Which my god. I mean, if it does, it's great, but doesn't do it right now. It just spits out a lot of convincing text to tell you to go via a Instagram influencer. So I think I think more importantly, like, for me as a perspective founder, I want to be more efficient with my time with my...

Speaker 0 And I want to employ engineers, Ai engineers, a small developers quote unquote. That make me more productive. Otherwise, you're not 8 very ai idea. And in and so another way of phrasing this is opening Paying engineers, 900000 dollars a year is a bug another a feature. It it means that Open ai have not solved this themselves.

Speaker 0 Even though they they are by old measures with the word extremely productive because they worth 30000000000 dollars with the employee base of 400.

Speaker 2 Could you ask me it? That maybe they've solved it to some extent and that all that's left is the really high value task, and that's why it's so high.

Speaker 0 Yeah. Yeah. Absolutely. So they they've... They...

Speaker 0 There's... This kind of, like, a curve. It kinda goes up in terms of engineer composition. And then as once you get intelligent enough and ubiquitous enough, then it goes down again. But it does go up initially because...

Speaker 0 Per engineer engineer being able to wield Ai more effectively makes each engineer more productive, and you pay that more... If they're able to do that. So so I do I do think that that is true. That's a good pushback. Anyway, so I wanna make a lot of more apps, audio.

Speaker 0 Not not capable doing it. I wanted to just have a domain specific, hey, only do code. And so I wanna I wanted to only do that. And so basically, I started when when the ant ent 100 k context launch. I wanted to build a chrome extension.

Speaker 0 Where where it just summarize any any website for me in the style that I wanted. So I built this chrome extension and I demoed it And this is basically the chrome extension. You just click on any page, and it and it has this loading bar because it's calling on to and summarizing a very, very long. This is this is an interview including transcripts. And I also gave it a style prompt, which is gonna come very important later, which is the kind of summary that I want.

Speaker 0 Right? I don't want the default summary style of and or open ai or whatever. I... Have a very specific way that I like to read things. So here's this the summary style prompts.

Speaker 0 And that's and that's a super useful. And obviously, the secret to this demo is that I didn't I didn't code any single part of of this extension, it was it was generated from a very, very long prompts. Right? And for me, the the prompt development process that I would... Wanted to to have was essentially start from a command line where you're just...

Speaker 0 Typing in, like, a single sentence, right? Which is definitely under specified. And then I discovered that once you move to... From from a single sentence into markdown, You can actually, like, write a lot of stuff in markdown. And so markdown is all you need for prompting Markdown is language not English.

Speaker 0 Because you can specify code blocks been markdown down. You can specify animations and markdown down. You can specify prompts in code and prompts. And and basically, the the the the ease of recur in prompting the ability to paste code in just to give you an idea, the... Because it...

Speaker 0 It's calling the ant ent Api, and Api Api dropped after 20 22. So it is not in the training corpus of Open eye. So what you need to do or or all you need to do is literally just curl the the the... The Api and get the response same signature, and then paste that full thing into the prompts and Gp 4 can figure it out. G fields can figure it out.

Speaker 0 This is a very, very common thing. So the activation energy for unfamiliar Apis is so low. I just don't need to I just don't need to, like... I don't need to read docs. I just gotta paste it in the prompt.

Speaker 0 And it's basically your log book, right, of of developing an app. And I think the discovery for me was just that Gp is perfectly capable with understanding log books and translating it to an app. There's 1 problem with the the whole thing, which is whole program coherent. What is what this is doing is generating 5 and 6... 5 to 6 different apps between the the browser scripts, the...

Speaker 0 You can you can look at the the code here. And you can look at the examples page and this is... This is examples that demoed. So it's generating back on Js as Conscious. I pasted it in this manually, manifest json on proper html copper Js and installs the Css.

Speaker 0 So 123456 files. And they all have to refer to each other they have cross dependencies. And this is a very similar problem for the text to image people or text to video people. They they also have problem with c coherent. And for the story generation people, the the people who are doing very, very long novels and and creative fiction type things.

Speaker 0 So we have the same problem in code, but we have a tool code that they don't have, which is that we can name our variables. And it's... And again, markdown, fantastic for for naming variables, Highly recommended it. And then since Then, obviously, a lot of people have taken it and extended it to build chat your plugins, react apps, Vs code extensions, full stack apps of Python and Sql light tailwind, catch re reacts node mongodb the Apps and then Cli apps. So the complexity of this...

Speaker 0 Of these tools of of small, it it can generate this level of complexity, which is, like, small apps, small extensions extension small everything. But there's a lot of problems with this approach. And And I and I wanted to solve them, but the... But I really kept hitting into this issue of what I what I actually is actually do is is is generate this manifest called the shared dependencies on D. But you can see that over here, which is which is a prompt that is that opening that Gp generates to program itself for future iterations of itself.

Speaker 0 And this is essentially a plant. If this is the cold equivalent of chain of thought. You know, let's let's think step by step for English, and then here this is... Here's here's what I'm gonna share between files and you have to use the exact names of these things. Or the code is not gonna compile.

Speaker 0 So I think that's pretty important. And and the planner that I have, this is the prompt for the planner. And the and the file that it generates. The planner I have is too primitive. And I I needed a better planner, and I didn't really have us have a clue for what...

Speaker 0 That planner is. By the way, the planner is that the fifth level, the the hardest 1. Because Newest. And and also something that Open acknowledges itself is bad at well, which is long horizon planning. So I really had a problem with this until I read the voyager paper from Nvidia.

Speaker 0 That basically was improved upon existing capabilities or or solutions or minecraft So solving Minecraft is is like, a very, very arbitrary task. But you just pick the most complex task you can find, like, building a diamond tool. And then see if you can do it faster than any other automated steps. And Voyager did it by in part having a skill library. They they have 3...

Speaker 0 They actually have 3 sections, but really I all built having his skill. And I think that's that's 1 of the things that made it more interesting to me. And and and together with the the the insight from the baby level. Actually made it more such such that I realized that everything with a with an L core is fundamentally constrained by the capabilities of an L, whereas we already know how to code a ton of things. And know we're not constrained by that.

Speaker 0 We're we're no. Not... We're just constrained by turing completeness or whatever. And so here's the core insight or, like, the you don't know, like, the the screenshot screenshot takeaway for for this talk or at least my thesis so far. Is that L cor fundamentally constrained by l capabilities.

Speaker 0 And in order to break out of that or and in order to make more of a dependable, reliable, more secure, more comp controllable apps. You have to invert the the way that the L m's relationship with code is typically done and move code into the core of the the the the functionality. And there the 4... The 4... The is the specifically 4 section segments that have been identified so far.

Speaker 0 This is not exhaustive for this this is just top of mind, the planner agent, the skills library, the context agents and the Ui and interface generation agent. I'll pause there in case I'm I've lost. I I'm not looking at the the the chat. Alright. I can keep going.

Speaker 0 So so I just mention that this is this is where I might... My current area of active researches. And I'm not sure if all of these must be developed together or if they can be developed by separate teams. And maybe which is define a a nice standard interface between them. But I definitely feel like this is a standard progression that all of us.

Speaker 0 Will make. And some some some in, some use cases will will be perfectly fine here. But the more you need control, the more you need long long horizon planning. The more you need skills reuse, the more you need like sort of custom Ui, the more you'll have to move this architecture over, and I think that keeping that in mind and having a language for all of us builders to discuss, this in the shorthand. So The shorthand I have is l core or code core, you know, which core which which 1 I you sort of putting at the heart And and and why this is code core, by the way, is that once the code is generated, the the L is not in the loop.

Speaker 0 For that. Right? Like, that... You just prefer that code that has been validated because it's fast, cheap, simple, whatever. And that and and always works and those are not properties of balance.

Speaker 0 So I think I think that's that that's a that's a very attractive, and I think it is up to us to identify what the roles are. Maybe there's roles here I I haven't imagined. And and I fully welcome debates a discussion about that property But I think, like, recognizing this distinction does make sense. We do have this basically evolution and I'll I'll try wrap ups soon. Of, like, So let's say let's say 20 20 to 20 22, it was, like, open Playground.

Speaker 0 Where we did completion. Right? And then 20 20... End of the 20 22, we thought chat. Right?

Speaker 0 And this was this is 1... This is a a few things. This was essentially very, very light context, which is essentially the conversation history of that session. And then Ui and in interface generation. Right?

Speaker 0 And then and then funny 30 is yeah, Gotta to have to go monthly month. But it's actually, like, I think the the the way that the way that these things have have developed is really really like... So chat with Ming. Right? So this is this is chat plus tools or skills.

Speaker 0 But also with a little bit of pen. Okay. So that it will say if I chat with bug. This is March. I don't actually know when

Speaker 1 we got chat with it maybe Feb.

Speaker 0 And Chat with plugins ins, this was... I I remember this day because I was was very excited I was on a family trip of Japan and I dropped everything to to c learn it. The interesting thing about chat with plugins and chat with bing is that you started to see plugin in usage that isn't 1... That isn't single iteration that you can you can see when you when you do it when you do a chat, when you do a chat, and you have... Let's say, G 4 and Browser bing Right?

Speaker 0 Like, research, the top 10 reviews box office. And what what this would do is this would actually not just browse the web for the first result, but actually do a 4 loop over the top 10 movies. And is clicking on the link. So there's some kind of planning inside of here, with a task queue with, you know, with also such just, like, tool usage. But like, this was this is this became necessary okay.

Speaker 0 Fine. It was... For this for this particular curry, it was... It was able to to fulfill this in 1 link. But I've definitely seen screenshots and and I've definitely had personal experiences of...

Speaker 0 Multiple multiple steps of research before before that was done. So so, like, hopefully, that that that makes sense. And based and then auto, let's say, around the May April, May period that was other Gp, Baby. And And and that was that added in skills like and contact agents. Okay.

Speaker 0 I I think I'm going wrong. So let me let me just wrap up. So I think that each of these needs some kind of sort of core compute component or design, and which is some kind of recur. Agents because this is a effectively the the wall loop that embedded in in in the agent that is capable of... Like, from this, you can back out chain of thought or tree your thought if if if you if you design it correctly.

Speaker 0 But basically, like, If I think you have this input, I think you have this output, and then and then some form of recur where you have validation and and fixing. And so then you can connect that up to a vector Db where you store all the skills that you validate and that is a skills library, or you can connect that it up to a priority task queue or tag. And that is your planner agents, or you can connect it up to a memory store and that is your your your... The the the the agent's, conversation history or memory with the with the user, including his preferences, or he can connect it up with with this within with an interface spot, interface store, and that's your design system. So chat being not all that you need you can also respond in terms of components that have been preset for you.

Speaker 0 So that that is that is my state of thinking right now. That everything comes down to... D dispositions of agents that we that that we define an interface that talks between them. I'll pause there.

Speaker 2 What's the... So in those diagrams, like, is there a different way that, like, a design agent would talk to the Ui and interface as opposed to like, a skills agent would chat to the skills library as opposed to, like, the context agent would talk to the content to is it just, like, the same type of agent talking to different things? Or is there, like, fundamentally different... And you just need, like, these different things? Or is there fundamentally different ways that these agents interact with with these various like resources?

Speaker 0 Yeah. My impression is that just like anything machine learning, you would get short term gains by doing feature engineering. By by designing, like, okay. The the... Alright.

Speaker 0 The the skills... Skills library to manage agent has a a specific contract context agent plan agent has specific contract so and so forth. And that that is efficient in terms of tokens. Right? That is efficient in terms of effectiveness because can you can you can feature engineer your way into some form of of 80 20 rule.

Speaker 0 And then this will go away once model, sir. Capable enough or cheap enough for that. You can just dump everything. I think a lot of the engineering is sort of the art of the possible. Right?

Speaker 0 Like, are what can we do with today's tools and and limitations? And I I do think that we do... We'll probably want to define Pair wise connection contact interfaces.

Speaker 2 Alright. In the interest of time, let's move on to 18 eval, but we'll surely return to a lot of this topic and right answer at the end. So And Eval, I don't know who wants to take it away, but you guys are up.

Speaker 5 Alright. I'll take control of the screen.

Speaker 3 Day. And afterwards, John, we should tell talk about how you're doing Your h and.

Speaker 5 I eval. Okay. Cool. Can everybody see my screening? Can't see anyone else.

Speaker 5 Okay. Awesome. So hey everyone, once again, my name's is Alex with Darius Guru. And, Jesse, we were all hacked together at Gi House, which is a collective a very talented Ai researchers, data scientist, machine learning engineers and we all spent A Saturday, trying to hack together some cool projects, all involving agents. So for some context, We...

Speaker 5 Everyone on the team has tried agents, and we've all come with the conclusion that they suck. It's just ind dispute at this point. It mentioning. This also but even, like, you know, people building new agents don't seem to like them very much. And it doesn't really take much research to see that a lot of people in the Internet are very upset with the state of things.

Speaker 5 They're expensive. They fall under loops, and they don't really seem to cop their task in any meaningful ways. So that's why we teamed up to create Eval, our team ends up winning first place, a Tech bump. Their and so it's essentially a 3 step process. So the first is you take an agent up for choice.

Speaker 5 We chose taxi Ai for those who are unaware taxi Ai is a chrome extension that takes control of your browser and performance browser automation actions based on a task. We take every single log that that that the agent takes and then plug it into some party tools. So we try Microsoft Excel tried it true and amplitude, which is also in analytics dashboard software. And then we can use that to actually identify the pain points and where the are failing and wider failing and how we can make improvements on the next iteration. So I'll just without, you know, explaining much further jump right into the demo...

Speaker 5 Or sorry. 1 1 other thing is how do you actually know, like, your benchmarks. Right? So effectively just if you're just... You know, everybody does the favorite metric of Lg t m at 10, which is looks good to me at 10.

Speaker 5 But a few weeks ago, there was a dataset published called Mind 2 web. So Mind to web is about over 2000 tasks. From over a hundred 30 websites where they recorded the sessions of users actually trying to accomplish a task So for example, booking a fly or purchasing a pizza or scraping a twitter profile, and they took all of those tasks to be human operating it and then record the clicks and how they actually accomplish it. How many steps it took so and so forth. So we use this kind of as our our baseline metric here.

Speaker 5 So... How effective is an Agent compared to a human trying to accomplish the same task. And so this is kind of like, you know, the the standard and there's not much of there's not much work in the way of standards or agency stays, but this is probably 1 of the early earliest standards we can work with. So Here's a demo of Taxi in action. So taxis he's a Chrome extension.

Speaker 5 So right now, I've reported it giving the task of... Creating a an account on delta dot com with my name. As you can see it's pretty slow. It goes step by step in terms of filling out my first name on the floor. My last name, what And dialed it and this is what we did is we created this waterfall graph where you can actually analyze this steps agent.

Speaker 5 So you can see the token consumption, the prompt that it used, the action Id, and it's chain of reasoning in terms of, like, what the response was and what the action. The given the prompt in the context. So that's this is a taxi action and this is how agent eval plugs into taxi. So what is the output of that... So other than the waterfall graph I just showed you in that little short video, we took every single action and plugged it into an amplitude dashboard.

Speaker 5 So on the the further... The the left go side of the screen, there is a graph that looks like 2 bar charts. And what that effectively is saying is we ran about 400 trials not a variety... On the same task actually, for the delta thing. And we saw that only 60 only 30 percent of the time is it actually complete the task.

Speaker 5 So, like... And completion is not even necessarily like, you know, is it the data corrected or not. It's just, like, whether the agent believes that it is configured task. So that... Just as a standalone benchmark, like, 30 percent of the time success rate, that's really quite dreadful.

Speaker 5 And on top of that in the middle graph, we thought about how long does the agent take to actually play a test. Right? So on media it takes about 25 seconds to go from starts completion. But that's also including failures, and then oftentimes it get stuck and that, you know, to take well over 2 minutes to sign out for fill out of form. So is we're kind of the metrics that we looked at here to identify, like, okay, How well is the...

Speaker 5 The agent performing on a high level thing to compared to minds to when. And, you know, are there any infinite loops are... Is it crashing? I think someone on and so forth. So finding out whether ends session prematurely.

Speaker 5 Here's. I had a harder time zooming into this, but actually, sure I can just share my right here. So this is a, like, an attrition graph for lack of a better word. So starting on the left hand side of the screen, this is kind of like all of the starting agent sessions. So if we took 400 agents, you would say that you know, the hundred percent of them started the task.

Speaker 5 Right? You have to start the test an agent. But only 92 percent of the agents actually made it to the next stage. You about 8 percent of them ended the session prematurely either it got lost. There was an error.

Speaker 5 There was a bug. And then step by step, you can actually see that the agents you know, from the step of starting to test the processing gone to determining the action to actually performing the action, 40 percent of the agents fail outright. And that's really kind of a contending view of just how Port is performing right now. But the optimistic side is this is actually kind of 1 of the first metrics we have. Right?

Speaker 5 Everybody knows Lg gm at 10, but this is 1 of the more cost... Views of how the agent survive and perform over time. So this is 1 thing that we worked on on the agent eval project was getting this graph running. I'm gonna go back to my slideshow. And lastly, so...

Speaker 5 Yeah, just, you know, kind of high level thing is we ran this... So the last few things were benchmark against ulta dot com. This is a graph showing how we benchmark it against. The minds to what data dataset. So we saw...

Speaker 5 Similarly, it only exceeds 30 percent of the time. That seems to be kind of like the standard process here whereas it it fails 70 percent. And we run this, I think on something like, keep you honest here various. I think, like, something like, 15 or 20 different tests?

Speaker 4 25.

Speaker 0 And in

Speaker 5 an aggregate. Is that right?

Speaker 3 25.

Speaker 5 Various 25 different tasks, and we saw that This was kind of the failure metric, 30 70. So not so hot. Yeah. Anyway, that's that's it from my end. So Agree wanted to talk a little more about the the key metrics that we thought about and Brainstorm then think we're gonna be.

Speaker 5 You still going forward?

Speaker 4 Yeah. Absolutely. Yeah. Definitely on the previous slides, so in terms of our experimentation, just wanna call out that we took a random set of tasks, and it wasn't representative. So, like, part of our future work here is is scaling this up to validate whether our ratios of success rate.

Speaker 4 We're correct. The key metrics that we're looking at, so the first is task success is the agents end state does that match the desired outcome? And this is part of having a supervised approach where we have a ground truth dataset where we can look at that, repeat abilities is another thing we care about. Right? So if over multiple iterations, given the non deter nature of these systems, if you...

Speaker 4 We run it through a task... Several times. Does it end up at that end state multiple times, what percent of the time. 1 of the more interesting dimensions here is that there's there's a graph similarity problem, which going back to, like, A goes and discrete math if you ever took those classes. There's this concept of iso morph of graphs.

Speaker 4 Right? So as a agent is taking a series of choices, there's this large space that it's exploring, how similar is the graph of the steps that it took to the desired step set of steps. So that's another dimension to look at. We coined this idea of drunken ness. There's definitely a better word out there, but we just stuck with with Drunken is set basically, did the model have too many beers?

Speaker 4 Did it go off the rails? And if so, like, how off the rails did it go? Was it a reversible or irreversible decision? How egregious was its deviation. That's something to look at.

Speaker 4 Some of the more quantitative metrics that are are easier to evaluate. So a number of steps that the model took to complete the task. And how much of a delta was that from the ground truth. So when a human did it, how many steps did they take versus the the agent? What kind of latency numb tokens cost, these are more standard metrics that you might look at?

Speaker 4 Agent, In general, like, 1 of our key takeaways through the the day of experimentation was eval for web agents are are challenging. And there's several reasons for this. The first debugging agent failures is challenging, I mentioned these systems are non deter. And so they can fail for different reasons every time even with the same starting conditions. So for example, was the failure result of limited token tokens?

Speaker 4 Was it missing context? Also, could it... There's another category of failures, which are reasoning failures, I. E did it get him stuck in an infinite loop, was their task divergence. And then finally, it's...

Speaker 4 These agents are starting to interact with the web for the first time. So there may be issues in the web interaction step. So... Was it... Did have a difficulty understanding the dom?

Speaker 4 Was their incorrect data entry when it was interacting with the website. Did they get stuck with a cap? You know, there's a variety of things that can that can break these systems. Couple other things. So the web isn't static.

Speaker 4 You know, with the supervised approach that we're we were going with, there's a ground truth dataset set that is a snapshot of a point in time. But the web is constantly evolving it's a dynamic space. And so how do you have update ability of these z files over time. Defining task success, number 3 is also tricky. So tasks can require some kind of optimization.

Speaker 4 Let's say you're booking a flight. How do you know that you got the best flight or the right flight? What is a successful outcome there? And, like... So so there's there's a degree of did it complete the task, but then there's, like, some optimization level that's an additional combination.

Speaker 4 Then there's the comb tutorial action space. There's so many potential act... That these systems can take, how do you whittle that down? And, you know, finally, as we think about multiple agent spaces, where you have multiple agents coordinating with 1 another, this becomes an even more complex problem. So gonna hand it over to Darius now and to to walk through some alternate alternative approaches.

Speaker 3 Thanks, Kurt. So I think what's helpful is to contextualize a bit of the work we did in the context of a few different approaches evaluate. So what we did was basically supervised evaluation. So we had the mind web dataset, here we had the agent form being valued using the labeled dataset set of, you know. Human labels where where people took actions.

Speaker 3 But there's other approaches. And it was really cool is a on there some other hacks doing those other approaches as well. So the second is, like, reinforcement learning. So here classic case would be, like, unit test pass fail, and that is your environmental reward function to tell you if... Hey, so, you know, test starts passing.

Speaker 3 You're doing a good job. You can also think about projects like Voyager where they have a paper and as you make progress in Minecraft graph, then you're also doing a a good job. The the last approach is kind of this this in in development concept of, like, self supervised. And so here the ground truth would be conversational performance. And what people are doing is having...

Speaker 3 The agent chest being evaluated with a same agent trying to mock a humans representation and then doing auto labeling. So there's a project that kind of that set approach. So the first hack was a way to explore that labeled dataset. So this was done by Marco, and it was called Jungle. And so basically, his tool allows you to look at the different domains in mind web sub select certain web sites or categories and then kind of export that dataset.

Speaker 3 So this data would have been amazing for for us because we would been able to select, like checkout sites where we have a clear hey, sites finish flow and that would been helpful in doing our evaluation. And so here's it Marco contact corner fur out. The second hack was a reinforcement eval approach. And so here James looked at unit test pass fails. And so he built an agent that ran a bunch, I think around 10 times per execution, looked to see if it would flip the the status of the unit test, and then kind of built up this the scorecard of, hey, the that actually occurred.

Speaker 3 If we go to the next slide, we can have see that video And the... So what's going on here is he's kind of running the the agent. It's kicking off 10 different code edits. And then at around, like, 21 or so, you'll see based on the the unit test pass fails, it'll it'll flip the status of... Whether or not the Pastor failed.

Speaker 3 And so, you know, do you do this a number of times to kind of get Ross given the non Germans challenge and that's James Brother, the last project was kind of a self supervised eval approach. And so here J and team ran their agent. They had a simulated human with a goal of, like, booking a flight and making a trip. And the agent under test was kind of con converting with that human. And then finally, if you can pause here, they created a number of L 1 based eval metrics including, like, hey, did the the trip actually have coverage was the conversation human like.

Speaker 3 And based on these scores assessed in kind of an auto be about way the forms of agent. Also a really cool project.

Speaker 5 Okay. Awesome. Yeah. Yeah. Yeah.

Speaker 5 So just 1 last thing, shout outs to team of agent eval, you can check us out on on Twitter. We write kind of brady thought leadership y kind threads every now and again. And you can check out get scorecard. So Jesse and Darius are working on this. They're building.

Speaker 5 Because no. I've I'll let you guys touch that 1.

Speaker 3 Yep. So testing production ready, and and the evaluation, mostly for chains in in Harrison terminology, but also exploring agents because of the the new.

Speaker 5 Mh. Yeah. And I'm working on staff at Ai. So we're building kind of a vetted catalog Ai agents and c copilot and using some of the agent eval. Framework tooling under the hood, some which may or may not be open source in the near future.

Speaker 5 Yep. So at that point, I think we can open it for questions. If anybody has anything in mind.

Speaker 2 Awesome. Thank you guys for that.

Speaker 6 You can check out the q and A session. Beatrice addressed 1 of the Agent eval member? Yeah. But that's... Question is up on the right side.

Speaker 4 And and before we jump into Q and a, 1 other thing I wanted to call is we had a... We had a large team of folks and there were many contributors. So a few other folks that didn't join us today, but we're also key contributors. Andre, Beatrice, John and Aj. I wanted to just shut them out as well.

Speaker 2 Awesome. Thank you guys for for that presentation. Jumping in. We've about, like, 5 minutes left. Gonna jump into a few of the questions on the the right, and maybe you can do quick rapid fire.

Speaker 2 But... So top 1 by far. What is the most affordable free slash local L for specific agent executor, agent task like decision making tool selection. And in general, mean, I guess, like, general this, like, yeah. Open, is there anything else that works?

Speaker 2 I

Speaker 1 Yeah. So, like, I think there's this trade off between capability and efficiency. And I think right now, it should chase capability, meaning pay for, you know, try to get the 64. Api access. It's not actually that hard.

Speaker 1 Just join some hack funds where they keep they hand out access for Candy and and pay for it, It's not that it's not expensive. But if you need it to be local, then something like an Mp 7 b, unless you can run a 65 b thing, which maybe G bell is working on.

Speaker 3 Yeah. I kind of correlate this flex self driving. So we're... You know, the cars are kind putting the most expensive sensors on the car because you wanna solve the problem first. And then you, you know, down cost as, you know, hardware is cheaper.

Speaker 1 Yeah. For for what for those who don't know Darius does work as self driving. So he knows so he's something.

Speaker 2 Awesome. Yeah. The only thing I add to that is, like, yeah. Use open And specifically, I think open Ai functions is really good. So I'd probably almost always start there.

Speaker 2 There's 1 other 1 that I thought look. Good. Oh, and Yeah. This is related to... Hold on.

Speaker 2 Where is it? It... It's related to the 1 around, like, tool tool generation and skill generation. Basically. So in minecraft, I think they had the the language model like write new skills, basically, or something like that?

Speaker 2 You...

Speaker 1 Yeah. Every search curriculum as well.

Speaker 2 Yeah. So so how would

Speaker 1 you of entropy

Speaker 2 how do you think about the skill edition in, like, a generic framework? Like... And I think... I'm I'm trying to find the question I forgot what it, but it was, like, could you, like, ask the human to add a skill or something that like that? Like, what does this even look like generically.

Speaker 2 Oh.

Speaker 1 Yeah. It's The... I think the human human in the loop is slow.

Speaker 2 Yeah.

Speaker 1 And probably not that useful. Maybe with if they're a domain expert. That could be helpful. But actually, just plugging it into the Internet, it's probably be more useful well, because that's just a lot more humans and, yeah, That will be my quick 2 cents. And and then you have to turn...

Speaker 1 There's all these pro programming imperatives. I think I need to... En them all. But having a human in the loop requires it to be asynchronous, essentially, because you have to because the human needs to be able to, like, just kinda monitor and respond it, it shove things into it a to task you and then the agent needs to be able to survey a task you and and pick up those pieces of information. Rather than having the whole program pause and wait for the human to respond.

Speaker 1 That's that's it's very it's gonna interrupt driven and not agent t in a sense of it's not autonomous.

Speaker 5 On top of that, I think the member serves right, keep me honest. For the voyager paper, they've kind of had kind of a cheat coat around the Rl, which was had all these... Minecraft tutorial videos, which oftentimes flood at the feed. Right? So if you can find tutorials out there, you don't really have to go through the ar that's a process of teaching it if all the instructions are out there on the web for you.

Speaker 1 Yeah. I I would I will actually not call it Rl jeff because it is quote cool. 0 gradient in the sense that there was no actual training. It was just context that the

Speaker 0 embedded. Great.

Speaker 2 Awesome. And and then last question before we sign off. Yeah and this is around basically how to get agents to best invoke. Tools. And so the question is, like, specifically, like, do you think, like, behind the scenes open Ai functions is doing something like like, react.

Speaker 2 There's also been kind of, like, the tree of thought types stuff. When you... And so I guess, like, general this, when you guys are trying to get to use tools? What prompting strategies do you use? Do you do tree of thought?

Speaker 2 Do you do react? Do you just tell it to? You ask it to think carefully step by step?

Speaker 1 Do I just tell it too and give some examples. I think Shot is the lowest hanging fruit. Everyone should just few shot, there their prompts. Yeah. Beyond that, it's it's unknown un test.

Speaker 1 You probably need to eval the effectiveness is all these methods.

Speaker 2 Cool. Alright. Thank you all for joining. This has been a lot of fun change versus agents. Actually, wait...

Speaker 2 Yeah. Change versus agents. You gotta pick 1, final sign off. Everyone has to pick chain agent. I don't even know what you're picking it.

Speaker 2 I don't know what the judging criteria is. But you gotta pick 1. Chain agents da?

Speaker 3 Off the chain.

Speaker 2 Alright. Wait. Off the chain, is that good chain or are you off the chain as bad.

Speaker 3 But I could see how that could be missus over.

Speaker 2 Yeah. Sorry, I can change sw. Change our agents.

Speaker 1 Oh respond like, Tourist was like exact wins. Spain between the smooth.

Speaker 2 Alright. Agents agents. Agent is. Alright.

Speaker 4 Agents all of the way baby. Alex?

Speaker 5 Secret agent.

Speaker 2 Jesse? The agent? Fully autonomous. Beatrice address. Yep.

Speaker 6 Lang shane.

Speaker 2 I okay. Do you know what... That's a that's perfect segue way to end. It we'll end it there. I think an an overall Overwhelming win for agents 4 to 2.

Speaker 2 But, yeah. We'll see what happens. For for what

Speaker 1 is earth leg is gonna be extremely critical and agenda anyway. So... Yeah.

Speaker 0 It's not a it's not a blank chain versus. It's...

Speaker 2 No. It's it's... No. It's it's definitely not. Like Maybe Harrison.

Speaker 4 Have you thought about that?

Speaker 2 Sorry. Laying engagement Yeah. Yeah. Yeah.

Speaker 1 There it's every the company. But...

Speaker 2 Yeah. Alright. Thank you guys so much for joining. This has been a lot of fun and hopefully in in and and very informative as well. So I really appreciate it.

Speaker 2