Summary of LangChain "Chains vs Agents" Webinar

Summary LangChain "Chains vs Agents" Webinar - YouTube (Youtube) youtu.be

8,779 words - YouTube video - View YouTube video

One Line

John discusses the importance of agents, their evaluation, and the challenges faced by small developer applications, while also exploring different approaches to agent evaluation and emphasizing the need for further research.

Slides

Slide Presentation (12 slides)

Copy slides outline Copy embed code Download as Word

The Importance of Agents in Application Development

Source: youtu.be - video - 8,779 words - view

Agents and their Interaction with Calendars and Tasks

• Agents can perform tasks without human involvement

• Combining different tools and components builds up to an agent

• Tool use vs. browser automation (specialized form of tool use)

Visual: Image showing an agent interacting with a calendar

Taxonomy of Agent Capabilities and the Role of Memory

• Need for a taxonomy to categorize agent capabilities

• Memory plays a crucial role in an agent's functionality

• Having all components is not necessary, but contributes to intelligence

Visual: Graph showing the relationship between agent capabilities and memory

Challenges in Creating Coherent Programs for Small Developer Applications

• Complexity of small developer tools

• Need for a better planner in small developer applications

• Reference to the Voyager paper introducing a skill library for complex tasks

Visual: Image illustrating the challenges faced by small developer applications

Moving Code into the Core Functionality of Language Models

• Development of specific sections: planner agent, skills library, context agents, UI/interface generation agent

• Standardized progression in language architecture using "code core"

• Different ways agents interact with resources and pairwise connection contact interfaces

Visual: Diagram showcasing the different sections of a language model

Evaluation of Agents and the Challenges Faced

• Low success rate and slow completion time of agents using Taxi AI chrome extension

• Introduction of Agent Eval project to identify pain points and areas for improvement

• Challenges of debugging agent failures and managing complex action space

Visual: Chart comparing success rates of different agents

Alternative Approaches to Agent Evaluation

• Reinforcement learning using unit tests as environmental reward functions

• Importance of further research and experimentation in agent evaluation

• Mention of other projects and tools like the Agent Eval Scorecard and open AI functions

Visual: Image showcasing different approaches to agent evaluation

Evaluating Conversational Agents through Self-Supervised Learning

• Use of Jungle tool for exploring labeled datasets and exporting subsets for evaluation

• Reinforcement evaluation based on unit test performance

• Self-supervised evaluation through simulated human interactions

Visual: Screenshot of Jungle tool interface

The Role of Humans in Agent Evaluation and the Use of Examples

• Trade-off between capability and efficiency in agent evaluation

• Importance of using examples and prompts to guide agents

• Few-shot learning as a starting point

Visual: Image illustrating the role of humans in agent evaluation

Future of Language Models: Chains vs. Agents

• Majority preference for agents, but acknowledgment of a combination of approaches

• Gratitude to the team and contributors involved in the projects discussed

• Invitation to check out the Q&A session for further information

Visual: Image showing a combination of chains and agents

Advancing Agent Evaluation for More Efficient Development

• Importance of agent evaluation in improving development efficiency

• Need for further research and experimentation in evaluating conversational agents

• Reminder of the main message: Combining existing capabilities with a skill library leads to productive development

Note: The presentation can be enhanced with appropriate visuals, such as graphs, images, and charts, to support the key points and make it more visually appealing.

Key Points

John discusses the concept of agents and their interaction with calendars and tasks.
The importance of combining different tools and components to build up to an agent.
The difference between tool use and browser automation, with browser automation being a specialized form of tool use.
The need for a taxonomy of agent capabilities and the importance of memory in an agent's functionality.
Challenges associated with creating coherent programs and the complexity of small developer tools.
The LangChain "Chains vs Agents" webinar focuses on moving code into the core functionality of language models and the development of specific sections like the planner agent, skills library, context agents, and UI/interface generation agent.
The evaluation of agents, including the low success rate and slow completion time of agents using the Taxi AI chrome extension.
Alternative approaches to agent evaluation, such as reinforcement learning using unit tests as environmental reward functions, and the need for further research and experimentation in this field.

Summaries

84 word summary

John discusses the importance of agents, combining tools, and the need for a taxonomy of agent capabilities. He shares his experience with small developer applications and the challenges they face. The webinar focuses on moving code into language models and identifies four specific sections for development. The evaluation of agents is discussed, including the low success rate of the Taxi AI chrome extension. Different approaches to agent evaluation are explored. The webinar concludes with a discussion on feedback and the importance of further research.

331 word summary

John discusses the concept of agents and their interaction with calendars and tasks. He emphasizes the importance of combining different tools and components to build up to an agent, highlighting the need for a taxonomy of agent capabilities and the importance of memory. The conversation then shifts to small developer applications and the challenges associated with creating coherent programs. John shares his experience with building a chrome extension using prompts and the ease of developing with markdown. He discusses the complexity of small developer tools and the need for a better planner, suggesting combining existing capabilities with a skill library for efficient development.

The webinar focuses on moving code into the core functionality of language models to create dependable applications. Four specific sections requiring development are identified: the planner agent, skills library, context agents, and UI/interface generation agent. The need for a standardized progression in language architecture is mentioned. The webinar explores how various agents interact with different resources and discusses defining pairwise connection contact interfaces.

The evaluation of agents is discussed, particularly in relation to the Taxi AI chrome extension. The low success rate and slow completion time of agents using this extension are highlighted, leading to the introduction of the Agent Eval project. Challenges related to debugging agent failures, dealing with a dynamic web environment, defining task success, and managing complex action spaces when multiple agents are involved are also discussed. Alternative approaches to agent evaluation, such as reinforcement learning using unit tests as environmental reward functions, are mentioned.

Different approaches to evaluating conversational agents are discussed, including self-supervised learning, reinforcement evaluation, and self-supervised evaluation. Tools such as Jungle for exploring labeled datasets and the use of examples and prompts to guide agents are mentioned. Other projects and tools related to agent evaluation, such as the Agent Eval Scorecard and the use of open AI functions, are also discussed.

The webinar concludes with a discussion on participants' feedback and the importance of further research and experimentation in this field.

437 word summary

John discusses the concept of agents and their interaction with calendars and tasks. He emphasizes the importance of combining different tools and components to build up to an agent. John explains the difference between tool use and browser automation, stating that browser automation is a specialized form of tool use. He highlights the need for a taxonomy of agent capabilities and mentions the importance of memory in an agent's functionality. The conversation then shifts to small developer applications and the challenges associated with creating coherent programs. John shares his experience with building a chrome extension using prompts and the ease of developing with markdown. He discusses the complexity of small developer tools and the need for a better planner. John references the Voyager paper from Nvidia, which introduced a skill library for solving complex tasks. He concludes by emphasizing that coding offers more flexibility than using an AI core and suggests combining existing capabilities with a skill library for efficient development.

The LangChain "Chains vs Agents" webinar focuses on moving code into the core functionality of language models to create dependable, secure, and controllable applications. Four specific sections requiring development are identified: the planner agent, skills library, context agents, and UI/interface generation agent. The need for a standardized progression in language architecture is mentioned, with the term "code core" used to refer to the central code. The webinar explores how various agents interact with different resources and discusses defining pairwise connection contact interfaces.

The evaluation of agents is discussed, particularly in relation to the Taxi AI chrome extension. The low success rate and slow completion time of agents using this extension are highlighted, leading to the introduction of the Agent Eval project. This project aims to identify pain points and areas for improvement through analyzing agent actions and metrics such as task success, repeat abilities, graph similarity, and drunkenness. Challenges related to debugging agent failures, dealing with a dynamic web environment, defining task success, and managing complex action spaces when multiple agents are involved are also discussed.

Alternative approaches to agent evaluation, such as reinforcement learning using unit tests as environmental reward functions, are mentioned. The webinar emphasizes the importance of further research and experimentation in this field.

Different approaches to evaluating conversational agents are discussed. These include self-supervised learning, reinforcement evaluation, and self-supervised evaluation. Tools such as Jungle for exploring labeled datasets and the use of examples and prompts to guide agents are mentioned. Other projects and tools related to agent evaluation, such as the Agent Eval Scorecard and the use of open AI functions, are also discussed.

The webinar concludes with a discussion on participants'

739 word summary

John discusses the concept of agents and how they can interact with calendars and perform tasks without human involvement. He mentions the importance of building up to an agent by combining different tools and components. He also talks about the difference between tool use and browser automation, stating that browser automation is a specialized form of tool use. John explains that having all the components is not necessary for an agent to function, but it can contribute to its intelligence. He emphasizes the need for a taxonomy of agent capabilities and mentions the importance of memory in an agent's functionality. The conversation then shifts to small developer applications and the challenges associated with creating coherent programs. John shares his experience with building a chrome extension using prompts and the ease of developing with markdown. He discusses the complexity of small developer tools and the need for a better planner. He references the Voyager paper from Nvidia, which introduced a skill library for solving complex tasks. John concludes by highlighting the insight that coding offers more flexibility than using an AI core, and suggests that combining existing capabilities with a skill library can lead to more efficient and productive development.

The LangChain "Chains vs Agents" webinar discusses the need to move code into the core functionality of language models in order to create more dependable, secure, and controllable applications. The webinar identifies four specific sections that require development: the planner agent, the skills library, the context agents, and the UI and interface generation agent. The speaker also mentions the need for a standardized progression in language architecture and suggests using the term "code core" to refer to the central code. The webinar also explores the different ways in which various agents interact with different resources and discusses the possibility of defining pairwise connection contact interfaces.

The webinar then transitions to discussing the evaluation of agents, particularly in relation to the Taxi AI chrome extension. The presenters highlight the low success rate and slow completion time of agents using this extension, emphasizing the need for improvement. They introduce their project, Agent Eval, which aims to identify pain points and areas for improvement by analyzing agent actions and metrics such as task success, repeat abilities, graph similarity, and drunkenness (deviation from desired steps). The presenters also discuss the challenges of debugging agent failures, dealing with a dynamic web environment, defining task success, and managing the complex action space when multiple agents are involved.

The webinar concludes by mentioning alternative approaches to agent evaluation, such as reinforcement learning using unit tests as environmental reward functions. The presenters highlight the importance of further research and experimentation in this field.

In this excerpt from the LangChain "Chains vs Agents" webinar, the speakers discuss different approaches to evaluating conversational agents. One approach is through self-supervised learning, where an agent is evaluated by trying to mimic a human's representation and using auto-labeling. Marco developed a tool called Jungle that allows users to explore labeled datasets and export specific subsets of data for evaluation. Another approach is reinforcement evaluation, where an agent's performance is assessed based on whether it can pass unit tests. James demonstrated this approach by running an agent that executed code edits and checked the status of unit tests. The last project discussed was a self-supervised evaluation approach, where an agent interacts with a simulated human to complete tasks like booking a flight. The team created L1-based evaluation metrics to assess the success of these conversations.

The speakers also mentioned other projects and tools related to agent evaluation, such as the Agent Eval Scorecard and the use of open AI functions for affordable and efficient agent execution. They discussed the role of humans in the loop and the trade-off between capability and efficiency in agent evaluation. They emphasized the importance of using examples and prompts to guide agents and mentioned the use of few-shot learning as a starting point.

The webinar concluded with a fun discussion about whether participants preferred chains or agents. While the majority leaned towards agents, it was acknowledged that the future of language models would likely involve a combination of both approaches. The speakers expressed their gratitude to the team and contributors involved in the projects discussed and invited viewers to check out the Q&A session for further information.

Overall, this excerpt provides an overview of different approaches to evaluating conversational agents, highlighting the development of tools and frameworks for assessment.

Raw indexed text (47,153 chars / 8,779 words)

Source: https://youtu.be/bYLHklxEd_k
Page title: LangChain "Chains vs Agents" Webinar - YouTube

[00:00:00 - 00:00:15]

John: And agents because I just sent my calendar and someone interacts with it without my interrupt... My my my involvement, And when they need to reschedule it, they just interact with that and it just books updates my thing for me. There's no L in the loop, but it is an agent in the sense that it's a thing that exists as part

[00:00:15 - 00:00:19]

Guru: of my org. That works without me that is the human.

[00:00:23 - 00:00:31]

n/a: Alright. That that's maybe a good... Segue to hand it off to you So We'd love to hear more about what you're working on, how you see it evolving. You've been doing some awesome stuff. So really excited.

[00:00:31 - 00:00:41]

John: Thank you. I... We maybe should have tested the screen share a little bit before I if I did this. It's my... So I think I'm sharing my screen now.

[00:00:41 - 00:00:46]

n/a: Yeah. It's working. We have an infinite mirror going on. Now we no longer have an infinite there, a bit sad. But...

[00:00:46 - 00:00:52]

John: Yes. Fun fun fact, that I've heard that that called a drugs effect if you're... Into

[00:00:53 - 00:00:53]

n/a: that kind

[00:00:53 - 00:00:56]

Guru: of thing with the with the mirrors. U.

[00:00:56 - 00:01:14]

John: So... Okay. So I'm very jealous of vivian writing ability because I had basically the same pieces 2 months ago. So she she treated this yesterday or maybe 2 days ago, And obviously, she's an she's extremely well established. I think she works at open the eyes.

[00:01:14 - 00:01:42]

John: So, like, extremely credible. 2 months ago, I wrote this. Which is agents, equals, foundation models plus gen thought, plus memory plus browsers plus tool use plus planning, So I feel like I just, like, over complicated it and she just, like, made an equation and where it open Ai and, like, you know, obviously, put in a whole bunch more worth than I did. But this is how I kind of frame frame it. And I think if you if you could take take the composition of these tools, you can actually...

[00:01:43 - 00:02:02]

John: Try to build up to whatever agent that you're talking about. Right? So I think Guru, you're talking about some kind of conversational agents, And maybe that's a combination of foundation model plus evaluation plus external memory. Right? And then, like, you know, chat with bing, that's that's adding on the browser browser tooling.

[00:02:02 - 00:02:40]

John: Please ignore, this is when my my thoughts were less well forms. Now I I have sort of retroactively revised step 3 to be just browser research rather in automation and all these I'm promoting up to towards fuller agents. So basically, the the difference between sort of the the step 2 and 3 is step 2 represents things you already know and think step 3 represent things you don't yet know and that you need to research. But all of this is just only for sort of the read read only phase or, like the... It only exists in in in text during conversation or when you're doing research.

[00:02:40 - 00:03:05]

John: And then stepping out of read into read and write. So, like, basically interacting with the rewards, you have 2 user planning. So I don't know if it's like, that that is a a good basis or anyone wants to sort of discuss with this this kind of framing your mind. I'll pause there in case I You any your responses? Nope.

[00:03:05 - 00:03:06]

John: Okay. Alright.

[00:03:07 - 00:03:07]

Darius: I've I

[00:03:07 - 00:03:16]

n/a: have a question. Like, what's the what's the difference between, like, tool use? In browser automation? Like, couldn't you can automation using tools under the hood?

[00:03:17 - 00:03:26]

John: Yeah. Yeah. Exactly. So I... I often say in my in my sort of longer form explanation that browser automation is a single instance of a more general purpose we'll use.

[00:03:27 - 00:03:48]

John: Browsers are just a a special case just because they represent knowledge, whereas tool user are typically very well defined contracts of, let's say, I interact an open Api spec. Or predefined connector Apis. Whereas here is just a very general purpose. Like, read need information right information voice. Here it's, it's it's actually due actions onto the world.

[00:03:50 - 00:04:01]

Darius: And then do you cost any of these as like must haves versus like... Like, is is in Harrison form like a chain? Or do you have to have all of these for it to be an Agents"?

[00:04:01 - 00:04:17]

John: Well, no. You do not have to have all these to be an agent. I think that these are just tools in a toolkit. To use when your situation asks for it. So if you discover that you need browser automation, you might...

[00:04:17 - 00:04:34]

John: You you you might... You want need you wanna pull it in. You've discovered it that you need memory, you wanna pull it in. I do obviously have this suggestion that evolution towards a more intelligent agent, we'll probably require all these steps. But it's not a requirements.

[00:04:34 - 00:04:44]

John: I mean, you can sort of lob legit these things and just go, like, alright. Let's just do it without tool use. Right? And and then you have Baby ag gi. And then add to use that you have all the Gp, you know, something like that.

[00:04:45 - 00:05:04]

John: Right? Do I I think not having a taxonomy of the types of capabilities that... You can pick off the shelf for your agent, means that you're not thinking about what agents can do well enough. So having a rigorous rigorously thought through Taxonomy. I think it's important there.

[00:05:08 - 00:05:09]

John: And for what it's worth?

[00:05:09 - 00:05:24]

Kurt: Sorry go ahead. There's an interesting question in Chad actually, from Michael about can can memory also be considered tool use. This is something that I've also wondered before. It's, like, in the same way that browsing is effectively to use. So basically external something that's not provided in the base model.

[00:05:25 - 00:05:29]

Kurt: You're calling vector Db in a way that's a tweet tool use. You how do you feel about that?

[00:05:31 - 00:05:48]

John: I think that at the most superficial level, it is a former tool tool But then I think this... These become so specialized when you take them seriously enough that they deserve a special place. In the stack. And maybe I'll I'll go into that a little bit. Right?

[00:05:49 - 00:05:57]

John: So... Okay. So maybe I'll I'll talk a little bit about small developer. Oops. Give give that a little bit away.

[00:05:58 - 00:06:12]

John: Most... And actually, I... You know, III did a personal of this presentation at... The Ai house thing... And 1 thing I forgot neglected to ask was, you know, how many people have heard of auto Gp, which is basically everyone and how many people use on the Gp, which is basically 0.

[00:06:13 - 00:06:41]

John: There's people on the auto Gp team that I've met and asked privately, who don't use auto Gp. Right? It is it is a it is it is equivalent of the tulip, tulip of 16 hundreds. And and I think like, it's fine to dream, but I I think we want to sort of make it more practical because that's the point of all this energy I guess and probably investment going in. So I I often often think about...

[00:06:43 - 00:07:18]

John: 1 thing 1 thing Think about agents is to try to connect it to an existing set of thinking and philosophy because we don't have to reinvent everything from scratch. So the 2 things that I would recommend looking into is reading Ben Thomson article on text 2 philosophies, which I think is quite applicable to the the Ai h as well. And in short, it's really, you know, tech... Ai can either do things for you quite ai, I can do things with you. And I think that's a very important design distinction and or agents can supply both.

[00:07:18 - 00:07:44]

John: I think we jump very very very directly to the do things for you bit. Instead of thinking about how to integrate the human in the loop. So for example, if you if you sort of break down baby Ag. But by the way, 1 trick went encountering an eco basis, to stick all of this into GC4 and ask it for a diagram of of the of the code base, and it is very, very capable generating things like this.

[00:07:44 - 00:07:46]

Darius: You did that small developer?

[00:07:47 - 00:07:50]

John: No. I just... I... I did this in the opening playground I I encourage everyone to do this?

[00:07:51 - 00:07:55]

n/a: Oh, I mean, did you do that for the small developer kind of like eco architecture or...

[00:07:55 - 00:07:58]

John: Oh, I haven't. We could do it right now? I mean,

[00:07:59 - 00:08:03]

n/a: So... Please maybe after maybe after you you talk about it small development? Yeah.

[00:08:05 - 00:08:16]

John: I I just... Love I love showing people tricks. Because that that is actually probably more useful as an immediate takeaway. Anyway, so Baby Hbo was designed like this. There's, like, 4 different agents, which was the origin of of me so...

[00:08:17 - 00:08:33]

John: Realizing that there's this architectural shift going on that not... Nobody's talking about. And then it lose back and then it it, you know, that both auto G and maybe. Like, basically, everyone does this, which is, like, you slap on this, like, hey, and interrupts and saying, hey, human. Do you want to do the next action?

[00:08:34 - 00:08:59]

John: And that's a there's a very sort of tacked on thing to to how we talk to how we sort of add in because we're trying to design this. In in a way that very much tries to leave the human out of it. And I think we we should actually... The second the second existing philosophy, I think we should borrow from prior art. It's essentially the the the categorization from self driving cars.

[00:09:00 - 00:09:20]

John: Where they've already thought through the difference between the 5 different levels of automated driving. And I I was 1 of the people in 20 14 15 that said that we would have self driving right now. Full self driving right now. And I was wrong. And I think a lot of people would drawn in by that by that dream.

[00:09:21 - 00:09:48]

John: And in fact, like, the the people who worked on, the less glamorous things, but more practical things that people use on a day to day basis. Got the data got the lessons that they learned. That now we're actually going over into. Into into the into more full cell driving. So what I will categorize this is basically auto Lgbt is aiming here and maybe there's there's not enough people aiming here in in the sort of level 1 level 2 Agents" sp.

[00:09:49 - 00:10:02]

John: Cool. So... And and I also kinda go into the the central problems of small developer. A lot of the existing agents... Steam to to do general things.

[00:10:02 - 00:10:27]

John: Like auto Gp default prompts is increase my net worth. Which my god. I mean, if it does, it's great, but doesn't do it right now. It just spits out a lot of convincing text to tell you to go via Instagram influencer. So I think I think more importantly, like, for me as a perspective founder, I want to be more efficient with my time with my...

[00:10:28 - 00:10:44]

John: And I want to employ engineers, Ai engineers, a small developers quote unquote. That make me more productive. Otherwise, you're not 8 very ai idea. And in and so another way of phrasing this is opening Paying engineers, 900000 dollars a year is a bug another a feature. It it means that Open ai have not solved this themselves.

[00:10:45 - 00:10:52]

John: Even though they they are by old measures with the word extremely productive because they worth 30000000000 dollars with the employee base of 400.

[00:10:53 - 00:11:00]

n/a: Could you ask me it? That maybe they've solved it to some extent and that all that's left is the really high value task, and that's why it's so high.

[00:11:02 - 00:11:06]

John: Yeah. Yeah. Absolutely. So they they've... They...

[00:11:07 - 00:11:18]

John: There's... This kind of, like, a curve. It kinda goes up in terms of engineer composition. And then as once you get intelligent enough and ubiquitous enough, then it goes down again. But it does go up initially because...

[00:11:19 - 00:11:36]

John: Per engineer engineer being able to wield Ai more effectively makes each engineer more productive, and you pay that more... If they're able to do that. So so I do I do think that that is true. That's a good pushback. Anyway, so I wanna make a lot of more apps, audio.

[00:11:37 - 00:11:51]

John: Not not capable doing it. I wanted to just have a domain specific, hey, only do code. And so I wanna I wanted to only do that. And so basically, I started when when the ant ent 100 k context launch. I wanted to build a chrome extension.

[00:11:52 - 00:12:17]

John: Where where it just summarize any any website for me in the style that I wanted. So I built this chrome extension and I demoed it And this is basically the chrome extension. You just click on any page, and it and it has this loading bar because it's calling on to and summarizing a very, very long. This is this is an interview including transcripts. And I also gave it a style prompt, which is gonna come very important later, which is the kind of summary that I want.

[00:12:17 - 00:12:26]

John: Right? I don't want the default summary style of or open opening ai or whatever. I... Have a very specific way that I like to read things. So here's this the summary style prompts.

[00:12:27 - 00:12:45]

John: And that's and that's a super useful. And obviously, the secret to this demo is that I I didn't code any single part of of this extension, it was it was generated from a very, very long prompts. Right? And for me, the the prompt development process that I would... Wanted to to have was essentially start from a command line where you're just...

[00:12:46 - 00:13:04]

John: Typing in, like, a single sentence, right? Which is definitely under specified. And then I discovered that once you move to... From from a single sentence into markdown, you can actually, like, write a lot of stuff in markdown. And so markdown is all you need for prompting Markdown is business language not English.

[00:13:05 - 00:13:31]

John: Because you can specify code blocks been markdown down. You can specify animations and markdown down. You can specify prompts in code in prompts. And and basically, the the the the ease of recur in prompting the ability to paste code in just to give you an idea, the... Because it it's calling the ant ent Api, and Api Api dropped after 20 22.

[00:13:31 - 00:13:57]

John: So it is not in the training corpus of Open eye. So what you need to do or or all you need to do is literally just curl the the the... The Api and get the response same signature and then paste that full thing into the prompts and Gp 4 can figure it out. G fields can figure it out. This is a very, very common thing.

[00:13:58 - 00:14:11]

John: So the activation energy for unfamiliar Apis is so low. I just don't need to I just don't need to, like... Don't need to read docs. I just gotta paste it in the prompt. And it's basically your log book, right, of of developing an app.

[00:14:12 - 00:14:35]

John: And I think the discovery for me was just that Gp is perfectly capable with understanding log books and translating it to an app. There's 1 problem with the the whole thing, which is whole program coherent. What is what this is doing is generating 5 and 6... 5 to 6 different apps between the the browser scripts, the... You can you can look at the the code here.

[00:14:36 - 00:14:52]

John: And you can look at the examples page and this is... This is examples that that demoed. So it's generating back on Js as conscious. I pasted it in this manually, manifest json on proper html copper Js and installs the Css. So 123456 files.

[00:14:52 - 00:15:17]

John: And they all have to refer to each other they have cross dependencies. And this is a very similar problem for the text to image people or text to video people. They they also have problem with c coherent. And for the story generation people, the the people who are doing very, very long novels and and creative fiction type things. So we have the same problem in code, but we have a tool code that they don't have, which is that we can name our variables.

[00:15:18 - 00:15:48]

John: And it's... And again, markdown, fantastic for for naming variables, Highly recommended it. And then since Then, obviously, a lot of people have taken it and extended it to build chat your plugins, react apps, Vs code extensions, full stack apps of Python and Sql light tailwind, catch re reacts node mongodb the Apps and then Cli apps. So the complexity of this... Of these tools of of small developer, it it can generate this level of complexity, which is, like, small apps, small extension extension small everything.

[00:15:49 - 00:16:12]

John: But there's a lot of problems with this approach. And And I and I wanted to solve them, but the... But I really kept hitting into this issue of what I what I actually is actually do is is is generate this manifest called the shared dependencies on D. But you can see that over here, which is which is a prompt that is... That opening that Gp generates to program itself for future iterations of itself.

[00:16:13 - 00:16:30]

John: And this is essentially a plant. If this is the cold equivalent of chain of thought. You know, let's let's think step by step for English, and then here this is... Here's here's what I'm gonna share between files and you have to use the exact names of these things. Or the code is not gonna compile.

[00:16:31 - 00:16:43]

John: So I think that's pretty important. And and the planner that I have, this is the prompt for the planner. And the and the file that it generates. The planner I have is too primitive. And I I needed a better planner.

[00:16:43 - 00:16:58]

John: And I didn't really have us have a clue for what... That planner is. By the way, the planner is that the fifth level, the the hardest 1. Because Newest. And and also something that Open acknowledges itself is bad at well, which is long horizon planning.

[00:17:01 - 00:17:28]

John: So I really had a troll problem with this until I read the voyager paper from Nvidia. That basically was improved upon existing capabilities or or solutions or minecraft So solving Minecraft is is like, a very, very arbitrary task. But you just pick the most complex task you can find, like, building a diamond tool. And then see if you can do it faster than any other automated steps. And Voyager did it by in part having a skill library.

[00:17:29 - 00:17:57]

John: They they have 3... They actually have 3 sections, but really I all built having his library. And I think that's that's 1 of the things that made it more interesting to me. And and and together with the the the insight from the baby I level. Actually made it more such such that I realized that everything with a with an L core is fundamentally constrained by the capabilities of an L, whereas we already know how to code a ton of things.

[00:17:57 - 00:18:12]

John: And know we're not constrained by that. We're we're no. Not... We're just constrained by turing completeness or whatever. And so here's the core insight or, like, the you don't know, like, the the screenshot screenshot takeaway for for this talk or at least my thesis so far.

[00:18:12 - 00:18:36]

John: Is that L cor fundamentally constrained by l capabilities. And in order to break out of that or and in order to make more of a dependable, reliable, more secure, more compost controllable apps. You have to invert the the way that the L m's relationship with code is typically done and move code into the core of the the the the functionality. And there the 4... The 4...

[00:18:36 - 00:19:01]

John: The is the specifically 4 section segments that have been identified so far. This is not exhaustive for this is just top of mind, the planner agent, the skills library, the context agents and the Ui and interface generation agent. I'll pause there in case I'm I've lost. I I'm not looking at the the the chat. Alright.

[00:19:02 - 00:19:22]

John: I keep going. So so I just mention that this is this is where I might... My current area of active researches. And I'm not sure if all of these must be developed together or if they can be developed by separate teams. And maybe which is define a a nice standard interface between them.

[00:19:22 - 00:19:54]

John: But I definitely feel like this is a standard progression that all of us. Will make. And some some some in, some use cases will will be perfectly fine here. But the more you need control, the more you need long term long horizon planning. The more you need skills reuse, the more you need like sort of custom Ui, the more you'll have to move this architecture over, and I think that keeping that in mind and having a language for all of us builders to discuss, this in the shorthand.

[00:19:54 - 00:20:15]

John: So I the shorthand I have is l core or code core, you know, which core which which 1 I you sort of putting at the heart And and and why this is code core, by the way, is that once the code is generated, the the L is not in the loop. For that. Right? Like, that... You just prefer that code that has been validated because it's fast, cheap, simple, whatever.

[00:20:16 - 00:20:43]

John: And that and and always works and those are not properties of balance. So I think I think that's that that's a that's a very attractive, and I think it is up to us to identify what the roles are. Maybe there's roles here I I haven't imagined. And and I fully welcome debates a discussion about that property But I think, like, recognizing this distinction does make sense. We do have this basically evolution and I'll I'll try wrap up soon.

[00:20:43 - 00:21:01]

John: Of, like, So let's say let's say 20 20 to 20 22, it was, like, open Playground. Where we did completion. Right? And then 20 20... End of 20 22, we thought chat.

[00:21:01 - 00:21:14]

John: Right? And this was this is 1... This is a a few things. This was essentially very, very light context, which is essentially the conversation history of that session. And then Ui and in interface generation.

[00:21:14 - 00:21:32]

John: Right? And then and then funny 30 is yeah, Gotta to have to go monthly month. But it's actually, like, I think the the the way that the way that these things have have developed is really really like... So chat with Ming. Right?

[00:21:32 - 00:21:44]

John: So this is this is chat plus tools or skills. But also with a little bit of pen. Okay. So that it will say if I chat with bug. This is March.

[00:21:45 - 00:22:16]

John: I don't actually know when we got chat with it maybe Feb. And Chat with plugins ins, this was... I I remember this day because I was was very excited I was on a family trip of Japan and I dropped everything to to learn it. The interesting thing about chat with plugins and chat with bing is that you started to see plugin in usage that isn't 1... That isn't single iteration that you can you can see when you when you do it when you do a chat, when you do a chat, and you have...

[00:22:16 - 00:22:46]

John: Let's say, G 4 and Browser bing Right? Like, research, the top 10 reviews box office. And what what this would do is this would actually not just browse the web for the first result, but actually do a 4 loop over the top 10 movies. And it's clicking on the link. So there's some kind of planning inside of here, with a task queue with, you know, with all such just, like, tool usage.

[00:22:48 - 00:22:56]

John: But like, this was this is this became necessary okay. Fine. It was... For this for this particular curry, it was... It was able to to fulfill this in 1 link.

[00:22:56 - 00:23:15]

John: But I've definitely seen screenshots and and I've definitely had personal experiences of... Multiple multiple steps of research before before that was done. So so, like, hopefully, that that that makes sense. And based and then auto... Let's say about the May April, May period that was other Gp, Baby.

[00:23:15 - 00:23:33]

John: And And and that was that added in skills like and contact agents. Okay. I I think I'm going wrong. So let me let me just wrap up. So I think that each of these needs some kind of sort of core compute component or design, and which is some kind of recur.

[00:23:34 - 00:24:20]

John: Agents" because this is a effectively the the wall loop that embedded in in in the Agents" that is capable of... Like, from this, you can back out chain of thought or tree your thought if if if you if you design it correctly. But basically, like, If I think you have this input, I think you have this output, and then and then some form of recur where you have validation and and fixing. And so then you can connect that up to a vector Db where you store all the skills that you validate and that is a skills library, or you can connect that it up to a priority task queue or tag. And that is your planner Agents", or you can connect it up to a memory store and that is your your your...

[00:24:21 - 00:24:45]

John: The the the the Agents", conversation history or memory with the with the user, including his preferences, or he can connect it up with with this within with an interface spot... Interface store, and that's your design system. So chat being not all that you need you can also respond in terms of components that have been preset for you. So that that is that is my state of thinking right now. That everything comes down to...

[00:24:45 - 00:24:52]

John: D of agents that we that that we define an interface that talks between them. I'll pause there.

[00:24:55 - 00:25:24]

n/a: What's the... So in those diagrams, like, is there a different way that, like, a design agent would talk to the Ui and interface as opposed to like, a skills agent would chat to the skills library as opposed to, like, the context agent would talk to the content to Is it just, like the same type of agent talking to different things? Or is there, like, fundamentally different... And you just need, like, these different things? Or is there fundamentally different ways that these agents interact with with these various like resources?

[00:25:26 - 00:25:39]

John: Yeah. My impression is that just like anything machine learning, you would get short term gains by doing feature engineering. By by designing, like, okay. The the... Alright.

[00:25:39 - 00:25:57]

John: The the skills... Skills library to manage agent has a a specific contract context agent plan agent has specific contract so and so forth. And that that is efficient in terms of tokens. Right? That is efficient in terms of effectiveness because can you can you can feature engineer your way into some form of of 80 20 rule.

[00:25:58 - 00:26:08]

John: And then this will go away once model, sir. Capable enough or cheap enough for that. You can just dump everything. I think a lot of the engineering is sort of the art of the possible. Right?

[00:26:08 - 00:26:20]

John: Like, are what can we do with today's tools and and limitations? And I I do think that we do... We'll probably want to define Pair wise connection contact interfaces.

[00:26:25 - 00:26:37]

n/a: Alright. In the interest of time, let's move on to 18 eval, but we'll surely return to a lot of this topic and right answer at the end. So Agents" Eval, I don't know who wants to take it away, but you guys are up.

[00:26:38 - 00:26:39]

Alex: Alright. I'll take control of

[00:26:39 - 00:26:48]

Darius: the screen. And afterwards, John, we should still talk about how you're doing Your h and.

[00:26:52 - 00:26:53]

John: I eval.

[00:26:56 - 00:27:01]

Alex: Okay. Cool. Can everybody see my screening? Can't see anyone else. Okay.

[00:27:01 - 00:27:30]

Alex: Awesome. So hey everyone, once again, my name's is Alex with Darius Guru. And, Jesse, we were all hacked together at Ag gi House, which is a collective a very talented Ai researchers, data scientist, machine learning engineers and we all spent A Saturday, trying to hack together some cool projects, all involving agents. So for some context, We... Everyone on the team has tried agents, and we've all come with the conclusion that they suck.

[00:27:31 - 00:27:46]

Alex: It's just ind dispute at this point. It mentioning. This also but even, like, you know, people building new agents don't seem to like them very much. And it doesn't really take much research to see that a lot of people in the Internet are very upset with the state of things. They're expensive.

[00:27:46 - 00:28:17]

Alex: They fall under loops, and they don't really seem to cop their task in any meaningful ways. So that's why we teamed up to create Eval, our team ends up winning first place, a Tech bump. Their and so it's essentially a 3 step process. So the first is you take an Agents" for choice. We chose taxi Ai for those who are unaware taxi Ai is a chrome extension that takes control of your browser and performance browser automation actions based on a task.

[00:28:18 - 00:28:46]

Alex: We take every single log that that that the agent takes and then plug it into some party tools. So we try Microsoft Excel tried it true and amplitude, which is also in analytics dashboard software. And then we can use that to actually identify the pain points and where the are failing and wider failing and how we can make improvements on the next iteration. So I'll just without, you know, explaining much further jump right into the demo... Or sorry.

[00:28:46 - 00:29:04]

Alex: 1 1 other thing is how do you actually know, like, your benchmarks. Right? So effectively just if you're just... You know, everybody does the favorite metric of Lg t m at 10, which is looks good to me at 10. But a few weeks ago, there was a dataset published called Mind 2 web.

[00:29:04 - 00:29:35]

Alex: So Mind to web is about over 2000 tasks. From over 130 websites where they recorded the sessions of users actually trying to accomplish a task So for example, booking a fly or purchasing a pizza or scraping a twitter profile, and they took all of those tasks to be human operating it and then record the clicks and how they actually accomplish it. How many steps it took so and so forth. So we use this kind of as our our baseline metric here. So...

[00:29:36 - 00:30:03]

Alex: How effective is an Agents" compared to a human trying to accomplish the same task. And so this is kind of like, you know, the the standard there's not much of there's not much work in the way of standards or agency stays, but this is probably 1 of the early earliest standards we can work with. So Here's a demo of Taxi in action. So taxis he's a Chrome extension. So right now, I've reported it giving the task of...

[00:30:03 - 00:30:30]

Alex: Creating a an account on delta dot com with my name. As you can see it's pretty slow. It goes step by step in terms of filling out my first name on the floor. My last name, what And dialed it and this is what we did is we created this waterfall graph where you can actually analyze this steps agent. So you can see the token consumption, the prompt that it used, the action Id, and it's chain of reasoning in terms of, like, what the response was and what the action.

[00:30:30 - 00:30:52]

Alex: The given the prompt in the context. So that's this is a taxi action and this is how agent eval plugs into taxi. So what is the output of that? So other than the waterfall graph I just showed you in that little short video, we took every single action and plugged it into an amplitude dashboard. So on the the further...

[00:30:53 - 00:31:19]

Alex: The the left go side of the screen, there is a graph that looks like 2 bar charts. And what that effectively is saying is we ran about 400 trials not a variety on the same task actually, for the delta thing. And we saw that only 60 only 30 percent of the time is it actually complete the task. So, like... And completion is not even necessarily like, you know, is it the data corrected or not.

[00:31:19 - 00:31:36]

Alex: It's just, like, whether the agent believes that it is configured task. So that... Just as a standalone benchmark, like, 30 percent of the time success rate, that's really quite dreadful. And on top of that in the middle graph, we thought about how long does the agent take to actually play a test. Right?

[00:31:36 - 00:32:04]

Alex: So on media it takes about 25 seconds to go from starts completion. But that's also including failures, and then oftentimes it get stuck and that, you know, to take well over 2 minutes to sign out for fill out of form. So is we're kind of the metrics that we looked at here to identify, like, okay, How well is the... The agent performing on a high level thing to compared to minds to when. And, you know, are there any infinite loops are...

[00:32:04 - 00:32:18]

Alex: Is it crashing? I think someone on and so forth. So finding out whether ends session prematurely. Here's log. I had a harder time zooming into this, but actually, sure I can just share my right here.

[00:32:18 - 00:32:39]

Alex: So this is a, like, an attrition graph for lack of a better word. So starting on the left hand side of the screen, this is kind of like all of the starting agent sessions. So if we took 400 agents, you would say that you know, the hundred percent of them started the task. Right? You have to start the test an agent.

[00:32:39 - 00:33:02]

Alex: But only 92 percent of the agents actually made it to the next stage. You about 8 percent of them ended the session prematurely either it got lost. There was an error. There was a bug. And then step by step, you can actually see that the Agents" you know, from the step of starting to test the processing gone to determining the action to actually performing the action, 40 percent of the agents fail outright.

[00:33:03 - 00:33:20]

Alex: And that's really kind of a contending view of just how Port is performing right now. But the optimistic side is this is actually kind of 1 of the first metrics we have. Right? Everybody knows Lg gm at 10, but this is 1 of the more cost... Views of how the Agents" survive and perform over time.

[00:33:20 - 00:33:41]

Alex: So this is 1 thing that we worked on on the agent eval project was getting this graph running. I'm gonna go back to my slideshow. And lastly, so... Yeah, just, you know, kind of high level thing is we ran this... So the last few things were benchmark against ulta dot com.

[00:33:41 - 00:33:55]

Alex: This is a graph showing how we benchmark it against. The minds to what data dataset. So we saw... Similarly, it only exceeds 30 percent of the time. That seems to be kind of like the standard process here whereas it it fails 70 percent.

[00:33:57 - 00:34:03]

Alex: And we run this, I think on something like, keep you honest here various. I think, like, something like, 15 or 20 different tests?

[00:34:03 - 00:34:04]

Kurt: 25.

[00:34:04 - 00:34:04]

John: And in

[00:34:04 - 00:34:06]

Alex: an aggregate. Is that right?

[00:34:07 - 00:34:07]

Darius: 25.

[00:34:08 - 00:34:24]

Alex: Darius 25 different tasks, and we saw that This was kind of the failure metric, 30 70. So not so hot. Yeah. Anyway, that's that's it from my end. So Agree wanted talk a little more about the the key metrics that we thought about and Brainstorm then think we're gonna be.

[00:34:25 - 00:34:26]

Alex: Useful still going forward?

[00:34:26 - 00:34:42]

Kurt: Yeah. Absolutely. Yeah. Definitely on the previous slides, so in terms of our experimentation, just wanna call out that we took a random set of tasks, and it wasn't representative. So, like, part of our future work here is is scaling this up to validate whether our ratios of success rate.

[00:34:42 - 00:35:04]

Kurt: We're correct. The key metrics that we're looking at, so the first is task success is the Agents" end state does that match the desired outcome? And this is part of having a supervised approach where we have a ground truth dataset where we can look at that, repeat abilities is another thing we care about. Right? So if over multiple iterations, given the non deter nature of these systems, if you...

[00:35:04 - 00:35:24]

Kurt: We run it through a task... Several times. Does it end up at that end state multiple times, what percent of the time. 1 of the more interesting dimensions here is that there's there's a graph similarity problem, which going back to, like, A goes and discrete math if you ever took those classes. There's this concept of iso morph of graphs.

[00:35:24 - 00:35:44]

Kurt: Right? So as a agent is taking a series of choices, there's this large space that it's exploring, how similar is the graph of the steps that it took to the desired step set of steps. So that's another dimension to look at. We coined this idea of drunken ness. There's definitely a better word out there, but we just stuck with with drunken is set basically, did the model have too many beers?

[00:35:44 - 00:35:56]

Kurt: Did it go off the rails? And if so, like, how off the rails did it go? Was it a reversible or irreversible decision? How egregious was its deviation. That's something to look at.

[00:35:56 - 00:36:15]

Kurt: Some of the more quantitative metrics that are are easier to evaluate. So a number of steps that the model took to complete the task. And how much of a delta was that from the ground truth. So when a human did it, how many steps did they take versus the the agent? What kind of latency numb tokens cost, these are more standard metrics that you might look at?

[00:36:16 - 00:36:46]

Kurt: Agent, In general, like, 1 of our key takeaways through the the day of experimentation was eval for web agents are are challenging. And there's several reasons for this. The first debugging agent failures is challenging, I mentioned these systems are non deter. And so they can fail for different reasons every time even with the same starting conditions. So for example, was the failure result of limited token tokens?

[00:36:46 - 00:36:58]

Kurt: Was it missing context? Also, could it... There's another category of failures, which are reasoning failures, I. E did it get him stuck in an infinite loop, was their task divergence. And then finally, it's...

[00:36:59 - 00:37:07]

Kurt: These agents are starting to interact with the web for the first time. So there may be issues in the web interaction step. So... Was it... Did have a difficulty understanding the dom?

[00:37:07 - 00:37:19]

Kurt: Was their incorrect data entry when it was interacting with the website. Did they get stuck with a cap? You know, there's a variety of things that can that can break these systems. Couple other things. So the web isn't static.

[00:37:19 - 00:37:42]

Kurt: You know, with the supervised approach that we're we were going with, there's a ground truth dataset set that is a snapshot of a point in time. But the web is constantly evolving it's a dynamic space. And so how do you have update ability of these z files over time. Defining task success, number 3 is also tricky. So tasks can require some kind of optimization.

[00:37:42 - 00:37:56]

Kurt: Let's say you're booking a flight. How do you know that you got the best flight or the right flight? What is a successful outcome there? And, like... So so there's there's a degree of did it complete the task, but then there's, like, some optimization level that's an additional combination.

[00:37:57 - 00:38:14]

Kurt: Then there's the comb tutorial action space. There's so many potential act... That these systems can take, how do you whittle that down? And, you know, finally, as we think about multiple agent spaces, where have multiple agents coordinating with 1 another. This becomes an even more complex problem.

[00:38:15 - 00:38:19]

Kurt: So gonna hand it over to Darius now and to to walk through some alternate alternative approaches.

[00:38:21 - 00:38:40]

Darius: Thanks, Kurt. So I think what's helpful is to contextualize a bit of the work we did in the context of a few different approaches evaluate. So what we did was basically supervised evaluation. So we had the mind web dataset, here we had the agent form being valued using the labeled dataset set of, you know. Human labels where where people took actions.

[00:38:40 - 00:38:57]

Darius: But there's other approaches. And it was really cool is at on there some other hacks doing those other approaches as well. So the second is like reinforcement learning. So here classic case would be like unit test pass fail and that is your environmental reward function to tell you if... Hey, so, you know, test starts passing.

[00:38:57 - 00:39:21]

Darius: You're doing a good job. You can also think about projects like Voyager where they have a paper and as you make progress in Minecraft graph, then you're also doing a a good job. The the last approach is kind of this this in in development concept of, like, self supervised. And so here the ground truth would be conversational performance. And what people are doing is having...

[00:39:21 - 00:39:51]

Darius: The agent chest being evaluated with a same agent trying to mock a humans representation and then doing auto labeling. So there's a project that kind of that set approach. So the first hack was a way to explore that labeled dataset. So this was done by Marco, and it was called Jungle. And so basically, his tool allows you to look at the different domains in my web sub select certain web sites or categories and then kind of export that dataset.

[00:39:51 - 00:40:27]

Darius: So this data would have been amazing for for us because we would been able to select, like checkout sites where we have a clear hey, sites finish flow and that would been helpful in doing our evaluation. And so here's it Marco contact corner fur out. The second hack was a reinforcement eval approach. And so here James looked at unit test pass fails. And so he built an agent that ran a bunch, I think around 10 times per execution, looked to see if it would flip the the status of the unit test, and then kind of built up this the scorecard of, hey, the that actually occurred.

[00:40:27 - 00:40:28]

Guru: If we go to

[00:40:28 - 00:40:50]

Darius: next slide, we can have see that video And the... So what's going on here is he's kind of running the the agent. It's kicking off 10 different code edits. And then at around, like, 21 or so, you'll see based on the the unit test pass fails, it'll it'll flip the status of... Whether or not the Pastor failed.

[00:40:50 - 00:41:26]

Darius: And so, you know, do you do this a number of times to kind of get ross given the non Germans challenge and that's James Brother, the last project was kind of a self supervised eval approach. And so here J and team ran their agent. They had a simulated human with a goal of, like, booking a flight and making a trip. And the agent under test was kind of con converting with that human. And then finally, if you can pause here, they created a number of L 1 based eval metrics including, like, hey, did the the trip actually have coverage was the conversation human like.

[00:41:26 - 00:41:32]

Darius: And based on these scores assessed in kind of an auto would be about way the forms of Agents". Also a really cool project.

[00:41:35 - 00:41:36]

Alex: Okay. Awesome. Yeah.

[00:41:37 - 00:41:37]

Kurt: Yeah.

[00:41:38 - 00:41:54]

Alex: Yeah. So just 1 last thing, shout outs to team of agent eval, you can check us out on on Twitter. We write kind of brady thought leadership y kind threads every now and again. And you can check out get scorecard. So Jesse and Darius are working on this.

[00:41:55 - 00:41:58]

Alex: They're building. Because, no. I've I'll let you guys choose that 1.

[00:41:59 - 00:42:08]

Darius: Yep. So testing production ready, and and the evaluation, mostly for chains in in Harrison terminology, but also exploring agents because of the the new.

[00:42:09 - 00:42:23]

Alex: Mh. Yeah. And I'm working on staff at Ai. So we're building kind of a vetted catalog Ai agents and c copilot and using some of the agent eval. Framework tooling under the hood, some which may or may not be open source in the near future.

[00:42:24 - 00:42:30]

Alex: Yep. So at that point, I think we can open it for questions. If anybody has anything in mind.

[00:42:31 - 00:42:33]

n/a: Awesome. Thank you guys for that.

[00:42:33 - 00:42:44]

Beatrice: You can check out the q and A session. Beatrice addressed 1 of the Agents" eval member? Yeah. But that's... Question is up on the right side.

[00:42:46 - 00:43:01]

Kurt: And and before we jump into Q and a, 1 other thing I wanted to call is we had a... We had a large team of folks and there were many contributors. So a few other folks that didn't join us today, but we're also key contributors. Andre, Beatrice, John and Aj. I wanted to just shut them out as well.

[00:43:05 - 00:43:15]

n/a: Awesome. Thank you guys for for that presentation. Jumping in. We've about, like, 5 minutes left. Gonna jump into a few of the questions on the the right, and maybe you can do quick rapid fire.

[00:43:16 - 00:43:33]

n/a: But... So top 1 by far. What is the most affordable free slash local L for specific agent executor, agent task like decision making tool selection. And in general, mean, I guess like, general this, like, yeah. Open, is there anything else that works?

[00:43:34 - 00:43:34]

n/a: I

[00:43:38 - 00:43:52]

Guru: Yeah. So, like, I think there's this trade off between capability and efficiency. And I think right now, it should chase capability, meaning pay for You know, try to get the 64. Api access. It's not actually that hard.

[00:43:53 - 00:44:08]

Guru: Just join some hack funds where they keep they hand out access for Candy and and pay for it, It's not that it's not expensive. But if you need it to be local, then something like an Mp 7 b, unless you can run a 65 b thing, which maybe G bell is working on.

[00:44:09 - 00:44:19]

Darius: Yeah. I kind of correlate this flex self driving. So we're... You know, the cars are kind putting the most expensive sensors on the car because you wanna solve the problem first. And then you, you know, down cost as, you know, hardware is cheaper.

[00:44:20 - 00:44:25]

Guru: Yeah. For for what for those who don't know Darius does work as self driving. So he knows so he's something

[00:44:29 - 00:44:38]

n/a: Awesome. Yeah. The only thing I add to that is, like, yeah. Use open And specifically, I think open Ai functions is really good. So I'd probably almost always start there.

[00:44:39 - 00:44:47]

n/a: There's 1 other 1 that I thought look. Good. Oh, and Yeah. This is related to... Hold on.

[00:44:47 - 00:45:00]

n/a: Where is it? It... It's related to the 1 around, like, tool tool generation and skill generation. Basically. So in Minecraft, I think they had the the language model like write new skills basically, or something like that?

[00:45:01 - 00:45:01]

n/a: You...

[00:45:01 - 00:45:03]

Guru: Yeah. Every search curriculum as well.

[00:45:04 - 00:45:05]

n/a: Yeah. So so how would

[00:45:05 - 00:45:07]

Guru: you of entropy

[00:45:07 - 00:45:19]

n/a: how do you think about the skill edition in, like, a generic framework? Like... And I think... I'm I'm trying to find the question I forgot what it, but it was, like, could you, like, ask the human to add a skill or something that like that? Like, what does this even look like generically.

[00:45:19 - 00:45:19]

n/a: Oh.

[00:45:21 - 00:45:26]

Guru: Yeah. It's The... I think the human human in the loop is slow.

[00:45:26 - 00:45:27]

n/a: Yeah.

[00:45:27 - 00:45:44]

Guru: And probably not that useful. Maybe with if they're a domain expert. That could be helpful. But actually, just plugging it into the Internet, it's probably be more useful well, because that's just a lot more humans and, yeah, that will be my quick 2 cents. And and then you have to turn...

[00:45:45 - 00:46:11]

Guru: There's all these pro programming imperatives. I think I need to... En them all. But having a human in the loop requires it to be asynchronous, essentially because you have to because the human needs to be able to, like, just kinda monitor and respond it it shove things into it a to task you and then the agent needs to be able to survey a task you and and pick up those pieces of information. Rather than having the whole program pause and wait for the human to respond.

[00:46:11 - 00:46:16]

Guru: That's that's it's very it's gonna intra interrupt driven and not Agents" in a sense of it's not autonomous.

[00:46:19 - 00:46:42]

Alex: On top of that, I think if member serves right, keep me honest. For the voyager paper, they've kind of had kind of a cheat coat around the Rl, which was had all these minecraft tutorial videos, which oftentimes flood at the feed. Right? So if you can find tutorials out there, you don't really have to go through the ar that's a process of teaching it if all the instructions are out there on the web for you.

[00:46:43 - 00:46:51]

Guru: Yeah. I I would I will actually not call it Rl jeff because it is quote cool. 0 gradient in the sense that there was no actual training. It was just context that the

[00:46:51 - 00:46:53]

Darius: embedded. Great.

[00:46:55 - 00:47:11]

n/a: Awesome. And and then last question before we sign off. Yeah and this is around basically how to get agents to best invoke. Tools. And so the question is, like, specifically, like, do you think, like, behind the scenes open Ai functions is doing something like like, react.

[00:47:11 - 00:47:22]

n/a: There's also been kind of like, the tree of thought types stuff. When you... And so I guess, like, general this, when you guys are trying to get to use tools? What prompting strategies do you use? Do you do tree of thought?

[00:47:23 - 00:47:28]

n/a: Do you do react? Do you just tell it to? You ask it to think carefully step by step?

[00:47:29 - 00:47:40]

Guru: Do I just tell it too and give some examples. I think Shot is the lowest hanging fruit. Everyone should just few shot, there their prompts. Yeah. Beyond that, it's it's unknown un test.

[00:47:40 - 00:47:43]

Guru: You probably need to eval the effectiveness is all these methods.

[00:47:47 - 00:47:54]

n/a: Cool. Alright. Thank you all for joining. This has been a lot of fun change versus agents. Actually, wait...

[00:47:54 - 00:48:00]

n/a: Yeah. Change versus agents. You gotta pick 1, final sign off. Everyone has to pick chain agent. I don't even know what you're picking it.

[00:48:00 - 00:48:04]

n/a: I don't know what the judging criteria is. But you gotta pick 1. Chain agents da?

[00:48:04 - 00:48:05]

Darius: Off the chain.

[00:48:06 - 00:48:10]

n/a: Alright. Wait. Off the chain, is that good chain or are you off the chain as batch.

[00:48:10 - 00:48:11]

Darius: But I could see how that could be missus over.

[00:48:12 - 00:48:15]

n/a: Yeah. Sorry, I can change sw. Change our Agents". Oh

[00:48:16 - 00:48:19]

Guru: respond like, Tourist was like exact wins. Spain between the smooth.

[00:48:20 - 00:48:23]

n/a: Alright. Agents" agents. Agents". Alright.

[00:48:25 - 00:48:27]

Kurt: Agents all of the way baby. Alex?

[00:48:28 - 00:48:29]

Alex: Secret agent.

[00:48:30 - 00:48:36]

n/a: Jesse? The agent? Fully autonomous. Beatrice address. Yep.

[00:48:38 - 00:48:39]

Beatrice: LangChain.

[00:48:39 - 00:48:48]

n/a: I okay. Do you know what... That's a that's perfect segue way to end. It we'll end it there. I think an an overall overwhelming win for agents 4 to 2.

[00:48:49 - 00:48:52]

n/a: But, yeah. We'll see what happens. For for what

[00:48:52 - 00:48:59]

Guru: is earth leg is gonna be extremely critical and agenda anyway. So... Yeah. It's not a it's not a LangChain versus. Agents".

[00:49:00 - 00:49:04]

n/a: No. It's it's... No. It's it's definitely not. Like Maybe Harrison.

[00:49:04 - 00:49:05]

Kurt: Have you thought about that?

[00:49:06 - 00:49:13]

n/a: Sorry. Agents" Yeah. Yeah. Yeah.

[00:49:13 - 00:49:15]

Guru: There it's every the company. But...

[00:49:15 - 00:49:24]

n/a: Yeah. Alright. Thank you guys so much for joining. This has been a lot of fun and hopefully in in and very informative as well. So I really appreciate it.

[00:49:25 - 00:49:28]

n/a: And, yeah, Thank you, everyone for tuning in. Have a have a good day?