Summary LangChain "RAG Evaluation" Webinar - YouTube (Youtube) www.youtube.com
9,664 words - YouTube video - View YouTube video
Harrison Alright. We are live. Hello everyone. Welcome to a webinar on evaluating rag applications. Really excited for this 1 has been works.
Harrison For a while. We've we've had some... We've we've got some awesome guests with. We've been working on some collaboration with the with the Reg team. For a few weeks now.
Harrison So so really excited about this and then added in pedro last minute as well because he has a really cool kind of, like, a product that just launched is very relevant to kind of like, we'll be talking about. So minor logistics before we get started. This is being recorded. It'll be available at the link afterwards and then we'll also put it up on Youtube tomorrow. We will...
Harrison The format that we'll do is quick interest by everyone, then we'll have a presentation from Rag graphs on on on what they're building. We'll hear a little bit from will about what some of the stuff we're working on at LangChain, and then we'll hit hand it over to Peter to talk about building an application. Like, what what as that application builder. What is... Actually looking at in terms of evaluation.
Harrison After that, we will then just go into general Q and A. So if you guys have questions about kind of like Evaluation", he Please, please, please put them in the Q and A box on the right. So if you look to the right, you can see there's a little chat box that's probably where you're in by default. And then if you look below that, there's a little box of the question mark. Please put any questions there.
Harrison We'll go through the ones that have the most votes so you can vote for ones that you want hear answered. It's and and we'll answer them in that order. I think that's all the logistic stuff should be pretty straightforward. I'm I'm really excited for that 1, or for the for this event. So let's start it with some quick intros.
Harrison Justin, do you maybe wanna go first?
Justin Yeah. Hey, first of all, thanks asking for setting it up created... Be talking with you guys and learning learning from him. But, yeah, my name is Z. I'm 1 of the c maintain as Ra.
Justin I used to focus an ambulance engineer previously and then cut into the L fever by this year. And then they api I in, like, from that point they focusing on how to make L apps and especially rag apps more much more reliable. Okay. That's work.
Harrison Awesome. And and sure, who do you will maybe wanna go now as well to wrap that up.
Shubham Yeah. Hi. I'm S. I better a data sense professional for past 5 years, in training language model since 2018. And now follow on maintaining our rad.
Shubham So happy to be here.
Harrison Great. Well.
Will Yeah. Hi, I'm Well. I work with harrison at LangChain. Been working a lot especially on Lengthening Links smith platform looking at our evaluations. Trying to integrate better with...
Will Great pack packages and frameworks like Rag. And just really trying to connect the dots between LangChain Links smith and what everyone needs, here to evaluate any type of custom application.
Harrison Perfect. And then and then Pedro?
Pedro Hey, guys. Thank you so much. My name is Pedro. I'm the founding of Cavern, which is launched Noah yesterday. We're just trying to really, you know, get all the cool infrastructure and put in the hands users, building more workflows that are really easy and simple for them to get.
Pedro So we've been building, you know, an Ai and link chain since the beginning of the year. So really excited to to dive deep.
Harrison Alright. And with those interests out of the way, let's let's hear about. What what did you guys got for us today or
Shubham Yeah. I'll I guess, I'll share the screen now. Perfect. Okay. Okay.
Shubham Hope everyone can see the screen.
Justin Yep. Correct.
Shubham Yep. Alright.
Justin Alright. So Rag ask is... Open consult evaluation framework that rebuilt on c for ourselves. So we were. We were testing out a bunch of...
Justin Like, we were pretty passionate would like systems and we were by building a bunch of them assets, But there was no proper way to validate There were a couple of existing metrics, but like, those score those those score roads. Etcetera, but they weren't had very few, like, very poor in assessment. And there were, like, many datasets and marks to value to the retrieval part but they were also not consistent with work with the data using production, so that was also not very good for evaluation. So we were, like, focused betting, like, we were trying to figure out how we can boost this evaluation. And we
Shubham were thinking about that a
Justin bit and that how and all the learnings exactly. Had was what we up in Ra. So... But from the start having a event... Very compelling we when at taro and their tested evaluation, it think is very important to build offers apps.
Justin While you can easily start off with the Google demo with us like, libraries these, like LangChain. When you're going into production, and you have to bring in these engineering disciplines like how could tests you and be evaluation pipeline. So that your your apps are liable and take over friction and continue to see so. So that is 1 also the importance of having. And that's the 1 of the reasons why we focused fully on reaching graph.
Shubham Okay. Yeah. So Rad said tool focus on Ll as a evaluation. But before going to As a evaluation, we should be very much aware of why and where we should not use L also comes with a certain set of biases that we should be really aware of so that we can we can design package that can bypass these files. So And just discuss some of these balances that we are aware of as of today.
Shubham 1 of these either elements has a... Particular position bias because of which it will refer our to a particular position compared to other positions. So if you ask them to compare and select the best outputs from 3 day outputs, it might just prefer the output position 1 because it just as a per particular bias towards that position. And output and the output from the L can change number you change the position of the... Outputs.
Shubham And rooms also have also have shown to prefer integer scores, assigning scores, And it never almost prints out floating point scores. So this is also bias that has. And when used to compare answers, even ZT4 actually preferred answers from its own output even when compared to answers from human entertainment answers because it is due to a style by z has. And then it also has a bios picking it. A a particular number from a set of numbers, then we'll also see x level for that.
Shubham And l in general are also us in nature, which due to which, it can be different scores when work differently. So using, there are clear is not really very good solution And here, you can see that I have generated basically hundred random numbers from J 3 point file and it. Takes up 7 most of the time. And this essentially shows a particular buyers from of the m and, that it is actually preferring the a number 7 compared to other numbers. And if this was an ideal system, it should have actually preferred or should have produce some Are uniform distribution.
Shubham So
Justin so we have make, there are use Element directly has a few devices. But they are also very effective, like, then evaluating in in a bunch of situation. They have very good correlation. With human adjustment reviews in, like, for specific situations. But so the gold services is, we new evidence have like certain items.
Justin But they also have some positives. But what are some methodologies so that we can work around these patches that they have and use it for evaluating this to without better system. Some and we came up with a few padding. So basically, for any rack system, there was, like, 2 components which is the generation answer like. And do we need to evaluate both of them separately so that it gives you hold holistic 6 score of, what is happening give gives you 6 0.
Justin So for the generation part, you need we have, like, faithful as an answer let's see, which which I I talk about on these fixes. And how they work down. But for driven, it have context relevancy and context record. And faithfulness, faithfulness is basically how we formulated fixed is the rest 2 p variation to random calls. And the first 1 is we have given a generated answer, we are trying to figure out what is the number of statements in ti, and we'd see if these statements are supported from the trip context.
Justin So faithfulness is a measure of the nation that is z general answer. And nation in in our definition is... If Tl is generating anything that is not supported by the cards. And say... And with this page for a score, we can get a measure of that.
Justin And the next 1 answer relevancy is a measure of how relevant designated generated answer is to the keep question. But here you can see that for the same question just the last answer is the most relevant at to the point. And this is the financial score is A measure of that. So, basically, what we do, we try to reverse engineer. The answer keyword...
Justin The question, given an answer, and the calculate the similarity between the, the run this like, 3 times and we try to figure out similarity between generated question. And if the similarity of the generated question phase, very high, we don't have, the answer is also very correlated. Is also to the point.
Shubham The next spot that, I will focus on ease the context. So now we have Evaluation". We have a this that can evaluate the generation, but the... Now we'll focus us on 2 mat that can evaluate review upon. So first thing, is contact, which is used as a proxy for Evaluation" for precision of the retrieve context So this helps in optimizing the some size And to calculate this, the the problem is formulated as a candidate sentence extraction.
Shubham So it's pretty when given repair contacts and question, the x actually extracts on a number of sentences. That can be used to answer the given question. So the file has scope is the ratio number of extracted sentences from the even context developed by the total number of sentences that were split trained by the retrieval. So this is... This can be used to optimize the channel series and decrease the services, which can lead to selling course and also.
Shubham Next 1 is a convex free call recall. This is a very newly introduced metric. And this is where lot of problem happens and estimating the vehicle call without our friends is a the key problem. So we to run val contact score, You need a ground truth allocated annotated answer. And this is also formulated as a combination of candidates and then extraction and natural LangChain influence.
Shubham So given an an annotated answer and reviewed the context, we figure out the data points that r percent in the retrieve context and also are missing from the retrieve context. And the finance score is just the precision from these 2 points, calculating record from these 2 points. Then on talk of all these mat, we placed self consist to ensure hiring rep and consistency of drug course. So basically, l have proven to give consistent output. When they are when they are highly confident about certain output So this conference can be measured and estimated by improving the multiple times and quantifying the agreement between.
Shubham And the generated answer in these multiple Evaluation". We use this trick on top of these mat to ensure high consistency in Robust course. So natalie it. And then we have future directions. We are working on a test generation parallel.
Shubham So we have seen them formulating a means kind of difficult because it should follow certain retain characteristics as it should have a distribution similar to test it. It should also have a different levels of question, questions in it with the different levels of difficulties. So we are generating a paradigm to generate questions from given set of documents, and and if you have a C test of prompts that it will also consider those prompts. So apart from that, we are also working on evaluating meeting and. Evaluating and and testing elements, which is mostly used in combination with low Evaluation" systems, And then after that, we are looking forward to develop our own custom models for evaluation.
Shubham That's our remote rag. So we can send all rad in the following link. We'll also share this slide in the chat. And if you wish Chat, just stop us an email a team at Stop. Thank you.
Harrison Awesome. Thank you guys. So... Okay. I have a few questions and and maybe...
Harrison And and and we'll do those before maybe moving on, but if other people have questions as well. On the panel, please jump in and then if you have questions in the chat, drop them in the chat box, because I think there's a lot of stuff here that I wanna on in fact. First of all, though, just like, basic even setting the scene like I'd love to understand, like, are you guys working on this full time? Is it the 2 of you? Is there is there a company behind this?
Harrison Is it just purely open source? Like, what's the what's what's the what's the story behind the repo?
Justin Yeah. So currently, it is an open source report. We are, like, trying to form a company around it. But not necessarily around Rad, as. Not necessarily around to.
Justin So yeah. Right now it's an open source Article, we trying to see how the option goes and then pretty.
Harrison Awesome. Well, you've got a lot of fans. I saw 1 person in the chat. That they use in the company every day. So you've got a lot of fans for for Ra.
Harrison A more technical question I have is you presented, like, the 4 different areas say that you kind of like measure on. When you guys see, like, applications that are producing a wrong answer at the end which of the 4 like, most contribute to that? Do you have us... Or, like... Like...
Harrison Yeah. Which are the 4 most kind of like relevant or do you see most people kind of like, dressing out about or or that's where, like, most of the errors occur.
Justin Yeah. It is mostly with the context like the trigger part. Because Yeah. Like building up a system that can retrieve... Like, there's a lot of things that show, like, all of the clients that, like, have using that we talk to are at d of what a simple practice words and the similarity search stuff and can do.
Justin So that you need... Like, they trying to figure much more comprehensive ways up training the information that is actually... That can actually be used to answer the question. So that is... And that is where, like, context record.
Justin That is something
Shubham that we added
Justin yesterday yesterday. Pc would really have.
Harrison Got it. So context recall, that makes sense. And then 1 interesting thing for 1 of the for 1 of the one's, not context your you all, but the other 1 in context, you said, like... Yeah, context relevancy. You said, like, changing the chunk size could help kind of like with with optimizing that?
Harrison For the other ones, like, how would you recommend fixing all of them? Like, if something's going wrong in context recall? What can I do? Something's going wrong in, like, generation? Like, what what do I do?
Harrison What are fixes for all these issues?
Shubham Right. So sure especially for context for context, so we call we say, in most cases, it is either little bit. It is the issue from the burden that either the burden is not really suitable to retrieve the kind of information that is required for the given question. Or even most of in some cases, this can be also sold by kind of query transformation. So error of these letters, works many times.
Shubham But basically, generally improving and enhancing the retrieval techniques and embed, usually works in blue context, record so that you don't really miss any point that is necessary. So this is very, very after people... We build a simple ra system when it push towards production. So information that on is very important regarding our context relevancy. It is a mag precision.
Shubham So the we'll answer the question correctly, mostly even if the system has no decision, but it'll will usually take more time because you are sending more of a plan to a number of... Token to the Ll and it entities calls are going to cause much more. So this can also think some cases, cause some some form of association, because you are presenting mode of information that is more than recur to the, and it can mess up some of the... Low context. It he can go some of the low context issues with.
Shubham And for the generator part regarding free 4 months and relevancy relevancy, adjusting the moment and, you know, doing some kind of transformation, Usually works to improve the prompt. We... People also use some kind of Guard to ensure that none of the sources... All of the sources which are certain. Actually after the things that are actually generated So yes.
Shubham These are the main points that We seeing?
Justin Yeah. Also, I wanted to, like... Mentioned the role of Locksmith smith and this is playing 1 of the main reasons why you were super excited with the collaborating. Because, like, Okay. So like...
Justin Because, like, where good, like, Landscape has been like Super poor useful for us even internally to figure out where exactly because it gives you the old trace of the right from the evaluation and to the the trace of the actual change the actual c. And that means you have, like, you can act... It is much more instead of pinpoint where exactly what where exactly and what exactly went from. So we... Like, you see some having a very good, like, tracing and monitoring setup up with your evaluation pipeline I we're responsibility do this in a conveying.
Harrison Yeah. And then I know 1 of the big things that we're, like, You know, struggling for good metrics to evaluate all these things. Yeah, I think like I think Lin smith provides pretty good kind of like, visibility into what's going on. But I think when paired with some of these metrics, I think it makes it even more powerful. And actually, that...
Harrison Yeah, the... So that leads me to another question, which is, like, I How much would you kind of like blindly trust these metrics? Like, when you guys are doing them, do you like... How how much do you look at the individual data points? Actually, maybe there's a better way for me.
Harrison When you guys look at the individual data points, How often do you see that the grades are like correct and, like, you agree with the Ll assisted evaluation?
Shubham Right. So for our own evaluation, we we have a set of 3 when I repeated test set. And whenever we roll all the new methods. We measure the correlation with human judgment, metrics. And if that is Satisfied level that is around more than 0.8, we will no roll on the run rates.
Shubham So we have this internal testing mechanism through which we test these mat And also, like, has students and Blacks has been very useful, very debugging because when all we formulate efforts. Some of the data points, or some situations to the metrics my message, and we can use lang to identify why 8 messed up. So it's been very useful. But we generally follow this test testing mechanism means inside allow to ensure that actually works.
Justin Yeah. And there there are times when it has massive, but I been, like, surprised with a number of times, it actually can't thing. Right? So that is that has been surprising, but that is something that we have to to so that there is more hard rates and these stars... These, like, metrics have are much.
Justin But we do these like tests are.
Will So a somewhat related question for this that's in the Q and A right now and it was sort of matched 1 thing that that I had personally. 1 of the nice things that you do is you take these 4 metrics, and you take the harmonic meme for an overall score, and you can make that little bit easier to be filtering your results between models that because you have that... Single combined float value. A lot of people coming maybe from the more traditional Ml space, will perhaps already have implemented or be familiar with other metrics for retrieve, like, I mean, reciprocal rank or you have, like a a labeled dataset and you wanna see if the retrieve documents or something or, like, show up there or maybe like, precision at k? I how do you think about extending, like, the the existing metrics set within Rag to include other metrics, either Ll assisted or some of these traditional things?
Will And but how would you go about using that in, like, for your overall workflow?
Justin Yeah. That's a that's value of question. So, like, they work, like, all these existing methodologies, and they work like, really bad. We came up... Be the honest of, like, figuring, like, trying to leverage elements airlines were situations when like, actually building test set to...
Justin Like... So if you are using like, if you are running the reducing that it's for like, information payment, to calculate the previous this quote. You have to have annotated annotated datasets for the given for you given, like, the section that you're trying to get And when people ask trying starting off and, like, even if there are move forward, building such a as as is very complicated. And that's why we, like, focused on our key, reference free. But, like, if you can leverage the existing, like, system, like, access the accessing metrics is, that is a very, like, Evaluation" point.
Justin Like, little bit pretty useful.
Harrison Awesome. Can probably... There's a lot of other questions. We'll get to them at the end. I think there's a lot to dive in here.
Harrison Will, do you maybe wanna chat about some of the stuff we're doing on the lane I I guess we already mentioned a bit about how it integrates Rag and everything like that. But anything additional to to add or any, yeah, additional experiments we've done?
Will And Yeah. I can I can keep it pretty short? I have... III can highlight 2 things. 1, we have a Cookbook and I'll walk through maybe some codes.
Will Just briefly there to share some, like some information of how you can get started. Actually, maybe I'll pull up the the notebook that you have in in Rag that we can continue in this team, show how you're integrating there. And then we can show a little bit of a glimpse in the Ui to see, like, maybe you have... So something that goes wrong and then you can drill down retrieve pipeline in order to see what's happening. So Alright here.
Will I'll I'll I'll share my screen briefly. So this notebook goes through a much simpler metric and it's... Not... It's reference based and not reference for you. We're just measuring the correctness of the overall end to end system response on your cheat system.
Will And then I'll just walk through briefly how it connect to Laying smith there, and then, like, go through a couple of the data points there that maybe it gets wrong. Overall, this is an easy thing to get started. If you have a small test set that you're have labeled, and you know the answer. It's less useful. If you're still trying to mind that information, you don't have like, the the actual references that you wanna fill up.
Will And it also maybe it doesn't provide as much of a a detailed analysis there by breaking it down to these 4 different quadrant, which is why we like to be able to be flexibly defining custom evaluator and linking that to things such as Rock there, The overall process really is creating all the datasets and then defining the system, and then you wanna evaluate Links with and then you can start iterating there. And it creates some results that look something like this. The thing that I think Lan smith really helps with here is that it can help you curate this dataset set. It's often kind of hard to start from a 0, and you can do things such as, like, synthetic data generation or other things. But by collecting traces of your actual system in, like, a beta deployment, start then you can see real questions that your users are or that you are generating on the actual corpus you're doing.
Will And then you can go through and and correct any retrieve answers and and and create that really right there. So that way, you don't really have to start from this, like, 0 and start like, checking the performance there. Here we have an example where you just write out the questions and answers. This is overlay smith documentation. Create the dataset, and and this is...
Will I'll share the link afterwards. So I won't be labor the code too much. And then you evaluate the chain. We just use, like a qa evaluator, which essentially asks if it's correct or incorrect, given the context and given the labels. Again very simple.
Will And I think it is important to bear in mind that this this weakness in L based evaluations. It it does sometimes get bias stories like position. If you have specific answers he's trying to select from. And it also can have, like, systemically systemic bias in certain things where if it has some knowledge in its in its parameter space, it might be disagreeing with the ground truth answer you have there, especially if there's some areas in, like the tokens or something there. You can run the valleys and then you can see the results in in LangChain.
Will And what this gives you is this out of metrics there, but then also it gives you an entire audit of each data point in the dataset, so you can see what what which documents were retrieved for that data point. The the open Ai call or the other Ll call in order to s response, the query generation step if you wanna be rep replacing the query into some other steps there. And and then all those types of things. And so then 1 of you see that it is incorrect or the see see that it is correct. You can drill down into which part of the component is actually failing.
Will Often, it'll be that the documents you're retrieving are all, see too similar or insufficient relevant. And again, using something like Rag would help provide that a little bit more direction at our top level, But then you can actually go in and then make tweaks tweaks to your system and and improve the metrics there. Another pretty cool thing Or actually, I'll I'll I'll jump over to the the actual demo... I'll don't know it's exactly like Blank smith now so that you can see what it looks like. So this is a run that we did.
Will I just generated a random projects. That's not super informative. But you can see the results here on how accurate it is. And it looks like it got most of the answers. Correct.
Will This is a relatively easy dataset set. Let's see. When it got incorrect here. You can click through to each of these different steps, like you can see which documents it retrieved here. You can see that perhaps, like, the format of this isn't the most useful for all of the information.
Will And then you can also go to the feedback run. And you can see sort of the the reasoning that it went through. What information was passed through the evaluator and then how it was making the decision here. So you can see these whole prompts. And if you disagree with that you can open it in the playground, you can make changes to the prompts there so that you can maybe improve your evaluator.
Will For for the next time you're running in. I think we provide some off the shelf Evaluation". The Thin LangChain, I think extending to things like Rag, to other things can give you a lot more flexibility. Whenever you have special criteria, You have special types of user specific eval or considerations you wanna be including because not all retrieval augmented generation pipeline is the same. And not every system wants to have the exact same, like, decisions whenever you're doing that.
Will So I can try this. And then I can make changes and then you can submit it and see what types of results. Returns. And then you can use that next time whenever you wanna be evaluating.
Harrison Do we have do we have any raw of evaluator built in?
Will Yeah. I maybe, Kristin, do you have any shared results for it. I haven't run it recently, So I don't have a a link on hand. Maybe I could actually go through and and run... You notebook?
Harrison It's... I think we're probably a bit short on time to do that live, but maybe we can drop a drop a link to that notebook and and Because I think yeah. Like, the the metrics that that you guys kind of, like have pair super nice that I think was, like, the debug ability and inspect. So I think
Will it's a really powerful powerful combo. I think 1 1 final thing before we move on. I think, like, the things that are really, like are great about being able to integrate, like Rag and LangChain. Is that with Rag, you can get... Started without having a labeled dataset set.
Will You get all these great metrics, and you can look through it the results. And then from that, you can use Line Smith to help investigate things. You can make disagreements with that, and then continue to curate this dataset so it's more reliable and that you... These types of evaluation. Are more correlated with what your human judgment would be.
Will And so... Like, with this, it can be more of a flywheel. And then you can also adjust it to be more specific to your domain. Like, maybe your your your generation pipeline is very focused on the specific domain of of of problems and maybe you wanna be focused on, like, Yes. Specific information there.
Will Maybe the hall hallucinations metric, you wanna include additional things in apart from the retrieve data that is very specific to your context. You can use that to have things that are very specific to whatever you wanna be deploying.
Harrison Awesome. And, yeah. Just just shared a... Example or the notebook in the in the chat. So people should definitely check that out.
Harrison Alright. So now let's move on to the the final presenter, the Pedro, who who is actually building something. He sees the only 1 here building an application instead of tools. So maybe you can start Chatting about Noah and and and the product that you're building? And then also, how do you think about evaluation?
Harrison Like, you know, there's all these different techniques. What are you actually doing in in practice? If if anything at all?
Pedro Totally no. And I appreciate the time and the presentation was great. So for for starters with Nova, we just try to bring the experience, but with a lot more context on the user. I think right now, users just go on their documents, copy and paste and try to always bring context manually to Chat team. And, you know, I think retrieval and our infrastructure is very good enough to create an automated system that we just get the documents that are relevant.
Pedro To any workflow, and the user just talks to the chat assuming that the chat can retrieve the best piece of context. As we build Noah to to have an experience that users just come in, They integrate Google driver notion. They populate in hundreds of documents, and then they can just start chatting with them. Without necessarily knowing what where exactly the documents are or certain keywords is just finally a c copilot that can, you know, help people you know, generate things or find things or, you know, fine of whatever use case that that that application is relevant, And for us, you know, Evaluation" is interesting because it's a horizontal models so are not particularly optimizing for. Accountants or lawyers or any other profession, we want that to be horizontal and multipurpose purpose.
Pedro So whenever you know, we try to optimize for a specific workflow or query. We usually find that it hurts the the general performance of them. System, and then we usually go back to while, this is just a a pure retrieval the best way to go about it. So I'm excited to share a bit of the learnings that we we found in, you know, building the system and user input... And and usually, user input was a more star for us usually...
Pedro It's pretty binary. It's... Did they say either the answer's is good or the answer is completely out of the way. So this is kind of a matrix that I... Oh, sorry.
Pedro Shared it wrong. Let me just So yeah. Me present it here. So let's see. Yeah.
Pedro Okay. So the matrix is it, you know, we would built and and we optimize or try to optimize for a bunch of the factors that are involved in people, chunk size quantity, putting a lot of intermediary chains to try to get a better answer. And what we found is that pre usually the users will be happy with an answer when the model captures the right chunk. And if the chunk is big enough to provide a very good answer. So, you know, we tried with very...
Pedro We try to recruit retrieval with very few big chunks and with a lot of chunks that are medium sized, and there's usually a balance there that lighten the needs to bit big enough, but also there needs to be. Some of them so that there's a complete context being captured in the in the L and the answer is usually gonna be satisfactory. So that we found that to be, you know, the best parameters to provide the best answer to users as opposed to change that... Reduce what what needs to be done. And then at that agent or chain deciding documents to search through or or things to do.
Pedro So relevant jump side and quantity, were very important for us in a more pure retrieval system And then, of course, you know, context size is really key. We try not to partition our our queries in a lot of chains. Usually, just 1 or 2 chains with the user input, the back to retrieval stuffing not into a chain was very good to to get good answers or Intermediary air genes are good sometimes because sometimes users are, like, compare, you know, information from this document with information from this other document. And then you you kinda need to chain to understand that you need to do 2 vector retrieval to get chunks from from from different queries. That was really good for optimizing for this type of query, but it wasn't it wasn't good for overall performance because, you know, every query would hit into that.
Pedro Chain and that would hurt our our latency and that would usually also hurt the quality of the output because it would get repetitive chunks, and then it wouldn't be as good. So we honestly found that, you, a more pure high quality retrieval is is what right now use the best answers to users. And then I think the factor to optimize for is that when the user dataset has outdated or conflict conflicting information, the Is crazy because, you know, 1 chunk says 1 thing and the other chunks is other thing. We just try to make sure that whenever the users uploading their their documents that they don't take... Like, the best solution was to not take to old documents.
Pedro Because usually, you know, if you're talking about product strategy or sales metrics, you're gonna have documents from 20 19, 20, 20, 20 21, anybody who remove those documents out of the way they're are not relevant anyways, you're gonna get very good answers. So you just don't... You just be sure to not apply older information that was better than to try to optimize that and change. It's like, oh, we you have 2, you know, different chunks. 1 is newer 1 is older just far as high as to the newer and give a a better weight for that.
Pedro That was a good change to do and that works well for for those parties that had those conflicting information going on. But it wasn't good for the overall model because again we're trying to optimize for you know, people in multiple industries using it for their own use cases. So that wasn't a good optimization considering the overall performance. And then just lastly, users are very generous. So, you know, if if they ask something and it doesn't go, they're trying to be more specific in their queries.
Pedro I think historically, before Lo le search and working with documents has always been a keyword. Type of effort. So there very used to be specific with parties which really helps retrieval. They really like... Usually, they're like, go into this document and and summarize it or they're very good with keywords, which really helps performance of of our of our software because, you know, it just it just really helps retrieval.
Pedro And then memory is really hard to optimize. I think, you know, I I'm still trying to crack the code there, but memory is usually a prompt. That that we pass and any word or or any information that is weird, it it usually messes with the what with the ultimate output. So for us, just getting the last message is is optimal for memory. So whatever the user said.
Pedro Before that's the only piece of context that we would give del lemon the subsequent chain as memory. If you add more, it kind of goes into that issue of a lot of conflict conflicting information that is weird, and then they'll lenders doesn't know what to do. And I think there's probably a sweet spot memory prompt that Uses that they don't share, but that's usually probably gonna be the gold standard. Text for retrieval is amazing. So tab data, notion, Html you know, I think it still depends on the on the quality of the of the loading, but that's, you know, still still not a hundred percent, And then, you know, the promise needs to be very pure it needs to be very.
Pedro Okay. You you retrieve the chunks, and then you just ask. Based on that based on the context, given answer, if you add you're a helpful salesperson or if you're a happy person or or anything, there's a black box around that word that's gonna the the final answer. So being as pure and as concise possible and the prompt after the retrieval is done, is is really the best for for preventing hallucinations. So those are the, like, a few of the learnings that that we had and and talking to users and and building Noah to ensure that the Evaluation" are good.
Pedro But, yeah, It's it's been a very empirical process of talking to users and making sure that sean will retreat answers her satisfactory to make sure the application works best for everybody.
Harrison This is super interesting. So, yeah. So 1 of my quick was gonna be exactly what it sounded like you were talking a little bit towards the end so you have all these learnings. And when you're when you're saying, like, oh, we found this to be better than this. Shin How exactly did you find that Was that, like, collecting thumbs up thumbs down, Feed that?
Harrison Was that purely kind of like, running it with users and asking them what they thought? Like, what's the what's... Like, how exactly are you doing kinda like... Yeah.
Pedro So it's a mix of of all the above. So thumbs up and and thumbs down are really relevant to us. So we can just if users are open to it, you know, whenever we see a thumbs down. And sir, we hop on a call with a user, and we look on the on the backlog and see, okay. Like, did it get the right chunk or or why was the answer bad?
Pedro Usually, a bad answer came from a retrieval that didn't get the right chunk. Either because he was stuffed somewhere hidden. So so that the retrieval wasn't, you know, smart enough or it usually happened in data that wasn't properly loaded. So, you know, data that was in a Csv or a data that was in a slide or Pdf, for a text first trade text, it was really good. But, yeah.
Pedro So usually, you know, we we saw that, you know, a thumbs down. Or was shadowed users using it to make sure that, you know, they they had... You know, we just wanted to see if okay users are typing, you know, go on on my product strategy, launch document and generate 4 more ideas and just like, see how the users use it and if the retrieval was able to capture that was really helpful and in understanding, Okay. This are those are things to optimize for. And usually, the the chains that we try to optimize with the ideas that we have usually are not the right ones considering this all the users use it.
Pedro Users are very keyword word driven. Mh So... Yeah.
Shubham That is will be found.
Harrison Super interesting. More specific questions I have. You say, like, text based sources perform the best in vector it. Is this opposed to, like, Pdfs or, like, are you guys doing... You're not doing multi model yet.
Harrison Are you or or
Pedro Yeah. So we are... We're doing all types of of data sources so Csv and slides and everything. Yeah. But...
Pedro Re text is the best just because it's the... It's the simplest to load. We usually... The the most tricky to load for us is notion because it's a huge h html that you have. There has so many components that you have to load properly and to ensure that it's connected with h the other components and that's a that's a full chunk.
Pedro So that's tricky for us and sometimes, you know, very important context it's split across 4 or 5 chunks and the retrieval not. That great. So that... That's what I mean by that. Usually, straight text when you vector a straight text, that performs really, really well.
Pedro And it's still getting there for tableau... Like, for Csp and and other types of data which would do them and we have good responses, but it's it's still not a hundred percent. And I do think that as agents evolve and, you know, as Gp 4 gets faster. And then those things happen. We're gonna see, you know, an ability to really pour information from from those as well.
Harrison Awesome. And then the last question I had, you had 1 bullet point that was, like, generous context length is is key. More context sizes key. Have you... So there's been like, some talk around, like, as the as the context windows gets longer and longer, the kind of, like, forget what's in the middle?
Harrison And so maybe there's actually some diminishing returns. Have you seen that in practice Or...
Pedro Yes. Yes. We actually have. So it's... It goes...
Pedro It goes as a problem. So you have to give it a context, so it answers But if you just if you just... If you could put, you know, 9000 tokens of context and ask a question, you wouldn't get... Like, it would be better to have retrieved, 4000 token chunk and and got the answer from it. So we we actually found that.
Pedro Really small chunk size is really, really bad because it's just kind of a keyword search. So you need to be able to retrieve at least, you know, a page worth of context that is around you know, the the hotspot spot of the answer. But, yeah, as you as you really, really, you know, amplify that you do find that that diminishing return happening. So... But it...
Pedro It's important to have to strike that balance, because very little or or too much, really hurt.
Harrison Super very interesting. Alright. There's some good questions in the chat. So let's maybe jump there and and and some of them are maybe more relevant for for Rag and some of them are... Probably more relevant for Noah.
Harrison I'll try to pick a combination of all them for the next... I guess we got like, 8 or 9 minutes, probably I think this one's for the Ra team. So so you guys mentioned that sometimes the embedding model is inappropriate for the Corpus s. And and this is... Are there evaluation metrics that point to that as a cause as opposed to some other defect in the rag pipeline And maybe more generally, like...
Harrison When all these steps go wrong, how are there... How how can you kind of like know what's the underlying cause of, whether it's, like, the embedding model or the chunk size? Or anything like that? Are there any heuristic or or best practices that you guys have observed? And no I think...
Harrison Or Pedro, I think you're still sharing your screen or So... Yeah. I think you turned off your video instead of shit... On sharing your screen.
Pedro Oh, that's great. Hey. I'll stop sharing. Oh, wow, I see. Yeah.
Pedro There we go. And it this on the Chrome tab.
Shubham Yeah. Yeah. So the question, I think in general, when they're embedding when there is issues with the embedding, it the the material desktop retrieve any... Or or the information which is required to answer the question. So in that case, most cases, context free call is the 1 to look at because the contact free call will be low because it doesn't really have their amount information, which can be used to, which is necessary to answer the question.
Shubham So yeah. So and another situation can bring channels, video context are are split up across multiple chunks and it it only retrieves the chunk are theory or maybe some of the channels to answer the question. So again, that case all context, could something which is very important. In in case we worked amazing the rate and meeting metrics? So.
Shubham Yeah.
Harrison Right ahead. There there's a series of questions in around basically like, what basically, what traditional methods of assessment does rag compare slash compete with, and can you even use them in in conjunction, whether the... Yeah. How how do you guys think about that? Like, what are your cell metrics for kind of like, retrieval or question answering and how does the Reg compare?
Harrison Can you use them together?
Shubham Right. So most of the credential metrics around generation is all was generated. Right and or was you know, implemented free I mean and there's a national quality itself was hubspot. And these metrics mostly and also works mostly with only annotated down tool. But in case of Rad, it mostly focused on reference fee evaluation, and hence these things are actually little different to compare apple shuffle.
Shubham So so we don't really compare it to other traditional methods. But... Yeah. I mean, if you have good answers, which are particularly short form, 1 can even go with traditional metrics instead of using Ll system. Because short foam are particularly easy to compare compared to long chrome answers.
Harrison Awesome. Pedro, I think there's 2 cool ones for you. The first 1... You mentioned... Like, people would love to learn more about how you deal with stale and outdated information.
Harrison This is a very painful problem. So, yeah, you you'd mentioned... Like in from what I understood, you actually kind of like, put the burden a little bit on the user in terms of, like, trying to get them to upload only the most? Relevant things. Is that correct?
Harrison Do you do anything else to kind of like safeguard that did you try anything that that didn't work even?
Pedro Yeah. So so we did try to wait out older. So whatever retrieval happens. And chunks are very desperate disparate and and their in their date. We really try to give a better weight to the newer chunk it was just tricky to pass that in a general chain of.
Pedro You have do you have to do that or not. So whenever we did that, it worked... Best for for older jump... For for for this issue when whenever you have those chunks happening. But Heard the overall performance of the whole system because then now every query you need it to be passed into a chain they needed to decide that you need to go into this chain and that hard.
Pedro So users are really good at Okay. Like, they ask the question that are... That's very directed. Right? They're, like, find me the the the product strap...
Pedro Did they usually set a time a timeline for further query. So the product document of last year or the product document of this year. For our uploads, we try to restrict very old documents. So for whenever you're uploading things to noah, we only take documents up to January 20 22. That way you avoid having documents from 20 15...
Pedro You know, the those very old documents that have very different information. So putting the burden new user is is really good because you know, you don't have to overly optimize for for for outdated chunks because they're not relevant to the user anyways. Right? So, the user can just choose. Okay.
Pedro We're gonna put on only the relevant documents that are usually the newer ones. And then... So you just don't you would have to put them into... Into the pool of retrieval anyways, because that just caused a lot of problems. So, yeah.
Pedro We tried to optimize with the chain. It just hurdle overall performance
Harrison Too interesting. And then the other question that I think is really interesting is how can noah answer questions that need to aggregate data across different documents. And and how do you handle that?
Pedro Totally. And and and that's keith when you're setting up the number of chunks retreat parameter So everything is, you know, vector baptized, and and then the retrieval retrieve chunks that can that can be from a lot of multiple documents, you're gonna have 1 or 2 chunks that are the hotspot all the answer, that will come from 1 document, but the more chunks to retrieve the more likely they come from different documents. And then they aggregate into a chain and then the chain is multi document, So, you know, increasing the number of chunks retrieved, just maximizes the the chances of you getting. Chunks from multiple documents. And then if they're relevant, they'll will be really good at at including the context into the answer.
Pedro So, yeah, we just would just make sure to put a a chunk size that is big enough, so that a chunk size retrieval that is that is big enough, so that you get... Usually, you know, con from 2, 3 or 4 documents. So usually, Noah provides an answer with cy sources and usually, you know, it comes from 1 source, but a lot of times, it comes from 2 or 3. Yeah.
Harrison Especially, do you have any, like... Sense of how how many questions are about multiple documents or most of them just about 1 document.
Pedro I feel like... 40 percent are are about, you know, 2 or more. Yeah. So usually, most are about 1 document because the users like, okay, Like, go on the document and and ask as something. And then even if retrieves chunks other the documents, not relevant that Cuts them off.
Pedro But in a lot of times, you know, you have, you know, 2 or 3 documents that, like, have something relevant opened. That is... Sometimes like 10 percent of the chunk, but that really helps the final ones. So making sure that you're you're allowing... You're retrieving, you know, 567 chunks, and then, you know, passing that burden of, you know, the deciding if the relevant of or not to that is is really it's really good.
Pedro And then you have a multi
Harrison document answer. And Awesome. And then last question and this came up a few times, So I think it's a good 1 to end on. Is probably for... Mostly for will in the in the Rag guys.
Harrison Around, like, the data that you evaluate? Where where should this come from? How many data points do you need? How do you generate this how like a lot of people... Like, how how how can I come up with data to actually evaluate and and get these metrics on?
Will Do you guys... Wanna I answer that first?
Shubham Where you can take him. Yeah.
Will Yeah. So I guess, like, so from the length side, I think 1 of our... Main value props and, like, interested is that we wanna get data that's as representative as possible to what you're gonna seen production, and we wanted to cover the full distribution of use cases. Like, whenever you're putting a a system into production. Yeah.
Will You care about average performance, You wanna get that, like, p 95 scene performance. And similarly for the retrieval quality and the answer quality, you wanna be trying to get as much cover of this tail distribution as possible. I think this is really where L are useful, but also where, like, becomes a little bit hard. And there's, like, obviously, considerations around privacy and everything, so you can s different things that are that representative there. In terms of, like, how like, got like, tactics to do this, There's definitely tools like no make and other things that use vector distances.
Will I'm not sure how useful those are. What I've found and would love the other Aca folks opinion on this. But I found it quite useful to start with a small 1 and then try to extend in different dimensions and see where the existing system... Some fails, and then I can manually annotate or use, like, assisted annotations to then gradually build that out so that you have it from still be pretty high quality without just going for the huge quantity that can slow down file process.
Justin Awesome.
Shubham Yeah. I... Yeah. I I go to use quan and As that I said should... Should be very reflective of what we see in production.
Shubham And there are some automated tools which we can do and to filter prompts, on the test so that you can automatically formulate a set of small has small tested from the ones you see in production this can be something like automated d duplication and filtering out phones that represents data points from different different data points. So... Yeah. And and I think language smith is particularly useful and manually selected test curation.
Harrison Alright. Awesome. That about rats wraps it up for today. I think this was things think it was super interesting. Lots of lots of lots of great learnings on on all the metrics that Rad has.
Harrison And Pedro where I really liked the... I think you had 8 different bullet points. So it were awesome. I I think those are a very real world kind of like learning. So I think it was a great combination of of of tooling and real world application.
Harrison So thank you guys so much for joining. Yeah. That's all I got. And thank you everyone for tuning in. See you have the next 1.
Will Thanks everyone.
Pedro