Summary LlamaIndex Webinar: Make RAG Production-Ready - YouTube (Youtube) www.youtube.com
10,590 words - YouTube video - View YouTube video
Speaker 0 Alright. Welcome, everyone. Welcome back to another episode of the In Index webinar in series. This is You’re here. And today, we're super excited to be talking about production retrieval augmented generation.
Speaker 0 And this is the first time that we're actually doing a panel. And so this will be a panel with From Haystack. Max from Sid Ai as well as Bob from We. And so as mentioned, we'll start off with some brief presentations from each of the folks around 5 minutes or so just to kind of, like, give an overview of the respective companies as well as some basic concepts, and then we'll jump into the panel discussion. So let's go from there.
Speaker 0 Do wanna do you wanna kick us off.
Speaker 1 So let me hopefully share my screen successfully. And hopefully you're seeing this. Yes. Okay. So as we have 5 minutes, just fair warning to everyone.
Speaker 1 These are very, very basic. Slides about my experiences with the retrieval augmented to generation and some things that I think are some interesting discussions that we can maybe have in the panel discussion. So quickly about myself. I'm yeah. I'm torn.
Speaker 1 I'm 1 of the developer advocates at deep set, but I mainly focus on our open source L framework what called Haystack. And obviously, with that comes building a lot of rag pipelines and some experiences dealing with large language models. So without further ado, I think I'm just gonna start off by talking about why we even discuss rag what Rag is. Multi retrieval augmentation really achieves for us. So that if there's anyone who doesn't know, we just briefly go over it.
Speaker 1 So I boiled it down to 3 very basic bullet points. The models that we use for well currently use large language models I trained up to a certain point. So they don't really have information after that certain point. They also don't have any information on our confidential data. So Rag is just a methodology we use to help large language models, get the relevant context to answer any query.
Speaker 1 And I love to show a very simple example. I reuse this example, very frequently but I think it depicts it quite well. So I use the example of this webinar this evening, well, evening for me. And I just... Put in this question in chat Just a few hours ago and I asked who are the speakers at the In index panel tonight and As this is Chad Gp and it's actually quite a performance model.
Speaker 1 I got a pretty nice response, admitting that it just doesn't know. So what we mean by retrieval augmented generation or Rag is basically a technique that allows us to transform this instruction. To something like this. So instead of just inputting the query as is, we instruct the model to answer the query based on the provided context in this case, but obviously this could be a plethora of instructions, which is why I think Rag is actually quite exciting. And then we provide it with some relevant context before we ask the same question again.
Speaker 1 And the important part here is how we fill in this context. So I cheated and I just copy pasted If of your event page today. But what a rag are texture architecture achieves is to find the relevant context and instruct the model that we choose to use with that relevant context. So that looks a bit like this from a very, very high level. You might have external data that could be your vector database.
Speaker 1 Like alleviate, If could be even be the web, and we have something we call a retriever component that acts as a sort of filter that can go through that external data source and select the most relevant context for any given question. And then we inject that into prompt, which we then send to a large language model. In my view today, what could be interesting to discuss is actually some of these things here. So some important building blocks from my experiences to build a drag pipeline for production is the retrieving of the relevant documents. So all the context, and we can do this in very various ways.
Speaker 1 We can do that with keyword retrieval, embedding retrieval, even hybrid retrieval. We can also play around with the ordering of the retreat documents. And then of of course, the important things are the large language model we use and the actual prompt itself. I'm really going to focus on the top part, but what they contribute to the quality of the rag pipeline at the end is largely the quality of the context provided to the large language model. If we don't provide it with the right context, then the large language model doesn't really have much to do.
Speaker 1 And then the last 2 is given the context, the manner with which the large language will ends up transforming that into some sort of answer. So the first step retrieval, we can, I think probably discuss keyword retrieval and embedding retrieval and hybrid retrieval? Each have their own strengths and each have the reasons why you would pick it for a production use case or or not, trivial and especially because I think it's interesting to discuss hybrid retrieval, it has positives and negatives, but embedding retriever or semantic search isn't great at necessarily retrieving anything that might require us to do some keyword based search and I think we can already think of some examples here if you have like product Ids, you want to search up if you have an e commerce platform or something like that, or company names that you have to look up embedding retrieval might not be greater at this, but you still might need semantic search to happen. So in this case you might need to looking into combining to retrieval methodologies. And then this is the thing that I most excited of about these days is re ranking.
Speaker 1 And this is an effective way to retrieve a diverse set of documents and this is quite new. It can help us combine multiple resources. And this is just going to be a brief intro in this slide deck, but it can help with lost in the middle. I'm going to show a paper that just came out quite recently about that But quickly about diversity. So 1 thing that we're seeing that can be a problem, In certain production applications of grad pipelines is imagine a similarity rank and here I've basically ordered documents and they're all in a color palette.
Speaker 1 So imagine the blue ones, they're all kind of different colors, they are different documents, but they are quite similar to each other. And if I'm ending up using a large language model whose context window only allows me to have this many documents. Then I maybe have a problem because in cases where I want to potentially answer long form question answering based on a broad topic, only having a subset of documents that actually talk about a very limited amount of topics might be an issue. So this is where we talk about diversity rank, which simply introduces higher diversity in the order of documents that we introduced to large language models. So I think this might be an interesting topic to discuss.
Speaker 1 And then finally, another thing that is quite new to me is lost in the middle. This is a paper. I've linked it down below. But what In effectively tells us is that actually large language models really concentrate focus on what they see at the beginning and the end of context windows. What do we do in scenarios where we might lose a lot of valuable context and information that simply happened to be more in the middle of the context window.
Speaker 1 So We are now looking into ranking techniques that will allow us to shuffle those up so that we lose as little relevant information as possible. And that's it for me. And I think these would be some of the topics that I'd really like to discuss in the panel session. And hopefully I didn't go over 5 minutes.
Speaker 0 So worries. Thanks Anna for your time and next up. Is Max from Sid Ai and you wanna share some slides?
Speaker 2 Yes. Let me quickly set up the screen sharing here. There we go. Give a second. Here we are.
Speaker 2 Okay. Hi. Good morning. Yeah. I'm Max.
Speaker 2 I'm the Ceo of If dot ai. We're in Yc c's current current summer batch. And I'll quickly talk about our experience with rag and trying to get this work in production. And the subtitle of this, the B story is 11 months of pain. And quickly, I think this If covered already.
Speaker 2 So I'll jump over quickly. We we started building Ll apps. For consumers quite early, but they really they kept failing. And the failure always looked something like this. I'd ask it to write a slogan for my company said, it's An open I would think If stands for sudden infant death syndrome, which it doesn't.
Speaker 2 And the easy and the quick answer is L le are a bit like first In interns. You’re super smart and eager, but they know nothing about the person or company that they're working for. And the solution seems quite simple. You just add data. And I think like, especially coming in with with the classical engineering mindset, The feeling is how hard can it be.
Speaker 2 I'll just take some documents. I'll chunk In into fixed length, chunks, I'll then use In better I can use open Or I can use something else. I'll throw it into a vector database. And at the end, I'll create an internal or an external Api endpoint to actually get this to work. But once you go into production, you realize that there's a lot more that can go wrong and that will go wrong how do you handle changes to documents?
Speaker 2 If you're pulling from If you’re like Google, Google Drive, Google Gmail, notion or any of the others? How do you handle those updates? How do you reflect those In the vectors that you have stored? And if you have something that's public on the web and especially if you ingest from Google Drive, someone will adjust or try to ingest a 3 gigabyte Pdf, and we'll send you an email at 2 in the morning if that's somehow fail. And I think 1 of the classical problems and 1 of the approaches that we'll get into a bit more later is at some point, you have a lot of chunks.
Speaker 2 And every single retrieval result seems really, really bad because you somehow have a finite set of granularity, but you have In almost infinite set of data. And then you have to somehow think about crossing encode and re to bring that back up and improve that again, And then you have to think about things like markdown, data distribution being different than for example, powerpoint slides like these just because it's a different writing style. In. To be represented well in the same If In space, then you’re going to struggle, then there's, of course, the compliance question as soon as you're going more into the enterprise world, can I even use Open ai or do I have to self host everything? And the complexities that come with L posting.
Speaker 2 And then there's emails are very, very a tricky individual 1 and I think I'll get into, like, some approaches that we've used for that later because there are so much information and just so much tricky fluff and just so little information. And then there's things like Google's Api, casa a certification and processes like these that you actually have to run into before you can guide it up and run And I'll quickly get into these 2 more and 1 approach that we found worked well for us there. And that's you have a limited amount of floating points, a limited amount of precision in your embedding. And it's really important to make them count. So in the classical email sense, most of it is looking forward, great to hear from you.
Speaker 2 Let's circle backs, etcetera, etcetera, and not actual information and not actual information on the content. So what we've have done, what we've done is we've tried to reduce the fluff the nice cities In the writing style as much as possible. To do that, we actually have a fine tune summarizing that just is designed to summarize email threads. Then we have an better fine tune on these summaries. And to actually afford us a larger degree of separation in the latent space.
Speaker 2 So instead of going through the original text, which is very fluffy and very... And not not that easily sep. We create this different kind of latent space, just on the summaries, and that may search much, much better. And means we need less cross coding and re ranking at the other end to make it work. And yeah, that pretty much brings us to the end.
Speaker 2 The retrieval architecture depends mostly on the data type more than the usage type. And there's a long road from prototypes to production. Thank you very much. U.
Speaker 0 Awesome. Thanks, Max. And we'll ask on always. Bob from.
Speaker 3 Hey. Well, thanks for me, Jerry. And the wonderful and decks that I just saw. I'm not gonna show it back. I'm actually gonna show demo.
Speaker 3 So I'm I'm gonna bring this together and actually demo this If for you based on on with get. So let me quickly share my screen. 1 sec. Here we go. Screen share.
Speaker 3 Here we go. So let me give you a little bit of context what you're about to to look at. And that's the This is the V console, We created is If factor database and open source factor database, those who do not know. So at some of the slides you saw the In factor database mentioned. And basically, what we take take care of is storing these embedding and storing the data objects.
Speaker 3 So that could be the emails that Max just showed or the examples of the meetups that the 1 just showed. In this case, I have a a demo data set that 1 of my 1 of my colleagues, Conor made a this data for a very nice a blog post about it, so I can Can share it afterwards. But basically, what we're gonna do is we're gonna real time try to do that the direct thing. And we're not only gonna do that, but we're also going to do something new on top of that, what we call generative feedback loops the in the data. So I've prepared stuff a little bit, so In the query language that I'm gonna use is Graph.
Speaker 3 But if you interact with that you can use Python Javascript and the client for you want, but I think this is an easy query language for everybody to read and understand what's what's happening here. So we have a in the database, we have a query. Language and we have listings in this database and what these listings contain are Airbnb b listings. So in as here, you see like the you see the name of the listing, the neighborhood the price per night, room type In those kind of things. But 1 of the things that you might notice is that there's no there's no description.
Speaker 3 So if you now wanna search, if you do want to do semantic search, or hybrid search. That's In mentioned over the these listings we we can because we don't know we don't have any descriptions. So 1 the things that we can do is that we can use a form of rag or to create a generative feedback look, And what we're basically gonna do is that we're gonna and I showed In in a bit. We're gonna inject this data into a prompt. And what comes out of the prompt will be stored in the database.
Speaker 3 So we have a little prompt here and I I have that in a notebook right, so write a short B description for the listing, etcetera, etcetera. So... Well, as so, I'm now hitting 1. So they basically, we're using for this demo I'm using... Open ion embedding.
Speaker 3 I was basically doing is that it's scoring to, And it's, okay, Show me all the listings where there's no description take that information from the description, and If will generate not only a description for it, but also generate a effect embedding for it. So if I let not sure. If it's done yet. So running has almost done so if I if I run it, so you see already you’re see a few. So now you see the same listing, so this was the listing that we just saw, but now it has a description.
Speaker 3 And what we've done is by importing it. We told that we want create infection embedding for the description. So if we limit that to the first... Let's limit that to the first 1, So just to the first result. So that this 1, we can do this additional factor.
Speaker 3 So this is basically In factor embedding that in this case, we received from Open Going way down. Or the system. So now 1 of the things that we can do based on the data that was generated on the the information... And information from the listing is that we can do a... Well, hybrid search for example.
Speaker 3 So we sag glance in new york to walk my blog. So what this is gonna do when I hit a search, is that it's causing a create effect embedding for the scoring. It's gonna do a hybrid search. Thank you, Anna for In inspiration. Let's go for hybrid search.
Speaker 3 And what it's gonna do is that it's gonna do a factor based search on the query. And it's gonna do a Bm 25 search. So on the individual In in the query, simultaneously. And now the cool thing is is that we try to retrieve a data object, based on the description that was just generated using the form of reg. Right?
Speaker 3 So in we go and run the query and then says description no. This of course, If demo, why is saying now. Let's see. Let's see here we go from 2. 0, here you go.
Speaker 3 So Yeah. So here you goes. So here they hear In returns this result. This description, I don't know why it didn't give the description to to this 1, but, you know, live demo. So basically, what you see here is that it says, okay, welcome to Laura Spacious.
Speaker 3 So studio located Central park, etcetera, etcetera, etcetera. And now we can do reg on top of that again, So we can say... Let's say, we can say generate single results. And Again can give it another single result. I can give another prompt, and I can say...
Speaker 3 Why is this listing a let's place to walk. My dog. And what I'm gonna do is I'm basically In inject the description. So what this is gonna do is that it's gonna query the... It's gonna create database, so it's gonna create a factor embedding for this result.
Speaker 3 It's gonna return the description, the host name, the name enabled have you. And then we're gonna do rack again for the generated content, which is this. And we're gonna basically ask it like if it can explain why that that's actually a good result. So this 1 takes a little bit longer because we need to send 2 results, to the Api. But now you see here we have that result.
Speaker 3 So that we generate welcome to Lower Studio you’re love the Central Park, Blah blah blah And then the direct result like this listing is a good place to work with out because it's located In Central Park in East, Harlem. And the nice thing that you can do with these kind of things that you basically have seen 2 concepts here. So 1 is just pure. Right? So that's this...
Speaker 3 Where we basically are injecting that information in the prompt, and you can do that doesn't matter how big your data set. That is like kind the power effective databases. I mean, this data set In like 25 data objects or something. But you can... We see users that go literally into the billions.
Speaker 3 So you can quickly search for them and then the second thing you can do with that is that you can actually generate content like this 1 and store that back in the database with a vector embedding. Now you cannot only use Rack for better search, but you can also use Rack to create data in your dataset set, modify data in your dataset set, sometimes you can delete data. So that was my that was my demo.
Speaker 0 Amazing. Thanks all for the amazing presentations. And with that, let's get into let's get into the fun. Let's get into the panel. So I'll I'll ask some basic questions to kick us off.
Speaker 0 And probably we'll do this for the next, like, 25 minutes or so. And then if there's any questions from the audience, please, just feel free to chime in and it'll I'll probably just, like, In relieve some of the questions. Along with some of the basic questions I prepped as well. Let's kick it off with a kind of like introductory question, which is... So a lot of users have built these types of retrieval augmented systems or I'm gonna call it rag for short In the prototyping phase in your minds, what are the key considerations that users would need to take into account when actually trying to build rag In production.
Speaker 0 So everything from performance, cost, latency, scalability security, and we can start with a 2 In.
Speaker 1 So the way I see it, there's actually 2 parts where a lot of considerations going to kind of separately from each other because there's the retrieval step of the retrieval augmented generation pipeline and then there's a generation step. So in the retrieval step you’re have to consider your retrieval models, but this can actually be quite relatively small and cheap in some ways. That they tend to be a lot more lightweight. Whereas in the generation part, there's I see tons of considerations I live in the Eu, It can even start from considering security wise what kind of models you want to use. We see this very often here in Europe.
Speaker 1 For example, open In models are very, very performant, but sometimes it's just not an option given the legal circumstances. And then you have to consider, whether you are going with an open source model and whether you're hosting it yourself or you go and go for a host of service such as Azure Sage, this alongside it brings of course, cost considerations as well and so on and so on. And then the second part that I briefly mentioned with hybrid retrieval, I think. It kind of depend... It goes into a bit of latency considerations.
Speaker 1 And this is really dependent on your use case, I would say. For some people, simple embedding retrieval with a very lightweight model. If you're lucky enough to have a sentence similarity model that works on your language, maybe this is fine. But like actually Bob's example with New York is a really good 1 where you need to combine both keyword search and embedding search, you are doing 2 things basically at once. Although keyword search can be very, very, very fast, you now have 2 things to think about the run In parallel.
Speaker 1 So there's some latency considerations. Those are the main facets that I immediately my mind goes to.
Speaker 0 Thanks,
Speaker 2 Yeah. I think the the main partner or what In our experience ends up eating a lot of the time is actually getting the data syncing part right, and that is just because the Apis for the services are so different depending on how you ingest permission, you actually have to keep quite a complex state of what you already have and what you still need to add or need to remove or change or modify. And that can just, you know, out of experience, eat up a lot of your time. I think for a lot of the the stuff here, it's you have to be quite conscious of how much data you actually want to ingest and what the search sample size is. If it's quite small then probably you're well served by In easy solution and you won't run into that many issues.
Speaker 2 But As soon as you're probably looking at a few million chunks, you're gonna have to get more creative and hybrid search is a great way to actually get there. And, yeah. Also with Open eye, I think like the legal considerations are 1. And the other is just speed. Right.
Speaker 2 By self hosting these models and by c locating them with your vector database and and the other things, you can actually get much, much faster, right? And have turnaround times In the, you’re know, high tens of milliseconds or low hundreds of milliseconds. For the entire part, including hybrid search. And if you try something like that with open eyes and embedding, you're oftentimes looking at AP9 of above a second.
Speaker 0 Then Bob.
Speaker 3 Yeah. So the... So what's what's interesting If and to to build on on top of Max point is the... That has like this. This big distinction between demo, like I just gave.
Speaker 3 Right? So, I mean, I had 25 data objects. And the the example that we have ranked that took about 25 seconds to actually execute. That is for demo great Right? That's not so great if you have a real time use case and you have like a hundred million or more data checks because then it's gonna take quite some time to index over that.
Speaker 3 That stops. So... The the combination of of these production systems of how you work with these model is completely different than these these examples that I just... So I agree with every what everybody said there. So what's super interesting to see for from my perspective is actually development when it comes to Cpu based inference because the thing is at some point, you need to make like a cost you know, you need to make estimation about cost.
Speaker 3 So again, did the example I just gave, was perfectly fine for this demo? It's in in... You’re know, it's gonna be very expensive Production-Ready. So it's like okay, we have a we have a user that has, like, 20000000000 data objects In we've yet. Even if they wanted to use for of opening ai, That's like, that's 2000000 bucks to create spectrum embedding.
Speaker 3 So then an open loop model is way more interesting how you operate that. The good news is that there's so much work happening in, like... That stack. So like the database with the models and those kind of things. And secondly thing, also something that I find very interesting, and that goes for that Jerry were you In on, so U and Antoine.
Speaker 3 So the the tooling that starts to arrive around the the ecosystem. Basically, the solutions to help people actually, you know, figure out how to store the the I was junk In information. So the... Well, long story short, I think the... What I'm what I'm very excited about is the how these things basically and now are starting to come together.
Speaker 3 And 1 more idea that I wanna add, just more for people on our listening that are listening here. Is that I think if I have to make prediction is that we're gonna see this combination of not only doing rag. For search results. So as I just, you know, gave the example like, you know, why If a good place to work your dog. But you also they were gonna say, like, Is this a good representation from my dataset?
Speaker 3 Right? So the example that that Max gave... If you have all that... These emails stored? And it just...
Speaker 3 Like yeah, thanks for the info. That an actually... Model can say now In way too generic. And that that we start to use Reg and In so the models and the database and the tooling to achieve that to actually clean up the data, and actually makes makes sense of it or and you can go very far with it. So if for example, look at if I if I may make your example, that you can even In say, okay.
Speaker 3 This email is not containing enough information effective space. Let's go to the previous email and see what was in the previous email, and then generate content based on that. So I think that's something we will see in the near future too that people also start to use the models to query the data and then modify the the the dataset set, and that's like a harmony of the models, defective the database and at the tooling like like you’re all are creating. So I'm sad about that.
Speaker 2 Yeah. There's like 1 thing that we're also exploring is... Is the send is the notion of something like lazy information retrieval. So if someone asks the question, right? Or if there's a query and there's no really good answers by whatever metric we find.
Speaker 2 Right? That we can actually go in and try to collect more and more information have intermediate synthesis steps to actually find an answer. Or for that question or for that query once it comes again. Right? And trying to actually, like, in an intermediate step, find more information and collect more, something that wouldn't work at great time, but that we can do and kind of, like more the static kind of regime.
Speaker 0 Interesting. So, like dynamic information lock, basically. That this is awesome. And maybe the next thing that we should talk about is let's out data. So everything from, you’re know, Et etl extraction transformation loading into a vector store.
Speaker 0 This is a pretty common step that users. Face when they first build a rag system from loading data, let's say, a Pdf, chunking it up, putting into a vector store And Maxim and Bob, both of you have talked about something of that's key considerations you need to think around the data. So around performance, around costs around latency and scalability. What are some of the key pitfalls that users commonly run into if we, you know, drill down into this a little bit more? And what are the key considerations that they they might have to...
Speaker 0 Fair run with. So for instance, everything from, like, chunking strategy, like document transformations, how do account for scalability, maybe we could start with with Max fair
Speaker 2 Sorry. I didn't... My my Internet connection was bad. I didn't catch the last 20 seconds.
Speaker 0 Okay. No worries. The high level idea is just what are the key pitfalls that users commonly run into in kind of this. Data Et etl player. And, you know, you’re were talking about this In some of your slides as well as your answer 2, like everything from chunk sizes, the scalability, and the curious scare thoughts, on, yeah, like how should they robust diversify their data to create better racks systems?
Speaker 2 Yeah, I think that there's the 1 part which is just follow the center practices around you’re effectively handling un trusted text. So there's always stuff that can go wrong and will go wrong. And then setting reasonable limits. So for example, right, on, I think in our production, someone tried to link a few terabytes, large Google Drive, right? And you need to make sure that actually the pipeline is engineered to handle something like that to batch it properly and then and then return, and not just, you know, return it out of memory error and break the entire system.
Speaker 2 And especially if you're offering this to multiple also joined users, you have to think about how do you actually really writes provide some sort of exclusion that, you know, 1 person can't break the servers for for everyone else. Really I think the most difficult part to engineer from that perspective apart from of course, the retrieval pipeline is exactly the data syncing component and making sure that works well. And then handling all of those keys and actually all of those Api access tokens. I'm not from not sure how many of you have actually tried, but for example, getting getting also Api access to Google's Drive Api and email Apis is a huge process and very, very lengthy In and it's like, 600 emails front and back and forth with Google to actually get permission to do that. And then you’re know, having a Casa assessment and etcetera, etcetera that you have to run.
Speaker 2 And I think like these are the... Definitely the underrated challenges and stuff that, you know, you look at it and you're like, yeah, this sounds of reasonable but, you know, then you go about it and you actually realize, hey, this is actually a lot of work.
Speaker 0 Kind like Bob did do all thoughts?
Speaker 3 Yeah. Certainly. So the... So 1 of the things that we see a lot is that and I'm gonna assume that a lot of people In the on on the Zoom here are just, you know, are building something or playing around with something. And again, like my example with 25 data object, you know, that that's fine.
Speaker 3 Even if you think it's pretty fast, that you've optimized it, and you get to maybe like, and In in your mind, it's like 200 milliseconds is is great. That that maybe. If you don't have a large data set, right? Then In. It's un unusable with 200 milliseconds.
Speaker 3 So What I what I would recommend people do is that If if you take just a sheet of paper or like a whiteboard? Are you’re just from the moment of ingestion. All the way to the query, You just draw out what's happening there? How much how long was In inference time on the model? How long was retrieval time?
Speaker 3 Based on database In. Then you get to a number. So let's say that the if you... In my case, that I use these, endpoints that I get to maybe hundred 5280 seconds. Just do that times the the number of data objects that you have.
Speaker 3 And then then you will be quickly shocked about the the number that that's there. So you really need to start to think from perspective of a performance if you're really building a business. Now, or whatever you’re building as if it has like a large data. What I recommend doing is that, for example, if you use in an open source model or a model that's In sage maker or those kind of things? Validate if that model generates the results that you want in And if that's check In the box, built the pipeline as easy as you can, perfectly fine to do it like with the latency from Opening eye etcetera, and start to optimize for them 1 by 1.
Speaker 3 Just take the building block out replace with something else and just start to minimize the time. If what we know from the big production use cases where we have where retrieval from the database, embedding generation, etcetera, plays a role. If you really optimize it well for big data sets. You should be able to get between, like, 20 and 13 milliseconds today and to end. If you do that very well.
Speaker 3 But it say it's say, it's it's really it's a it's a it's a it's it's a professional. Right? So it's like people are really good at just optimizing these kind of things, but just keep it in mind because I've seen a lot of say people building and just build amazing prototypes. And then they said, like, but now to make this into a valuable business, we need to do this you’re know, x hundred million and then they were shocked with the infrastructure and many prices and times that were associated with that. So tune hold you’re back, but just bear in mind that these things take time, but people might like to know if they're new this In like, a lot of work prod-ready being done there.
Speaker 3 So for example, and if you idea. Like a spark connector that people literally pump in millions of embedding and data objects per per minute. So these tools are there, but like make sure to think about. And if you really have, like, data science background, I I would highly recommend to start to think also about the death side or the engineering side of, just bring the stuff to production.
Speaker 0 K. I wanna... Was there anything you wanna add?
Speaker 1 I can learn some things in this data ingestion part based on what I see from our community, which I think probably actually max with your product, you also experienced a lot is especially if you want your right pipeline to be so sort of up to date to your time. There are actually 2 types of datasets sets that you do rack on. There are some cases where you have dataset set that is not necessarily going to change that much. So you don't have to think about data ingestion more than once maybe. But there are some other cases where I'm sure in these examples or for example, you want to be able to do continuous.
Speaker 1 Rag on emails that are constantly changing or notion pages that are constantly changing. That requires a completely different architectural setup maybe the same pipeline, but a completely different setup for scalability. So this is 1 of the main considerations I see. When we're talking about data ingestion for vector databases.
Speaker 0 Thanks you’re thoughts. In think this is a related question to data and we'll tie In into a next section on kind of retrieval as well. In Sac your a question from the audience. What is the best way to chunk your data? And how do you think about optimal chunk sizes as well as chunking strategies?
Speaker 0 And I think we could start with 2 on if you wanna to go first.
Speaker 1 I find this question quite difficult because it's... Feels like a lot of the case If can be trial In error. That's also is going to depend on the embedding model that you want... That you decide you want to use. So a lot of embedding models actually if you do want to do retrieval with a given embedding model, you're not going go beyond a certain number of words for your chunk sizes.
Speaker 1 But at the end of the day, there's also a chunking strategy. So for example, I'm gonna show it with my hands because I can't really describe it any other way. But for example, we have this paragraph... 2 paragraphs following each other. We and there's this chunk and there's this chunk immediately after it.
Speaker 1 But they have potentially something here just at the end of it. Would have been relevant to the sentence that started right here. So we see a lot of people trying around with chunks that actually overlap each other so that you don't lose context that might be relevant, given a scenario where 1 chunk is retrieved and the other is not. So there's a lot of prep processing thoughts that goes into chunking. There's obviously the sizes of the chunks that matter.
Speaker 1 And of course, I'm gonna go back to this, but I think chunk sizes probably is going to be something we discuss more if it does turn out that diversity In going to be very important for lf Qa because then we have a limited amount of context length we can fill. So a limited number of chunks, we can add into our context So if you want to include a lot more diversity, maybe smaller chunks are better. It depends on the type of data you're doing Rag In. That's my overall answer.
Speaker 0 Bob, did you wanna your response?
Speaker 3 And so this In this is a question that I that I don't... Have a good answer because the thing is like the moment that that it hits the database, it's already jumped. So it's a it's like it's so as the the, you’re know, like Anna and. That's like, of course, your expertise, like in the tools you're building how that how that's done. But so I'm not sure if I'm allowed to break the format a little bit here, but I would love to hear your answer to that question Jerry.
Speaker 3 So because you see that not a lot, of course.
Speaker 0 Oh, wow. Yeah. I thought I thought I was interviewing you guys. So, I mean, I think it's basically what Wanna venture. And I was actually...
Speaker 0 Kinda curious to learn some of the tips and tricks that you guys had because I was gonna see if I could implement some of this and in the framework itself. The kinda The 2 things I typically see is in terms of chunks, I think smaller chunks tend to lead a better embedding base retrieval just because you're not, like averaging out the relevant context with, like, a bunch of random stuff like If before and after the the actual piece of text. However, the downside with smaller chunks actually is the fact that when you actually feed this through the language model for synthesis, he's it doesn't actually have enough context to really give you a detailed answer to the question. And then I think the other piece here In a lot of people, especially pretty much everybody building All apps for a specific vertical on specific types of data builds their own custom parser as opposed In using, like, any of the out of the box parser. From for instance Like on online next or In trend.
Speaker 0 But yeah, actually, Max, I I was curious get your thoughts my mistaken. Because you also talked a little bit about this and the slides too.
Speaker 2 U Yeah. I think I have I I've mainly have 2 points. So First of all, I think you have to separate the chunk size at embedding time and the chunk size at retrieval time because these can actually be different. Right? And this is also how we approach it.
Speaker 2 Right? So sometimes right, we'll we'll retrieve actually quite small, something that's, you know, inside the distribution of of the the models that we use for embedding, and then we'll actually try to retrieve the previous and the post chunks right to actually provide more context as you set for for the query. Right? So these are sometimes 2 different things. And depending on what you're working on, actually splitting the chunk size at embedding in the chunk size at retrieval makes sense.
Speaker 2 And the... So I think like that's that's the first consideration. And then in the slides. Right? The other 1 is, you can...
Speaker 2 Quite freely choose to actually search and receive over a different latent space than just the raw text. And for us for emails, that was the summary space. So we effectively, we have a quantized 7 bill, fine tuned on email sum customization and all that model does is. It'll take In a thread and it'll try to output a quite fixed size summary of that text. And then we have In better that will actually then and embedded that that is fine tune to just those summaries and that we'll then go In finally.
Speaker 2 And that usually means the... We lose a lot of the context on the writing style and, you know, the the kinds of those components. But if you're building a rack for information retrieval and not for style retrieval, that is actually what what you want. Right? And you have the ability to engineer it that way and do it that way.
Speaker 2 And then... Yeah, I think, like, more generally right to then measure the entrant performance. You know, I I come more from the research background, so I love thinking of, you know, synthetic benchmarks and, like, ideas to test this like etcetera, etcetera. But the most important thing is to capture that quality from you’re users and then try to evaluate different approaches In your pipeline, 5. Mh.
Speaker 3 Yeah. And if I if I may quickly add something because something popped into my mind that I actually do have forever everest. Response to this again, based the thanks to max base and what you said. So so from the from the database perspective, that's kind of where we should like so the demo that I just gave you, you see me do the, in this case, hybrid search, but it could also be the the defective search. But the thing that you also have are filters.
Speaker 3 So the thing is that you set the complete... You you load all the data in the database. So the the... Let's say that we take these emails. And let's say that we have, like, a hundred million emails.
Speaker 3 So But you know that you're searching for an answer based... And something that's in a email threat, for example, you can actually store If in database. So you can say, okay. This is the the body of the email. This is the fact banning of the body of that email.
Speaker 3 And that is bay... That's coming from this email address and it was part of this threat. So then your database query looks like something like, do factor search for something a very specific bot only limited to that email threat in those hundred million emails. And that works very well. So as a tip, I guess, bear in mind that you don't have to solve In hundred percent on ingestion time in this case, in in the database, you can also be smart about how you're structuring the data that you're storing and then build filters on top of the vector search that you’re doing.
Speaker 3 So that's also an option.
Speaker 1 I actually wanted to add something here because was actually the conversation of chunks. And then what Max said, maybe think of 1 thought or rag up maybe we didn't necessarily discuss, but it could potentially have an effect on the type of chunking you decide to do, is that rag can mean many things actually, based on simply the prompt you provided. So initially, what is even your task, why are you building rag? I think often we talk about question answering. But for example, in Max example, In could be summarizing emails, which are very short.
Speaker 1 In another example, it could be to summarize something about a whole topic. And then you have your retriever set to retrieve maybe the top few most relevant documents out of your database. So this is actually a good consideration on whether your chunks should be larger or smaller, because if you for example, asking it's summarized a whole topic, maybe a larger size of context and less documents being embedded into your prompt make more sense. So actually, I think the point of your brag pipeline matters a lot In how you design your chunking as well?
Speaker 3 Yeah. That's a good point. And that is like 1 more idea to add to that just for people playing around with this. In like, what you can even do is that and and sorry again for the perspective database, but that's just quite... That's what I do every day.
Speaker 3 So it's my just by where I live. You can even say, like, hey, we have email threats, and we have a sum so we use Reg to create a sum authorization of the whole email thread. We store that. And then the individual emails. So then you basically have 2 queries.
Speaker 3 So query number 1 is like show me the email that was about x. And then when you retrieve that email and you have the Id, for example, of the email thread. Okay. Now search to these individual email threads, set a filter for the email red. And In drill down into an email.
Speaker 3 So let's say you have an email with 10 tread. You basically have 11 data objects. Right the 1 data object with a sum segmentation of the whole email and 10 data objects with embedding that are the individual, bodies of the of the email that you’re stored. So you In you can get very smart about this kind of stuff and just be based on your use case you know, you can do a lot in... As to as said, like, using rack to generate that kind of stuff, but also store it, and retrieve it in a smart play.
Speaker 3 And then that combination that bring... You know, that brings you very far.
Speaker 0 Awesome. Thanks to everyone for the thoughts. I do actually wanna talk a little bit about retrieval and this leads into assemble later conversations around, like, how to structure in the data and also, like, you know, adding metadata filters and and also hybrid search and re ranking. Yeah. Like, what In your minds are generally the top retrieval issues that users brian into when they maybe first build rag with, for instance, like, top k embedding search.
Speaker 0 And how exactly do things like hybrid search re ranking, adding metadata data filters help with them. And and to In it, if you wanna I get started.
Speaker 1 So off the top of my head, actually, this is not a problem with retrieval but potentially problem with the way we view rag or evaluate rag. And the first thing that pops my mind. When you say retrieval is Often, I feel like we seem to overlook the retrieval step when we think about the performance of entire rack pipeline. Someone else that's in the chat and I reply to this and often we see the result of a right pipeline and the mine goes straight to the large Language model we're using. Rather than seeing actually, let's drill down like Bob said earlier, let's drill down is my retrieval step actually performing well, am I getting the relevant context to begin with?
Speaker 1 And then obviously, this is an issue at this point with the retrieval methodology you decide to use. And 1 of the reasons I brought up hybrid search, and I saw that there was some chats going on in there is for example, again, to reference what Bob and Max brought up. Let's imagine a scenario where we have emails and we need be allowing our users to ask what was the topic of email Id something, or let's assume we sell shoes and we've got a specific shoe name or Id, say, what are the specifications of shoe x y zed? And unfortunately embedding retrieval that we really often use for Rag and we'd stop there. Often is not great at at retrieving these types of queries are not great at doing keyword search, but we still want to be able to do semantic search to get a more comprehensive answer as well.
Speaker 1 So this is where we see people use a combination of keyword and embedding search. Basically what Bob was showing in this demo at the beginning. And yeah, when we talk about metadata filtering, are actually posted In this in the chat as well. I'd love to hear what you guys experiences have been. But for example, metadata, addition of metadata within the context you could provide to your rack pipeline is something I do very often, maybe even this use.
Speaker 1 But this allows us provide some sort of labeling information of each of the documents we retrieve and provide extra content or context without replies with our large language models. I've, for example, of used this to add. If I've got a website, I've crawled or I've got a set of websites I've called somewhere in a static data set. I tend to have a Url in the metadata, and I love this for documentation search because then if I have the Url of that piece of documentation, I retrieve right after it then the reply of a large language model can simply reference the actual place that the answer was generated from. And In use cases like this, I see metadata are very, very useful, but I love hear you guys are thoughts as well.
Speaker 3 If I sorry, Bob... Oh, yeah. No. So. I I just...
Speaker 3 So I I... That's a very good point Wanna so. So 3 things that I that I wanna respond to. So the the the first 1 is the... A lot of people ask we see ask questions about like what we call like explain Ai and using the metadata data, in these results is actually very helpful.
Speaker 3 Because what you basically can do is they can say, well, let's go back to the example of the of the emails tower on that topic. That you basically can say that whatever you're generating with the generative model that you can take it... And that came from this and the metadata data is that that you give my for example, be the email thread, Id. And he said, okay. So that...
Speaker 3 That's from... It's coming from that email that was sending them that date, for example. So you you’re. You know, that allows you to make it explain and and give that kind of context. The second thing of usage is like the...
Speaker 3 So the example that I basically gave from the with the with you with the E b dataset set that actually that listing information is a form of metadata. So we just call it the that object, but it's a form of metadata. So you use the metadata data to generate the content that you create an m embedding for to search over. Right? So that is...
Speaker 3 There's also something new you could do to think about use cases as in staying with the with with the emails. Is that the email body is a form of meta. I can basically say okay, what should be the right response to this email. Right? And then that's what you what you store.
Speaker 3 And the third thing I wanted say and this is something that... Because Daniel already mentioned this, but we see this coming back more and more often. That indeed pure If search, pure effective based retrieval is often and not enough for your case. And and I've I'm sharing now a a link to In to a visual In in a blog post in In in the chat window, And 1 way that we explain In internally often is that, like, if you have a big sea with fish, and you want, you’re know, catch a specific type of fish, then you’re throw the net and the net is basically your vector search. So you know in space where the surge, but then you’re net comes out of the watering.
Speaker 3 And it does not mean that the first thing that you’re pick out of the net net is actually the fish that you’re that you were trying to catch. Right? So what the rear rank basically does is that it organized these results, and it has to do a little bit with traditional keyword based search or how we use to do those types of search results, because if you, for example, taking e commerce solution to you might say, well, I might search for Adidas shoes for the summer. But the results I wanna see first, are issues that are sold the most. Right?
Speaker 3 And otherwise, you’re just have like random I mean, they... What you catch from your data set what you trees contains probably the right information. But now you, you know, start with 20 shoes and maybe nobody's buying these shoes. Right? So it's like, and then you wanna organize based on that.
Speaker 3 So... Bear in mind that re ranking does not always has to be based on a model. Re ranking can also be I do a vector search for In did that choose for the summer. And I'm gonna re rank them based on shoes that are sold them amount. So that's also that's also form ranking.
Speaker 3 So what you’re basically re rank that that you you retrieve that set, but then you somehow need to organize them in a way that for your user, who In, you’re know, being being presented with that information. They see the right stuff first. Not in all use cases that's needed. Sometimes you just want 20 result, and you don't care where they are in the order because you need these stranded results. But for for for example, an e commerce application, then you might wanna have time.
Speaker 0 Awesome. Thanks.
Speaker 2 Yeah. Not that much. Like, I think in principle, if you're building something with, like, like, the Toyota said that bar showed, then really nothing matters, Right? You don't really have to do anything. But if you're building something larger then usually the answer is everything kind of matters.
Speaker 2 Right? So yes, you have to use the metadata. And... Oftentimes from a pure customer perspective, you need to be able to do keyword search. If I'm explicitly giving you a keyword or the name of a file or the name of...
Speaker 2 Something that I want the context to be about, and it doesn't find that result that usually leads to incredibly frustrated users. I think like that's you're you're gonna have to run into these these things. And then, yeah, there's definitely also a good chunk of engineering In has happened. Cross encode, are definitely big part of that. I I love the idea of also doing the meta metadata ranking at the very end.
Speaker 2 I think we'll definitely 1 of the takeaways from me from this. And yeah, so In think I think not that much to add add from that perspective for me. 1 1 thing maybe Tu, I think got into this in the chat is rec. And that is also really big again, depends on, like, what kind of data you have. And it also really depends on just the kind of data you have.
Speaker 2 So for example, let's let's get back to email. I see I see that's the common thread here. But if if we email every 6 months, then an email from... 6 months ago can still be quite up to date and probably we should still be quite ranked by high, that's something that I'm asking you explicitly about. But if we email every single day, probably an email from last week can already be outdated.
Speaker 2 Right? So you actually end up having... Quite complicated different kinds of ways of accounting for that and for accounting that rec, and it's not just, you know, a log curve over time, but it's actually something a bit more complicated if you want In for Production-Ready system, which is why we're here.
Speaker 0 Sweet. This has been an awesome discussion. I think we need like an entire separate discussion on, like, an entire separate webinar In on rag. Ob and. I know a lot of you In the chat we're asking about that.
Speaker 0 And I think that deserves like another hour at least to to fully chat I think all of you made some great points and I I learned a lot personally, myself during during this webinar. As we wrap up, maybe we could just go through each person and in 10 seconds, just give your hot take, or just any hot take on drag. You know, if you don't have a take, just talk about, like, what's kind of top of mind for even in terms of problems that you're trying to solve. And and just wrap up there and we can start with Max. In put on spine.
Speaker 2 Okay. I'll start off with the hot take. I think it's something that that Bob also brought up before, but that just pure vector databases and embedding almost never worked for production size datasets. Right? That the approach is always more complicated, more varied and actually has to be very, very finely tuned to the data that you're ingesting and the use case that you want use, but especially the data you're ingesting.
Speaker 3 Bob. And so I think so hot take in 10 seconds. So I... My hot take is that I think. That this paradigm shift of, like, the database and the In interacting with itself.
Speaker 3 That's in, like, a couple of years from now, everybody is doing that. And we you’re like, why didn't we ever do that. So I think... And that reg score thing of that. So I think that the self, you know, growing manipulating datasets sets ever I think that that's the
Speaker 0 future. And honor.
Speaker 1 Am I allowed a hot take on something we didn't get into at all today. But I think it's relevant to rag is I often see when we talk about drag, we talk about the models and my hot take In. Models in isolation to the prompt and the prompt in isolation to the model that we send the prompt to evaluating them both separately is not evaluating right properly. I think the combination of the instruction decide to send and the model you decide to use with it is probably 1 of the most important components of a successful rack pipeline.
Speaker 0 Okay. Awesome. Well, thanks everybody for your time. Thank you, especially to Bob Max accent In 20 for your time joining us here today. We'll be continuing discussions.
Speaker 0 We'll have follow webinars hosted by LlamaIndex In talking about more components Rag especially around stuff like. Atlas, and maybe just to on a lot of these like data decisions as well. Join us on on flawless us on Twitter. Join us on our discord we'll be continuing the discussions and thoughts there as well. For Alright.
Speaker 0 Thanks everybody.
Speaker 2 Yeah. Then thank you, Jerry, for putting this together and and kirk Yes. A great set of questions. It was a lot of fun for for us. U.
Speaker 3 Thank bye. Thank you.
Speaker 0 Alright. Bye, guys.
Speaker 1