Summary of LangChain Retrieval Webinar

Summary LangChain Retrieval Webinar - YouTube (Youtube) www.youtube.com

11,930 words - YouTube video - View YouTube video

One Line

The LangChain Retrieval Webinar on YouTube covered retrieval models, the Colbert bear model, the Dsp programming model, challenges with react agents and obituary tools, and document understanding, ranking, and diversity in information retrieval.

Slides

Slide Presentation (12 slides)

Copy slides outline Copy embed code Download as Word

LangChain Retrieval Webinar: Connecting Data with Language Models

Source: www.youtube.com - video - 11,930 words - view

The Importance of Retrieval in Language Models

• Retrieval plays a crucial role in connecting data with language models.

• Better types of retrieval are needed to improve the accuracy of language models and reduce hallucinations.

• Traditional approaches involve creating embeddings for document chunks and storing them in a vector store for retrieval.

Different Types of Retrieval Methods

• Search engines like master database and Sql can be used for retrieval.

• Evaluating retrieval methods using common information retrieval datasets and metrics such as precision and recall is essential.

• The Beer benchmark includes different types of domains for evaluation.

Using Language Models to Generate Synthetic Data

• Language models can generate synthetic data for training retrieval models, improving ranking and performance.

• Combining different techniques like spa vector representation and dense vector embedding retrieval is the future of efficient retrieval.

• Practical tips for improving retrieval include focusing on data quality, formatting, indexing, attribution, and evaluating retriever and generation separately.

The Benefits of the Colbert Bear Retrieval Model

• Colbert bear combines independent encoding and fine-grained representations for superior performance in downstream tasks.

• The model uses a scoring function to estimate similarity between query terms and document vectors.

• Colbert bear is highly efficient and cost-effective.

Introduction to the Dsp Programming Model

• The Dsp programming model allows for building specialized retrievers using Python functions and declarative calls to retrieval and language models.

• It offers flexibility in customization and can be used for zero-shot or few-shot learning.

• Dsp provides magical primitives for creating a declarative pipeline, improving accuracy on benchmarks like Hot Qa.

Challenges with React Agents and Obituary Tools

• React agents struggle with using tools like School B 2.

• Reliability is important, requiring the use of expensive Gp 4 or Gp 0.5 in production.

• Dsp offers solutions for improving accuracy on benchmarks like Hot Qa.

Opportunities for Self-Improvement with Dsp

• Dsp allows for reflection and learning from mistakes, enabling self-improvement.

• The use of metadata and rich annotations with small datasets can lead to high precision retrieval.

• The demonstrate feature in Dsp allows the language model to learn how to interact with different parts of the pipeline.

Document Understanding, Ranking, and Diversity

• Information retrieval and search involve document understanding, ranking, and diversity.

• GPUs can be used for retrieval and re-ranking, improving efficiency.

• Cross-encoder models can be incorporated for enhanced retrieval and ranking.

Conclusion: Key Takeaways from the LangChain Retrieval Webinar

• Retrieval is critical in connecting data with language models.

• Better types of retrieval methods are needed to improve accuracy and reduce hallucinations.

• The Colbert bear retrieval model offers superior performance and cost-effectiveness.

• The Dsp programming model allows for flexibility and customization in building specialized retrievers.

• Challenges with react agents and obituary tools can be overcome with Dsp.

• Document understanding, ranking, and diversity are important aspects of information retrieval and search.

Advancing Retrieval in Language Models with LangChain

• Retrieval plays a crucial role in connecting data with language models.

• LangChain is actively exploring and encouraging further research in retrieval methods and strategies.

• Let's continue advancing retrieval to unlock the full potential of language models.

[Note: Visuals such as graphs, images, and charts can be added to relevant slides to enhance the presentation.]

Key Points

The LangChain Retrieval Webinar discussed the importance of retrieval in connecting data with language models.
The speakers emphasized the need for better types of retrieval to improve the accuracy of language models and reduce hallucinations.
Different types of retrieval methods were discussed, including using search engines like master database and Sql.
The use of language models to generate synthetic data for training retrieval models was highlighted as a way to improve ranking and performance.
Practical tips for improving retrieval included focusing on data quality, formatting, indexing, attribution, and evaluating retriever and generation separately.
The speakers discussed the benefits of the Colbert bear retrieval model and its potential integration with LangChain.
The Dsp programming model was introduced as a tool for building specialized retrievers and offered flexibility in customization.
The webinar also discussed the challenges and solutions in using react agents, the capabilities of Dsp, and the potential applications of retrieval in agent-based systems.

Summaries

66 word summary

The LangChain Retrieval Webinar on YouTube discussed retrieval models and strategies for knowledge-intensive tasks, including the Colbert bear model. The webinar also introduced the Dsp programming model for retrieval pipelines. Challenges with react agents and obituary tools were discussed, and the webinar concluded with a discussion on document understanding, ranking, and diversity in information retrieval. The LangChain community encourages further research in these areas. (25 words)

162 word summary

The LangChain Retrieval Webinar on YouTube explored effective retrieval models and strategies for knowledge-intensive tasks. Speakers Joe and Omar discussed the importance of balancing improvements in tail queries while maintaining the quality of head queries. Omar presented Colbert bear, a retrieval model that combines independent encoding and fine-grained representations, which has shown superior performance in downstream tasks. He also discussed its potential integration with LangChain and options for hosting the model. The webinar also introduced the Dsp programming model, which allows users to describe retrieval pipelines using Python functions and declarative calls to retrieval and language models. It offers flexibility and can be used for zero-shot or few-shot learning. The challenges of using react agents and obituary tools were discussed, emphasizing reliability and the use of expensive Gp 4 or Gp 0.5 in production. The webinar concluded with a discussion on document understanding, ranking, and diversity in information retrieval and search. The LangChain community encourages further research and experimentation in these areas.

411 word summary

The LangChain Retrieval Webinar on YouTube discussed effective retrieval models and strategies for knowledge-intensive tasks. The webinar featured speakers Joe and Omar, who shared their insights and experiences on the subject. Joe emphasized the importance of balancing improvements in tail queries while maintaining the quality of head queries. Omar presented the concept of Colbert bear, a retrieval model that combines independent encoding and fine-grained representations. Colbert bear has shown superior performance in downstream tasks while being efficient and cost-effective. Omar also discussed the availability of Colbert bear for use in different domains and its potential integration with LangChain. He mentioned the option of hosting the model on dedicated machines or using V-phone for hosting and serving. In addition to Colbert bear, Omar mentioned other applications of Roberta-based models, such as question-answering systems and chatbots. The webinar also introduced the Dsp programming model, which allows users to describe retrieval pipelines using Python functions and declarative calls to retrieval and language models. It offers flexibility in customization and can be used for zero-shot or few-shot learning. The webinar provided valuable insights into effective retrieval models and their application in knowledge-intensive tasks.

The LangChain Retrieval Webinar also discussed the challenges of using react agents and obituary tools. The webinar highlighted the importance of reliability and the use of expensive Gp 4 or Gp 0.5 in production. Dsp offers magical primitives that allow the implementation of a declarative pipeline, enabling the creation of a react agent that can interact with School V 2. This can significantly improve accuracy on benchmarks like Hot Qa. Dsp also offers the demonstrate feature, which allows the language model to learn how to interact with different parts of the pipeline. The compile primitive in Dsp allows for efficiency and adaptation to user queries. The webinar also discussed the use of retrieval for agents and the concept of long-term memory. Dsp provides opportunities for self-improvement and learning from mistakes. The use of metadata and rich annotations with small datasets can lead to high precision retrieval.

The webinar concluded with a discussion on different aspects of information retrieval and search, such as document understanding, ranking, and diversity. The LangChain community encourages further research and experimentation in these areas. Overall, the webinar provided insights into the challenges and solutions in using react agents, the capabilities of Dsp, and the potential applications of retrieval in agent-based systems. The speakers, Joe and Omar, shared their expertise and encouraged further exploration and learning in the field.

922 word summary

In this LangChain Retrieval Webinar, the speakers discuss the topic of retrieval and its importance in connecting data with language models. The speakers include O omar top from Stanford Nlp group, who has worked on retrieval models, and Joe Christian Berg from the V team, who has worked on a retrieval engine. The main usage of retrieval in LangChain is to enable the connection of data with language models using embedding and vectors. The traditional approach involves creating embeddings for document chunks and storing them in a vector store for retrieval. However, there is a need for better types of retrieval to reduce hallucinations and improve the accuracy of language models. LangChain has integrated Best button and other methods as a generic retriever interface. Joe discusses different types of retrieval made possible by not just using a vector store, including search engines like master database and Sql. He emphasizes the importance of evaluating retrieval methods using common information retrieval datasets and metrics such as precision and recall. He also mentions the Beer benchmark, which includes different types of domains for evaluation. Joe compares spa vector representation with dense vector embedding retrieval, highlighting the advantages and limitations of each. He suggests that combining these techniques is the future of efficient retrieval. He also discusses the use of language models to generate synthetic data for training retrieval models, which can improve ranking and performance. Practical tips for improving retrieval include focusing on data quality, formatting, indexing, attribution, and evaluating retriever and generation separately. The speakers engage in a Q&A session, discussing topics such as evaluating IR systems and the use of synthetic data for evaluations and re-ranking.

The LangChain Retrieval Webinar on YouTube discussed the topic of effective retrieval models and strategies for knowledge-intensive tasks. The webinar featured speakers Joe and Omar, who shared their insights and experiences on the subject.

Joe emphasized the importance of focusing on both head queries and tail queries when building a startup. He highlighted the need to balance improvements in the tail with maintaining the quality of the head queries to ensure a positive impact on the product. He also mentioned the use of larger language models for better planning and structuring of metrics.

Omar presented the concept of Colbert bear, a retrieval model that combines the benefits of independent encoding and fine-grained representations. The model uses a scoring function to estimate similarity between query terms and document vectors. Compared to other models, Colbert bear has shown superior performance in downstream tasks while being highly efficient and cost-effective.

Omar also discussed the availability of Colbert bear for use in different domains and highlighted the potential for integration with LangChain. He mentioned the option of hosting the model on dedicated machines or using V-phone for hosting and serving. However, he noted that there are currently no other hosted versions available.

In addition to Colbert bear, Omar mentioned other applications of Roberta-based models, such as question-answering systems, chatbots, and fact-checking systems. He emphasized the importance of adapting retrieval models to evolving downstream tasks and discussed the use of the Dsp programming model for building specialized retrievers. The Dsp programming model allows users to describe their retrieval pipelines using Python functions and declarative calls to retrieval and language models. It offers flexibility in customization and can be used for zero-shot or few-shot learning.

Overall, the webinar provided valuable insights into effective retrieval models and their application in knowledge-intensive tasks. The speakers highlighted the advantages of Colbert bear and discussed its potential integration with LangChain. They also introduced the Dsp programming model as a tool for building specialized retrievers.

LangChain Retrieval Webinar discussed the challenges of using react agents and obituary tools. While react agents can help get started, they struggle with using tools like School B 2. The webinar also highlighted the importance of reliability and the use of expensive Gp 4 or Gp 0.5 in production. Dsp offers magical primitives that allow the implementation of a declarative pipeline, enabling the creation of a react agent that can interact with School V 2. This can significantly improve accuracy on benchmarks like Hot Qa.

Dsp also offers the demonstrate feature, which allows the language model to learn how to interact with different parts of the pipeline. By providing labeled examples, the pipeline can be tested and customized. The compile primitive in Dsp is another powerful feature that allows for efficiency and adaptation to user queries. By deploying the tool in front of users and gathering their questions, the program can be compiled using user questions to improve accuracy without needing to rely on Gp 4.

The webinar also discussed the use of retrieval for agents and the concept of long-term memory. Dsp provides opportunities for self-improvement by allowing for reflection and learning from mistakes. It offers both per example self-improvement and pre-compiled pipelines based on examples. The use of metadata and rich annotations with small datasets can also lead to high precision retrieval.

The webinar concluded with a discussion on different aspects of information retrieval and search, such as document understanding, ranking, and diversity. There are various options and techniques available, including using GPUs for retrieval and re-ranking, as well as incorporating cross-encoder models. The LangChain community is actively exploring these areas and encourages further research and experimentation.

Overall, the webinar provided insights into the challenges and solutions in using react agents, the capabilities of Dsp, and the potential applications of retrieval in agent-based systems. The speakers, Joe and Omar, shared their expertise and encouraged further exploration and learning in the field.

Raw indexed text (65,847 chars / 11,930 words)

Source: https://www.youtube.com/watch?v=VrL7AbrY438
Page title: LangChain Retrieval Webinar - YouTube

[00:00:00 - 00:00:20]

Harrison: Are live. Welcome everyone to the lang chain webinar series. Today, we'll be talking about retrieval, a topic that some of us have only started thinking about in the last 6 months. I'd count myself in that category. But others of us have been thinking about for years.

[00:00:20 - 00:00:48]

Harrison: And our speakers today are in that category. So we have O omar top here. From Stanford Nlp group, who has worked on retrieval models and Joe Christian Berg here from from the V team whose worked on that retrieval engine. Many many things, many features in invest but we'll hear some about that and a lot about the general problem of retrieval. And to kick us off well...

[00:00:49 - 00:01:00]

Harrison: Harrison, I think you wanna say a little bit about how retrieval fits with the LangChain library, what it looks like, what people have already seen before we dive into the the the depths that Joe and Omar have plumb

[00:01:01 - 00:01:09]

Charles: Yeah. Absolutely. I just wanted to... And Only take, like, 1 or 2 minutes to do this because there's much more exciting things to talk about. But just to...

[00:01:10 - 00:01:50]

Charles: Just to just to ground it in in kind of what what people familiar with LangChain might recognize So I should be sharing my screen now. This is this is from a blog post we did in February. The the main the main usage of retrieval and LangChain is basically to enable connecting your data with language models. And so the way that we've traditionally done this, basically is is we've kind of had once we've we've done everything largely based on kind of, like embedding and vectors. And we've taken documents, split them as chunks, created embedding for those chunks and then stuff them in a vector store, and then we've used that.

[00:01:50 - 00:02:38]

Charles: To create a chatbot where you basically have a question that comes in, you look up in the Vector store relevant pieces of text. You then pass that along with you original question into a language model and it grounds its answer in the documents that it's passed in. We discussed we discussed hallucinations, the last week and and this was 1 of the the the retrieval augmented generation was 1 of the most popular methods for reducing hallucinations and and a lot of the comments on that webinar hours basically, you know, the the nations that do still exist come down to messing up on retrieval. And so really excited to learn more about better types of retrieval that we can be doing and and and how we can hopefully... You know, add some of that in LangChain change, some of this already is with V.

[00:02:40 - 00:03:11]

Charles: And the the last thing I'll say here before handing it over is, you We used to do... We used to have a terrible abstraction LangChain, which was everything was a vector store, and you just did similarity search in about 2 months ago, we changed that just being a generic retriever interface. Which has enabled us to to really quickly kind of, like, integrate Best button and and other methods as well. And so as we talk about some of the ideas today, yeah, Really hopeful that a lot of these can can make them way, make their way into an integration with LangChain in some form and and everyone here can play around with it. And I'll kind of stop there because, yeah, much more interesting people to listen to.

[00:03:15 - 00:03:25]

Harrison: Yeah. And I think we had Joe up first to give a bit of a kind of like overview of all the... Other types of retrieval made possible by not just using a vector store. So let's here from Joe First.

[00:03:26 - 00:03:38]

Joe: Sure. Thank you. So let me share... Alright. So first of all, thank you for having me.

[00:03:39 - 00:04:30]

Joe: Great thing invited here and see all the excitement allow LangChain and language models and of course, as you said, housing retrieval optimization for generating with large language models. So I'll have a few boring slides. I'll talk about information retrieval in the age of gen Ai. Or large language models. And like, many of you builders out there, you're probably building using LangChain, using large language models and you're using a retriever pipeline and your retrieving context, and you are putting this into a prompt and you're building stuff, and we at the respiratory team we're also building stuff using res bar and using line chain.

[00:04:31 - 00:05:01]

Joe: And the reason why we're doing that is that we want to improve our documentation search here. We have several resources. Best of related resources around documentation, and our cloud documentation, our sample applications, are blog, high as cloud, everything and we we need to improve the overall search experience. So let's connect LangChain and West 1 and build a new search interface. And In this case, also I can answer the the question about what is actually information retrieval.

[00:05:01 - 00:05:39]

Joe: Right? So information retrieval, is summarized here by a language model based on the retrieval context on the investment documentation. And since we've been working on language model for search of the search ranking and questioning answering for quite some time. So there's a lot of content around that across our site so The agent does a pretty good job, but at actually summarizing what information retrieval is and it's the process of retrieving relevant information from a corpus of documents using search and LangChain strategies. And measuring the effectiveness of these strategies is important, Right?

[00:05:39 - 00:06:06]

Joe: Because we want to differentiate and and compare different types of retrieval methods to see, you know, how they compare and how they perform. Right? So I'll I'll dive into that. In the context of doing retrieval for generation. We want to find relevant context and we want to stuff that into a prompt or a language model prompt.

[00:06:06 - 00:06:35]

Joe: And we want to use some kind of information retrieval system. Right? And this might be a search engine, like, you know, master database, Sql, graph, hyper d Ra, whatever, you know, And that's why we love the new abstract in line train around with retriever, And some of the motivation is basically adding these languages as well they are large. They are... Expensive, and they have limited token lights.

[00:06:35 - 00:07:27]

Joe: So if you cross those multiple different dimensions that kind of cadence that we want to address by retrieval that might be. And of course, there's a lot of knowledge in the para the weights of the language models, and we want to augment that with our private data or data that the model hasn't been really tried trained on. And as Harrison said, last last webinar was about reducing hallucinations. The 3 rule improves on that, but it doesn't my experience and it entirely, but it does to use organizations and also allows you to scale to I mean, larger dataset than than just whatever token like limitation, your language model of of reference is is at... So retrieval augmentation is is not really a brand new concept.

[00:07:29 - 00:08:09]

Joe: As far as I know it was introduced actually at back in 20 20 at night the rips. Where researchers from Met did built a system where they will do retrieve, and then they would augment a sequence to sequence language model and generate the answer, for open the domain question answering. And before that... Just before that, the state of the art and open domain question answering was extract your question answering where you would retrieve with a retriever, and then you would extract or predict the best answer span in the 2 text. So it wouldn't actually generate text.

[00:08:09 - 00:08:32]

Joe: It would just take out the most probable answer from the between context. And... But obviously, now we have much larger models for them I'm just keeping it here. We did some work on that, we have sample applications to actually reproduce that system. And also, I think that's a dense passage trigger this was used to retrieve passages from the knowledge base here.

[00:08:32 - 00:08:57]

Joe: Was using a a vector model And I think Dp was 1 of the first models or papers that described how to efficiently actually train a density tumor model. So... Yeah. And I listened to the last webinar and there was questions about, you know, how do you ever a most retrieval systems? And and how do you go about doing that.

[00:08:57 - 00:09:20]

Joe: So in academia, there's is a set of, like, common information retrieval datasets sets that typically researchers will report on. And 1 important collection or is is the attractive Retrieval conference, which has many collections or relevancy collections and... It's... Their their effort has been spanning like decades. Right?

[00:09:21 - 00:09:55]

Joe: And for every conference, there's some new tasks, there's deep learning, representation, there's all kinds of different tasks. So that's a which set of datasets for relevancy. Marco from Bing. Is the largest or the largest relevant data set out there with a lot of training data, queries and relevant document pairs there's both a passage version, which has, like, shorter text, and there's so sub documents ranking Sub task and... Well you actually have that documents.

[00:09:55 - 00:10:46]

Joe: And then most importantly, I think is the beer benchmark, which is actually a collection of these different information with material collections. And the unique thing about the beer benchmark is that it has all these different types of domain. So medical domain bed search all these different type of tasks and domains and the to kind of... You go at that benchmark, you could try a method to for instance, are ranking on that's been trained on web search like Ms Marco and then you apply to these new collections and that basically gives you information about how good it's a particular way of doing retrieval when it's outside in a new domain or a new task. And then obviously when you emulate evaluate these systems, we use some kind of metrics and we call that k.

[00:10:48 - 00:11:12]

Joe: Won't go into the details on this, but basically trying to focus on retrieving all the relevant documents. Precision is about retrieving nothing but a relevant among the top k. And among us industry practitioners you have, the Lg with she looks good to me, right? Then we ship it. I think that's maybe the most common metric use when we're information retrieval systems.

[00:11:12 - 00:11:47]

Joe: So In the industry, of course, there's more than that this we deal with not on these static collections like in these datasets. We look at engagement metrics. If you're running a e commerce search, you will look at conversation sales revenue app. Obviously, not only about relevancy but also multi the objective that you're not optimizing just for the pure relevancy, but you also has some other goals. A small note on trade distribution is that all these because we have just 1 query doesn't really weigh the queries risk.

[00:11:48 - 00:12:15]

Joe: But in the real production industry application of the search material, you have a very different distribution of queries. So you have head queries that are frequent, versus the tail price which are more complex and and you don't see them a lot. And I think that the magic of searches in the tail, right? Because that's that's that's really everyone you can get the head grips right, So okey? Yep.

[00:12:16 - 00:12:17]

Joe: Yeah. Let I ask

[00:12:17 - 00:12:18]

Harrison: you a quick question about it.

[00:12:18 - 00:12:19]

Charles: That's weird you talked

[00:12:19 - 00:12:26]

Harrison: about it. So yeah. So so you started to touch on this a little bit at... And. But for for people who are building applications on their own data.

[00:12:27 - 00:12:38]

Harrison: I wanna to evaluate how, like, different retrieval things are doing on their own data. And how you recommend them to get started, like, just, you know, they they don't have any data. You know, they they don't... They're getting started from scratch.

[00:12:39 - 00:12:41]

Charles: Right? They don't have any data. Exactly. Yeah.

[00:12:41 - 00:12:59]

Joe: Yeah. So 1 thing is to to... When you have the Beer benchmark, it's it's it's 1 way to to get it on this timing of. Does any of the beard that the datasets sets that are in beer? Does that closely look like the task you are trying to solve?

[00:13:00 - 00:13:34]

Joe: Or the data that you have, that would be 1 thing. The only obvious thing is that you actually start looking at and and doing queries and then trying to bring out, you know, people to elevate the results, like are these actually relevant and you are the probability domain expert. So you can, you know, build a collection of queries and the retrieve of results and then basically annotate is is this a relevant or highly relevant? And then you quickly can kind of build your own kind of evaluation set. And that way can iterate on improving rank.

[00:13:35 - 00:13:42]

Joe: And see if if you are improving or not in across these different query. So, yeah, many organizations do that right

[00:13:43 - 00:13:49]

Harrison: and the type of data that they should be labeling in there is basically query document, and then Bull is relevant is not relevant?

[00:13:50 - 00:14:00]

Joe: For example, yeah. Yeah. Yep. So in this practical, actually touching a perfect timing. I mean, here's are some practical examples of 2 relevant a it.

[00:14:01 - 00:14:17]

Joe: Set. So 1 is natural questions, which is a data set from Google Research. It has questions that have been passed on Google Search. That's and where the ants answer can be found on wikipedia. So when did kendrick Lamar first time come out and the ground truth is 07/02/2011.

[00:14:18 - 00:14:40]

Joe: So the way you would escalate this or transform this question answering data set into a retrieval problem. Is that you will look for the ground truth in the 3 documents. Right? And if the if the were 3 documents is in position 1, then you have recall is perfect because it's... You retrieve the answer position 1 and so far.

[00:14:40 - 00:15:15]

Joe: And then you can kind of figure out, you know, are you actually retrieving the right answer because actually dear, and then you can, later on I will touch on, you know, separating the 2 things about. Judging the quality of the consideration versus the retrieval quality. And another more traditional information retrieval dataset is tracked Covid, which is a dataset of Covid related literature. And 1 example there is like the how does the coronavirus respond to changes in the weather. And in this case, you're not looking for a particular answer.

[00:15:15 - 00:15:49]

Joe: You could could do that. But in this case, it's like a classic information material problem or you have greater relevancy labels. And then you could the task is basically to optimize the retrieve and optimize this ordering. And then you can also compute g was another precision oriented metric or you can produce loop coal. And I think that damaged Mark will pass ranking data and document ranking think datasets is the 2 largest and 2 most important datasets that are out there.

[00:15:49 - 00:16:22]

Joe: Because most of them models that you will see even including open And embedding and so forth are actually trained on this data. And it has about 9000000 passages from from Bing and it has a lot of queries, and then you can do... Retrieval and ranking and then you can evaluate your method. And the data is basically split into 3 parts So there's a train part and the and a development part and the and the test with. So you can use those train queries to train models.

[00:16:22 - 00:17:06]

Joe: So then you evaluate your models. And then the models are applied on the same domain and the same kind of distribution that there were trained down, which is important then which leads us to the kind of next about text and retrieval and representations brain Well, I mean, there's something a lot of focus around dense representations or vector search or vector storage. Right? And search is is leaning towards dense representations, the legacy or the not legacy, but the Spa representations, some people, you know... And I want to contrast these 2 kind of techniques that are used for retrieval there today.

[00:17:06 - 00:17:24]

Joe: And 1 is the Spa vector representation where you basically have a very high high initial vector space. With millions of dimensions. But for any given text carrier or document, there are very few dimensions that are actually non 0. Right? Then you have...

[00:17:25 - 00:17:53]

Joe: And this fits well with data structure pool got Indexes case, You can efficiently in stream over this, And on the other hand, we have the dense or the vector embedding retrieval where we map text into a lower. Dense vector space. And where you can use nearest network research or you can use neural nervous. Indices to speed up and and introduce an approximation. Then I know U Cr will talk about neuro vector representation.

[00:17:53 - 00:18:13]

Joe: I'm sure and that there are hybrid combinations. And what I think is important when you talk about vector search and and learning representation. You can also learn spa representation, but I won't cover that. But it's it's effectively a way of learning representations, right? So you want to learn...

[00:18:14 - 00:19:06]

Joe: An embedding model that will embed relevant documents for a query so that they are closed in this embedding space. Right? So you're doing representation learning. So we have better pre language models, so they great. We also know a more about how to produce good embedding now, But it's still largely an un unsolved problem to have these representations generalize for many different tasks across kind of different domains And beer the beer benchmark, this is basically on the beer benchmark and what the beer benchmark shows is that many of these dense betting models that have been trained on the Ms market labels, when they are applied in a different domain when they are applied to Track Covid natural pressures and so forth.

[00:19:06 - 00:19:43]

Joe: They actually under perform in terms of ranking metrics compared to, like, a plain B 25 that you will get out of elastic search or a West, a partial solar. So so that's 1 thing that to to to be of. But embedding models will improve and we'll will be better, but I think... Personally, that combining these 2 techniques is is is the future of efficient retrieval, because the spa representations allows you to do exact matching. So you can search for entities or phone numbers for.

[00:19:44 - 00:20:11]

Joe: And also another thing that's commonly overlooked is that these representations handles variable lengths, while dense are pretty much limited to also the input size of your embedded model. Plus the dense models are really fast at searching. So Yeah. I have some more on on on improving an adoption. I'm I'm sure that former actually, it's also to kinda talk about this.

[00:20:11 - 00:20:52]

Joe: So and that's the way a new technique that you're actually using language models to improve C shot ranking. So using the language models too. Actually generates synthetic credit. And I did a blog post on this earlier on on this technique is really an emerging promising direction and we see a lot of papers and work around this and they really improve ranking both for dense receivers or multi vector representations, where you can actually generate training data. And I'm hundred percent sure that this is the future of how you adapt to your domain and how you make dense models or any type of banking model better for your domain.

[00:20:52 - 00:21:14]

Joe: That you actually use Slash language models to generate synthetic data for training. And and we do see this also now catching on or have... I mean, it's been playing out the for some time, but Spotify is 1 example, they did a a paper on that. And and I think that's really interesting to that they. Add synthetic data to the training their dense retrieval methods.

[00:21:14 - 00:21:20]

Joe: And the... Yeah. We we love spotify because they use less for for vector search. So Yeah. So that...

[00:21:21 - 00:21:48]

Joe: That's an interesting direction. Another thing I give touched on is that in... At scale, you need to some kind of have some face to achieve 1 ranking to to save cost, but that's a separate thing. Yeah. Few practical tips that I've seen, I think we need to focus more on the garbage in garbage is garbage out we also need to focus on, you know, getting the right data into the index and, you know, how to format it and index it correctly.

[00:21:48 - 00:22:01]

Joe: And then you could also do attribution that how you can link the together to answer. And like I said earlier, evaluate these to retriever and graduation separate. Quickly. Yeah. Joe.

[00:22:01 - 00:22:02]

Joe: So for this Yeah.

[00:22:03 - 00:22:13]

Harrison: For for the garbage garbage out and then getting the right data in that seems quickly like, very practical. And so I'm sure a lot of people here. Like, what exactly does that mean? And what are some, like, best best practices here.

[00:22:13 - 00:22:34]

Joe: Right. So I think this is standard standard basically from search. This is from the thing that we are building. And here I'm asking what are the available pre trained models that you can use Less cloud and as you can see, it doesn't pick up the models because we haven't been able to parse this. Html table correctly.

[00:22:34 - 00:22:58]

Joe: Right? So those type of errors you know try to try to avoid that. I think 1 unique experience with this is that We do capture the whole page experience with the links. We don't blob this into 1 snippet and just percent snippet. Because for us, it's really important that you can attribute the answer, and you can click through the sources and you can copy the code or...

[00:22:58 - 00:23:17]

Joe: Examples because it it can still have it, So I think it's really important that we actually show a very good representation of the original source. The Yeah. So that's pretty much what I what I had. So if you headed it, you can tweet me on Twitter, and I'm I'm they on Twitter. So...

[00:23:18 - 00:23:22]

Joe: Yeah. I'm I'm really looking for her for homeowners presentation. So

[00:23:24 - 00:23:36]

Harrison: Yeah. Thanks thanks, Joe. I'll be, you know, canceling you on Twitter later. Yeah. The question I had was you mentioned these...

[00:23:38 - 00:24:03]

Harrison: You mentioned doing your, like, ranking with like, synthetic data from language models. But I would surprised as you didn't mention that earlier than when we were talking about, like, evaluating your Ir system. So do you think there's a reason why you shouldn't do that like, those kinds of like evaluations or, like tests or like... Things like that with synthetic data and it should only be used for that re ranking.

[00:24:06 - 00:24:14]

Joe: No. I I think you can be used to both. Right? You can ask the the language model to generate. I mean, there are a lot of options here.

[00:24:15 - 00:24:26]

Joe: And we used it in the demo as well too... For every passage, generate 5 questions that you think are relevant for this passage. And we use that for her to complete search. And that works really well. Right?

[00:24:26 - 00:24:43]

Joe: So you can actually... You know, what are type of questions that people will will go ask this dataset, You know, you can generate that upfront. And then if you find able to find this hard negative mining, you can also use this to train the ranking model. Yeah. So I think they're the...

[00:24:44 - 00:24:55]

Joe: It's a really emerging topic, but the the the... I I'm sure Omar also will present some of... A face on on this and it's really promising direction. Yeah. Yeah.

[00:24:55 - 00:24:56]

Joe: It doesn't have

[00:24:57 - 00:25:11]

Harrison: a So, yeah, You mentioned the hard negative mining, which like, reminds me of what you said in the talk about the, like, tail. Right? The like really difficult queries. I'm curious, what are your thoughts on how to ensure... Like, people want metrics?

[00:25:11 - 00:25:26]

Harrison: They want recall it k and precision at k and not Lg t. K. And... But like, a lot of those are, like, these statistical metrics that you're gonna over emphasize bulk of distribution and under emphasize. The tail of the distribution, These, like, rare complicated queries.

[00:25:26 - 00:25:39]

Harrison: Do you have any thoughts about, like, how to structure your metrics, maybe how to structure your continue integration pipeline or your engineering workflows to ensure that you are not reg progressing on the tail or improving on that tail.

[00:25:40 - 00:26:02]

Joe: Yeah. I think I think once you get traction, once you have actually real use data so that you know what the distribution will be. You should basically, you know, as a as a startup where you're studying the build on this, you should you should focus also, you know, getting those head queries, you know, to start... Because that's what has most impact, you know, if you cannot get the basic rights, people will leave. Right?

[00:26:02 - 00:26:27]

Joe: But so so you you so you should when you're, That's why I kept that point is that, you know, you you have to be mindful about these things because in some cases, if you're rolling out changes, you know, you might you might improve the tail, but then you are destroying the head. Right? Which will have a a virtual effect on on on your product. So, yeah. It it's it's it's not trivial.

[00:26:28 - 00:26:37]

Joe: I would say. And... And that's the 1 thing I didn't touch on is that these how you're gonna... How you planning the... Using the large...

[00:26:37 - 00:27:05]

Joe: Larger language models to plan, how you're gonna do more into the agent plan, to planning on how you do doing the k. And so. I think that's really interesting and powerful. And and just seeing... I I remember Saw 1 of the examples from the LangChain chain documentation and I started to a colleague and said, you know, hey look, you know, this is super interesting because everybody in search will be talking about carrie understanding for 20 years, but it's generally like hand waving like this.

[00:27:05 - 00:27:08]

Joe: But they actually saw a concrete example. So... Yeah.

[00:27:10 - 00:27:31]

Harrison: And then I wanted to make sure we got at least 1 of the great questions from the audience. We'll we'll talk about them also after Omar talk. But I wanted to at least prove to people that their questions coming into the Q and A will get. And there was a really great 1 from Vm which is... Related to what we've been talking about, which is particularly measuring and benchmarking hallucinations.

[00:27:32 - 00:27:51]

Harrison: So you talked about spans and spans are 1 way that people have like, ensure that there's no in in, like, a prior generation of Nlp. I'm curious, like, is genes special techniques for hallucinations. Do you think that they show up if you just annotate and benchmark in normal ways? Do you need special guard rails? What do you think?

[00:27:53 - 00:28:09]

Joe: Oh, I think it's really outside of my core competence to comment on it. So I I will refer to to the actual last Webinar. On nations. But I do think that it's smart to evaluate these separately. Right?

[00:28:09 - 00:28:23]

Joe: So have you retrieved enough relevant information to answer the question. Right? So you can basically that. And what is the generation that is coming out of that is that hall or is it not? Right?

[00:28:24 - 00:28:39]

Joe: So I think having that separation because that means you might have solved the retrieval problem, but then you have more a prompt problem, the language model, how you how you tune that. Yeah. So you can separate how how you spend the rest of the time improving it a Yep.

[00:28:39 - 00:28:45]

Harrison: I think that makes sense. Yeah. Sort of treating these as separate types of bugs, separate, like, category. Exactly.

[00:28:45 - 00:28:47]

Joe: Exactly. Yeah. Exactly.

[00:28:50 - 00:28:55]

Harrison: Okay. Well, then... Yeah. We'll hear again from Joe. When we...

[00:28:56 - 00:29:15]

Harrison: When we have our open Q and A session at the end. But next up, we have Omar. So let me pull you up here Omar. And Yeah. So Omar was gonna share primarily about a specific technique and some models for the specific technique of multi vector retrieval.

[00:29:15 - 00:29:18]

Harrison: So, yeah, Omar, I think you can take it away.

[00:29:20 - 00:29:28]

Omar: Alright. Yeah. Thanks for hosting us. Harrison, and thanks for joining Charles and, you know, leading the discussion. Joe.

[00:29:28 - 00:30:02]

Omar: I really enjoyed your talk. You know, we've been in touch for a long time since know, you guys integrated C and V, 1 of the very earliest integrations, maybe the very first 1. So today, I wanna talk about, effective retrieval models and strategies for knowledge intensive tasks. And this, you know, builds, I think, vary directly on the things that Joe introduce, And in particular, I wanna highlight 2 things that pertain to what we've been building at Stanford over the past almost 4 years now. So I'll talk about hal Cool bear, which is the standalone, really powerful general purpose retrieval model.

[00:30:03 - 00:30:41]

Omar: Could fit into, you know, LangChain, you know, chains in the future. You know, this is something that could be integrated and could be very powerful there. And for the longer tail of sort of more advanced settings, how can you adapt retriever? To kind of more nuanced tasks and also react, you know, to or respond to, you know, issues that arise and essentially improve the pipelines that you're using within retrieval in, say a LangChain agent or, you know, some other chain that you have. So this is work with lots of, you know, folks in the Bear Dsp team, this includes, you know all the people listed stated in my advisors at Stanford.

[00:30:43 - 00:30:59]

Omar: Alright. So I'm not gonna spend a lot of time on introducing I art again. We all kind of know it and drove did a great job. But basically, the setting I'll start with here at the beginning is that we're just given this text query. And we'd like our retriever to consult some large tech corpus, Maybe it's our, you know, private data.

[00:30:59 - 00:31:27]

Omar: Maybe it's wikipedia, Maybe it's... You know, a large chunk of the Internet that we've crawled, whatever it is and we wanted to surface a ranked list of results that maybe we can give later to a language model or it could be part of a more complex complex pipeline that we're building. And, you know, this is something that we and others have explored in many downstream task, maybe you're trying to answer questions. Maybe you're trying to do fact checking or you're trying to have a chat with a user. And, you know, this is the sort of thing that, you know, we've been exploring...

[00:31:28 - 00:32:10]

Omar: Basically during my Phd over the past few years. And 1 of the things that, you know, working in this space for a while teaches you is that this is a beautiful place for studying balance between quality and efficiency. What we want our systems that are gonna answer challenging natural language you know, queries, you, in some cases, very advanced complex questions, but they also need to work in seconds and ideally, really milliseconds, working with millions of documents or, you know, tens or hundreds of millions of documents as a show. And you know, when we started working, you know, you know, in in this in this space from a neural standpoint, there were 2 extreme matching... Paradigm that were sort of plausible and that existed.

[00:32:11 - 00:32:39]

Omar: And this is actually kind of, like, right after, you know, bert started being used in Ir paradigm So there is this sort of single vector representation world. And this is sort of the paradigm that Joe emphasize today because this is the mainstream way and. Most of you here using retrieval methods or maybe almost everybody you know, who's not using a cool variant is doing something like this. In in this in this formulation, you have a bunch of documents. You're gonna run them through a tower architecture, maybe bert or something else, and you're gonna ask it to generate a vector.

[00:32:40 - 00:33:07]

Omar: Maybe a dense vector, maybe a sports vector, and then you're gonna index these vectors or store them for search later. You get a query in the future. You're gonna run it through a similar or maybe the same. Birth architecture or some other hidden for proprietary thing, maybe open Ai embedding, and it's gonna give you the single dense vector at the end of the day, and then you're gonna do some nearest neighbor similarity search. This is great in some ways, You know, I mean, we take it for granted, but the fact that the query and document are encoded independently.

[00:33:08 - 00:33:48]

Omar: Means a lot. You know, you could encode these documents once in advance and cache them and store them and reuse them. And that, you know, dramatically makes you, you know, able to do things that you wouldn't be able to do otherwise And obviously, dramatically reduces costs, Imagine having to compute representations for the documents for every new query, On the flip side though, this is not really, you know, a very versatile approach. The reason being, you know, you're forced as a model to find this 1 vector in this 1 embedding space that is supposed to be able to answer every possible question and about a large chunk of text, maybe a paragraph or even longer. And that's, you know, if your intuition tells you, that's not easy because it's really not easy.

[00:33:48 - 00:34:09]

Omar: It's a very heavy burden on these models. And And, you know, this has been the message that we've been sending for for years now. And, you know, it's it's it's it's true as ever. So a different paradigm that sort of actually, you know, that that pre predicts some of the birth work on, you know, buying quarters like the ones on the left, although not all work on buying quarters. Is working with crossing quarters.

[00:34:09 - 00:34:25]

Omar: Maybe you've heard of re anchors. I, C coherent recently introduced some of these. So this is an old sort of approach also, but differently from the other thing, you can see all these dense connections there, it pays attention to queries and documents at once. It extremely powerful fine grained interactions it. It looks at the document in light of the query.

[00:34:26 - 00:34:44]

Omar: And so it's very powerful, but the problem is it's extremely unsustainable. You know, every time you have a new query, all of the documents representations that you've had before are are useless. Because they've not been conditioned in a new query that you have. So this is not scale, like if you have millions of documents, you know, you're not gonna want to re encode all of them for every query. That's not practical.

[00:34:46 - 00:35:09]

Omar: And this is what motivated us in late 20 19, you know, to build Colbert bear and, you know, release this. Actually the same week as Dp, you know, that you might have heard of, where our goal is to keep the best of both. We'd like to have independent encoding. We're not gonna have, like, these, you know, early interactions between queries and documents. And so they're gonna be encoded independently.

[00:35:09 - 00:35:30]

Omar: But key thing is we'd like to have fine grained representations for them. Meaning, we're not gonna have this bottleneck of reducing each of these into a single vector because this... Really not that plausible you know, to to restrict yourself to something like this. Now the challenge that you face here is well, now you have a matrix, You have a set of vectors for for each document. And a set of vectors for each query.

[00:35:30 - 00:35:44]

Omar: It's not a single vector anymore. And so fan similarity search is not even well defined anymore. What should we do? What we're gonna do is we want basically a method that estimates similarity between 2 mat c's, but we're not gonna do it... So you know, naive.

[00:35:44 - 00:36:36]

Omar: What we want is something that is going to be scalable, something that would still allow us to use, you know, indexes and efficient representations that put, scale this to tens or hundreds of millions of messages in milliseconds. So, you know, what we came up with was this scoring function, that is essentially going to take every query term that is encoded as as a vector and is going to find the nearest match to that term in that in that document. And we're gonna repeat this and we're simply gonna sum of these scores. And then, you know, what we're what we're trying to say here is that for every term in the query, It's a lot easier to actually map that into a vector representation because we're just trying to understand 1 token in its context. And that's a much easier task than trying to understand a blob of 500 words that may or may not, you know, that you may or may not know what questions would be asked about.

[00:36:37 - 00:37:00]

Omar: And so with this with this matrix representation, Because of the task is is that much simpler. These vectors, you know, there's a whole lot of vectors here, but they can be extremely small and they can be indexed. For faster. So for instance, in our early iteration of Colbert, it could index all of wikipedia and search, you know, and search it in just 70 millisecond. And it's only been getting faster and, you know, and and cheaper to work with.

[00:37:01 - 00:37:28]

Omar: So just as an intuition building exercise here, here's a real exercise of matching. Say you wanted to answer the question, when did just transformers cartoons guys come out, you know, not the the attention is all you need paper. This is the query, and I have snippets from the document here, the relevant document. And if you look what Roberta does at matching time, it encodes. The word when with a vector representation that's very close to the on in, you know, in in the answer that talks about the date.

[00:37:28 - 00:38:07]

Omar: You know, certainly gives transformers, you know, a similar representation in both the query and the document, Cartoon matches with the verb animated, very reasonable, you know, and release matches with thumb. So, you know, does this actually work. And what we learned back in, you know, late 20 19 early 20 20 is that you know, compared with concurrent approaches that use bert also for, you know, single vector representation, things that are really popular Dp or Ans etsy, you, you know, is is a whole lot better in downstream matching. In fact, it preserves all of the accuracy of the re rank that are really expensive you know, that that I talked about. While being, you know, obviously orders of magnitude faster and cheaper.

[00:38:08 - 00:38:37]

Omar: And you know, this is something that, you know, you're you're reliably on on many tasks. Interesting thing though that we didn't think of when we built Colbert bear was that when there's a domain shift when you're... Which is something that Joe highlighted. When you're trying to move, the from the training domain, say, Ms marco or natural questions, and you're trying to test these models in new settings. The interesting thing that I think was first highlighted by beer in their evaluations was that, you know, these interaction models, in particular late interaction or, you know, the expensive interaction.

[00:38:37 - 00:39:09]

Omar: They're a lot better at sort of transfer from 1 domain to the other. And our intuition for that is encoding context at the level of tokens is just a whole lot easier and more robust and resilient than you know, you know, cram everything into 1 vector. No matter how large you make it. And, you know, ever since there's been lots and lots of studies sort of confirming, late interaction is generally a lot better than, you know, single vector retrieval for Search. But we've also been doing work and others have been doing work, applying this to downstream tasks.

[00:39:10 - 00:39:38]

Omar: So this was called Bear v 1. You know, if you're gonna use this, we strongly recommend called Bear v 2, which we built, you know, in late 20 20, early 20 22, And we focus here it's explicitly on trying to get models that generalize better. Kobe v 1 generalize better to new domains just by happens since it work better. But Could 2 was purpose built for this by, you know, a more resilient den noise training strategy. Look, you know, III refer to I refer you to the paper for the details of that?

[00:39:38 - 00:40:08]

Omar: But we also built this residual compression technique. So that each of the vectors in the c storage are really small. So if you work with Cob 2 right now, even though we have you know, more than 10 times as many vectors as you're, you know, like, off the shelf buying encode. The vectors are actually taking no more space than a standard, know, like Dp retriever or, you know, these single vector to retriever. Because each of our vectors could be included as little as 20 bytes or something like that, You So, you know, we have applied this in domain and out of the domain.

[00:40:09 - 00:40:26]

Omar: So the the 2 tables and right are out of the main test. A whole bunch of them beer. We also introduced the Latte, you know, a play on the words there. And you know, in that cross... You know, I think it was something like, you could count them, but it was, like, 22 out of 30 evaluations, you know, it...

[00:40:26 - 00:40:58]

Omar: It it comes out on top across a a wide range of, you know, spores and dense retrieval models that are, you know, very recent and powerful. It So, you know, the the the the Ko v 2 introduction, which is, you know, what's what's what's on the, you know, Colbert report right now. You know, it reduces the index size quite a bit over CV1, and obviously maintains the same highly efficient, you know, latency and survey. More recently than that, we introduced dedicated retrieval engine, like? Because we we have this different interaction mechanism, late interaction.

[00:40:59 - 00:41:30]

Omar: And what we wanted is, you know, to build an engine that can support efficient serving of these models at large scale or efficient search at large scale. This is something that, you know, we spent a bunch of exploration on and, you know, just to to to to highlight a few of the, you know, of the results there. If you're working with something like Ms smart code this is probably larger than most of the you know, local data sets you might work with. So it has half a billion tokens with AB9 million messages. Could encode this You know, in in in Ko 2 was planned and as it list 25 gigabytes.

[00:41:30 - 00:41:58]

Omar: And you can search even if you don't have a Gpu in just 45 milliseconds for each query wikipedia, a bit larger, but, you know, the entire index could be around the 100 gigabytes. And you could search, you know, with a Cpu and as that 67 milliseconds. If you have a Gpu, it only gets faster. And then the largest dataset we tried was Ms mark v 2, which has close to 10000000000 tokens, a hundred 40000000 messages. And the index is still, you know, just 200 gigabyte gigabytes.

[00:41:58 - 00:42:26]

Omar: And, you know, you could still search this on a Cpu you know, in just a little over a hundred milliseconds. And obviously, with the Gpu again, a lot faster. So this is, you know, public, you could just go to the Colbert barrier repository and play with this. This is something that maybe, you know, weeks setting to chat about, you know, integrations of something like this into into line chain. I guess you could actually already use versions of this through V in LangChain, planning so that might be an interesting, you know, different way to to to to to do to do a lot of this.

[00:42:27 - 00:42:28]

Omar: So just.

[00:42:29 - 00:42:46]

Charles: I I was gonna ask exactly that. Like, what is the what is the easiest way for people to to play around with this? And, like, what would that involve? So if, you know, if I go to that github Repo, do I have to like host the the kind... Do I have to, like, run the embedding model, like, locally myself and host that, and it sounds like maybe there's something best but that takes care of that?

[00:42:46 - 00:42:53]

Charles: Are there other like, what's the is... Yeah. What's the what's the simple... Or what are the what are the options for people to use this and what would those entail?

[00:42:54 - 00:43:21]

Omar: So Cool There is is more of a... You know, it's just a, you know, it's a model, so there's a checkpoint, and then there is the encode, and then there is the infrastructure that we have for serving. The infrastructure here being software infrastructure. If you wanna host this or or or such either you dedicate your own machine with Cpus for search and Gpu for indexing, or, you know, V phone might be, you know, your best shot otherwise add this. In terms of a hosted hosted...

[00:43:22 - 00:43:41]

Omar: I Not aware of any other hosted versions of this. But, you know, certainly, you know, I I hope that looking at various results that I... She... I I've shown we'll show up more of we'll encourage more people to sort of build infrastructure on this. The challenge being, you know, you get a lot of quality and it's fast and and it's efficient, but it's a different infrastructure.

[00:43:41 - 00:43:57]

Omar: So that's the that's where something like v best flexibility. I think the gray language. They didn't get a chance to highlight that, but it's maybe not as much as as as we'd like, but, you know, it's a very powerful quick query language that makes that possible. But actually, III... So I'm happy to take more questions about that, but I I want it.

[00:43:57 - 00:44:16]

Omar: To highlight sort of a kind of a bigger picture story here. So suppose that you... You know, you're not working in isolation. And I think a lot of the questions we're talking about to scenarios where you just don't have a lot of data, and, you know, you can't just like train, you know, your own retriever and such. And, you know, you could just use school bar out of the box, 0 shot.

[00:44:16 - 00:44:54]

Omar: But could you do better? And so we spent the last 2 to 3 years looking at, like, how we can apply Roberta to new domain. So I mentioned Ko 2 already for 0 shots search. But if you wanted to answer questions, with language models, you know, we've looked at, which is this sort of pipeline for essentially very similar to drag, but has a different supervision technique and use a school bear and that that delivers large gains against things like Drag. We also built this system hindsight led by Ash Here at Stanford, which focuses on trying to build chat bots with Code there like models, and how do you make it grounded and check for nation and things like that.

[00:44:54 - 00:45:15]

Omar: So people who are looking for ways to evaluate that definitely check out hindsight? And we looked at also just fact checking, especially in multi hop and complex questions. And, you know, I don't want you to go and read each of these papers now unless it's exactly relevant to you. But what based on... But what I wanna say is based on these sort of building these different systems on top of Colbert or Colbert bear like models.

[00:45:16 - 00:45:49]

Omar: We've been working for the last year or a little bit more on what you should use in 20 23 to adapt retrieval, maybe was called be or was something else. To this to these fast changing data scores fat... You know, sort of constantly, you know, evolving downstream tasks. And for that, you know, be working in line chain and you have all kinds of, you know, agents and tools at your disposal, an additional 1 that you should consider is the Ds programming model. And for LangChain audience, you could think of the Sp as this advanced tool builder for highly specialized retriever.

[00:45:49 - 00:46:22]

Omar: You know, you're you're adding tools all the time to to line chain. Most of them work out of the box, but let's say you have this really difficult task. So where you wanna specialize your retrieval to work with, you know, specific requirements on length or dependencies in the sense of multi hop queries or you know, you see a lot of issues that users report and you wanna tackle them. Working with prompts or chains or, you know, other obstruction, can only take you so far, at least in a systematic way, and this is what we build the Dsp programming model for. So this is something that you know, you could build tools with that you know, are used by your LangChain agent.

[00:46:23 - 00:47:00]

Omar: And at a high level, you know, I can't really dig into all the details here, but at the high level, gonna write a small program, which is a python function in, you know, just plain old python with standard control flow to describe what you know, what you'd... What you see your retriever as doing in interactions with a language model. And I'll I'll I'll sort of touch on some examples of that. And the key thing is in that in that program that you're writing, which might have loops or, you know, if statements or just, you know, various complex complex control flow elements, you're not gonna write prompts. And you're certainly not gonna write few shot, you know, demonstrations for your prompts.

[00:47:00 - 00:47:47]

Omar: Instead, you're going to just focus on describing the high level pipeline and the leaves of your pipeline are going to be these declarative calls, either to add a retrieval model or to a language model. And you're gonna use built in primitives in the Sp that will take your program and a language model. And 1 or 2, you know, and a small number of, like, end to end task labels that you can hand annotate, and I'll show examples of that, and it will build up few shop prompts for your task or it could do even, you know, more interesting things for... Than that, like, for instance, the same program that you write without changes can work as a 0 shot sort of like pipeline it can work as a few short pipeline without hand rotating few short examples, but it can actually compile itself into a small language model that you run locally. And saves you own cost, makes makes you...

[00:47:48 - 00:48:07]

Omar: You able to adapt things a little bit more sort of concrete to your to your to your needs. And you know, we we have a paper on this. So feel free to check it out on the. And, you know, if we've tested this out on a bunch of tasks with, you know, I think amazing results. So, you know, for instance, what what what the what the highlight now is if you're working on Hot Qa.

[00:48:07 - 00:48:40]

Omar: So let's say that you're building you know, a conversational agent. You know, and maybe it can access many tools, maybe calculators and other things. But 1 of the core functionality is you want that agent to be able to answer complex questions on your data, whether a retriever. And maybe these complex questions, you know, maybe just using an off the shelf 3 3 even if it's called bare is only gonna take you so far because there are inherent dependencies. There are questions that you can only answer by answering simpler to composition questions.

[00:48:40 - 00:48:58]

Omar: And you know, this is something that a lot of the, you know, react agents and whatnot in in in LangChain can help you get started on. But what they can't really... Get you to do is how to teach them to use an obituary tool, like, you know, react does not know how to use School B 2. So you're kind of forced to use a 0 shot react agent or maybe to sit down and try to write prompts. And hope they work.

[00:48:59 - 00:49:31]

Omar: There's also challenges of like, okay, if you wanna use, like, react and make it reliable. Maybe you gotta use, like, Gp 4 or at least an expensive you know, Gp 0.5 thing and production. And maybe you don't wanna, you know, you know, put that all out there and, you know, not have a lot of ways of controlling it, And, you know, what we... What what what what Dsp gives you are these magical primitives that would take your program, which is describing this retriever which could be a tool of your LangChain chain agent. And you basically just say, hey, the it just implemented this declarative pipeline.

[00:49:32 - 00:49:58]

Omar: That is maybe a 0 shot Dsp, based react agent. Just like at a copy of the version that's LangChain. But now, I actually wrote 2 examples by hand, of some of the questions that users ask I have define an answer. I definitely don't want to spend time like writing prompts or like intermediate, pipeline, specific, data, and But I just... You know, I have these examples and you can treat them as unit tests essentially and make sure that whatever pipeline you have can pass those reliably.

[00:49:59 - 00:50:37]

Omar: So you could just say Dsp to demonstrate and give it the function program that you wrote in in Dsp, and give it these 2 labeled, you know, and exam and task examples. And what we... You know, what... What what this would do for any or Dsp pipeline that you give it, you know, and and you can sort of customize the way demonstrations are done, is, you know, it can it can teach the language model how to interact with the different pieces of the pipeline that you've built. So for instance, you know, this this sort of like demonstrate line that we have here can convert a 0 shot react agent that's written in Dsp and a few lines of code, into a few react agent that knows how to interact with School V 2.

[00:50:37 - 00:51:11]

Omar: And in practice, this can raise accuracy on, you know, benchmarks like Hot Qa from around 30 percent in in a 0 shot setting to a 35 percent in a few shot setting that you did not write the prompts for. So it can, you know, teach itself how to generate queries and how to use the messages and, you know, how to control the flow, etcetera. And, you know, you could you could easily add, you know, extensions to the pipeline that keep praising this, you know, a very simple, like, if fuel line addition makes this 41 percent exact match. So another sort of magical features and I'll conclude with this, is the compile primitive in Sp. So let's say that...

[00:51:11 - 00:51:37]

Omar: And a lot of people ask things related to this. Let's say that you have this program now, which is bootstrap few shot react agent built in Dsp as a tool that you can use in your LangChain, you know, conversational pipeline. But it's kind of expensive because it's gonna talk to Gp 3.5 or Gp 4 or whatever it is, you know, several times in each query. How can you make this efficient? And how can you make sure this adapts to the actual queries people ask?

[00:51:37 - 00:51:46]

Omar: So our solution for this is you deploy? Your agent and in particular, you deploy this tool in front of users. You gather a bunch of questions from them. These questions don't have labels. They don't have answers.

[00:51:47 - 00:52:30]

Omar: But, you know, you could still you could still gather them and and and save them. And then he could just say, hey, ps, can you please compile my program that we built here, and use these user questions, and I'd like to give get get that, you know, I'd like you to use in this pipeline, maybe the default language model that you have, which might be configured to be t 5 large. You know, and on on on Github, the default is ada open Ai data. So and you know, we've been playing for with this for a while. And, you know, if you compile, you know, Dsp program that there's multi hop search into T 5 large, you match the quality over the retrieval that you're getting from, you know, deb eventually 3.5, And you only lose a very small kind of margin in terms of the final answer accuracy, but you don't need to go to open As old anymore.

[00:52:31 - 00:53:10]

Omar: And, you know, this whole thing, it requires, like 6 hand labeled examples to build the initial sort of bootstrapping of this. And so lots of opportunities here in terms of being able to build your own self bootstrap, retrieval tools in something like line chain. So, you know, I'll I'll I'll stop here. Certainly excited about integrations with what people are doing with in terms of Just as an initial, you know, standalone, general purpose and also with Dsp programs that individually implement, you know, you know, pipelines for retrieval in a more task specific or test aware manner?

[00:53:12 - 00:53:24]

Harrison: So, Omar, can I ask a quick question about the the Dsp stuff So the so the demonstrate step, is that kind of equivalent to coming coming up with, like, few shot examples dynamically? Basically, like, you give it.

[00:53:24 - 00:53:24]

Omar: Yep.

[00:53:24 - 00:53:37]

Harrison: Oh, okay. And and so in and so is that is it like... Like, could you could you do the demonstrate step? Get those few shot examples, save those and, like, have those as as... Like, for?

[00:53:37 - 00:53:40]

Harrison: Is it dynamically generated every time as you go through.

[00:53:41 - 00:53:51]

Omar: So this is entirely program dependent. I encourage people to read the paper for, you know, different... We described, like, design patterns. For most tasks, if you're just getting started. Yeah, this could be a frozen set.

[00:53:51 - 00:54:05]

Omar: You just generated 3 examples and they're fixed. But you could also imagine fancier or things where You haven't you got a new question. You have a bunch of questions in your old database. And on the fly, you retrieve similar examples. You have a K and n thing in in LangChain.

[00:54:05 - 00:54:21]

Omar: Gonna use it to achieve similar examples, and then annotate those on the fly so that the model... The language model can see similar examples being answered. And leverage those there. Yeah. I guess I should keep my answers short that we have time for.

[00:54:21 - 00:54:21]

Joe: Oh,

[00:54:25 - 00:54:49]

Harrison: Yeah. So I wanted to grab some questions from the audience and make sure to get a chance to get, you know, Joe and Omar in in conversation with you. Each other. So I'm gonna grab this 1 from there. Who we talked a lot about, like, each q and a that's been the task that people used done retrieval for, but the tests, the thing that people are really excited about right now, especially in the LangChain community is around agents.

[00:54:50 - 00:55:15]

Harrison: And 1 use of retrieval for agents is creating this sort of like hack long term memory and allowing for self improvement. So what of the, like, approaches that we've seen so far in this talk that maybe are you aware of Joe Omar? Do you think are maybe most useful. For helping with this with this problem of improving the memory of like an agent. Maybe we'll start with with Joe.

[00:55:19 - 00:55:38]

Joe: Yeah. I'm not that familiar with the term. Used around long term memory, short term memory. So it's it's it's... My understanding was that long term memory was for about knowledge database and not the intermediate steps, but I might be wrong.

[00:55:38 - 00:55:42]

Joe: So. I'll pass that question. Maybe Omar more into the agent's the

[00:55:43 - 00:55:59]

Omar: Yeah. I actually also don't know the specific LangChain terminology here. But but in terms of self improvement there are 2 paths in general? There are are models, and I think this is the... The 1 that's a lot more common in the asian world is kind of per example or post hoc self improvement.

[00:55:59 - 00:56:08]

Omar: You you're trying to do stuff. You make a mistake, you reflect, and then you do a better job next time. That's great. That's very versatile. And again, you know, like, all these 0 shop agents.

[00:56:08 - 00:56:44]

Omar: They they allow you to do things that like, very recent not not, you know, just a week before LangChain, people wouldn't have thought where possible you know, in terms of the abstraction that generalize G act and other things. But being reactive to the extent of reflecting, per example might mean silly things like, you make the same mistake every time. And then you reflect on it and then you realize you made a mistake and then you keep fixing it. That's a little bit expensive and maybe kind of ill advised. And this is this is where the obstruction that we have around demonstrations in the Dsp sp among other sorts of things that fit up fit together.

[00:56:44 - 00:57:13]

Omar: Say, well, you know, if you can collect some of these... A couple of examples in advance, then you can sort of essentially bootstrap your pipeline on them in advance and it fit could avoid these sorts of errors with some... You could you could sprinkle in a little bit of rules. So like, the the the the intuition and the dsp is that you're you're building up your up the task as you go. And if you could just say, hey, I wanna handle this exception where, you know, I'm trying to search, and I keep searching like, too many times in a row or, like, I try to answer it before searching or whatever it is, that happens in your agent.

[00:57:14 - 00:57:44]

Omar: You could add an exception for it in the code or add like an if statement for in the code or whatever it is, rerun, rec compile, you get new demonstrations you you build up a new model underneath if you wanna compile the way. And, you know, it gives you gives you that path for self improvement that is maybe a little bit less reactive. But you could certainly use it with agents or, you know, it it it it fits in the bigger pipeline, because this is something I don't anticipate... I don't see you seeing doing this at every little edge of your pipeline. This is probably going to be what you wanna reserve, like the research.

[00:57:45 - 00:57:59]

Omar: Part of your brain and research part of your team, when you're focusing at the key kind of core differentiates or your application. Maybe the search, maybe that is some generative component, whatever it is, at least that's that's my, you know, my... The way I think of this.

[00:58:02 - 00:58:09]

Harrison: Yeah. I think maybe... Yeah. Harrison can maybe comment on this a little bit. But I think with with Yeah.

[00:58:09 - 00:58:37]

Harrison: The self improvement, I think you're right that, like, finding ways to make that self improvement persists from 1 conversation to another is huge and Dsp seems like a promising way to do that. I think in terms of the retrieval and long term memory side for agents. The way it normally looks is that like, the amount of information that you're searching over is gonna generally be smaller. Right? It's like the memory of a single agent that's more like the scale of the amount of information a person consumes on on the Internet.

[00:58:37 - 00:59:11]

Harrison: So we're talking hundreds of documents, thousands of documents rather than millions. And then you also are maybe doing some annotation of that at the same time with the language model. So it's not just a document is a document with additional metadata, which is a pattern that foolish seller brought up in 1 of their questions about adding Ll, like, adding rich metadata to documents, that's also in natural language. I think those are kind of the the questions that are on people's mind for agents. What if I have, like, a relatively small amount of documents where I wanna achieve, like, really, really high precision.

[00:59:13 - 00:59:35]

Harrison: And like maybe reasonable recall. I don't care as much about, like, Turbo scale. And yeah. And I'm doing that in a context where there's L in place, so there's lots of opportunity for, like, metadata and rich meant, some of this might be previous outputs of the Ll not necessarily Internet data. There's a little bit more control there.

[00:59:36 - 00:59:53]

Omar: So what I would say here is know, I I think there are very nice obstruction and LangChain chain you know, that I'm familiar with here, like, you know, the... Like, you guys have the the essentially map reduced thing. You If you have a data that's decimal small. I imagine that's just very natural thing you could do. If the data gets to a level where that too expensive.

[00:59:53 - 01:00:17]

Omar: You could essentially use the compile function in Dsp, you replicate that behavior, but with the language model that got fine tune to be a tiny thing that, you know, understands the specifics of what you wanted to do. So you don't need, you know you don't need the full power of Gp 4, to, you know, do that specific map, reduce process. But... But if you just... If you try to prompt fl t 5, it it can't really do it reliably.

[01:00:17 - 01:00:45]

Omar: So the compile will do is, hey, let's simulate a little bit what Gp 4 would do on a few un unlabeled examples and, you know, teach it to the to a small model that updates adaptive as we change our program. And so that's 1 thing you could do there. Alternatively I I think Colbert V 2 indexing of small datasets is a, you know, is an exciting direction. It certainly makes it... Even less of a concern, the fact that you have a lot of vectors to store.

[01:00:45 - 01:01:11]

Omar: I don't think it's a concern to begin with unless you have, like, many billions of documents, at which point it becomes a bit of a challenge. But if you have hundreds or thousands, definitely try this out like, index it, you know, you might not even you know, need a very powerful Gpu for indexing. You do need an Gpu for indexing now just because the code assumes that you have 1, But fundamentally, if people take this up, you know, you could run it on on Cpu with fast bert implementations.

[01:01:12 - 01:01:12]

Joe: Mh.

[01:01:13 - 01:01:17]

Omar: Search is Cpu friendly though like, you don't need Gpus for. Just for indexing?

[01:01:18 - 01:01:29]

Harrison: Right. Right. Yeah. And Joe, any thoughts about this specific context of long term memory where you've got, like, maybe smaller amounts of documents, Richer Richer metadata, more control.

[01:01:30 - 01:01:35]

Joe: Yeah. I think this... The mechanisms are the same. Right? You want to retrieve over dates And as I said the...

[01:01:35 - 01:02:15]

Joe: The I think the hybrid combinations and and combining... And also taking whatever text is there and the metadata and the that you actually can store this data in any kind of store that will allow you to store both the text data and the vector representations. If it's tens data like, in case of Colbert or is a single vector model and then... It's just that you don't have to introduce any kind of approximate vector search you can do exact search on those smaller data, you don't, you know, you could run the full call but model. And maybe even you know, you can run a full cross and coded model.

[01:02:15 - 01:02:30]

Joe: If you use like a mini language model, you know, because they are you know, they are also... I think they're overlooked in in the current landscape, but they do have some advantages as well. Right? Okay. If you have a Gpus, you don't need to store anything.

[01:02:30 - 01:03:10]

Joe: You can retrieve simply and then re rank using a Gpu and and that's why we invest team has been focusing on adding Gpu capacity and invest clusters like for doing that because I think crossing folders it's really something, a very easy way to lift accuracy even if you're running simple beyond 25 with retriever. Yeah. So there there are a lot of options and and and and information retrieval and working search because there's so many fascinating problems, like sub problems of search. Clear understanding. His document understanding, ranking, you know, diversity, someone mentioned here in the chat, you know, what we Do for diversifying the results that Are put to the to the language model.

[01:03:10 - 01:03:33]

Joe: So there are so many aspects going on. Right? And and now we just touch on, you know, how to how to evaluate that rank list in the context of agents, but there are so many things going around and and Yeah. And I'll I'll I'll share my my presentation on Twitter later on, and and I'll I also share some. Or this is on free online courses on information tool.

[01:03:33 - 01:03:40]

Joe: If you wanna learn from experts. So so so people can get up to date on the general stuff.

[01:03:41 - 01:03:44]

Harrison: Joe. Joe, It sounds like there's so much going on that we might

[01:03:44 - 01:03:47]

Charles: need to do a a rerun of this basic. That's what you're saying. Right? Like...

[01:03:47 - 01:03:50]

Joe: Yeah. That's what I'm saying. Wonderful.

[01:03:51 - 01:04:03]

Harrison: Yeah. And so we didn't get a chance to answer everybody's questions, but I know Harrison and I and Joe are all on Twitter and... Wonderful I've seen both of them frequently having interesting conversations with people who have questions

[01:04:03 - 01:04:06]

Charles: Omar as well. Omar as well. Yeah. Omar as well.

[01:04:06 - 01:04:08]

Joe: I'm you. Interactions. Perfect.

[01:04:09 - 01:04:11]

Harrison: Alright. That... Oh, that's your Twitter handles late in interact.

[01:04:11 - 01:04:12]

Joe: Yeah. Yeah.

[01:04:13 - 01:04:14]

Harrison: That's. It's actually... It's actually a

[01:04:14 - 01:04:17]

Omar: bun on the on the Late show with Colbert bear.

[01:04:18 - 01:04:39]

Harrison: Oh, god. Incredible. Well, thanks for your Colbert report on on late interaction and multi back retrieval and thank. Joe for the overview and showing us some of some of the stuff that V was building working on. And as always, thanks to Harrison for organizing the LangChain community, putting LangChain together.

[01:04:40 - 01:04:44]

Charles: And thanks to Charles for hosting this. And and I'm keeping everything running smoothly.

[01:04:45 - 01:05:02]

Harrison: Mh. And I'll just say we I run a little, like, online... Little online education stuff, the full stack. L and boot camp full stack deep learning course. Joe gave our information retrieval lecture, the the thumbs up the coveted Joe Berg thumbs up.

[01:05:02 - 01:05:03]

Harrison: So yeah. We check it out.

[01:05:04 - 01:05:14]

Joe: Fantastic resource. I can't believe you gave away that for free. I mean, Jesus Christ. I was it was fantastic. I mean, go watch it.

[01:05:14 - 01:05:21]

Joe: You know, go to that link. Watch all that material, you know... Very good fantastic content. Yeah. Really good stuff.

[01:05:22 - 01:05:27]

Harrison: Thanks, Joe. Alright. Well, thanks everybody for coming, and we'll see you at the next 1 of these.