Summary of LlamaIndex Sessions: Practical challenges of building a Legal Chatbot over your PDFs

Summary LlamaIndex Sessions: Practical challenges of building a Legal Chatbot over your PDFs - YouTube (Youtube) youtu.be

5,241 words - YouTube video - View YouTube video

One Line

The text discusses the challenges of building a legal chatbot that parses PDFs, including limitations of existing parsers, the need for accurate OCR, handling specific domains and structured data, and the potential for creating comparison tools and unified document representations.

Slides

Slide Presentation (7 slides)

Copy slides outline Copy embed code Download as Word

Building a Legal Chatbot: Challenges and Opportunities

Source: youtu.be - video - 5,241 words - view

Parsing Queries and Improving Core Query Qualities

• Parsing queries is crucial for building an effective legal chatbot

• Improving core query qualities enhances the accuracy of search results

• Current PDF parsers have limitations in handling specific domains and languages

Multilingual Capabilities and Processing Mix Format Documents

• Challenges with multilingual capabilities in legal chatbot development

• Processing mix format documents and foreign languages poses difficulties

• Specific domain processing is needed to overcome these challenges

Considering Security Settings and Training Models for Accuracy

• Security settings in downloaded PDFs must be considered for chatbot development

• Training models can significantly improve the accuracy of the chatbot

• OCR software, particularly handwriting recognition, presents challenges

Better Handling of Tables and Structured Data within PDFs

• PDF parsers need to improve handling of tables and structured data

• Utilizing data frames can enhance processing of structured data

• Unified document representation combining structured and unstructured data is beneficial

Creating a Comparison Tool for PDF Parsers

• A comparison tool for PDF parsers would be valuable for developers

• Different parsers extract tables and text differently, requiring careful selection

• G4 can be used to clean up documents, but limitations exist for OCR

Unlocking the Potential of Legal Chatbots

• Building a legal chatbot poses challenges in PDF parsing and beyond

• Accurate OCR, domain-specific processing, and improved table handling are essential

• Creating comparison tools and unified document representations can drive progress

Key Points

The discussion revolves around the challenges of building a legal chatbot and improving PDF parsing.
The importance of parsing queries and improving core query qualities is emphasized.
Limitations of existing PDF parsers and the need for more specific domain processing are mentioned.
Challenges with multilingual capabilities and processing mix format documents and foreign languages are highlighted.
The need for considering security settings in downloaded PDFs and training models to improve accuracy is mentioned.
Better handling of tables and structured data within PDFs is identified as a key feature needed in PDF parsers.
The idea of creating a unified document representation that combines structured and unstructured data is suggested.
The challenges of parsing tables within PDFs and the potential for creating a comparison tool for PDF parsers are discussed.

Summary

559 word summary

In this text excerpt, Speaker 1 expresses gratitude for the opportunity to discuss the challenges of building a legal chatbot. Speaker 0 finds the podcast educational and learns about PDF parsing and the difficulties of building a useful tool. They discuss the importance of parsing queries and improving core query qualities. They also mention the limitations of existing PDF parsers and the need for more specific domain processing. They touch on challenges with multilingual capabilities and the lack of processing for mix format documents and foreign languages.

Speaker 0 asks about potential directions for exploration beyond hybrid search and ranking. Speaker 1 mentions the need for considering security settings in downloaded PDFs and the potential for training models to improve accuracy. They discuss the challenges with OCR software, particularly with handwriting recognition. They mention the discrepancies in performance among different OCR tools and the need to carefully select parsers for documents containing handwritten information.

Speaker 0 inquires about features they wish PDF parsers had, and Speaker 1 mentions the need for better handling of tables and structured data within PDFs. They discuss the benefits of using a data frame to process structured data and suggest creating a unified document representation that combines structured and unstructured data. They also mention the idea of comparing different PDF parsers and showcasing their inputs and outputs.

They discuss the challenges of parsing tables within PDFs and mention that some parsers only extract tables while others extract text as well. Speaker 0 suggests creating a comparison tool for PDF parsers similar to nat dot dev for L, and Speaker 1 agrees it would be useful. They also mention using G4 to clean up documents and discuss the limitations of OCR over different types of documents, including ones with handwriting or watermarks.

Overall, they discuss the challenges of PDF parsing, the limitations of existing parsers, and potential directions for improvement. They highlight the importance of handling specific domains, multilingual capabilities, and structured data within PDFs. They also emphasize the need for accurate OCR, particularly for handwritten text, and the potential for creating comparison tools and unified document representations. The speaker discusses the use of a ranking module to improve the precision of search results. They explain that the first step is to retrieve documents based on metadata, and then use embedded retrieval to find more relevant information within those documents. The speaker mentions that hybrid search can increase accuracy, but it is not yet perfect. They also mention the challenges of assigning text to specific judges and the different processing strategies they have tried. The speaker talks about the formatting challenges of different types of documents and the overall structure of a supreme court case. They mention that the current iteration of the chatbot is focused on helping users understand supreme court cases, but they plan to add more features in the future. Speaker 1 is discussing the first iteration of a legal chatbot that pulls from legal opinions from the Supreme Court, dating back to the 1800s, and existing PDF files on the Supreme Court website and Library of Congress. The chatbot is being built to provide accessible resources on legal matters, specifically Supreme Court decisions and opinions. Sam Yu, a software engineer and co-founder of honorable.ai, discusses the challenges of building this legal chatbot. This conversation is part of the Llama Index webinar.

Raw indexed text (29,380 chars / 5,241 words)

Page description tag: In this video, we chat with Sam Yu on practical challenges of 1) parsing supreme court decisions, and 2) building an LLM-powered chatbot over it.A lot of cha...

Speaker 0: Alright. Hey everybody. This is Jerry here. And we're super excited today on the Llama index webinar to bring on guest. His name is Sam yu.

He's a c founder at the honorable dot ai, and he's here to tell you all about the challenges of building a legal trap. So just to start, maybe Sam, you could give a quick instruction.

Speaker 1: The Oh, hi, I'm Sam. I'm currently, I've working as a software engineer, but... Because the recent explosion of the chatbot L. So I decided you use my knowledge of building a chatbot out that's focusing on legal matters, especially for, let's say, the supreme court, you know, you decisions, opinions, and know, because of the recent decisions. There is a lot of interest in those field, but there's not much resource sources that can be easily accessible to people without a legal background.

Speaker 0: That's a super interesting and relevant use case. And, yeah, maybe give some context. Would you be able to describe a little bit about what you're building?

Speaker 1: Okay. So basically, the first iteration is quite simple. I'm just pulling from all the, like, legal opinions from the Supreme Court, dating back to basically the first... I think, nineteen, I'm, like, eighteen something, you know. And the all the legal documents that's existing in the Pdf files on the Supreme court website and library congress.

So getting all those information together and making, like, doing some extracting data extraction, pre processing, metadata embedding. And then I'm feeding that into the embedded into the l model, And then so when the user when asking any questions, it will be accurately retrieving information regarding, like, their queries. Got it.

Speaker 0: Is the end user Ux just allow users to understand supreme court cases? Or are there other additional kind of like high level goals

Speaker 1: that you haven't mind? This is just the first iteration. It's beginning just as a chatbot out for users to understanding any question there have interest in the Supreme court. It can be... Either you can can be a lawyer or you can be Leland that works both way.

But down the road, I do plan into adding, let's say, the transcript, the audio files or, like, other let's say, the the law, the code itself embed into the host... The L. So do when the user asking question, it will retrieve more relevant query and the result.

Speaker 0: Sweet. That's awesome. And then maybe just to dive into the nature of these stocks humans a little bit more. Can you can you tell me a little bit about like, the format of, like, a supreme court case? And and, you know, what is the overall structure of this data?

Can

Speaker 1: cross, like, over two hundred years, there so many different formats of the documents. Then when you have the more modern and documents, which is pretty much like very nicely formatted in the Pdf files. But then we also you go back to, let's say, pre nineteen eighties then you have a lot of documents it's kind of the scan of the old paper documents, which is sometimes it's not very nicely formatted and it has a lot of, let's, say, handwritten like on some of the documents. And so it can be like kinda tricky and when sometimes they don't scan it really well, so you have this, like, missing pieces here and there or they're just, like, it's just, like, maybe it wasn't preserved well. So it can, like, having...

When... So once you do this conversion, from the Pdf file to the text file... Because sometimes they can having lots of, like, like, extra artifacts or there's a missing pieces here and there and or also the document to text was not recognized well.

Speaker 0: Got it. What about, like, the nature of the tax itself? I think you mentioned... Briefly, like, there's... You know, there's probably, like, different judges, like, writing, like, concur opinions, dissent opinions, like, what are some of the, like, processing challenges there?

Speaker 1: Because the formatting... The formatting documents itself because as you said, there is basic, you have the opinion that basic the majority so you have up... And... Judge writing the the opinion pieces, and then you have the additional judge right, maybe writing concur to the opinion pieces, but not the same as paste. And then you have the descending judges, there could be multiple descending.

So in one Pdfs, you might have different basically, different pieces of documents. So how you gonna do... Because when you're processing the basic Pdf as a document, you might put them as the metadata all... If you put them metadata altogether, they may not reflecting what the judge like, basically jealousy say judge so my air sets. It's maybe different from the judge broader set.

But in the metadata, that it may not reflect that. So in order to best have your accurate, you know, like response, you want to basically separate them into different pieces. But at the same time, you want on to have the metadata that's kind really reflecting, like, which each judges stands on this cases and to have all the information that's really available when you do to retrieve.

Speaker 0: Super. So basically, you kinda wanna associate, you know, the relevant text and opinions with the relevant judge who's actually associated with the specific section thing Right? Because you could have, like, multiple judges each training their own opinion on the case?

Speaker 1: Yeah. And there some, like, really... I I wouldn't call it edge cases. So there's is time and then when one judge agree with one piece of opinion, but says, okay. But I agree on all the opinion, except the first sentences.

So like, okay. So when you retrieve those, how you're gonna to process that one sentences is. So is that... Yeah. But this is a h cases.

Like haven't find a really good solution. Yet, But this is just something like, need really need to think about it, how to... Like, when you do them embedding, how you process these documents.

Speaker 0: I Gotcha. And and maybe just the overall idea of just like, you know, how do you assign like the text to to like a person? What are some of the processing? Strategies that you have tried for that. Because that's pretty relevant.

Right? Like even if you process like a chat history for. It's it's a different use case, but it seems pretty relevant. And in this case, you know, it's a super supreme core opinion with like a bunch of different opinions floating around in the document.

Speaker 1: Yeah. So I've been... And then initially, I would just straight processing the documents without considering too much into it. Like But when I'm Go into the details. No.

I've been using, like, they say, Nlp, instead of the Ll I'm just using Nlp to extracting basically the judges. Name like, from the text itself. And then basically assigns them and based on their whether they're descending or concur assigned to each pieces. So the Nlp was one way of doing it. But because of sometimes the edge cases, so some document does not work well with Nlp.

And so. I've been discovering using, let's say, using Gp four. It's like, it's really good at classification. So you can basically do zero shot of the prompt. Basically, just giving the whole chunk into Chat T four, and then it will...

Tell you like, categorize, who is descending who is agreeing who is, like, concur everything single very nice. See.

Speaker 0: So you actually you do, like, use Gp for to actually process the document and then to to actually extract, like, relevant metadata that you... Just give it a piece of us. Mh.

Speaker 1: Yeah. I just leave a piece of it. So basically, it's like the self cor basic retrieval, Just extracting the entity given on the conditions.

Speaker 0: Got it. And then you then use this, like, process information, and then you, like, store this information in, like, a factor database that you use for later retrieval.

Speaker 1: Yeah. Exactly. And I've trying to get additional doc... Additional metadata from the core website to understand, like... What's this case for, like, what's the area of interest?

Like what's the, you know, the basically the the prior... Let's say, the lower core, which lower cord is from? So these are data gonna be added into the metadata data as well. So when you do a retrieval, you can basically asking... Like, they give it more keyword, let's say, you wanna give it more keyword to more precisely to locate where this case, like whether it's relevant to your interest or not.

Speaker 0: Like Gotcha. That's awesome. I think, you know, if you're interested in contributing a supreme Court case loader to to Llama Hub, which is our site for just like, data loader for All on apps, so I think that would be awesome because this seems like a pretty relevant and useful like... And and kinda like, domain specific use case. Right?

And for, like, supreme court cases, there's like a certain way you can, like, process, extract this information to create, like, a nice document in representation.

Speaker 1: Yeah. Definitely. Once I'm finishing basically... Finalize is my end of like, how to best processing these data and put them all together definitely and right interesting, like, contributing to the allowing index.

Speaker 0: So. No. It's awesome. And maybe the step beyond this is now that you have this document and you've extracted some of this information. What is like the the way you're representing the the document?

Is it like, you have... Is it still as, like, text trunks and you have metadata for each text stroke or are you thinking about a slightly different approach?

Speaker 1: So basically, I'm processing converting the Pdf into text file and then, you know, making some a additional adjustment processing and then putting those metadata into the document by itself. So this way it'll be basically it'll keep along within the document file. So when do the retrieval will be still there.

Speaker 0: Got it. And so you're basically inserting the metadata into the file. And then then storing that summer. Yeah. Awesome.

What is the later part of your stack? So what like kind of vector database are you using and then how are you doing initial retrieval? And then how are you thinking about some of the, like, failure cases of like the initial like like, retrieval stack?

Speaker 1: How Yeah. I've been testing all kinds of stores just to trying to see which one has the basically best use best, like match. Well, my use case, I think each store is very different these days. Let's say, the current one I'm picking is Wee because We had this hybrid search message. You can providing it with keyword and then doing basic the query.

So this way, it will be more precise because the metadata I'm embedded. So I definitely wanted to retrieval to have a more precise outcome.

Speaker 0: Got it. Makes sense. Maybe... So poor hybrid search, just thinking about s semantic search. Do you try that?

And then what... If so, like, what were some of the failure cases of it?

Speaker 1: It sometimes, it doesn't capture everything because, like, when you do a semantic search, let's say if your sentences is having multiple judges, is sometimes it doesn't really, you know, sometimes doesn't like, capture well which one is like, basically who is the most prominent in the sentences. I assume. Yep.

Speaker 0: So does that change even after you have added, like, the metadata about, like, the the, you know. The judge corresponding so

Speaker 1: each starts. So basically, when I'm I'm pulling out those judges, I put a weight on the... And basically on the judge. So the the one that's appearing the first, have a higher weight. So like, basically, that's the person who wrote it, and the other is agreeing within this the the first church.

Judge. So this way, I'm looking at basically, I'm telling the the data to like focus more on the first judge.

Speaker 0: I see. Yeah. I see. Was that the solution that you had to kind of make sure that Fetch more relevant software? Is this like, even with that, it was still...

Fetching air all that information.

Speaker 1: It's it's it's like Iowa it's not one hundred percent working. Sometimes sometimes it works well, but sometimes it doesn't. That's the the thing like... It's like... I wish it can work hundred percent of time, but it doesn't at at this time.

I'm trying to try and figure out how to, like, basically maybe adding more metadata. At the same time I do be aware user may not putting in a very long q. User may just put very simple sentences and they want to get the best results.

Speaker 0: Got it. Makes sense. And along those lines, you mentioned you the user might enter like, different types of queries to ask these types of questions? What are what is the class of questions that you're looking to answer? And could you give some examples of the of those

Speaker 1: Let's say, I wanted to know whether, like, let's say, judge judge Kavanaugh has wrote anything about the specific, like, interesting commerce. So so that will be So first, it will process the query to pick out. Okay. So this... The area of interest interstate commerce and the judge is camelot.

So And then so you put these two into the, basic say the keyword of keyword search. So to reflecting... To retrieving the relevant document. Then you can do either re rank or, like, doing other processing to then you do the basic the courier retreat, but that embedded retrieval to pulling out the relevant information.

Speaker 0: Got it. Have you noticed a substantial difference between once you go from just like pure talk case semantic search? To adding in some sort of hybrid search keyword filtering component.

Speaker 1: Oh, it definitely increased the accuracy. It's not a hundred percent there yet, but it definitely like basically. Pulling let was say the... Early on was by V, like, thirty percent. And then now it's come...

You can get to, like, maybe sixty, seventy percent you know, higher a score.

Speaker 0: Super interesting. And and maybe just for our listeners, could you give a sense of, like, what exactly hybrid search is doing? In this case that will help improve the accuracy?

Speaker 1: So basically, because like I said earlier, when you have a metadata. So when you're looking through those metadata, so the first step is going to basically just doing a metadata, to finding that documents that embedded with those metadata. So it will do the first retrieval, getting those documents back. And then the second process is during basically using the query two through the embedded retrieval to findings from those documents to getting more relevant from doing instead of just in going through everything or you're getting through on the summary of your documents.

Speaker 0: Got it. So it's a way of, like, increasing the precision of your trip Because, like, without the beta data, maybe you're getting back stuff that kind of matches, like, the semantic search part, like the embedding similarity, but you're not necessarily getting back documents that fit the keywords. Exactly. Got it. Got it.

The other part that you mentioned is this idea of like a second stage like, ranking module? And maybe have you tried that? And and if so, like, could you give a description of, like, how it works and the stuff that you tried

Speaker 1: Yeah. So basically, so first, I've tried that, but the result is kinda of mixed right now for me, but I'm still kinda in the process of refining trying to find a, like, basic sweep. How to best use it. So basically, when you treating those documents first, and then through the re rent the score, each document you have a score, but and then through those score, basically putting the more relevant document first, and then you're doing the more the second the embedded retrieval from the cor.

Speaker 0: Oh, I see. So you're actually doing some sort of... Are you doing the embedding based retrieval first or as a second stage?

Speaker 1: I forgot know my pipeline. So I'm billing... I think I'm getting the embedded retrieval first year re ranking then you do go through the prompt. I see.

Speaker 0: Yeah. And then you use the Alm to to do some sort of rebranding. I see. Got it. Got it.

Makes sense. Cool. Taking a step, what are what are your kind of like favorite Pdf posts your packages? Like, what what what are some existing parser that you think are are pretty good?

Speaker 1: I think personally, if you wanna to do a local, like T probably the best. Mh. I came in give you guys show some example, let me share my screen. Sounds great. Alright.

Can you see on my screen? Yep. Okay. So, basic, this is just a testing of some of the documents I'm processing. And you can see on the right is the one of the supreme core opinions that's from the nineteen sixty eight, which you can see is the scan of the the paper document.

Speaker 0: Oh, nice. It's literally just like a a scan. Right? Okay. Alright.

Yeah.

Speaker 1: So you have the handwritten place. You have some light basically having the... With the watermark. So it's it's every seeing everywhere and they can be really confusing for some of the the Pdf loader. And I can just show you guys.

This is the result I got from this first one is from the T. So the result, as you can see on the left, it's not too bad, but some some of the word definitely is missing, and it doesn't really know exactly what they're looking at. And then I have... Yeah. Another one is the Pdf minor dot six.

And it does sum well as you can see the first sentence the Supreme Court of United States as the the T didn't... Like, it picked up it has more text on the top. So you know, each document is different. H pop parser is different. So sometimes this document we work well with one thing, but not well with the other thing.

So let me show that... This is a more worse example. It's called P pdf f two. So it gives out something, but it's just like... As you can see on the bottom, it just like, it doesn't work well.

Speaker 0: Oh, interesting. So yeah. This is like it's trying to do Ocr over this document. Right? But and maybe like the bot like...

Characters or just not, you know, like, in the right format. Yeah.

Speaker 1: Yeah. And I'm currently this, I wanna show you guys one method I'm using to clean up doc months is using G four. So I basically feeding the document of scan basically the Ocr attacks in into three four and ask it to correct it and it actually does a really decent job as you can see, it pretty much capture everything that's on the on the page. Even though I don't know... But I don't really know if G four has this documents in this training they are not.

So... But this is a very impressive results. It doesn't capture everything. Let's say the one thing I think it's missing is the page number. It doesn't doesn't have the page number on there, but it pretty much captured the essence of the documents.

Speaker 0: Really it's awesome. This is a great... I actually think this would be a super useful notebook to to share.

Speaker 1: Yeah. So, yeah, this is one way of processing document, But because G four, is sold costly, it may not be feasible to run, like... Your entire document basis, but it can be something that you should think about it.

Speaker 0: That's fair. I think even just a basic comparison. Of oh, I see you have other Pdf libraries in there too. You know, like, if you just have a comparison of all the different Pdf parser and just showcase some of the text I actually think that would be, like, a great comparison tool. You know, how like, there's, like, nat dot dev for for, like, L so you can compare, like, the inputs and outputs of different text out.

So, like, same for pdf courses just like, throw in different Pdfs, take a look at how the output first. Yeah... Actually that would be a super useful.

Speaker 1: That's a great idea. Yeah. Definitely. Yeah. I mean, I can show basically this nursing, you wanted to think about the tables.

So a lot of pdfs have tables, but it doesn't really process well as... I'm using Pdf minor, basically, it does extract the text, but it's kind of just like, it's not really useful. Like, if you say putting them into L, it may not process really nicely, which I run into that problem lock. So you might want to, like, change it to into a data form data frame, So this way, you have a more structured data, and you can use there's so many packages out there that you can process the the like the data frame very nicely with Ll. So this is something that you might wanna think about it when you're running, let's say, just a table Pdf.

So yeah. Interesting. Okay.

Speaker 0: So so is this like that this table parsing that you described, it's for, like, parsing tables within a Pdf. Right? Yes. Yeah. And and for other packs, like, can it...

Does it parse like, hybrid data or only tables within a Pdf?

Speaker 1: I think this is mainly just basically it only extract tables. It doesn't like, work... It doesn't take all the text. But I think... Well, I'm kind of doing the research feel like trying to find a mix parser.

So put everything together. Like, basically extracting table and extracting text and put them all together at the end.

Speaker 0: Yeah. So how... That's a that's the next question I was gonna ask actually. If you think about a Pdf. It has a lot of, like, unstructured text and you could get that from, like Ocr or just, like the text directly in the Pdf and it also has structured data.

How are you thinking about, like, you know, creating a unified document representation that contains both elements or how are you thinking like merging the structured unstructured data somehow?

Speaker 1: I'm thinking you probably need to have a mixed format, so you have the data frame, let's say, or Csv format, and you have the tax format. So it's... Just you have to... But, I mean, I'm trying to struggling and help you, like, reference the other documents in your document. So they know what it...

Where those information is located. So I think that's the issue. I haven't really worked out yet.

Speaker 0: I see. So being able to, like, reference other sections within the document. Recommend. Yeah. And model relationships.

I see. Yeah. Makes sense. Yes. This is cool.

Speaker 1: Yeah. And the lastly, I just want to, like, showcase real quick basically like handwritten, like Pdf, sometimes it doesn't work well. Let's say, the first one, the Pdf finder. It basically it doesn't recognize any text at all for some reason. So if you're having a large basically processing hundreds of documents at once, and you may not even realize you have missing documents sometimes.

Because it just... Some of the parser may just doesn't process all and some parser process well. And let's say, the T does a recently decent job. And is there's so many, like, benchmark people out there just like handwritten, like, Ke Handwritten and it just, like, something sometimes they just, like, do it very poorly. Some of the parser.

It just doesn't process as well. And if your company or your organization have some kind of documents, you might even want to consider training data, so they... The the parser can understand your data of way better than the existing parser out there? I see. So this this is this is like a public benchmark

Speaker 0: or do you create this?

Speaker 1: Oh, this is a public benchmark. This I see.

Speaker 0: Got it. Got. I found. So it's like, comparing the quality of, like, different kind of, like, Ocr tools like... Across different, like, categories of data.

Yeah. Could you help, like, maybe just, like, distilled those results just a little bit more like, kind of what what are these tools doing well in where are they yeah. Doing.

Speaker 1: So so the the category one is basically random... Wikipedia, Google Search. So it's like very clear. Html format, even though it's save as a Pdf file. So they still...

I believe they still return. I retain those text very nicely. So Pretty much every parcel out there can, like, getting those texts like, close to a hundred percent accuracy. But the problem starts when you do dealing with the handwritten in which is category too. So the Handwritten, like, I'm showing an example, some parser just doesn't do it.

Like, basically say, you do it very poorly. Let's say, this azure parser like below twenty percent And others Abby About, well fifty percent and you have a higher rate with Aws or Gcp. So, like, it can really varies So if you have to careful file about which parts or you're using a when processing your hand... If you... Your documents contain any handwritten and information.

Speaker 0: Yeah. Well, the the discrepancy is like, huge. There's, like, Gcp is, like, pretty good and and Azure is pretty low. Yeah. Got it.

It it seems to me there's, like, a interesting challenge with Ocr software, like, if it's well formatted, it seems to do pretty well and then handwriting just seems very volatile. Right? And and then just like that creates a lot more variability in the performance.

Speaker 1: Yeah. And you have to consider your company is for may different. From like existing the training data. They have in those models. You made that...

That's like, create cases. You might wanna train your own data basically your models to having a more accurate representation went during the parser.

Speaker 0: Awesome. This is great. I think the next question I was just wondering is you know, there's a lot of these Pdf parsing and also Ocr packages and tools that you've played around with and this is a great analysis. Is there anything that you wish that these Pdf parser had that that just don't exist right now?

Speaker 1: I would consider it's... It's like, basically, it has to be a la carte. So it has you're considering... Let's say, some of the Pdf you download from engine, I may have a security like setting in it. So you...

Your parser default pdf parser may not... Like, basically process... I may not be able to bypass social security. So that's one issue. You might have your face.

And also the second is basic if your care... If your parser has, like, a lot of images or tables. How those that process. Maybe it's the para or other parser can, like really parsing those straight text Pdf really well. But once it's encounters those mix data, the Pdf, how does, like, how to handle those information.

I think that's a key.

Speaker 0: Doctor. Yeah. Makes a lot of sense. And and kinda looking more broadly at the overall, like, kind of Based for retrieval augment generation applications that, you know, that your application also falls in this category. What are some these, like challenges that the you're like, you think still exist and what are some, like, potential exciting future directions say you're excited about that you want tackle?

Speaker 1: I think the specific domain processing, document processing is still kinda lacking. And because it's all the document... Basically existing product out there. Is mostly tailored towards like, we can process any documents. But it fails at, like, if you're having a more mix format documents or if you have a foreign language sometimes, it just doesn't process at all.

Like, I've tried to some, like, a parser it just doesn't handle, like, foreign... Let's say, Asian language as well. So this is something that definitely to be considered when you're doing those, like, basically a document processing.

Speaker 0: Got it. So, like most like, multilingual capabilities.

Speaker 1: Yeah. Yeah. Exactly. Like, more specific domain, Like, more, let's say, you wanna just process a transcript or processing, like, supreme for opinions, like, or a core document. So those has to be, like very explicit I mean it may now work well with the, like, basically current

Speaker 0: the Pdf parser that's in one of the packages. And on the L side, like, you know, you talked a little bit about, like, kind of like Tuesday, ranking, like hybrid search, Are there any other potential directions that you could be interested in exploring? Just beyond kind of like some of the existing, like, hybrid search and ranking impressions that you've tried.

Speaker 1: I think it's just more dis distilled into, like, basically how to, you know, parsing the query. I think that's another key to how to... To pick up re information. Like, basically, I think the process of how using L to improving the core query qualities, I think that's something that I'd really need to look into it, see what can be done and into improving, basically beyond the limit information that's been

Speaker 0: provided. Makes a lot of sense. Well, Sam, thanks so much for being here today. And I think this is a really educational podcast for me. I wanted a lot about Pdf parsing, a lot of the challenges and it it really does seem like to really build something useful.

This really is like, one of the key things that you do have to solve for kind of like your domain specific chat app So it was great learning about some of the thoughts that you have on this area. So thanks so much.

Speaker 1: Alright. Thank you. Thank you for having Jerry.