Summary Weakly Supervised Information Extraction from Handwritten Documents arxiv.org
7,975 words - PDF document - View PDF document
One Line
The article proposes a weakly supervised approach to extracting medicine names from handwritten medical prescriptions using a domain-specific medicine language model and weakly supervised segmentation, which significantly enhances the performance of existing OCR systems.
Key Points
- A weakly supervised approach is proposed for extracting medicine names from handwritten medical prescriptions using a domain-specific language model and weakly supervised segmentation.
- The model achieves 78% pixel mIoU using weak labels and enhances the performance of existing OCR systems.
- The approach involves using an OCR labeling function and a segmentation labeling function, which improves over iterations.
- The authors use a medicine name vocabulary and a dataset of 9645 handwritten prescriptions written by 117 doctors.
- The algorithm developed can selectively infuse domain knowledge and correct errors caused by misinterpreting similar-looking medicines or OCR errors.
- The paper reviews various methods for weakly supervised information extraction from handwritten documents, emphasizing the importance of weak supervision in training models and highlighting potential for further research.
Summaries
300 word summary
This document contains a list of research papers related to weakly supervised information extraction and recognition from various types of documents. The paper reviews various methods for weakly supervised information extraction from handwritten documents, including deep learning approaches for OCR post-correction, text recognition, segmentation, and labeling, as well as pre-trained biomedical language models for text mining. The algorithm developed can selectively infuse domain knowledge and correct errors caused by misinterpreting similar-looking medicines or OCR errors. The method involves predicting in-vocabulary words by matching text with a trained medicine language model (LM). The authors experimented with injecting a language model (LM) selectively and segmenting lines to improve performance for recognizing medicine names in handwritten documents. The proposed framework for weakly supervised information extraction from handwritten documents involves two labeling functions - OCR and Segmentation Labeling Function. The authors use a medicine name vocabulary and a dataset of 9645 handwritten prescriptions written by 117 doctors. The approach involves using a probabilistic program to create an exhaustive set of possible medicine name lines, and training a character-based n-gram language model (LM) on those lines. The article proposes a weakly supervised approach to extracting medicine names from handwritten medical prescriptions using a domain-specific medicine language model and weakly supervised segmentation. The method involves using an OCR labeling function and a segmentation labeling function, which improves over iterations. The segmentation labeling function is used to pseudo-label the training set via OCR, and a segmentation network is trained using the relatively small training set. The authors also mention the use of a character n-gram language model for decoding. The model achieves 78% pixel mIoU using weak labels and significantly enhances the performance of existing OCR systems. The document also discusses the use of weakly supervised information extraction from handwritten documents and improvements in optical character recognition.
664 word summary
This paper proposes a weakly supervised approach to extracting information from handwritten medical prescriptions, specifically medicine names, using a domain-specific medicine language model and weakly supervised segmentation. The model achieves 78% pixel mIoU using weak labels and significantly enhances the performance of existing OCR systems. The document also discusses the use of weakly supervised information extraction from handwritten documents and improvements in optical character recognition. The model used is pre-trained and consists of an encoder and a fully connected symbol classification head. Synthetic medicine lines are generated using patterns of medicine lines as written by doctors in prescriptions. Using such a language model in the OCR decoder improves performance significantly. The article discusses a method for weakly supervised information extraction from handwritten documents to identify medicine names. The authors propose constructing a training dataset using weak labels and an OCR labeling function, and using an assignment problem algorithm to assign bounding boxes to medicine names. They also mention the use of a character n-gram language model for decoding.
The method involves using an OCR labeling function and a segmentation labeling function, which improves over iterations. The segmentation labeling function is used to pseudo-label the training set via OCR, and a segmentation network is trained using the relatively small training set.
The approach involves using a probabilistic program to create an exhaustive set of possible medicine name lines, and training a character-based n-gram language model (LM) on those lines. Domain-specific knowledge is injected into the OCR system using the LM, and a segmentation model is trained to identify medicine lines based on visual features.
The authors use a medicine name vocabulary and a dataset of 9645 handwritten prescriptions written by 117 doctors. They annotate 500 images for evaluation purposes. The authors use an n-gram model for in-vocabulary prediction and edit distance search for each medicine line text and the medicine vocabulary. They use experimental results and rigorous ablations to understand the efficacy of the framework. The proposed framework for weakly supervised information extraction from handwritten documents involves two labeling functions - OCR and Segmentation Labeling Function. The medicine name prediction model's performance increases with subsequent iterations. The end-to-end medicine name prediction model is evaluated using mean precision, mean recall, and mean jaccard index. The segmentation model's performance is evaluated using mean intersection over union.
The authors experimented with injecting a language model (LM) selectively and segmenting lines to improve performance for recognizing medicine names in handwritten documents. The segmentation model utilized cues from visual features surrounding medicine lines, such as hyphens, Tab, Cap, etc. They were able to reach a strong upper-bound performance with weak labels.
The method involves predicting in-vocabulary words by matching text with a trained medicine language model (LM). The performance of the algorithm improves as more medicine names are added to the LM, but saturates after a certain point. The n-gram LM involving history characters also improves performance, with the best results obtained when the medicine lines are segmented.
The algorithm developed can selectively infuse domain knowledge and correct errors caused by misinterpreting similar-looking medicines or OCR errors. Two types of errors are possible - segmentation and medicine names predicted but not in the ground-truth. The paper compares multiple strategies of predicting in-vocabulary words and found that top-k+majority performs the best.
The paper reviews various methods for weakly supervised information extraction from handwritten documents, including deep learning approaches for OCR post-correction, text recognition, segmentation, and labeling, as well as pre-trained biomedical language models for text mining. The paper emphasizes the importance of weak supervision in training models and highlights the potential for further research in this area. This document provides a list of research papers related to weakly supervised information extraction and recognition from various types of documents. The papers cover topics such as object localization, semantic segmentation, OCR correction, and domain adaptation. Specific techniques mentioned include Snorkel for rapid training data creation, Med-BERT for disease prediction on large-scale medical records, and W-TALC for weakly supervised temporal activity localization and classification.
1455 word summary
This is a list of references to research papers related to weakly supervised information extraction and recognition from various types of documents, including handwritten documents, electronic health records, and natural scenes. The papers cover topics such as object localization, semantic segmentation, OCR correction, and domain adaptation. Some specific techniques mentioned include Snorkel for rapid training data creation, Med-BERT for disease prediction on large-scale medical records, and W-TALC for weakly supervised temporal activity localization and classification. This paper reviews various methods for weakly supervised information extraction from handwritten documents. The reviewed methods include deep learning approaches for OCR post-correction, text recognition, segmentation, and labeling, as well as pre-trained biomedical language models for text mining. The paper emphasizes the importance of weak supervision in training models and highlights the potential for further research in this area. The paper discusses the problem of extracting medicine names from handwritten prescriptions. The algorithm developed can selectively infuse domain knowledge and correct errors caused by misinterpreting similar-looking medicines or OCR errors. Two types of errors are possible - segmentation and medicine names predicted but not in the ground-truth. The paper compares multiple strategies of predicting in-vocabulary words and found that top-k+majority performs the best. Increasing the threshold beyond exact matches significantly reduces precision, at the gain of recall. The framework can be applied to other types of documents. This document discusses a weakly supervised information extraction method for predicting medicine names from handwritten documents. The method involves predicting in-vocabulary words by matching text with a trained medicine language model (LM). The performance of the algorithm improves as more medicine names are added to the LM, but saturates after a certain point. The n-gram LM involving history characters also improves performance, with the best results obtained when the medicine lines are segmented. The output of the model is dependent on the top-k decoded paths, and the performance varies as the number of paths and LM weight is changed. The performance of the model can be affected by the synthetic lines used to train the medicine LM. This document discusses weakly supervised information extraction from handwritten documents, specifically focusing on medicine name recognition. The authors experimented with injecting a language model (LM) selectively and segmenting lines to improve performance. They found that selectively injecting the LM and segmenting lines played a critical role in recognizing medicine names. The segmentation model utilized cues from visual features surrounding medicine lines, such as hyphens, Tab, Cap, etc. Additionally, cues for medicine name segmentation were different from those of generic text detection. The authors were able to reach a strong upper-bound performance with weak labels. Ground-truth medicine bounding boxes only had a small impact on medicine name prediction. The proposed framework for weakly supervised information extraction from handwritten documents involves two labeling functions - OCR and Segmentation Labeling Function. The medicine name prediction model's performance increases with subsequent iterations. The end-to-end medicine name prediction model is evaluated using mean precision, mean recall, and mean jaccard index. The segmentation model's performance is evaluated using mean intersection over union. The dataset includes more than 90,000 medicine names, and synthetic medicine names are generated using a character-based medicine LM. This is a study on weakly supervised information extraction from handwritten medical documents. The authors use a medicine name vocabulary and a dataset of 9645 handwritten prescriptions written by 117 doctors. They annotate 500 images for evaluation purposes. The prescriptions contain different sections such as vital, observation, and lab/scan. The authors use an n-gram model for in-vocabulary prediction and edit distance search for each medicine line text and the medicine vocabulary. The OCR predictions are decoded at a character level using character LMs. The authors use top-k path decoding and find all the text which have an exact match with the top-k predictions. They take a majority voting of all the matched names, and that becomes the prediction for every line. The authors use experimental results and rigorous ablations to understand the efficacy of the framework. The document discusses a method for weakly supervised information extraction from handwritten documents, specifically in the context of medicine line recognition. The approach involves using a probabilistic program to create an exhaustive set of possible medicine name lines, and training a character-based n-gram language model (LM) on those lines. Domain-specific knowledge is injected into the OCR system using the LM, and a segmentation model is trained to identify medicine lines based on visual features. The OCR decoder can incorporate the LM to correct errors, and the segmentation model uses bounding boxes as supervision to train label masks for identifying medicine lines. The article discusses a weakly supervised information extraction method for handwritten documents, particularly for predicting medicine patches. The method involves using an OCR labeling function and a segmentation labeling function, which improves over iterations. The segmentation labeling function is used to pseudo-label the training set via OCR, and a segmentation network is trained using the relatively small training set. The model predicts the medicine lines on the images in the rest of the dataset, and a second labeling function is used to alleviate missing bounding boxes. A high threshold is set to reduce noise in the data, leading to problems in learning the segmentation network. This method can introduce a significant amount of noise in a sizable number of images. The article discusses a method for weakly supervised information extraction from handwritten documents, specifically in identifying medicine names. Due to illegible handwriting, matching bounding boxes may not always align with ground truth medicine names. The authors propose constructing a training dataset using weak labels and an OCR labeling function. They also discuss using an assignment problem algorithm to assign bounding boxes to medicine names. The article highlights the importance of optimizing coverage and reducing errors in labeling functions. The authors also mention the use of a character n-gram language model for decoding and note that their method is agnostic to the OCR encoder used. This article discusses a method for weakly supervised information extraction from handwritten documents, specifically the problem of extracting medicine names from non-form type handwritten images. The model used is pre-trained and consists of an encoder and a fully connected symbol classification head. The encoder uses a 12-layer transformer encoder and 7 layers of inverted bottleneck conv layers. The training data is weakly labeled, and the output of the framework should be a list of medicine names that appear in the image, where m j ? V, the vocabulary of medicines. The model uses a probabilistic program to generate synthetic medicine lines using patterns of medicine lines as written by doctors in prescriptions. The article concludes that using such a language model in the OCR decoder improves performance significantly. This document discusses the use of weakly supervised information extraction from handwritten documents. Domain-specific language models have been shown to improve performance on OCR tasks, and weak labels can be converted to strong labels via labeling functions. The goal is to learn a segmentation model to detect entities present in training images, which reduces the manual labor needed to acquire strong labels. Traditional methods would need strong labels, but recent work has focused on developing methods that can learn from only weak labels. The document also discusses improvements in optical character recognition and related works. This paper proposes a weakly supervised approach to extract information from handwritten medical prescriptions, specifically medicine names. The approach involves developing a domain-specific medicine language model (LM) using synthetic medicine name lines and a weakly supervised segmentation method to detect specific text regions. The weakly supervised medicine line detector achieves 78% pixel mIoU with just weak labels, making it easier to obtain annotations compared to strong bounding polygon labels. The recognition model injects the specific LM into the medicine section of the prescription to enhance the recognition of medicine names. The proposed approach significantly enhances the performance of existing OCR systems by selectively infusing domain knowledge using only weak supervision. The paper discusses the challenge of extracting information from handwritten medical prescriptions, which are often inscrutable and do not follow any specific structure or format. Existing OCR models do not perform well on such documents, and require meticulously labeled data for learning. The authors propose a domain-specific medicine language model that performs 2.5x better than state-of-the-art methods in extracting medicine names from generated data. This model is learned using only synthetically generated weak labels, and identifies the regions of interest in the image before injecting location information. The authors note that adapting existing models to domain-specific training data is expensive, and OCR errors can be a problem with unstructured handwritten documents. This paper discusses weakly supervised information extraction from inscrutable handwritten document images. The authors are Sujoy Paul, Gagan Madan, Akankshya Mishra, Narayan Hegde, and Pradeep Kumar.