Summary of Language Model-Based Document Information Extraction and Localization

Summary Language Model-Based Document Information Extraction and Localization arxiv.org

9,093 words - PDF document - View PDF document

One Line

LMDX utilizes LLMs to successfully extract entities from VRDs, addressing issues with semi-structured documents and achieving impressive accuracy in extracting diverse entity types.

Slides

Slide Presentation (9 slides)

Copy slides outline Copy embed code Download as Word

Language Model-Based Document Information Extraction and Localization

Source: arxiv.org - PDF - 9,093 words - view

Introduction to LMDX

• LMDX leverages Large Language Models (LLMs) for extracting entities from visually rich documents (VRDs)

• LMDX addresses challenges in semi-structured document information extraction

• LMDX achieves impressive accuracy in extracting diverse entity types

[Visual: Image depicting visually rich document]

The LMDX Pipeline

• OCR: Obtaining words and line segments from the document image

• Chunking: Dividing the document into smaller chunks for LLM processing

• Prompt Generation: Creating LLM prompts for each chunk

• LLM Inference: Running the LLM with the prompts and sampling multiple completions

• Decoding: Parsing the LLM completions into structured entities and their locations

[Visual: Diagram illustrating the LMDX pipeline]

LMDX PaLM 2-S Performance

• Achieves state-of-the-art results on the VRDU and CORD benchmarks

• Outperforms existing baselines in terms of accuracy and performance

• Demonstrates data efficiency with similar extraction quality at zero-shot as baselines with 10-100 training documents

[Visual: Graph comparing LMDX PaLM 2-S performance with baselines]

Ablation Studies

• Importance of base entity extraction training for extraction quality in few-shot and zero-shot scenarios

• Significance of coordinate tokens for spatial information communication to the LLM

• Impact of sampling strategy on extraction quality and error correction capability

[Visual: Table summarizing results of ablation studies]

Error Analysis and Potential Solutions

• Common error pattern caused by OCR lines grouping multiple semantically different segments incorrectly

• Incorporating the image modality as a potential solution to address this limitation

[Visual: Example of an error caused by OCR lines grouping segments incorrectly]

Key Points Recap

• LMDX uses LLMs for precise extraction and localization in visually rich documents

• LMDX addresses challenges in understanding complex layouts and tabular arrangements

• The LMDX pipeline consists of OCR, chunking, prompt generation, LLM inference, and decoding

• LMDX PaLM 2-S achieves state-of-the-art results on VRDU and CORD benchmarks

• Ablation studies highlight the importance of base entity extraction training, coordinate tokens, and sampling strategy

• Error analysis reveals common error patterns caused by OCR lines grouping segments incorrectly

• LMDX combines language models with traditional document analysis techniques for information extraction and localization

[Visual: Quick summary of key points]

Future Directions

• Incorporating the image modality for improved accuracy

• Exploring open-source LLMs for further advancements in LMDX

• Language model-based approaches have great potential for document analysis and understanding tasks

[Visual: Image showcasing future possibilities]

Note: The visuals mentioned in [brackets] are suggestions for potential visual elements that can enhance the presentation. The actual selection of visuals should be based on the specific content and context of the presentation.

Key Points

LMDX is a methodology that uses Large Language Models (LLMs) to extract key entities from visually rich documents (VRDs) with high precision and localization.
LMDX addresses the challenge of understanding complex layouts and tabular arrangements in semi-structured document information extraction.
The LMDX pipeline consists of five stages: OCR, chunking, prompt generation, LLM inference, and decoding.
LMDX PaLM 2-S, which uses the PaLM 2-S LLM, achieves state-of-the-art results on the VRDU and CORD benchmarks.
Ablation studies show the importance of base entity extraction training, coordinate tokens, and sampling strategy in LMDX methodology.
Error analysis reveals common error patterns in LMDX caused by OCR lines grouping multiple segments incorrectly.
LMDX combines the power of language models with traditional document analysis techniques for information extraction and localization.
LMDX outperforms baselines in terms of accuracy and performance, demonstrating the potential of language model-based approaches in document analysis.

Summaries

25 word summary

LMDX uses LLMs to extract entities from VRDs, overcoming challenges with semi-structured documents. It achieves state-of-the-art results and can accurately extract various types of entities.

85 word summary

LMDX utilizes Large Language Models (LLMs) to accurately extract key entities from visually rich documents (VRDs). It overcomes the challenge of extracting information from semi-structured documents by incorporating layout encoding and a grounding mechanism within LLMs. The LMDX pipeline consists of five stages, achieving state-of-the-art results on benchmarks. Error analysis identifies OCR line grouping as a common error pattern, with potential solutions including incorporating the image modality. LMDX is highly effective in accurately extracting singular, repeated, and hierarchical entities from visually rich documents using LLMs.

127 word summary

LMDX is a methodology that uses Large Language Models (LLMs) to extract key entities from visually rich documents (VRDs) with high precision and localization. It addresses the challenge of semi-structured document information extraction by introducing layout encoding within LLMs and a grounding mechanism. The LMDX pipeline consists of five stages: OCR, chunking, prompt generation, LLM inference, and decoding. LMDX PaLM 2-S achieves state-of-the-art results on the VRDU and CORD benchmarks, outperforming existing baselines. Error analysis reveals that a common error pattern in LMDX is caused by OCR lines grouping multiple semantically different segments. Incorporating the image modality is identified as a potential solution. LMDX is a highly effective methodology for extracting information from visually rich documents using LLMs, achieving accurate extraction of singular, repeated, and hierarchical entities.

404 word summary

LMDX (Language Model-Based Document Information Extraction and Localization) is a methodology that uses Large Language Models (LLMs) to extract key entities from visually rich documents (VRDs) with high precision and localization. It addresses the challenge of semi-structured document information extraction by introducing layout encoding within LLMs and a grounding mechanism. LMDX can extract singular, repeated, and hierarchical entities with or without training data while localizing them within the document.

The LMDX pipeline consists of five stages: Optical Character Recognition (OCR), chunking, prompt generation, LLM inference, and decoding. OCR obtains words and line segments from the document image. Chunking divides the document into smaller chunks for LLM processing. Prompt generation creates LLM prompts for each chunk, including document representation, task description, and schema representation. LLM inference involves running the LLM with the prompts and sampling multiple completions. Decoding parses the LLM completions into structured entities and their locations.

LMDX PaLM 2-S, which uses the PaLM 2-S LLM, achieves state-of-the-art results on the Visually Rich Document Understanding (VRDU) and Consolidated Receipt Dataset (CORD) benchmarks. It outperforms existing baselines, demonstrating data efficiency by achieving similar extraction quality at zero-shot as baselines with 10-100 training documents. Ablation studies show that the base entity extraction training stage is crucial for extraction quality, especially in few-shot scenarios and zero-shot extraction.

Error analysis reveals that a common error pattern in LMDX is caused by OCR lines grouping multiple semantically different segments, leading to incorrect associations in predictions. Incorporating the image modality is identified as a potential solution.

In conclusion, LMDX is a highly effective methodology for extracting information from visually rich documents using LLMs. It achieves state-of-the-art results, demonstrates data efficiency, and enables accurate extraction of singular, repeated, and hierarchical entities. The paper discusses the use of language models for extracting and localizing information from documents. The authors propose a method called LMDX that combines language models with traditional document analysis techniques.

LMDX outperforms two baselines, LayoutLMv3 LARGE and Donut, on three benchmark datasets. It achieves higher accuracy in extracting entities and localizing information. The success of LMDX is attributed to the use of language models, which can learn complex patterns and relationships in documents.

Ablation studies show that both document chunking and entity value parsing contribute significantly to the performance of LMDX. In conclusion, LMDX is an effective framework for document information extraction and localization. It outperforms existing baselines and has the potential to improve document analysis and understanding tasks.

528 word summary

LMDX (Language Model-Based Document Information Extraction and Localization) is a methodology that utilizes Large Language Models (LLMs) to extract key entities from visually rich documents (VRDs) with high precision and localization. LMDX addresses the challenge of semi-structured document information extraction by introducing layout encoding within LLMs and a grounding mechanism. It can extract singular, repeated, and hierarchical entities with or without training data while localizing them within the document.

LMDX PaLM 2-S, which uses the PaLM 2-S LLM, achieves state-of-the-art results on the Visually Rich Document Understanding (VRDU) and Consolidated Receipt Dataset (CORD) benchmarks. It outperforms existing baselines, demonstrating data efficiency by achieving similar extraction quality at zero-shot as baselines with 10-100 training documents. LMDX PaLM 2-S also generalizes well to unseen templates, particularly for hierarchical entities like line items.

Ablation studies show that the base entity extraction training stage is crucial for extraction quality, especially in few-shot scenarios and zero-shot extraction. Removing coordinate tokens leads to a substantial decrease in extraction quality, while removing the sampling strategy slightly reduces extraction quality but removes error correction capability.

In conclusion, LMDX is a highly effective methodology for extracting information from visually rich documents using LLMs. It achieves state-of-the-art results, demonstrates data efficiency, and enables accurate extraction of singular, repeated, and hierarchical entities. Future work includes incorporating the image modality and exploring open-source LLMs for further improvements.

The paper discusses the use of language models for extracting and localizing information from documents. The authors propose a method called LMDX that combines language models with traditional document analysis techniques.

The LMDX framework consists of three main components: document chunking, entity value parsing, and responses decoding. Document chunking divides the document into smaller segments based on token limits. Entity value parsing extracts values for specific entities using regular expressions and segment mappings. Responses decoding decodes the extracted entities from multiple language model responses using majority voting.

Ablation studies show that both document chunking and entity value parsing contribute significantly to the performance of LMDX. The model can interpret the layout of the document without explicit coordinate information.

In conclusion, LMDX is an effective framework for document information extraction and localization. It outperforms existing baselines and has the potential to improve document analysis and understanding tasks.

792 word summary

LMDX (Language Model-Based Document Information Extraction and Localization) is a methodology that leverages Large Language Models (LLMs) for extracting key entities from visually rich documents (VRDs) with high precision and localization. LLMs have shown significant progress in Natural Language Processing (NLP) but have not been successfully applied to semi-structured document information extraction, which requires understanding complex layouts and tabular arrangements. LMDX addresses this challenge by introducing layout encoding within LLMs and a grounding mechanism to ensure accurate extraction. It can extract singular, repeated, and hierarchical entities with or without training data while localizing them within the document.

The LMDX pipeline consists of five stages: Optical Character Recognition (OCR), chunking, prompt generation, LLM inference, and decoding. OCR is used to obtain words and lines segments from the document image. Chunking divides the document into smaller chunks for LLM processing. Prompt generation creates LLM prompts for each chunk, including document representation, task description, and schema representation. LLM inference involves running the LLM with the prompts and sampling multiple completions. Decoding parses the LLM completions into structured entities and their locations.

LMDX PaLM 2-S, which uses the PaLM 2-S LLM, is evaluated on the Visually Rich Document Understanding (VRDU) and Consolidated Receipt Dataset (CORD) benchmarks. It achieves state-of-the-art results on both benchmarks, outperforming existing baselines. LMDX PaLM 2-S demonstrates data efficiency by achieving similar extraction quality at zero-shot as baselines with 10-100 training documents. It also generalizes well to unseen templates compared to baselines. The performance on hierarchical entities, such as line items, is significantly better than baselines.

Ablation studies are conducted to assess the importance of different facets of the LMDX methodology. Removing the base entity extraction training stage results in a significant drop in extraction quality, especially in few-shot scenarios and zero-shot extraction. Removing coordinate tokens, which communicate spatial information to the LLM, leads to a substantial decrease in extraction quality. Discarding the sampling strategy slightly reduces extraction quality but also removes the error correction capability.

In conclusion, LMDX is a highly effective methodology for extracting information from visually rich documents using LLMs. It achieves state-of-the-art results, demonstrates data efficiency, and enables accurate extraction of singular, repeated, and hierarchical entities. Future work includes incorporating the image modality and exploring open-source LLMs for further improvements.

This document is a summary of the paper "Language Model-Based Document Information Extraction and Localization" by Rohan Anil, et al. The paper discusses the use of language models for extracting and localizing information from documents. The authors propose a method called Language Model-based Document eXtraction (LMDX) that combines the power of language models with traditional document analysis techniques.

The LMDX framework consists of three main components: document chunking, entity value parsing, and responses decoding. Document chunking involves dividing the document into smaller segments based on token limits. The authors present a chunking algorithm that prunes segments from the end of the page to fit within the token limit. Entity value parsing focuses on extracting values for specific entities from the document. The authors propose an algorithm that parses entity values using regular expressions and segment mappings. Responses decoding involves decoding the extracted entities from multiple language model responses using majority voting.

The authors evaluate the LMDX framework on three benchmark datasets: VRDU Ad-Buy Form, VRDU Registration Form, and CORD. They compare the performance of LMDX with two baselines: LayoutLMv3 LARGE and Donut. LayoutLMv3 LARGE is a model that combines text and layout information for document understanding, while Donut is a model specifically designed for table extraction.

The results show that LMDX outperforms both baselines on all three datasets. LMDX achieves higher accuracy in extracting entities and localizing information compared to LayoutLMv3 LARGE and Donut. The authors attribute the success of LMDX to the use of language models, which have the ability to learn complex patterns and relationships between entities in documents.

The authors also conduct ablation studies to analyze the impact of different components of LMDX. They find that both document chunking and entity value parsing contribute significantly to the overall performance of LMDX. Additionally, they observe that using line index tags instead of coordinate grounding tags does not significantly affect the performance of LMDX, suggesting that the model can interpret the layout of the document without explicit coordinate information.

In conclusion, the paper presents LMDX as an effective framework for document information extraction and localization. The results demonstrate the superiority of LMDX over existing baselines in terms of accuracy and performance. The authors suggest that language model-based approaches have great potential for improving document analysis and understanding tasks.

Raw indexed text (58,080 chars / 9,093 words / 1,474 lines)

LMDX: L ANGUAGE M ODEL - BASED D OCUMENT

I NFORMATION E XTRACTION A ND L OCALIZATION

Vincent Perot 1∗ Kai Kang 2 Florian Luisier 2 Guolong Su 1 Xiaoyu Sun 2

Ramya Sree Boppana 2 Zilong Wang 3† Jiaqi Mu 1 Hao Zhang 1 Nan Hua 1

Google Research 2 Google Cloud 3 University of California, San Diego

{vperot,kangkai,fluisier,gsu,sunse}@google.com

{ramyasreeb,jqmu,haozhangthu,nhua}@google.com

[email protected]

A BSTRACT

Large Language Models (LLM) have revolutionized Natural Language Processing

(NLP), improving state-of-the-art on many existing tasks and exhibiting emergent

capabilities. However, LLMs have not yet been successfully applied on semi-

structured document information extraction, which is at the core of many docu-

ment processing workflows and consists of extracting key entities from a visually

rich document (VRD) given a predefined target schema. The main obstacles to

LLM adoption in that task have been the absence of layout encoding within LLMs,

critical for a high quality extraction, and the lack of a grounding mechanism ensur-

ing the answer is not hallucinated. In this paper, we introduce Language Model-

based Document Information EXtraction and Localization (LMDX), a method-

ology to adapt arbitrary LLMs for document information extraction. LMDX can

do extraction of singular, repeated, and hierarchical entities, both with and with-

out training data, while providing grounding guarantees and localizing the entities

within the document. In particular, we apply LMDX to the PaLM 2-S LLM and

evaluate it on VRDU and CORD benchmarks, setting a new state-of-the-art and

showing how LMDX enables the creation of high quality, data-efficient parsers.

I NTRODUCTION

The recent advent of transformers (Vaswani et al., 2017) and self-supervised pretraining procedures

has led to significant progress in Visually Rich Document (VRD) Understanding. Within that field,

the task of document information extraction (IE), which consists of extracting key entities within a

semi-structured document (e.g. invoice, tax form, paystub, receipt, etc) given a predefined schema,

has received a lot of attention from industry and academia due to its importance and wide applicabil-

ity to intelligent document processing workflows. However, document information extraction still

remains challenging for current generation systems. In particular, information in semi-structured

forms is organized in complex layout across many possible templates, which requires understand-

ing of the document context, spatial alignment among the different segments of text, and tabular

arrangement of structured entities (e.g. line items on an invoice, deduction items on a paystub, etc.).

Content on the document can be printed or handwritten, with scanning artefacts like rotation and

contrast issues. Moreover, since some business automation workflows require certain level of ac-

curacy, they are often integrated with human-in-the-loop interactions for auditing and correction of

predictions, requiring knowing the precise location of extracted entities to make it a tractable task for

a human rater. Finally, since a quasi-infinite number of document types exist, and that organizations

have limited annotation resources, most parsers are built with very small amount of data.

From those complexities emerge the following desiderata of document information extraction sys-

tems: they should support high-quality extraction of singular, repeated, and hierarchical entities,

while localizing those entities in the document, and doing so with very low or no amount of training

data. So far, no publicly disclosed system has been able to address all of those desiderata.

∗

†

Corresponding author

Work done during Google internship

1Many current approaches divide the problem in two stages: a text recognition/serialization step,

usually achieved by an off-the-shelf Optical Character Recognition (OCR) service, followed by a

parsing step, which finds the relevant entity values from the recognized text. Since the text serializa-

tion is imperfect, much attention has been given to fusing the text and layout together in the parsing

step (Majumder et al., 2020; Garncarek et al., 2021; Hwang et al., 2021; Katti et al., 2018; Denk &

Reisswig, 2019). Hong et al. (2021) proposes to encode the relative 2D distances of text blocks in

the attention of the transformer, and learning from unlabeled documents with an area-masking strat-

egy. Lee et al. (2022) proposes encoding the relative token positions with a graph neural network

with edges constructed from a beta-skeleton algorithm. It further frames information extraction as

a NER sequence tagging task with an IOB scheme (Ramshaw & Marcus, 1995; Palm et al., 2017)

which allows them to localize the entities. However, IOB does not support extracting hierarchical

entities, and is not robust to text serialization errors, where an entity is broken in disjoint segments.

Since text and layout do not contain all the information in the document (e.g. table boundaries,

logos), leveraging the image modality has also been extensively explored (Xu et al., 2021; Lee et al.,

2023; Appalaraju et al., 2021; 2023; Zhang et al., 2022). Xu et al. (2020) uses a separate image

encoder before adding the output as feature to the token encodings, while Huang et al. (2022) jointly

models the page image patches alongside the tokens, using a word-patch alignment self-supervised

pretraining task to learn an alignment between the modalities.

Other approaches treat extraction as a sequence generation problem. Powalski et al. (2021) adds an

auto-regressive decoder on top of a text-layout-image encoder, all initialized from T5 (Raffel et al.,

2020). Kim et al. (2022) foregoes the text recognition step completely, using a Vision Transformer

encoder with an auto-regressive decoder pretrained on a pseudo-OCR task on a large document im-

age corpora, and finetuned on the final extraction parse tree with XML tags for the target extraction

schema. While this approach allows to predict hierarchical entities, it does not allow localizing

entities in the document.

None of the previously discussed approaches attempt to understand the semantics of the schema and

its entity types, and instead opt to encode the schema in the model weights through training, hence

requiring training data for unseen schemas and document types. QueryForm (Wang et al., 2023b)

utilizes a prompt encoding both the schema and entity types, allowing the model to do zero-shot

extraction. Likewise, PPN (Wei et al., 2023) inputs the raw entity types in the encoder itself, and

uses a scoring matrix to predict the link classes between document tokens and types, with great

few-shot performance. However, both approaches are not able to predict hierarchical entities.

In parallel, Large Language Models (OpenAI, 2023; Google et al., 2023; Hoffmann et al., 2022)

have revolutionized Natural Language Processing, showing the capabilities to solve a task with sim-

ply an instruction (Wei et al., 2022) or a few examples added to the prompt (Brown et al., 2020).

This paradigm opens the possibility of extracting entities with very little to no training data. Wang

et al. (2023a) transforms the NER task to a sequence generation task suitable for LLMs by incorpo-

rating special tokens in the sequence, marking the entity boundaries, and proposes a self-verification

strategy limiting the LLM hallucinations. However, this is applicable to text-only scenarios, with

hallucinations still a possibility.

This motivates us to introduce Language Model-based Document Information EXtraction and Lo-

calization (LMDX), a methodology for leveraging off-the-shelf LLMs for information extraction on

semi-structured documents. Our contributions can be summarized as follows:

• We propose a prompt that enables LLMs to perform the document IE task on leaf and

hierarchical entities with precise localization, including without any training data.

• We also propose a layout encoding scheme that communicate spatial information to the

LLM without any change to its architecture.

• We introduce a decoding algorithm transforming the LLM responses into extracted entities

and their bounding boxes on the document, while discarding all hallucination.

• We systematically evaluate the data efficiency of LMDX on multiple public benchmarks

and establish a new state-of-the-art on those by a wide margin, especially at low-data

regimes.

A comparison of LMDX characteristics and other popular document information extraction systems

can be found at Table 1.

2Table 1: Comparison of document information extraction systems.

Document Information Extraction Systems Hierarchical entity Entity localization Zero-shot support

FormNet(v2), LayoutLM(v2), Docformer, Glean, ...

QueryForm, PPN

Donut

LMDX (Ours) ✗

✗

✓

✓ ✓

✓

✗

✓ ✗

✓

✗

✓

2.1

M ETHODOLOGY

O VERVIEW

Overall, our pipeline is divided into five stages: OCR, chunking, prompt generation, LLM inference

and decoding, detailed in the following sections. An overview with a simple example can be found

in Figure 1, with the input and output of each stage showcased.

OCR’ed Document

(multi-page)

Document chunks

( x N )

Chunking

Target Schema

{

LLM Prompts

( x N )

Apple Store 38|05

Apple Store xx|yy

...

D8050LLA 25|43

D8050LLA xx|yy

...

Prompt

LLM

Generation

Inference

{

"retailer":"",

"line_item":[{

"order_date":"",

"product_id":"",

"line_item":[{

"product_price":""

"prd_number":"",

}],

"item_price":""

"subtotal":""

}],

}

"subtotal":""

}

"retailer":"",

"line_item":[{

"product_id":"",

"product_price":""

}],

"subtotal":""

}

LLM Completions

( x NK )

{

"retailer":

"Apple 38|05",

"Apple Store xx|yy",

"line_item":[{

"order_date":

"product_id":

"June 29, 2023

"D8050LL/A 25|43",

xx|yy",

"product_price":

"$100.00 72|43"

}],

"subtotal": null

"line_item":[{

}

"prd_number":

"D8050LL/A xx|yy",

"item_price":

"$499.00 72|58"

}],

"subtotal":

"$499.00 78|74"

}

Input: Red Output: Green

Intermediate results: Grey

Final Extraction

{

"retailer":{

"value":"Apple",

"location":{

"page":0,

"bbox":[18,04,47,08]

Decoding

}},

"line_item":[{

"product_id":{

"value":"D8050LL/A",

"location":{

"page":0, ...

}},

"product_price":{

"value":"$100.00",

"location":{

"page":0, ...

}},...],

"subtotal":{

"value":"$499.00",

"location":{

"page": 1,

"bbox": [70,73,90,77]

}},

}

Figure 1: Overview of the LMDX methodology.

2.2

O PTICAL C HARACTER R ECOGNITION

We first use an off-the-shelf OCR service on the document image to obtain words and lines segments,

along with their corresponding spatial position (bounding box) on the document. An example of

output from that stage on a sample document is given in Appendix A.6.

2.3

C HUNKING

Since a document can be arbitrarily long and that LLMs have limited input token length, the docu-

ment is divided into document chunks so that each is small enough to be processed by the LLM. To

achieve this, we first divide the document into individual pages, then we iteratively remove the last

line segments (coming from OCR) until the prompt containing this chunk is below the maximum

input token length of the LLM. Lastly, we group those removed lines as a new document page, and

repeat the same logic until all chunks are below the input token limit of the LLM. At the end of this

stage, we have N chunks. The decision to first divide the document by page stems from the observa-

tion that entities rarely cross page boundaries, and as such this chunking scheme will have minimal

impact on the final extraction quality. The algorithm is described in pseudo-code in Appendix A.1.

2.4

P ROMPT G ENERATION

The prompt generation stage takes in the N document chunks and creates a LLM prompt for each

of them. As seen in Figure 2, our prompt design contains the document representation, a description

3of the task, and the target schema representation containing the entities to extract. XML-like tags

are used to define the start and end of each component.

{DOCUMENT_REPRESENTATION}

{TASK_DESCRIPTION}

{SCHEMA_REPRESENTATION}

Figure 2: Structure of the LLM prompts.

Document Representation. The chunk content is represented in the prompt as the concatenation

of all its segment texts, suffixed with the coordinates of those segments in the following format:

< segment text > XX|Y Y segment . Coordinate tokens are built by normalizing the segment’s X

and Y coordinate, and quantizing them in B buckets, assigning the index of that bucket as the token

for a coordinate.

This coordinate-as-tokens scheme allows us to communicate the layout modality to the LLM, with-

out any change to its architecture. There are many variation to that scheme: using OCR line versus

OCR words as segment, the granularity of the quantization, and the number of coordinates to use

per segment (e.g. [x center , y center ] versus [x min , y min , x max , y max ]). Appendix A.4 shows how those

variations affect the prompt token length. In practice, since LLM context length is still limited, we

use line-level segments with 2 coordinates and B = 100 quantization buckets in all our experiments.

Task Description. The task description is simply a short explanation of the task to accomplish. In

our experiments, we hard code it to the following: From the document, extract the text values and

tags of the following entities:.

Schema Representation. The schema is represented as a structured JSON object, where the keys

are the entity types to be extracted, and the values correspond to their occurrence (single or multiple),

and sub-entities (for hierarchical entities). For instance, {"foo": "", "bar": [{"baz": []}]} means that

the LLM should extract only a single entity of type foo and multiple hierarchical entities of type bar,

that could each hold multiple entities of type baz.

After this step, we have N prompts, one for each document chunk. A full example of a prompt on a

document can be found in Appendix A.6.

2.5

C OMPLETION T ARGETS

In this section, we describe the expected LLM completion format, which can be observed in Figure 1.

Like the schema, the completion is a JSON structured object with the keys being the entity types,

and values being the extracted information from the document chunk. JSON was chosen as a format

for the completion and schema since it supports hierarchical objects, is very token-efficient, and

usually present in LLMs training data mixtures. Note that the keys in the completion have the same

ordering, occurrence and class (hierarchical or leaf) as the entity types in the schema. The values of

leaf entities must follow a specific format:

< text on segment 1 > XX|Y Y segment 1 \n < text on segment 2 > XX|Y Y segment 2 \n ...

An entity can span multiple (potentially disjoint) OCR segments (lines or words). For each segment

of the entity, the value contains the entity text on that segment, along with the coordinate tokens of

that segment, which act as a segment identifier, allowing us to localize the entities and ground the

model prediction (e.g. making sure the extracted value is not a hallucination), as will be detailed in

Section 2.7.

4Missing entity types are completed by the model with null for singular types, and [] for repeated

types. Samples of completions can be found in Appendix A.6.

2.6

LLM I NFERENCE

In this stage of the pipeline, we run inference on the LLM with the N prompts. For each prompt,

we sample K completions from the LLM (for a total of N K completions for the entire document)

using Top K sampling. This randomness in the sampling allows to do error correction (e.g. if a

response is not valid JSON, have hallucinated segment coordinate identifier, etc), and increase the

extraction quality as will be shown in further sections. Note that we still want the inference to be

fully deterministic so that LMDX’s extractions are the same across two identical documents. To do

so, we rely on pseudo-random sampling using a fixed seed.

2.7

D ECODING

In this stage, we parse the raw LLM completions into structured entities and their locations.

Conversion to structured entities. We begin by parsing each model completion as a JSON object.

Completions that fail to parse are discarded. For each key-value pair in the JSON object, we interpret

the key as the entity type and parse the value to get the entity text and bounding box (as detailed in

the next paragraph). Predicted entity types that are not in the schema are discarded. If the model

unexpectedly predicts multiple values for single-occurrence entity types, we use the most frequent

value as the final predicted value. Hierarchical JSON object are recursively parsed as hierarchical

entities in a similar manner. This algorithm is described in pseudo-code in Appendix A.3.

Entity Value Parsing. We expect the JSON value to include both text extractions and segment

identifiers for each predicted entity, as described in Section 2.5. We first parse the value into its

(segment text, segment identif ier) pairs. For each pair, we look up the corresponding segment

in the original document using the segment identifier and verify that the extracted text is exactly

included on that segment. Finally, once we have the entity location on all its segments, we get the

entity bounding box by computing the smallest bounding box encompassing all the OCR words

included in the entity. Entity values with any segments that fail to ground (invalid entity value

format, non-existent segment identifier, or non-matching segment text) in the original document are

discarded. The entity value parsing algorithm is described in pseudo-code in Appendix A.2.

Prediction Merging. We first merge the predicted entities for the same document chunk from the

K LLM completions through majority voting (Wang et al., 2022). For each entity type, we gather

the predicted entities, including empty predictions, across the K completions. The most common

prediction(s) are selected as the predicted value for that entity type. We then merge the predictions

among the N document chunks by concatenating them to obtain the document level predictions.

Prediction Merging for hierarchical entities. For hierarchical entities, we use the entire pre-

dicted tree value from a single LLM completion, as this method best preserves the parent-child

relationship predicted by the model. For each top-level hierarchical entity type, we perform major-

ity voting on all affiliated leaf, intermediate and top-level entity types among K completions as if

they are flattened. We then tally the votes with equal weight to determine which completion to use

for the prediction, and select the most common one for that hierarchical entity.

E VALUATION

We evaluate the methodology explained in section 2 on public benchmarks using the PaLM 2-S

LLM, which we call LMDX PaLM 2-S . Note that we use the small version of this LLM due to limited

accelerator resources, but larger versions could be used, likely leading to higher extraction quality.

Our training process is composed of two phases as shown in Figure 3. In the first phase we finetune

PaLM 2-S on a data mixture containing a variety of (document, schema, extraction) tuples. In par-

ticular, this data mixture contains the Payment dataset (Majumder et al., 2020), along with a diverse

set of publicly available PDF form templates obtained from government websites that we filled with

5synthetic data using an internal tool, and annotated for schema and entities to extract. The goal of

this phase is to train the model to interpret the semantics of the entity types and extraction hierar-

chy specified in the schema, and find them within the document, along with learning the extraction

syntax. Hence, the variety of schemas and documents in this phase is of utmost importance.

During the second phase, starting from the base entity extractor checkpoint from the previous phase,

we finetune the LLM on the target to specialize it to do high quality extraction on the target bench-

mark. At this stage, only the target benchmark data is included in the training mixture. Note that, for

zero-shot experiments, this second phase is skipped. Furthermore, no document or schema contained

in the base extraction training phase overlap with the documents and schemas used in the special-

ization training phase. For all training phases, we follow the input and target syntax described in

section 2.4 and 2.5.

3.1

P ARAMETERS

For training, we finetune PaLM 2-S using a batch size of 8, a dropout probability of 0.1 and a learning

rate of 10 −6 with a standard cross-entropy loss. Once training is done, we select the checkpoint with

the lowest loss on the dev set, and report performance on the test set. For LLM inference, we use a

temperature of 0.5 and a Top K of 40, sampling 16 responses for each chunk processed by the LLM,

as described in section 2.6. Finally, for both training and inference, we use an input token length of

6144 and output token length of 2048. We use line-level segments and only two coordinates [x center ,

y center ] with 100 quantization buckets to save on the number of input and output tokens consumed

by the coordinate-as-tokens scheme.

3.2

D ATASETS

Visually Rich Document Understanding (VRDU). Wang et al. (2023c) introduces a public

benchmark for entity extraction from visually-rich documents that includes two datasets: Regis-

tration Form, containing 6 semantically rich entity types, and Ad-buy Form, containing 14 entity

types with one hierarchical line_item entity. For each dataset, VRDU proposes samples of 10, 50,

100 and 200 train documents to evaluate the data efficiency of models. It also offers different tasks

to evaluate the generalization powers of extraction systems: Single Template Learning (STL) where

train/test share the same single template, Mixed Template Learning (MTL) where train/test con-

tains overlapping sets of templates, and Unseen Template Learning (UTL) where train/test contains

disjoint sets of templates. For our experiments, we finetune LMDX PaLM 2-S for 4000 steps on each

dataset, training data size, and task setup independently and report Micro-F1 through the provided

evaluation tool. We then compare LMDX PaLM 2-S to the published state-of-the-art baselines.

Consolidated Receipt Dataset (CORD). Park et al. (2019) introduces a benchmark of Indonesian

receipts from shops and restaurants, with a target schema of 30 fine-grained entities, grouped into

menu, total and subtotal hierarchical entities. CORD 1 does not provide a standard evaluation toolkit,

so we adopt the normalized Tree Edit Distance accuracy metric (Zhang & Shasha, 1989), previously

introduced by Kim et al. (2022) on that benchmark, since it is agnostic to the output scheme used

and considers the hierarchical entities as part of the metric. For our experiments, we use the official

800train/100dev/100test split, but also sample the first D = 10/50/100/200 documents from the

train split to assess the data efficiency of LMDX on this benchmark. For each data setup, we fine-

https://huggingface.co/datasets/naver-clova-ix/cord-v1

Vanilla PaLM 2

Finetune on

diverse

document and

schemas

Base Entity

Extractor

PaLM 2

Finetune on

documents

with a single

schema

Figure 3: LMDX training phases.

Specialized

Entity Extractor

PaLM 2tune LMDX for 12000 steps. For comparison, we also train and evaluate state-of-the-art baselines

LayoutLMv3 LARGE and Donut. Those baselines are detailed in Appendix A.7.

3.3

R ESULTS

Table 2: Results of LMDX PaLM 2-S on the different tasks and train data size setups |D| of VRDU, with

best performing model results in bold. Unlike the baselines, LMDX can do zero-shot extraction.

Registration Form

|D|

Model

Ad-buy Form

Single Mixed Unseen Micro-F1 Micro-F1 Micro-F1 Micro-F1

Mixed Template

Line Item F1 Micro-F1

Unseen

0 LMDX PaLM 2-S 73.81 71.65 74.94 39.74 21.21 39.33

10 FormNet

LayoutLM

LayoutLMv2

LayoutLMv3

LMDX PaLM 2-S 74.22

65.91

80.05

72.51

90.88 63.61

36.41

69.44

60.72

87.72 50.53

25.54

54.21

21.17

86.87 20.47

20.20

25.36

10.16

54.35 5.72

6.95

9.96

5.92

39.35 20.28

19.92

25.17

10.01

54.82

50 FormNet

LayoutLM

LayoutLMv2

LayoutLMv3

LMDX PaLM 2-S 89.38

86.21

88.68

87.24

93.06 85.38

80.15

84.13

81.36

91.42 68.29

55.86

61.36

47.85

88.43 40.68

39.76

42.23

39.49

75.08 19.06

19.50

20.98

19.53

65.42 39.52

38.42

41.59

38.43

75.70

100 FormNet

LayoutLM

LayoutLMv2

LayoutLMv3

LMDX PaLM 2-S 90.91

88.70

90.45

89.23

93.97 88.13

86.02

88.36

87.32

92.41 72.58

63.68

65.96

57.69

89.70 40.38

42.38

44.97

42.63

78.05 18.80

21.26

23.52

22.08

69.77 39.88

41.46

44.35

41.54

75.99

200 FormNet

LayoutLM

LayoutLMv2

LayoutLMv3

LMDX PaLM 2-S 92.12

90.47

91.41

90.89

93.97 90.51

87.94

89.19

89.77

92.78 77.29

70.47

72.03

62.58

90.22 43.23

44.66

46.54

45.16

79.82 21.86

23.90

25.46

24.51

72.09 42.87

44.18

46.31

44.43

78.42

Results for VRDU are presented in Table 2. For all data regimes and tasks, LMDX PaLM 2-S sets a

new state-of-the-art by a wide margin. In particular, we find that LMDX PaLM 2-S can extract decently

with no training data: it exhibits similar extraction quality at zero-shot than baselines at 10-100

train dataset size (for instance 39.74% Micro-F1 on Ad-Buy Form Mixed Template vs 40.68% for

FormNet at 50 train documents, or 73.81% Micro-F1 on Registration Single Template vs 74.22%

for FormNet at 10 train documents). LMDX PaLM 2-S is also much more data efficient than the base-

lines: it is at 5.06% Micro-F1 of its peak performance at 10 training documents for Registration

Form Mixed Template (87.72% vs 92.78% Micro-F1) while LayoutLMv2, the strongest baseline,

is within 19.75% of its peak performance (69.44% vs 89.19% Micro-F1). Finally, we notice that

LMDX PaLM 2-S generalizes much better to unseen templates than baselines: on Registration Form,

LMDX PaLM 2-S has a drop lesser than 5% Micro-F1 on Unseen Template compared to Single Tem-

plate across all data regimes, while baselines like LayoutLMv2 sees a drop between 19.38% and

27.32%.

On CORD, with results in Table 3, we observe similar trends, highlighting the generalization of

the results. At 10 documents, LMDX PaLM 2-S is 4.03% from its peak performance attained at 800

documents, versus 22.34% for the strongest baseline LayoutLMv3 LARGE , showcasing the data

efficiency of the LMDX methodology.

Performance on Hierarchical Entities. As seen on Ad-Buy Form tasks, LMDX PaLM 2-S is capa-

ble of grouping line items much better than the baselines (which are using heuristics) for all data

regimes. In particular, LMDX PaLM 2-S has similar line_item grouping performance at zero-shot than

the best baseline at 200 train documents (21.21% versus 25.46% F1 respectively). With all the train-

ing data, LMDX PaLM 2-S scores a 72.09% F1 on line_item, an absolute improvement of 46.63% over

the best baseline LayoutLMv2.

7Table 3: LMDX PaLM 2-S results on CORD. Normalized Tree Edit Distance Accuracy is reported.

3.4

|D| Model n-TED Accuracy

0 LMDX PaLM 2-S 67.47

10 Donut

LayoutLMv3 LARGE

LMDX PaLM 2-S 33.01

73.87

92.27

50 Donut

LayoutLMv3 LARGE

LMDX PaLM 2-S 75.44

87.29

93.80

100 Donut

LayoutLMv3 LARGE

LMDX PaLM 2-S 82.17

91.83

93.64

200 Donut

LayoutLMv3 LARGE

LMDX PaLM 2-S 84.49

94.44

94.73

800 Donut

LayoutLMv3 LARGE

LMDX PaLM 2-S 90.23

96.21

96.30

A BLATIONS

In this section, we ablate different facets of the LMDX methodology to highlight their relative

importance. The results can be found in Table 4 and are discussed below. For all ablations, we

evaluate on the VRDU Ad-Buy Form Mixed Template task, only changing the ablated facet.

Table 4: Ablations of Base Entity Extraction Training, Coordinate Tokens, and Sampling and their

relative effects on extraction quality. All ablations are done on VRDU Ad-Buy Mixed Template.

|D|

100

200

LMDX PaLM 2-S

Without Base EE Training

Without Coordinate Tokens

Without Sampling Strategy

Micro-F1 Micro-F1 ∆ (%) Micro-F1 ∆ (%) Micro-F1 ∆ (%)

39.74

54.35

75.08

78.05

79.82 0.00

42.91

66.51

68.87

72.25 -39.74

-11.44

-8.57

-9.18

-7.57 27.59

39.37

62.35

65.14

65.70 -12.15

-14.98

-12.73

-12.91

-14.12 39.53

52.85

73.88

77.30

78.43 -0.21

-1.50

-1.20

-0.75

-1.39

Effects of Base Entity Extraction Training. In this ablation, we remove the first stage training

on the varied data mixture from Figure 3 and directly finetune on the VRDU target task. As seen

on columns 3-4 of Table 4, ablating that training stage leads to significant drop in extraction quality

in few-shot scenarios and the complete loss of zero-shot extraction ability due to the model not

respecting the extraction format, hence failing decoding. As the train set size increases, the degraded

performance lessens, from -11.44% to -7.57%, as the model learns the extraction task and the desired

completion format.

Effects of Coordinate Tokens. In this ablation, we replace the coordinate tokens, which commu-

nicate the position of each line within the document, by the index of that line. This index still acts

as a unique identifier for the line segment (required for entity localization and grounding) but does

not communicate any position information. An example of a prompt with line index can be found

in Appendix A.6. As can be seen on columns 5-6 of Table 4, the coordinate tokens are substantially

important to the extraction quality, ranging from 12.15% to 14.98% absolute micro-F1 improvement

across the data regimes.

Effects of Sampling Strategy. In this ablation, we discard our sampling strategy, and instead

sample a single response from the model. As seen in columns 7-8 of Table 4, this leads to a 0.21%

to 1.5% drop in micro-F1. While overall minor for quality, the sampling strategy also allows to

correct extraction format mistakes.

83.5

E RROR A NALYSIS AND L IMITATIONS

In this section, we perform an error analysis on the test set to identify common error patterns of

LMDX. A very common error type we observe is caused by OCR lines grouping multiple se-

mantically different segments. We show two instance of those cases observed in LMDX PaLM 2-S

on the VRDU Ad-Buy Form in Figure 4. In the first example, prediction for the entity

line_item/program_desc includes text from the previous column "Channel" along with the value in

the column "Description". From the OCR line bounding boxes, we can see that these two columns

are grouped as the same OCR line. In the second example, the model confuses between the adjacent

keys "Invoice Period" and "Flight Dates" and extracts invoice dates as flight dates. Similar to the

first example, OCR line bounding boxes show that the invoice dates and the key "Flight Dates" are

grouped together in the same line although they are semantically different. As LMDX PaLM 2-S uses

only coarse line layout information ([x center , y center ] with 100 quantization buckets), the model fails

in these cases. This is a current limitation of the LMDX system. We believe that incorporating the

image modality will help make LMDX more performant and robust to those OCR errors, addressing

the current limitations of the system.

Example 1:

line_item/program_desc

Groundtruth:

Local News 6a-630a

Prediction:

WJZ Local News 6a-630a

Example 2

Groundtruth:

flight_from: 12/24/19

flight_to: 12/30/19

Prediction:

flight_from: 11/25/19

flight_to: 12/29/19

Figure 4: Typical error pattern of LMDX PaLM 2-S . In both examples, the detected OCR lines are

shown in red, the model predicted entities are shown in blue, and the groundtruth entities are shown

in green. In both cases, the detected OCR lines merge two semantically distinct segments, causing

the model to wrongly associate them in its predictions.

C ONCLUSION

In this paper, we have introduced LMDX, a methodology that enables using LLMs for information

extraction on visually rich documents, setting a new state-of-the-art on public benchmarks VRDU

and CORD. LMDX is the first methodology to allow the extraction of singular, repeated and hierar-

chical entities, while grounding its predictions and localizing the entities in the document. LMDX

is extremely data efficient, and even allows high quality extraction at zero-shot on entirely new doc-

ument types and schemas. Nonetheless, since it relies on a LLM, LMDX is more resource-intensive

than previous approaches, and its coordinate-as-tokens scheme requires long inputs and outputs. As

future work, we will explore applying the methodology to open-source LLMs and adding the image

modality to the system using Large Vision-Language Models.

R EPRODUCIBILITY S TATEMENT

In order to increase reproducibility, we’ve provided all details of the LMDX methodology. We’ve

included our LLM prompts and completions in Appendix A.6, along with all our algorithms for

chunking and decoding in Appendix A.1, A.2 and A.3. Furthermore, we’ve provided the exact tar-

get schemas used in our experiments in Appendix A.5. For CORD specifically, we’ve used a metric

with a public implementation (https://github.com/clovaai/donut/blob/master/

donut/util.py) and an easy to reproduce sampling strategy for the data-efficiency splits (first

D train documents). Finally, our baselines are publicly available (https://github.com/

microsoft/unilm/tree/master/layoutlmv3, https://github.com/clovaai/

donut) and thoroughly detailed in Appendix A.7.

9R EFERENCES

Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, and R. Manmatha. Doc-

former: End-to-end transformer for document understanding. In Proceedings of the IEEE/CVF

International Conference on Computer Vision (ICCV), pp. 993–1003, October 2021.

Srikar Appalaraju, Peng Tang, Qi Dong, Nishant Sankaran, Yichu Zhou, and R. Manmatha. Doc-

formerv2: Local features for document understanding, 2023.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari-

wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal,

Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M.

Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,

Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford,

Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.

Timo I. Denk and Christian Reisswig. Bertgrid: Contextualized embedding for 2d document repre-

sentation and understanding, 2019.

Łukasz Garncarek, Rafał Powalski, Tomasz Stanisławek, Bartosz Topolski, Piotr Halama, Michał

Turski, and Filip Graliń ski. LAMBERT: Layout-aware language modeling for information ex-

traction. In Document Analysis and Recognition – ICDAR 2021, pp. 532–547. Springer Inter-

national Publishing, 2021. doi: 10.1007/978-3-030-86549-8_34. URL https://doi.org/

10.1007%2F978-3-030-86549-8_34.

Rohan Anil Google, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre

Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H.

Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Mor-

eira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yu-

jing Zhang, Gustavo Hernandez Abrego, Junwhan Ahn, Jacob Austin, Paul Barham, Jan Botha,

James Bradbury, Siddhartha Brahma, Kevin Brooks, Michele Catasta, Yong Cheng, Colin Cherry,

Christopher A. Choquette-Choo, Aakanksha Chowdhery, Clément Crepy, Shachi Dave, Mostafa

Dehghani, Sunipa Dev, Jacob Devlin, Mark Díaz, Nan Du, Ethan Dyer, Vlad Feinberg, Fangxi-

aoyu Feng, Vlad Fienber, Markus Freitag, Xavier Garcia, Sebastian Gehrmann, Lucas Gonzalez,

Guy Gur-Ari, Steven Hand, Hadi Hashemi, Le Hou, Joshua Howland, Andrea Hu, Jeffrey Hui,

Jeremy Hurwitz, Michael Isard, Abe Ittycheriah, Matthew Jagielski, Wenhao Jia, Kathleen Ke-

nealy, Maxim Krikun, Sneha Kudugunta, Chang Lan, Katherine Lee, Benjamin Lee, Eric Li,

Music Li, Wei Li, YaGuang Li, Jian Li, Hyeontaek Lim, Hanzhao Lin, Zhongtao Liu, Freder-

ick Liu, Marcello Maggioni, Aroma Mahendru, Joshua Maynez, Vedant Misra, Maysam Mous-

salem, Zachary Nado, John Nham, Eric Ni, Andrew Nystrom, Alicia Parrish, Marie Pellat, Mar-

tin Polacek, Alex Polozov, Reiner Pope, Siyuan Qiao, Emily Reif, Bryan Richter, Parker Riley,

Alex Castro Ros, Aurko Roy, Brennan Saeta, Rajkumar Samuel, Renee Shelby, Ambrose Slone,

Daniel Smilkov, David R. So, Daniel Sohn, Simon Tokumine, Dasha Valter, Vijay Vasudevan,

Kiran Vodrahalli, Xuezhi Wang, Pidong Wang, Zirui Wang, Tao Wang, John Wieting, Yuhuai

Wu, Kelvin Xu, Yunhan Xu, Linting Xue, Pengcheng Yin, Jiahui Yu, Qiao Zhang, Steven Zheng,

Ce Zheng, Weikang Zhou, Denny Zhou, Slav Petrov, and Yonghui Wu. Palm 2 technical report,

2023.

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza

Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hen-

nigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy,

Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre.

Training compute-optimal large language models, 2022.

Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, and Sungrae Park.

BROS: A layout-aware pre-trained language model for understanding documents. CoRR,

abs/2108.04539, 2021. URL https://arxiv.org/abs/2108.04539.

Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. Layoutlmv3: Pre-training for

document ai with unified text and image masking. In Proceedings of the 30th ACM International

Conference on Multimedia, 2022.

10Wonseok Hwang, Jinyeong Yim, Seunghyun Park, Sohee Yang, and Minjoon Seo. Spatial de-

pendency parsing for semi-structured document information extraction. In Findings of the

Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 330–343, Online, August

2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.28. URL

https://aclanthology.org/2021.findings-acl.28.

Anoop Raveendra Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Jo-

hannes Höhne, and Jean Baptiste Faddoul. Chargrid: Towards understanding 2d documents,

2018.

Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim,

Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document un-

derstanding transformer. In European Conference on Computer Vision (ECCV), 2022.

Chen-Yu Lee, Chun-Liang Li, Timothy Dozat, Vincent Perot, Guolong Su, Nan Hua, Joshua Ainslie,

Renshen Wang, Yasuhisa Fujii, and Tomas Pfister. FormNet: Structural encoding beyond sequen-

tial modeling in form document information extraction. In Proceedings of the 60th Annual Meet-

ing of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3735–3754,

Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.

acl-long.260. URL https://aclanthology.org/2022.acl-long.260.

Chen-Yu Lee, Chun-Liang Li, Hao Zhang, Timothy Dozat, Vincent Perot, Guolong Su, Xiang

Zhang, Kihyuk Sohn, Nikolay Glushnev, Renshen Wang, Joshua Ainslie, Shangbang Long,

Siyang Qin, Yasuhisa Fujii, Nan Hua, and Tomas Pfister. FormNetV2: Multimodal graph con-

trastive learning for form document information extraction. In Proceedings of the 61st Annual

Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 9011–

9026, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/

2023.acl-long.501. URL https://aclanthology.org/2023.acl-long.501.

Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata, James Bradley Wendt, Qi Zhao, and

Marc Najork. Representation learning for information extraction from form-like documents. In

ACL, 2020.

OpenAI. Gpt-4 technical report, 2023.

Rasmus Berg Palm, Ole Winther, and Florian Laws. Cloudscan - a configuration-free invoice anal-

ysis system using recurrent neural networks. In Proceedings of 2017 14th IAPR International

Conference on Document Analysis and Recognition, pp. 406–413, United States, 2017. IEEE.

ISBN 9781538635858. doi: 10.1109/ICDAR.2017.74.

Seunghyun Park, Seung Shin, Bado Lee, Junyeop Lee, Jaeheung Surh, Minjoon Seo, and Hwal-

suk Lee. Cord: A consolidated receipt dataset for post-ocr parsing. In Workshop on Document

Intelligence at NeurIPS 2019, 2019.

Rafał Powalski, Łukasz Borchmann, Dawid Jurkiewicz, Tomasz Dwojak, Michał Pietruszka, and

Gabriela Pałka. Going full-tilt boogie on document understanding with text-image-layout trans-

former. In Josep Lladós, Daniel Lopresti, and Seiichi Uchida (eds.), Document Analysis and

Recognition – ICDAR 2021, pp. 732–747, Cham, 2021. Springer International Publishing. ISBN

978-3-030-86331-9.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi

Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-

text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http:

//jmlr.org/papers/v21/20-074.html.

Lance Ramshaw and Mitch Marcus. Text chunking using transformation-based learning. In Third

Workshop on Very Large Corpora, 1995. URL https://aclanthology.org/W95-0107.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,

Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von

Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Ad-

vances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.,

2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/

file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

11Shuhe Wang, Xiaofei Sun, Xiaoya Li, Rongbin Ouyang, Fei Wu, Tianwei Zhang, Jiwei Li, and

Guoyin Wang. Gpt-ner: Named entity recognition via large language models, 2023a.

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh-

ery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.

arXiv preprint arXiv:2203.11171, 2022.

Zifeng Wang, Zizhao Zhang, Jacob Devlin, Chen-Yu Lee, Guolong Su, Hao Zhang, Jennifer Dy,

Vincent Perot, and Tomas Pfister. QueryForm: A simple zero-shot form entity query frame-

work. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 4146–4159,

Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.

findings-acl.255. URL https://aclanthology.org/2023.findings-acl.255.

Zilong Wang, Yichao Zhou, Wei Wei, Chen-Yu Lee, and Sandeep Tata. Vrdu: A benchmark for

visually-rich document understanding. In Proceedings of the 29th ACM SIGKDD Conference on

Knowledge Discovery and Data Mining, KDD ’23, pp. 5184–5193, New York, NY, USA, 2023c.

Association for Computing Machinery. ISBN 9798400701030. doi: 10.1145/3580305.3599929.

URL https://doi.org/10.1145/3580305.3599929.

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du,

Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners, 2022.

Kaiwen Wei, Jie Yao, Jingyuan Zhang, Yangyang Kang, Fubang Zhao, Yating Zhang, Changlong

Sun, Xin Jin, and Xin Zhang. Ppn: Parallel pointer-based network for key information extraction

with complex layouts, 2023.

Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio,

Cha Zhang, Wanxiang Che, Min Zhang, and Lidong Zhou. Layoutlmv2: Multi-modal pre-training

for visually-rich document understanding. In Proceedings of the 59th Annual Meeting of the

Association for Computational Linguistics (ACL) 2021, 2021.

Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. Layoutlm: Pre-

training of text and layout for document image understanding. In Proceedings of the 26th ACM

SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1192–1200,

2020.

Kaizhong Zhang and Dennis Shasha. Simple fast algorithms for the editing distance between trees

and related problems. SIAM Journal on Computing, 18(6):1245–1262, 1989. doi: 10.1137/

0218082. URL https://doi.org/10.1137/0218082.

Zhenrong Zhang, Jiefeng Ma, Jun Du, Licheng Wang, and Jianshu Zhang. Multimodal pre-training

based on graph attention network for document understanding, 2022.

12A

A.1

A PPENDIX

C HUNKING ALGORITHM

Algorithm 1 Document Chunking

1: function CHUNK(D, L, F )

▷ D is a document containing multiple pages. L is token limit.

▷ F is a function that outputs prompt token length given some segments (e.g. lines).

C = ϕ

▷ C is to record all produced chunks.

for i = 1 to |D.pages| do

S = D.pages[i].segments

while S ̸ = ϕ do

for j = |S| to 1 do

▷ Start pruning from the end of the page.

if F (S[1 : j]) ≤ L then

C = C ∪ {S[1 : j]}

10:

S = S[j + 1 : |S|]

▷ Continue to work on the rest of the segments.

11:

Exit for loop

12:

end if

13:

end for

14:

end while

15:

end for

16:

return C

17: end function

A.2

E NTITY VALUE PARSING ALGORITHM

Algorithm 2 Entity Value Parsing

1: function P ARSE E NTITY V ALUE (D, E)

▷ D is a document chunk.

▷ E is raw extraction results for one entity type parsed from one LLM sample.

G = ϕ

▷ G is to record all parsed entity values.

R = Regex(“(\d\d\|\d\d)”)

▷ R is a regex that captures the segment identifiers.

M = {“s.x|s.y” 7→ s|s ∈ D.segments} ▷ M holds a mapping between segment id and segment.

for i = 1 to |E| do

W = ϕ

▷ W is to hold all words for this entity.

P = R.split(E[i])

▷ P is expected to be interleaved text values and segment ids.

for j = 1 to |P |/2 do

10:

if P [j ∗ 2] ∈

/ M then

11:

Go to next i

▷ Segment ID is hallucinated. Grounding failure.

12:

end if

13:

S = M [P [j ∗ 2]]

▷ Retrieve the stored segment from M with parsed segment ID.

14:

T = P [j ∗ 2 − 1]

▷ T is to hold the predicted text.

15:

if T not substring of S then

16:

Go to next i

▷ Grounding failure, skip the current entity.

17:

end if

18:

W = W ∪ (S ∩ T )

19:

end for

20:

G ′ .value = w∈W w.text_value

▷ G ′ is to hold the entity to return.

′

21:

G .bounding_box = {min(b.x), min(b.y), max(b.x), max(b.y)} w∈W,b=w.bounding_box

22:

G = G ∪ {G ′ }

23:

end for

24:

return G

25: end function

13A.3

D ECODING ALGORITHM

Algorithm 3 Responses Decoding

10:

11:

12:

13:

14:

15:

16:

17:

18:

19:

20:

21:

22:

23:

24:

25:

26:

27:

28:

29:

30:

31:

32:

33:

34:

35:

36:

37:

38:

39:

40:

41:

42:

43:

44:

45:

function D ECODE F OR T YPE (J, T , D)

▷ J is one or more JSON objects.

▷ T is an entity type.

▷ D is a document chunk.

▷ E is to record all parsed and grounded entities.

E = ϕ

for j = 1 to |J| do

J ′ = J[j][T.type]

▷ J ′ is to hold entities for T’s type before grounding.

if T.subtypes = ϕ then

▷ T is leaf entity type.

E = E ∪ P arseEntityV alue(D, J ′ )

else

▷ T is hierarchical entity type.

E ′ .subtypes = T ′ ∈T.subtypes DecodeF orT ype(J ′ , T ′ , D)

▷ E ′ is hierarchical entity.

′

E = E ∪ {E }

end if

end for

return E

end function

function M AJORITY V OTING (T , E)

▷ T is an entity type.

▷ E is a 2D vector of entities of type T from all LLM responses.

▷ V is to record all votes.

V = [0, 0, ..., 0] ∈ R |E|

L = {T }

while L ̸ = ϕ do

T ′ = L[0]

E ′ = ϕ

for j = 1 to |E| do

E ′ = E ′ ∪ {e|e ∈ E[j], e.type = T ′ }

end for

for i = 1 to |E ′ | - 1 do

for j = i + 1 to |E ′ | do

if E ′ [i] = E ′ [j] then

V [i] = V [i] + 1

V [j] = V [j] + 1

end if

end for

L = L[1 : |L|]

L = L ∪ T ′ .subtypes

end while

return E[argmax(V )]

end function

function D ECODE A LL S AMPLES (S, T , D)

return T ′ ∈T

end function

▷ E ′ [j] holds entities with type T ′ from E[j].

▷ Remove T ′ and inject its sub-types for recursion.

▷ Return the entity values with the highest votes.

▷ S is all LLM response samples on D.

▷ T is a list of entity types.

▷ D is a document chunk.

M ajorityV oting( S ′ ∈S DecodeF orT ype(P arseJson(S ′ ), T ′ , D))

14A.4

T OKEN L ENGTH S TATISTICS

Table 5 details the token length (50 th and 99 th percentiles) of the prompt and completion targets for

the train split of datasets used in our experiments. We select the line level segment, 2 coordinate

scheme, no JSON indentation so that all datasets fit within our 6144 prompt token length and 2048

output token length.

Table 5: Prompt and target token length of different coordinate-as-tokens schemes on VRDU and

CORD benchmarks, using the vocabulary of PaLM 2-S. We vary the number of coordinates and their

quantization buckets in the localization tags, the segment level (e.g. line versus word), chunking

style (e.g. page versus max input tokens) and JSON indentation in the schema and completion

targets.

VRDU Ad-Buy Form

# Coord. # Quant. Segment Chunking JSON Indent

2 100

100

1000

100

100 Line

Word

Line

Line Page

Page

6144 tokens None

None

Input

Target

50 th 99 th 50 th 99 th

2377

3865

3329

2687

2417

2377 3920

13978

5284

4322

3328

3920 602

718

777

660

689

602 1916

2328

2473

2095

2234

1916

VRDU Registration Form

# Coord.

# Quant.

100

1000

100

Segment

Line

Word

Line

Chunking

JSON Indent

Page

6144 tokens

None

Input

Target

50 99 50 99 th

963

3083

1232

1052

977

963 1578

5196

2017

1723

1592

1578 79

101

79 147

349

177

155

160

147

CORD

# Coord. # Quant. Segment Chunking JSON Indent

2 100

100

1000

100

100 Line

Word

Line

Line Page

Page

6144 tokens None

None

Input

Target

50 th 99 th 50 th 99 th

342

396

408

364

411

342 869

1067

1139

959

938

869 355

375

422

376

474

355 1495

1638

1801

1957

1997

1495A.5

S CHEMAS

In this section, we present the schemas used for the experiments of this paper. The schema for

VRDU Ad-Buy Form, VRDU Registration Form, and CORD can be found in Figure 5, Figure 6 and

Figure 7 respectively.

{

"advertiser": "",

"agency": "",

"contract_num": "",

"flight_from": "",

"flight_to": "",

"gross_amount": "",

"line_item": [

{

"channel": "",

"program_desc": "",

"program_end_date": "",

"program_start_date": "",

"sub_amount": ""

}

"product": "",

"property": "",

"tv_address": ""

}

Figure 5: VRDU Ad-Buy Form Schema.

{

"file_date": "",

"foreign_principle_name": "",

"registrant_name": "",

"registration_num": "",

"signer_name": "",

"signer_title": ""

}

Figure 6: VRDU Registration Form Schema.

16{

"line_item": [

{

"discount_price": "",

"identifier": "",

"name": "",

"other": "",

"quantity": "",

"sub_name": [],

"sub_price": [],

"sub_quantity": [],

"subtotal_price": "",

"total_price": "",

"unit_price": ""

}

"subtotal": {

"discount_price": "",

"other": [],

"service_price": "",

"subtotal_price": [],

"tax_price": []

"total": {

"cash_price": [],

"change_price": "",

"credit_card_price": "",

"emoney_price": "",

"line_item_quantity_count": "",

"line_item_type_count": "",

"other": "",

"total_price": []

}

Figure 7: CORD Schema. Note that the original entity types have been renamed to more semanti-

cally meaningful names.

17A.6

S AMPLE P ROMPTS AND C OMPLETIONS

In this section, we present example of LMDX prompts and completions from the LLM on the VRDU

Ad-Buy dataset to better showcase the format used. Figure 8 shows the original document with the

line bounding boxes from OCR, Figure 9 shows the corresponding prompt and completion on that

document with coordinate grounding tags, and Figure 10 shows the same prompt and completion,

but with line index tags (used in ablation studies to showcase how the LLM can interpret the layout).

Figure 8: Example document from VRDU Ad-Buy Form. The line bounding boxes detected by

OCR are shown in red.

18=============================== Prompt ===============================

Print Date 02/28/20 14:21:20 75|02

Page 1 of 1 91|02

ORDER 49|04

WSIL 77|07

Orders 04|09

Order/Rev: 17|09

14086 29|09

HARRISBURG 78|09

PADUCAH 78|10

CAPE GIRARDEAU 78|12

Alt Order #: 17|11

Product Desc: 18|13

Mike Carr for Jackson Co States Atty 38|13

Estimate: 16|15

WSIL-TV 73|15

Flight Dates: 17|17

03/03/20 - 03/03/20 33|17

Primary AE: 57|17

David Cisco 73|17

Original Date/ Rev: 19|19

02/28/20/02/28/20 33|19

...

Start Date End Date Weekdays Spots/Week Rate Rating 31|66

Week: 03/02/20 03/08/20 -3-- 15|67

3 $200.00 0.00 47|67

Totals 79|69

3 87|69

$600.00 94|69

From the document, extract the text values and tags of the following

entities:

{"advertiser": "", "agency": "", "contract_num": "", "flight_from": "",

"flight_to": "", "gross_amount": "", "line_item": [{"channel": "",

"program_desc": "", "program_end_date": "", "program_start_date": "",

"sub_amount": ""}], "product": "", "property": "", "tv_address": ""}

============================= Completion =============================

{"advertiser": "Committee to Elect Mike Carr 37|36", "agency":

"Committee to Elect Mike Carr 37|24", "contract_num": "14086 29|09",

"flight_from": "03/03/20 33|17", "flight_to": "03/03/20 33|17",

"gross_amount": "$600.00 77|51", "line_item": [{"channel": "3 31|63",

"program_desc": "Tuesday Prime 31|63\nYour Voice Your Vote 29|64",

"program_end_date": "03/03/20 31|63", "program_start_date":

"03/03/20 31|63", "sub_amount": "$600.00 94|63"}], "product":

"Mike Carr for Jackson Co States Atty 38|13", "property": "WSIL 77|07",

"tv_address": null}

Figure 9: VRDU Ad-Buy Form sample prompt and completion with 2 Coordinates for grounding

tags. The document is truncated for easier visualization.

19=============================== Prompt ===============================

Print Date 02/28/20 14:21:20 <0>

Page 1 of 1 <1>

ORDER <2>

WSIL <3>

Orders <4>

Order/Rev: <5>

14086 <6>

HARRISBURG <7>

PADUCAH <8>

CAPE GIRARDEAU <9>

Alt Order #: <10>

Product Desc: <11>

Mike Carr for Jackson Co States Atty <12>

Estimate: <13>

WSIL-TV <14>

Flight Dates: <15>

03/03/20 - 03/03/20 <16>

Primary AE: <17>

David Cisco <18>

Original Date/ Rev: <19>

02/28/20/02/28/20 <20>

...

Start Date End Date Weekdays Spots/Week Rate Rating <105>

Week: 03/02/20 03/08/20 -3-- <106>

3 $200.00 0.00 <107>

Totals <108>

3 <109>

$600.00 <110>

From the document, extract the text values and tags of the following

entities:

{"advertiser": "", "agency": "", "contract_num": "", "flight_from": "",

"flight_to": "", "gross_amount": "", "line_item": [{"channel": "",

"program_desc": "", "program_end_date": "", "program_start_date": "",

"sub_amount": ""}], "product": "", "property": "", "tv_address": ""}

============================= Completion =============================

{"advertiser": "Committee to Elect Mike Carr <44>", "agency":

"Committee to Elect Mike Carr <29>", "contract_num": "14086 <6>",

"flight_from": "03/03/20 <16>", "flight_to": "03/03/20 <16>",

"gross_amount": "$600.00 <84>", "line_item": [{"channel": "3 <102>",

"program_desc": "Tuesday Prime Other CM <102>", "program_end_date":

"03/03/20 <102>", "program_start_date": "03/03/20 <102>",

"sub_amount": "$600.00 <100>"}], "product": "Mike Carr for Jackson Co

States Atty <12>", "property": "WSIL-TV <14>", "tv_address": null}

Figure 10: VRDU Ad-Buy Form sample prompt and completion with line index for grounding tags,

which does not communicate layout information. The document is truncated for easier visualization.

20A.7

CORD B ASELINES D ETAILS

LayoutLMv3 LARGE Baseline. We follow the released implementation 2 for the

LayoutLMv3 LARGE model and the training protocol described in Huang et al. (2022) as

closely as possible. In particular, we train the model for 80 epochs for each experiment on CORD

(namely, 10, 50, 100, 200, and 800-document training sets), on the IOB tags of the leaf entities. One

difference in our training is that, due to computational resource constraints, we use batch_size = 8

and learning_rate = 2 · 10 −5 .

As the LayoutLMv3 model can only extract leaf entities, we design and heavily optimize a heuristic

algorithm to group the leaf entities into hierarchical entities menu, subtotal and total. The best

heuristics we could find are as follows:

• For the subtotal and total hierarchical entity types, since they appear only once per doc-

ument, we group all their extracted sub-entities under a single subtotal and total entity,

respectively.

• For menu hierarchical entity type, we observe that those entities usually occur multiple

times on a document, and each menu has at most one nm, num, unitprice, cnt, discount-

price, price, itemsubtotal, etc sub-entities and potentially multiple sub_nm, sub_price and

sub_cnt sub-entities. We also notice that the sub-entities aligned horizontally overwhelm-

ingly belong to the same menu entity, and a menu entity can sometimes span over two or

more consecutive horizontal lines. To leverage those observations, we perform a two-step

grouping process for menu entities. First, we merge the extracted leaf sub-entities into hor-

izontal groups, where a threshold of 0.5 on the intersection-over-union of the Y-axis was

used for the determination of horizontal alignment. Second, we further merge the consec-

utive horizontal groups into menu entities, if and only if the horizontal groups do not have

type duplication in any of the nm, num, unitprice, cnt, discountprice, price, itemsubtotal,

and etc sub-entities (namely, those sub-entities only show up in at most one of the con-

secutive horizontal groups to be merged). We allow duplication of sub_nm, sub_price and

sub_cnt sub-entity types. After those two steps, we obtain the final menu entities.

Donut Baseline. We follow Donut released implementation 3 for the Donut benchmarking results

on CORD. We use the default training configuration for all experiments on CORD (namely, 10,

50, 100, 200, and 800-document training sets), with the following difference: we reduce batch size

from 8 to 4 due to computational resource constraints, and increase the number of train epochs from

30 to 60. For each experiment, checkpoint with the lowest loss on the dev set is selected and we

report performance on test set. Normalized Tree Edit Distance accuracy scores produced by Donut

evaluation code are reported (similar to all our other models).

https://github.com/microsoft/unilm/tree/master/layoutlmv3

https://github.com/clovaai/donut