Summary of Understanding HTML Models with Large Language

Summary Understanding HTML Models with Large Language arxiv.org

10,355 words - PDF document - View PDF document

One Line

The study explores the use of Large Language Models for various HTML understanding tasks and integrating text generation into web interfaces.

Slides

Slide Presentation (12 slides)

Copy slides outline Copy embed code Download as Word

Exploring Large Language Models for HTML Understanding

Source: arxiv.org - PDF - 10,355 words - view

Introduction

• Large language models (LLMs) have not been fully explored for HTML understanding tasks.

• This study presents fine-tuned LLMs for three HTML understanding tasks: semantic classification, description generation, and autonomous web navigation.

• Pretraining helps models learn general HTML structure and ensures correct formatting.

Semantic Classification

• Classifying elements into role categories.

• Aggregating information from multiple sources on the page.

• LLMs outperform previous SL models.

Description Generation

• Generating textual descriptions of elements.

• Formulated as an extractive problem.

• Snippet extraction and location are key.

Autonomous Web Navigation

• Training LLMs for behavior cloning.

• Novel approach in the research literature.

• Designing interfaces for integrating text generation into web interfaces.

Transfer Learning and Accuracy

• Fine-tuning pretrained LLMs improves transfer learning and accuracy.

• WebC-PaLM-62B performs the best.

• WebC-T5-large is competitive with larger models.

T5-based Encoder-Decoder Models

• T5-based models perform better across all tasks.

• Better performance compared to other models.

• Visuals: Comparison chart of model performance.

Mobile-BERT for Resource-Limited Devices

• Mobile-BERT is a compact BERT model designed for resource-limited devices.

• Enables HTML understanding on mobile devices.

• Visuals: Image of Mobile-BERT model.

Importance of HTML Snippets

• Snippet generation procedure and results discussed.

• Lack of information in snippets leads to errors in model predictions.

• 32% of errors in T5-3B model due to lack of information in HTML snippets.

Sensitivity to Structural Information

• Ablation study examines model performance with critical structure components removed.

• Preserving structural information is crucial for accurate predictions.

• Visuals: Examples of HTML with and without structural information.

Leveraging LLMs for HTML Understanding

• Fine-tuning LLMs enhances transfer learning and accuracy in HTML understanding tasks.

• Pretraining helps models learn general HTML structure and ensures correct formatting.

• T5-based encoder-decoder models outperform other models.

• Mobile-BERT enables HTML understanding on resource-limited devices.

• Reminder: Large language models have great potential for advancing HTML understanding tasks.

[Note: Visuals can be added to slides where relevant, such as charts, images, or screenshots of models or interfaces.]

Key Points

Large language models (LLMs) have not been fully explored for HTML understanding tasks.
This study presents fine-tuned LLMs for three HTML understanding tasks: semantic classification, description generation, and autonomous web navigation.
Fine-tuning pretrained LLMs improves transfer learning and accuracy in HTML understanding tasks.
Pretraining helps models learn general HTML structure and ensures correct formatting.
T5-based encoder-decoder models perform better across all tasks compared to other models.
Mobile-BERT is a compact BERT model designed for resource-limited devices.
The snippet generation procedure and its results are discussed, highlighting the importance of information in HTML snippets.
An ablation study examines the sensitivity of model performance to preserving structural information in HTML.

Summaries

37 word summary

This study investigates the use of Large Language Models (LLMs) for HTML understanding tasks, including semantic classification, description generation, and autonomous web navigation. It explores training LLMs for behavior cloning and integrating text generation into web interfaces.

40 word summary

This study explores the use of Large Language Models (LLMs) for HTML understanding tasks, specifically semantic classification, description generation, and autonomous web navigation. The focus is on training LLMs for behavior cloning and integrating text generation into web interfaces. The

509 word summary

Large language models (LLMs) have not been fully explored for HTML understanding tasks, such as parsing and automating web-based tasks. This study presents fine-tuned LLMs for three HTML understanding tasks: semantic classification, description generation, and autonomous

This study focuses on embedding and training a Large Language Model (LLM) for autonomous web navigation, which is a novel approach in the research literature. The implementation of LLM training for behavior cloning and the design of interfaces for integrating text generation into a

Semantic Classification involves classifying elements into role categories. To solve this, the system can aggregate information from multiple sources on the page. Description Generation is formulated as an extractive problem, where the goal is to locate and generate the textual description of an element

We mark the salient element using a special attribute called "target" and perform snippet extraction for semantic classification and description generation datasets. Full HTML pages are kept in MiniWoB. The models are provided with unparsed plaintext HTML in the form of token

Fine-tuning pretrained LLMs improves transfer learning and accuracy in HTML understanding tasks. WebC-PaLM-62B performs the best, but WebC-T5-large is competitive with larger models. LLMs outperform previous SL models and

LLMs without pretraining perform well on websites that require simple text matching but struggle with tasks like click checkboxes. Pretraining helps models learn general HTML structure and ensures correct formatting. T5-based encoder-decoder models perform better across all tasks compared to other

This summary provides a list of references to various academic papers related to program synthesis, foundation models, language models, mobile app navigation, scaling language modeling, web form filling automation, environment generation for reinforcement learning, semantic understanding of user interfaces, learning to control

This summary presents a list of references to various research papers related to neural language models, self-supervised learning of language representations, pre-training for form understanding, visually-rich document understanding, reinforcement learning on web interfaces, pretrained transformers as universal computation engines, natural

Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou proposed Mobile-BERT, a compact BERT model for resource-limited devices. Romal Thoppilan, Daniel

This excerpt provides an overview of the snippet generation procedure and presents additional results and analysis. It mentions that 32% of errors in the T5-3B model are due to lack of information in HTML snippets, while 30% are related to

An ablation study was conducted to examine the sensitivity of model performance to preserving structural information in HTML. The study involved evaluating a model's performance on HTML input with critical structure components removed, specifically by removing closing tags while keeping the order of elements and their

Classifying text as a username or password without additional context is not possible. The MarkupLM model is evaluated on HTML understanding tasks and achieves lower accuracy compared to WebC-BERT. The success rates of various models in MiniWoB tasks are compared in

The document discusses HTML models with large language understanding. The text excerpt consists mostly of numerical values, which appear to be resource requirements and running times of different language models. The models mentioned include PaLM, T5, LaMDA, and others.

Raw indexed text (65,940 chars / 10,355 words / 2,512 lines)

U NDERSTANDING HTML

M ODELS

WITH

L ARGE L ANGUAGE

Izzeddin Gur, Ofir Nachum, Yingjie Miao, Mustafa Safdari, Austin Huang

Aakanksha Chowdhery, Sharan Narang, Noah Fiedel, Aleksandra Faust

Google Research

{izzeddin,ofirnachum,yingjiemiao,msafdari,austinvhuang

chowdhery,sharannarang,nfiedel,sandrafaust}@google.com

A BSTRACT

Large language models (LLMs) have shown exceptional performance on a va-

riety of natural language tasks. Yet, their capabilities for HTML understanding

– i.e., parsing the raw HTML of a webpage, with applications to automation of

web-based tasks, crawling, and browser-assisted retrieval – have not been fully

explored. We contribute HTML understanding models (fine-tuned LLMs) and an

in-depth analysis of their capabilities under three tasks: (i) Semantic Classifica-

tion of HTML elements, (ii) Description Generation for HTML inputs, and (iii)

Autonomous Web Navigation of HTML pages. While previous work has devel-

oped dedicated architectures and training procedures for HTML understanding,

we show that LLMs pretrained on standard natural language corpora transfer re-

markably well to HTML understanding tasks. For instance, fine-tuned LLMs are

12% more accurate at semantic classification compared to models trained exclu-

sively on the task dataset. Moreover, when fine-tuned on data from the MiniWoB

benchmark, LLMs successfully complete 50% more tasks using 192x less data

compared to the previous best supervised model. Out of the LLMs we evalu-

ate, we show evidence that T5-based models are ideal due to their bidirectional

encoder-decoder architecture. To promote further research on LLMs for HTML

understanding, we create and open-source a large-scale HTML dataset distilled

and auto-labeled from CommonCrawl. 1

I NTRODUCTION

Web crawling (Olston et al., 2010), form-filling (Diaz et al., 2013; Gur et al., 2021), or information

retrieving web agents (Nogueira & Cho, 2016) are important for both automating and assisting

users in web-based tasks. These and similar applications rely on models that can search for specific

content or controls on a web page as well as navigate a website autonomously. Since a web page in

its raw form is represented as an HTML-based text sequence, the success of models for web-based

tasks relies on their ability to understand HTML semantics, structure, and embedded interactions.

The predominant approach to web automation and HTML understanding is to train specialized mod-

els, i.e., gathering application-specific datasets and designing neural network (NN) architectures to

leverage inductive biases of the HTML’s structure; see, e.g., Liu et al. (2018); Toyama et al. (2021);

Gur et al. (2021); Humphreys et al. (2022). However, both dataset collection and neural architecture

design are expensive, time-consuming, and require highly-specialized, domain-specific knowledge.

Meanwhile, in the natural language processing (NLP) literature, large language models (LLMs) have

emerged as a solution to the difficulties of dataset collection and specialized NN design (Kaplan

et al., 2020; Bommasani et al., 2021). A popular paradigm in NLP is to take an off-the-shelf LLM

– pretrained on a large text corpus via an unsupervised and task-agnostic learning objective – and

either fine-tune or prompt the LLM on a small task-specific dataset. This paradigm has shown

exceptional performance on a variety of NLP tasks (Xue et al., 2020; Brown et al., 2020; Austin

et al., 2021). Whether LLMs can be applied to HTML understanding – especially given the much

larger context and sequence lengths – remains an under-explored question.

See visualizations of the results at https://sites.google.com/view/llm4html/home.

Enter Email Address

Enter Password:

Please enter your password.

Email Address

class="form-label" for=”pass”>Enter Password:

type="email" id="uName” target>

class="hidden">Please enter your password.

(a)

(b)

Figure 1: a) HTML example page with a highlighted salient element, an element of interest (dashed box).

All canonical tasks evaluate a distinct interaction with this element, either by classifying it as one of a set of

categories, generating a text description of its purpose, or applying an action as part of a sequential navigation

of a multi-page website. b) LLM architectures overview. Dashed boxes denote sub-modules that are specific to

either encoder-only or encoder-decoder models. For encoder-only models, we add an extra classification layer.

Decoder-only models (not in the diagram) are similar to encoder-decoder models, the main difference is that

the HTML snippet is fed to the decoder and processed from left-to-right.

In this paper, we investigate whether LLMs can be applied to HTML understanding to produce

better-performing, more sample-efficient HTML understanding models and without the need for

custom NN architecture design. To that end, we present a suite of three benchmarking tasks for

HTML understanding that capture the essence of these applications and require understanding both

structure and content. First, we devise Semantic Classification as a task that requires a model to

classify a given HTML element into one of a set of categories, such as address, email, password

etc., with application to automated form-filling. Second, we present Description Generation, a

label-extraction task where a model is given an HTML snippet and is asked to produce a natural

language description. For instance for an email field, the description might be “Please enter your

email address.” Note that in the majority of web pages, this connection between input elements and

description content is only implicit in the raw HTML code and inferring such links is a prerequisite

for higher-level navigation objectives. The third task is Autonomous Web Navigation (Shi et al.,

2017). A model is presented with an HTML page paired with a natural language command and

must apply appropriate actions on a sequence of HTML pages to satisfy the command. See Figure

1a for a simplified example of these tasks.

With these benchmark tasks in hand, we evaluate the transfer capabilities of a variety of pretrained

LLMs (Table 1), varying in architecture (encoder-only, encoder-decoder, or decoder-only), model

size (from 24.6M to 62B parameters), and training data corpora (both including and excluding pre-

training NLP and HTML corpus). While prior work universally pre-parses the HTML as input to the

model (Gur et al., 2021; Liu et al., 2018; Nakano et al., 2021), ours – to the best of our knowledge – is

the first work that uses raw, unprocessed HTML. Our results show that LLMs demonstrate a remark-

able level of HTML understanding across all tasks, with up to 192× more sample-efficiency than

models trained from scratch, and achieving a new SoTA for supervised learning on the MiniWoB

benchmark suite (Shi et al., 2017). The encoder-decoder architectures with bi-directional attention

show the best performance across the board even when their pretraining does not include HTML. In

addition, we show that the performance scales sub-linearly with the model size.

The broader objective of this research is to advance the integration of LLMs with autonomous web

agents. It has only been in the last year that researchers have begun to utilize LLMs outside of

NLP and integrate them as core capabilities in autonomy (Lu et al. (2021); Ahn et al. (2022)). In

this context, LLMs are reasoning engines for sequential decision making agents interacting with

environments.

The present work is the first in the research literature to embed an LLM and train it as an agent for

autonomous web navigation. This requires new implementations to adapt LLM training for behavior

2cloning in addition to designing interfaces for integrating text generation into a perception-compute-

action cycle operating in a stateful web environment. Our implementation allows us to answer new

questions regarding trade-offs among various model characteristics.

We believe these contributions expand the scope of language models and connect their unique capa-

bilities with autonomous agents for the web. We provide a new perspective on machine learning for

HTML understanding and web automation, showing that pretrained LLMs can achieve significant

performance on such tasks, reducing the need for specialized architectures and training protocols.

To encourage further research in this direction, we open sourced 2 model weights for agents used in

the WoB environment and our dataset for description generation.

R ELATED W ORK

HTML Understanding Autonomous web navigation has been a popular application for neural net-

work models, and a variety of works propose simulated websites for training web-based agents, with

application to task fulfillment (Yao et al., 2022; Gur et al., 2021; Burns et al., 2022; Mazumder &

Riva, 2020; Shi et al., 2017; Liu et al., 2018) as well as information retrieval or question-answering

(Adolphs et al., 2021; Nogueira & Cho, 2016). Simulated websites provide an easy way to evaluate

models online, and for this reason we use the existing MiniWoB benchmark (Shi et al., 2017) for our

web navigation setting. However, it is still important to have a mechanism for evaluating models on

a wide variety of real-world websites. This was the key motivation for generating our own dataset

for the description generation task, which is distilled and auto-labeled from CommonCrawl and is a

key contribution of our paper.

Alongside these benchmarks, many works have developed models for web navigation and related

subtasks (Pasupat et al., 2018; Bommasani et al., 2021; He et al., 2021; Gur et al., 2021; Humphreys

et al., 2022; Liu et al., 2018; Jia et al., 2019). These works often rely on specialized neural network

architectures that leverage inductive biases of HTML structure, or on preprocessing of HTML to

make it easier to input to a model (Li et al. (2021a;b)). In contrast, our work takes a minimalist

approach, providing HTML in text form with minimal processing and using widely-adopted trans-

former networks.

LLMs and HTML Works that explore the intersection of LLMs and HTML generally fall into two

categories. The first category uses LLMs to assist web navigation (Nakano et al., 2021; Yao et al.,

2022), and typically relies on a custom preprocessing to map the context and structure of a web page

to natural language, thus severely restricting what HTML pages the model can parse. The second

category pretrains LLMs on a large corpora of HTML text (Aghajanyan et al., 2021). However,

these works typically restrict the model evaluation to standard NLP tasks, e.g., summarization and

question/answering as opposed to tasks more relevant to HTML understanding and web automation.

Our work can be thought of as the reverse: We keep the pretraining of LLMs unchanged and focus

on the mechanisms for transferring the pretrained LLMs to HTML-relevant tasks.

B RIEF B ACKGROUND ON HTML

S EMI -S TRUCTURED T EXT D ATA

HTML is a markup language, used to organize web page structure and content. Consider the

example HTML page in Figure 1a. This web page includes two adjacent input elements, one for

e-mail and another for password, with their corresponding labels on a separate branch of the page.

These inputs and labels are one of many possible elements that serve as HTML building blocks.

Each element has a set of attributes – key and value pair – that describe the element’s content, such

as style and human-readable text. When rendered in a browser, these attributes will be responsible

for how the element is shown and where it is positioned. In the example in Figure 1a, the first

input has three attributes, tag="input", type="email", and id="uName", that identify

the element as an email input with an identifier (“uName”) that can be accessed programmatically.

https://console.cloud.google.com/storage/browser/gresearch/webllm

3Task Dataset Size Input

Autonomous Web Navigation MiniWoB Demos (Shi et al., 2017) 12K Page

Semantic Classification Annotated Shopping Webpages (Gur et al., 2021) 28K Snippet

Description Generation CommonCrawl (new) 85K Snippet

Model

Architecture

Enc-Dec

Dec

Output Task Output

Text Dictionary

All Text Category

Enc-Dec

Dec Text Text

Table 1: Task, dataset, and model summary. All models receive raw HTML. Autonomous Web Navigation

receives the entire HTML, while the other tasks receive HTML snippets extracted given salient element.

C ANONICAL T ASKS FOR HTML U NDERSTANDING

We devise three canonical tasks to study HTML understanding capabilities of LLM-based web

agents. These tasks require correctly interpreting both structure and content to varying degrees

to make predictions, with autonomous navigation being the most challenging capability of the three.

Autonomous Web Navigation. This task evaluates how well a model navigates multi-page web-

sites as a sequential decision-making problem (Shi et al., 2017; Liu et al., 2018). At the beginning

of an episode, the agent is given a natural language instruction, e.g. Enter the username “lyda”

and the password “N22t” into the text fields and press login. The agent applies actions to a se-

quence of HTML pages, where each action is of the form function(selector, text). The

function is one of click or type, selector is an integer pointer that uniquely identifies an ele-

ment, and text is a text to input if the type functionality is activated. An episode terminates when

either the page reaches a terminal state (e.g., the ‘sign in’ button is clicked) or the maximum number

of steps is reached.

Semantic Classification. Many HTML understanding applications require a model that can classify

HTML elements into standardized categories. For example, in automated form-filling (Diaz et al.,

2013; Gur et al., 2021), it is useful to identify a ‘submit button’ across many websites (e.g., shopping,

flight booking, utility application) with various button representations (e.g., position, color, or text).

Thus, we formulate Semantic Classification as classifying elements into role categories. Take the

example HTML in Figure 1a which includes two input elements and a submit button. Let’s

pick the first input as an element of interest to be classified by the system, also called a salient

element. The system should classify this element as username, since it appears on a login page and

it has a label with Email Address which is typically associated with the username in form-filling

applications. To solve this, the system can aggregate information from multiple sources in the page

– the label that says Enter Email Address, the input attributes (type=“email” and id=“uName”),

or even the ordering of other elements in the page such as ‘password’ and ‘sign in’.

Description Generation. Motivated by applications in accessibility-minded web browser con-

trol (Jorgensen & Binsted, 2005), we formulate description generation as an extractive problem

where the goal is to locate the textual description of an element in the HTML and generate it as

output. For instance, the description of the salient element in Figure 1a is Enter Email Address;

when rendered, this label will appear above the ‘email’ input field. HTML provides a large

amount of flexibility, and so in general a descriptive text that appears alongside a specific element

when rendered can be very far from that element when looking at the HTML plaintext. Thus, this

task evaluates a model’s ability to understand the structure of HTML as it would appear to a user,

despite not having access to the rendered web page directly.

D ATASETS

Each of our canonical tasks requires a separate dataset, with the description generation task using a

newly contributed, auto-labelled dataset based on CommonCrawl.

Autonomous Web Navigation. We use the 12K demonstrations included in the publicly available

MiniWoB benchmark (Shi et al., 2017), which encompass 62 website applications ranging from

email forwarding to social media interactions. Each demonstration is a sequence of (instruction,

HTML, action) tuples. Every element in a MiniWoB demonstration is accompanied by a reference

number unique within its respective pages. This number can be used as an element selector, making

the action space unified across all tasks and time steps. For instance, the action in Figure 1a would be

4type(ref=5, ”[email protected]”), where 5 refers to the index of the input when counted from

top-to-bottom. As model input, we concatenate the natural language instruction and HTML into a

single text input sequence. Similarly, we treat the action as a text sequence for the model to predict.

Semantic Classification. We use a dataset of 28K labelled examples, containing 66 different cat-

egories, of the form (HTML, element, category), previously used in the context of environment

generation (Gur et al., 2021). The dataset consists of HTMLs from real-world shopping websites

and categories relevant to form-filling during payment and checkout on these websites.

Description Generation. For this task, we derive a dataset from CommonCrawl. 3 CommonCrawl

does not include renderings or annotations that would reveal what text in the HTML is associated

with which elements. Instead, we infer descriptions of various elements by exploiting a special

attribute in the HTML schema known as for. As an example in Figure 1a, the first label in

the HTML has a for attribute with value uName, which is the id of the element described by

label; in this case, the id is that of the first input in the page. This annotation does not affect

the rendering of the page and is typically used for accessibility purposes. We utilize the information

given by these for attributes to create a large-scale dataset to study description generation. A small

sample is available in the supplemental material, while the entire dataset will be available upon

publication.

Specifically, we collected 100 WARC (from April 2019) files from the CommonCrawl project and

extracted all HTML labels that have a for attribute. Removing non-Unicode and alphanumeric

text in HTML labels results in a 400K example datset. We balance the distribution of labels,

effectively downsampling the dataset to 85K samples. Each example is represented as (HTML,

element, description), where HTML is the HTML plaintext of the page, element is the element

whose id attribute matches that appearing in the label’s for attribute, and description is the text

inside the label element (see example in Figure 1a). More details of the dataset can be found in

Appendix A.1.

P RE -P ROCESSING

In treating HTML as token sequences, we minimize any HTML tree pre-processing prior to model

input. We thus provide HTML as raw text (i.e., sequences of text tokens) and only apply a snippet

extraction pre-processing for pages which are too large to fit into the typical LLMs context windows.

Snippet Extraction. Real HTML pages can grow extremely large, reaching thousands of elements,

far beyond the context window of the largest LLM that we studied (1920 tokens in PaLM (Chowdh-

ery et al., 2022)). LLMs typically truncate such long sequences, which can be detrimental to HTML

understanding as HTMLs are not linearly structured. We take an element-centric approach and ex-

tract HTML snippets (a small portion of HTML code) surrounding a salient element (Figure 5). A

simple heuristic, which controls the tree’s width and depth, guides the process: Start with a salient

element and traverse its ancestors in the HTML tree until a stopping condition is satisfied. As we

traverse up, we estimate the height of the tree and the increased number of descendants of the new

root. We stop when either metric violates a pre-defined limit and take the resulting sub-tree as the

snippet. We mark the salient element using a special attribute, called target, to distinguish it from

other elements. We perform the snippet extraction for the semantic classification and description

generation datasets, and keep the full HTML pages in MiniWoB because these pages are typically

much smaller than real-world HTML.

HTML un-Parsing. We provide the models with the unparsed plaintext HTML in the form of

a sequence of tokens. This canonical representation does not require specific model architectures

such as hierarchical networks (Liu et al., 2018; Gur et al., 2021) and can be fed into any LLM. We

transform all datasets by converting every HTML page or snippet into a sequence. For MiniWoB,

we additionally concatenate (action history, instruction, HTML) tuples into a single sequence.

http://commoncrawl.org

M ODEL T RAINING

We study a variety of transformer-based LLMs (Vaswani et al., 2017) with different sizes and archi-

tectures for HTML understanding tasks (Table 1). In the rest of the text, we prefix models fine-tuned

for Autonomous Web Navigation, Description Generation, and Semantic Classification with WebN-

, WebD-, and WebC-, respectively. For instance, WebD–T5-3B is the three billion parameter T5

model (Raffel et al., 2020) fine-tuned for the Description Generation task. The rest of this section

elaborates on training details.

Encoder-Decoder and Decoder-only Models. We train encoder-decoder models, i.e., T5 (Raffel

et al., 2020), and decoder-only models, i.e., LaMDA (Thoppilan et al., 2022) and PaLM (Chowdh-

ery et al., 2022), with text input and text output (Figure 1b). Inputs are raw HTML pages or snippet

texts; similarly, outputs are categories, natural language descriptions, or actions represented as text.

Namely, for Semantic Classificationwe use the textual representation of categories, similar to previ-

ous classification problems in NLP (Raffel et al., 2020). For Autonomous Web Navigation, actions

are converted into text by first converting them into key and value pairs and then concatenating the

pairs.

Many websites in MiniWoB require multiple interactions, such as click-button-sequence or click-

checkboxes, where each interaction might cause a subtle change in the website state. For instance,

after clicking on a checkbox in the click-checkboxes website, its value flips from positive to negative

or the other way around, which is not always reflected in LLMs’ predictions and leads to action

repetitions. We solve this issue by augmenting tuples in the dataset with a sequence of past actions,

(action history, instruction, HTML, action), and allowing LLMs to learn from past experience.

Encoder-only Models. We train encoder-only models, i.e., BERT (Devlin et al., 2018), with text

input and categorical output. We keep semantic categories as discrete one-hot classes. To train

encoder-only models, we add a new classification layer after the final encoder layer to produce a

distribution over semantic categories. In addition to the typical BERT models, we study Mobile-

BERT (Sun et al., 2020), distilled from BERT-large with inverted bottlenecks, and Albert-XL (Lan

et al., 2020), with parameter sharing and embedding split.

R ESULTS

We now present the results of fine-tuned LLMs for HTML understanding. We compare the models’

performance with the existing baselines where possible (autonomous web navigation) and against

other LLM architectures and training regimes (all tasks). Sections 8.1, 8.2, and 8.3 evaluate task-

specific performance, while Section 8.4 assesses the performance across all the tasks.

Metrics: For autonomous web navigation we evaluate models’ Success Rate, which is averaged over

100 episodes per task. For the other tasks, we use Accuracy to measure exact match between predic-

tion and ground truth. In the description generation task, we additionally provide evaluations using

alternative ‘soft’ text evaluation metrics, BLEU and ROUGE-1, measuring the similarity between

predicted and ground truth text.

8.1

A UTONOMOUS W EB N AVIGATION R ESULTS

For Autonomous Web Navigation we fine-tune two WebN- encoder-decoder architectures (WebN-

T5-large and WebN-T5-3B) on 12k demonstrations from human-annotated real websites. We eval-

uate the models on MiniWob (Liu et al., 2018) benchmark, and compare with specialized architec-

tures trained using supervised learning (SL) on 2.4 million human expert demonstrations CC-Net

(SL) (Humphreys et al., 2022), and two RL models bootstrapped with SL, CC-Net (SL) (CC-Net

(SL & RL) (Humphreys et al., 2022), and WGE (SL & RL) (Liu et al., 2018)). Additionally, we

compare with the decoder-only architecture (WebN-Lambda-1B) and perform an ablation study on

the impact of including the action history in the input.

Comparison to SoTA. Since previous works report success on only a subset of websites in Mini-

WoB, we evaluate on 48 out of 62 websites that are common across all models. Table 8 in the

Appendix reports fine-grained results while Figure 2a presents results averaged over all websites.

Compared to CC-Net (SL) which is trained on all 2.4M demonstrations, WebN-T5-3B improves the

612K

Demos

2.4M

Demos

Model Name

T5-large

LaMDA-1B

T5-3B

WebN-T5-large

WebN-LaMDA-1B

WebN-T5-3B

(a) Baseline comparison.

Success (%)

18.1

15.6

11.1

46.4

48.8

51.8

Model Size

800M

(b) Pre-training effect.

Figure 2: a) WebN–T5* performance compared to the previous SOTA models on MiniWoB benchmark.

WebN-T5-3B improves the task success 16% while using 192 times less data, compared to the best supervised

learning (SL) model, CC-Net (SL). LLMs performance is only surpassed by works utilizing RL, requiring or-

ders of magnitude more online experience interaction with websites. b) LLMs with and without pretraining

on Autonomous Web Navigation task. Those with pretraining (denoted by the ‘WebN-’ prefix) show a 2.5-4.5x

performance improvement.

Model Name

WebC-MobileBERT

WebC-Albert-XL

WebC-BERT-smallest

WebC-BERT-small

WebC-BERT-medium

WebC-BERT-base

WebC-BERT-large

WebC-T5-base

WebC-T5-large

WebC-T5-3B

WebC-LaMDA-1B

WebC-PaLM-8B

WebC-PaLM-62B

T5-large

T5-3B

PaLM-8B

Test (%)

78.1

83.5

84.4

85.2

83.9

84.1

86.8

87.0

87.7

87.4

86.6

88.7

76.4

77.2

73.3

Dev (%)

77.7

83.1

83.6

85.2

84.5

84.8

85.8

89.9

89.3

90.3

87.1

89.9

90.5

75.2

73.8

70.1

Model Size

24.6 M

58.9 M

38.7 M

52.8 M

67 M

109.5 M

335.2 M

250 M

800 M

62 B

800 M

Code in training Corpus

12.5% Code

5% Code (0.875% HTML)

Table 2: LLMs performance on the Semantic Classification task. Fine-tuning off-the-shelf pretrained LLMs

(model names with prefix ‘Web*’) helps LLMs transfer better compared to training the same architecture from

scratch on the HTML dataset (model names without prefix ‘Web*’), improving the accuracy of PaLM-8B more

than 12%. While WebC-PaLM-62B clearly performed better than all other models, we found WebC-T5-large

to be competitive with much larger models such as WebC-LaMDA-1B or WebC-PaLM-8B.

success 16% while only training on 12K publicly-available demonstrations, yielding over 192x im-

provement in sample-efficiency. We find that all choices of LLMs outperform previous SL models.

Notably, WebN-T5-3B significantly improves on websites requiring multiple-action sequences such

as click checkboxes or websites requiring entering text such as login user (Table 8). We observe that

the performance of LLMs is only surpassed by previous works utilizing RL, which require orders of

magnitude more online experience interaction. Extending our fine-tuned LLMs to an RL setting is

a promising avenue for future work.

Action history ablation. Across all LLMs we consistently observe a decrease in success, on av-

erage 6.4%, when past actions are excluded from the inputs (Figure 2a). Action history helps with

websites that require entering multiple texts, as well as understanding minor changes that could be

difficult to detect (e.g. click checkboxes and multi layout). multi layout requires entering 3 different

texts in the website where the layout is randomized at each episode, yet, surprisingly, even the (rel-

atively smaller) WebN-T5-large model without action history outperforms the CC-Net (SL) model;

illustrating that incorporating action history is not the only contributing factor for the better success.

7Categories

Figure 3: Accuracy per classification category of the WebC-T5-3B model on the development dataset.

300

500

Height Test (%) Dev (%)

7 87.7

88.6

88.4

89.3

87.8

75.8 90.3

89.2

90.0

89.2

88.8

74.5

(%)

New

descendants

WebC-PaLM

WebC-T5-3B

T5-3B (full data / no pretraining)

500

1000

1500

2000

Data Size

(a)

(b)

Figure 4: a) Effect of snippet extraction parameters on WebC-T5-3B. Increases above 50% in new descendants

and height of 4. Large increases in both parameters lead to large snippets and decrease in accuracy. b) Accu-

racy over training data size. Using only 1000 labeled examples (4.4% of all training dataset), WebC-T5-3B

outperforms T5-3B (full data without pretraining) which is trained on all available labeled data (approximately

30k examples), and outperforms WebC-PaLM-8B which is an order of magnitude larger.

8.2

S EMANTIC C LASSIFICATION T ASK R ESULTS

To evaluate the Semantic Classification task, we compare the T5 encoder-decoder architecture’s

three size variants (WebC-T5-base, WebC-T5-large, and WebC-T5-3B) fine-tuned on 22K real,

human-labeled training websites. We compare with a fine-tuned encoder only architectures

(WebC-*BERT*), three fine-tuned decoder-only architectures (WebC-LaMDA and PaLM), and both

encoder-decoder and decoder-only models trained on human labeled websites from scratch. Results

are presented in Table-2, where we find that all WebC-LLMs perform well and significantly better

than the same architectures without pretraining.

Accuracy per category. In Figure 3, we present accuracy distribution of the WebC-T5-3B model

on the development dataset. The fine-tuned encoder-decoder model performs strongly on a majority

of the categories (Figure 3), even on those with very few samples. For instance, the model is 100%

accurate on password new which has only 56 training examples, because the class is unambiguous.

On the other hand, unsurprisingly, the performance drops when the category is ambiguous, such as

in the email category which is frequently mistaken as username.

Snippet generation ablation. Two hyper-parameters govern snippet generation: percentage of

new descendants and height of the new root. While small variations of both parameters do not

change the performance, increasing both degrades the performance significantly (Table 4a). With

new descendants up to 500% and height up to 7, the performance drops by more than 15%. Note

that snippet generation returns the full-page HTML when both parameters increase indefinitely.

Data size impact. When varying the fine-tuning training data sizes (1, 5, 10, 20, or 50 samples per

class) in Figure 4b, WebC-T5-3B slightly outperforms WebC-PaLM-8B which is an order of mag-

nitude larger. Compared to T5-3B that is trained on all available HTML data without pretraining,

WebC-T5-3B achieves better performance while using only 3.4% of labeled data (1000 samples),

8Model Name

WebD-T5-large

WebD-LaMDA-1B

WebD-T5-3B

Closest Description

Accuracy(%)

83.2

83.3

57.4

Test

BLEU

90.2

87.5

90.8

24.4

ROUGE-1

90.5

90.2

90.9

59.2

Accuracy(%)

84.3

85.2

60.8

Dev

BLEU

91.7

88.6

92.1

23.9

ROUGE-1

91.5

91.2

91.9

62.1

Table 3: Description generation accuracy of LLMs.

thus highlighting the benefit of using standard off-the-shelf pretrained LLMs for HTML understand-

ing.

8.3

D ESCRIPTION G ENERATION T ASK R ESULTS

For Description Generation we split the CommonCrawl dataset based on URL top-level domains to

test LLMs’ capabilities to generalize to unseen HTML. We fine-tune encoder-decoder architectures

(WebD–T5*) and decoder-only models (WebD–LaMDA*), with results presented in Table 3. We

also evaluate a strong heuristic baseline which simply finds the description closest to the salient

element in the HTML text (Closest Description).

Accuracy and Similarity Performance We show results of our evaluations in Table 3. All models

achieve high scores across all metrics, achieving ≈ 84% on the accuracy in terms of exact match and

a higher non-exact match score based on BLEU and ROUGE-1 (≈ 91%). This difference indicates

that the models are capable of locating the descriptions, but not always generating the exact output.

8.4

HTML U NDERSTANDING LLM S P ERFORMANCE A NALYSIS A CROSS T ASKS

We now analyze our results in aggregate to derive our main conclusions.

8.4.1

P RETRAINING E FFECT : P RETRAINING ON L ARGE T EXT C ORPORA M ATTERS

Fine-tuned pretrained LLMs outperform LLMs trained on HTML-only data, improving the perfor-

mance by more than 34.1% on the Autonomous Web Navigation (Table 2b), and 10% to 12.7% on

the Semantic Classification task (Table 2).

Since Autonomous Web Navigation is the most difficult task, the improved performance is an en-

couraging evidence of the value of LLMs in HTML understanding tasks. Specifically, we observe

that LLMs without pretraining are comparable to fine-tuned pretrained models only on websites that

require simple text matching. In contrast, for websites such as click checkboxes, text matching is

harder and we find that pretraining is key to good performance. We also found that without pretrain-

ing, model outputs were frequently in an incorrect format such as invalid dictionaries or invalid refs

with non-integer values. This suggests that the large corpora used for pretraining helps models to

learn general HTML structure.

8.4.2

A RCHITECTURE E FFECT : T5- BASED M ODELS P ERFORM B EST A CROSS A LL T ASKS

Encoder-decoder T5 based models perform better across all three tasks. On the Autonomous Web

Navigation task, encoder-decoder (WebN-T5) architectures are better or comparable to WebN-

LaMDA-1B (Figure 2a). On the Semantic Classification, the smallest encoder-decoder model

(WebC-T5-base) performs comparably to much larger decoder-only models (WebC-LaMDA-1B or

WebC-PaLM-8B) and the largest encoder-only model (WebC-BERT-large) which has 85M more pa-

rameters (Table 2). We also observe that decoder-only PaLM-8B performs worse than much-smaller

encoder-decoder T5-large when trained only on HTML data. Finally, on the Description Generation

encoder-decoder architecture has higher BLEU score.

One possible explanation for the strong performance of T5-based moels is the encoder-decoder

architecture of these models. Namely, T5 models utilize an encoder with a bidirectional attention

mechanism, not present in the LaMDA and PaLM decoders. The bidirectional attention mechanism

can process HTML pages from both ends, potentially overcoming the loss of information when

tree-structured HTML pages are converted into a fixed linear text sequences.

98.4.3

M ODEL S IZE E FFECT : S IZE (S UB - LINEARLY ) M ATTERS

Across the tasks it appears that the architecture plays an important role in the model performance.

Model size and performance are also positively correlated, although they reach diminishing returns.

For instance, the model performance is roughly O(log log n) with respect to model size on Seman-

tic Classification (Figure 4b in Appendix). On the Autonomous Web Navigation task, performance

grows slowly with the model size (Table 8), while on the Description Generation it plateaus (Ta-

ble 3).

8.5

D ISCUSSION

Bi-directional attention vs training corpora: Pretraining on large corpora matters, yielding ≤4.5x

performance improvements. Larger models tend to be better and we credit the bidirectional attention

for T5’s best overall performance across the tasks. PaLM and LaMDA include HTML and other

code in their pretraining corpora, while BERT and T5 architectures did not, showing that pretraining

on HTML is not necessary for strong performance when fine-tuned for HTML understanding. This

strengthens the hypothesis behind the role of the bidirectional attention, and opens up the possibility

to further improve the performance of T5 architectures by pretraining them on corpora with HTML.

Practical impact on labeling: When available, the pretrained LLMs need very little new expert

data (200x and 30x reduction on the web navigation and classification tasks, respectively). This has

a big potential impact on practical applications, reducing the data collection time and cost by orders

of magnitude.

Bigger is not always better: When choosing the model size, the expected performance gains (sub-

linear at best and asymptotic at worst) should be considered alongside the model’s training and

inference time and cost. For instance, on the classification task, the largest model WebC-PaLM-62B

takes several days to fine-tune, and evaluates at 30 Hz, while WebC-T5-large fine-tunes in several

hours and evaluates at 700 Hz – an order of magnitude more expensive for a single percent uplift in

accuracy. BERT models on the other hand train in minutes. If the application does not require high

precision, these might be a good choice.

Context window is a bottleneck: The major bottleneck for the HTML understanding tasks seems to

be the context window length that the current LLMs support, even with models that accept 1000+ to-

kens. It remains prohibitive to evaluate web navigation tasks on real websites that are orders of mag-

nitude larger than pages in MiniWob. Similarly, we observed that increasing the snippet size leads

to major performance degradation. This makes HTML understanding an interesting benchmark for

future LLM development. For instance, new methods may need to be developed to compress the

state representation of web content for use in LLM context windows.

C ONCLUSION

We presented canonical tasks and fine-tuned LLMs for HTML understanding. The comprehensive

evaluations and analyses over a range of architectures, dataset sizes, and baselines yields practical

findings and highlights current limitations of these models. We find that a) pretraining is critical for

the performance and can reduce labeled data requirements, improving sample efficiency up to 200x;

b) model architecture is the second-most important factor, and T5 models with bidirectional attention

and encoder-decoder architecture perform the best across the board; c) given a choice, model size

should be evaluated in the context of the model’s training and inference performance, as the model

size sub-linearly correlates with its performance. Finally, the proposed HTML understanding tasks

highlight the relatively short context window that limits current LLMs, suggesting possibilities for

future research that incorporate or eliminate this constraint.

R EFERENCES

Leonard Adolphs, Benjamin Boerschinger, Christian Buck, Michelle Chen Huebscher, Massimil-

iano Ciaramita, Lasse Espeholt, Thomas Hofmann, and Yannic Kilcher. Boosting search engines

with interactive agents. arXiv preprint arXiv:2109.00527, 2021.

10Armen Aghajanyan, Dmytro Okhonko, Mike Lewis, Mandar Joshi, Hu Xu, Gargi Ghosh, and Luke

Zettlemoyer. Htlm: Hyper-text pre-training and prompting of language models. arXiv preprint

arXiv:2107.06955, 2021.

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea

Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do as i can, not as i say:

Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan,

Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language

models. arXiv preprint arXiv:2108.07732, 2021.

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx,

Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportu-

nities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,

Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are

few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.

Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, and Bryan A Plummer.

Interactive mobile app navigation with uncertain or under-specified natural language commands.

arXiv preprint arXiv:2202.02312, 2022.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam

Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm:

Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep

bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

Oscar Diaz, Itziar Otaduy, and Gorka Puente. User-driven automation of web form filling. In

International Conference on Web Engineering, pp. 171–185. Springer, 2013.

Izzeddin Gur, Natasha Jaques, Yingjie Miao, Jongwook Choi, Manoj Tiwari, Honglak Lee, and

Aleksandra Faust. Environment generation for zero-shot compositional reinforcement learning.

Advances in Neural Information Processing Systems, 34:4157–4169, 2021.

Zecheng He, Srinivas Sunkara, Xiaoxue Zang, Ying Xu, Lijuan Liu, Nevan Wichers, Gabriel Schu-

biner, Ruby Lee, and Jindong Chen. Actionbert: Leveraging user actions for semantic under-

standing of user interfaces. In Proceedings of the AAAI Conference on Artificial Intelligence,

volume 35, pp. 5931–5938, 2021.

Peter C Humphreys, David Raposo, Tobias Pohlen, Gregory Thornton, Rachita Chhaparia, Alistair

Muldal, Josh Abramson, Petko Georgiev, Adam Santoro, and Timothy Lillicrap. A data-driven

approach for learning to control computers. In International Conference on Machine Learning,

pp. 9466–9482. PMLR, 2022.

Sheng Jia, Jamie Ryan Kiros, and Jimmy Ba. DOM-q-NET: Grounded RL on structured lan-

guage. In International Conference on Learning Representations, 2019. URL https://

openreview.net/forum?id=HJgd1nAqFX.

Chuck Jorgensen and Kim Binsted. Web browser control using emg based sub vocal speech recog-

nition. In Proceedings of the 38th Annual Hawaii International Conference on System Sciences,

pp. 294c–294c. IEEE, 2005.

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child,

Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language

models. arXiv preprint arXiv:2001.08361, 2020.

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Sori-

cut. Albert: A lite bert for self-supervised learning of language representations. In International

Conference on Learning Representations, 2020.

11Chenliang Li, Bin Bi, Ming Yan, Wei Wang, Songfang Huang, Fei Huang, and Luo Si. Structurallm:

Structural pre-training for form understanding. arXiv preprint arXiv:2105.11210, 2021a.

Junlong Li, Yiheng Xu, Lei Cui, and Furu Wei. Markuplm: Pre-training of text and markup language

for visually-rich document understanding. arXiv preprint arXiv:2110.08518, 2021b.

Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement

learning on web interfaces using workflow-guided exploration. arXiv preprint arXiv:1802.08802,

2018.

Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. Pretrained transformers as universal

computation engines. arXiv preprint arXiv:2103.05247, 2021.

Sahisnu Mazumder and Oriana Riva. Flin: A flexible natural language interface for web navigation.

arXiv preprint arXiv:2010.12844, 2020.

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo-

pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted

question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.

Rodrigo Nogueira and Kyunghyun Cho. End-to-end goal-driven web navigation. Advances in neural

information processing systems, 29, 2016.

Christopher Olston, Marc Najork, et al. Web crawling. Foundations and Trends® in Information

Retrieval, 4(3):175–246, 2010.

Panupong Pasupat, Tian-Shun Jiang, Evan Zheran Liu, Kelvin Guu, and Percy Liang. Mapping

natural language commands to web elements. arXiv preprint arXiv:1808.09132, 2018.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi

Zhou, Wei Li, Peter J Liu, et al. Exploring the limits of transfer learning with a unified text-to-text

transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.

Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of bits: An

open-domain platform for web-based agents. In International Conference on Machine Learning,

pp. 3135–3144. PMLR, 2017.

Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. Mobile-

BERT: a compact task-agnostic BERT for resource-limited devices. In Proceedings of the 58th

Annual Meeting of the Association for Computational Linguistics. Association for Computational

Linguistics, 2020.

Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze

Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven

Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin,

James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Yanqi

Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett, Kathleen S. Meier-

Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny So-

raker, Ben Zevenbergen, Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen Olson,

Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna,

Matthew Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil,

Blaise Aguera-Arcas, Claire Cui, Marian Croak, Ed H. Chi, and Quoc Le. Lamda: Language

models for dialog applications. CoRR, 2022.

Daniel Toyama, Philippe Hamel, Anita Gergely, Gheorghe Comanici, Amelia Glaese, Zafarali

Ahmed, Tyler Jackson, Shibl Mourad, and Doina Precup. Androidenv: a reinforcement learn-

ing platform for android. arXiv preprint arXiv:2105.13231, 2021.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,

Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural informa-

tion processing systems, 30, 2017.

12HTML

Enter Email Address

Enter Password:

Please enter your password.

if expandable:

expand

snippet

generation

Please enter your password.

otherwise

output

id="uName” target>

Figure 5: High-level overview of our pre-processing pipeline for generating snippets from a full HTML web-

page. Given the page, we detect salient elements and for each one of them we extract snippets by recursively

moving up in the HTML tree until a validation heuristic fails.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi,

Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. Huggingface’s

transformers: State-of-the-art natural language processing. CoRR, abs/1910.03771, 2019. URL

http://arxiv.org/abs/1910.03771.

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya

Barua, and Colin Raffel. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv

preprint arXiv:2010.11934, 2020.

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-

world web interaction with grounded language agents. arXiv preprint arXiv:2207.01206, 2022.

A PPENDIX

A.1

D ATASET D ETAIL

Examining the description distribution, we found the original 400K dataset to be very skewed; only

20 descriptions (such as Email and Password) were covering 50% of the dataset. We sub-sampled the

dataset so that each unique description has at most 10 data points. We also found that for attributes

are almost always defined for HTML labels. This could cause a model to overfit and just find the

label element in the HTML and ignore everything else. To avoid this sort of ‘cheating’ we replace

the tags of HTML labels by randomly sampling from {div, span, a, label}. These tags

are also frequently used to inject text in HTML but they are very rarely used with for attributes.

Finally, we removed examples where there are only a single text in the HTML since models can

trivially generate descriptions by finding the only text in the HTML, which biases model weights

and evaluation metrics. After this final step, we have a total of 85K labeled examples.

A.1.1

S NIPPET G ENERATION

In Figure 5, we give a high-level overview of our snippet generation procedure.

A.2

A.2.1

A DDITIONAL R ESULTS

S EMANTIC C LASSIFICATION

Error Analysis. We manually examined 50 errors of T5-3B model over the development set (Ta-

ble 4) and assigned them into one of the 9 error types that we devised. We found that 32% of the

errors are due to lack of information in the HTML snippets, which is mainly the result of lost in-

formation during snippet extraction process. Annotation errors or email/username ambiguity make

up 30% of the errors. These can’t be improved without revising the annotated data or adding extra

information to resolve the ambiguity. We also found that the model sometimes picks a more general

category, or a nearby text misleads the model; the latter usually happens when the HTML snippet is

long where majority of the elements are noise.

13Error Type

Not enough information in the HTML snippet

Incorrect annotation (ex: ”unknown role” instead of ”organization”)

Annotation tool translates user selection incorrectly

Email/Username ambiguity

More general category (ex: ”header” instead of ”cart header”)

Immediate neighboring text misleads

Incorrect date formatting (ex: ”mm” instead of ”mmm”)

No information in the HTML snippet

Others

Percentage of Examples

Table 4: Types of errors over 50 manually examined examples. 32% of errors are due to lack of information

in HTML snippets, 30% of errors are related to annotations or can’t be improved due to ambiguity (email/user-

name), and the remaining errors are incorrect predictions by the model.

Few-Shot Prompting In Table 5, we present few-shot prompting performance of a 540B PaLM

model. We probe the model using a prompt template Role: with 1 ex-

ample per category and generate categories using greedy-decoding. In our preliminary experiments,

we found that few-shot prompting achieves only 45.6 accuracy, much lower than a model fine-tuned

on the same data (Figure 6). We found two common problems – the model is not able to canonicalize

predictions into categories and many of the examples are dropped due to context length.

We developed post-processing methods to al-

leviate the canonicalization problem and pre-

Model Name

Test Dev

processing methods to reduce lengths of ex-

PaLM-540B

64.2 60.3

amples. Adding a dictionary-based mapping

- w/o Example Cleaning

57.9 57.2

on predictions – a manually curated paraphrase

- w/o Category Rewriting

52.1 50.7

dictionary – improves the performance to 52.1.

- w/o Dictionary Mapping 45.6 45.1

We also tried rewriting predictions by chang-

ing the order of tokens around ” ” such as Table 5: Few-shot prompting performance with differ-

name first to first name which further improved ent pre- and post-processing steps.

the performance to 57.9. Finally, we cleaned

examples in the prompt by removing certain el-

ements such as ”svg”, ”path”, ”img”, and ”iframe” and also removing class attribute from every

element; this pre-processing step gives 64.2.

Figure 6: Performance comparison w.r.t. increasing model size. As the model size increases, we

observe an increase in overall accuracy with PaLM-62B model achieving the highest accuracy while

being 7x larger than PaLM-8B.

14A.3

S AMPLE E PISODES FROM M INI W O B

See Table 6 for an example episode of web navigation inferred by a fine-tuned LLM.

A.4

D ETAILED M INI W O B R ESULTS

See Table 7 for detailed performance of various models on MiniWob.

A.5

R ESOURCE R EQUIREMENTS

See Table 8.

A.6

S TRUCTURE D EPENDENCE A BLATION S TUDY

We conducted an ablation study to examine the sensitivity of model performance to preserving

structural information. To do so, we evaluate the model’s performance on HTML input with criti-

cal structure components removed. We kept the order of elements and their attributes fixed while

corrupting the nesting structure by removing closing tags.

Removing closing tags corresponds to a valid traversal (BFS) and keeps the order of elements the

same as the text based input.

As a simple example:

would be converted into:

We evaluated the trained WebN-T5-3B model on the same set of synthetic websites from the

MiniWoB benchmark with this aspect of structure removed from the HTML pages. WebN-T5-

3B achieves a 45.4% success rate, 6% lower than before, suggesting that WebN-T5-3B is at least

partially dependent on the DOM topology.

A.7

T ASK - SPECIFIC M ODELS

An alternative to LLMs is to adapt bespoke task-specific architectures tailored towards processing

of structured documents and HTML (Li et al. (2021b;a)).

StructuralLM (Li et al. (2021a)) is an approach specifically tailored for document understanding

(i.e., combinations of images and text), and thus makes several simplifying assumptions for its model

that limit its applicability to HTML understanding (i.e., trees of elements with a richer structure and

functionality). It is trained only on the textual content of a document - the markup information is

ignored. For example, any input field or dropdown in a document would be missing from the model

inputs. All of the tasks we study require knowledge of this information. For example, in autonomous

navigation the model needs to interact with input elements (e.g. text, checkboxes, dropdowns) such

as username and password in the login-user task in MiniWoB. Typically, a “type” action with a

reference to an element and a text argument is generated by the model. Without knowing which

input elements are available in the page, it is impossible to generate a reference to any input element.

While MarkupLM (Li et al. (2021b)) is better tailored for understanding HTML pages, it has similar

drawbacks as StructuralLM in that it focuses solely on text and structure of text while ignoring

everything else in the markup. To illustrate our point better, we used the open source implementation

of MarkupLM from the HuggingFace library (Wolf et al. (2019)) to process the sample HTML

snippet in Figure-1(b). The MarkupLM ignores all input elements, both username and password,

and generates ~~Email AddressEnter Password:Please enter your password.~~ which is the

text input to the MarkupLM Transformer. Classifying this text as username or password is not

possible without the additional context on which input element is the salient element (in this context

it is the username). See below for the code to reproduce our result.

15from transformers import MarkupLMProcessor

processor = MarkupLMProcessor.from_pretrained(f"microsoft/markuplm-base")

snippet = ’’’

Email Address

Enter Password:

type="password" id="pass">

’’’

encoding = processor(snippet)

print(processor.batch_decode(encoding["input_ids"]))

MarkupLM is also evaluated on NLP-like tasks such as QA or entity classification where understand-

ing page content is paramount, whereas we focus on HTML understanding tasks such as autonomous

navigation where both content and the page’s layout structure need to be understood.

We perform a quantitative evaluation of MarkupLM on our tasks to understand how significant

these limitations are. We fine-tune the MarkupLM-base model on the semantic classification task,

using the same setup as other WebC models but with the suggested hyperparameters from (Li et al.

(2021b)). We use the MarkupLM implementation from the HuggingFace library (Wolf et al. (2019)).

On development and test sets, MarkupLM-base achieves 65% and 66% accuracy, respectively. These

results are more than 16% lower compared to similar size WebC-BERT-base results that we report

in our work. This suggests that although domain specific models may be suitable for processing

HTML for NLP tasks, the generality, flexibility, and sample efficiency LLMs provide advantages

for autonomous navigation tasks.

16Table 6: A sample web page and corresponding episode using the T5-3B model. At each time step,

previous actions, instruction, and HTML are concatenated into a single HTML text. Note that at the

beginning of episode, there is no past actions and we simply concatenate instruction and HTML.

Action is generated as a sequence of tokens which is later parsed into a dictionary. The ref in the

action points to an element that has a ref attribute with the same value. For instance, at the beginning

of episode, ref: 6 corresponds to an input with ref=6. At the end of the episode, the model clicks on

the submit button and the episode terminates.

Web page

HTML Text

Action Text

{action: click, ref: 6}

{action: click, ref: 10}

{action: click, ref: 12}

{action: click, ref: 14}{action: click, ref: 16}

{action: click, ref: 17}

18Table 7: Success rate comparison of various models in MiniWoB tasks. Baseline results are borrowed from

(Humphreys et al., 2022). Note that these are normalized between 0 and 1.

TASK

bisect-angle

book-flight

chase-circle

choose-date-easy

choose-date-medium

choose-date

choose-list

circle-center

click-button-sequence

click-button

click-checkboxes-large

click-checkboxes-soft

click-checkboxes-transfer

click-checkboxes

click-collapsible-2

click-collapsible

click-color

click-dialog-2

click-dialog

click-link

click-menu-2

click-menu

click-option

click-pie

click-scroll-list

click-shades

click-shape

click-tab-2-easy

click-tab-2-hard

click-tab-2-medium

click-tab-2

click-tab

click-test-2

click-test-transfer

click-test

click-widget

copy-paste-2

copy-paste

count-shape

count-sides

drag-box

drag-cube

drag-item

drag-items-grid

drag-items

drag-shapes

drag-sort-numbers

email-inbox-delete

email-inbox-forward-nl-turk

email-inbox-forward-nl

email-inbox-forward

email-inbox-important

email-inbox-nl-turk

email-inbox-noscroll

email-inbox-reply

email-inbox-star-reply

email-inbox

enter-date

enter-password

enter-text-2

enter-text-dynamic

enter-text

enter-time

find-midpoint

find-word

focus-text-2

focus-text

grid-coordinate

guess-number

highlight-text-2

highlight-text

identify-shape

moving-items

multi-layouts

multi-orderings

navigate-tree

number-checkboxes

read-table-2

read-table

resize-textarea

right-angle

scroll-text-2

scroll-text

search-engine

simon-says

simple-algebra

simple-arithmetic

social-media-all

social-media-some

social-media

terminal

text-editor

text-transform

tic-tac-toe

unicode-test

use-autocomplete

use-colorwheel-2

use-colorwheel

use-slider-2

use-slider

use-spinner

visual-addition

Human

0.92

0.87

0.82

0.99

0.98

0.97

0.98

0.96

0.94

0.98

0.87

0.73

0.98

0.97

0.99

0.97

0.99

0.98

0.97

0.99

0.98

0.91

0.88

0.99

0.96

0.97

0.99

0.83

0.94

0.82

0.98

0.99

0.98

0.87

0.93

0.96

0.92

0.99

0.88

0.91

0.96

0.99

0.93

0.96

0.91

0.95

0.96

0.97

0.96

0.91

0.97

0.98

0.94

0.96

0.99

0.87

0.99

0.97

0.98

0.94

0.96

0.18

0.95

0.96

0.98

0.96

0.95

0.97

0.94

0.87

0.97

0.62

0.86

0.96

0.89

0.91

0.96

0.88

0.86

0.71

0.99

0.98

0.94

0.9

0.97

0.98

0.97

WebN-T5-3B WebN-T5-3B

(no history)

n/a

0.03

0.26

n/a

0.22

0.54

0.63

0.96

0.27

0.24

n/a

0.37

0.87

0.51

0.53

n/a

0.12

n/a

0.18

0.74

n/a

0.41

n/a

0.33

0.60

n/a

0.23

n/a

0.38

0.97

n/a

0.98

0.89

n/a

0.49

n/a

0.88

0.72

0.82

n/a

0.83

0.88

0.91

n/a

0.34

n/a

0.02

0.21

n/a

0.48

n/a

0.22

n/a

0.07

n/a n/a

n/a

0.05

0.14

n/a

0.96

0.43

0.34

0.84

0.01

0.23

0.35

0.96

n/a

0.38

0.78

0.14

0.54

n/a

0.13

n/a

0.09

n/a

0.97

n/a

0.43

n/a

0.09

n/a

0.26

n/a

0.21

0.92

n/a

0.92

0.99

0.01

n/a

0.42

n/a

0.89

0.40

0.64

n/a

0.48

0.64

0.99

n/a

0.34

n/a

0.24

n/a

0.40

n/a

0.15

n/a

0.05

n/a

CC-Net

(SL & RL) CC-Net

(SL) World

bits

(SL & RL) Workflow

guided

exploration

(SL & RL)

0.97

0.87

0.93

0.99

0.97

0.99

0.97

0.71

0.95

0.99

0.98

0.99

0.83

0.94

0.99

0.97

0.6

0.95

0.99

0.98

0.99

0.98

0.63

0.79

0.85

0.79

0.98

0.99

0.97

0.98

0.97

0.88

0.99

0.94

0.97

0.98

0.96

0.75

0.86

0.75

0.85

0.9

-0.01

0.98

0.6

0.83

0.95

0.98

0.95

0.91

0.99 0.29

0.8

0.42

0.26

0.12

0.19

0.36

0.47

0.78

0.04

0.36

0.32

0.17

0.81

0.82

0.88

0.95

0.59

0.52

0.22

0.21

0.15

0.01

0.04

0.11

0.61

0.19

0.54

0.27

0.95

0.94

0.56

0.01

0.04

0.21

0.74

0.61

0.23

0.61

0.05

0.13

0.26

0.11

0.22

0.01

0.3

0.05

0.13

0.11

0.09

0.02

0.04

0.39

0.35

0.04

0.35

0.05

0.96

0.99

0.66

0.21

0.4

0.51

0.68

0.02

0.13

0.32

0.01

0.27

0.26

0.88

0.04

0.15

0.02

0.03

0.38

0.01

0.03

0.11

0.19

0.32

0.86

0.07

0.38

0.68

0.03

0.18

0.47

0.36 0.8

n/a

0.25

0.98

0.22

0.62

n/a

0.48

0.11

0.98

0.23

0.53

0.31

0.16

0.13

0.28

0.15

0.07

0.27

0.11

n/a

0.08

0.97

0.83

n/a

0.34

0.18

0.3

0.31

0.18

n/a

0.01

0.41

0.92

0.66

n/a

0.03

0.61

0.08

0.31

0.83

0.95

0.26

0.2

0.13

0.9

0.36

n/a

0.78

n/a

0.2

0.16

0.11

0.38

0.96

0.28

0.04

0.07

n/a

0.23

0.01

0.34

n/a

0.15

0.51

0.17

0.01 n/a

n/a

0.16

n/a

0.99

0.68

0.51

0.64

0.98

0.65

n/a

0.32

n/a

0.22

0.64

n/a

0.64

0.55

n/a

0.93

n/a

0.59

n/a

0.77

n/a

0.43

0.99

n/a

0.52

n/a

0.9

n/a

0.99

n/a

0.99

0.05

0.99

n/a

0.26

n/a

0.01

0.39

n/a

0.37

n/a

0.78

n/a

0.04

n/a

Learning

navigate

the web

(RL)

n/a

0.26

n/a

DOM-Q-Net

(RL) Workflow

guided

exploration

(Augmented)

n/a

0.54

n/a

n/a n/a

n/a

0.16

n/a

0.84

0.94

0.64

0.99

n/a

0.32

n/a

0.99

0.64

n/a

0.98

n/a

0.93

n/a

0.76

n/a

0.93

n/a

0.99

0.96

n/a

0.9

n/a

0.99

n/a

0.99

n/a

0.01

0.42

n/a

0.47

n/a

0.98

n/a

0.04

n/a

Learning

navigate

the web

(Augmented)

n/a

0.26

n/a

Aggregated

SOTA

(SL & RL) Aggregated

SOTA

(Augmented)

0.8

n/a

0.26

0.98

0.68

0.51

0.64

0.65

0.16

0.13

0.07

0.27

0.64

n/a

0.59

0.3

0.31

0.18

n/a

0.01

0.41

0.92

0.66

n/a

0.77

n/a

0.54

0.52

0.31

0.2

0.13

0.9

n/a

0.78

0.99

0.05

0.16

0.11

0.38

0.96

0.28

0.04

0.07

0.01

0.37

n/a

0.78

0.15

0.51

0.17

0.01 0.8

n/a

0.26

0.98

0.84

0.94

0.64

0.99

0.16

0.13

0.07

0.99

0.64

n/a

0.76

0.3

0.31

0.18

n/a

0.01

0.41

0.92

0.66

n/a

0.93

n/a

0.99

0.9

0.31

0.2

0.13

0.9

n/a

0.78

0.16

0.11

0.38

0.96

0.28

0.04

0.07

0.42

0.01

0.47

n/a

0.98

0.15

0.51

0.17

0.01Table 8: Resource requirements and running time of LLMs.

Model Name

PaLM

LaMDA

Model Size

62B

TPU version

TPU v4

TPU v2

Batch size

128

Input sequence length

1920

512

Examples per sec (training)

9.313

64.4

163.8

363.1

Examples per sec (inference)

30.51

184.3

734.5

1416