Summary of MotionLM Multi-Agent Motion Forecasting as Language Modeling

Summary MotionLM Multi-Agent Motion Forecasting as Language Modeling arxiv.org

9,515 words - PDF document - View PDF document

One Line

MotionLM is an advanced model that combines trajectory generation and interaction modeling to achieve top-notch performance in multi-agent motion forecasting for autonomous vehicles.

Slides

Slide Presentation (11 slides)

Copy slides outline Copy embed code Download as Word

MotionLM: Advancing Multi-Agent Motion Forecasting in Autonomous Vehicles

Source: arxiv.org - PDF - 9,515 words - view

Introduction

• MotionLM is a model for multi-agent motion forecasting in autonomous vehicles.

• The model represents continuous trajectories as sequences of discrete motion tokens.

• MotionLM treats motion prediction as a language modeling task.

• The model achieves state-of-the-art performance on the Waymo Open Motion Dataset, ranking first in the interactive challenge.

Capturing Interaction Dependencies

• MotionLM combines trajectory generation and interaction modeling in a single decoding process.

• The model captures interaction dependencies within trajectories.

• Temporal causality is enforced to preserve the order of token sampling.

• Conditional rollouts allow for causal interventions and improved predictions.

Model Architecture

• MotionLM consists of an encoder that processes scene elements.

• The encoder encodes information from various modalities, such as roadgraph elements and traffic light states.

• The model also includes a trajectory decoder that generates sequences of motion tokens.

• Motion tokens are sampled from a learned distribution and combined with attention mechanisms to produce joint trajectories.

Performance Evaluation

• MotionLM achieves competitive performance on marginal motion prediction.

• The model achieves state-of-the-art results on interactive motion prediction.

• Performance metrics include miss rate, mAP, soft mAP, minADE, minFDE, and prediction overlap.

• Results are presented for different interactive attention frequencies and numbers of rollouts per replica.

Key Features of MotionLM

• The model leverages multi-agent rollouts and discrete motion tokens.

• MotionLM captures the joint distribution over multimodal futures.

• Ensemble techniques are used to account for epistemic uncertainty and improve predictions.

Model Optimization and Implementation

• The model is trained using teacher forcing.

• Various hyperparameters, such as learning rate, number of layers, hidden size, and attention heads, are utilized.

• Inference is performed using nucleus sampling with a top-p parameter of 0.95.

• The model is optimized for efficient implementation and scalability.

Visualizations and Examples

• Supplementary visualizations showcase the model's predictions in various scenarios.

• Examples demonstrate the differences between marginal vs. joint predictions and marginal vs. conditional predictions.

• Temporally causal vs. acausal conditioning is also illustrated.

Conclusion and Future Work

• MotionLM achieves state-of-the-art performance in interactive motion forecasting.

• The model generates accurate and diverse predictions by leveraging multi-agent rollouts and discrete motion tokens.

• Future work includes integrating the trained model into model-based planning frameworks and exploring distillation strategies from large autoregressive teachers.

Advancing Multi-Agent Motion Forecasting with MotionLM

• MotionLM revolutionizes multi-agent motion forecasting in autonomous vehicles.

• The model combines trajectory generation and interaction modeling to capture interaction dependencies within trajectories.

• MotionLM achieves state-of-the-art performance and provides accurate and diverse predictions.

Note: Visuals such as graphs, images, and charts can be included in relevant slides to enhance understanding and engagement.

Key Points

MotionLM is a model for multi-agent motion forecasting in autonomous vehicles.
The model represents continuous trajectories as sequences of discrete motion tokens and treats motion prediction as a language modeling task.
MotionLM achieves state-of-the-art performance on the Waymo Open Motion Dataset, ranking first in the interactive challenge.
The model combines trajectory generation and interaction modeling in a single decoding process to capture interaction dependencies within trajectories.
MotionLM utilizes temporal causality, conditional rollouts, and ensemble techniques to improve predictions.
The model architecture consists of an encoder that processes scene elements and a trajectory decoder that generates sequences of motion tokens.
MotionLM demonstrates competitive performance on marginal motion prediction and achieves state-of-the-art results on interactive motion prediction.
The model leverages multi-agent rollouts and discrete motion tokens to capture the joint distribution over multimodal futures.

Summaries

21 word summary

MotionLM is a model for multi-agent motion forecasting in autonomous vehicles, achieving state-of-the-art performance by combining trajectory generation and interaction modeling.

60 word summary

MotionLM is a model for multi-agent motion forecasting in autonomous vehicles. It achieves state-of-the-art performance on the Waymo Open Motion Dataset, combining trajectory generation and interaction modeling. The model's performance is evaluated on the Waymo Open Motion Dataset interactive validation set, with optimization and implementation details provided. MotionLM surpasses previous methods in capturing multimodal future behavior and predicting agent interactions.

160 word summary

MotionLM is a model for multi-agent motion forecasting in autonomous vehicles that represents continuous trajectories as sequences of discrete motion tokens. It achieves state-of-the-art performance on the Waymo Open Motion Dataset, ranking first in the interactive challenge. The model combines trajectory generation and interaction modeling in a single decoding process, maximizing the log probability of token sequences among interacting agents. MotionLM achieves competitive performance on marginal motion prediction and state-of-the-art results on interactive motion prediction. The model's interactive attention frequency and the number of rollouts generated have a significant impact on performance. The paper presents optimization and implementation details, including the use of teacher forcing and various hyperparameters. The model's performance is evaluated on the Waymo Open Motion Dataset interactive validation set, with visualizations provided in the supplementary material. In conclusion, MotionLM provides an effective approach to multi-agent motion forecasting by treating it as a language modeling task, surpassing previous methods in capturing multimodal future behavior and predicting agent interactions.

396 word summary

MotionLM is a model for multi-agent motion forecasting in autonomous vehicles that treats motion prediction as a language modeling task. It represents continuous trajectories as sequences of discrete motion tokens and does not require anchors or explicit latent variable optimization. The model produces joint distributions over interactive agent futures in a single decoding process and enables temporally causal conditional rollouts. MotionLM achieves state-of-the-art performance on the Waymo Open Motion Dataset, ranking first in the interactive challenge.

To capture the likely maneuvers and responses of road agents, MotionLM represents continuous trajectories as discrete motion tokens and leverages sequence models to forecast their behavior. The model combines trajectory generation and interaction modeling in a single decoding process, maximizing the log probability of token sequences among interacting agents.

MotionLM is evaluated on the Waymo Open Motion Dataset, ranking second in soft mAP in the marginal prediction challenge and achieving a substantially improved miss rate compared to previous works. In the interactive prediction challenge, the model outperforms previous models, achieving a 6% relative improvement in mAP and a 3% relative improvement in miss rate.

The model architecture consists of an encoder that processes scene elements and a trajectory decoder that generates sequences of motion tokens. The scene encoder encodes information from various modalities, and the trajectory decoder generates discrete motion tokens for multiple agents in a temporally causal manner. MotionLM enforces temporal causality by preserving the order of token sampling.

MotionLM achieves competitive performance on marginal motion prediction and state-of-the-art results on interactive motion prediction. The model's interactive attention frequency and the number of rollouts generated have a significant impact on performance.

The paper introduces MotionLM as a method for interactive motion forecasting that captures the joint distribution over multimodal futures. The model establishes new state-of-the-art performance on the Waymo Open Motion Dataset interactive prediction challenge. The optimization details and implementation details of the model are presented, including the use of teacher forcing and various hyperparameters.

The evaluation of the model is done using various metrics, and the performance is evaluated on the Waymo Open Motion Dataset interactive validation set. Visualizations are provided in the supplementary material, showcasing the model's predictions in various scenarios.

In conclusion, MotionLM provides an effective and flexible approach to multi-agent motion forecasting by treating it as a language modeling task. The model surpasses previous methods in capturing multimodal future behavior and accurately predicting agent interactions.

647 word summary

MotionLM is a model for multi-agent motion forecasting in autonomous vehicles. It represents continuous trajectories as sequences of discrete motion tokens and treats motion prediction as a language modeling task. The model has several advantages: it does not require anchors or explicit latent variable optimization, it produces joint distributions over interactive agent futures in a single decoding process, and it enables temporally causal conditional rollouts. MotionLM achieves state-of-the-art performance on the Waymo Open Motion Dataset, ranking first in the interactive challenge.

To address limitations in existing joint prediction approaches, MotionLM combines trajectory generation and interaction modeling in a single decoding process. The model is trained to maximize the log probability of token sequences among interacting agents. At inference time, joint trajectories are produced step-by-step, with interacting agents sampling tokens simultaneously. The model can be applied to various behavior prediction tasks, including marginal, joint, and conditional predictions.

The performance of MotionLM is evaluated on the Waymo Open Motion Dataset. In the marginal prediction challenge, the model ranks second in soft mAP and achieves a substantially improved miss rate compared to previous works. In the interactive prediction challenge, MotionLM outperforms previous models, achieving a 6% relative improvement in mAP and a 3% relative improvement in miss rate.

The model architecture consists of an encoder that processes scene elements and a trajectory decoder that generates sequences of motion tokens. The scene encoder encodes information from various modalities, such as roadgraph elements and traffic light states. The trajectory decoder generates discrete motion tokens for multiple agents in a temporally causal manner. The tokens are sampled from a learned distribution and combined with attention mechanisms to produce joint trajectories.

MotionLM enforces temporal causality by preserving the order of token sampling. This allows for conditional rollouts that resemble causal interventions. The model also utilizes rollout aggregation to identify underlying modes of the joint future distribution and estimate their probabilities. Ensembling is used to account for epistemic uncertainty and improve predictions.

In experiments, MotionLM achieves competitive performance on marginal motion prediction and state-of-the-art results on interactive motion prediction. The model's interactive attention frequency and the number of rollouts generated have a significant impact on performance, with higher frequencies and more rollouts leading to better predictions.

Overall, MotionLM provides an effective and flexible approach to multi-agent motion forecasting by treating it as a language modeling task. The model surpasses previous methods in capturing multimodal future behavior and accurately predicting agent interactions.

The paper introduces a method for interactive motion forecasting called MotionLM, which leverages multi-agent rollouts over discrete motion tokens to capture the joint distribution over multimodal futures. The model establishes new state-of-the-art performance on the WOMD interactive prediction challenge.

The paper presents the optimization details and implementation details of the model. It uses teacher forcing to train the model and utilizes various hyperparameters such as learning rate, number of layers, hidden size, and number of attention heads. Inference is performed using nucleus sampling with a top-p parameter of 0.95.

The evaluation of the model is done using various metrics such as miss rate, mAP, soft mAP, minADE, minFDE, and prediction overlap. The performance of the model is evaluated on the WOMD interactive validation set, and results are presented for different interactive attention frequencies and numbers of rollouts per replica. The model achieves good performance across all metrics.

Visualizations are provided in the supplementary material, showcasing the model's predictions in various scenarios. Examples are shown for marginal vs. joint predictions, marginal vs. conditional predictions, and temporally causal vs. acausal conditioning.

In conclusion, the paper presents MotionLM as a method for interactive motion forecasting that achieves state-of-the

853 word summary

Modern sequence models often use next-token prediction objectives without domain-specific assumptions. MotionLM leverages this approach by representing continuous trajectories as discrete motion tokens, similar to language model vocabularies. The goal is to capture the likely maneuvers and responses of road agents by using sequence models to forecast their behavior. Marginal predictions, which treat agent behavior as independent and ignore interaction dependencies, are insufficient for planning systems. Existing joint prediction approaches either separate marginal trajectory generation from interaction scoring or do not model temporal dependencies within trajectories.

To address these limitations, MotionLM combines trajectory generation and interaction modeling in a single decoding process. The model is trained to maximize the log probability of token sequences among interacting agents. At inference time, joint trajectories are produced step-by-step, with interacting agents sampling tokens simultaneously. The model can be applied to various behavior prediction tasks, including marginal, joint, and conditional predictions.

The model architecture consists of an encoder that processes scene elements and a trajectory decoder that generates sequences of motion tokens. The scene encoder encodes information from various modalities, such as roadgraph elements and traffic light states. The trajectory decoder generates discrete motion tokens for multiple agents in a temporally causal manner. The tokens are sampled from a learned distribution and combined with attention mechanisms to produce joint trajectories.

The model uses a vocabulary of motion tokens and discretized delta action space to represent the possible actions of each agent. The sequence length for 8-second futures is 16 motion tokens per agent. The model utilizes a scene encoder for scene encoding and a trajectory decoder for autoregressively decoding motion token sequences.

The paper also discusses ablations, scaling analysis, and latency analysis of the model. Ablations are performed to analyze the performance of the model with different interactive attention frequencies and numbers of rollouts. Scaling analysis is done to evaluate the performance of the model with different parameter counts. Latency analysis is performed to measure the inference latency of the model for different numbers of rollouts.

In conclusion, the paper presents MotionLM as a method for interactive motion forecasting that achieves state-of-the-art performance on the WOMD interactive prediction challenge. The model leverages multi-agent rollouts and discrete motion tokens to capture the joint distribution over multimodal futures. The evaluation of the model demonstrates its effectiveness in generating accurate and diverse predictions. Future work includes leveraging the trained model in model-based planning frameworks and exploring distillation strategies from large autoregressive teachers.

Raw indexed text (63,189 chars / 9,515 words / 1,438 lines)

MotionLM: Multi-Agent Motion Forecasting as Language Modeling

Brian Cera Dian Chen * Mason Ng Aurick Zhou Nigamaa Nayakanti

Khaled S. Refaat Rami Al-Rfou Benjamin Sapp

Waymo

Vocabulary

Abstract

Reliable forecasting of the future behavior of road agents

is a critical component to safe planning in autonomous ve-

hicles. Here, we represent continuous trajectories as se-

quences of discrete motion tokens and cast multi-agent mo-

tion prediction as a language modeling task over this do-

main. Our model, MotionLM, provides several advantages:

First, it does not require anchors or explicit latent variable

optimization to learn multimodal distributions. Instead, we

leverage a single standard language modeling objective,

maximizing the average log probability over sequence to-

kens. Second, our approach bypasses post-hoc interaction

heuristics where individual agent trajectory generation is

conducted prior to interactive scoring. Instead, MotionLM

produces joint distributions over interactive agent futures in

a single autoregressive decoding process. In addition, the

model’s sequential factorization enables temporally causal

conditional rollouts. The proposed approach establishes

new state-of-the-art performance for multi-agent motion

prediction on the Waymo Open Motion Dataset, ranking 1 st

on the interactive challenge leaderboard.

1. Introduction

Modern sequence models often employ a next-token pre-

diction objective that incorporates minimal domain-specific

assumptions. For example, autoregressive language mod-

els [3, 10] are pre-trained to maximize the probability of

the next observed subword conditioned on the previous text;

there is no predefined notion of parsing or syntax built in.

This approach has found success in continuous domains as

well, such as audio [2] and image generation [49]. Leverag-

ing the flexibility of arbitrary categorical distributions, the

above works represent continuous data with a set of discrete

* Work

done during an internship at Waymo.

Contact: {aseff, bensapp}@waymo.com

…

Ari Seff

...

Motion token sequence:

t=1

t=2

t=T

Figure 1. Our model autoregressively generates sequences of dis-

crete motion tokens for a set of agents to produce consistent inter-

active trajectory forecasts.

tokens, reminiscent of language model vocabularies.

In driving scenarios, road users may be likened to par-

ticipants in a constant dialogue, continuously exchanging a

dynamic series of actions and reactions mirroring the fluid-

ity of communication. Navigating this rich web of interac-

tions requires the ability to anticipate the likely maneuvers

and responses of the involved actors. Just as today’s lan-

guage models can capture sophisticated distributions over

conversations, can we leverage similar sequence models to

forecast the behavior of road agents?

A common simplification to modeling the full future

world state has been to decompose the joint distribution of

agent behavior into independent per-agent marginal distri-

butions. Although there has been much progress on this

task [8, 47, 12, 25, 31, 5, 6, 21], marginal predictions are

insufficient as inputs to a planning system; they do not rep-

resent the future dependencies between the actions of differ-

ent agents, leading to inconsistent scene-level forecasting.

Of the existing joint prediction approaches, some applya separation between marginal trajectory generation and in-

teractive scoring [40, 42, 29]. For example, Luo et al. [29]

initially produce a small set of marginal trajectories for each

agent independently, before assigning a learned potential to

each inter-agent trajectory pair through a belief propaga-

tion algorithm. Sun et al. [42] use a manual heuristic to

tag agents as either influencers or reactors, and then pairs

marginal and conditional predictions to form joint predic-

tions.

We also note that because these approaches do not ex-

plicitly model temporal dependencies within trajectories,

their conditional forecasts may be more susceptible to spu-

rious correlations, leading to less realistic reaction predic-

tions. For example, these models can capture the cor-

relation between a lead agent decelerating and a trail-

ing agent decelerating, but may fail to infer which one is

likely causing the other to slow down. In contrast, previ-

ous joint models employing an autoregressive factorization,

e.g., [36, 43, 39], do respect future temporal dependencies.

These models have generally relied on explicit latent vari-

ables for diversity, optimized via either an evidence lower

bound or normalizing flow.

In this work, we combine trajectory generation and in-

teraction modeling in a single, temporally causal, decod-

ing process over discrete motion tokens (Fig. 1), leverag-

ing a simple training objective inspired by autoregressive

language models. Our model, MotionLM, is trained to

directly maximize the log probability of these token se-

quences among interacting agents. At inference time, joint

trajectories are produced step-by-step, where interacting

agents sample tokens simultaneously, attend to one another,

and repeat. In contrast to previous approaches which man-

ually enforce trajectory multimodality during training, our

model is entirely latent variable and anchor-free, with mul-

timodality emerging solely as a characteristic of sampling.

MotionLM may be applied to several downstream behavior

prediction tasks, including marginal, joint, and conditional

predictions.

This work makes the following contributions:

1. We cast multi-agent motion forecasting as a language

modeling task, introducing a temporally causal de-

coder over discrete motion tokens trained with a causal

language modeling loss.

2. We pair sampling from our model with a simple roll-

out aggregation scheme that facilitates weighted mode

identification for joint trajectories, establishing new

state-of-the-art performance on the Waymo Open Mo-

tion Dataset interaction prediction challenge (6% im-

provement in the ranking joint mAP metric).

3. We perform extensive ablations of our approach as

well as analysis of its temporally causal conditional

predictions, which are largely unsupported by current

joint forecasting models.

2. Related work

Marginal trajectory prediction. Behavior predictors are

often evaluated on their predictions for individual agents,

e.g., in recent motion forecasting benchmarks [14, 9, 4,

51, 37]. Previous methods process the rasterized scene

with CNNs [8, 5, 12, 17]; the more recent works repre-

sent scenes with points and polygraphs and process them

with GNNs [6, 25, 47, 22] or transformers [31, 40, 20]. To

handle the multimodality of future trajectories, some mod-

els manually enforce diversity via predefined anchors [8, 5]

or intention points [40, 52, 28]. Other works learn diverse

modes with latent variable modeling, e.g., [24].

While these works produce multimodal future trajecto-

ries of individual agents, they only capture the marginal dis-

tributions of the possible agent futures and do not model the

interactions among agents.

Interactive trajectory prediction. Interactive behavior

predictors model the joint distribution of agents’ futures.

This task has been far less studied than marginal motion

prediction. For example, the Waymo Open Motion Dataset

(WOMD) [14] challenge leaderboard currently has 71 pub-

lished entries for marginal prediction compared to only 14

for interaction prediction.

Ngiam et al. [32] models the distribution of future trajec-

tories with a transformer-based mixture model outputting

joint modes. To avoid the exponential blow-up from a full

joint model, Luo et al. [29] models pairwise joint distri-

butions. Tolstaya et al. [44], Song et al. [41], Sun et al.

[42] consider conditional predictions by exposing the future

trajectory of one agent when predicting for another agent.

Shi et al. [40] derives joint probabilities by simply mul-

tiplying marginal trajectory probabilities, essentially treat-

ing agents as independent, which may limit accuracy. Cui

et al. [11], Casas et al. [7], Girgis et al. [15] reduce the full-

fledged joint distribution using global latent variables. Un-

like our autoregressive factorization, the above models typi-

cally follow “one-shot” (parallel across time) factorizations

and do not explicitly model temporally causal interactions.

Autoregressive trajectory prediction. Autoregressive

behavior predictors generate trajectories at intervals to pro-

duce scene-consistent multi-agent trajectories. Rhinehart

et al. [36], Tang and Salakhutdinov [43], Amirloo et al.

[1], Salzmann et al. [39], Yuan et al. [50] predict multi-

agent future trajectories using latent variable models. Lu

et al. [27] explores autoregressively outputting keyframes

via mixtures of Gaussians prior to filling in the remaining

states. In [18], an adversarial objective is combined withScene

embeddings

Scene

Encoder

Decoded motion tokens

Ensemble & Rollout aggregation

…

[R,N,ᐧ,H]

Sample

…

Sample

…

Sample

NMS

Projection

Autoregressive Transformer

Decoder

Self Attention

… …

Agent 1 Agent 2

Multimodal

Scene Inputs

Embed

Cross Attention

A 1

K-means

Embed

…

[K,N,T,2]

[R,N,T,2]

A T-1

A T

Start

token

Time

t=0

t=1

t=T-1

t=T

Figure 2. MotionLM architecture. We first encode heterogeneous scene features relative to each modeled agent (left) as scene embeddings

of shape R, N, ·, H. Here, R refers to the number of rollouts, N refers to the number of (jointly modeled) agents, and H is the dimen-

sionality of each embedding. We repeat the embeddings R times in the batch dimension for parallel sampling during inference. Next, a

trajectory decoder autoregressively rolls out T discrete motion tokens for multiple agents in a temporally causal manner (center). Finally,

representative modes of the rollouts may be recovered via a simple aggregation utilizing k-means clustering initialized with non-maximum

suppression (right).

parallel beam search to learn multi-agent rollouts. Unlike

most autoregressive trajectory predictors, our method does

not rely on latent variables or beam search and generates

multimodal joint trajectories by directly sampling from a

learned distribution of discrete motion token sequences.

Discrete sequence modeling in continuous domains.

When generating sequences in continuous domains, one ef-

fective approach is to discretize the output space and predict

categorical distributions at each step.

For example, in image generation, van den Oord et al.

[45] sequentially predict the uniformly discretized pixel val-

ues for each channel and found this to perform better than

outputting continuous values directly. Multiple works on

generating images from text such as [35] and [49] use a

two-stage process with a learned tokenizer to map images

to discrete tokens and an autoregressive model to predict

the discrete tokens given the text prompt. For audio gen-

eration, WaveNet [46] applies a µ-law transformation be-

fore discretizing. Borsos et al. [2] learn a hierarchical to-

kenizer/detokenizer, with the main transformer sequence

model operating on the intermediate discrete tokens. When

generating polygonal meshes, Nash et al. [30] uniformly

quantize the coordinates of each vertex. In MotionLM, we

employ a simple uniform quantization of axis-aligned deltas

between consecutive waypoints of agent trajectories.

3. MotionLM

We aim to model a distribution over multi-agent inter-

actions in a general manner that can be applied to distinct

downstream tasks, including marginal, joint, and condi-

tional forecasting. This requires an expressive generative

framework capable of capturing the substantial multimodal-

ity in driving scenarios. In addition, we take consideration

here to preserve temporal dependencies; i.e., inference in

our model follows a directed acyclic graph with the parents

of every node residing earlier in time and children residing

later (Section 3.3, Fig. 4). This enables conditional fore-

casts that more closely resemble causal interventions [34]

by eliminating certain spurious correlations that can other-

wise result from disobeying temporal causality 2 . We ob-

serve that joint models that do not preserve temporal de-

pendencies may have a limited ability to predict realistic

agent reactions – a key use in planning (Section 4.6). To this

end, we leverage an autoregressive factorization of our fu-

ture decoder, where agents’ motion tokens are conditionally

dependent on all previously sampled tokens and trajectories

are rolled out sequentially (Fig. 2).

Let S represent the input data for a given scenario. This

may include context such as roadgraph elements, traffic

light states, as well as features describing road agents (e.g.,

vehicles, cyclists, and pedestrians) and their recent histo-

ries, all provided at the current timestep t = 0. Our

task is to generate predictions for joint agent states Y t =

{y t 1 , y t 2 , ..., y t N } for N agents of interest at future timesteps

t = 1, ..., T . Rather than complete states, these future state

targets are typically two-dimensional waypoints (i.e., (x, y)

coordinates), with T waypoints forming the full ground

truth trajectory for an individual agent.

2 We

make no claims that our model is capable of directly modeling

causal relationships (due to the theoretical limits of purely observational

data and unobserved confounders). Here, we solely take care to avoid

breaking temporal causality.3.1. Joint probabilistic rollouts

In our modeling framework, we sample a predicted ac-

tion for each target agent at each future timestep. These

actions are formulated as discrete motion tokens from a fi-

nite vocabulary, as described later in Section 3.2.2. Let

a nt represent the target action (derived from the ground

truth waypoints) for the nth agent at time t, with A t =

{a 1 t , a 2 t , ..., a N

t } representing the set of target actions for all

agents at time t.

Factorization. We factorize the distribution over joint fu-

ture action sequences as a product of conditionals:

p θ (A 1 , A 2 , ...A T | S) =

p θ (A t | A

(1)

p θ (a nt | A

(2)

n=1

Similar to [36, 43], Eq. (2) represents the fact that we treat

agent actions as conditionally independent at time t, given

the previous actions and scene context. This aligns empiri-

cally with real-world driving over short time intervals; e.g.,

non-impaired human drivers generally require at least 500

ms to release the accelerator in response to a vehicle braking

ahead ([13]). In our experiments, we find 2 Hz reactions to

be sufficient to surpass state-of-the-art joint prediction mod-

els.

We note that our model’s factorization is entirely latent

variable free; multimodal predictions stem purely from cat-

egorical token sampling at each rollout timestep.

Training objective. MotionLM is formulated as a gener-

ative model trained to match the joint distribution of ob-

served agent behavior. Specifically, we follow a maximum

likelihoood objective over multi-agent action sequences:

arg max

3.2. Model implementation

Our model consists of two main networks, an encoder

which processes initial scene elements followed by a tra-

jectory decoder which performs both cross-attention to the

scene encodings and self-attention along agent motion to-

kens, following a transformer architecture [48].

3.2.1

t=1

p θ (A t | A

Our model is subject to the same theoretical limitations

as general imitation learning frameworks (e.g., compound-

ing error [38] and self-delusions due to unobserved con-

founders [33]). However, we find that, in practice, these

do not prevent strong performance on forecasting tasks.

p θ (A t | A

(3)

t=1

Similar to the typical training setup of modern language

models, we utilize “teacher-forcing” where previous ground

truth (not predicted) tokens are provided at each timestep,

which tends to increase stability and avoids sampling dur-

ing training. We note that this applies to all target agents; in

training, each target agent is exposed to ground truth action

sequence prefixes for all target agents prior to the current

timestep. This naturally allows for temporal parallelization

when using modern attention-based architectures such as

transformers [48].

Scene encoder

The scene encoder (Fig. 2, left) is tasked with processing in-

formation from several input modalities, including the road-

graph, traffic light states, and history of surrounding agents’

trajectories. Here, we follow the design of the early fusion

network proposed by [31] as the scene encoding backbone

of our model. Early fusion is particularly chosen because of

its flexibility to process all modalities together with minimal

inductive bias.

The features above are extracted with respect to each

modeled agent’s frame of reference. Input tensors are then

fed to a stack of self-attention layers that exchange informa-

tion across all past timesteps and agents. In the first layer,

latent queries cross-attend to the original inputs in order to

reduce the set of vectors being processed to a manageable

number, similar to [23, 19]. For additional details, see [31].

3.2.2

Joint trajectory decoder

Our trajectory decoder (Fig. 2, center) is tasked with gener-

ating sequences of motion tokens for multiple agents.

Discrete motion tokens. We elect to transform trajecto-

ries comprised of continuous waypoints into sequences of

discrete tokens. This enables treating sampling purely as a

classification task at each timestep, implemented via a stan-

dard softmax layer. Discretizing continuous targets in this

manner has proven effective in other inherently continuous

domains, e.g., in audio generation [46] and mesh genera-

tion [30]. We suspect that discrete motion tokens also natu-

rally hide some precision from the model, possibly mitigat-

ing compounding error effects that could arise from imper-

fect continuous value prediction. Likewise, we did not find

it necessary to manually add any noise to the ground truth

teacher-forced trajectories (e.g., as is done in [26]).

Quantization. To extract target discrete tokens, we be-

gin by normalizing each agent’s ground truth trajectory

with respect to the position and heading of the agent atFigure 3. Displayed are the top two predicted joint rollout modes for three WOMD scenes. Color gradients indicate time progression from

t = 0s to t = 8s, with the greatest probability joint mode transitioning from green to blue and the secondary joint mode transitioning from

orange to purple. Three types of interactions are observed: an agent in the adjacent lane yields to the lane-changing agent according to the

timing of the lane change (left), a pedestrian walks behind the passing vehicle according to the progress of the vehicle (center), the turning

vehicle either yields to the crossing cyclist (most probable mode) or turns before the cyclist approaches (secondary mode) (right).

time t = 0 of the scenario. We then parameterize a uni-

formly quantized (∆x, ∆y) vocabulary according to a to-

tal number of per-coordinate bins as well as maximum and

minimum delta values. A continuous, single-coordinate

delta action can then be mapped to a corresponding index

∈ [0, num bins − 1], resulting in two indices for a com-

plete (∆x, ∆y) action per step. In order to extract actions

that accurately reconstruct an entire trajectory, we employ a

greedy search, sequentially selecting the quantized actions

that reconstruct the next waypoint coordinates with mini-

mum error.

We wrap the delta actions with a “Verlet” step where

a zero action indicates that the same delta index should

be used as the previous step (as [36] does for continuous

states). As agent velocities tend to change smoothly be-

tween consecutive timesteps, this helps reduce the total vo-

cabulary size, simplifying the dynamics of training. Finally,

to maintain only T sequential predictions, we collapse the

per-coordinate actions to a single integer indexing into their

Cartesian product. In practice, for the models presented

here, we use 13 tokens per coordinate with 13 2 = 169 total

discrete tokens available in the vocabulary (see Appendix A

for further details).

We compute a learned value embedding and two learned

positional embeddings (representing the timestep and agent

identity) for each discrete motion token, which are com-

bined via an element-wise sum prior to being input to the

transformer decoder.

Flattened agent-time self-attention. We elect to include

a single self-attention mechanism in the decoder that oper-

ates along flattened sequences of all modeled agents’ mo-

tion tokens over time. So, given a target sequence of length

T for each of N agents, we perform self-attention over N T

elements. While this does mean that these self-attended

sequences grow linearly in the number of jointly modeled

agents, we note that the absolute sequence length here is

still quite small (length 32 for the WOMD interactive split

– 8 sec. prediction at 2 Hz for 2 agents). Separate passes

of factorized agent and time attention are also possible [32],

but we use a single pass here for simplicity.

Ego agent reference frames. To facilitate cross-attention

to the agent-centric feature encodings (Section 3.2.1), we

represent the flattened token sequence once for each mod-

eled agent. Each modeled agent is treated as the “ego” agent

once, and cross-attention is performed on that agent’s scene

features. Collapsing the ego agents into the batch dimension

allows parallelization during training and inference.

3.3. Enforcing temporal causality

Our autoregressive factorization naturally respects tem-

poral dependencies during joint rollouts; motion token sam-

pling for any particular agent is affected only by past to-

kens (from any agent) and unaffected by future ones. When

training, we require a mask to ensure that the self-attention

operation only updates representations at each step accord-

ingly. As shown in Fig. 8 (appendix), this attention mask

exhibits a blocked, staircase pattern, exposing all agents to

each other’s histories only up to the preceding step.

Temporally causal conditioning. As described earlier, a

particular benefit of this factorization is the ability to query

for temporally causal conditional rollouts (Fig. 4). In this

setting, we fix a query agent to take some sequence of ac-

tions and only roll out the other agents.x=1

x=2

…

x=T

x=1

x=2

…

x=T

y=1

y=2

…

y=T

(a) Causal Bayesian Network for joint rollouts.

x=1 x=2 … x=T

y=1 y=2 … y=T

y=1

y=2

…

y=T

(b) Post-Intervention Causal Bayesian Network.

Conditioned Agent Causal conditioning

Sampling Agent Acausal conditioning

Figure 4. A Causal Bayesian network representation for joint rollouts (left), post-intervention Causal Bayesian network (center), and

acausal conditioning (right). Solid lines indicate temporally causal dependencies while dashed lines indicate acausal information flow.

Models without temporal dependency constraints will support acausal conditioning but not temporally causal conditioning, which can be

problematic when attempting to predict agent reactions.

We may view this as an approximation of computing

causal interventions [34] in the absence of confounders; in-

terventions cannot be learned purely through observational

data in general (due to the possible presence of unobserved

confounders), but our model’s factorization at least elim-

inates certain spurious correlations arising from breaking

temporal causality.

In Fig. 4 (a), we show an example of a Causal Bayesian

network governing joint rollouts. Applying an intervention

to nodes x = 1, ...T , by deleting their incoming edges,

results in a post-intervention Bayesian network depicted

in Fig. 4 (b), which obeys temporal causality. On the other

hand, acausal conditioning (Fig. 4 (c)) results in non-causal

information flow, where node x = i affects our belief about

node y = j for i ≥ j.

3.4. Rollout aggregation

Joint motion forecasting benchmark tasks like

WOMD [14] require a compact representation of the

joint future distribution in the form of a small number of

joint “modes”. Each mode is assigned a probability and

might correspond to a specific homotopic outcome (e.g.,

pass/yield) or more subtle differences in speed/geometry.

Here, we aggregate rollouts to achieve two primary goals:

1) uncover the underlying modes of the distribution and

2) estimate the probability of each mode. Specifically, we

follow the non-maximum suppression (NMS) aggregation

scheme described in [47], but extend it to the joint setting

by ensuring that all agent predictions reside within a

given distance threshold to the corresponding cluster. In

addition, we leverage model ensembling to account for

epistemic uncertainty and further improve the quality of the

predictions, combining rollouts from independently trained

replicas prior to the aggregation step.

4. Experiments

We evaluate MotionLM on marginal and joint motion

forecasting benchmarks, examine its conditional predic-

tions and conduct ablations of our modeling choices.

4.1. Datasets

Waymo Open Motion Dataset (WOMD). WOMD [14]

is a collection of 103k 20-second scenarios collected from

real-world driving in urban and suburban environments.

Segments are divided into 1.1M examples consisting of 9-

second windows of interest, where the first second serves

as input context and the remaining 8-seconds are the pre-

diction target. Map features such as traffic signal states and

lane features are provided along with agent states such as

position, velocity, acceleration, and bounding boxes.

Marginal and interactive prediction challenges. For the

marginal prediction challenge, six trajectories must be out-

put by the model for each target agent, along with likeli-

hoods of each mode. For the interactive challenge, two in-

teracting agents are labeled in each test example. In this

case, the model must output six weighted joint trajectories.

4.2. Metrics

The primary evaluation metrics for the marginal and in-

teractive prediction challenges are soft mAP and mAP, re-

spectively, with miss rate as the secondary metric. Distance

metrics minADE and minFDE provide additional signal on

prediction quality. For the interactive prediction challenge,

these metrics refer to scene-level joint calculations. We also

use a custom prediction overlap metric (similar to [29]) to

assess scene-level consistency for joint models. See Ap-

pendix C for details on these metrics.

4.3. Model configuration

We experiment with up to 8 model replicas and 512 roll-

outs per replica, assessing performance at various configu-

rations. For complete action space and model hyperparam-

eter details, see Appendices A and B.Model

HDGT [20]

MPA [22]

MTR [40]

Wayformer factorized [31]

Wayformer multi-axis [31]

MTR-A [40]

MotionLM (Ours)

minADE (↓) minFDE (↓) Miss Rate (↓) Soft mAP (↑)

0.7676

0.5913

0.6050

0.5447

0.5454

0.5640

0.5509 1.1077

1.2507

1.2207

1.1255

1.1280

1.1344

1.1199 0.1325

0.1603

0.1351

0.1229

0.1228

0.1160

0.1058 0.3709

0.3930

0.4216

0.4260

0.4335

0.4594

0.4507

Table 1. Marginal prediction performance on WOMD test set. We display metrics averaged over time steps (3, 5, and 8 seconds) and agent

types (vehicles, pedestrians, and cyclists). Greyed columns indicate the official ranking metrics for the marginal prediction challenge.

Model

SceneTransformer (J) [32]

M2I [42]

DenseTNT [28]

MTR [40]

JFP [29]

MotionLM (Ours)

minADE (↓) minFDE (↓) Miss Rate (↓) mAP (↑)

0.9774

1.3506

1.1417

0.9181

0.8817

0.8911 2.1892

2.8325

2.4904

2.0633

1.9905

2.0067 0.4942

0.5538

0.5350

0.4411

0.4233

0.4115 0.1192

0.1239

0.1647

0.2037

0.2050

0.2178

Table 2. Joint prediction performance on WOMD interactive test set. We display scene-level joint metrics averaged over time steps (3, 5,

and 8 seconds) and agent types (vehicles, pedestrians, and cyclists). Greyed columns indicate the official ranking metrics for the challenge.

4.4. Quantitative results Model

Marginal motion prediction. As shown in Table 1, our

model is competitive with the state-of-the-art on WOMD

marginal motion prediction (independent per agent). For

the main ranking metric of soft mAP, our model ranks sec-

ond, less than 2% behind the score achieved by MTRA [40].

In addition, our model attains a substantially improved miss

rate over all prior works, with a relative 9% reduction com-

pared to the previous state-of-the-art. The autoregressive

rollouts are able to adequately capture the diversity of mul-

timodal future behavior without reliance on trajectory an-

chors [8] or static intention points [40]. Test LSTM Baseline [14]

Scene Transformer [32]

JFP [29]

MotionLM (joint)

Val MotionLM (marginal)

MotionLM (joint)

Interactive motion prediction. Our model achieves

state-of-the-art results for the interactive prediction chal-

lenge on WOMD, attaining a 6% relative improvement in

mAP and 3% relative improvement in miss rate (the two of-

ficial ranking metrics) over the previous top scoring entry,

JFP [29] (see Table 2). In contrast to JFP, our approach does

not score pairs of previously constructed marginal trajecto-

ries. but generates joint rollouts directly. Fig. 3 displays

example interactions predicted by our model.

Table 3 displays prediction overlap rates for various

models on the WOMD interactive test and validation sets

(see metric details in Appendix C.2). We obtain test set

predictions from the authors of [14, 32, 29]. MotionLM

obtains the lowest prediction overlap rate, an indication of

scene-consistent predictions. In addition, on the valida-

tion set we evaluate two versions of our model: marginal

and joint. The marginal version does not perform attention

Prediction Overlap (↓)

0.07462

0.04336

0.02671

0.02607

0.0404

0.0292

Table 3. Prediction overlap rates. Displayed is the custom pre-

diction overlap metric for various model configurations on the

WOMD interactive test and validation sets.

across the modeled agents during both training and infer-

ence rollouts, while the joint version performs 2 Hz inter-

active attention. We see that the marginal version obtains a

relative 38% higher overlap rate than the joint version. The

interactive attention in the joint model allows the agents to

more appropriately react to one another.

4.5. Ablation studies

Interactive attention frequency. To assess the impor-

tance of inter-agent reactivity during the joint rollouts, we

vary the frequency of the interactive attention operation

while keeping other architecture details constant. For our

leaderboard results, we utilize the greatest frequency stud-

ied here, 2 Hz. At the low end of the spectrum, 0.125 Hz

corresponds to the agents only observing each other’s initial

states, and then proceeding with the entire 8-second rollout

without communicating again (i.e., marginal rollouts).

Performance metrics generally improve as agents areNo interactive attention

2 Hz

16 interaction iterations

2.060

0.910

2.050

0.905

2.040

0.900 minADE 2.030

0.895 minFDE 2.020

0.125 Hz

2.010

0.890 2.000

0.885 1.990

0.125 0.25

0.5

Interactive attention frequency (Hz)

1.980

0.540

0.200

0.520

mAP

0.180

0.480

Miss rate

0.170

permitted to interact more frequently (Fig. 6 top, Table 5

in appendix). Greater interactive attention frequencies not

only lead to more accurate joint predictions, but also re-

duce implausible overlaps (i.e., collisions) between differ-

ent agents’ predictions. Fig. 5 displays examples where the

marginal predictions lead to implausible overlap between

agents while the joint predictions lead to appropriately sep-

arated trajectories. See supplementary for animated visual-

izations.

Number of rollouts. Our rollout aggregation requires that

we generate a sufficient number of samples from the model

in order to faithfully represent the multimodal future dis-

tribution. For this ablation, we vary the number of rollouts

generated, but always cluster down to k = 6 modes for eval-

uation. In general, we see performance metrics improve as

additional rollouts are utilized (Fig. 6, bottom and Table 6

in appendix). For our final leaderboard results, we use 512

rollouts per replica, although 32 rollouts is sufficient to sur-

pass the previous top entry on joint mAP.

0.150

0.460

0.440

0.160

Figure 5. Visualization of the top joint rollout mode at the two ex-

tremes of the interactive attention frequencies studied here. With

no interactive attention (left), the two modeled agents only attend

to each other once at the beginning of the 8-second rollout and

never again, in contrast to 16 total times for 2 Hz attention (right).

The independent rollouts resulting from zero interactive attention

can result in scene-inconsistent overlap; e.g., a turning vehicle fails

to accommodate a crossing pedestrian (top left) or yield appropri-

ately to a crossing vehicle (bottom left).

0.500

0.190

0.420

1 2 4

Number of rollouts per replica

Figure 6. Joint prediction performance across varying interac-

tive attention frequencies (top) and numbers of rollouts per replica

(bottom) on the WOMD interactive validation set. Vertical axes

display joint (scene-level) metrics for an 8-replica ensemble. See

Tables 5 and 6 in the appendix for full parameter ranges and met-

rics.

4.6. Conditional rollouts

As described in Section 3.3, our model naturally sup-

ports “temporally causal” conditioning, similarly to previ-

ous autoregressive efforts such as [36, 43]. In this setting,

we fix one query agent to follow a specified trajectory and

stochastically roll out the target agent. However, we can

also modify the model to leak the query agent’s full trajec-

tory, acausally exposing its future to the target agent during

conditioning. This resembles the approach to conditional

prediction in, e.g., [44], where this acausal conditional dis-

tribution is modeled directly, or [29], where this distribution

is accessed via inference in an energy-based model.

Here, we assess predictions from our model across

three settings: marginal, temporally causal conditional, and

acausal conditional (Fig. 4). Quantitatively, we observe that

both types of conditioning lead to more accurate predictions

for the target agent (Table 4, Fig. 7). Additionally, we see

that acausal conditioning leads to greater improvement than

temporally causal conditioning relative to marginal predic-

tions across all metrics, e.g., 8.2% increase in soft mAP for

acausal vs. 3.7% increase for temporally causal.Prediction setting

Marginal

Temporally causal conditional

Acausal conditional

minADE (↓) minFDE (↓) Miss Rate (↓) Soft mAP (↑)

0.6069

0.5997

0.5899 1.2236

1.2034

1.1804 0.1406

0.1377

0.1338 0.3951

0.4096

0.4274

Table 4. Conditional prediction performance. Displayed are marginal (single-agent) metrics across three prediction settings for our model

on the WOMD interactive validation set: marginal, temporally causal conditional, and acausal conditional.

Intuitively, the greater improvement for acausal condi-

tioning makes sense as it exposes more information to the

model. However, the better quantitative scores are largely

due to predictions that would be deemed nonsensical if in-

terpreted as predicted reactions to the query agent.

This can be illustrated in examples where one agent is

following another, where typically the lead agent’s behavior

is causing the trailing agent’s reaction, and not vice versa,

but this directionality would not be captured with acausal

conditoning. This temporally causal modeling is especially

important when utilizing the conditional predictions to eval-

uate safety for an autonomous vehicle’s proposed plans. In

a scenario where an autonomous vehicle (AV) is stopped be-

hind another agent, planning to move forward into the other

agent’s current position could be viewed as a safe maneu-

ver with acausal conditioning, as the other agent also mov-

ing forward is correlated with (but not caused by) the AV

proceeding. However, it is typically the lead agent moving

forward that causes the trailing AV to proceed, and the AV

moving forward on its own would simply rear-end the lead

agent.

In the supplementary, we compare examples of predic-

tions in various scenarios for the causal and acausal condi-

tioning schemes. Models that ignore temporal dependen-

cies during conditioning (e.g., [44, 29]) may succumb to

the same incorrect reasoning that the acausal version of our

model does.

5. Conclusion and future work

In this work, we introduced a method for interactive mo-

tion forecasting leveraging multi-agent rollouts over dis-

crete motion tokens, capturing the joint distribution over

multimodal futures. The proposed model establishes new

state-of-the-art performance on the WOMD interactive pre-

diction challenge.

Avenues for future work include leveraging the trained

model in model-based planning frameworks, allowing a

search tree to be formed over the multi-agent action rollouts,

or learning amortized value functions from large datasets

of scene rollouts. In addition, we plan to explore distilla-

tion strategies from large autoregressive teachers, enabling

faster student models to be deployed in latency-critical set-

tings.

Marginal

Conditional

Figure 7. Visualization of the most likely predicted future for the

pedestrian in the marginal setting (left) and temporally causal con-

ditional setting (right). When considering the pedestrian indepen-

dently, the model assigns greatest probability to a trajectory which

crosses the road. When conditioned on the the vehicle’s ground

truth turn (magenta), the pedestrian is instead predicted to yield.

Acknowledgements. We would like to thank David

Weiss, Paul Covington, Ashish Venugopal, Piotr Fidkowski,

and Minfa Wang for discussions on autoregressive behavior

predictors; Cole Gulino and Brandyn White for advising on

interactive modeling; Cheol Park, Wenjie Luo, and Scott Et-

tinger for assistance with evaluation; Drago Anguelov, Kyr-

iacos Shiarlis, and anonymous reviewers for helpful feed-

back on the manuscript.References

[1] Elmira Amirloo, Amir Rasouli, Peter Lakner, Mohsen Ro-

hani, and Jun Luo. LatentFormer: Multi-agent transformer-

based interaction modeling and trajectory prediction, 2022.

[2] Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene

Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek,

Olivier Teboul, David Grangier, Marco Tagliasacchi, and

Neil Zeghidour. AudioLM: a language modeling approach

to audio generation, 2023.

[3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub-

biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan-

tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan-

guage models are few-shot learners. Advances in neural in-

formation processing systems, 33:1877–1901, 2020.

[4] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora,

Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi-

ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi-

modal dataset for autonomous driving. In CVPR, 2020.

[12] Henggang Cui, Vladan Radosavljevic, Fang-Chieh Chou,

Tsung-Han Lin, Thi Nguyen, Tzu-Kuo Huang, Jeff Schnei-

der, and Nemanja Djuric. Multimodal trajectory predictions

for autonomous driving using deep convolutional networks.

In 2019 International Conference on Robotics and Automa-

tion (ICRA), pages 2090–2096. IEEE, 2019.

[13] Johan Engström, Shu-Yuan Liu, Azadeh Dinparastdjadid,

and Camelia Simoiu. Modeling road user response timing

in naturalistic settings: a surprise-based framework, 2022.

URL https://arxiv.org/abs/2208.08651.

[14] Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi

Liu, Hang Zhao, Sabeek Pradhan, Yuning Chai, Ben-

jamin Sapp, Charles R. Qi, Yin Zhou, Zoey Yang, Au-

relien Chouard, Pei Sun, Jiquan Ngiam, Vijay Vasude-

van, Alexander McCauley, Jonathon Shlens, and Dragomir

Anguelov. Large scale interactive motion forecasting for au-

tonomous driving : The waymo open motion dataset. CoRR,

abs/2104.10133, 2021.

[5] Sergio Casas, Wenjie Luo, and Raquel Urtasun. Intentnet:

Learning to predict intention from raw sensor data. In Con-

ference on Robot Learning, pages 947–956, 2018. [15] Roger Girgis, Florian Golemo, Felipe Codevilla, Martin

Weiss, Jim Aldon D’Souza, Samira Ebrahimi Kahou, Felix

Heide, and Christopher Pal. Latent variable sequential set

transformers for joint multi-agent motion prediction. In In-

ternational Conference on Learning Representations, 2021.

[6] Sergio Casas, Cole Gulino, Renjie Liao, and Raquel Urta-

sun. Spagnn: Spatially-aware graph neural networks for

relational behavior forecasting from sensor data. In 2020

IEEE International Conference on Robotics and Automation

(ICRA), pages 9491–9497. IEEE, 2020. [16] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin

Choi. The curious case of neural text degeneration. In Inter-

national Conference on Learning Representations, 2020.

[7] Sergio Casas, Cole Gulino, Simon Suo, Katie Luo, Renjie

Liao, and Raquel Urtasun. Implicit latent variable model for

scene-consistent motion forecasting. In European Confer-

ence on Computer Vision, pages 624–641. Springer, 2020.

[8] Yuning Chai, Benjamin Sapp, Mayank Bansal, and Dragomir

Anguelov. Multipath: Multiple probabilistic anchor trajec-

tory hypotheses for behavior prediction. In Conference on

Robot Learning, pages 86–99. PMLR, 2020.

[9] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jag-

jeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter

Carr, Simon Lucey, Deva Ramanan, et al. Argoverse: 3d

tracking and forecasting with rich maps. In Proceedings of

the IEEE/CVF Conference on Computer Vision and Pattern

Recognition, pages 8748–8757, 2019.

[10] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,

Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul

Barham, Hyung Won Chung, Charles Sutton, Sebastian

Gehrmann, Parker Schuh, et al. PaLM: Scaling language

modeling with pathways, 2022.

[11] Alexander Cui, Sergio Casas, Abbas Sadat, Renjie Liao,

and Raquel Urtasun. Lookout: Diverse multi-future predic-

tion and planning for self-driving. In Proceedings of the

IEEE/CVF International Conference on Computer Vision,

pages 16107–16116, 2021.

[17] Joey Hong, Benjamin Sapp, and James Philbin. Rules of the

road: Predicting driving behavior with a convolutional model

of semantic interactions. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recognition,

pages 8454–8462, 2019.

[18] Maximilian Igl, Daewoo Kim, Alex Kuefler, Paul Mougin,

Punit Shah, Kyriacos Shiarlis, Dragomir Anguelov, Mark

Palatucci, Brandyn White, and Shimon Whiteson. Sym-

phony: Learning realistic and diverse agents for autonomous

driving simulation. In ICRA, 2022.

[19] Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zis-

serman, Oriol Vinyals, and João Carreira. Perceiver: General

perception with iterative attention. In ICML, 2021.

[20] Xiaosong Jia, Penghao Wu, Li Chen, Hongyang Li, Yu Liu,

and Junchi Yan. Hdgt: Heterogeneous driving graph trans-

former for multi-agent trajectory prediction via scene encod-

ing. CoRL, 2022.

[21] Siddhesh Khandelwal, William Qi, Jagjeet Singh, Andrew

Hartnett, and Deva Ramanan. What-if motion prediction

for autonomous driving. arXiv preprint arXiv:2008.10587,

2020.

[22] Stepan Konev. Mpa: Multipath++ based architecture for mo-

tion prediction, 2022. URL https://arxiv.org/abs/

2206.10041.[23] Juho Lee, Yoonho Lee, Jungtaek Kim, Adam R. Kosiorek,

Seungjin Choi, and Yee Whye Teh. Set transformer: A

framework for attention-based permutation-invariant neural

networks. In ICML, 2019.

[24] Namhoon Lee, Wongun Choi, Paul Vernaza, Christopher B

Choy, Philip HS Torr, and Manmohan Chandraker. Desire:

Distant future prediction in dynamic scenes with interacting

agents. In Proceedings of the IEEE conference on computer

vision and pattern recognition, pages 336–345, 2017.

[25] Ming Liang, Bin Yang, Rui Hu, Yun Chen, Renjie Liao, Song

Feng, and Raquel Urtasun. Learning lane graph representa-

tions for motion forecasting. In European Conference on

Computer Vision, pages 541–556. Springer, 2020.

[26] Jerry Liu, Wenyuan Zeng, Raquel Urtasun, and Ersin Yumer.

Deep structured reactive planning.

In ArXiv, volume

abs/2101.06832, 2021.

[27] Qiujing Lu, Weiqiao Han, Jeffrey Ling, Minfa Wang,

Haoyu Chen, Balakrishnan Varadarajan, and Paul Coving-

ton. Kemp: Keyframe-based hierarchical end-to-end deep

model for long-term trajectory prediction. In 2022 Inter-

national Conference on Robotics and Automation (ICRA),

2022.

[28] Ruikang Luo, Yaofeng Song, Han Zhao, Yicheng Zhang,

Yi Zhang, Nanbin Zhao, Liping Huang, and Rong Su. Dense-

tnt: efficient vehicle type classification neural network using

satellite imagery. ICCV, 2021.

[34] Judea Pearl. Causality. Cambridge University Press, 2 edi-

tion, 2009.

[35] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray,

Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.

Zero-shot text-to-image generation. In International Confer-

ence on Machine Learning, pages 8821–8831. PMLR, 2021.

[36] Nicholas Rhinehart, Rowan McAllister, Kris Kitani, and

Sergey Levine. PRECOG: Prediction conditioned on goals

in visual multi-agent settings. In ICCV, 2019.

[37] A Robicquet, A Sadeghian, A Alahi, and S Savarese.

Learning social etiquette: Human trajectory prediction in

crowded scenes. In European Conference on Computer Vi-

sion (ECCV), volume 2, 2020.

[38] Stephane Ross, Geoffrey Gordon, and Drew Bagnell. A re-

duction of imitation learning and structured prediction to no-

regret online learning. In AISTATS, 2011.

[39] Tim Salzmann, Boris Ivanovic, Punarjay Chakravarty, and

Marco Pavone. Trajectron++: Multi-agent generative tra-

jectory forecasting with heterogeneous data for control. In

ECCV, 2020.

[40] Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele.

Motion transformer with global intention localization and lo-

cal movement refinement. Advances in Neural Information

Processing Systems, 2022.

[29] Wenjie Luo, Cheol Park, Andre Cornman, Benjamin Sapp,

and Dragomir Anguelov. JFP: Structured multi-agent inter-

active trajectories forecasting for autonomous driving. In 6th

Annual Conference on Robot Learning, 2022. [41] Haoran Song, Wenchao Ding, Yuxuan Chen, Shaojie Shen,

Michael Yu Wang, and Qifeng Chen. PiP: Planning-

informed trajectory prediction for autonomous driving. In

ECCV, 2020.

[30] Charlie Nash, Yaroslav Ganin, S. M. Ali Eslami, and Pe-

ter W. Battaglia. PolyGen: An autoregressive generative

model of 3D meshes. International Conference on Machine

Learning, 2020. [42] Qiao Sun, Xin Huang, Junru Gu, Brian Williams, and Hang

Zhao. M2I: From factored marginal trajectory prediction

to interactive prediction. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recognition

(CVPR), 2022.

[31] Nigamaa Nayakanti, Rami Al-Rfou, Aurick Zhou, Kratarth

Goel, Khaled S. Refaat, and Benjamin Sapp. Wayformer:

Motion forecasting via simple & efficient attention networks.

ArXiv, abs/2207.05844, 2022.

[32] Jiquan Ngiam, Vijay Vasudevan, Benjamin Caine, Zheng-

dong Zhang, Hao-Tien Lewis Chiang, Jeffrey Ling, Re-

becca Roelofs, Alex Bewley, Chenxi Liu, Ashish Venu-

gopal, David J Weiss, Ben Sapp, Zhifeng Chen, and Jonathon

Shlens. Scene transformer: A unified architecture for pre-

dicting future trajectories of multiple agents. In International

Conference on Learning Representations, 2022.

[33] Pedro A. Ortega, Markus Kunesch, Grégoire Delétang, Tim

Genewein, Jordi Grau-Moya, Joel Veness, Jonas Buchli,

Jonas Degrave, Bilal Piot, Julien Pérolat, Tom Everitt,

Corentin Tallec, Emilio Parisotto, Tom Erez, Yutian Chen,

Scott E. Reed, Marcus Hutter, Nando de Freitas, and Shane

Legg. Shaking the foundations: delusions in sequence mod-

els for interaction and control. ArXiv, abs/2110.10819, 2021.

[43] Yichuan Charlie Tang and Ruslan Salakhutdinov. Multiple

futures prediction. In Advances in neural information pro-

cessing systems, 2019.

[44] Ekaterina Tolstaya, Reza Mahjourian, Carlton Downey,

Balakrishnan Vadarajan, Benjamin Sapp, and Dragomir

Anguelov. Identifying driver interactions via conditional be-

havior prediction. In 2021 IEEE International Conference

on Robotics and Automation (ICRA), 2021.

[45] Aäron van den Oord, Nal Kalchbrenner, and Koray

Kavukcuoglu. Pixel recurrent neural networks. In Interna-

tional conference on machine learning, pages 1747–1756.

PMLR, 2016.

[46] Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen

Simonyan, Oriol Vinyals, Alexander Graves, Nal Kalchbren-

ner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A

generative model for raw audio. In ArXiv, 2016.[47] Balakrishnan Varadarajan, Ahmed Hefny, Avikalp Sri-

vastava, Khaled S. Refaat, Nigamaa Nayakanti, Andre

Cornman, Kan Chen, Bertrand Douillard, Chi-Pang Lam,

Dragomir Anguelov, and Benjamin Sapp. Multipath++: Ef-

ficient information fusion and trajectory aggregation for be-

havior prediction. In ICRA, 2022.

[48] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-

reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia

Polosukhin. Attention is all you need. In Advances in Neural

Information Processing Systems, 2017.

[49] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gun-

jan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yin-

fei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han,

Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and

Yonghui Wu. Scaling autoregressive models for content-rich

text-to-image generation. Transactions on Machine Learn-

ing Research, 2022.

[50] Ye Yuan, Xinshuo Weng, Yanglan Ou, and Kris M Kitani.

Agentformer: Agent-aware transformers for socio-temporal

multi-agent forecasting. In Proceedings of the IEEE/CVF

International Conference on Computer Vision, pages 9813–

9823, 2021.

[51] Wei Zhan, Liting Sun, Di Wang, Haojie Shi, Aubrey

Clausse, Maximilian Naumann, Julius Kummerle, Hendrik

Konigshof, Christoph Stiller, Arnaud de La Fortelle, et al.

Interaction dataset: An international, adversarial and coop-

erative motion dataset in interactive driving scenarios with

semantic maps. arXiv preprint arXiv:1910.03088, 2019.

[52] Hang Zhao, Jiyang Gao, Tian Lan, Chen Sun, Ben Sapp,

Balakrishnan Varadarajan, Yue Shen, Yi Shen, Yuning Chai,

Cordelia Schmid, et al. Tnt: Target-driven trajectory pre-

diction. In Conference on Robot Learning, pages 895–904.

PMLR, 2021.N agents

A. Motion token vocabulary

Delta action space. The models presented in this paper

use the following parameters for the discretized delta action

space:

• Step frequency: 2 Hz

• Delta interval (per step): [-18.0 m, 18.0 m]

• Number of bins: 128

At 2 Hz prediction, a maximum delta magnitude of 18 m

covers axis-aligned speeds up to 36 m/s (∼80 mph), > 99%

of the WOMD dataset.

Verlet-wrapped action space. Once the above delta ac-

tion space has the Verlet wrapper applied, we only require

13 bins for each coordinate. This results in a total of

13 2 = 169 total discrete motion tokens that the model can

select from the Cartesian product comprising the final vo-

cabulary.

Sequence lengths. For 8-second futures, the model out-

puts 16 motion tokens for each agent (note that WOMD

evaluates predictions at 2 Hz). For the two-agent interac-

tive split, our flattened agent-time token sequences (Sec-

tion 3.2.2) have length 2 × 16 = 32.

Figure 8. Masked causal attention between two agents dur-

ing training. We flatten the agent and time axes, leading to an

N T × N T attention mask. The agents may attend to each other’s

previous motion tokens (solid squares) but no future tokens (empty

squares).

• Feed-forward network intermediate size: 1024

• Number of attention heads: 4

• Activation: ReLU

B.3. Optimization

We train our model to maximize the likelihood of the

ground truth motion token sequences via teacher forcing.

We use the following training hyperparameters:

• Number of training steps: 600000

B. Implementation details • Batch size: 256

B.1. Scene encoder • Learning rate schedule: Linear decay

We follow the design of the early fusion network pro-

posed by [31] as the scene encoding backbone of our model.

The following hyperparameters are used: • Initial learning rate: 0.0006

• Final learning rate: 0.0

• Number of layers: 4 • Optimizer: AdamW

• Hidden size: 256 • Weight decay: 0.6

• Feed-forward network intermediate size: 1024 B.4. Inference

• Number of attention heads: 4 We found nucleus sampling [16], commonly used with

language models, to be helpful for improving sample qual-

ity while maintaining diversity. Here we set the top-p pa-

rameter to 0.95.

• Number of latent queries: 92

• Activation: ReLU

B.2. Trajectory decoder C. Metrics descriptions

To autoregressively decode motion token sequences, we

utilize a causal transformer decoder that takes in the motion

tokens as queries, and the scene encodings as context. We

use the following model hyperparameters: C.1. WOMD metrics

• Number of layers: 4

• Hidden size: 256

All metrics for the two WOMD [14] benchmarks are

evaluated at three time steps (3, 5, and 8 seconds) and are

averaged over all object types to obtain the final value. For

joint metrics, a scene is attributed to an object class (vehicle,

pedestrian, or cyclist) according to the least common type

of agent that is present in that interaction, with cyclist beingthe rarest object class and vehicles being the most common.

Up to 6 trajectories are produced by the models for each

target agent in each scene, which are then used for metric

evaluation. the bounding boxes of two predicted agents collide at any

timestep in a scene, that counts as an overlap/collision for

that scene. The final prediction overlap rate is calculated as

the sum of per-scene overlaps, averaged across the dataset.

mAP & Soft mAP mAP measures precision of predic-

tion likelihoods and is calculated by first bucketing ground

truth futures of objects into eight discrete classes of intent:

straight, straight-left, straight-right, left, right, left u-turn,

right u-turn, and stationary.

For marginal predictions, a prediction trajectory is con-

sidered a “miss” if it exceeds a lateral or longitudinal error

threshold at a specified timestep T . Similarly for joint pre-

dictions, a prediction is considered a “miss” if none of the

k joint predictions contains trajectories for all predicted ob-

jects within a given lateral and longitudinal error threshold,

with respect to the ground truth trajectories for each agent.

Trajectory predictions classified as a miss are labeled as a

false positive. In the event of multiple predictions satisfy-

ing the miss criteria, consistent with object detection mAP

metrics, only one true positive is allowed for each scene, as-

signed to the highest confidence prediction. All other pre-

dictions for the object are assigned a false positive.

To compute the mAP metric, bucket entries are sorted

and a P/R curve is computed for each bucket, averaging

precision values over various likelihood thresholds for all

intent buckets results in the final mAP value. Soft mAP

differs only in the fact that additional matching predictions

(other than the most likely match) are ignored instead of be-

ing assigned a false positive, and so are not penalized in the

metric computation. D. Additional evaluation

Miss rate Using the same definition of a “miss” described

above for either marginal or joint predictions, miss rate is a

measure of what fraction of scenarios fail to generate any

predictions within the lateral and longitudinal error thresh-

olds, relative to the ground truth future.

minADE & minFDE minADE measures the Euclidean

distance error averaged over all timesteps for the closest

prediction, relative to ground truth. In contrast, minFDE

considers only the distance error at the final timestep. For

joint predictions, minADE and minFDE are calculated as

the average value over both agents.

C.2. Prediction overlap

As described in [29], the WOMD [14] overlap met-

ric only considers overlap between predictions and ground

truth. Here we use a prediction overlap metric to assess

scene-level consistency for joint models. Our implementa-

tion is similar to [29], except we follow the convention of

the WOMD challenge of only requiring models to gener-

ate (x, y) waypoints; headings are inferred as in [14]. If

Ablations. Tables 5 and 6 display joint prediction perfor-

mance across varying interactive attention frequencies and

numbers of rollouts, respectively. In addition to the ensem-

bled model performance, single replica performance is eval-

uated. Standard deviations are computed for each metric

over 8 independently trained replicas.

Scaling analysis. Table 7 displays the performance of

different model sizes on the WOMD interactive split, all

trained with the same optimization hyperparameters. We

vary the number of layers, hidden size, and number of atten-

tion heads in the encoder and decoder proportionally. Due

to external constraints, in this study we only train a single

replica for each parameter count. We observe that a model

with 27M parameters overfits while 300K underfits. Both

the 1M and 9M models perform decently. In this paper, our

main results use 9M-parameter replicas.

Latency analysis. Table 8 provides inference latency on

the latest generation of GPUs across different numbers of

rollouts. These were measured for a single-replica joint

model rolling out two agents.

E. Visualizations

In the supplementary zip file, we have included GIF an-

imations of the model’s greatest-probability predictions in

various scenes. Each example below displays the associ-

ated scene ID, which is also contained in the corresponding

GIF filename. We describe the examples here.

E.1. Marginal vs. Joint

• Scene ID: 286a65c777726df3

Marginal: The turning vehicle and crossing cyclist

collide.

Joint: The vehicle yields to the cyclist before turning.

• Scene ID: 440bbf422d08f4c0

Marginal: The turning vehicle collides with the cross-

ing vehicle in the middle of the intersection.

Joint: The turning vehicle yields and collision is

avoided.

• Scene ID: 38899bce1e306fb1

Marginal: The lane-changing vehicle gets rear-ended

by the vehicle in the adjacent lane.

Joint: The adjacent vehicle slows down to allow the

lane-changing vehicle to complete the maneuver.Ensemble

Single Replica

Freq. (Hz) minADE (↓) minFDE (↓) MR (↓) mAP (↑) minADE (↓) minFDE (↓) MR (↓) mAP (↑)

0.125

0.25

0.5

2 0.9120

0.9083

0.8931

0.8842

0.8831 2.0634

2.0466

2.0073

1.9898

1.9825 0.4222

0.4241

0.4173

0.4117

0.4092 0.2007

0.1983

0.2077

0.2040

0.2150 1.0681 (0.011)

1.0630 (0.009)

1.0512 (0.009)

1.0419 (0.014)

1.0345 (0.012) 2.4783 (0.025)

2.4510 (0.025)

2.4263 (0.022)

2.4062 (0.032)

2.3886 (0.031) 0.5112 (0.007)

0.5094 (0.006)

0.5039 (0.006)

0.5005 (0.008)

0.4943 (0.006) 0.1558 (0.007)

0.1551 (0.006)

0.1588 (0.004)

0.1639 (0.005)

0.1687 (0.004)

Table 5. Joint prediction performance across varying interactive attention frequencies on the WOMD interactive validation set. Displayed

are scene-level joint evaluation metrics. For the single replica metrics, we include the standard deviation (across 8 replicas) in parentheses.

Ensemble

Single Replica

# Rollouts minADE (↓) minFDE (↓) MR (↓) mAP (↑) minADE (↓) minFDE (↓) MR (↓) mAP (↑)

128

256

512 1.0534

0.9952

0.9449

0.9158

0.9010

0.8940

0.8881

0.8851

0.8856

0.8831 2.3526

2.2172

2.1100

2.0495

2.0163

2.0041

1.9888

1.9893

1.9825 0.5370

0.4921

0.4561

0.4339

0.4196

0.4141

0.4095

0.4103

0.4078

0.4092 0.1524

0.1721

0.1869

0.1934

0.2024

0.2065

0.2051

0.2074

0.2137

0.2150 1.9827 (0.018)

1.6142 (0.011)

1.3655 (0.012)

1.2039 (0.013)

1.1254 (0.012)

1.0837 (0.013)

1.0585 (0.012)

1.0456 (0.012)

1.0385 (0.012)

1.0345 (0.012) 4.7958 (0.054)

3.8479 (0.032)

3.2060 (0.035)

2.7848 (0.035)

2.5893 (0.031)

2.4945 (0.035)

2.4411 (0.033)

2.4131 (0.033)

2.3984 (0.031)

2.3886 (0.031) 0.8182 (0.003)

0.7410 (0.003)

0.6671 (0.003)

0.5994 (0.004)

0.5555 (0.005)

0.5272 (0.005)

0.5114 (0.005)

0.5020 (0.006)

0.4972 (0.007)

0.4943 (0.006) 0.0578 (0.004)

0.0827 (0.004)

0.1083 (0.003)

0.1324 (0.003)

0.1457 (0.003)

0.1538 (0.004)

0.1585 (0.004)

0.1625 (0.004)

0.1663 (0.005)

0.1687 (0.004)

Table 6. Joint prediction performance across varying numbers of rollouts per replica on the WOMD interactive validation set. Displayed

are scene-level joint evaluation metrics. For the single replica metrics, we include the standard deviation (across 8 replicas) in parentheses.

Parameter count Miss Rate (↓) mAP (↑)

300K

27M 0.6047

0.5037

0.4972

0.6072 0.1054

0.1713

0.1663

0.1376

Table 7. Joint prediction performance across varying model sizes

on the WOMD interactive validation set. Displayed are scene-

level joint mAP and miss rate for 256 rollouts for a single model

replica (except for 9M which displays the mean performance of 8

replicas).

• Scene ID: 2ea76e74b5025ec7

Marginal: The cyclist crosses in front of the vehicle

leading to a collision.

Joint: The cyclist waits for the vehicle to proceed be-

fore turning.

• Scene ID: 55b5fe989aa4644b

Marginal: The cyclist lane changes in front of the ad-

jacent vehicle, leading to collision.

Joint: The cyclist remains in their lane for the duration

of the scene, avoiding collision.

Number of rollouts Latency (ms)

128

256 19.9 (0.19)

27.5 (0.25)

43.8 (0.26)

75.8 (0.23)

137.7 (0.19)

Table 8. Inference latency on current generation of GPUs for dif-

ferent numbers of rollouts of the joint model. We display the mean

and standard deviation (in parentheses) of the latency measure-

ments for each setting.

E.2. Marginal vs. Conditional

“Conditional” here refers to temporally causal condition-

ing as described in the main text.

• Scene ID: 5ebba77f351358e2

Marginal: The pedestrian crosses the street as a vehi-

cle is turning, leading to a collision.

Conditional: When conditioning on the vehicle’s

turning trajectory as a query, the pedestrian is instead

predicted to remain stationary.

• Scene ID: d557eee96705c822Marginal: The modeled vehicle collides with the lead

vehicle.

Conditional: When conditioning on the lead vehicle’s

query trajectory, which remains stationary for a bit,

the modeled vehicle instead comes to a an appropriate

stop.

• Scene ID: 9410e72c551f0aec

Marginal: The modeled vehicle takes the turn slowly,

unaware of the last turning vehicle’s progress.

Conditional: When conditioning on the query vehi-

cle’s turn progress, the modeled agent likewise makes

more progress.

• Scene ID: c204982298bda1a1

Marginal: The modeled vehicle proceeds slowly, un-

aware of the merging vehicle’s progress.

Conditional: When conditioning on the query vehi-

cle’s merge progress, the modeled agent accelerates

behind.

E.3. Temporally Causal vs. Acausal Conditioning

• Scene ID: 4f39d4eb35a4c07c

Joint prediction: The two modeled vehicles maintain

speed for the duration of the scene.

Conditioning on trailing agent:

- Temporally causal: The lead vehicle is indifferent

to the query trailing vehicle decelerating to a stop, pro-

ceeding along at a constant speed.

- Acausal: The lead vehicle is “influenced” by the

query vehicle decelerating. It likewise comes to a stop.

Intuitively, this is an incorrect direction of influence

that the acausal model has learned.

Conditioning on lead agent:

- Temporally causal: When conditioning on the query

lead vehicle decelerating to a stop, the modeled trail-

ing vehicle is likewise predicted to stop.

-Acausal: In this case, the acausal conditional pre-

diction is similar to the temporally causal conditional.

The trailing vehicle is predicted to stop behind the

query lead vehicle.