Summary of Language Models Represent Space and Time

Summary Language Models Represent Space and Time arxiv.org

7,058 words - PDF document - View PDF document

One Line

Large language models learn linear representations of space and time, with improved performance as the models increase in size.

Slides

Slide Presentation (8 slides)

Copy slides outline Copy embed code Download as Word

Language Models Represent Space and Time

Source: arxiv.org - PDF - 7,058 words - view

Introduction

• Large language models (LLMs) have the ability to learn structured knowledge about fundamental dimensions such as space and time.

Linear Representations of Space and Time

• LLMs learn linear representations of space and time across multiple scales.

• These representations are robust and unified across different entity types.

• Evidence of "space neurons" and "time neurons" that encode spatial and temporal coordinates.

Probing Experiments

• Probing experiments reveal that spatial and temporal representations are built throughout the early layers of LLMs before plateauing.

• Larger LLMs consistently outperform smaller ones in predicting real-world location or time.

• Nonlinear probes do not perform better, indicating linear representations.

Generalization and Robustness

• Generalization performance is better than random, even when specific blocks of data are held out.

• Probes correctly generalize relative positions but not absolute positions.

• Probes with fewer parameters support the idea that LLMs explicitly represent space and time.

Individual Neurons

• Individual neurons within LLMs are highly sensitive to the true location of entities.

• These neurons serve as predictive feature probes, providing evidence of LLMs utilizing spatial and temporal features.

Implications and Future Work

• LLMs learn more than superficial statistics, contributing to the understanding of how they model the world.

• Further investigation is needed to understand the structure and use of spatial and temporal representations in LLMs.

• Sparse autoencoders may be explored to extract representations in the model's coordinate system.

Understanding Language Models' Representations of Space and Time

• LLMs have the ability to learn linear representations of space and time across multiple scales.

• Probing experiments show that these representations are robust and unified, with larger LLMs outperforming smaller ones.

• Individual neurons within LLMs are highly sensitive to the true location of entities, supporting the use of spatial and temporal features in LLMs.

Key Points

Large language models (LLMs) have the ability to learn structured knowledge about fundamental dimensions such as space and time.
LLMs learn linear representations of space and time across multiple scales, which are robust and unified across different entity types.
Probing experiments show that spatial and temporal representations are built throughout the early layers of LLMs before plateauing.
Larger LLMs consistently outperform smaller ones in predicting real-world location or time.
Individual neurons within LLMs are highly sensitive to the true location of entities, providing evidence that LLMs make use of spatial and temporal features.

Summaries

17 word summary

Large language models (LLMs) learn linear representations of space and time, with better performance in larger models.

72 word summary

Researchers found that large language models (LLMs) acquire knowledge about space and time, learning linear representations across multiple scales. Spatial and temporal representations are built in early layers before plateauing. Larger models outperform smaller ones, with robust and unified representations. Generalization performance was better than random, although not in absolute position. Individual neurons within LLMs are highly sensitive to the true location of entities, suggesting further investigation into spatial and temporal representations.

147 word summary

Researchers conducted a study on large language models (LLMs) to determine if they acquire knowledge about space and time. The study found evidence that LLMs learn linear representations of space and time across multiple scales. Probing experiments revealed that spatial and temporal representations are built throughout the early layers of the models before plateauing. Larger models outperformed smaller ones, and the representations were linear and robust to changes in prompting. The representations were also unified across different types of entities. The researchers conducted robustness checks and found that generalization performance was better than random, although not in absolute position. Individual neurons within the LLMs were highly sensitive to the true location of entities, supporting the idea that the models make use of spatial and temporal features. The study contributes to understanding how LLMs model the world and suggests further investigation into spatial and temporal representations in LLMs.

369 word summary

Researchers conducted a study to analyze the learned representations of spatial and temporal datasets in large language models (LLMs) to determine if LLMs acquire structured knowledge about fundamental dimensions such as space and time. They found evidence that LLMs learn linear representations of space and time across multiple scales and identified individual "space neurons" and "time neurons" that reliably encode spatial and temporal coordinates, providing further evidence that LLMs learn literal world models.

The researchers constructed six datasets containing names of places or events with corresponding space or time coordinates. Probing experiments revealed that models build spatial and temporal representations throughout the early layers before plateauing at around the halfway point. Larger models consistently outperformed smaller ones, and the representations were found to be linear and robust to changes in prompting. The representations were also unified across different types of entities.

To verify their findings, the researchers conducted robustness checks and found that while generalization performance suffered when specific blocks of data were held out, it was still better than random. The probes correctly generalized by placing points in the correct relative position but not in their absolute position. The researchers also trained probes with fewer parameters, which supported the idea that LLMs explicitly represent space and time but require more parameters to convert from the model's coordinate system to literal spatial coordinates or timestamps.

Additionally, the researchers discovered individual neurons within the LLMs that were highly sensitive to the true location of entities in space or time. These neurons were themselves fairly predictive feature probes, providing strong evidence that the models have learned and make use of spatial and temporal features.

The study contributes to the understanding of how LLMs model the world and supports the view that they learn more than superficial statistics. The researchers suggest future work to further investigate the structure and use of spatial and temporal representations in LLMs, as well as explore the potential for sparse autoencoders to extract representations in the model's coordinate system. They also highlight the need for methods to identify when a model recognizes a particular entity beyond specific prompts and recommend studying the training process to understand how spatial and temporal features are learned, recalled, and used internally.

508 word summary

Large language models (LLMs) have sparked debate over their capabilities and whether they merely learn superficial statistics or a coherent model of the data generating process. In a study, researchers analyzed the learned representations of spatial and temporal datasets in the Llama-2 family of models to determine if LLMs acquire structured knowledge about fundamental dimensions such as space and time. The researchers found evidence that LLMs learn linear representations of space and time across multiple scales. These representations are robust to prompting variations and unified across different entity types. They also identified individual "space neurons" and "time neurons" that reliably encode spatial and temporal coordinates, providing further evidence that LLMs learn literal world models.

The researchers constructed six datasets containing the names of places or events with corresponding space or time coordinates. These datasets spanned multiple spatiotemporal scales, including locations within the whole world, the United States, and New York City, as well as the death year of historical figures, the release date of art and entertainment, and the publication date of news headlines. They used linear regression probes on the internal activations of the names of these places and events at each layer of the Llama-2 models to predict their real-world location or time.

The probing experiments revealed that models build spatial and temporal representations throughout the early layers before plateauing at around the halfway point. Larger models consistently outperformed smaller ones. The representations were found to be linear, as nonlinear probes did not perform better. The representations were also fairly robust to changes in prompting and were unified across different types of entities.

To verify their findings, the researchers conducted robustness checks. They found that while generalization performance suffered when specific blocks of data were held out, it was still better than random. The probes correctly generalized by placing points in the correct relative position but not in their absolute position. The researchers also trained probes with fewer parameters by projecting the activation datasets onto their k largest principal components. The performance of these probes supported the idea that LLMs explicitly represent space and time but require more parameters to convert from the model's coordinate system to literal spatial coordinates or timestamps.

Raw indexed text (45,669 chars / 7,058 words / 703 lines)

Preprint

L ANGUAGE M ODELS R EPRESENT S PACE AND T IME

Wes Gurnee & Max Tegmark

Massachusetts Institute of Technology

{wesg, tegmark}@mit.edu

A BSTRACT

The capabilities of large language models (LLMs) have sparked debate over

whether such systems just learn an enormous collection of superficial statistics

or a coherent model of the data generating process—a world model. We find

evidence for the latter by analyzing the learned representations of three spatial

datasets (world, US, NYC places) and three temporal datasets (historical figures,

artworks, news headlines) in the Llama-2 family of models. We discover that

LLMs learn linear representations of space and time across multiple scales. These

representations are robust to prompting variations and unified across different en-

tity types (e.g. cities and landmarks). In addition, we identify individual “space

neurons” and “time neurons” that reliably encode spatial and temporal coordi-

nates. Our analysis demonstrates that modern LLMs acquire structured knowl-

edge about fundamental dimensions such as space and time, supporting the view

that they learn not merely superficial statistics, but literal world models.

I NTRODUCTION

Despite being trained to just predict the next token, modern large language models (LLMs) have

demonstrated an impressive set of capabilities (Bubeck et al., 2023; Wei et al., 2022), raising ques-

tions and concerns about what such models have actually learned. One hypothesis is that LLMs

learn a massive collection of correlations but lack any coherent model or “understanding” of the

underlying data generating process given text-only training (Bender & Koller, 2020; Bisk et al.,

2020). An alternative hypothesis is that LLMs, in the course of compressing the data, learn more

compact, coherent, and interpretable models of the generative process underlying the training data,

i.e., a world model. For instance, Li et al. (2022) have shown that transformers trained with next

token prediction to play the board game Othello learn explicit representations of the game state, with

Nanda et al. (2023) subsequently showing these representations are linear. Others have shown that

LLMs track boolean states of subjects within the context (Li et al., 2021) and have representations

that reflect perceptual and conceptual structure in spatial and color domains (Patel & Pavlick, 2021;

Abdou et al., 2021). Better understanding of if and how LLMs model the world is critical for rea-

soning about the robustness, fairness, and safety of current and future AI systems (Bender et al.,

2021; Weidinger et al., 2022; Bommasani et al., 2021; Hendrycks et al., 2023; Ngo et al., 2023).

In this work, we take the question of whether LLMs form world (and temporal) models as literally as

possible—we attempt to extract an actual map of the world! Specifically, we construct six datasets

containing the names of places or events with corresponding space or time coordinates that span

multiple spatiotemporal scales: locations within the whole world, the United States, and New York

City in addition to the death year of historical figures from the past 3000 years, the release date of

art and entertainment from 1950s onward, and the publication date of news headlines from 2010 to

2020. Using the Llama-2 family of models (Touvron et al., 2023), we train linear regression probes

(Alain & Bengio, 2016; Belinkov, 2022) on the internal activations of the names of these places and

events at each layer to predict their real-world location or time.

These probing experiments reveal evidence that models build spatial and temporal representations

throughout the early layers before plateauing at around the model halfway point with larger models

consistently outperforming smaller ones (§ 3.1). We then show these representations are (1) linear,

given that nonlinear probes do not perform better (§ 3.2), (2) fairly robust to changes in prompting

(§ 3.3), and (3) unified across different kinds of entities (e.g. cities and natural landmarks).

1Preprint

Figure 1: Spatial and temporal world models of Llama-2-70b. Each point corresponds to the layer

50 activations of the last token of a place (top) or event (bottom) projected on to a learned linear

probe direction. All points depicted are from the test set.

One possible explanation for our results is that the model only learns a mapping from place to a

country (for example) and it is the probe that actually learns the global geometry of how these

different groups are geospatially (or temporally) related. To study this, we conduct a series of ro-

bustness checks to understand how the probes generalize across different data distributions (§ 4.1)

and how probes trained on the PCA components perform (§ 4.2). Our findings suggest that the

probes memorize the absolute positioning of these groups, but that the model does have represen-

tations which reflect the relative positioning. In other words, the probe learns the mapping from

model coordinates to human interpretable coordinates. Finally, we use our probes to find individual

neurons which activate as a function of space or time, providing strong evidence that the model is

truly using these features (§ 5).

2.1

E MPIRICAL O VERVIEW

S PACE AND T IME R AW D ATASETS

To enable our investigation, we construct six datasets of names of entities (people, places, events,

etc.) with their respective location or occurrence in time, each at a different order of magnitude of

scale. For each dataset, we included multiple types of entities, e.g., both populated places like cities

and natural landmarks like lakes, to study how unified representations are across different object

types. Furthermore, we maintain or enrich relevant metadata to enable analyzing the data with more

detailed breakdowns, identify sources of train-test leakage, and support future work on factual recall

within LLMs. We also attempt to deduplicate and filter out obscure or otherwise noisy data.

Space We constructed three datasets of place names within the world, the United States, and New

York City. Our world dataset is built from raw data queried from DBpedia Lehmann et al. (2015).

In particular, we query for populated places, natural places, and structures (e.g. buildings or infras-

tructure). We then match these against Wikipedia articles, and filter out entities which do not have at

least 5,000 page views over a three year period. Our United States dataset is constructed from DB-

Pedia and a census data aggregator, and includes the names of cities, counties, zipcodes, colleges,

natural places, and structures where sparsely populated or viewed locations were similarly filtered

out. Finally, our New York City dataset is adapted from the NYC OpenData points of interest dataset

(NYC OpenData, 2023) containing locations such as schools, churches, transportation facilities, and

public housing within the city.

2Preprint

Time Our three temporal datasets consist of (1) the names and occupations of historical fig-

ures who died between 1000BC and 2000AD adapted from (Annamoradnejad & Annamoradnejad,

2022); (2) the titles and creators of songs, movies, and books from 1950 to 2020 constructed from

DBpedia with the Wikipedia page views filtering technique; and (3) New York Times news headlines

from 2010-2020 from news desks that write about current events, adapted from (Bandy, 2021).

Dataset

World

USA

NYC

Figures

Art

Headlines

2.2

Table 1: Entity count and representative examples for each of our datasets.

Count Examples

39585

29997

19838

37539

31321

28389

“Los Angeles”, “St. Peter’s Basilica”, “Caspian Sea”, “Canary Islands”

“Fenway Park”, “Columbia University”, “Riverside County”

“Borden Avenue Bridge”, “Trump International Hotel”

“Cleopatra”, “Dante Alighieri”, “Carl Sagan”, “Blanche of Castile”

“Stephen King’s It”, “Queen’s Bohemian Rhapsody”

“Pilgrims, Fewer and Socially Distanced, Arrive in Mecca for Annual Hajj”

M ODELS AND M ETHODS

Data Preparation All of our experiments are run with the base Llama-2 (Touvron et al., 2023)

series of auto-regressive transformer language models, spanning 7 billion to 70 billion parameters.

For each dataset, we run every entity name through the model, potentially prepended with a short

prompt, and save the activations of the hidden state (residual stream) on the last entity token for each

layer. For a set of n entities, this yields an n × d model activation dataset for each layer.

Probing To find evidence of spatial and temporal representations in LLMs, we use the standard

technique of probing Alain & Bengio (2016); Belinkov (2022), which fits a simple model on the

network activations to predict some target label associated with labeled input data. In particular,

given an activation dataset A ∈ R n×d model , and a target Y containing either the time or two-

dimensional latitude and longitude coordinates, we fit linear ridge regression probes

Ŵ = arg min ∥Y − AW ∥ 22 + λ∥W ∥ 22 = (A T A + λI) −1 A T Y

yielding a linear predictor Ŷ = AŴ . High predictive performance on out-of-sample data indicates

that the base model has temporal and spatial information linearly decodable in its representations,

although this does not imply that the model actually uses these representations (Ravichander et al.,

2020). In all experiments, we tune λ using efficient leave-out-out cross validation (Hastie et al.,

2009) on the probe training set.

2.3

E VALUATION

To evaluate the performance of our probes we report standard regression metrics such as R 2 and

Spearman rank correlation on our test data (correlations averaged over latitude and longitude for

spatial features). An additional metric we compute is the proximity error for each prediction, defined

as the fraction of predictions closer to the target point than the actual prediction. The intuition is that

for spatial data, absolute error metrics can be misleading (a 500km error for a city on the East Coast

of the United States is far more significant than a 500km error in Siberia), so when analyzing errors

per prediction, we often report this metric to account for the local differences in desired precision.

3.1

L INEAR M ODELS OF S PACE AND T IME

E XISTENCE

We first investigate the following empirical questions: do models represent time and space at all? If

so, where internally in the model? Does the representation quality change substantially with model

scale? In our first experiment, we train probes for every layer of Llama-2-{7B, 13B, 70B} for each

of our space and time datasets. Our main results, depicted in Figure 2, show fairly consistent pat-

terns across datasets. In particular, both spatial and temporal features can be recovered with a linear

3Preprint

Figure 2: Out-of-sample R 2 for linear probes trained on every model, dataset, and layer.

probe, these representations are more accurate with increasing model scale, and the representations

smoothly increase in quality throughout the first half of the layers of the model before reaching a

plateau. These observations are consistent with results from the factual recall literature demonstrat-

ing that early-to-mid MLP layers are responsible for recalling information about factual subjects

(Meng et al., 2022a; Geva et al., 2023).

The dataset with the worst performance is the New York City dataset. This was expected given the

relative obscurity of most of the entities compared with other datasets. However, this is also the

dataset where the largest model has the best relative performance, at nearly 2x the R 2 of smaller

models, suggesting that sufficiently large LLMs could eventually form detailed spatial models of

individual cities.

3.2

L INEAR R EPRESENTATIONS

Within the interpretability literature, there is a growing body of evidence supporting the linear rep-

resentation hypothesis that features within neural networks are represented linearly, that is, the pres-

ence or strength of a feature can be read out by projecting the relevant activation on to some feature

vector (Mikolov et al., 2013; Olah et al., 2020; Elhage et al., 2022b). However, these results are

almost always for binary or categorical features, unlike the naturally continuous features of space or

time.

To test whether spatial and temporal features are represented linearly, we compare the performance

of our linear ridge regression probes with that of substantially more expressive nonlinear MLP

Table 2: Out-of-sample R 2 of linear and nonlinear (one layer MLP) probes for all models and

features at 60% layer depth.

Dataset

Model Probe World USA NYC Historical Entertainment Headlines

Llama-2-7b Linear

MLP

Linear

MLP

Linear

MLP 0.881

0.897

0.896

0.916

0.911

0.926 0.799

0.819

0.825

0.824

0.864

0.869 0.219

0.204

0.237

0.230

0.359

0.312 0.785

0.775

0.804

0.818

0.835

0.839 0.788

0.746

0.806

0.808

0.885

0.884 0.564

0.467

0.645

0.656

0.746

0.739

Llama-2-13b

Llama-2-70b

4Preprint

Figure 3: Out-of-sample R 2 when entity names are included in different prompts for Llama-2-70b.

probes of the form W 2 ReLU(W 1 x + b 1 ) + b 2 with 256 neurons. Table 2 reports our results and

shows that using nonlinear probes results in minimal improvement to R 2 for any dataset or model.

We take this as strong evidence that space and time are also represented linearly (or at the very least

are linearly decodable), despite being continuous.

3.3

S ENSITIVITY TO P ROMPTING

Another natural question is if these spatial or temporal features are sensitive to prompting, that is,

can the context induce or suppress the recall of these facts? Intuitively, for any entity token, an

autoregressive model is incentivized to produce a representation suitable for addressing any future

possible context or question.

To study this, we create new activation datasets where we prepend different prompts to each of the

entity tokens, following a few basic themes. In all cases, we include an “empty” prompt contain-

ing nothing other than the entity tokens (and a beginning of sequence token). We then include a

prompt which asks the model to recall the relevant fact, e.g., “What is the latitude and longitude of

” or “What was the release date of ’s .” For the United States and

NYC datasets we also include versions of these prompts asking where in the US or NYC this loca-

tion is, in an attempt to disambiguate common names of places (e.g. City Hall). As a baseline we

include a prompt of 10 random tokens (sampled for each entity). To determine if we can obfuscate

the subject, for some datasets we fully capitalize the names of all entities. Lastly, for the headlines

dataset, we try probing on both the last token and on a period token appended to the headline.

We report results for the 70B model in Figure 3 and all models in Figure 7. We find that explicitly

prompting the model for the information, or giving disambiguation hints like that a place is in the

US or NYC, makes little to no difference in performance. However, we were surprised by the degree

to which random distracting tokens degrades performance. Capitalizing the entities also degrades

performance, though less severely and less surprisingly, as this likely interferes with “detokenizing”

the entity (Elhage et al., 2022a; Gurnee et al., 2023; Geva et al., 2023). The one modification that

did notably improve performance is probing on the period token following a headline, suggesting

that periods are used to contain some summary information of the sentences they end.

R OBUSTNESS C HECKS

The previous section has shown that the true point in time or space of diverse types of events or

locations can be linearly recovered from the internal activations of the mid-to-late layers of LLMs.

However, this does not imply if (or how) a model actually uses the feature direction learned by the

5Preprint

probe, as the probe itself could be learning some linear combination of simpler features which are

actually used by the model.

4.1

4.1.1

V ERIFICATION VIA G ENERALIZATION

B LOCK HOLDOUT GENERALIZATION

To illustrate a potential issue with our results, consider the task of representing the full world map.

If the model has, as we expect it does, an almost orthogonal binary feature for is in country X,

then one could construct a high quality latitude (longitude) probe by summing these orthogonal

feature vectors for each country with coefficient equal to the latitude (longitude) of that country.

Assuming a place is in only one country, such a probe would place each entity at its country centroid.

However, in this case, the model does not actually represent space, only country membership, and it

is only the probe which learns the geometry of the different countries from the explicit supervision.

To better distinguish these cases, we analyze how the probes generalize when holding out specific

blocks of data. In particular, we train a series of probes, where for each one, we hold out one country,

state, borough, century, decade, or year for the world, USA, NYC, historical figure, entertainment,

and headlines dataset respectively. We then evaluate the probes on the held out block of data. In

Table 3, we report the average proximity error for the block of data when completely held out,

compared to the error of the test points from that block in the default train-test split, averaged over

all held out blocks.

We find that while generalization performance suffers, especially for the spatial datasets, it is clearly

better than random. By plotting the predictions of the held out states or countries in Figures 10

and 11, a qualitatively clearer picture emerges. That is, the probe correctly generalizes by placing

the points in the correct relative position (as measured by the angle between the true and predicted

centroid) but not in their absolute position. We take this as weak evidence that the probes are

extracting explicitly learned features by the model, but are memorizing the transformation from

model coordinates to human coordinates. However, this does not fully rule out the underlying binary

features hypothesis, as there could be a hierarchy of such features that do not follow country or

decade boundaries.

Table 3: Average proximity error across blocks of data (e.g., countries, states, decades) when in-

cluded in the training data compared to completely held out. Random performance is 0.5.

Dataset

Model Block World USA NYC Historical Entertainment Headlines

Llama-2-7b nominal

held out

nominal

held out

nominal

held out 0.071

0.170

0.068

0.156

0.071

0.164 0.144

0.192

0.144

0.189

0.121

0.188 0.331

0.473

0.319

0.470

0.262

0.433 0.129

0.133

0.121

0.126

0.115

0.119 0.147

0.158

0.141

0.152

0.105

0.122 0.258

0.264

0.223

0.235

0.182

0.200

Llama-2-13b

Llama-2-70b

4.1.2

C ROSS ENTITY GENERALIZATION

Implicit in our discussion so far is the claim that the model represents the space or time coordinates

of different types of entities (like cities or natural landmarks) in a unified manner. However, similar

to the concern that a latitude probe could be a weighted sum of membership features, a latitude

probe could also be the sum of different (orthogonal) directions for the latitudes of cities and for the

latitudes of natural landmarks.

Similar to the above, we distinguish these hypotheses by training a series of probes where the train-

test split is performed to hold out all points of a particular entity class 1 Table 4 reports the proximity

We only do this entities which do not make up the majority of the training data (e.g., as is the case with

populated places for the world dataset and songs for the entertainment dataset) which is partially responsible

for the discrepancies in the nominal cases for Tables 3 and 4.

6Preprint

error for the entities in the default test split compared to when heldout, averaged over all such splits

as before. The results suggest that the probes largely generalize across entity types, with the main

exception of the entertainment dataset 2 .

Table 4: Average proximity error across entity subtypes (e.g. books and movies) when included in

the training data compared to being fully held out. Random performance is 0.5.

Dataset

Model Entity World USA NYC Historical Entertainment Headlines

Llama-2-7b nominal

held out

nominal

held out

nominal

held out 0.120

0.151

0.117

0.147

0.113

0.147 0.206

0.262

0.197

0.259

0.173

0.203 0.313

0.367

0.310

0.377

0.266

0.322 0.164

0.168

0.153

0.159

0.149

0.149 0.224

0.305

0.207

0.283

0.159

0.271 0.199

0.289

0.171

0.266

0.144

0.219

Llama-2-13b

Llama-2-70b

4.2

D IMENSIONALITY R EDUCTION

Despite being linear, our probes still have d model learnable parameters (ranging from 4096 to 8192

for the 7B to 70B models), enabling it to engage in substantial memorization. As a complementary

form of evidence to the generalization experiments, we train probes with 2 to 3 orders of magnitude

fewer parameters by projecting the activation datasets onto their k largest principal components.

Figure 4 illustrates the test R 2 for probes trained on each model and dataset over a range of k

values, as compared to the performance of the full d model -dimensional probe. We also report the

test Spearman correlation in Figure 12 which increases much more rapidly with increasing k than

the R 2 . Notably, the Spearman correlation only depends on the rank order of the predictions while

R 2 also depends on their actual value. We view this gap as further evidence that the model explicitly

represents space and time as these features must account for enough variance to be in the top dozen

principal components, but that the probe requires more parameters to convert from the model’s

coordinate system to literal spatial coordinates or timestamps. We also observed that the first several

principal components clustered the different entity types within the dataset, explaining why more

than a few are needed.

Figure 4: Test R 2 for probes trained on activations projected onto k largest principal components

for each dataset and model compared to training on the full activations.

S PACE AND T IME N EURONS

While the previous results are suggestive, none of our evidence directly shows that the model uses

the features learned by the probe. To address this, we search for individual neurons with input or

We note in this case the Spearman correlation is still high, suggesting this is an issue with bias generaliza-

tion, as the different entity types are not uniformly distributed in time.

7Preprint

Figure 5: Space and time neurons in Llama-2 models. Depicts the result of projecting activation

datasets onto neuron weights compared to true space or time coordinates with Spearman correlation

by entity type.

output weights that have high cosine similarity with the learned probe direction. That is, we search

for neurons which read from or write to a direction similar to the one learned by the probe.

We find that when we project the activation datasets on to the weights of the most similar neurons,

these neurons are indeed highly sensitive to the true location of entities in space or time (see Fig-

ure 5). In other words, there exist individual neurons within the model that are themselves fairly

predictive feature probes. Moreover, these neurons are sensitive to all of the entity types within our

datasets, providing stronger evidence for the claim these representations are unified.

If probes trained with explicit supervision are an approximate upper bound on the extent to which a

model represents these spatial and temporal features, then the performance of individual neurons is

a lower bound. In particular, we generally expect features to be distributed in superposition (Elhage

et al., 2022b), making individual neurons the wrong level of analysis. Nevertheless, the existence

of these individual neurons, which received no supervision other than from next-token prediction, is

very strong evidence that the model has learned and makes use of spatial and temporal features.

R ELATED W ORK

Neural World Models Our work is most directly inspired by prior research into the extent to

which deep learning systems form interpretable models of their data generating process. The clean-

est demonstrations have from come from GPT style models trained on chess (Toshniwal et al., 2022)

and Othello games (Li et al., 2022) which were shown to have explicit representations of the board

and game state. Subsequent work on Othello has shown these representations are also linear (Nanda

et al., 2023). In true language models, Li et al. (2021) show that an entity’s dynamic properties

or relations can be linearly read out from representations at different points in the context. Abdou

et al. (2021) and Patel & Pavlick (2021) show LLMs have representations that reflect perceptual and

conceptual structure in color and spatial (but not geographic) domains, analogous to our attempts to

show representations reflect geographic structure. Most related to our work is Konkol et al. (2017)

and Liétard et al. (2021) who study the degree to which geography is reflected in word embeddings

or small language models. Liétard et al. (2021) conclude the amount of information is “limited” but

8Preprint

that larger models show more signs of geographic knowledge. By studying models with 100x the

parameters on 10x the labeled data, we confirm this hypothesis.

Factual Recall The point in time or space of an event or place is a particular kind of fact. Our

investigation is informed by prior work on the mechanisms of factual recall in LLMs (Meng et al.,

2022a;b; Geva et al., 2023) indicating that early-to-mid MLP layers are responsible for outputting

information about factual subjects, typically on the last token of the subject. Many of these works

also show linear structure, for example in the factuality of a statement (Burns et al., 2022) or in the

structure of subject-object relations (Hernandez et al., 2023). To our knowledge, our work is unique

in considering continuous facts.

Intepretability More broadly, our work draws upon many results and ideas from the interpretabil-

ity literature (Räuker et al., 2023), especially in topics related to probing (Belinkov, 2022), BERTol-

ogy (Rogers et al., 2021), the linearity hypothesis and superposition (Elhage et al., 2022b), and

mechanistic interpretability (Olah et al., 2020). More specific results related to our work include

Hanna et al. (2023) who find a circuit implementing greater-than in the context of years, and Goh

et al. (2021) who find neurons in multimodal models corresponding to places, similar to our time

and space neurons.

D ISCUSSION

We have provided evidence that LLMs learn linear representations of space and time that are unified

across entity types and fairly robust to prompting, and that there exists individual neurons that are

highly sensitive to these features. The corollary is that next token prediction alone is sufficient for

learning a literal map of the world give sufficient model and data size.

Our analysis raises many interesting questions for future work. While we showed that it is possible

to linearly reconstruct a sample’s absolute position in space or time, and that some neurons use

these probe directions, the true extent and structure of spatial and temporal representations remain

unclear. In particular, we conjecture that the most canonical form of this structure is a discretized

hierarchical mesh, where any sample is represented as a linear combination of its nearest basis

points at each level of granularity. Moreover, the model can and does use this coordinate system to

represent absolute position using the correct linear combination of basis directions in the same way

a linear probe would. We expect that as models scale, this mesh is enhanced with more basis points,

more scales of granularity (e.g. neighborhoods in cities), and more accurate mapping of entities to

model coordinates (Michaud et al., 2023).

This suggests future work on extracting representations in the model’s coordinate system rather than

trying to reconstruct human interpretable coordinates, perhaps with sparse autoencoders (Cunning-

ham et al., 2023). Another confounder in our analysis, and factual recall research more broadly,

is the existence of many entities in our dataset which the model is unaware of, contaminating our

activation datasets. We would be interested in methods that can identify when a model recognizes a

particular entity beyond simply prompting for specific facts and risking hallucinations.

We also barely scratched the surface of understanding how these spatial and temporal world models

are learned, recalled, or used internally. By looking across training checkpoints, it may be possible

to localize a point in training when a model organizes constituent is in place X features into a

coherent geometry or else conclude this process is gradual (Liu et al., 2021). We expect that the

model components which construct these representations are similar or identical to those for factual

recall (Meng et al., 2022a; Geva et al., 2023). In preliminary experiments, we found our models

had trouble answering basic spatial and temporal relations questions without relying on multi-step

reasoning, complicating any causal intervention analysis Wang et al. (2022), but think that this is the

natural next step for understanding when and how these features are used.

Finally, we note that the representation of space and time has received much more attention in bio-

logical neural networks than artificial ones (Buzsáki & Llinás, 2017). Place and grid cells (O’Keefe

& Dostrovsky, 1971; Hafting et al., 2005) in particular are among the most well-studied in the brain,

as we expect it to be productive source of inspiration for future work on LLMs.

9Preprint

R EFERENCES

Mostafa Abdou, Artur Kulmizev, Daniel Hershcovich, Stella Frank, Ellie Pavlick, and Anders

Søgaard. Can language models encode perceptual structure without grounding? a case study

in color. arXiv preprint arXiv:2109.06129, 2021.

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier

probes. arXiv preprint arXiv:1610.01644, 2016.

Issa Annamoradnejad and Rahimberdi Annamoradnejad. Age dataset: A structured general-purpose

dataset on life, work, and death of 1.22 million distinguished people. International AAAI Confer-

ence on Web and Social Media (ICWSM), 16, 2022.

Jack Bandy. Three decades of new york times headlines, 2021. URL https://www.kaggle.

com/datasets/johnbandy/new-york-times-headlines. Kaggle dataset.

Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational

Linguistics, 48(1):207–219, 2022.

Emily M Bender and Alexander Koller. Climbing towards nlu: On meaning, form, and under-

standing in the age of data. In Proceedings of the 58th annual meeting of the association for

computational linguistics, pp. 5185–5198, 2020.

Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the

dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM

conference on fairness, accountability, and transparency, pp. 610–623, 2021.

Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai, Mirella

Lapata, Angeliki Lazaridou, Jonathan May, Aleksandr Nisnevich, et al. Experience grounds lan-

guage. arXiv preprint arXiv:2004.10151, 2020.

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx,

Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportu-

nities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Ka-

mar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general

intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in lan-

guage models without supervision. arXiv preprint arXiv:2212.03827, 2022.

György Buzsáki and Rodolfo Llinás. Space and time in the brain. Science, 358(6362):482–485,

2017.

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen-

coders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600,

2023.

Nelson Elhage, Tristan Hume, Catherine Olsson, Neel Nanda, Tom Henighan, Scott Johnston, Sheer

ElShowk, Nicholas Joseph, Nova DasSarma, Ben Mann, Danny Hernandez, Amanda Askell, Ka-

mal Ndousse, Andy Jones, Dawn Drain, Anna Chen, Yuntao Bai, Deep Ganguli, Liane Lovitt,

Zac Hatfield-Dodds, Jackson Kernion, Tom Conerly, Shauna Kravec, Stanislav Fort, Saurav Ka-

davath, Josh Jacobson, Eli Tran-Johnson, Jared Kaplan, Jack Clark, Tom Brown, Sam McCan-

dlish, Dario Amodei, and Christopher Olah. Softmax linear units. Transformer Circuits Thread,

2022a. https://transformer-circuits.pub/2022/solu/index.html.

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec,

Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposi-

tion. arXiv preprint arXiv:2209.10652, 2022b.

Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual

associations in auto-regressive language models. arXiv preprint arXiv:2304.14767, 2023.

10Preprint

Gabriel Goh, Nick Cammarata, Chelsea Voss, Shan Carter, Michael Petrov, Ludwig Schubert, Alec

Radford, and Chris Olah. Multimodal neurons in artificial neural networks. Distill, 6(3):e30,

2021.

Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bert-

simas. Finding neurons in a haystack: Case studies with sparse probing. arXiv preprint

arXiv:2305.01610, 2023.

Torkel Hafting, Marianne Fyhn, Sturla Molden, May-Britt Moser, and Edvard I Moser. Microstruc-

ture of a spatial map in the entorhinal cortex. Nature, 436(7052):801–806, 2005.

Michael Hanna, Ollie Liu, and Alexandre Variengien. How does gpt-2 compute greater-than?: Inter-

preting mathematical abilities in a pre-trained language model. arXiv preprint arXiv:2305.00586,

2023.

Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements of

statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.

Dan Hendrycks, Mantas Mazeika, and Thomas Woodside. An overview of catastrophic ai risks.

arXiv preprint arXiv:2306.12001, 2023.

Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas,

Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models.

arXiv preprint arXiv:2308.09124, 2023.

Michal Konkol, Tomáš Brychcı́n, Michal Nykl, and Tomáš Hercig. Geographical evaluation of word

embeddings. In Proceedings of the Eighth International Joint Conference on Natural Language

Processing (Volume 1: Long Papers), pp. 224–232, 2017.

Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes,

Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören Auer, and Christian Bizer.

Dbpedia - a large-scale, multilingual knowledge base extracted from wikipedia, 2015. URL

http://dbpedia.org. Version 2023.

Belinda Z Li, Maxwell Nye, and Jacob Andreas. Implicit representations of meaning in neural

language models. arXiv preprint arXiv:2106.00737, 2021.

Kenneth Li, Aspen K Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Watten-

berg. Emergent world representations: Exploring a sequence model trained on a synthetic task.

arXiv preprint arXiv:2210.13382, 2022.

Bastien Liétard, Mostafa Abdou, and Anders Søgaard. Do language models know the way to rome?

arXiv preprint arXiv:2109.07971, 2021.

Leo Z Liu, Yizhong Wang, Jungo Kasai, Hannaneh Hajishirzi, and Noah A Smith. Probing across

time: What does roberta know and when? arXiv preprint arXiv:2104.07885, 2021.

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual asso-

ciations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022a.

Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. Mass-editing

memory in a transformer. arXiv preprint arXiv:2210.07229, 2022b.

Eric J Michaud, Ziming Liu, Uzay Girit, and Max Tegmark. The quantization model of neural

scaling. arXiv preprint arXiv:2303.13506, 2023.

Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space

word representations. In Proceedings of the 2013 conference of the north american chapter of the

association for computational linguistics: Human language technologies, pp. 746–751, 2013.

Neel Nanda, Andrew Lee, and Martin Wattenberg. Emergent linear representations in world models

of self-supervised sequence models. arXiv preprint arXiv:2309.00941, 2023.

Richard Ngo, Lawrence Chan, and Sören Mindermann. The alignment problem from a deep learning

perspective, 2023.

11Preprint

NYC OpenData. Points of interest, 2023. URL https://data.cityofnewyork.us/

City-Government/Points-Of-Interest/rxuy-2muj. Accessed: 2023-07-01.

John O’Keefe and Jonathan Dostrovsky. The hippocampus as a spatial map: preliminary evidence

from unit activity in the freely-moving rat. Brain research, 1971.

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter.

Zoom in: An introduction to circuits. Distill, 5(3):e00024–001, 2020.

Roma Patel and Ellie Pavlick. Mapping language models to grounded conceptual spaces. In Inter-

national Conference on Learning Representations, 2021.

Tilman Räuker, Anson Ho, Stephen Casper, and Dylan Hadfield-Menell. Toward transparent ai: A

survey on interpreting the inner structures of deep neural networks. In 2023 IEEE Conference on

Secure and Trustworthy Machine Learning (SaTML), pp. 464–483. IEEE, 2023.

Abhilasha Ravichander, Yonatan Belinkov, and Eduard Hovy. Probing the probing paradigm: Does

probing accuracy entail task relevance? arXiv preprint arXiv:2005.00719, 2020.

Anna Rogers, Olga Kovaleva, and Anna Rumshisky. A primer in bertology: What we know about

how bert works. Transactions of the Association for Computational Linguistics, 8:842–866, 2021.

Shubham Toshniwal, Sam Wiseman, Karen Livescu, and Kevin Gimpel. Chess as a testbed for

language model state tracking. In Proceedings of the AAAI Conference on Artificial Intelligence,

volume 36, pp. 11385–11393, 2022.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko-

lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda-

tion and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.

Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Inter-

pretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint

arXiv:2211.00593, 2022.

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yo-

gatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language

models. arXiv preprint arXiv:2206.07682, 2022.

Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, John Mellor,

Amelia Glaese, Myra Cheng, Borja Balle, Atoosa Kasirzadeh, et al. Taxonomy of risks posed by

language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and

Transparency, pp. 214–229, 2022.

12Preprint

D ATASETS

We describe the construction and post-processing of our data in more detail in addition to

known limitations. All datasets and code are available at https://github.com/wesg52/

world-models.

World Places We ran three separate queries to obtain the names, location, country, and associated

Wikipedia article of all physical places, natural places, and structures within the DBPedia database

Lehmann et al. (2015). Using the Wikipedia article link, we joined this information with data from

the Wikipedia pageview statistics database 3 to query how many times this page was accessed over

2018-2020. We use this as a proxy for whether we should expect an LLM to know of this place or

not, and filter those with less than 5000 views over this time period.

Several limitation are worth highlighting. First, our data only comes from English Wikipedia, and

hence is skewed towards the Anglosphere. Additionally, the distribution of entity types is not uni-

form, e.g. we noticed the United Kingdom has many more railway stations than any other country,

which could introduce unwanted correlations in the data that may affect the probes. Finally, about

25% of the samples had some sort of state or province modifier at the end like“Dallas County, Iowa”.

Because many of these locations were more obscure or would be ambiguous without the modifier,

so we chose to rearrange the string to be of the from “Iowa’s Dallas County” such the entity is

disambiguated but that we are not probing on a token that is a common country or state name.

USA Places The United States places dataset uses structures and natural places within the US

from the world places dataset as a starting point, in addition to another DBPedia for US colleges.

We then collect the name, population total, and state for every county 4 , zipcode 5 , and city 6 from a

census data aggregator. We then remove all duplicate county or city names (there are 31 Washington

counties in the US!), though we keep any duplicates that have 2x the population has the next largest

place of the same name. We also filter out cities with fewer than 500 people, zipcodes with fewer

than 10000 (or with population density greater than 50 and population greater than 2000), and any

place not in the lower 48 contiguous states (or Washington D.C.).

NYC Places Our New York City dataset is adapted from the NYC Open Date points of interest

dataset (NYC OpenData, 2023) containing the names of locations tracked by the city government.

This includes the names of schools, places of worship, transit locations, important roads or bridges,

government buildings, public housing, and more. Each of these places comes with a complex ID for

locations comprised of multiple such buildings (e.g. New York University or LaGuardia airport).

We construct our test train splits to make sure that all locations within the same complex are put in

the same split to avoid test-train leakage. We filtered out a large number of locations describing the

position of bouys in the multiple waterways surrounding NYC.

Historical Figures Our historical figures dataset contains the names and occupation of historical

figures who died between 1000BC-2000AD adapted from (Annamoradnejad & Annamoradnejad,

2022). We filtered the dataset to only contain the 350 most famous people who died from each

decade, imperfectly measured by the index of their Wikidata entity identifier.

Art and Entertainment Our art and entertainment dataset consists of the names of songs, movies,

and books with their corresponding artist, director, and author release date. We constructed this

dataset from DBpedia and similarly filtered out entities which had received less than 5000 page

views over 2018-2020. Because many songs or books have fairly generic titles, we include the

creator’s name in the prompt to disambiguate (e.g. “Stephen Kings’ It” for the empty prompt).

However, because some artists or authors release many songs or books, we sample our test-train

split by creator to avoid leakage.

https://en.wikipedia.org/wiki/Wikipedia:Pageview_statistics

https://simplemaps.com/data/us-counties

https://simplemaps.com/data/us-zips

https://simplemaps.com/data/us-cities

13Preprint

Figure 6: Distribution of samples in space or time for all datasets.

Headlines Our headlines dataset is adapted from a scrape of all New York Times headlines of the

past 30 years (Bandy, 2021). In an attempt to filter out headlines which do not describe an event

that could be localized in time, we employ a number of strategies. First we filter anything which is

not within the first 10 pages of the print edition. Second we filter out articles that don’t come from

the Foreign, National, Politics, Washington, or Obits news desks. Third we removed any titles that

contained a question mark.

A DDITIONAL R ESULTS

14Preprint

Figure 7: Out-of-sample R 2 when entity names are included in different prompts for all models.

15Preprint

Figure 8: Llama-2-70b layer 50 model of the United states. Points are projections of activations of

heldout US places onto learned latitude and longitude directions colored by true state, with median

state prediction enlarged. All points depicted are from the test set.

Figure 9: Llama-2-70b layer 50 model of the world. Points are projections of activations of heldout

world places onto learned latitude and longitude directions colored by true continent. All points

depicted are from the test set.

16Preprint

Figure 10: Out-of-sample predictions for each country when the probe training data contains no

samples from the country as compared to true locations and the mean of the training data. The

results imply that the learned feature direction correctly generalizes to the relative position of a

country but that the probes memorizes the absolute positions.

17Preprint

Figure 11: Out-of-sample predictions for each state when the probe training data contains no samples

from the state as compared to true locations and the mean of the training data. The results imply

that the learned feature direction correctly generalizes to the relative position of a country but that

the probes memorizes the absolute positions.

18Preprint

Figure 12: Test Spearman rank correlation for probes trained on activations projected onto k largest

principal components.