Summary of PIGEON Predicting Image Geolocations with Deep Learning

Summary PIGEON Predicting Image Geolocations with Deep Learning arxiv.org

11,881 words - PDF document - View PDF document

One Line

PIGEON is a powerful deep multi-task model that combines semantic geocell creation, CLIP vision transformer pretraining, and ProtoNet refinement to achieve impressive image geolocalization results.

Slides

Slide Presentation (13 slides)

Copy slides outline Copy embed code Download as Word

PIGEON: Predicting Image Geolocations with Deep Learning

Source: arxiv.org - PDF - 11,881 words - view

Introduction

• PIGEON is a deep multi-task model for Street View image geolocalization.

• Incorporates semantic geocell creation, CLIP vision transformer pretraining, and ProtoNet refinement.

• Impressive results achieved.

Dataset and Data Acquisition

• Description of dataset and data acquisition process.

• Six-step process of PIGEON approach outlined.

• Results presented, including distance-based metrics and augmented dataset metrics.

Geolocation Factors

• Street View images used to infer factors such as income, race, education, and voting patterns.

• Previous work combined Street View images with landmarks, indoor images, or aerial images.

• Geolocalizing objects within images and considering various factors.

Semantic Geocells and Location Prediction

• Designing semantic geocells using planet-scale open-source administrative data.

• Influenced by road markings, infrastructure quality, and natural boundaries.

• Addressing trade-off between geocell granularity and predictive accuracy through label smoothing.

Multi-Task Training

• Utilizing different task categories to train the model on relevant features correlated with geolocation.

• Categories include location, climate, compass direction, season, and traffic.

• Enhancing model's ability to predict image geolocations.

Ablation Study

• Evaluating the impact of various methodological contributions on geolocalization accuracy.

• Findings on label smoothing, four-image panorama, multi-task parameter sharing, semantic geocells, and CLIP.

• Understanding the importance of each contribution.

Performance Comparison

• Performance evaluation of PIGEON model compared to human players in GeoGuessr.

• PIGEON outperforms human players and achieves top rankings globally.

• Demonstrating the model's superior geolocation prediction capabilities.

Interpretability and Feature Attention

• Improving model interpretability by filtering outliers and squaring relevancy scores.

• Model pays attention to features like vegetation, road markings, utility posts, and signage.

• Enhancing performance in GeoGuessr and addressing player preferences.

State-of-the-Art Performance

• Performance of PIGEON model on image geolocation benchmark datasets.

• Achieving state-of-the-art results in zero-shot settings.

• Potential for solving problems in various domains.

Future Extensions

• Suggestions for future work and extensions to the PIGEON project.

• Expanding on the model's capabilities and applications.

• Continual improvement and advancement in image geolocation prediction.

Summary and Main Message

• PIGEON is a powerful deep multi-task model for image geolocalization.

• Incorporating semantic geocell creation, CLIP vision transformer pretraining, and ProtoNet refinement.

• Outperforming human players and achieving state-of-the-art performance.

• Deep learning holds great potential in predicting image geolocations.

[Note: Visuals such as graphs, images, and charts can be included in relevant slides to enhance understanding and engagement.]

Key Points

PIGEON is a deep multi-task model for planet-scale Street View image geolocalization.
The model incorporates semantic geocell creation, pretraining of a CLIP vision transformer, and refinement of location predictions with ProtoNets.
The model achieves state-of-the-art performance in zero-shot settings and outperforms human players in the online game GeoGuessr.
The study explores the use of deep learning to predict image geolocations and utilizes different task categories to train the model on relevant features correlated with geolocation.
The model pays attention to features like vegetation, road markings, utility posts, and signage for better performance in GeoGuessr.

Summaries

29 word summary

PIGEON is an effective deep multi-task model for image geolocalization, combining semantic geocell creation, CLIP vision transformer pretraining, and ProtoNet refinement. Its impressive results are showcased in Section 5.

34 word summary

PIGEON is a deep multi-task model for image geolocalization that incorporates semantic geocell creation, CLIP vision transformer pretraining, and ProtoNet refinement. The model achieves impressive results, as demonstrated in Section 5 of the document

531 word summary

PIGEON is a deep multi-task model for planet-scale Street View image geolocalization. It incorporates semantic geocell creation, pretraining of a CLIP vision transformer, and refinement of location predictions with ProtoNets. The model achieves

In Section 3, the dataset and data acquisition process are described. Section 4 outlines the six-step process of PIGEON, the proposed approach. Section 5 presents the results, including distance-based metrics and other metrics for the augmented dataset

Methods using Street View images have shown potential in inferring factors such as income, race, education, and voting patterns. Previous work often combined Street View images with landmarks, indoor images, or aerial images. Geolocalizing objects within images and considering

The authors of the document propose a method for predicting image geolocations using deep learning. They use planet-scale open-source administrative data to design semantic geocells, which are influenced by factors such as road markings, infrastructure quality, and natural boundaries.

Label Smoothing: Discretizing image geolocalization creates a trade-off between geocell granularity and predictive accuracy. We address this by devising a loss function that penalizes based on the distance between predicted and correct geocells. Sm

The study explores the use of deep learning to predict image geolocations. The model utilizes different task categories, such as location, climate, compass direction, season, and traffic, to train the model on relevant features correlated with geolocation. The authors

In this study, the authors conducted an ablation study to evaluate the impact of various methodological contributions on geolocalization accuracy. They found that label smoothing, four-image panorama, multi-task parameter sharing, semantic geocells, and CLIP

We evaluate the performance of PIGEON, a deep learning model for predicting image geolocations, in comparison to human players in the online game GeoGuessr. PIGEON outperforms human players and achieves top rankings among global players

We improved the interpretability of our model by filtering out outliers and squaring relevancy scores. The model pays attention to features like vegetation, road markings, utility posts, and signage, which are important for GeoGuessr players. However, there were

The document discusses the performance of the PIGEON model on image geolocation benchmark datasets. The model achieves state-of-the-art performance in zero-shot settings, indicating its potential for solving problems in various domains. The future work section suggests several extensions to

This text excerpt includes a list of references and URLs related to the topic of predicting image geolocations with deep learning. The references cover various aspects of the subject, including AI learning GeoGuessr, attention-model explainability, mapping the world's photos

This text excerpt is a list of references and URLs for various research papers and articles related to image geolocation. The references cover topics such as cross-view image geolocalization, location encoding for GeoAI, deep visual place recognition, building energy efficiency

The document provides additional information about the PIGEON project. It is divided into separate sections that cover different aspects of the project. Section A discusses the data sources and visualizes the data used for dataset augmentation. Section B describes the process of obtaining

We obtained driving side of the road data and a list of one million locations from GeoGuessr. We randomly sampled 100,000 locations for our dataset, maintaining the distribution of countries. We queried the Street View API to obtain location metadata and downloaded

Raw indexed text (79,213 chars / 11,881 words / 1,560 lines)

PIGEON: Predicting Image Geolocations

Lukas Haas 1 Michal Skreta 1 Silas Alberti 2

Extended Abstract

Planet-scale image geolocalization remains a challenging problem, necessitating fine-grained understanding of

visual information across countries, environments, and time. Although traditional retrieval-based approaches

using hand-crafted features have recently been superseded by deep learning methods, transformer-based advances

in machine learning have rarely been applied in image geolocalization.

We introduce PIGEON, a novel deep multi-task model for planet-scale Street View image geolocalization that

incorporates, inter alia, semantic geocell creation with label smoothing, conducts pretraining of a CLIP vision

transformer on Street View images, and refines location predictions with ProtoNets across a candidate set of

geocells. Our work presents three major contributions: first, we design a semantic geocells creation and splitting

algorithm based on open-source data which can be adapted to any geospatial dataset. Second, we show the

effectiveness of intra-geocell few-shot refinement and the applicability of unsupervised clustering and ProtNets

to the task. Finally, we make our pre-trained CLIP transformer model, StreetCLIP, publicly available for use in

adjacent domains with applications to fighting climate change and urban and rural scene understanding.

Motivated by the rising popularity of an online game GeoGuessr with over 50 million players worldwide, we

focus specifically on Street View images and create the first AI model which consistently beats human players in

GeoGuessr, ranking in the top 0.01% of players.

In addition to our novel modeling approach, we create a new planet-scale dataset for image geolocalization of

400,000 images. Our model achieves impressive results, aided by positive multi-task transfer in both an implicit

and explicit multi-task setting. We attain 91.96% country accuracy on our held-out set and 40.36% of our guesses

are within 25 km of target.

One of the most important results of our work is demonstrating the domain generalization of our pre-trained CLIP

model called StreetCLIP (Haas et al., 2023) and its robustness to distribution shifts. We apply StreetCLIP in a

zero-shot fashion to out-of-distribution benchmark datasets IM2GPS and IM2GPS3k and achieve state-of-the-art

results, beating models finetuned on more than four million in-distribution images.

Finally, we show that contrastive pretraining is an effective meta-learning technique for image geolocalization with

StreetCLIP realizing a more than 10 percentage points accuracy increase over CLIP on countries not seen during

StreetCLIP-specific pretraining. With image geolocalization datasets varying widely in terms of geographical

distribution, our results demonstrate the effectiveness of applying StreetCLIP to any geolocalization and related

problem.

1. Introduction

The game of GeoGuessr has become a worldwide sensation

in the recent years, attracting over 50 million players glob-

ally and getting covered by the New York Times (Browning,

Department of Computer Science, Stanford University,

Stanford, CA, USA. 2 Department of Electrical Engineer-

ing, Stanford University, Stanford, CA, USA. Correspon-

dence to: Lukas Haas , Michal

Skreta , Silas Alberti

[email protected]>.

Early preprint.

2022). On its surface, GeoGuessr seems quite simple: given

a Street View location, players need to say where they find

themselves in the world. Yet despite this seeming simplicity,

the game is infamously difficult. As a result of the diversity

of countries, seasons, and climates in the world, it is very

hard for most humans to accurately pinpoint their locations.

Motivated by Geoguessr, we embarked on finding a state-

of-the-art approach to planet-scale image geolocalization.

The general problem of photo geolocation has a variety of

popular use cases, ranging from geographic photo tagging

and retrieval at large technology companies to academic,

historical research based on archival images. The societalPIGEON: Predicting Image Geolocations

interest in artificial intelligence being able to recognize lo-

cation from images became clear in 2016, when a paper

published by Google garnered worldwide coverage by the

media (Weyand et al., 2016). Given the rising popularity

of GeoGuessr, numerous amateur attempts have been made

at “solving” the game (Suresh et al., 2018; de Fontnouvelle,

2021; Cassens, 2022). There is also an additional incen-

tive to contribute to a growing community of geography

enthusiasts, with AI models having the potential to improve

geography education and the potential of the learned Street

View representations to be beneficial for applications in sus-

tainability, i.e. the prediction of buildings’ energy efficiency

(Mayer et al., 2022). What is perhaps even more challenging, however, is the fact

that images can be taken anywhere in the world, representing

an extremely vast classification space. To that end, many

of the previous approaches at image geolocalization were

constrained to small types of parts of the world, such as

looking exclusively at cities (Wu & Huang, 2022), specific

mountain range like the Alps (Baatz et al., 2012; Saurer

et al., 2016; Tomešek et al., 2022), deserts (Tzeng et al.,

2013), or even beaches (Cao et al., 2012). Other approaches

focused on highly constrained geographical area, such as

the United Sates (Suresh et al., 2018) or even specific cities

like Pittsburgh and Orlando (Zamir & Shah, 2010) or San

Francisco (Berton et al., 2022a).

In this work, we present PIGEON, a model trained on Street

View data drawn from the same distribution as GeoGuessr,

achieving an impressive image geolocalization results and

consistently beating humans in the game of Geoguessr, rank-

ing amongst the top players globally. Some of our work’s

major contributions revolve around the use of CLIP, a recent

multi-modal vision transformer which has been shown to be

an effective few-shot learner (Radford et al., 2021), which

is important given the geographical sparsity of images in

most image geolocalization datasets. As such, our work in-

novates on approaches still leveraging convolutional neural

networks (CNNs) such as (Weyand et al., 2016). The first modern attempt at planet-scale image geolocaliza-

tion is attributed to IM2GPS in 2008 (Hays & Efros, 2008),

a retrieval-based approach using nearest-neighbor search

based on hand-crafted features. It was the first time that

image geolocalization was considered in an unconstrained

manner on a global scale. Yet despite this scale, dependence

on nearest-neighbor retrieval methods (Zamir & Shah, 2014)

meant that an enormous database of reference images would

be necessary for accurate image geolocalization on the scale

of the entire planet.

The remainder of this paper proceeds as follows. In Section

2, we outline past approaches to the problem of image ge-

olocalization. In Section 3, we describe our dataset and the

process of acquiring and augmenting our data. In Section

4, we discuss our proposed approach, outlining the six-step

process comprising PIGEON. In Section 5, we present our

results, discussing both distance-based metrics pertaining to

our main image geolocalization task as well as other metrics

relevant for our augmented dataset. In Section 6, we analyze

the particularities of the performance of our model while at-

tempting to interpret some predictions of the model. Section

7 summarizes our work, and Section 8 outlines potential

future directions for our research. 2.2.1. C ONVOLUTIONAL N EURAL N ETWORKS (CNN S )

2. Related Work

2.1. Traditional Image Geolocalization

The task of image geolocalization, also referred to as vi-

sual place recognition (Berton et al., 2022a), is typically

described as a difficult problem due to the sheer diversity

of the conditions in which images are taken. An image can

be taken during daytime or nighttime, with varying weather,

illumination, season, traffic, occlusion, viewing angle, and

many other factors. In fact, the task is deemed so difficult

that it was not immediately clear that visual features could

have superior predictive power in localizing images than

textual features (Crandall et al., 2009).

2.2. Deep Image Geolocalization

Interest in image geolocalization surged with the arrival

of deep learning to computer vision, marking an evolution

from hand-crafted to deep-learned features (Masone & Ca-

puto, 2021). In 2016, Google released a paper called PlaNet

(Weyand et al., 2016) that first applied convolutional neu-

ral networks (CNNs) (Krizhevsky et al., 2012) to photo

geolocalization. It also first cast the problem as a classifica-

tion task, which was particularly important as past research

has shown that it was difficult for deep learning models

to directly predict geographic coordinates (de Brebisson

et al., 2015) because most models do not learn the distri-

butions of data points efficiently, as well as because of the

interdependence of latitude and longitude. The improve-

ments made with deep learning led researchers to revisit

IM2GPS (Vo et al., 2017), apply CNNs to massive datasets

on mobile images (Howard et al., 2017), and make applica-

tions to GeoGuessrs more widespread (Suresh et al., 2018;

Luo et al., 2022). Nevertheless, some researchers argue

for using approaches combining classification and retrieval

(Kordopatis-Zilos et al., 2021).

2.2.2. V ISION T RANSFORMERS

Following the success of transformers (Vaswani et al., 2017)

in natural language processing, the transformer architecture

found its application to computer vision, such as through

the ViT architecture (Kolesnikov et al., 2021). The globalPIGEON: Predicting Image Geolocations

context of ViT architectures explains immediate significant

improvements compared with CNNs (Raghu et al., 2021).

Additionally, vision transformers have been found to be use-

ful in multi-model text and image setting, such as through

the OpenAI’s CLIP model (Radford et al., 2021) being ap-

plied to image geolocalization (Wu & Huang, 2022; Luo

et al., 2022). Prior papers have also used contrastive learning

without the use of CLIP (Kordopatis-Zilos et al., 2021).

Although vision transformers have been successfully ap-

plied to a range of problems in Computer Science, appli-

cations of these models have thus far been fairly limited

(Pramanick et al., 2022), but have recently been accelerat-

ing (Berton et al., 2022b). In particular, the emergence of

vision transformer models has not been widely applied to

the problem of geolocalization from Street View imagery.

2.3. Multi-task Image Geolocalization

Multi-task approaches have been found to be improving re-

sults of the main task by using complementary tasks (Ranjan

et al., 2016), with certain types of task being more benefi-

cial for the main task than others (Bingel & Søgaard, 2017).

This, coupled with the fact that auxiliary information was

found to be a vital pre-processing step for image geolocal-

ization (Pramanick et al., 2022), pointed to the potential of

multi-task learning to significantly accelerated the field of

image geolocalization.

Extracting sets of priors about objects that can potentially

be seen in an image (Ardeshir et al., 2014) can be framed as

ingredients to a multi-task setting, such as by using scene

recognition as a secondary task in a multi-task framework

(Pramanick et al., 2022). By using semantic segmentation,

the problem of extreme variation can be alleviated (Seymour

et al., 2018). In fact, until recently, state-of-the-art perfor-

mance (Müller-Budack et al., 2018) was made possible by

combining convolutional neural networks with contextual

information about environmental scenes. This is particu-

larly important as image geolocalization is very difficult in

natural environments (Tomešek et al., 2022). More recent

work showed that vision transformers and multi-task settings

(Pramanick et al., 2022) contribute to superior performance,

further accelerating research in the field.

2.4. Geocell Partitioning

The chosen method of partitioning the world into geocells

can have an enormous effect on downstream classification

performance. Previous approaches rely on geocells that are

either plainly rectangular (de Fontnouvelle, 2021), rectan-

gular using the S2 library (Müller-Budack et al., 2018), or

effectively arbitrary, such as through combinatorial parti-

tioning (Seo et al., 2018). While semantic construction of

geocells has been found to be of high importance to image

geolocalization (Theiner et al., 2022), even current state-of-

the-art papers using the S2 library (Pramanick et al., 2022).

Alternative method for achieving optimized geocells include

creating specific loss functions for the classification layer

(Izbicki et al., 2019).

2.5. Additional Prior Work

Other prior academic work cited the need for cross-view im-

age geolocalization as photos tend to be concentrated in land-

marks and urban areas with sparse ground level geo-tagged

photos. Cross-view approaches can combine ground-level

appearance, overhead appearance, and land cover attributes

(Lin et al., 2013). What is more, methods using Street View

images have shown incredible potential in inferring factors

such as income, race, education, and voting patterns (Gebru

et al., 2017). In prior work, oftentimes the Street View im-

ages were inputted to the model in conjunction with images

of landmarks (Weyand et al., 2020), images taken indoors,

or cross-viewed with aerial images (Yang et al., 2021; Zhu

et al., 2022). Moreover, recent paper cited the potential

of also geolocalizing objects within images (Wilson et al.,

2021), factoring in the differences in land cover (Rußwurm

et al., 2020), and setting new benchmarks (Berton et al.,

2022b). Further information about work done in image ge-

olocalization can be found in various surveys of the field

(Masone & Caputo, 2021; Wilson et al., 2021; Mai et al.,

2022; Li & Hsu, 2022).

3. Dataset

3.1. Dataset Acquisition

While most image geolocalization approaches rely on pub-

licly available datasets, this is not the case for Street View

given the lack of publicly available planet-scale Street View

datasets.

To that end, we decided to create on original dataset. We

proactively reached out to Erland Ranvinge, the Chief Tech-

nology Officer of Geoguessr, who generously agreed to

share a dataset of 1 million locations used in the Competitive

Duels mode of GeoGuessr. From the dataset, we randomly

sampled 100,000 of the provided locations, or 10% of the

overall dataset. For each of the locations, we downloaded

four images, ending up with 400,000 images.

The distribution of countries in our training set is displayed

in Figure 20 in Section B of the Appendix. It is also where

the details about our process of querying the Street View

API, including relevant parameters for both Street View

metadata and Street View images, is described. As can

be seen, there are clear “tiers” of countries delineated by

the frequency of sampling, and we denote each tier by a

different color. Approximately 70% of the locations are

in the “high” tier, 24% are in the “medium” tier, and the

remaining 6% are in the “low” tier.PIGEON: Predicting Image Geolocations

For each location, we start with a random compass direction

and take four images separated by 90 degrees, thus differ-

ing from a single-image setup typically seen in Street View

image geolocalization (de Fontnouvelle, 2021). We care-

fully created non-overlapping image patches like in prior

approaches (Cassens, 2022), and cropped images to remove

auxiliary watermarks.

Prior work addressing using Street View for GeoGuessr

image geolocalization did not specifically look at data ob-

tained directly from the GeoGuessr game (Luo et al., 2022),

making our approach particularly novel.

3.2. Image Format

Four images for a sample location in our dataset are vi-

sualized in Figure 1. It is crucial to notice the advantage

of a four-image setting compared to a single-image setting.

Looking at the leftmost image in Figure 1, it mainly contains

information on vegetation, making it difficult to locate the

image with confidence. However, the additional images pro-

vide clues pertaining to roads, buildings and cars, pointing

to the advantages of extending the dataset with additional

images in lieu of taking a single image for each location.

3.3. Dataset Augmentation

Recognizing that adding auxiliary geographic metadata can

be beneficial for image geolocalization (Arbinger et al.,

2022), we decided to augment our dataset with data on

Köppen-Geiger climate zones (Beck et al., 2018), as well

as elevation temperature, precipitation, etc. We also capture

information frequently used by human GeoGuessr players

in placing their guesses such as the side of the road that

traffic travels on.

Details regarding specific datasets used in our dataset aug-

mentation procedure are described in Section A of the Ap-

pendix.

4. Methodology

This work introduces a variety of technical novelties applied

to the problem of image geolocalization, summarized in the

following subsections.

4.1. Geocell Creation

Prior research has shown that predicting latitudes and longi-

tudes directly for any image geolocalization problem does

not result in state-of-the-art performance (Theiner et al.,

2022). Current methods all rely on the generation of geo-

cells to discretize the coordinate regression problem and

thus transform it into a classification setting, making geocell

design ”crucial for performance” (Theiner et al., 2022).

4.1.1. N AIVE G EOCELLS

Our initial geocell design is inspired by the approach under-

taken by papers that had previously achieved state-of-the-art

result on image geolocalization (Müller-Budack et al., 2018;

Pramanick et al., 2022) using the S2 geometry library. The

S2 geocell algorithm uses numerous rectangles which ob-

serve the curvature of the earth and split each rectangle into

four equally-sized smaller rectangles if the number of data

points within a given rectangle reaches a pre-defined thresh-

old. Our naive geocell algorithm works in a similar fashion;

it is first initialized with one large rectangle which is in

every subsequent step divided into two rectangles along

the longest side, only dividing a rectangle further if the

two resulting rectangles contain a minimum of thirty points.

Instead of splitting each rectangle into two equally-sized

rectangles, a k-means clustering is performed with k = 2 to

find a decision boundary, only splitting the given rectangle

if the minimum geocell size of thirty training data points

is respected. Figure 2 illustrates the resulting rectangular

geocells derived from our naive geocell creation algorithm

for the metropolitan area of Paris.

4.1.2. S EMANTIC G EOCELLS

A major contribution of this work is our contribution on the

generation of semantic geocells which automatically adapt

based on the geographic distribution of any training dataset

samples. The motivation behind a semantic geocell design

is that visual features in images often follow the semantics

of the given country (i.e. road marking), region (i.e. quality

of infrastructure), or city (i.e. street signs). In addition,

country or administrative boundaries often follow natural

boarders such as the flow of rivers or mountain ranges which

in turn influence visiual features such as the type vegetation,

soil color, or more.

We use planet-scale open-source administrative data for our

semantic geocell design, relying on non-overlaping political

shape files of three levels of administrative boundaries (coun-

try, admin 1, and admin 2 levels) obtained from (GADM,

2022). Starting at the most granular level (admin 2), our

algorithm merges adjacent admin 2 level polygons to such

that each geocell contains at least thirty training samples.

Our method attempts to preserve the hierarchy given by

admin 1 level boundaries and never merges cells across

country borders (defined by distinct ISO country codes).

It randomly merges geocells with adjacent cells using the

following prioritization:

1. Small adjacent geocells in same admin 1 area.

2. Large adjacent geocells in same admin 1 area.

3. Small adjacent geocells in same country.

4. Large adjacent geocells in same country.PIGEON: Predicting Image Geolocations

Figure 1. Four images comprising a 360-degree panorama in Pegswood, England in our dataset.

The above prioritization of our algorithm ensures that geo-

cells containing fewer than the minimum threshold of train-

ing samples are not simply appended to large adjacent

geocells but instead results in low-density regions being

aggregated into one larger cell, often surrounding major

metropolitan areas. This further preserves rural and urban

semantics. Figure 2 shows an example of our semantic geo-

cell design preserving the urban area of Paris as well as the

surrounding sub-urban regions.

One limitation of aggregating admin 2 level areas as de-

fined by (GADM, 2022) is that for some urban areas, the

number of training examples for a single cell might greatly

exceed the minimum sample threshold defined by the algo-

rithm’s user. In addition, through the process of merging

adjacent geocells, some cells might be created which could

be split again into multiple smaller cells based on different

boundaries.

We address this limitation in our geocell design through the

following innovative algorithm which uses Voronoi Tessel-

lation and the OPTICS clustering algorithm (Ankerst et al.,

1999) to further split a geocell into further smaller semantic

geocells.

Our Semantic Geocell Division Algorithm uses OPTICS

(Ankerst et al., 1999) to find a large cluster within a cell,

checking whether removing this cluster from the cell would

result in two cells each having a large number of training

samples than MINSIZE. If this is the case, the new geocell’s

polygon is determined by performing Voronoi Tessellation

over all points in the intial cell as depicted in Figure 3 and

assigning the Voronoi polygons to a new cell containing all

training samples in the computed OPTICS cluster. The area

found through Voronoi Tessellation is then removed from

the old geocell. The splitting is performed until convergence

for each OPTICS parameter setting. In our work, we use

three distinct OPTICS settings with values minsamples = 8,

10, and 15 for the three respective rounds and xi parameters

of 0.05, 0.025, and 0.015 for the same rounds. With each

successive setting, the requirements defining a cluster are

Algorithm 1 Semantic Geocell Division Algorithm

Input: geocell boundaries g, training samples x,

OPTICS parameters p, minimum cell size MINSIZE.

Initialize j = 1.

repeat

Initialize C = OPTICS(p j ).

for g i in g do

Define x i = {x j |x j ∈ x ∧ x j ∈ g i }.

repeat

Cluster c = C(x i ).

c max = c k where |x i,k | ≥ |x i,l |∀l.

if |c max | > MINSIZE and |x \ x i,k | > MINSIZE

then

New cell g new = VORONOI(x i,k ).

g i = g i \ g new .

Assign x i to cells i and new.

end if

until convergence

end for

j = j +1

until j is |p|

thus relaxed to find clusters even in cells which are difficult

to further divide.

Merging geocells according to administrative boundary hi-

erarchies and dividing large cells based on our Semantic

Geocell Division Algorithm results in geocells roughly bal-

anced in size and which also preserve the semantics of cities,

regions, countries, and the natural environment. By deploy-

ing our method to our training dataset, we compute the

boundaries of a total of 2203 geocells used for our experi-

ments.

4.2. Label Smoothing

By discretizing our image geolocalization problems via the

help of our semantic geocells creation process, a trade-off

is created between the granularity of geocells and predictivePIGEON: Predicting Image Geolocations

accuracy. The more granular the geocells are, the more

precise a prediction can be but the classification problem

becomes more difficult due to higher cardinality. To address

this issue, we devise a loss function which penalizes based

on the distance bwteen the predicted geocell to the correct

geocell. By smoothing the one-hot geocell classification

label according to equation 1, we train our models in a

much more data-efficient way as the parameters for multiple

geocells are trained concurrently with each training example.

The value of the smoothed one-hot label L i for geocell i

given the correct geocell c is given by

L i = exp(− [Hav(g i , x c ) − Hav(g c , x c )] /75)

(a) With rectangular geocells.

(b) With our semantic geocells.

Figure 2. Île-de-France area around Paris, France, under different

geocell creation specifications.

(1)

where g i are the centroid coordinates of the geocell polygon

of cell i and x c are the true coordinates of the example for

which the label is computed. The constant of 75 acts as a

temperature setting for the label smoothing which worked

well in out experiments. Hav(·, ·) is the Haversine distance

in kilometers defined as:

 s

2r arcsin 

sin 2

ϕ 2 − ϕ 1

+ cos(ϕ 1 ) cos(ϕ 2 ) sin 2

λ 2 − λ 1





(2)

One advantage of using the Haversine distance between two

points is that it respects the Earth’s spherical geometry, giv-

ing accurate estimates of the distance between two points.

Figure 4 demonstrates the results of smoothing geocell la-

bels which ideally results in lower geolocalization errors at

the cost of slightly lower geocell prediction accuracy due to

the added noise in the label.

By combining out semantic geocell design with label

smoothing, we optimize for our model to spread probabili-

ties across semantically similar and adjacent cells. Figure

5 the distribution of probabilities of our best model for a

true location close to the sea in Jakobstad, Finland. Notably,

our semantic geocell design and label smoothing results in

our model placing high probabilities on semantically similar

cells adjacent to the Gulf of Bothnia in Scandinavia.

4.3. Vision Transformer (CLIP)

Figure 3. Voronoi tessellation applied in the process of geocell

creation.

The input image is encoded using a pre-trained vision trans-

former (Kolesnikov et al., 2021). We utilized a pretrained

ViT-L/14 architecture and fine-tuning either the prediction

heads or also unfreeze the last vision transformer layer. For

model versions with multiple image inputs, we average the

embeddings of all four images. Averaging the embeddings

performed better during our experiments than combining

the emebddings via multi-head attention or an additional

transformer layer.

We were particularly interested in exploring the effect of

the type of pretraining on downstream performance. WePIGEON: Predicting Image Geolocations

(a) Without label smoothing.

(b) With label smoothing.

Figure 4. Impact of applying label smoothing over neighboring geocells for a location in Accra, Ghana.

compare a ViT-L/16 that was pre-trained ImageNet-21k

with 14 million images (Deng et al., 2009) with CLIP ViT-

L/14 which is a multi-modal model that utilized contrastive

pre-training on a dataset of 400 million images and caption

(Radford et al., 2021).

Based on our priors and commonly observed strategies by

professional GeoGuessr players, there are a variety of rel-

evant features for the image location task, e.g., vegetation,

road markings, street signs, and architecture. We hypothe-

size that the multi-modal pre-training creates embeddings

with a much deeper semantic understanding of the image,

enabling it to learn such features. As we show later, the

CLIP vision transformer gives a substantial improvement

over a comparable ImageNet vision transformer and using

attention maps, we can indeed show how this enables the

model to learn these strategies in an interpretable way.

4.4. StreetCLIP Contrastive Pretraining

Figure 5. Distribution of probabilities over geocells for a true loca-

tion in Jakobstad, Finland.

Inspired by the substantial improvement that we observed

from using CLIP’s contrastive pre-training over the Ima-

geNet pre-trained vision transformer, we explored designing

a contrastive pre-training task that we could use to fine-tune

our CLIP foundation model even before learning the geocell

prediction head.

For that, we augment our Street View dataset with geo-

graphic, demographic, and geological auxiliary data. ThisPIGEON: Predicting Image Geolocations

Figure 6. Contrastive pretraining of StreetCLIP (Haas et al., 2023) in an implicit multi-task setting using images from Varzea Grande,

Mato Grosso, Brazil.

data is used to create randomized captions for each image

using a rule-based system that samples components from

different task categories and combines them in a randomized

order. The probabilities for each category are adjusted based

on priors. Some examples of categories & corresponding

caption components include:

• Location: “A Street View photo in the region of Eastern

Cape in South Africa.”

• Climate: “This location has a temperate oceanic cli-

mate.”

• Compass Direction: “This photo is facing north.”

• Season: “This photo was taken in December.”

• Traffic: “In this location, people drive on the left side

of the road.”

This creates an implicit multi-task setting and ensures the

model maintains rich representations of the data while ad-

justing to the distribution of Street View images and learning

features that are relevant & correlated with geolocation.

4.5. Multi-task Learning

We also experiment with making our multi-task setup ex-

plicit by creating task-specific prediction heads for auxiliary

climate variables, population density, elevation, and the

month (season) of the year. As climate variables we include

the Köppen-Geiger Climate Zone, the yearly average tem-

perature and precipitation at the given location as well as

the difference in temperature and precipitation between the

month with the highest average value and the month with

the lowest average value. The climate zone and and season

prediction tasks are posed as a classification problem while

the other six auxiliary tasks are formulated as a regression

task.

In Hays & Efros (2014), the authors note that the ”distribu-

tion of likely locations for an image provides huge amounts

of additional meta-data for climate, average temperature for

any day, vegetation index, elevation, population density, per

capita income, average rainfall,” and more which can be

leveraged for the task of geolocalization.

We unfreeze the last CLIP layer to allow for parameter

sharing across tasks with the goal of observing a positive

transfer from our auxiliary tasks to our geolocalization prob-

lem and to learn more general image representations which

reduce the risk of overfitting to the training dataset. Our

loss function weights the geolocalization tasks as much as

all auxiliary tasks combined. A novel contribution of our

work is that we use eight auxiliary prediction tasks instead

of just two compared to prior research employing multi-task

methods (Pramanick et al., 2022) with multi-task methods

having shown impressive results across fields (Ruder, 2017).

4.6. ProtoNet Refinement

To further refine our model’s guesses within a geocell and

to improve street and city-level performance, we perform

intra-geocell refinement using ProtoNets (Snell et al., 2017).

Instead of simply predicting the mean latitude and longi-

tude of all points within a geocell as current state-of-the-art

aprroaches such as Pramanick et al. (2022), we pose each

cell’s intra-cell refinement as a separate few-shot classifica-

tion task.

We again use the OPTICS clustering algorithm (Ankerst

et al., 1999) with a minsample parameter of 3 and a xi pa-

rameter of 0.15 to cluster all points within a geocell and

thus propose classes to learn in the intra-cell classification

setting. Each cluster consisting of at least three trainingPIGEON: Predicting Image Geolocations

examples forms a prototype and its representation is com-

puted by averaging the embeddings of all images within

the prototype. To compute the prototype embeddings, we

use the same model as in our geocell prediction task but

remove the prediction heads and freeze all weights. Figure

7 illustrates examples of refinement clusters found by the

OPTICS algorithm in the Greater Los Angeles metropoltian

area.

During inference, we first compute and average the new lo-

cation’s embeddings. After our geocell classification model

predicts, instead of predicting that cell’s centroid coordi-

nates, we take the euclidian distance between the averaged

image embeddings and all prototypes within the given geo-

cell, selecting the prototype location with the smallest eu-

clidian image embedding distance to the inference location

as the final geolocalization prediction. The creation of intra-

cell location prototypes allows our model to predict one of

more than 11,000 distinct locations for a training dataset

of 90,000 locations instead of just choosing from the 2,203

distinct geocell centroid coordinates, thus allowing for more

precise decision making.

While guess refinement via protonets is in itself a novel idea,

our work goes one step further by allow our ProtoNet refiner

to optimize across cells. Instead of refining a geolocalization

prediction in a single cell, our ProtoNet refiner optimizes

across multiple cells which further increases performance.

During inference, our geocell classification model outputs

the top five predicted geocells as well as the model’s associ-

ated probabilities for these cells. The refinement model than

picks the most likely location within each of the five pro-

posed geocells after which a softmax is computed across the

five euclidian image embedding distances yielded through

ProtoNet refinement. We use a softmax with a temperature

of 1.6 which was carefully tuned to balance probabilities

across different geocells. Finally, these refinement prob-

abilities are multiplied with the probabilities provided by

the geocell classification model and the refinement location

corresponding to the highest joint probability is chosen as

the final geolocalization prediction.

5. Results

The results of our best-performing PIGEON model are listed

in the bottom row of Tables 1 and 2. We achieve an astound-

ing 91.96% Country Accuracy (based on political bound-

aries) and 40.36% of guesses are within 25 km of the correct

location. Moreover, the median kilometer error is 44.35 km

and the average GeoGuessr score is 4,525. In Table 3, we

list the results of our multi-task models on our augmented

dataset. Our results show that geographical, demographic,

and geological features can be inferred from Street View

images.

Figure 7. Visualized ProtoNet clusters in the Greater Los Angeles

metropolitan area.

5.1. Ablation Studies on Geolocalization Accuracy

We perform a detailed ablation study for each of our method-

ological contributions as described in Section 4. We summa-

rize our results in Table 1, displaying the percentage of our

guesses that fall within a given kilometer radius from the ac-

tual location, using standard kilometer-based metrics in line

with the literature (Pramanick et al., 2022). Furthermore, for

each ablation, we calculate additional distance-based met-

rics in Table 2 that provide insights as to the performance

of our modeling approach.

We have the following observations:

• Label Smoothing, Four-image Panorama, Multi-task

Parameter Sharing, Semantic Geocells and CLIP Pre-

training all significantly improve continent, country,

and region-level metrics.

• On the other hand, ProtoNet Refinement has almost no

effect on continent, country and region-level metrics,

but significantly improves street-level accuracy from

1.32% to 4.84% as well as city level accuracy from

34.96% to 39.86%.

• Fine-tuning the last CLIP layer hurts model perfor-

mance on its own, however, when performing multi-

task training with the last CLIP layer as shared param-

eters, there is positive transfer and it increases perfor-

mance. The multi-task training acts as a regularizer.

• When additionally performing the Contrastive Street-

CLIP Pretraining then unfreezing the last CLIP layer

again hurts performance. In particular, there is no

positive transfer from the multi-task training anymore.

Presumably, all of the benefits from multi-task super-

vision have already been captured from the implicitly

multi-task StreetCLIP pretraining.PIGEON: Predicting Image Geolocations

In Figure 8 we visualize the improvement of the best-

performing PIGEON models over the simplest model us-

ing CLIP Base, showing how the performance gains are

more palpable at finer granularities of distance compared to

coarser distance metrics.

Figure 8. Geolocalization accuracy of our within distance-based

standard metrics of km radii.

5.2. Contrastive Pretraining Results with StreetCLIP

The geolocation task is usually framed as a supervised learn-

ing problem. However, this has the major problem the mod-

els are very restricted to a specific task, e.g., the number of

classes and the distribution of the training data. For example,

our training dataset contains only Street View images during

the day, whereas IM2GPS, a common benchmark dataset

for geolocalization, contains a much wider distribution of

images, e.g., images of the inside of buildings and images

during the night. Moreover, both datasets have different

non-overlapping sets of countries and differing definitions

of countries, e.g., whether overseas territories like French

Guiana or Guam are considered their own countries or not.

We have the hypothesis that StreetCLIP (Haas et al., 2023),

through our Street View Multi-task Contrastive Pretraining,

learns relevant strategies for geolocalization but keeps the

general world knowledge from the original CLIP Pretraining.

Thereby, it can generalize to countries it has never seen

during our Street View Pretraining and is robust with regard

to distribution shift.

We test our trained StreetCLIP model on the benchmark

image geolocalization datasets IM2GPS and IM2GPS3k,

which contain a much broader distribution of images than

Street View. By generating an exhaustive list of 234 country

captions, we perform a zero-shot linear probe of StreetCLIP

to get country-level predictions which we then translate

into coordinates. Table 4 presents our results. We compare

against TransLocator Pramanick et al. (2022), the current

state-of-the-art on both of these datasets, and following

their work, we report our performance on continent-level

accuracy.

Whereas TransLocator was trained in a supervised manner

on 4.72 million images, our model was trained in a semi-

supervised manner on only 1 million Street View images.

Surprisingly, despite the distribution shift, StreetCLIP out-

performs the state-of-the-art on both benchmark datasets

using just linear probing. In particular, StreetCLIP performs

significantly better than CLIP which implies that there is

a transfer of image geolocalization performance onto new

distributions.

We conjecture that contrastive pretraining is performing

implicit meta-learning. To further, confirm this hypothesis

we investigated the performance of CLIP and StreetCLIP

in countries that were not seen during StreetCLIP training

(Haas et al., 2023). On the latest benchmark IM2GPS3K,

StreetCLIP achieves an accuracy of 52.79% for countries

not seen during unseen countries vs. 41.51% of accuracy

for CLIP. An explanation for this surprising transfer is that

the knowledge about these countries was already learned

during the initial CLIP pertaining, e.g., the text encoder

presumably has a good embedding of every country in the

world. However, the StreetCLIP pretraining primes the

model for the geolocalization tasks and unlocks additional

knowledge from the original CLIP pretraining. Thereby,

StreetCLIP can perform well on zero-shot transfer to new

tasks (i.e., new countries) where our contrastive pretraining

can be seen as a form of implicit meta-learning.

6. Analysis

We analyze our results in detail both through quantitative

and qualitative evaluations. We confirmed the accuracy of

our results by deploying our model in the GeoGuessr game,

where our model consistently beats high-ranking human

players, ranking in the Top 1,000 globally. We try to under-

stand whether StreetCLIP is learning interpretable strategies

by utilizing an explainability method. Furthermore, we ana-

lyze some of our underperforming guesses, and discuss the

limitations of our work.

6.1. Quantitative Evaluation

6.1.1. C OMPARISON WITH H UMAN P ERFORMANCE

Using our Chrome extension (see Appendix D), we deploy

PIGEON in online competitive GeoGuessr and aggregate

the results of 298 rounds of the game mode Duel against

human players of varying skill levels. We visualize the

comparison of PIGEON with actual human in-game perfor-

mance in Figure 9. Players are ranked into the following

divisions by skill level: Bronze Division, Silver Division,

Gold Division, Master Division, and Champion Division.PIGEON: Predicting Image Geolocations

Table 1. Multi-step ablation study on our modeling approach to image geolocalization.

Method Street

1 km Distance (% @ km)

City

Region Country

25 km 200 km 750 km CLIP Base

+ Label Smoothing

+ Four-image Panorama

+ Fine-tuning Last CLIP Layer

+ Multi-task Parameter Sharing

+ Semantic Geocells

+ Contrastive CLIP Pretraining

+ ProtoNet Refinement 1.28

0.92

1.10

1.18

1.24

1.32

4.84 24.08

24.18

32.50

32.74

33.22

34.54

34.96

39.86 55.38

59.04

75.32

75.14

75.42

76.36

78.48

78.98 80.20

82.84

92.92

93.00

93.42

93.36

94.82

94.76 92.00

92.76

98.00

97.98

98.16

97.94

98.48

- Unfreezing Last CLIP Layer 5.36 40.36 78.28 94.52 98.56

Continent

2500 km

Table 2. Results from the ablation study beyond the standard distance metrics (distance).

Country

Accuracy

% Mean

km Error

km Median

km Error

km GeoGuessr

Score

points

CLIP Base

+ Label Smoothing

+ Four-image Panorama

+ Fine-tuning Last CLIP Layer

+ Multi-task Parameter Sharing

+ Semantic Geocells

+ Contrastive CLIP Pretraining

+ ProtoNet Refinement 72.12

74.74

87.64

87.90

87.96

89.36

91.14

91.82 990.0

877.4

315.7

312.7

299.9

316.9

251.9

255.1 148.0

131.1

60.81

61.81

60.63

55.51

50.01

45.47 3,890

3,986

4,442

4,454

4,464

4,522

4,531

- Unfreezing Last CLIP Layer 91.96 251.6 44.35 4,525

Method

Table 3. Results from the ablation study beyond the standard distance metrics (non-distance).

Method

CLIP Base

+ Label Smoothing

+ Four-image Panorama

+ Fine-tuning Last CLIP Layer

+ Multi-task Parameter Sharing

+ Semantic Geocells

+ Contrastive CLIP Pretraining

+ ProtoNet Refinement

- Unfreezing Last CLIP Layer

Elevation

Error

Pop. Density

Error

people / km 2

Temp.

Error

◦

Precipitation

Error

mm / day

Month

Accuracy

Climate Zone

Accuracy

Prediction heads available only in the multi-task setting.

141.7

147.1

132.8

149.6

1,094

1.37

14.48

45.74

1,064

1.36

14.71

45.74

1,072

1.18

12.82

50.64

ProtoNet Refinement does not alter non-distance data.

1,119

1.26

15.08

45.42

74.10

74.66

75.76

75.22PIGEON: Predicting Image Geolocations

Table 4. Results from zero-shot probing with StreetCLIP (Haas

et al., 2023) contrastive pretraining on out-of-distribution bench-

mark datasets.

Benchmark Method Distance % @ km

Continent

2500 km

IM2GPS TransLocator

Zero-shot CLIP

Zero-shot StreetCLIP 86.70

86.08

88.19

IM2GPS3K TransLocator

Zero-shot CLIP

Zero-shot StreetCLIP 80.10

77.28

80.65

For reference, GeoGuessr has 30 million players worldwide,

and the Master Division represents roughly the top 1% of

players, whereas the Champion Division represents the Top

1000 players worldwide.

As we observe in Figure 9, PIGEON comfortably outper-

forms human performance. It even beats Champion Division

players in median kilometer distance and, therefore, belongs

to the Top 0.1% or Top 1000 players globally. Moreover,

PIGEON is able to perform guesses almost instantly.

6.1.2. U RBAN VS . R URAL

In order to elucidate the difficulty of different sub-

distributions, we investigate whether a performance differ-

ential exists between urban and rural locations. Presumably,

the density of relevant cues should be higher in Street View

images from urban locations.

We bin our validation dataset into quintiles by population

density and visualize PIGEON’s median kilometer error. In

Figure 10, we observe that indeed higher population density

correlates with better predictions. In particular, there is a

sharp dropoff in the highest quintile compared to the other

four quintiles. This confirms our hypothesis that there is a

higher density of cues in urban locations.

6.2. Qualitative Evaluation

6.2.1. E XPLAINABILITY

One of our hypotheses in Section 4.3 was that the contrastive

pre-training used by CLIP gives the model a deeper semantic

understanding of scenes and thereby enables it to discover

strategies that are interpretable by humans. Surprisingly,

the model was able to learn strategies that are taught in

online GeoGuessr guides without ever having been directly

supervised to learn these strategies.

In order to visualize what patches of the image are con-

sidered relevant for a given caption, we visualize attention

relevancy maps for our finetuned StreetCLIP model by im-

plementing the method from Generic Attention-model Ex-

plainability for Bi-Modal Transformers (Chefer et al., 2021).

In our experiments, we observed that this explainability

method does not generalize well from a patch size of 32,

as used in the official implementation, to our patch size of

14. Our hypothesis is that this is caused by the distribution

of relevancy scores across patches having a lower entropy

when the patch size is smaller. In order to resolve this

issue, we modify the method by filtering out outliers and

squaring relevancy scores. This significantly improved the

interpretability of both regular CLIP and our StreetCLIP

on smaller patch sizes and should be applicable beyond our

project.

For the visualizations in Figure 11, we generated relevancy

maps for an image from the validation dataset and the corre-

sponding ground-truth caption, e.g. “This photo is located in

Canada”. Indeed, the model pays attention to features that

professional GeoGuessr players consider important, e.g.,

vegetation, road markings, utility posts, and signage. This

makes the strong performance of the model explainable and

could furthermore enable the discovery of new strategies

that professional players have not yet discovered.

6.2.2. E RROR A NALYSIS

In spite of our model’s generally high accuracy of estimating

image geolocations, there were several scenarios in which

our model underperformed. By computing entropy for the

probabilities of top predicted geocells for each location in

our validation set, we managed to identify the images about

the geolocation of which our model was the most uncertain.

We visualize those cases in Figure 12.

The features of poorly classified images are aligned with our

intuitions and prior literature about difficult settings for im-

age geolocations. Figure 12 shows that images from tunnels,

bodies of water, poorly illuminated areas, forest, indoor

areas and soccer stadiums are amongst the imagery that is

the most difficult to pinpoint geographically. This makes

sense: without recognizable features directly pertaining to a

specific geographical area, their classification is much more

difficult when compares to images with features that clearly

distinguish a given geography.

6.3. Limitations

Nevertheless, several limitations remain. Although PI-

GEON can successfully identify the vast majority of coun-

tries in which photos were taken, it still cannot be used at

extremely precise levels (street-level) that are necessary for

detailed geo-tagging. Moreover, the Street View images in

our dataset were taken during daytime, raising doubts over

the generalization of the model to images taken during night-

time. Further testing under different appearance variationsPIGEON: Predicting Image Geolocations

Figure 9. Comparison of the GeoGuessr in-game performance of PIGEON with the performance of actual online GeoGuessr players.

Figure 10. Median km error by population density quintile.

(a) Attention attribution map for an image in Canada.

could provide insights into the robustness of PIGEON to

different seasons, illuminations, weather, etc. Additionally,

we recognize that some of our visualizations may be prone

to cherry-picking, thus not being wholly representative of

the underlying datasets.

7. Conclusion

Overall, PIGEON presents multiple novel improvements to

multi-task image geolocalization while providing important

insights and artifacts for related problems in fighting climate-

change and urban and rural scene understanding. PIGEON

achieves impressive results in planet-scale image geolocal-

ization on Street View images, achieving a country accuracy

of 91.96% on our held-out dataset and placing 40.36% of

our guesses within 25 km of the target. Our model consis-

tently beats human players in the game of GeoGuessr which

samples data from the same distribution as introduced in our

(b) Attention attribution map for an image in New Zealand.

Figure 11. Attention attribution maps for a sample of locations in

our dataset.PIGEON: Predicting Image Geolocations

novel dataset of 100,000 Street View locations.

The three major contributions of our work can be summa-

rized as follows: we introduce a semantic geocell creation

and splitting algorithm based on open-source data adaptable

to any geospatial dataset. Second, we show the effectiveness

of intra-geocell few-shot refinement via ProtoNets and the

use clustering to generate potential prediction candidates.

Finally, we make our pre-trained CLIP transformer model,

StreetCLIP (Haas et al., 2023), publicly available for use by

other researchers.

(a) Image from a tunnel.

(b) Image from a body of water.

Finally, we show that contrastive pretraining is an effec-

tive meta-learning technique ideal for domain generaliza-

tion and robustness to distribution shifts. One of the most

important results of our work is achieving state-of-the-art

performance on the IM2GPS and IM2GPS3k image geolo-

calization benchmark datasets which are strongly out-of-

distribution compared to our Street View dataset used for

the pre-training of StreetCLIP. Most notably, the state-of-the

art performance achieved is in zero-shot, shining light on

the potential of StreetCLIP to help solve problems in many

other domains.

8. Future Work

(d) Image from a forest.

(e) Image from an indoor area.

(f) Image from a soccer stadium.

Figure 12. Examples of images for which PIGEON was the most

uncertain about the correct location.

8.1. Potential Extensions

Going forward, several extensions can be made to make

image geolocalization more precise. Future models can

detect text included in images to leverage linguistic infor-

mation for predictions, with textual data having previously

been suggested as a potential feature aiding geolocaliza-

tion (Arbinger et al., 2022). Instead of being constrained

to street-level imagery, cross-view approaches could be em-

ployed, such as synthesizing satellite imagery with Street

View (Toker et al., 2021). Although we propose novel se-

mantic geocells, are experiments are constrained to one

granularity of geocells; in the future, various granularities

of geocells can be tested to find the optimal geocell sizes.

Ideally, future image geolocalization models would be ro-

bust to appearance changes, which bring up the need for

incorporating changes over the years, requiring datasets of

images over an extended period of time over a year (Ali-bey

et al., 2022). In a multi-task setting, determining the opti-

mal number of tasks is likely to be a priority. Additionally,

image segmentation and concept influence could be used for

further location prediction interpretability, and fusions be-

tween images to get information about the entire four-image

panorama and not just individual images. In the long term,

future work could go beyond Street View, with the models

able to geolocate any photo taken anywhere in the world

at fine-grained granularity. To that end, future experiments

in CLIP-based zero-shot settings should go beyond just the

continent-level accuracy.PIGEON: Predicting Image Geolocations

Some additional extensions we thought of exploring in this

project, but did not end up pursuing, include using knowl-

edge graphs, using road networks and compass directions

for intra-geocell refinement, as well as adding an urban/rural

scene recognition task to the multi-task setting.

8.2. Social Impact

The results we achieved have vast social impact potential.

By predicting climate based on images, we could be able to

assess the risk to the consequences of climate change. This

is why we decided to augment our data specifically with

the Köppen-Geiger climate classification system given its

emphasis on the geospatial understanding of the impacts of

climate change (Beck et al., 2018). Image geolocalization

can also be used for applications in autonomous driving

(Wilson et al., 2021), in war zones (such as during the Rus-

sian invasion of Ukraine), for attributing location to archival

images, helping historical research, as well as in promoting

geography education through gamified e-learning (Girgin,

2017).

Even with the potential benefits to humans, image geolocal-

ization nevertheless has to deal with various ethical issues.

Some actors posting images might not want their images to

be geolocalized, leading to questions about the fragility of

privacy protections. Furthermore, accurate image geolocal-

ization systems could be used by governments for citizen

surveillance, posing a threat to individual freedoms.

References

Ali-bey, A., Chaib-draa, B., and Giguère, P. GSV-Cities:

Toward appropriate supervised visual place recogni-

tion. Neurocomputing, 513:194–203, 2022. ISSN 0925-

2312. doi: https://doi.org/10.1016/j.neucom.2022.09.

127. URL https://www.sciencedirect.com/

science/article/pii/S0925231222012188.

Ankerst, M., Breunig, M. M., Kriegel, H.-P., and Sander,

J. OPTICS: Ordering Points to Identify the Clustering

Structure. In Proceedings of the 1999 ACM SIGMOD

International Conference on Management of Data, SIG-

MOD ’99, pp. 49–60, New York, NY, USA, 1999. Associ-

ation for Computing Machinery. ISBN 1581130848. doi:

10.1145/304182.304187. URL https://doi.org/

10.1145/304182.304187.

Arbinger, C., Bullin, M., and Henrich, A. Exploit-

ing geodata to improve image recognition with deep

learning. In Companion Proceedings of the Web Con-

ference 2022, WWW ’22, pp. 648–655, New York,

NY, USA, 2022. Association for Computing Machin-

ery. ISBN 9781450391306. doi: 10.1145/3487553.

3524645.

URL https://doi.org/10.1145/

3487553.3524645.

Ardeshir, S., Zamir, A. R., Torroella, A., and Shah, M. GIS-

Assisted Object Detection and Geospatial Localization. In

Fleet, D., Pajdla, T., Schiele, B., and Tuytelaars, T. (eds.),

Computer Vision – ECCV 2014, pp. 602–617, Cham,

2014. Springer International Publishing. ISBN 978-3-

319-10599-4.

Baatz, G., Saurer, O., Köser, K., and Pollefeys, M. Large

Scale Visual Geo-Localization of Images in Mountainous

Terrain. In Fitzgibbon, A., Lazebnik, S., Perona, P., Sato,

Y., and Schmid, C. (eds.), Computer Vision – ECCV 2012,

pp. 517–530, Berlin, Heidelberg, 2012. Springer Berlin

Heidelberg. ISBN 978-3-642-33709-3.

Beck, H. E., Zimmermann, N. E., McVicar, T. R., Ver-

gopolan, N., Berg, A., and Wood, E. F. Present and future

köppen-geiger climate classification maps at 1-km res-

olution. Scientific Data, 5(1):180214, Oct 2018. ISSN

2052-4463. doi: 10.1038/sdata.2018.214. URL https:

//doi.org/10.1038/sdata.2018.214.

Berton, G., Masone, C., and Caputo, B. Rethinking Visual

Geo-localization for Large-Scale Applications, 2022a.

URL https://arxiv.org/abs/2204.02287.

Berton, G., Mereu, R., Trivigno, G., Masone, C., Csurka, G.,

Sattler, T., and Caputo, B. Deep Visual Geo-localization

Benchmark, 2022b. URL https://arxiv.org/

abs/2204.03444.PIGEON: Predicting Image Geolocations

Bingel, J. and Søgaard, A. Identifying beneficial task

relations for multi-task learning in deep neural net-

works, 2017. URL https://arxiv.org/abs/

1702.08303.

Browning, K.

Siberia or Japan?

Expert Google

Maps Players Can Tell at a Glimpse., 2022. URL

https://www.nytimes.com/2022/07/07/

business/geoguessr-google-maps.html.

Cao, L., Smith, J. R., Wen, Z., Yin, Z., Jin, X., and

Han, J. BlueFinder: Estimate Where a Beach Photo

Was Taken. In Proceedings of the 21st International

Conference on World Wide Web, WWW ’12 Compan-

ion, pp. 469–470, New York, NY, USA, 2012. Associa-

tion for Computing Machinery. ISBN 9781450312301.

doi: 10.1145/2187980.2188081. URL https://doi.

org/10.1145/2187980.2188081.

Cassens, L. AI learns GeoGuessr and plays against

pro!, 2022. URL https://www.youtube.com/

watch?v=0k-SJgv-laM.

Chefer, H., Gur, S., and Wolf, L. Generic attention-model

explainability for interpreting bi-modal and encoder-

decoder transformers, 2021. URL https://arxiv.

org/abs/2103.15679.

Crandall, D., Backstrom, L., Huttenlocher, D., and Klein-

berg, J. Mapping the World’s Photos. In WWW ’09: Pro-

ceedings of the 18th International Conference on World

Wide Web, pp. 761–880, 2009.

de Brebisson, A., Simon, E., Auvolat, A., Vincent, P., and

Bengio, Y. Artificial Neural Networks Applied to Taxi

Destination Prediction, 2015. URL https://arxiv.

org/abs/1508.00021.

Gebru, T., Krause, J., Wang, Y., Chen, D., Deng, J.,

Aiden, E. L., and Fei-Fei, L. Using deep learn-

ing and Google Street View to estimate the demo-

graphic makeup of neighborhoods across the United

States. Proceedings of the National Academy of Sci-

ences, 114(50):13108–13113, 2017. doi: 10.1073/pnas.

1700035114. URL https://www.pnas.org/doi/

abs/10.1073/pnas.1700035114.

Girgin, M. Use of Games in Education: GeoGuessr in Geog-

raphy Course. International Technology and Education

Journal, 2017.

Haas, L., Alberti, S., and Skreta, M. Learning generalized

zero-shot learners for open-domain image geolocaliza-

tion, 2023.

Hays, J. and Efros, A. A. IM2GPS: estimating geographic

information from a single image. In Proceedings of the

IEEE Conf. on Computer Vision and Pattern Recognition

(CVPR), 2008.

Hays, J. and Efros, A. A. Multimodal Location Estima-

tion of Videos and Images, chapter Large-Scale Image

Geolocalization. Springer, 2014.

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang,

W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets:

Efficient convolutional neural networks for mobile vi-

sion applications, 2017. URL https://arxiv.org/

abs/1704.04861.

Izbicki, M., Papalexakis, E. E., and Tsotras, V. J. Exploiting

the Earth’s Spherical Geometry to Geolocate Images. In

Joint European Conference on Machine Learning and

Knowledge Discovery in Databases, pp. 3–19, 2019.

de Fontnouvelle, V.

GeoGuessrBot: Predicting

the Location of Any Street View Image, 2021.

URL https://vdefont.github.io/2021/06/

20/geoguessr.html. Kolesnikov, A., Dosovitskiy, A., Weissenborn, D., Heigold,

G., Uszkoreit, J., Beyer, L., Minderer, M., Dehghani, M.,

Houlsby, N., Gelly, S., Unterthiner, T., and Zhai, X. An

image is worth 16x16 words: Transformers for image

recognition at scale. 2021.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and

Fei-Fei, L. Imagenet: A large-scale hierarchical image

database. In 2009 IEEE Conference on Computer Vi-

sion and Pattern Recognition, pp. 248–255, 2009. doi:

10.1109/CVPR.2009.5206848. Kordopatis-Zilos, G., Galopoulos, P., Papadopoulos, S.,

and Kompatsiaris, I. Leveraging EfficientNet and Con-

trastive Learning for Accurate Global-scale Location Es-

timation, 2021. URL https://arxiv.org/abs/

2105.07645.

Fick, S. E. and Hijmans, R. J. WorldClim 2: new 1-

km spatial resolution climate surfaces for global land

areas. International Journal of Climatology, 37(12):

4302–4315, 2017. doi: https://doi.org/10.1002/joc.5086.

URL https://rmets.onlinelibrary.wiley.

com/doi/abs/10.1002/joc.5086. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Ima-

geNet Classification with Deep Convolutional Neural

Networks. In Pereira, F., Burges, C., Bottou, L.,

and Weinberger, K. (eds.), Advances in Neural

Information Processing Systems, volume 25. Curran As-

sociates, Inc., 2012. URL https://proceedings.

neurips.cc/paper/2012/file/

c399862d3b9d6b76c8436e924a68c45b-Paper.

pdf.

GADM. GADM Version 4.1, 2022. URL https://gadm.

org/about.html.PIGEON: Predicting Image Geolocations

Li, W. and Hsu, C.-Y. Geoai for large-scale image analysis

and machine vision: Recent progress of artificial intel-

ligence in geography. ISPRS International Journal of

Geo-Information, 11(7), 2022. ISSN 2220-9964. doi:

10.3390/ijgi11070385. URL https://www.mdpi.

com/2220-9964/11/7/385.

Lin, T.-Y., Belongie, S., and Hays, J. Cross-View Image

Geolocalization. In 2013 IEEE Conference on Computer

Vision and Pattern Recognition, pp. 891–898, 2013. doi:

10.1109/CVPR.2013.120.

Luo, G., Biamby, G., Darrell, T., Fried, D., and Rohrbach,

A. Gˆ3: Geolocation via Guidebook Grounding, 2022.

URL https://arxiv.org/abs/2211.15521.

Mai, G., Janowicz, K., Hu, Y., Gao, S., Yan, B., Zhu, R.,

Cai, L., and Lao, N. A review of location encoding for

geoai: methods and applications. International Journal of

Geographical Information Science, 36(4):639–673, 2022.

doi: 10.1080/13658816.2021.2004602. URL https://

doi.org/10.1080/13658816.2021.2004602.

Masone, C. and Caputo, B. A survey on deep visual place

recognition. IEEE Access, 9:19516–19547, 2021. doi:

10.1109/ACCESS.2021.3054937.

Mayer, K., Haas, L., Huang, T., Bernabé-Moreno, J., Ra-

jagopal, R., and Fischer, M. Estimating building en-

ergy efficiency from street view imagery, aerial im-

agery, and land surface temperature data, 2022. URL

https://arxiv.org/abs/2206.02270.

Müller-Budack, E., Pustu-Iren, K., and Ewerth, R. Geolo-

cation Estimation of Photos using a Hierarchical Model

and Scene Classification. In Proceedings of the European

Conference on Computer Vision (ECCV), pp. 563–579,

2018.

Pramanick, S., Nowara, E. M., Gleason, J., Castillo, C. D.,

and Chellappa, R. Where in the World is this Image?

Transformer-based Geo-localization in the Wild, 2022.

URL https://arxiv.org/abs/2204.13861.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh,

G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P.,

Clark, J., Krueger, G., and Sutskever, I. Learning Trans-

ferable Visual Models From Natural Language Super-

vision. CoRR, abs/2103.00020, 2021. URL https:

//arxiv.org/abs/2103.00020.

Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., and

Dosovitskiy, A. Do Vision Transformers See Like Convo-

lutional Neural Networks? CoRR, abs/2108.08810, 2021.

URL https://arxiv.org/abs/2108.08810.

Ranjan, R., Patel, V. M., and Chellappa, R. HyperFace: A

Deep Multi-task Learning Framework for Face Detection,

Landmark Localization, Pose Estimation, and Gender

Recognition. CoRR, abs/1603.01249, 2016. URL http:

//arxiv.org/abs/1603.01249.

Ruder, S. An Overview of Multi-Task Learning in Deep

Neural Networks. CoRR, abs/1706.05098, 2017. URL

http://arxiv.org/abs/1706.05098.

Rußwurm, M., Wang, S., Körner, M., and Lobell, D. Meta-

Learning for Few-Shot Land Cover Classification, 2020.

URL https://arxiv.org/abs/2004.13390.

Saurer, O., Baatz, G., Köser, K., Ladický, L., and Polle-

feys, M. Image based geo-localization in the alps. In-

ternational Journal of Computer Vision, 116(3):213–

225, Feb 2016. ISSN 1573-1405. doi: 10.1007/

s11263-015-0830-0. URL https://doi.org/10.

1007/s11263-015-0830-0.

Seo, P. H., Weyand, T., Sim, J., and Han, B. CPlaNet:

Enhancing Image Geolocalization by Combinatorial Par-

titioning of Maps, 2018. URL https://arxiv.org/

abs/1808.02130.

Seymour, Z., Sikka, K., Chiu, H., Samarasekera, S., and

Kumar, R. Semantically-Aware Attentive Neural Em-

beddings for Image-based Visual Localization. CoRR,

abs/1812.03402, 2018. URL http://arxiv.org/

abs/1812.03402.

Snell, J., Swersky, K., and Zemel, R. S.

Proto-

typical Networks for Few-shot Learning.

CoRR,

abs/1703.05175, 2017. URL http://arxiv.org/

abs/1703.05175.

Suresh, S., Chodosh, N., and Abello, M. DeepGeo: Photo

Localization with Deep Neural Network, 2018. URL

https://arxiv.org/abs/1810.03077.

Theiner, J., Müller-Budack, E., and Ewerth, R. Interpretable

Semantic Photo Geolocation. In 2022 IEEE/CVF Winter

Conference on Applications of Computer Vision (WACV),

pp. 1474–1484, 2022. doi: 10.1109/WACV51458.2022.

00154.

Toker, A., Zhou, Q., Maximov, M., and Leal-Taixé,

Coming Down to Earth: Satellite-to-Street

View Synthesis for Geo-Localization.

In 2021

IEEE/CVF Conference on Computer Vision and

Pattern Recognition (CVPR), pp. 6484–6493, Los

Alamitos, CA, USA, jun 2021. IEEE Computer So-

ciety. doi: 10.1109/CVPR46437.2021.00642. URL

https://doi.ieeecomputersociety.org/

10.1109/CVPR46437.2021.00642.PIGEON: Predicting Image Geolocations

Tomešek, J., Čadı́k, M., and Brejcha, J. CrossLocate: Cross-

Modal Large-Scale Visual Geo-Localization in Natural

Environments Using Rendered Modalities. In Proceed-

ings of the IEEE/CVF Winter Conference on Applications

of Computer Vision (WACV), pp. 3174–3183, January

2022.

Tzeng, E., Zhai, A., Clements, M., Townshend, R., and

Zakhor, A. User-Driven Geolocation of Untagged Desert

Imagery Using Digital Elevation Models. In 2013 IEEE

Conference on Computer Vision and Pattern Recognition

Workshops, pp. 237–244, 2013. doi: 10.1109/CVPRW.

2013.42.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention

Is All You Need, 2017. URL https://arxiv.org/

abs/1706.03762.

Vo, N., Jacobs, N., and Hays, J. Revisiting IM2GPS in

the Deep Learning Era, 2017. URL https://arxiv.

org/abs/1705.04838.

Weyand, T., Kostrikov, I., and Philbin, J.

PlaNet

- Photo Geolocation with Convolutional Neural Net-

works. In Computer Vision – ECCV 2016, pp. 37–55.

Springer International Publishing, 2016. doi: 10.1007/

978-3-319-46484-8 3. URL https://doi.org/10.

1007%2F978-3-319-46484-8_3.

Weyand, T., Araujo, A., Cao, B., and Sim, J. Google

Landmarks Dataset v2 – A Large-Scale Benchmark for

Instance-Level Recognition and Retrieval, 2020. URL

https://arxiv.org/abs/2004.01804.

Wilson, D., Zhang, X., Sultani, W., and Wshah, S. Visual

and object geo-localization: A comprehensive survey.

CoRR, abs/2112.15202, 2021. URL https://arxiv.

org/abs/2112.15202.

Wu, M. and Huang, Q. Im2city: Image geo-localization

via multi-modal learning. In Proceedings of the 5th

ACM SIGSPATIAL International Workshop on AI for Ge-

ographic Knowledge Discovery, GeoAI ’22, pp. 50–61,

New York, NY, USA, 2022. Association for Comput-

ing Machinery. ISBN 9781450395328. doi: 10.1145/

3557918.3565868. URL https://doi.org/10.

1145/3557918.3565868.

Yang, H., Lu, X., and Zhu, Y. Cross-view Geo-localization

with Layer-to-Layer Transformer. In Ranzato, M.,

Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan,

J. W. (eds.), Advances in Neural Information Processing

Systems, volume 34, pp. 29009–29020. Curran Asso-

ciates, Inc., 2021. URL https://proceedings.

neurips.cc/paper/2021/file/

f31b20466ae89669f9741e047487eb37-Paper.

pdf.

Zamir, A. R. and Shah, M. Accurate image localization

based on google maps street view. In Daniilidis, K.,

Maragos, P., and Paragios, N. (eds.), Computer Vision

– ECCV 2010, pp. 255–268, Berlin, Heidelberg, 2010.

Springer Berlin Heidelberg. ISBN 978-3-642-15561-1.

Zamir, A. R. and Shah, M. Image geo-localization based

on multiplenearest neighbor feature matching usinggen-

eralized graphs. IEEE Transactions on Pattern Analysis

and Machine Intelligence, 36(8):1546–1558, 2014. doi:

10.1109/TPAMI.2014.2299799.

Zhu, S., Shah, M., and Chen, C. TransGeo: Trans-

former Is All You Need for Cross-view Image Geo-

localization, 2022. URL https://arxiv.org/

abs/2204.00097.PIGEON: Predicting Image Geolocations

In this Appendix, we provide additional information that describes our work in further detail.

In Section A, we list our data sources and visualize the data used for augmenting our dataset. In Section B, we provide

details regarding the process of obtaining our images from the Street View API. In Section C, we discuss the background of

the GeoGuessr game that is relevant for understanding this project. In Section D, we describe a Chrome Extension we built

to play GeoGuessr by deploying our model online, allowing us to compare our results to human performance. In Section E,

we describe the technical details about the infrastructure used for running our models as well as the hyperparameters used

during model training.

A. Data Specification for Dataset Augmentation

A.1. Country Area Polygons

We obtain data on country areas from the Database of Global Administrative Areas (GADM) (GADM, 2022), with the data

available here. Additionally, we obtain data on several granularities of political boundaries of administrative areas, with the

data available here and here.

A.2. Köppen-Geiger Climate Zones

We obtain data on global climate zones through the Köppen-Geiger climate classification system (Beck et al., 2018), with

the data available here.

Our planet-scale climate zone data is visualized in Figure 13.

Figure 13. Map of planet–scale Köppen-Geiger climate zones in our dataset. Adapted from Beck et al. (2018).

A.3. Elevation

We obtain data on elevation through the United States Geological Survey’s Earth Resources Observation and Science (EROS)

Center, with the data available here. As elevation data was missing for several locations in our dataset, we further augmented

our data with missing values from parts of Alaska and parts of Europe, with the data for Alaska available here and the data

for Europe available here.

Our planet-scale elevation data is visualized in Figure 14.

A.4. GHSL Population Density

We obtain data on population density through the Global Human Settlement Layer (GHSL), with the data available here.PIGEON: Predicting Image Geolocations

Figure 14. Map of planet–scale elevation in our validation dataset.

Our planet-scale population density data is visualized in Figure 15.

A.5. WorldClim 2 Temperature and Precipitation

We obtain data on average temperature, temperature difference, average precipitation, and precipitation difference through

WorldClim 2 (Fick & Hijmans, 2017), with the data available here.

Our planet-scale average temperature is visualized in Figure 16. Our planet-scale temperature difference is visualized in

Figure 17. Our planet-scale average precipitation is visualized in Figure 18. Our planet-scale precipitation difference is

visualized in Figure 19.

A.6. Location of Country Capitals

We obtain data on the locations of country capitals used for refining our zero-shot StreetCLIP predictions through Kaggle,

with the data available here.

A.7. Alpha-2 Country Codes

We obtain our ISO 3166-2 alpha-2 country codes used for matching country codes generated through the Street View API

with country names through Kaggle, with the data available here.

A.8. Driving Side of the Road

We obtain our driving side of the road data through WorldStandards, with the data available here.

B. Querying Street View API

After signing an NDA with Erland Ranvinge, the Chief Technology Officer of GeoGuessr, we obtained a list of exactly one

million locations that actually appear in the Competitive Duels mode of GeoGuessr. From that list, we randomly sampledPIGEON: Predicting Image Geolocations

Figure 15. Map of planet–scale population density in our validation dataset.

Figure 16. Map of planet–scale average temperature in our validation dataset.PIGEON: Predicting Image Geolocations

Figure 17. Map of planet–scale temperature difference in our validation dataset.

Figure 18. Map of planet–scale average precipitation in our validation dataset.PIGEON: Predicting Image Geolocations

Figure 19. Map of planet–scale precipitation difference in our validation dataset.

100,000 locations, or 10% of the dataset, maintaining the distribution of countries representative of the larger dataset as

visualized in Figure 20.

It should be emphasized, however, that while the distribution is representative of the broader distribution of locations in

Google Street View, the Google Street View distribution itself cannot be thoughout of as a uniform global distribution, as

visualized in Figure 21.

To obtain the actual images from our dataset, we queried the Street View API using the Google Cloud Platform Education

Grants generously supplied to us by Google with the help of Dan Russell.

B.1. Metadata

We first queried the Street View API for the location metadata by supplying a pano id, or an id pertaining to each location,

with each request. That way, we were able to verify whether Street View images actually existed in this location and which

month and year a given image was taken. For each unavailable image, we sampled a random location from the same country

to maintain the prior distribution. Each metadata request was free of charge.

B.2. Images

Subsequently, we proceeded to download images from each location. Aside from the pano id, we specified additional

parameters specific to Street View image downloads. We chose the image size size to be 640x640 pixels, or the largest

available size. For each location, we generated a random heading, or compass direction, between 0 and 359, and added

90 degrees to each subsequent picture for that location to come up with a full panorama. Subsequently, we chose a field of

view, fov of 92, allowing us to retain all of the image’s information even after cropping the watermarks. We picked our

source parameter as outdoor to be limited to outdoor images, however a small portion of images was still from indoors,

as Figure 12 shows, emphasizing mislabeling on Google’s side. For the remaining parameters, we set the default values

to 0 for pitch, 50 for radius, and true for return error code. All in all, this allowed us to download images

consistently for each location in our dataset.PIGEON: Predicting Image Geolocations

Figure 20. Distribution of countries in our training set, colored by the tiers of frequency.

Figure 21. Map of planet–scale Google Street View coverage. Courtesy of Lion Cassens.PIGEON: Predicting Image Geolocations

C. Background on GeoGuessr

GeoGuessr is an online game founded in Sweden in 2013. Upon starting the game, the user is placed in a location supplied

by Google Street View and needs to guess where that location is in the world by placing a guess on the map. The game

can be played in both single- and multi-player modes on maps that are both GeoGuessr-provided as well as user-generated.

When playing with others, users can play both with their friends as well as with random opponents on the Internet. We

decided to focus PIGEON on the Competitive Duels mode, whereby the user directly competes with an opponent and

thus must not only place guesses accurately but also more accurately than the opponent. Each guess is translated into a

GeoGuessr score, the function of which we re-engineered as outlined in Equation 3:

score(x) = 5000 · e − 1492.7 ,

(3)

where x is the prediction error in kilometers.

To get a better sense of the game, we provide some sample screenshots in Figure 22. We took these screenshots while

deploying PIGEON in the real GeoGuessr game against real opponents using our self-developed Chrome Extension, which

we describe in Section D of this Appendix.

(a) Sample image in a Geoguessr location.

(b) Sample comparison of guesses between PIGEON and a

human player.

Figure 22. Sample screenshots from PIGEON deployed to the GeoGuessr game.

D. Chrome Extension for GeoGuessr

We constructed a Chrome extension that plays GeoGuessr by automating the browser. We did this to achieve two goals:

First, having an engaging live demonstration. Second, confirming that our model is robust enough to also perform well on

the real-world data from the GeoGuessr game.

D.1. Chrome Extension Behavior

The extension automatically activates itself once it detects that it is in a game and then autonomously places guesses.

Moreover, it is able to detect when a game is over and restarts the game automatically if it is configured to do so. At the

moment, it supports the following GeoGuessr game modes: Classic, Duel & Team Duel. It is able to play both in the “Play

With Friends” and “Competitive” mode. In the latter mode, you are matched online against another player of similar rank,

and each game either increases or decreases an Elo-based rank.PIGEON: Predicting Image Geolocations

The procedure to place a guess works as follows and is repeated for each GeoGeussr round until the game is detected to be

over:

1. Resize Chrome window to correct aspect ratio.

2. Wait until Street View scene is fully loaded.

3. Repeat the following for all four directions:

(a)

(b)

(c)

(d)

Hide all UI elements.

Take a screenshot.

Unhide all UI elements.

Rotate by 90 ◦ using simulated clicks.

4. POST request to our backend server endpoint with the four images encoded as Base64 as payload.

5. Receive predicted latitude & longitude from our server.

6. Optional: Random delay to behave more human-like

7. Place guess using reverse-engineered API call from GeoGuessr API.

8. Collect statistics about true location & human performance and submit to the server using an additional POST request.

D.2. Backend

In addition to the Chrome extension, we run a backend server on a machine with a GPU that runs the model inference. We

utilize the Python library FastAPI to implement two API endpoints:

1. Inference endpoint: A POST endpoint that receives either one or four images, passes them to a Pytorch pipeline that

preprocesses the images, and then runs inference on a GPU. In addition, it saves the images on disk in order to collect

an additional dataset. Then, it returns the latitude & longitude of our models to the client.

2. Statistics endpoint: A POST endpoint that receives the statistics about the correct location, the score & distance of our

guess, and human performance (i.e., location, score, and distance of our online opponent). This data is saved on disk

and then used for our evaluations.

E. Technical Specification

The following is an overview of the technical infrastructure used for this project, an estimation of the time needed to compute

our results, and an overview of the most important model parameters.

E.1. Technical Infrastructure

Our geolocalization models were trained on four NVIDIA A100 80GB GPUs with each model training between three hours

and two days. The contrastive pretraining of StreetCLIP required a total of eight NVIDIA A100 80 GB GPUs on which we

pre-trained our model for two days.

The geocell creation algorithm ran for one day on a local machine on a single CPU.

E.2. Hyperparameter Specification

For all our geolocaization models, we started by freezing all CLIP layers and solely training the prediction heads. To do

so, we used a learning rate of 1e −4 and a batch size (accumulated across GPUs and gradient updates) of 256. Once the

prediction heads were trained to convergence we unfroze the last CLIP layer for the respective models and used the same

batch size of 256 but lowered the learning rate to 2e −5 .

For the contrastive pretraining of StreetCLIP, we used an batch size of 2048 accumulated across all GPU cores and gradient

updates, a learning rate of 1e −6 , linear learning rate warmup with rate 0.2, and a weight decay of 0.001.