Summary of CityDreamer Compositional Generative Model of Unbounded 3D Cities

Summary CityDreamer Compositional Generative Model of Unbounded 3D Cities arxiv.org

8,763 words - PDF document - View PDF document

One Line

The CityDreamer model creates limitless 3D cities by separating building creation from background objects and utilizing a bird's eye view scene representation along with MaskGIT and VQVAE for layout generation.

Slides

Slide Presentation (9 slides)

Copy slides outline Copy embed code Download as Word

CityDreamer: Creating Limitless 3D Cities

Source: arxiv.org - PDF - 8,763 words - view

Introduction to CityDreamer

• The CityDreamer model is a generative model designed for creating unbounded 3D cities.

• It separates the generation of building instances from other background objects.

• Utilizes a bird's eye view scene representation.

Visual: Image showcasing a 3D city generated by CityDreamer

Layout Generation with MaskGIT and VQVAE

• CityDreamer utilizes MaskGIT and VQVAE for layout generation.

• Tokenizes the maps and height fields.

• Enables the creation of extendable semantic maps and height fields.

Volumetric Rendering for Accurate 3D Geometry

• CityDreamer employs volumetric rendering for generating accurate 3D geometry.

• Enables the creation of photorealistic images of unbounded 3D cities.

Visual: Comparison of a generated city with volumetric rendering and without

Evaluation of CityDreamer

• Evaluation involves generating distinct city layouts.

• Sampling different styles for each scene.

• Assessment of quality and consistency through user study.

Visual: Graph showing evaluation metrics and user study results

Limitations of Current City Layout Generation

• Inability to model and generate concave geometries like caves and tunnels.

• Individual generation of buildings may affect overall quality and naturalness.

Visual: Examples of concave geometries not currently supported

References in the Document

• List of references to various papers and studies related to generative models and 3D city modeling.

• Covers topics such as generative adversarial nets, 3D image synthesis, and human sensing in 3D environments.

Qualitative Comparison of Building Instance Generators

• Comparison of different building instance generator variants.

• Effectiveness of various generative scene parameterization techniques.

Visual: Examples of city backgrounds and generated buildings using different parameterization

CityDreamer: Unlocking Unlimited Possibilities

• The CityDreamer model revolutionizes the creation of 3D cities.

• Enables limitless possibilities for the entertainment industry.

• Reminder: CityDreamer separates building generation, uses bird's eye view, and employs volumetric rendering for accurate 3D geometry.

Key Points

The CityDreamer model is a generative model designed for creating unbounded 3D cities.
The model separates the generation of building instances from other background objects and uses a bird's eye view scene representation.
The CityDreamer model utilizes MaskGIT and VQVAE for layout generation and tokenizing the maps and fields.
The model employs volumetric rendering for generating accurate 3D geometry and photorealistic images of unbounded 3D cities.
The evaluation of the CityDreamer model involves generating distinct city layouts and sampling different styles for each scene.

Summaries

30 word summary

The CityDreamer model generates unbounded 3D cities by separating building generation from background objects. It uses a bird's eye view scene representation and incorporates MaskGIT and VQVAE for layout generation.

37 word summary

The CityDreamer model is a generative model for creating unbounded 3D cities. It separates building generation from other background objects and uses a bird's eye view scene representation. The model utilizes MaskGIT and VQVAE for layout generation

397 word summary

The CityDreamer model is a generative model designed for creating unbounded 3D cities. It separates the generation of building instances from other background objects and uses a bird's eye view scene representation. The model employs a volumetric renderer to generate

Generative models for scene-level content generation face challenges due to the high diversity of scenes. Some approaches have achieved 3D-aware scene synthesis but lack full 3D consistency or support for feed-forward generation of novel worlds. Other works focus on indoor

The CityDreamer model focuses on generating unbounded 3D cities by creating extendable semantic maps and height fields. It utilizes MaskGIT and VQVAE for layout generation and tokenizing the maps and fields. The bird's-eye-view (

The document describes a compositional generative model for creating unbounded 3D cities. The city background generator is trained using a combination of reconstruction loss and adversarial learning loss. Volumetric rendering is used for the building instance generator, which incorporates

The article discusses the evaluation protocols and results of the CityDreamer compositional generative model of unbounded 3D cities. The evaluation involves generating 1024 distinct city layouts and sampling 20 different styles for each scene. Evaluation metrics include F

The CityDreamer model is capable of generating accurate 3D geometry and photorealistic images of unbounded 3D cities. A user study was conducted to assess the quality and consistency of the generated cities, showing that the proposed method outper

The entertainment industry has a high demand for generating content for computer games and movies. However, there are limitations to the current city layout generation process, as it cannot model and generate concave geometries like caves and tunnels. Additionally, the individual generation of

The document is a list of references to various papers and studies related to generative models and 3D city modeling. These references include papers on generative adversarial nets, 3D image synthesis, neural rendering of Minecraft worlds, human sensing in

This text excerpt provides references to various papers and studies related to 3D avatar generation, image synthesis, and city layout generation. The papers mentioned cover a range of topics including data platforms for learning 3D structures, image synthesis techniques, 3

The excerpt discusses the qualitative comparison of different building instance generator variants and the effectiveness of different generative scene parameterization. The results show that using the Global Encoder and Hash Grid as scene parameterization produces more natural city backgrounds but decreases the quality of generated buildings

Raw indexed text (54,781 chars / 8,763 words / 1,353 lines)

CityDreamer: Compositional Generative Model of Unbounded 3D Cities

Haozhe Xie, Zhaoxi Chen, Fangzhou Hong, Ziwei Liu

S-Lab, Nanyang Technological University

{haozhe.xie, zhaoxi001, fangzhou001, ziwei.liu}@ntu.edu.sg

https://haozhexie.com/project/city-dreamer

Multi-view Consistent

Diverse City Layouts

Well-defined Geometry

Unbounded City Layout

Diverse Style and Viewpoints

Figure 1. The proposed CityDreamer generates a wide variety of unbounded city layouts and multi-view consistent appearances, featuring

well-defined geometries and diverse styles.

Abstract

1. Introduction

In the wave of the metaverse, 3D asset generation has

drawn considerable interest. Significant advancements have

been achieved in generating 3D objects [37, 40, 47], 3D

avatars [24, 29, 60], and 3D scenes [7, 11, 34]. Cities, being

one of the most crucial 3D assets, have found widespread

use in various applications, including urban planning, en-

vironmental simulations, and game asset creation. There-

fore, the quest to make 3D city development accessible to

a broader audience encompassing artists, researchers, and

players, becomes a significant and impactful challenge.

In recent years, notable advancements have been made

in the field of 3D scene generation. GANCraft [22] and

SceneDreamer [11] use volumetric neural rendering to pro-

duce images within the 3D scene, using 3D coordinates and

corresponding semantic labels. Both methods show promis-

ing results in generating 3D natural scenes by leveraging

pseudo-ground-truth images generated by SPADE [44]. A

very recent work, InfiniCity [34], follows a similar pipeline

for 3D city generation. However, creating 3D cities presents

greater complexity compared to 3D natural scenes. Build-

ings, as objects with the same semantic label, exhibit a wide

range of appearances, unlike the relatively consistent ap-

pearance of objects like trees in natural scenes. This fact

may decrease the quality of generated buildings when all

In recent years, extensive research has focused on 3D natu-

ral scene generation, but the domain of 3D city generation

has not received as much exploration. This is due to the

greater challenges posed by 3D city generation, mainly

because humans are more sensitive to structural distortions

in urban environments. Additionally, generating 3D cities

is more complex than 3D natural scenes since buildings, as

objects of the same class, exhibit a wider range of appear-

ances compared to the relatively consistent appearance

of objects like trees in natural scenes. To address these

challenges, we propose CityDreamer, a compositional

generative model designed specifically for unbounded 3D

cities, which separates the generation of building instances

from other background objects, such as roads, green lands,

and water areas, into distinct modules. Furthermore, we

construct two datasets, OSM and GoogleEarth, containing

a vast amount of real-world city imagery to enhance the

realism of the generated 3D cities both in their layouts and

appearances. Through extensive experiments, CityDreamer

has proven its superiority over state-of-the-art methods in

generating a wide range of lifelike 3D cities.

Corresponding author

12. Related Work

buildings in a city are given the same semantic label.

To handle the diversity of buildings in urban environ-

ments, we propose CityDreamer, a compositional genera-

tive model designed for unbounded 3D cities. As shown

in Figure 2, CityDreamer differs from existing methods in

that it splits the generation of buildings and background

objects like roads, green lands, and water areas into two

separate modules: the building instance generator and the

city background generator. Both generators adopt the bird’s

eye view scene representation and employ a volumetric ren-

derer to generate photorealistic images via adversarial train-

ing. Notably, the scene parameterization is meticulously

tailored to suit the distinct characteristics of background ob-

jects and buildings. Background objects in each category

typically have similar appearances while exhibiting irregu-

lar textures. Hence, we introduce the generative hash grid

to preserve naturalness while upholding 3D consistency. In

contrast, building instances exhibit a wide range of appear-

ances, but the texture of their façades often displays regular

periodic patterns. Therefore, we design periodic positional

encoding, which is simple yet effective for handling the di-

versity building façades. The compositor finally combines

the rendered background objects and building instances to

generate a cohesive image.

To enhance the realism of our generated 3D cities, we

construct two datasets: OSM and GoogleEarth. The OSM

dataset, sourced from OpenStreetMap [1], contains seman-

tic maps and height fields of 80 cities, covering over 6,000

km 2 . These maps show the locations of roads, buildings,

green lands, and water areas, while the height fields pri-

marily indicate building heights. The GoogleEarth dataset,

gathered using Google Earth Studio [2], features 400 or-

bit trajectories in New York City. It includes 24,000 real-

world city images, along with semantic and building in-

stance segmentation. These annotations are automatically

generated by projecting the 3D city layout, based on the

OSM dataset, onto the images. The Google Earth dataset

provides a wider variety of realistic urban images from dif-

ferent perspectives. Additionally, it can be easily expanded

to include cities worldwide.

The contributions are summarized as follows:

• We propose CityDreamer, a compositional generative

model designed specifically for unbounded 3D cities,

which separates the generation of building instances

from other background objects into distinct modules.

3D-aware GANs.

Generative adversarial networks

(GANs) [20] have achieved remarkable success in 2D im-

age generation [27, 28]. Efforts to extend GANs into 3D

space have also emerged, with some works [17, 42, 57] in-

tuitively adopting voxel-based representations by extend-

ing the CNN backbone used in 2D. However, the high

computational and memory cost of voxel grids and 3D

convolution poses challenges in modeling unbounded 3D

scenes. Recent advancements in neural radiance field

(NeRF) [41] have led to the incorporation of volume ren-

dering as a key inductive bias to make GANs 3D-aware.

This enables GANs to learn 3D representations from 2D

images [8, 18, 21, 43, 59]. Nevertheless, most of these meth-

ods are trained on curated datasets for bounded scenes, such

as human faces [27], human bodies [25], and objects [58].

Scene-level image generation. Unlike impressive 2D gen-

erative models that mainly target single categories or com-

mon objects, generating scene-level content is a challeng-

ing task due to the high diversity of scenes. Semantic image

synthesis, such as [15,22,38,44], shows promise in generat-

ing scene-level content in the wild by conditioning on pixel-

wise dense correspondence, such as semantic segmentation

maps or depth maps. Some approaches have even achieved

3D-aware scene synthesis [22, 31, 36, 38, 52], but they may

lack full 3D consistency or support feed-forward genera-

tion for novel worlds. Recent works like [7, 11, 34] have

achieved 3D consistent scenes at infinity scale through un-

bounded layout extrapolation. Another bunch of work [3,

14, 45, 55] focus on indoor scene synthesis using expensive

3D datasets [13, 53] or CAD retrieval [16].

3. The Proposed Method

As shown in Figure 2, CityDreamer follows a four-step

process to generate an unbounded 3D city. Initially, the

unbounded layout generator (Sec. 3.1) creates an arbitrary

large city layout L. Subsequently, the city background gen-

erator (Sec. 3.2) produces the background image Î G along

with its corresponding mask M G . Next, the building in-

stances generator (Sec. 3.3) generates images for building

instances { Î i B } ni=1 and their respective masks {M i B } ni=1 ,

where n is the number of building instances. Lastly, the

compositor (Sec. 3.4) merges the rendered background and

building instances into a single cohesive image I C .

• We construct two datasets, OSM and GoogleEarth,

with more realistic city layouts and appearances, re-

spectively. GoogleEarth includes images with multi-

view consistency and building instances segmentation.

3.1. Unbounded City Layout Generator

City Layout Represenetation. The city layout determines

the 3D objects present in the city and their respective loca-

tions. The objects can be categorized into six classes: roads,

buildings, green lands, construction sites, water areas, and

others. Moreover, there is an additional null class used to

represent empty spaces in the 3D volumes. The city lay-

• The proposed CityDreamer is evaluated quantitatively

and qualitatively against state-of-the-art 3D generative

models, showcasing its capability in generating large-

scale and diverse 3D cities.

2City Background Generator (§3.2)

Unbounded Layout Generator (§3.1)

Sematic Maps

Compositor (§3.4)

Cond.

Height Fields

Global Encoder

Real

Fake

Discriminator

𝐟𝐟 G

Bg. Image 𝐈𝐈 ̂ G

Bg. Feature

Generative

Neural Hash Grid

ℋ

Quantization

𝑐𝑐 0 𝑐𝑐 1 𝑐𝑐 2 𝑐𝑐 3 𝑐𝑐 4 𝑐𝑐 5 𝑐𝑐 6 𝑐𝑐 7 … 𝑐𝑐 𝐾𝐾

𝑐𝑐 7

𝑐𝑐 6

𝑐𝑐 4

𝑐𝑐 1

Mod.

𝐜𝐜, σ

Volumetric Rendering

Building Instance Generator (§3.3)

Mod.

𝑐𝑐 3

𝑐𝑐 0

𝑐𝑐 2

𝑐𝑐 5

Layout Token 𝐓𝐓

Indexed

Feature

𝐩𝐩

Local Encoder

𝐟𝐟 B i

Cond.

Height Field 𝐇𝐇

𝐜𝐜, σ

Mod.

𝐳𝐳

𝑙𝑙(𝐩𝐩)

Semantic Map 𝐒𝐒

Volumetric

Renderer

𝒑𝒑

𝐟𝐟 B i

(𝒑𝒑 𝒙𝒙 ,𝒑𝒑 𝒚𝒚 )

𝐟𝐟 B

City Layout 𝐋𝐋

𝑐𝑐 1

𝑐𝑐 2

𝑐𝑐 3

𝑐𝑐 7

Scene Parameterization

Instantiation

Visual Token

Prediction

𝑐𝑐 5

𝑐𝑐 4

𝑐𝑐 5

𝑐𝑐 6

𝑙𝑙(𝐩𝐩)

Volumetric

Renderer

𝑐𝑐 5 𝑐𝑐 1 𝑐𝑐 7

𝑐𝑐 4 𝑐𝑐 2 𝑐𝑐 6

Scene Representation

𝒑𝒑

𝐟𝐟 G

Codebook 𝒞𝒞

𝐩𝐩

Indexed

Feature

Bg. Mask 𝐌𝐌 G

Scene Representation

Scene Parameterization

Discriminator

Bldg. Images 𝐈𝐈 ̂ B i

Bldg. Masks 𝐌𝐌 B 𝑖𝑖

Real

Fake

Volumetric Rendering

Composited Image 𝐈𝐈 ̂ C

Figure 2. Overview of CityDreamer. The unbounded layout generator creates the city layout L. Then, the city background generator

performs ray-sampling to retrieve features from L and generates the background image with a volumetric renderer, focusing on background

objects like roads, green lands, and water areas. Similarly, the building instance generator renders the building instance image with another

volumetric renderer. Finally, the compositor merges the rendered background and building instances, producing a unified and coherent final

image. Note that “Mod.”, “Cond.”, “Bg.”, and “Bldg.” denote “Modulation”, “Condition”, “Background”, and “Building”, respectively.

mizing them using L1 Loss and Cross Entropy Loss E, re-

spectively. Additionally, to ensure sharpness in the height

field around the edges of the buildings, we introduce an ex-

tra Smoothness Loss S [39].

out in CityDreamer, denoted as a 3D volume L, is created

by extruding the pixels in the semantic map S based on the

corresponding values in the height field H. Specifically, the

value of L at (i, j, k) can be defined as

(

L (i,j,k) =

S (i,j)

if k ≤ H (i,j)

otherwise

ℓ VQ = λ R ∥ Ĥ p −H p ∥+λ S S( Ĥ p , H p )+λ E E( Ŝ p , S p ) (2)

(1)

where Ĥ p and Ŝ p denote the generated height field and se-

mantic map patches, respectively. H p and S p are the cor-

responding ground truth. The autoregressive transformer in

MaskGIT is trained using a reweighted ELBO loss [5].

where 0 denotes empty spaces in the 3D volumes.

City Layout Generation. Obtaining unbounded city lay-

outs is translated into generating extendable semantic maps

and height fields. To this aim, we construct the unbounded

layout generator based on MaskGIT [9], which inherently

enables inpainting and extrapolation capabilities. Specifi-

cally, we employ VQVAE [49, 54] to tokenize the seman-

tic map and height field patches, converting them into dis-

crete latent space and creating a codebook C = {c k |c k ∈

R D } K

i=1 . During inference, we generate the layout token T

in an autoregressive manner, and subsequently, we use the

VQVAE’s decoder to generate a pair of semantic map S and

height field H. Since VQVAE generates fixed-size seman-

tic maps and height fields, we use image extrapolation to

create arbitrary-sized ones. During this process, we adopt

a sliding window to forecast a local layout token at every

step, with a 25% overlap during the sliding.

Loss Functions. The VQVAE treats the generation of the

height field and semantic map as two separate tasks, opti-

3.2. City Background Generator

Scene Representation. Similar to SceneDreamer [11],

we use the bird’s-eye-view (BEV) scene representation for

its efficiency and expressive capabilities, making it eas-

ily applicable to unbounded scenes. Different from GAN-

Craft [22] and InfiniCity [34], where features are parameter-

ized to voxel corners, the BEV representation comprises a

feature-free 3D volume generated from a height field and a

semantic map, following Equation 1. Specifically, we initi-

ate the process by selecting a local window with a resolution

of N G

× N G

from the city layout L. This local win-

dow is denoted as L Local

, which is generated from the cor-

responding height field H Local

and semantic map S Local

Scene Parameterization. To achieve generalizable 3D rep-

resentation learning across various scenes and align con-

tent with 3D semantics, it is necessary to parameterize the

3extract a local window denoted as L Local

from the city lay-

B i

out L, with a resolution of N B H × N B W × N B D , centered

around the 2D center (c x B i , c y B i ) of the building instance B i .

The height field and semantic map used to generate L Local

B i

Local

can be denoted as H Local

and

respectively.

all

B i

buildings have the same semantic label in S, we perform

building instantiation by detecting connected components.

We observe that the façades and roofs of buildings in real-

world scenes exhibit distinct distributions. Consequently,

we assign different semantic labels to the façade and roof of

the building instance B i in L Local

, with the top-most voxel

B i

layer being assigned the roof label. The rest building in-

stances are omitted in L Local

by assigned with the null class.

B i

Scene Parameterization. In contrast to the city back-

ground generator, the building instance generator employs

a novel scene parameterization that relies on pixel-level fea-

tures generated by a local encoder E B . Specifically, we start

by encoding the local scene (H Local

, S Local

) using E B , re-

B i

sulting in the pixel-level feature f B i , which has a resolution

of N B H × N B W × N B C .

scene representation into a latent space, making adversar-

ial learning easier. For city backgrounds, we adopt the

generative neural hash grid [11] to learn generalizable fea-

tures across scenes by modeling the hyperspace beyond

3D space. Specifically, we first encode the local scene

(H Local

, S Local

) using the global encoder E G to produce

the compact scene-level feature f G ∈ R d G .

f G = E G (H Local

, S Local

)

(3)

By leveraging a learnable neural hash function H, the in-

dexed feature f G

at 3D position p ∈ R 3 can be obtained by

mapping p and f G into a hyperspace, i.e., R 3+d G → R N G .

f G

= H(p, f G ) =

d G

i=1

i i

f G

p j π j

mod T

(4)

j=1

where ⊕ denotes the bit-wise XOR operation. π i and π j

represent large and unique prime numbers. We construct

levels of multi-resolution hash grids to represent multi-

N H

scale features, T is the maximum number of entries per

level, and N G

denotes the number of channels in each

unique feature vector.

Volumetric Rendering. In a perspective camera model,

each pixel in the image corresponds to a camera ray r(t) =

o + tv, where the ray originates from the center of projec-

tion o and extends in the direction v. Thus, the correspond-

ing pixel value C(r) is derived from an integral.

Z ∞

r(t)

C(r) =

T (t)c(f G , l(r(t)))σ(f G )dt

(5)

f B i = E B (H Local

, S Local

)

B i

(7)

Given a 3D position p = (p x , p y , p z ), the corresponding

indexed feature f B p i can be computed as

(p ,p y )

f B p i = O(Concat(f B i x

, p z ))

(8)

(p ,p y )

where Concat(·) is the concatenation operation. f B i x

∈

R N B denotes the feature vector at (p x , p y ). O(·) is the po-

sitional encoding function used in the vanilla NeRF [41].

N L −1

O(x) = {sin(2 i πx), cos(2 i πx)} i=0

R t

r(s)

σ(f G )ds).

where T (t) = exp(−

l(p) represent the

semantic label at the 3D position p. c and σ denote the

color and volume density, respectively.

Loss Function. The city background generator is trained

using a hybrid objective, which includes a combination of a

reconstruction loss and an adversarial learning loss. Specif-

ically, we leverage the L1 loss, perceptual loss P [26], and

GAN loss G [32] in this combination.

(9)

Note that O(·) is applied individually to each value in the

given feature x, which are normalized to lie within the range

of [−1, 1].

Volumetric Rendering. Different from the volumetric ren-

dering used in the city background generator, we incorpo-

rate a style code z in the building instance generator to

capture the diversity of buildings. The corresponding pixel

value C(r) is obtained through an integral process.

Z ∞

r(t)

C(r) =

T (t)c(f B i , z, l(r(t)))σ(f B i )dt

(10)

ℓ G = λ L1 ∥ Î G − I G ∥ + λ P P( Î G , I G ) + λ G G( Î G , S G ) (6)

where I G denotes the ground truth background image. S G

is the semantic map in perspective view generated by ac-

cumulating semantic labels sampled from the L Local

along

each ray. The weights of the three losses are denoted as

λ L1 , λ P , and λ G . Note that ℓ G is solely applied to pixels

with semantic labels belonging to background objects.

Note that the camera ray r(t) is normalized with respect to

(c x B i , c y B i , 0) as the origin.

Loss Function. For training the building instance genera-

tor, we exclusively use the GAN Loss. Mathematically, it

can be represented as

ℓ B = G( Î B i , S B i )

3.3. Building Instance Generator

(11)

where S B i denotes the semantic map of building instance B i

in perspective view, which is generated in a similar manner

to S G . Note that ℓ B is exclusively applied to pixels with

semantic labels belonging to the building instance.

Scene Representation. Just like the city background gen-

erator, the building instance generator also uses the BEV

scene representation. In the building instance generator, we

4Table 1. A Comparison of GoogleEarth with representative

city-related datasets. Note that the number of images and area

are counted based on real-world images. “sate.” represents satel-

lite. “inst.”, “sem.”, and “plane” denote “instance segmentation”,

“semantic segmentation”, and “plane segmentation” respectively.

(a) The OSM Dataset

(b) City Layout

Dataset

# Images Area View Annotation 3D

KITTI [19]

Cityscapes [12]

SpaceNet MOVI [56]

OmniCity [30] 15 k

25 k

6.0 k

108 k -

- street

street

sate.

street/sate. sem.

sem.

inst.

inst./plane ✗

✗

HoliCity [61]

UrbanScene3D [35]

GoogleEarth 6.3 k

6.1 k

24 k 20 km 2

3 km 2

25 km 2 street

drone

drone sem./plane

inst.

inst./sem. ✓

✓

zoom level 18, approximately 0.597 meters per pixel. As

shown in Figure 3(a), the segmentation maps use red, yel-

low, green, cyan, and blue colors to denote the positions of

roads, buildings, green lands, construction sites, and water

areas, respectively. The height fields primarily represent the

height of buildings, with their values derived from Open-

StreetMap. For roads, the height values are set to 4, while

for water areas, they are set to 0. Additionally, the height

values for trees are sampled from perlin noise [46], ranging

from 8 to 16.

15 20 25 30 35 40 45 50 55 60 65 70

100

200

Elevation Angle of Viewpoint (°)

300

400

500

600

700

800

900

Altitude of Viewpoint (m)

(d) Dataset Statistics for the GoogleEarth dataset

Figure 3. Overview of the proposed datasets. (a) The OSM

dataset comprising paired height fields and semantic maps pro-

vides real-world city layouts. (b) The city layout, generated from

the height field and semantic map, facilitates automatic annota-

tion generation. (c) The GoogleEarth dataset includes real-world

city appearances alongside semantic segmentation and building in-

stance segmentation. (d) The dataset statistics demonstrate the va-

riety of perspectives available in the GoogleEarth dataset.

4.2. The GoogleEarth dataset

The GoogleEarth dataset is collected from Google Earth

Studio [2], including 400 orbit trajectories in Manhattan and

Brooklyn. Each trajectory consists of 60 images, with or-

bit radiuses ranging from 125 to 813 meters and altitudes

varying from 112 to 884 meters. In addition to the images,

Google Earth Studio provides camera intrinsic and extrin-

sic parameters, making it possible to create automated an-

notations for semantic and building instance segmentation.

Specifically, for building instance segmentation, we initially

perform connected components detection on the semantic

maps to identify individual building instances. Then, the

city layout is created following Equation 1, as demonstrated

in Figure 3(b). Finally, the annotations are generated by

projecting the city layout onto the images, using the camera

parameters, as shown in Figure 3(c).

Table 1 presents a comparative overview between

GoogleEarth and other datasets related to urban environ-

ments. Among datasets that offer 3D models, GoogleEarth

stands out for its extensive coverage of real-world images,

encompassing the largest area, and providing annotations

for both semantic and instance segmentation. Figure 3(d)

offers an analysis of viewpoint altitudes and elevations in

the GoogleEarth dataset, highlighting its diverse camera

viewpoints. This diversity enhances neural networks’ abil-

ity to generate cities from a broader range of perspectives.

Additionally, leveraging Google Earth and OpenStreetMap

data allows us to effortlessly expand our dataset to encom-

pass more cities worldwide.

3.4. Compositor

Since there are no corresponding ground truth images

for the images generated by the City Background Generator

and Building Instance Generator, it is not possible to train

neural networks to merge these images. Therefore, the com-

positor uses the generated images Î G and { Î B i } ni=1 , along

with their corresponding binary masks M G and {M B i } ni=1 ,

the compositor combines them into a unified image I C ,

which can be represented as

I C = Î G M G +

Î B i M B i

(12)

i=1

where n is the number of building instances.

4. The Proposed Datasets

4.1. The OSM Dataset

The OSM dataset, sourced from OpenStreetMap [1], is

composed of the rasterized semantic maps and height fields

of 80 cities worldwide, spanning an area of more than 6,000

km 2 . During the rasterization process, vectorized geometry

information is converted into images by translating longi-

tude and latitude into the EPSG:3857 coordinate system at

5Table 2. Quantitative comparison. The best values are high-

lighted in bold. Note that the results of InfiniCity are not included

in this comparison as it is not open-sourced.

5. Experiments

5.1. Evaluation Protocols

During evaluation, we use the Unbounded Layout Gen-

erator to generate 1024 distinct city layouts. For each scene,

we sample 20 different styles by randomizing the style code

z. Each sample is transformed into a fly-through video con-

sisting of 40 frames, each with a resolution of 960×540

pixels and any possible camera trajectory. Subsequently,

we randomly select frames from these video sequences for

evaluation. The evaluation metrics are as follows:

FID and KID. Fréchet Inception Distance (FID) [23] and

Kernel Inception Distance (KID) [4] are metrics for the

quality of generated images. We compute FID and KID

between a set of 15,000 generated frames and an evaluation

set comprising 15,000 images randomly sampled from the

GoogleEarth dataset.

Depth Error. We employ depth error (DE) to assess the 3D

geometry, following a similar approach to EG3D [8]. Using

a pre-trained model [48], we generate pseudo ground truth

depth maps for generated frames by accumulating density

σ. Both the “ground truth” depth and the predicted depth

are normalized to zero mean and unit variance to eliminate

scale ambiguity. DE is computed as the L2 distance be-

tween the two normalized depth maps. We assess this depth

error on 100 frames for each evaluated method.

Camera Error. Following SceneDreamer [11], we intro-

duce camera error (CE) to assess multi-view consistency.

CE quantifies the difference between the inference cam-

era trajectory and the estimated camera trajectory from

COLMAP [50]. It is calculated as the scale-invariant nor-

malized L2 distance between the reconstructed and gener-

ated camera poses.

Methods FID ↓ KID ↓ DE ↓ CE ↓

SGAM [51]

PersistentNature [7]

SceneDreamer [11]

CityDreamer 277.64

123.83

213.56

97.38 0.358

0.109

0.216

0.096 0.575

0.326

0.152

0.147 239.291

86.371

0.186

0.060

N B H , N B W , and N B D are set to 672, 672, and 640, respec-

tively. The number of channels N B C of the pixel-level fea-

tures is 63. The dimension N P L is set to 10.

Training Details

Unbounded Layout Generator. The VQVAE is trained with

a batch size of 16 using an Adam optimizer with β = (0.5,

0.9) and a learning rate of 7.2 × 10 −5 for 1,250,000 itera-

tions. The autoregressive transformer is trained with a batch

size of 80 using an Adam optimizer with β = (0.9, 0.999)

and a learning rate of 2 × 10 −4 for 250,000 iterations.

City Background and Building Instance Generators. Both

generators are trained using an Adam optimizer with β =

(0, 0.999) and a learning rate of 10 −4 . The discriminators

are optimized using an Adam optimizer with β = (0, 0.999)

and a learning rate of 10 −5 . The training lasts for 298,500

iterations with a batch size of 8. The images are randomly

cropped to a size of 192×192.

5.3. Comparisons

Baselines. We compare CityDreamer against four state-of-

the-art methods: SGAM [51], PersistentNature [7], Scene-

Dreamer [11], and InfiniCity [34]. With the exception of In-

finiCity, whose code is not available, the remaining methods

are retrained using the released code on the GoogleEarth

dataset to ensure a fair comparison. SceneDreamer initially

uses simplex noise for layout generation, which is not ideal

for cities, so it is replaced with the unbounded layout gen-

erator from CityDreamer.

Qualitative Comparison. Figure 4 provides qualitative

comparisons against baselines. SGAM struggles to produce

realistic results and maintain good 3D consistency because

extrapolating views for complex 3D cities can be extremely

challenging. PersistentNature employs tri-plane represen-

tation, but it encounters challenges in generating realistic

renderings. SceneDreamer and InfiniCity both utilize voxel

grids as their representation, but they still suffer from se-

vere structural distortions in buildings because all buildings

are given the same semantic label. In comparison, the pro-

posed CityDreamer generates more realistic and diverse re-

sults compared to all the baselines.

Quantitative Comparison. Table 2 presents the quanti-

tative metrics of the proposed approach compared to the

baselines. CityDreamer exhibits significant improvements

5.2. Implementation Details

We implement our network using PyTorch and CUDA.

The experiments are conducted using eight NVIDIA Tesla

V100 GPUs.

Hyperparameters

Unbounded Layout Generator. The codebook size K is set

to 512, and each code’s dimension D is set to 512. The

height field and semantic map patches are cropped to a size

of 512×512, and compressed by a factor of 16. The loss

weights, λ R , λ S , and λ E , are 10, 10, 1, respectively.

City Background Generator. The local window resolution

N G

, N G

, and N G

are set to 1536, 1536, and 640, respec-

tively. The dimension of the scene-level features d G is 2.

For the generative hash grid, we use N H

= 16, T = 2 19 ,

and N G = 8. The unique prime numbers in Equation 4

are set to π 1 = 1, π 2 = 2654435761, π 3 = 805459861,

π 4 = 3674653429, and π 5 = 2097192037. The loss func-

tion weights, λ L1 , λ P , and λ G , are 10, 10, 0.5, respectively.

Building Instance Generator. The local window resolution

6Figure 4. Qualitative comparison. The proposed CityDreamer produces more realistic and diverse results compared to all baselines. Note

that the visual results of InfiniCity [34] are provided by the authors and zoomed for optimal viewing.

in FID and KID, which is consistent with the visual com-

parisons. Moreover, CityDreamer demonstrates the capa-

bility to maintain accurate 3D geometry and view consis-

tency while generating photorealistic images, as evident by

the lowest DE and CE errors compared to the baselines.

User Study. To better assess the 3D consistency and quality

of the unbounded 3D city generation, we conduct an output

evaluation [6] as the user study. In this survey, we ask 20

volunteers to rate each generated camera trajectory based

on three aspects: 1) the perceptual quality of the imagery,

2) the level of 3D realism, and 3) the 3D view consistency.

The scores are on a scale of 1 to 5, with 5 representing the

best rating. The results are presented in Figure 5, show-

ing that the proposed method significantly outperforms the

baselines by a large margin.

ing “unbounded” city layouts. We compare it with Infini-

tyGAN [33] used in InfiniCity and a rule-based city layout

generation method, IPSM [10], as shown in Table 4. Fol-

lowing InfiniCity [34], we use FID and KID to evaluate the

quality of the generated layouts. Compared to IPSM and

InfinityGAN, Unbounded Layout Generator achieves better

results in terms of all metrics. The qualitative results shown

in Figure 6 also demonstrate the effectiveness of the pro-

posed method.

Effectiveness of Building Instance Generator. We em-

phasize the crucial role of the Building Instance Generator

in the success of unbounded 3D city generation. To demon-

strate its effectiveness, we conducted an ablation study on

the Building Instance Generator. We compared two op-

tional designs: (1) Removing the Building Instance Gen-

erator from CityDreamer, i.e., the model falling back to

SceneDreamer. (2) All buildings are generated at once by

the Building Instance Generator, without providing any in-

stance labels. The quantitative results presented in Table 4

5.4. Ablation Study

Effectiveness of Unbounded Layout Generator. The Un-

bounded Layout Generator plays a critical role in generat-

7Perceptual Quality

Degree of 3D Realism

Table 3. Effectiveness of Ubounded Layout Generator. The best

values are highlighted in bold. The images are centrally cropped

to a size of 4096×4096.

View Consistency

Methods FID ↓ KID ↓

IPSM [10]

InfinityGAN [33]

Ours 321.47

183.14

124.45 0.502

0.288

0.123

SGAM

Pers.Nature SceneDreamer InfiniCity

Table 4. Effectiveness of Building Instance Generator. The best

values are highlighted in bold. Note that “w/o BIG.” indicates the

removal of Building Instance Generator from CityDreamer. “w/o

Ins.” denotes the absence of building instance labels in the Build-

ing Instance Generator.

CityDreamer

Figure 5. User study on unbounded 3D city generation. All

scores are in the range of 5, with 5 indicating the best.

demonstrate the effectiveness of both the instance labels and

the Building Instance Generator. Please refer to Figure 7 for

more qualitative comparisons.

Effectiveness of Scene Parameterization. Scene param-

eterization directly impacts the quality of 3D city gener-

ation. The City Background Generator utilizes HashGrid

with patch-wise features from the global encoder, while

the Building Instance Generator uses vanilla SinCos posi-

tional encoding with pixel-wise features from the local en-

coder. We compare different scene parameterizations in

both the City Background Generator and the Building In-

stance Generator. Table 5 shows that using local encoders

in background generation or using global encoders in build-

ing generation leads to considerable degradation in image

quality, indicated by poor metrics. According to Equa-

tion 4, the output of HashGrid is determined by the scene-

level features and 3D position. While HashGrid enhances

the multi-view consistency of the generated background, it

also introduces challenges in building generation, leading

to less structurally reasonable buildings. In contrast, the

inherent periodicity of SinCos makes it easier for the net-

work to learn the periodicity of building façades, leading

to improved results in building generation. Please refer to

Sec. A.2 in the appendix for a detailed discussion.

Methods FID ↓ KID ↓ DE ↓ CE ↓

w/o BIG.

w/o Ins.

Ours 213.56

117.75

97.38 0.216

0.124

0.096 0.152

0.148

0.147 0.186

0.098

0.060

Table 5. Effectiveness of different generative scene parameter-

ization. The best values are highlighted in bold. Note that “CBG.”

and “BIG.” denote City Background Generator and Building In-

stance Generator, respectively. “Enc.” and “P.E.” represent “En-

coder” and “Positional Encoding”, respectively.

CBG.

BIG.

Enc. P.E. Enc. P.E.

Local

Global

Global SinCos

SinCos

Hash

Hash Global

Local

Global

Local Hash

SinCos

Hash

SinCos

FID ↓ KID ↓ DE ↓ CE ↓

219.30

107.63

213.56

97.38 0.233

0.125

0.216

0.096 0.154

0.149

0.153

0.147 0.452

0.078

0.186

0.060

6. Conclusion

In this paper, we propose CityDreamer, a compositional

generative model designed specifically for unbounded 3D

cities. Compared to existing methods that treat buildings as

a single class of objects, CityDreamer separates the gen-

eration of building instances from other background ob-

jects, allowing for better handling of the diverse appear-

ances of buildings. Additionally, we create the OSM and

GoogleEarth datasets, providing more realistic city layouts

and appearances, and easily scalable to include other cities

worldwide. CityDreamer is evaluated quantitatively and

qualitatively against state-of-the-arts, showcasing its capa-

bility in generating large-scale and diverse 3D cities.

5.5. Discussion

Applications. This research primarily benefits applications

that require efficient content creation, with notable exam-

ples being the entertainment industry. There is a strong de-

mand to generate content for computer games and movies

within this field.

Limitations. 1) The generation of the city layout involves

raising voxels to a specific height, which means that con-

cave geometries like caves and tunnels cannot be modeled

and generated. 2) During the inference process, the build-

ings are generated individually, resulting in a slightly higher

computation cost. Exploring ways to reduce the inference

cost would be beneficial for future work.

Acknowledgments This study is supported by the Min-

istry of Education, Singapore, under its MOE AcRF Tier

2 (MOE-T2EP20221-0012), NTU NAP, and under the

RIE2020 Industry Alignment Fund – Industry Collabora-

tion Projects (IAF-ICP) Funding Initiative, as well as cash

and in-kind contribution from the industry partner(s).

8References

[1] https://openstreetmap.org. 2, 5

[2] https://earth.google.com/studio. 2, 5

[3] Miguel Ángel Bautista, Pengsheng Guo, Samira Abnar, Wal-

ter Talbott, Alexander Toshev, Zhuoyuan Chen, Laurent

Dinh, Shuangfei Zhai, Hanlin Goh, Daniel Ulbricht, Afshin

Dehghan, and Joshua M. Susskind. GAUDI: A neural archi-

tect for immersive 3d scene generation. In NeurIPS, 2022.

[4] Mikolaj Binkowski, Danica J. Sutherland, Michael Arbel,

and Arthur Gretton. Demystifying MMD GANs. In ICLR,

2018. 6

[5] Sam Bond-Taylor, Peter Hessey, Hiroshi Sasaki, Toby P.

Breckon, and Chris G. Willcocks. Unleashing transform-

ers: Parallel token prediction with discrete absorbing diffu-

sion for fast high-resolution image generation from vector-

quantized codes. In ECCV, 2022. 3

[6] Zoya Bylinskii, Laura Mariah Herman, Aaron Hertzmann,

Stefanie Hutka, and Yile Zhang. Towards better user studies

in computer graphics and vision. Foundations and Trends in

Computer Graphics and Vision, 15(3):201–252, 2023. 7

[7] Lucy Chai, Richard Tucker, Zhengqi Li, Phillip Isola, and

Noah Snavely. Persistent Nature: A generative model of un-

bounded 3D worlds. In CVPR, 2023. 1, 2, 6

[8] Eric R. Chan, Connor Z. Lin, Matthew A. Chan, Koki

Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo,

Leonidas J. Guibas, Jonathan Tremblay, Sameh Khamis,

Tero Karras, and Gordon Wetzstein. Efficient geometry-

aware 3D generative adversarial networks. In CVPR, 2022.

2, 6

[9] Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and

William T. Freeman. MaskGIT: Masked generative image

transformer. In CVPR, 2022. 3

[10] Guoning Chen, Gregory Esch, Peter Wonka, Pascal Müller,

and Eugene Zhang. Interactive procedural street modeling.

ACM TOG, 27(3):103, 2008. 7, 8

[11] Zhaoxi Chen, Guangcong Wang, and Ziwei Liu. Scene-

Dreamer: Unbounded 3D scene generation from 2D image

collections. arXiv, 2302.01330, 2023. 1, 2, 3, 4, 6

[12] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo

Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe

Franke, Stefan Roth, and Bernt Schiele. The cityscapes

dataset for semantic urban scene understanding. In CVPR,

2016. 5

[13] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal-

ber, Thomas A. Funkhouser, and Matthias Nießner. Scan-

Net: Richly-annotated 3D reconstructions of indoor scenes.

In CVPR, 2017. 2

[14] Terrance DeVries, Miguel Ángel Bautista, Nitish Srivastava,

Graham W. Taylor, and Joshua M. Susskind. Unconstrained

scene generation with locally conditioned radiance fields. In

ICCV, 2021. 2

[15] Patrick Esser, Robin Rombach, and Björn Ommer. Taming

transformers for high-resolution image synthesis. In CVPR,

2021. 2

[16] Huan Fu, Bowen Cai, Lin Gao, Lingxiao Zhang, Jiaming

Wang, Cao Li, Qixun Zeng, Chengyue Sun, Rongfei Jia,

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

Binqiang Zhao, and Hao Zhang. 3D-FRONT: 3D furnished

rooms with layouts and semantics. In ICCV, 2021. 2

Matheus Gadelha, Subhransu Maji, and Rui Wang. 3d shape

induction from 2D views of multiple objects. In 3DV, 2017.

Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen,

Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja

Fidler. GET3D: A generative model of high quality 3D tex-

tured shapes learned from images. In NeurIPS, 2022. 2

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we

ready for autonomous driving? the KITTI vision benchmark

suite. In CVPR, 2012. 5

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing

Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville,

and Yoshua Bengio. Generative adversarial nets. In NIPS,

2014. 2

Jiatao Gu, Lingjie Liu, Peng Wang, and Christian Theobalt.

StyleNeRF: A style-based 3D aware generator for high-

resolution image synthesis. In ICLR, 2022. 2

Zekun Hao, Arun Mallya, Serge J. Belongie, and Ming-

Yu Liu. GANCraft: Unsupervised 3D neural rendering of

minecraft worlds. In ICCV, 2021. 1, 2, 3

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,

Bernhard Nessler, and Sepp Hochreiter. GANs trained by

a two time-scale update rule converge to a local nash equi-

librium. In NIPS, 2017. 6

Fangzhou Hong, Zhaoxi Chen, Yushi Lan, Liang Pan, and

Ziwei Liu. EVA3D: compositional 3D human generation

from 2D image collections. In ICLR, 2023. 1

Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian

Sminchisescu. Human3.6M: Large scale datasets and predic-

tive methods for 3D human sensing in natural environments.

IEEE TPAMI, 36(7):1325–1339, 2014. 2

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual

losses for real-time style transfer and super-resolution. In

ECCV, 2016. 4

Tero Karras, Samuli Laine, and Timo Aila. A style-based

generator architecture for generative adversarial networks.

IEEE TPAMI, 43(12):4217–4228, 2021. 2

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,

Jaakko Lehtinen, and Timo Aila. Analyzing and improving

the image quality of stylegan. In CVPR, 2020. 2

Nikos Kolotouros, Thiemo Alldieck, Andrei Zanfir, Ed-

uard Gabriel Bazavan, Mihai Fieraru, and Cristian Smin-

chisescu. DreamHuman: Animatable 3D avatars from text.

arXiv, 2306.09329, 2023. 1

Weijia Li, Yawen Lai, Linning Xu, Yuanbo Xiangli, Jinhua

Yu, Conghui He, Gui-Song Xia, and Dahua Lin. OmniCity:

Omnipotent city understanding with multi-level and multi-

view images. In CVPR, 2023. 5

Zhengqi Li, Qianqian Wang, Noah Snavely, and Angjoo

Kanazawa. InfiniteNature-Zero: Learning perpetual view

generation of natural scenes from single images. In ECCV,

2022. 2

Jae Hyun Lim and Jong Chul Ye. Geometric GAN. arXiv,

1705.02894, 2017. 4[49] Ali Razavi, Aäron van den Oord, and Oriol Vinyals. Gen-

erating diverse high-fidelity images with VQ-VAE-2. In

NeurIPS, 2019. 3

[50] Johannes L. Schönberger and Jan-Michael Frahm. Structure-

from-motion revisited. In CVPR, 2016. 6, 13

[51] Yuan Shen, Wei-Chiu Ma, and Shenlong Wang. SGAM:

building a virtual 3D world through simultaneous generation

and mapping. In NeurIPS, 2022. 6

[52] Zifan Shi, Yujun Shen, Jiapeng Zhu, Dit-Yan Yeung, and

Qifeng Chen. 3D-aware indoor scene synthesis with depth

priors. In ECCV, 2022. 2

[53] Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen,

Erik Wijmans, Simon Green, Jakob J. Engel, Raul Mur-

Artal, Carl Yuheng Ren, Shobhit Verma, Anton Clarkson,

Mingfei Yan, Brian Budge, Yajie Yan, Xiaqing Pan, June

Yon, Yuyang Zou, Kimberly Leon, Nigel Carter, Jesus Bri-

ales, Tyler Gillingham, Elias Mueggler, Luis Pesqueira,

Manolis Savva, Dhruv Batra, Hauke M. Strasdat, Renzo De

Nardi, Michael Goesele, Steven Lovegrove, and Richard A.

Newcombe. The Replica Dataset: A digital replica of indoor

spaces. arXiv, 1906.05797, 2019. 2

[54] Aäron van den Oord, Oriol Vinyals, and Koray

Kavukcuoglu.

Neural discrete representation learning.

In NIPS, 2017. 3

[55] Xinpeng Wang, Chandan Yeshwanth, and Matthias Nießner.

Sceneformer: Indoor scene generation with transformers. In

3DV, 2021. 2

[56] Nicholas Weir, David Lindenbaum, Alexei Bastidas,

Adam Van Etten, Varun Kumar Vijay, Sean McPherson, Ja-

cob Shermeyer, and Hanlin Tang. Spacenet MVOI: A multi-

view overhead imagery dataset. In ICCV, 2019. 5

[57] Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and

Josh Tenenbaum. Learning a probabilistic latent space of ob-

ject shapes via 3D generative-adversarial modeling. In NIPS,

2016. 2

[58] Tong Wu, Jiarui Zhang, Xiao Fu, Yuxin Wang, Jiawei

Ren, Liang Pan, Wayne Wu, Lei Yang, Jiaqi Wang, Chen

Qian, Dahua Lin, and Ziwei Liu. OmniObject3D: Large-

vocabulary 3D object dataset for realistic perception, recon-

struction and generation. In CVPR, 2023. 2

[59] Yang Xue, Yuheng Li, Krishna Kumar Singh, and Yong Jae

Lee. GIRAFFE HD: A high-resolution 3D-aware generative

model. In CVPR, 2022. 2

[60] Chi Zhang, Yiwen Chen, Yijun Fu, Zhenglin Zhou, Gang

Yu, Billzb Wang, Bin Fu, Tao Chen, Guosheng Lin, and

Chunhua Shen. StyleAvatar3D: Leveraging image-text dif-

fusion models for high-fidelity 3D avatar generation. arXiv,

2305.19012, 2023. 1

[61] Yichao Zhou, Jingwei Huang, Xili Dai, Linjie Luo, Zhili

Chen, and Yi Ma. HoliCity: A city-scale data platform for

learning holistic 3D structures. arXiv, 2008.03286, 2020. 5

[33] Chieh Hubert Lin, Hsin-Ying Lee, Yen-Chi Cheng, Sergey

Tulyakov, and Ming-Hsuan Yang. InfinityGan: Towards

infinite-pixel image synthesis. In ICLR, 2022. 7, 8

[34] Chieh Hubert Lin, Hsin-Ying Lee, Willi Menapace, Menglei

Chai, Aliaksandr Siarohin, Ming-Hsuan Yang, and Sergey

Tulyakov. InfiniCity: Infinite-scale city synthesis. In ICCV,

2023. 1, 2, 3, 6, 7, 15

[35] Liqiang Lin, Yilin Liu, Yue Hu, Xingguang Yan, Ke Xie, and

Hui Huang. Capturing, reconstructing, and simulating: The

urbanscene3d dataset. In ECCV, 2022. 5

[36] Andrew Liu, Ameesh Makadia, Richard Tucker, Noah

Snavely, Varun Jampani, and Angjoo Kanazawa. Infinite Na-

ture: Perpetual view generation of natural scenes from a sin-

gle image. In ICCV, 2021. 2

[37] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok-

makov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3:

Zero-shot one image to 3D object. arXiv, 2303.11328, 2023.

[38] Arun Mallya, Ting-Chun Wang, Karan Sapra, and Ming-Yu

Liu. World-consistent video-to-video synthesis. In ECCV,

2020. 2

[39] Simon Meister, Junhwa Hur, and Stefan Roth. UnFlow: Un-

supervised learning of optical flow with a bidirectional cen-

sus loss. In AAAI, 2018. 3

[40] Luke Melas-Kyriazi, Christian Rupprecht, Iro Laina, and

Andrea Vedaldi. RealFusion: 360° reconstruction of any ob-

ject from a single image. In CVPR, 2023. 1

[41] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik,

Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF:

Representing scenes as neural radiance fields for view syn-

thesis. In ECCV, 2020. 2, 4

[42] Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian

Richardt, and Yong-Liang Yang. HoloGAN: Unsupervised

learning of 3D representations from natural images. In

CVPR, 2019. 2

[43] Roy Or-El, Xuan Luo, Mengyi Shan, Eli Shecht-

man, Jeong Joon Park, and Ira Kemelmacher-Shlizerman.

StyleSDF: High-resolution 3D-consistent image and geom-

etry generation. In CVPR, 2022. 2

[44] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan

Zhu. Semantic image synthesis with spatially-adaptive nor-

malization. In CVPR, 2019. 1, 2

[45] Despoina Paschalidou, Amlan Kar, Maria Shugrina, Karsten

Kreis, Andreas Geiger, and Sanja Fidler.

ATISS:

autoregressive transformers for indoor scene synthesis.

In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N.

Dauphin, Percy Liang, and Jennifer Wortman Vaughan, edi-

tors, NeurIPS, 2021. 2

[46] Ken Perlin. An image synthesizer. In SIGGRAPH, 1985. 5

[47] Amit Raj, Srinivas Kaza, Ben Poole, Michael Niemeyer,

Nataniel Ruiz, Ben Mildenhall, Shiran Zada, Kfir Aberman,

Michael Rubinstein, Jonathan T. Barron, Yuanzhen Li, and

Varun Jampani. DreamBooth3D: Subject-driven text-to-3D

generation. arXiv, 2303.13508, 2023. 1

[48] René Ranftl, Katrin Lasinger, David Hafner, Konrad

Schindler, and Vladlen Koltun. Towards robust monocular

depth estimation: Mixing datasets for zero-shot cross-dataset

transfer. IEEE TPAMI, 44(3):1623–1637, 2022. 6

10In this supplementary material, we offer extra details and additional results to complement the main paper. Firstly, we offer

more extensive information and results regarding the ablation studies in Sec. A. Secondly, we present additional experimental

results in Sec. B. Finally, we provide a brief overview of our interactive demo in Sec. C.

A. Additional Ablation Study Results

A.1. Qulitative Results for Ablation Studies

Effectiveness of Unbounded Layout Generator. Figure 6 gives a qualitative comparison as a supplement to Table 3,

demonstrating the effectiveness of Unbounded Layout Generator. In the case of InfinityGAN, we follow the approach used

in InfiniCity, where each class of semantic maps is assigned a specific color, and we convert back to a semantic map by

associating it with the nearest color.

Figure 6. Qualitative comparison of different city layout generation methods. The height map values are normalized to a range of [0, 1]

by dividing each value by the maximum value within the map.

Effectiveness of Building Instance Generator. Figure 7 provides a qualitative comparison as a supplement to Table 4,

demonstrating the effectiveness of Building Instance Generator. Figure 7 highlights the importance of both Building Instance

Generator and the instance labels. Removing either of them significantly degrades the quality of the generated images.

Figure 7. Qualitative comparison of different Building Instance Generator variants. Note that ”w/o BIG.” indicates the removal of

Building Instance Generator from CityDreamer. ”w/o Ins.” denotes the absence of building instance labels in Building Instance Generator.

11A.2. More Discussions on Scene Parameterization

Table 5 displays the four primary combinations of different encoders and positional encodings. Additionally, Table 6

presents twelve additional alternative combinations, in addition to those in Table 5. The results in Table 6 clearly demonstrate

the superiority of the scene parameterization used in CityDreamer.

We present the qualitative results for the sixteen scene parameterization settings in Figure 8. Using the Global Encoder and

Hash Grid as scene parameterization results in more natural city backgrounds (first column) but leads to a severe decrease in

the quality of generated buildings (first row). As demonstrated in the third row and third column, this irregularity is weakened

when the Global Encoder is replaced with the Local Encoder. Furthermore, using the Global Encoder with SinCos positional

encoding introduces periodic patterns, as shown in the second row and second column. However, this periodicity is disrupted

when the Global Encoder is replaced with the Local Encoder (the fourth row and column) because the input of SinCos

positional encoding no longer depends on 3D position p. Nevertheless, this change also slightly reduces the multi-view

consistency.

Table 6. Effectiveness of different generative scene parameterization. The best values are highlighted in bold. Note that “CBG.” and

“BIG.” denote City Background Generator and Building Instance Generator, respectively. “Enc.” and “P.E.” represent “Encoder” and

“Positional Encoding”, respectively.

Enc.

P.E.

Enc.

P.E.

FID ↓

KID ↓

DE ↓

CE ↓

Global

Hash

Global

Hash

SinCos

213.56

0.216

0.153

0.186

113.45

0.141

0.149

0.086

Local

Hash

SinCos

112.61

0.129

0.153

0.095

Local

SinCos

97.38

0.096

0.147

0.060

Global

Hash

SinCos

248.30

0.318

0.156

0.325

135.86

0.205

0.155

0.106

Hash

Local

Hash

SinCos

125.97

0.172

0.150

0.165

132.67

0.174

0.151

0.089

Global

Hash

SinCos

203.97

0.199

0.156

0.153

116.01

0.105

0.150

0.933

SinCos

Local

Hash

SinCos

116.76

0.104

0.152

0.127

99.78

0.098

0.152

0.075

Global

Hash

SinCos

219.30

0.233

0.154

0.452

124.87

0.134

0.152

0.174

Local

Hash

SinCos

137.99

0.157

0.153

0.246

107.63

0.125

0.149

0.078

City Background Generator Scene Parameterization

Global + SinCos

Local + Hash

Local + SinCos

Global + Hash

Figure 8. Qualitative comparison of different scene parameterization. The terms “Global” and “Local” correspond to “Global Encoder”

(E G ) and ”Local Encoder” (E B ), which generate features following Equation 3 and Equation 7, respectively. “Hash” and “SinCos”

represent “Hash Grid” and “SinCos” positional encodings defined in Equations 4 and 9, respectively.

12B. Additional Experimental Results

B.1. View Consistency Comparison

To demonstrate the multi-view consistent renderings of CityDreamer, we utilize COLMAP [50] for structure-from-motion

and dense reconstruction using a generated video sequence. The video sequence consists of 600 frames with a resolution of

960×540, captured from a circular camera trajectory that orbits around the scene at a fixed height and looks at the center (sim-

ilar to the sequence presented in the supplementary video). The reconstruction is performed solely using the images, without

explicitly specifying camera parameters. As shown in Figure 9, the estimated camera poses precisely match our sampled

trajectory, and the resulting point cloud is well-defined and dense. Out of the evaluated methods, only SceneDreamer and

CityDreamer managed to accomplish dense reconstruction. CityDreamer, in particular, exhibited superior view consistency

compared to SceneDreamer. This superiority can be attributed to the fact that the images generated by CityDreamer are more

conducive to feature matching.

Dense Reconstruction

Reference Image

Sparse Reconstruction (with camera poses)

Figure 9. COLMAP reconstruction of a 600-frame generated video captured from an orbit trajectory. The red ring represents the

estimated camera poses, and the well-defined point clouds showcase CityDreamer’s highly multi-view consistent renderings.

B.2. Building Interpolation

As illustrated in Figure 10, CityDreamer demonstrates the ability to interpolate along the building style, which is controlled

by the variable z.

Figure 10. Linear interpolation along the building style. As we move from left to right, the style of each building changes gradually,

while the background remains unchanged.

13B.3. Additional Dataset Examples

In Figure 11, we provide more examples of the OSM and GoogleEarth datasets. The first six rows are taken from the

GoogleEarth dataset, specifically from New York City. The last two rows showcase Singapore and San Francisco, illustrating

the potential to extend the existing data to other cities worldwide.

(a) The OSM Dataset

(b) City Layout

Figure 11. Examples from the OSM and GoogleEarth datasets. (a) Height fields and semantic maps from the OSM dataset. (b) City

layouts generated from the height fields and semantic maps. (c) Images and segmentation maps from the GoogleEarth dataset.

14B.4. Additional Qualitative Comparison

In Figure 12, we provide more visual comparisons with state-of-the-art methods. We also encourage readers to explore

more video results available on our project page.

Figure 12. Qualitative comparison. The proposed CityDreamer produces more realistic and diverse results compared to all baselines.

Note that the visual results of InfiniCity [34] are provided by the authors and zoomed for optimal viewing.

15Layout Generation

Trajectory Selection

Rendering

Figure 13. The screenshots of the interactive demo. This interactive demo allows users to create their own cities in an engaging and

interactive manner. We encourage the readers to explore the video demo available on our project page.

C. Interactive Demo

We develop a web demo that allows users to interactively create their own cities. The process involves three main steps:

layout generation, trajectory selection, and rendering, as illustrated in Figure 13. Users can manipulate these steps to create

customized 3D city scenes according to their preferences.

During the layout generation phase, users have the option to create a city layout of arbitrary sizes using the unbounded

layout generator, or they can utilize the rasterized data from OpenStreetMap directly. This flexibility allows users to choose

between generating layouts from scratch or using existing map data as a starting point for their 3D city. Additionally, after

generating the layout, users can draw masks on the canvas and regenerate the layout specifically for the masked regions.

During the trajectory selection phase, users can draw camera trajectories on the canvas and customize camera step size,

view angles, and altitudes. There are three types of camera trajectories available: orbit, point to point, and multiple keypoints.

Once selected, the camera trajectory can be previewed based on the generated city layout, allowing users to visualize how

the city will look from different perspectives before finalizing their choices.

Finally, the cities can be rendered and stylized based on the provided city layout and camera trajectories.