Summary of MetaDreamer Text-to-3D Creation with Disentangling Geometry and Texture

Summary MetaDreamer Text-to-3D Creation with Disentangling Geometry and Texture arxiv.org

6,572 words - PDF document - View PDF document

One Line

MetaDreamer is a text-to-3D method that enhances generation by resolving geometric inconsistencies and slow speeds, resulting in efficient and high-quality outcomes.

Slides

Slide Presentation (9 slides)

Copy slides outline Copy embed code Download as Word

MetaDreamer: Revolutionizing Text-to-3D Creation

Source: arxiv.org - PDF - 6,572 words - view

Introduction

• MetaDreamer is an efficient and high-quality text-to-3D generation method.

• It leverages the disentangling of geometric and texture priors.

• MetaDreamer consists of two stages: the geometry stage and the texture stage.

Geometry Stage

• Rapidly establish fundamental geometric structure using 2D and 3D prior knowledge.

• Pretrained view-dependent diffusion model guides optimization process.

• Achieves strong multi-view consistency and complete geometry.

Texture Stage

• Refines geometric model and enhances texture.

• Transfers prior knowledge from 2D images to 3D model through score distillation sampling.

• Focuses on improving both geometry and textures of the 3D object.

Efficient Generation

• MetaDreamer generates high-quality 3D objects based on textual prompts within 20 minutes.

• Most efficient text-to-3D generation method currently available.

• Significant time savings compared to other methods.

Superior Quality

• Outperforms existing text-to-3D methods in terms of efficiency and quality.

• Higher CLIP similarity scores indicate better consistency with input text prompts.

• Highest scores in quality and alignment according to T3 Bench benchmarks.

Disentanglement of Geometry and Texture

• Addresses entanglement issue between geometry and texture.

• Uses only geometry priors in the coarse stage and only texture priors in the fine stage.

• Enhances overall quality of generated 3D objects.

Future Work

• Limitations in multi-object generation tasks due to lack of prior knowledge about multiple objects.

• Plan to introduce more multi-object geometric prior knowledge into the model.

Revolutionizing Text-to-3D Creation with MetaDreamer

• MetaDreamer is an efficient and high-quality text-to-3D generation method.

• It leverages the disentangling of geometric and texture priors.

• Outperforms existing methods in terms of efficiency and quality.

• Disentanglement of geometry and texture enhances overall quality.

• Future work includes addressing limitations in multi-object generation tasks.

• MetaDreamer revolutionizes the field of text-to-3D creation.

Key Points

MetaDreamer is an efficient and high-quality text-to-3D generation method.
It leverages the disentangling of geometric and texture priors.
MetaDreamer consists of two stages: the geometry stage and the texture stage.
It achieves multi-view consistency and accuracy of 3D objects.
MetaDreamer can generate high-quality 3D objects based on textual prompts within 20 minutes.
It outperforms existing text-to-3D methods in terms of both efficiency and quality.
MetaDreamer addresses the entanglement issue between geometry and texture to enhance overall quality.
Future work includes addressing limitations in multi-object generation tasks.

Summaries

20 word summary

MetaDreamer is a text-to-3D method that improves generation by addressing geometric inconsistencies and slow speeds, achieving efficient and high-quality results.

65 word summary

MetaDreamer is a text-to-3D generation method that improves upon existing frameworks by addressing geometric inconsistencies and slow generation speeds. It consists of two stages: geometry and texture. The geometry stage establishes strong multi-view consistency and complete geometry, while the texture stage refines the model and enhances its texture. By disentangling geometry and texture, MetaDreamer achieves equilibrium in learning, resulting in efficient and high-quality 3D generation.

137 word summary

MetaDreamer is a text-to-3D generation method that overcomes challenges in existing frameworks by addressing multi-view geometric inconsistencies and slow generation speeds. It consists of two stages: the geometry stage and the texture stage. The geometry stage establishes the geometric structure of the 3D object using 2D and 3D prior knowledge, resulting in strong multi-view consistency and complete geometry. The texture stage refines the geometric model and enhances its texture by transferring prior knowledge from 2D images. By disentangling the interaction between geometry and texture, MetaDreamer achieves equilibrium in learning and significantly reduces the time required for 3D generation. It outperforms state-of-the-art methods in terms of efficiency and quality, achieving higher CLIP similarity scores and the highest scores according to T3 Bench benchmarks. MetaDreamer leverages the disentangling of geometric and texture priors to generate high-quality 3D objects efficiently.

375 word summary

MetaDreamer is a text-to-3D generation method that addresses the challenges of multi-view geometric inconsistencies and slow generation speeds in existing frameworks. It consists of two stages: the geometry stage and the texture stage. In the geometry stage, the emphasis is on optimizing the geometric representation for multi-view consistency and accuracy. The texture stage focuses on fine-tuning the geometry and optimizing the texture.

The first stage of MetaDreamer rapidly establishes the fundamental geometric structure of the 3D object using 2D and 3D prior knowledge. It leverages a pretrained view-dependent diffusion model to guide the optimization process, resulting in 3D objects with strong multi-view consistency and complete geometry.

In the second stage, MetaDreamer refines the geometric model and enhances its texture. It uses pretrained text-to-image diffusion models to transfer prior knowledge from 2D images into the 3D model. The texture optimization stage improves both the geometry and textures of the 3D object.

By disentangling the interaction between geometry and texture, MetaDreamer achieves equilibrium in learning and makes the optimization objectives more explicit. This leads to significant time savings in the 3D generation process. MetaDreamer can generate high-quality 3D objects based on textual prompts within 20 minutes, making it the most efficient text-to-3D generation method available.

Comparisons with state-of-the-art methods show that MetaDreamer outperforms them in terms of both efficiency and quality. It achieves higher CLIP similarity scores, indicating better consistency between the generated 3D objects and the input text prompts. MetaDreamer also achieves the highest scores in terms of quality and alignment according to T3 Bench benchmarks.

MetaDreamer addresses the entanglement issue between geometry and texture by using only geometry priors in the coarse stage and only texture priors in the fine stage. This disentanglement allows for more effective optimization and enhances the overall quality of the generated 3D objects.

In future work, MetaDreamer aims to improve multi-object generation tasks by incorporating more multi-object geometric prior knowledge into the model.

In conclusion, MetaDreamer is an efficient and high-quality text-to-3D generation method that leverages the disentangling of geometric and texture priors. It achieves state-of-the-art results in terms of both efficiency and quality, outperforming existing methods. The two-stage optimization approach and the incorporation of 2D and 3D prior knowledge contribute to significant time savings and improved 3D object generation.

462 word summary

MetaDreamer is an efficient and high-quality text-to-3D generation method that leverages the disentangling of geometric and texture priors. It addresses the challenges of multi-view geometric inconsistencies and slow generation speeds in existing 3D synthesis frameworks. The method consists of two stages: the geometry stage and the texture stage. In the geometry stage, the emphasis is on optimizing the geometric representation to ensure multi-view consistency and accuracy of 3D objects. In the texture stage, the focus is on fine-tuning the geometry and optimizing the texture to achieve a more refined 3D object.

The first stage of MetaDreamer utilizes 2D and 3D prior knowledge to rapidly establish the fundamental geometric structure of the 3D object. It leverages a pretrained view-dependent diffusion model to guide the optimization process. The resulting 3D objects demonstrate strong multi-view consistency and possess complete geometry.

In the second stage, MetaDreamer further refines the geometric model obtained in the first stage and enhances its texture. It employs pretrained text-to-image diffusion models to transfer prior knowledge from 2D images into the 3D model through score distillation sampling. The texture optimization stage focuses on improving both the geometry and textures of the 3D object.

MetaDreamer achieves equilibrium in learning between geometry and texture by incorporating two distinct sources of prior knowledge. By disentangling the interaction between geometry and texture, the optimization objectives for each stage become more explicit, leading to significant time savings in the 3D generation process. MetaDreamer can generate high-quality 3D objects based on textual prompts within 20 minutes, making it the most efficient text-to-3D generation method currently available.

Quantitative and qualitative comparisons with state-of-the-art text-to-3D methods demonstrate that MetaDreamer outperforms these methods in terms of both efficiency and quality. It achieves higher CLIP similarity scores, indicating better consistency between the generated 3D objects and the input text prompts. The T3 Bench benchmarks also show that MetaDreamer achieves the highest scores in terms of quality and alignment.

MetaDreamer also addresses the entanglement issue between geometry and texture by using only geometry priors in the coarse stage and only texture priors in the fine stage. This disentanglement allows for more effective optimization and enhances the overall quality of the generated 3D objects.

In terms of future work, MetaDreamer has limitations in multi-object generation tasks due to the lack of prior knowledge about multiple objects in geometric priors. The authors plan to address this challenge by introducing more multi-object geometric prior knowledge into the model.

In conclusion, MetaDreamer is an efficient and high-quality text-to-3D generation method that leverages the disentangling of geometric and texture priors. It achieves state-of-the-art results in terms of both efficiency and quality, outperforming existing text-to-3D methods. The two-stage optimization approach and the incorporation of 2D and 3D prior knowledge contribute to significant time savings and improved 3D object generation.

Raw indexed text (41,617 chars / 6,572 words / 930 lines)

MetaDreamer: Efficient Text-to-3D Creation With Disentangling Geometry and

Texture

Lincong Feng 12 * Muyu Wang 13 * Maoyu Wang 1 Kuo Xu 14 Xiaoli Liu 1†

MetaApp AI Research

Beijing University Of Technology

Beijing Institute of Technology

Zhengzhou University

Project page: https://metadreamer3d.github.io/

A sleek stainless steel teapot

A cactus with pink flowers

A red fire hydran

A castle-shaped sandcastle

A long woolen scarf, striped red and black

A donut is covered with glaze

Figure 1. MetaDreamer for text-to-3D generation: MetaDreamer can rapidly (20 minutes) generates high-quality 3D content based

on input text. The resulting 3D objects exhibit strong multi-view consistency (no multi-headed problem) and possess complete geometry

along with high-quality textures. Visit https://metadreamer3d.github.io/ for an immersive visualization.

Abstract

ric a priori knowledge in optimization, and secondly, the

entanglement issue between geometry and texture in con-

ventional 3D generation methods.In response, we intro-

duce MetaDreammer, a two-stage optimization approach

that leverages rich 2D and 3D prior knowledge. In the

first stage, our emphasis is on optimizing the geometric

representation to ensure multi-view consistency and ac-

curacy of 3D objects. In the second stage, we concen-

trate on fine-tuning the geometry and optimizing the tex-

ture, thereby achieving a more refined 3D object. Through

leveraging 2D and 3D prior knowledge in two stages, re-

Generative models for 3D object synthesis have seen

significant advancements with the incorporation of prior

knowledge distilled from 2D diffusion models. Neverthe-

less, challenges persist in the form of multi-view geomet-

ric inconsistencies and slow generation speeds within the

existing 3D synthesis frameworks. This can be attributed

to two factors: firstly, the deficiency of abundant geomet-

* Equal contribution.

† Corresponding author.

1shown promising results, they continue to face several chal-

lenges. Firstly, these methods are time-consuming. It takes

several hours of continuous iterative optimization to gener-

ate a 3D object, consuming not only time but also a signif-

icant amount of computational resources. Another signifi-

cant challenge lies in striking a balance between geometric

and textural requirements. Methods based on distilling geo-

metric priors, such as Zero123 [17] and Make-it-3D [31],

excel in capturing precise geometric shapes but may fall

short in delivering high-quality textures. Conversely, ap-

proaches based on 2D prior, such as [14, 16, 24, 34] can ex-

cel in reproducing textures but may struggle with geometric

accuracy, sometimes leading to the notorious ”multi-face

problem”. These challenges highlight the ongoing pursuit

of more efficient and balanced techniques for 3D genera-

tion. Magic123 [11] utilizes two priors simultaneously but

faces another problem that geometric and textures become

entangled, resulting in training instability and failing to ad-

dress the aforementioned problem of geometric texture im-

balance.

We find that the fundamental cause of the aforemen-

tioned issue lies in the failure to strike a balance between

geometric and texture aspects. Consequently, we propose

MetaDreamer, an efficient generative 3D method that relies

on the disentangling of geometric and texture priors. To the

best of our knowledge, we are the first to achieve equilib-

rium in learning between geometry and texture through the

incorporation of two distinct prior knowledge sources. As

shown in Fig 1, the 3D objects generated by MetaDreamer

simultaneously consider both geometry and texture. In

terms of geometry, the generated 3D content demonstrates

strong multi-view consistency and possesses complete ge-

ometry. Regarding texture, the 3D content exhibits rich and

intricate textures. Our contributions can be summarized as

follows:

spectively, we effectively mitigate the interdependence be-

tween geometry and texture. MetaDreamer establishes

clear optimization objectives for each stage, resulting in

significant time savings in the 3D generation process. Ul-

timately, MetaDreamer can generate high-quality 3D ob-

jects based on textual prompts within 20 minutes, and to

the best of our knowledge, it is the most efficient text-to-

3D generation method. Furthermore, we introduce im-

age control into the process, enhancing the controllabil-

ity of 3D generation. Extensive empirical evidence con-

firms that our method is not only highly efficient but also

achieves a quality level that is at the forefront of current

state-of-the-art 3D generation techniques. Project page

at https://metadreamer3d.github.io/.

1. Introduction

The demand for 3D assets, particularly in applications

such as gaming and virtual reality, is steadily increasing.

However, in contrast to 2D assets, the acquisition of 3D data

is notably challenging, resulting in a scarcity of such data.

In order to address this issue, recent attention has been di-

rected towards 3D generation techniques. These approaches

endeavor to generate 3D assets from images or textual de-

scriptions, offering a potential solution to the problem of 3D

asset scarcity.

In the early days of 3D generation, the predominant

paradigm revolved around multi-view 3D reconstruction

[9, 35]. The fundamental idea was to gather information

from diverse angles to craft a comprehensive 3D represen-

tation. However, with the advent of robust 2D models like

Diffusion model [28], a wave of innovative 3D generation

methods has emerged. Broadly, these methods can be clas-

sified into two categories: text-driven [24, 34] and single-

image-driven [17, 31] 3D generation.

In the text-driven 3D paradigm, 3D content generation

is guided by textual descriptions. These novel approaches

[16, 24] utilize the natural language to create 3D represen-

tations. Text-based 3D generation methods primarily distill

prior knowledge from pre-trained multimodal text-to-image

generation models [28]. Their main objective is to lever-

age textual descriptions to generate 3D content, bridging the

semantic gap between language and visual representations.

While image-driven method aims to generate or reconstruct

3D structures from a single image. Single-image-based 3D

generation methods incorporate 3D prior knowledge into

image-based 2D diffusion models. These techniques focus

on inferring 3D structures from a single image, effectively

addressing the challenge of reconstructing 3D scenes from

limited viewpoint information. One representative work is

Zero-1-to-3 [17], which learns 3D prior knowledge from

view-dependent diffusion models.

While both image-to-3D and text-to-3D methods have

• We introduce MetaDreamer, a novel text-to-3D gen-

eration method that employs a two-stage optimization

process, from coarse to fine, to rapidly generate high-

quality 3D geometry and textures.

• We propose using 2D and 3D prior knowledge to faith-

fully generate 3D content from arbitrary text prompts.

In the first stage, we solely leverage 3D prior knowl-

edge, and in the second stage, we exclusively utilize

2D prior knowledge. This approach effectively pre-

vents the entanglement of geometric and texture pri-

ors.

• MetaDreamer can generate high-quality 3D content in

20 minutes. Through extensive qualitative and quanti-

tative comparisons, we found that our method outper-

forms the state-of-the-art in both efficiency and quality.

22. Related work 3. Preliminary

2.1. 3D Reconstruction From Signal view 3.1. Neural Rendering Of 3D model

NeRF [35] is a technique for neural inverse rendering

that consists of a volumetric raytracer and a multilayer per-

ceptron (MLP). Rendering an image from a NeRF is done

by casting a ray for each pixel from a camera’s center of

projection through the pixel’s location in the image plane

and out into the world. Sampled 3D points µ along each ray

are then passed through an MLP, which produces 4 scalar

values as output: a volumetric density τ (how opaque the

scene geometry at that 3D coordinate is) and an RGB color

c. These densities and colors are then alpha-composited

from the back of the ray towards the camera, producing the

final rendered RGB value for the pixel:

C =

w i c i ,

Before the advent of CLIP [25] and the widespread avail-

ability of large-scale 2D diffusion models [28], researchers

frequently relied on learning 3D priors from either synthetic

3D data, as demonstrated in works such as [3], or real-world

scans as mentioned in [27]. The representation of 3D data

comes in diverse formats, encompassing 3D voxels [8, 38],

point clouds [1,6], polygon meshes [33,36], and parametric

models [23, 41, 42].

Recently, there has been an increasing number of works

on learning to generate a 3D implicit field from a single

image [21, 40] and multiview [37]. Some works leverage

2D diffusion models to enable the generation of 3D models

from a single image. NeuralLift-360 [39] lift an in-the-wild

2D photo into a 3D object by learning probabilistic-driven

3D lifting with CLIP-guided diffusion priors and mitigates

the depth errors by a scale-invariant depth ranking loss. A

recent work Zero123 [17] finetunes the Stable Diffusion

model [28] to generate a novel view of the input image

based on relative camera pose. It uses fractional distillation

method SDS [24] to reconstruct 3D model through distilling

geometric priors of angular dependent diffusion models.

w i = α i

(1 − α j ),

(1)

j
α i = 1 − exp(−τ i ∥µ i − µ i+1 ∥).

In the traditional NeRF use-case, we are given a dataset

of input images and associated camera positions, and the

NeRF MLP is trained from random initialization using a

mean squared error loss function between each pixel’s ren-

dered color and the corresponding ground-truth color from

the input image. This yields a 3D model (parameterized by

the weights of the MLP) that can produce realistic render-

ings from previously-unseen views. Our model is built upon

Instant-NGP [22], which is an improved version of NeRF

for efficient highresolution rendering with resolutions vary-

ing from 64 to 512.

2.2. Text-to-3D Generation

Recently, text-to-3D generation has become increasingly

popular. Recent advances include CLIP [30], CLIP-mesh

[20], Latent-NeRF [19], Dream Field [14], Score-Jacobian-

Chaining [32], DreamFusion [24]. In CLIP-forge [30], the

model is trained for shapes conditioned on CLIP text em-

beddings from rendered images. During inference, the em-

bedding is provided for the generative model to synthe-

size new shapes based on the text. CLIP-mesh [20] and

Dream Field optimized the underlying 3D representation

with the CLIP-based loss. Dreamfusion [24] first intro-

duce Score Distillation Sampling (SDS) that applies a pre-

trained diffusion to opitimize a neural radiance field, which

is widely used in the following works such as [2, 4, 16, 19].

Magic3D [16] adds a finetuning phase with a textured-mesh

model [7], allowing high resolutions. ProlificDreamer [34]

further proposes Variational Score Distillation (VSD) that

improves the diversity and details of the generated models.

However, these methods only take advantage of the 2D prior

in the pretrained diffusion model. The lack of 3D geometry

3.2. Score Distillation Sampling

SDS [24] is an optimization method by distilling pre-

trained diffusion models, also known as Score Jacobian

Chaining (SJC) [32]. It is widely used in text-to-3D [24]

and imgae-to-3D [17] generation with great promise. The

principle of SDS is as follows:

Given a distribution p t (x t |c), the distribution of the for-

ward diffusion at time t of pretrained image-to-image or

text-to-image diffusion model with the noise prediction net-

work, and we denote q θ t (x t |c) as the distribution at time t

of the forward diffusion process starting from the rendered

image g(θ, c) with the camera c and 3D parameter θ, the

probabilistic density distillation loss [24] optimizes the pa-

rameter θ by solving:

min L SDS (θ) := E t,c w(t)D KL q t θ (x t | c)∥p t (x t | c)

Recent work, dreamfusion [24] and prolificdreamer [34],

optimises the 3D representation of NeRF [35] by learning

the prior knowledge of a large scale multimodal pre-trained

generative model SD [28], but they share a common prob-

lem: they only use 2D prior knowledge but lacks 3D prior

knowledge, resulting in flat or even distorted object shapes.

θ∈Θ

(2)

where t ∼ U(0.02, 0.98), ϵ ∼ N (0, I), w(t) is a weighting

function that depends on the timestep t, x t = α t g(θ, c)+σ t ϵ

3A cartoon-style tree.

L rec

rendering

Diffusion

( R 0 , T 0 )



( R i , T i )

rendering

L SDS

Multiview

Diffusion

( R i , T i )

Gaussian Noise

Segmentation

A cartoon-style tree.

Reference

( R 0 , T 0 ) Reference view

( R i , T i ) Novel view

Fixed pretrained model

Coarse NeRF

L Lora LoRA

L SDS Diffusion

rendering

( R i , T i )

Gaussian Noise

Trainable model



Concat

Fine NeRF

Figure 2. MetaDreamer is a two-stage coarse-to-fine optimization pipeline designed to generate 3D content from arbitrary input text. In

the first stage, we optimize a rough 3D model Instant-NGP [22] guiding by a reference image and view-dependent diffusion prior model

simultaneously. In the second stage, we continue to refine Instant-NGP using a text-to-image 2D diffusion prior model [28]. The entire

process takes 20 minutes. The entire optimization process only takes 20 minutes.

is the state of the rendered image at the time t of forward

diffusion. With this method, we can utilize the prior knowl-

edge of diffusion models to guide the optimization of 3D

NeRF [35].

images I r for guiding geometric learning. We also leverage

an off-the-shelf segmentation model, SAM [15], to segment

the foreground. The extracted mask, denoted as M is a bi-

nary segmentation mask and will be used in the optimiza-

tion. To prevent flat geometry collapse, i.e. the model gen-

erates textures that only appear on the surface without cap-

turing the actual geometric details, we further extract the

depth map from the reference view by the pretrained Mi-

DaS [26]. The foreground image is used as the input, while

the mask and the depth map are used in the optimization as

regularization priors.

4. Proposed Method

In this section, we introduce our MetaDreamer, an effi-

cient and high-quality text-to-3D generation network. As

depicted in Figure 2, our method can be divided into two

stages: the geometry stage and the texture stage. In the ge-

ometry stage, we obtain a coarse 3D representation, while

in the texture stage, we further refine the geometry and en-

hance its texture. Through the optimization in two stages,

we are able to disentangle the interaction between geometry

and texture during the optimization process. This makes the

optimization objectives for each stage more explicit, which

is crucial for improving both the efficiency and quality of

3D generation.

4.2. Geometric optimization

During the geometric optimization stage, MetaDreamer

acquires knowledge from the fusion of reference images

and pretrained geometric prior model [17]. In this stage,

we focus on learning the overall geometric shape, and care

less about geometric details and textures. The objective is

to rapidly establish the fundamental geometric structure of

the 3D object. In terms of 3D representation, we employ

the implicit parameterization model NeRF [35]. NeRF ex-

cels in capturing complex geometric properties, making it

the ideal choice for our goal of swiftly acquiring the geo-

metric representation from reference images. We use the

4.1. Preparatory work

MetaDreamer takes a textual prompt as input and gen-

erates a 3D model as output. In the first stage, we employ

text-to-image diffusion model [28] to generate 2D reference

4where θ is the NeRF parameters to be optimized, M is a bi-

nary segmentation mask of I r , ⊙ is the Hadamard product,

g(θ, v r , c) is the NeRF rendered view from the viewpoint

v r , M (·) is the foreground mask acquired by integrating the

volume density along the ray of each pixel. λ rgb and λ mask

are the weights for the foreground RGB and the mask.

pretrained multi-view diffusion model and reference image

priors separately to guide the learning of 3D NeRF.

View-dependent Diffusion Prior The pretrained view-

dependent prior diffusion model zero123xl [17] is used to

guide the optimization in our Method. It it fine-tuned from

an image-to-image diffusion model using the Objaversexl

[5] dataset, the largest open-source 3D dataset that consists

of 10 million models. Given the diffusion model denoiser

θ, the diffusion time step t ∼ [1, 1000], the embedding of

the input view and relative camera extrinsics c(x, R, T ), the

view-dependent diffusion model is optimized by the follow-

ing constraints:

2

min E z∼E(x),t,ϵ∼N (0,1) ∥ϵ − ϵ θ (z t , t, c(x, R, T ))∥ 2

θ

Depth Prior The depth prior is employed to prevent ex-

cessively flat or concave 3D representations. Relying solely

on appearance reconstruction losses can lead to subopti-

mal geometric results, given the inherent ambiguity of re-

constructing 3D content from 2D images. This ambigu-

ity arises because the 3D content could exist at various

distances while still appearing as the same 2D image, po-

tentially resulting in flat or concave geometries, as ob-

served in prior research (NeuralLift-360 [39]). To allevi-

ate this problem, we introduce depth regularization. We

utilize a pretrained monocular depth estimator [26] to ob-

tain the pseudo depth (d r ) for the reference image. The

NeRF model’s depth output (d) from the reference view-

point should closely align with the depth prior. However,

due to disparities between the two sources of depth estima-

tion, using the Mean Squared Error (MSE) loss is not ideal.

Instead, we employ the normalized negative Pearson corre-

lation as the depth regularization term.

Cov (M ⊙ d r , M ⊙ d)

1

1 −

(6)

L d =

2

Cov (M ⊙ d r ) Var(M ⊙ d)

(3)

where ϵ ∼ N (0, I), z t = α t x R,T + σ t ϵ is the target image

with noise. In this way, a view-dependent 3D prior diffusion

model ϵ θ can be obtained.

Geometry Score Distillation Sampling In this process,

we first randomly initialized a 3D model ϵ θ with parameter

θ ∈ Θ, where Θ is the space of θ with the Euclidean metric.

Then we randomly sample a position and angle for a ray in a

3D scene, with the ray’s position and direction represented

in spherical coordinates as r = (ρ, ϑ, φ), and render the

shaded NeRF model at 256 ∗ 256 resolution g(θ, r, c). After

that, we perform the forward diffusion process: add random

Gaussian noise to the rendered image. The hidden layer

noise image at step t is represented as x t = α t g(θ, r, c) +

σ t ϵ.

We then make direct use of the loss function of the dif-

fusion model: a noise estimate is made on the noise graph

and the MSE loss is used to constrain it:

(4)

L 3D = E t,ϵ w(t)∥ (ϵ pretrain1 (x t ; I r , t, c) − ϵ) ∥ 22

where Cov(·) denotes covariance and Var(·) measures stan-

dard deviation.

Geometry regularizers One of the NeRF limitations is

the tendency to produce high-frequency artifacts on the sur-

face of the object. To address this, we enforce the smooth-

ness of the normal maps of geometry for the generated 3D

model following [18]. We use the finite differences of the

depth to estimate the normal vector of each point, render a

2D normal map n from the normal vector, and impose a loss

as follows:

L n = ∥n − τ (g(n, k))∥

(7)

where c is the camera poses passed to view-dependent diffu-

sion model. Intuitively, Geometry-based SDS leverages the

multi-view geometric relationships of the view-dependent

diffusion model to encourage 3D consistency. It’s impor-

tant to note that during this process, the diffusion model

parameters are frozen.

where τ (·) denotes the stopgradient operation, and g(·) is a

Gaussian blur. The kernel size of the blurring, k, is set to 9

× 9.

Reference view Prior The reference image prior plays a

crucial role in ensuring the 3D fidelity. L rec is imposed in

the geometry stage as one of the major loss functions to

ensure the rendered image from the reference viewpoint (v r ,

assumed to be front view) is as close to the reference image

I r as possible. We adopt the mean squared error (MSE) loss

on both the reference image and its mask as follows:

L rec = λ rgb ∥M ⊙ (I r − g (θ, v r )) ∥ 22

+λ mask ∥M − M (g(θ, v r ))∥ 22

4.3. Texture optimization

In texture modeling stage, MetaDreamer primarily fo-

cuses on further refining the coarse geometric model ob-

tained in the first stage, encompassing both geometry and

textures. Similar to the first stage, in this stage, we heavily

rely on pretrained text-to-image diffusion models ϵ ϕ . We

transfer the prior knowledge from these 2D images into the

3D model through SDS [24]. It’s worth noting that there

(5)

5are domain gap between 2D and 3D. To narrow this domain

gap, we employ an efficient parameter fine-tuning method,

Lora [13] to fine-tune the diffusion model.

SJC [32], Magic3D [16], and ProlificDreamer [34]. As seen

in Figure 3, Other methods, only guided by 2D priors such

as Dreamfusion [24], Magic3D [16], etc., shares a common

issue: the Janus problem (also known as the multi-headed

problem). Moreover, their geometry is incomplete, not

smooth, and contains numerous holes. We attribute these

problems to their failure to introduce 3D priors. In compari-

son, our method solves the multi-head problem well and has

a more complete and smooth 3D normal. As for texture, de-

spite our model requiring only 20 minutes of optimization,

its textures are remarkably detailed, comparable to or even

surpassing current state-of-the-art methods. This improve-

ment in texture quality is attributed to our geometry-texture

decoupled optimization approach.

Texture Score Distillation Sampling Given a text-to-

image diffusion prior model ϵ sd , and a coarse geomet-

ric model g(θ) (obtained in the geometric stage), we em-

ploy SDS to further refine the geometric textures. Specifi-

cally, we first encode the rendered view g(θ, c) as latent z 0 ,

z t = α t z 0 + σ t ϵ is the noisy representation of the lantent

z 0 after t steps of forward diffusion and adds noise to it,

and guesses the clean novel view guided by the input text

prompt. Roughly speaking, SDS translates the rendered

view into an image that respects both the content from the

rendered view and the prompt. The texture score distillation

sampling loss is as follows:

(8)

L 2D (θ) = E t,ϵ w(t)∥ (ϵ sd (z t ; t, c) − ϵ) ∥ 22

5.2. Quantitative comparison

2D metrics In the context of text-based 3D, where there

are no standardized metrics for 3D objects, we employ 2D

metrics for evaluation. We evaluate on three variants of

CLIP [25]: CLIP B/32, CLIP B/16, and CLIP B/14. Specif-

ically, we indirectly measure the similarities of CLIP be-

tween text prompts and 3D objects by comparing the sim-

ilarity of 2D renderings of text prompts and 3D objects.

We compare our method with state-of-the-art text-to-3D

methods, such as DreamFusion [24], ProlificDreamer [34],

SJC [32],Latent-NeRF [19] and Magic3D [16]. Addition-

ally, we assess the similarity between 2D images generated

by our diffusion model [28] and the corresponding text, de-

noted as GT. In theory, when evaluating 3D quality with this

method, the similarity cannot exceed this value. Among all

methods, MetaDreamer obtains the highest CLIP similarity

score, closest to the GT score. This indirectly demonstrates

its ability to better maintain consistency between 3D objects

and input text.

where c represents the camera’s intrinsic parameters, θ is

learnable parameter of NeRF. It is worth noting that the pa-

rameters of the stable diffusion model ϵ sd are frozen.

2D-to-3D Domain Adaptation Despite the powerful

prior knowledge in diffusion models, applying this prior

knowledge directly to guide 3D generation is not ideal due

to the gap between the 2D and 3D domains. To solve this

problem, we employ Lora [13] to fine-tune the diffusion

models since it’s great capacity for few-shot fine-tuning.

The training loss for Lora is as follows:

(9)

L Lora (θ, ϕ) = E t,ϵ w(t)∥ (ϵ ϕ (z t ; t, c) − ϵ) ∥ 22

where ϵ ϕ is the small learnable Unet [29] condition by cam-

era parameter c and time embedding.

Opacity regularization To prevent high-frequency arti-

facts in 3D space, we first introduce a novel opacity reg-

ularization technique.In single-object 3D generation, this

penalty term plays a crucial role in accelerating conver-

gence and improving geometric quality: it significantly sup-

presses unnecessary blank filling and mitigates noise diffu-

sion:

X X

2

L reg =

∥w ij ∥

i

j

(10)

s.t. w ij ∈

/ C max

3D metrics T 3 Bench [12] privodes a comprehensive text-

to-3D benchmark to assess the quality of the generated 3D

models. They introduce two metrics: the quality score and

the alignment score. The quality metric utilizes multi-view

text-image scores and regional convolution to evaluate the

visual quality and view inconsistency. The alignment metric

relies on multi-view captioning and Large Language Model

(LLM) evaluation to measure text-3D consistency. Using

the generic prompts they provided, we conduct comparative

experiments under the setting of SingleObject where only

one single object and its description are mentioned each

prompt. Our methond achieves the highest scores on both

quality metric and alignment metric as shown in Table 2.

where w ij is the rendering weight, and C max is the largest

connected component of the rendering weight matrix.

5. Experiments

5.3. Efficiency Evaluation

5.1. Qualitative Analysis

To demonstrate the efficiency of MetaDreamer, we com-

pare it with popular text-to-3D generation methods in terms

of training iteration counts and time consumption. Table3

We qualitatively compare MetaDreamer with other ad-

vanced 3D methods: Dreamfusion [24], LatentNeRF [19],

6A rainbow-colored

umbrella

A futuristic, sleek

electric car model

A flamingo

scratching its neck

A pair of shiny

black leather shoes

A green enameled

watering can

Dreamfusion

LatentNeRF

SJC

Magic3D

ProlificDreamer

Ours

Figure 3. Text-to-3D samples generated by MetaDreamer from scratch. Our base model is Stable Diffusion and we do not employ any

other assistant model or user-provided shape guidance (see Table 1). See our accompanying videos in our project page for better visual

quality.

Mothod

GT

DreamFusion

LatentNeRF

SJC

Magic3D

ProlificDreamer

Ours

CLIP B/32↑

0.2947

0.2415

0.2373

0.2211

0.2673

0.2715

0.2869

CLIP B/16↑

0.2913

0.2303

0.2301

0.2365

0.2701

0.2829

0.2900

CLIP L/14↑

0.2715

0.2432

0.2210

0.2313

0.2610

0.2669

0.2710

Mothod

DreamFusion

LatentNeRF

SJC

Magic3D

ProlificDreamer

Ours

Quakity↑

24.9

34.2

26.3

38.7

51.1

54.6

Alignment↑

24.0

32.0

23.0

35.3

47.8

55.8

Average↑

24.4

33.1

24.7

37.0

49.3

55.2

Table 2. Comparisons in terms of T 3 Bench benchmarks.

Table 1. The consistency of the 3D model with the provided text

prompt is assessed by computing the similarity between multi-

ple randomly rendered views of the 3D model and the given text

prompt. GT represents the 2D image generated by the text-to-

image Diffusion model [28]

erate 3D models comparable to or even better than main-

stream methods. The entire process takes only 20 minutes,

saving 2 hours compared to mainstream methods. This effi-

ciency improvement is attributed to our disentangling train-

ing of geometry and texture.

reveals that the average number of iterations for mainstream

methods currently stands at 26,000, with an average dura-

tion of 2.5 hours. In contrast, while MetaDreamer requires

only 1,300 iterations (including 300 iterations in the first

stage and 1,000 iterations in the second stage), it can gen-

5.4. Ablation Study

In this section, we qualitatively analyze the effects of dif-

ferent prior knowledge on MetaDreamer. Specifically, we

7A rainbow-colored

umbrella

A futuristic, sleek

electric car model

An unusual turtle

w/o 2D prior

w/o 3D prior

MetaDreamer

Figure 4. Qualitative comparison: The left side is the multi-view rendering image at the coarse stage, and the right side is the multi-view

rendering image at the refined stage.

Method

Time(min)

Iterations

LatenNeRF

100

20000

DremFusion

60

10000

Magic3D

125

20000

SJC

65

10000

ProlificDreamer

420

70000

MetaDreamer

20

1300

Table 3. Comparison of average training times between MetaDreamer and various text-based 3D methods. All experiments were conducted

on a single NVIDIA A100 GPU. All experimental settings (number of iterations, random seeds, etc.) followed the official default settings

of threestudio [10].

conduct two experiments: using only 3D priors and using

only 2D priors. From the Fig 4, it is apparent that when

solely utilizing 3D prior knowledge (optimized for 300 it-

erations in the first stage), we obtain a rough geometric

model demonstrating good geometric integrity and view-

point consistency. However, it still lacks geometric details

and clear textures. Conversely, when exclusively employing

2D prior knowledge (optimized for 1000 iterations in the

second stage), we only obtain a very blurry residue, which

we attribute to the lack of 3D prior knowledge causing the

3D object not to converge. When combining 2D and 3D

prior knowledge in a two-stage manner, we achieve a per-

fect 3D object. It is evident that the geometric and texture

details missed in the first stage are compensated. Exper-

imental results demonstrate the complementary nature of

the two-stage optimization: the coarse model from the first

stage aids in accelerating geometric convergence in the sec-

ond stage, while the diffusion model and strong semantic

and 2D priors in the second stage contribute more imagi-

native power, helping to compensate for the geometric and

texture deficiencies from the first stage.

cient parameter fine-tuning method, LoRA. To prevent the

entanglement of the two types of priors, we use only ge-

ometry priors in the coarse stage and only texture priors in

the fine stage. Our MetaDreamer can generate high-quality

3D content within 20 minutes. Abundant qualitative and

quantitative comparative experiments demonstrate that our

method surpasses the state-of-the-art level in both efficiency

and quality.

7. Future Work

MetaDreamer performs at the state-of-the-art level in

terms of both efficiency and quality, but it still has some

limitations. For example, it performs poorly in multi-object

generation tasks due to the lack of prior knowledge about

multiple objects in geometric priors. We have attempted

to introduce multi-object priors using powerful multimodal

text-image pretraining models, but the results have not been

ideal, and they come with significant time consumption.

Therefore, we will address this challenge in the next stage

of our work by injecting more multi-object geometric prior

knowledge into the model.

6. Conclusion

References

In this work, we have proposed MetaDreamer, an effi-

cient and high-quality text-to-3D generation method. Our

approach leverages two different types of prior knowledge:

geometric priors (3D) and texture priors (2D), and adapts

the domain gap between 2D and 3D knowledge using effi-

[1] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and

Leonidas Guibas. Learning representations and generative

models for 3d point clouds. In International conference on

machine learning, pages 40–49. PMLR, 2018. 3

8[14] Ajay Jain, Ben Mildenhall, Jonathan T Barron, Pieter

Abbeel, and Ben Poole. Zero-shot text-guided object genera-

tion with dream fields. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition, pages

867–876, 2022. 2, 3

[15] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao,

Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White-

head, Alexander C Berg, Wan-Yen Lo, et al. Segment any-

thing. arXiv preprint arXiv:2304.02643, 2023. 4

[16] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa,

Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler,

Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution

text-to-3d content creation. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recognition,

pages 300–309, 2023. 2, 3, 6

[17] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok-

makov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-

3: Zero-shot one image to 3d object. arXiv preprint

arXiv:2303.11328, 2023. 2, 3, 4, 5

[18] Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, and

Andrea Vedaldi. Realfusion: 360deg reconstruction of any

object from a single image. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recognition,

pages 8446–8455, 2023. 5

[19] Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, and

Daniel Cohen-Or. Latent-nerf for shape-guided generation

of 3d shapes and textures. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recognition,

pages 12663–12673, 2023. 3, 6

[20] Nasir Mohammad Khalid, Tianhao Xie, Eugene Belilovsky,

and Tiberiu Popa. Clip-mesh: Generating textured meshes

from text using pretrained image-text models. In SIGGRAPH

Asia 2022 conference papers, pages 1–8, 2022. 3

[21] Norman Müller, Andrea Simonelli, Lorenzo Porzi,

Samuel Rota Bulo, Matthias Nießner, and Peter

Kontschieder.

Autorf: Learning 3d object radiance

fields from single view observations. In Proceedings of

the IEEE/CVF Conference on Computer Vision and Pattern

Recognition, pages 3971–3980, 2022. 3

[22] Thomas Müller, Alex Evans, Christoph Schied, and Alexan-

der Keller. Instant neural graphics primitives with a mul-

tiresolution hash encoding. ACM Transactions on Graphics

(ToG), 41(4):1–15, 2022. 3, 4

[23] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani,

Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and

Michael J Black. Expressive body capture: 3d hands,

face, and body from a single image. In Proceedings of

the IEEE/CVF conference on computer vision and pattern

recognition, pages 10975–10985, 2019. 3

[24] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden-

hall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv

preprint arXiv:2209.14988, 2022. 2, 3, 5, 6

[25] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya

Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,

Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning

transferable visual models from natural language supervi-

sion. In International conference on machine learning, pages

8748–8763. PMLR, 2021. 3, 6

[2] Yukang Cao, Yan-Pei Cao, Kai Han, Ying Shan, and Kwan-

Yee K Wong. Dreamavatar: Text-and-shape guided 3d hu-

man avatar generation via diffusion models. arXiv preprint

arXiv:2304.00916, 2023. 3

[3] Angel X Chang, Thomas Funkhouser, Leonidas Guibas,

Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese,

Manolis Savva, Shuran Song, Hao Su, et al. Shapenet:

An information-rich 3d model repository. arXiv preprint

arXiv:1512.03012, 2015. 3

[4] Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia.

Fantasia3d: Disentangling geometry and appearance for

high-quality text-to-3d content creation. arXiv preprint

arXiv:2303.13873, 2023. 3

[5] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs,

Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana

Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse:

A universe of annotated 3d objects. In Proceedings of

the IEEE/CVF Conference on Computer Vision and Pattern

Recognition, pages 13142–13153, 2023. 5

[6] Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set

generation network for 3d object reconstruction from a single

image. In Proceedings of the IEEE conference on computer

vision and pattern recognition, pages 605–613, 2017. 3

[7] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash-

nik, Amit H Bermano, Gal Chechik, and Daniel Cohen-

Or. An image is worth one word: Personalizing text-to-

image generation using textual inversion. arXiv preprint

arXiv:2208.01618, 2022. 3

[8] Rohit Girdhar, David F Fouhey, Mikel Rodriguez, and Ab-

hinav Gupta. Learning a predictable and generative vec-

tor representation for objects. In Computer Vision–ECCV

2016: 14th European Conference, Amsterdam, The Nether-

lands, October 11-14, 2016, Proceedings, Part VI 14, pages

484–499. Springer, 2016. 3

[9] Haoyu Guo, Sida Peng, Haotong Lin, Qianqian Wang,

Guofeng Zhang, Hujun Bao, and Xiaowei Zhou. Neural 3d

scene reconstruction with the manhattan-world assumption.

In Proceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition, pages 5511–5520, 2022. 2

[10] Yuan-Chen Guo, Ying-Tian Liu, Ruizhi Shao, Christian

Laforte, Vikram Voleti, Guan Luo, Chia-Hao Chen, Zi-

Xin Zou, Chen Wang, Yan-Pei Cao, and Song-Hai Zhang.

threestudio: A unified framework for 3d content generation.

https://github.com/threestudio- project/

threestudio, 2023. 8

[11] Abdullah Hamdi, Bernard Ghanem, and Matthias Nießsner.

Sparf: Large-scale learning of 3d sparse radiance fields from

few input images. In Proceedings of the IEEE/CVF Inter-

national Conference on Computer Vision, pages 2930–2940,

2023. 2

[12] Yuze He, Yushi Bai, Matthieu Lin, Wang Zhao, Yubin Hu,

Jenny Sheng, Ran Yi, Juanzi Li, and Yong-Jin Liu. T3

bench: Benchmarking current progress in text-to-3d gener-

ation. arXiv preprint arXiv:2310.02977, 2023. 6

[13] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-

Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.

Lora: Low-rank adaptation of large language models. arXiv

preprint arXiv:2106.09685, 2021. 6

9[26] René Ranftl, Katrin Lasinger, David Hafner, Konrad

Schindler, and Vladlen Koltun. Towards robust monocular

depth estimation: Mixing datasets for zero-shot cross-dataset

transfer. IEEE transactions on pattern analysis and machine

intelligence, 44(3):1623–1637, 2020. 4, 5

[27] Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler,

Luca Sbordone, Patrick Labatut, and David Novotny. Com-

mon objects in 3d: Large-scale learning and evaluation of

real-life 3d category reconstruction. In Proceedings of the

IEEE/CVF International Conference on Computer Vision,

pages 10901–10911, 2021. 3

[28] Robin Rombach, Andreas Blattmann, Dominik Lorenz,

Patrick Esser, and Björn Ommer. High-resolution image

synthesis with latent diffusion models. In Proceedings of

the IEEE/CVF conference on computer vision and pattern

recognition, pages 10684–10695, 2022. 2, 3, 4, 6, 7

[29] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-

net: Convolutional networks for biomedical image segmen-

tation. In Medical Image Computing and Computer-Assisted

Intervention–MICCAI 2015: 18th International Conference,

Munich, Germany, October 5-9, 2015, Proceedings, Part III

18, pages 234–241. Springer, 2015. 6

[30] Aditya Sanghi, Hang Chu, Joseph G Lambourne, Ye Wang,

Chin-Yi Cheng, Marco Fumero, and Kamal Rahimi Malek-

shan. Clip-forge: Towards zero-shot text-to-shape genera-

tion. In Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition, pages 18603–18613,

2022. 3

[31] Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran Yi,

Lizhuang Ma, and Dong Chen. Make-it-3d: High-fidelity

3d creation from a single image with diffusion prior. arXiv

preprint arXiv:2303.14184, 2023. 2

[32] Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh,

and Greg Shakhnarovich. Score jacobian chaining: Lifting

pretrained 2d diffusion models for 3d generation. In Pro-

ceedings of the IEEE/CVF Conference on Computer Vision

and Pattern Recognition, pages 12619–12629, 2023. 3, 6

[33] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei

Liu, and Yu-Gang Jiang. Pixel2mesh: Generating 3d mesh

models from single rgb images. In Proceedings of the Euro-

pean conference on computer vision (ECCV), pages 52–67,

2018. 3

[34] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan

Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and

diverse text-to-3d generation with variational score distilla-

tion. arXiv preprint arXiv:2305.16213, 2023. 2, 3, 6

[35] Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen,

and Victor Adrian Prisacariu. Nerf–: Neural radiance

fields without known camera parameters. arXiv preprint

arXiv:2102.07064, 2021. 2, 3, 4

[36] Chao Wen, Yinda Zhang, Zhuwen Li, and Yanwei Fu.

Pixel2mesh++: Multi-view 3d mesh generation via deforma-

tion. In Proceedings of the IEEE/CVF international confer-

ence on computer vision, pages 1042–1051, 2019. 3

[37] Chao-Yuan Wu, Justin Johnson, Jitendra Malik, Christoph

Feichtenhofer, and Georgia Gkioxari. Multiview compres-

sive coding for 3d reconstruction. In Proceedings of the

[38]

[39]

[40]

[41]

[42]

10

IEEE/CVF Conference on Computer Vision and Pattern

Recognition, pages 9065–9075, 2023. 3

Haozhe Xie, Hongxun Yao, Xiaoshuai Sun, Shangchen

Zhou, and Shengping Zhang. Pix2vox: Context-aware 3d

reconstruction from single and multi-view images. In Pro-

ceedings of the IEEE/CVF international conference on com-

puter vision, pages 2690–2698, 2019. 3

Dejia Xu, Yifan Jiang, Peihao Wang, Zhiwen Fan, Yi Wang,

and Zhangyang Wang. Neurallift-360: Lifting an in-the-wild

2d photo to a 3d object with 360deg views. In Proceedings of

the IEEE/CVF Conference on Computer Vision and Pattern

Recognition, pages 4479–4489, 2023. 3, 5

Qiangeng Xu, Weiyue Wang, Duygu Ceylan, Radomir

Mech, and Ulrich Neumann. Disn: Deep implicit surface

network for high-quality single-view 3d reconstruction. Ad-

vances in neural information processing systems, 32, 2019.

3

Silvia Zuffi, Angjoo Kanazawa, and Michael J Black. Li-

ons and tigers and bears: Capturing non-rigid, 3d, articulated

shape from images. In Proceedings of the IEEE conference

on Computer Vision and Pattern Recognition, pages 3955–

3963, 2018. 3

Silvia Zuffi, Angjoo Kanazawa, David W Jacobs, and

Michael J Black. 3d menagerie: Modeling the 3d shape and

pose of animals. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 6365–6373,

2017. 3