Summary of Text-guided Reconstruction of Lifelike Clothed Humans

Summary Text-guided Reconstruction of Lifelike Clothed Humans arxiv.org

14,167 words - PDF document - View PDF document

One Line

The use of a personalized T2I diffusion model and VQA allows for the reconstruction of realistic 3D clothed humans from just one image.

Slides

Slide Presentation (9 slides)

Copy slides outline Copy embed code Download as Word

Text-guided Reconstruction of Lifelike Clothed Humans

Source: arxiv.org - PDF - 14,167 words - view

TeCH - Reconstructing Lifelike 3D Clothed Humans

• TeCH uses a personalized Text-to-Image (T2I) diffusion model and textual information derived from visual questioning answering (VQA)

• It reconstructs lifelike 3D clothed humans from a single image

• TeCH is related to image-based human reconstructors and 3D human generators

Learning Clothing Details from 3D Data

• Various methods, such as CAPE, Chupa, gDNA, NPMs, and SPAMs, learn clothing details from 3D data

• These methods contribute to the reconstruction of lifelike clothed humans

• They enhance the accuracy and realism of the generated avatars

DreamBooth - Personalized Image Generation

• DreamBooth is a model that personalizes a pre-trained diffusion model for subject-driven image generation

• It uses few-shot tuning and text embedding to produce an image

• DreamBooth fine-tunes the diffusion model using MSE optimization

Enhancing Facial Details with Virtual Cameras

• The proposed method enhances facial details by sampling additional virtual cameras positioned around the face

• This approach improves the realism and accuracy of the reconstructed clothed humans

• The sampling parameters are set empirically and optimized during the texture stage

TeCH Outperforms Other Baselines

• TeCH outperforms other baselines in terms of both 3D metrics and 2D image quality metrics

• It accurately reconstructs clothed human geometry with intricate details

• TeCH produces high-quality textures for lifelike clothed humans

Advancements in Text-guided Reconstruction

• Text-guided reconstruction enables the creation of lifelike clothed humans from a single image

• Personalized models like TeCH and DreamBooth enhance the realism and accuracy of the generated avatars

• The combination of T2I diffusion models and VQA techniques opens new possibilities in computer vision

[Visuals can be added to support the main points, such as a comparison of TeCH results with other baselines, images of reconstructed clothed humans, or diagrams illustrating the process]

[Additional slides can be added if necessary to cover all the main points in a clear and organized manner]

Key Points

TeCH is a method for reconstructing lifelike 3D clothed humans from a single image.
It uses a personalized Text-to-Image (T2I) diffusion model and textual information derived from visual questioning answering (VQA).
TeCH is related to image-based human reconstructors and 3D human generators.
Various methods, such as CAPE, Chupa, gDNA, NPMs, and SPAMs, learn clothing details from 3D data.
The proposed method, TeCH, outperforms other baselines in terms of both 3D metrics and 2D image quality metrics.
DreamBooth is a model that personalizes a pre-trained diffusion model for subject-driven image generation.
The authors propose a method that enhances facial details by sampling additional virtual cameras positioned around the face.

Summaries

19 word summary

TeCH uses a personalized T2I diffusion model and VQA to reconstruct lifelike 3D clothed humans from a single image.

32 word summary

TeCH is a method for reconstructing lifelike 3D clothed humans from a single image. It uses a personalized Text-to-Image (T2I) diffusion model and visual questioning answering (VQA) to guide the generation process

621 word summary

TeCH is a method for reconstructing lifelike 3D clothed humans from a single image. It uses a personalized Text-to-Image (T2I) diffusion model and textual information derived from visual questioning answering (VQA) to

TeCH is related to image-based human reconstructors and 3D human generators. The human reconstructors can be categorized into explicit-shape-based, implicit-function-based, and NeRF-based methods. Explicit-shape-based methods use parametric body models

Generative modeling of 3D clothed humans is achieved through statistical body models trained on 3D data. Various methods, such as CAPE, Chupa, gDNA, NPMs, and SPAMs, learn clothing details from

The document discusses the use of a personalized Text-to-Image diffusion model called DreamBooth to guide the generation process of lifelike clothed humans. It also introduces the Score Distillation Sampling (SDS) loss, which is used to optimize

The text excerpt describes a method for reconstructing lifelike clothed humans using a text-guided approach. The geometry network predicts the SDF value for each vertex, which is used to extract triangular meshes. The generated mesh is rendered using differentiable

The document discusses a method for reconstructing lifelike clothed humans using text guidance. The method consists of two stages: geometry and texture. In the geometry stage, a pixel-wise L2 loss and an edge distance loss are used to optimize the

TeCH, a text-guided reconstruction method, outperforms other baselines in terms of both 3D metrics and 2D image quality metrics. It accurately reconstructs clothed human geometry with intricate details and produces high-quality textures. The

TeCH is a text-guided reconstruction method for creating lifelike clothed humans. The study compares different influences on the reconstruction, including geometry and texture. The results show that TeCH is able to accurately recover the human shape and generate high-quality

The proposed method, TeCH, aims to reconstruct a lifelike 3D clothed human from a single image, leveraging descriptive text prompts and personalized Text-to-Image diffusion models. The method optimizes the 3D avatar, including parts

DreamBooth is a model that personalizes a pre-trained diffusion model for subject-driven image generation. It uses few-shot tuning and takes an initial noise and a text embedding to produce an image. DreamBooth fine-tunes the diffusion model using MSE

In the text-guided reconstruction of lifelike clothed humans, the authors propose a method that enhances facial details by sampling additional virtual cameras positioned around the face. They set the sampling parameters empirically and optimize the texture stage using various steps and iterations

Researchers have been exploring the use of text-guided reconstruction to generate lifelike clothed human avatars. Several studies have focused on different aspects of this technology. One study, conducted by Yee K Wong, developed a system called DreamAvatar that

Several papers and preprints related to the reconstruction and generation of lifelike clothed humans were referenced in this excerpt. These papers cover a range of topics including animatable avatars, clothed human reconstruction, neural radiance fields, view synthesis,

The text excerpt includes references to various research papers related to the reconstruction of lifelike clothed humans. The papers mentioned cover topics such as deep human parsing, high-fidelity clothed avatar reconstruction, text-guided human image generation, and human shape

This document is a list of references to various papers and studies related to the topic of text-guided reconstruction of lifelike clothed humans. The references include papers on deep face recognition, avatars in geography, expressive body capture, clothing capture and

This document contains a list of references to various papers and conferences related to the field of computer vision and 3D human reconstruction. The references include papers on topics such as 3D human reconstruction from a single image, text-guided image generation,

Several research papers related to the reconstruction of lifelike clothed humans were cited in this document. The papers cover various topics such as explicit clothed human optimization, lifting 2D photos to 3D objects, generative 3D human

Raw indexed text (91,481 chars / 14,167 words / 1,939 lines)

TeCH: Text-guided Reconstruction of Lifelike Clothed Humans

Yangyi Huang 1∗ , Hongwei Yi 2∗ , Yuliang Xiu 2∗ , Tingting Liao 3 , Jiaxiang Tang 4 , Deng Cai 1 , Justus Thies 2

State Key Lab of CAD & CG, Zhejiang University 2 Max Planck Institute for Intelligent Systems

Mohamed bin Zayed University of Artificial Intelligence 4 Peking University

[email protected], {hongwei.yi, yuliang.xiu, justus.thies}@tuebingen.mpg.de

🤖

[email protected], [email protected], [email protected]

This is a caucasian man with short black hair and beard, wearing a blue T-shirt, blue jeans and boots

Input image

High-resolution

Full-body

meshes with

detailed

Figure 1. Given a single image, TeCH reconstructs

a lifelike 3D

clothed textured

human. “Lifelike”

refers

to 1) a appearance

detailed full-body geometry,

including facial features and clothing wrinkles, in both frontal and unseen regions, and 2) a high-quality texture with consistent color and

intricate patterns. The key insight is to guide the reconstruction using a personalized Text-to-Image (T2I) diffusion model and textual infor-

mation derived via visual questioning answering (VQA). Multi-view supervision is established through Score Distillation Sampling (SDS).

Abstract

“indescribable” appearance. To represent high-resolution

3D clothed humans at an affordable cost, we propose a hy-

brid 3D representation based on DMTet, which consists of

an explicit body shape grid and an implicit distance field.

Guided by the descriptive prompts + personalized T2I diffu-

sion model, the geometry and texture of the 3D humans are

optimized through multi-view Score Distillation Sampling

(SDS) and reconstruction losses based on the original ob-

servation. TeCH produces high-fidelity 3D clothed humans

with consistent & delicate texture, and detailed full-body

geometry. Quantitative and qualitative experiments demon-

strate that TeCH outperforms the state-of-the-art methods

in terms of reconstruction accuracy and rendering quality.

The code will be publicly available for research purposes at

huangyangyi.github.io/TeCH

Despite recent research advancements in reconstruct-

ing clothed humans from a single image, accurately restor-

ing the “unseen regions” with high-level details remains

an unsolved challenge that lacks attention. Existing meth-

ods often generate overly smooth back-side surfaces with

a blurry texture. But how to effectively capture all visual

attributes of an individual from a single image, which are

sufficient to reconstruct unseen areas (e.g. the back view)?

Motivated by the power of foundation models, TeCH re-

constructs the 3D human by leveraging 1) descriptive text

prompts (e.g. garments, colors, hairstyles) which are auto-

matically generated via a garment parsing model and Vi-

sual Question Answering (VQA), 2) a personalized fine-

tuned Text-to-Image diffusion model (T2I) which learns the

These authors contributed equally to this work.

11. Introduction

TeCH enables the reconstruction of high-fidelity 3D

clothed humans with detailed full-body geometry, and intri-

cate textures with consistent color and patterns. As a result,

it facilitates various downstream applications such as novel

view rendering, character animation, and shape & texture

editing. Quantitative evaluations performed on 3D clothed

human datasets, covering various poses (CAPE [93]) and

outfits (THuman2.0 [126]), have demonstrated TeCH’s su-

periority in reconstructing geometric details. Qualitative

comparisons conducted on in-the-wild images, accompa-

nied by a perceptual study, further confirm that TeCH

surpasses SOTA methods in terms of rendering quality.

The code will be publicly avaiable for research purpose at

huangyangyi.github.io/TeCH

High-fidelity 3D digital humans are crucial for various ap-

plications in augmented and virtual reality, such as gam-

ing, social media, education, e-commerce, and immersive

telepresence. To facilitate the creation of digital humans

from easily accessible in-the-wild photos, numerous ap-

proaches focus on reconstructing a 3D clothed human shape

from a single image [12, 38, 39, 46, 67, 72, 102–104, 119–

121, 136]. However, despite the advancements made by

previous approaches, this specific problem can be consid-

ered ill-posed due to the lack of observations of non-visible

areas. Efforts to predict invisible regions (e.g. back-side)

based on visible visual cues (e.g. colors [5, 46, 103], normal

estimates [104, 120, 121]) have proven unsuccessful, result-

ing in blurry texture and smoothed-out geometry, see Fig. 8.

As a result, inconsistencies arise when observing these re-

constructions from different angles. To address this issue,

introducing multi-view supervision could be a potential so-

lution. But is it feasible given only a single input image?

Here, we propose TeCH to answer this question. Unlike

prior research that primarily explores the connection be-

tween visible frontal cues and non-visible regions, TeCH

integrates textual information derived from the input im-

age with a personalized Text-to-Image diffusion model, i.e.,

DreamBooth [101], to guide the reconstruction process.

Specifically, we divide the information from the single

input image into the semantic information that can be accu-

rately described by texts and subject’s distinctive and fine-

detailed appearance which is not easily describable by text:

1) Describable semantic prompts, including the detailed

descriptions of colors, styles of garments, hairstyles, and

facial features, are explicitly parsed from the input image

using a garment parsing model (i.e. SegFormer [117]) and

a pre-trained visual-language VQA model (i.e. BLIP [65]).

2) Indescribable appearance information, which implic-

itly specifies the subject’s distinctive appearance and fine-

grained details, is embedded into a unique token “[V ]”, by

a personalized Text-to-Image (T2I) diffusion model [101].

Based on these information sources, we optimize the

3D human using multi-view Score Distillation Sampling

(SDS)[94], reconstruction losses based on the original ob-

servations, and regularization obtained from off-the-shelf

normal estimators, to enhance the fidelity of the recon-

structed 3D human models while preserving their original

identity. To represent a high resolution geometry at an af-

fordable cost, we propose a hybrid 3D representation based

on DMTet [32, 106]. This hybrid 3D representation com-

bines an explicit tetrahedral grid to approximate the overall

body shape and implicit Signed Distance Function (SDF)

and RGB fields to capture fine details in geometry and tex-

ture. In a two-stage optimization process, we first optimize

this tetrahedral grid, extract the geometry represented as a

mesh, and then optimize the texture.

2. Related Work

TeCH reconstructs a high-fidelity clothed human from a

single image, and imagine the missing parts through the aid

of descriptive prompts and a personalized diffusion model.

We relate TeCH to both image-based human reconstructors

(Sec. 2.1) and 3D human generators (Sec. 2.2). Human re-

constructors could be grouped as: 1) Explicit-shape-based,

2) Implicit-function-based, and 3) NeRF-based methods.

The human generators are categorized w.r.t. their training

data: 1) directly learned from 3D real captures or 2) in-

directly learned from large-scale 2D images. In addition,

there is a line of image-to-3D works focusing on general

objects, which will be discussed in Sec. 2.3.

2.1. Image-based Clothed Human Reconstruction

Explicit-shape-based Methods. Human Mesh Recovery

(HMR) from a single RGB image is a long-standing prob-

lem that has been thoroughly explored. A lot of meth-

ods [26, 53, 56–59, 64, 66, 68, 129, 131] use mesh-based

parametric body models [51, 78, 92, 123] to regress the

shape and pose of minimally-clothed 3d body meshes.

To account for the 3D garments, 3D clothing offsets [1–

4, 63, 116, 139] or deformable garment templates [9, 49]

are used on top of a body model. Also, non-parametric ex-

plicit representations, such as depth maps [29, 108], normal

maps [121], and point clouds [127] could be leveraged to

reconstruct the clothed human. However, explicit shapes of-

ten suffer from restricted topological flexibility, particularly,

when dealing with outfit variations in real-world scenarios,

e.g., dress, skirt, and open jackets.

Implicit-function-based Methods.

Implicit represen-

tations (occupancy/distance field) are topology-agnostic,

thus, can represent 3D clothed humans, with arbitrary

topologies, such as open jackets and loose skirts. A line

of works regresses the free-form implicit surface in an end-

to-end manner [5, 103, 104], leverages a 3D geometric

prior [12, 21, 38, 39, 46, 72, 120, 124, 136], or progressively

2Text guidance

"a [V] man, brown short hair, caucasian, [V] blue shirt, [V] khaki pants, socks, standing up, goatee beard"

garments styles,

colors, hairstyle, ...

SegFormer BLIP

(a) Parsing human image

RGB

DreamBooth

"the photo of a [V] man"

(b) Embedding subject details

normal

Normal

Estimator

hybrid 3D representation

(d) High-quality textured meshes

Figure 2. Method overview. TeCH takes an image I of a human as input. Text guidance is constructed through (a) using garment parsing

model (SegFormer) and VQA model (BLIP) to parse the human attributes A with pre-defined problems Q, and (b) embedding with subject-

specific appearance into DreamBooth D ′ as unique token [V ]. Next, TeCH represents the 3D clothed human with (c) SMPL-X initialized

hybrid DMTet, and optimize both geometry and texture using L SDS guided by prompt P = [V ] + P VQA (A). During the optimization,

L recon is introduced to ensure input view consistency, L CD is to enforce the color consistency between different views, and L normal serves as

surface regularizer. Finally, the extracted high-quality textured meshes (d) are ready to be used in various downstream applications.

builds up the 3D human using a “sandwich-like” structure

and implicit shape completion [121]. Among these works,

PIFu [103], ARCH(++) [39, 46], and PaMIR [136] infer

the full texture from the input image. PHORHUM [5] and

S3F [21] additionally decompose the albedo and global il-

lumination. However, the lack of multi-view supervision

often results in depth ambiguities or inconsistent textures.

to develop a high-fidelity head avatar generator. How-

ever, the scarcity of datasets containing real 3D clothed hu-

mans [11, 18, 47, 126, 135] limits the model’s generaliza-

tion ability and may lead to overfitting on small datasets.

3D Human Generator from 2D Image Collections. In

contrast to 3D data, large-scale 2D human images are

widely avaible from DeepFashion [34, 77], SHHQ [28]

and LAION-5B [105]. Related human generators repre-

sent 3D humans using meshes [36, 40, 50], DMTet [33],

Tri-planes [8, 25, 85, 109, 132], implicit functions [118],

or neural fields [13, 41, 60, 128]. Some methods adapt

GANs [54] by integrating diff-renderer [8, 25, 36, 85,

109, 110, 118, 132], while others leverage diffusion mod-

els [13, 40, 44, 60, 130]. Despite the demonstrated quality

of these methods in generating textured avatars, a signifi-

cant gap still exists in achieving “lifelike” avatars with de-

tailed geometry and texture, consistent with the input.

In contrast, TeCH excels at generating “lifelike” 3D

characters from a single image, incorporating consistent

texture with intricate patterns like checkered or overlapped

designs. It relies on a pretrained diffusion model which is

trained on a billion-level data, LAION-5B [105], and offers

the ability to imagine the non-visible regions, guided by

descriptive prompts. Furthermore, it leverages the image-

based reconstruction approach to faithfully reconstruct the

visible regions from a single input image.

NeRF-based Methods. There is a separate line of research

that focuses on optimizing neural radiance fields (NeRF)

from a single image. SHERF [43] and ELICIT [45] op-

timize a generalized human NeRF, incorporating model-

based priors (SMPL-X). While SHERF complements miss-

ing information from partial 2D observations, ELICIT uti-

lizes pre-trained CLIP [97] to provide an appearance prior.

2.2. Generative Modeling of 3D Clothed Humans

3D Human Generator Trained on 3D Data. Statistical

body models [51, 78, 92, 123] can be considered as 3D gen-

erative models of the human body. These models are trained

on numerous 3D scans of minimally-clothed bodies, and

can generate posed bodies with varying shapes, but with-

out clothing. To account for the outfits, CAPE [79] learns a

clothing offset layer based on the SMPL-D model, from reg-

istered human scans, Chupa [55] “carves” the SMPL mesh

by dual normal maps generated by pose-conditioned dif-

fusion model; Alternatively, gDNA [17], NPMs [88], and

SPAMs [89], learn the implicit clothed avatars from nor-

malized raw captures (i.e., scans, depth maps). Unfortu-

nately, all the aforementioned methods to learn generative

3D humans with diverse shapes and appearances require

3D data, which is both limited and expensive to acquire.

Rodin [113] has recently employed large-scale 3D syn-

thetic head avatars in combination with a diffusion model

2.3. Image-to-3D for General Objects

Lifting 2D to 3D for general objects is a longstanding

problem with valuable explorations. Here, we mainly fo-

cus on diffusion-guided approaches. Initially, CLIP [97]

semantic consistency loss [48], Score Jacobian Chaining

(SJC) [112] and Score Distillation Sampling (SDS) [94]

3What upper-clothes is the person wearing?

tank top

What is the color of the tank top? black and white

What is the style of the tank top? sleeveless

Background augmentation

DreamBooth

...

Segmentation

... wearing a black and white sleeveless tank top, ...

(a) Descriptive prompt

(b) Subject-specific generation from [ V ]

Figure 3. Prompt construction (P = P VQA + [V ]). (a) In-

quire VQA model with predefined questions on individual ap-

pearance to construct describable prompts P VQA . (b) Fine-tuned

DreamBooth with background-augmented images to embed inde-

scribable subject-specific details into unique identifier [V ].

Figure 4. The effects of text guidance. We compare the effective-

ness of using only VQA descriptions (TeCH vqa ), only DreamBooth

identity token (TeCH db ), and both of them (TeCH).

are proposed to leverage pretrained 2D diffusion models

for 3D content generation. Subsequently, there is a line of

works [22, 81, 98, 111, 122] that address this problem, by

incorporating textural inversion [30], DreamBooth [101],

CLIP-guided diffusion prior, depth prior, and reconstruction

loss. In addition to aforementioned “reconstruct via multi-

view SDS” scheme, recent attention has been drawn to the

“reconstruct via direct view-conditional generation” [14,

75, 76, 95, 114, 138]. In contrast, TeCH aims to recover

pixel-aligned models with intricate texture, even in non-

visible regions, which is a challenging scenario where ex-

isting solutions have not shown promising results.

3.1. Extracting Text-guidance from the Observation

Parsing human attributes. As depicted in Fig. 3, given

the input image of a human, SegFormer [117], which is

fine-tuned on ATR dataset [70, 71], is applied to recognize

each part of the garments (e.g. hat, skirt, pants, belt, shoes).

To obtain detailed descriptions (i.e. color and style) of

the parsed garments, we utilize the vision-language model

BLIP [65] as VQA captioner. This model has been pre-

trained on a vast collection of image-text pairs, enabling it

to automatically generate descriptive prompts. Rather than

using naive image captioning, we employ a series of fine-

grained VQA questions {Q i } (see Appx.’s Sec. B) as input

to BLIP. These questions cover garment styles, colors, facial

features, and hairstyles, with the corresponding answers de-

noted as {A i }. The set of {A i } will be inserted into a pre-

defined template to create text prompts P VQA , which will

serve as text-guidance to condition the text-to-image diffu-

sion model, recap the full method overview in Fig. 2.

3. Method

Given a single image as input, TeCH aims at reconstruct-

ing a high-fidelity 3D clothed human. Here, “high-fidelity”

refers to the inclusion of consistent texture with intricate

patterns, as well as detailed full-body geometry. To achieve

this, TeCH follows a two-step procedure: Firstly, a text

prompt that describes the human in the input image is

obtained via the human parsing model SegFormer [117]

and the VQA model BLIP [65] (Sec. 3.1). This descrip-

tive prompt is used to guide the generation process in

DreamBooth [101], a personalized Text-to-Image diffusion

model fine-tuned on augmented input images. Secondly,

the 3D human, which is represented as hybrid DMTet and

initialized with SMPL-X (Sec. 3.2), is optimized with SDS

losses [94] computed from the personalized DreamBooth

(Sec. 3.3). The Score Distillation Sampling (SDS) loss has

been introduced in DreamFusion [94] for the task of Text-

to-3D generation of general objects, by optimizing a neural

radiance field (NeRF) with gradients from a frozen diffu-

sion model. In our case, we utilize the SDS loss to guide

the reconstruction of a 3D human from a single input image,

employing a multi-stage optimization strategy (Sec. 3.3) to

get a consistent alignment of geometry and texture.

Embedding subject-specific appearance. Does the text

prompt P VQA comprehensively capture all the visual char-

acteristics of the subject? No, a picture is worth a thousand

words. Thus, we utilize DreamBooth [101] to learn the in-

describable visual appearance. DreamBooth is a method

for “personalizing” a diffusion model through few-shot tun-

ing (3∼5 images). We perform DreamBooth’s fine-tuning

on a pre-trained Stable Diffusion (v1.5) as the base model.

To generate the needed inputs, we augment the single input

image with five different backgrounds, as shown in Fig. 3.

To prevent language drift, we assign the subject classes

“ man ” or “ woman ” based on the gender determined by the

VQA. After fine-tuning DreamBooth, the subject-specific

distinctive appearance is encoded within a unique identi-

fier token “[V ]”. We insert “[V ]” into the prompt P VQA , to

construct the final text prompt P used by the personalized

DreamBooth D ′ . In Fig. 4, you can see how these individual

prompts contribute to the final appearance.

X M

XbM M

Xn fr

fBv

fBM

B;&@8 B

¶¬±¤¡´

Figure 5. (a) Top depicts the impact of specific elements within the textual guidance, such as garment styles & colors, hairstyle, facial

features, and the placement & inclusion of “[V ]”. (b) Bottom demonstrates that TeCH facilities text-guided garment color editing.

Deeper analysis of description P . In Fig. 5 (a), we first

show the impact of individual elements within the text

prompt, including garment styles & colors, hairstyle, and

face, which guide the model to recover the appearance of

each attribute of the clothed human. The first column shows

that a basic class description alone cannot effectively guide

the reconstruction process. However, in the subsequent

columns, text guidance incorporating detailed descriptions

of clothing proves successful in accurately reconstructing

the structure of clothed humans. Furthermore, with addi-

tional information regarding colors and hairstyles, the char-

acters reconstructed by TeCH vqa exhibit greater semantic

consistency with respect to the input view. However, merely

relying on VQA descriptions is insufficient for generating a

“convincingly fake” appearance.

Only using the DreamBooth guidance (TeCH db ), helps to

recover original garment patterns, which demonstrates that

DreamBooth has a high-level understanding of texture pat-

terns. However, it sometimes will diffuse the patterns to

the entire human. By combining “[V ]” with the VQA pars-

ing text prompts P VQA , TeCH produces remarkably realis-

tic texture with consistent color and intricate patterns.

In Fig. 5 (b), we also demonstrate some text-guided

garment color editing examples based on a fine-tuned

DreamBooth model D ′ and subject-specific token “[V ]”.

tetrahedral grid (V shell , T shell ) within an outer shell M shell ,

shown in Fig. 2-(c). Compared to the DMTet cubic-based

tetrahedral grid, the outer shell tetrahedral grid is more com-

putationally efficient for high-resolution geometry model-

ing of a human. Using PIXIE [26], we estimate an initial

body M body . To create M shell , a series of mesh dilation,

down-sampling, and up-sampling steps are applied to the

body mesh M body (see details Sec. C of Appx.).

We use two MLP networks Ψ g , Ψ c with hash encod-

ing [83], parameterized by ψ g and ψ c to learn the geometry

and color separately. The geometry network Ψ g predicts the

SDF value Ψ g (v i ) = s(v i ; ψ g ) of each DMTet vertex v i . It

is initialized by fitting it to the SDF of M shell :

\mathcal {L}_\mathrm {init}=\sum _{p_i\in \mathbf {P}_\mathrm {}}\norm {s(p_i; \psi _\mathrm {g}) - \mathrm {SDF}(p_i)}_2^2 ,

(1)

where P = {p i ∈ R 3 } is a point set randomly sampled near

M shell , and SDF(p i ) is the pre-computed pointwise SDF.

Triangular meshes can be extracted from this efficient hy-

brid 3D representation by Marching Tetrahedra (MT) [24]:

M=\mathrm {MT}(V_\mathrm {shell}, T_\mathrm {shell}, s(V_\mathrm {shell};\psi _\mathrm {g})) .

(2)

Given the camera parameters k, the generated mesh is ren-

dered through differentiable rasterization R [62], to get

the back-projected 3D locations P(M, k), rendered mask

M(M, k), and rendered normal image N (M, k)

3.2. Hybrid 3D Representation

To efficiently represent the 3D clothed human at a high res-

olution, we embed DMTet [32, 106] around the SMPL-X

body mesh [86]. Specifically, we construct a compact

\mathcal {R}(M, \mathbf {k}) = \left (\mathcal {P}(M, \mathbf {k}), \mathcal {M}(M, \mathbf {k}), \mathcal {N}(M, \mathbf {k})\right )

(3)The albedo of each back-projected pixel is predicted by the

color network Ψ c , where ψ c represents the parameters:

\mathcal {I'}(M, \psi _\mathrm {c}, \mathbf {k}) = \Psi _\mathbf {c}(\psi _\mathrm {c}, \mathcal {P}(M, \mathbf {k})) .

(4)

As detailed in Section 3.3, we optimize this 3D repre-

sentation using a coarse-to-fine strategy by applying suc-

cessive subdivisions on the tetrahedral grids. Specifically, a

more detailed surface M subdiv (ψ g ) can be obtained by ap-

plying volume subdivision on the surface tetrahedral grids

(V surface , T surface ) that intersect with M (ψ g ). Note that the

SDF values of the refined vertices are still inferred by Ψ g .

Figure 6. The effects of normal regularization. L norm regular-

izes the surface with predicted normal images N̂ front , N̂ back .

SDS loss on normal images. Inspired by Fantasia3D [16],

our approach integrates normal renderings with the SDS

loss [94]. It enables TeCH to effectively capture intricate

geometric details without rendering the color image. Given

the surface normals n = N (M, k), L norm

SDS is defined as:

3.3. Multi-stage Optimization

We adopt a multi-stage, coarse-to-fine optimization process

to sequentially recover the subject’s geometry and texture.

In the initial stage, we utilize the tetrahedral representation

to model the subject’s geometry (Sec. 3.3.1). Next, the ap-

pearance is recovered using the mesh that is extracted from

the tetrahedral grid (Sec. 3.3.2). Both stages are leverag-

ing SDS-based losses using the personalized DreamBooth

model which provides multi-view supervision by sampling

new camera views as described in Sec. 3.3.3.

3.3.1

\mathcal {L}_\mathrm {SDS}^\mathrm {norm}= \nabla _{\psi _\mathrm {g}}\mathcal {L}_\mathrm {SDS}^\mathrm {norm}(\mathbf {n}, \mathbf {c}^{P_\mathrm {norm}}) \\ = \mathbb {E}_{\mathrm {t, \epsilon }}\left [ w_t \left (\hat {\epsilon }_{\phi '}(\mathbf {z}_t^\mathbf {n};\mathbf {c}^{P_\mathrm {norm}},t)-\epsilon \right )\frac {\partial \mathbf {n}}{\partial \psi _\mathrm {g}}\frac {\partial \mathbf {z}^\mathbf {n}}{\mathbf {n}}\right ],

(7)

where c P norm is the text condition with an augmented

prompt P norm . We construct P norm from P by adding an

extra description “a detailed sculpture of” to better reflect

the intrinsic characteristics of normal maps.

Geometry Stage

We optimize the geometry based on a silhouette loss L sil us-

ing the orig. image, a text-guided SDS loss on rendered nor-

mal images L norm

SDS , and geometric regularization L reg based

on pred. normals L norm and surface smoothness L lap :

\begin {split} \label {overall_l} \mathcal {L}_\mathrm {geometry} &= \lambda _\mathrm {sil}\mathcal {L}_\mathrm {sil} + \lambda _\mathrm {SDS}\mathcal {L}_\mathrm {SDS}^\mathrm {norm} + \mathcal {L}_\mathrm {reg}\\ \mathcal {L}_\mathrm {reg} & = \lambda _\mathrm {norm}\mathcal {L}_\mathrm {norm} +\lambda _\mathrm {lap}\mathcal {L}_\mathrm {lap} , \end {split}

Geometric regularization. We found that relying solely

on silhouette and SDS losses may lead to the generation

of noisy surfaces, which is particularly evident for subjects

wearing complex clothing. To address this, we leverage nor-

mal estimations as an additional constraint to regularize the

reconstructed surface (see Fig. 6):

(5)

where λ represents the weights to balance the losses. Dur-

ing optimization of this loss, we perform a coarse-to-

fine subdivision on DMTet, to robustly produce a high-

resolution mesh for the clothed body. Specifically, the op-

timization is first performed w/o subdivision for t coarse =

5000 iters, and then with subdivision for t fine = 5000 iters.

\scriptsize \mathcal {L}_\mathrm {norm}(\hat {\mathcal {N}}_\mathbf {k}, \mathbf {n}) = \lambda _\mathrm {MSE}^\mathrm {{norm}} \norm {\hat {\mathcal {N}}_\mathbf {k}- \textbf {n}}_2^2 + \mathrm {LPIPS}(\hat {\mathcal {N}}_\mathbf {k}, \textbf {n})) , \label {loss-normal}

(8)

where N̂ k are the front and back normal maps estimated us-

ing ICON [120] indexed by the view k (k ∈ {front, back}).

n are the corresponding differentiably rendered normal im-

ages of the 3D shape Ψ g .We use a combination of LPIPS

and MSE loss to enhance the similarity between N̂ k and

n. Furthermore, we utilize a regularization loss based on

Laplacian smoothing [6], represented as L lap .

Pixel-aligned silhouette loss. The silhouette loss [125,

133] enforces pixel-alignment with the foreground mask S

of the input image I under the input camera view k:

\begin {split} \mathcal {L}_\mathrm {sil} &= \norm {\mathcal {S} - \mathcal {M}(M, \mathbf {k})}_2^2 \\ &+ \sum _{x\in \text {Edge}(\mathcal {M}(M, \mathbf {k}))}\min _{\hat x\in \text {Edge}(\mathcal {S})}\norm {x-\hat {x}}_1 . \end {split}

(6)

Mesh extraction. We use Marching Tetrahedra [24] to ex-

tract the mesh from the tetrahedral grid. Like ECON [121],

we register SMPL-X to this mesh which allows us to trans-

fer skinning weights for reposing (see Fig. 9). In addition,

we replace the hands with SMPL-X ones which effectively

mitigates any potential artifacts introduced during reposing

which is needed in the subsequent texture generation stage.

It consists of (1) a pixel-wise L2 loss over the foreground

mask S and the rendered silhouette M, and (2) an edge dis-

tance loss, based on the distance of each silhouette bound-

ary pixel x ∈ Edge(M(M, k)) to the nearest foreground

mask boundary pixel x̂ ∈ Edge(S).

63.3.2

Texture Stage

Given the triangular mesh from the geometry stage, we op-

timize the full texture. To recover the consistent details and

color, even for self-occluded regions, we render both the in-

put pose (M in ) and the A-pose (M A ) during optimization.

The textures of M in and M A are modeled by Ψ color in the

3D space of M A . We optimize the texture from scratch with

ψ c randomly initialized. In Fig. 7, we show the effect of this

multi-pose training. We utilize an occlusion-aware recon-

struction loss L recon on the input view of M in , an SDS loss

L color

SDS with text guidance on rendered color images of both

M in and M A , and a color consistency regularization L CD ,

with respective weights λ to balance the individual losses:

Input

w/o

Input

w/o

Figure 7. The effects of color consistency loss L CD and multi-

pose training (M A ) for texture optimization. L CD corrects the

over-saturated back-side color generated by SDS, while M A im-

proves the texture quality under self-occlusion or extreme poses.

(CD) by treating the pixels from both views as point clouds

within the RGB color space:

\begin {split} \label {overall_l_tex} \mathcal {L}_\mathrm {texture} &= \lambda _\mathrm {recon}\mathcal {L}_\mathrm {recon} + \lambda _\mathrm {SDS}\mathcal {L}_\mathrm {SDS}^\mathrm {color} +\lambda _\mathrm {CD}\mathcal {L}_\mathrm {CD} , \end {split} (9)

\mathcal {L}_\mathrm {CD}= \sum _{x \in \mathbf {F}_\mathbf {x}} \operatorname *{min}_{y \in \mathbf {F}_\mathcal {I}} ||x-y||^2_2 + \sum _{y \in \mathbf {F}_\mathcal {I}} \operatorname *{min}_{x \in \mathbf {F}_\mathbf {x}} ||x-y||^2_2, \label {eq:chamfer_color}

Note that L CD is only utilized after the full-body tex-

ture convergence (5000 iters), in an additional optimization

phase of 2000 iterations for enforcing color consistency.

(12)

where F x and F I respectively represent the foreground pix-

els of the novel-view albedo rendering x, and the input view

I. The improvement using L CD is shown in Fig. 7.

Occlusion-aware reconstruction loss. To enforce pixel-

alignment, we apply an input view reconstruction loss to

minimize the difference between input image I and the

albedo-rendered image I ′ (M, ψ c , k I ). Additionally, we

have observed that applying L recon to self-occluded areas

may lead to incorrect texture due to geometry misalignment.

Therefore, an occlusion-aware mask m occ is introduced to

selectively exclude the L recon in occluded regions.

\begin {split} \mathcal {L}_\mathrm {recon} &= m_\mathrm {occ}(\lambda _\mathrm {MSE} \norm {\mathcal {I} - \mathcal {I'}(M, \psi _\mathrm {c}, \mathbf {k_\mathcal {I}})}_2^2 \\ &+ \mathrm {LPIPS}(\mathcal {I}, \mathcal {I'}(M, \psi _\mathrm {c}, \mathbf {k_\mathcal {I}}))) , \label {recon-loss} \end {split}

3.3.3

Camera sampling during optimization

To optimize the 3D shape and texture using multi-view ren-

derings, cameras are randomly sampled in a way that en-

sures comprehensive coverage of the entire body by adjust-

ing various parameters. To mitigate the occurrence of mir-

rored appearance artifacts (i.e., Janus-head), we incorpo-

rate view-aware prompts (“ front/side/back/overhead

view ”) w.r.t. the viewing angle in the diffusion-based gen-

eration process, whose effectiveness has been demonstrated

in DreamBooth [94]. In order to improve facial details, we

also sample cameras positioned around the face, together

with the additional prompt “ face of ”. More details about

the camera sampling strategy are in Sec. D of Appx.

(10)

where k I denotes the input view camera, and λ MSE is a

weight to balance the two loss terms.

SDS loss on color images. To recover the full-body texture,

including unseen regions, we update ψ c via SDS loss L color

SDS

with text guidance. This loss is calculated based on random-

view color renderings x = I ′ (ψ g , ψ c , k), and DreamBooth

D ′ parameterized by ϕ ′ and guided by text prompt P .

4. Experiments

We compare TeCH with state-of-the-art image-based 3D

clothed human reconstruction methods, including body-

agnostic methods, such as PIFu [103], PIFuHD [104] and

PHORHUM [5], as well as methods that utilize SMPL-

(X) body prior, such as PaMIR [136], ICON [120] and

ECON [121]. For a fair comparison, all methods (i.e., PIFu,

PaMIR, ICON, ECON) utilize the same normal estimator

from ICON. Official PIFu, PaMIR and PHORHUM are

used to evaluate the quality of texture. For ECON, we use

ECON EX , due to its superior performance on both “OOD

poses” and “OOD outfits” cases, as reported in the original

paper [121]. Note that PHORHUM uses a different cam-

era model which is not compatible with our testing data,

thus, we use PHORHUM only for qualitative comparisons.

\mathcal {L}_\mathrm {SDS}^\mathrm {color} = \nabla _{\psi _\mathrm {c}}\mathcal {L}_\mathrm {SDS}^\mathrm {color}(\mathbf {x}, \mathbf {c}^P) \\ = \mathbb {E}_{\mathrm {t, \epsilon }}\left [ w_t \left (\hat {\epsilon }_{\phi '}(\mathbf {z}_t^\mathbf {x};\mathbf {c}^P,t)-\epsilon \right )\frac {\partial \mathbf {x}}{\partial \psi _\mathrm {c}}\frac {\partial \mathbf {z}^\mathbf {x}}{\mathbf {x}}\right ] , \label {sds-color-loss}

(11)

where k is the camera pose, c P is the text embedding of P .

Chamfer-based color consistency loss. As mentioned

in DreamFusion [94], the SDS loss may result in over-

saturated colors which will cause a noticeable color dispar-

ity between visible and invisible regions. To mitigate this

issue, we incorporate a color consistency loss to ensure that

the rendered novel views align closely with the color distri-

butions observed in the input view. We quantify the dispar-

ity between the color distributions using a chamfer Distance

7Method

3D Metrics

Chamfer ↓ CAPE

P2S ↓ Normal ↓

PIFu [103]

PIFuHD [104] 1.9683

3.2018 1.6236

2.9930 0.0623

0.0758

PaMIR [136]

ICON [120]

ECON [121]

TeCH 1.3756

0.8689

0.9186

0.7416 1.1852

0.8397

0.9227

0.6962 0.0526

0.0360

0.0330

0.0306

THuman2.0

Chamfer ↓ P2S ↓ Normal ↓

w/o SMPL-X body prior

1.9305

1.8031

0.0802

2.4613

2.3605

0.0924

w/ SMPL-X body prior

1.2979

1.2188

0.0676

1.1382

1.2285

0.0623

1.2585

1.4184

0.0612

1.2364

1.2715

0.0642

PSNR↑ 2D Image Quality Metrics

CAPE

THuman2.0

SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓

27.0994

- 0.9362

0.0987

23.5068

0.9296

0.1083

27.7279

28.3601 0.9456

0.9490

0.0904

0.0639

22.5466

25.2107

0.9266

0.9363

0.1082

0.0835

Table 1. Quantitative evaluation against SOTAs. TeCH surpasses SOTA baselines in terms of both 3D metrics and 2D image quality

metrics. This demonstrates its superior performance in accurately reconstructing clothed human geometry with intricate details, as well as

producing high-quality textures with consistent appearance.

More implementation details about network structure and

optimization setting can be found at Sec. E of Appx.

from both meshes, to measure the consistency and fine-

ness of local surface details, by rotating the camera by

{0 ◦ , 90 ◦ , 180 ◦ , 270 ◦ } w.r.t. to the input view. To evaluate

the quality of the texture, we report 2D image quality met-

rics, on the multi-view colored images rendered in the same

way as the normal images, including PSNR (Peak Signal-

to-Noise Ratio), SSIM (Structural Similarity) and LPIPS

(learned perceptual image path similarity).

As shown in Tab. 1, TeCH demonstrates superior per-

formance across all 2D metrics and 3D metrics on CAPE.

This reveals that TeCH can accurately reconstruct both

geometry and texture, even for subjects with challenging

poses (CAPE) or loose clothing (THuman2.0). However,

on THuman2.0, it achieves comparable reconstruction ac-

curacy to prior-based methods. This can be attributed to

the fact that the hallucinated back-side may differ from

the ground truth while still appears realistic. A perceptual

study Tab. 2 was conducted for additional clarification. See

Sec. 4.4 of Appx. for more results on these datasets.

4.1. Models and Datasets

Off-the-shelf models. TeCH relies on multiple off-the-

shelf pre-trained models and does not need any addi-

tional training data. Specifically, we use officially re-

leased stable-diffusion-v1.5 * as T2I diffusion model, which

is trained on LAION-5B, the VQA model BLIP [65] pre-

trained on 129M images from multiple datasets [15, 61,

74, 84, 87, 105] and fine-tuned on VQA2.0 [35], Seg-

Former * [117] pretrained from [10, 20, 23, 137] and fine-

tuned on ATR[69], PIXIE [26] trained on human images

from multiple datasets [19, 74, 90, 115, 140], and the nor-

mal predictor of ICON [120] trained on AGORA [91].

Datasets for evaluation. Based on the high-fidelity 3D tex-

tured scans from CAPE [79] and THuman2.0 [126], we per-

form quantitative evaluations.We follow ICON [120] to an-

alyze the robustness of reconstructions under both simple

and complex poses (150 scans from CAPE). An additional

150 THuman2.0 scans are included, which comprises 100

subjects that were manually selected to represent a diverse

range of clothing styles (e.g., open jackets, long coats, gar-

ments with intricate patterns, etc.), and 50 randomly sam-

pled subjects. The images are rendered at a resolution of

512 × 512. For qualitative comparison, we selected the

SHHQ dataset [28] due to its wide range of textures, out-

fits, and gestures. From this dataset, we randomly sampled

90 images with official mask annotations.

4.3. Perceptual Evaluation

To assess the generalization of TeCH on in-the-wild images

and evaluate the perceptual quality of our results, we con-

ducted a perceptual study using 90 randomly sampled im-

ages from the SHHQ dataset [28]. Participants were shown

videos showcasing rotating 3D humans reconstructed by

TeCH, as well as the baselines (PaMIR [136], PIFu [103],

ICON [120], ECON [121] and PHORHUM [5]). They

were asked to choose the more realistic and consistent re-

sult based on the input image. We gathered a total of 3,150

pairwise comparisons from 63 participants, uniformly cov-

ering 90 SHHQ subjects. The results in Tab. 2 show that

TeCH is preferred, both, in terms of geometry and texture.

As illustrated in Fig. 8, unlike other methods that tend to re-

construct overly smooth surfaces and blurry textures, TeCH

shows remarkable generalizability when applied to in-the-

wild images featuring diverse clothing styles and gestures.

It produces more realistic clothing, haircut, and facial de-

tails, even for unseen back-side views.

4.2. Quantitative Comparison

We quantitatively evaluate the reconstruction quality of ge-

ometry and appearance, using the Chamfer (bi-directional

point-to-surface) and P2S (1-directional point-to-surface)

distance, to measure the difference between the recon-

structed and ground-truth meshes. Additionally, we re-

port the L2 Normal error between normal images rendered

runwayml/stable-diffusion-v1-5

matei-dorian/segformer-b5-finetuned-human-parsing

8Input

PIFu

PaMIR

PHORHUM

TeCH

PIFu PaMIR

PHORHUM TeCH

PIFu PaMIR

PHORHUM TeCH

PIFu PaMIR

PHORHUM TeCH

Figure 8. Qualitative comparison on SHHQ images. TeCH generalizes well on in-the-wild images with diverse clothing styles and

textures. It successfully recovers the overall structure of the clothed body with text guidance, and generates realistic full-body texture

which is consistent with the colored pattern and the material of the clothes. ü Zoom in to see the geometric details.

Preference (%, ↑) PIFu PaMIR PHORHUM ICON ECON

Geometry

88.6 87.0

81.7

97.94 90.48

Colored Rendering 95.1 93.7

93.0

Ours

Table 2. Perceptual study. The percentages of user preference to

TeCH compared to other baselines are reported. Most participants

preferred TeCH in both geometry and colored rendering (texture).

Experiment settings

3D Metrics

VQA DreamBooth L norm L CD M A multi-stage Chamfer ↓ P2S ↓ Normal ↓

✓

0.9794

0.9779 0.0466

✓

✗

✓

0.9959

1.0192 0.0454

✗

✓

1.0032

1.0218 0.0470

✓

✗

0.9957

0.9963 0.0468

✓

✗

1.0882

0.9203 0.0870

✓

✗

✓

✗

✓

2D Image Quality Metrics

PSNR ↑ SSIM ↑ LPIPS ↓

26.7565 0.9428 0.0741

26.2078 0.9405 0.0813

26.9602 0.9428 0.0785

26.0465 0.9395 0.0775

26.6500 0.9427 0.0746

26.6506 0.9425 0.0786

Table 3. Ablation study. We quantitatively evaluate the ef-

fectiveness of each component. Top two results are colored as

first second . All the factors are grouped w.r.t. their influence:

A. geometry+texture, B. geometry only, C. texture only.

4.4. More Qualitative Results

In addition to Fig. 8, we show more qualitative compar-

isons between TeCH and other baselines (PIFu [103],

PIFuHD [104], PaMIR [136], PHORHUM [5],

ICON [120], ECON [121]) on CAPE, THuman2.0,

and SHHQ [28] images (Figs. 12 to 14 of Appx.), by

visualizing multi-view surface normals, color render-

ings, and zoomed-in details. For subjects in CAPE and

THuman2.0, TeCH precisely recover the human shape

and generate high-quality details of garments and facial

features, regardless of hard poses, complex texture, loose

clothing, or self-occlusion. Also, Fig. 14 demonstrates the

strong generalizability of TeCH on in-the-wild images,

more rotating 3D humans are provided in video.

P2S) and texture quality (LPIPS). Figure 4 shows that VQA

prompts help to recover the overall structure of clothing,

while DreamBooth enhances the fine details of the texture

pattern. Combining both text guidance sources yields the

best results. A detailed analysis of individual descriptive

texts (e.g., garments, hairstyles, etc.) is in Fig. 5

Geometric regularization. As shown in Fig. 6, using only

L norm

SDS to optimize the geometry will produce noisy arti-

facts, particularity noticeable in loose clothes. The signif-

icant increase in “Normal” error shown in Tab. 3-B echos

this. This issue can be mitigated by incorporating L norm at

the beginning of the optimization.

To assess the effectiveness of key designs in TeCH, we per-

form ablation studies on a 10% subset of the test set, con-

sisting of 15 subjects from THuman2.0 and 15 from CAPE.

The detailed analysis on these results is as follows: Consistent texture recovery. The results presented in

Fig. 7 demonstrate that L CD notably enhances color consis-

tency between the frontal and back sides, and ”multi-pose”

training (M A ) improves texture quality when dealing with

self-occlusion scenarios. This improvement is further sup-

ported by Tab. 3-C, across all 2D image quality metrics.

Text guidance. Table 3-A shows that either the “VQA-

only” or “DreamBooth-only” guidance exhibit a decrease

in performance w.r.t. reconstruction accuracy (Chamfer, Multi-stage optimization. As shown in Tab. 3-A, com-

pared to the decoupled two-stage optimization (Ours), the

joint optimization results in a performance drop across both

4.5. Ablation Studies

93D and 2D metrics. This may be attributed to the entan-

glement of the gradients from the geometry and texture

branches during optimization. Notably, in the separate tex-

ture stage, a colored image is rendered from the extracted

mesh, saving 20% of the run time compared to joint opti-

mization, which involves rendering from the DMTet mesh.

A. Extremely loose clothes

5.1. Avatar animation

Efficiency. For each subject, training DreamBooth takes 20

min, DMTet SMPL-X initialization takes 20 min, geometry

stage (coarse-50 min, fine-50 min), mesh post-processing

takes 10 min (remeshing, SMPL-X registration, hand re-

placement), texture stage takes 140 min, 270 min in total.

Thus, our per-subject optimization process remains time-

consuming, requiring approximately 4.5 hours per subject

on a V100 GPU. Addressing these limitations is crucial to

facilitate broader applications.

Following the geometry optimization phase, TeCH aligns

the clothed body mesh with the SMPL-X model, enabling

us to animate the reconstructed avatar with SMPL-X mo-

tions [80], as shown in Fig. 9 and video.

Future work. Leveraging controllable T2I models [52, 82,

96, 134] may help to improve the controllability and sta-

bility of generation process. Also, how to compositionally

generate the separate components, such as haircut [107],

accessories [31], and decoupled outfits [27], is still an un-

solved problem. We leave these for future research.

Animation video results

Figure 9. Animate TeCH with SMPL-X motions.

Broader impact. TeCH has many potential applica-

tions Sec. 5. However, as the technique advances, it has

the potential to facilitate deep-fake avatars and raise IP con-

cerns. Regulations should be established to address these

issues alongside its benefits in the entertainment industry.

5.2. Avatar editing

The text-guided texture generation feature also allows us to

edit the texture of the generated avatars. Here, we show

stylization results with different painting styles, like “ pop

art, pixel art, van gogh ”. The resulting texture not

only features the desired styles but also preserves the inher-

ent appearance traits of the original character.

“Pop art”

Original

“Pixel art”

C. SMPL-X Failure

Figure 11. The proposed method might exhibit noisy surfaces for

extremely loose clothing, or mismatched patterns. If PIXIE [26] is

predicting a wrong initial pose, the error propagates to TeCH.

5. Applications

Input

B. Mismatched patterns

7. Conclusion

We have proposed TeCH to reconstruct a lifelike 3D clothed

human from a single image, with detailed full-body geom-

etry and high-quality, consistent texture. The core insight

is that we can leverage descriptive text prompts and person-

alized Text-to-Image diffusion models to optimize the 3D

avatar including parts that are not visible in the input. Ex-

tensive experiments validate the superiority of TeCH over

existing methods in terms of geometry and rendering qual-

ity. We believe that this paradigm of using image and tex-

tual descriptions for 3D body reconstruction is a stepping

stone also for reconstruction tasks beyond human bodies.

“Van Gogh”

Figure 10. Text-guided stylization.

Acknowledgments . Haven Feng contributes the core idea of “chamfer distance in

6. Discussion

RGB space” Eq. (12). We thank Vanessa Sklyarova for proofreading, Haofan Wang,

Huaxia Li, and Xu Tang for their technical support, and Weiyang Liu’s and Michael

J. Black’s feedback. Yuliang Xiu is funded by the European Union’s Horizon 2020

Limitations. Despite achieving impressive results on di-

verse datasets, some failures cases still exist, see Fig. 11:

A. TeCH occasionally fails for extremely loose clothing,

this may relate to the constraint from SMPL-X-based ini-

tialization. B. mismatched pattern may occur as tattoo. C.

TeCH relies on robust SMPL-X pose estimation, which is

still an unsolved problem, especially for challenging poses.

research and innovation programme under the Marie Skłodowska-Curie grant agree-

ment No.860768 (CLIPE). Hongwei Yi is supported by the German Federal Ministry

of Education and Research (BMBF): Tubingen AI Center, FKZ: 01IS18039B. Yangyi

Huang is supported by the National Nature Science Foundation of China (Grant Nos:

62273302, 62036009, 61936006). Jiaxiang Tang is supported by National Natural

Science Foundation of China (Grant Nos: 61632003, 61375022, 61403005).

10Appendices

where ϵ̂ ϕ (z x t ; c, t) denotes the noise prediction of the dif-

fusion model with condition c and latent z x t of the gener-

ated image x. Such SDS-guided optimization is performed

with random camera poses to improve the multi-view con-

sistency. In contrast to DreamFusion, the 3D shape here is

parameterized with an improved DMTet instead of NeRF.

We provide an additional introduction to the preliminar-

ies (Sec. A) of TeCH. We list the VQA questions P VQA

(Sec. B). Additional implementation details to construct the

outer shell around SMPL-X (Sec. C), as well as details

on the camera sampling strategy (Sec. D) are given. Im-

plementation details of network structure and optimization

setting (Sec. E). Based on the benchmark datasets (CAPE,

THuman2.0) and in-the-wild photos used in the perceptual

studies, we present more qualitative results (Figs. 12 to 14).

Deep Marching Tetrahedra (DMTet). DMTet [32, 106]

is a hybrid 3D representation designed for high-resolution

3D shape synthesis and reconstruction. It incorporates the

advantages of both explicit and implicit representations, by

learning Signed Distance Field (SDF) values on the ver-

tices of a deformable tetrahedral grid. For a given DMTet,

represented as (V T , T ), where V T are the vertices in the

tetrahedral grid T , comprising K tetrahedrons T k ∈ T ,

with k ∈ {1, . . . , K}. Each tetrahedron is defined by

four vertices {v k 1 , v k 2 , v k 3 , v k 4 }. The objective of the model

is firstly to estimate the SDF value s(v i ) for each ver-

tex, then to iteratively refine the surface and subdivide the

tetrahedral grid by predicting the position offsets ∆v i and

SDF residual values ∆s(v i ). A triangular mesh can be ex-

tracted through Marching Tetrahedra [24]. As noted by

Magic3D [73], DMTet offers two advantages over NeRF,

fast-optimization and high-resolution. It achieves this by

efficiently rasterizing a triangular mesh into high-resolution

image patches using a differentiable renderer [62], enabling

interaction with pre-trained high-resolution latent diffusion

models, such as eDiff-I [7], and Stable Diffusion [100].

A. Preliminaries

DreamBooth. Pretrained text-to-Image diffusion mod-

els [99, 100, 102] lack the ability to mimic the appearance of

subjects in a given reference set and synthesize novel rendi-

tions of them in different contexts. To enable subject-driven

image generation, DreamBooth [101] personalizes the pre-

trained diffusion model through few-shot tuning.

Specifically, for a pre-trained image diffusion model x̂ ϕ ,

the model takes an initial noise ϵ ∼ N (0, 1), and a text

embedding c = Γ(P ), generated by the text encoder Γ and

a text prompt P , to produce an image x gen = x̂ ϕ (ϵ, c).

DreamBooth uses 3∼5 images of the same subject to fine-

tune the diffusion model using MSE denoising losses:

\mathbb {E}_{\mathbf {x},\mathbf {c},\mathbf {\epsilon },\mathbf {\epsilon '},t} = \bigl [w_t\left \Vert \hat {\mathbf {x}}_\phi (\alpha _t\mathbf {x}_\mathrm {gt}+\sigma _t\mathbf {\epsilon }, \mathbf {c}) - \mathbf {x}_\mathrm {gt}\right \Vert _2^2 \\ + \lambda w_{t'} \left \Vert \hat {\mathbf {x}}_\phi (\alpha _{t'}\mathbf {x}_\mathrm {prior}+\sigma _{t'}\mathbf {\epsilon }', \mathbf {c}_\mathrm {prior})-\mathbf {x}_\mathrm {prior}\right \Vert _2^2\bigr ]

B. VQA Questions Q

(13)

Where x gt represents ground-truth images, and c is the

embedding of a text prompt with a rare token as the

unique identifier, and α t , σ t , w t controls the noise sched-

ule and sample quality of the diffusion process at time

t ∼ U([0, 1]). The second term is the prior-preservation loss

weighted by λ, which is supervised by self-generated im-

ages x prior conditioned with the class-specific embedding

c prior = Γ(“ a man/woman ”). This loss mitigates the phe-

nomenon of language drift, where the model collapses into

a single mode by associating the class name with a particu-

lar instance, thus augmenting the output diversity.

To construct the descriptive prompt P VQA , we designed

a series of questions to parse clothed human attributes.

First, we use BLIP [65] and a series of general ques-

tions Q general to parse genders, facial appearance, hair

colors, hairstyles, facial hairs, and body poses. Sec-

ondly, we use SegFormer [117] to parse human garments,

consisting of 10 categories {hat, sunglasses, upper-

clothes, skirt, pants, dress, belt, shoes,

bag, scarf} , denoted as G, and use another group of

questions Q garments to parse the attribute of each garment

g ∈ G. All the questions are listed in Tab. 4.

Empirically, we found that the BLIP [65] VQA model

tends to use 1 ∼ 3 words to answer these questions, so

we simply concatenate all the answers and remove repeated

words to construct P VQA . Note that for the CAPE dataset,

we add the dataset-specific description “ hairnet ” to the

guidance as it is hard to be recognized by BLIP.

Score Distillation Sampling (SDS). DreamFusion [94] in-

troduces Score Distillation Sampling (SDS) loss, to perform

Text-to-3D synthesis by using pretrained 2D Text-to-Image

diffusion model ϕ. Instead of sampling in pixel space, SDS

optimizes over the 3D volume, which is parameterized with

θ, with the differential renderer g, so the generated image

x = g(θ) closely resembles a sample from the frozen diffu-

sion model. Here is the gradient of L SDS :

C. Construction of the Outer SMPL-X Shell

To construct a compact tetrahedral grid (V shell , T shell ), we

calculate a coarse outer shell M shell from SMPL-X esti-

mated body mesh M body . Specifically, we dilate M body

with an offset of ∆M body = 0.1 and simplify the mesh

\nabla _\theta \mathcal {L}_\mathrm {SDS}(\phi , \mathbf {x}=g(\theta )) \\ = \mathbb {E}_{\mathrm {t, \epsilon }}\left [ w_t \left (\hat {\epsilon }_\phi (\mathbf {z}_t^\mathbf {x};\mathbf {c},t)-\epsilon \right )\frac {\partial \mathbf {x}}{\partial \theta }\frac {\partial \mathbf {z}^\mathbf {x}}{\partial \mathbf {x}}\right ]

(14)

11Groups

Q general

Q garments

E. Implementation Details

Quetions

Is this person a man or a woman?

What is this person wearing?

What is the hair color of this person?

What is the hairstyle of this person?

Describe the facial appearance of this person.

Does this person have facial hair?

How is the facial hair of this person?

Describe the pose of this person.

Is this person wearing g?

What g is the person wearing? → d

What is the color of the d + g?

What is the style of the d + g?

E.1. Network Structure

We use two networks Ψ g and Ψ c to predict the SDF for ge-

ometry modeling and to predict the RGB value for albedo

texture modeling, respectively. For Ψ g , we use a 2-layer

MLP network with a hidden dimension of 32 and a hash po-

sitional encoding with a maximum resolution of 1028 and

16 resolution levels. During the forward process, we use

coordinates of V shell in the normalized unit space, the ver-

tices of the tetrahedral grid as the input of Ψ g to query SDF

value for each vertex.

For Ψ c , we use a similar network with 1-layer MLP and

a hash positional encoding with a maximum resolution of

2048. We model the albedo texture in the canonical A-

pose 3D space. Specifically, for the post-processed result

mesh M in = (V in , F ), we register the model with SMPL-

X, and repose it with the standard A-pose M A = (V A , F ).

During rendering, if a target pixel is projected onto a tri-

angle (v in

, v in

), where(i, j, k) ∈ F of the M in . We

query the pixel color with its corresponding 3d position in

the A-pose space, calculated by interpolation of the trian-

, v A

). Additionally, we use two 2-layer MLP

gle (v A

Ψ bg , Ψ bg conditioned by camera k to learn adaptive 3D

background colors for both normal map rendering N (M, k)

and color rendering I ′ (M, ψ c , k).

Table 4. Predefined questions for parsing clothed human at-

tributes. g is the segmentation category of a part of the garments,

and d is the recognized garment category from the answer to the

second question in Q garments .

by reducing triangle numbers by r decimate = 90% using

quadric decimation [42]. The we generate the tetrahedral

grid (V shell , T shell ) of this outer shell by TetGen [37] with a

maximum volume size of 5 × 10 −8 .

D. Camera Sampling

E.2. Optimization Details

To ensure full coverage of the entire body and the human

face, during optimization process, we sample virtual camera

poses into two groups: 1) K body cameras with a field of

view (FOV) covering the full body or the main body parts,

and 2) zoom-in cameras K face focusing the face region.

In both stages of our multi-stage optimization pipeline, we

use an Adam optimizer with a base learning rate of η =

1 × 10 − 3, and weight decay of λ WD = 5 × 10 −4

Geometry-stage optimization. We optimize Ψ g in a

coarse-to-fine manner, with t coarse = 5000 steps w/o mesh

subdivision and t fine = 5000 steps w/ mesh subdivision.

We use a loss weight setting of λ sil = 1 × 10 4 , λ SDS = 1,

λ lap = 1 × 10 4 , and a base loss weight λ base

norm = 1 × 10 .

For λ norm , to ensure robust convergence of the geometry,

we start with a higher value of λ norm during each stage and

gradually decrease it using a two-round cosine annealing,

where λ norm (t) is the weight of L norm at the t-th iteration:

The ratio P body determines the probability of sampling

k ∈ K body , while the height h body , radius r body , elevation

angle ϕ body , and azimuth ranges θ body are adjusted relative

to the SMPL-X body scale. Empirically, we set P body =

0.7, h body = (−0.4, 0.4), r body = (0.7, 1.3), θ body =

[−180 ◦ , 180 ◦ ), ϕ body = {0 ◦ }, with the M body propor-

tionally scaled to a unit space with xyz coordinates in the

range [−0.5, 0.5]. To mitigate the occurrence of mirrored

appearance artifacts (i.e., Janus-head), we incorporate view-

aware prompts, “ front/side/back/overhead view ”,

w.r.t. the viewing angle during generation process, whose

effectiveness has been demonstrated in DreamBooth [94].

\lambda _\mathrm {norm} (t) =\\ \begin {cases} 0.5 \lambda _\mathrm {norm}^\mathrm {base} \left (1 + \cos \left (\frac {t}{t_\mathrm {coarse}}\pi \right )\right ) & \text {if } t < t_\mathrm {coarse} \\ 0.5 \lambda _\mathrm {norm}^\mathrm {base} \left (1 + \cos \left (\frac {t-t_\mathrm {coarse}}{t_\mathrm {fine}}\pi \right )\right ) & \text {if } t \geq t_\mathrm {coarse} \end {cases},

In order to enhance facial details, we sample additional

virtual cameras positioned around the face k ∈ K face , to-

gether with the additional prompt “ face of ”. With a prob-

ability of P face = 1 − P body = 0.3, the sampling param-

eters include the view target c face , radius range r face , ro-

tation range θ face , and azimuth range ϕ face . Empirically,

we set c face to the 3D position of SMPL-X head keypoint,

r face = [0.3, 0.4], θ face = [−90 ◦ , 90 ◦ ] and ϕ face = {0 ◦ }.

(15)

Texture-stage optimization.

We optimize Ψ c for

t texture = 7000 steps, with λ recon = 2×10 4 and λ SDS = 1.

Besides, we set λ CD = 0 at the beginning of the training,

and λ SDS = 1 × 10 6 at the last t CD = 2000 iterations to

enforce color consistency.

12GT

PIFu

PaMIR

Phorhum

PIFuHD

ICON

ECON

TeCH

PIFu

PaMIR

Phorhum

Figure 12. Qualitative comparison on CAPE. TeCH performs better on subjects with challenging poses.

TeCHGT

PIFu

PaMIR

Phorhum

PIFuHD

ICON

ECON

TeCH

PIFu

PaMIR

Phorhum

TeCH

Figure 13. Qualitative comparison on THuman2.0. TeCH performs better regardless of hard pose, complex texture, or loose clothing.

14Input

PIFu

PaMIR

PHORHUM

PIFu PaMIR

PHORHUM TeCH

PIFu PaMIR

PHORHUM TeCH

PIFu PaMIR

PHORHUM TeCH

PIFu PaMIR

PHORHUM TeCH

PIFu PaMIR

PHORHUM TeCH

PIFu PaMIR

PHORHUM TeCH

PIFu PaMIR

PHORHUM TeCH

PIFu PaMIR

PHORHUM TeCH

TeCH

Figure 14. Qualitative comparison on SHHQ images. TeCH generalizes well on in-the-wild images with diverse clothing styles and

textures. It successfully recovers the overall structure of the clothed body with text guidance, and generates realistic full-body texture

which is consistent with the colored pattern and the material of the clothes. ü Zoom in to see the geometric details.

15References

[1] Thiemo Alldieck, Marcus A. Magnor, Weipeng Xu, Chris-

tian Theobalt, and Gerard Pons-Moll. Detailed human

avatars from monocular video. In International Conference

on 3D Vision (3DV), 2018. 2

[2] Thiemo Alldieck, Marcus A. Magnor, Weipeng Xu, Chris-

tian Theobalt, and Gerard Pons-Moll. Video based recon-

struction of 3D people models. In Computer Vision and

Pattern Recognition (CVPR), 2018.

[3] Thiemo Alldieck, Marcus A. Magnor, Bharat Lal Bhatna-

gar, Christian Theobalt, and Gerard Pons-Moll. Learning

to reconstruct people in clothing from a single RGB cam-

era. In Computer Vision and Pattern Recognition (CVPR),

2019.

[4] Thiemo Alldieck, Gerard Pons-Moll, Christian Theobalt,

and Marcus Magnor. Tex2Shape: Detailed Full Human

Body Geometry From a Single Image. In International

Conference on Computer Vision (ICCV), 2019. 2

[5] Thiemo Alldieck, Mihai Zanfir, and Cristian Sminchis-

escu. Photorealistic monocular 3d reconstruction of hu-

mans wearing clothing. In Computer Vision and Pattern

Recognition (CVPR), 2022. 2, 3, 7, 8, 9

[6] Rie Ando and Tong Zhang. Learning on graph with lapla-

cian regularization. Conference on Neural Information Pro-

cessing Systems (NeurIPS), 2006. 6

[7] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vah-

dat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika

Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero

Karras, and Ming-Yu Liu. eDiff-I: Text-to-Image Diffu-

sion Models with Ensemble of Expert Denoisers. arXiv

preprint:2211.01324, 2022. 11

[8] Alexander Bergman, Petr Kellnhofer, Wang Yifan, Eric

Chan, David Lindell, and Gordon Wetzstein. Generative

neural articulated radiance fields. Conference on Neural

Information Processing Systems (NeurIPS), 2022. 3

[9] Bharat Lal Bhatnagar, Garvita Tiwari, Christian Theobalt,

and Gerard Pons-Moll. Multi-Garment Net: Learning to

dress 3D people from images. In International Conference

on Computer Vision (ICCV), 2019. 2

[10] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-

stuff: Thing and stuff classes in context. In Computer Vision

and Pattern Recognition (CVPR), pages 1209–1218, 2018.

[11] Zhongang Cai, Daxuan Ren, Ailing Zeng, Zhengyu Lin,

Tao Yu, Wenjia Wang, Xiangyu Fan, Yang Gao, Yi-

fan Yu, Liang Pan, Fangzhou Hong, Mingyuan Zhang,

Chen Change Loy, Lei Yang, and Ziwei Liu. HuMMan:

Multi-modal 4d human dataset for versatile sensing and

modeling. In European Conference on Computer Vision

(ECCV), 2022. 3

[12] Yukang Cao, Guanying Chen, Kai Han, Wenqi Yang, and

Kwan-Yee K. Wong. JIFF: Jointly-aligned Implicit Face

Function for High Quality Single View Clothed Human Re-

construction. In Computer Vision and Pattern Recognition

(CVPR), 2022. 2

[13] Yukang Cao, Yan-Pei Cao, Kai Han, Ying Shan, and Kwan-

Yee K Wong. DreamAvatar: Text-and-Shape Guided 3D

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

Human Avatar Generation via Diffusion Models. arXiv

preprint:2304.00916, 2023. 3

Eric R. Chan, Koki Nagano, Matthew A. Chan, Alexan-

der W. Bergman, Jeong Joon Park, Axel Levy, Miika Ait-

tala, Shalini De Mello, Tero Karras, and Gordon Wetzstein.

GeNVS: Generative novel view synthesis with 3D-aware

diffusion models. In arXiv, 2023. 4

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu

Soricut. Conceptual 12M: Pushing web-scale image-text

pre-training to recognize long-tail visual concepts. In Com-

puter Vision and Pattern Recognition (CVPR), 2021. 8

Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fanta-

sia3D: Disentangling Geometry and Appearance for High-

quality Text-to-3D Content Creation. In International Con-

ference on Computer Vision (ICCV), 2023. 6

Xu Chen, Tianjian Jiang, Jie Song, Jinlong Yang, Michael J

Black, Andreas Geiger, and Otmar Hilliges. gDNA: To-

wards generative detailed neural avatars. In Computer Vi-

sion and Pattern Recognition (CVPR), 2022. 3

Wei Cheng, Ruixiang Chen, Wanqi Yin, Siming Fan, Keyu

Chen, Honglin He, Huiwen Luo, Zhongang Cai, Jingbo

Wang, Yang Gao, Zhengming Yu, Zhengyu Lin, Daxuan

Ren, Lei Yang, Ziwei Liu, Chen Change Loy, Chen Qian,

Wayne Wu, Dahua Lin, Bo Dai, and Kwan-Yee Lin. DNA-

Rendering: A Diverse Neural Actor Repository for High-

Fidelity Human-centric Rendering. In International Con-

ference on Computer Vision (ICCV), 2023. 3

Vasileios Choutas, Georgios Pavlakos, Timo Bolkart, Dim-

itrios Tzionas, and Michael J. Black. Monocular expressive

body regression through body-driven attention. In Euro-

pean Conference on Computer Vision (ECCV), pages 20–

40, 2020. 8

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo

Scharwächter, Markus Enzweiler, Rodrigo Benenson, Uwe

Franke, Stefan Roth, and Bernt Schiele. The cityscapes

dataset. In CVPR Workshop on the Future of Datasets in

Vision. sn, 2015. 8

Enric Corona, Mihai Zanfir, Thiemo Alldieck, Eduard

Gabriel Bazavan, Andrei Zanfir, and Cristian Sminchisescu.

Structured 3d features for reconstructing relightable and an-

imatable avatars. In Computer Vision and Pattern Recogni-

tion (CVPR), 2023. 2, 3

Congyue Deng, Chiyu Jiang, Charles R Qi, Xinchen Yan,

Yin Zhou, Leonidas Guibas, Dragomir Anguelov, et al.

NeRDi: Single-View NeRF Synthesis with Language-

Guided Diffusion as General Image Priors. In Computer

Vision and Pattern Recognition (CVPR), 2023. 4

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,

and Li Fei-Fei. Imagenet: A large-scale hierarchical im-

age database. In Computer Vision and Pattern Recognition

(CVPR), pages 248–255. Ieee, 2009. 8

Akio Doi and Akio Koide. An efficient method of triangu-

lating equi-valued surfaces by using tetrahedral cells. IE-

ICE TRANSACTIONS on Information and Systems, 74(1):

214–224, 1991. 5, 6, 11

Zijian Dong, Xu Chen, Jinlong Yang, Michael J Black, Ot-

mar Hilliges, and Andreas Geiger. AG3D: Learning to Gen-[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37] Si Hang. Tetgen, a delaunay-based quality tetrahedral mesh

generator. ACM Trans. Math. Softw, 41(2):11, 2015. 12

[38] Tong He, John P. Collomosse, Hailin Jin, and Stefano

Soatto. Geo-PIFu: Geometry and pixel aligned implicit

functions for single-view human reconstruction. In Confer-

ence on Neural Information Processing Systems (NeurIPS),

2020. 2

[39] Tong He, Yuanlu Xu, Shunsuke Saito, Stefano Soatto, and

Tony Tung. ARCH++: Animation-Ready Clothed Human

Reconstruction Revisited. In International Conference on

Computer Vision (ICCV), pages 11046–11056, 2021. 2, 3

[40] Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang

Cai, Lei Yang, and Ziwei Liu. Avatarclip: Zero-shot text-

driven generation and animation of 3d avatars. Transactions

on Graphics (TOG), 2022. 3

[41] Fangzhou Hong, Zhaoxi Chen, Yushi Lan, Liang Pan, and

Ziwei Liu. EVA3D: Compositional 3D Human Generation

from 2D Image Collections. In International Conference

on Learning Representations (ICLR), 2023. 3

[42] Hugues Hoppe. New quadric metric for simplifying meshes

with appearance attributes. In Proceedings Visualization’99

(Cat. No. 99CB37067), pages 59–510. IEEE, 1999. 12

[43] Shoukang Hu, Fangzhou Hong, Liang Pan, Haiyi Mei, Lei

Yang, and Ziwei Liu. Sherf: Generalizable human nerf

from a single image. In International Conference on Com-

puter Vision (ICCV), 2023. 3

[44] Yukun Huang, Jianan Wang, Ailing Zeng, He Cao, Xi-

anbiao Qi, Yukai Shi, Zheng-Jun Zha, and Lei Zhang.

DreamWaltz: Make a Scene with Complex 3D Animatable

Avatars. arXiv preprint:2305.12529, 2023. 3

[45] Yangyi Huang, Hongwei Yi, Weiyang Liu, Haofan Wang,

Boxi Wu, Wenxiao Wang, Binbin Lin, Debing Zhang,

and Deng Cai. One-shot implicit animatable avatars with

model-based priors. In International Conference on Com-

puter Vision (ICCV), 2023. 3

[46] Zeng Huang, Yuanlu Xu, Christoph Lassner, Hao Li, and

Tony Tung. ARCH: Animatable Reconstruction of Clothed

Humans. In Computer Vision and Pattern Recognition

(CVPR), pages 3093–3102, 2020. 2, 3

[47] Mustafa Işık, Martin Rünz, Markos Georgopoulos, Taras

Khakhulin, Jonathan Starck, Lourdes Agapito, and

Matthias Nießner. HumanRF: High-Fidelity Neural Radi-

ance Fields for Humans in Motion. Transactions on Graph-

ics (TOG), 2023. 3

[48] Ajay Jain, Matthew Tancik, and Pieter Abbeel. Putting

NeRF on a Diet: Semantically Consistent Few-Shot View

Synthesis. In International Conference on Computer Vision

(ICCV), pages 5885–5894, 2021. 3

[49] Boyi Jiang, Juyong Zhang, Yang Hong, Jinhao Luo, Lig-

ang Liu, and Hujun Bao. BCNet: Learning body and cloth

shape from a single image. In European Conference on

Computer Vision (ECCV), 2020. 2

[50] Ruixiang Jiang, Can Wang, Jingbo Zhang, Menglei Chai,

Mingming He, Dongdong Chen, and Jing Liao. Avatarcraft:

Transforming text into neural human avatars with parame-

terized shape and pose control. In International Conference

on Computer Vision (ICCV), 2023. 3

erate 3D Avatars from 2D Image Collections. In Interna-

tional Conference on Computer Vision (ICCV), 2023. 3

Yao Feng, Vasileios Choutas, Timo Bolkart, Dimitrios

Tzionas, and Michael J. Black. Collaborative regression of

expressive bodies using moderation. In International Con-

ference on 3D Vision (3DV), pages 792–804, 2021. 2, 5, 8,

Yao Feng, Jinlong Yang, Marc Pollefeys, Michael J. Black,

and Timo Bolkart. Capturing and animation of body and

clothing from monocular video. In SIGGRAPH Asia 2022

Conference Papers, 2022. 10

Jianglin Fu, Shikai Li, Yuming Jiang, Kwan-Yee Lin,

Chen Qian, Chen-Change Loy, Wayne Wu, and Ziwei Liu.

StyleGAN-Human: A Data-Centric Odyssey of Human

Generation. European Conference on Computer Vision

(ECCV), 2022. 3, 8, 9

Valentin Gabeur, Jean-Sébastien Franco, Xavier Martin,

Cordelia Schmid, and Gregory Rogez. Moulding humans:

Non-parametric 3D human shape estimation from single

images. In International Conference on Computer Vision

(ICCV), 2019. 2

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik,

Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An

image is worth one word: Personalizing text-to-image gen-

eration using textual inversion. In International Conference

on Learning Representations (ICLR), 2023. 4

Daiheng Gao, Yuliang Xiu, Kailin Li, Lixin Yang, Feng

Wang, Peng Zhang, Bang Zhang, Cewu Lu, and Ping Tan.

DART: Articulated Hand Model with Diverse Accessories

and Rich Textures. In Thirty-sixth Conference on Neural

Information Processing Systems Datasets and Benchmarks

Track, 2022. 10

Jun Gao, Wenzheng Chen, Tommy Xiang, Alec Jacob-

son, Morgan McGuire, and Sanja Fidler. Learning de-

formable tetrahedral meshes for 3d reconstruction. Confer-

ence on Neural Information Processing Systems (NeurIPS),

33:9936–9947, 2020. 2, 5, 11

Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen,

Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja

Fidler. GET3D: A Generative Model of High Quality 3D

Textured Shapes Learned from Images. In Conference on

Neural Information Processing Systems (NeurIPS), 2022. 3

Yuying Ge, Ruimao Zhang, Lingyun Wu, Xiaogang Wang,

Xiaoou Tang, and Ping Luo. A versatile benchmark for de-

tection, pose estimation, segmentation and re-identification

of clothing images. In Computer Vision and Pattern Recog-

nition (CVPR), 2019. 3

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv

Batra, and Devi Parikh. Making the V in VQA matter: El-

evating the role of image understanding in Visual Question

Answering. In Conference on Computer Vision and Pattern

Recognition (CVPR), 2017. 8

Artur Grigorev, Karim Iskakov, Anastasia Ianina, Renat

Bashirov, Ilya Zakharkin, Alexander Vakhitov, and Victor

Lempitsky. Stylepeople: A generative model of fullbody

human avatars. In Computer Vision and Pattern Recogni-

tion (CVPR), pages 5151–5160, 2021. 3

17[64] Jiefeng Li, Chao Xu, Zhicun Chen, Siyuan Bian, Lixin

Yang, and Cewu Lu. HybrIK: A hybrid analytical-neural

inverse kinematics solution for 3D human pose and shape

estimation. In Computer Vision and Pattern Recognition

(CVPR), pages 3383–3393, 2021. 2

[65] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi.

Blip: Bootstrapping language-image pre-training for uni-

fied vision-language understanding and generation. In

International Conference on Machine Learning (ICML),

pages 12888–12900. PMLR, 2022. 2, 4, 8, 11

[66] Jiefeng Li, Siyuan Bian, Qi Liu, Jiasheng Tang, Fan Wang,

and Cewu Lu. NIKI: Neural inverse kinematics with invert-

ible neural networks for 3d human pose and shape estima-

tion. In Computer Vision and Pattern Recognition (CVPR),

2023. 2

[67] Ruilong Li, Kyle Olszewski, Yuliang Xiu, Shunsuke Saito,

Zeng Huang, and Hao Li. Volumetric human teleportation.

In ACM SIGGRAPH 2020 Real-Time Live, 2020. 2

[68] Zhihao Li, Jianzhuang Liu, Zhensong Zhang, Songcen Xu,

and Youliang Yan. CLIFF: Carrying Location Information

in Full Frames into Human Pose and Shape Estimation. In

European Conference on Computer Vision (ECCV), pages

590–606. Springer, 2022. 2

[69] Xiaodan Liang, Si Liu, Xiaohui Shen, Jianchao Yang, Lu-

oqi Liu, Jian Dong, Liang Lin, and Shuicheng Yan. Deep

human parsing with active template regression. Trans-

actions on Pattern Analysis and Machine Intelligence

(TPAMI), 37(12):2402–2414, 2015. 8

[70] Xiaodan Liang, Si Liu, Xiaohui Shen, Jianchao Yang, Lu-

oqi Liu, Jian Dong, Liang Lin, and Shuicheng Yan. Deep

human parsing with active template regression. Trans-

actions on Pattern Analysis and Machine Intelligence

(TPAMI), 37(12):2402–2414, 2015. 4

[71] Xiaodan Liang, Chunyan Xu, Xiaohui Shen, Jianchao

Yang, Si Liu, Jinhui Tang, Liang Lin, and Shuicheng Yan.

Human parsing with contextualized convolutional neural

network. In International Conference on Computer Vision

(ICCV), pages 1386–1394, 2015. 4

[72] Tingting Liao, Xiaomei Zhang, Yuliang Xiu, Hongwei Yi,

Xudong Liu, Guo-Jun Qi, Yong Zhang, Xuan Wang, Xi-

angyu Zhu, and Zhen Lei. High-Fidelity Clothed Avatar

Reconstruction from a Single Image. In Computer Vision

and Pattern Recognition (CVPR), 2023. 2

[73] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki

Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja

Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3D: High-

Resolution Text-to-3D Content Creation. In Computer Vi-

sion and Pattern Recognition (CVPR), 2023. 11

[74] Tsung-Yi Lin, Michael Maire, Serge Belongie, James

Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and

C Lawrence Zitnick. Microsoft COCO: common objects

in context. In European Conference on Computer Vision

(ECCV), pages 740–755, 2014. 8

[75] Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund

T, Zexiang Xu, and Hao Su. One-2-3-45: Any Single Image

to 3D Mesh in 45 Seconds without Per-Shape Optimization.

arXiv preprint, 2023. 4

[51] Hanbyul Joo, Tomas Simon, and Yaser Sheikh. Total cap-

ture: A 3d deformation model for tracking faces, hands,

and bodies. In Computer Vision and Pattern Recognition

(CVPR), 2018. 2, 3

[52] Xuan Ju, Ailing Zeng, Chenchen Zhao, Jianan Wang, Lei

Zhang, and Qiang Xu. HumanSD: A Native Skeleton-

Guided Diffusion Model for Human Image Generation.

In International Conference on Computer Vision (ICCV),

2023. 10

[53] Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and

Jitendra Malik. End-to-end recovery of human shape and

pose. In Computer Vision and Pattern Recognition (CVPR),

pages 7122–7131, 2018. 2

[54] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,

Jaakko Lehtinen, and Timo Aila. Analyzing and improving

the image quality of StyleGAN. In Computer Vision and

Pattern Recognition (CVPR), 2020. 3

[55] Byungjun Kim, Patrick Kwon, Kwangho Lee, Myunggi

Lee, Sookwan Han, Daesik Kim, and Hanbyul Joo. Chupa:

Carving 3D Clothed Humans from Skinned Shape Priors

using 2D Diffusion Probabilistic Models. In International

Conference on Computer Vision (ICCV), 2023. 3

[56] Muhammed Kocabas, Nikos Athanasiou, and Michael J.

Black. VIBE: Video inference for human body pose and

shape estimation. In Computer Vision and Pattern Recog-

nition (CVPR), pages 5252–5262, 2020. 2

[57] Muhammed Kocabas, Chun-Hao P. Huang, Otmar Hilliges,

and Michael J. Black. PARE: Part attention regressor for

3D human body estimation. In International Conference

on Computer Vision (ICCV), pages 11127–11137, 2021.

[58] Muhammed Kocabas, Chun-Hao P. Huang, Joachim Tesch,

Lea Müller, Otmar Hilliges, and Michael J. Black. SPEC:

Seeing people in the wild with an estimated camera. In In-

ternational Conference on Computer Vision (ICCV), pages

11035–11045, 2021.

[59] Nikos Kolotouros, Georgios Pavlakos, Michael J. Black,

and Kostas Daniilidis. Learning to reconstruct 3D hu-

man pose and shape via model-fitting in the loop. In In-

ternational Conference on Computer Vision (ICCV), pages

2252–2261, 2019. 2

[60] Nikos Kolotouros, Thiemo Alldieck, Andrei Zanfir, Ed-

uard Gabriel Bazavan, Mihai Fieraru, and Cristian Smin-

chisescu. DreamHuman: Animatable 3D Avatars from

Text. arXiv preprint:2306.09329, 2023. 3

[61] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,

Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-

tidis, Li-Jia Li, David A Shamma, et al. Visual genome:

Connecting language and vision using crowdsourced dense

image annotations. International Journal of Computer Vi-

sion (IJCV), 123:32–73, 2017. 8

[62] Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol,

Jaakko Lehtinen, and Timo Aila. Modular primitives for

high-performance differentiable rendering. Transactions on

Graphics (TOG), 39(6), 2020. 5, 11

[63] Verica Lazova, Eldar Insafutdinov, and Gerard Pons-Moll.

360-Degree textures of people in clothing from a single im-

age. In International Conference on 3D Vision (3DV), 2019.

18[89] Pablo Palafox, Nikolaos Sarafianos, Tony Tung, and An-

gela Dai. Spams: Structured implicit parametric models.

Computer Vision and Pattern Recognition (CVPR), 2022. 3

[90] Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman.

Deep face recognition. In British Machine Vision Confer-

ence (BMVC), 2015. 8

[91] Priyanka Patel, Chun-Hao Paul Huang, Joachim Tesch,

David Hoffmann, Shashank Tripathi, and Michael J. Black.

AGORA: Avatars in geography optimized for regression

analysis. In Computer Vision and Pattern Recognition

(CVPR), pages 13468–13478, 2021. 8

[92] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani,

Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and

Michael J Black. Expressive body capture: 3d hands, face,

and body from a single image. In Computer Vision and

Pattern Recognition (CVPR), pages 10975–10985, 2019. 2,

[93] Gerard Pons-Moll, Sergi Pujades, Sonny Hu, and Michael

Black. ClothCap: Seamless 4D Clothing Capture and Re-

targeting. International Conference on Computer Graphics

and Interactive Techniques (SIGGRAPH), 36(4), 2017. Two

first authors contributed equally. 2

[94] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden-

hall. DreamFusion: Text-to-3d using 2d diffusion. In Inter-

national Conference on Learning Representations (ICLR),

2023. 2, 3, 4, 6, 7, 11, 12

[95] Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren,

Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan

Skorokhodov, Peter Wonka, Sergey Tulyakov, et al.

Magic123: One Image to High-Quality 3D Object Gen-

eration Using Both 2D and 3D Diffusion Priors. arXiv

preprint:2306.17843, 2023. 4

[96] Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao

Feng, Zhen Liu, Dan Zhang, Adrian Weller, and Bernhard

Schölkopf. Controlling Text-to-Image Diffusion by Orthog-

onal Finetuning. arXiv preprint:2306.07280, 2023. 10

[97] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya

Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,

Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen

Krueger, and Ilya Sutskever. Learning Transferable Vi-

sual Models From Natural Language Supervision. In Inter-

national Conference on Machine Learning (ICML), pages

8748–8763. PMLR, 2021. 3

[98] Amit Raj, Srinivas Kaza, Ben Poole, Michael Niemeyer,

Ben Mildenhall, Nataniel Ruiz, Shiran Zada, Kfir Aberman,

Michael Rubenstein, Jonathan Barron, Yuanzhen Li, and

Varun Jampani. DreamBooth3D: Subject-Driven Text-to-

3D Generation. arXiv preprint: 2303.13508, 2023. 4

[99] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott

Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya

Sutskever. Zero-Shot Text-to-Image Generation. In Inter-

national Conference on Machine Learning (ICML), 2021.

[100] Robin Rombach, Andreas Blattmann, Dominik Lorenz,

Patrick Esser, and Björn Ommer. High-resolution image

synthesis with latent diffusion models. In Computer Vi-

sion and Pattern Recognition (CVPR), pages 10684–10695,

2022. 11

[76] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok-

makov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3:

Zero-shot One Image to 3D Object. In International Con-

ference on Computer Vision (ICCV), 2023. 4

[77] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou

Tang. Deepfashion: Powering robust clothes recognition

and retrieval with rich annotations. In Proceedings of IEEE

Conference on Computer Vision and Pattern Recognition

(CVPR), 2016. 3

[78] Matthew Loper, Naureen Mahmood, Javier Romero, Ger-

ard Pons-Moll, and Michael J. Black. SMPL: A skinned

multi-person linear model. Transactions on Graphics

(TOG), 34(6):248:1–248:16, 2015. 2, 3

[79] Qianli Ma, Jinlong Yang, Anurag Ranjan, Sergi Pujades,

Gerard Pons-Moll, Siyu Tang, and Michael J. Black. Learn-

ing to Dress 3D People in Generative Clothing. In Com-

puter Vision and Pattern Recognition (CVPR), 2020. 3, 8

[80] Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje,

Gerard Pons-Moll, and Michael J. Black. AMASS: Archive

of Motion Capture as Surface Shapes. In International

Conference on Computer Vision (ICCV), pages 5442–5451,

2019. 10

[81] Luke Melas-Kyriazi, Christian Rupprecht, Iro Laina, and

Andrea Vedaldi. RealFusion: 360 Reconstruction of Any

Object from a Single Image. In Computer Vision and Pat-

tern Recognition (CVPR), 2023. 4

[82] Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang,

Zhongang Qi, Ying Shan, and Xiaohu Qie.

T2i-

adapter: Learning adapters to dig out more control-

lable ability for text-to-image diffusion models. arXiv

preprint:2302.08453, 2023. 10

[83] Thomas Müller, Alex Evans, Christoph Schied, and

Alexander Keller. Instant neural graphics primitives with

a multiresolution hash encoding. ACM Transactions on

Graphics (ToG), 41(4):1–15, 2022. 5

[84] Edwin G. Ng, Bo Pang, Piyush Kumar Sharma, and Radu

Soricut. Understanding guided image captioning perfor-

mance across domains. In Conference on Computational

Natural Language Learning, 2020. 8

[85] Atsuhiro Noguchi, Xiao Sun, Stephen Lin, and Tatsuya

Harada. Unsupervised learning of efficient geometry-aware

neural articulated representations. In European Conference

on Computer Vision (ECCV), pages 597–614. Springer,

2022. 3

[86] Hayato Onizuka, Zehra Haiyrci, Diego Thomas, Akihiro

Sugimoto, Hideaki Uchiyama, and Rin-Ichiro Taniguchi.

TetraTSDF: 3D human reconstruction from a single image

with a tetrahedral outer shell. In Computer Vision and Pat-

tern Recognition (CVPR), 2020. 5

[87] Vicente Ordonez, Girish Kulkarni, and Tamara Berg.

Im2text: Describing images using 1 million captioned pho-

tographs. In Conference on Neural Information Processing

Systems (NeurIPS), 2011. 8

[88] Pablo Palafox, Aljaž Božič, Justus Thies, Matthias Nießner,

and Angela Dai. NPMs: Neural Parametric Models for 3D

Deformable Shapes. In International Conference on Com-

puter Vision (ICCV), 2021. 3

19[101] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,

Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine

tuning text-to-image diffusion models for subject-driven

generation. In Computer Vision and Pattern Recognition

(CVPR), 2023. 2, 4, 11

[102] Chitwan Saharia, William Chan, Saurabh Saxena, Lala

Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed

Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi,

Rapha Gontijo Lopes, et al. Photorealistic text-to-image

diffusion models with deep language understanding. In

Conference on Neural Information Processing Systems

(NeurIPS), 2022. 2, 11

[103] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Mor-

ishima, Hao Li, and Angjoo Kanazawa. PIFu: Pixel-aligned

implicit function for high-resolution clothed human digiti-

zation. In International Conference on Computer Vision

(ICCV), pages 2304–2314, 2019. 2, 3, 7, 8, 9

[104] Shunsuke Saito, Tomas Simon, Jason Saragih, and Hanbyul

Joo. PIFuHD: Multi-Level Pixel-Aligned Implicit Func-

tion for High-Resolution 3D Human Digitization. In Com-

puter Vision and Pattern Recognition (CVPR), pages 81–90,

2020. 2, 7, 8, 9

[105] Christoph Schuhmann, Romain Beaumont, Richard Vencu,

Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo

Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts-

man, Patrick Schramowski, Srivatsa R Kundurthy, Kather-

ine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and

Jenia Jitsev. LAION-5B: An open large-scale dataset for

training next generation image-text models. In Thirty-

sixth Conference on Neural Information Processing Sys-

tems Datasets and Benchmarks Track, 2022. 3, 8

[106] Tianchang Shen, Jun Gao, Kangxue Yin, Ming-Yu Liu, and

Sanja Fidler. Deep marching tetrahedra: a hybrid represen-

tation for high-resolution 3d shape synthesis. Conference

on Neural Information Processing Systems (NeurIPS), 34:

6087–6101, 2021. 2, 5, 11

[107] Vanessa Sklyarova, Jenya Chelishev, Andreea Dogaru, Igor

Medvedev, Victor Lempitsky, and Egor Zakharov. Neural

Haircut: Prior-Guided Strand-Based Hair Reconstruction.

In International Conference on Computer Vision (ICCV),

2023. 10

[108] David Smith, Matthew Loper, Xiaochen Hu, Paris

Mavroidis, and Javier Romero. FACSIMILE: Fast and ac-

curate scans from an image in less than a second. In In-

ternational Conference on Computer Vision (ICCV), 2019.

[109] Jiang Suyi, Jiang Haoran, Wang Ziyu, Luo Haimin, Chen

Wenzheng, and Xu Lan. HumanGen: Generating Human

Radiance Fields with Explicit Priors. In Computer Vision

and Pattern Recognition (CVPR), 2023. 3

[110] David Svitov, Dmitrii Gudkov, Renat Bashirov, and Victor

Lemptisky. Dinar: Diffusion inpainting of neural textures

for one-shot human avatars. In International Conference on

Computer Vision (ICCV), 2023. 3

[111] Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran

Yi, Lizhuang Ma, and Dong Chen. Make-It-3D: High-

Fidelity 3D Creation from A Single Image with Diffusion

[112]

[113]

[114]

[115]

[116]

[117]

[118]

[119]

[120]

[121]

[122]

[123]

[124]

Prior. In International Conference on Computer Vision

(ICCV), 2023. 4

Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh,

and Greg Shakhnarovich. Score Jacobian Chaining: Lift-

ing Pretrained 2D Diffusion Models for 3D Generation. In

Computer Vision and Pattern Recognition (CVPR), 2023. 3

Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jian-

min Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen,

Fang Wen, Qifeng Chen, et al. Rodin: A Generative Model

for Sculpting 3D Digital Avatars Using Diffusion. In Com-

puter Vision and Pattern Recognition (CVPR), 2023. 3

Daniel Watson, William Chan, Ricardo Martin-Brualla,

Jonathan Ho, Andrea Tagliasacchi, and Mohammad

Norouzi. Novel View Synthesis with Diffusion Models

(3DiM). In International Conference on Learning Repre-

sentations (ICLR), 2023. 4

Donglai Xiang, Hanbyul Joo, and Yaser Sheikh. Monocular

total capture: Posing face, body, and hands in the wild. In

Computer Vision and Pattern Recognition (CVPR), pages

10957–10966, 2019. 8

Donglai Xiang, Fabian Prada, Chenglei Wu, and Jessica K.

Hodgins. MonoClothCap: Towards temporally coherent

clothing capture from monocular RGB video. In Interna-

tional Conference on 3D Vision (3DV), 2020. 2

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar,

Jose M Alvarez, and Ping Luo. SegFormer: Simple and

efficient design for semantic segmentation with transform-

ers. Conference on Neural Information Processing Systems

(NeurIPS), 34:12077–12090, 2021. 2, 4, 8, 11

Zhangyang Xiong, Di Kang, Derong Jin, Weikai Chen, Lin-

chao Bao, and Xiaoguang Han. Get3DHuman: Lifting

StyleGAN-Human into a 3D Generative Model using Pixel-

aligned Reconstruction Priors. In International Conference

on Computer Vision (ICCV), 2023. 3

Yuliang Xiu, Ruilong Li, Shunsuke Saito, Zeng Huang,

Kyle Olszewski, and Hao Li. Monocular real-time volu-

metric performance capture. In European Conference on

Computer Vision (ECCV), pages 49–67, 2020. 2

Yuliang Xiu, Jinlong Yang, Dimitrios Tzionas, and

Michael J. Black. ICON: Implicit Clothed humans Ob-

tained from Normals. In Computer Vision and Pattern

Recognition (CVPR), 2022. 2, 6, 7, 8, 9

Yuliang Xiu, Jinlong Yang, Xu Cao, Dimitrios Tzionas, and

Michael J. Black. ECON: Explicit Clothed humans Op-

timized via Normal integration. In Computer Vision and

Pattern Recognition (CVPR), 2023. 2, 3, 6, 7, 8, 9

Dejia Xu, Yifan Jiang, Peihao Wang, Zhiwen Fan, Yi Wang,

and Zhangyang Wang. NeuralLift-360: Lifting An In-the-

wild 2D Photo to A 3D Object with 360° Views. Computer

Vision and Pattern Recognition (CVPR), 2023. 4

Hongyi Xu, Eduard Gabriel Bazavan, Andrei Zanfir,

William T. Freeman, Rahul Sukthankar, and Cristian Smin-

chisescu. GHUM & GHUML: Generative 3D human shape

and articulated pose models. In Computer Vision and Pat-

tern Recognition (CVPR), pages 6183–6192, 2020. 2, 3

Xueting Yang, Yihao Luo, Yuliang Xiu, Wei Wang, Hao

Xu, and Zhaoxin Fan. D-IF: Uncertainty-aware Human[125]

[126]

[127]

[128]

[129]

[130]

[131]

[132]

[133]

[134]

[135]

[136]

Digitization via Implicit Distribution Field. In International

Conference on Computer Vision (ICCV), 2023. 2

Hongwei Yi, Chun-Hao P. Huang, Dimitrios Tzionas,

Muhammed Kocabas, Mohamed Hassan, Siyu Tang, Justus

Thies, and Michael J. Black. Human-Aware Object Place-

ment for Visual Environment Reconstruction. In Computer

Vision and Pattern Recognition (CVPR), 2022. 6

Tao Yu, Zerong Zheng, Kaiwen Guo, Pengpeng Liu, Qiong-

hai Dai, and Yebin Liu. Function4D: Real-time Human Vol-

umetric Capture from Very Sparse Consumer RGBD Sen-

sors. In Computer Vision and Pattern Recognition (CVPR),

2021. 2, 3, 8

Ilya Zakharkin, Kirill Mazur, Artur Grigorev, and Victor

Lempitsky. Point-based modeling of human clothing. In In-

ternational Conference on Computer Vision (ICCV), 2021.

Yifei Zeng, Yuanxun Lu, Xinya Ji, Yao Yao, Hao Zhu, and

Xun Cao. AvatarBooth: High-Quality and Customizable

3D Human Avatar Generation. arXiv preprint:2306.09864,

2023. 3

Hongwen Zhang, Yating Tian, Xinchi Zhou, Wanli Ouyang,

Yebin Liu, Limin Wang, and Zhenan Sun. PyMAF: 3D

Human Pose and Shape Regression with Pyramidal Mesh

Alignment Feedback Loop. In International Conference on

Computer Vision (ICCV), 2021. 2

Huichao Zhang, Bowen Chen, Hao Yang, Liao Qu, Xu

Wang, Li Chen, Chao Long, Feida Zhu, Kang Du, and

Min Zheng. AvatarVerse: High-quality & Stable 3D Avatar

Creation from Text and Pose. arXiv preprint:2308.03610,

2023. 3

Hongwen Zhang, Yating Tian, Yuxiang Zhang, Mengcheng

Li, Liang An, Zhenan Sun, and Yebin Liu. PyMAF-X:

Towards Well-aligned Full-body Model Regression from

Monocular Images. Transactions on Pattern Analysis and

Machine Intelligence (TPAMI), 2023. 2

Jianfeng Zhang, Zihang Jiang, Dingdong Yang, Hongyi

Xu, Yichun Shi, Guoxian Song, Zhongcong Xu, Xinchao

Wang, and Jiashi Feng. Avatargen: a 3d generative model

for animatable human avatars. In European Conference

on Computer Vision Workshops (ECCVw), pages 668–685.

Springer, 2023. 3

Jason Y. Zhang, Sam Pepose, Hanbyul Joo, Deva Ramanan,

Jitendra Malik, and Angjoo Kanazawa. Perceiving 3D

Human-Object Spatial Arrangements from a Single Image

in the Wild. In European Conference on Computer Vision

(ECCV), pages 34–51, Cham, 2020. Springer International

Publishing. 6

Lvmin Zhang and Maneesh Agrawala. Adding condi-

tional control to text-to-image diffusion models. arXiv

preprint:2302.05543, 2023. 10

Zerong Zheng, Tao Yu, Yixuan Wei, Qionghai Dai, and

Yebin Liu. DeepHuman: 3D Human Reconstruction From

a Single Image. In International Conference on Computer

Vision (ICCV), pages 7738–7748, 2019. 3

Zerong Zheng, Tao Yu, Yebin Liu, and Qionghai Dai.

PaMIR: Parametric Model-conditioned Implicit Represen-

tation for image-based human reconstruction. Transactions

[137]

[138]

[139]

[140]

on Pattern Analysis and Machine Intelligence (TPAMI), 44

(6):3170–3184, 2021. 2, 3, 7, 8, 9

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela

Barriuso, and Antonio Torralba. Scene parsing through

ade20k dataset. In Computer Vision and Pattern Recog-

nition (CVPR), pages 633–641, 2017. 8

Zhizhuo Zhou and Shubham Tulsiani. Sparsefusion: Dis-

tilling view-conditioned diffusion for 3d reconstruction. In

Computer Vision and Pattern Recognition (CVPR), 2023. 4

Hao Zhu, Xinxin Zuo, Sen Wang, Xun Cao, and Ruigang

Yang. Detailed human shape estimation from a single im-

age by hierarchical mesh deformation. In Computer Vision

and Pattern Recognition (CVPR), 2019. 2

Christian Zimmermann, Duygu Ceylan, Jimei Yang, Bryan

Russell, Max Argus, and Thomas Brox. Freihand: A dataset

for markerless capture of hand pose and shape from single

rgb images. In International Conference on Computer Vi-

sion (ICCV), 2019. 8