Summary of Free Lunch in Diffusion U-Net Improving Generation Quality

Summary Free Lunch in Diffusion U-Net Improving Generation Quality arxiv.org

5,762 words - PDF document - View PDF document

One Line

The authors propose FreeU, a method that improves the quality of diffusion models by analyzing the U-Net architecture and understanding the role of the backbone and skip connections in denoising and high-frequency components.

Slides

Slide Presentation (10 slides)

Copy slides outline Copy embed code Download as Word

Enhancing Diffusion Model Generation Quality with FreeU

Source: arxiv.org - PDF - 5,762 words - view

Introduction

• FreeU method improves generation quality of diffusion models without additional training or parameters

U-Net Architecture and Denoising

• U-Net backbone contributes to denoising in diffusion models

• Understanding the role of skip connections in denoising and high-frequency components

Low-Frequency vs High-Frequency Components

• Low-frequency components represent global structure and characteristics

• High-frequency components contain fine details and are sensitive to noise

Effects of Scaling Factors on Image Quality

• Increasing the scale factor of the backbone improves image quality

• Variations in the scaling factor of skip connections have negligible influence

Integration with State-of-the-Art Methods

• FreeU seamlessly integrates with Stable Diffusion, DreamBooth, ModelScope, and Rerender

• Enhancements observed in image and video synthesis models and specialized downstream applications

Significant Improvements in Synthesized Samples

• FreeU enhances the quality of synthesized samples in various diffusion models

• Improvements observed in image and video synthesis, personalized text-to-image tasks, and relation inversion methods

Simple yet Effective Approach

• FreeU enhances sample quality without increasing computational costs

• Analyzing skip connections and backbone features in diffusion U-Net architectures

Related Papers and Works

• References various papers on improving image quality and text-based image editing with diffusion models

• Includes auto-encoding variational Bayes, multi-concept customization of text-to-image diffusion, decomposed

Conclusion

• FreeU is a powerful method for improving the generation quality of diffusion models

• Enhancements observed in various applications and synthesis tasks

• Remember to consider FreeU as a simple yet effective approach for enhancing sample quality.

Key Points

FreeU is a method proposed to improve the generation quality of diffusion models without additional training or parameters.
The U-Net architecture and its backbone contribute to denoising in diffusion models.
Low-frequency components represent global structure while high-frequency components contain fine details and are sensitive to noise in the denoising process.
Scaling factors of the backbone and skip connections have different effects on the quality of generated images in the denoising process.
FreeU seamlessly integrates with state-of-the-art methods for data modeling and synthesis.
FreeU significantly enhances the quality of synthesized samples in image and video synthesis models, as well as specialized downstream applications.
FreeU is a simple yet effective approach that enhances sample quality without increasing computational costs.
Various papers and works related to improving image quality and text-based image editing with diffusion models are referenced in the document.

Summaries

38 word summary

The authors introduce FreeU, a method to enhance generation quality of diffusion models without extra training or parameters. They analyze the U-Net architecture and identify the backbone's role in denoising and the skip connections' contribution of high-frequency components.

41 word summary

The authors propose a method called FreeU to improve the generation quality of diffusion models without additional training or parameters. They investigate the U-Net architecture and find that the backbone contributes to denoising, while skip connections introduce high-frequency components. The den

325 word summary

FreeU is a method proposed by the authors to improve the generation quality of diffusion models without any additional training or parameters. The authors investigate the U-Net architecture and find that the backbone primarily contributes to denoising, while the skip connections introduce high

The paper discusses the relationship between low-frequency and high-frequency components in the denoising process of images. Low-frequency components represent global structure and characteristics, while high-frequency components contain fine details and are sensitive to noise. The U-Net architecture is investigated

Diffusion models for data modeling involve a diffusion process and a denoising process. The diffusion process introduces incremental Gaussian noise into the data distribution, while the denoising process reverses the diffusion process to obtain clean data. The denoising model

In the denoising process, the scaling factors of the backbone and skip connections have different effects on the quality of generated images. Increasing the scale factor of the backbone improves image quality, while variations in the scaling factor of the skip connections have negligible influence

To evaluate the effectiveness of FreeU, a series of experiments were conducted to compare it with other state-of-the-art methods such as Stable Diffusion, DreamBooth, ModelScope, and Rerender. FreeU seamlessly integrates with these methods without

The incorporation of FreeU in various diffusion models leads to significant improvements in the quality of synthesized samples. These enhancements are observed in image and video synthesis models, as well as specialized downstream applications such as personalized text-to-image tasks and relation inversion methods. Free

In this document, the authors introduce a simple yet effective approach called FreeU to enhance the sample quality of diffusion models without increasing computational costs. They analyze the effects of skip connections and backbone features in diffusion U-Net architectures and find that the backbone primarily

The document references various papers and works related to improving the image quality of StyleGAN and text-based image editing with diffusion models. Papers mentioned include those on auto-encoding variational Bayes, multi-concept customization of text-to-image diffusion, decomposed

Raw indexed text (38,182 chars / 5,762 words / 806 lines)

FreeU: Free Lunch in Diffusion U-Net

Chenyang Si Ziqi Huang Yuming Jiang Ziwei Liu B

S-Lab, Nanyang Technological University

{chenyang.si, ziqi002, yuming002, ziwei.liu}@ntu.edu.sg

Stable Diffusion

Stable Diffusion + FreeU

Figure 1. We propose FreeU, a method that substantially improves diffusion model sample quality at no costs: no training, no additional

parameter introduced, and no increase in memory or sampling time.

Abstract

In this paper, we uncover the untapped potential of dif-

fusion U-Net, which serves as a “free lunch” that substan-

tially improves the generation quality on the fly. We initially

investigate the key contributions of the U-Net architecture

to the denoising process and identify that its main backbone

primarily contributes to denoising, whereas its skip connec-

tions mainly introduce high-frequency features into the de-

coder module, causing the network to overlook the back-

bone semantics. Capitalizing on this discovery, we propose

a simple yet effective method—termed “FreeU” — that en-

hances generation quality without additional training or

finetuning. Our key insight is to strategically re-weight the

contributions sourced from the U-Net’s skip connections

and backbone feature maps, to leverage the strengths of

both components of the U-Net architecture. Promising re-

sults on image and video generation tasks demonstrate that

our FreeU can be readily integrated to existing diffusion

models, e.g., Stable Diffusion, DreamBooth, ModelScope,

Rerender and ReVersion, to improve the generation qual-

ity with only a few lines of code. All you need is to ad-

just two scaling factors during inference. Project page:

https://chenyangsi.top/FreeU/.

1. Introduction

Diffusion probabilistic models, a cutting-edge category

of generative models, have become a focal point in the

research landscape, particularly for tasks related to com-

puter vision [4, 5, 7, 9, 11, 17, 19, 23, 24, 26, 29]. Dis-

tinct from other classes of generative models such as Vari-

ational Autoencoder (VAE) [18], Generative Adversarial

Networks (GANs) [2, 8, 13–16, 22], and vector-quantized

approaches [6, 31], diffusion models introduce a novel gen-

erative paradigm. These models employ a fixed Markov

chain to map the latent space, facilitating intricate map-

pings that capture latent structural complexities within a

dataset. Recently, its impressive generative capabilities,

ranging from the high level of details to the diversity of the

generated examples, have fueled groundbreaking advance-Generated

image

A squirrel eating a burger.

Low

frequency

High

frequency

Denoising

Figure 2. The denoising process. The top row illustrates the image’s progressive denoising process across iterations, while the subsequent

two rows display low-frequency and high-frequency components after the inverse Fourier Transform, matching each step. It’s evident that

low-frequency components change slowly, whereas high-frequency components exhibit more significant variations during the denoising

process.

ments in a variety of computer vision applications such as

image synthesis [11, 25, 26, 29], image editing [1, 3, 21],

image-to-image translation [3, 28, 32], and text-to-video

generation [10, 20, 30, 33].

The diffusion models are comprised of the diffusion pro-

cess and the denoising process. During the diffusion pro-

cess, Gaussian noise is gradually added to the input data

and eventually corrupts it into approximately pure Gaussian

noise. During the denoising process, the original input data

is recovered from its noise state through a learned sequence

of inverse diffusion operations. Usually, a U-Net is trained

to iteratively predict the noise to be removed at each denois-

ing step. Existing works focus on utilizing pre-trained dif-

fusion U-Nets for downstream applications, while the inter-

nal properties of the diffusion U-Net, remain largely under-

explored.

Beyond the application of diffusion models, in this pa-

per, we are interested in investigating the effectiveness of

diffusion U-Net for the denoising process. To better un-

derstand the denoising process, we first present a paradigm

shift toward the Fourier domain to perspective the gener-

ated process of diffusion models, a research area that has

received limited prior investigation. As illustrated in Fig. 2,

the uppermost row provides the progressive denoising pro-

cess, showcasing the generated images across successive it-

erations. The subsequent two rows exhibit the associated

Step 1

Step 4

Step 7

Step 10

Step 13

Step 16

Step 19

Step 22

Step 25

Figure 3. Relative log amplitudes of Fourier for diffusion inter-

mediate steps. At each denoising step t, we visualize the relative

log amplitudes of Fourier of recovered date x t . We observe that

the high-frequency components of x t drops drastically during the

denoising process.

low-frequency and high-frequency spatial domain informa-

tion after the inverse Fourier Transform, aligning with each

respective step.

Evident from Fig. 2 is the gradual modulation of low-

frequency components, exhibiting a subdued rate of change,while their high-frequency components display more pro-

nounced dynamics throughout the denoising process. These

findings are further corroborated in Fig.3. This can be intu-

itively explained: 1) Low-frequency components inherently

embody the global structure and characteristics of an im-

age, encompassing global layouts and smooth color. These

components encapsulate the foundational global elements

that constitute the image’s essence and representation. Its

rapid alterations are generally unreasonable in denoising

processes. Drastic changes to these components could fun-

damentally reshape the image’s essence, an outcome typ-

ically incompatible with the objectives of denoising pro-

cesses. 2) Conversely, high-frequency components contain

the rapid changes in the images, such as edges and textures.

These finer details are markedly sensitive to noise, often

manifesting as random high-frequency information when

noise is introduced to an image. Consequently, denoising

processes need to expunge noise while upholding indispens-

able intricate details.

In light of these observations between low-frequency

and high-frequency components during the denoising pro-

cess, we extend our investigation to ascertain the specific

contributions of the U-Net architecture within the diffusion

framework. In each stage of the U-Net decoder, the skip

features from the skip connection and the backbone fea-

tures are concatenated together. Our investigation reveals

that the main backbone of the U-Net primarily contributes

to denoising. Conversely, the skip connections are observed

to introduce high-frequency features into the decoder mod-

ule. These connections propagate fine-grained semantic in-

formation to make it easier to recover the input data. How-

ever, an unintended consequence of this propagation is the

potential weakening of the backbone’s inherent denoising

capabilities during the inference phase. This can lead to the

generation of abnormal image details, as illustrated in the

first row of Fig. 1.

Building upon this revelation, we propel forward with

the introduction of a novel strategy, denoted as “FreeU”,

which holds the potential to improve sample quality with-

out necessitating the computational overhead of additional

training or fine-tuning. During the inference stage, we

instantiate two specialized modulation factors designed to

balance the feature contributions from the U-Net architec-

ture’s primary backbone and skip connections. The first,

termed the backbone feature factors, aims to amplify the

feature maps of the main backbone, thereby bolstering the

denoising process. However, we find that while the inclu-

sion of backbone feature scaling factors yields significant

improvements, it can occasionally lead to an undesirable

oversmoothing of textures. To mitigate this issue, we intro-

duce the second factor, skip feature scaling factors, aiming

to alleviate the problem of texture oversmoothing.

Our FreeU framework exhibits seamless adaptability

when integrated with existing diffusion models, encom-

passing applications like text-to-image generation and text-

to-video generation. We conduct a comprehensive ex-

perimental evaluation of our approach, employing Stable

Diffusion [26], DreamBooth [27], ReVersion [12], Mod-

elScope [20], and Rerender [34] as our foundational mod-

els for benchmark comparisons. By employing FreeU dur-

ing the inference phase, these models indicate a discernible

enhancement in the quality of generated outputs. The vi-

sualization illustrated in Fig. 1 substantiates the efficacy of

FreeU in significantly enhancing both intricate details and

overall visual fidelity within the generated images. Our con-

tributions are summarized as follows:

• We investigate and uncover the potential of U-Net ar-

chitectures for denoising within diffusion models and

identify that its main backbone primarily contributes

to denoising, whereas its skip connections introduce

high-frequency features into the decoder module.

• We further introduce a simple yet effective method, de-

noted as “FreeU”, which enhances U-Net’s denoising

capability by leveraging the strengths of both compo-

nents of the U-Net architecture. It substantially im-

proves the generation quality without requiring addi-

tional training or fine-tuning.

• The proposed FreeU framework is versatile and seam-

lessly integrates with existing diffusion models. We

demonstrate significant sample quality improvement

across various diffusion-based methods, showing the

effectiveness of FreeU at no extra cost.

2. Methodology

2.1. Preliminaries

Diffusion models such as Denoising Diffusion Proba-

bilistic Models (DDPM) [11], encompass two fundamental

processes for data modeling: a diffusion process and a de-

noising process. The diffusion process is characterized by a

sequence of T steps. At each step t, Gaussian noise is incre-

mentally introduced into the data distribution x 0 ∼ q(x 0 )

via a Markov chain, following a prescribed variance sched-

ule denoted as β 1 , . . . , β T :

(1)

q(x t |x t−1 ) = N (x t ; 1 − β t x t−1 , β t I)

The denoising process reverses the above diffusion process

to the underlying clean data x t−1 given the noisy input x t :

p θ (x t−1 |x t ) = N (x t−1 ; µ θ (x t , t), Σ θ (x t , t))

(2)

The µ θ and Σ θ determined through estimation procedures

involving a denoising model denoted as ϵ θ . Typically, this

denoising model is implemented using a time-conditional

U-Net architecture. It is trained to eliminate noise from data

samples while concurrently enhancing the overall fidelity of

the generated samples.s

skip connections

IFFT

FFT

skip connection

skip features (h)

skip

backbone

features features

backbone features (x)

(a) UNet Architecture

(b) FreeU Operations

Figure 4. FreeU Framework. (a) U-Net Skip Features and Backbone Features. In U-Net, the skip features and backbone features are

concatenated together at each decoding stage. We apply the FreeU operations during concatenation. (b) FreeU Operations. The factor b

aims to amplify the backbone feature map x, while factor s is designed to attenuate the skip feature map h.

b=0.6, s=1.0

b=0.8, s=1.0

b=1.0, s=1.0

b=1.2, s=1.0

b=1.4, s=1.0

0.0

0.6

0.8

1.0

1.2

1.4

-0.5

-1.0

-1.5

-2.0

0.0

b=1.0, s=0.6

b=1.0, s=0.8

b=1.0, s=1.0

b=1.0, s=1.2

0.2

b=1.0, s=1.4

Figure 5. Effect of backbone and skip connection scaling factors (b and s).

Increasing the backbone scaling factor b significantly enhances image quality, while

variations in the skip scaling factor s have a negligible influence on image synthesis

quality.

0.4

0.6

Frequency

0.8

1.0

Figure 6. Relative log amplitudes of Fourier with

variations of the backbone scaling factor b. Increas-

ing in b correspondingly results in a suppression of

high-frequency components in the images generated

by the diffusion model.

2.2. How does diffusion U-Net perform denoising? the encoder and decoder.

Building upon the notable disparities observed between

low-frequency and high-frequency components throughout

the denoising process illustrated in Fig. 2 and Fig. 3, we

extend our investigation to delineate the specific contribu-

tions of the U-Net architecture within the denoising process,

to explore the internal properties of the denoising network.

As depicted in Fig. 4, the U-Net architecture comprises a

primary backbone network, encompassing both an encoder

and a decoder, as well as the skip connections that facili-

tate information transfer between corresponding layers of The backbone of U-Net. To evaluate the salient char-

acteristics of the backbone and lateral skip connections

in the denoising process, we conduct a controlled experi-

ment wherein we introduce two multiplicative scaling fac-

tors—denoted as b and s—to modulate the feature maps

generated by the backbone and skip connections, respec-

tively, prior to their concatenation. As shown in Fig. 5, it

is evident that elevating the scale factor b of the backbone

distinctly enhances the quality of generated images. Con-

versely, variations in the scaling factor s, which modulates

the impact of the lateral skip connections, appear to exert a0.0

backbone

skip

fusion

-2.0

-4.0

-6.0

0.0

0.2

0.4

0.6

Frequency

0.8

1.0

Figure 7. Fourie relative log amplitudes of backbone, skip, and

their fused feature maps. The features, forwarded by skip con-

nections directly from earlier layers of the encoder block to the

decoder contain a large amount of high-frequency information.

negligible influence on the quality of the generated images.

Building upon these observations, we subsequently

probed the underlying mechanisms that account for the en-

hancement in image generation quality when the scaling

factor b associated with the backbone feature maps is aug-

mented. Our analysis reveals that this quality improvement

is fundamentally linked to an amplified denoising capabil-

ity imparted by the U-Net architecture’s backbone. As de-

lineated in Fig. 6, a commensurate increase in b correspond-

ingly results in a suppression of high-frequency components

in the images generated by the diffusion model. This im-

plies that enhancing backbone features effectively bolsters

the denoising capability of the U-Net architecture, thereby

contributing to a superior output in terms of both fidelity

and detail preservation.

The skip connections of U-Net. Conversely, the skip

connections serve to forward features from the earlier lay-

ers of encoder blocks directly to the decoder. Intriguingly,

as evidenced in Fig. 7, these features primarily constitute

high-frequency information. Our conjecture, grounded in

this observation, posits that during the training of the U-Net

architecture, the presence of these high-frequency features

may inadvertently expedite the convergence toward noise

prediction within the decoder module. Furthermore, the

limited impact of modulating skip features in Fig. 5 also

indicates that the skip features predominantly contribute

to the decoder’s information. This phenomenon, in turn,

could result in an unintended attenuation of the efficacy of

the backbone’s intrinsic denoising capabilities during infer-

ence. Thereby, this observation prompts pertinent questions

about the counterbalancing roles played by the backbone

and the skip connections in the composite denoising perfor-

mance of the U-Net framework.

2.3. Free lunch in diffusion U-Net

Capitalizing on the above discovery, we propel forward

with the introduction of simple yet effective method, de-

noted as “FreeU”, which effectively bolsters the denois-

ing capability of the U-Net architecture by leveraging the

strengths of both components of the U-Net architecture. It

substantially improves the generation quality without re-

quiring additional training or fine-tuning.

Technically, for the l-th block of the U-Net decoder, let

x l represent the backbone feature map from the main back-

bone at the preceding block, and let h l denote the feature

map propagated through the corresponding skip connection.

To modulate these feature maps, we introduce two scalar

factors: a backbone feature scaling factor b l for x l and a yet-

to-be-defined skip feature scaling factor s l for h l . Specif-

ically, the factor b l aims to amplify the backbone feature

map x l , while factor s l is designed to attenuate the skip fea-

ture map h l . For backbone features, upon experimental in-

vestigation, we discern that indiscriminately amplifying all

channels of x l through multiplication with b l engenders an

oversmoothed texture in the resulting synthesized images.

The reason is the enhanced U-Net compromises the image’s

high-frequency details while denoising. Consequently, we

confine the scaling operation to the half channels of x l as

follows:

(

′

b l · x l,i , if i < C/2

x l,i =

(3)

x l,i ,

otherwise

where x l,i denote the i-th channel of the feature map x l .

C is the total number of channels in x l . This strategy

not only enhances the backbone’s denoising capabilities but

also averts the deleterious consequences of a globally ap-

plied scaling, thereby arriving at a more nuanced balance

between noise reduction and texture preservation.

To further mitigate the issue of oversmoothed texture due

to enhancing denoising, we further employ spectral modu-

lation in the Fourier domain to selectively diminish low-

frequency components for the skip features. Mathemati-

cally, this operation is performed as follows:

F (h l,i ) = FFT(h l,i )

′

F (h l,i ) = F (h l,i ) ⊙ α l,i

h ′ l,i

′

= IFFT(F (h l,i ))

(4)

(5)

(6)

where FFT(·) and IFFT(·) are Fourier transform and inverse

Fourier transform. ⊙ denotes element-wise multiplication,

and α l,i is a Fourier mask, designed as a function of the

magnitude of the Fourier coefficients, serving to implement

the frequency-dependent scaling factor s l :

(

s l if r < r thresh ,

α l,i (r) =

(7)

1 otherwise.SD

SD + FreeU

a blue car is being filmed Mother rabbit is raising baby rabbits A bridge is depicted in the water

a baby in a red shirt a attacks an upset cat and is then chased off A teddy bear walking in the snowstorm

A cat riding a motorcycle. A panda standing on a surfboard in the ocean A boy is playing pokemon

Figure 8. Samples generated by Stable Diffusion [26] with or without FreeU.

where r is the radius. r thresh is the threshold frequency.

Then, the augmented skip feature map h ′ l is then concate-

nated with the modified backbone feature map x ′ l for sub-

sequent layers in the U-Net architecture, as shown in Fig. 4.

Remarkably, the proposed FreeU framework does not re-

quire any task-specific training or fine-tuning. Adding the

backbone and skip scaling factors can be easily done with

just a few lines of code. Essentially, the parameters of the

architecture can be adaptively re-weighted during the in-

ference phase, which allows for a more flexible and potent

denoising operation without adding any computational bur-

den. This makes FreeU a highly practical solution that can

be seamlessly integrated into existing diffusion models to

improve their performance.

3. Experiments

3.1. Implementation details

To assess the effectiveness of the proposed FreeU, we

systematically conduct a series of experiments, aligning our

benchmarks with state-of-the-art methods such as Stable

Diffusion [26], DreamBooth [27], ModelScope [20], and

Rerender [34]. Importantly, our approach seamlessly inte-

grates with these established methods without imposing any

additional computational overhead associated with supple-

mentary training or fine-tuning. We meticulously adhere to

the prescribed settings of these methods and exclusively in-

troduce the backbone feature factors and skip feature factors

during the inference.

3.2. Text-to-image

Stable Diffusion [26] is a latent text-to-image diffusion

model renowned for its capability to generate photorealistic

images based on textual input. It has consistently demon-

strated exceptional performance in various image synthesis

tasks. With the integration of our FreeU augmentation into

Stable Diffusion, the results, as exemplified in Fig. 8, ex-

hibit a notable enhancement in the model’s generative ca-

pacity.

To elaborate, the incorporation of FreeU into Stable Dif-

fusion [26] yields improvements in both entity portrayal

and fine-grained details. For instance, when provided with

the prompt “a blue car is being filmed”, FreeU refines the

image, eliminating rooftop irregularities and enhancing the

textural intricacies of the surrounding structures. In the case

of “Mother rabbit is raising baby rabbits”, FreeU ensures

that the generated image portrays a mother rabbit in a nor-

mal appearance caring for baby rabbits. Furthermore, In

scenarios like “a attacks an upset cat and is then chased

off” and “A teddy bear walking in the snowstorm”, FreeUModelScope

ModelScope+FreeU

A cinematic view of the ocean, from a cave.

ModelScope

ModelScope+FreeU

A cartoon of an elephant walking.

ModelScope

ModelScope+FreeU

An astronaut flying in space.

Figure 9. Samples generated by ModelScope [20] with or without FreeU.

helps generate more realistically posed cats and teddy bears.

Impressively, in response to the complex prompt “A cat rid-

ing a motorcycle”, FreeU not only accurately renders the

individual entities but also expertly captures the nuanced

relationship between them, ensuring that the cat is actively

engaged in riding. These results underscore the significant

qualitative enhancements achieved through the synergy of

FreeU with Stable Diffusion [26].

Quantitative evaluation. We conduct a study with 35 par-

ticipants to assess image quality and image-text alignment.

Each participant receives a text prompt and two correspond-

ing synthesized images, one from SD and another from

SD+FreeU. To ensure fairness, we use the same randomly

sampled random seed for generating both images. The im-

age sequence is randomized to eliminate any bias. Par-

ticipants then select the image they consider superior forTable 1. Text-to-Image Quantitative Results. We count the

percentage of votes for the baseline and our method respectively.

Image-Text refers to Image-Text Alignment.

Method

SD [26]

SD+FreeU

Image-Text

14.12%

85.88%

Image Quality

14.66%

85.34%

ReVersion

ReVersion+FreeU

child child

= “sits back-to-back with”

ReVersion

ReVersion+FreeU

dog basket

= “is contained inside of”

Table 2. Text-to-Video Quantitative Results. We count the per-

centage of votes for the baseline and our method respectively.

Video-Text refers to Video-Text Alignment.

Method

ModelScope [20]

ModelScope+FreeU

Input images

Video-Text

15.29%

84.71%

DreamBooth

Video Quality

14.33%

85.67%

Spiderman basket

= “is contained inside of”

DreamBooth + FreeU

a photo of action figure riding a motorcycle

cat motorbike

= “ride on”

Figure 11. Samples generated by ReVersion [12] with or with-

out FreeU.

“cat motorbike”

Rerender

= “ride on”

A toy on a beach

Rerender+FreeU

Figure 10. Samples generated by DreamBooth [27] with or

without FreeU.

image-text alignment and image quality, respectively. We

tabulate the votes for SD and SD+FreeU in each category in

Table 1. Our analysis reveals that the majority of votes go

to SD+FreeU, indicating that FreeU significantly enhances

the Stable Diffusion text-to-image model in both evaluated

aspects.

3.3. Text-to-video

ModelScope [20], an avant-garde text-to-video diffusion

model, stands at the forefront of video generation from tex-

tual descriptions. The infusion of our FreeU augmentation

into ModelScope [20] serves to further hone its video syn-

thesis prowess, as substantiated by Fig. 9. For instance,

when presented with the prompt “A cinematic view of the

ocean, from a cave”, FreeU enables ModelScope [20] to

generate the perspective “from a cave”, enriching the visual

narrative. In the case of “A cartoon of an elephant walk-

ing”, ModelScope [20] initially generates an elephant with

two trunks, but with the incorporation of FreeU, it rectifies

this anomaly and produces a correct depiction of an ele-

A dog wearing sunglasses

Figure 12. Samples generated by Rerender [34] with or without

FreeU.

phant in motion. Moreover, in response to the prompt “An

astronaut flying in space”, ModelScope [20], with the as-

sistance of FreeU, can generate a clear and vivid portrayal

of an astronaut floating in the expanse of outer space.

These results underscore the significant improvements

achieved through the synergistic application of FreeU with

ModelScope [20], resulting in high-quality generated con-

tent characterized by clear motion, rich detail, and semantic

alignment.

Quantitative evaluation. We conduct the quantitative eval-

uation for FreeU on the text-to-video task in a similar way

as text-to-image. The results displayed in Table 2 indi-

cate that most participants prefer the video generated withFigure 13. Fourier relative log amplitudes of Stable Diffusion [26] with or without FreeU within the denoising process.

SD SD SD SD

SD + FreeU SD + FreeU SD + FreeU SD + FreeU

Figure 14. The visualization of feature maps for Stable Diffusion [26] with or without FreeU.

FreeU.

3.4. Downstream tasks

FreeU presents substantial enhancements in the quality

of synthesized samples across various diffusion model ap-

plications. Our evaluations extend from foundational image

and video synthesis models to more specialized downstream

applications.

We incorporate FreeU into Dreambooth [27], a diffusion

model specialized in personalized text-to-image tasks. The

enhancements are evident, as demonstrated in Fig. 10, the

synthesized images present marked improvements in real-

ism. For instance, while the base DreamBooth [27] model

struggles to synthesize the appearance of the action figure’s

legs from the prompt “a photo of action figure riding a mo-

torcycle”, the FreeU-augmented version deftly overcomes

this hurdle. Similarly, for the prompt “A toy on a beach”,

the initial output exhibited body shape anomalies. FreeU’s

integration refines these imperfections, providing a more

accurate representation and improving color fidelity.

We also integrate FreeU into ReVersion [12], a Stable

Diffusion based relation inversion method, enhancing its

quality as shown in Fig. 11. For example, when the rela-

tion “back to back” is to be expressed between two children,

FreeU enhances ReVersion’s ability to accurately represent

this relationship. For the “inside” relation, when a dog is

supposed to be placed inside of a basket, ReVersion some-

times generates a dog with artifacts, and introducing FreeU

helps eliminate these artifacts. While ReVersion effectively

captures relational concepts, Stable Diffusion might occa-

sionally struggle to synthesize the relation concept due to

excessive high-frequency noises in the U-Net skip features.

Adding FreeU allows better entity and relation synthesis

quality by using exactly the same relation prompt learned

by ReVersion.

Furthermore, we evaluated FreeU’s impact on Reren-

der [34], a diffusion model tailored for zero-shot text-

guided video-to-video translations. Fig. 12 depicts the re-

sults: clear improvements in the detail and realism of syn-

thesized videos. For instance, when provided with the

prompt “A dog wearing sunglasses” and an input video,

Rerender [34] initially produces a dog video with artifacts

related to the “sunglasses”. However, the incorporation of

FreeU successfully eliminates such artifacts, resulting in a

refined output.

In summation, these outcomes substantiate that the in-

corporation of FreeU leads to enhanced entity representa-

tion and synthesis quality, employing precisely the same

learned prompt.

3.5. Ablation study

Effects of FreeU. FreeU is introduced with the primary

aim of enhancing the denoising capabilities of the U-Net ar-

chitecture within the diffusion model. To assess the impactSD

SD+FreeU (b)

SD+FreeU (b&s)

A fat rabbit wearing a purple robe walking through a fantasy landscape

a teddy bear walking down the road in the sunset

textures. To mitigate this issue, we introduce skip feature

scaling factors, aiming to reduce low-frequency informa-

tion and alleviate the problem of texture oversmoothing. As

demonstrated in Fig. 15, the combination of both backbone

and skip feature scaling factors in SD+FreeU(b & s) leads

to the generation of more realistic images. For instance, in

the prompt “A synthwave style sunset above the reflecting

water of the sea, digital art”, the generated sunset sky in

SD+FreeU(b & s) exhibits enhanced realism compared to

SD+FreeU(b). This highlights the efficacy of the compre-

hensive FreeU strategy in balancing features and mitigating

issues related to texture smoothing, ultimately resulting in

more faithful and realistic image generation.

4. Conclusion

A synthwave style sunset above the reflecting water of the sea, digital art

Figure 15. The ablation study of backbone scaling factor and

skip scaling factor.

of FreeU, we conducted analytical experiments using Sta-

ble Diffusion [26] as the base framework. In Fig. 13, we

present visualizations of the relative log amplitudes of the

Fourier transform of Stable Diffusion [26], comparing cases

with and without the incorporation of FreeU. These visual-

izations illustrate that FreeU exerts a discernible influence

in reducing high-frequency information at each step of the

denoising process, which indicates FreeU’s capacity to ef-

fectively denoising. Furthermore, we extended our analysis

by visualizing the feature maps of the U-Net architecture.

As shown in Fig. 14, we observe that the feature maps gen-

erated by FreeU contain more pronounced structural infor-

mation. This observation aligns with the intended effect of

FreeU, as it preserves intricate details while effectively re-

moving noise, harmonizing with the denoising objectives of

the model.

Effects of components in FreeU. We evaluate the ef-

fects of the proposed FreeU strategy, i.e. introducing back-

bone feature scaling factors and skip feature scaling factors

to intricately balance the feature contributions from the U-

Net architecture’s primary backbone and skip connections.

In Fig. 15, we present the results of our evaluations. In the

case of SD+FreeU(b), where backbone scaling factors are

integrated during inference, we observe a noticeable im-

provement in the generation of vivid details compared to

SD [26] alone. For instance, when given the prompt “A

fat rabbit wearing a purple robe walking through a fan-

tasy landscape”, SD+FreeU(b) generates a more realistic

rabbit with normal arms and ears, as opposed to SD [26].

However, it is imperative to note that while the inclusion

of feature scaling factors yields significant improvements,

it can occasionally lead to an undesirable oversmoothing of

In this study, we introduce the elegantly simple yet

highly effective approach, termed FreeU, which substan-

tially enhances the sample quality of diffusion models with-

out incurring any additional computational costs. Motivated

by the fundamental role played by both skip connections

and backbone features in U-Net architectures, we conduct

an in-depth analysis of their effects in diffusion U-Net. Our

investigation reveals that the primary backbone primarily

contributes to denoising, while the skip connections pre-

dominantly introduce high-frequency features into the de-

coder, potentially leading to a neglect of essential backbone

semantics. To address this, we strategically re-weight the

contributions originating from the U-Net’s skip connections

and backbone feature maps. This re-weighting process cap-

italizes on the unique strengths of both U-Net components,

resulting in a substantial improvement in sample quality

across a wide range of text prompts and random seeds. Our

proposed FreeU can be seamlessly integrated into various

diffusion foundation models and their downstream tasks, of-

fering a versatile means of enhancing sample quality.

References

[1] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended

diffusion for text-driven editing of natural images. In CVPR,

2022. 2

[2] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large

scale GAN training for high fidelity natural image synthesis.

arXiv preprint arXiv:1809.11096, 2018. 1

[3] Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune

Gwon, and Sungroh Yoon. ILVR: Conditioning method for

denoising diffusion probabilistic models. In ICCV, 2021. 2

[4] Prafulla Dhariwal and Alexander Nichol. Diffusion models

beat GANs on image synthesis. In NeurIPS, 2021. 1

[5] Patrick Esser, Robin Rombach, Andreas Blattmann, and

Bjorn Ommer. ImageBART: Bidirectional context with

multinomial diffusion for autoregressive image synthesis. In

NeurIPS, 2021. 1[6] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming

transformers for high-resolution image synthesis. In CVPR,

2021. 1

[7] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik,

Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. An

image is worth one word: Personalizing text-to-image gen-

eration using textual inversion. In ICLR, 2023. 1

[8] Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing

Xu, David Warde-Farley, Sherjil Ozair, Aaron C Courville,

and Yoshua Bengio. Generative adversarial nets. In NeurIPS,

2014. 1

[9] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo

Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vec-

tor quantized diffusion model for text-to-image synthesis. In

CVPR, 2022. 1

[10] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang,

Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben

Poole, Mohammad Norouzi, David J Fleet, et al. Imagen

video: High definition video generation with diffusion mod-

els. arXiv preprint arXiv:2210.02303, 2022. 2

[11] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-

sion probabilistic models. In NeurIPS, 2020. 1, 2, 3

[12] Ziqi Huang, Tianxing Wu, Yuming Jiang, Kelvin C.K. Chan,

and Ziwei Liu. ReVersion: Diffusion-based relation inver-

sion from images. arXiv preprint arXiv:2303.13495, 2023.

3, 8, 9

[13] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.

Progressive growing of GANs for improved quality, stability,

and variation. In ICLR, 2018. 1

[14] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen,

Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free

generative adversarial networks. In NeurIPS, 2021. 1

[15] Tero Karras, Samuli Laine, and Timo Aila. A style-based

generator architecture for generative adversarial networks. In

CVPR, 2019. 1

[16] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,

Jaakko Lehtinen, and Timo Aila. Analyzing and improving

the image quality of StyleGAN. In CVPR, 2020. 1

[17] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen

Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic:

Text-based real image editing with diffusion models. arXiv

preprint arXiv:2210.09276, 2022. 1

[18] Diederik P Kingma and Max Welling. Auto-encoding varia-

tional bayes. arXiv preprint arXiv:1312.6114, 2013. 1

[19] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli

Shechtman, and Jun-Yan Zhu. Multi-concept customization

of text-to-image diffusion. arXiv preprint arXiv:2212.04488,

2022. 1

[20] Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang,

Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tie-

niu Tan. VideoFusion: Decomposed diffusion models for

high-quality video generation. In CVPR, 2023. 2, 3, 6, 7, 8

[21] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia-

jun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided

image synthesis and editing with stochastic differential equa-

tions. In ICLR, 2022. 2

[22] Mehdi Mirza and Simon Osindero. Conditional generative

adversarial nets. arXiv preprint arXiv:1411.1784, 2014. 1

[23] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav

Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and

Mark Chen. GLIDE: Towards photorealistic image gener-

ation and editing with text-guided diffusion models. arXiv

preprint arXiv:2112.10741, 2021. 1

[24] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,

and Mark Chen. Hierarchical text-conditional image gener-

ation with CLIP latents. arXiv preprint arXiv:2204.06125,

2022. 1

[25] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,

and Mark Chen. Hierarchical text-conditional image gen-

eration with clip latents. arXiv preprint arXiv:2204.06125,

2022. 2

[26] Robin Rombach, Andreas Blattmann, Dominik Lorenz,

Patrick Esser, and Björn Ommer. High-resolution image syn-

thesis with latent diffusion models. In CVPR, 2022. 1, 2, 3,

6, 7, 8, 9, 10

[27] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,

Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine

tuning text-to-image diffusion models for subject-driven

generation. In CVPR, 2023. 3, 6, 8, 9

[28] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee,

Jonathan Ho, Tim Salimans, David Fleet, and Mohammad

Norouzi. Palette: Image-to-image diffusion models. In ACM

SIGGRAPH, 2022. 2

[29] Chitwan Saharia, William Chan, Saurabh Saxena, Lala

Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed

Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi,

Rapha Gontijo Lopes, et al. Photorealistic text-to-image

diffusion models with deep language understanding. arXiv

preprint arXiv:2205.11487, 2022. 1, 2

[30] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An,

Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual,

Oran Gafni, et al. Make-a-video: Text-to-video generation

without text-video data. arXiv preprint arXiv:2209.14792,

2022. 2

[31] Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete

representation learning. In NeurIPS, 2017. 1

[32] Tengfei Wang, Ting Zhang, Bo Zhang, Hao Ouyang, Dong

Chen, Qifeng Chen, and Fang Wen. Pretraining is all

you need for image-to-image translation. arXiv preprint

arXiv:2205.12952, 2022. 2

[33] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian

Lei, Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, and

Mike Zheng Shou. Tune-a-video: One-shot tuning of image

diffusion models for text-to-video generation. arXiv preprint

arXiv:2212.11565, 2022. 2

[34] Shuai Yang, Yifan Zhou, Ziwei Liu, and Chen Change

Loy. Rerender a video: Zero-shot text-guided video-to-video

translation. arXiv preprint arXiv:2306.07954, 2023. 3, 6, 8,