Summary of Improved Precision and Recall Metric for Assessing Generative Models

Summary Improved Precision and Recall Metric for Assessing Generative Models – arXiv Vanity www.arxiv-vanity.com

6,286 words - html page - View html page

One Line

We introduce a new metric, Precision-Recall, which is more efficient than existing methods and can be used to evaluate image generation tasks, and we demonstrate this on BigGAN and ImageNet with Wasserstein-2 distance, taking 8 minutes to run on a single NVIDIA Tesla V100 GPU.

Key Points

We present a metric to evaluate generative models which allows us to study the effects of the truncation trick.
We compare our metric to Sajjadi et al.'s method using four StyleGAN setups and found that heavily truncated setups have high precision and low recall, while FID-optimized setups have higher variation but some image artifacts.
We tested BigGAN on ImageNet and found that difficult classes had lower precision than easy ones.
Our metric correlates well with perceived image quality and variation, and suggests that truncation may not be necessary.
We offer an open-source implementation and use NVIDIA Tesla V100 GPU to run our implementation; a high-quality estimate using 50k images takes 8 minutes to run on a single GPU.

Summaries

122 word summary

We present a new metric for assessing generative models, Precision-Recall, which measures the overall quality and coverage of samples in image generation tasks. It embeds two sets of feature vectors in a feature space and evaluates precision and recall for given sets of real and generated images. This metric is more effective than existing methods and can be used to analyze truncation methods and estimate the quality of individual generated samples. We also tested BigGAN on ImageNet and found that difficult classes had lower precision than easy ones. The Wasserstein-2 distance is used to quantify the difference between densities of real and generated images. Our open-source implementation takes 8 minutes to run on a single NVIDIA Tesla V100 GPU for 50k images.

422 word summary

This paper presents an improved precision and recall metric for evaluating generative models such as GANs. This metric measures both the overall quality and coverage of samples in image generation tasks. It builds explicit non-parametric representations of the manifolds of real and generated data, allowing for the estimation of precision and recall. The metric is more effective than existing metrics and can be used to analyze truncation methods and estimate the quality of individual generated samples. Additionally, the Wasserstein-2 distance is used to quantify the difference between densities of real and generated images. We developed a metric to assess the precision and recall of generative models, and compared it to Sajjadi et al.'s method using four StyleGAN setups with varying truncation and training time. Our metric showed higher values of the neighborhood size increase precision and recall estimates consistently. We found that heavily truncated setups have high precision and low recall, while FID-optimized setups have higher variation but some image artifacts.

We also tested BigGAN on ImageNet and found that difficult classes had lower precision than easy ones. Our metric sheds light on design decisions and truncation methods, and can be used to select the best configuration for any given tradeoff between precision and recall. We present a metric to evaluate generative models which allows us to study the effects of the truncation trick. Clamping outliers is superior to rejection by distance and interpolation is competitive. Random replacement is less desirable. Our metric correlates well with perceived image quality and variation, and suggests that truncation may not be necessary. Deep features, self-attention GANs, PixelCNN decoders, very deep convolutional networks, cGANs with projection discriminator, spectral normalization, unrolled GANs, Wasserstein auto-encoders, megapixel size image creation, conditional image synthesis, style-based generator architectures, semi-supervised learning with deep generative models, Glow with invertible 1x1 convolutions, Progressive Growing of GANs and a two time-scale update rule for GANs are being used for generative models. Flow-GAN combines maximum likelihood and adversarial learning. We present a new metric for assessing generative models, Precision-Recall, which takes two sets of feature vectors as input, embeds them in a feature space and evaluates precision and recall for given sets of real and generated images. Figures 10 and 11 show examples of high and low quality interpolations and BigGAN-generated images, respectively, according to our realism scores. Our implementation is open-source and we offer occasional updates through our mailing list. We use NVIDIA Tesla V100 GPU to run our implementation; a high-quality estimate using 50k images takes 8 minutes to run on a single GPU.

1706 word summary

We present a new metric for assessing generative models, Precision-Recall. It takes two sets of feature vectors as input, embeds them in a feature space and evaluates precision and recall for given sets of real and generated images. It forms an estimate for the manifold of feature vectors, tabulating the distance to its -th nearest neighbor. We use NVIDIA Tesla V100 GPU to run our implementation; a high-quality estimate using 50k images takes 8 minutes to run on a single GPU.

Figures 10 and 11 show examples of high and low quality interpolations and BigGAN-generated images, respectively, according to our realism scores. Images with high realism score contain a clear object from the given class, whereas low-scoring images generally lack such object or the object is distorted in various ways.

Our implementation is open-source and we offer occasional updates through our mailing list. Deep features have shown to be effective in perceptual tasks. Self-attention generative adversarial networks (GANs) and PixelCNN decoders have been used for conditional image generation. Very deep convolutional networks and improved techniques for training GANs are two other methods being used. Assessing generative models through precision and recall has been proposed. Other methods such as cGANs with projection discriminator, spectral normalization, unrolled GANs, and Wasserstein auto-encoders are also being used. Megapixel size image creation, conditional image synthesis with auxiliary classifier GANs, and style-based generator architectures have been proposed. Semi-supervised learning with deep generative models, Glow with invertible 1x1 convolutions, and Progressive Growing of GANs are also being used. Finally, a two time-scale update rule for GANs converges to a local Nash equilibrium and Flow-GAN combines maximum likelihood and adversarial learning with distance from a point to an ellipse, an ellipsoid, or a hyperellipsoid. We present a metric to measure the quality of generative models, which correlates well with perceived image quality and variation. We demonstrate this metric with experiments using StyleGAN by linear interpolation in the intermediate latent space.

We found that only 2.4% of paths crossed unrealistic parts of the real manifold, suggesting that truncation may not be necessary. We also identified training configuration-related effects with our metric which may allow for more refined exclusion of unrealistic images. Our metric, in addition to previous techniques for estimating distribution quality, will prove valuable for further improving generative models. The realism score is a metric to evaluate the quality of interpolations and individual samples of generative models. This score is calculated as a continuous extension of the classification function and increases when a sample is closer to the manifold of real images and decreases when further away. Clamping appears to be the best tradeoff, while interpolation and random replacement are less desirable. Our findings indicate that recall alone is not enough to judge the quality of a distribution and FID should be used in addition. Visual inspection shows that generated images with high realism display a clear object from the given class, while low realism images are often distorted. Many generative methods employ a truncation trick to allow trading variation for quality after training, but quantitative evaluation of these tricks has proven difficult. We present a metric for assessing generative models that allows us to study these effects in a principled way. We evaluate four primary strategies (A-D) illustrated in Figure 7, plus three secondary strategies (E-G). Strategy B, approximating the distribution of latent vectors with a multivariate Gaussian and rejecting those with low probability density, is superior to rejecting by distance (A). Clamping outliers (C) is better than rejection because it provides better coverage around the extremes of the distribution. Interpolation (D) is competitive with clamping, as it affects all latent vectors equally and increases the average density. Random replacement (G) removes a random subset of latent vectors and inserts them back at the highest-density point. Strategies E and F yield an inferior tradeoff, confirming that sampling density is not a good predictor of image quality. We present a new metric to analyze generative models that assesses both precision and recall. We use the metric to analyze different training configurations of StyleGAN on the FFHQ dataset. We find that configurations with high recall (A, F) outperform those with high precision (B, C) in terms of FID. We also find that random translation of inputs (E) improves precision at the cost of recall and slightly improves FID. Our metric sheds light on design decisions and truncation methods, and can be used to select the best configuration for any given tradeoff between precision and recall. Generative models have seen rapid improvements and FID has become the de facto standard for evaluating them. However, precision and recall can provide a more detailed picture. We used these metrics to analyze StyleGAN. Easy classes such as cats and dogs had higher recall, likely due to their larger amount of training data. Difficult classes such as Lemon and Broccoli had lower recall, suggesting much of the variation had been missed. Precision was invariably high for easy classes and lower for difficult ones. We observed that quality increased with more truncation, in line with Brock et al.'s findings. We also tested BigGAN on ImageNet and found that difficult classes had lower precision than easy ones. Truncation packing a large number of generated images into a small region in the embedding space can lead to an underestimation of precision and overestimation of recall. This can cause clusters with no real images, and the metric to incorrectly report low precision. The method of Sajjadi et al. reports both precision and recall increasing as truncation is removed, contrary to expected behavior.

Figure 4 illustrates the effects of truncation on precision and recall using a single StyleGAN generator. With our method, precision increases and recall drops towards the left side of the plot as more truncation is applied, while Sajjadi et al. incorrectly reports lower precision when truncation is applied and the numerical values of both precision and recall seem excessively high when truncation is not applied.

Setup A is heavily truncated, leading to high precision and low recall. Moving to setup B increases variation, improving recall, but compromising image quality. Setup C is FID-optimized, with higher variation but some image artifacts. Setup D preserves variation and recall, but nearly all images have low quality. The ideal tradeoff between quality and variation depends on the intended application. Our metric provides explicit visibility on this tradeoff and allows quantifying the suitability of a given model for a particular application. We compare our precision and recall metric with Sajjadi et al.'s method using four StyleGAN setups with varying truncation and training time. Figure 3 shows our precision and recall metric (black dots) alongside Sajjadi et al.'s (red triangles). We use a neighborhood size of 50000 to avoid saturating values and take 50k samples, as is standard practice for FID. We compute feature vectors for each image by feeding it to a pre-trained VGG-16 classifier, and use Brock et al.'s feature space as it works better for our metric. Our tests show that higher values of the neighborhood size increase precision and recall estimates consistently. Our improved precision and recall metric for assessing generative models does not suffer from the weaknesses of existing metrics. We draw an equal number of samples from real and generated distributions and embed them into a high-dimensional feature space using a pre-trained classifier network. We then form explicit non-parametric representations of the manifolds of real and generated data, from which precision and recall can be estimated.

The key idea is to calculate the pairwise Euclidean distances between all feature vectors in the set and, for each feature vector, form a hypersphere with radius equal to the distance to its -nearest neighbor. This hypersphere defines a volume in the feature space that serves as an estimate of the true manifold. To determine precision, we query for each generated image whether it is within the estimated manifold of real images. For recall, we query for each real image whether it is within estimated manifold of generated images.

We also measure the Wasserstein-2 distance between multivariate Gaussian distributions to quantify the difference between the densities of real and generated images. This is a proper distance metric between the distributions and is able to capture aspects such as age, gender, pose, ethnicity, etc. which are not directly captured by precision and recall metrics. Sajjadi et al. proposed a metric based on distributions to evaluate generative models. This method has shortcomings and cannot interpret situations with a large number of packed together generated samples. Szegedy et al. proposed another metric based on the populations of training images and generated images in clusters. Karras et al. then used this metric to analyze variants of StyleGAN and BigGAN, and Brock et al. used it to understand design decisions and identify new variants. We present an improved precision and recall metric that is more effective and provide source code. This metric also allows us to analyze truncation methods and estimate the quality of individual generated samples. Generative models attempt to capture the essence of a manifold into a model that can generate novel samples. GANs, VAEs, autoregressive models, and likelihood-based models are some of the most widely used generative models. While metrics have been proposed to evaluate the quality of generated samples, they have not seen widespread use due to subjectivity or lack of coverage of the underlying manifold.

Recently, Sajjadi et al. proposed a novel metric that expresses the quality of the generated samples using two separate components: precision and recall. These two components correspond to the average sample quality and the coverage of the sample distribution, respectively. We also extend this metric to estimate the perceptual quality of individual samples and perform an analysis of truncation methods. We demonstrate the effectiveness of our metric in StyleGAN and BigGAN with illustrative examples. Furthermore, we analyze multiple design variants of StyleGAN to identify new variants that improve the state-of-the-art. This paper presents an evaluation metric for generative models such as GANs that can measure both the overall quality and coverage of samples in image generation tasks. The metric forms explicit non-parametric representations of the manifolds of real and generated data. The work was done during an internship at NVIDIA, and is available on arXiv Vanity as a responsive web page.

Raw indexed text (41,635 chars / 6,286 words / 810 lines)

Improved Precision and Recall Metric for Assessing Generative Models arXiv Vanity

arXiv Vanity

renders academic papers from

arXiv

as responsive web pages so you

dont have to squint at a PDF

View this paper on arXiv

Improved Precision and Recall Metric for Assessing Generative Models

Tuomas Kynknniemi

Aalto University

NVIDIA

&Tero Karras

NVIDIA

&Samuli Laine

NVIDIA

&Jaakko Lehtinen

Aalto University

NVIDIA

&Timo Aila

NVIDIA

This work was done during an internship at NVIDIA.

Abstract

The ability to evaluate the performance of a computational model is a vital requirement for driving algorithm research. This is often particularly difficult for generative models such as generative adversarial networks (GAN) that model a data manifold only specified indirectly by a finite set of training examples. In the common case of image data, the samples live in a high-dimensional embedding space with little structure to help assessing either the overall quality of samples or the coverage of the underlying manifold. We present an evaluation metric with the ability to separately and reliably measure both of these aspects in image generation tasks by forming explicit non-parametric representations of the manifolds of real and generated data.

We demonstrate the effectiveness of our metric in StyleGAN and BigGAN by providing several illustrative examples where existing metrics yield uninformative or contradictory results. Furthermore, we analyze multiple design variants of StyleGAN to better understand the relationships between the model architecture, training methods, and the properties of the resulting sample distribution. In the process, we identify new variants that improve the state-of-the-art.

We also perform the first principled analysis of truncation methods and identify an improved method.

Finally, we extend our metric to estimate the perceptual quality of individual samples, and use this to study latent space interpolations.

Introduction

Capturing the essence of a manifold, defined implicitly by a set of samples, into a computational recipe that allows generation of novel samples is a central goal of generative models.

While the quality of results from generative adversarial networks (GAN)

Goodfellow2014

, variational autoencoders (VAE)

Kingma2014

, autoregressive models

VanDenOord2016A

VanDenOord2016B

, and likelihood-based models

Dinh2016

Kingma2018

have seen rapid improvement recently

Karras2017

Grover2018

Tolstikhin2018

Miyato2018B

brock2018biggan

Karras2019

the automatic evaluation of these results continues to be challenging.

When modeling a complex manifold for sampling purposes, two separate goals emerge: individual samples drawn from the model should be faithful to the examples (they should be of high quality), and their variation should match that observed in the training set.

The most widely used metrics, such as Frchet Inception Distance (FID)

Heusel2017

, Inception Score (IS)

Salimans2016B

, and Kernel Inception Distance (KID)

KID2018

, group these two aspects to a single value without a clear tradeoff. We illustrate by examples that this makes diagnosis of model performance difficult.

For instance, it is interesting that while recent state-of-the-art generative methods

brock2018biggan

Kingma2018

Karras2019

claim to optimize FID, in the end the (uncurated) results are almost always produced using another model that explicitly sacrifices variation, and often FID, in favor of higher quality samples from a truncated subset of the domain

Marchesi2017

Meanwhile, insufficient coverage of the underlying manifold continues to be a challenge for GANs. Various improvements to network architectures and training procedures tackle this issue directly

Salimans2016B

Metz2016

Karras2017

Zinan2017

. While metrics have been proposed to estimate the degree of variation, these have not seen widespread use as they are subjective

Arora2017b

, domain specific

Metz2016

, or not reliable enough

Odena2017

Recently, Sajjadi et al.

Sajjadi2018

proposed a novel metric that expresses the quality of the generated samples using two separate components:

precision

and

recall

. Informally, these correspond to the average sample quality and the coverage of the sample distribution, respectively. We discuss their metric (Section

1.1

) and characterize its weaknesses that we later demonstrate experimentally. Our primary contribution is an improved precision and recall metric (Section

) which provides explicit visibility of the tradeoff between sample quality and variety. We will make the source code of our metric publicly available.

We demonstrate the effectiveness of our metric using two recent generative models (Section

), StyleGAN

Karras2019

and BigGAN

brock2018biggan

. We then use our metric to analyze several variants of StyleGAN (Section

) to better understand the design decisions that determine result quality, and identify new variants that improve the state-of-the-art. We also perform the first principled analysis of truncation methods

Marchesi2017

Kingma2018

brock2018biggan

Karras2019

. Finally, we extend our metric to estimate the quality of individual generated samples (Section

), offering a way to measure the quality of latent space interpolations.

1.1

Background

Figure 1

Definition of pure precision and recall for distributions

Sajjadi2018

(a) Denote the distribution of real images with

(blue) and the distribution of generated images with

(red).

(b) Precision is the probability that a random image from

falls within the support of

, i.e.,

falls within the support of

, i.e.,

Precision and recall are metrics commonly used to evaluate, e.g., binary classification and information retrieval systems. In the latter case,

precision

measures how large proportion of the documents returned by a search engine are relevant to the user, and

recall

measures how large proportion of all relevant documents were returned. Traditionally, these metrics are defined only for sets, but Sajjadi et al.

Sajjadi2018

present a formulation based on distributions that is suited for evaluating generative models. In this context, the generator acts as the retrieval engine, and the manifold of realistic images (as defined by the training set) corresponds to the set of relevant items. Precision thus maps to the proportion of generated images that are realistic, and recall measures to which extent the generator covers the realm of realistic images.

Sajjadi et al.

Sajjadi2018

also describe a practical method for computing precision and recall of a generative network. They embed training images and generated images in a high-dimensional feature space using a pre-trained Inception network

Szegedy2014

and employ minibatch

-means clustering on the resulting points. Precision and recall are then computed based on the populations of training images and generated images in the clusters. However, as we will show in Section

3.1

, this method has some shortcomings. In particular, it tends to give over-optimistic results, and it cannot correctly interpret situations where a large number of generated samples are packed together. This can happen, e.g., as a result of mode collapse or truncation. Furthermore, calculating pure precision and recall is problematic with Sajjadi et al.s metricboth of them tend to be very close to

1.0

in most scenariosso they propose visualizing the metric using a plot over different relative weightings between precision and recall. These plots should not be confused with traditional precision-recall curves where each point represents a different method, parameter value, etc.

Our goal is to measure pure precision and recall without weighting. In the notation of Sajjadi et al., these correspond to

and

, i.e., the volume of one distribution within the support of the other distribution. See Figure

for an illustration.

While precision and recall quantify two highly relevant aspects of a generative model, it should be noted that they do not directly compare the densities of the generated and real distributions. For example, an ideal face-generating model should reproduce the distributions of age, gender, pose, ethnicity, etc., from the training set, but precision and recall are more concerned about the extents of the distributions than their relative densities. Frchet Inception Distance

Heusel2017

fits multivariate Gaussian distributions to the feature-space embeddings of real and generated images, and computes the Wasserstein-2 distance between these two Gaussians. As such, it is a proper distance metric between the distributions and is able to quantify the difference between the densities of real and generated images. The drawback is that a low FID may indicate high precision (realistic images), high recall (large amount of variation), or anything in between.

Improved precision and recall metric using

-nearest neighbors

We will now describe our improved precision and recall metric that does not suffer from the weaknesses listed in Section

1.1

. The key idea is to form explicit non-parametric representations of the manifolds of real and generated data, from which precision and recall can be estimated.

Similar to Sajjadi et al.

Sajjadi2018

, we draw real and generated samples from

and

, respectively, and embed them into a high-dimensional feature space using a pre-trained classifier network.

We denote feature vectors of the real and generated images by

and

, respectively,

and the corresponding sets of feature vectors by

and

. We take an equal number of samples from each distribution, i.e.,

Figure 2

(a) An example manifold in a feature space. (b) Estimate of the manifold obtained by sampling a set of points and surrounding each with a hypersphere that reaches its

th nearest neighbor.

For each set of feature vectors

, we estimate the corresponding manifold in the feature space as illustrated in Figure

. We obtain the estimate by calculating pairwise Euclidean distances between all feature vectors in the set and, for each feature vector, forming a hypersphere with radius equal to the distance to its

th nearest neighbor. Together, these hyperspheres define a volume in the feature space that serves as an estimate of the true manifold. To determine whether a given sample

is located within this volume, we define a binary function

for at least one

otherwise,

(1)

where

returns

th nearest feature vector of

from set

. In essence,

provides a way to determine whether a given image looks realistic, whereas

provides a way to determine whether it could be reproduced by the generator. We can now define our metric as

precision

recall

(2)

In Equation (

), precision is quantified by querying for each generated image whether the image is within the estimated manifold of real images. Symmetrically, recall is calculated by querying for each real image whether the image is within estimated manifold of generated images. See Appendix

for pseudocode.

In practice, we compute the feature vector

for a given image by feeding it to a pre-trained VGG-16 classifier

Simonyan2014vgg

and extracting the corresponding activation vector after the second fully connected layer. Brock et al.

brock2018biggan

show that the nearest neighbors in this feature space are meaningful in the sense that they correspond to semantically similar images. Zhang et al.

Zhang2018metric

, on the other hand, use the intermediate activations of multiple convolutional layers of VGG-16 to define a perceptual metric for image corruptions that they show to correlate well with human judgments. We have tested both approaches and found that Brock at al.s feature space works considerably better for the purposes of our metric. We hypothesize that this is because Zhang et al.s approach implicitly places a lot of focus on spatial structure, which is appropriate for assessing image corruptions. In our case, however, the image manifolds are sparsely sampled and we cannot expect to find exact matches in terms of spatial structure. This favors choosing a feature space that eschews spatial structure and focuses on semantic content instead.

Like FID, our metric is weakly affected by the number of samples taken. Since it is standard practice to quote FIDs with 50k samples, we adopt the same design point for our metric as well.

The size of the neighborhood,

, is a compromise between covering the entire manifold (large values) and overestimating its volume as little as possible (small values).

In practice, we have found that higher values of

increase the precision and recall estimates in a fairly consistent fashion, and lower values of

decrease them, until they start saturating at 1.0 or 0.0.

Tests with various datasets and GANs showed that

is a robust choice that avoids saturating the values most of the time. Thus we use

and

50000

in all our experiments except the BigGAN measurements (Section

3.2

) where approximately 1300 real images per class were available.

Precision and recall of state-of-the-art generative models

In this section, we demonstrate that precision and recall computed using our method correlate well with perceived quality and variation of generator output distributions,

and compare our metric with Sajjadi et al.s method

Sajjadi2018

as well as the widely used FID metric

Heusel2017

For Sajjadi et al.s method, we use 20 clusters and report

and

as proxies for precision and recall, respectively, as recommended by the authors.

We examine two state-of-the-art generative models, StyleGAN

Karras2019

trained with the FFHQ dataset, and BigGAN

brock2018biggan

trained on ImageNet

imagenet

3.1

StyleGAN

(FID

91.7)

(FID

16.9)

(FID

4.5)

(FID

16.7)

Figure 3

Comparison of our method, Sajjadi et al.s method

Sajjadi2018

, and FID for 4 StyleGAN setups.

Black dots show our precision and recall, and red triangles denote Sajjadi et al.s method.

We recommend zooming in to better assess the quality of images.

Figure

shows the results of various metrics in four StyleGAN setups. These setups exhibit different amounts of truncation and training time,

and have been selected to illustrate how the metrics behave with varying output image distributions.

Setup A is heavily truncated, and the generated images are of high quality but very similar to each other in terms of color, pose, background, etc. This leads to high precision and low recall, as one would expect.

Moving to setup B increases variation, which improves recall, while the image quality and thus precision is somewhat compromised.

Setup C is the FID-optimized configuration in

Karras2019

. It has even more variation in terms of color schemes and accessories such as hats and sunglasses, further improving recall. However, some of the faces start to become distorted which reduces precision.

Finally, setup D preserves variation and recall, but nearly all of the generated images have low quality, indicated by much lower precision as expected.

In contrast, the method of Sajjadi et al.

Sajjadi2018

indicates that setups B, C and D are all essentially perfect, and incorrectly assigns setup A the lowest precision.

Looking at FID, setups B and D appear almost equally good, illustrating how much weight FID places on variation compared to image quality. This is also evidenced by the extremely high FID of setup A.

Setup C is ranked as clearly the best by FID despite the obvious image artifacts. The ideal tradeoff between quality and variation depends on the intended application, and it is

unclear which application might favor, e.g., setup D where practically all images are broken over setup B that produces high-quality samples at a lower variation.

The primary benefit of our metric is that it provides explicit visibility on this tradeoff and allows quantifying the suitability of a given model for a particular application.

0.0

0.3

0.7

1.0

(a)

(b) Our method

Sajjadi2018

Figure 4

(a) Example images produced by StyleGAN

Karras2019

trained using the FFHQ dataset. It is generally agreed

Marchesi2017

brock2018biggan

Kingma2018

Karras2019

that truncation provides a tradeoff between perceptual quality and variation.

(b) With our method, precision increases and recall drops towards the left side of the plot as more truncation is applied (

). At no truncation (

), recall reaches its highest value, approximating the fraction of training set

that could be reproduced by the generator (generally clearly less than 100%).

Sajjadi2018

incorrectly reports lower precision when truncation is applied and the numerical values of both precision and recall seem excessively high when truncation is not applied.

See text for discussion.

Figure

illustrates the effects of truncation

Marchesi2017

brock2018biggan

Kingma2018

Karras2019

on precision and recall using a single StyleGAN generator.

With our method, the maximally truncated setup (

) has zero recall but high precision. As truncation is gradually removed, precision drops and recall increases as expected.

In contrast, the method of Sajjadi et al. reports both precision and recall increasing as truncation is removed, contrary to the expected behavior.

We hypothesize that the difficulties are a result of truncation packing a large number of generated images into a small region in the embedding space.

This may result in clusters that contain no real images in that region, and ultimately causes the metric to incorrectly report low precision.

The tendency to underestimate precision can be alleviated by using fewer clusters, but doing so leads to overestimation of recall.

Our metric does not suffer from this problem because the manifolds of real and generated images are estimated separately, and the distributions are

never mixed together.

3.2

BigGAN

Great Pyrenees

Broccoli

Egyptian cat

Lemon

(FID = 30.0)

(FID = 40.2)

(FID = 39.2)

(FID = 46.4)

(a)

Bubble

Baseball player

Trumpet

Park bench

(FID = 63.5)

(FID = 49.2)

(FID = 100.4)

(FID = 80.3)

(b)

Figure 5

Our precision and recall for four easy (a) and four difficult (b) ImageNet classes using BigGAN.

For each class we sweep the truncation parameter

linearly from 0.3 to 1.0.

The leftmost point in each curve corresponds to the most heavily truncated case (

0.3

The FIDs refer to a non-truncated model, i.e.,

1.0

The per-class metrics were computed using all available training images of the class and an equal number of generated images, while the curve for the entire dataset was computed using 50k real and generated images.

Brock et al. recently presented BigGAN

brock2018biggan

, a high-quality generative network able to synthesize images for

ImageNet

imagenet

. Imagenet is a diverse dataset containing 1000 classes with approximately 1300 training images for each class.

Due to the large amount of variation within and between classes, generative modeling of ImageNet has proven to be a challenging problem

Miyato2018

Zhang2018sagan

brock2018biggan

Brock et al.

brock2018biggan

list several ImageNet classes that are particularly easy or difficult for their method. The difficult classes often contain precise global structure or unaligned human faces, or they are underrepresented in the dataset. The easy classes are largely textural, lack exact global structure, and are common in the dataset. Dogs are a noteworthy special case in ImageNet: with almost a hundred different dog breeds listed as separate classes, there is substantially more training data for dogs than for any other class, making them artificially easy. To a lesser extent, the same applies to cats that occupy

10 classes.

Figure

illustrates the precision and recall for some of these classes over a range of truncation values.

We notice that precision is invariably high for the suspected easy classes, including cats and dogs, and clearly lower for the difficult ones.

Brock et al. state that the quality of generated samples increases as more truncation is applied, and the precision as reported by our method is in line with this observation.

Recall paints a more detailed picture. It is very low for classes such as Lemon or Broccoli, implying much of the variation has been missed, but FID is nevertheless quite good for both. Since FID corresponds to a Wasserstein-2 distance in the feature space, low intrinsic variation implies low FID even when much of that variation is missed. Correspondingly, recall is clearly higher for the difficult classes. Based on visual inspection, these classes have a lot of intra-class variation that BigGAN training has successfully modeled.

Dogs and cats show similar recall compared to the difficult classes, and their image quality and thus precision is presumably boosted by the higher amount of training data.

Using precision and recall to analyze and improve StyleGAN

Generative models have seen rapid improvements recently, and FID has risen as the de facto standard for determining whether a proposed technique is considered beneficial or not. However, as we have shown in Section

, relying on FID alone may hide important qualitative differences in the results and it may inadvertently favor a particular tradeoff between precision and recall that is not necessarily aligned with the actual goals. In this section, we use our metric to shed light onto some of the design decisions associated with the model itself and perform the first principled analysis of truncation methods. We use StyleGAN

Karras2019

in all experiments, trained with FFHQ at

1024

4.1

Network architectures and training configurations

Figure 6

(a) Precision and recall for different snapshots of StyleGAN taken during the training, along with their corresponding Pareto frontier. We use the standard training configuration by Karras et al.

Karras2019

with FFHQ and

. (b) Different training configurations lead to vastly different tradeoffs between precision and recall. (c) Best FID obtained for each configuration (lower is better).

To avoid drawing false conclusions when comparing different training runs, we must properly account for the stochastic nature of the training process. For example, we have observed that FID can often vary by up to

between consecutive training iterations with StyleGAN. The common approach is to amortize this variation by taking multiple snapshots of the model at regular intervals and selecting the best one for further analysis

Karras2019

. With our metric, however, we are faced with the problem of multiobjective optimization

branke2008multiobjective

: the snapshots represent a wide range of different tradeoffs between precision and recall, as illustrated in Figure

a. To avoid making assumptions about the desired tradeoff, we identify the Pareto frontier, i.e., the minimal subset of snapshots that is guaranteed to contain the optimal choice for any given tradeoff.

Figure

b shows the Pareto frontiers for several variants of StyleGAN. The baseline configuration (A) has a dedicated

minibatch standard deviation

layer that aims to increase variation in the generated images

Karras2017

Zinan2017

. Using our metric, we can confirm that this is indeed the case: removing the layer shifts the tradeoff considerably in favor of precision over recall (B). We observe that

regularization

Mescheder2018

has a similar effect: reducing the

parameter by

100

shifts the balance even further (C). Karras et al.

Karras2017

argue that their progressive growing technique improves both quality and variation, and indeed, disabling it reduces both aspects (D). Moreover, we see that randomly translating the inputs of the discriminator by

pixels improves precision (E), whereas disabling instance normalization in the AdaIN operation

Huang2017

, unexpectedly, improves recall (F).

Figure

c shows the best FID obtained for each configuration; the corresponding snapshots are highlighted in Figure

a,b. We see that FID favors configurations with high recall (A, F) over the ones with high precision (B, C), and the same is also true for the individual snapshots.

The best configuration in terms of recall (F) yields a new state-of-the-art FID for this dataset. Random translation (E) is an exceptional case: it improves precision at the cost of recall, similar to (B), but also manages to slightly improve FID at the same time. We leave an in-depth study of these effects for future work.

4.2

Truncation methods

Many generative methods employ some sort of truncation trick

Marchesi2017

brock2018biggan

Kingma2018

Karras2019

to allow trading variation for quality after the training, which is highly desirable when, e.g., showcasing uncurated results. However, quantitative evaluation of these tricks has proven difficult, and they are largely seen as an ad-hoc way to fine-tune the perceived quality for illustrative purposes.

Using our metric, we can study these effects in a principled way.

StyleGAN is well suited for comparing different truncation strategies because it has an intermediate latent space

in addition to the input latent space

We evaluate four primary strategies illustrated in Figure

a: A) generating random latent vectors in

via the mapping network

Karras2019

and rejecting ones that are too far from their mean with respect to a fixed threshold, B) approximating the distribution of latent vectors with a multivariate Gaussian and rejecting the ones that correspond to a low probability density, C) clamping low-density latent vectors to the boundary of a higher-density region by finding their closest points on the corresponding hyperellipsoid

eberly2011distance

, and D) interpolating all latent vectors linearly toward the mean

Kingma2018

Karras2019

. We also consider three secondary strategies: E) interpolating the latent vectors in

instead of

, F) truncating the latent vector distribution in

along the coordinate axes

Marchesi2017

brock2018biggan

, and G) replacing a random subset of latent vectors with the mean of the distribution. As suggested by Karras et al.

Karras2019

, we also tried applying truncation to only some of the layers, but this did not have a meaningful impact on the results.

Figure 7

(a) Our primary truncation strategies avoid sampling the extremal regions of StyleGANs intermediate latent space. (b) Precision and recall for different amounts of truncation with FFHQ. (c) Using FID instead of recall to measure distribution quality. Note that the

-axis is flipped.

Figure

b shows the precision and recall of each strategy for different amounts of truncation. Strategies that operate in

yield a clearly inferior tradeoff (E, F), confirming that the sampling density in

is not a good predictor of image quality.

Rejecting latent vectors by density (B) is superior to rejecting them by distance (A), corroborating our Gaussian approximation as a viable proxy for image quality. Clamping outliers (C) is considerably better than rejecting them, because it provides better coverage around the extremes of the distribution. Interpolation (D) appears very competitive with clamping, even though it ought to perform no better than rejection in terms of covering the extremes. The important difference, however, is that it affects all latent vectors equally---unlike the other strategies (A--C) that are only concerned with the outliers. As a result, it effectively increases the average density of the latent vectors, countering the reduced recall by artificially inflating precision. Random replacement (G) takes this to the extreme: removing a random subset of the latent vectors does not reduce the support of the distribution but inserting them back at the highest-density point increases the average quality.

Interestingly, random replacement (G) actually leads to a slight

increase

in recall. This is an artifact of our

-NN manifold approximation, which becomes increasingly conservative as the density of samples decreases.

Our findings highlight the fact that recall alone is not enough to judge the quality of the distributionit only measures the extent. To illustrate the difference, we replace recall with FID in Figure

c. Our other observations remain largely unchanged, but interpolation and random replacement (D, G) become considerably less desirable as we account for the differences in probability density. Clamping (C) becomes a clear winner in this comparison, because it effectively minimizes the Wasserstein-2 distance between the truncated distribution and the original one in

. We have inspected the generated images visually and confirmed that clamping appears to generally yield the best tradeoff.

Estimating the quality of individual samples

While our precision metric provides a way to assess the overall quality of a population of generated images, it yields only a binary result for an individual sample and therefore is not suitable for ranking images by their quality. Here, we present an extension of the classification function

(Equation

) that provides a continuous estimate of how close a given sample is to the manifold of real images.

We define a

realism score

that increases the closer an image is to the manifold and decreases the further an image is from the manifold. Let

be a feature vector of a generated image and

a feature vector of a real image from set

. Realism score of

is calculated as

max

(3)

This is a continuous extension of

with the simple relation that

iff

. In other words, when

the feature vector

is inside the (

-NN induced) hypersphere of at least one

With any finite training set, the

-NN hyperspheres become larger in regions where the training samples are sparse, i.e., regions with low representation. When measuring the quality of a large population of generated images, these underrepresented regions have little impact as it is unlikely that too many generated samples land thereeven though the hyperspheres may be large, they are sparsely located and cover a small volume of space in total.

However, when computing the realism score for a single image, a sample that happens to land in such a fringe hypersphere may obtain a wildly inaccurate score. Large errors, even if they are rare, would undermine the usefulness of the metric.

We tackle this problem by discarding half of the hyperspheres with the largest radii. In other words, the maximum in Equation

is not taken over all

but only over those

whose associated hypersphere is smaller than the median. This pruning yields an overconservative estimate of the real manifold, but it leads to more consistent realism scores. Note that we use this approach only with

, not with

Red wine

Alp

Golden Retriever

Ladybug

Lighthouse

Tabby cat

Monarch butterfly

Cocker Spaniel

Best-2

Worst-2

Figure 8

Quality of individual samples of BigGAN from eight classes. Top: Images with high realism. Bottom: Images with low realism. We show two images with the highest and lowest realism score selected from 1000 non-truncated images.

Figure

shows example images from BigGAN with high and low realism. In general, the samples with high realism display a clear object from the given class, whereas the object is often distorted to unrecognizable for the low realism images. Appendix

provides more examples.

5.1

Quality of interpolations

An interesting application for the realism score is to evaluate the quality of interpolations.

We do this with StyleGAN using linear interpolation in the intermediate latent space

as suggested by Karras et al.

Karras2019

Figure

shows four example interpolation paths with randomly sampled latent vectors as endpoints.

Paths A appears to be located completely inside the real manifold, path D completely outside it, and paths B and C have one endpoint inside the real manifold and one outside it. The realism scores assigned to paths AD correlate well with the perceived image quality:

Images with low scores contain multiple artifacts and can be judged to be outside the real manifold, and vice versa for high-scoring images. See Appendix

for additional examples.

Figure 9

Realism score for four interpolation paths as function of linear interpolation parameter

and corresponding images from paths AD. We did not use truncation when generating the images.

We can use interpolations to investigate the shape of the subset of

that produces realistic-looking images. In this experiment, we sampled without truncation 1M latent vectors in

for which

, giving rise to 500k interpolation paths with both endpoints on the real manifold. It would be unrealistic to expect all intermediate images on these paths to also have

, so we chose to consider an interpolation path where more than 25% of the intermediate images have

0.9

as straying too far from the real manifold. Somewhat surprisingly, we found that only 2.4% of the paths crossed unrealistic parts of

under this definition, suggesting that the subset of

on the real manifold is highly convex. We see potential in using the realism score for measuring the shape of this region in

with greater accuracy, possibly allowing the exclusion of unrealistic images in a more refined manner than with truncation-like methods.

Conclusion

We have demonstrated through several experiments that the separate assessment of precision and recall can reveal interesting insights about generative models. Our method for measuring these metrics appears to correlate well with perceived image quality and variation,

and it responds in a consistent manner to techniques that alter the tradeoff between the two.

Using our metric, we have identified previously unknown training configuration-related effects in Section

4.1

, raising the question whether trunction is really necessary if similar tradeoffs can be achieved by

modifying the training configuration appropriately. We leave the in-depth study of these effects for future work.

Finally, we believe that the more detailed understanding enabled by our metric, in addition to the previous techniques for estimating distribution quality

Heusel2017

KID2018

, will prove valuable for further improving generative models.

References

(1)

S. Arora and Y. Zhang.

Do GANs actually learn the distribution? An empirical study.

CoRR

, abs/1706.08224, 2017.

(2)

M. Bikowski, D. J. Sutherland, M. Arbel, and A. Gretton.

Demystifying MMD GANs.

CoRR

, abs/1801.01401, 2018.

(3)

J. Branke, J. Branke, K. Deb, K. Miettinen, and R. Slowiski.

Multiobjective optimization: Interactive and evolutionary

approaches

, volume 5252.

Springer Science & Business Media, 2008.

(4)

A. Brock, J. Donahue, and K. Simonyan.

Large scale GAN training for high fidelity natural image synthesis.

Proc. ICLR

, 2019.

(5)

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.

ImageNet: A Large-Scale Hierarchical Image Database.

Proc. CVPR

, 2009.

(6)

L. Dinh, J. Sohl-Dickstein, and S. Bengio.

Density estimation using Real NVP.

CoRR

, abs/1605.08803, 2016.

(7)

D. Eberly.

Distance from a point to an ellipse, an ellipsoid, or a

hyperellipsoid.

Geometric Tools, LLC

, 2011.

(8)

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,

A. Courville, and Y. Bengio.

Generative Adversarial Networks.

NIPS

, 2014.

(9)

A. Grover, M. Dhar, and S. Ermon.

Flow-GAN: Combining maximum likelihood and adversarial learning

in generative models.

Proc. AAAI

, 2018.

(10)

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter.

GANs trained by a two time-scale update rule converge to a local

Nash equilibrium.

NIPS

, pages 66266637, 2017.

(11)

X. Huang and S. J. Belongie.

Arbitrary style transfer in real-time with adaptive instance

normalization.

CoRR

, abs/1703.06868, 2017.

(12)

T. Karras, T. Aila, S. Laine, and J. Lehtinen.

Progressive growing of GANs for improved quality, stability, and

variation.

CoRR

, abs/1710.10196, 2017.

(13)

T. Karras, S. Laine, and T. Aila.

A style-based generator architecture for generative adversarial

networks.

Proc. CVPR

, 2019.

(14)

D. P. Kingma and P. Dhariwal.

Glow: Generative flow with invertible 1x1 convolutions.

CoRR

, abs/1807.03039, 2018.

(15)

D. P. Kingma, D. J. Rezende, S. Mohamed, and M. Welling.

Semi-supervised learning with deep generative models.

Proc. NIPS

, 2014.

(16)

Z. Lin, A. Khetan, G. Fanti, and S. Oh.

PacGAN: The power of two samples in generative adversarial

networks.

CoRR

, abs/1712.04086, 2017.

(17)

M. Marchesi.

Megapixel size image creation using generative adversarial networks.

CoRR

, abs/1706.00082, 2017.

(18)

L. Mescheder, A. Geiger, and S. Nowozin.

Which training methods for GANs do actually converge?

CoRR

, abs/1801.04406, 2018.

(19)

L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein.

Unrolled generative adversarial networks.

CoRR

, abs/1611.02163, 2016.

(20)

T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida.

Spectral normalization for generative adversarial networks.

CoRR

, abs/1802.05957, 2018.

(21)

T. Miyato and M. Koyama.

cGANs with projection discriminator.

CoRR

, abs/1802.05637, 2018.

(22)

A. Odena, C. Olah, and J. Shlens.

Conditional image synthesis with auxiliary classifier GANs.

ICML

, 2017.

(23)

M. S. M. Sajjadi, O. Bachem, M. Lucic, O. Bousquet, and S. Gelly.

Assessing generative models via precision and recall.

CoRR

, abs/1806.00035, 2018.

(24)

T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen.

Improved techniques for training GANs.

NIPS

, 2016.

(25)

K. Simonyan and A. Zisserman.

Very deep convolutional networks for large-scale image recognition.

CoRR

, abs/1409.1556, 2014.

(26)

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan,

V. Vanhoucke, and A. Rabinovich.

Going deeper with convolutions.

CoRR

, abs/1409.4842, 2014.

(27)

I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf.

Wasserstein auto-encoders.

Proc. ICLR

, 2018.

(28)

A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu.

Pixel recurrent neural networks.

ICML

, pages 17471756, 2016.

(29)

A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and

K. Kavukcuoglu.

Conditional image generation with PixelCNN decoders.

CoRR

, abs/1606.05328, 2016.

(30)

H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena.

Self-attention generative adversarial networks.

CoRR

, abs/1805.08318, 2018.

(31)

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang.

The unreasonable effectiveness of deep features as a perceptual

metric.

Proc. CVPR

, 2018.

Appendix A

Pseudocode and implementation details

Set of real and generated images (

), feature network

, neighborhood size

function

Precision-Recall

precision

Manifold-Estimate

recall

Manifold-Estimate

return

precision

recall

function

Manifold-Estimate

10:

Approximate manifold of

11:

for

12:

for

all

Pairwise distances to all points in

13:

min

-th smallest value to exclude

itself.

14:

Compute how many points from

are within the approximated manifold of

15:

16:

for

17:

for

any

then

18:

19:

return

Algorithm 1

-NN precision and recall pseudocode.

Algorithm

shows the pseudocode for our method. The main function

Precision-Recall

evaluates precision and recall for given sets of real and generated images,

and

, by embedding them in a feature space defined by

(lines 23) and estimating the corresponding manifolds using

Manifold-Estimate

(lines 46). The helper function

Manifold-Estimate

takes two sets of feature vectors

as inputs. It forms an estimate for the manifold of

and counts how many points from

are located within the manifold. Estimating the manifold requires computing the pairwise distances between all feature vectors

and, for each

, tabulating the distance to its

-th nearest neighbor (lines 911). These distances are then used to determine the fraction of feature vectors

that are located within the manifold (lines 1317). Note that in the pseudocode feature vectors

are processed one by one on lines 9 and 14 but in a practical implementation they can be processed in mini-batches to improve efficiency.

We use NVIDIA Tesla V100 GPU to run our implementation. A high-quality estimate using 50k images in both

and

takes

8 minutes to run on a single GPU. For comparison, evaluating FID using the same data takes

4 minutes and generating 50k images (

1024

) with StyleGAN using one GPU takes

14 minutes. We will make our implementation open-source.

Appendix B

Quality of samples and interpolations

Border Collie

Lakeside

Norwich Terrier

Persian cat

Panda

Siamese cat

(a) High-quality samples

Border Collie

Lakeside

Norwich Terrier

Persian cat

Panda

Siamese cat

(b) Low-quality samples

Figure 10

Examples of (a) high and (b) low quality BigGAN samples according to our realism scores.

Figure 11

Examples of (a) high and (b) low quality interpolations according to our realism scores.

Figure

shows BigGAN-generated images for which the estimated realism score (Section

) is very high or very low. Images with high realism score contain a clear object from the given class, whereas low-scoring images generally lack such object or the object is distorted in various ways.

Figure

presents further examples of high and low quality interpolations (Section

5.1

). High-quality interpolations consist of images with high perceptual quality and coherent background despite the endpoints being potentially quite different from each other. On the contrary, low-quality interpolations are usually significantly distorted and contain incoherent patterns in the image background.

Generated by

LaTeXML

Want to hear about new tools we're making? Sign up to our mailing list for occasional updates.

If you find a rendering bug,

file an issue on GitHub

Or, have a go at fixing it yourself

the renderer is open source

For everything else, email us at

[email protected]

A project from

Replicate

, with help from

LaTeXML

Contribute on GitHub.