Summary Improved Precision and Recall Metric for Assessing Generative Models – arXiv Vanity www.arxiv-vanity.com
6,286 words - html page - View html page
One Line
We introduce a new metric, Precision-Recall, which is more efficient than existing methods and can be used to evaluate image generation tasks, and we demonstrate this on BigGAN and ImageNet with Wasserstein-2 distance, taking 8 minutes to run on a single NVIDIA Tesla V100 GPU.
Key Points
- We present a metric to evaluate generative models which allows us to study the effects of the truncation trick.
- We compare our metric to Sajjadi et al.'s method using four StyleGAN setups and found that heavily truncated setups have high precision and low recall, while FID-optimized setups have higher variation but some image artifacts.
- We tested BigGAN on ImageNet and found that difficult classes had lower precision than easy ones.
- Our metric correlates well with perceived image quality and variation, and suggests that truncation may not be necessary.
- We offer an open-source implementation and use NVIDIA Tesla V100 GPU to run our implementation; a high-quality estimate using 50k images takes 8 minutes to run on a single GPU.
Summaries
122 word summary
We present a new metric for assessing generative models, Precision-Recall, which measures the overall quality and coverage of samples in image generation tasks. It embeds two sets of feature vectors in a feature space and evaluates precision and recall for given sets of real and generated images. This metric is more effective than existing methods and can be used to analyze truncation methods and estimate the quality of individual generated samples. We also tested BigGAN on ImageNet and found that difficult classes had lower precision than easy ones. The Wasserstein-2 distance is used to quantify the difference between densities of real and generated images. Our open-source implementation takes 8 minutes to run on a single NVIDIA Tesla V100 GPU for 50k images.
422 word summary
This paper presents an improved precision and recall metric for evaluating generative models such as GANs. This metric measures both the overall quality and coverage of samples in image generation tasks. It builds explicit non-parametric representations of the manifolds of real and generated data, allowing for the estimation of precision and recall. The metric is more effective than existing metrics and can be used to analyze truncation methods and estimate the quality of individual generated samples. Additionally, the Wasserstein-2 distance is used to quantify the difference between densities of real and generated images. We developed a metric to assess the precision and recall of generative models, and compared it to Sajjadi et al.'s method using four StyleGAN setups with varying truncation and training time. Our metric showed higher values of the neighborhood size increase precision and recall estimates consistently. We found that heavily truncated setups have high precision and low recall, while FID-optimized setups have higher variation but some image artifacts.
We also tested BigGAN on ImageNet and found that difficult classes had lower precision than easy ones. Our metric sheds light on design decisions and truncation methods, and can be used to select the best configuration for any given tradeoff between precision and recall. We present a metric to evaluate generative models which allows us to study the effects of the truncation trick. Clamping outliers is superior to rejection by distance and interpolation is competitive. Random replacement is less desirable. Our metric correlates well with perceived image quality and variation, and suggests that truncation may not be necessary. Deep features, self-attention GANs, PixelCNN decoders, very deep convolutional networks, cGANs with projection discriminator, spectral normalization, unrolled GANs, Wasserstein auto-encoders, megapixel size image creation, conditional image synthesis, style-based generator architectures, semi-supervised learning with deep generative models, Glow with invertible 1x1 convolutions, Progressive Growing of GANs and a two time-scale update rule for GANs are being used for generative models. Flow-GAN combines maximum likelihood and adversarial learning. We present a new metric for assessing generative models, Precision-Recall, which takes two sets of feature vectors as input, embeds them in a feature space and evaluates precision and recall for given sets of real and generated images. Figures 10 and 11 show examples of high and low quality interpolations and BigGAN-generated images, respectively, according to our realism scores. Our implementation is open-source and we offer occasional updates through our mailing list. We use NVIDIA Tesla V100 GPU to run our implementation; a high-quality estimate using 50k images takes 8 minutes to run on a single GPU.
1706 word summary
We present a new metric for assessing generative models, Precision-Recall. It takes two sets of feature vectors as input, embeds them in a feature space and evaluates precision and recall for given sets of real and generated images. It forms an estimate for the manifold of feature vectors, tabulating the distance to its -th nearest neighbor. We use NVIDIA Tesla V100 GPU to run our implementation; a high-quality estimate using 50k images takes 8 minutes to run on a single GPU.
Figures 10 and 11 show examples of high and low quality interpolations and BigGAN-generated images, respectively, according to our realism scores. Images with high realism score contain a clear object from the given class, whereas low-scoring images generally lack such object or the object is distorted in various ways.
Our implementation is open-source and we offer occasional updates through our mailing list. Deep features have shown to be effective in perceptual tasks. Self-attention generative adversarial networks (GANs) and PixelCNN decoders have been used for conditional image generation. Very deep convolutional networks and improved techniques for training GANs are two other methods being used. Assessing generative models through precision and recall has been proposed. Other methods such as cGANs with projection discriminator, spectral normalization, unrolled GANs, and Wasserstein auto-encoders are also being used. Megapixel size image creation, conditional image synthesis with auxiliary classifier GANs, and style-based generator architectures have been proposed. Semi-supervised learning with deep generative models, Glow with invertible 1x1 convolutions, and Progressive Growing of GANs are also being used. Finally, a two time-scale update rule for GANs converges to a local Nash equilibrium and Flow-GAN combines maximum likelihood and adversarial learning with distance from a point to an ellipse, an ellipsoid, or a hyperellipsoid. We present a metric to measure the quality of generative models, which correlates well with perceived image quality and variation. We demonstrate this metric with experiments using StyleGAN by linear interpolation in the intermediate latent space.
We found that only 2.4% of paths crossed unrealistic parts of the real manifold, suggesting that truncation may not be necessary. We also identified training configuration-related effects with our metric which may allow for more refined exclusion of unrealistic images. Our metric, in addition to previous techniques for estimating distribution quality, will prove valuable for further improving generative models. The realism score is a metric to evaluate the quality of interpolations and individual samples of generative models. This score is calculated as a continuous extension of the classification function and increases when a sample is closer to the manifold of real images and decreases when further away. Clamping appears to be the best tradeoff, while interpolation and random replacement are less desirable. Our findings indicate that recall alone is not enough to judge the quality of a distribution and FID should be used in addition. Visual inspection shows that generated images with high realism display a clear object from the given class, while low realism images are often distorted. Many generative methods employ a truncation trick to allow trading variation for quality after training, but quantitative evaluation of these tricks has proven difficult. We present a metric for assessing generative models that allows us to study these effects in a principled way. We evaluate four primary strategies (A-D) illustrated in Figure 7, plus three secondary strategies (E-G). Strategy B, approximating the distribution of latent vectors with a multivariate Gaussian and rejecting those with low probability density, is superior to rejecting by distance (A). Clamping outliers (C) is better than rejection because it provides better coverage around the extremes of the distribution. Interpolation (D) is competitive with clamping, as it affects all latent vectors equally and increases the average density. Random replacement (G) removes a random subset of latent vectors and inserts them back at the highest-density point. Strategies E and F yield an inferior tradeoff, confirming that sampling density is not a good predictor of image quality. We present a new metric to analyze generative models that assesses both precision and recall. We use the metric to analyze different training configurations of StyleGAN on the FFHQ dataset. We find that configurations with high recall (A, F) outperform those with high precision (B, C) in terms of FID. We also find that random translation of inputs (E) improves precision at the cost of recall and slightly improves FID. Our metric sheds light on design decisions and truncation methods, and can be used to select the best configuration for any given tradeoff between precision and recall. Generative models have seen rapid improvements and FID has become the de facto standard for evaluating them. However, precision and recall can provide a more detailed picture. We used these metrics to analyze StyleGAN. Easy classes such as cats and dogs had higher recall, likely due to their larger amount of training data. Difficult classes such as Lemon and Broccoli had lower recall, suggesting much of the variation had been missed. Precision was invariably high for easy classes and lower for difficult ones. We observed that quality increased with more truncation, in line with Brock et al.'s findings. We also tested BigGAN on ImageNet and found that difficult classes had lower precision than easy ones. Truncation packing a large number of generated images into a small region in the embedding space can lead to an underestimation of precision and overestimation of recall. This can cause clusters with no real images, and the metric to incorrectly report low precision. The method of Sajjadi et al. reports both precision and recall increasing as truncation is removed, contrary to expected behavior.
Figure 4 illustrates the effects of truncation on precision and recall using a single StyleGAN generator. With our method, precision increases and recall drops towards the left side of the plot as more truncation is applied, while Sajjadi et al. incorrectly reports lower precision when truncation is applied and the numerical values of both precision and recall seem excessively high when truncation is not applied.
Setup A is heavily truncated, leading to high precision and low recall. Moving to setup B increases variation, improving recall, but compromising image quality. Setup C is FID-optimized, with higher variation but some image artifacts. Setup D preserves variation and recall, but nearly all images have low quality. The ideal tradeoff between quality and variation depends on the intended application. Our metric provides explicit visibility on this tradeoff and allows quantifying the suitability of a given model for a particular application. We compare our precision and recall metric with Sajjadi et al.'s method using four StyleGAN setups with varying truncation and training time. Figure 3 shows our precision and recall metric (black dots) alongside Sajjadi et al.'s (red triangles). We use a neighborhood size of 50000 to avoid saturating values and take 50k samples, as is standard practice for FID. We compute feature vectors for each image by feeding it to a pre-trained VGG-16 classifier, and use Brock et al.'s feature space as it works better for our metric. Our tests show that higher values of the neighborhood size increase precision and recall estimates consistently. Our improved precision and recall metric for assessing generative models does not suffer from the weaknesses of existing metrics. We draw an equal number of samples from real and generated distributions and embed them into a high-dimensional feature space using a pre-trained classifier network. We then form explicit non-parametric representations of the manifolds of real and generated data, from which precision and recall can be estimated.
The key idea is to calculate the pairwise Euclidean distances between all feature vectors in the set and, for each feature vector, form a hypersphere with radius equal to the distance to its -nearest neighbor. This hypersphere defines a volume in the feature space that serves as an estimate of the true manifold. To determine precision, we query for each generated image whether it is within the estimated manifold of real images. For recall, we query for each real image whether it is within estimated manifold of generated images.
We also measure the Wasserstein-2 distance between multivariate Gaussian distributions to quantify the difference between the densities of real and generated images. This is a proper distance metric between the distributions and is able to capture aspects such as age, gender, pose, ethnicity, etc. which are not directly captured by precision and recall metrics. Sajjadi et al. proposed a metric based on distributions to evaluate generative models. This method has shortcomings and cannot interpret situations with a large number of packed together generated samples. Szegedy et al. proposed another metric based on the populations of training images and generated images in clusters. Karras et al. then used this metric to analyze variants of StyleGAN and BigGAN, and Brock et al. used it to understand design decisions and identify new variants. We present an improved precision and recall metric that is more effective and provide source code. This metric also allows us to analyze truncation methods and estimate the quality of individual generated samples. Generative models attempt to capture the essence of a manifold into a model that can generate novel samples. GANs, VAEs, autoregressive models, and likelihood-based models are some of the most widely used generative models. While metrics have been proposed to evaluate the quality of generated samples, they have not seen widespread use due to subjectivity or lack of coverage of the underlying manifold.
Recently, Sajjadi et al. proposed a novel metric that expresses the quality of the generated samples using two separate components: precision and recall. These two components correspond to the average sample quality and the coverage of the sample distribution, respectively. We also extend this metric to estimate the perceptual quality of individual samples and perform an analysis of truncation methods. We demonstrate the effectiveness of our metric in StyleGAN and BigGAN with illustrative examples. Furthermore, we analyze multiple design variants of StyleGAN to identify new variants that improve the state-of-the-art. This paper presents an evaluation metric for generative models such as GANs that can measure both the overall quality and coverage of samples in image generation tasks. The metric forms explicit non-parametric representations of the manifolds of real and generated data. The work was done during an internship at NVIDIA, and is available on arXiv Vanity as a responsive web page.