Abstract— The primary objective of this paper is to provide a comparative analysis of various generative adversarial networks (GANs). In this paper we present our study of — SinGAN, Conditional Generative Adversarial Networks (CGAN), Star generative adversarial network (StarGAN) and Cycle generative adversarial network (CycleGAN). We also present our results generated from these GANs and provide their comparisons on different metrics such as RMSE, UQI, MS-SSIM, VIF and also comparing the images generated by the FCN-8s architecture in this paper. The main motive behind writing this paper is to provide a one place study for the different variants of GAN which is currently lacking in the literature available. This work will help grasping the concepts of the different GAN architectures and their loss functions which are explained in detail.
Index Terms—Generative Adversarial Networks (GANs), SinGAN, CGAN, StarGAN, CycleGAN, FCN results.
I. INTRODUCTION
The Generative Adversarial Networks (GANs) have been studied and applied to tremendous applications. The versatility of GANs have made them popular. The earlier literature studies and research provide specific applications of the GANs. In this paper we present the comparative examination of the recent variations of GANs available. We have performed the literature survey as well as performed the performance comparison of different GAN architectures on several parameters. The research paper such as [1] provide the comparative study of different GANs but lacks in the technical comparison. But in this paper we provide the details of SinGAN, CGAN, StarGAN, CycleGAN as well as a detailed mathematical comparison on several parameters. This is the contribution from the authors. The literature and surveys available here lack these detailed mathematical comparisons. We bridge this gap in this paper.
The paper is arranged as follows. We present the fundamental details of the GANs in the beginning in II. This section majorly contains the detailed study performed by the authors. The details that we provide, primarily cover brief introduction of a particular GAN, loss function, the architecture, advantages of that particular GAN as well as the resultant images generated from that GAN in the results part. We also provide the comparison analysis in a summarized form in section III. The section provides mathematical metric comparisons of SinGAN, CGAN, StarGAN, CycleGAN in a tabular form. This will be helpful to study the performance examination of a particular GAN. This is our contribution towards the research community. In the end we conclude the
paper with a conclusion section IV. In which we provide a conclusive note on the comparative analysis of the GANs.
II. STUDY OFGENERATIVEADVERSARIALNETWORKS In this section we provide the detailed study of different GANs performed by us. First we give brief information about a particular GAN and then we give the mathematical details such as loss function, its architecture and the resultant image produced by that particular GAN. We have used the pre-trained models of the GANs available online to produce and compare the resultant images.
A. SinGAN
1)A brief introduction of SinGAN: SinGAN [2] contains a pyramid of fully convolutional GANs where each of them is responsible for learning the patch distribution of the image at a different scale of the image. It is an unconditional generative model. SinGAN could be used for a variety of tasks such as image super-resolution, paint-to-image, harmonization, editing and single image animation.
2) Loss functions: The loss function of the nthGAN in SinGAN is given by:
minGnmaxDnℒ adv(Gn, Dn) + αℒ rec(Gn), (1)
The 2 terms in the loss function signify 2 different types of losses which are:
1.Adversarial loss: The basic aim of this loss is to penalize the model for the distance between the distribution of patches in
x
n (the real image) and the distribution ofpatches in generated samples
x
~n . WGAN-GP [3] loss is used since the authors of the original paper found it to increase the training stability.2.Reconstruction loss: This term ensures that whenever the input does not contain any noise, the generator is able to reconstruct the original image. The authors of the original paper specifically chose {
z
recN ,z
recN1,…,z
0rec} ={
z
*,0,…,0}, wherez
*is some fixed noise map (drawn once and kept fixed during training). The generated image at the nth scale (n < N), when the noise maps are used, is denoted byrec n
x
~ .ℒrec= ||Gn(0, (
rec n
x
1~
)↑r) -
x
n||2, (2)Generative Adversarial Networks:
A Comparative Analysis
and for n=N,
ℒ rec= ||GN(z*) -
x
N||2, (3)3)Architecture: As shown in figure 1, the model consists of a pyramid of generators, {G0, … , GN}, trained against an image pyramid of
x
:{x
0, … ,x
N }, wherex
n is a down sampled version ofx
by a factor ofr
n, for some r > 1. Both training and inference start from the lowest level, from coarse-to-fine fashion. The input to each Gnis a random noise image zn, and the image generated from the previous scale~
n
x
, up sample to the current resolution (except for the coarsest level which is purely generative). Therefore, for n = N,~
N
x
= GN(zN), (4) For n < N,n
x
~ = Gn(zn,( 1 ~
n
x
)↑r), (5) [image:2.595.47.282.244.572.2]Fig. 1.SinGAN’s architecture
Fig. 2.Generator’s architecture
Inside each layer of GAN, convolutional layers are present which generate the missing details in ( 1
~
n
x
)↑r(residual learning [4,5]). The operation performed by Gnas given in figure 2 is,n
x
~ = ( 1~
n
x
)↑r+ ψn(zn+ ( ~ 1
n
x
)↑r), (6)Where ψn is a fully convolutional network with 5
convolutional blocks of the form
generated at test time since the generators are fully convolutional.
2. A large number of image manipulation tasks could be performed using this model.
3. Only a single training image is required.
5) Results: The model was trained using the code available on the official GitHub repository of [2]. The images used were starry night by Van Gogh and the Colosseum which were trained for 8 levels (N=7).
If the GAN generated images are realistic then the classifiers trained on the real images should also classify correctly the generated images. Using this intuition we are performing the evaluation of GANs with the help of FCN-8s architecture [7] .
[image:2.595.307.547.255.352.2](a) (b)
Fig. 3.(a) Starry night, (b) Colosseum
The evaluation of the generated images was carried out by using FCN metric. The code used was taken from [8]. The results are shown for n=4,5 for the starry night image and n=6,7 for the image of Colosseum.
(a) (b) (c)
[image:2.595.310.546.449.591.2](d) (e) (f)
Fig. 4.(a) Fake sample generated at n=4, (b) Real image at n=4, (c) Result of FCN (difference image), (d) Fake sample generated at n=5, (e) Real image at n=5, (c) Result of FCN (difference image)
[image:2.595.312.549.652.781.2](d) (e) (f)
Fig. 5.(a) Fake sample generated at n=6, (b) Real image at n=6, (c) Result of FCN (difference image), (d) Fake sample generated at n=7, (e) Real image at n=7, (c) Result of FCN (difference image)
B. CGAN
1)A brief introduction of CGAN: In [9,10] the authors came up with the general purpose solution to the image-to-image translation with the help of conditional adversarial networks. The authors demonstrate an effective approach for reconstruction of object from edge maps, photo synthesis from edge maps in [9]. The conditional adversarial nets are able to perform the above mentioned tasks efficiently because the nets not only learn the mappings between inputs and outputs, but also learn the loss function used to train the mappings. Many problems in the field of image processing, computer graphics and computer vision can be posed as image-to-image translation tasks. The authors came up with the common framework for this problem. We know that convolutional neural networks(CNN) are widely used for the image prediction problems. The CNNs basically learn to minimize the loss function - an objective that scores the quality of the result. Even though the learning process being automatic, designing of effective loss function requires more efforts as well as time.
2) Objective function: The objective of conditional adversarial network can be given as
ℒ
cGAN(G,D) = Εx,y[logD(x,y)] +Εx,z[log(1-D(x,G(x,z))], (7)
whereGtries to minimize the objective against an adversarial D that tries to maximize it, i.e.G*= arg minGmaxD ℒcGAN (G,D).
It is beneficial to mix the GAN objective with a traditional loss such as L2 distance [11]. Which makes the generator to fool the discriminator as well as tries to make it near the ground truth output. The L1 distance encourages less blurring and hence is used in the final objective function, which is given as
G*= arg minGmaxDℒcGAN(G,D) + λℒL1(G) (8) whereℒ L1(G) = Ex,y,z[||y-G(x,z)||1].
[image:3.595.332.517.288.398.2]3) Architecture: The architectures of generator and discriminator have been adopted from those given in [12].
Fig. 6.CGAN Generator and Discriminator Architecture.
Fig. 7. The “U-net” architecture. An encoder-decoder with skip connections between mirrored layers in the encoder and decoder stacks.
In [9], the authors also consider the “U-net” architecture shown in Fig. 2. Which specifically provides skip connections betweenithlayer andn-ithlayer, wheren is the total number of layers.
4)Advantage: The advantages of conditional GAN (CGAN) are listed as follows:
1. The CGAN is a promising approach for many image-to-image translation tasks.
2. It is suitable for highly structured graphical outputs. 3. The CGAN is applicable in a wide variety of settings because the network learns a loss adapted to the task and data at hand.
(a) (b) (c)
Fig. 8.(a) real image A, (b) real image B, (c) fake image generated from the trained CGAN.
For the quantitative performance evaluation of CGAN we have employed FCN-8s architecture [7]. The results are provided in the Fig. 9.
(a) (b) (c)
(d) (e) (f)
Fig. 9.(a), (d) real image A. (b), (e) fake image B. (c), (f) Resultant images obtained by FCN-8s architecture showing the difference between real and fake images.
C. CycleGAN
1) A brief introduction of CycleGAN: Image to Image translation is a hot topic in the field of computer vision and graphics. Image-to-Image translation involves learning the mapping from a source domain image to a target domain image. For this task, we require a dataset consisting of aligned images from the source and target domain. But obtaining such a dataset is too expensive, and there are various domains where obtaining a paired dataset is impossible. Eg. Dog to Cat Translation, Horse to Zebra Translation, Artistic Stylization. To tackle this problem, we need a method that learns to capture special characteristics from the source domain and figuring out how these characteristics could be translated to the target domain. We need to train a mapping G: X→Y such that the output ŷ = G(x), x∈X, is indistinguishable from images y∈Y by an
adversary trained to classify ŷ from y. However, optimizing
[image:4.595.348.489.136.222.2]samples {xi}i=1N where xi∈X and {yj}j=1M where yj∈Y. Therefore our model will include two mappings G : X→Y and F: Y→X. In addition, we will introduce two adversarial discriminators DX and DY where DX aims to distinguish between {x} and translated {F(y)}, in the same way, DYaims to distinguish between {y} and {G(x)}.
Fig. 15.Our model contains two mapping functions G and F. And their respective Discriminators,DXand DY.
3) Loss Functions:
1. Adversarial Loss[15]:
For the mapping function G: X→Y and its discriminator DY,
ℒ
GAN(G,DY,X,Y) =Ey~Pdata(y)[log(DY(Y))] + Ex~Pdata(x)[log(1 - DY(G(X))] (15)For the mapping function F: Y→X and its discriminator DX,
ℒ
GAN(G,DX,X,Y) =Ex~Pdata(x)[log(DX(x))]+Ey~Pdata(y)[log(1 - DX(F(y))] (16)
2. Cycle Consistency Loss:
Adversarial training can in theory, learn mappings G and F that can produce outputs identically distributed as target domains Y and X respectively. However, with large capacity, a network can map the same set inputs to any random permutation in the target domain. There can be million different mappings that can form between source and target domain. For this purpose we define cycle loss:
ℒcyc(G,F) =Ex~Pdata(x)[||F(G(X)) - x||] +
Ey~Pdata(y)[||G(F(Y)) - y||](17)
Therefore complete objective function is:
G*, F*=arg minGmaxDℒ(G,F,DX,DY),where
ℒ(G,F,DX,DY) =ℒGAN(G,DY,X,Y) +
ℒGAN(G,DX,X,Y) +
ℒcycle(G,F) (18) [image:4.595.53.289.237.448.2](a) (b)
Fig. 16. (a).Forward cycle-consistency loss (b). Backward cycle-consistency loss
4) Architecture: CycleGAN is consists of 2 GANs, making it a total of 2 Generators and 2 Discriminators. Given 2 different images, one generator transform source images into target images and the other transform target images into source images. During the training phase, the discriminators check if images computed by generators are real or fake. By this process, generators can become better with the feedback from their respective discriminators. The architectures of generator have been adopted from [11]. Generators take as input image of size 256x256, down sample them, then up sample them back to 256x256 creating the generated image.The architectures of discriminators have been adopted from [12,13,14]. Discriminators take as input, an image of size 256x256 and output a tensor of size 30x30. Each neuron (value) of the output tensor holds the classification result for a 70x70 area of the input image.
5) Advantages:The advantages are listed below:
1. Paired Dataset for image for image translation is not required. Pricing of annotation varies from project to project depending on the complexity and volume. So, therefore the cost of dataset formation and annotation process is minimized.
2. Can learn image-to-image translation tasks with relatively small amounts of data.
6) Limitations and Future Works:
1. When there are drastic differences between source and target domain, then the model denigrates into minimal change.
2. In some scenarios, there is still a gap between performances by models with paired and unpaired dataset.
3. Handling the extreme transformation and especially geometric changes, is an important task.
4.Integrating weak or semi-supervised data into the dataset can lead to better results and save us from the annotation cost of the dataset.
7) Results: The pre-trained model of the CycleGAN available and presented in [12] has been used to produce the resultant images. The model essentially gets trained on the unpaired images from source and target domain shown in Fig. 17(a) and Fig.17(b). The CycleGAN learns these mappings and resultant image is shown in Fig. 17(c).
[image:5.595.302.541.49.136.2](a) (b) (c)
Fig. 17.(a) real image A, (b) real image B, (c) fake image generated from the trained CycleGAN.
For the quantitative performance evaluation of CycleGAN we have employed FCN-8s architecture [7]. The results are provided in the Fig. 18.
(a) (b) (c)
(d) (e) (f)
Fig. 18.(a),(d) Source Image (Horse) (b),(e) Generated Images from the GAN (c),(f) Images obtained from FCN-8s architecture as result.
D. StarGAN:
[image:5.595.51.288.62.143.2] [image:5.595.306.546.224.402.2]Fig. 10.Overview of StarGAN, consisting of two modules, a discriminatorDand a generatorG.(a)Dlearns to distinguish between real and fake images and classify the real images to its corresponding domain. (b) G takes in as input both the image and target domain label and generates an fake image. The target domain label is spatially replicated and concatenated with the input image.(c)Gtries to reconstruct the original image from the fake image given the original domain label.(d)Gtries to generate images indistinguishable from real images and classifiable as target domain byD. 2) Loss functions:There are three types of loss functions and each one is used to train either generator or discriminator or both.
1.Adversarial Loss: Regular GAN loss used for generation and detection of the images. Here G = generator, D = discriminator, c = output domain label, x = input image.
ℒ
adv=Ex[log Dsrc(x)] +Ex,c[log(1 - Dsrc(G(x,c)))] (9)2. Domain Classification Loss : Along with the discriminator there is also auxillary classifier on top of D. Real images are used to calculate the loss for optimizing the discriminator. Generated images are used to calculate the loss for optimizing the generator. Here x = real input image, c’ = domain label of the real image x, c = label of the domain of the translated image.
ℒ
rcls=Ex,c’[-log Dcls(c' | x)] (10)ℒ
fcls=Ex,c[-log Dcls(c | G(x, c))] (11)3. Reconstruction Loss : For reconstruction, we use a translated image from the generator as the input to the generator itself along with the original label of the input image. L1 norm is used for calculating this loss. Here x = input image, c’ = domain label of the image x, c = label of the domain where the input image is to be translated.
ℒ
rec=Ex,c,c’[ || x - G(G(x, c), c') ||1] (12)Complete Loss for Generator :
Complete Loss for Discriminator :
ℒ
D=ℒ
adv+ λclsℒ
fcls+ λrecℒ
rec(14)
[image:6.595.310.533.254.379.2]3) Architecture: The architecture used in StarGAN is adapted from CycleGAN. The generator of StarGAN has two convolutional layers for down-sampling and six residual blocks. There are two de-convolutional layers which are rather called transposed convolutional layers for up-sampling. The stride of both convolutional and transposed convolutional layers is two. The discriminator of the of the architecture has an input layer, five hidden convolutional layers and two output layers.
[image:6.595.310.545.420.511.2]Fig. 11: Architecture of the generator of the StarGAN model.
Fig. 12: Architecture of the Discriminator of the StarGAN model.
4)
Advantages:The advantages of StarGAN are listed below:1. Diversity of generated images.
2.
Scalability over multiple domains(d) (e) (f)
Fig. 13.(a) Image generated belonging to domain of black haired images, (b) Image generated belonging to domain of blonde haired images, (c) Image generated belonging to domain of brown haired images, (d) Image generated belonging to domain of opposite gender, (e) Image belonging to domain of increased age. (f) Original ground truth image
The results of FCN-8s architecture from [7] have been shown below.
(a) (b) (c)
(d) (e) (f)
Fig. 14.(a), (d) Ground truth images given to the generator of StarGAN, (b), (e) Reconstructed images by the generator of StarGAN, (c), (d) Difference images between ground truth and reconstructed images respectively.
III. COMPARATIVE ANALYSIS TABLE I
M
ETRICV
ALUESOFD
IFFERENTG
ANSMetric Vs.
GAN RMSE UQI MS-SSIM VIF
SinGAN 22.7351 0.9550 0.9639 0.1305
CGAN 48.0449 0.8788 0.9369 0.0632
StarGAN 61.3293 0.8040 0.3396 0.0447
CycleGAN 00.1159 0.9277 0.9668 1.0970
The mathematical comparison summarized in Table I contains root Mean Squared Error (RMSE), Universal Quality Image Index (UQI), Multi-scale Structural Similarity Index (MS-SSIM), Visual Information Fidelity (VIF) [19,20,21] metric values corresponding to the respective GAN.
RMSE can be thought of as a (normalized) distance between the reconstructed images and the ground truth images. Root Mean square error (RMSE) indicates the average difference of the pixels throughout the image. A higher RMSE indicates a greater difference between the original and processed image. Therefore, if RMSE is low
then the trained modal is according to the desired mapping, and vice versa.
UQI is designed by modeling an image distortion as a combination of three factors: loss of correlation, luminance distortion, and contrast distortion. Although the new index is mathematically defined and no human visual system model is taken into consideration. It does not depend on the images being tested, the viewing conditions or the individual observers. For Images we apply sliding window technique to evaluate the index [19].
MS-SSIM measures that can automatically predict perceived image quality. It is based on a top-down assumption that the Human Vision System is strongly adapted for extracting structural information and therefore a measure of structural similarity is a good approximation of perceived image quality. So, to calculate structural similarity between the images, 3 parameters are defined namely luminance, contrast and structure comparison. It takes two images, i.e, reconstructed and ground truth images as inputs. The system iteratively applies a low-pass filter and down-sample the filtered image by a factor of 2. We index the original image as Scale 1, and the highest scale as Scale M, which is obtained after M−1 iterations. At the jthscale, the contrast comparison and the structure comparison are calculated. The luminance comparison is computed only at Scale M [20].
VIF is a reference image quality assessment index based on natural scene statistics and the assumption that the image information extracted by the human visual system. The VIF index employs natural scene statistical (NSS) models in and a distortion (channel) model to quantify the information shared between the ground truth and the reconstructed images. The reconstructed image is modeled as being the output of a stochastic source that passes through the Human Visual System channel and is processed later by the brain. The information content of the reconstructed image is quantified as being the mutual information between the input and output of the Human Visual System channel. This is the information that the brain could ideally extract from the output of the Human Visual System. The same measure is then quantified in the presence of an image distortion channel that distorts the output of the natural source before it passes through the Human Visual System channel, thereby measuring the information that the brain could ideally extract from the ground truth image.The two information measures are then combined to form a visual information fidelity measure that relates visual quality to relative image information [21].
[image:7.595.43.294.500.617.2]The results summarized in Table I will be helpful in getting the better idea for the selection of the GAN architecture by the research community a particular scenario.
IV. CONCLUSION
In this paper we present the comparative analysis of different GAN architectures. The authors have used the available literature and online resources to reproduce the results. On top of that the mathematical comparative analysis in Table I, which is a novel contribution and will be helpful to the research community in future.
REFERENCES
[1] Hitawala S. Comparative study on generative adversarial networks. arXiv preprint arXiv:1801.04271. 2018 Jan 12.
[2] Shaham, T.R., Dekel, T., Michaeli, T.: Singan: Learning a generative model from a single natural image. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV),2019 [3] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin,
and Aaron C Courville. Improved training of wasserstein GANs. In Advances in Neural Information Processing Systems, pages 5767-5777, 2017.
[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[5] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing, 26(7):3142–3155, 2017.
[6] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
[7] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015 [8] FCN Score calculation,https://github.com/n-zhang/fcn_score [9] Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. Image-to-image
translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125-1134, 2017.
[10] Mirza M, Osindero S. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. 2014 Nov 6.
[11] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. InCVPR, 2016. [12] A. Radford, L. Metz, and S. Chintala. Unsupervised representation
learning with deep convolutional generative adversarial networks. In
ICLR, 2016.
[13] Zhu, Jun-Yan and Park, Taesung and Isola, Phillip and Efros, Alexei A, ”Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”, IEEE International Conference on Computer Vision (ICCV), 2017
[14] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016.
[15] C. Ledig, L. Theis, F. Husz´ar, J. Caballero, A. Cunningham,A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-realistic single image superresolution using a generative adversarial network. In CVPR, 2017.
[16] I. Goodfellow. NIPS 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160,2016
[17] C. Li and M. Wand. Precomputed real-time texture synthesis with markovian generative adversarial networks. ECCV, 2016.
[18] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “StarGAN: Unified generative adversarial networks for multi domain image to -image translation,” arXiv:1711.09020 [cs.CV], Nov. 2017.
[19] UQI-Zhou Wang and A. C. Bovik, "A universal image quality index," in IEEE Signal Processing Letters, vol. 9, no. 3, pp. 81-84, March