5.9 Additional Experimental Results
6.5.2.1 Multiscale Transformative Upsampling Network
Reception for LR Images in a Multiscale Manner: State-of-the-art CNN based super-resolution networks [Yu and Porikli,2016,2017a,b;Zhu et al.,2016b;Bulat and
Tzimiropoulos, 2018; Chen et al., 2018] only accept LR inputs in a fixed resolution,
i.e., 16×16 pixels. When the resolutions of LR images are larger than the desired res- olution, those methods need to downsample input images. However, downsampling input images may result in the loss of high-frequency details of LR inputs as well as more ambiguous mappings between LR and HR face images in super-resolution. In addition, we assume that the resolutions of LR images are smaller than 32×32 pixels. Otherwise, LR images can provide enough resolution for human observation and computer analysis. Hence, we only focus on LR images whose resolutions are smaller than 32×32 pixels in this paper.
Inspired by the Laplacian pyramid, we decompose an image into two compo- nents: a low-frequency part and a high-frequency part. We downsample an input image to 16×16 pixels as our low-frequency part, as illustrated in Fig. 6.1(i). The high-frequency part is obtained by subtracting the input image by the interpolated low-frequency components. Then, we upsample the high-frequency component to 32×32 pixels, as visible in Fig. 6.1(j). In this way, our transformative upsampling network can receive LR face images at different resolutions while preserving high- frequency residual details of the inputs for super-resolution.
In order to combine the information of the high-frequency and low-frequency branches together, we extract feature maps from the images of those tow branch- es and then concatenate the feature maps for further super-resolution. Specifically, we firstly employ an autoencoder with skip connections to extract features from the low-frequency component and then upsample the feature maps by a deconvolutional layer. After the deconvolutional layer, the resolution of the low-frequency branch has been increased as the same as the resolution of the high-frequency branch. Rather than directly combining the high-frequency residual component with the feature
§6.5 Proposed Method: MTDN 99
maps of the low-frequency component, we apply two cascaded residual blocks to extract features from the high-frequency component as well. Then, we concatenate the feature maps extracted from the high-frequency residual component with the up- sampled feature maps of the low-frequency component and then employ a residual block to fuse the concatenated feature maps.
As shown in Fig. 6.3, our network is able to super-resolve LR face images at different resolutions. Note that, we do not need to fine-tune our network on images of different sizes. As expected, the PSNRs of our upsampled results become higher as the resolutions of LR faces increase. This indicates our network exploits all the information in LR input images for super-resolution.
Upsampling Layers: After obtaining the concatenated features maps of input images, we further super-resolve the feature maps by the deconvolutional layers and residual blocks. The deconvolutional layer, also known back-convolutional layer, can be made of a cascade of an upsampling layer and a convolutional layer, or a con- volutional layer with a fractional stride [Zeiler et al.,2010;Zeiler and Fergus, 2014]. Therefore, the resolution of the output of the deconvolutional layers is larger than the resolution of its input. To reduce potential blocky artifacts caused by deconvolu- tional layers [Yu and Porikli,2018] as well as increase the capacity of the network, we cascade a residual block after each deconvolutional layer as our upsampling layer.
Spatial Transformation Layers: The spatial transformation network (STN) is proposed byJaderberg et al.[2015]. It can estimate the motion parameters of images, and warp images to the canonical view. In our architecture, the spatial transfor- mation network layers are represented as the green boxes in Fig. 6.2. These layers contain three modules: a localization module, a grid generator module, and a sam- pler. The localization module consists of a number of hidden layers and outputs the transformation parameters of an input relative to the canonical view. The grid gener- ator module creates a sampling grid according to the estimated parameters. Finally, the sampler module maps the input onto the generated grid by bilinear interpolation. Since we focus on in-plane rotations, translations, and scale changes without re- quiring a 3D face model, we employ the similarity transformation for face alignment. Although STNs can warp images, it is not straightforward to use them directly to align very LR face images. As shown in Fig.6.1(c), directly applying an STN to align LR images causes distortion artifacts due to the difficulty of spatial transformation estimation on very LR faces. There are several factors needed to be considered: (i) After the alignment of LR images, facial patterns are blurred due to the resampling of the aligned faces by bilinear interpolation. (ii) Since the resolution is very low and a wide range of poses exists, estimating spatial transformations on such small face images may lead to alignment errors. (iii) Due to the blur and alignment errors, the upsampling network may fail to generate realistic HR faces. (iv) If STNs are em- ployed to the two branches separately, the estimated transformation parameters of these two branches may be different. This will result in misalignments between the low-frequency component and the high-frequency residual components. As a result of the misalignments, distortion artifacts or ghosting artifacts may appear in the final results. Therefore, we employ STNs to align the concatenated feature maps. In this
100 Hallucinating Faces by Multiscale Transformative Discriminative Networks
(a) (b) (c) (d) (e) (f) (g) (h)
Figure 6.4: Illustrations of different losses for super-resolution. (a) The input 16×16 LR images. (b) The original 128×128 HR images. (c) The aligned LR images. (d) The upsampled faces by SRGAN [Ledig et al.,2017]. Here, SRGAN is applied to the aligned LR faces. Since SRGAN is trained on generic images patches, we re-train SRGAN on whole face images. (e) The face images super-resolved by our previous method [Yu and Porikli,2017b]. (f) The super-resolved faces byLpix. (g) The super- resolved faces by Lpix+Lf eat. (h) The super-resolved faces by Lpix+Lf eat+LU.
Here, we omit the trade-off weights for simplicity.
way, we can align the low-frequency and high-frequency parts simultaneously. Instead of using a single STN to align LR face images, we employ multiple STN layers to line up the feature maps. Using multiple layers significantly reduces the load on each spatial transformation network and further reduces the errors of mis- alignments. In addition, resampling feature maps by multiple STN layers prevents from damaging or blurring input LR facial patterns. Since STN layers and the up- sampling layers are interwoven together (rather than being two individual networks), the upsampling network can learn to eliminate the undesired effects of misalignment in the training stage.