• No results found

Addtional Experimental Results

4.8 Appendix

4.8.2 Addtional Experimental Results

Here, we demonstrate more experimental comparisons with the state-of-the-art meth- ods in Fig.4.4.

We do not apply an STN on the third deconvolutional layer due to the limita- tion of hardware memory. Because the upsampling network learns common facial components from aligned feature maps, aligning feature maps in the early layers can facilitate the learning of the upsampling network. As expected, Tab.4.2 shows that applying an STN in the first deconvolutional layer (STN1) can achieve better results than in the second layer (STN2).

Chapter5

Hallucinating Very Low-Resolution

Unaligned and Noisy Face Images

by Transformative Discriminative

Autoencoders

5.1

Foreword

In chapter 2 and chapter 3, our methods require all the low-resolution face images to be aligned. Then we reduce the requirements of alignments of low-resolution faces by incorporating spatial transformer networks into our upsampling network in chapter 4. However, our previous works assume the low-resolution face images are noise-free. Since the resolution of input face images is very small, every pixel matters significantly in super-resolution. Thus, image noise will deteriorate the face hallucination performance dramatically. In this chapter we present a transforma- tive discriminative autoencoder to upsample noisy low-resolution face images while suppressing artifacts caused by the noise. We observe that directly denoising low- resolution faces may corrupt low-resolution facial patterns and leads to distortions and artifacts in upsampled high-resolution face images. Instead of denoising low- resolution images, we first super-resolve low-resolution face images while reducing noise. Since noise may cause artifacts in upsampled high-resolution faces, we then project denoised and upsampled high-resolution faces to noise-free low-resolution ones to reduce the artifacts. Finally, we can achieve high-quality high-resolution face images by upsampling the noise-free low-resolution ones.

This chapter has been published as a conference paper: Xin Yu, Fatih Porikli: Hal- lucinating Very Low-Resolution Unaligned and Noisy Face Images by Transforma- tive Discriminative Autoencoders. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3760-3768, 2017.

70 Hallucinating Very Low-Resolution Unaligned and Noisy Face Images

5.2

Abstract

Most of the conventional face hallucination methods assume the input image is suf- ficiently large and aligned, and all require the input image to be noise-free. Their performance degrades drastically if the input image is tiny, unaligned, and contami- nated by noise.

In this paper, we introduce a novel transformative discriminative autoencoder to 8× super-resolve unaligned noisy and tiny (16×16) low-resolution face images. In contrast to encoder-decoder based autoencoders, our method uses decoder-encoder- decoder networks. We first employ a transformative discriminative decoder network to upsample and denoise simultaneously. Then we use a transformative encoder network to project the intermediate HR faces to aligned and noise-free LR faces. Fi- nally, we use the second decoder to generate hallucinated HR images. Our extensive evaluations on a very large face dataset show that our method achieves superior hal- lucination results and outperforms the state-of-the-art by a large margin of 1.82dB PSNR.

5.3

Introduction

Face images provide critical information for visual perception and identity analysis. However, when they are noisy and their resolutions are inadequately small (e.g.as in some surveillance videos), there is little information available to be inferred reliably from them. Very low-resolution and noisy face images not only impede human perception but also impair computer analysis.

To tackle this challenge, face hallucination techniques aim at recovering high- resolution (HR) counterparts from low-resolution (LR) face images and have received significant attention in recent years. Previous state-of-the-art methods mainly focus on recovering HR faces from aligned and noise-free LR face images. More specifically, face hallucination methods based on holistic appearance models [Baker and Kanade,

2000,2002;Liu et al.,2001;Wang and Tang,2005;Liu et al.,2007;Hennings-Yeomans

et al., 2008; Ma et al., 2010; Yang et al., 2010; Li et al., 2014; Kolouri and Rohde,

2015;Wang et al.,2014;Yu and Porikli,2016] require LR faces to be precisely aligned

beforehand. However, when the LR images are contaminated by noise, the accuracy of face alignment degrades dramatically. Besides, due to the wide range of pose and expression variations, it is difficult to learn a comprehensive, holistic appearance model for LR images not aligned appropriately. As a result, these methods often produce ghosting artifacts for noisy unaligned LR inputs.

Rather than learning holistic appearance models, facial components based face hallucination methods have been proposed [Tappen and Liu, 2012;Yang et al.,2013;

Zhou and Fan, 2015; Zhu et al., 2016b]. They transfer HR facial components from

the training dataset to the input LR images without requiring alignment of LR input images in advance. These methods heavily rely on the successful localization of facial landmarks. Because facial landmarks are difficult to detect in very low resolution (16×16 pixels) images, they fail to localize the facial components accurately and thus

§5.3 Introduction 71

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 5.1: Comparison of our method with the CNN based face hallucination UR- DGN [Yu and Porikli,2016]. (a) 16×16 LR input image. (b) 128×128 HR original image. (c) Denoised and aligned LR image. We firstly apply BM3D [Dabov et al.,

2007] and then STN [Jaderberg et al.,2015]. (d) The corresponding most similar face in the training dataset. (e) Bicubic interpolation of (c). (f) Image generated by UR- DGN. Note that, URDGN super-resolves the denoised and aligned LR image, not the original LR input (in favor of URDGN). (g) The denoised and aligned LR image by our decoder-encoder as an intermediate output. (h) The final hallucinated face by

our TDAE method.

produce artifacts in the upsampled face images. In other words, the facial component based methods are not suitable to upsample noisy unaligned LR faces either.

Considering the resolution of faces is too small and the presence of noise, face detectors may also fail to locate such tiny noisy faces. Thus, using pose specific face detectors as a preprocessing step to compensate for misalignments is also impractical. In this paper, we propose a new transformative discriminative autoencoder (T- DAE) to super-resolve a tiny (16×16 pixels) unaligned and noisy face image by a remarkable upscaling factor of 8×, where we estimate 64 pixels for each single pixel of the input LR image. Furthermore, each pixel has also been contaminated by noise, making the task even more challenging.

Our TDAE consists of three serial components: a decoder, an encoder, and a sec- ond decoder. Our decoder network comprises deconvolutional and spatial transfor- mation layers [Jaderberg et al.,2015]. It can progressively upsample the resolutions of the feature maps by its deconvolutional layers while aligning the feature maps by its spatial transformation layers. Similar to [Yu and Porikli, 2016], we employ not only the pixel-wise intensity similarity between the hallucinated face images and the ground-truth HR face images but also the class similarity constraint that enforces the upsampled faces to lie on the manifold of real faces by a discriminative network. Hence, we achieve a transformative decoder that is also discriminative. Since the LR inputs are noisy, the hallucinated faces after the decoder may still contain artifacts. In order to obtain aligned and noise-free LR faces, we project the upsampled HR

72 Hallucinating Very Low-Resolution Unaligned and Noisy Face Images

faces back onto the LR face domain by a transformative encoder. Finally, we train our second decoder on the projected LR faces to attain hallucinated HR face images. In this manner, the artifacts are greatly reduced and our TDAE produces authentic HR face images.

Overall, the contributions of this paper are mainly in four aspects:

• We propose a new transformative-discriminative architecture to hallucinate tiny (16×16 pixels) unaligned and noisy face images by an upscaling factor of 8×. • In contrast to conventional autoencoders, we first device a decoder-encoder

structure to generate noise-free and aligned LR faces, and then a second de- coder trained on the encoded LR faces to hallucinate high-quality HR face im- ages.

• Our method does not require to model or estimate noise parameters. It is agnostic to the underlying spatial deformations and contaminated noise. • To the best of our knowledge, our method is the first attempt to address the

super-resolution of tiny and noisy face images without requiring alignment of LR faces beforehand, which makes our method practical.

5.4

Related Work

Face hallucination has received significant attention in recent years [Tappen and Liu,

2012; Yang et al., 2013; Wang et al., 2014; Kolouri and Rohde, 2015; Zhou and Fan,

2015; Zhu et al., 2016b; Yu and Porikli, 2016]. Previous face hallucination methods

mainly focus on recovering HR faces from aligned and noise-free LR face images, and in general, they can be grouped into two categories: holistic methods and part-based methods.

Holistic methods use global face models learned by PCA to hallucinate entire HR faces. In the work [Wang and Tang, 2005], an eigen-transformation is proposed to generate HR face images by establishing a linear mapping between LR and HR face subspaces. Similarly, Liu et al. [2007] employ a global appearance model learned by PCA to upsample aligned LR faces and a local non-parametric model to enhance the facial details. The work [Kolouri and Rohde, 2015] explores optimal transport and subspace learning to morph an HR output according to the given aligned LR faces. Since holistic methods require LR face images to be precisely aligned and share the same pose and expression as the HR references, they are very sensitive to the misalignments of LR images. Besides, image noise makes the alignment of LR faces even more difficult.

Part-based methods upsample facial parts rather than entire faces, and thus they can handle various poses and expressions. They either employ a training dataset of reference patches to reconstruct the HR counterparts of the input LR patches or exploit facial components. In [Baker and Kanade, 2002], high-frequency details of aligned frontal face images are reconstructed by finding the best mapping between