Appearance-Based Gaze Estimation - Remote Sensors-Based Eye Tracking

2.3 Remote Sensors-Based Eye Tracking

2.3.1 Appearance-Based Gaze Estimation

Appearance-based methods avoid local gaze features detection, but rather use the image content

as the input map. Instead of explicitly modeling the eye in 2-dimensional (2D) or 3-dimensional (3D), they learn a direct mapping between the image features and the gaze points. The input dimensionality is much higher than feature-based methods. Therefore, the success of these methods relies on how well the training data covers the variation in the test data. In this respect, unless large amounts of training data is provided to handle variations due to user identity, head pose, eye ball pose, illumination, scale, etc., they suffer from the generalization problem, particularly for subject-independent mapping. Early efforts used artificial neural networks to directly map the eye image pixels to the gaze points on a screen. Their systems required thousands of training samples and a fixed head pose to obtain acceptable gaze estimation accuracies. As a pioneering work, [Baluja and Pomerleau, 1994] use 2000 cropped eye images as input to a multi-layer neural network and their system achieved an accuracy of 1.5◦while allowing for certain head movements. Similarly, [Xu et al., 1998] used 3000 training images to achieve a comparable accuracy. Later, alternative methods were proposed under similar conditions such as limited head pose variations and in-session calibrations. For instance, [Kar-Han Tan et al., 2002] proposed to use linear interpolation to reconstruct a test sample from a local appearance manifold

Figure 2.4 – An example appearance-based gaze estimation pipeline [Funes Mora, 2015].

within the training data. They leveraged the topology information, encoded as 2D space of gaze parameters, to constrain the samples selection. They managed to eﬀectively reduce the number of training samples while obtaining an acceptable accuracy. In addition, [Hansen and Pece, 2005] proposed a tracking method based on particle ﬁltering and the expectation-maximization contour algorithm for robust iris tracking. To perform gaze estimation with this method, they required users to gaze at four calibration points as a lower bound. Their system achieved an accuracy of ∼4◦_{under limited head movements.}

Later on, alternative approaches have been proposed mainly to reduce the number of labeled training samples. For example, [Sugano et al., 2010] proposed a novel method that automatically collects labeled samples by utilizing saliency prior from a video. [Lu et al., 2011] introduced an adaptive linear regression method that automatically selects training samples for mapping. These methods worked well under controlled conditions, such as fixed head pose using a chin-rest, fixed illumination settings, and well aligned eye images. However, their performance degraded greatly when user head was not stationary. Moreover, [Funes-Mora and Odobez, 2012] leveraged RGB-D cameras to directly handle eye appearance variation by generating frontal view eye images used as input to adaptive linear regression. They later proposed a framework for 3D gaze estimation, as shown in Figure2.4. Thanks to the depth measurements and the fitted 3D facial mesh, they improved the framework’s robustness to head pose and between-user appearance variations [Mora and Odobez, 2016].

As the performance of appearance-based methods heavily relies on the data variability for the training of the model, several recent efforts have been devoted to capture larger data variability. In this context, large-scale datasets were collected, such as MPIIGaze [Zhang et al., 2015] and GazeCapture [Krafka et al., 2016]. The authors were then trained convolutional neural network (CNN)s on this large datasets to learn robust mappings. They achieved significant accuracy improvements over the state-of-the-art appearance-based methods with an error of ∼4◦_{. Despite such models trained on large datasets provided head pose and illumination change} tolerance to a certain extent, collecting such datasets to acquire sufficient data variation is still cumbersome and impractical. Instead, learning-by-synthesis approaches [Lu et al., 2012, Sugano et al., 2014, Wood et al., 2016b, Wood et al., 2016a] were introduced to increase the data variability using the synthesized eye images. For example, [Lu et al., 2012] synthesized additional eye images of various head poses using pixel displacements applied on the real images captured at particular head poses. Although the method allowed certain head pose tolerance, the estimation

2.3. Remote Sensors-Based Eye Tracking accuracies were poor. Besides, this technique could not improve the robustness to subject or environmental variations. [Sugano et al., 2014] collected a fully calibrated multi-view gaze dataset (UT Multi-view Gaze dataset) from eight synchronized webcams, and performed a 3D eye region reconstruction in order to generate dense training data of eye images. [Wood et al., 2016b] presented a method to rapidly synthesize large amounts of variable eye region images as training data. Their eye region model was derived from high-resolution 3D face scans, and enabled image-based lighting to cover a range of illumination conditions. To demonstrate the eﬃcacy of the method, they synthesized over a million eye images and learned a gaze estimator using k-nearest-neighbors. Despite the simplicity of the classiﬁer employed, they achieved ∼10◦_{accuracy error on a cross-dataset evaluation on MPIIGaze dataset, and outperformed the} CNN-based method described in [Zhang et al., 2015].

As a conclusion, appearance-based methods have an important advantage over the other methods, that is to not require a particular hardware setup and user calibration. The recent advancements in the synthesizing and rendering technology together with learning successful models from large-scale datasets using deep learning techniques have brought back a considerable attention to

appearance-based methods since they remarkably improve the estimation accuracy and the head

pose and illumination variations tolerance. There is no doubt that these methods have a great potential to make eye tracking a pervasive technology. However, the current estimation accuracy and robustness performances are still insuﬃcient to be utilized for the applications that require precise gaze estimation (<1◦).

In document Robust Eye Tracking Based on Adaptive Fusion of Multiple Cameras (Page 39-41)