Conclusion - Robust Eye Tracking Based on Adaptive Fusion of Multiple Cameras

than the visual axis (LoG). Consequently, a user calibration is essential for cross ratio-based methods in order to compensate for the subject-speciﬁc estimation bias.

In the literature, several efforts have been made in order to enhance the tracking accuracy and robustness of cross ratio-based methods through the user calibration methods. In the original system introduced by [Yoo et al., 2002], there was not any subject-specific bias correction. Later, they refined their method by several enhancements in feature detection and they introduced a technique to compensate for cornea’s non-coplanarity using an additional light emitting diode (LED) illuminator in their hardware setup [Yoo and Chung, 2005]. Even though the calibration did not consider the correction for the axes difference, it significantly improved the estimation accuracy. In a similar approach, [Coutinho and Morimoto, 2006] proposed a method to compensate for the axes difference for the first time. Yet, their system required a fifth LED in the hardware setup similar to [Yoo and Chung, 2005]. Later, homography-based bias correction modeling was introduced by [Kang et al., 2007]. They simplified the error correction using a similar calibration procedure but eliminated the need for the fifth LED. It outperformed all previous methods despite having a simpler hardware setup. [Hansen et al., 2010] then proposed a normalized homography mapping to further improve the tracking robustness against perspective distortions.

In cross ratio-based gaze estimation, homography-based user calibration methods are widely accepted by the eye tracking community as the state-of-the-art bias correction technique. They have proven to successfully work when there is no large head movements during the tracking.

Nonetheless, their accuracies to compensate for the estimation bias are significantly affected by the large head movements. Consequently, alternative techniques have been developed to address the robustness against large head movements. The majority of these efforts suggested solutions by adapting the bias correction to the changes in head movements, e.g., [Coutinho and Morimoto, 2013, Huang et al., 2014]. For instance, [Coutinho and Morimoto, 2013] proposed methods to adaptively correct the bias displacement vector with respect to head movements. In addition, [Huang et al., 2014] proposed an adaptive homography calibration, which is an offline-trained model on simulated data. The model successfully adapted the homography calibration with respect to the head movements. On the other hand, an important limitation in [Coutinho and Morimoto, 2013] and [Huang et al., 2014] is that they utilize a chin rest to keep the head pose fixed during their evaluation. Therefore, the evaluations could not take the head pose variations and continuous head movements into account. Besides, using a chin rest significantly harms the user experience and is impractical for real-world human-computer interaction (HCI) applications.

2.4 Conclusion

Eye tracking has a long research history of over a hundred years and its popularity is now on the rise due to the growing interest from diverse disciplines. Video-oculography technique is undoubtedly more preferable due to its non-intrusive nature in comparison with electro-oculography and contact lens based techniques.

Video-oculography technique can be implemented as head mounted trackers and remote sensors-based trackers depending on the application scenario. Remote sensors-sensors-based trackers are mostly preferred over head mounted ones since they provide more natural user experience. Remote sensors-based gaze estimation methods can be categorized into two groups, namely, appearance-based methods and feature-appearance-based methods.

Appearance-based methods have an important advantage over feature-based methods, that is, their hardware and system requirements are considerably lower. In other words, they simply require an uncalibrated ordinary camera. The recent advances in the synthesizing and rendering technology as well as those in machine learning, such as convolutional neural networks, have attracted a signiﬁcant amount of attention to appearance-based methods in the community. Recently, notable improvements in accuracy and robustness have been made, and a great potential has been shown.

However, despite their promise, the current performances are still inadequate for them to be utilized for the applications which require precise gaze estimation.

Feature-based methods are still the most widely preferred methods since they signiﬁcantly outperform appearance-based methods in terms of the accuracy. These methods can be analyzed under three categories, each of which has its own advantages and disadvantages. Among all categories, 3D model-based methods enable the highest accuracy and head movement tolerance, owing to sophisticated 3D eye and environment modeling. However, their setup complexity is considerable higher. They mostly require fully calibrated setups, e.g., stereo vision systems or Kinect-like depth sensors. Unlike 3D model-based methods, regression-based and cross ratio-based methods do not require fully calibrated setups as they rely on 2D gaze features and approximations. Despite their simplicity, their accuracies are competitive with those of 3D model-based systems.

The tracking robustness is also a crucial concern in eye tracking research. There is no doubt that the performances of the existing systems are highly affected by certain factors, including the data resolution, the quality of user calibration, head pose changes, head movements, illumination variations, and subject-specific factors, such as eye wear and eye type. Although numerous efforts, as reviewed in detail in Section5.1, have been devoted to address such factors and promising results have been achieved, the current performances are still far from ideal. Hence, alternative eye tracking systems that achieve high accuracies using simple, flexible, and user-friendly setups are needed. In addition to the setup complexity and high accuracy, the robustness to real-world conditions and user calibration convenience must be considered as important evaluation criteria.

In this regard, new models or frameworks are necessary to improve eye trackers’ sensitivity to low resolution eye data, user calibration, head movements, changes in illumination, eye glasses, and eye type variations.

In this thesis, considering the eﬀorts in the literature and the state of the eye tracking research, we design a novel eye tracking framework, which aims to address the aforementioned challenges in eye tracking. First of all, we develop a multi-camera eye tracking framework, which uses a non-intrusive, ﬂexible, and adaptable setup. Within this framework, we suggest to employ a

2.4. Conclusion

cross ratio-based gaze estimation method, which avoids camera and geometric scene calibrations, enables fast gaze processing for real-time tracking, and operates fairly with low resolution data.

The details of the framework are explained in detail in Chapter 3. Secondly, we investigate regression-based techniques to address the user calibration convenience. We present novel methods, which enable to model the subject-speciﬁc estimation bias when the calibration data is limited in size and quality, as explained in detail with extensive evaluations in Chapter4. Lastly, we optimize the proposed multi-camera framework in a way to improve the overall accuracy, tracking availability, and more importantly, tracking robustness to challenging conditions. The details of the adaptive multi-camera fusion mechanism and our comprehensive evaluations are described in Chapter5. Besides, in our experiments and evaluations, we strictly target natural HCI. We avoid protocols which contradict with it, such as using a chin rest, and also introduce new, more realistic experiments and evaluation schemes.

3 Robust Real-Time Multi-Camera Gaze Estimation Framework

In this chapter, we present the details of the proposed gaze estimation framework. Considering the objectives of this thesis (Section1.2), together with the acquired insights from the previous efforts (Section2.4), we design a novel multi-camera gaze estimation framework, which enables real-time accurate eye tracking using low-resolution eye data. Meanwhile, we aim our methodology to operate with a simple and flexible hardware setup, which does not require any camera or geometric scene calibration, as required by the majority of the existing systems. In this respect, we propose a multi-camera gaze estimation framework, which comprises of independently operating single-camera systems. As the estimation of the gaze relies on a simple cross-ratio geometry in each single-camera system, the multi-camera framework is capable of real-time tracking. Besides, it only requires an uncalibrated setup that can effortlessly be adapted for various applications.

An overview of the proposed framework is illustrated in Figure3.1. It comprises of simultaneously operating multiple single-camera systems, each of which consists of four consecutive processes:

i) data acquisition, ii) gaze features detection, iii) gaze estimation, and iv) subject-speciﬁc bias correction. Gaze outputs obtained from all single-camera systems are then fed into an adaptive fusion mechanism to output an overall point of regard (PoR) per frame.

In the rest of this chapter, we firstly describe the details of the framework, especially with an emphasis from the data acquisition until cross-ratio based gaze estimation. Then, the remaining two main processes, i.e., subject-specific calibration and adaptive fusion, are briefly explained.

Since these two processes constitute the two main contributions of this thesis, they are explained separately in detail in the next chapters. This chapter also presents the real-time implementation of the framework in Section3.6. Finally, the discussions and conclusions are given in Section3.7.

6LQJOH&DPHUD

(a) Overview of the multi-camera gaze estimation framework.

(b) Overview of the single-camera system.

Figure 3.1 – Overview of the proposed framework.

3.1. Data Acquisition

3.1 Data Acquisition

In the proposed framework, the data is acquired using an uncalibrated multi-camera setup. The setup is completely remote and ﬂexible, such that depending on the application scenario, the number of the cameras and their positioning can be alternated without requiring any camera or geometric system calibration. Since the gaze estimation relies on cross ratio technique, the setup operates under active near-infrared (NIR) illumination. Using active lighting rather than natural one, in fact, brings an important advantage. The tracking performance becomes less sensitive to the variations in ambient illumination, i.e., the performance does not drastically change under indoor or outdoor lighting, or under total darkness, as analyzed in detail in Chapter5.

In this thesis, we primarily target screen-based (desktop) tracking scenarios due to a high number of potential use cases. In this regard, considering the trade-oﬀ between the accuracy and the total cost, we developed a prototypical three-camera setup as can be seen in Figure3.2.

The prototypical setup consists of three PointGrey Flea3 monochrome cameras, four groups of NIR light emitting diode (LED)s for the active illumination, and a controller unit for the synchronization. Each camera has a medium image resolution (1280×1024 pixels), and is equipped with a large field-of-view (FoV) manual focus lens, i.e., focal length is 8 mm and diagonal FoV is 58^◦. The cameras are placed around a 24-inch monitor: one of the cameras is located slightly below the monitor, whereas the other two are placed on the left and right sides of the monitor. In order to create the corneal reflections (glints), NIR LEDs with 850 nm wavelength are placed on the corners of the monitor. In addition, band-pass filters around 850 nm are mounted to the lenses to filter the ambient light out. Besides, a micro-controller is programmed to synchronize all the cameras, so that the images can be captured simultaneously by all cameras

Figure 3.2 – Example three-camera setup.

at 30 frames per second (fps). In fact, one of the responsibilities of the micro-controller is to optimize the light emissions regarding the eye safety. Therefore, we synchronize the cameras’

shutters with LEDs’ emission duration. In addition, a comprehensive quantitative analysis, which discusses impacts of employing diﬀerent number of cameras in various conﬁgurations, is given Chapter5.

The current setup provides us multiple eye appearances simultaneously acquired from various views. In order to obtain a large working volume to allow for large head movements, we employed large FoV lenses. Therefore, in comparison with the majority of the existing eye trackers, the acquired data resolution is rather low in our system, e.g., image and eye resolutions are 1280×1024 and ∼90×50 pixels, respectively. Besides, it is important to note that we put a great emphasis on acquiring the data in a natural manner during our user experiments. The users were explicitly asked to interact with our system the way they feel the most natural and comfortable. For instance, although a chin rest has widely been used by the previous work, it was strictly avoided in our evaluations, so that the users could perform head pose changes and continuous head movements. Sample frames acquired using the current three-camera setup are shown in Figure3.3.

Figure 3.3 – Sample frames acquired from three subjects using the current hardware setup: (left column) right side camera, (middle column) bottom camera, (right column) left side camera.

In document Robust Eye Tracking Based on Adaptive Fusion of Multiple Cameras (Page 47-55)