Pitch Yaw Roll Total Sequence 1 2.88◦ 3.19◦ 2.81◦ 2.97◦ Sequence 2 1.73◦ 3.86◦ 2.32◦ 2.78◦ Sequence 3 2.56◦ 3.33◦ 2.80◦ 2.92◦ Sequence 4 2.26◦ 3.62◦ 2.39◦ 2.82◦
Table 3.1: RMS error for each sequence. Pitch, yaw and roll represent rotation around X, Y and Z axis, respectively.
and head translations of up-to 90cm, including translation along the Z axis. Figure 3-3 shows the pose estimates of our adaptive view-based tracker for the first sequence.
Figure 3-4 compares the tracking results of this sequence with the inertial sensor. The RMS errors for all 4 sequences is shown in Table 3.1. Our results demonstrate that our tracker is accurate to within the resolution of the Inertia Cube2 sensor.
General Object Tracking
Since our tracking approach doesn’t use any prior information about the object when using the stereo-based view registration algorithm, our algorithm works on different object classes without changing any parameters. Our last experiment uses the same tracking technique described in this chapter to track a puppet. The position of the puppet in the first frame was defined manually. Figure 3-5 presents the tracking results.
Our results show that using adaptive view-based appearance model we can ro-bustly track the head pose over a large range of motion. As we will see in the following section, for eye gaze estimation, the range of motion is smaller and a pre-acquired view-based appearance model is sufficient for estimating eye gaze.
-30 -20 -10 0 10 20 30
1 26 51 76 101 126 151 176 201 226 251 276 301 326 351 376 401 426 451 476 501 526 551 576 601 626 651 676 701 726 751 776
Inertia Cube2 Adaptive View-based Model
-50 -40 -30 -20 -10 0 10 20 30 40
1 26 51 76 101 126 151 176 201 226 251 276 301 326 351 376 401 426 451 476 501 526 551 576 601 626 651 676 701 726 751 776
Inertia Cube2 Adaptive View-based Model
-30 -20 -10 0 10 20 30 40 50
1 26 51 76 101 126 151 176 201 226 251 276 301 326 351 376 401 426 451 476 501 526 551 576 601 626 651 676 701 726 751 776
Inertia Cube2 Adaptive View-based Model
Rotation around X axis (degrees)Rotation around Y axis (degrees)Rotation around Z axis (degrees)
Figure 3-4: Comparison of the head pose estimation from our adaptive view-based approach with the measurements from theInertia Cube2 sensor.
165 235 255 310
400 370
350 330
1 45 105 130
Figure 3-5: 6-DOF puppet tracking using the adaptive view-based appearance model.
• Non-intrusive
• Automatic initialization
• Robust to eye glasses
• Works with monocular cameras
• Works with low resolution images
• Takes advantage of other cues (e.g., head tracking)
Eye gaze tracking has been a research topic for many years [117, 62]. Some recent systems can estimate the eye gaze with an accuracy of less than a few degrees; these video-based systems require high resolution images and usually are constrained to small fields of view (4x4 cm) [73, 18]. Many systems require an infra-red light and filtered camera [118, 43]. In this section we develop a passive vision-based eye gaze estimator sufficient for inferring conversational gaze aversion cues; as suggested in [116], we can improve tracking accuracy by integration with head pose tracking.
As discussed earlier in Chapter 2, one of our goals is to recognize eye gestures that help an ECA to differentiate when a user is thinking from when a user is waiting
Figure 3-6: Example image resolution used by our eye gaze estimator. The size of the eye samples are 16x32 pixels.
for more information from the ECA. We built an eye gaze estimator that produces sufficient precision for gesture recognition and works with a low-resolution camera.
Figure 3-6 shows an example of the resolution used during training and testing.
Our approach for eye gaze estimation is a three-step process: (1) detect the loca-tion of each eye in the image using a cascade of boosted classifiers, (2) track each eye location over time using a head pose tracker, and (3) estimate the eye gaze based on a view-based appearance model.
3.2.1 Eye Detection
For eye detection, we first detect faces inside the entire image and then search in-side the top-left and top-right quarters of each detected face for the right and left eyes, respectively. Face and eye detectors were trained using a cascaded version
Figure 3-7: Experimental setup used to acquire eye images from 16 subjects with ground truth eye gaze. This dataset was used to train our eye detector and our gaze estimator.
of Adaboost [110]. For face detection, we used the pre-trained detector from Intel OpenCV.
To train our left and right eye detectors, we collected a database of 16 subjects looking at targets on a whiteboard. This dataset was also used to train the eye gaze estimator described in Section 3.2.3. A tripod was placed 1 meter in front of the whiteboard. Targets were arranged on a 7x5 grid so that the spacing between each target was 10 degrees (see Figure 3-7). The top left target represented an eye direction of -30 degrees horizontally and +20 degrees vertically. Two cameras were used to image each subject: one located in front of the target (0,0) and another in front of the target (+20,0).
Participants were asked to place their head on the tripod and then look sequen-tially at the 35 targets on the whiteboard. A keyboard was placed next to the participant so that he/she could press the space bar after looking at a target. The experiment was repeated under 3 different lighting conditions (see Figure 3-8). The
Figure 3-8: Samples from the dataset used to train the eye detector and gaze estima-tor. The dataset had 3 different lighting conditions.
location and size of both eyes were manually specified to create the training set.
Negative samples were selected from the non-eye regions inside each image.
3.2.2 Eye Tracking
The results of the eye detector are sometime noisy due to missed detections, false-positives and jitter in the detected eye location. For these reasons we need a way to smooth the estimated eye locations and keep a reasonable estimate of the eye location even if the eye detector doesn’t trigger.
Our approach integrates eye detection results with a monocular 3D head pose tracker to achieve smooth and robust eye tracking, that computes the 3D position and orientation of the head at each frame. We initialize our head tracker using the detected face in the first frame. A 3D ellipsoid model is then fit to the face based on the width of the detected face and the camera focal length. The position and orientation of the model are updated at each frame after tracking is performed.
Our approach for head pose tracking is based on the Adaptive view-based appear-ance model (described in the previous section) and differs from previously published ellipsoid-based head tracking techniques [4] in the fact that we acquire extra key-frames during tracking and adjust the key-frame poses over time. This approach makes it possible to track head pose over a larger range of motion and over a long period of time with bounded drift. The view registration is done using an iterative version of the Normal Flow Constraint [97].
Given the new head pose estimate for the current frame, the region of interest
(ROI) around both eyes is updated so that the center of the ROI reflects the observed head motion. The eye tracker will return two ROIs per eye: one from the eye detector (if the eye was detected) and the other from the updated ROI based on the head velocity.
3.2.3 Gaze Estimation
To estimate the eye gaze given a region of interest inside the image, we created two view-based appearance models [77, 68], one model for each eye. We trained the models using the dataset described in Section 3.2.1, which contains eye images of 16 subjects looking at 35 different orientations, ranging [-30,30] horizontally and [-20,20]
vertically.
We define our view-based eigenspaces modelsPl and Pr, for the left and right eye respectively, as:
P ={I¯i,Vi, εi}
where ¯Ii is the mean intensity image for viewi,εi is the eye gaze of that view andViis the eigenspace matrix. The eye gaze is represented asε= [ Rx Ry ], a 2-dimensional vector consisting of the horizontal and vertical rotation.
To create the view-based eigenspace models, we first store every segmented eye image in a one-dimensional vector. We can then compute the average vectors ¯Ii =
1 n
n
j=1Iij and stack all the normalized intensity vectors into a matrix:
Ii = Ii1−I¯i Ii2−I¯i
· · · T
To compute the eigenspaces Vi for each view, we find the SVD decomposition Ii = UiDiViT.
At recognition time, given a seed region of interest (ROI), our algorithm will search around the seed position for the optimal pose with the lowest reconstruction errore∗i. For each region of interest and each view of the appearance modeli, we find
the vector wi that minimizes:
ei =|It −I¯i−wi· VIi|2, (3.6) The lowest reconstruction errore∗i will be selected and the eye gazeεi associated with the optimal view i will be returned. In our implementation, the search region was [+4,-4] pixels along the X axis and [+2,-2] pixels along the Y axis. Also, different scales for the search region were tested during this process, ranging from 0.7 to 1.3 times the original size of the ROI.
Gaze estimation was done independently for each seed ROI returned by the eye tracker described in the previous section. ROIs associated with the left eye are pro-cessed using the left view-based appearance model and similarly for the right eye. If more then one seed ROI was used, then the eye gaze with the lowest reconstruction error is selected. The final eye gaze is approximated based on a simple average of the left and right eye gaze estimates.
3.2.4 Experiments
To test the accuracy of our eye gaze estimator we ran a set of experiments using the dataset described earlier in Section 3.2.1. In these experiments, we randomly selected 200 images, then retrained the eye gaze estimator and compared the estimated eye gaze with the ground truth estimate.
We tested two aspects of our estimator: its sensitivity to noise and its performance using different merging techniques. To test our estimator’s sensitivity to noise, we added varying amounts of noise to the initial region of interest. Figure 3-9 shows the average error on the eye gaze for varying levels of noise. Our eye gaze estimator is relatively insensitive to noise in the initialized region of interest, maintaining an average error of under 8 degrees for as much as 6 pixel noise.
We also tested two techniques to merge the left and right eye gaze estimates: (1) picking the eye gaze estimate from the eye with the lowest reconstruction error and (2) averaging the eye gaze estimates from both eyes. Figure 3-9 also summarizes the
0 1 2 3 4 5 6 7 8 9 10 0
2 4 6 8 10 12 14 16 18 20
Noise applied to initial search region (in pixel)
Average error on eye gaze estimation (in degrees)
Left eye Right eye
Merged: lowest score Merged: average
Figure 3-9: Average error of our eye gaze estimator when varying the noise added to the initial region of interest.
result of this experiment. We can see that the averaging technique consistently works better than picking the lowest reconstruction error. This result confirms our choice of using the average to compute eye gaze estimates.
The eye gaze estimator was applied to unsegmented video sequence of all 6 human participants from the user study described in Section 2.2.3. Each video sequence lasted approximately 10-12 minutes, and was recorded at 30 frames/sec, for a total of 105,743 frames. During these interactions, human participants would rotate their head up to +/-70 degrees around the Y axis and +/-20 degrees around the X axis, and would also occasionally translate their head, mostly along the Z axis. The following eye gesture recognition results are on the unsegmented sequences, including extreme head motion.
It is interesting to look at the qualitative accuracy of the eye gaze estimator for a sample sequence of images. Figure 3-10 illustrates a gaze aversion gesture where
Figure 3-10: Eye gaze estimation for a sample image sequence from our user study.
The cartoon eyes depict the estimated eye gaze. The cube represents the head pose computed by the head tracker.
the eye gaze estimates are depicted by cartoon eyes and the head tracking result is indicated by a white cube around the head. Notice that the estimator works quite well even if the participant is wearing eye glasses.