Chapter II Directed Evolution of Galactose Oxidase Towards Increased
3. Materials and Methods
3.5. Protein Purification and Characterization
The robot is capable of detecting a set of percepts, i.e. face, motion, color segment, and sound. We describe the implementation details of each perceptual subsystem below and present how they are integrated in the next section.
4.3.1 Face Detection and Tracking
In order to be responsive to people’s interaction attempt, MERTZ must be able to detect the presence of humans and track them while they are within its field of view.
In order to detect and track faces, we are combining a set of existing face detection and feature tracking algorithms. We will not go into the implementation details, as these information are available in the publication of each original work.
We are using the frontal face detector developed by Paul Viola and Michael Jones [98]. The face detector occasionally finds a false positive face region in certain back-grounds, causing the robot to fixate on the floor, wall, or ceiling. This is especially problematic during long experiments where there was a lot of down time when no one is around. We implemented a SIFT-based feature matching module to calculate a sparse disparity measure between the face region in the images from the left and right cameras. Using an expected ratio of estimated disparity and face size, the module rules out some false faces in the background that are too far away.
Since both people and the robot tend to move around frequently, faces do not remain frontal to the cameras for very long. We are using a KLT-based tracker to complement the frontal face detector. The KLT-based tracker was obtained from Stan Birchfield’s implementation [7] and enhanced to track up to five faces simultaneusly.
The robot relies on the same-person face tracking module as a stepping stone toward unsupervised face recognition. The idea is that the better and longer the robot can track a person continuously, the more likely that it will collect a good sequence of face images from him/her. A good sequence is one which contains a set
Figure 4-2: The same person face tracking algorithm. We have combined the face detector, the KLT-based tracker, and some spatio-temporal assumptions to achieve same-person tracking.
of face images of the same person with high variations in pose, facial expression, and other environmental aspects.
We have combined the face detector, the KLT-based tracker, and some spatio-temporal assumptions to achieve same-person tracking. As shown in figure 4-2 for every detected face the robot activates the KLT-based face tracker for subsequent frames. If the tracker is already active and there is an overlap in the tracked and detected region, the system will assume that they belong to the same person and refine the face location using the newly detected region. If there is no overlap, the system will activate a new tracker for the new person. The overlapping criteria is also shown in figure 4-2. If the disparity checker catches a false positive detection, the face tracker cancels the corresponding tracking target.
With this algorithm, we make some spatio-temporal assumptions that each se-quence of tracked face belongs to the same individual. Of course, this can sometimes be wrong, especially in the case of simultaneous tracking of multiple people. We have observed two failure modes; where a sequence contains face images of two people and where the face image consists of mostly or completely the background region. The first failure happened in multiple occasions where a parent is holding a child. In these cases, their face proximity often confuses the tracker.
4.3.2 Color Segment Tracking
In order to enhance the robot’s person tracking capabilities, we augmented the face detector by tracking the color segment found inside the detected face frame. This can be handy when the person’s face is rotated too much such that neither the face detector or tracker can locate it.
The color segment tracker was developed using the CAMSHIFT (Continuously Adaptive Mean Shift Algorithm) approach [10], obtained from the OpenCV Library [82]. The tracker is initialized by the face detector and follows a similar tracking algorithm as described in figure 4-2. This tracker can only track one segment at a time, however. We also implemented an additional module to check for cases when the tracker is lost, which tend to cause the CAMSHIFT algorithm to fixate on background regions. This module performs a simple check that the color histogram intersection of the initial and tracked region is large enough [93].
4.3.3 Motion Detection
Since the robot’s cameras are moving, background differencing is not sufficient for detecting motion. Thus, we have implemented an enhanced version of the motion detector. Using the same approach for detecting motion with active cameras [33], we use the KLT-based tracking approach to estimate displacement of background pixels due to robot motion at each frame [87]. Object motion is then detected by looking for an image region whose pixel displacement exhibit a high variance from the rest of the image.
In order to complement the motion detector, we implemented a simple and fast color-histogram based distance estimator to detect objects that are very close to the robot. We simply divide the image into four vertical regions and compute color histograms for each region on both cameras. We then calculated the histogram in-tersection between each region of the two cameras [93]. This method though simple and sparse is at times effective in detecting objects that are very close to the robot.
This detection is used to allow the robot to back up and protect itself from proximate
objects.
4.3.4 Auditory Perception
The robot’s auditory system consists of a sound detection and localization module, some low-level sound processing, and word recognition. An energy-based sound de-tection module determines the presence of sound events above a threshold. This threshold value was initially empirically determined, but we quickly found that this did not work well outside the laboratory. This threshold is adaptively set using a sim-ple mechanism described in section 4.5. Lastly, the robot inhibits its sound detection module when it is speaking.
In an early experiment, we observed that the robot’s limited visual field of view really limit its capability in finding people using only visual cues. The microphone has a built in sound localizer and displays the horizontal direction of the sound source using five indicator LEDs. Thus, we now tap into these LEDs on the microphone to obtain the sound source direction. The presence and location of each sound event are immediately sent to the robot’s attention system, allowing the robot to attend to stimuli from a much larger spatial range.
In parallel to this, a separate module also processes each sound input for further speech processing. The robot’s speech recognition system was implemented using CMU Sphinx 2 [65]. This module uses a fixed energy threshold to segment and record sound events, because we would like to record as many sound segments as possible for evaluation purposes.
Each recorded segment is processed redundantly for both phoneme and word recognition, as well as for pitch periodicity detection. Firstly, the pitch periodic-ity detection is used to extract voiced frames and filter phoneme sequences from noise-related errors. The filtered phoneme sequence output is then further filtered by using TSYLB, a syllabification tool to rule out subsequences that are unlikely to occur in the English language [31]. Lastly, the final phoneme sequence is used to filter the hypothesized word list. The robot’s behavior system then utilizes this final list to produce speech behaviors, as described in section 4.6.