• No results found

Chapter 2. Literature review •

2.2 Detection

2.2.4 Appearance, model and texture

In addition to the shape information other features such as geometrical and kinematic constraints can be employed for hand and body posture detection. For instance, the hand' s articulation does not allow the fingers to rotate in any arbitrary direction (at least in a normal hand). Therefore, having the kinematic model, possible states of the hand can be estimated for a certain shape. The geometrical constraint can also be employed for this purpose. For example, the distance between hand blob and face blob could at most be equal to the length of the arm. Hence, this can be employed to limit the search for hand blob within an image after detecting the face or vice versa. Based on this idea, Urano, Matsui, Nakata and Mizogouchi [69] suggested a body posture recognition system based on outline diameter and higher order local auto correction features of the shape. As the first step, their system uses depth thresholding for detecting the person in the scene. In the next step, it uses the pre-recorded background information to eliminate some of the unwanted areas not eliminated in the first step. Finally, in the third step, the diameters and high-order local correlation features are used for template matching. They have tested their system using 6 different body postures and reported 76.8% to 1 00% accuracy in detecting different

body postures. The role of background elimination using a pre-recorded background is not clear in this research. Assuming that this is a necessary step in the pre­ processing, employing this technique could be impractical due to two facts. Firstly, recording the initial background should be performed before the detection phase which requires the intervention of user or another mechanism. Secondly, the environmental changes, such as change of lighting, make the initial pre-recorded background obsolete. Re-recording the background requires that the user leaves the scene for a moment. According to our observation this process could be very annoying with cameras equipped with an auto shutter mechanism. When user leaves the scene, the camera adjusts its shutter to adapt to the new lighting condition. And very often the background' s intensity would be different in the presence of the user in the scene.

Rosales et al . [70] have addressed the problem of recovering a 3D hand pose from a

monocular color sequence. They have proposed a system that tracks the hand and

estimates its 3D configuration on every frame. Their approach is based on a

probabilistic modeling method called specialized mapping architecture (SMA) which

is used for mapping image features to likely 3D hand poses. SMA is related to

machine learning models that use the principle of divide-and-conquer to reduce the

complexity of the learning problem by splitting it into several similar ones. In

general, these algorithms try to fit surfaces to the observed data by (i) splitting the

input space into several regions, and (ii) approximating simpler functions to fit the

Chapter 2. Literature review

segmenting of the hands, they have used the similarity of the color of face and hands. They have used a dataset of 8000 synthetically generated hand images for training a feed-forward neural network with 5 hidden layers. The main advantage of this algorithms is its linear growth rate of O(M) (M is the number of specialized functions). Employing the divide-and-conquer for shape matching is an interesting technique for this purpose. However, its implementation will have some side effects on the ability of shape detection. Specifically, in their implementation, the detector would be sensitive to rotation. Hence, the detector may fail in a rotated hand posture.

Poppe et al . [7 1] described a vision-based approach for body pose estimation in video sequences in the context of a meeting environment. In the first step, the silhouette of the body is extracted using a frame subtraction technique. In the next step, using skin color segmentation based on the HS (Hue-Saturation) color space, the face and hands of the speaker are separated from the silhouette of the body. Finally, using inverse kinematics and silhouette matching, the locations of elbows and knees are calculated.

Employing the kinematic model and other cues together, however, raises a new level of complexity and computational cost, although it may provide a more robust framework for hand and body posture detection. In this context, Loutas et al . [72] proposed a mutual information approach for articulated object tracking based on a similarity measure. The measure is calculated on the tracked object image or alternatively on the tracked object texture map accompanied by a confidence map. The use of the object texture map was found to improve the tracker' s performance. Articulated constraints are included using a kinematic model on the tracker search

range and initial conditions based on the anatomy and the kinematic capabilities of each joint.

Lu et al [73] proposed a model-based integration of visual cues for hand tracking.

They have used multiple sources of information which come from edges, optical flow

and shading. A hand in their model consists of a base link (palm), and five chains

(fingers) connected to the base link through five two-degree-of-freedom revolute

j oints. Finger parts are modeled as cylinders and the palm is modeled as a six­

rectangle-side-solid.

In summary, the applications of model-based hand posture detection are still limited mostly due to the computational cost of employing the inverse kinematic model . In addition, it requires studying the kinematic model of the object in advance which is itself a time consuming task. However, considering the fact that the kinematic model of the hand and body is almost the same for all people, it can be used as a framework for other research. It is expected that this technique will be employed more frequently with availability of more powerful hardware in the future.