6.4 Support Vector Machine
6.4.4 Multiclass Kernel Machines
Inherently, SVM is a binary class classifier. Generally when there are K>2 classes, the common method one-vs-rest is implied to use a binary classifier. In one-vs-rest, each class is trained against all other classes combined and K support vector machines are learned. In training, examples of Class-1 (Ci) are labelled as +1 and examples of all other
classes (CK), ki are labelled as -1, whilst in testing, allg x , i( ) i1, , .K are calculated. In another approach, instead of building K two-class SVM classifiers to separate one from all the rest, a one against one (pair-wise separation) multiclass SVM is proposed. For k>2 classes, K K( 1) / 2 pair-wise classifiers are built, with each gi j( )x taking examples of Ci with the label +1, examples of Cj with label -1, and not using examples of
the other classes. Separation of classes in pairs is trivial and has the additional advantage of faster optimization because it uses less data.
In general, both one-vs-rest and pair-wise separation are special cases of the error- correcting output codes [214], which decompose a multiclass problem to a set of two class problems.
In yet another approach, Weston and Watkins [215] proposed to write a single multiclass optimization problem involving all classes
2 1 1 min 2 subject to K t i i i i t w C
(6.52) t 0 02
,
and
0
t t t t t t i i i i z zw x
w
w x
w
i
z
where z contains the class index of t x and C is the usual regularization parameter. t
The SVM multiclass implementation used in this dissertation is publicly available, and is based on a multiclass formulation described by Crammer and Singer [216], an enhanced
104 version of Weston and Watkins [215]. To solve the problem of optimization SVMmulticlass1 uses an algorithm based on structural SVMs [217]. As far as author is aware, this is the first time a fully implemented multi class classification has been attempted in visual speech recognition.
6.5
Summary
In this chapter, visual features used for visual speech recognition are reviewed. It is hypothesized that appearance based features provide a better representation of visual speech compared to shape based and model based approaches. Appearance based features also do not require further localization of lip features throughout the image sequence as in contour and combination based techniques. Advantages and limitations of both the appearance based and shape based features are discussed.
Computer based lip-reading studies have indicated that the important visual information lies in the temporal change of a mouth [194] and motion features are more discriminative compared to static features for computer based lip-reading [127]. Based on the above studies, this research uses appearance based motion features, computed by optical flow estimation.
Optical flow based DMHIs are developed. To represent these spatio-temporal templates, global internal region based descriptors are selected. In this research, two region based feature descriptors, ZM and HM, are evaluated. ZM are orthogonal moments which are capable of reflecting the shape and intensity distribution of DMHIs. ZM and HM have good rotation property and are invariant to changes of mouth orientation in the images. The number of ZMs is determined empirically, while the 7 HMs are computed from each DMHI.
Finally, the chapter concludes with a thorough discussion of the SVM classifier. SVM is a discriminative classifier that classifies features without knowing the priori information of data. It is able to find a globally optimal solution. The discussion includes the theory
1
105 behind the SVM binary class and multiclass classifiers, with details of optimal separating hyper-plane and linearly non-separable hyper-plane with soft margin separation using the slack variables. SVM kernels are also described, such as linear, polynomial, radial basis function and sigmoid.
106
Chapter 7
Experimental Results
This chapter reports on the experiments conducted to evaluate the performance of motion templates computed by the optical flow vertical component and the directional motion history images (DMHIs) technique based also on optical flow.
The experimental work consists of two parts. Section 7.1 reports solely on the optical flow vertical component based technique that investigates the viseme classification in terms of accuracy, sensitivity and specificity. The vertical component of optical flow contains most of the information of a visual speech viseme utterance. However, to capture the mouth motion while a subject smiles or laughs, the horizontal component cannot be ignored. The optical flow vertical component is divided into multiple non- overlapping blocks and the statistical features of each block are used as a feature of an utterance. These features of an utterance were classified using a support vector machine classifier. For recognition and further performance evaluation of the proposed features, SVM multi-class classification is performed. The performance of multiple block sizes was evaluated empirically. The detailed theoretical frame work of the feature extraction and classification techniques have been explained in Chapters 4 and 6 respectively.
Section 7.2 describes the DMHI based viseme classification. Two types of image features examined for DMHI were Zernike moments (ZM) and Hu moments (HM). These features were classified using a SVM classifier. The detailed theoretical description of ZM and HM has been given in Chapter 6. In addition, the proposed DMHI technique is compared with the traditional motion history images. For better representation of the results, performance evaluation is described in terms of accuracy, specificity and sensitivity. All experiments in this dissertation were conducted using leave-one-out mechanism.
107