• No results found

Use of SVM – A Process-Based Approach

Due largely to the better classification results, recently support vector ma- chines (SVMs) have become a popular technique for classification type problems. Even though people consider them as easier to use than artificial neural networks, users who are not familiar with the intricacies of SVMs often get unsatisfactory results. In this section we provide a process-based approach to the use of SVM which is more likely to produce better results.

x Preprocess the data - Scrub the data

ƒ Deal with the missing values

ƒ Deal with the presumably incorrect values

ƒ Deal with the noise in the data - Transform the data

ƒ Numerisize the data

ƒ Normalize the data

x Develop the model(s)

- Select kernel type (RBF is a natural choice)

- Determine kernel parameters based on the selected kernel type (e.g.,

C and Ȗ for RBF) – A hard problem. One should consider using cross- validation and experimentation to determine the appropriate values for these parameters.

- If the results are satisfactory, finalize the model, otherwise change the kernel type and/or kernel parameters to achieve the desired accuracy level.

x Extract and deploy the model.

For clarity purposes, a pictorial representation of the process model is depicted in Fig. 7.3. A short description for some of the important steps in the above listed process is given below.

Numerisizing the Data: SVMs require that each data instance is repre- sented as a vector of real numbers. Hence, if there are categorical attributes, we first have to convert them into numeric data. We recommend using m numbers to represent an m-category attribute. Only one of the m numbers is one, and others are zero. For example, a three-category attribute such as {red, green, blue} can be represented as (0,0,1), (0,1,0), and (1,0,0).

Use of SVM – A Process-Based Approach 119

Pre-Process the Data • Scrub the data

- Missing values - Incorrect values - Noisy values • Transform the data

- Numerisize - Normalize

Develop the Model(s) •Select the kernel type

- Radial Basis Function (RBF) - Sigmoid

- Polynomial, etc.

•Determine the Kernel Parameters - Use of v-fold cross validation - Employ “grid-search”

Deploy the Model •Extract the model

coefficients

•Code the trained model into the decision support system •Monitor and maintain the

model Raw data Pre-processed data

Validated SVM model

Decision Model Re-process the data

Fig. 7.3. A process description for SVM model development

Normalizing the Data(as is the case for ANN): Scaling them before ap- plying SVM is very important. Sarle explained why we scale data while using Neural Networks, and most of considerations also apply to SVM.7

The main advantage is to avoid attributes in greater numeric ranges domi- nate those in smaller numeric ranges. Another advantage is to avoid nu- merical diĜculties during the calculation. Because kernel values usually depend on the inner products of feature vectors, e.g. the linear kernel and the polynomial kernel, large attribute values might cause numerical prob- lems. We recommend linearly scaling each attribute to the range [í1, +1] or [0, 1]. Of course we have to use the same method to scale testing data before testing. For example, suppose that we scaled the first attribute of training data from [í10, +10] to [í1, +1]. If the first attribute of testing da- ta is lying in the range [í11, +8], we must scale the testing data to [í1.1, +0.8].

7 W.S. Sarle, editor (1997), Neural Network FAQ, part 1 of 7: Introduction, peri- odic posting to the Usenet newsgroup comp.ai.neural-nets, URL: ftp://ftp.sas. com/pub/neural/FAQ.html.

120 7 Support Vector Machines

8 S.S. Keerthi, C.-J. Lin (2003). Asymptotic behaviors of support vector machines with Gaussian kernel, Neural Computation 15:7, 1667–1689.

9 K.-M. Lin, C.-J. Lin (2003). A study on reduced support vector machines, IEEE

Transactions on Neural Networks 14:6, 1449–1559.

Select the Kernel Model: Though there are only four common kernels, we must decide which one to try first. Then the penalty parameter C and kernel parameters are chosen. We suggest that in general RBF is a reason- able first choice. The RBF kernel nonlinearly maps samples into a higher dimensional space, so it, unlike the linear kernel, can handle the case when the relation between class labels and attributes is nonlinear. Furthermore, the linear kernel is a special case of RBF. Keerthi and Lin show that the linear kernel with a penalty parameter C has the same performance as the RBF kernel with some parameters (C, Ȗ).8 In addition, the sigmoid kernel

behaves like RBF for certain parameters.9 The second reason is the number

of hyperparameters which influences the complexity of model selection. The polynomial kernel has more hyperparameters than the RBF kernel. Fi- nally, the RBF kernel has less numerical diĜculties. Cross Validation and Grid Search: Cross-validation and Grid-search. There are two parameters while using RBF kernels: C and Ȗ. It is not known beforehand which C and

Ȗ are the best for one problem; consequently some kind of model selection (parameter search) must be done. The goal is to identify good (C, Ȗ) so that the classifier can accurately predict unknown data (i.e., testing data). Note that it may not be useful to achieve high training accuracy (i.e., classifiers accurately predict training data whose class labels are indeed known). Therefore, a common way is to separate training data to two parts of which one is considered unknown in training the classifier. Then the prediction accuracy on this set can more precisely reflect the performance on classify- ing unknown data. An improved version of this procedure is cross- validation. In v-fold cross-validation, we first divide the training set into v

subsets of equal size. Sequentially one subset is tested using the classifier trained on the remaining ví1 subsets. Thus, each instance of the whole training set is predicted once so the cross-validation accuracy is the per- centage of data which are correctly classified. The cross-validation proce- dure can prevent the overfitting problem. Another recommendation for dealing with the overfitting problem is a “grid-search” on C and Ȗ using cross-validation. Basically pairs of (C, Ȗ) are tried and the one with the best cross-validation accuracy is picked. We found that trying exponentially growing sequences of C and Ȗ is a practical method to identify good pa- rameters (for example, C = 2í5, 2í3, ... , 215, Ȗ = 2í15, 2í13, ... , 23). The grid-search is straightforward but seems stupid. In fact, there are sev- eral advanced methods which can save computational cost by, for example,

In some situations, the proposed procedure is not good enough, so other techniques such as feature selection may be needed. Such issues are be- yond our consideration here. Our experience indicates that the procedure works well for data which do not have many features. If there are thou- sands of attributes, there may be a need to choose a subset of them before giving the data to SVM.

Support Vector Machines versus Artificial