The exponential growth of the amount of digital information has significantly in-creased the need to access right and relevant information and the desire to organize and categorize data. Therefore, automatic classification, – the assignment of in-stances (i.e., pictures, text documents, DNA sequences, Web sites) to one or more predefined categories based on their content has become a very important field of machine learning research. One popular machine learning algorithm for automatic data classification is Support Vector Machines (SVM), due to its strong theoreti-cal foundation and good generalization performance. However, SVM did not see widespread adoption in communities that work with very large datasets because of the high computational cost involved in solving the quadratic programming problem in the training phase of the learner. This thesis addresses the scalability problem and the issues that stem from class imbalance and noisy data in SVM.
Specifically, we propose algorithms and approaches that enable SVM to (i) scale to very large datasets with online and active learning, (ii) yield improved com-putational efficiency and prediction performance in classification of imbalanced datasets, and (iii) achieve faster learning and sparser solutions without sacrificing classification accuracy in noisy datasets by showing the benefits of using non-convex optimization for online SVM.
This thesis does not attempt to solve a specific problem in a particular domain (i.e. text, image, speech), but addresses the problem of computational learning in a general way with the ultimate goal of generalization with sparse models and scalable solutions. The particular focus is on Support Vector Machine algorithm with classification problems, but the proposed approaches on active and online learning are also well extensible to other widely used machine learning algorithms for various tasks. Several application areas may benefit from the methods and approaches outlined in this thesis: Medical and health care informatics, compu-tational medicine, web page categorization, decision automation systems in the business and engineering world, machine fault detection in industries, analyzing space telescope data in astronomy are only some of the examples.
LASVM – An Online SVM Algorithm for Massive and Streaming Data: The sizes of the datasets are quickly outgrowing the computing power of our computers. If we look at the advances in computer hardware technology in the last decade, hard disk capacities became thousand times larger but processors became only hundred times faster. Therefore, we need faster machine learning algorithms in order to make computers learn more efficiently from experimental and historical data. Moreover, in many domains, data now arrives faster than we are able to learn from it. To avoid wasting this data, we must switch from the traditional batch machine learning algorithms to online systems that are able to mine streaming, high-volume, open-ended data as they arrive. We present a fast online Support Vector Machine classifier algorithm, namely LASVM [2], that can handle continuous stream of new data and has an outstanding speed improvement over the classical (batch) SVM and other online SVM algorithms, while preserving the classification accuracy rates of the state-of-the-art SVM solvers. The speed
improvement and the demand for less memory with the online learning setting enable SVM to be applicable to very large datasets. As an application to a real world system, we developed a name disambiguation framework for CiteSeer that utilizes LASVM as a distance function to determine the similarity of different au-thor metadata records and a clustering step that determines unique auau-thors based on LASVM-based similarity metric. Applied to the entire CiteSeer repository with more than 700,000 papers, the algorithm efficiently identified and disambiguated close to 420,000 unique authors in CiteSeer.
Active Learning and VIRTUAL for Class Imbalance Problem: The class imbalance problem occurs when there are significantly less number of observa-tions of the target concept. However, standard machine learning algorithms yield better prediction performance with balanced datasets. We demonstrate that active learning is capable of solving the class imbalance problem by providing the learner more balanced classes [3]. The proposed method also yields an efficient querying system and removes the barriers for applying active learning to very large scale datasets. With small pool active learning [3], we show that we do not need to search the entire dataset to find the informative instances. We also propose a hy-brid algorithm, called VIRTUAL (Virtual Instances Resampling Technique Using Active Learning) [4], that integrates oversampling and active learning methods to form an adaptive technique for resampling of minority class instances. VIRTUAL is more efficient in generating new synthetic instances and has a shorter training time than other oversampling techniques due to its adaptive nature and its efficient decision capability in creating virtual instances.
Online Non-Convex Support Vector Machines: Databases often con-tain noise in the form of inaccuracies, inconsistencies and false labeling. Noise in the data is notorious for degrading the prediction performance and computa-tional efficiency of machine learning algorithms. We design an online non-convex Support Vector Machine algorithm (LASVM-NC) [5], which has strong ability of suppressing the influences of outliers (mislabeled examples). Then, again in the online learning setting, we propose an outlier filtering mechanism based on approx-imating non-convex behavior in convex optimization (LASVM-I) [5]. These two algorithms are built upon another novel SVM algorithm (LASVM-G) [5] that is capable of generating accurate immediate models in its iterative steps by leverag-ing the duality gap. We argue that despite many advantages of convex modelleverag-ing, the price we pay for insisting on convexity is an increase in the size of the model and the scaling properties of the algorithm. We show that non-convexity can be very effective for achieving sparse and scalable solutions, particularly when the data consists of abundant label noise.