Machine Learning Algorithms - Yin_unc_0153D

In Chapter 6 we develop a machine-learning based methodology to estimate available bandwidth. In this section, we introduce basic concepts of machine learning, as well as those machine-learning algorithms mentioned in Chapter 6.

Machine learning is the technique which allows computers to learn an algorithm that explores the relationship of collected data without being explicitly programmed, and then uses the learned algorithm to make predictions on new data. Machine learning is employed in a wide range of computing tasks, such as spam filtering [46], pattern recognition in image processing [47], stock-market prediction [48], etc. Recently, it has been introduced to the realm of computer networks — [49] employs machine-learning technique to analyze Internet traffic for malicious intrusion detection, [50] classifies Internet traffic using a learning process, [51] predicts throughput of TCP flows by learning the history of flow statistics, [52] explored the learnability of congestion-control algorithms, etc. Furthermore, [53] collected data for connection and link status, and let the machine learn an algorithm to distinguish congestion-induced packet-losses and error-induced losses, which has long been regarded as a notorious performance hurdle of wireless congestion control.

Machine Learning Algorithm learned model feature vector: x=<x₁,x₂,...,x_p> <X_train, Y_train>

<X_test, Y_test> Evaluate

Y_est

Training Phase

Testing Phase

Figure 2.2: Machine Learning Training and Testing Phases

2.4 Classification and Regression

Machine learning is commonly used to solve two types of problems, Classification and Regression. Classification is the problem of identifying new observations into a known set of categories. For instance, to classify whether an incoming flow is malicious or not [49] is a classification problem, which matches a flow into two categories. Different from classification, there exists no known “buckets” for the output of regression. Rather, the aim of a regression problem is to “generate” a discrete number for a new observation, based on the relationship between existing observations it discovers. For example, predicting future trend of a stock based on its transaction history [48] is a regression problem.

2.4 Training and Testing

For both classification and regression problems, machine-learning aims to figure out the relationship between inputx, and outputy.x= (x1, ..., xp), often referred to as a feature vector, consists of p features that are believed to relate toy. For instance, to learn the stock price based on monthly history, the outputyis the price fordayi, and the inputxis the vector of daily price for the past month fromdayi−30todayi−1.

To solve a machine-learning problem, there are two phases involved in the learning process — training phase and testing phase, as depicted in Fig 2.2. In the training phase, a set of feature vectors and their corresponding output< Xtrain, Ytrain >are collected, and referred to as the “training set”. The training set are then fed to a machine-learning algorithm, whose task is to discover the relationship betweenXand

Y, and then to generate a learned “model” to mathematically represent the relationship. The testing phase tests the quality of the learned model against a testing set< Xtest, Ytest >, which contains observations that not excluded from in the training set. The model takes inXtest, and computes the estimated resultYest. Its performance is evaluated by comparingYestwith the ground-truthYtestin the testing set.

It is commonly acknowledged that a training set that contains large enough samples and has a good coverage of diversity in the inputs features, is helpful to generate a more accurate model [54].

2.4 Machine Learning Algorithms

In Chapter 6 several popular machine-learning algorithms are adopted to improve the performance of RAPID bandwidth estimator. In this section we describe the basic ideas of their learning processes, without diving into the algorithmic details, which is out of the scope of this thesis.

ElasticNet ElasticNet [55, 56] is one of the most prominent linear-regression algorithms, which assume a linear relationship between input vectorXand outputY as follows: Y =wo+ ΣPi=1wixi. The aim of a linear-regression algorithm is to tune the linear coefficientswiin order to minimize the model error against the training set.

RandomForest

RandomForest, AdaBoost and GradientBoost all base their learning algorithms on decision tree method. Each decision tree solves a classification function. It regards the input vector x = (x1, ..., xp) as a p- dimensional space, partitions the space into a set of regions, each corresponds to an outputyin the training set. Thenxin the testing phase is mapped into a region, whose output the correspondingyin that region.

RandomForest generates a model that consists of multiple trees. Each tree model is trained with a randomly selected subset of training data.The output of the ultimate model is the average of the output of all tree models.

AdaBoost

Similar to RandomForest, AdaBoost relies on learning a number of decision tree models. However, unlike RandomForest for which each regression tree model is trained on a subset of training set, each tree for AdaBoost is to fit the entire training set and all features.

AdaBoost algorithm learns the aggregate model in a “boosting” style. Decision trees are are built iteratively — in each iteration a new tree model is built to address the “shortcomings” of existing trees. The “shortcomings” are identified by evaluating the existing model against and training set, and highlighting those

data samples that it fails to classify correctly. When the subsequent tree model is built, it focuses on fixing those miss-classified data samples.

The output of the final model is the weighted sum of the output of all tree models, the weight of each is determined during the learning process. The model produced by such “boosting” method is considered to be more accurate than RandomForest [57, 58].

GradientBoost

GradientBoost [58] also follows a boosting learning method, but with a different aim from AdaBoost in each iteration of generating tree models. Rather than focusing on fitting those miss-classified samples, GradientBoost targets at minimizing the gradient of all data samples in the p-dimensional space.

CHAPTER 3: TESTBED

This chapter describes the testbed used throughout this thesis. Section 3.1 describes the topology of our 10Gbps testbed networks. The robustness of the bandwidth estimation logic is evaluated in the presence of different degree of cross-traffic noise in Chapter 5. Chapter 8 also studies how whether RAPID is able to maintain high link utilization with different burstiness of traffic. These require the generation of cross traffic of different levels of burstiness — section 3.2 describes how this is achieved. In Chapter 8, several congestion control protocols are evaluated against different RTTs, and random packet loss rates. Section 3.3 describes the mechanisms to emulate RTT and loss emulations for TCP flows.

In document Yin_unc_0153D_17422.pdf (Page 48-52)