77Introduction to machine learning

This chapter covers

77Introduction to machine learning

important are collection, preparation, and analysis of data, where knowledge about the problem domain is needed. Therefore, these two machine learning chapters are mostly about steps 4 and 5 described previously.

7.1.1 Definition of machine learning

Machine learning is one of the largest research areas within artificial intelligence, a scientific field that studies algorithms for simulating intelligence. Ron Kohavi and Foster Provost in their article “Glossary of Terms” describe machine learning in these words:

Machine learning is a scientific discipline that explores the construction and study of algorithms that can learn from and make predictions on data.3

This is in contrast to traditional programming methods where an algorithm that needs to do what it’s explicitly programmed for (like parsing an XML file with a cer- tain structure) is explicitly programmed into it. Such traditional methods can’t be eas- ily expanded to cover similar tasks, like parsing XML files with a similar structure. As another example, making a speech-recognition program that recognizes different accents and voices would be impossible by explicitly programming it, because the sheer number of variations in the way a single word can be pronounced would neces- sitate that many versions of the program.

Instead of incorporating the explicit knowledge about the problem area in the program itself, machine learning relies on methods from the fields of statistics, proba- bility, and information theory to discover and use the knowledge inherent in data and then change the behavior of a program accordingly in order to be able to solve the initial task (such as recognizing speech).

7.1.2 Classification of machine learning algorithms

The most basic classification of machine learning algorithms divides them into two classes called supervised and unsupervised learners. A data set for supervised learning is pre-labeled (information about the expected prediction output is provided with the data), whereas one for unsupervised learning contains no labels and the algorithm needs to determine them itself.

Supervised learning is used for many practical machine learning problems today, such as spam detection, speech and handwriting recognition, computer vision, and more. A spam-detection algorithm, for example, is trained on examples of emails manually marked as spam or not spam (labeled data) and learns how to classify future emails.

Unsupervised learning is also a powerful tool that is is widely used. Among other purposes, it’s used for discovering structure within data—for example, groups of similar items known as clusters)—anomaly detection, image segmentation, and so on.

3 _{Ron Kohavi; Foster Provost (1998). "Glossary of terms",}_{Editorial for the Special Issue on Applications of Machine}

CLASSIFICATIONTOSUPERVISEDANDUNSUPERVISEDALGORITHMS

In supervised learning, an algorithm is given a set of known inputs and matching outputs, and it has to find a function that can be used to transform the given inputs to the true outputs even in the case of input data not seen during the training phase. The same function can then be used to predict outputs of any future input. The typical supervised learning tasks are regression and classification.

Regression attempts to predict the values of continuous output variables based on a set of input variables. Classification aims to classify sets of inputs into two or more

classes (discrete output variables). Both regression and classification models are trained based on a set of inputs with known outputs—where known outputs are the output variables values or classes, which are supervised problems.

In the case of unsupervised learning, the output is not known in advance, and the algorithm has to find some structure in the data without additional information provided. A typical unsupervised learning task is clustering. With clustering, the goal of the algorithm is to discover dense regions, called clusters, in the input data by analyz- ing similarities between the input examples. There are no known classes used as a ref- erence.

For an example of differences between supervised and unsupervised learning, con- sider figure 7.2. It shows the often used Iris flower data set4_{created in 1936. The data}

set contains widths and lengths of petals and sepals5_{of 150 flowers of three iris flower}

species: Iris setosa, Iris versicolor, and Iris virginica (50 flowers of each species). For the sake of simplicity, only sepal length and width are given in figure 7.2. That way we can plot the data set in two dimensions.

Sepal length and sepal width are features (or dimensions) of input, and the flower species is the output (or target variable, a label). We would like our algorithm to find a mapping function that correctly maps sepal length and sepal width to flower species for existing and future examples.

NOTE For historical reasons, and because of many possible application areas, a single concept in machine learning can have several different names. Inputs are also called examples, points, data samples, observations, or instances. In Spark, training examples for supervised learning are called labeled points. Features (sepal length and sepal width in the Iris data set, for example) are also called

dimensions,attributes, variables, or independent variables.

On the graph on the lefthand side of figure 7.2, flower species corresponding to each input are marked with dots, circles, and x marks, which means that the flower species are known in advance. We call this the training set because it can be used to train (or

fit) the parameters of the machine learning model to determine the mapping function. You would then test the accuracy of your trained model using a test set containing

4 _{Iris flower data set, Wikipedia (http://en.wikipedia.org/wiki/Iris_flower_data_set)} 5 _A_sepal_{is a part of flower that supports its petals and protects the flower in bud.}

In document Reactive Data Handling (Page 82-84)