• No results found

D ASK R ANDOM F OREST F EATURE D ETECTION ☁ 1 Setup

In document Python for Cloud Computing (Page 180-184)

for i in range(len(X)): Yprime.append(Fprime(X[i]))

10.5 D ASK R ANDOM F OREST F EATURE D ETECTION ☁ 1 Setup

First we need our tools. pandas gives us the DataFrame, very similar to R’s DataFrames. The DataFrame is a structure that allows us to work with our data more easily. It has nice features for slicing and transformation of data, and easy ways to do basic statistics.

numpy has some very handy functions that work on DataFrames.

10.5.2 Dataset

We are using a dataset about the wine quality dataset, archived at UCI’s Machine Learning Repository (http://archive.ics.uci.edu/ml/index.php).

Now we will load our data. pandas makes it easy!

Like in R, there is a .describe() method that gives basic statistics for every column in the dataset.

fixed acidity volatile

acidity citric acid

residual sugar chlorides count 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 mean 8.319637 0.527821 0.270976 2.538806 0.087467 std 1.741096 0.179060 0.194801 1.409928 0.047065 min 4.600000 0.120000 0.000000 0.900000 0.012000 25% 7.100000 0.390000 0.090000 1.900000 0.070000 import pandas as pd import numpy as np

# red wine quality data, packed in a DataFrame

red_df = pd.read_csv('winequality-red.csv',sep=';',header=0, index_col=False)

# white wine quality data, packed in a DataFrame

white_df = pd.read_csv('winequality-white.csv',sep=';',header=0,index_col=False)

# rose? other fruit wines? plum wine? :(

# for red wines

50% 7.900000 0.520000 0.260000 2.200000 0.079000

75% 9.200000 0.640000 0.420000 2.600000 0.090000

max 15.900000 1.580000 1.000000 15.500000 0.611000

fixed acidity volatile

acidity citric acid

residual sugar chlorides count 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 mean 6.854788 0.278241 0.334192 6.391415 0.045772 std 0.843868 0.100795 0.121020 5.072058 0.021848 min 3.800000 0.080000 0.000000 0.600000 0.009000 25% 6.300000 0.210000 0.270000 1.700000 0.036000 50% 6.800000 0.260000 0.320000 5.200000 0.043000 75% 7.300000 0.320000 0.390000 9.900000 0.050000 max 14.200000 1.100000 1.660000 65.800000 0.346000

Sometimes it is easier to understand the data visually. A histogram of the white wine quality data citric acid samples is shown next. You can of course visualize other columns’ data or other datasets. Just replace the DataFrame and column name (see Figure 33).

# for white wines

white_df.describe()

import matplotlib.pyplot as plt

def extract_col(df,col_name): return list(df[col_name])

col = extract_col(white_df,'citric acid') # can replace with another dataframe or column

plt.hist(col)

#TODO: add axes and such to set a good example

Figure 33: Histogram

10.5.3 Detecting Features

Let us try out a some elementary machine learning models. These models are not always for prediction. They are also useful to find what features are most predictive of a variable of interest. Depending on the classifier you use, you may need to transform the data pertaining to that variable.

10.5.3.1 Data Preparation

Let us assume we want to study what features are most correlated with pH. pH of course is real-valued, and continuous. The classifiers we want to use usually need labeled or integer data. Hence, we will transform the pH data, assigning wines with pH higher than average as hi (more basic or alkaline) and wines with

pH lower than average as lo (more acidic).

# refresh to make Jupyter happy

red_df = pd.read_csv('winequality-red.csv',sep=';',header=0, index_col=False) white_df = pd.read_csv('winequality-white.csv',sep=';',header=0,index_col=False)

#TODO: data cleansing functions here, e.g. replacement of NaN

# if the variable you want to predict is continuous, you can map ranges of values # to integer/binary/string labels

# for example, map the pH data to 'hi' and 'lo' if a pH value is more than or # less than the mean pH, respectively

M = np.mean(list(red_df['pH'])) # expect inelegant code in these mappings

Lf = lambda p: int(p < M)*'lo' + int(p >= M)*'hi' # some C-style hackery # create the new classifiable variable

red_df['pH-hi-lo'] = map(Lf,list(red_df['pH']))

# and remove the predecessor

Now we specify which dataset and variable you want to predict by assigning vlues to SELECTED_DF and TARGET_VAR, respectively.

We like to keep a parameter file where we specify data sources and such. This lets me create generic analytics code that is easy to reuse.

After we have specified what dataset we want to study, we split the training and test datasets. We then scale (normalize) the data, which makes most classifiers run better.

Now we pick a classifier. As you can see, there are many to try out, and even more in scikit-learn’s documentation and many examples and tutorials. Random Forests are data science workhorses. They are the go-to method for most data scientists. Be careful relying on them though–they tend to overfit. We try to avoid overfitting by separating the training and test datasets.

10.5.4 Random Forest

Now we will test it out with the default parameters.

Note that this code is boilerplate. You can use it interchangeably for most scikit-

from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn import metrics

# make selections here without digging in code

SELECTED_DF = red_df # selected dataset

TARGET_VAR = 'pH-hi-lo' # the predicted variable # generate nameless data structures

df = SELECTED_DF

target = np.array(df[TARGET_VAR]).ravel()

del df[TARGET_VAR] # no cheating

#TODO: data cleansing function calls here # split datasets for training and testing

X_train, X_test, y_train, y_test = train_test_split(df,target,test_size=0.2)

# set up the scaler

scaler = StandardScaler() scaler.fit(X_train)

# apply the scaler

X_train = scaler.transform(X_train) X_test = scaler.transform(X_test)

# pick a classifier

from sklearn.tree import DecisionTreeClassifier,DecisionTreeRegressor,ExtraTreeClassifier,ExtraTreeRegressor from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier

learn models.

Now output the results. For Random Forests, we get a feature ranking. Relative importances usually exponentially decay. The first few highly-ranked features are usually the most important.

Feature ranking:

fixed acidity 0.269778 citric acid 0.171337 density 0.089660 volatile acidity 0.088965 chlorides 0.082945 alcohol 0.080437 total sulfur dioxide 0.067832 sulphates 0.047786 free sulfur dioxide 0.042727 residual sugar 0.037459 quality 0.021075

Sometimes it’s easier to visualize. We’ll use a bar chart. See Figure 34

# test it out model = clf.fit(X_train,y_train) pred = clf.predict(X_test) conf_matrix = metrics.confusion_matrix(y_test,pred) var_score = clf.score(X_test,y_test) # the results importances = clf.feature_importances_ indices = np.argsort(importances)[::-1]

# for the sake of clarity

num_features = X_train.shape[1]

features = map(lambda x: df.columns[x],indices)

feature_importances = map(lambda x: importances[x],indices) print 'Feature ranking:\n'

for i in range(num_features):

In document Python for Cloud Computing (Page 180-184)