• No results found

Selecting the algorithm to be applied

In document Python for Cloud Computing (Page 168-172)

for i in range(len(X)): Yprime.append(Fprime(X[i]))

10.4.12 Selecting the algorithm to be applied

Algorithim selection primarily depends on the objective you are trying to solve and what kind of dataset is available. There are differnt type of algorithms which can be applied and we will look into few of them here.

10.4.12.1 Linear Regression

This algorithm can be applied when you want to compute some continuous value. To predict some future value of a process which is currently running, you

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30, random_state=42, stratify=y)

X_train_num = X_train[["amount","oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest"]] X_train_cat = X_train[["type"]]

X_model_col = ["amount","oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest","type"] from sklearn.pipeline import Pipeline

from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import Imputer num_attribs = list(X_train_num) cat_attribs = list(X_train_cat) num_pipeline = Pipeline([ ('selector', DataFrameSelector(num_attribs)), ('attribs_adder', CombinedAttributesAdder()), ('std_scaler', StandardScaler()) ]) cat_pipeline = Pipeline([ ('selector', DataFrameSelector(cat_attribs)),

('cat_encoder', CategoricalEncoder(encoding="onehot-dense")) ])

can go with regression algorithm.

Examples where linear regression can used are :

1. Predict the time taken to go from one place to another 2. Predict the sales for a future month

3. Predict sales data and improve yearly projections.

10.4.12.2 Logistic Regression

This algorithm can be used to perform binary classification. It can be used if you want a probabilistic framework. Also in case you expect to receive more training data in the future that you want to be able to quickly incorporate into your model.

1. Customer churn prediction.

2. Credit Scoring & Fraud Detection which is our example problem which we are trying to solve in this chapter.

3. Calculating the effectiveness of marketing campaigns.

10.4.12.3 Decision trees

Decision trees handle feature interactions and they’re non-parametric. Doesnt

from sklearn.linear_model import LinearRegression from sklearn.preprocessing import StandardScaler import time scl= StandardScaler() X_train_std = scl.fit_transform(X_train) X_test_std = scl.transform(X_test) start = time.time() lin_reg = LinearRegression()

lin_reg.fit(X_train_std, y_train) #SKLearn's linear regression

y_train_pred = lin_reg.predict(X_train_std) train_time = time.time()-start

from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split

X_train, _, y_train, _ = train_test_split(X_train, y_train, stratify=y_train, train_size=subsample_rate, random_state=42) X_test, _, y_test, _ = train_test_split(X_test, y_test, stratify=y_test, train_size=subsample_rate, random_state=42) model_lr_sklearn = LogisticRegression(multi_class="multinomial", C=1e6, solver="sag", max_iter=15)

model_lr_sklearn.fit(X_train, y_train) y_pred_test = model_lr_sklearn.predict(X_test) acc = accuracy_score(y_test, y_pred_test)

results.loc[len(results)] = ["LR Sklearn", np.round(acc, 3)] results

support online learning and the entire tree needs to be rebuild when new traning dataset comes in. Memory consumption is very high.

Can be used for the following cases 1. Investment decisions

2. Customer churn 3. Banks loan defaulters 4. Build vs Buy decisions 5. Sales lead qualifications

10.4.12.4 K Means

This algorithm is used when we are not aware of the labels and one needs to be created based on the features of objects. Example will be to divide a group of people into differnt subgroups based on common theme or attribute.

The main disadvantage of K-mean is that you need to know exactly the number of clusters or groups which is required. It takes a lot of iteration to come up with the best K.

10.4.12.5 Support Vector Machines

from sklearn.tree import DecisionTreeRegressor dt = DecisionTreeRegressor()

start = time.time()

dt.fit(X_train_std, y_train)

y_train_pred = dt.predict(X_train_std) train_time = time.time() - start start = time.time()

y_test_pred = dt.predict(X_test_std) test_time = time.time() - start

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split, GridSearchCV, PredefinedSplit from sklearn.metrics import accuracy_score

X_train, _, y_train, _ = train_test_split(X_train, y_train, stratify=y_train, train_size=subsample_rate, random_state=42) X_test, _, y_test, _ = train_test_split(X_test, y_test, stratify=y_test, train_size=subsample_rate, random_state=42) model_knn_sklearn = KNeighborsClassifier(n_jobs=-1)

model_knn_sklearn.fit(X_train, y_train) y_pred_test = model_knn_sklearn.predict(X_test) acc = accuracy_score(y_test, y_pred_test)

results.loc[len(results)] = ["KNN Arbitary Sklearn", np.round(acc, 3)] results

SVM is a supervised ML technique and used for pattern recognition and classification problems when your data has exactly two classes. Its popular in text classification problems.

Few cases where SVM can be used is

1. Detecting persons with common diseases. 2. Hand-written character recognition

3. Text categorization

4. Stock market price prediction

10.4.12.6 Naive Bayes

Naive Bayes is used for large datasets.This algoritm works well even when we have a limited CPU and memory available. This works by calculating bunch of counts. It requires less training data. The algorthim cant learn interation between features.

Naive Bayes can be used in real-world applications such as: 1. Sentiment analysis and text classification

2. Recommendation systems like Netflix, Amazon 3. To mark an email as spam or not spam

4. Face recognition

10.4.12.7 Random Forest

Ranmdon forest is similar to Decision tree. Can be used for both regression and classification problems with large data sets.

Few case where it can be applied. 1. Predict patients for high risks.

2. Predict parts failures in manufacturing. 3. Predict loan defaulters.

from sklearn.ensemble import RandomForestRegressor

forest = RandomForestRegressor(n_estimators = 400, criterion='mse',random_state=1, n_jobs=-1) start = time.time()

10.4.12.8 Neural networks

Neural network works based on weights of connections between neurons. Weights are trained and based on that the neural network can be utilized to predict the class or a quantity. They are resource and memory intensive.

Few cases where it can be applied.

1. Applied to unsupervised learning tasks, such as feature extraction.

2. Extracts features from raw images or speech with much less human intervention

10.4.12.9 Deep Learning using Keras

Keras is most powerful and easy-to-use Python libraries for developing and evaluating deep learning models. It has the efficient numerical computation libraries Theano and TensorFlow.

10.4.12.10 XGBoost

XGBoost stands for eXtreme Gradient Boosting. XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. It is engineered for efficiency of compute time and memory resources.

In document Python for Cloud Computing (Page 168-172)