for i in range(len(X)): Yprime.append(Fprime(X[i]))
10.4.12 Selecting the algorithm to be applied
Algorithim selection primarily depends on the objective you are trying to solve and what kind of dataset is available. There are differnt type of algorithms which can be applied and we will look into few of them here.
10.4.12.1 Linear Regression
This algorithm can be applied when you want to compute some continuous value. To predict some future value of a process which is currently running, you
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30, random_state=42, stratify=y)
X_train_num = X_train[["amount","oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest"]] X_train_cat = X_train[["type"]]
X_model_col = ["amount","oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest","type"] from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import Imputer num_attribs = list(X_train_num) cat_attribs = list(X_train_cat) num_pipeline = Pipeline([ ('selector', DataFrameSelector(num_attribs)), ('attribs_adder', CombinedAttributesAdder()), ('std_scaler', StandardScaler()) ]) cat_pipeline = Pipeline([ ('selector', DataFrameSelector(cat_attribs)),
('cat_encoder', CategoricalEncoder(encoding="onehot-dense")) ])
can go with regression algorithm.
Examples where linear regression can used are :
1. Predict the time taken to go from one place to another 2. Predict the sales for a future month
3. Predict sales data and improve yearly projections.
10.4.12.2 Logistic Regression
This algorithm can be used to perform binary classification. It can be used if you want a probabilistic framework. Also in case you expect to receive more training data in the future that you want to be able to quickly incorporate into your model.
1. Customer churn prediction.
2. Credit Scoring & Fraud Detection which is our example problem which we are trying to solve in this chapter.
3. Calculating the effectiveness of marketing campaigns.
10.4.12.3 Decision trees
Decision trees handle feature interactions and they’re non-parametric. Doesnt
from sklearn.linear_model import LinearRegression from sklearn.preprocessing import StandardScaler import time scl= StandardScaler() X_train_std = scl.fit_transform(X_train) X_test_std = scl.transform(X_test) start = time.time() lin_reg = LinearRegression()
lin_reg.fit(X_train_std, y_train) #SKLearn's linear regression
y_train_pred = lin_reg.predict(X_train_std) train_time = time.time()-start
from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split
X_train, _, y_train, _ = train_test_split(X_train, y_train, stratify=y_train, train_size=subsample_rate, random_state=42) X_test, _, y_test, _ = train_test_split(X_test, y_test, stratify=y_test, train_size=subsample_rate, random_state=42) model_lr_sklearn = LogisticRegression(multi_class="multinomial", C=1e6, solver="sag", max_iter=15)
model_lr_sklearn.fit(X_train, y_train) y_pred_test = model_lr_sklearn.predict(X_test) acc = accuracy_score(y_test, y_pred_test)
results.loc[len(results)] = ["LR Sklearn", np.round(acc, 3)] results
support online learning and the entire tree needs to be rebuild when new traning dataset comes in. Memory consumption is very high.
Can be used for the following cases 1. Investment decisions
2. Customer churn 3. Banks loan defaulters 4. Build vs Buy decisions 5. Sales lead qualifications
10.4.12.4 K Means
This algorithm is used when we are not aware of the labels and one needs to be created based on the features of objects. Example will be to divide a group of people into differnt subgroups based on common theme or attribute.
The main disadvantage of K-mean is that you need to know exactly the number of clusters or groups which is required. It takes a lot of iteration to come up with the best K.
10.4.12.5 Support Vector Machines
from sklearn.tree import DecisionTreeRegressor dt = DecisionTreeRegressor()
start = time.time()
dt.fit(X_train_std, y_train)
y_train_pred = dt.predict(X_train_std) train_time = time.time() - start start = time.time()
y_test_pred = dt.predict(X_test_std) test_time = time.time() - start
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, PredefinedSplit from sklearn.metrics import accuracy_score
X_train, _, y_train, _ = train_test_split(X_train, y_train, stratify=y_train, train_size=subsample_rate, random_state=42) X_test, _, y_test, _ = train_test_split(X_test, y_test, stratify=y_test, train_size=subsample_rate, random_state=42) model_knn_sklearn = KNeighborsClassifier(n_jobs=-1)
model_knn_sklearn.fit(X_train, y_train) y_pred_test = model_knn_sklearn.predict(X_test) acc = accuracy_score(y_test, y_pred_test)
results.loc[len(results)] = ["KNN Arbitary Sklearn", np.round(acc, 3)] results
SVM is a supervised ML technique and used for pattern recognition and classification problems when your data has exactly two classes. Its popular in text classification problems.
Few cases where SVM can be used is
1. Detecting persons with common diseases. 2. Hand-written character recognition
3. Text categorization
4. Stock market price prediction
10.4.12.6 Naive Bayes
Naive Bayes is used for large datasets.This algoritm works well even when we have a limited CPU and memory available. This works by calculating bunch of counts. It requires less training data. The algorthim cant learn interation between features.
Naive Bayes can be used in real-world applications such as: 1. Sentiment analysis and text classification
2. Recommendation systems like Netflix, Amazon 3. To mark an email as spam or not spam
4. Face recognition
10.4.12.7 Random Forest
Ranmdon forest is similar to Decision tree. Can be used for both regression and classification problems with large data sets.
Few case where it can be applied. 1. Predict patients for high risks.
2. Predict parts failures in manufacturing. 3. Predict loan defaulters.
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(n_estimators = 400, criterion='mse',random_state=1, n_jobs=-1) start = time.time()
10.4.12.8 Neural networks
Neural network works based on weights of connections between neurons. Weights are trained and based on that the neural network can be utilized to predict the class or a quantity. They are resource and memory intensive.
Few cases where it can be applied.
1. Applied to unsupervised learning tasks, such as feature extraction.
2. Extracts features from raw images or speech with much less human intervention
10.4.12.9 Deep Learning using Keras
Keras is most powerful and easy-to-use Python libraries for developing and evaluating deep learning models. It has the efficient numerical computation libraries Theano and TensorFlow.
10.4.12.10 XGBoost
XGBoost stands for eXtreme Gradient Boosting. XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. It is engineered for efficiency of compute time and memory resources.