Mining and Visualising a Structured Dataset. Registration No: Wordcount: 3,077

(1)

Mining and Visualising a Structured Dataset

Registration No: 170243258

Wordcount: 3,077

(2)

Abstract

This report is aiming to predict the house price based on numerous features of houses using two kinds of machine learning methods, Multiple Linear Regression and Random Forest Regression. The dataset is acquired from Kaggle, including more than one thousand properties and their features in Ames. Based on this report, the best model to apply on this dataset is Random Forest and prepocess the data using feature reduction and log transformation. This report is as follow. Firstly, a brief introduction address the problem the author is attempting to solve. Next, the theories of the machine learning method that is going to be deployed is thoroughly illustrated. Thirdly, the means of data cleaning is explained. The Methods session provides the process of building the model and finding the best model in KNIME step by step, including every node utilized in KNIME. The results are presented and discussed in the fifth part. Finally, there are some reflection of conducting this experiment and suggestion for further effort with hindsight. Conclusion is provided at the end as well.

Table of content

1. Introduction………3

2. Data Mining Theory 2.1 Regression Analysis………3

2.2 Random Forest………4

2.3 Performance Evaluation………4

3. Data Preparation 3.1 Exploratory Data Analysis (EDA) ………5

3.2 Missing Data and Outliers………8

4. Methods 4.1 Build the MLR and RF Model………11

4.2 Find the Best Model………12

5. Results and Discussion………..13

6. Reflection………15

7. Conclusion………15

8. Reference………16

(3)

1. Introduction

What is your dream house looks like? What are the prominent factors affecting the price of a house? How do we forecast the price of a house in the future? Can we really predict the sale price based on the features of the house? This report aims at answering these questions.

Using a dataset containing numerous features of more than one thousand properties, the author is attempting to build a predictive model to not only discover what really matters when buying a house, but also effectively forecast how much a property can be. Therefore, the SalePrice is the target variable in this report. Regression models are more appropriate to be utilized to achieve this goal since it is designed to forecast numeric outcomes. Two models, multiple linear regression and random forest, will be built to make comparison to find out which performs better. At the end of the experiment, R Square Score (R²) and Root Mean Squared Error (RMSE) will be calculated to evaluate the performance of each model.

2. Data Mining Theory 2.1 Regression Analysis

Regression is one of the analysis methods of supervised machine learning. The objective of this analysis method is to build a model which is capable of telling us how independent variables affect target variables and predicting “quantitative” outcomes.

(Bowerman, O’Connell, & Murphree, 2014). This is a decisive feature because when we are attempting to forecast a certain number of something (usually price), such as stock price(Ouahilal, Mohajir, Chahhou, & Mohajir, 2017), movie box office(Simonoff

& Sparrow, 2000), or gold price(Ismail, Ismail, Yahya, & Shabri, 2009), regression model is broadly deployed in this domain. “We tend to refer to problems with a quantitative response as regression problems.” (James, Witen, Hastie, & Tibshirani, 2007, p.28). In a regression model, we try to build a relationship between target (or dependent, response) variable, which is normally denoted by the letter y, and independent variables (or features, predictors, indicators, attributes…), which is normally denoted by x1, x2, x3,…, xn. (Bowerman et al., 2014). A simple linear regression model equation with one independent variable and one target variable can be presented as y=bx+c, whereas c represents the intercept and bstands for the coefficient of the independent variable. Multiple Linear Regression (MLR) refers to a model with two or more independent variables and one target variable, presented as y=b+b x +b x +…+b x whereas b represents the intercept and b stands for the

(4)

coefficient of each independent variable. The coefficient also means the extent to which an independent variable affects target variable. In order to build and assess the model, data is usually separate into two parts: training set and test set. Training set data is used to build the model, the more training data are the more precise the model is. After the training process, the predictive target variable is compared with the target variable of the test set to see how different are them. The smaller the difference between the predictive and actual data the more accurate the model is. The ratio of training set and test set can be determined by the conductors. Typically, data is segmented into 80% or 70% of training set and 20% or 30% of test set.

2.2 Random Forest

However, there are many cases that you cannot find a line of best fit of a dataset since there is no linear relationship in this data. Hence, a non-linear regression model must be adopted under that circumstance. Decision Trees and Random Forest (RF) are two of which and this report will adopt Random Forest method. RF is basically a method extends from Decision Trees via bagging and discorrelation. Bagging is “a general- purpose procedure for reducing the variance of a statistical learning method”(James et al., 2007, p. 316) which is generally effectively used in Decision Trees due to the high variance nature of trees method. By combining numerous trees (bootstrap), the accuracy of the model can be substantially improved. (Ghasemi & Tavakoli, 2013).

However, there is still one problem: the procedure may use the same strong predictor for each decision trees as root, making the trees highly correlated, which does not lead to a good reduction of variance. (James et al., 2007). RF can overcome this problem.

By randomly selecting predictors at each node, every predictor has a more equal chance to be selected (Singh, Sihag, & Singh, 2017), therefore, correlation between trees will be reduced. It is hard to say which learning model, MLR or RF, performs better. MLR outperforms RF when there is a strong linear relationship between independent variables and target variable (Smith, Ganesh, & Liu, 2013) while RF is superior to MLR when the relationship is more complex (Svetnik et al., 2003).

2.3 Performance Evaluation

To assess the accuracy and performance of a regression model, two quantities are generally used, Root Mean Squared Error (RMSE) and R Square score (R²). The two numbers are able to “quantify the extent to which the model fits the data”.(James et al., 2007, p.68). R² is a measure which indicates that how close is the observed data fit in

(5)

our regression model. In another word, to what extent does the model explain the whole data. (Minitab, 2013). R² can be easily seen as follow (Frost, 2017):

!^" =$%&'%()* *,-.%'(*/ 01 2ℎ* 45/*.

652%. $%&'%()*

R²value ranges from 0 to 1. 0 means the model does not explain any of the variability of the target data whereas 1 indicates the model perfectly explain all of the variability of the response data.(Minitab, 2013). Ideally, we are aiming to get a R²value as close to one as possible.

After the model is built, there will be two types of target variable data, predicted value (yˆ), and actual value (yi). The difference between them (yi-yˆ) is called error, or residual. Every data point has a residual, they are positive if the data point is above the line or negative if the data point is below the line. If we add them all up, the result is usually zero, so we square them (yi-yˆ)² to avoid this outcome. Then, we divided it by the number of data point and take square root to get the root of mean of error. RMSE can be presented as follow:

!789 = (yi − yˆ)2 (

RMSE measures the average difference between estimated value and actual value.

Ideally, we are aiming to build a model with a RMSE as low as possible. Both R² and RMSE are useful methods for evaluating the performance of a regression model with different meaning. They will be used combinedly in this report.

3. Data Preparation

The “House Prices” dataset contains 1460 properties (rows) and 79 features (columns). The features describe each residential house from 79 different aspects. There are 37 numeric variables and 43 categorical variables.

3.1 Exploratory Data Analysis (EDA)

It is always important to fully understand the data before carrying out analysis. In addition, since there are too many predictors in the dataset, it may lead to overfitting problem if including all of them. Irrelative predictors can be defined in the exploratory stage. I first take a look at the distribution of the target variable.

(6)

Figure 1

As shown in Figure 1, the distribution of “SalePrice” is not normal. It is positive skewed with 1.88 skewness and 6.54 kurtosis, meaning that there are more houses with lower price. Next, I explore the relationship between independent variables and target variable by creating a correlation table (Table 1) and a matrix graph (Figure 2). I defined the predictor is highly correlated if the coefficient is higher than 0.5 or lower than -0.5 and only display the strong predictors in the table.

(7)

Figure 2

Predictor Correlation with SalePrice OverallQual 0.79

GrLivArea 0.71 GarageCars 0.64 GarageArea 0.62 TotalBsmtSF 0.61

1stFlrSF 0.61

FullBath 0.56

TotRmsAbvGrd 0.53 YearBuilt 0.52 YearRemodAdd 0.51

Table 1

There are 10 strong predictors. However, there are still some attributes we can remove.

If two predictors are significantly correlated, it means that they basically stand for same things, so we can remove one of them. (Marcelino, 2018). “GarageCars” and

“GarageArea”, “TotalBsmtSF” and “1stFlrSF” are under this situation. As a result, I

(8)

keep the predictor with higher correlation and remove “GarageArea” and “1stFlrSF”.

The rest of the eight predictors are defined as strong predictors. I then take a look at the relationship between the strong predictors and the target variable.

Figure 3

This pairplot illustrates that basically all eight features have positive relation with

“SalePrice”. Also, some of these plots show outliers.

3.2 Missing Data and Outliers

In this session, I deal with missing data and outliers. First, I look into missing data of the dataset.

(9)

Table 2

As shown in Table.2, there are 7 variables that have more than 5% missing value.

Especially, pool, alley, and fence are really rare. Although there is no missing data in

“PoolArea”, it actually connects to “PoolQC” and there are 1453 rows with a value 0.

This should be excluded as well if we want to exclude “PoolQC”. Same situation in

“Fireplace” and “FireplaceQu”. Nearly half of these properties don’t have fireplaces.

Linear feet of street connected to property has nearly 20% lost, maybe it is not a primary consideration when people buying house. Garage condition can be substituted by many other similar attributes such as “GarageCars”, “GarageArea”. These 9 variables will be removed in the preprocessing session to prevent models from missing data.

Next, I inspect if there are any outliers in continuous variables using bivariate analysis and scatter plot.

(10)

Figure 4

From the scatter plot we can see that there are two obvious outliers at the bottom-right corner in terms of “GrLivArea”. The top two points seem still follow the pattern so I see them as leverage outliers and keep them.

(11)

Figure 5

There is only one outlier in “TotalBsmtSF”, the one at the bottom-right corner with a value greater than 6,000.

4. Methods

4.1 Build the MLR and RF Model

Use the “File Reader” node to import the “House Price.csv” file. After reading in the file, data cleaning must be conducted before analysing. I deal with missing data first.

From the exploratory data analysis, I have already found out that there are 9 attributes needed to be removed due to their high portion of missing value. Use the “Column Filter” node to exclude these attributes. For the rest of the missing data, I fill them in with mean value (number) or most frequent value (string). Next, from the continuous bivariate analysis, I know that there are three outliers in “GrLivArea” and “SalePrice”

and “TotalBsmtSF” and “SalePrice”. Hence, using the “Row Filter” node, I exclude the rows with value higher than 4,500 in “GrLivArea” and 6,000 in “TotalBsmtSF”.

Now the data is cleaned and ready for building our model. The last process before building the model is splitting our data into training set and test set. I choose the most general proportion, 80% training set and 20% test set, for this partition. This task can

(12)

first partition to 80%. I also use random seed in order to prevent the results from being influenced by partitioning if I adjust other nodes.

There are two arrows at the right side of “Partitioning” node, the upper one stands for the training set whereas the bottom one stands for the test set. Connect the upper one to the learner node, “Linear Regression Learner” and “Random Forest Learner (Regression)”, and the bottom one to the predictor node, “Regression Predictor” and

“Random Forest Predictor (Regression)”. Finally, “Scatter Plot” and “Numeric Scorer”

are deployed to evaluate the performance of the model. Before executing “Numeric Scorer”, configure the “Reference column” to “SalePrice” and “Predicted column” to

“Prediction(SalePrice)”, which should be generated automatically. “Scatter Plot”

visualises the model. We can assess the model by inspecting how close are the points to the line y=x. “Numeric Scorer” gives us the statistical number, R² and RMSE, of the model.

4.2 Find the Best Model

For each machine learning method, I use four different experimental design to find the best model with the highest R² and lowest RMSE:

1) The original data only deal with missing data and outliers.

2) Include the eight strong predictors only.

3) Apply log transformation on the target value.

4) Include the eight strong predictors and apply log transformation on the target value.

From the EDA, I have identified eight strong predictors as they have the highest correlation with the target variable. In the second model, I include these predictors only to avoid noises and overfitting. Before the “Partitioning” node, add in “Column Filter”

and configure it to include the eight predictors and target variable only.

Also from the EDA, the target variable “SalePrice” is identified as not normal distributed and positive skewed. In order to reduce the skewness and make it more normal distributed, log transformation is an effective way.(Feng, Wang, Lu, & Tu, 2013). Log transformation can be conducted by using “Math Formula”. Add the node before “Partitioning” and select log(x), which can be found in the function list, in the

(13)

configuration. Then select “SalePrice” from the column list as the x value. I append another column “log(SalePrice)” to differentiate them. I also add in “Histogram”

before and after this formula to inspect whether the distribution is more normal or not.

The histograms obviously show that the transformation does a huge difference.

Before the transformation After the transformation

Since the model is built based on this log value, the results are log as well. Hence, reversing the log transformation is demanded. Add another “Math Formula” after the predictor node. Choose (x^y) function and put 10 as base and

“Prediction(log(SalePrice))” as exponent in the configuration.

The forth design is the combination of using strong predictors only and applying log transformation.

5. Results and Discussion

The performances of each model are shown below as Table. 3:

Design. 1 Design. 2 Design. 3 Design. 4

R² RMSE R² RMSE R² RMSE R² RMSE

MLR 0.894 25,290.748 0.805 36,614.146 0.938 22,355.278 0.874 27,345.071 RF 0.727 34,529.37 0.879 28,356.335 0.716 44.860.727 0.988 7,658.516

Table 3

The model with best performance among MLR models is design 3, which applied log transformation on the target variable. This is quite straightforward. The more linear the variable is, the more accurate the linear regression model is. When using the same group of predictors, applying log transformation leads to a higher R² value. This can be proved

(14)

by comparing design 1 and design 3, design 2 and design 3. Hence, we can conclude that the log transformation makes the MLR model more precise.

In terms of RF model, the combination of using eight strong predictors only and applying log transformation leads to the highest R² value, which is 0.988. The lowest two R² value occurs when including all predictors without filtering out the noises, even applying log transformation does not make huge difference. By contrast, the model that excludes irrelative predictors first has a considerably increase in R² value. 15% increase from design 1 to design 2 and 27% increase from design 3 to design 4. From the results, we can tell that filtering out noises has more impact on RF models than making the variable more normal distributed.

Although the highest R² value occurs at RF model, which is nearly 99%, we still cannot claim that RF is superior than MLR. In fact, without excluding irrelative predictors, MLR actually has better performance than RF. There are three conclusions can be found from the results of this experiment:

a) Log transformation does reduce the skewness of positive skewed distributed variables and strengthen the performance of MLR model.

b) Excluding irrelative predictors first has incredible impact on RF, making it much more precise.

c) The model with the best performance is RF using the combination of including strong predictors only and applying log transformation on the target variable.

Nde (2017) used the same dataset and built several linear regression models using different combinations of predictors in his thesis. He also looked into the correlation between each independent variable and the target variable, and the results are quite similar with mine.

The top ten strong predictors he identified was “HouseSpace”, “OverallQual”,

“GrLivArea”, “1stFlrSF”, “YearBuilt”, “GarageCars”, “GarageArea”, “TotalBsmtSF”,

“plotArea”, and “GarageType”. The main reason why the top predictors slightly vary from our works is probably the difference of dealing with missing data and outliers. Nde (2017) turned the “NA” into “0” value. There is no such a process of transforming categorical data into numerical data in my experiment. The model with the best performance in Nde’s experiment was the one using predictors suggested by glmulti algorithm with a 0.8907 of R² value, slightly lower than the best one in this report.

(15)

The fact that log transformation improve the performance of linear regression model was also supported in another work conducted by Wu (2017). He utilized Support Vector Machine (SVM) to build the model and Principle Component Analysis (PCA) to extract features. Log transformation was included as well to make comparison. The highest R² he achieved was 0.86. Different method of reducing features can lead to different results as well.

6. Reflection

The most challenging part in this experiment is actually data preprocessing. Since the results are determined by what data you feed to the model, the preprocessing plays a crucial role in the whole experiment. According to the results, my experiment achieves a huge success with nearly 0.99 R²score. However, I think it is abnormally high. There might be space for improvement in data preprocessing. The means of dealing with missing data and outliers could be more comprehensive. In addition, although the strong predictors are selected based on the correlation matrix, it is still a little bit subjective. Alternatively, further effort can be put on trying to include different numbers of top influential predictors.

Furthermore, there are only two machine learning methods deployed in the report. Other methods such as k-Nearest Neighbours or support vector machine can be applied on to make comparisons. There are various ways to carry on the experiment and each way results in different outcomes. A more thorough experiment can be done by including as many methods as possible.

7. Conclusion

The features that influence the house price most are “OverallQual”, “GrLivArea”,

“GarageCars”, “TotalBsmtSF”, “FullBath”, “TotRmsAbvGrd”, “YearBuilt”, and

“YearRemodAdd”. Applying log transformation has more impact than feature reduction in MLR model whereas RF model has better performance when excluding irrelative features.

The best model is RF using both feature reduction and log transformation. Throughout the process, I have had more understanding about the theories behind machine learning method and how to apply them effectively. Data preprocessing is the most important session in the whole process. The results are most influenced by the data fed to models. The way of data preprocessing is a kind of art. It is also challenging to decide which method to be applied on. Different methods have their own strength and weakness when dealing with different

(16)

dataset. We have to understand the data thoroughly first in order to apply the best mean of preprocessing and learning method and always be reflective about the results.

8. Reference

Bowerman, B. L., O’Connell, R. T., & Murphree, E. S. (2015). Regression analysis : unified concepts, practical applications, and computer implementation. Retrieved from

http://find.shef.ac.uk/primo_library/libweb/action/display.do?tabs=viewOnlineTab&ct=

display&fn=search&doc=44SFD_ALMA_DS51258001280001441&indx=1&recIds=44 SFD_ALMA_DS51258001280001441&recIdxs=0&elementId=0&renderMode=popped Out&displayMode=full&frbrVersion=&frbg=&&dscnt=0&scp.scps=scope%3A%2844S FD%29%2Cprimo_central_multiple_fe&tb=t&mode=Basic&vid=SFD_VU2&srt=rank

&tab=everything&dum=true&vl(freeText0)=regression&dstmp=1525503278149 Feng, C., Wang, H., Lu, N., & Tu, X. M. (2013). Log transformation: application and

interpretation in biomedical research. Statistics in Medicine, 32(2), 230–239.

https://doi.org/10.1002/sim.5486

Frost, J. (2017). How To Interpret R-squared in Regression Analysis - Statistics By Jim.

Retrieved May 7, 2018, from http://statisticsbyjim.com/regression/interpret-r-squared- regression/

Ghasemi, J. B., & Tavakoli, H. (2013). Application of random forest regression to spectral multivariate calibration. Analytical Methods, 5(7), 1863.

https://doi.org/10.1039/c3ay26338j

Ismail, Z., Ismail, Z., Yahya, A., & Shabri, A. (2009). Forecasting Gold Prices Using

Multiple Linear Regression Method. American Journal of Applied Sciences, 6(8), 1509–

1514. Retrieved from http://thescipub.com/pdf/10.3844/ajassp.2009.1509.1514 James, G., Witen, D., Hastie, T., & Tibshirani, R. (2007). An Introduction to Statistical

Learning with Applications in R. Performance Evaluation (Vol. 64).

https://doi.org/10.1016/j.peva.2007.06.006

Marcelino, P. (2018). Comprehensive data exploration with Python. Retrieved May 9, 2018, from https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python Minitab. (2013). Regression Analysis: How Do I Interpret R-squared and Assess the

Goodness-of-Fit? Retrieved May 7, 2018, from

http://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i- interpret-r-squared-and-assess-the-goodness-of-fit

(17)

Nde, S. (2017). Fitting a Linear Regression Model and Forecasting in R in the Presence of Heteroskedascity with Particular Reference to Advanced Regression Technique Dataset on kaggle.com. All Student Theses. Retrieved from https://opus.govst.edu/theses/99 Ouahilal, M., Mohajir, M. El, Chahhou, M., & Mohajir, B. E. El. (2017). A novel hybrid model based on Hodrick–Prescott filter and support vector regression algorithm for optimizing stock market price prediction. Journal of Big Data, 4(1), 31.

https://doi.org/10.1186/s40537-017-0092-5

Simonoff, J. S., & Sparrow, Ii. R. (2000). _Predicting Movie Grosses: Winers and Losers, Blockbusters and Sleepers. CHANCE, 13(3), 15–24.

https://doi.org/10.1080/09332480.2000.10542216

Singh, B., Sihag, P., & Singh, K. (2017). Modelling of impact of water quality on infiltration rate of soil by random forest regression. Modeling Earth Systems and Environment, 3(3), 999–1004. https://doi.org/10.1007/s40808-017-0347-3

Smith, P. F., Ganesh, S., & Liu, P. (2013). A comparison of random forest regression and multiple linear regression for prediction in neuroscience. Journal of Neuroscience Methods, 220(1), 85–91. https://doi.org/10.1016/J.JNEUMETH.2013.08.024

Svetnik, V., Liaw, A., Tong, C., Culberson, J. C., Sheridan, R. P., & Feuston, B. P. (2003).

Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling. Journal of Chemical Information and Modeling, 43(6), 1947–1958.

https://doi.org/10.1021/CI034160G

Wu, J. (2017). Housing Price prediction Using Support Vector Regression. Master’s Projects. Retrieved from http://scholarworks.sjsu.edu/etd_projects/540