Chapter 3: Predictive Analysis Applied
3.3. Challenges & Resolutions
This section will discuss and elaborate following four common difficulties faced in any predictive analysis process.
1. Cause & effect
2. Lies, dammed lies & statistics 3. Model Over fitting
4. Correlation between independent variables
We cannot always conclude a relation to be a cause & effect relationship if we find out a good mathematical relationship between variables as it is not always easy to interpret every mathematical relationship as cause & effect relationships. If we plot a graph between numbers of jobs in market to number of cars newly bought in city, we can see a mathematical relationship but we cannot summarize it that with every car sold in market there is a new job created. We can understand second challenge of lies & statistics by an example which covers dangers of looking only at statistical measures and is known as Anscombe’s quartet. He used 4 datasets having simple similar statistical properties to prove the importance of plotting data before analyzing it and impact of outliers as they all appear to be very different from one another when plotted.
First dataset appears as a well-behaved dataset having clean and well-fitting linear model and can be plotted using y = 3 + 0.5x having mean of X as 9 and mean of Y as 10. Second dataset does not have a linear correlation strangely has same equation y = 3 + 0.5x but with R squared value of 0.67. Third dataset does have linear relation but the linear regression is thrown off by an outlier which means if the outliers were spotted and removed before plotting it would have been easy to fit a correct linear model. Last dataset does not fit any kind of linear model but the single outlier makes keeps the alarm from going off. This implies that it is wise to understand data before applying any algorithm. Graphs and 4 data sets used are shown in figure below.
Chapter 3: Predictive Analysis Applied
62
I II III IV
x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89
Four Data Sets in Anscombe’s Quartet
Overfitting signifies a condition when data under analysis fits a model “too good” that it can be thought to describing your sample nearly perfectly and is too rigid to fit any other sample.
This condition thus makes it loose enough to serve our predictive needs by fitting badly on new data. Over fit specifically needs to be watched when you’ve got small sample sizes or your data is too small & limited in some way and defining as phenomenon where the predictive model may well describe the relationship between predictors only but may fail to provide valid predictions in new data. It is generally due to high expectations and need for accuracy requiring an extra good job to fit the sample data by introducing too many input variables. Most of times it is case when a model has to many data points compared to number of data points. Including test data and analyzing it from every angle set is crucial when building a predictive model to have it more accurate and stable over time. Figure below explains the context with help of two figures representing two graphs based on same data points. Left graph off course is doing a decent job
63
as it captures general nature & characteristic of the relationship between the X and Y variables.
While right hand side graphs is clearly attempting too hard to capture every subtle change in the relationship between the two variables; It makes model on left outperforming model on right when new data points are fed into the model as the right hand side model will not be able to generalize well the data it has not seen before. To avoid Overfitting, words of advice is to use a proportion and balance of the available data to train the model and the rest of the data which is unseen or hold out data to test the model. This is a key methodology in PA and definitely an important one in classification analysis and time series analysis.
Process of Overfitting the models
‘Multicollinearity’ is problem & comes in picture when you’re trying to fit a regression model or other linear model. It indicates a case of predictors correlated with other predictors in the model. Unfortunately, the effects of Multicollinearity can feel unsure and intangible, which makes it unclear about how to fix if you are able to decide that it should be fixed. Statisticians define multicollinearity as a strong correlation between two or more independent variables. It is quite difficult to remove effects on dependent variables because of linear relation making model easily assuming the existence of multicollinearity in dataset. Estimates made on parameter may alter significantly in response to small changes in the model or the data which means Multicollinearity effects the calculations regarding individual predictors without minimizing the predictive power or reliability of the model as a whole specially at least within the sample data itself indicating that a multiple regression model with correlated predictors can definitely show you the degree of relation between bundle of predictors predicts the outcome variable, but will not produce always a valid results about any individual predictor and about extent of redundancy of predictors with regards to each other. Multicollinearity to an extent is normal but if it has higher value it becomes a problem because i the variance of the coefficient estimates increase which make the estimates very sensitive to minor changes in the model. Following can be seen as main sources of multicollinearity; method used for data collection, constraints pushed in the population, Model specification or an over fitted over defined model. Removing multicollinearity
Chapter 3: Predictive Analysis Applied
64
fully is not possible but can be reduced by several remedial measures such as collecting the additional data or new data, re-specification of the model, ridge regression or by using data reduction technique like principal component analysis.
Examples of Multicollinearity
Figure above shows two graphs X1 & X2 that are highly positively correlated and value of correlation coefficient between them is 0. 9771 as computed by data on left. Trying to find a model now that describe the relationship between Y and independent variables X1 & X2 is difficult because we can merely differentiate because of them being so close and hence higher value of multicollinearity becomes a problem because i the variance of the coefficient estimates increase which make the estimates very sensitive to minor changes in the model. Mitigation, adding more data sampling gives an advantage can’t solve it completely. Omitting one of correlated variables can be another interesting approach if you can decide which variable to ignore risking the danger of ignoring real casual variable.
65