www.analyttica.com
Predictive Modelling using
Linear Regression
https://leaps.analyttica.com
Table of Contents
Concept of Regression Analysis
Simple and Multiple Linear Regression Evaluating a Linear Model
Variable Selection and Transformations
https://leaps.analyttica.com
Concept of Regression Analysis
Regression analysis is a predictive modelling technique which estimates the relationship between two or more variables. Recall that a correlation analysis makes no assumption about the causal relationship between two variables. Regression analysis focusses on the relationship between a dependent (target) variable and independent variable(s) (predictors). Here, dependent variable is assumed to be the effect of the independent variable(s). The value of predictors is used to estimate or predict the likely-value of the target variable.
For example, to describe the relationship between diesel consumption and industrial production, if it is assumed that “diesel consumption” is the effect of “industrial production”, we can do a regression analysis to predict value of “diesel consumption” for some specific value of “industrial production”
To do this, we first try to assume a mathematical relationship between the target and the predictor(s). The relationship can be a straight line (linear regression) or a polynomial curve (polynomial regression) or a non-linear relationship (non-linear regression). This can be done through various ways. The simplest and most popular way is to create a scatter plot of the target variable and predictor variable. (Refer to Figure 1 and Figure 2)
Figure 1: Linear Relationship Figure 2: Polynomial Relationship
Once the type of relationship is established, we try to find the most-likely values of the coefficients in the mathematical formula.
Regression analysis comprises of the entire process of identifying the target and predictors, finding
the relationship, estimating the coefficients, finding the predicted values of target, and finally
evaluating the accuracy of the fitted relationship.
https://leaps.analyttica.com
Why do we use Regression Analysis?
As discussed above, regression analysis estimates the relationship between two or more variables.
More specifically, regression analysis helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed.
Let’s say, we want to estimate the credit card spend of the customers in the next quarter. For each customer, we have their demographic and transaction related data which indicate that the credit card spend is a factor of age, credit limit and total outstanding balance on their loans. Using this insight, we can predict future sales of the company based on current and past information.
What are the benefits of using Regression Analysis?
Regression explores significant relationships between dependent variable and independent variable.
It indicates the strength of impact of multiple independent variables on a dependent variable and helps to determine which variables in particular, are most significant predictors of the dependent variable. Their influence is quantified by the magnitude and sign of the beta estimates, which is nothing but the extent to which they impact the dependent variable.
It also allows us to compare the effect of variable measures on different scales and can consider nominal, interval, or categorical variables for analysis.
The simplest form of the equation with one dependent and one independent variable is defined by the formula:
y = c + b*x, where
y = estimated dependent score, c = constant,
b = regression coefficient, and
x = independent variable.
https://leaps.analyttica.com
Types of Regression Techniques
For predictions, there are many regression techniques available. The type of regression technique to be used is mostly driven by three metrics:
• Number of independent variables
• Type of dependent variables
• Shape of regression line
Let’s briefly discuss a few regression techniques. A more elaborate discussion of the most commonly used regression techniques will be covered later in the module.
Linear Regression
Linear regression is one of the most commonly used predictive modelling techniques. It establishes a relationship between dependent variable (Y) and one or more independent variables (X) using a best fit straight line (also known as a regression line).
It is represented by an equation 𝑌 = 𝑎 + 𝑏𝑋 + 𝑒, where a is the intercept, b is the slope of the line and e is the error term. This equation can be used to predict the value of a target variable based on given predictor variable(s).
Logistic Regression
Logistic regression is used to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.
Polynomial Regression
A regression equation is a polynomial regression equation if the power of independent variable is more than 1. The equation below represents a polynomial equation.
𝑌 = 𝑎 + 𝑏𝑋 + 𝑐𝑋
2In this regression technique, the best fit line is not a straight line. It is rather a curve that fits into the data points.
Ridge Regression
Ridge regression is suitable for analyzing multiple regression data that suffers from multicollinearity.
When multicollinearity occurs, least squares estimates are unbiased, but their variances are large so
they may be far from the true value. By adding a degree of bias to the regression estimates, ridge
regression reduces the standard errors. It is hoped that the net effect will be to give estimates that
are more reliable.
https://leaps.analyttica.com
Linear Regression
Linear Regression is a predictive modelling technique that establishes a relationship between dependent variable (Y) and one or more explanatory variables denoted by X, using a best fit straight line (also known as regression line).
It is represented by the equation, 𝑌 = 𝑎 + 𝑏 ∗ 𝑋 + 𝑒, where a is intercept, b is slope of the line and e is error term.
This equation can be used to predict the value of target variable based on given predictor variable(s).
The case of one explanatory variable is called simple linear regression. For more than one explanatory variables, the process is called multiple linear regression. In this technique, the dependent variable is continuous, independent variable(s) can be continuous or discrete, and nature of the regression line is linear.
The following sections discuss in detail, the process of developing and evaluating a regression model. An important concept to recall at this point, is that of Data Splitting, which requires the data to be randomly split into Training and Validation datasets. The rationale behind splitting the data is that the model is built on one dataset (training) and its performance is evaluated on the validation dataset to evaluate its performance on a new, unknown dataset.
In all following discussions, it is understood that the model building and evaluation process
(determining the best fitting line and estimating the accuracy of the model) is done on the training
dataset, and the model validation is done on the validation dataset.
https://leaps.analyttica.com
Determining the Best Fitting Line
Consider we have a random sample of 20 students with their height (x) and weight (y) and we need to establish a relationship between the two. One of the first and basic approach to fit a line through the data points is to create a scatter plot of (x,y) and draw a straight line that fits the experimental data.
Figure 3
Since there can be multiple lines that fit the data, the challenge arises in choosing the one that best fits. As we already know, the best fit line can be represented as
𝑦
̂
i= 𝑏
0+ 𝑏
1𝑥
𝑖Where,
• 𝑦 denotes the observed response for experimental unit i
• 𝑥
𝑖denotes the predictor value for experimental unit i
•
𝑦̂
iis the predicted response (or fitted value) for experimental unit i
When we predict height using the above equation, the predicted value of the prediction wouldn't be perfectly accurate. It has some "prediction error" (or "residual error"). This can be represented as
𝑒
𝑖= 𝑦
𝑖−
𝑦̂
iA line that fits the data best will be one for which the n (i = 1 to n) prediction errors, one for each observed data point, are as small as possible in some overall sense.
One way to achieve this goal is to invoke the "least squares criterion," which says to "minimize the
sum of the squared prediction errors."
https://leaps.analyttica.com
̂
The equation of the best fitting line is:
ŷ
𝑖= 𝑏
0+ 𝑏
1𝑥
𝑖We need to find the values of b
0and b
1that make the sum of the squared prediction errors the smallest i.e.
Residual Squares = ∑
𝑛𝑒
𝑖2= ∑
𝑛(𝑦
𝑖− ŷ
𝑖)
2𝑖=1 𝑖=1
Because the deviations are first squared, then added, there is no cancelling out between positive and negative values.
Least Square Estimates
For the above equation 𝑏0 and 𝑏1 are determined using the following:
𝑏̂
0= 𝑌̅ – 𝑏̂
1𝑋̅ and 𝑏̂
1=
∑𝑛𝑖=1(Xi−𝑋̅)(Yi−𝑌̅)∑𝑛𝑖=1(Xi−𝑋̅ )2
Because the formulas for b
0and b
1are derived using the least squares criterion, the resulting equation, 𝑦̂
i= 𝑏
0+ 𝑏
1𝑥
𝑖, is often referred to as the "least squares regression line," or simply the "least squares line." It is also sometimes called the "estimated regression equation."
What Does the Equation Mean?
The equation above is a physical interpretation of each of the coefficients and hence it is very important to understand what the regression equation means.
• The coefficient 𝑏
0, or the intercept, is the expected value of Y when X = 0
• The coefficient 𝑏
1, or the slope, is the expected change in Y when X is increased by one unit.
https://leaps.analyttica.com
The following figure explains the interpretations clearly.
Figure 4
Example of Linear Regression: Factors affecting Credit Card Sales
An analyst wants to understand what factors (or independent variables) affect credit card sales. Here, the dependent variable is credit card sales for each customer, and the independent variables are income, age, current balance, socio-economic status, current spend, last month’s spend, loan outstanding balance, revolving credit balance, number of existing credit cards and credit limit. In order to understand what factors affect credit card sales, the analyst needs to build a linear regression model.
It is important to note that a linear regression cannot be applied to categorical variables, and is not recommended for ordinal variables, hence, the analyst may also need to check the variable type before running a model.
Module 1 Simulation 1: Learn & Apply a Simple Linear Regression Model
In this simulation, the learner is exposed to a sample dataset comprising of telecom customer
accounts and their annual income, age along with their average monthly revenue (dependent
variable). The learner is expected to apply the linear regression model using annual income as the
single predictor variable.
https://leaps.analyttica.com
Evaluating a Linear Regression Model
Once we fit a linear regression model, we need to evaluate the accuracy of the model. In the following sections, we will discuss the various methods used to evaluate the accuracy of the model with respect to its predictive power.
For an in-depth understanding of all the topics covered in the coming sections, refer to the course
“Fundamentals of Data Analytics” on Analyttica TreasureHunt LEAPS
(https://leaps.analyttica.com/courses/overview/Fundamentals-of-Data-Analytics). You can also
refer to any of the standard Statistics books for more information on the same.
F-Statistics and p-value
The F-Test indicates whether a linear regression model provides a better fit to the data than a model that contains no independent variables. It consists of the null and alternate hypothesis and the test statistic helps to prove or disprove the null hypothesis.
The null hypothesis here is “The target variable cannot be significantly predicted using the predictor
variable(s)”. To do this we look at the F-statistic and its p-value. Mathematically, the null hypothesis
we test here is “All slope parameters are 0” (note the number of slope parameters will be the same
as the number of independent variables in the model). Hence, if the null hypothesis is accepted (or
not rejected) then it means we cannot predict target variable using the predictor variables and hence
regression is not possible.
https://leaps.analyttica.com
Coefficient of Determination
Next, we look at the R-squared value of the model, which is also called the “Coefficient of Determination”. This statistic calculates the percentage of variation in target variable explained by the model. The below illustration captures the explained vs. unexplained variation in data.
Figure 5 R-squared is calculated using the following formula:
R
2=
Explained VarianceTotal Variance
=
∑𝑛𝑖=1(𝑌̂𝑖 −𝑌̅)2∑𝑛𝑖=1(Yi−𝑌̅)2
R-squared is always between 0 and 100%. As a guideline, the more the R-squared, the better is the model. The objective is not to maximize the R-squared, since the stability and applicability of the model are equally important. Next, check the Adjusted R-squared value. Ideally, the R-squared and adjusted R-squared values need to be in close proximity of each other. If this is not the case, then the analyst may have over fitted the model and may need to remove the insignificant variables from the model.
Module 1 Simulation 2: Learn & Apply the concept of R-Square
In this simulation, the learner is exposed to a sample dataset capturing telecom customer accounts
and their annual income, age, along with their average monthly revenue (dependent variable). The
dataset also contains predicted values of “average monthly revenue” from a regression model. The
learner is expected to apply the concept of calculation of coefficient of determination.
https://leaps.analyttica.com
Check the p-value of the Parameter Estimates
The p-value for each variable tests the null hypothesis that the coefficient is equal to zero (no effect).
A low p-value (<0.05) indicates that we can reject the null hypothesis. In other words, a predictor that has a low p-value can be included in the model because changes in the predictor's value are related to changes in the response variable. Conversely, a larger (insignificant) p-value suggests that changes in the predictor are not associated with changes in the response. This is an iterative process and the analyst may need to re-run the model until only significant variables remain. If there are hundreds of variables then the analyst may choose to automate the variable selection using the forward, backward or stepwise techniques. Automated variable selection is however, not recommended for small number of variables in the dataset.
Module 2 Simulation 1: Build a Multivariate Linear Regression Model and Evaluate Parameter Significance
In this simulation, the learner is exposed to a sample dataset capturing the flight status of flights with their delay in arrival, along with various possible predictor variables like departure delay, distance, air time, etc. The learner is expected to build a multiple regression model where all the variables are significant.
Residual Analysis
We can also evaluate a regression model based on various summary statistics on error or residuals.
Some of them are:
• Root Mean Square Error (RMSE): Where we find average of squared residuals as per the given formula:
RMSE =
1n
∑
ni=1(Y
i− Y ̅)
2• Mean Absolute Percentage Error (MAPE): We find the average percentage deviation as per the given formula:
MAPE =
1n
∑
𝐴𝐵𝑆(Yi− Ŷi)Yi 𝑛𝑖=1