Regression
The process of identifying the relationship and the effects of this relationship on
the outcome of future values of objects is defined as regression. Regression helps
in identifying the behavior of a variable when other variable(s) are changed in
the process. Regression analysis is used for prediction and forecasting
applications
For example
, a regression model could be used to predict children's height, given
their age, weight, and other factors.
A regression task begins with a data set in which the target values are known. For
example, a regression model that predicts children's height could be developed based on observed data for many children over a period of time. The data might track age, weight, developmental milestones, family history, and so on. Height would be the target, the other attributes would be the
predictors
,
and the data for each child would constitute a case.Common Applications of Regression
Regression modeling has many applications in trend analysis, business planning,
Simple Linear Regression
It is a statistical method that allows us to summarize and study relationships
between two continuous (quantitative) variables:
One variable, denoted x, is regarded as
the predictor, explanatory, or
independent variable
.
The other variable, denoted y, is regarded as the
response, outcome, or
dependent variable
.
Types of relationships
Types of relationships
Types of relationships
Linear/non linear/ no relationships
What is the "Best Fitting Line"?
What is the "Best Fitting Line"?
What is the "Best Fitting Line"?
Below are formulas for the intercept b0 and the slope b1 for the
Common Error Variance
σ2 quantifies how much the responses (y) vary around the
Common Error Variance
Example: Suppose you have two brands (A and B) of
thermometers, and each brand offers a Celsius
thermometer and a Fahrenheit thermometer. You measure
the temperature in Celsius and Fahrenheit using each
1. Coefficient of Determination, r-square
How well does your regression equation truly represent
your set of data?
Coefficient of Determination, r-square
For plot in figure in previous side, note that SSTO = SSR + SSE. The sums of squares appear to tell the story pretty well. They tell us that most of the variation in the response y (SSTO = 1827.6) is just due to random variation (SSE = 1708.5), not due to the regression of y on x (SSR = 119.1). You might notice that SSR divided by
SSTO is 119.1/1827.6 or
0.065.c
Correlation Coefficient r
Few examples
r2 = 100% and r = 1.000
measures tell us that there is a perfect linear relationship between temperature in degrees Celsius and temperature in degrees Fahrenheit.
Few examples
r2 = 90.4% and r = 0.951
Linear regression in R
Reading in the data and splitting
library(xlsx)
powerData <- read.xlsx('
Folds5x2_pp.xlsx'
, 1)Data splitting
set.seed(123)
split <- sample(seq_len(nrow(powerData)), size = floor(0.75 * nrow(powerData)))
trainData <- powerData[split, ] testData <- powerData[-split, ]
The dataset is obtained from the UCI Machine Learning Repository. The dataset
Linear regression in R
Building the prediction model
predictionModel <- lm(PE ~ AT + V + AP + RH, data = trainData)
Testing the prediction model
We will now apply the prediction model to the test data.
prediction <- predict(predictionModel, newdata = testData)
head(prediction)
2 4 12 13 14 17
444.0433 450.5260 456.5837 438.7872 443.1039 463.7809 head(testData$PE)
[1] 444.37 446.48 453.99 440.29 451.28 467.54
Testing the prediction model
We can calculate the value of R-squared for the prediction model on the test data set as follows:
SSE <- sum((testData$PE - prediction) ^ 2)
SST <- sum((testData$PE - mean(testData$PE)) ^ 2) 1 - SSE/SST