Simple Linear Regression and Correlations
April 14, 2020
[30]: import numpy as np import pandas as pd
import matplotlib.pyplot as plt import statsmodels.api as sm
from scipy.stats import shapiro # For assessing normality from scipy.stats import pearsonr # For exploring correlations
# I referenced Dr. Jason Brownlee's page for some commands related to␣
,→statistical analyses.
# This reference would be helpful for those who are looking to learn how to␣
,→leverage
# Python for statistical analyses.
# https://machinelearningmastery.com/
,→statistical-hypothesis-tests-in-python-cheat-sheet/
[31]: # This data set was used for the STAT2 textbook that I purchased.
data = pd.read_csv('Pulse.csv') # Load the CSV file into the data frame called␣
,→data [32]: data
[32]: Active Rest Smoke Gender Exercise Hgt Wgt
0 97 78 0 1 1 63 119
1 82 68 1 0 3 70 225
2 88 62 0 0 3 72 175
3 106 74 0 0 3 72 170
4 78 63 0 1 3 67 125
.. … … … …
227 105 85 0 1 2 64 150
228 82 74 0 1 3 66 124
229 102 81 0 0 2 69 172
230 87 67 0 0 2 68 170
231 81 62 0 0 3 68 151
[232 rows x 7 columns]
[33]: data.describe()
[33]: Active Rest Smoke Gender Exercise Hgt \
count 232.000000 232.000000 232.000000 232.000000 232.000000 232.000000 mean 91.297414 68.349138 0.112069 0.474138 2.254310 68.245690 std 18.820234 9.949378 0.316133 0.500410 0.738536 3.738761 min 51.000000 43.000000 0.000000 0.000000 1.000000 60.000000 25% 79.000000 62.000000 0.000000 0.000000 2.000000 65.000000 50% 88.500000 68.000000 0.000000 0.000000 2.000000 68.000000 75% 102.000000 74.000000 0.000000 1.000000 3.000000 71.000000 max 154.000000 106.000000 1.000000 1.000000 3.000000 78.000000
Wgt count 232.000000 mean 157.918103 std 31.832587 min 102.000000 25% 135.000000 50% 150.000000 75% 175.000000 max 260.000000
The sample has a total of 200 participants with a mean age of 57.55 years old and a standard deviation of 20.05 years). The average systolic blood pressure was 132.28 beats per minute, with a standard deviation of 26.83 beats per minute.
[34]: y = data['Active'] # Dependent variable
# Shapiro-Wilk Normality Test stat, p = shapiro(y)
print('The Shapiro-Wilk Normality test statistic is %.3f with a p-value of %.
,→3f' % (stat,p)) if p > 0.05:
print('The distribution of the dependent variable is normal.') else:
print('The distribution of the dependent variable is not normal.') The Shapiro-Wilk Normality test statistic is 0.970 with a p-value of 0.000 The distribution of the dependent variable is probably not normal.
[35]: x1 = data['Wgt'] # Independent variable #1 [42]: # Pearson's Correlation Coefficient
stat, p = pearsonr(y, x1)
print('The Pearson r correlation coefficient is %.3f, with a p-value of %.3f' %␣
,→(stat, p)) if p > 0.05:
print('There is a no significant correlation between the two variables.') else:
print('There is a significant correlation between the two variables.') The Pearson r correlation coefficient is -0.058, with a p-value of 0.379 There is a no significant correlation between the two variables.
[45]: plt.scatter(x1, y) # The plt.scatter(X,Y) plt.xlabel('Weight', fontsize=20)
plt.ylabel('Beats Per Minute While Active', fontsize=20) plt.show()
[38]: x = sm.add_constant(x1) results = sm.OLS(y,x).fit() results.summary()
[38]: <class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: Active R-squared: 0.003
Model: OLS Adj. R-squared: -0.001
Method: Least Squares F-statistic: 0.7767
Date: Tue, 14 Apr 2020 Prob (F-statistic): 0.379
Time: 16:47:21 Log-Likelihood: -1009.2
No. Observations: 232 AIC: 2022.
Df Residuals: 230 BIC: 2029.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
---
const 96.7137 6.269 15.427 0.000 84.362 109.066
Wgt -0.0343 0.039 -0.881 0.379 -0.111 0.042
==============================================================================
Omnibus: 18.033 Durbin-Watson: 1.931
Prob(Omnibus): 0.000 Jarque-Bera (JB): 19.874
Skew: 0.676 Prob(JB): 4.84e-05
Kurtosis: 3.479 Cond. No. 817.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
"""
There is no significant linear relationship between Weight and Beats Per Minute While Active, t = -0.881, p > 0.05.
[47]: plt.scatter(x1,y)
yhat = (-0.0343)*x1 + 96.7137 # The line of best fit
fig = plt.plot(x1,yhat,lw=4,c='orange',label='regression line') plt.xlabel('Weight',fontsize = 20)
plt.ylabel('Beats Per Minute While Active',fontsize = 20) [47]: Text(0, 0.5, 'Beats Per Minute While Active')
[48]: y = data['Rest'] # Dependent variable
x2 = data['Active'] # Independent variable #2 plt.scatter(x2, y) # The plt.scatter(X,Y)
plt.xlabel('Beats Per Minute While at Rest', fontsize=20) plt.ylabel('Beats Per Minute While Active', fontsize=20) plt.show()
[52]: # Pearson's Correlation Coefficient stat, p = pearsonr(y, x2)
print('The Pearson r correlation coefficient is %.3f, with a p-value of %.3f' %␣
,→(stat, p)) if p > 0.05:
print('There is a no significant correlation between the two variables.') else:
print('There is a significant correlation between the two variables.') The Pearson r correlation coefficient is 0.604, with a p-value of 0.000 There is a significant correlation between the two variables.
[49]: x = sm.add_constant(x2) results = sm.OLS(y,x).fit() results.summary()
[49]: <class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: Rest R-squared: 0.365
Model: OLS Adj. R-squared: 0.362
Method: Least Squares F-statistic: 132.2
Date: Tue, 14 Apr 2020 Prob (F-statistic): 1.79e-24
Time: 16:57:18 Log-Likelihood: -809.03
No. Observations: 232 AIC: 1622.
Df Residuals: 230 BIC: 1629.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
---
const 39.1882 2.589 15.136 0.000 34.087 44.289
Active 0.3194 0.028 11.499 0.000 0.265 0.374
==============================================================================
Omnibus: 0.490 Durbin-Watson: 1.931
Prob(Omnibus): 0.783 Jarque-Bera (JB): 0.261
Skew: -0.051 Prob(JB): 0.878
Kurtosis: 3.130 Cond. No. 463.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
"""
[51]: plt.scatter(x2,y)
yhat = 0.3194*x2 + 39.1882
fig = plt.plot(x2,yhat,lw=4,c='orange',label='regression line') plt.xlabel('Beats Per Minute While at Rest',fontsize = 20) plt.ylabel('Beats Per Mintute While Active',fontsize = 20) [51]: Text(0, 0.5, 'Beats Per Mintute While Active')
There is a significant linear relationship between Beats Per Minute at Rest and Beats Per Minute While Active, t = 11.50, p < 0.01.