• No results found

Simple Linear Regression and Correlations

N/A
N/A
Protected

Academic year: 2021

Share "Simple Linear Regression and Correlations"

Copied!
8
0
0

Loading.... (view fulltext now)

Full text

(1)

Simple Linear Regression and Correlations

April 14, 2020

[30]: import numpy as np import pandas as pd

import matplotlib.pyplot as plt import statsmodels.api as sm

from scipy.stats import shapiro # For assessing normality from scipy.stats import pearsonr # For exploring correlations

# I referenced Dr. Jason Brownlee's page for some commands related to␣

,statistical analyses.

# This reference would be helpful for those who are looking to learn how to␣

,leverage

# Python for statistical analyses.

# https://machinelearningmastery.com/

,statistical-hypothesis-tests-in-python-cheat-sheet/

[31]: # This data set was used for the STAT2 textbook that I purchased.

data = pd.read_csv('Pulse.csv') # Load the CSV file into the data frame called␣

,data [32]: data

[32]: Active Rest Smoke Gender Exercise Hgt Wgt

0 97 78 0 1 1 63 119

1 82 68 1 0 3 70 225

2 88 62 0 0 3 72 175

3 106 74 0 0 3 72 170

4 78 63 0 1 3 67 125

.. … … … …

227 105 85 0 1 2 64 150

228 82 74 0 1 3 66 124

229 102 81 0 0 2 69 172

230 87 67 0 0 2 68 170

231 81 62 0 0 3 68 151

[232 rows x 7 columns]

(2)

[33]: data.describe()

[33]: Active Rest Smoke Gender Exercise Hgt \

count 232.000000 232.000000 232.000000 232.000000 232.000000 232.000000 mean 91.297414 68.349138 0.112069 0.474138 2.254310 68.245690 std 18.820234 9.949378 0.316133 0.500410 0.738536 3.738761 min 51.000000 43.000000 0.000000 0.000000 1.000000 60.000000 25% 79.000000 62.000000 0.000000 0.000000 2.000000 65.000000 50% 88.500000 68.000000 0.000000 0.000000 2.000000 68.000000 75% 102.000000 74.000000 0.000000 1.000000 3.000000 71.000000 max 154.000000 106.000000 1.000000 1.000000 3.000000 78.000000

Wgt count 232.000000 mean 157.918103 std 31.832587 min 102.000000 25% 135.000000 50% 150.000000 75% 175.000000 max 260.000000

The sample has a total of 200 participants with a mean age of 57.55 years old and a standard deviation of 20.05 years). The average systolic blood pressure was 132.28 beats per minute, with a standard deviation of 26.83 beats per minute.

[34]: y = data['Active'] # Dependent variable

# Shapiro-Wilk Normality Test stat, p = shapiro(y)

print('The Shapiro-Wilk Normality test statistic is %.3f with a p-value of %.

,3f' % (stat,p)) if p > 0.05:

print('The distribution of the dependent variable is normal.') else:

print('The distribution of the dependent variable is not normal.') The Shapiro-Wilk Normality test statistic is 0.970 with a p-value of 0.000 The distribution of the dependent variable is probably not normal.

[35]: x1 = data['Wgt'] # Independent variable #1 [42]: # Pearson's Correlation Coefficient

stat, p = pearsonr(y, x1)

print('The Pearson r correlation coefficient is %.3f, with a p-value of %.3f' %␣

,(stat, p)) if p > 0.05:

(3)

print('There is a no significant correlation between the two variables.') else:

print('There is a significant correlation between the two variables.') The Pearson r correlation coefficient is -0.058, with a p-value of 0.379 There is a no significant correlation between the two variables.

[45]: plt.scatter(x1, y) # The plt.scatter(X,Y) plt.xlabel('Weight', fontsize=20)

plt.ylabel('Beats Per Minute While Active', fontsize=20) plt.show()

[38]: x = sm.add_constant(x1) results = sm.OLS(y,x).fit() results.summary()

[38]: <class 'statsmodels.iolib.summary.Summary'>

"""

OLS Regression Results

==============================================================================

Dep. Variable: Active R-squared: 0.003

(4)

Model: OLS Adj. R-squared: -0.001

Method: Least Squares F-statistic: 0.7767

Date: Tue, 14 Apr 2020 Prob (F-statistic): 0.379

Time: 16:47:21 Log-Likelihood: -1009.2

No. Observations: 232 AIC: 2022.

Df Residuals: 230 BIC: 2029.

Df Model: 1

Covariance Type: nonrobust

==============================================================================

coef std err t P>|t| [0.025 0.975]

---

const 96.7137 6.269 15.427 0.000 84.362 109.066

Wgt -0.0343 0.039 -0.881 0.379 -0.111 0.042

==============================================================================

Omnibus: 18.033 Durbin-Watson: 1.931

Prob(Omnibus): 0.000 Jarque-Bera (JB): 19.874

Skew: 0.676 Prob(JB): 4.84e-05

Kurtosis: 3.479 Cond. No. 817.

==============================================================================

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

"""

There is no significant linear relationship between Weight and Beats Per Minute While Active, t = -0.881, p > 0.05.

[47]: plt.scatter(x1,y)

yhat = (-0.0343)*x1 + 96.7137 # The line of best fit

fig = plt.plot(x1,yhat,lw=4,c='orange',label='regression line') plt.xlabel('Weight',fontsize = 20)

plt.ylabel('Beats Per Minute While Active',fontsize = 20) [47]: Text(0, 0.5, 'Beats Per Minute While Active')

(5)

[48]: y = data['Rest'] # Dependent variable

x2 = data['Active'] # Independent variable #2 plt.scatter(x2, y) # The plt.scatter(X,Y)

plt.xlabel('Beats Per Minute While at Rest', fontsize=20) plt.ylabel('Beats Per Minute While Active', fontsize=20) plt.show()

(6)

[52]: # Pearson's Correlation Coefficient stat, p = pearsonr(y, x2)

print('The Pearson r correlation coefficient is %.3f, with a p-value of %.3f' %␣

,(stat, p)) if p > 0.05:

print('There is a no significant correlation between the two variables.') else:

print('There is a significant correlation between the two variables.') The Pearson r correlation coefficient is 0.604, with a p-value of 0.000 There is a significant correlation between the two variables.

[49]: x = sm.add_constant(x2) results = sm.OLS(y,x).fit() results.summary()

[49]: <class 'statsmodels.iolib.summary.Summary'>

"""

OLS Regression Results

==============================================================================

Dep. Variable: Rest R-squared: 0.365

(7)

Model: OLS Adj. R-squared: 0.362

Method: Least Squares F-statistic: 132.2

Date: Tue, 14 Apr 2020 Prob (F-statistic): 1.79e-24

Time: 16:57:18 Log-Likelihood: -809.03

No. Observations: 232 AIC: 1622.

Df Residuals: 230 BIC: 1629.

Df Model: 1

Covariance Type: nonrobust

==============================================================================

coef std err t P>|t| [0.025 0.975]

---

const 39.1882 2.589 15.136 0.000 34.087 44.289

Active 0.3194 0.028 11.499 0.000 0.265 0.374

==============================================================================

Omnibus: 0.490 Durbin-Watson: 1.931

Prob(Omnibus): 0.783 Jarque-Bera (JB): 0.261

Skew: -0.051 Prob(JB): 0.878

Kurtosis: 3.130 Cond. No. 463.

==============================================================================

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

"""

[51]: plt.scatter(x2,y)

yhat = 0.3194*x2 + 39.1882

fig = plt.plot(x2,yhat,lw=4,c='orange',label='regression line') plt.xlabel('Beats Per Minute While at Rest',fontsize = 20) plt.ylabel('Beats Per Mintute While Active',fontsize = 20) [51]: Text(0, 0.5, 'Beats Per Mintute While Active')

(8)

There is a significant linear relationship between Beats Per Minute at Rest and Beats Per Minute While Active, t = 11.50, p < 0.01.

References

Related documents

The purpose of this study therefore, was to specifically evaluate effect of students’ academic achievement on identified difficult concepts in Senior Secondary

I have also shown that although Dryer’s typology provides us with an overview of how nominal plurality can be marked in noun phrases, treating too many different grammatical

The results show that the optimum temperature observed from experiment is at thermophilic range 56°c was achieving a higher biogas yield and high methane

Proposition 1: ERP systems implemented with the intent of standardizing and centralising data flows throughout an organisation are more likely to result in higher levels of

Conclusions: Neurologists ’ intuition and clinical judgment alone are not accurate for detection of depression or cognitive impairment in patients with recent-onset

We hypothesized that an anterolateral open approach with direct visualization of the syndesmosis would result in a lower rate of malreduction compared with a standard closed

Measures for capital structure were total debt to total asset (debt ratio - DA), short-term debt to total asset (STA) and long-term debt to total asset (LTA), which based on the

This study was aimed at investigating the frequency, clinical signs, and anomalies in infants hospitalized for urinary tract infection.. Methods: This cross-sectional study