Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

(1)

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Volkert Siersma [email protected]

The Research Unit for General Practice in Copenhagen

(2)

Content

Quantifying association between continuous variables.

In particular:

• Correlation

• (Simple) regression

(3)

Example – Newly diagnosed Type 2 Diabetes

pt glucose bmi sex age 1 1 15.3 25.16070 0 53.02669 2 2 12.1 22.96838 0 50.86653 3 4 13.4 34.37500 0 87.73990 4 5 14.0 26.16190 1 64.59411 5 6 13.8 35.07805 0 62.10815 6 7 13.8 26.71779 1 58.97604 7 8 16.2 27.18233 1 82.46133 8 9 8.5 33.70120 0 76.36687

A data set with 729 newly diagnosed Type 2 diabetes patients.

pt: Patient ID

glucose: Diagnostic plasma glucose (mmol/l) bmi: Body Mass Index (kg/m2)

8 9 8.5 33.70120 0 76.36687 9 10 17.3 28.67547 1 72.63792 10 11 8.6 26.21882 1 48.91170 11 12 17.0 27.43951 0 53.40999 12 13 15.4 32.67832 0 64.07392 13 14 7.8 24.05693 1 63.86858 14 15 16.4 25.12406 1 52.35318 15 16 7.4 33.13134 0 42.77618 16 17 11.6 30.12729 1 46.76797 17 19 14.2 33.07857 0 63.45517 18 20 14.4 29.24211 0 78.74333 19 21 11.6 21.24225 1 66.66940

sex: sex (1=male, 0=female) age: age (years)

(4)

Research question

• Do fat people have a more severe diabetes when the diabetes is discovered?

Or in a more “statistical” language:

• Is diagnostic plasma glucose (positively) associated with the body mass index at the time of diagnosis?

mass index at the time of diagnosis?

(5)

Scatter-plot

• When investigating a potential association between only two variables (like diagnostic plasma glucose and BMI) a scatter- plot is an important part of the analysis.

• It gives insight in the nature of the association.

• It shows problems in the data, e.g. outliers, strange or impossible values.

impossible values.

(6)

Scatter-plot

(7)

Scatter-plot

• There is no apparent tendency, specifically not one that would support our research question…

• …and if we have to point out a tendency, it would be that high BMI associates with lower diagnostic glucose (why is this not so strange if we think about the diagnosis of

diabetes?).

• There seem to be some very large values, especially for diagnostic plasma glucose. These are valid measurements.

• Maybe a log transformation of glucose would make associations more apparent?

(8)

Scatter-plot – R code

plot(diabetes$bmi,diabetes$glucose, frame=TRUE,

main=NULL,

xlab=”BMI (kg/m2)”,

ylab=”Glucose (mmol/l)”, col=”green”,

pch=19)

(9)

Scatter-plot – log transformation

(10)

Measures of association

• We want to capture the association between two variables in a single number: a correlation coefficient, a measure of

association.

• Suppose that Y_i is the diagnostic plasma glucose of patient i and X_i the BMI for the same person. Then we want our

measure of association to have the following characteristics:

• A positive association indicates that if X_i is large (relative to the rest of the sample) then Y_i is likely to be large as well.

• A negative association indicates that if X_i is large then Y_i is likely to be small.

(11)

Measures of association – between -1 and 1

• 0 : No association

• 1 : perfect positive association

association

• -1 : Perfect negative association

(12)

Measures of association for the diabetes data

r = -0.059 ρ = -0.050 τ = -0.034

(13)

Measures of association for the diabetes data

…and log transformed…

r = -0.053 ρ = -0.050 τ = -0.034

Only the first one Only the first one changes!

(14)

Pearson’s correlation coefficient

Pearson’s correlation coefficient is computed from the data set (X_i,Y_i), i = 1,…,N as:

where and are the respective means and SD_x and SD_y the respective standard deviations.

y x

N

i

i i

SD SD

N

Y Y

X X

r ( 1 )

) )(

(

1

−

=

∑

=

X Y

respective standard deviations.

(15)

Characteristics of Pearson’s correlation coefficient

Pearson’s correlation coefficient has the following properties:

• It measures the degree of linear association.

• It is invariant to linear change of scale for the variables.

• It is not robust to outliers.

• Coefficient values that are comparable between different data sets, and moreover a valid confidence interval and p-value, require that both X_i and Y_i are normally distributed.

(16)

Pearson’s correlation coefficient – R code

> cor(diabetes$bmi,diabetes$glucose,use=”complete.obs”) [1] -0.05938123

> cor.test(diabetes$bmi,diabetes$glucose) Pearson's product-moment correlation

Gives only the correlation coefficient.

Also performs a statistical test to see whether the coefficient is data: diabetes$bmi and diabetes$glucose

t = -1.5995, df = 723, p-value = 0.1101

alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval:

-0.13162533 0.01349032 sample estimates:

cor -0.05938123

coefficient is different from zero.

(17)

Normally distributed?

BMI Glucose

A Normal distribution for comparison.

(18)

Normally distributed?

BMI Log(Glucose)

(19)

Normally distributed?

(20)

Normally distributed?

(21)

R code

hist(diabetes$bmi,main=”BMI”,xlab=”BMI (kg/m2)”,col=”green”)

qqnorm(diabetes$bmi,main=”BMI”,col=”green”) A histogram of BMI:

A Normal Q-Q plot of BMI:

qqline(diabetes$bmi,col=”red”)

jpeg(file=”D:\mydirectory\mypicture.jpg”,width=500,height=500)

#

# put here the code that generates the picture

#

dev.off()

And how do we get all these works of art in some decent format?

(22)

Rank correlation – Spearman’s ρ

If data does not appear to be Normally distributed, or when there are outliers, one may instead compute the correlation between the ranks of the X_i values and the ranks of the Y_i values.

This gives a nonparametric correlation coefficient called Spearman’s ρ.

It measures monotone association.

It is invariant to monotone transformations (like a log transformation).

It is robust to outliers.

It has an odd interpretation.

(23)

Spearman’s rank correlation coefficient – R code

> cor.test(diabetes$bmi,diabetes$glucose,method=”spearman”) Spearman's rank correlation rho

data: diabetes$bmi and diabetes$glucose S = 66678220, p-value = 0.1801

alternative hypothesis: true rho is not equal to 0 sample estimates:

rho rho -0.04983743 Warning message:

In cor.test.default(diabetes$bmi, diabetes$glucose, method = "spearman") : Cannot compute exact p-values with ties

(24)

Rank correlation – Kendall’s τ

• A measure of monotone association with a more intuitive interpretation than Spearman’s ρ is Kendall’s

τ

^.

• The observations from a pair of subjects i, j are

• concordant if X_i < X_j and Y_i < Y_j

• or X_i > X_j and Y_i > Y_j

• and

• discordant if X_i < X_j and Y_i > Y_j

• or X_i > X_j and Y_i < Y_j

• Kendall’s

τ

is the difference between the probability for a concordant pair and the probability for a discordant pair.

• There are various versions of Kendall’s

τ

depending on how

(25)

Characteristics of Kendall’s tau

It measures monotone association.

It is invariant to monotone transformations (like a log transformation).

It is robust to outliers.

It has a more straightforward interpretation than Spearman’s rho.

(26)

Kendall’s rank correlation coefficient – R code

> cor.test(diabetes$bmi,diabetes$glucose,method=”kendall”) Kendall's rank correlation tau

data: diabetes$bmi and diabetes$glucose z = -1.3755, p-value = 0.169

alternative hypothesis: true tau is not equal to 0 sample estimates:

tau -0.03427314

(27)

Correlation in the diabetes data

r = -0.059 (p = 0.110) ρ = -0.050 (p = 0.180) τ = -0.034 (p = 0.169)

(28)

Correlation in the diabetes data

…and log transformed…

r = -0.053 (p = 0.154) ρ = -0.050 (p = 0.180) τ = -0.034 (p = 0.169)

(29)

Limitations of correlation coefficients

• While it is (relatively) clear what a correlation coefficient of 0 means, and also 1 or -1, it is often unclear what a highly significant correlation of, say, 0.5 means

• Correlation rarely answers the research question to a sufficient extend; because it is not easily interpretable.

• Coefficients of correlation depend on the sample selection and therefore we cannot compare values of the coefficients found in different data.

(30)

(31)

Regression analysis

• An (intuitively interpretable) way to describe a (linear) association between two continuous type variables.

• It models a response Y (the dependent variable, the

exogenous variable, the output) as a function of a predictor X (the independent variable, the exogenous variable, the explanatory variable, the covariate) and a term representing explanatory variable, the covariate) and a term representing random other influences (error, noise).

(32)

Regression model formulation

• We say: “To regress Y on X”

or: “To regress glucose on BMI”

• Mathematically: Y_i = α + βX_i + ε_i

• Where ε_i are independently Normal distributed noise terms with mean 0 and standard deviation σ.

mean 0 and standard deviation σ.

(33)

Regression model

• The mean of Y is modelled with a linear function of X; a line in the X-Y plane.

• For each X, Y is a random variable Normally distributed around the modelled mean of Y, with standard deviation σ

(34)

Scatter-plot with regression line

(35)

Interpretation of the parameters

We have variation due to a systematic part, the explanatory variable, and a random part, the noise. The systematic part of the model is defined by the regression line.

α

= the intercept:

mean level for Y_i when X_i = 0

β

= the slope:

mean increase for Y_i when X_i is increased 1 unit.

(36)

Research question

• Do fat people have a more severe diabetes when the diabetes is discovered?

Or in a more “statistical” language:

• Is diagnostic plasma glucose (positively) associated with the body mass index at the time of diagnosis?

index at the time of diagnosis?

• In a (simple) linear regression analysis, is the slope β different from 0 (or more pertinently, larger than 0)?

(37)

How does the model answer the research question?

• Interest may focus on making a simple hypothesis about the two parameters:

Null hypothesis : β = 0 Null hypothesis : α = 0

• The second hypothesis often has no (clinical) meaning.

(38)

Linear regression – R code

> mymodel <- lm(diabetes$glucose~diabetes$bmi)

> summary(mymodel) Call:

lm(formula = diabetes$glucose ~ diabetes$bmi) Residuals:

Min 1Q Median 3Q Max -6.6974 -3.5771 -0.8535 2.3008 49.1636

Estimate of the slope

P-value of the test for the null

hypothesis β = 0.

-6.6974 -3.5771 -0.8535 2.3008 49.1636 Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 14.96096 1.08396 13.80 <2e-16 ***

diabetes$bmi -0.05739 0.03588 -1.60 0.110 ---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.976 on 723 degrees of freedom

(4 observations deleted due to missingness)

Table with parameter estimates

(39)

Plot of regression line – R code

The lm() function can be used to plot the regression line in the scatter-plot:

> plot(diabetes$bmi,diabetes$glucose)

> abline(mymodel)

(40)

Scatter-plot with regression line

…log transformed glucose…

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(41)

How are the parameters estimated?

• The estimated parameters of the linear model define the line (found among all possible lines) which minimizes the

squared distance between the data-points and the line in the scatter-plot.

• The estimation method is called ‘ordinary least-squares’

(maximum likelihood gives the same answer).

(42)

Least squares fit

(43)

Does the model fit the data?

(44)

Diagnostic plots

(45)

Diagnostic plots

• R produces some diagnostic plots (of varying usefulness).

• The residuals (the error or noise) was supposed to be Normal distributed, this can be studied in the Q-Q plot (top right)

• More importantly, the residuals should have a single standard deviation, i.e. the variance should not increase with, for

example, BMI. This can be studied in the residuals vs. fitted example, BMI. This can be studied in the residuals vs. fitted plot (top left)

> opar <- par(mfrow = c(2,2), oma = c(0,0,1.1,0))

> plot(mymodel)

> par(opar)

(46)

Data transformations

• If the residuals are not Normal, or (and this is more serious because the central limit theorem deals with much of the non- Normality issue) if variance seems to increase with level, it may be a good idea to transform one or both variables.

• This is the real reason to investigate log(glucose) instead of glucose.

(47)

Data transformations – log transform

(48)

The influence of one outlier

(49)

Simpson’s paradox

Florida death penalty verdicts for homicide 1976-1987 relative to defendant’s race

White Black

11%

(53/430)

8%

(15/176)

(50)

Simpson’s paradox

White Black

Victim white 11%

(53/414)

23%

(11/37)

• Blacks tend to murder blacks and whites tend to murder whites…

• …and the murder of a white person has a higher probability (53/414) (11/37)

Victim black 0%

(0/16)

3%

(4/139)

person has a higher probability of death penalty.

• For any victim the probability for a black person to get death penalty is about 2 times higher.

(51)

Confounding

Victim’s race

We are interested in the green highlighted

association, but there is a correlation with the

victim’s race both with the defendant’s race and the outcome of the trial.

Death penalty Defendant’s

race

(52)

Confounding

Confounder

A confounder influences both exposure and outcome

When confounding is present we cannot interpret the green highlighted association as

causal

Outcome Exposure

(53)

Randomization

Confounder

Often there are many factors that may influence both

exposure and outcome,

…some of them may not be observed

…or are unknown.

Outcome Exposure

randomised

If exposure is randomised, then there is no confounding.

The green highlighted

association can be interpreted causal.

(54)

Two regressions

• The blue points denote patients with SBP>140 mmHg; the blue line the corresponding regression line.

• The red points denote patients with SBP < 140 mmHg; the red with SBP < 140 mmHg; the red line the corresponding regression line.

• The black line is the general regression line.

•The slopes from the stratified analyses are less steep than the

(55)

Multiple regression

> mymodel <- lm(log(diabetes$glucose)~diabetes$bmi+diabetes$SBP)

> summary(mymodel) Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 2.639870 0.069389 38.045 <2e-16 ***

diabetes$bmi -0.002625 0.002308 -1.137 0.2558 diabetes$SBP -0.054447 0.024168 -2.253 0.0246 * ---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The adjusted slope (association) of bmi is less pronounced than before.

SBP is related to both glucose and bmi and is a confounder.

(56)

Multiple regression

• Adjusting a statistical analysis means to include other predictor variables into the model formula.

• Intuitively, a slope for BMI is determined for each level of the SBP variable separately and these are then averaged.

• …including SBP in the analysis removes the confounding

• …including SBP in the analysis removes the confounding effect of SBP from the relationship between log(glucose) and BMI.

(57)

Take home message

• Association between two continuous variables may be measured by correlation coefficients or in (simple) linear regression analysis.

• The latter provides arguably the best interpretable results.

• Moreover, it is straightforwardly extended to be able to deal with confounding, and more…