Exercises

Exercise 1 Infant birth weight

In a study of infant birth weight for different occupational groups this was recorded for some first-time mothers hairdressers. The following table shows the results in grams (data specified in sorted order) for 20 births in total, 10 female births and 10 male births.

Females (x) 2474 2547 2830 3219 3429 3448 3677 3872 4001 4116 Males (y) 2844 2863 2963 3239 3379 3449 3582 3926 4151 4356

Solve at least the following questions a)-c) first “manually” and then by the inbuilt fun-ctions in R. (It is OK to use R as alternative to your pocket calculator for the “manual”

part, but avoid the inbuilt functions that will produce the results without forcing you to think about how to compute it during the manual part.)

a) What is the sample mean, variance and standard deviation of female births? Express in your own words the story told by these numbers. The idea is to force you to in-terpret what these numbers mean.

b) Compute the same summary statistics of the male births. Compare and explain differences with the results for the female births.

c) Find the five quartiles for each distribution — and draw the two boxplots with pen and paper (i.e. not using R.)

d) Are there any “extreme” observations in the two distributions? (Use the modified boxplot definition of extremness)

e) What are the coefficient of variations in the two groups?

Exercise 2 Course Grades

To compare the difficulty of 2 different courses at a university the following grades distributions (given as number of pupils who achieved the grades) were registered:

Course 1 Course 2 Total

Grade 12 20 14 34

Grade 10 14 14 28

Grade 7 16 27 43

Grade 4 20 22 42

Grade 2 12 27 39

Grade 0 16 17 33

Grade -3 10 22 32

Total 108 143 251

a) What is the median of the 251 achieved grades?

b) What are the quartiles and the IQR (Inter Quartile Range)?

Exercise 3 cholesterol

In a clinical trial of a cholesterol-lowering agent, 15 patients’ cholesterol (in mMol/l) was measured before treatment and 3 weeks after starting treatment. Data are listed in the following table:

Patient 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Before 9.1 8.0 7.7 10.0 9.6 7.9 9.0 7.1 8.3 9.6 8.2 9.2 7.3 8.5 9.5 After 8.2 6.4 6.6 8.5 8.0 5.8 7.8 7.2 6.7 9.8 7.1 7.7 6.0 6.6 8.4

a) What is the median of the cholesterol measurements for the patients before treat-ment, and similarly after treatment?

b) Find the standard deviations of the cholesterol measurements of the patients be-fore and after treatment

c) Find the sample covariance between cholesterol measurements of the patients be-fore and after treatment.

d) Find the sample correlation between cholesterol measurements of the patients be-fore and after treatment.

e) Compute the 15 differences (Dif = Before−After) and do various summary stati-stics and plotting on these: mean, variance, standard deviations, boxplot etc.

f) The big question behind these data is whether an average decrease in choleste-rol level can be “shown statistically”. We will learn how to formally answer this question in a few weeks, but which summary statistics and/or plots would you look at to have some idea of what the answer will be?

Exercise 4 Project Start

a) Go to CampusNet and take a look at the first project. Follow the steps to import the data into R and get started with the explorative data analysis.

eNote 2

Probability and simulation

2 Probability and simulation 1

2.1 Random variable . . . 3 2.2 Discrete distributions . . . 6 2.2.1 Mean and variance . . . 12 2.2.2 Binomial distribution. . . 19 2.2.3 Hypergeometric distribution . . . 22 2.2.4 Poisson distribution . . . 24 2.3 Continuous distributions. . . 28 2.3.1 Uniform distribution . . . 31 2.3.2 Normal distribution . . . 33 2.3.3 Log-Normal distribution. . . 39 2.3.4 Exponential distribution . . . 40 2.4 Identities for the mean and variance . . . 44 2.5 Covariance and correlation . . . 46 2.5.1 Correlation . . . 47 2.6 Independence of random variables . . . 48 2.7 Functions of normal random variables . . . 53 2.7.1 The χ²distribution . . . 54 2.7.2 The t-distribution . . . 59 2.7.3 The F distribution . . . 66 2.8 Exercises . . . 71

In this chapter the elements from probability theory needed for establishing the founda-tion of the statistics presented in this book are introduced. These elements provide the techniques needed for a basic mathematical description of randomness. For example for calculating the probabilities of outcomes in various types of typical experiment setups, e.g. dice rolls and lottery draws, as well as for natural phenomena, such as the waiting time between radioactive decays. The theory is introduced together with illustrative R-code examples, which the reader is encouraged to try and interact with in parallel to reading the text.

2.1 Random variable

The basic building blocks needed to describe random outcome of an experiment are introduced in this section. The definition of an experiment is quite broad, it can be a con-trolled experiment, which is carried out under concon-trolled conditions e.g. in a laboratory or flipping a coin, as well as be in conditions which are not controlled, where simply a process is observed e.g. observations of the GNP or measurements taken with a space telescope. Hence an experiment can be thought of as any setting in which the outcome cannot be fully known. This for example also includes measurement noise, which are random “errors” related to the system used to observe with, maybe originating from noise in electrical circuits or small turbulence around the sensor. Measurements will always contain some noise.

First the sample space is defined.

Definition 2.1

The sample space S is the set of all possible outcomes of an experiment.

Example 2.2

Consider an experiment in which a person will throw two paper balls with the purpose of hitting a wastebasket. All the possible outcomes forms the sample space of this experiment as

S=(miss,miss),(hit,miss),(miss,hit),(hit,hit) (2-1)

Now a random variable can be defined.

Definition 2.3

A random variable is a function which assign a numerical value to each possible out-come of an experiment, i.e. a value to each outout-come in the sample space. In this book random variables are usually denoted with capital letters, e.g.

X, Y, . . . (2-2)

Example 2.4

Continuing the paper ball example above, a random variable can be defined as the number of hits, thus

X (miss,miss)=0 (2-3)

X (hit,miss)=1 (2-4)

X (miss,hit)=1 (2-5)

X (hit,hit)=2 (2-6)

In this case the random variable is function which maps the sample space S to positive inte-gers, i.e. X : S→N0.

Remark 2.5

The random variable is used to represent a value of the outcome before the exper-iment is carried out. Usually the experexper-iment is carried out n times and a random variable is needed for each time

{^Xi : 1, 2, . . . , n} ^(2-7)

After the experiment has been carried out n times a set of values of the random variable is available as

{^xi : 1, 2, . . . , n} ^(2-8)

Each value is called a realization or observation of the random variable and is denoted with a small letter sub-scripted with an index i, as introduced in Chapter 1.

Finally, in order to quantify probability a random variable has assigned probability dis-tribution to it, which can either be discrete or continuous depending on the nature of the outcomes:

• Discrete outcomes can be for example: outcome of a dice roll or number of cups of coffee per day, hence some countable phenomena, which can be represented with an integer.

• Continuous outcomes can be for example: weight of the yearly harvest, time spend on homework each week or current electricity production, hence a phe-nomena which is measured with a continuous value.

Furthermore the outcome values can either be unlimited or limited. This is most obvi-ous in the case discrete case, e.g. a dice roll is limited to 6 outcome values between 1 and 6. However it is also often the case for continuous random variables, for example many are not negative (weights, distances, etc.) and proportions are limited to a range between 0 and 1.

Conceptually there is no difference between the discrete and the continuous case, it is a matter of how it is defined and how the properties are calculated. In the remaining of this chapter, first the discrete case is presented and then the continuous.

2.2 Discrete distributions

In this section discrete distributions and their properties are introduced. A discrete random variable has discrete outcomes and follows a discrete distribution.

To exemplify, consider the outcome of one roll of a fair six-sided dice as the random variable X^fair. It has six possible outcomes, each with equal probability. This is specified with the probability density function.

Definition 2.6

For a discrete random variable X the probability density function (pdf) is

f(x) = P(X= x) (2-9)

It assigns a probability to every possible outcome value x.

A discrete pdf fulfills two properties: there must be no negative probabilities for any outcome value

f(x)≥0 for all x (2-10)

and the probability for all outcome values must sum to one

all x

∑

f(x) =1 (2-11)

For the fair dice the pdf is

x 1 2 3 4 5 6

f_Xfair(x) ¹₆ ¹₆ ¹₆ ¹₆ ¹₆ ¹₆

If the dice is not fair, maybe it has been modified to increase the probability of rolling a six, the pdf could for example be

x 1 2 3 4 5 6

f_Xunfair(x) ¹₇ ¹₇ ¹₇ ¹₇ ¹₇ ²₇

where X^unfairis a random variable representing the value of a roll with the unfair dice.

Remark 2.7

Note that the pdfs were denoted with the name of the random variable to which they belonged. This is done when there is a need to distinguish between pdfs e.g. for several random variables. For example if two random variables X and Y are used in same context, then: the pdf fX(x)is for X and fY(x)is for Y, the sample standard deviation s_Xis for X and s_Y is for Y, and so forth.

With the pdf defined the experiment can easily be simulated, i.e. instead of carrying out the experiment in reality it is carried out using a model on the computer. When the simulation includes generating random numbers it is called a stochastic simulation.

Example 2.8

Let’s simulate the experiment of rolling a dice using the following R code (open the file eNote2-ProbabilitySimulation.Rand try it)

## Simulate rolling a dice

## Make a random draw from (1,2,3,4,5,6) with equal probability for each outcome sample(1:6, size=1)

More interesting is it to repeat the experiment many times, then we can calculate the empirical density function (or empirical pdf ) as a discrete histogram and actually “see” the shape of the pdf

## Simulate a fair dice

## Number of simulated realizations

n <- 30

## Draw independently from the set (1,2,3,4,5,6) with equal probability xFair <- sample(1:6, size=n, replace=TRUE)

## Count the number of each outcome using the table function table(xFair)

## Plot the empirical pdf plot(table(xFair)/n)

## Add the pdf to the plot

lines(rep(1/6,6), type="h", col="red")

0.00.20.40.60.81.0

Density

1 2 3 4 5 6

Empirical pdf pdf

Try simulating with different number of rolls n and describe how this affects the accuracy of the empirical pdf compared to the pdf?

Now repeat this with the unfair dice

## Simulate an unfair dice

## Number of simulated realizations

n <- 30

## Draw independently from the set (1,2,3,4,5,6) with higher probability for a six xUnfair <- sample(1:6, size=n, replace=TRUE, prob=c(rep(1/7,5),2/7))

## Plot the empirical density function plot(table(xUnfair)/n)

## Add the pdf to the plot

lines(c(rep(1/7,5),2/7), type="h", col="red")

0.00.20.40.60.81.0

Density

1 2 3 4 5 6

Empirical pdf pdf

Compare the fair and the unfair dice simulations:

How did the empirical pdf change?

By simply observing the empirical pdf can we be sure to distinguish between the fair and the unfair dice?

How does the number of rolls n affect how well we can distinguish the two dices?

One reason to simulate becomes quite clear here: it would take considerably more time to actually carry out these experiments. Furthermore sometimes calculating the theo-retical properties of random variables (e.g. products of several random variables etc.) can be impossible and simulations can be an easy way to obtain results.

Random number sequences generated with software algorithms have the properties of real random numbers, e.g. they are independent, but are in fact deterministic sequences depending on a seed, which sets an initial value of the sequence. Since they behave like and are used as random numbers in simulations, but are in fact deterministic sequences, they are named pseudo random numbers.

Remark 2.9

In R the initial values can be set with a single number called the seed as demonstrated with the following R code. When R starts the seed is created from the time of start-up. Real random numbers can for example be generated by sampling some physical phenomena, for example atmospheric noise as done atwww.random.org.

## The random numbers generated depends on the seed

## Set the seed set.seed(127)

## Generate a (pseudo) random sequence sample(1:10)

[1] 3 1 2 8 7 4 10 9 6 5

## again and see that new numbers are generated sample(1:10)

[1] 9 3 7 2 10 8 6 1 5 4

## Set the seed again and generate the same numbers as before when the

## seed was set set.seed(127) sample(1:10)

[1] 3 1 2 8 7 4 10 9 6 5

The cumulated distribution function (cdf), or simply the distribution function, is often used.

Definition 2.10 The cdf

The cumulated distribution functions (cdf) for the discrete case is the probability of realizing an outcome below or equal to the value x

F(x) = P(X ≤^x) =

∑

j where xj≤x

f(xj) =

∑

j where xj≤x

P(X =xj) (2-12)

The probability that the outcome of X is in a range is

P(a<X ≤^b) = F(b)−^F(a) (2-13)

For the fair dice the probability of an outcome below or equal to 4 can be calculated

F_Xfair(4) =

∑

4 j=1

f_Xfair(x_j) = ¹ 6+¹

6 +¹ 6+¹

6 = ²

3 (2-14)

Example 2.11

Simulate again 30 rolls of the dice and see the empirical distribution function

## Simulate a fair dice

## Number of realizations

n <- 30

## Draw independently from the set (1,2,3,4,5,6) with equal probability xFair <- sample(1:6, size=n, replace=TRUE)

## Calculate the empirical cdf cumsum(table(xFair))

## Plot a discrete histogram

plot(as.table(cumsum(table(xFair)))/n, lwd=10)

## Add the cdf

lines(cumsum(rep(1/6,6)), type="h", col="red")

0.00.20.40.60.81.0

Distribution

1 2 3 4 5 6

Empirical cdf cdf

2.2.1 Mean and variance

In Chapter 1 the sample mean and the sample variance was introduced. They indicate re-spectively the centering and the spread of a data, i.e. of a sample. In this section the mean and variance are introduced. They are properties of the distribution of a random variable, they are called population parameters. The mean indicates where the distribu-tion is centered. The variance indicates the spread of the distribudistribu-tion.

Mean and expected value

The mean (µ) of the random variable (i.e. its distribution) is the most used measure in statistical analysis. It is also defined as a function (E()) called the expected value of the random variable, they are equivalent: usually the mean is used for a single random vari-able and the expected value is used, when dealing with functions of random varivari-ables (which output can of course also be considered a random variable).

Definition 2.12 Mean value

The mean and expected value of a discrete random variable X are

µ =E(X) =

∑

∞ i=1

xif(xi) (2-15)

where xiis the value and f(x_i)is the probability that X takes the outcome value xi.

The mean is simply the weighted average over all the all possible outcome values, weighted with the corresponding probability. As indicated in the definition there might be infinitely many possible outcome values, but in the discrete case these are countable, also the weights should go sufficiently fast to zero in order for the sum to be defined.

Example 2.13

For the fair dice the mean is calculated by µ_xfair =E(X^fair) =11

for the unfair dice the mean is

µ_xunfair =E(X^unfair) =11

After an experiment has been carried out n times then the sample mean or average can be calculated as previously defined in Chapter 1.

µˆ = ¯x= ¹ n

∑

n i

x_i. (2-18)

It is called a statistic, which means that it is calculated from a sample. Note the use of

a hat in the notation over µ: this indicates that it is an estimate of the real underlying mean.

Our intuition tells us that the estimate ( ˆµ) will be close to true underlying expectation (µ) when n is large. This is indeed the case, to be more specific Eh

1 n∑ Xi

i = µ (when E[Xi] = µ), and we say that the average is a central estimator for the expectation. The exact quantification of these qualitative statements will be covered in Chapter 3.

Now play a little around with the mean and the sample mean with some simulations.

Example 2.14

Carrying out the experiment more than one time an estimate of the mean, i.e. the sample mean, can be calculated. Simulate rolling the fair dice

## Simulate a fair dice

## Number of realizations

n <- 30

## Simulate rolls with a fair dice

xFair <- sample(1:6, size=n, replace=TRUE)

## Calculate the sample mean sum(xFair)/n

## or

mean(xFair)

Let us see what happens with the sample mean of the unfair dice by simulating the same number of rolls

## Simulate an unfair dice

## n realizations

xUnfair <- sample(1:6, size=n, replace=TRUE, prob=c(rep(1/7,5),2/7))

## Calculate the sample mean mean(xUnfair)

Calculate the mean of the unfair dice and compare it to the mean of the fair dice (see Equation (2-16)). Is this in accordance with your simulation results?

Let us again turn to how much we can “see” from the simulated data and the impact of the number of realizations n. In statistics the term information is used to refer to how much information is embedded in the data, and therefore how accurate different properties (parameters) can be estimated from the data.

Repeat the simulations several times with n=30. By simply comparing the sample means from a single simulation can it then be determined if the means really are different?

Repeat the simulations several times and increase n. What happens to the ”accu-racy” (i.e. how close it is to the real mean) of the sample mean estimate of the real mean? and thereby how well it can be inferred if the sample means are different?

What happens to the information when n is increased?

Variance and standard deviation

The second most used statistic is the variance (or standard deviation). It is a measure describing the spread of the distribution, more specifically the spread away from the mean.

Definition 2.15 Variance

The variance of a discrete random variable X is

σ² =Var(X) = E[(X−^µ)²] (2-19) The ”squared” on the symbol marks that the variance is the squared standard devia-tion σ.

The variance is a weighted (be probabilities) average of the squared distance between the outcome and the mean value.

Remark 2.16

Notice that the variance cannot be negative.

Notice that the variance is the expected value of a function of the random value, namely the squared distance between the random variable X and its mean. It is calculated by

The standard deviation is measured on the same scale (same units) as the random vari-able, which is not case for the variance. Therefore the standard deviation can much easier be interpreted, when communicating the spread of a distribution).

Consider how the expected value is calculated in Equation (2-15). One can think of the squared distance as a new random variable that has an expected value which is the variance of X.

Example 2.17

The variance of rolls with the fair dice is σ_x²fair =E[(X^fair−^µX^fair)²]

After an experiment has been carried out n times the sample variance can be calculated as defined previously by

and hence thereby also sample standard deviation s.

Again our intuition tells us that the statistics (sample variance), should in some sense converge to the true standard deviation, this is indeed the case and the we call the sample variance a central estimator for the true underlying variance. The convergence will be quantified for one special case in Chapter 3.

The sample variance is calculated by:

• take the sample mean (average of the observations): ¯x

• take the distance for each sample: xi− ^¯x

• and finally take the average (using n−1 in the denominator, see Chapter 1) of the squared distances.

Example 2.18

Return to the simulations. First calculate the sample variance from n rolls of a fair dice

## Simulate a fair dice and calculate the sample variance

## Number of realizations

n <- 30

## Simulate

xFair <- sample(1:6, size=n, replace=TRUE)

## Calculate the distance for each sample to the sample mean distances <- xFair - mean(xFair)

## Calculate the average of the squared distances sum(distances^2)/(n-1)

## Or use the built in function var(xFair)

Let us then try to play with variance in the dice example. Let us now consider a four-sided dice. The pdf is

x 1 2 3 4

F_XfairFour(x) ¹₄ ¹₄ ¹₄ ¹₄

Plot the pdf for both the six-sided dice and the four-sided dice

## Plot the pdf of the six-sided dice and the four-sided dice

## Plot them

plot(rep(1/6,6), type="h", col="red") plot(rep(1/4,4), type="h", col="blue")

1 2 3 4 5 6

0.00.20.40.60.81.0

Sixsideddicedensity

1 2 3 4 5 6

0.00.20.40.60.81.0

Foursideddicedensity

## Calculate the means and variances of the dices

## The means

muXSixsided <- sum((1:6)*1/6) # Six-sided muXFoursided <- sum((1:4)*1/4) # Four-sided

## The variances

sum((1:6-muXSixsided)^2*1/6) sum((1:4-muXFoursided)^2*1/4)

Which dice has the highest variance?

2.2.2 Binomial distribution

The binomial distribution is a very important discrete distribution and appears in many applications, it is presented in this section. In statistics it is typically used for propor-tions as presented in Chapter 8.

If an experiment has two possible outcomes (failure or success, no or yes, 0 or 1) and is repeated more than one time, then the number of successes is binomial distributed. For example the number of sixes obtained with rolls of a dice or the number of heads with flips of a coin. Each repetition must be independent, this is called successive draws with replacement (think of drawing notes from a hat, where after each draw the note is put back again, i.e. the drawn number is replaced again).

In document Introduction, descriptive statistics, R and data visualization (Page 41-200)

eNote 2

Probability and simulation

Contents

2.1 Random variable

2.2 Discrete distributions

∑

∑

∑

∑

2.2.1 Mean and variance

Mean and expected value

∑

∑

Variance and standard deviation

2.2.2 Binomial distribution