Descriptive and Inferential Statistics

(1)

General Sir John Kotelawala Defence University

Workshop on

Descriptive and Inferential Statistics

Faculty of Research and Development

14

^th

May 2013

(2)

3

1. Introduction to Statistics

1.1 What is Statistics?

In the common usage, `statistics' refers to numerical information. (Here, `Statistics' is the plural of `Statistic', which means one piece of numerical information). For example,

 Percentage of male nurses in Sri Lanka is 5%

 Birth rate: 17.42 births/1,000 population

 Death rate: 5.92 deaths/1,000 population

 Infant mortality rate: 9.7 deaths/1,000 live births

 Life expectancy at birth: male: 72.21 years female: 79.38 years

 GDP (value of all final goods and services produced in a year): $106.5 billion

 Unemployment rate (the percent of the labor force that is without jobs) : 5.8%

 Inflation rate (the annual percent change in consumer prices compared with the previous year's consumer prices): 5.9% (2010 est.)



In the more specific sense, `statistics' refers to a field of Study. It has been defined in several ways. For example,

 Statistics is the study of the collection, organization, analysis, and interpretation of data - http://en.wikipedia.org/wiki/Statistics

 Statistics is the mathematical science involved in the application of quantitative principles to the collection, analysis, and presentation of numerical data. –

http://stat.fsu.edu/undergrad/statinf2.php

 Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting numerical data to assist in making more effective decisions. -

http://business.clayton.edu/arjomand/business/l1.html

1.2 Data and Information

 These words are often used interchangeably. However, there are some differences.

 Data are the numbers, characters, symbols, images etc., collected in the raw form for analysis whereas information is processed data.

 Data is unprocessed facts and figures without any added interpretation or analysis.

(3)

4

 Information is data that has been interpreted so that it has meaning for the user.

 Knowledge is a combination of information, experience and insight that may benefit the individual or the organization.

1.3 Distinguishing between Variables and Data

 A variable is some characteristic which has different `values' or categories for different units (items/subjects/individuals)

 Examples of variables on which data are collected at a prenatal clinic. Gender, Ethnicity, Age, Body temperature, Pulse rate, Blood pressure, Fasting blood sugar level, Urine pH value, Income group, Number of children.

 We collect data on variables.

 Data are raw numbers or facts that must be processed (analyzed) to get useful information.

 We get information by processing data.

Variable: Age (in years) of patients

• Data: 31, 42, 34, 33, 41, 45, 35, 39, 28, 41

• Information:

the mean age is 36.9 years.

the percentage of patients above 40 years of age: 40%

1.4 Population and sample

Statistics is used for making conclusions regarding a group of units (individuals/items/subjects). Such a group of interest is called a population. In research, the

`population' represents a group of units that one wishes to generalize the conclusions to. The populations of interest are usually large.

Even though the decisions have to be made pertaining to the population of interest, often it is impossible or very difficult to collect data from the whole population, due to practical constraints on the available money, time and labour etc., or due to the nature of the population. Therefore, often data are collected from only a subset of the population. Such a subset is called a sample.

(4)

5 1.5 Descriptive Statistics and Inferential Statistics

Descriptive Statistics is the branch of Statistics that includes methods of organizing, summarizing and presenting data in an informative way. Commonly used methods are:

frequency tables, graphs, and summary measures.

Inferential Statistics is the branch of Statistics that includes methods used to make decisions, estimates, predictions, or generalizations about a population, based on a sample. This includes point estimation, interval estimation, tests of hypotheses, regression analysis, time series analysis, multivariate analysis, etc

1.6 Classification of Variables

(5)

6

Why do we need to know about types of variables? You need to know, in order to evaluate the appropriateness of the statistical techniques used, and consequently whether the conclusions derived from them are valid. In other words, you can't tell whether the results in a particular medical research study are credible unless you know what types of variables or measures have been used in obtaining the data.

1.6.1 Qualitative Variables

 The characteristic is a quality.

 The data are categories. They cannot be given numerical values. However, they may be given numerical labels.

 Examples: Gender of patient, Ethnicity, income group

1.6.2 Quantitative Variables

 The characteristic is a quantity.

 The data are numbers. They are obtained by counting or measuring with some scale.

 Examples: Age, Body temperature, Pulse rate, Blood pressure, Fasting blood sugar level, Urine pH value, Number of children

1.6.3 Discrete Variables

 Quantitative.

 Usually, the data are counts.

 There are impossible values between any two possible values.

 Examples: Pulse rate, Number of children

1.6.4 Continuous Variables

 Quantitative.

 Usually, the data are obtained by measuring with a scale.

 There are no impossible values between any two possible values. Any value between any two possible values is also a possible value.

 Examples: Age, Fasting blood sugar level, Body temperature, Urine pH value

(6)

7 1.6.5 Scales of measurement

1.6.5.1 Nominal Variables

 Qualitative

 No order or ranking in categories.

 Examples: Gender, Ethnicity

1.6.5.2 Ordinal Variables

 Qualitative

 Categories can be ordered or ranked.

 Examples: income group

1.6.5.3 Interval Variables

 Quantitative.

 Data can be ordered or ranked.

 There is no absolute zero. Zero is only an arbitrary point with which other values can compare.

 Difference between two numbers is a meaningful numerical value.

 They are called interval variables because the intervals between the numbers represent something real. This is not the case with ordinal variables.

 Ratio of two numbers is not a meaningful numerical value.

 Examples: Temperature

1.6.5.4 Ratio Variables

 Possesses all the characteristics of an interval variable.

 There exists an absolute (true) zero.

 Ratio between different measurements is meaningful.

 Examples: Age, Pulse rate, Fasting blood sugar level, Number of children

(7)

8

2. Data Analysis with SPSS 16

2.1 Running SPSS for Windows

Method 01

Click on the Start button at the lower left of your screen, and among the program listed, find SPSS for windows and select SPSS 16.0 for Windows.

Method 02

If there is an SPSS shortcut on the desktop, simply put the cursor on it and double click the left mouse button.

Shown below is an image of the screen you will see when SPSS is ready.

Figure 01

Start –up dialog box Menu Bar

Tool Bar

(8)

9 You could select any one of the options on the start-up dialog box and click OK, or you could simply hit Cancel. If you hit Cancel, you can either enter new data in the blank Data Editor or you could open an existing file using the File menu bar as explained later.

2.2 Different Types of Windows in SPSS

2.2.1 The Data Editor

As shown in figure 01 first you will see start – up dialog box listing several options; behind it is the Data Editor. The Data Editor is a worksheet used for entering and editing data. It has two panes,

 Data editor

Variable View Data View

 Output viewer

 Syntax editor

 Script window

2.2.1.1 Naming and defining variables

When preparing a new dataset in SPSS, it is required to set the following attributes from the variable view.

 Move your cursor to the bottom of the Data Editor, where you will see a tab labeled Variable View. Click on that tab. A different grids appears, with these column headings:

For each variable we create, we need to specify all or most of the attributes described by these column headings.

(9)

10 Name  Should be a single word.

 Spaces and special characters (!, ?, *, ) are not allowed.

 Each variable name must be unique; duplication is not allowed.

 The underscore character is frequently used where a space is desired in names.

Type

Click within the Type column, and a small gray button marked with three dots will appear; click on it and you’ll see this dialog box.

Numeric is the default type. (Basically, numeric and string types are preferred for many of the variables.)

(For a full description of each of the variable types, click on the Help button.)

Width&

Decimals

Applicable for numeric type of variables.

Label This is an optional attribute which can be used for entering a detailed name.

Values This option allows user to configure the coding structure for categorical variables.

(In the Values column, click on the word None and then click the gray

(10)

11 box with three dots. This open the value labels dialog box. )

(eg: Type “1” in the value box and type “male” in the label box. Click Add. Then type “0” in the value, and “female” in label. Click Add and then click OK. )

Missing The user can assign codes to represent the missing observations.

Measure The scale of measurement applicable to variable. Both interval and ratio scales are referred as ‘scale’ type.

2.2.1.2 Entering Data

The Data View pane of the Data Editor window is used to enter the data. Displayed initially is an empty spreadsheet with the variable names you have defined appearing as the column headings.

2.2.1.3 Saving a Data File

On the File menu, choose Save As…In the Save in box, select the destination directory that chosen (in our example, we’re saving it to the Desktop.). Then give a suitable file name and click Save.

(11)

12 2.2.2 Output Viewer

Display outputs and errors. Extension of the saved file will be “spv.”

2.3 Reading data to the SPSS

Data can be entered directly or it can be imported from a number of different sources. The process for reading data stored in SPSS format data files; spreadsheet application, such as Microsoft Excel is to be covered in the class room session. SPSS format data files are organized by cases (rows) and variables (columns).

(12)

13

3. Descriptive Analysis of Data

Descriptive statistics consists of organizing and summarizing the information collected.

Descriptive statistics describes the information collected through numerical measurements, charts, graphs and tables. The main purpose of descriptive statistics is to provide an overview of the information collected.

3.1 Organizing Qualitative Data

Recall that qualitative data provide no numerical measures that categorize or classify an individual. When qualitative data are collected, we often interested in determining the number of individuals that occur within each category.

3.1.1 Tabular Data Summaries

A frequency table (frequency distribution) is a listing of the values a variable takes in a data set, along with how often (frequency) each value occurs.

Definition 3.1: The frequency is the number of observations in the data set that fall into a particular class.

Definition 3.2: The relative frequency is the class frequency divided by the total number of observations in the data set; that is,

Relative frequency =

Definition 3.3: The percentage is the relative frequency multiplied by 100; that is, Percentage = Relative frequency * 100

Relative frequency is usually more useful than a comparison of absolute frequencies.

One- way frequency tables (Simple frequency table) Analyze Descriptive Statistics Frequencies (Select the variable and click OK)

(13)

14 Table 01: Composition of the sample by activity

Note: The “Valid Percent” column takes into account missing values. For instance, if there was one missing values in this data set, then the valid number of cases would be 91. If that were the case, the valid Percentage of slight category would be 11%. Note that “Percent” and

“Valid Percent” will both always total to 100%.

The “Cumulative Percent” is a cumulative percentage of the cases for category and all categories listed above it in the table. The cumulative percentages are not meaningful, of course, unless the scale has ordinal properties.

3.2 Cross classification tables

Cross classification tables (contingency tables/ two-way tables) display the relationship between two or more categorical (nominal or ordinal) variables.

Analyze Descriptive Statistics Crosstabs…

(14)

15 Note: Crosstabs command will not present percentages from its default options. You can add Row, Column and Total percentages as appropriate using Cells… option in crosstab command window.

Table 02: Composition of the sample by smoke and gender

(15)

16 3.3 Graphical Presentation for Categorical Data

The most effective way to present information is by means of visual display. Graphs are frequently used in statistical analyses both as a means of uncovering patterns in a set of data and as a means of conveying the important information from a survey in a concise and accurate fashion.

3.3.1 Bar Charts Simple Bar Chart

Graphs Legacy Dialogs Bar Choose the options Simple and Summaries for groups of cases

Choose the relevant variable as category axis

(16)

17 Cluster Bar Chart

Graphs Legacy Dialogs Bar Choose the options Cluster and Summaries for groups of cases

Component Bar Chart (Sub-divided bar diagram)

These diagrams show the total of values and its break up into parts. The bar is subdividing into various parts in proportion to the values given in the data and may be drawn on absolute figures or percentages. Each component occupies a part of the bar proportional to its share in the total. To distinguish different components from one another, different colors or shades may be given. When sub-divided bar diagram is drawn on percentage basis it is called percentage bar diagram. The various components should be kept in the same order in each bar.

(17)

18 Pie Chart

SPSS Command

Graphs Legacy Dialogs Pie Define

3.2 Organizing Quantitative Data

3.2.1 Grouped frequency tables

In order to construct a grouped frequency distribution, the numerical variable should be classified first. We can use Recode option in SPSS to perform this classification. One the variable is classified into a different variable, a frequency table can be prepared to present the grouped frequency distribution.

SPSS command for Recode (into different variables) Transform Recode in to different variables or

Transform Visual binning

3.2.2 Graphical Presentation of Numerical Data

When presenting and analyzing the behavior of numerical variable, different graphical options such as Histogram, Dot plot, Box plot can be used.

SPSS commands Histogram:

Graphs Legacy Dialogs Histogram Dot plot:

Graphs Legacy Dialogs Scatter/ Dot Simple Dot Define Box plot:

Graphs Legacy Dialogs Box plot Simple Define

(18)

19 3.3 Summary measures

SPSS Command

Analyze Descriptive Statistics Frequencies Statistics Analyze Descriptive Statistics Descriptives

Analyze Descriptive Statistics Explore

Central Tendency

Mean

:

Median: It is the value that lies in the middle of the data when arranged in ascending order. That is, half the data are below the median and half the data are above the median.

Mode: The mode of a variable is the most frequent observation of the variable that occurs in the data set

Measures of Dispersion

Range: Difference between the largest data value and the smallest data value.

Sample variance:

Sample Standard deviation:

Inter-Quartile range: measure the spread of a data around the median.

The range of middle 50% of the data is called the inter-quartile range.

Quartiles The quartiles of a set of values are the three points that divide the data set into four groups, each representing a fourth of the population being sampled.

Measures of skewness

Skewness is the characteristic that describes the lack of symmetry.

Kurtosis Degree of peakeedness of a distribution, usually taken relative to a normal distribution.

(19)

20 3.4 Scatter Plot

When you analyze bi-variate data it is best to start with a suitable graph. In a quantitative bi- variate data set, we have a (x; y) pair for each sampling unit, where x denotes the independent variable and y denotes the dependent variable. Each (x; y) pair can be considered as a point on the cartesian plan. Scatter plot is a plot of all the (x; y) pairs in the data set.

The purpose of scatter plot is to illustrate diagrammatically any relationship between two quantitative variables.

 If the variables are related, what kind of relationship it is, linear or nonlinear ?

 If the relationship is linear, the scattergram will show whether it is negative or positive.

SPSS Command

Graphs Legacy Dialogs Scatter/ Dot Simple Scatter Define

(20)

21 3.5 Correlation

 The correlation coefficient, r lies between -1 and +1.

 When r = 1, it signifies a perfect positive linear relationship

 When r = -1, it signifies a perfect negative linear relationship

 The further away r is from 0, the stronger is the correlation. Figure 6.5 shows some examples.

SPSS Command

Analyse Correlation Bivariate

(21)

22

4. Fundamentals of Statistical Inference

The need for making educated guesses and drawing conclusions regarding some group of units of interest arises in almost every field. Such a group of interest is called a population.

In research, the population represents a group of units that you wish to generalize your conclusions to.

Even though the decisions have to be made pertaining to the population of interest, often it is impossible or very difficult to collect data from the whole population, due to practical constraints on the available money, time and labour etc., or due to the nature of the population. Therefore, often data are collected from only a subset of the population. Such a subset is called a sample.

The process of making educated guess and conclusions regarding a population, using a sample from that population is called a Statistical Inference. Usually this involves collecting suitable data, analyzing data using suitable statistical techniques, measuring the uncertainty of the results and making conclusions.

Statistical inference problems usually involve one or more unknown constant related to the population of interest. Such unknown constants are called parameters. For example, the total of the value of variable X for the units of a finite population (which is called the population total), the means of the values of X for the units of a finite population (which is called the population mean), proportion of units with some specified characteristics (which is called the population proportion) and the means of some random variable (which is called the expected value) are some examples for parameters. In addition, we come across parameters in various models like regression models, probability distributions.

Often statistical inference problems involve estimation of parameters and test of hypotheses concerning parameters. Estimation can be of the form of point estimation and/or interval estimation.

(22)

23 4.1 Point Estimation

It involves using the sample data to calculate a single number to estimate the parameter of interest. For instance, we might use the sample mean to estimate the population mean μ.

The problem is that two different samples are very likely to result in different sample means, and thus there is some degree of uncertainty involved. A point estimate does not provide any information about the inherent variability of the estimator; we do not know how close is to μ in any given situation. While is more likely to be near the true population mean if the sample on which it is based is large.

4.2 Interval Estimation

The method is often preferred. The technique provides a range of reasonable values that are intended to contain the parameter of interest, the range of values is called a confidence interval. In interval estimation we derive an interval so that we can say that the parameter lies within the interval with a given level of confidence.

4.3 Terminology and Notation

4.3.1 Estimate

An approximate value for a parameter, determined using a sample of data is called a point estimate or in short, an estimate.

4.3.2 Estimator

We obtain an estimate by substituting the sample of data in to a formula. Such a formula is called an estimator. An estimator is a function of the data.

4.3.3 Notation

We usually use Greek letters to denote parameters. For example the population mean, population standard deviation, population proportion are usually denoted by µ, σ and θ respectively.

(23)

24 Example:

Suppose that we are interested in estimating the mean µ and the variance σ². Let X1, X2,…

X5 be 5 random observations from this population. Let {3, 5, 2, 1, 2} be one observed sample from this population and {4, 1, 3, 2, 1} be another observed sample from this population.

Table 01 illustrates the terms parameters, estimators and estimates.

Parameter Estimator Estimate 01

(Using {3, 5, 2, 1, 2})

Estimate 02

(Using {4, 1, 3, 2, 1}) µ

σ²

4.4 Point Estimation of Population Mean

Suppose X is a variable derived on the units of a large population and we are interested in the population mean μ. Suppose we have selected a random sample of n units and we have observed X on those units. Let x1, x2, x3,… be the observed values of X. Then = (x1 + x2 +x3

+… xn)/n can be used as an approximate value for the population mean. Therefore, we say that the is an estimate for μ. It is a point estimate.



In order to estimate the population mean using the sample mean, one of the following options can be used. These were introduced in the previous section.

Analyze Descriptive Statistics Frequencies Statistics Analyze Descriptive Statistics Descriptives

4.5.1 Bound on the error of and confidence intervals

Usually an estimate is not exactly equal to the parameter. The difference between the actual value of the parameter and the estimate is called the ‘error’ of the estimate. Since we do not know the actual value of the parameter, we cannot know the exact error in our estimate.

However we can place a bound on the error with a known level of confidence. For example,

(24)

25 using the statistical theory, we may be able to make a statement like ‘we are 95% confident that error of the estimate is less than 75 ’. This is equivalent to saying that ‘we are 95%

confident that ’. This is equivalent to saying that ‘we are 95% confident that ’. This means, we are 95% confident that is in the interval ). Such a interval is called a 95% confidence interval.

(25)

26 Computing an Appropriate Confidence interval for a Population Mean

No

Is the value of σ known?

Use

Use a

nonparametric technique Or

Increase the sample size at least 30 to develop a confidence interval.

No

Use

Yes Yes

Yes No

No

Is the population Normal?

Yes

Use the sample standard deviation s to estimate σ and use

Or, more correctly Use

Since n is large, there is little difference between these intervals Is the value

of σ known?

Is n≥30?

Use

(26)

27 Small sample from a normal population

Example 1

A researcher wish to estimate the average number of heart beats per minute for a certain population. In one such study the following data were obtained from 16 individuals.

77, 92, 93, 77, 98, 81, 76, 71, 100, 87, 88, 86, 97, 95, 81, 96

It is known from past research that the distribution of the number of heart beats per minute among humans is normally distributed. Find 90% confidence interval for the mean.

SPSS Command for the interval Estimation of population mean

Analyze Descriptive Statistics Explore Note:

Use ‘Statistics’ in ‘Explore’ command and set the confidence level if it is required to be change. The default confidence level is 95%.

(27)

28 Interpretation:

We are 90% confidence that the mean heart beat level for the population is between (82.7019, 90.4231).

Interpretation

What do we mean by saying that we are 90% confident that the mean heart beat level for the population is between ( 82.7019, 90.4231)

………

Example 02

As reported by the US National Center for Health Statistics, the mean serum high density lipoprotein (HDL) cholesterol of female 20 – 29 years old is μ = 53. Dr. Paul wants to estimate the mean serum HDL cholesterol of his 20 – 29 years old female patients. He randomly selects 15 of his 20 – 29 year old patients and obtains the data as shown.

65, 47, 51, 54, 70, 55, 44, 48, 36, 53, 45, 34, 59, 45, 54

(28)

29 a) Use the data to compute a point estimate for the population mean serum HDL cholesterol in patients.

b) Construct a 95% confidence interval for the mean serum HDL cholesterol for the patients. Interpret the result.

Note: In this problem it is not given that the population is normally distributed. Since the sample size is small, we must verify that serum HDL cholesterol is normally distributed. If a population cannot be assumed normal, we must use large sample or nonparametric techniques. However if we can assume that the parent population is normal, then small samples can be handled using the t distribution

Assessing normality

The assumption of normality is a prerequisite for many inferential statistical techniques.

There are a number of different ways to explore this assumption graphically:

 Histogram

 Stem-and-leaf plot

 Boxplot

 Normal probability plot

Furthermore, a number of statistics are available to test normality:

 Kolmogorov – Smirnov statistic, with a Lilliefors significance level and the Shapiro Wilk statistic

 Skewness

 Kurtosis

Normal probability plots 1. Select the Analyze menu.

2. Click on Descriptive Statistics and then Explore… to open the Explore dialogue box.

3. Select the variable you require (i.e HDL) and click on the ► button to move this

(29)

30 variable into the Dependent List: box

4. Click on the Plots… command pushbutton to obtain the Explore: Plots sub dialogue box.

5. Click on the Normality plots with tests check box, and ensure that the Factor levels together radio button is selected in the Boxplots display.

6. Click on Continue.

7. In the Display box, ensure that Both is activated.

8. Click on the Options… command pushbutton to open the Explore: Options sub- dialogue box.

9. In the Missing Values box, click on the Exclude cases pairwise radio button. If this option is not selected then, by default, any variable with missing data will be excluded from the analysis. That is, plots and statistics will be generated only for cases with complete data.

10. Click on Continue and then OK.

Normal Probability Plot

In a normal probability plot, each observed value is paired with its expected value from the normal distribution. If the sample is from a normal distribution, then the cases fall more or less in a straight line.

(30)

31 Kolmogorov-Smirnov and Shapiro-Wilk statistics

The Kolmogorov-Smirnov with a Lilliefors significance level for testing normality is produced with the normal probability and detrended probability plots. If the significance level is greater than 0.05 then normality is assumed.

Since the conditions are satisfied we can precede with the t test confidence intervals.

Large sample from a normal distribution (σ unkown) Example 03

A reacher is interested in obtaining an estimate of the average level of some enzyme in a certain human population. He has taken a sample of 35 individuals and determined the level of the enzyme in each individual. It is known from past research that the distribution of the level of this enzyme among humans is normally distributed. The following are the values 20, 11, 32, 25, 6, 23, 19, 24, 15, 31, 19, 23, 21, 27, 17, 20, 23, 23, 22, 13, 15, 28, 27, 18, 11, 32, 23, 28, 14, 23, 21, 25, 19, 29, 17

Construct a 95% confidence interval for the mean population mean and interpret the result.

Large sample from a non-normal distribution, or we do not know data are normally distributed (σ unkown)

Example 04 (Pulse data set)

1. Construct a 95% confidence interval for the mean pulse rate of all males 2. Construct a 95% confidence interval for the mean pulse rate of all females

(31)

32 3. Compare the preceding results. Can we conclude that the population means for males and females are different? Why or Why not?

Note:

We said that if we do not know σ (which is almost always the case) and the sample size n is large (say at least 30), then we can estimate σ by s in the z-based confidence interval.

(

)

It can be argued, however, that because the t-based confidence interval

( ±

)

is a statistically correct interval that not requires that we know σ, then it is best, if we do not know σ, to use this interval for any size sample – even for a large sample. Most common t- tables give t points for degrees of freedom from 1 to 30, so we would need a more complete t table or computer software package to use the t-based confidence interval for a sample whose size n exceeds 31. For large samples (n > 30), the tradition “by-hand” approach is to invoke the Central Limit Theorem, to estimate σ using the sample standard deviation (s) and to construct an interval using the normal distribution, but this is just a practical approach from pre-computing days. With software like SPSS, the default presumption is that we don’t know σ, and so the Explore command automatically uses the sample standard deviation and builds an interval using the value of the t – distribution rather than the normal. However, because these intervals do not differ by much when n is at least 30, it is reasonable, if n is at least 30, to use the large sample, z-based interval as an approximation to the t-based interval. In practice, the values of the normal and t distribution becomes very close when n exceeds 30.

(32)

33

5. Hypothesis testing

5.1 Introduction

 Sometimes, the objective of an investigation is not to estimate a parameter, but instead to decide which of two contradictory statements about the parameter is correct. This is called hypothesis testing.

 Hypothesis testing typically begin with some theory, claim or assertion about a particular parameter or several parameters.

 In any hypothesis testing problem, there are two contradictory hypotheses under consideration, one is called the null hypothesis. The other is called the alternative hypothesis.

 The validity of a hypothesis will be tested by analyzing the sample. The procedure which enables us to decide whether a certain hypothesis is true or not, is called Test of Hypothesis.

5.2 Terminology and Notation

Hypothesis: A hypothesis is a statement or claim regarding a characteristic of one or more populations.

Test of Hypothesis: The testing of hypothesis is a procedure based on sample evidence and probability, used to test claims regarding a characteristic of one or more populations.

Hypothesis testing is based upon two types of hypotheses.

The null hypothesis, denoted by H₀ is a statement to be tested. The null hypothesis is assumed true until evidence indicates otherwise.

The alternative hypothesis denoted by H₁ is a claim to tested. We are trying to find evidence for the alternative hypothesis.

Two - Tailed Left - Tailed Right -Tailed

Table 5.1

(33)

34 Computation of Test Statistics

A function of sample observations (i.e. statistic) whose computed value determined the final decision regarding acceptance or rejection of H_0,is called a Test Statistic. The appropriate test statistics has to be chosen very carefully and knowledge of its sampling distribution under H₀ (i.e. when the null hypothesis is true) is essential in framing the decision rule. If the value of the test statistic falls in the critical region, the null hypothesis is rejected.

Types of Errors in Hypothesis Testing - Type I and Type II Errors

As stated earlier, we use sample data to determine whether to reject or not reject the null hypothesis. Because the decision to reject or not reject the null hypothesis is based upon incomplete (i. e., sample) information, there is always the possibility of making an incorrect decision. In fact, there are four possible outcomes from hypothesis testing.

Four Outcomes from Hypothesis Testing

Reality

H0 is True H1 is True

Conclusion Do not Reject H0

Reject H₀ Table 5.2

The Level of Significance

The level of significance is the maximum probability of making a type I error and it is denoted by α,

α = P (Type I error) = P( rejecting H₀ when H₀ is true)

The probability of making a Type I error is chosen by the researcher before the sample data are collected. Traditionally, 0.01, 0.05 or 0.1 are taken as α

Critical Region or Rejection Region

The rejection region or critical region is the region of the standard normal curve corresponding to a predetermined level of significance α. The region under the normal curve which is not covered by the rejection region is known as Acceptance Region. Thus the

(34)

35 statistic which leads to rejection of null hypothesis H0 gives us the region known as Rejection region or Critical region. The value of the test statistic compute to test the null hypothesis H0 is known as the Critical Value. The Critical value separates the rejection region from the acceptance region.

Two - Tailed Left - Tailed Right - Tailed

Table 5.3

Methods for making conclusion

Method 01: Compare the critical value with the test statistic:

Two Tailed Left Tailed Right tailed

Table 5.4

(35)

36 Method 02: Compare the p - value with the significance level:

Two Tailed Left Tailed Right tailed

Table 5.5 Power

The probability of rejecting a wrong null hypothesis is called the power of the test. The probability of committing type ii error is denoted by ß.

Power = 1-ß

5.3 Formulating a hypothesis

It is ideal if a test can be derived such that both errors are minimized simultaneously.

However, it may not be possible with the available data.

Instead, we consider tests for which the probability of one error is controlled. Conventionally, the type I error is controlled.

Usually, out of the two errors, one error is more serious than the other. In such situations it is reasonable to minimize the probability of the more serious error. In order to achieve this, the hypothesis is constructed so that the more serious error will be the type I error.

An alternative way is to take the initially favored claim as the null hypothesis. The initially favored claim will not be rejected in favor of the alternative unless sample evidence contradicts it and provides strong support for the assertion.

If one of the hypothesis is an equality and the other is an inequality, then the equality hypothesis is taken to be the null hypothesis.

(36)

37 5.4 Steps in test of hypothesis

1. Set up the “Null Hypothesis” H₀ and the “Alternative Hypothesis” H_1.

2. State the appropriate “test statistic” and also its sampling distribution when the null hypothesis is true.

3. Select the “level of significance” α of the test, if it is not specified in the given problem.

4. Find the “critical region” of the test at the chosen level of significance.

5. Compute the value of the test statistic on the basis of sample data null hypothesis.

6. If the computed value of test statistic lies in the critical region “reject H₀” otherwise “do not reject H0”.

7. Write the conclusion in plain non-technical language.

(37)

38 5.5 One – Sample Hypothesis Tests about Population Mean

Selecting an Appropriate Test Statistic to Test a Hypothesis about a Population Mean

Use Z =

Yes

Yes No

No

Is the population Normal?

Yes

Use

Z = Use t =

No

Use a

nonparametric technique Or

Increase the sample size at least 30 to conduct parametric hypothesis test

No

Use the sample standard deviation s to estimate σ and use Z =

Or, more correctly Use

t =

Since n is large, there is little difference between these tests

Is n≥30?

(38)

39 5.5.1 A small sample two sided hypothesis

Example 5.1 File: ph.sav

An engineer wants to measure the bias in a pH meter. She uses the meter to measure the pH in 14 neutral substances (pH = 7) and obtains the data obtained below.

7.01 7.04 6.97 7.00 6.99 6.97 7.04

7.04 7.01 7.00 6.99 7.04 7.07 6.97

Is there sufficient evidence to support the claim that the pH meter is not correctly calibrated at the α = 0.05 level of significance?

Approach:

In this case, we have only sixteen observations, meaning that the Central Limit Theorem does not apply. With a small sample, we should only use the t test if we can reasonably assume that the parent population is normally distributed. In this problem also since the sample size is small before proceeding to test, we must verify that pH is normally distributed.

Hypothesis to be tested

H0: Data are normally distributed.

H1: Data are not normally distributed.

(39)

40 According to the Kolmogorov- Smirnov p-value 0.2 > 0.05. Hence we do not reject H0 under 0.05 level of significance.We can conclude data are normally distributed.

Since the conditions are satisfied we can proceed with the t test.

Hypothesis to be tested:

……….

To conduct a one-sample t-test 1. Select the Analyze menu.

2. Click on Compare Means and then One-Sample T Test… to open the One-Sample T Test dialogue box.

3. Select the variable you require (i.e. pH) and click on the ► button to move the variable into the Test Variable(s): box.

4. In the Test Value: box type the mean score (i.e. 7).

(40)

41 5. Click on OK.

Calculated value of the test P-value Statistic

Note: In SPSS a Column labeled Sig. (usually two tailed Sig.) displays the p-value of a particular Hypothesis test.

Decision:………..

Conclusion:………..

………..

(41)

42 Note:

5.5.2 Performing One-tail Tests using One-Sample T Test Procedure

The One Sample T-test procedure in SPSS is designed to test two-tail hypothesis. However, a researcher may need to test a one-tail (left tail or right tail) hypothesis. In this situation the p- value for the corresponding test has to be computed using the following criteria.

1. For left-tail tests(i.e. H₁: μ < )

If the sample mean is less than (i.e. t < 0) then, p-value = Sig/2 Otherwise, p-value = 1-Sig/2

2. For right-tail tests(i.e. H1: μ > )

If the sample mean is greater than (i.e. t > 0) then, p-value = Sig/2 Otherwise, p-value = 1-Sig/2

Example 5.2

In a study conducted by the U.S. Department of Agriculture, it was found that the mean daily caffeine intake of 20-29 year old female in 2010 was 142.8 milligrams. A nutritionist claims that the mean daily caffeine intake has increased since then. She obtains a simple random sample of 35 females between 20 and 29 years of age and determines their daily caffeine intakes. The results are presented in caffine.sav. Test the nutritionist’s claim at the α = 0.05 level of significance.

Approach: The dataset represents a large sample (n=35), so we can rely on the Central Limit Theorem to assert that the sampling distribution is approximately normal.

Hypothesis:……….

P-value:………

Decision:………..

Conclusion:………

………

(42)

43 Non – Parametric Binomial Test for the One-Sample Test procedure

The Binomial Test procedure compares an observed proportion of cases to the propotion expected under a binomial distribution with a specified probability parameter. The observed proportion is defined either by the number of cases having the first value of a dichotomous (a variable that has two possible values) variable or by the number of cases at or below a given cut point on a scale (quantitative) variable.

Hypothesis (to be tested on a quantitative variable)

H0: median = m0 vs, H1: median ≠ m0

SPSS command

Analyze Nonparametric Binomial Test Note: Set the cut point to the hypothesized median value.

(43)

44

6. Inferences on Two Samples

In the preceding chapter, we used a statistical test of hypothesis to compare the unknown mean, proportion of a single population to some fixed known value. In practical applications however, it is far more common to compare the means of two different populations, where both parameters are unknown.

In order to perform inference on the difference of two population means, we must first determine whether the data come from an independent or dependent sample.

 Samples are independent when he individuals selected for one sample do not dictate which individuals are to be in second sample.

 Samples are dependent when the individuals selected to be in one sample are used to determine the individuals to be in the second sample.

6.1 Testing hypotheses concerning two populations means μ1 and μ2: Dependent Samples

Let (x1, y1), (x2, y2), (x3, y3),…. ( xn, yn) be a random sample of paired observations. Suppose that x’s are identically distributed with population mean and population variance μ₁and respectively. Also suppose that y’s are identically distributed with population mean and population variance μ₂and respectively.

Let μd be a known constant. Consider the following hypotheses:

Two-Tailed Left-Tailed Right-Tailed

H0: H₁:

H0: ≥ H₁:

H0: ≤ H₁:

Rather than consider the two sets of observations to be distinct samples, we focus on the difference in measurements within each pair. Suppose that our two groups observations are as follows:

(44)

45 Sample 01 Sample 02 Differences within each pair

x11

x21

x31

…

xn1

x12

x22

x32

…

xn2

d1 = x11 – x12

d2 = x21 – x22

d3 = x31 – x32

….

dn = xn1 – xn2

=

- )

²

If differences are normally distributed or the sample size n is large, The test statistic is,

U =

Compare the critical value with the test statistic, using the guideline below

Two - tailed Left - Tailed Right - Tailed

If U < or U > ,n-1

reject the null hypothesis

If

U <

,n-1

If

U >

6.1.2 Confidence Interval for Matched – Pairs Data

We can also create a confidence interval for the mean difference , using the sample mean difference , the sample standard difference s_d, the sample size and . Remember, the format for a confidence interval about population mean is of the following form:

Point estimate ± Margin of error

Based on the preceding formula we compute the confidence interval about as follows:

(45)

46 (1-α) 100% confidence interval for is given by

SPSS Command

Command for Paired - Samples T test

Analyze Compare Means Paired Samples T Test

Example 6.1

A dietitian hopes to reduce a person’s cholesterol level by using a special diet supplemented with a combination of vitamin pills. Six (6) subjects were pre-tested and then placed on diet for two weeks. Their cholesterol levels were checked after the two week period. The results are shown below. Cholesterol levels are measured in milligrams per deciliter.

2.1 Test the claim that the Cholesterol level before the special diet is greater than the Cholesterol level after the special diet at α = 0.01 level of significance.

2.2 Construct 99% confidence interval for the difference in mean cholesterol levels.

Assume that the cholesterol levels are normally distributed both before and after.

Subject 1 2 3 4 5 6

Before 210 235 208 190 172 244

After 190 170 210 188 173 228

Example 6.2

A physician is evaluating a new diet for patients with a family history of heart disease. To test the effectiveness of this diet, 16 patients are placed on the diet for 6 months. Their weights are measured before and after the study, and the physician wants to know if either set of measurements has changed. Test whether there are statistically significant differences between the pre and post-diet of these patients. Use 5% level of significant.

Step 01: Calculating differences

(46)

47 Transform Compute Variable

Step 02:

Because the sample size is small, we must verify that difference data normally distributed.

Note: Use ‘Plots… ‘ in ‘Explore’ command and set ‘Normality plots with test’

Step 03:

Command for Paired - Samples T test

Analyze Compare Means Paired Samples T Test

6.4 Performing One – tail Tests using Paired – Samples T Test procedure

The Paired Samples T – Test procedure in SPSS is designed to test two-tail hypothesis.

However, a researcher may need to test a one – tail (left-tail or right-tail) hypothesis. In this situation the p-value for the corresponding test has to be computed using the following criteria.

1. For left-tail tests (i.e. < 0)

If the sample mean of differences is less than 0 (i.e t < 0) then, p-value = Sig/2.

Otherwise, p-value = 1 – Sig/2

(47)

48 2. For right-tail tests (i.e. > 0)

If the sample mean of differences is greater than 0 (i.e t > 0) then, p-value = Sig/2.

Example: If a researcher tries to find whether post-diet weights have been significantly increased, determine the p-value and state your findings at 5% level of significance.

6.5 Nonparametric Wilcoxon Test for Two Related Samples

Hypothesis

H0: = 0 vs, H1: ≠ 0

SPSS command

Analyze Nonparametric 2 Related Samples

Note: Ensure that ‘Wilcoxon’ is checked in the ‘Test Type’ dialog box.

6.6 Testing hypotheses concerning two population means μ1 and μ2: Independent samples

Let x₁, x₂, x₃, ….xm be a random sample of observations from a certain population with population mean and population variance μ1 and respectively. Also let y1, y2, …yn be a random sample of observations from a certain population with population mean and population variance μ₂ and respectively. Further suppose that two samples are independent.

Let μ_dbe a known constant. Consider the following hypotheses:

Two-Tailed Left-Tailed Right-Tailed

H₀: H1:

H₀: ≥ H1:

H₀: ≤ H1:

(48)

49 Case 01: Data from normal distributions, both variances are known

The test statistic is,

U =

Two - Tailed Left - Tailed Right - Tailed

If_{U <} or U >

If_{U <}

If U >

Case 02: Data from two normal distributions with unequal variances ( ), both variances are unknown, m and n are small

The test statistic is, U =

If Ucal < or t >

If

U

_cal

<

_,ν

If

U

cal

>

Where

ν =

(49)

50 (1-α)100% Confidence Interval about the Difference of Two Means

( ) ±

Case 03: Data normal, both variances are unknown, but known that they are equal.

=

2

=

²

Also let

=

The test statistic is,

U =

If Ucal < or Ucal>

If

U

cal

<

_,m+n-2 reject the null hypothesis

If

U

_cal

>

(1-α)100% Confidence Interval about the Difference of Two Means

( ) ±

SPSS Command for the Independent-Samples T test

Analyze Compare Means Independent Samples T Test

Note: On ‘Define Groups’ option, apply relevant codes of the groups to be compared.

(50)

51 6.6.1 Performing One – tail Tests using Independent – Samples T Test procedure

The Independent Samples T – Test procedure in SPSS is designed to test two-tail hypothesis.

However, a researcher may need to test a one – tail (left-tail or right-tail) hypothesis. In this situation the p-value for the corresponding test has to be computed using the following criteria.

1. For left-tail tests (i.e. < )

If the sample mean of differences is less than 0 (i.e t < 0) then, p-value = Sig/2.

2. For right-tail tests (i.e. > )

If the sample mean of differences is greater than 0 (i.e t > 0) then, p-value = Sig/2.

6.7 The Nonparametric Mann – Whitney U Test for Two Independent Samples

What should you do if the t test assumptions are markedly violated (e.g., what if the response variable is not normal?) One answer is to run the appropriate nonparametric test, which in this case called the Mann – Whitney (M-W) U test.

Hypothesis

H0: = vs, H1: ≠

SPSS command

Analyze Nonparametric 2 Independent Samples Note: Ensure that ‘Mann – Whitney U test’ is checked.

On ‘Define Groups’ option, apply relevant codes of the groups to be compared.

(51)

52 Example 6.3:

The purpose of a study by Eidelman et al. was to investigate the nature of lung destruction in cigarette smokers before the development of marked emphysema. Three lung destructive index measurements were made on the lungs of lifelong nonsmokers and smokers who died suddenly outside the hospital of nonrespiratory causes. A large score indicates greater lung damage. For one of the indexes the scores yielded by the lungs of a sample of nine nonsmokers and a sample of 16 smokers are shown in Table 02. We wish to know if we may conclude, on the basis of these data, that smoker, in general, have greater lung damage as measured by this destructive index than do smokers.

Nonsmokers 18.1 6.0 10.8 11.0 7.7 17.9 8.5 13.0 18.9 Smokers 16.6 13.9 11.3 26.5 17.4 15.3 15.8 12.3 18.6

12.0 24.1 16.5 21.8 16.3 23.4 18.8

Example 6.4:

Researchers wished to know if they could conclude that two populations of infants differ with respect to mean age at which they walked alone. The following data (age in months) were collected:

Sample from population A: 9.5, 10.5, 9.0, 9.75, 10.0, 13.0, 10.0, 13.5, 10.0, 9.5, 10.0, 9.75 Sample from population B: 12.5, 9.5, 13.5, 13.75, 12.0, 13.75, 12.5, 9.5, 12.0, 13.5, 12.0, 12.0

(52)

53

7. Comparison Multiple Groups

In the preceding chapter, we covered techniques for determining whether a difference exits between the means of two independent populations. It is not unusual, however, to encounter situations in which we wish to test for differences among three or more independent means rather than just two. The extension of the two sample t test to three or more samples is known as the Analysis of Variance or ANOVA for short.

Definition:

Analysis of Variance (ANOVA) is an inferential method that is used to test the equality of three or more population means.

7.1 One- Way Analysis of Variance

It is the simplest type of analysis of variance. The one-way analysis of variance is a form of design and subsequent analysis utilized when the data can be classified into k categories or levels of a single factor, and the equality of the k class means in the population is to be investigated.

For example, five fertilizers are applied to four plots each of wheat and yield of wheat on each of the plot is given. We may be interested in finding out whether the effect of these fertilizers on the yield is significantly different or in other words, whether the samples have come from the same normal population. The answer to this problem is problem is provided by the technique of analysis of variance. The basic purpose of the variance is to test the homogeneity of several means.

In order to perform ANOVA test, certain requirements must be satisfied.

7.2 Requirements of ANOVA Test

1. Independent random samples have been taken from each population.

2. The populations are normally distributed.

3. The population variances are all equal.

7.3 The Hypothesis test of Analysis of Variance H₀:

H1: At least one of the population means differs from the others

(53)

54 7.4 Decomposition of Total Sum of Squares

The name analysis of variance is derived from a partitioning of total variability into its component parts. Let y_ij is the j^th observation of i^th factor level. The data collected under the factor levels can be represented as follows.

Group (Factor Level/ Treatment)

1 2 3 ….. k

Number of observations

n1 n2 n3 …. nk

mean

variance

Grand mean ( ) =

=

The total variation present in the data is measured by the sum of squares of all these deviations. Thus

Total Sum of Squares (SSTo) =

The total variation in the observation can be split into the following two components.

1. The variation between the classes or the variation due to different bases of classification, commonly known as treatments.

2. The variation within the classes, i.e, the inherent variation of the random variable within the observation of a class. This variation is due to chance causes which are beyond the control of human hand.

(54)

55 The sum of squares due to differences in the treatment means is called the treatment sum of squares or between sums of squares and is given by the expression.

Sum of squares of the differences between treatments =

or

Treatment Sum of Squares (SSTr)

The sum of squares due to inherent variabilities in the experiment material is called the Sum of Squares of the differences within the treatment.

Sum of squares of differences within the treatment(SSE) =

It can be shown that

=

+

Total sum of squares = Sum of squares between treatments + Sum of squares within treatments

(SSTo) (SSTr) (SSE)

7.5 The Mean Squares

In finding the average squared deviations due to treatment and to error, we divide each sum of squares by its degrees of freedom. We call the two resulting averages mean square treatment (MSTr) and mean square error (MSE), respectively.

The number of degrees of freedom associated with SSTr = k-1

MSTr =

The number of degrees of freedom associated with SSE = n- k

MSE =

The Expected Values of the Statistics MSTr and MSE under the null hypothesis

E(MSE) = ……….(1)