SPSS Vs 23 Tutorial, 2019. Dr. Rosa Padilla.doc

(1)

SPSS VS 23 TUTORIAL

(2)

x

Overview of SPSS

The meaning of SPSS is “Statistical Package for Social Sciences”. The current traditional version of SPSS is 25, and the next version is already in the works (it is called the anniversary version, 50 years of SPSS.

IBM SPSS Statistics is a tool for managing your statistical data and research. SPSS is a multipurpose, graphic and statistical data storage system.

SPSS Windows has 3 windows:

 Data Editor has two views, selected by tabs at the bottom of the window  Viewer or Draft. viewer which displays the output files

 Syntax Editor, which displays syntax files

The Data Editor has two parts:

1. Variable View window, which displays metadata or information about the data in the active file, such as variable names and labels, value labels, formats, and missing value indicators.

2. Data View window, which displays data from the active file in spreadsheet format which holds the data in a rectangular format with cases as rows and variables as columns. Data can be directly entered or imported from another program using menu commands. (Cut-and-paste is possible, but not advised.) Errors in data entry can also be directly corrected here.

Beginning an SPSS Session

Begin by opening SPSS 23 for Windows.

1. Click on the IBMSPSS shortcut button on your desktop. OR

2. Go to START, click on PROGRAMS, and click on IBM SPSS.

Variable view: used to define the type of information that is entered in to each column in data view. In Data View rows are cases. Each row represents a different case. A case is a set of observations about one person, one country, one object, one experiment, etc

(3)

Menus and Toolbars

Various pull-down menus appear at the top of the Data Editor window. These pull-down menus are at the heart of using SPSS. The Data Editor Menu items (with some of the uses of the menu) are:

FILE: Standard options for opening, saving, printing and exiting.

EDIT: Used to copy and paste data values; used to find data in a file; insert variables and cases. VIEW: Options for showing/hiding toolbars, displaying values or their labels in Data Editor.

DATA: Identify duplicate cases, merge files, split file, select cases, weight cases, etc. TRANSFORM: Compute new variables, recode variables.

ANALYZE: This menu provides access to the statistical procedures for analyzing your data set. All the items on the analyze menu have sub menus.

DIRECT: It allows you to perform advanced analysis of clients or contacts to improve your marketing MARKETING: campaigns and maximize the ROI of your marketing budget.

GRAPHS: Provide options to create high quality plots and charts.

UTILITIES: Used to display information on individual variables (add comments to accompany data file (and other advanced features).

Add-ons: The SPSS extension packages are additional features of the program that you can add to SPSS (advanced statistical procedures)

WINDOW: Provides option for switch between data, syntax and navigator windows.

HELP: Contains SPSS help system (for example Select Help|Case Studies. Provides hands-on examples of how to create various types of statistical analyses and how to interpret the results).

Data Entry into SPSS

SPSS runs on Windows and Mac operating systems, but the focus of these notes is Windows.

Data entry (the workspace and labels)

A. Manually directly enter in to SPSS by typing in Data View

B. Enter into other database software such as Excel then import into SPSS

A. Data entry for the first option: by manually directly enter in to SPSS by typing in Data View

Typing in data

(4)

Click in variable view, the tabs at the bottom of the window labeled “Data View" and “Variable View". In “Data View”, you can enter, and edit data for all of your cases; while in “Variable View”, you can view, enter, and edit information about the variables themselves (see below).

In the "Name" column we write the names of the variables that you have to enter (Click the Row 1). Name of the variable. It is your own choice, but make it understandable and do not use numbers or symbols as the first letter since SPSS will not accept it. Moreover, you cannot use spaces in the name. For example: “education_level”

"Type “column. Indicates the variable type. The most common is Numeric (only accepts numerical data, for example age or number of children) and String (also accepts letters, e.g. for qualitative questions). Typically, all responses in a questionnaire are transformed into numbers. For example: “Female = 1 and Male = 2.

In the Width column don't touch. Correspond to the number of characters that is allowed to be typed in the data cell. Default for numerical and string variables is 8, which only needs to be altered if you want to type in long strings of numbers or whole sentences.

Decimals write de number of decimals that the variable enters (for example variable Age has cero decimals, etc)

“Label’ for many variables this includes entering a Label, which is a human-readable alternate name for each variable. The labels replace the variable names on much of the output, but the names are still used for specifying variables for analyses. The variable label is an explanation of what the variable is, e.g. if the variable name was sex then the label might be “gender of the respondent”.

“Values” In the Values column Click on the button “...” to reveal the Values Labels dialogue box. Enter your values and corresponding labels for the variable you are defining if appropriate.

(5)

all of the levels of the variable. When you are finished, verify that all of the information in the box is correct, and then click OK to complete the process.

The “ Missing and invalid data ” The missing column is very important, for example when we see the data of a survey, some answers to some unanswered questions, and for a good analysis these empty cells are recoded with a special code that is usually 9, 99, 999, and In The MISSING column we write that code so that this number is not taken into account in the analysis. Missing data cannot be entered, of course, and the cell for the missing value can either be left blank or a special code (of one’s choice) may be entered.

“Columns” and “Align” are given by default.

The “Measure” column is very important, all quantitative variables choose as “scale". Ordinal and nominal variables are the other options for Measure. In many parts of SPSS, you will see a visual reminder of the Measure of your variables in the form of icons. A small diagonal yellow rule ( ) indicates a “scale". A small three level bar graph with increasing bar heights ( ) indicates an “ordinal" variable. Three colored balls with one on top and two below indicates nominal data ( ).

Note: Categorical. Data with a limited number of distinct values or categories (for example, gender or religion). Categorical variables can be numeric variables that use numeric codes to represent categories (for example, 1 = male and 2 = female). Also referred to as qualitative data. Categorical variables can be either nominal or ordinal.

Nominal. A variable can be treated as nominal when its values represent categories with no intrinsic ranking (for example, gender).

Ordinal. A variable can be treated as ordinal when its values represent categories with some intrinsic ranking (for example, levels of service satisfaction from highly dissatisfied to highly satisfied).

Scale. Data measured on an interval or ratio scale, where the data values indicate both the order of values and the distance between values.

(6)

Example

Name Type Width Decimal Label Values

ID Numeric 8 0 Identification None

gender Numeric 8 0 Gender

1= male 2= female

age Numeric 8 0 Age in years None

age_cat Numeric 8 0 Age category

1= <18 2=18 - 24 3=25 - 34 etc

After creating all your variables, you can go to the “Data View”, and then you can enter the data, and edit data for all of your cases, see the following example:

To enter data, simply highlight a cell, enter the number, and then press return. You will be in the next cell down which will be highlighted.

Notice how the values for gender are coded as 1 and 2. It is possible to reveal their values Labels instead. Click on the Value Labels on the task bar at the top, now the values are given as Male and Female, just as we coded before.

B. Second option, using data in Excel

Data from an Excel spreadsheet can be imported into SPSS as follows:

(7)

2. Locate the file of interest:

Select Files of type (for example, excel) you want to import. Change the files of type to Excel (*.xls), then browse and open the file

Check that the box labelled “Read

variable names from the first row of data” is ticked and click OK (that is if the first row in excel contains your variable names, otherwise leave un-ticked)

(8)

Note: if you want to transfer data from Excel to SPSS it is a good idea to ensure that any categorical data (e.g. yes/no/don’t know, male/female, etc.) are entered in Excel as numeric data (codes) rather than text. For example, you could always code ’No’ as 0 and ’Yes’ as 1, and so on.

After reading in the data it is a good idea to ’type’ in SPSS what the codes for your categorical variables are. This ensures that tables and graphs are labelled appropriately. More detailed instructions:

1. Click on the Variable View tab in the bottom left hand corner of the data editor window

2. Look at the row for the variable you’re dealing with and go to the Values column. Click on the word None 3 Click on the little grey square (with dots in it) on the right

4 Enter the first value (code) — e.g. 1 — and the corresponding label — e.g. male — then click on Add 5 Repeat until you have entered all the labels & codes for this variable, then click OK

6 Repeat this process for the other categorical variables.

Clean data after import data files

 Key in values and labels for each variable  Run frequency for each variable

 Check outputs to see if you have variables with wrong values.

 Check missing values and physical surveys if you use paper surveys, and make sure they are real missing.  Sometimes, you need to recode string variables into numeric variables

Exporting Data to Excel

Click on FILE ⇒ SAVE AS. Click on the File Name for the file to be exported. For the “Save as Type” select from the pull-down menu Excel (*.xls) click on Save.

Transform the variables (Computing variables and Recoding variables)

For transform the variables: first open one file from SPSS, the name of this file is: Survey_sample.sav. Step for open files from SPSS:

(9)

With this file go to transform the variables:

Compute variable (Creating new variables)

Creating new variables (data transformation) is commonly needed. Depending on what you are trying to do.

Recode Variables

This option you use when you need to transform the original data into intervals (you have two forms, one is “Recode into same variables”, the other one is “Recode into different variables”.

Follow the steps that show the figure above (we would like to transform ‘number of children’ in the following intervals:

0 = None

(10)

Write the name and label and Change < Click on “Old and New Values”

And appear the following figure. You write the new values that you want to transform; 0 = None

1 = 1 – 4 2 = 5 or More

(11)

Next you must return to the “Variable View” and into the new variable that you just create write the labels of intervals

0 = None 1 = 1 – 4 2 = 5 or More

Computing Variables

For creating of a simple data transformation, which is the result of applying a mathematical formula to one or more existing variables, use the “Compute Variable”.

To apply "Variable computing" we will continue to use the data for "Survey_ Sample".

Create a new variable: Confidence (The average of confidence); for this we add all the answers to these 6 questions about confidence and calculate the mean of these responses; ((confinan+ conbus+ coneduc+ conpress+ conmedic+ contv)/6).

(12)

Steps in SPSS for computing a new variable. Follow the figure bellow

Pass all the variables you want to add, each separated by a comma, then < OK

Running Analyses (frequency: Calculating tables, statistics, and graphics)

(13)

Analyzing data using Frequencies

To perform simple analyses to obtain frequency tables Descriptive Statistics >Analyze >Frequencies

Generally a frequency is used for looking at detailed. Categorical data is for variables such as gender, i.e., males are coded as “1” and females are coded as “2.” Frequencies options include a table showing counts and percentages, statistics including percentile values, central tendency, dispersion and distribution; and charts including bar charts, pie and histograms.

Steps for using the frequencies procedure:

Click the Analyze < Descriptive Statistics < Frequencies and select your variables for analysis. You can then choose statistics options, choose chart options, and have SPSS calculate your request.

Analyzing Categorical variable

For this example we are going to check out "Happiness of marriage (hapmar)" on the file "Survey_Sample". We will look at this variable for our initial investigation.

(14)

Continue < OK

Output from Frequencies:

The major part of the display shows the value labels (Very happy, Pretty happy, Not too happy, and Total), and the missing categories, NAP (Not Appropriate), DK (Don’t Know), and NA (Not Answered).

In a written paper, you should state that the “Valid Percent” excludes the “missing” answers.

Analyzing numerical variable (Scale in SPSS)

Follow the same step as we did for the categorical variable, for this example we are going to look at the distribution of “number of children, age and highest year of school completed for the “Survey_Sample”

Since these variables that we are going to analyze they are measure of interval/ratio level, different statistics from our previous example will be used.

Basic statistical analysis (Descriptive statistics) Procedure:

Analyze < Descriptive Statistics < Frequencies (click) < and appear the box Frequencies < First click on Number of children. Click the select arrow in the middle and SPSS will place age in the Variable(s) box. Follow the same steps to choose “Age” and the variable name for “Highest year of school completed” < click on

(15)

Next, click the Continue button to return to the main Descriptive dialog box < Click OK in the main dialog box Descriptive and SPSS will calculate and display the output shown in the following figure:

Statistics

Number of children

Age of respondent

Highest year of school completed

N Valid 2825 2828 2820

Missing 7 4 12

Mean 1.82 45.56 13.25

Median 2 42 13

Std. Deviation 1.69 17.1 2.928

Skewness _1.07 _0.567 _-0.256

Std. Error of Skewness 0.046 0.046 0.046

Kurtosis 1.386 -0.527 1.069

Std. Error of Kurtosis _0.092 _0.092 _0.092

Range 8 71 20

Minimum 0 18 0

Maximum 8 89 20

Percentiles 25 0 32 12

50 2 42 13

75 3 57 16

Interpretation: The average number of children per family is 1.82 and the variability around the average is 1.69. The youngest family respondent has "0" children and the largest number of children was 8. Look at the SPSS exit for "the highest year of school completed." It has an average of 13.25 (just over 1 year after high school) and a standard deviation of 3.213. Some of the respondents indicated that no "0" years of school ended. The highest education reported was 20 years.

Skewness: a measure of the asymmetry of a distribution. The normal distribution is symmetric and has a skewness ≈ 0.

Right or Positive skewness: a long right tail (Skw >0). Left or Negative skewness: a long left tail (Skw < 0). As a general rule of thumb: If skewness is:

If skewness is between -0.5 and 0.5, the distribution is approximately symmetric.

Ifskewnessis between -1 and -0.5 or between 0.5 and 1, the distribution is moderately skewed. If less than -1 or greater than 1, the distribution is highly skewed.

(16)

platykurtic (K < 0) data values are flatter and more dispersed along the X axis.

If k ≈ 0.263, we say that the curve corresponding to the frequency distribution is mesokurtic (has just pointing to the normal or Gaussian)

If k> 0.263, we say that the curve corresponding to the frequency distribution is leptokurtic

if k <0.263, we say that the curve corresponding to the frequency distribution is platykurtic

Note: For the rest of interpreting statistics suggest reviewing chapter 1 of the course.

Analysis with Custom Table (Numerical Variable)

For Creating the custom table, we choose from the Menu: Analyze > Custom Tables > Custom Table:

Next select the variables and to drag and drop them to the position "Rows". It is important that the level of measurement of all variables that you analyze is set correctly, because the default settings for the table will be based on that.

It is also important that all variables that you selected have exactly the same value labels.

Next we click the button “Summary Statistics” to select the statistics we want to display in our table. The window shown on the right pops up. We can add statistics to the table and we can remove some. In this example we have chosen to select Standard Deviation, Minimum and Maximum.

Note that you can rearrange the order in which the selected statistics will be displayed. If you are satisfied with your selection you click on "Apply to Selection" and “OK”

Output from Custom Tables for numerical variables

(17)

Number of children 2 2 0 8

Age of respondent 46 17 18 89

Highest year of school

completed 13 3 0 20

Analysis with Custom Table (Categorical Variable)

For categorical variables follow the same steps above, but be careful that when you choose to study variables have the same level of measurement (in our case now, nominal or ordinal), in this case in Summary Statistics choose Table N %.

Output from Custom Tables for categorical variables

Count

Table N %

Gender Male 1232 43.5%

Female 1600 56.5%

Total 2832 100.0%

Marital status Married 1346 47.5%

Widowed 283 10.0%

Divorced 446 15.8%

Separated 93 3.3%

Never married 663 23.4%

Total 2831 100.0%

Highest degree LT High school 430 15.2%

High school 1500 53.2%

Junior college 209 7.4%

Bachelor 478 16.9%

Graduate 205 7.3%

Total 2822 100.0%

(18)

A stacked table can include other variables in other dimensions. For example, you could cross two variables stacked in the rows with a third variable displayed in the column dimension.

1. Open the table builder again (Analyze menu, Tables, Custom Tables).

2. If Gender and Highest degree aren't already stacked in the rows, follow the directions above for stacking them.

3. Drag and drop General Happiness t from the variable list to the Columns area on the canvas pane.

4. The tables show by default only Count, for better interpretation ask in Summary Statistics “Column N %” and Apply to all

5. Click OK to create the table.

Output from Crosstabulation Tables

General happiness

Very happy Pretty happy Not too happy

Count

Column

N % Count

Column

N % Count

Column N %

Gender Male 373 41.9% 712 45.2% 133 39.1%

Female 518 58.1% 863 54.8% 207 60.9%

Total 891 100.0% 1575 100.0% 340 100.0%

Highest degree LT High school 132 14.9% 197 12.5% 97 28.6%

High school 420 47.3% 889 56.6% 179 52.8%

Junior college 79 8.9% 113 7.2% 13 3.8%

Bachelor 163 18.4% 273 17.4% 39 11.5%

Graduate 94 10.6% 98 6.2% 11 3.2%

Total 888 100.0% 1570 100.0% 339 100.0%

Building graphics

The first question is: Which Graph to select? The answer is according the type of the variable you will analyze. Many statistical procedures include an option to generate charts. In addition, you can choose from the

(19)

Legacy Dialogs (accessed from the main Graph menu) leads to a series of different graph/chart types from which to select. In all cases, you will then be asked to Define the graph by specifying the variables to be displayed and how they are to be plotted.

The precise information required and the format of the dialogue boxes will vary according to the type of graph you select.

If you are not sure what each chart type looks like, select Graphboard Template Chooser (from the main graph menu). You have to ensure that the variable has the correct Measure (nominal, ordinal or scale) defined in Variable View.

Graph – Chart Builder

The Chart Builder dialog box is an interactive window that allows you to preview how a chart will look while you build it.

1. Click the Gallery tab if it is not selected.

The Gallery includes many different predefined charts, which are organized by chart type. The Basic Elements tab also provides basic elements (such as axes and graphic elements) for creating charts from scratch, but it's easier to use the Gallery.

2. Click Bar if it is not selected.

Icons representing the available bar charts in the Gallery appear in the dialog box. The pictures should provide enough information to identify the specific chart type.

3. Drag the icon for the simple bar chart onto the "canvas" which is the large area above the Gallery.

The Chart Builder displays a preview of the chart on the canvas. Note that the data used to draw the chart are not your actual data.

4. You add variables by dragging them from the Variables list, which is located to the left of the canvas. A variable's measurement level is important in the Chart Builder. You are going to use the Highest degree variable on the x axis.

5. Now drag highest degree from the Variables list to the x axis drop zone.

The ‘y’ axis drop zone defaults to the Count statistic. If you want to use another statistic (such as percentage or mean), you can easily change it.

6. Click Element Properties to display the Element Properties window.

The Element Properties window allows you to change the properties of the various chart elements.

(20)

Output from Chart Builder

Finally: Once your graph is produced you can edit it further like any piece of SPSS Output by double-clicking on it.

(21)

Building the clustered bar chart

Assuming you want to present how the distributions of marital status [marital] of the survey respondents are different by gender [sex] group. Then, you may need this type of “clustered bar chart”.

Output

(22)

Box plot

Begin by entering your data. You should have two columns: one for your dependent variable, and one for your grouping (independent variable).

In the Variable View, make sure that the appropriate scale is selected for each variable. The dependent variable should be a scale variable, and the grouping variable should be ordinal or nominal.

This chart divides the data into four areas of equal frequency. The central box (where the middle 50% of the data) has a vertical (or horizontal) inside the box indicates the median (if this line is at the center in the center of the box there is symmetry). From the center of each side vertical (or horizontal) of the box are drawn whiskers. The mustache on the left (or lower) has its extreme value closer to Q1 - 1.5 * IQR, while the right whisker (or higher) has its extreme value closer to Q3 + 1.5 * IQR, and are considered the most extreme outliers in Q3 + 3 * IQR or less than Q1 - 3 * IQR (in SPSS are represented by “o” or “x”, respectively). So from top to bottom, here’s what’s in the boxplot:

 Top whisker: Q3 + 1.5 * IQR,  Top of box: 3rd quartile  Line in the middle: median  Bottom of box: 1st quartile  Bottom whisker: Q1 - 1.5 * IQR Steps

Select the Boxplot option under the gallery. Select the type of boxplot you want to create, and drag it to the main window.

(23)

Output

Interpretation. The major part of our distribution is not normal and there are significant outliers, the cases beyond the lower line of our boxplot. The majority of the outliers are at the lowest end of the distribution, people with little or no education. There are also more observations above than below the median.

INFERENTIAL STATISTICS

Inferential statistics. The methods used to estimate a property of a population on the basis of a sample.

Statistical Inference involves two main types of techniques: Parameter Estimation and Hypothesis Testing. Whatever the technique used, the overall purpose is to use data from a probability sample to extract conclusions about a population.

• Data are observations about the variable being measured.

(24)

Parameter Estimation

The objective of estimation is to determine the approximate value of a population parameter on the basis of a sample statistic.

- Confidence interval

A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data. (Definition taken from Valerie J. Easton and John H. McColl's).

Application example 1

We want to study approximately, how much is the monthly cost of using the student's cell phone from AUCA. We took a sample of 20 students and asked them how much they spend in a month on their cell phone. The answers are shown in the table below:

Monthly expenditure on cell phone

10000 5000 4500 7000 5500 3500 2000 1500 500 4872 800 5800 4873 5801 10000 9500 7800 2570 6531 5842

Calculate an appropriate confidence interval with 95% of confidence for the average monthly cell phone use of AUCA students and interpret the result.

Steps in SPSS

First create the variable in ‘Variable view’ after fill the data in ‘Data view’, then click on analyze and then 'compare means' after 'one sample t test' and then continue with the steps shown in the following figure.

Output from Compare Means

One-Sample Statistics

N Mean Std.

Deviation

Std. Error Mean Monthly_Cellphone 20 5194.45 2842.309 635.560

(25)

One-Sample Test

Test Value = 0

t df Sig. (2-tailed) Mean Difference

95% Confidence Interval of the Difference

Lower Upper

Monthly_Cellphone 8.173 19 .000 5194.450 3864.21 6524.69

Interpretation. At 95% confidence, AUCA students spend monthly on using the cell phone between 3864.21 to 6524.60 Rwf.

Hypothesis

A statistical hypothesis is a statement on a probabilistic model and a hypothesis test is a method to determine the possibility of that statement based on a sample.

Continue using the same example 1 given previously

Test the claim that the true mean monthly cell phone bill the student of AUCA is less than ten thousand Rwf. For testing this hypothesis we took a sample of 20 students and we asked them, how much they spend in a month on their cell phone? The answers are shown in the table below:

Monthly expenditure on cell phone

10000 5000 4500 7000 5500 3500 2000 1500 500 4872 800 5800 4873 5801 10000 9500 7800 2570 6531 5842 Steps

1. Specify the population value of interest

Mean monthly cell phone bill the student of AUCA

2. Formulate the appropriate null and alternative hypotheses  Ho: μ ³ 10000

 Ha: μ < 10000 (This is a lower tail test) 3. Specify the desired level of significance

(26)

Output from SPSS

One-Sample Test

Test Value = 10000

t df

Sig.

(2-tailed) Mean Difference

95% Confidence Interval of the Difference

Lower Upper

Monthly Cell Phone _-7.561 ₁₉

.000 -4805.55 -6135.79 -3475.31

Making decision and interpret result: Since Sig or p-value = 0.000 < 0.01=α, we reject the null hypothesis, therefore the difference found is highly significant, we can conclude at 5% of significance level, there is evidence that supports the alternative hypothesis, i.e. students invest in charge their phone less than ten thousand Rwf per month.

Note:The value of p_value or Sig. gives us the SPSS default is bilateral, unilateral if we value: Sig /2 (.000/2 = .000)

(27)

The independent t-test, compares the means between two unrelated groups. For example, you could use an independent t-test to understand whether second year graduate salaries differed based on gender.

Dependent variable second year graduate salaries

Independent variable gender, which has two groups: "male" and "female".

Example

A cigarette maker analyzes two different brands for determining the nicotine content. A sample was taken of each brand and got the following results (in milligrams).

Brand A: 24 26 25 22 23

Brand B: 27 28 25 29 26

Do the above results indicate that there is a difference in the average content of nicotine in both brands? Solution

Formulate the appropriate hypotheses: Steps using SPSS (Create data file)

Enter the data in SPSS, with the variable “Nicotine” takes up one column, and the Brand variable for identifying whether the nicotine data was from brand A or brand B subject takes up another column.

The “Nicotine” is considered as the dependent, response or outcome variable, and the “Brand” variable is the independent or factor variable. The two variables should be created in the way as seen in the data editor on the right. The Brand variable takes on two possible values, 1 or 2. The value “1” for brand A, and the value “2” for brand B.

When the data is completed, follow the steps shown on the next slide.

(28)

Output from SPSS

Interpretation: the report shows the descriptive statistics, the average content of nicotine of Brand A is less than the average of Brand B, and standard deviation for both are similar; but do not know whether this difference observed is significant.

So we ask the t test for independent samples, which we gives t = -3.00, looking at the next Sig. (2-tailed) the value is .017, lower than .05.

Making Decision and interpret result: Sig <0.05, therefore the level of significance of 5% we can say the results indicate that there is a difference significant in the average content of nicotine in both brands, i.e. Brand B content more nicotine than Brand A significantly.

(29)

Note: Interpreting Box plots in general:

Box plots are used to show overall patterns of response for a group. They provide a useful way to visualize the range and other characteristics of responses for a large group.

Box plot to check if there are no outlier values and if the boxes behave symmetrically The diagram below shows a variety of different box plot shapes and positions

Some general observations about box plots:

About variability: The Box plot of 2, 3 and 4 show homogeneity, that is, they are symmetric distributions. However box plot 1 shows an asymmetric distribution (Asymmetry negative), i.e. data tend to be concentrated towards the top of the distribution and extend leftward. In the context (about marks), the majority marks or views, etc. is concentrated in a higher score and lowest score are more dispersed.

The box plot is comparatively short - see example (2). This suggests that overall students have a high level of agreement with each other.

One box plot is much higher or lower than another – compare (3) and (4) – This could suggest a difference between groups. For example, the box plot for (4) may be lower than the equivalent plot for (3).

Obvious differences between box plots – see boxes plots (1) and (2), (1) and (3), or (2) and (4). Any obvious difference between box plots for comparative groups is worthy of further investigation.

The 4 sections of the box plot are uneven in size – See box plot (1). This shows that many students have similar views at certain parts of the scale, but in other parts of the scale students are more variable in their views. The long upper whisker in the example means that students’ views are varied amongst the most positive quartile group, and very similar for the least positive quartile group.

Same median, different distribution – See boxes plots (1), (2), and (3). The medians (which generally will be close to the average) are all at the same level. However the box plots in these examples show very different distributions of views.

(30)

Hypothesis testing to determine the normality Ho: The variables follow a normal distribution Ha: Variables do not follow a normal distribution

Tests of Normality

Cigarette Brand

Kolmogorov-Smirnova _Shapiro-Wilk

Statistic df Sig. Statistic df Sig.

Nicotine Brand A _0.136 ₅ _.200* _0.987 ₅ _0.967

Brand B _0.136 ₅ .200*

0.987 5 0.967

a. Lilliefors Significance Correction

Making a Decision and Interpreting the Result of the Test

We observed Shapiro-Wilk statistic given that the samples are small

The p-values (Sig.) Brand A: Sig, 967 Brand B: Sig, 967

From Shapiro-Wilk test of normality are both greater than 0.05, so we don’t reject null hypothesis, which imply that it is acceptable to assume that the average content of nicotine distributions for Brand A and Brand B populations are both normal (or bell-shaped)

Assumption of Homogeneity

Through the Levene test can see if this assumption very important to compare groups met. The report of SPSS gives without asking

Ho: the variances are equal Ha: not assume equal variances

Decision: The p-value or (Sig=1.000) provides the Levene test is greater than 5%, and then we cannot reject Ho and conclude that equal variances assumed.

Example

An analyst at a department store wants to evaluate a recent credit card promotion. To this end, 500 cardholders were randomly selected. Half received an ad promoting a reduced interest rate on purchases made over the next three months, and half received a standard seasonal announcement.

This example uses the file creditpromo.sav from SPSS To begin the analysis, from the menus choose:

(31)

Running the Analysis

SPSS Report

Group Statistics

Type of mail insert received

N Mean Std.

Deviation

Std. Error Mean

$ spent during promotional period

Standard 250 1566.3890 346.67305 21.92553

New Promotion 250 1637.5000 356.70317 22.55989

The Descriptive table displays the sample size, mean, standard deviation, and standard error for both groups. On average, customers who received the interest-rate promotion charged about $71 more than the comparison group, and they vary a little more around their average.

The procedure produces two tests of the difference between the two groups. One test assumes that the variances of the two groups are equal. The Levene statistic tests this assumption.

Levene's Test for Equality of Variances

F Sig.

$ spent during promotional period

Equal variances

assumed 1.19 0.276

Equal variances not assumed

In this example, the significance value of the statistic is 0.276. Because this value is greater than 0.05, you can assume that the groups have equal variances and ignore the second test displayed.

Independent Samples Test

Levene's Test for

Equality of Variances t-test for Equality of Means

F Sig. t df

Sig. (2-tailed) Mean Difference Std. Error Difference $ spent during

promotional period

Equal variances

assumed 1.19 0.276 -2.26 498 0.024 -71.11095 31.45914

Equal

variances not

assumed -2.26 497.60 0.024 -71.11095 31.45914

(32)

The df column displays degrees of freedom (498). For the independent samples t test, this equals the total number of cases in both samples minus 2.

The column labeled Sig. (2-tailed) displays a probability from the t distribution with 498 degrees of freedom (Sig = .024). The value listed is the probability of obtaining an absolute value greater than or equal to the observed t statistic, if the difference between the sample means is purely random.

The Mean Difference (-71.11095) is obtained by subtracting the sample mean for group 2 (the New Promotion group) from the sample mean for group 1.

Interpretation: Since the significance value of the test is less than 0.05 (Sig = .024), you can safely conclude that the average of 71.11 dollars more spent by cardholders receiving the reduced interest rate is not due to chance alone. The store will now consider extending the offer to all credit customers.

Hypothesis test for the difference in population means (paried or related samples)

One of the most common experimental designs is the "pre-post" design. A study of this type often consists of two measurements taken on the same subject, one before and one after the introduction of a treatment or a stimulus. The basic idea is simple. If the treatment had no effect, the average difference between the measurements is equal to 0 and the null hypothesis holds. On the other hand, if the treatment did have an effect (intended or unintended!), the average difference is not 0 and the null hypothesis is rejected.

The Paired-Samples t test procedure is used to test the hypothesis of no difference between two variables. The data may consist of two measurements taken on the same subject or one measurement taken on a matched pair of subjects.

Example

A group of ten patients who were newly detected diabetes was observed to determine whether an educational program was effective in increasing their knowledge of diabetes. A test was applied before and after the educational program on self-related aspects of the disease. The test results were as follows:

Patient 1 2 3 4 5 6 7 8 9 10

Before 75 62 67 70 55 59 60 64 72 59

After 77 65 68 72 62 61 60 67 75 68

Does the educational program was effective with respect to patients’ knowledge? Report on the statistical software SPSS 23.0

(33)

1° database

2° Process (order the t-test analysis for related samples)

3° Report

Paired Samples Statistics

Mean N

Std. Deviation

Std. Error Mean Pair

1 Before 64.3000_After _67.5000 10₁₀ 6.49872_5.79751 2.05508_1.83333

Interpretation: the report shows the descriptive statistics, the average being (before) is less than the average after implementing the program, but do not know whether this difference observed is significant, so we ask the t test for related samples, which we gives t = -3,692, same as we obtained manually, just as we have confidence intervals.

Paired Samples Test Paired Differences

t df

Sig. (2-tailed) Mean

Std. Deviation

Std. Error Mean Pair

1

Before

- After -3.20000 2.74064 .86667 -3.692 9 .005

(34)

Making Decision and interpret the result: If Sig <0.05, we reject Ho, therefore at level of significance of 5% we can say that the educational program was effective with respect to increased knowledge of patients.

ANOVA – Analysis of Variance

ANOVA is a statistical method that stands for analysis of variance. ANOVA was developed by Ronald Fisher in 1918 and is the extension of the T and the Z test. The T-test and Z-test were commonly used, but the problem with the T-test is that it cannot be applied for more than two groups. This test is also called the Fisher analysis of variance, which is used to do the analysis of variance between and within the groups whenever the groups are more than two.

Analysis of variance provides a way to determine if one or more discrete level factors (independent variable) influence an outcome measurement (variable quantitative, and is dependent variable).

Relationship amongst T Test, Analysis of Variance, Analysis of Covariance, & Regression

Illustrative Applications of One-way ANOVA Effect of In-store Promotion on Sales

The department store is attempting to determine the effect of in-store promotion (X) on sales (Y). The data of the table shows the store sales in thousands of Rwandan francs (Yij) for each level of promotion.

Store #

Level of In-Store Promotion

High Medium Low

1 10 8 5

2 9 8 7

3 10 7 6

4 8 9 4

5 9 6 5

6 8 4 2

7 9 5 3

8 7 5 2

9 7 6 1

10 6 4 2

83 62 37

Solution

(35)

Ha:

Step 2: Level of significance: Step 3: Test Distribution

Steps using SPSS

1. Create data file. We have two variables (sales in a column and level in the second column that has three alternatives). See the figure below.

(36)

Sales

N Mean

Std. Deviation

Std. Error

95% Confidence Interval for Mean

Minimum Maximum Lower

Bound

Upper Bound

High 10 8.30 1.337 .423 7.34 9.26 6 10

Medium 10 6.20 1.751 .554 4.95 7.45 4 9

Low 10 3.70 2.003 .633 2.27 5.13 1 7

Total 30 6.07 2.532 .462 5.12 7.01 1 10

Interpretation: This table describes the means and standard deviations of each group: Effect of in store promotion on sales. The mean represents average sales. It can be clearly seen that the average sales for high level promotion is higher than other levels, and the variability even for high level promotion is lower than others; However, we cannot reach any conclusion that one level is more significant in sales than another level without examining the statistical importance of the result (information from the F test).

ANOVA Output

ANOVA

Sales

Sum of Squares df Mean Square F Sig.

Between Groups 106.067 2 53.034 17.941 .000

Within Groups 79.800 27 2.956

Total 185.867 29

Making Decision and Interpret result: This is the main ANOVA result. The significance value comparing the groups is (Sig. .000 <0.05), so we reject the null hypothesis and at level of significance 5 % at least one population mean is different.

Note: If the test of ANOVA is significant, continue the analysis with Post-hoc

Post‐hoc Analysis – Tukey’s Honestly Significant Difference (HSD) Test

When an F turns out to be significant, we know, with some degree of confidence, that there is a real difference somewhere among our means. But if there are more than two groups, we don’t know where that difference is. Post hoc tests have been designed for doing pair-wise comparisons after a significant F is obtained.

K-1= 3-1=2 (Number of groups 1)

N-K= 30-3=27 (Number of the total data – Number of

groups)

N-1=30-1=29 (Number of the

total data – 1)

106.067/2=53.034

F=53.034/2.956

F=17.941

(37)

Interpretation: This is the post-hoc tests to see where the differences lie. You can see that the all levels of promotion differ significantly between them.

Homogeneous Subsets

Sales Tukey HSD

Level of

promotion N

Subset for alpha = 0.05

1 2 3

Low 10 3.7

Medium 10 6.2

High 10 8.3

Sig. 1 1 1

Means for groups in homogeneous subsets are displayed. a. Uses Harmonic Mean Sample Size = 10.000.

Interpretation: we can see that the High level of promotion (mean = 8.3) is greater than the Medium group and the Low group, after adjustment for multiple comparisons.

Finally the results revealed statistically significant differences among the levels of promotion, F =17.941, and p_value (Sig. =.000). Post-hoc Tukey tests revealed statistically significant differences between levels of promotion; High promotion (mean = 8.3, Sd = 1.34), and the Medium promotion (6.2, Sd =1.750) and the last mean correspond to Low promotion (mean = 3.7, Sd = 2.00)

Assumption: Normality Check

Ho: The variables are normally distributed Ha: The variables were not normally distributed

(38)

Tests of Normality

Level of promotion

Kolmogorov-Smirnova _Shapiro-Wilk

Statistic df Sig. Statistic df Sig. Sales High .200 10 .200* _.932 ₁₀ _.466

Medium .153 10 .200* _.932 ₁₀ _.473

Low .202 10 .200* _.935 ₁₀ _.498

*. This is a lower bound of the true significance. a. Lilliefors Significance Correction

P-values or (sig)>.05, then do not reject Ho, therefore conclude that the variables follow a normal distribution or the normality assumption may be assumed valid.

Interpretation: Box plot (to check if there are no outlier values and if the boxes behave symmetrically). Both graphs show the variable sales by promotion are normality.

Assumption: homogeneity of variance

This table (Levene’s test) tests the assumption of equal variances for the ANOVA.

Test of Homogeneity of Variances

Test of Homogeneity of Variances Sales

Levene Statistic df1 df2 Sig

1.353 2 27 0.275

Interpretation: Look at the sig. or p-value (.275) which is above .05. The p_value given in the last column is sufficiently large to conclude the assumption of constant variances should not be rejected. The result indicates that equal variances assumption is met.

Making Decision and interpret: Sig = 0.275 > 0.05, there isn’t enough evidence to reject the null hypothesis,

(39)

Regression and correlation analysis of single and multiple

The purpose of this analysis is to establish the relationship between two or more variables that is Correlation Analysis, and establish a mathematical model to estimate the value of a variable based on the value of the other variables that is Regression Analysis

Linear Regression

Linear Regression estimates the coefficients of the linear equation, involving one or more independent variables that best predict the value of the dependent variable. For example, you can try to predict a salesperson's total yearly sales (the dependent variable) from independent variables such as age, education, and years of experience.

Data Considerations

Data. The dependent and independent variables should be quantitative. Categorical variables, such as religion, major field of study, or region of residence, need to be recoded to binary (dummy) variables or other types of contrast variables.

Assumptions. For each value of the independent variable, the distribution of the dependent variable must be normal. The variance of the distribution of the dependent variable should be constant for all values of the independent variable. The relationship between the dependent variable and each independent variable should be linear, and all observations should be independent.

Example

The following data give the selling price, square footage, number of bedrooms, and age of houses that have sold in a neighborhood in the last 6 months. Develop multiple correlation and regression analysis for to predict selling price.

Selling Price ($)

Square

Footage Bedrooms Age

64000 1670 2 30

59000 1339 2 25

61500 1712 3 30

79000 1840 3 40

87500 2300 3 18

92500 2234 3 30

95000 2311 3 19

113000 2377 3 7

115000 2736 4 10

138000 2500 3 1

142500 2500 4 3

144000 2479 3 3

145000 2400 3 1

147500 3124 4 0

144000 2500 3 2

155500 4062 4 10

(40)

Steps to do a regression and correlation analysis 1. Scatter-Plot

Prior to entering your data on the Data Editor screen (shown below), you must define your variables on the Variable View screen (not shown). Once your variables have defined your variables (X as your predictor and Y as your criterion), enter the data as shown below on the Data Editor screen.

Start your analysis by creating a scatter plot: Graphs Legacy Dialogsscatter Dot

After clicking on Graphs and selecting Scatter, you will be given the option to select the type of plot- shown on the next figure. Highlight the Simple scatter-plot option and click Define.

(41)

The scatter plot will be produced and displayed on the Output Viewer.

2. Simple Correlation between Selling price and Square Footage

(there is no association between Selling price and Square Footage)

(There is an association between them)

(42)

Correlation between Selling price and Age

(there is no association between Selling price and Age)

(There is an association between them)

We can see that the Pearson correlation coefficient for the two variables is r = -.881 **, this value indicates a strong, inverse linear relationship between Selling price and Age. Furthermore, we see that it is significant, with p < 0.000.

3. Regression Analysis

After creating a scatter plot, you should run a regression analysis. The regression analysis will produce regression coefficients, a correlation coefficient, and an ANOVA table.

Begin by selecting AnalyzeRegression Linear (shown below).

(43)

The output of the analysis is shown below.

Descriptive Statistics

Mean

Std.

Deviation N

Selling Price ($) _114588.24 _35901.443 ₁₇

Square Footage _2407.29 _618.457 ₁₇

Bedrooms _3.12 _0.6 ₁₇

Age _13.65 _13.052 ₁₇

Interpretation. The average of selling price of the houses that have sold in a neighborhood in the last 6 months is 114588.24, and variability around mean is 35901.443, means it is heterogeneous data (CV=31%)

Multiple correlation

The Model Summary table reports the correlation coefficient as multiple R, the R, Square statistic is in the second column, and Durbin-Watson is the last column

Interpret Multiple Correlation (Check the following table)

Multiple Correlation = .941, there is high association between Selling price and Square footage, bedroom and age (the model improved by interacting with independent variables)

Interpret Coefficient of Determination

R square=.885; i.e. 88.5% of the variation in the selling price, can be explained by variation in square footage, bedroom and age of the houses that have sold in a neighborhood in the last 6 months.

Model Summaryb

Model R

R Square

Adjusted R Square

Std. Error of the Estimate

Durbin-Watson

1 .941a _.885 _.859 _13482.331 _2.280

a. Predictors: (Constant), Houses' Age, Bedrooms, Square Footage b. Dependent Variable: Selling Price ($)

Interpret autocorrelation by Durbin Watson

DW= 2.280 It is a value slightly greater than 2, but still close to 2, which indicates that there is no evidence of autocorrelation, therefore the assumption is assumed.

Note: Autocorrelation

Autocorrelation exists when independent variables in a regression equation are highly correlated among themselves. If only two predictors are correlated, we have collinearity.

The most common test for autocorrelation is the Durbin-Watson statistic is a test to find out the serial correlation between adjacent error terms. The range of this statistic ranges from 0 to 4.

DW < 2 suggest positive autocorrelated (common) DW ≈ 2 suggest no autocorrelated (ideal)

DW > 2 suggest negative autocorrelated (rare)

(44)

Ho: β1=β2=β3=0 Ha: At least one βj ≠0

F=33.484 (see in the table below) Sig: .000

Making decision and interpret result: As Sig <0.000 then reject null hypothesis, indicating that at least one of the explanatory variables (Square Footage, number of bedrooms or houses’ age) is related to the price that houses sold. ANOVAa Model Sum of Squares df Mean

Square F Sig.

1 Regression _18259565357 ₃ _6086521786 _33.484 _.000b

Residual 2363052290 13 181773253.1

Total 20622617647 16

Dependent Variable: Selling Price ($) and Predictors: (Constant), Age, Bedrooms, Square Footage

Using multiple regression, and find a model that will help explain current sales, and which is the best predictor? Y= 82373.7208 +25.859 square footage – 2127.798 number of bedrooms – 1714.821 age of houses

Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig.

B Std. Error Beta

1 (Constant) 82373.720 23072.294 3.570 .003

Square

Footage 25.859 9.638 .445 2.683 .019

Bedrooms -2127.798 8872.079 -.036 -.240 .814

Age_houses -1714.821 328.054 -.623 -5.227 .000

a. Dependent Variable: Selling Price ($)

Application: use the Model you fount before and predict the selling price of a 10 years of the houses , and 2000 square foot house with 3 bedrooms.

Solution:

Y= 82373.7208 +25.859 square footage – 2127.798 number of bedrooms – 1714.821 Age of house Y=110560.116

(45)

Interpretation: the Residuals follow a normal distribution and also the residuals assumed homocedasticity

Nonparametric Statistics

Nonparametric tests are sometimes called distribution-free tests because they are based on fewer assumptions (e.g., they do not assume that the outcome is approximately normally distributed). Parametric tests involve specific probability distributions (e.g., the normal distribution) and the tests involve estimation of the key parameters of that distribution (e.g., the mean or difference in means) from the sample data. The cost of fewer assumptions is that nonparametric tests are generally less powerful than their parametric counterparts.

Parametric vs Nonparametric Statistics

• Parametric Statistics are statistical techniques based on assumptions about the population from which the sample data are collected.

– Assumption that data being analyzed are randomly selected from a normally distributed population.

– Requires quantitative measurements that yield interval or ratio level data.

• Nonparametric Statistics are based on fewer assumptions about the population and the parameters.

– Sometimes called “distribution-free” statistics.

_ A variety of nonparametric statistics are available for use with nominal data, ordinal and even quantitative (convert to ordinal).

_ When the data is quantitative but does not meet the assumptions of normality and homogeneity, in the case of more than two groups.

Nonparametric statistics (or tests) based on the ranks of measurements are called rank statistics (or rank tests).

(46)

Chi-Square

1 Sample Quantitative -

Qualitative

Binomial

Association variables

Independent Quantitative Mann-Whitney U

Quantitative Wilcoxon Related

Before-After McNemar

Independent Quantitative Kruscal Wallis-H

Quantitative Friedman Related

Qualitative Cochran's Q

C o m p ar e g ro u p s 2 Samples More than 2 samples

2 Samples Discrete or

categorical

Chi-Square Independent

Advantages and Disadvantages of non-parametric test 1. Do Not Involve Population Parameters

 Example: Probability Distributions, Independence 2. Data Measured on Any Scale

 Ratio or Interval

 Ordinal (example: Good-Better-Best)  Nominal (example: Male-Female)

Chi Square Independent Test (c2₎

A chi square (c2_{) statistic is used to investigate whether distributions of categorical variables differ from one} another. Basically categorical variable yield data in the categories and numerical variables yield data in numerical form. Responses to such questions as "What is your major?" or do you own a car?" are categorical because they yield data such as "Business" or "no". In contrast, responses to such questions as "How tall are

you?" it is numerical. Numerical data can be either discrete or continuous. The table below may help you see the differences between these two variables.

Data Type Question Type

Possible Responses

Categorical What is your sex? male or female

Numerical-Discrete How many children do you have? 0 or 1 or 2 or…

Numerical-Continuous How tall are you? 75 inches

A test of independence assesses whether paired observations on two variables, expressed in a contingency table, are independent of each other (e.g. polling responses from people of different nationalities to see if one's nationality affects the response).

Steps in Hypothesis Testing

1. Formulate the appropriate null and alternative hypothesis Ho: The variables are independent

(47)

Example Use the data of file that has the SPSS "demo.sav"

This is a hypothetical data file that concerns a purchased customer database, for the purpose of mailing monthly offers. Whether or not the customer responded to the offer is recorded, along with various demographic information.

Crosstabulation tables (contingency tables) display the relationship between two or more categorical (nominal or ordinal) variables. The size of the table is determined by the number of distinct values for each variable, with each cell in the table representing a unique combination of values. Numerous statistical tests are available to determine whether there is a relationship between the variables in a table.

Interpreting results

What factors affect the products that people buy?

The most obvious is probably how much money people have to spend. In this example we’ll examine the relationship between income level and PDA (personal digital assistant) ownership.

The cells of the table show the count or number of cases for each joint combination of values. For example, 455 people in the income range $25,000–$49,000 own PDAs.

None of the numbers in this table, however, stand out in an obvious way, indicating any obvious relationship between the variables.

(48)

Significance Testing for Crosstabulations

The purpose of a cross tabulation is to show the relationship (or lack thereof) between two variables.

Although there appears to be some relationship between the two variables, is there any reason to believe that the differences in PDA ownership between different income categories are anything more than random variation?

A number of tests are available to determine if the relationship between two crosstabulated variables is significant. One of the more common tests is chi-square. One of the advantages of chi-square is that it is appropriate for almost any kind of data.

Pearson chi-square tests the hypothesis that the row and column variables are independent. Hypothesis

Ho: The variables Income and Owns PDA are independent. Ha: The variables Income and Owns PDA are related. Level significance α = .05

Test Statistic: c2 _{= 37.677, Sig= .000}

Chi-Square Tests

Value df

Asymp. Sig. (2-sided) Pearson Chi-Square 37.677a ₃ _.000

Likelihood Ratio 37.313 3 .000

Linear-by-Linear

Association 36.537 1 .000

N of Valid Cases 6400

a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 228.73.

Making a Decision and Interpreting the Result: since the calculate c2 _{= 37.677, which is more than critical value} of 7.81, Reject Ho. Also p-value(sig.) =.000<.05, which also supports rejecting Ho. At the level of significance of 5% the variables Income and Owns PDA are related.

(49)

Differences between independent groups Mann-Whitney U-Test

The Mann-Whitney Test is one of the most powerful of the nonparametric tests for comparing two independent groups when the dependent variable is either ordinal or continuous, but not normally distributed. For example, you could use the Mann-Whitney U test to understand whether attitudes towards pay discrimination, where attitudes are measured on an ordinal scale, differ based on gender (i.e., your dependent variable would be "attitudes towards pay discrimination" and your independent variable would be "gender", which has two groups: "male" and "female"). Alternately, you could use the Mann-Whitney U test to understand whether salaries, measured on a continuous scale, differed based on educational level (i.e., your dependent variable would be "salary" and your independent variable would be "educational level", which has two groups: "high school" and "university"). The Mann-Whitney U test is often considered the nonparametric alternative to the independent t-test although this is not always the case.

Nonparametric counterpart of the t test for independent samples

• Does not require normally distributed populations

• May be applied to ordinal data

• Actual measurements not used – ranks of the measurements used Assumptions

– Independent Samples

– At Least Ordinal Data Steps in Hypothesis Testing

1. Formulate the appropriate null and alternative hypothesis Ho: Both groups are the same (there is no difference) Ha: Both groups are not the same

Example

Is this sufficient evidence to indicate a difference in the average height of the groups? The data is in the following table 5.2:

Heights of males (cm) 193 188 185 183 180 178 170 Heights of females (cm) 175 173 168 165 163

Solution

Ho: Male and female students are the same height (the distribution of heights for the two groups are equal) Ha: Male and female students are not the same height (the distribution of heights for the two groups are not equal)

Mann-Whitney U Rank Sum Test – Solution in SPSS

(50)

Solution

Decision: Sig .010<.05 Reject Null Hypothesis at α= .05

Conclusion: The height of males is greater than height of

(51)

Differences between dependent groups

Parametric Nonparametric

Compare two variables measured in the same sample

t-test for dependent samples

Sign test

Wilcoxon’s matched pairs test or Mc Nemar If more than two variables are

measured in same sample

Repeated

measures ANOVA

Friedman’s two way analysis of variance

Wilcoxon Rank

The Wilcoxon Rank Test is a non-parametric statistical hypothesis test, used when comparing two related samples, matched samples, or repeated measurements on a single sample to assess whether their population mean ranks differ (i.e. it is a paried difference test). It can be used as an alternative to the paried Student’s t-test (before and after studies), t studies in which measures are taken on the same person or object under different conditions, Studies or twins or other relatives or the t-test for dependent samples when the population cannot be assumed to be normally distributed.

 Assumptions

Random Samples

Populations are continuous

 Can Use Normal Approximation If ni ≥15

1. Formulate the appropriate null and alternative hypothesis

H0: both samples come from the same underlying distribution (there is no difference) Ha: both samples are not come from the same underlying distribution

Example

The mayor of a city wants to see if pollution levels are reduced by closing the streets to the car traffic. This is measured by the rate of pollution every 60 minutes (8am - 22pm: total of 15 measurements) in a day when traffic is open, and in a day of closure to traffic, the data of air pollution is in the following table

Rate of pollution in different situation

With traffic: 214, 159, 169, 202, 103, 119, 200, 109, 132, 142, 194, 104, 219, 119, 234 Without traffic: 159, 135, 141, 101, 102, 168, 62, 167, 174, 159, 66, 118, 181, 171, 112

It is clear that the two groups are paired, because there is a bond between the readings, consisting in the fact that we are considering the same city (with its peculiarities weather, ventilation, etc.) albeit in two different days. Not being able to assume a Gaussian distribution for the values recorded, we must proceed with a non-parametric test, the Wilcoxon signed rank test.

Solution:

H0: both samples come from the same underlying distribution

Ha: both samples are not come from the same underlying distribution n = 15

(52)

Wilcoxon Rank Sum Test – Solution in SPSS

Wilcoxon Report in SPSS

Ranks Test Statisticsa

N Mean Rank Sum of Ranks

Without_traffic -With_traffic

Without_traffic - With_traffic

Negative Ranks

9a _8.89 ₈₀

Z -1.136b

Positive Ranks

6b _6.67 ₄₀ _{Asymp. Sig.}

(2-tailed)

0.256

Ties 0c _{a. Wilcoxon Signed Ranks Test}

Total 15 b. Based on positive ranks.

a. Without_traffic < With_traffic b. Without_traffic > With_traffic c. Without_traffic = With_traffic

Making a Decision and Interpreting the Result of the Test from result of statistic software SPSS.

Since Sig.=0.256>0.05, we can’t reject the null hypothesis, at α=0.05. There is No evidence for unequal distribution. Therefore the closing roads to traffic did not bring any improvement in terms of rate of pollution.

Nonparametric Tests for Multiple Independent Samples

The nonparametric tests for multiple independent samples are useful for determining whether or not the values of a particular variable differ between two or more groups. This is especially true when the assumptions of ANOVA are not met.

When the assumptions behind the standard ANOVA are invalid or suspect, you should consider using the nonparametric procedures designed to test for the significance of the difference between multiple groups. They are called nonparametric because they make no assumptions about the parameters (such as the mean and variance) of a distribution, nor do they assume that any particular distribution is being used.

(53)

Kruskal-Wallis H

When a researcher wishes to compare three or more groups or populations and the data are ordinal, the Kruskal-Wallis test is the appropriate statistical technique, or when the data is numerical but the assumptions does not have to assume that the underlying populations are normally distributed or the equal variances. It is used to test the null hypothesis that all populations have identical distribution functions against the alternative hypothesis that at least two of the samples differ only with respect to location (median), if at all. For example, you could use a Kruskal-Wallis H test to understand whether exam performance, measured on a continuous scale from 0-100, differed based on test anxiety levels (i.e., your dependent variable would be "exam performance" and your independent variable would be "test anxiety level", which has three independent groups: students with "low", "medium" and "high" test anxiety levels). Alternately, you could use the Kruskal-Wallis H test to understand whether attitudes towards pay discrimination, where attitudes are measured on an ordinal scale, differed based on job position (i.e., your dependent variable would be "attitudes towards pay discrimination", measured on a 5-point scale from "strongly agree" to "strongly disagree", and your independent variable would be "job description", which has three independent groups: "shop floor", "middle management" and "boardroom").

Data types that can be analyzed with Kruskal-Wallis H The data points must be independent from each other

 the distributions do not have to be normal and the variances do not have to be equal  you should ideally have more than five data points per sample

 all individuals must be selected at random from the population  all individuals must have equal chance of being selected

 sample sizes should be as equal as possible but some differences are allowed

Formulate the appropriate null and alternative hypothesis Ho: Identical Distribution

Ha: At Least 2 Differ

Specify the desired level of significance (a)

a , (level of significance); typical values are .01, .05, or .10 Example

An advertising agency employs three different film production companies to produce its television commercials. The advertising agency has taken a sample of five commercials from each of the population houses, and agency executives have ranked the production quality of the commercials from best quality (1) to lowest quality (15). These ranks are show in the next table. Notice that the advertising agency considered two commercials to be ranked of equal quality. Hence, rather than being ranked 3 and 4, the two commercials are each ranked 3.5. The data is in the following Table 5.4.

Solution

Ho: Identical Distribution Ha: At Least 2 Differ α = .05

Example with SPSS