SPSS VS 23 TUTORIAL
x
Overview of SPSS
The meaning of SPSS is “Statistical Package for Social Sciences”. The current traditional version of SPSS is 25, and the next version is already in the works (it is called the anniversary version, 50 years of SPSS.
IBM SPSS Statistics is a tool for managing your statistical data and research. SPSS is a multipurpose, graphic and statistical data storage system.
SPSS Windows has 3 windows:
Data Editor has two views, selected by tabs at the bottom of the window Viewer or Draft. viewer which displays the output files
Syntax Editor, which displays syntax files
The Data Editor has two parts:
1. Variable View window, which displays metadata or information about the data in the active file, such as variable names and labels, value labels, formats, and missing value indicators.
2. Data View window, which displays data from the active file in spreadsheet format which holds the data in a rectangular format with cases as rows and variables as columns. Data can be directly entered or imported from another program using menu commands. (Cut-and-paste is possible, but not advised.) Errors in data entry can also be directly corrected here.
Beginning an SPSS Session
Begin by opening SPSS 23 for Windows.
1. Click on the IBMSPSS shortcut button on your desktop. OR
2. Go to START, click on PROGRAMS, and click on IBM SPSS.
Variable view: used to define the type of information that is entered in to each column in data view. In Data View rows are cases. Each row represents a different case. A case is a set of observations about one person, one country, one object, one experiment, etc
Menus and Toolbars
Various pull-down menus appear at the top of the Data Editor window. These pull-down menus are at the heart of using SPSS. The Data Editor Menu items (with some of the uses of the menu) are:
FILE: Standard options for opening, saving, printing and exiting.
EDIT: Used to copy and paste data values; used to find data in a file; insert variables and cases. VIEW: Options for showing/hiding toolbars, displaying values or their labels in Data Editor.
DATA: Identify duplicate cases, merge files, split file, select cases, weight cases, etc. TRANSFORM: Compute new variables, recode variables.
ANALYZE: This menu provides access to the statistical procedures for analyzing your data set. All the items on the analyze menu have sub menus.
DIRECT: It allows you to perform advanced analysis of clients or contacts to improve your marketing MARKETING: campaigns and maximize the ROI of your marketing budget.
GRAPHS: Provide options to create high quality plots and charts.
UTILITIES: Used to display information on individual variables (add comments to accompany data file (and other advanced features).
Add-ons: The SPSS extension packages are additional features of the program that you can add to SPSS (advanced statistical procedures)
WINDOW: Provides option for switch between data, syntax and navigator windows.
HELP: Contains SPSS help system (for example Select Help|Case Studies. Provides hands-on examples of how to create various types of statistical analyses and how to interpret the results).
Data Entry into SPSS
SPSS runs on Windows and Mac operating systems, but the focus of these notes is Windows.
Data entry (the workspace and labels)
A. Manually directly enter in to SPSS by typing in Data View
B. Enter into other database software such as Excel then import into SPSS
A. Data entry for the first option: by manually directly enter in to SPSS by typing in Data View
Typing in data
Click in variable view, the tabs at the bottom of the window labeled “Data View" and “Variable View". In “Data View”, you can enter, and edit data for all of your cases; while in “Variable View”, you can view, enter, and edit information about the variables themselves (see below).
In the "Name" column we write the names of the variables that you have to enter (Click the Row 1). Name of the variable. It is your own choice, but make it understandable and do not use numbers or symbols as the first letter since SPSS will not accept it. Moreover, you cannot use spaces in the name. For example: “education_level”
"Type “column. Indicates the variable type. The most common is Numeric (only accepts numerical data, for example age or number of children) and String (also accepts letters, e.g. for qualitative questions). Typically, all responses in a questionnaire are transformed into numbers. For example: “Female = 1 and Male = 2.
In the Width column don't touch. Correspond to the number of characters that is allowed to be typed in the data cell. Default for numerical and string variables is 8, which only needs to be altered if you want to type in long strings of numbers or whole sentences.
Decimals write de number of decimals that the variable enters (for example variable Age has cero decimals, etc)
“Label’ for many variables this includes entering a Label, which is a human-readable alternate name for each variable. The labels replace the variable names on much of the output, but the names are still used for specifying variables for analyses. The variable label is an explanation of what the variable is, e.g. if the variable name was sex then the label might be “gender of the respondent”.
“Values” In the Values column Click on the button “...” to reveal the Values Labels dialogue box. Enter your values and corresponding labels for the variable you are defining if appropriate.
all of the levels of the variable. When you are finished, verify that all of the information in the box is correct, and then click OK to complete the process.
The “ Missing and invalid data ” The missing column is very important, for example when we see the data of a survey, some answers to some unanswered questions, and for a good analysis these empty cells are recoded with a special code that is usually 9, 99, 999, and In The MISSING column we write that code so that this number is not taken into account in the analysis. Missing data cannot be entered, of course, and the cell for the missing value can either be left blank or a special code (of one’s choice) may be entered.
“Columns” and “Align” are given by default.
The “Measure” column is very important, all quantitative variables choose as “scale". Ordinal and nominal variables are the other options for Measure. In many parts of SPSS, you will see a visual reminder of the Measure of your variables in the form of icons. A small diagonal yellow rule ( ) indicates a “scale". A small three level bar graph with increasing bar heights ( ) indicates an “ordinal" variable. Three colored balls with one on top and two below indicates nominal data ( ).
Note: Categorical. Data with a limited number of distinct values or categories (for example, gender or religion). Categorical variables can be numeric variables that use numeric codes to represent categories (for example, 1 = male and 2 = female). Also referred to as qualitative data. Categorical variables can be either nominal or ordinal.
Nominal. A variable can be treated as nominal when its values represent categories with no intrinsic ranking (for example, gender).
Ordinal. A variable can be treated as ordinal when its values represent categories with some intrinsic ranking (for example, levels of service satisfaction from highly dissatisfied to highly satisfied).
Scale. Data measured on an interval or ratio scale, where the data values indicate both the order of values and the distance between values.
Example
Name Type Width Decimal Label Values
ID Numeric 8 0 Identification None
gender Numeric 8 0 Gender
1= male 2= female
age Numeric 8 0 Age in years None
age_cat Numeric 8 0 Age category
1= <18 2=18 - 24 3=25 - 34 etc
After creating all your variables, you can go to the “Data View”, and then you can enter the data, and edit data for all of your cases, see the following example:
To enter data, simply highlight a cell, enter the number, and then press return. You will be in the next cell down which will be highlighted.
Notice how the values for gender are coded as 1 and 2. It is possible to reveal their values Labels instead. Click on the Value Labels on the task bar at the top, now the values are given as Male and Female, just as we coded before.
B. Second option, using data in Excel
Data from an Excel spreadsheet can be imported into SPSS as follows:
2. Locate the file of interest:
Select Files of type (for example, excel) you want to import. Change the files of type to Excel (*.xls), then browse and open the file
Check that the box labelled “Read
variable names from the first row of data” is ticked and click OK (that is if the first row in excel contains your variable names, otherwise leave un-ticked)
Note: if you want to transfer data from Excel to SPSS it is a good idea to ensure that any categorical data (e.g. yes/no/don’t know, male/female, etc.) are entered in Excel as numeric data (codes) rather than text. For example, you could always code ’No’ as 0 and ’Yes’ as 1, and so on.
After reading in the data it is a good idea to ’type’ in SPSS what the codes for your categorical variables are. This ensures that tables and graphs are labelled appropriately. More detailed instructions:
1. Click on the Variable View tab in the bottom left hand corner of the data editor window
2. Look at the row for the variable you’re dealing with and go to the Values column. Click on the word None 3 Click on the little grey square (with dots in it) on the right
4 Enter the first value (code) — e.g. 1 — and the corresponding label — e.g. male — then click on Add 5 Repeat until you have entered all the labels & codes for this variable, then click OK
6 Repeat this process for the other categorical variables.
Clean data after import data files
Key in values and labels for each variable Run frequency for each variable
Check outputs to see if you have variables with wrong values.
Check missing values and physical surveys if you use paper surveys, and make sure they are real missing. Sometimes, you need to recode string variables into numeric variables
Exporting Data to Excel
Click on FILE ⇒ SAVE AS. Click on the File Name for the file to be exported. For the “Save as Type” select from the pull-down menu Excel (*.xls) click on Save.
Transform the variables (Computing variables and Recoding variables)
For transform the variables: first open one file from SPSS, the name of this file is: Survey_sample.sav. Step for open files from SPSS:
With this file go to transform the variables:
Compute variable (Creating new variables)
Creating new variables (data transformation) is commonly needed. Depending on what you are trying to do.
Recode Variables
This option you use when you need to transform the original data into intervals (you have two forms, one is “Recode into same variables”, the other one is “Recode into different variables”.
Follow the steps that show the figure above (we would like to transform ‘number of children’ in the following intervals:
0 = None
Write the name and label and Change < Click on “Old and New Values”
And appear the following figure. You write the new values that you want to transform; 0 = None
1 = 1 – 4 2 = 5 or More
Next you must return to the “Variable View” and into the new variable that you just create write the labels of intervals
0 = None 1 = 1 – 4 2 = 5 or More
Computing Variables
For creating of a simple data transformation, which is the result of applying a mathematical formula to one or more existing variables, use the “Compute Variable”.
To apply "Variable computing" we will continue to use the data for "Survey_ Sample".
Create a new variable: Confidence (The average of confidence); for this we add all the answers to these 6 questions about confidence and calculate the mean of these responses; ((confinan+ conbus+ coneduc+ conpress+ conmedic+ contv)/6).
Steps in SPSS for computing a new variable. Follow the figure bellow
Pass all the variables you want to add, each separated by a comma, then < OK
Running Analyses (frequency: Calculating tables, statistics, and graphics)
Analyzing data using Frequencies
To perform simple analyses to obtain frequency tables Descriptive Statistics >Analyze >Frequencies
Generally a frequency is used for looking at detailed. Categorical data is for variables such as gender, i.e., males are coded as “1” and females are coded as “2.” Frequencies options include a table showing counts and percentages, statistics including percentile values, central tendency, dispersion and distribution; and charts including bar charts, pie and histograms.
Steps for using the frequencies procedure:
Click the Analyze < Descriptive Statistics < Frequencies and select your variables for analysis. You can then choose statistics options, choose chart options, and have SPSS calculate your request.
Analyzing Categorical variable
For this example we are going to check out "Happiness of marriage (hapmar)" on the file "Survey_Sample". We will look at this variable for our initial investigation.
Continue < OK
Output from Frequencies:
The major part of the display shows the value labels (Very happy, Pretty happy, Not too happy, and Total), and the missing categories, NAP (Not Appropriate), DK (Don’t Know), and NA (Not Answered).
In a written paper, you should state that the “Valid Percent” excludes the “missing” answers.
Analyzing numerical variable (Scale in SPSS)
Follow the same step as we did for the categorical variable, for this example we are going to look at the distribution of “number of children, age and highest year of school completed for the “Survey_Sample”
Since these variables that we are going to analyze they are measure of interval/ratio level, different statistics from our previous example will be used.
Basic statistical analysis (Descriptive statistics) Procedure:
Analyze < Descriptive Statistics < Frequencies (click) < and appear the box Frequencies < First click on Number of children. Click the select arrow in the middle and SPSS will place age in the Variable(s) box. Follow the same steps to choose “Age” and the variable name for “Highest year of school completed” < click on
Next, click the Continue button to return to the main Descriptive dialog box < Click OK in the main dialog box Descriptive and SPSS will calculate and display the output shown in the following figure:
Statistics
Number of children
Age of respondent
Highest year of school completed
N Valid 2825 2828 2820
Missing 7 4 12
Mean 1.82 45.56 13.25
Median 2 42 13
Std. Deviation 1.69 17.1 2.928
Skewness 1.07 0.567 -0.256
Std. Error of Skewness 0.046 0.046 0.046
Kurtosis 1.386 -0.527 1.069
Std. Error of Kurtosis 0.092 0.092 0.092
Range 8 71 20
Minimum 0 18 0
Maximum 8 89 20
Percentiles 25 0 32 12
50 2 42 13
75 3 57 16
Interpretation: The average number of children per family is 1.82 and the variability around the average is 1.69. The youngest family respondent has "0" children and the largest number of children was 8. Look at the SPSS exit for "the highest year of school completed." It has an average of 13.25 (just over 1 year after high school) and a standard deviation of 3.213. Some of the respondents indicated that no "0" years of school ended. The highest education reported was 20 years.
Skewness: a measure of the asymmetry of a distribution. The normal distribution is symmetric and has a skewness ≈ 0.
Right or Positive skewness: a long right tail (Skw >0). Left or Negative skewness: a long left tail (Skw < 0). As a general rule of thumb: If skewness is:
If skewness is between -0.5 and 0.5, the distribution is approximately symmetric.
Ifskewnessis between -1 and -0.5 or between 0.5 and 1, the distribution is moderately skewed. If less than -1 or greater than 1, the distribution is highly skewed.
platykurtic (K < 0) data values are flatter and more dispersed along the X axis.
If k ≈ 0.263, we say that the curve corresponding to the frequency distribution is mesokurtic (has just pointing to the normal or Gaussian)
If k> 0.263, we say that the curve corresponding to the frequency distribution is leptokurtic
if k <0.263, we say that the curve corresponding to the frequency distribution is platykurtic
Note: For the rest of interpreting statistics suggest reviewing chapter 1 of the course.
Analysis with Custom Table (Numerical Variable)
For Creating the custom table, we choose from the Menu: Analyze > Custom Tables > Custom Table:
Next select the variables and to drag and drop them to the position "Rows". It is important that the level of measurement of all variables that you analyze is set correctly, because the default settings for the table will be based on that.
It is also important that all variables that you selected have exactly the same value labels.
Next we click the button “Summary Statistics” to select the statistics we want to display in our table. The window shown on the right pops up. We can add statistics to the table and we can remove some. In this example we have chosen to select Standard Deviation, Minimum and Maximum.
Note that you can rearrange the order in which the selected statistics will be displayed. If you are satisfied with your selection you click on "Apply to Selection" and “OK”
Output from Custom Tables for numerical variables
Number of children 2 2 0 8
Age of respondent 46 17 18 89
Highest year of school
completed 13 3 0 20
Analysis with Custom Table (Categorical Variable)
For categorical variables follow the same steps above, but be careful that when you choose to study variables have the same level of measurement (in our case now, nominal or ordinal), in this case in Summary Statistics choose Table N %.
Output from Custom Tables for categorical variables
Count
Table N %
Gender Male 1232 43.5%
Female 1600 56.5%
Total 2832 100.0%
Marital status Married 1346 47.5%
Widowed 283 10.0%
Divorced 446 15.8%
Separated 93 3.3%
Never married 663 23.4%
Total 2831 100.0%
Highest degree LT High school 430 15.2%
High school 1500 53.2%
Junior college 209 7.4%
Bachelor 478 16.9%
Graduate 205 7.3%
Total 2822 100.0%
A stacked table can include other variables in other dimensions. For example, you could cross two variables stacked in the rows with a third variable displayed in the column dimension.
1. Open the table builder again (Analyze menu, Tables, Custom Tables).
2. If Gender and Highest degree aren't already stacked in the rows, follow the directions above for stacking them.
3. Drag and drop General Happiness t from the variable list to the Columns area on the canvas pane.
4. The tables show by default only Count, for better interpretation ask in Summary Statistics “Column N %” and Apply to all
5. Click OK to create the table.
Output from Crosstabulation Tables
General happiness
Very happy Pretty happy Not too happy
Count
Column
N % Count
Column
N % Count
Column N %
Gender Male 373 41.9% 712 45.2% 133 39.1%
Female 518 58.1% 863 54.8% 207 60.9%
Total 891 100.0% 1575 100.0% 340 100.0%
Highest degree LT High school 132 14.9% 197 12.5% 97 28.6%
High school 420 47.3% 889 56.6% 179 52.8%
Junior college 79 8.9% 113 7.2% 13 3.8%
Bachelor 163 18.4% 273 17.4% 39 11.5%
Graduate 94 10.6% 98 6.2% 11 3.2%
Total 888 100.0% 1570 100.0% 339 100.0%
Building graphics
The first question is: Which Graph to select? The answer is according the type of the variable you will analyze. Many statistical procedures include an option to generate charts. In addition, you can choose from the
Legacy Dialogs (accessed from the main Graph menu) leads to a series of different graph/chart types from which to select. In all cases, you will then be asked to Define the graph by specifying the variables to be displayed and how they are to be plotted.
The precise information required and the format of the dialogue boxes will vary according to the type of graph you select.
If you are not sure what each chart type looks like, select Graphboard Template Chooser (from the main graph menu). You have to ensure that the variable has the correct Measure (nominal, ordinal or scale) defined in Variable View.
Graph – Chart Builder
The Chart Builder dialog box is an interactive window that allows you to preview how a chart will look while you build it.
1. Click the Gallery tab if it is not selected.
The Gallery includes many different predefined charts, which are organized by chart type. The Basic Elements tab also provides basic elements (such as axes and graphic elements) for creating charts from scratch, but it's easier to use the Gallery.
2. Click Bar if it is not selected.
Icons representing the available bar charts in the Gallery appear in the dialog box. The pictures should provide enough information to identify the specific chart type.
3. Drag the icon for the simple bar chart onto the "canvas" which is the large area above the Gallery.
The Chart Builder displays a preview of the chart on the canvas. Note that the data used to draw the chart are not your actual data.
4. You add variables by dragging them from the Variables list, which is located to the left of the canvas. A variable's measurement level is important in the Chart Builder. You are going to use the Highest degree variable on the x axis.
5. Now drag highest degree from the Variables list to the x axis drop zone.
The ‘y’ axis drop zone defaults to the Count statistic. If you want to use another statistic (such as percentage or mean), you can easily change it.
6. Click Element Properties to display the Element Properties window.
The Element Properties window allows you to change the properties of the various chart elements.
Output from Chart Builder
Finally: Once your graph is produced you can edit it further like any piece of SPSS Output by double-clicking on it.
Building the clustered bar chart
Assuming you want to present how the distributions of marital status [marital] of the survey respondents are different by gender [sex] group. Then, you may need this type of “clustered bar chart”.
Output
Box plot
Begin by entering your data. You should have two columns: one for your dependent variable, and one for your grouping (independent variable).
In the Variable View, make sure that the appropriate scale is selected for each variable. The dependent variable should be a scale variable, and the grouping variable should be ordinal or nominal.
This chart divides the data into four areas of equal frequency. The central box (where the middle 50% of the data) has a vertical (or horizontal) inside the box indicates the median (if this line is at the center in the center of the box there is symmetry). From the center of each side vertical (or horizontal) of the box are drawn whiskers. The mustache on the left (or lower) has its extreme value closer to Q1 - 1.5 * IQR, while the right whisker (or higher) has its extreme value closer to Q3 + 1.5 * IQR, and are considered the most extreme outliers in Q3 + 3 * IQR or less than Q1 - 3 * IQR (in SPSS are represented by “o” or “x”, respectively). So from top to bottom, here’s what’s in the boxplot:
Top whisker: Q3 + 1.5 * IQR, Top of box: 3rd quartile Line in the middle: median Bottom of box: 1st quartile Bottom whisker: Q1 - 1.5 * IQR Steps
Select the Boxplot option under the gallery. Select the type of boxplot you want to create, and drag it to the main window.
Output
Interpretation. The major part of our distribution is not normal and there are significant outliers, the cases beyond the lower line of our boxplot. The majority of the outliers are at the lowest end of the distribution, people with little or no education. There are also more observations above than below the median.
INFERENTIAL STATISTICS
Inferential statistics. The methods used to estimate a property of a population on the basis of a sample.
Statistical Inference involves two main types of techniques: Parameter Estimation and Hypothesis Testing. Whatever the technique used, the overall purpose is to use data from a probability sample to extract conclusions about a population.
• Data are observations about the variable being measured.
Parameter Estimation
The objective of estimation is to determine the approximate value of a population parameter on the basis of a sample statistic.
- Confidence interval
A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data. (Definition taken from Valerie J. Easton and John H. McColl's).
Application example 1
We want to study approximately, how much is the monthly cost of using the student's cell phone from AUCA. We took a sample of 20 students and asked them how much they spend in a month on their cell phone. The answers are shown in the table below:
Monthly expenditure on cell phone
10000 5000 4500 7000 5500 3500 2000 1500 500 4872 800 5800 4873 5801 10000 9500 7800 2570 6531 5842
Calculate an appropriate confidence interval with 95% of confidence for the average monthly cell phone use of AUCA students and interpret the result.
Steps in SPSS
First create the variable in ‘Variable view’ after fill the data in ‘Data view’, then click on analyze and then 'compare means' after 'one sample t test' and then continue with the steps shown in the following figure.
Output from Compare Means
One-Sample Statistics
N Mean Std.
Deviation
Std. Error Mean Monthly_Cellphone 20 5194.45 2842.309 635.560
One-Sample Test
Test Value = 0
t df Sig. (2-tailed) Mean Difference
95% Confidence Interval of the Difference
Lower Upper
Monthly_Cellphone 8.173 19 .000 5194.450 3864.21 6524.69
Interpretation. At 95% confidence, AUCA students spend monthly on using the cell phone between 3864.21 to 6524.60 Rwf.
Hypothesis
A statistical hypothesis is a statement on a probabilistic model and a hypothesis test is a method to determine the possibility of that statement based on a sample.
Continue using the same example 1 given previously
Test the claim that the true mean monthly cell phone bill the student of AUCA is less than ten thousand Rwf. For testing this hypothesis we took a sample of 20 students and we asked them, how much they spend in a month on their cell phone? The answers are shown in the table below:
Monthly expenditure on cell phone
10000 5000 4500 7000 5500 3500 2000 1500 500 4872 800 5800 4873 5801 10000 9500 7800 2570 6531 5842 Steps
1. Specify the population value of interest
Mean monthly cell phone bill the student of AUCA
2. Formulate the appropriate null and alternative hypotheses Ho: μ ³ 10000
Ha: μ < 10000 (This is a lower tail test) 3. Specify the desired level of significance
Output from SPSS
One-Sample Test
Test Value = 10000
t df
Sig.
(2-tailed) Mean Difference
95% Confidence Interval of the Difference
Lower Upper
Monthly Cell Phone -7.561 19
.000 -4805.55 -6135.79 -3475.31
Making decision and interpret result: Since Sig or p-value = 0.000 < 0.01=α, we reject the null hypothesis, therefore the difference found is highly significant, we can conclude at 5% of significance level, there is evidence that supports the alternative hypothesis, i.e. students invest in charge their phone less than ten thousand Rwf per month.
Note:The value of p_value or Sig. gives us the SPSS default is bilateral, unilateral if we value: Sig /2 (.000/2 = .000)
The independent t-test, compares the means between two unrelated groups. For example, you could use an independent t-test to understand whether second year graduate salaries differed based on gender.
Dependent variable second year graduate salaries
Independent variable gender, which has two groups: "male" and "female".
Example
A cigarette maker analyzes two different brands for determining the nicotine content. A sample was taken of each brand and got the following results (in milligrams).
Brand A: 24 26 25 22 23
Brand B: 27 28 25 29 26
Do the above results indicate that there is a difference in the average content of nicotine in both brands? Solution
Formulate the appropriate hypotheses: Steps using SPSS (Create data file)
Enter the data in SPSS, with the variable “Nicotine” takes up one column, and the Brand variable for identifying whether the nicotine data was from brand A or brand B subject takes up another column.
The “Nicotine” is considered as the dependent, response or outcome variable, and the “Brand” variable is the independent or factor variable. The two variables should be created in the way as seen in the data editor on the right. The Brand variable takes on two possible values, 1 or 2. The value “1” for brand A, and the value “2” for brand B.
When the data is completed, follow the steps shown on the next slide.
Output from SPSS
Interpretation: the report shows the descriptive statistics, the average content of nicotine of Brand A is less than the average of Brand B, and standard deviation for both are similar; but do not know whether this difference observed is significant.
So we ask the t test for independent samples, which we gives t = -3.00, looking at the next Sig. (2-tailed) the value is .017, lower than .05.
Making Decision and interpret result: Sig <0.05, therefore the level of significance of 5% we can say the results indicate that there is a difference significant in the average content of nicotine in both brands, i.e. Brand B content more nicotine than Brand A significantly.
Note: Interpreting Box plots in general:
Box plots are used to show overall patterns of response for a group. They provide a useful way to visualize the range and other characteristics of responses for a large group.
Box plot to check if there are no outlier values and if the boxes behave symmetrically The diagram below shows a variety of different box plot shapes and positions
Some general observations about box plots:
About variability: The Box plot of 2, 3 and 4 show homogeneity, that is, they are symmetric distributions. However box plot 1 shows an asymmetric distribution (Asymmetry negative), i.e. data tend to be concentrated towards the top of the distribution and extend leftward. In the context (about marks), the majority marks or views, etc. is concentrated in a higher score and lowest score are more dispersed.
The box plot is comparatively short - see example (2). This suggests that overall students have a high level of agreement with each other.
One box plot is much higher or lower than another – compare (3) and (4) – This could suggest a difference between groups. For example, the box plot for (4) may be lower than the equivalent plot for (3).
Obvious differences between box plots – see boxes plots (1) and (2), (1) and (3), or (2) and (4). Any obvious difference between box plots for comparative groups is worthy of further investigation.
The 4 sections of the box plot are uneven in size – See box plot (1). This shows that many students have similar views at certain parts of the scale, but in other parts of the scale students are more variable in their views. The long upper whisker in the example means that students’ views are varied amongst the most positive quartile group, and very similar for the least positive quartile group.
Same median, different distribution – See boxes plots (1), (2), and (3). The medians (which generally will be close to the average) are all at the same level. However the box plots in these examples show very different distributions of views.
Hypothesis testing to determine the normality Ho: The variables follow a normal distribution Ha: Variables do not follow a normal distribution
Tests of Normality
Cigarette Brand
Kolmogorov-Smirnova Shapiro-Wilk
Statistic df Sig. Statistic df Sig.
Nicotine Brand A 0.136 5 .200* 0.987 5 0.967
Brand B 0.136 5 .200*
0.987 5 0.967
a. Lilliefors Significance Correction
Making a Decision and Interpreting the Result of the Test
We observed Shapiro-Wilk statistic given that the samples are small
The p-values (Sig.) Brand A: Sig, 967 Brand B: Sig, 967
From Shapiro-Wilk test of normality are both greater than 0.05, so we don’t reject null hypothesis, which imply that it is acceptable to assume that the average content of nicotine distributions for Brand A and Brand B populations are both normal (or bell-shaped)
Assumption of Homogeneity
Through the Levene test can see if this assumption very important to compare groups met. The report of SPSS gives without asking
Ho: the variances are equal Ha: not assume equal variances
Decision: The p-value or (Sig=1.000) provides the Levene test is greater than 5%, and then we cannot reject Ho and conclude that equal variances assumed.
Example
An analyst at a department store wants to evaluate a recent credit card promotion. To this end, 500 cardholders were randomly selected. Half received an ad promoting a reduced interest rate on purchases made over the next three months, and half received a standard seasonal announcement.
This example uses the file creditpromo.sav from SPSS To begin the analysis, from the menus choose:
Running the Analysis
SPSS Report
Group Statistics
Type of mail insert received
N Mean Std.
Deviation
Std. Error Mean
$ spent during promotional period
Standard 250 1566.3890 346.67305 21.92553
New Promotion 250 1637.5000 356.70317 22.55989
The Descriptive table displays the sample size, mean, standard deviation, and standard error for both groups. On average, customers who received the interest-rate promotion charged about $71 more than the comparison group, and they vary a little more around their average.
The procedure produces two tests of the difference between the two groups. One test assumes that the variances of the two groups are equal. The Levene statistic tests this assumption.
Levene's Test for Equality of Variances
F Sig.
$ spent during promotional period
Equal variances
assumed 1.19 0.276
Equal variances not assumed
In this example, the significance value of the statistic is 0.276. Because this value is greater than 0.05, you can assume that the groups have equal variances and ignore the second test displayed.
Independent Samples Test
Levene's Test for
Equality of Variances t-test for Equality of Means
F Sig. t df
Sig. (2-tailed) Mean Difference Std. Error Difference $ spent during
promotional period
Equal variances
assumed 1.19 0.276 -2.26 498 0.024 -71.11095 31.45914
Equal
variances not
assumed -2.26 497.60 0.024 -71.11095 31.45914
The df column displays degrees of freedom (498). For the independent samples t test, this equals the total number of cases in both samples minus 2.
The column labeled Sig. (2-tailed) displays a probability from the t distribution with 498 degrees of freedom (Sig = .024). The value listed is the probability of obtaining an absolute value greater than or equal to the observed t statistic, if the difference between the sample means is purely random.
The Mean Difference (-71.11095) is obtained by subtracting the sample mean for group 2 (the New Promotion group) from the sample mean for group 1.
Interpretation: Since the significance value of the test is less than 0.05 (Sig = .024), you can safely conclude that the average of 71.11 dollars more spent by cardholders receiving the reduced interest rate is not due to chance alone. The store will now consider extending the offer to all credit customers.
Hypothesis test for the difference in population means (paried or related samples)
One of the most common experimental designs is the "pre-post" design. A study of this type often consists of two measurements taken on the same subject, one before and one after the introduction of a treatment or a stimulus. The basic idea is simple. If the treatment had no effect, the average difference between the measurements is equal to 0 and the null hypothesis holds. On the other hand, if the treatment did have an effect (intended or unintended!), the average difference is not 0 and the null hypothesis is rejected.
The Paired-Samples t test procedure is used to test the hypothesis of no difference between two variables. The data may consist of two measurements taken on the same subject or one measurement taken on a matched pair of subjects.
Example
A group of ten patients who were newly detected diabetes was observed to determine whether an educational program was effective in increasing their knowledge of diabetes. A test was applied before and after the educational program on self-related aspects of the disease. The test results were as follows:
Patient 1 2 3 4 5 6 7 8 9 10
Before 75 62 67 70 55 59 60 64 72 59
After 77 65 68 72 62 61 60 67 75 68
Does the educational program was effective with respect to patients’ knowledge? Report on the statistical software SPSS 23.0
1° database
2° Process (order the t-test analysis for related samples)
3° Report
Paired Samples Statistics
Mean N
Std. Deviation
Std. Error Mean Pair
1 Before 64.3000After 67.5000 1010 6.498725.79751 2.055081.83333
Interpretation: the report shows the descriptive statistics, the average being (before) is less than the average after implementing the program, but do not know whether this difference observed is significant, so we ask the t test for related samples, which we gives t = -3,692, same as we obtained manually, just as we have confidence intervals.
Paired Samples Test Paired Differences
t df
Sig. (2-tailed) Mean
Std. Deviation
Std. Error Mean Pair
1
Before
- After -3.20000 2.74064 .86667 -3.692 9 .005
Making Decision and interpret the result: If Sig <0.05, we reject Ho, therefore at level of significance of 5% we can say that the educational program was effective with respect to increased knowledge of patients.
ANOVA – Analysis of Variance
ANOVA is a statistical method that stands for analysis of variance. ANOVA was developed by Ronald Fisher in 1918 and is the extension of the T and the Z test. The T-test and Z-test were commonly used, but the problem with the T-test is that it cannot be applied for more than two groups. This test is also called the Fisher analysis of variance, which is used to do the analysis of variance between and within the groups whenever the groups are more than two.
Analysis of variance provides a way to determine if one or more discrete level factors (independent variable) influence an outcome measurement (variable quantitative, and is dependent variable).
Relationship amongst T Test, Analysis of Variance, Analysis of Covariance, & Regression
Illustrative Applications of One-way ANOVA Effect of In-store Promotion on Sales
The department store is attempting to determine the effect of in-store promotion (X) on sales (Y). The data of the table shows the store sales in thousands of Rwandan francs (Yij) for each level of promotion.
Store #
Level of In-Store Promotion
High Medium Low
1 10 8 5
2 9 8 7
3 10 7 6
4 8 9 4
5 9 6 5
6 8 4 2
7 9 5 3
8 7 5 2
9 7 6 1
10 6 4 2
83 62 37
Solution
Ha:
Step 2: Level of significance: Step 3: Test Distribution
Steps using SPSS
1. Create data file. We have two variables (sales in a column and level in the second column that has three alternatives). See the figure below.
Sales
N Mean
Std. Deviation
Std. Error
95% Confidence Interval for Mean
Minimum Maximum Lower
Bound
Upper Bound
High 10 8.30 1.337 .423 7.34 9.26 6 10
Medium 10 6.20 1.751 .554 4.95 7.45 4 9
Low 10 3.70 2.003 .633 2.27 5.13 1 7
Total 30 6.07 2.532 .462 5.12 7.01 1 10
Interpretation: This table describes the means and standard deviations of each group: Effect of in store promotion on sales. The mean represents average sales. It can be clearly seen that the average sales for high level promotion is higher than other levels, and the variability even for high level promotion is lower than others; However, we cannot reach any conclusion that one level is more significant in sales than another level without examining the statistical importance of the result (information from the F test).
ANOVA Output
ANOVA
Sales
Sum of Squares df Mean Square F Sig.
Between Groups 106.067 2 53.034 17.941 .000
Within Groups 79.800 27 2.956
Total 185.867 29
Making Decision and Interpret result: This is the main ANOVA result. The significance value comparing the groups is (Sig. .000 <0.05), so we reject the null hypothesis and at level of significance 5 % at least one population mean is different.
Note: If the test of ANOVA is significant, continue the analysis with Post-hoc
Post‐hoc Analysis – Tukey’s Honestly Significant Difference (HSD) Test
When an F turns out to be significant, we know, with some degree of confidence, that there is a real difference somewhere among our means. But if there are more than two groups, we don’t know where that difference is. Post hoc tests have been designed for doing pair-wise comparisons after a significant F is obtained.
K-1= 3-1=2 (Number of groups 1)
N-K= 30-3=27 (Number of the total data – Number of
groups)
N-1=30-1=29 (Number of the
total data – 1)
106.067/2=53.034
F=53.034/2.956
F=17.941
Interpretation: This is the post-hoc tests to see where the differences lie. You can see that the all levels of promotion differ significantly between them.
Homogeneous Subsets
Sales Tukey HSD
Level of
promotion N
Subset for alpha = 0.05
1 2 3
Low 10 3.7
Medium 10 6.2
High 10 8.3
Sig. 1 1 1
Means for groups in homogeneous subsets are displayed. a. Uses Harmonic Mean Sample Size = 10.000.
Interpretation: we can see that the High level of promotion (mean = 8.3) is greater than the Medium group and the Low group, after adjustment for multiple comparisons.
Finally the results revealed statistically significant differences among the levels of promotion, F =17.941, and p_value (Sig. =.000). Post-hoc Tukey tests revealed statistically significant differences between levels of promotion; High promotion (mean = 8.3, Sd = 1.34), and the Medium promotion (6.2, Sd =1.750) and the last mean correspond to Low promotion (mean = 3.7, Sd = 2.00)
Assumption: Normality Check
Ho: The variables are normally distributed Ha: The variables were not normally distributed
Tests of Normality
Level of promotion
Kolmogorov-Smirnova Shapiro-Wilk
Statistic df Sig. Statistic df Sig. Sales High .200 10 .200* .932 10 .466
Medium .153 10 .200* .932 10 .473
Low .202 10 .200* .935 10 .498
*. This is a lower bound of the true significance. a. Lilliefors Significance Correction
P-values or (sig)>.05, then do not reject Ho, therefore conclude that the variables follow a normal distribution or the normality assumption may be assumed valid.
Interpretation: Box plot (to check if there are no outlier values and if the boxes behave symmetrically). Both graphs show the variable sales by promotion are normality.
Assumption: homogeneity of variance
This table (Levene’s test) tests the assumption of equal variances for the ANOVA.
Test of Homogeneity of Variances
Test of Homogeneity of Variances Sales
Levene Statistic df1 df2 Sig
1.353 2 27 0.275
Interpretation: Look at the sig. or p-value (.275) which is above .05. The p_value given in the last column is sufficiently large to conclude the assumption of constant variances should not be rejected. The result indicates that equal variances assumption is met.
Making Decision and interpret: Sig = 0.275 > 0.05, there isn’t enough evidence to reject the null hypothesis,
Regression and correlation analysis of single and multiple
The purpose of this analysis is to establish the relationship between two or more variables that is Correlation Analysis, and establish a mathematical model to estimate the value of a variable based on the value of the other variables that is Regression Analysis
Linear Regression
Linear Regression estimates the coefficients of the linear equation, involving one or more independent variables that best predict the value of the dependent variable. For example, you can try to predict a salesperson's total yearly sales (the dependent variable) from independent variables such as age, education, and years of experience.
Data Considerations
Data. The dependent and independent variables should be quantitative. Categorical variables, such as religion, major field of study, or region of residence, need to be recoded to binary (dummy) variables or other types of contrast variables.
Assumptions. For each value of the independent variable, the distribution of the dependent variable must be normal. The variance of the distribution of the dependent variable should be constant for all values of the independent variable. The relationship between the dependent variable and each independent variable should be linear, and all observations should be independent.
Example
The following data give the selling price, square footage, number of bedrooms, and age of houses that have sold in a neighborhood in the last 6 months. Develop multiple correlation and regression analysis for to predict selling price.
Selling Price ($)
Square
Footage Bedrooms Age
64000 1670 2 30
59000 1339 2 25
61500 1712 3 30
79000 1840 3 40
87500 2300 3 18
92500 2234 3 30
95000 2311 3 19
113000 2377 3 7
115000 2736 4 10
138000 2500 3 1
142500 2500 4 3
144000 2479 3 3
145000 2400 3 1
147500 3124 4 0
144000 2500 3 2
155500 4062 4 10
Steps to do a regression and correlation analysis 1. Scatter-Plot
Prior to entering your data on the Data Editor screen (shown below), you must define your variables on the Variable View screen (not shown). Once your variables have defined your variables (X as your predictor and Y as your criterion), enter the data as shown below on the Data Editor screen.
Start your analysis by creating a scatter plot: Graphs Legacy Dialogsscatter Dot
After clicking on Graphs and selecting Scatter, you will be given the option to select the type of plot- shown on the next figure. Highlight the Simple scatter-plot option and click Define.
The scatter plot will be produced and displayed on the Output Viewer.
2. Simple Correlation between Selling price and Square Footage
(there is no association between Selling price and Square Footage)
(There is an association between them)
Correlation between Selling price and Age
(there is no association between Selling price and Age)
(There is an association between them)
We can see that the Pearson correlation coefficient for the two variables is r = -.881 **, this value indicates a strong, inverse linear relationship between Selling price and Age. Furthermore, we see that it is significant, with p < 0.000.
3. Regression Analysis
After creating a scatter plot, you should run a regression analysis. The regression analysis will produce regression coefficients, a correlation coefficient, and an ANOVA table.
Begin by selecting AnalyzeRegression Linear (shown below).
The output of the analysis is shown below.
Descriptive Statistics
Mean
Std.
Deviation N
Selling Price ($) 114588.24 35901.443 17
Square Footage 2407.29 618.457 17
Bedrooms 3.12 0.6 17
Age 13.65 13.052 17
Interpretation. The average of selling price of the houses that have sold in a neighborhood in the last 6 months is 114588.24, and variability around mean is 35901.443, means it is heterogeneous data (CV=31%)
Multiple correlation
The Model Summary table reports the correlation coefficient as multiple R, the R, Square statistic is in the second column, and Durbin-Watson is the last column
Interpret Multiple Correlation (Check the following table)
Multiple Correlation = .941, there is high association between Selling price and Square footage, bedroom and age (the model improved by interacting with independent variables)
Interpret Coefficient of Determination
R square=.885; i.e. 88.5% of the variation in the selling price, can be explained by variation in square footage, bedroom and age of the houses that have sold in a neighborhood in the last 6 months.
Model Summaryb
Model R
R Square
Adjusted R Square
Std. Error of the Estimate
Durbin-Watson
1 .941a .885 .859 13482.331 2.280
a. Predictors: (Constant), Houses' Age, Bedrooms, Square Footage b. Dependent Variable: Selling Price ($)
Interpret autocorrelation by Durbin Watson
DW= 2.280 It is a value slightly greater than 2, but still close to 2, which indicates that there is no evidence of autocorrelation, therefore the assumption is assumed.
Note: Autocorrelation
Autocorrelation exists when independent variables in a regression equation are highly correlated among themselves. If only two predictors are correlated, we have collinearity.
The most common test for autocorrelation is the Durbin-Watson statistic is a test to find out the serial correlation between adjacent error terms. The range of this statistic ranges from 0 to 4.
DW < 2 suggest positive autocorrelated (common) DW ≈ 2 suggest no autocorrelated (ideal)
DW > 2 suggest negative autocorrelated (rare)
Ho: β1=β2=β3=0 Ha: At least one βj ≠0
F=33.484 (see in the table below) Sig: .000
Making decision and interpret result: As Sig <0.000 then reject null hypothesis, indicating that at least one of the explanatory variables (Square Footage, number of bedrooms or houses’ age) is related to the price that houses sold. ANOVAa Model Sum of Squares df Mean
Square F Sig.
1 Regression 18259565357 3 6086521786 33.484 .000b
Residual 2363052290 13 181773253.1
Total 20622617647 16
Dependent Variable: Selling Price ($) and Predictors: (Constant), Age, Bedrooms, Square Footage
Using multiple regression, and find a model that will help explain current sales, and which is the best predictor? Y= 82373.7208 +25.859 square footage – 2127.798 number of bedrooms – 1714.821 age of houses
Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig.
B Std. Error Beta
1 (Constant) 82373.720 23072.294 3.570 .003
Square
Footage 25.859 9.638 .445 2.683 .019
Bedrooms -2127.798 8872.079 -.036 -.240 .814
Age_houses -1714.821 328.054 -.623 -5.227 .000
a. Dependent Variable: Selling Price ($)
Application: use the Model you fount before and predict the selling price of a 10 years of the houses , and 2000 square foot house with 3 bedrooms.
Solution:
Y= 82373.7208 +25.859 square footage – 2127.798 number of bedrooms – 1714.821 Age of house Y=110560.116
Interpretation: the Residuals follow a normal distribution and also the residuals assumed homocedasticity
Nonparametric Statistics
Nonparametric tests are sometimes called distribution-free tests because they are based on fewer assumptions (e.g., they do not assume that the outcome is approximately normally distributed). Parametric tests involve specific probability distributions (e.g., the normal distribution) and the tests involve estimation of the key parameters of that distribution (e.g., the mean or difference in means) from the sample data. The cost of fewer assumptions is that nonparametric tests are generally less powerful than their parametric counterparts.
Parametric vs Nonparametric Statistics
• Parametric Statistics are statistical techniques based on assumptions about the population from which the sample data are collected.
– Assumption that data being analyzed are randomly selected from a normally distributed population.
– Requires quantitative measurements that yield interval or ratio level data.
• Nonparametric Statistics are based on fewer assumptions about the population and the parameters.
– Sometimes called “distribution-free” statistics.
_ A variety of nonparametric statistics are available for use with nominal data, ordinal and even quantitative (convert to ordinal).
_ When the data is quantitative but does not meet the assumptions of normality and homogeneity, in the case of more than two groups.
Nonparametric statistics (or tests) based on the ranks of measurements are called rank statistics (or rank tests).
Chi-Square
1 Sample Quantitative -
Qualitative
Binomial
Association variables
Independent Quantitative Mann-Whitney U
Quantitative Wilcoxon Related
Before-After McNemar
Independent Quantitative Kruscal Wallis-H
Quantitative Friedman Related
Qualitative Cochran's Q
C o m p ar e g ro u p s 2 Samples More than 2 samples
2 Samples Discrete or
categorical
Chi-Square Independent
Advantages and Disadvantages of non-parametric test 1. Do Not Involve Population Parameters
Example: Probability Distributions, Independence 2. Data Measured on Any Scale
Ratio or Interval
Ordinal (example: Good-Better-Best) Nominal (example: Male-Female)
Chi Square Independent Test (c2)
A chi square (c2) statistic is used to investigate whether distributions of categorical variables differ from one another. Basically categorical variable yield data in the categories and numerical variables yield data in numerical form. Responses to such questions as "What is your major?" or do you own a car?" are categorical because they yield data such as "Business" or "no". In contrast, responses to such questions as "How tall are
you?" it is numerical. Numerical data can be either discrete or continuous. The table below may help you see the differences between these two variables.
Data Type Question Type
Possible Responses
Categorical What is your sex? male or female
Numerical-Discrete How many children do you have? 0 or 1 or 2 or…
Numerical-Continuous How tall are you? 75 inches
A test of independence assesses whether paired observations on two variables, expressed in a contingency table, are independent of each other (e.g. polling responses from people of different nationalities to see if one's nationality affects the response).
Steps in Hypothesis Testing
1. Formulate the appropriate null and alternative hypothesis Ho: The variables are independent
Example Use the data of file that has the SPSS "demo.sav"
This is a hypothetical data file that concerns a purchased customer database, for the purpose of mailing monthly offers. Whether or not the customer responded to the offer is recorded, along with various demographic information.
Crosstabulation tables (contingency tables) display the relationship between two or more categorical (nominal or ordinal) variables. The size of the table is determined by the number of distinct values for each variable, with each cell in the table representing a unique combination of values. Numerous statistical tests are available to determine whether there is a relationship between the variables in a table.
Interpreting results
What factors affect the products that people buy?
The most obvious is probably how much money people have to spend. In this example we’ll examine the relationship between income level and PDA (personal digital assistant) ownership.
The cells of the table show the count or number of cases for each joint combination of values. For example, 455 people in the income range $25,000–$49,000 own PDAs.
None of the numbers in this table, however, stand out in an obvious way, indicating any obvious relationship between the variables.
Significance Testing for Crosstabulations
The purpose of a cross tabulation is to show the relationship (or lack thereof) between two variables.
Although there appears to be some relationship between the two variables, is there any reason to believe that the differences in PDA ownership between different income categories are anything more than random variation?
A number of tests are available to determine if the relationship between two crosstabulated variables is significant. One of the more common tests is chi-square. One of the advantages of chi-square is that it is appropriate for almost any kind of data.
Pearson chi-square tests the hypothesis that the row and column variables are independent. Hypothesis
Ho: The variables Income and Owns PDA are independent. Ha: The variables Income and Owns PDA are related. Level significance α = .05
Test Statistic: c2 = 37.677, Sig= .000
Chi-Square Tests
Value df
Asymp. Sig. (2-sided) Pearson Chi-Square 37.677a 3 .000
Likelihood Ratio 37.313 3 .000
Linear-by-Linear
Association 36.537 1 .000
N of Valid Cases 6400
a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 228.73.
Making a Decision and Interpreting the Result: since the calculate c2 = 37.677, which is more than critical value of 7.81, Reject Ho. Also p-value(sig.) =.000<.05, which also supports rejecting Ho. At the level of significance of 5% the variables Income and Owns PDA are related.
Differences between independent groups Mann-Whitney U-Test
The Mann-Whitney Test is one of the most powerful of the nonparametric tests for comparing two independent groups when the dependent variable is either ordinal or continuous, but not normally distributed. For example, you could use the Mann-Whitney U test to understand whether attitudes towards pay discrimination, where attitudes are measured on an ordinal scale, differ based on gender (i.e., your dependent variable would be "attitudes towards pay discrimination" and your independent variable would be "gender", which has two groups: "male" and "female"). Alternately, you could use the Mann-Whitney U test to understand whether salaries, measured on a continuous scale, differed based on educational level (i.e., your dependent variable would be "salary" and your independent variable would be "educational level", which has two groups: "high school" and "university"). The Mann-Whitney U test is often considered the nonparametric alternative to the independent t-test although this is not always the case.
Nonparametric counterpart of the t test for independent samples
• Does not require normally distributed populations
• May be applied to ordinal data
• Actual measurements not used – ranks of the measurements used Assumptions
– Independent Samples
– At Least Ordinal Data Steps in Hypothesis Testing
1. Formulate the appropriate null and alternative hypothesis Ho: Both groups are the same (there is no difference) Ha: Both groups are not the same
Example
Is this sufficient evidence to indicate a difference in the average height of the groups? The data is in the following table 5.2:
Heights of males (cm) 193 188 185 183 180 178 170 Heights of females (cm) 175 173 168 165 163
Solution
Ho: Male and female students are the same height (the distribution of heights for the two groups are equal) Ha: Male and female students are not the same height (the distribution of heights for the two groups are not equal)
Mann-Whitney U Rank Sum Test – Solution in SPSS
Solution
Decision: Sig .010<.05 Reject Null Hypothesis at α= .05
Conclusion: The height of males is greater than height of
Differences between dependent groups
Parametric Nonparametric
Compare two variables measured in the same sample
t-test for dependent samples
Sign test
Wilcoxon’s matched pairs test or Mc Nemar If more than two variables are
measured in same sample
Repeated
measures ANOVA
Friedman’s two way analysis of variance
Wilcoxon Rank
The Wilcoxon Rank Test is a non-parametric statistical hypothesis test, used when comparing two related samples, matched samples, or repeated measurements on a single sample to assess whether their population mean ranks differ (i.e. it is a paried difference test). It can be used as an alternative to the paried Student’s t-test (before and after studies), t studies in which measures are taken on the same person or object under different conditions, Studies or twins or other relatives or the t-test for dependent samples when the population cannot be assumed to be normally distributed.
Assumptions
Random Samples
Populations are continuous
Can Use Normal Approximation If ni ≥15
Steps in Hypothesis Testing
1. Formulate the appropriate null and alternative hypothesis
H0: both samples come from the same underlying distribution (there is no difference) Ha: both samples are not come from the same underlying distribution
Example
The mayor of a city wants to see if pollution levels are reduced by closing the streets to the car traffic. This is measured by the rate of pollution every 60 minutes (8am - 22pm: total of 15 measurements) in a day when traffic is open, and in a day of closure to traffic, the data of air pollution is in the following table
Rate of pollution in different situation
With traffic: 214, 159, 169, 202, 103, 119, 200, 109, 132, 142, 194, 104, 219, 119, 234 Without traffic: 159, 135, 141, 101, 102, 168, 62, 167, 174, 159, 66, 118, 181, 171, 112
It is clear that the two groups are paired, because there is a bond between the readings, consisting in the fact that we are considering the same city (with its peculiarities weather, ventilation, etc.) albeit in two different days. Not being able to assume a Gaussian distribution for the values recorded, we must proceed with a non-parametric test, the Wilcoxon signed rank test.
Solution:
H0: both samples come from the same underlying distribution
Ha: both samples are not come from the same underlying distribution n = 15
Wilcoxon Rank Sum Test – Solution in SPSS
Wilcoxon Report in SPSS
Ranks Test Statisticsa
N Mean Rank Sum of Ranks
Without_traffic -With_traffic
Without_traffic - With_traffic
Negative Ranks
9a 8.89 80
Z -1.136b
Positive Ranks
6b 6.67 40 Asymp. Sig.
(2-tailed)
0.256
Ties 0c a. Wilcoxon Signed Ranks Test
Total 15 b. Based on positive ranks.
a. Without_traffic < With_traffic b. Without_traffic > With_traffic c. Without_traffic = With_traffic
Making a Decision and Interpreting the Result of the Test from result of statistic software SPSS.
Since Sig.=0.256>0.05, we can’t reject the null hypothesis, at α=0.05. There is No evidence for unequal distribution. Therefore the closing roads to traffic did not bring any improvement in terms of rate of pollution.
Nonparametric Tests for Multiple Independent Samples
The nonparametric tests for multiple independent samples are useful for determining whether or not the values of a particular variable differ between two or more groups. This is especially true when the assumptions of ANOVA are not met.
When the assumptions behind the standard ANOVA are invalid or suspect, you should consider using the nonparametric procedures designed to test for the significance of the difference between multiple groups. They are called nonparametric because they make no assumptions about the parameters (such as the mean and variance) of a distribution, nor do they assume that any particular distribution is being used.
Kruskal-Wallis H
When a researcher wishes to compare three or more groups or populations and the data are ordinal, the Kruskal-Wallis test is the appropriate statistical technique, or when the data is numerical but the assumptions does not have to assume that the underlying populations are normally distributed or the equal variances. It is used to test the null hypothesis that all populations have identical distribution functions against the alternative hypothesis that at least two of the samples differ only with respect to location (median), if at all. For example, you could use a Kruskal-Wallis H test to understand whether exam performance, measured on a continuous scale from 0-100, differed based on test anxiety levels (i.e., your dependent variable would be "exam performance" and your independent variable would be "test anxiety level", which has three independent groups: students with "low", "medium" and "high" test anxiety levels). Alternately, you could use the Kruskal-Wallis H test to understand whether attitudes towards pay discrimination, where attitudes are measured on an ordinal scale, differed based on job position (i.e., your dependent variable would be "attitudes towards pay discrimination", measured on a 5-point scale from "strongly agree" to "strongly disagree", and your independent variable would be "job description", which has three independent groups: "shop floor", "middle management" and "boardroom").
Data types that can be analyzed with Kruskal-Wallis H The data points must be independent from each other
the distributions do not have to be normal and the variances do not have to be equal you should ideally have more than five data points per sample
all individuals must be selected at random from the population all individuals must have equal chance of being selected
sample sizes should be as equal as possible but some differences are allowed
Steps in Hypothesis Testing
Formulate the appropriate null and alternative hypothesis Ho: Identical Distribution
Ha: At Least 2 Differ
Specify the desired level of significance (a)
a , (level of significance); typical values are .01, .05, or .10 Example
An advertising agency employs three different film production companies to produce its television commercials. The advertising agency has taken a sample of five commercials from each of the population houses, and agency executives have ranked the production quality of the commercials from best quality (1) to lowest quality (15). These ranks are show in the next table. Notice that the advertising agency considered two commercials to be ranked of equal quality. Hence, rather than being ranked 3 and 4, the two commercials are each ranked 3.5. The data is in the following Table 5.4.
Solution
Ho: Identical Distribution Ha: At Least 2 Differ α = .05
Example with SPSS