SPSS VS 23 TUTORIAL
Level I
Chi Square Independent Test (c2)
A chi square (c2) statistic is used to investigate whether distributions of categorical variables differ from one
another. Basically categorical variable yield data in the categories and numerical variables yield data in numerical form. Responses to such questions as "What is your major?" or do you own a car?" are categorical because they yield data such as "Business" or "no". In contrast, responses to such questions as "How tall are
you?" it is numerical. Numerical data can be either discrete or continuous. The table below may help you see the differences between these two variables.
Data Type Question Type
Possible Responses
Categorical What is your sex? male or female
Numerical-Discrete How many children do you have? 0 or 1 or 2 or…
Numerical-Continuous How tall are you? 75 inches
A test of independence assesses whether paired observations on two variables, expressed in a contingency table, are independent of each other.
When performing tests of hypotheses one often faces the situation in which a decision must be made as to whether or not two or more variables pertaining to the same population can be considered independent. In order to assess the independency of two variables we use the contingency table formalism, which now, however, is applied to only one population whose variables can be categorised into two or more categories. The variables can either be discrete (nominal or ordinal) or continuous. In this latter case, one must choose suitable categorisations for the continuous variables.
Steps in Hypothesis Testing
1. Formulate the appropriate null and alternative hypothesis Ho: The variables are independent
Ha: The variables are related
Example 1. Use the data of file that has the SPSS "demo.sav"
This is a hypothetical data file that concerns a purchased customer database, for the purpose of mailing monthly offers. Whether or not the customer responded to the offer is recorded, along with various demographic information.
Interpreting results
What factors affect the products that people buy?
The most obvious is probably how much money people have to spend. In this example we’ll examine the relationship between income level and PDA (personal digital assistant) ownership.
The cells of the table show the count or number of cases for each joint combination of values. For example, 455 people in the income range $25,000–$49,000 own PDAs.
None of the numbers in this table, however, stand out in an obvious way, indicating any obvious relationship between the variables.
It is often difficult to analyze a cross tabulation simply by looking at the simple counts in each cell
Significance Testing for Crosstabulations
The purpose of a cross tabulation is to show the relationship (or lack thereof) between two variables.
Although there appears to be some relationship between the two variables, is there any reason to believe that the differences in PDA ownership between different income categories are anything more than random variation?
A number of tests are available to determine if the relationship between two crosstabulated variables is significant. One of the more common tests is chi-square. One of the advantages of chi-square is that it is appropriate for almost any kind of data.
Hypothesis
Ho: The variables Income and Owns PDA are independent. Ha: The variables Income and Owns PDA are related. Level significance α = .05
Test Statistic: c2 = 37.677, Sig= .000
Chi-Square Tests
Value df
Asymp. Sig. (2-sided)
Pearson Chi-Square 37.677a 3 .000
Likelihood Ratio 37.313 3 .000
Linear-by-Linear
Association 36.537 1 .000
N of Valid Cases 6400
a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 228.73.
Making a Decision and Interpreting the Result: since the calculate c2 = 37.677, which is more than critical value
of 7.81, Reject Ho. Also p-value(sig.) =.000<.05, which also supports rejecting Ho. At the level of significance of 5% the variables Income and Owns PDA are related.
Interpretation: the percentage of people who own PDAs rises as the income category rises.
Example 2
Consider the Programming dataset, containing results of pedagogical enquiries made during the period 2016 - 2019, of IT students attending the course.
Previous knowledge on
programming
Final examination score
Poor Fair Good Very Good Total
0 76 78 16 7 177
1 19 29 10 13 71
2 2 6 7 8 23
Total 97 113 33 28 271
Step to pass the data to SPSS when we have contingence tables
For SPSS to successfully analyze this problem, enter the two variables “Previous knowledge on programming” and “Final examination score”. In the “Previous knowledge on programming” this variable is discrete; therefore you type as you see. For the second variable that is in column “Final examination score” (we’re still in the “Variable View” tab), click the cell in the “Values” column, as shown in the following figure.
Code Observation
1 Poor
2 Fair
3 Good
4 Very Good
Finally enter the data in (Data view), as noted in the following table
Previous knowledge on programming
Final examination
score Frequency
0 1 76
0 2 78
0 3 16
0 4 7
1 1 19
1 2 29
1 3 10
1 4 13
2 1 2
2 2 6
2 3 7
Since the data is a frequency table, we must tell SPSS how to weight the different cases; that is we should tell SPSS that Frequency records the frequency in each pair of category listed by the variables from the example. Use Data (from the menu) → Weight Cases to open the Weight Cases dialog. Make the changes as shown in the following figure:
Finally, let’s perform the Chi Square:
Analyze < Descriptive Statistics < Crosstab < and follow the step that we show in the picture:
Output
Previous knowledge on programming * Final examination score Crosstabulation Count
Final examination score Total
Poor Fair Good Very Good
Previous knowledge on programming
.00 76 78 16 7 177
1.00 19 29 10 13 71
2.00 2 6 7 8 23
Chi-Square Tests
Value df Asymp. Sig. (2-sided)
Pearson Chi-Square 43.044a 6 .000
Likelihood Ratio 39.805 6 .000
Linear-by-Linear Association 38.725 1 .000 N of Valid Cases 271
a. 2 cells (16.7%) have expected count less than 5. The minimum expected count is 2.38.
Hypothesis
Ho: The performance obtained by the students at the final examination is independent of their previous knowledge on programming
Ha: The performance obtained by the students at the final examination is related of their previous knowledge on programming
Level significance α = .05 Test Statistic: c2 = 43.044, Sig= .000
Making a Decision and Interpreting the Result: since Sig. =.000<.05, we reject Ho. At the level of significance of 5% the variables performance obtained by the students at the final examination and their previous knowledge on programming are related.
Spearman’s Rank - Correlation Coefficient
1. Measures Correlation Between Ranks
2. Corresponds to Pearson Product Moment Correlation Coefficient 3. Values Range from -1 to +1
4. Equation (Shortcut)
Example
You’re a research assistant for the FBI. You’re investigating the relationship between a person’s attempts at deception & % changes in their pupil size. You ask subjects a series of questions, some of which they must answer dishonestly. At the .05 level, what is the correlation coefficient?
Subj. Deception Pupil
1 87 10
2 63 6
3 95 11
4 50 7
5 43 0
Solution
Hypothesis
Ha: There is relationship between a person’s attempts at deception & % changes in their pupil size Statistic:Spearman's rho = .900 Sig. .037
Make decision and interpret result: Since Sig. < .05 we can reject null hypothesis, therefore there is significant relationship between a person’s attempts at deception & % changes in their pupil size
Output
Correlations
Deception Pupil
Spearman's rho
Deception
Correlation Coefficient 1.000 .900*
Sig. (2-tailed) . .037
N 5 5
Pupil
Correlation Coefficient .900* 1.000
Sig. (2-tailed) .037 .
N 5 5
Regression and correlation analysis
The purpose of this analysis is to establish the relationship between two or more variables that is Correlation Analysis, and establish a mathematical model to estimate the value of a variable based on the value of the other variables that is Regression Analysis
Linear Regression
Linear Regression estimates the coefficients of the linear equation, involving one or more independent variables that best predict the value of the dependent variable. For example, you can try to predict a salesperson's total yearly sales (the dependent variable) from independent variables such as age, education, and years of experience.
Data Considerations
Data. The dependent and independent variables should be quantitative. Categorical variables, such as religion, major field of study, or region of residence, need to be recoded to binary (dummy) variables or other types of contrast variables.
Assumptions. For each value of the independent variable, the distribution of the dependent variable must be normal. The variance of the distribution of the dependent variable should be constant for all values of the independent variable. The relationship between the dependent variable and each independent variable should be linear, and all observations should be independent.
Example
The following data give the selling price, square footage, number of bedrooms, and age of houses that have sold in a neighborhood in the last 6 months. Develop multiple correlation and regression analysis for to predict selling price.
Selling Price ($)
Square
Footage Bedrooms Age
64000 1670 2 30
59000 1339 2 25
61500 1712 3 30
79000 1840 3 40
87500 2300 3 18
92500 2234 3 30
95000 2311 3 19
113000 2377 3 7
115000 2736 4 10
138000 2500 3 1
142500 2500 4 3
144000 2479 3 3
145000 2400 3 1
147500 3124 4 0
144000 2500 3 2
155500 4062 4 10
Steps to do a regression and correlation analysis 1. Scatter-Plot
Prior to entering your data on the Data Editor screen (shown below), you must define your variables on the Variable View screen (not shown). Once your variables have defined your variables (X as your predictor and Y as your criterion), enter the data as shown below on the Data Editor screen.
Start your analysis by creating a scatter plot: Graphs Legacy Dialogsscatter Dot
After clicking on Graphs and selecting Scatter, you will be given the option to select the type of plot- shown on the next figure. Highlight the Simple scatter-plot option and click Define.
The scatter plot will be produced and displayed on the Output Viewer.
1. Simple Correlation between Selling price and Square Footage
(there is no association between Selling price and Square Footage)
(There is an association between them)
Correlation between Selling price and Age
(there is no association between Selling price and Age)
(There is an association between them)
We can see that the Pearson correlation coefficient for the two variables is r = -.881 **, this value indicates a strong, inverse linear relationship between Selling price and Age. Furthermore, we see that it is significant, with p < 0.000.
2. Regression Analysis
After creating a scatter plot, you should run a regression analysis. The regression analysis will produce regression coefficients, a correlation coefficient, and an ANOVA table.
Begin by selecting AnalyzeRegression Linear (shown below).
The output of the analysis is shown below.
Descriptive Statistics
Mean
Std.
Deviation N
Selling Price ($) 114588.24 35901.443 17
Square Footage 2407.29 618.457 17
Bedrooms 3.12 0.6 17
Age 13.65 13.052 17
Interpretation. The average of selling price of the houses that have sold in a neighborhood in the last 6 months is 114588.24, and variability around mean is 35901.443, means it is heterogeneous data (CV=31%)
ANNEXS
Calculate Age from birthdates data
Use the example ‘Employee data.sav’ from SPSS
1. If you need age in years and have in your database as date of birth, first verify in the variable view that the date of birth in the 'Type' column is defined as "Date"
2. Go to transform >compute variable> create a new variable, according the current day, then follow the image below
3. After creating the variable, go to the variable view and in the "Type" column change from numeric to "Date"
Compute Age = datediff (cdate,bdate,’days’)/365.25
Transform a type of string variable into a numeric type Continue with the data before ‘Employee’
1. Click Transform > Automatic Recode.
2. Select the string variable of interest in the left column and move it to the right column.
3. Enter a new name for the autorecoded variable in the New Name field, and then click Add New Name, then continue the following steps
4. SPSS will assign numeric categories in alphabetical order.
1. From the menus choose:
Data > Merge Files
2. Select Add Cases or Add Variables.