Step-By-Step Basic Statistics Using SAS_ Student Guide.pdf

(1)

(2)

The writing is exceptionally clear and easy to follow, and precise definitions are provided to avoid confusion. Examples are used to illustrate each concept, and those examples are, like everything in this book, clear and logically presented. Sample SAS output is provided for every analysis, with each part labeled and thoroughly explained so the reader understands the results.

Sheri Bauman, Ph.D. Assistant Professor Department of Educational Psychology University of Arizona, Tucson

[Larry Hatcher] once again manages to provide clear, concise, and detailed explanations of the SAS program and procedures, including appropriate examples and sample write-ups.

Frank Pajares Winship Distinguished Research Professor Emory University

The Student Guide and the Exercises books are excellent choices for use in quantitative courses in psychology and education.

Bert W. Westbrook, Ph.D. Professor of Psychology Alumni Distinguished Undergraduate Professor North Carolina State University

(3)

(4)

BASIC

STATISTICS

Using

SAS

®

(5)

Using SAS : Student Guide. Cary, NC: SAS Institute Inc.

Step-by-Step Basic Statistics Using SAS®

: Student Guide

All rights reserved. Printed in the United States of America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc.

U.S. Government Restricted Rights Notice: Use, duplication, or disclosure of this software and related

documentation by the U.S. government is subject to the Agreement with SAS Institute and the restrictions set forth in FAR 52.227-19, Commercial Computer Software-Restricted Rights (June 1987).

SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513. 1st printing, April 2003

SAS Publishing provides a complete selection of books and electronic products to help customers use SAS software to its fullest potential. For more information about our e-books, e-learning products, CDs, and hardcopy books, visit the SAS Publishing Web site at support.sas.com/pubs or call 1-800-727-3228.

SAS®

and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

(6)

Dedication

(7)

(8)

Acknowledgments

During the development of these books, Caroline Brickley, Gretchen Rorie Harwood, Stephenie Joyner, Sue Kocher, Patsy Poole, and Hanna Schoenrock served as editors. All were positive, supportive, and helpful. They made the books stronger, and I thank them for their guidance.

A number of other people at SAS made valuable contributions in a variety of areas. My sincere thanks go to those who reviewed the books for technical accuracy and readability: Jim Ashton, Jim Ford, Marty Hultgren, Catherine Lumsden, Elizabeth Maldonado, Paul Marovich, Ted Meleky, Annette Sanders, Kevin Scott, Ron Statt, and Morris Vaughan. I also thank Candy Farrell and Karen Perkins for production and design; Joan Stout for indexing; Cindy Puryear and Patricia Spain for marketing; and Cate Parrish for the cover designs.

(15)

(16)

Using This

Student Guide

Introduction... 3

Overview ...3

Intended Audience and Level of Proficiency ...3

Platform and Version ...3

Materials Needed...4

Introduction to the SAS System ... 4

Why Do You Need This Student Guide?...4

What Is the SAS System?...5

Who Uses SAS? ...5

Using the SAS System for Statistical Analyses...5

Contents of This Student Guide... 6

Overview ...6

Chapter 2: Terms and Concepts Used in This Guide...7

Chapter 3: Tutorial: Using the SAS Windowing Environment to Write and Submit SAS Programs ...7

Chapter 4: Data Input...7

Chapter 5: Creating Frequency Tables ...7

Chapter 6: Creating Graphs ...8

Chapter 7: Measures of Central Tendency and Variability...8

Chapter 8: Creating and Modifying Variables and Data Sets ...8

Chapter 9: Standardized Scores (z Scores)...8

Chapter 10: Bivariate Correlation...9

Chapter 11: Bivariate Regression ...9

(17)

Chapter 13: Independent-Samples t Test ...9

Chapter 14: Paired-Samples t Test...9

Chapter 15: One-Way ANOVA with One Between-Subjects Factor...10

Chapter 16: Factorial ANOVA with Two Between-Subjects Factors ...10

Chapter 17: Chi-Square Test of Independence ...10

References ...10

(18)

Introduction

Overview

This chapter introduces you to the SAS System, a computer application that can be used to perform statistical analyses. It explains just what SAS is, where it is installed, and describes some of the advantages associated with using SAS for data analysis. Finally, it briefly summarizes what you will learn in each of the chapters that comprise this Student Guide.

Intended Audience and Level of Proficiency

This guide is intended for those who want to learn how to use SAS to perform elementary statistical analyses. The guide assumes that many students using it have not already taken a course on elementary statistics. To assist these students, this guide briefly reviews basic terms and concepts in statistics at an elementary level. It was designed to be easily understood by first and second year college students.

This book was also designed to be user-friendly to those who may have little or no

experience with personal computers. The beginning of Chapter 3, “Tutorial: Using the SAS Windowing Environment to Write and Submit SAS Programs,” reviews basic concepts in using Microsoft Windows, such as selecting menus, double-clicking icons, and so forth. Those who already have experience in using Windows will be able to quickly skim through this elementary material.

Platform and Version

This guide shows how to use the SAS System for Windows, as opposed to other operating environments. This is most apparent in Chapter 3, “Using the SAS Windowing Environment to Write and Submit SAS Programs.” However, the remaining chapters show how to write SAS code to perform statistical analyses, and most of this material will be useful to all SAS users, regardless of the operating environment. This is because, for the most part, the same SAS code can be used on a wide variety of operating environments to obtain the same results.

This book was designed for those using the SAS System Version 8 and later versions. It may also be helpful to those using earlier versions of SAS (such as V6 or V7). However, if you are using one of these earlier versions, it is likely that some of the SAS system options described here are not available with your version. It is also likely that some of the SAS output that you obtain will be arranged differently than the output that is presented here.

(19)

Materials Needed

To complete the activities described in this book, you will need

• access to a personal computer on which the SAS System for Windows has been installed,

• one (and preferably two) 3.5-inch disks, formatted for IBM PCs (or some other type of storage media).

Some students using this book will also use its companion volume, Step-by-Step Basic

Statistics Using SAS: Exercises. The chapters in the Exercises book parallel most of the

chapters contained in this Student Guide. Each chapter in the Exercises book contains two assignments for students to complete. Complete solutions are provided for the

odd-numbered exercises, but not for the even-odd-numbered ones. The Exercises book can give you useful practice in learning how to use SAS, but it is not absolutely required.

Introduction to the SAS System

Why Do You Need This Student Guide?

This Student Guide shows you how to use a computer application called the SAS System to perform elementary statistical analyses. Until recently, students in elementary statistics courses typically performed statistical computations by hand or with a pocket calculator. In recent years, however, the increased availability of computers has made it possible for students to also use statistical software packages such as SPSS and the SAS System to perform these analyses. This latter approach allows students to focus more on conceptual issues in statistics, and spend less time on the mechanics of performing mathematical operations by hand. Step by step, this Student Guide will introduce you to the SAS System, and will show you how to use it to perform a variety of statistical analyses that are

(20)

What Is the SAS System?

The SAS System is a modular, integrated, and hardware-independent application. It is used as an information delivery system by business organizations, governments, and universities worldwide.

SAS is used for virtually every aspect of information management in organizations, including decision support, project management, financial analysis, quality improvement, data warehousing, report writing, and presentations. However, this guide will focus on just one aspect of SAS: its ability to perform the types of statistical analyses that are appropriate for research in the social sciences and education.

By the time you have completed this text, you will have accomplished two objectives: you will have learned how to perform elementary statistical analyses using SAS, and you will have become familiar with a widely used information delivery system.

Who Uses SAS?

The SAS System is widely used in business organizations and universities. Consider the following statistics from July 2002:

• SAS supports over 40 operating environments, including Windows, OS/2, and UNIX. • SAS Institute’s computer software products are installed at over 38,400 sites in 115

countries.

• Approximately 71% of SAS installations are in business locations, 18% are education sites, and 11% are government sites. It is used for teaching and research at about 3,000 university locations.

• It is estimated that SAS software products are used by more than 3.5 million people worldwide.

• 90% of all Fortune 500 companies are SAS clients.

Using the SAS System for Statistical Analyses

SAS is a particularly powerful tool for social scientists and educators because it allows them to easily perform virtually any type of statistical analysis that may be required in their research. SAS is comprehensive enough to perform the most sophisticated multivariate analyses, but is so easy to use that undergraduates can perform simple analyses after only a short period of instruction.

In a sense, the SAS System may be viewed as a library of prewritten statistical algorithms. By submitting a brief SAS program, you can access a procedure from the library

(21)

and use it to analyze a set of data. For example, below are the SAS statements used to call up the algorithm that calculates Pearson correlation coefficients:

PROC CORR DATA=D1; RUN;

The preceding statements will cause SAS to compute the Pearson correlation between every possible pair of numeric variables in your data set. Being able to call up complex

procedures with such a simple statement is what makes SAS so powerful. By contrast, if you had to prepare your own programs to compute Pearson correlations by using a programming language such as FORTRAN or BASIC, it would require many statements, and there would be many opportunities for error. By using SAS instead, most of the work has already been completed, and you are able to focus on the results of the analysis rather than on the mechanics of obtaining those results.

Contents of This Student Guide

Overview

This guide has two objectives: to teach the basics of using SAS in general and, more specifically, to show how to use SAS procedures to perform elementary statistical analyses. Chapters 1–4 provide an overview to the basics of using SAS. The remaining chapters cover statistical concepts in a sequence that is representative of the sequence followed in most elementary statistics textbooks.

Chapters 10–17 introduce you to inferential statistical procedures (the type of procedures that are most often used to analyze data from research). Each chapter shows you how to conduct the analysis from beginning to end. Each chapter also provides an example of how the analysis might be summarized for publication in an academic journal in the social sciences or education. For the most part, these summaries are written according to the guidelines provided in the Publication Manual of the American Psychological Association (1994).

Many students using this book will also use its companion volume, Step-by-Step Basic

Statistics Using SAS: Exercises. For Chapters 3–17 in this student guide, the corresponding

chapter in the exercise book provides you with a hands-on exercise that enables you to practice the data analysis skills that you are learning.

The following sections provide a summary of the contents of the remaining chapters in this guide.

(22)

Chapter 2: Terms and Concepts Used in This Guide

Chapter 2 defines some important terms related to research and statistics that will be used throughout this guide. It also introduces you to the three types of files that you will work with during a typical session with SAS: the SAS program, the SAS log, and the SAS output file.

Chapter 3: Tutorial: Using the SAS Windowing Environment to Write and Submit SAS Programs

The SAS windowing environment is a powerful application that you will use to create, edit, and submit SAS programs. You will also use it to review your SAS logs and output. Chapter 3 provides a tutorial that teaches you how to use this application. Step by step, it shows you how to write simple SAS programs and interpret their results. By the end of this chapter, you should be ready to use the SAS windowing environment to write and submit SAS programs on your own.

Chapter 4: Data Input

Chapter 4 shows you how to use the DATA and INPUT statements to create SAS data sets. You will learn how to read both numeric and character variables by using a simple, list style for data input. By the end of the chapter, you will be prepared to input the data sets that will be presented throughout the remainder of this guide.

Chapter 5: Creating Frequency Tables

Chapter 5 shows you how to create frequency tables that are useful for understanding your data and answering some types of research questions. For example, imagine that you ask a sample of 150 people to tell you their age. If you then used SAS to create a frequency table for this age variable, you would be able to easily answer questions such as

• How many people are age 30?

• How many people are age 30 or younger? • What percent of people are age 45?

(23)

Chapter 6: Creating Graphs

Chapter 6 shows you how to use SAS to create frequency bar charts––bar charts that indicate the number of people who displayed a given value on a variable. For example, imagine that you asked 150 people to indicate their political party. If you used SAS to create a frequency bar chart, the resulting chart would indicate the number of people who are democrats, the number who are republicans, and the number who are independents. Chapter 6 also shows how to create bar charts that plot subgroup means. For example, assume that, in the “political party” study described above, you asked the 150 subjects to indicate both their political party and their age. You could then use SAS to create a bar chart that plots the mean age for people in each party. For instance, the resulting bar chart might show that the average age for democrats was 32.12, the average age for republicans was 41.56, and the average age for independents was 37.33.

Chapter 7: Measures of Central Tendency and Variability

Chapter 7 shows you how to compute measures of variability (e.g., the interquartile range, standard deviation, and variance) as well as measures of central tendency (e.g., the mean, median, and mode) for numeric variables. It also shows how to use stem-and-leaf plots to determine whether a distribution is skewed or approximately normal in shape.

Chapter 8: Creating and Modifying Variables and Data Sets

Chapter 8 shows how to use subsetting IF statements to create new data sets that contain a specified subgroup from the original sample. It also shows how to use mathematical operators and IF-THEN statements to recode variables and to create new variables from existing variables.

Chapter 9: Standardized Scores (z Scores)

Chapter 9 shows how to transform raw scores into standardized variables (z score variables) with a mean of 0 and a standard deviation of 1. You will learn how to do this by using the data manipulation statements that you learned about in Chapter 8. Chapter 9 also illustrates how you can review the sign and absolute magnitude of a z score to understand where a particular observation stands on the variable in question.

(24)

Chapter 10: Bivariate Correlation

Bivariate correlation coefficients allow you to determine the nature of the relationship between two numeric variables. Chapter 10 shows you how to use the CORR procedure to compute Pearson correlation coefficients for interval- and ratio-level variables. You will also learn to interpret the p values (probability values) that are produced by PROC CORR to determine whether a given correlation coefficient is significantly different from zero.

Chapter 10 also shows how to use PROC PLOT to create a two-dimensional scattergram that illustrates the relationship between two variables.

Chapter 11: Bivariate Regression

Bivariate regression is used when you want to predict scores on an interval- or ratio-level criterion variable from an interval- or ratio-level predictor variable. Chapter 11 shows you how to use the REG procedure to compute the slope and intercept for the regression equation, along with predicted values and residuals of prediction.

Chapter 12: Single-Sample t Test

Chapter 12 shows how to use the TTEST procedure to perform a single-sample t test. This is an inferential procedure that is useful for determining whether a sample mean is

significantly different from a specified population mean. You will learn how to interpret the

t statistic, and the p value associated with that t statistic.

Chapter 13: Independent-Samples t Test

You use an independent-samples t test to determine whether there is a significant difference between two groups of subjects with respect to their mean scores on the dependent variable. Chapter 13 explains when to use the equal-variance t statistic versus the unequal-variance t statistic, and shows how to use the TTEST procedure to conduct this analysis.

Chapter 14: Paired-Samples t Test

The paired-samples t test is also appropriate when you want to determine whether there is a significant difference between two sample means. The paired-samples approach is indicated when each score in one sample is dependent upon a corresponding score in the second sample. This will be the case in studies in which the same subjects provide repeated measures on the same dependent variable under different conditions, or when matching procedures are used. Chapter 14 shows how to perform this analysis using the TTEST procedure.

(25)

Chapter 15: One-Way ANOVA with One Between-Subjects Factor One-way analysis of variance (ANOVA) is an inferential procedure similar to the

independent-samples t test, with one important difference: while the t test allows you to test the significance of the difference between two sample means, a one-way ANOVA allows you to test the significance of the difference between more than two sample means. Chapter 15 shows how to use the GLM procedure to perform a one-way ANOVA, and then to follow with multiple comparison (post hoc) tests.

Chapter 16: Factorial ANOVA with Two Between-Subjects Factors A one-way ANOVA, as described in Chapter 15, may be appropriate for analyzing data from an experiment in which the researcher manipulates only one independent variable. In contrast, a factorial ANOVA with two between-subjects factors may be appropriate for analyzing data from an experiment in which the researcher manipulates two independent variables simultaneously. Chapter 16 shows how to perform this type of analysis. It provides examples of results in which the main effects are significant, as well as results in which the interaction is significant.

Chapter 17: Chi-Square Test of Independence

Nonparametric statistical procedures are procedures that do not require stringent

assumptions about the nature of the populations under study. Chapter 17 illustrates one of the most common nonparametric procedures: the chi-square test of independence. This test is appropriate when you want to study the relationship between two variables that assume a limited number of values. Chapter 17 shows how to conduct the test of significance and interpret the results presented in the two-way classification table created by the FREQ procedure.

References

Many statistical procedures are illustrated in this guide by showing you how to analyze fictitious data from an empirical study. Many of these “studies” are loosely based on actual investigations reported in the research literature. These studies were chosen to help

introduce you to the types of empirical investigations that are often conducted in the social and behavioral sciences and in education. The “References” section at the end of this guide provides complete references for the actual studies that inspired the fictitious studies reported here.

(26)

Conclusion

This guide assumes that some of the students using it have not yet completed a course on elementary statistics. This means that some readers will be unfamiliar with terms used in data analysis, such as “observations,” “null hypothesis,” “dichotomous variables,” and so on. To remedy this, the following chapter, "Terms and Concepts Used in This Guide," provides a brief primer on basic terms and concepts in statistics. This chapter should lay a foundation that will make it easier to understand the chapters to follow.

(27)

(28)

Terms and

Concepts Used

in This Guide

Introduction...15 Overview ...15 A Common Language for Researchers...15 Why This Chapter Is Important ...15 Research Hypotheses and Statistical Hypotheses ...16

Example: A Goal-Setting Study...16 The Research Question ...16 The Research Hypothesis...16 The Statistical Null Hypothesis...18 The Statistical Alternative Hypothesis...19 Directional versus Nondirectional Alternative Hypotheses ...19 Summary ...21 Data, Variables, Values, and Observations ...21

Defining the Instrument, Gathering Data, Analyzing Data, and

Drawing Conclusions...21 Variables, Values, and Observations ...22 Classifying Variables According to Their Scales of Measurement...24

Introduction ...24 Nominal Scales ...25 Ordinal Scales...25 Interval Scales ...26 Ratio Scales...27

(29)

Classifying Variables According to the Number of Values They Display ...27 Overview ...27 Dichotomous Variables ...27 Limited-Value Variables ...28 Multi-Value Variables ...28 Basic Approaches to Research ...29

Nonexperimental Research ...29 Experimental Research...31 Using Type-of-Variable Figures to Represent Dependent and

Independent Variables ...32 Overview ...32 Figures to Represent Types of Variables...33 Using Figures to Represent the Types of Variables Assessed

in a Specific Study...34 The Three Types of SAS Files...37

Overview ...37 The SAS Program...37 The SAS Log...42 The SAS Output File ...44 Conclusion...45

(30)

Introduction

Overview

This chapter has two objectives. This first is to introduce you to basic terms and concepts related to research design and data analysis. This chapter describes the different types of variables that might be analyzed when conducting research, the classification of these variables according to their scale of measurement or other characteristics, and the differences between nonexperimental versus experimental research.

The chapter’s second objective is to introduce you to the three types of files that you will work with when you perform statistical analyses with SAS. These include the SAS program file, the SAS log file, and the SAS output file.

After completing this chapter, you should be familiar with the fundamental terms and concepts that are relevant to data analysis, and you will have a foundation to begin learning about the SAS System in detail in subsequent chapters.

A Common Language for Researchers

Research in the behavioral sciences and in education is extremely diverse. In part, this is because the behavioral sciences represent a wide variety of disciplines, including

psychology, sociology, anthropology, political science, management, and other fields. Further complicating matters is the fact that, within each discipline, a wide variety of methods are used to conduct research. These methods can include unobtrusive observation, participant observation, case studies, interviews, focus groups, surveys, ex post facto studies, laboratory experiments, and field experiments.

Despite this diversity in methods used and topics investigated, most scientific investigations still share a number of characteristics. Regardless of field, most research involves an investigator who gathers data and performs analyses to determine what the data mean. In addition, most researchers in the behavioral sciences and education use a common language in reporting their research; researchers from all fields typically speak of “testing null

hypotheses” and “obtaining significant p values.”

Why This Chapter Is Important

The purpose of this chapter is to review some fundamental concepts and terms that are shared in the behavioral sciences and in education. You should familiarize (or refamiliarize) yourself with this material before proceeding to the subsequent chapters, as most of the terms introduced here will be referred to again and again throughout the text. If you have not yet taken a course in statistics, this chapter will provide an elementary introduction; if you have already completed a course in statistics, it will provide a quick review.

(31)

Research Hypotheses and Statistical Hypotheses

Example: A Goal-Setting Study

Imagine that you have been hired by a large insurance company to find ways of improving the productivity of its insurance agents. Specifically, the company would like you to find ways to increase the number of insurance policies sold by the average agent. You will therefore begin a program of research to identify the determinants of agent productivity. In the course of this program, you will work with research questions, research hypotheses, and statistical hypotheses.

The Research Question

The process of research often begins by developing a clear statement of the research question (or questions). The research question is a statement of what you hope to have learned by the time the research has been completed. It is good practice to revise and refine the research question several times to ensure that you are very clear about what it is you really want to know.

For example, in the current example, you might begin with the question “What is the

difference between agents who sell much insurance versus agents who sell little insurance?” A more specific question might be “What variables have a causal effect on the amount of insurance sold by agents?” Upon reflection, you might realize that the insurance company really only wants to know what things management can do to cause the agents to sell more insurance. This might eliminate from consideration those variables that are not under

management’s control, and can substantially narrow the focus of the research program. This narrowing, in turn, leads to a more specific statement of the research question such as “What variables under the control of management have a causal effect on the amount of insurance sold by agents?” Once the research question has been more clearly defined in this way, you are in a better position to develop a good hypothesis that provides a possible answer to the question.

The Research Hypothesis

An hypothesis is a statement about the predicted relationships among events or variables. A good hypothesis in the present case might identify a specific variable that is expected to have a causal effect on the amount of insurance sold by agents. For example, a research hypothesis might predict that the agents’ level of training will have a positive effect on the amount of insurance sold. Or it might predict that the agents’ level of achievement

(32)

In developing the hypothesis, you might be influenced by any of a number of sources: an existing theory, some related research, or even personal experience. Let's assume that in the present situation, for example, you have been influenced by goal-setting theory. This theory states, among other things, that higher levels of work performance are achieved when

difficult goals are set for employees. Drawing on goal-setting theory, you now state the following hypothesis: “The difficulty of the goals that agents set for themselves is

positively related to the amount of insurance they sell.” Notice how this statement satisfies our definition for a research hypothesis, as it is a statement about the predicted relationship between two variables. The first variable can be labeled “goal difficulty,” and the second can be labeled “amount of insurance sold.”

The predicted relationship between goal difficulty and amount of insurance sold is illustrated in Figure 2.1. Notice that there is an arrow extending from goal difficulty to amount of insurance sold. This arrow reflects the prediction that goal difficulty is the causal variable, and amount of insurance sold is the variable being affected.

Figure 2.1. Causal relationship between goal difficulty and amount of insurance sold, as predicted by the research hypothesis.

In Figure 2.1, you can see that the variable being affected (insurance sold) appears on the left side of the figure, and that the causal variable (goal difficulty) appears on the right. This arrangement might seem a bit unusual to you, since most figures that portray causal

relationships have the order reversed (with the causal variable on the left and the variable being affected on the right). However, this guide will always use the arrangement that appears in Figure 2.1, for reasons that will become clear later.

You can see that the research hypothesis stated above is quite broad in nature. In many research situations, however, it is helpful to state hypotheses that are more specific in the predictions they make. For example, assume that there is an instrument called the “Smith Goal Difficulty Scale.” Scores on this fictitious instrument can range from zero to 100, with higher scores indicating more difficult goals. If you administered this scale to a sample of agents, you could develop a more specific research hypothesis along the following lines: “Agents who score 60 or above on the Smith Goal Difficulty Scale will sell greater amounts of insurance than agents who score below 60.”

(33)

The Statistical Null Hypothesis

Beginning in Chapter 10, “Bivariate Correlation,” this guide will show you how to use the SAS System to perform tests of null hypotheses. The way that you state a specific null hypothesis will vary depending on the nature of your research question and the type of data analysis that you are performing. Generally speaking, however, a statistical null hypothesis is typically a prediction that there is no difference between groups in the population, or that there is no relationship between variables in the population.

For example, consider the research hypothesis stated in the preceding section: “Agents who score 60 or above on the Smith Goal Difficulty Scale will sell greater amounts of insurance than agents who score below 60.” Assume that you conduct a study to investigate this research hypothesis. You identify two groups of subjects:

• 50 Agents who score 60 or above on the Smith Goal Difficulty Scale (the “high goal-difficulty group”).

• 50 Agents who score below 60 on the Smith Goal Difficulty Scale (the “low goal-difficulty group”).

You observe these agents over a 12-month period, and record the amount of insurance that they sell. You want to investigate the following (fairly specific) research hypothesis:

Research hypothesis: The average amount of insurance sold by the high goal-difficulty group will be greater than the average amount sold by the low goal-difficulty group. You plan to analyze the data using a statistical procedure such as a t test (which will be discussed in Chapter 13, “Independent-Samples t Test”). One way to structure this analysis is to begin with the following statistical null hypothesis:

Statistical null hypothesis: In the population, there is no difference between the high goal-difficulty group and the low goal-difficulty group with respect to their mean scores on the amount of insurance sold.

Notice that this is a prediction of no difference between the groups. You will analyze the data from your sample, and if the observed difference is large enough, you will reject this null hypothesis of no difference. Rejecting this statistical null hypothesis means that you have obtained some support for your original research hypothesis (the hypothesis that there is a difference between the groups).

Statistical null hypotheses are often represented symbolically. For example, this is how you could have symbolically represented the preceding statistical null hypothesis:

(34)

where

H₀ is the symbol used to represent the null hypothesis

µ1 is the symbol used to represent the mean amount of insurance sold by Group 1 (the

high goal-difficulty group) in the population

µ₂ is the symbol used to represent the mean amount of insurance sold by Group 2 (the low goal-difficulty group) in the population.

The Statistical Alternative Hypothesis

A statistical alternative hypothesis is typically a prediction that there is a difference between groups in the population, or that there is relationship between variables in the population. The alternative hypothesis is the counterpart to the null hypothesis; if you reject the null hypothesis, you tentatively accept the alternative hypothesis.

There are different ways that you can state alternative hypotheses. One way is simply to predict that there is a difference between the population means, without predicting which population mean is higher. Here is one way of stating that type of alternative hypothesis for the current study:

Statistical alternative hypothesis: In the population, there is a difference between the high goal-difficulty group and the low goal-difficulty group with respect to their mean scores on the amount of insurance sold.

The alternative hypothesis also can be stated symbolically H₁: µ₁≠ µ₂

The H₁ symbol above is the symbol for an alternative hypothesis. Notice that the “not equal” symbol (≠) is used to represent the prediction that the means will not be equal.

Directional versus Nondirectional Alternative Hypotheses

Nondirectional hypotheses. The preceding section illustrated a nondirectional alternative hypothesis, also known as a two-sided or two-tailed alternative hypothesis. With the type of study described here (a study in which group means are being compared), a nondirectional alternative hypothesis simply predicts that one population mean differs from the other population mean––it does not predict which population mean will be higher. You would obtain support for this nondirectional alternative hypothesis if the high goal-difficulty group sold significantly more insurance, on the average, than the low goal-difficulty group. You would also obtain support for this nondirectional alternative hypothesis if the low goal-difficulty group sold significantly more insurance than the high goal-goal-difficulty group. With a nondirectional alternative hypothesis, you are predicting some type of difference, but you are not predicting the specific nature, or direction, of the difference.

(35)

Directional hypotheses. In some situations it might be appropriate to use a directional alternative hypothesis. With the type of study described above, a directional alternative hypothesis (also known as a one-sided or one-tailed alternative hypothesis) not only predicts that there will be a difference, but also makes a specific prediction about which population will display the higher mean.

For example, in the present study, previous research might lead you to predict that the population of high goal-difficulty employees will sell more insurance, on the average, than the population of low goal-difficulty employees. If this were the case, you might state the following directional alternative hypothesis:

Statistical alternative hypothesis: In the population, mean amount of insurance sold by the high goal-difficulty group is greater than the mean amount of insurance sold by the low goal-difficulty group.

This alternative hypothesis can also be stated symbolically H₁: µ₁ > µ₂

where

µ₁ represents the mean amount of insurance sold by Group 1 (the high goal-difficulty group) in the population

µ₂ represents the mean amount of insurance sold by Group 2 (the low goal-difficulty group) in the population.

Notice that the “greater than” symbol (>) is used to represent the prediction that the mean for the high goal-difficulty population is greater than the mean for the low goal-difficulty population.

Choosing directional versus nondirectional tests. Which type of alternative hypothesis should you use in your research? Most statistics textbooks recommend using a

nondirectional, or two-sided, alternative hypothesis, in most cases. The problem with the directional hypothesis is that if your obtained sample means are in the opposite direction of the direction that you predict, it can cause you to fail to reject the null hypothesis even when there are very large differences between the sample means.

For example, assume that you state the directional alternative hypothesis presented above (i.e., “In the population, mean amount of insurance sold by the high goal-difficulty group is greater than the mean amount of insurance sold by the low goal-difficulty group”). Because your alternative hypothesis is a directional hypothesis, the null hypothesis you are testing is as follows:

H0: µ1≤ µ2

which means, “In the population, the mean amount of insurance sold by the high goal-difficulty group (Group 1) is less than or equal to the mean amount of insurance sold by the low goal-difficulty group (Group 2).”

(36)

Clearly, to reject the null hypothesis, the high goal-difficulty group (Group 1) must display a mean that is greater than the low goal-difficulty group (Group 2). If Group 2 displays the higher mean, then you might not reject the null hypothesis, no matter how great that

difference might be. This presents a problem because the finding that Group 2 scored higher than Group 1 may be of great interest to other researchers (particularly because it is not what many would have expected). This is why, in many situations, nondirectional tests are

preferred over directional tests.

Summary

In summary, research projects often begin with a statement of a research hypothesis. This allows you to develop a specific, testable statistical null hypothesis and an alternative hypothesis. The analysis of your data will lead you to one of two results:

• If the results are significant, you can reject the null hypothesis and tentatively accept the alternative hypothesis. Assuming the means are in the predicted direction, this type of result provides some support for your initial research hypothesis.

• If the results are nonsignificant, you fail to reject the null hypothesis. This type of result fails to provide support for your initial research hypothesis.

Data, Variables, Values, and Observations

Defining the Instrument, Gathering Data, Analyzing Data, and Drawing Conclusions

With the null hypothesis stated, you can now test it by conducting a study in which you gather and analyze relevant data. Data is defined as a collection of scores that are obtained when subject characteristics and/or performance are observed and recorded. For example, you can choose to test your hypothesis by conducting a simple correlational study: You identify a group of 100 agents and determine

• the difficulty of the goals that have been set for each agent • the amount of insurance sold by each.

Different types of instruments can be used to obtain different types of data. For example, you might use a questionnaire to assess goal difficulty, but rely on company records for measures of insurance sold. Once the data are gathered, each agent will have one score indicating the difficulty of his or her goals, and a second score indicating the amount of insurance he or she has sold.

You would then analyze the data to see if the agents with the more difficult goals did, in fact, sell more insurance. If so, the study results would lend some support to your research hypothesis; if not, the results would fail to provide support. In either case, you would be

(37)

able to draw conclusions regarding the tenability of your hypotheses, and would have made some progress toward answering your research question. The information learned in the current study might stimulate new questions or new hypotheses for subsequent studies, and the cycle would repeat. For example, if you obtained support for your hypothesis with a correlational study, you might choose to follow it up with a study using a different research method, perhaps an experimental study (the difference between these methods will be described below). Over time, a body of research evidence would accumulate, and researchers would be able to review this body to draw general conclusions about the determinants of insurance sales.

Variables, Values, and Observations

Definitions. When discussing data, one often speaks in terms of variables, values, and observations. Further complicating matters is the fact that researchers make distinctions between different types of variables (such as quantitative variables versus classification variables). This section discusses the distinctions between these terms.

• Variables. For the type of research discussed in this book, a variable refers to some specific characteristic of a subject that can assume one or more different values. For the subjects in the study described above, “amount of insurance sold” is an example of a variable: Some subjects had sold a large amount of insurance, and others had sold less. A different variable was “goal difficulty:” Some subjects had more difficult goals, while others had less difficult goals. Subject age was a third variable, while subject sex (male versus female) was yet another.

• Values. A value, on the other hand, refers to either a particular subject's relative standing on a quantitative variable, or a subject's classification within a classification variable. For example, the “amount of insurance sold” is a quantitative variable that can assume a large number of values: One agent might sell $2,000,000 worth of insurance in one year, one might sell $100,000 worth, and another might sell $0 worth. Subject age is another quantitative variable that can assume a wide variety of values. In the sample studied, these values ranged from a low of 22 years to a high of 64 years. • Quantitative variables. You can see that, in both of these examples, a particular value

is a type of score that indicates where the subject stands on the variable. The word “score” is an appropriate substitute for the word “value” in these cases because both “amount of insurance sold” and “age” are quantitative variables: variables that represent the quantity, or amount, of the construct that is being assessed. With quantitative variables, numbers typically serve as values.

• Classification variables. A different type of variable is a classification variable or, alternatively, qualitative variable or categorical variable. With classification

variables, different values represent different groups to which the subject might belong. “Sex” is a good example of a classification variable, as it might assume only one of two values: A particular subject is classified as being either a male or a female. “Political Party” is an example of a classification variable that can assume a larger number of

(38)

values: A subject might be classified as being a republican, a democrat, or an

independent. These variables are classification variables and not quantitative variables because the values only represent membership in a singular, specific group––

membership that cannot be represented meaningfully with a numeric value. • Observational units. In discussing data, researchers often make references to

observational units, that can be defined as the individual subjects (or other objects) that serve as the source of the data. Within the behavioral sciences and education, an

individual person usually serves as the observational unit under study (although it is also possible to use some other entity, such as an individual school or organization, as the observational unit). In this text, the individual person is used as the observational unit in most examples. Researchers will often refer to the “number of observations” or

“number of cases” included in their data set, and this typically refers to the number of subjects who were studied.

An example. For a more concrete illustration of the concepts discussed so far, consider the data set displayed in Table 2.1:

Table 2.1

Insurance Sales Data

________________________________________________________________________ Goal

difficulty Overall Observation Name Sex Age scores ranking Sales ________________________________________________________________________ 1 Bob M 34 97 2 $598,243 2 Walt M 56 80 1 $367,342 3 Jane F 36 67 4 $254,998 4 Susan F 24 40 3 $80,344 5 Jim M 22 37 5 $40,172 6 Mack M 44 24 6 $0 ________________________________________________________________________

The preceding table reports information regarding six research subjects: Bob, Walt, Jane, Susan, Jim, and Mack; therefore, we would say that the data set includes six observations. Information about a particular observation (subject) is displayed as a row running

horizontally from left to right across the table.

The first column of the data set (running vertically from top to bottom) is headed “Observation,” and it simply provides an observation number for each subject. The second column (headed “Name”) provides a name for each subject.

The remaining five columns report information about the five research variables that are being studied.

The column headed “Sex” reports subject sex, which might assume one of two values: “M” for male and “F” for female.

(39)

The column headed “Age” reports the subject's age in years.

The “Goal Difficulty Scores” column reports the subject's score on a fictitious goal difficulty scale. In this example, each participant has a score on a 20-item questionnaire about the difficulty of his or her work goals. Depending on how they respond to the

questionnaire, subjects receive a score ranging from a low of zero (meaning that the subject views the work goals as extremely easy) to a high of 100 (meaning that the goals are viewed as extremely difficult).

The column headed “Overall Ranking,” shows how the subjects were ranked by their supervisor according to their overall effectiveness as agents. A rank of 1 represents the most effective agent, and a rank of 6 represents the least effective.

The column headed “Sales” reveals the amount of insurance sold by each agent (in dollars) during the most recent year.

Table 2.1 provides a very small data set with six observations and five research variables (sex, age, goal difficulty, overall ranking, and sales). One of the variables was a

classification variable (sex), while the remainder were quantitative variables. The numbers or letters that appear within a particular column represent some of the values that could be assumed by that variable.

Classifying Variables According to Their Scales of

Measurement

Introduction

One of the most important schemes for classifying a variable involves its scale of measurement. Researchers generally discuss four different scales of measurement: nominal, ordinal, interval, and ratio. Before analyzing a data set, it is important to determine which scales of measurement were used because certain types of statistical procedures require specific scales of measurement. For example, a one-way analysis of variance generally requires that the dependent variable be an interval-level or ratio-level variable; the chi-square test of independence allows you to analyze nominal-level variables; other statistics make other assumptions about the scale of measurement used with the variables that are being studied.

(40)

Nominal Scales

A nominal scale is a classification system that places people, objects, or other entities into mutually exclusive categories. A variable that is measured using a nominal scale is a classification variable: It simply indicates the name of the group to which each subject belongs. The examples of classification variables provided earlier (e.g., sex and political party) also serve as examples of nominal-level variables: They tell you which group a subject belongs to, but they do not provide any quantitative information about the subjects. That is, the “sex” variable might tell you that some subjects are males and other are females, but it does not tell you that some subjects possess more of a specific characteristic relative to others. With the remaining three scales of measurement, however, some quantitative

information is provided.

Ordinal Scales

Values on an ordinal scale represent the rank order of the subjects with respect to the

variable that is being assessed. For example, Table 2.1 includes one variable called “Overall Ranking,” which represents the rank-ordering of the subjects according to their overall effectiveness as agents. The values on this ordinal scale represent a hierarchy of levels with respect to the construct of “effectiveness”: We know that the agent ranked “1” was

perceived as being more effective than the agent ranked “2,” that the agent ranked “2” was more effective than the one ranked “3,” and so forth.

However, an ordinal scale has a serious limitation in that equal differences in scale values do not necessarily have equal quantitative meaning. For example, notice the rankings reproduced here: Overall ranking Name _______ ______ 1 Walt 2 Bob 3 Susan 4 Jane 5 Jim 6 Mack

Notice that Walt was ranked #1 while Bob was ranked #2. The difference between these two rankings is 1 (because 2 – 1 = 1), so we might say that there is one unit of difference between Walt and Bob. Now notice that Jim was ranked #5 while Mack was ranked #6. The difference between these two rankings is also 1 (because 6 – 5 = 1), so we might say that there is also 1 unit of difference between Jim and Mack. Putting the two together, we can see that the difference in ranking between Walt and Bob is equal to the difference in ranking between Jim and Mack. But does this mean that the difference in overall effectiveness between Walt and Bob is equal to the difference in overall effectiveness between Jim and Mack? Not necessarily. It is possible that Walt was just barely superior to

(41)

Bob in effectiveness, while Jim was substantially superior to Mack. These rankings tell us very little about the quantitative differences between the subjects with regard to the

underlying construct (effectiveness, in this case). An ordinal scale simply provides a rank order of who is better than whom.

Interval Scales

With an interval scale, equal differences between scale values do have equal quantitative meaning. For this reason, you can see that the interval scale provides more quantitative information than the ordinal scale. A good example of an interval scale is the Fahrenheit scale used to measure temperature. With the Fahrenheit scale, the difference between 70 degrees and 75 degrees is equal to the difference between 80 degrees and 85 degrees: the units of measurement are equal throughout the full range of the scale.

However, the interval scale also has an important limitation: it does not have a true zero point. A true zero point means that a value of zero on the scale represent zero quantity of the construct being assessed. It should be obvious that the Fahrenheit scale does not have a true zero point. When the thermometer reads zero degrees, that does not mean that there is absolutely no heat present in the environment––it is still possible for the temperature to go lower (into the negative numbers).

Researchers in the social sciences often assume that many of their “man-made” variables are measured on an interval scale. For example, in the preceding study involving insurance agents, you would probably assume that scores from the goal difficulty questionnaire constitute an interval-level scale; that is, you would likely assume that the difference

between a score of 50 and 60 is approximately equal to the difference between a score of 70 and 80. Many researchers would also assume that scores from an instrument such as an intelligence test are also measured at the interval level of measurement.

On the other hand, some researchers are skeptical that instruments such as these have true equal-interval properties, and prefer to refer to them as quasi-interval scales.

Disagreements concerning the level of measurement achieved with such paper-and-pencil instruments continues to be a controversial topic within many disciplines.

In any case, it is clear that there is no true zero point with either of the preceding instruments: a score of zero on the goal difficulty scale does not indicate the complete absence of goal difficulty, and a score of zero on an intelligence test does not indicate the complete absence of intelligence. A true zero point can be found only with variables measured on a ratio scale.

(42)

Ratio Scales

Ratio scales are similar to interval scales in that equal differences between scale values do have equal quantitative meaning. However, ratio scales also have a true zero point, which gives them an additional property: with ratio scales, it is possible to make meaningful statements about the ratios between scale values.

For example, the system of inches used with a common ruler is an example of a ratio scale. There is a true zero point with this system, in that “zero inches” does in fact indicate a complete absence of length. With this scale, it is possible to make meaningful statements about ratios. It is appropriate to say that an object four inches long is twice as long as an object two inches long. Age, as measured in years, is also on a ratio scale: a 10-year-old house is twice as old as a 5-year-old house. Notice that it is not possible to make these statements about ratios with the interval-level variables discussed above. One would not say that a person with an IQ of 160 is twice as intelligent as a person with an IQ of 80, as there is no true zero point with that scale.

Although ratio-level scales are most commonly used for reporting the physical properties of objects (e.g., height, weight), they are also common in the type of research that is discussed in this manual. For example, the study discussed above included the variables “age” and “amount of insurance sold (in dollars).” Both of these have true zero points, and are measured as ratio scales.

Classifying Variables According to the Number of Values

They Display

Overview

The preceding section showed that variables can be classified according to their scale of measurement. Sometimes is also useful to classify variables according to the number of values they display. There might be any number of approaches for doing this, but this guide uses a simple division of variables into three groups according to the number of possible values: dichotomous variables, limited-value variables, and multi-value variables.

Dichotomous Variables

A dichotomous variable is a variable that assumes just two values. These variables are sometimes called binary variables. Here are some examples of dichotomous variables: • Suppose that you obtain Smith Anxiety Test scores from 50 male subjects and 50 female

subjects. In this study, “subject sex” is a dichotomous variable, because it can assume just two values, “male” versus “female.”

(43)

• Suppose that you conduct an experiment to determine whether the herbal supplement ginkgo biloba causes improvement in a rat’s ability to learn. You begin with 20 rats, and randomly assign them to two groups. Ten rats are assigned to the 100 mg group (they receive 100 mg of ginkgo), and the other ten rats are assigned to the 0 mg group (they receive no ginkgo). In this study, the independent variable that you are manipulating is “amount of ginkgo administered.” This is a dichotomous variable because it assumes just two values “0 mg” versus “100 mg.”

Limited-Value Variables

A limited-value variable is a variable that assumes just two to six values in your sample. Here are some examples of limited-value variables:

• Suppose that you obtain Smith Anxiety Test scores from 50 Caucasian subjects, 50 African-American subjects, and 50 Asian-American subjects. In this study, “subject race” is a limited-value variable because it assumes just three values: “Caucasian” versus “African-American” versus “Asian-American.”

• Suppose that you again conduct an experiment to determine whether ginkgo biloba causes improvements in a rat’s ability to learn. You begin with 100 rats, and randomly assign them to four groups: Twenty-five rats are assigned to the 150 mg group, 25 rats are assigned to the 100 mg group, 25 rats are assigned to the 50 mg group, and 25 rats are assigned to the 0 mg group. In this study, the independent variable that you are manipulating is still “amount of ginkgo administered.” You know that this is a limited-value variable because it assumes just four limited-values “0 mg” versus “50 mg” versus “100 mg” versus “150 mg.”

Multi-Value Variables

Finally, this book defines a multi-value variable as a variable that assumes more than six values in your sample. Here are some examples of multi-value variables:

• Assume that you obtain Smith Anxiety Test scores from 100 subjects. With the Smith Anxiety Test, scores (values) may range from 0–99, with higher scores indicating greater anxiety. In analyzing the data, you see that your subjects displayed a wide variety of scores, for example:

• One subject received a score of 2. • One subject received a score of 5. • Two subjects received a score of 10. • Five subjects received a score of 21. • Seven subjects received a score of 33. • Eight subjects received a score of 45. • Nine subjects received a score of 53.