Quantitative Methods
Pre-Assessment Test Introduction
Welcome to the pre-assessment test for the HBS Quantitative Methods Tutorial. Students with a strong statistics background may take the pre-assessment test to satisfy the quantitative methods requirement without taking the tutorial. To satisfy the requirement, you will need to answer at least 75% of the questions correctly.
This is an open-book multiple-choice exam. To advance from one question to the next, you must select one of the four answer choices and click the Submit button. After submitting your answer, you will not be able to change it or return to the question, so make sure you are satisfied with your selection before you submit each answer.
In the briefcase, links to Excel spreadsheets containing z-value and t-value tables are provided for your convenience. For some questions, additional links to Excel spreadsheets containing relevant data will appear immediately below the question text.
Your exam results will be displayed immediately upon completion of the exam. The exam results screen will indicate which questions you answered correctly, and which area of the tutorial you should review for the questions you answered incorrectly. After completing the exam, you can review your test results at any time by returning to this screen and clicking OK.
If you haven't yet taken the test, click Pre-Assessment Test on the navigation on the left to begin. Good luck! Frequently Asked Questions
How difficult are the questions on the exam? The exam questions have a level of difficulty similar to the exercises in the course.
Can I refer to statistics textbooks and online resources to help me during the test? Yes. This is an open-book examination.
May I receive assistance on the exam? No. Although we strongly encourage collaborative learning at HBS, work on exams such as the assessment tests must be entirely your own. Thus you may neither give nor receive help on any exam question.
Is this a timed exam? No. You should take about 60-90 minutes to complete the exam, depending on your familiarity with the material, but you may take longer if you need to.
What happens if I am (or my internet connection is) interrupted while taking the exam? Your answer choices will be recorded for the questions you were able to complete and you will be able to pick up where you left off when you return to the exam site.
How do I see my exam results? Your results will be displayed as soon as you submit your answer to the final question. The results screen will indicate which questions you answered correctly, and which area of the tutorial you should review for any questions you answered incorrectly.
Pre-Assessment Test [Exam content not shown] Overview & Introduction
Welcome to QM...
Welcome! You are about to embark on a journey that will introduce you to the basics of quantitative and statistical analysis. This course will help you develop your skills and instincts in applying quantitative methods to formulate, analyze, and solve management decision-making problems.
Click on the link labeled "The Tutorial and its Method" in the left menu to get started. The Tutorial and its Method
evaluate management situations you will face not only in your studies but also as a manager. Click on the right arrow icon below to advance to the next page.
This isn't a formal or comprehensive tutorial in quantitative methods. QM won't make you a statistician, but it will help you become a more effective manager.
The tutorial's primary emphasis is on developing good judgment in analyzing management problems. Whether you are learning the material for the first time or are using QM to refresh your quantitative skills, you can expect the tutorial to improve your ability to formulate, analyze, and solve managerial problems.
You won't be learning quantitative analysis in the typical textbook fashion. QM's interactive nature provides frequent opportunities to assess your understanding of the concepts and how to apply them — all in the context of actual management problems.
You should take 15 to 20 hours to run through the whole tutorial, depending on your familiarity with the material. QM offers many features we hope you will explore, utilize, and enjoy.
The Story and its Characters
Naturally, the most appropriate setting for a course on statistics is a tropical island...
Somehow, "internship" is not the way you'd describe your summer plans to your friends. You're flying out to Hawaii after all, staying at a 5-star hotel as a Summer Associate with Avio Consulting.
This is a great learning opportunity, no doubt about it. To think that you had almost skipped over this summer internship, as you prepared to enroll in a two-year MBA program this fall.
You are also excited that the firm has assigned Alice, one of its rising stars, as your mentor. It seems clear that Avio partners consider you a high potential intern — they are willing to invest in you with the hope that you will later return after you complete your MBA program.
Alice recently received the latest in a series of quick promotions at Avio. This is her first assignment as a project lead: providing consulting assistance to the Kahana, an exclusive resort hotel on the Hawaiian island Kauai.
Needless to say, one of the perks of the job is the lodging. The Kahana's brochure looks inviting — luxury suites, fine cuisine, a spa, sports activities. And above all, the pristine beach and glorious ocean.
After your successful interview with Avio, Alice had given you a quick briefing on the hotel and its manager, Leo. Leo inherited the Kahana just three years ago. He has always been in the hospitality industry, but the sheer scope of the luxury hotel's operations has him slightly overwhelmed. He has asked for Avio's help to bring a more rigorous approach to his management decision-making processes.
Using the Tutorial: A Guide to Tutorial Resources
Before you start packing your beach towel, read this section to learn how to use this tutorial to your greatest advantage.
QM's structure and navigational tools are easy to master. If you're reading this text, you must have clicked on the link labeled "Using the Tutorial" on the left.
These navigation links open interactive clips (like this one) here.
There are three types of interactive clips: Kahana Clips, Explanatory Clips, and Exercise Clips. Kahana Clips pose problems that arise in the context of your consulting engagement at the Kahana. Typically, one clip will have Leo assign you and Alice a specific task. In a later Kahana Clip you will analyze the problem, and you and Alice will present your results to Leo for his consideration. The Kahana clips will give you exposure to the types of business problems that benefit from the analytical methods you'll be learning, and a context for practicing the methods and interpreting their results.
To fully benefit from the tutorial, you should solve all of Leo's problems. At the end of the tutorial, a multiple-choice assessment exam will evaluate your understanding of the material.
In Explanatory Clips, you will learn everything needed to analyze management problems like Leo's. Complementing the text are graphs, illustrations, and animations that will help you understand the material. Keep on your toes: you'll be asked questions even in Explanatory Clips that you should answer to check your understanding of the concepts.
Some explanatory clips give you directions or tips on how to use the analytical and computational features of Microsoft Excel. Facility with the necessary Excel functions will be critical to solving the management decision problems in this course.
you see a Briefcase link in a clip, we strongly encourage you to click on the link to access the data. Then, practice using the Excel functions to reproduce the graphs and analyses that appear in the clips. You will also see Data links that you should click to view summary data relating to the problem.
Exercise Clips provide additional opportunities for you to test your understanding of the material. They are a resource that you can use to make sure that you have mastered the important concepts in each section.
Work through exercises to solidify your knowledge of the material. Challenge exercises provide
opportunities to tackle somewhat more advanced problems. The challenge exercises are optional - you should not have to complete them to gain the mastery needed to pass the tutorial assessment test. The arrow buttons immediately below are used for navigation within clips. If you've made it this far, you've been using the one on the right to move forward.
Use the one on the left if you want to back up a page or two.
In the upper right of the QM tutorial screen are six buttons. From left to right they are links to the Help, Discuss, Notes, Briefcase, Glossary, and Print.
To access additional Help features, click on the Help icon.
Use the discussion board to discuss course materials with your classmates, ask questions, and share any previous on-the-job experiences you may have had applying the concepts in the course. HBS staff and faculty will also use the discussion board to post clarifying information from time to time. To access the discussion board, click on the Discuss icon.
The Notes summarize the content of the Explanatory Clips. Can't recall all the essential steps of a hypothesis test? Find them in the Notes.
In your Briefcase you'll find all the data you'll need to complete the course, neatly stored as Excel Workbooks. In many of the clips there will be links to specific documents in the Briefcase, but the entire Briefcase is available at any time.
In the Glossary/Index you'll find a list of helpful definitions of terms used in the course, along with brief descriptions of the Excel functions used in the course.
At the end of the tutorial, you'll have the opportunity to evaluate the tutorial. In the meantime, as you work through QM, you may have comments or feedback on the material. We invite your feedback at any time: click on the Feedback icon on the navigation bar below. The page you are currently viewing will be recorded with your feedback.
We encourage you to use all of QM's features and resources to the fullest. They are designed to help you build an intuition for quantitative analysis that you will need as an effective and successful manager.
... and Welcome to Hawaii!
The day of departure has come, and you're in flight over the Pacific Ocean. Alice graciously let you take the window seat, and you watch as the foggy West Coast recedes behind you.
I've been to Hawaii before, so I'll let you have the experience of seeing the islands from the air before you set foot on them. This Leo sounds like quite a character. He's been in business all his life, involved in many ventures — some more successful than others. Apparently, he once owned and managed a gourmet spam restaurant!
Spam is really popular among the islanders. Leo tried to open a second location in downtown Honolulu for the tourists, but that didn't do so well. He had to declare bankruptcy.
Then, just three years ago, his aunt unexpectedly left him the Kahana. Now Leo is back in business, this time with a large operation on his hands.
It sounds to me like he's the kind of manager who usually relies on gut instincts to make business decisions, and likes to take risks. I think he's hired Avio to help him make managerial decisions with, well, better judgment. He wants to learn how to approach management problems in a more sophisticated, analytical fashion.
We'll be using some basic statistical tools and methods. I know you're no expert in statistics, but I'll fill you in along the way. You'll be surprised at how quickly they'll become second nature to you. I'm confident you'll be able to do quite a bit of the analytic work soon.
Once your plane touches down in Kauai, you quickly pick up your baggage and meet your host, Leo, outside the airport. Inheriting the Kahana came as a big surprise. My aunt had run the Kahana for a long time, but I never considered that she would leave it to me.
Anyway, I've been trying my best to run the Kahana the way a hotel of its quality deserves. I've had some ups and downs. Things have been fairly smooth for the past year now, but I've realized that I have to get more serious about the way I make decisions. That's where you come into the picture.
I used to be quite a risk-taker. I made a lot of decisions on impulse. Now, when I think of what I have to lose, I just want to get it right.
After you arrive at the Kahana, Leo personally shows you to your rooms. "I have a table reserved for the three of us at 8 in the main restaurant," Leo announces. "You just have to try our new chef's mango and brie tart."
Basics: Data Description Leo's Data Mine
After your welcome dinner in the Kahana's main restaurant, Leo asks you and Alice to meet him the next morning. You wake up early enough to take a short walk on the beach before you make your way to Leo's office.
Good morning! I hope you found your rooms comfortable last night and are starting to recover from your trip.
Unfortunately, I don't have much time this morning. As you requested on the phone, I've assembled the most important data on the Kahana. It wasn't easy — this hasn't been the most organized hotel in the world, especially since I took over. There's just so much to keep track of.
Thank you, Leo. We'll have a look at your data right away, so we can get a more detailed understanding of the Kahana and the type of data you have available for us to work with. Anything in particular that you'd like us to focus on as we peruse your files? Yes. There are two things in particular that have been on my mind recently.
For one, we offer some recreational activities here at the Kahana, including a scuba diving certification course. I contract out the operations to a local diving school. The contract is up soon, and I need to renew it, hire another school, or discontinue offering scuba lessons all together.
I'd like you to get me some quotes from other diving schools on the island so I get an idea of the competition's pricing and how it compares to the school I've been using.
I'm also very concerned about hotel occupancy rates. As you might imagine, the Kahana's occupancy fluctuates during the year, and I'd like to know how, when, and why. I'd love to have a better feeling for how many guests I can expect in a given month. These files contain some information about tourism on the island, but I'd really like you to help me make better sense of it. Somehow I feel that if I could understand the patterns in the data, I could better predict my own occupancy rates.
That's what we're here to do. We'll take a look at your files to get better acquainted with the Kahana, and then focus on diving school prices and occupancy patterns.
Thanks, or as we say in Hawaiian, Mahalo. By the way, we're not too formal here on Hawaii. As you probably noticed, your suite, Alice, includes a room that has been set up as an office. But feel free to take your work down to the beach or by the pool whenever you like.
Thanks! We'll certainly take advantage of that.
Later, under a parasol at the beach, you pore over Leo's folders. Feeling a bit overwhelmed, you find yourself staring out to sea.
Alice tells you not to worry: "We have a number of strategies we can use to compile a mountain of data like this into concise and useful information. But no matter what data you are working with, always make sure you really understand the data before doing a lot of analysis or making managerial decisions."
What is Alice getting at when she tells you to "understand the data?" And how can you develop such an understanding?
Describing and Summarizing Data
Data can be represented by graphs like histograms. These visual displays allow you to quickly recognize patterns in the distribution of data.
Working with Data
Information overload. Inventory costs. Payroll. Production volume. Asset utilization. What's a manager to do?
The data we encounter each day have valuable information buried within them. As managers, correctly analyzing financial, production, or marketing data can greatly improve the quality of the decisions we make.
Analyzing data can be revealing, but challenging. As managers, we want to extract as much of the relevant information and insight as possible from our data we have available.
When we acquire a set of data, we should begin by asking some important questions: Where do the data come from? How were they collected? How can we help the data tell their story?
Suppose a friend claims to have measured the heights of everyone in a building. She reports that the average height was three and a half feet. We might be surprised...
... until we learn that the building is an elementary school.
We'd also want to know if our friend used a proper measuring stick. Finally, we'd want to be sure we knew how she measured height: with or without shoes.
Before starting any type of formal data analysis, we should try to get a preliminary sense of the data. For example, we might first try to detect any patterns, trends, or relationships that exist in the data.
category or across different categories. But how do we do this? And is this often time-consuming process worth it?
Accountants think so. Balance Sheets and Profit and Loss Statements arrange information to make it easier to comprehend. In addition, accountants separate costs into categories such as capital investments, labor costs, and rent. We might ask: Are operating expenses increasing or decreasing? Do office space costs vary much from year to year?
Comparing data across different years or different categories can give us further insight. Are selling costs growing more rapidly than sales? Which division has the highest inventory turns?
Histograms
In addition to grouping data, we often graph them to better visualize any patterns in the data. Seeing data displayed graphically can significantly deepen our understanding of a data set and the situation it describes.
To see the value a graphical approach can add, let's look at worldwide consumption of oil and gas in 2000. What questions might we want to answer with the energy data? Which country is the largest consumer? How much energy do most countries use?
Source
In order to create a graph that provides good visual insight into these questions, we might sort the countries by their level of energy consumption, then group together countries whose consumption falls in the same range — e.g., the countries that use 100 to 199 million tonnes per year, or 200 to 299 million tonnes.
Source
We can find the number of countries in each range, and then create a bar graph in which the height of each bar represents the number of countries in each range. This graph is called a histogram.
A histogram shows us where the data tend to cluster. What are the most common values? The least common? For example, we see that most countries consume less than 100 million tonnes per year, and the vast majority less than 200 million tonnes. Only three countries, Japan, Russia, and the US, consume more than 300 million tonnes per year.
Why are there so many countries in the first range — the lowest consumption? What factors might influence this? Population might be our first guess.
Yet despite a large population, India's energy consumption is significantly less than that of Germany, a much smaller nation. Why might this be? Clearly other factors, like climate and the extent of industrialization, influence a country's energy usage.
Outliers
In many data sets, there are occasional values that fall far from the rest of the data. For example, if we graph the age distribution of students in a college course, we might see a data point at 75 years. Data points like this one that fall far from the rest of the data are known as outliers. How do we interpret them?
First, we must investigate why an outlier exists. Is it just an unusual, but valid value? Could it be a data entry error? Was it collected in a different way than the rest of the data? At a different time? We might discover that the data point refers to a 75 year-old retiree, taking the course for fun.
After making an effort to understand where an outlier comes from, we should have a deeper
understanding of the situation the data represent. Then, we can think about how to handle the outlier in our analysis. Typically, we do one of three things: leave the outlier alone, or — very rarely — remove it or change it to a corrected value.
A senior citizen in a college class may be an outlier, but his age represents a legitimate value in the data set. If we truly want to understand the age distribution of all students in the class, we would leave the point in.
Or, if we now realize that what we really want is the age distribution of students in the course who are also enrolled in full-time degree-granting programs, we would exclude the senior citizen and all other non-degree program students enrolled in the course.
Occasionally, we might change the value of an outlier. This should be done only after examining the underlying situation in great detail.
For example, if we look at the inventory graph below, a data point showing 80 pairs of roller-blades in inventory would be highly unusual.
Notice that the data point "80" was recorded on April 13th, and that the inventory was 10 pairs on April 12th, and 6 on April 14th.
Based on our management understanding of how inventory levels rise and fall, we realize that the value of 80 is extraordinarily unlikely. We conclude that the data point was likely a data entry error. Further investigation of sales and purchasing records reveals that the actual inventory level on that day was 8, not 80. Having found a reliable value, we correct the data point.
Excluding or changing data is not something we do often. We should never do it to help the data 'fit' a conclusion we want to draw. Such changes to a data set should be made on a case-by-case basis only after careful investigation of the situation.
Summary
With any data set we encounter, we must find ways to allow the data to tell their story. Ordering and graphing data sets often expose patterns and trends, thus helping us to learn more about the data and the underlying situation. If data can provide insight into a situation, they can help us to make the right decisions.
Creating Histograms
Note: Unless you have installed the Excel Data Analysis ToolPak add-in, you will not be able to create histograms using the Histogram tool. However, we suggest you read through the instructions to learn how Excel creates histograms so you can construct them in the future when you do have access to the Data Analysis Toolpak.
To check if the Toolpak is installed on your computer, go to the Data tab in the Toolbar in Excel 2007. If "Data Analysis" appears in the Ribbon, the Toolpak has already been installed. If not, click the Office Button in the top left and select "Excel Options." Choose "Add-Ins" and highlight the "Analysis Toolpak" in the list and click "Go." Check the box next to Analysis Toolpak and click "OK." Excel will then walk you through a setup process to install the toolpak.
Creating a histogram with Excel involves two steps: preparing our data, and processing them with the Data Analysis Histogram tool.
To prepare the data, we enter or copy the values into a single column in an Excel worksheet.
Often, we have specific ranges in mind for classifying the data. We can enter these ranges, which Excel calls "bins," into a second column of data.
In the Tool bar, select the Data tab, and then choose Data Analysis. In the Data Analysis pop-up window, choose Histogram and click OK.
Click on the Input Range field and enter the range of data values by either typing the range or by dragging the cursor over the range.
Next, to use the bins we specified, click on the Bin Range field and enter the appropriate range. Note: if we don't specify our own bins, Excel will create its own bins, which are often quite peculiar.
Click the Chart Output checkbox to indicate that we want a histogram chart to be generated in addition to the summary table, which is created by default.
Click New Worksheet Ply, and enter the name you would like to give the output sheet. Finally, click OK, and the histogram with the summary table will be created in a new sheet. Central Values for Data
Graphs are very useful for gaining insight into data. However, sometimes we would like to summarize the data in a concise way with a single number.
The Mean
Often, we'd like to summarize a set of data with a single number. We'd like that summary value to describe the data as well as possible. But how do we do this? Which single value best represents an entire set of data? That depends on the data we're investigating and the type of questions we'd like the data to answer.
What number would best describe employee satisfaction data collected from annual review questionnaires? The numerical average would probably work quite well as a single value representing employees' experiences.
To calculate average — or mean — employee satisfaction, we take all the scores, sum them up, and divide the result by 11, the number of surveys. The Greek letter mu represents the mean of the data set.
The mean is by far the most common measure used to describe the "center" or "central tendency" of a data set. However, it isn't always the best value to represent data. Outliers can exercise undue influence and pull the mean value towards one extreme.
In addition, if the distribution has a tail that extends out to one side — a skewed distribution — the values on that side will pull the mean towards them. Here, the distribution is strongly skewed to the right: the high value of US consumption pulls the mean to a value higher than the consumption of most other countries. What other numbers can we use to find the central tendency of the data? The Median
Let's look at the revenues of the top 100 companies in the US. The mean revenue of these companies is about $42 billion. How should we interpret this number? How well does this average represent the revenues of these companies?
When we examine the revenue distribution graphically, we see that most companies bring in less than $42 billion of revenue a year. If this is true, why is the mean so high?
Source
As our intuition might tell us, the top companies have revenues that are much higher than $42 billion. These higher revenues pull up the average considerably.
Source
In cases like income, where the data are typically very skewed, the mean often isn't the best value to represent the data. In these cases, we can use another central value called the median.
Source
The median is the middle value of a data set whose values are arranged in numerical order. Half the values are higher than the median, and half are lower.
Source
For income, the median revenues of the top 100 US companies is $30 billion; significantly less than $42 billion. Half of all the companies earn less than $30 billion, and half earn more than $30 billion. Source
Median revenue is a more informative revenue estimate because it is not pulled upwards by a small number of high-revenue earners. How can we find the median?
Source
With an odd number of data points, listed in order, the median is simply the middle value. For example, consider this set of 7 data points. The median is the 4th data point, $32.51.
In a data set with an even number of points, we average the two middle values — here, the fourth and fifth values — and obtain a median of $41.92.
When deciding whether to use a mean or median to represent the central tendency of our data, we should weigh the pros and cons of each. The mean weighs the value of every data point, but is sometimes biased by outliers or by a highly skewed distribution.
By contrast, the median is not biased by outliers and is often a better value to represent skewed data. The Mode
A third statistic to represent the "center" of a data set is its mode: the data set's most frequently occurring value. We might use the mode to represent data when knowing the average value isn't as important as knowing the most common value. In some cases, data may cluster around two or more points that occur especially frequently, giving the histogram more than one peak. A distribution that has two peaks is called a bimodal distribution.
Summary
To summarize a data set using a single value, we can choose one of three values: the mean, the median, or the mode. They are often called summary statistics or descriptive statistics. All three give a sense of the "center" or "central tendency" of the data set, but we need to understand how they differ before using them:
Finding The Mean In Excel
To find the mean of a data set entered in Excel, we use the AVERAGE function.
We can find the mean of numerical values by entering the values in the AVERAGE function, separated by commas.
In most cases, it's easier to calculate a mean for a data set by indicating the range of cell references where the data are located.
Excel ignores blank values in cells, but not zeros. Therefore, we must be careful not to put a zero in the data set if it does not represent an actual data point.
Finding The Median In Excel
Excel can find the median, even if a data set is unordered, using the MEDIAN function. The easiest way to calculate a data set's median is to select a range of cell references. Finding The Mode In Excel
Excel can also find the most common value of a data set, the mode, using the MODE function. If more than one mode exists in a data set, Excel will find the one that occurs first in the data.
Mean, median, and mode are fairly intuitive concepts. Already, Leo's mountain of data seems less intimidating. Variability
The mean, median and mode give you a sense of the center of the data, but none of these indicate how far the data are spread around the center. "Two sets of data could have the same mean and median, and yet be distributed completely differently around the center value," Alice tells you. "We need a way to measure variation in the data."
The Standard Deviation
It's often critical to have a sense of how much data vary. Do the data cluster close to the center, or are the values widely dispersed?
Let's look at an example. To identify good target markets, a car dealership might look at several
communities and find the average income of each. Two communities — Silverhaven and Brighton — have average household incomes of $95,500 and $97,800. If the dealer wants to target households with incomes above $90,000, he should focus on Brighton, right?
We need to be more careful: the mean income doesn't tell the whole story. Are most of the incomes near the mean, or is there a wide range around the average income? A market might be less attractive if fewer households have an income above the dealer's target level. Based on average income alone, Brighton might look more attractive, but let's take a closer look at the data.
Despite having a lower average income, incomes in Silverhaven have less variability, and more households are in the dealer's target income range. Without understanding the variability in the data, the dealer might have chosen Brighton, which has fewer targeted homes.
Clearly it would be helpful to have a simple way to communicate the level of variability in the household incomes in two communities.
Just as we have summary statistics like the mean, median, and mode to give us a sense of the 'central tendency' of a data set, we need a summary statistic that captures the level of dispersion in a set of data.
The standard deviation is a common measure for describing how much variability there is in a set of data. We represent the standard deviation with the Greek letter sigma:
The standard deviation emerges from a formula that looks a bit complicated initially, so let's try to understand it at a conceptual level first. Then we'll build up step by step to help understand where the formula comes from.
The standard deviation tells us how far the data are spread out. A large standard deviation indicates that the data are widely dispersed. A smaller standard deviation tells us that the data points are more tightly clustered together.
Calculating
A hotel manager has to staff the front reception desk in her lobby. She initially focuses on a staffing plan for Saturdays, typically a heavy traffic day. In the hospitality industry, like many service industries, proper staffing can make the difference between unhappy guests and satisfied customers who want to return.
On the other hand, overstaffing is a costly mistake. Knowing the average number of customer requests for services during a shift gives the manager an initial sense of her staffing needs; knowing the standard deviation gives her invaluable additional information about how those requests might vary across different days.
The average number of customer requests is 172, but this doesn't tell us there are 172 requests every Saturday. To staff properly, the hotel manager needs a sense of whether the number of requests will typically be between 150 and 195, for example, or between 120 and 220.
To calculate the standard deviation for data — in this case the hotel traffic — we perform two steps. The first is to calculate a summary statistic called the variance.
Each Saturday's number of requests lies a certain distance from 172, the mean number of requests. To find the variance, we first sum the squares of these differences. Why square the differences?
A hotel manager would want information about the magnitude of each difference, which can be positive, negative, or zero. If we simply summed the differences between each Saturday's requests and the mean, positive and negative differences would cancel each other out.
But we are interested in the magnitude of the differences, regardless of their sign. By squaring the differences, we get only positive numbers that do not cancel each other out in a sum.
The formula for variance adds up the squared differences and divides by n-1 to get a type of "average" squared difference as a measure of variability. (The reason we divide by n-1 to get an average here is a technicality beyond the scope of this course.) The variance in the hotel's front desk requests is 637.2. Can we use this number to express the variability of the data?
value in 'squared' requests. What is a request-squared? Or a dollar-squared, if we were solving a problem involving money? We would like a way to express variability that is in the same units as the original data — front-desk requests, for example. The standard deviation — the first formula we saw — accomplishes this.
The standard deviation is simply the square root of the variance. It returns our measure to our original units. The standard deviation for the hotel's Saturday desk traffic is 25.2 requests.
Interpreting
What does a standard deviation of 25.2 requests tell us? Suppose the standard deviation had been 50 requests. With a larger standard deviation, the data would be spread farther from the mean. A higher standard deviation would translate into more difficult staffing: when request traffic is unusually high, disgruntled customers wait in long lines; when traffic is very low, desk staff are idle.
For a data set, a smaller standard deviation indicates that more data points are near the mean, and that the mean is more representative of the data. The lower the standard deviation, the more stable the traffic, thereby reducing both customer dissatisfaction and staff idle time.
Fortunately, we almost never have to calculate a standard deviation by hand. Spreadsheet tools like Excel make it easy for us to calculate variance and standard deviation.
Summary
The standard deviation measures how much data vary about their mean value. Finding in Excel
Excel's STDEV function calculates the standard deviation.
To find the standard deviation, we can enter data values into the STDEV formula, one by one, separated by commas. In most cases, however, it's much easier to select a range of cell references to calculate a standard deviation. To calculate variance, we can use Excel's VAR function in the same way.
The Coefficient of Variation
The standard deviation measures how much a data set varies from its mean. But the standard deviation only tells you so much. How can you compare the variability in different data sets?
A standard deviation describes how much the data in a single data set vary. How can we compare the variability of two data sets? Do we just compare their standard deviations? If one standard deviation is larger, can we say that data set is "more variable"?
Standard deviations must be considered within the data's context. The standard deviations for two stock indices below — The Street.Com (TSC) Internet Index and the Pacific Exchange Technology (PET) Index — were roughly equivalent over a period. But were the two indices equally variable?
Source
If the average price of an index is $200, a $20 standard deviation is relatively high (10% of the
average); if the average is $700, $20 is relatively low (not quite 3% of the average). To gauge volatility, we'd certainly want to know that PET's average index price was over three and half times higher than TSC's average index price.
Source
To get a sense of the relative magnitude of the variation in a data set, we want to compare the standard deviation of the data to the data's mean.
Source
We can translate this concept of relative volatility into a standardized measure called the coefficient of variation, which is simply the ratio of the standard deviation to the mean. It can be interpreted as the standard deviation expressed as a percent of the mean.
To get a feeling for the coefficient of variation, let's compare a few data sets. Which set has the highest relative variation? Click the answer you select.
Because the coefficient of variation has no units, we can use it to compare different kinds of data sets and find out which data set is most variable in this relative sense.
The coefficient of variation describes the standard deviation as a fraction of the mean, giving you a standard measure of variability.
The coefficient of variation expresses the standard deviation as a fraction of the mean. We can use it to compare variation in different data sets of different scales or units.
Applying Data Analysis
After a good night's sleep, you meet Alice for Breakfast.
"It's time to get started on Leo's assignments. Could you get those price quotes from diving schools and prepare a presentation for Leo? We'll want to present our findings as neatly and concisely as possible. Use graphs and summary statistics wherever appropriate. Meanwhile, I'll start working on Leo's hotel occupancy problem."
Pricing the Scuba Schools
In addition to the school Leo is currently using, you find 20 other scuba services in the phone book. You call those 20 and get price quotes on how much they would charge the Kahana per guest for a Scuba Certification Course.
Prices
You create a histogram of the prices. Use the bin ranges provided in the data spreadsheet, or experiment with your own bins. If you do not have the Excel Analysis Toolpak installed, click on the Briefcase link labeled "Histogram" to see the finished histogram.
Prices Histogram
This distribution is skewed to the right, since a tail of higher prices extends to the right side of the histogram. The shape of the distribution suggests that:
a. The mean = the median
This is not the best answer. When the histogram is skewed to one side, the mean and the median are different. If the histogram you constructed from the pricing data looks symmetric, try using the recommended bin sizes.
b. The mean > the median
This is the best answer. The prices of the few expensive schools "pull" the mean towards the right.
c. The mean < the median
This is not the best answer. When the histogram is skewed to the left, the mean is less than the median.
d. None of the above relationships can be determined from the histogram.
This is not the best answer. It should be apparent from your histogram that the distribution is skewed to the right, in which case the mean is greater than the median. If the histogram you constructed from the pricing data looks symmetric, try using the recommended bin sizes.
Prices Histogram
You calculate the key summary statistics. The correct values are (Mean, Median, Standard Deviation): a. $307, $326, $60
This is not the correct answer. You may be confusing the mean and the median. b. $307, $326, $67
This is not the correct answer. You may be confusing the mean and the median. c. $326, $307, $60
This is not the correct answer. The standard deviation is $67. d. $326, $307, $67
This is the correct answer.
Prices Histogram
Your report looks good. This graphic is very helpful. At the moment, I'm paying $330 per guest, which is about average for the island. Clearly, I could get a cheaper deal — only 6 schools would charge a higher rate. On the other hand, maybe these more expensive schools offer a better diving experience? I wonder how satisfied my guests have been with the course offered by my current contractor...
Exercise 1: VA Linux Stock Bonanza
After a company completes its initial public offering, how is the ownership of common stock distributed between individuals in the firm, often termed "named insiders"?
Let's examine a company, VA Linux, that choose to sell its stock in an Initial Public Offering (IPO) during the IPO craze in the late 1990s.
According to its prospectus, after the IPO, VA Linux would have the following distribution of outstanding shares of common stock owned by insiders:
Source
From the VA Linux common stock data, what could we learn by creating a histogram? (Choose the best answer) a. The total number of shares of common stock owned by the named insiders.
This is not the best answer. To find the total number of shares, it would be best to add up the raw data in tabulated form. Since histograms place data points in ranges, we'd have trouble finding the total of the individual values from a histogram.
b. The percentage of common stock owned by each of the named insiders in VA Linux's prospectus. This is not the best answer. The histogram specifies neither the exact number of shares owned by each individual nor the total number of outstanding shares of common stock, both of which we would need to compute the percentage of common stock owned by each insider.
c. How the ownership stakes are distributed among named insiders.
This is the best answer. By converting the data into a histogram, the distribution of stock among the named insiders is apparent, and we get a good idea of how ownership is distributed inside this young company. d. How the named insiders' shares compare to the holdings of outside investors who purchased
shares in the IPO.
This is not the best answer. Although this analysis would be interesting, we simply don't have the necessary data. We have no information about how much stock individuals other than the named stockholders will own after the IPO.
Exercise 2: Employee Turnover
Here is a histogram graphing annual turnover rates at a consulting firm. Which summary statistic better describes these data?
a. The mean
This is not the best answer. As you can see in the histogram, the data are strongly skewed to the right. A few years of uncharacteristically high turnover have a strong influence on the value of the mean. In cases such as this, the median is often a better descriptor for the center of the data.
b. The median
This is the best answer. A few years of uncharacteristically high turnover have a strong influence on the value of the mean. In cases such as this, the median is often a better descriptor for the center of the data.
Exercise 3: Honidew Internship
The J. B. Honidew Corporation offers a prestigious summer internship to first-year students at a local business school. The human resources department of Honidew wants to publish a brochure to advertise the position.
To attract a suitable pool of applicants, the brochure should give an indication of Honidew's high academic expectations. The human resources manager calculates the mean GPA of the previous 8 interns, to include in the brochure.
The mean GPA of the former interns is: a. 3.86
b. 3.91
This is not the correct answer. Be sure you are calculating the mean, and not the median. c. 3.93
This is not the correct answer. If we exclude the lowest GPA 3.35 as an outlier this would be the correct answer, but we must include it because it is an actual value of a previous intern's GPA.
Interns' GPA's
In 1997, J. B. Honidew's grandson's girlfriend was awarded the internship, even though her GPA was only 3.35. In the presence of outliers or a strongly skewed data set, the median is often a better measure of the 'center'. What's the median GPA in this data set?
a. 3.87
This is not the correct answer. 3.87 is one of the two central GPA data points, but the median is the average of the two central points.
b. 3.91
This is the correct answer. The median is the average of the two central GPA data points, 3.87, and 3.95. c. 4.0
This is not the correct answer. As the most frequently occurring data point, 4.0 is the mode of the sample.
Interns' GPA's
Exercise 4: Scuba Regulations
Safety equipment typically needs to fall within very precise specifications. Such specifications apply, for example, to scuba equipment using a device called a "rebreather" to recycle oxygen from exhaled air.
Recycled air must be enriched with the right amount of oxygen from the tank before delivery to the diver. With too little oxygen, the diver can become disoriented; too much, and the diver can experience oxygen poisoning. Minimizing the deviation of oxygen concentration levels from the specified level is clearly a matter of life and death!
A scuba equipment-testing lab compared the oxygen concentrations of two different brands of rebreathers, A and B. Examine the data. Without doing any calculations, for which of the two rebreathers does the oxygen concentration appear to have a lower standard deviation?
a. A
This is the correct answer. Much more of the data are clustered near the mean of the data set: 21.00%. b. B
This is not the correct answer. The data for model B are spread farther from its mean of 20.98% than the data for model A are spread from its mean, 21.00%.
Notice that data set A's extreme values are closer to the center, with more data points closer to the center of the set. Even without calculations, we have a good knack for seeing which set is more variable.
We can back up our observations; by using the standard deviation formula or the STDEV function in Excel, we can calculate that the standard deviation of A is 0.58%, whereas that of B is 1.05%.
Exercise 5: Fluctuations in Energy Prices
After decades of government control, states across the US are deregulating energy markets. In a deregulated market, electricity prices tend to spike in times of high demand.
This volatility is a concern. A primary benefit to consumers in a regulated market is that prices are fairly stable. To provide a baseline measure for the volatility of prices prior to deregulation, we want to compute the standard deviation of prices during the 1990s, when electricity prices were largely regulated.
From 1990 to 2000, the average national price in July of 500kW of electricity ranged between $45.02 and $50.55. What is the standard deviation of these eleven prices?
a. $2.02
This is the correct answer. Either using Excel or calculating the formula by hand, the standard deviation is $2.02, fairly low compared to the mean price of $48.40.
b. $4.08
This is not the correct answer. You may have forgotten to take the square root of the variance. Try using Excel's STDEV formula to double-check your answer.
This is not the correct answer. If you calculated the standard deviation by hand, did you forget to divide by n-1?
Electricity Prices Source
Excel makes the job much easier, because all that's required is entering the data into cells and inputting the range of cells into the =STDEV() function. The result is $2.02.
On the other hand, to calculate the standard deviation by hand, use the formula:
First, calculate the mean, $48.40. Then, find the difference between each data point and the mean. Calculate the sum of these squared differences, 40.79. Divide by the number of points minus one (11 - 1 =10 in this case) to obtain 4.08. Taking the square root of 4.08 gives us the standard deviation, $2.02.
Exercise 6: Big Mart Personal Care Products
Suppose you are a purchasing agent for a wholesale retailer, Big-Mart. Big-Mart offers several generic versions of household items, like deodorant, to consumers at a considerable discount.
Every 18 months, Big-Mart requests bids from personal care companies to produce these generic products.
After simply choosing the lowest individual bidder for years, Big-Mart has decided to introduce a vendor "score card" that measures multiple aspects of each vendor's performance. One of the criteria on the score card is the level of year-to-year fluctuation in the vendor's pricing.
Compare the variability of prices from each supplier. Which company's prices vary the least from year to year in relation to their average price, as measured by the coefficient of variation?
a. Personal Care International
This not the correct answer. The coefficient of variation is 0.17, in between the two other brands: 0.12 and 0.20. Take the ratio of the standard deviation to the mean to find the coefficient of variation.
b. Beautica
This is the correct answer. The coefficient of variation is 0.12, lower than for both of the other companies. c. BMKIP
This is not correct answer. This coefficient of variation is 0.20, the largest coefficient of variation of the three. Take the ratio of the standard deviation to the mean to find the coefficient of variation.
Summary
Pleased with your work, Alice decides to teach you more data description techniques, so you can take over a greater share of the project.
Relationships Between Variables
So far, you learned how to work with a single variable, but many managerial problems involve several factors that need to be considered simultaneously.
Two Variables
We use histograms to help us answer questions about one variable. How do we start to investigate patterns and trends with two variables?
Let's look at two data sets: heights and weights of athletes. What can we say about the two data sets? Is there a relationship between the two?
Our intuition tells us that height and weight should be related. How can we use the data to inform that intuition? How can we let the data tell their story about the strength and nature of that relationship?
As always, one of our first steps is to try to visualize the data.
Because we know that each height and weight belong to a specific athlete, we first pair the two variables, with one height-weight pair for each athlete.
Plotting these data pairs on axes of height and weight — one data point for each athlete in our data set — we can see a relationship between height and weight. This type of graph is called a "scatter diagram." Scatter diagrams provide a visual summary of the relationship between two variables. They are
extremely helpful in recognizing patterns in a relationship. The more data points we have, the more apparent the relationship becomes.
In our scatter diagram, there's a clear general trend: taller athletes tend to be heavier.
We need to be careful not to draw conclusions about causality when we see these types of relationships. Growing taller might make us a bit heavier, but height certainly doesn't tell the whole story about our weights. Assuming causality in the other direction would be just plain wrong. Although we may wish otherwise, growing heavier certainly doesn't make us taller!
The direction and extent of causality might be easy to understand with the height and weight example, but in business situations, these issues can be quite subtle.
hindsight can appear as ludicrous as assuming that gaining weight can make us taller.
Why don't we try graphing another pair of data sets to see if we can identify a relationship? On a scatter diagram, we plot for each day the number of massages purchased at a spa resort versus the total number of guests visiting the resort.
We can see a relationship between the number of guests and the number of massages. The more guests that stay at the resort, the more massages purchased — to a point, where massages level off.
Why does the number of massages reach a plateau? We should investigate further. Perhaps there are limited numbers of massage rooms at the spa. Scatter plots can give us insights that prompt us to ask good questions, those that deepen our understanding of the underlying context from which the data are drawn.
Variable and Time
Sometimes, we are not as interested in the relationship between two variables as we are in the behavior of a single variable over time. In such cases, we can consider time as our second variable.
Suppose we are planning the purchase of a large amount of high-speed computer memory from an electronics distributor. Experience tells us these components have high price volatility. Should we make the purchase now? Or wait?
Assuming we have price data collected over time, we can plot a scatter diagram for memory price, in the same way we plotted height and weight. Because time is one of the variables, we call this graph a time series.
Time series are extremely useful because they put data points in temporal order and show how data change over time. Have prices been steadily declining or rising? Or have prices been erratic over time? Are there seasonal patterns, with prices in some months consistently higher than in others?
Time series will help us recognize seasonal patterns and yearly trends. But we must be careful: we shouldn't rely only on visual analysis when looking for relationships and patterns.
False Relationships
Our intuition tells us that pairs of variables with a strong relationship on a scatter plot must be related to each other. But we must be careful: human intuition isn't foolproof and often we infer relationships where there are none. We must be careful to avoid some of these common pitfalls.
Let's look at an example. For US presidents of the last 150 years, there seems to be a connection between being elected in a year that is a multiple of 20 (1900, 1920, 1940, etc.) and dying in office. Abraham Lincoln (elected in 1860) was the first victim of this unfortunate relationship.
Source
James Garfield (elected 1880) survived his presidency (but was assasinated the year after he left office), and William McKinley (1900), Warren Harding (1920), Franklin Roosevelt (1940), and John F. Kennedy (1960) all died in office. Source
Ronald Reagan (elected 1980) only narrowly survived an assassination attempt. What do the data suggest about the president elected in 2020?
Probably nothing. Unless we have a reasonable theory about the connection between the two variables, the relationship is no more than an interesting coincidence.
Hidden Variables
Even when two data sets seem to be directly related, we may need to investigate further to understand the reason for the relationship.
We may find that the reason is not due to any fundamental connection between the two variables themselves, but that they are instead mutually related to another underlying factor.
Suppose we're examining sales of ice-hockey pucks and baseballs at a sporting goods store.
The sales of the two products form a relationship on a scatter plot: when puck sales slump, baseball sales jump. But are the two data sets actually related? If so, why?
A third, hidden factor probably drives both data sets: the season. In winter, people play ice hockey. In spring and summer, people play baseball.
If we had simply plotted puck and baseball sales without thinking further, we might not have considered the time of year at all. We could have neglected a critical variable driving the sales of both products.
In many business contexts, hidden variables can complicate the investigation of a relationship between almost any two variables.
A final point: Keep in mind that scatter plots don't prove anything about causality. They never prove that one variable causes the other, but simply illustrate how the data behave.
Summary
Plotting two variables helps us see relationships between two data sets. But even when relationships exist, we still need to be skeptical: is the relationship plausible? An apparent relationship between two variables may simply be coincidental, or may stem from a relationship each variable has with a third, often hidden variable.
Creating Scatter Diagrams
To create a scatter diagram in Excel with two data sets, we need to first prepare the data, and then use Excel's built in chart tools to plot the data.
To prepare our data, we need to be sure that each data point in the first set is aligned with its corresponding value in the other set. The sets don't need to be contiguous, but it's easier if the data are aligned side by side in two columns. If the data sets are next to each other, simply select both sets.
Next, from the Insert tab in the toolbar, select Scatter in the Charts bin from the Ribbon, and choose the first type: Scatter with Only Markers.
Excel will insert a nonspecific scatter plot into the worksheet, with the first column of data represented on the X-axis and the second column of data on the Y-axis.
We can include a chart title and label the axes by selecting Quick Layout from the Ribbon and choosing Layout 1.
Then we can add the chart title and label the axes by selecting and editing the text.
Finally, our scatter diagram is complete. You can explore more of Excel's new Chart Tools to edit and design elements of your chart.
Correlation
By plotting two variables on a scatter plot, we can examine their relationship. But can we measure the strength of that relationship? Can we describe the relationship in a standardized way?
Humans have an uncanny ability to discern patterns in visual displays of data. We "know" when the relationship between two variables looks strong ...
... or weak ... ... linear ... ... or nonlinear ...
... positive (when one variable increases, the other tends to increase) ... ... or negative (when one variable increases, the other tends to decrease).
Suppose we are trying to discern if there is a linear relationship between two variables. Intuitively, we notice when data points are close to an imaginary line running through a scatter plot.
Logically, the closer the data points are to that line, the more confidently we can say there is a linear relationship between the two variables.
However, it is useful to have a simple measure to quantify and communicate to others what we so readily perceive visually. The correlation coefficient is such a measure: it quantifies the extent to which there is a linear relationship between two variables.
To describe the strength of a linear relationship, the correlation coefficient takes on values between -1 and +1. Here's a strong positive correlation (about 0.85) ...
... and here's a strong negative correlation (about -0.90).
If every point falls exactly on a line with a negative slope, the correlation coefficient is exactly -1.
At the extremes of the correlation coefficient, we see relationships that are perfectly linear, but what happens in the middle? Even when the correlation coefficient is 0, a relationship might exist ! just not a linear relationship. As we've seen, scatter plots can reveal patterns and help us better understand the business context the data describe.
To reinforce our understanding of how our intuition about the strength of a linear relationship between variables translates into a correlation coefficient, let's revisit the examples we analyzed visually earlier.
Influence of Outliers
In some cases, the correlation coefficient may not tell the whole story. Managers want to understand the attendance patterns of their employees. For example, do workers' absence rates vary by time of year?
Suppose a manager suspects that his employees skip work to enjoy the good life more often as the temperature rises. After pairing absences with daily temperature data, he finds the correlation coefficient to be 0.466.
While not a strong linear relationship, a coefficient of 0.466 does indicate a positive relationship — suggesting that the weather might indeed be the culprit.
But look at the data — besides a few outliers, there isn't a clear relationship. Seeing the scatter plot, the manager might realize that the three outliers correspond to a late-summer, three-day
transportation strike that kept some workers homebound the previous year.
Without looking at the data, the correlation coefficient can lead us down false paths. If we exclude the outliers, the
relationship disappears, and the correlation essentially drops to zero, quieting any suspicion of weather. Why do the outliers influence our measure of linearity so much?
As a summary statistic for the data, the correlation coefficient is calculated numerically, incorporating the value of every data point. Just as it does with the mean, this inclusiveness can get us into trouble...
Because measures like correlation give more weight to points distant from the center of the data, outliers can strongly influence the correlation coefficient of the entire set. In these situations, our intuition and the measure we use to quantify our intuition can be quite different. We should always attempt to reconcile those differences by returning to the data. Summary
The correlation coefficient characterizes the strength and direction of a linear relationship between two data sets. The value of the correlation coefficient ranges between -1 and +1.
Finding in Excel
Excel's CORREL function calculates the correlation coefficient for two variables. Let's return to our data on athletes' height and weight.
Enter the data set into the spreadsheet as two paired columns. We must make sure that each data point in the first set is aligned with its corresponding value in the other set.
To compute the correlation, simply enter the two variables' ranges, separated by a comma, into the CORREL function as shown below.
The order in which the two data sets are selected does not matter, as long as the data "pairs" are maintained. With height and weight, both values certainly need to refer to the same person!
Occupancy and Arrivals
Alice is eager to move forward: "With your new understanding of scatter diagrams and correlation, you'll be able to help me with Leo's hotel occupancy problem."
In the hotel industry, one of the most important management performance measures is room occupancy rate, the percentage of available rooms occupied by guests.
Alice suggests that the monthly occupancy rate might be related to the number of visitors arriving on the island each month. On a geographically isolated location like Hawaii, visitors almost all arrive by airplane or cruise ship, so state agencies can gather very precise data on arrivals.
Alice asks you to investigate the relationship between room occupancy rates and the influx of visitors, as measured by the average number of visitors arriving to Kauai per day in a given month. She wants a graphical overview of this relationship, and a measure of its strength.
Leo's folders include data on the number of arrivals on Kauai, and on average hotel occupancy rates in Kauai, as tracked by the Hawaii Department of Business, Economic Development, and Tourism.
Kauai Data Source
The best way to graphically represent the relationship between arrivals and occupancy is: a. A histogram
This is not the best answer. A histogram is used to gain insight into the behavior of a single variable. It represents the frequency at which certain ranges of values of the variable occur in a data set.
b. A scatter diagram
This is the best answer. We use scatter diagrams to represent the relationship between two variables. c. A time series
This is not the best answer. We use time series to display the behavior of a variable over time. d. A series of concentric burning wheels
This is not the best answer. It is simply a more exciting way of saying "none of the above," which is also not the best answer.
Kauai Data Source
You generate the scatter diagram using the data file and Excel's Chart Wizard. The relationship can be characterized as: a. Weakly negative and linear
This is not the best answer. The relationship is positive. Higher levels of occupancy generally correspond to higher numbers of arrivals.
b. Strongly negative and non-linear
This is not the best answer. The relationship is positive. Higher levels of occupancy generally correspond to higher numbers of arrivals.
c. Strongly positive and linear
This is the best answer. The relationship is positive. Higher levels of occupancy generally correspond to higher numbers of arrivals. The trend appears to be reasonably linear.
d. Strongly positive and non-linear
This is not the best answer. The trend appears to be generally linear.
Source
You calculate the correlation coefficient. Enter the correlation coefficient in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary.
Kauai Data Source
To find the correlation coefficient, open the Kahana Data file. In any empty cell, type =CORREL(B2:B37,C2:C37). When you hit enter, the correct answer, 0.71, will appear. Kauai Data
Together with Alice, you compile your findings and present them to Leo. Source
I see. The relationship between the number of people arriving on Kauai and the island's hotel occupancy rate follows a general trend, but not a precise pattern. Look at this: in two months with nearly the same average number of daily arrivals, the occupancy rates were very different — 68% in one month and 82% in the other.
But why should they be so different? When people arrive on the island, they have to sleep somewhere. Do more campers come to Kauai in one month, and more hotel patrons in the other?
Well, that might be one explanation. There could be differences in the type of tourists arriving. The vacation preferences of the arrivals would be what we call a hidden variable.
Another hidden variable might be the average length of stay. If the length of stay varies month to month, then so will hotel occupancy. When 50 arrivals check into a hotel, the occupancy rate will be higher if they spend 10 days each at the hotel than if they spend only 3 days.
I'm following you, but I'm beginning to see that the occupancy issue is more complex than I expected. Let's get back to it at a later time. The scuba school contract is more pressing at the moment.
Exercise 1: The Effectiveness of Search Engines
As online retailing expands, many companies are interested in knowing how effective search engines are in helping consumers find goods online.
Computer scientists study the effectiveness of such search engines and compare how many results search engines recall and the precision with which they recall them. "Precision" is another way of saying that the search found its target, for example a page containing both the phrases "winter parka" and "Eddie Bauer."
What could you say about the relationship between the Precision and the number of Results Recalled? a. The amount of information a search engine recalls decreases over time.
This is not the best answer. Time isn't graphed on the scatter plot, and we do not know how it might be involved in a relationship between these two variables.
b. An increase in precision causes the amount retrieved to decrease.
This is the not the best answer. Although we do observe higher values of precision with lower values of recall, and vice versa, we have no idea if one causes the other. With a scatter diagram, we can never make claims about causality!
c. Recall and precision seem to be related: a large number of results typically pairs with low precision.
This is the best answer. From the scatter plot, we can see that the variables demonstrate a relationship, but maybe not a linear one. However, even when we recognize a clear
relationship, we cannot conclude that greater precision causes the amount of information recalled to decrease.
Source
Exercise 2: Education and Income
Is an education a good investment in your future? Some very successful business executives are college dropouts, but is there a relationship in the general population between income and education level?
a. Strongly positive
This is the best answer. The level of income is strongly associated with the number of years of education for our data.
b. Weakly positive
This is not the best answer. The correlation between income and level of education is fairly pronounced. Weak correlations scatter widely around the imaginary line we can trace through the data.
c. Weakly negative
This is not the best answer. In general, as education increases, incomes do as well. In a negative correlation, as education increases, income would decrease.
Source
Though we should always calculate the correlation coefficient if we want to have a precise measure, it's good to have a rough feel for the correlation between two variables we see plotted on a scatter diagram. For the income-education data, the coefficient is nearest to:
a. 0.1
This is not the best answer. A correlation coefficient of 0.1 indicates data with a weak linear relationship, but for our data, the relationship is fairly strong.
b. -0.5
This is not the best answer. At -0.5, the correlation coefficient indicates a negative linear relationship. Education and income tend to increase at the same time, which occurs with a positive linear correlation.
c. 0.9
This is the best answer. A fairly strong linear relationship has a correlation coefficient closer to 1.0, making 0.9 a reasonable guess for what we see occurring between income and education level.
Sampling & Estimation
Introduction: The Scuba Problem
Leo asks you to help him evaluate the Kahana's contract with the scuba school.
Scuba diving lessons are an ideal way for our guests to enjoy their vacation or take a break from their business activities. We have an excellent coral reef, and scuba diving is becoming very popular among vacationers and business travelers.
We started our year-round diving program last year, contracting a local diving school to do a scuba certification course. The one-year trial contract is now up for renewal.
Maintaining the scuba offerings on-site isn't cheap. We have to staff the scuba desk seven days a week, and we subsidize the costs associated with each course. So I want to get a good handle on how satisfied the guests are with the lessons before I decide whether or not to renew the contract.
The hotel has a database with information about which guests took scuba lessons and when. Feel free to take a look at it, but I can't spend a fortune figuring this out. And I need to know as soon as possible, since our contract expires at the end of the month.
Alice convinces you to do some field research and join her for a scuba diving lesson. You return late that afternoon exhausted but exhilarated. Alice is especially enthusiastic.
"Well, I certainly give the lessons two thumbs up. And we haven't even been out to sea yet!
"But our opinions alone can't decide the matter. We shouldn't infer from our experience that Leo's clientele as a whole enjoyed the scuba certification course. After all, we may have caught the instructor on his best day this year."
Alice suggests creating a survey to find out how satisfied guests are with the scuba diving school. Generating Random Samples
Naturally, you can't ask the opinion of every guest who took scuba lessons over the past year. You have to survey a few guests, and from their opinions draw conclusions about hotel guests in general. The guests you choose to survey must be representative of all of the guests who have taken the scuba course at the resort. But how can you be sure you get a good sample?
How to Create a Representative and Unbiased Sample
As managers, we often need to know something about a large group of people or products. For example, how many defective parts does a large plant produce each year? What are the average annual earnings of a Wall Street investment banker? How many people in our industry plan to attend the annual