• No results found

Soft Drink Consumption

Per-capita consumption of soft drink beverages is related to per-capita gross domestic product (GDP). Generally, the higher the GDP of a country, the more soda its citizens consume. Soft drink consumption is measured in number of 8-oz servings.

Source

Based on this relationship, you can expect that, on average, for each additional $1,000 of per-capita GDP a country's soda consumption increases by:

a. 180 servings.

This is not the correct answer. Double check your math and try again. b. 148 servings.

This is not the correct answer. Only the slope of 0.018 should be considered when finding the incremental increase in soda consumption.

c. 130 servings.

This is not the correct answer. The slope of 0.018 should be considered when finding the increase in soda consumption.

d. 18 servings.

This is the best answer.

The regression equation tells us that in our data set, average soda consumption increases by 0.018 servings for every additional $1 of per-capita GDP. So, for an additional $1,000,

average consumption increases by ($1,000)(0.018 servings/$) = 18 servings.

The per-capita GDP in the Netherlands is $25,034. What do you predict is the average number of servings of soda consumed in the Netherlands per year?

Enter predicted average soda consumption (in servings) as an integer (e.g., "5"). Round if necessary.

a. 580 b. 581

The regression equation tells us that average soda consumption = 130 + 0.018*(per-capita GDP). Therefore, we anticipate the Netherlands' average soda consumption to be 580.6 servings.

Although the regression predicts a soda consumption of around 581 servings per person for the Netherlands, the actual measured number of servings consumed is much lower: 362. The discrepancy in the actual and predicted consumption reinforces that per-capita GDP alone is not a perfect predictor of soda consumption.

Calculating the Regression Line

A regression line helps you understand the relationship between two variables and forecast future values of the dependent variable. Alice points out to you that these two features of regression analysis make it a powerful tool for managers who make important decisions in the uncertain world of business.

But how do you generate a regression line from observed data? Of all the straight lines that you could draw through a scatter diagram, which one is the regression line?

The Accuracy of a Line

Let's return to Julius Tabin's sales and advertising data. As we can see from the graph, no straight line could be drawn that would pass through every point in the data set.

This is not surprising. Typically, advertising is not a perfect predictor of sales, so we don't expect every data point to fall in a perfect line. The regression line depicts the best linear relationship between the two variables. We attribute the difference between the actual data

points and the line to the influence that other variables have on sales, or to chance alone. Since the regression line does not pass through every point, the line does not fit the data perfectly. How accurately does the regression line represent the data?

To measure the accuracy of a line, we'll quantify the dispersion of the data around the line. Let's look at one line we could draw through our data set.

Let's consider a second line. Click on the line that more closely fits the ten data points.

Although in this example we can see which of two lines is more accurate, it is useful to have a precise measure of a line's accuracy.

To quantify how accurately a line fits a data set, we measure the vertical distance between each data point and the line.

Why don't we measure the shortest distance between the point and the line — the distance perpendicular to the line? Why do we measure vertically?

We measure vertical distance because we are interested in how well the line predicts the value of the dependent variable. The dependent variable — in our case, sales — is measured on the vertical axis. For each data point, we want to know: how close is the value of sales

predicted by the line to the historically observed value of sales?

From now on we will refer to this vertical distance between a data point and the line as the error in prediction or the residual error, or simply the error. The error is the difference between the observed value and the line's prediction for our dependent variable. This difference may be due to the influence of other variables or to plain chance.

Going forward, we will refer to the value of the dependent variable predicted by the line as y- hat and to the actual value of the dependent variable as y. Then the error is y - (y-hat), the difference between the actual and predicted values of the dependent variable.

The complete mathematical description of the relationship between the dependent and

independent variables is y = a + bx + error. The y-value of any data point is exactly defined by these terms: the value y-hat given by the regression line plus the error, y - (y-hat).

Collectively, the errors in prediction for all the data points measure how accurately a line fits a set of data.

To quantify the total size of the errors, we cannot just sum each of the vertical distances. If we did, positive and negative distances would cancel each other out.

Instead, we take the square of each distance and then sum all the squares, similarly to what we do when we calculate variance.

This measure, called the Sum of Squared Errors (SSE), or the Residual Sum of Squares, gives us a good measure of how accurately a line describes a set of data.

The less well the line fits the data, the larger the errors, and the higher the Sum of Squared Errors.

Summary

To find the line that best fits a data set, we first need a measure of the accuracy of a line's fit: the Sum of Squared Errors. To find the Sum of Squared Errors, we calculate the vertical distances from the data points to the line, square the distances, and sum the squares. Identifying the Regression Line

Now that you have a way to measure how well a line fits a set of data, you need a way to identify the line that "best fits" the data: the regression line.

We can calculate the Sum of Squared Errors for any line that passes through the data. Of course, different lines will give us different Sums of Squared Errors. The line we are looking

for — the regression line — is the one with the smallest Sum of Squared Errors.

Let's look at several lines that could describe the relationship between advertising and sales in our example. Our intuition tells us that the middle line is a much better fit than line a or line b. Let's check our intuition. For each line, we can calculate the Sum of Squared Errors to

determine its accuracy.

The lower the Sum of Squared Errors, the more precisely the line fits the data, and the higher the line's accuracy.

The line that most accurately describes the relationship between advertising and sales — the regression line — is the line that minimizes the sum of squares. Finding the regression line for a set of data is a calculation-intensive process best left to statistical software.

Summary

The line that most accurately fits the data — the regression line — is the line for which the Sum of Squared Errors is minimized.

Performing Regression Analysis in Excel 2007

Note: Unless you have installed the Excel Data Analysis ToolPak add-in, you will not be able to do regression analysis using the regression tool. However, we suggest you read through the following instructions to learn how Excel's regression tool works, so you can run regressions in the future, when you do have access to the Data Analysis Toolpak.

Performing regression analysis by hand is a time-consuming process. Fortunately, statistical software packages and major spreadsheet programs — Excel, for example — can do the necessary calculations for you in a matter of seconds. Click on the Excel link to access the data file so you can practice doing the analysis in Excel as you read through the instructions. EasyMeat Data

Let's go through the process step by step. We start with data entered in two columns in an Excel spreadsheet. Each column contains values of a variable. To perform regression analysis, there must be an equal number of entries in each column.

Under the Data tab in the toolbar we select the Data Analysis option.

A window pops open containing an alphabetical list of statistical tools. We select "Regression" and click "OK". A new window opens offering several options for regression analysis.

In the regression window, we see a prompt field titled ''Input Y Range.'' In it, we enter C1:C11, the range of cells containing the column label (C1) and the data (C2:C11) for the dependent variable: Sales ($).

We repeat this for the prompt field titled ''Input X Range,'' entering B1:B11 to include both the column label (B1) and the data (B2:B11) for the independent variable: Advertising ($). Since we included the column lables in row 1 in our ranges, we must check the "Labels" box. Including labels is helpful because Excel uses the labels to identify the variable coefficients in the output sheet. If you do not include the labels in your ranges, do not check the label box, or Excel will treat the first row of data as labels, excluding those entries from the regression.

Finally, we select the output option "New Worksheet Ply:", enter the name for the new worksheet, and click "OK."

Excel opens a new worksheet with the name we specified. In it, we see an intimidating array of data. For the moment, we are mainly interested in the entries in the cells labeled "Coefficients", which specify the intercept and slope of the regression line.

Note that the label "Advertising ($)" has been carried over from the original data column. The coefficient in the "Advertising ($)" row is the slope of the regression line.

For the exercises in this unit, we strongly recommend you find the relevant data in an Excel spreadsheet and perform the regression analyses yourself. If you do not have the Analysis Toolpak, you can open a file containing the relevant regression output.

EasyMeat Data

EasyMeat Regression