• No results found

Empire Learning

Empire Learning is a developer of educational software. CEO Bill Hartborne is making a bid for a contract to create an e-learning module for a new client.

Preparing the bid requires an estimate of the number of labor-hours it will take to create the new module. Bill believes that the length of a module and the complexity of its animations directly affect the amount of labor required to complete it.

Bill has data on the labor-hours Empire used to complete previous courses. He also knows the number of pages and the animation run-time of each previous course — quantities he thinks are reasonable proxies for course length and animation complexity, respectively. Perform a simple regression analysis for each of the independent variables: number of pages and run-time of animations.

Empire Learning Data

Which factor explains more variation in labor hours? a. Number of pages

This is the correct answer. The R-squared for the simple regression of labor-hours versus number of pages is 83%.

b. Run-time of animations

This is not the correct answer. The R-squared for the simple regression of labor-hours versus run- time of animations is 69%, which is lower than the R-squared using number of pages as the independent variable.

Empire Learning Data

Empire Learning Regressions

In the simple regressions, which of the independent variables contributes significantly to the number of labor-hours it takes Empire to create an e-learning course?

a. Number of pages only

This is not the correct answer. The p-value for the coefficient on the number of pages is 0.0002, well below 0.05, the most commonly used level of significance.

b. Run-time of animations only

This is not the correct answer. The p-value for the coefficient on the run-time of animations is 0.003, well below 0.05, the most commonly used level of significance.

c. Both variables

This is the correct answer. The p-values for both independent variables are less than 0.05, the most commonly used level of significance.

d. Neither variable

This is not the correct answer. The p-values for both independent variables are less than 0.05, the most commonly used level of significance.

Empire Learning Data

Empire Learning Regressions

The p-values for the coefficients on animation run-time and number of pages are 0.003 and 0.0002 respectively; well below 0.05, the most commonly used level of significance. Thus, we conclude that both independent variables contribute significantly in their respective simple regressions to the number of labor hours Empire takes to create an e-learning course.

Run the multiple regression of labor-hours versus number of pages and run-time of animations.

Empire Learning Data

According to this multiple regression, which of the independent variables contributes significantly to the number of labor-hours it takes Empire to create an e-learning course?

a. Number of pages only

This is not the correct answer. The p-value for the coefficient on the number of pages is 0.001, well below 0.05, the most commonly used level of significance.

b. Run-time of animations only

This is not the correct answer. The p-value for the coefficient on the run-time of animations is 0.015, well below 0.05, the most commonly used level of significance.

c. Both variables

This is the correct answer. The p-values for both independent variables are less than 0.05, the most commonly used level of significance.

d. Neither variable

This is not the correct answer. The p-values for both independent variables are less than 0.05, the most commonly used level of significance.

Empire Learning Data

Empire Learning Regressions

The p-values for the coefficients on animation run-time and number of pages are 0.014 and 0.0015 respectively; well below 0.05, the most commonly used level of significance. Thus, we conclude that both independent variables contribute significantly in the multiple regression to the number of labor hours Empire takes to create an e-learning course. Exercise 2: The Empire Strikes Back

For this exercise, refer to the regression analyses performed in Exercise 1 of this section. Empire Learning Data

Empire Learning Regressions

Bill Hartborne, CEO of Empire Learning, is using regression analysis to predict the number of labor-hours it will take his team to create a new e-learning course. He is using data on previous courses Empire created, with the number of pages and the total run-time of animations as independent variables.

Empire Learning Data

Empire Learning Regressions

In the multiple regression of labor-hours versus number of pages and run-time of animations, the coefficient of 0.84 for the number of pages tells us that:

a. For every additional 100 pages of module length, the run-time of animations increases by an average of 84 seconds.

This is not the best answer. The coefficient on an independent variable describes the mathematical relationship between the independent variable and the dependent variable (in this case, labor-hours),

not the relationship between two independent variables.

b. For every additional 100 pages of module length, the run-time of animations increases by 84 seconds when we control for labor-hours.

This is not the best answer. The coefficient on an independent variable describes the mathematical relationship between the independent variable and the dependent variable (in this case, labor-hours), not the relationship between two independent variables.

c. For every additional 100 pages of module length, the number of labor-hours increases by an average of 84.

This is not the best answer. The coefficient of 0.84 describes the net relationship between number of pages and labor-hours when controlling for animation run-time.

d. For every additional 100 pages of module length, the number of labor-hours increases by 84 when we control for animation run-time.

This is the best answer. The coefficient of 0.84 describes the net relationship between number of pages and labor-hours when controlling for animation-run-time.

Empire Learning Data

Empire Learning Regressions

In the multiple regression equation, the coefficient of the independent variable "number of pages" is gross relative to:

a. The number of labor-hours.

This is not the correct answer. An independent variable is not gross or net relative to the dependent variable.

b. The run-time of animations.

This is not the correct answer. The run-time of animation is part of this multiple regression, so the number of pages is net relative to the run-time of animations.

c. The number of illustrations used in the module.

This is the correct answer. The number of illustrations is omitted in the regression analysis, so the number of pages is gross relative to this variable.

d. Nothing. The number of pages is an all around pleasant and sanitary variable. This is not the correct answer. Perhaps you should review the clip on interpreting the multiple regression equation to find the meaning of "gross" in this context.

Empire Learning Data

Empire Learning Regressions Challenge: Children of the Empire

For this exercise, refer to the regression analyses performed in Exercise 1 of this section. Empire Learning Data

Empire Learning Regressions

Bill Hartborne, the CEO of Empire Learning, is using regression analysis to predict the number of labor-hours it will take his team to create a new e-learning course. He is using data on previous courses Empire created, with the number of pages and the total run-time of animations as independent variables.

Empire Learning Data

Empire Learning Regressions

Bill bills out his talent at $70/hour. Based on the multiple regression, how much should he charge for the labor content of a course with 400 pages and 170 seconds of animations? Enter the estimated cost of the labor (in $) as an integer (e.g., enter "$5.00" as "5"). Round if necessary.

b. 77500 c. 77210

Empire Learning Data

Empire Learning Regressions

First use the regression equation to predict the number of labor-hours required to complete the course.

Empire Learning Data

Empire Learning Regressions

Then multiply that number by Empire Learning's billing rate of $70/hour to find the total amount he should charge for the labor content of the course, $77,210.

Empire Learning Data

Empire Learning Regressions

Bill is sure that the client will balk at a labor bill of over $70,000. He knows that animation is important to the client, so doesn't want to cut corners there. However, he believes that his lead writer can cover the content in fewer pages without compromising his renowned clear and engaging prose.

Empire Learning Data

Empire Learning Regressions

To reduce total labor costs to $70,000, how many pages must Bill cut from the plan to meet his client's cost limits?

a. Around 87 pages.

This is not the correct answer. The coefficient for number of pages is 0.84. b. Around 103 pages.

This is not the correct answer. 103 is the number of labor-hours bill needs to cut, not the number of pages.

c. Around 123 pages. This is the correct answer.

d. The aren't enough pages for Bill to cut to reduce the price below $70,000.

This is not the correct answer. There are enough pages to cut to reduce the price below $70,000. Empire Learning Data

Empire Learning Regressions

To reduce the labor bill from $77,210 to $70,000, Bill must reduce labor costs by $7,210. To achieve this reduction, Bill must cut the contract's labor hours by 103 hours, since he bills out his talent at $70/hour.

Empire Learning Data

Empire Learning Regressions

Since the animation run time will not change, we use the net relationship between labor- hours and number of pages, which tells us that each additional page consumes 0.84 labor hours. Thus, Bill must reduce the number of pages by 123..

Empire Learning Data

Empire Learning Regressions New Concepts in Multiple Regression

"I expect Leo will call us this evening after his meeting with the lawyers," Alice predicts. "I hope things are going well. If Mr. Pitt's lawsuit materializes, Leo might not have much of a business left to help him with."

The Staffing Problem (III)

I just spent the whole day at my lawyers' offices. Please give me some good news about the occupancy problem.

Well, we've found a regression model that incorporates arrivals on Kauai and advance bookings. We're now able to explain about 86% of the variation in occupancy.

Kahana Occupancy Regressions

That's great. That's so much better than the 39% you calculated using only advance bookings as the independent variable. I should be able to make much more reliable predictions based on your new model!

Unfortunately, no. Although this new model helps us understand why your occupancy varies, we can't exactly use the model to make predictions.

You can use advance bookings to make a prediction about occupancy in a given month because the bookings are known to you ahead of time. But you won't get the data on the number of arrivals in a month until it's too late and your guests are already on your doorstep. That's terrible! Sure, it's nice to know how today's occupancy is affected by today's arrivals. But I need to make business decisions! I need to know one month in advance how many staffers to hire! Please, isn't there something you can do?

Don't give up yet, Leo. We still have a number of statistical approaches at our disposal. We'll have something for you when you get back from Honolulu.

Multicollinearity

"We need some more advanced statistical tools to find a regression suitable to Leo's

purposes," Alice tells you. "And there is still a pitfall you'll need to learn to avoid when using multiple regression."

Another key factor that influences the price of a house is the size of the property it is built on - its "lot size". Naturally, we'd pay more for a spacious acre of land than for 800 cramped

square feet.

Using new data on the lot sizes of the 15 sample houses in Silverhaven, run the simple regression of house price on lot size. If we don't control for any other factors, how much of the variation in price can be explained by variation in lot size? Enter R-squared as a decimal number with two digits to the right of the decimal point (e.g., enter "50%" as "0.50"). Round if necessary. a. 0.30 b. 0.301 c. 0.302 d. 0.303 e. 0.304 f. 0.305

Silverhaven Real Estate Data

Lot size accounts for 30 percent of a home's price. Do the data provide evidence that there is a significant linear relationship between house price and lot size?

a. Yes.

This is the correct answer. The p-value for the lot size coefficient is 0.033, which is less than the most commonly used significance level 0.05.

b. No.

This is not the correct answer. The p-value for the lot size coefficient is 0.033, which is less than the most commonly used significance level 0.05.

c. This question cannot be answered without running a multiple regression.

This is not the correct answer. The significance of the independent variable can be found in a simple regression, too.

Silverhaven Real Estate Data

Silverhaven Real Estate Regressions

The low p-value of 0.033 tells us that we can be confident that the gross relationship between lot size and home price is significant. What happens when we add lot size as a third

independent variable in our multiple regression of price on three independent variables: house size, distance from downtown Silverhaven, and lot size?

Run a multiple regression of price on the three independent variables: house size, distance, and lot size. How does the addition of the new independent variable, "lot size" affect the predictive power of the regression model?

a. The explanatory of the regression increases when lot size is added as an independent variable.

This is the correct answer.

b. The explanatory of the regression decreases when lot size is added as an independent variable.

This is not the correct answer.

c. The explanatory of the regression stays the same when lot size is added as an independent variable.

This is not the correct answer. Silverhaven Real Estate Data

Silverhaven Real Estate Regressions

By adding the independent variable "lot size", we improve adjusted R-squared slightly: from 89% to 91%, telling us that the predictive power of the regression has improved. What about the significance of the independent variables? Has adding the new variable changed the p- values of the coefficients?

Silverhaven Real Estate Regressions

Something odd has happened. In our earlier regression with two independent variables, the p- values for both the house size and the distance coefficients were less than 0.05. Now, adding lot size into the equation has somehow raised the p-value for house size.

The new p-value for house size, 0.2179, is so high that there is no longer evidence of a

significant linear relationship between price and house size after taking lot size into account. How do we explain this drop in significance?

When a multiple regression delivers a surprising result such as this, we can usually attribute it to a relationship between two or more of the independent variables. Let's look at the data on house size and lot size.

Case of the Dropping Significance.

When two of the independent variables are highly correlated, one is essentially a proxy for the other. This phenomenon is called multicollinearity. In our example, lot size is a good proxy for house size.

Both house size and lot size contribute to the price of a home. But because these two variables are closely correlated in our data set, there is not enough information in the data to discern how their combined contributions should be attributed to the two independent variables. The net effect of house size on price should tell us the effect of house size on price assuming that the lot size is fixed. However, we can't detect this effect in the data: house size and lot size are so closely related that we've never seen house size vary much when lot size is fixed. Would dropping the variable house size improve the predictive power of the regression model? The multiple regressions with and without house size have different numbers of independent variables, so we use adjusted R-squared to compare their predictive power.

Without house size, adjusted R-squared is 90.89%, slightly lower than 91.40%, the adjusted R-squared for the regression including house size. Thus, although the regression model cannot accurately estimate the effect of house size when we control for lot size and distance, the addition of house size does help explain a bit more of the variance in selling price.

Diagnosing and Treating Multicollinearity

A common indication of lurking multicollinearity in a regression is a high adjusted R-squared value accompanied by low significance for one or more of the independent variables. One way to diagnose multicollinearity is to check if the p-value on and independent variable rises when a new independent variable is added, suggesting strong correlation between those independent variables.

How much of a problem is multicollinearity? That depends on what we are using the regression analysis for. If we're using it to make predictions, multicollinearity is not a problem, assuming as always that the historically observed relationships among the variables continue to hold going forward.

In the house price example, we'd keep the house size variable in the model, because its presence improves adjusted R-squared and because our judgment would suggest that house size should have an impact on price separate from the effect of lot size.

If we're trying to understand the net relationships of the independent variables, multicollinearity is a serious problem that must be addressed. One way to reduce

multicollinearity is to increase the sample size. The more observations we have, the easier it will be to discern the net effects of the individual independent variables.

We can also reduce or eliminate multicollinearity by removing one of the collinear

independent variables. Identifying which one to remove requires a careful analysis of the relationships between the independent variables and the dependent variable. This is where a manager's deep understanding of the dynamics of the situation becomes invaluable. In our home price example, we'd expect both house size and lot size to have significant and discernable effects on the price of a home: a shack on an acre of land should cost less than a mansion on a similar property.

To better understand the net effect of house size we should probably gather a larger sample to reduce multicollinearity. If we didn't expect lot size and house size to have distinct

effects on price, we might remove house size from the equation. Summary

Multicollinearity occurs when some of the independent variables are strongly interrelated: distinguishing the respective effects of some of the independent variables on the dependent variable is not possible using the available data. Multicollinearity is typically not a problem when we use regression for forecasting. When using regression to understand the net

relationships between independent variables and the dependent variable, multicollinearity should be reduced or eliminated.

Lagged Variables

In the Silverhaven real estate example we looked at a number of houses and compared four characteristics: price, house size, lot size, and distance from the city center. These are