• No results found

Exploratory Data Analysis with One and Two Variables

N/A
N/A
Protected

Academic year: 2022

Share "Exploratory Data Analysis with One and Two Variables"

Copied!
6
0
0

Loading.... (view fulltext now)

Full text

(1)

Instructions for Lab #2

Statistics 111 - Probability and Statistical Inference

Lab Objective

To explore data with histograms and scatter plots.

Review

A few highlights to review from the previous lab:

1. The Working Directory

It is always important to know where the files that you are using are saved on the computer.

This is so both you and Stata can access the correct files. Let’s see how this would work for this problem.

In the next sectionis the link to the dataset for this problem set. Download the data and save/put it in the folder you are going to work in (e.g., C:\Users\John\Documents\Stats111\

lab2). Now you need to direct Stata to this folder, which will be the folder you will be work- ing from. There are two ways to accomplish this.

(a) Use the menu bar and navigate to File -> Change Working Directory. This will pull up a window that you can then use to find the folder where you saved the data.

(b) Issue a command to change the working directory by typing cd followed by the filepath to your file, (e.g., cd C:\Users\John\Documents\Stats111\lab2).

Typing pwd into the command line well tell you the current directory and can be used to verify that you are in the right place.

2. “Do-files” and comments

All the commands you enter into the Command Line for the lab can (and should) be put into a “do-file” to allow replication and access at a later date. You should save a “do-file” in the folder where the data is. Let’s go through another example.

Download this sample “do-file” and save/put it in the same folder where you saved the data above. Now go back to Stata, make sure you are in the correct working directory, and run this file through each of the following three ways:

(2)

(a) Type do example.do in the command line.

(b) Navigate to File -> Do... and manually select the file.

(c) Open up the “do-file” editor (Ctrl+8; or Windows -> Do-file Edtior), open the “do-file”

in the editor, and Execute the job (Tools -> Execute or shortcut key on the toolbar).

Note the words on lines 7-8 of this “do-file”. They are green and are preceded by an asterisk (*). These are comments and are very useful in do-files. Stata ignores these lines when executing a “do-file.”

Now you can create a copy of this “do-file” and build off it for the problem set.

3. Fonts

Different fonts generally mean very specific things in the labs. This is to help you more easily distinguish Stata code from the rest of the text:

• Typewriter text refers to Stata commands.

• Italicized text, either in typewriter or in regular font, refer to variable names. Some- times these are specific variables in a data set (Domestic) while at others they are simply a place holder for any variable or name that you may choose (cont_var).

• Red textis for Data Analysis Tips.

• Blue textare generally links to datasets, examples or places in the document.

Lab Procedures

Let’s go to the movies!

What are the characteristics of U.S. movies that make the most money? Let’s address this question with the data setmovies2012.dta. It comprises data on the 250 top domestic grossing movies of all time as of November 2012. The variables are:

Variable Description

Ranking Ranking on Domestic gross sales Title Movie title

Year Release year

Domestic Domestic gross sales Foreign Foreign gross sales Worldwide Worldwide gross sales

Budget Budget

Rating MPAA rating

Best_Picture Academy Awards Best Picture (nominated or won) Genre Main/first genre of the movie

(3)

Variable Description

All_Genres List of all genres the movie falls into Director Name of the director

There are missing data in this file. We’ll ignore them for simplicity. In general, when con- fronted with missing data, it is best to get the advice of a professional statistician before doing analyses.

Data Analysis Tip: The unit of measurement for the monetary variables is not stated. That’s bad practice. Always include a description of the units somewhere on the file. Based on knowl- edge of movie revenues, it is clear that that the unit of measurement is $1,000,000.

Questions:

1. After reading in the data, describe the distributions of foreign and domestic grosses. That is, say where most values are, note any outliers, and say whether the distribution is tightly packed around its mean or is spread out. Also, report the mean and standard deviation.

In addition to summarizeing the data, you can use histograms to get a visual representation of the distribution of the data. The command is

histogram varname, name(graph1, replace),

where varname is the variable of interest. As with most commands, there are many options available with histogram. One of the most useful options is name(graph1, replace). This stores a version of the graph in temporary memory to be used later and it causes the graph to be displayed in a window titled graph1. This allow multiple graph windows to be open at the same time1. You use the replace option within the perentheses of the name option to overwrite the previous version of the graph.

Further, if you want to compare multiple histograms in one window you can combine them by typing

graph combine graph1 graph2, name(graph_all, replace),

where again graph1, graph2, and graph_all are just examples for names of the graphs.

You can also navigate to Graphics-Histogram to get a wizard to help with the graph.

Data Analysis Tip: The default histogram in Stata is a true histogram, where the areas of the bins sum to one. Often people want just the heights to sum to one. This is accomplished with the fraction option. Further, if you want the y-axis to simply count how many observations are in each bin, you can use the frequency option.

2. Which sentence best describes the distributions of domestic and foreign grosses? You can just write the letter of your choice on the lab report.

1The default is for each graph, regardless of type, to overwrite the previous graph

(4)

(a) Domestic and foreign grosses are very similar.

(b) Domestic and foreign grosses have similar distributional shapes, but foreign grosses tend to be larger than domestic grosses.

(c) Domestic and foreign grosses have similar distributional shapes, but domestic grosses tend to be larger than foreign grosses.

(d) The two distributions look nothing like each other, because one has a long left tail and the other has a long right tail.

3. What are the names of the two movies that are the largest outliers on all three monetary variables?

4. We can examine the relationship between world-wide gross and movie genre using a box plot. Use the variable Genre for this analysis. The command for a box plot is

graph box cont_var, name(graph1, replace)or

graph box cont_var, name(graph1, replace) over(cat_var),

where cont_var represents the continuous variable that you are trying to graph. The option over(cat_var)allows you to break the box plot down by different values of a categorical variable.

Alternatively, you can navigate to Graphics-Box Plot, type the continuous variable in the Main tab and the categorical variable in the Categories tab. If you want to clean up the graph, you can test out some of the other tabs in the wizard.

Answer the three questions below.

(a) Out of Comedy and Animated movies, which one has a distribution of world-wide grosses that is most similar to the distribution of world-wide grosses for Action movies?

Justify your choice in at most two sentences.

(b) Compare the distributions for Drama movies and Adventure movies. Do they have reasonably similar medians? Is one more spread out than the other (if so, say which one)?

(c) If you directed a movie and wanted to make lots of money worldwide, which type appears to give you the best chance of doing so? Base your answer on the results of the box plot.

5. Describe the relationship between domestic gross and foreign gross. To make a scatter plot, we will tell Stata that we want to do a twoway graph as a scatter:

graph twoway scatter varname1 varname2.

Alternatively, you can navigate to Graphics-Twoway graph (scatter, line, etc.), go to the Plots tab, and hit the Create button. Choose “Scatter” and the Y and X variables.

(5)

Items to include in your description are the general trend of the relationship (e.g., positive and linear, negative and linear, some other pattern, no clear pattern) and whether there are any outliers or points that do not fit the pattern.

6. Report the three pairwise correlations between Foreign, Domestic, and World-wide gross.

Further, graph the raw distribution of the data for each pair.

To find correlations in Stata, type

correlate varname1 varname2 . . .

which will show a matrix of the correlations between all the variables used as input (you can use as many as you’d like). Note that the diagonal is always 1. Make sure you know why that is.

In addition, you can create a scatterplot matrix which will create scatter plots of all the dif- ferent outcomes by typing

graph matrix varname1 varname2 . . .

Alternatively, you can navigate to Graphics-Scatterbox Matrix. And again, you can use mul- tiple variables.

Do the correlations suggest strongly positive linear relationships, weakly positive linear rela- tionships, no linear relationships, weakly negative linear relationships, or strongly negative linear relationships?

7. Why are the correlations between Domestic and Worldwide, and Foreign and Worldwide, stronger than than the correlation between Domestic and Foreign? The answer has to do with the definitions of the variables.

8. Outliers can have a strong effect on correlations. Let’s check to see if excluding Avatar and Titanic changes the correlations substantially. To exclude Avatar and Titanic, let’s again use the if functionality of Stata by typing if Title!=‘‘Avatar’’ & Title!=‘‘Titanic’’ at the end of the previous commands.

Data Analysis Tip: Note that != is defined as “not equal to” and|is the “OR” operator.

Now, re-calculate the correlations in (6). Did the correlations get stronger or weaker? Does the substance of your conclusions in (6) change very much when excluding Avatar and Ti- tanic?

Data Analysis Tip: It is not acceptable to exclude outliers from analyses unless you have a scientific reason to do so (e.g., a data entry error, or maybe the outlying unit is not part of your target population). Hiding outliers is fudging data to get results you want. That is dishonest and unethical. When you see outliers, do analyses with and without them.

When the results do not change much, report the results based on the full data set, and tell your audience that the results were not sensitive to the outliers. When the results do change substantially, report both sets of analyses: one with and one without the outliers.

(6)

This honestly informs people that your conclusions are not on very solid ground, because particular data points affect the results greatly.

Feel free to explore relationships between sales and other characteristics of movies, like rating, best picture nominations/wins, and director.

References

Related documents

[r]

Operators must make sure that new data communications services, which one the one hand they hope will enable them to sell new content-based services to subscribers, will on the

Having agreed that maritime law allows an award of punitive damages in a case of economic loss resulting from recklessness and that an award in this case is

Chair for Network Architectures and Services Technical University of Munich (TUM) Architecture MoonGen Core DPDK Userscript Mo onGen HW NIC NIC Port Q 0 ... Q n Port Userscript Lua

In particular, the vortex shedding is a general characteristic of bluff body aerodynamics and not (necessarily) linked to transonic effects. In this context, Kawai and Fujii

This study had five major findings. 1) Tnmd was modified by two N-glycans, and its CTD was not cleaved in NIH3T3. 2) The Tnmd protein was expressed in the PDL during the eruptive

Katz Seniors Campus, for the approval of a Major Amendment to the MorseLife Community Service Planned Development (CSPD) to add adult daycare as a permitted

Introducing relaxing music and offering chances for inmates to play music themselves, helps lower their anxiety and anger levels that prison has helped develop and provides a positive