1. Basic Definitions
Statistics like mathematics is a language. Therefore, statistics uses many terms to create a shared vocabulary that allows statisticians to easily communicate with one another. These terms also strive to avoid confusion with commonly used terms in our everyday language.
One of the central ideas in statistics can be summarized in as follows: We have a very large group of something (people, animals, objects etc.) that we wish to describe or make predictions about but because of constraints (time, money, access etc.) we are able to only look at a small subgroup of this larger group. The question then is how to logically, fairly, and accurately extend our information about this smaller group to our desired larger group. But notice how awkward all the previous statements are without a shared language to quickly describe the things we are studying and these different groups.
Therefore, we introduce the following basic definitions:
The population (sometimes referred to as the population of interest) is the large collection of all the individuals that we wish to study. The sample is the subgroup of the
population that we actually collect data about. An individual is a single unit of the population, in other words a single person, object, time frame etc.
As researchers we might want to know many things about a single individual (this is especially important when trying to establish relationships or make predictions). Each attribute that we wish to study about an individual is called a variable. This means that a single individual in our sample might have many variables collected about it.
Variables are further subdivided into two main types. Categorical (or Qualitative) variables are variables where individuals are placed into categories. These categories are decided by the researcher and can be as fine or as broad as the researcher wants them to be.
Quantitative variables are variables where individuals have a value recorded AND arithmetic of these values makes sense.
Sometimes these variables are given some extra descriptors.
Categorical variables where only two categories are given are sometimes referred to as
Boolean variables.
Quantitative variables where the values take on only discrete, separated values (ie 1,2,3 etc) are called Discrete Quantitative Variables. This is contrasted with quantitative variables that can take on any value up to the accuracy of measurement. These are called Continuous Quantitative Variables.
Example 1.1)
Suppose that a community college wants to know more about their students. To do this they randomly mail out surveys to 500 students and receive back 100 responses. For each survey they ask the following questions:
1) What is your current age? 2) What is your zipcode?
3) What is your educational goal? (Choose one) Associates/2 year Degree, Bachelors/4 year Degree, Advanced Degree/Masters or Beyond, Other
4) What fields interest you? (Choose as many as apply) STEM, Humanities, Fine Arts, Social Sciences, Business, Life Sciences.
In describing this example we should consider several things:
Population of Interest: All students at this community college. (Note from the information we were given we do not know the exact size of the population)
Sample: The 100 students who responded (Note: the intended sample was 500 students but the sample is only those that we ACTUALLY collect data about). We will often reference the number of individuals in our sample as the sample size and use the variable n for notation. Individual: A single student (Note: The sample and population are both clearly made up of individuals)
There were four variable collected about each individual. They can be described as follows:
Variable 1 was a Quantitative Variable (most likely Discrete). This is because a response will be a number, like 19 years, and it makes sense to do arithmetic with this values. If one student responds with 19 years and another responds with 21 years, it makes sense to say the second student is 2 years older!
Variable 2 was a Categorical Variable. This is because while the responses are “numbers” arithmetic makes no sense. If one person’s zipcode is 92606 and the another person’s zipcode is 92605 it doesn’t make any sense to add or subtract these zipcodes. Instead, while this variable might use digits, these zipcodes represent categories about where that person lives.
Variables 3 was a Categorical Variable. This is a standard categorical variable (one that we will study in detail shortly) Notice the use of an “Other” category to catch anyone who doesn’t fit the small number of options given.
Variable 4 was also a Categorical Variable. This also a standard categorical variable but one that we won’t work with as much because of the fact that a single individual can put themselves into multiple categories. This means there will not be a 1-to-1 correspondence between the number of responses we get and the number of individuals in our sample. This makes analyzing results from this type of variable quite a bit harder.
Example 1.2)
Suppose that you work for a local coffee shop and are tasked with studying its recent business. To do this you decide to randomly choose 15 days and record the following
information about each day.
1) How many customers visited the store that day? 2) Did any customers complain that day? (Yes/no)
3) How much money was spent at the coffee shop that day?
4) Which pastry made more profit that day? (Muffins, croissants, breads or bagels)
Once again let’s use the definitions we’ve studied so far to describe this example in the language of statistics.
Population of Interest: This is a bit tricky actually. It is tempting to say that the population in this case are all the customers of this coffee shop. However, notice that you are not collecting data about each customer in this case, but instead collecting data about each day! That means that the population of interest in this case is all possible days that you could have chosen to study the coffeeshop. Notice in this case the population size is again unknown and might not even really make sense unless we were to set a timeframe.
Sample: This is clearer after the discussion above. The sample is clearly the 15 days that you recorded the information on.
Individual: Each day. Again as discussed before both the Population of Interest and Sample must be made up of individuals. Once we said that the sample was 15 days, the fact that each
individual is actually a day in this case follows naturally.
There were again four variables collected about each individual. They can further be described as follows:
Variable 1 was a Quantitative Variable. Once again this variable would most likely be further classified as Discrete since after all you can only have a whole number of customers visit the shop on each day.
Variable 2 was a Categorical Variable. This variable would be further classified as Boolean since it only had two possible choices Yes or No. (Note: This variable would be classified as
Quantitative if it recorded the number of customers who complained. This shows that in statistics wording is very important!)
Variable 3 was a Quantitative Variable. This variable would probably be further classified as Continuous because you could give different level of accuracies. For example, you could round to the nearest dollar and report that the money spent in the coffeeshop in a day was $1273 or include the cents and record that it was $1272.84. (Note: Some people might argue that since you can’t go any more accurate then cents this is still Discrete and that’s a fair argument. This shows that Discrete vs Continuous is not a hard and fast separation unlike Quantitative vs Categorical. Therefore, don’t worry too much about the Discrete vs Continuous separation)
Variable 4 was a Categorical Variable. Again if the question was how much profit each of these pastries made then it would be Quantitative. Further note that an “Other” category is probably not necessary in this case because the coffee shop only serves these four type of pastries.
2. Sampling Techniques
As mentioned earlier one of the basic goals of statistics stems from wanting to
study/analyze/infer things about a large population but because of time/budget constraints being restricted to only a small subset or sample of the population. So the question naturally becomes how do you select this sample.
Two key ideas are central to trying to select the best possible sample. Those ideas are Random and Representative.
We will begin with Random. Randomness is essential to a statistical study since it allows for results to be legitimately inferred from the data. Consider the non-random example to see why this is the case.
If a restaurant wanted to claim they were the best restaurant in town they could simply select their 10 best reviews to show off. However, if someone else (say a competing restaurant) wanted to slander this restaurant they could simply select the 10 worst reviews and use these as their criticism. Now what is important to recognize here is that we are not assuming that these reviews are false or made up, maybe the restaurant really does have 10 reviews that talk about it as the best restaurant ever and 10 reviews that see it as the worst, the problem lies in extending the results. If the restaurant puts forward the 10 favorable reviews, these clearly do not reflect the other hundreds of reviews that the restaurant has. Nor do the 10 disparaging reviews reflect the other hundred reviews. In other words, the problem we are encountering here is that by not having chosen the reviews randomly, there is no reason to believe that we can take this sample and extend its results to the whole population of reviews.
What this thought experiment is intended to make you realize is that randomness is absolutely essential when it comes to inferential statistics. Trying to make generalizations or predictions from non-random data can lead to many issues and often is unreliable and inaccurate. There are a few cases where non-randomly collected data can be used in descriptive statistics especially when first starting a statistical study. In can provide a very cheap and easy way to get a general sense of a population. However, for the majority of our discussion and for all of inferential statistics we will view non-randomly collected data as statistically invalid.
The second main idea to good sampling is Representativeness. Representativeness is the concept of trying to make sure that your sample while being only a small subset of the
population accurately reflects the type of individuals in the population. In other words if your population was a small town where around half the individuals were female and half the individuals were male you would hope your sample has a similar gender split. It might cause some inaccuracies if your sample was three quarters female and a quarter male.
One simple sounding way to help with representativeness is to simply take a very large sample. Again imagine trying to sample a small town about their political views. To ensure accuracy you would want to make sure that you get some younger people and some older people, some men and some women, you would definitely want people of different levels of
socioeconomic status, you would definitely want different ethnicities and cultures represented. And further you would want all the different combinations of these attributes. Probability (a topic that will be discussed in more depth a bit later) tells us that the larger and larger the sample is the closer and closer its distribution will be to the actual population distribution.
So why then do we not just simply take huge samples to solve the problem of
representativeness? The answer goes back to something mentioned at the very beginning of this section, namely that taking large samples can be costly and time consuming and sometimes is even impossible. Therefore, as we begin our discussion of different sampling techniques we will see how certain techniques can help achieve better representativeness even in relatively small samples.
Before getting started on listing the basic sampling techniques we need to mention one more distinction between Random and Representative. When discussing randomness it is a yes or no question. The individuals chosen to be in a statistical sample are either chosen randomly or they are not. This is contrasted with representativeness which is more ambiguous and should be viewed as a scale. There is no way to know if a sample is truly representative or not (to do so would be to suddenly know everything about the population defeating the purpose of the sample in the first place!). Instead, for our purposes, one should be more on the lookout for clear gaps in representativeness, in other words, clear mistakes or oversights. If nothing obvious stands out we will usually give the study/sample the benefit of the doubt and assume it is reasonably
representatives.
The first sampling technique we want to discuss is what is called a Simple Random Sample or SRS. An SRS is the statistical equivalent of a lottery or even more simply “drawing names from a hat”. In an SRS every individual in the population is given an equal chance of being selected. Theoretically, this means that every individual in the population is assigned a number and then the statistician randomly selects an amount of these numbers for their sample. SRS’s with small populations are very easy to understand. Suppose there was a classroom with thirty students and the instructor wants to randomly choose five students for a demonstration. All the instructor would need to do is give each student a number (1 through 30) and then choose five numbers at random. Those five numbers chosen then would be the sample. (Note: Obviously, numbers could be avoided altogether in this case by just simply by putting all the names of the students in the class in a hat and choosing five of them at random…)
The advantages of an SRS are clear, it is random so it satisfies the main condition of being statistically valid as discussed earlier. As for being representative, this method does nothing special to help ensure representativeness, it requires that the sample is hopefully large enough for probability to do its job and make for a relatively accurate distribution.
The main disadvantage of an SRS comes from how the sample is constructed. Namely, an SRS requires that the population size be known. Obviously, with very large populations this can be quite challenging or in fact sometimes impossible. For an example of it being challenging consider trying to find the exact number of adult US citizens. (Note: Once again, as is common in statistics there is a grey area here, consider trying to take an SRS of all students at a
community college, a researcher could use the administration's database as their population which while perhaps missing a few students would be close enough to still build a reasonable SRS) For an example of it being impossible consider the example of a store wanting to sample from all its customers during the course of a holiday weekend. They won’t know the total number of customers until after they leave!
Overall, in summary an SRS is the most basic way to construct a statistically valid sample. The following methods were will discuss all attempt to improve on the SRS approach or cover one of its flaws. It should be noted, however, that even with a perfectly constructed SRS poorly constructed questions and other forms of bias can still ruin the data. We’ll study this more later on but it is a good idea to keep in mind that choosing a sample is only the first step in the statistical process.
The second sampling technique we want to discuss is what is called Systematic Sampling. Systematic sampling as its name implies is sampling using a form of mathematical system. This mathematical system takes the place of the randomness necessary for the sample to be statistically valid. This system, however, must be rigidly adhered to. For example, if a store decides to use a systematic sample to sample their customers during a holiday weekend they might assign one worker to wait at the exit and ask every 10th person who leaves some
questions. The worker, for the sample to be valid, must follow this system. This means that if the 10th customer to come along is angry or in a hurry the worker must attempt to talk to them. If the
customer refuses the worker has to wait another customers even if the very next customer looks friendly and talkative. Notice that this might mean that the worker has to stay a long time before their sample will be sizable.
The main advantage of this approach is that by using a system the population size being known is no longer necessary to conduct a systematic sample. As described above in our store example, the store has no way of knowing how many customers will come by during that weekend but it does not matter for this type of sampling. Also, as already mentioned despite the fact that the word “random” is not used in the description for all statistical purposes systematic counts as random.
There are two disadvantages of this approach. The first is that once again this approach does nothing special in terms of representativeness. It relies entirely on the sample being large enough to be representative. The second drawback is that creating the type of system necessary for this sampling technique is not always possible. Consider trying to gauge a small town population’s view on a political issue. Trying to build a system to sample from the population would be challenging as the people would be spread all about and there would be no guarantee that all the citizens would visit one location.
The third sampling technique we want to discuss is called Stratified Sampling. In stratified sampling the researcher first separates the population into groups based on some trait they consider important for the study. These groups are called strata and each individual in the population must be placed into exactly one strata. Then the researcher proceeds to take an SRS from each strata separately and combines these together to form their overall sample. For example, consider a researcher wanting to poll students at a university. To do this they might first separate the students into freshmen, sophomores, juniors and seniors and then proceed to randomly choose 50 students from each group for an overall sample of 200 students.
It should be clear that the main advantage of stratified sampling is that it guarantees that individuals with each of the traits that were used to construct the strata will be present in the sample. This is especially important if the researcher believes that this trait may influence the data that they are collecting. In other words, stratified sampling directly tries to improve representativeness.
The disadvantages of stratified sampling are two-fold. One is that because of the classification process at the beginning stratified sampling requires the most work by the
researcher. In other words, you are paying for the increased representativeness with more work at the beginning of the process. Second is that both the classification process and the subsequent SRSs require that the population size is at least reasonably known or manageable.
The fourth sampling technique we want to discuss is called Cluster Sampling. In cluster sampling the individuals in the population are found to already be in naturally occuring groups or clusters. Instead, then, of performing an SRS on the whole population choosing individuals one at a time, in cluster sampling an SRS is performed on the clusters themselves, choosing entire clusters at a time. All the individuals in a selected cluster are then added to the sample. For
example, consider a university wanting to interview its student population. Instead of trying to find a list of all students the researcher could instead find a list of all the classes and choose randomly choose entire classes of students to be added to the sample.
From this quick example it should be clear that the main advantage of cluster sampling is speed; a researcher using cluster sampling can build a larger random sample much faster than a traditional SRS. Further, as described in the above example, it can help in situations where the population size might be too large to enumerate since instead only the clusters must be randomly selected from.
The main disadvantage of cluster sampling is simple, it is not always possible! The population a researcher is studying might not be arranged into any neat or useful clusters. Further, one should be careful of using cluster sampling when the reason that the clusters exist might directly relate to the variable you wish to study. For example, if a teacher wanted to study student satisfaction in their class they would not want to cluster the students according to grade (A,B,C,D,F) and then randomly choose a cluster. This would give a very limited view of the situation.
At this point, we have discussed four of the main ways to construct statistically valid samples. These will be the four main techniques you will need to know for our class. It goes without saying that these are not the only techniques that exist. In fact sampling design is really a field in its own right and we are just scratching the surface.
We can, however, mention one of the first extensions of our previous four techniques. Most real-life statistical studies make use of a sampling technique called Multistage Sampling. This simply means that the sample is constructed by using several of the previous techniques in unison. Consider the following to see how this would be used in practice. Suppose that a
researcher wants to study views on a political issue in a city. To do this the researcher might first divide the city up into several different main regions. This would be stratification. Then inside each region the researcher might select whole street blocks to add to their sample. This would be clustering. Finally, for every house selected the researcher might randomly select one adult member of the household to interview. This would be an SRS. By combining these strategies we have constructed a sampling technique that is more capable of dealing with real world situations.
Finally, we should discuss one non-random sampling technique. In general, sampling done without randomness or without a mathematical system is called Convenience Sampling. As detailed at the beginning of this section non-random sampling makes the sample statistically invalid for making predictions, inferences or generalizations. In other words, non-randomly collected data should be treated very cautiously, at best it can hope to give a researcher an initial view of the problem before conducting more thorough analysis, at worst it can mislead a
researcher. A final note, some statisticians classify a special type of convenience sampling where individuals volunteer to put themselves into the sample as volunteer sampling. We will not make this distinction and instead return to the problem of having volunteers in a sample when we have our discussion on statistical biases.
For now, let’s consider some examples to make use of the terminology we have added to our vocabulary.
Example 2.1)
You go to a peanut butter factory and are tasked with analyzing the jars of peanut butter produced by this factory. You choose a random day to visit and choose every 25th jar of peanut butter produced that day. You end up getting a sample of 40 jars of peanut butter. For each jar you record the following information:
1) The weight of the full jar of peanut butter in ounces
2) Whether or not the jar had an expiration date stamped on the bottom 3) What type of peanut butter was inside jar (Creamy, Crunchy, Flavored)
Let’s use our statistics terminology so far to describe this example as thoroughly as possible:
Population of Interest: One could argue for two different but fundamentally similar populations of interest here. You could limit yourself to saying that the population was all the jars of peanut butter produced the day you visited but considering that you chose a random day to visit the factory it is probably better to view the population as all jars of peanut butter produced by this factory.
Sample and sample size: The sample is the 40 jars of peanut butter you selected. The sample size is clearly 40.
Individual: The individual in this case is each individual jar of peanut butter. Again note that both population and sample are made up of individuals.
Sampling Technique: The main sampling technique used here is Systematic Sampling. The fact that you chose every 25th jar is a system. You should also note that despite the jars not being selected “randomly” the use of a mathematical system counts as random in terms of statistical validity. (Note: If you wanted to go even further you could in this example say that there is small form of Multistage Sampling since you first randomly chose a day, perhaps by SRS, and then chose the jars using a Systematic Sample)
Variable 1 was a Quantitative Variable. Most since the weights could be recorded with any level of accuracy we would classify this as Continuous.
Variable 2 was a Categorical Variable. Since there were only two options (Yes there was an expiration date or No there was not an expiration date) this Categorical Variable could be further classified as Boolean.
Variable 3 was a Categorical Variable. While there were more than two options for this
Categorical Variable it should be noted that this variable does possess a 1-to-1 correspondence in that each jar will only be placed into one category.
Example 2.2)
For each of the following situations let’s attempt to identify the population, the sample (and sample size), sampling technique used, and type of variable studied.
1.) A researcher wants to study the actual amount of soda in 12oz cans of Cheapo Cola. To do this the researcher visits all the grocery store in his town and randomly selects 10 six-packs of Cheapo Cola. He then takes each of the 60 cans and measures how much soda is in each can.
2.) A researcher wants to study student’s views at a local community college on the current job market. To do this the researcher visits the campus and asks any friendly looking student whether they view the current local job market as Good, Neutral, Bad, or Don’t Know. The researcher ultimately talks to 50 students.
3.) A researcher wants to study the relationship between age and political affiliation in a small town with around 20,000 adults. To do this the researcher classifies all the adults in the town into categories of 21-35, 35-50, 50-65, and 65+. Then the researcher randomly chooses 30 adults from each category. Finally, the researcher asks each of the selected adults for their political affiliation (Democrat, Republican, or Other)
4.) A researcher wants to study student views on a professor. To do this they gather a list of all the students who have taken a class with that professor in the past year. Then they randomly choose 25 of these students and ask them all “Would you recommend this professor to a friend? (Yes/No)”
5.) A researcher wants to study how long people spend per visit at their local gym. To do this the researcher stands outside the gym and asks every 12th member to leave
“Approximately how long, in minutes, do you workout on each visit to this gym?” The researcher manages to talk to 20 members.
The first study’s population is all cans of Cheapo Cola. The sample is the 60 cans that were selected (ten six-packs means 10x6=60 total cans). This sample was a cluster sample with each cluster being a randomly selected six-pack. The type of variable studied, the amount of soda in each can, was Quantitative.
The second study’s population is all students at the community college. The sample is the 50 students the researcher talked to. This sample was a convenience sample since there was no
mention of randomness and “friendly-looking” certainly does not constitute a system. The type of variable studied, view on the job market, was Categorical.
The third study’s population is all adults in the small town. The sample is the 120 adults the researcher selected. (Note: We know the sample size is 120 adults since it was 30 adults per age group with 4 different age groups 30x4=120). This sample was a stratified sample with the strata being each age group. We can note that this was stratified sample since the population was first split into large groups and then SRSs were taken from each group. The type of variable studied, political affiliation, was Categorical.
The fourth study’s population is all students who taken the professor in question. The sample is the 25 students who the researcher selected. This sample was an SRS. This is because it was simply stated the students were chosen randomly from the population. There were no other additional steps. The type of variable studied, recommend (yes/no), was Categorical (Boolean).
The fifth study’s population is all the patrons of that gym. The sample is the 20 customers who talked to the researcher. This sample was a systematic sample where the system was choosing every 12th customer. The type of variable studied, time spent per workout, was Quantitative.
3. Categorical Variables
For the beginning part of our course, our focus will be on descriptive statistics. In other words, we want to be able to describe, summarize, and display the data we have. This is very important because often times in real life this is the moment when statisticians have to take the large amounts of data they have gathered and collected and turn it into something that is
manageable and useful. If they fail at this moment wrong decisions can be made even in the face of good and accurate data.
The first type of statistical variable we will study is the categorical variable. Since categorical variables are variables where individuals are placed into categories the amount of things we can do with them is somewhat limited. As we will see Quantitative Variables have a bit more depth as we can bring the full force of mathematics to bear on them.
The first step with Categorical Variables is a table or display called a Frequency Table. A Frequency Table for a Categorical Variable is a list of of the possible categories for that
variable along with the number of individuals that fell into that category (ie the frequency). Note, in a standard Categorical Variable where each individual is limited to a single category the sum of the all the frequencies will be exactly equal to the sample size. If individuals are not limited to a single category then this will not be the case.
In large samples a Frequency Table can be a bit unwieldy. Especially since humans have been documented to struggle with understanding size differences in larger numbers. In other
words, most people can understand that a 5-4 margin is not a particularly overwhelming margin yet may struggle to see that it is the same as 450-360 margin. Therefore, for Categorical
Variables where individuals were allowed only to be in a single category we can construct a
Relative Frequency Table. A Relative Frequency Table is constructed by dividing each
frequency by the sum of all the frequencies (ie the sample size) (Note: It should be clear that this process can also be reversed, if you are given a Relative Frequency Table and the sample size, multiplication will give you the Frequency Table). To see this directly consider the following example.
Example 3.1)
Suppose that a small coffee shop wants to study which type of coffee drink is ordered most frequently. To do this the coffeeshop randomly chooses five days during a month and records all the coffee drink orders from those days. They get the following data:
Type of Coffee Drink Frequency
Coffee 243
Americano 134
Latte 291
Mocha 197
Other 185
First of all, notice that this sample was collected as a cluster sample (the clusters were the 5 days randomly chosen). Second, notice that the variable here was a standard Categorical Variable where each individual (a single drink order) was placed into a single category. That means that if we total the frequencies that will be exactly the sample size.
Doing this we take 243+134+291+197+185 = 1050 Thus, our sample size was 1050 drink orders.
To build our relative frequency table we will now divide each of the frequencies by 1050. We can write these values either as decimal or a percentage.
Relative Frequency Table for Example 3.1
Type of Coffee Drink Relative Frequency
Coffee .231 or 23.1%
Americano .128 or 12.8%
Latte .277 or 27.7%
Mocha .188 or 18.8%
Other .176 or 17.6%
You can note here that we rounded to three decimal places to balance between accuracy and ease of reading. Further you can note now that summing the relative frequencies will now lead to 1 or 100%. This will always be the case up to any rounding errors that might occur.
It should be clear that one of the main advantages of a relative frequency diagram is table is to make the data more easily understandable. If you said to someone that in your sample 291 out of 1050 of your drink sales were lattes this might not resonate with them the same way saying that 27.7% of your drink sales were lattes (ie a little over a quarter of all drink sales). In other words, a relative frequency table helps a reader with their number sense.
While Frequency and Relative Frequency Tables are good ways of summarizing Categorical Data we also want to examine a few visual displays for Categorical Data. We will look at two main displays.
The first visual display for categorical data is a Bar Graph. A bar graph can be
constructed from either the frequencies or the relative frequencies. A bar graph is constructed by placing the the different categories of the categorical variable along the horizontal axis. Then the vertical axis displays the frequencies/relative frequencies. The vertical axis must start at 0 and the axis must be evenly and consistently spaced. This means the amount of increase for each of the vertical axis marks must be the same. It is also considered good practice to make each bar in a bar graph the same width so as not to give extra visual weight/importance to a particular category. Finally, always make sure that in a bar graph the bars do NOT touch.
The second visual display for categorical data is a Pie Chart. A pie chart can ONLY be constructed for a categorical variable where each individual is placed into a single category. A pie chart is based on the relative frequencies and is created by cutting a circle into pieces based on the percentage of respondents in each category.
First, Bar Graphs are usually used when it is important for the viewer to be able to compare the relative size of categories against one another. Since all the information in a bar graph is held in the vertical dimension it is very simple to see for example if one bar is half the size of another vertically then that category had half the frequency! A Pie Chart on the other hand is used when it is important for the viewer to be able to compare the relative size of a category against the whole. In other words, a pie chart makes it easy for a viewer to see how much of the overall sample a single category made up.
Second, Bar Graphs can be based on both frequencies and relative frequencies while a Pie Chart is based only the relative frequencies. This means that for a categorical variable where individuals can place themselves into multiple categories, only the bar graph using actual
frequencies is an appropriate display. Now let’s look at some examples.
Example 3.2)
Suppose that a researcher wants to study favorite movie genres amongst teenagers in the United States. To do this the researcher divides up the US into four main regions, East Coast, West Coast, Midwest, and South. Then the researcher randomly selects 100 teenagers from each region and asks them “What is your favorite genre of movie to watch?” The researcher then constructs the following relative frequency table:
Movie Genre Relative Frequency
Action 25%
Comedy 28%
Drama 14%
Horror 12%
Romance 8%
Other 13%
First, before jumping in to far into this example let’s quickly note that the population here is all US teenagers, the sample were the 400 randomly selected teenagers (100 from each of the 4 regions), and the sampling technique used was stratified sampling (the strata were each major region).
Second, let’s reconstruct the actual frequency distribution for this data. This means that we will take each percentage, convert it back to a decimal and then multiply with the sample size
of 400. (Note: Remember, to go from frequency to relative frequency you divide, from relative frequency to actual frequency you multiply).
This table then looks like:
Movie Genre Relative Frequency
Action .25 x 400 = 100
Comedy .28 x 400 = 112
Drama .14 x 400 = 56
Horror .12 x 400 = 48
Romance .08 x 400 = 32
Other .13 x 400 = 52
Immediately, you should note here that the frequencies sum up to 400 as expected! Now let’s create some visualizations of this data. First, we’ll construct a bar graph using the relative frequencies.
You should notice a couple of things here. The bars are each the same width and do not touch. The vertical axis has a consistent scale and starts at 0. The choice of the scale is
completely up to the creator and should be chosen so that the graph is most visually appealing and easy to understand.
We can also see that we get an equivalent bar graph by choosing a different scale and looking at the actual frequencies instead of the relative frequencies.
We can see from either of these bar graphs why presenters choose to create these types of displays. In just a quick glance this diagram can show the popularity of certain categories against one another. It should be clear at a glance that Action is close to twice as popular as Drama for example. Or that Drama and Horror differ only a relatively small amount in popularity. These are the advantages of creating a bar graph.
Finally, we can take the relative frequencies from this data and construct a pie chart. This diagram would be as follows.
The advantages of this type of diagram are again that it easily shows how the categories relate to the whole. Imagine trying to say that 112 out of 400 teenagers interviewed chose comedy. That can be simplified by saying 28% of teenagers interviewed chose comedy. And even better in this diagram a viewer can get an actual sense of how big 28% is, taking up just a little over a quarter of our diagram.
Obviously, drawing a pie chart by hand without the aid of a computer can be a bit difficult. In general, it is best to try to find combinations of the percentages that add up to or close to easy to draw portions like 25% or 50%. Remember, if being asked to draw something freehand, the diagram only needs to be close not perfect!
Example 3.3)
Suppose that a researcher at a university with around 30,000 students is tasked with choosing a new fast food restaurant to be added to campus. To find how what might be most popular with the student population the researcher obtains a full list of the registered students at the university from the administration and randomly chooses 300 students to email a survey to that asks “Which of the following fast food restaurants would you like to see on campus? (Choose at least one) McDonald’s, Burger King, Wendy’s, Jack in the Box, Carl’s Jr, or Subway” Of the 300 students who are the sent the survey only 45 respond. The researcher then constructs the following frequency table:
Fast Food Restaurant Frequency
McDonald’s 28
Burger King 7
Wendy’s 23
Jack in the Box 20
Carl’s Jr 12
Subway 11
Again before diving in too deep into this example, let’s list all the major features of this sample. The population is the approximately 30,000 students at this university, the sample is the 45 students who respond (not the 300 who were emailed) and the sampling technique used was a SRS.
The first thing that should jump out in this data is that summing up the frequencies gives 28+7+23+20+12+11 = 101. This 101 does not match the sample size of 45. So what is going on here?
The answer lies in remembering that this is a categorical variable that does not have a 1-to-1 correspondence between responses and individuals. There was no limitation to how many categories a single individual could place themselves into so there is no reason to expect that the total frequency would match the sample size. In fact, if we were just presented with the
frequencies and did not know the sample size from the earlier text there would be no way to recover the sample size!
As discussed earlier, we will not be spending much time working with this type of categorical variable as it is harder to draw conclusions from. One of the main reasons for this is that not everyone’s views will be represented equally. Consider that one individual might have only voted for Subway indicating that they really want Subway while someone else may have voted for everything! One person then only got to cast one vote while the other actually got to vote six times but our data doesn’t really catch that the first individual really wants Subway. There are ways around this issue but we won’t be going that far in depth on this issue.
As for displaying this data visually, remember there is only one choice for categorical variables of this type, that is a bar graph using the actual frequencies. It would look as follows:
4. Quantitative Variables
Now that we’ve covered categorical variables it is time to turn our attention to quantitative variables. Once again, however, we will be keeping our attention focused on descriptive statistics, ie trying to summarize and visualize quantitative data.
We shall see that we can do a bit more with quantitative data than with categorical since our data itself this time will be numbers. That means we can bring more of the force of
mathematics to bear on this problem and get some very interesting results.
Before jumping in to learning some calculations we need to briefly talk about the three main goals when studying quantitative variables. As we will see there are three qualities when working with a numeric sample that we wish to describe. These are the center,
spread/variability and shape/distribution. The center, as should be expected, tells us where the middle of our data lies. The spread or variability tells us how spread out our data is from that middle. The shape/distribution tells us how the data is spread out, is it on both sides of the center,
is there more data on one side than the other, are very high values uncommon but very low values frequent etc.
It should make some intuitive sense why these are things you would want to know when presented with a collection of numeric data points. Say these data points for example represented salaries for a field you were considering studying. Obviously, you would want to know the center because that might represent the typical or mid level salary you would expect to earn in this field. But the spread/variability would be important too! If the field came with high variability then you would know there is a risk you might make a much lower value than the middle but there is also a chance that you could make much more. If the field had low variability then you know you would stand a good chance of making a figure near that middle amount. Finally, the shape/distribution would also factor into your decision. If the data has a distribution that has some very high values but not many very low values then you might take this as
meaning that the field you are looking has most people making a slightly lower than typical salary with a few lucky ones making far more. All of this would be useful to know and really only by having all that put together can we get a clear picture of our data.
We shall start by looking at two different ways to measure the center of quantitative data. The first is called the mean/average. This is calculated by summing all the values in a sample and then dividing by the number of data points. In terms of notation, if the average we are calculating pertains to a population we use the character mu, 𝞵, while if the average pertains to a sample we use the character, x, read x-bar.
The second is called the median. This calculated by arranging all the values in a sample in ascending order and then finding the number in the middle. If the data set has an even number of data points, the median is calculated by averaging the two values in the center. In terms of notation, the median is always represented with the character uppercase M.
Both the mean and the median will have the same units as the original sample. In other words, if your data points are amounts in dollars then the mean and median will also be amounts in dollars.
Example 4.1)
For each of the following data sets calculate the mean and median: Data Set 1: 3, 4, 0, 8, 10
Data Set 2: 3, 4, 0, 8, 10, 100
For Data Set 1 we calculate the average as:
Average = 3+4+0+8+105 = 5 25 = 5
Clearly, the middle is the value 4. So the median is 4.
For Data Set 2 we calculate the average as:
Average = 3+4+0+8+10+1006 = 0.83
6
125 ≈ 2
The median is found by first rearranging the values 0, 3, 4, 8, 10, 100
This time the middle lands between the values 4 and 8. Averaging these together, 4+82 = 6
So the median is 6.
Besides just showing these two calculations this example reveals another very important idea. Notice that Data Set 2 is simply Data Set 1 with one additional data point, namely 100. Notice how the presence of this extra data point, which is much larger than the others, affects our two measures. The average jumps from 5 to 20.83 while the median only increases from 4 to 6. This gives us an example of what is called resistance. A statistical measure is said to be resistant if the presence of extreme values does not affect the measure very much. Our previous example tells us that the average is not resistant while the median is resistant!
It should be noted here that not being resistant does not make a statistical measure bad. Instead, it simply tells us that we should be careful using that measure in the presence of extreme values. A real-life example of this is looking at salaries in a particular field. Often times both the mean and median salaries are reported but it is better to look at the median salary as the average salary may be significantly higher due to a select amount of people in that field making lots of money. A quick internet search can show this by finding that the average salary for a lawyer in the US is around 140,000 dollars per year while the median is around 118,000 dollars a year!
Now that we have some ways of measuring the center of a data set let us look at some ways of measuring the spread/variability of data.
The first is called the range. The range is found by simply subtracting the minimum value from the maximum value. In other words, the range measures the distance from the smallest value in the data set to the largest value in the data set. Because of its simplicity the range is not particularly useful but it does at least give us, if you will pardon the pun, the size of the range for the data set. It should also be clear that by being based on the maximum and minimum the range is not resistant.
The second is called the standard deviation. The standard deviation measures how much the data varies from the mean. Data with large standard deviation is spread far from the mean while data with a small standard deviation is clumped closely to the mean. Standard deviation is the most common measure of variability that we will use for our class.
In terms of notation, standard deviation calculated from a sample is written with character, s, while standard deviation from a population is given the symbol, σ. This time the
distinction between calculating for a sample vs a population is important and actually influences the formula.
We calculate sample standard deviation as:
s=
√
n−1∑
xi(xi−x)
2
And we calculate population standard deviation as:
σ =
√
n∑
xi(xi−μ)
2
Note: In both these formulas n represents the number of data points and representsxi
each individual data point.
You should notice a couple of things about these formulas.
First, the only difference in these two formulas is the denominator switches from n-1 to n and the notation for average is switched from the sample notation to the population notation. The change in the denominator results from a concept called degrees of freedom but we will not go far in depth on this point.
Second, this formula is far more complicated than the others we have seen so far. In fact, it is complicated enough that we will always calculate standard deviation using a calculator.
Finally, remember that a high standard deviation indicates lots of variability or spread and a low standard deviation indicates very little variability or spread.
The standard deviation is a measure based on the average so it is also not resistant. The last measure of spread we want to discuss takes a little bit of setup. We first need to define the quartiles of a data set. The quartiles are three values that cut the data set into four equal pieces. In other words, 25% of the data is less than the first quartile, 25% of the data is between the first and second quartile, 25% of the data is between the second and third quartile, and 25% of the data is greater than the third quartile.
Let’s focus our attention on the second quartile for a moment. Notice, that according to our above definition 50% of the data is less than the second quartile and 50% of the data is greater than the second quartile. This means that the second quartile is actually the median!
How do you find the first and third quartiles then? The first quartile is simply the median of all the values less than the median and the third quartile is the median of all the values greater than the median. (Note: In terms of notation, the first quartile is denoted Q1and the third
Finally, this brings us to our third and final measure of spread/variability, the
interquartile range or IQR. The IQR is calculated simply as Q3−Q1. The IQR measures how large a range the middle 50% of the data is spread over. Once again, a large IQR represents very spread out data while a small IQR represents tightly packed together data.
As the IQR is based on median calculations, the IQR is a resistant measure of spread.
Example 4.2)
Consider the data set:0, 3, 4, 8, 8, 10, 11, 15, 17, 23, 25, 25, 30, 34, 40, 45, 47, 60
You should confirm the following:
Mean: 22.5 Sample SD: 17.23
Population SD: 16.74 (note population sd is always less than sample sd) Range: 60
Median: 20 Q1: 8 Q3: 34 IQR: 26
Before moving on let’s take a look quickly at our quartiles. Consider visually how you found these:
| | |
0, 3, 4, 8, 8, 10, 11, 15, 17 | 23, 25, 25, 30, 34, 40, 45, 47, 60 | | |
Notice that the four pieces created by the quartiles are indeed four equal pieces, each has exactly the same amount of data. This is exactly what we should have expected from the definition of the quartiles.
One of the most common uses for the IQR is the outlier check. An outlier is a data point that is extreme when compared to the rest of the data. In other words, an outlier is a data point that does not fit the general pattern of the rest of the sample. Outliers can be detected by the IQR.
The IQR outlier check is done as follows. First, you must calculate the IQR. Second, you construct what are called the lower and upper bounds.
The upper bound is calculated as Q3 +1.5xIQR
Notice what we are doing here is extending a bit below the first quartile and a bit above the third quartile.
Finally, any data points that are outside this range are considered outliers. Let’s look at an example of this process.
Example 4.3)
Suppose you have the following data set: 0, 8, 12, 14, 15, 16, 16, 17, 19, 19, 23, 32, 41
It should not be hard to see that the quartiles are:
Q1: 13 M: 16 Q3: 21
Thus, the IQR is calculated as: IQR = Q3 - Q1 = 21-13 = 8
Then to check for any outliers we construct the bounds as: Lower Bound = Q1 - 1.5xIQR = 13 - 1.5x8 = 13-12 = 1
Upper Bound = Q3 + 1.5xIQR = 21 +1.5x8 = 21+12= 33
Finally, we determine if any data points fall outside these bounds. There are two data points in fact that should be viewed as outliers. The value 0 is smaller than the lower bound and the value 41 is larger than the upper bound. So in this data set there are two outliers.
Now, we should keep in mind outliers do not indicate that these data points are errors! While some outliers may occur because of data entry error or similar mistakes, outliers at their core just represent data points very different than the rest of the sample. Often times outliers represent members of a different population being accidentally grouped into our population of interest. Sometimes even, outliers are just outliers, values more extreme than the rest. We want to be aware of them, regardless of origin however, since the presence of outliers may affect
non-resistant measures. As we will see later in our course, to run certain statistical procedures we need to make sure there are no outliers.
The last quality we want to measure for quantitative data is shape or distribution. It turns out that measuring shape or distribution is most easily done by visually displaying the data. We will look at three different visual displays for quantitative data.
The first display is a histogram. A histogram is sort of the quantitative equivalent to a bar graph. To construct a histogram we create equally sized intervals called classes. Then we count how many data points fall into each of these classes. This then gives the frequency of each class. When drawing a histogram we still put the frequencies along the vertical axis (and still must start the vertical axis at 0 and use a consistent axis) and we put the classes along the horizontal axis. The horizontal axis should display the classes which since they are all the same size will automatically make this axis consistent. Note, since the classes are intervals that adjoin to one another in a histogram the bars touch!
Let’s look at an example of constructing a histogram.
Example 4.4)
Suppose you have the following data representing 17 students’ grades on an exam. The scores are: 51, 53, 53, 55, 56, 58, 58, 60, 62, 65, 66, 68, 76, 79, 85, 88, 100.
Note: You should be able to calculate the following for this data, Mean: 66.65, Median: 62 We will build a histogram for this data with class size 10 starting at 50.
This would have the following classes:
Classes Frequency
[50,60) 7
[60, 70) 5
[70,80) 2
[80,90) 2
[90,100) 0
[100,110) 1
A couple notes about this. Why does say the interval [50,60) have frequency 7? This is because 7 of our data points fall into this interval namely 51,53,53,55,56,58,58. Why do we use both a hard bracket and soft bracket? Remember from algebra a hard bracket indicates that the number is included while the soft bracket indicates the number is excluded. This avoids us double counting numbers on the border. So for example, the 60 in our data is included in [60,70) not [50,60). For our class we will always use the convention of including the left end point and excluding the right end point. Note, this convention forces us to have an interval of [100,110) to make sure we have a class for the data point 100.
So what does this histogram show us? What we are really looking for is what is called the shape or distribution of the data. As we will see in a moment this particular histogram has the shape we call right skewed.
For our class we will focus on a few typical shapes. The following list is not meant to be
exhaustive of all possible distributions in statistics but these are the ones that will appear in your course.
Right skewed distributions are distributions that have a tail that extends out to the right. This means that the data has some very large values that are infrequent or uncommon.
Left skewed distributions are distributions that have a tail that extends out to the left. This means that the data has some very small values that are are infrequent or uncommon.
Bell/normal distributions are distributions where two equal tails extend out both to the right and left. This means that middle values in the data are the most frequent while very large and very small values are must less common. Bell/normal distributions are a type of symmetric
distribution.
Uniform distributions are distributions where all values of the data are close to equally likely. Visually, this means there are no high or low frequencies in the histogram. Uniform distributions are also a type of symmetric distribution.
It should also be noted here there is a very important relationship between shape, resistance, median and mean. Since mean is not resistant it can be pulled or influenced by extreme values that are present when a distribution is skewed. Therefore, the following is generally true: If the distribution is right skewed the Mean > Median
If the distribution is left skewed the Mean < Median If the distribution is symmetric the Mean Median≈
The next visual display we want to discuss is called a stem and leaf plot often called just a
stemplot. A stemplot is constructed by taking the values present in your sample and splitting them into stems and leaves. The rule for this separation is that each leaf must only be a single digit while stems must contain all the remaining digits. There is also what is called a split stem and leaf plot which is the same except each stem is repeated twice. The first stem collects only the leaves 0-4 while the second stem gets the leaves 5-9. This type of split stem plot can be useful when the data is clumped closely together. Consider the following example.
Example 4.5)
Consider the following data set of 16 data points:
130, 160, 160, 200, 210, 230, 230, 240, 250, 280, 280, 290, 300, 310, 320, 370
We will first construct a standard stemplot. Remember, each leaf must be only a single digit. However, notice that all these values end in 0 so we can effectively ignore this as useful
information. So instead we will just focus on the first two digits of each number. We would get the following:
1|3 6 6
2|0 1 3 3 4 5 8 8 9 3|0 1 2 7
Key: 1|3 = 130
Note, the Key here tells a reader how to interpret these values. This is very important because diagram 1|3 could mean 13, 130, .13 or even 13,000!
Does this diagram tell us much about shape? We might guess that the shape is close to bell/normal but the picture certainly is that nice to look at since there are only three stems. Instead, when there are less than 4 stems we can try a split stem plot.
1|3 1|6 6 2|0 1 3 3 4 2|5 8 8 9
3|0 1 2 3|7
Key 1|3 = 130
Notice that this diagram makes it much clearer that we are looking at a bell/normal distribution. There is a clear peak in the middle with approximately equal tails out to both sides!
This might ask the question what are the differences between a histogram and stem plot. The main difference is that a histogram is better used for large data sets but has the weakness that once you place data points into classes/interval there is no way to know from the diagram what the original data points actually were. When we say that class [50,60) has frequency 7 all we know is that there are 7 data points in our sample between 50 and 60. We no longer know what those actual data points are! A stemplot, on the other hand, preserves all the original data but can be very messy when there are lots of data points. Further, stemplots do not have the flexibility of being able to choose your class size like in a histogram. Both displays, however, are valid for determining the shape of a distribution.
Before moving to our last display let’s take a quick look at how to best estimate the data that might be missing from a histogram.
Example 4.6)
Suppose you are studying annual rainfall in a city. You find that this has been recorded officially for this city for the past 42 years. You collect all of this data (treating it as a sample of all the possible years) and construct the following frequency table: (All values in inches of rain per year)
Class Frequency
[34,37) 2
[37,40) 4
[40,43) 9
[43,46) 12
[46,49) 8
[52,55) 1
[55,58) 1
Let us attempt to do several things with this data. First, let’s visualize it as a histogram to determine its shape and then attempt to estimate the mean and standard deviation of this data. Finally, we’ll us the data to determine in what percent of year did this city receive less than 40 inches of rain. You should note that this second problem must be an ESTIMATE because we do not have access to the original data. What we mean by that is when we say [37,40) has frequency 4 we don’t know if those 4 data points are 37,37,38,39 or even 37.3, 38, 38, 39.8!
But let’s start by converting this into a histogram. It would look as follows:
We should be able to clearly classify this distribution as bell/normal.
Now how to estimate the mean and standard deviation? What we need to do is make a fair guess for the data in each class. The best way to do that is via the midpoints. We rewrite our table as:
Midpoint Frequency
35.5 2
38.5 4
41.5 9
44.5 12
47.5 8
50.5 5
53.5 1
Then using the frequency list function of our calculator we can tell the calculator that we want our data to have two 35.5’s, four 38.5’s, nine 41.5’s etc.
Doing this we get the following: Average ≈ 44.64inches of rain per year Sample SD ≈ 4.59inches of rain per year
To answer our final question we could simply add the frequencies together of the first two classes and divide by the sample size. This would yield 2+442 ≈ 14.3%
The final visual display for quantitative data is called a boxplot or sometimes more informally it is known as a box and whiskers plot. This display is based on a particular summary of
quantitative data called the 5-number summary. The 5-number summary is simply, the
minimum value in the data set, quartile 1, the median, quartile 3, and the maximum value in the data set. A box plot then displays this summary by drawing a box that starts at Q1 and ends at Q3. The median is then represented as a line inside the box. Lines (or whiskers as cat-people call them) are then extended from both sides of the box out to the minimum and maximum values. This display can also be used to determine the shape of data. The larger side of the box tells you if the shape is right skewed, left skewed, or if both sides are approximately equal in size,
symmetric. Two things should be noted here. A boxplot can not differentiate between
bell/normal and uniform and when determining shape the “whiskers” should be ignored. Below is an example of constructing a boxplot.
Example 4.7)
Suppose you have a data set consisting 10 students SAT scores. You have the following: 450, 500, 520, 550, 550, 630, 650, 670, 730, 800
Let us find the five number summary for this data set, check for any outliers and then determine the shape of this data based on a box plot.
The five number summary is:
Min: 450 Q1: 520 M: 590 Q3: 670 Max: 800 (all values in points)
This tells us the IQR is 670-520 = 150 Performing the outlier check we find that,
Lower Bound = 520 - 1.5x150 = 520 - 225 = 295 Upper Bound = 670 + 1.5x150 = 670 +225 = 895
None of our data points are outside these bounds so we have no outliers. Finally our boxplot would look as follows:
The shape of this boxplot would be considered symmetric. We should see this visually but if we are ever unsure we can simply measure the distances between Q1 and M (590-520=70) and M and Q3 (670-590=80). Notice these values are very close to each other so the two sides of the box are about the same size, thus indicating a symmetric distribution. (We don’t know if its normal or uniform!)
For our last example of working with quantitative data we will look at an example of how different measures might lead to different conclusions!
Example 4.8)
Suppose that a company is looking to hire a new computer programmer. They have narrowed their choice down to four different candidates. They take each of these candidates and give them the same four coding assignments. Then they score their performance on these assignments on a scale of 0-10. They get the following data:
Candidate A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
Amy 9 1 4 9 2 4 7 5 7 8
Brian 8 0 2 8 1 3 7 7 9 7
Carl 10 0 2 10 0 3 5 5 6 6
Dani 5 2 5 3 3 5 8 8 4 1
Let’s try to choose the best candidate according to this data in different ways. First, let’s try the average score of their performance as a ranking system.
Doing this we find that Amy had an average score of 5.6, Brian had an average score of 5.2, Carl had an average score of 4.7 and Dani had an average score of 4.4. Therefore, we might choose Amy from this measurement.
Doing this we find that Amy has a median score of 6, Brian has a median score of 7, Carl has a median score of 5, and Dani has a median score of 4.5. Therefore, with this measure we might choose Brian instead of Amy.
But there are other ways to look at this data. Carl could argue that he is the best candidate as he is the only candidate to score a 10 on one of the assignments! Amy and Brian’s top scores were only 9’s and Dani only managed an 8.
And Dani could make a strong argument by looking at each assignment individually. Notice that Dani did the best on 6 of the assignments (A2, A3, A5, A6, A7 and A8). Amy was the best on one assignment (A10), Brian was the best on one assignment (A9) and Carl was the best on two assignments (A1 and A4).
Thus, all of these candidates could make an argument based on this data. In other words, different measures and different views with the same data yield radically different conclusions!