Chapter I. Introduction to Statistics; 2020.doc

(1)

Chapter I. Introduction to Statistics 1

Chapter I

INTRODUCTION TO STATISTICS

...

…...

Objetive

Chapter

Prepare students to obtain,

manage data and transform it into

information to describe, synthesize,

analyze and interpret information

in order to deduce the

characteristics of a target

population through the use of

different statistical tools.

(2)

1. Introduction of the chapter

This chapter prepares students how to obtaining data and transform it into information to describe, synthesizing, analyzing, and interpreting information by using table, graphs and summary statistic. The course covers the fundamental tools and features for descriptive statistics, and demonstrates real-world applications, particularly those related to the field of business and education. It covers the following topics: statistics language, the language is what helps you know what a problem is asking for, what results are needed, and how to describe and evaluate the results in a statistically correct manner; data gathering, organization and presentation of data; summarize data using: measures of central tendency, measures of variability, measure of position and measure of distribution.

Analyze dataset using Window Excel and Statistical Package for the Social Sciences SPSS v23.

Why we study Statistics?

 The study of statistics will serve to enhance and further develop critical and analytic thinking skills. To do well in statistics one must develop and use formal logical thinking abilities that are both high level and creative.

 Students and professional people may be called on to conduct research in their field, since statistical procedures are basic to research. To accomplish this, they must be able to design experiments; collect, organize, analyze, and summarize data; and possibly make reliable predictions or forecasts for future use. They must be able to communicate the results of the study in their own words.

 Other reason why we study statistics is to be able to read journals. Most technical journals you will read contain some form of statistics. Usually, you will find them in something called the results section. Without an understanding of statistics, the information contained in this section will be meaningless. An understanding of basic statistics will provide you with the fundamental skills necessary to read and evaluate most results sections. The ability to extract meaning from journal articles and the ability to critically evaluate research from a statistical perspective are fundamental skills that will enhance your knowledge and understanding in related coursework.

 Students and professional people can also use the knowledge gained from studying statistics to become better consumers and citizens. For example, we can make intelligent decisions about what products to purchase based on consumer studies, government spending based on utilization studies, and so on.

1.1 The Branches of Statistics

(3)

1.2 Definitions of terms

 Data. Data can be defined as a collection of facts or information from which conclusions may be drawn (the raw material of statistics is data). Data can be collected from sources or through observation, surveys, or by doing experiments.

 Population. Population is a complete set of individuals, objects, and measurements having some common characteristics. Population is denoted by “N", e.g. if we want to determine the average of I.Q. The score of all university students in Rwanda, then the population is all students who are in Rwanda's universities in 2019.

 Sample it is not usually possible, or not practical, to examine every member of a population, so we use a sample, a smaller selection taken from that population, to estimate some value or characteristic of the whole population. Care must be taken when selecting the sample as it must be representative of the whole population under consideration otherwise it doesn't tell us anything relevant to that particular population.

 Parameter is a numerical measure that describes a variable (characteristic) of a population that you’re interested in estimating or testing (such as a population mean or proportion). For example the population mean is a parameter that is often used to indicate the average value of quantity.

 Statistic is a numerical measure that describes a variable (characteristic) of a sample (part of population). e.g. the median net hourly wage of the total sample is 450 Rwandan francs (RWF).

https://wageindicator.org/Wageindicatorfoundation/publications/2013/wages-in-rwanda

 Statistics the statistics is a field of study concerned with (1) the collection, organization, summarization, and analysis of data; and (2) the drawing of inferences about a group of data when only a part of the data is observed.

 Variable is a characteristic number, or quantity that can be measured or counted that changes or varies over time and/or for different individuals or objects under consideration e.g. gender, household income, business income and expenses, country of birth, capital expenditure, etc.

Sources of data:

You start each statistical analysis by identifying the source of the data; we begin to look for adequate data that will serve as raw material for our research. Such data is generally available in one or more of the following sources:

 Published Sources are the data available in print or in electronic form, including data found on internet website. Primary data sources are those published by the individual or group that collected

(4)

the data. Secondary data sources are those compiled from primary sources e.g. National Bank of Rwanda: http://www.bnr.rw/index.php?id=213; National Institute of Statistics of Rwanda (NISR): http://www.statistics.gov.rw/

 Routinely kept records. Any type of organization that does keep records of day-to-day transactions of its activities. Hospital medical records, for example, contain immense amounts of information on patients, while hospital accounting records contain a wealth of data on the facility’s business activities.

 Surveys. If the data needed to answer a question are not available from routinely kept records, the logical source may be a survey, for this sometimes we use questionnaire or similar means to gather values for the responses from a set of participants.

 Experiments. Frequently the data needed to answer a question are available only as the result of an experiment. For example, the Human Resources department may wish to know which of the various strategies is best to maximize worker compliance, therefore, the HR Department could conduct an experiment in which it measures the strategies that workers are using, and then they are given a program to help them maximize these strategies, at the end of the program workers are again measured and you can see if the program has been effective or not. The subsequent evaluation of the responses to the different strategies could allow HR to decide which strategy is the most effective.

1.3 Software Tools. There are many software tools for statistical analysis, such as SPSS, STATISTICA and others. The collected data must be stored in tabular form ("data matrix"). SPSS is an important and popularized software product of the type driven by menus in window environments with user-friendly data editing, representation and graphic support in an interactive way. SPSS requires a minimum time for familiarization and allows the user to easily perform statistical analysis using a philosophy based on spreadsheets to operate with the data.

SPSS

Begin by opening SPSS 23 for Windows.

1. Click on the IBMSPSS shortcut button on your desktop. OR

2. Go to START, click on PROGRAMS, and click on IBM SPSS.

(5)

Data Entry. SPSS runs on Windows and Mac operating systems, but the focus of these notes is Windows.

Variable view: You can define information about your variables by accessing the Variable View tab (at the bottom of the Data Editor window). The Variable View tab displays information about the variables in your data. Click the Variable View tab at the bottom.

Data view: used for entering, editing and modifying data. Menus and Toolbars

Various pull-down menus appear at the top of the Data Editor window. These pull-down menus are at the heart of using SPSS. The Data Editor Menu items (with some of the uses of the menu) are:

FILE: Standard options for opening, saving, printing and exiting.

EDIT: Used to copy and paste data values; used to find data in a file; insert variables and cases.

VIEW: Options for showing/hiding toolbars, displaying values or their labels in Data Editor. DATA: Identify duplicate cases, merge files, split file, select cases, weight cases, etc. TRANSFORM: Compute new variables, recode variables.

ANALYZE: This menu provides access to the statistical procedures for analyzing your data set. All the items on the analyze menu have sub menus.

DIRECT MARKETING: It allows you to perform advanced analysis of clients or contacts to improve your marketing campaigns and maximize the ROI of your marketing budget.

GRAPHS: Provide options to create high quality plots and charts.

UTILITIES: Used to display information on individual variables (add comments to accompany data file (and other advanced features).

Add-ons: The SPSS extension packages are additional features of the program that you can add to SPSS (advanced statistical procedures).

WINDOW: Provides option for switch between data, syntax and navigator windows.

HELP: Contains SPSS help system (for example Select Help|Case Studies. Provides hands-on examples of how to create various types of statistical analyses and how to interpret the results).

Example of data in SPSS

Note. Dear student, for more information I recommend opening the file "Tutorial of SPSS Vs 23. Dr. Rosa Padilla" that is posted on the website course or you can go to the AUCA library or surf the internet and you will find how to learn and how to use this important statistical software.

1.4 Types of Data and Scales of Measurement

(6)

Variables can be classified in several ways. One method of classification refers to the type and amount of information contained in the data. Data are either categorical or quantitative/numerical. Another method is to classify data by levels of measurement: nominal, ordinal, interval or ratio.

Classification of variables according of types of data

Categorical. This is generally non-numerical data which is placed into exclusive categories and then counted rather than measured; their values are describing by words.

This type of variable can be broken down into two types: Nominal and Ordinal.

1. Nominal data is merely descriptive (e.g. religion, country name, sex). Any assigned numerical value is merely for convenience (e.g. religion: Catholic = 1, Adventist = 2, Other = 3)

2. Ordinal-Level data has rank order, though intervals between data points cannot be considered equal (e.g. Income (high/medium/low); Severity (poor, average, high).

Numerical/quantitative. Numerical or quantitative data arise from counting, measuring something, or some kind of mathematical operation.

This type of variable can be broken down into two types: Discrete and Continuous.

1. Discrete. Often such data are integers. E.g. number of takeoffs (departures) at Kigali International Airport; the number of people shopping in a supermarket.

2. Continuous. It is a numerical variable that can have any value within an interval. E.g. weight of a package of rise (e.g., 495.897 grams), this is continuous variable because any interval such as <495 – 500> grams can contain infinitely many possible values.

(7)

Classification of variables according of measurement level

1. Nominal-Level data is merely descriptive, the data describing it are simple labels or names which cannot be ordered (e.g. religion, country name, sex). Any assigned numerical value is merely for convenience (e.g. Religion: Catholic = 1, Adventist = 2, Other = 3)

2. Ordinal-Level data has rank in a meaningful order, though intervals between data points cannot be considered equal (e.g. household income (high/medium/low); severity (poor/ average/ high))

3. Interval-Level, this kind of measurement not only assigns rank or order. The major strength of this scale lies in the fact that they have equal units of measurement. However they do not possess a true zero (zero is not natural). For example, temperature scales are interval data with 25C warmer than 20C and a 5C difference has some physical meaning. Note that 0C is arbitrary, so that it does not make sense to say that 20C is twice as hot as 10C.

4. Ratio scale is the strongest level of measurement, here the measures are not only expressed in equal unit but true zero also exists. The zero here indicates absence of quality or attributes being assessed. Examples are measurement of household income, length, weight, etc. These permit statements regarding the comparative ration in relation to some quality or property existing among different individuals. For example, profit is a ratio variable (e.g. 4 million is twice 2 million).

Likert scales

Likert scales. It is a special case that is frequently used in survey research. You have undoubtedly seen such scales. Typically, a statement is made and the respondent is asked to indicate his or her agreement/disagreement on a five-point or seven-point scale using verbal anchors.

For example:

“High school students should be required to go to college to study a foreign language." (Check one)

Strongly Disagree

Somewhat Disagree

Neither agree Nor Disagree

Somewhat agree

Strongly agree

Types of variable by experimental design

Independent variable. Variable controlled by the researcher; changes in this variable may produce changes in the dependent variable.(Is the presumed “cause” in the theoretical model)

(8)

Dependent variable. The observed variable that is expected to change as a result of changes in the independent variable in an experiment. (Is the presumed “cause” in the theoretical model)

Moderating variable. Suspected or known to impact or influence the Dependent variable.

Latent variables. Most social scientific concepts are not directly observable, e.g. intelligence, social capital. This makes them hypothetical or ‘latent constructs’.

We can measure latent variables using observable indicators.

1.5 Methods of data collection

There are many methods used to collect or obtain data for statistical analysis. Four of the most popular methods are:

1. Census. A census is a survey of a whole population

2. Sample survey. A sample survey is a study that obtains data from a representative subset of a population, in order to estimate population attributes.

Surveys may be administered in a variety of ways, e.g. • Personal Interview,

• Telephone Interview, and • Administered Questionnaire.

3. Experiment. It is a controlled study of a group. Experiments are very common in the medical fields. The researcher controls how members are placed study groups and which treatment each group receives. Bias can be a major issue with experiments.

4. Observational study. Is about the same as an experiment. However, the researcher does not use control groups or assign treatments.

Guidelines for building questionnaires

A questionaire is a standardised set of questions administered to the respondents in a survey

Respondents are required to interpret a preestablished set of questions and to supply the information these questions seek.

Key design principles:

 Keep the questionnaire as short as possible.  Ask short, simple, and clearly questions.

(9)

Chapter I. Introduction to Statistics 9  Pretest a questionnaire on a small number of people.

 Think about the way you intend to use the collected data when preparing the questionnaire. Formatting the answer

Survey items can take a variety of formats; the most common are:

1. Open-ended questions that call for numerical answers. Example: Now, thinking about your physical health, which includes physical illness and injury:

For how many days during the past 30 days was your physical health not good? 2. Closed questions with ordered response scales. Example:

Would you say that in general your health is:

(1) Poor (2) Fair (3) Good (4) Very good (5) Excellent Or closed questions with categorical response options

Are you:

(1) Married (4) Separated (2) Divorced (5) Never married

(3) Widowed (6) A member of an unmarried couple

With closed questions, include all reasonable possibilities as explicit response options Are you:

(1) Married Are you:

(2) Divorced (1) Married

(3) Widowed (2) Single

(4) Separated (5) Never married

3. Make the question as specific as possible (about who it covers, what time period, which behaviours…)

Over the last month, that is ….. In a tipical week, how often do you how often do you read a newspaper read a newspaper?

in a tipical week?

Clearly specify the attitude object of interest

Do you agree or disagree with the following statement?

Government is spending too little on education Do you think the Government is spending too litte, about the right amount, or too much on education?

(1) Strongly Agree (2) Agree

(3) Neither agree nor disagree (4) Disagree

(5) Strongly Disagree

Example of a questionnaire

Adventist University of Central Africa QUESTIONNAIRE

(10)

Chapter I. Introduction to Statistics 10 (01/12/2020)

Dear Colleague:

This Questionnaire is strictly confidential and will only serve as statistical information to work in the Statistic Training Session. The study is purely for academic purposes and the information given will be treated with utmost confidentiality. I am therefore, humbly request you to spare some time and answer the following questions.

Instructions: Please answer the following questions truthfully by marking a tick where appropriate and fill in the blanks.

I.GENERAL INFORMATION

Gender: 1Female 2Male

Age range (years): 1Less than 30 230-40 341-50 451 and above Marital status: 1Single 2Married 3Divorce 4Widowed 5Other Highest level of education: 1Primary 2Secondary 3University 4Other How long have you been working in your Institution?

1 Less than 1 year 21 – 3 years 34 – 5 years 4 More than 5 years

II.THE TIME OF YOUR LIFE

Please tick (√) the number that corresponds to your level of agreement. Choose one among these three statements:

1 = Never; 2 = Sometimes; 3 = Always

How do you manage time?

Items Never Sometimes Always

1. I prioritize the things that need to be done 1 2 3

2. I usually finish what I set out to do in any day 1 2 3

3. In the past I have always got tasks done on time 1 2 3

4. I feel I make the best use of my time 1 2 3

5. I can tackle difficult or unpleasant task without using delaying tactics

and wasting time 1 2 3

(11)

1

Chapter I Introduction to Statistics

Assignment 1 – Overview of Statistics

1. List three applications of Statistics in your field

or specialty.

2. Mach each of the following terms to is correct definition:

TERMS DEFINITION

( )Parameter

a. The complete collection of items under study

( ) Inferential Statistics

b. A number that describes a sample characteristic

( ) Census

c. Procedures for collecting, classifying, summarizing, and presenting data

(e) Statistics

d. A number that describes a population characteristic

( ) Population

e. The science of gathering and summarizing data and using results to make decisions ( )Descriptive

Statistics f. A subset of the population

( ) Sample

g. The process of arriving at a conclusion about a population parameter on the basis of a sample statistic

( ) Statistic

h. A survey of all the elements in a population

3. Determine whether the following data is categorical (nominal or ordinal) or numerical (continuous or discrete).

( ) The number of people living in a household ( ) The branches of Statistics

( ) The average miles per gallon on all new Fords.

( ) Customer Satisfaction

4. The height of an individual is an example of a: a. discrete variable

b. continuous variable c. categorical variable

5. The portion of the population that is selected for analysis is called:

a. a sample b. a frame c. a parameter d. a statistic

6. The brand of an automobile (Toyota, kia, Nissan, MW, and so on) is an example of a: a. discrete variable

b. continuous variable c. categorical variable d. constant

7. The number of credit cards in a person’s wallet is an example of a:

a. discrete variable b. continuous variable c. categorical variable d. constant

8. Statistical inference occurs when you:

a. compute descriptive statistics from a sample b. take a complete census of a population c. present a graph of data

d. take the result of a sample and reach conclusion about a population

9. The human resources director of a large corporation wants to develop a dental benefits package and decides to select 100 employees from a list of all 5,000 workers in order to study their preferences for the various components of a potential package. All the workers in the corporation constitute the ___________

a. sample b. population c. statistic d. parameter d. constant

10. Those methods that involved collecting, presenting, and computing characteristics of a set of data in order to properly describe the various features of the data are called:

a. statistical inference b. the scientific method c. sampling

d. descriptive statistic

11. Construct a questionnaire with at least 3 general questions (demographic data), and five specific questions on any topic related to your field of study.

(12)

2

DESCRIBING DATA VISUALLY 1.6 Data Analysis: Tables and graphs

The presentation of data is mainly done using two methods: the tabular and graphical method.

Tables and graphs play an important role in business communication mainly because they are two primary means to structure and communicate quantitative information.

We can’t say that Graphs are better than Tables or vice versa, but each is better than the other for a particular communication task. If your message requires the precision of numbers and text labels to identify what they are, you should use a Table. When you want to show the relationship of the data, use a graph.

1.7 Frequency table for numerical and categorical variable

Frequency Table: is a table used to organize data. The left column (called classes or groups) includes all possible responses on a variable being studied. The right column is a list of the frequencies, or number of observations and percentages, for each class.

Categorical variable

Example: The results of a survey that asked adults how they pay their monthly bills can be presented using a summary table:

Table 1. How Adults Pay Monthly Bills

Form of Payment Frequency Percentage (%)

Cash 75 15

Check 270 54

Electronic/online 140 28

Other/don't know 15 3

Total 500 100

Source: Data extracted from USA Today Snapshots, October 4, 2007

Interpretation: You can conclude that more than half the adults pay by check and the majority (82%) pay by check or through electronic/online payment methods.

1.8 Graphical Representation of Data

A statistical chart or graph is the presentation of information by means of geometric figures. The primary objective of a graph is to give an overall visual impression for quick and easy to understand. It is important to consider the title of the figure, specify the scale, legend and determine the appropriate figure to information.

 A graph consists of two axes called the x (horizontal) and y (vertical) axes. These axes correspond to the variables we are relating. For example Price and Quantity.

(13)

3

Chart Types

For categorical variables • Pie circular Chart

(sex, profession, etc.). Want to know the frequency and percentage of total cases that fall into each category.

• Bar chart: Like a histogram, but with gaps between bars, useful for showing two samples side-by-side Simple® a variable, even when the variable is quantitative but discreet

Grouped ® two variables Stacked ® two variables

Interpretation: The bar or Pie chart enables you to see that most of the adults pay their monthly bills by check or electronic/online, a small percentage pay with cash.

(14)

4

• Pareto Chart

The Pareto Chart is named after Vilfredo Pareto, an Italian economist who lived in 1897, who postulated that a large share of wealth is owned by a small percentage of the population. This basic principle translates well into quality problems. A Pareto Chart is a series of bars whose heights reflect the frequency or impact of problems. The bars are arranged in descending order of height from left to right. This means the categories represented by the tall bars on the left are relatively more significant than those on the right. This bar chart is used to separate the “vital few” from the “trivial many”. These charts are based on the Pareto Principle which states that 80 percent of the problems come from 20 percent of the causes. Pareto charts are extremely useful because they can be used to identify those factors that have the greatest cumulative effect on the system, and thus screen out the less significant factors in an analysis. Ideally, this allows the user to focus attention on a few important factors in a process.

Note:

You can think of the benefits of using a Pareto Charts in economic terms. A Pareto Chart breaks a big problem down into smaller pieces, identifies the most significant factors, shows where to focus efforts, and allows better use of limited resources. You can separate the few major problems from the many possible problems so you can focus your improvement efforts, arrange data according to priority or importance, and determine which problems are most important using data, not perception.

Example:

A Pareto chart can be used to quickly identify what business issues need attention. By using hard data instead of intuition, there can be no question about what problems are influencing the outcome most.

In the example below, XYZ Clothing Store was seeing a steady decline in business. Before the manager did a customer survey, he assumed the decline was due to customer dissatisfaction with the clothing line he was selling and he blamed his supply chain for his problems. After charting the frequency of the answers in his customer survey, however, it was very clear that the real reasons for the decline of his business had nothing to do with his supply chain.

By collecting data and displaying it in a Pareto chart, the manager could see which variables were having the most influence.

Customer complaints Count Clothing faded 18 Clothing shrank 14

Rude sales 61

Poor lighting 44

Layout confusing 35

Sizes limited 23

Parking 82

How to Construct a Pareto Chart with SPSS

1. Create the variables in ‘Variable view” as shown below

(15)

5

Chapter I Introduction to Statistics 2. Go to ‘Data View 'and enter the data as shown below

3. Go to ‘Analyze’ <Quality Control <Pareto Chart and continue following the steps in the following figure

Output

(16)

6

Chapter I Introduction to Statistics If you want to improve your graph, do it

Interpretation: The Chart above shows that 80% of the effects come from 20% of the causes. In this example, we see the significant vital few are: parking difficulties, rude sales people, poor lighting, and layout confusing were hurting his business most. Following the Pareto principle, those are the areas where he should focus his attention to build his business back up.

Excel report

For Numerical or quantitative variables: Stem and leaf

Uses the actual numerical values of each data point.

• Divide each measurement into two parts: the stem and the leaf. • List the stems in a column, with a vertical line to their right.

• For each measurement, record the leaf portion in the same row as its matching stem.

(17)

7

Chapter I Introduction to Statistics • Order the leaves from lowest to highest in each stem.

Stem-and-leaf plots are a method for showing the frequency with which certain classes of values occur. You could make a frequency distribution table or a histogram for the values, or you can use a stem-and-leaf plot and let the numbers themselves to show pretty much the same information.

Example:

Suppose you have the following list of values about the prices ($) of 11 brands of walking shoes:

12 13 21 27 33 34 35 37 40 40 41

Steps to make Stem and leaf

Step 1. Analyze > Descriptive Statistics >Explore

Select the desired variable and click the arrow to move them to the right side (as shown in the figure below)

Click “OK” to display the Stem and Leaf and Boxplot

Figure 4. Price ($) of walking shoes

Frequency Stem & Leaf 2.00 1 . 23 2.00 2 . 17 4.00 3 . 3457 3.00 4 . 001

Stem width: 10.00 Each leaf: 1 case(s)

(18)

8

Interpretation: The stem and leaf plot of walking shoes prices showed that shoes prices are a little more concentrated between 30 and 40 $.

 How many of the estimates are between 30 and 40 dollars (inclusive)?

Seven of the eleven estimates are between 30 and 40 $.

Boxplot

This tool allows studying the symmetry of the data and detecting outliers. This chart divides the data into four areas of equal frequency. The central box (where the middle 50% of the data) has a vertical (or horizontal) inside the box indicates the median (if this line is at the center in the center of the box there is symmetry). From the center of each side vertical (or horizontal) of the box are drawn whiskers. The mustache on the left (or lower) has its extreme value closer to Q1 - 1.5 * IQR, while the right whisker (or higher) has its extreme value

closer to Q 3 + 1, 5 * IQR, and are considered the most extreme outliers in Q3 + 3 * IQR or less than Q1 - 3 *

IQR (in SPSS are represented by “o” or “x”, respectively). Remember that. Q1 = quartile one or percentile 25.

Q2 = quartile two or percentile 50. Q3 = quartile three or percentile 75. IQR = interquartile range = Q3 - Q1. Example

Use the same example before

The prices ($) of 18 brands of walking shoes:

Interpretation: We see that half the prices of walking shoes are between 21 and 40 dollars. If the data set includes one or more outliers they are plotted separately as points on the chart. 25% of prices fall below $ 21, and 25% of the highest price of shoes was over $ 40. The box plot above does not present outliers.

And finally, Box plots often provide information about the shape of a data set. The example below shows some common patterns.

2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16

Applied Statistics– Dr. Rosa Padilla de Casamayor

Statistics

Minimum ₁₂

Maximum ₄₁

Percentiles

25 ₂₁

50 ₃₄

(19)

9

2 4 6 8 10 12 14 16

Skewed right Symmetric Skewed left

Interpreting Graphs: Shapes

Histograms

A Histogram is a pictorial method of representing data. It appears similar to a Bar Chart but has two fundamental differences:

 The data must be measurable on a standard scale; e.g. lengths rather than colors.

The Area of a block, rather than its height, is drawn proportional to the Frequency, so if one column is twice the width of another it needs to be only half the height to represent the same frequency.

Example:

Lines

(20)

10

A line chart or line graph is a type of chart which displays information as a series of data points called 'markers' connected by straight line segments. It is a basic type of chart common in many fields. A line chart is often used to visualize a trend in data over intervals of time.

Food Crops (in billions Rwf) 2011 – 2014 Rwanda

You can see a trend of growth in agriculture sector, however cyclically there is always a slight decline in the first quartile of the year.

Population Pyramid

A population pyramid, also called an age pyramid or age picture diagram, is a graphical illustration that shows the distribution of various age groups in a population (typically that of a country or region of the world), which forms the shape of a pyramid when the population is growing. It is also used in ecology to determine the overall age distribution of a population; an indication of the reproductive capabilities and likelihood of the continuation of a species.

It typically consists of two back-to-back bar graphs, with the population plotted on the X-axis and age on the Y-axis, one showing the number of males and one showing females in a particular population in five-year age groups (also called cohorts). Males are conventionally shown on the left and females on the right, and they may be measured by raw number or as a percentage of the total population.

These graphs give us a vision of youth, maturity and old age of a population and, therefore, also the degree of development of the population. According to their shape may have different types of pyramids:

Progressive:

A high percentage of young population, which will decline as they move ages. They are typical of underdeveloped

(21)

11

Chapter I Introduction to Statistics countries where life expectancy is low

and the high birth rate. Constrictive pyramid:

At the base there is less population than in the middle and the older

population is considerable. Are typical of developed countries whose fertility is declining and life expectancy is high.

Stationary pyramid

The intermediate age brackets have the same population as the base. They are typical of developing countries that have controlled mortality and begins to birth control.

Population of Rwanda, 2017

Age M F 0-4 881085 869505 5 - 9 809168 808172 44118 718005 722764 15-19 627280 632937 20-24 537047 554176 25-29 501908 509298 30-34 440259 477309 35-39 335397 371730 40-44 248906 273463 45-49 189532 201359 50-54 193509 188707 55-59 159312 163228 60-64 104803 120675 65-69 65202 85834 70-74 37504 57143 75-79 19276 30450 80-84 11805 18454 85-89 5103 8134 90-94 763 1254 95-99 197 303

100+ 2 2

Source: United Nations, Department of Economic and Social Affairs, Population Division. World Population Prospects: The 2015 Revision. (Medium variant)

https://www.populationpyramid.net/5

(22)

12

Chapter I Introduction to Statistics Interpretation

The age pyramid below graphically displays the population’s age and sex composition.

Horizontal bars present the numbers of males and females in each age group. The age pyramid of Rwanda has a large base, implying that the majority of the population is young and that fertility levels are still high. Around 50% (5.4 million) of the population is under 20.

ANALIZES OF THE DATA 2.1 Summarize Data

Descriptive measures derived from a sample (n items) are statistics, while for a population (N items or infinite) they are parameters. For a sample of numerical data, we are interested in three key characteristics: center, variability and shape. (Doane & Seward).

Characteristic Interpretation

Center

Where are the data values concentrated? What seen to be typical or middle data values? Is there central tendency?

Variability

How much dispersion is there in data? How spread out are the data values? Are there unusual values?

Shape

Are the data values distributed symmetrically? Skewed? Sharply peaked? Flat? Bimodal?

1.9 Measure of Central Tendency

Measures of central tendency provide information about the average or typical score in a data set. The most widely used and familiar average. They are computed to give a “center” around which the measurements in the data are distributed; i.e. the central indicate that the data seem to cluster:

Mean, Median, Mode and Geometric mean.

Type of Scale

Measure of Central

Tendency Measure of Dispersion

Nominal( ) Mode None

Ordinal( ) Median Percentile

Interval or Ratio( ) Mean, Geometric mean

Standard deviation, Range, Coefficient of variation, IQR

Mean

The Arithmetic Average, the data must be measurable on an interval scale. This is calculated by dividing the 'sum of the values' by the 'number of the values'.

Sample Mean: Population Mean:

Characteristics

• Most sensitive of all measures of central tendency

• Most appropriate measure of central tendency to use for ratio data (may be used on interval data) • Considers all information about the data and is used to perform other statistical calculations • Influenced by extreme scores, especially if the distribution is small

Median

(23)

13

Is the middle score in a set of ranked scores; score that represents the exact middle of the distribution; the fiftieth percentile; the score that 50% of the scores are above and 50% of the scores are below.

Characteristics

• Not affected by extreme scores. • A measure of position.

Mode

Most common value

In the previous example (the height in cm), there is no mode, because nobody has the same height..

When to Use What

• Mean is a great measure. But, there are time when its usage is inappropriate or impossible. Nominal data: Mode

The distribution is bimodal: Mode You have ordinal data: Median or mode Are a few extreme scores: Median

Relationship between Mean , Median and Mode

1.10 Measures of Variability

Indicate the degree of concentration data with respect to mean or how far away the measurements are from the center:

Variance, standard deviation, coefficient of variation, range, maximum and minimum Range

The range is the difference between the maximum and minimum values in a set: RANGE = (Xlargest – Xsmallest) Example

Data set 1: [1, 25, 50, 75, 100]; R: 100-1 = 100 Data set 2: [48, 49, 50, 51, 52]; R: 52-48 = 5

The range ignores how data are distributed and only takes the extreme scores into account

Standard Deviation

Shows the data scatter about the mean. The standard deviation (SD) quantifies variability. It is expressed in the same units as the data.

A small standard deviation means that the group has small variability or relatively homogeneous.

(24)

14

At a distance of one half standard deviations of 68% will observations. At a distance of two half standard deviation of 95% will observations.

Standard Error of the Mean

The Standard Error of the Mean (SEM) quantifies the precision of the mean. It is a measure of how far your sample mean is likely to be from the true population mean. It is expressed in the same units as the data.

Coefficient of variation:

The coefficient of variation (CV) is a standardized measure of dispersion (is a measure of relative variability). It is defined as the ratio of the standard deviation to the mean, applies in the single variable setting. In the modeling setting, the CV is calculated as the ratio of the root mean squared error (RMSE) to the mean of the dependent variable. In both settings, the CV is often presented as the given ratio multiplied by 100. The CV for a single variable aims to describe the dispersion of the variable in a way that does not depend on the variable's measurement unit. The higher the CV, the greater the dispersion in the variable. The CV for a model aims to describe the model fit in terms of the relative sizes of the squared residuals and outcome values. The lower the CV, the smaller the residuals relative to the predicted value. This is suggestive of a good model fit.

Let’s compare variability between samples where units are different.

Interpretation: 20 % of variability with respect to the mean, i.e., the data is Regular (acceptable). Note:

Less than 12% Very Homogeneous

More than 50% Very Heterogeneous

Risk of a Single Asset (standard deviation)

Wes and Jennie Moore, owners of Moore’s Foto Shop in western Pennsylvania, are considering two investment alternatives, asset A and asset B. They are not sure which of these two single assets is better, and they ask Sheila Newton, a financial planner, for some assistance.

Solution: Sheila knows that the standard deviation “s”, is the most common single indicator of the risk of the variability of a single asset. In financial situations the fluctuation around a stock’s actual rate of returns and is expected rate of return is called the risk of the stock. The standard deviation measures the variation of returns around an asset’s mean. Sheila obtains the rates of return of each asset. The results are show in the following table. Notice that each asset has the same average rate of return of 12.2%. However, once Sheila obtains the standard deviation and CV, it becomes apparent that asset B is a more risky investment.

(25)

15

Chapter I Introduction to Statistics Rates of Return

Year Asset A (%) Asset B (%)

5 years ago 11.3 9.4

4 years ago 12.5 17.1

3 years ago 13 13.3

2 years ago 12 10

1 year ago 12.2 11.2

Total 61 61

Average rate of return 12.20% 12.20%

Standard deviation 0.63 3.12

CV 5.16 25.57

Interquartile range:

One half of the difference between the upper quartile (the 75%’ile) and the lower quartile (the 25%’ile) in a distribution

Similar to the range, but eliminating extreme observations below and above. It is not as sensitive to extreme values.

1.11 Measures of Relative Position

Defines the order quantile as a variable value below which is a cumulative frequency. Special cases are the percentiles, deciles, and quartiles

Percentiles: The p-the percentile is a number such that at most p% of the measurements are below it at most 100-p percent of the data are above it.

Example, if in a certain data the 85th_{percentile is 17 means that 15% of the measurements in the data are} above 17. It also means that 85% of the measurements are below 17.

• Quartiles: Divide the data into 4 equal values • Deciles: Divide the data into 10 equal values

• Percentiles: Divide the information into 100 equal values

Quartiles

In descriptive statistics, a quartile is any of the three values which divide the sorted data set into four equal parts, so that each part represents 1/4th of the sample or population.

– first quartile (designated Q1) = lower quartile

• cuts off lowest 25% of data (25th percentile ) – second quartile (designated Q2) = median

• cuts data set in half (50th percentile )

(26)

16

Chapter I Introduction to Statistics – third quartile (designated Q3) = upper quartile

• cuts off highest 25% of data, or lowest 75% (75th percentile )

The difference between the upper and lower quartiles is called the interquartile range.

1.12 Shape of Distributions: Skewness and Kurtosis

The histogram can give you a general idea of the shape, but two numerical measures of shape give a more precise evaluation: skewness tells you the amount and direction of skew (departure from horizontal symmetry), and kurtosis tells you how tall and sharp the central peak is, relative to a standard bell curve. Skewness

On the picture above the first distribution is symmetric, and the second one is moderately skewed right: its right tail is longer and most of the distribution is at the left. By contrast, the third is moderately skewed left: the left tail is longer and most of the distribution is at the right.

Interpreting

If skewness = 0, the data are perfectly symmetrical. But a skewness of exactly zero is quite unlikely for real-world data, so how can you interpret the skewness number?

 If skewness is less than −1 or greater than +1, the distribution is highly skewed.

 If skewness is between −1 and −0.5. or between +0.5. and +1, the distribution is moderately skewed.  If skewness is between −0.5. and +0.5., the distribution is approximately symmetric.

Kurtosis

Intuitively, the kurtosis is a measure of the peakedness of the data distribution.

Applied Statistics– Dr. Rosa Padilla de Casamayor

Right Skewed

Left Skewed If Skewed 0 the distribution is symmetric

Skew > 0 the distribution is positive (Positively asymmetry). Fewer scores right of

the peak

Can be caused by a floor effect

Skew < 0 the distribution is negative (Negatively asymmetry). Fewer scores left of

the peak

Can be caused by a ceiling effect

(27)

17

Interpretation

If k 0.0, we say that the curve corresponding to the frequency distribution is mesokurtic (has just pointing to the normal or Gaussian).

If k <-0.263, we say that the curve corresponding to the frequency distribution is platykurtic If k> 0.263, we say that the curve corresponding to the frequency distribution is leptokurtic

Review problems of chapter 1. What are descriptive statistics?

Short Answer

2. Which of the following graphs is not appropriate for categorical data: a. Pareto chart

b. Bar chart c. Pie chart d. Histogram

3. The number of goals scored by two rival teams in each of the 16 matches of soccer championship were:

Team A: 2 1 0 3 1 4 2 3 3 5 1 0 0 2 1 5

Team B: 3 5 1 2 1 0 0 4 1 1 1 2 3 4 5 2

Drawing a Box-Whisker plot for each distribution and compare and, which team got best?

4. Construct the most appropriate graph for the following information

Countries

Life Expectancy Africa

Mauritius 73.9

Madagascar 66.5

Ghana 63.5

Kenya 59.7

Rwanda 59.6

South Africa 58.2

Uganda 55.8

Nigeria 53.2

Burundi 53

(28)

18

DR Congo 49.5

Sierra Leone 46.5

5. If your business was investigating the delay associated with processing credit card applications, you could group the data into the following categories:

•No signature

•Residential address not valid •Non-legible handwriting •Already a customer •Other

The data that were collected are shown in the following table:

Delay in processing credit card applications Count

No Signature 40

No Address 9

Illegible 22

Current Customer 15

Other 8

Construct a Pareto Chart, and answer the following questions a. What are the largest issues facing our team or business?

b. What 20 percent of sources are causing 80 percent of the problems (80/20 Rule)? c. Where should we focus our efforts to achieve the greatest improvements?

6. Construct with the following data, a Pareto Chart, and answer the following questions: a. What are the largest issues facing our team or business?

b. What 20 percent of sources are causing 80 percent of the problems (80/20 Rule)? c. Where should we focus our efforts to achieve the greatest improvements?

Restaurant Complaints

Complaint Count

Food is tasteless 65

Wait time 109

Unfriendly staff 12

Not clean 30

Overpriced 789

Too noisy 27

Food not fresh 9 Small portion 621

No atmosphere 45

Other 15

Total 1722

7. Interpret the following figure:

(29)

19

Answer True or False

8. One of the advantages of a pie chart is that it shows that the total of all the categories of the pie adds to 100%

9. Histograms are used for numerical data, whereas bar charts are suitable for categorical data

10. A financial services company wants to collect information on the weekly number of transaction. To study the weekly transaction, it can use a pie chart.

11. For each data set: (a) Find the mean, median and mode. (b) Which, if any, of these three measures is the weakest indicator of a “typical” data value? Why?

a. 100 m dash times (n= 6 top runners): 9.87, 9.98, 10.02, 10.15, 10.36, 10.36 b. Number of children (n=13 families): 0, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 6

c. Numbers of cars in driveway (n= 8 homes): 0, 0, 1, 1, 2, 2, 3, 5

12.Two people work in a factory making parts for cars. The table shows how many complete parts they make in one week.

Worker Mon Tue Wed Thu Fri

Samuel 20 21 22 20 21

Pheneas 30 15 12 36 28

a. Find the mean and measure of variability for Samuel and Pheneas b. Who is most consistent?

c. Who makes the most parts in a week?

13.Weight of luggage presented by airline passengers at the check-in (measured to the nearest kg).

18 23 20 21 24 23 20 20 15 19 24

Compute and interpret the measurement of central tendency, variability, Quartiles, Skewness and Kurtosis.