Business Statistics: Intorduction
Donglei DuFaculty of Business Administration, University of New Brunswick, NB Canada Fredericton E3B 9Y2
Table of contents
1 Why Statistics? 2 What is Statistics?
3 Two methodologies in Science 4 Variables and types
5 Sources of Statistical Data
6 Software for Statistical Analysis
7 Materials to learn R 8 A brief tutorial of R with a
case study
Section 1
Why Statistics?
Figure: Source: http://mechanicalforex.com/2015/05/
building-algorithmic-trading-systems-for-the-forex-market-part-2-where-to-look. html
Why Statistics?
This is a required course for your degree. It is a prerequisite for many other topics.
Data everywhere, particularly in this big data era. Sampling vs censusing.
Decision Making: Statistics will help you make important decision.
Data everywhere: Example 1: productivity and
standard of living of a nation
wages Labor productivity Japan US U.K. Canada France Germany Italy India Indonesia Bulgaria Egypt Pakistan 80 8 0.8 2 20 200
Figure: High productivity is a key to high standard of living
Two important statistics that a nation is most concerned are the
productivity andstandard of
living. Productivity is usually
measured in terms of output per worker, and standard of living is measured in terms of wages per worker. They are usually strongly related. Countries with high productivity in general are seen with high standard of living (Left figure)
Data everywhere: Example 2: Consumer Price
Index (CPI) in Canada
The Consumer Price Index measures the rate of price change for goods and services bought by consumers in a country.
It is a statistic constructed using the prices of a sample of representative items whose prices are collected periodically. For instance, the CPI All-items for Canada for the month of July 2013 was 123.1 (2000=100), meaning that consumer prices were 23.1% higher in July 2013 than in 2000. [ hyphens]http://www.statcan.gc.ca/tables-tableaux/sum-som/l01/cst01/cpis01a-eng.htm
Data everywhere: Example 2: Consumer Price
Index (CPI) in Canada
The CPI directly or indirectly affects nearly all Canadians.
Old Age Security pensions, Canada Pension Plan payments, and other forms of social and welfare payments are adjusted
periodically to take account of changes in the CPI.
Rental agreements, spousal and child support payments and other forms of contractual and price-setting arrangements are frequently tied in some manner to movements in the CPI. Cost-of-living adjustment (COLA) clauses link wage increases to movements in the CPI. Labour contracts governing the wages of many Canadian workers include COLA clauses.
Sampling vs censusing?
Costs of surveying the entire population may be too large or prohibitive
e.g., Television networks monitor the popularity of their programs;
Destruction of elements during investigation
e.g., Manufacturers estimate the average lifetime of light bulbs; doctors take a blood sample to check for disease.
Unknown future
Decision Making
How do large retailers (like COSTO, Walmart) fill their store shelves so as to meet the customer demand while minimizing their operating cost?
How do doctors in the hospital make diagnose? How do political leaders run their campaign?
How do Investment Banks in Wall Street (or Bay Street) decide which stock (or stocks) to invest?
How do insurance companies decide the premium for a particular client?
How do car dealers decide how many car models of each brand to be kept in their locations?
Section 2
What is Statistics?
What is Statistics?
It is the science and art
of collecting, organizing, and representing data in such a way that the characteristics and patterns of the data can be easily captured (Descriptive Statistics);
also, of estimating attributes and drawing inference from a sample about the entire population (Inferential Statistics).
Examples: descriptive or inferential?
In 1995, 45% of Canadian households owned a computer and 25% were connected to internet; On average, Canadians spend 1.3 hours per day commuting, and 1.5 hours per day with their children.
descriptive
The accounting department of a firm will select a sample of invoices to check for accuracy of all the invoices of the company.
Terminologies
Population vs sample
Population: a set of all elements of interests in an investigation Sample: a subset of all elements of a population
Parameter vs Statistic
Parameter: a measurable characteristic of a Population (such as average, extreme, variation, proportion)
Statistic: a measurable characteristic of a Sample
Section 3
Two methodologies in Science
Deductive: general −→ particular Mathematics: Axioms + logic Inductive: particular −→ general
Most natural sciences, like physics, chemistry, and Statistics... a.k.a, the method of Experimentation
The method of Experimentation
1 Define the experimental goal or a working hypothesis 2 Design an experiment
3 Collect and represent data 4 Estimate the values/relations 5 Draw inferences
Section 4
Variables and types
Variable
1 A variable is a characteristic/attribute of a population/sample that is interest in a particular investigation and the value of the variable can "vary" from one entity to another.
A person’s gender is a variable, which could have the value of "Male" for one person and "Female" for another.
The rank of faculty members in Business Administration is a variable, which could have the value of "Full Professor" for one person, "Associate Professor" for another, and "‘Assistant Professor"’ for yet another.
Temperatures in this classroom is a variable which could have the value of "20" or "100".
Types of Data: Qualitative vs. Quantitative
Variables
Qualitative (a.k.a. categorical)
Qualitative variables take on values that are names or labels. Examples: gender, country names, color
Quantitative (a.k.a, numeric)
Quantitative variables are numeric.
Examples: number of bedrooms in houses, number of minutes to the end of this class, distance between two cities
Types of Data: Discrete vs. Continuous Variables
Quantitative variables can be further classified as discrete or continuous
Continuous Variable: a variable can take on any value within a range;
Examples: the number of minutes to the end of this class, distance between two cities, pressure in a tire, weight of a pork chop, height of students in a class
Discrete variable: a variable can take only certain value (finite or countably infinite) within a range.
Data/variable representation
Language such as English, French, Chinese: This is the natural and direct way
Such as "All the people in Canada"’
Numbers: This is the mathematical and indirect way
such as, data representation in computer: binary numbers: 0,1
Level of measurement
Suppose now we represent all data /variable using numbers, then Level of measurement (a.k.a. scales of measure) are the different ways numbers can be used.
There are four levels of measurements Nominal
Ordinal Interval Ratio
However, representing variables as numbers does not give you the license to perform the regular logical/arithmetic operations all the time (such as comparison, addition, subtraction,
Nominal level
A variable is at the nominal level if none of the five operations (namely comparison, addition, subtraction, multiplication, and division) is meaningful.
At the nominal level of measurement, numbers are assigned to a set of mutually exclusive and exhaustive categories for the purpose of naming, labeling, or classifying the observations, but no arithmetic operation is meaningful;
where
Mutually exclusive: any individual object is included in ONLY ONE CATEGORY
Exhaustive: any individual object MUST APPEAR in one of the categories
Examples
Examples Barcodes
Social insurance numbers (SIN) Student IDs
Phones numbers
The fact that the barcode for one product is higher than that for another, or that your SIN is higher than mine tells us nothing. In surveys we often use arbitrary numbers to code variables such as religion, ethnicity, major in college or gender.
Ordinal level
A variable is at the ordinal level if only comparison is meaningful.
Examples
ExamplesRank of faculty members
Maclean ranking of Canadian colleges:
[hyphens]http://oncampus.macleans.ca/education/2012/11/01/2013-comprehensive/
US News Ranking of US/world colleges http://www.usnews.com/rankings
The differences between data values cannot be determined or are meaningless.
For instance, first class is better than economy and that business is in between. Just how much better first class is
Interval
A variable is at the interval level if only division is meaningless. Reason: Interval data have meaningful intervals between measurements, but there is no true starting point (zero).
Examples
ExamplesTemperatures in Celsius and Fahrenheit are interval data Certainly order is important and intervals are meaningful. However, a20◦Cdashboard is not twice as hot as the10◦C
outside.
A conversion formula between Celsius and Fahrenheit F= 9 5C+32 So10◦C=50◦Fand20◦C=68◦F. Obviously 20◦C 10◦C 6= 68◦F 50◦F
Ratio
A variable is at the ratio level if all logical/arithmetic operations are meaningful.
Reason: Ratios between measurements as well as intervals are meaningful because there is a non-arbitrary zero point.
Examples
Examples Income Distance Height
Summary of level of measurement
comparison addition subtraction multiplication division
nominal x x x x x
ordinal √ x x x x
interval √ √ √ x x
ratio √ √ √ √ √
A general method for identifying the level of
measurement
Ask yourself the following three questions: Is order meaningful?
No! then the data is nominal Is difference meaningful?
No! then the data is ordinal Is zero meaningful
No! then the data is interval Yes! then the data is ratio
Section 5
Sources of Statistical Data
Sources of Statistical Data
Statistics Canada: http://www.statcan.ca Twitter: www.twitter.com
Facebook: www.facebook.com . . ..
Section 6
Software for Statistical Analysis
Software for Statistical Analysis
Many, but the most popular ones are EXCEL (proprietary)
SAS (proprietary) R (open-source) . . ..
Section 7
Materials to learn R
Some online resources to learn R I
R in a nutshell An introduction to R R for Beginners Try R Quick-RThe art of R Programming
www.coursera.org offers some nice courses on R Such asthis one
Section 8
A brief tutorial of R with a case study
R for Statistical Analysis
We will retrieve some stock data via package quantmod from Yahoo Finance
Load package
rm(list=ls())
require("quantmod")
Parameters
startDate <- '1957-03-04' # start of data
endDate <- '2014-01-06' # end of data
Retrieve data
symbols<-c("^GSPC") #symbols<-c("AAPL","FB", "YHOO") if(file.exists("data/GSPC.RData")) { load("data/GSPC.RData") #GSPC_csv <- read.csv("data/GSPC.csv") } else { getSymbols(symbols,src = "yahoo", # OHLC format from = startDate,
#to = endDate,
index.class=c("POSIXt","POSIXct"), # Recommended time series index warnings = FALSE,
adjust=TRUE)
dir.create(file.path('data'), showWarnings = TRUE) save(list="GSPC", file="data/GSPC.RData")
write.csv(GSPC,file="data/GSPC.csv")
}
Plot
options(width=60)
chartSeries(GSPC["2004::2014"]) addMACD()