Statistical software is essential for data manipulation and analysis. It is also used to deal with numerical calculations, to produce graphics, and to simulate probability models. There are many statistical software systems; some of the most comprehensive and popular are SAS, S-Plus, SPSS, Strata, Systat Minitab and R. Spreadsheet software such as EXCEL is also useful.
In this course we use the R software system. It is an open source package that has extensive statistical capabilities and very good graphics procedures. The R home page is www.r-project.org where a free download is available for most common operating systems. Some of the basics of R are described in the next section. We use R for several purposes: to manipulate and graph data, to …t and check statistical models, to estimate attributes or test hypotheses, to simulate data from probability models.
Using
RLots of online help is available in R. You can use a search engine to …nd the answer to most questions. For example, if you search for “R tutorial”, you will …nd a number of excellent introductions to R that explain how to carry out most tasks. Within R, you can …nd help for a speci…c function using the command help(function name) but it is often easier to look externally using a search engine.
Here we show how to use R on a Windows machine. You should have R open as you read this material so you can play along.
Some R Basics
R is command-line driven. For example, if you want to de…ne a quantity x, use the assign- ment function <- (that is, < followed by -).
x < 15
or, (a slight complication)
x < c(1; 3; 5) so x is a column vector with elements 1,3,5.
A few general comments
If you want to change x, you can up-arrow to return to the assignment and make the change you want, followed by a carriage return.
If you are doing something more complicated, you can type the code in Notepad or some other text editor (Word is not advised!) and cut and paste the code into R.
You can save your session and, if you choose, it will be restored the next time you open R.
1.6. STATISTICAL SOFTWARE AND R 29
You can add comments by entering # with the comment following on the same line.
Vectors
Vectors can consist of numbers or other symbols; we will consider only numbers here. Vectors are de…ned using the function c( ). For example,
x < c(1; 3; 5; 7; 9)
de…nes a vector of length 5 with the elements given. You can display the vector by typing x and carriage return. Vectors and other objects possess certain attributes. For example, typing
length(x) will give the length of the vector x.
You can cut and paste comma- delimited strings of data into the function c(). This is one way to enter data into R. See below to learn how you can read a …le into R.
Arithmetic
R can be used as a calculator. Enter the calculation after the prompt > and hit return as shown below. > 7+3 [1] 10 > 7*3 [1] 21 > 7/3 [1] 2.333333 > 2^3 [1] 8
You can save the result of the calculation by assigning it to a variable such as y<-7+3
Some Functions
There are many functions in R. Most operate on vectors in a transparent way, as do arithmetic operations. (For example, if x and y are vectors then x + y adds the vectors element-wise; if x and y are di¤erent lengths, R may do surprising things! Some examples, with comments, follow
> x<-c(1,3,5,7,9) # Define a vector x
> x # Display x
[1] 1 3 5 7 9
30 1. INTRODUCTION TO STATISTICAL SCIENCES
elements are an arithmetic progression > y
[1] 1.00 1.25 1.50 1.75 2.00
> y[2] # Display the second element of vector y [1] 1.25
> y[c(2,3)] # Display the vector consisting of the 2nd and 3rd elements of vector y
[1] 1.25 1.50
> mean(x) # Computes average of the elements of vector x
[1] 5
> summary(x) # A useful function which summarizes features of a vector x
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 3 5 5 7 9
> sd(x) # Computes the (sample) standard deviation of
the elements of x [1] 10
> exp(1) # The exponential function
[1] 2.718282 > exp(y)
[1] 2.718282 3.490343 4.481689 5.754603 7.389056
> round(exp(y),2) # round(y,n) rounds the elements of vector y to n decimal places
[1] 2.72 3.49 4.48 5.75 7.39 > x+2*y
[1] 3.0 5.5 8.0 10.5 13.0
We often want to compare summary statistics of variate values by group (such as sex). We can use the by()function. For example,
> y<-rnorm(100) # y is a vector of length 100 with entries generated at random from G(0,1) dist’n
> x<-c(rep(1,50),rep(2,50)) # x is a vector of length 100 with 50 1’s followed by 50 2’s
> by(y,x,summary) # generates a summary for the elements of y for each value of the grouping variable x
1.6. STATISTICAL SOFTWARE AND R 31
Graphs
Note that in R, a graphics window opens automatically when a graphical function is used. A useful way to create several plots in the same window is the function par() so, for example, following the command
par(mfrow=c(2,2))
the next 4 plots will be placed in a 2 2 array within the same window. There are various plotting and graphical functions. Three useful ones are
plot(y~x) # Generates a scatterplot of y versus x and thus x and y must be of the same length
hist(y) # Creates a frequency histogram based on the values in the vector y. To get a relative frequency histogram (areas of rectangles sum to one) use hist(x,freq=F)
boxplot(y~x) # Creates side-by-side boxplots of the values of y for each value of x
You can control the axes of plots (especially useful when you are making comparisons) by including xlim = c(a; b) and ylim = c(d; e) as arguments separated by commas within the plotting function. Also you can label the axes by including xlab = \yourchoice" and ylab = \yourchoice". A title can be added using main = \yourchoice". There are many other options. Check out the Html help “An Introduction to R” for more information on plotting.
To save a graph, you can copy and paste into a Word document for example or alternately use the “Save as” menu to create a …le in one of several formats.
Probability Distributions
There are functions which compute values of probability functions or probability density functions, cumulative distribution functions, and quantiles for various distributions. It is also possible to generate random samples from these distributions. Some examples follow for the Gaussian distribution. For other distributions, type help(distributionname) or check the “Introduction to R” in the Html help menu.
> y<- rnorm(10,25,5) # Generate 10 random values from the G(25,5) dist’n and store the values in the vector y
> y # Display the values
[1] 22.50815 26.35255 27.49452 22.36308 21.88811 26.06676 18.16831 30.37838 [9] 24.73396 27.26640
32 1. INTRODUCTION TO STATISTICAL SCIENCES
> pnorm(1,0,1) # Compute P(Y<=1) for a G(0,1) random variable [1] 0.8413447
> qnorm(.95,0,1) # Find the 0.95 quantile for G(0,1) [1] 1.644854
>dnorm(2,1,3) # Compute value of G(1,3) p.d.f. at y=2 [1] 0.1257944
Reading data from a …le
R stores and retrieves data from the current working directory. You can use the command
getwd()
to determine the current working directory. To change the working directory, look in the File menu for \changedir" and browse until you reach your choice.
There are many ways to read data into R. The …les we used in Chapter 1 are in .txt format with the variate labels in the …rst row separated by spaces and the corresponding variate values in subsequent rows. We created the …les from EXCEL and then saved the …les as text …les.
To read such …les, …rst be sure the …le is in your working directory. Then use the commands
a<-read.table(’filename.txt’,header=T) #filename in single quotes attach(a)
The “header=T”tells R that the variate names are in the …rst row of the data …le. The object a is called a data frame in R and the variate names are of the form \a : v1" where v1 is the name of the …rst column in the …le. The R function attach(a) allows you to drop the a : from the variate names.
Writing data to a …le
You can cut and paste output generated by R in the sessions window although the format is usually messed up. This approach works best for Figures. You can write an R vector or other object to a text …le through
write(y,file="filename")
1.6. STATISTICAL SOFTWARE AND R 33
Example 1.5.2 Revisited
In the …le ch1example152.txt, there are three columns labelled hour, machine and volume. The data are
hour machine volume hour machine volume
1 1 357:8 21 1 356:5
1 2 358:7 21 2 357:3
2 1 356:6 22 1 356:9
..
. ... ... ... ... ...
Here is R code which could be used for these data:
#Read in the data
a<-read.table(’ch1example152.txt’,header=T) attach(a)
#Calculate summary statistics and standard deviation by machine by(volume,machine,summary)
by(volume,machine,sd)
#Separate the volumes by machine into separate vectors v1 and v2 v1<-volume[seq(1,79,2)] # Puts machine 1 values in vector v1 v2<-volume[seq(2,80,2)] # Puts machine 2 values in vector v2 h<-1:40
#Plot run charts by machine, one above of the other, #type=’l’ joins the points on the plots
par(mfrow=c(2,1)) # Creates 2 plotting areas, one above the other
plot(v1~h,xlab=’Hour’,ylab=’volume’,main=’New Machine’,ylim=c(355,360),type=’l’) plot(v2~h,xlab=’Hour’,ylab=’volume’,main=’Old Machine’,ylim=c(355,360),type=’l’) #Plot side by side relative frequency histograms
#and overlay Gaussian densities for each machine
par(mfrow=c(1,2)) # Creates 2 plotting areas side by side
br<-seq(355,360,0.5) # Defines interval endpoints for the histograms hist(v1,br,freq=F,xlab=’volume’,ylab=’density’,main=’New Machine’)
w1<-356.8+0.538*seq(-3,3,0.01) # Values where Gaussian p.d.f. is located dd1<-dnorm(w1,356.8,0.53)
points(w1,dd1,type=’l’) # Superimpose Gaussian p.d.f.
hist(v2,br,freq=F,xlab=’volume’,ylab=’density’,main=’Old Machine’)
w2<-357.5+0.799*seq(-3,3,0.01) # Values where Gaussian p.d.f. is located dd2<-dnorm(w2,357.5,0.8)
34 1. INTRODUCTION TO STATISTICAL SCIENCES