Data in statistics and in R
2.4 Data import, export and connections
Unless you enter data directly into R (something you should avoid), you will need to know how to import your data to R. To exchange data with those poor souls that do not use R, you also need to learn how to export data. If you routinely need to obtain data from a database management system (DBMS), it may be tedious to export the data from the system and then import it to R. A powerful alternative is to access the data directly from within R. This requires connection to the DBMS. Connecting directly to a DBMS from within R has three important conveniences: Automation (thus minimizing errors), working with a remote DBMS (that is, data that do not reside on your computer) and analysis in real time. R comes with an import/export manual (The R Development Core Team, 2006a). It is well written and you should read it for further details. We will discuss some of these R’s capabilities when we need them.
2.4.1 Import and export
Exporting data from R is almost a mirror to importing data. We will concentrate on importing. There are numerous functions that allow data imports. Some of them
access binary files created with other software. You should always strive to import text data. If the data are in another system and you cannot export them as text from that system, then you need to import the binary files as written by the other system without conversion to text first.
Text data
The easiest way to import data is from a text file. Any self-respecting software allows data export in a text format. All you need to do is make sure you know how the data are arranged in the text file (if you do not, experiment). To import text data, use one of the two: read.table() or read.csv(). Both are almost identical, so we shall use them interchangeably.
Example 2.8. We discussed the WHO data in Example 2.7. Here is how we import them and the first few columns and rows:
> who <- read.table('who.by.continents.and.regions.txt', + sep = '\t', header = TRUE)
> head(who[, 1 : 3], 4)
country continent region
1 Africa Africa <NA>
2 Eastern Africa Africa Eastern Africa 3 Burundi Africa Eastern Africa 4 Comoros Africa Eastern Africa
We tell read.table() that columns are separated by the tab character (sep = '\t').
The first row of the text file holds the headers of the data columns. The data were obtained as an Excel spreadsheet and then saved as a text file with tab as the
sepa-rating character. ut
A useful function to import text data is scan(). You can use it to read files and control various aspects of the file to be read. We find it particularly useful in situations like this: You browse to a web page that contains a single column data; a string of numbers;
something like:
10 50 120 . . .
Then copy the numbers to your clipboard and in R do this:
> new.data <- scan() 1:
The number prompt indicates that you are in input mode. Paste the data you copied to the clipboard and enter a return (extra blank line) when done. You can also use scan()to enter data manually.
60 Data in statistics and in R
Data from foreign systems
The package foreign includes functions that read and write in formats that other system recognize. At the time of writing, you can import data from SAS, DBF, Stata, Minitab, Octave, S, SPSS and Systat. Let us illustrate import from a Stata binary file.
Example 2.9. The data were published in Krivokapich et al. (1999) and down-loaded from the University of California, Los Angeles Department of Statistics site at http://www.stat.ucla.edu/. Import cardiac.dta like this:
> library(foreign)
> cardiac <- read.dta('cardiac.dta')
> head(cardiac[, 1:4]) bhr basebp basedp pkhr
1 92 103 9476 114
2 62 139 8618 120
3 62 139 8618 120
4 93 118 10974 118
5 89 103 9167 129
6 58 100 5800 123
Importing from other systems is done much the same way. ut
2.4.2 Data connections
Here we discuss one way to connect to data stored in formats other than R. Such connections allow us to both import data and manipulate it before importing.
Often, we need to import—or even manipulate—data stored in a variety of for-mats. For example, Microsoft’s Excel is widely used to both analyze and store data.
Another example would be cases where tremendous amounts of data are stored in a DBMS such as Oracle and dBase and large statistical software such as SAS, SPSS and Stata. R is not a DBMS and as such, is not suitable to hold large databases.
Open Data Base Connectivity (ODBC) is a protocol that allows access to database systems (and spreadsheets) that implement it. The protocol is common and is imple-mented in R. Among others, the advantages of connecting to a remote database are:
Data safety and replication, access to more than one database at a time, access to (very) large databases and analysis in real time of changing data. In the next example, we connect to a worksheet in an Excel file. The package RODBC includes the necessary functions.
Example 2.10. An Excel file was obtained from WHO (2004). The file name is who-population-data-2002.xls. Minor editing was necessary to prepare the data for R.
These include, for example, changing the spreadsheet notation for missing data from ..to NA. So we created a new worksheet in Excel, named MyFormat. Here we connect to this worksheet via RODBC and import the data. The task is divided into two steps.
First, we make a connection to the spreadsheet at the operating system level. Then we open the connection from within R.
If you are using a system other than Windows, read up on how to name an ODBC connection. A connection is an object that contains information about the data loca-tion, the data file name, access rights and so on. To name a connecloca-tion, go to your systems Control Panel and locate a program that is roughly named Data Sources (ODBC).2Next, activate the program. A window, named ODBC Data Source Admin-istrator pops up. We are adding a connection, so click on Add .... A window, named Create a New Data Source shows up. From the list of ODBC drivers, choose Microsoft Excel Driver (*.xls) and click on Finish. A new window, named ODBC Microsoft Excel Setup appears. We type who in the Data Source Name entry and something to describe the connection in the Description entry. Then we click on Select Workbook... button.
Next, we navigate to the location of the Excel file and click on it. We finally click on OK enough times to exit the program. So now we have an ODBC connection to the Excel data file. The connection is named who and any software on our system that knows how to use ODBC can open the who connection.
Next, we connect and import the data with
> library(RODBC) ; con <- odbcConnect('who')
> sqlTables(con) ; who <- sqlFetch(con, 'MyFormat')
> odbcClose(con)
Here, we load the RODBC package (by Lapsley and Ripley, 2004) and use odbcConnect() with the argument 'who' (a system-wide known ODBC object) to open a connection named con. sqlTables() lists all the worksheet names that the connection con knows about. In it, we find the worksheet MyFormat. Next, we access the MyFormat worksheet in the data source with sqlFetch(). This function takes two unnamed arguments, the connection (con in our case) and the worksheet name (MyFormat). sqlFetch() returns a data frame object and we assign this object to who. Now who is a data.frame. When done, we close the connection with
odbcClose(). ut
In the next example, we show how to access data from a MySQL DBMS that resides on another computer. We use MySQL not because it is the best but because it is common (and yes, it is free). We recommend the more advanced and open source DBMS from http://archives.postgresql.org.
Example 2.11. We will use a MySQL database server installed on a remote machine.
The database we use is called rtest. Before accessing the data from R, we need to create a system-wide Data Source Name (DSN). To create a DSN, follow these steps:
1. Download the so-called MySQL ODBC driver from http://MySQL.org and install it according to the instructions.
2. Read the instructions that come with the driver on how to create a DSN under a Windows system.
Now in R, we open a connection to the data on the remote server:
> library(RODBC)
> (con <- odbcConnect('rtest', case = 'tolower')) RODB Connection 13
2In Windows XP, the program resides in the Control Panel, under Administrative Tools.
62 Data in statistics and in R
The argument case causes all characters to be converted to lower case. The details tell us that rtest is open, the server has been located and the user ID is root (a user known to the remote DBMS).
We communicate with the data via the Simple Query Language (SQL)—a standard language that provides database facilities. To see what data tables are available in the database, we do
> (sqlTables(con))
TABLE_CAT TABLE_SCHEM TABLE_NAME TABLE_TYPE REMARKS
1 rtest who TABLE MySQL table
Next, we import the data in who into a data frame:
> who.from.MySQL <- sqlQuery(con, 'select * from who') Let us see some data and close the connection:
> head(who.from.MySQL[, 1 : 3])
country continent region
1 Africa Africa <NA>
2 Eastern Africa Africa Eastern Africa 3 Burundi Africa Eastern Africa 4 Comoros Africa Eastern Africa 5 Djibouti Africa Eastern Africa 6 Eritrea Africa Eastern Africa
> (odbcClose(con)) [1] TRUE
Be sure to close a connection once you are done with it. ut Let us upload data from R to the rtest database.
Example 2.12. The data for this example are from United States Department of Justice (2003). It lists all of the 7 658 capital punishment cases in the U.S. between 1973 and 2000 (data prior to 1973 were collapsed into 1973). We load the data, open a connection and use sqlSave():
> load('capital.punishment.rda')
> con <- odbcConnect('rtest')
> (sqlTables(con))
TABLE_CAT TABLE_SCHEM TABLE_NAME TABLE_TYPE REMARKS
1 rtest who TABLE MySQL table
> sqlSave(con, capital.punishment, + tablename = 'capital_punishment')
Next, we check that all went well and close the connection:
> (sqlTables(con))
TABLE_CAT TABLE_SCHEM TABLE_NAME TABLE_TYPE REMARKS
1 rtest capital_punishment TABLE MySQL table
2 rtest who TABLE MySQL table
> odbcClose(con)
Because we cannot use dots for names in MySQL, we specify tablename =
'capital punishment'. ut
In addition to the mentioned connections, you can open connections to files and read and write directly to them and to files on the Web.