8.2 Reading in External Data Files
8.2.1 The Table Format
Table-format files are best thought of as plain-text files with three key fea- tures that fully define how R should read the data.
Header If a header is present, it’s always the first line of the file. This optional feature is used to provide names for each column of data. When importing a file into R, you need to tell the software whether a header is present so that it knows whether to treat the first line as variable names or, alternatively, observed data values.
Delimiter The all-important delimiter is a character used to separate the entries in each line. The delimiter character cannot be used for anything else in the file. This tells R when a specific entry begins and ends (in other words, its exact position in the table).
Missing value This is another unique character string used exclusively to denote a missing value. When reading the file, R will turn these entries into the form it recognizes:NA.
Typically, these files have a .txt extension (highlighting the plain-text style) or .csv (for comma-separated values).
Let’s try an example, using a variation on the data framemydataas defined at the end of Section 5.2.2. Figure 8-2 shows an appropriate table- format file called mydatafile.txt, which has the data from that data frame with a few values now marked as missing. This data file can be found on the book’s website at https://www.nostarch.com/bookofr/, or you can create it yourself from Figure 8-2 using a text editor.
Figure 8-2: A plain-text table-format file
Note that the first line is the header, the values are delimited with a single space, and missing values are denoted with an asterisk (*). Also, note
that each new record is required to start on a new line. Suppose you’re handed this plain-text file for data analysis in R. The ready-to-use com- mandread.tableimports table-format files, producing a data frame object, as follows:
R> mydatafile <- read.table(file="/Users/tdavies/mydatafile.txt", header=TRUE,sep=" ",na.strings="*", stringsAsFactors=FALSE)
R> mydatafile
person age sex funny age.mon 1 Peter NA M High 504 2 Lois 40 F <NA> 480 3 Meg 17 F Low 204 4 Chris 14 M Med 168 5 Stewie 1 M High NA 6 Brian NA M Med NA
In a call toread.table,filetakes a character string with the filename and folder location (using forward slashes),headeris a logical value telling R whetherfilehas a header (TRUEin this case),septakes a character string pro- viding the delimiter (a single space," ", in this case), andna.stringsrequests the characters used to denote missing values ("*"in this case).
If you’re reading in multiple files and don’t want to type the entire folder location each time, it’s possible to first set your working directory viasetwd(Section 1.2.3) and then simply use the filename and its exten- sion as the character string supplied to thefileargument. However, both approaches require you to know exactly where your file is located when you’re working at the R prompt. Fortunately, R possesses some useful addi- tional tools should you forget your file’s precise location. You can view tex- tual output of the contents of any folder by usinglist.files. The following example betrays the messiness of my local user directory.
R> list.files("/Users/tdavies")
[1] "bands-SCHIST1L200.txt" "Brass" "Desktop" [4] "Documents" "DOS Games" "Downloads" [7] "Dropbox" "Exercise2-20Data.txt" "Google Drive" [10] "iCloud" "Library" "log.txt" [13] "Movies" "Music" "mydatafile.txt" [16] "OneDrive" "peritonitis.sav" "peritonitis.txt" [19] "Personal9414" "Pictures" "Public"
[22] "Research" "Rintro.tex" "Rprofile.txt"
[25] "Rstartup.R" "spreadsheetfile.csv" "spreadsheetfile.xlsx" [28] "TakeHome_template.tex" "WISE-P2L" "WISE-P2S.txt"
[31] "WISE-SCHIST1L200.txt"
One important feature to note here, though, is that it can be difficult to distinguish between files and folders. Files will typically have an extension, and folders won’t; however,WISE-P2Lis a file that happens to have no exten- sion and looks no different from any of the listed folders.
You can also find files interactively from R. Thefile.choosecommand opens your filesystem viewer directly from the R prompt—just as any other program does when you want to open something. Then, you can navigate to the folder of interest, and after you select your file (see Figure 8-3), only a character string is returned.
R> file.choose()
[1] "/Users/tdavies/mydatafile.txt"
Figure 8-3: My local file navigator opened as the result of a call to file.choose. When the file of interest is opened, the R command returns the full file path to that file as a character string.
This command is particularly useful, as it returns the character string of the directory in precisely the format that’s required for a command such asread.table. So, calling the following line and selecting mydatafile.txt, as in Figure 8-3, will produce an identical result to the explicit use of the file path infile, shown earlier:
R> mydatafile <- read.table(file=file.choose(),header=TRUE,sep=" ", na.strings="*",stringsAsFactors=FALSE)
If your file has been successfully loaded, you should be returned to the R prompt without receiving any error messages. You can check this with a call tomydatafile, which should return the data frame. When importing data into data frames, keep in mind the difference between character string observations and factor observations. No factor attribute information is
stored in the plain-text file, butread.tablewill convert non-numeric values into factors by default. Here, you want to keep some of your data saved as strings, so setstringsAsFactors=FALSE, which prevents R from treating all non- numeric elements as factors. This way,person,sex, andfunnyare all stored as character strings.
You can then overwritesexandfunnywith factor versions of themselves if you want them as that data type.
R> mydatafile$sex <- as.factor(mydatafile$sex)
R> mydatafile$funny <- factor(x=mydatafile$funny,levels=c("Low","Med","High"))