• No results found

Fundamental structures: Objects, classes, and related concepts . 10

Here we provide a brief introduction to data structures. The Introduction to R (discussed in Section 1.2) provides more comprehensive coverage.

1.5.1 Objects and vectors

Almost everything is an object, which may be initially disconcerting to a new user. An object is simply something that R can operate on. Common ob-jects include vectors, matrices, arrays, factors (see 2.4.16), dataframes (akin to datasets in other packages), lists, and functions.

The basic variable structure is a vector. Vectors can be created using the = or <- assignment operators (which assigns the evaluated expression on the right-hand side of the operator to the object on the left-right-hand side). For instance, the following code creates a vector of length 6 using the c() function to concatenate scalars.

> x = c(5, 7, 9, 13, -4, 8)

Other assignment operators exist, as well as the assign() function (see Section 2.11.8 or help("<-") for more information).

1.5.2 Indexing

Since vector operations are so common, it is important to be able to access (or index) elements within these vectors. Many different ways of indexing vectors are available. Here, we introduce several of these, using the above example. The command x[2] would return the second element of x (the scalar 7), and x[c(2,4)] would return the vector (7,13). The expressions x[c(T,T,T,T,T,F)], x[1:5] (first through fifth element) and x[-6] (all ele-ments except the sixth) would all return a vector consisting of the first five elements in x. Knowledge and basic comfort with these approaches to vector indexing is important to effective use of R.

Operations should be carried out wherever possible in a vector fashion (this is different from some other packages, where data manipulation operations are typically carried out an observation at a time). For example, the following commands demonstrate the use of comparison operators.

1.5 FUNDAMENTAL STRUCTURES 11

> rep(8, length(x)) [1] 8 8 8 8 8 8

> x>rep(8, length(x))

[1] FALSE FALSE TRUE TRUE FALSE FALSE

> x>8

[1] FALSE FALSE TRUE TRUE FALSE FALSE

Note that vectors are reused as needed, as in the last comparison. Only the third and fourth elements of x are greater than 8. The function returns a logical value of either TRUE or FALSE. A count of elements meeting the condition can be generated using the sum() function.

sum(x>8) [1] 2

The code to create a vector of values greater than 8 is given below.

> largerthan8 = x[x>8]

> largerthan8 [1] 9 13

The command x[x>8] can be interpreted as “return the elements of x for which x is greater than 8.” This construction is sometimes difficult for some new users, but is powerful and elegant. Examples of its application in the book can be found in Sections 2.4.18 and 2.13.4.

Other comparison operators include == (equal), >= (greater than or equal),

<= (less than or equal and != (not equal). Care needs to be taken in the compar-ison using == if noninteger values are present (see 2.8.5). The which() function (see 3.1.1) can be used to find observations that match a given expression.

1.5.3 Operators

There are many operators defined to carry out a variety of tasks. Many of these were demonstrated in the sample section (assignment, arithmetic) and above examples (comparison). Arithmetic operations include +, -, *, /,ˆ(exponentia-tion), %% (modulus), and &/& (integer division). More information about oper-ators can be found using the help system (e.g., ?"+"). Background information on other operators and precedence rules can be found using help(Syntax).

R supports Boolean operations (OR, AND, NOT, and XOR) using the |,

&, ! operators and the xor() function, respectively.

1.5.4 Lists

Lists are ordered collections of objects that are indexed using the [[ operator or through named arguments.

> newlist = list(x1="hello", x2=42, x3=TRUE)

> is.list(newlist) [1] TRUE

> newlist

$x1

[1] "hello"

$x2 [1] 42

$x3 [1] TRUE

> newlist[[2]]

[1] 42

> newlist$x2 [1] 42

1.5.5 Matrices

Matrices are rectangular objects with two dimensions. We can create a 2 × 3 matrix using our existing vector from Section 1.5.1, display it, and test for its type with the following commands.

> A = matrix(x, 2, 3)

> A

[,1] [,2] [,3]

[1,] 5 9 -4

[2,] 7 13 8

> dim(A) [1] 2 3

> # is A a matrix?

> is.matrix(A) [1] TRUE

> is.vector(A) [1] FALSE

> is.matrix(x) [1] FALSE

Comments can be included: any input given after a # character until the next new line is ignored.

Indexing for matrices is done in a similar fashion as for vectors, albeit with a second dimension (denoted by a comma).

1.5 FUNDAMENTAL STRUCTURES 13

The main way to access data is through a dataframe, which is more general than a matrix. This rectangular object, similar to a dataset in other statistics packages, can be thought of as a matrix with columns of vectors of different types (as opposed to a matrix, which consists of vectors of the same type). The functions data.frame(), read.csv(), (see Section 2.1.5) and read.table() (see 2.1.2) return dataframe objects. A simple dataframe can be created using the data.frame() command. Access to subelements is achieved using the $ operator as shown below (see also help(Extract)).

In addition, operations can be performed by column (e.g., calculation of sample statistics):

> y = rep(11, length(x))

> y

[1] 11 11 11 11 11 11

> ds = data.frame(x, y)

> ds

Note that use of data.frame() differs from the use of cbind() (see 2.5.5), which yields a matrix object.

> y = rep(11, length(x))

> y

[1] 11 11 11 11 11 11

> newmat = cbind(x, y)

> newmat

Dataframes are created from matrices using as.data.frame(), while matrices can be constructed using as.matrix() or cbind().

Dataframes can be attached using the attach(ds) command (see 2.3.1).

After this command, individual columns can be referenced directly (i.e., x in-stead of ds$x). By default, the dataframe is second in the search path (after the local workspace and any previously loaded packages or dataframes). Users are cautioned that if there is a variable x in the local workspace, this will be referenced instead of ds$x, even if attach(ds) has been run. Name conflicts of this type are a common problem and care should be taken to avoid them.

The search() function lists attached packages and objects. To avoid clut-tering the R workspace, the command detach(ds) should be used once the dataframe is no longer needed. The with() and within() commands (see 2.3.1 and 6.1.3) can also be used to simplify reference to an object within a dataframe without attaching.

Sometimes a package (Section 1.7.1) will define a function (Section 1.6) with the same name as an existing function. This is usually harmless, but to reverse it, detach the package using the syntax detach("package:PKGNAME"), where PKGNAME is the name of the package (see 5.7.6).

The names of all variables within a given dataset (or more generally for subobjects within an object) are provided by the names() command. The names of all objects defined within an R session can be generated using the objects() and ls() commands, which return a vector of character strings.

Objects within the workspace can be removed using the rm() command. To remove all objects, (carefully) run the command rm(list=ls()).

The print() and summary() functions can be used to display brief or more extensive descriptions, respectively, of an object. Running print(object) at

1.6. BUILT-IN AND USER-DEFINED FUNCTIONS 15

Related documents