Processing the Raw Data - The Raw Data - Predicting Location via Indoor Positioning Systems

Predicting Location via Indoor Positioning Systems

1.2 The Raw Data

1.2.1 Processing the Raw Data

Now that we have determined the desired target representation of the data in R, we can write the code to extract the data from the input file and manipulate it into that form. The data are not in a rectangular form so we cannot simply use a function such asread.table().

However, there is structure in the observations that we can use to process the lines of text. For example, the main data elements are separated by semicolons. Let’s see how the semicolon splits the fourth line, i.e., the first line that is not a comment:

strsplit(txt[4], ";")[[1]]

[1] "t=1139643118358"

[2] "id=00:02:2D:21:0F:33"

[3] "pos=0.0,0.0,0.0"

[4] "degree=0.0"

[5] "00:14:bf:b1:97:8a=-38,2437000000,3"

[6] "00:14:bf:b1:97:90=-56,2427000000,3"

[7] "00:0f:a3:39:e1:c0=-53,2462000000,3"

[8] "00:14:bf:b1:97:8d=-65,2442000000,3"

[9] "00:14:bf:b1:97:81=-65,2422000000,3"

[10] "00:14:bf:3b:c7:c6=-66,2432000000,3"

[11] "00:0f:a3:39:dd:cd=-75,2412000000,3"

[12] "00:0f:a3:39:e0:4b=-78,2462000000,3"

[13] "00:0f:a3:39:e2:10=-87,2437000000,3"

[14] "02:64:fb:68:52:e6=-88,2447000000,1"

[15] "02:00:42:55:31:00=-84,2457000000,1"

Within each of these shorter strings, the “name” of the variable is separated by an ‘=’

character from the associated value. In some cases this value contains multiple values where the separator is a ‘,’. For example, "pos=0.0,0.0,0.0" consists of 3 position variables that are not named.

We can take this vector, which we created by splitting on the semi-colon, and further split each element at the ‘=’ characters. Then we can process the resulting strings by splitting them at the ‘,’ characters. This might look something like

unlist(lapply(strsplit(txt[4], ";")[[1]], function(x)

sapply(strsplit(x, "=")[[1]], strsplit, ",")))

We end up with a large character vector with the names and data values from the entire first record as individual “tokens.” We can then rearrange them into the appropriate form.

However, we can do this much more simply and generally using the fact that the split pa-rameter forstrsplit()can be a regular expression so we can split on any of several characters in a single function call. This means we can use

tokens = strsplit(txt[4], "[;=,]")[[1]]

to split at a ‘;’, ‘=’ or ‘,’ character.

Before we proceed to write much code to read these data, we ask: Canread.table()take a regular expression as a separator? If so, we can use it instead ofreadLines(). Unfortunately, it does not. It would slow down the reading of regular text files quite considerably.

Based on the results of the strsplit(), we have all the data elements from the first row.

The first 10 elements oftokensgive the information about the hand-held device:

tokens[1:10]

[1] "t" "1139643118358" "id" "00:02:2D:21:0F:33"

[5] "pos" "0.0" "0.0" "0.0"

[9] "degree" "0.0"

We can extract the values of these variables with tokens[c(2, 4, 6:8, 10)]

[1] "1139643118358" "00:02:2D:21:0F:33" "0.0"

[4] "0.0" "0.0" "0.0"

We know these correspond to the variables time, MAC address, x, y, z, and orientation.

Let’s turn our attention to the recorded signals within this observation. These are the remaining values in the split vector, i.e.,

tokens[ - ( 1:10 ) ]

[1] "00:14:bf:b1:97:8a" "-38" "2437000000" "3"

[5] "00:14:bf:b1:97:90" "-56" "2427000000" "3"

[9] "00:0f:a3:39:e1:c0" "-53" "2462000000" "3"

[13] "00:14:bf:b1:97:8d" "-65" "2442000000" "3"

[17] "00:14:bf:b1:97:81" "-65" "2422000000" "3"

[21] "00:14:bf:3b:c7:c6" "-66" "2432000000" "3"

[25] "00:0f:a3:39:dd:cd" "-75" "2412000000" "3"

[29] "00:0f:a3:39:e0:4b" "-78" "2462000000" "3"

[33] "00:0f:a3:39:e2:10" "-87" "2437000000" "3"

[37] "02:64:fb:68:52:e6" "-88" "2447000000" "1"

[41] "02:00:42:55:31:00" "-84" "2457000000" "1"

We can think of these as rows in a 4-column matrix or data frame giving the MAC address, signal, channel, and device type, so let’s unravel these and build a matrix from the values.

Then we can bind these columns with the values from the first 10 entries. We do this with tmp = matrix(tokens[ - (1:10) ], ncol = 4, byrow = TRUE)

mat = cbind(matrix(tokens[c(2, 4, 6:8, 10)], nrow = nrow(tmp), ncol = 6, byrow = TRUE),

tmp)

We confirm that we have 11 rows in the matrix, one for each MAC address, and 10 columns, 6 of which have the same value for each MAC address (e.g., position and orientation):

dim(mat) [1] 11 10

We put all this code into a function so we can repeat this operation for each row in the input file. That is,

processLine = function(x) {

tokens = strsplit(x, "[;=,]")[[1]]

tmp = matrix(tokens[ - (1:10) ], ncol = 4, byrow = TRUE) cbind(matrix(tokens[c(2, 4, 6:8, 10)], nrow = nrow(tmp),

ncol = 6, byrow = TRUE), tmp) }

Let’s apply our function to several lines of the input:

tmp = lapply(txt[4:20], processLine)

Note that we started at the fourth line of the file because the first 3 lines are comments.

The result is a list of 17 matrices and we can determine how many signals were detected at each point with

sapply(tmp, nrow)

[1] 11 10 10 11 9 10 9 9 10 11 11 9 9 9 8 10 14

We have done the hard part. Of course, we want to turn these individual matrices into a single data frame. We can stack the matrices together usingdo.call(). We might be inclined to write a loop to concatenate the second matrix to the first, the third to the earlier result, and so on. This would be very slow. (Try it!) However,do.call()does this stacking efficiently and simply. We calldo.call()with the name of the function to call and a list containing the individual arguments that we ordinarily pass to that function separately. For us, this is as simple as

offline = as.data.frame(do.call("rbind", tmp)) dim(offline)

[1] 170 10

We are now ready to try this code on the entire dataset. We discard the lines starting with the comment character ‘#’ and then pass each remaining line toprocessLine(). lines = txt[ substr(txt, 1, 1) != "#" ]

tmp = lapply(lines, processLine) When we run this, we get 6 warnings of the form

1: In matrix(tokens[c(2, 4, 6:8, 10)], nrow(tmp), 6, byrow = TRUE) :

data length exceeds size of matrix

In general, we want to be very cautious about warning messages.

We can try to find the rows to which these warning messages correspond by exploring the result, but it is easier to catch them as they occur. We can ask R to raise an error when a warning is issued and then browse the call stack to examine the state of the computations.

We do this by setting an option to handle errors and another to change warnings into errors:

options(error = recover, warn = 2)

We run thelapply() call again with these new options:

tmp = lapply(lines, processLine)

When the first warning occurs, we are presented with the message and the call stack Error in matrix(tokens[c(2, 4, 6:8, 10)], nrow(tmp), 6,

byrow = TRUE) : (converted from warning)

data length exceeds size of matrix Enter a frame number, or 0 to exit

1: lapply(lines, processLine) 2: lapply(lines, processLine)

3: FUN(c("t=1139643118358;id=00:02:2D:21:0F:33;pos=0.0,0.0,0.0...

4: cbind(matrix(tokens[c(2, 4, 6:8, 10)], nrow(tmp), 6, byrow = TRUE), tmp)

5: matrix(tokens[c(2, 4, 6:8, 10)], nrow(tmp), 6, byrow = TRUE) 6: .signalSimpleWarning("data length exceeds size of matrix",

quote(matrix(tokens[c(2, 4 7: withRestarts({

8: withOneRestart(expr, restarts[[1]]) 9: doWithOneRestart(return(expr), restart) Selection:

We select 3, which corresponds to our processLine()function. We can issue R commands in this call frame. For example, we can see what variables are available withls(). Looking at the variablex, we see its value is

[1] "t=1139671993259;id=00:02:2D:21:0F:33;pos=12.0,3.0,0.0;

degree=315.1"

What we notice about this value is that we have no signals detected for this position. This observation is an anomalous case, which our processLine()function needs to handle.

We can modifyprocessLine()to discard such observations or alternatively add a single row to the data frame with the hand-held information and NA values for the MAC, signal, channel, and type. We choose to discard these observations as they do not help us in developing our positioning system. We change our function to return NULL if the tokens vector only has 10 elements. Our revised function is

processLine = function(x) {

tokens = strsplit(x, "[;=,]")[[1]]

if (length(tokens) == 10) return(NULL)

tmp = matrix(tokens[ - (1:10) ], , 4, byrow = TRUE) cbind(matrix(tokens[c(2, 4, 6:8, 10)], nrow(tmp), 6,

byrow = TRUE), tmp) }

We run the updatedprocessLine()and see if the warnings disappear.

options(error = recover, warn = 1) tmp = lapply(lines, processLine)

offline = as.data.frame(do.call("rbind", tmp), stringsAsFactors = FALSE)

Indeed, we received no warning messages. Our data frame offline has over one million rows:

dim(offline)

[1] 1181628 10

Our data frame consists of character-valued variables. A next step is to convert these values into appropriate data types, e.g., convert signal strength to numeric, and to further clean the data as needed. This is the topic of the next section.

In document (Chapman & Hall_CRC The R Series) Deborah Nolan, Duncan Temple Lang-Data Science in R_ A Case Studies Approach to Computational Reasoning and Problem Solving-Chapman and Hall_CRC (2015).pdf (Page 33-37)