Predicting Location via Indoor Positioning Systems
1.3 Cleaning the Data and Building a Representation for Analysis
We run the updatedprocessLine()and see if the warnings disappear.
options(error = recover, warn = 1) tmp = lapply(lines, processLine)
offline = as.data.frame(do.call("rbind", tmp), stringsAsFactors = FALSE)
Indeed, we received no warning messages. Our data frame offline has over one million rows:
dim(offline)
[1] 1181628 10
Our data frame consists of character-valued variables. A next step is to convert these values into appropriate data types, e.g., convert signal strength to numeric, and to further clean the data as needed. This is the topic of the next section.
1.3 Cleaning the Data and Building a Representation for Analysis
A simple first step to creating a data structure for analysis is to put meaningful names on the variables and convert them to the appropriate data type. We begin by adding names with
names(offline) = c("time", "scanMac", "posX", "posY", "posZ",
"orientation", "mac", "signal",
"channel", "type")
Then we convert the position, signal, and time variables to numeric with numVars = c("time", "posX", "posY", "posZ",
"orientation", "signal")
offline[ numVars ] = lapply(offline[ numVars ], as.numeric)
We can also change the type of the device to something more comprehensible than the numbers 1 and 3. To do this, we can make it a factor with the levels, say, "adhoc"
and "access point". However, in our analysis, we plan to use only the signal strengths measured to the fixed access points to develop and test our model. Given this, we drop all records for adhoc measurements and remove the type variable from our data frame. We do this with
offline = offline[ offline$type == "3", ]
offline = offline[ , "type" != names(offline) ] dim(offline)
[1] 978443 9
We have removed over 100,000 records from our data frame.
Next we consider the time variable. According to the documentation, time is measured in the number of milliseconds from midnight on January 1st, 1970. This is the origin used for thePOSIXtformat, but withPOSIXt, it is the number of seconds, not milliseconds. We can scale the value of time to seconds and then simply set the class of thetimeelement in order to have the values appear and operate as date-times in R. We keep the more precise time in rawTimejust in case we need it. We perform the conversion as follows:
offline$rawTime = offline$time offline$time = offline$time/1000
class(offline$time) = c("POSIXt", "POSIXct")
Now that we have completed these conversions, we check the types of the variables in the data frame with
unlist(lapply(offline, class)) and verify that they are what we want:
time1 time2 scanMac posX posY
"POSIXt" "POSIXct" "character" "numeric" "numeric"
posZ orientation mac signal channel
"numeric" "numeric" "character" "numeric" "character"
rawTime
"numeric"
We have the correct shape for the data and even the correct types. We next verify that the actual values of the data look reasonable. There are many approaches we can take to do this. We start by looking at a summary of each numeric variable with
summary(offline[, numVars])
time posX posY
Min. :2006-02-10 23:31:58 Min. : 0 Min. : 0.0 1st Qu.:2006-02-11 05:21:27 1st Qu.: 2 1st Qu.: 3.0 Median :2006-02-11 11:57:58 Median :12 Median : 6.0 Mean :2006-02-16 06:57:37 Mean :14 Mean : 5.9 3rd Qu.:2006-02-19 06:52:40 3rd Qu.:23 3rd Qu.: 8.0 Max. :2006-03-09 12:41:10 Max. :33 Max. :13.0
posZ orientation signal
Min. :0 Min. : 0 Min. :-99
1st Qu.:0 1st Qu.: 90 1st Qu.:-69 Median :0 Median :180 Median :-60
Mean :0 Mean :167 Mean :-62
3rd Qu.:0 3rd Qu.:270 3rd Qu.:-53
Max. :0 Max. :360 Max. :-25
We also convert the character variables to factors and examine them with summary(sapply(offline[ , c("mac", "channel", "scanMac")],
as.factor))
mac channel 00:0f:a3:39:e1:c0:145862 2462000000:189774 00:0f:a3:39:dd:cd:145619 2437000000:152124 00:14:bf:b1:97:8a:132962 2412000000:145619 00:14:bf:3b:c7:c6:126529 2432000000:126529 00:14:bf:b1:97:90:122315 2427000000:122315 00:14:bf:b1:97:8d:121325 2442000000:121325
(Other) :183831 (Other) :120757
scanMac 00:02:2D:21:0F:33:978443
After examining these summaries, we find a couple of anomalies:
• There is only one value for scanMac, the MAC address for the hand-held device from which the measurements were taken. We might as well discard this variable from our data frame. However, we may want to note this value to compare it with the online data.
• All of the values forposZ, the elevation of the hand-held device, are 0. This is because all of the measurements were taken on one floor of the building. We can eliminate this variable also.
We modify our data frame accordingly,
offline = offline[ , !(names(offline) %in% c("scanMac", "posZ"))]
1.3.1 Exploring Orientation
According to the documentation, we should have only 8 values for orientation, i.e., 0, 45, 90, ..., 315. We can check this with
length(unique(offline$orientation)) [1] 203
Clearly, this is not the case. Let’s examine the distribution oforientation:
plot(ecdf(offline$orientation))
An annotated version of this plot appears in Figure 1.2. It shows the orientation values are distributed in clusters around the expected angles. Note the values near 0 and near 360 refer to the same direction. That is, an orientation value 1 degree before 0 is reported as 359 and 1 degree past 0 is a 1.
Although the experiment was designed to measure signal strength at 8 orientations – 45 degree intervals from 0 to 315 – these orientations are not exact. However, it may be useful in our analysis to work with values corresponding to the 8 equi-spaced angles. That is, we want to map 47.5 to 45, and 358.2 to 0, and so on. To do this, we take each value and find out to which of the 8 orientations it is closest and we return that orientation. We must handle values such as 358.2 carefully as we want to map them to 0, not to the closer 315. The following function makes this conversion:
orientation
Empirical CDF
●
●
●
●
●
●
●
●●●●●
●●● ●
●●●●●
●●
●
●
●
●
●
●
●
●●●●●●●●●● ●●●●●●●●●
●
●
●
●
●
●
●
●●●●●●●●
●●●●●●●
●
●
●
●
●
●
●
●●
●●●●●●●● ●●●●
●●●●
●
●
●
●
●
●
●
●●●●●●●●
●●●●●●●
●
●
●
●
●
●
●
●
●
●
●●●●●●●●●●● ●●●●●●●
●●●
●
●
●
●
●
●
●●
●
●●●●
●●● ●●●●●●
●●
●
●
●
●
●
●
●
●
●
●
●●●●●●●● ● ●●●●●●●●●●
●
0.00.20.40.60.81.0
0 45 90 135 180 225 270 315 360
Figure 1.2: Empirical CDF of Orientation for the Hand-Held Device. This empirical distri-bution function of orientation shows that there are 8 basic orientations that are 45 degrees apart. We see from the steps in the function that these orientations are not exactly 45, 90, 135, etc. Also, the 0 orientation is split into the two groups, one near 0 and the other near 360.
roundOrientation = function(angles) { refs = seq(0, by = 45, length = 9)
q = sapply(angles, function(o) which.min(abs(o - refs))) c(refs[1:8], 0)[q]
}
We useroundOrientation()to create the rounded angles with
offline$angle = roundOrientation(offline$orientation)
Again, we keep the original variable and augment our data frame with the new angles.
We check that the results are as we expect with boxplots:
with(offline, boxplot(orientation ~ angle,
xlab = "nearest 45 degree angle", ylab="orientation"))
From Figure 1.3 we see that the new values look correct and the original values near 360 degrees are mapped to 0. It also shows the variability in the act of measuring.
●
nearest 45 degree angle
orientation
Figure 1.3: Boxplots of Orientation for the Hand-Held Device. These boxplots of the original orientation against the rounded value confirm that the values have mapped correctly to 0, 45, 90, 135, etc. The “outliers” at the top left corner of the plot are the values near 360 that have been mapped to 0.
1.3.2 Exploring MAC Addresses
From thesummary()information, it seems that there may be a one-to-one mapping between the MAC address of the access points and channel. For example, the summary statistics show there are 126,529 occurrences of the address 00:14:bf:3b:c7:c6 and the same number of occurrences of channel 2432000000. To help us ascertain if we do have a one-to-one mapping, we look at the relationship between the MAC address and channel.
How many unique addresses and channels do we have? There should be the same number, if there is a one-to-one mapping. We find:
c(length(unique(offline$mac)), length(unique(offline$channel))) [1] 12 8
There are 12 MAC addresses and 8 channels. We were given the impression from the building plan (see Figure 1.1) that there are only 6 access points. Why are there 8 channels and 12 MAC addresses? Rereading the documentation we find that there are additional access points that are not part of the testing area and so not seen on the floor plan. Let’s check the counts of observations for the various MAC addresses withtable():
table(offline$mac)
00:04:0e:5c:23:fc 00:0f:a3:39:dd:cd 00:0f:a3:39:e0:4b
418 145619 43508
00:0f:a3:39:e1:c0 00:0f:a3:39:e2:10 00:14:bf:3b:c7:c6
145862 19162 126529
00:14:bf:b1:97:81 00:14:bf:b1:97:8a 00:14:bf:b1:97:8d
120339 132962 121325
00:14:bf:b1:97:90 00:30:bd:f8:7f:c5 00:e0:63:82:8b:a9
122315 301 103
Clearly the first and the last two MAC addresses are not near the testing area or were only working/active for a short time during the measurement process because their counts are very low. It’s probably also the case that the third and fifth addresses are not among the access points displayed on the map because they have much lower counts than the others and these are far lower than the possible 146,080 recordings (recall that there are potentially signals recorded at 166 grid points, 8 orientations, and 110 replications).
According to the documentation, the access points consist of 5 Linksys/Cisco and one Lancom L-54g routers. We look up these MAC addresses at the http://coffer.com/
mac_find/site to find the vendor addresses that begin with 00:14:bf belong to Linksys devices, those beginning with 00:0f:a3 belong to Alpha Networks, and Lancom devices start with 00:a0:57 (see Figure 1.4). We do have 5 devices with an address that begins 00:14:bf, which matches with the Linksys count from the documentation. However, none of our MAC addresses begin with 00:a0:57 so there is a discrepancy with the documen-tation. Nonetheless, we have discovered valuable information for piecing together a better understanding of the data. For now, let’s keep the records from the top 7 devices. We do this with
subMacs = names(sort(table(offline$mac), decreasing = TRUE))[1:7]
offline = offline[ offline$mac %in% subMacs, ]
Figure 1.4: Screenshot of the coffer.com Mac Address Lookup Form. The coffer.com Web site offers lookup services to find the MAC address for a vendor and vice versa.
Finally, we create a table of counts for the remaining MAC×channel combinations and confirm there is one non-zero entry in each row
macChannel = with(offline, table(mac, channel)) apply(macChannel, 1, function(x) sum(x > 0))
00:0f:a3:39:dd:cd 00:0f:a3:39:e1:c0 00:14:bf:3b:c7:c6
1 1 1
00:14:bf:b1:97:81 00:14:bf:b1:97:8a 00:14:bf:b1:97:8d
1 1 1
00:14:bf:b1:97:90 1
Indeed we see that there is a one-to-one correspondence between MAC address and channel for these 7 devices. This means we can eliminatechannelfromoffline, i.e.,
offline = offline[ , "channel" != names(offline)]
1.3.3 Exploring the Position of the Hand-Held Device
Lastly, we consider the position variables,posXandposY. For how many different locations do we have data? The by() function can tally up the numbers of rows in our data frame for each unique (x, y) combination. We begin by creating a list containing a data frame for each location as follows:
locDF = with(offline,
by(offline, list(posX, posY), function(x) x)) length(locDF)
[1] 476
Note that this list is longer than the number of combinations of actual (x, y) locations at which measurements were recorded. Many of these elements are empty:
sum(sapply(locDF, is.null)) [1] 310
The null values correspond to the combinations of the xs and ys that were not observed.
We drop these unneeded elements as follows:
locDF = locDF[ !sapply(locDF, is.null) ] and confirm that we now have only 166 locations with length(locDF)
[1] 166
We can operate on each of these data frames to, e.g., determine the number of observa-tions recorded at each location with
locCounts = sapply(locDF, nrow)
And, if we want to keep the position information with the location, we do this with
locCounts = sapply(locDF,
function(df)
c(df[1, c("posX", "posY")], count = nrow(df))) We confirm thatlocCountsis a matrix with 3 rows with
class(locCounts) [1] "matrix"
dim(locCounts) [1] 3 166
We examine a few of the counts, locCounts[ , 1:8]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
posX 0 1 2 0 1 2 0 1
posY 0 0 0 1 1 1 2 2
count 5505 5505 5506 5524 5543 5558 5503 5564
We see that there are roughly 5,500 recordings at each position. This is in accord with 8 orientations × 110 replications × 7 access points, which is 6,160 signal strength measure-ments.
We can visualize all 166 counts by adding the counts as text at their respective locations, changing the size and angle of the characters to avoid overlapping text. We first transpose the matrix so that the locations are columns of the matrix and then we make our plot with locCounts = t(locCounts)
plot(locCounts, type = "n", xlab = "", ylab = "")
text(locCounts, labels = locCounts[,3], cex = .8, srt = 45)
We see in Figure 1.5 that there are roughly the same number of signals detected at each location.
1.3.4 Creating a Function to Prepare the Data
We have examined all the variables except time and signal. This process has helped us clean our data and reduce it to those records that are relevant to our analysis. We leave the examination of the signals to the next section where we study its distributional properties.
As for time, while this variable is not directly related to our model, it indicates the order in which the observations were taken. In an experiment, this can be helpful in uncovering potential sources of bias. For example, the person carrying the hand-held device may have changed how the device was carried as the experiment progressed and this change may lead to a change in the strength of the signal. Plots and analyses of the relationship between time and other variables can help us uncover such potential problems. We leave this investigation as an exercise.
Since we also want to read the online data in R, we turn all of these commands into a function calledreadData(). Additionally, if we later change our mind as to how we want to handle some of these special cases, e.g., to keep channelor posZ, then we can make a simple update to our function and rerun it. We might even add a parameter to the function definition to allow us to process the data in different ways. We leave it as an exercise to createreadData().
We callreadData()to create the offline data frame with
0 5 10 15 20 25 30
024681012
5505 5505 5506 5524 5543 5558 5503 5564 5513
5529 5526 5549 5526 5469 5464 5525 5429 5499 5532 5482 5396 5443 5358 5468 5447 5486 5483 5468 5444 5448 5503 5496 5472 5533 5459 5493 5543 5481 5539 5527 5490 5522 5433
5554 5463 5480 5409 5424 5526 5538 5442 5585 5556 5542 5526 5486 5560 5519 5507
5474 5447 5552 5500 5538 5365 5507 5409 5520 5551 5531 5615
5530 5517 5535 5474 5483 5541 5447 5506 5513 5519 5499 5576
5492 5473 5426 5529 5567 5528 5553 5595 5581 5519 5506 5437 5534 5536 5349 5331 5440 5448 5411 5470 5472 5671 5541 5514 5543 5452 5527 5543 5593 5540 5531 5539 5621 5624 5495 5450 5500 5638 5586 5592 5619 5603 5626 5477 5585 5588 5475 5559 5367 5461 5309 5330 5535 5395 5361 5494 5475 5493 5481 5573 5527 5556 5504 5545 5553 5579 5692 5540 5492 5487 5589
5463 5596 5557 5488 5523 5656 5524 5544 5734 5523 5553 5774
Figure 1.5: Counts of signals detected at each position. Plotted at each location in the building is the total number of signals detected from all access points for the offline data.
Ideally for each location, 110 signals were measured at 8 angles for each of 6 access points, for a total of 5280 recordings. These data include a seventh Mac address and not all signals were detected, so there are about 5500 recordings at each location.
offlineRedo = readData()
Then we use theidentical()function to check this version of the data frame against the one that we already created:
identical(offline, offlineRedo) [1] TRUE
This confirms that our function behaves as expected.
When we collect code into a function, it is common to forget about some of the variables we need. The code works because they are found in the R session (i.e., globalenv()), but the function does not work in new R sessions or gives the wrong results if we define those global variables differently, by chance. We use thefindGlobals()function in thecodetools package [10] to identify what variables are global, i.e.,
library(codetools)
findGlobals(readData, merge = FALSE)$variables [1] "processLine" "subMacs"
TheprocessLine()function is a variable since it is referenced in a call tolapply() in read-Data()so this is not a problem. The variablesubMacsis also identified as global. This vari-able was created in the global environment from the unique values ofmac(see Section 1.3.2) and we neglected to include this code in the function. We can update the function to pass it as a parameter with a suitable default value or to create subMacswithin the function;
then,subMacsis no longer a global variable.