http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Information Fusion and Integration
Masatoshi Ishikawa Carson Reynolds
1
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Last Week
2
What is the largest prime factor of the next integer
after the largest known factorial prime?
103007
http://bit.ly/kmdcw7
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Weekly Puzzle
Answer via twitter with a direct message to
@CarsonReynolds
3
Which of these shapes is
unlike the others?
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
4A casino has a game which
gives players 100 yen for each dot on a rolled die. They
charge 550 yen per roll. On average, how much does the
casino make per game?
Image Credit: Nanami Kamimura
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Expected Value of a Function
5 Figure Credit: Wikipedia Expected Value
N → ∞
Bishop (2006) PR&ML Chapter 1
E(x) =
� N i=1
P (x i )x i
E(x) � 1 N
� N i=1
x i
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Expected Value of a Function
5 Figure Credit: Wikipedia Expected Value
N → ∞
Bishop (2006) PR&ML Chapter 1
Average
Sample Mean
First Statistical Moment Unbiased Estimator of Mean
Expectation E(x) =
� N i=1
P (x i )x i
E(x) � 1 N
� N i=1
x i
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Variance
6
Press et al. (1992) NR in C
The variance is a measure of the spread of a random variable. Particularly how far it varies from the
expected or mean value.
Figure Credit:
IPCC Third Assessment Report Client Change 2001
var(x) = 1
N − 1
� N i=1
(x i − E(x)) 2 σ(x) = �
var(x)
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
7Skew(x) = 1 N
� N i=1
( x i − E(x) σ ) 3
Kurt(x) = ( 1 N
� N i=1
( x i − E(x)
σ ) 4 ) − 3
Press et al. (1992) Numerical Recipes in C
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
8Suppose I have two discrete random variables each with their own mean and variance.
Figure Credit:
Maxfield and Lyon (1983)
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Demo:
Multivariate Gaussian
9
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Covariance
10
We can represent
covariance using the marginal variances of the variables and a
rotation parameter (the correlation).
If x and y are
independent their covariance is 0.
Σ(x, y) = E[(x − E[x])(y − E[y])]
Σ = E �
(X − E[X]) (X − E[X]) � � Σ =
� σ x 2 ρσ x σ y ρσ x σ y σ y 2
�
ρ = E[(x − E[x])(y − E[y])]
σ x σ y
Figure Credit: DHS Figure A.4
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
More about Covariance
12
• The diagonal of a
covariance matrix is the variance.
• If all the rows of the input matrix are
identical, then the
covariance matrix is
zero.
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
13Equal Distance Contours
K. Van Laerhoven (2004)
�p − q� 1
Euclidean
Manhattan Chebyshev Mahalanobis
�p − q� 2 �p − q� ∞ ?
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Mahalanobis Distance
14
∆ 2 (x, µ) =
�
(x − µ) � Σ −1 (x − µ)
• Mahalanobis Distance is defined for an observation vector x, a mean vector μ, and a covariance matrix associated with the mean.
• When the covariance is the identity matrix, equivalent to Euclidean distance.
• It provides a measure of similarity between data
points in a multidimensional Gaussian space.
• Root of sum of square
differences divided out by the covariance.
Bradski and Kaehler (2008)
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
15The slope of a linear regression line is:
cov(x,y)/var(x)
Where would the regression line appear in the each of the above examples?
Barton et al. (2007) Evolution
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Using Mahalanobis Distance
16
Suppose you want to train a system to
recognized eye blinks using just a few examples measuring distance between the components, and width and height of each component.
Grauman et al. (2001)
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
17Blink Tracker
Blink Tracker with Chye Connsynn
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Bhattacharyya distance
18
D B (p, q) = − ln( �
x ∈X
� p(x)q(x))
Figure Credit:
B. Mak and E. Bernard (1996)
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
What is information and how do we fuse it?
19
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Transmission of Information
20
R.V.L. Hartley (1928) Transmission of Information
Researchers at Bell Laboratories grew
interested in how signals such as morse code could be interpreted in the
presence of noise.
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
21Figure Credit: A.B.C. Telegraphic Code (1901)
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Measure of Information
22
R.V.L. Hartley (1928) Transmission of Information
Hartley sought to make a quantitative measure of information. His model analyzed messages of length n composed of symbols that could take on s different values.
However each symbol value was equally likely.
H = n log s H = log s n
2 3 4 5 n
10 20 30
log s n
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Messages, Signals & Noise
23
C. E. Shannon (1948)
A Mathematical Theory of Communication
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Shannon’s Entropy
24
H = −
� m i
p i log p i Shannon instead
considered the list of symbols as different
observations of a discrete random variable. When all the probabilities are equal it is exactly Hartley’s H.
C. E. Shannon (1948)
� m i=1
p i = 1
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Units of
Measure for Entropy
25
Base Units Conversion
2 bits
e nats 1 bit = ln 2
10 bans 1 bit = log
102
“measures how much one random variables tells us about another”
P. E. Latham and Y. Roudi (2009) Scholarpedia
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Probability-Weighted Information Gain
26
The “entropy term is the average amount of
information to be gained from a certain set of
events.”
Information gain is
inversely related to the probability of an event.
Pluim et al. (2003)
H =
� m i
p i log 1
p i
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Computing Entropy
27
If we have a list n of
discrete probabilities, we can compute n log n for each, and then sum each of the entries
Pluim et al. (2003)
� f (n)
“...a 1-yr old child uses the words ‘mummy,’
‘daddy,’ ‘cat,’ and ‘uh-oh.’”
Word Probability
mummy 0.35
daddy 0.2
cat 0.2
uh-oh 0.25
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
What is the Entropy?
28
n = (0.35, 0.2, 0.2, 0.25)
Word Probability
mummy 0.35
daddy 0.2
cat 0.2
uh-oh 0.25
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
What is the Entropy?
28
n = (0.35, 0.2, 0.2, 0.25)
f (n) = ( −0.5301, −0.4644, −0.4644, −0.5)
Word Probability
mummy 0.35
daddy 0.2
cat 0.2
uh-oh 0.25
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
What is the Entropy?
28
n = (0.35, 0.2, 0.2, 0.25)
f (n) = ( −0.5301, −0.4644, −0.4644, −0.5)
� f (n) = 1.9589
Word Probability
mummy 0.35
daddy 0.2
cat 0.2
uh-oh 0.25
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
What is the Entropy of a 2 year old?
29
Word Probability
mummy 0.05
daddy 0.05
cat 0.02
train 0.02
car 0.02
cookie 0.02
telly 0.02
no 0.80
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
What is the Entropy of a 2 year old?
29
Word Probability
mummy 0.05
daddy 0.05
cat 0.02
train 0.02
car 0.02
cookie 0.02
telly 0.02
no 0.80
� f (n) = 1.25412
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Entropy Minimized
30
�1.0 �0.5 0.5 1.0
5 10 15
“A distribution with a
20single sharp peak
corresponds to a low entropy value...”
Entropy is minimized by the Dirac Delta function.
Pluim et al. (2003)
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Entropy Maximized
31
1.5 2.0 2.5 3.0
0.1 0.2 0.3 0.4 0.5
“When all messages are
0.6equally likely to occur, the entropy is maximal, because you are completely
uncertain which message you will receive.”
Entropy is maximized by the Uniform Distribution.
Pluim et al. (2003)
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Demo: Minimizing and Maximizing Distributions
32
�1.0 �0.5 0.5 1.0
5 10 15 20
1.5 2.0 2.5 3.0
0.1 0.2 0.3 0.4 0.5 0.6
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Comparing Entropy
33
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Kullback-Leibler Divergence
34
D KL (p(x), q(x)) = �
x
p(x) log p(x) q(x)
DHS Appendix A7.2
If two distributions cover the same discrete range, then we can compare
how far one diverges from the other. KL
divergence expresses
how many additional bits are needed to encode p using the distribution q.
Suppose we wanted to
compare a random variable with itself? What would the KL divergence be in this
case?
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Kullback-Leibler Divergence
34
D KL (p(x), q(x)) = �
x
p(x) log p(x) q(x)
DHS Appendix A7.2
If two distributions cover the same discrete range, then we can compare
how far one diverges from the other. KL
divergence expresses
how many additional bits are needed to encode p using the distribution q.
Suppose we wanted to
compare a random variable with itself? What would the KL divergence be in this
case?
KL divergence
is not a a distance metric
because it does not satisfy the triangle inequality:
D KL (p,q) != D KL (q,p)
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Mutual Information
35
I(A, B) = H(B) − H(B|A) I(A, B) = �
a,b
p(a, b) log p(a, b) p(a)p(b)
Pluim et al. (2003)
Conditionally
As a distance
metric
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Kullback-Leibler and Mutual Information
36
D KL (p(x), q(x)) = �
x
p(x) log p(x) q(x) Mutual Information is the
Kullback-Leibler
divergence between the joint distribution and the product of the
independent ones. I(A, B) =
�
a,b
p(a, b) log p(a, b) p(a)p(b)
P. E. Latham and Y. Roudi (2009) Scholarpedia
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
I(A, B) ≤ I(A, A)
Properties of
Mutual Information
37
I(A, B) = I(B, A) I(A, A) = H(A)
Pluim et al. (2003)
Symmetry
Relationship to Entropy
Mutual Information is Less than Self Information
I(A, B) = 0 ⇐⇒ A & B are
Independent
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Image Registration
38
A. Goshtasby & M. Satter (1999)
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
39N-bin
Histogram of Image
N-bin
Histogram of Image
Joint Distribution Independent
Distributions
Kullback Liebler Divergence
Photo Credit: M. Farmer (2003)
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
CT & MR
Image Registration
40
Pluim et al. (2003)
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
41Figure Credit: M. Farmer (2003)
Image Registration
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
42A. Dame and E. Marchand (2010)
Map/Aerial View Matching
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Applications:
Augmented Reality
43
A. Dame and E. Marchand (2010)
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Case Study:
Sensor-Generated Social Networks
44
Figure 2: The hardware implementation of a mo- tion detector node. Left: early MITes-based ver- sion, Right: An ultra-low-power, standards-based prototype.
we ask that you cite this report[23] in all publications that derive from your work on this data set.
4. DATA DESCRIPTION
The dataset is comprised of several segments:
• raw motion data
• calibration data
• calendar data
• solar and weather data
• intermediate track analytics
The section contains detailed descriptions of each of these subsets.
4.1 Raw Motion Data
The primary data stream is the output of the motion de- tector network. The sensors are ceiling mounted at approxi- mately two meter intervals along hallways and in grids cover- ing public spaces such as lobbies and meeting rooms. There are no sensors in individual offices. The sensors are installed with the intention of covering the floor area completely with little or no overlap between sensor fields of view. The ceil- ing height varies, but is approximately three meters in most areas. Figure 1 shows the tiled arrangement of the sensors and a snapshot of motion activity.
The sensors use the MITes platform[15] for processing and communication coupled to a modified KC7783R sensor board. Two versions of the hardware are shown in Figure 2.
The MITes-based node used to collect the data in this re- lease is pictured on the left. The re-engineered node on the right is used in new installations and it features an 802.15.4 radio, a modular design, and ultra-low power consumption for drastically extended battery life.
The MITes employ an unreliable network protocol that does not contain checksum information. As a result packets are sometimes lost, duplicated, or garbled. An attempt has been made to filter out the duplicated and garbled packets.
However it is certain that there are packets missing from in the data and any analysis should take that into account. The
loss rate is low, but is difficult to measure since it depends on the network load.
The sensor boards are the type commonly found in mo- tion activated lighting fixtures and security sensors. They are passive infrared motion detectors. They work by sensing light emitted in the far-infrared by warm objects and signal on high-frequency changes in the scene at those frequencies.
They were modified to reduce their adaption rate from min- utes to seconds. Since the timing of the detections depends on the analog characteristics of the circuit, the minimum inter-detection time varies, but is observed to typically be around 1.5 seconds.
The raw data is available as compressed ASCII text files in five parts:
0114.txt.gz Mar 21 23:00:25 2006 – Jun 11 00:26:40 2006 0115.txt.gz Jun 11 00:26:40 2006 – Oct 4 18:13:20 2006 0116.txt.gz Oct 4 18:13:20 2006 – Jan 28 11:00:00 2007 0117.txt.gz Jan 28 11:00:00 2007 – May 24 05:46:40 2007 0118.txt.gz May 24 05:46:40 2007 – Jul 2 15:41:50 2007
The filename refers to the high-order bits of the times- tamps on the data contained in each file.
The files contain data like this:
470 01179980510828 01179980511853 1.0 469 01179980512169 01179980513193 1.0 467 01179980513580 01179980514609 1.0 468 01179980514573 01179980515598 1.0
The first element is the sensor identification number. The second and third numbers are the timestamps of the begin- ning of the event. The fourth number is a meaningless place holder value.
The map in Figure 3 depicts the test area. Executives and administrators occupy the wing on the right right side of the eighth floor map. Researchers occupy the bottom and left wings, and most of the 7th floor. The central core of the building contains restrooms, lobbies, elevators, and on the eighth floor, the mail room and the kitchen. There are several stairwells that connect the floors.
We have been collecting data at this facility since October of 2005. Data from the entire area depicted on that map has been continuously recorded since March 2006. The system generates approximately two million motion detections per month.
4.1.1 Time
The data is timestamped with the number of milliseconds since the epoch: January 1, 1970 UTC. Like the windows system clock, this number becomes larger than 232 after 50 days, on February 19, 1970, so you must take care to use 64- bit integer representations when manipulating timestamps:
• __int64 or long long in C
• use bignum in PERL
• java.math.BigInteger in Java
11
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
45Kautz and Selman
(1998) developed
ReferralWeb to mine bibliographical
references and
automatically model the implied social network
Social Networks from
Data Mining
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
46Choudhury and
Pentland (2002) used wearable sensors to assess social network and interpersonal
interaction.
Social Networks from Wearable Hardware
A similar analysis of the larger dataset shows clustering of groups and a reduction of interaction with increasing physical separation. Subject IDs 2-9 belong to group 1, IDs 10,12-15 belong to group 2, IDs 16-19 to group 3 and 21- 24 to group 4, IDs 20 and 25 were physically co-located with groups 1&2 (no one was assigned ID# 1 or 11). Note, that there are few individuals that have broad connections across groups (ID 3, 8, and 13) - this type of individuals usually have an important effect on the information flow within the community.
Figure 9 The connectivity matrix of interaction duration. Each row is a different individual and each column depicts the fraction of his/her interaction with others. Image(i,j) depicts person i’s interaction with person j. Dark region signify absence of interaction.
These connectivity graph or network graph can then be used to estimate centrality measures as traditionally done in social network analysis. Centrality measures seek to quantify an individual’s prominence within a network by summarizing the relationships among the different individuals in the network. There are different measures of centrality e.g. degree, betweenness, eigenvector etc.[15, 16]. Here we use the eigenvector centrality where the status of a person is recursively related to the statuses of the people he/she is connected to. If an individual is chosen by a popular person it should add to the person’s popularity. If A is the adjacency matrix where aij means that I contributes to the status of j and x is the vector of centrality score – the most general form of eigenvector centrality is:
1 1 2 2 ...
i i i ni n
x =a x +a x + +a x (1) In matrix representation:
A xt = (2) x The eigenvector centrality is the eigenvector of the adjacency matrix corresponding to an eigenvalue of 1.
Normalizing the rows of A to sum to 1 ensures that
equation 2 is solvable. The eigenvector centrality measure for the larger group based on the adjacency matrix is shown below (Figure 10), ID 3 and 8 with highest centrality scores also are individuals who had most connection across groups.
Figure 10 Eigenvector centrality measures of the 23 individual participating in the larger study calculated from proximity data
4. Conclusion and Future Work
In this paper, we present a method for analyzing the connectivity of interacting groups using data gathered from wearable sensors. We have presented initial results from our efforts in sensor-based modeling of human communication networks. We show that we can automatically and reliably estimate when people are in close proximity and when they are talking. We demonstrate the advantage of continuous sensing of interactions that allows us to measure the structure of communication networks along various dimensions – duration, frequency, ratio of interaction etc. We also present centrality scores for each individuals computed automatically from raw sensor data. Centrality measures are often used in social network analysis as a measure of influence and embeddedness of a person in his/her community. In many studies it has been shown that topology of people’s connectivity is the most important feature and the actual interaction content is not as crucial in understanding a person’s role within the community[1, 17- 19]. We are currently obtaining quantitative results for our algorithms by comparing the accuracy of our techniques to hand-labeled ground truth data of the interactions. We are also incorporating our work on modeling the dynamics of the network as a whole that will in the future allow us to quantitatively measure influences people have on each other [20].
http://www.k2.t.u-tokyo.ac.jp Ishikawa Komuro Laboratory
Photo: C. Wren
http://www.k2.t.u-tokyo.ac.jp Ishikawa Komuro Laboratory
Photo: C. Wren
http://www.k2.t.u-tokyo.ac.jp Ishikawa Komuro Laboratory
Photo: C. Wren
Photo:
K. Ryall
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
51Sensor Locations
(% ##% $ " # ' +$%!"' % & % !#&' , #" ' #('& # ' ( "
%& #&%) , &"&#%& & % *,& #& " !'" %##!&
58 4A 4@5<:H<GL 45BHG J;<6; C8EFBA <F 64HF<A: 4 C4EG<6H?4E F8AFBE 46G<I4G<BA 5864HF8 G;8E8 <F BA?L 4 F<A:?8 C8EFBA G;8 74G4 4FFB6<4G<BA CEB5?8@ G;HF 586B@8F GE<I<4??L 78:8A8E4G8
F FBBA 4F 4 GJB <A7<I<7H4?F C4FF A84E 846; BG;8E 4A 4@
5<:H<GL 4E<F8F GJB <A7<I<7H4?F 8AG8E 4A 4@5<:H<GL 4A7 GJB
?84I8 5HG G;8E8 <F AB J4L GB 7<FG<A:H<F; BA8 9EB@ G;8 BG;8E
!AFG847 B9 E8CE8F8AG<A: 846; CBFF<5?8 ;LCBG;8G<64? ?458?<A:
4F 4 7<FG<A6G 8?8@8AG <A BHE 74G454F8 J8 <AFG847 E86BE7 G;8 GE4AF<G<BAF G;4G J8 4E8 FHE8 B9 4A7 ?<A> G;8@ GB:8G;8E 4G CB<AGF B9 4@5<:H<G<8F GB 9BE@ 4 7<E86G87 :E4C; ,;<F :E4C;
J<?? 6BAG4<A G;8 GEH8 GE46> 4F 4 FH5:E4C; 5HG 4?FB 6BAG4<AF 4?? 4?G8EA4G<I8 6BAF<FG8AG ;LCBG;8G<64? GE46>F 4F J8??
,;8 GE46>?8GF 6BH?7 58 E8:8A8E4G87 9EB@ G;8 E4J @BG<BA 74G4 5HG J8 <A6?H78 BHE <AG8ECE8G4G<BA B9 G;8 GE46>?8G :E4C;
9BE 6BAI8A<8A68 ,;8 E8CE8F8AG4G<BA B9 GE46>?8G :E4C; <F 8K C?4<A87 <A 78G4<? <A G;8 ?8 151
5. DATA MANAGEMENT
/8 FGBE8 BHE 74G4 <A 4 74G454F8 GB 946<?<G4G8 86<8AG 4668FF 4A7 F84E6; %BFG B9 G;8 74G4 ?8F 4E8 9BE@4GG87 GB 58 E847 7<E86G?L <AGB 4 74G454F8 FC86<64??L %L+)$ I8EF<BA ,;8
0.) ?8F 4E8 8K4@C?8F B9 6E84G<A: G45?8F ?B47<A: 74G4 4A7 C8E9BE@<A: 8K4@C?8 DH8E<8F !G <F ABG A868FF4EL GB HF8 4 74G454F8 @BFG B9 G;8 ?8F J<?? 4?FB ?B47 <AGB %4G?45 BE 64A 58 E847 84F<?L <AGB 6HFGB@ FB9GJ4E8 8J4E8 ;BJ8I8E G;4G G;8 74G4F8G <F F<M45?8
6. BIBLIOGRAPHY
8?BJ <F 4 ?<FG B9 E8?8I4AG E898E8A68F <AGB G;8 ?<GHE4GHE8 ,;<F <F ABG <AG8A787 4F 4A 8K;4HFG<I8 5<5?<B:E4C;L 5HG 6BA G4<A8F E898E8A68F GB BHE JBE> J<G; G;<F 74G4F8G 4A7 CB<AG 8EF <AGB G;8 ?<GHE4GHE8 GB E8?4G87 JBE> BA @BG<BA F8AFBEF 64@8E4F A8GJBE>F F8AFBE A8GJBE>F :8A8E4??L CE<I46L FB6<4?
A8GJBE>F 4A7 G;8 C8E68CG<BA B9 46G<I<GL
7. ACKNOWLEDGMENTS
,;8 4HG;BEF JBH?7 ?<>8 GB 8KCE8FF G;8<E :E4G<GH78 9BE G;8 HA7L<A: FHCCBEG 4A7 HAE8?8AG<A: 8A6BHE4:8@8AG B9 E "BF8C;
%4E>F <E86GBE B9 G;8 *8F84E6; $45BE4GBEL B9 G;8 %<G FH5<F;< ?86GE<6 *8F84E6; $45BE4GBE<8F J;8A G;<F F8AFBE A8G JBE> J4F 6BA68<I87 78F<:A87 5H<?G 4A7 <AFG4??87 /8 4?FB J<F; GB G;4A> G;8 6HEE8AG @4A4:8@8AG B9 %*$ 9BE G;8<E 6BAG<AH87 FHCCBEG B9 G;<F CEB=86G A7 A4??L J8 @HFG G;4A>
G;8 8@C?BL88F <AG8EAF 78?<I8EL C8BC?8 6?84A<A: 6E8J 4A7 I<F<GBEF J;B 6BAGE<5HG87 74G4 GB G;<F F8G 5L @BI<A: 4EBHA7
EB47J4L 58GJ88A %4E6; 4A7 %4E6;
8. REFERENCES
1 3 5BJ7 B5<6> ! FF4 %LA4GG 4A7 / *B:8EF ,;8 4J4E8 ;B@8 8I8?BC<A: G86;AB?B:<8F 9BE FH668FF9H? 4:<A: !A 41%''&+0)5 1(
"14-5*12 10 761/#6+10 #5 # #4' +8'4 13 * <CC8EFC46; B;8A 4A7 " 4AAL %B78?<A:
;H@4A 58;4I<BE 9EB@ F<@C?8 F8AFBEF <A G;8 ;B@8 !A
41%''&+0)5 ( *' 10('4'0%' 0 '48#5+8'
1/276+0)
13 4H@58E: 4A7 B:: A 86<8AG @8G;B7 9BE 6BAGBHE GE46><A: HF<A: 46G<I8 F;4C8 @B78?F !A
41%''&+0) 1( 6*' "14-5*12 10 16+10 1( 104+)+&
#0& 46+%7.#6'& $,'%65 ! B@CHG8E +B6<8GL
13 B5<6> %BI8@8AG 46G<I<GL 4A7 46G<BA G;8 EB?8 B9 >ABJ?87:8 <A G;8 C8E68CG<BA B9 @BG<BA
*+.1512*+%#. 4#05#%6+105 +1.1)+%#. %+'0%'5
N
13 , ;BH7;HEL 4A7 (8AG?4A7 ;4E46G8E<M<A: FB6<4?
A8GJBE>F HF<A: G;8 FB6<B@8G8E !A 41%''&+0)5 1( 6*'
146* /'4+%#0 551%+#6+10 1( 1/276#6+10#. 1%+#.
#0& 4)#0+;#6+10#. %+'0%'
13
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Tracklets
52
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
53Binary Adjacency Matrices
P
1,1· · · P
1,30... . .. ...
P
1,30· · · P
30,30
• If person x emailed y then P
x,y= 1 otherwise P
x,y= 0.
• If person x visited y then P
x,y= 1 otherwise P
x,y= 0.
• If person x said they knew y then P
x,y= 1 otherwise P
x,y= 0.
1
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
54Adjacency Matrices to Graphs
• If person x emailed y then P
x,y= 1 otherwise P
x,y= 0.
• If person x visited y then P
x,y= 1 otherwise P
x,y= 0.
• If person x said they knew y then P
x,y= 1 otherwise P
x,y= 0.
• x = 1101011111
• y = 0001110001
• δ[x − y] = 1100101110
• hd(x, y) = 6
• nhd(x, y) =
106= 0.6
•
! 0 1 1 0
"
1 1 1 1 1 1 1 1 1
0 1 1 0 1 0 1 0 1 1 0 1 0 0 1 0
1
• If person x emailed y then P
x,y= 1 otherwise P
x,y= 0.
• If person x visited y then P
x,y= 1 otherwise P
x,y= 0.
• If person x said they knew y then P
x,y= 1 otherwise P
x,y= 0.
• x = 1101011111
• y = 0001110001
• δ[x − y] = 1100101110
• hd(x, y) = 6
• nhd(x, y) =
106= 0.6
•
! 0 1 1 0
"
1 1 1 1 1 1 1 1 1
0 1 1 0 1 0 1 0 1 1 0 1 0 0 1 0
1
• If person x emailed y then P
x,y= 1 otherwise P
x,y= 0.
• If person x visited y then P
x,y= 1 otherwise P
x,y= 0.
• If person x said they knew y then P
x,y= 1 otherwise P
x,y= 0.
• x = 1101011111
• y = 0001110001
• δ[x − y] = 1100101110
• hd(x, y) = 6
• nhd(x, y) =
106= 0.6
•
! 0 1 1 0
"
1 1 1 1 1 1 1 1 1
0 1 1 0 1 0 1 0 1 1 0 1 0 0 1 0
1
1 2
1
2 3
1
2 3
4
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
55Hamming Distance
8 PRIVACY AND SENSOR-GENERATED SOCIAL NETWORKS
The second row indicates that the 1-year sensor data is in fact rather similar to the survey graph. However, third second indicates that the email graph is not very similar to the survey graph. We can illustrate this visually with the following scaled Venn Diagrams as in Figure 2.
In short, the sensors provide a more complete picture of the social network connections than the email alone. Moreover, there are an order of magnitude fewer false-positive network links in the sensor graph data (0.6%) than in the email graph data (2%).
E.2 Hamming Distance
The next graph comparison metric which we applied was Hamming distance. A Ham- ming distance hd(x, y) can be defined as an operation on two vectors x and y of the same length n. The Hamming distance is the sum of the number of elements which are equal at the index location in x
iand y
i. First suppose we have a delta function which is similar to Kronecker’s delta, but which emits a 0 when the elements are equal:
δ[n] =
0, if n = 0
1, if n �= 0 (2)
By making use of this delta function as defined in Equation (2) the Hamming distance hd(x, y) may be computed as follows:
hd(x, y) =
�
n i=1δ[x
i− y
i]. (3)
That is the sum of the delta of all elements from 1 to n in the vectors x and y. Since the maximal Hamming distance varies with different vector lengths n it is convenient to have a normalized Hamming distance nhd(x, y):
8 PRIVACY AND SENSOR-GENERATED SOCIAL NETWORKS
The second row indicates that the 1-year sensor data is in fact rather similar to the survey graph. However, third second indicates that the email graph is not very similar to the survey graph. We can illustrate this visually with the following scaled Venn Diagrams as in Figure 2.
In short, the sensors provide a more complete picture of the social network connections than the email alone. Moreover, there are an order of magnitude fewer false-positive network links in the sensor graph data (0.6%) than in the email graph data (2%).
E.2 Hamming Distance
The next graph comparison metric which we applied was Hamming distance. A Ham- ming distance hd(x, y) can be defined as an operation on two vectors x and y of the same length n. The Hamming distance is the sum of the number of elements which are equal at the index location in x
iand y
i. First suppose we have a delta function which is similar to Kronecker’s delta, but which emits a 0 when the elements are equal:
δ[n] =
0, if n = 0
1, if n �= 0 (2)
By making use of this delta function as defined in Equation (2) the Hamming distance hd(x, y) may be computed as follows:
hd(x, y) =
�
n i=1δ[x
i− y
i]. (3)
That is the sum of the delta of all elements from 1 to n in the vectors x and y. Since the maximal Hamming distance varies with different vector lengths n it is convenient to have a normalized Hamming distance nhd(x, y):
REYNOLDS, WREN AND IVANOV: PRIVACY AND SENSOR-GENERATED SOCIAL NETWORKS 9
nhd(x, y) =
�
ni=0
δ[x
i− y
i]
n (4)
When this normalized Hamming distance is zero, the graphs are identical and no edits are required. As the number increases, a greater proportion of the edges must be edited in order to make the two graphs identical. When the number is one, every edge in the graph has to be altered to make the graphs equal. In Table I normalized Hamming distances are listed for our data sets.
Ordering by normalized Hamming distance edits, we see that the network constructed by fusing the email and sensor data is most similar to the survey network. These results also suggest the sensors provide a better approximation of the social network indicated by the survey data. On this metric, the email network is somewhat dissimilar to the survey network.
TABLE I
Similarity Metrics
Social Network Jaccard Edge Coefficient Normalized Hamming Distance
Fused 0.883 0.117
Sensors 0.863 0.137
Email 0.161 0.838
E.3 Degree Distribution
Another line of analysis to judge the similarities of potential graphs is to compare their degree distribution. This approach is used by Newman et al. in examining empirical social network data sets [9]. For any vertex v in a graph, its degree k indicates how many adjacent vertices exist. For an undirected graph A, the average degree ¯ k can be defined in terms of the cardinality of graph edges |E(A)| and the cardinality of graph verticies |V (A)|.
• If person x emailed y then P
x,y= 1 otherwise P
x,y= 0.
• If person x visited y then P
x,y= 1 otherwise P
x,y= 0.
• If person x said they knew y then P
x,y= 1 otherwise P
x,y= 0.
• x = 1101011111
• y = 0001110001
• δ[x − y] = 1100101110
• hd(x, y) = 6
• nhd(x, y) =
106= 0.6
1
• If person x emailed y then P
x,y= 1 otherwise P
x,y= 0.
• If person x visited y then P
x,y= 1 otherwise P
x,y= 0.
• If person x said they knew y then P
x,y= 1 otherwise P
x,y= 0.
• x = 1101011111
• y = 0001110001
• δ[x − y] = 1100101110
• hd(x, y) = 6
• nhd(x, y) =
106= 0.6
1
• If person x emailed y then P
x,y= 1 otherwise P
x,y= 0.
• If person x visited y then P
x,y= 1 otherwise P
x,y= 0.
• If person x said they knew y then P
x,y= 1 otherwise P
x,y= 0.
• x = 1101011111
• y = 0001110001
• δ[x − y] = 1100101110
• hd(x, y) = 6
• nhd(x, y) =
106= 0.6
1
• If person x emailed y then P
x,y= 1 otherwise P
x,y= 0.
• If person x visited y then P
x,y= 1 otherwise P
x,y= 0.
• If person x said they knew y then P
x,y= 1 otherwise P
x,y= 0.
• x = 1101011111
• y = 0001110001
• δ[x − y] = 1100101110
• hd(x, y) = 6
• nhd(x, y) =
106= 0.6
1
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
56Jaccard Edges
REYNOLDS, WREN AND IVANOV: PRIVACY AND SENSOR-GENERATED SOCIAL NETWORKS 7
As is typical, we define a graph G to be an ordered pair of sets denoting verticies and edges G = (V, E). For convenience we use the notion V (G) to indicate a graph’s verticies and E(G) to denote its edges.
E.1 Jaccard Similarity Coefficient
The Jaccard similarity coefficient (a measure of similarity) is defined as the cardinality (or number of elements) of the intersection of sets A and B the divided by the cardinality of the union.
Fig. 2. A visual comparison of the overlap in Jaccard edges using Venn diagrams scaled to network size. On the left the survey data set (blue) and the sensor data set (orange) are compared. On the right the survey data set (blue again) and email data set (green) are compared. Note that the sensor data set covers nearly all of the survey data.
As we were more interested in the relationships, we used the edges of the social graph which denote some sort of communicative relationship. Thus we used an extended version of the Jaccard similarity coefficient which instead looked the cardinality of edges in the graphs.
For convenience we refer to the result of this function application as Jaccard edges. Equation (1) defines Jaccard edges J E (A, B) by considering the intersection over the union of the edges of graphs A and B.
J E (A, B) = |E(A) �
E(B) |
|E(A) �
E(B) | (1)
In Table I Jaccard edge similarity coefficients are listed. Firstly we see that the fused
network drawing from both email and sensor data was most similar to the survey graph.
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory We hypothesized that:
1. social network structure (as reported through an interface) can be estimated by sensor network
II. a fused network will be more similar to the survey data than either email or sensor data alone
57 REYNOLDS, WREN AND IVANOV: PRIVACY AND SENSOR-GENERATED SOCIAL NETWORKS 9
nhd(x, y) =
�
ni=0
δ[x
i− y
i]
n (4)
When this normalized Hamming distance is zero, the graphs are identical and no edits are required. As the number increases, a greater proportion of the edges must be edited in order to make the two graphs identical. When the number is one, every edge in the graph has to be altered to make the graphs equal. In Table I normalized Hamming distances are listed for our data sets.
Ordering by normalized Hamming distance edits, we see that the network constructed by fusing the email and sensor data is most similar to the survey network. These results also suggest the sensors provide a better approximation of the social network indicated by the survey data. On this metric, the email network is somewhat dissimilar to the survey network.
TABLE I
Similarity Metrics
Social Network Jaccard Edge Coefficient Normalized Hamming Distance
Fused 0.883 0.117
Sensors 0.863 0.137
Email 0.161 0.838
E.3 Degree Distribution
Another line of analysis to judge the similarities of potential graphs is to compare their
degree distribution. This approach is used by Newman et al. in examining empirical social
network data sets [9]. For any vertex v in a graph, its degree k indicates how many adjacent
vertices exist. For an undirected graph A, the average degree ¯ k can be defined in terms of
the cardinality of graph edges |E(A)| and the cardinality of graph verticies |V (A)|.
http://www.k2.t.u-tokyo.ac.jp Ishikawa Komuro Laboratory
Centrality Analysis
• After maximizing spread in degree distribution, we computed (using igraph and R):
• betweenness centrality
• closeness centrality
• eigenvector centrality
• The same individual was most central in the email
and survey datasets.
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
Readings
DHS 2.2 & A.7
Moravec, H. P. (1988).
Sensor Fusion in Certainty Grids for Mobile Robots. AI Magazine 9(2).
59
http://www.k2.t.u-tokyo.ac.jp Ishikawa Oku Laboratory
?
Twitter: @CarsonReynolds
60