IMPLEMENTATION AND EVALUATION - serve as the initial centroids for two different clusters The d

serve as the initial centroids for two different clusters The distances of the other triples from the centroids are calculated according to the

6.3 IMPLEMENTATION AND EVALUATION

To evaluate th e p ro p o sed fram ew ork, we use sensor d ata obtained from the testbed in the C entre for C om m unication Systems Research at the U niversity of Surrey [97]. The d ata is collected in one m on th perio d from sensor n odes deployed in the offices collecting inform a tion ab o u t light level, pow er consum ption of the w orkstation, passive infra-red (PIR), tem peratu re an d noise levels.

O r i g i n a l a n d D i m e n s i o n a l i t y R e d u c e d D a t a s n n n n in n n n n ___ 3 5 0 0 0 0 , z o o o q o 25QQ00 p a a i q o

l/VUUvUUAÀJLJL/LA/V7v_JlAiA[^iI

I — P:a|r U C A W V 1/ - i q o - M iuA— iwvAA—

(a) Reduction of the original data to a reduced vector of length 128 D i s t a n c e b e t w e e n D i m e n s i o n a l i t y R e d u c e d a n d O r i g i n a l D a t a p a a p e a ♦ —» f f t 5 0 0 3 0 0 200 100, 64 128 2 5 6 R e d u c e d V ecto r L ength

(b) Distance between original data and reconstructed re duced data E x e c u t i o n T i m e o f d i f f e r e n t D i m e n s i o n a l i t y R e d u c t i o n T e c h n i q u e s p a a 0 .3 5 p e a 0» 0 .2 5 ■E 0.20 R e d u c e d V ector L ength

(c) Execution time of the different approaches with differ ent reduced vector lengths

140 A U T O M A T E D O N T O L O G Y C O N S T R U C T I O N

also m ade the d ata available online^. The evaluation w as conducted on a w orkstation com puter w ith 4 Gbyte Ram an d 2.4 G hz d u al Core processor. P hy thon w as u sed as the p rogram m ing environm ent. We first transform the d ata into a dim ensionality red uced d ata set. In our evaluation w e exam ine the Fast F ourier Transform ation, the Principal C om ponent A nalysis (PCA), the Piecewise A ggregate A pproxim ation (PAA) an d its discrete form th e Symbolic A ggregate A pproxim ation (sax). In Figure 49, pan el a), the original an d reduced d ata are shown. We have chosen to reduce th e 274960 sam ples to a representation of 128 sam ples. To evaluate the reconstruction error of the different approaches, w e p erform the transform ation over the dataset into dif ferent o u tp u t vectors (w ith the length to the pow er of 2). A fterw ards we reconstruct the original d ata from the reduced d ata by extrapo lating^ the d ata to the initial size to be com parable. We m easure the reconstruction error by taking the Euclidean distance betw een the original d ata an d the reconstructed data. In p anel b) of Figure 49, the reconstruction error (by E uclidean distance) is shown. SAX an d PCA have the best reconstruction results, w here FFT is th ird an d the re construction by using the principal com ponents ranks last. We also m easure the execution tim e of the different algorithm s show n p anel c) of Figure 49. The FFT has constant execution tim e, w here PCA, SAX an d PAA have linear rise in execution tim e by increasing the o u tp u t vector length. In o ur app ro ach w e choose the SAX algorithm because of its sim ple im plem entation. SAX consists of two loops, ad dition an d m ultiplication, in contrast to the PCA an d FFT algorithm s requiring com plex m atrix com putations such as eigenvalue / singular value decom position. We argue th a t the com plexity of the algorithm sh ou ld be considered to be applicable for energy an d com putation constrained sensor nodes. In this w ork the dim ensionality reduction is perform ed on a pow erful w orkstation, however, the dim ensionality reduction process can be out-sourced to the resource constrained sen sor nodes. The d ata is then norm alised an d transform ed into the SAX representation to reduce its dim ensionality an d to m ake it suitable for our ontology construction algorithm . We transform each observation day into one SAX w ord w ith the length of 24 representing one letter p er h o u r end in g u p w ith 31 w ords th a t each include 24 letters.

In a first step the d ata is g ro u p ed using ou r discrete KM eans clus tering m etho d w ith a group size of k=2 and k=3 as show n in Figure 51. The k param eter is estim ated by grouping the data over different k an d for each k calculates the variance of each cluster group. The goal is to keep the overall variance p er cluster as low as possible. In Figure 50, w e show the variance over different values of k an d even tually set the nu m b er of clusters to tw o for the following steps. Each d ata p o in t in Figure 51 represents a triple (Sax P attern, Time stam p.

1 http://kat.ee.surrey.ac.uk/data.csv 2 extrapolating by copying the values

Determining the numbers of cluster in the Watt dataset

s 50

X 30

NumberofClusters

( a ) W a t t D a t a s e t

Determining the numbers of cluster in the light dataset

120

100

NumberofClusters

( b ) L i g h t D a t a s e t

Figure 50: Determination of the numbers of clusters based on the cluster group variance

142 A U T O M A T E D O N T O L O G Y C O N S T R U C T I O N 60 50 40 30 20 10 -10 U n n a m e d C o n c e p t 2 U n n a m e d C o n c e p t 1 10 0 10 20 30 40 50 60 70 ( a ) C l u s t e r i n g w i t h k = 2 - 1 0 U n n a m e d C o n c e p t 3 U n n a m e d C o n c e p t 1 U n n a m e d C o n c e p t 2 -10 0 10 20 30 40 50 60 70 (b) Clustering with k=3

Errors in g ro u p "w a tts"

Errors in g ro u p "light” Errors in g ro u p "mic

Figure 52: Evaluating the Clustered Data from different Sensors with Real Calendar Information and showing the Error Rate over 100 Ran dom Runs

Sensor type) that is gro uped into one particular group. The clusters are then represented as a concept. We know from the d ata th at two trends can be observed, the data from the pow er meter, noise and PIR sensors have high activity d uring w orkdays an d rem ain steady over the w eekend. The goal is to autom atically label the un nam ed concepts as either workday or weekend an d represent this know ledge in the ontology.

We evaluate the results of the clustering m ethod w ith real calendar inform ation show n in Figure 52. The best case is to achieve an error rate of o, thu s all triples of the data set have been correctly gro up ed into either the workday or weekend group.

Due to the fact th at we choose rand om triples as starting p oint for our clustering m ethod the results in each experim ent could be different. To show the perform ance of the algorithm in different experim ents, we ru n the evaluation 100 tim es to get a com parable average, m ini m u m an d m axim um of the error rate. The results are show n in Table 25. In m ost cases the triples are categorised correctly, however, som e tim es an o d d starting triples is selected an d all triples including the ones from the weekend are categorised as weekdays resulting in the highest error rate of 8.

Table 25; Error Rate in detecting the correct groups from different sensor types

Error Rate Watt Light PIR Mic Average 3.07 34 6 1.14 1.27

M inim um 0 0 0 0

1 4 4 A U T O M A T E D O N T O L O G Y C O N S T R U C T I O N 0.1 50 40 30 20 10

Figure 53: Number of Relations based on the Factors: Cut-off Threshold and Cluster Size

The groups are then included in a baseline ontology as u nn am ed concepts. The tem poral relation betw een the concepts is extracted u s ing the statistical m odel. In this scenario, it is m ore likely that one con cept follows the sam e concept p (g ro u p l | g ro up ! ) = 0.7 an d group 2 following after group 1 p (g ro u p 2 | group 1 ) = 0.2. This expresses that it is m ore likely th at a weekday follows another weekday. As stated ear lier the am oun t of (tem poral) relationships is d ep en d en t on cluster size an d cut-off threshold.

In Figure 53, we show the dependency betw een cluster size, thresh old an d resulting am ou nt of relations th at are eventually represented in the ontology. C urrently there is no autom ated w ay to choose the right p aram eters an d thus heuristics and dom ain experience have to be considered w hile designing the system.

Figure 54 show s an excerpt of the autom atically constructed ontol ogy. Squares represent classes th at can have individuals, instances from a certain dom ain and represented as circles in the figure. O n the left of the figure is the inform ation th at can be gathered from the sensor devices itself. M eta-inform ation such as observation period, deployed devices an d their capabilities are represented as SSN con cepts. The SAX w ords an d the inferred inform ation th at is acquired th ro u g h the fram ew ork are show n on the right. The Grey h ighlighted concepts show the novelty of the autom ated process. Figure 55 shows a screenshot of the ontology visualised by an ontology visualisation

M e ta -in fo rm a tio n g a th e r e d fro m se n s o r da ta in ferred Inform a tion

Figure 54: A Schematic View of the Constructed Topological Ontology

http:/A fm w .w 3.org/2002A )7/ow l#Cli ' © A b s t r a c t i o n s © '

Figure 55: An Excerpt of the Automatically Created Topological Ontology

tooP an d Listing 2 depicts an RDF/XM L sn ipp et of the created on tology. The fram ew ork can conclude the m eaning of raw sensor data an d represent it in a topological ontology. The created ontology can be dow nloaded^.

6.3.1 Discussion

The curren t w ork allows to create a topical ontology from raw sensor data. The created ontology can be used as a baseline for furth er im provem ents creating richer ontologies. O ur approach is divided into three steps: Data pre-processing, Ontology Construction and Rule Based

3 h t t p : / / s e m w e b . s a l z b u r g r e s e a r c h . a t / a p p s / r d f - g r a v i t y /

1 4 6 A U T O M A T E D O N T O L O G Y C O N S T R U C T I O N

Listing 2; RDF/XML Snippet of the created Ontology

< r d f : a b o u t = " h t t p : / A v w w . s e m a n t i c w e b . o r g / e r i s / o n t o l o g i e s / 2 0 1 3 / 8 / a b s t r a c t i o n — o n t o l o g y — 17#A B BB C FG A A "> < r d f : t y p e r d f ; r e s o u r c e = " h t t p ; / / w w w . s e m a n t i c w e b . o r g / e r i s / o n t o l o g i e s / 2 0 1 3 / 8 / a b s t r a c t i o n — o n t o l o g y — i y # S A X _ W o r d " / > < S S N : o c c u r e d A t r d f : r e s o u r c e = " h t t p : / / w w w . s e m a n t i c w e b . o r g / e r i s / o n t o l o g i e s / 2 0 1 3 / 8 / a b s t r a c t i o n — o n t o l o g y — i 7 # 2 o i 3 0 5 o 8 i 2 o o : i3 0o"/> < S S N : o b s e r v e d B y r d f : r e s o u r c e = " h t t p : / / w w w . s e m a n t i c w e b . o r g / e r i s / o n t o l o g i e s / 2 0 1 3 / 8 / a b s t r a c t i o n — o n t o l o g y — i 7 # S e n s o r _ 2 " / >

Table 26: Methods applied throughout the process

Pre Processing

A pplied M ethod Alternatives Param eter D escription Param eter Learning

SAX PCA,DFFT

output length alphabet size

Entropy Variance

O ntology Construction

A p plied M ethod Alternatives Parameter D escription Param eter Learning KMeans Hiearchical Clustering, MeanShift

K distance

num ber of clusters distance function

FTest, D ie

M arkov M odel H idden M arkov Model, N euronal Network

t _{cutoff threshold} _{Expectation Maximation}

Labelling

A pplied M ethod A lternatives Parameter D escription Param eter Learning Rule-Based Statistics, C rowdsourcing

n Rule Base

labelling.

We propose an au to m ated fram ework; however, there are certain p a ram eters d u rin g each step th at influence the outcom e. In the follow ing w e describe the param eters in each step an d discuss their im pact. The u sed SAX algorithm to tran sform the raw num erical sensor d ata into string representations to reduce the dim ensionality an d easy com parability takes a w indow of sam ples w ith a specified w indow len gth a n d tu rn s it into a redu ced vector w ith sm aller lengths. The choice of the reduced vector length can have an im pact on the next processing steps. In case th a t a very sm all vector length has b een cho sen, im p o rtan t d ata such as outliers or certain p attern s can be lost. In the case th a t the redu ced vector length is chosen high, the effect of re ducing the am o u n t of data is decreased an d either too m an y noise is p assed onto the next algorithm or the am o u n t of d ata is n o t suitable for the processing intensive clustering process.

Besides the p aram eter n , to control the reduced vector o u tp u t length, a param eter a, has to be set, to control the size of the dictionary th at is u sed w h en transform ing to a string representation. The larger the dictionary, the finer w ill be the g ran ularity resolution of the discre- tised representation. We have conducted som e research th a t chooses the rig h t param eters b ased on the variance of the d ata [46]. This is

useful w h en interesting events occur outside of the m ean of a d ata w indow . O ther possible techniques for the dim ensionality reduction are Principal C om ponent A nalysis (PCA) an d the Discrete Fourier Transform ation (DFT). The different techniques are benchm arked in the evaluation section.

The O ntology C onstruction step uses a m odified KM eans clustering algorithm th a t groups sim ilar sam ples based on their distance. The algorithm requires tw o param eters, the predicted n u m b er of clusters K an d a distance function. C om m only the Euclidean distance is u sed to calculate sim ilarity betw een d ata points, how ever here we use a m odified distance function. In different application scenarios, vari ance changes in the sensor d ata an d different scales (time function) w ill affect the results. To prove the feasibility a larger case-study w ith data-sets from m ore dom ains have to be conducted; however, this w o u ld exceed the scope of this prototypical work. There are m ethods to estim ate the am oun t of clusters. In the evaluation section, we in troduce a m ethod to determ ine the nu m bers of clusters b ased o n the g roup varian ce/ex pected variance. There are other m ethods to group an d classify sam ples th a t are m ore use-case specific such as hierarchi cal clustering or M ean-shift. To label the groups, w e use a rule-based approach. The approach is non-param etric, an d m ainly relies on the know ledge base. Therefore the rule-base has to be chosen according to the application scenario. O ther approaches th a t leverage the crow d sourced m echanism s to label the concepts are for exam ple described in[i25]

In Table 26, w e show the applied algorithm s an d sum m arise the u sed p aram eters th a t have an im pact on the generated topical ontology.

A N I N T E G R A T E D S Y S T E M F O R K N O W L E D G E A C Q U I S I T I O N

To com bine the m iddlew are th at com m unicates an d captures the data an d the m ethods th a t process the d ata to understand ab le concepts, we introduce an analytics softw are th a t bridges the gap betw een the dif ferent p arts an d can connect to the m iddlew are introd uced in C hapter 3 to provide a sim ple interface to the un d erly in g sensor netw orks an d also provides an interface to the in troduced processing an d abstrac tio n m ethods p resen ted in C hapter 5 an d C hapter 6. The K now ledge A cquisition Toolkit (KAT) can be u sed as standalone application to process d ata from sources th a t are n o t h an d led by the m iddlew are com ponent such as static CSV an d excel file form ats. KAT also p ro vides an interface to the m iddlew are to retrieve d ata from sources m anaged by the m iddlew are com ponent. In the following, we intro duce the architecture of the softw are an d describe its im plem entation. The w orkflow description from d ata acquisition to know ledge acqui sition an d an exem plary use-case are also discussed.

7.1 D E S I G N I N G A N I N T E G R A T E D T O O L

The m ain focus of KAT is to provide an analytics fram ew ork for the data th a t is p ro vided by the m iddlew are introd uced in C hapter 3 w ith the processing an d abstraction m ethods discussed in C hapter 2, C hapter 5 an d C hapter 6. The toolkit is divided into three m ain layers depicted in Figure 56, nam ely d ata provisioning responsible to collect the d ata from various in p u t sources, data processing for the data an d the representation to the end-user. Similar to the architecture of the m iddlew are com ponent, the toolkit follows an extensible plug-in d e sign. N ew com ponents can be included by follow ing the interface descriptions for each layer. This facilitates the flexibility of the toolkit a n d allows developm ent of new plug-ins on each layer. The plug-in developer can focus on the logic an d does n o t have to han dle the im plem entation of the other layers once s /h e follows the interface d e scription provid ed for each layer. In the following w e introduce each layer an d describe the w orkflow steps from data provisioning to data representation.

D A T A P R O V I S I O N I N G L A Y E R : The data has to be g athered for fur

ther processing. The d ata provisioning layer provides tw o m odes for the data collection. Either the layer establishes a connection to a gatew ay th a t ru n s the m iddlew are com ponent or static file sources can be directly accessed by the tool. The d ata provision-

In document Intelligent Communication and Information Processing for Cyber-Physical Data. (Page 148-159)