mHigk, if var < Q
6.1 DATA PROCESSING FRAMEWORK
The m ain objective of this section is to rep resent m eaningful relations an d extracted concepts from large am ounts of sensory d ata from the real w orld d ata in a h u m a n an d or m achine interpretable form at. A s show n in Figure 45, real w orld phenom ena are observed by collecting m easurem ents from sensors an d the raw d ata (m ostly num erical) is sent to a user or a gatew ay in w hich the data is further processed an d rep resented in a m eaningful sem antic representation.
We provide a fram ew ork th a t infers know ledge from the d ata and constructs a topical ontology representation from the concepts th at are extracted from the raw data. In this section, w e introduce som e background know ledge about sensor d ata an d discuss sem antic rep resentation fram eworks.
6.1.1 Real World Data
Real W orld d ata is com m only rep orted th ro ug h observation an d m ea su rem ent d ata obtained from sensory devices. Sensor d ata is often com m unicated as raw tim e-series data th a t can consist of a tim e stam p stating the tim e of m easurem ent, device Id, an d the values sensed b y the sensor th at is on bo ard of the sensor nodes i.e. tem per ature, light, sound, presence an d other relevant m eta data.
creasing. O n the one h a n d the price for hard w are is decreasing an d on the other h a n d day-to-day devices and appliances are equ ip ped w ith m ore capable hardw are. D ue to the large n um b er of sensor nodes an d hig h sam pling rates of sensor data, the am o u n t of data is not bearable for m any d ata processing algorithm s. The deluge of d ata requires a variety of different efforts such as real-tim e reporting, spa tial distrib ution an d the variety of sensors an d various qualities of the d ata for effective processing. Therefore dim ension reduction tech niques are usually u sed to reduce the n u m ber of features from a h igh dim ensional space to a low -dim ensional representation [102].
M ost com m on u sed techniques are: the Discrete Fast Fourier Transfor m ation (DFFT), transform ing the tim e-based d ata into the frequency dom ain to rem ove im w anted frequencies before transform ing it back to the tim e-dom ain. The Principal C om ponent A nalysis (PCA), ex tracting a new orthogonal base to represent the original d ata by cal culating the covariance or the Singular Value D ecom position (SVD), an d the Piecewise A ggregate A pproxim ation (PAA) an d its symbolic representation, th a t uses averaged w indow s, utilised in this work. We evaluate an d discuss som e of these techniques in the evaluation sec tion.
To abstract from n um erical values an d to create higher-level concepts from the large am o u n t of d ata pro d u ced by sensor devices, w e use the SensorSAX dim ensionality reduction m echanism introduced in Section 4.3. SensorSAX discretises the d ata an d generates symbolic w ords representing p attern s from the sensor data.
D ata discretisation serves as bu ilding block for m any p attern an d event detection algorithm s. It enables to m ap reoccurrm g p attern s to events even if there is variance, tim e shifting or different m eans in the d ata [73], [137], [83], [93]. SensorSAX exploits a variable encoding rate in stead of a constant rate based on the activity in the stream ing d ata an d allow s h igher com pression an d fewer errors in reconstruct ing the original raw data by only transm itting SAX w ords in case th a t there is activity in the sensor data. In this w ork, we focus on creating a topical ontology using the p attern s th a t are extracted from the Sen sorSAX patterns.
For instance, a tim e series sensor data is transform ed into the discre tised w ord "CDDCBAAAB"; sim ilar p attern s w ill have resem blance to this symbolic representation. The string sim ilarity betw een p attern s helps to index an d then com pare different p attern s by reducing the am o un t of data th a t has to be processed an d allow s to associate rules to com pare a n d /o r process the discretised w ords.
To illustrate the symbolic d ata aggregation, w e use an exam ple, the w ord "CDDCBAAAB", is a p attern constructed from sensor d ata ob tained via SensorSAX from an accelerom eter th a t has been attached to a door an d m easu red over 5 seconds. This could lead to the se-
132 A U T O M A T E D O N T O L O G Y C O N S T R U C T I O N
Raw sensor Data
Data Pre-Processing SAX discretisation Ontology Construction Concept Creation Property Creadon Rule Base
Rule Based Labelling
Ontology
Concept Property Naming "V Naming
Figure 46: Framework Overview
m antic concept "doorClosed" or "doorOpened" th a t can be stored and represented in an ontology.
6.1.2 Semantic Representation of Real World Data
The key idea b eh in d u sin g sem antic description for sensor data is to enable representation, form alisation an d enhanced interoperability of sensor data. O ntologies can be used to store sem antic concepts th at rep resent phenom ena a n d attributes from the real w orld th a t are u n derstandable for the h u m a n user an d also interpretable for m achines du e to the stand ardised d ata representation.
The concepts can be linked together th ro u g h relationships th a t ex press interactions an d dependencies betw een the concepts. The W3C Sem antic Sensor N etw ork Incubator G roup has in troduced the Se m antic Sensor N etw ork O ntology (SSN) [19] th a t provides a m odel to annotate sensors an d their m eta data, an d gathered data. The SSN O n tology uses sem antic concepts to m odel the physical attributes of sen sor netw orks such as "Sensor Device", "Tem perature Sensor", "Radio Link". Properties in the SSN m odel the relationship betw een concepts such as "occuredAt", "observedBy" to relate sensor d ata annotations to dom ain m odels.
Zhao an d M eersm ann [141] introduce the concept of topical ontolo gies th a t rep resent a basic know ledge structure of a certain dom ain th a t can be u sed as a build in g block for further enhancem ent. Topical ontologies include the m ain concepts (topics) th at app ear in a certain dom ain b u t unlike a taxonom y also provide basic relations betw een the fun dam ental concepts.
We use the SSN O ntology as a starting p o in t for o ur m etho d an d extend the ontology by extracting new insights from the raw sensor data to construct a topical ontology representing an extract of the ob served dom ain. The following describes o u r approach to b ridg e the gap betw een raw data an d the required sem antic concepts.
6.1.3 Overview of the framework
In Figure 46, an overview of the p rop osed fram ew ork to process the raw sensory d ata an d construct topical ontology is shown. The fram e w ork consists of three m ain com ponents: D ata Pre-Processing, O n tology C onstruction and Rule Based Labelling. The raw sensor d ata
serves as the in p u t for the fram ew ork. A KM eans clustering m echa nism is used to group the d ata into clusters th a t form the unlabelled concepts. A M arkov m odel is u sed to create tem poral relations b e tw een the new ly created concepts.
The u n n am ed concepts (i.e clustered patterns) an d tem poral relations are u sed to create the initial topical ontology. A fter the initial ontology construction, the concepts are labelled usin g a rule-based reasoning m echanism . The rule-based engine processes the context of the d ata an d tries to nam e the unlabelled concepts an d properties.
1. D ata Pre-Processing: In a first step, the raw d ata is stan dard ised to a m ean of o an d a stan d ard deviation of i to ensure an even distribution of the d ata over the w hole processing p erio d a n d to allow com parison of differently distribu ted signals. A fterw ards the data is transform ed to the SensorSAX patterns. This allow s the m ap p in g of sym bolised descriptions to sem antic concepts in the ontology construction an d also reduces the size of d ata com m unication. The dim ensionality of the d ata is reduced by the aggregation algorithm in SensorSAX.
This step can be perfo rm ed on the sensing devices, in case the devices are n o t able to perform the task d u e to lim ited process ing capabilities, the process can be m oved to a n o de w ith higher processing capabilities (e.g. a gateway).
2. O nto lo gy C onstruction: The structure creation process defines the outline of the ontology construction. A prelim inary ontol ogy structure is created by extracting concepts an d properties using a clustering algorithm an d a statistical m odel. We follow a conceptual clustering approach [39] to create sem antic con cepts w ith o u t labelling them .
The clusters are form ed b ased on the sim ilarity of the attributes: symbolic representation an d the m eta d ata such as sensor type an d tim e range of the m easurem ent. Each cluster is form alised as an u n n am ed concept in the ontology structure. To m odel the properties in ou r current im plem entation, w e use a M arkov m odel to find the tem poral relations such as "occursAfter" b e tw een the concepts.
3. R ule-B ased Labelling: In order to nam e the concepts an d the properties, w e utilise a rule-based m echanism . The rule system is based on the Sem antic Web Rule Language. It accepts sym bolised SAX p atterns an d ad d s a nam e tag to the unlabelled concepts.
We introduce a system th a t is able to extract rules b ased on the m eta-inform ation an d external d ata sources to autom atically d e fine the labels.
1 3 4 A U T O M A T E D O N T O L O G Y C O N S T R U C T I O N
6.2 D A T A D R I V E N O N T O L O G Y C O N S T R U C T I O N
The follow ing three m eth od s are introduced to develop a solution th at autom atically constructs an ontology depicting a perceptual view of the sensed environm ent: clustering the symbolic patterns, creating properties via a M arkov m odel an d nam ing the unlabelled concepts via a rule-based m ethod.
6.2.1 Clustering for Concept Construction
In order to reduce the am o u n t of data th a t has to be processed, we use the SensorSAX algorithm to create com pressed symbolic repre sentations of the data. SAX introduces a distance function th at allows com paring generated w ords such as "ABBA" an d "ABBC" an d stating a sim ilarity betw een o an d i. C om m on distance m easurem ents and string sim ilarity functions such as Levenshtein- or H am m ing-distance cannot be u sed on the SAX w ords due to non-uniform distribution of the letters in the m ain SAX algorithm .
The sole com parison of the w ords is n o t sufficient, as w ords can be sim ilar b u t m easu red by different types of sensors th at are n o t related to each other. The w ords are also d ep en d en t on the observation time. We introduce a set of inform ation th a t is need ed to cluster the data into different grou ps based on their different attributes. We define a triple set A = [P, t, T], w here P is a SAX w ord, t is the observation type an d T the observation tim e. In addition, w e define a distance function (show n as equation 2) to com pare the sim ilarity of tw o triple sets.
sa x D ist(P , Q) ' ^
•' w
\
^ ( d i s t ( p i , q i ) 2 (12) 1=1
distance(A i, A2) = saxD ist(Pi,?2) * tim eD iff(ti,t2) *typeDiff(Ti,T2) (13) In the first equ ation above, the original SAX b ased distance func tion is depicted. sa x D ist(P , Q) returns the distance betw een tw o w ords P an d Q according to the distance function in [83], w here n is the length of the SAX w ord, w the alphabet size of letters used in the discretisation process an d the function dist(pi., qi ) referring to a p re calculated lookup table for the particular alphabet size w . We extend the first equation by ad d in g a factor to com pare the tim e difference an d typ e difference betw een tw o triples show n in the equation below. tim e D if f retu rn in g a value betw een ]o,i] according to the tem poral distance of tw o triples an d ty p e D iff return in g either 0 or 1 m atching the ty pe of the triples. C om paring functions values from Euclidian an d non-E uclidian space can lead to w rong results as the space di m ensions are n o t equal. The alternatives are the use of non-linear di m ensionality red uctio n techniques an d the kernel trick to m ap them