Data Provisoning - Intelligent Communication and Information Processing for Cyber-Physical Data

Middleware API Static Data Wrapper

CSV CKAN EXCEL

Figure 56; Architectural Overview of the Toolkit

ing layer provides a unified access API w here a com m on data access interface for the higher-layers is provided. The current im plem entation sup po rts static CSV and Excel sheets in ad d i tion to the d ata p rovided by the m iddlew are.

D A T A P R O C E S S I N G L A Y E R : The Toolkit allows to select and apply

different algorithm s on the d ata th at have been introduced in this thesis. The selection process is carried out m anually in contrary to the approach introduced in C hapter 6. The non- autom atic m ethod selection aim s to visualise the m ethods to ease the u n d erstan d in g for an analyst a n d /o r know ledge engi neer. In case a data source is going to be incorporated on the system , the toolkit can be u sed to pre-select som e of the intro duced m ethods. As show n later in a use-case scenario, different d ata sets require different processing m ethods to create inter pretable abstractions from the data.

D A T A R E P R E S E N T A T I O N L A Y E R : The visualisation of the various p ro

cessing m ethods is prov ided in the data processing layer. It also provides and the representation of the different steps from data source selection to the selection and the application of the p ro cessing m ethods in a sem antic RDF representation. The sem an tic representation allows to trace the decisions th at have been m ade th ro u g h o u t the w orkflow from the initial sensor values to the higher-level abstractions. However, no t only the param eters

7- 2 F RO M D A T A A C Q U I S I T I O N TO K N O W L E D G E A C Q U I S I T I O N I5I

^

AAnnaallyysstt//

9

Knowledge Engineer MachineProcess/

Knowledge

f

Acquisition

\

#2 Toolkit

W ork sta

y s ta tic D ata Source

Gatei^ay/ M idd lew are

f I

CSV EXCEL JSON

Figure 57: Workflow from Data Acquisition to Knowledge Acquisition

are represented in the sem antic m odel, b u t also the abstractions and their relationship m odels th rough properties.

The softw are is designed in P ython and based on a service archi tecture oriented design to allow interfacing w ith other third p arty com ponents.

7.2 F R O M D A T A A C Q U I S I T I O N T O K N O W L E D G E A C Q U I S I T I O N

The toolkit com plem ents the w ork done in this thesis by providing a graphical user interface to the collected sensor data and the infor m ation processing m ethods. Figure 57 show s the role of KAT in the overall architecture. KAT can connect to the m iddlew are via the m id dlew are API or can use static repositories to app ly the processing algorithm s.

We introduce the know ledge acquisition toolkit (KAT). KAT p ro vides algorithm s for num erical and textual data analysis th at can help to extract m eaningful inform ation and rep resent it in a hum an- readable or m achine interpretable form at. M ore technical details can be found on the official website: http://kat.ee.surrey.ac.uk.

We provide a use-case scenario w here we apply the introduced al gorithm s on two data sets from different dom ains and explain the

Preprocessing Dim cnsionalits Keduetinn

Keature

i:\tra c tin n A bstraction Representation

-V

V a r ia n c e K ilter -1/ l‘ AA -A-V K M e a n s -1\-V -h -Vll i g h p a s s K ilter - L I'A A

-y

IIM M -A

-y

c — f \ A c iio n a b le it inn ~[/ K n o w le d g e . . , . — K S e m a n tic _ t \ A c tio n a b le a r o \ a m ^ R e p re s e n ta tio n [ / K n o w le d g e

Figure 58: The Workflow of the Data Abstraction and the Selected Algo rithms ? KtKwri»dj»*a|uo*«iTocfttJ ^ j . KnowledgeAcquisitionToolkit KnowledgeAcquisitionToolkit ; j : sTikSy (a) (b)

Figure 59: ECG data (a) and power consumption data (b) loading screen

selection of the applied algorithm s to give an overview w here an d how the p articular algorithm s can be applied on real w orld data. In 58, we show the w orkflow chain of the algorithm s th at is going to be ap plied on the data. We chose two data sets from different application dom ains, the first data set is from our ow n sensor test b ed deployed at the U niversity of Surrey [97]. The data comes from a sensor node in front of one of the authors desk m onitoring the pow er consum ption of a w orkstation connected to the pow er meter. The raw data set w as cap tu red over a m o nth an d contains 274960 samples.

The second data set is from the m achine learning d ata set repository m aintained by the U niversity of California, Irvine an d represents an electrocardiogram (ECG) d ataset w ith 3751 sam ples, pub lish ed by Ol szew ski [100] an d found at the UCR Time Series C lassification/C lus tering page^.

In Figure 59, the file data loading screen of KAT is show n, on the left w in do w the ECG is presented, on the right, the pow er consum ption data is presented. The tool sup po rts different in p u t sources and for m ats such as CSV, EXCEL, SQL, CKAN API [132]. In the case that several categories inside a data source are available, the user can se lect the categories on w hich the algorithm s should be applied on. O ur aim in this exam ple use-case is to find the outlier in the ECG dataset h ap p en in g at aro u n d sam ple 2400-2600 an d to cluster the repetitive

7- 2 F R OM D A T A A C Q U I S I T I O N TO K N O W L E D G E A C Q U I S I T I O N 153

Bey»*. . « » □ ie Loader ' hÿw» □

° j ü I jÜ .1.111 Ujlt

(a) (b)

Figure 60: (a): ECG data after applying variance filter, (b): Watts data after applying highpass filter.

"work day" behaviour in the w atts d ataset an d represent it in a se m antic representation.

First we app ly pre-processing filters to the data. O n the ECG data, we choose the variance filter to reduce the d ataset to sam ples w ith a high volatility in w indow s. The w indow s size can be defined in KAT. O n the w atts data, we choose to filter the noise at the bottom of the data, to m inim ise the "background pow er consum ption" and focus on pow er peaks (^possible presence in an office) w ith the help of a high-pass filter. The processed data can be seen in Figure 60.

To elim inate rigurosity and red u n d an cy we reduce the dim ension ality of the data. For bo th d ata sets we use the Piecewise A ggregate A pproxim ation technique (PAA). The interesting p attern s of the data now get visible, as show n in Figure 61. In the ECG dataset, it is n o ticeable that there is certain peak th at stands out from the others. In the w atts d ataset it can be seen th at there is som e regularity behind the data. The reader can easily infer th at the pow er consum ption is high d u rin g a w ork day at office hou rs and low betw een the w ork days (between peaks) an d on the w eekend (long gaps). In bo th cases the am o un t of data sam ples has been reduced significantly, 100 out of 274960 sam ples for the w atts data an d 50 o ut of 3751 for ECG data. This will ease the processing of the following processing steps. The m ore processing intensive cluster algorithm can now operate on less data sam ples to provide the first level of lower-level abstractions. We ru n a KMeans algorithm on both datasets. O n the ECG dataset we ru n KMeans w ith k=3, representing low activity (called the PR- Interval) in group o, peaks (called the QT Interval) in group 1 and outliers in group 2. On the w atts dataset, we use HM M to group it into two tem poral groups, a group representing a w ork day and a group representing the w eekend(probably no presence in the office). The clustering of the data is represented in Figure 62. After the clus tering step we discover tem poral relations betw een the clustered data. For tem poral relation discovery we use a M arkov chain approach to calculate the probabilities of the occurrences of the groups. To ease

|nie Loadervanana j oaaQ

v\A/vVV/\vvAvv/

Figure 61: (a): ECG data after applying PA A, revealing the outlier and sup pressing the background noise, (b): Watts data after applying PAA, revealing the regular pattern of a workday

rrm ieies

(a) (b )

Eigure 62: (a) ECG data after applying KMeans with k=3, grouping the data into groups of data with low values(o), high-values(i) and outliers(2). (b) Watts data after applying HMM with 2 states, group ing the data into two groups of low power (o) and high power (1) consumption

7- 2 F R O M D A T A A C Q U I S I T I O N TO K N O W L E D G E A C Q U I S I T I O N 1 5 5

QT luteniu

Heloadg ^ PhiÿpasB 3 | p* [ bneant 3 j =w,W. Q

(a) (b)

Figure 63: (a): ECG data after applying PA A, revealing the outlier and sup pressing the background noise, (b): Watts data after applying PAA, revealing the regular pattern of a workday

g o f c r e a t e d , by / in s ta n c e _ o f ;realed_by O utlier W o rk _ d a y W e e k e n d PR_lnterva! Loa' Power Q T In terv al R e g u la r E C G

L ow -L evel A b stra c tio n H igh-L evel A b stra c tio n

Figure 64: Data Graph Representation of the information acquired through the abstraction process

the u nd erstan d in g we labelled the groups in KAT. The results are show n in Figure 63. A possible representation of the abstractions th at have been acquired th rou gh the overall process can be seen in Figure 64. KAT provides functions to rep resent the data g rap h in RD F/X M L form at. KAT allows to define param eters how granu lar the d ata should be presented. For instance, it w ould also be possible to include the raw data com ing from the sensors th at lead to the lower- level abstractions, b u t for presentation reasons w e only include the inform ation from the lower-level abstractions and onw ards. D espite the inform ation that has been acquired it w ould also be possible to in clude the provenance inform ation e.g. param eters and operators that led to the different abstractions. A n ongoing research project th at cap tures the provenance param eters is the PROV-O ontology^ and will be included in future work.

C O N C L U S I O N S A N D F U T U R E W O R K

This chapter concludes this thesis w ith a sum m ary of the research contributions. Furtherm ore, w e discuss the outcom es of the w ork an d suggest directions for future research,

8.1 S U M M A R Y O F R E S E A R C H A C H I E V E M E N T S

The challenges an d objectives listed in Sections 1.2 an d 1.3 have been addressed by the following contributions.

1. We developed a m iddlew are introd uced in C hapter 3 th a t p ro vides a unified access layer to heterogeneous sensor netw orks an d data sources. The m iddlew are uses a plug-in architecture th a t eases the developm ent of softw are com ponents for em erg ing technologies on the n etw ork side an d also enables to im ple m ent n ew processing algorithm s th a t can be in tegrated into the m iddlew are. O n the app lication /serv ice level, n ew plug-ins can be developed th a t provide different access m ethods to the d ata prov id ed by the m iddlew are.

2. We im plem ented com ponents for the netw ork side of the m id dlew are to access a various set of resources such as IEEE 802.15.4 enabled-devices, O racle Sim Spot devices b u t also w eb-based data sources.

3. In order to su p p o rt continuous com m unication in the sensor netw ork, the m iddlew are su pp orts the m ovem ent of sensor nodes betw een different base stations, e.g. gateways, by prov id in g a m obility schem e th a t solves the hand-over p roblem introd uced in Section 4.1. The solution introduces tw o operation m odes, caching an d tunneling. D uring the hand-over an d in tim es w ith no connection to the d ata source, the m iddlew are can com pen sate queries by providing cached or pred icted results. The tu n neling m ode is u sed in cases in w hich the q uery cannot be for w ard ed to the gatew ay th a t is responsible for the d ata source after the location change. Therefore the m iddlew are com ponent allows the rerouting of queries th ro u g h an overlay netw ork. 4. We extended the m iddlew are com ponent w ith the su p p o rt of

form ing an overlay netw ork betw een the base stations to be m ore resilient against changes in the u nd erly in g sensor and data source netw orks in Section 4.2. The m iddlew are utilises a m ediated gossiping algorithm in o rd er to distribute queries an d

inform ation th ro u g h o u t the netw orks for efficient inform ation dissem ination.

5. The d ata com m unication betw een d ata sources an d base sta tions has been optim ised in order to reduce the d ata traffic an d increase th e com m unication of significant inform ation. This w ork introd uced a novel aggregation algorithm called Sensor- SAX in Section 4.3 th a t reduces the dim ensionality of stream ing sensor d ata b u t rem ains the features for furth er inform ation p ro cessing algorithm s.

6. We intro du ced a new inform ation processing an d abstraction m eth od th at extracts m eaningful inform ation from data and also reduces the d ata traffic by sending p attern s instead of raw d ata in C h apter 5. In particular, an abductive reasoning m odel w ith a tem poral extension th at uses the aggregations from the SensorSAX algorithm to infer and abstract inform ation to h u m ans a n d /o r m achine interpretations.

7. The abstraction m odel has b een extended to operate autom ati cally by using a clustering an d a rule-based ap proach to create an interoperable sem antic representation of the inferred inter pretations in C hapter 6. The toolkit provides a data driven ap proach to construct ontologies from real w orld data autom ati cally.

8. In order to p rovide an integrated solution, th a t connects the m iddlew are w ith a user-friendly interface, we developed a know l edge acquisition toolkit in C hapter 7. The toolkit can connect to the m iddlew are an d enables the analysis of the d ata on a w ork station.

8.2 L E S S O N S L E A R N E D

In this section, we highlight som e of the decisions m ade an d revise considerations th at have b een m ad e th ro u g h o u t this work.

8.2.1 Centralised Middleware vs De-centralised In-network Processing In this thesis, w e introduced a m iddlew are com ponent in C hapter 3 th at ru n s on resource constrained sensor nodes and hig h capable gateways. The architecture of the m iddlew are uses the sensor nodes as d ata sources an d push es m ost of the processing tasks to the higher level gatew ay nodes. It is questionable if this m aster/slav es approach utilises the processing pow er of the entire netw ork. W ith the h elp of m ore d istrib u ted approaches, decisions can be m ade faster d u e to the early d ata processing directly at the d ata source. However, on the other h an d , the heterogeneity of the vast am o unt of different sensor

8.2 L E S S O N S L E A R N E D l6l

h ard w are m akes it difficult to optim ise each algorithm for distributed in-netw ork processing. In this w ork, m iddlew are has been u sed to h ide the com plexity of different h ardw are an d softw are platform s an d to provide a unified access layer to the gathered data. The p attern creation an d d ata abstraction m ethods are th en also d esigned in a w ay th a t can ru n on the n od e w ith affordable com putation an d energy footprints.

8.2.2 Mediated Gossiping vs Network Layer Routing

The com m unication protocol defined in C hapter 4.2 is im plem ented at the application layer. This choice has b een m ade to provide a close m teraction betw een applications an d the un derlying netw orks, rath er th a n p u sh in g the dissem ination an d qu ery tasks to the netw ork lay ers. This allows a fast developm ent pace in application an d service developm ent. The advantages lie in the transparency for applications an d APIs in the inform ation dissem ination an d q uery optim isation w ith the focus on abstracting from the low er netw ork layers. However, other approaches optim ise the netw ork layers th a t allow to utilise the netw ork infrastructure an d processing capabilities. O ne assum ption of this w ork is th a t ro uting an d MAC level interactions are optim ised for energy efficiency an d reliability. By h id in g from this com plexity on the application layer, it m akes it difficult for the application devel oper to see how the tasks are p erform ed on the netw ork layer.

8.2.3 Information Abstraction vs Outlier and Event Detection

This w ork introduces a new p arad ig m for inform ation processing called inform ation abstraction w hich is discussed in C hapter 5. The differentiation to other established research areas such as outlier an d event detention is th a t we assum e th a t inform ation an d d ata insights are n o t always m odelled in the system. In outlier an d event detec tion approaches, it u sually can be foreseen w h at inform ation is going to be expected. However, w ith the deluge of data an d the grow th in Cyber-Physical D ata, the task of defining an d m odelling expected events becom es infeasible. However, the detection of false positives an d tru e negatives, m ore precisely, abstractions an d events th a t have been inferred by the system th a t do n o t reflect the real w orld a n d /o r events can occur in the real w orld th at are n o t reflected in the system. We claim th a t ou r approach is useful for creating a base line for event an d outlier detection approaches by reducing the u n w an ted inform a tion flood by rem oving an d filtering unnecessary data an d creating effective p attern representations. This can be helpful for know ledge an d ontology engineers th at require m ethods to cope w ith the infor m atio n overload and Big D ata challenges in the loT dom ain.

8.3 F U T U R E W O R K

The com m unication an d processing of Cyber-Physical D ata is an on going research topic an d w e focused on som e of the key challenges th a t w ill arise in the near future. A grow ing n u m b er of devices w ill be connected to the Internet in the future w hich w ill enable creation of various services an d applications in the digital age.

We introd uced inform ation abstraction an d Cyber-Physical D ata p ro cessing solutions th a t create h u m a n u nd erstan dable inform ation. O ur results show th a t it is possible to extract autom atically m eaningful in form ation from a deluge of d ata th a t are p ro d u ced by various sensor devices.

The processing algorithm s in troduced in C hapter 5 an d C hapter 6 use SensorSAX as a b uild in g block for the abstraction creation. Sensor SAX (described in Section 4.3) relies on the variance m easured in the cu rrent observed raw sensor data. The variance is u sed as an indicator for the activity in the data. However, this im plies th a t only data seg m ents gain attention th a t have strong m ovem ents in the data. Finer- grain solutions can be investigated in order to detect subtle changes in the data. This can either be achieved by incorporating historical data to learn the thresholds th a t indicate if the w in d o w size has to be decreased or increased. SensorSAX also relies on the assum ption th at the m easurem ent d ata has a norm alised distribution, w hich is true for m ost n atu ra l physical inform ation. But is h as to be n o ted , th a t this assum ptio n can only be ho ld for long-term observations. The distri b u tio n in sm all observed w indow s can change drastically an d im pact the aggregation of SensorSAX. To solve the latter, usin g probabilistic density functions an d adjusting segm entations in the SensorSAX al gorithm can be considered.

The abductive m odel th at has b een in troduced in Section 5.2 includes the tem poral dom ain. A logical extension is to consider the spatial do m ain to han dle cases in w hich abstractions are m ore d ep en d e n t on the location of the source data.

In conclusion, processing an d han dlin g large-scale an d dynam ic data stream s th a t often require real-tim e solutions are key to effective real isation an d utilisation of the Cyber-Physical Systems an d the Internet

In document Intelligent Communication and Information Processing for Cyber-Physical Data. (Page 159-182)