1.4.1
Overview
In today’s complex cyber-physical systems data is generated in very large scale by a wide range of sources, such as sensors and embedded devices [191]. As it turns out, the complexity of this data is at least as much of a challenge in gaining profound insights as is the sheer size of it. In fact, even relatively small but complex datasets can be difficult to analyse. In the context of CPSs the complexity results from the complicated physical phenomena described by this data, which makes it difficult to extract a model able to explain such data and its various multi-layered relationships.
Furthermore, these systems are expected to react quickly, which requires fast, i.e., in near real-time, data analysis. In addition to that CPSs usually have only limited computational capabilities. Even though—driven by advances in embedded systems and their falling prices—the computing power of CPSs is getting more and more pow- erful, they can most of the time not rely on big clusters and batch processing for their analytic tasks.
The high complexity of CPS data makes it necessary to properly structure and organise data to efficiently analyse it. This requires appropriate models, able to represent the context (internal state and surrounding environment) of a cyber-physical system [181]. Building, storing, and loading appropriate context models to enable efficient live an- alytics for CPSs is challenging [261]. Existing solutions fail to provide sustainable mechanisms to analyse such data in live.
In the following, an overview of the main challenges addressed in this thesis is presented. In the contribution part of this dissertation each challenge is then discussed in detail. Each challenge corresponds to a concrete need encountered during the collaboration with Creos Luxembourg S.A.
1.4.2
Challenges addressed in this thesis
Analysing data in motion. Data handled by cyber-physical systems is usually dynamic, i.e., it is constantly changing [191], [224]. This is also known as data in motion, as opposed to traditional data at rest, or as temporal data [186]. For example, physical quantities measured by sensors, such as temperature, pressure, speed, and distance are inherently temporal. Moreover, data in CPSs often changes frequently and at very different paces. It is usually not enough to only consider the current data. Instead, reasoning processes typically need to analyse and compare data from the current context with its history [188], [171], [202]. For instance, predicting the electric load for a particular region requires a good understanding of the past electricity production and consumption in this region, as well as recent data, such as current and forecasted weather conditions. However, data models can usually only reflect the context of a CPS at a given point in time, i.e., they only represent a snapshot of a real system at one specific timestamp. Such discretisation leads to a representation of
temporal context data as a finite sequence (potentially distributed) of snapshots (e.g., proposed by [188], [171]). As a consequence, the state of a context model between two snapshots is not defined. This results in losing the semantic of continuously evolving data [289]. To address this problem, it is a common approach to regularly sample and store the context of a system at a very high rate in order to provide analytic algorithms with enough historical data. In order to correlate data from different timestamps, analytic algorithms then need to mine a huge amount of snapshots, extract a relevant dataset, and finally analyse it (e.g., [239], [188], [171]). This requires heavy resources and/or is time-consuming, which stands in conflict with the near real-time response time requirements such systems need to meet.
Challenge #1:
One of the major challenges addressed in this thesis is how data models and associated storage systems can be organised to offer reasoning algorithms an efficient, coherent, and consistent view of temporal data.
Exploring hypothetical actions. Making sustainable decisions requires to antici- pate the possible impacts of actions. The exploration of what might happen if this or that action would be taken is referred to as what-if analysis [167], [41]. Every action triggers effects which potentially lead to an alternative state from where a set of other actions can be applied and so forth. Considering complex systems, like cyber-physical systems, it can come to situations where hundreds or thousands of alternative actions must be explored before a solid decision can be made (e.g., optimisation and planning tasks [138]). For example, in case of a potential overload situation the smart grid would need to explore numerous chains of different actions, like restricting the maximum al- lowed load for certain customers or regulating the charging of electric cars, to finally decide for the most appropriate chain of actions. Every action can be interpreted as a divergent point leading to an independent alternative. What-if analysis simulates different actions and tries to find the sequence of actions which leads to a desired alter- native [167], [41]. The usefulness of simulating actions based on models in the context of CPSs has, for example, been shown by Fejoz et al., [143] (in this case using the CPAL language [249]). An alternative can be interpreted as a snapshot of a system’s context. In order to simulate different chains of actions, every alternative needs to be able to evolve independently—both in space, i.e., leading to additional alternatives, and in time. This leads to different histories in different alternatives, creating a very high combinatorial complexity of alternatives and temporal data.
Challenge #2:
The second major challenge addressed in this dissertation is how data models and associated storage systems can be organised to allow an efficient exploration of a large number of independent alternatives—in space and time—even on a massive amount of data.
Reasoning over distributed data in motion. CPSs are not just getting more and more large-scale and complex but are also increasingly equipped with distributed control and decision-making abilities [267], [221]. Reasoning over distributed data is a complex task [148] since it relies on the aggregation and processing of various dis-
tributed and constantly changing data [232]. In fact, to fulfil their tasks, these systems typically need to share context information between computational nodes (any com- puter system reading, writing, or processing data in the context of a CPS). Therefore, appropriate data models used to analyse data in a CPS must support such distribution. Data models of complex CPSs can get very large, which makes sharing this information efficiently challenging. For example, the state of a smart grid is continuously updated with a high frequency from various sensor measurements (like consumption or quality of power supply) and other internal or external events (e.g., overload warnings). In reaction to these state changes different actions can be triggered. However, reasoning and decision-making processes are not centralised but distributed over smart meters, data concentrators, and a central system [142], making it necessary to share context information between these nodes. Smart grids, depending on the size of a city or coun- try, can consist of millions of elements and thousands of distributed computational nodes. This challenges the efficiency of sharing context information, especially when taking the near real-time requirements such systems usually need to meet into account.
Challenge #3:
How to handle large-scale, distributed, and frequently changing data to enable efficient analytics over distributed data is the third challenge addressed in this thesis.
Combining domain knowledge with machine learning. In order to meet future needs CPSs need to become increasingly intelligent [267]. On the one hand, some situations CPSs will face are predictable at design time. For example, to react to critical overload situations, the maximum allowed load for customers could be restricted or the charging of electric cars could be balanced accordingly. On the other hand, such systems will also face events that are unpredictable at design time. For instance, the electric consumption of a house depends on the number of people living there, their activities, weather conditions, used devices, and so forth. Despite such behaviour is unpredictable at design time, it is known at design time that this behaviour is unknown [298] and that it can be learned later by observing past situations, once data becomes available. Machine learning algorithms can help to solve this unknown behaviour by extracting commonalities over massive datasets. However, in cases where datasets are composed of independent entities (so-called system of systems [97]) which behave very differently, finding one coarse-grained common behavioural model can be difficult or even inappropriate. For example, the consumption of a factory follows a very different pattern than the consumption of an apartment. Searching for commonalities between these entities would not lead to correct conclusions. Instead, following a “divide and conquer” strategy, learning on finer granularities can be considerably more efficient for such problems [330], [144]. However, learning on fine granularities leads to many fine-grained learning units, which must be synchronised and composed to express more complex behavioural models. Therefore, this requires an appropriate structure to model such learning units and their relationships to domain knowledge.
Challenge #4:
The last challenge addressed in this thesis is how domain knowledge and machine learn- ing can be seamlessly combined to improve data analytics for cyber-physical systems.
000101010 101010111 011000011 010101010 101010001 011010010 Data stream Domain rules, learning rules Data structure Storage system Real system Analysis Meta model multi-dimensional
graph data model
generate legend data + domain knowledge learned information relationships/ dependencies
Figure 1.5: Model-driven live analytics