• No results found

Machine learning is an evolution of pattern recognition and computational learning theory in artificial intelligence. It explores the construction and study of algorithms that can learn from and make predictions on data. It uses algorithms operating by building a mathematical model from example inputs to make data-driven predictions or decisions, rather than strictly static program instructions [318]. The essence of machine learning is to create compact mathematical models that represent abstract domain notions of profiles, tastes, correlations, and patterns that: 1) fit well the current observations of the domain and 2) are able to extrapolate well to new observations [242]. Several categorisations of machine learning techniques are possible. We can divide these techniques according to the nature of the used learning: In supervised learning data has predefined and well known fields to serve as expected output of the learn- ing process. While in unsupervised learning input data is not labeled and does not have a known field defined as output. Machine learning algorithms try to deduce structures present in the input data to find hidden patterns. In some cases, input data is a mixture of labelled and unlabelled samples. This class of problems is called semi-supervised learning. Many machine learning algorithms require some parameters (called hyper-parameters) to configure the learning process itself. In some situations, these parameters can also be learned or adapted according to the specific business domain. Thus, they are called meta learning parameters and the process of learning such parameters is called meta learning. For the rest of the paper we will refer to such parameters simply as parameters.

Another categorisation of machine learning techniques is according to the frequency of learning: In online learning, for every new observation of input/output, the learning algorithm is executed and its state is updated incrementally with each new observation. This is also known as live, incremental, or on-the-fly machine learning. We speak of offline learning or batch learning when a whole dataset or several observations are sent in “one shot” to the learning algorithm. We speak of lazy learning, or on-demand learning when we train a machine learning technique only for the purpose of estimating

the output of a specific input vector. The learning technique is trained using a small batch or a subset of observations similar to the requested input. This type offers a case-based or context-based reasoning because the learning is tailored for the requested input.

Finally, a machine learning module can be composed by combining several machine learning submodules. This is usually called ensemble methods. It is often used to create a strong machine learning model from multiple weaker machine learning models that are independently trained. The results of the weaker models can be combined in many ways (voting, averaging, linear combination) to improve the overall learning. Some techniques split the training data over the weaker models, this is called bagging. Other techniques split over the features and some split over both data and features. Random forests are a powerful example of these techniques, where the global machine learning module is composed by several decision trees, each trained on a subset of data and features. Neural networks are another example, where the global network is composed by several neurones, each can be seen as an independent learning unit.

3

State of the art

This chapter discusses work related to the one presented in this dissertation. First, it presents related approaches for analysing data in the context of cyber-physical systems. Then, major data analytics platforms, stream and graph processing frameworks, and graph databases are detailed. Finally, the related work regarding the four challenges addressed in this dissertation is discussed.

Contents

3.1 Analysing data of cyber-physical systems . . . 42 3.2 Data analytics platforms . . . 42 3.3 Stream processing frameworks . . . 48 3.4 Graph processing frameworks . . . 53 3.5 Graph databases . . . 60 3.6 Analysing data in motion . . . 63 3.7 Exploring hypothetical actions . . . 70

3.8 Reasoning over distributed data in motion . . . 72

3.9 Combining domain knowledge and machine learning . . . 74

3.1

Analysing data of cyber-physical systems

The specific challenges of data analytics—or big data analytics—in the context of IoT and cyber-physical systems have been discussed and identified as an open issue by several researchers, e.g., [191], [270], [293]. Jara et al., [191] describe existing ana- lytics solutions, discuss open challenges, and provide some guidelines to address these challenges. As a major difference between existing big data analytics and analytics for CPSs, they discuss the need for real-time analytics as a vertical requirement from communication to analytics. They propose a hybrid approach, where real-time analyt- ics is used for control and, in addition, batch processing for modelling and behaviour learning. Ray [270] proposes an architecture for autonomous perception-based decision and control of complex CPS. As two of the main challenges he identifies the complexity of such systems (e.g., stemmed from the complex underlying physical phenomena) and their usually high performance requirements. Ray argues that, therefore, an efficient abstraction is the key for modelling and analysing data of cyber-physical systems. Similar, Stankovic [293] in his discussion about research directions for the Internet of Things, mentions the challenges of converting the continuously collected raw data into usable knowledge, as one of the big challenges.

Thematically close to the context of our work is the work of ˇSikˇsnys [291]. ˇSikˇsnys focuses on the planning phase in the context of large-scale multi agent CPSs. As a main contribution, ˇSikˇsnys provides a definition and a conceptual model (composed of a flexibility model, a prescriptive model, and a decision model) of what he calls “PrescriptiveCPS”. Sikˇsnys contributions remain at the level of a conceptual model, rather than proposing a concrete technical approach for data analytics of CPSs. In the following we discuss analytics platforms, frameworks, and technologies that are related to our approach, before we present the related work specific for the four challenges addressed in this dissertation. Figure 3.1 presents an overview of the state of the art presented in the following chapters.