Data analysis processes - Development processes for model engineering

5.2 Development processes for model engineering

5.2.5 Data analysis processes

So far we looked at different methodologies from the fields of software engineering, modelling and simulation, knowledge and ontologies engineering. The common between them all is that they all deal with the engineering view of creating a system. They all have a set of questions that have to be answered, a set of requirements that have to be satisfied, and in return output a system that is the solution to those questions and requirements, and that is later tested for satisfiability, further developed and maintained.

On the other hand, the data analysis view is a bit different. In it empirical data is gathered and analysed in order to answer different questions. As Cohen explains, no matter whether one is conducting experiments with rats or with programs, there are always three basic research questions [26, p. 3].

How will a change in the agent’s structure affect its behaviour given a task and an environment?

How will a change in an agent’s task affect its behaviour in a particular environment? How will a change in an agent’s environment affect its behaviour on a particular task? To answer any of these questions based on empirical data, one can apply one of four types of empirical studies [26, p. 7]. The first are the exploratory studies that deal with causal hypotheses that are tested in observation or manipulation experiments. Basically, that means that huge amounts of data are collected and later analysed for similarities. The second type of empirical studies are the assessment studies. They establish baselines and ranges of the behaviour of a given system, or the corresponding environment. The third type are manipulation experiments which test hypotheses about causal relations and factors by manipulating them, and determin- ing effects on measured variables. The last type are the observation experiments. They disclose effects of factors on measured variables by observing associations between levels of factors and values of the variables.

As Cohen explains, in the early stages of a project one asks questions that are answered by exploratory studies, while as the project progresses the answers are obtained by experimental studies. This shift from exploratory to experimental studies defines the progress in science. This is also shown in Fig. 5.13 where with the progression form a specific system to more general, also our understanding shifts from descriptive to causal explanation.

To answer research questions based on empirical studies, Cohen proposes a strategy. This can also be regarded as a process for data analysis as data analysis is based on exploratory studies. This process consists of five steps [26, p. 6] and describes a typical empirical generalisation strategy that is usually used in any data analysis study.

JHQHUDO

VSHFLI FWR DV\VWHP

JHQHUDOLVDWLRQ

GHVFULSWLRQ SUHGLFWLRQ FDVXDOH[SODQDWLRQ

XQGHUVWDQGLQJ

SURJUHVVLQVFLHQFH

Figure 5.13: The generalisation and understanding of basic research questions [26, p. 3] (Figure adapted from [26, p. 3].).

1. Implement a program that exhibits a behaviour of interest performed in a speciﬁc environment.

2. Identify the program’s features, tasks and environments that inﬂuence the target behaviour.

3. Develop and test a causal model of how the selected features inﬂuence the behaviour. 4. When the model is able to provide accurate predictions, generalise the features so that

other variables, programs and features are included in the model.

5. Test whether the general model is able to accurately predict the behaviour of the larger set of programs, tasks, and environments.

This general approach for answering questions in empirical studies is also reﬂected in the state of the art for data analysis processes. Below we look in two such processes.

5.2.5.1 Pattern classiﬁcation lifecycle [36, p. 14]

A more concrete study design lifecycle is proposed by Duda et al. [36, p. 14]. It concerns the exploratory study design of pattern recognition. As pattern recognition analyses data for repeating similarities, the lifecycle includes the data collection, the choice of features to be compared, the choice of model, the training of the model and its evaluation. Fig. 5.14 shows

FROOHFW GDWD FKRRVH IHDWXUHV FKRRVH PRGHO WUDLQ FODVVLI HU HYDOXDWH FODVVLI HU VWDUW HQG

Figure 5.14: The pattern classiﬁcation lifecycle proposed by [36, p. 14] (Figure adapted from [36, p. 14].).

the proposed process. The ﬁrst phase is the data collection and deals with the data needed for training and testing the designated system. As a rule, the more the training data, the better the system performance [36, p. 14]. The second phase is the feature choice. The features to be compared are usually selected based on preliminary data analysis and on the available prior knowledge about typical features relevant for the problem domain. The third phase is the model choice where the model that is to be the solution of the problem is selected. Here, different models can be selected and later tested to ﬁnd out which of them is able to represent

the underlying patterns. The next phase is the model training. This phase deals with the process of using data to determine the classifier. Based on the patterns learned by the classifier it can later recognise and classify new data. The last phase is the evaluation, or with other words, how well is the classifier able to identify patterns in new data. Typical strategies and tests for performance evaluation can be found in [26, p. 185–235].

This is a typical data analysis approach that can also be recognised in the intuitive development process from Chapter 3. However, in our case the development process concentrates on the manner in which a model is developed. Furthermore, as CCBM proposes substitution of training data with a-priori knowledge, the fourth model phase is redundant.

5.2.5.2 Cross-Industry Standard Process for Data Mining [131, 21]

Another development method for data analysis comes from the ﬁeld of data mining. The method proposed by Shearer is called CRoss-Industry Standard Process for Data Mining (CRISP-DM) [131] and consists of six recursive phases. The process can be seen in Fig. 5.15.

%XVLQHVV XQGHUVWDQGLQJ 'DWD XQGHUVWDQGLQJ 'DWD SUHSDUDWLRQ 0RGHOOLQJ (YDOXDWLRQ 'HSOR\PHQW 'DWD

Figure 5.15: Cross-Industry Standard Process for Data Mining proposed by Shearer [131] (Figure adapted from [131].).

The ﬁrst phase is the business understanding which tries to understand the problem from a business perspective and to convert the gathered knowledge to a data mining deﬁnition. It also deals with developing the preliminary plan for achieving the problem objectives. The second phase is the data understanding where the initial data is collected and analysed in order to discover data quality problems, initial insights into the data, or to identify interesting patterns that can help form hypotheses about the hidden information. The third phase involves the data

preparation where the ﬁnal dataset or the data to be fed into the modelling tool is prepared. It

involves five steps: first the data to be used is selected based on the problem objectives, then the data is cleaned, or missing information in ambiguous subsets is estimated and added to the datasets. The third step is the data construction which involved preparing the data by deriving new attributes from existing ones or developing new records. Later the data is integrated by combining information from multiple tables or records to create new records or values. Finally, the data is formatted to fit the needs of the designated tool.

The fourth phase is the modelling where various modelling techniques are selected and ap- plied to the problem so that their parameters can be calibrated and optimised. This involves the selection of the modelling technique, the generation of the test design, the creation of mod-

els, and finally, their assessment. According to Shearer the model assessment is based on the analyst domain knowledge, the data mining success criteria, and the desired test design.

The fifth phase is the model evaluation which deals with more detailed evaluation of the model performance and of its ability to satisfy the business objectives. It involves not only the evaluation of the model accuracy and generality, but also the representation of the business objectives, as well as any reasons why the model could be deficient. Finally, it is decided whether the model has to be finished and deployed or to initiate new iteration of the development process, or to set up a completely new data mining project.

The last phase in the project is the deployment where the knowledge gained throughout the project is organised and presented in a way understandable for the customer. This involves the development of a strategy for deployment, as well as one for the model monitoring and maintenance. Additionally, a final project report is produced and the whole project is assessed in terms of successes and failures.

The CRISP-DM process provides data analysis approach that generally meets the needs of CCBM and allows the process iteration and improvement. It also takes care of something that is not an issue in other fields of computer science but is important part of the activity recognition process: it provides a mechanism for collecting, preparing and using the data to be evaluated. On the other hand, as most data analysis approaches, it does not go into detail about the concrete modelling process. And from our experience with CCBM, exactly this is an essential part of a successful activity recognition process.

It can be seen that in difference with the engineering methods, the data analysis methods deal with how to collect, analyse and evaluate data, rather than with what is the best approach to building a model. They also do not concern themselves with extensive conceptualisation of the model to be developed. The experience from Chapter 3 showed that this is a major drawback as many conceptual problems appear first in the model evaluation, when they could be avoided all together in the presence of more extensive model analysis and design.

5.2.6 The gap between data analysis and engineering

From the above sections it became apparent that depending on the process application, each process puts more stress on the phases of interest for the concrete model developers. For example, in software engineering the processes deal with more detailed software conceptualisation and design3 and output a validated and deployed software system. The software engineering processes are relatively general and can be adapted for different problems. Variations of the waterfall model can be seen in different fields of computer science like the modelling and simulation lifecycle proposed by Balci et al. [11], the ontology development process Methontology [48], or the knowledge-based development process proposed by Gonzalez and Dankel [61, p. 304]. The main disadvantage of these methods is that at some point the early analysis and design phases are no longer revisited which could account to conceptual problems that have to be solved during the implementation phase. This issue has been addressed by Parnas et al. in their work A Rational Design Process: How and Why to Fake It [107]. In it they discuss the problem of freezing early phases and point out that in reality if an error in the early phases is detected, one will always go back and fix it. They also argue that one usually does not follow the process exactly as described but rather later on produces documentation that pretends the process was followed.

3_{With the exception of the evolutionary software development process that produces a prototype as soon as}

To solve that problem, recursive development processes are proposed such as the Boehm’s spiral model [15] or the evolutionary development process [135, p.11]. However, in the case of the Boehm’s spiral model, the development process itself is still a choice of the developer, with the only requirement to be repeated until the desired product is outputted. And in the case with the evolutionary development process, the rapid prototyping does not allow thorough model conceptualisation which could lead to unexpected problems during the model implementation and evaluation.

Furthermore, the various processes described in the above sections deliver final products which differ from the output of a model based activity recognition system. For example, the output of a software engineering process is a software system that is validated and verified where by validation and verification the following is meant.

validation: The process of evaluating a system or component during or at the end of the development process to determine whether it satisfies specified requirements[IEEE-STD- 610.12] [72, p. 212].

verification: The process of evaluating a system or component to determine whether the products of a given development phase satisfy the conditions imposed at the start of that phase[IEEE-STD-610.12] [72, p. 213].

The processes for modelling and simulation, context-aware systems, and ontologies all have similar understanding for the concepts of validation and verification. The final output of this phase will be to prove that the system satisfies the previously defined requirements and that it is a solution to the problem at hand. However, the things look a bit different in the field of data analysis. Here in addition to the model correctness and suitability for the given problem, the developer should also concern herself with a more detailed evaluation of the model performance. This is due to the fact that there could exist a model that satisfies all requirements, and that is proved to be a solution of the problem, but that given real data, performs poorly4.

On the other hand, the data analysis processes concentrate on the underlying data and the information the model can unearth from it rather on a detailed model development process. Chapter 3 has shown that this could be a drawback for the later model implementation and evaluation as some problems could have been easily avoided by more thorough conceptualisation.

Furthermore, most of the processes described above, produce the project documentation when the project is already completed. Still, the practice with the three experiments in Chapter 3 showed that when the documentation is not produced at the moment a decision is made, it is almost impossible to later reconstruct the given decision. This indicates that a better process for documenting the model development is needed.

Finally, one essential issue with the discussed development processes is the lack of mechanism for developing the model’s probabilistic structure. A purely rule-based approach would easily fit into e.g. some adaptation of the waterfall model as the development of elements of the causal structure does not change the structure of the remaining elements. For example, we can sequentially develop all actions in the causal model and each newly implemented action will not change the causal structure of the remaining actions. On the other hand, when dealing with probabilities, the change of one probabilistic element will change all the remaining elements. For example, introducing weight to an action will not change just the one action, but will also have effect on all the remaining actions as each action is weighted with respect to the rest of 4_{Actually, the experience with modelling with CCBM so far has proven that first versions of the models that}

the available actions. The same applied to introducing probabilistic action durations. An action duration will affect the execution time of all the remaining actions. This probabilistic aspect of approaches that combine logic with Bayesian inference is something that is not regarded in any of the existing development processes. This is also an essential aspect of the CCBM model development that has to be carefully considered.

It is obvious that there is a gap between the engineering view of model development and that of a data analyst. To bridge this gap, in the next sections a development process for engineering Computational Causal Behaviour Models for activity recognition is proposed.

In document Methods for engineering symbolic human behaviour models for activity recognition (Page 174-179)