Information Extraction - Class Model Extraction

6.6 Class Model Extraction

6.6.2 Information Extraction

With the image segmented using k-means (or any other similar method), we feed every individual class to the text processor. It will retrieve the contents of the box line by line. This text must be processed in order to identify the type and name of the component. The following basic rules are followed to identify the individual elements of a class:

• Class nameif the text belongs to the first or second line, and starts with a capital letter.

• Attributeevery line starting from the second, that starts with lower case or a visibility symbol, and not containing parenthesis.

• Operator every line starting from the second, that starts with lower case or a visibility symbol and containing parenthesis.

We these rules, we can parse the contents of every class image as shown in Figure 6.9. Note that this implementation only works on classes that have been written following the standard UML notation.

The prototype implementation currently only works for a limited set of ideal class diagrams, particularly those that are very clear with classes that are sufficiently spaced. However, as a proof-of-concept, it certainly illustrates how our formal framework can be extended with the usage of image processing in order to reduce human effort in favour of better automation.

6.7 Summary

In this chapter, a proof-of-concept to support TOMM was proposed, namely T4TOMM. In order to make TOMM easily accessed, automatic formalization of ConSpec specifications is performed. Similarly, class models are partially obtained from the images of class diagrams, which are then formalized. SMT- LIB models generated from the processes described above are then and checked using SMT-Solvers to determine model validity and equivalence, together with the model inference from specifications. Some of the limitations of T4TOMM are briefly mentioned here and further explored in the following chapter.

Figure 6.5: Lines detection pro ces s Figure 6.6: Rectangles detec tion pro cess Figure 6.7: Segmen ts detection pro cess

6.7. SUMMARY 119

Figure 6.8: Comparison of labelling vs k-means

Chapter 7 Evaluation

Through this chapter we describe the different aspects involved in the evaluation of our contributions, including the evaluation methodology, the cases to be evaluated, the actual evaluation and the results for each of the items enumerated in Section 1.4

First we evaluate our specification format for functional requirements (ConSpec + SpeCNL). This format is evaluated by manually translating exist-

ing requirements from different sources and domains into SpeCNL sentences within a ConSpec specification. For each of these requirements, we discuss the extension of the requirements that we were able to express in SpeCNL and the limitations that we came across. We highlight the results regarding functional requirements, and we discuss future work to improve and extend this specification format.

For model generation we refer to the requirements of the library system introduced in Chapter 2 restructured as a ConSpec specification, and we use T4TOMM to infer its corresponding class model. We then manually check that all the classes, attributes, operations and inheritances expected are present in the inferred model.

For model validation, we evaluate four individual cases. First, we check an invalid mode, to make sure our theory is capable of determining when a diagram is neither sound nor complete with respect to a given specification. Then we check a model that we know is sound, but not complete, that is, that it lacks some elements that exist in the specification. Then we check model completeness alone in a diagram that has all the elements of the specification, but also additional elements. We finally check a valid model, that is, a model that is sound and complete. All these checks are done using our proof-of-concept and manually modified diagrams that satisfy each case.

For model comparison, we follow a similar approach to the one of model validation. We use T4TOMM to check for two models that are different, then

we check for left, right and total equivalence of manually modified diagrams that satisfy these scenarios.

The evaluation of these theories includes first a demonstration of their manual application, and then we proceed to evaluate the specific cases using T4TOMM. All these theories are evaluated with respect to the current capabilities of TOMM, that is, reasoning about classes, attributes, operations and inheritances. Additional elements such as OCL constraints or associations are not evaluated. The obtained results are manually compared against the expected ones, and then, they are discussed individually in each subsection. The threats for validity are discussed towards the end of the corresponding section.

In addition, we evaluate T4TOMM for model extraction, which is a secondary contribution. We extract class models from existing class diagrams from different sources. For each existing class diagram, we manually count the number of classes, attributes and operations, and then we compare those results with the ones obtained from our proof-of-concept.

7.1 ConSpec and SpeCNL

In Chapter 4 we presented our document structure, ConSpec, and our con- trolled natural language SpeCNL used to specify functional requirements. In this section, we discuss the process followed to evaluate their capabilities and limitations in comparison with requirements documents written in English, with no pre-defined structure. We present here a selection of requirements from a collection of publicly available requirements documents, and we anal- yse the process required to rewrite these original requirements into ConSpec specifications.

In document A seamless framework for formal reasoning on specifications : model derivation, verification and comparison (Page 130-135)