String based constraints - Evolution-based Quality Characteristics and Measures

4.3 Evolution-based Quality Characteristics and Measures

5.1.3 String based constraints

For string based constraints generation the primary focus is to understand minimum length (minLength) and maximum length (maxLength) of a property. In this context

5.1 SHACL Constraints Components and Shape Induction 61

max and min length subjected to rdf:type and node with literal values. In general, if the value of minLength is 0, then there is no restriction on the string length, but the constraint is still violated if the value node is a blank node. On the other hand, the value of maxLength without restriction could be any string length based on the rdf:type. We considered the distribution of string lengths to identify minLength and maxLength of literal values of a property. More specifically, we explored all the properties present in a class, literals string lengths distribution interquartile range for constraints generation. We evaluate the minLength using 1st quartile(Q1) and maxLength using the 3rd quartile (Q3). Table 5.3 illustrates the string length condi- tions for minLength and maxLength. In particular, we mainly focus on identifying a relative range for the maximum and minimum length. An example of string length based SHACL Shape for dbo:title property is presented in Listing 5.5.

Table 5.3 Minimum and maximum String length levels.

Key Description

minLength0 Minimum Length <Q1 minLength1 Minimum Length ≥ Q1 maxLength0 Maximum Length <Q3 maxLength1 Maximum Length ≥ Q3

Listing 5.5 String constraints.

@ p r e f i x dbo : < http :/ /d b p e d i a . org/o n t o l o g y/>. @ p r e f i x sh : < http :/ /www . w3 . org/ns/s h a c l # >.

ex : D B p e d i a P e r s o n a sh : N o d e S h a p e ; sh : t a r g e t C l a s s dbo : P e r s o n ;

# m i n L e n g t h

sh : p r o p e r t y [ sh :path foaf : name ; sh : m i n L e n g t h 1; sh : m a x L e n g t h 8]; # for MAX1 sh : p r o p e r t y [ sh :path dbo : b i r t h D a t e ; sh : m i n L e n g t h 1; sh : m a x L e n g t h 8] .

5.2 Summary

In this chapter, we described an approach for inducing validation rules in the form of RDF shapes by profiling the data and use inductive approaches to extract validation rules. Another use case for inducting shapes consists in describing the data (which is helpful in generating queries or creating dynamic user interfaces). Based on the proposed RDF Shape induction approach, in Chapter 6 we present a validation process in a generic way that applies to any type of constraint using the results from the evolution based quality assessment. Furthermore, Chapter 7 presents the details of a predictive learning evaluation for two types of constraints, namely, cardinality and range type constraints. Although in this chapter we only discussed the shape induction process for three types of constraints, this approach can be extended to other types of constraints, such as value range constraints (min and max values), string constraints (pattern, languagesIn, uniqueLanguage), or property pair constraints (lessThan, lessThanOrEquals, disjoint, equal) [88].

Chapter 6 Evolution-based Quality Assessment

and Validation Approach

In the context of quality assessment methodology, the Data Life Cycle (DLC) provides a high-level overview of the stages involved in successful management and preservation of data for any use and reuse process. Moreover, several versions of the data life cycles exist with different attributes considering variations in practices across domains or communities [34]. Data quality life cycle generally includes the identification of quality requirements and relevant metrics, quality assessment, and quality improvement [35, 90]. Debattista et al. [35] presents a data quality life cycle that covers the phases from the assessment of data, to cleaning and storing. They show that in the lifecycle quality assessment and improvement of Linked Data is a continuous process. However, we explored the features of quality assessment based on KB evolution. Our reference Data Life Cycle is defined by the international stan- dard ISO 25024 [1]. We extend the reference DLC to integrate a quality assessment phase along with the data collection, data integration, and external data acquisition phase. This phase ensures data quality for the data processing stage. The extended DLC is reported in Figure 6.1.

The first step in building the quality assessment approach was to identify the quality characteristics. Based on the quality characteristics presented in Section 4.3, we proposed a KB quality assessment approach. In particular, our evolution-based quality assessment approach computes statistical distributions of KB elements from different KB releases and detects anomalies based on evolution patterns. The valida-

Fig. 6.1 ISO/IEC 25024 Data Life Cycle (DLC) [1] with proposed quality assessment approach. The box highlight the components that are added as improvements to the DLC.

tion approach is based on the RDF Shape induction process introduced in Chapter 5. Figure 6.2 reports the proposed workflow using the quality assessment and validation procedures. In this workflow, the left side displays collection of KB releases as input, while the approach is divided into two phases: (1) Quality evaluation (including the statistical profiler) using evolution-based quality characteristics (Chapter 4); and (2) Validationwhich is composed of feature extraction and manual evaluation.

Elaborating further, quality assessment and validation procedures are based on four stages: (1) data collection (multiple releases of a knowledge base), (2) quality evaluation process (relying on statistical and quality profiling), (3) validation process (based on feature extraction and manual validation), and (4) modeling and quality problem report (evaluation of learning models and generation of quality problem report). For the quality assessment process, a prototype using the R statistical package is implemented, and it is shared as open source in order to foster reproducibility of the experiments1.

The rest of this chapter is structured using the proposed quality assessment and validation workflow which is illustrated in Figure 6.2. Section 6.1 provides a general overview of how the multiple releases of a KB are used as input in the approach. The quality evolution process is presented in Section 6.2, where each phase in the pipeline using statistical analysis and quality profiling. The validation process is outlined in Section 6.3 using features extraction (Sec. 6.3.1) process which is based on RDF Shape induction (Chapter 6). Furthermore, Section 6.3.2 presents a detailed

In document Automated Knowledge Base Quality Assessment and Validation based on Evolution Analysis (Page 77-82)