1 School of Business Informatics and Mathematics,
University of Mannheim. 68159 Mannheim. Germany
2 Institute for Enterprise Systems (InES),
L 15, 1-6, 68131 Mannheim. Germany
3 ontoprise GmbH, An der RaumFabrik 33a,
76227 Karlsruhe. Germany
A STUDY IN USER-CENTRIC DATA INTEGRATION
Data Integration
maps different data sources to a consistent
target structure.
Motivation 1
Target Structure (Ontology)
(Encompassing consistent view to the data)
Data Integration Rules Data Sources
(Direct extraction out of different data sources)
Jan Noessner - Lehrstuhl für künstliche Intelligenz – University of Mannheim 3
Motivation Related Work
User-Centric Mapping Assistant Approach
Study Design and Datasets
Experimental Results
Conclusion and Future Work
Outline
Outline 1 2 3 4 5 6Automatic data integration approaches
are still error prone and
need to be supervised by human domain experts.
The problem of data integration has been studied intensively on a technical level in different areas of computer science.
Researchers have investigated the automatic identification of semantic relations between different datasets (Euzenat and Shvaiko, 2007).
A prominent line of research investigates the use of ontologies - formal representations of the conceptual structure of an application domain - as a basis for both, identifying and using semantic relations.
Jan Noessner - Lehrstuhl für künstliche Intelligenz – University of Mannheim 5
Existing work in
user-centric data integration
investigated rather
simple scenarios.
In a recent study, Gass and Maedche have investigated the problem of data integration in the context of personal information management from a user-centric point of view (Gass and Maedche, 2011).
The scenario addressed in their work, however, focuses on the integration of rather simple data schemas, in that case personal data where the task is mainly to map properties describing a person (e.g. name or bank account number).
Traditional User Interfaces try to visualize integration rules
Related Work 2
Most approaches are based on advanced visualization of the models to be integrated and the mappings created by the user (Granitzer et al., 2010).
Jan Noessner - Lehrstuhl für künstliche Intelligenz – University of Mannheim 7
Drawbacks
are visualization limits in number and complexity of
integration rules.
Related Work 2
Visualizations quickly reach their limits if Many integration rules exist
Very complex mapping rules exist, which are hard to visualize.
@{c#CCMappingRule1305046217116} ?_SIID:HighComfortAndLowSavetyCar :- ?_SIID:<http://www.owl-ontologies.com/autos.owl#Cars>@<http://www.owl-ontologies.com/autos.owl> AND
(?_SIID[<http://www.owl- ontologies.com/autos.owl#hasSafetyFeaturesRating>->?_VAR0]@<http://www.owl-ontologies.com/autos.owl> AND ?_VAR0 <= 2.0) AND (?_SIID[<http://www.owl-
ontologies.com/autos.owl#hasComfortAndConvenienceRating>-?_VAR1]@<http://www.owl-ontologies.com/autos.owl> AND ?_VAR1 >= 3.5).
High expert knowledge is needed to interprete the consequences of the Mapping Rules
The need of
User-Centric Data Integration
has been recognized.
Recently, researchers in ontology and schema matching have recognized the need for user support in aligning complex conceptual models (Falconer, 2009; Falconer and Storey, 2007).
Jan Noessner - Lehrstuhl für künstliche Intelligenz – University of Mannheim 9
The
cognitive support model
for data integration
by Falconer and Noy (2011) underlines the user interaction.
Our
Modified Cognitive Support Model
is based on identifying
wrong instances and asking questions in natural language.
User-Centric Mapping Assistant Approach 3
User Inspection Decision which concept to examine
User identifies instances which have been classified incorrectly.
User answers questions.
Diagnostic algorithm generates the minimal amount of user questions
Questions are represented to the user in natural language sentences in a todo-list.
Jan Noessner - Lehrstuhl für künstliche Intelligenz – University of Mannheim 11
Our
Interactive User Interface
enables users to investigate data
on the instance level.
In the
Analysis and Decision Making
phase the user decides
which concept he wants to examine.
User-Centric Mapping Assistant Approach 3
User decides which
examine
User decides which concept he wants to examine
Jan Noessner - Lehrstuhl für künstliche Intelligenz – University of Mannheim 13
In the
Interaction
phase the user identifies wrong classified
instances.
User-Centric Mapping Assistant Approach 3
User decides which
examine
User decides which concept he wants to examine
1
User identifies instances
incorrectly.
User identifies instances which have been classified incorrectly.
In the
Analysis and Generation
phase the minimal amount of user
questions is generated by the system.
User-Centric Mapping Assistant Approach 3
User decides which
examine
User decides which concept he wants to examine
1
User identifies instances
incorrectly.
User identifies instances which have been classified incorrectly.
2
Diagnostic algorithm
amount of user questions Diagnostic algorithm
generates the minimal amount of user questions 3
Jan Noessner - Lehrstuhl für künstliche Intelligenz – University of Mannheim 15
In the
Representation
phase the questions are represented to the
user in natural language.
User-Centric Mapping Assistant Approach 3
User decides which
examine
User decides which concept he wants to examine
1
User identifies instances
incorrectly.
User identifies instances which have been classified incorrectly.
2
Diagnostic algorithm
amount of user questions Diagnostic algorithm
generates the minimal amount of user questions 3
Questions are represented to the user in natural language sentences in a todo-list.
4
Motivation Related Work
User-Centric Mapping Assistant Approach Study Design and Datasets
Experimental Results
Conclusion and Future Work
Outline
Outline 1 2 3 4 5 6Jan Noessner - Lehrstuhl für künstliche Intelligenz – University of Mannheim 17
The
Source Dataset
is an instructional dataset from the web.
The
target Schema
is manually created.
Study Design and Datasets 4
Source Dataset
Instructional dataset from the car-selling domain
(http://gaia.isI.cnr.it/˜straccia/down load/teaching/SI/2006/Autos.owl) The dataset contains:
324 data records (cars, car parts, etc.)
100 attributes (like speed, fuel consumption, ...).
91 concepts organized in a concept hierarchy.
Complex enough, but small enough to be handled in a user-study.
Ten
Integration Rules
were wrong and had to be identified by the
subjects (
Dependent Variable
).
Study Design and Datasets 4
Two Datasets containing 10 wrong integration rules each. Type 1: Easy Mistakes
Type 2: Complex Mistakes
The subjects had to find as many wrong integration rules as possible.
The dependent variable is the number of errors the subjects found in the
Wheel Engine
AirCondition AutomaticOneZoneAirCondition Filter:
hasZoneNumber = 2 hasAutomatic = false
Jan Noessner - Lehrstuhl für künstliche Intelligenz – University of Mannheim 19
We compared the conventional approach with the
MappingAssistant approach (
Independent Variable
).
We compared the conventional approach with the
MappingAssistant approach (
Independent Variable
).
Jan Noessner - Lehrstuhl für künstliche Intelligenz – University of Mannheim 21
We compared the conventional approach with the
MappingAssistant approach (
Independent Variable
).
We compared the conventional approach with the
MappingAssistant approach (
Independent Variable
).
Jan Noessner - Lehrstuhl für künstliche Intelligenz – University of Mannheim 23
For
Simulating Background Knowledge
the subjects had an
information sheet.
Both, the order of tasks and the order of datasets were switched.
Jan Noessner - Lehrstuhl für künstliche Intelligenz – University of Mannheim 25
We performed the study with 22 subjects.
Study Design and Datasets 4
22 subjects participated in the user study, each performed both tasks on both datasets.
6 female, 16 male
average age: 27.8 years (min = 21, max > 50). 54% of the subjects were students.
Precision, Recall, and F-Measure
Experimental Results 5
number of errors that have correctly been identified by a subject number of errors been identified by a subject
number of errors that have correctly been identified by a subject number of all existing correct errors (10)
1 2
Jan Noessner - Lehrstuhl für künstliche Intelligenz – University of Mannheim 27
In the
Average Performance of Subjects
the recall was one third
higher in the MappingAssistant approach.
Comparing the
Performance on the Subject Level
91% of the
subjects found more mistakes in the MappingAssistant approach.
Jan Noessner - Lehrstuhl für künstliche Intelligenz – University of Mannheim 29
In the standard approach subjects with low
technical knowledge reached lower F-Scores.
In the MappingAssistant approach the reached F-Score is
independent from the level of knowledge.
Jan Noessner - Lehrstuhl für künstliche Intelligenz – University of Mannheim 31
The
User Feedback
is better for the MappingAssistant approach
than for the standard approach.
Experimental Results 5
Task 1
Conclusion
Conclusion and Future Work 6
The goal of our research was to enable the people with less or no knowledge of technologies to integrate their data. We presented a user-centric approach to data integration that is based on a cognitive support model.
We presented the results of a user study demonstrating that our
MappingAssistant approach empowers users to solve data integration
problems more effectively and efficiently. In particular, we showed that users were able to find more errors in mapping rules in a given period of time.
Further, we were able to show that while with conventional mapping
technology a high level of expertise in mapping technology is required, while the MappingAssistant approach significantly reduces the performance
difference of experienced and inexperienced users.
Slide 32
s5 auch hier ist das while zuviel oder?! shaihulud; 21.06.2012
In
Future Work
we will focus on correcting the wrong integration
rules.
Select concept and mark wrong instance
Feedback questions from the sysstem to the user
Identified the wront integration rules
Selection of the integration rule and mark Calculation of correction suggestions of the Selection of a correction suggestion Actualizing the integration rule
Jan Noessner - Lehrstuhl für künstliche Intelligenz – University of Mannheim 34
End
… for your attention!