Catalogs
and
Data Integration for E-Commerce
Applications
On-line catalogues
Issues
• Advantages?
• Product information
• Information coupling
• security • purchase process• Buyers catalogue vs. Sellers catalogue
• Data integration
Advantages
• Up-to-date information
• Directed search possibilities
• More information and multi-media information
• Coupling with ordering and stock info
• Personalisation of information
• Cost reduction for production
• Configure or specify products
• Intelligent assistance
Products in catalogs
1. Uniquely identifiable products
• basis of most catalogs
2. Select values of fixed attributes
• E.g. Colour of clothes, processor type of PC
3. Configurable
• E.g. PC, car, ...
For situation 1. and 2. product databases containing all
possible products are present.
Product data
• Identifying the product (articlenumber(s),
name)
• Technical data
• design, use, norms (ISO, ...),...
• Commercial data
• Prices, delivery conditions,...
• Logistical data
Product profiles
• Not all parties are interested in the same
attributes of the product. E.g. A plumber is
interested in the size of a bathtub and fixtures,
the user in the colour.
• Branches and companies have their own product
codes. E.g. For bolts EAN, ISO, Borstlap,...
• Problem: different companies identify (classify)
their products in different ways. E.g. Tiles can be
ceramic products or floor/wall covering.
Commercially sensitive data
• Price information
• Discount availability
• Transparant prices are nice for buyers but not
for sellers
• Availability data
• possibilities:
• Stocked article (indicates type of article) • Article in stock
Security
• Separate catalogue data from product data base
• If personalized data is generated where is the
code stored?
• Security vs. up-to-date information
• Catalogue maintenance (who, when,…?)
Order process
• Searching the catalogue is part of the purchasing
process
• The design of this process should indicate who can
search the catalogue, which information is available,
for which products ordering authorization is needed,
etc.
• B2C
→
simple
– Consumer does not have to integrate with back-end – Consumer can decide himself
• B2B
→
complex
– Both sides need to integrate with back-end systems – Purchasing process regulated by buying company
Who has the responsibility?
Should the catalog and the ordering process be
under the responsibility of
1. The supplier
2. The customer
3. A broker
(Customer specific) catalogues
with suppliers
customer Suppliers Supplier 1 catalogue Supplier 2 catalogue Supplier n catalogue purchasers Internet(Customer specific) catalogues
with suppliers
Advantage:
• Supplier can manage the catalogue
efficiently
• Supplier can add functions for each client
Disadvantage:
• Supplier specifies products
Purchasing catalogue with
customer
Catalogue supplier-1 Catalogue supplier -2 Catalogue supplier -n Purchase catalogue Prod. supplier. -1 Prod. supplier. -2 Prod. supplier. -n updates suppliers customer purchasers InternetPurchasing catalogue
with customer
Advantage:
• Uniform search and ordering process for
customer
• Customer determines which products can be
shown
Disadvantage:
• More difficult to maintain for supplier
• More difficult to keep info up-to-date and
complete
Catalogue with broker
Catalogue supplier-1 Catalogue supplier-2 Catalogue supplier-n purchasers customer suppliers broker Prod. supplier -1 Prod. supplier -2 Catalogue broker Updates Prod. supplier -nCatalogue with broker
Advantage:
• Costs are shared
• Standardisation
Disadvantage:
• Extra party in the process
• Needs data integration
The multi catalogue/multi view
problem:
Data integration
suppliers customers Customer-1 Catalogue supplier -1 Catalogue supplier -2 Catalogue supplier -n?
Customer-2 Customer-mInformation Management
Integrating catalogs is an instance of a more
general problem:
Managing data from many heterogeneous,
autonomous sources.
Information Management
Search and Collect Index and Organise Customise and RedistributeFile Systems Digital Libraries Databases World Wide Web
Information Management
Image Banks Email SystemsInformation Management
• Vast collections
• Composite multimedia components
• Heterogeneous
• Dynamic
• Autonomous
• Different interfaces
• Different data representations
Information Management
• Management of Heterogeneous Information
– Information Integration – Data Warehousing
– Online Analytical Processing
• Knowledge Discovery
– Web Crawling
World Wide
Web
Digital Libraries Scientific Databases
Personal Databases
Providing
uniform (sources transparent to user),
access to (query and eventually updates to ), multiple,
autonomous (can’t affect behavior of sources) heterogeneous (different models and schemas) data sources.
What are some data integration
challenges?
• Freshness of data
• Query response time
• Availability/reliability of sources
• Autonomy of sources
• Heterogeneities at various levels of abstraction
• Two approaches
• Mediation (virtual, query-driven, lazy)
• Data Warehousing ( materialized, eager)
Mediation Approach
User Interface/ User Interface/ Applications Applications Wrapper Wrapper InformationSource Information Source
Information Source Wrapper
...
Mediator Mediator World Wide Web ExtractorMediation Approach
• Information fetched, translated, filtered, merged on-the-fly in response to a query
• Good for:
• rapidly changing information sources • clients with unpredictable needs
• searching over vast amounts of data • But
• inefficiency, delay in query processing • expensive filtering and merging
• Common model for managing heterogeneous data •Object Exchange Model (OEM)
• Information source wrapping (wrapper) • data and query translation
• Extend query capabilities for sources with limited capabilities •Toolkit for automatically generating wrappers
• Multi source query processing and information fusion
•Declaratively specify how mediators collects and processes information
• Browsing and exploring information sources through WWW •Format OEM objects as a web of hypertext documents
•Traverse hyperlinks to explore nested structure and contents
Semantic Integration
• So far, no efficient solution to overcoming semantic heterogeneities
•Detect overlap and remove inconsistencies in representation of similar real-world objects in different schemas
•Result of independent creation of schemas • Need external domain (semantic)
Application in E-commerce broker
• MeMo project
• Mediating between partners in construction
• Partners from Spain, Germany, Holland
Idea: Introduce a broker to facilitate
communication
intended communication member of company A member of company Bbroker law, standards, codes, memory of business
export business data export db - product profiles company A export business data export db - product profiles company B import business data business data repository shared product ontology company profiles; export db schema
Share business data ?
We assume members of a market are willing to share business data, esp. company profiles and product
profiles. The interest is founded in their desire to do business and find partners.
Members include other data providers, e.g. fincancial data, product group codes. They are either trusted-third parties (like chambers of commerce) or
companies how make profit from facilitating business (e.g. banks).
Business Data Product Data Company Profiles Finance Infos Loading via JDBC, ODBC, XML etc. Search
Engine NegotiationManager WorkflowManager
Repository Market Owner defines ontologies Data Provider define data sources Service broker call service implementaion register service call service service URL op1 server1/op1 op2 server1/op2 op3 server2/op3 Web browser Market User Repository Proxy Service table
HTTP proxy & Firewall
Business Data Integrator
Architecture
of the
MEMO
broker
Banks & Insurance Companies Chambers of Commerce CompaniesThe mismatch of product profiles and ontologies
• search engine: topic-based access to information about products
• heterogeneous product profiles available from companies
• multiple ontologies are used to index these profiles in the repository
product ontology Stone material floor tile product profiles
Pid Name size price 341 “Ge” 30 3,41 342 “Ka” 35 3,69
Pnr nam descr
089 “VA” “Use this ….” 342 “BO” “Our best …”
? ?
From data structures to semantic objects
Strategy
4. Deduce product and attribute classification 3. Plan classification to ontologies based on the
profile data structure
2. Represent the profile data structure as semantic objects
1. Represent profiles as semantic objects
Trega tiles:
ean size colour sbk hb
123-.. 10x10 white3 c1001 hb876 describing attributes product id grouping attributes tuple.1 123-.. ean „10x10“ size TregaTiles in „white3“ c1001 sbk hb colour hb876 Note: suppliers use their individual profile schemas!
2. Represent the profile data structure
as semantic objects
Trega tiles: ean size colour sbk hb
TregaTiles ean EAN-Code
String size String SBK-Concept HB-Concept sbk hb colour Trega supplier
3. Plan classification to ontologies based on the
profile data structure (1)
TregaTiles ean EAN-Code
String size String SBK-Concept HB-Concept sbk hb colour Trega supplier ProductProfile Domain field Company ProductCode Perspective supplier prodid group in in
Schema for all ontologies *
Perspective contains Lexical String Language denotation relationship label languageOntologies of different perspectives are distinguishable via ‘perspective’.
attributeOf
Concept
Concept Attribute
3. Plan classification to ontologies based on the
profile data structure (2)
ProductProfile Domain field Company ProductCode Perspective supplier prodid group attributeOf Concept Concept Attribute ATTRIBUTE CLASSIFY
„tegel“
4. Deducing product classifications
„tile“C1001 „Fliese“ ProductProfile ProductCode prodid TregaTiles SBK-Code sbk tuple.1 Perspective group 123-.. classifiedAs
forall x//ProductCode, t//ProductProfile, C/Concept (t [prodid] x) and (t [group] C)
==> (x classifiedAs C)
C1001
4. Deducing attribute classifications
„area“ ProductProfile Domain Concept Attribute field ATTRIBUTE CLASSIFYforall CA/ConceptAttribute f/Proposition!attribute (exists F/ProductProfile!field
(F ATTRIBUTECLASSIFY CA) and (f in F)) ==> (f classifiedAs CA) A001 TregaTiles String size ATTRIBUTE CLASSIFY tuple.1 „10x10“ classifiedAs in in
Example
classification
attributeOf nt A0001 A0002 ”area" "product form" C1001 ”tile" nt nt nt nt a domain-specific ontology ”123-.." tuple.1 profile p ”10x10" a company's product catalog classifiedAs TOBE CLASSIFIEDAS this classification is deduced! in in Trega TregaTiles supplier String sizeData Warehousing Approach
Clients Data Data Warehouse WarehouseSource Source Source
. . .
Extractor/ Monitor Integration System . . . Metadata Extractor/ Monitor Extractor/ MonitorData Warehousing Approach
• High query performance • Accessible any time
• even if sources are not available
• Clear separation between operational data store and analysis portion of data
•long-running analysis queries do not interfere with local processing at sources
• Extra information
• summarize (aggregate information) • access to historical information
Data Warehousing Approach
• Warehouse maintenance (materialized view update problem)
• how to maintain warehouse in light of constant changes to sources
• 24x7 operations (no real down-time anymore) • solution: “incremental view update algorithms”
• Warehouse integrator (challenges similar to those seen in mediation research)
Online Analytical Processing
(OLAP)
How to make long-running analytical queries more efficiently
•pre-compute frequently used portions of queries and materialize
•which views to compute (space-time trade-off) • Extend SQL with new operators for OLAP (e.g., cube, roll-up, drill-down)
Knowledge Discovery
• Extraction of implicit, previously unknown and potentially useful knowledge from data
• Traditionally studied in AI, now multidisciplinary (including DBT, Data Visualization)
• Data Mining: combine knowledge discovery with efficient implementation to allow very large datasets.
Data Mining
• Build a model of the real world • Describe pattern and relationships
• guide business decisions
• e.g., determine layout of shelves in grocery store
• make predictions
• e.g., What recipients to include on mailing list. • Not magic, still need to understand data and statistics
Data Mining Models
• Classification and regression (predicting)
•E.g., neural networks, rules, decision trees • Time series (forecasting)
• Clustering (description)
•finding clusters that consist of similar records • Association analysis, sequence discovery (describe behavior)
• Assumption: “Brute-force” does not scale
• Relevant information than “everything first-process later” • Light-weight crawler + runtime environment (JESS)
•set of CLIPS rules determine crawling behavior •crawler migrates to Web-sites (remote execution) •returns with selected pages in compressed form • Efficient crawling techniques
•breadth, depth-first not efficient
•visit as many “hot” pages in as little time as possible •URL ordering
•importance metrics (e.g., back link count, page rank, location metric)
•Web statistics
•size doubles every 12 months
•about 1 billion pages by 2000 (index ~5.5 TB)
•assume index age < 30 days, crawl and download data at 45 MB/sec (~80 million pages/day).
•Inferencing
•extract and establish relationships that exists (e.g., among web documents) to infer new knowledge not explicitly stated • Improved clustering & association rules based techniques
•Incremental
•Parallel execution •Mostly library data
• Mediation, DW, and OLAP
•Focus on integrating heterogeneous data
•Methodology to overcome semantic heterogeneity problem (semantic context mediation)
•Developing and building a hybrid integration architecture (warehouse+on-demand querying)
•Revisit work on WWW based information browsing tools • Knowledge discovery
• knowledge discovery on WWW and library data to improve searching •Key ingredient is fully indexed and annotated repository to reflect
relationships uncovered during mining phase
•Mobile crawler to collect Web pages efficiently (download pages related to special topic)
Integration of Information
• (1) A Super Global Database!
– obsolete before it is established
• (2) Distributed, free standing databases (today)
– browsing, surfing, getting lost
• (3) Distributed databases with a single standard allowing interoperation (this is not XML!)
– standards follow progress, cannot lead it
• (4) Distributed databases with identified or published formats (this is XML)
– requires rapid adaptation to keep up with resources • (5) = (4) + Mediators
– keep up with resources in an economy of scale
Applications
• Intranets
– Enterprise data integration
– web-site construction
• World-wide web:
– comparison shopping (Netbot, Junglee)
– portals integrating data from multiple sources
– XML integration
• Science & culture
– Medical genetics: integrating genomic data
– Astrophysics: monitoring events in the sky
– Environment: Puget Sound Regional Synthesis
Model
– Culture: uniform access to all the cultural databases
produced by countries in Europe
Application
Global Schema
Local Schema Local Schema Local Schema
Data Warehouse Source Source Source Query Mediator Wrapper Wrapper
What does a data integration system look
like?
What are some data integration
challenges?
• Heterogeneity of sources (intentional and extensional levels) • Limitations in the mechanisms for accessing the sources
• Materialized vs. virtual integration
• Data extraction, cleaning, and reconciliation
• How to process updates expressed on the global schema, and updates expressed on the sources
• The querying problem: How to answer queries expressed on the global schema
• The modeling problem: How to model the global schema, the sources, and the relationships between the two
The querying problem
• Each query is expressed in terms of the global
schema, and the mediator must reformulate the query in terms of a set of queries at the sources
• The crucial step is deciding the query plan, i.e., how to decompose the query into a set of sub queries to the sources
• The computed sub queries are then shipped to the sources, and the results are assembled into the final answer
Example Scenario
http://www.amazon.com s1 (Title,Author,Subject) http://www.book-a-million.com s2 (ISBN,Title,Publisher) http://……...Example Scenario
Retrieve the titles and subjects of all the books
written by (Leon Sterling) and published by MIT
PRESS
SELECT title, subject FROM amazon.com
WHERE author = “Sterling”
Source 2 Source 1 Amazon.com (titles, authors, subjects) Book-a-million.com (ISBN, titles, publisher) SELECT title FROM book-a-million.com WHERE publisher = MIT SELECT title, subject
FROM book-a-million.com, amazon.com
Quality in query answering
• The data integration system should be designed in such a way that suitable quality criteria are met. • Here, we concentrate on:
• Soundness: the answer to queries includes
nothing but the truth
• Completeness: the answer to queries includes
the whole truth
• We aim at the whole truth, and nothing but the truth. But, what the truth is depends on the approach
Modeling
Source 2 Source Structure Source Structure Mapping Source 1 Global SchemaModeling Problem
•How do we model the global schema (structured vs. semistructured)
•How do we model the sources (conceptual and structural level)
•How do we model the relationship between the global schema and the sources
•Are the sources defined in terms of the global schema (this approach is called source-centric, or local-as-view, or LAV)?
•Is the global schema defined in terms of the sources (this approach is called
global-schema-centric or global-as-view, or GAV
Example Scenario
Global schema book(Title,Year,Author ) european(Author )
review(Title, Review)
Source 1 r1(Title, Year, Author)
since 1960, European authors
Source 2 r2(Title, Review) since 1990
Query Title and review of books in 1998?
{(T,R) | ∃ A.book(T,1998,A) ^ review(T,R)}
Local As View
Source
Global Schema
LAV
Query Processing in LAV
Global schema
book(Title,Year,Author) european(Author )
review(Title,Review)
views over the global schema
r1(T,Y,A) Æ{(T,Y,A) | book(T,Y,A) ^ european(A) ^ Y ≥ 1960} r2(T, R) Æ {(T,R) | book(T,Y,A) ^ review(T,R) ^ Y ≥ 1990}
The query
{ (T,R) | book(T,1998,A) ^ review(T,R) }
re-expressing the atoms of the global schema in terms of atoms at the sources.
Query Processing in LAV
Answering queries in LAV is like solving a mystery case: • Sources represent reliable witnesses
• Witnesses know part of the story, and source data represent what they know
• We have an explicit representation of what the witnesses know
• We have to solve the case (answering queries) based on the information we are able to gather from the
witnesses
Global As View
A Source Global Schema GAVThe data of A are taken from source 1 and …
Global-as-view – Example
Global schema
book(Title,Year,Author) european(Author )
review(Title,Review)
views over the sources
book(T,Y,A) Æ {(T,Y,A) | r1(T,Y,A)}
european(A) Æ {(A) | r1(T,Y,A)}
Query processing in GAV
book (T,1998,A)
∧
review(T,R)r1(T,1998,A)
∧
r2(T,R)unfolding
The query {(T,R) | movie (T,1998,D) ∧ review (T,R)} is
processed by means of unfolding, i.e., by expanding the atoms according to their definitions, so as to come up with source relations.
Query processing in GAV
•We do not have any explicit representation
of what the witnesses know
•All the information that the witnesses can
provide have been compiled into an
“investigation report”(source descriptions =
the global schema, and the mapping)
•Solving the case (answering queries) means
basically looking at source descriptions
GAV and LAV: Pros &
Cons
• Local-as-view
• Quality depends on how well we have characterized the sources • High modularity and reusability (if the global schema is well designed, when a source changes, only its definition is affected) • Query processing needs reasoning (query reformulation complex)
• Global-as-view
• Quality depends on how well we have compiled the sources into the global schema through the mapping
• Whenever a source changes or a new one is added, the global schema needs to be reconsidered
• Query processing can be based on some sort of unfolding (query reformulation looks easier)
Conclusions
• Data integration applications have to cope with incomplete information, no matter which is the modeling approach
• Some techniques already developed, but several open problems still remain (in LAV, GAV, and GLAV)
• Many other problems not addressed here are relevant in data integration (e.g., how to construct the global schema, how to deal with inconsistencies, how to cope with updates, ...)
• In particular, given the complexity of sound and complete query answering, it is interesting to look at methods that accept less
quality answers, trading efficiency for accuracy
Local Database
Local Logistics Planning View
Local Logistics
Operations View LocalDatabase
Mediated Logistics View
Real-Time Information Processing and Filtering
Data/Knowledge Refinement, Fusion, and Certification Information Repository Internet Interface Text Analysis Image Analysis Database Wrapper Simulation Interface Information Interface Layer Information Management Layer Information Gathering Layer Communication Among Views Executive Agent User Agent
Active View Agents
Mediators Facilitators Real-Time Agents Knowledge Rovers Field Agents Information Curators