Catalogs and Data Integration for E-Commerce Applications

(1)

Catalogs

and

Data Integration for E-Commerce

Applications

(2)

On-line catalogues

Issues

• Advantages?

• Product information

• Information coupling

• security • purchase process

• Buyers catalogue vs. Sellers catalogue

• Data integration

(3)

Advantages

• Up-to-date information

• Directed search possibilities

• More information and multi-media information

• Coupling with ordering and stock info

• Personalisation of information

• Cost reduction for production

• Configure or specify products

• Intelligent assistance

(4)

Products in catalogs

1. Uniquely identifiable products

• basis of most catalogs

2. Select values of fixed attributes

• E.g. Colour of clothes, processor type of PC

3. Configurable

• E.g. PC, car, ...

For situation 1. and 2. product databases containing all

possible products are present.

(5)

Product data

• Identifying the product (articlenumber(s),

name)

• Technical data

• design, use, norms (ISO, ...),...

• Commercial data

• Prices, delivery conditions,...

• Logistical data

(6)

Product profiles

• Not all parties are interested in the same

attributes of the product. E.g. A plumber is

interested in the size of a bathtub and fixtures,

the user in the colour.

• Branches and companies have their own product

codes. E.g. For bolts EAN, ISO, Borstlap,...

• Problem: different companies identify (classify)

their products in different ways. E.g. Tiles can be

ceramic products or floor/wall covering.

(7)

Commercially sensitive data

• Price information

• Discount availability

• Transparant prices are nice for buyers but not

for sellers

• Availability data

• possibilities:

• Stocked article (indicates type of article) • Article in stock

(8)

Security

• Separate catalogue data from product data base

• If personalized data is generated where is the

code stored?

• Security vs. up-to-date information

• Catalogue maintenance (who, when,…?)

(9)

Order process

• Searching the catalogue is part of the purchasing

process

• The design of this process should indicate who can

search the catalogue, which information is available,

for which products ordering authorization is needed,

etc.

• B2C

→

simple

– Consumer does not have to integrate with back-end – Consumer can decide himself

• B2B

→

complex

– Both sides need to integrate with back-end systems – Purchasing process regulated by buying company

(10)

Who has the responsibility?

Should the catalog and the ordering process be

under the responsibility of

1. The supplier

2. The customer

3. A broker

(11)

(Customer specific) catalogues

with suppliers

customer Suppliers Supplier 1 catalogue Supplier 2 catalogue Supplier n catalogue purchasers Internet

(12)

(Customer specific) catalogues

with suppliers

Advantage:

• Supplier can manage the catalogue

efficiently

• Supplier can add functions for each client

Disadvantage:

• Supplier specifies products

(13)

Purchasing catalogue with

customer

Catalogue supplier-1 Catalogue supplier -2 Catalogue supplier -n Purchase catalogue Prod. supplier. -1 Prod. supplier. -2 Prod. supplier. -n updates suppliers customer purchasers Internet

(14)

Purchasing catalogue

with customer

Advantage:

• Uniform search and ordering process for

customer

• Customer determines which products can be

shown

Disadvantage:

• More difficult to maintain for supplier

• More difficult to keep info up-to-date and

complete

(15)

Catalogue with broker

Catalogue supplier-1 Catalogue supplier-2 Catalogue supplier-n purchasers customer suppliers broker Prod. supplier -1 Prod. supplier -2 Catalogue broker Updates Prod. supplier -n

(16)

Catalogue with broker

Advantage:

• Costs are shared

• Standardisation

Disadvantage:

• Extra party in the process

• Needs data integration

(17)

The multi catalogue/multi view

problem:

Data integration

suppliers customers Customer-1 Catalogue supplier -1 Catalogue supplier -2 Catalogue supplier -n

?

Customer-2 Customer-m

(18)

Information Management

Integrating catalogs is an instance of a more

general problem:

Managing data from many heterogeneous,

autonomous sources.

(19)

Search and Collect Index and Organise Customise and Redistribute

(20)

File Systems Digital Libraries Databases World Wide Web

Image Banks Email Systems

(21)

• Vast collections

• Composite multimedia components

• Heterogeneous

• Dynamic

• Autonomous

• Different interfaces

• Different data representations

(22)

• Management of Heterogeneous Information

– Information Integration – Data Warehousing

– Online Analytical Processing

• Knowledge Discovery

– Web Crawling

(23)

World Wide

Web

Digital Libraries Scientific Databases

Personal Databases

Providing

uniform (sources transparent to user),

access to (query and eventually updates to ), multiple,

autonomous (can’t affect behavior of sources) heterogeneous (different models and schemas) data sources.

(24)

What are some data integration

challenges?

• Freshness of data

• Query response time

• Availability/reliability of sources

• Autonomy of sources

• Heterogeneities at various levels of abstraction

• Two approaches

• Mediation (virtual, query-driven, lazy)

• Data Warehousing ( materialized, eager)

(25)

Mediation Approach

User Interface/ User Interface/ Applications Applications Wrapper Wrapper Information

Source Information Source

Information Source Wrapper

...

Mediator Mediator World Wide Web Extractor

(26)

Mediation Approach

• Information fetched, translated, filtered, merged on-the-fly in response to a query

• Good for:

• rapidly changing information sources • clients with unpredictable needs

• searching over vast amounts of data • But

• inefficiency, delay in query processing • expensive filtering and merging

(27)

• Common model for managing heterogeneous data •Object Exchange Model (OEM)

• Information source wrapping (wrapper) • data and query translation

• Extend query capabilities for sources with limited capabilities •Toolkit for automatically generating wrappers

• Multi source query processing and information fusion

•Declaratively specify how mediators collects and processes information

• Browsing and exploring information sources through WWW •Format OEM objects as a web of hypertext documents

•Traverse hyperlinks to explore nested structure and contents

(28)

Semantic Integration

• So far, no efficient solution to overcoming semantic heterogeneities

•Detect overlap and remove inconsistencies in representation of similar real-world objects in different schemas

•Result of independent creation of schemas • Need external domain (semantic)

(29)

Application in E-commerce broker

• MeMo project

• Mediating between partners in construction

• Partners from Spain, Germany, Holland

(30)

Idea: Introduce a broker to facilitate

communication

intended communication member of company A member of company B

broker _{law, standards, codes,} memory of business

(31)

export business data export db - product profiles company A export business data export db - product profiles company B import business data business data repository shared product ontology company profiles; export db schema

Share business data ?

We assume members of a market are willing to share business data, esp. company profiles and product

profiles. The interest is founded in their desire to do business and find partners.

Members include other data providers, e.g. fincancial data, product group codes. They are either trusted-third parties (like chambers of commerce) or

companies how make profit from facilitating business (e.g. banks).

(32)

Business Data Product Data Company Profiles Finance Infos Loading via JDBC, ODBC, XML etc. Search

Engine Negotiation_Manager Workflow_Manager

Repository Market Owner defines ontologies Data Provider define data sources Service broker call service implementaion register service call service service URL op1 server1/op1 op2 server1/op2 op3 server2/op3 Web browser Market User Repository Proxy Service table

HTTP proxy & Firewall

Business Data Integrator

Architecture

of the

MEMO

broker

Banks & Insurance Companies Chambers of Commerce Companies

(33)

The mismatch of product profiles and ontologies

• search engine: topic-based access to information about products

• heterogeneous product profiles available from companies

• multiple ontologies are used to index these profiles in the repository

product ontology Stone material floor tile product profiles

Pid Name size price 341 “Ge” 30 3,41 342 “Ka” 35 3,69

Pnr nam descr

089 “VA” “Use this ….” 342 “BO” “Our best …”

? ?

(34)

From data structures to semantic objects

Strategy

4. Deduce product and attribute classification 3. Plan classification to ontologies based on the

profile data structure

2. Represent the profile data structure as semantic objects

(35)

1. Represent profiles as semantic objects

Trega tiles:

ean size colour sbk hb

123-.. 10x10 white3 c1001 hb876 describing attributes product id grouping attributes tuple.1 123-.. ean „10x10“ _size TregaTiles in „white3“ c1001 sbk hb colour hb876 Note: suppliers use their individual profile schemas!

(36)

2. Represent the profile data structure

as semantic objects

Trega tiles: ean size colour sbk hb

TregaTiles ean EAN-Code

String size String SBK-Concept HB-Concept sbk hb colour Trega supplier

(37)

3. Plan classification to ontologies based on the

profile data structure (1)

TregaTiles ean EAN-Code

String size String SBK-Concept HB-Concept sbk hb colour Trega supplier ProductProfile Domain field Company ProductCode Perspective supplier prodid group in in

(38)

Schema for all ontologies *

Perspective contains Lexical String Language denotation relationship label language

Ontologies of different perspectives are distinguishable via ‘perspective’.

attributeOf

Concept

Concept Attribute

(39)

3. Plan classification to ontologies based on the

profile data structure (2)

ProductProfile Domain field Company ProductCode Perspective supplier prodid group attributeOf Concept Concept Attribute ATTRIBUTE CLASSIFY

(40)

„tegel“

4. Deducing product classifications

„tile“

C1001 „Fliese“ ProductProfile ProductCode prodid TregaTiles SBK-Code sbk tuple.1 Perspective group 123-.. classifiedAs

forall x//ProductCode, t//ProductProfile, C/Concept (t [prodid] x) and (t [group] C)

==> (x classifiedAs C)

C1001

(41)

4. Deducing attribute classifications

„area“ ProductProfile Domain Concept Attribute field ATTRIBUTE CLASSIFY

forall CA/ConceptAttribute f/Proposition!attribute (exists F/ProductProfile!field

(F ATTRIBUTECLASSIFY CA) and (f in F)) ==> (f classifiedAs CA) A001 TregaTiles String size ATTRIBUTE CLASSIFY tuple.1 „10x10“ classifiedAs in in

(42)

Example

classification

attributeOf nt A0001 A0002 ”area" "product form" C1001 ”tile" nt nt nt nt a domain-specific ontology ”123-.." tuple.1 profile p ”10x10" a company's product catalog classifiedAs TOBE CLASSIFIEDAS this classification is deduced! in in Trega TregaTiles supplier String size

(43)

Data Warehousing Approach

Clients Data Data Warehouse Warehouse

Source Source Source

. . .

Extractor/ Monitor Integration System . . . Metadata Extractor/ Monitor Extractor/ Monitor

(44)

Data Warehousing Approach

• High query performance • Accessible any time

• even if sources are not available

• Clear separation between operational data store and analysis portion of data

•long-running analysis queries do not interfere with local processing at sources

• Extra information

• summarize (aggregate information) • access to historical information

(45)

Data Warehousing Approach

• Warehouse maintenance (materialized view update problem)

• how to maintain warehouse in light of constant changes to sources

• 24x7 operations (no real down-time anymore) • solution: “incremental view update algorithms”

• Warehouse integrator (challenges similar to those seen in mediation research)

(46)

Online Analytical Processing

(OLAP)

How to make long-running analytical queries more efficiently

•pre-compute frequently used portions of queries and materialize

•which views to compute (space-time trade-off) • Extend SQL with new operators for OLAP (e.g., cube, roll-up, drill-down)

(47)

Knowledge Discovery

• Extraction of implicit, previously unknown and potentially useful knowledge from data

• Traditionally studied in AI, now multidisciplinary (including DBT, Data Visualization)

• Data Mining: combine knowledge discovery with efficient implementation to allow very large datasets.

(48)

Data Mining

• Build a model of the real world • Describe pattern and relationships

• guide business decisions

• e.g., determine layout of shelves in grocery store

• make predictions

• e.g., What recipients to include on mailing list. • Not magic, still need to understand data and statistics

(49)

Data Mining Models

• Classification and regression (predicting)

•E.g., neural networks, rules, decision trees • Time series (forecasting)

• Clustering (description)

•finding clusters that consist of similar records • Association analysis, sequence discovery (describe behavior)

(50)

• Assumption: “Brute-force” does not scale

• Relevant information than “everything first-process later” • Light-weight crawler + runtime environment (JESS)

•set of CLIPS rules determine crawling behavior •crawler migrates to Web-sites (remote execution) •returns with selected pages in compressed form • Efficient crawling techniques

•breadth, depth-first not efficient

•visit as many “hot” pages in as little time as possible •URL ordering

•importance metrics (e.g., back link count, page rank, location metric)

(51)

•Web statistics

•size doubles every 12 months

•about 1 billion pages by 2000 (index ~5.5 TB)

•assume index age < 30 days, crawl and download data at 45 MB/sec (~80 million pages/day).

•Inferencing

•extract and establish relationships that exists (e.g., among web documents) to infer new knowledge not explicitly stated • Improved clustering & association rules based techniques

•Incremental

•Parallel execution •Mostly library data

(52)

• Mediation, DW, and OLAP

•Focus on integrating heterogeneous data

•Methodology to overcome semantic heterogeneity problem (semantic context mediation)

•Developing and building a hybrid integration architecture (warehouse+on-demand querying)

•Revisit work on WWW based information browsing tools • Knowledge discovery

• knowledge discovery on WWW and library data to improve searching •Key ingredient is fully indexed and annotated repository to reflect

relationships uncovered during mining phase

•Mobile crawler to collect Web pages efficiently (download pages related to special topic)

(53)

Integration of Information

• (1) A Super Global Database!

– obsolete before it is established

• (2) Distributed, free standing databases (today)

– browsing, surfing, getting lost

• (3) Distributed databases with a single standard allowing interoperation (this is not XML!)

– standards follow progress, cannot lead it

• (4) Distributed databases with identified or published formats (this is XML)

– requires rapid adaptation to keep up with resources • (5) = (4) + Mediators

– keep up with resources in an economy of scale

(54)

Applications

• Intranets

– Enterprise data integration

– web-site construction

• World-wide web:

– comparison shopping (Netbot, Junglee)

– portals integrating data from multiple sources

– XML integration

• Science & culture

– Medical genetics: integrating genomic data

– Astrophysics: monitoring events in the sky

– Environment: Puget Sound Regional Synthesis

Model

– Culture: uniform access to all the cultural databases

produced by countries in Europe

(55)

Application

Global Schema

Local Schema Local Schema Local Schema

Data Warehouse Source Source Source Query Mediator Wrapper Wrapper

What does a data integration system look

like?

(56)

What are some data integration

challenges?

• Heterogeneity of sources (intentional and extensional levels) • Limitations in the mechanisms for accessing the sources

• Materialized vs. virtual integration

• Data extraction, cleaning, and reconciliation

• How to process updates expressed on the global schema, and updates expressed on the sources

• The querying problem: How to answer queries expressed on the global schema

• The modeling problem: How to model the global schema, the sources, and the relationships between the two

(57)

The querying problem

• Each query is expressed in terms of the global

schema, and the mediator must reformulate the query in terms of a set of queries at the sources

• The crucial step is deciding the query plan, i.e., how to decompose the query into a set of sub queries to the sources

• The computed sub queries are then shipped to the sources, and the results are assembled into the final answer

(58)

Example Scenario

http://www.amazon.com s₁(Title,Author,Subject) http://www.book-a-million.com s₂(ISBN,Title,Publisher) http://……...

(59)

Example Scenario

Retrieve the titles and subjects of all the books

written by (Leon Sterling) and published by MIT

PRESS

SELECT title, subject FROM amazon.com

WHERE author = “Sterling”

Source 2 Source 1 Amazon.com (titles, authors, subjects) Book-a-million.com (ISBN, titles, publisher) SELECT title FROM book-a-million.com WHERE publisher = MIT SELECT title, subject

FROM book-a-million.com, amazon.com

(60)

Quality in query answering

• The data integration system should be designed in such a way that suitable quality criteria are met. • Here, we concentrate on:

• Soundness: the answer to queries includes

nothing but the truth

• Completeness: the answer to queries includes

the whole truth

• We aim at the whole truth, and nothing but the truth. But, what the truth is depends on the approach

(61)

Modeling

Source 2 Source Structure Source Structure Mapping Source 1 Global Schema

(62)

Modeling Problem

•How do we model the global schema (structured vs. semistructured)

•How do we model the sources (conceptual and structural level)

•How do we model the relationship between the global schema and the sources

•Are the sources defined in terms of the global schema (this approach is called source-centric, or local-as-view, or LAV)?

•Is the global schema defined in terms of the sources (this approach is called

global-schema-centric or global-as-view, or GAV

(63)

Example Scenario

Global schema book(Title,Year,Author ) european(Author )

review(Title, Review)

Source 1 r₁(Title, Year, Author)

since 1960, European authors

Source 2 r₂(Title, Review) since 1990

Query Title and review of books in 1998?

{(T,R) | ∃ A.book(T,1998,A) ^ review(T,R)}

(64)

Local As View

Source

Global Schema

LAV

(65)

Query Processing in LAV

Global schema

book(Title,Year,Author) european(Author )

review(Title,Review)

views over the global schema

r₁(T,Y,A) Æ{(T,Y,A) | book(T,Y,A) ^ european(A) ^ Y ≥ 1960} r₂(T, R) Æ {(T,R) | book(T,Y,A) ^ review(T,R) ^ Y ≥ 1990}

The query

{ (T,R) | book(T,1998,A) ^ review(T,R) }

re-expressing the atoms of the global schema in terms of atoms at the sources.

(66)

Query Processing in LAV

Answering queries in LAV is like solving a mystery case: • Sources represent reliable witnesses

• Witnesses know part of the story, and source data represent what they know

• We have an explicit representation of what the witnesses know

• We have to solve the case (answering queries) based on the information we are able to gather from the

witnesses

(67)

Global As View

A Source Global Schema GAV

The data of A are taken from source 1 and …

(68)

Global-as-view – Example

Global schema

book(Title,Year,Author) european(Author )

review(Title,Review)

views over the sources

book(T,Y,A) Æ {(T,Y,A) | r₁(T,Y,A)}

european(A) Æ {(A) | r₁(T,Y,A)}

(69)

Query processing in GAV

book (T,1998,A)

∧

review(T,R)

r₁(T,1998,A)

∧

r₂(T,R)

unfolding

The query {(T,R) | movie (T,1998,D) ∧ review (T,R)} is

processed by means of unfolding, i.e., by expanding the atoms according to their definitions, so as to come up with source relations.

(70)

Query processing in GAV

•We do not have any explicit representation

of what the witnesses know

•All the information that the witnesses can

provide have been compiled into an

“investigation report”(source descriptions =

the global schema, and the mapping)

•Solving the case (answering queries) means

basically looking at source descriptions

(71)

GAV and LAV: Pros &

Cons

• Local-as-view

• Quality depends on how well we have characterized the sources • High modularity and reusability (if the global schema is well designed, when a source changes, only its definition is affected) • Query processing needs reasoning (query reformulation complex)

• Global-as-view

• Quality depends on how well we have compiled the sources into the global schema through the mapping

• Whenever a source changes or a new one is added, the global schema needs to be reconsidered

• Query processing can be based on some sort of unfolding (query reformulation looks easier)

(72)

Conclusions

• Data integration applications have to cope with incomplete information, no matter which is the modeling approach

• Some techniques already developed, but several open problems still remain (in LAV, GAV, and GLAV)

• Many other problems not addressed here are relevant in data integration (e.g., how to construct the global schema, how to deal with inconsistencies, how to cope with updates, ...)

• In particular, given the complexity of sound and complete query answering, it is interesting to look at methods that accept less

quality answers, trading efficiency for accuracy

(73)

Local Database

Local Logistics Planning View

Local Logistics

Operations View Local_Database

Mediated Logistics View

Real-Time Information Processing and Filtering

Data/Knowledge Refinement, Fusion, and Certification Information Repository Internet Interface Text Analysis Image Analysis Database Wrapper Simulation Interface Information Interface Layer Information Management Layer Information Gathering Layer Communication Among Views Executive Agent User Agent

Active View Agents

Mediators Facilitators Real-Time Agents Knowledge Rovers Field Agents Information Curators