• No results found

General Purpose Database Summarization

N/A
N/A
Protected

Academic year: 2021

Share "General Purpose Database Summarization"

Copied!
33
0
0

Loading.... (view fulltext now)

Full text

(1)

Table of Content

General Purpose Database Summarization

A web service architecture for on-line database summarization

R´egis Saint-Paul (speaker),

Guillaume Raschia, Noureddine Mouaddib

LINA - Polytech’Nantes - INRIA ATLAS-GRIM Group

(2)

Table of Content

Table of Content

1 Introduction Generalities Related works 2 Summary model Description space Building the summaries 3 System architecture

Web service organization Complexity and performances 4 Conclusion

(3)

Introduction Summary model System architecture Conclusion Generalities Related works

Table of Content

1 Introduction Generalities Related works 2 Summary model Description space Building the summaries

3 System architecture

Web service organization Complexity and performances

(4)

Introduction Summary model System architecture Conclusion Generalities Related works

Motivations

Provide small versions of very large databases Descriptive ability :

scientific studies (epidemiology) ;

commercial and marketing studies(customer

segmentation) ;

log analysis (connection/operation profile) ; data obfuscation ;

data personalization and filtering.

Data size reduction ability :

approximate querying(hotel booking),

database browsing(image database),

storing rough view of the data on devices with low memory capacity (tourism GPS data).

(5)

Introduction Summary model System architecture Conclusion Generalities Related works

Motivations

Provide small versions of very large databases Descriptive ability :

scientific studies (epidemiology) ;

commercial and marketing studies(customer segmentation) ;

log analysis (connection/operation profile) ; data obfuscation ;

data personalization and filtering.

Data size reduction ability :

approximate querying(hotel booking),

database browsing(image database),

storing rough view of the data on devices with low memory capacity (tourism GPS data).

(6)

Introduction Summary model System architecture Conclusion Generalities Related works

Motivations

Provide small versions of very large databases Descriptive ability :

scientific studies (epidemiology) ;

commercial and marketing studies(customer segmentation) ;

log analysis (connection/operation profile) ; data obfuscation ;

data personalization and filtering.

Data size reduction ability :

approximate querying(hotel booking),

database browsing(image database),

storing rough view of the data on devices with low memory capacity (tourism GPS data).

(7)

Introduction Summary model System architecture Conclusion Generalities Related works

What is a summary ?

Definition A summary is a concise representation of a set of structured data. ⇒ Semantic Compression Occupation Income Ph.D. Student 1 000 Lecturer 2 000 Managing Director 8 500 Politician xx xxx Tab.:RelationR Occupation Income Research Miserable Executive Enormous Tab.:SummaryR∗

(8)

Introduction Summary model System architecture Conclusion Generalities Related works

Aggregate computation

P oi nt de ve n te Z on e gé ogr ap hi que P ays T ous Produit Date Aggregate computation

SDB, OLAP [Codd et al. 93], DataCubes [Gray et al. 93]

Datacube summarization

QuotientCube [Lakshmanan et al. 2002]

Limitations

Do not preserve the initial data schema ; Subject oriented, has to be designed ; Fixed and crisp granularity, threshold effect.

(9)

Introduction Summary model System architecture Conclusion Generalities Related works

Clustering approaches for semantic compression

intuition

Describe groups rather than individual observation.

Clustering – ItCompress [Jagadish et al. 1999]

Bayesian network classifier – Spartan [Babu et al. 2001]

Association rules– Fascicule [Jagadish et al. 1999]

Limitations

Classes shape depends on the selected criteria [Fasulo 1999] ; Single granularity of the compressed relation ;

(10)

Introduction Summary model System architecture Conclusion Generalities Related works

Foundations of our approach

Intuition

Trying to reproduce the human learning mechanisms.

Formal concept analysis

[Barbut et al. 1970, Wille 1982]

Conceptual clustering – [Michalski et Stepp 1983]

Unimem [Lebowitz 1986], Cobweb [Fisher 1987], Fuzz [Chen & Lu 1997]

Limitations

Approaches were validated only on small data samples ; Lack of maintenance capabilities.

(11)

Introduction Summary model System architecture Conclusion

Description space Building the summaries

Table of Content

1 Introduction Generalities Related works 2 Summary model Description space Building the summaries

3 System architecture

Web service organization Complexity and performances

(12)

Introduction Summary model System architecture Conclusion

Description space Building the summaries

Possibilistic Data Representation

Theoretical foundation :

Fuzzy-set theory (Zadeh, 1965) et

Possibility theory (Zadeh 1978, Dubois&Prade 1985)

Management of uncertain, incomplete and gradual information :

“John’s age shouldapproximately bebetween 16 and 20, but that’s not sure.”

Possibility distribution AGE 16 20 1.0 0.0 Dom 1.0 0.0 a b c d e f

(13)

Introduction Summary model System architecture Conclusion

Description space Building the summaries

Background knowledge

For each attributeA with domainDA, a set of Linguistic Labelsis

defined together with theirmembership function overDA.

Example, on attributeincome :

Dincome = [0,200000]

Dincome+ = {none,miserable,modest, . . .}

0 20 40 60 80 100 0 1 none miserable modest reasonable comfortable enormous outrageous DINCOME(K$)

(14)

Introduction Summary model System architecture Conclusion

Description space Building the summaries

Summary representation space

Original tuple (raw data)

t=ht.A1, . . . ,t.Aki, t ∈ R {t} DA R(A1, . . . ,Ak) =Qki=1DAi   y   y   y {z} F(DA+) R∗(A1, . . . ,Ak) = Qk i=1F(D + Ai) Summarized tuple z =hz.A1, . . . ,z.Aki, z ∈ R∗

(15)

Introduction Summary model System architecture Conclusion

Description space Building the summaries

Summary model

A summary is a 3-uplez = (Iz,Rz,Ez) with :

Iz : the intentional content ;

Rz : the extensional content, subset of the relationR;

Ez : a set of edges toward other summaries.

Example of a summary

Label satisfaction support

intentionIz 1.83 OCCUPATION employee 0.2 1.25 manager 1.0 0.33 managing director 0.7 0.25 INCOME comfortable 1.0 1.50 high 1.0 0.33 extensionRz { t1, t2,t5,t13} 4

(16)

Introduction Summary model System architecture Conclusion

Description space Building the summaries

Partial order on summaries

Subsumption relation :

z vz0 ⇐⇒ Rz ⊆Rz0

Hierarchical organization :

root: mostgeneral summary ;

leaves: mostspecific summaries.

The user-defined Background Knowledge fixes the finest level and, consequently, the maximal hierarchy size.

(17)

Introduction Summary model System architecture Conclusion

Description space Building the summaries

Algorithm outline

hierarchical conceptual classification incremental process

top-down approach selective local search

Advantages

summary freshness through incremental maintenance linear time complexity w.r.t. the number of tuples

Weaknesses

sub-optimal model (dynamic environment)

(18)

Introduction Summary model System architecture Conclusion

Description space Building the summaries

Process overview

! " # $ % & $ ' ( ) ) * ' +++ , +++ $ ( +++ - $ $

(19)

Introduction Summary model System architecture Conclusion

Description space Building the summaries

Local search

The process looks for the learning operator which produces the highest quality child partition.

Learning operators

affect,

create, merge,

(20)

Introduction Summary model System architecture Conclusion

Description space Building the summaries

Local search

The process looks for the learning operator which produces the highest quality child partition.

Learning operators

affect, create,

merge, split.

(21)

Introduction Summary model System architecture Conclusion

Description space Building the summaries

Local search

The process looks for the learning operator which produces the highest quality child partition.

Learning operators

affect, create, merge,

(22)

Introduction Summary model System architecture Conclusion

Description space Building the summaries

Local search

The process looks for the learning operator which produces the highest quality child partition.

Learning operators

affect, create, merge, split.

(23)

Introduction Summary model System architecture Conclusion

Description space Building the summaries

Multi-granularity summary

The summary hierarchy presents many different precision levels.

The trade-off between size and concision can be chosen

a-posteriori depending on the user need ;

Analogy between the drill-down/roll-up operation on Datacube and the summary hierarchy navigation.

(24)

Introduction Summary model System architecture Conclusion

Web service organization Complexity and performances

Table of Content

1 Introduction Generalities Related works 2 Summary model Description space Building the summaries

3 System architecture

Web service organization Complexity and performances

(25)

Introduction Summary model System architecture Conclusion

Web service organization Complexity and performances

Process overview

Message Oriented Application ; Each document has autonomous specification(XSchema) ;

Possibility to benefit fromMessage Oriented Middleware(MOM) ;

Each service may beused separately or composedwith others ;

Based on wide spread standards (W3C, ECMA et ISO).

(26)

Introduction Summary model System architecture Conclusion

Web service organization Complexity and performances

Concept formation performed by autonomous “agents”

! " " # $ "

Memory management optimized through specific pagination method ;

Process parallelization, Computation optimization through the use of a local cache with incremental upholding

(27)

Introduction Summary model System architecture Conclusion

Web service organization Complexity and performances

Process performance evaluation

Tests based on 1990 US census data[UCI KDD Archive]. 1 billion tuples ;

14 attributes used for the summarization ; 5 to 14 modalities per attributes (prepared).

Nombre de n-uplets traités

900 000 800 000 700 000 600 000 500 000 400 000 300 000 200 000 100 000 T a u x d e t ra it e m e n t (n -u p le t/ s ) 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0

(28)

Introduction Summary model System architecture Conclusion

Web service organization Complexity and performances

Dynamic performances

Nombre de n-uplets traités

900 000 800 000 700 000 600 000 500 000 400 000 300 000 200 000 100 000 0 N o m b re d e f e u il le s 110 000 100 000 90 000 80 000 70 000 60 000 50 000 40 000 30 000 20 000 10 000 0

Profondeur max Profondeur moyenne

Nombre de n-uplets traités

900 000 800 000 700 000 600 000 500 000 400 000 300 000 200 000 100 000 0 Pr o fo n d e u r d e l a h ié ra rc h ie 26 24 22 20 18 16 14 12 10 8 6 4 2 0 Process performance is dependent only on the hierarchy size.

(29)

Introduction Summary model System architecture Conclusion

Web service organization Complexity and performances

Comparison with a real-life dataset

The marketing department of CIC (a french banking group) provided customer data :

33700 records ;

70 attributes (10 of them used for the summary) ; Background Knowledge defined with the bank marketing experts ;

(30)

Introduction Summary model System architecture Conclusion

Web service organization Complexity and performances

Dynamic performances on real data

Number of processed tuples

30 000 25 000 20 000 15 000 10 000 5 000 0 N u m b e r o f le a v e s 14 000 13 000 12 000 11 000 10 000 9 000 8 000 7 000 6 000 5 000 4 000 3 000 2 000 1 000 0

Number of processed tuples

30 000 25 000 20 000 15 000 10 000 5 000 P ro c e s s in g r a te ( T u p le /s ) 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0

The number of leaves follows an asymptotic evolution ;

The process tends toward a classification only regime.

(31)

Introduction Summary model System architecture Conclusion

Table of Content

1 Introduction Generalities Related works 2 Summary model Description space Building the summaries

3 System architecture

Web service organization Complexity and performances

(32)

Introduction Summary model System architecture Conclusion

Conclusion

We presented :

A general purpose multi-granularity summarization model :

anadaptativealternative to thegroup by;

simultaneous maintenance ofseveral compression levels;

robustandintuitiveclasses thanks to human-like learning mechanism and uncertainty handling.

The architecture of the system, which contributes to :

ease ofcouplingwith DBMS (web services) ;

performanceoptimization and parallelization (use of autonomous agent) ;

Validation of the system performance on a test database and a real-life one.

(33)

Introduction Summary model System architecture Conclusion

Questions ?

Web Site of SaintEtiQ

http://www.simulation.fr/seq

Win32 prototype with test dataset available for download Process available online as web service

References

Related documents