Best Online Business Places to Visit - Part 1

(1)

DATA INTEGRATION SERVICES

Christian Convey, Olga Karpenko, Nesime Tatbul and Jue Yan

1 Introduction

** subjectto modications

1.1 Denition of Data Integration

Data integrationsystems harmonizedatafrommultiplesourcesintoa singlecoherent

represen-tation. The goal is to provide an integrated view over all the data sources of interest and to

provide a uniform interface to access all of these data. The access to the integrated data is

usuallyintheform of queryingratherthanupdatingthedata.

The data sources to be integrated may belong to the same enterprise or may be arbitrary

sources on the web. Most of the time, each of the sources is independently designed for

au-tonomous operation. Also, thesources are not necessarily databases; they may be legacy

sys-tems(oldandobsolescentsystemsthatarediÆculttomigratetoamoderntechnology)or

struc-tured/unstructuredleswithdierentinterfaces. Data integration requires thatthedierences

inmodeling,semanticsand capabilitiesofthesourcestogether withthepossibleinconsistencies

beresolved.

1.2 Motivation for Data Integration Systems

Historicalview: integration-by-hand

Users can focus on specifying what they want, not on how to obtain what they want.

Insteadofndingrelevantsources,interactingwitheverysourceandcombiningdatafrom

dierentsources, auser can ask queriesina uniedway.

Particular examples:

{ Desireforreportsthatdescribeallpartsofamergedorganization(bankmergers, car

dealerships,etc.).

Facilitatesdecisionsupportapplications(OLAP,Data mining)

OLAP (On-LineAnalyticalProcessing)ismaking nancial,marketing orbusiness

anal-ysis to be able to make business decisions on a collection of detailed data from one

ormore datasources. Theanalysisisdonethroughaskinglargenumberofaggregate

querieson thedetaileddata.

Data Mining is discoveringknowledge from a largevolume of data. Statistical rules or

patterns areautomaticallyfoundfrom therawcollection of data.

1.3 Major Issues

Heterogeneityof datasources

Availabilityof datasources

(2)

Correctnessof theintegratedview ofthe data

Query performance

(3)

2 Data Integration Architectures

2.1 Dimensionstocategorizearchitecturalmodelsforintegratingdatasources

Autonomy

Autonomyrefers to thedegree which individualdata sourcescan operate independently.

{ Designautonomy

The source is independent in data models, naming of the data elements, semantic

interpretationof thedata, constraintsetc.

{ Communicationautonomy

The source is independent in deciding what information it provides to the other

componentsthatarepartoftheintegratedsystemand towhichrequestsitresponds.

{ Executionautonomy

The sourceisindependent inexecutionand scheduling ofincomingrequests.

Heterogeneity

Heterogeneityreferstothedegreeofdissimilaritybetweenthecomponentdatasourcesthat

make up the data integration system. It occurs at dierent levels. On a technical level,

heterogeneity comes from dierent hardware platforms, operating systems, networking

protocolsorsimilarlower-levelconcepts. Onaconceptuallevel, heterogeneitycomesfrom

dierent programming and data modelsas well as dierent understanding and modeling

of thesame real-world concepts(ex: naming).

Distribution

Distributionrefersto thephysicaldistributionof dataovermultiplesites.

{ Client/Server

Serverdoesdatamanagement, client providesuser interface.

{ Peer-to-Peer (fullydistributed)

Each machine hasfullfunctionalityof datamanagement.

Transparency

Transparencyreferstotheseparationofhigher-levelsemanticsofasystemfromlower-level

implementationissues. Atransparentsystemhidestheimplementationdetailsfromusers.

2.2 Major Approaches to Data Integration

Threecommon approachesthatdata sourcescan be madeto worktogether:

VirtualViewApproach

Nodataisstoredintheintegrateddatabase. Dataisaccessedon-demand(lazyapproach).

MaterializedView/WarehousingApproach

Informationisstored inarepository(warehouse)to be queriedlater (eager approach).

Hybridapproach

Data isselectivelymaterialized.

(4)

2.3.1 Federated Database Systems

The sources are semi-autonomous and the distribution is peer-to-peer. Each source asks all

others in the federation to supply information when needed. One-to-one connections between

all pairsof databases,whereeachdatabasehasinterfaceto theothers. The softwareon thetop

of federation allowsuser to accessit inuniedmanner.

Picture

Schemaintegration

**Wewillbrieymentionithereandrefer tothefollowingsection aboutschema

integra-tion.

2.3.2 Mediated Systems

Denitionfrom DavidMartin'spaper(InformationBroker Project):

"Mediation is a process that permits a user to getinformation from a wide variety of sources,

withouthavingto beawareoftheidentities,locations,schemas,accessmechanisms,orcontents

ofthese sources. Acomponent thatperformsmediationpresentsa singleschema (called

medi-ated schema - our comment) to its requesters, accepts queriesexpressed inthat schema, and

handles all the detailsof getting the appropriate data from relevant information sources (each

of which islikelyto operate witha dierentschema)."

Picture

Steps inhow thequeriesareanswered

{ Receive a queryformulatedon theuniedschema fromthe user

{ Query reformulation

{ Optimizationand execution

{ Translation

(**Fordetaileddescriptionof thestepsabove refer to thesection on queryprocessing)

Wrapperasan important component- hidestechnicaland datamodel heterogeneity.

Additionalproblemsraised by thisspecicarchitecture

Known implementations

- TSIMMIS

- InformationManifold

- SIMS

- Carnot

(**very briey,maybe justmention)

2.4 Materialized View Approach (Data Warehousing)

Filtered data from several sources is stored in a single repository (called "data warehouse").

This data is combined into a unied schema. A data warehouse often contains historical and

(5)

Motivation: Whywouldwewant to materialize data?

OLTP (On-LineTransactionProcessing)vs OLAP

{ Terabytesof data

{ Workloads arequeryintensive;queriesarecomplex

{ OLAP datais modeled multi-dimensionally

What isinvolvedinbuildinga datawarehouse?

(**Foreach of thisstepsmentionthemainresearchissues)

{ Modeling and design

{ Maintenance (creationand refreshing)

{ Operation

- Squirrel

- WHIPS

2.5 Hybrid approach

What datato materialize?

Howdatais maintained

(** refer to section on MaterializedViewManagement fora detaileddiscussionofselection

of viewsto materialize)

(6)

3 Data Models and Schema Integration

3.1 Data Models

Adata modelisanabstract descriptionofhowinformationisencodedinasystem. Typically,

people use data models to perform certain reasoning about a system, especially when doing

designwork regardingthesystem.

Data models arehelpful because they are simpliedversionsof the system that the model

correspondsto. Data modelscan omitdetailsaboutthe modeled system thatare irrelevant to

thetaskat hand. Thiscan make work easierfora personworkingwiththe modeled system.

A data model is valid(with respectto its author's intentions) ifit accurately describesall

facets of themodeled systemthat thedatamodel's authorintendedthemodelto describe.

Forexample, an entity-relationship (ER) model can be a validmodel of anassociated

rela-tionaldatabase.

Adatamodelmightnotrepresentalloftheinterestinginformationthatexistsinthesystem

that the model is an abstraction of. For example, an ER model would notlikelydescribe how

manyrecords appearina given table oftherelational database.

A datamodel (asconceivedhere) doesnotnecessarilyattempt to describedynamicaspects

of theinformationencodeda system. Forexample, a datamodelmight describethat a system

had fourexplicitstates: "STARTING", "RUNNING", "QUIESCING",and "SHUTDOWN".

A data model probably would not describe the possiblesequences of state transitionsthat

themodeledsystem could undergo. That kindof descriptionwould becharacteristic of a state

model,nota datamodel.

Common types of data modelsused insoftware

Numeroustypesof datamodelshave beenstudiedandputinto use. Some ofthemostcommon

ones aredescribedbelow.

Entity-Relationship(ER) datamodels

These modelsdescribe a systemasa set of entityclasses and a set ofclasses of potential

relationsbetween entities.

Object-orienteddatamodels

Thesemodelsdescribeclassesofobjects,relationshipsthatwillexist betweeninstancesof

those classes,and servicesthat willbeprovided byobjects.

Object-orientedmodelsthereforetendto be morethanjustdatamodels. Object-oriented

models go one step further by describing to some extent what dynamic behavior the

modeled systemcan undergo when executing.

Hierarchicaldatamodels

Thesemodelsrepresentinformationasatree-likehierarchy,composedofnodesand edges.

Informationmay be encodedinsuchdetailsasnode values,edgelabels,and theshapeof

thetree.

3.2 Schemas

A schema, asone would expectto ndina databasemanagement system(DBMS), is a data

model oftheinformationstoredin thedatabase.

What typically distinguishesschemas from more general data modelsis that schemas may

(7)

Examples includeindexdenitionand wheredatabaselesreside on thelesystem.

Common types of schemas used in software

Relationalschemas

ThesearemuchlikeE-Rmodels,exceptthattheycanbespecicaboutstoragestructures.

Example: SQL Data Manipulation Language

It hasstructure, butisnotconvenient forsoftwareto parse theargument's structure.

Object-orienteddatabaseschemas

Theseare muchlikeobject-orienteddatamodels.

Additionalimplementationdetailsmayincludethesourcecodeofthemethodsassociated

withclassesorinstancesof objects.

Example: CORBA

In CORBA, bindings are generated between details specied in a generic description of

objects'interfaces(IDL)andtheactualprogramcodethatimplementsthoseobjects. That

binding code, as wellas thebound-to code itself,could reasonably be considered part of

theschema.

Semistructuredschemas

These are much like hierarchical data models. The additionaldetails provided might

in-cludeacanonicalnamefortheschema(tobereferredtobyindividualdatabaseinstances),

orwhat characterset encodingis usedina databaseconformingto theschema.

Example: XML DTDs

Unstructured

Manydocuments havestructure that'seasilyrecognizable to humans,butit very unclear

to software. Froma softwaresystem's perspective,these documents areunstructured.

Examples

{ Plato's Republic It has structure, but is not convenient for software to parse the

argument's structure.

{ A bitmap image of a molecule's shape The image contains structure, but it's quite

diÆcultforan applicationto appreciatethat fact.

3.3 Data Model Semantics

The semantics of a data modeldescribe how thedetails of the datamodelcorrespond to the

detailsof themodeled system.

Perhaps the most familiar representation of data model semantics is the use of a data

dictionary insome datamodels.

3.4 Mapping Between Schemas

Some of the most complicated work of data integration is the construction of techniquesthat

can mapthe datafromone schemainto another.

In mediateddatabasesystems andindatawarehousesystems, thereisa singleschema used

when providing the integrated view of the data. In such systems that have n dierent data

(8)

In data warehouse systems, the schema mapping can generally be one-way. The mapping

justneedstogureouthowtopresent informationinthedatasourceinawaythat'sconsistent

withthe integratedview's schema.

In mediatedsystems, the schema mapping must be two-way. Like with a data warehouse,

a mapping from the data source schema to the integrated schema must be dened. However,

mediated systems receive queries from theirusers, which must then be translated into queries

processed by thedata sources. The user formulates thequery usingterms from theintegrated

schema. The query posed to the data source by the mediator must be in terms of the data

source'sschema.

3.5 Hard Problems in Schema Mapping

Onto mappings

Example: Numericgradesto letter grades[give creditto theH.P.paper]

Dierent categorizations

I.e., Data model A records (Unit color, Number sold), and data model B records (Unit

weight, Number sold). Unless color =>weight orweight => color, there's a serious limit

on any unieddatamodelthatwantsto represent Numberunitssold inausefulway.

Recognizing object identity

Problem: Eachschema impliesthatisonlyhasonedataobjectperobjectinthemodeled

system. Whenproducinga unieddatabase, it'spossibleto lostthis

one-data-object-per-modeled-objectqualityunlesscare istaken. Even then, itcan be ahardproblem to solve

(**Research thisbetter - cjc).

Conicting data

Schema semantics oftenmake an implicitclaim that data usingthe schema is absolutely

correct. People usuallyknowto take thatclaim witha grainof salt. When unifyingdata

from two or more data sources, contradictions can be reached, leading to an internally

inconsistent viewof data.

Unequal expressive power of schemas

By way of example, object-oriented databases (OODBs) might include their method

be-haviors as part of the schema. However, SQL schemas have only limited support for

expressingsystembehavior.

In particular, the OODB methods might be written in computationally complete

lan-guages, and SQL is not computationally complete. The important lesson from this is

that care must be taken in the selection of which schema type to use for presenting the

integrated viewof thedata.

3.6 A Survey of Automatible Schema Mapping

Not all schema mapping is necessarily diÆcult to perform with computers. Some types of

schemasarenicelytranslatabletoothertypesofschemas,whichopensthedoorforusingsoftware

to producethe mappings.

What makes thisfeasibleisthat inthese cases, mappingpatterns have beenidentiedthat

preserve data model semantics without requiring the map-writing software to care about the

(9)

Mapping hierarchical schemas to relationalschemas

Considera hierarchicaltypeschemalikeHTML, and arelational type schema like SQL.

Generally, HTML has is a tree of nodes. Each node has a set of attributes name/value

pairsand a setof childrennodes.

On naive mapping is possible: Create a SQL table where one row corresponds to each

nodefrom theHTML schema. Thetree structureis preserved byhavingacolumnin the

table thatliststheprimary keyof therowthat correspondsto each childnode.

Mapping relational schemas to object-oriented schemas

One solution is to deneone object class per relation. Create one instance of each class

to model eachrow thatappearinthe corresponding relation.

Relational view denitions and calculated values can be expressed with methods in the

(10)

4 Querying the Integrated Data

The traditional way of query processing involves: getting a declarative query from the user;

parsing it; passing it through a query optimizer which produces an eÆcient query execution

planthatdescribeshowtoexactlyevaluatethequery(i.e.,applywhichoperators,inwhatorder

usingwhat algorithm);and nally executingtheplan.

In data integration systems this is not so straight forward because the query coming from

theuserisnotaskedonasinglesourcebutratheronanintegrationofdatasources,eachhaving

dierent queryformulationand processingmechanismsdenedon dierentdata models.

4.1 Data Modeling

4.1.1 Languages

Alternativeswillbediscussedand Datalogwillbebrieyintroducedasthechoice of queryand

view languageforthefollowingsubsections.

4.1.2 Modeling the Mediated Schema

Mediated schema is the single schema of the integrated system which is obtained by unifying

the schemas of the data sources being integrated. Queries to the data integration system are

formulatedonthismediatedschema.

4.1.3 Modeling the Data Sources

Data source descriptions which specifythe relationship between the relations in the mediated

schema and those inthe localschemas at thesources shouldbe provided. The descriptionof a

sourcespecies its

contents

attributes

constraintson its contents

completenessand reliability

queryprocessingcapabilities

Besides, usingthe descriptionsofthe sources,we shouldbeable to detect thefollowing:

overlappingand contradictory dataamong sources

semantic mismatches amongsources

dieringnamingconventionsfordata values amongsources

4.2 Query Reformulation/Rewriting

Usingthesourcedescriptions,userquerywrittenintermsofthemediatedschemaisreformulated

into aquerythatrefersdirectlytotheschemasofthesources(butstillinthedatamodelofthe

(11)

Semantic Correctness of the Reformulation: The answers obtained from the sources will

becorrect answers to theoriginalquery.

Minimizing the Source Access: Sources that can not contribute any answer or partial

answerto thequery shouldnot beaccessed. In additionto avoiding access to redundant

sources, we shouldreformulate the queries as specic aspossible to each of the accessed

sources to avoid redundantquery evaluation.

4.2.1 Global As View (GAV) Approach

For each relationRin themediatedschema, a queryover thesourcerelations is writtenwhich

specieshowto obtainR's tuplesfrom thesources.

4.2.2 Local As View (LAV) Approach

ForeachdatasourceS,aruleovertherelationsinthemediatedschemaiswrittenthatdescribes

whichtuplesarefoundinS.Thisisanapplicationofamuchbroaderproblemcalled"Answering

QueriesusingViews". (** Italso appliesto view designandselection inData Warehousing.)

Query reformulation inLAV is more complex. However, it has important advantages

com-pared to GAV:adding newsources and specifyingconstraintsinLAV areeasier.

- equivalent rewritings vsmaximally-containedrewritings

The BucketAlgorithm

The Inverse-Rules Algorithm

The MiniconAlgorithm

The Shared-Variable-Bucket Algorithm

The CoreCoverAlgorithm

Comparisonof theAlgorithms

4.2.3 Completeness and Complexity of Finding Query Rewritings

sourceincompleteness

recursiverewritings

4.2.4 Using Probabilistic Information

forsourcecompleteness

foroverlapbetweenpartsof themediatedschema

(12)

4.2.5 Alternative Query Reformulation for Dynamic Information Integration

4.3 Query Optimization and Execution

Queryoptimizationreferstotheprocessoftranslatingadeclarativequeryintoaqueryexecution

plan,i.e., aspecicsequence of stepsthatthequery executionengineshouldfollowto evaluate

thequery. Whyis itdiÆcultindataintegration systems?

sourcesareautonomous;optimizermaynothaveanystatisticsorreliabilityinfoaboutthe

sources,query planningisdiÆcult

sources areheterogeneous; they mayhave dierent queryprocessing capabilities

data transfer timeis not predictable dueto the existence of thenetwork environment,it

isdiÆcultto make cost estimates

some sourcesmaybecome unavailable(?)

4.3.1 Query Plan Generation

4.3.2 Query Execution

4.3.3 Adaptive Query Execution

- interleavingoptimizationand executionof thequeryinthe TukwilaSystem

4.3.4 Query Translation

Up to now, the queries are reformulated into a form that refers directly to the data sources

rather than the mediated schema. However, they are still in the data model of the mediated

system. Theyshouldbe translatedinto thedatamodelsof thedatasources. Thisis performed

bysource-specic wrapperswhichwillbedened inthefollowingsection.

(**Thisistheresponsibilityofthesourcewrappers. WewillbeconnectingQueryProcessing

section to Data Extraction section briey discussingthe query translation stage inside query

execution.)

(** We should also be discussingthe ow of translationin the reverse direction, i.e., after

the sources successfully perform the local queries, how the resuls are collected and combined

(13)

5 Data Extraction

Combinestechniquesfrom DBand AI (NLP,Machine learning).

5.1 Techniques for Extracting Data (Wrappers)

To access informationfrom dierent heterogeneous data sources, we have to translate queries

and data from one datamodelto another. This functionis providedbywrappersaround each

individual data source. Wrapper converts queries into one or more queries understandableby

the underlyingdata source and transforms results into the format understood by application

(mediator).

5.1.1 Wrapper generation

Issues

- refer to thesectionon Data Modelsand Schema Integration

- Mediator systems usuallyrequire more complexwrappersthan do mostwarehouse

sys-tems

Ways ofcreating wrappers

{ Manual

Whyis itimpractical forsome sources?

Incase of Web sources:

- bignumberofsources

- new sourcesareadded frequently

- formatof sourceschange

So,high maintenance costs.

{ Semi-automatic (interactive)

Noted that only small part of the code deals with the specic access details of the

source. The rest is common among wrappers or data transformation can be

ex-pressed in a declarative fashion (high-level). Graphical interface, programming by

demonstration.

{ Automatic

site-specic orgeneric

usuallyneeds trainingoftensupervisedlearning

Tools forsemi-automatic/automatic wrapper construction for structured/semistructured

data

{ Template-based wrappers

{ Inductive learning techniques for automatically learning a wrapper (using labeled

data)

Inductivelearning- taskof computingsome generalizationfrom aset of examples

Methods:

zero-order (decisiontree learners)

rst-order (inductivelogic programming)

(14)

Tools fordataextraction fromunstructured documents

{ Using ontologies and conceptual models to extract and structure information from

data-rich,unstructured documents.

{ Usingheuristicapproaches to ndrecordboundariesinweb documents.

5.1.2 Filters

Ifa wrapperreturns asupersetofwhat query wants, we can ltertheresultsof thequery.

5.2 Interfaces of Sources to the World

(**Thissection hasthepurposeof showingwhat kindofinterfaces thesourcesprovidesothat

wrappers can communicatewiththem)

5.2.1 General issues with interfaces

Standardized interfaces

Many interfaces are intended for reuse by many applications. Therefore, they separate

theirfunctionalityinto two groups.

{ Informationregardingthereusableaspectsoftheinterface(i.e.,connections,sessions,

supported unitsoftransfer (les, wholerecords,etc.)).

{ Application-specic information (i.e., record layouts of tabular text les, the set of

les,specicobjects instances).

Data resolution

Divisionofapplication dataencodinginup to two levels

{ Addressablebytheinterface(i.e.,lesvia FTP,recordcolumnvaluesvia ODBC)

{ Notaddressablebytheinterface(i.e.,theFTPinterfacetoretrievean XMLle,but

FTPdoesn'tprovidespecic elementextraction from thatXML le.)

Applicationsmayhave some freedomaboutwhereto encode information. Forexample,a

FTPserverforimagelescouldserveoneleperimage(withahelpfullename),orcould

servejustone .ziplewhichmustbe openedbytheclienttochoosethedesiredimagele.

Data types

{ Interface basic data types may or may not correspond wellto the basic data types

usedto describe dataat thedatasource.

{ Example In HTTP, all data is represented as a string. This works great for web

pages. But ifHTTP isbeingused to transfer a soundle,thestring data typeisn't

particularlyuseful.

Interface semantics

{ Datasourcesand dataintegratorsarecapableofencodingwhateverinformationthey

(15)

{ Someinterfacedesignshelpprovidehintstothesemanticsofapplicationdatamoved

acrosstheinterface. (I.e., ODBCallowsits usersto refer tospecictable name,and

those namesmight providehelpful hintsasto theapplicationsemanticsof thetable

contents.)

{ Why dierent interfaces suit dierent application data models Some interfaces are

designed ina way that allows forobvious mappingbetween applicationdata models

andtheinterface. (I.e.,ODBCwitharelationaldatabase,orCORBAwithan

object-orientedsystem.)

{ Dealingwithsemanticsoftheapplication-specicdataiscoveredinaanothersection

(needsan xref)

5.2.2 Survey of CommonInterfaces

It's instructive to considersome dierent kindsof non-application-specicinterfaces.

Non-network interfaces

{ DiÆculties

Data integration mayinvolve two ormore computersthatmustcommunicate.

A data sourcethatonlyhaslocalinterfaces maytherefore requiresome

integra-tion softwareto be installedon thedatasource'scomputer

{ Examples

Writing to leson a localdisk drive.

Displayinginformationon alocalGUIonly(i.e., MSWindows,notX).

Network-capable interfaces

Interfaces that allowan application to readily communicate with other applicationsthat

areeither localoron a dierentcomputer.

{ DiÆculties

Potentially independent failuresofcommunicationapplications

Potentially signicant communicationscosts/ performanceissues

{ Examples

FTP

CORBA

(16)

6 Materialized View Management

We aregoingto discusstwomajorsubtopicsunder thisheadingwithgivinghigheremphasisto

thesecond one:

Designand selection ofviewsto materialize

Maintenance of thematerialized views

6.1 Design and Selection of Views to Materialize

MajorGoals:

to minimizetotal queryresponse time

to minimizethecost of maintainingtheselected views

6.2 The Problem of View Maintenance

6.2.1 Denition

traditionalview updateproblemand whythisone isdierent andmore diÆcult

Asthedatabasechangesbecauseofupdatesappliedtothebaserelations,thematerialized

viewsmay alsorequirechange. A materializedview can always be brought up to dateby

re-evaluating theview denition. However, completere-evaluation iswasteful.

"heuristic of inertia"- onlya part of the view changes inresponse to changes in thebase

relations

stepsof viewmaintenance:

{ propagation(computingchanges to theview)

{ refreshing(applyingchanges to themat. view)

6.2.2 Dimensions

AvailableInformation

materialized view,baserelations, otherviews, integrityconstraints, etc.

Allowable Modications

insertions,deletions,updates,sets of each,group updates, change viewdenition, etc.

Expressivenessof theViewDenitionLanguage

conjunctive queries, duplicates, aggregation, union, recursion, negation, select, project,

join,spj,etc.

Database Instance

Complexity

{ Complexityofthe ViewMaintenance Language

(17)

6.2.3 View Maintenance Policies

when to applymaintenance procedureson thematerialized views

ImmediateView Maintenance

Refreshingis done withinthe transaction that changes the base data. slow transactions;

fasterqueriesand alwaysup to dateresults.

Deferred ViewMaintenance

{ lazily,at query time

fasttransactions; slowqueryspeed.

{ forced,after a certainamount ofchangeto thebase data

non upto date results;goodtransaction and querytime.

{ periodically,incertaintime intervals

non upto date results;goodtransaction and querytime.

comparison ofall ingeneral; howthedecisionwhen to usewhichone ismade.

6.3 Incremental View Maintenance

pre-updatevs post-updatealgorithms

variousalgorithmswillbesummarized

6.4 View Maintenance Anomalies(Consistency Issues)

caused bydecouplingbtw viewdenitionand basedata

denitionofcorrectness and levels ofcorrectness

solutions:

{ recomputing theviews

{ storing copiesof baserelations

{ ECA (Eager CompensatingAlgorithm)

6.5 Update Filtering

Detection ofbase data updatesthat are irrelevant to theview (i.e., have no eect on the state

of the view) wil be discussed. Forsuch updates, we do not need to perform anymaintenance.

Thus,updatelteringmakesmaintenance more eÆcient bypreventing redundant work.

6.6 View Self-Maintenance

In general, Self-Maintenance refersto views being maintained withoutusingall thebase data.

There existsdierent notionsof its exact meaning dependingon how much informationis

ava-iable. Atminimum,theviewupdateisperformedat theintegratedsystembyonlyknowingthe

particular base data update that has occurred, the view denitions and the materialied data.

(18)

{ Given a view,is itself-maintainable?

{ Ifit isself-maintainable,how?

Single-View Self-Maintenance

doesnotconsiderthematerialized views

Multiple-ViewSelf-Maintenance

makesuseof contents oftheother materialized viewsto minimizebase dataaccess

Updatesas awhole

i.e., batch updates;maymake maintenance easier andmore eÆcient

Makingviewsself-maintainable

When theanswerto therst questionlistedabove is No,wecan dene and materializea

minimalset of auxiliary views to make the originalnon-maintainable view maintainable.

Here, basically we are increasing the amount of information available at the integrated

systemlevel.

6.7 Dynamic View Management

problemsof staticview selectionand maintenance

dynamicview selectionand maintenance

performanceparameters

{ space (taken upbythematerialized views)

{ workload (changingqueryworkload)

{ maintenancewindow(howoftenwewouldliketoperformmaintainanceandhowlong

isthe systemtoleratedto beunfunctional)

(19)

7 Systems

TSIMMIS(Stanford)

Ariadne(USC)

Araneus(Web-Bases, UniversityofRome,Italy)

Infomaster(Stanford)

InformationManifold(AT &T)

Tukwila (UniversityofWashington)

SIMS (USC)

Carnot(MCC)

WHIPS (Stanford)

Squirrel

** subjectto modications

8 Open Questions and Research Issues

** willbe formedlaterfrom everybody'slist ofresearchissuesfrom theabovesections

9 Concluding Remarks