DATA INTEGRATION SERVICES
Christian Convey, Olga Karpenko, Nesime Tatbul and Jue Yan
1 Introduction
** subjectto modications
1.1 Denition of Data Integration
Data integrationsystems harmonizedatafrommultiplesourcesintoa singlecoherent
represen-tation. The goal is to provide an integrated view over all the data sources of interest and to
provide a uniform interface to access all of these data. The access to the integrated data is
usuallyintheform of queryingratherthanupdatingthedata.
The data sources to be integrated may belong to the same enterprise or may be arbitrary
sources on the web. Most of the time, each of the sources is independently designed for
au-tonomous operation. Also, thesources are not necessarily databases; they may be legacy
sys-tems(oldandobsolescentsystemsthatarediÆculttomigratetoamoderntechnology)or
struc-tured/unstructuredleswithdierentinterfaces. Data integration requires thatthedierences
inmodeling,semanticsand capabilitiesofthesourcestogether withthepossibleinconsistencies
beresolved.
1.2 Motivation for Data Integration Systems
Historicalview: integration-by-hand
Users can focus on specifying what they want, not on how to obtain what they want.
Insteadofndingrelevantsources,interactingwitheverysourceandcombiningdatafrom
dierentsources, auser can ask queriesina uniedway.
Particular examples:
{ Desireforreportsthatdescribeallpartsofamergedorganization(bankmergers, car
dealerships,etc.).
Facilitatesdecisionsupportapplications(OLAP,Data mining)
OLAP (On-LineAnalyticalProcessing)ismaking nancial,marketing orbusiness
anal-ysis to be able to make business decisions on a collection of detailed data from one
ormore datasources. Theanalysisisdonethroughaskinglargenumberofaggregate
querieson thedetaileddata.
Data Mining is discoveringknowledge from a largevolume of data. Statistical rules or
patterns areautomaticallyfoundfrom therawcollection of data.
1.3 Major Issues
Heterogeneityof datasources
Availabilityof datasources
Correctnessof theintegratedview ofthe data
Query performance
2 Data Integration Architectures
2.1 Dimensionstocategorizearchitecturalmodelsforintegratingdatasources
Autonomy
Autonomyrefers to thedegree which individualdata sourcescan operate independently.
{ Designautonomy
The source is independent in data models, naming of the data elements, semantic
interpretationof thedata, constraintsetc.
{ Communicationautonomy
The source is independent in deciding what information it provides to the other
componentsthatarepartoftheintegratedsystemand towhichrequestsitresponds.
{ Executionautonomy
The sourceisindependent inexecutionand scheduling ofincomingrequests.
Heterogeneity
Heterogeneityreferstothedegreeofdissimilaritybetweenthecomponentdatasourcesthat
make up the data integration system. It occurs at dierent levels. On a technical level,
heterogeneity comes from dierent hardware platforms, operating systems, networking
protocolsorsimilarlower-levelconcepts. Onaconceptuallevel, heterogeneitycomesfrom
dierent programming and data modelsas well as dierent understanding and modeling
of thesame real-world concepts(ex: naming).
Distribution
Distributionrefersto thephysicaldistributionof dataovermultiplesites.
{ Client/Server
Serverdoesdatamanagement, client providesuser interface.
{ Peer-to-Peer (fullydistributed)
Each machine hasfullfunctionalityof datamanagement.
Transparency
Transparencyreferstotheseparationofhigher-levelsemanticsofasystemfromlower-level
implementationissues. Atransparentsystemhidestheimplementationdetailsfromusers.
2.2 Major Approaches to Data Integration
Threecommon approachesthatdata sourcescan be madeto worktogether:
VirtualViewApproach
Nodataisstoredintheintegrateddatabase. Dataisaccessedon-demand(lazyapproach).
MaterializedView/WarehousingApproach
Informationisstored inarepository(warehouse)to be queriedlater (eager approach).
Hybridapproach
Data isselectivelymaterialized.
2.3.1 Federated Database Systems
The sources are semi-autonomous and the distribution is peer-to-peer. Each source asks all
others in the federation to supply information when needed. One-to-one connections between
all pairsof databases,whereeachdatabasehasinterfaceto theothers. The softwareon thetop
of federation allowsuser to accessit inuniedmanner.
Picture
Schemaintegration
**Wewillbrieymentionithereandrefer tothefollowingsection aboutschema
integra-tion.
2.3.2 Mediated Systems
Denitionfrom DavidMartin'spaper(InformationBroker Project):
"Mediation is a process that permits a user to getinformation from a wide variety of sources,
withouthavingto beawareoftheidentities,locations,schemas,accessmechanisms,orcontents
ofthese sources. Acomponent thatperformsmediationpresentsa singleschema (called
medi-ated schema - our comment) to its requesters, accepts queriesexpressed inthat schema, and
handles all the detailsof getting the appropriate data from relevant information sources (each
of which islikelyto operate witha dierentschema)."
Picture
Steps inhow thequeriesareanswered
{ Receive a queryformulatedon theuniedschema fromthe user
{ Query reformulation
{ Optimizationand execution
{ Translation
(**Fordetaileddescriptionof thestepsabove refer to thesection on queryprocessing)
Wrapperasan important component- hidestechnicaland datamodel heterogeneity.
Additionalproblemsraised by thisspecicarchitecture
Known implementations
- TSIMMIS
- InformationManifold
- SIMS
- Carnot
(**very briey,maybe justmention)
2.4 Materialized View Approach (Data Warehousing)
Filtered data from several sources is stored in a single repository (called "data warehouse").
This data is combined into a unied schema. A data warehouse often contains historical and
Motivation: Whywouldwewant to materialize data?
OLTP (On-LineTransactionProcessing)vs OLAP
{ Terabytesof data
{ Workloads arequeryintensive;queriesarecomplex
{ OLAP datais modeled multi-dimensionally
What isinvolvedinbuildinga datawarehouse?
(**Foreach of thisstepsmentionthemainresearchissues)
{ Modeling and design
{ Maintenance (creationand refreshing)
{ Operation
Known implementations
- Squirrel
- WHIPS
2.5 Hybrid approach
What datato materialize?
Howdatais maintained
Known implementations
(** refer to section on MaterializedViewManagement fora detaileddiscussionofselection
of viewsto materialize)
3 Data Models and Schema Integration
3.1 Data Models
Adata modelisanabstract descriptionofhowinformationisencodedinasystem. Typically,
people use data models to perform certain reasoning about a system, especially when doing
designwork regardingthesystem.
Data models arehelpful because they are simpliedversionsof the system that the model
correspondsto. Data modelscan omitdetailsaboutthe modeled system thatare irrelevant to
thetaskat hand. Thiscan make work easierfora personworkingwiththe modeled system.
A data model is valid(with respectto its author's intentions) ifit accurately describesall
facets of themodeled systemthat thedatamodel's authorintendedthemodelto describe.
Forexample, an entity-relationship (ER) model can be a validmodel of anassociated
rela-tionaldatabase.
Adatamodelmightnotrepresentalloftheinterestinginformationthatexistsinthesystem
that the model is an abstraction of. For example, an ER model would notlikelydescribe how
manyrecords appearina given table oftherelational database.
A datamodel (asconceivedhere) doesnotnecessarilyattempt to describedynamicaspects
of theinformationencodeda system. Forexample, a datamodelmight describethat a system
had fourexplicitstates: "STARTING", "RUNNING", "QUIESCING",and "SHUTDOWN".
A data model probably would not describe the possiblesequences of state transitionsthat
themodeledsystem could undergo. That kindof descriptionwould becharacteristic of a state
model,nota datamodel.
Common types of data modelsused insoftware
Numeroustypesof datamodelshave beenstudiedandputinto use. Some ofthemostcommon
ones aredescribedbelow.
Entity-Relationship(ER) datamodels
These modelsdescribe a systemasa set of entityclasses and a set ofclasses of potential
relationsbetween entities.
Object-orienteddatamodels
Thesemodelsdescribeclassesofobjects,relationshipsthatwillexist betweeninstancesof
those classes,and servicesthat willbeprovided byobjects.
Object-orientedmodelsthereforetendto be morethanjustdatamodels. Object-oriented
models go one step further by describing to some extent what dynamic behavior the
modeled systemcan undergo when executing.
Hierarchicaldatamodels
Thesemodelsrepresentinformationasatree-likehierarchy,composedofnodesand edges.
Informationmay be encodedinsuchdetailsasnode values,edgelabels,and theshapeof
thetree.
3.2 Schemas
A schema, asone would expectto ndina databasemanagement system(DBMS), is a data
model oftheinformationstoredin thedatabase.
What typically distinguishesschemas from more general data modelsis that schemas may
Examples includeindexdenitionand wheredatabaselesreside on thelesystem.
Common types of schemas used in software
Relationalschemas
ThesearemuchlikeE-Rmodels,exceptthattheycanbespecicaboutstoragestructures.
Example: SQL Data Manipulation Language
It hasstructure, butisnotconvenient forsoftwareto parse theargument's structure.
Object-orienteddatabaseschemas
Theseare muchlikeobject-orienteddatamodels.
Additionalimplementationdetailsmayincludethesourcecodeofthemethodsassociated
withclassesorinstancesof objects.
Example: CORBA
In CORBA, bindings are generated between details specied in a generic description of
objects'interfaces(IDL)andtheactualprogramcodethatimplementsthoseobjects. That
binding code, as wellas thebound-to code itself,could reasonably be considered part of
theschema.
Semistructuredschemas
These are much like hierarchical data models. The additionaldetails provided might
in-cludeacanonicalnamefortheschema(tobereferredtobyindividualdatabaseinstances),
orwhat characterset encodingis usedina databaseconformingto theschema.
Example: XML DTDs
Unstructured
Manydocuments havestructure that'seasilyrecognizable to humans,butit very unclear
to software. Froma softwaresystem's perspective,these documents areunstructured.
Examples
{ Plato's Republic It has structure, but is not convenient for software to parse the
argument's structure.
{ A bitmap image of a molecule's shape The image contains structure, but it's quite
diÆcultforan applicationto appreciatethat fact.
3.3 Data Model Semantics
The semantics of a data modeldescribe how thedetails of the datamodelcorrespond to the
detailsof themodeled system.
Perhaps the most familiar representation of data model semantics is the use of a data
dictionary insome datamodels.
3.4 Mapping Between Schemas
Some of the most complicated work of data integration is the construction of techniquesthat
can mapthe datafromone schemainto another.
In mediateddatabasesystems andindatawarehousesystems, thereisa singleschema used
when providing the integrated view of the data. In such systems that have n dierent data
In data warehouse systems, the schema mapping can generally be one-way. The mapping
justneedstogureouthowtopresent informationinthedatasourceinawaythat'sconsistent
withthe integratedview's schema.
In mediatedsystems, the schema mapping must be two-way. Like with a data warehouse,
a mapping from the data source schema to the integrated schema must be dened. However,
mediated systems receive queries from theirusers, which must then be translated into queries
processed by thedata sources. The user formulates thequery usingterms from theintegrated
schema. The query posed to the data source by the mediator must be in terms of the data
source'sschema.
3.5 Hard Problems in Schema Mapping
Onto mappings
Example: Numericgradesto letter grades[give creditto theH.P.paper]
Dierent categorizations
I.e., Data model A records (Unit color, Number sold), and data model B records (Unit
weight, Number sold). Unless color =>weight orweight => color, there's a serious limit
on any unieddatamodelthatwantsto represent Numberunitssold inausefulway.
Recognizing object identity
Problem: Eachschema impliesthatisonlyhasonedataobjectperobjectinthemodeled
system. Whenproducinga unieddatabase, it'spossibleto lostthis
one-data-object-per-modeled-objectqualityunlesscare istaken. Even then, itcan be ahardproblem to solve
(**Research thisbetter - cjc).
Conicting data
Schema semantics oftenmake an implicitclaim that data usingthe schema is absolutely
correct. People usuallyknowto take thatclaim witha grainof salt. When unifyingdata
from two or more data sources, contradictions can be reached, leading to an internally
inconsistent viewof data.
Unequal expressive power of schemas
By way of example, object-oriented databases (OODBs) might include their method
be-haviors as part of the schema. However, SQL schemas have only limited support for
expressingsystembehavior.
In particular, the OODB methods might be written in computationally complete
lan-guages, and SQL is not computationally complete. The important lesson from this is
that care must be taken in the selection of which schema type to use for presenting the
integrated viewof thedata.
3.6 A Survey of Automatible Schema Mapping
Not all schema mapping is necessarily diÆcult to perform with computers. Some types of
schemasarenicelytranslatabletoothertypesofschemas,whichopensthedoorforusingsoftware
to producethe mappings.
What makes thisfeasibleisthat inthese cases, mappingpatterns have beenidentiedthat
preserve data model semantics without requiring the map-writing software to care about the
Mapping hierarchical schemas to relationalschemas
Considera hierarchicaltypeschemalikeHTML, and arelational type schema like SQL.
Generally, HTML has is a tree of nodes. Each node has a set of attributes name/value
pairsand a setof childrennodes.
On naive mapping is possible: Create a SQL table where one row corresponds to each
nodefrom theHTML schema. Thetree structureis preserved byhavingacolumnin the
table thatliststheprimary keyof therowthat correspondsto each childnode.
Mapping relational schemas to object-oriented schemas
One solution is to deneone object class per relation. Create one instance of each class
to model eachrow thatappearinthe corresponding relation.
Relational view denitions and calculated values can be expressed with methods in the
4 Querying the Integrated Data
The traditional way of query processing involves: getting a declarative query from the user;
parsing it; passing it through a query optimizer which produces an eÆcient query execution
planthatdescribeshowtoexactlyevaluatethequery(i.e.,applywhichoperators,inwhatorder
usingwhat algorithm);and nally executingtheplan.
In data integration systems this is not so straight forward because the query coming from
theuserisnotaskedonasinglesourcebutratheronanintegrationofdatasources,eachhaving
dierent queryformulationand processingmechanismsdenedon dierentdata models.
4.1 Data Modeling
4.1.1 Languages
Alternativeswillbediscussedand Datalogwillbebrieyintroducedasthechoice of queryand
view languageforthefollowingsubsections.
4.1.2 Modeling the Mediated Schema
Mediated schema is the single schema of the integrated system which is obtained by unifying
the schemas of the data sources being integrated. Queries to the data integration system are
formulatedonthismediatedschema.
4.1.3 Modeling the Data Sources
Data source descriptions which specifythe relationship between the relations in the mediated
schema and those inthe localschemas at thesources shouldbe provided. The descriptionof a
sourcespecies its
contents
attributes
constraintson its contents
completenessand reliability
queryprocessingcapabilities
Besides, usingthe descriptionsofthe sources,we shouldbeable to detect thefollowing:
overlappingand contradictory dataamong sources
semantic mismatches amongsources
dieringnamingconventionsfordata values amongsources
4.2 Query Reformulation/Rewriting
Usingthesourcedescriptions,userquerywrittenintermsofthemediatedschemaisreformulated
into aquerythatrefersdirectlytotheschemasofthesources(butstillinthedatamodelofthe
Semantic Correctness of the Reformulation: The answers obtained from the sources will
becorrect answers to theoriginalquery.
Minimizing the Source Access: Sources that can not contribute any answer or partial
answerto thequery shouldnot beaccessed. In additionto avoiding access to redundant
sources, we shouldreformulate the queries as specic aspossible to each of the accessed
sources to avoid redundantquery evaluation.
4.2.1 Global As View (GAV) Approach
For each relationRin themediatedschema, a queryover thesourcerelations is writtenwhich
specieshowto obtainR's tuplesfrom thesources.
4.2.2 Local As View (LAV) Approach
ForeachdatasourceS,aruleovertherelationsinthemediatedschemaiswrittenthatdescribes
whichtuplesarefoundinS.Thisisanapplicationofamuchbroaderproblemcalled"Answering
QueriesusingViews". (** Italso appliesto view designandselection inData Warehousing.)
Query reformulation inLAV is more complex. However, it has important advantages
com-pared to GAV:adding newsources and specifyingconstraintsinLAV areeasier.
- equivalent rewritings vsmaximally-containedrewritings
The BucketAlgorithm
The Inverse-Rules Algorithm
The MiniconAlgorithm
The Shared-Variable-Bucket Algorithm
The CoreCoverAlgorithm
Comparisonof theAlgorithms
4.2.3 Completeness and Complexity of Finding Query Rewritings
sourceincompleteness
recursiverewritings
4.2.4 Using Probabilistic Information
forsourcecompleteness
foroverlapbetweenpartsof themediatedschema
4.2.5 Alternative Query Reformulation for Dynamic Information Integration
4.3 Query Optimization and Execution
Queryoptimizationreferstotheprocessoftranslatingadeclarativequeryintoaqueryexecution
plan,i.e., aspecicsequence of stepsthatthequery executionengineshouldfollowto evaluate
thequery. Whyis itdiÆcultindataintegration systems?
sourcesareautonomous;optimizermaynothaveanystatisticsorreliabilityinfoaboutthe
sources,query planningisdiÆcult
sources areheterogeneous; they mayhave dierent queryprocessing capabilities
data transfer timeis not predictable dueto the existence of thenetwork environment,it
isdiÆcultto make cost estimates
some sourcesmaybecome unavailable(?)
4.3.1 Query Plan Generation
4.3.2 Query Execution
4.3.3 Adaptive Query Execution
- interleavingoptimizationand executionof thequeryinthe TukwilaSystem
4.3.4 Query Translation
Up to now, the queries are reformulated into a form that refers directly to the data sources
rather than the mediated schema. However, they are still in the data model of the mediated
system. Theyshouldbe translatedinto thedatamodelsof thedatasources. Thisis performed
bysource-specic wrapperswhichwillbedened inthefollowingsection.
(**Thisistheresponsibilityofthesourcewrappers. WewillbeconnectingQueryProcessing
section to Data Extraction section briey discussingthe query translation stage inside query
execution.)
(** We should also be discussingthe ow of translationin the reverse direction, i.e., after
the sources successfully perform the local queries, how the resuls are collected and combined
5 Data Extraction
Combinestechniquesfrom DBand AI (NLP,Machine learning).
5.1 Techniques for Extracting Data (Wrappers)
To access informationfrom dierent heterogeneous data sources, we have to translate queries
and data from one datamodelto another. This functionis providedbywrappersaround each
individual data source. Wrapper converts queries into one or more queries understandableby
the underlyingdata source and transforms results into the format understood by application
(mediator).
5.1.1 Wrapper generation
Issues
- refer to thesectionon Data Modelsand Schema Integration
- Mediator systems usuallyrequire more complexwrappersthan do mostwarehouse
sys-tems
Ways ofcreating wrappers
{ Manual
Whyis itimpractical forsome sources?
Incase of Web sources:
- bignumberofsources
- new sourcesareadded frequently
- formatof sourceschange
So,high maintenance costs.
{ Semi-automatic (interactive)
Noted that only small part of the code deals with the specic access details of the
source. The rest is common among wrappers or data transformation can be
ex-pressed in a declarative fashion (high-level). Graphical interface, programming by
demonstration.
{ Automatic
site-specic orgeneric
usuallyneeds trainingoftensupervisedlearning
Tools forsemi-automatic/automatic wrapper construction for structured/semistructured
data
{ Template-based wrappers
{ Inductive learning techniques for automatically learning a wrapper (using labeled
data)
Inductivelearning- taskof computingsome generalizationfrom aset of examples
Methods:
zero-order (decisiontree learners)
rst-order (inductivelogic programming)
Tools fordataextraction fromunstructured documents
{ Using ontologies and conceptual models to extract and structure information from
data-rich,unstructured documents.
{ Usingheuristicapproaches to ndrecordboundariesinweb documents.
5.1.2 Filters
Ifa wrapperreturns asupersetofwhat query wants, we can ltertheresultsof thequery.
5.2 Interfaces of Sources to the World
(**Thissection hasthepurposeof showingwhat kindofinterfaces thesourcesprovidesothat
wrappers can communicatewiththem)
5.2.1 General issues with interfaces
Standardized interfaces
Many interfaces are intended for reuse by many applications. Therefore, they separate
theirfunctionalityinto two groups.
{ Informationregardingthereusableaspectsoftheinterface(i.e.,connections,sessions,
supported unitsoftransfer (les, wholerecords,etc.)).
{ Application-specic information (i.e., record layouts of tabular text les, the set of
les,specicobjects instances).
Data resolution
Divisionofapplication dataencodinginup to two levels
{ Addressablebytheinterface(i.e.,lesvia FTP,recordcolumnvaluesvia ODBC)
{ Notaddressablebytheinterface(i.e.,theFTPinterfacetoretrievean XMLle,but
FTPdoesn'tprovidespecic elementextraction from thatXML le.)
Applicationsmayhave some freedomaboutwhereto encode information. Forexample,a
FTPserverforimagelescouldserveoneleperimage(withahelpfullename),orcould
servejustone .ziplewhichmustbe openedbytheclienttochoosethedesiredimagele.
Data types
{ Interface basic data types may or may not correspond wellto the basic data types
usedto describe dataat thedatasource.
{ Example In HTTP, all data is represented as a string. This works great for web
pages. But ifHTTP isbeingused to transfer a soundle,thestring data typeisn't
particularlyuseful.
Interface semantics
{ Datasourcesand dataintegratorsarecapableofencodingwhateverinformationthey
{ Someinterfacedesignshelpprovidehintstothesemanticsofapplicationdatamoved
acrosstheinterface. (I.e., ODBCallowsits usersto refer tospecictable name,and
those namesmight providehelpful hintsasto theapplicationsemanticsof thetable
contents.)
{ Why dierent interfaces suit dierent application data models Some interfaces are
designed ina way that allows forobvious mappingbetween applicationdata models
andtheinterface. (I.e.,ODBCwitharelationaldatabase,orCORBAwithan
object-orientedsystem.)
{ Dealingwithsemanticsoftheapplication-specicdataiscoveredinaanothersection
(needsan xref)
5.2.2 Survey of CommonInterfaces
It's instructive to considersome dierent kindsof non-application-specicinterfaces.
Non-network interfaces
{ DiÆculties
Data integration mayinvolve two ormore computersthatmustcommunicate.
A data sourcethatonlyhaslocalinterfaces maytherefore requiresome
integra-tion softwareto be installedon thedatasource'scomputer
{ Examples
Writing to leson a localdisk drive.
Displayinginformationon alocalGUIonly(i.e., MSWindows,notX).
Network-capable interfaces
Interfaces that allowan application to readily communicate with other applicationsthat
areeither localoron a dierentcomputer.
{ DiÆculties
Potentially independent failuresofcommunicationapplications
Potentially signicant communicationscosts/ performanceissues
{ Examples
FTP
CORBA
6 Materialized View Management
We aregoingto discusstwomajorsubtopicsunder thisheadingwithgivinghigheremphasisto
thesecond one:
Designand selection ofviewsto materialize
Maintenance of thematerialized views
6.1 Design and Selection of Views to Materialize
MajorGoals:
to minimizetotal queryresponse time
to minimizethecost of maintainingtheselected views
6.2 The Problem of View Maintenance
6.2.1 Denition
traditionalview updateproblemand whythisone isdierent andmore diÆcult
Asthedatabasechangesbecauseofupdatesappliedtothebaserelations,thematerialized
viewsmay alsorequirechange. A materializedview can always be brought up to dateby
re-evaluating theview denition. However, completere-evaluation iswasteful.
"heuristic of inertia"- onlya part of the view changes inresponse to changes in thebase
relations
stepsof viewmaintenance:
{ propagation(computingchanges to theview)
{ refreshing(applyingchanges to themat. view)
6.2.2 Dimensions
AvailableInformation
materialized view,baserelations, otherviews, integrityconstraints, etc.
Allowable Modications
insertions,deletions,updates,sets of each,group updates, change viewdenition, etc.
Expressivenessof theViewDenitionLanguage
conjunctive queries, duplicates, aggregation, union, recursion, negation, select, project,
join,spj,etc.
Database Instance
Complexity
{ Complexityofthe ViewMaintenance Language
6.2.3 View Maintenance Policies
when to applymaintenance procedureson thematerialized views
ImmediateView Maintenance
Refreshingis done withinthe transaction that changes the base data. slow transactions;
fasterqueriesand alwaysup to dateresults.
Deferred ViewMaintenance
{ lazily,at query time
fasttransactions; slowqueryspeed.
{ forced,after a certainamount ofchangeto thebase data
non upto date results;goodtransaction and querytime.
{ periodically,incertaintime intervals
non upto date results;goodtransaction and querytime.
comparison ofall ingeneral; howthedecisionwhen to usewhichone ismade.
6.3 Incremental View Maintenance
pre-updatevs post-updatealgorithms
variousalgorithmswillbesummarized
6.4 View Maintenance Anomalies(Consistency Issues)
caused bydecouplingbtw viewdenitionand basedata
denitionofcorrectness and levels ofcorrectness
solutions:
{ recomputing theviews
{ storing copiesof baserelations
{ ECA (Eager CompensatingAlgorithm)
6.5 Update Filtering
Detection ofbase data updatesthat are irrelevant to theview (i.e., have no eect on the state
of the view) wil be discussed. Forsuch updates, we do not need to perform anymaintenance.
Thus,updatelteringmakesmaintenance more eÆcient bypreventing redundant work.
6.6 View Self-Maintenance
In general, Self-Maintenance refersto views being maintained withoutusingall thebase data.
There existsdierent notionsof its exact meaning dependingon how much informationis
ava-iable. Atminimum,theviewupdateisperformedat theintegratedsystembyonlyknowingthe
particular base data update that has occurred, the view denitions and the materialied data.
{ Given a view,is itself-maintainable?
{ Ifit isself-maintainable,how?
Single-View Self-Maintenance
doesnotconsiderthematerialized views
Multiple-ViewSelf-Maintenance
makesuseof contents oftheother materialized viewsto minimizebase dataaccess
Updatesas awhole
i.e., batch updates;maymake maintenance easier andmore eÆcient
Makingviewsself-maintainable
When theanswerto therst questionlistedabove is No,wecan dene and materializea
minimalset of auxiliary views to make the originalnon-maintainable view maintainable.
Here, basically we are increasing the amount of information available at the integrated
systemlevel.
6.7 Dynamic View Management
problemsof staticview selectionand maintenance
dynamicview selectionand maintenance
performanceparameters
{ space (taken upbythematerialized views)
{ workload (changingqueryworkload)
{ maintenancewindow(howoftenwewouldliketoperformmaintainanceandhowlong
isthe systemtoleratedto beunfunctional)
7 Systems
TSIMMIS(Stanford)
Ariadne(USC)
Araneus(Web-Bases, UniversityofRome,Italy)
Infomaster(Stanford)
InformationManifold(AT &T)
Tukwila (UniversityofWashington)
SIMS (USC)
Carnot(MCC)
WHIPS (Stanford)
Squirrel
** subjectto modications
8 Open Questions and Research Issues
** willbe formedlaterfrom everybody'slist ofresearchissuesfrom theabovesections
9 Concluding Remarks