Centrum voor Wiskunde en Informatica
REPORT
RAPPORT
Memory Aware Query Scheduling in a Database Cluster
F. Waas, M.L. Kersten
Information Systems (INS)
Report INS-R0021
ISSN 1386-3681
CWI
P.O. Box 94079
1090 GB Amsterdam
The Netherlands
CWI is the National Research Institute for Mathematics
and Computer Science. CWI is part of the Stichting
Mathematisch Centrum (SMC), the Dutch foundation
for promotion of mathematics and computer science
and their applications.
SMC is sponsored by the Netherlands Organization for
Scientific Research (NWO). CWI is a member of
ERCIM, the European Research Consortium for
Informatics and Mathematics.
Copyright © Stichting Mathematisch Centrum
P.O. Box 94079, 1090 GB Amsterdam (NL)
Kruislaan 413, 1098 SJ Amsterdam (NL)
Telephone +31 20 592 9333
Telefax +31 20 592 4199
FlorianWaas DEIS{CSITE-CNR UniversitadiBologna 40136Bologna Italy MartinL. Kersten CWI P.O.Box94079 1090GBAmsterdam TheNetherlands ABSTRACT
Querythroughputisoneoftheprimaryoptimizationgoalsininteractiveweb-basedinformationsystemsinorder toachievetheperformancenecessarytoservelargeuser communities. Queriesinthisapplicationdomaindier signicantlyfromthoseintraditionaldatabaseapplications: theyareoflowercomplexityandalmostexclusively read-only. Thearchitectureweproposehereisspecicallytailoredtotakeadvantageofthequerycharacteristics. Itisbasedonalargeparallelshared-nothingdatabaseclusterwhereeachnoderunsaseparateserverwithafully replicatedcopyofthedatabase. Aqueryisassignedandentirelyexecutedononesinglenodeavoidingnetwork contention or synchronizationeects. However, theactual keytoenhanced throughput isa resource eÆcient scheduling ofthearriving queries. Wedevelop asimpleand robustscheduling schemethattakes thecurrently memoryresidentdataateachserverintoaccountand tradesomemoryre-useandexecutiontime,reordering queriesasnecessary.
Ourexperimentalevaluationdemonstratestheeectivenesswhenscalingthesystembeyondhundredsofnodes showingsuper-linearspeedup.
1991ACMComputingClassicationSystem:H.2.4Databasesystems,Distributeddatabases,Paralleldatabases KeywordsandPhrases: Queryrouting,databasecluster,main-memorydatabases.
Note: FundedbyCWI/INS1theme.
1. Introduction
Asignicantnumberofweb-basedinformationsystemsrelyondatabasetechnologytoservelarge user communities which makes scalability a key issue for the design of web-enabled database
systems. Parallel processing and data replication, are necessary to deal with the peak loads encountered. Likewise,aneective query dispatchingscheme isneeded tolevelthesystemload aswellasto guaranteequality-of-serviceinterms of responsetime.
In this paper we are concerned with initial experiences with a multi-media portal under construction based on the Monet database system [BK99 ]. The system is intended to provide eÆcientaccesstoa largecollectionof indexedmulti-mediaobjects. It isendemictothiskindof
informationsystem that user interactionis dominated byread accesses. A number of systems with similarrequirements regarding the deployed database backend have been developed and manymore arecurrentlyunderconstruction.
With each user interaction, the interface emits a number of queries to the database that ideally lead to an answer set of a few tens of candidate results. Involving accesses to dierent multi-dimensionalindexes,theevaluation ofsuchqueriesis usuallyintheorder of fewseconds.
applications. Moreover, thedeviationof runningtimeamong thequeriesislimited,notleastto ensureacceptable responsetimes.
The primary challenge in this setting is to develop processing techniques to optimize the
querythroughput. Parallelprocessingisanessentialelementtoachievethis,however,astraight forward recasting of methods developed for parallel databases doesnot applyhere since most solutions devised in thisarea are almost exclusively geared to tackle highly complex and long
running queries. There, queries are usually parallelized on a granularity of partial plans or even single operators, i.e. single operators like the join of two tables are executed in parallel on dierent nodesinvolving exchangeof partialresultsamong thesinglenodes. However, these
techniques are ineective for the kind of query we are considering since communication and coordinationoverheadswould outweigh theactualbenets.
In thispaper, we proposea parallelqueryprocessingarchitecturethat can take advantageof
the query characteristic by its physical design, suitable query scheduling,and the way queries areexecuted.
Theplatformofoperationisashared-nothingenvironment|i.e.aclusterofinexpensivePCs|
whereeachnoderunsaMonetserverwithafullyreplicatedcopyofthedatabase. Onemachine isdistinguishedascoordinatornodethatdispatchesthearrivingqueriestotheserversaccording to aschedulingstrategy. The scheduling schema we develop inthispaperdiersradicallyfrom previousworkaswedonottrytomodelvarioussystemparametersinordertoexploitprimarily
idle system resources, but take into account what data is memory-resident at the servers, i.e. cachedbytheservers. Thealgorithmisbasedonametricthatdeterminesthedistancebetweena serverandaquery|thelessthisdistance,themoresimilarthestateofthememoryatthisserver
and what isrequired to process thisquery. Moreover, we investigate possibilitiesof re-ordering and deferredexecutionofqueriesto furtherreduceexecutioncosts. Once aqueryisassigned to aserveritisexecutedinisolationonthisserver,sonosynchronizationorcommunicationwithin
theclusterisneededfreeinginterconnectionbandwidthforshippingofbothqueriesandresults. Since we aredealing withread-only accesses, we do not have to considertransaction
mecha-nismsto keepthereplicateddata consistent acrossthedatabasecluster. Rather,thedatabases areperiodicallyupdatedbymirroringamaster database.
The experimental evaluation of the techniques proposed show substantial savings over con-ventional greedy scheduling that takes only the machines' workload into account. In a large
numberof experiments we investigate theimpact of individualparameters closely. Our results conrmthearchitecturaldecisions showingexcellentscaling behavior.
The remainder of this paper is organized as follows. We review related work in Section 2.
In Section 3 we present the architecture and describe the query model in Section 4. Section 5 discussesthemodelingoftheserverpoolandtheschedulingalgorithm. InSection6,wepresent a comprehensive performance analysis. Section 7 contains a discussionof thedesign decisions.
We concludethe paperwithSection8.
2. Related Work
Parallelqueryprocessinghasbeenstudiedinalargevarietyoffacets, seee.g.[PMC +
90 ,DG92 ,
HS93 ,WFA95,Gra95 ]. Mostofrelatedworkinthiseldconcentratedonpossibilitiestospeedup highly complex queries with long running times. Approaches as taken in [HM94] and [GI96 ] suggest a decomposition of the query plans into sub-plans which arethen executed inparallel
ondierentnodesoftheparallelprocessingenvironment. Thegranularityofthisdecomposition variesandcan beasneasparallelizingsingleoperatorasstudiedforexamplein[SD89,SD90 ,
These approaches have in common that they require communication between single nodes forshippingorexchangingpartialresults. Thiscausesnetworkcontention andsynchronization eects where nodes have to wait forothers to complete their tasks rst. As a result, a
paral-lelization along these lines only pays ifthe query is of suÆciently high complexity. Otherwise communicationoverhead and synchronizationeects outweigh performance gains.
Moreover, parallel processing as outlined above scales only for small numbers of nodes
ef-fectively. In shared-nothing architectures, network contention becomes increasingly a bottle-neck; in the case of shared-everything, the high degree of resource sharing limits the scaling [Sto86 , NZT96].
General memoryallocation issueshave beenexplored extensivelywithrespect to various as-pectsofqueryprocessing. In[MD93],authorsproposeddynamicmemoryallocationschemesfor multi-queryworkloadto levelmemoryallocationwithoutsharing resident dataamongdierent
queries. Byanalyzingqueriesand theircommon sub-expressionsthere-use ofmemoryresident data is often a by-eect [SSN94]. In [MSD93 ], authors consider batch scheduling for parallel processing. However, main-memoryis in bothcases transparentlyviewed as a centralresource
and datalocationwithindistributedmemoryhasnot beenconsidered.
In thecontext of transactionprocessing,several query routingschemas fordatabaseclusters havebeeninvestigated(seee.g.[Tho87 ,FGND93,RBS00 ]). Thiseldofapplicationdiersfrom theproblemat handinits preliminaries: queriesare usuallyofcomplexnature and updates to
the database need to be propagated over the complete cluster. One of the major goals is for instanceto schedulequeriessothatlockingcon ictsareavoided. Foracomprehensiveoverview on thissubject seee.g. [Rah92 ].
3. Architecture
Figure 1 shows the architecture of our system. It consists of a cluster of database servers managed by a central scheduler node. All machines have separate main-memory, CPU, and
disks not sharing any resources other than network bandwidth. We use Monet, the main-memory database system developed at CWI, as database server [BK95 , BK99 ]. Besides its
vertical fragmented data model, Monet is distinguishedby its memory awareness, i.e. it solely usesoperatingsystemprimitivesforits memorymanagement,mappingdatabasetablesdirectly to virtualmemory,to avoid the overhead of a proprietary buermanager simulatinga virtual
memorylayer. Moreoveritprovidesspecicallycacheawareoperatorimplementations[BMK99 ] to maximizesystemutilization onrunningqueries.
Additionally,a pool of web-servers forms the front-end of the system which clients interact
with. The web-servers receive either parameterized queries using text-based forms, or they interactwith visualqueryformulationtools.
Processing of a client query is done in 7 steps: Users formulate their query using the web-interface(Fig.1,(1)). Attheweb-server,thequeryisre-formulatedusingtheinternalprocedural
query representation ofthe databasesystem|inourcase MIL,thequery languageof Monet| and submitted to the query scheduler (2). The scheduler maintains a queue of queries that are to be executed. By analyzing the data requirements for the execution of a query and the
dataresidentinmain-memoryattheservers,theschedulerdeterminesafavorableassignmentof queriestoservers(4). Thequeryisexecutedontheassignednode(5)andtheresultisreturned to thefront-endhost (6).
1
Afterformatting theresult,itis shippedto the user(7).
This architecture directly aims at the throughputoptimization of compact querieswhere an execution on a single node is the most eÆcient kind of processing. But this architecture also
1
1
2
Web-application
Server
Web-application
Server
Web-application
Server
Database
Server
Database
Server
Database
Server
6
7
U
7
U
Master
Database
1
User
User
User
4
3
6
5
U
U
Periodical update of replicated databases
Query scheduling
Online data sources
Robot
Robot
Feature Analysis
Indexing
Scheduler
Query
Figure1: Query processing architecture
exploits the second query characteristic: the impreciseness of the data. For example, data gathered by robots from the Internet to buildup a multi-media index is updated only at low frequency. Thus,theserverclusterdonothavetobekeptinsyncaswouldbethecaseifupdates
by users were allowed. Instead, the databases are updated periodicallyreplicating the master database(U). The frequencyof updatesdependsnaturally on theapplicationdomain.
4. Query Model
The choice of Monet asdatabase software forthe back-endclusterimpliesa specic model for thequery execution. The vertical fragmentationof the tablesin Monet causesbulk-processing to be more eÆcient thanpipeliningtechniques. Asa consequence, there areno more than two
tables processed on a single CPU at a time. Note, this executionmodel doesnot impose any restrictionsontheshapeoftheexecutionplans|bothlinearandbushyplansarefeasible. Access to a table can be either of type load or scan. load reads a table from disk into memory to
make it part of thehot set, e.g. used withinner relations of nested-loop joins. scan reads the table, butdoesnotkeep thedata inmemory afterthe actualoperationis performed,e.g. used inselections orfor theprobe relationof a hash join. The buildingof a hash table can be seen asspecialtypeof loadasthe resultis onlyaccessiblevia thehash attribute.
The costsfor executing a relationalgebra operator consists of thecosts of loading/scanning ofthetable plustheactualcostsfortheoperation. Fittingthepiecestogether, we candescribe
a queryasa sequence Q=h(t 1 ;a 1 ;l 1 ;w 1 );:::;(t n ;a n ;l n ;w n )i
whereeachquadruplecorrespondstoanoperatoror|asincaseofajoin|toapartialoperator. t
i
speciesthetable necessaryfortheoperation. a i
denotesthehashattribute,i.e.theoperator
accesses t i
via thisattribute; a i
= ifnotapplicable.
The associated loading costs, if the table is notalready resident in main-memory, are given byl
i
. We denotethe totalloading costs of aquery as
L(Q)= X i l i :
The costs to execute the operator, the operator costs,are denoted by w i
. We denote the total
operator costs of aquery as
W(Q)= X i w i : Note, bothl i and w i
areexpressed interms ofthe sameunit to achieve apropercomparison.
A queriesexecutiontime ona \cold" server, i.e.no datais loaded yet,is
(Q)=L(Q)+W(Q):
Ifall tablesneeded byQare alreadyinmemory|andincase ofhashtables, are hashedbythe
properattribute|the executioncostsof thequeryamountsto W(Q)only.
Examples
HerearesomeexamplestoillustratethismodelingbasedonMonetperformancecharacteristics. The twotables A andBused inthisexampleareof size8.8MBand 4.95MB, respectively. Our
test platformachieved abandwidthof 5.5MB/sfordisk access.
Nested-Loop Join,A1B
Q=h(A;;1:6;0);(B;;0:9;2 :6) i
TableAtakes1.6stoload,nooperatorcostsoccur. TableB takes0.9stoload;performing
thejoinrequires 2.6s.
HashJoin, A1B
Q=h(A;a;2:7;0);(B;;0:9;1 :1) i
Similartopreviousbutnow,Amustbeahashtablewithhashattributea. Henceloading A ismore expensive asitincludesbuildingthehashtable.
5. Query Scheduling
The query scheduling comprises several elements. Besides a model for the servers we dene a server-querydistancewhichcapturesthepotentialre-useofmemoryresidentdata. Additionally,
5.1 Servers
For thescheduling, a server of thedatabase clusteris modeled byits state of memory and the
workload. Thestateofmemoryisthesetoftablesresidenttogetherwithareplacementstrategy. Weareconsideringbasetablesonlyanddiscardintermediateresultsoftheprocessingassoonas they areno longerused. As areplacement strategyweuse LRUas itexhibits thebestaverage performance. Theloadinganddroppingoftablesisdoneviathememorymappingfunctionality
of theoperatingsystem. To maintainsuÆcient controloverthe memoryallocationthroughout thecompleteclusterweloadanddroponlycompletetables. Thisway,theschedulercanrelyon theinformationwhichtables arememory-resident,i.e.accessingthem willnotcause additional
costs forswapping. Swapping mayonlyoccur when all unused tablesare already dropped but thememoryrequirementsof thecurrent operationarestillnotmet.
Fortheworkload,wedistinguishthetwostates idle and busy,i.e.weassignone queryto one
server at a time. This isnot justa simplicationto facilitate thescheduling buta necessity in main-memorydatabaseswherecache awarenessand concurrentmemoryaccessareofdistinctly higherimportance thanin I/O-dominateddatabase models[MBK00 ]. We modelthe workload
asfunctionJ(S)whichreturnstheexpectedtimeofjobcompletionatserverS,givenitscurrent workload, i.e. the expected time from now when S willbecome idle. If theserver is idle, J(S) evaluatesto 0. J willbeusedintheschedulertondthenode thatwillnishitsjob next. J is
computedbyconventionalcostformulaeknownfromsequentialqueryprocessing: Giventhetime x
0
thecurrentlyrunningqueryhasbeenassignedtoserverS,xthetimeJ isevaluated,andethe expectedrunningtimeofthequery,thetimeofjob completioncomputes to J(S)=x
0
+e x.
Seealso Section7 fora discussionon theaccuracy ofJ.
5.2 DistanceMetric
We dene the server-query distance as the costs to load the tables for a given query Q on a server S: d(Q;S)= X i R (t i ;a i )l i where R (t i ;a i )= ( 0; ift i
is memory-resident at S and hashedbyattributea i
1; else
indicateswhethertablet i
isresidentinmemoryat S. Incaset i
isrequiredashashtable,Ralso checks whetherthetable is hashedbyattributea
i .
Scheduling a batch of queries optimally on k servers is nding a division into k batches B
1
;:::;B k
,each of which are executed sequentially on one server, such thatthe running time of thebatch withthelongest completiontime
max i X j d(Q j ;S)+W(Q j ) ; Q j 2B i isminimal. 5.3 Scheduling Algorithm
Algorithm mas
while queuenotemptydo
c min
1
foreachquery Q intop n
(queue) do
for i=1 to number ofservers do
c =d(Q;S i )+W(Q)+J(S i ) if c<c min do b Q Q b S S i c min c done done
if expired(Q)then break
done assignquery b Qto server b S remove b Qfrom queue done
Figure 2: Scheduling Algorithm mas
outline of ouralgorithm calledMemory Aware Scheduling (mas). It iterates over the queue of arrivingqueries, selecting one at atime, and determinesthebestad-hocassignment.
In detail,we examinetherstnelementsof thequeue|orless ifthe queuedoesnotcontain
nqueries. Weinvestigatetheimpactofnandsuitablevaluesforitinthenextsection. Foreach of then queries,we computec whichconsistsof thedistance toall servers S
i
plustheoperator costofthequeryand theexpectedtimeat whichserverS
i becomesavailable, J(S i ). We record thepair( b Q; b
S)withthelowestvalueforc. After examinationofallnqueries, we assign b Qto
b S
which meansthat b
Q willbe executed on b
S assoon asthisservergets idle.
The algorithmisinO(Nns)whereN is thenumberofqueriesintotal, nisthenumberof queriesconsideredineachrun,andsisthenumberofserversavailable,i.e.thealgorithmislinear
in the number of queries. To give any meaningful bounds on the performance is particularly diÆcultbecause oftheLRU replacement of tables.
5.4 Deadlines
Inordertogivetheuseraguaranteeofservice,wetageveryquerywithadeadline. Thisdeadline
refers to the latest point in time thequery has to be assigned to a server for execution, i.e. as soon as the deadline of a query expires, the scheduler has no other choice than assigning this very query to aserver.
We tag all queries with a time stamp according to their arrival. In other words queries are not forced by deadlines to overtake others, though it is often benecial. As a result, we need onlycheck therstqueryof thecurrent topn batchfordeadlineexpiration. Iftherstquery's
deadlineis expired,we donotneedto examineanyotherqueryinthebatchbuthave toassign therst immediately to a server. Otherwise,ifthe rst query's deadlineis not yet expired, no
Size 20MB 30MB 40MB 50MB >50MB
Numberoftables 233 87 24 15 17
Table1: Sizesand numbersof basetables
all servers ensuresthebestassignment incasethe deadlineexpired.
6. ExperimentalResults
Inthissection,wedescribeexperimentalresultsobtainedwithasimulator. Wechosetosimulate
thesysteminordertoexperimentwithparametersthatarestrictlylimitedbyanactualhardware conguration, such as size of the database cluster ormain-memory availableat the individual servers.
6.1 Preliminaries
Since the full multi-media Acoi demonstrator database is still under construction, we had to
connetheexperimentstothepartalready operational. Thepartchosenistheindexsystemof theACMAnthology,whichisincludedinthedemonstratortoassess itscapabilities inthearea of XML-based database processing. We used statistical data available from an actual Monet
database instance which contains the complete XML code of the Anthology decomposed into theverticallyfragmenteddatamodel[SKWW00 ]. Includingindexes,thedatabasecontains376 tables with up to nearly 60000 rows. The numbers of tables according to their sizes are given
in Table 1. Typical queriesuse some 5 tables orless, seldom up to 10 or more. We generated batches of 10,000 and 100,000 queries of exponentially distributed sizes accessing 5 tables on average.
We donotconsidermechanismsto reject user queriesdueto overloadsincethiscan be done alreadyatthefront-endlevel. Modelingarrivalratesprobabilisticallyisnotnecessaryasqueries donotsignicantlydierinrunningtime,thusinformationaboutthequeriesdoesnotneedtobe
consideredtomakeachoicewhichqueriestoacceptandwhichtoreject. Forourexperiments,we assumethemaximalexpectedqueryarrivalratetomodeltheworstcasebehaviorwithmaximum load. Moreover, we assume all hostswithin the database cluster are of identical conguration
regardingthecost relevant parameters.
Since the replacement of tables along an LRU strategy makes analytical modeling of the algorithm hard, we implemented Graham's list scheduling (gls), which is based on workload
gures [Gra69], forcomparison. We adapted the original algorithm to t the online arrival of queries, i.e. to use thekind of look-ahead we introducedin masabove. Graham's algorithm is known to be highly eective despite its simplicityovercoming some characteristic disturbances also known as scheduling anomalies. These anomalies occur with jobs that dier signicantly
in completion time. Please note, we run Graham's algorithm in exactly the same setting, i.e. withLRUreplacementoftablesasnecessary. Asaresult,Graham'salgorithmalso protsfrom re-useof memoryresident data.
The mainparameters we want to investigate are, foremost, the numberof servers, the look-ahead duringthe scheduling, and the amount of memory available at each server. Note, there are two principal ways of comparison: a one-on-one comparison of both algorithms run on
identic congurations of the platform (relative performance), or a comparison of the scaling characteristics. Typical examplesforthelatter arespeedupand scale-up. We willcompare the two algorithmsusingbothprincipleswhereadequate.
0.50
0.60
0.70
0.80
0.90
1.00
1.10
0
10
20
30
40
50
60
70
80
90
100
Relative execution time
Number of servers
GLS
MAS
0.50
0.60
0.70
0.80
0.90
1.00
1.10
0
10
20
30
40
50
60
70
80
90
100
Relative execution time
Number of servers
GLS
MAS
Figure 3: Relativeperformanceof warm over coldservers
6.2 Warm/Cold Processing
In thisrst experimentweinvestigate thedierences betweenwarm and coldservers. We refer
to a server as cold if no other tables than system tables have been loaded. If more than 80% of the available memory are allocated we call a server warm.
2
This is not only relevant for laterexperimentsbutalsofortherealapplicationscenariowhenthesystemhastobeshutdown
and re-started fortechnical reasons like maintenance etc. We can expect dierent eectsif the amountofmemoryperserverisvaried. Forsmallservercongurationsthewarmupiscompleted earlierthanforlargeones. However, more important istheamountof memoryintotal,i.e. the
numberof servers. Figure 3 shows the relative execution times of batches of 10,000 queries as functionof thenumberservers. Forexample theleftmost data point for masre ects the ratio of executiontimes formas on acold to that on a warm server. The left graph shows times for
a servercongurationof 64,theright for256MB.
Forgls, coldandwarmexecutiontimes arealmostthesame. Warm processing isonlyup to 5%quicker. Formas theratiodierssignicantlyshowing gainsof more than35%. Especially
fortherange up to 50 servers,the largerconguration (right)achievesbetter performance,i.e. the curve is steeper. Formore servers, the impact of the larger amount of memory decreases: for 100 servers and more, resultsare virtually equal. We address theissue of memory sizes in
more detailin6.5.
Allfurtherresultspresented inthissectionare obtainedfrom warm servers.
6.3 Reordering of Queries
Thenext experimentinvestigates theimpactof thelook-aheadduringscheduling,i.e.the
max-imalnumberof queriesthatmay be re-orderedbetweeneach assignment.
Figure 4 shows execution times fora single server (left)and a cluster of 100 servers (right). In both cases the server(s) had 64MB of memory. The execution time is shown as a function of thelook-ahead, scaled to gls's rst datapoint, whichcorrespondsto arst-come-rst-serve
(FCFS) scheduling. Thesizeofthecompletequerybatch was10,000. Inthecase ofone server, thelook-aheadisofhighimportanceformasandsavingscanamountupto20%oftheexecution time. gls,however, doesnotsignicantlyprot from look-ahead.
2
0.75
0.80
0.85
0.90
0.95
1.00
1.05
0
200
400
600
800
1000 1200 1400 1600 1800 2000
Execution time [scaled to FCFS]
Look-ahead [queries]
GLS
MAS
0.40
0.50
0.60
0.70
0.80
0.90
1.00
1.10
0
200
400
600
800
1000 1200 1400 1600 1800 2000
Execution time [scaled to FCFS]
Look-ahead [queries]
GLS
MAS
Figure4: Impact of look-ahead; Singleserver (left)and clusterof100 servers(right)
Inthecaseof100servers,thesituationchanges. glsrelativelyimprovesmorewithincreasing look-ahead but for mas hardly any improvement is noticeable, though its execution time is substantially below the one of gls. This is due to the fact that in a pool of 100 servers
many server oer a very low server-query distance and dierences are often very small. As a consequencere-ordering cannothelpndingasignicantlylowerserver-querydistance asitdid inthesequentialcase.
In all furtherexperimentswe usea look-ahead of 100 queriesunlessstatedotherwise.
6.4 Speedup andScale-up
The two fundamental measures when investigating the performance of parallel systems are speedup and scale-up. The previous quanties the gains when scaling up the platform but
keepingtheproblemsizeconstant,thelatterdescribesthesystem'sabilitytocopewithproblem sizesgrowingproportionallywith theplatform(cf.e.g. [DG92]).
Figure 5 shows the speedup for a query batch of 100,000 queries evaluated on up to 4096
servers with 64MB memory each. As the plot shows, gls achieves slightlysub-linearspeedup whereasmasachieves even super-linear speedupwhichtranslates to eective re-use of memory residentdata. Tobetterillustratethisphenomenon,consideraverybasicexampleusing4tables
A,B,C and D of same size such that only two table t into memory at the same time, and a batch of 4queries: Q 1 = h(A;2;2);(B;2;2)i Q 2 = h(C ;2;2);(D;2;2)i Q 3 = h(D;2;2)i Q 4 = h(A;2;2)i
On a single machine, the total costs amount to 8+8+4+4 = 24. With linear speedup, we wouldexpect12cost unitsfora parallelizationontwo hosts. masassignsqueriesQ
1 and Q 4 to one, Q 2 and Q 3
to the other machine which amountsto 8+2=10 costsat each node. Hence,
thetotal executiontime is10 compared to 24on a singlenode which givesaspeedupof2.4. Figure 6shows thescale-upforthesameserverpoolconguration. masmaintainsascale-up
0
1000
2000
3000
4000
5000
6000
7000
0
512
1024
1536
2048
2560
3072
3584
4096
Speedup
Number of servers
linear
MS
GLS
Figure 5: Speedup0.7
0.8
0.9
1
1.1
1.2
1.3
0
512
1024
1536
2048
2560
3072
3584
4096
Scaleup
Number of servers
constant
MAS
GLS
Figure 6: Scaleup 6.5 Direct ComparisonInourlastexperiment,wecomparethealgorithmsdirectly,i.e.we determinetheratioof mas's executiontimetotheoneof glsforeachindividualparametersetting. Valuesbelow1.0indicate
thatmasoutperformsgls. Asparameteroftheexperimentwevarynumberofservers,memory available, and look-ahead.
Figure 7 shows the relative execution time as a function of the number of servers and the amountofmemoryperserver. Thenumberofserversvariesbetween5and100,memorybetween
32and640MB.Asthediagramdisplays,masoutperformedglsinall400individualexperiments achieving execution times as short as 40% of those of gls. However, as the diagram reveals,
thisisnomonotonicprocess,rather,withincreasingmemorysizes,glsmanagesto \catchup", though only to a certain degree. See for example the front row where, after mas increases its lead (0.4 at ca. 160MB) it cannot further improve on the running time whereas gls becomes
increasinglybetter. Formore thanca. 320MBneither algorithmcan achieve anyimprovement, thustheplateaux.
Figure 8 shows the relative performance as function of the number of servers and the look-ahead. All servers had 64MB of memory. The plot shows results that aÆrm the previously
foundonesnowalso inthedirectcomparison: Forsmallnumberof servers look-aheadplaysan importantrole (see lastrow),whichfades asthenumberof server increases(see front row).
7. Discussion
Whileimplementingandtestingdierentversionsofmaswemadeseveraldesigndecisionswhich arenotself-evidentat rst sight and deserve to bediscussedmore elaborately inthefollowing. Themostimportantquestioninthedesignprocesswaswhethertoincludemoresystem
infor-mation abouttheservers ornot. Typically,cost modelsin paralleldatabases try to model the system|particularlythesharedresourceslikenetworkorshareddisks|asdetailedaspossible. Wechosenottoincorporatethiskindofinformationforthreereasons: Firstly,thetypeofquery
we are dealingwith cannot prot much from the kindof parallelprocessing these cost models havebeendevelopedfor. Secondly,acostcomputationthattakesdetailsatthisnegranularity into account is computational too expensive to deliver cost estimates for hundreds of servers
32
160
320
480
640
Memory [MB]
100
75
50
25
5
Number of Servers
0.4
0.4
0.5
0.6
0.7
0.8
Figure7: Relativeperformance;
mem-ory available and number of servers varied
1
500
1000
1500
2000
Look ahead [queries]
100
75
50
25
5
Number of Servers
0.6
0.6
0.7
0.8
Figure 8: Relative performance;
look-ahead andnumberof servers varied
informationis hardto maintain. To keepit accurately up-to-date duringthe processing would requirea substantialshare of bothnetworkand processingresources.
In contrast to that the cost values used by mas are of simpler nature. The cost estimates forthequeryexecution time(see Section5.1) canbe assembled fromsequential costestimates. Since thequeriesareemittedbya xedinterface, there isthepossibilityto pre-compilequeries
which arethen instantiatedwith afew parameters. In thatcase highlyaccurate cost estimates can bepre-computed and|like thequeries|instantiatedwith parameters.
The informationnecessarytodescribethesystemfortheschedulingisinsofareasy to
main-tainasitconsistsonlyofafeedbackindicatingthatprocessingofthelastqueryhasterminated, i.e. the result has beenshipped, and what tables are now in main-memory. To avoid idleness between feedback and new assignment, servers can be extended to buerone query while
pro-cessingthepreviouslyassigned. Thefeedbacktogetherwiththefactthatserversdonotmaintain ownquery queuesensures thattheimpact of afew,inaccurate cost estimates,whichcan never becompletely avoided, iskeptdown to aminimum.
Finally,the queryexecution we proposeddoesnotallowfor multi-programming,i.e. running several queries concurrently on one server. This choice was motivated by two facts. Firstly, multi-programmingisnotverybenecialinmain-memorydatabaseslikeMonetasconcurrently
processed queries often get into each other's way, unlike in I/O dominated database systems. Secondly, and more importantlyas this holds for other database back-ends too, the potential gainsofmulti-programmingwhereI/OofonequeryandCPU intensivecomputationofanother
one can be aligned is very limited as it is our foremost goal to avoid costly I/O operations at all.
8. Conclusion
Web-enabled databases and database back-end technology for large web-base information sys-temsare one of thefastest growing segments of thedatabasemarket. Thosesystems challenge thetraditionalrepertoireofoptimizationtechniquesusedindatabasetechnology. Powerful,user
databaseback-endwhichmake throughputoptimization aprimary optimizationgoal.
In thispaper, we propose a parallel query processing architecture and investigated the pos-sibilityto exploitdata sharingbyclever scheduling of thearriving queries. We have developed
mas, a scheduling strategy that tries to maximize the re-useof data resident in main memory across thedatabasecluster. The algorithm isdistinguishedby its simplicityandrobustness on one hand|the information needed to make mas work is easy to obtain, accurate, and needs
onlylittleeortto be keptup-to-date|and itseectivenesson theotherhand.
Our experiments show superiorresults compared to conventional list scheduling in terms of both query throughput and scaling behavior conrming our considerations. Moreover, by
re-usingthedataavailableratherthanassigningdataactively,theschedulingalgorithmadapts to changinghotspots.
Our future work is geared towardextending the scheduling schema to considerintermediate
References
[BK95] P. A. Boncz and M. L. Kersten. Monet: An ImpressionistSketch of an Advanced Database System. In Proc. Basque International Workshop on Information Te ch-nology,San Sebastian,Spain,1995.
[BK99] P.A.Boncz andM. L.Kersten. MIL PrimitivesforQueryinga FragmentedWorld. TheVLDB Journal, 8(2):101{119, 1999.
[BMK99] P. A. Boncz, S. Manegold, and M. L. Kersten. Database Architecture Optimized forthenewBottleneck: Memory Access. InProc.of the Int'l. Conf. on Very Large Data Bases, pages54{65, Edinburgh,UK,1999.
[CHM95] C. Chekuri, W. Hasan, and R. Motwani. Scheduling Problems in Parallel Query Optimization. In Proc. of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages255{265, SanJose, CA,USA,May1995.
[DG92] D.J. DeWittand J.Gray. ParallelDatabase Systems: The Future of High Perfor-mance DatabaseSystems. Communications of the ACM,35(6):85{98, June1992.
[FGND93] D.F.Ferguson,L.Georgiadis,C.Nikolaou,andK.Davies. GoalOriented,Adaptive TransactionRoutingforHighPerformanceTransactionProcessingSystems.InProc. of the Int'l. Conf. on Paralleland Distributed Information Systems, pages138{147,
SanDiego, CA,USA, January 1993.
[GI96] M. N.Garofalakis and Y. E.Ioannidis. Multi-dimensional ResourceScheduling for Parallel Queries. In Proc. of the ACM SIGMOD Int'l. Conf. on Management of
Data,pages 365{376, Montreal, Canada,June1996.
[GI97] M.N.GarofalakisandY.E.Ioannidis. ParallelQuerySchedulingandOptimization withTime-andSpace-SharedResources. InProc.of the Int'l. Conf. on Very Large
Data Bases, pages296{305, Athens, Greece,September1997.
[Gra69] R. L. Graham. Bounds on Multiprocessing Timing Anomalies. SIAM Journal on Applied Mathematics, 17(2):416{429, March1969.
[Gra95] J.Gray. A Surveyof ParallelDatabase TechniquesandSystems. InTutorial Hand-outsofthe21stInt'l.Conf.onVeryLargeDataBases,Zurich,Switzerland, Septem-ber1995.
[HM94] W.HasanandR.Motwani. OptimizationAlgorithmsforExploitingthe Parallelism-Communication Tradeoin PipeliningParallelism. In Proc. of the Int'l. Conf. on Very Large Data Bases, pages36{47, Santiago, Chile,September1994.
[HM95] W. Hasan and R.Motwani. Coloring AwayCommunicationin ParallelQuery Op-timization. In Proc. of the Int'l. Conf. on Very Large Data Bases, pages 239{250, Zurich,Switzerland,September1995.
[HS93] W. Hong and M. Stonebraker. Optimization of ParallelQuery ExecutionPlans in XPRS. Distributed and Parallel Databases, 1(1):9{32, 1993.
[MBK00] S. Manegold, P. A. Boncz, and M. L. Kersten. What happens during a Join? -DissectingCPU and Memory OptimizationEects. In Proc. of the Int'l. Conf. on Very Large Data Bases, Cairo,Egypt,September2000. Acceptedforpublication.
[MD93] M.MehtaandD.J.DeWitt. DynamicMemoryAllocationforMultiple-QueryW ork-loads.InProc.of theInt'l.Conf.on VeryLargeData Bases,pages354{367, Dublin, Ireland,September1993.
[MSD93] M. Mehta, V. Soloviev, and D. J. DeWitt. Batch Scheduling in Parallel Database Systems. In Proc. of the IEEE Int'l. Conf. on Data Engineering, pages 400{410, Vienna,Austria,April 1993.
[NZT96] M.G.Norman,T.Zurek,andP.Thanisch.MuchAdoAboutShared-Nothing.ACM SIGMODRecord,25(3):16{21, September1996.
[PMC +
90] H.Pirahesh, C. Mohan, J. Cheng, T. S. Liu,and P. Selinger. Parallelismin
Rela-tionalDataBase Systems: Architectural IssuesandDesignApproaches. InProc.of the Int'l. Symp. on Databases in Parallel and Distr. Systems, pages 4{29, Dublin, Ireland,July1990.
[Rah92] E.Rahm.AFramewokforWorkloadAllocationinDistributedTransaction Process-ingSystems. Systems Software Journal,18:171{190, 1992.
[RBS00] U.Rohm, K.Bohm, andH.-J.Schek. OLAPQueryRoutingand PhysicalDesignin aDatabase Cluster. InProc.of the Int'l.Conf. on ExtendingDatabase Technology, Lecture Notes in Computer Science, pages 254{268, Konstanz, Germany, March
2000.
[SD89] D.A.Schneider andD. J DeWitt. A Performance Evaluation of FourParallelJoin AlgorithmsinaShared-NothingMultiprocessorEnvironment. InProc.of the ACM
SIGMODInt'l. Conf. on Management of Data, pages110{121, May1989.
[SD90] D. A. Schneiderand D. J. DeWitt. Tradeos inProcessing Complex Join Queries
via Hashing in Multiprocessor Database Machines. In Proc. of the Int'l. Conf. on Very Large Data Bases, pages469{480, Brisbane,Australia,August 1990.
[SKWW00] A.R.Schmidt,M.L.Kersten,M.A.Windhouwer,andF.Waas. EÆcientRelational Storage and Retrievalof XML Documents. InInternational Workshop on the Web
andDatabases(InconjunctionwithACMSIGMOD),pages47{52,Dallas,TX,USA, May2000.
[SSN94] K.Shim,T.Sellis,andD.Nau.ImprovementsonaHeuristicAlgorithmfor Multiple-QueryOptimization. IEEETrans. onKnowledge andData Engineering,12(2):197{ 222,March 1994.
[Tho87] A. Thomasian. A Performance Study of Dynamic Load Balancing in Distributed Systems. In Proc. of the IEEE Int'l. Conf. on Distributed Computing Systems, Berlin,Germany,1987.
[WFA95] A.N.Wilschut, J.Flokstra, andP.M. G. Apers. ParallelEvaluationof Multi-Join Queries.InProc.ofthe ACM SIGMODInt'l.Conf. on Managementof Data,pages 115{126, San Jose,CA,USA, May 1995.