Memory aware query scheduling in a database cluster

(1)

Centrum voor Wiskunde en Informatica

REPORT

RAPPORT

Memory Aware Query Scheduling in a Database Cluster

F. Waas, M.L. Kersten

Information Systems (INS)

(2)

Report INS-R0021

ISSN 1386-3681

CWI

P.O. Box 94079

1090 GB Amsterdam

The Netherlands

CWI is the National Research Institute for Mathematics

and Computer Science. CWI is part of the Stichting

Mathematisch Centrum (SMC), the Dutch foundation

for promotion of mathematics and computer science

and their applications.

SMC is sponsored by the Netherlands Organization for

Scientific Research (NWO). CWI is a member of

ERCIM, the European Research Consortium for

Informatics and Mathematics.

Copyright © Stichting Mathematisch Centrum

P.O. Box 94079, 1090 GB Amsterdam (NL)

Kruislaan 413, 1098 SJ Amsterdam (NL)

Telephone +31 20 592 9333

Telefax +31 20 592 4199

(3)

FlorianWaas DEIS{CSITE-CNR UniversitadiBologna 40136Bologna Italy MartinL. Kersten CWI P.O.Box94079 1090GBAmsterdam TheNetherlands ABSTRACT

Querythroughputisoneoftheprimaryoptimizationgoalsininteractiveweb-basedinformationsystemsinorder toachievetheperformancenecessarytoservelargeuser communities. Queriesinthisapplicationdomaindier signicantlyfromthoseintraditionaldatabaseapplications: theyareoflowercomplexityandalmostexclusively read-only. Thearchitectureweproposehereisspecicallytailoredtotakeadvantageofthequerycharacteristics. Itisbasedonalargeparallelshared-nothingdatabaseclusterwhereeachnoderunsaseparateserverwithafully replicatedcopyofthedatabase. Aqueryisassignedandentirelyexecutedononesinglenodeavoidingnetwork contention or synchronizationeects. However, theactual keytoenhanced throughput isa resource eÆcient scheduling ofthearriving queries. Wedevelop asimpleand robustscheduling schemethattakes thecurrently memoryresidentdataateachserverintoaccountand tradesomemoryre-useandexecutiontime,reordering queriesasnecessary.

Ourexperimentalevaluationdemonstratestheeectivenesswhenscalingthesystembeyondhundredsofnodes showingsuper-linearspeedup.

1991ACMComputingClassicationSystem:H.2.4Databasesystems,Distributeddatabases,Paralleldatabases KeywordsandPhrases: Queryrouting,databasecluster,main-memorydatabases.

Note: FundedbyCWI/INS1theme.

1. Introduction

Asignicantnumberofweb-basedinformationsystemsrelyondatabasetechnologytoservelarge user communities which makes scalability a key issue for the design of web-enabled database

systems. Parallel processing and data replication, are necessary to deal with the peak loads encountered. Likewise,aneective query dispatchingscheme isneeded tolevelthesystemload aswellasto guaranteequality-of-serviceinterms of responsetime.

In this paper we are concerned with initial experiences with a multi-media portal under construction based on the Monet database system [BK99 ]. The system is intended to provide eÆcientaccesstoa largecollectionof indexedmulti-mediaobjects. It isendemictothiskindof

informationsystem that user interactionis dominated byread accesses. A number of systems with similarrequirements regarding the deployed database backend have been developed and manymore arecurrentlyunderconstruction.

With each user interaction, the interface emits a number of queries to the database that ideally lead to an answer set of a few tens of candidate results. Involving accesses to dierent multi-dimensionalindexes,theevaluation ofsuchqueriesis usuallyintheorder of fewseconds.

(4)

applications. Moreover, thedeviationof runningtimeamong thequeriesislimited,notleastto ensureacceptable responsetimes.

The primary challenge in this setting is to develop processing techniques to optimize the

querythroughput. Parallelprocessingisanessentialelementtoachievethis,however,astraight forward recasting of methods developed for parallel databases doesnot applyhere since most solutions devised in thisarea are almost exclusively geared to tackle highly complex and long

running queries. There, queries are usually parallelized on a granularity of partial plans or even single operators, i.e. single operators like the join of two tables are executed in parallel on dierent nodesinvolving exchangeof partialresultsamong thesinglenodes. However, these

techniques are ineective for the kind of query we are considering since communication and coordinationoverheadswould outweigh theactualbenets.

In thispaper, we proposea parallelqueryprocessingarchitecturethat can take advantageof

the query characteristic by its physical design, suitable query scheduling,and the way queries areexecuted.

Theplatformofoperationisashared-nothingenvironment|i.e.aclusterofinexpensivePCs|

whereeachnoderunsaMonetserverwithafullyreplicatedcopyofthedatabase. Onemachine isdistinguishedascoordinatornodethatdispatchesthearrivingqueriestotheserversaccording to aschedulingstrategy. The scheduling schema we develop inthispaperdiersradicallyfrom previousworkaswedonottrytomodelvarioussystemparametersinordertoexploitprimarily

idle system resources, but take into account what data is memory-resident at the servers, i.e. cachedbytheservers. Thealgorithmisbasedonametricthatdeterminesthedistancebetweena serverandaquery|thelessthisdistance,themoresimilarthestateofthememoryatthisserver

and what isrequired to process thisquery. Moreover, we investigate possibilitiesof re-ordering and deferredexecutionofqueriesto furtherreduceexecutioncosts. Once aqueryisassigned to aserveritisexecutedinisolationonthisserver,sonosynchronizationorcommunicationwithin

theclusterisneededfreeinginterconnectionbandwidthforshippingofbothqueriesandresults. Since we aredealing withread-only accesses, we do not have to considertransaction

mecha-nismsto keepthereplicateddata consistent acrossthedatabasecluster. Rather,thedatabases areperiodicallyupdatedbymirroringamaster database.

The experimental evaluation of the techniques proposed show substantial savings over con-ventional greedy scheduling that takes only the machines' workload into account. In a large

numberof experiments we investigate theimpact of individualparameters closely. Our results conrmthearchitecturaldecisions showingexcellentscaling behavior.

The remainder of this paper is organized as follows. We review related work in Section 2.

In Section 3 we present the architecture and describe the query model in Section 4. Section 5 discussesthemodelingoftheserverpoolandtheschedulingalgorithm. InSection6,wepresent a comprehensive performance analysis. Section 7 contains a discussionof thedesign decisions.

We concludethe paperwithSection8.

2. Related Work

Parallelqueryprocessinghasbeenstudiedinalargevarietyoffacets, seee.g.[PMC +

90 ,DG92 ,

HS93 ,WFA95,Gra95 ]. Mostofrelatedworkinthiseldconcentratedonpossibilitiestospeedup highly complex queries with long running times. Approaches as taken in [HM94] and [GI96 ] suggest a decomposition of the query plans into sub-plans which arethen executed inparallel

ondierentnodesoftheparallelprocessingenvironment. Thegranularityofthisdecomposition variesandcan beasneasparallelizingsingleoperatorasstudiedforexamplein[SD89,SD90 ,

(5)

These approaches have in common that they require communication between single nodes forshippingorexchangingpartialresults. Thiscausesnetworkcontention andsynchronization eects where nodes have to wait forothers to complete their tasks rst. As a result, a

paral-lelization along these lines only pays ifthe query is of suÆciently high complexity. Otherwise communicationoverhead and synchronizationeects outweigh performance gains.

Moreover, parallel processing as outlined above scales only for small numbers of nodes

ef-fectively. In shared-nothing architectures, network contention becomes increasingly a bottle-neck; in the case of shared-everything, the high degree of resource sharing limits the scaling [Sto86 , NZT96].

General memoryallocation issueshave beenexplored extensivelywithrespect to various as-pectsofqueryprocessing. In[MD93],authorsproposeddynamicmemoryallocationschemesfor multi-queryworkloadto levelmemoryallocationwithoutsharing resident dataamongdierent

queries. Byanalyzingqueriesand theircommon sub-expressionsthere-use ofmemoryresident data is often a by-eect [SSN94]. In [MSD93 ], authors consider batch scheduling for parallel processing. However, main-memoryis in bothcases transparentlyviewed as a centralresource

and datalocationwithindistributedmemoryhasnot beenconsidered.

In thecontext of transactionprocessing,several query routingschemas fordatabaseclusters havebeeninvestigated(seee.g.[Tho87 ,FGND93,RBS00 ]). Thiseldofapplicationdiersfrom theproblemat handinits preliminaries: queriesare usuallyofcomplexnature and updates to

the database need to be propagated over the complete cluster. One of the major goals is for instanceto schedulequeriessothatlockingcon ictsareavoided. Foracomprehensiveoverview on thissubject seee.g. [Rah92 ].

3. Architecture

Figure 1 shows the architecture of our system. It consists of a cluster of database servers managed by a central scheduler node. All machines have separate main-memory, CPU, and

disks not sharing any resources other than network bandwidth. We use Monet, the main-memory database system developed at CWI, as database server [BK95 , BK99 ]. Besides its

vertical fragmented data model, Monet is distinguishedby its memory awareness, i.e. it solely usesoperatingsystemprimitivesforits memorymanagement,mappingdatabasetablesdirectly to virtualmemory,to avoid the overhead of a proprietary buermanager simulatinga virtual

memorylayer. Moreoveritprovidesspecicallycacheawareoperatorimplementations[BMK99 ] to maximizesystemutilization onrunningqueries.

Additionally,a pool of web-servers forms the front-end of the system which clients interact

with. The web-servers receive either parameterized queries using text-based forms, or they interactwith visualqueryformulationtools.

Processing of a client query is done in 7 steps: Users formulate their query using the web-interface(Fig.1,(1)). Attheweb-server,thequeryisre-formulatedusingtheinternalprocedural

query representation ofthe databasesystem|inourcase MIL,thequery languageof Monet| and submitted to the query scheduler (2). The scheduler maintains a queue of queries that are to be executed. By analyzing the data requirements for the execution of a query and the

dataresidentinmain-memoryattheservers,theschedulerdeterminesafavorableassignmentof queriestoservers(4). Thequeryisexecutedontheassignednode(5)andtheresultisreturned to thefront-endhost (6).

1

Afterformatting theresult,itis shippedto the user(7).

This architecture directly aims at the throughputoptimization of compact querieswhere an execution on a single node is the most eÆcient kind of processing. But this architecture also

1

(6)

1

2 Web-application

Server

Web-application

Server

Web-application

Server

Database

Server

Database

Server

Database

Server

6

7 U

Master

Database

1 User

User

4

3

6

5 U

U

Periodical update of replicated databases

Query scheduling

Online data sources

Robot

Feature Analysis

Indexing

Scheduler

Query

Figure1: Query processing architecture

exploits the second query characteristic: the impreciseness of the data. For example, data gathered by robots from the Internet to buildup a multi-media index is updated only at low frequency. Thus,theserverclusterdonothavetobekeptinsyncaswouldbethecaseifupdates

by users were allowed. Instead, the databases are updated periodicallyreplicating the master database(U). The frequencyof updatesdependsnaturally on theapplicationdomain.

4. Query Model

The choice of Monet asdatabase software forthe back-endclusterimpliesa specic model for thequery execution. The vertical fragmentationof the tablesin Monet causesbulk-processing to be more eÆcient thanpipeliningtechniques. Asa consequence, there areno more than two

tables processed on a single CPU at a time. Note, this executionmodel doesnot impose any restrictionsontheshapeoftheexecutionplans|bothlinearandbushyplansarefeasible. Access to a table can be either of type load or scan. load reads a table from disk into memory to

make it part of thehot set, e.g. used withinner relations of nested-loop joins. scan reads the table, butdoesnotkeep thedata inmemory afterthe actualoperationis performed,e.g. used inselections orfor theprobe relationof a hash join. The buildingof a hash table can be seen asspecialtypeof loadasthe resultis onlyaccessiblevia thehash attribute.

The costsfor executing a relationalgebra operator consists of thecosts of loading/scanning ofthetable plustheactualcostsfortheoperation. Fittingthepiecestogether, we candescribe

(7)

a queryasa sequence Q=h(t 1 ;a 1 ;l 1 ;w 1 );:::;(t n ;a n ;l n ;w n )i

whereeachquadruplecorrespondstoanoperatoror|asincaseofajoin|toapartialoperator. t

i

speciesthetable necessaryfortheoperation. a i

denotesthehashattribute,i.e.theoperator

accesses t i

via thisattribute; a i

= ifnotapplicable.

The associated loading costs, if the table is notalready resident in main-memory, are given byl

i

. We denotethe totalloading costs of aquery as

L(Q)= X i l i :

The costs to execute the operator, the operator costs,are denoted by w i

. We denote the total

operator costs of aquery as

W(Q)= X i w i : Note, bothl i and w i

areexpressed interms ofthe sameunit to achieve apropercomparison.

A queriesexecutiontime ona \cold" server, i.e.no datais loaded yet,is

(Q)=L(Q)+W(Q):

Ifall tablesneeded byQare alreadyinmemory|andincase ofhashtables, are hashedbythe

properattribute|the executioncostsof thequeryamountsto W(Q)only.

Examples

HerearesomeexamplestoillustratethismodelingbasedonMonetperformancecharacteristics. The twotables A andBused inthisexampleareof size8.8MBand 4.95MB, respectively. Our

test platformachieved abandwidthof 5.5MB/sfordisk access.

Nested-Loop Join,A1B

Q=h(A;;1:6;0);(B;;0:9;2 :6) i

TableAtakes1.6stoload,nooperatorcostsoccur. TableB takes0.9stoload;performing

thejoinrequires 2.6s.

HashJoin, A1B

Q=h(A;a;2:7;0);(B;;0:9;1 :1) i

Similartopreviousbutnow,Amustbeahashtablewithhashattributea. Henceloading A ismore expensive asitincludesbuildingthehashtable.

5. Query Scheduling

The query scheduling comprises several elements. Besides a model for the servers we dene a server-querydistancewhichcapturesthepotentialre-useofmemoryresidentdata. Additionally,

(8)

5.1 Servers

For thescheduling, a server of thedatabase clusteris modeled byits state of memory and the

workload. Thestateofmemoryisthesetoftablesresidenttogetherwithareplacementstrategy. Weareconsideringbasetablesonlyanddiscardintermediateresultsoftheprocessingassoonas they areno longerused. As areplacement strategyweuse LRUas itexhibits thebestaverage performance. Theloadinganddroppingoftablesisdoneviathememorymappingfunctionality

of theoperatingsystem. To maintainsuÆcient controloverthe memoryallocationthroughout thecompleteclusterweloadanddroponlycompletetables. Thisway,theschedulercanrelyon theinformationwhichtables arememory-resident,i.e.accessingthem willnotcause additional

costs forswapping. Swapping mayonlyoccur when all unused tablesare already dropped but thememoryrequirementsof thecurrent operationarestillnotmet.

Fortheworkload,wedistinguishthetwostates idle and busy,i.e.weassignone queryto one

server at a time. This isnot justa simplicationto facilitate thescheduling buta necessity in main-memorydatabaseswherecache awarenessand concurrentmemoryaccessareofdistinctly higherimportance thanin I/O-dominateddatabase models[MBK00 ]. We modelthe workload

asfunctionJ(S)whichreturnstheexpectedtimeofjobcompletionatserverS,givenitscurrent workload, i.e. the expected time from now when S willbecome idle. If theserver is idle, J(S) evaluatesto 0. J willbeusedintheschedulertondthenode thatwillnishitsjob next. J is

computedbyconventionalcostformulaeknownfromsequentialqueryprocessing: Giventhetime x

0

thecurrentlyrunningqueryhasbeenassignedtoserverS,xthetimeJ isevaluated,andethe expectedrunningtimeofthequery,thetimeofjob completioncomputes to J(S)=x

0

+e x.

Seealso Section7 fora discussionon theaccuracy ofJ.

5.2 DistanceMetric

We dene the server-query distance as the costs to load the tables for a given query Q on a server S: d(Q;S)= X i R (t i ;a i )l i where R (t i ;a i )= ( 0; ift i

is memory-resident at S and hashedbyattributea i

1; else

indicateswhethertablet i

isresidentinmemoryat S. Incaset i

isrequiredashashtable,Ralso checks whetherthetable is hashedbyattributea

i .

Scheduling a batch of queries optimally on k servers is nding a division into k batches B

1

;:::;B k

,each of which are executed sequentially on one server, such thatthe running time of thebatch withthelongest completiontime

max i X j d(Q j ;S)+W(Q j ) ; Q j 2B i isminimal. 5.3 Scheduling Algorithm

(9)

Algorithm mas

while queuenotemptydo

c min

1

foreachquery Q intop n

(queue) do

for i=1 to number ofservers do

c =d(Q;S i )+W(Q)+J(S i ) if c<c min do b Q Q b S S i c min c done done

if expired(Q)then break

done assignquery b Qto server b S remove b Qfrom queue done

Figure 2: Scheduling Algorithm mas

outline of ouralgorithm calledMemory Aware Scheduling (mas). It iterates over the queue of arrivingqueries, selecting one at atime, and determinesthebestad-hocassignment.

In detail,we examinetherstnelementsof thequeue|orless ifthe queuedoesnotcontain

nqueries. Weinvestigatetheimpactofnandsuitablevaluesforitinthenextsection. Foreach of then queries,we computec whichconsistsof thedistance toall servers S

i

plustheoperator costofthequeryand theexpectedtimeat whichserverS

i becomesavailable, J(S i ). We record thepair( b Q; b

S)withthelowestvalueforc. After examinationofallnqueries, we assign b Qto

b S

which meansthat b

Q willbe executed on b

S assoon asthisservergets idle.

The algorithmisinO(Nns)whereN is thenumberofqueriesintotal, nisthenumberof queriesconsideredineachrun,andsisthenumberofserversavailable,i.e.thealgorithmislinear

in the number of queries. To give any meaningful bounds on the performance is particularly diÆcultbecause oftheLRU replacement of tables.

5.4 Deadlines

Inordertogivetheuseraguaranteeofservice,wetageveryquerywithadeadline. Thisdeadline

refers to the latest point in time thequery has to be assigned to a server for execution, i.e. as soon as the deadline of a query expires, the scheduler has no other choice than assigning this very query to aserver.

We tag all queries with a time stamp according to their arrival. In other words queries are not forced by deadlines to overtake others, though it is often benecial. As a result, we need onlycheck therstqueryof thecurrent topn batchfordeadlineexpiration. Iftherstquery's

deadlineis expired,we donotneedto examineanyotherqueryinthebatchbuthave toassign therst immediately to a server. Otherwise,ifthe rst query's deadlineis not yet expired, no

(10)

Size 20MB 30MB 40MB 50MB >50MB

Numberoftables 233 87 24 15 17

Table1: Sizesand numbersof basetables

all servers ensuresthebestassignment incasethe deadlineexpired.

6. ExperimentalResults

Inthissection,wedescribeexperimentalresultsobtainedwithasimulator. Wechosetosimulate

thesysteminordertoexperimentwithparametersthatarestrictlylimitedbyanactualhardware conguration, such as size of the database cluster ormain-memory availableat the individual servers.

6.1 Preliminaries

Since the full multi-media Acoi demonstrator database is still under construction, we had to

connetheexperimentstothepartalready operational. Thepartchosenistheindexsystemof theACMAnthology,whichisincludedinthedemonstratortoassess itscapabilities inthearea of XML-based database processing. We used statistical data available from an actual Monet

database instance which contains the complete XML code of the Anthology decomposed into theverticallyfragmenteddatamodel[SKWW00 ]. Includingindexes,thedatabasecontains376 tables with up to nearly 60000 rows. The numbers of tables according to their sizes are given

in Table 1. Typical queriesuse some 5 tables orless, seldom up to 10 or more. We generated batches of 10,000 and 100,000 queries of exponentially distributed sizes accessing 5 tables on average.

We donotconsidermechanismsto reject user queriesdueto overloadsincethiscan be done alreadyatthefront-endlevel. Modelingarrivalratesprobabilisticallyisnotnecessaryasqueries donotsignicantlydierinrunningtime,thusinformationaboutthequeriesdoesnotneedtobe

consideredtomakeachoicewhichqueriestoacceptandwhichtoreject. Forourexperiments,we assumethemaximalexpectedqueryarrivalratetomodeltheworstcasebehaviorwithmaximum load. Moreover, we assume all hostswithin the database cluster are of identical conguration

regardingthecost relevant parameters.

Since the replacement of tables along an LRU strategy makes analytical modeling of the algorithm hard, we implemented Graham's list scheduling (gls), which is based on workload

gures [Gra69], forcomparison. We adapted the original algorithm to t the online arrival of queries, i.e. to use thekind of look-ahead we introducedin masabove. Graham's algorithm is known to be highly eective despite its simplicityovercoming some characteristic disturbances also known as scheduling anomalies. These anomalies occur with jobs that dier signicantly

in completion time. Please note, we run Graham's algorithm in exactly the same setting, i.e. withLRUreplacementoftablesasnecessary. Asaresult,Graham'salgorithmalso protsfrom re-useof memoryresident data.

The mainparameters we want to investigate are, foremost, the numberof servers, the look-ahead duringthe scheduling, and the amount of memory available at each server. Note, there are two principal ways of comparison: a one-on-one comparison of both algorithms run on

identic congurations of the platform (relative performance), or a comparison of the scaling characteristics. Typical examplesforthelatter arespeedupand scale-up. We willcompare the two algorithmsusingbothprincipleswhereadequate.

(11)

0.50

0.60

0.70

0.80

0.90

1.00

1.10

0

10

20

30

40

50

60

70

80

90

100 Relative execution time

Number of servers

GLS

MAS

0.50

0.60

0.70

0.80

0.90

1.00

1.10

0

10

20

30

40

50

60

70

80

90

100 Relative execution time

Number of servers

GLS

MAS

Figure 3: Relativeperformanceof warm over coldservers

6.2 Warm/Cold Processing

In thisrst experimentweinvestigate thedierences betweenwarm and coldservers. We refer

to a server as cold if no other tables than system tables have been loaded. If more than 80% of the available memory are allocated we call a server warm.

2

This is not only relevant for laterexperimentsbutalsofortherealapplicationscenariowhenthesystemhastobeshutdown

and re-started fortechnical reasons like maintenance etc. We can expect dierent eectsif the amountofmemoryperserverisvaried. Forsmallservercongurationsthewarmupiscompleted earlierthanforlargeones. However, more important istheamountof memoryintotal,i.e. the

numberof servers. Figure 3 shows the relative execution times of batches of 10,000 queries as functionof thenumberservers. Forexample theleftmost data point for masre ects the ratio of executiontimes formas on acold to that on a warm server. The left graph shows times for

a servercongurationof 64,theright for256MB.

Forgls, coldandwarmexecutiontimes arealmostthesame. Warm processing isonlyup to 5%quicker. Formas theratiodierssignicantlyshowing gainsof more than35%. Especially

fortherange up to 50 servers,the largerconguration (right)achievesbetter performance,i.e. the curve is steeper. Formore servers, the impact of the larger amount of memory decreases: for 100 servers and more, resultsare virtually equal. We address theissue of memory sizes in

more detailin6.5.

Allfurtherresultspresented inthissectionare obtainedfrom warm servers.

6.3 Reordering of Queries

Thenext experimentinvestigates theimpactof thelook-aheadduringscheduling,i.e.the

max-imalnumberof queriesthatmay be re-orderedbetweeneach assignment.

Figure 4 shows execution times fora single server (left)and a cluster of 100 servers (right). In both cases the server(s) had 64MB of memory. The execution time is shown as a function of thelook-ahead, scaled to gls's rst datapoint, whichcorrespondsto arst-come-rst-serve

(FCFS) scheduling. Thesizeofthecompletequerybatch was10,000. Inthecase ofone server, thelook-aheadisofhighimportanceformasandsavingscanamountupto20%oftheexecution time. gls,however, doesnotsignicantlyprot from look-ahead.

2

(12)

0.75

0.80

0.85

0.90

0.95

1.00

1.05

0

200

400

600

800 1000 1200 1400 1600 1800 2000

Execution time [scaled to FCFS]

Look-ahead [queries]

GLS

MAS

0.40

0.50

0.60

0.70

0.80

0.90

1.00

1.10

0

200

400

600

800 1000 1200 1400 1600 1800 2000

Execution time [scaled to FCFS]

Look-ahead [queries]

GLS

MAS

Figure4: Impact of look-ahead; Singleserver (left)and clusterof100 servers(right)

Inthecaseof100servers,thesituationchanges. glsrelativelyimprovesmorewithincreasing look-ahead but for mas hardly any improvement is noticeable, though its execution time is substantially below the one of gls. This is due to the fact that in a pool of 100 servers

many server oer a very low server-query distance and dierences are often very small. As a consequencere-ordering cannothelpndingasignicantlylowerserver-querydistance asitdid inthesequentialcase.

In all furtherexperimentswe usea look-ahead of 100 queriesunlessstatedotherwise.

6.4 Speedup andScale-up

The two fundamental measures when investigating the performance of parallel systems are speedup and scale-up. The previous quanties the gains when scaling up the platform but

keepingtheproblemsizeconstant,thelatterdescribesthesystem'sabilitytocopewithproblem sizesgrowingproportionallywith theplatform(cf.e.g. [DG92]).

Figure 5 shows the speedup for a query batch of 100,000 queries evaluated on up to 4096

servers with 64MB memory each. As the plot shows, gls achieves slightlysub-linearspeedup whereasmasachieves even super-linear speedupwhichtranslates to eective re-use of memory residentdata. Tobetterillustratethisphenomenon,consideraverybasicexampleusing4tables

A,B,C and D of same size such that only two table t into memory at the same time, and a batch of 4queries: Q 1 = h(A;2;2);(B;2;2)i Q 2 = h(C ;2;2);(D;2;2)i Q 3 = h(D;2;2)i Q 4 = h(A;2;2)i

On a single machine, the total costs amount to 8+8+4+4 = 24. With linear speedup, we wouldexpect12cost unitsfora parallelizationontwo hosts. masassignsqueriesQ

1 and Q 4 to one, Q 2 and Q 3

to the other machine which amountsto 8+2=10 costsat each node. Hence,

thetotal executiontime is10 compared to 24on a singlenode which givesaspeedupof2.4. Figure 6shows thescale-upforthesameserverpoolconguration. masmaintainsascale-up

(13)

0 1000

2000

3000

4000

5000

6000

7000

0

512 1024

1536

2048

2560

3072

3584

4096

Speedup

Number of servers

linear

MS

GLS

Figure 5: Speedup

0.7

0.8

0.9

1

1.1

1.2

1.3

0

512 1024

1536

2048

2560

3072

3584

4096

Scaleup

Number of servers

constant

MAS

GLS

Figure 6: Scaleup 6.5 Direct Comparison

Inourlastexperiment,wecomparethealgorithmsdirectly,i.e.we determinetheratioof mas's executiontimetotheoneof glsforeachindividualparametersetting. Valuesbelow1.0indicate

thatmasoutperformsgls. Asparameteroftheexperimentwevarynumberofservers,memory available, and look-ahead.

Figure 7 shows the relative execution time as a function of the number of servers and the amountofmemoryperserver. Thenumberofserversvariesbetween5and100,memorybetween

32and640MB.Asthediagramdisplays,masoutperformedglsinall400individualexperiments achieving execution times as short as 40% of those of gls. However, as the diagram reveals,

thisisnomonotonicprocess,rather,withincreasingmemorysizes,glsmanagesto \catchup", though only to a certain degree. See for example the front row where, after mas increases its lead (0.4 at ca. 160MB) it cannot further improve on the running time whereas gls becomes

increasinglybetter. Formore thanca. 320MBneither algorithmcan achieve anyimprovement, thustheplateaux.

Figure 8 shows the relative performance as function of the number of servers and the look-ahead. All servers had 64MB of memory. The plot shows results that aÆrm the previously

foundonesnowalso inthedirectcomparison: Forsmallnumberof servers look-aheadplaysan importantrole (see lastrow),whichfades asthenumberof server increases(see front row).

7. Discussion

Whileimplementingandtestingdierentversionsofmaswemadeseveraldesigndecisionswhich arenotself-evidentat rst sight and deserve to bediscussedmore elaborately inthefollowing. Themostimportantquestioninthedesignprocesswaswhethertoincludemoresystem

infor-mation abouttheservers ornot. Typically,cost modelsin paralleldatabases try to model the system|particularlythesharedresourceslikenetworkorshareddisks|asdetailedaspossible. Wechosenottoincorporatethiskindofinformationforthreereasons: Firstly,thetypeofquery

we are dealingwith cannot prot much from the kindof parallelprocessing these cost models havebeendevelopedfor. Secondly,acostcomputationthattakesdetailsatthisnegranularity into account is computational too expensive to deliver cost estimates for hundreds of servers

(14)

32

160

320

480

640 Memory [MB]

100

75

50

25

5 Number of Servers

0.4

0.5

0.6

0.7

0.8

Figure7: Relativeperformance;

mem-ory available and number of servers varied

1

500 1000

1500

2000

Look ahead [queries]

100

75

50

25

5 Number of Servers

0.6

0.7

0.8

Figure 8: Relative performance;

look-ahead andnumberof servers varied

informationis hardto maintain. To keepit accurately up-to-date duringthe processing would requirea substantialshare of bothnetworkand processingresources.

In contrast to that the cost values used by mas are of simpler nature. The cost estimates forthequeryexecution time(see Section5.1) canbe assembled fromsequential costestimates. Since thequeriesareemittedbya xedinterface, there isthepossibilityto pre-compilequeries

which arethen instantiatedwith afew parameters. In thatcase highlyaccurate cost estimates can bepre-computed and|like thequeries|instantiatedwith parameters.

The informationnecessarytodescribethesystemfortheschedulingisinsofareasy to

main-tainasitconsistsonlyofafeedbackindicatingthatprocessingofthelastqueryhasterminated, i.e. the result has beenshipped, and what tables are now in main-memory. To avoid idleness between feedback and new assignment, servers can be extended to buerone query while

pro-cessingthepreviouslyassigned. Thefeedbacktogetherwiththefactthatserversdonotmaintain ownquery queuesensures thattheimpact of afew,inaccurate cost estimates,whichcan never becompletely avoided, iskeptdown to aminimum.

Finally,the queryexecution we proposeddoesnotallowfor multi-programming,i.e. running several queries concurrently on one server. This choice was motivated by two facts. Firstly, multi-programmingisnotverybenecialinmain-memorydatabaseslikeMonetasconcurrently

processed queries often get into each other's way, unlike in I/O dominated database systems. Secondly, and more importantlyas this holds for other database back-ends too, the potential gainsofmulti-programmingwhereI/OofonequeryandCPU intensivecomputationofanother

one can be aligned is very limited as it is our foremost goal to avoid costly I/O operations at all.

8. Conclusion

Web-enabled databases and database back-end technology for large web-base information sys-temsare one of thefastest growing segments of thedatabasemarket. Thosesystems challenge thetraditionalrepertoireofoptimizationtechniquesusedindatabasetechnology. Powerful,user

(15)

databaseback-endwhichmake throughputoptimization aprimary optimizationgoal.

In thispaper, we propose a parallel query processing architecture and investigated the pos-sibilityto exploitdata sharingbyclever scheduling of thearriving queries. We have developed

mas, a scheduling strategy that tries to maximize the re-useof data resident in main memory across thedatabasecluster. The algorithm isdistinguishedby its simplicityandrobustness on one hand|the information needed to make mas work is easy to obtain, accurate, and needs

onlylittleeortto be keptup-to-date|and itseectivenesson theotherhand.

Our experiments show superiorresults compared to conventional list scheduling in terms of both query throughput and scaling behavior conrming our considerations. Moreover, by

re-usingthedataavailableratherthanassigningdataactively,theschedulingalgorithmadapts to changinghotspots.

Our future work is geared towardextending the scheduling schema to considerintermediate

(16)

References

[BK95] P. A. Boncz and M. L. Kersten. Monet: An ImpressionistSketch of an Advanced Database System. In Proc. Basque International Workshop on Information Te ch-nology,San Sebastian,Spain,1995.

[BK99] P.A.Boncz andM. L.Kersten. MIL PrimitivesforQueryinga FragmentedWorld. TheVLDB Journal, 8(2):101{119, 1999.

[BMK99] P. A. Boncz, S. Manegold, and M. L. Kersten. Database Architecture Optimized forthenewBottleneck: Memory Access. InProc.of the Int'l. Conf. on Very Large Data Bases, pages54{65, Edinburgh,UK,1999.

[CHM95] C. Chekuri, W. Hasan, and R. Motwani. Scheduling Problems in Parallel Query Optimization. In Proc. of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages255{265, SanJose, CA,USA,May1995.

[DG92] D.J. DeWittand J.Gray. ParallelDatabase Systems: The Future of High Perfor-mance DatabaseSystems. Communications of the ACM,35(6):85{98, June1992.

[FGND93] D.F.Ferguson,L.Georgiadis,C.Nikolaou,andK.Davies. GoalOriented,Adaptive TransactionRoutingforHighPerformanceTransactionProcessingSystems.InProc. of the Int'l. Conf. on Paralleland Distributed Information Systems, pages138{147,

SanDiego, CA,USA, January 1993.

[GI96] M. N.Garofalakis and Y. E.Ioannidis. Multi-dimensional ResourceScheduling for Parallel Queries. In Proc. of the ACM SIGMOD Int'l. Conf. on Management of

Data,pages 365{376, Montreal, Canada,June1996.

[GI97] M.N.GarofalakisandY.E.Ioannidis. ParallelQuerySchedulingandOptimization withTime-andSpace-SharedResources. InProc.of the Int'l. Conf. on Very Large

Data Bases, pages296{305, Athens, Greece,September1997.

[Gra69] R. L. Graham. Bounds on Multiprocessing Timing Anomalies. SIAM Journal on Applied Mathematics, 17(2):416{429, March1969.

[Gra95] J.Gray. A Surveyof ParallelDatabase TechniquesandSystems. InTutorial Hand-outsofthe21stInt'l.Conf.onVeryLargeDataBases,Zurich,Switzerland, Septem-ber1995.

(17)

[HM94] W.HasanandR.Motwani. OptimizationAlgorithmsforExploitingthe Parallelism-Communication Tradeoin PipeliningParallelism. In Proc. of the Int'l. Conf. on Very Large Data Bases, pages36{47, Santiago, Chile,September1994.

[HM95] W. Hasan and R.Motwani. Coloring AwayCommunicationin ParallelQuery Op-timization. In Proc. of the Int'l. Conf. on Very Large Data Bases, pages 239{250, Zurich,Switzerland,September1995.

[HS93] W. Hong and M. Stonebraker. Optimization of ParallelQuery ExecutionPlans in XPRS. Distributed and Parallel Databases, 1(1):9{32, 1993.

[MBK00] S. Manegold, P. A. Boncz, and M. L. Kersten. What happens during a Join? -DissectingCPU and Memory OptimizationEects. In Proc. of the Int'l. Conf. on Very Large Data Bases, Cairo,Egypt,September2000. Acceptedforpublication.

[MD93] M.MehtaandD.J.DeWitt. DynamicMemoryAllocationforMultiple-QueryW ork-loads.InProc.of theInt'l.Conf.on VeryLargeData Bases,pages354{367, Dublin, Ireland,September1993.

[MSD93] M. Mehta, V. Soloviev, and D. J. DeWitt. Batch Scheduling in Parallel Database Systems. In Proc. of the IEEE Int'l. Conf. on Data Engineering, pages 400{410, Vienna,Austria,April 1993.

[NZT96] M.G.Norman,T.Zurek,andP.Thanisch.MuchAdoAboutShared-Nothing.ACM SIGMODRecord,25(3):16{21, September1996.

[PMC +

90] H.Pirahesh, C. Mohan, J. Cheng, T. S. Liu,and P. Selinger. Parallelismin

Rela-tionalDataBase Systems: Architectural IssuesandDesignApproaches. InProc.of the Int'l. Symp. on Databases in Parallel and Distr. Systems, pages 4{29, Dublin, Ireland,July1990.

[Rah92] E.Rahm.AFramewokforWorkloadAllocationinDistributedTransaction Process-ingSystems. Systems Software Journal,18:171{190, 1992.

[RBS00] U.Rohm, K.Bohm, andH.-J.Schek. OLAPQueryRoutingand PhysicalDesignin aDatabase Cluster. InProc.of the Int'l.Conf. on ExtendingDatabase Technology, Lecture Notes in Computer Science, pages 254{268, Konstanz, Germany, March

2000.

[SD89] D.A.Schneider andD. J DeWitt. A Performance Evaluation of FourParallelJoin AlgorithmsinaShared-NothingMultiprocessorEnvironment. InProc.of the ACM

SIGMODInt'l. Conf. on Management of Data, pages110{121, May1989.

[SD90] D. A. Schneiderand D. J. DeWitt. Tradeos inProcessing Complex Join Queries

via Hashing in Multiprocessor Database Machines. In Proc. of the Int'l. Conf. on Very Large Data Bases, pages469{480, Brisbane,Australia,August 1990.

[SKWW00] A.R.Schmidt,M.L.Kersten,M.A.Windhouwer,andF.Waas. EÆcientRelational Storage and Retrievalof XML Documents. InInternational Workshop on the Web

andDatabases(InconjunctionwithACMSIGMOD),pages47{52,Dallas,TX,USA, May2000.

[SSN94] K.Shim,T.Sellis,andD.Nau.ImprovementsonaHeuristicAlgorithmfor Multiple-QueryOptimization. IEEETrans. onKnowledge andData Engineering,12(2):197{ 222,March 1994.

(18)

[Tho87] A. Thomasian. A Performance Study of Dynamic Load Balancing in Distributed Systems. In Proc. of the IEEE Int'l. Conf. on Distributed Computing Systems, Berlin,Germany,1987.

[WFA95] A.N.Wilschut, J.Flokstra, andP.M. G. Apers. ParallelEvaluationof Multi-Join Queries.InProc.ofthe ACM SIGMODInt'l.Conf. on Managementof Data,pages 115{126, San Jose,CA,USA, May 1995.