Using Grids for Exploiting the Abundance of Data in Science

(1)

USING GRIDSFOREXPLOITING THE ABUNDANCE OFDATA INSCIENCE

EUGENIO CESARIO

∗

AND DOMENICO TALIA

∗

,

†

Abstrat. Digitaldata volumes aregrowing exponentially inallsienes. To handlethisabundane indata availability,

sientists must use data analysis tehniques in their sienti praties and solving environments to get the benets oming

fromknowledge thatan beextrated fromlargedata soures. Whendata ismaintainedover geographially remotesites the

omputationalpowerofdistributedandparallelsystemsanbeexploitedforknowledgedisoveryinsientidata. Inthissenario

theGridanprovideaneetiveomputationalsupportfordistributedknowledgedisoveryonlargedatasets.Inpartiular,Grid

serviesfordataintegrationandanalysisanrepresentaprimaryomponentfore-sieneappliationsinvolvingdistributedmassive

andomplexdata sets. Thispaper desribessomeresearhativitiesindata-intensiveGridomputing. Inpartiularwedisuss

theuseofdataminingmodelsandserviesonGridsystemsfortheanalysisoflargedatarepositories.

Keywords: e-siene,knowledgedisovery,grid,paralleldatamining,distributeddatamining,grid-baseddatamining

1. Introdution. Thepasttwodeadeshavebeendominatedbytheadventofinreasinglypowerfuland

less expensive ubiquitous omputing, as well as the appearane of the World Wide Web and related

teh-nologies[12℄. Due to suh advanesin information tehnology andhigh performaneomputing, digital data

volumes are growing exponentially in many elds of human ativities. This phenomenon onerns sienti

disiplines,aswellasindustryandommere. Suhtehnologialdevelopmenthasalsogeneratedawholenew

setof hallenges: theworldis drowningin ahugequantityofdata,whih isstillgrowingveryrapidly bothin

thevolumeandomplexity.

Jim Gray in some talks in 2006 identied four hronologial steps for the methodologies employed by

sientists fordisoveries. Therst step ourred thousand years ago, when siene wasempirial and it was

orientedtojustdesribenaturalphenomena. Theseondoneistemporallyloatedaroundafewhundredyears

ago,when atheoretialbranh wasborn,aimed at formulatingsomegeneralmodelsdesribingthe empirial

knowledge. Thethird step ourred in the latestfew deades, when aomputational branh started upand

omplexphenomenastartedtobesimulatedbytheresouresmadeavailablebytheurrenttehnology. Finally,

thefourth step isrun today, when sientists areworking tounify theories, experimentsand simulations with

dataproessingandexplorationto extratknowledgehiddeninit.

Theabundaneofdigitallystoreddatarequireto onsiderin detailthis phenomenon. Inpartiular,there

are two importanttrends, tehnologial and methodologial, whih seem to partiularly distinguish the new,

information-rihsienefromthepast:

•

Tehnologial. Thereisalotofdataolletedandwarehousedinvariousrepositoriesdistributedoverthe

world: dataanbeolletedandstoredathighspeedsinloaldatabases,fromremotesouresorfrom

theour galaxy. Someexamples inlude datasets from theelds of medialimaging, bio-informatis,

remote sensing and (as very innovative aspet) several digital sky surveys. This implies a need for

reliabledatastorage,networking,anddatabase-relatedtehnologies,standardsandprotools.

•

Methodologial. Hugedatasetsarehardtounderstand,andinpartiulardataonstrutsandpatterns

present in them annot be omprehended by humans diretly. This is a diret onsequene of the

growthinomplexityofinformation,andmainlyitsmulti-dimensionality. Forexample,aomputational

simulationangenerateterabytesofdatawithinafewhours,whereashumananalystsmaytakeseveral

weekstoanalyzethesedatasets. Forsuhareason,mostofdatawillneverbereadbyhumans,rather

theyareto beproessedandanalyzedbyomputers.

Weansummarizewhat weforesaidasfollows: whereassomedeadesagothemain problemwasthelak

ofinformation, thehallengenowseemstobe(i)the very largevolume of information and (ii) the assoiated

omplexitytoproess for extratingsigniant anduseful partsorsummaries.

Nevertheless, the rst aspet does not represent a limitation or aproblem for the sienti ommunity:

urrentdatastorage,arhiteturalsolutionsandommuniationprotoolsprovideareliabletehnologialbase

to ollet and store suh abundane of data in an eient and eetive way. Moreover, the availability of

highthroughputsientiinstrumentationandveryinexpensivedigitaltehnologiesfailitatedthistrendfrom

both tehnologialand eonomialviewpoint. On theother hand, theomputational powerof omputers is

∗

ICAR-CNR,ViaP.Bui41C,87036Rende(CS),Italy(esarioiar.nr.it, taliaiar.nr.it).

†

(2)

not growing asfast as the demand of suh a data omputation requires, and this represents a limit for the

knowledgethatpotentiallyouldbeextrated. Asanadditionalaspet,wehavetoonsiderthat storageosts

areurrentlydereasingfaster thanomputingosts,andthistrendmakesthingsworse.

Forexample,theimpatofforesaidissuesinthebiologialeldiswelldesribedin[20℄. Itpointsoutthat

the emergene of genome and post-genome tehnology has made huge amount of data available,demanding

a proportional support of analysis. Nevertheless, an important fator to be onsidered is that the number

of available omplete genomi sequenes is doubling almost every 12 months, whereas aording to Moore's

law available omputeyles (i. e., omputational power) double every 18 months. Additionally, we have to

onsiderthat analysisofgenomisequenesrequirebinaryomparisonsofthegenesinvolvedinit. Asadiret

onsequene of that, the omputational overhead is very very high. We an see the impat of suh issues

in Figure 1.1 (soure: [20℄), whih ontrasts the number of geneti sequenes obtained with the number of

annotationsgenerated. Thegureshowsthat theknowledge(annotations,models, patterns) hasasub-linear

ratewithrespettothetheavailabledatasequeneswhihtheyareextratedfrom.

Fig.1.1. Growthofsequenesandannotationssine1982(Soure:[20℄)

To handle this abundane in data availability (whose rate of prodution often far outstrips our ability

to analyze), appliations are emerging that explore, query, analyze, visualize, and in general, proess very

large-sale data sets: they are named data intensive appliations. Computational siene is evolving toward

dataintensiveappliationsthatinludedataintegrationandanalysis,informationmanagement,andknowledge

disovery. Inpartiular,knowledgedisoveryin largedatarepositoriesanndwhatisinterestingin themby

usingdataminingtehniques. Dataintensiveappliationsinsienehelpsientistsinhypothesisformationand

give them a support on their sienti praties and solvingenvironments, getting thebenets omingfrom

knowledgethat anbeextratedfrom largedatasoures.

Whendataismaintainedovergeographiallydistributed sitestheomputationalpowerofdistributed and

parallel systems an be exploited for knowledge disovery in sienti data. Parallel and distributed data

miningalgorithmsaresuitabletosuhapurpose. Moreover,inthis senariotheGrid anprovideaneetive

omputational support for dataintensiveappliation and for knowledge disoveryfrom largeand distributed

datasets. Grid omputing is reeivingan inreasingattentionfrom the researhommunity, wathing at this

newomputinginfrastrutureasakeytehnologyforsolvingomplexproblems andimplementingdistributed

high-performaneappliations[14℄.

Todaymanyorganizations,ompanies,andsientientersprodueandmanagelargeamountsofomplex

dataandinformation. Climate,astronomi,andgenomidatatogetherwithompanytransationdataarejust

some examples of massive amounts of digital data that today must be stored and analyzed to nd useful

knowledge inthem. Thisdataand informationpatrimonyanbeeetively exploitedifit isused asasoure

to produe knowledge neessaryto support deisionmaking. This proess is bothomputationally intensive,

ollaborative, anddistributed in nature. The developmentof data mining software for Gridsoers toolsand

(3)

enablingonditionfordevelopinghigh-performanedataminingtasksandknowledgedisoveryproesses,and

itmeetsthehallengesposedbytheinreasingdemand forpowerandabstratnessomingfromomplexdata

miningsenariosin sieneandengineering. Forexample,someprojetsdesribedinSetion 2suhasNASA

InformationGrid, TeraGrid, andOpenSiene Gridusetheomputationalandstoragefailitiesin theirGrid

infrastruturestominedatainadistributedway. Sometimeintheseprojetsareusedadhosolutionsfordata

mining,in otherasesgenerimiddlewareisused ontopof basiGridtoolkits. AspointedoutbyWilliamE.

Johnstonin [19℄, theuseof generalpurposedatamining toolsmayeetivelysupport theanalysisof massive

anddistributeddatasets inlargesalesieneandengineering.

TheGrid allowsto federate and share heterogeneousresouresand servies suh assoftware, omputers,

storage,data,networksinadynamiway. Gridserviesanbethebasielementforomposingsoftwareanddata

elements,and exeutingomplexappliationson Gridand Websystems. TodaytheGrid isnotjust ompute

yles,butitisalsoadistributeddatamanagementinfrastruture. Integratingthosetwofeatureswithsmart"

algorithms we an obtain a knowledge-intensive platform. The driving Grid appliations are traditionally

high-performane and data intensive appliations, suh as high-energy partile physis, and astronomy and

environmental modeling, in whih experimental devies reate largequantities of data that requiresienti

analysis.

Inthe latestyears manysigniantGrid-baseddata intensiveappliations and infrastrutureshavebeen

implemented. Inpartiular,theservie-basedapproahisallowingtheintegrationofGridandWebforhandling

withdata. Wewillbrieyreportsomeoftheseappliationsin therstofthepaper;thenwedisussaboutthe

useofhighperformanedataminingtehniquesforsieneinGrid platforms.

Therestofthepaperisorganizedasfollows. Setion2desribessomeGrid-baseddataintensiveprojetsand

appliations. Setion 3gives anoverview of approahes forparallel, distributed and Grid-based data mining

tehniques. Setion 4 introdues the Knowledge Grid, a referene software arhiteture for geographially

distributedknowledgedisoverysystems. TheSetion 5givesonludingremarks.

2. GridTehnologiesfor dealingwith Sientidata. Severalsientiteamsandommunitiesare

usingGridtehnologyfor dealingwithintensiveappliationsaimedat sientidataproessing. Asexamples

ofthisapproah,in thefollowingweshortlydesribesomeofthem.

2.1. The DataGrid Projet: Grid for Physis. TheEuropeanDataGrid [11℄is aprojetfunded by

theEuropeanUnion with theaim of setting upaomputational anddata-intensiveGrid ofresouresfor the

analysis of data oming from sienti exploration. The main goal of the projet is to oordinate resoure

sharing, ollaborative proessing and analysis of huge amounts of data produed and stored by many

sien-tilaboratoriesbelonging to several institutions. It is madeeetiveby thedevelopmentof atehnologial

infrastruture enabling sienti ollaborations where researhers and sientists will perform their ativities

regardlessofgeographialloation. Theprojetdevelopssalablesoftwaresolutionsin order to handle many

PBs 1

of distributed data, tensof thousand ofomputing resoures(proessors,disks, et.), and thousands of

simultaneous users from multiple researh institutions. The three real data intensive omputingappliations

areasovered by the projetare biology/medial, earth observation and partile physis. In partiular, the

lastoneisorientedto answerlongstandingquestions aboutthefundamental partilesofmatterandthefores

ating betweenthem. The goalis to understand why somepartiles are muh heavierthan others, and why

partileshavemassatall. Tothatend,CERN 2

hasbuilttheLargeHadronCollider (LHC),themostpowerful

partileaelerator everoneived, that generateshugeamountsof data. Itis estimated that LHC generates

approximately1GB/seand10PB/yearofdata. TheDataGridProjetprovidedthesolutionforstoringand

proessing this data, based on amulti-tiered, hierarhial omputing model for sharingdata and omputing

power among multiple institutions. In partiular, a Tier-0 entre is loated at CERN and is linked by high

speednetworks toapproximatelytenmajorTier-1 dataproessing entres. These fan outthedatato alarge

numberofsmallerones(Tier-2).

TheDataGridprojetended onMarh 2004,butmany oftheproduts(tehnologies, infrastruture,et.)

areused andextended in the EGEE projet. The Enabling Grids for E-sienE (EGEE) [13℄ projet brings

togethersientistsandengineersfrommorethan240institutionsin45ountriesworld-widetoprovideaseamless

Grid infrastruture for e-Siene that is available to sientists 24 hours/day. Expanding from originally two

1 _{P etaByte}

_{= 10}

6 _GigaBytes

(4)

sienti elds, high energy physis and life sienes, EGEE now integrates appliations from many other

sienti elds, ranging from geology to omputational hemistry. The EGEE Grid onsists of over 36,000

CPUsavailabletousers24hoursaday,7daysaweek,inadditiontoabout5PBdiskofstorage,andmaintains

30,000 onurrent jobs on average. Having suh resoures available hanges the waysienti researh takes

plae. Theend usedepends ontheusers'needs: largestorageapaity,thebandwidththat theinfrastruture

provides, orthe sheer omputing poweravailable. Generally, the EGEE Grid infrastrutureis ideal for any

sienti researhespeiallywhere the time andresouresneeded forrunning theappliations are onsidered

impratialwhenusingtraditionalITinfrastrutures.

2.2. The NASAInformation Power Grid (IPG) Infrastruture. TheNASA'sInformation Power

Grid (IPG) [18℄ is a high-performane omputing and data grid built primarily for use by NASA sientists

and engineers. The IPG has been onstrutedby NASA between 1998and the present makingheavyuse of

GlobusToolkitomponentstoprovideGridaesstoheterogeneousomputationalresouresmanagedbyseveral

independentresearhlaboratories. SientistsandengineersaesstheIPG's omputationalresouresfrom any

loationwithGridinterfaesprovidingseurity,uniformity,andontrol. SientistsbeyondNASAanalsouse

familiar Grid interfaesto inlude IPG resouresin theirappliations (with appropriate authorization). The

IPG infrastruturehasbeenand isbeingused by numeroussienti andengineeringeortsbothwithin and

beyondNASA. Someofits mostimportantappliationsareomputational uiddynamis andmeteorologial

datamining.

2.3. TeraGrid. TeraGrid [29℄ is an open sienti disovery infrastruture ombining leadership lass

resoures(inludingsuperomputers,storage,andsientivisualizationsystems)atninepartnersitestoreate

an integrated, persistent omputational resoure. It is oordinated by the Grid Infrastruture Group(GIG)

at the University of Chiago. Using high-performane network onnetions, the TeraGrid integrates

high-performane omputers, data resoures and tools, and high-end experimental failities around the ountry.

Currently,TeraGridresouresinludemorethan250teraopsof omputingapabilityandmorethan30PBs

ofonlineandarhivaldatastorage,withrapidaessandretrievaloverhigh-performanenetworks. Researhers

analsoaessmorethan100disipline-speidatabases.Withthisombinationofresoures,theTeraGridis

oneoftheworld'slargestandmostomprehensivedistributedGridinfrastrutureforopensientiresearh.

2.4. NASAand Google. ReentlyNASAinitiatedajointprojetwithGoogle,In. forapplyingGoogle

searh tehnology to help sientiststo proess, organize, and analyze thelarge-sale streams of data oming

fromtheLargeSynoptiSurveyTelesope(LSST),loatedinChile. Whenompleted,theLSSTwillgenerate

over 30 terabytes of multiple olor images of visible sky eah night. Google will ollaborate with LSST to

developsearh and dataaess tehniquesand serviesthat anproess,organize and analyzethe verylarge

amountsofdataomingfromtheinstrument'sdatastreamsinrealtime. Theenginewillreatedataimages"

forsientiststoviewsigniantspaeeventsandextratimportantfeaturesfromthem. Thisjointprojetwill

showhowomplexdatamanagementtehniquesgenerallyusedinsearhenginesanbeexploitedforsienti

disovery.

In thegeneral framework of this ollaboration, themain NASA's goalis to makeits hugestores of data

olleted during everything from spaeraft missions, moon landings to landings on Mars to orbits around

Jupiteravailabletosientistsandthepubli. SomeofthedataanalreadybefoundonNASA'sWebsitebut

exploitingGoogletehniqueswithhighperformanefailities,thisdatawill beaessiblein aneasyway.

2.5. Open Siene Grid. TheOpen SieneGrid [24℄isaollaborationofsieneresearhers,software

developersand omputing, storage and network providers. It gives aess to shared resouresworldwide to

sientists(fromuniversities,national laboratoriesand omputingentersarosstheUnitedStates). TheOpen

Siene Grid links storage and omputing resoures at more than 30 sites aross the United States. The

OSGworksativelywithmanypartners,inludingGridandnetworkorganizationsandinternational,national,

regionalandampusGrids,toreateaGridinfrastruturethatspanstheglobe. Sientistsfrommanydierent

eldsusetheOSGtoadvanetheirresearh. AppliationsofOSGprojetareativeinvariousareasofsiene,

likepartileandnulearphysis,astrophysis,bioinformatis,gravitational-wavesiene,mathematis,medial

imagingand nanotehnology. OSG resouresinlude thousands of omputersand 10 of terabytes of arhival

datastorage.

(5)

experimental results than data. myExperiment has been inuened by soial networking programs suh as

Wiredand Flikr, andis basedonthemySpae infrastruture. myExperimentenablessientiststo shareand

useworkowsandredue time-to-experiment,shareexpertiseandavoidreinvention. myExperimentreatesan

environmentforsientiststoadoptGridtehnologies,wheretheyandene,whentheysharedata,withwhom

theyshareit andhowmuh of itanbe aessed. ThemyExperiment projetmainly fouses itsappliations

atasestudiesforthespeiareasofastronomy,bio-informatis,hemistryandsoialsiene.

2.7. National Virtual Observatory. The National Virtual Observatory [23℄ is anew researh projet

whose goalisto makealltheastronomialdata in theworld quikly andeasily aessibleby anyone. Suh a

projetenablesanewwayofdoingastronomy,movingfrom aneraofobservationsofsmall, arefullyseleted

samplesofobjetsin oneorafewwavelengthbands, to theuseof multi-wavelength dataformillions,oreven

billions of objets. Suh large olletion of data makes researhers able to disover subtle, but signiant,

patternsinstatistiallyrihandunbiaseddatabases,andtounderstandomplexastrophysialsystemsthrough

theomparisonofdatatonumerialsimulations. With theNationalVirtualObservatory (NVO),astronomers

exploredata that others have alreadyolleted, ndingnew usesand new disoveriesin existing data. NVO

enablesastronomersto doanew typeofresearh that, ombinedwith traditionaltelesopeobservations,will

leadto manynew andinterestingdisoveries. It isworthnotiing that theNVOhas proposed to exploit the

omputationalresouresoftheTeraGridprojet(desribedintheSetion2.3),in ordertoenableastronomers

intheexplorationandanalysisofthephysialproessesthatdrivetheformationandevolutionofouruniverse,

andenouragingnewwaystousesuperomputingfailitiesforsiene.

2.8. Southern California Earthquake Center. The Southern California Earthquake Center projet

[26℄ is aimed at developingnew omputing apabilities, that an leadto better foreastsof when and where

earthquakesare likely to ourin Southern California,and howthe ground will shake asaresult. Thenal

goalis to improvemathematial models about the struture ofthe Earth and how theground movesduring

earthquakes. Theprojetteaminludesollaboratingresearhersfrom SouthernCaliforniaEarthquakeCenter

(SCEC),theInformationSienes Institute(ISI)at USC,theSan DiegoSuperomputingCenter(SDSC), the

InorporatedInstitutionsforSeismology(IRIS),andtheUnitedStatesGeologialSurvey(USGS).Theprojet

heavily exploits Grid tehnologies, allowing sientiststo organize and retrieve information stored throughout

theountry,andgivingadvantagesoftheproessingpowerofanetworkofmanyomputers.

3. Data Miningand Knowledge Disovery. Afterdisussingsigniantdatamanagementissuesand

projets, here we fous on data mining tehniques for knowledge disovery in large sienti data

reposito-ries.DataMining isthesemi-automatidisoveryofpatterns,models,assoiations,anomaliesand(statistially

signiant) strutures hidden in data. Traditional data analysis is assumption-driven, that is the hypothesis

is formed and validated against the data. Data mining, in ontrast, is disovery-driven, in the sense that

thepatterns (andmodels)areautomatially extrated from data. Datamining foundsits appliationto

sev-eralsientiandengineeringdomains,inludingastrophysis,medialimaging,omputationaluiddynamis,

biology,struturalmehanis,andeology.

Fromasientiviewpoint,dataanbeolletedbymanysoures: remotesensorsonasatellite,telesope

sanningthe sky, miroarrays generating gene expression data, sienti simulations, et. Moreover,in suh

infrastruturesdata are olletedand stored at enormous speeds (GBs/hour). Both suh aspets imply that

sientiappliationhavetodealwithmassivevolumeofdata.

Mininglargedatasetsrequirespowerfulomputationalresoures. Amajorissueindataminingissalability

with respetto theverylargesize of urrent-generation and next-generationdatabases,giventhe exessively

longproessingtime takenby (sequential) data miningalgorithms onrealistivolumesof data. Infat, data

mining algorithms working on very large data sets take a very long time on onventional omputers to get

results. Inordertoimproveperformanes,someparallelanddistributed approaheshavebeenproposed.

Parallel omputing isaviablesolutionforproessing andanalyzingdata sets inreasonabletimeby using

parallelalgorithms. Highperformaneomputersandparalleldataminingalgorithmsanoeraveryeient

waytomineverylargedatasets[27℄,[28℄byanalyzingtheminparallel. Underadataminingperspetive,suh

aeld isknownasparallel data mining (PDM).

Beyondthe development of knowledge disovery systemsbasedon parallel omputing platforms, alot of

workhasbeendevotedtodesignsystemsabletohandleandanalyzemulti-sitedatarepositories. Mining

(6)

environment. The researh area named distributed data mining oers an alternativeapproah. It works by

analyzingdata in adistributed fashionand payspartiularattentionto thetrade-o betweenentralized

ol-letionanddistributed analysisofdata. This tehnologyispartiularly suitableforappliationsthat typially

dealwithverylargeamountofdata(e.g.,transation data,sientisimulationandteleommuniationdata),

whihannotbeanalyzedinasinglesiteontraditionalmahinesinaeptabletimes.

Grid tehnologyintegratesbothdistributedandparallelomputing,thusitrepresentsaritial

infrastru-tureforhigh-performanedistributedknowledgedisovery. Gridomputingwasdesignedasanewparadigmfor

oordinatedresouresharingandproblem solvingin advanedsiene andengineeringappliations. Forthese

reasons,Gridsanoeraneetivesupporttotheimplementationanduseofknowledgedisoverysystemsby

Grid-basedDataMining approahes.

Inthefollowingparallel,distributedandGrid-baseddataminingaredisussed.

3.1. Parallel Data Mining. Parallel DataMining isonerned withthestudy and appliationof data

mining analysis doneby parallel algorithms. The keyideaunderlying suh a eld is that parallel omputing

an give signiant benets in the implementation of data mining and knowledge disoveryappliations, by

meansof theexploitationof inherentparallelismof dataminingalgorithms. Maingoalsofthe useof parallel

omputingtehnologiesinthedataminingeldare: (i)performaneimprovementsofexistingtehniques,(ii)

implementationofnew (parallel)tehniquesand algorithms,and (iii) onurrentanalysis using dierentdata

miningtehniquesinparallel andresultintegrationtogetabettermodel(i.e.,moreaurateresults).

As observed in [5℄, three main strategies an be identied in the exploitation of parallelism algorithms:

Independent Parallelism, Task Parallelism and Single Program Multiple Data(SPMD) Parallelism. We point

outthatthis isawellknown lassiationofgeneralstrategiesfordevelopingparallelalgorithms, in fatthey

are not neessarilyrelated only to data mining purposes. Nevertheless, in the following we will desribethe

underlying ideaofsuh strategiesbyontextualizingthem indata miningappliations. A shortdesriptionof

theunderlyingideaofsuhstrategiesfollows.

Independent Parallelism. It is exploited when proessesare exeuted in parallel in an independent way.

Generally,eahproesshasaess tothewhole dataset anddoesnotommuniateorsynhronizewith other

proesses. Suhastrategy,forexample,isappliedwhen

p

dierentinstanesofthesamealgorithmareexeuted

onthewholedataset,but eah onewithadierentsettingofinputparameters. Inthis way,theomputation

ndsout

p

dierentmodels,eahonedeterminedbyadierentsettingofinputparameters. Avalidationstep

should learnwhihoneof the

p

preditivemodelsis themost reliablefor thetopi under investigation. This

strategyoftenrequiresommutations amongtheparallel ativities.

Task Parallelism. It isknown alsoasControl Parallelism. It supposes that eah proess exeutesdierent

operationson(adierentpartitionof)thedataset. Theappliationofsuhastrategyindeisiontreelearning,

for example, leads to have

p

dierent proesses running, eah one assoiated to a partiular subtree of the

deision tree to be built. The searh goes parallely on in eah subtree and, as soon as all the

p

proesses

nishtheirexeutions,thewholenaldeisiontreeisomposedbyjoiningthevarioussubtreesobtainedbythe

proesses.

SPMD Parallelism. Thesingleprogram multiple data (SPMD) model [10℄ (alsoalled data parallelism)is

exploitedwhenasetofproessesexeuteinparallelthesamealgorithmondierentpartitionsofadataset,and

proessesooperatetoexhange partialresults. Aordingto thisstrategy, thedatasetis initially partitioned

in

p

parts,if

p

istheapriori-xedparallelismdegree(i.e.,thenumberofproessesrunningin parallel). Then,

the

p

proessessearhinparallelapreditivemodelforthesubsetassoiatedtoit. Finally,theglobalresultis

obtainedbyexhangingalltheloalmodelsinformation.

Thesethreestrategiesforparallelizingdataminingalgorithmsarenotneessarilyalternative. Infat,they

anbe ombinedtoimprovebothperformaneand aurayofresults. Forompleteness, wesayalso thatin

ombinationwith strategiesforparallelization,dierent datapartition strategiesmaybeused : (i) sequential

partitioning (separate partitions are dened without overlapping among them), (ii) over-based partitioning

(some data an be repliated on dierent partitions) and (iii) range-basedquerypartitioning (partitions are

denedonthebasisofsomequeriesthat seletdata aordingtoattributevalues).

Arhiteturalissuesareafundamentalaspetforthegoodnessofaparalleldataminingalgorithm. Infat,

interonnetion topology of proessors, ommuniation strategies, memory usage, I/O impat on algorithm

performane, load balaning of the proessors are strongly related to the eieny and eetiveness of the

(7)

relatedtotheparallelizationstrategiesandthereisamutualinuenebetweenknowledgeextrationstrategies

and arhitetural features. For instane, inreasing the parallelism degree in some ases orresponds to an

inrement of the ommuniation overhead among the proessors. However, ommuniationosts anbe also

balanedbytheimprovedknowledgethatadataminingalgorithmangetfromparallelization. Ateahiteration

theproessorssharetheapproximatedmodelsproduedbyeahofthem. Thuseahproessorexeutesanext

iterationusingitsownpreviousworkandalsotheknowledgeproduedbytheotherproessors. Thisapproah

animprovetherateatwhihadataminingalgorithmndsamodelfordata(knowledge)andmakeupforlost

timeinommuniation. Parallelexeutionofdierentdataminingalgorithmsandtehniquesanbeintegrated

notjust togethighperformanebutalsohighauray.

3.2. Distributed Data Mining. Traditional warehouse-basedarhitetures fordata miningsupposeto

haveentralized data repository. Suh aentralizedapproahis fundamentallyinappropriate for mostof the

distributed and ubiquitous data mining appliations. In fat, the long response time, lak of proper use of

distributedresoure,andthefundamentalharateristiofentralizeddataminingalgorithmsdonotworkwell

in distributedenvironments. Asalablesolutionfordistributed appliationsallsfor distributedproessingof

data,ontrolled bytheavailableresouresandhumanfators. Forexample,letus onsideranadhowireless

sensornetworkwherethedierentsensornodesaremonitoringsometime-ritialevents. Centralolletionof

datafromeverysensornodemayreatetraoverthelimitedbandwidthwirelesshannelsandthismayalso

drainalot ofpowerfrom thedevies.

A distributed arhiteture for data mining is likely aimed to redue the ommuniationloadand alsoto

reduethebatterypowermoreevenlyarossthedierentnodesinthesensornetwork. Oneaneasilyimagine

similarneedsfordistributedomputationofdataminingprimitivesinadhowirelessnetworksofmobiledevies

likePDAs,ellphones,andwearableomputers[25℄. Thewirelessdomainisnottheonlyexample. Infat,most

oftheappliationsthatdealwithtime-ritialdistributeddataarelikelytobenetbypayingarefulattention

to the distributed resouresfor omputation, storage, and the ost of ommuniation. As an other example,

letus onsider the World Wide Web as it ontainsdistributed data and omputing resoures. An inreasing

numberofdatabases(e.g.,weatherdatabases,oeanographidata,et.) anddatastreams(e.g.,nanialdata,

emergingdiseaseinformation,et.) areurrentlymadeon-line,andmanyofthemhangefrequently. Itiseasy

tothink ofmanyappliationsthat requireregularmonitoringofthesediverseanddistributed souresof data.

A distributed approahto analyzethis data is likelyto bemoresalable and pratialpartiularly when

the appliation involvesalarge numberof data sites. Hene, in this aseweneed data mining arhitetures

thatpayarefulattentiontothedistributionofdata,omputingandommuniation,inordertoaessanduse

theminanearoptimalfashion.Distributeddatamining (DDM)onsidersdatamininginthisbroaderontext.

DDMmayalsobeusefulinenvironmentswithmultipleomputenodesonnetedoverhighspeednetworks.

Evenifthedataanbequiklyentralizedusingtherelativelyfastnetwork,properbalaningofomputational

loadamongalusterofnodesmayrequireadistributed approah. Theprivayissueisplayinganinreasingly

importantroleintheemergingdataminingappliations. Forexample,letussupposeaonsortiumofdierent

banksollaboratingfor detetingfrauds. If aentralized solutionwasadopted,all the datafrom everybank

shouldbeolletedin asingleloation,to beproessedbyadataminingsystem. Nevertheless,in suhaase

adistributed data miningsystem should be thenatural tehnologial hoie: it is ableto learnmodels from

distributeddatawithoutexhangingtherawdataamongdierentrepositories,anditallowsdetetionoffraud

bypreservingtheprivayofeverybank'sustomertransationdata.

Forwhatonernstehniquesandarhiteture,itisworthnotiing thatmanyseveralothereldsinuene

DistributedData Mining systemsonepts. First, manyDDM systemsadopt the multi-agentsystem (MAS)

arhiteture,whih ndsitsrootin thedistributedartiialintelligene(DAI). Seond,althoughparallel data

miningoftenassumesthepreseneof highspeednetworkonnetionsamongthe omputingnodes,the

devel-opmentofDDMhasalsobeeninuenedbythePDMliterature. MostDDMalgorithmsaredesigneduponthe

potentialparallelismtheyanapplyoverthegivendistributeddata. Typially,thesamealgorithmoperateson

eah distributeddata siteonurrently,produing oneloalmodel persite. Subsequently, allloalmodels are

aggregatedtoproduethenalmodel. InFigure3.1ageneraldistributeddataminingframeworkispresented.

Thesuessof DDMalgorithmsliesintheaggregation. Eah loalmodelrepresentsloallyoherentpatterns,

but laksdetailsthat mayberequiredtoindue globally meaningfulknowledge. Forthis reason,manyDDM

(8)

multiple modelsandombinesthemto enhaneauray. Typially,voting(weightedorun-weighted) shema

are employed to aggregatebase model for obtaining aglobal model. As wehave disussed above, minimum

data transferis another keyattribute of the suessful DDM algorithm. As a nal onsideration,the

homo-geneity/heterogeneityofresouresisanotherimportantaspettobeonsideredinthedistributeddatamining

approahes. In this senario, the term "resoures"refers both to omputational resoures (omputers with

similar/dierent omputational power) and data resoures (datasets with horizontally/vertially partitioning

among nodes). The rstmeaning aetsonly thealgorithm exeutiontime, while data heterogeneityplaysa

fundamental role in the algorithm design. That is, dealingwith dierent data formatsit requires algorithms

designedinaordaneto thedierentdataformats.

Data Source 1

Data Mining

Algorithm

Local Model 1

Data Source 2

Data Mining

Algorithm

Local Model 2

Data Source n

Data Mining

Algorithm

Local Model n

Local Model

Aggregator

Global Model

Data

Source

i

Data

Mining

Algorithm

Local

Model i

Fig.3.1.GeneralDistributedDataMiningFramework.

3.3. Grid-based Data Mining. Inthe last years, Grid omputing is reeiving an inreasingattention

bothfrom theresearh ommunity and from industry and governments,wathing at this new omputing

in-frastrutureasakeytehnologyforsolvingomplexproblemsandimplementingdistributedhigh-performane

appliations. Grid tehnologyintegrates bothdistributed andparallel omputing, thusit representsaritial

infrastruturefor high-performanedistributed knowledgedisovery.Grid omputing diersfrom onventional

distributedomputingbeauseitfousesonlarge-saledynamiresouresharing,oersinnovativeappliations,

and,insomeases,itisgearedtowardhigh-performanesystems. TheGrid emergedasaprivilegedomputing

infrastruturetodevelopappliationsovergeographiallydistributedsites,providingforprotoolsandservies

enablingtheintegratedandseamlessuseofremoteomputingpower,storage,software,anddata,managedand

sharedbydierentorganizations.

Basi Grid protools and servies are provided by toolkits suh as Globus Toolkit (www.globus.org/

toolkit), Condor (www.s.wis.edu/ondor), Glite, and Uniore. In partiular, the Globus Toolkit is the

mostwidelyusedmiddlewareinsientianddata-intensiveGridappliations,andisbeomingadefato

stan-dardforimplementingGridsystems. Thistoolkitaddressesseurity, informationdisovery,resoureanddata

management,ommuniation,fault-detetion,andportabilityissues. Awidesetofappliationsisbeing

devel-opedfortheexploitationofGridplatforms. Sineappliationareasrangefromsientiomputingtoindustry

and business, speialized serviesare required to meet needs in dierent appliation ontexts. In partiular,

dataGrids havebeendesignedtoeasily store,move,andmanage largedata setsin distributed data-intensive

appliations. Besidesoredatamanagementservies,knowledge-basedGrids,builtontopofomputationaland

dataGrid environments,are neededto oerhigher-levelserviesfordataanalysis,inferene, anddisoveryin

sientiandbusinessareas[21℄. Insomepapers,seeforexample[1℄,[19℄,and[7℄,itislaimedthatthereation

of knowledge Grids is the enabling onditionfor developing high-performane knowledge disoveryproesses

and meeting the hallenges posed by theinreasing demand of powerand abstratness omingfrom omplex

problemsolvingenvironments.

(9)

appliationsareinreasinglybeomingavailableasstand-alonepakagesandasremoteserviesontheInternet.

Examplesinlude geneandDNAdatabases,networkaessand intrusiondata,drugfeatures andeets data

repositories, astronomy data les, and data about web usage, ontent, and struture. Knowledge disovery

proeduresinalltheseappliationstypiallyrequirethereationandmanagementofomplex,dynami,

multi-stepworkows. Ateahstep,datafromvarioussouresanbemoved,ltered,andintegratedandfedintoadata

miningtool. Basedontheoutputresults,thedeveloperhooseswhihother datasetsandminingomponents

anbeintegratedintheworkow,orhowtoiteratetheproesstogetaknowledgemodel. Workowsaremapped

on aGrid by assigning nodes to theGrid hosts and using interonnetions for implementingommuniation

amongtheworkownodes.

Forompletenessof treatment,wepointoutsomeother Grid-basedknowledge disoverysystemsand

a-tivities that have beendesigned in reentyears. Disovery Net [8℄ is aninfrastruture for eetivelysupport

sientiknowledgedisoveryproess,inpartiularintheareasoflifesieneandgeo-hazardpredition.

DataS-pae [17℄is aframework providing eientdata aessand transferovertheGridthat implementsanad-ho

protool for working with remote and distributed data (named DataSpae transfer protool, DSTP).

Info-Grid [16℄ is aservie-based data integration middleware engine, designed to provide information aess and

queryingserviesnotin an'universal'way,but byapersonalizedviewoftheresouresforeah partiular

ap-pliationdomain.DataCutter [2℄isanotherGridmiddlewareframeworkaimedatprovidingspeiserviesfor

thesupportofmulti-dimensionalrange-querying,dataaggregationanduser-denedlteringoverlargesienti

datasetsinshareddistributedenvironments. Finally,GATES [4℄(Grid-basedAdapTiveExeutiononStreams)

is anOGSA basedsystemthat providessupport for proessing of datastreams in a Gridenvironment. This

systemis designed to support thedistributed analysis of data streams arising from distributed soures (e.g.,

datafromlargesaleexperiments/simulations). GATESprovidesautomatiresouredisoveryandaninterfae

forenablingself-adaptationtomeetreal-timeonstraints.

TheKnowledge Gridarhiteture isdesigned aordingtothe Servie OrientedArhiteture (SOA), that

is a model for building exible, modular, and interoperable software appliations. The key aspet of SOA

is the onept of servie, that is a software blok apable of performing a given task or business funtion.

Eah servie operatesbyadhering toa well dened interfae,dening required parametersand the nature of

the result. One dened and deployed, servies are like blak boxes", that is, they work independently of

the state of any other servie dened within the system, often ooperating with other servies to ahieve a

ommongoal. ThemostimportantimplementationofSOAisrepresentedbyWebServies,whosepopularityis

mainlyduetotheadoptionofuniversallyaeptedtehnologiessuhasXML,SOAP,andHTTP.AlsotheGrid

providesaframeworkwherebyagreatnumberofserviesanbedynamiallyloated,balaned,andmanaged,

sothat appliations arealwaysguaranteed to beseurely exeuted, aordingto the priniples ofon-demand

omputing.

TheGrid ommunityhasadopted the Open Grid Servies Arhiteture (OGSA) asanimplementation of

theSOAmodelwithintheGridontext. InOGSAeveryresoureisrepresentedasaWebServiethatonforms

to a set of onventions and supports standard interfaes. OGSA provides a well-dened set of Web Servie

interfaesfor thedevelopmentofinteroperableGridsystemsandappliations[15℄. ReentlytheWS-Resoure

Framework (WSRF) has been adopted as an evolution of early OGSA implementations [9℄. WSRF denes

a family of tehnial speiations for aessing and managing stateful resoures using Web Servies. The

omposition of a Web Servie and astateful resoureis termed asWS-Resoure. The possibility to dene a

âstateâassoiatedtoaservieisthemostimportantdierenebetweenWSRF-ompliantWebServies,

andpre-WSRF ones. Thisis akeyfeaturein designingGrid appliations, sineWS-Resoures provideaway

torepresent,advertise,andaesspropertiesrelatedto bothomputationalresouresandappliations.

The Knowledge Grid is a software for implementing knowledge disoverytasks in a wide range of

high-performane distributed appliations. It oersto users high-level abstrations anda set of serviesby whih

theyanintegrateGridresourestosupportallthephasesoftheknowledgedisoveryproess.

TheKnowledgeGridsupportssuhativitiesbyprovidingmehanismsandhigherlevelserviesforsearhing

resoures,representing,reating,andmanagingknowledgedisoveryproesses,andforomposingexistingdata

serviesanddata miningservies in astrutured manner,allowingdesigners to plan, store,doument,verify,

shareandre-exeutetheirworkowsaswellasmanagetheiroutputresults. TheKnowledgeGridarhiteture

isomposed ofasetofserviesdividedin twolayers: theCore K-Gridlayer andtheHigh-levelK-Grid layer.

(10)

useofrepositoriesthatprovideinformationaboutresouremetadata,exeutionplans,andknowledgeobtained

asresultofknowledgedisoveryappliations.

In the Knowledge Grid environment, disovery proesses are represented as workows that a user may

omposeusing bothonreteandabstrat Gridresoures. Knowledgedisoveryworkowsaredened usinga

visualinterfaethatshowsresoures(data,tools,andhosts)to theuserandoersmehanismsforintegrating

theminaworkow. InformationaboutsingleresouresandworkowsarestoredusinganXML-basednotation

that representsaworkow(alledexeutionplan intheKnowledgeGridterminology)asadata-owgraphof

nodes, eah onerepresentingeither adataminingservieoradatatransferservie. The XMLrepresentation

allowstheworkowsfordisoveryproessestobeeasilyvalidated,shared,translatedinexeutablesripts,and

storedforfutureexeutions. Itisworthnotiing thatwhentheusersubmitsaknowledgedisoveryappliation

totheKnowledgeGrid,shehasnoknowledgeaboutallthelowleveldetailsneededbytheexeutionplan. More

preisely, thelient submits to the Knowledge Grid a high level desriptionof the KDD appliation, named

oneptual model, moretargeted to distributed knowledge disoveryaspets than to grid-relatedissues. The

KnowledgeGridinarststepreatesanexeutionplanonthebasisoftheoneptualmodelreeivedfromthe

user, andthen exeutesitbyusing theresoureseetivelyavailable. Torealizethis logi,it initially models

anabstrat exeutionplan (wheresomespeiedresoureouldremain'abstratly'dened,i.e. theyouldnot

mathwitharealresoure),that inaseondstepisresolvedintoaonreteexeutionplan (whereamathing

betweeneahresoureandsomeonereallyavailableontheGridis found).

TheKnowledgeGridhasbeenusedinvariousrealsenarios,pointingoutitssuitabilityinseveral

heteroge-neousappliations. Forlakofspaewearenotabletodisussaboutthem. Forsuhareasonwegiveherejust

someoutlines,moredetails anbefoundin theited papers. Thegoaloftheexampledesribed in[6℄wasto

obtainalassierforanintrusiondetetionsystem,performingaminingproessona(verylargesize)dataset

ontainingreordsgeneratedby network monitoring. Theexamplereportedin [5℄wasasimplemeta-learning

proess,thatexploitstheKnowledgeGridtogenerateanumberofindependentlassiersbyapplyinglearning

programstoaolletionofdistributeddatasets inparallel.

As a sienti appliation senario, let us onsider the olletion of sky observations and the analysis

of their harateristis. Let us suppose to have distint image data obtained by observations and

simula-tions, from whih we want to extrat signiant metris. Generally, a signiative set of astronomydata is

verylarge size (

≈

20 −

30

terabytes). In addition, suh kind of observation are very high-dimensional,

be-ause eah point is usually desribed by

≈

10

3

attributes (inluding morphologial parameters, ux ratios,

et.). Finally, they usually are full of missing values and noise. Then, the main issue here is to analyze

a distribution of

≈

20 −

30

terabytes of points in a parameter spae of

≈

10

3

dimensions. Let us

sup-pose that our eort is devoted to identify how many distint types of objets are there (i. e., stars,

galax-ies, quasars, blak holes, et.), and grouping them with respet to their type. This an be obtained by a

lustering analysis, however it is a non-trivial task if we onsider the large size data and their high

dimen-sionality. To suh a purpose, a distributed framework an be suitable to get results in a reasonable time.

Initially we have a data repository where all suh an observed sky data is olleted (for example, an

astro-nomi observatory). Then, suh a data is proessed by a distributed lustering algorithm. In order to do

that, they are partitioned on many nodes and proessed on those nodes in parallel. The results of every

lustering algorithm are olleted and ombined to obtain a global lustering model. In addition, eah

out-lier anrepresenta possible(rare)new objet. For suh a reason, and in order to get moreknowledgefrom

them, all thedeteted outliers aretransferred to another node for a further lassiation, i. e. by adeision

tree.

Figure4.1showssuhadistributedmeta-learningsenario,inwhihagloballusteringmodellassier

CM

is obtainedon

N ode

C

startingfrom the original dataset

DS

stored on

N ode

A

(i.e, where theobservatoryis

loated). Moreover,alltheoutliersdetetedareolletedinanoutlierset

OS

andareproessedbyalassier

Cl

ona

N ode

B

. Thisproessanbedesribedthroughthefollowingsteps:

1. On

N ode

A

,datasets

DS

1

, . . . , DS

n

areextratedfrom

DS

bythepartitioner

P

. Then

DS

1

, . . . , DS

n

,

arerespetivelymovedfrom

N ode

A

to

N ode

1

, . . . , N ode

n

.

2. On eah

N ode

i

(

i

= 1

, . . . , n

)

the lusterer

C

i

applies a lustering algorithms on eah dataset

DS

i

.

Then,eahloalresultismovedfrom

N ode

i

to

N ode

C

.

3. On

N ode

C

,loalmodelsreeivedfrom

N ode

1

, . . . , N ode

n

areombinedbytheombiner

C

toprodue

the globallustering model

CM

. Moreover,outliers detetedare olleted in anoutlier set

OS

, and

(11)

4. On

N ode

B

, the lassier

Cl

proesses the

OS

outlier data set and extrats a suitable lassiation

model(i.e.,adeisiontree)fromit.

BeingtheKnowledgeGridaservieorientedarhiteture,theKnowledgeGriduserinteratswithsomeservies

todesignandexeutesuhanappliation.

As an additional onsideration, we notie that a lient appliation, that wants to submit a knowledge

disoveryomputation to the Knowledge Grid, has to interat not with all of these servies, but just with

someofthem; thereare,infat,twolayersofservies: high-level servies(DAS,TAAS,EPMS andRPS)and

ore-level servies (KDS and RAEMS). The design ideais that user level appliations diretly interat with

high-levelserviesthat,inordertoperformalientrequest,invokesuitableoperationsexportedbytheore-level

servies. Inturn, ore-levelserviesperformtheiroperationsbyinvokingbasiservies provided byavailable

gridenvironmentsrunningonthespei host,aswellasbyinteratingwithotherore-levelservies. Inother

words,operationsexportedbyhigh-levelserviesaredesignedtobeinvokedbyuser-levelappliations,whereas

operationsprovidedbyore-levelserviesarethoughttobeinvokedbothbyhigh-levelandore-levelservies.

Morein detail,the useran interats withthe DAS (Data Aess Servie)and TAAS (Tools and Algorithms

Aess Servie)servies to nd dataand mining software andwith the EPMS (ExeutionPlan Management

Servie)servietoomposeaworkow(exeutionplan)desribingatahighleveltheneededativitiesinvolved

in theoverall dataminingomputation. Throughtheexeutionplan, omputing,softwareanddataresoures

arespeiedalongwith aset ofrequirementsonthem. The exeutionplan isthen proessedbythe RAEMS

(Resoure Alloation andExeutionManagement Servie), whih takesareofits alloation. Inpartiular, it

rst ndsappropriate resouresmathing userrequirements(i. e., a set of onrete hosts

N ode

1

, . . . , N ode

n

,

oering thesoftware

C

1

, . . . , C

n

, and anode

N ode

W

providing the

C

ombiner software and a node

N ode

Z

exporting thelassier

Cl

), thenmanagesthe exeutionof overall appliation, enforingdependeniesamong

data extration, transfer, and mining steps. Finally, the RAEMS manages results retrieving, and visualize

thembytheRPS (ResultsPresentation Servie)servie(thatoersfailitiesforpresentingandvisualizingthe

extratedknowledgemodels).

Data Set

DS

Partitioner

P

Node A

Data Set

DS1

Clusterer

C1

Node 1

Data Set

DS2

Clusterer

C2

Node 2

Data Set

DSn

Clusterer

Cn

Node n

Combiner

C

Node C

Clustering

Model

CM

Outlier Set

OS

Classifier

Cl

Node B

Outlier

Classification

Model

OCM

C1

C2

Cn

C4

Fig.4.1. Adistributedmeta-learningsenario.

5. Conlusion. In this paper we havepointedout that digital data volumesare growingexponentially

in siene and engineering. Often digital repositories and soures inrease their size muh faster than the

omputationalpoweroeredbytheurrenttehnology. Tohandlethisabundaneindataavailability,sientists

mustembodyknowledgedisoverytoolsto ndwhatisinterestinginthem.

Whendataismaintainedovergeographiallydistributedsites,Gridomputinganbeusedasadistributed

infrastrutureforservie-basedintensiveappliations. VarioussientiappliationsbasedonGrid

infrastru-tures, desribed in the paper, onretely show how it anbeexploited forsientipurposes. Moreover,the

omputationalpowerofdistributed andparallel systemsanbeexploited forknowledgedisoveryinsienti

(12)

is areferene softwarearhiteture forgeographially distributed knowledge disoverysystemsthat allows to

implementomplexdataanalysisappliationsasaolletionof distributedservies.

REFERENCES

[1℄ F.Berman,FromTeraGridtoKnowledgeGrid,CommuniationsoftheACM,44(11)(2001),pp.2728.

[2℄ M.Beynon, T. Kur, U.Catalyurek,C. Chang, A.Sussman, andJ. Saltz,Distributed Proessing of VeryLarge

DatasetswithDataCutter,ParallelComputing,27(11)(2001),pp.14571478.

[3℄ M.CannataroandD.Talia,TheKnowledgeGrid,CommuniationsoftheACM,46(1)(2003),pp.8993.

[4℄ L.Chen,K.Reddy,andG.Agrawal,GATES:AGrid-BasedMiddlewareforProessingDistributedDataStreams,Pro.

ofthe13thIEEEInt.SymposiumonHighPerformaneDistributedComputing(HPDC),(2004),pp.192201.

[5℄ A.Congiusta,D.Talia,andP.Trunfio,ParallelandGrid-BasedDataMining,inDataMiningandKnowledgeDisovery

Handbook,Springer,2005,pp.10171041.

[6℄ A.Congiusta,D. Talia,andP.Trunfio,UsingGridsforDistributed KnowledgeDisovery,inMathematialMethods

forKnowledgeDisoveryandDataMining,IGIGlobalPublisher,2007,pp.248298.

[7℄ A. Congiusta, D. Talia, and P. Trunfio, Distributed Data Mining Servies Leveraging WSRF, Future Generation

ComputerSystems,23(1)(2007),pp.3441.

[8℄ V.Curin,M.Ghanem,Y.Guo,M.Kohler,A.Rowe,J.SyedJ.,andP.Wendel,DisoveryNet:TowardsaGrid

ofKnowledgeDisovery,Pro.ofthe8thInt.ConfereneonKnowledgeDisoveryandDataMining(KDD),(2002).

[9℄ K. Czajkowski et al., The WS-Resoure Framework Version 1.0,

http://www.ibm.om/developerworks/library/ws-resoure/ws-wsrf.pdf,2004.

[10℄ F.Darema,SPMDmodel: past,presentandfuture,inReentAdvanesinParallelVirtualMahineandMessagePassing

Interfae: 8thEuropeanPVM/MPIUsers'GroupMeeting,Springer,2001,p.1.

[11℄ DataGridProjet,http://web.datagrid.nr.it,2001.

[12℄ S.G.Djorgovski,VirtualAstronomy,InformationTehnology,andtheNewSientiMethodology,Pro.ofthe7thInt.

WorkshoponComputerArhiteturesforMahinePereption(CAMP),(2005),pp.125132.

[13℄ EGEEProjet,http://www.eu-egee.org/,2005.

[14℄ I.Foster,C.Kesselman,J.Nik,andS.Tueke,ThePhysiologyoftheGrid:AnOpenGridServiesArhiteturefor

Distributed SystemsIntegration,GlobusProjet,www.globus.org/alliane/publiations/papers/ogsa.pdf,2002.

[15℄ I.Foster,C.Kesselman,J.Nik,andS.Tueke,ThePhysiologyoftheGrid,inGridComputing:MakingtheGlobal

InfrastrutureaReality,F.Berman,G.Fox,andA.Hey,eds.,Wiley,2003,pp.217249.

[16℄ N.Giannadakis,A.Rowe,M.Ghanem,andY.Guo,AWebInfrastruturefortheExploratoryAnalysisandMiningof

Data,InformationSienes,2007,pp.199226.

[17℄ R. Grossman, and M. Mazzuo, DataSpae: A Data Web for the Exploratory Analysis and Mining of Data,IEEE

ComputinginSieneandEngineering,4(4),2002,pp.4451.

[18℄ IPGProjet,http://www.gloriad.org/gloriad/projets/ projet000053.html,1998.

[19℄ W. E. Johnston,Computational and Data Gridsin Large SaleSieneand Engineering,FutureGeneration Computer

Systems,18(8)(2002),pp.10851100.

[20℄ F.Meyer,GenomeSequeningvs.Moore'sLaw: CyberChallenges fortheNextDeade, CTWathQuarterly,2(3)(2006),

http://www.twath.org/quarter ly/ar tiles/2 006/ 08/g eno me-sequening-vs-moores-law/.

[21℄ R.Moore,Knowledge-basedGrids,Pro.ofthe18thIEEESymposiumonMassStorageSystemsand9thGoddardConferene

onMassStorageSystemsandTehnologies,(2001).

[22℄ myExperimentProjet,http://www.eu-egee.org/,2006.

[23℄ NationalVirtualObservatoryProjet,http://www.us-vo.org/,http://www.virtualobservatory.org/,2001.

[24℄ OpenSieneGridProjet,http://www.opensienegrid.org/,2004.

[25℄ B.Park,andH.Kargupta,DistributedDataMining:Algorithms,Systems,andAppliations,inDataMiningHandbook,

IEAPublisher,2002,pp.341358.

[26℄ SouthernCalifornia EarthquakeCenterProjet,http://epienter.us.edu/meportal/index.html,2001.

[27℄ D.Skilliorn,StrategiesforParallelDataMining,IEEEConurreny,7(4)(1999),pp.2635.

[28℄ D.Talia,ParallelisminKnowledgeDisoveryTehniques,Pro.ofthe6thInt.Conf.onAppliedParallelComputing,(2002),

pp.127136.

[29℄ TeraGridProjet,http://www.teragrid.org/, 2005.

Editedby: PasquaD'Ambra,DanieladiSerano,MarioRosarioGuarraino,FranesaPerla

Reeived: June2007