USING GRIDSFOREXPLOITING THE ABUNDANCE OFDATA INSCIENCE
EUGENIO CESARIO
∗
AND DOMENICO TALIA
∗
,
†
Abstrat. Digitaldata volumes aregrowing exponentially inallsienes. To handlethisabundane indata availability,
sientists must use data analysis tehniques in their sienti praties and solving environments to get the benets oming
fromknowledge thatan beextrated fromlargedata soures. Whendata ismaintainedover geographially remotesites the
omputationalpowerofdistributedandparallelsystemsanbeexploitedforknowledgedisoveryinsientidata. Inthissenario
theGridanprovideaneetiveomputationalsupportfordistributedknowledgedisoveryonlargedatasets.Inpartiular,Grid
serviesfordataintegrationandanalysisanrepresentaprimaryomponentfore-sieneappliationsinvolvingdistributedmassive
andomplexdata sets. Thispaper desribessomeresearhativitiesindata-intensiveGridomputing. Inpartiularwedisuss
theuseofdataminingmodelsandserviesonGridsystemsfortheanalysisoflargedatarepositories.
Keywords: e-siene,knowledgedisovery,grid,paralleldatamining,distributeddatamining,grid-baseddatamining
1. Introdution. Thepasttwodeadeshavebeendominatedbytheadventofinreasinglypowerfuland
less expensive ubiquitous omputing, as well as the appearane of the World Wide Web and related
teh-nologies[12℄. Due to suh advanesin information tehnology andhigh performaneomputing, digital data
volumes are growing exponentially in many elds of human ativities. This phenomenon onerns sienti
disiplines,aswellasindustryandommere. Suhtehnologialdevelopmenthasalsogeneratedawholenew
setof hallenges: theworldis drowningin ahugequantityofdata,whih isstillgrowingveryrapidly bothin
thevolumeandomplexity.
Jim Gray in some talks in 2006 identied four hronologial steps for the methodologies employed by
sientists fordisoveries. Therst step ourred thousand years ago, when siene wasempirial and it was
orientedtojustdesribenaturalphenomena. Theseondoneistemporallyloatedaroundafewhundredyears
ago,when atheoretialbranh wasborn,aimed at formulatingsomegeneralmodelsdesribingthe empirial
knowledge. Thethird step ourred in the latestfew deades, when aomputational branh started upand
omplexphenomenastartedtobesimulatedbytheresouresmadeavailablebytheurrenttehnology. Finally,
thefourth step isrun today, when sientists areworking tounify theories, experimentsand simulations with
dataproessingandexplorationto extratknowledgehiddeninit.
Theabundaneofdigitallystoreddatarequireto onsiderin detailthis phenomenon. Inpartiular,there
are two importanttrends, tehnologial and methodologial, whih seem to partiularly distinguish the new,
information-rihsienefromthepast:
•
Tehnologial. Thereisalotofdataolletedandwarehousedinvariousrepositoriesdistributedovertheworld: dataanbeolletedandstoredathighspeedsinloaldatabases,fromremotesouresorfrom
theour galaxy. Someexamples inlude datasets from theelds of medialimaging, bio-informatis,
remote sensing and (as very innovative aspet) several digital sky surveys. This implies a need for
reliabledatastorage,networking,anddatabase-relatedtehnologies,standardsandprotools.
•
Methodologial. Hugedatasetsarehardtounderstand,andinpartiulardataonstrutsandpatternspresent in them annot be omprehended by humans diretly. This is a diret onsequene of the
growthinomplexityofinformation,andmainlyitsmulti-dimensionality. Forexample,aomputational
simulationangenerateterabytesofdatawithinafewhours,whereashumananalystsmaytakeseveral
weekstoanalyzethesedatasets. Forsuhareason,mostofdatawillneverbereadbyhumans,rather
theyareto beproessedandanalyzedbyomputers.
Weansummarizewhat weforesaidasfollows: whereassomedeadesagothemain problemwasthelak
ofinformation, thehallengenowseemstobe(i)the very largevolume of information and (ii) the assoiated
omplexitytoproess for extratingsigniant anduseful partsorsummaries.
Nevertheless, the rst aspet does not represent a limitation or aproblem for the sienti ommunity:
urrentdatastorage,arhiteturalsolutionsandommuniationprotoolsprovideareliabletehnologialbase
to ollet and store suh abundane of data in an eient and eetive way. Moreover, the availability of
highthroughputsientiinstrumentationandveryinexpensivedigitaltehnologiesfailitatedthistrendfrom
both tehnologialand eonomialviewpoint. On theother hand, theomputational powerof omputers is
∗
ICAR-CNR,ViaP.Bui41C,87036Rende(CS),Italy(esarioiar.nr.it, taliaiar.nr.it).
†
not growing asfast as the demand of suh a data omputation requires, and this represents a limit for the
knowledgethatpotentiallyouldbeextrated. Asanadditionalaspet,wehavetoonsiderthat storageosts
areurrentlydereasingfaster thanomputingosts,andthistrendmakesthingsworse.
Forexample,theimpatofforesaidissuesinthebiologialeldiswelldesribedin[20℄. Itpointsoutthat
the emergene of genome and post-genome tehnology has made huge amount of data available,demanding
a proportional support of analysis. Nevertheless, an important fator to be onsidered is that the number
of available omplete genomi sequenes is doubling almost every 12 months, whereas aording to Moore's
law available omputeyles (i. e., omputational power) double every 18 months. Additionally, we have to
onsiderthat analysisofgenomisequenesrequirebinaryomparisonsofthegenesinvolvedinit. Asadiret
onsequene of that, the omputational overhead is very very high. We an see the impat of suh issues
in Figure 1.1 (soure: [20℄), whih ontrasts the number of geneti sequenes obtained with the number of
annotationsgenerated. Thegureshowsthat theknowledge(annotations,models, patterns) hasasub-linear
ratewithrespettothetheavailabledatasequeneswhihtheyareextratedfrom.
Fig.1.1. Growthofsequenesandannotationssine1982(Soure:[20℄)
To handle this abundane in data availability (whose rate of prodution often far outstrips our ability
to analyze), appliations are emerging that explore, query, analyze, visualize, and in general, proess very
large-sale data sets: they are named data intensive appliations. Computational siene is evolving toward
dataintensiveappliationsthatinludedataintegrationandanalysis,informationmanagement,andknowledge
disovery. Inpartiular,knowledgedisoveryin largedatarepositoriesanndwhatisinterestingin themby
usingdataminingtehniques. Dataintensiveappliationsinsienehelpsientistsinhypothesisformationand
give them a support on their sienti praties and solvingenvironments, getting thebenets omingfrom
knowledgethat anbeextratedfrom largedatasoures.
Whendataismaintainedovergeographiallydistributed sitestheomputationalpowerofdistributed and
parallel systems an be exploited for knowledge disovery in sienti data. Parallel and distributed data
miningalgorithmsaresuitabletosuhapurpose. Moreover,inthis senariotheGrid anprovideaneetive
omputational support for dataintensiveappliation and for knowledge disoveryfrom largeand distributed
datasets. Grid omputing is reeivingan inreasingattentionfrom the researhommunity, wathing at this
newomputinginfrastrutureasakeytehnologyforsolvingomplexproblems andimplementingdistributed
high-performaneappliations[14℄.
Todaymanyorganizations,ompanies,andsientientersprodueandmanagelargeamountsofomplex
dataandinformation. Climate,astronomi,andgenomidatatogetherwithompanytransationdataarejust
some examples of massive amounts of digital data that today must be stored and analyzed to nd useful
knowledge inthem. Thisdataand informationpatrimonyanbeeetively exploitedifit isused asasoure
to produe knowledge neessaryto support deisionmaking. This proess is bothomputationally intensive,
ollaborative, anddistributed in nature. The developmentof data mining software for Gridsoers toolsand
enablingonditionfordevelopinghigh-performanedataminingtasksandknowledgedisoveryproesses,and
itmeetsthehallengesposedbytheinreasingdemand forpowerandabstratnessomingfromomplexdata
miningsenariosin sieneandengineering. Forexample,someprojetsdesribedinSetion 2suhasNASA
InformationGrid, TeraGrid, andOpenSiene Gridusetheomputationalandstoragefailitiesin theirGrid
infrastruturestominedatainadistributedway. Sometimeintheseprojetsareusedadhosolutionsfordata
mining,in otherasesgenerimiddlewareisused ontopof basiGridtoolkits. AspointedoutbyWilliamE.
Johnstonin [19℄, theuseof generalpurposedatamining toolsmayeetivelysupport theanalysisof massive
anddistributeddatasets inlargesalesieneandengineering.
TheGrid allowsto federate and share heterogeneousresouresand servies suh assoftware, omputers,
storage,data,networksinadynamiway. Gridserviesanbethebasielementforomposingsoftwareanddata
elements,and exeutingomplexappliationson Gridand Websystems. TodaytheGrid isnotjust ompute
yles,butitisalsoadistributeddatamanagementinfrastruture. Integratingthosetwofeatureswithsmart"
algorithms we an obtain a knowledge-intensive platform. The driving Grid appliations are traditionally
high-performane and data intensive appliations, suh as high-energy partile physis, and astronomy and
environmental modeling, in whih experimental devies reate largequantities of data that requiresienti
analysis.
Inthe latestyears manysigniantGrid-baseddata intensiveappliations and infrastrutureshavebeen
implemented. Inpartiular,theservie-basedapproahisallowingtheintegrationofGridandWebforhandling
withdata. Wewillbrieyreportsomeoftheseappliationsin therstofthepaper;thenwedisussaboutthe
useofhighperformanedataminingtehniquesforsieneinGrid platforms.
Therestofthepaperisorganizedasfollows. Setion2desribessomeGrid-baseddataintensiveprojetsand
appliations. Setion 3gives anoverview of approahes forparallel, distributed and Grid-based data mining
tehniques. Setion 4 introdues the Knowledge Grid, a referene software arhiteture for geographially
distributedknowledgedisoverysystems. TheSetion 5givesonludingremarks.
2. GridTehnologiesfor dealingwith Sientidata. Severalsientiteamsandommunitiesare
usingGridtehnologyfor dealingwithintensiveappliationsaimedat sientidataproessing. Asexamples
ofthisapproah,in thefollowingweshortlydesribesomeofthem.
2.1. The DataGrid Projet: Grid for Physis. TheEuropeanDataGrid [11℄is aprojetfunded by
theEuropeanUnion with theaim of setting upaomputational anddata-intensiveGrid ofresouresfor the
analysis of data oming from sienti exploration. The main goal of the projet is to oordinate resoure
sharing, ollaborative proessing and analysis of huge amounts of data produed and stored by many
sien-tilaboratoriesbelonging to several institutions. It is madeeetiveby thedevelopmentof atehnologial
infrastruture enabling sienti ollaborations where researhers and sientists will perform their ativities
regardlessofgeographialloation. Theprojetdevelopssalablesoftwaresolutionsin order to handle many
PBs 1
of distributed data, tensof thousand ofomputing resoures(proessors,disks, et.), and thousands of
simultaneous users from multiple researh institutions. The three real data intensive omputingappliations
areasovered by the projetare biology/medial, earth observation and partile physis. In partiular, the
lastoneisorientedto answerlongstandingquestions aboutthefundamental partilesofmatterandthefores
ating betweenthem. The goalis to understand why somepartiles are muh heavierthan others, and why
partileshavemassatall. Tothatend,CERN 2
hasbuilttheLargeHadronCollider (LHC),themostpowerful
partileaelerator everoneived, that generateshugeamountsof data. Itis estimated that LHC generates
approximately1GB/seand10PB/yearofdata. TheDataGridProjetprovidedthesolutionforstoringand
proessing this data, based on amulti-tiered, hierarhial omputing model for sharingdata and omputing
power among multiple institutions. In partiular, a Tier-0 entre is loated at CERN and is linked by high
speednetworks toapproximatelytenmajorTier-1 dataproessing entres. These fan outthedatato alarge
numberofsmallerones(Tier-2).
TheDataGridprojetended onMarh 2004,butmany oftheproduts(tehnologies, infrastruture,et.)
areused andextended in the EGEE projet. The Enabling Grids for E-sienE (EGEE) [13℄ projet brings
togethersientistsandengineersfrommorethan240institutionsin45ountriesworld-widetoprovideaseamless
Grid infrastruture for e-Siene that is available to sientists 24 hours/day. Expanding from originally two
1
P etaByte
= 10
6
GigaBytes
sienti elds, high energy physis and life sienes, EGEE now integrates appliations from many other
sienti elds, ranging from geology to omputational hemistry. The EGEE Grid onsists of over 36,000
CPUsavailabletousers24hoursaday,7daysaweek,inadditiontoabout5PBdiskofstorage,andmaintains
30,000 onurrent jobs on average. Having suh resoures available hanges the waysienti researh takes
plae. Theend usedepends ontheusers'needs: largestorageapaity,thebandwidththat theinfrastruture
provides, orthe sheer omputing poweravailable. Generally, the EGEE Grid infrastrutureis ideal for any
sienti researhespeiallywhere the time andresouresneeded forrunning theappliations are onsidered
impratialwhenusingtraditionalITinfrastrutures.
2.2. The NASAInformation Power Grid (IPG) Infrastruture. TheNASA'sInformation Power
Grid (IPG) [18℄ is a high-performane omputing and data grid built primarily for use by NASA sientists
and engineers. The IPG has been onstrutedby NASA between 1998and the present makingheavyuse of
GlobusToolkitomponentstoprovideGridaesstoheterogeneousomputationalresouresmanagedbyseveral
independentresearhlaboratories. SientistsandengineersaesstheIPG's omputationalresouresfrom any
loationwithGridinterfaesprovidingseurity,uniformity,andontrol. SientistsbeyondNASAanalsouse
familiar Grid interfaesto inlude IPG resouresin theirappliations (with appropriate authorization). The
IPG infrastruturehasbeenand isbeingused by numeroussienti andengineeringeortsbothwithin and
beyondNASA. Someofits mostimportantappliationsareomputational uiddynamis andmeteorologial
datamining.
2.3. TeraGrid. TeraGrid [29℄ is an open sienti disovery infrastruture ombining leadership lass
resoures(inludingsuperomputers,storage,andsientivisualizationsystems)atninepartnersitestoreate
an integrated, persistent omputational resoure. It is oordinated by the Grid Infrastruture Group(GIG)
at the University of Chiago. Using high-performane network onnetions, the TeraGrid integrates
high-performane omputers, data resoures and tools, and high-end experimental failities around the ountry.
Currently,TeraGridresouresinludemorethan250teraopsof omputingapabilityandmorethan30PBs
ofonlineandarhivaldatastorage,withrapidaessandretrievaloverhigh-performanenetworks. Researhers
analsoaessmorethan100disipline-speidatabases.Withthisombinationofresoures,theTeraGridis
oneoftheworld'slargestandmostomprehensivedistributedGridinfrastrutureforopensientiresearh.
2.4. NASAand Google. ReentlyNASAinitiatedajointprojetwithGoogle,In. forapplyingGoogle
searh tehnology to help sientiststo proess, organize, and analyze thelarge-sale streams of data oming
fromtheLargeSynoptiSurveyTelesope(LSST),loatedinChile. Whenompleted,theLSSTwillgenerate
over 30 terabytes of multiple olor images of visible sky eah night. Google will ollaborate with LSST to
developsearh and dataaess tehniquesand serviesthat anproess,organize and analyzethe verylarge
amountsofdataomingfromtheinstrument'sdatastreamsinrealtime. Theenginewillreatedataimages"
forsientiststoviewsigniantspaeeventsandextratimportantfeaturesfromthem. Thisjointprojetwill
showhowomplexdatamanagementtehniquesgenerallyusedinsearhenginesanbeexploitedforsienti
disovery.
In thegeneral framework of this ollaboration, themain NASA's goalis to makeits hugestores of data
olleted during everything from spaeraft missions, moon landings to landings on Mars to orbits around
Jupiteravailabletosientistsandthepubli. SomeofthedataanalreadybefoundonNASA'sWebsitebut
exploitingGoogletehniqueswithhighperformanefailities,thisdatawill beaessiblein aneasyway.
2.5. Open Siene Grid. TheOpen SieneGrid [24℄isaollaborationofsieneresearhers,software
developersand omputing, storage and network providers. It gives aess to shared resouresworldwide to
sientists(fromuniversities,national laboratoriesand omputingentersarosstheUnitedStates). TheOpen
Siene Grid links storage and omputing resoures at more than 30 sites aross the United States. The
OSGworksativelywithmanypartners,inludingGridandnetworkorganizationsandinternational,national,
regionalandampusGrids,toreateaGridinfrastruturethatspanstheglobe. Sientistsfrommanydierent
eldsusetheOSGtoadvanetheirresearh. AppliationsofOSGprojetareativeinvariousareasofsiene,
likepartileandnulearphysis,astrophysis,bioinformatis,gravitational-wavesiene,mathematis,medial
imagingand nanotehnology. OSG resouresinlude thousands of omputersand 10 of terabytes of arhival
datastorage.
experimental results than data. myExperiment has been inuened by soial networking programs suh as
Wiredand Flikr, andis basedonthemySpae infrastruture. myExperimentenablessientiststo shareand
useworkowsandredue time-to-experiment,shareexpertiseandavoidreinvention. myExperimentreatesan
environmentforsientiststoadoptGridtehnologies,wheretheyandene,whentheysharedata,withwhom
theyshareit andhowmuh of itanbe aessed. ThemyExperiment projetmainly fouses itsappliations
atasestudiesforthespeiareasofastronomy,bio-informatis,hemistryandsoialsiene.
2.7. National Virtual Observatory. The National Virtual Observatory [23℄ is anew researh projet
whose goalisto makealltheastronomialdata in theworld quikly andeasily aessibleby anyone. Suh a
projetenablesanewwayofdoingastronomy,movingfrom aneraofobservationsofsmall, arefullyseleted
samplesofobjetsin oneorafewwavelengthbands, to theuseof multi-wavelength dataformillions,oreven
billions of objets. Suh large olletion of data makes researhers able to disover subtle, but signiant,
patternsinstatistiallyrihandunbiaseddatabases,andtounderstandomplexastrophysialsystemsthrough
theomparisonofdatatonumerialsimulations. With theNationalVirtualObservatory (NVO),astronomers
exploredata that others have alreadyolleted, ndingnew usesand new disoveriesin existing data. NVO
enablesastronomersto doanew typeofresearh that, ombinedwith traditionaltelesopeobservations,will
leadto manynew andinterestingdisoveries. It isworthnotiing that theNVOhas proposed to exploit the
omputationalresouresoftheTeraGridprojet(desribedintheSetion2.3),in ordertoenableastronomers
intheexplorationandanalysisofthephysialproessesthatdrivetheformationandevolutionofouruniverse,
andenouragingnewwaystousesuperomputingfailitiesforsiene.
2.8. Southern California Earthquake Center. The Southern California Earthquake Center projet
[26℄ is aimed at developingnew omputing apabilities, that an leadto better foreastsof when and where
earthquakesare likely to ourin Southern California,and howthe ground will shake asaresult. Thenal
goalis to improvemathematial models about the struture ofthe Earth and how theground movesduring
earthquakes. Theprojetteaminludesollaboratingresearhersfrom SouthernCaliforniaEarthquakeCenter
(SCEC),theInformationSienes Institute(ISI)at USC,theSan DiegoSuperomputingCenter(SDSC), the
InorporatedInstitutionsforSeismology(IRIS),andtheUnitedStatesGeologialSurvey(USGS).Theprojet
heavily exploits Grid tehnologies, allowing sientiststo organize and retrieve information stored throughout
theountry,andgivingadvantagesoftheproessingpowerofanetworkofmanyomputers.
3. Data Miningand Knowledge Disovery. Afterdisussingsigniantdatamanagementissuesand
projets, here we fous on data mining tehniques for knowledge disovery in large sienti data
reposito-ries.DataMining isthesemi-automatidisoveryofpatterns,models,assoiations,anomaliesand(statistially
signiant) strutures hidden in data. Traditional data analysis is assumption-driven, that is the hypothesis
is formed and validated against the data. Data mining, in ontrast, is disovery-driven, in the sense that
thepatterns (andmodels)areautomatially extrated from data. Datamining foundsits appliationto
sev-eralsientiandengineeringdomains,inludingastrophysis,medialimaging,omputationaluiddynamis,
biology,struturalmehanis,andeology.
Fromasientiviewpoint,dataanbeolletedbymanysoures: remotesensorsonasatellite,telesope
sanningthe sky, miroarrays generating gene expression data, sienti simulations, et. Moreover,in suh
infrastruturesdata are olletedand stored at enormous speeds (GBs/hour). Both suh aspets imply that
sientiappliationhavetodealwithmassivevolumeofdata.
Mininglargedatasetsrequirespowerfulomputationalresoures. Amajorissueindataminingissalability
with respetto theverylargesize of urrent-generation and next-generationdatabases,giventhe exessively
longproessingtime takenby (sequential) data miningalgorithms onrealistivolumesof data. Infat, data
mining algorithms working on very large data sets take a very long time on onventional omputers to get
results. Inordertoimproveperformanes,someparallelanddistributed approaheshavebeenproposed.
Parallel omputing isaviablesolutionforproessing andanalyzingdata sets inreasonabletimeby using
parallelalgorithms. Highperformaneomputersandparalleldataminingalgorithmsanoeraveryeient
waytomineverylargedatasets[27℄,[28℄byanalyzingtheminparallel. Underadataminingperspetive,suh
aeld isknownasparallel data mining (PDM).
Beyondthe development of knowledge disovery systemsbasedon parallel omputing platforms, alot of
workhasbeendevotedtodesignsystemsabletohandleandanalyzemulti-sitedatarepositories. Mining
environment. The researh area named distributed data mining oers an alternativeapproah. It works by
analyzingdata in adistributed fashionand payspartiularattentionto thetrade-o betweenentralized
ol-letionanddistributed analysisofdata. This tehnologyispartiularly suitableforappliationsthat typially
dealwithverylargeamountofdata(e.g.,transation data,sientisimulationandteleommuniationdata),
whihannotbeanalyzedinasinglesiteontraditionalmahinesinaeptabletimes.
Grid tehnologyintegratesbothdistributedandparallelomputing,thusitrepresentsaritial
infrastru-tureforhigh-performanedistributedknowledgedisovery. Gridomputingwasdesignedasanewparadigmfor
oordinatedresouresharingandproblem solvingin advanedsiene andengineeringappliations. Forthese
reasons,Gridsanoeraneetivesupporttotheimplementationanduseofknowledgedisoverysystemsby
Grid-basedDataMining approahes.
Inthefollowingparallel,distributedandGrid-baseddataminingaredisussed.
3.1. Parallel Data Mining. Parallel DataMining isonerned withthestudy and appliationof data
mining analysis doneby parallel algorithms. The keyideaunderlying suh a eld is that parallel omputing
an give signiant benets in the implementation of data mining and knowledge disoveryappliations, by
meansof theexploitationof inherentparallelismof dataminingalgorithms. Maingoalsofthe useof parallel
omputingtehnologiesinthedataminingeldare: (i)performaneimprovementsofexistingtehniques,(ii)
implementationofnew (parallel)tehniquesand algorithms,and (iii) onurrentanalysis using dierentdata
miningtehniquesinparallel andresultintegrationtogetabettermodel(i.e.,moreaurateresults).
As observed in [5℄, three main strategies an be identied in the exploitation of parallelism algorithms:
Independent Parallelism, Task Parallelism and Single Program Multiple Data(SPMD) Parallelism. We point
outthatthis isawellknown lassiationofgeneralstrategiesfordevelopingparallelalgorithms, in fatthey
are not neessarilyrelated only to data mining purposes. Nevertheless, in the following we will desribethe
underlying ideaofsuh strategiesbyontextualizingthem indata miningappliations. A shortdesriptionof
theunderlyingideaofsuhstrategiesfollows.
Independent Parallelism. It is exploited when proessesare exeuted in parallel in an independent way.
Generally,eahproesshasaess tothewhole dataset anddoesnotommuniateorsynhronizewith other
proesses. Suhastrategy,forexample,isappliedwhen
p
dierentinstanesofthesamealgorithmareexeutedonthewholedataset,but eah onewithadierentsettingofinputparameters. Inthis way,theomputation
ndsout
p
dierentmodels,eahonedeterminedbyadierentsettingofinputparameters. Avalidationstepshould learnwhihoneof the
p
preditivemodelsis themost reliablefor thetopi under investigation. Thisstrategyoftenrequiresommutations amongtheparallel ativities.
Task Parallelism. It isknown alsoasControl Parallelism. It supposes that eah proess exeutesdierent
operationson(adierentpartitionof)thedataset. Theappliationofsuhastrategyindeisiontreelearning,
for example, leads to have
p
dierent proesses running, eah one assoiated to a partiular subtree of thedeision tree to be built. The searh goes parallely on in eah subtree and, as soon as all the
p
proessesnishtheirexeutions,thewholenaldeisiontreeisomposedbyjoiningthevarioussubtreesobtainedbythe
proesses.
SPMD Parallelism. Thesingleprogram multiple data (SPMD) model [10℄ (alsoalled data parallelism)is
exploitedwhenasetofproessesexeuteinparallelthesamealgorithmondierentpartitionsofadataset,and
proessesooperatetoexhange partialresults. Aordingto thisstrategy, thedatasetis initially partitioned
in
p
parts,ifp
istheapriori-xedparallelismdegree(i.e.,thenumberofproessesrunningin parallel). Then,the
p
proessessearhinparallelapreditivemodelforthesubsetassoiatedtoit. Finally,theglobalresultisobtainedbyexhangingalltheloalmodelsinformation.
Thesethreestrategiesforparallelizingdataminingalgorithmsarenotneessarilyalternative. Infat,they
anbe ombinedtoimprovebothperformaneand aurayofresults. Forompleteness, wesayalso thatin
ombinationwith strategiesforparallelization,dierent datapartition strategiesmaybeused : (i) sequential
partitioning (separate partitions are dened without overlapping among them), (ii) over-based partitioning
(some data an be repliated on dierent partitions) and (iii) range-basedquerypartitioning (partitions are
denedonthebasisofsomequeriesthat seletdata aordingtoattributevalues).
Arhiteturalissuesareafundamentalaspetforthegoodnessofaparalleldataminingalgorithm. Infat,
interonnetion topology of proessors, ommuniation strategies, memory usage, I/O impat on algorithm
performane, load balaning of the proessors are strongly related to the eieny and eetiveness of the
relatedtotheparallelizationstrategiesandthereisamutualinuenebetweenknowledgeextrationstrategies
and arhitetural features. For instane, inreasing the parallelism degree in some ases orresponds to an
inrement of the ommuniation overhead among the proessors. However, ommuniationosts anbe also
balanedbytheimprovedknowledgethatadataminingalgorithmangetfromparallelization. Ateahiteration
theproessorssharetheapproximatedmodelsproduedbyeahofthem. Thuseahproessorexeutesanext
iterationusingitsownpreviousworkandalsotheknowledgeproduedbytheotherproessors. Thisapproah
animprovetherateatwhihadataminingalgorithmndsamodelfordata(knowledge)andmakeupforlost
timeinommuniation. Parallelexeutionofdierentdataminingalgorithmsandtehniquesanbeintegrated
notjust togethighperformanebutalsohighauray.
3.2. Distributed Data Mining. Traditional warehouse-basedarhitetures fordata miningsupposeto
haveentralized data repository. Suh aentralizedapproahis fundamentallyinappropriate for mostof the
distributed and ubiquitous data mining appliations. In fat, the long response time, lak of proper use of
distributedresoure,andthefundamentalharateristiofentralizeddataminingalgorithmsdonotworkwell
in distributedenvironments. Asalablesolutionfordistributed appliationsallsfor distributedproessingof
data,ontrolled bytheavailableresouresandhumanfators. Forexample,letus onsideranadhowireless
sensornetworkwherethedierentsensornodesaremonitoringsometime-ritialevents. Centralolletionof
datafromeverysensornodemayreatetraoverthelimitedbandwidthwirelesshannelsandthismayalso
drainalot ofpowerfrom thedevies.
A distributed arhiteture for data mining is likely aimed to redue the ommuniationloadand alsoto
reduethebatterypowermoreevenlyarossthedierentnodesinthesensornetwork. Oneaneasilyimagine
similarneedsfordistributedomputationofdataminingprimitivesinadhowirelessnetworksofmobiledevies
likePDAs,ellphones,andwearableomputers[25℄. Thewirelessdomainisnottheonlyexample. Infat,most
oftheappliationsthatdealwithtime-ritialdistributeddataarelikelytobenetbypayingarefulattention
to the distributed resouresfor omputation, storage, and the ost of ommuniation. As an other example,
letus onsider the World Wide Web as it ontainsdistributed data and omputing resoures. An inreasing
numberofdatabases(e.g.,weatherdatabases,oeanographidata,et.) anddatastreams(e.g.,nanialdata,
emergingdiseaseinformation,et.) areurrentlymadeon-line,andmanyofthemhangefrequently. Itiseasy
tothink ofmanyappliationsthat requireregularmonitoringofthesediverseanddistributed souresof data.
A distributed approahto analyzethis data is likelyto bemoresalable and pratialpartiularly when
the appliation involvesalarge numberof data sites. Hene, in this aseweneed data mining arhitetures
thatpayarefulattentiontothedistributionofdata,omputingandommuniation,inordertoaessanduse
theminanearoptimalfashion.Distributeddatamining (DDM)onsidersdatamininginthisbroaderontext.
DDMmayalsobeusefulinenvironmentswithmultipleomputenodesonnetedoverhighspeednetworks.
Evenifthedataanbequiklyentralizedusingtherelativelyfastnetwork,properbalaningofomputational
loadamongalusterofnodesmayrequireadistributed approah. Theprivayissueisplayinganinreasingly
importantroleintheemergingdataminingappliations. Forexample,letussupposeaonsortiumofdierent
banksollaboratingfor detetingfrauds. If aentralized solutionwasadopted,all the datafrom everybank
shouldbeolletedin asingleloation,to beproessedbyadataminingsystem. Nevertheless,in suhaase
adistributed data miningsystem should be thenatural tehnologial hoie: it is ableto learnmodels from
distributeddatawithoutexhangingtherawdataamongdierentrepositories,anditallowsdetetionoffraud
bypreservingtheprivayofeverybank'sustomertransationdata.
Forwhatonernstehniquesandarhiteture,itisworthnotiing thatmanyseveralothereldsinuene
DistributedData Mining systemsonepts. First, manyDDM systemsadopt the multi-agentsystem (MAS)
arhiteture,whih ndsitsrootin thedistributedartiialintelligene(DAI). Seond,althoughparallel data
miningoftenassumesthepreseneof highspeednetworkonnetionsamongthe omputingnodes,the
devel-opmentofDDMhasalsobeeninuenedbythePDMliterature. MostDDMalgorithmsaredesigneduponthe
potentialparallelismtheyanapplyoverthegivendistributeddata. Typially,thesamealgorithmoperateson
eah distributeddata siteonurrently,produing oneloalmodel persite. Subsequently, allloalmodels are
aggregatedtoproduethenalmodel. InFigure3.1ageneraldistributeddataminingframeworkispresented.
Thesuessof DDMalgorithmsliesintheaggregation. Eah loalmodelrepresentsloallyoherentpatterns,
but laksdetailsthat mayberequiredtoindue globally meaningfulknowledge. Forthis reason,manyDDM
multiple modelsandombinesthemto enhaneauray. Typially,voting(weightedorun-weighted) shema
are employed to aggregatebase model for obtaining aglobal model. As wehave disussed above, minimum
data transferis another keyattribute of the suessful DDM algorithm. As a nal onsideration,the
homo-geneity/heterogeneityofresouresisanotherimportantaspettobeonsideredinthedistributeddatamining
approahes. In this senario, the term "resoures"refers both to omputational resoures (omputers with
similar/dierent omputational power) and data resoures (datasets with horizontally/vertially partitioning
among nodes). The rstmeaning aetsonly thealgorithm exeutiontime, while data heterogeneityplaysa
fundamental role in the algorithm design. That is, dealingwith dierent data formatsit requires algorithms
designedinaordaneto thedierentdataformats.
Data Source 1
Data Mining
Algorithm
Local Model 1
Data Source 2
Data Mining
Algorithm
Local Model 2
Data Source n
Data Mining
Algorithm
Local Model n
Local Model
Aggregator
Global Model
Data
Data
Source
Source
i
i
Data
Data
Mining
Mining
Algorithm
Algorithm
Local
Local
Model i
Model i
Fig.3.1.GeneralDistributedDataMiningFramework.
3.3. Grid-based Data Mining. Inthe last years, Grid omputing is reeiving an inreasingattention
bothfrom theresearh ommunity and from industry and governments,wathing at this new omputing
in-frastrutureasakeytehnologyforsolvingomplexproblemsandimplementingdistributedhigh-performane
appliations. Grid tehnologyintegrates bothdistributed andparallel omputing, thusit representsaritial
infrastruturefor high-performanedistributed knowledgedisovery.Grid omputing diersfrom onventional
distributedomputingbeauseitfousesonlarge-saledynamiresouresharing,oersinnovativeappliations,
and,insomeases,itisgearedtowardhigh-performanesystems. TheGrid emergedasaprivilegedomputing
infrastruturetodevelopappliationsovergeographiallydistributedsites,providingforprotoolsandservies
enablingtheintegratedandseamlessuseofremoteomputingpower,storage,software,anddata,managedand
sharedbydierentorganizations.
Basi Grid protools and servies are provided by toolkits suh as Globus Toolkit (www.globus.org/
toolkit), Condor (www.s.wis.edu/ondor), Glite, and Uniore. In partiular, the Globus Toolkit is the
mostwidelyusedmiddlewareinsientianddata-intensiveGridappliations,andisbeomingadefato
stan-dardforimplementingGridsystems. Thistoolkitaddressesseurity, informationdisovery,resoureanddata
management,ommuniation,fault-detetion,andportabilityissues. Awidesetofappliationsisbeing
devel-opedfortheexploitationofGridplatforms. Sineappliationareasrangefromsientiomputingtoindustry
and business, speialized serviesare required to meet needs in dierent appliation ontexts. In partiular,
dataGrids havebeendesignedtoeasily store,move,andmanage largedata setsin distributed data-intensive
appliations. Besidesoredatamanagementservies,knowledge-basedGrids,builtontopofomputationaland
dataGrid environments,are neededto oerhigher-levelserviesfordataanalysis,inferene, anddisoveryin
sientiandbusinessareas[21℄. Insomepapers,seeforexample[1℄,[19℄,and[7℄,itislaimedthatthereation
of knowledge Grids is the enabling onditionfor developing high-performane knowledge disoveryproesses
and meeting the hallenges posed by theinreasing demand of powerand abstratness omingfrom omplex
problemsolvingenvironments.
appliationsareinreasinglybeomingavailableasstand-alonepakagesandasremoteserviesontheInternet.
Examplesinlude geneandDNAdatabases,networkaessand intrusiondata,drugfeatures andeets data
repositories, astronomy data les, and data about web usage, ontent, and struture. Knowledge disovery
proeduresinalltheseappliationstypiallyrequirethereationandmanagementofomplex,dynami,
multi-stepworkows. Ateahstep,datafromvarioussouresanbemoved,ltered,andintegratedandfedintoadata
miningtool. Basedontheoutputresults,thedeveloperhooseswhihother datasetsandminingomponents
anbeintegratedintheworkow,orhowtoiteratetheproesstogetaknowledgemodel. Workowsaremapped
on aGrid by assigning nodes to theGrid hosts and using interonnetions for implementingommuniation
amongtheworkownodes.
Forompletenessof treatment,wepointoutsomeother Grid-basedknowledge disoverysystemsand
a-tivities that have beendesigned in reentyears. Disovery Net [8℄ is aninfrastruture for eetivelysupport
sientiknowledgedisoveryproess,inpartiularintheareasoflifesieneandgeo-hazardpredition.
DataS-pae [17℄is aframework providing eientdata aessand transferovertheGridthat implementsanad-ho
protool for working with remote and distributed data (named DataSpae transfer protool, DSTP).
Info-Grid [16℄ is aservie-based data integration middleware engine, designed to provide information aess and
queryingserviesnotin an'universal'way,but byapersonalizedviewoftheresouresforeah partiular
ap-pliationdomain.DataCutter [2℄isanotherGridmiddlewareframeworkaimedatprovidingspeiserviesfor
thesupportofmulti-dimensionalrange-querying,dataaggregationanduser-denedlteringoverlargesienti
datasetsinshareddistributedenvironments. Finally,GATES [4℄(Grid-basedAdapTiveExeutiononStreams)
is anOGSA basedsystemthat providessupport for proessing of datastreams in a Gridenvironment. This
systemis designed to support thedistributed analysis of data streams arising from distributed soures (e.g.,
datafromlargesaleexperiments/simulations). GATESprovidesautomatiresouredisoveryandaninterfae
forenablingself-adaptationtomeetreal-timeonstraints.
TheKnowledge Gridarhiteture isdesigned aordingtothe Servie OrientedArhiteture (SOA), that
is a model for building exible, modular, and interoperable software appliations. The key aspet of SOA
is the onept of servie, that is a software blok apable of performing a given task or business funtion.
Eah servie operatesbyadhering toa well dened interfae,dening required parametersand the nature of
the result. One dened and deployed, servies are like blak boxes", that is, they work independently of
the state of any other servie dened within the system, often ooperating with other servies to ahieve a
ommongoal. ThemostimportantimplementationofSOAisrepresentedbyWebServies,whosepopularityis
mainlyduetotheadoptionofuniversallyaeptedtehnologiessuhasXML,SOAP,andHTTP.AlsotheGrid
providesaframeworkwherebyagreatnumberofserviesanbedynamiallyloated,balaned,andmanaged,
sothat appliations arealwaysguaranteed to beseurely exeuted, aordingto the priniples ofon-demand
omputing.
TheGrid ommunityhasadopted the Open Grid Servies Arhiteture (OGSA) asanimplementation of
theSOAmodelwithintheGridontext. InOGSAeveryresoureisrepresentedasaWebServiethatonforms
to a set of onventions and supports standard interfaes. OGSA provides a well-dened set of Web Servie
interfaesfor thedevelopmentofinteroperableGridsystemsandappliations[15℄. ReentlytheWS-Resoure
Framework (WSRF) has been adopted as an evolution of early OGSA implementations [9℄. WSRF denes
a family of tehnial speiations for aessing and managing stateful resoures using Web Servies. The
omposition of a Web Servie and astateful resoureis termed asWS-Resoure. The possibility to dene a
âstateâassoiatedtoaservieisthemostimportantdierenebetweenWSRF-ompliantWebServies,
andpre-WSRF ones. Thisis akeyfeaturein designingGrid appliations, sineWS-Resoures provideaway
torepresent,advertise,andaesspropertiesrelatedto bothomputationalresouresandappliations.
The Knowledge Grid is a software for implementing knowledge disoverytasks in a wide range of
high-performane distributed appliations. It oersto users high-level abstrations anda set of serviesby whih
theyanintegrateGridresourestosupportallthephasesoftheknowledgedisoveryproess.
TheKnowledgeGridsupportssuhativitiesbyprovidingmehanismsandhigherlevelserviesforsearhing
resoures,representing,reating,andmanagingknowledgedisoveryproesses,andforomposingexistingdata
serviesanddata miningservies in astrutured manner,allowingdesigners to plan, store,doument,verify,
shareandre-exeutetheirworkowsaswellasmanagetheiroutputresults. TheKnowledgeGridarhiteture
isomposed ofasetofserviesdividedin twolayers: theCore K-Gridlayer andtheHigh-levelK-Grid layer.
useofrepositoriesthatprovideinformationaboutresouremetadata,exeutionplans,andknowledgeobtained
asresultofknowledgedisoveryappliations.
In the Knowledge Grid environment, disovery proesses are represented as workows that a user may
omposeusing bothonreteandabstrat Gridresoures. Knowledgedisoveryworkowsaredened usinga
visualinterfaethatshowsresoures(data,tools,andhosts)to theuserandoersmehanismsforintegrating
theminaworkow. InformationaboutsingleresouresandworkowsarestoredusinganXML-basednotation
that representsaworkow(alledexeutionplan intheKnowledgeGridterminology)asadata-owgraphof
nodes, eah onerepresentingeither adataminingservieoradatatransferservie. The XMLrepresentation
allowstheworkowsfordisoveryproessestobeeasilyvalidated,shared,translatedinexeutablesripts,and
storedforfutureexeutions. Itisworthnotiing thatwhentheusersubmitsaknowledgedisoveryappliation
totheKnowledgeGrid,shehasnoknowledgeaboutallthelowleveldetailsneededbytheexeutionplan. More
preisely, thelient submits to the Knowledge Grid a high level desriptionof the KDD appliation, named
oneptual model, moretargeted to distributed knowledge disoveryaspets than to grid-relatedissues. The
KnowledgeGridinarststepreatesanexeutionplanonthebasisoftheoneptualmodelreeivedfromthe
user, andthen exeutesitbyusing theresoureseetivelyavailable. Torealizethis logi,it initially models
anabstrat exeutionplan (wheresomespeiedresoureouldremain'abstratly'dened,i.e. theyouldnot
mathwitharealresoure),that inaseondstepisresolvedintoaonreteexeutionplan (whereamathing
betweeneahresoureandsomeonereallyavailableontheGridis found).
TheKnowledgeGridhasbeenusedinvariousrealsenarios,pointingoutitssuitabilityinseveral
heteroge-neousappliations. Forlakofspaewearenotabletodisussaboutthem. Forsuhareasonwegiveherejust
someoutlines,moredetails anbefoundin theited papers. Thegoaloftheexampledesribed in[6℄wasto
obtainalassierforanintrusiondetetionsystem,performingaminingproessona(verylargesize)dataset
ontainingreordsgeneratedby network monitoring. Theexamplereportedin [5℄wasasimplemeta-learning
proess,thatexploitstheKnowledgeGridtogenerateanumberofindependentlassiersbyapplyinglearning
programstoaolletionofdistributeddatasets inparallel.
As a sienti appliation senario, let us onsider the olletion of sky observations and the analysis
of their harateristis. Let us suppose to have distint image data obtained by observations and
simula-tions, from whih we want to extrat signiant metris. Generally, a signiative set of astronomydata is
verylarge size (
≈
20
−
30
terabytes). In addition, suh kind of observation are very high-dimensional,be-ause eah point is usually desribed by
≈
10
3
attributes (inluding morphologial parameters, ux ratios,
et.). Finally, they usually are full of missing values and noise. Then, the main issue here is to analyze
a distribution of
≈
20
−
30
terabytes of points in a parameter spae of≈
10
3
dimensions. Let us
sup-pose that our eort is devoted to identify how many distint types of objets are there (i. e., stars,
galax-ies, quasars, blak holes, et.), and grouping them with respet to their type. This an be obtained by a
lustering analysis, however it is a non-trivial task if we onsider the large size data and their high
dimen-sionality. To suh a purpose, a distributed framework an be suitable to get results in a reasonable time.
Initially we have a data repository where all suh an observed sky data is olleted (for example, an
astro-nomi observatory). Then, suh a data is proessed by a distributed lustering algorithm. In order to do
that, they are partitioned on many nodes and proessed on those nodes in parallel. The results of every
lustering algorithm are olleted and ombined to obtain a global lustering model. In addition, eah
out-lier anrepresenta possible(rare)new objet. For suh a reason, and in order to get moreknowledgefrom
them, all thedeteted outliers aretransferred to another node for a further lassiation, i. e. by adeision
tree.
Figure4.1showssuhadistributedmeta-learningsenario,inwhihagloballusteringmodellassier
CM
is obtainedon
N ode
C
startingfrom the original datasetDS
stored onN ode
A
(i.e, where theobservatoryisloated). Moreover,alltheoutliersdetetedareolletedinanoutlierset
OS
andareproessedbyalassierCl
onaN ode
B
. Thisproessanbedesribedthroughthefollowingsteps:1. On
N ode
A
,datasetsDS
1
, . . . , DS
n
areextratedfromDS
bythepartitionerP
. ThenDS
1
, . . . , DS
n
,arerespetivelymovedfrom
N ode
A
toN ode
1
, . . . , N ode
n
.2. On eah
N ode
i
(
i
= 1
, . . . , n
)
the lustererC
i
applies a lustering algorithms on eah datasetDS
i
.Then,eahloalresultismovedfrom
N ode
i
toN ode
C
.3. On
N ode
C
,loalmodelsreeivedfromN ode
1
, . . . , N ode
n
areombinedbytheombinerC
toproduethe globallustering model
CM
. Moreover,outliers detetedare olleted in anoutlier setOS
, and4. On
N ode
B
, the lassierCl
proesses theOS
outlier data set and extrats a suitable lassiationmodel(i.e.,adeisiontree)fromit.
BeingtheKnowledgeGridaservieorientedarhiteture,theKnowledgeGriduserinteratswithsomeservies
todesignandexeutesuhanappliation.
As an additional onsideration, we notie that a lient appliation, that wants to submit a knowledge
disoveryomputation to the Knowledge Grid, has to interat not with all of these servies, but just with
someofthem; thereare,infat,twolayersofservies: high-level servies(DAS,TAAS,EPMS andRPS)and
ore-level servies (KDS and RAEMS). The design ideais that user level appliations diretly interat with
high-levelserviesthat,inordertoperformalientrequest,invokesuitableoperationsexportedbytheore-level
servies. Inturn, ore-levelserviesperformtheiroperationsbyinvokingbasiservies provided byavailable
gridenvironmentsrunningonthespei host,aswellasbyinteratingwithotherore-levelservies. Inother
words,operationsexportedbyhigh-levelserviesaredesignedtobeinvokedbyuser-levelappliations,whereas
operationsprovidedbyore-levelserviesarethoughttobeinvokedbothbyhigh-levelandore-levelservies.
Morein detail,the useran interats withthe DAS (Data Aess Servie)and TAAS (Tools and Algorithms
Aess Servie)servies to nd dataand mining software andwith the EPMS (ExeutionPlan Management
Servie)servietoomposeaworkow(exeutionplan)desribingatahighleveltheneededativitiesinvolved
in theoverall dataminingomputation. Throughtheexeutionplan, omputing,softwareanddataresoures
arespeiedalongwith aset ofrequirementsonthem. The exeutionplan isthen proessedbythe RAEMS
(Resoure Alloation andExeutionManagement Servie), whih takesareofits alloation. Inpartiular, it
rst ndsappropriate resouresmathing userrequirements(i. e., a set of onrete hosts
N ode
1
, . . . , N ode
n
,oering thesoftware
C
1
, . . . , C
n
, and anodeN ode
W
providing theC
ombiner software and a nodeN ode
Z
exporting thelassier
Cl
), thenmanagesthe exeutionof overall appliation, enforingdependeniesamongdata extration, transfer, and mining steps. Finally, the RAEMS manages results retrieving, and visualize
thembytheRPS (ResultsPresentation Servie)servie(thatoersfailitiesforpresentingandvisualizingthe
extratedknowledgemodels).
Data Set
DS
Partitioner
P
Node A
Data Set
DS1
Clusterer
C1
Node 1
Data Set
DS2
Clusterer
C2
Node 2
Data Set
DSn
Clusterer
Cn
Node n
Combiner
C
Node C
Clustering
Model
CM
Outlier Set
OS
Classifier
Cl
Node B
Outlier
Classification
Model
OCM
C1
C2
Cn
C4
Fig.4.1. Adistributedmeta-learningsenario.
5. Conlusion. In this paper we havepointedout that digital data volumesare growingexponentially
in siene and engineering. Often digital repositories and soures inrease their size muh faster than the
omputationalpoweroeredbytheurrenttehnology. Tohandlethisabundaneindataavailability,sientists
mustembodyknowledgedisoverytoolsto ndwhatisinterestinginthem.
Whendataismaintainedovergeographiallydistributedsites,Gridomputinganbeusedasadistributed
infrastrutureforservie-basedintensiveappliations. VarioussientiappliationsbasedonGrid
infrastru-tures, desribed in the paper, onretely show how it anbeexploited forsientipurposes. Moreover,the
omputationalpowerofdistributed andparallel systemsanbeexploited forknowledgedisoveryinsienti
is areferene softwarearhiteture forgeographially distributed knowledge disoverysystemsthat allows to
implementomplexdataanalysisappliationsasaolletionof distributedservies.
REFERENCES
[1℄ F.Berman,FromTeraGridtoKnowledgeGrid,CommuniationsoftheACM,44(11)(2001),pp.2728.
[2℄ M.Beynon, T. Kur, U.Catalyurek,C. Chang, A.Sussman, andJ. Saltz,Distributed Proessing of VeryLarge
DatasetswithDataCutter,ParallelComputing,27(11)(2001),pp.14571478.
[3℄ M.CannataroandD.Talia,TheKnowledgeGrid,CommuniationsoftheACM,46(1)(2003),pp.8993.
[4℄ L.Chen,K.Reddy,andG.Agrawal,GATES:AGrid-BasedMiddlewareforProessingDistributedDataStreams,Pro.
ofthe13thIEEEInt.SymposiumonHighPerformaneDistributedComputing(HPDC),(2004),pp.192201.
[5℄ A.Congiusta,D.Talia,andP.Trunfio,ParallelandGrid-BasedDataMining,inDataMiningandKnowledgeDisovery
Handbook,Springer,2005,pp.10171041.
[6℄ A.Congiusta,D. Talia,andP.Trunfio,UsingGridsforDistributed KnowledgeDisovery,inMathematialMethods
forKnowledgeDisoveryandDataMining,IGIGlobalPublisher,2007,pp.248298.
[7℄ A. Congiusta, D. Talia, and P. Trunfio, Distributed Data Mining Servies Leveraging WSRF, Future Generation
ComputerSystems,23(1)(2007),pp.3441.
[8℄ V.Curin,M.Ghanem,Y.Guo,M.Kohler,A.Rowe,J.SyedJ.,andP.Wendel,DisoveryNet:TowardsaGrid
ofKnowledgeDisovery,Pro.ofthe8thInt.ConfereneonKnowledgeDisoveryandDataMining(KDD),(2002).
[9℄ K. Czajkowski et al., The WS-Resoure Framework Version 1.0,
http://www.ibm.om/developerworks/library/ws-resoure/ws-wsrf.pdf,2004.
[10℄ F.Darema,SPMDmodel: past,presentandfuture,inReentAdvanesinParallelVirtualMahineandMessagePassing
Interfae: 8thEuropeanPVM/MPIUsers'GroupMeeting,Springer,2001,p.1.
[11℄ DataGridProjet,http://web.datagrid.nr.it,2001.
[12℄ S.G.Djorgovski,VirtualAstronomy,InformationTehnology,andtheNewSientiMethodology,Pro.ofthe7thInt.
WorkshoponComputerArhiteturesforMahinePereption(CAMP),(2005),pp.125132.
[13℄ EGEEProjet,http://www.eu-egee.org/,2005.
[14℄ I.Foster,C.Kesselman,J.Nik,andS.Tueke,ThePhysiologyoftheGrid:AnOpenGridServiesArhiteturefor
Distributed SystemsIntegration,GlobusProjet,www.globus.org/alliane/publiations/papers/ogsa.pdf,2002.
[15℄ I.Foster,C.Kesselman,J.Nik,andS.Tueke,ThePhysiologyoftheGrid,inGridComputing:MakingtheGlobal
InfrastrutureaReality,F.Berman,G.Fox,andA.Hey,eds.,Wiley,2003,pp.217249.
[16℄ N.Giannadakis,A.Rowe,M.Ghanem,andY.Guo,AWebInfrastruturefortheExploratoryAnalysisandMiningof
Data,InformationSienes,2007,pp.199226.
[17℄ R. Grossman, and M. Mazzuo, DataSpae: A Data Web for the Exploratory Analysis and Mining of Data,IEEE
ComputinginSieneandEngineering,4(4),2002,pp.4451.
[18℄ IPGProjet,http://www.gloriad.org/gloriad/projets/ projet000053.html,1998.
[19℄ W. E. Johnston,Computational and Data Gridsin Large SaleSieneand Engineering,FutureGeneration Computer
Systems,18(8)(2002),pp.10851100.
[20℄ F.Meyer,GenomeSequeningvs.Moore'sLaw: CyberChallenges fortheNextDeade, CTWathQuarterly,2(3)(2006),
http://www.twath.org/quarter ly/ar tiles/2 006/ 08/g eno me-sequening-vs-moores-law/.
[21℄ R.Moore,Knowledge-basedGrids,Pro.ofthe18thIEEESymposiumonMassStorageSystemsand9thGoddardConferene
onMassStorageSystemsandTehnologies,(2001).
[22℄ myExperimentProjet,http://www.eu-egee.org/,2006.
[23℄ NationalVirtualObservatoryProjet,http://www.us-vo.org/,http://www.virtualobservatory.org/,2001.
[24℄ OpenSieneGridProjet,http://www.opensienegrid.org/,2004.
[25℄ B.Park,andH.Kargupta,DistributedDataMining:Algorithms,Systems,andAppliations,inDataMiningHandbook,
IEAPublisher,2002,pp.341358.
[26℄ SouthernCalifornia EarthquakeCenterProjet,http://epienter.us.edu/meportal/index.html,2001.
[27℄ D.Skilliorn,StrategiesforParallelDataMining,IEEEConurreny,7(4)(1999),pp.2635.
[28℄ D.Talia,ParallelisminKnowledgeDisoveryTehniques,Pro.ofthe6thInt.Conf.onAppliedParallelComputing,(2002),
pp.127136.
[29℄ TeraGridProjet,http://www.teragrid.org/, 2005.
Editedby: PasquaD'Ambra,DanieladiSerano,MarioRosarioGuarraino,FranesaPerla
Reeived: June2007